Predicting Rents in Helsinki Using Machine learning Models

Khaled Alwithinani
Aug 25, 2020
11 min read

Updated: Oct 16, 2021

Introduction:

The project aims to find suitable predictive models for the data, this data will include houses monthly rentals, sizes, whether a house has a balcony or sauna, number of rooms, house types, city, addresses and its coordinates. I got the data using a scraping technique in Python programming language and then geocoded each address to get the coordinates. The variable that needs prediction is the Monthly Rentals (MR) of the new flats, to achieve my objective, I am going to train the predictive models based on the data I scraped. I am going to use ArcGIS to run my analysis and find the desired areas of the city to geoenrich the data. The models that will be used are General Linear Regression (GLR), Geographically Weighted Regression (GWR) and Forest-Based Classification and Regression (FBCR). I have found that the FBCR model predicted the Monthly Rentals (MR) with an accuracy of 66% while the GWR model had an accuracy of 62%. However the GWR had a more reasonable prediction for flats with monthly rents over 2000 euro. In this page I will explain the process behind this decision, starting with the data source of this project.

Data Source:

I obtained the data by scraping a website called Vuokraovi, this is an established website in Finland for finding properties to rent. It includes information about: rent, address, size, monthly rent, number of rooms and other facilities. It is a well-recognised website where it has recorded 4.5M visits by July 2020, according to www.semrush.com, check this link https://www.semrush.com/analytics/traffic/overview/www.vuokraovi.com?from_lp=1&sm2workspace=seo. Below you will see an image of the data on its first stage after running the scraping code on the website . It translate the information for each flats into a readable format that can be used as a dataframe.

After scraping the data, I used Python to geocode each address presented in my data. Geocoding is the process of transforming address into into spatial data, which will show the exact geographical coordinates for that address. Below you can see an example of this process.

I then formatted the data to suit my needs using Python and Google Sheets below you will see a screenshot of the table used for this project.

I uploaded the table to ArcGIS, to display the location of flats on the map I applied the World Geodetic System 1984 as the coordinate system because this is the coordinate system that was used when geocoding each address. Below you can see a map of each flat location.

Methods and results:

For me to find any possible correlations between the variables I created a scatter plot matrix (see below). The dependent variable for my project is the MR, so the plot will show the relationship between this variable and other independent variables. There are positive relationships between the MR and size, number of rooms, balcony and sauna. One significant positive relationship is shown to be between size and MR, the R-squared (R2) of their relationship is 0.41. The number of rooms comes second at an R-squared (R2) of 0.11, which is insignificant. The other relationships are very weak with values of 0.03 or less. The closer the R2 is to 1, the stronger the relationship. We will try and introduce these other variables to the Forest-Based Classification and Regression (FBCR) later on to test if they will have any impact on the predictive power of the model.

General Linear Regression (GLR)

I started by running a GLR model to see if the R2 will show any improvement, now it is important to note that the Multiple R2 (MR2) value is not what we are looking for, what we are looking for is the Adjusted R2 (AR2). See the image below.

After running the GLR model, there was no improvement to the AR2 compared to the base R2 displayed in the scatter plot matrix between monthly rentals and size. See the image above. However, on the map, we are now able to see the areas where the predicted monthly rent was underestimated or overestimated. Underestimation indicates that the predicted monthly rents for the flats is likely to be higher than what's predicted and the opposite can be said about overestimation. This could mean that there is another factor influencing rent that hasn't been considered when running the model. Looking at the map below, the red-coloured flats represent an underestimation of the monthly rents and the purple-coloured flats represent the overestimation.

It is clear in the map above that there is a cluster of underestimated prediction in the centre of Helsinki and a small area in the east part of the city, this area is called Kalasatama, both places are known to be expensive. Both areas are circled around so that it can be seen.

Finding spatial relationships between variables

In this section, I am going to use a tool in ArcGIS known as Local Bivariate Relationships (LBR). This tool allows you to discover spatial relationships using an entropy-based approach. This approach will tell us if there is a relationship to destroy by randomising the data and if there is a significant relationship randomising the data will increase the entropy considerably. It will use a neighbourhood size of 100 flats, the neighbourhood size should be large enough to capture a significant relationship between variables. The main reason for using this tool is to find out whether this data can be spatially modelled or not.

In the image above if you look at the Positive Linear, you will see a percentage of 74%, this means that 74% of the data has a positive relationship between the Size and MR. This indicates that this data can be spatially modelled and in this case, we can use Geographically Weighted Regression (GWR) to model spatial relationships between Size and MR at neighbourhood size of 100 flats

Geographically Weighted Regression (GWR)

GWR lets you evaluate a local model of a variable or process you want to understand or predict. It fits a regression to every feature in a dataset, it constructs separate equations by combining dependent and explanatory variables within the neighbourhood of each target feature. The neighbourhood shape and extent is analyzed based on its Type and Selection Method parameters.

After I ran the GWR model, we can see in the image above that there is a significant improvement in the AR2, which has improved from 0.41 to be 0.62 meaning that this model has a predictive power of 0.62. Through this model, we can also identify desired areas in the city.

The map above shows a raster map created by the GWR model. The circled values represent the slope and the slop value only shows a positive value. It means that the local regression coefficients are positive. This implies that GWR modelled a positive relationship between size and MR. To identify desired areas, we need to look again at the circled values, the higher the slope the more desired the area is. This slope here does not represent a physical slope but rather an indication between the area and MR. In other words areas with a higher slope shown in red mean that a small change in the size of the flat corresponds to a greater increase to its MR. As we anticipated when we used the GLR model, the centre of the city has a high slope in the map, which means that the centre of Helsinki is a desired area alongside the other areas shown in dark red.

I have created polygons around the areas that showed the highest slope value, these polygon will be used to geoenrich my data. It will be used to calculate the distance between the flats and the desired areas, which will be used for the Forest-Based Classification and Regression (FBCR) because this time the desired areas will be considered by the model when predicting the MR of the flats.

Forest-Based Classification and Regression (FBCR)

Unlike GLR and GWR, this model is not impacted by multicollinearity, because it is not a linear model. It is capable of modelling the relationships between a large number of predictor variables, whether spatial or non-spatial properties. FBCR uses a trees based approach on random subsets of the data and every tree makes a prediction, referred to as a vote. It then summarizes the votes as the average and conducts a final prediction. The randomness of subsetting data means that FBCR will have results with varying accuracy. It will also let us calculate the uncertainty of the MR predictions and detect the most important variables for our data.

The FBCR model allows you to check the importance of the variables in terms of their predictive effect on the dependent variable. I included variables that could have an impact on the MR, these variables are house type, the number of rooms, flats that have a sauna or balcony and the three most desired areas identified by using the GWR (see the image below under the icon "Top Variable Importance" for more details). The plot box above shows the importance of each variable in term of its prediction influence on the MR. The house type, sauna and balcony variables have a low impact, the desired areas variable have a decent impact and the size variable have the highest impact.

I decided to remove the variables with the lowest importance values from the model because they can impact the accuracy of my prediction negatively. I removed the balcony and sauna variables and kept house type even though its importance was low. Because when I tested the model, I found out that the most accurate prediction was conducted when I included the desired areas, number of rooms, house type and size to the model. The house type improved the prediction from an AR2 of 0.64 to 0.67, see the image above under the Validation Data.

Despite the house type variable being of low importance, including it resulted in a slight improvement to the prediction of the flats’ MR. There might be a stronger relationship between house types and price that the model couldn’t identify as important. Looking at the image above, the Mean Squared Error (MSE) 2 values remain quite the same. This indicates that the model has enough trees and converged to its maximum accuracy.

The chart above presents the uncertainty bounds of the prediction around the actual prediction depicted by the blue line. Uncertainty bounds increase gradually with higher MR until it reaches 2000 euros. After this point, the uncertainty bounds increase rapidly. This is due to the small sample size for expensive flats. This chart is a good representation of the uncertainties related to the predictions made by the trained sample.

Spatial distribution of uncertainty

In this section I am going to identify the spatial distribution of the uncertainty, the model returns 5th percentile (P05) and 95th percentile (P95), which represent a higher and lower estimate of the MR. 50th percentile (P05) on the other hand represents the predicted value. To calculate the uncertainty we will use this formula, Uncertainty = (P95-P05)/P50. This will show how wide our model uncertainty window is to the magnitude of the prediction.

To find out which areas on the map will respond to random changes when training the data, I used the Optimized Hot Spot Analysis tool. For this analysis, I will add the calculated uncertainty of the FBCR into this tool so that it can generate a map with hotspots and cold spots that will show us the location of flats that have high and low uncertainties after prediction see the map below.

The map above shows a hot spot cluster in the centre of Helsinki represented in red which means that the MR of this area is more prone to change by random changes to the training data. For example, a small change in the size of the flat will correspond to a large change to the flat MR. The flats shown in blue indicate a cold spot which is less prone to random changes, due to it being in a less desired area of the city, which also means that the uncertainty for these flats is less than the one located in the centre of Helsinki. A small change in the size of flats in the cold spot will not correspond to a large change in the flat price.

Even though the accuracy of the model was improved after introducing the 3 most desired areas in Helsinki, the model still presents some uncertainties on those areas, this could be due to the fact that the distance calculated in the model did not consider the street layout but rather measured the distance in a straight line. This practice presents some limitation to the model and will more likely shorten the distance between the flats and desired areas and make the flats look closer than where they actually are. A possible solution to this problem is to use service area analysis, because it considers roads and streets when measuring the distance, however this goes beyond the scope of this project. This analysis and the buffer analysis was discussed in another project with the title "Network Analysis (BUFFER AND SERVICE AREAS)" .

Models comparison between GWR and FBCR

So far 2 models performed quite similarly in term of their prediction power and both are considered generalisable, which mean that they can predict data points they have not seen with high accuracy as well. However, the FBCR had a slightly higher predictive power than the GWR. The GWR had an AR2 of 0.62 and the FBCR had an AR2 of 0.67. Let's test how will they both perform when predicting the MR of new flats in Helsinki.

Looking at the map above we can see the location of the new flats. These are the flats for which the trained models will try to predict MR. The prediction will help us make a better decision, for example, of where to invest or rent, because we will have some insights into the cost estimation before the MR is announced.

Histogram 1 shows the GWR distribution of predicted MR

Histogram 2 shows the FBCR distribution of predicted MR

The MR ranges and average values of the FBCR and GWR histograms are very similar, look at Histogram 1 and 2. The average predicted MR of the new flats for both models is around 1500 euros. The upper limit values of the MR at the right end of both histograms are 3350 euro for GWR and 4118 for FBCR. The GWR estimate for the upper limit is more reasonable because GWR assigns values taking the neighbourhood size into account. It means that flats over the price of 2000 euros are predicted more accurately with the GWR than the FBCR even though the FBCR had a slightly higher predictive power overall.

The flats in the Helsinki dataset are pre-existing but don't have a condition rating, which means that the condition of the flat is not considered in the model. The condition of the flats will most likely affect the MR, so the new flats might have a higher MR than predicted because of their new condition. This fact contributes to the uncertainties of the FBCR model.

Conclusions

Neither method is inherently superior when comparing FBCR with GWR. They both serve different needs of evaluation or valuation. For the current dataset, GWR might be better suited to capture spatial variations concerning the monthly rents (MR) and it also can work well when developing a local model for the MR, because the upper limit of GWR is lower than the upper limit of FBCR displayed in Histogram 1 and 2. This indicates that the GWR is better at predicting the MR for flats with an MR of over 2000 euros. However, because of multicollinearity, we cannot use many different variables as predictors for GWR as we did with the FBCR .

In comparison to the FBCR, the impact of the other variables of the new flats have been considered when predicting the MR, which resulted in slightly higher monthly rents (MR) than in the GWR model. This might have been caused because the model considered some other factors such as house type or the number of rooms. Uncertainty analysis of the FBCR model shows that rents of 2000 euros or more have higher uncertainty than flats below that price range. This is caused by the fewer number of data for flats with high MR.

Perhaps I also needed to add more variables than what I have introduced to the model to be able to drastically increase the accuracy of the prediction, which could outweigh the high uncertainty of flats with MR over 2000 euros. This would require more time to scrape more data, which goes beyond the scope of this project and what it intended to convey.

In the map presented below, you can see the predicted MR of the new flats. I decided to go with the GWR to predict the MR. Even though the AR2 value was slightly lower than the FBCR, the GWR showed a better estimation for the upper limit value of the rent. Both models managed to estimate the monthly rent with an accuracy between 62% and 67% based on their AR2. The best practice would be to use both to get more insights about the data, but for the sake of choosing the most suited model for my data,I decided to present the map produced by the GWR model.