Skip to content

Latest commit

 

History

History
4425 lines (2361 loc) · 61.2 KB

King-County-House-Sales-Report.md

File metadata and controls

4425 lines (2361 loc) · 61.2 KB

Predicting House Prices

Ashley O’Mahony | ashleyomahony.com | May 2019



King County House Sales Dataset

The King County House Sales dataset contains the house sales around Seattle, in King County (USA), between May 2014 and May 2015. The original dataset can be found on Kaggle.

All the files of this project are saved in a GitHub repository.

The dataset consists in:

  • Train Set with 17,277 observations with 19 house features, the ID, and the price.
  • Test Set with 4,320 observations with 19 house features and the ID. The price column will be added to the Test Set, with NAs, to ease the pre-processing stage.

This map shows where the properties listed in the dataset are located:

This project aims to predict house prices based on their features.



Packages

This analysis requires these R packages:

  • Data Manipulation: data.table, dplyr, tibble, tidyr

  • Plotting: corrplot, GGally, ggmap, ggplot2, grid, gridExtra

  • Machine Learning: caret, dbscan, glmnet, leaderCluster, MLmetrics, ranger, xgboost

  • Multithreading: doMC, doParallel, factoextra, foreach, parallel

  • Reporting: kableExtra, knitr, RColorBrewer, shiny, and… beepr.

These packages are installed and loaded if necessary by the main script.



Data Loading

The dataset is pretty clean, with no missing value in both Train and Test sets. A quick look at the id values also shows that only a few houses have been sold more than once on the period, so it doesn’t seem relevant the consider the id and the date for this analysis. We will thus focus on the house features to predict its price.

## [1] "0 columns of the Train Set have NAs."
## [1] "0 columns of the Test Set have NAs."
## [1] "103 houses sold more than once on the period, for a total of 17173 houses in the Train Set (0.006%)."
## [1] "8 houses in the Test Set sold more than once on the period, for a total of 4312 houses in the Test Set (0.002%)."

Some of the features can be converted into factors, as they are not continuous and help categorizing the house:

  • waterfront is a boolean, indicating if the house has a view to a waterfront.
  • view indicated the number of times the house has been viewed. As it only has 4 values, this feature can be factorized.
  • condition is a score from 1 to 5, indicating the state of the house.
  • grade is another score, with 11 levels.
  • bedrooms is typically used to categorize houses.

The package caret automatically one-hot encodes the categorical features when fitting a model. Due to their number of unique values, these features won’t be factorized at first:

  • bathrooms could also be used to categorize the houses, although it is not clear why the data shows decimal numbers.
  • zipcode is a cluster-like category of the house location.

The impact of processing them as factors will be tested with a Linear Regression.



Exploratory Data Analysis

The target of this analysis is the price of the houses. Like the real estate market, the price variable is highly right-skewed, with its maximum value being more than 10 times larger than the 3rd quartile:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000  320000  450000  539865  645500 7700000


These properties of high-value cannot be removed from the dataset, as they are not anomalies but examples of the luxury market in the region. By transforming the data with a logarithm to base 10, it is possible to easily rebalance the dataset. This transformation could be used later to try to improve our models’ performance.


The Correlation Matrix between the numerical features of the Train Set indicates some correlation between the price and:

  • sqft_living: more living area logically implies a higher price.
  • sqft_living15: the living area in 2015 is also linked to the current price. Recent renovations have modified the living area, but it is likely that the house stays in a similar price range.
  • sqft_above: by definition, the area above ground is related to the living area. It is not a surprice to see correlation with the price as well.
  • bathrooms: interestingly, the number of bathrooms seems to slightly correlate with the price.

Some other relationships can be identified:

  • the sqft_ features are linked to each others, as sqft_living = sqft_above + sqft_basement, and the lot areas are usually not changing much within a year.
  • the zipcode and the longitude are quite linked, maybe due to the way zipcode have been designed in the region.


This other Correlation Matrix provides another view of the relationship between features, as well as their own distribution.


The following charts allow to understand the structure of each factor feature:

  • A majority of houses in the region have 3 or 4 bedrooms. An overall trend shows that more bedrooms means higher prices.
  • Most of the houses in the region are in an average condition 3, with 26% of the houses being in a great condition 4, and 8% being in an exceptional condition 5.
  • The real estate market in the region seems to offer almost as many houses with two floors than with one floor.
  • The grade distribution is similar than the condition (with more levels), but the relation to price looks clearer.
  • Houses with more views seem to be in a higher price range.
  • Houses with a view the waterfront are clearly more expensive than the ones without.


The following charts describe the numercial features and provide interesting insights:

  • The number of bathrooms is difficult to interpret without knowing the meaning of the decimals. But it seems that majority of houses have values of 2.5 and 1.
  • The latitude and longitude distributions suggest a higher number of properties in the North-West of the region.
  • The floor space is also right-skewed, with a large portion of the houses having no basement.
  • Looking at the distribution of the feature yr_built, we can observe a regular increase of real estate projects since the early years of the 20th century. This trend can certainly be related to the population growth and the urban development in the region over the years.
  • With only 739 properties with renovation records, the yr_renovated feature is extremely right-skewed. If we consider only the properties with records, renovations go from 1934 till 2015, with higher numbers from the 80s, and a big wave of renovations in 2013 and 2014.
  • Houses in the dataset seem quite well distributed across the zipcode range, although it is difficult to comment this feature without a good knowlegde of the region and the structure of the zipcode system.



Data Preparation

As the Test Set doesn’t contain the feature price, it is necessary to randomly split the Train Set in two, with a 80|20 ratio:

  • Train Set A, which will be used to train our model.
  • Train Set B, which will be used to test our model and validate the performance of the model.

These datasets are centered and scaled, using the preProcess function of the package caret. These transformations should improve the performance of linear models.

Copies of these datasets are prepared with the features bathrooms and zipcode set as factors, and other copies with the price being transformed with log10. These versions of the datasets will allow to test the response of our models to different inputs.



Cross-Validation Strategy

To validate the stability of our models, we will apply a 10-fold cross-validation, repeated 3 times.



Baseline

At first, we use the linear regression algorithm to fit models on different dataset configurations:

  • The prepared datasets, without additional transformation,
  • The prepared datasets with price being transformed with log10,
  • The prepared datasets with bathrooms and zipcode as factors,
  • The prepared datasets with both transformations.

model

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Lin. Reg.

173375.7

0.7371931

115993.74

0.2340575

48

0.1

Baseline Lin. Reg. Log

167660.5

0.7542339

106380.01

0.1987122

48

0.1

Baseline Lin. Reg. All Fact

132157.4

0.8472985

85342.24

0.1702119

143

0.3

Baseline Lin. Reg. Log All Fact

126177.3

0.8608053

74064.56

0.1394403

143

0.3

The two transformations appear to improve the performance of the model, both individually and jointly. The log10 helps by centering the ‘price’ and reduce the influence of high values. The class change of bathrooms and zipcode avoids the model to consider them as ordinal features, which is particularly useful for zipcode which characterizes house clusters.


We continue by training models with a random forest using the package ranger, then with an eXtreme Gradient BOOSTing (XGBoost), both on the four possible combinations of transformations.

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Lin. Reg.

173375.7

0.7371931

115993.74

0.2340575

48

0.1

Baseline Lin. Reg. Log

167660.5

0.7542339

106380.01

0.1987122

48

0.1

Baseline Lin. Reg. All Fact

132157.4

0.8472985

85342.24

0.1702119

143

0.3

Baseline Lin. Reg. Log All Fact

126177.3

0.8608053

74064.56

0.1394403

143

0.3

Baseline Ranger

117953.9

0.8783576

68898.26

0.1357399

47

26.3

Baseline Ranger Log

122706.6

0.8683575

69140.95

0.1306257

47

26.0

Baseline Ranger All Fact

114785.7

0.8848043

67552.36

0.1330712

142

123.5

Baseline Ranger Log All Fact

118612.8

0.8769947

67751.38

0.1281304

142

122.1

Baseline XGBoost

116039.2

0.8822746

71617.33

0.1427542

47

11.8

Baseline XGBoost Log

110893.0

0.8924851

67756.69

0.1307163

47

11.1

Baseline XGBoost All Fact

110899.7

0.8924722

69604.63

0.1394589

142

33.5

Baseline XGBoost Log All Fact

109891.8

0.8944177

66483.14

0.1287528

142

29.3

These results confirm our previous conclusions:

  • the log10 transformation has a great impact on the model performance. It will be kept for next models.
  • the class change of bathrooms and zipcode has lower impact, but still improves the results.

Looking at the algorithm performances, Ranger has slighlty better results than XGBoost, but the calculation time being significantly longer for Ranger, we will continue with XGBoost only.



Feature Engineering

House Age and Renovation

A few features can be developped, focusing on the yr_built and yr_renovated of the house:

  • house_age: the difference between the yr_built and 2015 gives the age of the house.
  • house_age_since_renovation: the difference between the yr_renovated and 2015 indicates since how many years the house hasn’t been renovated.
  • renovated: a boolean flag indicating if the house has been renovated at least once.

Clusters

Another element which might influence the price of a house is the dynamism of the market in the area. Some neighborhoods are more in demand than others, either because of the price or because of the reputation. Thus, it would be interesting to add a feature that considers the density of house sales in the area.

This hexagonal binning map shows the areas with high density of house sales, which are mainly located in the North-West of the region. Some other hotspots can be found in the eastern area too.


To identify areas of high house sales density, we will use the clustering algorithm DBSCAN (Density-Based Spatial Clustering of Application with Noise) from the package dbscan. This algorithm groups points that are closely packed together, and marks as outliers the points that lie alone in low-density regions.

This algorithm requires two parameters:

  • eps - epsilon: defines the radius of neigborhood around a point x (reachability maximum distance).
  • minPts: defines the minimum number of neighbors within epsilon radius. If the point x is not surrounded by minPts neighbors, it is considered as an outlier.

After multiple trials, it appeared that the parameters providing a maximum number of relevant clusters were eps = 0.0114 and minPts = 50. This plot shows which areas have been defined as high-density clusters, using the Train Set A.


The clusters now being found based on Train Set A, we can asociate them to the points of Train Set B. The maps below allow to compare the results and confirm that the points of Train Set B are assigned to the right cluster.


With these new features, we can train a new model and compare the results with the previous ones.

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Ranger Log

122706.6

0.8683575

69140.95

0.1306257

47

26.0

Baseline Ranger Log All Fact

118612.8

0.8769947

67751.38

0.1281304

142

122.1

Baseline XGBoost Log All Fact

109891.8

0.8944177

66483.14

0.1287528

142

29.3

FE XGBoost

119055.1

0.8760756

67999.89

0.1299648

190

40.9


The new model obtains a reasonable result, although it didn’t beat the two best models.



Feature Selection with Lasso and RFE

We can compare the Top 20 feature importances for the 4 best models, and observe some similarities:

  • lat and long appear in all four Top 20, with lat having a really high importance. It might indicate that the house prices depend a lot on the location on the North-South axis.
  • sqft_living and its related features sqft_above, sqft_living15 are also at the top of the list, which is not really surprising.
  • grade values are important, especially the higher ones. It might indicate that higher grades have a significant impact on the house price.
  • yr_built usually appears at the first tier of the Top 20s.
  • As a great indicator of the house location, the zipcode seems to have some kind of importance. zipcode98004 and zipcode98112 in particular appear in the Top 20. These neighborhoods might have specific characteristics which influence the prices.
  • waterfront appears in the four Top 20, which might be a great add-value for properties in this area.
  • condition3, which indicate that the house in a fair condition, seems to impact the price as well.
  • view4 has also an influence on the price. It could indicate lower prices, due to the difficulty to find a buyer. Or in the opposite, it could indicate prices higher than the market. A better look at the models structures will give us better insights.


Lasso Regression

With 190 features, the model resulting from the Feature Engineering stage is quite complex and would benefit from cutting down its number of variables. Also, only a few parameters seem to have a real importance for the model result. Thus, we will start our trimming process with a Lasso Regression which is forcing the coefficients with minor contribution to be equal to 0. We will then get a list of features that we can ignore for our model.

## [1] "The Lasso Regression selected 188 variables, and rejected 12 variables."

Rejected Features

bedrooms.3

bathrooms.5.5

view.2

renovated.0

bathrooms.3

bathrooms.7.5

condition.4

renovated.1

bathrooms.4.5

waterfront.1

grade.8

cluster.10


We can note that the new feature renovated has been rejected all together. Large values of bathrooms also don’t seem to have any impact on the price.


Recursive Feature Elimination

The dataset being lightened from a few features, we continue the feature selection using a RFE (Recursive Feature Elimination), which will test the variable importances for models trained on subsets with predefined number of variables. This should help to reduce the number of variables with a low impact on the model performance.

For this RFE, we use Random Forest algorithms and 5-fold Cross-Validation. This chart shows the RMSE obtained for each iteration of the RFE.

Note: the RMSE hasn’t been rescaled, which explains the difference with the table of model results.


Although we can’t observe any clear ‘knee’, which would indicate that many features could be eliminated with a low impact on the model performance, we could expect to get a fair result for subsets of 40 to 60 features. It is here difficult to estimate the impact on the metrics, as the RMSE shown here is re-scalled.

If we look at the variables importance, we can see that 4 features have a great impact on the model performance, while the importance then decreases slowly to be almost insignificant after the 25th feature.


By defining an arbitrary threshold, we can filter the features with low importance. We decide that features that have an importance less than 0.1% overall should be eliminated. We end up with a list of 37 features which cumulate 96.8% of the variable importances. We can expect a low impact on the model performance, which could be recaptured with model fine tuning.

Selected Features

zipcode.98040

grade.12

grade.5

zipcode.98106

bedrooms.2

bedrooms.4

condition.2

cluster.2

yr_renovated

cluster.0

zipcode.98112

floors.2

cluster.31

condition.3

floors.1

view.4

waterfront.0

grade.11

zipcode.98004

grade.9

grade.6

grade.10

age_renovation

sqft_basement

cluster.1

house_age

yr_built

bathrooms.1

view.0

sqft_lot

sqft_lot15

grade.7

long

sqft_above

sqft_living15

sqft_living

lat


Using this list of selected features, we can re-train an XGBoost model and compare it with previous results.

model

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Ranger Log

122706.6

0.8683575

69140.95

0.1306257

47

26.0

Baseline Ranger Log All Fact

118612.8

0.8769947

67751.38

0.1281304

142

122.1

Baseline XGBoost Log All Fact

109891.8

0.8944177

66483.14

0.1287528

142

29.3

FE XGBoost

119055.1

0.8760756

67999.89

0.1299648

190

40.9

XGBoost Post RFE

118315.5

0.8776106

69053.40

0.1323090

37

9.4


The performance of the model post-RFE is not as good as the previous models, but with only 37 features, it is still a very good score! We should be able to catch-up on results with some model tuning.



Tuning

After subsetting the features selected by the Lasso and RFE steps, we proceed to model tuning on a XGBoost and a Ranger, to see which one provides the best results.

Starting with the XGBoost, we use a Tune Grid to fine best combination of parameters.


These charts don’t reveal any opportunity to easily improve the result of the parameters we already tested. Thus, we can approve the best model from this search, which used these parameters:

nrounds

max_depth

eta

gamma

colsample_bytree

min_child_weight

subsample

900

4

0.1

0

1

3

1


Trying now with a Ranger algorithm, which provided really good results before Feature Engineering.


These charts don’t reveal any opportunity to easily improve the result. We might actually have pushed a bit much on the mtry parameter. Anyway, we can approve the best model from this search, which used these parameters:

mtry

splitrule

min.node.size

10

variance

1


Let’s compare these two new models with the previous best ones.

model

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Ranger Log

122706.6

0.8683575

69140.95

0.1306257

47

26.0

Baseline Ranger Log All Fact

118612.8

0.8769947

67751.38

0.1281304

142

122.1

Baseline XGBoost Log All Fact

109891.8

0.8944177

66483.14

0.1287528

142

29.3

FE XGBoost

119055.1

0.8760756

67999.89

0.1299648

190

40.9

XGBoost Post RFE

118315.5

0.8776106

69053.40

0.1323090

37

9.4

Tuning XGBoost

112054.8

0.8902204

64687.33

0.1253911

37

73.3

Tuning Ranger

121960.2

0.8699540

69505.40

0.1316731

37

34.2


The Tuned XGBoost performs very well, outperforming the previous records! The Tuned Ranger, on the other hand, is a bit disappointing, as it is actually worse than the Baseline Ranger with Log10(price)



Final Model

The best model for this analysis is the Tuned XGBoost, which reaches the MAPE score of 0.1253911 with only 37 features.

model

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Tuning XGBoost

112054.8

0.8902204

64687.33

0.1253911

37

73.3

The model uses these parameters:

nrounds

max_depth

eta

gamma

colsample_bytree

min_child_weight

subsample

900

4

0.1

0

1

3

1


The model needs these features as input:

Features

zipcode.98040

grade.12

grade.5

zipcode.98106

bedrooms.2

bedrooms.4

condition.2

cluster.2

yr_renovated

cluster.0

zipcode.98112

floors.2

cluster.31

condition.3

floors.1

view.4

waterfront.0

grade.11

zipcode.98004

grade.9

grade.6

grade.10

age_renovation

sqft_basement

cluster.1

house_age

yr_built

bathrooms.1

view.0

sqft_lot

sqft_lot15

grade.7

long

sqft_above

sqft_living15

sqft_living

lat


Comparing the predictions of our model with the actual values of the Train Set B would provide an idea of its limitations. We can notice that the model has difficulties to predict prices between 0.75 and 1 million dollars. Further work on Feature Engineering might be able to improve the results in this area.


These charts provide different views of the model residuals, which allow to identify possible characteristics of the data which might not be well modeled. We can see that overall, the residuals look normaly distributed, but there might be some possible improvements, as some outliers can be seen on the scatterplot and on the boxplot, and the QQplot displays a ‘wavy’ line insteand of a straight one.


The model now needs to be applied on the Test Set, for which we haven’t the actual prices in this analysis. All the metrics mentioned in this report only apply to the Train Set B. Applying the model on the Test Set will allow to validate if the model is relevant with unknown data.





Annexe - All Models and Results

model

RMSE

Rsquared

MAE

MAPE

Coefficients

Train Time (min)

Baseline Lin. Reg.

173375.7

0.7371931

115993.74

0.2340575

48

0.1

Baseline Lin. Reg. Log

167660.5

0.7542339

106380.01

0.1987122

48

0.1

Baseline Lin. Reg. All Fact

132157.4

0.8472985

85342.24

0.1702119

143

0.3

Baseline Lin. Reg. Log All Fact

126177.3

0.8608053

74064.56

0.1394403

143

0.3

Baseline Ranger

117953.9

0.8783576

68898.26

0.1357399

47

26.3

Baseline Ranger Log

122706.6

0.8683575

69140.95

0.1306257

47

26.0

Baseline Ranger All Fact

114785.7

0.8848043

67552.36

0.1330712

142

123.5

Baseline Ranger Log All Fact

118612.8

0.8769947

67751.38

0.1281304

142

122.1

Baseline XGBoost

116039.2

0.8822746

71617.33

0.1427542

47

11.8

Baseline XGBoost Log

110893.0

0.8924851

67756.69

0.1307163

47

11.1

Baseline XGBoost All Fact

110899.7

0.8924722

69604.63

0.1394589

142

33.5

Baseline XGBoost Log All Fact

109891.8

0.8944177

66483.14

0.1287528

142

29.3

FE XGBoost

119055.1

0.8760756

67999.89

0.1299648

190

40.9

FE XGBoost Clusters

113018.6

0.8883238

70184.98

0.1397804

48

13.2

FE XGBoost Clusters Fact

115534.0

0.8832976

71665.63

0.1415128

92

22.3

FE XGBoost Clusters All Fact

114216.9

0.8859432

70396.77

0.1395339

187

43.9

XGBoost Post RFE

118315.5

0.8776106

69053.40

0.1323090

37

9.4

Tuning XGBoost

112054.8

0.8902204

64687.33

0.1253911

37

73.3

Tuning Ranger

121960.2

0.8699540

69505.40

0.1316731

37

34.2




Ashley O’Mahony | ashleyomahony.com | May 2019