House Price Regression Analysis
House Price Regression Analysis
1 Introduction
What is the price of a home in Ames, Iowa? Our inaugural project for the Statistical Foundations for Data
Science course in the Southern Methodist University Master of Science in Data Science (MSDS) program
was to compete in an online Kaggle competition utilizing linear regression techniques we’ve learned in this
course to date. Our team elected to use R as the preferred analysis platform under the consensus that it
has more broad applicability for use in industry, including data gathering and wrangling, in addition to
advanced visualization and analytic tools. The project objective was to apply various predictive models
in order to assess the suitability of our parameter selections for determining the sales prices of homes in
Ames. The measures of accuracy were applied in terms of the Root Mean Square Error, or RMSE, as well as
other comparison models such as cross-validation, and the adjusted R-squared. Our approach outlined in
this research document is limited in that we were not permitted to use more advanced algorithms we will
be exposed to later in the MSDS program; rather, in conjunction with the aforementioned linear regression
techniques, we were directed to apply the exploratory data analysis and data cleaning methods we have
learned, which will surely be of use to us in our future personal and academic endeavors.
The Ames, Iowa Data Set describes the sale of individual residential properities from 2006-2010 in Ames,
Iowa [1] . The data was retreved from the dataset hosting site Kaggle, where it is listed under a machine
learning competition named House Prices: Advanced Regression Techniques [2] . The data is comprised of
37 numeric features, 43 non-numeric features and an observation index split between a training set and a
testing set, which contain 1460 and 1459 observations, respectively. The response variable (SalePrice) is
only provided for the training set. The output of a model on the test set can be submitted to the Kaggle
competition for scoring the performance of the model in terms of RMSE. The first analysis models property
sale prices (SalePrice) as the response of living room area (GrLivArea) of the property and neighborhood
(Neighborhood) where it is located. In the second analysis, variable selection techniques are used to determine
which explanatory varaibles are associated with SalePrice to find a predictive model.
1
3 Analysis Question I
Century 21 has commissioned an analysis of this data to determine how the sale price of property is related
to living room area of the property in the Edwards, Northwest Ames, and Brookside neighborhoods of Ames,
IA.
3.2 Modeling
Linear regression will be used to model sale price as a response of the living room area. From the initial
exploratory data analysis, it was determined that sale prices should be log-transformed to meet the model
assumptions for linearity (see section 5.1), thus improving our models fit and reducing standard error.
Additionally, two observations were removed as they appeared to be from a different population than the
other observations in the dataset (see section 5.2); therefore, analysis only considers properties with living
rooms less than 3500 sq. ft. in area.
We will use extra sums of square (ESS) tests to determine if a neighborhood should be added in the model.
We start with the logarithm of sale price as the response of living room area and build up a model with
neighborhood. The equations (1-3) below show the models considered. The Edwards neightborhood is used
for reference.
Base Model
Additive Model
µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) + βˆ2 (Brookside) + βˆ3 (N orthwestAmes) (2)
Interaction Model
2
3.2.1 ESS Tests Between Models
The following ESS test provides convincing evidence that the addition of the additive neighborhood terms is
an improvement over the base model (p-value < 0.0001).
The following ESS test provides convincing evidence that the addition of the interaction neighborhood terms
is an improvement over the additive model (p-value < 0.0001).
Based on the two ESS tests, the interaction model appears to be significant; thus, we will continue with the
interaction model.
3
3.3 Model Assumptions Assessment
The following assessments for model assumptions are made based on Figure 1 and Figure 4:
• The residuals of the model appear to be approximately normally distrubited based on the QQ plot of
the residuals and histogram of the residuals, suggesting the assumption of normality is met.
• No patterns are evident in the scatter plots of residuals and studentized residuals vs predicted value,
suggesting the assumption of constant variance is met.
• While some observations appear to be influential and have high leverage, removing these observations
does not have a significant impact on the result of the model fit.
• Based on the scatter plot of the log transform of SalePrice vs GrLivArea, it appears that a linear
model is reasonable (see section 5.1).
The sampling procedure is not known. We will assume the independence assumption is met.
75
0.4
Count
50 189
0.0 4
25 58
261
193 168
−0.4 3
153
204 370
289
139 156
308
209 Observation
301 48
0 101 90 183
RStudent
368 normal
−2 0 2 −0.5 0.0 0.5 112 132345 123
296240
130 327 8 78 leverage
Theoretical Quantile Residuals 0 85 349 53 80
152 344
30 outlier
Residuals vs Prediction RStudent vs Prediction 249
237
0.8 84 273 159
191
175
166
259 179 outlier & leverage
4 354 185 135
96
64357
0.4 2 104
RStudent
Residual
−4
0.0 0
−0.4 −2
−4
11.0 11.5 12.0 12.5 11.0 11.5 12.0 12.5 0.000 0.025 0.050 0.075 0.100
Predicted Value Predicted Value Leverage
The three models were trained and validated on the training dataset using 10-fold cross validation. The table
below summerizes the performance of the models with RMSE, adjusted R2 , and PRESS. These results show
that the interaction model produces the best performance, which is consistent with the result of the ESS test.
4
3.5 Parameters
The following table summerizes the parameter estimates for the interaction model.
We estimate that for increase in 100 sq. ft., there is associated multiplicative increase in median price of
• 1.055 for the Edwards neighborhood with a 95% confidence interval of [1.044 , 1.066]
• 1.033 for the Northwest Ames neighorhood with a 95% confidence interval of [1.026 , 1.040]
• 1.077 for the Brookside neighorhood with a 95% confidence interval of [1.063 , 1.090]
Since the sampling procedure is not known and this is an observational study, the results only apply to this
data.
3.7 Conclusion
In response to the analysis commissioned by Century 21, the log transform of property sale price was modeled
as a linear response to the property living room area for residential properties in Ames, IA. It was determined
that it was necessary to include interaction terms to allow for the influence of neighborhood on sale price.
Based on the model, there is strong evidence of an associated multiplicative increase in median sale price for
an increase in living room area (p-vlue < 0.0001, overall F-test).
5
4 Analysis Question II
Century 21 has commissioned a second analysis using the same dataset for the creation of a very predictive
model of SalePrice. The analysis will be expanded to include as many of the 80 total features as required
to determine the sale price of residential properties across all neighborhoods of Ames, Iowa, beyond only the
three - Edwards, Northwest Ames, and Brookside - previously commissioned for analysis.
4.2 Modeling
Through analyzing our variable selection and cross-validation processes - along with our nascant domain
knowledge of residential real estate - we ultimately arrived at a multiple linear regression model featuring 11
linear predictor variables and two interaction terms. Specifically, our variable selection process included direct
analysis of a correlation plot and a correlation matrix as well as performing forward selection, backward
elimination, and stepwise regression.
Regarding missing data, we imputed NA values for 19 variables using a combination of the data dictionary
provided by Century 21 as well as our domain knowledge. After building models with and without trans-
formations applied to variables, we noted no significaznt difference in variable selection from our selection
process so elected to use non-transformed predictor variables. We did, however, use the log-transformed
SalePrice applied in the first analysis.
Forward Selection
Forward selection is a variable selection methodology that begins with a constant mean and adds explanatory
variables one-by-one until no further additonal predictor variables significantly improve the model’s fit. This
employess the “F-to-enter” method from the extra-sum-of-squares F-statistic. This was the first method
we employed. For this process, we provided the test a starting model with no predictor variables and a
model from which terms can be selected, which included all predictor variables available. The process worked
forward with selecting one parameter. The suggested model shown in section 5.3.1.
Backward Elimination
Backward elimination is a variable selection methodology that begins with all possible predictor variables
and works backward, eliminating variables using all possible combinations until only the best for the fit are
provided. This employess the “F-to-remove” method from the extra-sum-of-squares F-statistic. For this
process, we provided the test a model with all available predictor variables from which insignificant variables
were eliminated. The suggested model shown in section 5.3.2.
Stepwise Regression
Stepwise regression is a variable selection methodology that performs one step of forward selection for each
step of backward elimination. The steps are repeated, concurrently, until no further predictor variables can
be added or removed. This is the third model approach we used. The suggested model shown in section 5.3.3.
6
Custom Variable Selection
To develop the custom model, we employed a combination of a correlation matrix for quantitative data,
analysis of the summarization of the suggested model from stepwise selection, and through direct analysis of
the pairs plots. As previously mentioned, our final model included 11 linear terms and two interaction terms.
We removed all variables suggested to be removed by the stepwise regression and backward elimination tests,
then reprocessed the updated models until forward selection, backward elimination, and stepwise regression
were in agreeance with respect to the linear terms. Once this trial-and-error process was completed, we added
interaction terms based on domain knowledge and re-applied the forward selection, backward elimination,
and stepwise regression methods until only significant terms - both linear and interactive - remained. We
then used graphical analysis to visually confirm interaction between the interactive terms remaining. The
custom model shown in section 5.3.4.
The assumption assessment plots were similar for all four models. The assumption assessment plots and
discussion for the custom model are provided here with Figure 2. The assumption assessment plots for the
other three models are provided for reference in section 5.5.
Based on the diagnostic plots below, the custom model appears to reasonably meet the assumptions of linear
regression. The standardized residuals do not appear to exhibit a discernible pattern, indicating constant
variance along the regression, or homoscedasticity. While there are some outliers, this does not appear to be
an egregious violation. Based on the QQ plot, there is a small level of deviation on the ends of the distribution
of the errors, but for the most part, the errors adhere to normality. The sample size should be sufficient to
protect against this non-normality. Based on the standardized residuals vs. leverage plot, only a few values
have high leverage and are outlying. However, these violations do not appear to be egregious.
0.25 688
1180
1420 89
400 895
773 272 329
737 884 53
Count
0.00 219
278
897
802 155
458
842 1138
49
218
2.5 1210
1242
1121
14 394
559 582
1012
1325746
141 1179
292 186
489
1268941
−0.25 200 748
9376
1449 4071157247
635 1198
421
54 710 636 419583 470
473
22
62
542
763 1182 327
1307
844591
9311074
165 599
−0.50 103
294
968447 872 6
515
550
359
768 1207 1122
145
1195
1060
Observation
933
353 387
436 40
0 0.0 984
1008
1295
882 283
222
808363 953 955
RStudent
1047704
1454 662 normal
−2 0 2 −0.75 −0.50 −0.25 0.00 0.25 0.50 1067
1214198
940 314 761
238
384 509 265
Theoretical Quantile Residuals 810
1437871 1290 1384
336 leverage
1172
250 254
440
531 1209
788
715
10214480 529 outlier
Residuals vs Prediction RStudent vs Prediction 771560
319 713 589
1061
1334
727 709
1130
outlier & leverage
0.50 5.0 −2.5 58162867
873 533 432
1450 411 811
0.25 2.5 1429
RStudent
Residual
915 588
0.00 0.0
−5.0
967
−0.25 −2.5 1321
463 31
496
−0.50 −5.0 632
11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage
7
4.4 Comparing Competing Models
While the models from forward, backward, and stepwise selection produce higher adjusted R2 values on the
training data, the yield much higher errors when applied to the Kaggle test set. These selection methods
appear to be overfitting to the training data, thus fail to generalize to the Kaggle test set. Undisputedly,
the custom model outperformed the model built strictly on the output of the forward selection, backward
elimination, and stepwise regression variable selection procedures when applied to a new dataset.
4.5 Conclusion
In an effort to produce a highly accurate and repeatable predictive model using linear regression, all explanatory
variables were considered with three types of variable selection techniques: forward selection, backward
elimination, and stepwise regression. Additionally, a custom model was initially produced by eliminating
variables suggested by the automatic selection processes, visually exploring the data with pairwise scatter
plots, and adding interaction terms based on graphical analysis and domain knowledge. Automatic selection
was reapplied to suggest terms from the initial custom model, which was then again adjusted for final
inspection by the automatic techniques. The final models suggested strictly by the automatic techniques
produced high R2 values, but performed poorly on the Kaggle test set. This suggests the automatic techniques
alone were overfitting to the training data. The final custom model, however, produced a high R2 value and
performed remarkably well on the Kaggle test set (see section 5.4). This suggests that the custom model is
not overfitting to the training data and generalizes well to an unseen dataset. Ultimately, we determined the
best approach is a combination of automatic selection, visual and analytic inspection, and the application of
domain knowledge.
8
5 Appendix
The scatter plot in Figure 3 shows relationship of SalePrice vs GrLivArea for all three neighborhoods of
interest to Century 21. Based on this plot, it does not appear that this relationship meets the assumptions of
linear regression, specifically the constant varaince assumption. The response will be transformed to attempt
to handle the changing variance.
6e+05
4e+05
Sale Price
2e+05
The images below show the scatter plots of log sale price vs living room area (Figure 4). In the image on the
right, the scatter plot is shown for each neighborhood. In the image on the left the observations for all three
neighborhoods are included. In all cases, a linear model appears to be reasonable to model this data.
The two outlying observations with living room areas greater than 4000 sq. ft. appear to be from a different
distribution than the main dataset. Since these are partial sales, it is possible that the sale prices do not
reflect market value. For this reason, we will limit the analysis to properities with less than 3500 sq. ft.
(Figure 5)
9
Regression Plots for Neighborhoods Log of Sale Price vs Living Room Area
Northwest Ames Edwards
13 13
Log of Sale Price
11 11
12
11
11
10
0 1000 2000 3000 1000 2000 3000
Living Room Area (sq. ft.) Living Room Area (sq. ft.)
12.0
Log of Sale Price
Abnorml
11.5
11.0
10.5
1000 2000 3000
Living Room Area (sq. ft.)
10
5.3 Models Suggested by Automated Selection
11
5.3.3 Stepwise Selection
The following image shows the result on Kaggle for the custom model.
12
5.5 Assumption Assessment Plots for Automatic Selection Models
The following discussion applies to the assumption assessment for the three models produced by automatic
selection.
Generally, based on the diagonstic plots, these models appear to reasonably meet the assumptions for linear
regression. The standardized residuals do not appear to exhibit a discernible pattern, indicating constant
variance along the regression, or homoscedasticity. However, there are a small number of observations -
relative to the overall sample size - with unusually high residuals. Nonetheless, this is not enough to add
detrimental impact to the model. Based on the QQ plot, there is a small level of deviation on the ends of
the distribution of the errors, but for the most part, the errors adhere to normality. The sample size should
be sufficient to protect against this non-normality. Based on the standardized residuals vs. leverage plot,
only a few values have high leverage and are outlying. Compared to the custom model (Figure 2), these
diagnostic plots for these models show a few more influential observations with high leverage. However, these
observations cannot be excluded from the model.
Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)
QQ Plot of Residuals Histogram of Residuals 969
Threshold: 0.195
0.50 5.0
600
Acutal Quantile
0.25 688
681143 108
1180
895
400 144089
329 999
Count
984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
Theoretical Quantile Residuals 871 254 531314238
1356
1290 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 5891130
709
713 411
−2.5 72762867
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual
915 967
0.00 0.0 −5.0
1321
−0.25 −2.5
463 496
632
−0.50 −5.0 31
−7.5
−7.5
11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage
13
Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)
QQ Plot of Residuals Histogram of Residuals 969
Threshold: 0.195
0.50 5.0
600
Acutal Quantile
0.25 688
681143 108
1180
895
400 144089
329 999
Count
0.00 773155 737
219
897
802452 49
272 53
1420
2.5 474
278
863
1210 1121
458
2181138 186
582
141 636
8841179
4891012
1268
746
515 941
−0.25 54 1263 407 470
200 635
376
1169 1157
583 247
748
109 1074
334
1325
710 1198
419
872
1182 1151
599
−0.50 968156
931165
145 327
1143
1307 1089
591 387
933 1122 401246
Observation
0 0.0 447 1060 6
343
809
363
1195
662
1346
761
1207
353
1454
222
283808 955
1023
RStudent
984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
1356 531
Theoretical Quantile Residuals 1290
871 254 314238 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 5891130
709
713 67 411
−2.5 727628
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual
915 967
0.00 0.0 −5.0
1321
−0.25 −2.5
463 496
632
−0.50 −5.0 31
−7.5
−7.5
11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage
0.25 688
681143 108
1180
895
400 144089
329 999
Count
984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
1356 531
Theoretical Quantile Residuals 1290
871 254 314238 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 1130
71367 589 709 411
−2.5 727628
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual
915 967
0.00 0.0 −5.0
1321
−0.25 −2.5
463 496
632
−0.50 −5.0 31
−7.5
−7.5
11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage
14
References
[1] Cock, D. D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression
project. Journal of Statistics Education, 19(3).
[2] Kaggle (2016). Ames housing dataset. Data retrieved from the Kaggle website, https://www.kaggle.com/
c/house-prices-advanced-regression-techniques/data.
15