0% found this document useful (0 votes)

17 views

House Price Regression Analysis

The document analyzes a dataset containing home sales data from Ames, Iowa to build regression models for predicting home prices. It uses linear regression to model home sale price based on living area and neighborhood. Tests show models with neighborhood as an additive and interactive term fit the data better than a base model without neighborhood. Diagnostics indicate the final interaction model meets assumptions of normality and constant variance. The model will help a real estate agency predict how sale price varies with living area across different neighborhoods.

Uploaded by

Pramendra Kumar Singh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

House Price Regression Analysis

Uploaded by

Pramendra Kumar Singh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Regression Analysis of the Ames, Iowa Dataset

Stuart Miller, Paul Adams, and Chance Robinson

Master of Science in Data Science, Southern Methodist University, USA

1 Introduction

What is the price of a home in Ames, Iowa? Our inaugural project for the Statistical Foundations for Data
Science course in the Southern Methodist University Master of Science in Data Science (MSDS) program
was to compete in an online Kaggle competition utilizing linear regression techniques we’ve learned in this
course to date. Our team elected to use R as the preferred analysis platform under the consensus that it
has more broad applicability for use in industry, including data gathering and wrangling, in addition to
advanced visualization and analytic tools. The project objective was to apply various predictive models
in order to assess the suitability of our parameter selections for determining the sales prices of homes in
Ames. The measures of accuracy were applied in terms of the Root Mean Square Error, or RMSE, as well as
other comparison models such as cross-validation, and the adjusted R-squared. Our approach outlined in
this research document is limited in that we were not permitted to use more advanced algorithms we will
be exposed to later in the MSDS program; rather, in conjunction with the aforementioned linear regression
techniques, we were directed to apply the exploratory data analysis and data cleaning methods we have
learned, which will surely be of use to us in our future personal and academic endeavors.

For more information see: https://github.com/sjmiller8182/RegressionHousingPrices

2 Ames, Iowa Data Set

The Ames, Iowa Data Set describes the sale of individual residential properities from 2006-2010 in Ames,
Iowa [1] . The data was retreved from the dataset hosting site Kaggle, where it is listed under a machine
learning competition named House Prices: Advanced Regression Techniques [2] . The data is comprised of
37 numeric features, 43 non-numeric features and an observation index split between a training set and a
testing set, which contain 1460 and 1459 observations, respectively. The response variable (SalePrice) is
only provided for the training set. The output of a model on the test set can be submitted to the Kaggle
competition for scoring the performance of the model in terms of RMSE. The first analysis models property
sale prices (SalePrice) as the response of living room area (GrLivArea) of the property and neighborhood
(Neighborhood) where it is located. In the second analysis, variable selection techniques are used to determine
which explanatory varaibles are associated with SalePrice to find a predictive model.

1
3 Analysis Question I

3.1 Question of Interest

Century 21 has commissioned an analysis of this data to determine how the sale price of property is related
to living room area of the property in the Edwards, Northwest Ames, and Brookside neighborhoods of Ames,
IA.

3.2 Modeling

Linear regression will be used to model sale price as a response of the living room area. From the initial
exploratory data analysis, it was determined that sale prices should be log-transformed to meet the model
assumptions for linearity (see section 5.1), thus improving our models fit and reducing standard error.
Additionally, two observations were removed as they appeared to be from a different population than the
other observations in the dataset (see section 5.2); therefore, analysis only considers properties with living
rooms less than 3500 sq. ft. in area.

We will use extra sums of square (ESS) tests to determine if a neighborhood should be added in the model.
We start with the logarithm of sale price as the response of living room area and build up a model with
neighborhood. The equations (1-3) below show the models considered. The Edwards neightborhood is used
for reference.

Base Model

µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) (1)

Additive Model

µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) + βˆ2 (Brookside) + βˆ3 (N orthwestAmes) (2)

Interaction Model

µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) + βˆ2 (Brookside) + βˆ3 (N orthwestAmes)+

βˆ3 (Brookside)(LivingRoomArea) + βˆ4 (N orthwestAmes)(LivingRoomArea) (3)

2
3.2.1 ESS Tests Between Models

Comparison: Base Model to Additive Model

The following ESS test provides convincing evidence that the addition of the additive neighborhood terms is
an improvement over the base model (p-value < 0.0001).

## Analysis of Variance Table

##
## Model 1: log(SalePrice) ~ (GrLivArea)
## Model 2: log(SalePrice) ~ (GrLivArea) + Neighborhood_BrkSide + Neighborhood_NAmes
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 379 16.672
## 2 377 14.824 2 1.8483 23.503 2.403e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison: Additive Model to Interaction Model

The following ESS test provides convincing evidence that the addition of the interaction neighborhood terms
is an improvement over the additive model (p-value < 0.0001).

## Analysis of Variance Table

##
## Model 1: log(SalePrice) ~ (GrLivArea) + Neighborhood_BrkSide + Neighborhood_NAmes
## Model 2: log(SalePrice) ~ (GrLivArea) + Neighborhood_BrkSide + Neighborhood_NAmes +
## (GrLivArea) * Neighborhood_BrkSide + (GrLivArea) * Neighborhood_NAmes
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 377 14.824
## 2 375 13.441 2 1.3834 19.299 1.053e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the two ESS tests, the interaction model appears to be significant; thus, we will continue with the
interaction model.

3
3.3 Model Assumptions Assessment

The following assessments for model assumptions are made based on Figure 1 and Figure 4:

• The residuals of the model appear to be approximately normally distrubited based on the QQ plot of
the residuals and histogram of the residuals, suggesting the assumption of normality is met.
• No patterns are evident in the scatter plots of residuals and studentized residuals vs predicted value,
suggesting the assumption of constant variance is met.
• While some observations appear to be influential and have high leverage, removing these observations
does not have a significant impact on the result of the model fit.
• Based on the scatter plot of the log transform of SalePrice vs GrLivArea, it appears that a linear
model is reasonable (see section 5.1).

The sampling procedure is not known. We will assume the independence assumption is met.

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

QQ Plot of Residuals Histogram of Residuals
0.8 Threshold: 0.031
Acutal Quantile

75
0.4
Count

50 189
0.0 4

25 58
261
193 168
−0.4 3
153
204 370
289
139 156
308
209 Observation
301 48
0 101 90 183
RStudent

368 normal
−2 0 2 −0.5 0.0 0.5 112 132345 123
296240
130 327 8 78 leverage
Theoretical Quantile Residuals 0 85 349 53 80
152 344
30 outlier
Residuals vs Prediction RStudent vs Prediction 249
237
0.8 84 273 159
191
175
166
259 179 outlier & leverage
4 354 185 135
96
64357
0.4 2 104
RStudent
Residual

−4
0.0 0

−0.4 −2

−4
11.0 11.5 12.0 12.5 11.0 11.5 12.0 12.5 0.000 0.025 0.050 0.075 0.100
Predicted Value Predicted Value Leverage

Figure 1: Diagnostic Plots

3.4 Comparing Competing Models

The three models were trained and validated on the training dataset using 10-fold cross validation. The table
below summerizes the performance of the models with RMSE, adjusted R2 , and PRESS. These results show
that the interaction model produces the best performance, which is consistent with the result of the ESS test.

Model RMSE CV.Press Adjused.R.Squared

Interaction Model 0.1910566 12.51675 0.5084024
Additive Model 0.1988473 13.55835 0.4750767
Base Model 0.2077993 14.80661 0.4096544

4
3.5 Parameters

The following table summerizes the parameter estimates for the interaction model.

Parameter Estimate CI.Lower CI.Upper

Intercept 11.0254845 10.8861855 11.1647836
GrLivArea 0.0005387 0.0004324 11.1647836
Neighborhood_BrkSide -0.2338906 -0.4468114 -0.0209698
Neighborhood_NAmes 0.4178562 0.2558923 0.5798200
GrLivArea:Neighborhood_BrkSide 0.0001996 0.0000336 0.0003656
GrLivArea:Neighborhood_NAmes -0.0002145 -0.0003366 -0.0000924

Where Intercept is β0 , GrLivArea is β1 , Neighborhood_BrkSide is β2 , Neighborhood_NAmes is β3 ,

GrLivArea:Neighborhood_BrkSide is β4 , and GrLivArea:Neighborhood_NAmes is β5

3.6 Model Interpretation

We estimate that for increase in 100 sq. ft., there is associated multiplicative increase in median price of

• 1.055 for the Edwards neighborhood with a 95% confidence interval of [1.044 , 1.066]
• 1.033 for the Northwest Ames neighorhood with a 95% confidence interval of [1.026 , 1.040]
• 1.077 for the Brookside neighorhood with a 95% confidence interval of [1.063 , 1.090]

Since the sampling procedure is not known and this is an observational study, the results only apply to this
data.

3.7 Conclusion

In response to the analysis commissioned by Century 21, the log transform of property sale price was modeled
as a linear response to the property living room area for residential properties in Ames, IA. It was determined
that it was necessary to include interaction terms to allow for the influence of neighborhood on sale price.
Based on the model, there is strong evidence of an associated multiplicative increase in median sale price for
an increase in living room area (p-vlue < 0.0001, overall F-test).

5
4 Analysis Question II

4.1 Question of Interest

Century 21 has commissioned a second analysis using the same dataset for the creation of a very predictive
model of SalePrice. The analysis will be expanded to include as many of the 80 total features as required
to determine the sale price of residential properties across all neighborhoods of Ames, Iowa, beyond only the
three - Edwards, Northwest Ames, and Brookside - previously commissioned for analysis.

4.2 Modeling

Through analyzing our variable selection and cross-validation processes - along with our nascant domain
knowledge of residential real estate - we ultimately arrived at a multiple linear regression model featuring 11
linear predictor variables and two interaction terms. Specifically, our variable selection process included direct
analysis of a correlation plot and a correlation matrix as well as performing forward selection, backward
elimination, and stepwise regression.

Regarding missing data, we imputed NA values for 19 variables using a combination of the data dictionary
provided by Century 21 as well as our domain knowledge. After building models with and without trans-
formations applied to variables, we noted no significaznt difference in variable selection from our selection
process so elected to use non-transformed predictor variables. We did, however, use the log-transformed
SalePrice applied in the first analysis.

Forward Selection

Forward selection is a variable selection methodology that begins with a constant mean and adds explanatory
variables one-by-one until no further additonal predictor variables significantly improve the model’s fit. This
employess the “F-to-enter” method from the extra-sum-of-squares F-statistic. This was the first method
we employed. For this process, we provided the test a starting model with no predictor variables and a
model from which terms can be selected, which included all predictor variables available. The process worked
forward with selecting one parameter. The suggested model shown in section 5.3.1.

Backward Elimination

Backward elimination is a variable selection methodology that begins with all possible predictor variables
and works backward, eliminating variables using all possible combinations until only the best for the fit are
provided. This employess the “F-to-remove” method from the extra-sum-of-squares F-statistic. For this
process, we provided the test a model with all available predictor variables from which insignificant variables
were eliminated. The suggested model shown in section 5.3.2.

Stepwise Regression

Stepwise regression is a variable selection methodology that performs one step of forward selection for each
step of backward elimination. The steps are repeated, concurrently, until no further predictor variables can
be added or removed. This is the third model approach we used. The suggested model shown in section 5.3.3.

6
Custom Variable Selection

To develop the custom model, we employed a combination of a correlation matrix for quantitative data,
analysis of the summarization of the suggested model from stepwise selection, and through direct analysis of
the pairs plots. As previously mentioned, our final model included 11 linear terms and two interaction terms.
We removed all variables suggested to be removed by the stepwise regression and backward elimination tests,
then reprocessed the updated models until forward selection, backward elimination, and stepwise regression
were in agreeance with respect to the linear terms. Once this trial-and-error process was completed, we added
interaction terms based on domain knowledge and re-applied the forward selection, backward elimination,
and stepwise regression methods until only significant terms - both linear and interactive - remained. We
then used graphical analysis to visually confirm interaction between the interactive terms remaining. The
custom model shown in section 5.3.4.

4.3 Model Assumption Assessment

The assumption assessment plots were similar for all four models. The assumption assessment plots and
discussion for the custom model are provided here with Figure 2. The assumption assessment plots for the
other three models are provided for reference in section 5.5.

Based on the diagnostic plots below, the custom model appears to reasonably meet the assumptions of linear
regression. The standardized residuals do not appear to exhibit a discernible pattern, indicating constant
variance along the regression, or homoscedasticity. While there are some outliers, this does not appear to be
an egregious violation. Based on the QQ plot, there is a small level of deviation on the ends of the distribution
of the errors, but for the most part, the errors adhere to normality. The sample size should be sufficient to
protect against this non-normality. Based on the standardized residuals vs. leverage plot, only a few values
have high leverage and are outlying. However, these violations do not appear to be egregious.

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

QQ Plot of Residuals Histogram of Residuals 969
5.0 Threshold: 0.158
0.50 600
143
681 108
Acutal Quantile

0.25 688
1180
1420 89
400 895
773 272 329
737 884 53
Count

0.00 219
278
897
802 155
458
842 1138
49
218
2.5 1210
1242
1121
14 394
559 582
1012
1325746
141 1179
292 186
489
1268941
−0.25 200 748
9376
1449 4071157247
635 1198
421
54 710 636 419583 470
473
22
62
542
763 1182 327
1307
844591
9311074
165 599
−0.50 103
294
968447 872 6
515
550
359
768 1207 1122
145
1195
1060
Observation
933
353 387
436 40
0 0.0 984
1008
1295
882 283
222
808363 953 955
RStudent

1047704
1454 662 normal
−2 0 2 −0.75 −0.50 −0.25 0.00 0.25 0.50 1067
1214198
940 314 761
238
384 509 265
Theoretical Quantile Residuals 810
1437871 1290 1384
336 leverage
1172
250 254
440
531 1209
788
715
10214480 529 outlier
Residuals vs Prediction RStudent vs Prediction 771560
319 713 589
1061
1334
727 709
1130
outlier & leverage
0.50 5.0 −2.5 58162867
873 533 432
1450 411 811
0.25 2.5 1429
RStudent
Residual

915 588
0.00 0.0
−5.0
967
−0.25 −2.5 1321
463 31
496
−0.50 −5.0 632

11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage

Figure 2: Custom Assumption Assessment Plots

7
4.4 Comparing Competing Models

While the models from forward, backward, and stepwise selection produce higher adjusted R2 values on the
training data, the yield much higher errors when applied to the Kaggle test set. These selection methods
appear to be overfitting to the training data, thus fail to generalize to the Kaggle test set. Undisputedly,
the custom model outperformed the model built strictly on the output of the forward selection, backward
elimination, and stepwise regression variable selection procedures when applied to a new dataset.

Model Kaggle.Score CV.Press Adjused.R.Squared

Custom 0.13290 17.23869 0.9303195
Forward Selection 0.13476 17.08652 0.9327050
Backward Selection 0.13475 16.69872 0.9327050
Stepwise Regression 0.13476 17.12312 0.9327050

4.5 Conclusion

In an effort to produce a highly accurate and repeatable predictive model using linear regression, all explanatory
variables were considered with three types of variable selection techniques: forward selection, backward
elimination, and stepwise regression. Additionally, a custom model was initially produced by eliminating
variables suggested by the automatic selection processes, visually exploring the data with pairwise scatter
plots, and adding interaction terms based on graphical analysis and domain knowledge. Automatic selection
was reapplied to suggest terms from the initial custom model, which was then again adjusted for final
inspection by the automatic techniques. The final models suggested strictly by the automatic techniques
produced high R2 values, but performed poorly on the Kaggle test set. This suggests the automatic techniques
alone were overfitting to the training data. The final custom model, however, produced a high R2 value and
performed remarkably well on the Kaggle test set (see section 5.4). This suggests that the custom model is
not overfitting to the training data and generalizes well to an unseen dataset. Ultimately, we determined the
best approach is a combination of automatic selection, visual and analytic inspection, and the application of
domain knowledge.

8
5 Appendix

5.1 Checking for Linearity in SalePrice vs GrLivArea

The scatter plot in Figure 3 shows relationship of SalePrice vs GrLivArea for all three neighborhoods of
interest to Century 21. Based on this plot, it does not appear that this relationship meets the assumptions of
linear regression, specifically the constant varaince assumption. The response will be transformed to attempt
to handle the changing variance.

Sale Price vs Living Room Area

6e+05

4e+05
Sale Price

2e+05

1000 2000 3000

Living Room Area (sq. ft.)

Figure 3: Scatter Plot of Sale Price vs Living Room Area

The images below show the scatter plots of log sale price vs living room area (Figure 4). In the image on the
right, the scatter plot is shown for each neighborhood. In the image on the left the observations for all three
neighborhoods are included. In all cases, a linear model appears to be reasonable to model this data.

5.2 Analysis of Influential points

The two outlying observations with living room areas greater than 4000 sq. ft. appear to be from a different
distribution than the main dataset. Since these are partial sales, it is possible that the sale prices do not
reflect market value. For this reason, we will limit the analysis to properities with less than 3500 sq. ft.
(Figure 5)

9
Regression Plots for Neighborhoods Log of Sale Price vs Living Room Area
Northwest Ames Edwards
13 13
Log of Sale Price

Log of Sale Price

13
12 12

11 11

Log of Sale Price

10 10
0 1000 2000 3000 0 1000 2000 3000 12
Living Room Area (sq. ft.) Living Room Area (sq. ft.)
Brook Side
13
Log of Sale Price

12
11

10
0 1000 2000 3000 1000 2000 3000
Living Room Area (sq. ft.) Living Room Area (sq. ft.)

Figure 4: Scatter Plots of Log of Sale Price vs Living Room Area

Log of Sale Price vs Living Room Area

Normal
Normal
Alloca
Normal
12.5
Normal
Abnorml
Normal

12.0
Log of Sale Price

Abnorml

11.5

11.0

10.5
1000 2000 3000
Living Room Area (sq. ft.)

Figure 5: Influential Points

10
5.3 Models Suggested by Automated Selection

5.3.1 Forward Selection

The model suggested by forward selection.

log(SalePrice) ~
OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 +
OverallCond + YearBuilt + TotalBsmtSF + GarageCars + MSZoning +
SaleCondition + BldgType + Functional + LotArea + KitchenQual +
BsmtExposure + CentralAir + Condition1 + ScreenPorch + BsmtFullBath +
Heating + Fireplaces + YearRemodAdd + Exterior1st + GarageQual +
WoodDeckSF + SaleType + OpenPorchSF + HeatingQC + LotConfig +
EnclosedPorch + ExterCond + PoolQC + Foundation + LandSlope +
RoofMatl + GarageArea + MasVnrType + HalfBath + PoolArea +
`3SsnPorch` + Street + KitchenAbvGr + GarageCond + FullBath +
BsmtQual + BsmtFinSF2

5.3.2 Backward Selection

The model suggested by backward selection.

log(SalePrice) ~
MSZoning + LotArea + Street + LotConfig + LandSlope +
Neighborhood + Condition1 + Condition2 + BldgType + OverallQual +
OverallCond + YearBuilt + YearRemodAdd + RoofStyle + RoofMatl +
Exterior1st + MasVnrType + ExterCond + Foundation + BsmtQual +
BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF +
Heating + HeatingQC + CentralAir + `1stFlrSF` + `2ndFlrSF` +
LowQualFinSF + BsmtFullBath + FullBath + HalfBath + KitchenAbvGr +
KitchenQual + Functional + Fireplaces + GarageCars + GarageArea +
GarageQual + GarageCond + WoodDeckSF + OpenPorchSF + EnclosedPorch +
`3SsnPorch` + ScreenPorch + PoolArea + PoolQC + SaleType +
SaleCondition

11
5.3.3 Stepwise Selection

The model suggested by stepwise selection.

5.3.4 Custom Model

The model constructed by hand.

log(SalePrice) ~
BsmtUnfSF + CentralAir + HalfBath + KitchenQual + Neighborhood +
OverallCond + OverallQual + RoofMatl + `1stFlrSF` + `2ndFlrSF` +
YearBuilt + MSZoning:Neighborhood + YearBuilt:Neighborhood

5.4 Kaggle Score

The following image shows the result on Kaggle for the custom model.

12
5.5 Assumption Assessment Plots for Automatic Selection Models

The following discussion applies to the assumption assessment for the three models produced by automatic
selection.

Generally, based on the diagonstic plots, these models appear to reasonably meet the assumptions for linear
regression. The standardized residuals do not appear to exhibit a discernible pattern, indicating constant
variance along the regression, or homoscedasticity. However, there are a small number of observations -
relative to the overall sample size - with unusually high residuals. Nonetheless, this is not enough to add
detrimental impact to the model. Based on the QQ plot, there is a small level of deviation on the ends of
the distribution of the errors, but for the most part, the errors adhere to normality. The sample size should
be sufficient to protect against this non-normality. Based on the standardized residuals vs. leverage plot,
only a few values have high leverage and are outlying. Compared to the custom model (Figure 2), these
diagnostic plots for these models show a few more influential observations with high leverage. However, these
observations cannot be excluded from the model.
Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)
QQ Plot of Residuals Histogram of Residuals 969
Threshold: 0.195
0.50 5.0
600
Acutal Quantile

0.25 688
681143 108
1180
895
400 144089
329 999
Count

0.00 773155 737

219
897
802452 49
272 53
1420
2.5 474
278
863
1210 1121
458
2181138186
582
141 636
8841179
4891012
1268
746941
515
−0.25 54 1263 407 470
200 635
376
1169 7481157
583 247
109 1074
334
1325
710 1198
419
872
1182 1151
599
−0.50 968
156
931165
145 327
1143
1307 1089
591 387
933 1122 401246
Observation
0 0.0 447 1060 6
343
809
363
1195
662
1346
761
1207
353
1454
222
283808 955
1023
RStudent

984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
Theoretical Quantile Residuals 871 254 531314238
1356
1290 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 5891130
709
713 411
−2.5 72762867
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual

915 967
0.00 0.0 −5.0
1321
−0.25 −2.5
463 496
632
−0.50 −5.0 31
−7.5
−7.5
11.0 11.5 12.0 12.5 13.0 11.0 11.5 12.0 12.5 13.0 0.00 0.25 0.50 0.75 1.00
Predicted Value Predicted Value Leverage

Figure 6: Forward Selection Assumption Assessment Plots

13
Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)
QQ Plot of Residuals Histogram of Residuals 969
Threshold: 0.195
0.50 5.0
600
Acutal Quantile

0.25 688
681143 108
1180
895
400 144089
329 999

Count
0.00 773155 737
219
897
802452 49
272 53
1420
2.5 474
278
863
1210 1121
458
2181138 186
582
141 636
8841179
4891012
1268
746
515 941
−0.25 54 1263 407 470
200 635
376
1169 1157
583 247
748
109 1074
334
1325
710 1198
419
872
1182 1151
599
−0.50 968156
931165
145 327
1143
1307 1089
591 387
933 1122 401246
Observation
0 0.0 447 1060 6
343
809
363
1195
662
1346
761
1207
353
1454
222
283808 955
1023

RStudent
984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
1356 531
Theoretical Quantile Residuals 1290
871 254 314238 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 5891130
709
713 67 411
−2.5 727628
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual

Figure 7: Backward Selection Assumption Assessment Plots

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

QQ Plot of Residuals Histogram of Residuals 969
Threshold: 0.195
0.50 5.0
600
Acutal Quantile

0.25 688
681143 108
1180
895
400 144089
329 999
Count

0.00 773155 737

219
897
802452 49
272 53
1420
2.5 474
278
863
1210 1121
458
2181138186
582
141 636
8841179
4891012
1268
746
515 941
−0.25 54 1263 407 470
200 635
376
1169 1157
583 247
748
109 1074
334
1325
710 1198
419
872
1182 1151
599
−0.50 968
156
931165
145 327
1143
1307 1089
591 387
933 1122 401246
Observation
0 0.0 447 1060 6
343
809
363
1195
662
1346
761
1207
353
1454
222
283808 955
1023
RStudent

984
676 436
206 normal
−2 0 2 −0.4 0.0 0.4 704940 953 265
509
336
1409
250
1356 531
Theoretical Quantile Residuals 1290
871 254 314238 leverage
1334
1172 1384
1209
349
309 440 198
480
319
771 13771383
1091
934 529 outlier
Residuals vs Prediction RStudent vs Prediction 1061
560
94 1130
71367 589 709 411
−2.5 727628
581 outlier & leverage
0.50 5.0 873 432
1450 811
1429 533
0.25 2.5
588
RStudent
Residual

Figure 8: Stepwise Selection Assumption Assessment Plots

14
References
[1] Cock, D. D. (2011). Ames, iowa: Alternative to the boston housing data as an end of semester regression
project. Journal of Statistics Education, 19(3).

[2] Kaggle (2016). Ames housing dataset. Data retrieved from the Kaggle website, https://www.kaggle.com/
c/house-prices-advanced-regression-techniques/data.

KIIT Deemed To Be University: A Project Report
No ratings yet
KIIT Deemed To Be University: A Project Report
33 pages
2sd3 Final
No ratings yet
2sd3 Final
2 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
B-701 Boysen Permacoat Flat Latex2
100% (1)
B-701 Boysen Permacoat Flat Latex2
7 pages
MAT 240 Project One
No ratings yet
MAT 240 Project One
9 pages
Ames Housing Price Prediction - Complete ML Project With Python
No ratings yet
Ames Housing Price Prediction - Complete ML Project With Python
14 pages
Exploring Boston Housing Data
No ratings yet
Exploring Boston Housing Data
7 pages
Case 2 Predicting Boston Housing
0% (6)
Case 2 Predicting Boston Housing
2 pages
Plastic Packaging
50% (2)
Plastic Packaging
117 pages
Theophanes Chronographia
100% (3)
Theophanes Chronographia
922 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
Applied Econometrics For Managers: Case #1: House Price Prediction
No ratings yet
Applied Econometrics For Managers: Case #1: House Price Prediction
2 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
HousePricePrediction Poster
No ratings yet
HousePricePrediction Poster
1 page
Sta302 Final Project - Poster
No ratings yet
Sta302 Final Project - Poster
1 page
Ames
No ratings yet
Ames
3 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
The Boston Housing Dataset
100% (1)
The Boston Housing Dataset
4 pages
House Ames Project
No ratings yet
House Ames Project
15 pages
LAB 1 Notes
No ratings yet
LAB 1 Notes
3 pages
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
No ratings yet
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
15 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
Project1 Report1
No ratings yet
Project1 Report1
3 pages
YuanLishun Project2019
No ratings yet
YuanLishun Project2019
18 pages
Regression Analysis
No ratings yet
Regression Analysis
52 pages
Report
No ratings yet
Report
15 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
Task 1
No ratings yet
Task 1
11 pages
Stats Assignment - Solution - Updated
No ratings yet
Stats Assignment - Solution - Updated
5 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
25 pages
House Pricing Regression
No ratings yet
House Pricing Regression
11 pages
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
No ratings yet
Regression Week 2: Multiple Linear Regression Assignment 1: If You Are Using Graphlab Create
1 page
Stats Project Module 1 2
No ratings yet
Stats Project Module 1 2
21 pages
Summer Internship Outlook
No ratings yet
Summer Internship Outlook
35 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
Part VI: Simple Regression: X X Y E
No ratings yet
Part VI: Simple Regression: X X Y E
22 pages
FinalProject STAT4444
No ratings yet
FinalProject STAT4444
11 pages
Assignment3 A20
No ratings yet
Assignment3 A20
3 pages
wang2021
No ratings yet
wang2021
5 pages
ml project clg (2)
No ratings yet
ml project clg (2)
62 pages
Report
No ratings yet
Report
40 pages
Case Project House Prices-Round1-Rev
No ratings yet
Case Project House Prices-Round1-Rev
2 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
f3683849-7ca6-4854-8f96-af11b6e837ec
No ratings yet
f3683849-7ca6-4854-8f96-af11b6e837ec
20 pages
Describe The Report
No ratings yet
Describe The Report
4 pages
AAAAAAAAAAAAAAAAAAAAAAAAA
No ratings yet
AAAAAAAAAAAAAAAAAAAAAAAAA
41 pages
House Prices: Advanced Regression Techniques: Ompetitions
No ratings yet
House Prices: Advanced Regression Techniques: Ompetitions
3 pages
Homework - Week 7: Problem 3.31
No ratings yet
Homework - Week 7: Problem 3.31
13 pages
SAS Slide SDFDSFDSFSD Dfsdsdfwyr6u
No ratings yet
SAS Slide SDFDSFDSFSD Dfsdsdfwyr6u
37 pages
Coding
No ratings yet
Coding
7 pages
HW9 Pissot Final
No ratings yet
HW9 Pissot Final
63 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
AIML
No ratings yet
AIML
5 pages
Housing Prices AI
No ratings yet
Housing Prices AI
10 pages
Real Estate Model
No ratings yet
Real Estate Model
5 pages
Making predictions
No ratings yet
Making predictions
13 pages
SPSS Lab 3 - Velu Pandian Ravichandran
No ratings yet
SPSS Lab 3 - Velu Pandian Ravichandran
4 pages
Regression Analysis
No ratings yet
Regression Analysis
37 pages
Faisal Nadeem (SAP# 30601)
No ratings yet
Faisal Nadeem (SAP# 30601)
7 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
Introduction to Area-Based Anti-Aliasing for CGI
From Everand
Introduction to Area-Based Anti-Aliasing for CGI
Michel A Rohner
No ratings yet
Writeup On Bank Customer Churn Prediction
No ratings yet
Writeup On Bank Customer Churn Prediction
14 pages
Hushh For BITS
No ratings yet
Hushh For BITS
4 pages
Finance Notes
No ratings yet
Finance Notes
31 pages
Combined
No ratings yet
Combined
9 pages
Predictive Notes
No ratings yet
Predictive Notes
17 pages
Personal KYC Form
No ratings yet
Personal KYC Form
4 pages
FIN195 Document 2
No ratings yet
FIN195 Document 2
5 pages
STIEBEL ELTRON Mini Instant Water Heaters SELECT
No ratings yet
STIEBEL ELTRON Mini Instant Water Heaters SELECT
12 pages
WRF Tutorial
No ratings yet
WRF Tutorial
84 pages
10 Nation Petroleum Gas, Inc. v. Rizal Commercial
No ratings yet
10 Nation Petroleum Gas, Inc. v. Rizal Commercial
19 pages
Assignment (Organisational Behaviour)
94% (18)
Assignment (Organisational Behaviour)
24 pages
Introduction To Marketing: Dr. Rishav Raj Gupta
No ratings yet
Introduction To Marketing: Dr. Rishav Raj Gupta
32 pages
Treasure Hunting Codes
100% (1)
Treasure Hunting Codes
27 pages
2 Comet Cloud
No ratings yet
2 Comet Cloud
24 pages
Math Pyq
No ratings yet
Math Pyq
4 pages
Survey On Fake Image Detection Using Image Processing
No ratings yet
Survey On Fake Image Detection Using Image Processing
5 pages
Geospatial Engineering: June 2014
No ratings yet
Geospatial Engineering: June 2014
120 pages
Mandrake Linux Itm
No ratings yet
Mandrake Linux Itm
13 pages
TRANSDUCERS
No ratings yet
TRANSDUCERS
79 pages
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
No ratings yet
IOSA Checklist: ISM Edition 9 - Effective September 1, 2015
75 pages
Heat Transfer Note
No ratings yet
Heat Transfer Note
27 pages
Government of Kerala: (For Works Costing Below Rs. 5 Crores)
No ratings yet
Government of Kerala: (For Works Costing Below Rs. 5 Crores)
111 pages
TN MedSup Com v1
No ratings yet
TN MedSup Com v1
1 page
Engine Mount Important 2341392
No ratings yet
Engine Mount Important 2341392
6 pages
NIC Components NLQ Series
No ratings yet
NIC Components NLQ Series
4 pages
Additional Mathematics Revision Paper 2 2024 - 240501 - 174942
No ratings yet
Additional Mathematics Revision Paper 2 2024 - 240501 - 174942
4 pages
Corporate Restructuring
No ratings yet
Corporate Restructuring
3 pages
CCHSStrategicPlan 2016-19
No ratings yet
CCHSStrategicPlan 2016-19
12 pages
Schneider Electric_ComPacT-NSX-new-generation_C25F3 (1)
No ratings yet
Schneider Electric_ComPacT-NSX-new-generation_C25F3 (1)
7 pages
Image Denoising Using Discrete Wavelet Transform
No ratings yet
Image Denoising Using Discrete Wavelet Transform
4 pages
(Ebooks PDF) Download Handbook of Mouse Auditory Research From Behavior To Molecular Biology 1st Edition James F. Willott Full Chapters
100% (25)
(Ebooks PDF) Download Handbook of Mouse Auditory Research From Behavior To Molecular Biology 1st Edition James F. Willott Full Chapters
84 pages

House Price Regression Analysis

Uploaded by

House Price Regression Analysis

Uploaded by

Regression Analysis of the Ames, Iowa Dataset

Stuart Miller, Paul Adams, and Chance Robinson

For more information see: https://github.com/sjmiller8182/RegressionHousingPrices

2 Ames, Iowa Data Set

3.1 Question of Interest

µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) (1)

µ{log(SaleP rice)} = βˆ0 + βˆ1 (LivingRoomArea) + βˆ2 (Brookside) + βˆ3 (N orthwestAmes)+

Comparison: Base Model to Additive Model

## Analysis of Variance Table

Comparison: Additive Model to Interaction Model

## Analysis of Variance Table

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

Figure 1: Diagnostic Plots

3.4 Comparing Competing Models

Model RMSE CV.Press Adjused.R.Squared

Parameter Estimate CI.Lower CI.Upper

Where Intercept is β0 , GrLivArea is β1 , Neighborhood_BrkSide is β2 , Neighborhood_NAmes is β3 ,

3.6 Model Interpretation

4.1 Question of Interest

4.3 Model Assumption Assessment

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

Figure 2: Custom Assumption Assessment Plots

Model Kaggle.Score CV.Press Adjused.R.Squared

5.1 Checking for Linearity in SalePrice vs GrLivArea

Sale Price vs Living Room Area

1000 2000 3000

Figure 3: Scatter Plot of Sale Price vs Living Room Area

5.2 Analysis of Influential points

Log of Sale Price

Log of Sale Price

Figure 4: Scatter Plots of Log of Sale Price vs Living Room Area

Log of Sale Price vs Living Room Area

Figure 5: Influential Points

5.3.1 Forward Selection

The model suggested by forward selection.

5.3.2 Backward Selection

The model suggested by backward selection.

The model suggested by stepwise selection.

5.3.4 Custom Model

The model constructed by hand.

5.4 Kaggle Score

0.00 773155 737

Figure 6: Forward Selection Assumption Assessment Plots

Figure 7: Backward Selection Assumption Assessment Plots

Fit Assessment Plots Outlier and Leverage Diagnostics for log(SalePrice)

0.00 773155 737

Figure 8: Stepwise Selection Assumption Assessment Plots

You might also like