100% found this document useful (1 vote)
245 views

Final Project - Regression Models

Final submission for the statistical analysis with R from washington university and coursera on regression models. Built a multiple linear regression model over the ames dataset to predict the expected price.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
245 views

Final Project - Regression Models

Final submission for the statistical analysis with R from washington university and coursera on regression models. Built a multiple linear regression model over the ames dataset to predict the expected price.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

05/04/2017 FinalProjectRegressionModels

FinalProjectRegressionModels
1Background
Asastatisticalconsultantworkingforarealestateinvestmentfirm,yourtaskistodevelopamodeltopredict
thesellingpriceofagivenhomeinAmes,Iowa.Youremployerhopestousethisinformationtohelpassess
whethertheaskingpriceofahouseishigherorlowerthanthetruevalueofthehouse.Ifthehomeis
undervalued,itmaybeagoodinvestmentforthefirm.

2TrainingDataandrelevantpackages
Inordertobetterassessthequalityofthemodelyouwillproduce,thedatahavebeenrandomlydividedinto
threeseparatepieces:atrainingdataset,atestingdataset,andavalidationdataset.Fornowwewillload
thetrainingdataset,theotherswillbeloadedandusedlater.

load("ames_train.Rdata")

Usethecodeblockbelowtoloadanynecessarypackages

library(statsr)
library(dplyr)
library(BAS)
library(gridExtra)#packagetoplot2ggplotatonce
library(ggplot2)
library(corrplot)
library(caret)

2.1Part1ExploratoryDataAnalysis(EDA)
Whenyoufirstgetyourdata,itsverytemptingtoimmediatelybeginfittingmodelsandassessinghowthey
perform.However,beforeyoubeginmodeling,itsabsolutelyessentialtoexplorethestructureofthedata
andtherelationshipsbetweenthevariablesinthedataset.

DoadetailedEDAoftheames_traindataset,tolearnaboutthestructureofthedataandtherelationships
betweenthevariablesinthedataset(refertoIntroductiontoProbabilityandData,Week2,forareminder
aboutEDAifneeded).YourEDAshouldinvolvecreatingandreviewingmanyplots/graphsandconsidering
thepatternsandrelationshipsyousee.

Afteryouhaveexploredcompletely,submitthethreegraphs/plotsthatyoufoundmostinformativeduring
yourEDAprocess,andbrieflyexplainwhatyoulearnedfromeach(whyyoufoundeachinformative).

2.1.1a.Priceorlog(price)?
First,Ievaluatedthedistributionofpriceinordertodecideifitneededsomekindoftransformation.Bythe
twographsbelow,weseethatthedistributionoflog(price)ismuchmorenormallydistributedthanitsnormal
distribution.Afteranalyzingthisplot,Idecidedtomodellog(price).

summary(ames_train$price)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 1/35
05/04/2017 FinalProjectRegressionModels

##Min.1stQu.MedianMean3rdQu.Max.
##12790129800159500181200213000615000

summary(log(ames_train$price)+1)

##Min.1stQu.MedianMean3rdQu.Max.
##10.4612.7712.9813.0213.2714.33

plot1<ggplot(ames_train,aes(price))+geom_histogram()+ggtitle('Distributionofprice')
plot2<ggplot(ames_train,aes(log(price)+1))+geom_histogram()+ggtitle('Distributionofl
og(price)')
grid.arrange(plot1,plot2)

##`stat_bin()`using`bins=30`.Pickbettervaluewith`binwidth`.
##`stat_bin()`using`bins=30`.Pickbettervaluewith`binwidth`.

2.1.2b.NumericalCorrelations
Next,wewillchechthenumericalcorrelationbetweenthenumericvariablesandtheircorrelationwiththe
dependentvariablelog(price).First,wecheckwhichcolumnhasNAandthepercentageofNAinsideit.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 2/35
05/04/2017 FinalProjectRegressionModels

integer.cols.index<which(sapply(ames_train,class)=='integer')
ames_train.numeric<ames_train[,integer.cols.index]

NApercentage<sapply(ames_train.numeric,function(col){
NAcount<sum(is.na(col))
if(NAcount==0){result<0}
else{
result<NAcount/length(col);
}
invisible(result)
});
perc.missing<NApercentage[NApercentage!=0]
perc.missing

##Lot.FrontageMas.Vnr.AreaBsmtFin.SF.1BsmtFin.SF.2Bsmt.Unf.SF
##0.1670.0070.0010.0010.001
##Total.Bsmt.SFBsmt.Full.BathBsmt.Half.BathGarage.Yr.BltGarage.Cars
##0.0010.0010.0010.0480.001
##Garage.Area
##0.001

Iseethereisnoneedtoremoveapredictorvarible,asthecolumnwithmostNAsonlyhas16%ofNAs.So,
inordertocreateanumericalcorrelationmatrix,IdecidedtoremovetherowsthathadanyNAsinit.The
procedureremoved219rows.Therearemultipleapproachesintheliteratureshowinghowtodealwith
missingvalues,andremovingthevariablesisoneofmostbasicandnotsogoodtechniques,butasafirst
basicanalysisonthedataset,wewilldoitanyway.

na.rows<which(apply(ames_train.numeric,1,function(row){
any(is.na(row))
}))
ames_train.numeric<ames_train.numeric[na.rows,]
ames_train.numeric$log.price<log(ames_train.numeric$price)

corplot<corrplot(cor(ames_train.numeric),title='Correlationbetweennumerical
variables')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 3/35
05/04/2017 FinalProjectRegressionModels

Finally,Iselectedonlythevariablescorrelationswithlog.priceandfilteredonlyvariablesthathadcorrelation
>0.4.

corplot<corplot[which(rownames(corplot)=='price'),which(colnames(corplot)=='price')]
price.corr<corplot['log.price',]
corplot.df<data.frame(price.corr=price.corr,var=names(price.corr))

reorderedLevels<corplot.df$var[order(corplot.df$price.corr,decreasing=TRUE)]
corplot.df$var<factor(corplot.df$var,levels=reorderedLevels)
corplot.df.final<corplot.df%>%filter(abs(price.corr)>0.4)

ggplot(corplot.df.final,aes(var,price.corr))+
geom_bar(stat='identity')+
ggtitle('Numericvariablecorrelationwithlog(price)')+
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 4/35
05/04/2017 FinalProjectRegressionModels

2.1.3c.CategoricalCorrelations
First,Iselectedonlythecategoricalvariablesfromtheoriginaldataset,thenIremovedthecolumnsthathad
morethan80%ofitsvaluesmissing.Beforecalculatingthevariablescapabilitytodescribelog(price),I
checkediftherewereanynearzerovariancevariables,i.e.,variablesthat,comparingwithitstotalpossible
values,hadallofitsactualvaluesconcentratedalongonecategory.Ifyouchecktheutilitiesvariables,you
willseethatallofitsvaluesareconcentradedinonecategoryAllPub.Thenextstepwastotrytopredict
log(price)usingeachcategoricalpredictorinordertoobservewhichvariablecouldbetterexplainthe
log(price)variation.Thefitoftheregressionwasmeasuredbytheadjusted\(R^2\)andthevariableswith
thehighestfitwereplotted.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 5/35
05/04/2017 FinalProjectRegressionModels

#Getcharactervariables
categ.cols.index<which(sapply(ames_train,class)=='factor')
ames_train.categ<ames_train[,categ.cols.index]

#removecolumnsthathavemorethan0.8%ofNA's
perc80.na.cols<which(apply(ames_train.categ,2,function(row){
sum(is.na(row))/nrow(ames_train.categ)>0.8
}))
ames_train.categ<ames_train.categ[,perc80.na.cols]

#e.g.utilitieswaszerovariance
nzvVars<nzv(ames_train.categ)
ames_train.categ<ames_train.categ[,nzvVars]#removenzvcolumns

#associationwiththepricevariable
rScores<sapply(ames_train.categ,
function(col){
df<data.frame(log.price=log(ames_train$price),col=col)
model<lm(log.price~col,df)
summary(model)$adj.r.squared
})

catcorr.df<data.frame(price.corr=rScores,var=names(rScores))
reorderedLevels<catcorr.df$var[order(catcorr.df$price.corr,decreasing=TRUE)]
catcorr.df$var<factor(catcorr.df$var,levels=reorderedLevels)
catcorr.df<catcorr.df%>%filter(abs(price.corr)>0.4)

ggplot(catcorr.df,aes(var,price.corr))+
geom_bar(stat='identity')+
ggtitle('R^2categoricalvariablewhenexplainingwithprice')+
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 6/35
05/04/2017 FinalProjectRegressionModels

**

2.2Part2Developmentandassessmentofan
initialmodel,followingasemiguidedprocessof
analysis
2.2.1Section2.1AnInitialModel
Inbuildingamodel,itisoftenusefultostartbycreatingasimple,intuitiveinitialmodelbasedontheresults
oftheexploratorydataanalysis.(Note:Thegoalatthisstageisnottoidentifythebestpossiblemodelbut
rathertochooseareasonableandunderstandablestartingpoint.Lateryouwillexpandandrevisethis
modeltocreateyourfinalmodel.

BasedonyourEDA,selectatmost10predictorvariablesfromames_trainandcreatealinearmodelfor
price (oratransformedversionofprice)usingthosevariables.ProvidetheRcodeandthesummary
outputtableforyourmodel,abriefjustificationforthevariablesyouhavechosen,andabriefdiscussionof
themodelresultsincontext(focusedonthevariablesthatappeartobeimportantpredictorsandhowthey
relatetosalesprice).

BecauseIthinkusingnumericalvariablesareeasierthancategoricalones,Idecidedtofittheinitialmodel
usingthe10bestnumericalvariablescorrelatedwithlog(price).Tohaveabetterinterepretationofthe
coefficients,includingtheinterceptterm,Ifirstcentralizedthepredictorvariablessowecaninterpretthe
betasasvariationsfromthemeanpredictedvaluelog(price).

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 7/35
05/04/2017 FinalProjectRegressionModels

#workwithcenteredvariables,exceptforprice
ames_train.numeric.centered<as.data.frame(scale(ames_train.numeric,center=TRUE,scale=
FALSE))
ames_train.numeric.centered$price<ames_train.numeric$price

modelq2Log.centered<lm(log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual,
data=ames_train.numeric.centered)
summary(modelq2Log.centered)

##
##Call:
##lm(formula=log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
##Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+Total.Bsmt.SF+
##area+Overall.Qual,data=ames_train.numeric.centered)
##
##Residuals:
##Min1QMedian3QMax
##1.806320.079300.010320.091980.46932
##
##Coefficients:
##EstimateStd.ErrortvaluePr(>|t|)
##(Intercept)1.204e+016.217e031935.833<2e16***
##Mas.Vnr.Area8.457e073.944e050.0210.98290
##Garage.Yr.Blt1.722e045.444e040.3160.75181
##Year.Remod.Add2.495e034.244e045.8796.16e09***
##Year.Built1.861e034.588e044.0565.50e05***
##Garage.Area8.081e056.747e051.1980.23140
##Garage.Cars1.527e022.018e020.7570.44944
##X1st.Flr.SF8.500e053.233e052.6290.00873**
##Total.Bsmt.SF1.368e042.862e054.7802.10e06***
##area2.260e041.859e0512.161<2e16***
##Overall.Qual1.075e017.405e0314.524<2e16***
##
##Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
##
##Residualstandarderror:0.1737on770degreesoffreedom
##MultipleRsquared:0.8363,AdjustedRsquared:0.8342
##Fstatistic:393.5on10and770DF,pvalue:<2.2e16

Bythesummaryofthemodel,wecanseesomevariablesthatreceivemoreimportancethanothers.Aswe
areworkingwithlog(price)andcenteredpredictors(http://stats.idre.ucla.edu/other/mult
pkg/faq/general/faqhowdoiinterpretaregressionmodelwhensomevariablesarelogtransformed/),we
canmakethefollowingassumptions:*InterceptThevalueexp(\(\beta_{0}\))=exp(1.204e+01)=169396.9
canbeconsideredtheunconditionalgeometricmean(i.e.atypeofcentermeasureandnotrelatedtoany
predictorvariableofX)ofYprice*Year.Remod.AddandYear.BuildInthiscase,weseethattheyearthe
housewasbuildandtheyearitwasremodeledreallyimpactedtheprice,andeachvariationfromthemean
ofthesepredictors,andkeepingallotherpredictorsconstant,changethepriceinanorderof(exp(\
(\beta_{i}\))1)*100=0.24%and0.18%respectivelly.*X1st.Flr.SFAsmallbutimportantpredictor,the
sizeofthefirstfloorinsquarefeetshowedasaimportantvariabletomodelthepriceofahouse.
Consideringallothervariablesconstant,aunitchangeinX1st.Flr.SFfromthemean,wouldimpacttheprice
ofthehousein0.008%.Thismightnotseemsmuchbut,consideringtherangeofthiscenteredvariableof
[757.901973.00],theycanimpactthepriceinarangeof6%to15.78%ofthetotalprice.*Total.Bsmt.SF
andareafollowthesameprincipleand,keepingeverythingelseconstant,eachunitvariationdeviatingfrom
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 8/35
05/04/2017 FinalProjectRegressionModels

themeanpromote1.01%and1.02%changeinpricerespectfully*Lastly,Ideservedaseparatedsectionfor
theOverall.Qualcoefficient.Inthiscontextofcentralized(butnotstandardized,sowecannotinfer
coefficientimportance)logregression,thiswasthebiggestcoefficient.Consideringtherangeofthe
centralizedvariablein[4.1680,3.8320]andtheinterpretationofthecoefficientas(exp(\
(\beta_{Overal.Qual}\))1)*100=11.34%(eachunitchangeimpactthepredictedpricein11.34%),this
variablealonecanimpactthefinalpredictedpriceinarangeof[47.30%,43.48%]

Insummary,itseemsthatrealestateemployeesfocusmostlyonthesizeofthesize,itsageandageneral
qualitymeasuredwhich,inmyopinion,givesanaveragescoreforalltheitensandluxuriesahousecould
have.

2.2.2Section2.2ModelSelection
Noweitherusing BAS anotherstepwiseselectionprocedurechoosethebestmodelyoucan,usingyour
initialmodelasyourstartingpoint.Tryatleasttwodifferentmodelselectionmethodsandcomparetheir
results.Dotheybotharriveatthesamemodelordotheydisagree?Whatdoyouthinkthismeans?

Formodelselection,ItriedastepwisevariableselectionwithbothAICandBICcriteriaandbotharrivedat
thesamemodel,whichcanbeseenbellow.AsbothgavethesameresultBUTalsohadabetterfittothe
trainingmodel,IselectedtocontinuethemodellingprocedurewiththeAICmodel(personalchoice,couldbe
bothAICorBIC).

#stepwise
null<lm(log(price)~1,data=ames_train.numeric.centered)
upperModel<lm(log(price)~Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual,
data=ames_train.numeric.centered)

##Mas.Vnr.Area,Garage.Yr.BltandGarage.Carswereleftoutonbothmodels
aic<step(null,scope=list(upper=upperModel),
data=ames_train.numeric.centered,direction='both')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 9/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1329.34
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual199.42142.5932267.8
##+area173.80868.2051900.1
##+Total.Bsmt.SF166.37075.6431819.3
##+Garage.Cars165.60776.4071811.4
##+X1st.Flr.SF164.51077.5041800.3
##+Garage.Area158.20383.8101739.2
##+Year.Built157.41284.6011731.9
##+Year.Remod.Add155.43586.5791713.8
##+Garage.Yr.Blt153.61388.4001697.6
##+Mas.Vnr.Area132.849109.1641532.8
##<none>142.0131329.3
##
##Step:AIC=2267.85
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29133.3012458.0
##+area19.28233.3112457.8
##+Total.Bsmt.SF17.99134.6012428.1
##+Garage.Cars16.09636.4972386.5
##+Garage.Area16.03036.5632385.1
##+Year.Remod.Add13.48439.1092332.5
##+Garage.Yr.Blt13.06339.5302324.1
##+Year.Built12.95739.6362322.0
##+Mas.Vnr.Area11.47341.1192293.3
##<none>42.5932267.8
##Overall.Qual199.421142.0131329.3
##
##Step:AIC=2458.04
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+area14.24029.0612562.4
##+Year.Remod.Add13.29730.0042537.5
##+Garage.Cars12.51730.7842517.4
##+Garage.Yr.Blt12.51330.7882517.3
##+Year.Built12.48030.8222516.5
##+Garage.Area11.91831.3832502.4
##+Total.Bsmt.SF10.54332.7582468.9
##+Mas.Vnr.Area10.21333.0882461.1
##<none>33.3012458.0
##X1st.Flr.SF19.29142.5932267.8
##Overall.Qual144.20277.5041800.3
##
##Step:AIC=2562.41
##log(price)~Overall.Qual+X1st.Flr.SF+area
##
##DfSumofSqRSSAIC
##+Year.Built13.858325.2032671.7
##+Garage.Yr.Blt13.142025.9192649.8
##+Year.Remod.Add13.071525.9902647.7
##+Garage.Cars11.469427.5922600.9
##+Garage.Area11.260527.8012595.0

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 10/35
05/04/2017 FinalProjectRegressionModels

##+Total.Bsmt.SF11.179627.8822592.8
##<none>29.0612562.4
##+Mas.Vnr.Area10.030929.0302561.2
##area14.240033.3012458.0
##X1st.Flr.SF14.249433.3112457.8
##Overall.Qual125.462554.5242073.0
##
##Step:AIC=2671.66
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built
##
##DfSumofSqRSSAIC
##+Year.Remod.Add10.989724.2132700.9
##+Total.Bsmt.SF10.588324.6152688.1
##+Garage.Area10.316524.8862679.5
##+Garage.Cars10.296724.9062678.9
##+Garage.Yr.Blt10.115825.0872673.2
##<none>25.2032671.7
##+Mas.Vnr.Area10.000525.2022669.7
##X1st.Flr.SF13.353028.5562576.1
##Year.Built13.858329.0612562.4
##area15.618830.8222516.5
##Overall.Qual19.711334.9142419.1
##
##Step:AIC=2700.94
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.746523.4672723.4
##+Garage.Area10.262123.9512707.4
##+Garage.Cars10.211424.0022705.8
##<none>24.2132700.9
##+Garage.Yr.Blt10.023524.1902699.7
##+Mas.Vnr.Area10.004424.2092699.1
##Year.Remod.Add10.989725.2032671.7
##Year.Built11.776525.9902647.7
##X1st.Flr.SF13.527127.7402596.7
##area15.001829.2152556.3
##Overall.Qual17.951232.1642481.2
##
##Step:AIC=2723.4
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Garage.Area10.203223.2642728.2
##+Garage.Cars10.179423.2872727.4
##<none>23.4672723.4
##+Garage.Yr.Blt10.018023.4492722.0
##+Mas.Vnr.Area10.000823.4662721.4
##X1st.Flr.SF10.287823.7552715.9
##Total.Bsmt.SF10.746524.2132700.9
##Year.Remod.Add11.147924.6142688.1
##Year.Built11.305924.7732683.1
##area15.407228.8742563.4
##Overall.Qual16.715030.1822528.9
##
##Step:AIC=2728.19
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 11/35
05/04/2017 FinalProjectRegressionModels

##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF+Garage.Area
##
##DfSumofSqRSSAIC
##<none>23.2642728.2
##+Garage.Cars10.016823.2472726.8
##+Garage.Yr.Blt10.002523.2612726.3
##+Mas.Vnr.Area10.000023.2642726.2
##Garage.Area10.203223.4672723.4
##X1st.Flr.SF10.213823.4772723.1
##Total.Bsmt.SF10.687523.9512707.4
##Year.Built11.010924.2742697.0
##Year.Remod.Add11.088524.3522694.5
##area14.830728.0942582.8
##Overall.Qual16.584229.8482535.6

bic<step(null,scope=list(upper=upperModel),data=ames_train.numeric.centered,
direction='both',
k=log(nrow(ames_train.numeric)))

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 12/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1324.67
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual199.42142.5932258.5
##+area173.80868.2051890.8
##+Total.Bsmt.SF166.37075.6431810.0
##+Garage.Cars165.60776.4071802.1
##+X1st.Flr.SF164.51077.5041791.0
##+Garage.Area158.20383.8101729.9
##+Year.Built157.41284.6011722.5
##+Year.Remod.Add155.43586.5791704.5
##+Garage.Yr.Blt153.61388.4001688.2
##+Mas.Vnr.Area132.849109.1641523.5
##<none>142.0131324.7
##
##Step:AIC=2258.53
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29133.3012444.1
##+area19.28233.3112443.8
##+Total.Bsmt.SF17.99134.6012414.2
##+Garage.Cars16.09636.4972372.5
##+Garage.Area16.03036.5632371.1
##+Year.Remod.Add13.48439.1092318.5
##+Garage.Yr.Blt13.06339.5302310.2
##+Year.Built12.95739.6362308.1
##+Mas.Vnr.Area11.47341.1192279.4
##<none>42.5932258.5
##Overall.Qual199.421142.0131324.7
##
##Step:AIC=2444.06
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+area14.24029.0612543.8
##+Year.Remod.Add13.29730.0042518.8
##+Garage.Cars12.51730.7842498.8
##+Garage.Yr.Blt12.51330.7882498.7
##+Year.Built12.48030.8222497.8
##+Garage.Area11.91831.3832483.7
##+Total.Bsmt.SF10.54332.7582450.2
##<none>33.3012444.1
##+Mas.Vnr.Area10.21333.0882442.4
##X1st.Flr.SF19.29142.5932258.5
##Overall.Qual144.20277.5041791.0
##
##Step:AIC=2543.76
##log(price)~Overall.Qual+X1st.Flr.SF+area
##
##DfSumofSqRSSAIC
##+Year.Built13.858325.2032648.3
##+Garage.Yr.Blt13.142025.9192626.5
##+Year.Remod.Add13.071525.9902624.3
##+Garage.Cars11.469427.5922577.6
##+Garage.Area11.260527.8012571.7

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 13/35
05/04/2017 FinalProjectRegressionModels

##+Total.Bsmt.SF11.179627.8822569.5
##<none>29.0612543.8
##+Mas.Vnr.Area10.030929.0302537.9
##area14.240033.3012444.1
##X1st.Flr.SF14.249433.3112443.8
##Overall.Qual125.462554.5242059.0
##
##Step:AIC=2648.35
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built
##
##DfSumofSqRSSAIC
##+Year.Remod.Add10.989724.2132673.0
##+Total.Bsmt.SF10.588324.6152660.1
##+Garage.Area10.316524.8862651.6
##+Garage.Cars10.296724.9062650.9
##<none>25.2032648.3
##+Garage.Yr.Blt10.115825.0872645.3
##+Mas.Vnr.Area10.000525.2022641.7
##X1st.Flr.SF13.353028.5562557.5
##Year.Built13.858329.0612543.8
##area15.618830.8222497.8
##Overall.Qual19.711334.9142400.5
##
##Step:AIC=2672.98
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.746523.4672690.8
##+Garage.Area10.262123.9512674.8
##+Garage.Cars10.211424.0022673.2
##<none>24.2132673.0
##+Garage.Yr.Blt10.023524.1902667.1
##+Mas.Vnr.Area10.004424.2092666.5
##Year.Remod.Add10.989725.2032648.3
##Year.Built11.776525.9902624.3
##X1st.Flr.SF13.527127.7402573.4
##area15.001829.2152533.0
##Overall.Qual17.951232.1642457.9
##
##Step:AIC=2690.78
##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Garage.Area10.203223.2642690.9
##<none>23.4672690.8
##+Garage.Cars10.179423.2872690.1
##X1st.Flr.SF10.287823.7552687.9
##+Garage.Yr.Blt10.018023.4492684.7
##+Mas.Vnr.Area10.000823.4662684.1
##Total.Bsmt.SF10.746524.2132673.0
##Year.Remod.Add11.147924.6142660.1
##Year.Built11.305924.7732655.1
##area15.407228.8742535.5
##Overall.Qual16.715030.1822500.9
##
##Step:AIC=2690.91
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 14/35
05/04/2017 FinalProjectRegressionModels

##log(price)~Overall.Qual+X1st.Flr.SF+area+Year.Built+
##Year.Remod.Add+Total.Bsmt.SF+Garage.Area
##
##DfSumofSqRSSAIC
##<none>23.2642690.9
##Garage.Area10.203223.4672690.8
##X1st.Flr.SF10.213823.4772690.4
##+Garage.Cars10.016823.2472684.8
##+Garage.Yr.Blt10.002523.2612684.3
##+Mas.Vnr.Area10.000023.2642684.2
##Total.Bsmt.SF10.687523.9512674.8
##Year.Built11.010924.2742664.3
##Year.Remod.Add11.088524.3522661.8
##area14.830728.0942550.2
##Overall.Qual16.584229.8482502.9

#printresults
aic

##
##Call:
##lm(formula=log(price)~Overall.Qual+X1st.Flr.SF+area+
##Year.Built+Year.Remod.Add+Total.Bsmt.SF+Garage.Area,
##data=ames_train.numeric.centered)
##
##Coefficients:
##(Intercept)Overall.QualX1st.Flr.SFarea
##1.204e+011.080e018.559e052.285e04
##Year.BuiltYear.Remod.AddTotal.Bsmt.SFGarage.Area
##1.794e032.493e031.365e041.117e04

bic

##
##Call:
##lm(formula=log(price)~Overall.Qual+X1st.Flr.SF+area+
##Year.Built+Year.Remod.Add+Total.Bsmt.SF+Garage.Area,
##data=ames_train.numeric.centered)
##
##Coefficients:
##(Intercept)Overall.QualX1st.Flr.SFarea
##1.204e+011.080e018.559e052.285e04
##Year.BuiltYear.Remod.AddTotal.Bsmt.SFGarage.Area
##1.794e032.493e031.365e041.117e04

2.2.3Section2.3InitialModelResiduals
Onewaytoassesstheperformanceofamodelistoexaminethemodelsresiduals.Inthespacebelow,
createaresidualplotforyourpreferredmodelfromaboveanduseittoassesswhetheryourmodelappears
tofitthedatawell.Commentonanyinterestingstructureintheresidualplot(trend,outliers,etc.)andbriefly
discusspotentialimplicationsitmayhaveforyourmodelandinference/predictionyoumightproduce.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 15/35
05/04/2017 FinalProjectRegressionModels

Fromtheresidualplot,wecanseethattheyfollowanormaldistributionanddonotseemtohaveanynon
lineartrend.Alreadyinthefirstplot,wecanseetheremightbethreeoutliers(rows251,343and582)who
caninfluencenegativelyinourleastsquaresestimators.Fornow,Ikeepitastheyare,butIwillevaluate
thenafter.

Thenormalqqplotconfirmthattheresidualnormalityassumptionsisnotsoapartfromtheidealgaussian
distribution.

Lastly,thecooksdistanceshowthattwoofthethreepossibleoutlierhaveahighleverage.Iwillevaluate
thenshortly.*

par(mfrow=c(2,2))
plot(aic)

par(mfrow=c(1,1))

2.2.4Section2.4InitialModelRMSE
Youcancalculateitdirectlybasedonthemodeloutput.BespecificabouttheunitsofyourRMSE(depending
onwhetheryoutransformedyourresponsevariable).Thevalueyoureportwillbemoremeaningfulifitisin
theoriginalunits(dollars).

TheRMSEforthetrainingdatais40297.18.TheRMSEcanbeinterpretedasageneraldatastandard
deviationfromtheexpectedvalue.Inthissense,inaverage,themodelpredictsthehousepricewitha
deviationof+40297.18dollars.Rememberingthatthismeasureissensibletooutliers,suchaswehave
seenbut,overall,isagood(small)RMSE,aspricecanrangefrom[12790,615000].
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 16/35
05/04/2017 FinalProjectRegressionModels

RMSE<sqrt(mean((ames_train.numeric.centered$priceexp(aic$fitted.values))^2))
RMSE

##[1]40297.18

2.2.5Section2.5Overfitting
Theprocessofbuildingamodelgenerallyinvolvesstartingwithaninitialmodel(asyouhavedoneabove),
identifyingitsshortcomings,andadaptingthemodelaccordingly.Thisprocessmayberepeatedseveral
timesuntilthemodelfitsthedatareasonablywell.However,themodelmaydowellontrainingdatabut
performpoorlyoutofsample(meaning,onadatasetotherthantheoriginaltrainingdata)becausethe
modelisoverlytunedtospecificallyfitthetrainingdata.Thisiscalledoverfitting.Todeterminewhether
overfittingisoccurringonamodel,comparetheperformanceofamodelonbothinsampleandoutof
sampledatasets.Tolookatperformanceofyourinitialmodelonoutofsampledata,youwillusethedata
set ames_test .

load("ames_test.Rdata")

Useyourmodelfromabovetogeneratepredictionsforthehousingpricesinthetestdataset.Arethe
predictionssignificantlymoreaccurate(comparedtotheactualsalesprices)forthetrainingdatathanthe
testdata?Whyorwhynot?Brieflyexplainhowyoudeterminedthat(whatstepsorprocessesdidyouuse)?

Topredictthetestdata,first,Iselectedagainonlythenumericalvariables,inordertocenterthen,likeIdid
withthetrainingdata.

TheRMSEforthetestdatawas58072.19,abiggerstandarddeviationthatwehaveseeninthetraining
dataset.Thisvariationmakessenseastheordinaryleastsquaresestimatorstrytominimizethesquare
distanceofthepointstotheline(hyperplane)inthetrainingdataandthusisanoptimisticerror.

Theresidualplotshowsaroughtlynormaldistributionaroundzero,contributedbytheqqplot.Wecansee
thereisapossibleoutlierwithinthepredictionsbut,aswearedealingwiththetestdataset,thereisnothing
wecando.

#preprocessdata
ames_test.numeric.centered<ames_test[,integer.cols.index]
ames_test.numeric.centered<as.data.frame(scale(ames_test.numeric.centered,center=TRUE,
scale=FALSE))

predictions<predict(aic,ames_test.numeric.centered)
#logrmse=0.16
RMSEq6<sqrt(mean((ames_test$priceexp(predictions))^2))
RMSEq6

##[1]58072.19

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 17/35
05/04/2017 FinalProjectRegressionModels

residualsq6<predictionslog(ames_test$price)

par(mfrow=c(1,2))
plot(residualsq6,main='Residualsforvalidationdata')
qqnorm(residualsq6)
qqline(residualsq6)

par(mfrow=c(1,1))

Notetothelearner:Ifinreallifepracticethisoutofsampleanalysisshowsevidencethatthetrainingdata
fitsyourmodelalotbetterthanthetestdata,itisprobablyagoodideatogobackandrevisethemodel
(usuallybysimplifyingthemodel)toreducethisoverfitting.Forsimplicity,wedonotaskyoutodothisonthe
assignment,however.

2.3Part3DevelopmentofaFinalModel
Nowthatyouhavedevelopedaninitialmodeltouseasabaseline,createafinalmodelwithatmost20
variablestopredicthousingpricesinAmes,IA,selectingfromthefullarrayofvariablesinthedatasetand
usinganyofthetoolsthatweintroducedinthisspecialization.

Carefullydocumenttheprocessthatyouusedtocomeupwithyourfinalmodel,sothatyoucananswerthe
questionsbelow.

2.3.1Section3.1FinalModel
2.3.1.1OutliersAnalysis
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 18/35
05/04/2017 FinalProjectRegressionModels

First,IanalysedthethreeoutliersofthedatasetinordertodecidewhetherIremovedthen.Afterafirst
inspection,Ifoundsomepotentialvariablesthatmightnotbearepresentantofthedata(extremevalues).

#rows251,343and582
rowsOfInterest<c(251,343,582)
ames_train.numeric[rowsOfInterest,c('area','Year.Built','price')]

###Atibble:3x3
##areaYear.Builtprice
##<int><int><int>
##146762007184750
##2832192312789
##31317192040000

First,fotthe251observation,wecanseethatitisthebiggesthouseinthedataset,andreallyfarapartfrom
thesecondmost,aswecanseeintheplotbelow.So,asthepointisnotsorepresentative,Idecidedto
removeit,asitcanhighinfluencewhentryingtominimizetheleastsquaresestimator.

max(ames_train.numeric$area)

##[1]4676

plot(ames_train.numeric$area,ames_train.numeric$price)

#^decidedtoremoveobs251^

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 19/35
05/04/2017 FinalProjectRegressionModels

Second,points343and582cametomyattentionbythefacttheywerereallyoldhouses,andreallycheap
ones.Infact,selectingotherhousesbuiltinthesameperiod,giveameanpricealothigherthanthesetwo,
moreextremeresultforobservation343.Observation582alsohadalowpricebut,asitapproachesmore
closelytothemeanpriceswiththesameage,Idecidednottoremoveit.

sum(ames_train.numeric$Year.Built<=1923)

##[1]65

summary(ames_train.numeric$price[ames_train.numeric$Year.Built<1923])

##Min.1stQu.MedianMean3rdQu.Max.
##40000105500124800124700148400266000

plot(ames_train.numeric$Year.Built[ames_train.numeric$Year.Built<=1923],
ames_train.numeric$price[ames_train.numeric$Year.Built<=1923],
main='PriceDistributionbyYear.Built',
xlab='Year.Built',
ylab='price')

#^decidedtoremoveobs343,butNOTremove582^

2.3.1.2Finalmodel
Providethesummarytableforyourmodel.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 20/35
05/04/2017 FinalProjectRegressionModels

Finally,Ihadtodotwomoretransformationsbeforerunningtheregression.AsIincludedfactor(character)
variables,someinsights(andproblems)appeared.

IsawtheNAvalueinthebasementqualitymeantnobasementso,inordertousethisinformation
intheregression,ItransformedtheNAvaluesintoafactor.

WhenIincludedNeighborhood,IstartedtogetanerrorofNotfullrankmatrixandthatmeantIhad
linearlydependentvariablesbetweenmypredictors.Inspectingmydata,Ifoundthat,whenconverting
theNeighborhoodcolumntodummyvariables,twoNeighborhoodshadacountof0(GrnHilland
LandMrk)andthatwouldbringmetwocolumnsofonly0(andproblems!).SoIconvertedthesetwo
factortoauniqueoneGrnHill_LandMrk.

Now,timetoregression!Iincludedthefourcategoricalvariablesfromearlieranalysisthataloneexplained
morethan0.5asadjusted\(R^2\).

###RemoveearlyrowswithNAandalsooutliers
ames_train.2<ames_train[c(na.rows,251,343),]

#rescalenumericalvaluestozeromean,exceptforprice
ames_train.2[,integer.cols.index]<scale(ames_train.2[,integer.cols.index],center=TRUE,scal
e=FALSE)
ames_train.2$price<ames_train[c(na.rows,251,343),]$price

#convertNAtoafactor
ames_train.2.Qual_sub<as.character(ames_train.2$Bsmt.Qual)
ames_train.2.Qual_sub[is.na(ames_train.2.Qual_sub)]<'NA'
ames_train.2$Bsmt.Qual<relevel(factor(ames_train.2.Qual_sub),ref='TA')

#createnewfactorGrnHill_LandMrkanddropunusedones.
ames_train.2$Neighborhood<droplevels(ames_train.2$Neighborhood)
ames_train.2$Neighborhood<factor(ames_train.2$Neighborhood,levels=
c(levels(ames_train.2$Neighborhood),'GrnHill_LandMrk'))

testModRel2<lm(log(price)~Fireplaces+BsmtFin.SF.1+TotRms.AbvGrd+
Full.Bath+Mas.Vnr.Area+Garage.Yr.Blt+Year.Remod.Add+
Year.Built+Garage.Area+Garage.Cars+X1st.Flr.SF+
Total.Bsmt.SF+area+Overall.Qual+Neighborhood+
Bsmt.Qual+Exter.Qual+Kitchen.Qual,
data=ames_train.2)

#preparestepwiseAICmodelselection
null<lm(log(price)~1,data=ames_train.2)

#Exter.QualGarage.Yr.BltGarage.CarsMas.Vnr.AreaX1st.Flr.SFBsmt.Qualwereleftout
aic2<step(null,scope=list(upper=testModRel2),data=ames_train.2,direction='both')

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 21/35
05/04/2017 FinalProjectRegressionModels

##Start:AIC=1327.74
##log(price)~1
##
##DfSumofSqRSSAIC
##+Overall.Qual198.73942.5812260.2
##+Neighborhood2586.91654.4032021.4
##+Bsmt.Qual573.88467.4361894.1
##+area173.13568.1851893.5
##+Exter.Qual372.00169.3191876.6
##+Kitchen.Qual469.53871.7821847.4
##+Total.Bsmt.SF165.79275.5271813.8
##+Garage.Cars165.02776.2921806.0
##+X1st.Flr.SF164.00877.3121795.6
##+Garage.Area157.87183.4491736.1
##+Year.Built156.99584.3241728.0
##+Year.Remod.Add155.04386.2761710.2
##+Garage.Yr.Blt153.20888.1111693.8
##+Full.Bath149.29592.0251659.9
##+TotRms.AbvGrd139.422101.8981580.5
##+Fireplaces135.555105.7641551.5
##+Mas.Vnr.Area132.249109.0701527.5
##+BsmtFin.SF.1130.810110.5091517.3
##<none>141.3191327.7
##
##Step:AIC=2260.25
##log(price)~Overall.Qual
##
##DfSumofSqRSSAIC
##+X1st.Flr.SF19.29533.2852450.1
##+area19.27333.3082449.6
##+Neighborhood2510.47032.1102430.1
##+Total.Bsmt.SF17.98834.5922420.1
##+BsmtFin.SF.116.40436.1772385.2
##+Garage.Cars16.08736.4942378.4
##+Garage.Area16.03336.5472377.3
##+TotRms.AbvGrd15.09637.4852357.6
##+Bsmt.Qual54.08838.4932328.9
##+Fireplaces13.52439.0562325.6
##+Year.Remod.Add13.49239.0882324.9
##+Kitchen.Qual43.77238.8082324.5
##+Garage.Yr.Blt13.07039.5112316.5
##+Year.Built12.96539.6152314.5
##+Full.Bath12.80339.7772311.3
##+Exter.Qual32.68939.8922305.1
##+Mas.Vnr.Area11.46241.1192285.5
##<none>42.5812260.2
##Overall.Qual198.739141.3191327.7
##
##Step:AIC=2450.11
##log(price)~Overall.Qual+X1st.Flr.SF
##
##DfSumofSqRSSAIC
##+Neighborhood256.37426.9112565.7
##+area14.24629.0392554.4
##+Year.Remod.Add13.28729.9982529.1
##+Garage.Cars12.51530.7712509.3
##+Garage.Yr.Blt12.50330.7822509.0

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 22/35
05/04/2017 FinalProjectRegressionModels

##+Year.Built12.46930.8162508.2
##+Bsmt.Qual52.74530.5402507.2
##+TotRms.AbvGrd12.35130.9342505.2
##+Kitchen.Qual42.41630.8702500.8
##+Garage.Area11.91931.3662494.4
##+BsmtFin.SF.111.80031.4862491.4
##+Full.Bath11.63031.6552487.2
##+Exter.Qual31.53131.7552480.8
##+Fireplaces11.18332.1022476.3
##+Total.Bsmt.SF10.54632.7392461.0
##+Mas.Vnr.Area10.22233.0632453.3
##<none>33.2852450.1
##X1st.Flr.SF19.29542.5812260.2
##Overall.Qual144.02777.3121795.6
##
##Step:AIC=2565.7
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood
##
##DfSumofSqRSSAIC
##+area14.815922.0952717.3
##+TotRms.AbvGrd12.390524.5212636.2
##+Year.Remod.Add11.736125.1752615.7
##+Kitchen.Qual41.820925.0902612.3
##+BsmtFin.SF.111.282625.6292601.7
##+Garage.Cars11.124925.7862597.0
##+Fireplaces11.099625.8122596.2
##+Garage.Area10.971925.9392592.3
##+Full.Bath10.772626.1392586.4
##+Bsmt.Qual50.939225.9722583.4
##+Garage.Yr.Blt10.564026.3472580.2
##+Year.Built10.326626.5852573.2
##+Exter.Qual30.411326.5002571.7
##+Total.Bsmt.SF10.217626.6942570.0
##+Mas.Vnr.Area10.114326.7972567.0
##<none>26.9112565.7
##Neighborhood256.373933.2852450.1
##X1st.Flr.SF15.198932.1102430.1
##Overall.Qual112.866739.7782263.3
##
##Step:AIC=2717.3
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area
##
##DfSumofSqRSSAIC
##+BsmtFin.SF.111.949120.1462787.2
##+Year.Remod.Add11.402420.6932766.4
##+Kitchen.Qual41.268120.8272755.3
##+Bsmt.Qual51.126420.9692748.1
##+Year.Built10.841321.2542745.5
##+Total.Bsmt.SF10.668221.4272739.2
##+Garage.Yr.Blt10.645421.4502738.4
##+Garage.Area10.422321.6732730.3
##+Exter.Qual30.484321.6112728.6
##+Garage.Cars10.355221.7402727.9
##+Fireplaces10.242521.8532723.9
##<none>22.0952717.3
##+TotRms.AbvGrd10.012122.0832715.7
##+Mas.Vnr.Area10.008022.0872715.6
##+Full.Bath10.006722.0892715.5
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 23/35
05/04/2017 FinalProjectRegressionModels

##X1st.Flr.SF11.517723.6132667.6
##area14.815926.9112565.7
##Neighborhood256.943429.0392554.4
##Overall.Qual16.620228.7152515.2
##
##Step:AIC=2787.24
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1
##
##DfSumofSqRSSAIC
##+Year.Remod.Add11.390718.7562841.0
##+Kitchen.Qual41.120919.0252823.8
##+Garage.Yr.Blt10.561119.5852807.2
##+Year.Built10.542619.6042806.5
##+Bsmt.Qual50.545019.6012798.6
##+Garage.Area10.335519.8112798.3
##+Exter.Qual30.407219.7392797.1
##+Garage.Cars10.284619.8622796.3
##+Total.Bsmt.SF10.195219.9512792.8
##+Fireplaces10.083220.0632788.5
##<none>20.1462787.2
##+TotRms.AbvGrd10.006120.1402785.5
##+Mas.Vnr.Area10.001120.1452785.3
##+Full.Bath10.000120.1462785.2
##X1st.Flr.SF10.361120.5072775.4
##BsmtFin.SF.111.949122.0952717.3
##Neighborhood256.293226.4392625.5
##area15.482325.6292601.7
##Overall.Qual16.257226.4032578.5
##
##Step:AIC=2840.96
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add
##
##DfSumofSqRSSAIC
##+Garage.Area10.312718.4432852.1
##+Kitchen.Qual40.433618.3222851.2
##+Total.Bsmt.SF10.277518.4782850.6
##+Year.Built10.255618.5002849.7
##+Garage.Yr.Blt10.245918.5102849.2
##+Bsmt.Qual50.427218.3282848.9
##+Garage.Cars10.227418.5282848.5
##+Fireplaces10.173618.5822846.2
##+Exter.Qual30.153718.6022841.4
##<none>18.7562841.0
##+TotRms.AbvGrd10.026718.7292840.1
##+Full.Bath10.015318.7402839.6
##+Mas.Vnr.Area10.000018.7562839.0
##X1st.Flr.SF10.362719.1182828.0
##Year.Remod.Add11.390720.1462787.2
##BsmtFin.SF.111.937320.6932766.4
##Neighborhood254.525123.2812722.6
##Overall.Qual14.865923.6222663.3
##area15.126123.8822654.7
##
##Step:AIC=2852.06
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 24/35
05/04/2017 FinalProjectRegressionModels

##
##DfSumofSqRSSAIC
##+Kitchen.Qual40.410818.0322861.6
##+Total.Bsmt.SF10.245918.1972860.5
##+Bsmt.Qual50.395218.0482858.9
##+Year.Built10.191818.2512858.2
##+Fireplaces10.188818.2542858.1
##+Garage.Yr.Blt10.089818.3532853.9
##<none>18.4432852.1
##+Exter.Qual30.139918.3032852.0
##+TotRms.AbvGrd10.029118.4142851.3
##+Full.Bath10.013218.4302850.6
##+Garage.Cars10.006918.4362850.3
##+Mas.Vnr.Area10.000918.4422850.1
##X1st.Flr.SF10.190718.6342846.0
##Garage.Area10.312718.7562841.0
##Year.Remod.Add11.367819.8112798.3
##BsmtFin.SF.111.853720.2972779.4
##Neighborhood254.305422.7482738.6
##area14.601523.0442680.5
##Overall.Qual14.769123.2122674.9
##
##Step:AIC=2861.61
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual
##
##DfSumofSqRSSAIC
##+Total.Bsmt.SF10.186717.8452867.7
##+Fireplaces10.176717.8552867.3
##+Bsmt.Qual50.351317.6812866.9
##+Year.Built10.100817.9312864.0
##+Garage.Yr.Blt10.053017.9792861.9
##+TotRms.AbvGrd10.046317.9862861.6
##<none>18.0322861.6
##+Full.Bath10.025618.0062860.7
##+Exter.Qual30.105117.9272860.2
##+Garage.Cars10.005018.0272859.8
##+Mas.Vnr.Area10.000218.0322859.6
##X1st.Flr.SF10.185818.2182855.6
##Kitchen.Qual40.410818.4432852.1
##Garage.Area10.289918.3222851.2
##Year.Remod.Add10.691218.7232834.3
##BsmtFin.SF.111.740519.7732791.8
##Neighborhood254.377122.4092742.3
##Overall.Qual13.946821.9792709.4
##area14.351822.3842695.2
##
##Step:AIC=2867.71
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF
##
##DfSumofSqRSSAIC
##+Fireplaces10.225017.6202875.6
##X1st.Flr.SF10.001117.8462869.7
##+Year.Built10.068217.7772868.7
##+TotRms.AbvGrd10.053417.7922868.0
##<none>17.8452867.7
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 25/35
05/04/2017 FinalProjectRegressionModels

##+Garage.Yr.Blt10.039017.8062867.4
##+Full.Bath10.030717.8152867.1
##+Bsmt.Qual50.200217.6452866.5
##+Exter.Qual30.109217.7362866.5
##+Garage.Cars10.005817.8402866.0
##+Mas.Vnr.Area10.000217.8452865.7
##Total.Bsmt.SF10.186718.0322861.6
##Kitchen.Qual40.351718.1972860.5
##Garage.Area10.263218.1092858.3
##Year.Remod.Add10.763018.6082837.1
##BsmtFin.SF.111.315419.1612814.3
##Neighborhood254.289022.1342749.9
##Overall.Qual13.609021.4542726.2
##area14.518922.3642693.9
##
##Step:AIC=2875.6
##log(price)~Overall.Qual+X1st.Flr.SF+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF+Fireplaces
##
##DfSumofSqRSSAIC
##X1st.Flr.SF10.001417.6222877.5
##+TotRms.AbvGrd10.062417.5582876.4
##+Year.Built10.061617.5592876.3
##+Garage.Yr.Blt10.056117.5642876.1
##<none>17.6202875.6
##+Exter.Qual30.111817.5092874.6
##+Full.Bath10.018317.6022874.4
##+Garage.Cars10.012317.6082874.1
##+Bsmt.Qual50.189717.4312874.0
##+Mas.Vnr.Area10.001117.6192873.6
##Kitchen.Qual40.331717.9522869.1
##Fireplaces10.225017.8452867.7
##Total.Bsmt.SF10.234917.8552867.3
##Garage.Area10.274317.8952865.6
##Year.Remod.Add10.856518.4772840.6
##BsmtFin.SF.111.080618.7012831.2
##Neighborhood254.047421.6682764.5
##Overall.Qual13.350120.9712742.0
##area13.662921.2832730.5
##
##Step:AIC=2877.53
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces
##
##DfSumofSqRSSAIC
##+Year.Built10.063017.5592878.3
##+TotRms.AbvGrd10.062117.5602878.3
##+Garage.Yr.Blt10.057317.5642878.1
##<none>17.6222877.5
##+Exter.Qual30.110217.5122876.4
##+Full.Bath10.018517.6032876.3
##+Garage.Cars10.012217.6102876.1
##+Bsmt.Qual50.182117.4402875.6
##+X1st.Flr.SF10.001417.6202875.6
##+Mas.Vnr.Area10.001217.6212875.6
##Kitchen.Qual40.335917.9582870.8
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 26/35
05/04/2017 FinalProjectRegressionModels

##Fireplaces10.224717.8462869.7
##Garage.Area10.274617.8962867.5
##Total.Bsmt.SF10.384718.0062862.7
##Year.Remod.Add10.856918.4792842.5
##BsmtFin.SF.111.079218.7012833.2
##Neighborhood254.073721.6952765.5
##Overall.Qual13.361120.9832743.5
##area13.937121.5592722.4
##
##Step:AIC=2878.32
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built
##
##DfSumofSqRSSAIC
##+TotRms.AbvGrd10.066917.4922879.3
##<none>17.5592878.3
##+Full.Bath10.034117.5252877.8
##+Exter.Qual30.122117.4372877.8
##Year.Built10.063017.6222877.5
##+Garage.Yr.Blt10.015217.5442877.0
##+Garage.Cars10.007117.5522876.6
##+Mas.Vnr.Area10.002817.5562876.4
##+X1st.Flr.SF10.000117.5592876.3
##+Bsmt.Qual50.167017.3922875.8
##Kitchen.Qual40.273617.8322874.3
##Fireplaces10.222517.7812870.5
##Garage.Area10.244617.8032869.6
##Total.Bsmt.SF10.358517.9172864.6
##Year.Remod.Add10.791518.3502846.0
##BsmtFin.SF.111.027618.5862836.0
##Neighborhood253.525621.0842785.8
##Overall.Qual13.149820.7092751.8
##area13.975521.5342721.3
##
##Step:AIC=2879.3
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built+TotRms.AbvGrd
##
##DfSumofSqRSSAIC
##+Full.Bath10.046117.4462879.3
##<none>17.4922879.3
##+Exter.Qual30.113317.3782878.4
##TotRms.AbvGrd10.066917.5592878.3
##Year.Built10.067817.5602878.3
##+Garage.Yr.Blt10.015617.4762878.0
##+Garage.Cars10.003117.4892877.4
##+Mas.Vnr.Area10.003117.4892877.4
##+X1st.Flr.SF10.000117.4922877.3
##+Bsmt.Qual50.177217.3152877.2
##Kitchen.Qual40.286117.7782874.7
##Fireplaces10.231417.7232871.1
##Garage.Area10.247417.7392870.4
##Total.Bsmt.SF10.370917.8632864.9
##Year.Remod.Add10.807018.2992846.2
##BsmtFin.SF.111.074418.5662834.9
##area11.415418.9072820.7
file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 27/35
05/04/2017 FinalProjectRegressionModels

##Neighborhood253.394820.8872791.1
##Overall.Qual13.183620.6752751.0
##
##Step:AIC=2879.35
##log(price)~Overall.Qual+Neighborhood+area+BsmtFin.SF.1+
##Year.Remod.Add+Garage.Area+Kitchen.Qual+Total.Bsmt.SF+
##Fireplaces+Year.Built+TotRms.AbvGrd+Full.Bath
##
##DfSumofSqRSSAIC
##<none>17.4462879.3
##Full.Bath10.046117.4922879.3
##+Exter.Qual30.121217.3252878.8
##+Garage.Yr.Blt10.017017.4292878.1
##TotRms.AbvGrd10.078917.5252877.8
##+Garage.Cars10.004417.4412877.6
##+Mas.Vnr.Area10.002917.4432877.5
##Year.Built10.087617.5332877.4
##+X1st.Flr.SF10.000017.4462877.4
##+Bsmt.Qual50.153517.2922876.2
##Kitchen.Qual40.291417.7372874.4
##Fireplaces10.213017.6592871.9
##Garage.Area10.241717.6882870.6
##Total.Bsmt.SF10.379817.8262864.6
##Year.Remod.Add10.822618.2682845.5
##BsmtFin.SF.111.057818.5032835.5
##area11.442618.8882819.5
##Neighborhood253.338220.7842793.0
##Overall.Qual13.165620.6112751.5

summary(aic2)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 28/35
05/04/2017 FinalProjectRegressionModels

##
##Call:
##lm(formula=log(price)~Overall.Qual+Neighborhood+area+
##BsmtFin.SF.1+Year.Remod.Add+Garage.Area+Kitchen.Qual+
##Total.Bsmt.SF+Fireplaces+Year.Built+TotRms.AbvGrd+
##Full.Bath,data=ames_train.2)
##
##Residuals:
##Min1QMedian3QMax
##1.810970.056230.007480.077910.43378
##
##Coefficients:
##EstimateStd.ErrortvaluePr(>|t|)
##(Intercept)1.204e+016.674e02180.363<2e16***
##Overall.Qual8.772e027.575e0311.580<2e16***
##NeighborhoodBlueste1.141e011.085e011.0520.29326
##NeighborhoodBrDale2.156e018.012e022.6910.00728**
##NeighborhoodBrkSide2.960e027.253e020.4080.68338
##NeighborhoodClearCr1.974e019.952e021.9830.04768*
##NeighborhoodCollgCr4.707e026.218e020.7570.44929
##NeighborhoodCrawfor1.814e017.303e022.4840.01323*
##NeighborhoodEdwards9.341e036.725e020.1390.88957
##NeighborhoodGilbert6.910e026.451e021.0710.28449
##NeighborhoodGreens1.393e021.097e010.1270.89893
##NeighborhoodIDOTRR1.461e017.514e021.9440.05230.
##NeighborhoodMeadowV1.814e018.195e022.2130.02718*
##NeighborhoodMitchel7.073e026.750e021.0480.29507
##NeighborhoodNAmes5.172e026.539e020.7910.42918
##NeighborhoodNoRidge8.770e027.161e021.2250.22111
##NeighborhoodNPkVill5.850e029.858e020.5930.55309
##NeighborhoodNridgHt1.173e016.383e021.8380.06650.
##NeighborhoodNWAmes3.707e036.812e020.0540.95662
##NeighborhoodOldTown8.488e027.243e021.1720.24162
##NeighborhoodSawyer6.458e026.785e020.9520.34150
##NeighborhoodSawyerW6.020e036.496e020.0930.92619
##NeighborhoodSomerst8.649e026.248e021.3840.16667
##NeighborhoodStoneBr1.895e017.065e022.6830.00746**
##NeighborhoodSWISU1.760e028.613e020.2040.83816
##NeighborhoodTimber1.193e017.088e021.6830.09286.
##NeighborhoodVeenker1.201e018.220e021.4610.14448
##area2.076e042.656e057.8171.86e14***
##BsmtFin.SF.11.062e041.586e056.6944.30e11***
##Year.Remod.Add2.470e034.184e045.9035.44e09***
##Garage.Area1.278e043.995e053.2000.00143**
##Kitchen.QualFa7.685e025.472e021.4040.16061
##Kitchen.QualGd1.105e022.742e020.4030.68718
##Kitchen.QualPo2.264e011.774e011.2760.20229
##Kitchen.QualTA7.025e023.185e022.2060.02772*
##Total.Bsmt.SF8.056e052.008e054.0116.65e05***
##Fireplaces3.404e021.133e023.0040.00276**
##Year.Built9.136e044.743e041.9260.05445.
##TotRms.AbvGrd1.198e026.549e031.8280.06788.
##Full.Bath2.416e021.728e021.3970.16268
##
##Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
##
##Residualstandarderror:0.1536on739degreesoffreedom

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 29/35
05/04/2017 FinalProjectRegressionModels

##MultipleRsquared:0.8766,AdjustedRsquared:0.87
##Fstatistic:134.5on39and739DF,pvalue:<2.2e16

2.3.2Section3.2Transformation

Ionlyconductedastudytovisualizethepredictorvariablesanddetermineiftheyshouldbelogor
exponentiallytransformed.AlthoughIcouldobtainsomenormaldistributionsaftercarefulanalysis,Icould
notimprovethefinalfitofthemodel,soIdecidednottoincludevariabletransformations,givenalotofwork
wasnecessarytoevaluateeachpredictorindividually.

2.3.3Section3.3VariableInteraction
Didyoudecidetoincludeanyvariableinteractions?Whyorwhynot?Explaininafewsentences.

Icouldnotthinkofanyvariableinteractioninthisdatasetanddidnottrytofitallpossibleinteractions,given
thatthiscouldleadmetofalsepositiveswhentestingifthecoefficientwassignificant.

2.3.4Section3.4VariableSelection
Whatmethoddidyouusetoselectthevariablesyouincluded?Whydidyouselectthemethodyouused?
Explaininafewsentences.

Amongthevariableselectiontechniqueswehaveseeninthecourse,IdecidedtousetheAICAikake
InformationCriteriainsteadofBICBayesianInformationCriteriaorstepwiseusingthePValuesbecauseit
wasametriceasiertounderstandandlessrigorousthanBICanddidnotinvolvehypothesistestings,like
stepwisewithPValues,thatcouldleadmetofalseresults.

TheprocedureofthefinalmodelselectioncanbeseeninthefirstblockofFinalmodelsection.

2.3.5Section3.5ModelTesting

TestingonoutofsampledataconfirmedmethatIwasnotoverfittingmymodel.Ididnotpresentallthe
modelsIhavefitinthisassignment,butIcheckedthetestdataeachtimeIinsertedafewnumericalof
characterattributesandcheckedifIimproved(anddidnotraised)theerroronthetestdata.

2.4Part4FinalModelAssessment
2.4.1Section4.1FinalModelResidual
Foryourfinalmodel,createandbrieflyinterpretaninformativeplotoftheresiduals.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 30/35
05/04/2017 FinalProjectRegressionModels

Togetthepredictionsforthetestset,wemustfirstperformthesameoperationswedidfirstwiththetraining
set.Rescalingnumericalattributes,andprocessingfactorvariables.

Theresidualsfromthetestsetalsohadanormaldistributionappearancearound0andtheqqplotalso
helpstoevidencethattheresidualsdonotdeviatetoomuchfromnormal.

#trainingdataRMSE
testModRel2_RMSE<sqrt(mean((ames_train.2$priceexp(aic2$fitted.values))^2))

###TestDataPreprocessing

load("ames_test.Rdata")
ames_test.2<ames_test

ames_test.2[,integer.cols.index]<scale(ames_test[,integer.cols.index],center=TRUE,scale=FA
LSE)
ames_test.2$price<ames_test$price

#checkifanyofthesecolumnshaveNAinthetestset
Test.cols.interest<c('Fireplaces','BsmtFin.SF.1',
'Full.Bath','TotRms.AbvGrd',
'Year.Remod.Add','Year.Built',
'Total.Bsmt.SF','Garage.Area',
'area','Overall.Qual','Neighborhood',
'Kitchen.Qual')

Test.na.rows<which(apply(ames_test.2[,Test.cols.interest],1,function(row){
any(is.na(row))
}))
#ames_test.2<ames_test.2[Test.na.rows,]

ames_test.2.Qual_sub<as.character(ames_test.2$Bsmt.Qual)
ames_test.2.Qual_sub[is.na(ames_test.2.Qual_sub)]<'NA'
ames_test.2$Bsmt.Qual<relevel(factor(ames_test.2.Qual_sub),ref='TA')
ames_test.2$Neighborhood<droplevels(ames_test.2$Neighborhood)
ames_test.2$Neighborhood<factor(ames_test.2$Neighborhood,levels=c(levels(ames_test.2$Ne
ighborhood),'GrnHill_LandMrk'))


testModRel2_predictions<predict(aic2,ames_test.2)
residualsq10<testModRel2_predictionslog(ames_test.2$price)

par(mfrow=c(1,2))
plot(residualsq10,main='FinalModelResidualsforvalidationdata')
qqnorm(residualsq10)
qqline(residualsq10)

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 31/35
05/04/2017 FinalProjectRegressionModels

par(mfrow=c(1,1))

2.4.2Section4.2FinalModelRMSE
Foryourfinalmodel,calculateandbrieflycommentontheRMSE.

RMSEq11<sqrt(mean((ames_test.2$priceexp(testModRel2_predictions))^2))
RMSEq11

##[1]46660.19

TheRMSEforthetestsetwas,asexpected,higherthanthetrainingsetbut,comparingthesameRMSE
fromtheprevioustestsetpredictions,ithadagreatimprovement!TheRMSEscoredecreasedfrom
58072.19to46660.19.Asweareseeingthetestsettoevaluateoutoftheboxtesting,thisisalsoan
optimisticversionoftheRMSE.Forabetter(conservative)RMSEscore,wemustgiveafinalpredictiontoa
setwehaveneverseenbefore.

2.4.3Section4.3FinalModelEvaluation
Whataresomestrengthsandweaknessesofyourmodel?

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 32/35
05/04/2017 FinalProjectRegressionModels

Themodeldevelopeduntilnowisasimpleone,withoutfurtherinvestigationoftheremainingfeaturesand
itsinteractionsorpolynomials.Despitethislackofnewfeatures,themodelseemedtogeneralizewell,asit
couldbeseenintheoutofboxtesting.

Withlittlevariables,themodelseemstogiveagoodandundestandablefit,inasensetheparameters
canaidtherealestateemployeesindecidingiftheirhousesarebeingsoldunderofoverpriced.

Someweaknesses:

Somecategoricalvariables(Neighboorhoodforexample)hasfewexamplesofsomeofitsdomains,
havingsomeneighborhoodswithatotalcountofexamples<10.Thiscouldgiveusaunrealistic
parameterestimation.Asolutiontothiswouldbetogetmoreexamplesorreducethenumberof
categoriesandseeifitstillgetsagoodfit.

Ididnotconsidermulticollinearityamongthepredictors.Iftherewassomebetweentheregression
variablesIused,wecanhavesomeunstablepredictorsand,therefore,unstablepredictions.

Infuturesteps,Iwouldconsiderthesetwosteps,astheycangiveunrealisticpredictionsforthehouse
prices.

2.4.4Section4.4FinalModelValidation
Testingyourfinalmodelonaseparate,validationdatasetisagreatwaytodeterminehowyourmodelwill
performinreallifepractice.

Youwillusetheames_validationdatasettodosomeadditionalassessmentofyourfinalmodel.Discuss
yourfindings,besuretomention:*WhatistheRMSEofyourfinalmodelwhenappliedtothevalidation
data?
*Howdoesthisvaluecomparetothatofthetrainingdataand/ortestingdata?*Whatpercentageofthe
95%predictiveconfidence(orcredible)intervalscontainthetruepriceofthehouseinthevalidationdata
set?
*Fromthisresult,doesyourfinalmodelproperlyreflectuncertainty?

load("ames_validation.Rdata")

Beforeapplyingthemodeltothefinaldataset,wemustperformthesamepreprocessingstepswedidwith
previousdatasets,thatis,centeringandfactorvariablesprocessing.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 33/35
05/04/2017 FinalProjectRegressionModels

#preprocessdata
ames_validation.2<ames_validation
ames_validation.2[,integer.cols.index]<scale(ames_validation[,integer.cols.index],center=T
RUE,scale=FALSE)
ames_validation.2$price<ames_validation$price

Test.cols.interest<c('Fireplaces','BsmtFin.SF.1',
'Full.Bath','TotRms.AbvGrd',
'Year.Remod.Add','Year.Built',
'Total.Bsmt.SF','Garage.Area',
'area','Overall.Qual','Neighborhood',
'Kitchen.Qual')

Test.na.rows<which(apply(ames_validation.2[,Test.cols.interest],1,function(row){
any(is.na(row))
}))
#ames_test.2<ames_test.2[Test.na.rows,]

ames_validation.2.Qual_sub<as.character(ames_validation.2$Bsmt.Qual)
ames_validation.2.Qual_sub[is.na(ames_validation.2.Qual_sub)]<'NA'
ames_validation.2$Bsmt.Qual<relevel(factor(ames_validation.2.Qual_sub),ref='TA')

ames_validation.2.Neighborhood.TEMP<as.character(ames_validation.2$Neighborhood)
ames_validation.2.Neighborhood.TEMP[ames_validation.2.Neighborhood.TEMP=='GrnHill'|
ames_validation.2.Neighborhood.TEMP=='Landmrk']<'GrnHill_Landmrk'
ames_validation.2$Neighborhood<factor(ames_validation.2.Neighborhood.TEMP,
levels=
unique(ames_validation.2.Neighborhood.TEMP))
ames_validation.2<as.data.frame(ames_validation.2%>%filter(!Neighborhood%in%c('GrnHill
_Landmrk','Landmrk')))

Finally,theRMSEforthevalidationsetis:

valModRel2_predictions<predict(aic,ames_validation.2)

valModRel2_RMSEq6<sqrt(mean((ames_validation.2$priceexp(valModRel2_predictions))^2))
valModRel2_RMSEq6

##[1]70301.83

Asexpected,theRMSEforthevalidationsetisalittlehigherthanthetestset,aswewereusingthetestset
aspartofthetrainingprocess,thusreducingtheRMSEartificially.Next,inordertotestiftheassumptions
aboutuncertaintyaremet,wecalculatethecoverageprobability.Ifourmodelmetsthiscriteria,roughtly95%
ofthedatashouldbeinsideofthe95%predictionconfidenceinterval.

#Predictprices
predict.full<exp(predict(aic,ames_validation.2,interval="prediction"))

#Calculateproportionofobservationsthatfallwithinpredictionintervals
coverage.prob.full<mean(ames_validation.2$price>predict.full[,"lwr"]&
ames_validation.2$price<predict.full[,"upr"])
coverage.prob.full

##[1]0.95996

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 34/35
05/04/2017 FinalProjectRegressionModels

Aswecanseeabove,~96%ofourdataisinsideour95%predictionconfidenceinterval,showingthatthe
finalmodelproperlyreflectuncertainty.

2.5Part5Conclusion
Provideabriefsummaryofyourresults,andabriefdiscussionofwhatyouhavelearnedaboutthedataand
yourmodel.

Althoughthisstudymanagedtomodelagotfittothedata,muchmoreworkcanbedoneasnextsteps.New
variablescouldbeenginneredandpolynomialregressionscouldgiveaportionofnonlinearityinthemodel.
Stillonthenonlinearregressionwecouldimposesomeregularizationtoimproveoutofsampleaccuracy
andalsoprovidessomemorevariableselectionprocedures.

Thebuiltmodelshowedthatwithafewgoodselectedvariables,wecanalreadyprovideagoodexplanation
ofhowthehousepricesofAmescanbeexplained.Inthisstudy,suchvariablesasoverallqualityandyear
builtprovidedagoodexplanationinawayofpercentageofvariancefromthemeanandcouldalreadyhelp
realestateinvestorandemployeestoseeiftheyarecharging(orbeingcharged)abovetheexpectedprice.

Also,itisimportanttonotehowtosplityourdatainordertonotgiveyourselfanoptimisticerrorratefrom
thetrainingset,andalsohowtobettertestthedataavailableinordertobetterunderstandhowyourmodel
willperformwithneverseenbeforedata.Lastly,themodelselectionprocedureshelpedtoeliminatebad
predictorsinanautomaticway,aswellhelpedtocontrolmodelcomplexity.Theseprocedurescombined
helpedtobuildafinalmodelwithfewvariablesbutwithahighpredictioncapacity.

file:///C:/Users/Caio%20Miyashiro/Documents/Stat%20Specialization/5%20%20Capstone%20Project/Week%207/Final_peer.html 35/35

You might also like