04.Session Notes on Principal Component Regression(1)
04.Session Notes on Principal Component Regression(1)
In this session we will be learning the Principal Component Regression Analysis (PCR). This is a
combination of two methods. Regression and Principal component analysis (PCA). This is one of the
solutions for the problem of multi-collinearity. Regression analysis has six assumptions associated
with it.
Apart from all the other assumptions, the last assumption is very important. In most of the practical
situations, decision makers include as many variables as possible and try to build a model that will
help them better understand the given situation. The main objective is to make predictions and also
identify the significant variables that are causing a change in the response variable. Taking these two
as major objectives, one builds a model. In the process, one reaches a stage where the decision on
the model has to be taken. At this stage, one has to test all the assumptions associated with the
model. Among all the assumptions, independence of the regressors is very important. One has test
this before taking any decision on the model. This can be tested using VIF (Variance inflation factor)
and the cut-off value for the same is 5. It is calculated for every regressor variable and if for any
variable, the value is more than 5, then we conclude that there is a problem of multi-collinearity. In
such cases, one can use PCR as a solution. Under this, the variables that are interrelated are
combined into components and the same are used to build the model.
To explain the PCR in R, we consider two data sets. The first one is related to the personality traits of
a brand ambassador and the second one is related to the attitude of the customers to purchase a
sports utility vehicle.
Under this, the brand wants to identify an ambassador who can take their brand to the customers
effectively. In order to select a brand ambassador, as per the likes of the customer, they have
considered several personality traits on which they wish to collect the responses from the customers
and then appropriately select the ambassador. The traits include Attractiveness, Trustworthy, Classy,
Beautiful, Qualified, Elegant, Knowledgeable, Dependable, Honest, Experience, Sexy, Sincere, Expert,
and, Reliable. After collecting the responses from the customers, the brand wishes to build a model
that will help them to identify the most important traits. One of the team members felt that, linking
the responses of the customers, on the traits, to the brand score the customers give if the individual
is selected as a brand ambassador, will make the model more effective. That is, they want to
consider the brand score as the key aspect and all the traits as those that will change the opinion of
the customers on the band score. For this, they have adopted regression analysis. Note that,
regression analysis is an appropriate method, which will help one to identify the significant variables
that cause change in the response variable and also split the response into parts. The responses are
collected on a sample of 112 customers and stored in the data set named “Personality”. Use the
same and build the model using R. The following gives the details of the same.
setwd("F:/07.PGDM 2020/03.DAR/09.R-Codes")
getwd()
install.packages("readxl")
library(readxl)
install.packages("psych")
library(psych)
install.packages("car")
library(car)
install.packages("lm.beta")
library(lm.beta)
install.packages("lmtest")
library(lmtest)
This indicates that there are 15 variables (excluding the “respondents” column).
BR_lm=lm(Brandscore~Attractive+Trustworthy+Classy+Beautiful+Qualified
+Elegant+Knowledgable+Dependable+Honest+Experienced
+Sexy+Sincere+Expert+Reliable, data=personality)
summary(BR_lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.65343 0.87753 -0.745 0.4583
Attractive 0.12496 0.22205 0.563 0.5749
Trustworthy 0.11069 0.21686 0.510 0.6109
Classy -0.13025 0.11980 -1.087 0.2797
Beautiful 0.01311 0.24532 0.053 0.9575
Qualified -0.21676 0.12271 -1.767 0.0805 .
Elegant -0.06209 0.12918 -0.481 0.6319
Knowledgable 0.23789 0.11771 2.021 0.0460 *
Dependable 0.11399 0.16291 0.700 0.4858
Honest -0.03990 0.18601 -0.215 0.8306
Experienced 0.15052 0.12256 1.228 0.2224
Sexy 0.10664 0.11094 0.961 0.3388
Sincere 0.12599 0.17740 0.710 0.4793
Expert 1.91213 0.10832 17.653 <2e-16 ***
Reliable -0.30915 0.19094 -1.619 0.1087
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above results one can infer that, only three traits out of the 14 traits are significant. That is,
other traits have to be dropped from the analysis. But before doing the same, one has to check all
the assumptions associated with the model. This has to followed when we have several variables.
# Testing the Assumptions of the Model
mean(BR_lm$residuals)
> mean(BR_lm$residuals)
[1] -2.701333e-17
#2. Errors have constant variance
bptest(BR_lm)
> bptest(BR_lm)
data: BR_lm
BP = 5.7033, df = 14, p-value = 0.9734
durbinWatsonTest(BR_lm)
> durbinWatsonTest(BR_lm)
shapiro.test(BR_lm$residuals)
> shapiro.test(BR_lm$residuals)
data: BR_lm$residuals
W = 0.64349, p-value = 4.202e-15
# The assumption of normality is not satisfied
vif(BR_lm)
> vif(BR_lm)
#VIF stands for variance inflation factor is calculated as VIF=1/(1-R-square). R-square is calculated by
building a regression model for one variable on all other variables. For example, we build a
regression model for variable attractive on all other variables and compute the VIF value for
attractive, build a regression model for trustworthy on all other variables and calculate the value of
VIF and so on. If any value is more than 5, then we conclude that there is a problem of multi
collinearity. For the given situation one can observe that, the value of VIF for Trustworthy,
Dependable, Honest, and Reliable are more than 5. Hence, we conclude that there is a problem of
multi collinearity.
# Hence, we use principal component regression analysis (PCR). To use PCR, one has to first use
principal component analysis (PCA) and extract the components. The first step for this is to calculate
the correlation matrix for the variables considered. In order to find the correlation matrix, we use
the function cor().
BR_cor
> BR_cor
Attractive Trustworthy Classy Beautiful Qualified Elegant
Attractive 1.00000000 -0.01264030 0.36098866 0.657982887 0.02110671
0.25068566
Trustworthy -0.01264030 1.00000000 0.31611581 0.059987507 0.38485920
0.19671116
Classy 0.36098866 0.31611581 1.00000000 0.428367098 0.16827761
0.47332849
Beautiful 0.65798289 0.05998751 0.42836710 1.000000000 0.03385594
0.38902214
Qualified 0.02110671 0.38485920 0.16827761 0.033855941 1.00000000
0.03384210
Elegant 0.25068566 0.19671116 0.47332849 0.389022136 0.03384210
1.00000000
Knowledgable 0.01543064 0.22602311 0.22737858 0.067351758 0.71949047
0.08143712
Dependable -0.03899312 0.75561272 0.34332341 0.013735232 0.41802019
0.31469459
Honest 0.02650611 0.83943601 0.40019076 0.109494188 0.40049883
0.30481144
Experienced -0.02128842 0.47278105 0.22885676 0.121429350 0.70188185 -
0.06294609
Sexy 0.24442236 -0.05355256 0.05513737 0.521954628 -0.05758225
0.21977259
Sincere -0.02403279 0.87813526 0.33450759 -0.008773328 0.38522650
0.16456164
Expert 0.07212673 0.44327010 0.27206439 0.049369387 0.49446655
0.21326599
Reliable -0.04952453 0.89343060 0.30528384 0.095153983 0.36586001
0.18010322
Knowledgable Dependable Honest Experienced Sexy Sincere
Attractive 0.01543064 -0.03899312 0.02650611 -0.02128842 0.24442236 -
0.024032788
Trustworthy 0.22602311 0.75561272 0.83943601 0.47278105 -0.05355256
0.878135261
Classy 0.22737858 0.34332341 0.40019076 0.22885676 0.05513737
0.334507594
Beautiful 0.06735176 0.01373523 0.10949419 0.12142935 0.52195463 -
0.008773328
Qualified 0.71949047 0.41802019 0.40049883 0.70188185 -0.05758225
0.385226500
Elegant 0.08143712 0.31469459 0.30481144 -0.06294609 0.21977259
0.164561639
Knowledgable 1.00000000 0.26016493 0.32701867 0.58945727 0.04050363
0.160348525
Dependable 0.26016493 1.00000000 0.84975750 0.38380194 -0.10940900
0.756467444
Honest 0.32701867 0.84975750 1.00000000 0.37657439 -0.04285025
0.760214553
Experienced 0.58945727 0.38380194 0.37657439 1.00000000 0.10710261
0.457925552
Sexy 0.04050363 -0.10940900 -0.04285025 0.10710261 1.00000000 -
0.218910017
Sincere 0.16034853 0.75646744 0.76021455 0.45792555 -0.21891002
1.000000000
Expert 0.43578487 0.58883736 0.55315137 0.45333378 -0.09549250
0.459699384
Reliable 0.31419957 0.79562501 0.84862403 0.47136527 -0.03588997
0.772498031
Expert Reliable
Attractive 0.07212673 -0.04952453
Trustworthy 0.44327010 0.89343060
Classy 0.27206439 0.30528384
Beautiful 0.04936939 0.09515398
Qualified 0.49446655 0.36586001
Elegant 0.21326599 0.18010322
Knowledgable 0.43578487 0.31419957
Dependable 0.58883736 0.79562501
Honest 0.55315137 0.84862403
Experienced 0.45333378 0.47136527
Sexy -0.09549250 -0.03588997
Sincere 0.45969938 0.77249803
Expert 1.00000000 0.51156581
Reliable 0.51156581 1.00000000
# Using the correlation matrix, we build the eigen values. These eigen values decide the number of
components to be selected. According to Kaiser (1971), we choose those components whose eigen
values are more than 1. To find the eigen values, we use eigen(). Note that, the contribution from
each component to the total variance is calculated using the eigen values. As the eigen values tend
to be less than 1, the contribution will be lesser and hence we prefer those components for which
the eigen values are more than 1.
eigen() decomposition
$values
[1] 5.71038318 2.40474742 1.71856097 0.96586170 0.79767880 0.59916395
0.43789412 0.38794934
[9] 0.26583991 0.23859264 0.17836079 0.13125240 0.11160716 0.05210761
$vectors
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -0.02853653 0.490580500 0.02129562 0.12826934 0.55604826 0.330015602 -
0.26075973
[2,] -0.36620299 -0.073869107 -0.20703972 -0.23949926 0.10891469 -0.099563387 -
0.13323164
[3,] -0.20245465 0.336341529 -0.08329167 0.39339285 0.08563461 -0.577801997
0.44919359
[4,] -0.06736258 0.565546333 0.04143647 -0.17424525 0.17853629 0.034284546 -
0.01953286
[5,] -0.26355679 -0.069282513 0.48585192 0.10506029 0.02621432 -0.006238234 -
0.34553272
[6,] -0.13168411 0.364281141 -0.21160898 0.32675651 -0.59310521 -0.041203867 -
0.27470137
[7,] -0.21391863 0.002511485 0.54095678 0.17036638 -0.18905075 -0.103439223 -
0.26875761
[8,] -0.36213104 -0.068993724 -0.19375416 0.02212275 -0.12645152 0.163189998 -
0.03648150
[9,] -0.37403164 -0.015645346 -0.19080185 -0.05496307 -0.06695975 0.073116453 -
0.12777587
[10,] -0.27030410 -0.040915587 0.43251500 -0.22034602 0.13118612 -0.238456618
0.29517937
[11,] 0.01872343 0.396127495 0.13674991 -0.64064244 -0.39043250 0.058304006
0.16991167
[12,] -0.35020440 -0.120214200 -0.22279188 -0.07126553 0.22578071 -0.121823568 -
0.03982987
[13,] -0.29046298 -0.021431614 0.13473403 0.27136724 -0.10129398 0.653120185
0.55565783
[14,] -0.36813868 -0.069942616 -0.16448899 -0.23028492 0.01341381 0.008065038 -
0.03181948
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0.019408983 0.28456101 0.32153721 -0.24816882 -0.07583447 0.01150061 -
0.057493985
[2,] -0.037181172 -0.07118750 0.31551641 0.23635675 -0.22625012 0.03577835
0.711584863
[3,] 0.204483525 0.24862262 -0.03981079 0.12193157 -0.10685611 0.09860329
0.002824212
[4,] -0.008181143 -0.54734616 -0.49040924 0.16081224 0.17401360 -0.01695435
0.116294197
[5,] -0.285373304 0.27613581 -0.33904271 0.38885278 -0.32425984 0.13081441 -
0.111475699
[6,] -0.408395711 -0.17087842 0.20578196 -0.13037413 -0.09136809 -0.04545792 -
0.061100798
[7,] 0.498500137 -0.15816509 0.26180520 -0.05757966 0.39050079 0.07401206
0.102754343
[8,] 0.016391130 0.33007841 -0.47271515 -0.48594426 0.25278909 0.28233690
0.263971031
[9,] 0.323005252 0.12629818 -0.14803427 0.05561073 -0.07579958 -0.78163161 -
0.179691689
[10,] -0.398657814 -0.13513724 0.09399845 -0.51145702 -0.07794223 -0.27331154 -
0.007035123
[11,] 0.050995337 0.40947367 0.14215225 0.14296107 0.07075068 0.08656165 -
0.074090345
[12,] -0.333152912 0.03634842 0.18934028 0.28297179 0.62756434 0.07412687 -
0.336480983
[13,] -0.039697041 -0.11101521 0.11812574 0.21752177 -0.01429880 0.01582291
0.046604448
[14,] 0.284493736 -0.31242663 0.07013411 -0.14714359 -0.40075879 0.43048994 -
0.476644948
#From the above results one can observe that, the eigen values for three components are more than
1 and hence, we consider 3 components. To build the components, we use the variables given and
the corresponding data. The function principal() is used for the same. Note that, the component
scores are used to build the regression model.
BR_PCR=principal(personality[,c(2:15)],nfactors = 3,rotate = "varimax", scores=TRUE)
# the above function has the following major inputs. First, the columns for which we wish to apply
the PCA, number of factors (or the number of components) and the rotation method. Rotation
method ensures that the variables do not get linked with all the components and get linked with
only those components with which they have to get associated with. We use “varimax” rotation.
That is, rotate the components to the direction of the data such that, maximum variation is
extracted. The third input is scores. This option will give us the scores for each component.
# RC indicates rotated component. RC1, RC2, and RC3 are the three components built using the PCA
algorithm. The variables under each component is listed based on the component loadings.
Component loadings are the correlations between the variables and the components. The cut-off
value for the component loadings is 0.5 and variables with values at least 0.5 will be listed a
component. For example, for the given situation, we have Trustworthy, Dependable, Honest, Sincere
and Reliable are having values more than 0.5 and hence we list them under one component.
Similarly, other variables are listed the respective components. In order to take these three
components, the minimum total variance explained is 65%. From the above one can observe that,
cumulative variance is 0.70 or 70%. Hence, we can consider the three components for building the
model. The scores generated for the three components are added to the original data and then used
in building the model. We use cbind() to add the component scores to the original data.
View(BR_PCR$scores)
fix(personality)
names(personality)
> names(personality)
[1] "Respondents" "Attractive" "Trustworthy" "Classy" "Beautiful" "Qualified"
[7] "Elegant" "Knowledgable" "Dependable" "Honest" "Experienced" "Sexy"
[13] "Sincere" "Expert" "Reliable" "Brandscore" "RC1" "RC3"
[19] "RC2"
# One can observe that the three component scores are added to the original
data.
# We now rebuild the model by taking the three components in the model.
summary(BR_lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0714 0.1361 51.964 < 2e-16 ***
RC1 0.8528 0.1367 6.239 8.80e-09 ***
RC2 0.1520 0.1367 1.112 0.269
RC3 0.9870 0.1367 7.220 7.63e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
One can observe that RC3 is not significant. Hence, we exclude the same from the model and rebuild
the same.
summary(BR_lm2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0714 0.1362 51.907 < 2e-16 ***
RC1 0.8528 0.1368 6.232 8.89e-09 ***
RC3 0.9870 0.1368 7.213 7.66e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We now again test the assumptions associated with the rebuilt model.
mean(BR_lm2$residuals)
bptest(BR_lm2)
durbinWatsonTest(BR_lm2)
shapiro.test(BR_lm2$residuals)
vif(BR_lm2)
detach(personality)
setwd("F:/07.PGDM 2020/03.DAR/09.R-Codes")
getwd()
install.packages("readxl")
library(readxl)
life_sty=read_excel(file.choose())
attach(life_sty)
View(life_sty)
fix(life_sty)
install.packages("car")
library(car)
install.packages("Hmisc")
library(Hmisc)
install.packages("lmtest")
library(lmtest)
Life_reg=lm(Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+
summary(Life_reg)
Assumptions
mean(Life_reg$residuals)
durbinWatsonTest(Life_reg)
shapiro.test(Life_reg$residuals)
bptest(Life_reg)
vif(Life_reg)
library(REdaS)
eigen(cor(life_sty[,2:22]))
LS_pcr=principal(life_sty[,2:22],
LS_pcr
View(LS_pcr$scores)
cor(LS_pcr$scores)
life_sty=cbind(life_sty,LS_pcr$scores)
View(life_sty)
LS_fit=lm(Y~RC1+RC2+RC3+RC4+RC5+RC6,data=life_sty)
summary(LS_fit)
LS_fit1=lm(Y~RC2+RC3+RC4+RC5,data=life_sty)
summary(LS_fit1)
mean(LS_fit1$residuals)
durbinWatsonTest(LS_fit1)
bptest(LS_fit1)
shapiro.test(LS_fit1$residuals)
vif(LS_fit1)
detach(life_sty)