0% found this document useful (0 votes)
13 views

04.Session Notes on Principal Component Regression(1)

Uploaded by

nairsuraj725
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

04.Session Notes on Principal Component Regression(1)

Uploaded by

nairsuraj725
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Session Notes on

Principal Component Regression Analysis

In this session we will be learning the Principal Component Regression Analysis (PCR). This is a
combination of two methods. Regression and Principal component analysis (PCA). This is one of the
solutions for the problem of multi-collinearity. Regression analysis has six assumptions associated
with it.

1. The response and the regressor are linearly related.


2. On average the residual is zero.
3. All the residuals have constant variance.
4. All residuals are uncorrelated.
5. Residuals are normally distributed.
6. All the regressors are statistically independent.

Apart from all the other assumptions, the last assumption is very important. In most of the practical
situations, decision makers include as many variables as possible and try to build a model that will
help them better understand the given situation. The main objective is to make predictions and also
identify the significant variables that are causing a change in the response variable. Taking these two
as major objectives, one builds a model. In the process, one reaches a stage where the decision on
the model has to be taken. At this stage, one has to test all the assumptions associated with the
model. Among all the assumptions, independence of the regressors is very important. One has test
this before taking any decision on the model. This can be tested using VIF (Variance inflation factor)
and the cut-off value for the same is 5. It is calculated for every regressor variable and if for any
variable, the value is more than 5, then we conclude that there is a problem of multi-collinearity. In
such cases, one can use PCR as a solution. Under this, the variables that are interrelated are
combined into components and the same are used to build the model.

We demonstrate the same using two situations.

Data Sets Considered and the R-codes

To explain the PCR in R, we consider two data sets. The first one is related to the personality traits of
a brand ambassador and the second one is related to the attitude of the customers to purchase a
sports utility vehicle.

Data set 1: Personality of the brand ambassador

Under this, the brand wants to identify an ambassador who can take their brand to the customers
effectively. In order to select a brand ambassador, as per the likes of the customer, they have
considered several personality traits on which they wish to collect the responses from the customers
and then appropriately select the ambassador. The traits include Attractiveness, Trustworthy, Classy,
Beautiful, Qualified, Elegant, Knowledgeable, Dependable, Honest, Experience, Sexy, Sincere, Expert,
and, Reliable. After collecting the responses from the customers, the brand wishes to build a model
that will help them to identify the most important traits. One of the team members felt that, linking
the responses of the customers, on the traits, to the brand score the customers give if the individual
is selected as a brand ambassador, will make the model more effective. That is, they want to
consider the brand score as the key aspect and all the traits as those that will change the opinion of
the customers on the band score. For this, they have adopted regression analysis. Note that,
regression analysis is an appropriate method, which will help one to identify the significant variables
that cause change in the response variable and also split the response into parts. The responses are
collected on a sample of 112 customers and stored in the data set named “Personality”. Use the
same and build the model using R. The following gives the details of the same.

setwd("F:/07.PGDM 2020/03.DAR/09.R-Codes")

getwd()

install.packages("readxl")

library(readxl)

install.packages("psych")

library(psych)

install.packages("car")

library(car)

install.packages("lm.beta")

library(lm.beta)

install.packages("lmtest")

library(lmtest)

personality=read_excel(file.choose()) # Import the data to R

attach(personality) # Attach the data to R

fix(personality) # To open the data in the R editor

View(personality) # To View the data

names(personality) # To know the variable names considered in the data set


names(personality) # To know the variable names considered in the data set
[1] "Respondents" "Attractive" "Trustworthy" "Classy"
[5] "Beautiful" "Qualified" "Elegant" "Knowledgable"
[9] "Dependable" "Honest" "Experienced" "Sexy"
[13] "Sincere" "Expert" "Reliable" "Brandscore"

This indicates that there are 15 variables (excluding the “respondents” column).

str(personality) # To know the structure of the data set


> str(personality) # To know the structure of the data set
'data.frame': 112 obs. of 16 variables:
$ Respondents : num 1 2 3 4 5 6 7 8 9 10 ...
$ Attractive : num 4 5 5 4 5 5 5 5 5 5 ...
$ Trustworthy : num 3 3 3 4 2 2 1 4 3 2 ...
$ Classy : num 4 5 3 5 5 5 2 5 4 5 ...
$ Beautiful : num 5 5 5 4 5 5 4 5 5 5 ...
$ Qualified : num 2 3 3 3 5 2 4 3 2 4 ...
$ Elegant : num 5 5 4 5 5 5 4 5 4 5 ...
$ Knowledgable: num 2 5 3 4 4 4 4 3 2 5 ...
$ Dependable : num 5 3 3 4 3 2 1 4 3 5 ...
$ Honest : num 4 4 3 4 2 2 2 4 3 5 ...
$ Experienced : num 3 3 3 2 4 2 1 3 2 3 ...
$ Sexy : num 5 5 5 4 4 5 5 5 5 5 ...
$ Sincere : num 4 3 3 4 2 2 1 4 3 2 ...
$ Expert : num 4 3 3 4 4 2 4 4 2 4 ...
$ Reliable : num 4 3 3 4 2 2 1 4 3 4 ...
$ Brandscore : num 8 6 6 8 8 4 8 8 4 8 ...

# To build the model, we use “lm” function.

BR_lm=lm(Brandscore~Attractive+Trustworthy+Classy+Beautiful+Qualified

+Elegant+Knowledgable+Dependable+Honest+Experienced

+Sexy+Sincere+Expert+Reliable, data=personality)

summary(BR_lm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.65343 0.87753 -0.745 0.4583
Attractive 0.12496 0.22205 0.563 0.5749
Trustworthy 0.11069 0.21686 0.510 0.6109
Classy -0.13025 0.11980 -1.087 0.2797
Beautiful 0.01311 0.24532 0.053 0.9575
Qualified -0.21676 0.12271 -1.767 0.0805 .
Elegant -0.06209 0.12918 -0.481 0.6319
Knowledgable 0.23789 0.11771 2.021 0.0460 *
Dependable 0.11399 0.16291 0.700 0.4858
Honest -0.03990 0.18601 -0.215 0.8306
Experienced 0.15052 0.12256 1.228 0.2224
Sexy 0.10664 0.11094 0.961 0.3388
Sincere 0.12599 0.17740 0.710 0.4793
Expert 1.91213 0.10832 17.653 <2e-16 ***
Reliable -0.30915 0.19094 -1.619 0.1087
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7575 on 97 degrees of freedom


Multiple R-squared: 0.866, Adjusted R-squared: 0.8467
F-statistic: 44.79 on 14 and 97 DF, p-value: < 2.2e-16

From the above results one can infer that, only three traits out of the 14 traits are significant. That is,
other traits have to be dropped from the analysis. But before doing the same, one has to check all
the assumptions associated with the model. This has to followed when we have several variables.
# Testing the Assumptions of the Model

#1. Average error is zero

mean(BR_lm$residuals)
> mean(BR_lm$residuals)

[1] -2.701333e-17
#2. Errors have constant variance

bptest(BR_lm)
> bptest(BR_lm)

studentized Breusch-Pagan test

data: BR_lm
BP = 5.7033, df = 14, p-value = 0.9734

#3. Errors are uncorrelated

durbinWatsonTest(BR_lm)
> durbinWatsonTest(BR_lm)

lag Autocorrelation D-W Statistic p-value


1 -0.02732889 2.053081 0.796
Alternative hypothesis: rho!= 0

#4. Errors are normally distributed

shapiro.test(BR_lm$residuals)
> shapiro.test(BR_lm$residuals)

Shapiro-Wilk normality test

data: BR_lm$residuals
W = 0.64349, p-value = 4.202e-15
# The assumption of normality is not satisfied

#5. All the regressors are independent

vif(BR_lm)
> vif(BR_lm)

Attractive Trustworthy Classy Beautiful Qualified


2.034807 11.054036 1.850226 2.981964 3.455707
Elegant Knowledgable Dependable Honest Experienced
1.777280 2.890002 5.324240 6.676393 3.186619
Sexy Sincere Expert Reliable
1.811147 6.266828 1.951332 8.061033

#VIF stands for variance inflation factor is calculated as VIF=1/(1-R-square). R-square is calculated by
building a regression model for one variable on all other variables. For example, we build a
regression model for variable attractive on all other variables and compute the VIF value for
attractive, build a regression model for trustworthy on all other variables and calculate the value of
VIF and so on. If any value is more than 5, then we conclude that there is a problem of multi
collinearity. For the given situation one can observe that, the value of VIF for Trustworthy,
Dependable, Honest, and Reliable are more than 5. Hence, we conclude that there is a problem of
multi collinearity.

# Hence, we use principal component regression analysis (PCR). To use PCR, one has to first use
principal component analysis (PCA) and extract the components. The first step for this is to calculate
the correlation matrix for the variables considered. In order to find the correlation matrix, we use
the function cor().

BR_cor=cor(personality[,c(2:15)]) # It gives the correlation matrix of the variables

BR_cor
> BR_cor
Attractive Trustworthy Classy Beautiful Qualified Elegant
Attractive 1.00000000 -0.01264030 0.36098866 0.657982887 0.02110671
0.25068566
Trustworthy -0.01264030 1.00000000 0.31611581 0.059987507 0.38485920
0.19671116
Classy 0.36098866 0.31611581 1.00000000 0.428367098 0.16827761
0.47332849
Beautiful 0.65798289 0.05998751 0.42836710 1.000000000 0.03385594
0.38902214
Qualified 0.02110671 0.38485920 0.16827761 0.033855941 1.00000000
0.03384210
Elegant 0.25068566 0.19671116 0.47332849 0.389022136 0.03384210
1.00000000
Knowledgable 0.01543064 0.22602311 0.22737858 0.067351758 0.71949047
0.08143712
Dependable -0.03899312 0.75561272 0.34332341 0.013735232 0.41802019
0.31469459
Honest 0.02650611 0.83943601 0.40019076 0.109494188 0.40049883
0.30481144
Experienced -0.02128842 0.47278105 0.22885676 0.121429350 0.70188185 -
0.06294609
Sexy 0.24442236 -0.05355256 0.05513737 0.521954628 -0.05758225
0.21977259
Sincere -0.02403279 0.87813526 0.33450759 -0.008773328 0.38522650
0.16456164
Expert 0.07212673 0.44327010 0.27206439 0.049369387 0.49446655
0.21326599
Reliable -0.04952453 0.89343060 0.30528384 0.095153983 0.36586001
0.18010322
Knowledgable Dependable Honest Experienced Sexy Sincere
Attractive 0.01543064 -0.03899312 0.02650611 -0.02128842 0.24442236 -
0.024032788
Trustworthy 0.22602311 0.75561272 0.83943601 0.47278105 -0.05355256
0.878135261
Classy 0.22737858 0.34332341 0.40019076 0.22885676 0.05513737
0.334507594
Beautiful 0.06735176 0.01373523 0.10949419 0.12142935 0.52195463 -
0.008773328
Qualified 0.71949047 0.41802019 0.40049883 0.70188185 -0.05758225
0.385226500
Elegant 0.08143712 0.31469459 0.30481144 -0.06294609 0.21977259
0.164561639
Knowledgable 1.00000000 0.26016493 0.32701867 0.58945727 0.04050363
0.160348525
Dependable 0.26016493 1.00000000 0.84975750 0.38380194 -0.10940900
0.756467444
Honest 0.32701867 0.84975750 1.00000000 0.37657439 -0.04285025
0.760214553
Experienced 0.58945727 0.38380194 0.37657439 1.00000000 0.10710261
0.457925552
Sexy 0.04050363 -0.10940900 -0.04285025 0.10710261 1.00000000 -
0.218910017
Sincere 0.16034853 0.75646744 0.76021455 0.45792555 -0.21891002
1.000000000
Expert 0.43578487 0.58883736 0.55315137 0.45333378 -0.09549250
0.459699384
Reliable 0.31419957 0.79562501 0.84862403 0.47136527 -0.03588997
0.772498031
Expert Reliable
Attractive 0.07212673 -0.04952453
Trustworthy 0.44327010 0.89343060
Classy 0.27206439 0.30528384
Beautiful 0.04936939 0.09515398
Qualified 0.49446655 0.36586001
Elegant 0.21326599 0.18010322
Knowledgable 0.43578487 0.31419957
Dependable 0.58883736 0.79562501
Honest 0.55315137 0.84862403
Experienced 0.45333378 0.47136527
Sexy -0.09549250 -0.03588997
Sincere 0.45969938 0.77249803
Expert 1.00000000 0.51156581
Reliable 0.51156581 1.00000000

# Using the correlation matrix, we build the eigen values. These eigen values decide the number of
components to be selected. According to Kaiser (1971), we choose those components whose eigen
values are more than 1. To find the eigen values, we use eigen(). Note that, the contribution from
each component to the total variance is calculated using the eigen values. As the eigen values tend
to be less than 1, the contribution will be lesser and hence we prefer those components for which
the eigen values are more than 1.

eigen(BR_cor) # Given the Eigen values


> eigen(BR_cor) # Given the Eigen values

eigen() decomposition
$values
[1] 5.71038318 2.40474742 1.71856097 0.96586170 0.79767880 0.59916395
0.43789412 0.38794934
[9] 0.26583991 0.23859264 0.17836079 0.13125240 0.11160716 0.05210761

$vectors
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -0.02853653 0.490580500 0.02129562 0.12826934 0.55604826 0.330015602 -
0.26075973
[2,] -0.36620299 -0.073869107 -0.20703972 -0.23949926 0.10891469 -0.099563387 -
0.13323164
[3,] -0.20245465 0.336341529 -0.08329167 0.39339285 0.08563461 -0.577801997
0.44919359
[4,] -0.06736258 0.565546333 0.04143647 -0.17424525 0.17853629 0.034284546 -
0.01953286
[5,] -0.26355679 -0.069282513 0.48585192 0.10506029 0.02621432 -0.006238234 -
0.34553272
[6,] -0.13168411 0.364281141 -0.21160898 0.32675651 -0.59310521 -0.041203867 -
0.27470137
[7,] -0.21391863 0.002511485 0.54095678 0.17036638 -0.18905075 -0.103439223 -
0.26875761
[8,] -0.36213104 -0.068993724 -0.19375416 0.02212275 -0.12645152 0.163189998 -
0.03648150
[9,] -0.37403164 -0.015645346 -0.19080185 -0.05496307 -0.06695975 0.073116453 -
0.12777587
[10,] -0.27030410 -0.040915587 0.43251500 -0.22034602 0.13118612 -0.238456618
0.29517937
[11,] 0.01872343 0.396127495 0.13674991 -0.64064244 -0.39043250 0.058304006
0.16991167
[12,] -0.35020440 -0.120214200 -0.22279188 -0.07126553 0.22578071 -0.121823568 -
0.03982987
[13,] -0.29046298 -0.021431614 0.13473403 0.27136724 -0.10129398 0.653120185
0.55565783
[14,] -0.36813868 -0.069942616 -0.16448899 -0.23028492 0.01341381 0.008065038 -
0.03181948
[,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 0.019408983 0.28456101 0.32153721 -0.24816882 -0.07583447 0.01150061 -
0.057493985
[2,] -0.037181172 -0.07118750 0.31551641 0.23635675 -0.22625012 0.03577835
0.711584863
[3,] 0.204483525 0.24862262 -0.03981079 0.12193157 -0.10685611 0.09860329
0.002824212
[4,] -0.008181143 -0.54734616 -0.49040924 0.16081224 0.17401360 -0.01695435
0.116294197
[5,] -0.285373304 0.27613581 -0.33904271 0.38885278 -0.32425984 0.13081441 -
0.111475699
[6,] -0.408395711 -0.17087842 0.20578196 -0.13037413 -0.09136809 -0.04545792 -
0.061100798
[7,] 0.498500137 -0.15816509 0.26180520 -0.05757966 0.39050079 0.07401206
0.102754343
[8,] 0.016391130 0.33007841 -0.47271515 -0.48594426 0.25278909 0.28233690
0.263971031
[9,] 0.323005252 0.12629818 -0.14803427 0.05561073 -0.07579958 -0.78163161 -
0.179691689
[10,] -0.398657814 -0.13513724 0.09399845 -0.51145702 -0.07794223 -0.27331154 -
0.007035123
[11,] 0.050995337 0.40947367 0.14215225 0.14296107 0.07075068 0.08656165 -
0.074090345
[12,] -0.333152912 0.03634842 0.18934028 0.28297179 0.62756434 0.07412687 -
0.336480983
[13,] -0.039697041 -0.11101521 0.11812574 0.21752177 -0.01429880 0.01582291
0.046604448
[14,] 0.284493736 -0.31242663 0.07013411 -0.14714359 -0.40075879 0.43048994 -
0.476644948

#From the above results one can observe that, the eigen values for three components are more than
1 and hence, we consider 3 components. To build the components, we use the variables given and
the corresponding data. The function principal() is used for the same. Note that, the component
scores are used to build the regression model.
BR_PCR=principal(personality[,c(2:15)],nfactors = 3,rotate = "varimax", scores=TRUE)

# the above function has the following major inputs. First, the columns for which we wish to apply
the PCA, number of factors (or the number of components) and the rotation method. Rotation
method ensures that the variables do not get linked with all the components and get linked with
only those components with which they have to get associated with. We use “varimax” rotation.
That is, rotate the components to the direction of the data such that, maximum variation is
extracted. The third input is scores. This option will give us the scores for each component.

# Building the principal components

BR_PCR # To view the components


Principal Components Analysis
Call: principal(r = cor(personality[, c(2:15)]), nfactors = 3, rotate = "varimax",
scores = TRUE)
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC3 RC2 h2 u2 com
Attractive -0.06 0.01 0.76 0.58 0.42 1.0
Trustworthy 0.90 0.19 0.02 0.85 0.15 1.1
Classy 0.40 0.10 0.59 0.52 0.48 1.8
Beautiful 0.00 0.06 0.89 0.80 0.20 1.0
Qualified 0.25 0.87 -0.02 0.81 0.19 1.2
Elegant 0.33 -0.13 0.61 0.50 0.50 1.7
Knowledgable 0.10 0.87 0.08 0.76 0.24 1.0
Dependable 0.89 0.20 0.02 0.82 0.18 1.1
Honest 0.90 0.21 0.11 0.86 0.14 1.1
Experienced 0.29 0.81 0.03 0.74 0.26 1.3
Sexy -0.21 0.09 0.60 0.41 0.59 1.3
Sincere 0.89 0.16 -0.06 0.82 0.18 1.1
Expert 0.52 0.49 0.07 0.51 0.49 2.0
Reliable 0.88 0.24 0.03 0.83 0.17 1.1

RC1 RC3 RC2


SS loadings 4.72 2.63 2.48
Proportion Var 0.34 0.19 0.18
Cumulative Var 0.34 0.53 0.70
Proportion Explained 0.48 0.27 0.25
Cumulative Proportion 0.48 0.75 1.00

Mean item complexity = 1.3


Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is 0.07

Fit based upon off diagonal values = 0.97

# RC indicates rotated component. RC1, RC2, and RC3 are the three components built using the PCA
algorithm. The variables under each component is listed based on the component loadings.
Component loadings are the correlations between the variables and the components. The cut-off
value for the component loadings is 0.5 and variables with values at least 0.5 will be listed a
component. For example, for the given situation, we have Trustworthy, Dependable, Honest, Sincere
and Reliable are having values more than 0.5 and hence we list them under one component.
Similarly, other variables are listed the respective components. In order to take these three
components, the minimum total variance explained is 65%. From the above one can observe that,
cumulative variance is 0.70 or 70%. Hence, we can consider the three components for building the
model. The scores generated for the three components are added to the original data and then used
in building the model. We use cbind() to add the component scores to the original data.

View(BR_PCR$scores)

personality=cbind(personality, BR_PCR$scores) # Combining the component scores to the original


data set

fix(personality)

names(personality)
> names(personality)
[1] "Respondents" "Attractive" "Trustworthy" "Classy" "Beautiful" "Qualified"
[7] "Elegant" "Knowledgable" "Dependable" "Honest" "Experienced" "Sexy"
[13] "Sincere" "Expert" "Reliable" "Brandscore" "RC1" "RC3"
[19] "RC2"

# One can observe that the three component scores are added to the original
data.
# We now rebuild the model by taking the three components in the model.

BR_lm1=lm(Brandscore~RC1+RC2+RC3, data=personality) # Rebuilding the model

summary(BR_lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0714 0.1361 51.964 < 2e-16 ***
RC1 0.8528 0.1367 6.239 8.80e-09 ***
RC2 0.1520 0.1367 1.112 0.269
RC3 0.9870 0.1367 7.220 7.63e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.44 on 108 degrees of freedom


Multiple R-squared: 0.4608, Adjusted R-squared: 0.4458
F-statistic: 30.76 on 3 and 108 DF, p-value: 1.875e-14

One can observe that RC3 is not significant. Hence, we exclude the same from the model and rebuild
the same.

BR_lm2=lm(Brandscore~RC1+RC3, data=personality) # Rebuilding the model

summary(BR_lm2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0714 0.1362 51.907 < 2e-16 ***
RC1 0.8528 0.1368 6.232 8.89e-09 ***
RC3 0.9870 0.1368 7.213 7.66e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.442 on 109 degrees of freedom


Multiple R-squared: 0.4546, Adjusted R-squared: 0.4446
F-statistic: 45.43 on 2 and 109 DF, p-value: 4.471e-15

We now again test the assumptions associated with the rebuilt model.

# Testing the assumptions

mean(BR_lm2$residuals)

bptest(BR_lm2)

durbinWatsonTest(BR_lm2)

shapiro.test(BR_lm2$residuals)

vif(BR_lm2)

detach(personality)

Data set 2 Life style.

setwd("F:/07.PGDM 2020/03.DAR/09.R-Codes")

getwd()

install.packages("readxl")

library(readxl)

life_sty=read_excel(file.choose())

attach(life_sty)

View(life_sty)

fix(life_sty)

install.packages("car")
library(car)

install.packages("Hmisc")

library(Hmisc)

install.packages("lmtest")

library(lmtest)

Life_reg=lm(Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12+X13+X14+X15+

X16+X17+X18+X19+X20+X21, data = life_sty)

summary(Life_reg)

Assumptions

#1. Average error is zero

mean(Life_reg$residuals)

#2. Uncorrelated errors

durbinWatsonTest(Life_reg)

#3. Errors follow normal distribution

shapiro.test(Life_reg$residuals)

#4. Constant variance

bptest(Life_reg)

#5. All the regressors are independent

vif(Life_reg)

library(REdaS)

eigen(cor(life_sty[,2:22]))

LS_pcr=principal(life_sty[,2:22],

nfactors = 6, rotate = "varimax", scores = TRUE)

LS_pcr

View(LS_pcr$scores)

cor(LS_pcr$scores)

life_sty=cbind(life_sty,LS_pcr$scores)

View(life_sty)

LS_fit=lm(Y~RC1+RC2+RC3+RC4+RC5+RC6,data=life_sty)

summary(LS_fit)

LS_fit1=lm(Y~RC2+RC3+RC4+RC5,data=life_sty)
summary(LS_fit1)

mean(LS_fit1$residuals)

durbinWatsonTest(LS_fit1)

bptest(LS_fit1)

shapiro.test(LS_fit1$residuals)

vif(LS_fit1)

detach(life_sty)

You might also like