Advanced Statistics Module Mini-Project Rohan Kanungo
MINI PROJECT
ADVANCED
STATISTICS MODULE
Submitted by Rohan Kanungo
5th June 2019
pg. 1
Advanced Statistics Module Mini-Project Rohan Kanungo
TABLE OF CONTENTS
Project Objective ........................................................................ 3
Problem Analysis ....................................................................... 4
Evidence of Multicollinearity.................................................... 5
Factor Analysis ........................................................................... 7
Naming of Factors .................................................................... 10
Multiple Regression Analysis ................................................. 11
R-Code....................................................................................... 13
pg. 2
Advanced Statistics Module Mini-Project Rohan Kanungo
Project Objective
The project is focussed on market segmentation in the context of product service
management. The data file Facor-Hair is to be used for performing the analysis.
pg. 3
Advanced Statistics Module Mini-Project Rohan Kanungo
Problem Analysis
The data set consists of 13 variables and 100 observations. Satisfaction is the
dependent variable and the others are the factors that determine the satisfaction
(independent variables)
For the purposes of market segmentation, Principal Component/Factor analysis can
be used identify the structure of a set of variables as well as provide a process for
data reduction.
We therefore examine and analyze the data set -
Understand whether these variables can be “grouped.” By grouping the
variables, we will be able to see the big picture in terms of understanding the
customer
Reduce the 13 variables to a smaller number of composite variables
str(Hairdata_original)
'[Link]': 100 obs. of 13 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
pg. 4
Advanced Statistics Module Mini-Project Rohan Kanungo
Evidence of Multicollinearity
The sample size is 100 which provides an adequate basis to calculate the corelation
between variables.
To determine the existence of collinearity, we run a collinearity test.
## Find the correlation
cor(Hairdata)
[Link](Hairdata,numbers=TRUE,xlas = 2,upper=FALSE)
The plot above shows that there is evidence of multicollinearity. The cells marked in
blue show a high degree of possibility of multi-collinearity.
pg. 5
Advanced Statistics Module Mini-Project Rohan Kanungo
To determine the significance of collinearity, we run Bartlett’s test.
## Significance of correlation
## Bartlett's Test
[Link](Hairdata,n=100)
$chisq
[1] 619.2726
$[Link]
[1] 1.79337e-96
$df
[1] 55
Conclusion:
Since the p-value is very less, the test indicates that statistically, multicollinearity
exists in the data set.
pg. 6
Advanced Statistics Module Mini-Project Rohan Kanungo
Factor Analysis
1. Eigen Value Computation
eigen() decomposition
$values
3.426971 2.550897 1.690976 1.086556 0.609424 0.551884 0.401518 0.246952
0.203553 0.132842 0.098427
2. Scree Plot
## Scree Plot
HairScree<-[Link](Hairfactor,HairEigenValue)
plot(HairScree,col="RED",pch=18,main="Scree Plot")
lines(HairScree,col="Blue")
abline(h=1,col="PURPLE")
pg. 7
Advanced Statistics Module Mini-Project Rohan Kanungo
Using the Kaiser rule, we determine that there are four factors, which are the
principal factors.
3. Rotation of Loadings
## Loadings
## Unrotate Principal Loadings
Hair_unrotate <- principal(Hairdata,nfactors = 4,rotate = "none")
print(Hair_unrotate,digits=5)
UnRotatedprofile <-plot(Hair_unrotate,[Link](Hair_unrotate$loadings))
UnRotatedprofile
PC1 PC2 PC3 PC4
SS loadings 3.42697 2.55090 1.69098 1.08656
Proportion Var 0.31154 0.23190 0.15373 0.09878
Cumulative Var 0.31154 0.54344 0.69717 0.79595
Proportion Explained 0.39141 0.29135 0.19314 0.12410
Cumulative Proportion 0.39141 0.68276 0.87590 1.00000
To make the boundaries sharper, we perform an orthogonal rotation to clearly identify the factors.
## Rotate Principal Loadings
Hair_rotate <- principal(Hairdata,nfactors=4,rotate="varimax")
print(Hair_rotate,digits=5)
Rotatedprofile <-plot(Hair_rotate,[Link](Hair_rotate$loadings),cex=1.0)
Rotatedprofile
pg. 8
Advanced Statistics Module Mini-Project Rohan Kanungo
RC1 RC2 RC3 RC4
SS loadings 2.89268 2.23362 1.85551 1.77359
Proportion Var 0.26297 0.20306 0.16868 0.16124
Cumulative Var 0.26297 0.46603 0.63471 0.79595
Proportion Explained 0.33039 0.25511 0.21193 0.20257
Cumulative Proportion 0.33039 0.58550 0.79743 1.00000
4. Plot and Determine the Factors
par(mfrow=c(1,2))
[Link](Hair_unrotate,main="Unrotated factors")
[Link](Hair_rotate,main="Rotated factors")
pg. 9
Advanced Statistics Module Mini-Project Rohan Kanungo
Naming of Factors
RC1 RC2 RC3 RC4
ProdQual 0.00152 -0.01274 -0.03282 0.87566
Ecom 0.0568 0.87056 0.04735 -0.11746
TechSup 0.01833 -0.02446 0.93919 0.10051
CompRes 0.92582 0.11593 0.0486 0.09123
Advertising 0.13876 0.74152 -0.0816 0.01467
ProdLine 0.59122 -0.06397 0.14598 0.642
SalesFImage 0.13252 0.90045 0.07559 -0.15924
ComPricing -0.08515 0.22563 -0.24551 -0.72258
WartyClaim 0.10982 0.05483 0.93099 0.10218
OrdBilling 0.86376 0.10683 0.0839 0.03931
DelSpeed 0.9382 0.17734 -0.00463 0.05227
Factor 1: Customer Service
i. CompRes
ii. DelSpeed
iii. OrdBilling
Factor 2 : Marketing
i. SalesFImage
ii. Ecom
iii. Advertising
Factor 3 : Technical Support
i. TechSup
ii. WartyClaim
Factor 4 : Product Value
i. ProdQual
ii. ComPricing
pg. 10
Advanced Statistics Module Mini-Project Rohan Kanungo
Multiple Regression Analysis
Using the factor scores of the four factors identified, we build a data set and
formulate a multiple linear regression model.
## Multiple Regression Analysis
## Create the data set using the factor scores from PCA/FA process
mydata=[Link](Hair_rotate$score)
mydataforregression=cbind(mydata,Hairdata_original$Satisfaction)
names(mydataforregression) <-
c("customerservice","marketing","techsupport","productvalue","customersatisf
action")
str(mydataforregression)
attach(mydataforregression)
## Build Simple Linear Model
SLM=lm(customersatisfaction~customerservice+marketing+techsupport+productval
ue,data=mydataforregression)
summary(SLM)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.07089 97.589 < 2e-16 ***
customerservice 0.61805 0.07125 8.675 1.12e-13 ***
marketing 0.50973 0.07125 7.155 1.74e-10 ***
techsupport 0.06714 0.07125 0.942 0.348
productvalue 0.54032 0.07125 7.584 2.24e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7089 on 95 degrees of freedom
Multiple R-squared: 0.6605, Adjusted R-squared: 0.6462
F-statistic: 46.21 on 4 and 95 DF, p-value: < 2.2e-16
pg. 11
Advanced Statistics Module Mini-Project Rohan Kanungo
R-squared Interpretation
Multiple R-squared: 0.6605 means that 66.05% of the dependent variable is
explained by the independent variables; i.e. 66.05% of customer satisfaction is
dependent on the four factors identified.
Probability (F-statistic > 46.21) = p-value: < 2.2e-16 is much smaller than 5%
Hence, REJECT NULL HYPOTHESIS that all betas are zeroes
Conclude at least one beta exists, ACCEPT ALT. HYPOTHESIS
Individual Coefficients are also highly significant as evidenced by the t-stat that are
extremely low
Pvalue - Each one of them is much less than 5%. Hence, individual betas also exist.
Overall, regression model exists in the poulation, meaning that the linear model of
customer satisfaction depending on customer service, marketing, technical support
and product value is statistically valid.
pg. 12
Advanced Statistics Module Mini-Project Rohan Kanungo
R-Code
## =======================================================================
## MINI-PROJECT 2
## MODULE - ADVANCED STATISTICS
## =======================================================================
## Environment Set up
## Read Input File "Factor-Hair-Revised"
## Install libraries nFactors and Psych for Factor Analysis
library(nFactors)
library(psych)
getwd()
Hairdata_original <- [Link]("[Link]",header=TRUE)
View(Hairdata_original)
attach(Hairdata_original)
str(Hairdata_original)
## Since the first col is an ID number, we need to ignore it before analysis
## The last column Satisfaction is the dependent variable, we need to remove it before
analysis
Hairdata <-Hairdata_original[,2:12]
str(Hairdata)
## Find the correlation
cor(Hairdata)
[Link](Hairdata,numbers=TRUE,xlas = 2,upper=FALSE)
## Significance of correlation
## Bartlett's Test
[Link](Hairdata,n=100)
## How many factors are applicable
## Eigen Values is the basis for selecting the number of factors
## Eigen Value Computation
Hairev<-eigen(cor(Hairdata))
print(Hairev,digits=5)
HairEigenValue=Hairev$values
HairEigenValue
## There are 11 variables, hence 11 factors
Hairfactor <-seq(1,11,by=1)
Hairfactor
## Scree Plot
HairScree<-[Link](Hairfactor,HairEigenValue)
plot(HairScree,col="RED",pch=18,main="Scree Plot")
pg. 13
Advanced Statistics Module Mini-Project Rohan Kanungo
lines(HairScree,col="Blue")
abline(h=1,col="PURPLE")
## Loadings
## Unrotate Principal Loadings
Hair_unrotate <- principal(Hairdata,nfactors = 4,rotate = "none")
print(Hair_unrotate,digits=5)
UnRotatedprofile <-plot(Hair_unrotate,[Link](Hair_unrotate$loadings))
UnRotatedprofile
## Rotate Principal Loadings
Hair_rotate <- principal(Hairdata,nfactors=4,rotate="varimax")
print(Hair_rotate,digits=5)
Rotatedprofile <-plot(Hair_rotate,[Link](Hair_rotate$loadings),cex=1.0)
Rotatedprofile
Hair_rotate$scores
Hair_rotate$loadings
par(mfrow=c(1,2))
[Link](Hair_unrotate,main="Unrotated factors")
[Link](Hair_rotate,main="Rotated factors")
## Multiple Regression Analysis
## Create the data set using the factor scores from PCA/FA process
mydata=[Link](Hair_rotate$score)
mydataforregression=cbind(mydata,Hairdata_original$Satisfaction)
names(mydataforregression) <-
c("customerservice","marketing","techsupport","productvalue","customersatisfaction")
str(mydataforregression)
attach(mydataforregression)
## Build Simple Linear Model
SLM=lm(customersatisfaction~customerservice+marketing+techsupport+productvalue,data=my
dataforregression)
summary(SLM)
## =======================================================================
## END MINI-PROJECT 2
## =======================================================================
pg. 14