Project 2 - Advanced Statistics
Project 2 - Advanced Statistics
Table of Contents
1. Project Objective
2. Assumptions
4.1. Histogram
4.2. Boxplot
4.3 Scatterplot
7. KMO Test
8. Factor Interpretation
The objective of the report is to explore the data set (“Factor-Hair-Revised”) in R and generate
insights about the data set. This exploration report will consists of the following:
Importing the dataset in R
Understanding the structure of dataset
Graphical exploration
Exploratory Data Analysis
Multicollinerality
Simple Linear Regression
PCA/Factor Analysis
Multiple Linear Regression
Model output and validity
2. Assumptions
Step by step approach. A Typical Data exploration activity consists of the following
steps:
Necessary packages need to be installed invoked using Library command. Please refer R
code for compete set of packages.
Setting a working directory on starting of the R session makes importing and exporting data files
and code files easier. Basically, working directory is the location/ folder on the PC where you
have the data, codes etc. related to the project.
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
3
4. Data Summary, Univariate, Bivariate analysis, graphs
Following are the observation of summary: All the 12 variables are number and 1 variable is
integer.
Satisfaction is the dependent variable, therefore we have a graph here for all the
variables. The data (Rating in scale of 1-10) is given by 100 customers for all the
variables depicted in the graph above.
Outliers can be seen in Ecommerce, Sales Force Image and Order billing.
4
4.3: Bivariate Analysis
Scatter plot below infers that there is correlation between certain variables:
5
5. Simple Liner regression:
Below models are the analysis of all the independent variables with dependent variable.
6
5.1 – Model 1 Customer Satisfaction and product Quality
Residuals:
Min 1Q Median 3Q Max
-1.88746 -0.72711 -0.01577 0.85641 2.25220
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.67593 0.59765 6.151 1.68e-08 ***
Product.Quality 0.41512 0.07534 5.510 2.90e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals:
Min 1Q Median 3Q Max
-2.37200 -0.78971 0.04959 0.68085 2.34580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1516 0.6161 8.361 4.28e-13 ***
Ecommerce 0.4811 0.1649 2.918 0.00437 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
7
Multiple Rsquare is 7.9% and Adjusted Rsquare is around 7% signifies both the factors are not
corelated. But the model is valid since P value is much less than 0.05.
Multiple Rsquare is 1.3% and Adjusted Rsquare is around 0.26% signifies both the factors are
not corelated. The model is not valid since p value is more than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.26136 -0.93297 0.04302 0.82501 2.85617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.44757 0.43592 14.791 <2e-16 ***
Technical.Support 0.08768 0.07817 1.122 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
8
5.4 – Model 4 Customer Satisfaction and Complaint Resolution.
Multiple Rsquare is 36.4% and Adjusted Rsquare is around 35.8% signifies both the factors are
lightly corelated. But the model is valid since p value is much less than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.40450 -0.66164 0.04499 0.63037 2.70949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.68005 0.44285 8.310 5.51e-13 ***
`Complain Resolution` 0.59499 0.07946 7.488 3.09e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
9
Multiple Rsquare is 9.3% and Adjusted Rsquare is around 8.4% signifies both the factors are not
corelated. The model is valid since p value is less than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.34033 -0.92755 0.05577 0.79773 2.53412
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6259 0.4237 13.279 < 2e-16 ***
Advertising 0.3222 0.1018 3.167 0.00206 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple Rsquare is 30.3% and Adjusted Rsquare is around 29.6% signifies both the factors are
lightly corelated. The model is valid since p value is less than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.3634 -0.7795 0.1097 0.7604 1.7373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.02203 0.45471 8.845 3.87e-14 ***
10
Product.Line 0.49887 0.07641 6.529 2.95e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple Rsquare is 25% and Adjusted Rsquare is around 24.3% signifies both the factors are
relatively corelated. The model is valid since p value is less than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.2164 -0.5884 0.1838 0.6922 2.0728
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.06983 0.50874 8.000 2.54e-12 ***
Sales.Force.Image 0.55596 0.09722 5.719 1.16e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
11
5.8 – Model 8 Customer Satisfaction and Competitive pricing.
Residuals:
Min 1Q Median 3Q Max
-1.9728 -0.9915 -0.1156 0.9111 2.5845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.03856 0.54427 14.769 <2e-16 ***
Competitive.Pricing -0.16068 0.07621 -2.108 0.0376 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple Rsquare is 4.4% and Adjusted Rsquare is around 3.4% signifies both the factors are not
corelated. The model is not valid since p value is less than 0.05.
12
Multiple Rsquare is 3.2% and Adjusted Rsquare is around 2.1% signifies both the factors are not
corelated. The model is not valid since p value is more than 0.05.
Residuals:
Min 1Q Median 3Q Max
-2.36504 -0.90202 0.03019 0.90763 2.88985
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3581 0.8813 6.079 2.32e-08 ***
`Warranty&Claims` 0.2581 0.1445 1.786 0.0772 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals:
Min 1Q Median 3Q Max
-2.4005 -0.7071 -0.0344 0.7340 2.9673
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0541 0.4840 8.377 3.96e-13 ***
Order.Billing 0.6695 0.1106 6.054 2.60e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple Rsquare is 27.2% and Adjusted Rsquare is around 26.5% signifies both the factors are
relatively corelated. The model is valid since p value is less than 0.05.
13
5.11 – Model 11 Customer Satisfaction and Delivery Speed.
Residuals:
Min 1Q Median 3Q Max
-2.22475 -0.54846 0.08796 0.54462 2.59432
Coefficients:
Multiple Rsquare is 33.3% and Adjusted Rsquare is around 32.6% signifies both the factors are
relatively corelated. The model is not valid since p value is less than 0.05.
14
6. Analysis based on Multicollinearity:
As an early indicator we run the correlation matrix in R. We observe that some of the variables
are significant.
$chisq
[1] 619.2726
$p.value
[1] 1.79337e-96
$df
[1] 55
15
As the P value is less than 0.05, as per the Bartlett test results its ideal for dimension reduction.
This test is to check if correlation Matrix is different from Identity Matrix.
As the overall MSA is more than 0.5: The value is 0.65 as per the KMO, we consider the given
sample is adequate and there is no need of additional samples.
8. Factor interpretation
EV = Value$values
> EV
[1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378 0.40151815 0.246
95154 0.20355327 0.13284158
[11] 0.09842702
16
As per the elbow rule in the graph 5 factors can be taken. But we will follow the Kaiser rule and
consider only 4 factors.
Below are the results of Principal component analysis without rotation. We observe that 12
variables are reduced to 4 factors wi
principal(r = depvardata, nfactors = 4, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 PC4 h2 u2 com
Product.Quality 0.25 -0.50 -0.08 0.67 0.77 0.232 2.2
Ecommerce 0.31 0.71 0.31 0.28 0.78 0.223 2.1
Technical.Support 0.29 -0.37 0.79 -0.20 0.89 0.107 1.9
Complain Resolution 0.87 0.03 -0.27 -0.22 0.88 0.119 1.3
Advertising 0.34 0.58 0.11 0.33 0.58 0.424 2.4
Product.Line 0.72 -0.45 -0.15 0.21 0.79 0.213 2.0
Sales.Force.Image 0.38 0.75 0.31 0.23 0.86 0.141 2.1
Competitive.Pricing -0.28 0.66 -0.07 -0.35 0.64 0.359 1.9
Warranty&Claims 0.39 -0.31 0.78 -0.19 0.89 0.108 2.0
Order.Billing 0.81 0.04 -0.22 -0.25 0.77 0.234 1.3
Delevery.Speed 0.88 0.12 -0.30 -0.21 0.91 0.086 1.4
17
To make the factors more significant and move the variables closer to 0 and 1 we will rotate the
values without affecting the communality.
Principal Components Analysis
Call: principal(r = depvardata, nfactors = 4, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 RC4 h2 u2 com
Product.Quality 0.00 -0.01 -0.03 0.88 0.77 0.232 1.0
Ecommerce 0.06 0.87 0.05 -0.12 0.78 0.223 1.1
Technical.Support 0.02 -0.02 0.94 0.10 0.89 0.107 1.0
Complain Resolution 0.93 0.12 0.05 0.09 0.88 0.119 1.1
Advertising 0.14 0.74 -0.08 0.01 0.58 0.424 1.1
Product.Line 0.59 -0.06 0.15 0.64 0.79 0.213 2.1
Sales.Force.Image 0.13 0.90 0.08 -0.16 0.86 0.141 1.1
Competitive.Pricing -0.09 0.23 -0.25 -0.72 0.64 0.359 1.5
Warranty&Claims 0.11 0.05 0.93 0.10 0.89 0.108 1.1
Order.Billing 0.86 0.11 0.08 0.04 0.77 0.234 1.1
Delevery.Speed 0.94 0.18 0.00 0.05 0.91 0.086 1.1
>
1. RC1 = This factor relates to Delivery Speed, complain resolution and Order Billing.
Name: Post.Sale
2. RC2 = This factor relates to Sales force image, Ecommerce and advertising.
Name: Marketing
3. RC3 = This factor relates to technical support and warranty claims
Name: Support
4. RC4 = This factor relates to product quality, competitive pricing (negatively) and product
line.
Name: Product.attributes
18
11. Multiple Linear Regression Analysis
lm(formula = `Customer Satisfaction` ~ Post.sale + Marketing +
Support + Product.attributes)
Residuals:
Min 1Q Median 3Q Max
-1.6346 -0.5021 0.1368 0.4617 1.5235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91813 0.07087 97.617 < 2e-16 ***
Post.sale 0.61799 0.07122 8.677 1.11e-13 ***
Marketing 0.50994 0.07123 7.159 1.71e-10 ***
Support 0.06686 0.07120 0.939 0.35
Product.attributes 0.54014 0.07124 7.582 2.27e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
19
12. Output Interpretation in business terms
We have taken four new factors as independent variable and Customer satisfaction as
dependent variable:
R square: In the above model the R Square is 66.07% which is quite significant.
66.07% of variation in Customer satisfaction is explained by all the four factors.
Probability (F value > 46.25) = P value 2.2e-16 is much less than 5% significance
level – Hence we reject the null hypothesis and at least one beta is non zero, so we
accept alternative Hypothesis.
There is evidence that regression model exists in the population
Regression has 4 degrees of freedom and has total 99(100 observation -1) degrees of
freedom. Hence error or residual has (99-4) has 95 degrees of freedom.
Adjusted R square gives after adjusting value for the degrees of freedom for every
value added – 64.64%
Individual coefficients for all 3 factors are highly significant as the individual T-
stats are less than alpha 5% except for support Factor.
1. We split the data into Train and Test in the ration of 70:30
2. Train Data – 70% of data will be used for Model development and 30% of data
for validation purposes.
Model Development:
The R square is 72.4% and Pvalue much lesser than alpha level. The Model is Valid.
lm(formula = lm(`Customer Satisfaction` ~ Post.sale + Marketing +
Support + Product.attributes), data = Train)
Residuals:
Min 1Q Median 3Q Max
-1.41367 -0.51245 0.02767 0.46999 1.55876
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.89801 0.07918 87.121 < 2e-16 ***
Post.sale 0.53704 0.08348 6.433 1.63e-08 ***
Marketing 0.60431 0.07481 8.078 1.92e-11 ***
Support 0.05046 0.07916 0.637 0.526
Product.attributes 0.51975 0.08093 6.422 1.70e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.66 on 66 degrees of freedom
Multiple R-squared: 0.7245, Adjusted R-squared: 0.7078
F-statistic: 43.39 on 4 and 66 DF, p-value: < 2.2e-16
20
14. Prediction of the Model:
We take the rest 30% of the data for Test and predict the Model.
With confidence Level of 95%. The accepted values – Proper fit:Upper Limit :Lowerlimit –
are given. As an analyst we can choose any of the values as per organization rules.
Predtest
21