0% found this document useful (0 votes)

97 views

Advanced Statistics Project

This document describes analyzing salary data using one-way and two-way ANOVA. One-way ANOVA is performed on salary with respect to education and occupation individually. For education, the null hypothesis of equal means is rejected, but for occupation it is not rejected. A Tukey test shows salary means differ significantly across educational levels. A two-way ANOVA and interaction plot are then used to analyze the effects of education and occupation on salary together, showing their interaction. Principal component analysis is also proposed to reduce dimensionality and further analyze the data.

Uploaded by

WildShyam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views

Advanced Statistics Project

Uploaded by

WildShyam

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

ADVANCED STATISTICS

ANOVA_EDA_PCA

PROJECT ANALYSIS REPORT

1
Contents Page#
Problem 1 A
1. State the null and the alternate hypothesis for conducting one-way ANOVA for 4
both Education and Occupation individually.

2. Perform a one-way ANOVA on Salary with respect to Education. State whether

the null hypothesis is accepted or rejected based on the ANOVA results. 4

3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether

the null hypothesis is accepted or rejected based on the ANOVA results. 5

4. If the null hypothesis is rejected in either (2) or in (3), find out which class means 5
are significantly different. Interpret the result. (Non-Graded)

Problem 1 B
1. What is the interaction between two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction plot. 5
[hint: use the ‘pointplot’ function from the ‘seaborn’ function]

2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result? 6

3. Explain the business implications of performing ANOVA for this particular case 7
study.

Problem 2
1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be 8
performed]. What insight do you draw from the EDA?

2. Is scaling necessary for PCA in this case? Give justification and perform scaling. 12

3. Comment on the comparison between the covariance and the correlation matrices 14
from this data [on scaled data].

4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so] 15

16
5. Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
6. Perform PCA and export the data of the Principal Component (eigenvectors) into a 17
data frame with the original features
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in terms 22
of eigenvectors and corresponding features]

8. Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors indicate? 22

9. Explain the business implication of using the Principal Component Analysis for this 23
case study. How may PCs help in the further analysis?

2
Page#
List of Tables
4
1A.1 - ANOVA for Education
5
1A.2 - ANOVA for Occupation
5
1A.3 – Turkey Test
6
1B.1 – Two way ANOVA - Education and Occupation
8
2.1 – Statistical Summary of Continuous Variables
16
2.2 – Eigen Vectors
17
2.3 - Eigen Values
18
2.4 - Variance ratio
19
2.5 - Cumulative Sum of Explained variance ratio
19
2.6 – Selects PCs
20
2.7 – Selected fit transformed Scaled Data Head

List of Figures
6
1B.1 - Interaction Plot between Education and Occupation
9
2.1 – Distribution of Variables
10
2.2 – Boxplot of variables
11
2.3 - Pairplot of Variables
12
2.4 – Heatmap of Variables
13
2.5 – Distribution of Scaled Data
14
2.6 – Heatmap of Scaled Data
16
2.7 – Boxplot of Scaled Data
18
2.8 – Screeplot of Components Variance
20
2.9 - Loading of Selected PC Absolute Values
21
2.10 - Heatmap of Selected Scaled Data
21
2.11 - Heatmap of Selected – Fit Transform Data
22
2.12 – PC Index to Explained Variance ratio

3
Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation
is at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive
or managerial. A different number of observations are in each level of education –
occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption
may not always hold if the sample size is small.]

1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.

Null and Alternate Hypothesis for One way ANOVA for Education:

H0: The means of 'Salary' variable with respect to each Education category is equal.

HA1: At least one of the means of 'Salary' variable with respect to each Education category is
unequal

Null and Alternate Hypothesis for One way ANOVA for Occupation:

H0: The means of 'Salary' variable with respect to each Occupation category is equal.

HA2: At least one of the means of 'Salary' variable with respect to each Occupation category is
unequal

Where Alpha = 0.05

If p value < 0.05, reject Null Hypothesis

If p value >= 0.05, fail to reject Null Hypothesis

2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

Table 1A.1 - ANOVA for Education

df sum_sq mean_sq F PR(>F)

C(Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08
Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN

Since the p value in this scenario is less than Alpha (0.05), we reject the Null Hypothesis (H0).

3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

4
Table 1A.2 - ANOVA for Occupation

df sum_sq mean_sq F PR(>F)

C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN

Since the p value in this scenario is greater than Alpha (0.05), we fail to reject the Null
Hypothesis (H0).

4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)

Turkey Test to Interpret the Statistical significance of ANOVA test:

Table 1A.3 – Turkey Test

Multiple Comparison of Means - Tukey HSD, FWER=0.05

=======================================================================
==
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------------------------------
--
Bachelors Doctorate 43274.0667 0.0146 7541.1439 79006.9894 True
Bachelors HS-grad -90114.1556 0.001 -132035.1958 -48193.1153 True
Doctorate HS-grad -133388.2222 0.001 -174815.0876 -91961.3569 True

Interpretation from table:

Mean count of Salary differs across Educational Levels.

Bachelors/Doctorate > Bachelors/HS-grad > Doctorate /HS-grad

Problem 1B:
1 .What is the interaction between two treatments? Analyse the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot. [Hint: use the
‘pointplot’ function from the ‘seaborn’ function]

Fig 1B.1 - Interaction Plot between Education and Occupation

5
From above plot we can
make out that the
interaction between
people with:

Adm-Clerical job with

Bachelors and
Doctorates is fairly good.
Sales job with Bachelors
and Doctorates is good.
Prof-Speciality job with
HS-grad and Bachelors is
ok
Exec-Managerial job role has no interactions with any other educational background.

From above plot we can figure out that people with educational level:

Doctorates: are into higher salary brackets and mostly Prof-speciality roles or Exec-
managerial roles or in sales profile, very few are doing Adm-clerical jobs
Bachelors: fall in mid income range and found mostly working as an Exec -managers, Adm-
clerks or into sales but very few are found in Prof- speciality profile.
HS-grads: are in low income brackets, mostly doing Prof-speciality or Adm -clerical work and
few are doing Sales but hardly any in Exec-managerial role

Table 1B.1 – Two way ANOVA - Education and Occupation

Null and Alternate Hypothesis for Tow way ANOVA:

H0: The mean of Salary variable of each Occupation type and Education level are equal.

HA2: At least one of the means of 'Salary' variable for Occupation type and Education level is not
equal.

Where Alpha = 0.05

6
If p value < 0.05, reject Null Hypothesis
If p value >= 0.05, fail to reject Null Hypothesis

p-value of Education is 5.44e-12 which is less than 𝛼 0.05 and we can reject the Null Hypothesis
( 𝐻0 ). This shows that the means of 'Salary' variable with respect to each Education category is
equal.

p-value of Occupation is 0.072 which is greater than 𝛼 0.05 and we fail to reject the Null
Hypothesis ( 𝐻0 ). This shows that the means of 'Salary' variable with respect to each
Occupation category is not equal.

As Education and Occupation interaction is 0.00002232 which is < 0.05 , there seems to be more
statistical interaction.

3. Explain the business implications of performing ANOVA for this particular case study.

ANOVA stands for “analysis of variance” and is used in statistics when you are testing a
hypothesis to understand how different groups respond to each other by making
connections between independent and dependent variables. ANOVA is a statistical test that
compares the means of groups in order to determine if there is a difference between them.
It is used when more than two group means are compared. For two group means, we can do
t-test.

ANOVA is used in a business context to help manage income /salary by comparing your
education to occupation here in this case to help manage revenue income (salary).

ANOVA can also be used to forecast Salary trends by analysing patterns in data to better
understand the future hike of Salary.

It’s also a widely used statistical technique for comparing the relationship between factors
that cause a rise in Salary, assuming this report is for HR department or HR consulting firm.
Some of the key takeaways as below:
 As the Education level upgrades Salary increases. On an average Doctorate earns
higher salary than Bachelors and HS-Grads. However, it might be possibility that
being Doctorate may not necessarily mean significant high salary than HS-Grad or
Bachelors employees. So that means Doctorates are suitable for all job role or not
always preferred above other education levels, maybe they can be considered some
times as over qualified for certain job roles
 Though there is lesser significance of Occupation than education on Salary but at
certain levels it impacts Salary.
 We must also take note of that high salaries are offered to Bachelor’s degree holders
than Doctorates for few occupations. So, we can say that there are some
shortcomings of dataset provided which reduces accuracy of the test and analysis
done, as there can be few more other important variables which can impact salary
such as years of experience, specialisation, industry/domain etc.
 HR department plays more comprehensive role while setting up salary bands. As
similar job titles with different industries demands varying salary package as per job
profile, plus years of experience for the job matters here deciding scale of a person.

7
 ANOVA test indicates that the Education level coupled with Occupation has
significant influence over salary than alone occupation type with comparison to
Educational background.

…………………………………………………….**********************……………………………………………

Problem 2:

The dataset Education - Post 12th Standard.csv contains information on various colleges.

You are expected to do a Principal Component Analysis for this case study according to
the instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can
be found in the following file: Data Dictionary.xlsx.

1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be

performed]. What insight do you draw from the EDA?

Table 2.1 – Statistical Summary of Continuous Variables

Observations:
 Data loaded is correct
 There are 777 rows and 18 variables.
 There are no missing values observed in given file.
 'Name' is categorised as object type, S.F Ratio is given as Float type and others are
integer type
 The data provided is about current college records to check the better performance
to choose the post +2 admission

8
 There are no duplicate records found

Fig. 2.1 – Distribution of Variables

Observations from plot:

Right Skewed: PhD,Terminal

Left skewed: Apps, Accept, Enroll, Top10perc, F.Undergrad, P.Undergrad, Room.Board,

Books, Personal, S.F.Ratio, perc.alumni, Expend

Normal Bell Curves: Top25perc, Outstate, Grad.Rate

9
Fig 2.2 – Boxplot of variables

Boxplot of variables to check the presence of outliers

Fig 2.3 - Pairplot of Variables

Pairplot to find the Correlation between Variables:

10
Below have high Correlation:

 Apps and Accept

 Accept and Enrol
 Top10percentage and Top25 percentage
 Enrol and F.Undergraduate
 PhD and PhD

Fig 2.4 – Heatmap of Variables

Heat Map to check collinearity of Original Data

11
2. Is scaling necessary for PCA in this case? Give justification and perform scaling.

 Data has 18 PCA, once we get amount of variance explained by each component,
based on the information we get, we can decide how many components to retain for
analysis. Hence Scaling is required.
 After Scaling Standard deviation is 1.0 for all variables. Q1 (25%) value and minimum
values difference is lesser than original dataset in most of the variables.
 It scales the data in such a way that the mean value of the features tends to 0 and
the standard deviation tends to 1
 Min-Max method ensure that the data scaled to have values in the range 0 to 1

Fig 2.5 – Distribution of Scaled Data

12
3. Comment on the comparison between the covariance and the correlation matrices from
this data [on scaled data].

 Covariance and Correlation matrices measure the relationship and the

dependency between two variables.
 “Covariance” indicates the direction of the linear relationship between variables.
 “Correlation” on the other hand measures both the strength and direction of the
linear relationship between variables.

13
 Correlation refers to the scaled form of covariance. Covariance is affected by the
change in scale.
 Covariance indicates the direction of the linear relationship between variables.
Correlation on the other hand measures both the strength and direction of the
linear relationship between two variable

Fig 2.6 – Heatmap of Scaled Data

From the Heatmap of scaled data, below are evident:

Strong Correlation between:

 Apps & Accept
 Accept & Enroll
 Enroll & F.Undergrad
 Top10perc & Top25perc
 PhD & Terminal

14
Week Correlation between:
 Top10perc & SF Ratio
 Top25perc & SF Ratio
 Outstate & SF Ratio
 Room Board & SF Ratio
 PercAlumni & SF Ratio
 Expend & SF Ratio

4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]

Fig 2.7 – Boxplot of Scaled Data

Scaling shrinks the range of the feature values as shown in the left figure below. However,
the outliers have an influence when computing the empirical mean and standard deviation.
Standard Scaler therefore cannot guarantee balanced feature scales in the presence of
outlier.

On Comparison between Boxplots Fig 2.2 Vs Fig 2.7 no much difference in terms of outliers
reduction.

5. Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]

15
Table 2.2 – Eigen Vectors

array([[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01,

3.54273947e-01, 3.44001279e-01, 1.54640962e-01,
2.64425045e-02, 2.94736419e-01, 2.49030449e-01,
6.47575181e-02, -4.25285386e-02, 3.18312875e-01,
3.17056016e-01, -1.76957895e-01, 2.05082369e-01,
3.18908750e-01, 2.52315654e-01],
[ 3.31598227e-01, 3.72116750e-01, 4.03724252e-01,
-8.24118211e-02, -4.47786551e-02, 4.17673774e-01,
3.15087830e-01, -2.49643522e-01, -1.37808883e-01,
5.63418434e-02, 2.19929218e-01, 5.83113174e-02,
4.64294477e-02, 2.46665277e-01, -2.46595274e-01,
-1.31689865e-01, -1.69240532e-01],
[-6.30921033e-02, -1.01249056e-01, -8.29855709e-02,
3.50555339e-02, -2.41479376e-02, -6.13929764e-02,
1.39681716e-01, 4.65988731e-02, 1.48967389e-01,
6.77411649e-01, 4.99721120e-01, -1.27028371e-01,
-6.60375454e-02, -2.89848401e-01, -1.46989274e-01,
2.26743985e-01, -2.08064649e-01],
[ 2.81310530e-01, 2.67817346e-01, 1.61826771e-01,
-5.15472524e-02, -1.09766541e-01, 1.00412335e-01,
-1.58558487e-01, 1.31291364e-01, 1.84995991e-01,
8.70892205e-02, -2.30710568e-01, -5.34724832e-01,
-5.19443019e-01, -1.61189487e-01, 1.73142230e-02,
7.92734946e-02, 2.69129066e-01],
[ 5.74140964e-03, 5.57860920e-02, -5.56936353e-02,
-3.95434345e-01, -4.26533594e-01, -4.34543659e-02,
3.02385408e-01, 2.22532003e-01, 5.60919470e-01,
-1.27288825e-01, -2.22311021e-01, 1.40166326e-01,
2.04719730e-01, -7.93882496e-02, -2.16297411e-01,
7.59581203e-02, -1.09267913e-01],
[-1.62374420e-02, 7.53468452e-03, -4.25579803e-02,
-5.26927980e-02, 3.30915896e-02, -4.34542349e-02,
-1.91198583e-01, -3.00003910e-02, 1.62755446e-01,
6.41054950e-01, -3.31398003e-01, 9.12555212e-02,
1.54927646e-01, 4.87045875e-01, -4.73400144e-02,
-2.98118619e-01, 2.16163313e-01],
[-4.24863486e-02, -1.29497196e-02, -2.76928937e-02,
-1.61332069e-01, -1.18485556e-01, -2.50763629e-02,
6.10423460e-02, 1.08528966e-01, 2.09744235e-01,
-1.49692034e-01, 6.33790064e-01, -1.09641298e-03,
-2.84770105e-02, 2.19259358e-01, 2.43321156e-01,
-2.26584481e-01, 5.59943937e-01],
[-1.03090398e-01, -5.62709623e-02, 5.86623552e-02,
-1.22678028e-01, -1.02491967e-01, 7.88896442e-02,
5.70783816e-01, 9.84599754e-03, -2.21453442e-01,
2.13293009e-01, -2.32660840e-01, -7.70400002e-02,
-1.21613297e-02, -8.36048735e-02, 6.78523654e-01,
-5.41593771e-02, -5.33553891e-03],
[-9.02270802e-02, -1.77864814e-01, -1.28560713e-01,
3.41099863e-01, 4.03711989e-01, -5.94419181e-02,
5.60672902e-01, -4.57332880e-03, 2.75022548e-01,
-1.33663353e-01, -9.44688900e-02, -1.85181525e-01,
-2.54938198e-01, 2.74544380e-01, -2.55334907e-01,
-4.91388809e-02, 4.19043052e-02],
[ 5.25098025e-02, 4.11400844e-02, 3.44879147e-02,
6.40257785e-02, 1.45492289e-02, 2.08471834e-02,
-2.23105808e-01, 1.86675363e-01, 2.98324237e-01,
-8.20292186e-02, 1.36027616e-01, -1.23452200e-01,
-8.85784627e-02, 4.72045249e-01, 4.22999706e-01,
1.32286331e-01, -5.90271067e-01],
[ 4.30462074e-02, -5.84055850e-02, -6.93988831e-02,
-8.10481404e-03, -2.73128469e-01, -8.11578181e-02,
1.00693324e-01, 1.43220673e-01, -3.59321731e-01,
3.19400370e-02, -1.85784733e-02, 4.03723253e-02,
-5.89734026e-02, 4.45000727e-01, -1.30727978e-01,
6.92088870e-01, 2.19839000e-01],
[ 2.40709086e-02, -1.45102446e-01, 1.11431545e-02,
3.85543001e-02, -8.93515563e-02, 5.61767721e-02,
-6.35360730e-02, -8.23443779e-01, 3.54559731e-01,
-2.81593679e-02, -3.92640266e-02, 2.32224316e-02,
1.64850420e-02, -1.10262122e-02, 1.82660654e-01,
3.25982295e-01, 1.22106697e-01],

16
[ 5.95830975e-01, 2.92642398e-01, -4.44638207e-01,
1.02303616e-03, 2.18838802e-02, -5.23622267e-01,
1.25997650e-01, -1.41856014e-01, -6.97485854e-02,
1.14379958e-02, 3.94547417e-02, 1.27696382e-01,
-5.83134662e-02, -1.77152700e-02, 1.04088088e-01,
-9.37464497e-02, -6.91969778e-02],
[ 8.06328039e-02, 3.34674281e-02, -8.56967180e-02,
-1.07828189e-01, 1.51742110e-01, -5.63728817e-02,
1.92857500e-02, -3.40115407e-02, -5.84289756e-02,
-6.68494643e-02, 2.75286207e-02, -6.91126145e-01,
6.71008607e-01, 4.13740967e-02, -2.71542091e-02,
7.31225166e-02, 3.64767385e-02],
[ 1.33405806e-01, -1.45497511e-01, 2.95896092e-02,
6.97722522e-01, -6.17274818e-01, 9.91640992e-03,
2.09515982e-02, 3.83544794e-02, 3.40197083e-03,
-9.43887925e-03, -3.09001353e-03, -1.12055599e-01,
1.58909651e-01, -2.08991284e-02, -8.41789410e-03,
-2.27742017e-01, -3.39433604e-03],
[ 4.59139498e-01, -5.18568789e-01, -4.04318439e-01,
-1.48738723e-01, 5.18683400e-02, 5.60363054e-01,
-5.27313042e-02, 1.01594830e-01, -2.59293381e-02,
2.88282896e-03, -1.28904022e-02, 2.98075465e-02,
-2.70759809e-02, -2.12476294e-02, 3.33406243e-03,
-4.38803230e-02, -5.00844705e-03],
[ 3.58970400e-01, -5.43427250e-01, 6.09651110e-01,
-1.44986329e-01, 8.03478445e-02, -4.14705279e-01,
9.01788964e-03, 5.08995918e-02, 1.14639620e-03,
7.72631963e-04, -1.11433396e-03, 1.38133366e-02,
6.20932749e-03, -2.22215182e-03, -1.91869743e-02,
-3.53098218e-02, -1.30710024e-02]])

Table 2.3 - Eigen Values

array([5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123,

0.84849117, 0.6057878 , 0.58787222, 0.53061262, 0.4043029 ,
0.31344588, 0.22061096, 0.16779415, 0.1439785 , 0.08802464,
0.03672545, 0.02302787])

6. Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features

Statistical tests to be done before PCA:

Bartletts Test of Sphericity:

pvalue found : 0.0
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the
population.
H0: All variables in the data are uncorrelated
Ha: At least one pair of variables in the data are correlated
If the null hypothesis cannot be rejected, then PCA is not advisable.
If the p-value is small, then we can reject the null hypothesis and agree that there is at least
one pair of variables in the data are correlated hence PCA is recommended.

KMO Test:
pvalue found : 0.8131
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to
examine how appropriate PCA is.
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand,

17
MSA > 0.7 is expected to provide a considerable reduction is the dimension and extraction
of meaningful components.

Therefore from above test, it is evident to proceed for PCA

Table 2.4 - Variance ratio

array([0.32020628, 0.26340214, 0.06900917, 0.05922989, 0.05488405,

0.04984701, 0.03558871, 0.03453621, 0.03117234, 0.02375192,
0.01841426, 0.01296041, 0.00985754, 0.00845842, 0.00517126,
0.00215754, 0.00135284])

Fig 2.8 – Screeplot of Components Variance

Renaming the Variables/Components to PC:

18
Table 2.5 - Cumulative Sum of Explained variance ratio

array([0.32020628, 0.58360843, 0.65261759, 0.71184748, 0.76673154,

0.81657854, 0.85216726, 0.88670347, 0.91787581, 0.94162773,
0.96004199, 0.9730024 , 0.98285994, 0.99131837, 0.99648962,
0.99864716, 1. ])

From the Screeplot and Cumulative Sum of Explained variance ratio, it is evident that first 9
components contributes 90% of data, so proceed for 9 PC components

Table 2.6 – Selects PCs

Extract the required (as per the cumulative explained variance) number of PCs
Create a data frame out of fit transformed scaled data above

19
Fig 2.9 - Loading of Selected PC Absolute Values:

Table 2.7 – Selected fit transformed Scaled Data Head

Fig 2.10 - Heatmap of Selected Scaled Data

20
Fig 2.11 - Heatmap of Selected – Fit Transform Data

21
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [Hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]

The Linear equation of 1st component:

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 *

Top25perc + 0.15 * F.Undergrad + 0.03 * P.Undergrad + 0.29 * Outstate +
0.25 * Room.Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32
* Terminal + -0.18 * S.F.Ratio + 0.21 * perc.alumni + 0.32 * Expend +
0.25 * Grad.Rate +

8. Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?

Fig 2.12 – PC Index to Explained Variance ratio

The plot visually shows how variance are defined, by how many principle components.
In the plot we see that 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so on.
Effectively we can get material variance explained (ie. 90%) by analysing 9 Principle
components instead all of the 17 variables (attributes) in the dataset.

PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the
data. Because rotation is a kind of linear transformation, new dimensions will be sums of the
old ones. The Eigen-vectors (Principle Components) , determine the direction or Axes along
which linear transformation acts, stretching or compressing input vectors. They are the lines
of change that represent the action of the larger matrix, the very “line” in linear
transformation.

22
9. Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

The principal component analysis is used to reduce the multicollinearity between the
variables.
Depending on the variance of the dataset we can reduce the PCA components.
The PCA components for this business case is 9 where we could understand the maximum
variance of the dataset.
Using the components we can now understand the reduced multicollinearity in the dataset.

…………………………………………………….**********************……………………………………………

Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
91% (34)
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
Saral Jyotish
50% (2)
Saral Jyotish
296 pages
VND Ms-Excel&rendition 1
100% (1)
VND Ms-Excel&rendition 1
51 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
Advanced Statistics
100% (1)
Advanced Statistics
16 pages
Ruhee Ansari - Advanced Statistic Project SCB
100% (1)
Ruhee Ansari - Advanced Statistic Project SCB
28 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
SMDM Project SAMPLE REPORT
0% (2)
SMDM Project SAMPLE REPORT
7 pages
Prob 3
No ratings yet
Prob 3
2 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Safety Data Sheet: 1. Identification of The Substance or Mixture and of The Supplier
100% (1)
Safety Data Sheet: 1. Identification of The Substance or Mixture and of The Supplier
6 pages
Tanaya - Lokhande - Advance Statistic Business Report
No ratings yet
Tanaya - Lokhande - Advance Statistic Business Report
24 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
Business Report: Advanced Statistics Module Project I
100% (1)
Business Report: Advanced Statistics Module Project I
5 pages
ASProject-Padma Murali
No ratings yet
ASProject-Padma Murali
45 pages
Anova and Pca
No ratings yet
Anova and Pca
10 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
Exploratory Data Analysis:: Salarydata - CSV
No ratings yet
Exploratory Data Analysis:: Salarydata - CSV
32 pages
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
No ratings yet
Advanced Statistics ANOVA PCA EDA Project Report 3 Great Lakes
28 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
SMDM Project Report Dsba
No ratings yet
SMDM Project Report Dsba
2 pages
Problem 2 - Survey: Importing Nessceary Libraries
No ratings yet
Problem 2 - Survey: Importing Nessceary Libraries
10 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
SMDM Project Report: Submitted By: Kratika Vijayvergiya
100% (1)
SMDM Project Report: Submitted By: Kratika Vijayvergiya
15 pages
SMDM Project
0% (1)
SMDM Project
22 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Business Report Project - Sheetal - SMDM
100% (1)
Business Report Project - Sheetal - SMDM
20 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Business Report: Advanced Statistics Module Project - II
No ratings yet
Business Report: Advanced Statistics Module Project - II
9 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Project - 1 (Cold Storage Case Study)
No ratings yet
Project - 1 (Cold Storage Case Study)
11 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Buisiness Reoprt Extended As Project Report
No ratings yet
Buisiness Reoprt Extended As Project Report
18 pages
SMDM Project Solved
0% (1)
SMDM Project Solved
27 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Solution To Problem 1: Importing The Libraries
No ratings yet
Solution To Problem 1: Importing The Libraries
6 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Business Analytics Report: Submitted To
No ratings yet
Business Analytics Report: Submitted To
32 pages
SMDM Extended Project
No ratings yet
SMDM Extended Project
1 page
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Project1 - Cold Storage Case Study
No ratings yet
Project1 - Cold Storage Case Study
11 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
SMDM Assignment 3 Final
No ratings yet
SMDM Assignment 3 Final
2 pages
Australian Gas Production - Project On Time Series Forecasting
100% (19)
Australian Gas Production - Project On Time Series Forecasting
29 pages
Bollibathula Vani SMDM PROJECT
No ratings yet
Bollibathula Vani SMDM PROJECT
20 pages
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
12 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
SMDM Project
100% (1)
SMDM Project
22 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Schaum's Outline of Business Statistics, Fourth Edition
From Everand
Schaum's Outline of Business Statistics, Fourth Edition
Leonard J. Kazmier
3/5 (1)
No To Premarital Sex: Ring Wearers
No ratings yet
No To Premarital Sex: Ring Wearers
26 pages
Useful CSResurce Class 8
No ratings yet
Useful CSResurce Class 8
9 pages
s4 Ilp Teacher Leader Project
No ratings yet
s4 Ilp Teacher Leader Project
4 pages
2023-2024 Binhi Letter To Parents
No ratings yet
2023-2024 Binhi Letter To Parents
2 pages
THINK L4 Unit 6 Grammar
No ratings yet
THINK L4 Unit 6 Grammar
2 pages
Effective ATS Management
No ratings yet
Effective ATS Management
7 pages
Pengetahuan Wirausaha Dan Minat Berwirausaha Pada Siswa SMK: Sahade Dan M. Yusuf A. Ngampo
No ratings yet
Pengetahuan Wirausaha Dan Minat Berwirausaha Pada Siswa SMK: Sahade Dan M. Yusuf A. Ngampo
6 pages
DW 144
No ratings yet
DW 144
98 pages
TARIKH PENILAIAN UNTUK AP220 - v1
No ratings yet
TARIKH PENILAIAN UNTUK AP220 - v1
2 pages
Đề 18 Thi vào 10
No ratings yet
Đề 18 Thi vào 10
6 pages
Digital Electronics
No ratings yet
Digital Electronics
64 pages
PSY 350 Annotated Bibliography Saggar - Docx-1
No ratings yet
PSY 350 Annotated Bibliography Saggar - Docx-1
3 pages
Support Bracket
No ratings yet
Support Bracket
1 page
South African Renewable Energy Grid Code Version 2.9 Requirements Part III Discussions and Conclusions
No ratings yet
South African Renewable Energy Grid Code Version 2.9 Requirements Part III Discussions and Conclusions
5 pages
Applying Digital Analysis Using Benford's Law To Detect Fraud-The Dangers of Type I Errors
0% (1)
Applying Digital Analysis Using Benford's Law To Detect Fraud-The Dangers of Type I Errors
7 pages
ARIAS Act01
No ratings yet
ARIAS Act01
2 pages
Chapter 3: Solar System and Earth
No ratings yet
Chapter 3: Solar System and Earth
2 pages
2023 Syllbus Culture and Literature 1
No ratings yet
2023 Syllbus Culture and Literature 1
6 pages
JB Service Manual PDF
No ratings yet
JB Service Manual PDF
52 pages
Final Project Report Risc
No ratings yet
Final Project Report Risc
25 pages
Bart Sibrel - Wiki Article (Moon Landing Was Faked)
0% (1)
Bart Sibrel - Wiki Article (Moon Landing Was Faked)
3 pages
WEEK 4 Quiz in College Algebra
No ratings yet
WEEK 4 Quiz in College Algebra
3 pages
Pre Excavation Checklist
No ratings yet
Pre Excavation Checklist
1 page
Heat Stress Program - J38
No ratings yet
Heat Stress Program - J38
24 pages
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Management Science
100% (1)
Spreadsheet Modeling & Decision Analysis: A Practical Introduction To Management Science
56 pages
Soas Dissertation Late Submission
100% (2)
Soas Dissertation Late Submission
6 pages
Nylosolv A EN
No ratings yet
Nylosolv A EN
1 page

Advanced Statistics Project

Uploaded by

Advanced Statistics Project

Uploaded by

ADVANCED STATISTICS

PROJECT ANALYSIS REPORT

2. Perform a one-way ANOVA on Salary with respect to Education. State whether

3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether

Where Alpha = 0.05

If p value < 0.05, reject Null Hypothesis

Table 1A.1 - ANOVA for Education

df sum_sq mean_sq F PR(>F)

df sum_sq mean_sq F PR(>F)

Turkey Test to Interpret the Statistical significance of ANOVA test:

Table 1A.3 – Turkey Test

Multiple Comparison of Means - Tukey HSD, FWER=0.05

Interpretation from table:

Mean count of Salary differs across Educational Levels.

Bachelors/Doctorate > Bachelors/HS-grad > Doctorate /HS-grad

Fig 1B.1 - Interaction Plot between Education and Occupation

Adm-Clerical job with

Table 1B.1 – Two way ANOVA - Education and Occupation

Null and Alternate Hypothesis for Tow way ANOVA:

Where Alpha = 0.05

The dataset Education - Post 12th Standard.csv contains information on various colleges.

1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be

Table 2.1 – Statistical Summary of Continuous Variables

Fig. 2.1 – Distribution of Variables

Observations from plot:

Right Skewed: PhD,Terminal

Left skewed: Apps, Accept, Enroll, Top10perc, F.Undergrad, P.Undergrad, Room.Board,

Normal Bell Curves: Top25perc, Outstate, Grad.Rate

Boxplot of variables to check the presence of outliers

Fig 2.3 - Pairplot of Variables

Pairplot to find the Correlation between Variables:

 Apps and Accept

Fig 2.4 – Heatmap of Variables

Heat Map to check collinearity of Original Data

Fig 2.5 – Distribution of Scaled Data

 Covariance and Correlation matrices measure the relationship and the

Fig 2.6 – Heatmap of Scaled Data

From the Heatmap of scaled data, below are evident:

Strong Correlation between:

Fig 2.7 – Boxplot of Scaled Data

5. Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]

array([[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01,

Table 2.3 - Eigen Values

array([5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123,

Statistical tests to be done before PCA:

Bartletts Test of Sphericity:

Therefore from above test, it is evident to proceed for PCA

Table 2.4 - Variance ratio

array([0.32020628, 0.26340214, 0.06900917, 0.05922989, 0.05488405,

Fig 2.8 – Screeplot of Components Variance

Renaming the Variables/Components to PC:

array([0.32020628, 0.58360843, 0.65261759, 0.71184748, 0.76673154,

Table 2.6 – Selects PCs

Table 2.7 – Selected fit transformed Scaled Data Head

Fig 2.10 - Heatmap of Selected Scaled Data

The Linear equation of 1st component:

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 *

Fig 2.12 – PC Index to Explained Variance ratio

You might also like