Advanced Statistics Project
Advanced Statistics Project
ANOVA_EDA_PCA
1
Contents Page#
Problem 1 A
1. State the null and the alternate hypothesis for conducting one-way ANOVA for 4
both Education and Occupation individually.
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means 5
are significantly different. Interpret the result. (Non-Graded)
Problem 1 B
1. What is the interaction between two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction plot. 5
[hint: use the ‘pointplot’ function from the ‘seaborn’ function]
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result? 6
3. Explain the business implications of performing ANOVA for this particular case 7
study.
Problem 2
1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be 8
performed]. What insight do you draw from the EDA?
2. Is scaling necessary for PCA in this case? Give justification and perform scaling. 12
3. Comment on the comparison between the covariance and the correlation matrices 14
from this data [on scaled data].
4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so] 15
16
5. Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
6. Perform PCA and export the data of the Principal Component (eigenvectors) into a 17
data frame with the original features
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use
values with two places of decimals only). [hint: write the linear equation of PC in terms 22
of eigenvectors and corresponding features]
8. Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors indicate? 22
9. Explain the business implication of using the Principal Component Analysis for this 23
case study. How may PCs help in the further analysis?
2
Page#
List of Tables
4
1A.1 - ANOVA for Education
5
1A.2 - ANOVA for Occupation
5
1A.3 – Turkey Test
6
1B.1 – Two way ANOVA - Education and Occupation
8
2.1 – Statistical Summary of Continuous Variables
16
2.2 – Eigen Vectors
17
2.3 - Eigen Values
18
2.4 - Variance ratio
19
2.5 - Cumulative Sum of Explained variance ratio
19
2.6 – Selects PCs
20
2.7 – Selected fit transformed Scaled Data Head
List of Figures
6
1B.1 - Interaction Plot between Education and Occupation
9
2.1 – Distribution of Variables
10
2.2 – Boxplot of variables
11
2.3 - Pairplot of Variables
12
2.4 – Heatmap of Variables
13
2.5 – Distribution of Scaled Data
14
2.6 – Heatmap of Scaled Data
16
2.7 – Boxplot of Scaled Data
18
2.8 – Screeplot of Components Variance
20
2.9 - Loading of Selected PC Absolute Values
21
2.10 - Heatmap of Selected Scaled Data
21
2.11 - Heatmap of Selected – Fit Transform Data
22
2.12 – PC Index to Explained Variance ratio
3
Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To
understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected
and each person’s educational qualification and occupation are noted. Educational
qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation
is at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive
or managerial. A different number of observations are in each level of education –
occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption
may not always hold if the sample size is small.]
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
Null and Alternate Hypothesis for One way ANOVA for Education:
H0: The means of 'Salary' variable with respect to each Education category is equal.
HA1: At least one of the means of 'Salary' variable with respect to each Education category is
unequal
Null and Alternate Hypothesis for One way ANOVA for Occupation:
H0: The means of 'Salary' variable with respect to each Occupation category is equal.
HA2: At least one of the means of 'Salary' variable with respect to each Occupation category is
unequal
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Since the p value in this scenario is less than Alpha (0.05), we reject the Null Hypothesis (H0).
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
4
Table 1A.2 - ANOVA for Occupation
Since the p value in this scenario is greater than Alpha (0.05), we fail to reject the Null
Hypothesis (H0).
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
Problem 1B:
1 .What is the interaction between two treatments? Analyse the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot. [Hint: use the
‘pointplot’ function from the ‘seaborn’ function]
5
From above plot we can
make out that the
interaction between
people with:
From above plot we can figure out that people with educational level:
Doctorates: are into higher salary brackets and mostly Prof-speciality roles or Exec-
managerial roles or in sales profile, very few are doing Adm-clerical jobs
Bachelors: fall in mid income range and found mostly working as an Exec -managers, Adm-
clerks or into sales but very few are found in Prof- speciality profile.
HS-grads: are in low income brackets, mostly doing Prof-speciality or Adm -clerical work and
few are doing Sales but hardly any in Exec-managerial role
2. Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and
alternative hypotheses and state your results. How will you interpret this result?
H0: The mean of Salary variable of each Occupation type and Education level are equal.
HA2: At least one of the means of 'Salary' variable for Occupation type and Education level is not
equal.
6
If p value < 0.05, reject Null Hypothesis
If p value >= 0.05, fail to reject Null Hypothesis
p-value of Education is 5.44e-12 which is less than 𝛼 0.05 and we can reject the Null Hypothesis
( 𝐻0 ). This shows that the means of 'Salary' variable with respect to each Education category is
equal.
p-value of Occupation is 0.072 which is greater than 𝛼 0.05 and we fail to reject the Null
Hypothesis ( 𝐻0 ). This shows that the means of 'Salary' variable with respect to each
Occupation category is not equal.
As Education and Occupation interaction is 0.00002232 which is < 0.05 , there seems to be more
statistical interaction.
3. Explain the business implications of performing ANOVA for this particular case study.
ANOVA stands for “analysis of variance” and is used in statistics when you are testing a
hypothesis to understand how different groups respond to each other by making
connections between independent and dependent variables. ANOVA is a statistical test that
compares the means of groups in order to determine if there is a difference between them.
It is used when more than two group means are compared. For two group means, we can do
t-test.
ANOVA is used in a business context to help manage income /salary by comparing your
education to occupation here in this case to help manage revenue income (salary).
ANOVA can also be used to forecast Salary trends by analysing patterns in data to better
understand the future hike of Salary.
It’s also a widely used statistical technique for comparing the relationship between factors
that cause a rise in Salary, assuming this report is for HR department or HR consulting firm.
Some of the key takeaways as below:
As the Education level upgrades Salary increases. On an average Doctorate earns
higher salary than Bachelors and HS-Grads. However, it might be possibility that
being Doctorate may not necessarily mean significant high salary than HS-Grad or
Bachelors employees. So that means Doctorates are suitable for all job role or not
always preferred above other education levels, maybe they can be considered some
times as over qualified for certain job roles
Though there is lesser significance of Occupation than education on Salary but at
certain levels it impacts Salary.
We must also take note of that high salaries are offered to Bachelor’s degree holders
than Doctorates for few occupations. So, we can say that there are some
shortcomings of dataset provided which reduces accuracy of the test and analysis
done, as there can be few more other important variables which can impact salary
such as years of experience, specialisation, industry/domain etc.
HR department plays more comprehensive role while setting up salary bands. As
similar job titles with different industries demands varying salary package as per job
profile, plus years of experience for the job matters here deciding scale of a person.
7
ANOVA test indicates that the Education level coupled with Occupation has
significant influence over salary than alone occupation type with comparison to
Educational background.
…………………………………………………….**********************……………………………………………
Problem 2:
Observations:
Data loaded is correct
There are 777 rows and 18 variables.
There are no missing values observed in given file.
'Name' is categorised as object type, S.F Ratio is given as Float type and others are
integer type
The data provided is about current college records to check the better performance
to choose the post +2 admission
8
There are no duplicate records found
9
Fig 2.2 – Boxplot of variables
10
Below have high Correlation:
11
2. Is scaling necessary for PCA in this case? Give justification and perform scaling.
Data has 18 PCA, once we get amount of variance explained by each component,
based on the information we get, we can decide how many components to retain for
analysis. Hence Scaling is required.
After Scaling Standard deviation is 1.0 for all variables. Q1 (25%) value and minimum
values difference is lesser than original dataset in most of the variables.
It scales the data in such a way that the mean value of the features tends to 0 and
the standard deviation tends to 1
Min-Max method ensure that the data scaled to have values in the range 0 to 1
12
3. Comment on the comparison between the covariance and the correlation matrices from
this data [on scaled data].
13
Correlation refers to the scaled form of covariance. Covariance is affected by the
change in scale.
Covariance indicates the direction of the linear relationship between variables.
Correlation on the other hand measures both the strength and direction of the
linear relationship between two variable
14
Week Correlation between:
Top10perc & SF Ratio
Top25perc & SF Ratio
Outstate & SF Ratio
Room Board & SF Ratio
PercAlumni & SF Ratio
Expend & SF Ratio
4. Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
Scaling shrinks the range of the feature values as shown in the left figure below. However,
the outliers have an influence when computing the empirical mean and standard deviation.
Standard Scaler therefore cannot guarantee balanced feature scales in the presence of
outlier.
On Comparison between Boxplots Fig 2.2 Vs Fig 2.7 no much difference in terms of outliers
reduction.
15
Table 2.2 – Eigen Vectors
16
[ 5.95830975e-01, 2.92642398e-01, -4.44638207e-01,
1.02303616e-03, 2.18838802e-02, -5.23622267e-01,
1.25997650e-01, -1.41856014e-01, -6.97485854e-02,
1.14379958e-02, 3.94547417e-02, 1.27696382e-01,
-5.83134662e-02, -1.77152700e-02, 1.04088088e-01,
-9.37464497e-02, -6.91969778e-02],
[ 8.06328039e-02, 3.34674281e-02, -8.56967180e-02,
-1.07828189e-01, 1.51742110e-01, -5.63728817e-02,
1.92857500e-02, -3.40115407e-02, -5.84289756e-02,
-6.68494643e-02, 2.75286207e-02, -6.91126145e-01,
6.71008607e-01, 4.13740967e-02, -2.71542091e-02,
7.31225166e-02, 3.64767385e-02],
[ 1.33405806e-01, -1.45497511e-01, 2.95896092e-02,
6.97722522e-01, -6.17274818e-01, 9.91640992e-03,
2.09515982e-02, 3.83544794e-02, 3.40197083e-03,
-9.43887925e-03, -3.09001353e-03, -1.12055599e-01,
1.58909651e-01, -2.08991284e-02, -8.41789410e-03,
-2.27742017e-01, -3.39433604e-03],
[ 4.59139498e-01, -5.18568789e-01, -4.04318439e-01,
-1.48738723e-01, 5.18683400e-02, 5.60363054e-01,
-5.27313042e-02, 1.01594830e-01, -2.59293381e-02,
2.88282896e-03, -1.28904022e-02, 2.98075465e-02,
-2.70759809e-02, -2.12476294e-02, 3.33406243e-03,
-4.38803230e-02, -5.00844705e-03],
[ 3.58970400e-01, -5.43427250e-01, 6.09651110e-01,
-1.44986329e-01, 8.03478445e-02, -4.14705279e-01,
9.01788964e-03, 5.08995918e-02, 1.14639620e-03,
7.72631963e-04, -1.11433396e-03, 1.38133366e-02,
6.20932749e-03, -2.22215182e-03, -1.91869743e-02,
-3.53098218e-02, -1.30710024e-02]])
6. Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features
KMO Test:
pvalue found : 0.8131
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to
examine how appropriate PCA is.
Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected.
On the other hand,
17
MSA > 0.7 is expected to provide a considerable reduction is the dimension and extraction
of meaningful components.
18
Table 2.5 - Cumulative Sum of Explained variance ratio
From the Screeplot and Cumulative Sum of Explained variance ratio, it is evident that first 9
components contributes 90% of data, so proceed for 9 PC components
Extract the required (as per the cumulative explained variance) number of PCs
Create a data frame out of fit transformed scaled data above
19
Fig 2.9 - Loading of Selected PC Absolute Values:
20
Fig 2.11 - Heatmap of Selected – Fit Transform Data
21
7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [Hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]
8. Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?
The plot visually shows how variance are defined, by how many principle components.
In the plot we see that 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so on.
Effectively we can get material variance explained (ie. 90%) by analysing 9 Principle
components instead all of the 17 variables (attributes) in the dataset.
PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the
data. Because rotation is a kind of linear transformation, new dimensions will be sums of the
old ones. The Eigen-vectors (Principle Components) , determine the direction or Axes along
which linear transformation acts, stretching or compressing input vectors. They are the lines
of change that represent the action of the larger matrix, the very “line” in linear
transformation.
22
9. Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]
The principal component analysis is used to reduce the multicollinearity between the
variables.
Depending on the variance of the dataset we can reduce the PCA components.
The PCA components for this business case is 9 where we could understand the maximum
variance of the dataset.
Using the components we can now understand the reduced multicollinearity in the dataset.
…………………………………………………….**********************……………………………………………
23