0% found this document useful (0 votes)
136 views32 pages

Exploratory Data Analysis:: Salarydata - CSV

The document discusses a case study analyzing salary data of 40 individuals. It includes: - Exploratory data analysis of the dataset containing education, occupation, and salary variables. - One-way ANOVA tests to analyze the effect of education and occupation on salary. For education, the null hypothesis that mean salaries are equal is rejected, while for occupation it is not. - A two-way ANOVA finds an interaction between education and occupation. - Implications for businesses include salaries varying more by education than occupation, and some occupations having similar salaries across education levels. The document then shifts to analyzing a college dataset, including: - Exploratory data analysis finding most variables are normally distributed and some

Uploaded by

Tanaya Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views32 pages

Exploratory Data Analysis:: Salarydata - CSV

The document discusses a case study analyzing salary data of 40 individuals. It includes: - Exploratory data analysis of the dataset containing education, occupation, and salary variables. - One-way ANOVA tests to analyze the effect of education and occupation on salary. For education, the null hypothesis that mean salaries are equal is rejected, while for occupation it is not. - A two-way ANOVA finds an interaction between education and occupation. - Implications for businesses include salaries varying more by education than occupation, and some occupations having similar salaries across education levels. The document then shifts to analyzing a college dataset, including: - Exploratory data analysis finding most variables are normally distributed and some

Uploaded by

Tanaya Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Problem 1A:

Salary is hypothesized to depend on educational qualification and


occupation. To understand the dependency, the salaries of 40 individuals
[SalaryData.csv] are collected and each person’s educational qualification
and occupation are noted. Educational qualification is at three levels, High
school graduate, Bachelor, and Doctorate. Occupation is at four levels,
Administrative and clerical, Sales, Professional or specialty, and Executive
or managerial. A different number of observations are in each level of
education – occupation combination.

Exploratory Data Analysis:

Dataset has 3 variables Education, Occupation and Salary. Education and


Occupation are object type and Salary is an integer type.

Descriptive Statistics:
Check for null values:
Therefore we can conclude that there is no null values
Descriptive Statistics for the dataset:

Question 1.1 State the null and the alternate hypothesis for
conducting one-way ANOVA for both Education and Occupation
individually.
Null and the alternate hypothesis for conducting one-way ANOVA
for Education
𝜇� = Mean salary of Doctorate
𝜇𝐵 = Mean salary of Bachelors
𝜇C = Mean salary of HS-grad
�0: 𝜇� = 𝜇𝐵 = 𝜇C
��: 𝜇� ≠ 𝜇𝐵 ≠ 𝜇C
Level of significance (Alpha) = 0.05
N = 40
Null and the alternate hypothesis for conducting one-way ANOVA
for Occupation
𝜇� = Mean salary of Prof-specialty
𝜇𝐵 = Mean salary of Sales
𝜇C = Mean salary of Adm-clerical
𝜇D = Mean salary of Exec-managerial
�0: 𝜇� = 𝜇𝐵 = 𝜇C = 𝜇D
��: 𝜇� ≠ 𝜇𝐵 ≠ 𝜇C ≠ 𝜇D
Level of significance (Alpha) = 0.05
N = 40

Question 1.2 Perform a one-way ANOVA on Salary with respect


to Education. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
As per my output from python:

F value: 30.95628
PR value: 1.257709e-08
Therefore PR value < alpha, hence we have evidence to reject the
null hypothesis and so we accept the alternate hypothesis that is 𝜇� ≠
𝜇𝐵 ≠ 𝜇C
PR value is 1.257709e-08 and it is lesser than 5% level of significance
So the statistical decision is accepting the alternate hypothesis at 5%
level of significance.
So at 95% confidence level, there is sufficient evidence to prove that
mean salary of Doctorate is not equal to mean salary of Bachelors is
not equal to mean salary of HS-grad

Question 1.3 Perform a one-way ANOVA on Salary with respect


to Occupation. State whether the null hypothesis is accepted or
rejected based on the ANOVA results.
As per my output from python:
F statistic: 0.884144
PR value: 0.458508
Therefore PR value > alpha, hence we have no evidence to reject the
null hypothesis and so we accept the null hypothesis that is 𝜇� = 𝜇𝐵 =
𝜇C = 𝜇D
PR value is 0.458508 and it is greater than 5% level of significance
So the statistical decision is failing to reject the null hypothesis at 5%
level of significance.
So at 95% confidence level, there is sufficient evidence to prove that
mean salary of Prof-specialty is equal to mean salary of Sales is equal
to mean salary of Adm-clerical is equal to mean salary of Exec-
managerial

Question 1.4 If the null hypothesis is rejected in either (2) or in


(3), find out which class means are significantly different.
Interpret the result.
Yes, Null hypothesis is rejected in question 1.2 where PR value <
alpha, hence we have evidence to reject the null hypothesis and so we
accept the alternate hypothesis that is mean salary of Doctorate is not
equal to mean salary of Bachelors is not equal to mean salary of HS-
grad
To check out which class means are significantly different we can do
Tukey HSD test (Tukey Honest Significance Design test) using
python
As per my output from python:
Therefore, we conclude that
Mean salary of Bachelors is not equal to Mean salary of Doctorate
Mean salary of Bachelors is not equal to Mean salary of HS-grad
Mean salary of Doctorate is not equal to Mean salary of HS-grad

Problem 1B:
Question 1.5 What is the interaction between two treatments?
Analyze the effects of one variable on the other (Education and
Occupation) with the help of an interaction plot
To find the interaction between the two variables Education and
Occupation, I have plotted the point-plot graph of both Education and
Occupation with respect to the salary
Where we can very clearly see that there seems to be very less or
almost no interaction amongst the two categorical variables and the
salary differ a lot with respect to their Education and Occupations.
But, in case of individuals with Bachelors education and working as
Sales and Exec-managerial earns almost the same amount of salary
Also to check the interaction between Education and Occupation I
have done an interaction effect test which is a two-way ANOVA based
on Salary with respect to both Education and Occupation along with
their interaction Education*Occupation
As per my output from python:
As Education and Occupation interaction is 2.232500e-05 which is
less than 0.05, there seems to be some statistical interaction.

Question 1.6 Perform a two-way ANOVA based on Salary with


respect to both Education and Occupation (along with their
interaction Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this
result?
�0: There is no interaction between Education and Occupation
��: There is interaction between Education and Occupation
Level of significance (Alpha) = 0.05
N = 40
As per my output from python:

F value: 8.519815
PR value: 2.232500e-05
Therefore PR value < alpha, hence we have evidence to reject the
null hypothesis and so we accept the alternate hypothesis that is there
is interaction between Education and Occupation

PR value is 2.232500e-05 and it is lesser than 5% level of significance


So the statistical decision is accepting the alternate hypothesis at 5%
level of significance.
So at 95% confidence level, there is sufficient evidence to prove that
there is interaction between Education and Occupation

Question 1.7 Explain the business implications of performing


ANOVA for this particular case study.
1) Mean salary of Bachelors is not equal to Mean salary of Doctorate
Mean salary of Bachelors is not equal to Mean salary of HS-grad
Mean salary of Doctorate is not equal to Mean salary of HS-grad

2) Mean salary of Prof-specialty is equal to Mean salary of Sales


Mean salary of Adm-clerical is equal to Mean salary of Exec-
managerial
Mean salary of Exec-managerial is equal to Mean salary of Sales
Mean salary of Exec-managerial is equal to Mean salary of Prof-
specialty
Mean salary of Sales is equal to Mean salary of Adm-clerical
Mean salary of Prof-specialty is equal to Mean salary of Adm-clerical
3) There is sufficient evidence to prove that there is interaction
between Education and Occupation
4) Individuals with Bachelors education and working as Sales and
Exec-managerial earns almost the same amount of salary
5) Individuals with HS-grad education and working in Sales earns the
least amount of salary
6) Individuals with Doctorate education and working as Prof-specialty
earns the highest amount of salary
Problem 2:
The dataset Education - Post 12th Standard.csv contains
information on various colleges. You are expected to do a
Principal Component Analysis for this case study according to the
instructions given.
Descriptive Statistics:
Check for null values:

Therefore we can conclude that there is no null values


Descriptive Statistics for the dataset:
Check for outliers:
After treatment of outliers:

Question 2.1 Perform Exploratory Data Analysis [both univariate


and multivariate analysis to be performed]. What insight do you
draw from the EDA?
For univariate analysis I have plotted distribution plot and bar plot
From the distribution plots we can clearly observe that all of them are
normally distributed where variables like apps, accept, enrol, top10
perc, F.undergrad, P.undergrad, personal, perc.alumni and expend is
left skewed whereas variables like PhD and Terminal are right skewed
For multivariate analysis I have plotted scatter plot and heat map
From the above heat map and scatter plot we can clearly see that
1) F.grad has very high correlation with Enroll
2) Apps has high correlation with Accept
3) Enroll has high correlation with Accept
4) Expend has very low correlation with SF Ratio

Question 2.2 Is scaling necessary for PCA in this case? Give


justification and perform scaling.
Yes, we need to do scaling because Data set has features with
different “weights” In “Distance” based algorithms it is recommended
to transform the features so that all features are in same “scale”
Here I have performed scaling using standard scaler from sklearn
Before scaling:

After scaling:
So after scaling the mean of all the variables tends to 0 and standard
deviation tends to 1

Question 2.3 Comment on the comparison between the covariance


and the correlation matrices from this data.
Correlation matrices of the scaled data:
Covariance matrices of the scaled data:

As clearly seen in the above figures obtained as the output of my


python the correlation of the scaled data is equal to covariance of the
scaled data since the standard deviation of the scaled data is 1 as seen
in (Question 2.2)

Question 2.4 Check the dataset for outliers before and after
scaling. What insight do you derive here?
Before scaling:
After scaling:

Hence, after scaling we can able to see two things


1) Further more outliers are treated and less outliers are present in the
new scaled data than from the data before scaling
2) The range of data after scaling is from -0.6 to 0.6 whereas before
scaling the range was from 0 to 20000 this is because after scaling the
mean of all the variables tends to 0 and standard deviation tends to 1

Question 2.5 Perform PCA and export the data of the Principal
Component scores into a data frame.
As per my output from python:

Question 2.6 Extract the eigenvalues, and eigenvectors.


Eigen Values

%s [5.64307841 4.82973672 1.10030644 0.9966849 0.8977433 0.76549205

0.58709565 0.55450358 0.44319291 0.38222641 0.24563729 0.03891348

0.05597992 0.07466871 0.12376406 0.13603844 0.14684496]


Eigen Vectors

%s [[ 2.42671239e-01 3.24930495e-01 9.77100175e-02 -1.02559773e-01

2.28743180e-01 -4.76414519e-02 1.23782113e-02 -3.41030317e-02

-1.84655032e-01 -1.34049566e-01 -6.79408155e-02 -1.51051724e-01

5.73869368e-01 2.54721171e-02 -3.50002377e-01 4.76265776e-01

-2.73993248e-02]

[ 2.08095876e-01 3.57755851e-01 1.25144023e-01 -1.21914245e-01

2.02792107e-01 -3.31338141e-02 -1.41529768e-03 -1.02521665e-01

-1.89697047e-01 -1.23207526e-01 -2.86891699e-02 4.52766958e-01

-6.43625404e-01 -4.08143058e-02 -1.12837998e-01 2.08677137e-01

-1.27528369e-01]

[ 1.64564266e-01 3.95824297e-01 9.44419384e-02 -1.42497171e-02

1.72168365e-01 3.89761143e-02 7.92830517e-03 -1.34762063e-01

-5.20184210e-02 -4.79563882e-02 -2.29745788e-02 -7.50067816e-01

-2.58381892e-01 3.37484396e-02 2.25003975e-01 -2.65981931e-01

-1.80558174e-02]

[ 3.44633526e-01 -7.53900839e-02 -7.23866450e-02 3.75563233e-01

1.45905144e-01 8.37673857e-02 2.58267694e-01 2.89094711e-01

1.10851590e-01 -7.14429611e-02 -6.57319491e-03 5.89947774e-02

-5.31897461e-02 7.23553559e-01 3.22924466e-02 1.62488072e-02

4.57358763e-02]

[ 3.37858398e-01 -3.67211412e-02 -4.63368319e-02 4.27876370e-01

1.20536687e-01 2.14918233e-02 2.34717438e-01 3.36249057e-01

1.89924670e-01 -4.53255044e-02 -1.32078205e-01 -1.47356588e-02

-3.70257583e-03 -6.58266244e-01 -2.64582929e-02 -3.47432747e-02

-1.58456329e-01]

[ 1.34287678e-01 4.06243667e-01 8.72397333e-02 -1.46165800e-02

1.15073146e-01 5.49956869e-02 2.79162755e-02 -1.22385171e-01

1.41252801e-04 1.12660606e-02 -3.63762678e-02 4.51829780e-01

4.13880625e-01 -1.05340454e-02 3.50240396e-01 -5.20754661e-01

7.88826178e-02]

[ 1.45128920e-02 3.54916637e-01 3.86964803e-02 -2.07265372e-01

-1.32038801e-01 5.16448338e-02 9.36586774e-02 5.41905661e-02

7.36103386e-01 4.23776360e-01 1.89391557e-01 4.97285352e-03


-3.24986907e-02 3.82640339e-02 -1.01785907e-01 1.61437628e-01

-3.58599650e-02]

[ 2.97304568e-01 -2.37362415e-01 2.05908405e-02 -2.53851713e-01

4.29684243e-02 1.39668813e-02 -1.04399025e-01 2.38893511e-02

1.46112057e-02 -1.87206448e-01 6.09931131e-01 -4.74030188e-03

9.51907262e-02 2.55089170e-03 2.23348292e-01 -7.53206365e-03

-5.57302446e-01]

[ 2.51192093e-01 -1.23789047e-01 -2.60693995e-02 -5.66793784e-01

-9.02072957e-02 -2.57756666e-01 -1.25975104e-01 3.55685905e-01

2.17093330e-01 -3.04566995e-01 -4.62002002e-01 -1.79855036e-02

-2.24494212e-02 3.34522167e-02 9.10703410e-02 -8.88393882e-02

1.05909326e-01]

[ 9.35681745e-02 1.06015391e-01 -7.13557985e-01 4.72789590e-02

-1.66299240e-02 -6.08723983e-01 1.39285992e-01 -2.56097327e-01

-1.62931047e-02 7.45947123e-02 5.14384110e-02 2.61824511e-03

2.76914586e-03 8.27989701e-03 4.23639027e-02 9.11163317e-03

-4.85902891e-02]

[-4.84668755e-02 2.35469217e-01 -5.21834336e-01 1.07878431e-01

6.29349774e-02 3.84138124e-01 -6.56948779e-01 2.51641980e-01

3.28224085e-02 -9.21496211e-02 1.75508664e-02 1.80670889e-02

-1.08413791e-02 1.34656529e-03 -2.94677640e-02 2.34506560e-02

-9.33480418e-03]

[ 3.24667558e-01 7.06517594e-02 5.72580979e-02 1.23470983e-01

-5.47356939e-01 6.21741995e-02 -9.61136496e-02 -4.70574563e-02

-1.67560289e-01 1.25222037e-01 -3.61407075e-02 3.31773567e-04

5.14472664e-03 -5.66570968e-02 5.33328919e-01 4.34191485e-01

1.86806043e-01]

[ 3.20509921e-01 5.96664001e-02 3.74577785e-02 7.31469402e-02

-5.85124026e-01 4.79218101e-02 -9.84467080e-02 -1.16057935e-01

-1.29240181e-01 7.52852907e-02 -1.04786025e-01 -1.52860888e-02

-1.06241335e-03 8.90053732e-02 -5.22450951e-01 -3.67220609e-01

-2.64231329e-01]

[-1.78476677e-01 2.47834896e-01 2.58375559e-01 2.83024041e-01

-2.26758818e-01 -4.42093906e-01 -1.74587187e-01 2.15537935e-01


1.21020302e-01 -4.58497282e-01 3.83414110e-01 1.26724133e-03

-1.49408711e-02 -8.57331081e-03 -8.61575853e-02 -4.40006223e-02

2.33516509e-01]

[ 1.98617542e-01 -2.43261851e-01 1.09906654e-01 2.29944433e-01

1.38310455e-01 5.59562277e-03 -3.21857779e-01 -6.35277046e-01

4.57694691e-01 -2.50492792e-01 -1.69270316e-01 2.63795357e-02

3.68623907e-03 8.85282560e-03 -1.17084626e-02 6.97525565e-02

5.84510444e-02]

[ 3.40157000e-01 -1.35747859e-01 -1.72929690e-01 -2.20176259e-01

2.83072831e-02 2.38206017e-01 1.50523749e-01 -8.64143063e-02

-5.05354371e-02 -6.52884786e-02 3.96898901e-01 -9.78083104e-03

-6.46344811e-02 -1.59986586e-01 -2.25352711e-01 -1.17933520e-01

6.64009736e-01]

[ 2.48644778e-01 -1.60607758e-01 2.31028150e-01 7.41465533e-02

2.87880353e-01 -3.72585008e-01 -4.63707778e-01 1.62585522e-01

-1.29825028e-01 5.80722250e-01 7.39141063e-02 -2.27570853e-03

-1.81770778e-02 7.18415214e-03 -5.27965354e-02 -8.65713354e-02

1.41880695e-01]]

Question 2.7 Write down the explicit form of the first PC (in
terms of the eigenvectors. Use values with two places of decimals
only).
To find the first PC, Sort the eigenpairs in descending order of
eigenvalues and select he one with the largest value. This is the first
principal component that covers the maximum information from the
original data
As per my output from python:
Decending order of eigenvalues:
5.643078412365666
4.829736719876616
1.1003064401877618
0.9966849033178099
0.8977433005683855
0.7654920479750724
0.5870956473731336
0.5545035789961089
0.443192911646843
0.3822264051083322
0.24563728544474883
0.1468449578810244
0.1360384406428
0.12376405766765952
0.07466870608065557
0.0559799215601786
0.03891347980205039

First PC Eigen vectors:


[0.24 0.32 0.10 -0.10 0.23 -0.05 0.01 -0.03 -0.18 -0.13 -0.07 -
0.15 0.57 0.03 -0.35 0.48 -0.03]

Entire eigen vectors:


Question 2.8 Consider the cumulative values of the eigenvalues.
How does it help you to decide on the optimum number of
principal components? What do the eigenvectors indicate?
AS per the output from my python:
Visually we can observe that there is steep drop in variance explained
with increase in number of PC's.
We will proceed with 8 components here as we are depending on 90%
variation, in the cumulative variance explained the 90.32% falls on
the 8th component so we dimensionally reduce from 17 to 8
The eigen vectors are the principle components with all 17
components and 100% variation and no dimensional reduction

Question 2.9 Explain the business implication of using the


Principal Component Analysis for this case study. How may PCs
help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]
With help of PCA we have been able to reduce 17 numeric features
into 8 components which is able to explain 90% of variance in the
data

You might also like