Introduction (1)
Introduction (1)
Submitted to
By
Pirangi Charan Teja Goud
In partial Fulfillment of
PGP-DSBA
Page
List of Equations............................................................................................................................... 5
Data Dictionary ................................................................................................................................ 6
Introduction ..................................................................................................................................... 7
Step 2: Select the Significance Level ....................................................................................... 24
Step 3: Conduct the Hypothesis Test ...................................................................................... 24
Results for Shingle A: ......................................................................................................... 24
Results for Shingle B: ......................................................................................................... 24
Step 4: Decision Based on p-value.......................................................................................... 25
Step 1: Define the Hypotheses ............................................................................................... 25
Step 2: Choose Significance Level .......................................................................................... 25
Step 3: Identify the Test Statistic............................................................................................. 25
Step 4: Compute the Test Statistic and p-value ....................................................................... 26
Step 5: Decision Rule ............................................................................................................ 26
Step 6: Conclusion ................................................................................................................ 26
Problem 3 ...................................................................................................................................... 26
Salary Distribution by Education Level (Boxplot) ......................................................................... 27
Summary of Salary Distribution by Education Level (Boxplot Analysis) .......................................... 27
Salary Variation and Spread ................................................................................................... 27
Presence of Outliers .............................................................................................................. 27
Skewness and Distribution Shape .......................................................................................... 28
Conclusion .............................................................................................................................. 28
Summary of Salary Distribution by Occupation (Boxplot Analysis) ................................................ 28
Median Salary Comparison .................................................................................................... 29
Salary Variability and Dispersion ............................................................................................ 29
Outliers and Distribution Characteristics ................................................................................ 29
Conclusion .............................................................................................................................. 29
Step 1: State the Hypotheses .................................................................................................... 30
Step 2: Check the Assumptions of ANOVA ................................................................................. 30
Page
Step 3: Conduct the Hypothesis Test (One-Way ANOVA) ............................................................. 30
Step 4: Conclusion from the Results .......................................................................................... 30
Conclusion .............................................................................................................................. 31
Step 1:.................................................................................................................................. 31
Null Hypothesis (H₀) .............................................................................................................. 31
Alternative Hypothesis (H₁) .................................................................................................... 31
Step 2: Assumption Checks ................................................................................................... 32
Step 3: Normality Test (Shapiro-Wilk Test)............................................................................... 32
Step 4: Homogeneity of Variance (Levene’s Test) ..................................................................... 32
Step 4: Conduct the Hypothesis Test (One-Way ANOVA) .......................................................... 32
Step 2: Check the Assumptions.............................................................................................. 33
Step 3: Conduct the Hypothesis Test (Two-Way ANOVA) .......................................................... 33
ANOVA Table ..................................................................................................................... 33
Step 4: Conclusion ................................................................................................................ 34
List of Tables
1. Table 1: Summary Statistics of Student Demographics
2. Table 2: Probability Analysis of Gender Distribution
3. Table 3: Conditional Probabilities of Majors by Gender
4. Table 4: Summary Statistics of Moisture Content in Shingles
5. Table 5: Hypothesis Test Results for Moisture Content
6. Table 6: ANOVA Results for Salary Differences
Page
List of Figures
Page
List of Equations
1. Probability Formula:
𝐹𝑎𝑣𝑜𝑟𝑎𝑏𝑙𝑒𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑃(𝐸𝑣𝑒𝑛𝑡) =
𝑇𝑜𝑡𝑎𝑙 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠
2. Conditional Probability Formula:
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴|𝐵|) =
𝑃(𝐵)
3. Union of two Events:
Page
Data Dictionary
Page
Introduction
This report provides statistical insights into three different problems: Student Demographics and Behavioral
Analysis, Moisture Content in ABC Asphalt Shingles and The Relationship Between Salary, Education, and
Occupation. This report applies statistical techniques such as probability analysis, hypothesis testing, and
ANOVA to address key business problems. Each section follows a structured approach, detailing the problem,
methodology, results, and conclusions. The insights derived from this analysis will support organizations in
enhancing decision-making processes and optimizing operational efficiency.
Problem 1:
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates (stored in the Survey data set).
1.1 What is the probability that a randomly selected CMSU student will be male?
Solution: the probability that a randomly selected CMSU student will be male is
Probability = (Number of male students) / (Total number of students)
Page
• Total number of students = 62
62
P(Male)= = 0.468 (approx.)
29
Thus, the probability that a randomly selected CMSU student will be male is 0.468 (or 46.8%).
1.2 What is the probability that a randomly selected CMSU student will be female?
Solution: The probability of selecting a female student is calculated using the formula
Probability = (Number of female students) / (Total number of students)
Thus, the probability that a randomly selected CMSU student will be female is 0.532 (or 53.2%).
1.3What is the conditional probability of different majors among male students in CMSU?
Probability = (Number of male students in a specific major) / (Total number of male students)
• Management:
6
P (Management / Male) = = 0.207 (approx.)
29
• Retailing/Marketing:
5
P (Management / Male) = = 0.172 (approx.)
29
• Other:
4
P (Other / Male) = = 0.138 (approx.)
29
• Economics/Finance:
4
P (Economics or Finance / Male) = = 0.138 (approx.)
29
• Accounting:
Page
4
P (Accounting /Male) = = 0.138 (approx.)
29
• Undecided:
3
P (Undecided / Male) = = 0.103 (approx.)
29
• International Business:
2
P (International Business / Male) = = 0.069 (approx.)
29
• CIS:
1
P (International Business / Male) = = 0.034 (approx.)
29
1.4 What is the conditional probability of different majors among the female students of CMSU?
Solution: The conditional probability of a female student being in a particular major is given by the formula:
Probability = (Number of female students in a specific major) / (Total number of female students)
From the dataset, the total number of female students is 33. Below are the probabilities for different majors
among female students:
• Retailing/Marketing:
9
P (Retailing or Marketing /Female) = = 0.273 (approx.)
33
• Economics/Finance:
7
P (Economics or Finance / Female)= = 0.212 (approx.)
33
• Management:
4
P (Management / Female) = 33
= 0.121 (approx.)
• International Business:
4
P (International Business / Female) = = 0.121 (approx.)
33
• Other:
3
P (Other / Female) =33 = 0.091 (approx.)
• CIS:
3
P (CIS / Female) = = 0.091 (approx.)
33
• Accounting:
3
P (Accounting / Female) = = 0.091 (approx.)
33
Page
These probabilities represent the likelihood that a randomly selected female student is in a specific major.
1.5 What is the probability that a randomly chosen student is a male and intends to graduate?
Solution: The probability that a randomly chosen student is male and intends to graduate is calculated using the
formula:
Probability = (Number of male students who intend to graduate) / (Total number of students)
1.6 What is the probability that a randomly selected student is a female and does NOT have a laptop?
Solution: the probability that a randomly selected student is a female and does NOT have a laptop:
Probability = (Number of female students without a laptop) / (Total number of students)
The probability that a randomly selected student is female and does NOT have a laptop is 0.065 (or 6.5%).
1.7 What is the probability that a randomly chosen student is a male or has full-time employment?
Solution: The probability that a randomly chosen student is male or has full-time employment is calculated
using the formula:
Page
• Number of students with full-time employment = 11
• Number of male students with full-time employment = 6
• Total number of students = 62
29
P(Male) = = 0.468
62
11
P (Full-Time Employment) = 62 = 0.177
6
P (Male ∩ Full-Time Employment) = 62 = 0.097
1.8 What is the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management?
Solution: the conditional probability that given a female student is randomly chosen; she is majoring in
international business or management:
Probability = (Number of female students majoring in International Business or Management) / (Total number
of female students)
8
= = 0.242
33
Thus, the probability that a randomly chosen female student is majoring in International Business or
Management is 0.242 (or 24.2%).
1.9 If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
Solution: The probability that a randomly chosen student has a GPA less than 3 is given by:
Page
• Number of students with GPA < 3 = 17
• Total number of students = 62
•
17
P (GPA < 3) = 62 = 0.274
Thus, the probability that a randomly chosen student has a GPA less than 3 is 0.274 (or 27.4%).
1.10 What is the conditional probability that a randomly selected male earns 50 or more?
Solution:
the conditional probability that a randomly selected male earns 50 or more:
P (Earning ≥ 50/ Male) = Number of males earning ≥50 / Total number of males
14
=29 = 0.4828 = 48.28%
the conditional probability that a randomly selected male earns 50 or more is 0.4828 or 48.28%
1.11 What is the conditional probability that a randomly selected female earns 50 or more?
Solution:
the conditional probability that a randomly selected female earns 50 or more:
P (Earning ≥ 50 / Female) = Number of males earning ≥50 / Total number of females
6
= 11 = 0.5455 𝑜𝑟 54.55%
the conditional probability that a randomly selected female earns 50 or more is 0.5455 or 54.55%
1.12 Are the continuous variables in the data normally distributed? Write a note summarizing your conclusions.
Solution:
Page
Fig 1: Distribution of GPA
The distribution of GPA appears to be approximately normal, with a slight left skew (skew = - 0.31). This
suggests that most students have GPAs clustered around the mean, with slightly more values leaning towards
the higher end. Since the distribution is close to normal, standard statistical methods can be applied without
significant concerns.
Page
The Salary distribution is moderately right-skewed (skew = 0.53). This indicates that most individuals earn
within a lower to mid-range salary, but a few earn significantly higher amounts, pulling the distribution to the
right. While this skewness is not extreme, it suggests that median-based statistics might provide a better
representation of central tendency than the mean.
Spending is highly right-skewed (skew=1.59), meaning that most individuals spend relatively low amounts,
while a few outliers have significantly higher spending levels. This skewness suggests that a small percentage
of people drive up the average spending. A log transformation may help normalize this variable for better
statistical analysis.
Page
Fig 4: Distribution of Text messages
Similarly, Text Messages show a highly right-skewed distribution (skew=1.30), indicating that most
individuals send relatively few messages, while a smaller group sends an exceptionally high number. This
skewed nature suggests that using median-based measures or transformation techniques could be beneficial in
further analysis.
Conclusion: In conclusion, while GPA is nearly normal, Salary, Spending, and Text Messages exhibit right-
skewed distributions due to high-value outliers. If normality is required for analysis, transformations such as
Page
logarithmic scaling or non-parametric tests should be considered.
• If the points follow the diagonal line closely, the Salary data is normally distributed.
• If the points deviate significantly, especially at the higher or lower ends, it suggests that the Salary data
may have a skewed distribution or outliers.
Summary:
• The GPA data appears to be somewhat normally distributed, but there may be slight deviations in
the tails, indicating potential skewness or outliers.
Page
Fig 6: Salary Data: Theoretical vs. Observed Quantiles
• If the points follow the diagonal line closely, the Salary data is normally distributed.
• If the points deviate significantly, especially at the higher or lower ends, it suggests that the Salary data
may have a skewed distribution or outliers.
• Summary:
• The Salary data likely deviates from normality, particularly in the tails. This could indicate a right-
skewed distribution, where a few individuals have significantly higher salaries compared to the
majority.
Page
• If the points align well with the diagonal line, the Spending data is normally distributed.
• Deviations, especially at the higher end (e.g., points curving upward), suggest that the Spending
data may have a right-skewed distribution, with some individuals spending significantly more than
others.
• Summary:
o The Spending data appears to be right-skewed, with a few outliers or individuals spending much
more than the majority. This is common in spending data, where most people spend within a
certain range, but a few spend much more.
• If the points follow the diagonal line, the Text Messages data is normally distributed.
• Deviations, especially at the higher end, suggest that the data may have a right-skewed
distribution, with some individuals sending significantly more text messages than others.
• Summary:
• The Text Messages data is likely right-skewed, with a few individuals sending a much higher
number of text messages compared to the majority. This is common in communication data,
where most people send a moderate number of messages, but a few are highly active.
Page
• Conclusion:
• GPA: Approximately normally distributed, suitable for parametric methods with minor adjustments.
• Salary: Right-skewed, non-parametric methods or transformations recommended.
• Spending: Right-skewed, non-parametric methods or transformations recommended.
• Text Messages: Right-skewed, non-parametric methods or transformations recommended.
• For the skewed datasets (Salary, Spending, and Text Messages), consider using log transformations
or non-parametric statistical tests to handle the skewness and outliers. For GPA, parametric
methods can be used, but it’s important to check for outliers or slight deviations from normality.
Problem 2:
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of
moisture the shingles contain when they are packaged. Customers may feel that they have purchased a product
lacking in quality if they find moisture and wet shingles inside the packaging. In some cases, excessive moisture
can cause the granules attached to the shingles for texture and coloring purposes to fall off the shingles resulting
in appearance problems. To monitor the amount of moisture present, the company conducts moisture tests. A
shingle is weighed and then dried. The shingle is then reweighed and based on the amount of moisture taken out
of the product, the pounds of moisture per 100 square feet is calculated. The company would like to show that
the mean moisture content is less than 0.35 pounds per 100 square feet.
Exploratory Data Analysis (EDA) is essential for understanding the distribution, patterns, and potential
anomalies within the dataset before conducting statistical tests. In this analysis, we will examine Shingle A and
Shingle B individually using summary statistics and visualizations.
Page
Fig 9: Histogram of Moisture Content – Shingle A
• Summary:
• The histogram shows the distribution of moisture content in Shingle A.
• If the shape is approximately normal, it indicates a well-distributed moisture content.
• If skewed, it suggests uneven moisture retention.
Boxplot of Shingle – B:
Page
Fig 10: Boxplot of Moisture Content – Shingle A
• Summary:
• The boxplot helps identify outliers (values far from the whiskers).
• If outliers exist, further investigation is needed—whether they are natural variations or errors.
Page
Fig 11: Histogram of Moisture Content – Shingle B
• Summary
• Similar insights as above but specific to Shingle B.
• A wider spread would indicate higher variability in moisture content.
Page
Fig 12: Boxplot of Moisture Content – Shingle – B
• Summary
• Compares the moisture variability with A.
• If the box (IQR) is larger than A, Shingle B has higher variation in moisture content.
Page
2.1 Is there any evidence that the mean moisture content in both types of shingles is within the permissible
limits?
Solution:
For both Shingle A and Shingle B, we are testing if the mean moisture content is less than the permissible limit
(0.35 pounds per 100 square feet).
• Null Hypothesis (H₀): The mean moisture content is greater than or equal to 0.35.
H₀: μ ≥ 0.35
• Alternative Hypothesis (H₁): The mean moisture content is less than 0.35.
H₁: μ < 0.35
• We will use α = 0.05 (5%), which means we will reject the null hypothesis if the p-value is less than
0.05.
We use a one-sample t-test to check if the mean moisture content is significantly less than 0.35.
• t-statistic: -1.4735
• p-value: 0.0748
• t-statistic: -3.6087
• p-value: 0.00048
Page
Step 4: Decision Based on p-value
• For Shingle A: Since p-value (0.0748) > 0.05, we fail to reject H₀.
• Conclusion: There is not enough statistical evidence to conclude that the moisture content of
Shingle A is less than 0.35.
• This suggests that Shingle A may not meet the permissible moisture limit.
• Null Hypothesis (H₀): The mean moisture content of Shingle A and Shingle B are equal.
Ho: μ A = μ B
• Alternative Hypothesis (H₁): The mean moisture content of Shingle A and Shingle B are not equal.
H1: μ A = μ B
• Since the population standard deviations are unknown, and sample sizes are different, we use the
independent two-sample t-test (Welch’s t-test).
• This test assumes unequal variances and follows a t-distribution.
Page
Step 4: Compute the Test Statistic and p-value
• t-statistic: 1.3912
• p-value: 0.1686
Step 6: Conclusion
Since the p-value (0.1686) is greater than the significance level (0.05), we fail to reject the null hypothesis
(H₀).
Conclusion: There is not enough statistical evidence to say that the mean moisture content of Shingle A
and Shingle B are significantly different. We conclude that their means are statistically similar at the 5%
significance level.
Problem 3
Salary is hypothesized to depend on educational qualification and occupation. To understand the dependency,
the salaries of 40 individuals are collected and each person’s educational qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor's, and Doctorate. Occupation is at
four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education–occupation combination.
Page
Salary Distribution by Education Level (Boxplot)
The boxplot displays the salary distribution for individuals with different education levels (High School
Graduate, Bachelor's, and Doctorate). The key insights are:
• Doctorate holders have the highest median salary, reflecting their advanced qualifications.
• Bachelor's degree holders earn more than High School graduates but less than Doctorate holders.
• High School graduates (HS-grad) have the lowest median salary.
• Doctorate: The widest interquartile range (IQR), indicating substantial variation in salaries.
• Bachelor’s: Moderate salary variation, with a slightly smaller IQR than Doctorates.
• High School Graduates: The smallest salary spread, indicating relatively consistent earnings.
Presence of Outliers
Page
Skewness and Distribution Shape
• The distributions for Doctorate and bachelor's degree holders are more spread out, suggesting a broader
salary range.
• The salary distribution for HS Graduates is more compact, implying lower variability in earnings.
Conclusion
The analysis suggests that salary levels tend to increase with higher education levels. Doctorate holders not only
earn the highest median salary but also show the most variation in salaries. Bachelor's degree holders follow a
similar trend but with lower earnings. High School graduates earn the least, with more stable salaries. These
patterns indicate potentially significant differences in salaries across education levels, warranting further
statistical analysis (such as ANOVA) to verify the significance of these differences.
The boxplot illustrates salary distributions across various occupational categories: Administrative & Clerical,
Sales, Professional Specialty, and Executive & Managerial. Key observations include:
Page
Median Salary Comparison
• Executive & Managerial roles have the highest median salary, indicating that individuals in these
positions typically earn more than those in other occupations.
• Sales roles exhibit a moderately high median salary, though with a more dispersed distribution.
• Administrative & Clerical roles have a relatively lower median salary.
• Professional Specialty roles display a highly variable median salary, suggesting significant disparities in
earnings within this category.
• Professional Specialty roles demonstrate the largest salary spread, reflecting substantial variability in
earnings.
• Executive & Managerial positions show the least variation, suggesting more consistency in salaries
within this category.
• Sales roles exhibit moderate salary dispersion.
Conclusion
The salary distribution varies considerably across occupations. Executive & Managerial roles tend to offer the
highest and most stable salaries, whereas Professional Specialty roles demonstrate the greatest variation,
potentially due to differences in specialization and expertise. These observations underscore the significant
influence of occupation on salary levels. To validate these differences statistically, an ANOVA test should be
conducted to determine whether they are significant.
3.1 Is there any significant difference in salaries among different levels of education?
Solution:
Page
Step 1: State the Hypotheses
1. Normality: Salaries within each education level should be approximately normally distributed. We test
this using the Shapiro-Wilk test.
2. Homogeneity of Variance: The variance in salaries across education levels should be similar. We test
this using Levene’s test.
We perform a One-Way ANOVA to determine whether at least one education level has a significantly different
salary compared to others.
Page
c. HS-grad: p-value = 0.1783 (normal)
d. Since all p-values > 0.05, the normality assumption holds.
2. Homogeneity of Variance (Levene’s Test):
a. Test Statistic: 1.8801, p-value = 0.1669
b. Since p-value > 0.05, variances are equal, so we can proceed with ANOVA.
3. One-Way ANOVA Results:
a. F-statistic: 30.9563, p-value < 0.0001
b. Since p-value < 0.05, we reject H₀ and conclude that salaries differ significantly among
education levels.
4. Post-hoc Analysis (Tukey’s HSD Test) (if applicable):
a. If ANOVA is significant, Tukey’s HSD test helps identify which specific education levels have
significantly different salaries.
Conclusion
• Based on One-Way ANOVA, there is a statistically significant difference in salaries among different
education levels.
• Since ANOVA only tells us that at least one group is different, a post-hoc test (Tukey’s HSD) can be
conducted to determine which specific education levels differ from each other.
• This finding suggests that higher education levels are associated with significantly different salary
distributions.
3.2 IIs there any significant difference in salaries among different levels of different occupations?
Solution:
Step 1:
At least one education level has a significantly different mean salary compared to others.
Page
Step 2: Assumption Checks
Normality (Shapiro-Wilk Test): Salaries within each education level should be approximately normally
distributed.
Homogeneity of Variance (Levene’s Test): The variance in salaries across education levels should be similar.
• F-statistic: 30.9563
• p-value: < 0.0001
Since p-value < 0.05, we reject the null hypothesis (H₀) and conclude that salaries differ significantly among
education levels.
Conclusion
• Based on One-Way ANOVA, there is a statistically significant difference in salaries among different
education levels.
• Since ANOVA only tells us that at least one group is different, a post-hoc test (Tukey’s HSD) can be
conducted to determine which specific education levels differ from each other.
• This finding suggests that higher education levels are associated with significantly different salary
distributions.
ANOVA Table
Page
Step 4: Conclusion
• Since the interaction effect (Education × Occupation) is significant (p < 0.05), we reject the null
hypothesis.
• This means that the effect of Education on Salary depends on Occupation, and vice versa.
• However, due to the unbalanced group sizes and empty cells (Exec-managerial for HS-grad = 0), the
results should be interpreted with caution.
Page