SMDM Project
SMDM Project
SMDM PROJECT
DSBA
NATASHA CHAUHAN
9/10/2021
CONTENT
1
Problem 1: Wholesale Customers Analysis
Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data
consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).
Solution:
Importing all the libraries and data into the jupyter notebook
Basic EDA:
There are total 440 entries and 9 coulmns, There are total of 9 variables:
Buyer/Spender, Channel, Region, Fresh, Milk, Grocery, Frozen, Detergents_Paper,
and Delicatessen. Data type are 2 Object/string and 7 integers which means there
are 2 categorical columns and 7 numerical.
2
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the
least?
Describe explains about all the statistical measures such as count, frequency,
mean, mode, median, range and skewness in the data known as Measures of
central tendency.
There are 2 categorical (channel, region) variable and 7 numerical (buyer/sender,
fresh, milk, Grocery, Frozen, Detergents_Paper, and Delicatessen).
o Channel - 2 unique values. Hotel with highest frequency of 298/440 total count.
o Region - 3 unique values: Others has highest frequency of 316/440 total count.
o Fresh:
Total count- 440, where mean -12000, sd- 12647.33, 3 (mini)-112151 (maxi)
Quartile 1 (25%) - 3127.75, Quartile 2 (50%) - 8504.0 also known as median,
Quartile 3 (75%) - 16933.75
o Milk:
Total count- 440, where mean -5796.27, sd-7380.38, 55 (mini)-73498 (maxi)
Quartile 1 (25%) - 1533, Quartile 2 (50%) -3627 also known as median,
Quartile 3 (75%) - 7190.25.
o Grocery:
Total count- 440, where mean - 7951.28, sd- 9503.16, 3 (mini)- 92780.0 (maxi)
Quartile 1 (25%) - 2153.0, Quartile 2 (50%) - 4755.5also known as median,
Quartile 3 (75%) - 10655.75
o Frozen:
Total count- 440, where mean – 3071.93, sd- 4854.67, 25 (mini)- 60869.0
(maxi)
Quartile 1 (25%) - 742.25, Quartile 2 (50%) - 1526.0 also known as median,
Quartile 3 (75%) - 3554.25.
o Detergents_Paper:
Total count- 440, where mean – 2881.49, sd - 4767.85, 3 (mini)- 40827.0
(maxi)
Quartile 1 (25%) - 256.75, Quartile 2 (50%) - 816.5 also known as median,
Quartile 3 (75%) - 3922.0.
o Delicatessen:
Total count- 440, where mean – 1524.87, sd- 2820.10, 3 (mini)- 47943 (maxi)
Quartile 1 (25%) - 408.25, Quartile 2 (50%) - 965.5 also known as median,
Quartile 3 (75%) - 1820.25.
3
To find out the most spending within region and channel, need to create a new
column “spending’ which is total of all the 6 items/variables, shown below:
4
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
Solution:
After plotting the graph using crosstab function across all the items, we can
conclude that all the items doesn’t behave similar to regions and channels.
Crosstab function has been used here for clear bifurcation between the Channels
in the specific region.
It can be clearly seen that the all the items doesn’t behave in similar manner
when it comes to channel and region. As Fresh and Frozen has the highest
spending in hotel channel than channel across all region. Whereas Milk,
Grocery, Detergents_Paper has higher spending in Retail than hotel across all
region. Delicatessen also have a little more spending from retail than hotel.
It is safe to say that Fresh and Frozen items are mostly consumed by Hotel
channel irrespective of the region they are distributing. On the other hand, items
like Milk, Grocery, Detergents_Paper are distributed among retailer
irrespective of region.
But Delicatessen is the item which is almost distributed equally in Oporto
between both the channels. It is the variable which shows difference in spending
between the channels which is affected by region. There is a huge difference
between the different channels spending in region Lisbon but almost equal in
Oporto.
From the observation we can say that 5 others variables distribution among
channel is not affected by the region. But distribution of Delicatessen is affected
slightly.
5
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
The below table gives the summary of descriptive measure of all six items from
data:
Consistency of the items can be calculated using Coefficient of Variance (CV). The
coefficient of variation is a relatively simple and quick tool to compare different data
series. The higher the CV value reflects higher inconsistency.
Formula to derive Coefficient of variance (CV):
CV = μ/σ
Where:
σ = Standard deviation
μ = Mean
From the table last row of CV- it is evident that Coefficient of Variance is highest for
Item “Delicatessen” and lowest for Item “Fresh”. Hence it can be concluded that
the item that shows the most inconsistent behaviour is Delicatessen and the items
show the least inconsistent behaviour is Fresh.
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.
6
From the box plot we can easily conclude that all the variables has outliers in them.
Yes, Outliers are present in “Fresh, Milk, Grocery, Frozen, Detergents_Paper,
Delicatessen”.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective
On the basis of analysis done above, we can recommend following solutions to the
business:
o ‘Others’ is the region which are spending the most as compared to Lisbon and
Oporto and ‘Retail’ is the channel which is spending the most compared to
hotel in this region. If business needs to be extended it should be done in the
region ‘Other’ and channel should be Retail. They need to focus on the items
like ‘Milk, Grocery, Detergents_Paper, and Delicatessen’ rather on all of
them. As extending the business by keeping these things in mind can boast up
the sale and increase the revenue for the business rather than focusing on
Lisbon or Oporto.
o If the business is interested in growing the revenue from the ‘Hotel’ channel
then they should highly focus on the food products like “Fresh and Frozen”. It
is being observed that irrespective of the region, Items like Fresh and Frozen
are doing amazing good in Hotel channel.
o Food product like Fresh is having the highest spending in both the channels,
irrespective of region followed by Grocery and Milk. So, it is recommended
that these food products should be there at all the business and regions.
o Delicatessen is the product which has shown most inconsistency irrespective
of region and channel. So it is the product which can be sold at all the channels
and region. Hence, it should be made available at all times.
7
Problem 2 –
Problem Statement:
The Student News Service at Clear Mountain State University (CMSU) has decided to
gather data about the undergraduate students that attend CMSU. CMSU creates and
distributes a survey of 14 questions and receives responses from 62 undergraduates
(stored in the Survey data set).
Solution:
Once importing all the libraries into the jupyter notebook, upload the Survey file into it.
Completed the basic EDA:
2.1 For this data, construct the following contingency tables (Keep Gender as
row variable)
2.1.1 Gender and Major
Sol: Below output from Jupyter notebook
8
2.2 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.2.1 What is the probability that a randomly selected CMSU student will be
male?
P (randomly selected stud will be male) = (Total no of male /Total no of
students)
By Calculation done in python we can conclude that:
Probability that a randomly selected CMSU student will be male:
0.46774193548387094 which is 46.77%.
2.2.2 What is the probability that a randomly selected CMSU student will be
female?
P (randomly selected stud will be female) = (Total no of female /Total no of
students)
By Calculation done in python we can conclude that:
Probability that a randomly selected CMSU student will be female:
0.532258064516129 which is 53.22%.
2.3 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.3.1 Find the conditional probability of different majors among the male
students in CMSU.
Using contingency tables of Gender and Majors we got the total numbers of
males opting for different majors.
P (major | male) = P (major ∩ male)/ P (male)
Sol: From the calculation done in python we conclude that:
Probability of Accounting among male student is 13.79%
Probability of CIS among male student is 3.45%
Probability of Economics/Finance among male student is 13.79%
Probability of International Business among male student is 6.9%
Probability of Management among male student is 20.69%
Probability of Other among male student is 13.79%
Probability of Retailing/Marketing among male student is 17.24%
Probability of Undecided among male student is 10.34%.
9
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Using contingency tables of Gender and Majors we got the total numbers of
females opting for different majors
P (major | female) = P (major ∩ female)/ P (Female)
Sol: From the calculation done in python we conclude that:
Probability of Accounting among female student is 9.09%
Probability of CIS among female student is 9.09%
Probability of Economics/Finance among female student is 21.21%
Probability of International Business among female student is 12.12%
Probability of Management among female student is 12.12%
Probability of Other among female student is 9.09%
Probability of Retailing/Marketing among female student is 27.27%
Probability of Undecided among female student is 0%
2.4 Assume that the sample is a representative of the population of CMSU. Based
on the data, answer the following question:
2.4.1 Find the probability that a randomly chosen student is a male and
intends to graduate.
10
2.5 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.5.1 Find the probability that a randomly chosen student is a male or has
full-time employment?
11
2.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No). The Undecided students are not considered now and the table is a
2x2 table. Do you think the graduate intention and being female are
independent events?
To be proven that both the events are independent, mentioned condition needs to
be fulfilled:
P (A∩B) = P (A) * P (B)
To check whether the graduate intention and being female are independent
events, we need to prove that:
P (Female ∩ Grad intention Yes) = P (Female) * P (Grad intention Yes)
P (Grad Intention Yes) = 28/40 = 0.7
P (Grad Intention Yes | female) = 11 / 20 = 0.55
Sol: From the calculation done in python we conclude that:
P (Female∩ Grad intention Yes) ≠ P (Female) * P (Grad intention Yes)
Hence, Graduate intention and being female are not independent events
2.7 Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. Answer the following questions based
on the data:
2.7.1 If a student is chosen randomly, what is the probability that his/her
GPA is less than 3?
As GPA is a continuous variable, it can be calculated using Poisson
distribution, calculated in Python notebook:
Method 1:
To calculate the probability that GPA is less than 3, we need to add the
probability of 0, 1, 2 using Poisson distribution.
Mean of GPA (m) = 3.13
P (GPA is less than 3) = P (GPA is 0) + P (GPA is 1) + P (GPA is 2)
Sol: From the calculation done in python we conclude that:
Stats.poisson.pmf(0,m)+stats.poisson.pmf(1,m)+stats.poisson.pmf(2,m)
The probability that GPA is less than 3 is 0.394703 or 39.47%.
Method 2:
Instead of adding the probability of 0, 1, 2 using cdf (Cumulative
Distribution Function) in Poisson distribution
stats.poisson.cdf (2,3.13)
The probability that GPA is less than 3 is 0.394703 or 39.47%.
12
2.7.2 Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected
female earns 50 or more.
Method 1: Using contingency table (calculation in python)
2.8 Note that there are four numerical (continuous) variables in the data set, GPA
Salary, Spending, and Text Messages. For each of them comment whether
they follow a normal distribution. Write a note summarizing your conclusions.
13
By using distplot, we can understand by the plot whether the variables are
normally distributed or not, 4 numerical/continuous variables will be used are
GPA, Salary, Spending, and Text messages.
From the above Distplot for all the continuous variables- GPA, Salary, Spending
and Text Messages, we can see that:
‘GPA’ is almost Normally Distributed with a left skewness.
‘Salary’ is also Normally Distributed with slight right skewness.
‘Spending’ is not Normally distributed and highly Right Skewed
‘Text message’ is not Normally Distributed and highly Right Skewed.
Skewness of Variables are mentioned below, calculation in python:
GPA -0.314600
Salary 0.534701
Spending 1.585915
Text Messages 1.295808
GPA has very less skewness, it is left skewed hence its negative.
Salary also has very less skewness, it is right skewed hence its positive
Spending is highly Right Skewed
Text Message is highly Right Skewed.
14
Problem 3 –
Problem Statement:
An important quality characteristic used by the manufacturers of ABC asphalt shingles is
the amount of moisture the shingles contain when they are packaged. Customers may feel
that they have purchased a product lacking in quality if they find moisture and wet shingles
inside the packaging. In some cases, excessive moisture can cause the granules attached
to the shingles for texture and coloring purposes to fall off the shingles resulting in
appearance problems. To monitor the amount of moisture present, the company conducts
moisture tests. A shingle is weighed and then dried. The shingle is then reweighed, and
based on the amount of moisture taken out of the product, the pounds of moisture per 100
square feet are calculated. The company would like to show that the mean moisture
content is less than 0.35 pounds per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet)
for ‘A’ shingles and 31 for ‘B’ shingles.
Solution:
Once importing all the libraries into the jupyter notebook, upload the A & B shingles.csv
file into it. Completed the basic EDA:
3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly
showing all steps.
Shingles A:
For Shingles A formulate the null and alternate hypothesis at per pound per 100gm
feet, which is:
H0: mean moisture content <= 0.35
HA: mean moisture content > 0.35
Level of Significance = 0.05
Sample size is given for both the variables but population mean and standard
deviation is not known.
Sample being in small size, will use a T test and t_statistics test. We are testing
only one sample so we will run a 1 sample T test due to unknown population
standard deviation.
15
From the python calculation:
t_stat, p_value = ttest_1samp(df.A, 0.35)
print('The T statistic is: {0}\n''The corresponding pvalue is :{1}'.format(t_stat,
p_value/2))
As Python by default tests for 2 sample test in ttest_1samp, shows the result of 2-
sided so it is divided by 2 as we are running a 1-sided test.
One sampled t-test, p-value= 0.07477
The T statistic is: -1.4735046253382782
The corresponding pvalue is: 0.07477633144907513
T test, p-value (0.075) > Level of significance (0.05)
We failed to reject the null hypothesis which means there is no enough evidence to
conclude that the mean moisture content for A Shingles is more than 0.35 pounds
per 100 sq feet.
Shingles B:
For Shingles B formulate the null and alternate hypothesis at per pound per 100gm
feet, which is:
H0: mean moisture content <= 0.35
HA: mean moisture content > 0.35
Level of Significance = 0.05
Again sample size is given for both the variables but population mean and standard
deviation is not known.
As sample being in small size, will use a T test and t_statistics test. We are testing
only one sample so we will run a 1 sample T test due to unknown population
standard deviation.
From the python calculation:
t_stat, p_value = ttest_1samp(df.B, 0.35, nan_policy= "omit")
print('The T statistic is: {0}\n''The corresponding pvalue is :{1}'.format(t_stat,
p_value/2))
As Python by default tests for 2 sample test in ttest_1samp, shows the result of 2-
sided so it is divided by 2 as we are running a 1-sided test.
One sampled t-test, p-value= 0.0020904774003191826
The T statistic is: -3.1003313069986995
The corresponding pvalue is: 0.0020904774003191826
T test, p-value (0.002) < Level of significance (0.05)
We have evidence to reject null hypothesis, we have evidence to conclude that the
mean moisture content of Shingles B is not less than or equal to 0.35 pounds per
100 sq feet.
16
3.2 Do you think that the population mean for shingles A and B are equal? Form
the hypothesis and conduct the test of the hypothesis. What assumption do
you need to check before the test for equality of means is performed?
In testing whether the mean for shingles A is same as the shingles B, we need to
formulate the hypothesis:
H0: mean moisture content of Shingles A = mean moisture content of
Shingles B
HA: mean moisture content of Shingles A ≠ mean moisture content of
Shingles B
Mathematically it can also be written as:
H0: μA - μB = 0 or μA = μB
HA: μA - μB ≠ 0 or μA ≠ μB
Level of Significance = 0.05
We have 2 samples now shingles A and shingles B, and the population mean and
Standard deviation is still not known.
The sample is not a large sample and both the variables are independent variables.
So we will use the t distribution and the tStat test statistic for two sample unpaired
test.
We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO
INDEPENDENT samples. This function returns t statistic and two-tailed p value.
From the python calculation:
t_stat,p_value = ttest_ind(df['A'],df['B'],nan_policy='omit')
print('The T statistic is',t_stat)
print('The corresponding pvalue is',p_value)
As we know that Python by default tests for 2 sample test in ttest_1samp.
Two sampled t-test, p-value= 0.2017496571835306
The T statistic is 1.2896282719661123
The corresponding pvalue is 0.2017496571835306
T test, p-value (0.2017) > Level of significance (0.05)
We failed to reject the null hypothesis which means there is no enough evidence to
conclude that the mean moisture content for Shingles A is equal to the mean
moisture content for Shingles B.
Therefore, we can conclude that the population mean for Shingles A and B are
equal. Assumptions when running a two-sample t-test, the basic assumptions
are that the distributions of the two populations are normal, and that the variances
of the two distributions are the same. If those assumptions are not likely to be met,
another testing procedure could be use.
17
THANK YOU!!
18