SMDM Project
SMDM Project
DSBA
1
Contents
Problem 1.....................................................................................................................................................................4
1.1. Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the most?
Which Region and which Channel spent the least? .....................................................................................4-8
1.2. There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.............................................8-10
1.3. On the basis of the descriptive measure of variability, which item shows the most inconsistent behaviour?
Which items shows the least inconsistent behaviour? ...................................................................................10
1.4. Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments...........................................................................................................................................10
1.5. On the basis of your analysis, what are your recommendations for the business? How can your analysis
help the business to solve its problem? Answer from the business perspective...............................................11
Problem 2.......................................................................................................................................................................12
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)……………...12
2.1.1.Gender and Major.......................................................................................................................12
2.1.2.Gender and Grad Intention..........................................................................................................12
2.1.3. Gender and Employment............................................................................................................12
2.1.4. Gender and Computer ................................................................................................................13
2.2. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the
following questions:...........................................................................................................................................13
2.2.1. What is the probability that a randomly selected CMSU student will be male?.........................13
2.2.2 What is the probability that a randomly selected CMSU student will be female?
...............................................................................................................................................................13
2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:………………………………………………………………………………………….....13
2.3.1. Find the conditional probability of different majors among the male students in CMSU……..13
2.3.2 Find the conditional probability of different majors among the female students of CMSU…...14
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the
following question:……………………………………………………………………………………………15
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate………15
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.....................................................................................................................................................15
2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:…………………………………………………………………………………………….16
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?...16
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring
in international business or management…………………………………………………………...…16
2
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided
students are not considered now and the table is a 2x2 table. Do you think the graduate intention and being
female are independent events?..........................................................................................................................17
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages……………………………………………………………………………………………….......…..17
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?..........17
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the
conditional probability that a randomly selected female earns 50 or more……………………...……18
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing
your conclusions………………………………………………………………………………………………19
Problem 3.........................................................................................................................................................................19
3.1 Do you think there is evidence that means moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps.................................................................21
3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct
the test of the hypothesis. What assumption do you need to check before the test for equality of means is
performed?..........................................................................................................................................................22
3
Problem 1 Wholesale Customers Analysis
A wholesale distributor operating in different regions of Portugal has information on annual spending of
several items in their stores across different regions and channels. The data consists of 440 large retailers’
annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across
different sales channel (Hotel, Retail).
This data set has 440 rows and 9 columns. It refers to customers of a wholesale distributor. It involves the
annual spending in monetary units (m.u.) on different product categories.The Following data dictionary gives
more details on this data set:
Description of variables is as folllows:
Region Frequency Region - total : 440 rows Lisbon 77 rows Oporto 47 rows Other 316 row
Channel Frequency Channel -total : 440 rows Hotel 298 rows Retail 142 rows
Our project goal is to analysis the data and answer the questions asked.Thus, there is no outcome to be
predicted, and the EDA just tries to find patterns in the data.
1.1 Use methods of descriptive statistics to summarize data.
(a) Which Region and which Channel spent the most?
(b) Which Region and which Channel spent the least?
By using the describe function in python we first looked at the basic descriptive statistics of the dataset.
Sample of the Data:
Tab 1.1.1
4
Exploratory Analysis of the Data:
Tab 1.1.2
Descriptive Statistics of the Data:
Tab 1.1.3
Tab 1.1.4
5
Table 1.1.4 shows categories spend on the basis of Region where
1. Other has spent 1,067,759
2. Lisbon has spent 2,386,813
3. Oporto has spent 1,555,088
Fig 1.1
Fig 1.1 shows the visual representation of Table 1.1.4 in the form of bar graph.
Tab 1.1.5
Table 1.1.5 shows categories spend on the basis of Channel where
1. Hotel has spent 7,999,569
2. Retail has spent 6,619,931
6
Fig 1.2
Fig 1.2 shows the visual representation of Table 1.1.5 in the form of bar graph.
Fig 1.3
7
Fig 1.3 is a visual representation of categories spent in different regions through two channels.
From above data we can conclude that in Region Other has spent highest and Oporto has spent the least
where as in Channels Hotel has spent the higher as compared to Retail.
In Regions Others has spent the most and in Channels Hotel has spent the most.
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.
Tab1.1.6
Measure of Central Tendency - Mean, Median, mode Measure of Dispersion - Range, IQR, Standard Deviation
From the Tab 1.1.3 & Tab 1.1.6, we can infer the following
Channel has two unique values, with "Hotel" as most frequent with 298 out of 440 transactions. i.e
67.7 percentage of spending comes from "Hotel" channel.
Retail has three unique values, with "Other" as most frequent with 316 out of 440 transactions.
i.e.71.8 percentage of spending comes from "Other" region.
Fresh item (440 count),
has a mean of 12000.3, standard deviation of 12647.3, with min value of 3 and max value of 112151.
The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504
range = max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful
in calculating the outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627
8
range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25
The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5
has a mean of 3071.93, standard deviation of 4854.67, with min value of 25 and max value of 60869.
The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526
The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5
The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5
Visual representation of all varieties in the form of histogram across Region and Channel.
9
Fig 1.4
By Fig 1.4 we can conclude that Data is left skewed, All varieties show similar behaviour across Region and
Channel.
1.3 On the basis of the descriptive measure of variability, which item shows the most inconsistent behaviour?
Which items shows the least inconsistent behaviour?
We have calculated the Variance & Coefficient of Variance of all the varieties.
Tab 1.1.7 shows the Variance & Coefficient of Variance of all the varieties.
After Observing on the basis of Coefficient of Variance
10
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments.
Use Boxplot to see Outliers:
In Fig 1.5 The black point is the outliers in boxplot graph.
Fig 1.5
Fig 1.6
Yes there are outliers in all the items across the product range (Fresh, Milk, Grocery, Frozen,
Detergents_Paper & Delicatessen)
11
1.5 On the basis of your analysis, what are your recommendations for the business? How can your analysis
help the business to solve its problem? Answer from the business perspective.
As per the analysis, I find out there are inconsistencies in spending of different items (by calculating coefficient
of variance ), which should be decreased.The spending in Hotel and Retail Channel are scattered which should
be more or less equal, and lso spent should be equal for different regions. More focus should be given to items
other than Fresh & Grocery.
Tab2.1.1
2.1.1. Gender and Major
Tab2.1.2
2.1.2. Gender and Grad Intention
Tab2.1.3
12
Tab2.1.4
Tab2.1.5
2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
From all the contingency tables creates it can be seen that.
Total No of Students = 62
Total No of Male = 29
Probability a randomly selected student will be male =Total No of Male / Total No of Students
Hence from the calculations done in Python we conclude that :
The probability that a randomly selected CMSU student will be male is 46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be female?
From all the contingency tables creates it can be seen that.
Total No of Students = 62
Total No of Female = 33
Probability a randomly selected student will be male =Total No of Male / Total No of Female
The probability that a randomly selected CMSU student will be Female is 53.23 %
2.3. Assume that the sample is representative of the population of CMSU.Based on the data, answer the following
question:
2.3.1. Find the conditional probability of different majors among the malestudents in CMSU.
13
Contingency table For Gender and Major :
Tab2.1.2
2.3.2 Find the conditional probability of different majors among the femalestudents of CMSU.
Tab2.1.2
From all the contingency tables creates it can be seen that.
2.4. Assume that the sample is a representative of the population of CMSU.Based on the data, answer the following
question:
2.4.1. Find the probability That a randomly chosen student is a male andintends to graduate.
Tab2.1.3
Probability that a randomly chosen student is a Male = 29/62
Probability of Male that intends to Gradruate = 17/29
Probability a randomly chosen student is a male and intends to graduate
= Probability that a randomly chosen student is a Male*Probability that a randomly chosenstudent is a Male
The probability That a randomly chosen student is a male and intends to graduate is 27.42 %
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.
Tab2.1.5
15
Probability that a randomly chosen student is a Female = 33/62
Probability of Female with No Laptop = 1-(29/33)
Probability that a randomly selected student is a female and does NOT have a laptop
= Probability that a randomly chosen student is a Female * Probability of Female with NoLaptop
The probability that a randomly selected student is a female and does NOT have a laptop is 6.45 %
2.5. Assume that the sample is representative of the population of CMSU.Based on the data, answer the following
question:
2.5.1. Find the probability that a randomly chosen student is either a maleor has full-time employment?
Tab2.1.4
The probability that a randomly chosen student is either a male or has a full-time employment79.87 %
2.5.2. Find the conditional probability that given a female student israndomly chosen, she is majoring in
international business ormanagement.
Tab2.1.2
16
Probability of international business given Female = 4/33
Probability of management given Female = 4/33
The conditional probability that given a female student is randomly chosen, she is majoring ininternational
business or management is 24.242 %
2X2 Contingency table of Gender and Intent to Graduate without considering the Undecidedstudents
Tab2.1.3
Two events A and B can be proved to be Independent events when it satisfies the condition :
P(A∩B) = P(A) * P(B)
In this case if being female and graduate intention are independent can be proven by checking thecondition :
P(F∩Yes) = P(F) * P(Yes)
Hence from the calculations done in Python we conclude that : P(F∩Yes) ≠ P(F) * P(Yes)
Hence, Graduate intention and being female are not independent events
2.7. Note that there are four numerical (continuous) variables in the dataset, GPA, Salary, Spending, and
Text Messages. Answer the following questions based on the data
2.7.1. If a student is chosen randomly, what is the probability that his/herGPA is less than 3?
Since GPA is a continuous variable the Probability of a student whose GPA is less than 3 an be calculated by
using the Poisson Distribution.
To calculate the probability of GPA 3 or less we will add the prob of 0,1,2 and 3 GPA obtained in the
PoissonDistribution.
17
If a student is chosen randomly, what is the probability that his/her GPA is less than 3is 39.49%
2.7.2. Find the conditional probability that a randomly selected male earns50 or more. Find the conditional
probability that a randomly selectedfemale earns 50 or more.
(a) Conditional probability that a randomly selected male earns 50 or more:
Fig 2.1
The above distplot (Fig2.1) represents the salary of all the Male in the population.
As we can see it is normally distributed hence the conditional probability that a randomly selected male earns
50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution
andthen will subtract from 1.
Fig 2.2
The above distplot Fig 2.2 represents the salary of all the Female in the population.
18
As we can see it is normally distributed hence the conditional probability that a randomly selected female
earns 50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution
andthen will subtract from 1.
2.8. Note that there are four numerical (continuous) variables in the dataset, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing
your conclusions.
Fig 2.3
19
From the above histograms Fig 2.3 for the continuous variables GPA, Salary, Spending and Text Messages we
can see that :
GPA is almost Normally Distributed with a slight skewness toward the left.
Salary is also Normally Distributed with a slight skewness towards the right.
Spending is not Normally distributed and highly Right Skewed
Text message is not Normally distributed and highly Right Skewed.
Tab2.1.6
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shinglesand 31
for B shingles. This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem.
20
Tab3.1.1
3.1 Do you think there is evidence that means moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps.
For the A shingles, the null and alternative hypothesis to test whether the population mean moisture content
isless than 0.35 pound per 100 square feet is given:
We have a samples and we do not know the population standard deviation.The sample is not a large sample.
So you use the t distribution and the tSTAT test statisticSince we a testing for only sample A we use One
sample T test.
Also as python by default inPython, ttest_1samp shows the result of 2-sided it is divided by 2 as our is a
!_Sided test.
We have no evidence to reject the null hypothesis since p value > Level of significance
For the B shingles, the null and alternative hypothesis to test whether the population mean moisture content
isless than 0.35 pound per 100 square feet is given:
We have a samples and we do not know the population standard deviation.The sample is not a large sample.
So you use the t distribution and the tSTAT test statisticSince we a testing for only sample A we use One
sample T test. . Also as python by default inPython, ttest_1samp shows the result of 2-sided it is divided by 2
as our is a !_Sided test.
Hence from the calculations done in Python we conclude that :
21
Our one-sample t-test p-value= [0.0020904774003191826]
We have evidence to reject the null hypothesis since p value < Level of significance
3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct
the test of the hypothesis. What assumption do you need to check before the test for equality of means
isperformed?
To perform a Test of equality of the population mean of the A shingles and B shingles, the null and
alternativehypothesis to test whether the population mean moisture content is equal is given:
H0 : mean moisture content of A = mean moisture content of BHA : mean moisture content of A
≠ mean moisture content of B
Level of significance: 0.05
We have two samples A and B and we do not know the population standard deviation.
The samples are not large sample. So you use the t distribution and the tSTAT test statisticSince we a testing
for equality between sample A and B we use two sample T test.
We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis since
p value > Level of significance
Therefore, It can be concluded that the population mean for shingles A and B are equal.
22