0% found this document useful (0 votes)
60 views

Descriptive Analytics Assignment

The document provides a descriptive analytics assignment for a business analyst intern at an online retail furniture and office supplies company. The intern is asked to analyze a dataset of 51,290 transactions from 2013-2016 which includes variables like sales, customer information, products, and shipments. Specifically, the intern must [1] use visualization techniques to summarize variables and develop hypotheses, [2] test hypotheses using appropriate analytics, and [3] provide marketing strategy recommendations based on the results.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Descriptive Analytics Assignment

The document provides a descriptive analytics assignment for a business analyst intern at an online retail furniture and office supplies company. The intern is asked to analyze a dataset of 51,290 transactions from 2013-2016 which includes variables like sales, customer information, products, and shipments. Specifically, the intern must [1] use visualization techniques to summarize variables and develop hypotheses, [2] test hypotheses using appropriate analytics, and [3] provide marketing strategy recommendations based on the results.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Descriptive Analytics Assignment

You have been assigned as a business analyst for a leading online retail company specializing in furniture and office supplies. A dataset of
51,290 transactions between 2013 to 2016 has been given to you. The dataset contains the columns Row ID, Order ID, Order Date, Ship Date,
Ship Mode, Customer ID, Customer Name, Segment, Postal Code, City, State, Country, Region, Market, Product ID, Category, Sub-Category,
Product Name, Sales, Quantity, Discount, Profit, Shipping Cost, Order Priority. Use the dataset to derive marketing strategy insights, and make a
report by answering the following questions:
1. Use appropriate visualization techniques to summarize each of the variables and to construct hypothesis statements. Develop at least five
hypothesis statements.
2. Use appropriate analytic techniques to test the hypothesis statements. Explain and interpret your results in detail.
3. Provide recommendations to the company relating to marketing strategies that it should adopt based on the results of the analysis.

Descriptive Statistics Analysis

Variable Data Type


Row Id Ship Mode Order date Sales
Order ID Order Priority Ship date Discount
Customer Id Profit
Customer Name Shipping Cost
Segment
Postal Code
City
State
Nominal Ordinal Interval Ratio
Country
Region
Market
Product ID
Category
Subcategory
Product Name
Quantity
10% of the whole data is taken as Sample (Random rows selected) – 5129
With Outliers Values
Sales 300
Mean 251.86
Standard Error 7.94
Median 83.4
Mode 12.96 250
Standard Deviation 568.6949458
Sample Variance 323413.94
Kurtosis 482.42
Skewness 15 200
Range 22637.58
Minimum 0.90
Maximum 22638.48 163.02
Sum 1291803.72 150
Count 5129
Confidence Level(95.0%) 15.57
Range 22637.582
Q1 30.96 100
Q3 246.42
IQR 215.46
Lower Bound -292.23
52.44
Upper Bound 569.61 50
Mean of z-score 0
Q2-Q1 52.44
Q3-Q2 163.02 30.96
Upper Whisker 22638.48 0
Lower Whisker 0.90 Values
Wupper-Q3 22392.06
Q1-W lower 30.06
Min Outlier
Upper Outliers Count 564
Lower Outliers Count 0
Max for Outlier 22638.48
Min for Outlier #N/A
Due to huge number of outliers(11%) clearly visible in the boxplot graph, median is a better measure for the given data.
1. Mean, median and mode values are different. So, it is a skewed distribution.
2.Standard deviation is highconsidering the less number of sample – 568.7
3. After the removal of outliers, data set has a normal distribution
4. Kurtosis (Range: - 3 to 3) prior to the removal of outliers is greater than 3 which means that the data distribution is highly peaked with
lesser dispersion.
Hence the statistic distribution assuming normal distribution is not applicable
5. Skewness (Normal Range: - 1 to 1) of the overall Sales data and each of the readings is more than 14 hence it is highly positively skewed,
with a tail to the right.

Sales vs Count
Conclusion
A sample of 5129 sets of data was taken for analysis.
The sales Mean, Standard Error, Median, Mode, Standard Deviation, Sample Variance, Kurtosis and skewness is higher.
This difference in LOS was impacting the over statistics of the hospital as well.
By plotting box plot, the outliers were observed for each Key variable. Using Z-score computation,
the outliers were removed overall.
This resulted in better quality of the data set.
The statistics were done again post removal of outliers and compared as below:

Before Removal of Outliers: After Removal of Outliers:


Mean 251.86 Mean 119.71
SD 568.69 SD 128.74
Mode 12.96 Mode 12.96
Min 0.90 Min 0.90
Max 22638.48 Max 568.47
Range 0.898 - 22638.48 Range 0.898 - 568.47
Sum 1291803.72 Sum 546491.90
Variance 323413.94 Variance 16574.92
Q1 30.96 Q1 27.60
Q3 246.42 Q3 166.24
IQR 215.46 IQR 138.64
Sales vs Profit
Scatter Plot of Sales and Profit – Profit and Sales are positively correlated. As the Sales of the goods increases, we can see an increase in the profit for each
sub category item.

Accessories Appliances Art Binders


2000.00 4000.00 400.00 2000.00

1000.00 2000.00 200.00 1000.00

0.00 0.00 0.00 0.00


0.00 1000.00 2000.00 3000.00 4000.00 0.00 2000.00 4000.00 6000.00 0.00 200.00 400.00 600.00 800.00 0.00 1000.002000.003000.004000.005000.00
-1000.00 -2000.00 -200.00 -1000.00

Bookcases Chairs Copiers Envelopes


4000.00 2000.00 4000.00 400.00
1000.00
2000.00 2000.00 200.00
0.00
0.00 0.00 0.00
-1000.00 0.00 1000.00 2000.00 3000.00 4000.00
0.00 2000.00 4000.00 6000.00 0.00 2000.00 4000.00 6000.00 0.00 100.00 200.00 300.00 400.00 500.00
-2000.00 -2000.00 -2000.00 -200.00

Fasteners Furnishings Labels Machines


200.00 500.00 50.00 4000.00
2000.00
100.00
0.00 0.00 0.00
0.00 0.00 200.00 400.00 600.00 800.00 1000.00 0.00 50.00 100.00 150.00 200.00 -2000.00 0.00 5000.00
10000.00
15000.00
20000.00
25000.00
0.00 100.00 200.00 300.00
-100.00 -500.00 -50.00 -4000.00

Paper Phones Storage Supplies


400.00 2000.00 1000.00 400.00
500.00
200.00 200.00
0.00 0.00
0.00 0.00 2000.00 4000.00 6000.00 0.00
-500.00 0.00 1000.00 2000.00 3000.00 4000.00
0.00 200.00 400.00 600.00 800.00 0.00 2000.004000.006000.008000.0010000.00
-200.00 -2000.00 -1000.00 -200.00

Tables
2000.00
1000.00
0.00
-1000.00 0.00 1000.002000.003000.004000.005000.00
-2000.00
Sales vs Discount
Scatter Plot between Sales and Discount – We do see that Sales is independent of Discount. The number of sales is not increasing with increase in the
percentage of discount given

Accessories Appliances Art Binders


60000.00 100000.00 30000.00 30000.00
40000.00 20000.00 20000.00
50000.00
20000.00 10000.00 10000.00
0.00 0.00 0.00 0.00
0.00 2.00 4.00 6.00 8.00 10.00 0.00 2.00 4.00 6.00 8.00 0.00 5.00 10.00 15.00 0.00 10.00 20.00 30.00 40.00 50.00

Bookcases Chairs Copiers Envelopes


100000.00 100000.00 100000.00 15000.00
10000.00
50000.00 50000.00 50000.00
5000.00
0.00 0.00 0.00 0.00
0.00 2.00 4.00 6.00 8.00 0.00 5.00 10.00 15.00 0.00 2.00 4.00 6.00 0.00 5.00 10.00 15.00

Fasteners Furnishings Labels Machines


10000.00 30000.00 6000.00 60000.00
20000.00 4000.00 40000.00
5000.00
10000.00 2000.00 20000.00
0.00 0.00 0.00 0.00
0.00 2.00 4.00 6.00 0.00 5.00 10.00 15.00 20.00 0.00 2.00 4.00 6.00 8.00 10.00 0.00 1.00 2.00 3.00 4.00 5.00

Paper Phones Storage Supplies


20000.00 150000.00 100000.00 30000.00
100000.00 20000.00
10000.00 50000.00
50000.00 10000.00
0.00 0.00 0.00 0.00
0.00 5.00 10.00 15.00 0.00 5.00 10.00 15.00 0.00 5.00 10.00 15.00 20.00 0.00 2.00 4.00 6.00 8.00

Tables
40000.00

20000.00

0.00
0.00 1.00 2.00 3.00 4.00 5.00
Histogram of Sales (Before removing Outliers) -> The data is skewed completely to right

Histogram of Sales (After removing Outliers) – The skewness of data is still towards the right. But the data seem to be more uniformly distributed.
Histogram of Profit -> Data seems to be normally distributed (before Removing Outlier)

Histogram of Profit (After removing Outliers) – Few extreme values in the data has reduced after removing the outliers
Bar graph of Sales Vs Sub Category – Chairs, bookcases and Phones seems to be in more demand when compared to other items

Stacked Bar Chart – Sub Category Vs Shipping Mode -> First class Shipment is more in demand for Art, binders and Storage items
Shipping mode Vs Items - > The first class or Same day shipping is done mainly for orders below 1000 due to cost factor

Order Priority Vs Segment of product - > Consumer segment are given more priority in case of Criticality of Order shipment
Box Plot- Category Vs Sales (Before removing Outliers) – We are able to see few Outliers in technology and Office Supplies Category.
Box plot- Sales Vs Category (After Removing Outliers)-> We see significance difference in the dataand most of the extreme values are being taken out.
Sub Category vs Profit- In the segment of Phones, chairs and copiers, we see more profit incurred. We also see some extreme Profit values in segment like
Benders, phones and copiers gained due to outlier’s present.
Sub Category Vs Profit (After removal of outlier) – Significance decrease in the extreme or outlier values. In the case of Tables, we do not see almost no
outliers/extreme values
Correlogram of the Sample data – We can see a strong positive correlation between Sales and Shipping cost. And a strong negative correlation between
discount and profit.
Model
The model is created using Excel for multiple linear regression assuming alpha as 0.05 level of significance. Based on the coefficient of determination R
square, 92% of the Sales is determined by the independent variables used in the model. The model is a good fit.

Hypothesis statements
1.) Mean Sales across all regions is equal
2.) Mean Shipping cost is equal to all Regions
3.) Mean Profit is equal across all Regions
4.) Profit is independent of Discount, Sales and Quantity
5.) Shipping Cost is independent of Country and SubCategory

Hypothesis 1: Mean Sales across all regions is equal

#H0- Mean sales of all regions are equal(P<=0.05)


#H1- Mean sales across all regions are not equal(P>0.05)
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 27795668.7 22 1263439.486 3.817474282 4.34292E-09 1.544168019
Within Groups 1654148813 4998 330962.1475

Total 1681944482 5020

As the P value is 4.34292487440634E-09 which is less than 0.05, we reject the null hypothesis and accept the alternative hypothesis.

Hypothesis 2: Mean Shipping cost for all Regions are equal


#H0 - Mean Shipping Cost for all Regions are equal (P <=0.05)
#H1 - Mean Shipping.Cost for all Regions are not equal(P>0.05)
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 1331672.471 22 60530.56686 0.231377076 0.999926966 1.544264581
Within Groups 1251280836 4783 261610.043
Total 1252612508 4805

As the P value is 0.99 which is grator than 0.05, we accept the null hypothesis.
Hypothesis 3: Mean Profit is equal across all Regions

#H0 - Mean Profit is equal across all Regions (P <=0.05)


#H1 - Mean Profit is not equal across all Regions(P>0.05)

ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 3709487.433 22 168613.0652 6.756952375 1.98104E-20 1.544230836
Within Groups 121176678.3 4856 24954.01119

Total 124886165.8 4878

As the P value is 1.98103919419889E-20 which is less than 0.05, we reject the null hypothesis and accept the alternate hypothesis.

Hypothesis 4: Profit is independent of Discount, Sales and Quantity

#H0 - Profit=discount+Sales+Quantity
#H1 - Profit!=discount+Sales+Quantity

Sales Quantity Discount Profit


Sales 1 0.282497955 -0.073802107 0.375220889
Quantity 0.282497955 1 -0.015008324 0.117562264
Discount -0.073802107 -0.015008324 1 -0.332023545
Profit 0.375220889 0.117562264 -0.332023545 1

We see that all the factors are significantly affecting the price
Regression Statistics
Multiple R 0.483848399
R Square 0.234109273
Adjusted R Square 0.233660947
Standard Error 137.6369678
Observations 5129

ANOVA
df SS MS F Significance F
Regression 3 29676717.88 9892239.294 522.185034 4.1619E-296
Residual 5125 97087666.4 18943.93491
Total 5128 126764384.3

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 35.49332164 3.755206698 9.451762443 4.93786E-21 28.13151313 42.85513014 28.13151313 42.85513014
Sales 0.096358198 0.003532539 27.27732036 4.2666E-153 0.089432913 0.103283482 0.089432913 0.103283482
Quantity 1.003829029 0.881729614 1.138477162 0.254974505 -0.72473749 2.732395548 -0.72473749 2.732395548
Discount -227.8011358 9.123190008 -24.96946085 5.8321E-130 -245.6864836 -209.915788 -245.6864836 -209.915788

Discount, Quantity, Sales are having significant effect on Price still which is contributing to only 23.41%. As the P value is less than 0.05, we accept the
hypothesis
Hypothesis 5: Shipping Cost is independent of Country and SubCategory

#H0 - shipping Cost= SubCategory + Country


#H1 - Shipping Cost!= SubCategory + Country

Pr(>|t|)
(Intercept) 0.332639
Sub-Category : Appliances 0.255826
Sub-Category : Art 0.003758 **
Sub-Category : Binders 0.000448 ***
Sub-Category : Bookcases 0.140265
Sub-Category : Chairs 0.100037
Sub-Category : Copiers 4.18e-05 ***
Sub-Category : Envelopes 0.018414 *
Sub-Category : Fasteners 0.002851 **
Sub-Category : Furnishings 0.003676 **
Sub-Category : Labels 0.000587 ***
Sub-Category : Machines 0.76511
Sub-Category : Paper 0.015242 *
Sub-Category : Phones 0.236621
Sub-Category : Storage 0.106763
Sub-Category : Supplies 0.012435 *
Sub-Category : Tables < 2e-16 ***
Country : Albania 0.91281
Country : Algeria 0.738475
Country : Angola 0.951432
Country : Argentina 0.234835
Country : Australia 0.899099
Country : Austria 0.594137
Country : Azerbaijan 0.765025
Country : Bahrain 0.705502
Country : Bangladesh 0.981222
Country : Barbados 0.826301
Country : Belarus 0.250224
Country : Belgium 0.799339
Country : Benin 0.769308
Country : Bolivia 0.931652
Country : Bosnia and Herzegovina 0.780198
Country : Brazil 0.37474
Country : Bulgaria 0.818955
Country : Burkina Faso 0.007239 **
Country : Cambodia 0.316617
Country : Cameroon 0.746526
Country : Canada 0.909344
Country : Chile 0.706947
Country : China 0.543633
Country : Colombia 0.734385
Country : Costa Rica 0.746996
Country : Cote d'Ivoire 0.817516
Country : Croatia 0.974605
Country : Cuba 0.660458
Country : Czech Republic 0.905662
Country : Democratic Republic of the Congo 0.831813
Country : Denmark 0.110653
Country : Dominican Republic 0.410668
Country : Ecuador 0.798754
Country : Egypt 0.754726
Country : El Salvador 0.955801
Country : Equatorial Guinea 0.664443
Country : Estonia 0.881587
Country : Finland 0.984997
Country : France 0.991896
Country : Gabon 0.885618
Country : Georgia 0.485149
Country : Germany 0.930062
Country : Ghana 0.967607
Country : Guatemala 0.405258
Country : Guinea 0.961298
Country : Guinea-Bissau 0.880244
Country : Guyana 0.817636
Country : Haiti 0.096429 .
Country : Honduras 0.199614
Country : Hong Kong 0.865287
Country : Hungary 0.76052
Country : India 0.774885
Country : Indonesia 0.672151
Country : Iran 0.696326
Country : Iraq 0.986082
Country : Ireland 0.190443
Country : Israel 0.85794
Country : Italy 0.408283
Country : Jamaica 0.720221
Country : Japan 0.657808
Country : Jordan 0.908563
Country : Kazakhstan 0.071053 .
Country : Kenya 0.952589
Country : Kyrgyzstan 0.611104
Country : Lesotho 0.890601
Country : Liberia 0.121887
Country : Libya 0.851323
Country : Lithuania 0.272273
Country : Luxembourg 0.556419
Country : Madagascar 0.811705
Country : Malaysia 0.433257
Country : Mali 0.826235
Country : Martinique 0.402828
Country : Mauritania 0.395629
Country : Mexico 0.931423
Country : Moldova 0.789967
Country : Mongolia 0.802962
Country : Montenegro 0.881106
Country : Morocco 0.53037
Country : Mozambique 0.534544
Country : Myanmar (Burma) 0.506389
Country : Namibia 0.74116
Country : Nepal 0.989618
Country : Netherlands 0.016850 *
Country : New Zealand 0.8569
Country : Nicaragua 0.823594
Country : Niger 0.942059
Country : Nigeria 0.011995 *
Country : Norway 0.485229
Country : Pakistan 0.038315 *
Country : Panama 0.108455
Country : Papua New Guinea 0.961708
Country : Paraguay 0.944253
Country : Peru 0.244631
Country : Philippines 0.10966
Country : Poland 0.940954
Country : Portugal 0.032413 *
Country : Qatar 0.888379
Country : Romania 0.46974
Country : Russia 0.921857
Country : Rwanda 0.705098
Country : Saudi Arabia 0.914051
Country : Senegal 0.93805
Country : Sierra Leone 0.916336
Country : Singapore 0.845836
Country : Slovakia 0.884829
Country : Slovenia 0.000184 ***
Country : Somalia 0.936314
Country : South Africa 0.828634
Country : South Korea 0.088749
Country : Spain 0.925237
Country : Sri Lanka 0.961813
Country : Sudan 0.619647
Country : Sweden 0.120765
Country : Switzerland 0.215018
Country : Syria 0.988224
Country : Taiwan 0.289607
Country : Tanzania 0.964852
Country : Thailand 0.213359
Country : Togo 0.769891
Country : Trinidad and Tobago 0.764979
Country : Tunisia 0.106301
Country : Turkey 0.044371 *
Country : Turkmenistan 0.709195
Country : Uganda 0.008655 **
Country : Ukraine 0.999113
Country : United Arab Emirates 0.131181
Country : United Kingdom 0.722631
Country : United States 0.888192
Country : Uruguay 0.910498
Country : Uzbekistan 0.81616
Country : Venezuela 0.354139
Country : Vietnam 0.483576
Country : Yemen 1.57e-05 ***
Country : Zambia 0.996456
Country : Zimbabwe 0.166744

Residual standard error: 117.6 on 4866 degrees of freedom


Multiple R-squared: 0.1675, Adjusted R-squared: 0.1415
F-statistic: 6.443 on 152 and 4866 DF, p-value: < 2.2e-16
As P value is less than 0.05, we accept this alternate hypothesis.

Conclusion and Recommendations

The dataset contains sufficient amount of records for the study. Of the 5019 sets available, the outliers were removed to obtain better data quality.
Sale of the goods depends on various factors like price, shipping cost, quantity, region and subcategory.
The model is 92% fit (86% adjusted to fit). There are only a few factors remaining that will impact the Sales.

You might also like