Project - Cold Storage Case Study
Project - Cold Storage Case Study
Case Study
Santanu Mondal
PGP-BABI Aug-2019 Group 4
Page |1
Table of Contents
Page |2
5.2 State the Hypothesis, perform hypothesis test and determine p-value .......................................... 22
6 Conclusion/Inference ............................................................................................................................... 25
7 Appendix A ............................................................................................................................................... 26
Page |3
1 Project Objective
1.1 Problem 1
Cold Storage started its operations in Jan 2016. They are in the business of
storing Pasteurized Fresh Whole or Skimmed Milk, Sweet Cream,
Flavored Milk Drinks. To ensure that there is no change of texture, body
appearance, separation of fats the optimal temperature to be maintained is
between 2 - 4 C.
In the first year of business they outsourced the plant maintenance work to a
professional company with stiff penalty clauses. It was agreed that if
it was statistically proven that probability of temperature going outside the 2 -
4 C during the one-year contract was above 2.5% and less than 5% then
the penalty would be 10% of AMC (annual maintenance case). In case it
exceeded 5% then the penalty would be 25% of the AMC fee. The average
temperature data at date level is given in the file
“Cold_Storage_Temp_Data.csv”
The objective of the report is to explore the Cold Storage Case Study dataset
using concepts of Statistical Methods of Decision Making and generate insights
about the data. This exploration report will consist of the following:
Importing the dataset in R
Understanding the structure of dataset
Graphical exploration
Descriptive statistics
Insights from the dataset
Find solutions to some problems bases on the key insights drawn from
the data, as elaborated on Section 4.
1) Find mean cold storage temperature for Summer, Winter and
Rainy Season
2) Find overall mean for the full year
3) Find Standard Deviation for the full year
4) Assume Normal distribution, what is the probability of
temperature having fallen below 2 C?
5) Assume Normal distribution, what is the probability of
temperature having gone above 4 C?
6) What will be the penalty for the AMC Company?
1.2 Problem 2
In Mar 2018, Cold Storage started getting complaints from their Clients that
they have been getting complaints from end consumers of the dairy products
going sour and often smelling. On getting these complaints, the supervisor pulls
out data of last 35 days’ temperatures. As a safety measure, the Supervisor
decides to be vigilant to maintain the temperature 3.9 C or below.
Page |4
Assuming 3.9 C as upper acceptable value for mean temperature and at alpha
= 0.1, the objective is to find out if there is need for some corrective action in
the Cold Storage Plant or is it that the problem is from procurement side from
where Cold Storage is getting the Dairy Products. The data of the last 3 days is
in “Cold_Storage_Mar2018.csv”
Dataset 1
Given the nature of the data provided in the dataset, it can be seen that this
refers to the temperatures in the cold storage over the entire year of 2016.
The 365 rows of the dataset correspond to 365 unique days of the year and
the temperatures recorded on each day.
To provide more insights of season-wise trends,
The dataset has further been broken down to 3 seasons: Summer,
Rainy, and Winter.
Summer corresponds to the months of Feb to May.
Rainy corresponds to June to September.
Winter corresponds to Jan & Oct to Dec.
Also, the following data dictionary is considered for the 4 features in the
dataset:
Sl. Feature Feature Feature Description
No. Name Code
1 Season Season Seasons across the year: Summer,
Rainy, Winter
2 Month Month All 12 months in a year, Jan to Dec
3 Date Date Dates is each month when the
temperatures were recorded
4 Temperature Temperature Temperatures recorded on each day of
the year
Dataset 2
Upon receiving complaints from customers in 2018, the Cold Storage requested
data from the maintenance company. Supervisor pulled temperature data for
last 35 days (Feb 11 to Mar 17)
Page |5
The overall characteristics of this dataset is exactly same as the Dataset 1,
except that the data pulled all correspond to the season of summer.
However, the dataset 1 will be put through all possible methods of Univariate
and Bi-variate analyses.
Page |6
3.2 Variable Identification
The dataset is analyzed for basic understanding of the features and data
contained. It is usually an activity by which data is explored and organized in
order so the information it contains is made clear.
Please refer to Appendix A for Source Code.
3.2.1 Variable classes/characteristics
No of rows vs. No. of columns:
Dataset 1
No. of Rows No. of Columns
365 4
Dataset 2
No. of Rows No. of Columns
35 4
Page |7
Summary() function in R helps deduce most of the key values, however,
there being no inbuilt functions for deducing Mode and IQR, customized
functions have been written – refer to Appendix A for code.
Dataset 1
Dataset 2
Page |8
Boxplots:
Dataset 2
Dataset 1:
Page |9
Histograms (Dataset 1):
P a g e | 10
3.3.1.1 Continuous Variable Analysis: key observations
Dataset 1:
Overall temperatures across the year has a single outlier as can be
observed from the Boxplot: Temperatures across the Year
However, when the Temperatures are plotted against smaller sets of 3
seasons, the temperatures during the Winter season seem to be having
3 outliers, thereby inference that can be made is,
o Temperature fluctuation is observably higher in Winter season,
compared to Summer that has no outliers and Rainy having one
o When plotted against the whole year, the number of outlier is 1.
Dataset 2
From the Temperature boxplot, we can observe a very heavy Positive
skewness in the data distribution
A single outlier seen that is quite off from the Max value.
P a g e | 11
3.4 Bi-Variate Analysis
Bi-variate Analysis that tasks itself with relationship between two variables from the
perspective of this dataset, we will try to figure out the overall
relationships/correlations among the different variables on hand, both categorical
and continuous.
P a g e | 12
3.4.3 Categorical & Continuous:
3.4.3.1 Temperature vs Month
Observations:
Outliers are present in the Months of Sep, Oct, Jan, where Jan itself has 3.
Skewness:
o Positive: observed in months of Sep, Nov, Jun, Jul, Jan, Dec, Aug
o Negative: observed in months of Mar, May, Feb, Apr, Oct
o Normal Distribution: not observed in any month
In almost 10% of the days in January, there seems to be some anomaly,
which can be further looked into by the AMC Company maintaining the Cold
Storage facility.
Using the rpivotTable function, some observations (Table and Bar Chart
functions used):
o September has clocked the highest Mean temperature, Variance, and
Standard Deviation amongst all 12 months
o Nov has clocked the lowest Mean temperature, Variance, and Standard
Deviation amongst all 12 months
P a g e | 13
P a g e | 14
3.4.3.2 Temperature vs Season
Boxplot: Temperature vs. Season
Observations:
3 Outliers in the Winter season, 1 in Rainy, whereas no Outliers in Summer
Skewness:
o Winter: Positive Skewness can be observed
o Summer: Heavy Negative Skewness can be observed
o Rainy: No/Negligible skewness can be observed, and it seems to
the normally distributed.
Using the rpivotTable function, some observations (Table and Bar Chart
functions used):
o The Mean temperatures for each season, although Summer has
clocked the highest, there’s not much difference compared to Rainy,
but Winter is slower than both
o The variance in Temperatures is higher in Rainy season by quite some
margin compared to both Winter and Summer
o For the Standard deviation in Temperature, it’s Rainy season again
trumping over both Summer and Winter
P a g e | 15
3.5 Missing Value Treatment
Missing value treatment is an important step in Exploratory Data Analysis, as
missing data in the training data set can reduce the power/fit of a model or can
lead to a biased model because we have not analyzed the behavior and relationship
with other variables correctly. It can lead to wrong prediction or classification.
The datasets under scrutiny does not have any Missing values as we have already
observed in the data summary, so it is not elaborated in this project.
P a g e | 16
3.6 Outlier Treatment
Outlier is a commonly used terminology by analysts and data scientists as it needs
close attention; else, it can result in wildly wrong estimations. Simply put, Outlier is
an observation that appears far away and diverges from an overall pattern in a
sample.
Outliers can drastically change the results of the data analysis and statistical
modeling. There are numerous unfavorable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.
Most commonly used method to detect outliers is visualization. We use various
visualization methods, like Box-plot, Histogram, etc., as we have applied earlier on
both the datasets on multiple sections. However, dealing with Outliers being out of
scope for this Project, no specific action has been taken on the data.
P a g e | 17
Bar charts of different attributes of Central Tendency and Dispersion
P a g e | 18
Observations:
1) Start of Month (first 10 days):
a. Comes second to Month End in terms of total Range and also has 1
outlier
b. Slight Positive skewed characteristic can be observed here
2) Mid Month (middle 10 days):
a. In terms of range, the most consistent one, however, has a single
outlier
b. No/negligible skewness is observed, and seems to be normally
distributed
3) Month End (last 10 days):
a. Has no outliers.
b. The range is bigger than the other two, thus creating the possibility of
increased supervision/attention towards the end of each month.
c. Heavy Positive skewness can be observed
4) From the bar charts, the Mean, Variance, and Standard Deviation do not
seem to be telling much in terms of Statistical relationship between the
Temperatures and time of the month
Inference: From the dataset under analysis, it can be observed that the
highest average temperature is being clocked for the season of Summer
whereas the lowest is for Winter. Although once can assume this to match the
natural ambient temperatures of different seasons, statistically we cannot draw
a conclusion due to the lack of weather data across the year.
Overall Mean
of Full Year
2.96
P a g e | 19
4.3 Find Standard Deviation for the full year
Standard
Deviation for
the full year
0.5086
Inference: As we have already assumed the dataset to be a normally
distributed, statistically we can infer the ranges of data based on the Standard
Deviation and Mean calculated earlier:
- 68% of the values lie in the range of 2.454 and 3.4715 (+- Sigma)
- 95% of the values lie in the range of 1.946 and 3.980 (+- 2Sigma)
- 99.7% of the values lie in the range of 1.437 and 4.488 (+- 3Sigma)
P a g e | 20
4.5 Assume Normal distribution, what is the probability of
temperature having gone above 4 C
To find out the probability value for Temp’s more than 4, we have to
consider the lower tail = FALSE because it’s at the right half of the graph.
Using the pnorm function in R, the result is
= 2.070296% probability
A particular Temperature can never attain a value “lower than 2” and “higher
than 4” at the same time, therefore these 2 are mutually exclusive events,
thus P (A U B) = P(A) + P(B)
Therefore, P = P(Temp<2) + P(Temp>4) = 4.988713%
P a g e | 21
5 Problem solving: Problem 2
5.1 Which Hypothesis test shall be performed to check if
corrective action is needed at the cold storage plant?
Observations:
Assumptions:
Approach:
1) Since the population standard deviation is unknown, the best Statistic test
to perform would be the Student’s T-statistic Test
2) However, we will go ahead, perform the Z test as well, and compare the
results with the same from T Test before drawing up the conclusion.
3) Since we are talking about potential corrective actions, we intend to be
more exhaustive and detail oriented.
Hypothesis
The supervisor has been tasked with maintaining the temperature at the cold
storage to below 3.9 C - this will be the Null Hypothesis
P a g e | 22
Hypothesis Tests
Step 1: State the Hypotheses: Ho: Mu <= 3.9 & Ha: Mu > 3.9
Step 11: Pvalue calculation: Xbar being greater than Mu, we can infer this is
a right-tailed test, therefore, we would be using the following formula on R to
calculate Pvalue.
Step 12: Result: Since Pvalue < alpha, the Null Hypothesis is rejected, and
Alternative Hypothesis is accepted, thus statistically concluding (via T Test) that
the Temperature in the Cold Storage is greater than 3.9 C with 90% confidence
(1 – 0.1), thus causing the products go sour or smelling.
Step 13: We will find the actual confidence by subtracting the Pvalue from 1.
Actual Confidence = (1 - Pvalue) * 100 = 99.52888%
P a g e | 23
Z-statistic Test:
Mu = 3.9
alpha = 0.1
Xbar = 3.974286
S = 0.159674
m = 35
se = 0.07428571
sde = 0.02698984
As significance, alpha = 0.1, the critical values of Zstat are +1.28 and -
1.28, as calculated using MS Excel.
We are using a probability value of 0.9 instead of 0.1 because MX Excel
considers it cumulatively and it is a right tailed Test.
P a g e | 24
The nonrejection region is -1.28 <= Zc <= 1.28.
Result: Since Z < Zstat, we can reject the null hypothesis and accept the
alternate hypothesis, thus reinforcing the results from T Test.
6 Conclusion/Inference
We have seen from the dataset 1 that holds the values from the year 2016, the
average temperature throughout the year is 2.96. However, as months went by,
the working quality of the Cold Storage seems to have degraded. And from the
samples taken in 2018, without even putting it through any Statistical analyses, we
can see a mean temperature to be 3.97, which is 1 degree higher, and going by the
working principle of Cold storages, that does not look good, which is why the
complaints of products going sour and smelling kept pouring in. However, we
reserved our judgement before doing a root cause analysis through Statistical
analysis and concluding the result.
P a g e | 25
7 Appendix A
#============================================================#
# #
# Exploratory Data Analysis - Cold Storage Case Study #
# #
#=========================================================== #
# Environment Set up and Data Import
# Setup Working Directory
setwd("G:/My_R/Project 1")
# Variable Identification
#------------------------#
# Dataset 1 #
#------------------------#
## use the atach command to store the column names of the dataset in the same
session#
attach(tempdata)
P a g e | 26
[1] 0.2586628
# Dataset 2 #
#-----------#
# Load the dataset into a temporary data frame
tempmarch = read.csv("Cold_Storage_Mar2018.csv", header = TRUE)
## to avoid name conflict with the Temperature header name from the previous
dataset,
## rename the column as a precautionary measure ##
colnames(tempmarch)[colnames(tempmarch)=="Temperature"] <- "temp"
#Check the variable
class(Date)
[1] "integer"
P a g e | 27
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.800 3.900 3.900 3.974 4.100 4.600
# Range
Range = max(temp) - min(temp)
print(Range)
[1] 0.8
P a g e | 28
+ ylab("") +
+ ggtitle("Histogram: Temperatures vs. Seasons")+
+ geom_histogram() +
+ scale_fill_manual(values =
+ c("Winter" = "Green",
+ "Summer" = "Red",
+ "Rainy" = "Cyan"))
##categorical variables
plot(Season,main='Seasons',xlab = "Seasons", ylab = "Frequency",col = c("Cya
n", "Red", "Green"))
plot(Month,main='Months',xlab = "Months", ylab = "Frequency")
#Bivariate Analyses
#===================
## Temperature vs Month
## Install the randomcoloR to help us generate 12 random colors for each mont
h to be used in the boxplot
install.packages("randomcoloR")
Error in install.packages : Updating loaded packages
library(randomcoloR)
install.packages("randomcoloR")
Warning in install.packages :
package ‘randomcoloR’ is in use and will not be installed
##Boxplot of Temperature vs Month
plot(Month,Temperature,
+ horizontal = TRUE,
+ main='Temperature Vs Month',
+ xlab = "Temperature",
+ ylab = "Month",
+ col = randomColor(12, luminosity="light"))
## Using the rpivotTable to chart out the Mean temperatures for each month t
hrough a single function
rpivotTable(tempdata)
library(rpivotTable)
## Using the rpivotTable to chart out the Mean temperatures for each month t
hrough a single function
rpivotTable(tempdata)
## Temperature vs Season
## Boxplot of Temperature vs Season
plot(Date,Temperature,
+ horizontal = TRUE,
+ main='Temperature Vs Season',
+ xlab = "Temperature",
+ ylab = "Season")
## Temperature vs Date
cor(Temperature, tempdata$Date)
[1] -0.02814857
P a g e | 29
dim(tempdata)
[1] 365 5
summary(tempdata)
Season Month Date Temperature Datespa
n
Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700 Start of Month:12
0
Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500 Mid Month :12
0
Winter:123 Jan : 31 Median :16.00 Median :2.900 Month End :12
5
Jul : 31 Mean :15.72 Mean :2.963
Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
May : 31 Max. :31.00 Max. :5.000
(Other):179
head(tempdata)
Season Month Date Temperature Datespan
1 Winter Jan 1 2.4 Start of Month
2 Winter Jan 2 2.3 Start of Month
3 Winter Jan 3 2.4 Start of Month
4 Winter Jan 4 2.8 Start of Month
5 Winter Jan 5 2.5 Start of Month
6 Winter Jan 6 2.4 Start of Month
tail(tempdata)
Season Month Date Temperature Datespan
360 Winter Dec 26 2.7 Month End
361 Winter Dec 27 2.7 Month End
362 Winter Dec 28 2.3 Month End
363 Winter Dec 29 2.6 Month End
364 Winter Dec 30 2.3 Month End
365 Winter Dec 31 2.9 Month End
str(tempdata)
'data.frame': 365 obs. of 5 variables:
$ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3 3 3 .
..
$ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5
...
$ Date : int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...
$ Datespan : Factor w/ 3 levels "Start of Month",..: 1 1 1 1 1 1 1 1 1 1 .
..
#Boxplot of temperatures across seasons
boxplot(Temperature~Datespan,
+ horizontal = TRUE,
+ col = c("Cyan", "Red", "Green"),
+ main = "Temperatures vs. Datespans in Months across the Year",
+ xlab = "Temperatures",
+ ylab = "Date Spans")
##Problem 1 Question 1
#--------------------#
## Temporay data frame with data from Winter Season##
temp_winter = tempdata[Season %in% "Winter",]
P a g e | 30
360 Winter Dec 26 2.7
361 Winter Dec 27 2.7
362 Winter Dec 28 2.3
363 Winter Dec 29 2.6
364 Winter Dec 30 2.3
365 Winter Dec 31 2.9
##Create a variable to calculate and populate the mean temperature of Winter
season
mean_winter = round(mean(temp_winter$Temperature), digits = 2)
#-----------------------------------------------------#
## Temporay data frame with data from Summer Season##
temp_summer = tempdata[Season %in% "Summer",]
#-----------------------------------------------------#
## Temporay data frame with data from Rainy Season##
temp_rainy = tempdata[Season %in% "Rainy",]
##View the Winter season temp data table
head(temp_rainy)
Season Month Date Temperature
152 Rainy Jun 1 3.2
153 Rainy Jun 2 3.9
154 Rainy Jun 3 2.9
155 Rainy Jun 4 3.1
156 Rainy Jun 5 2.8
157 Rainy Jun 6 2.8
tail(temp_rainy)
Season Month Date Temperature
268 Rainy Sep 25 2.6
269 Rainy Sep 26 3.9
270 Rainy Sep 27 3.3
271 Rainy Sep 28 2.9
272 Rainy Sep 29 1.7
P a g e | 31
273 Rainy Sep 30 2.6
##Problem 1 Question 2
#--------------------#
mean_fullyear_noround = mean(Temperature)
mean_fullyear = round(mean(Temperature), digits = 2)
mean_fullyear
[1] 2.96
##Problem 1 Question 3
#--------------------#
sd_fy = round(sd(Temperature, na.rm = TRUE), digits = 4)
sd_fy
[1] 0.5086
##Problem 1 Question 4
#--------------------#
y <- pnorm(2, mean = mean_fullyear_noround , sd = sd_fy, lower.tail = TRUE)
y*100
[1] 2.918417
##Problem 1 Question 5
#--------------------#
z <- pnorm(4, mean = mean_fullyear_noround , sd = sd_fy, lower.tail = FALSE)
z*100
[1] 2.070296
##Problem 1 Question 6
#--------------------#
m = (y + z)*100
P a g e | 32
m
[1] 4.988713
##-------------------------##
## Problem 2 ##
##-------------------------##
## to avoid name conflict with the Temperature header name from the previous
dataset,
## rename the column as a precautionary measure ##
colnames(tempmarch)[colnames(tempmarch)=="Temperature"] <- "temp"
## use the attach command to store the column names of the dataset in the sam
e session#
attach(tempmarch)
## Hypothesis Testing ##
## T Test being used since the population standard deviation is unknown ##
## Null Hypothesis: - Ho: t = 3.9 ##
## Alternate Hypothesis: - Ha: t < 3.9 ##
## leading and trailing brackets have been used in all equations to
## save an extra step ##
##------------------------------------------##
P a g e | 33
## calculate the standard deviation of the sample and store it in the variabl
e sd ##
(sd = sd(temp))
[1] 0.159674
## tstat calculations ##
(Tstat = se/sde)
[1] 2.752359
## pvalue calculations ##
## The xbar being greater than mu, we can infer this is a right tailed test
##
## therefore, we need to insert the command, lower.tail = FALSE ##
(Pvalue = pt(Tstat, df, lower.tail = FALSE))
[1] 0.004711198
## alpha(significance) = 0.1 ##
(alpha = 0.1)
[1] 0.1
#---------------------------------------------------------------------------#
# THE END #
#---------------------------------------------------------------------------#
P a g e | 34