0% found this document useful (0 votes)
383 views19 pages

Udacity Business Analyst Project 8

The document outlines the steps taken to complete a predictive analytics capstone project. It involves determining optimal store formats for existing stores using k-means clustering, predicting formats for new stores using a boosted classification model, and forecasting produce sales for each store format using either ETS or ARIMA time series models. For existing stores, 3 formats were identified as optimal. The new stores were predicted to fall mostly in formats 2 and 1. ETS and ARIMA models were compared to select the best model for each cluster, with ETS(M,N,M) chosen for clusters 1 and 2 and ETS(M,N,A) for cluster 3.

Uploaded by

yogafire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
383 views19 pages

Udacity Business Analyst Project 8

The document outlines the steps taken to complete a predictive analytics capstone project. It involves determining optimal store formats for existing stores using k-means clustering, predicting formats for new stores using a boosted classification model, and forecasting produce sales for each store format using either ETS or ARIMA time series models. For existing stores, 3 formats were identified as optimal. The new stores were predicted to fall mostly in formats 2 and 1. ETS and ARIMA models were compared to select the best model for each cluster, with ETS(M,N,M) chosen for clusters 1 and 2 and ETS(M,N,A) for cluster 3.

Uploaded by

yogafire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Project:​ ​Predictive​ ​Analytics​ ​Capstone

Complete​ ​each​ ​section.​ ​When​ ​you​ ​are​ ​ready,​ ​save​ ​your​ ​file​ ​as​ ​a​ ​PDF​ ​document​ ​and​ ​submit​ ​it
here:
https://coco.udacity.com/nanodegrees/nd008/locale/en-us/versions/1.0.0/parts/7271/project

Task​ ​1:​ ​Determine​ ​Store​ ​Formats​ ​for​ ​Existing​ ​Stores


1. What​ ​is​ ​the​ ​optimal​ ​number​ ​of​ ​store​ ​formats?​ ​How​ ​did​ ​you​ ​arrive​ ​at​ ​that​ ​number?

The​ ​optimal​ ​number​ ​of​ ​store​ ​formats​ ​is​ ​3.​ ​To​ ​arrive​ ​at​ ​this​ ​number​ ​the​ ​following​ ​steps​ ​were
taken:

a) Extract​ ​store​ ​sales​ ​data​ ​for​ ​2015​ ​and​ ​sum​ ​the​ ​values​ ​for​ ​each​ ​store​ ​by​ ​category
b) Use​ ​K-means​ ​clustering​ ​to​ ​determine​ ​the​ ​optimum​ ​number​ ​of​ ​store​ ​clusters
based​ ​on​ ​the​ ​percentage​ ​of​ ​sales​ ​per​ ​category​ ​for​ ​2015
c) ​ ​From​ ​the​ ​K-means​ ​analysis​ ​report,​ ​the​ ​following​ ​output​ ​was​ ​produced:

d) Based​ ​on​ ​the​ ​output,​ ​the​ ​optimal​ ​number​ ​of​ ​clusters​ ​is​ ​3​ ​because:
i) Highest​ ​Adjusted​ ​Rand​ ​Index​ ​indicating​ ​best​ ​cluster​ ​stability
ii) Highest​ ​CH​ ​Index​ ​indicating​ ​best​ ​distinctness​ ​and​ ​compactness
iii) Minimal​ ​spread​ ​and​ ​high​ ​median,​ ​and​ ​matching​ ​number​ ​of​ ​clusters​ ​in
both​ ​index​ ​calculations
2. How​ ​many​ ​stores​ ​fall​ ​into​ ​each​ ​store​ ​format?

The​ ​summary​ ​of​ ​number​ ​of​ ​stores​ ​that​ ​fall​ ​into​ ​each​ ​format(cluster):
Cluster No​ ​of​ ​Stores

1 23

2 29

3 33

3. Based​ ​on​ ​the​ ​results​ ​of​ ​the​ ​clustering​ ​model,​ ​what​ ​is​ ​one​ ​way​ ​that​ ​the​ ​clusters​ ​differ​ ​from
one​ ​another?
The​ ​image​ ​below​ ​shows​ ​a​ ​section​ ​from​ ​the​ ​output​ ​of​ ​K-Means​ ​Clustering:

Referring​ ​to​ ​Percent​ ​of​ ​Sales​ ​for​ ​Bakery​ ​(PCSales_Bakery)​ ​and​ ​Percent​ ​of​ ​Sales​ ​for
Dairy(PCSales_Dairy),​ ​we​ ​can​ ​observe​ ​that​ ​cluster​ ​1​ ​has​ ​a​ ​strong​ ​negative​ ​while​ ​cluster
2​ ​has​ ​a​ ​strong​ ​positive​ ​indicating​ ​they​ ​are​ ​opposite​ ​of​ ​each​ ​other.​ ​Cluster​ ​3​ ​has​ ​a​ ​value
in-between​ ​clusters​ ​1​ ​and​ ​2.

4. Please​ ​provide​ ​a​ ​Tableau​ ​visualization​ ​(saved​ ​as​ ​a​ ​Tableau​ ​Public​ ​file)​ ​that​ ​shows​ ​the
location​ ​of​ ​the​ ​stores,​ ​uses​ ​color​ ​to​ ​show​ ​cluster,​ ​and​ ​size​ ​to​ ​show​ ​total​ ​sales.

https://public.tableau.com/views/Clustervisualizationv2/StoreLocationbyClusterandTotalSales?:e
mbed=y&:display_count=yes&publish=yes
Task​ ​2:​ ​Formats​ ​for​ ​New​ ​Stores
1. What​ ​methodology​ ​did​ ​you​ ​use​ ​to​ ​predict​ ​the​ ​best​ ​store​ ​format​ ​for​ ​the​ ​new​ ​stores?​ ​Why
did​ ​you​ ​choose​ ​that​ ​methodology?​ ​(Remember​ ​to​ ​Use​ ​a​ ​20%​ ​validation​ ​sample​ ​with
Random​ ​Seed​ ​=​ ​3​ ​to​ ​test​ ​differences​ ​in​ ​models.)

Store​ ​demographic​ ​data​ ​was​ ​used​ ​as​ ​the​ ​data​ ​to​ ​predict​ ​the​ ​best​ ​store​ ​format​ ​for​ ​new​ ​stores.
The​ ​data​ ​contains​ ​44​ ​variables​ ​related​ ​to​ ​customer​ ​demographics​ ​including:
Age,​ ​Education,​ ​HouseHold​ ​Size,​ ​HouseHold​ ​Income​ ​Range,Population​ ​Percentage,
Home​ ​Value​ ​,​ ​Population​ ​Density
The​ ​variables​ ​were​ ​used​ ​to​ ​develop​ ​a​ ​multinomial​ ​classification​ ​model​ ​for​ ​the​ ​store
format(cluster).​ ​3​ ​models​ ​were​ ​compared:
i)​ ​Decision​ ​Tree​ ​Model
ii)​ ​Random​ ​Forest​ ​Model
iii)​ ​Boosted​ ​Model

The​ ​output​ ​was​ ​validated​ ​using​ ​20%​ ​validation​ ​sample​ ​with​ ​random​ ​seed​ ​=​ ​3.​ ​The​ ​output​ ​of​ ​the
validation​ ​is​ ​as​ ​follows:

Model Accuracy F1 Accuracy_1 Accuracy_2 Accuracy_3

Random​ ​Forest 0.8235 0.8251 0.7500 0.8000 0.8750

Decision​ ​Tree 0.8235 0.8251 0.7500 0.8000 0.8750

Boosted 0.8235 0.8543 0.8000 0.6667 1.0000


The​ ​confusion​ ​matrix​ ​is​ ​shown​ ​below:

The​ ​3​ ​models​ ​show​ ​similar​ ​accuracy,​ ​but​ ​the​ ​boosted​ ​model​ ​shows​ ​better​ ​F1​ ​Score​ ​at​ ​0.8543..
Since​ ​the​ ​F1​ ​is​ ​a​ ​measure​ ​of​ ​precision​ ​and​ ​accuracy,​ ​the​ ​boosted​ ​model​ ​was​ ​selected​ ​and​ ​used
to​ ​predict​ ​the​ ​format​ ​of​ ​the​ ​new​ ​stores.
2. What​ ​format​ ​do​ ​each​ ​of​ ​the​ ​10​ ​new​ ​stores​ ​fall​ ​into?​ ​Please​ ​fill​ ​in​ ​the​ ​table​ ​below.

Store​ ​Number Segment


S0086 3
S0087 2
S0088 1
S0089 2
S0090 2
S0091 1
S0092 2
S0093 1
S0094 2
S0095 2

Task​ ​3:​ ​Predicting​ ​Produce​ ​Sales


1.​ ​What​ ​type​ ​of​ ​ETS​ ​or​ ​ARIMA​ ​model​ ​did​ ​you​ ​use​ ​for​ ​each​ ​forecast?​ ​Use​ ​ETS(a,m,n)​ ​or
ARIMA(ar,​ ​i,​ ​ma)​ ​notation.​ ​How​ ​did​ ​you​ ​come​ ​to​ ​that​ ​decision?

The​ ​model​ ​parameters​ ​were​ ​similar​ ​as​ ​the​ ​plots​ ​shared​ ​similar​ ​characteristics​ ​in​ ​time
series​ ​decomposition​ ​plots.​ ​These​ ​are​ ​shown​ ​in​ ​the​ ​coming​ ​pages.

Both​ ​ETS​ ​and​ ​ARIMA​ ​models​ ​were​ ​compared​ ​based​ ​on​ ​in-sample​ ​validation,​ ​AIC​ ​value,
and​ ​holdout​ ​sample​ ​validation​ ​to​ ​decide​ ​which​ ​model​ ​was​ ​used​ ​for​ ​the​ ​actual​ ​forecast.
Summary​ ​of​ ​Model​ ​Selection​ ​(explanation​ ​below):
Cluster(Store​ ​Format) Model

1 ETS(M,N,M)

2 ETS(M,N,M)

3 ETS(M,N,A)

ETS​ ​Model​ ​Building:

For​ ​ETS​ ​Model,​ ​2​ ​options​ ​were​ ​applied​ ​and​ ​compared:


The​ ​error​ ​showed​ ​a​ ​multiplicative​ ​trend​ ​as​ ​it​ ​had​ ​varying​ ​magnitude.
The​ ​trend​ ​appears​ ​to​ ​cancel​ ​out​ ​ ​-​ ​indicating​ ​there​ ​is​ ​no​ ​trend.
There​ ​is​ ​a​ ​seasonal​ ​component​ ​and​ ​the​ ​magnitude​ ​varied​ ​slightly,​ ​so​ ​both​ ​options​ ​were
investigated​ ​(multiplicative​ ​and​ ​additive)
ETS(M,N,M)​ ​and​ ​ETS​ ​(M,N,A)​ ​models​ ​were​ ​applied​ ​and​ ​compared.
Cluster​ ​1​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
Cluster​ ​2​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
Cluster​ ​3​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
ARIMA​ ​Model​ ​Building:

Time​ ​Series​ ​ACF​ ​and​ ​PACF​ ​showed​ ​there​ ​is​ ​still​ ​significant​ ​autocorrelation​ ​and​ ​the​ ​data
is​ ​not​ ​stationary.

The​ ​data​ ​was​ ​differenced​ ​one​ ​time​ ​for​ ​the​ ​seasonal​ ​component​ ​and​ ​another​ ​time​ ​for​ ​the
non-seasonal​ ​component,​ ​resulting​ ​in​ ​a​ ​stationarized​ ​plot.(d(1)​ ​and​ ​D(1)

For​ ​non-seasonal​ ​component,​ ​ACF​ ​and​ ​PACF​ ​show​ ​lag-1​ ​shows​ ​has​ ​a​ ​significant​ ​value.
As​ ​it​ ​is​ ​negative,​ ​a​ ​MA(1)​ ​term​ ​is​ ​applied.

For​ ​seasonal​ ​component,​ ​ACF​ ​and​ ​PACF​ ​plot​ ​shows​ ​lag-12​ ​has​ ​a​ ​significant​ ​value.​ ​As​ ​it
is​ ​negative​ ​a​ ​MA(1)​ ​term​ ​is​ ​applied.

The​ ​ARIMA​ ​model​ ​selected​ ​is​ ​ARIMA(0,1,1)(0,1,1)[12]​ ​and​ ​compared​ ​with​ ​a​ ​fully​ ​auto
ARIMA​ ​model-​ ​ARIMA(0,1,0)(0,0,0)[12].

Cluster​ ​1​ ​Time​ ​Series​ ​ACF/PACF:


Cluster​ ​1​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​2​ ​Time​ ​Series​ ​ACF/PACF:

Cluster​ ​2​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​3​ ​Time​ ​Series​ ​ACF/PACF:
Cluster​ ​3​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​1​ ​Model​ ​Summary:
In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 807.7 16431.12 0.37 4.4

ETS(M,N,A) 831.8 22234.61 0.45 5.33

ARIMA(0,1,1)(0,1,1)[12] 481.6 11807.02 0.2 2.54

ARIMA(0,1,0)(0,0,0)[12] 766.3 25487.98 0.57 6.8


(auto)

Holdout​ ​Sample​ ​Validation​ ​Tables​ ​(12​ ​months​ ​holdout​ ​sample):


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​1:​ ​Final​ ​Model​ ​Selection:​ ​ETS(M,N,M)​ ​-​ ​ ​this​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.

Cluster​ ​2​ ​Model​ ​Summary:


In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 787.2 11997.14 0.41 3.31

ETS(M,N,A) 822.1 19691.19 0.56 4.48

ARIMA(0,1,1)( 478 10849.46 0.29 2.33


0,1,1)[12]

ARIMA(1,0,0)( 779.7 20917.39 0.71 5.75


0,0,0)[12]
(auto)

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​2:​ ​Final​ ​Model​ ​Selection:ETS(M,N,M)​ ​-​ ​ ​this​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.
Cluster​ ​3​ ​Model​ ​Summary:
In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 787.1 11935.17 0.38 3.97

ETS(M,N,A) 808.7 15394.64 0.46 4.91

ARIMA(0,1,1)( 475.6 10210.29 0.23 2.48


0,1,1)[12]

ARIMA(0,1,1)( 480.1 15837.35 0.36 3.92


0,1,0)[12]
(auto)

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​3:​ ​Final​ ​Model​ ​Selection:​ ​ETS(M,N,A)​ ​-​ t​ his​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.

2.​ ​Please​ ​provide​ ​a​ ​Tableau​ ​Dashboard​ ​(saved​ ​as​ ​a​ ​Tableau​ ​Public​ ​file)​ ​that​ ​includes​ ​a
table​ ​and​ ​a​ ​plot​ ​of​ ​the​ ​three​ ​monthly​ ​forecasts;​ ​one​ ​for​ ​existing,​ ​one​ ​for​ ​new,​ ​and​ ​one​ ​for
all​ ​stores.​ ​Please​ ​name​ ​the​ ​tab​ ​in​ ​the​ ​Tableau​ ​file​ ​"Task​ ​3".

https://public.tableau.com/views/ProduceSalesDashboard/ExistingandForecastProduceS
ales?:embed=y&:display_count=yes&publish=yes
Before​ ​you​ ​submit

Please​ ​check​ ​your​ ​answers​ ​against​ ​the​ ​requirements​ ​of​ ​the​ ​project​ ​dictated​ ​by​ ​the​ ​rubric.
Reviewers​ ​will​ ​use​ ​this​ ​rubric​ ​to​ ​grade​ ​your​ ​project.

You might also like