REport Time Series
REport Time Series
BUSINESS REPORT
AKSHAYA J K
1 |Page
Contents
............................................................................ 3
......................... 3
............................................................................................................. 5
..................... 9
.....................................................................10
.........................................................................16
........................................................................18
..................................22
..................................................................27
..........................................................................................................27
........................................................28
2 Problem Statement – TSF – Rose Dataset..............................................................30
........................30
............................................................................................................32
....................36
.....................................................................37
.........................................................................43
........................................................................45
..................................49
2 |Page
..................................................................54
..........................................................................................................54
........................................................55
3 |Page
.
Sparkling.csv Rose.csv
Solution:
Loaded required packages and read Monthly sales of Sparkling wine dataset without
using panda’s date-time format.
Add the time stamp to the original data-frame and set the time stamp as an index,
also drop the YearMonth column from the dataset.
4 |Page
Method-2:
Alternate way to read the original data-frame has a Time series data is by using
panda’s functions. [parse_dates=True, squeeze=True, index_col=0]
All values are properly loaded for the dataset with the index as panda’s date-time
format.
Sparkling time series data do not contain any missing values.
Plot the Sparkling Time Series to understand the behaviour of the data:
The Sparkling wine dataset shows significant seasonality and doesn’t shows any
consistent trend but has upward and downward slopes during the time period.
Sparkling wine has been consistently favoured over the years by customers.
5 |Page
Solution:
The basic measures of descriptive statistics tell us how the Sales have varied across
years. But for this measure of descriptive statistics we have averaged over the whole
data without taking the time component into account.
The descriptive summary of the data shows that on an average 2402 units of
Sparkling wines were sold each month on the given period of time. 50% of month’s
sales varied from 1605 units to 2549 units. Maximum sale reported in a month is
7242 units.
Yearly Boxplot:
6 |Page
Monthly Boxplot:
The yearly-boxplot, shows that the average sale of Sparkling has been more or
less consistent across the period, at or a little below 2000 units.
The outliers in the yearly-boxplot most probably represent the seasonal sale during
the seasonal months.
The monthly-box-plot shows a clear seasonality during the festive seasonal months
of October, November and December, which peaks in December. The sale tanks in
the month of June.
7 |Page
The monthly plot for Sparkling shows mean and variation of units sold each month
over the years. Sale’s in seasonal month’s shows a higher variation than in the
lean months.
Sale in December with a mean few points below 6000, varies from 7400 to 4500
units over the years. Whereas sale in November varies from 3500 units to 5000
units and sale in October varies from 2500 to 4000 units.
The lean months from January till September shows more or less a consistent sale
around 2000 units.
The plot of monthly sale over the years also shows the seasonality component of
the time-series, with October, November and December selling exponentially
higher volumes.
The highest volume of Sparkling wines were sold in December, 1987 and the least
of December sale was in 1981. Post 1987 December sales is around an average
6500 units, which was around 5000 in early 80’s.
The seasonal sale since 1990 has been more or less consistent around 6000 units
in December, 4000 units in November and 3000 units in October.
Sales for the months from January to July is seen to be consistent across the
years, compared to the rest of the months.
8 |Page
As the altitude of the seasonal peaks in the observed plot is changing according to
the change in trend, the time-series is assumed to be ‘multiplicative’.
The plot of the trend component does not show a consistent trend, but an
intermediary period shows an upward trend which gets consistent on the late half
of time-series.
The additive model shows the seasonality with a variance of 3000 units and the
multiplicative model shows a variance of 30%.
The residual shows a pattern of high variability across the period of time-series,
which is more or less consistent in both additive and multiplicative decompositions.
The additive model shows a mean variance around 0 and the multiplicative model
shows a variance around 10%.
If the seasonality and residual components are independent of the trend, then you
have an additive series. If the seasonality and residual components are in fact
dependent, meaning they fluctuate on trend, then we have a multiplicative series.
9 |Page
Solution:
The train and test datasets are created with year 1991 as starting year for test data
Note: Please do try to build as many models as possible and as many iterations of models as
possible with different parameters.
Solution:
To regress the sale of Sparkling wines, numerical time instance order for both training and test
set were generated and the values added to the respective datasets
The linear regression plots shows a gradual upward trend in forecast of Sparkling
wine, consistent with the observed trend which was not visually apparent.
For Regression on Time forecast on the Test Data, RMSE is 1389.135.
In naive model, the prediction for tomorrow is the same as today and the
prediction for day after tomorrow is tomorrow and since the prediction of tomorrow
is same as today, therefore the prediction for day after tomorrow is also today
11 | P a g e
The model has taken the last value from the test set and fitted it on the rest of the
train time period and used the same value to forecast the test set.
For Naive forecast on the Test Data, RMSE is 3864.279
The model do not capture the trend or seasonality for the given dataset.
In the Simple Average model, the forecast is done using the mean of the time-series
variable from the training set.
The model is not capable of either forecasting or able to capture the trend and
seasonality present in the dataset.
For Simple Average on the Test Data, RMSE is 1275
For the moving average model, we will calculate rolling means (or trailing moving
averages) for different intervals. The best interval can be determined by the
maximum accuracy.
The moving average models are built for trailing 2 points, 4 points, 6 points and 9
points.
For Sparkling dataset the accuracy is found to be higher with the lower rolling point
averages.
In moving average forecasts the values can be fitted with a delay of n number of
points.
The best interval of moving average from the model is 2 point
12 | P a g e
RMSE Values:
The model was ran without passing a value for alpha and used parameters:
‘optimized=True, use_brute=True’.
The auto-fit model picked up alpha = 0.0496 as the smoothing parameter.
Simple Exponential Smoothing is applied if the time-series has neither a trend nor
seasonality, which is not the case with the given data.
The forecasting using smoothing levels of alpha between 0 and 1 are as below, where the
smoothing levels are passed manually.
For alpha value closer to 1, forecasts follows the actual observation closely and closer to 0,
forecasts are farther from actual and line gets smoothened
For Sparkling, test RMSE is found to be higher for values closer to zero, which is same as
in Simple average forecast.
By passing manual alpha values, alpha =0.025 gives a better RMSE compared to
optimized RMSE value.
13 | P a g e
The Double Exponential Smoothing models is applicable when data has trend, but no
seasonality. Sparkling data contain slight trend component and very significant seasonality
In first iteration, smoothing level (alpha) and trend (beta) are fitted to the model
iteratively from values 0.1 to 1 and the best combination was chosen based on the RMSE
values, which is as below with alpha 0.1 and beta 0.1
On the second iteration the model was allowed to choose the optimized values using
parameters ‘optimized=True, use_brute=True’
The auto-fit model retuned higher RMSE value compared to iterative alpha=0.1 and
beta=0.1 RMSE value.
14 | P a g e
The Triple Exponential Smoothing models (Holt-Winter’s Model) is applicable when data
has both trend and seasonality. Sparkling data contain slight trend and significant
seasonality
On first iteration, smoothing level (alpha), trend (beta) and seasonality (gamma) are fitted
to the model iteratively from values 0.1 to 1 and the best combination was chosen based
on the RMSE values, which is as below with alpha 0.4, beta 0.1 and gamma 0.3
On the second iteration the model was allowed to choose the optimized values using
parameters ‘optimized=True, use_brute=True’
The auto-fit model retuned higher RMSE value compared to iterative alpha=0.4, beta=0.1
and gamma=0.3 RMSE value.
15 | P a g e
Model Comparison:
From the comparison of accuracy values and the plot it can be inferred that Triple
Exponential Smoothing is the best model, which has trend as well as seasonality
components fitting well with the test data.
2 point trailing moving average model is also found to have fit well with a slight lag in test
dataset.
16 | P a g e
Solution:
Augmented Dickey Fuller test is the statistical test to check the stationarity of a time
series. The test determine the presence of unit root in the series to understand if the
series is stationary or not
Null Hypothesis: The series has a unit root, that is series is non-stationary
Alternate Hypothesis: The series has no unit root, that is series is stationary
If we fail to reject the null hypothesis, it can say that the series is non-stationary and if we
accept the null hypothesis, it can say that the series is stationary
The ADF test on the original Sparkling series retuned the below values, where p-value is
greater than alpha .05 so we fail to reject the null hypothesis.
Differencing of order one is applied on the Sparkling series as below and tested for
stationarity. At an order of differencing 1, the series is found to be stationary as below
The rolling mean and standard deviation is also plotted to understand the component of
seasonality and to ascertain if it’s multiplicative or additive in character.
The altitude of rolling mean and std dev is seen changing according to change in slope,
which indicates multiplicity.
17 | P a g e
The ADF test is also done in this exercise with logarithmic transformation of the train data
and differencing of seasonal order (12), to understand if removing the multiplicity of the
seasonal component will have an impact on the accuracy of model.
18 | P a g e
Solution:
ARIMA model was built with optimised model and found the least AIC value =2210.62 at
(2, 1, 2).
As the Sparkling series of data contain seasonality component, ARIMA model do not
perform well. The RMSE value for this Auto- ARIMA model is 1375.
19 | P a g e
The model was built on train data with seasonality 12 and with different optimal
parameters (p, d, q)x(P, D, Q) parameters, the lowest AIC is 1382.35 was obtained at (1,
1, 2)x(0, 1, 2, 12).
The model was built with the above parameters.
The diagnostics plot of the model was derived and the standardized residuals are found to
follow a mean of zero, and the histogram shows the residuals follow a normal distribution.
The Normal Q-Q plot also shows that the quantiles come from a normal distribution as the
point forms roughly a straight line.
The correlogram shows the autocorrelation of the residuals and there are no significant
lags above the confidence index.
The RMSE values of the automated SARIMA model is 382.58
20 | P a g e
The model was built on log transformed train data and with seasonality 12 and with
different optimal parameters (p, d, q)x(P, D, Q) parameters, the lowest AIC is 284.48 was
obtained at (0, 1, 1)*(1, 0, 1, 12).
The model was built with the above parameters.
21 | P a g e
The diagnostics plot of the model was derived and the standardized residuals are found to
follow a mean of zero, and the histogram shows the residuals follow a normal distribution.
The Normal Q-Q plot also shows that the quantiles come from a normal distribution as the
point forms roughly a straight line.
The correlogram shows the autocorrelation of the residuals and there are no significant
lags above the confidence index.
From the above model summary it can be inferred that MA.L1, AR.L.S12, MA.L.S12 terms
has the highest absolute weightage.
From the p-values it can be inferred that terms MA.L1, AR.L.S12, MA.L.S12 are significant
terms, as their values are below 0.05.
The RMSE values of the automated SARIMA of log series model is 336.58
22 | P a g e
The model built with log series data has a lower RMSE value when compared to original
train data.
Solution:
The RMSE value of manual ARIMA model is 4780. Since the ARIMA model do not capture
the seasonality, this model do not perform well.
From the ACF plot of the observed/ train data, it can be inferred that at seasonal interval
of 12, the plot is not quickly tapering off. So a seasonal differencing of 12 has to be taken
24 | P a g e
From the plots above an apparent slight trend is still existing after differencing of seasonal
order of 12. With a further differencing of order one, no trend is present.
An ADF test need to be done to check the stationarity after the above differencing. With a
p-value below alpha 0.05 and test statistic below critical values, it can be confirmed that
the data is stationary.
ACF and PACF plots of the seasonal-differenced + one order differenced data is created to
find the values for (p,d,q)x(P,D,Q).
25 | P a g e
Solution:
Solution:
Based on the overall model evaluation and comparison, Maual SARIMA is selected for final
prediction into 12 months in future.
Manual SARIMA model with optimal parameters (3,1,1)*(1,1,2,12) is found to be the best
model in terms of accuracy scored against the full data.
The model predicts an upward trend and continuation of the seasonal surge in sales in the
upcoming 12 months. According to the model the seasonal sale will be more than that of
the previous year.
28 | P a g e
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
Solution:
The model forecasts sale of 29535 units of Sparkling wine in 12 months into future. Which
is an average sale of 2462 units per month.
29 | P a g e
The seasonal sale in December 1995 will hit a maximum of 6136 units, before it drops to
the lowest sale in January 1996; at 1246 units
The wine company is recommended to ramp up their procurement and production line in
accordance with the above forecasts for the third quarter of 1995 (October, November and
December), which is a total of 13,370 units of sparkling wine is expected to be sold.
The forecast also indicates that the year-on-year sale of sparkling wine is not showing an
upward trend. The winery must adopt innovative marketing skills to improve the sale
compared to previous years.
30 | P a g e
Solution:
Loaded required packages and read Monthly sales of Rose wine dataset without using
panda’s date-time format.
Add the time stamp to the original data-frame and set the time stamp as an index,
also drop the YearMonth column from the dataset.
31 | P a g e
Method-2:
Alternate way to read the original data-frame has a Time series data is by using
panda’s functions. [parse_dates=True, squeeze=True, index_col=0]
All values are properly loaded for the dataset with the index as panda’s date-time
format. The Rose Time series has values in float64 data-type format.
Rose time series contain 2 missing values, they are for the time stamp '1994-07-01'
and '1994-08-01'
Impute the null values by using interpolation [polynomial of order 2].
Plot the Sparkling Time Series to understand the behaviour of the data:
32 | P a g e
The Rose wine dataset shows significant seasonality and decreasing Trend could be
observed with a multiplicative seasonality present.
The demand for Rose had been fell out-of-favour over the years.
Solution:
The mean value of the Time Series is nearly same as the median values. As a time
series data it may signify presence of decreasing trend and multiplicative
seasonality.
The descriptive summary of the data shows that on an average 90 units of Rose
wines were sold each month on the given period of time. 50% of months sales
varied from 63 units to 112 units. Maximum sale reported in a month is 267 units
and minimum of 28 units
The basic measures of descriptive statistics tell us how the Sales have varied
across years. But for this measure of descriptive statistics we have averaged over
the whole data without taking the time component into account.
33 | P a g e
Yearly Boxplot:
The yearly-boxplot, shows that the average sale of Rose wine moving according to
the downward trend in sales over the years. The outliers over upper bound in the
yearly-boxplot most probably represent the seasonal sale during the seasonal
months.
The monthly-box-plot shows a clear seasonality during the seasonal months of
November and December. Though the sale tanks in the month of January, it picks
up in the due course of the year.
Average sale in December is around 140 units, November is around 110 units and
October is around 90 units.
34 | P a g e
The monthly plot for Rose shows mean and variation of units sold each month over
the years. Sale in months such as July, August, September and December shows a
higher variation than the rest
Sale in December with a mean few points below 100, varies from 75 to 270 units
over the years. Whereas the average sale is less than or closer to 100 units
(above50) for the rest of the year.
The plot of monthly sale over the years also shows the seasonality component of
the time-series, with November and December selling exponentially higher
volumes than other months.
The highest volume of Rose wines were sold in December, 1980 and the least of
December sale was in 1993. Though December sale picked after 1983, it
consistently dipped after 1987.
35 | P a g e
The observed plot of the decomposition diagram shows visible annual seasonality
and a downward trend. The early period of the plot shows higher variation than in
the later periods
The trend diagram shows a downward trend overall. Exponential dips can be seen
between 1981 and 1983 and later from 1991 to 1993
Seasonal components are quite visible and consistent in both the observed and
seasonal charts of the diagrams. The multiplicative model shows variance in
seasonality of 16%
The residuals shows a pattern of high variability across the period of time-series,
which is more or less consistent.
The variance in residuals shows higher variance in the early period of the series,
which explains the higher variance in observed plot at same time period.
As the seasonality peaks are consistently reducing its altitude in consistent with
trend, the series can be treated as multiplicative in model building
36 | P a g e
Solution:
The train and test datasets are created with year 1991 as starting year for test data
Note: Please do try to build as many models as possible and as many iterations of models as
possible with different parameters.
Solution:
To regress the sale of Rose wines, numerical time instance order for both training and test set
were generated and the values added to the respective datasets
The linear regression on the Rose dataset shows an apparent downward trend as consistent
with the observed time-series.
For Regression on Time forecast on the Test Data, RMSE is 15.278
The model has successfully captured the trend of the series, but does not reflect the
seasonality.
In naive model, the prediction for tomorrow is the same as today and the
prediction for day after tomorrow is tomorrow and since the prediction of tomorrow
is same as today, therefore the prediction for day after tomorrow is also today.
The model has taken the last value from the test set and fitted it on the rest of the
train time period and used the same value to forecast the test set.
For Naive forecast on the Test Data, RMSE is 79.75.
The model do not capture the trend or seasonality for the given dataset.
38 | P a g e
In the Simple Average model, the forecast is done using the mean of the time-series
variable from the training set.
The model is not capable of either forecasting or able to capture the trend and
seasonality present in the dataset.
For Simple Average on the Test Data, RMSE is 53.48
39 | P a g e
For the moving average model, we will calculate rolling means (or trailing moving
averages) for different intervals. The best interval can be determined by the
maximum accuracy.
The moving average models are built for trailing 2 points, 4 points, 6 points and 9
points.
For Rose dataset the accuracy is found to be higher with the lower rolling point
averages.
In moving average forecasts the values can be fitted with a delay of n number of
points.
The best interval of moving average from the model is 2 point.
40 | P a g e
The model was ran without passing a value for alpha and used parameters:
‘optimized=True, use_brute=True’.
The auto-fit model picked up alpha = 0.0987 as the smoothing parameter.
Simple Exponential Smoothing is applied if the time-series has neither a trend nor
seasonality, which is not the case with the given data.
The forecasting using smoothing levels of alpha between 0 and 1 are as below, where the
smoothing levels are passed manually.
For alpha value closer to 1, forecasts follows the actual observation closely and closer to 0,
forecasts are farther from actual and line gets smoothened
For Rose, test RMSE is found to be higher for values closer to zero, which is same as in
Simple average forecast.
Both manual alpha =0.10 and optimized alpha value are having similar RMSE value.
The Double Exponential Smoothing models is applicable when data has trend, but no
seasonality. Rose data contain significant trend component and seasonality.
In first iteration, smoothing level (alpha) and trend (beta) are fitted to the model
iteratively from values 0.1 to 1 and the best combination was chosen based on the RMSE
values, which is as below with alpha 0.1 and beta 0.1
On the second iteration the model was allowed to choose the optimized values using
parameters ‘optimized=True, use_brute=True’
The auto-fit model has lower RMSE value compared to iterative alpha=0.1 and beta=0.1
RMSE value.
41 | P a g e
The Triple Exponential Smoothing models (Holt-Winter’s Model) is applicable when data
has both trend and seasonality. Rose data contain significant trend and seasonality.
On first iteration, smoothing level (alpha), trend (beta) and seasonality (gamma) are fitted
to the model iteratively from values 0.1 to 1 and the best combination was chosen based
on the RMSE values, which is as below with alpha 0.4, beta 0.1 and gamma 0.3
On the second iteration the model was allowed to choose the optimized values using
parameters ‘optimized=True, use_brute=True’
The auto-fit model retuned higher RMSE value compared to iterative alpha=0.1, beta=0.2
and gamma=0.3 RMSE value.
42 | P a g e
Model Comparison:
43 | P a g e
From the comparison of accuracy values and the plot it can be inferred that Triple
Exponential Smoothing is the best model, which has trend as well as seasonality
components fitting well with the test data.
2 point trailing moving average model is also found to have fit well with a slight lag in test
dataset.
Solution:
Augmented Dickey Fuller test is the statistical test to check the stationarity of a time
series. The test determine the presence of unit root in the series to understand if the
series is stationary or not
Null Hypothesis: The series has a unit root, that is series is non-stationary
Alternate Hypothesis: The series has no unit root, that is series is stationary
If we fail to reject the null hypothesis, it can say that the series is non-stationary and if we
accept the null hypothesis, it can say that the series is stationary
The ADF test on the original Rose series retuned the below values, where p-value is
greater than alpha .05 so we fail to reject the null hypothesis.
44 | P a g e
Differencing of order one is applied on the Rose series as below and tested for stationarity.
At an order of differencing 1, the series is found to be stationary as below
The rolling mean and standard deviation is also plotted to understand the component of
seasonality and to ascertain if it’s multiplicative or additive in character.
The altitude of rolling mean and std dev is seen changing according to change in slope,
which indicates multiplicity.
The ADF test is also done in this exercise with logarithmic transformation of the train data
and differencing of seasonal order (12), to understand if removing the multiplicity of the
seasonal component will have an impact on the accuracy of model.
45 | P a g e
Solution:
ARIMA model was built with optimised model and found the least AIC value =1276 at (0,
1, 2).
As the Rose series of data contain seasonality component, ARIMA model do not perform
well. The RMSE value for this Auto- ARIMA model is 15.63.
46 | P a g e
The model was built on train data with seasonality 12 and with different optimal
parameters (p, d, q)x(P, D, Q) parameters, the lowest AIC is 774.97 was obtained at (0, 1,
2)x(2, 1, 2, 12).
The model was built with the above parameters.
47 | P a g e
The diagnostics plot of the model was derived and the standardized residuals are found to
follow a mean of zero, and the histogram shows the residuals follow a normal distribution.
The Normal Q-Q plot also shows that the quantiles come from a normal distribution as the
point forms roughly a straight line.
The correlogram shows the autocorrelation of the residuals and there are no significant
lags above the confidence index.
The RMSE values of the automated SARIMA model is 16.53
48 | P a g e
The model was built on log transformed train data and with seasonality 12 and with
different optimal parameters (p, d, q)x(P, D, Q) parameters, the lowest AIC is -247.08 was
obtained at (0, 1, 1)*(1, 0, 1, 12).
The model was built with the above parameters.
The diagnostics plot of the model was derived and the standardized residuals are found to
follow a mean of zero, and the histogram shows the residuals follow a normal distribution.
The Normal Q-Q plot also shows that the quantiles come from a normal distribution as the
point forms roughly a straight line.
The correlogram shows the autocorrelation of the residuals and there are no significant
lags above the confidence index.
From the above model summary it can be inferred that MA.L1, AR.L.S12, MA.L.S12 terms
has the highest absolute weightage.
From the p-values it can be inferred that terms MA.L1, AR.L.S12, MA.L.S12 are significant
terms, as their values are below 0.05.
The RMSE values of the automated SARIMA of log series model is 17.93
49 | P a g e
The model built with log series data has a higher RMSE value when compared to original
train data.
Solution:
50 | P a g e
The RMSE value of manual ARIMA model is 84.16. Since the ARIMA model do not capture
the seasonality, this model do not perform well.
From the ACF plot of the observed/ train data, it can be inferred that at seasonal interval
of 12, the plot is not quickly tapering off. So a seasonal differencing of 12 has to be taken
51 | P a g e
An ADF test need to be done to check the stationarity after the above differencing. With a
p-value below alpha 0.05 and test statistic below critical values, it can be confirmed that
the data is stationary.
ACF and PACF plots of the seasonal-differenced + one order differenced data is created to
find the values for (p,d,q)x(P,D,Q).
52 | P a g e
Solution:
Triple Exponential Smoothing (Holt Winter’s) with alpha: 0.1, beta: 0.2 and gamma: 0.3 is
found to be the best model, followed by 2-point trailing moving average model.
Solution:
Based on the overall model evaluation and comparison, Triple Exponential Smoothing (Holt
Winter’s) is selected for final prediction into 12 months in future.
TES model alpha: 0.1, beta: 0.2 and gamma: 0.3 & trend: ‘additive’, seasonal:
‘multiplicative’ is found to be the best model in terms of accuracy scored against the full
data.
The model predicts continuation of the trend in sales and seasonality in year-end sales.
The prediction shows a stabilization of downward trend, as the sales will be almost same
as previous observed year.
The RMSE value of TES obtained for the entire dataset is 17.88
55 | P a g e
Please explain and summarise the various steps performed in this project. There should be
proper business interpretation and actionable insights present.
Solution:
56 | P a g e
The model forecasts sale of 585 units of Rose wine in 12 months into future. Which is an
average sale of 48 units per month.
The seasonal sale in December 1995 will reach a maximum of 82 units, before it drops to
the lowest sale in January 1996; at 30 units.
Unlike Sparkling wine, Rose wine sells very low number of units and the standard
deviation is only 12.75. Which means that higher demand does not impact procurement
and production.
The ABC estate wine should investigate the low demand for Rose wine in market and make
corrective actions in marketing and promotions.