Tutorial 9 - Solutions
Tutorial 9 - Solutions
1. The us_employment file contains the total employment in different industries in the
United States. Using the industry: "Leisure and Hospitality":
a. Plot the data. Do the data need transforming? If so, find a suitable
transformation.
This clearly needs transforming – the variation at beginning of the series is small
and increases significantly with time. We can use log or Box-Cox with guerrero
to transform.
leisure |>
autoplot(log(Employed) |> difference(lag=12) |> difference())
The double differenced logged data is close to stationary, although the variance has
decreased over time.
leisure |>
gg_tsdisplay(log(Employed) |> difference(lag=12) |> difference(), plot_type=" partial")
Using our guide for identifying terms – examining the ACF plot (for MA terms) we
can see that the last significant spike in the early lags is at lag 4, and that we have a
single seasonal spike at lag 12. Recall that we differenced the data twice – first and
seasonal differences. So, a suggested model could be ARIMA(0,1,4)(0,1,1). Note,
the order in brackets is (AR,I,MA) which we denote using (p,d,q) for the non-
seasonal part and (P,D,Q) for the seasonal part. The non seasonal part is in the first
brackets, (0,1,4) in this case, where - the p=0 is because we are doing an MA model
(so p=0 or AR(0)). The d=1 is because the data is differenced once. The q=4 is
because of the last spike at lag 4 (out of the early lags as there are also spikes after
lag 4).
The seasonal part is in the second brackets, (0,1,1) in this case – again, the P=0 is
because we are doing an MA model (so AR(0)). The D=1 is because it is differenced
once (seasonal difference). The Q=1 is because there is only one significant
seasonal lag, at lag 12 – this would have been D=2 if we had a spike at lags 12 and
24.
Alternatively, and in a similar way, examining the PACF plot (for AR terms), we can
see that the last spike in the early lags is at lag 3, so use non-seasonal AR(3), and
that we have a seasonal spike at lags 12 and 24 so can use a seasonal AR(2), or
together, ARIMA(3,1,0)(2,1,0). You can experiment with different models as this is
just a guide.
There is significant autocorrelation at a few lags (we usually would like to see no
more than 1 for white noise).
We will now use the arima function to auto select a model:
a. Plot the data. Do the data need transforming? If so, find a suitable
transformation.
The trend and seasonality show that the data are not stationary.
We will try seasonal differencing, quarterly in this case as the data is quarterly.
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")
It seems that we could have continued with only taking seasonal differences but we
will try to take a first order difference as well to make it more stationary.
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), pl ot_type
= "partial")
Only a slight effect on stationarity after taking first difference as well.
Using our guide for identifying terms – examining the ACF plot (for MA terms) we
can see that there is a significant spike in the early lags at lag 1, and that we
have a seasonal spike at lags 4 and 8 (remember the data is quarterly). Recall
that we differenced the data twice – first and seasonal differences. So a
suggested model could be ARIMA(0,1,1)(0,1,2). Note that there is indeed a spike
in lag 4 as well so we can use a non-seasonal MA(4) as well.
Examining the PACF plot (for AR terms) we can see that there is a spike in the
early lags at lag 1, so use non-seasonal AR(1), and that we have a seasonal
spike at lags 4, 8, 12,16 so can use a seasonal AR(4), or together,
ARIMA(1,1,0)(4,1,0). You can experiment with different models as this is just a
guide, for example ARIMA(1,1,0)(0,1,2). We will also use the auto selector to
compare models.
Automatic model selection has also taken a first order difference, and so we can
compare the AICc values (we can’t use AIC to compare models with different
difference oreders). ARIMA(1,1,4)(0,1,1) was selected due to the lowest AIC.
d. Examine the residuals, do they resemble white noise? If not, try to find
another ARIMA model which fits better.
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 auto 8.55 0.201
The Ljung Box test has a large p-value = 0.201 (larger than 5%) so we can’t reject
the assumption (H0) that the residuals are white noise. So this confirms the residual
graph above.
#Q2f
aus_production |> autoplot(Gas)
#Clearly a transformation is needed
lambda <- aus_production |>
features(Gas, guerrero) |>
pull(lambda_guerrero)
aus_production |>
autoplot(box_cox(Gas, lambda))
view(lambda)
#Variance seems more stable after the box-cox transformation with lambda of 0.12.
Find this number in the environment section, lamda.
#b
#data will need differencing
aus_production |>
gg_tsdisplay(box_cox(Gas, lambda) |> difference(4) |> difference(1), plot_type =
"partial")
#first and seasonal differencing make it stationary
#c
#try different models. looking at the acf and pacf we can try arima013011 or
arima310210
fit <- aus_production |> model(
arima013011 = ARIMA(box_cox(Gas, lambda) ~ 0 + pdq(0, 1, 3) + PDQ(0, 1, 1)),
arima310210 = ARIMA(box_cox(Gas, lambda) ~ 0 + pdq(3, 1, 0) + PDQ(2, 1, 0)),
auto = ARIMA(box_cox(Gas, lambda))
)
fit |> select(auto) |>
report()
#Auto selection is ARIMA(2,1,2)(1,1,1)[4]
glance(fit)
# We can compare AIC of the different models as all have the same difference order.
Seems that our arima013011 is actually a bit better than the auto selected one (has
lower AIC).