Additional Notes 3 - Forecasting Model Performance
Additional Notes 3 - Forecasting Model Performance
There are many statistical measures that describe how well a model fits a given sample of data.
However, this goodness-of-fit approach often uses residuals and does not really reflect the
capability of the forecasting technique to successfully predict future observations. The user of the
forecasts is very concerned about the accuracy of future forecasts, not model goodness of fit, so it
is important to evaluate this aspect of any recommended technique.
Sometimes forecast accuracy is called out-of-sample forecast error, to distinguish it from the
residuals that arise from a model-fitting process.
𝜋𝜋
NOTE: 𝑀𝑀𝑀𝑀𝑀𝑀 = �2 ⋅ 𝑀𝑀𝑀𝑀𝑀𝑀 if the forecast errors are normally distributed.
If a time series consists of uncorrelated observations and has constant variance. we say that it is
white noise. If, in addition, the observations in this time series are normally distributed, the time
series is Gaussian white noise. Ideally, forecast errors are Gaussian white noise.
If a time series is white noise, the distribution of the sample autocorrelation coefficient at lag 𝑘𝑘 in
large samples is approximately normal with mean zero and variance 1/𝑇𝑇, i.e.,
Therefore we could test the hypothesis 𝐻𝐻0 : 𝜌𝜌𝑘𝑘 = 0 using the test statistic
𝑟𝑟𝑘𝑘
𝑍𝑍0 = = 𝑟𝑟𝑘𝑘 √𝑇𝑇
�1/𝑇𝑇
This procedure is a one-at-a-time test; that is, the significance level applies to the autocorrelations
considered individually.
We are often interested in evaluating a set of autocorrelations jointly to determine if they indicate
that the time series is white noise. Box and Pierce (1970) have suggested such a procedure.
2
Consider 𝑍𝑍02 = 𝑟𝑟𝑘𝑘2 𝑇𝑇; it is approximately 𝜒𝜒(1) . The Box-Pierce statistic
𝐾𝐾
𝑄𝑄𝐵𝐵𝐵𝐵 = 𝑇𝑇 � 𝑟𝑟𝑘𝑘2
𝑘𝑘=1
2
is distributed approximately as 𝜒𝜒(𝑘𝑘) under the null hypothesis that the time series is white noise.
When this test statistic is applied to a set of residual autocorrelations the statistic 𝑄𝑄𝐵𝐵𝐵𝐵 ~𝜒𝜒(𝐾𝐾−𝑝𝑝)
2
where 𝑝𝑝 is the number of parameters in the model. Box and Pierce call this procedure a
Portmanteau or general goodness-of-fit statistic – it is testing the goodness of fit of the
autocorrelation function to the autocorrelation function of white noise.
A modification of this test that works better for small samples was devised by Ljung and Box (1978).
The Ljung-Box goodness-of-fit statistic is
𝐾𝐾
1
𝑄𝑄𝐿𝐿𝐿𝐿 = 𝑇𝑇(𝑇𝑇 + 2) � � � 𝑟𝑟 2
𝑇𝑇 − 𝑘𝑘 𝑘𝑘
𝑘𝑘=1
The the Ljung-Box statistic is very similar to the original Box-Pierce statistic, the difference being
that the squared sample autocorrelation at lag 𝑘𝑘 is weighted by (𝑇𝑇 + 2)/(𝑇𝑇 − 𝑘𝑘). For large 𝑇𝑇,
these weights will be approximately unity, and so the 𝑄𝑄𝐿𝐿𝐿𝐿 and 𝑄𝑄𝐵𝐵𝐵𝐵 statistics will be very similar.
CHOOSING BETWEEN COMPETING MODELS
Selecting the model that provides the best fit to historical data generally does not result in a
forecasting method that produces the best forecasts of new data. Concentrating too much on
the model that produces the best historical fit often results in overfitting, or including too many
parameters or terms in the model just because these additional terms improve the model fit.
In general, the best approach is to select the model that results in the smallest standard deviation
(or mean squared error) of the one-step-ahead forecast errors when the model is applied to data
that was not used in the fitting process. Some refer to this as an out-of-sample forecast error
standard deviation (or mean squared error). A standard way to measure this out-of-sample
performance is by utilizing some form of data splitting; that is, divide the time series data into
two segments – one for model fitting and the other for performance testing. Sometimes data
splitting is called cross-validation.
It is somewhat arbitrary as to how the data splitting is accomplished. However. a good rule of
thumb is to have at least 20 or 25 observations in the performance testing data set.
∑𝑇𝑇𝑡𝑡=1 𝑒𝑒𝑡𝑡2
𝑠𝑠 =
𝑇𝑇 − 𝑝𝑝
𝑹𝑹 – Squared Statistic
∑𝑇𝑇𝑡𝑡=1 𝑒𝑒𝑡𝑡2
𝑅𝑅 2 = 1 −
∑𝑇𝑇𝑡𝑡=1(𝑦𝑦𝑡𝑡 − 𝑦𝑦�)2
Large values of 𝑅𝑅 2 suggest a good fit to the historical data. Because the residual sum of squares
always decreases when parameters are added to a model, relying on 𝑅𝑅 2 to select a forecasting
model encourages overfitting or putting in more parameters than are really necessary to obtain
good forecasts. A large value of 𝑅𝑅 2 does not ensure that the out-of-sample one-step-ahead
forecast errors will be small.
2
∑𝑇𝑇𝑡𝑡=1 𝑒𝑒𝑡𝑡2 /(𝑇𝑇 − 𝑝𝑝) 𝑠𝑠 2
𝑅𝑅Adj = 1 − 𝑇𝑇 = 1 − 𝑇𝑇
∑𝑡𝑡=1(𝑦𝑦𝑡𝑡 − 𝑦𝑦�)2 /(𝑇𝑇 − 1) ∑𝑡𝑡=1(𝑦𝑦𝑡𝑡 − 𝑦𝑦�)2 /(𝑇𝑇 − 1)
The adjustment is a size adjustment – that is, adjust for the number of parameters in the model.
Note that a model that maximizes the adjusted 𝑅𝑅 2 statistic is also the model that minimizes the
residual mean square.
Akaike Information Criterion (AIC)
These two criteria penalize the sum of squared residuals for including additional parameters in the
model. Models that have small values of the AIC or SIC are considered good models.
One way to evaluate model selection criteria is in terms of consistency. A model selection
criterion is consistent if it selects the true model when the true model is among those considered
with probability approaching unity as the sample size becomes large, and if the true model is not
among those considered, it selects the best approximation with probability approaching unity as
the sample size becomes large.
• All of 𝑠𝑠 2 , the 𝑅𝑅Adj
2
, and the AIC are inconsistent, because they do not penalize for adding
parameters heavily enough. Relying on these criteria tends to result in overfitting.
• The SIC, which carries a heavier size adjustment penalty, is consistent.
Consistency, however, does not tell the complete story. It may turn out that the true model and
any reasonable approximation to it are very complex. An asymptotically efficient model selection
criterion chooses a sequence of models as 𝑇𝑇 (the amount of data available) gets large for which
the one-step-ahead forecast error variances approach the one-step-ahead forecast error
variance for the true model at least as fast as any other criterion. The AIC is asymptotically
efficient but the SIC is not.
Remarks:
• Sometimes we see the first term in the AIC, AICC, or SIC written as −2 ln 𝐿𝐿(𝛽𝛽, 𝜎𝜎 2 ) – the
likelihood function for the fitted model evaluated at the maximum likelihood estimates of the
unknown parameters 𝛽𝛽 and 𝜎𝜎 2 . In this context, AIC, AICC, and SIC are called penalized
likelihood criteria.
• When both AIC and SIC are available. we prefer using SIC. It generally results in smaller, and
hence simpler. models, and so its use is consistent with the time-honored model-building
principle of parsimony.
• Nevertheless, the best way to evaluate a candidate model's potential predictive
performance is to use data splitting. This will provide a direct estimate of the one-step-ahead
forecast error variance.
ADDITIONAL e-VIDEO RESOURCES:
Forecasting (7): Forecast accuracy measures (MSE, RMSE, MAD & MAPE) (youtube.com)
https://www.youtube.com/watch?v=0vtRKLVNhQ8
HANDS-ON EXERCISE
Using the airline.csv data, assess the forecast model performance of the best-fitting Holt-Winters
(multiplicative seasonals) model. *See Chapter 3 slide deck, last example.
RECALL: The airline.csv data contains the number of international passenger bookings (in
thousands) per month on an airline (Pan Am) in the United States were obtained from the Federal
Aviation Administration for the period 1949–1960 (Brown, 1963).