MATH545-Time Series
MATH545-Time Series
Course notes by
Léo Raymond-Belzile
[email protected]
1 Introduction 5
1.1 Simple stationary models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Trends and Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Non-parametric trend removal . . . . . . . . . . . . . . . . . . . . . . 15
1.2.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.5 Assessing the white noise assumption . . . . . . . . . . . . . . . . . . 18
1.2.6 Some general results and representations for stationary processes . . . 19
1.3 Autoregressive Time Series Processes . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.1 Autoregressive model of order p (AR(p)) . . . . . . . . . . . . . . . . . 26
1.4 Moving Average Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.1 MA(q) Process as an AR process . . . . . . . . . . . . . . . . . . . . . 29
1.5 Forecasting Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6 Partial autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.1 Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2 ARMA models 42
2.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.2 Autocovariance function . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.3 Autocovariance Generating function (ACVGF) . . . . . . . . . . . . . 48
2.1.4 Forecasting for ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Estimation and model selection for ARMA(p,q) . . . . . . . . . . . . . . . . . 52
2.2.1 Moment-based estimation . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . 53
2.2.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2
3 Non-Stationary and Seasonal Models 58
3.1 ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Unit roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Seasonal ARIMA models (SARIMA) . . . . . . . . . . . . . . . . . . . . . . . 61
4 State-space models 63
4.1 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Basic Structural Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Filtering and Smoothing: the Kalman Filter . . . . . . . . . . . . . . . . . . . 72
3
Foreword
The objectives of time series analysis is the modeling of sequences of random variables
by time, as X1 , X2 , . . . , Xt based on data x1 , . . . , xn . Typically, (X1 , . . . , Xn ) will not be
mutually independent and n is large1 . There is much greater interest in forecasting and
prediction: given data x1 , . . . , xn , we wish to make statements about Xn+1 , Xn+2 , . . .
Definition (Time series)
A time series model is a probabilistic representation describing {Xt } via their joint distri-
bution or the moments, in particular expectation, variance and covariance.
It will often not be possible to specify a joint distribution for the observations. Also, one
will want to impose restrictions on moments, which we refer to as simplifying assumptions.
Note
1. Xt can be a discrete or continuous random variable
2. Xt could be vector-valued.
3. The time index will most typically be discrete and represent constant time spacing -
it could also be necessary to consider continuous-time indexing.
1 That is usually greater than 50, contrary to longitudinal statistics where the number of time observations
is small
4
Chapter 1
Introduction
Section 1.1: Simple stationary models
The joint CDF of (X1 , . . . , Xn ) gives one description of the probabilistic relationship between
the random variables (RVs). Using the standard notation FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 , ≤
x1 , X2 ≤ x2 , . . . , Xn ≤ xn ), this function fully specifies the joint probability model for the
RVs. If Xt is vector-valued, denoted X t = (Xt1 , . . . , Xtk )> (namely a k × 1 vector), then
“Xt ≤ xt ” is to be interpreted component wise: Xt1 ≤ xt1 , Xt2 ≤ xt2 , . . . , Xtk ≤ xtk . A
special case of this is mutual independence.
n
Y
X1 , . . . , Xn are mutually independent ⇔ P(X1 ≤ x1 , . . . , Xn ≤ xn ) = P(Xt ≤ xt ),
t=1
that is the observed values of X1 , X2 , . . . , Xt−1 contain no information for modeling (or
predicting) Xt . Beyond independence, at the other extreme, specifying a completely gen-
eral joint model requires the specification of one marginal distribution and n − 1 differ-
ent conditional models2 . Often, modeling is restricted to specification of moments of the
process. Most typically, attention focuses on the first two moments (expectation and vari-
ance/covariance).
The emphasis of the lecture will be on stationarity models, which we will deal through in
the beginning of the course. The concept is one to do with stability. Some process we dealt
with last class were stable in mean or in variance or exhibiting periodic structure.
Definition 1.1 (Stationarity)
A time series process is stationary if {Xt , t ∈ Z} and {Xt+h , t ∈ Z} for h = ±1, ±2, . . . ,
have the precisely the same statistical properties (either in terms of joint distribution or
moment properties).
In practical terms, t is arbitrary and it doesn’t matter where we start collecting, we will
always be able to dip in the stationary model. For most of the course, we will be dealing
with weak stationarity, or moment stationarity, as joint distribution is intractable.
Definition 1.2 (Mean and covariance function)
Let {Xt } be a time series process with E Xt2 < ∞ ∀ t. Then µX (t) = E (Xt ) is the mean
function and γ̃X (t, s) = Cov(Xt , Xs ) = E ((Xt − µX (t))(Xs − µX (s))) for integers t, s is the
covariance function.
2 This procedure can be intractable (if one is interested in a specific period and the implications of the
model for that given interval) and may not be feasible analytically.
5
Definition 1.3 (Weak stationarity)
We say {Xt } is weakly stationary if
A stronger form of stationarity imposes conditions on the form of the joint distribution of
the process.
Definition 1.4 (Strong stationarity)
{Xt } is strongly stationarity if (Xt+1 , . . . , Xt+n ) and (Xt+h+1 , . . . , Xt+h+n ). have the
same joint distribution for all t, h, n.
Clearly, the second inequality implies the second, although it does not guarantee that the
first two moments exist. Gaussian process is an example of the latter stronger conditions,
with multivariate and a specific structure for the covariance matrix and the mean vector.
For most part, we will be dealing with weak stationarity.
Definition 1.5 (Autocovariance function)
For a weakly stationary process, define the autocovariance function (ACVF) by
3
and the autocorrelation function (or ACF) by ρX (h) = γX (h)/γX (0) for h ∈ Z.
So far, we have defined only properties of the process, not the data. For the moment,
these are nonparametrically specified functions. We have by definition that γX (h) =
Cov(Xt+h , Xt ) and γX (0) = Cov(Xt , Xt ) = Var(Xt ) and ρX (h) = Cor(Xt+h , Xt ) for all
t, h.
Some examples to illustrate the construction of these functions.
Example 1.1 (IID process)
Suppose we have an IID process where we suppose {Xt } is a binary process, i.e. Xt ∈ {−1, 1}
such that P(Xt = 1) = p and P(Xt = −1) = 1−p such that the Xt are mutually independent.
We can easily compute the moments, as
E (Xt ) = p + (1 − p)(−1) = 2p − 1
and
Var(Xt ) = [p(1)2 + (1 − p)(−1)2 ] − (2p − 1)2 = 1 − (1 − 2p)2 .
By independence, Cov(Xt , Xs ) = 0 if t 6= s.
3 This function of h will be bounded on [−1, 1].
6
Example 1.2 (Random walk)
A more interesting process is the random walk, where {Xt } is the same as in the previous
example with the IID process and {St } is a process constructed from {Xt }, defined by
Pt
St = i=1 Xi = St−1 + Xt assuming that t ≥ 1 and the process start at 0 (that is S0 = 0).
Although the process is easy to construct, it has some interesting properties. If we take
the simple case where p = 1/2, then E (Xt ) = 0 and Var(Xt ) = 1 and therefore E (St ) = 0,
but Var(St ) = t. Since the variance of sum of IID random variable is the sum of the
variance of each elements, so that the variance increase linearly with t and our process is
not stationarity.
If p 6= 1/2, then E (Xt ) = 2p − 1 and E (St ) = (2p − 1)t and St also grows linearly with t
and Var (Xt ) = 1 − (1 − 2p)2 and Var (St ) = tVar (Xt ). In this case, if p < 1/2 we have a
downward “drift”, while if p > 1/2, we have an upward drift. In all cases, the variance of
St grows linearly with t. As such, {St } is nonstationary (clearly).
Note
We could also define Xt as the difference of the random variable St , that is Xt = St − St−1 .
This gives a key insight on how to turn a nonstationary process into a stationary one, in
this case by differencing.
Example 1.3 (Stationary Gaussian process)
Suppose, for all finite collections X1 , . . . , Xn , the joint distribution is multivariate Gaussian
(Normal) with E (Xt ) = µX (not depending on t) and covariance defined for Xt , Xs as
Cov(Xt , Xs ) = γX (|t − s|) for t, s ∈ {1, . . . , n}.
This stationary version imposes extra conditions, which corresponds to a structured covari-
ance matrix which has fewer than 21 n(n + 1) different elements.
Let ΓX (n) denote the (n × n) matrix with [ΓX (n)]t,s = γX (|t − s|). Then ΓX (n) is a
symmetric positive definite 4 matrix ( ∀ x ∈ Rn , x> ΓX (n)x > 0) with Toeplitz structure,
constant among diagonals,
γX (0) γX (1) ··· ··· γX (n − 1)
..
γ (1)
X γX (0) γX (1) ··· .
.. .. .. ..
ΓX (n) =
γX (2) . . . .
.. .. .. ..
. . .
. γX (1)
γX (n − 1) ··· ··· γX (1) γX (0)
unlike the usual settings, with n free parameters in ΓX (n). We write the vector X =
(X1 , . . . , Xn )> ∼ Nn (µX 1n , ΓX (n)).5 Note that all marginal distributions and all condi-
4 In fact, is non-negative definite, but we will restrict our examples to positive definite matrix
5 Again, these are parameters of the process, not any estimates derived from the data.
7
tional distributions are also multivariate Gaussian. Specifically, the one that we might be
most interested in is p(Xt |X1 , . . . , Xt−1 ), which is univariate Gaussian.
Exercise 1.1
Review the properties of the multivariate Normal and make sure you understand the last
statement and are able to derive it.
The next few examples use similar construction methods, but are more general.
Example 1.4 (General IID noise)
Let {Xt } ∼ IID(0, σ 2 ), so that E (Xt ) = 0 and E Xt2 = σ 2 < ∞ with all Xt mutually
independent. From that, we get that the autocovariance function
σ 2 if h = 0
γX (h) = γ̃X (t + h, t) =
0 if h 6= 0.
and
1 if h = 0
ρX (h) =
0 if h 6= 0.
For most of what we will be doing in the course, we will restrict to white noise. However,
if we wanted to estimate the model using maximum likelihood, we would need additional
requirements (IID).
Example 1.6 (General random walk)
Pt
Let {Xt } ∼ IID(0, σ 2 ) and again {St } is defined by St = i=1 Xi . Using linear properties
of the expectation, E (Xt ) = 0 implies that E (St ) = 0 and E Xt2 = σ 2 and E St2 = tσ 2 as
P 2
t Pt Pt Pi−1
St2 = i=1 Xi = i=1 Xi2 + 2 i=2 j=1 Xi Xj . The first term of the summation has
expectation tσ 2 and the second term has expectation zero from independence. Again, {St }
is nonstationary.
6 Independence would imply uncorrelatedness, but not conversely. This allows us to relax our assump-
tions, as the only calculations we are required to perform are moments so uncorrelatedness is really a moment
requirement.
8
We also have
γ̃X (t + h, t) = Cov(St+h , St )
= Cov(St + Xt+1 + · · · + Xt+h , St )
= Cov(St , St ) + Cov(Xt+1 + · · · Xt+h , St )
h
X
= Cov(St , St ) + Cov(Xt+i , St )
i=1
= Var(St ) + 0
= tσ 2 .
and the function depends on t, but not on h. This calculation uses result for covariances
which says that Cov(aX + bY + c, Z) = aCov(X, Z) + bCov(Y, Z).
We can easily verify stationarity by calculating the first two moments. By linearity
Cov(Xt+h , Xt ) = E (Xt+h Xt )
Now
Since Zt is a white-noise process, E (Zr Zs ) = σ 2 if r = s and zero otherwise (if r 6= s). Then
σ 2 if h = 0 σ 2 if h = −1
E (Zt+h Zt ) = E (Zt+h Zt−1 ) =
0 if h 6= 0 0 if h 6= −1
9
and
σ 2 if h = 1 σ 2 if h = 0
E (Zt+h−1 , Zt ) = E (Zt+h−1 , Zt−1 ) =
0 if h 6= 1 0 if h 6= 0.
Since γ̃X does not depend on t, this imply {Xt } is stationary, Xt ∼ MA(1) with Var (Xt ) =
γ̃X (t, t) = σ 2 (1 + θ2 ) with autocorrelation function
1
if h = 0
ρX (h) = 2
θ/(1 + θ ) if h = ±1
0 otherwise .
We will see that any process defined as a linear combination of white noise process for
arbitrary lag will be a stationary process.
Example 1.8 (Autoregression)
Let Xt ∼ AR(1). Again, suppose {Zt } ∼ WN(0, σ 2 ), with σ 2 < ∞. Let |φ| < 1 and7
assume {Xt } is a stationary process such that Xt and Xs are uncorrelated if s < t, with
the following recursive definition
8
This can be rewritten Xt − φXt−1 = Zt . Then
Under the assumption of stationarity, we must have E (Xt ) = 0 as this is the only solution
for an arbitrary φ. Under stationarity, we can go directly for the autocorrelation function
and
10
so that γX is an even function. We can take this form here and substitute in for Xt , yielding
where E (Zt Xt−h ) as t > t − h. Hence, using the above recursion, we can get γX (h) =
φγX (h − 1) = φh γX (0) for h > 0 and using symmetry, γX (h) = γX (−h) = φ|h| γX (0).
We now want to find γX (0), equal to
Cov(Xt , Xt ) = E Xt2
= E ((φXt−1 + Zt )(φXt−1 + Zt ))
= φ2 E Xt−1
2
+ 2φE (Xt−1 Zt ) + E Zt2 .
But, by assumption E (Xt−1 Zt ) = 0 (as current X are uncorrelated with future values of
2
Z) and E Xt−1 = γX (0) as {Xt } is stationary. Therefore, γX (0) = φ2 γX (0) + σ 2 . This
implicit formula can be rearranged to get and explicit form for γX (0) = σ 2 /(1 − φ2 ).
It is possible to extend the two processes described above, thus MA(1) can be generalized
to MA(q), of the form
11
Section 1.2: Trends and Seasonality
1.2.1. Trends
For many datasets, it is evident that the observed series does not arise from a process that
is stationary or constant in mean. We might observe that
We might want to decompose the process into a part that has a deterministic mean structure
and on top of that have a stable or stationary component.
Thus, a model of the form Xt = mt + Yt may be appropriate, where mt is a deterministic
function of t and {Yt } is a zero-mean (and often stationary) process. If {Yt } is assumed to
have finite variance, and mt is assumed to have some parametric form, say mt (β), then β
may be estimated using least-squares. One could also use nonparametric or semiparametric
regression methods to estimate the model.
If, say, {Yt } ∼ WN(0, σ 2 ) is assumed then, we can determine the parameter estimate for β
by minimizing the sum of squared residuals errors, that is
n
X
βb = arg min (xt − mt (β))2 ;
β t=1
this is linear (or nonlinear) regression with homoskedastic errors. Optimization is tractable
analytically, (it is just convex optimization). In the easy case, we could have e.g. mt (β) =
β0 + β1 t + β2 t2 and β = (β0 β1 β2 )> can be estimated using ordinary least squares (OLS).
If there was a correlation structure (such as AR) and the assumption of white noise did
not hold, then OLS would not be statistically efficient and weighted least squares would be
preferable.
Example 1.9 (Lake Huron data, example 135 B&D)
In this example Xt is the level (in feet) of Lake Huron from 1875-1972, with n = 198. In
the illustration, a fitted linear trend is presented. One could also try to test for structural
break around observation 20 and fit a model with two plateaus. Assuming mt = β0 + β1 t,
fitting this model is straightforward; in R, the lm function can be used. What we get is
βb0 = 10.204(0.230) and βb1 = −0.024(0.004) with standard errors indicated in parenthesis.
The slope is statistically significant.
We can verify some of the assumptions that we made about {Yt }. Let ybt = xt − mt (β) b =
xt − βb0 − βb1 t. Detrended data, that is {b
yt } looks like a zero mean process - this is a standard
12
residuals plots. We might suspect an increase in variance, but the assumptions are not too
far off. We also need to verify that Yt are uncorrelated, but it turns out that they are not;
yt−1 , ybt ), for t = 2, . . . , n shows that the series exhibit positive correlation
a scatter plot of (b
(0.775). In fact, Yt is not likely a white-noise process.
1.2.2. Seasonality
Many (real) time series are influenced by seasonal factors, which may be due to climate,
calendar, economic cycles or physical factors. In this section, we will consider only season-
ality and not consider trend. This that we are studying is a deterministic seasonality. One
model for “seasonal” behaviour is
Xt = st + Yt , t ∈ Z,
k
X
st = β0 + [β1j cos(λj t) + β2j sin(λj t)]
j=1
where β0 , β11 , β21 , β12 , β22 , . . . are unknown constants, but λ1 < λ2 < · · · < λk are fixed
constants (known).10
Example 1.10 (Accidental Deaths in the US: Jan 1973-Dec 1978)
We have n = 72 monthly observations. Choose k = 2 and set λ1 = 2π/12, which gives a 12
monthly cycle and λ2 = 2π/6 which is a 6 monthly cycle. 11
The next test is to assess whether our assumptions where correct. We will
function. One can make some model selection and perform goodness of fit test, nested structure and F tests
to evaluate the models.
13
◦ try to use AR, MA models for {Yt }.
Before looking at the data, one doesn’t know what the periodicities are and the λ are
unknown. If they are estimated, one has to do NLS, which can be easily implemented as
well. Another example is Montreal gas station prices, which tend to peak on Tuesday, and
exhibit a periodic structure if you demean the data. Generally, to carry out assessment
of model assumptions concerning a stationary process, we may examine standard moment-
based estimators. We are looking for estimators of µ
b = x̄ and for the ACVF, we use
n−|h|
1 X
γ
b(h) = (xt+|h| − x̄)(xt − x̄), −n < h < n.
n t=1
γ
b(h) is a consistent estimator of γ(h), but it is biased. The advantage of using this
estimator is that it does, however, guarantee that Γ γ (|i − j|)]ij is non-singular.
b n = [b
Is zbt an uncorrelated sequence? Apparently not, there is still positive correlation between zbt
and zbt−1 . This means that the AR(1) model is not adequate. At this stage, we could work
with this residual or go back to the model and propose a more complicated model. We try
an AR(2) model, i.e.
Yt = φ1 Yt−1 + φ2 Yt−2 + Zt .
The fit via OLS yields φb1 = 1.002 and φb2 = −0.283. In this case, it is hard to see whether
12 This function would be completely specified as a function of the parameters of the process, while here
14
these values imply stationarity. The resulting residual series {b
zt } formed by taking
The general strategy when faced with a real time-series model: for observed series {Xt }
with realization {xt },
◦ check/model properties of {b
yt } and check whether they are IID/WN, or AR, or MA,
etc.
a) Set
q
1 X
m
bt = xt+j , q+1≤t≤n−q
2q − 1 j=−q
taking a local unweighted average of the points in the vicinity of t. This is a form of “low
pass filtering”. An extension of this is to enlarge the window centered on t, equivalent to
the above by setting aj = 0 if |j| > q, given by
∞
X
m
bt = aj xt−j
=−∞
15
14
and it is evident that the closer to 0, the smoother the curve. An explicit form for m
bt
is
t−2
X
mbt = α(1 − α)j xt−j + (1 − α)t−1 x1
j=0
In the presence of both trend and seasonality, the procedures for pre-processing must be
applied as follows. 15
Pd
If Xt = mt + st + Yt with E (Yt ) = 0, st+d = st , and j=1 sj = 0. The steps are outline
next16
d
1X
sbk = wk − w̄k = wk − wk
d i=1
Step 3 Remove the remaining trend: compute m b ∗t from the de-seasonalized data x∗t = xt −
b t − sbt using a parametric or a non-parametric approach. This yields the decompo-
m
sition Xt = m b ∗t + sbt + Yt
14 This is a preprocessing set, but be careful as you can remove the structure of the data through the
smoothing method.
15 The above averaging does not respect the seasonality, especially if you take terms from the previous
that step.
16
1.2.4. Differencing
Definition 1.6 (Differencing at lag 1)
The lag difference operator,
∇d Xt = Xt − Xt−d = Xt − B d Xt = (1 − B d )Xt
∇2 Xt = (1 − B)(1 − B)Xt
= (1 − 2B + B 2 )Xt
= Xt − 2Xt−1 + Xt−2
= (Xt − Xt−1 ) − (Xt−1 − Xt−2 )
and st − st−d = 0.
18 Y ∗= (Yt − Yt−1 ) is what is obtained. This is different from our first detrending approach, which used
t
a linear parametric formulation for mt . If Yt is white-noise, then Yt∗ is no longer white noise, but MA(1).
The nature of the Yt process will change.
17
1.2.5. Assessing the white noise assumption
Some tests that are commonly used. If {Zt } ∼ WN(0, σ 2 ), then as n → ∞,
· 1
ρbn (h) ∼ N 0, , ∀h≥1
n
where ρbn (h) is the estimator of ρX (h) derived from data Z1 , . . . , Zn . This result is derived
from Central limit theorem.
19
This allows us to carry out pointwise (in h) hypothesis tests of the form
H0 : ρX (h) = 0, h≥1
ρn (1), . . . , ρbn (l)) are asymptotically independent under the white noise assumption.
Also, (b
We can test ρbn (h), h = 1, . . . , l individually or jointly (i.e. simultaneously testing: should
make correction in the critical values in order to account for multiplicity). 20 An approxi-
mate 95% CI for ρX (h) is
1.96
ρbn (h) ± √
n
There are other tests, which also rely on asymptotics. They are presented next.
Proposition 1.8 (Portmanteau test)
Ph
The test statistic is Q = n j=1 {b ρn (j)}2 . Under H0 : ρX (j) = 0 for j = 1, . . . , h and
Q ∼ χ2h .21
h
X ρn (j)}2
{b
QBL = n(n + 2)
j=1
n−j
·
and under H0 , QBL ∼ χ2 (h)
19 In R, we have seen line for pointwise critical values for the test or simultaneous tests for more than one
h.
20 It would also be possible (maybe not in practice) to perform a permutation test for the white-noise
case: if the data is uncorrelated, it will not be impacted if we permute it. One can then recompute the test
statistic for these n! permutations. This yield a discrete distribution on n! elements. One can compare the
true observed value from the sample with the distribution from the permutation.
21 For most of the cases we are dealing with, this is a powerful test for simultaneous testing.
18
Proposition 1.10 (McLeod-Li test)
Let Wt = Zt2 , then compute ρbW
n (h) for {wt }. The test statistic is given by
h
X ρW (j)}2
{b n
QML = n(n + 2)
j=1
n−j
·
and again under H0 , QML ∼ χ2h .22
Theorem 1.12
A function g defined on Z is the autocovariance function (acvf) of a stationary time
series process if and only if g is even and non-negative definite.
Proof Suppose g is the acvf of a stationary process {Xt }. Then g is even by properties
of covariance. Let a = (a1 , . . . , an )> and X 1:n = (X1 , . . . , Xn )> , and let ΓX (n) be
the n × n matrix with (i, j)th element γX (i − j) for i, j = 1, . . . , n and n ≥ 1. By
definition g(i − j) ≡ γX (i − j). Then
n X
n
X
Var a> X 1:n = ai γX (i − j)aj .
i=1 j=1
Conversely, if g is even and non-negative definite, then ΓX (n) formed by setting the
(i, j)th element to g(i − j) is symmetric and non-negative definite23 . So consider {Xt }
the Gaussian process such that X i:n ∼ Nn (0, ΓX (n)).
19
2. Construction of stationary processes
If {Zt } is stationary, then the “filtered” process {Xt } defined y
for q ≥ 0 for some function g is also stationary with Xt and Xs for |t − s| > q
uncorrelated if {Zt } is white-noise. {Xt } is termed “q-dependent” in this case. 24
3. Linear processes
{Xt } is a linear process if, for all t,
∞
X
Xt = ψj Zt−j
j=−∞
P∞
where {Zt } ∼ WN(0, σ 2 ), {ψj } is a sequence of real constants such that j=−∞ |ψj | <
P∞ j
∞. That is Xt = Ψ(B)Zt with Ψ(z) = j=−∞ ψj z , an infinite order polynomial
(generating function).
A linear process is “non-anticipating” or “causal” if ψj = 0 ∀ j < 0.
Note
P∞ P∞
The condition j=−∞ |ψj | < ∞ ensures that j=−∞ ψj Zt−j is well-defined, that is,
M.S. P∞
that the sum converges in mean-square, i.e. in fact Xt = j=−∞ ψj Zt−j that is
∀ t,
Xn 2
lim E Xt − ψj Zt−j = 0.
n→∞
j=−n
In this calculation, ψ(B) is a linear filter acting on {Zt }. From this, we can compute
the moment properties of {Xt } directly.
P∞
Recall Xt = j=−∞ ψj Zt−j . Note that
p q
E (|Zt |) = E |Zt |2 = E (Zt2 ) = σ
24 Filtering generally means all observations up to current time in such case.
20
and
∞
X
E (|Xt |) = E ψj Zt−j
j=−∞
∞
X
≤ E |ψj ||Zt−j |
j=−∞
∞
X
= |ψj |E|Zt−j |
j=−∞
X∞
≤σ |ψj | < ∞.
j=−∞
We thus have a finite variance process and we have an expression for the variance of Xt in
terms of the ψj . We can also, perhaps more importantly, compute the autocovariance, in
exactly the same fashion as we did the previous calculation.
21
!
∞
X ∞
X
E (Xt+h Xt ) = E ψj Zt+h−j ψk Zt−k
j=−∞ k=−∞
∞
X ∞
X
= ψj ψk E (Zt+h−j Zk )
j=−∞ k=−∞
X∞
2
= ψj ψj−h E Zt+h−j
j=−∞
X∞
2
=σ ψj ψj+h ≡ γX (h).
j=−∞
P∞
Note that if {Yt } is stationary with acvf γY (h), then if Xt = ψ(B)Yt = j=−∞ ψj Yt−j ,
we can replace {Zt } in the previous calculations to obtain results for E (Xt ) , Var (Xt ) and
γX (h). In particular, if E (Yt ) = 0, then E (Xt ) = 0 and γX (h) can be written as
∞
X ∞
X
γX (h) = ψj ψk γY (h − j + k).
j=−∞ k=−∞
22
stationary, E (Xt ) = 0 and again by our earlier results,
∞
X ∞
X
ψj ψj+h E Zt2 = σ 2 φj φj+h
γX (h) =
j=0 j=0
P∞
and so γX (h) = σ 2 φh j=0 φ2j = σ 2 φh /(1 − φ2 ).
Note
We may formally write that Ψ(B) = (1 − φB)−1 as
X∞
Ψ(B)(1 − φB) = ψj B j (1 − φB)
j=0
∞
X ∞
X
= ψj B j − φψj B j+1
j=0 j=0
X∞ X∞
= φj B j − φj B j = 1
j=0 j=1
We have thus constructed a solution for the process, but still have to demonstrate its unique-
ness. Now suppose that {Yt } is another stationary solution to the equation Yt = φYt−1 + Zt .
We show an explicit relationship between {Xt } and {Yt }. By the previous recursive ap-
P∞
proach, Yt = j=0 φj Zt−j . Then
k
X ∞
X
Yt − φj Zt−j = φj Zt−j
j=0 j=k+1
∞
X
= φk+1 φj Zt−j−k−1
j=0
k+1
=φ Yt−k−1
and
!2
k
X
φj Zt−j = lim φ2k+2 E Yt−k−1
2
lim E Yt − =0
k→∞ k→∞
j=0
2
MS P∞ j
as E Yt−k−1 < ∞ by assumption. Therefore, we have Yt = j=0 φ Zt−j = Xt . Therefore
{Xt } and {Yt } are equal in mean-square, thus {Xt } is the unique stationary solution to the
23
AR(1) equation (up to mean-square equivalence).
If we choose to define {Xt } ∼ AR(1) with |φ| < 1, then {Xt } has a “unique” representation
as an MA(∞) (or linear) process.
Note
If |φ| = 1, i.e. Xt = Xt−1 + Zt or Xt = −Xt−1 + Zt . Suppose (for the sake of contradiction)
that {Xt } is a stationary solution to the equation Xt = φXt−1 + Zt with |φ| = 1. Then by
the previous recursion,
Xn
Xt = φn+1 Xt−n−1 + φj Zt−j
j=0
1 1
Xt = Xt+1 − Zt+1
φ φ
1 1 1
= 2 Xt+2 − Zt+1 − 2 Zt+2
φ φ φ
∞
X
... = − φ−j Zt+j
j=1
P∞ P∞
with j=1 |φ−j | ≤ |φ−1 |j < ∞ as |φ−1 | < 1.
j=0
P∞
We do have a stationary solution Xt = j=−∞ ψj Zt−j where
−φ−j if j < 0
ψj =
0 if j ≥ 0
Since the white noise are residuals from the future observations, this is not of much use in
practical terms.
24
◦ |φ| < 1, stationary and causal
◦ |φ| = 1, non-stationary
P∞ P∞
where j=−∞ |αj | < ∞, j=−∞ |βj | < ∞. Then
∞
X ∞
X
α(B)β(B)Zt = αj B j βj B j Z t
j=−∞ j=−∞
∞
X ∞
X
= αj βk B j+k Zt
j=−∞ k=−∞
∞ ∞
( ) !
X X
l
= αk βl−k B Zt
l=−∞ k=−∞
∞
!
X
l
= ψl B Zt
l=−∞
P∞
where ψl = k=−∞ αk βl−k . The last line is a linear process representation, as it is just
∞
X
ψl Zt−l .
l=−∞
25
The two linear filters combine to yield another linear filter
{Xt } is the autoregressive process of order p (denoted {Xt } ∼ AR(p)) with real coefficients
φ1 , . . . , φ p .
Example 1.12 (AR(2) process)
Take Xt − φ1 Xt−1 − φ2 Xt−2 = Zt for {Zt } ∼ WN(0, σ 2 ). By identical arguments to the
AR(1) case. We have E (Zt ) = 0, Var (Zt ) = σ 2 < ∞, which leads to E (Xt ) = 0. Note that
{Xt } can be found as the solution to the equation
(1 − φ1 B − φ2 B 2 )Xt = Zt
for some collection of coefficients {ψj } such that {Xt } converges in mean square? If so, then
the ψj must be the coefficients in the series expansion of (1 − φ1 B − φ2 B 2 )−1 . First, note
that
Φ(B) = (1 − φ1 B − φ2 B 2 ) = (1 − ξ1 B)(1 − ξ2 B)
26
are the reciprocal roots of Φ(z) = 0). Note finally that we could write
1 φ1
Φ(z) = φ2 − − z2
φ2 φ2
z 2 − φ1 /φ2 z − 1/φ2 = 0
(z − η1 )(z − η2 ) = 0
as
= ψ1 (B)ψ2 (B)
with each sum convergent provided |ξ1 |, |ξ2 | < 1. Indeed, this is a necessary and sufficient
condition. Thus
where
∞ X
X ∞
Ψ(B) = ξ1j ξ2k B l+k
j=0 k=0
Pr k r−k
with ψr = k=0 ξ1 ξ2 for r ≥ 0.
This is a valid series expansion, we need to verify some conditions on ψr . We have |ψr | ≤
27
Pr P∞
k=0 |ξ1 |k |ξ2 |r−k . We require that r=0 |ψr | < ∞. Explicitly, if ξ1 , ξ2 are a complex pair,
Pr
so that |ξ1 | = |ξ2 |, then |ψr | ≤ k=0 |ξ1 |r = (r + 1)M r say, where M = |ξ1 | < 1. Therefore,
∞
X ∞
X
|ψr | ≤ (r + 1)M r < ∞
r=0 r=0
For the AR(p), in the more general case, we can apply the same results and construction.
Write
Φ(B) = (1 − φ1 B − φ2 B 2 − φ3 B 3 − · · · − φp B p )
p
Y
= (1 − ξr B)
r=1
The AR(p) process is stationary and causal if |ξr | < 1 for r = 1, . . . p. The values ξ1 , . . . , ξp
solve Φ(z) = 0; as before, if ηr = ξr−1 , then η1 , . . . , ηp solve Φ(z) = 0 also with |ηr | > 1 ∀ r.25
What if |ξr | = 1 for some r? Or if |ξr | > 1? The first case will turn to be nonstationary,
while for the second, one can construct a stationary process in a certain way, but which
turns out to be non-causal.
Xt = Zt + θ1 Zt−1 + · · · + θq Zt−q
where {Zt } ∼ WN(0, σ 2 ). We know that E (Xt ) = 0 ∀ t and also that {Xt } is stationary.
Also, we have
∞
X
Xt = ψj Zt−j
j=−∞
25 Sometimes seen as the condition that all the roots lie outside the unit circle, contrary to ξ lying inside
r
the unit circle).
28
where
θ j , if j = 1, 2, . . . , q
ψj = 1, if j = 0
0, if j < 0
so the MA(q) process for q ≥ 1 has a straightforward linear process representation. By the
earlier results,
∞
X
γX (h) = σ 2 ψj ψj+|h|
j=−∞
q−|h|
X
= θj θj+|h|
j=0
where θ0 ≡ 1 with the convention that the sum is zero if |h| > q.
Xt = (1 + θB)Zt .
(as (1 + θB)−1 (1 + θB) = 1 using this definition) provided |θ| < 1, otherwise the sum
P∞
Π(B)Xt = j=0 (−θ)j Xt−j would not be mean-square convergent.
If |θ| < 1, {Xt } ∼ MA(1) admits the representation
Π(B)Xt = Zt
P∞
where Π(B) = j=−∞ πj B j with πj = (−θ)j if j ≥ 0 and zero otherwise, that is {Xt } ∼
AR(∞).
If {Xt } ∼ MA(q), say Xt = Θ(B)Zt , where Θ(B) = 1 + θ1 B + θ2 B 2 + · · · + θq B q , then we
29
may factorize
q
Y
Θ(B) = (1 − ωl B).
l=1
Π(B)Xt = Zt
P∞
with Π(B) = j=−∞ πj B j provided that |ωl | < 1, for l = 1, . . . , q where πj is the coefficient
of B j in the expansion of
q
Y
(1 + θ1 B + θ2 B 2 + · · · + θq B q )−1 = (1 − ωl B)−1
l=1
q
Y ∞
X
= ωlj B j
l=1 j=−∞
If {Xt } admits an AR(∞) representation, that is |ωl | < 1 for l = 1, . . . , q, we say that {Xt }
is invertible (note that the AR(∞) representation is causal.)
If {Xt } ∼ MA(1) with |θ| > 1, Xt is still stationary, but it does not admit a causal AR(∞)
representation.
P∞
However, we may write Zt = − j=1 (−θ)−j Xt+j which is a mean-square convergent AR(∞)
process that is non-causal.
Criterion : In the context of weakly stationary processes, with finite first and second
moments, a natural criterion to optimize with respect to is minimum mean squared
error. That is, for linear predictors, we aim to choose coefficients a1 , . . . , an to minimize
the expected value of the squared difference between the true value of the variable to be
predicted and the predictor, that is
bn+h )2
E (Xn+h − X
26 For practicality reasons and due to the analogy with the Gaussian case
30
where
n
X
X
bn+h = a0 + ai Xn−i+1
i=1
for variable a. Note that the expectation is over all the random variables, which are
X1 , . . . , Xn , Xn+h . We may solve this optimization (minimization) problem analytically.
n
!!
∂MSE(a) X
= E 2 Xn+h − a0 − ai Xn−i+1 =0 (1.5)
∂a0 i=1
n
! !
∂MSE(a) X
= E 2 Xn+h − a0 − ai Xn−i+1 Xn−j+1 =0 (1.6)
∂aj i=1
n
!
X
a0 = µ 1 − ai
i=1
n
! n
X X
E (Xn+h Xn−j+1 ) − µ 1 − ai E (Xn−j+1 ) − ai E (Xn−i+1 Xn−j+1 ) = 0
i=1 i=1
31
imply that
n
X
γX (h + j − 1) − ai γX (i − j) = 0
i=1
for j = 1, . . . , n. Let
γ X (n, h) ≡ (γX (h), . . . , γX (h + n − 1))>
and similarly, a1:n = (a1 , . . . , an )> . Then the system of equations to be solved becomes
ΓX (n)a1:n = γ X (n, h)
an n×1 system of equations. The minimum MSE is achieved when a1:n solves this equation;
the solution depends on both n and h, so write a1:n (h) = (a1 (h), . . . , an (h))> as the optimal
value. The optimal forecast (prediction) is then Xbn+h = µ + Pn ai (h)(Xn−i+1 − µ). 27
i=1
P n
The optimal forecast is X i=1 ai (h)(Xn−i+1 − µ) and the minimum MSE is
bn+h = µ +
bn+h )2 = γX (0) − a1:n (h)> γ X (n, h)
E (Xn+h − X
n
X n X
X n
= γX (0) − 2 ai γX (h + i − 1) + ai γ(i − j)aj
i=1 i=1 j=1
Note
Finding a1:n (h)from
ΓX (n)a1:n = γ X (n, h)
27 The optimal construction depends upon the covariance of the process. Clearly, a
1:n (h) is a function of
the covariance sequence.
28 In R, using the QR decomposition (through solve). For n > 2000, the computation becomes prohibitive.
In general, ΓX is fully parametrized in most cases, and the matrix can be sparse. Choleski, eigenvalue
decomposition may work effectively.
29 If we are dealing with a MA process, we can use the AR(∞) approximation and truncate the coefficients
for forecasting.
32
Example 1.13
In the AR(1) case, Xt = φXt−1 + Zt where Zt ∼ WN(0, σ 2 ) and |φ| < 1. Then
φ2 φn−1
1 φ ···
φ 1 φ ··· ···
σ2
φ2 φ 1 ··· ···
ΓX (n) =
1 − φ2
.. .. .. .. ..
. . . . .
φn−1 ··· ··· φ 1
and
For h = 1, we have ΓX (n)a1:n (h) = γX (n, 1) yields a1:n (1) = (φ, 0, · · · , 0)> . Thus,
n
X
X
bn+1 = µ + ai (i)(Xn−i+1 − µ)
i=1
= µ + φ(Xn − µ)
bn+h = φh Xn
X
Example 1.14
Consider the AR(p) case, with a general model of the form Φ(B)Xt = Zt
Xt − φ1 Xt−1 − · · · − φp Xt−p = Zt
one can go through the same arguments and conclude that the optimal forecast for X
bn+1 is
n
X
X
bn+1 = φi Xn−i
i=1
33
Proposition 1.13 (Solving for a1:n (h))
−1
If ΓX (n) is non-singular, we may write a1:n (h) = {ΓX (n)} γX (n, h), but computing the
matrix inverse may be prohibitive.
1. γX (0) > 0
2. γX (h) → 0 as h → ∞.
Note
For real time series data, γX (h) is not known, so sample-based estimates γ
bX (h) are used,
n−h
1 X
γ
bX (h) = (Xt+h − X̄n )(Xt − X̄n )
n t=1
Pn
where X̄n = n−1 t=1 Xt .30
Two algorithms can be used to compute the optimal coefficients recursively; this is useful
for large n, or when data are being observed on an ongoing basis.
Sum of correlated observations are harder to deal with, although the form is nice. The
Levinson-Durbin provides a recursive algorithm that uses the residuals.
Algorithm 1.1 (Levinson-Durbin)
Without loss of generality, take E (Xt ) = 0. Write
n
X
X
bn+1 = ϕn,i Xn−i+1
i=1
= ϕ>
n Xn:1
where ϕn = (ϕn,1 , · · · , ϕn:n )> , which corresponds in our prediction forecast to a1:n The
optimal choice solves
Γn ϕn = γ(n, 1),
vn = γ(0) − ϕ>
n γ(n, 1)
30 One assumes that the order of the process is finite, so that some terms be zero in the sum. Note that
the last term, since we are dealing with an autocorrelation, then the variance of the sum will be larger than
the variance for the IID case. Each sample observation brings less information about the sample then an
IID counterpart.
34
Recursion: Define ϕn in terms of ϕn−1 and vn−1 as follows:
1
γ(n) − ϕ> R
ϕn,n = n−1 γ (n − 1, 1)
vn−1
where γ R (k, 1) = (γ(k), γ(k − 1), . . . , γ(1))> and assuming γ is known, and
> 31
for ϕn−1 , ϕR R
n−1 is a (n−1)×1 vector, ϕn,n a scalar and ϕn−1 = (ϕn−1,n−1 , ϕn−1,n−2 , · · · , ϕn−1,1 ) .
Then set
vn = vn−1 (1 − ϕ2n,n ).
γ(1)
Initialization: For n = 1, ϕ1,1 = γ(0) = ρ(1) with v0 = γ(0) and v1 = γ(0)(1 − ρ2 (1)).
To see why this works, let
1
Pn = Γn
γ(0)
Pk ϕR
k = ρk:1
31 The calculations depend on an inner product, of O(n), rather than n log(n). This is thus more compu-
tationally efficient.
35
using the symmetry of Pk . By the proposed recursion, we have that
" # ! " #
Pk ρk:1 ϕk − ϕk+1,k+1 ϕR
k a
Pk+1 ϕk+1 = > =
ρk:1 1 ϕk+1,k+1 b
a = Pk ϕk − ϕk+1,k+1 Pk ϕR
k + ϕk+1,k+1 ρk:1
= Pk ϕk = ρ1:k
b = ρ> > R
k:1 ϕk − ϕk+1,k+1 ρk:1 ϕk + ϕk+1,k+1
= ρ> > R
k:1 ϕk − ϕk+1,k+1 (1 − ρk:1 ϕk )
We have now
1
γ(k + 1) − ϕ> R
ϕk+1,k+1 = k γ (k, 1)
vk
where
vk = γ(0) − ϕ>
k γ(k, 1)
= γ(0) 1 − ϕ>
k ρ1:k
= γ(0)(1 − ρ> R
k:1 ϕk )
1
γ(k + 1) − ϕ> R
ϕk+1,k+1 = > R k γ (k, 1) (1.7)
γ(0)(1 − ρk:1 ϕk )
γ(k + 1) − ϕ>
R
k γ (k, 1)
ρ>
k:1 ϕk + (1 − ρ> R
k:1 ϕk )
γ(0)(1 − ρ> R
k:1 ϕk )
= ρ> >
k:1 ϕk + ρ(k + 1) − ϕk ρk:1
= ρ(k + 1)
36
We have effectively performed a matrix inversion by considering a block decomposition
" #
Pk ρk:1
Pk+1 = >
ρk:1 1
vn = E (Xn+1 − ϕ> 2
n Xn:1 )
= γ(0) − ϕ>
n γ(n − 1)
plugging in (1.7). That is, vn = vn−1 (1 − ϕ2n,n ) and as vn ≥ 0, imply ϕ2n,n < 1, |ϕn,n | < 1
thus vn ≤ vn−1 , i.e. the {vn } form a non-increasing sequence.
Note
The Levinson-Durbin recursion solves the equation Γn ϕn = γ(n, 1) without matrix inver-
sion, in order n2 operations. Most matrix inversion procedures are order n3 ; L-D exploits
the Toeplitz structure.33
- this is the partial correlation for X and Y given (accounting for) Z; it is the correlation
between residuals obtained by regressing in turn X and on Z and Y on Z. The PACF
32 This works because of the nature of the nature of the matrix (Toeplitz structure). When we get to
estimation by maximum likelihood estimates, we will get the likelihood in terms of matrix inverses.
33 One can also use singular value, eigenvalue decomposition or Choleski decomposition; these method
37
computes the correlation between
Xt − E Xt |X(t+1):(t+h−1) (1.8a)
Xt+h − E Xt+h |X(t+1):(t+h−1) (1.8b)
Example 1.15
Consider an causal stationary AR(p) process; for h > p,
E Xt+h |X(t+1):(t+h−1) , X1:(t−1) = φXt+h−1 + · · · + φp Xt+h−p
E Xt |X(t+1):(t+h−1) , X1:(t−1) = φ1 Xt+h−1 + · · · + φp Xt+h−p
where now (1.8a) equals Zt and (1.8b) is Zt+h , therefore the correlation Cor(Zt , Zt+h ) = 0
and αX (h) = 0 for h > p, while for h ≤ p, αX (h) 6= 0. That is, we can diagnose an AR(p)
structure by inspecting whether the partial autocorrelation drops to zero at some finite lag.
Algorithm 1.2 (Innovations algorithm)
Let {Xt } be zero-mean process with E Xt2 < ∞ and let γ̃X (t, s) = E (Xt Xs ) (not neces-
sarily stationary). Let
0 if n = 1
X
bn =
E Xn |X 1:(n−1) if n ≥ 2
Let Un = Xn − X bn be the error in prediction and let U 1:n be the vector of prediction errors
(computed in a one-step ahead fashion). As E Xn |X 1:(n−1) is linear in X1 , . . . , Xn−1 , we
may write U 1:n = An X 1:n where An is given by
1 0 0 0 0 0
a1,1 1 0 0 0 0
.. ..
An =
a2,2 a2,1 1 0 . .
.. .. .. .. ..
. . .
. .
an−1,n−1 ··· ··· ··· an−1,1 1
as
U1 = X1
U2 = X2 − X
b2 = X2 + a1,1 X1
U3 = X3 − X
b3 = X3 + a2,2 X2 + a2,1 X1
···
38
Let C n = A−1
n , again a lower triangular matrix, where
1 0 0 0 0 0
ϑ1,1 1 0 0 0 0
.. ..
Cn =
ϑ2,2 ϑ2,1 1 0 . .
.. .. .. .. ..
. . .
. .
ϑn−1,n−1 ··· ··· ··· ϑn−1,1 1
where
0 0 0 0 0 0
ϑ1,1 0 0 0 0 0
.. ..
Θn = C n − In =
ϑ2,2 ϑ2,1 0 0 . .
.. .. .. .. ..
. . .
. .
ϑn−1,n−1 ··· ··· ··· ϑn−1,1 0
and so
0 if n = 1
X
bn+1 = Pn
i=1 ϑn,i (Xn+1−i − Xn+1−i ) if n ≥ 2
b
◦ Initialize
v0 = γ̃X (1, 1) = E X12
◦ Recursion at step n
k−1
1 X
ϑn,n−k = γ̃X (n + 1, k + 1) − ϑk,k−j ϑn,n−j vj
vk j=0
39
for 0 ≤ k < n
n−1
X
vn = γ̃X (n + 1, n + 1) − ϑ2n,n−j vj
j=0
bn M.S.
A process {Xt } is deterministic (or predictable) if, for all n, Xn − X = 0 so that the
prediction variance
b n )2 = 0
E (Xn − X
Example 1.16
Suppose we have random variables A, B such that E (A) = E (B) = 0 and Var (A) =
Var (B) = 1 and further E (AB) = 0, that is A, B are uncorrelated.
Let {Xt } be defined by
Xt = A cos(ωt) + B sin(ωt)
Xn = A cos(ωn) + B sin(ωn)
But
Xn = 2 cos(ω)Xn−1 − Xn−2
Thus X bn M.S.
bn = 2 cos(ω)Xn−1 − Xn−2 and Xn − X = 0. Thus {Xt } is a deterministic
34
process.
34 Brockwell and Davis use this terminology, although it is still stochastic in nature.
40
The Wold decomposition states that if {Xt } is stationary and non-deterministic, then {Xt }
can be written as
∞
M.S.
X
Xt = ψj Zt−j + Vt
j=0
where {Zt } ∼ WN(0, σ 2 ), {Vt } is predictable and {Zt }, {Vt } are uncorrelated.35 Further-
P∞
more, ψ0 = 1, j=1 ψj2 < ∞ with
E (Xt Zt−j )
ψj = , j≥1
E (Zt2 )
35 Notice that the linear process is causal, so there is no contribution of the negative terms. The result
41
Chapter 2
ARMA models
Section 2.1: Basic properties
All indicates that we can combine the class of filters to get a richer class of models.
2.1.1. ARMA(p, q)
The Autoregressive-Moving average model of order (p, q) (or ARMA(p, q)) for time-
series Xt specifies that {Xt } is a solution of
Φ(B)Xt = Θ(B)Zt
Φ(B) = 1 − φ1 B − φ2 B 2 − · · · − φp B p
Θ(B) = 1 + θ1 B + θ2 B 2 + · · · + θq B q
Assume that Φ(B) and Θ(B) have no common factors i.e. all ξj are different from all ωj
(so that no cancellation is possible). Recall that |ξj | =
6 1 ∀ j imply stationarity, |ξj | < 1 ∀ j
implies causal and |ωi | < 1 ∀ i implies invertible.
For a linear process representation, we can write
∞
X
Xt = Ψ(B)Zt = ψj Zt−j ;
j=0
Φ(B)Ψ(B) = Θ(B)
Θ(Z)
that is Ψ(Z) = Φ(Z) for arbitrary complex value Z. We can compute the ψj0 s by equating
42
coefficients, that is
(1 − φ1 B − φ2 B 2 − · · · − φp B p )(ψ0 + ψ1 B + ψ2 B 2 + · · · ) = 1 + θ1 B + θ2B 2 + · · · + θq B q
B 0 : ψ0 = 1
B 1 : ψ1 − φ1 ψ0 = θ1 ⇒ ψ1 = φ1 + θ1
ψ − Pj φ ψ 0 ≤ j ≤ max(p, q + 1)
j k=1 k j−k = θj ,
Bj :
ψj − p φk ψj−k = 0,
P
j ≥ max(p, q + 1)
k=1
Therefore
θ + Pj φ ψ if 0 ≤ j ≤ max(p, q + 1), θ0 = 1
j k=1 k j−k
ψj = Pp
k=1 φk ψj−k = 0 if j ≥ max(p, q + 1)
(1 − φB)Xt = (1 + θB)Zt
or equivalently
Xt − φXt−1 = Zt + θZt−1 .
43
By the previous formulation,
1 if j = 0
ψj =
(θ + φ)φj−1 if j ≥ 1
(θ + φ)2
= σ2 1 +
1 − φ2
and
∞
X
γX (1) = σ 2 ψj ψj+1
j=0
∞
X
= σ 2 (θ + φ) + (θ + φ)2 φ (φj−1 )2
j=1
(θ + φ)φ
= σ 2 (θ + φ) 1 +
1 − φ2
and similarly,
∞
X
γX (h) = σ 2 ψj ψj+h
j=0
= φ γX (h − 1)
= φh−1 γX (1), h≥2
44
P∞
We wish to compute γX (h) for h ∈ Z+ . First write Xt = Ψ(B)Zt = j=0 ψj Zt−j
E (Xt−1 Zt ) = 0
E (Xt−1 Zt−1 ) = σ 2 ψ0
and the right hand side is σ 2 θ1 ψ0 . We do not have yet enough information.
3. Multiply by Xt−2 , take expectations. On the right hand side, E (Xt−2 Zt ) = E (Xt−2 , Zt−1 ) =
0. Therefore, the equation becomes
We now have four equations and four unknowns γX (0), γX (1), γX (2), γX (3) which we
can solve simultaneously. For Xt−k , where k ≥ 4 and
γX (k) − φ1 γX (k − 1) − φ2 γX (k − 2) − φ3 γX (k − 3) = 0
for k ≥ 4.
45
Example 2.3 (Autocovariances of ARMA(1,3))
Consider an ARMA(1,3) model of the form
Premultiply by Xt−k and take expectations. Using the same procedure as before, for k = 0,
we have
and for k = 1
and iterating this procedure, we will not this time get a homogeneous equation (the right
hand side doesn’t vanish). For k = 2, we have
and subsequently
The case of an ARMA(1,2) is left as an exercise. The calculation and the values of the
autocovariances are necessary to compute the likelihood.
46
For the general ARMA case, say ARMA(p, q) where the form of the equation is
p
X q
X
Xt − φj Xt−j = Zt + θj Zt−j
j=1 j=1
Multiply by Xt−k and take expectations. From the previous calculations, the left hand side
is
p
X
γX (k) − φj γX (|k − j|)
j=1
γX (k) − φ1 γX (k − 1) − · · · − φp γX (k − p)
σ 2 Pq θ ψ if k ≤ q
j=k j j−k
=
0 otherwise
γX (k) = φ1 γX (k − 1) + · · · + φp γX (k − p)
and we solve the linear system of the first max(p, q + 1) + 1 equations, then use the final
recursion.
Note
Analytical solution is possible, and is complicated, relies on the mathematics of difference
equations and rely on the roots of the inverse polynomial (with boundary values).
47
2.1.3. Autocovariance Generating function (ACVGF)
The ACVGF is another tool for computing the ACVF. Suppose {Xt } is defined by
∞
X
Xt = ψj Zt−j
j=−∞
P∞
with {Zt } ∼ WN(0, σ 2 ). Suppose that there exists r > 1 such that j=−∞ ψj z j is conver-
gent for z ∈ C, 1r < |z| < r.
P∞ h
Define the ACVGF, GX , by GX (z) = h=−∞ γX (h)z . Now, we know from previous
results that
∞
X
γX (h) = Cov(Xt , Xt+h ) = σ 2 ψj ψj+|h|
j=−∞
therefore
∞
X ∞
X
GX (z) = σ 2 ψj ψj+|h| z h
h=−∞ j=−∞
∞
X ∞
X
= σ2 ψj2 + ψj ψj+h (z h + z −h )
j=−∞ h=1
∞ ∞
!
X X
= σ2 ψj z j ψh z −h
j=−∞ h=−∞
2 −1
= σ Ψ(z)Ψ(z )
P∞
where Ψ(z) = j=−∞ ψj z j . For ARMA(p, q), Φ(B)Xt = Θ(B)Zt . For stationary processes,
Qp Pp
we have Ψ(B) = j=1 (1 − ξj B) with |ξj | =
6 1 for all j. Therefore, Ψ(z) = j=1 (1 − ξj z) 6= 0
when |z| = 1. We now have
Xt = Ψ(B)Zt
P∞ j
with Ψ(B) = j=−∞ ψj B where Φ(z)Ψ(z) = Θ(z), ∀ z. This means that Ψ(z) =
Θ(Z)/Φ(z) is well-defined since there is a neighborhood around 1 when z is in the annulus
Ar for some r and where Ar = {z : 1r < |z| < r}. For the ARMA(p, q), we have
σ 2 Θ(z)Θ(z −1 )
GX (z) = σ 2 Ψ(z)Ψ(z −1 ) =
Φ(z)Φ(z −1 )
48
Example 2.5
P∞
Let Zt ∼ WN(0, σ 2 ), then GZ (z) = h=−∞ γX (h)z h = σ 2 and thus GZ (z) is constant, does
not depend on z and is the only generating process for which the ACVGF is constant36 .
where Φ(z) 6= 0 and Θ(z) 6= 0 when |z| = 1. and where we relax the assumption that all ξ
are less than one in modulus; we allow
p
1 − ξj−1 z
!
Y
∗
Φ (z) = Φ(z)
j=r+1
1 − ξj z
q
1 − ωj−1 z
!
Y
Θ∗ (z) = Θ(z)
j=s+1
1 − ωj z
−1
Zt∗ = Φ∗ (B) {Θ∗ (B)} Xt
∗ ∗ −1 −1
= Φ (B) {Θ (B)} Θ∗ (B) {Φ∗ (B)} Zt
p
! q !−1
Y 1 − ξj−1 B Y 1 − ωj−1 B
= Zt
j=r+1
1 − ξj B j=s+1
1 − ωj B
= Ψ∗ (B)Zt
36 Identifies Z up to mean-square
49
Therefore GZ ∗ (z) = σ 2 Ψ∗ (z)Ψ∗ (z −1 ), therefore we are left with
p −1 Y q p −1 −1 Y q −1
Y 1 − ξj z 1 − ωj z Y 1 − ξ j z 1 − ωj z
σ2 .
j=r+1
1 − ξj z j=s+1 1 − ωj−1 z j=r+1 1 − ξj z −1 j=s+1 1 − ωj−1 z −1
But
1 − ξj−1 z 1 − ξj−1 z −1
! !
= |ξj |−2
1 − ξj z 1 − ξj z −1
and so
p
Y q
Y
GZ ∗ (z) = σ 2 |ξj |−2 |ωj |2 = σ ∗2
j=r+1 j=s+1
and we therefore conclude that {Zt∗ } ∼ WN(0, σ ∗2 ).37 Therefore Φ∗ (B)Xt = Θ∗ (B)Zt∗
where {Zt∗ } ∼ WN(0, σ ∗2 ) therefore {Xt } ∼ ARMA(p, q) defined with respect to {Zt∗ }. But
note
r
Y p
Y
Φ∗ (B) = (1 − ξj B) (1 − ξj−1 B)
j=1 j=r+1
and all roots are less than one in modulus, therefore this representation is causal. Similarly,
s
Y q
Y
Θ∗ (B) = (1 − ωj B) (1 − ωj−1 B)
j=1 j=s+1
therefore this process is invertible, since all the roots lie inside the unit circle.
Example 2.6
Let {Xt } ∼ MA(1) of the form
Xt = Zt − 2Zt−1 = (1 − 2B)Zt
37 For this argument to work, we need stationarity to hold, for the expansion to be valid. In the case where
the original process was both non-causal and non-invertible, the resulting variance may be smaller or lower,
depending on the roots, how far they are from the unit circle
50
−1
Let Zt∗ = 1 − 12 B (1 − 2B)Zt , that is
1
1 − B Zt∗ = (1 − 2B)Zt = Xt
2
Φ̃(B)Zt∗ = Θ̃(B)Zt
Θ̃(z) 1 − 2z
Ψ̃(z) = =
Φ̃(z) 1 − 12 z
and therefore
1 − 2z −1
2 −1 2 1 − 2z
GZ ∗ (z) = σ Ψ̃(z)Ψ̃(z )=σ = 4σ 2
1 − 12 z 1 − 12 z −1
The morale of this story is that for any non-causal and/or non-invertible, a legitimate
formulation can be found such that Xt is causal and invertible. Therefore, all the restrictions
imposed earlier that restricted the roots of the Θ(B) and Φ(B) to be less than one in modulus
are really for ease and that these are the only cases we need to consider.
Partial autocorrelation
Calculations for ARMA(p, q) follow the Levinson-Durbin general algorithm: no more trans-
parent calculations are available.
51
Section 2.2: Estimation and model selection for ARMA(p,q)
2.2.1. Moment-based estimation
For general stationary processes, moment-based estimation of µx , γX (h), ρX (h), σ 2 is used.
For ARMA(p, q), we seek to estimate (φ1 , . . . , φp , θ1 , . . . , θq ) and σ 2 which then determine
γX (h). An elementary form of estimation involves finding (φ, θ, σ 2 ) such that the matrix
ΓX (n, φ, θ) (the n × n covariance matrix implied by (φ, θ)) most closely matches Γ b n (the
sample covariance matrix) that is
(φ,
b θ)
b = arg min d(ΓX (n, φ, θ), Γ
b n)
φ,θ
Using our previous strategy, multiply through by Xt−k and take expectations.
p
X
E (Xt Xt−k ) = φj E (Xt−j Xt−k )
j=1
◦ substituting γ
bX (h) for γX (h) yields the estimates φb1 , φb2 , . . . , φbp .
Note
1. For the AR(p), we could use OLS and form the design matrix with the lagged val-
ues and perform the regression on this matrix. However, the Yule-Walker approach
guarantees a stationary and causal solution.
52
2. We can choose to look at k = 1, . . . , p to derive the Yule-Walker equations: in this
case, we must solve
ΓX (p)φ1:p = γ X (p, 1)
σ 2 = γX (0) − φ>
1:p γ X (p, 1)
c2 is obtained by plugging in γ 38
so σ bX and φ
b .
1:p
for 0 ≤ k ≤ p+q as we need p+q +1 equations to solve for the p+q +1 unknowns – but
this is non-linear in (φ1 , . . . φp , θ1 , θq , σ 2 ) as {ψj } are the coefficients in the MA(∞)
representation of {Xt }.
◦ Burg’s Algorithm for AR(p) This is an algorithm that uses the sample PACFs to
estimate φ1 , . . . , φp .
◦ For the MA(q), moment based estimation can be achieved using the Innovations algo-
rithm.
53
need to calculate the inverse of κn and the determinant of κn . We can use the Innovations
algorithm in order to get a decomposition of κ which avoids direct calculations of the
determinant and the inverse of κ.
If X
bt is the one-step ahead forecast,
bt = E Xt |X1:(t−1)
X
2
then as Xt |X1:(t−1) ∼ N (µt−1 , σt−1 ) where
where
" #
κt−1 κ1:(t−1),t
κt =
κt,1:(t−1) κt,t
we have X
bt = µt−1 . The likelihood is given by
n 1 1
L(κn (φ, θ)) = (2π)− 2 | det κn (φ, θ)|− 2 exp − x> κ−1
(φ, θ)x1:n
2 1:n n
Direct optimization of the log likelihood log L(κn (φ, θ)), denoted `, is possible but compu-
tationally expensive when n is large as we need to compute
54
Qn
det |κn | = det |D n | = t=1 vt−1 = v0 v1 v2 · · · vn−1 and
n b t )2
X (Xt − X
X> −1 > −1
1:n κn X 1:n = (X 1:n − X 1:n ) D n (X 1:n − X 1:n ) =
c c
t=1
vt−1
Note
bt+1 )2 = σ 2 = σ 2 rt and the likelihood can be rewritten as
We have vt = E (Xt+1 − X t
n
1 nY
2 −2 − 12 1
n σ rt exp − 2 S(X 1:n ) .
(2π) 2 t=1
2σ
where
n bt (φ, θ))2
X (Xt − X
S(X 1:n ) ≡ S(X 1:n , φ, θ) =
t=1
rt−1 (φ, θ)
with σ̃ 2 = S(φ, θ)/(n − p − q) may also be used. 41 In R, arima allows you to select full
ML, or a conditional ML, or least-squares or conditional LS (where the conditional analysis
41 Thisis an inefficient procedure if we make the normality assumption. In finite samples, these two
methods may be indistinguishable in the finite sample case.
55
conditions on the original p + q data).42
◦ model complexity
◦ residuals (structure)
43
◦ forecasting performance (“out-of-sample”)
AIC(M, β M ) = −2`M (β
b ) + 2 dim(β )
M M
for example44 for ARMA(p, q), we have p + q + 1 parameters. The model with smallest AIC
is selected as most appropriate. This should be used only to compare nested models; we
fit ARMA(p, q) for p, q moderately large, then examine all submodels.
Note
◦ In small samples, the 2 dim(β M ) penalty is not sufficiently stringent(i.e. it selects
overly complex models).
42 It is rather awkward to use full ML, since we need to condition x on past values, while the conditional
1
ML allows to start at the p + q + 1 term.
43 Forecast errors and the innovations algorithm gives a record of how the model is performing at this
stage
44 Note that increasing utility of adding parameters in the model always decreases as we add covariates,
while the second term increases linearly. One can plot this as a function of the models, and the one that
yields the minimum value is preferred in practice. For the BIC, the intercept will be higher, as log(n) will
be linear in the dimension of the parameter β, for fixed n (the length of the time-series).
45 Meaning that with probability one, the AIC does not select the model as the sample size becomes
infinite.
56
where n is the length of the series. This has a more stringent penalty, replacing 2 by log(n)
in the penalty term. This adjustment alleviates both the small sample underpenalization
and the large sample inconsistency of the AIC – the BIC is a consistent model selection
criterion.46 Once the “best model” has been selected, the residuals from the fit should
resemble a white noise series (constant variance, uncorrelated).47
46 In this framework, note that consistency is really in terms of correctly choosing the best model from the
nested models. This is also interesting in the context of model misspecification in the context of inference
(when fitting a ML in an incorrect model, for example selecting from a set of models, which does not include
the DGM.
47 There are other suggestions in the Brockwell and Davis book, but they do not address large sample size
57
Chapter 3
Non-Stationary and Seasonal Models
Section 3.1: ARIMA models
If d is a non-negative integer, we call {Xt } an ARIMA model of order (p, d, q) if the process
{Yt } defined by
Yt = (1 − B)d Xt
{Xt } is stationary if and only if d = 0. In all other cases (d ≥ 1), we can add polynomial
trend of order d − 1 and the resulting process still satisfies the ARIMA equation (3.10).
However, as Φ(Z) = 0 has all reciprocal roots inside the unit circle, we merely need to
difference {Xt } d times and perform inference on the resulting differenced series yt = (1 −
B)d xt .
In practice, d is not known, so we may try d = 1, 2, . . . in turn and assess the stationarity of
the resulting series. However, in practice, it can be difficult to distinguish a factor (1 − B)
from (1 − ξB) with |ξ| → 1 in the ARMA polynomial.
Example 3.1
Consider the following processes, the AR(2) process defined by (1−0.6B)(1−0.99B)Xt = Zt
versus the ARIMA(1, 1, 0) (1 − 0.6B)(1 − B)Xt = Zt . These processes look very similar in
a simulation. The solution to the second equation is non-stationary. This is clear by the
drift for larger periods looking at the graphs (see handout). Differencing when we shouldn’t
yields a more complicated process. We thus want a statistical way to distinguish these two
models.
Dickey-Fuller test
Suppose {Xt } ∼ AR(1) where
Xt − µ = φ(Xt−1 − µ) + Zt
58
with Zt ∼ WN(0, σ 2 ), if |φ| < 1, then by standard theory for method of moments estimators,
we have that
√
n(φb − φ) N (0, 1 − φ2 ).
Therefore,
where φ∗0 = µ(1 − φ) and φ∗1 = φ − 1. As {Zt } ∼ WN(0, σ 2 ), we can estimate φ∗0 , φ∗1 via OLS:
we regress the series (1 − B)xt = xt − xt−1 on xt−1 – the forms of φb∗0 and φb∗1 are identical
to the forms of the intercept and slope estimators in the OLS. The Wald-type test statistic
derived by Dickey and Fuller is then
φb∗1
tDF =
b φb∗ )
se( 1
used to test the null hypothesis H0 : φb∗1 = 0 i.e. that H0 : φ = 1 and we are in the presence
of a unit root.48 Here
Pn
− xt−1 − φb∗0 − φb2 xt−1 )
t=2 (xt
b φb∗1 ) =
se( pPn
(n − 3) 2
t=2 (xt−1 − x̄)
The null distribution of tDF is non-standard (even asymptotically) – but for finite n, can
be approximated using Monte-Carlo. The test of H0 is carried out against the one-sided
alternative H1 : φ∗1 < 0 (equivalently H1 :, φ < 1). See handout: the first three panels (with
parameters close to 1, of 0.9,0.99 and 0.999 show asymptotic normality, where
√ b
n(φ − φ)
p ∼ N (0, 2)
1 − φ2
while the final panel shows the DF statistic distribution. Note that this test is available in
R through the adf.test function.
48 One still needs to write down the properties of the distribution of the test statistic, in terms of the
Brownian motion.
59
For the AR(p), we have
p
X
Xt − µ = φj (Xt−j − µ) + Zt
j=1
which imply
p
X
(1 − B)Xt = φ∗0 + φ∗1 Xt−1 + φ∗j (1 − B)Xt−j+1 + Zt
j=2
φ∗0 = µ(1 − φ1 − · · · − φp )
p
X
φ∗1 = φj − 1
j=1
p
X
φ∗j = − φk
k=j
for k = 2, . . . , p.
If {Xt } ∼ AR(p) with one unit root, then (1 − B)Xt ∼ AR(p − 1) is stationary. Therefore,
we can test for a unit root by testing
H0 : φ∗1 = 0
In the case of multiple unit roots, sequential testing, applying the Dickey-Fuller procedure
in each case, doing first differencing multiple times.
Note
Consider the model with φ(B) = (1 − ξB)(1 − ξB),¯ roots that appear in conjugate pair,
1 −iω
i.e. the AR(2) model with reciprocal roots ξ = r e = 1r cos(ω) − i 1r sin(ω) and ξ¯ = 1r eiω
60
where r > 1. Then
!
2
2 2 2 1 2
φ(B) = (1 − 2<(ξB) + |ξ| B ) = 1 − cos(ω)B + B
r r
1 sin(hω + λ)
ρ(h) =
rh sin(λ)
∇s Xt = Xt − Xt−s
= Xt − B s Xt
= (1 − B s )Xt .
This operator allows us to construct seasonally non-stationary models; whereas the ARIMA
model considers (1 − B)d Φ(B)Xt = Θ(B)Zt , the seasonal ARIMA (SARIMA) allows (1 −
B s )d Φ(B)Xt = Θ(B)Zt . This is a form of non-stationary model, where Φ(B)Xt = Θ(B)Zt
defines a stationary (causal) ARMA(p, q) process.
The general form of the SARIMA is as follows: For d, D non-negative integers and season-
ality s, {Xt } is a seasonal ARIMA process
if {Yt } defined by
Yt = (1 − B)d (1 − B s )D Xt
61
where {Zt } ∼ WN(0, σ 2 )
Φ(z) = 1 − φ1 z − · · · − φp z p
Φ̆(z) = 1 − φ1 − · · · − φP z P
Θ(z) = 1 − θ1 z − · · · − θq z q
Θ̆(z) = 1 − θ1 − · · · − θQ z Q
◦ forecasting: forecast.Arima
62
Chapter 4
State-space models
By coupling two (stationary) processes, we can construct even more complicated models;
for example, we can take the first process to be latent, thus describing hidden time-varying
structure, whereas the second process is constructed conditional on process 1 and is used to
represent the observed data.
We consider the marginal structure implied by this joint model for the observed data. In
general, we will use vector-valued time series processes: {X t } = (Xt1 , . . . , XtK )> is a K × 1
vector-valued process that potentially exhibits
We have
Y t = Gt X t + W t , t = 0, ±1, ±2, . . .
63
49
where {F t } is a sequence of L × L deterministic matrices and where V t ∼ WN(0, Ω t ).
To complete the model, it is normal to consider t ≥ 1 and setting X 1 as a random vector,
uncorrelated with the residuals {W t }, {V t }.
Note
1. Typically, the white-noise series are uncorrelated: E W t V >
s = 0 ∀ s, t.
Y t = Gt X t + dt + H 1t Z t
X t+1 = F t X t + et + H 2t Z t
3. With careful definition of {X t }, one can construct correlated process with more general
correlation structure that is implied by the AR(1)-type form.
4. The extension of this formulation to non-linear state space models is possible, but
more challenging in terms of inference.
X t = F t X t−1 + V t−1
= F t (F t−1 X t−2 + V t−2 ) + V t−1
= F t F t−1 X t−2 + F t V t−2 + V t−1
= F t F t−1 (F t−2 X t−3 + V t−3 ) + F t V t−2 + V t−1
···
= ft (X 1 , V 1 , . . . , V t−1 )
64
and similarly
E W tX >
s = E W tY >
s =0
Xt+1 = φXt + Vt
∞
X
X1 = Y1 = φj Z1−j
j=0
Yt = Gt Xt + Wt , Gt = [1] ∀ t, Wt ≡ 0 ∀ t
Yt − φ1 Yt−1 − · · · − φp Yt−p = Zt
and let X t = (Yt−p+1 , Yt−p+2 , · · · , Yt )> a (p × 1) vector. The observation equation is given
by
Yt = Gt X t + Wt , Gt = (0, 0, . . . , 0, 1) ∀ t, Wt ≡ 0 ∀ t
Xt+1 = F t Xt + Vt
65
which can be written as
Yt−p+2 0 1 0 ··· 0 Yt−p+1 0
Y 0
t−p+3 0 1 ··· 0 Yt−p+2 0
. . ..
.. .. .. .. + .. Zt+1
. = . . . .
. . . . .
Yt+2 0 0 0 ··· 1 Yt+1 0
Yt+1 φp φp−1 φp−2 ··· φ1 Yt 1
Example 4.3
Let {Yt } ∼ ARMA(1, 1) process, with
(1 − φB)Yt = (1 + θB)Zt
where {Zt } ∼ WN(0, σ 2 ) and |φ|, |θ| < 1. We could rewrite this as Yt = φYt−1 + Zt + θZt−1 .
Let now Xt+1 = φXt + Zt+1 (an AR(1) process), so that
" # " #" # " #
Xt 0 1 Xt−1 0
= +
Xt+1 0 φ Xt Zt+1
Gt X t = θXt−1 + Xt
= θXt−1 + φXt−1 + Zt
Yt = θXt−1 + Xt
= θ(φXt−2 + Zt−1 ) + (φXt−1 + Zt )
= φ(Xt−1 + θXt−2 ) + Zt−1 + Zt
= φYt−1 + θZt−1 + Zt
66
Example 4.4
Let {Yt } ∼ MA(1) and Yt = Zt + θZt−1 , {Zt } ∼ WN(0, σ 2 ) and let
" #
Yt
Xt =
θZt
Yt = Gt X t + W t , Gt = [1 0] , Wt ≡ 0 ∀ t
The representation is not unique; alternatively, we could have set Xt = Zt−1 and then in the
observation equation set Gt = [θ], W t = Zt and for the state equations Ft = [0], Vt = Zt . 50
Low dimension or high dimensional representation can be used, in the next case the higher
dimension is more transparent.
Example 4.5 (ARMA(p, q) process)
Suppose {Yt } ∼ ARMA(p, q) with respect to {Zt } ∼ WN(0, σ 2 ) is causal, Φ(B)Yt = Θ(B)Zt .
Let m = max{p, q + 1}, φj = 0 ∀ j > p, θj = 0 ∀ j > q and θ0 = 1
Let {Xt } ∼ AR(p) with Φ(B)Xt = Zt and X t = [Xt−M +1 , Xt−m+2 , . . . , Xt−1 , Xt ]> , an
(m × 1) vector. Then the observation equation
Y t = Gt X t + W t , Gt = [θn−1 , θn−2 , . . . , θ1 , θ0 ], Wt ≡ 0
Alternatively, let m = max{p, q}, φj = 0 ∀ j > p. We use the linear process representation
50 Independence is not in the formulation. We can also reduce the dimension of the state. It is rather
67
to get another state-space form. Now Φ(B)Yt = Θ(B)Zt , which in terms imply that
Yt = {Φ(B)}−1 Θ(B)Zt
= Ψ(B)Zt
∞
X
= Ψj Zt−j
j=0
Θ(Z)
where {ψj } are the coefficients in the expansion of Ψ(Z) = Φ(Z) as a power series in Z. Let
{Xt } be defined by the AR(1)equation,
X t+1 = F t X t + V t ,
an (m×1) system (VAR(1)) where F t is as before, and Vt = [ψ1 , ψ2 , . . . , ψm ]> Zt+1 = HZt+1
say. The observation equation is
Y t = Gy X t + W t
51
and to justify this, we will differ this until we discuss multivariate processes.
F t , Gt can be allowed to vary over time periods in a deterministic fashion, so that the
process is not stationary, but there is some stability.
X t+1 = F t X t + V t , t = 0, ±1, ±2
51 The lowest dimension one can get is max{p, q}. For arbitrary state space representation, there is no
unique formulation, can add a term in the observation equation and remove it in the state equation.
68
and suppose F t = F ∀ t. Then
X t+1 = F X t + V t
= F (F X t−1 + V t−1 ) + V t
= F 2 X t−1 + F V t−1 + V t
..
.
n
X
= F n+1 X t−n+1 + F j V t−j
j=0
that is for X t+1 to be bounded in probability (for all t), we must have that F n stays
bounded for all n. Write
F = EDE −1
D n = [diag(λ1 , . . . , λL )]n
D n = diag(λn1 , . . . , λnL )
and therefore, we need |λj | ≤ 1 (i.e. |λ1 | < 1 under the usual convention.); thus we need
solutions of
det(F − I Z ) = 0
det(I − F Z ) 6= 0 ∀ z ∈ C, |z| ≤ 1.
2
Yt = Mt + Wt , {Wt } ∼ WN(0, σW )
69
and the state equation
Mt+1 = Mt + Vt , {Vt } ∼ WN(0, σV2 )
where Vt , Wt are uncorrelated processes. Thus the mean is evolving as a random walk and
Yt varies according to some noise. Here K = L = 1, so a 1 dimensional state model. The
state equation here is clearly a random walk. Let
E Yt∗ Yt+h
∗
= E ((Vt−1 + (Wt − Wt−1 )(Vt+h−1 + (Wt+h − Wt+h−1 ))
Then
" #
1 1
Xt+1 = Xt + Vt ;
0 1
2
σV 0
with Yt = [1 0]Xt + Wt . Here Vt ∼ WN(0, Q) where Q = 2
0 σU
and the dimensions
parameter of the state space model are L = 2, K = 1.
70
Seasonal model
Suppose we wish to construct a stochastic seasonal component with period d say. Recall
Pd
that {St } is satisfied St+d = St all t > d and t=1 St = 0.
Write
where {St } is a zero mean random variable. We can write this in a state-space formulation
as follows: Let
Yt
Yt−1
Xt = .
..
Yt−d+2 ((d−1)×1)
so that
h i
Yt = 1 0 ··· 0 X t + Wt .
X t+1 = F X t + V t
with
−1 −1 −1 ··· −1
1 0 0 ··· 0
1
..
0
0 1 0··· .
0
F = ,V t =
St
. ..
1 ..
0 0 . ..
.
.
.. .. .. .. ..
. . . .
0
0 0 ··· 1 0
2
Obs Yt = Gt X t + W t , G = [1, 0, 1, 0, . . . 0], W t ∼ WN(0, σW )
and
State X t+1 = F t X t + V t
71
where F comprises the block diagonal from the previous models.
1 1 0 0 ··· 0
0 1 0 0 ··· 0
0
0 −1 −1 ··· −1
Ft ≡ F =
... ..
. 1 0 ··· 0
. .. .. ..
. . .
. . 0 ···
0 0 0 ··· 1 0
[Reconstructed states from the plots in air data.] σ12 is the variance from {Mt }, σ22 the
variance from {Bt } and similarly σ32 for {St } and
σ12 0 0
V t ∼ WN(0, Q) where Q = 0 σ22 0
0 0 σ32
In this code, we have Bt+1 is constant, so this has been adjusted in the code accordingly.
Yt = Mt + (−Yt−1 − · · · − Yt−d+1 ) + Wt
(0)
Mt+1 = Mt + Bt + Vt
Bt+1 = Vt + Ut
Yt+1 = (−Yt − · · · − Yt−d+2 ) + St
This is a 17 parameters maximum likelihood, we get a plot with a slope around 2. Amending
the code to maximize over the logged data (log-scale transformation), we have a much more
linear slope, with less leak from the seasonal part. In this form (not-logged), it is capturing
the heteroskedasticity of the model. The seasonal component and the forecast are better.
Obs Yt |X t ∼ NK (Gt X t , Σ t )
State X t+1 |X t ∼ NL (F t X t , Ω t )
72
and p(xt |y 1:t ) is not yet available. This comes back to the filtering, where
and the latter term is again of the form found in the prediction. This suggests that we
compute (recursively) p(x1 ), p(x1 |y1 ), p(x2 |y1 ), p(x2 |y2 ), . . .. For time point t, in the Gaus-
sian case, if p(x1:n , y 1:n ) is jointly Gaussian, so therefore p(xt |y 1:t ) is also Gaussian, as is
p(xt+1 |y 1:t ) (any conditional distribution or marginal is also normal). Hence all we need to
do is track moments of the prediction and filtering distribution. For example,
where
and we can compute the likelihood using terms computed during the Kalman filter. This
however heavily depends on the Gaussianity assumption. Conjugate prior distribution and
discrete distribution are examples where we can compute numerically using the Kalman-
Filter. The recursive calculations and the algorithm are outlined in the R code provided.
The Kalman recursions are given by the following equations, where at|t are the best linear
predictors and P t|t the corresponding mean-square error matrix. The quantities vt and M t
denote the one-step-ahead error in forecasting yt conditional on the information set at time
73
t − 1 and its MSE, respectively.
v t = yt − Gt at|t−1 (4.11a)
M t = Gt P t|t−1 G>
t + Σt (4.11b)
at|t = at|t−1 + P t|t−1 G> −1
t M t vt (4.11c)
P t|t = P t|t−1 − P t|t−1 G> −1
t M t Gt P t|t−1 (4.11d)
at+1|t = F t at|t (4.11e)
P t+1|t = F t P t|t F >
t + Ωt (4.11f)
which can be combined into further recursions, used to get another form for the prediction
and simplify the former forms
and a smoothing algorithm can be applied to a state-space model given a fixed set of data;
estimates of the state vector are computed at each t using all available information. Denote
by at|n the smoothed linear estimates for t ∈ {0, . . . , n − 1} given all data until point n,
that is at|n = E (X t |y 1:n ) along with P t|n via backward recursions.
P ∗t = P t F >
t P t+1|t (4.13a)
at|n = at|t + P ∗t (at+1|n − at+1|t ) (4.13b)
P t|n = P t|t + P ∗t (P t+1|n − P t+1|t )P ∗t (4.13c)
Yt = Xt + Wt
Xt+1 = Xt + Vt
2
where we assume that Wt ∼ N (0, σW ) and Vt ∼ N (0, σV2 ) are independent white noise
series.
The joint likelihood for observations and states up to n is of the form
n
Y
p(x1:n , y1:n ) = p(x1 )p(y1 |x1 ) p(xi |xi−1 )p(yi |xi )
i=2
74
which is a two parameter model in the Gaussian case; alternatively, we could also set
x1 to be unknown; it could be regarded as having some density, or even a state which
corresponds to a degenerate distribution. The joint likelihood explicited above is follows
a multinormal distribution N2n (·, ·), and from standard results all marginal distributions –
but also conditional distributions, are also Gaussian. We could thus compute the first two
moments to get a complete characterization of the model.
(1) Filtering.
As earlier mentioned, this calculations is a Bayes theorem calculation, where we abstract
for now of the terms involving yt in the denominator. We could regard the proportionality,
having
2
Yt |Xt = xt ∼ N (xt , σW ) Xt+1 |Xt = xt ∼ N (xt , σV2 )
2
and the recursion assumption is that Xt |Y1:t−1 ∼ N (mt|t−1 , St|t−1 ), which depends only on
low dimensional summary statistics.
" #!
1 1 2 1 2
p(xt |y 1:t ) ∝ exp − 2 (xt − yt ) + S 2 (xt − mt|t−1 )
2 σW t|t−1
2
such that Xt |Y1:t ∼ N (mt|t , St|t ). But the following is a simple “complete the square”
calculation, using the fact that
2
2 2 2 Aa + Bb AB
A(x − a) + B(x − b) = M (x − m) + constant = (A + B) x − + (a − b)2
A+B A+B
75
which imply that
yt mt|t−1
2 + S2 2 2
!−1
σW t|t−1 yt St|t−1 + σW mt|t−1 2 1 1
mt|t = = and St|t = 2 + S2
1 1 2 2
St|t−1 + σW σW
2 + 2 t|t−1
σW St|t−1
and so the parameters of the Normal distributions are known in terms of the previous
observations.
(2) Predictions This is a posterior predictive-type calculation. We have
Z
p(xt+1 |y 1:t ) = p(xt+1 |xt , y 1:t )p(xt |y 1:t )dxt
Z
≡ p(xt+1 |xt )p(xt |y 1:t )dxt
Z !
1 1 2 1
∝ exp − 2 (xt+1 − xt ) exp − 2 (xt − mt|t )2 dxt
2 σV 2St|t
!
2 2
1 St|t + σV
Z
1 2 ∗ 2
= exp −
xt+1 − mt|t exp − (x − m )
2 + σ2
2 St|t 2 σV2 St|t
2
V
!
1 1 2
∝ exp − 2 2 (xt+1 − mt|t )
2 σV + St|t
using the fact that xt+1 |xt ⊥ y 1:t by assumption. We can therefore conclude that p(xt+1 |y 1:t ) ∼
2
N (mt+1|t , St+1|t ). For the prediction, we could also use a shortcut using conditional inde-
pendence and do an iterated expectation and iterated variance calculation. We will again
2
derive (in this case verify) that mt+1|t = mt|t and St+1|t = σV2 +2 St|t
2
.
mt+1|t = E (Xt+1 |y 1:t ) = EXt |Y 1:t EXt+1 |Xt ,Y 1:t (Xt |Xt , y 1:t ) = EXt |Y 1:t (Xt ) = mt|t
It remains to calculate p(y 1:n ) = p(y1 )p(y2 |y1 p(y3 |y1 , y2 ) · · · p(yn |y 1:n−1 ). This is again a
76
posterior-type calculation, which can be made using
Z
p(yt |y 1:t−1 ) = p(yt |xt )p(xt |y 1:t−1 )dxt
both Gaussian in Xt or
77
Chapter 5
Financial time series models
To model
◦ asset/bond/option prices
◦ interest rates
◦ exchange rates
78
License
Attribution - You must attribute the work in the manner specified by the author or licensor (but not in any
way that suggests that they endorse you or your use of the work).
Noncommercial - You may not use this work for commercial purposes.
Share Alike - If you alter, transform, or build upon this work, you may distribute the resulting work only
under the same or similar license to this one.
Waiver - Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain - Where the work or any of its elements is in the public domain under applicable law, that
status is in no way affected by the license.
Other Rights - In no way are any of the following rights affected by the license:
Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
The author’s moral rights;
Rights other persons may have either in the work itself or in how the work is used, such as publicity or
privacy rights.
79