0% found this document useful (0 votes)
20 views

MATH545-Time Series

Uploaded by

ankurmondal039
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

MATH545-Time Series

Uploaded by

ankurmondal039
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

MATH 545 - Intro to Time Series Analysis

Pr. David A. Stephens

Course notes by
Léo Raymond-Belzile
[email protected]

The current version is that of August 21, 2015


Fall 2012, McGill University
Please signal the author by email if you find any typo.
These notes have not been revised and should be read carefully.
Licensed under Creative Commons Attribution-Non Commercial-ShareAlike 3.0 Unported
Contents

1 Introduction 5
1.1 Simple stationary models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Trends and Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Non-parametric trend removal . . . . . . . . . . . . . . . . . . . . . . 15
1.2.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.5 Assessing the white noise assumption . . . . . . . . . . . . . . . . . . 18
1.2.6 Some general results and representations for stationary processes . . . 19
1.3 Autoregressive Time Series Processes . . . . . . . . . . . . . . . . . . . . . . . 22
1.3.1 Autoregressive model of order p (AR(p)) . . . . . . . . . . . . . . . . . 26
1.4 Moving Average Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.1 MA(q) Process as an AR process . . . . . . . . . . . . . . . . . . . . . 29
1.5 Forecasting Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6 Partial autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.1 Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2 ARMA models 42
2.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.2 Autocovariance function . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.3 Autocovariance Generating function (ACVGF) . . . . . . . . . . . . . 48
2.1.4 Forecasting for ARMA(p, q) . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Estimation and model selection for ARMA(p,q) . . . . . . . . . . . . . . . . . 52
2.2.1 Moment-based estimation . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . 53
2.2.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2
3 Non-Stationary and Seasonal Models 58
3.1 ARIMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Unit roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Seasonal ARIMA models (SARIMA) . . . . . . . . . . . . . . . . . . . . . . . 61

4 State-space models 63
4.1 State-Space Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Basic Structural Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Filtering and Smoothing: the Kalman Filter . . . . . . . . . . . . . . . . . . . 72

5 Financial time series models 78

3
Foreword
The objectives of time series analysis is the modeling of sequences of random variables
by time, as X1 , X2 , . . . , Xt based on data x1 , . . . , xn . Typically, (X1 , . . . , Xn ) will not be
mutually independent and n is large1 . There is much greater interest in forecasting and
prediction: given data x1 , . . . , xn , we wish to make statements about Xn+1 , Xn+2 , . . .
Definition (Time series)
A time series model is a probabilistic representation describing {Xt } via their joint distri-
bution or the moments, in particular expectation, variance and covariance.

It will often not be possible to specify a joint distribution for the observations. Also, one
will want to impose restrictions on moments, which we refer to as simplifying assumptions.
Note
1. Xt can be a discrete or continuous random variable

2. Xt could be vector-valued.

3. The time index will most typically be discrete and represent constant time spacing -
it could also be necessary to consider continuous-time indexing.

1 That is usually greater than 50, contrary to longitudinal statistics where the number of time observations

is small

4
Chapter 1
Introduction
Section 1.1: Simple stationary models
The joint CDF of (X1 , . . . , Xn ) gives one description of the probabilistic relationship between
the random variables (RVs). Using the standard notation FX1 ,...,Xn (x1 , . . . , xn ) = P(X1 , ≤
x1 , X2 ≤ x2 , . . . , Xn ≤ xn ), this function fully specifies the joint probability model for the
RVs. If Xt is vector-valued, denoted X t = (Xt1 , . . . , Xtk )> (namely a k × 1 vector), then
“Xt ≤ xt ” is to be interpreted component wise: Xt1 ≤ xt1 , Xt2 ≤ xt2 , . . . , Xtk ≤ xtk . A
special case of this is mutual independence.
n
Y
X1 , . . . , Xn are mutually independent ⇔ P(X1 ≤ x1 , . . . , Xn ≤ xn ) = P(Xt ≤ xt ),
t=1

that is the observed values of X1 , X2 , . . . , Xt−1 contain no information for modeling (or
predicting) Xt . Beyond independence, at the other extreme, specifying a completely gen-
eral joint model requires the specification of one marginal distribution and n − 1 differ-
ent conditional models2 . Often, modeling is restricted to specification of moments of the
process. Most typically, attention focuses on the first two moments (expectation and vari-
ance/covariance).
The emphasis of the lecture will be on stationarity models, which we will deal through in
the beginning of the course. The concept is one to do with stability. Some process we dealt
with last class were stable in mean or in variance or exhibiting periodic structure.
Definition 1.1 (Stationarity)
A time series process is stationary if {Xt , t ∈ Z} and {Xt+h , t ∈ Z} for h = ±1, ±2, . . . ,
have the precisely the same statistical properties (either in terms of joint distribution or
moment properties).

In practical terms, t is arbitrary and it doesn’t matter where we start collecting, we will
always be able to dip in the stationary model. For most of the course, we will be dealing
with weak stationarity, or moment stationarity, as joint distribution is intractable.
Definition 1.2 (Mean and covariance function)

Let {Xt } be a time series process with E Xt2 < ∞ ∀ t. Then µX (t) = E (Xt ) is the mean
function and γ̃X (t, s) = Cov(Xt , Xs ) = E ((Xt − µX (t))(Xs − µX (s))) for integers t, s is the
covariance function.
2 This procedure can be intractable (if one is interested in a specific period and the implications of the

model for that given interval) and may not be feasible analytically.

5
Definition 1.3 (Weak stationarity)
We say {Xt } is weakly stationary if

1. µX (t) does not depend on t;

2. γ̃X (t + h, t) does not depend on t for all integers h.

A stronger form of stationarity imposes conditions on the form of the joint distribution of
the process.
Definition 1.4 (Strong stationarity)
{Xt } is strongly stationarity if (Xt+1 , . . . , Xt+n ) and (Xt+h+1 , . . . , Xt+h+n ). have the
same joint distribution for all t, h, n.

Clearly, the second inequality implies the second, although it does not guarantee that the
first two moments exist. Gaussian process is an example of the latter stronger conditions,
with multivariate and a specific structure for the covariance matrix and the mean vector.
For most part, we will be dealing with weak stationarity.
Definition 1.5 (Autocovariance function)
For a weakly stationary process, define the autocovariance function (ACVF) by

γX (h) ≡ γ̃X (h, 0) ≡ γ̃X (t + h, t)

3
and the autocorrelation function (or ACF) by ρX (h) = γX (h)/γX (0) for h ∈ Z.

So far, we have defined only properties of the process, not the data. For the moment,
these are nonparametrically specified functions. We have by definition that γX (h) =
Cov(Xt+h , Xt ) and γX (0) = Cov(Xt , Xt ) = Var(Xt ) and ρX (h) = Cor(Xt+h , Xt ) for all
t, h.
Some examples to illustrate the construction of these functions.
Example 1.1 (IID process)
Suppose we have an IID process where we suppose {Xt } is a binary process, i.e. Xt ∈ {−1, 1}
such that P(Xt = 1) = p and P(Xt = −1) = 1−p such that the Xt are mutually independent.
We can easily compute the moments, as

E (Xt ) = p + (1 − p)(−1) = 2p − 1

and
Var(Xt ) = [p(1)2 + (1 − p)(−1)2 ] − (2p − 1)2 = 1 − (1 − 2p)2 .

By independence, Cov(Xt , Xs ) = 0 if t 6= s.
3 This function of h will be bounded on [−1, 1].

6
Example 1.2 (Random walk)
A more interesting process is the random walk, where {Xt } is the same as in the previous
example with the IID process and {St } is a process constructed from {Xt }, defined by
Pt
St = i=1 Xi = St−1 + Xt assuming that t ≥ 1 and the process start at 0 (that is S0 = 0).
Although the process is easy to construct, it has some interesting properties. If we take
the simple case where p = 1/2, then E (Xt ) = 0 and Var(Xt ) = 1 and therefore E (St ) = 0,
but Var(St ) = t. Since the variance of sum of IID random variable is the sum of the
variance of each elements, so that the variance increase linearly with t and our process is
not stationarity.
If p 6= 1/2, then E (Xt ) = 2p − 1 and E (St ) = (2p − 1)t and St also grows linearly with t
and Var (Xt ) = 1 − (1 − 2p)2 and Var (St ) = tVar (Xt ). In this case, if p < 1/2 we have a
downward “drift”, while if p > 1/2, we have an upward drift. In all cases, the variance of
St grows linearly with t. As such, {St } is nonstationary (clearly).
Note
We could also define Xt as the difference of the random variable St , that is Xt = St − St−1 .
This gives a key insight on how to turn a nonstationary process into a stationary one, in
this case by differencing.
Example 1.3 (Stationary Gaussian process)
Suppose, for all finite collections X1 , . . . , Xn , the joint distribution is multivariate Gaussian
(Normal) with E (Xt ) = µX (not depending on t) and covariance defined for Xt , Xs as
Cov(Xt , Xs ) = γX (|t − s|) for t, s ∈ {1, . . . , n}.
This stationary version imposes extra conditions, which corresponds to a structured covari-
ance matrix which has fewer than 21 n(n + 1) different elements.
Let ΓX (n) denote the (n × n) matrix with [ΓX (n)]t,s = γX (|t − s|). Then ΓX (n) is a
symmetric positive definite 4 matrix ( ∀ x ∈ Rn , x> ΓX (n)x > 0) with Toeplitz structure,
constant among diagonals,
 
γX (0) γX (1) ··· ··· γX (n − 1)
 .. 
 γ (1)
 X γX (0) γX (1) ··· . 

 .. .. .. .. 
ΓX (n) = 
 γX (2) . . . .


 .. .. .. ..

. . .
 
 . γX (1) 
γX (n − 1) ··· ··· γX (1) γX (0)

unlike the usual settings, with n free parameters in ΓX (n). We write the vector X =
(X1 , . . . , Xn )> ∼ Nn (µX 1n , ΓX (n)).5 Note that all marginal distributions and all condi-
4 In fact, is non-negative definite, but we will restrict our examples to positive definite matrix
5 Again, these are parameters of the process, not any estimates derived from the data.

7
tional distributions are also multivariate Gaussian. Specifically, the one that we might be
most interested in is p(Xt |X1 , . . . , Xt−1 ), which is univariate Gaussian.
Exercise 1.1
Review the properties of the multivariate Normal and make sure you understand the last
statement and are able to derive it.

The next few examples use similar construction methods, but are more general.
Example 1.4 (General IID noise)

Let {Xt } ∼ IID(0, σ 2 ), so that E (Xt ) = 0 and E Xt2 = σ 2 < ∞ with all Xt mutually
independent. From that, we get that the autocovariance function

σ 2 if h = 0
γX (h) = γ̃X (t + h, t) =
0 if h 6= 0.

and 
1 if h = 0
ρX (h) =
0 if h 6= 0.

We now have a more arbitrary and general stationary process.


Example 1.5 (White noise)

Let {Xt } ∼ WN(0, σ 2 ). The white noise process has E (Xt ) = 0 and E Xt2 = σ 2 < ∞ with
Xt ’s uncorrelated6 . The autocovariance function is still

σ 2 if h = 0
γX (h) = γ̃X (t + h, t) =
0 if h 6= 0.

For most of what we will be doing in the course, we will restrict to white noise. However,
if we wanted to estimate the model using maximum likelihood, we would need additional
requirements (IID).
Example 1.6 (General random walk)
Pt
Let {Xt } ∼ IID(0, σ 2 ) and again {St } is defined by St = i=1 Xi . Using linear properties
 
of the expectation, E (Xt ) = 0 implies that E (St ) = 0 and E Xt2 = σ 2 and E St2 = tσ 2 as
P 2
t Pt Pt Pi−1
St2 = i=1 Xi = i=1 Xi2 + 2 i=2 j=1 Xi Xj . The first term of the summation has
expectation tσ 2 and the second term has expectation zero from independence. Again, {St }
is nonstationary.
6 Independence would imply uncorrelatedness, but not conversely. This allows us to relax our assump-

tions, as the only calculations we are required to perform are moments so uncorrelatedness is really a moment
requirement.

8
We also have

γ̃X (t + h, t) = Cov(St+h , St )
= Cov(St + Xt+1 + · · · + Xt+h , St )
= Cov(St , St ) + Cov(Xt+1 + · · · Xt+h , St )
h
X
= Cov(St , St ) + Cov(Xt+i , St )
i=1

= Var(St ) + 0
= tσ 2 .

and the function depends on t, but not on h. This calculation uses result for covariances
which says that Cov(aX + bY + c, Z) = aCov(X, Z) + bCov(Y, Z).

Example 1.7 (Moving average process)


Let {Xt } ∼ MA(1) and suppose {Zt } ∼ WN(0, σ 2 ) where σ 2 < ∞. Let θ ∈ R and define

Xt = Zt + θZt−1 , t∈Z (1.1)

We can easily verify stationarity by calculating the first two moments. By linearity

E (Xt ) = E (Zt + θZt−1 ) = E (Zt ) + θE (Zt−1 ) = 0

We want to verify that γ̃X (t + h, t) doesn’t depend on t

Cov(Xt+h , Xt ) = E (Xt+h Xt )

Now

Xt+h Xt = (Zt+h + θZt+h−1 )(Zt + θZt−1 )


= Zt+h Zt + θ(Zt+h Zt−1 + Zt+h−1 Zt ) + θ2 Zt+h−1 Zt−1 .

Since Zt is a white-noise process, E (Zr Zs ) = σ 2 if r = s and zero otherwise (if r 6= s). Then
 
σ 2 if h = 0 σ 2 if h = −1
E (Zt+h Zt ) = E (Zt+h Zt−1 ) =
0 if h 6= 0 0 if h 6= −1

9
and  
σ 2 if h = 1 σ 2 if h = 0
E (Zt+h−1 , Zt ) = E (Zt+h−1 , Zt−1 ) =
0 if h 6= 1 0 if h 6= 0.

Combining the above results, we obtain



2 2
σ (1 + θ ) if h = 0


γ̃X (t + h, t) = σ2 θ if h = ±1


0 otherwise .

Since γ̃X does not depend on t, this imply {Xt } is stationary, Xt ∼ MA(1) with Var (Xt ) =
γ̃X (t, t) = σ 2 (1 + θ2 ) with autocorrelation function

1

 if h = 0
ρX (h) = 2
θ/(1 + θ ) if h = ±1


0 otherwise .

We will see that any process defined as a linear combination of white noise process for
arbitrary lag will be a stationary process.
Example 1.8 (Autoregression)
Let Xt ∼ AR(1). Again, suppose {Zt } ∼ WN(0, σ 2 ), with σ 2 < ∞. Let |φ| < 1 and7
assume {Xt } is a stationary process such that Xt and Xs are uncorrelated if s < t, with
the following recursive definition

Xt = φXt−1 + Zt , t = 0, ±1, ±2, . . . .

8
This can be rewritten Xt − φXt−1 = Zt . Then

E (Xt ) = E (φXt−1 ) + E (Zt ) = φE (Xt−1 ) .

Under the assumption of stationarity, we must have E (Xt ) = 0 as this is the only solution
for an arbitrary φ. Under stationarity, we can go directly for the autocorrelation function
and

γX (h) = E (Xt+h Xt ) = E (Xt Xt−h ) = γX (−h)


7 Otherwise, if φ = 1, we would have the random walk model.
8 It
is not straightforward to derive stationarity with this definition, we will prove it later. We can for
now do some calculations.

10
so that γX is an even function. We can take this form here and substitute in for Xt , yielding

γX (h) = E ((φXt−1 + Zt )Xt−h )


= φE (Xt−1 Xt−h ) + E (Zt Xt−h )
= φγX (h − 1) + 0

where E (Zt Xt−h ) as t > t − h. Hence, using the above recursion, we can get γX (h) =
φγX (h − 1) = φh γX (0) for h > 0 and using symmetry, γX (h) = γX (−h) = φ|h| γX (0).
We now want to find γX (0), equal to

Cov(Xt , Xt ) = E Xt2


= E ((φXt−1 + Zt )(φXt−1 + Zt ))
= φ2 E Xt−1
2
+ 2φE (Xt−1 Zt ) + E Zt2 .
 

But, by assumption E (Xt−1 Zt ) = 0 (as current X are uncorrelated with future values of
2

Z) and E Xt−1 = γX (0) as {Xt } is stationary. Therefore, γX (0) = φ2 γX (0) + σ 2 . This
implicit formula can be rearranged to get and explicit form for γX (0) = σ 2 /(1 − φ2 ).

It is possible to extend the two processes described above, thus MA(1) can be generalized
to MA(q), of the form

Xt = Zt + θ1 Zt−1 + θ2 Zt−2 + · · · + θq Zt−q (1.2)

and AR(1) to AR(p) as a regression on the previous p values of Xt as

Xt = φ1 Xt−1 + φ2 Xt−2 + · · · + φp Xt−p + Zt (1.3)

and also combine the two models. We would end up modeling

Xt − φ1 Xt−1 − · · · − φp Xt−p = Zt + θ1 Zt−1 + · · · θq Zt−q , (1.4)

which is the so called ARMA(p, q), or autoregressive moving-average process. Most


of the models we will play will be ARMA-type models.
Generalization for MA is far more trivial than AR, since MA processes are very stable (we
can take linear combinations of MA processes (with finite coefficients) which are stable as
well), whereas for AR(p) it is not straightforward to extend them: verification is necessary.
Since AR is close to the random walk, we need to think carefully which values for φi , i =
1, . . . , p to choose to get stationarity. We will study these later.

11
Section 1.2: Trends and Seasonality
1.2.1. Trends
For many datasets, it is evident that the observed series does not arise from a process that
is stationary or constant in mean. We might observe that

◦ perhaps a monotonic trend is present

◦ some other deterministic variation in mean is observed

We might want to decompose the process into a part that has a deterministic mean structure
and on top of that have a stable or stationary component.
Thus, a model of the form Xt = mt + Yt may be appropriate, where mt is a deterministic
function of t and {Yt } is a zero-mean (and often stationary) process. If {Yt } is assumed to
have finite variance, and mt is assumed to have some parametric form, say mt (β), then β
may be estimated using least-squares. One could also use nonparametric or semiparametric
regression methods to estimate the model.
If, say, {Yt } ∼ WN(0, σ 2 ) is assumed then, we can determine the parameter estimate for β
by minimizing the sum of squared residuals errors, that is
n
X
βb = arg min (xt − mt (β))2 ;
β t=1

this is linear (or nonlinear) regression with homoskedastic errors. Optimization is tractable
analytically, (it is just convex optimization). In the easy case, we could have e.g. mt (β) =
β0 + β1 t + β2 t2 and β = (β0 β1 β2 )> can be estimated using ordinary least squares (OLS).
If there was a correlation structure (such as AR) and the assumption of white noise did
not hold, then OLS would not be statistically efficient and weighted least squares would be
preferable.
Example 1.9 (Lake Huron data, example 135 B&D)
In this example Xt is the level (in feet) of Lake Huron from 1875-1972, with n = 198. In
the illustration, a fitted linear trend is presented. One could also try to test for structural
break around observation 20 and fit a model with two plateaus. Assuming mt = β0 + β1 t,
fitting this model is straightforward; in R, the lm function can be used. What we get is
βb0 = 10.204(0.230) and βb1 = −0.024(0.004) with standard errors indicated in parenthesis.
The slope is statistically significant.
We can verify some of the assumptions that we made about {Yt }. Let ybt = xt − mt (β) b =
xt − βb0 − βb1 t. Detrended data, that is {b
yt } looks like a zero mean process - this is a standard

12
residuals plots. We might suspect an increase in variance, but the assumptions are not too
far off. We also need to verify that Yt are uncorrelated, but it turns out that they are not;
yt−1 , ybt ), for t = 2, . . . , n shows that the series exhibit positive correlation
a scatter plot of (b
(0.775). In fact, Yt is not likely a white-noise process.

1.2.2. Seasonality
Many (real) time series are influenced by seasonal factors, which may be due to climate,
calendar, economic cycles or physical factors. In this section, we will consider only season-
ality and not consider trend. This that we are studying is a deterministic seasonality. One
model for “seasonal” behaviour is

Xt = st + Yt , t ∈ Z,

but where {st } is periodic with period d, that is st−d = st ∀ t.9


An example is the “harmonic model”, where st = st (β) is modeled in a parametric fashion,
that is st is of the form

k
X
st = β0 + [β1j cos(λj t) + β2j sin(λj t)]
j=1

where β0 , β11 , β21 , β12 , β22 , . . . are unknown constants, but λ1 < λ2 < · · · < λk are fixed
constants (known).10
Example 1.10 (Accidental Deaths in the US: Jan 1973-Dec 1978)
We have n = 72 monthly observations. Choose k = 2 and set λ1 = 2π/12, which gives a 12
monthly cycle and λ2 = 2π/6 which is a 6 monthly cycle. 11
The next test is to assess whether our assumptions where correct. We will

◦ estimate β0 , β11 , β12 , β21 , β22 using OLS

◦ form {Ybt } series as ybt = xt − st (β).


b

◦ assess WN(0, σ 2 ) assumption about {Yt }

◦ in this analysis, residual series {b


yt } is positively uncorrelated.
9 Again we can think of this as a decomposition into a deterministic part and a zero mean process Y
t
which is the random component.
10 We treat them as known constant as not to have to estimate the λ , so as to be able to fit a linear
i
model. λ are chosen so that the periodicity matches the observed data.
11 This is easy to implement in R, building a n × 5 design matrix and perform OLS, directly using the lm

function. One can make some model selection and perform goodness of fit test, nested structure and F tests
to evaluate the models.

13
◦ try to use AR, MA models for {Yt }.

Before looking at the data, one doesn’t know what the periodicities are and the λ are
unknown. If they are estimated, one has to do NLS, which can be easily implemented as
well. Another example is Montreal gas station prices, which tend to peak on Tuesday, and
exhibit a periodic structure if you demean the data. Generally, to carry out assessment
of model assumptions concerning a stationary process, we may examine standard moment-
based estimators. We are looking for estimators of µ
b = x̄ and for the ACVF, we use

n−|h|
1 X
γ
b(h) = (xt+|h| − x̄)(xt − x̄), −n < h < n.
n t=1

γ
b(h) is a consistent estimator of γ(h), but it is biased. The advantage of using this
estimator is that it does, however, guarantee that Γ γ (|i − j|)]ij is non-singular.
b n = [b

To assess the validity of the WN(0, σ 2 ) assumption, we compute γ


b(h) and examine whether
b(h) is “significantly different” from zero when h 6= 0.
γ
Note
Note that this is a “nonparametric” approach. Parametric approaches are also possible in
some cases. 12
Example 1.11 (Lake Huron data)
Recall the model we imposed on this data, which was of the form Xt = mt (β) + Yt =
β0 + β1 t + Yt . The fit yielded residuals {b yt }, where ybt = xt − βb0 − βb1 t. We know that
Cor(byt−1 , ybt ) ≈ 0.77. We can try an AR(1) model, with Yt = φYt−1 + Zt , where {Zt } ∼
WN(0, σ 2 ) (recall that {Yt } is zero mean process). φ and σ 2 can be estimated from {b yt }
series using OLS, regressing ybt on ybt−1 . This does not impose the stationarity assumption on
{Yt }. 13 Using the lm function in R returns φb = 0.791. We can now go back to the proposed
model and compute {b zt } as
zbt = ybt − φb
byt−1 .

Is zbt an uncorrelated sequence? Apparently not, there is still positive correlation between zbt
and zbt−1 . This means that the AR(1) model is not adequate. At this stage, we could work
with this residual or go back to the model and propose a more complicated model. We try
an AR(2) model, i.e.
Yt = φ1 Yt−1 + φ2 Yt−2 + Zt .

The fit via OLS yields φb1 = 1.002 and φb2 = −0.283. In this case, it is hard to see whether
12 This function would be completely specified as a function of the parameters of the process, while here

we impose no relation for γ(h).


13 We have assumed until now that our AR(1) model be stationary. Since this is 1 period lagged model,

this corresponds to |φ| < 1.

14
these values imply stationarity. The resulting residual series {b
zt } formed by taking

zbt = ybt − φb1 ybt−1 − φb2 ybt−2

appears to be possibly WN(0, σ 2 ).

The general strategy when faced with a real time-series model: for observed series {Xt }
with realization {xt },

◦ write Xt = mt + st + Yt , a combination of deterministic trend and seasonality compo-


nents.

◦ estimate mt , st using parametric (or possibly semi-parametric or nonparametric mod-


els)

◦ form the residual series {b


yt } as ybt = xt − m
b t − sbt

◦ check/model properties of {b
yt } and check whether they are IID/WN, or AR, or MA,
etc.

1.2.3. Non-parametric trend removal


A non-parametric smoothing approach is constructed as follows:

a) Set
q
1 X
m
bt = xt+j , q+1≤t≤n−q
2q − 1 j=−q

taking a local unweighted average of the points in the vicinity of t. This is a form of “low
pass filtering”. An extension of this is to enlarge the window centered on t, equivalent to
the above by setting aj = 0 if |j| > q, given by

X
m
bt = aj xt−j
=−∞

b) “Exponential” smoothing, using α ∈ (0, 1) as a smoothing parameter



αx + (1 − α)m if t ≥ 2
t b t−1
m
bt =
x 1 if t = 1

15
14
and it is evident that the closer to 0, the smoother the curve. An explicit form for m
bt
is
t−2
X
mbt = α(1 − α)j xt−j + (1 − α)t−1 x1
j=0

In the presence of both trend and seasonality, the procedures for pre-processing must be
applied as follows. 15
Pd
If Xt = mt + st + Yt with E (Yt ) = 0, st+d = st , and j=1 sj = 0. The steps are outline
next16

Step 1 ◦ d is even: set q = d/2 and take


  q−1
xt−q + xt+q 1 X
m
bt = + xt+j
2d d
j=−(q−1)

◦ d is odd: set q = (d − 1)/2 and take


q
1 X
m
bt = xt+j
d j=−q

This removes the trend whilst respecting seasonality.

Step 2 Set wk = n1k j (xk+jd − m


P
b k+jd ) for k = 1, 2, . . . , d, where the sum extends over j
such that q < k + jd < n − q and nk is the number of terms in the sum. Then

d
1X
sbk = wk − w̄k = wk − wk
d i=1

for k = 1, 2, . . . , d and sbt = sbt−d for t > d.17

Step 3 Remove the remaining trend: compute m b ∗t from the de-seasonalized data x∗t = xt −
b t − sbt using a parametric or a non-parametric approach. This yields the decompo-
m
sition Xt = m b ∗t + sbt + Yt
14 This is a preprocessing set, but be careful as you can remove the structure of the data through the

smoothing method.
15 The above averaging does not respect the seasonality, especially if you take terms from the previous

cycle and you will distort the estimate of m b t.


16 if this is non-zero, without loss of generality we can include it in the m component.
t
17 Everything we have done in this analysis is non-parametric. We have removed the seasonality with

that step.

16
1.2.4. Differencing
Definition 1.6 (Differencing at lag 1)
The lag difference operator,

∇Xt = Xt − Xt−1 = Xt − BXt = (1 − B)Xt

where B is the backshift operator.


Definition 1.7 (Seasonal differencing)
The lag d difference operator, ∇d , is defined by

∇d Xt = Xt − Xt−d = Xt − B d Xt = (1 − B d )Xt

Note (Elementary algebraic properties)


1. B j Xt = Xt−j = BXt−j+1 = B 2 Xt−j+2 , etc.

2. ∇j Xt = ∇(∇j−1 Xt ). So for example,

∇2 Xt = (1 − B)(1 − B)Xt
= (1 − 2B + B 2 )Xt
= Xt − 2Xt−1 + Xt−2
= (Xt − Xt−1 ) − (Xt−1 − Xt−2 )

3. If Xt = mt + Yt , applying the first-difference operator to Xt , then

∇Xt = Xt − Xt−1 = (mt − mt−1 ) + (Yt − Yt−1 )

so if mt = mt (β) = β0 + β1 t, then mt − mt−1 = β1 , i.e. ∇ removes a linear trend. 18


To remove a polynomial trend of order k, one may apply k th order differencing, that
is we look at ∇k Xt . Note that if {Yt } ∼ WN(0, σ 2 ), then ∇k Yt is not white-noise, but
still stationary.

4. ∇∇d Xt = (1 − B)(1 − B d )Xt = (1 − B d )(1 − B)Xt = ∇d ∇Xt .

5. ∇d removes a seasonality with period d. If Xt = mt + st + Yt , then

∇d Xt − (mt − mt−d ) + (st − st−d ) + (Yt − Yt−d )

and st − st−d = 0.
18 Y ∗= (Yt − Yt−1 ) is what is obtained. This is different from our first detrending approach, which used
t
a linear parametric formulation for mt . If Yt is white-noise, then Yt∗ is no longer white noise, but MA(1).
The nature of the Yt process will change.

17
1.2.5. Assessing the white noise assumption
Some tests that are commonly used. If {Zt } ∼ WN(0, σ 2 ), then as n → ∞,
 
· 1
ρbn (h) ∼ N 0, , ∀h≥1
n

where ρbn (h) is the estimator of ρX (h) derived from data Z1 , . . . , Zn . This result is derived
from Central limit theorem.
19
This allows us to carry out pointwise (in h) hypothesis tests of the form

H0 : ρX (h) = 0, h≥1

ρn (1), . . . , ρbn (l)) are asymptotically independent under the white noise assumption.
Also, (b
We can test ρbn (h), h = 1, . . . , l individually or jointly (i.e. simultaneously testing: should
make correction in the critical values in order to account for multiplicity). 20 An approxi-
mate 95% CI for ρX (h) is

1.96
ρbn (h) ± √
n

There are other tests, which also rely on asymptotics. They are presented next.
Proposition 1.8 (Portmanteau test)
Ph
The test statistic is Q = n j=1 {b ρn (j)}2 . Under H0 : ρX (j) = 0 for j = 1, . . . , h and
Q ∼ χ2h .21

Proposition 1.9 (Box-Ljung test)


This test attempts finite-sample bias correction

h
X ρn (j)}2
{b
QBL = n(n + 2)
j=1
n−j

·
and under H0 , QBL ∼ χ2 (h)

19 In R, we have seen line for pointwise critical values for the test or simultaneous tests for more than one
h.
20 It would also be possible (maybe not in practice) to perform a permutation test for the white-noise

case: if the data is uncorrelated, it will not be impacted if we permute it. One can then recompute the test
statistic for these n! permutations. This yield a discrete distribution on n! elements. One can compare the
true observed value from the sample with the distribution from the permutation.
21 For most of the cases we are dealing with, this is a powerful test for simultaneous testing.

18
Proposition 1.10 (McLeod-Li test)
Let Wt = Zt2 , then compute ρbW
n (h) for {wt }. The test statistic is given by

h
X ρW (j)}2
{b n
QML = n(n + 2)
j=1
n−j

·
and again under H0 , QML ∼ χ2h .22

1.2.6. Some general results and representations for stationary processes


1. Stationarity and autocovariance

Definition 1.11 (Non-negative definite)


A real-valued function g : Z → R is non-negative definite if, for all n ≥ 1, if a =
(a1 , . . . , an )> ∈ Rn , then
n X
X n
ai g(i − j)aj ≥ 0
i=1 j=1

Theorem 1.12
A function g defined on Z is the autocovariance function (acvf) of a stationary time
series process if and only if g is even and non-negative definite.

Proof Suppose g is the acvf of a stationary process {Xt }. Then g is even by properties
of covariance. Let a = (a1 , . . . , an )> and X 1:n = (X1 , . . . , Xn )> , and let ΓX (n) be
the n × n matrix with (i, j)th element γX (i − j) for i, j = 1, . . . , n and n ≥ 1. By
definition g(i − j) ≡ γX (i − j). Then
n X
n
 X
Var a> X 1:n = ai γX (i − j)aj .
i=1 j=1

But Var a> X 1:n ≥ 0 by properties of variance. Thus g is non-negative definite.




Conversely, if g is even and non-negative definite, then ΓX (n) formed by setting the
(i, j)th element to g(i − j) is symmetric and non-negative definite23 . So consider {Xt }
the Gaussian process such that X i:n ∼ Nn (0, ΓX (n)). 

22 Notice that Cor(Z


c t2 ) has the same asymptotic distribution as Cor(W
c t ) and that the squaring insure that
we sum non-negative correlation coefficient.
23 By defining Γ in this way, this matrix has a Toeplitz structure, i.e. constant along the diagonal.

19
2. Construction of stationary processes
If {Zt } is stationary, then the “filtered” process {Xt } defined y

Xt = g(Zt , Zt−1 , . . . , Zt−q )

for q ≥ 0 for some function g is also stationary with Xt and Xs for |t − s| > q
uncorrelated if {Zt } is white-noise. {Xt } is termed “q-dependent” in this case. 24

3. Linear processes
{Xt } is a linear process if, for all t,

X
Xt = ψj Zt−j
j=−∞

P∞
where {Zt } ∼ WN(0, σ 2 ), {ψj } is a sequence of real constants such that j=−∞ |ψj | <
P∞ j
∞. That is Xt = Ψ(B)Zt with Ψ(z) = j=−∞ ψj z , an infinite order polynomial
(generating function).
A linear process is “non-anticipating” or “causal” if ψj = 0 ∀ j < 0.
Note
P∞ P∞
The condition j=−∞ |ψj | < ∞ ensures that j=−∞ ψj Zt−j is well-defined, that is,
M.S. P∞
that the sum converges in mean-square, i.e. in fact Xt = j=−∞ ψj Zt−j that is
∀ t,  
 Xn 2
lim E  Xt − ψj Zt−j  = 0.
n→∞
j=−n

In this calculation, ψ(B) is a linear filter acting on {Zt }. From this, we can compute
the moment properties of {Xt } directly.

P∞
Recall Xt = j=−∞ ψj Zt−j . Note that
p  q
E (|Zt |) = E |Zt |2 = E (Zt2 ) = σ
24 Filtering generally means all observations up to current time in such case.

20
and  

X
E (|Xt |) = E  ψj Zt−j 
j=−∞
 

X
≤ E |ψj ||Zt−j |
j=−∞

X
= |ψj |E|Zt−j |
j=−∞
X∞
≤σ |ψj | < ∞.
j=−∞

Now E (Xt ) = 0 and


 2 

X
Var (Xt ) = E Xt2 = E 

ψj Zt−j  
 
j=−∞
 !
∞ ∞
!
X X
= E ψj Zt−j ψk Zt−k 
j=−∞ k=−∞

X ∞
X
= ψj ψk E (Zt−j Zt−k )
j=−∞ k=−∞
X∞
ψj2 E Zt−j2

=
j=−∞
X∞
= σ2 |ψj |2
j=−∞
 2

X
≤ σ2  |ψj | < ∞.
j=−∞

We thus have a finite variance process and we have an expression for the variance of Xt in
terms of the ψj . We can also, perhaps more importantly, compute the autocovariance, in
exactly the same fashion as we did the previous calculation.

21
  !

X ∞
X
E (Xt+h Xt ) = E  ψj Zt+h−j  ψk Zt−k 
j=−∞ k=−∞

X ∞
X
= ψj ψk E (Zt+h−j Zk )
j=−∞ k=−∞
X∞
2

= ψj ψj−h E Zt+h−j
j=−∞
X∞
2
=σ ψj ψj+h ≡ γX (h).
j=−∞

P∞
Note that if {Yt } is stationary with acvf γY (h), then if Xt = ψ(B)Yt = j=−∞ ψj Yt−j ,
we can replace {Zt } in the previous calculations to obtain results for E (Xt ) , Var (Xt ) and
γX (h). In particular, if E (Yt ) = 0, then E (Xt ) = 0 and γX (h) can be written as

X ∞
X
γX (h) = ψj ψk γY (h − j + k).
j=−∞ k=−∞

Section 1.3: Autoregressive Time Series Processes


Recall the AR(1) process: let {Zt } ∼ WN(0, σ 2 ) with |φ| < 1 and {Xt } is defined as the
solution to Xt = φXt−1 + Zt with E (Xt Zs ) = 0, s > t. That is Xt − φXt−1 = Zt or
(1 − φB)Xt = Zt
By recursion,

Xt = φXt−1 + Zt = φ(φ(Xt−2 + Zt−1 ) + Zt


= φ2 Xt−2 + φZt−1 + Zt
n−1
X
· · · = φn Xt−n + φj Zt−j
j=0

X
... = φj Zt−j
j=0

since the leading term vanishes if φ < 1, φn → 0 as n → ∞.


P∞ P∞ P∞
So we may write Xt = j=0 ψj Zt−j , where ψj = φj . But j=0 |φj | ≤ j=0 |φ|j < ∞.
We are thus using a linear filter with absolutely convergent coefficients series, and {Xt } is

22
stationary, E (Xt ) = 0 and again by our earlier results,

X ∞
X
ψj ψj+h E Zt2 = σ 2 φj φj+h

γX (h) =
j=0 j=0

P∞
and so γX (h) = σ 2 φh j=0 φ2j = σ 2 φh /(1 − φ2 ).
Note
We may formally write that Ψ(B) = (1 − φB)−1 as
 
X∞
Ψ(B)(1 − φB) =  ψj B j  (1 − φB)
j=0

X ∞
X
= ψj B j − φψj B j+1
j=0 j=0
X∞ X∞
= φj B j − φj B j = 1
j=0 j=1

so that (1 − φB)Xt = Zt → Xt = (1 − φB)−1 Zt and if we allow series expansion in B of


P∞
this (1 − φB)−1 , then by definition (1 − φB)−1 ≡ j=0 φj B j .

We have thus constructed a solution for the process, but still have to demonstrate its unique-
ness. Now suppose that {Yt } is another stationary solution to the equation Yt = φYt−1 + Zt .
We show an explicit relationship between {Xt } and {Yt }. By the previous recursive ap-
P∞
proach, Yt = j=0 φj Zt−j . Then

k
X ∞
X
Yt − φj Zt−j = φj Zt−j
j=0 j=k+1

X
= φk+1 φj Zt−j−k−1
j=0
k+1
=φ Yt−k−1

and
 !2 
k
X
φj Zt−j  = lim φ2k+2 E Yt−k−1
2

lim E  Yt − =0
k→∞ k→∞
j=0

2
 MS P∞ j
as E Yt−k−1 < ∞ by assumption. Therefore, we have Yt = j=0 φ Zt−j = Xt . Therefore
{Xt } and {Yt } are equal in mean-square, thus {Xt } is the unique stationary solution to the

23
AR(1) equation (up to mean-square equivalence).
If we choose to define {Xt } ∼ AR(1) with |φ| < 1, then {Xt } has a “unique” representation
as an MA(∞) (or linear) process.
Note
If |φ| = 1, i.e. Xt = Xt−1 + Zt or Xt = −Xt−1 + Zt . Suppose (for the sake of contradiction)
that {Xt } is a stationary solution to the equation Xt = φXt−1 + Zt with |φ| = 1. Then by
the previous recursion,
Xn
Xt = φn+1 Xt−n−1 + φj Zt−j
j=0

and let Rt,n = Xt − φn+1 Xt−n−1 . Then


n
X n
X
Var (Rt,n ) = φ2j Var (Zt−j ) = Var (Zt−j ) = (n + 1)σ 2 .
j=0 j=0

The variance of that remainder go to infinity as n → ∞. Therefore, the variance of Xt grows



to infinity as if the converse was true and Var (Xt ) was finite, then Var Xt − φn+1 Xt−n−1 <
∞. Therefore, no such stationary solution exists. What happens if |φ| > 1? Clearly in this
P∞
case j=0 φj Zt−j diverges because the coefficient of Zt−j grows without bound. But note
that we can also rewrite the AR(1) equation from its original form Xt = φXt−1 + Zt ⇒
Xt+1 = φXt + Zt+1 which imply that

1 1
Xt = Xt+1 − Zt+1
φ φ
1 1 1
= 2 Xt+2 − Zt+1 − 2 Zt+2
φ φ φ

X
... = − φ−j Zt+j
j=1

P∞ P∞
with j=1 |φ−j | ≤ |φ−1 |j < ∞ as |φ−1 | < 1.
j=0
P∞
We do have a stationary solution Xt = j=−∞ ψj Zt−j where

−φ−j if j < 0
ψj =
0 if j ≥ 0

Since the white noise are residuals from the future observations, this is not of much use in
practical terms.

To summarize, there are three situations for the AR(1): if

24
◦ |φ| < 1, stationary and causal

◦ |φ| = 1, non-stationary

◦ |φ| > 1, stationary, non-causal.


Note
P∞
We were able in the AR(1) case to do the expansion (1 − φB)−1 ≡ j=0 φj B j for a single
factor (1 − φB) as an operator acting on {Xt } to produce the MA(∞) or linear process
representation. But suppose

X
α(B) = αj B j
j=−∞
X∞
β(B) = βj B j
j=−∞

P∞ P∞
where j=−∞ |αj | < ∞, j=−∞ |βj | < ∞. Then
  

X ∞
X
α(B)β(B)Zt =  αj B j   βj B j  Z t
j=−∞ j=−∞
 

X ∞
X
= αj βk B j+k  Zt
j=−∞ k=−∞
∞ ∞
( ) !
X X
l
= αk βl−k B Zt
l=−∞ k=−∞

!
X
l
= ψl B Zt
l=−∞

P∞
where ψl = k=−∞ αk βl−k . The last line is a linear process representation, as it is just


X
ψl Zt−l .
l=−∞

Note that we need



X ∞
X ∞
X
|ψl | ≤ |αk βl−k |
l=−∞ l=−∞ k=−∞
∞ ∞
! !
X X
≤ |αk | |βl−k | <∞
k=−∞ l=−∞

25
The two linear filters combine to yield another linear filter

α(B)β(B) = β(B)α(B) = ψ(B)

and we can consider the filters applied sequentially. i.e.

α(B)β(B)Zt = α(B){β(B)Zt } = α(B)Yt

where {Yt } defined by Yt = β(B)Zt is a stationary process.

1.3.1. Autoregressive model of order p (AR(p))


Suppose {Xt } is the solution to the stochastic equation

Xt − φ1 Xt−1 − φ2 Xt−2 − · · · − φp Xt−p = Zt

{Xt } is the autoregressive process of order p (denoted {Xt } ∼ AR(p)) with real coefficients
φ1 , . . . , φ p .
Example 1.12 (AR(2) process)
Take Xt − φ1 Xt−1 − φ2 Xt−2 = Zt for {Zt } ∼ WN(0, σ 2 ). By identical arguments to the
AR(1) case. We have E (Zt ) = 0, Var (Zt ) = σ 2 < ∞, which leads to E (Xt ) = 0. Note that
{Xt } can be found as the solution to the equation

(1 − φ1 B − φ2 B 2 )Xt = Zt

– but can we write



X
Xt = ψj Zt−j
j=0

for some collection of coefficients {ψj } such that {Xt } converges in mean square? If so, then
the ψj must be the coefficients in the series expansion of (1 − φ1 B − φ2 B 2 )−1 . First, note
that
Φ(B) = (1 − φ1 B − φ2 B 2 ) = (1 − ξ1 B)(1 − ξ2 B)

where we identified the complex-valued parameters ξ1 , ξ2 by equating coefficients, that is


φ1 = ξ1 + ξ2 and φ2 = ξ1 ξ2 .
In general, ξ1 , ξ2 will be complex-valued and in this case, they will form a complex conjugate
pair (and have same modulus). However, in some cases, ξ1 and ξ2 will be entirely real. Note
also that if Φ(z) = 1 − φ1 z − φ2 z 2 , then ξ1 and ξ2 solve Φ(z) = 0 for |z| ≤ 1 (i.e. ξ1 and ξ2

26
are the reciprocal roots of Φ(z) = 0). Note finally that we could write
 
1 φ1
Φ(z) = φ2 − − z2
φ2 φ2

Therefore η1 = ξ1−1 and η2 = ξ2−1 are the roots of

z 2 − φ1 /φ2 z − 1/φ2 = 0
(z − η1 )(z − η2 ) = 0

So Φ(B) = (1 − ξ1 B)(1 − ξ2 B) and thus

{Φ(B)}−1 = {(1 − ξ1 B)(1 − ξ2 B)}−1


= (1 − ξ1 B)−1 (1 − ξ2 B)−1

as

Φ(B){Φ(B)}−1 = (1 − ξ1 B)(1 − ξ2 B)(1 − ξ1 B)−1 (1 − ξ2 B)−1


= (1 − ξ1 B)(1 − ξ1 B)−1 (1 − ξ2 B)(1 − ξ2 B)−1
=1

as the operators commute.


What we have is that we can write (1 − φ1 B − φ2 B 2 ) as
 
∞ ∞
!
X j
X
(1 − ξ1 B)−1 (1 − ξ2 B)−1 = ξ1 B j  ξ2k B k
j=0 k=0

= ψ1 (B)ψ2 (B)

with each sum convergent provided |ξ1 |, |ξ2 | < 1. Indeed, this is a necessary and sufficient
condition. Thus

Xt = Ψ(B)Zt = ψ1 (B)ψ2 (B)Zt

where
∞ X
X ∞
Ψ(B) = ξ1j ξ2k B l+k
j=0 k=0
Pr k r−k
with ψr = k=0 ξ1 ξ2 for r ≥ 0.
This is a valid series expansion, we need to verify some conditions on ψr . We have |ψr | ≤

27
Pr P∞
k=0 |ξ1 |k |ξ2 |r−k . We require that r=0 |ψr | < ∞. Explicitly, if ξ1 , ξ2 are a complex pair,
Pr
so that |ξ1 | = |ξ2 |, then |ψr | ≤ k=0 |ξ1 |r = (r + 1)M r say, where M = |ξ1 | < 1. Therefore,


X ∞
X
|ψr | ≤ (r + 1)M r < ∞
r=0 r=0

as M < 1, by standard results for convergent series.


If ξ1 , ξ2 are real valued, take M = max(|ξ1 |, |ξ2 |) < 1. The same result follows. The AR(2)
process is stationary and causal provided that |ξ1 |, |ξ2 | < 1, that is the roots are inside the
unit circle.

For the AR(p), in the more general case, we can apply the same results and construction.
Write

Φ(B) = (1 − φ1 B − φ2 B 2 − φ3 B 3 − · · · − φp B p )
p
Y
= (1 − ξr B)
r=1

The AR(p) process is stationary and causal if |ξr | < 1 for r = 1, . . . p. The values ξ1 , . . . , ξp
solve Φ(z) = 0; as before, if ηr = ξr−1 , then η1 , . . . , ηp solve Φ(z) = 0 also with |ηr | > 1 ∀ r.25
What if |ξr | = 1 for some r? Or if |ξr | > 1? The first case will turn to be nonstationary,
while for the second, one can construct a stationary process in a certain way, but which
turns out to be non-causal.

Section 1.4: Moving Average Processes


Recall the MA(1) process {Xt } defined by Xt = Zt + θZt−1 for t ∈ Z and the more general
MA(q) process, defined by

Xt = Zt + θ1 Zt−1 + · · · + θq Zt−q

where {Zt } ∼ WN(0, σ 2 ). We know that E (Xt ) = 0 ∀ t and also that {Xt } is stationary.
Also, we have

X
Xt = ψj Zt−j
j=−∞

25 Sometimes seen as the condition that all the roots lie outside the unit circle, contrary to ξ lying inside
r
the unit circle).

28
where

θ j , if j = 1, 2, . . . , q


ψj = 1, if j = 0


0, if j < 0

so the MA(q) process for q ≥ 1 has a straightforward linear process representation. By the
earlier results,

X
γX (h) = σ 2 ψj ψj+|h|
j=−∞
q−|h|
X
= θj θj+|h|
j=0

where θ0 ≡ 1 with the convention that the sum is zero if |h| > q.

1.4.1. MA(q) Process as an AR process


Consider {Xt } ∼ MA(1) with

Xt = (1 + θB)Zt .

To get an AR representation Π(B)Xt = Zt , we must find Π(B). We may formally write



X
Π(B) = (1 + θB)−1 = (−θ)j B j
j=0

(as (1 + θB)−1 (1 + θB) = 1 using this definition) provided |θ| < 1, otherwise the sum
P∞
Π(B)Xt = j=0 (−θ)j Xt−j would not be mean-square convergent.
If |θ| < 1, {Xt } ∼ MA(1) admits the representation

Π(B)Xt = Zt
P∞
where Π(B) = j=−∞ πj B j with πj = (−θ)j if j ≥ 0 and zero otherwise, that is {Xt } ∼
AR(∞).
If {Xt } ∼ MA(q), say Xt = Θ(B)Zt , where Θ(B) = 1 + θ1 B + θ2 B 2 + · · · + θq B q , then we

29
may factorize
q
Y
Θ(B) = (1 − ωl B).
l=1

Hence {Xt } admits an AR(∞) representation, i.e.

Π(B)Xt = Zt
P∞
with Π(B) = j=−∞ πj B j provided that |ωl | < 1, for l = 1, . . . , q where πj is the coefficient
of B j in the expansion of
q
Y
(1 + θ1 B + θ2 B 2 + · · · + θq B q )−1 = (1 − ωl B)−1
l=1
 
q
Y ∞
X
=  ωlj B j 
l=1 j=−∞

If {Xt } admits an AR(∞) representation, that is |ωl | < 1 for l = 1, . . . , q, we say that {Xt }
is invertible (note that the AR(∞) representation is causal.)
If {Xt } ∼ MA(1) with |θ| > 1, Xt is still stationary, but it does not admit a causal AR(∞)
representation.
P∞
However, we may write Zt = − j=1 (−θ)−j Xt+j which is a mean-square convergent AR(∞)
process that is non-causal.

Section 1.5: Forecasting Stationary Processes


We aim to predict Xn+h on the basis of X1 , . . . , Xn . Optimal prediction in general is largely
intractable, so we focus on linear predictors26 .

Criterion : In the context of weakly stationary processes, with finite first and second
moments, a natural criterion to optimize with respect to is minimum mean squared
error. That is, for linear predictors, we aim to choose coefficients a1 , . . . , an to minimize
the expected value of the squared difference between the true value of the variable to be
predicted and the predictor, that is
 
bn+h )2
E (Xn+h − X
26 For practicality reasons and due to the analogy with the Gaussian case

30
where
n
X
X
bn+h = a0 + ai Xn−i+1
i=1

i.e. solve the minimization problem over a


 !2 
n
X
min E  Xn+h − a0 − ai Xn−i+1  = min MSE(a)
a a
i=1

for variable a. Note that the expectation is over all the random variables, which are
X1 , . . . , Xn , Xn+h . We may solve this optimization (minimization) problem analytically.

n
!!
∂MSE(a) X
= E 2 Xn+h − a0 − ai Xn−i+1 =0 (1.5)
∂a0 i=1
n
! !
∂MSE(a) X
= E 2 Xn+h − a0 − ai Xn−i+1 Xn−j+1 =0 (1.6)
∂aj i=1

for j = 1, . . . , n. From (1.5), we have


n
X
E (Xn+h ) − ai E (Xn−i+1 ) = a0 .
i=1

But under stationarity, E (Xt ) = µ (say) for all t. Therefore,

n
!
X
a0 = µ 1 − ai
i=1

From (1.6) and substituting this expression for a0 , we get

n
! n
X X
E (Xn+h Xn−j+1 ) − µ 1 − ai E (Xn−j+1 ) − ai E (Xn−i+1 Xn−j+1 ) = 0
i=1 i=1

We can rewrite this as


n
 X
E Xn+h Xn−j+1 − µ2 − ai E Xn−i+1 Xn−j+1 − µ2 = 0

i=1

31
imply that
n
X
γX (h + j − 1) − ai γX (i − j) = 0
i=1

for j = 1, . . . , n. Let
γ X (n, h) ≡ (γX (h), . . . , γX (h + n − 1))>

and again denote the covariance matrix

ΓX (n) = [γX (i − j)] i=1,...,n


j=1,...,n

and similarly, a1:n = (a1 , . . . , an )> . Then the system of equations to be solved becomes

ΓX (n)a1:n = γ X (n, h)

an n×1 system of equations. The minimum MSE is achieved when a1:n solves this equation;
the solution depends on both n and h, so write a1:n (h) = (a1 (h), . . . , an (h))> as the optimal
value. The optimal forecast (prediction) is then Xbn+h = µ + Pn ai (h)(Xn−i+1 − µ). 27
i=1
P n
The optimal forecast is X i=1 ai (h)(Xn−i+1 − µ) and the minimum MSE is
bn+h = µ +
 
bn+h )2 = γX (0) − a1:n (h)> γ X (n, h)
E (Xn+h − X
n
X n X
X n
= γX (0) − 2 ai γX (h + i − 1) + ai γ(i − j)aj
i=1 i=1 j=1

Note
Finding a1:n (h)from

ΓX (n)a1:n = γ X (n, h)

is a straightforward numerical exercise as ΓX (n) is non-negative definite, symmetric, Toeplitz.


28 29

27 The optimal construction depends upon the covariance of the process. Clearly, a
1:n (h) is a function of
the covariance sequence.
28 In R, using the QR decomposition (through solve). For n > 2000, the computation becomes prohibitive.

In general, ΓX is fully parametrized in most cases, and the matrix can be sparse. Choleski, eigenvalue
decomposition may work effectively.
29 If we are dealing with a MA process, we can use the AR(∞) approximation and truncate the coefficients

for forecasting.

32
Example 1.13
In the AR(1) case, Xt = φXt−1 + Zt where Zt ∼ WN(0, σ 2 ) and |φ| < 1. Then

φ2 φn−1
 
1 φ ···
 φ 1 φ ··· ··· 
σ2 
 
 φ2 φ 1 ··· ··· 

ΓX (n) =
1 − φ2 
 .. .. .. .. .. 

 . . . . . 
φn−1 ··· ··· φ 1

and

γ X (n, h) = (γX (h), · · · , γX (n + h − 1))>


σ2
= (φh , φh+1 , · · · , φn+h−1 )>
1 − φ2

For h = 1, we have ΓX (n)a1:n (h) = γX (n, 1) yields a1:n (1) = (φ, 0, · · · , 0)> . Thus,
n
X
X
bn+1 = µ + ai (i)(Xn−i+1 − µ)
i=1

= µ + φ(Xn − µ)

which equals φXn for a zero mean process.

By similar methods, we have that

bn+h = φh Xn
X

almost surely for h ≥ 1.


Note
As h → ∞, X bn+h → 0 almost surely, which corresponds to the mean of the process.

Example 1.14
Consider the AR(p) case, with a general model of the form Φ(B)Xt = Zt

Xt − φ1 Xt−1 − · · · − φp Xt−p = Zt

one can go through the same arguments and conclude that the optimal forecast for X
bn+1 is

n
X
X
bn+1 = φi Xn−i
i=1

where φi = 0 for i > p.

33
Proposition 1.13 (Solving for a1:n (h))
−1
If ΓX (n) is non-singular, we may write a1:n (h) = {ΓX (n)} γX (n, h), but computing the
matrix inverse may be prohibitive.

Sufficient conditions for non-singularity of ΓX (n)

1. γX (0) > 0

2. γX (h) → 0 as h → ∞.
Note
For real time series data, γX (h) is not known, so sample-based estimates γ
bX (h) are used,

n−h
1 X
γ
bX (h) = (Xt+h − X̄n )(Xt − X̄n )
n t=1

Pn
where X̄n = n−1 t=1 Xt .30

Two algorithms can be used to compute the optimal coefficients recursively; this is useful
for large n, or when data are being observed on an ongoing basis.
Sum of correlated observations are harder to deal with, although the form is nice. The
Levinson-Durbin provides a recursive algorithm that uses the residuals.
Algorithm 1.1 (Levinson-Durbin)
Without loss of generality, take E (Xt ) = 0. Write

n
X
X
bn+1 = ϕn,i Xn−i+1
i=1

= ϕ>
n Xn:1

where ϕn = (ϕn,1 , · · · , ϕn:n )> , which corresponds in our prediction forecast to a1:n The
optimal choice solves

Γn ϕn = γ(n, 1),

which is an n × 1 system of equations, yielding minimum MSE

vn = γ(0) − ϕ>
n γ(n, 1)

30 One assumes that the order of the process is finite, so that some terms be zero in the sum. Note that

the last term, since we are dealing with an autocorrelation, then the variance of the sum will be larger than
the variance for the IID case. Each sample observation brings less information about the sample then an
IID counterpart.

34
Recursion: Define ϕn in terms of ϕn−1 and vn−1 as follows:

1
γ(n) − ϕ> R
 
ϕn,n = n−1 γ (n − 1, 1)
vn−1

where γ R (k, 1) = (γ(k), γ(k − 1), . . . , γ(1))> and assuming γ is known, and

ϕn,1:(n−1) = ϕn−1 − ϕn,n ϕR


n−1

> 31
for ϕn−1 , ϕR R
n−1 is a (n−1)×1 vector, ϕn,n a scalar and ϕn−1 = (ϕn−1,n−1 , ϕn−1,n−2 , · · · , ϕn−1,1 ) .

Then set
vn = vn−1 (1 − ϕ2n,n ).

γ(1)
Initialization: For n = 1, ϕ1,1 = γ(0) = ρ(1) with v0 = γ(0) and v1 = γ(0)(1 − ρ2 (1)).
To see why this works, let

1
Pn = Γn
γ(0)

be the autocorrelation matrix and let


1
ρ1:n = (γ(1), . . . , γ(n))>
γ(0)
>
ρn:1 = ρR
1:n = (ρ(n), · · · , ρ(1))

Proof Note that Γn ϕn = γ(n, 1) ⇒ Pn ϕn = ρ1:n . We need to verify that Pn ϕn = ρ1:n ⇒


Pn+1 ϕn+1 = ρ1:(n+1) .
For n = 1, P1 = 1, ϕ1 = ϕ1,1 = ρ(1) = ρ1,1 . By induction on n. Assume that Pk ϕk = ρ1:k .
Now
" #
Pk ρk:1
Pk+1 = >
ρk:1 1

a block matrix and also Pk ϕk = ρ1:k , therefore

Pk ϕR
k = ρk:1

31 The calculations depend on an inner product, of O(n), rather than n log(n). This is thus more compu-

tationally efficient.

35
using the symmetry of Pk . By the proposed recursion, we have that
" # ! " #
Pk ρk:1 ϕk − ϕk+1,k+1 ϕR
k a
Pk+1 ϕk+1 = > =
ρk:1 1 ϕk+1,k+1 b

where a is equal to a k × 1 system of the form

a = Pk ϕk − ϕk+1,k+1 Pk ϕR
k + ϕk+1,k+1 ρk:1

= Pk ϕk = ρ1:k

by the induction hypothesis. For b, we have

b = ρ> > R
k:1 ϕk − ϕk+1,k+1 ρk:1 ϕk + ϕk+1,k+1

= ρ> > R
k:1 ϕk − ϕk+1,k+1 (1 − ρk:1 ϕk )

We have now
1 
γ(k + 1) − ϕ> R

ϕk+1,k+1 = k γ (k, 1)
vk
where

vk = γ(0) − ϕ>
k γ(k, 1)

= γ(0) 1 − ϕ>

k ρ1:k

= γ(0)(1 − ρ> R
k:1 ϕk )

From the above formula, we have

1
γ(k + 1) − ϕ> R
 
ϕk+1,k+1 = > R k γ (k, 1) (1.7)
γ(0)(1 − ρk:1 ϕk )

changing the order of the inner product and from b, we have

γ(k + 1) − ϕ>
 R

k γ (k, 1)
ρ>
k:1 ϕk + (1 − ρ> R
k:1 ϕk )
γ(0)(1 − ρ> R
k:1 ϕk )

= ρ> >
k:1 ϕk + ρ(k + 1) − ϕk ρk:1

= ρ(k + 1)

36
We have effectively performed a matrix inversion by considering a block decomposition
" #
Pk ρk:1
Pk+1 = >
ρk:1 1

and then computed P−1 −1


k+1 using Pk .
32
For vn , we have

vn = E (Xn+1 − ϕ> 2

n Xn:1 )

= γ(0) − ϕ>
n γ(n − 1)

= γ(0) − ϕ> R >


n−1 γ(n − 1, 1) + ϕn,n (ϕn−1 ) γ(n − 1, 1) − ϕn,n γ(n)
>
= vn−1 + ϕn,n (ϕR

n−1 ) γ(n − 1, 1) − γ(n)
>
= vn−1 − ϕ2n,n γ(0) − (ϕR

n−1 ) γ(n − 1, 1)

= vn−1 − ϕ2n,n vn−1


= vn−1 (1 − ϕ2n,n )

plugging in (1.7). That is, vn = vn−1 (1 − ϕ2n,n ) and as vn ≥ 0, imply ϕ2n,n < 1, |ϕn,n | < 1
thus vn ≤ vn−1 , i.e. the {vn } form a non-increasing sequence.
Note
The Levinson-Durbin recursion solves the equation Γn ϕn = γ(n, 1) without matrix inver-
sion, in order n2 operations. Most matrix inversion procedures are order n3 ; L-D exploits
the Toeplitz structure.33

Section 1.6: Partial autocorrelation


Define the function αX (h) by

1, if h = 0
αX (h) =
φh,h , if h ≥ 1

for h = 0, 1, 2, . . . This αX (h) is the partial autocorrelation function (PACF). For


random variables X, Y, Z, the partial correlation is defined by

Cor(X − E (X|Z)), (Y − E (Y |Z)))

- this is the partial correlation for X and Y given (accounting for) Z; it is the correlation
between residuals obtained by regressing in turn X and on Z and Y on Z. The PACF
32 This works because of the nature of the nature of the matrix (Toeplitz structure). When we get to

estimation by maximum likelihood estimates, we will get the likelihood in terms of matrix inverses.
33 One can also use singular value, eigenvalue decomposition or Choleski decomposition; these method

have special implementation for positive-definite Toeplitz structure matrices.

37
computes the correlation between

Xt − E Xt |X(t+1):(t+h−1) (1.8a)

Xt+h − E Xt+h |X(t+1):(t+h−1) (1.8b)

Example 1.15
Consider an causal stationary AR(p) process; for h > p,

E Xt+h |X(t+1):(t+h−1) , X1:(t−1) = φXt+h−1 + · · · + φp Xt+h−p

E Xt |X(t+1):(t+h−1) , X1:(t−1) = φ1 Xt+h−1 + · · · + φp Xt+h−p

where now (1.8a) equals Zt and (1.8b) is Zt+h , therefore the correlation Cor(Zt , Zt+h ) = 0
and αX (h) = 0 for h > p, while for h ≤ p, αX (h) 6= 0. That is, we can diagnose an AR(p)
structure by inspecting whether the partial autocorrelation drops to zero at some finite lag.
Algorithm 1.2 (Innovations algorithm)

Let {Xt } be zero-mean process with E Xt2 < ∞ and let γ̃X (t, s) = E (Xt Xs ) (not neces-
sarily stationary). Let

0 if n = 1
X
bn = 
E Xn |X 1:(n−1) if n ≥ 2

Let Un = Xn − X bn be the error in prediction and let U 1:n be the vector of prediction errors

(computed in a one-step ahead fashion). As E Xn |X 1:(n−1) is linear in X1 , . . . , Xn−1 , we
may write U 1:n = An X 1:n where An is given by
 
1 0 0 0 0 0

 a1,1 1 0 0 0 0 
 .. .. 
An = 

a2,2 a2,1 1 0 . .

.. .. .. .. ..
 
. . .
 
 . . 
an−1,n−1 ··· ··· ··· an−1,1 1

as

U1 = X1
U2 = X2 − X
b2 = X2 + a1,1 X1
U3 = X3 − X
b3 = X3 + a2,2 X2 + a2,1 X1
···

38
Let C n = A−1
n , again a lower triangular matrix, where

 
1 0 0 0 0 0

 ϑ1,1 1 0 0 0 0 
 .. .. 
Cn = 

ϑ2,2 ϑ2,1 1 0 . .

.. .. .. .. ..
 
. . .
 
 . . 
ϑn−1,n−1 ··· ··· ··· ϑn−1,1 1

Then X 1:n = C n U 1:n = C n (X 1:n − X


c1:n ).
c1:n = X 1:n − U 1:n = C n U 1:n − U 1:n = (C n − In )U 1:n , that is
Also, X

c1:n = (C n − In )(X 1:n − X


X c1:n ) = Θn (X 1:n − X
c1:n )

where
 
0 0 0 0 0 0

 ϑ1,1 0 0 0 0 0 
 .. .. 
Θn = C n − In = 

ϑ2,2 ϑ2,1 0 0 . .

.. .. .. .. ..
 
. . .
 
 . . 
ϑn−1,n−1 ··· ··· ··· ϑn−1,1 0

and so

0 if n = 1
X
bn+1 = Pn
i=1 ϑn,i (Xn+1−i − Xn+1−i ) if n ≥ 2
 b

bn+1 is a linear combination of X1 − X


i.e. X b1 , X2 − X
b2 , . . . , Xn − X
bn , the one-step ahead
prediction errors which are uncorrelated.
Recursion

◦ Initialize
v0 = γ̃X (1, 1) = E X12


◦ Recursion at step n
 
k−1
1  X
ϑn,n−k = γ̃X (n + 1, k + 1) − ϑk,k−j ϑn,n−j vj 
vk j=0

39
for 0 ≤ k < n

n−1
X
vn = γ̃X (n + 1, n + 1) − ϑ2n,n−j vj
j=0

Compute: v0 , ϑ1,1 , v1 , ϑ2,2 , ϑ2,1 , v2 , ϑ3,3 , ϑ3,1 , v3 , . . ..

1.6.1. Wold Decomposition

bn M.S.
A process {Xt } is deterministic (or predictable) if, for all n, Xn − X = 0 so that the
prediction variance
 
b n )2 = 0
E (Xn − X

Example 1.16
Suppose we have random variables A, B such that E (A) = E (B) = 0 and Var (A) =
Var (B) = 1 and further E (AB) = 0, that is A, B are uncorrelated.
Let {Xt } be defined by

Xt = A cos(ωt) + B sin(ωt)

for some frequency ω ∈ (0, 2π). For each integer n,

Xn = A cos(ωn) + B sin(ωn)

But

cos(ωn) = cos(ω(n − 1)) cos(ω) − sin(ω(n − 1)) sin(ω)


sin(ωn) = sin(ω(n − 1)) cos(ω) + cos(ω(n − 1)) sin(ω),

the double angle formula and therefore

Xn = 2 cos(ω)Xn−1 − Xn−2

Thus X bn M.S.
bn = 2 cos(ω)Xn−1 − Xn−2 and Xn − X = 0. Thus {Xt } is a deterministic
34
process.
34 Brockwell and Davis use this terminology, although it is still stochastic in nature.

40
The Wold decomposition states that if {Xt } is stationary and non-deterministic, then {Xt }
can be written as

M.S.
X
Xt = ψj Zt−j + Vt
j=0

where {Zt } ∼ WN(0, σ 2 ), {Vt } is predictable and {Zt }, {Vt } are uncorrelated.35 Further-
P∞
more, ψ0 = 1, j=1 ψj2 < ∞ with

E (Xt Zt−j )
ψj = , j≥1
E (Zt2 )

35 Notice that the linear process is causal, so there is no contribution of the negative terms. The result

goes beyond the ARMA case.

41
Chapter 2
ARMA models
Section 2.1: Basic properties
All indicates that we can combine the class of filters to get a richer class of models.

2.1.1. ARMA(p, q)
The Autoregressive-Moving average model of order (p, q) (or ARMA(p, q)) for time-
series Xt specifies that {Xt } is a solution of

Φ(B)Xt = Θ(B)Zt

where {Zt } ∼ WN(0, σ 2 ) and

Φ(B) = 1 − φ1 B − φ2 B 2 − · · · − φp B p
Θ(B) = 1 + θ1 B + θ2 B 2 + · · · + θq B q

Suppose we study the causal, invertible case, i.e.


p
Y
Φ(B) = (1 − ξj B), |ξj | < 1 ∀ j
j=1
Yq
Θ(B) = (1 − ωi B), |ωi | < 1 ∀ i
i=1

Assume that Φ(B) and Θ(B) have no common factors i.e. all ξj are different from all ωj
(so that no cancellation is possible). Recall that |ξj | =
6 1 ∀ j imply stationarity, |ξj | < 1 ∀ j
implies causal and |ωi | < 1 ∀ i implies invertible.
For a linear process representation, we can write

X
Xt = Ψ(B)Zt = ψj Zt−j ;
j=0

we must have for this representation to be valid

Φ(B)Ψ(B) = Θ(B)

Θ(Z)
that is Ψ(Z) = Φ(Z) for arbitrary complex value Z. We can compute the ψj0 s by equating

42
coefficients, that is

(1 − φ1 B − φ2 B 2 − · · · − φp B p )(ψ0 + ψ1 B + ψ2 B 2 + · · · ) = 1 + θ1 B + θ2B 2 + · · · + θq B q

Equating the coefficients, we get for

B 0 : ψ0 = 1
B 1 : ψ1 − φ1 ψ0 = θ1 ⇒ ψ1 = φ1 + θ1


ψ − Pj φ ψ 0 ≤ j ≤ max(p, q + 1)
j k=1 k j−k = θj ,
Bj :
ψj − p φk ψj−k = 0,
P
j ≥ max(p, q + 1)
k=1

Therefore

θ + Pj φ ψ if 0 ≤ j ≤ max(p, q + 1), θ0 = 1
j k=1 k j−k
ψj = Pp

k=1 φk ψj−k = 0 if j ≥ max(p, q + 1)

2.1.2. Autocovariance function


For the ACVF,

X
2
γX (h) = E (Xt Xt+h ) = σ ψj ψj+|h|
j=0

from previous result.


This may be easy theoretically, but it may be difficult to do in practice. We will try to find
another method to compute the autocovariance.
Example 2.1 (Autocovariances of ARMA(1,1))
Consider an ARMA(1,1), |φ| < 1 and the model

(1 − φB)Xt = (1 + θB)Zt

or equivalently

Xt − φXt−1 = Zt + θZt−1 .

43
By the previous formulation,

1 if j = 0
ψj =
(θ + φ)φj−1 if j ≥ 1

We can compute using the previous formula



X
γX (0) = σ 2 ψj2
j=0
 

X
= σ 2 1 + ψj2 
j=1
 

X
= σ 2 1 + (θ + φ)2 (φj−1 )2 
j=1

(θ + φ)2
 
= σ2 1 +
1 − φ2

and

X
γX (1) = σ 2 ψj ψj+1
j=0
 

X
= σ 2 (θ + φ) + (θ + φ)2 φ (φj−1 )2 
j=1
 
(θ + φ)φ
= σ 2 (θ + φ) 1 +
1 − φ2

and similarly,

X
γX (h) = σ 2 ψj ψj+h
j=0

= φ γX (h − 1)
= φh−1 γX (1), h≥2

For ARMA(p, q), direct computation may be more tractable.


Example 2.2
Consider an ARMA(3,1) case, which is of the form

Xt − φ1 Xt−1 − φ2 Xt−2 − φ3 Xt−3 = Zt − θ1 Zt−1 . (2.9)

44
P∞
We wish to compute γX (h) for h ∈ Z+ . First write Xt = Ψ(B)Zt = j=0 ψj Zt−j

1. Multiply through by Xt , take expectations in (2.9).


On the left hand side, we have

E Xt2 − φ1 E (Xt Xt−1 ) − φ2 E (Xt Xt−2 ) − φ3 E (Xt Xt−3 )




= γX (0) − φ1 γX (1) − φ2 γX (2) − φ3 γX (3)

while on the right hand side,


∞ ∞
! !
X X
E (Xt Zt ) = E ψj Zt−j Zt = ψj E (Zt−j Zt ) = σ 2 ψ0
j=0 j=0

X
E (Xt Zt−1 ) = E (Zt−j Zt−1 ) = σ 2 ψ1
j=0

2. Multiply through by Xt−1 , take expectations. The left hand side is

γX (1) − φ1 γX (0) − φ2 γX (1) − φ3 γX (2)

and the right hand side

E (Xt−1 Zt ) = 0
E (Xt−1 Zt−1 ) = σ 2 ψ0

and the right hand side is σ 2 θ1 ψ0 . We do not have yet enough information.

3. Multiply by Xt−2 , take expectations. On the right hand side, E (Xt−2 Zt ) = E (Xt−2 , Zt−1 ) =
0. Therefore, the equation becomes

γX (2) − φ1 γX (1) − φ2 γX (0) − φ3 γX (1)

4. Multiply by Xt−3 , take expectations, which leaves us with

γX (3) − φ1 γX (2) − φ2 γX (1) − φ3 γX (0) = 0

We now have four equations and four unknowns γX (0), γX (1), γX (2), γX (3) which we
can solve simultaneously. For Xt−k , where k ≥ 4 and

γX (k) − φ1 γX (k − 1) − φ2 γX (k − 2) − φ3 γX (k − 3) = 0

for k ≥ 4.

45
Example 2.3 (Autocovariances of ARMA(1,3))
Consider an ARMA(1,3) model of the form

Xt − φ1 Xt−1 = Zt + θ1 Zt−1 + θ2 Zt−2 + θ3 Zt−3

Premultiply by Xt−k and take expectations. Using the same procedure as before, for k = 0,
we have

γX (0) − φ1 γX (1) = σ 2 (ψ0 + θ1 ψ1 + θ2 ψ2 + θ3 ψ3 )

and for k = 1

γX (1) − φ1 γX (0) = σ 2 (θ1 ψ0 + θ2 ψ1 + θ3 ψ2 )

and iterating this procedure, we will not this time get a homogeneous equation (the right
hand side doesn’t vanish). For k = 2, we have

γX (2) − φ1 γX (1) = σ 2 (θ2 ψ0 + θ3 ψ1 ),

and subsequently

k=3: γX (3) − φ1 γX (2) = σ 2 θ3 ψ0


k=4: γX (4) − φ1 γX (3) = 0
k≥5: γX (j) − φ1 γX (k − 1) = 0

We can compute the autocovariances numerically by substitution.

This can be used in a variety of examples.


Example 2.4 (Autocovariances of ARMA(2,1))

k=0: γX (0) − φ1 γX (1) − φ2 γX (2) = σ 2 (ψ0 + θ1 ψ1 )


k=1: γX (1) − φ1 γX (0) − φ2 γX (1) = σ 2 θ1 ψ0
k=2: γX (2) − φ1 γX (1) − φ2 γX (0) = 0
k=3: γX (3) − φ1 γX (2) − φ2 γX (1) = 0

using the fact that γ(k) = γ(−k) in the previous calculations.

The case of an ARMA(1,2) is left as an exercise. The calculation and the values of the
autocovariances are necessary to compute the likelihood.

46
For the general ARMA case, say ARMA(p, q) where the form of the equation is
p
X q
X
Xt − φj Xt−j = Zt + θj Zt−j
j=1 j=1

Multiply by Xt−k and take expectations. From the previous calculations, the left hand side
is
p
X
γX (k) − φj γX (|k − j|)
j=1

and on the right hand side,



0 if k > j
E (Xt−k Zt−j ) =
σ 2 ψj−k if k ≤ j.

Write θ0 = 1; the the RHS becomes for k ≤ q


q
X
σ2 θj ψj−k
j=k

that is for k = 0, 1, 2, . . . , max(p, q + 1), we have that the LHS

γX (k) − φ1 γX (k − 1) − · · · − φp γX (k − p)

σ 2 Pq θ ψ if k ≤ q
j=k j j−k
=
0 otherwise

and for k < max(p, q + 1)

γX (k) = φ1 γX (k − 1) + · · · + φp γX (k − p)

and we solve the linear system of the first max(p, q + 1) + 1 equations, then use the final
recursion.
Note
Analytical solution is possible, and is complicated, relies on the mathematics of difference
equations and rely on the roots of the inverse polynomial (with boundary values).

In R, there is a function that allows you to do this fairly routine calculation.

47
2.1.3. Autocovariance Generating function (ACVGF)
The ACVGF is another tool for computing the ACVF. Suppose {Xt } is defined by

X
Xt = ψj Zt−j
j=−∞

P∞
with {Zt } ∼ WN(0, σ 2 ). Suppose that there exists r > 1 such that j=−∞ ψj z j is conver-
gent for z ∈ C, 1r < |z| < r.
P∞ h
Define the ACVGF, GX , by GX (z) = h=−∞ γX (h)z . Now, we know from previous
results that

X
γX (h) = Cov(Xt , Xt+h ) = σ 2 ψj ψj+|h|
j=−∞

therefore

X ∞
X
GX (z) = σ 2 ψj ψj+|h| z h
h=−∞ j=−∞
 

X ∞
X
= σ2  ψj2 + ψj ψj+h (z h + z −h )
j=−∞ h=1
 
∞ ∞
!
X X
= σ2  ψj z j  ψh z −h
j=−∞ h=−∞
2 −1
= σ Ψ(z)Ψ(z )
P∞
where Ψ(z) = j=−∞ ψj z j . For ARMA(p, q), Φ(B)Xt = Θ(B)Zt . For stationary processes,
Qp Pp
we have Ψ(B) = j=1 (1 − ξj B) with |ξj | =
6 1 for all j. Therefore, Ψ(z) = j=1 (1 − ξj z) 6= 0
when |z| = 1. We now have

Xt = Ψ(B)Zt
P∞ j
with Ψ(B) = j=−∞ ψj B where Φ(z)Ψ(z) = Θ(z), ∀ z. This means that Ψ(z) =
Θ(Z)/Φ(z) is well-defined since there is a neighborhood around 1 when z is in the annulus
Ar for some r and where Ar = {z : 1r < |z| < r}. For the ARMA(p, q), we have

σ 2 Θ(z)Θ(z −1 )
GX (z) = σ 2 Ψ(z)Ψ(z −1 ) =
Φ(z)Φ(z −1 )

A series expansion of GX (z) in z has coefficients γX (h) for h = 0, ±1, ±2, . . ..

48
Example 2.5
P∞
Let Zt ∼ WN(0, σ 2 ), then GZ (z) = h=−∞ γX (h)z h = σ 2 and thus GZ (z) is constant, does
not depend on z and is the only generating process for which the ACVGF is constant36 .

Converting non-causal/non-invertible processes


Let Xt ∼ ARMA(p, q), that is Φ(B)Xt = Θ(B)Zt where {Zt } ∼ WN(0, σ 2 ) and
p
Y
Φ(z) = (1 − ξj z)
j=1
Yq
Θ(z) = (1 − ωj z)
j=1

where Φ(z) 6= 0 and Θ(z) 6= 0 when |z| = 1. and where we relax the assumption that all ξ
are less than one in modulus; we allow

0 < ξ1 ≤ ξ2 ≤ · · · ≤ ξr < 1 < ξr+1 ≤ · · · ≤ ξp


0 < ω1 ≤ ω2 ≤ · · · ≤ ωs < 1 < ωs+1 ≤ · · · ≤ ωq

that is {Xt } is non-causal if r < p and non-invertible is s < q. However, define

p
1 − ξj−1 z
!
Y

Φ (z) = Φ(z)
j=r+1
1 − ξj z
q
1 − ωj−1 z
!
Y
Θ∗ (z) = Θ(z)
j=s+1
1 − ωj z

Let {Zt∗ } be defined by

−1
Zt∗ = Φ∗ (B) {Θ∗ (B)} Xt
∗ ∗ −1 −1
= Φ (B) {Θ (B)} Θ∗ (B) {Φ∗ (B)} Zt
p
! q !−1
Y 1 − ξj−1 B Y 1 − ωj−1 B
=   Zt
j=r+1
1 − ξj B j=s+1
1 − ωj B

= Ψ∗ (B)Zt
36 Identifies Z up to mean-square

49
Therefore GZ ∗ (z) = σ 2 Ψ∗ (z)Ψ∗ (z −1 ), therefore we are left with
    
p −1   Y q p −1 −1   Y q −1 
 Y 1 − ξj z 1 − ωj z  Y 1 − ξ j z 1 − ωj z
σ2 .

j=r+1
1 − ξj z  j=s+1 1 − ωj−1 z  j=r+1 1 − ξj z −1  j=s+1 1 − ωj−1 z −1 

But

1 − ξj−1 z 1 − ξj−1 z −1
! !
= |ξj |−2
1 − ξj z 1 − ξj z −1

for each j and similarly


! !
1 − ωj z 1 − ωj z −1
= |ωj |2
1 − ωj−1 z 1 − ωj−1 z −1

and so
  
p
Y q
Y
GZ ∗ (z) = σ 2  |ξj |−2   |ωj |2  = σ ∗2
j=r+1 j=s+1

and we therefore conclude that {Zt∗ } ∼ WN(0, σ ∗2 ).37 Therefore Φ∗ (B)Xt = Θ∗ (B)Zt∗
where {Zt∗ } ∼ WN(0, σ ∗2 ) therefore {Xt } ∼ ARMA(p, q) defined with respect to {Zt∗ }. But
note
r
Y p
Y
Φ∗ (B) = (1 − ξj B) (1 − ξj−1 B)
j=1 j=r+1

and all roots are less than one in modulus, therefore this representation is causal. Similarly,
s
Y q
Y
Θ∗ (B) = (1 − ωj B) (1 − ωj−1 B)
j=1 j=s+1

therefore this process is invertible, since all the roots lie inside the unit circle.
Example 2.6
Let {Xt } ∼ MA(1) of the form

Xt = Zt − 2Zt−1 = (1 − 2B)Zt
37 For this argument to work, we need stationarity to hold, for the expansion to be valid. In the case where

the original process was both non-causal and non-invertible, the resulting variance may be smaller or lower,
depending on the roots, how far they are from the unit circle

50
−1
Let Zt∗ = 1 − 12 B (1 − 2B)Zt , that is
 
1
1 − B Zt∗ = (1 − 2B)Zt = Xt
2

where Zt∗ is an ARMA(1,1) process in terms of Zt . We could rewrite this as

Φ̃(B)Zt∗ = Θ̃(B)Zt

where Φ̃(z) = 1 − 21 z and Θ̃(z) = 1 − 2z and therefore Zt∗ = Ψ∗ (B)Zt where

Θ̃(z) 1 − 2z
Ψ̃(z) = =
Φ̃(z) 1 − 12 z

and therefore

1 − 2z −1
  
2 −1 2 1 − 2z
GZ ∗ (z) = σ Ψ̃(z)Ψ̃(z )=σ = 4σ 2
1 − 12 z 1 − 12 z −1

In the general definition, this correspond to the case q = 1, s = 0, ω1 = 2 and Zt∗ ∼


WN(0, 4σ 2 ) and  
1 1 ∗
Xt = 1 − B Zt∗ = Zt∗ − Zt−1
2 2
so {Xt } is invertible with respect to {Zt∗ }.

The morale of this story is that for any non-causal and/or non-invertible, a legitimate
formulation can be found such that Xt is causal and invertible. Therefore, all the restrictions
imposed earlier that restricted the roots of the Θ(B) and Φ(B) to be less than one in modulus
are really for ease and that these are the only cases we need to consider.

Partial autocorrelation
Calculations for ARMA(p, q) follow the Levinson-Durbin general algorithm: no more trans-
parent calculations are available.

2.1.4. Forecasting for ARMA(p, q)


Forecasting is possible using the innovations algorithm based on the ACV sequence.

51
Section 2.2: Estimation and model selection for ARMA(p,q)
2.2.1. Moment-based estimation
For general stationary processes, moment-based estimation of µx , γX (h), ρX (h), σ 2 is used.
For ARMA(p, q), we seek to estimate (φ1 , . . . , φp , θ1 , . . . , θq ) and σ 2 which then determine
γX (h). An elementary form of estimation involves finding (φ, θ, σ 2 ) such that the matrix
ΓX (n, φ, θ) (the n × n covariance matrix implied by (φ, θ)) most closely matches Γ b n (the
sample covariance matrix) that is

(φ,
b θ)
b = arg min d(ΓX (n, φ, θ), Γ
b n)
φ,θ

this may not be straightforward.


We now target the estimation of parameters from the data, which we believe is a realization
of the series.

Yule-Walker Method for AR(p)


Suppose {Xt } ∼ AR(p), which means Φ(B)Xt = Zt or

Xt = φ1 Xt−1 + φ2 Xt−2 + · · · + φp Xt−p + Zt

Using our previous strategy, multiply through by Xt−k and take expectations.
p
X
E (Xt Xt−k ) = φj E (Xt−j Xt−k )
j=1

which gives us the equation


p
X
γX (k) = φj γX (|j − k|)
j=1

When k = 0, 1, . . . , p − 1, are considered, we have p (linear) equations in φ1 , . . . , φp and


γX (h), h = 0, 1, 2, . . ., which we may solve analytically.

◦ substituting γ
bX (h) for γX (h) yields the estimates φb1 , φb2 , . . . , φbp .

Note
1. For the AR(p), we could use OLS and form the design matrix with the lagged val-
ues and perform the regression on this matrix. However, the Yule-Walker approach
guarantees a stationary and causal solution.

52
2. We can choose to look at k = 1, . . . , p to derive the Yule-Walker equations: in this
case, we must solve

ΓX (p)φ1:p = γ X (p, 1)

as in the case of optimal forecasting. We also have that

σ 2 = γX (0) − φ>
1:p γ X (p, 1)

c2 is obtained by plugging in γ 38
so σ bX and φ
b .
1:p

3. This strategy also works for the ARMA(p, q). We solve


q
X
γX (k) − φ1 γX (k − 1) − · · · − φp γX (k − p) = σ 2 θj ψj−k
j=k

for 0 ≤ k ≤ p+q as we need p+q +1 equations to solve for the p+q +1 unknowns – but
this is non-linear in (φ1 , . . . φp , θ1 , θq , σ 2 ) as {ψj } are the coefficients in the MA(∞)
representation of {Xt }.

◦ Burg’s Algorithm for AR(p) This is an algorithm that uses the sample PACFs to
estimate φ1 , . . . , φp .

◦ For the MA(q), moment based estimation can be achieved using the Innovations algo-
rithm.

2.2.2. Maximum Likelihood Estimation


In order to do this, we need parametric assumptions in order to do exact MLE, we need to
make assumptions about the series. ML estimation is the most efficient method of inference
in the “regular” case, 39 but it requires parametric assumptions. 40
Gaussian case: Suppose {Xt } is a Gaussian time-series with zero mean and autocovariance
given by

κ(i, j) = γ̃X (i, j) = E (Xi Xj )

that is X1:n ∼ Nn (0n , κn ), where κn is the (n × n) covariance matrix. If we make a second


parametric assumption that {Xt } follows an ARMA(p, q) process, then κn = κn (φ, θ) we
38 Depending on the starting p equations, we get different estimates which get closer and closer to each

other as the sample size increase.


39 Meaning that they produce estimates that have the lowest variance, at least asymptotically
40 Econometricians are not happy with the Gaussianity assumption, and they rather rely on different

techniques, using asymptotically justified likelihood from CLT and so on.

53
need to calculate the inverse of κn and the determinant of κn . We can use the Innovations
algorithm in order to get a decomposition of κ which avoids direct calculations of the
determinant and the inverse of κ.
If X
bt is the one-step ahead forecast,


bt = E Xt |X1:(t−1)
X

2
then as Xt |X1:(t−1) ∼ N (µt−1 , σt−1 ) where

µt−1 = κt,1:(t−1) κ−1


t−1 X 1:(t−1)
2
σt−1 = κt,t − κt,1:(t−1) , κt−1 κ1:(t−1),t

where
" #
κt−1 κ1:(t−1),t
κt =
κt,1:(t−1) κt,t

we have X
bt = µt−1 . The likelihood is given by
 
n 1 1
L(κn (φ, θ)) = (2π)− 2 | det κn (φ, θ)|− 2 exp − x> κ−1
(φ, θ)x1:n
2 1:n n

Direct optimization of the log likelihood log L(κn (φ, θ)), denoted `, is possible but compu-
tationally expensive when n is large as we need to compute

det |κn (φ, θ)| and κ−1


n (φ, θ)

However, the Innovations algorithm allows to achieve a decomposition. Recall that


 
0 0 0 0 0 0

 ϑ1,1 0 0 0 0 0 
 .. .. 
Cn = 

ϑ2,2 ϑ2,1 0 0 . .

.. .. .. .. ..
 
. . .
 
 . . 
ϑn−1,n−1 ··· ··· ··· ϑn−1,1 0

and X 1:n = C n (X 1:n − X


c1:n ) where the elements of X 1:n − X
c1:n are uncorrelated (by con-
struction) and are normally distributed with covariance matrix D n = diag(v0 , v1 , . . . , vn−1 ),
which is the diagonal matrix of one step forecast variances. Thus, κn = C n D n C > n and

54
Qn
det |κn | = det |D n | = t=1 vt−1 = v0 v1 v2 · · · vn−1 and

n b t )2
X (Xt − X
X> −1 > −1
1:n κn X 1:n = (X 1:n − X 1:n ) D n (X 1:n − X 1:n ) =
c c
t=1
vt−1

Thus the likelihood L is


n
!
1 bt )2
1 X (Xt − X
L(κn ) = n p exp − .
(2π) 2 (v0 · v1 · · · vn−1 ) 2 t=1 vt−1

Note  
bt+1 )2 = σ 2 = σ 2 rt and the likelihood can be rewritten as
We have vt = E (Xt+1 − X t

n  
1  nY
2 −2 − 12 1
n σ rt exp − 2 S(X 1:n ) .
(2π) 2 t=1

where
n bt (φ, θ))2
X (Xt − X
S(X 1:n ) ≡ S(X 1:n , φ, θ) =
t=1
rt−1 (φ, θ)

the ML estimates of (φ, θ, σ 2 ) are obtained by maximizing this likelihood.

For the non-Gaussian case, we may still use the function


  n2 Y
n  
2 1 − 21 1
L(φ, θ, σ ) = {rt−1 (φ, θ)} exp − 2 S(φ, θ)
σ2 t=1

as a mean of estimating the parameters where


n
X bt (φ, θ))2
(xt − x
S(φ, θ) =
t=1
rt−1 (φ, θ)

Alternatively, least squares procedure based on

(φ̃, θ̃) = arg min S(φ, θ)

with σ̃ 2 = S(φ, θ)/(n − p − q) may also be used. 41 In R, arima allows you to select full
ML, or a conditional ML, or least-squares or conditional LS (where the conditional analysis
41 Thisis an inefficient procedure if we make the normality assumption. In finite samples, these two
methods may be indistinguishable in the finite sample case.

55
conditions on the original p + q data).42

2.2.3. Model selection


In considering model selection, we may prioritize

◦ fidelity to the observed data (“within-sample” criterion)

◦ model complexity

◦ residuals (structure)
43
◦ forecasting performance (“out-of-sample”)

Akaike Information Criterion (AIC)


The AIC is widely used in statistics; it trades off goodness-of-fit with model complexity. For
model M with parameter β and likelihood LM (β), the AIC is defined as

AIC(M, β M ) = −2`M (β
b ) + 2 dim(β )
M M

for example44 for ARMA(p, q), we have p + q + 1 parameters. The model with smallest AIC
is selected as most appropriate. This should be used only to compare nested models; we
fit ARMA(p, q) for p, q moderately large, then examine all submodels.
Note
◦ In small samples, the 2 dim(β M ) penalty is not sufficiently stringent(i.e. it selects
overly complex models).

◦ In large samples, the procedure is inconsistent.45

Bayesian Information Criterion (BIC)


This criterion adjusts the penalty to relieve both these issues. The BIC takes the form

BIC(M, β M ) = −2`M (β)


b + log(n) dim(β )
M

42 It is rather awkward to use full ML, since we need to condition x on past values, while the conditional
1
ML allows to start at the p + q + 1 term.
43 Forecast errors and the innovations algorithm gives a record of how the model is performing at this

stage
44 Note that increasing utility of adding parameters in the model always decreases as we add covariates,

while the second term increases linearly. One can plot this as a function of the models, and the one that
yields the minimum value is preferred in practice. For the BIC, the intercept will be higher, as log(n) will
be linear in the dimension of the parameter β, for fixed n (the length of the time-series).
45 Meaning that with probability one, the AIC does not select the model as the sample size becomes

infinite.

56
where n is the length of the series. This has a more stringent penalty, replacing 2 by log(n)
in the penalty term. This adjustment alleviates both the small sample underpenalization
and the large sample inconsistency of the AIC – the BIC is a consistent model selection
criterion.46 Once the “best model” has been selected, the residuals from the fit should
resemble a white noise series (constant variance, uncorrelated).47

46 In this framework, note that consistency is really in terms of correctly choosing the best model from the

nested models. This is also interesting in the context of model misspecification in the context of inference
(when fitting a ML in an incorrect model, for example selecting from a set of models, which does not include
the DGM.
47 There are other suggestions in the Brockwell and Davis book, but they do not address large sample size

problematics. The arima function in R has a slot with the residuals.

57
Chapter 3
Non-Stationary and Seasonal Models
Section 3.1: ARIMA models
If d is a non-negative integer, we call {Xt } an ARIMA model of order (p, d, q) if the process
{Yt } defined by

Yt = (1 − B)d Xt

is a causal ARMA(p, q) process, i.e. {Xt } satisfies

Φ(B)(1 − B)d Xt = Θ(B)Zt Zt ∼ WN(0, σ 2 ). (3.10)

{Xt } is stationary if and only if d = 0. In all other cases (d ≥ 1), we can add polynomial
trend of order d − 1 and the resulting process still satisfies the ARIMA equation (3.10).
However, as Φ(Z) = 0 has all reciprocal roots inside the unit circle, we merely need to
difference {Xt } d times and perform inference on the resulting differenced series yt = (1 −
B)d xt .
In practice, d is not known, so we may try d = 1, 2, . . . in turn and assess the stationarity of
the resulting series. However, in practice, it can be difficult to distinguish a factor (1 − B)
from (1 − ξB) with |ξ| → 1 in the ARMA polynomial.
Example 3.1
Consider the following processes, the AR(2) process defined by (1−0.6B)(1−0.99B)Xt = Zt
versus the ARIMA(1, 1, 0) (1 − 0.6B)(1 − B)Xt = Zt . These processes look very similar in
a simulation. The solution to the second equation is non-stationary. This is clear by the
drift for larger periods looking at the graphs (see handout). Differencing when we shouldn’t
yields a more complicated process. We thus want a statistical way to distinguish these two
models.

Section 3.2: Unit roots


The presence of a unit root (|ξ| = 1) in the AR polynomial fundamentally changes properties
of the process, estimators, tests, etc.

Dickey-Fuller test
Suppose {Xt } ∼ AR(1) where

Xt − µ = φ(Xt−1 − µ) + Zt

58
with Zt ∼ WN(0, σ 2 ), if |φ| < 1, then by standard theory for method of moments estimators,
we have that

n(φb − φ) N (0, 1 − φ2 ).

However, if |φ| = 1, this asymptotic result doesn’t hold. Suppose φ = 1; then

(1 − B)Xt = µ(1 − φ) + (φ − 1)Xt + Zt


⇔ Xt = µ + φ(Xt−1 − µ) + Zt .

Therefore,

(1 − B)Xt = φ∗0 + φ∗1 Xt−1 + Zt

where φ∗0 = µ(1 − φ) and φ∗1 = φ − 1. As {Zt } ∼ WN(0, σ 2 ), we can estimate φ∗0 , φ∗1 via OLS:
we regress the series (1 − B)xt = xt − xt−1 on xt−1 – the forms of φb∗0 and φb∗1 are identical
to the forms of the intercept and slope estimators in the OLS. The Wald-type test statistic
derived by Dickey and Fuller is then

φb∗1
tDF =
b φb∗ )
se( 1

used to test the null hypothesis H0 : φb∗1 = 0 i.e. that H0 : φ = 1 and we are in the presence
of a unit root.48 Here
Pn
− xt−1 − φb∗0 − φb2 xt−1 )
t=2 (xt
b φb∗1 ) =
se( pPn
(n − 3) 2
t=2 (xt−1 − x̄)

The null distribution of tDF is non-standard (even asymptotically) – but for finite n, can
be approximated using Monte-Carlo. The test of H0 is carried out against the one-sided
alternative H1 : φ∗1 < 0 (equivalently H1 :, φ < 1). See handout: the first three panels (with
parameters close to 1, of 0.9,0.99 and 0.999 show asymptotic normality, where
√ b
n(φ − φ)
p ∼ N (0, 2)
1 − φ2

while the final panel shows the DF statistic distribution. Note that this test is available in
R through the adf.test function.
48 One still needs to write down the properties of the distribution of the test statistic, in terms of the

Brownian motion.

59
For the AR(p), we have
p
X
Xt − µ = φj (Xt−j − µ) + Zt
j=1

which imply
p
X
(1 − B)Xt = φ∗0 + φ∗1 Xt−1 + φ∗j (1 − B)Xt−j+1 + Zt
j=2

In this formulation we have

φ∗0 = µ(1 − φ1 − · · · − φp )
p
X
φ∗1 = φj − 1
j=1
p
X
φ∗j = − φk
k=j

for k = 2, . . . , p.
If {Xt } ∼ AR(p) with one unit root, then (1 − B)Xt ∼ AR(p − 1) is stationary. Therefore,
we can test for a unit root by testing

H0 : φ∗1 = 0

using tADF = φ∗1 /se(φ


b ∗1 ) where the numerator and denominator are obtained from regressing
(1 − B)xt on xt−1 and (xt−1 − xt−2 ), (xt−2 − xt−3 ), etc. However, the null distribution is
different from the AR(1) case. The ADF test function uses a look-up table to calibrate the
test statistic. This is thus the Augmented Dickey-Fuller test.
Note
Further extensions of this test exist for models with trends.

In the case of multiple unit roots, sequential testing, applying the Dickey-Fuller procedure
in each case, doing first differencing multiple times.
Note
Consider the model with φ(B) = (1 − ξB)(1 − ξB),¯ roots that appear in conjugate pair,
1 −iω
i.e. the AR(2) model with reciprocal roots ξ = r e = 1r cos(ω) − i 1r sin(ω) and ξ¯ = 1r eiω

60
where r > 1. Then
!
2
2 2 2 1 2
φ(B) = (1 − 2<(ξB) + |ξ| B ) = 1 − cos(ω)B + B
r r

This is an AR(2) model with ACF

1 sin(hω + λ)
ρ(h) =
rh sin(λ)

for h = 0, ±1, ±2 where


r2 + 1
 
λ = arctan 2 tan(ω)
r −1
that is ρ(h) is a decaying sinusoidal function. If r → 1 from above, then ρ(h) → cos(ωh).
For r > 1, we see the following picture

Section 3.3: Seasonal ARIMA models (SARIMA)


Recall the lag-difference operator ∇s

∇s Xt = Xt − Xt−s
= Xt − B s Xt
= (1 − B s )Xt .

This operator allows us to construct seasonally non-stationary models; whereas the ARIMA
model considers (1 − B)d Φ(B)Xt = Θ(B)Zt , the seasonal ARIMA (SARIMA) allows (1 −
B s )d Φ(B)Xt = Θ(B)Zt . This is a form of non-stationary model, where Φ(B)Xt = Θ(B)Zt
defines a stationary (causal) ARMA(p, q) process.
The general form of the SARIMA is as follows: For d, D non-negative integers and season-
ality s, {Xt } is a seasonal ARIMA process

{Xt } ∼ SARIMA(p, d, q)(P, D, Q)s

if {Yt } defined by

Yt = (1 − B)d (1 − B s )D Xt

upon commuting the operators, is a causal ARMA process determined by

Φ(B)Φ̆(B s )Yt = Θ(B)Θ̆(B s )Zt

61
where {Zt } ∼ WN(0, σ 2 )

Φ(z) = 1 − φ1 z − · · · − φp z p
Φ̆(z) = 1 − φ1 − · · · − φP z P
Θ(z) = 1 − θ1 z − · · · − θq z q
Θ̆(z) = 1 − θ1 − · · · − θQ z Q

of order respectively p, P, q, Q that is {Yt } ∼ ARMA(p + sP, q + sQ). This is introduced to


allow stochastic seasonality to be introduced. We can apply the inverse of the differencing
(cumulate) to get the forecast for Yt . Thus for inference and forecasting for seasonal ARIMA
models proceeds by preprocessing {Xt } to {Yt }, then performing inference forecasting for
{Yt }, and then undoing the differencing and seasonal differencing. The introduction of
differencing of different order changes radically the dataset, therefore comparing AIC and
BIC, with different length dataset, should not be done in practice.
To implement this in R

◦ inference: use the arima function

◦ forecasting: forecast.Arima

from the “forecast” library in R.

62
Chapter 4
State-space models
By coupling two (stationary) processes, we can construct even more complicated models;
for example, we can take the first process to be latent, thus describing hidden time-varying
structure, whereas the second process is constructed conditional on process 1 and is used to
represent the observed data.
We consider the marginal structure implied by this joint model for the observed data. In
general, we will use vector-valued time series processes: {X t } = (Xt1 , . . . , XtK )> is a K × 1
vector-valued process that potentially exhibits

◦ autocorrelation (between elements)

◦ cross-correlation (across elements).

We have

γij (t, s) = Cor(Xti , Xsj )

describing the auto and cross-correlation. For stationary processes, µ = E (X t ) is a K × 1


vector and Γ(h) = [γij (h)]K
i=1,j=1 a K ×K matrix, where γij (h) is the autocovariance for the
th
i component and γij (h) = γji (−h). The extension to vector-valued time series is not very
complicated. The extension of the white-noise process is the following:
 {W t } ∼ WN(0, Σ t )

>
where Σ t is a K × K matrix such that E (W t ) = 0K and E W t W s = Σ t if s = t and
zero otherwise.

Section 4.1: State-Space Formulation


The linear state space model for {Y t } of dimension K × 1 is specified by two relations that
together form a dynamic linear model.
We have the observation equation

Y t = Gt X t + W t , t = 0, ±1, ±2, . . .

where {Gt } is a sequence of K × L deterministic matrices, and where {W t } ∼ WN(0, Σ t )


that is Y t is a linear combination of elements of X t and the state equation

X t+1 = F t X t + V t , t = 0, ±1, ±2, . . .

63
49
where {F t } is a sequence of L × L deterministic matrices and where V t ∼ WN(0, Ω t ).
To complete the model, it is normal to consider t ≥ 1 and setting X 1 as a random vector,
uncorrelated with the residuals {W t }, {V t }.
Note  
1. Typically, the white-noise series are uncorrelated: E W t V >
s = 0 ∀ s, t.

2. If we require correlation between {W t } and {V s }, a amended formulation can be


proposed, for example with cross-correlation and drift terms, we might write the ob-
servation and state equation respectively as

Y t = Gt X t + dt + H 1t Z t
X t+1 = F t X t + et + H 2t Z t

where {Z t } is WN(0 of dimension (K + L) × 1, Σ t ) process, H 1t is of dimension


(K × (K + L)), H 2t is of dimension (L × (K + L)) and dt , et are deterministic “drift”
terms.

3. With careful definition of {X t }, one can construct correlated process with more general
correlation structure that is implied by the AR(1)-type form.

4. The extension of this formulation to non-linear state space models is possible, but
more challenging in terms of inference.

For the simple model, we have that

X t = F t X t−1 + V t−1
= F t (F t−1 X t−2 + V t−2 ) + V t−1
= F t F t−1 X t−2 + F t V t−2 + V t−1
= F t F t−1 (F t−2 X t−3 + V t−3 ) + F t V t−2 + V t−1
···
= ft (X 1 , V 1 , . . . , V t−1 )

and similarly, Y t = gt (X 1 , V 1 , . . . , V t−1 , W t ).


Note
For t > s,
   
E V tX >
s = E V t Y >
s =0
49 We will be able to marginalize over the parameters for X . We usually take G , F as constant over
t t t
time, otherwise inference may be impossible due to the limited amount of data points.

64
and similarly
   
E W tX >
s = E W tY >
s =0

as for the ARMA models.


Example 4.1
Let {Yt } ∼ AR(1) with Yt = φYt−1 + Zt and Zt ∼ WN(0, σ 2 ), |φ| < 1. Let Ft ≡ [φ] ∀ t, the
1 × 1 matrix. and let

Xt+1 = φXt + Vt

X
X1 = Y1 = φj Z1−j
j=0

In this example, the observation equation is

Yt = Gt Xt + Wt , Gt = [1] ∀ t, Wt ≡ 0 ∀ t

and the state equation is given by

Xt+1 = Ft Xt + Vt , Ft ≡ [φ] ∀ t, Vt = Zt+1

Example 4.2 (State-space formulation of {Yt } ∼ AR(p))


Let {Yt } be a causal AR(p) time series process with respect to {Zt } ∼ WN(0, σ 2 ), that is

Yt − φ1 Yt−1 − · · · − φp Yt−p = Zt

and let X t = (Yt−p+1 , Yt−p+2 , · · · , Yt )> a (p × 1) vector. The observation equation is given
by

Yt = Gt X t + Wt , Gt = (0, 0, . . . , 0, 1) ∀ t, Wt ≡ 0 ∀ t

and the state equation by

Xt+1 = F t Xt + Vt

65
which can be written as
      
Yt−p+2 0 1 0 ··· 0 Yt−p+1 0
Y  0
 t−p+3   0 1 ··· 0  Yt−p+2  0
   

 .  . .. 
.. .. ..   ..  +  ..  Zt+1
   
 . = . . . .
 .  . .   .  .
      
 Yt+2   0 0 0 ··· 1   Yt+1  0
Yt+1 φp φp−1 φp−2 ··· φ1 Yt 1

Example 4.3
Let {Yt } ∼ ARMA(1, 1) process, with

(1 − φB)Yt = (1 + θB)Zt

where {Zt } ∼ WN(0, σ 2 ) and |φ|, |θ| < 1. We could rewrite this as Yt = φYt−1 + Zt + θZt−1 .
Let now Xt+1 = φXt + Zt+1 (an AR(1) process), so that
" # " #" # " #
Xt 0 1 Xt−1 0
= +
Xt+1 0 φ Xt Zt+1

i.e. X t+1 = F t X t + V t . Then the observation equation is


" #
h i X
t−1
Yt = Gt X t + W t , Yt = θ 1
Xt

namely Gt = [θ 1], W t = 0, ∀ t and the state equation is given by

X t+1 = F t X t + V t , V t = [0, 1]> Zt+1

We can verify this

Gt X t = θXt−1 + Xt
= θXt−1 + φXt−1 + Zt

Yt = θXt−1 + Xt
= θ(φXt−2 + Zt−1 ) + (φXt−1 + Zt )
= φ(Xt−1 + θXt−2 ) + Zt−1 + Zt
= φYt−1 + θZt−1 + Zt

66
Example 4.4
Let {Yt } ∼ MA(1) and Yt = Zt + θZt−1 , {Zt } ∼ WN(0, σ 2 ) and let
" #
Yt
Xt =
θZt

with the observation equation

Yt = Gt X t + W t , Gt = [1 0] , Wt ≡ 0 ∀ t

and the state equation


" # " #
0 1 1
X t+1 = F t X t + V t , Ft = , Vt= Zt+1
0 0 θ

The representation is not unique; alternatively, we could have set Xt = Zt−1 and then in the
observation equation set Gt = [θ], W t = Zt and for the state equations Ft = [0], Vt = Zt . 50

Low dimension or high dimensional representation can be used, in the next case the higher
dimension is more transparent.
Example 4.5 (ARMA(p, q) process)
Suppose {Yt } ∼ ARMA(p, q) with respect to {Zt } ∼ WN(0, σ 2 ) is causal, Φ(B)Yt = Θ(B)Zt .
Let m = max{p, q + 1}, φj = 0 ∀ j > p, θj = 0 ∀ j > q and θ0 = 1
Let {Xt } ∼ AR(p) with Φ(B)Xt = Zt and X t = [Xt−M +1 , Xt−m+2 , . . . , Xt−1 , Xt ]> , an
(m × 1) vector. Then the observation equation

Y t = Gt X t + W t , Gt = [θn−1 , θn−2 , . . . , θ1 , θ0 ], Wt ≡ 0

and the state equation


   
0 1 0 ··· 0 0
 0
 0 1 ··· 0 
 0 
 
 . .. .. .. .. 
  . 
X t+1 = F t X t + V t ,  ..
Ft =  . . . . V t = .
 . 
 
   
 0 0 0 ··· 1  0 
φm φm−1 φm−2 ··· φ1 (m×m), Zt+1 (m×1)

Alternatively, let m = max{p, q}, φj = 0 ∀ j > p. We use the linear process representation
50 Independence is not in the formulation. We can also reduce the dimension of the state. It is rather

simple in the MA(1) case.

67
to get another state-space form. Now Φ(B)Yt = Θ(B)Zt , which in terms imply that

Yt = {Φ(B)}−1 Θ(B)Zt
= Ψ(B)Zt

X
= Ψj Zt−j
j=0

Θ(Z)
where {ψj } are the coefficients in the expansion of Ψ(Z) = Φ(Z) as a power series in Z. Let
{Xt } be defined by the AR(1)equation,

X t+1 = F t X t + V t ,

an (m×1) system (VAR(1)) where F t is as before, and Vt = [ψ1 , ψ2 , . . . , ψm ]> Zt+1 = HZt+1
say. The observation equation is

Y t = Gy X t + W t

with Gt = [1, 0, . . . , 0], W t = Zt

51
and to justify this, we will differ this until we discuss multivariate processes.
F t , Gt can be allowed to vary over time periods in a deterministic fashion, so that the
process is not stationary, but there is some stability.

Stationarity and Stability


Recall the state equation

X t+1 = F t X t + V t , t = 0, ±1, ±2
51 The lowest dimension one can get is max{p, q}. For arbitrary state space representation, there is no

unique formulation, can add a term in the observation equation and remove it in the state equation.

68
and suppose F t = F ∀ t. Then

X t+1 = F X t + V t
= F (F X t−1 + V t−1 ) + V t
= F 2 X t−1 + F V t−1 + V t
..
.
n
X
= F n+1 X t−n+1 + F j V t−j
j=0

that is for X t+1 to be bounded in probability (for all t), we must have that F n stays
bounded for all n. Write

F = EDE −1

as the eigendecomposition for F , then D is the matrix of eigenvalues of F , and F n =


ED n E −1 , i.e. for F n to stay bounded (elementwise), we require that

D n = [diag(λ1 , . . . , λL )]n

stays bounded. But

D n = diag(λn1 , . . . , λnL )

and therefore, we need |λj | ≤ 1 (i.e. |λ1 | < 1 under the usual convention.); thus we need
solutions of

det(F − I Z ) = 0

to be within the unit circle, or equivalently in Brockwell and Davis notation,

det(I − F Z ) 6= 0 ∀ z ∈ C, |z| ≤ 1.

Section 4.2: Basic Structural Models


Local level model
The observation equation can be written simply as

2
Yt = Mt + Wt , {Wt } ∼ WN(0, σW )

69
and the state equation
Mt+1 = Mt + Vt , {Vt } ∼ WN(0, σV2 )

where Vt , Wt are uncorrelated processes. Thus the mean is evolving as a random walk and
Yt varies according to some noise. Here K = L = 1, so a 1 dimensional state model. The
state equation here is clearly a random walk. Let

Yt∗ = ∇Yt = Yt − Yt−1 = Vt−1 + (Wt − Wt−1 )

and clearly, E (Yt ) = 0 and

E Yt∗ Yt+h


= E ((Vt−1 + (Wt − Wt−1 )(Vt+h−1 + (Wt+h − Wt+h−1 ))

and setting h, we get



E Yt∗2 = σV2 + 2σW
 2


 if h = 0
E Yt∗ Yt+h



= E Yt∗ Yt+1
 2
= −σW if h = 1


E Yt∗ Yt+h
2

=0 if h ≥ 2

which corresponds to an ARIMA(0,1,1) model.

Local slope model


(0)
The observation equation is Yt = Mt + Wt , the state equation Mt = Mt−1 + Bt−1 + Vt−1
2 (0)
where Bt = Bt−1 + Ut−1 where {Wt } ∼ WN(0, σW ), where {Vt } ∼ WN(0, σV2 ) and
2
{Ut } ∼ WN(0, σU ). let
" # " #
(0)
Mt V
Xt = , Vt = t
Bt Ut

Then
" #
1 1
Xt+1 = Xt + Vt ;
0 1

2
 
σV 0
with Yt = [1 0]Xt + Wt . Here Vt ∼ WN(0, Q) where Q = 2
0 σU
and the dimensions
parameter of the state space model are L = 2, K = 1.

70
Seasonal model
Suppose we wish to construct a stochastic seasonal component with period d say. Recall
Pd
that {St } is satisfied St+d = St all t > d and t=1 St = 0.
Write

Yt+1 = −Yt − Yt−1 − · · · − Yt−d+2 + St

where {St } is a zero mean random variable. We can write this in a state-space formulation
as follows: Let
 
Yt
 Yt−1 
 
Xt =  . 

 .. 

Yt−d+2 ((d−1)×1)
so that
h i
Yt = 1 0 ··· 0 X t + Wt .

Now the state space equation is

X t+1 = F X t + V t

with
 
−1 −1 −1 ··· −1  

1 0 0 ··· 0
 1
..
  0
 
0 1 0··· .   
0
 
F = ,V t = 
  St
 
. ..
1 ..

0 0 .   .. 

 .
 .
 .. .. .. .. .. 
 . . . . 
 0
0 0 ··· 1 0

and where {St } ∼ WN(0, σS2 )


Combining the three components altogether: Let X t = [Mt , Bt , Yt , Yt−1 , . . . , Yt−d+2 ]> .
Then

2
Obs Yt = Gt X t + W t , G = [1, 0, 1, 0, . . . 0], W t ∼ WN(0, σW )
and
State X t+1 = F t X t + V t

71
where F comprises the block diagonal from the previous models.
 
1 1 0 0 ··· 0
0 1 0 0 ··· 0
 
 
0
 0 −1 −1 ··· −1 
Ft ≡ F = 
 ... .. 
 . 1 0 ··· 0 
. .. .. .. 
. . .
. . 0 ··· 
0 0 0 ··· 1 0

[Reconstructed states from the plots in air data.] σ12 is the variance from {Mt }, σ22 the
variance from {Bt } and similarly σ32 for {St } and
 
σ12 0 0
V t ∼ WN(0, Q) where Q =  0 σ22 0
 

0 0 σ32

In this code, we have Bt+1 is constant, so this has been adjusted in the code accordingly.

Yt = Mt + (−Yt−1 − · · · − Yt−d+1 ) + Wt
(0)
Mt+1 = Mt + Bt + Vt
Bt+1 = Vt + Ut
Yt+1 = (−Yt − · · · − Yt−d+2 ) + St

This is a 17 parameters maximum likelihood, we get a plot with a slope around 2. Amending
the code to maximize over the logged data (log-scale transformation), we have a much more
linear slope, with less leak from the seasonal part. In this form (not-logged), it is capturing
the heteroskedasticity of the model. The seasonal component and the forecast are better.

Section 4.3: Filtering and Smoothing: the Kalman Filter


See handouts from now on– The key ideas are the following: imagine for simplicity a Normal
dataset; in the Gaussian case,

Obs Yt |X t ∼ NK (Gt X t , Σ t )
State X t+1 |X t ∼ NL (F t X t , Ω t )

Then the prediction is


Z
p(xt+1 |y 1:t ) = p(xt+1 |xt )p(xt |y 1:t )dxt

72
and p(xt |y 1:t ) is not yet available. This comes back to the filtering, where

p(yt |xt )p(xt |y 1:(t−1) )


p(xt |y 1:t ) =
p(yt |y 1:(t−1) )

and the latter term is again of the form found in the prediction. This suggests that we
compute (recursively) p(x1 ), p(x1 |y1 ), p(x2 |y1 ), p(x2 |y2 ), . . .. For time point t, in the Gaus-
sian case, if p(x1:n , y 1:n ) is jointly Gaussian, so therefore p(xt |y 1:t ) is also Gaussian, as is
p(xt+1 |y 1:t ) (any conditional distribution or marginal is also normal). Hence all we need to
do is track moments of the prediction and filtering distribution. For example,

X t |X 1 , Y 1:t ∼ NL (at|t , P t|t )


X t+1 |X 1 , Y 1:t ∼ NK (at+1|t , P t+1 |t).

By the law of iterated expectation,

E (X t+1 |x1 , y 1:t ) = F t at|t = at+1|t


Var (X t+1 |x1 , y 1:t ) = F t P t|t F >
t + Ω t = P t+1|t

E (Yt+1 |x1 , y 1:t ) = Gt+1 at+1|t


Var (Yt+1 |x1 , y 1:t ) = Gt+1 P t+1|t G>
t+1 + Σ t+1

Thus for likelihood-based estimation,


n
Y
L(θ, y 1:n ) = f (y1 |θ) f (yt |y 1:(t−1) , θ)
t=2

where

f (yt |y 1:(t−1) , θ) ∼ Nk (Gt+1 at+1|t , Gt+1 P t+1|t G>


t+1 + Σ t )

and we can compute the likelihood using terms computed during the Kalman filter. This
however heavily depends on the Gaussianity assumption. Conjugate prior distribution and
discrete distribution are examples where we can compute numerically using the Kalman-
Filter. The recursive calculations and the algorithm are outlined in the R code provided.
The Kalman recursions are given by the following equations, where at|t are the best linear
predictors and P t|t the corresponding mean-square error matrix. The quantities vt and M t
denote the one-step-ahead error in forecasting yt conditional on the information set at time

73
t − 1 and its MSE, respectively.

v t = yt − Gt at|t−1 (4.11a)
M t = Gt P t|t−1 G>
t + Σt (4.11b)
at|t = at|t−1 + P t|t−1 G> −1
t M t vt (4.11c)
P t|t = P t|t−1 − P t|t−1 G> −1
t M t Gt P t|t−1 (4.11d)
at+1|t = F t at|t (4.11e)
P t+1|t = F t P t|t F >
t + Ωt (4.11f)

which can be combined into further recursions, used to get another form for the prediction
and simplify the former forms

at+1|t = F t at|t−1 + K t v t (4.12a)


Kt = F t P t|t−1 G> −1
t Mt (4.12b)
P t+1|t = F t P t|t−1 L>
t + Ωt (4.12c)
Lt = F t + K t Gt (4.12d)

and a smoothing algorithm can be applied to a state-space model given a fixed set of data;
estimates of the state vector are computed at each t using all available information. Denote
by at|n the smoothed linear estimates for t ∈ {0, . . . , n − 1} given all data until point n,
that is at|n = E (X t |y 1:n ) along with P t|n via backward recursions.

P ∗t = P t F >
t P t+1|t (4.13a)
at|n = at|t + P ∗t (at+1|n − at+1|t ) (4.13b)
P t|n = P t|t + P ∗t (P t+1|n − P t+1|t )P ∗t (4.13c)

The Gaussian linear state space model is of the form

Yt = Xt + Wt
Xt+1 = Xt + Vt

2
where we assume that Wt ∼ N (0, σW ) and Vt ∼ N (0, σV2 ) are independent white noise
series.
The joint likelihood for observations and states up to n is of the form
n
Y
p(x1:n , y1:n ) = p(x1 )p(y1 |x1 ) p(xi |xi−1 )p(yi |xi )
i=2

74
which is a two parameter model in the Gaussian case; alternatively, we could also set
x1 to be unknown; it could be regarded as having some density, or even a state which
corresponds to a degenerate distribution. The joint likelihood explicited above is follows
a multinormal distribution N2n (·, ·), and from standard results all marginal distributions –
but also conditional distributions, are also Gaussian. We could thus compute the first two
moments to get a complete characterization of the model.
(1) Filtering.
As earlier mentioned, this calculations is a Bayes theorem calculation, where we abstract
for now of the terms involving yt in the denominator. We could regard the proportionality,
having

p(xt |y 1:t ) ∝ p(yt |xt )p(xt |y 1:t−1 )


  !
1 2 1 2
∝ exp − 2 (yt − xt ) × exp − 2 (xt − mt|t−1 )
2σW 2St|t−1

as from the observation and state equations, we have

2
Yt |Xt = xt ∼ N (xt , σW ) Xt+1 |Xt = xt ∼ N (xt , σV2 )

2
and the recursion assumption is that Xt |Y1:t−1 ∼ N (mt|t−1 , St|t−1 ), which depends only on
low dimensional summary statistics.
" #!
1 1 2 1 2
p(xt |y 1:t ) ∝ exp − 2 (xt − yt ) + S 2 (xt − mt|t−1 )
2 σW t|t−1

as this is an univariate distribution in xt ; we want to get this in terms of


!
1
exp − 2 (xt − mt|t )2
2St|t

2
such that Xt |Y1:t ∼ N (mt|t , St|t ). But the following is a simple “complete the square”
calculation, using the fact that
 2
2 2 2 Aa + Bb AB
A(x − a) + B(x − b) = M (x − m) + constant = (A + B) x − + (a − b)2
A+B A+B

where m = (Aa + Bb)/(A + B) and M = A + B. We thus get


  ! !2 
2 2
1 1 1 yt /σW + mt|t−1 /St|t−1
exp −  2 + S2 xt − 2 + 1/S 2 + constant
2 σW t|t−1 1/σW t|t−1

75
which imply that
yt mt|t−1
2 + S2 2 2
!−1
σW t|t−1 yt St|t−1 + σW mt|t−1 2 1 1
mt|t = = and St|t = 2 + S2
1 1 2 2
St|t−1 + σW σW
2 + 2 t|t−1
σW St|t−1

and so the parameters of the Normal distributions are known in terms of the previous
observations.
(2) Predictions This is a posterior predictive-type calculation. We have
Z
p(xt+1 |y 1:t ) = p(xt+1 |xt , y 1:t )p(xt |y 1:t )dxt
Z
≡ p(xt+1 |xt )p(xt |y 1:t )dxt
Z   !
1 1 2 1
∝ exp − 2 (xt+1 − xt ) exp − 2 (xt − mt|t )2 dxt
2 σV 2St|t
  !
2 2
1 St|t + σV
Z
1 2 ∗ 2
= exp − 
  xt+1 − mt|t  exp − (x − m )
2 + σ2
2 St|t 2 σV2 St|t
2
V
!
1 1 2
∝ exp − 2 2 (xt+1 − mt|t )
2 σV + St|t

using the fact that xt+1 |xt ⊥ y 1:t by assumption. We can therefore conclude that p(xt+1 |y 1:t ) ∼
2
N (mt+1|t , St+1|t ). For the prediction, we could also use a shortcut using conditional inde-
pendence and do an iterated expectation and iterated variance calculation. We will again
2
derive (in this case verify) that mt+1|t = mt|t and St+1|t = σV2 +2 St|t
2
.

mt+1|t = E (Xt+1 |y 1:t ) = EXt |Y 1:t EXt+1 |Xt ,Y 1:t (Xt |Xt , y 1:t ) = EXt |Y 1:t (Xt ) = mt|t

and similarly for the variance calculation,

Var (Xt+1 |y 1:t )


 
= VarXt |Y 1:t EXt+1 |Xt ,Y 1:t (Xt+1 |Xt , y 1:t ) + EXt |Y 1:t VarXt+1 |Xt ,Y 1:t (Xt+1 |Xt , y 1:t )
2
= St|t + σV2 .

It remains to calculate p(y 1:n ) = p(y1 )p(y2 |y1 p(y3 |y1 , y2 ) · · · p(yn |y 1:n−1 ). This is again a

76
posterior-type calculation, which can be made using
Z
p(yt |y 1:t−1 ) = p(yt |xt )p(xt |y 1:t−1 )dxt

both Gaussian in Xt or

p(yt |xt )p(xt |y 1:t−1 )


p(xt |y 1:t ) =
p(yt |y 1:t−1 )

but the simplest calculation is via iterated moment calculation. Indeed,



E (Yt |Y 1:t−1 ) = EXt |Y1:t−1 EYt |Xt (Yt |Xt
= EXt |Y1:t−1 (Xt ) = mt|t−1

and for the variance,


 
Var (Yt |Y 1:t−1 ) = VarXt |Y1:t−1 EYt |Xt (Yt |Xt ) + EXt |Y1:t−1 VarYt |Xt (Yt |Xt )
2
= VarXt |Y1:t−1 (Xt |Y1:t−1 ) + EXt |Y1:t−1 (σW )
2 2
= St|t−1 + σW

77
Chapter 5
Financial time series models
To model

◦ asset/bond/option prices

◦ interest rates

◦ exchange rates

simple stationary/non-stationary models are not sufficiently sophisticated (complex) to cap-


ture observed dynamics. Such series often involve non-stationary components (time-varying
mean, unit roots) and also temporal heteroskedasticity, (i.e. time-varying variance).

78
License

Creative Commons Attribution-Non Commercial-ShareAlike 3.0 Unported

You are free:

to Share - to copy, distribute and transmit the work


to Remix - to adapt the work

Under the following conditions:

Attribution - You must attribute the work in the manner specified by the author or licensor (but not in any
way that suggests that they endorse you or your use of the work).

Noncommercial - You may not use this work for commercial purposes.

Share Alike - If you alter, transform, or build upon this work, you may distribute the resulting work only
under the same or similar license to this one.

With the understanding that:

Waiver - Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain - Where the work or any of its elements is in the public domain under applicable law, that
status is in no way affected by the license.
Other Rights - In no way are any of the following rights affected by the license:
Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
The author’s moral rights;
Rights other persons may have either in the work itself or in how the work is used, such as publicity or
privacy rights.

CC Course notes for MATH 545: Intro to Time Series


BY: Léo Raymond-Belzile
Full text of the Legal code of the license is available at the following URL.

79

You might also like