0% found this document useful (0 votes)

107 views

Nips Tutorial 2016

This document discusses theory and algorithms for forecasting non-stationary time series. It begins with an introduction to time series analysis and classical models like AR, MA, and ARMA. These models assume the time series is stationary, meaning the properties like mean and variance do not change over time. However, many real-world time series are non-stationary. The document outlines challenges in applying standard machine learning to non-stationary time series due to violations of assumptions like independence and identical distributions. It proposes new learning theory and algorithms to address forecasting non-stationary time series.

Uploaded by

Marko Milovanović

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

Nips Tutorial 2016

Uploaded by

Marko Milovanović

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

Theory and Algorithms for

Forecasting Non-Stationary
Time Series

VITALY KUZNETSOV KUZNETSOV@

GOOGLE RESEARCH..

MEHRYAR MOHRI MOHRI@

COURANT INSTITUTE & GOOGLE RESEARCH..
Motivation
Time series prediction:

• stock values.

• economic variables.

• weather: e.g., local and global temperature.

• sensors: Internet-of-Things.

• earthquakes.

• energy demand.

• signal processing.

• sales forecasting.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 2

Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 3

Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 4

Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 5

Challenges
Standard Supervised Learning:

• IID assumption.

• Same distribution for training and test data.

• Distributions ﬁxed over time (stationarity).

none of these assumptions holds

for time series!

Theory and Algorithms for Forecasting Non-Stationary Time Series page 6

Outline
Introduction to time series analysis.

Learning theory for forecasting non-stationary time series.

Algorithms for forecasting non-stationary time series.

Time series prediction and on-line learning.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 7

Introduction to Time
Series Analysis
Classical Framework
Postulate a particular form of a parametric model that is
assumed to generate data.

Use given sample to estimate unknown parameters of the

model.

Use estimated model to make predictions.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 9

Autoregressive (AR) Models
Deﬁnition: AR( p) model is a linear generative model based
on the pth order Markov assumption: 
  p
X
  8t, Yt = ai Yt i + ✏ t
i=1
 
where

• ✏ts are zero mean uncorrelated random variables with

variance .

• a1 , . . . , ap are autoregressive coeﬃcients.

• Yt is observed stochastic process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 10

Moving Averages (MA)
Deﬁnition: MA( q) model is a linear generative model for
the noise term based on the qth order Markov assumption: 
  q
X
  8t, Yt = ✏t + bj ✏ t j

  j=1

where

• b1 , . . . , bq are moving average coeﬃcients.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 11

ARMA model
(Whittle, 1951; Box & Jenkins, 1971)

Deﬁnition: ARMA( p, q) model is a generative linear model

that combines AR( p) and MA( q) models:
p
X q
X
8t, Yt = ai Yt i + ✏t + bj ✏ t j .
i=1 j=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 12

ARMA

Theory and Algorithms for Forecasting Non-Stationary Time Series page 13

Stationarity
Deﬁnition: a sequence of random variables Z = {Zt }+1 1 is
stationary if its distribution is invariant to shifting in time.

same distribution

(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )

Theory and Algorithms for Forecasting Non-Stationary Time Series page 14

Weak Stationarity
Deﬁnition: a sequence of random variables Z = {Zt }+11 is
weakly stationary if its ﬁrst and second moments are
invariant to shifting in time, that is,

• E[Zt ] is independent of t .

• E[Zt Zt j] = f (j) for some function f.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 15

Lag Operator
Lag operator L is deﬁned by LYt = Yt 1.

ARMA model in terms of the lag operator:

p
! q
!
X X
i
1 ai L Yt = 1 + bj Lj ✏ t
i=1 j=1

Characteristic polynomial 
  p
X
  P (z) = 1 ai z i
i=1
 
can be used to study properties of this stochastic process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 16

Weak Stationarity of ARMA
Theorem: an ARMA( p, q) process is weakly stationary if the
roots of the characteristic polynomial P (z) are outside the
unit circle.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 17

Proof
If roots of the characteristic polynomial are outside the unit
circle then: 
  p
X
  P (z) = 1 ai z i = c( 1 z) · · · ( p z)
  i=1

  = c0 (1 1
1
z) · · · (1 p
1
z)
 
where | i | > 1 for all i = 1, . . . , p and c, c0 are constants.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 18

Proof
Therefore, the ARMA( p, q) process 
  p
! q
!
X X
  1 ai Li Yt = 1 + bj Lj ✏ t
  i=1 j=1

admits MA( 1) representation: 

  ✓ ◆ 1 ✓ ◆ 1✓ q ◆
X
  Yt = 1 1
1
L · · · 1 p
1
L 1 + b j Lj
✏t
  j=1

where   ! 1 1 ⇣
X ⌘k
  1 1
L = 1
L
i i
  k=0
 
is well-deﬁned since | i
1
| < 1.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 19
Proof
Therefore, it suﬃces to show that 
  X1

  Yt = j ✏t j
j=0
 
is weakly stationary.

The mean is constant

1
X
E[Yt ] = j E[✏t j ] = 0.
j=0
Covariance function E[Yt Yt l ] only depends on the lag l:
1 X
X 1 1
X
E[Yt Yt l ] = k j E[✏t j ✏t l k ] = j j+l.
k=0 j=0 j=0

Theory and Algorithms for Forecasting Non-Stationary Time Series page 20

ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored: 

 
P (z) = R(z)(1 z)D
 
where R(z) has no unit roots.

Deﬁnition: ARIMA( p , D , q ) model is an ARMA( p, q) model 

for (1 L) Yt :
D

p
! !D q
!
X X
i
1 ai L 1 L Yt = 1 + bj Lj ✏t.
i=1 j=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 21

ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored: 

 
 
where has no unit roots.

Deﬁnition: ARIMA( , , ) model is an ARMA( , ) model 

for :

Theory and Algorithms for Forecasting Non-Stationary Time Series page 22

Other Extensions
Further variants:

• models with seasonal components (SARIMA).

• models with side information (ARIMAX).

• models with long-memory (ARFIMA).

• multi-variate time series models (VAR).

• models with time-varying coeﬃcients.

• other non-linear models.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 23

Modeling Variance
(Engle, 1982; Bollerslev, 1986)

Deﬁnition: the generalized autoregressive conditional

heteroscedasticity GARCH( p, q) model is an ARMA( p, q)
model for the variance t of the noise term ✏t: 
  p 1 q 1
X X
  8t, 2
t+1 =!+ ↵i 2
t i + ✏
j t
2
j
  i=0 j=0

where

• ✏ts are zero mean Gaussian random variables with

variance t conditioned on {Yt 1 , Yt 2 , . . .}.

• ! > 0 is the mean parameter.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 24

GARCH Process
Deﬁnition: the generalized autoregressive conditional
heteroscedasticity (GARCH( , )) model is an ARMA( , )
model for the variance of the noise term : 
 
 
 
where

• s are zero mean Gaussian random variables with

variance conditioned on .

• is the mean parameter.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 25

State-Space Models
Continuous state space version of Hidden Markov Models: 
 
Xt+1 = BXt + Ut ,
 
Yt = AXt + ✏t
 
where

• Xt is an n-dimensional state vector.

• Yt is an observed stochastic process.

• A and B are model parameters.

• Ut and ✏t are noise terms.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 26

State-Space Models

Theory and Algorithms for Forecasting Non-Stationary Time Series page 27

Estimation
Diﬀerent methods for estimating model parameters:

• Maximum likelihood estimation:

• Requires further parametric assumptions on the noise

distribution (e.g. Gaussian).

• Method of moments (Yule-Walker estimator).

• Conditional and unconditional least square estimation.

• Restricted to certain models.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 28

Invertibility of ARMA
Deﬁnition: an ARMA( p, q) process is invertible if the roots of
the polynomial  
X q
 
Q(z) = 1 + bj z j
  j=1
are outside the unit circle.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 29

Learning guarantee
Theorem: assume Yt ⇠ ARMA( p, q) is weakly stationary and
invertible. Let a bT denote the least square estimate of
a = (a1 , . . . , ap ) and assume that p is known. Then, kbaT ak
converges in probability to zero.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 30

Notes
Many other generative models exist.

Learning guarantees are asymptotic.

Model needs to be correctly speciﬁed.

Non-stationarity needs to be modeled explicitly.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 31

Theory
Time Series Forecasting
Training data: ﬁnite sample realization of some stochastic
process,
(X1 , Y1 ), . . . , (XT , YT ) 2 Z = X ⇥ Y.

Loss function: L : H ⇥ Z ! [0, 1] , where H is a hypothesis

set of functions mapping from X to Y .

Problem: ﬁnd h 2 H with small path-dependent expected

loss,
T
⇥ T
⇤
L(h, Z1 ) = E L(h, ZT +1 )|Z1 .
ZT +1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 33

Standard Assumptions
Stationarity:
same distribution

(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )

Mixing:
dependence between events decaying with k.

B A

n n+k
Theory and Algorithms for Forecasting Non-Stationary Time Series page 34
Learning Theory
Stationary and -mixing process: generalization bounds.

• PAC-learning preserved in that setting (Vidyasagar, 1997).

• VC-dimension bounds for binary classiﬁcation (Yu, 1994).

• covering number bounds for regression (Meir, 2000).

• Rademacher complexity bounds for general loss

functions (MM and Rostamizadeh, 2000).

• PAC-Bayesian bounds (Alquier et al., 2014).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 35

Learning Theory
Stationarity and mixing: algorithm-dependent bounds.

• AdaBoost (Lozano et al., 1997).

• general stability bounds (MM and Rostamizadeh, 2010).

• regularized ERM (Steinwart and Christmann, 2009).

• stable on-line algorithms (Agarwal and Duchi, 2013).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 36

Problem
Stationarity and mixing assumptions:

• often do not hold (think trend or periodic signals).

• not testable.

• estimating mixing parameters can be hard, even if

general functional form known.

• hypothesis set and loss function ignored.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 37

Questions
Is learning with general (non-stationary, non-mixing)
stochastic processes possible?

Can we design algorithms with theoretical guarantees?

need a new tool for the analysis.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 38

Key Quantity - Fixed h
key diﬀerence

L(h, Zt1 1 ) L(h, ZT1 )

1 t T T +1

1 Xh i
T
Key average quantity: L(h, ZT1 ) L(h, Zt1 1 ) .
T t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 39

Discrepancy
Deﬁnition:
T
1X
= sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H T t=1

• captures hypothesis set and loss function.

• can be estimated from data, under mild assumptions.

• = 0 in IID case or for weakly stationary processes with

linear hypotheses and squared loss (K and MM, 2014).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 40

Weighted Discrepancy
Deﬁnition: extension to weights (q1 , . . . , qT ) = q.
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1

• strictly extends discrepancy deﬁnition in drifting (MM and

Muñoz Medina, 2012) or domain adaptation (Mansour, MM,
Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss
(Devroye et al., 1996; Ben-David et al., 2007).

• admits upper bounds in terms of relative entropy, or in

terms of -mixing coeﬃcients of asymptotic stationarity
for an asymptotically stationary process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 41

Estimation
Decomposition: (q)  0 (q) + s .
✓ T
X T
X ◆
1
(q)  sup L(h, Zt1 1 ) qt L(h, Zt1 1 )
h2H s t=1
t=T s+1
✓ T
X ◆
1
+ sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H s
t=T s+1

1 T s T T +1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 42

Learning Guarantee
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
T
r
X E[N1 (↵, G, z)]
T
L(h, Z1 )  qt L(h, Zt ) + (q) + 2↵ + kqk2 2 log ,
t=1

where G = {z 7! L(h, z) : h 2 H}.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 43

Bound with Emp. Discrepancy
Corollary: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s + 4↵
t=1 s ⇥ ⇤
h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z ,
8 ✓ ◆
T
X T
X
>
> 1
>
> b (q) = sup L(h, Zt ) qt L(h, Zt )
< s
h2H
where t=T s+1 t=1
>
> us unif. dist. over [T s, T ]
>
>
:
G = {z 7! L(h, z) : h 2 H}.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 44

Weighted Sequential α-Cover
(Rakhlin et al., 2010; K and MM, 2015)
Deﬁnition: let z be a Z-valued full binary tree of depth T.
Then, a set of trees V is an l1-norm q -weighted ↵-cover of a
function class G on z if
XT
T ↵
8g 2 G, 8 2 {±1} , 9v 2 V : |vt ( ) g(zt ( ))|  .
t=1
kqk1
g(z1 ), v1
1 +1

g(z2 ), v2 g(z3 ), v3

1 +1 1 +1

g(z4 ), v4 g(z5 ), v5 g(z6 ), v6 g(z7 ), v7

" v1 g(z1 )
#
1 v3 g(z3 ) ↵
v6 g(z6 )
 .
kqk1
v12 g(z12 ) 1
g(z12 ), v12
Theory and Algorithms for Forecasting Non-Stationary Time Series page 45
Sequential Covering Numbers
Deﬁnitions:

• sequential covering number:

N1 (↵, G, z) = min{|V| : V l1 -norm q-weighted ↵-cover of G}.

• expected sequential covering number: E [N1 (↵, G, z)].

z⇠ZT

Z1 , Z10 ⇠ D1

Z2 , Z20 ⇠ D2 (·|Z1 ) Z3 , Z30 ⇠ D2 (·|Z10 )

Z4 , Z40 Z5 , Z50
⇠ D3 (·|Z1 , Z2 ) ⇠ D3 (·|Z1 , Z20 ) ZT : distribution based on Zt s.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 46

Proof
⇣ T
X ⌘
Key quantities: (ZT1 ) = sup L(h, ZT ) qt L(h, Zt )
h2H t=1
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1

Chernoﬀ technique: for any t > 0 ,

⇥ ⇤
P (ZT1 (q) > ✏)
" T
#
X ⇥ ⇤
P sup qt L(h, Zt1 1 ) L(h, Zt ) > ✏ (sub-add. of sup)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
= P exp t sup qt L(h, Zt1 1 ) L(h, Zt ) > et✏ (t > 0)
h2H t=1
" #
T
X ⇥ ⇣ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt1 1 ) L(h, Zt ) . (Markov’s ineq.)
h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 47

Symmetrization
Key tool: decoupled tangent sequence Z0 1 associated to ZT1 .
T

• Zt and Zt0 i.i.d. given Zt1 1

.
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt1 1 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E exp t sup qt E[L(h, Zt0 )|Zt1 1 ] L(h, Zt ) (tangent seq.)
h2H t=1
" #
hX
T ⇣⇥ ⇤ i⌘
t✏
= e E exp t sup E qt L(h, Zt0 ) L(h, Zt ) ZT1 (lin. of expectation)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
e t✏
E exp t sup qt L(h, Zt0 ) L(h, Zt ) . (Jensen’s ineq.)
h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 48

Symmetrization
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt0 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) L(h, zt ( )) (tangent seq. prop.)
(z,z ) h2H t=1
" #
⇣ T
X T
X ⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) + t sup qt t L(h, zt ( )) (sub-add. of sup)
(z,z ) h2H t=1 h2H t=1
"
1 ⇣ XT ⌘
t✏ 0
e E0 E exp 2t sup qt t L(h, zt ( ))
(z,z ) 2 h2H t=1
#
1 ⇣ XT ⌘
+ exp 2t sup qt t L(h, zt ( )) (convexity. of exp)
2 h2H t=1
" #
⇣ X T ⌘
= e t✏ E E exp 2t sup qt t L(h, zt ( )) .
z h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 49

Covering Number
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⌘
t✏
 e E E exp 2t sup qt t L(h, zt ( ))
z h2H t=1
" #
⇣ h T
X i⌘
 e t✏ E E exp 2t max qt t v t ( ) + ↵ (↵-covering)
z v2V
t=1
" " ##
X ⇣ T
X ⌘
t(✏ 2↵)
e E E exp 2t qt t v t ( ) (monotonicity of exp)
z
v2V t=1
" ##
X ⇣ t2 kqk2 ⌘
t(✏ 2↵)
e E exp (Hoe↵ding’s ineq.)
z 2
v2V
h i 
t2 kqk2
 E N1 (↵, G, z) exp t(✏ 2↵) + .
z 2

Theory and Algorithms for Forecasting Non-Stationary Time Series page 50

Algorithms
Review
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,  
XT
 
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s + 4↵
  t=1 s ⇥ ⇤
  h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z .

This bound can be extended to hold uniformly over q at

the price of the additional term: 
  p
e
O(kq uk1 log2 log2 (1 kq uk) 1)
.

Data-dependent learning guarantee.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 52

Discrepancy-Risk Minimization
Key Idea: directly optimize the upper bound on
generalization over q and h .

This problem can be solved eﬃciently for some L and H.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 53

Kernel-Based Regression
Squared loss function: L(y, y 0 ) = (y y 0 )2

Hypothesis set: for PDS kernel K,

n o
H = x 7! w · K (x) : kwkH  ⇤ .

Complexity term can be bounded by 

⇣ ⌘ 
O (log3/2 T )⇤ sup K(x, x)kqk2 .
x

Theory and Algorithms for Forecasting Non-Stationary Time Series page 54

Instantaneous Discrepancy
Empirical discrepancy can be further upper bounded in
terms of instantaneous discrepancies: 
  T
X
  b (q)  qt dt + M kq uk1
  t=1

where M = sup L(y, y 0 ) and 

y,y 0 !
  T
1 X
dt = sup L(h, Zt ) L(h, Zt ) .
h2H s
t=T s+1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 55

Proof
By sub-additivity of supremum 
  ( T
X T
X
)
1
  b (q) = sup L(h, Zt ) qt L(h, Zt )
h2H s
  (
t=T s+1 t=1
!
T
X T
X
  = sup qt
1
L(h, Zt ) L(h, Zt )
  h2H t=1
s
t=T s+1
)
  T ⇣
X 1 ⌘1 T
X
  + qt L(h, Zt )
t=1
T s
t=T s+1
T T
!
X 1 X
 qt sup L(h, Zt ) qt L(h, Zt ) + M ku qk1.
t=1 h2H s
t=T s+1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 56

Computing Discrepancies
Instantaneous discrepancy for kernel-based hypothesis
with squared loss: 
  T
!
X
dt = sup us (w0 · K (xs ) ys ) 2 (w0 · K (xt ) yt ) 2 .
kw0 k⇤ s=1

Diﬀerence of convex (DC) functions.

Global optimum via DC-programming: (Tao and Ahn, 1998).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 57

Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all   0 < kq uk1  1
  XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s  
t=1
⇣   ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O . 
x