0% found this document useful (0 votes)
107 views

Nips Tutorial 2016

This document discusses theory and algorithms for forecasting non-stationary time series. It begins with an introduction to time series analysis and classical models like AR, MA, and ARMA. These models assume the time series is stationary, meaning the properties like mean and variance do not change over time. However, many real-world time series are non-stationary. The document outlines challenges in applying standard machine learning to non-stationary time series due to violations of assumptions like independence and identical distributions. It proposes new learning theory and algorithms to address forecasting non-stationary time series.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Nips Tutorial 2016

This document discusses theory and algorithms for forecasting non-stationary time series. It begins with an introduction to time series analysis and classical models like AR, MA, and ARMA. These models assume the time series is stationary, meaning the properties like mean and variance do not change over time. However, many real-world time series are non-stationary. The document outlines challenges in applying standard machine learning to non-stationary time series due to violations of assumptions like independence and identical distributions. It proposes new learning theory and algorithms to address forecasting non-stationary time series.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Theory and Algorithms for

Forecasting Non-Stationary
Time Series

VITALY KUZNETSOV KUZNETSOV@


GOOGLE RESEARCH..

MEHRYAR MOHRI MOHRI@


COURANT INSTITUTE & GOOGLE RESEARCH..
Motivation
Time series prediction:

• stock values.

• economic variables.

• weather: e.g., local and global temperature.

• sensors: Internet-of-Things.

• earthquakes.

• energy demand.

• signal processing.

• sales forecasting.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 2


Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 3


Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 4


Google Trends

Theory and Algorithms for Forecasting Non-Stationary Time Series page 5


Challenges
Standard Supervised Learning:

• IID assumption.

• Same distribution for training and test data.

• Distributions fixed over time (stationarity).

none of these assumptions holds


for time series!

Theory and Algorithms for Forecasting Non-Stationary Time Series page 6


Outline
Introduction to time series analysis.

Learning theory for forecasting non-stationary time series.

Algorithms for forecasting non-stationary time series.

Time series prediction and on-line learning.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 7


Introduction to Time
Series Analysis
Classical Framework
Postulate a particular form of a parametric model that is
assumed to generate data.

Use given sample to estimate unknown parameters of the


model.

Use estimated model to make predictions.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 9


Autoregressive (AR) Models
Definition: AR( p) model is a linear generative model based
on the pth order Markov assumption:


 p
X

 8t, Yt = ai Yt i + ✏ t
i=1

where

• ✏ts are zero mean uncorrelated random variables with


variance .

• a1 , . . . , ap are autoregressive coefficients.

• Yt is observed stochastic process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 10


Moving Averages (MA)
Definition: MA( q) model is a linear generative model for
the noise term based on the qth order Markov assumption:


 q
X

 8t, Yt = ✏t + bj ✏ t j


 j=1

where

• b1 , . . . , bq are moving average coefficients.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 11


ARMA model
(Whittle, 1951; Box & Jenkins, 1971)

Definition: ARMA( p, q) model is a generative linear model


that combines AR( p) and MA( q) models:
p
X q
X
8t, Yt = ai Yt i + ✏t + bj ✏ t j .
i=1 j=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 12


ARMA

Theory and Algorithms for Forecasting Non-Stationary Time Series page 13


Stationarity
Definition: a sequence of random variables Z = {Zt }+1 1 is
stationary if its distribution is invariant to shifting in time.

same distribution

(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )

Theory and Algorithms for Forecasting Non-Stationary Time Series page 14


Weak Stationarity
Definition: a sequence of random variables Z = {Zt }+11 is
weakly stationary if its first and second moments are
invariant to shifting in time, that is,

• E[Zt ] is independent of t .

• E[Zt Zt j] = f (j) for some function f.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 15


Lag Operator
Lag operator L is defined by LYt = Yt 1.

ARMA model in terms of the lag operator:


p
! q
!
X X
i
1 ai L Yt = 1 + bj Lj ✏ t
i=1 j=1

Characteristic polynomial


 p
X

 P (z) = 1 ai z i
i=1

can be used to study properties of this stochastic process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 16


Weak Stationarity of ARMA
Theorem: an ARMA( p, q) process is weakly stationary if the
roots of the characteristic polynomial P (z) are outside the
unit circle.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 17


Proof
If roots of the characteristic polynomial are outside the unit
circle then:


 p
X

 P (z) = 1 ai z i = c( 1 z) · · · ( p z)

 i=1


 = c0 (1 1
1
z) · · · (1 p
1
z)

where | i | > 1 for all i = 1, . . . , p and c, c0 are constants.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 18


Proof
Therefore, the ARMA( p, q) process


 p
! q
!
X X

 1 ai Li Yt = 1 + bj Lj ✏ t

 i=1 j=1

admits MA( 1) representation:




 ✓ ◆ 1 ✓ ◆ 1✓ q ◆
X

 Yt = 1 1
1
L · · · 1 p
1
L 1 + b j Lj
✏t

 j=1

where 
 ! 1 1 ⇣
X ⌘k

 1 1
L = 1
L
i i

 k=0

is well-defined since | i
1
| < 1.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 19
Proof
Therefore, it suffices to show that


 X1


 Yt = j ✏t j
j=0

is weakly stationary.

The mean is constant


1
X
E[Yt ] = j E[✏t j ] = 0.
j=0
Covariance function E[Yt Yt l ] only depends on the lag l:
1 X
X 1 1
X
E[Yt Yt l ] = k j E[✏t j ✏t l k ] = j j+l.
k=0 j=0 j=0

Theory and Algorithms for Forecasting Non-Stationary Time Series page 20


ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored:




P (z) = R(z)(1 z)D

where R(z) has no unit roots.

Definition: ARIMA( p , D , q ) model is an ARMA( p, q) model



for (1 L) Yt :
D

p
! !D q
!
X X
i
1 ai L 1 L Yt = 1 + bj Lj ✏t.
i=1 j=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 21


ARIMA
Non-stationary processes can be modeled using processes
whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored:





where has no unit roots.

Definition: ARIMA( , , ) model is an ARMA( , ) model



for :

Theory and Algorithms for Forecasting Non-Stationary Time Series page 22


Other Extensions
Further variants:

• models with seasonal components (SARIMA).

• models with side information (ARIMAX).

• models with long-memory (ARFIMA).

• multi-variate time series models (VAR).

• models with time-varying coefficients.

• other non-linear models.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 23


Modeling Variance
(Engle, 1982; Bollerslev, 1986)

Definition: the generalized autoregressive conditional


heteroscedasticity GARCH( p, q) model is an ARMA( p, q)
model for the variance t of the noise term ✏t:


 p 1 q 1
X X

 8t, 2
t+1 =!+ ↵i 2
t i + ✏
j t
2
j

 i=0 j=0

where

• ✏ts are zero mean Gaussian random variables with


variance t conditioned on {Yt 1 , Yt 2 , . . .}.

• ! > 0 is the mean parameter.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 24


GARCH Process
Definition: the generalized autoregressive conditional
heteroscedasticity (GARCH( , )) model is an ARMA( , )
model for the variance of the noise term :




where

• s are zero mean Gaussian random variables with


variance conditioned on .

• is the mean parameter.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 25


State-Space Models
Continuous state space version of Hidden Markov Models:


Xt+1 = BXt + Ut ,

Yt = AXt + ✏t

where

• Xt is an n-dimensional state vector.

• Yt is an observed stochastic process.

• A and B are model parameters.

• Ut and ✏t are noise terms.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 26


State-Space Models

Theory and Algorithms for Forecasting Non-Stationary Time Series page 27


Estimation
Different methods for estimating model parameters:

• Maximum likelihood estimation:

• Requires further parametric assumptions on the noise


distribution (e.g. Gaussian).

• Method of moments (Yule-Walker estimator).

• Conditional and unconditional least square estimation.

• Restricted to certain models.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 28


Invertibility of ARMA
Definition: an ARMA( p, q) process is invertible if the roots of
the polynomial 

X q

Q(z) = 1 + bj z j

 j=1
are outside the unit circle.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 29


Learning guarantee
Theorem: assume Yt ⇠ ARMA( p, q) is weakly stationary and
invertible. Let a bT denote the least square estimate of
a = (a1 , . . . , ap ) and assume that p is known. Then, kbaT ak
converges in probability to zero.

Similar results hold for other estimators and other models.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 30


Notes
Many other generative models exist.

Learning guarantees are asymptotic.

Model needs to be correctly specified.

Non-stationarity needs to be modeled explicitly.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 31


Theory
Time Series Forecasting
Training data: finite sample realization of some stochastic
process,
(X1 , Y1 ), . . . , (XT , YT ) 2 Z = X ⇥ Y.

Loss function: L : H ⇥ Z ! [0, 1] , where H is a hypothesis


set of functions mapping from X to Y .

Problem: find h 2 H with small path-dependent expected


loss,
T
⇥ T

L(h, Z1 ) = E L(h, ZT +1 )|Z1 .
ZT +1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 33


Standard Assumptions
Stationarity:
same distribution

(Zt , . . . , Zt+m ) (Zt+k , . . . , Zt+m+k )


Mixing:
dependence between events decaying with k.

B A

n n+k
Theory and Algorithms for Forecasting Non-Stationary Time Series page 34
Learning Theory
Stationary and -mixing process: generalization bounds.

• PAC-learning preserved in that setting (Vidyasagar, 1997).

• VC-dimension bounds for binary classification (Yu, 1994).

• covering number bounds for regression (Meir, 2000).

• Rademacher complexity bounds for general loss


functions (MM and Rostamizadeh, 2000).

• PAC-Bayesian bounds (Alquier et al., 2014).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 35


Learning Theory
Stationarity and mixing: algorithm-dependent bounds.

• AdaBoost (Lozano et al., 1997).

• general stability bounds (MM and Rostamizadeh, 2010).

• regularized ERM (Steinwart and Christmann, 2009).

• stable on-line algorithms (Agarwal and Duchi, 2013).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 36


Problem
Stationarity and mixing assumptions:

• often do not hold (think trend or periodic signals).

• not testable.

• estimating mixing parameters can be hard, even if


general functional form known.

• hypothesis set and loss function ignored.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 37


Questions
Is learning with general (non-stationary, non-mixing)
stochastic processes possible?

Can we design algorithms with theoretical guarantees?

need a new tool for the analysis.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 38


Key Quantity - Fixed h
key difference

L(h, Zt1 1 ) L(h, ZT1 )

1 t T T +1

1 Xh i
T
Key average quantity: L(h, ZT1 ) L(h, Zt1 1 ) .
T t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 39


Discrepancy
Definition:
T
1X
= sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H T t=1

• captures hypothesis set and loss function.

• can be estimated from data, under mild assumptions.

• = 0 in IID case or for weakly stationary processes with


linear hypotheses and squared loss (K and MM, 2014).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 40


Weighted Discrepancy
Definition: extension to weights (q1 , . . . , qT ) = q.
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1

• strictly extends discrepancy definition in drifting (MM and


Muñoz Medina, 2012) or domain adaptation (Mansour, MM,
Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss
(Devroye et al., 1996; Ben-David et al., 2007).

• admits upper bounds in terms of relative entropy, or in


terms of -mixing coefficients of asymptotic stationarity
for an asymptotically stationary process.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 41


Estimation
Decomposition: (q)  0 (q) + s .
✓ T
X T
X ◆
1
(q)  sup L(h, Zt1 1 ) qt L(h, Zt1 1 )
h2H s t=1
t=T s+1
✓ T
X ◆
1
+ sup L(h, ZT1 ) L(h, Zt1 1 ) .
h2H s
t=T s+1

1 T s T T +1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 42


Learning Guarantee
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
T
r
X E[N1 (↵, G, z)]
T
L(h, Z1 )  qt L(h, Zt ) + (q) + 2↵ + kqk2 2 log ,
t=1

where G = {z 7! L(h, z) : h 2 H}.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 43


Bound with Emp. Discrepancy
Corollary: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 ,
XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s + 4↵
t=1 s ⇥ ⇤
h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z ,
8 ✓ ◆
T
X T
X
>
> 1
>
> b (q) = sup L(h, Zt ) qt L(h, Zt )
< s
h2H
where t=T s+1 t=1
>
> us unif. dist. over [T s, T ]
>
>
:
G = {z 7! L(h, z) : h 2 H}.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 44


Weighted Sequential α-Cover
(Rakhlin et al., 2010; K and MM, 2015)
Definition: let z be a Z-valued full binary tree of depth T.
Then, a set of trees V is an l1-norm q -weighted ↵-cover of a
function class G on z if
XT
T ↵
8g 2 G, 8 2 {±1} , 9v 2 V : |vt ( ) g(zt ( ))|  .
t=1
kqk1
g(z1 ), v1
1 +1

g(z2 ), v2 g(z3 ), v3

1 +1 1 +1

g(z4 ), v4 g(z5 ), v5 g(z6 ), v6 g(z7 ), v7


" v1 g(z1 )
#
1 v3 g(z3 ) ↵
v6 g(z6 )
 .
kqk1
v12 g(z12 ) 1
g(z12 ), v12
Theory and Algorithms for Forecasting Non-Stationary Time Series page 45
Sequential Covering Numbers
Definitions:

• sequential covering number:


N1 (↵, G, z) = min{|V| : V l1 -norm q-weighted ↵-cover of G}.

• expected sequential covering number: E [N1 (↵, G, z)].


z⇠ZT

Z1 , Z10 ⇠ D1

Z2 , Z20 ⇠ D2 (·|Z1 ) Z3 , Z30 ⇠ D2 (·|Z10 )

Z4 , Z40 Z5 , Z50
⇠ D3 (·|Z1 , Z2 ) ⇠ D3 (·|Z1 , Z20 ) ZT : distribution based on Zt s.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 46


Proof
⇣ T
X ⌘
Key quantities: (ZT1 ) = sup L(h, ZT ) qt L(h, Zt )
h2H t=1
T
X
(q) = sup L(h, ZT1 ) qt L(h, Zt1 1 ) .
h2H t=1

Chernoff technique: for any t > 0 ,


⇥ ⇤
P (ZT1 (q) > ✏)
" T
#
X ⇥ ⇤
P sup qt L(h, Zt1 1 ) L(h, Zt ) > ✏ (sub-add. of sup)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
= P exp t sup qt L(h, Zt1 1 ) L(h, Zt ) > et✏ (t > 0)
h2H t=1
" #
T
X ⇥ ⇣ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt1 1 ) L(h, Zt ) . (Markov’s ineq.)
h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 47


Symmetrization
Key tool: decoupled tangent sequence Z0 1 associated to ZT1 .
T

• Zt and Zt0 i.i.d. given Zt1 1


.
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt1 1 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E exp t sup qt E[L(h, Zt0 )|Zt1 1 ] L(h, Zt ) (tangent seq.)
h2H t=1
" #
hX
T ⇣⇥ ⇤ i⌘
t✏
= e E exp t sup E qt L(h, Zt0 ) L(h, Zt ) ZT1 (lin. of expectation)
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
e t✏
E exp t sup qt L(h, Zt0 ) L(h, Zt ) . (Jensen’s ineq.)
h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 48


Symmetrization
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⇥ ⇤⌘
t✏
 e E exp t sup qt L(h, Zt0 ) L(h, Zt )
h2H t=1
" #
⇣ T
X ⇥ ⇤⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) L(h, zt ( )) (tangent seq. prop.)
(z,z ) h2H t=1
" #
⇣ T
X T
X ⌘
=e t✏
E 0 E exp t sup qt t L(h, zt0 ( )) + t sup qt t L(h, zt ( )) (sub-add. of sup)
(z,z ) h2H t=1 h2H t=1
"
1 ⇣ XT ⌘
t✏ 0
e E0 E exp 2t sup qt t L(h, zt ( ))
(z,z ) 2 h2H t=1
#
1 ⇣ XT ⌘
+ exp 2t sup qt t L(h, zt ( )) (convexity. of exp)
2 h2H t=1
" #
⇣ X T ⌘
= e t✏ E E exp 2t sup qt t L(h, zt ( )) .
z h2H t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 49


Covering Number
⇥ ⇤
P (ZT1
(q) > ✏)
" #
⇣ T
X ⌘
t✏
 e E E exp 2t sup qt t L(h, zt ( ))
z h2H t=1
" #
⇣ h T
X i⌘
 e t✏ E E exp 2t max qt t v t ( ) + ↵ (↵-covering)
z v2V
t=1
" " ##
X ⇣ T
X ⌘
t(✏ 2↵)
e E E exp 2t qt t v t ( ) (monotonicity of exp)
z
v2V t=1
" ##
X ⇣ t2 kqk2 ⌘
t(✏ 2↵)
e E exp (Hoe↵ding’s ineq.)
z 2
v2V
h i 
t2 kqk2
 E N1 (↵, G, z) exp t(✏ 2↵) + .
z 2

Theory and Algorithms for Forecasting Non-Stationary Time Series page 50


Algorithms
Review
Theorem: for any > 0 , with probability at least 1 , for
all h 2 H and ↵ > 0 , 

XT

L(h, ZT1 )  qt L(h, Zt ) + b (q) + s + 4↵

 t=1 s ⇥ ⇤

 h i 2 E N1 (↵, G, z)
+ kqk2 + kq us k2 2 log z .

This bound can be extended to hold uniformly over q at


the price of the additional term:


 p
e
O(kq uk1 log2 log2 (1 kq uk) 1)
.

Data-dependent learning guarantee.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 52


Discrepancy-Risk Minimization
Key Idea: directly optimize the upper bound on
generalization over q and h .

This problem can be solved efficiently for some L and H.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 53


Kernel-Based Regression
Squared loss function: L(y, y 0 ) = (y y 0 )2

Hypothesis set: for PDS kernel K,


n o
H = x 7! w · K (x) : kwkH  ⇤ .

Complexity term can be bounded by



⇣ ⌘

O (log3/2 T )⇤ sup K(x, x)kqk2 .
x

Theory and Algorithms for Forecasting Non-Stationary Time Series page 54


Instantaneous Discrepancy
Empirical discrepancy can be further upper bounded in
terms of instantaneous discrepancies:


 T
X

 b (q)  qt dt + M kq uk1

 t=1

where M = sup L(y, y 0 ) and



y,y 0 !

 T
1 X
dt = sup L(h, Zt ) L(h, Zt ) .
h2H s
t=T s+1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 55


Proof
By sub-additivity of supremum


 ( T
X T
X
)
1

 b (q) = sup L(h, Zt ) qt L(h, Zt )
h2H s

 (
t=T s+1 t=1
!
T
X T
X

 = sup qt
1
L(h, Zt ) L(h, Zt )

 h2H t=1
s
t=T s+1
)

 T ⇣
X 1 ⌘1 T
X

 + qt L(h, Zt )
t=1
T s
t=T s+1
T T
!
X 1 X
 qt sup L(h, Zt ) qt L(h, Zt ) + M ku qk1.
t=1 h2H s
t=T s+1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 56


Computing Discrepancies
Instantaneous discrepancy for kernel-based hypothesis
with squared loss:


 T
!
X
dt = sup us (w0 · K (xs ) ys ) 2 (w0 · K (xt ) yt ) 2 .
kw0 k⇤ s=1

Difference of convex (DC) functions.


Global optimum via DC-programming: (Tao and Ahn, 1998).

Theory and Algorithms for Forecasting Non-Stationary Time Series page 57


Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all 
 0 < kq uk1  1

 XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s 

t=1
⇣ 
 ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .

x

Corresponding optimization problem:




 ( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 58


Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all 
 0 < kq uk1  1

 XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s 

t=1
⇣ 
 ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .

x

Corresponding optimization problem:




 ( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 59


Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all 
 0 < kq uk1  1

 XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s 

t=1
⇣ 
 ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .

x

Corresponding optimization problem:




 ( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 60


Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all 
 0 < kq uk1  1

 XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s 

t=1
⇣ 
 ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .

x

Corresponding optimization problem:




 ( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 61


Discrepancy-Based Forecasting
Theorem: for any > 0 , with probability at least 1 , for
all kernel-based hypothesis h 2 H and all 
 0 < kq uk1  1

 XT
L(h, ZT1 )  qt L(h, Zt ) + b (q) + s 

t=1
⇣ 
 ⌘
e log3/2 T sup K(x, x)⇤ + kq uk1
+O .

x

Corresponding optimization problem:




 ( T T
)
X X
min qt (w · K (xt ) yt ) 2 + 1 qt dt + 2 kwkH + 3 kq uk1 .
q2[0,1]T ,w
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 62


Convex Problem
Change of variable: rt = 1/qt .

Upper bound: |rt 1


1/T |  T 1
|rt T|.
( T T
)
X (w · K (xt ) yt ) 2 + 1 dt
X
min + 2 kwkH + 3 |rt T| .
r2D,w
t=1
rt t=1

• where D = {r : rt 1}.
• convex optimization problem.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 63


Two-Stage Algorithm
Minimize empirical discrepancy b (q) over q (convex
optimization).

Solve (weighted) kernel-ridge regression problem:




 ( T )
X
min qt⇤ (w · K (xt ) yt )2 + kwkH 

w

 t=1

where q⇤ is the solution to discrepancy minimization


problem.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 64


Preliminary Experiments
Artificial data sets:

Theory and Algorithms for Forecasting Non-Stationary Time Series page 65


True vs. Empirical Discrepancies

Discrepancies Weights
Theory and Algorithms for Forecasting Non-Stationary Time Series page 66
Running MSE

Theory and Algorithms for Forecasting Non-Stationary Time Series page 67


Real-world Data
Commodity prices, exchange rates, temperatures &
climate.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 68


Time Series Prediction &
On-line Learning
Two Learning Scenarios
Stochastic scenario:

• distributional assumption.

• performance measure: expected loss.

• guarantees: generalization bounds.

On-line scenario:

• no distributional assumption.

• performance measure: regret.

• guarantees: regret bounds.

• active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al.


2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth,
1998, 2001; Koolen et al., 2015).
Theory and Algorithms for Forecasting Non-Stationary Time Series page 70
On-Line Learning Setup
Adversarial setting with hypothesis/action set H.

For t = 1 to T do

• player receives xt 2 X .

• player selects ht 2 H .

• adversary selects yt 2 Y .

• player incurs loss L(ht (xt ), yt ) .

Objective: minimize (external) regret


T
X T
X
RegT = L(ht (xt ), yt ) min L(h(xt ), yt ).
h2H ⇤
t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 71


Example: Exp. Weights (EW)
Expert set H ⇤ = {E1 , . . . , EN } , H = conv(H ⇤ ) .

EW({E1 , . . . , EN })
1 for i 1 to N do
2 w1,i 1
3 for t 1 to T do
4 Receive(x P t)
N
i=1 wt,i Ei
5 ht P N
i=1 wt,i
6 Receive(yt )
7 Incur-Loss(L(ht (xt ), yt ))
8 for i 1 to N do
9 wt+1,i wt,i e ⌘L(Ei (xt ),yt ) . (parameter ⌘ > 0)
10 return hT

Theory and Algorithms for Forecasting Non-Stationary Time Series page 72


EW Guarantee
Theorem: assume that L is convex in its first argument and
takes values in [0, 1]. Then, for any ⌘ > 0 and any
sequence y1 , . . . , yT 2 Y , the regret of EW at time T
satisfies
log N ⌘T
RegT  + .
⌘ 8
p
For ⌘ = 8 log N/T ,
p
RegT  (T /2) log N .

✓r ◆
RegT log N
=O .
T T

Theory and Algorithms for Forecasting Non-Stationary Time Series page 73


EW - Proof
PN
Potential: t = log i=1 wt,i .

Upper bound:
PN
i=1 wt 1,i e ⌘L(Ei (xt ),yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(Ei (xt ),yt ) ]
wt 1
✓  ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(Ei (xt ), yt ) E [L(Ei (xt ), yt )] ⌘ E [L(Ei (xt ), yt )]
wt 1 wt 1 wt 1

⌘2
 ⌘ E [L(Ei (xt ), yt )] + (Hoe↵ding’s ineq.)
wt 1 8
⌘2
 ⌘L E [Ei (xt )], yt + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(ht (xt ), yt ) + .
8

Theory and Algorithms for Forecasting Non-Stationary Time Series page 74


EW - Proof
Upper bound: summing up the inequalities yields
XT
⌘2 T
T 0  ⌘ L(ht (xt ), yt ) + .
t=1
8
Lower bound: N
X PT
⌘ L(Ei (xt ),yt )
T 0 = log e t=1 log N
i=1
N PT
⌘ L(Ei (xt ),yt )
log max e t=1 log N
i=1
T
X
N
= ⌘ min L(Ei (xt ), yt ) log N.
i=1
t=1
Comparison:
T
X T
X
N log N ⌘T
L(ht (xt ), yt ) min L(Ei (xt ), yt )  + .
t=1
i=1
t=1
⌘ 8
Theory and Algorithms for Forecasting Non-Stationary Time Series page 75
Questions
Can we exploit both batch and on-line to

• design flexible algorithms for time series prediction with


stochastic guarantees?

• tackle notoriously difficult time series problems e.g.,


model selection, learning ensembles?

Theory and Algorithms for Forecasting Non-Stationary Time Series page 76


Model Selection
Problem: given N time series models, how should we use
sample ZT1 to select a single best model?

• in i.i.d. case, cross-validation can be shown to be close to


the structural risk minimization solution.

• but, how do we select a validation set for general


stochastic processes?

• use most recent data?

• use the most distant data?

• use various splits?

• models may have been pre-trained on ZT1 .

Theory and Algorithms for Forecasting Non-Stationary Time Series page 77


Learning Ensembles
Problem: given a hypothesis set H and a sample ZT1 , find
PT
accurate convex combination h = t=1 qt ht with h 2 HA
and q 2 .

• in most general case, hypotheses may have been pre-


trained on ZT1 .

on-line-to-batch conversion for general non-stationary


non-mixing processes.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 78


On-Line-to-Batch (OTB)
Input: sequence of hypotheses h = (h1 , . . . , hT ) returned
after T rounds by an on-line algorithm A minimizing
general regret
T
X T
X
RegT = L(ht , Zt ) inf L(h⇤ , Zt ).
h 2 H⇤

t=1 t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 79


On-Line-to-Batch (OTB)
Problem: use h = (h1 , . . . , hT ) to derive a hypothesis h 2 H
with small path-dependent expected loss,
⇥ ⇤
LT +1 (h, ZT1 ) = E L(h, ZT +1 )|ZT1 .
ZT +1

• i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al.,


2004).

• but, how do we design solutions for the general time-


series scenario?

Theory and Algorithms for Forecasting Non-Stationary Time Series page 80


Questions
Is OTB with general (non-stationary, non-mixing) stochastic
processes possible?

Can we design algorithms with theoretical guarantees?

need a new tool for the analysis.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 81


Relevant Quantity
key difference

Lt (ht , Zt1 1 ) LT +1 (ht , ZT1 )

1 t+1 T T +1

XT h i
1
Average difference: LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 ) .
T t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 82


On-line Discrepancy
Definition:
T
X h i
disc(q) = sup qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 ) .
h2HA t=1

• HA : sequences that A can return.

• q = (q1 , . . . , qT ) : arbitrary weight vector.

• natural measure of non-stationarity or dependency.

• captures hypothesis set and loss function.

• can be efficiently estimated under mild assumptions.

• generalization of definition of (Kuznetsov and MM, 2015) .

Theory and Algorithms for Forecasting Non-Stationary Time Series page 83


Discrepancy Estimation
Batch discrepancy estimation method.

Alternative method:

• assume that the loss is µ-Lipschitz.

• assume that there exists an accurate hypothesis h⇤ :


h i
⇤ T
⌘ = inf

E L(Z T +1 , h (X T +1 ))|Z 1 ⌧ 1.
h

Theory and Algorithms for Forecasting Non-Stationary Time Series page 84


Discrepancy Estimation
Lemma: fix sequence ZT1 in Z. Then, for any > 0 , with
probability at least 1 , the following holds for all ↵ > 0:
r
d E[N1 (↵, G, z)]
disc(q)  discH T (q) + µ⌘ + 2↵ + M kqk2 2 log ,

where
T
X h i
d H (q) =
disc sup qt L ht (XT +1 ), h(XT +1 ) L ht , Z t .
h2H,h2HA t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 85


Proof Sketch
T
X h i
disc(q) = sup qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 )
h2HA t=1
T
X  h i

 sup qt LT +1 (ht , ZT1 ) E L(ht (XT +1 ), h (XT +1 )) ZT1
h2HA t=1
T
X h h i i

+ sup T
qt E L(ht (XT +1 ), h (XT +1 )) Z1 Lt (ht , Zt1 1 )
h2HA t=1
T
X h i

 µ sup qt E L(h (XT +1 ), YT +1 ) ZT1
h2HA t=1
T
X h h i i

+ sup T
qt E L(ht (XT +1 ), h (XT +1 )) Z1 Lt (ht , Zt1 1 )
h2HA t=1
h i
= µ sup E L(h⇤ (XT +1 ), YT +1 ) ZT1
h2HA
T
X h h i i
+ sup qt E L(ht (XT +1 ), h⇤ (XT +1 )) ZT1 Lt (ht , Zt1 1 ) .
h2HA t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 86


Learning Guarantee
Lemma: let L be a convex loss bounded by M and hT1 a
hypothesis sequence adapted to ZT1. Fix q 2 . Then, for
any > 0 , the following holds with probability at least 1
PT
for the hypothesis h = t=1 qt ht :
T r
X 1
T
LT +1 (h, Z1 )  qt L(ht , Zt ) + disc(q) + M kqk2 2 log .
t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 87


Proof
By definition of the on-line discrepancy,
XT h i
qt LT +1 (ht , ZT1 ) Lt (ht , Zt1 1 )  disc(q).
t=1
h i
At = qt Lt (ht , Z1t 1
L(ht , Zt ) is a martingale difference,
)
thus by Azuma’s inequality, whp,
T
X T
X q
qt Lt (ht , Z1t 1
) qt L(ht , Zt ) + kqk2 2 log 1 .
t=1 t=1

By convexity of the loss:


T
X
LT +1 (h, ZT1 )  qt LT +1 (ht , ZT1 ).
t=1

Theory and Algorithms for Forecasting Non-Stationary Time Series page 88


Learning Guarantee
Theorem: let L be a convex loss bounded by M and H⇤a set
of hypothesis sequences adapted to ZT1 . Fix q 2 . Then, for
any > 0 , the following holds with probability at least 1
PT
for the hypothesis h = t=1 qt ht :

LT +1 (h, ZT1 )
XT
⇤ T RegT
 inf LT +1 (h , Z1 ) + 2disc(q) +

h 2H
t=1
T
r
2
+ M kq uk1 + 2M kqk2 2 log .

Theory and Algorithms for Forecasting Non-Stationary Time Series page 89


Conclusion
Time series forecasting:

• key learning problem in many important tasks.

• very challenging: theory, algorithms, applications.

• new and general data-dependent learning guarantees


for non-mixing non-stationary processes.

• algorithms with guarantees.

Time series prediction and on-line learning:

• proof for flexible solutions derived via OTB.

• application to model selection.

• application to learning ensembles.


Theory and Algorithms for Forecasting Non-Stationary Time Series page 90
Time Series Workshop

Join us in Room 117


Friday, December 9th.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 91


References
T. M. Adams and A. B. Nobel. Uniform convergence of
Vapnik-Chervonenkis classes under ergodic sampling. The
Annals of Probability, 38(4):1345–1367, 2010.

A. Agrawal, J. Duchi. The generalization ability of online


algorithms for dependent data. Information Theory, IEEE
Transactions on, 59(1):573–587, 2013.

P. Alquier, X. Li, O. Wintenberger. Prediction of time series


by statistical learning: general losses and fast rates.
Dependence Modelling, 1:65–93, 2014.

Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir.


Online learning for time series prediction. COLT, 2013.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 92
References
O. Anava, E. Hazan, and A. Zeevi. Online time series
prediction with missing data. ICML, 2015.

O. Anava and S. Mannor. Online Learning for


heteroscedastic sequences. ICML, 2016.

P. L. Bartlett. Learning with a slowly changing distribution.


COLT, 1992.

R. D. Barve and P. M. Long. On the complexity of learning


from drifting distributions. Information and Computation,
138(2):101–123, 1997.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 93


References
P. Berti and P. Rigo. A Glivenko-Cantelli theorem for
exchangeable random variables. Statistics & Probability
Letters, 32(4):385 – 391, 1997.

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis


of representations for domain adaptation. In NIPS. 2007.

T. Bollerslev. Generalized autoregressive conditional


heteroskedasticity. J Econometrics, 1986.

G. E. P. Box, G. Jenkins. (1990) . Time Series Analysis,


Forecasting and Control.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 94


References
O. Bousquet and M. K. Warmuth. Tracking a small set of
experts by mixing past posteriors. COLT, 2001.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the


generalization ability of on-line learning algorithms. IEEE
Trans. on Inf. Theory , 50(9), 2004.

N. Cesa-Bianchi and C. Gentile. Tracking the best


hyperplane with a simple budget perceptron. COLT, 2006.

C. Cortes and M. Mohri. Domain adaptation and sample


bias correction theory and algorithm for regression.
Theoretical Computer Science, 519, 2014.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 95


References
V. H. De la Pena and E. Gine. (1999) Decoupling: from
dependence to independence: randomly stopped processes, U-
statistics and processes, martingales and beyond. Probability
and its applications. Springer, NY.

P. Doukhan. (1994) Mixing: properties and examples. Lecture


notes in statistics. Springer-Verlag, New York.

R. Engle. Autoregressive conditional heteroscedasticity with


estimates of the variance of United Kingdom inflation.
Econometrica, 50(4):987–1007, 1982.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 96


References
D. P. Helmbold and P. M. Long. Tracking drifting concepts by
minimizing disagreements. Machine Learning, 14(1): 27-46,
1994.

M. Herbster and M. K. Warmuth. Tracking the best expert.


Machine Learning, 32(2), 1998.

M. Herbster and M. K. Warmuth. Tracking the best linear


predictor. JMLR, 2001.

D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time


estimation in reversible Markov chains from a single
sample path. NIPS, 2015.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 97


References
W. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi.
Minimax time series prediction. NIPS, 2015.

V. Kuznetsov, M. Mohri. Generalization bounds for time


series prediction with non-stationary processes. In ALT,
2014.

V. Kuznetsov, M. Mohri. Learning theory and algorithms for


forecasting non-stationary time series. In NIPS, 2015.

V. Kuznetsov, M. Mohri. Time series prediction and on-line


learning. In COLT, 2016.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 98


References
A. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence
and consistency of regularized boosting algorithms with
stationary β-mixing observations. In NIPS, pages 819–826,
2006.

Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain


adaptation: Learning bounds and algorithms. In COLT.
2009.

R. Meir. Nonparametric time series prediction through


adaptive model selection. Machine Learning, pages 5–34,
2000.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 99


References
D. Modha, E. Masry. Memory-universal prediction of
stationary random processes. Information Theory, IEEE
Transactions on, 44(1):117–133, Jan 1998.

M. Mohri, A. Munoz Medina. New analysis and algorithm


for learning with drifting distributions. In ALT, 2012.

M. Mohri, A. Rostamizadeh. Rademacher complexity


bounds for non-i.i.d. processes. In NIPS, 2009.

M. Mohri, A. Rostamizadeh. Stability bounds for stationary


φ-mixing and β-mixing processes. Journal of Machine
Learning Research, 11:789–814, 2010.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 100
References
V. Pestov. Predictive PAC learnability: A paradigm for
learning from exchangeable input data. GRC, 2010.

A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random


averages, combinatorial parameters, and learnability. In
NIPS, 2010.

L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC-


Bayes bounds for non-iid data: Applications to ranking and
stationary beta-mixing processes. JMLR 11:1927–1956,
2010.

C. Shalizi and A. Kontorovitch. Predictive PAC-learning and


process decompositions. NIPS, 2013.
Theory and Algorithms for Forecasting Non-Stationary Time Series page 101
References
A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random
averages, combinatorial parameters, and learnability. NIPS,
2010.

I. Steinwart, A. Christmann. Fast learning from non-i.i.d.


observations. NIPS, 2009.

M. Vidyasagar. (1997). A Theory of Learning and


Generalization: With Applications to Neural Networks and
Control Systems. Springer-Verlag New York, Inc.

V. Vovk. Competing with stationary prediction strategies.


COLT 2007.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 102
References
B. Yu. Rates of convergence for empirical processes of
stationary mixing sequences. The Annals of Probability,
22(1):94–116, 1994.

Theory and Algorithms for Forecasting Non-Stationary Time Series page 103
𝜷-Mixing
Definition: a sequence of random variables Z = {Zt }+1
1 is
-mixing if

(k) = sup En sup P[A | B] P[A] ! 0.
n B2 1 A2 1
n+k

dependence between events decaying with k.

B A

n n+k

Theory and Algorithms for Forecasting Non-Stationary Time Series page 104

You might also like