0% found this document useful (0 votes)
2 views8 pages

Performance Bounds For Nonlinear Time Series Prediction

Uploaded by

xitang0220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Performance Bounds For Nonlinear Time Series Prediction

Uploaded by

xitang0220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Performance Bounds for Nonlinear Time Series Prediction

Ron Meir*
Department of Electrical Engineering
Technion
Haifa 32000, Israel
[email protected]

Abstract Concomitantly with the above developments, a great deal


of work has been devoted to (mainly linear) parametric mod-
We consider the problem of time series predic- els, where powerful algorithms and exact statistical results
tion within the uniform convergence framework have been derived (for a general survey of this field see, for
pioneered by Vapnik and Chervonenkis. In or- example, the book by Brockwell and Davis [3] for linear
der to incorporate the dependence inherent in the models and White [15] in a more general setting). These
temporal structure, recent results from the theory results, however, have usually been obtained only by mak-
of empirical processes are utilized whereby, for ing rather stringent assumptions about the nature of the pro-
certain classes of mixing processes, dependent cess being modeled. To circumvent some of these problems,
sequences are mapped into independent ones by many recent workers have turned their attention to nonpara-
an appropriate blocking scheme. Finite sample metric models of time series, where very weak assumptions
bounds are calculated in terms of covering num- about the underlying process are needed [9]. One of the
bers of the approximating class and the trade- appealing features of this type of approach is its generality
off between approximation and estimation is dis- and robustness. Moreover, under certain conditions it can
cussed. Finally, we sketch how Vapnik’s the- be shown that the approach is asymptotically minimax [2],
ory of structural risk minimization (aka complex- implying that it cannot be improved upon in a worst case set-
ity regularization) may be applied in the context ting. While these results are very encouraging, they do not
of mixing stochastic processes. A comparison address the issue of finite sample behavior, which is the main
of the method with other recent approaches to focus of this work. Moreover, one believes that for cases
nonparametric time series prediction is also dis- where some prior knowledge about the process is available,
cussed. an approach which is able to take this knowledge into ac-
count in a flexible manner, will be more efficient in terms
of finite sample performance. An approach similar to the
1 INTRODUCTION one proposed in this work has been recently considered by
Modha and Masry [ 111. However, the approach taken by
A great deal of effort has been expended in recent years on the above authors is somewhat different, being motivated by
the problem of deriving robust distribution-free error bounds the complexity regularization framework pioneered by Bar-
for learning. Following the seminal work of Vapnik and ron [ 11, and extensions of the uniform laws of large numbers
Chervonenkis [ 131, particular emphasis has been attached to dependent processes. In this work we take a different ap-
to problems of pattern recognition and regression analysis proach, whereby the process is first mapped onto an inde-
where the data are assumed to be independent and identi- pendent one, after which standard results from the theory of
cally distributed. Combining abstract results from the theory empirical processes are used.
of empirical processes, with flexible classes of function ap-
proximators, such as neural networks, radial basis functions
and wavelet networks, has given rise to practically applica-
2 THE PROBLEM OF TIME SERIES
ble learning algorithms with guaranteed (although usually PREDICTION
overly pessimistic) performance bounds.
Consider a stationary stochastic process Y = {. . . , Y-t ,
*This work was supported in part by the a grant from the Israel x3,&,.. . }, where Yi is a random variable defined over a
Science Foundation compact domain in R and such that ]Yi] 5 M with prob-
penllission to make di~it:ll/hnrd wpirs ol‘ ali or p;WlOftIllS nlnkrial Ibr
ability 1, for some positive constant M. The problem of
perjonal or &ssr~m use is gmnkd without tie provided tha( the copi- one-step prediction, in the mean square sense, can then be
are no1 made or distributed for protit or commercinl ndvntltagc. lhe COPY- phrased as that of computing a function f(.) of the infinite
ri&t notice, the title of the puhlicntion and its date nppw. and notice is
past such that E {(Ye - f(YIk))z} is minimal, where we
give!, d~nl copyright is by permission of tlw ACM. Inc. TO COPYotherwise.
lo republish, to post on servrrs or lo rrdistrihule lo lists. rrcluira Specific use the notation Y: = (Yi, Yi+r,. . . , Yj), j 2 i. It is
fee
permission and/or well known that the optimal predictor in this case is given
COLT 97 Nashville, Tennesee. USA
cOpyright1997ACM O-8979 I-891~6/97/7..$3.50

122
hy the conditional mean, E[IblI’ZA]. While this solution, in class ?t,, and the fact that the conditional density E[I: lY,t;,!]
principle, settles the issue of optimal prediction, it does not may in principle be an arbitrarily complicated function. Of
settle the issue of actually computing the optimal predictor. course, in order to bound this term we will have to make
First of all, note that to compute the conditional mean, the some regularity assumptions about this function. Finally, the
probabilistic law generating the stochastic process y must last term in (3) represents the so called estimation error, and
he known. Furthermore, the requirement of knowing the is the only term which depends on the data l’,N. Similarly
full past, 1, ’ is of course rather stringent. In this work to the problem of regression for i.i.d data, we expect that
we consider the more practical situation, where afinite sub- the approximation and estimation terms lead to conflicting
sequence l’lN = (Yl ,112, , YN) is observed, and an op- demands on the choice of the the complexity, n, of the func-
timal prediction is needed, conditioned on this data. More- tional class 7-1,. Clearly to minimize the approximation error
over, we allow the predictors to be based only on a finite lug the complexity should be made as large as possible. How-
vector of size d. Ultimately, in order to achieve full gen- ever, doing this will cause the estimation error to increase,
erality one may let d -+ 00 in order to obtain the optimal because of the larger freedom in choosing a specific function
predictor, although we do not pursue this direction here. in ?l,, to fit the data.
We consider the problem of selecting an empirical esti- Up to this point we have not specified how to select the
mator from a class of functions ?l,, : R” + R, where n is empirical estimator hn,~. In this work we follow the ideas of
a complexity index of the class (for example, the number of Vapnik and Chervonenkis [ 131, which have been studied ex-
computational nodes in a feedforward neural network with tensively in the context of i.i.d observations, and restrict our
a single hidden layer), and IhJ < hl for h E ?l,,. Con- selection to that hypothesis which minimizes the empirical
sider then an empirical predictor hn,~ (1’,:(:), i > N, for error, given by
I’, based on the finite data vector Y;” and depending on the
ri-dimensional lag vector Yifj, where hn,~ E a,,. It is Ldh) = $q

possible to split the error incurred by this predictor into four Id+1

terms, each possessing a rather intuitive meaning. It is the


competition between these terms which determines the opti- Thus, h n,~ = argminwtn LN(~). For this function it
mal solution, for afied amount of data. First, define the loss is easy to establish the following result (see for example
Lemma 8.2 in [6]), the proof of which does not depend on
of a hypothesis Ir : Rd -+ R as
the independence property.
L(h) = E { (I; - l@;‘--,‘))‘} (1)
Lemma 1 Lef hT1,N be the minimizer of the empirical error:
Let hi be the optimal hypothesis minimizing the loss (1).
Then Lgl,,, 5 2s~~~~~~ \L(\I,) - LN(h)l.
We are then able to split the loss of the empirical predictor
h ,L,~ into four basic components.
Lemma 1 is the basic result which permits the transfor-
L(h,,,,\I) = L(“) + LI;“) + Lg;, + Ly;*, >N, (2) mation of the problem of upper bounding the estimation error
to one of dealing with uniform laws of large numbers. The
where the four terms are defined as follows: main distinction from the i.i.d case, of course, is that ran-
L(“) = E { (1; - E[E;IY:z])‘} , dom variables appearing in the empirical error, LN(~), are
no longer independent. It is clear at this point that some as-
LI;“’ = E { (E[Y#“-,‘I - E[k;jY;:;])“} , sumptions are needed regarding the stochastic process ?, in
Lj;f;, = E { (E[Y$;“:;] - rl:,(E::;))2} , order that a law of large numbers may be established. In any
event, it is obvious that the standard approach of using ran-
L(‘) domization and symmetrization as in the i.i.d case [12][7]
ri,r,.,v = E{ Wilk:“Z-;I - &,NW,~_;:))~}
will not work here. To circumvent this problem, two ap-
- E { (E[Y$;:<;] - II;(~::;))‘} (3) proaches have been proposed. The first makes use of the
The first term in (3) represents the inherent unpredictability so-called method of sieves together with extensions of the
of the process (often referred to as the innovation in the time- Bernstein inequality to dependent data [ 16][1 I]. The second
series literature) and is thus unrelated to the modeling and approach, to be pursued here, is based on mapping the prob-
prediction process; it will be disregarded in the sequel. The lem onto one characterized by an i.i.d process [ 171, and the
utilization of the standard results for the latter case.
second term, LI;“‘, IS referred to as the (dynamic) misspec-
ification error, and is related to the error incurred in using a
finite memory model (of memory size d) to predict a process 3 MIXING PROCESSES
with potentially infinite memory. This is a purely determin- Moving from i.i.d observations to general stochastic processes
istic quantity, for which we do not have at this point any use- greatly enriches the class of problems one may address. In
ful upper bounds for the type of processes considered in this order to have some control of the estimation error discussed
paper. The final two terms in (3) will be our main concern in the previous section, we will restrict ourselves in this work
here. The third term. namely LgA represents the so called to the class of so-called mixing processes. These are pro-
approximation error, namely the minimal error incurred by cesses for which the ‘future’ depends only weakly on the
selecting the optimal hypothesis in NH,. The fact that this er- ‘past’, in a sense that will now be made precise. Following
ror is non-zero results from the restricted complexity of the the definitions and notation of [ 171, which will be utilized

123
in the sequel, let CT!= c,(Y:) and G;+~ = u(Y,$.), be the numbered according to their order in the block-sequence.
si ma-algebras of events generated by the random variables For 1 <_ j <_ PN define
7
q = (K,y2,.. . ,X) and YITk = W+k, Yf+k-tl, . . .I, re- Hj = {i : 2(j - l)aN + 1 5 i 5 (2j - l)aN} ,
spectively. We then define &, the coefficient of absolute
Ti = {i : (2j - 1) UN + 1 5 i < (2&V}.
regularity [ 81, as
Denote the random variables corresponding to the Hj and
Pm = ;yyEsu~ { P(%O - P(B)1 : B E g;,,} , Tj indices as Y(j) = {Yi : i E Hj} and Y’(j) = {Yi : i E
- (4) Tj}. The sequence of H-blocks is then denoted by YaN =
{Y(i)}$zI. Now, construct a sequence of independently dis-
where the expectation is taken with respect to UI = a(Yi).
A stochastic process is said to be P-mixing if P,,, + 0 as tributed blocks {Z(j))};.?, , where Z(j) = {& : i E Hj},
m + 00. We note that there exist many other definitions such that the sequence is independent of YIN and each block
of mixing (see [8] for an extensive listing). The motivation has the same distribution as the block Y(j) from the origi-
for using the P-mixing coefficient is that it is the weakest nal sequence. Because the process is stationary, the blocks
form of mixing for which it is possible to establish uniform Z(j) are not only independent but also identically distributed.
laws of large numbers. In this work we consider two type of The basic idea in the construction of the independent block
processes for which this coefficient decays to zero, namely sequence is that one can show that it is ‘close’, in a well-
algebraically decaying processes for which Pm = 0 (m-‘), defined manner to the original blocked sequence Ya,. More-
r > 0, and exponentially mixing processes for which & = over, by appropriately selecting the number of blocks, PN,
O(exp{-bm”}), b,~ > 0. Since we are concerned with depending on the mixing nature of the sepence, one may
finite sample results, we will assume that the conditions can relate properties of the original sequence Yl , to those of the
be phrased as independent block sequence Z,, , To do so, use is made of
the following lemma:
for the case of algebraic mixing, and Lemma 2 ([ 171, Lemma 4.1) Let the distributions of Ya,
laN be Q and 0, respectively. For any measurable func-
and =
Pm I Bexp{-bm”),
tion h on RaNPN with bound M, Qh(Ya,) - ah(ZaN )(i
in the case of exponential mixing, for some finite constants
MPN&., .
fl and p. Finally, we comment that for Markov processes
mixing implies exponential mixing [4], so that at least for Let F be a permissible class of bounded functions, such
this type of process, there is no loss of generality in assum- that 0 5 f < A4 for any f E T. The term permissible
ing that the process is exponentially mixing. Note also that is used here in the sense of [73 and the related condition is
the usual i.i.d process may be obtained from either the expo- imposed to prevent certain measurability problems (see [7]
nentially or algebraically mixing process, by taking the limit for details). In order to relate the uniform deviations (with
IC + co or r -+ co, respectively. We summarize the above respect to F) of the original sequence YIN to those of the
notions in the following assumption: independent-block sequence Z,, use is made of Lemma 2.
We also utilize Lemma 4.2 from [ 171 and modify it so that it
holds for finite sample size. Consider the block-independent
Assumption 1 The compactly supported stochastic process sequence &., and define
E is stationary and P-mixing.

In this section we follow the approach taken by Yu [ 171


in deriving uniform laws of large numbers for mixing pro- where
cesses. While Yu’s work was mainly geared towards asymp-
totic results, we will be concerned here with finite sample f(E”‘) = C (f(<i) - Ef), j = LT... ,PN,
iEHj
theory, and will need to modify her results accordingly. More-
over, our results differ from hers when discussing specific is a sequence of independentrandom variables such that [iI <
assumptions about functional classes and their metric en- ~NM. In the remainder of the paper we use variables with
tropies. Finally, Yu’s paper was concerned with algebraically a tilde above them to denote quantities related to the trans-
mixing processes for which T < 1, as a central limit theorem formed block sequence. Finally, we the symbol EN to denote
holds in the case r > 1. Here we study both cases as well as the empirical average with respect to the original sequence,
the exponentially mixing one.
namely ENS = (N - d)-’ CEd+l j(Xi).
The basic idea in [ 171, as in many related approaches,
involves the construction of an independent-block sequence, Lemma 3 Suppose 3 is a permissible bounded class and let
which is shown to be ‘close’ to the original process in a well- N > 4MaN/&. Then
defined probabilistic sense. We briefly recapitulate the con-
struction, slightly modifying the notation in [ 171 to fit in with
the present paper. Divide the sequence YIN into 2pN blocks, P ;Y$ IEN~ - Efl > E
each of size UN, such that the remainder block is of size
rN = N - 2pNaN 5 min(2UN, 2/‘N). The blocks are then + %N&v. (3

124
PROOF The claim is established by a simple modification Lemma 5 For any E > 0,
of Lemma 4.2 in [ 171, which consist of re-writing the term
leaf - Efl in terms of the blocked sequences Hj and Tj. N (~,&,h) L N (~,3&)
The only difference from [ 171 is the constraint on N. which
arises from the remainder term mentioned above. I
PROOFLetG = {gl,... , gm} be an k-cover of 3,, in the
The main merit of Lemma 3 is in the transformation of
the problem from the domain of dependent processes, im- sup-norm, namely for every f E 3n there exists a gj E G
plicit in the quantity leaf - Efl, to one characterized by in- such that
dependent processes, implicit in the term ElrN f, correspond- sup If(z) - 93(x)1 L 5
ing to the independent blocks. The price paid for this trans-
Then it is easy to siow that {cf.!, gj (z:y,,, is an E-cover
formation is the extra term 2~~0~~ which appears on the
r.h.s of (5). of F,, in the sup-norm since

4 RISK BOUNDS s”,P !(x) - ?gj(zi) I sU,p E f(li) - %S,(Zi)


1=1 I I 1=1 I=1
The development in Section 3 was concerned with a scalar
stochastic process %‘. In order to use the results in the context
of time series, we first define a new vector-valued process I 2 suP If(xi) - Yj(zi)l
i=l 5’
.q = {... 3X-l,.~o,X1;...} where
5 a,4 =&,
S, = (Y,.l;-1,. . ,Yi--d) E R’+‘. CJN
For this s_equence the P-mixing coefficients obey the inequal- from which the result follows. I
ity p,(X) 5 Pm-d@). In the remainder of this work we
will assume that the mixing coefficients are given directly in Lemma 6 Suppose 3n is a permissible bounded class. Then
terms of the process x, in order not to drag the additional forany N > OandE > 0,
factor of d in their definition. Furthermore, in this section
we follow [7] and identify the functional space 3, of section P
3 with the loss space CR, composed of the functions

I,,(X,) = (K - h(y:y.
We retain the index n, as it is plays a vital role in the context
of complexity regularization. PROOF SKETCH Proceed by using standard bounds (as in
It is well known in the theory of empirical processes (see [7]) combined with Lemma 5 in order to obtain the result in
[ 141 for example), that in order to obtain upper bounds on terms of 3,, rather than Fn. I
uniform deviations of i.i.d sequences, use must be made of Remark: The covering number appearing in Lemma 6 is
the so-called covering number of the function class 3n, with taken with respect to the supremum norm, and is thus ex-
respect to the empirical tl norm. Following common prac- pected to be larger than the usual L1 based covering num-
tice we denote the &-covering number of the functional space ber. However, for many classes of natural functions (such as
3,,, using the metric p by M(E, 3n, p). neural networks and projection pursuit networks) one may
In the present case it it is more convenient to transform utilize methods from [5] to compute these mumblers. For
the problem somewhat. For this purpose we first define a example, for the class ‘I-l, of feedforward neural networks of
new class of functions related to 3n. the form h(s) = xici4(aTz + h,) where CT=‘=, lcil 5 M,
z is defined over a compact domain and c+!J(.)is a continuous
Definition 4 Let 3,, be a class of real-valuedfunctions from function such that c$(u) approaches a finite constant for u +
RD --t R. D = d + 1. For each f E 3-, and x =
foe, one may show that N(E, Y-l,,, Lm) 5 c (:)‘“, where
(~l,G!,~~’T.~a~), Xi E RD. letf(x) = Crz’=“,(f(Zi) - Cf) , yn = d(n + 2) (see [l] for a related example). We make the
jtir some constant cf. depending on f. Then we define Fn = assumption here that this type of behavior is typical of the
{j : f E3r,},wheref:R”ND+R functional classes considered. Other types of behavior will
be discussed in the full paper. The parameter y,% = $31,)
characterizes the complexity of the class ‘U,, similarly to
Let z(pN) = (xl,. , xfiN ) be a sequence of (aND) dimen-
the pseudo-dimension in [7]. Note that the L,-covering
sional vectors, where xi corresponds to the i’th block. De-
numbers of 3-1, and CH,, are related by N(E, &, , L,) <_
note the empirical L1 norm, with respect to the data vector
N(&,%, Lx,).
z(fiN) by il. namely
Assumption 2 The sup-norm covering number of 3,, is up-
per bounded as follows:
II.! - iAle, = $ $ ,f(xj) - tj(xj)l.
J--I Yn

N(~,3,2,&) LC ; >
Furthermore, let L, represent the supremum norm Ilf - 0
dlL, = supx If(x) - g(x)\. We then have: for somefinite positive constant yn.

125
Combining Lemmas 3 and Assumption 2, and keeping in PROOF The proof is similar to that of Corollary 8, except
mind that F,, = CR,, we obtain the basic distribution free that the choice PN = N&1(‘+“) is now dictated by the desire
bound. Results for the estimation error then follow from to balance the two terms on the r.h.s. of (6), so that they
Lemma 1. are asymptotically of a similar order of magnitude. This is
possible in the case of exponential mixing, but not in the
Theorem 7 Suppose 3,, is a permissible bounded class and case of algebraic mixing where the term PN& decays only
let the sample size be such that N > 4MaN/E. Then for any algebraically to zero. I
E > 0, Note that the first terms in (7) and (9) degenerate to the
result for independent observations [7] when T + 00 (im-
P sup lENf -Efl > E plying s + co) and K + 00, respectively.
fU=ll One may similarly phrase the problem as one of deter-
+ 2pN&v~ (6) mining a ‘confidence interval’. Namely, given N and 6 one
wishes to determine an upper bound for the estimation error,
PROOF Immediate using Lemmas 3,6 and Assumption 2. by making use of Lemma 1. For brevity, we state the result
only for algebraically mixing processes.
Corollary8 (Sample complexity bounds - algebraic mix-
ing) Let E > 0 and 0 < 6 < 1 be given, and assume that Corollary 10 (Confidence interval) Assume the stochastic
the process R is algebraically P-mixing with coeficient T. process X is algebraically mixing with coeficient I-, and let
Assume further that 0 < 6 < 1 be a given conjdence parameter: Assume further
that
=(E$)’ pogz
N2Nl(&,d) +logq ]‘,(gf.},
N>max $(M2(2 + yn>)-~
i[
(7)
where 0 < s < r. Then with probability larger than 1 - 6
where 0 < s < r. Then with probability larger than 1 - 6
the inequality supf EF, leaf - Efl < E holds. 128M* (1 + 9) log N
sup IEN~ - Efl < >
fG% iv
PROOF Demanding that the r.h.s. of (6) be smaller than S we
may take each one of the terms to be smaller than $. Since where &’ = N*.
the proce_ss is assumed to be algebraically mixing we have
&, 5 pakr. Now, simple arguments show that a condi- PROOF In order to guarantee a confidence of b we again re-
tion sufficient to guarantee that the estimation error converge quire that the both terms in (6) be smaller than %. The con-
to zero in probability is that PN - Ns/(‘+s) for large N. dition on the first term yields (8), which must now be solved
Setting PN = N8/(1+s)/2 and aN = N’/(‘+“) in (6), we for E rather than N. Again, with a view towards consistency
obtain two bounds which must hold simultaneously, namely we choose PN = N”/(l+s) and aN = N’/(‘+s). Using the
notation A = Ns/(1+s) and taking logarithms we obtain the
4c(;)?.~p{-$$$} 5 f, (8) inequality

and
N-(‘-d/(1+8) < ’
- 2’
In order to solve this inequality, we assume &2 2 w and
Combining these two equations and transforming them in or- proceed to evaluate a value of A for which this relationship
der to get a bound on N completes the proof. I holds. Substituting in the above inequality and arranging the
equation we obtain
Corollary 9 (Sample complexity bounds - exponential mix-
A 1
ing) Let E > 0 and 0 < 6 < 1 be given, and assume that
the process x is exponentially P-mixing with coeficient IC. 128M2 - ZY”
Assume further that
Clearly for large enough N the inequality will hold if the
pre-factor of the log N term on the r.h.s of the inequality is
= (i?!gc)’
N 2 N*(&,cq [7.10g2+log!?f larger than 1, which yields the condition

A 2 128M*(l+ $). (10)


(9)
Additionally we need to require that
Then with probability larger than 1 - 6 the inequafity 64
supfcFn Jeff - Efl < E holds. log ICT2 log $ + ;m log -
Alog N’

126
for which it is sufficient to neglect the log N term appearing on the r.h.s of (6) is obtained by setting PN = Nti/(1+K)/2
in the last term on the r.h.s. of the above equation. Substitut- and aN = N1’(‘fn). With this choice we then obtain (for
ing (10) for .i we find the condition N > 4Ma,v/&)
7”
e-NE2/12RM~
P sup lENf-Efl>&
fEF”
from which WC conclude that 15~ 2 @$(M’(2 + rll)) -%. + me-bfl,
Finally, we still need to find the condition for which the sec-
ond term in (6) is smaller than 5, which yields the condition where fi = N”/(‘+“). It is easily established that under the
conditions& 5 M, b > l/128,6 5 4c/M andmin,, {m} >
1, the second term on the r.h.s of (I 1) is smaller than the first,
leading to a bound of the form

Combining the two lower bounds for N keeping in mind the ?n


(~-lv?/12HM~
definition of ni’. yields the desired resulr. I sup Jeff - Efj > E
fEF,,
Comparing the result of Corollary IO to the standard re- 0;)
sults for i.i.d data, as in 1121, we note that the structure is sim-
Since we have imposed the condition N > dhfaN/E, we
ilar, the main difference being the appearance of an ‘effective
may replace the prefactor of the exponential term in ( 12) by
sample size’ @ < N in the bound. As can be expected, the
8c(2N/M)Yn, implying that under this constraint
effective sample size is decreased due to the dependence and
this is manifested in a slower convergence rate as a function
of the sample size. An immediate consequence of Corollary
10 is the consistency of the estimation process, in the sense
that the estimation error converges to zero w.p. I. In fact,
using complexity regularization as in Section 5 one may es- Now, based on ideas in [ 101 (see also [ 12]), we define a com-
tablish the universal consistency of the method for a large plexity term
class of nonparametric functions.
Remark In view of (3) it is evident that in order to obtain a
N)
1% ‘%,N + 72

d
~(71, =
performance bound on the mean-square error prediction, use N(64M2)-’ ’
must he made of approximation error bounds. It is important
IO note, however, that the requirement that the covering num- where Cn,N = 8c2 ‘“. Let h n,N be the hypotheses
ber of the class ‘I& be ‘small’ usually constrains the class of ( >
functions used. thus jeopardizing their approximation abil- minimizing the empirical error within their respective classes
ity. We have recently been ahle to show (work in progress) an. The principle of structural risk minimization [ 121 then
that under certain conditions both terms can be controlled si- instructs one to select the hypothesis h> E { h7,,~ };P=1 which
multaneously. for regression functions belonging to certain minimizes the sum of the empirical error and the complexity
classes of smooth functions. We note that for approaches term r(n, N) given by
hased on sieve methods, as in [I 11, the constraints imposed
LN(h,N) + r(n, N).
on the approximating class ‘& are usually more stringent
that those in the present approach, resulting is poorer ap- The problem at this point has been transformed, under the
proximation ability in general. conditions stated above, to one of the standard i.i.d type, the
main difference being that the sample size N has been re-
5 COMPLEXITY REGULARIZATION placed by &!. It is thus clear that we may directly take results
from this setting, and apply them to our case making the ap-
Consider a sequence of hypothesis classes ti,,, and let ‘H = propriate modifications. By using Theorem 1 in [ IO] we can
U’z;I%,,. In this section WC discuss the problem of select- then show that for all N and n, and all E > 4r(7~,N), one
ing a hypothesis class ‘Jf,, E ?i with optimal performance. has
We use the term optimal here in a loose, but by now stan-
dard, way to refer to the hypothesis class for which the upper 167, log n; + c,
hound on the expected error is minimal. Of course this does E tL(h$,)J 5 >>5
N
- il5,
not guarantee that the actual loss of this class is minimal. We
limit ourselves in this section to exponentially mixing pro- where c, = log 8c - Y,, log M + 16n. As pointed out in
cesses, for which results analogous to the the i.i.d case may [lo] this type of result indicates that the performance of hk
he derived. In fact, since for stationary Markov processes is essentially as good as the hest bound one could get by
.J-mixing is equivalent lo exponential ,R-mixing [9], there is knowing the optimal value of n in advance.
no loss of generality for this class in considering only expo-
nential mixing. In particular we assume in this section that
6 COMPARISON
i’& 5 B t‘>c~)(--Om” } for some finite positive constants p, K.
and b. Keeping the results of Theorem 7 in mind, we see A lot of effort has been invested in recent years in the prob-
that the optimal asymptotic balance between the two terms lem of nonparametric prediction of time series (see 121 and

127
[9] for a summary - mostly within the kernel estimator frame- (see the remark before Assumption 2). Plugging in the ap-
work). Similarly to our results, universal consistency can proximation error and the assumption on 7n into (15) using
be established as well as asymptotic convergence rates. As K. = 1, and optimizing over n we obtain after some algebra
will be seen below similar asymptotic rates of convergence that the expected loss can be upper bounded as follows:
are attained, although the two methods differ substantially -
in their detailed implementation. It should be stressed, how- log2 N 2(2k+2+d)
ever, that the nonparametric results have been derived to-date
EL(h*,)50 N (17)
( >
under much more general conditions than has been possible
in this work. In particular, moment conditions replace the Comparing the bounds (16) and ( 17) we note that they are of
very stringent compactness assumptions made in this paper. the same structure, although differing in detail, which may
Additionally, results can be obtained for many different types be due to the different assumptions made concerning the op-
of mixing processes, both exponential and algebraic. More- timal predictor R(y). As stressed in Section 1, the results of
over, consistency can be established even in the case of sta- the nonparametric approach are known to be asymptotically
tionary ergodic sources (see Section 3.5 in IS]), an assump- minimax and can thus not be improved upon in general. No
tion which imposes very weak constraints on the process. such claim is made here for the results of the complexity
Clearly, a great deal of work still needs to be done in the regularization approach. Observe that both results display
complexity regularization framework to match the nonpara- the typical slowing down as the dimension d increases - a
metric results. On the positive side we should stress that, manifestation of the so-called ‘curse of dimensionality’.
under parametric conditions, the performance delivered by As a final note, we observe that in parametric situations
the complexity regularization framework is superior to that the rate of convergence achieved by the complexity regular-
of the nonparametric methods (see below). ization approach is much faster, and indeed does not suf-
In order to compare the results of this paper with those in fer from a curse of dimensionality. Consider the situation
(a)
where the approximation error L,,, vanishes for some value
the extensive nonparametric literature we consider the case
of exponential mixing. In [9] (Section 3.4) it is established of n = no. In this case we obtain that the convergence rate
that for compactly supported exponentially mixing processes is given solely by the estimation error term, which can be
(with K = 1) the deviation between the optimal predictor upper bounded by

R(y) = E[Y-,(q?; = y]

and the kernel estimator RN(~) obeys


which is much faster than the decay rate in (16), when the
memory size d becomes large.
E-URN) = EIR(Y) - RN(Y)I~ i 0 Recently Modha and Masry [ 1 l] have also presented a
(lb, complexity regularization framework for the analysis of time
where Ic is the number of smooth derivatives for which the series. While their approach, based on complexity regular-
function R(y) obeys a Lipschitz condition. We note that a ization applied in the framework of the method of sieves,
similar type of result is proved also for the sup-norm type of shares many features with that discussed in this work, their
error. For non-compactly supported processes the error rate methodology is different from that presented here, which is
is somewhat reduced (see [9]). based on Vapnik’s framework. We further note that their re-
In order to compare this type of bound to results within sults have only been given for the exponentially mixing case.
the complexity regularization framework, we consider a spe- It should be noted that the above approach requires that more
cific structure, namely the class of neural networks indexed stringent conditions be imposed on the approximating class,
by the number of hidden units, n, in the single hidden layer. which usually have a rather deleterious effect on the approx-
Alternatively, the recent results in [ 181 allow us to consider imation error. Moreover, while a result analogous to (15) is
also the popular mixture of experts architecture. One of derived, the approximation error LyA is multiplied by a con-
the main results in [18] was the demonstration that under stant which is larger than unity, implying that if ‘H does not
the condition that the optimal predictor R(y) belongs to a contain a model for which the approximation error vanishes,
Sobolev space consisting of functions with integrable k’th then convergence to the truly optimal value is not guaran-
order derivatives, the approximation error for neural networks teed. An appealing feature of the approach developed in [ 11)
and mixture of experts is upper bounded by is that the memory size may be determined adaptively, and
can be shown to yield optimal asymptotic performance.
Lb)
d,*5 cd
7 DISCUSSION
where c is a constant. Moreover, it is also shown in [ 181
that the approximation rate may be achieved by using pa- In this work we have considered the problem of one-step pre-
rameters defined over a compact (although increasing with diction for stationary and mixing non-linear time series, fo-
n) domain. This last result is important in establishing the cusing mainly on finite sample results for mixing processes.
finiteness of the growth parameter 7n appearing in Assump- Extending the results pioneered by Vapnik and Chervonenkis
tion 2. To make things specific we take 7n to be polynomial to the dependent case, we have mainly relied on the results
in n, namely 7n = 7n9 for some positive q. In most cases of of [ 171. As we have shown, Vapnik’s framework of struc-
interest, such as neural and mixture networks, one has q = 1 tural risk minimization (aka complexity regularization) can

128
he extended to this domain. delivering universal consistency [ 131 V.N. Vapnik and A.Y. Chervonenkis, “On the uniform
and convergence rates. convergence of relative frequencies of events to their
One of the remaining questions regarding this work, and probabilities”, Theory of Prob. and Applic., vol. I6 no.
the related work by Modha and Masry 1I 11, is related to the 2, pp. 264-280, 197 1.
advantages and drawbacks of the method with respect to the [ 141 A. W. van der Vaart and J. A. Wellner, Weak Conver-
nonparametric approach, which is known to be asymptoti- gence and Empirical Processes, Springer Verlag 1996.
cally minimax. Keeping in mind the asymptotic optimal- [ 151 H. White, Estimation, Inference and Specification
ity of the the latter method, we note that in situations where Analysis, Cambridge University Press, 1994.
some prior parametric knowledge is available, the complex- [ 161 H. White and J.M. Wooldridge, “Some results on sieve
ity regularized approach has the benefit of delivering faster estimation with dependent observations”, in Nonpara-
rates of convergence, while still retaining the universal con- metric and Semi-parametric Methods in Econometrics
sistency property. It should be stressed, however, that the and Statistics, W. A. Barnett, J. Powell and G. Tau
computational burden required for this approach is often pro- then, EDs., Cambridge U. Press, I99 1.
hihitivc. Finally, one of the appealing features of the com- [17] B. Yu, “Rates of convergence for empirical processes
plexity rcgularization approach is that it delivers a ‘model’, of stationary mixing sequences”, Ann. Prob., Vol. 22,
which is sometimes useful in interpreting the data. This is No. I, 94-116, 1994.
much harder to achicvc within the nonpramaetric framework. [ 181 A. Zeevi, R. Meir and V. Maiorov, “Error Bounds for
In summary then we conclude that while encouraging Functional Approximation and Estimation Using Mix-
results have been established, there clearly remains much tures of Experts”, .submitted to IEEE Transactions on
work to be done in extending the framework and weakening Information Theory, 1996.
some of the assumptions, as well as in comparing it in practi-
cal situations to the results from the theory of nonparametric
statistics.

References
[ I] A.R. Barron, “Universal Approximation Bound for Su-
perpositions of’ A Sigmoidal Function,” IEEE Trans.
InjI Theory, vol. IT-39, pp. 930-945, 1993.
121 D. Bosque, Nonparametric Statistics for Stochastic
Processes, Lecture Notes in Statistics I 10, Springer
Verlag 1996.
131 P.J. Brockwell and R.A. Davis, zmr Series: Theoof
and Methods, Second Edition, Springer Verlag, New
York, 199 1.
[4] Y.A. Davydov, “Mixing conditions for Markov chains”,
Thug of Prob. and its Applications, vol. 18, pp. 3 12-
328, 1973.
[S] A. Kolmogorov and V. Tikhomirov, “&-entropy and E-
capacity of sets in function spaces”, Translations of
thr American Mathematical Sociev. vol. 17: 277-364,
IYhl.
[6] L. Devroye, L. Gyiirfi and G. Lugosi, A Probabilistic
Theor? of Pattern Recognition, Springer Verlag 1996.
[7] D. Haussler. “Decision theoretic generalizations of the
PAC model for neural net and other learning applica-
tions“, Inform. Comput., vol. 100 no. 1, pp. 78-150,
1902.
[S] P Doukhan Mixing - Properties and Examples,
Springer Verlag 1994.
191 L. Gyiirti, W. HBrdle. P. Sarda and P. Vieu, Nonparu-
metric Curve Estimution from Time Series, Lecture
Notes in Statistics 60, Springer Verlag, 1989.
[ IO] G. Lugosi and K. Zeger, “Concept learning using com-
plexity regularization”, IEEE Trans. InfI Theory, Vol.
42. No. I : 48-54, 1996.
[ I I ] D. Modha and E. Masry, “Universal Prediction of Sta-
tionary Random Processes”, submitted to IEEE Trans-
artions on Information Theop, 1996.
1121 V. Vapnik, Estimation of Dependencies Based on Em-
pirical Data, Springer Verlag, 1982.

129

You might also like