Performance Bounds For Nonlinear Time Series Prediction
Performance Bounds For Nonlinear Time Series Prediction
Ron Meir*
Department of Electrical Engineering
Technion
Haifa 32000, Israel
[email protected]
122
hy the conditional mean, E[IblI’ZA]. While this solution, in class ?t,, and the fact that the conditional density E[I: lY,t;,!]
principle, settles the issue of optimal prediction, it does not may in principle be an arbitrarily complicated function. Of
settle the issue of actually computing the optimal predictor. course, in order to bound this term we will have to make
First of all, note that to compute the conditional mean, the some regularity assumptions about this function. Finally, the
probabilistic law generating the stochastic process y must last term in (3) represents the so called estimation error, and
he known. Furthermore, the requirement of knowing the is the only term which depends on the data l’,N. Similarly
full past, 1, ’ is of course rather stringent. In this work to the problem of regression for i.i.d data, we expect that
we consider the more practical situation, where afinite sub- the approximation and estimation terms lead to conflicting
sequence l’lN = (Yl ,112, , YN) is observed, and an op- demands on the choice of the the complexity, n, of the func-
timal prediction is needed, conditioned on this data. More- tional class 7-1,. Clearly to minimize the approximation error
over, we allow the predictors to be based only on a finite lug the complexity should be made as large as possible. How-
vector of size d. Ultimately, in order to achieve full gen- ever, doing this will cause the estimation error to increase,
erality one may let d -+ 00 in order to obtain the optimal because of the larger freedom in choosing a specific function
predictor, although we do not pursue this direction here. in ?l,, to fit the data.
We consider the problem of selecting an empirical esti- Up to this point we have not specified how to select the
mator from a class of functions ?l,, : R” + R, where n is empirical estimator hn,~. In this work we follow the ideas of
a complexity index of the class (for example, the number of Vapnik and Chervonenkis [ 131, which have been studied ex-
computational nodes in a feedforward neural network with tensively in the context of i.i.d observations, and restrict our
a single hidden layer), and IhJ < hl for h E ?l,,. Con- selection to that hypothesis which minimizes the empirical
sider then an empirical predictor hn,~ (1’,:(:), i > N, for error, given by
I’, based on the finite data vector Y;” and depending on the
ri-dimensional lag vector Yifj, where hn,~ E a,,. It is Ldh) = $q
possible to split the error incurred by this predictor into four Id+1
123
in the sequel, let CT!= c,(Y:) and G;+~ = u(Y,$.), be the numbered according to their order in the block-sequence.
si ma-algebras of events generated by the random variables For 1 <_ j <_ PN define
7
q = (K,y2,.. . ,X) and YITk = W+k, Yf+k-tl, . . .I, re- Hj = {i : 2(j - l)aN + 1 5 i 5 (2j - l)aN} ,
spectively. We then define &, the coefficient of absolute
Ti = {i : (2j - 1) UN + 1 5 i < (2&V}.
regularity [ 81, as
Denote the random variables corresponding to the Hj and
Pm = ;yyEsu~ { P(%O - P(B)1 : B E g;,,} , Tj indices as Y(j) = {Yi : i E Hj} and Y’(j) = {Yi : i E
- (4) Tj}. The sequence of H-blocks is then denoted by YaN =
{Y(i)}$zI. Now, construct a sequence of independently dis-
where the expectation is taken with respect to UI = a(Yi).
A stochastic process is said to be P-mixing if P,,, + 0 as tributed blocks {Z(j))};.?, , where Z(j) = {& : i E Hj},
m + 00. We note that there exist many other definitions such that the sequence is independent of YIN and each block
of mixing (see [8] for an extensive listing). The motivation has the same distribution as the block Y(j) from the origi-
for using the P-mixing coefficient is that it is the weakest nal sequence. Because the process is stationary, the blocks
form of mixing for which it is possible to establish uniform Z(j) are not only independent but also identically distributed.
laws of large numbers. In this work we consider two type of The basic idea in the construction of the independent block
processes for which this coefficient decays to zero, namely sequence is that one can show that it is ‘close’, in a well-
algebraically decaying processes for which Pm = 0 (m-‘), defined manner to the original blocked sequence Ya,. More-
r > 0, and exponentially mixing processes for which & = over, by appropriately selecting the number of blocks, PN,
O(exp{-bm”}), b,~ > 0. Since we are concerned with depending on the mixing nature of the sepence, one may
finite sample results, we will assume that the conditions can relate properties of the original sequence Yl , to those of the
be phrased as independent block sequence Z,, , To do so, use is made of
the following lemma:
for the case of algebraic mixing, and Lemma 2 ([ 171, Lemma 4.1) Let the distributions of Ya,
laN be Q and 0, respectively. For any measurable func-
and =
Pm I Bexp{-bm”),
tion h on RaNPN with bound M, Qh(Ya,) - ah(ZaN )(i
in the case of exponential mixing, for some finite constants
MPN&., .
fl and p. Finally, we comment that for Markov processes
mixing implies exponential mixing [4], so that at least for Let F be a permissible class of bounded functions, such
this type of process, there is no loss of generality in assum- that 0 5 f < A4 for any f E T. The term permissible
ing that the process is exponentially mixing. Note also that is used here in the sense of [73 and the related condition is
the usual i.i.d process may be obtained from either the expo- imposed to prevent certain measurability problems (see [7]
nentially or algebraically mixing process, by taking the limit for details). In order to relate the uniform deviations (with
IC + co or r -+ co, respectively. We summarize the above respect to F) of the original sequence YIN to those of the
notions in the following assumption: independent-block sequence Z,, use is made of Lemma 2.
We also utilize Lemma 4.2 from [ 171 and modify it so that it
holds for finite sample size. Consider the block-independent
Assumption 1 The compactly supported stochastic process sequence &., and define
E is stationary and P-mixing.
124
PROOF The claim is established by a simple modification Lemma 5 For any E > 0,
of Lemma 4.2 in [ 171, which consist of re-writing the term
leaf - Efl in terms of the blocked sequences Hj and Tj. N (~,&,h) L N (~,3&)
The only difference from [ 171 is the constraint on N. which
arises from the remainder term mentioned above. I
PROOFLetG = {gl,... , gm} be an k-cover of 3,, in the
The main merit of Lemma 3 is in the transformation of
the problem from the domain of dependent processes, im- sup-norm, namely for every f E 3n there exists a gj E G
plicit in the quantity leaf - Efl, to one characterized by in- such that
dependent processes, implicit in the term ElrN f, correspond- sup If(z) - 93(x)1 L 5
ing to the independent blocks. The price paid for this trans-
Then it is easy to siow that {cf.!, gj (z:y,,, is an E-cover
formation is the extra term 2~~0~~ which appears on the
r.h.s of (5). of F,, in the sup-norm since
I,,(X,) = (K - h(y:y.
We retain the index n, as it is plays a vital role in the context
of complexity regularization. PROOF SKETCH Proceed by using standard bounds (as in
It is well known in the theory of empirical processes (see [7]) combined with Lemma 5 in order to obtain the result in
[ 141 for example), that in order to obtain upper bounds on terms of 3,, rather than Fn. I
uniform deviations of i.i.d sequences, use must be made of Remark: The covering number appearing in Lemma 6 is
the so-called covering number of the function class 3n, with taken with respect to the supremum norm, and is thus ex-
respect to the empirical tl norm. Following common prac- pected to be larger than the usual L1 based covering num-
tice we denote the &-covering number of the functional space ber. However, for many classes of natural functions (such as
3,,, using the metric p by M(E, 3n, p). neural networks and projection pursuit networks) one may
In the present case it it is more convenient to transform utilize methods from [5] to compute these mumblers. For
the problem somewhat. For this purpose we first define a example, for the class ‘I-l, of feedforward neural networks of
new class of functions related to 3n. the form h(s) = xici4(aTz + h,) where CT=‘=, lcil 5 M,
z is defined over a compact domain and c+!J(.)is a continuous
Definition 4 Let 3,, be a class of real-valuedfunctions from function such that c$(u) approaches a finite constant for u +
RD --t R. D = d + 1. For each f E 3-, and x =
foe, one may show that N(E, Y-l,,, Lm) 5 c (:)‘“, where
(~l,G!,~~’T.~a~), Xi E RD. letf(x) = Crz’=“,(f(Zi) - Cf) , yn = d(n + 2) (see [l] for a related example). We make the
jtir some constant cf. depending on f. Then we define Fn = assumption here that this type of behavior is typical of the
{j : f E3r,},wheref:R”ND+R functional classes considered. Other types of behavior will
be discussed in the full paper. The parameter y,% = $31,)
characterizes the complexity of the class ‘U,, similarly to
Let z(pN) = (xl,. , xfiN ) be a sequence of (aND) dimen-
the pseudo-dimension in [7]. Note that the L,-covering
sional vectors, where xi corresponds to the i’th block. De-
numbers of 3-1, and CH,, are related by N(E, &, , L,) <_
note the empirical L1 norm, with respect to the data vector
N(&,%, Lx,).
z(fiN) by il. namely
Assumption 2 The sup-norm covering number of 3,, is up-
per bounded as follows:
II.! - iAle, = $ $ ,f(xj) - tj(xj)l.
J--I Yn
N(~,3,2,&) LC ; >
Furthermore, let L, represent the supremum norm Ilf - 0
dlL, = supx If(x) - g(x)\. We then have: for somefinite positive constant yn.
125
Combining Lemmas 3 and Assumption 2, and keeping in PROOF The proof is similar to that of Corollary 8, except
mind that F,, = CR,, we obtain the basic distribution free that the choice PN = N&1(‘+“) is now dictated by the desire
bound. Results for the estimation error then follow from to balance the two terms on the r.h.s. of (6), so that they
Lemma 1. are asymptotically of a similar order of magnitude. This is
possible in the case of exponential mixing, but not in the
Theorem 7 Suppose 3,, is a permissible bounded class and case of algebraic mixing where the term PN& decays only
let the sample size be such that N > 4MaN/E. Then for any algebraically to zero. I
E > 0, Note that the first terms in (7) and (9) degenerate to the
result for independent observations [7] when T + 00 (im-
P sup lENf -Efl > E plying s + co) and K + 00, respectively.
fU=ll One may similarly phrase the problem as one of deter-
+ 2pN&v~ (6) mining a ‘confidence interval’. Namely, given N and 6 one
wishes to determine an upper bound for the estimation error,
PROOF Immediate using Lemmas 3,6 and Assumption 2. by making use of Lemma 1. For brevity, we state the result
only for algebraically mixing processes.
Corollary8 (Sample complexity bounds - algebraic mix-
ing) Let E > 0 and 0 < 6 < 1 be given, and assume that Corollary 10 (Confidence interval) Assume the stochastic
the process R is algebraically P-mixing with coeficient T. process X is algebraically mixing with coeficient I-, and let
Assume further that 0 < 6 < 1 be a given conjdence parameter: Assume further
that
=(E$)’ pogz
N2Nl(&,d) +logq ]‘,(gf.},
N>max $(M2(2 + yn>)-~
i[
(7)
where 0 < s < r. Then with probability larger than 1 - 6
where 0 < s < r. Then with probability larger than 1 - 6
the inequality supf EF, leaf - Efl < E holds. 128M* (1 + 9) log N
sup IEN~ - Efl < >
fG% iv
PROOF Demanding that the r.h.s. of (6) be smaller than S we
may take each one of the terms to be smaller than $. Since where &’ = N*.
the proce_ss is assumed to be algebraically mixing we have
&, 5 pakr. Now, simple arguments show that a condi- PROOF In order to guarantee a confidence of b we again re-
tion sufficient to guarantee that the estimation error converge quire that the both terms in (6) be smaller than %. The con-
to zero in probability is that PN - Ns/(‘+s) for large N. dition on the first term yields (8), which must now be solved
Setting PN = N8/(1+s)/2 and aN = N’/(‘+“) in (6), we for E rather than N. Again, with a view towards consistency
obtain two bounds which must hold simultaneously, namely we choose PN = N”/(l+s) and aN = N’/(‘+s). Using the
notation A = Ns/(1+s) and taking logarithms we obtain the
4c(;)?.~p{-$$$} 5 f, (8) inequality
and
N-(‘-d/(1+8) < ’
- 2’
In order to solve this inequality, we assume &2 2 w and
Combining these two equations and transforming them in or- proceed to evaluate a value of A for which this relationship
der to get a bound on N completes the proof. I holds. Substituting in the above inequality and arranging the
equation we obtain
Corollary 9 (Sample complexity bounds - exponential mix-
A 1
ing) Let E > 0 and 0 < 6 < 1 be given, and assume that
the process x is exponentially P-mixing with coeficient IC. 128M2 - ZY”
Assume further that
Clearly for large enough N the inequality will hold if the
pre-factor of the log N term on the r.h.s of the inequality is
= (i?!gc)’
N 2 N*(&,cq [7.10g2+log!?f larger than 1, which yields the condition
126
for which it is sufficient to neglect the log N term appearing on the r.h.s of (6) is obtained by setting PN = Nti/(1+K)/2
in the last term on the r.h.s. of the above equation. Substitut- and aN = N1’(‘fn). With this choice we then obtain (for
ing (10) for .i we find the condition N > 4Ma,v/&)
7”
e-NE2/12RM~
P sup lENf-Efl>&
fEF”
from which WC conclude that 15~ 2 @$(M’(2 + rll)) -%. + me-bfl,
Finally, we still need to find the condition for which the sec-
ond term in (6) is smaller than 5, which yields the condition where fi = N”/(‘+“). It is easily established that under the
conditions& 5 M, b > l/128,6 5 4c/M andmin,, {m} >
1, the second term on the r.h.s of (I 1) is smaller than the first,
leading to a bound of the form
d
~(71, =
performance bound on the mean-square error prediction, use N(64M2)-’ ’
must he made of approximation error bounds. It is important
IO note, however, that the requirement that the covering num- where Cn,N = 8c2 ‘“. Let h n,N be the hypotheses
ber of the class ‘I& be ‘small’ usually constrains the class of ( >
functions used. thus jeopardizing their approximation abil- minimizing the empirical error within their respective classes
ity. We have recently been ahle to show (work in progress) an. The principle of structural risk minimization [ 121 then
that under certain conditions both terms can be controlled si- instructs one to select the hypothesis h> E { h7,,~ };P=1 which
multaneously. for regression functions belonging to certain minimizes the sum of the empirical error and the complexity
classes of smooth functions. We note that for approaches term r(n, N) given by
hased on sieve methods, as in [I 11, the constraints imposed
LN(h,N) + r(n, N).
on the approximating class ‘& are usually more stringent
that those in the present approach, resulting is poorer ap- The problem at this point has been transformed, under the
proximation ability in general. conditions stated above, to one of the standard i.i.d type, the
main difference being that the sample size N has been re-
5 COMPLEXITY REGULARIZATION placed by &!. It is thus clear that we may directly take results
from this setting, and apply them to our case making the ap-
Consider a sequence of hypothesis classes ti,,, and let ‘H = propriate modifications. By using Theorem 1 in [ IO] we can
U’z;I%,,. In this section WC discuss the problem of select- then show that for all N and n, and all E > 4r(7~,N), one
ing a hypothesis class ‘Jf,, E ?i with optimal performance. has
We use the term optimal here in a loose, but by now stan-
dard, way to refer to the hypothesis class for which the upper 167, log n; + c,
hound on the expected error is minimal. Of course this does E tL(h$,)J 5 >>5
N
- il5,
not guarantee that the actual loss of this class is minimal. We
limit ourselves in this section to exponentially mixing pro- where c, = log 8c - Y,, log M + 16n. As pointed out in
cesses, for which results analogous to the the i.i.d case may [lo] this type of result indicates that the performance of hk
he derived. In fact, since for stationary Markov processes is essentially as good as the hest bound one could get by
.J-mixing is equivalent lo exponential ,R-mixing [9], there is knowing the optimal value of n in advance.
no loss of generality for this class in considering only expo-
nential mixing. In particular we assume in this section that
6 COMPARISON
i’& 5 B t‘>c~)(--Om” } for some finite positive constants p, K.
and b. Keeping the results of Theorem 7 in mind, we see A lot of effort has been invested in recent years in the prob-
that the optimal asymptotic balance between the two terms lem of nonparametric prediction of time series (see 121 and
127
[9] for a summary - mostly within the kernel estimator frame- (see the remark before Assumption 2). Plugging in the ap-
work). Similarly to our results, universal consistency can proximation error and the assumption on 7n into (15) using
be established as well as asymptotic convergence rates. As K. = 1, and optimizing over n we obtain after some algebra
will be seen below similar asymptotic rates of convergence that the expected loss can be upper bounded as follows:
are attained, although the two methods differ substantially -
in their detailed implementation. It should be stressed, how- log2 N 2(2k+2+d)
ever, that the nonparametric results have been derived to-date
EL(h*,)50 N (17)
( >
under much more general conditions than has been possible
in this work. In particular, moment conditions replace the Comparing the bounds (16) and ( 17) we note that they are of
very stringent compactness assumptions made in this paper. the same structure, although differing in detail, which may
Additionally, results can be obtained for many different types be due to the different assumptions made concerning the op-
of mixing processes, both exponential and algebraic. More- timal predictor R(y). As stressed in Section 1, the results of
over, consistency can be established even in the case of sta- the nonparametric approach are known to be asymptotically
tionary ergodic sources (see Section 3.5 in IS]), an assump- minimax and can thus not be improved upon in general. No
tion which imposes very weak constraints on the process. such claim is made here for the results of the complexity
Clearly, a great deal of work still needs to be done in the regularization approach. Observe that both results display
complexity regularization framework to match the nonpara- the typical slowing down as the dimension d increases - a
metric results. On the positive side we should stress that, manifestation of the so-called ‘curse of dimensionality’.
under parametric conditions, the performance delivered by As a final note, we observe that in parametric situations
the complexity regularization framework is superior to that the rate of convergence achieved by the complexity regular-
of the nonparametric methods (see below). ization approach is much faster, and indeed does not suf-
In order to compare the results of this paper with those in fer from a curse of dimensionality. Consider the situation
(a)
where the approximation error L,,, vanishes for some value
the extensive nonparametric literature we consider the case
of exponential mixing. In [9] (Section 3.4) it is established of n = no. In this case we obtain that the convergence rate
that for compactly supported exponentially mixing processes is given solely by the estimation error term, which can be
(with K = 1) the deviation between the optimal predictor upper bounded by
R(y) = E[Y-,(q?; = y]
128
he extended to this domain. delivering universal consistency [ 131 V.N. Vapnik and A.Y. Chervonenkis, “On the uniform
and convergence rates. convergence of relative frequencies of events to their
One of the remaining questions regarding this work, and probabilities”, Theory of Prob. and Applic., vol. I6 no.
the related work by Modha and Masry 1I 11, is related to the 2, pp. 264-280, 197 1.
advantages and drawbacks of the method with respect to the [ 141 A. W. van der Vaart and J. A. Wellner, Weak Conver-
nonparametric approach, which is known to be asymptoti- gence and Empirical Processes, Springer Verlag 1996.
cally minimax. Keeping in mind the asymptotic optimal- [ 151 H. White, Estimation, Inference and Specification
ity of the the latter method, we note that in situations where Analysis, Cambridge University Press, 1994.
some prior parametric knowledge is available, the complex- [ 161 H. White and J.M. Wooldridge, “Some results on sieve
ity regularized approach has the benefit of delivering faster estimation with dependent observations”, in Nonpara-
rates of convergence, while still retaining the universal con- metric and Semi-parametric Methods in Econometrics
sistency property. It should be stressed, however, that the and Statistics, W. A. Barnett, J. Powell and G. Tau
computational burden required for this approach is often pro- then, EDs., Cambridge U. Press, I99 1.
hihitivc. Finally, one of the appealing features of the com- [17] B. Yu, “Rates of convergence for empirical processes
plexity rcgularization approach is that it delivers a ‘model’, of stationary mixing sequences”, Ann. Prob., Vol. 22,
which is sometimes useful in interpreting the data. This is No. I, 94-116, 1994.
much harder to achicvc within the nonpramaetric framework. [ 181 A. Zeevi, R. Meir and V. Maiorov, “Error Bounds for
In summary then we conclude that while encouraging Functional Approximation and Estimation Using Mix-
results have been established, there clearly remains much tures of Experts”, .submitted to IEEE Transactions on
work to be done in extending the framework and weakening Information Theory, 1996.
some of the assumptions, as well as in comparing it in practi-
cal situations to the results from the theory of nonparametric
statistics.
References
[ I] A.R. Barron, “Universal Approximation Bound for Su-
perpositions of’ A Sigmoidal Function,” IEEE Trans.
InjI Theory, vol. IT-39, pp. 930-945, 1993.
121 D. Bosque, Nonparametric Statistics for Stochastic
Processes, Lecture Notes in Statistics I 10, Springer
Verlag 1996.
131 P.J. Brockwell and R.A. Davis, zmr Series: Theoof
and Methods, Second Edition, Springer Verlag, New
York, 199 1.
[4] Y.A. Davydov, “Mixing conditions for Markov chains”,
Thug of Prob. and its Applications, vol. 18, pp. 3 12-
328, 1973.
[S] A. Kolmogorov and V. Tikhomirov, “&-entropy and E-
capacity of sets in function spaces”, Translations of
thr American Mathematical Sociev. vol. 17: 277-364,
IYhl.
[6] L. Devroye, L. Gyiirfi and G. Lugosi, A Probabilistic
Theor? of Pattern Recognition, Springer Verlag 1996.
[7] D. Haussler. “Decision theoretic generalizations of the
PAC model for neural net and other learning applica-
tions“, Inform. Comput., vol. 100 no. 1, pp. 78-150,
1902.
[S] P Doukhan Mixing - Properties and Examples,
Springer Verlag 1994.
191 L. Gyiirti, W. HBrdle. P. Sarda and P. Vieu, Nonparu-
metric Curve Estimution from Time Series, Lecture
Notes in Statistics 60, Springer Verlag, 1989.
[ IO] G. Lugosi and K. Zeger, “Concept learning using com-
plexity regularization”, IEEE Trans. InfI Theory, Vol.
42. No. I : 48-54, 1996.
[ I I ] D. Modha and E. Masry, “Universal Prediction of Sta-
tionary Random Processes”, submitted to IEEE Trans-
artions on Information Theop, 1996.
1121 V. Vapnik, Estimation of Dependencies Based on Em-
pirical Data, Springer Verlag, 1982.
129