Exploring the Limits of Bootstrap
Exploring the Limits of Bootstrap
LIMITS
BOOTSTRA
REESE
LDAP
HR ESHER ERROR
ONSEN
EMU DEED EE ROLE ERSTE.
Ry
eben 298 OERONEONTA SESSOLE EUSA ESN ESPEN EROS SSIES SABIE OERELESORES SSE ESE LEED ERE AEN VODARONEIDA
sees EEE essen SULASt SESS SLA RSET ESE SES UIPRONE SEED SE DEDTERO VEER ON NSELE REELS IEEE
*Now available in a lower priced paperback edition in the Wiley Classics Library.
Probability and Mathematical Statistics (Continued)
SHORACK and WELLNER * Empirical Processes with Applications to Statistics
STAUDTE and SHEATHER »* Robust Estimation and Testing
STOYANOV »* Counterexamples in Probability
STYAN - The Collected Papers of T. W. Anderson: 1943-1985
WHITTAKER * Graphical Models in Applied Multivariate Statistics
YANG » The Construction Theory of Denumerable Markov Processes
*Now available in a lower priced paperback edition in the Wiley Classics Library.
} me . { La page ‘ya r a
"4
ee
Proms rrstiiascays
a eres
fae era
<ul ht
ot So! etyoh nantes ,
SS
Ain _ Peete es
Aes ve wae WERE PUR
Say Ges eid PereySa
recy et eee ee ey oh AAES; BM tn b
» Viet ees iy ny wa fi AstanBey re, fi ts
i e fy Lata’ ar - te re f SS ir
TRE. Oi Re ea
bets Pairs (LV PA ae Be ; nie ae
AGS
rit7,
iui TRA
the) Vode
LSS yt TR Re
ot © and VEY
NE Ge idaighé Gomi Je
2% ius, Fie Latest, See Pe
rer o:¢ bey {5aTGs” ie
tae ei
a = my 4
b ‘ us P hy
4 4
= . a H ri
¥%ea 5 7 ‘ }- ol
- ae = oa. 5 a
: —S ~“ + . ir
he i ASL wo
ne ya 7) oe fie 2
ts 5 9
r i cd 1
4 x
" eS, Baie: Oe
é ty Pang no ed Meee
1 ee ire - Hesorutsig
ae f Leith)Mal
i th Me
r ; oad
: 4 eae b
eS Gly 4 @
Exploring the Limits
of
Bootstrap
Exploring the Limits
of
Bootstrap
Edited by
RAOUL LEPAGE
Michigan State University, East Lansing, Michigan
LYNNE BILLARD
University of Georgia, Athens, Georgia
A Wiley-Interscience Publication
JOHN WILEY & SONS, INC.
New York e Chichester e Brisbane ¢ Toronto e Singapore
In recognition of the importance of preserving what has been
written, it is a policy of John Wiley & Sons, Inc., to have books
of enduring value published in the United States printed on
acid-free paper, and we exert our best efforts to that end.
1098765432
CONTRIBUTORS
PART 1. INTRODUCTION
Introduction to Bootstrap
Bradley Efron and Raoul LePage
Introduction, 3
Jackknife, 4
Bootstrap, 5
Consistency of Bootstrap, 7
Pivoting and Edgeworth Expansions, 8
New Directions, 9
References, 10
P. J. Bickel
Summary, 65
1. Introduction, 66
2. Second Order Correctness and Equivalence, 67
3. Second Order Optimality and Robustness, 74
References, 75
Introduction, 77
Asymptotics for the Conditional Bootstrap, 79
Some New Bootstrap Methods, 81
oe
he
ue Proofs, 87
References, 97
B. Efron
Abstract, 99
Introduction, 99
1. Why Do Maximum Likelihood Estimated Distributions Tend to Be
Short-Tailed?, 99
Why Does the Delta Method Tend to Underestimate Standard Errors?, 103
Why are Cross- Validation Estimators So Variable?, 108
What Is a Correct Confidence Interval?, 112
What Is a Good Nonparametric Pivotal Quantity?, 116
OpeWhat are Computationally Efficient Ways To Bootstrap?, 120
Fat
References, 124
Efficient Bootstrap Simulation 127
Peter Hall
Abstract, 127 e-
. Introduction, 127
Uniform Resampling, 128
. Linear Approximation, 129
. Centering Method, 131
. Balanced Resampling, 133
. Antithetic Resampling, 135
. Importance Resampling, 137
AUA
=
PWN
CN. Quantile Estimation, 141
References, 142
Abstract, 145
Introduction, 145
Consistency of the Bootstrap for U-Quantiles, 146
Accuracy of the Bootstrap for U-Quantiles, 150
se Applications, 153
he
bas
References, 154
John G. Kinateder
Introduction, 157
The Stochastic Integral Representation, 160
The Invariance Principle, 161
The Limit Laws, 172
Simulation Results, 173
Remarks, 178
a . Appendix, 179
Bok
Hee
References, 180
Edgeworth Correction by ‘Moving Block’ Bootstrap for Stationary
and Nonstationary Data 183
S. N. Lahiri
Abstract, 183
Introduction, 183
Results on X,,, 187
Smooth Functions of Mean, 192
Nonstationary Data, 195
edi
as
hf
adeProofs, 197
References, 212
Abstract, 215
. Introduction, 215
References, 224
Abstract, 279
. Introduction, 279
. Some Preliminary Notations and Basic Ideas, 282
. The Second Order Accuracy of the Random Weighting Approximation, 287
LL
&
WN. Two Examples, 297
Acknowledgements, 303
References, 343
PART 3. APPLICATIONS OF THE BOOTSTRAP
Bootstrapping for Order Statistics Sans Random Numbers
(Operational Bootstrapping) 309
William A. Bailey
Abstract, 309
1. Meshing and Von Mises Theory, 311
2. Bivariate Generalized Numerical Convolutions, 313
References and Acknowledgments, 318
1. Introduction, 319
Abstract, 345 ~
Introduction, 345
Deterministic Hazards, 346
The Hazard Process, 347
Estimation of Deterministic Functions
Asymptotic Distributions of Estimates and Hypothesis Testing, 351
oe
ae
here
Censoring, Competing Risks and Time-Varying Covariates, 354
References, 360
Index 419
PREFACE
Special thanks are also due to the National Science Foundation and the
Office of Naval Research who through their support of INTERFACE ’90
indirectly encouraged participation in the bootstrap conference as well.
] oh ion us: ‘ i a.
ii
i
nie year at ot
peopel
4 Te sataneu cd sled
a Yo oa P
ree ay
iat”
Introduction to Bootstrap
Bradley Efron and Raoul LePage
Stanford University and Michigan State University
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
4 Efron and LePage
as
Substituting (3) into (2) gives an estimated standard error for xX,
Let *(3) be the data set with the i-th datum removed,
and let 44) equal t(xqy)> the statistic 6 reevaluated for the deleted-point
data set X(j) ° The jackknife estimate of standard error is
AN res aa 4 aw AD SETS
$e; ck (9} = [hoe Dj=1 9%) : 4.)) ] / (6)
that is required is the ability to recompute 6, n times, once for each deleted-
point vector x;. The jackknife marked a decisive switch toward compu-
tation, and away from the sort of routine theorizing that statisticians
traditionally did in extending formula (2) to the complications of real-life
statistical practice.
Tukey’s formula didn’t eliminate statistical theory of course. Rather it
focused attention on the theory justifying (6) as an accurate assessment of
standard error. The jackknife turned out to work poorly on very un-smooth
estimators like the sample median, but otherwise to give generally
trustworthy results. See Miller (1964).
A major disappointment in the development of jackknife theory
concerned “‘studentization.”’ Standard errors are often used to set
approximate (1-a)-level confidence intervals for 0, of the form
a
6+ Za/2 - sé (7)
where zp is the standard normal percentile z9 995 = -1.960, etc. In the case
of estimating the true mean p(F) with the sample mean X , where sé is given
by (4), and where F is assumed to be a normal distribution, Student’s
famous result says that z, /2 in (7) should be replaced by t,/2yn-1? the
student’s t percentile with degrees of freedom v=n-1. Considerable efforts
were made to find the correct degrees of freedom for SE jack? in order to get
approximate confidence intervals better than 4 + 2, /2 SE jack? but to no
conclusive end.
se(x , F} = [(@x)*/n)'/? ,
almost the same as the traditional estimate (4).
Efron’s (1979) bootstrap paper makes these main points:
Pan) (10)
Given a sample x = (x, ... , Xn) lid F, we would like bootstrap to be able to
estimate the distribution Lp by resampling x and applying § with F in place
of F, i.e. we want
Bickel and Freedman (1981) isolate a set of three general conditions which
together imply the bootstrap consistency result (12) in the circumstances
(some of them quite general) studied thus far:
Their paper should be consulted for the precise statement of these results.
In the paper, a number of examples are developed which exhibit the failure
of (12) due to violation of each of the three conditions individually.
Based on this work, bootstrap was rapidly shown to apply in a broad
range of standard applications, including t-statistics, empirical, and quantile
processes (Bickel and Freedman, 1981); multiple regression (Freedman,
1981); and stratified sampling (Bickel and Freedman, 1984).
8 merry. Nea LePage
Efronaiand Re ae
ee ee ee oe Ee eee
P(| E5¥|<
a V(x) | x) & La, (16)
where 6* is the sample standard deviation of (x], ... » Xm), we also have
New directions. Some effort has gone into studying the precise nature
of bootstrap’s failure for long tailed errors. Athreya (1987) studied
bootstrapping X when Lea are stable laws of index a<2, and established that
the limit distribution obtained by bootstrap is random. Recently, Gine’ and
Zinn (1989) have shown that normal limits Lq are in some sense necessary
when 6(n; x) are normalized sums. Do bootstraps exist that can successfully
cope with long-tailed errors in normalized sums, for how is one to know
when the normal limits apply?
A potentially important development in bootstrap is “double-dip”
bootstrap, i.e choosing among several competing estimators the one whose
bootstrap-estimated sampling error is least; then again using bootstrap to
assess the sampling error of this adaptively chosen estimator.
Other promising new lines of current research are directed toward
10 Efron and LePage
EE OLE COM Ree nh ital eTRE EEL eS
References
'
On the bootstrap of M-estimators
and other statistical functionals
Miguel A. Arcones*!
CUNY, Graduate Center
The University of Connecticut
Evarist Giné* }
CUNY, The College of Staten Island
The University of Connecticut
Abstract
*Work partially supported by NSF Grant No. DMS 9000132 and by PSC-CUNY Grants
No. 669336 and No. 661376
tCurrent address: Mathematical Sciences Research Institute, 1000 Centennial Drive,
Berkeley, CA 94720
tCurrent address: The University of Connecticut, Department of Mathematics, Storrs,
CT 06269-3009
as
—————
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
14 Arcones and Giné
1 Introduction
(see Section 2(c)) and the “square root trick” exponential bound (LeCam
(1986, page 546), Giné and Zinn (1984, 1986)). Concrete examples include
the bootstrap of the spatial median and the bootstrap of k-means (this, for
reasons of expediency, only in R). (Can these two examples be handled with
more “classical” methods i.e. proving some kind of differentiability with
respect to the multivariate cdf? Perhaps, but we were not diligent enough
to check this since our approach, that follows Pollard’s, is so well adapted to
the problem).
The results on differentiable functionals are in Section 4. The emphasis
is on bootstrapping limit theorems for functionals which are differentiable
in a very weak sense (only along a family of subsequences and at a fixed
rate). Both the framework and the ideas of the proof of Theorem 4.6 be-
long to Sheehy and Wellner (1988) and Gill (1989). Our contribution to this
subject consists only of providing some accurate proofs (particularly for the
existence of a representation for which simultaneously the empirical process
converges uniformly a.s. to the Gaussian limit and the bootstrap CLT holds
a.s., which is given in Section 2) and observing that if the bootstrap sample
size is of the order 0(n/loglog n) then the bootstrap CLT actually holds
a.s. (a phenomenon already pointed out before in similar situations - Yang
(1985), Arcones and Giné (1990)).
2 Empirical processes.
In the first part of this section we give some notation and definitions about
empirical processes and describe the results on the bootstrap of empirical
processes of Giné and Zinn (1990, 1991). In a second part we give some
technical results that will be useful in the study of the bootstrap for differ-
entiable functionals. The third contains an application of Alexander’s (1984)
exponential inequalities to the almost sure behavior of empirical processes.
and
P:F—-R given by Bf= | §4P
Arcones and Giné
are bounded. So, 6,, P € I°(F), the Banach space of real bounded functions
on F, equipped with the supremum norm
Both v, and P, are random elements with values in [*(F). Let {Gp(f): f €
F}, the P-Brownian bridge indexed by F, be the centered Gaussian process
with covariance
(un(f) : f € J) 4c (Gr(f): f € J)
by the finite-dimensional central limit theorem. An important question is
whether this convergence can be made “uniform” over all of F. To make this
precise, we define (following Hoffmann-Jorgensen): if {X,}$2o are I%°(F)-
valued random elements and Xo is measurable and has a separable support
then
». © 4» Xo (2.5)
M-Estimators
in [>(F) iff
E*H(X,) + EH(Xo) (2.6)
for all H : I*(F) — R bounded and continuous. LE” stands for outer
integral. Note that if the process Gp has a version with bounded pp-
uniformly continuous trajectories, then it is measurable and has support
CAF, pp) (C, = bounded uniformly continuous functions), which is separa-
ble in (I“(F),|| - ||-). If Gp has this property we say, for short, that Gp is
sample continuous. Dudley’s definition for “CLT for the empirical process
uniform over F” is: F is a P-Donsker class iff:
(i) Gp is sample continuous and
(ii) uF +, Gp in I°(F).
To relate this definition to classical theory, just note that Donsker’s theorem
on weak convergence in D(—oo, 00) of the empirical cdf can just be restated
as: The class F = {I_...4: t € R} is P-Donsker for all P € P(R). The use in
statistics of this theorem rests in part on the continuous mapping theorem:
if F is P-Donsker and T : I*(F) — R is continuous then
where
We then say that the bootstrap CLT for the empirical process holds in prob-
ability or that F is pr-bootstrap P-Donsker iff for all « > 0
Pr*{dgr*(in,
Gp) > €} > 0. (2.09)
Here d is d with E* replaced by conditional expectation given the sample.
The bootstrap CLT for the empirical process holds a.s. or F is a.s.-bootstrap
P-Donsker iff :
dpr*(%m,Gp) > 0 almost uniformly. (2.12)
Note that the bootstrap CLT for the empirical process gives at once, via
the continuous mapping theorem, the bootstrap of a large variety of limit
theorems.
The following result states that the bootstrap always works in this general
context, even without local uniformity of the CLT in P.
Theorem 2.1 (Giné and Zinn (1990)). Let F be a measurable class of
functions on S. Then
(i) F is P-Donsker & F is pr-bootstrap P-Donsker.
(it) F is P-Donsker and PF? < co © F is a.s.-bootstrap P-Donsker.
We refer to Giné and Zinn (1990) for the proof. There has recently been
some interest in considering different bootstrap sample sizes. In the present
situation this amounts to taking v/" with m not necessarily equal to n. The
proofs in Giné and Zinn (1990) can be modified to yield the following:
Proposition 2.2 (Giné and Zinn, unpublished). Under the same measura-
bility as in 2.1,
(i) F is P-Donsker and m, — co => dpy.(vir,Gp) > 0 in Pr*.
(ii)F is P-Donsker, PF? < 00, mn 7 00, Ma/Mon > c > 0
=> vin +. Gp in I°(F) as.
We will not prove this proposition, but only mention the additional facts
that must be combined with the proof of Theorem 2.1 in Giné and Zinn
(1990) to yield Proposition 2.2. Their proofs are not difficult to implement
(given the published background). These are:
1) If N(m/n) is the difference of two independent Poisson variables with
mean m/n, then E maxi<n |N.(m/n)|/m? — 0 where Ny are iid. with the
law of N, and
Note that 2) means precisely that the CLT for empirical processes indexed
€ P(S). ktis also proved there that if F is UPG
in Pmly
by F holds unifor
thea,letting¢ = FUF-F=FUS-h: fh EF},
\|2,— Bolle — 9 = dat-(Gz,, Gx) > ©
Be <P(S),n=1,. _. Then a simple triangle inequality gives:
Theorem 2.3 (Giné and Zinn (1991)). IfF measurable and uniformly
boundedis UPG then
: \R. — Rolle + 0 v= +c Gu, 2 *(F). (2.13)
Arcones and Giné
In Corollary 2.4 the bootstrap sample size can be taken arbitrary as long
as it tends to infinity: the same proof applies.
Therefore, ¥ being P-Donsker, the /°(F) xR-valued random elements (vp, Yn)
converge in law to (Gp,0) (this is immediate from the definition of conver-
gence in law). Then Dudley (1985), Theorem 4.1, implies the existence of a
probability space (2,2, Pr) and perfect functions g, with Prog, = Pr such
that
similar statement (which is proved in van der Vaart and Wellner (1989)) and
describes how this result is relevant for convergence of differentiable func-
tionals. Our proof is different, closer to the proof of the continuous mapping
theorem for a single functional in Pollard (1989).
Theorem 2.6 Let {Z,}%2, be random elements with values in a metric space
(V,d). Suppose Zo has separable range and is measurable, and that Z, >. Zo
(in the sense of (2.5), (2.6) with I°(F) replaced by V). Let Vo be a separable
Borel subset of V such that Pr{Z € Vo} = 1 and let Vo be its completion.
For each n € N let V, C V be a set containing the range of Z,. Let Ho:
Vo — R be measurabie and let H, : V, +R be functions satisfying
A (Z,) if Ho(Zo).
Proof By Lusin’s theorem (e.g. Dudley (1989, page 190)) and tightness
of L(Zo) in Vo, for every r > 0 there exists K C Vo,K compact in Vo, such
that Pr{Z € K°} <7 and Hp is (uniformly) continuous on K. This and
the convergence hypothesis on {H,} implies that for all « > 0 there exist
6 > 0 and ng < of such that
Let gn, go be the perfect maps prescribed by Dudley’s (1985) theorem for
Zn —¢ Zo. By the properties of these maps and the definition of convergence
in law it suffices to prove
(since then, by Dudley (1985, Theorem 3.5), Hn(Zn) +c Ho(Zo)). Since for
m >No
{sup |\Hn(Zn ° In) =A Ho( Zo ° go)| > e}
n>m
we have ,
Pr{sup |Hn(Zn © Gn) — Ho(Zo © go)|"> €}
n>m
sup #{{71,..-
tar CU sC ecr—a2,
res" =
These classes have very interesting properties regarding the CLT because
their metric entropies with respect to the £2(P) distances are small uni-
formly in P (Dudley (1984)). These properties are inherited by VC-subgraph
classes of functions: a class F of functions if VC-subgraph if the class of sets
{{(z,t): 0 <t < f(z) or f(x) < t < 0} : f € F} are VC. The same
is true for VC-subgraph difference classes G = {f —g : f,g € F} with F
VC-sub graph. A typical example of a VC-subgraph class of functions is
{h(x,0) = h(x — 0) : 6 € R} if h: RR is monotone. Uniformly bounded
VC-subgraph classes are UPG and, more importantly, they satisfy the follow-
ing exponential inequality due to Alexander (1984, Theorem 2.8 and 1985,
Theorem 2.2):
We can replace n~ log log n by 2-* log k and F/, by Fi, so that the above
series is dominated by
since
M?/a > e*(log k)!*°/4c and Mn? > €2-/? /(log k)?. (2.20)
We can apply Ottaviani’s inequality (e.g. Araujo and Giné (1980), page 111)
and get the series above dominated by
gk+1
In this section we prove an a.s. bootstrap CLT for M-estimators under con-
ditions close to the “non-standard” conditions of Huber (1967) and those of
Pollard (1985). The proofs are based on methods from these two papers,
Arcones and Giné
where P,, is the empirical measure based on {X;}%, 7.7.d.(P). We make the
following assumptions:
Theorem 3.2 Let g,P,9n,9n satisfy (A.1) - (A.6), (3.2) and (3.3). Then
Jim £(n¥/7(6% — 6,(w))) = a. lim £(n¥/76,) (3.4)
which is N(0, Ag'(CovA)Az').
Proof We proceed in three steps.
Claim 1 There exists c < co such that, letting a, = (n~! log log n)!/?,
for some K < oo. This inequality implies (3.5) for c = 4K.
Let A, = (P, —P)(A) and A, = (P, — P,)(A). As mentioned above, for
some ¢ < 00,
lim sup |A,,|/an = c a.s. (3.7)
Now, the bootstrap CLT for A(X) (Theorem 2.1, or Bickel and Freedman
(1981)) gives that for any c > 0,
and
Proof (3.10) is imediate from (A.3), (A.4) and Theorem 2.8 applied to F;,i <
m. In order to prove (3.11) using Alexander’s bound, we must estimate the
size of
onsup{Py(ry(-,8))?
= :[8]<can}.
For this we use the “square root trick” inequality in Giné and Zinn (1984,
Lemma 5.2), which gives
Pr{ay, > 4(log logn)~@*®)} < (log log n)” exp(—n/ (log log n)'+5')
for some T > 0, 6’ > 0 and all n large enough. Therefore, eventually
Now we proceed to the proof of Theorem 3.2 using the above claims and
an argument of Pollard (1985). By (3.1) and the definitions of A,, A,, 9,
and r we have: ,
And the “o” terms also tend to zero in Pr, Pr —a.s. by (3.6), (3.7) and (3.8).
Thus, replacing these limits in (3.12) gives
All the terms at the right side of this inequality tend to zero a.s. by (3.5),
(3.7) and (3.10). Hence,
and therefore, {n1/2(6,, — 0,,)} converges in conditional law a.s. to the same
limit. a
Remark 3.3 If 0(|4|?/ log |log |6||) is replaced by o(|6|*) in (3.1), if (A.4) is
replaced by Pr?(-,@) — 0, if the condition on {r(-,6) : || < A} in (A.3) is
replaced just by stochastic equicontinuity at 0, i.e. by
(Romo (1990)). These modified conditions are quite close to Huber’s (1967)
and Pollard’s (1985).
Next we give sufficient conditions for the consistency hypothesis (A.5) and
(A.6) to hold. They are slightly stronger than those in Huber (1967). Let,
as above, P € P(S), O C R¢ be a Gs set such that 0 € O°,g: Sx OR
jointly measurable, and assume, without loss of generality, that G(0) = 0
(where G(#) = Pg(-,6)). The conditions that will imply consistency are as
follows:
eh (3.14)
and ’
Be'6,, ii -Onacas (3.15)
Proof — (3.14) is proved in Huber (1967). To prove (3.15) we first observe
that by the bootstrap weak law of large numbers in R and by (B.2),
P, sup
gC
9(-,0)/0(0) >, Psup 9(-,0)/0(8)
gC
as. (3.16)
and ?
Prg(-,9) +p, Pg(-,0) a.s. (3.17)
So, by (3.16), Pr —a.s., with Pr-probability tending to 1 as n > 00 we have
that for all 6 ZC,
|0€C,|6|>e
sup Pyg(-,6)— sup
GEC, |9|>e
G(8)|
(C.3) The classes of functions F; = {h;(-,@) — hi(-,0) : |0| < M} for some
M > 0,i =1,...,d, where h; denotes the 2 — th coordinate
of h, satisfy the following property: there is m < oo and uniformly
bounded measurable VC-subgraph classes G;; = {g;;(-,) : |@] < M}
such that h;(-,0) — hj(-,0) = i, 9:;(-,
9).
(C.4) For i < d,j <m,Varpg;;(-,0) < 1/(log| log |6||)!*+° for some
6 > 0 and @ in a neighborhood of 0.
(C.5) There exist symmetric (P”-completion) measurable functions
6(21,...,2n) defined on the support of P”,n € N, such that
if. 0, = O(Xtyeg Aw) shen
nil? Peh(., 6”) >, 04.s. and 6” —0,(w) +p, 0 a.s. (3.22)
As before, consistency ((C.5)), ((C.6)) will be handled separately.
This shows that for almost every w the sequence {n}/?(H(6”) —H(8,(w)))} is
Pr- stochastically bounded: it converges weakly by the bootstrap CLT in R?.
Since 6, — 9, —p, 0 a.s. ((3.22)), hypothesis (C.2) implies Pr{|On — On |>
2|H(8n) — H(8n)|} as. 0. Hence the sequence {n/2|§% — 0,(w)|} is Pr-
stochastically bounded, w —a.s. Then, n/?0(|6, — 0,|) +, 0 a.s. and (3.29)
and (3.20) give
The consistency conditions of Huber (1967, Case B) not only give con-
sistency of 6,,, but also consistency of 6,, i.e. (C.5) and (C.6). The proof is
similar to that of Theorem 3.2.
g(x,0)= |x| — |x — 4.
Since the case d = 1 is already studied in Bickel and Freedman (1981) (their
proof contains some inaccuracies that can be fixed using e.g. Theorem 2. 8)
we will assume d > 2. The set of medians of P, which is convex, consists
of a single point unless P is concentrated on a ee (and has more then one
M-Estimators
—— ae
|| — |x — 6] - (SE + th — ee
— eae Ce, a 3-—
—_—_| < 2—_ el
jz] 2\x| |x jz? a|
it follows that
We let O = {(0,,..., 9x) (S R* : 0; < eee < 6x}, g(a, 9) = minj<;(x ree 6;)?
(by hypothesis (1) the —z? term is unnecessary) and G(@) = Pg(-,@), 0 € O.
By a compactness argument there always exists, for each (21,...,2,) € R”,
a point @ in © that minimizes (n~! 7%, 62,)g(-, 9). Then, by the section
theorem in Cohn (1980, Cor. 8.5.4) there exists a universally measurable
selection 0(x1,...,%n). We let 0% := 0(X1(w),...,Xn(w)) be our estimator
of 4, and call it 6,. For each w € N and n € N,O, = OY := O(X4,,...,X%,)
is the bootstrap of 0,.
Pollard (1982) proved a CLT for 6, — yu. We will show here that Theorem
3.2 implies that Pollard’s CLT can be bootstrapped a.s. under conditions
(1)-(3). (See Romo (1990) for the bootstrap in probability in R? and under
somewhat weaker conditions.)
For consistency we follow Cuesta and Matrdn (1988). They show that
if Z,,Zo are B-valued random variables, B a uniformly convex Banach
space, such that Z, a5. Zo,Zo has a unique k-mean pw := 0(Z) and
Emini<é ||Z, — pill? > Emini<s ||Zo — ull’, then 0(Z,) — 0(Zo). They
apply this result and a Skorohod representation to show consistency of 0,
under hypotheses (1) and (2) that is, condition (A.5). This argument boot-
straps as follows: Let C = {(—oo,t] : t¢ € R}. Then by the bootstrap law of
large numbers,
||P’ — Pllc +p, 0 a.s.and
PS min(s — pi)? Sp, Pmin(x — p;)? a.s.
Therefore for each w fixed and for every subsequence there is a further sub-
sequence {n’} for which these two limits hold a.s. Let w’ be in the set where
convergence occurs (a set depending on w). By Skorohod’s representation
there are random variables Y“*’(w”),n = 0,1,..., on (Q”, &”, P”) such that
P"0(Yv")-} = Pe(w!) and P"”0(¥""")-! = P, and Y“" Yo" Plas.
Also P" minsg(¥"”
—pi)?= Pe(w!) minsga(2—pi)? 44, Pminsee(2—p:)?,
Therefore their result gives that 0(P” o(¥4")-1) — yp that is, 6( Py (w’)) > p.
Hence condition (A.6) is satisfied.
It is easily checked that if P has a differentiable density f at (4; +
Hi41)/2,7 = 1,...,k —1, then G(@) is three times differentiable at 0 = y,
hence condition (A.2) holds. If we let My = (—o0, (41 + 2)/2],M; =
(45 + Hj-1)/2,(M5 + H541)/2], 7= 2,...,k-1, and My = ((4n_-1 + Ux)/2, 00),
and if we define A;,j =1,...,k, in the same way but replacing p by 6 then
we have:
This inequality implies that supjg<x ||r(z, 9)||.. < 00 for all K < co and that
Pr?(-,8) < C\@ — p| for all 0, in a neighborhood of yz and for some C' < oo.
Hence (A.4) holds. For each 9, r(x, 9) is the sum of k? or less functions which
are linear on an interval and zero outside it. Hence condition (A.3) is also
satisfied.
Theorem 3.2 applies and it follows that, under conditions (1)-(3) on P,
the CLT for k-means of Pollard (1982) bootstraps a.s. (for d = 1).
Then
where
Let |0| < 6/2. The sample points X; satisfying X; — 0 € Cs,2 are all a.s.
different by continuity of P on Cs. Moreover, h(x) is continuous at c = X;—0
if X;— 0 ¢ Cs/2. Hence the function P,,h(-,4) has a jump at 0 of size at most
2||h\|../n a.s. This proves, by (3.32) and (3.33), that
nil? P_h(., On) —,.5, 0 and nV? P h(-, On) —p, 04.8.
Theorem 3.6 and its corollary Theorem 3.10 contain the bootstrap CLT
for the most usual M-estimators in particular for the median, Huber’s esti-
mators, etc. For instance, Theorem 3.10 applies to
h(x) = —kI(~co,-2)
(2) + t][-n,ay(2) + KI(k,00)(2)
under minimal conditions on P, namely that P{k} = P{—k} = 0 and
P(—k,k) # 0 (assuming Ph(-,0) = 0).
[fr — fallev
a sup{| [(fr— fo)d(@: — Q2)|/l1Q1 — Qalle : Q1,Q2 € U, |1Q1 — Qallz > 0}
tends to zero as ||R — Q||- — 0 for R and Q in U. Here is Dudley’s boot-
strap limit theorem for a single functional T (he also considers families of
functionals).
(In the second limit, since T(P,,) may not be measurable, weak conver-
gence is in the sense of Hoffmann-Jgrgensen -(2.6) with [*(F) replaced by
R.)
The conditions of Theorem 4.1 can be relaxed if we allow the bootstrap
CLT to hold only in (outer) probability. Moreover, the parametric or semi-
parametric bootstrap also holds if F is UPG.
Given n > 0, let M be such that Pr{||Gp||- > M —1} < (6/6) A (n/2)
and let no < oo be so that for n > no, Pr*A(n,M,6)° < 7, which exists
because F is P-Donsker and pr-bootstrap P-Donsker (Theorem 2.1). For
n > no, w € A(n,M,6) and w’ such that ||i%(w’)||- < M, we can apply
(4.1) to T(P,) — T(P) and T(P,) — T(P), to get (assuming, without loss of
generality, that |o(¢)| is monotone in t)
Remark 4.3 The above proof works also for T taking values in a separable
Banach space B if we further assume that fp7(X), with £(X) = P, satisfies
the CLT in B.
{n1/?||
P,, — Pa||%-} stochastically bounded
=> dpz.[n?(T(Px) — 8m), Gr4(f)] + 0 in Pr’,
where P® is the empirical measure constructed from X$. 3:3.d.( Pe).
Gill (1989) and Sheehy and Wellner (1988) approach (Hadamard) differ-
entiability via Theorem 2.5. Using versions P, of Pay P,, of P, and Gp of Gp
so that simultaneously ||7,,— Gpllr — Oand dpi (Un, Gp) — 0, a.s. we have
in (4.3) n¥/?0(||P,—P\lz) 3 0 @ —a.s. and, further using Dudley’s theorem
to get that for each&fixed (in a set of Pr-measure one) ||v,— Gate > 0a.s.,
~ we also have n1/?o0 (Pn — P,||- + ||Pa — P\lz) a.s, 0. The bootstrap CLT in
pr for n'/?(T(P,) — T(Pa)) is then obtained by passing back to the original
variables via Theorem 3.5 of Dudley (1985). Making this argument precise,
particularly if T(21,...,2n) = T(n~! O%, 6;,) is not measurable, requires
some extra care since one must prove almost uniform convergence of the
functional applied to the versions, something not always handled with rigor.
A question of some interest is how weak a differentiability requirement can
we impose on T and still obtain a pr-bootstrap limit theorem. The following
definition (which is a slight modification of one in Gill (1989) and Sheehy
and Wellner (1988)) will help to provide an answer.
Definition 4.5 Let F be a set of measurable functions on S, PE P, Ca
family of convergent sequences x = {zn} C I°(P) and H = linear span of
Usec({tn} Uf{lim z,}). Let T = {T,}%o be a family of R¢-valued functionals
defined on subsets of I°(F) such that the domain of Ty contains P and the do-
main of Tp, contains the set {P+n/*z,, : rp is the n-th term of x € C}. Then
T is n—\/?-differentiable along C at P if there exists a linear continuous map
Tt : H — R?, the derivative of T at P, such that
OP +n en). — TolP)— nT5(z,,)] 0 (4.4)
for all {z,} €C. By continuity, Tp(x,) can be replaced by Tp(xo) in (4.4) if
to. = lim7z,.
Denote by M¢ the set of measures of finite total variation on (S,S)
which are in /°(F), and by Pe the set of probability measures in M¢ (i.e.
Q € Pe iff ||Of|lz < 00). For 2, € Mg, 2, > & will mean ||z, — 2||¢:=
sup yer |tn(f) — 2(f)| — 0. We let C,(F, ep) be the space of uniformly con-
tinuous functions on (F, ep), where e2(f,g) = P(f —g)?.
The first and second parts of the following theorem are taken from Sheehy
and Wellner (1988). It strictly contains Theorem 4.1 up to some measura-
bility (which can possibly be removed), and also the main result in Yang
(1985).
Theorem 4.6 Let F be P-Donsker, let
C = {{x,} : 2, € Mr, P+n-V?2, € Pr, lim tn € C,(F, ep)} (4.5)
Proof of Theorem 4.6 ‘To prove (i) we set in Theorem 2.6 (V,d)
xz € V,,y € Vo. Then (i) follows directly from that theorem. (This nice proof
is Wellner’s (1989).)
The proof of (ii) uses the representation Theorem 2.5 twice in the way
outlined after Theorem 4.4. Since (ii) is just Theorem 3.3A in Sheehy and
Wellner (1988) and their proof is accurate under the present measurability
conditions, we will only prove (iii). (Note that dg, is in fact a sup over
a countable number of functions since R¢ is o-compact and the Lipschitz
functions on a comptact set are a separable set for the sup norm.)
Under the hypothesis of (iii) the empirical process indexed by F satisfies
the compact law of the iterated logarithm (Dudley and Philipp (1983), The-
orem 1.2) with limit set K = {u,(f) = f fhdP : f h?dP <1,h in the £L,(P)
- closed linear span of { f — Pf}} (Kuelbs (1976)). Note that K C C,(F, ep).
So we have that Pr-a.s.
Let w be such that (4.8) and (4.9) hold, and let g,,n = 0,1,..., be the
perfect functions of Dudley’s theorem for {vx.,} and Gp. Then
Ny /(n'/ log logn’) — c € [0, 00) and vp(w)/(2 log log n’)'/? ao
uses Theorem 2.5 for (vn? y,) —c¢ (Gp,0) (m P(F) x (G))
which yields (v,,n7'/?%_) 0 gx — (G@p,0) © g almost uniformly and
dar+v|(da,. nV?) (Gp,0)] — 0 almost uniformly by the bootstrap
CLT and LLN, and then Dudley’s representation theorem for each @ fixed, on
the sequence {(de, n¥29, >}, Similarly in (iii), using both the bootstrap
CLT and the bootstrap LLN, we have not only (4.9) but
Remark 4.9 By Proposition 2.2 (i), statement (ii) in 4.6 also holds for any
bootstrap sample size N = N,, - oo.
_ the CLT):
Step 1. {z,}€C 3% := (P+ 2,/n'l*) 0:
This follows immediately from
(P+n'?z,)(-sup
iru
g(-,6)/b(6)) > a
pst P+n-/7z,)(g(-,8))
n~"""2_)(g(-,9)) +0
and
| n~!7z,(g(-,0)) +0
as in the proofof Theorem 3.5.
nly, —Z,(A) — 0,
References
Araujo, A. and Giné, E. (1980). The central limit theorem for real and Ba-
nach valued random variables. Wiley, New York.
Arcones, M. and Giné, E. (1990). The bootstrap of the mean with arbitrary
bootstrap sample size. Ann. Inst..H. Poincaré 25, 457-481.
Cuesta, J. A. and Matran, C. (1988). The strong law of large numbers for
k-means and best possible nets of Banach valued variables. Probability The-
ory and Rel. Fields 78, 523-534.
Giné E. and Zinn, J. (1984). Some limit theorems for empirical processes.
Ann. Probability 12, 929-989.
Giné E. and Zinn, J. (1986). Lectures on the central limit theorem for em-
pirical processes. Lect. Notes in Math. 1221, 50-113. Springer, Berlin.
van der Vaart, A. W. and Wellner, J. (1989). Prohorov and continuous map-
ping theorems in the Hoffmann-Jgrgensen weak convergence theory, with
applications to convolution and asymptotic minimax theorems. Preprint.
thy
. fhaaimlonsesinat
ra Py‘3dion fateh
in 4
Abstract
Introduction
Let {X,; n > 0} be a homogeneous ergodic (positive recurrent, irreducible
and aperiodic) Markov chain countable with state space S and transition
probability matrix P = (p;;). The problem of estimating P and the distri-
bution of the hitting time T, of a state A arises in several areas of applied
probability. The application of the bootstrap method of Efron (1979) to
the finite state Markov chain case was considered by Kulperger and Prakasa
Rao (1987)and Basawa et al (1989). Athreya and Fuh (1989) discussed the
countable state space case. The general state space case is an important
open problem.
The goal of the present paper is to give a brief survey of the results of
the the above mentioned papers and also some related work of Datta and
McCormick (1990) on second order correction of a method of bootstrap pro-
posed by Basawa et al (1989) for the finite state space case. The latter paper
also considers parametric bootstrap for finite state Markov chains. No proofs
are given in the present paper.
Keywords: Primary 62G05; Secondary 60F05, 60J10
1980 Mathematics subject classifications: bootstrap estimation, central limit theorem, hit-
ting times, Markov chains, stationary distributions, transition probabilities
' Research partially supported by NSF Grant 8803639.
2 Research partially supported by NSC of ROC Grant 79-0208-M001-63.
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
Athreya and Fuh
Since the state space S is finite, we can consider the non-parametric case
as a special case of the parametric case. So, the consistency and asymptotic
normality of the maximum likelihood estimators can be deduced using the
analogy with the multinomial distribution. This idea also can be used to
prove the consistency of the bootstrap estimators of P, given x
The consistency of II, for II follows from the strong law applied to the
renewal sequence of return times to state 2.
THEOREM 1. For all i,
The following theorem is a central limit theorem for the maximum likeli-
hood estimator P, of P and is in Billingsley’s book (1961).
THEOREM 2.
Vn(P,-P)—N(0,Zp) __ in distribution,
where Up, the variance-covariance matrix, is given by
(LP)
cissrjry= Si ig (855" — Diz) -
Det 2 efaraoess wees z*}be a realisation of a Markov chain with transition
probability matrix P, and let P, be the P, function evaluated at this z*.
For this bootstrap method Kulperger and Prakasa Rao (1989) established
the following central limit theorem.
Bootstrapping Markov Chains 51
THEOREM 3. Under the notations given above, we have for almost all real-
izations of the Markov chain {X,; n > 0},
UNAPe te ND) Yacin distribution
as n — oo and N, — oo, where ¥p is the same variance-covariance matrix
in Theorem 2.
Let T; be the first hitting time of state k. That is, we let
{intjne 0: i, = Kk}:
Tk = :
co, if no such n exist.
Let Pr(t;P) = P(T, <t| Xo =1;P) denote the probability that T, < t
for t € {1,2,3,---}, where P is the transition probability matrix of a Markov
chain X = i de n > 0} with initial state Xp = 1.
For any k x k stochastic matrix P, let A = A(P) be the stochastic matrix
which is the same as P except that the k‘* row is replaced by (0,--- ,0,1)
with 1 in the k** position. Note that
Pr@R)= (A) (*)
The bootstrap estimate of the distribution Pr(-; P) of the hitting time T;, is
Pr (-Pa) . From (*) and the fact that P, — P with probability 1, we have
for each t, /
Pr(t; P,) — Pr(t;P) —> 0 ‘w.p.T as n — oo.
Here the problem is to estimate the distribution of
G,(t;P)=Jn (Pr(ti P,,) — Pr(t; P)) :
The bootstrap approximation to the above distribution is the conditional
distribution of
G,(t; P,) = Vn (Pr(ti P,) — Pr(t; P,))
The problem here is to verify that these two distributions are asymptotically
close. Kulperger and Prakasa Rao (1987) obtained the next two theorems.
THEOREM 4. Let A, =A el: Then for all t = 1,2,3,---, we have
Z} = Var(AU + UA).
where U ~ N(0 and D4), for any k x k stochastic P, Up is as in Theorem 2.
Athreya and Fuh
Then
co
in the finite state space case except that the resample size is changed from
nto Ny,
1) With P, as its transition probability, generate a Markov chain real-
ization of N, steps x* = (r5,2],---,ry_). Call this the bootstrap
sample, and let P,= P(Nn,x*). Note that P,, bears the same relation
to x* as P,, does to x.
2) Approximate the sampling distribution H, of R(x,P) by the condi-
tional distribution H* of R(x*, P,) = (Pa — Pn) given x.
Method II.
The existence of a recurrent state A which is visited infinitely often (i.0.)
for a recurrent Markov chain is well-known. A well-known approach to its
limit theory is via the embedded renewal process of returns to A. This is the
so-called regeneration method. For a fixed state A, by the strong Markov
- property, the cycles {X,; j = gen tee pen —1} areiid. forn =1,2,---,
where 7) is the time of the n** return to A.
Fix an integer k and observe the chain up to the random time n = Tey
Let
{Xo,X1,°°- a
Deeat Ato Li ee ag i ea
(That is, for each m, there is an im such that T;,i,,-1 < Tim < Tk,in-)
Define
Y, = Thi, — Th
Y2 = Ty,i. — Th2
ae M
ieee
Fr (t) = 7 Perey
jot
Bootstrapping Markov Chains
The naive bootstrap will consist of fixing Y;,Y2,--- , Ym, and drawing i.i.d.
samples Y,*, Y;,--- , Yn, distributed as Fix (-) and defining
: L N, ire
Fy) =D1}
Nn 44
<0)
where
Nn
m\”) = S~ I(Xnt = i);
t=0
Then
it,(t) — 7%; —> 0 in probability.
The weak law of Theorem 1 suggests the possibility of a central limit the-
orem for 7,(-). This turns out to be somewhat intractable and Athreya
and Fuh (1989) address instead a related question motivated by the boot-
strapping of P. Let pn(z,7) denotes the proportion of (2,7) transition to the
number of visits to7 in {Xntz; 0 <t < N,}. That is, let
m\” ay
= hoe=),
t=]
Nn-1
ee = ae TW Xne =t)I(Xaeti) = J);
t=0
and
(n)
a mit, ifm” >0;
Pn(2,j) = :
' Then
V Nn(Pn — Pn) —+ N(0,¥p) in distribution,
as n — oo and N, — oo independently of each other, where Sp is the same
varlance-covariance matrix in Theorem 2.
Here the convergence in distribution means that for any finite set A of
pairs (2,7),
{VNn Balt,5)— Pris); (5) € A} — NO,(Ep)a),
where (ip), is a block diagonal matrix involving the states in A. In partic-
ular, if A = {1,2,--- ,k}, then
et?) 0 0
Ov LTP) 0
(Up)a= an
0 0 tL (P)
This suggests using P,, and it, as estimators of P and 7 respectively and
in order to obtain confidence intervals, we need to look at the distributions
of (P, — P) and (ip — 7).
PROPOSITION 2. (Derman (1956))
Let X = {X,; n > 0} be a homogeneous irreducible, positive recurrent
Markov chain. Then
See Derman (1956) for an explicit form of 5%. Here again the convergence
has the same meaning as Theorem 8.
In order to obtain confidence intervals for P and 7, one can use Proposi-
tions 3 and 4, but use L, and uP in place of Up and X3 respectively.
An alternate approach to finding confidence intervals is to use the method
of bootstrap. Here for the pivotal quantity
Vn( Pp om Ply:
Vk S| 1 (e(é) — Fe(2))
i=1
can be written as
k * k
Jk (2s!f(nz) See ies
Sever T; py T;
Athreya and Fuh
where for any cycle n, f(n) = >o;-, ligi(n), and gi(n) is the number of visits
to 2. Since
Lem)<(So DT)
the hypothesis E,T2 < oo implies E,a(f(n))? < oo yielding the following
extension of the above theorem.
THEOREM 12. With the above notations, if ExT? < oo, then for any finite
subset A of the state space S, and for almost all realizations of {X,; n > 0},
THEOREM 13. With the notations given above, if ExT < oo, then, for
almost all realizations of the process {Xn; n > 0}, we have for each i,j,
With the same argument as in Theorem 12, we have the following extension
of Theorem 13.
THEOREM 14. With the above notations, if E,T2 < 00, then for any finite
subset A of S x S, and for almost all realizations of {X,; n > 0}
where &* is the variance covariance matrix with the (2,7), (2',7')** element
is
O(i,3),(,3")
= Cov (Ba his(n) —_Ehi(n)
SR 9 ee vpn) Bhoye(n)
AEN ell FV )
Bon) Bain)?
1
Ege(n) —Ege(ny 2
= Pair. < (his (7), hije(7))
Ehi;(n)Ei (0) Cov (9i(7),
9(7))
(Egi(n)Egi(n))?
_ 9 Ehij(n) Cov (hi;(7), 9i(7))
Eg:(n)(Egi(n))?
25 Ehi;(n) Cov (hir5(7), gi(n)) 5
(E9:(n))?
Egi (7)
Define my
ni = DU T(Wi = 3),
t=1
and :
A 135 rap arte yt
Pi eek ts %7 3 :
2
1 S45
{Vn (Bi; — Pis)i Sk}.
Basawa et al show (using the fact that multinomial goes to multivariate
normal) that H* converges w.p.1 to the same multivariate normal as the
limit of H, (as claimed in Theorem 2), thus showing that this bootstrap
method is consistent.
In a recent paper Datta and McCormick (1990) have investigated the ac-
curacy of the above method and have established the following theorems.
=0 ralase | ?
-0(c)
P (eh Nl (i ij (1— pi; )) 17? ==
Bootstrapping Markov Chains 63
SSS
They also obtain an Edgeworth type expansion for the case when Di; 18
irrational.
The asymptotic accuracy of the method proposed by Basawa et al is almost
of order O(n-1/?). Datta and McCormick (1990) propose three modified
bootstrap schemes and obtain Edgeworth expansions for each of them. See
Datta and McCormick (1990) for details.
References
By
P.J. Bickel
Univ. of California, Berkeley
Nee
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
66 Bickel
neeck lh a al ce
J Vlog p (x,0)dF(x) = 0
where V is the gradient, corresponds to 6= MLE. We also assume that 6 is
asymptotically linear with influence function y(-,F) see Hampel (1986). That
iss
Jv@,FdF@ = 0
and
0° (F) = Jy’, F)dF(x) < ~,
for all Fe F. Thus if L(-|F) denotes the distribution of a function of
X, under.,
X,;,.. F then
T, (FF) = vn - 0)/6
and we know the exact distribution L(T,(F,,F)|F) we are led to the exact
1-— a LCB
a 6
Gin Se)
where
Le {n ®goori
— 9B00T2)} > L(A) # 0
6) = fw? F,)dF,(),
the nonparametric estimate of o? (Fy) while
6; = 0)
where ¥ is the MLE of y. Then r, # rp unless oy is also efficient for Fo. Thus
70 Bickel
_._|| ee
a
the bounds are not equivalent to second order. As a consequence we note the
following phenomenon. Hall (1988) shows that the Efron parametric and non-
parametric BCA bounds are second order equivalent to bootstrap t bounds
2) If Tyg= Ta +
n =
ceO@ = Om + PO 4 o@.
Vn
Since
= of) + O,@')
we obiain that (2.6) holds with
since then,
OBact
6% ~BWacr ee=eeae
2 etAn ed maA
= Op (n"')
But if,
¥ (6) — @) a (6 — 9) 3 An
V =0 and (2.11) follows. For D we also need a result from the theory of
efficient estimation, (see Pfanzagl (1981), Bickel, Klaassen, Ritov, and Weliner
Bickel
(1991)) which we again state without explicit regularity conditions. These may
however be found in the works cited above.
(6, — 64)
Va On Pax (2.16)
82
A,
ye Ue ge
x Vn
where
A, = U,(n"!?
E28) (ry (KF) — 12 (XK,F))o! ®)
+ O, (1).
To make these results rigorous we need to justify (2.7), (2.8), (2.9) and sub-
stitution of the random c, (F,) into (2.7). When the estimates are smooth func-
tions of vector means n! D2, M(X,), M)x1, the argument is due to Cibisov
(1973) and Pfanzagl (1981), see also Hall (1986). In general the idea is:
a) To expand T(F,,F) in a von Mises or Hoeffding expansion and show that
the remainder after three terms can be neglected in the Edgeworth expansion.
That is, write for suitable aj
and Bickel and Freedman (1980) have to be employed to get by the failure of
Cramer’s condition due to the discreteness of F,. Substitution of c,(F,) in
(2.7) can be justified once we express T (F,, F) — c, (F,) in the form (2.17) by
using the inversion of (2.8) for c, (F,).
Pp[O*
< O(F)-5,] < Pp[O< O()-8] + om). (3.1)
In fact to avoid superefficiency phenomena we essentially have to require
second order correctness to hold uniformly on shrinking neighbourhoods of
every fixed F and then require (3.1) to similarly hold uniformly. If o(n7) is
replaced in (3.1) by o(1) then e* is first order efficient. It is shown in Pfan-
zagl (1981) and Bickel, Cibisov, and van Zwet (1981) that first order efficiency
On the other hand if we use 6% = Jw? (x, F,) dF, (x) but use the parametric
bootstrap, when F € Fo, we are, in general, first order correct. The reason is
that,
References
Bai, C., Bickel, P.J. and Olshen, R. (1989). The bootstrap for prediction.
Proceedings of an Oberwolfach Conference, Springer-Verlag.
Bickel, P.J. and Freedman, D.A. (1981). Some asymptotics on the bootstrap.
Ann. Statist. 9, 1196-1217.
Bickel, P.J., Chibisov, D.M. and van Zwet W.R. (1981). On efficiency of first
and second order. International Statistical Review 49, 169-175.
Bickel, P.J., Gotze, F., and van Zwet, W.R. (1983). A simple analysis of
third-order efficiency of estimates. Proceedings of the Berkeley Conference in
Honor of Jerzy Neyman and Jack Kiefer. Wadsworth. Belmont.
Bickel, P.J., Gotze, F. and van Zwet, W.R. (1989). The Edgeworth expansion
for U statistics of degree two. Ann. Statist. 14, 1463-1484.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
SIAM. Philadelphia.
Hall, P.J. (1986). On the bootstrap and confidence intervals. Ann. Statist. 14,
1431-1452.
Hampel, F., Renchotti, E., Rousseuw, P., Stahel, W. (1986). Robust statistics:
the approach based on influence functions. J. Wiley. New York.
Parr, W. (1985). The bootstrap: Some large sample theory and connections
with robustness. Stat. Prob. Letters 3, 97-100.
ABSTRACT
This paper investigates the scope of bootstrap schemes based on i.i.d.
resampling for estimating the sampling distribution of the m.lL.e. pi of @ transi-
tion probability p;; of a finite state Markov chain.
The asymptotic accuracy of a bootstrap method proposed by Basawa et al.
is studied. It is shown that the best rate possible with this method is O(n~”),
where n is the sample size. Three modified bootstrap schemes are proposed for
the above problem. It is shown that an Edgeworth correction is possible with
each of these new methods when estimating the sampling distribution of stand-
ardized §;;, if p; is irrational.
1. Introduction
A number of bootstrap methods for estimating the sampling distribution of
hitting time, transition counts and proportions of a Markov chain have recently
been proposed by various authors. Basawa et al. (1989) and Kulperger and
Prakasa Rao (1989) considered Markov chains with finite state spaces; the
countable state space case has been investigated by Athreya and Fuh (1989).
Whereas the asymptotic validity of these methods has been established by
the respective authors, nothing is known about their rates of approximation. In
this paper, we study the asymptotic rates of a method proposed by Basawa et
al. (1989). This method is easy to implement in practice because, given the ori-
ginal data, the bootstrap distribution is constructed using i.i.d. resampling.
We show that the best rate possible with this method is O(n”), where n
is the sample size, which is the same as that of the classical normal approxima-
tion. This method is referred to as "conditional bootstrap" by the previous
authors. The main reason it fails to be any better is that this approximation is
based on a rather naive i.id. sampling which cannot account for a part of the
skewness term arising from the dependent structure in the original sample.
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
Datta and McCormick
n::
— if n; > 0,
i
Bi = (1.1)
0, otherwise, ;
with
n n
where, for a set A, [A] stands for its indicator function. The problem is to
bootstrap the sampling distribution of pj on the basis of the original data
Xj, --, X43. The following is a method proposed by Basawa et al. with
motivation deriving from a multinomial type representation of the Markov
chain {X,}. See Basawa et al. for the details.
Given the original data Xj, ..., X,,1, obtain §,; by (1.1). Then generate
independent random variables W,, with probability mass function
P*(Wit = j) = fy 1 Si,jsNjt2 1. (1.2)
Define
* ni * °
ny = 2 [Wir = J], (1.3)
and
Then the distribution of n;“(p;; — pi), given X), ..., X,41, is a bootstrap approxi-
mation to the distribution of n;*(p;; — Piy)-
Basawa et al. showed that this approximation is asymptotically valid, in
sup norm on R, along almost all sample paths X, ..., X,,;. In this paper, we
obtain a one-term Edgeworth expansion for the bootstrap distribution. Com-
parisons of this with the corresponding expansion for the sampling distribution
yield the rate results mentioned earlier. These results are presented in the next
section.
Section 3 describes some modified bootstrap methods and the correspond-
ing rate results. As mentioned, earlier, the main result of these section is that a
better rate of approximation is possible with these methods if Pj is irrational.
All the proofs are presented in Section 4.
Theorem 2.1
Let 0 < Pij < 2 Then
The next theorem shows that a better rate of convergence is possible for
the bootstrap approximation to the sampling distribution of the standardized p;;.
Theorem 2.2 :
Let 0< Pij <1. Then
A comparison of the one term Edgeworth expansions would reveal that the
above rate cannot be improved. The Edgeworth expansion for the distribution
of standardized f,; has been obtained by Datta and McCormick (1990). We only
present the case when pj; is irrational. In the other case, the expansion will
have an additional discontinuous n~” term. The Edgeworth expansion for the
bootstrap approximation follows from Theorem 2.2 of Datta (1989).
Theorem 2.3
Suppose that for some 1 <k <N, py > 0. If pj is irrational then
Remark 2.1 The above theorem can be obtained under a more general condi-
tion. See Datta and McCormick (1990) for details.
Theorem 2.4
Let 0 < Pij <aele Then
aK n-“n;(p;j—B;))
< x} = O(x)+
n(x)(1—-x?)(1—2p;)
(6;6;(1-B;))* 6(p;p;(1—p;))”
n*9(x)Q(n*x(6;h;;(1-f;))*)
tt -4
(pp,(1-p,))” sea
uniformly in x, where Q(x) = [x] — x + 4.
It can be seen from the above two theorems that the n-“ terms in the two
expansions do not match. Hence the closeness of the bootstrap approximation
to the true sampling distribution cannot be any smaller than O(n-%).
Finite State Markov Chain 81
SS
K,
> 6? - ph), if ij
k=1
ee (3.1)
f. = (3.2)
0, otherwise.
and
Let [Xj, = i, X3, = j] be a transition count. In place of the naive bootstrap esti-
mate 2[Xj, =i, Xz, =i] / ZX}, =i] we will consider a bootstrap estimate of
the form
n n
DY WlXne =i Xy =H / DO XT = i
1 t=1
where the Lie are i.i.d weights, with Bl. =1 and E'lj, = fi.
More precisely, given i and j, generate i.i.d. Liv 1 <t <n such that
and
fii, if fj Sale
B; (3.7)
1, if %j <1,
Piij
a Ao:2 Rita ijibe ’ (3.8)
Aj(Bi; — Aj)
P2ij
Shope cethie
2
te° (3.9)
By(By — Ajj)
Let
nig = D VilXt = i
nN %* .
(3.10)
t=1
and
n % . * .
Then
Pij =
0, otherwise,
Datta and McCormick
Theorem 3.1
Along almost all sample paths X,, X,, --- ,
=s A log Nv
(a) sup IP(nj*(6;-pi) < x} — P*(nj“*ni@y(pij—B;)) < x} |= OC yee
and
nj;/nig ; if Nici) # 0,
0, otherwise.
Then Pi is our bootstrap estimate of p,; under this method. In order to remove
the effect of latticeness and obtain o(n~”) rate, we need to add a correction
term to the the standardized P; :
Finite State Markov Chain 85
SSS
SD
Theorem 3.2
Along almost all sample paths X;, X2,
(a’) sup IP(n-™n,(6;; — pi) $x} — P*(n“*nXi op," - By) <x} 1 = 0 log fs Dy
if0< Py < 1;
moreover if the conditions of Theorem 2.3 hold then
(b)
-Y% 7a -Y4 ¥ 7, # a
n nN; — ee n Yn. . oo ee
where 1) is a zero mean random variable (independent of W’’s and Ii;’s) having
finite absolute third moment and compactly supported characteristic function.
and
[F+1]}+€,, if tj >1
— (3.13)
1+ &,> if ij <1
where 0 < €, — €, and € is irrational. Define Piijand Poi as pj; and pj; with
Datta and McCormick
OS a
n;
* * * .
ny = D Vie (Wir = J)
=
* *” 7 */”
Nij / Nj) 5 if Ni) ye 0,
7?
Pye
0 , otherwise,
defines a bootstrap estimate of fj; For this third method we can conclude the
following:
Theorem 3.3
Along almost all sample paths Xj, Xo, ...,
if 0 < py < 1;
moreover, if the conditions of Theorem 2.3 hold then
bate A
Pi)<x}x}!! =o=o(
n-“n-*..
< x) — Py Pi
p::
™.)
(D:. —
(B;H,(1 — pi)”
5 (PipPy(l — pi)”y
Finite State Markov Chain
To the best of our knowledge, the present paper is the only work so far
studying the rates of bootstrap approximations for Markov chains. It is hoped
that this paper will initiate further studies in this direction. In particular, the
following questions deserve investigations:
(1) How does the natural parametric bootstrap (see, e.g., Basawa et al.,
Athreya and Fuh (method I)) perform in terms of rates of approximations?
To answer this question, one essentially needs to establish a continuous
version of the Edgeworth expansion for Bj-
(2) How to bootstrap the joint distribution of (p;;) with a rate better (if possi-
ble!) than that of normal approximation? The parametric bootstrap may be
a candidate for this. Clearly, the method in this section can handle only
one f;; at a time. .
4. Proofs
We are going to use the following terms and notations throughout this sec-
tion. For any sequence Vj, ..., V, of i.id., non-degenerate random variables
with finite second moment, n“(V — 11)/o will be called the standardized sum of
the V’s, where V =n! = V,, p = EV), 0? = E(V; - 1)”. A sequence of distri-
butions {F,,} on R is said converge in d3 to a distribution F on R, written as
z OF if {F,} converges to F weakly, and [ixi34F, = fle ldF. ® and
will denote the standard normal c.d.f. and p.d.f. respectively.
Let Y, = (X;,, X41), t2 1, where {X,} is the original Markov chain. Then
{Y,} is also a Markov chain with state space S,= {(u,v): 1suvs N,
Puy > 0}, which has at most N? states. Moreover the stationary distribution for
Y; is given by Pau,v) = Pu Puv-
Since {X,} is irreducible, aperiodic, and has a finite state space, there exist
M < - and p < 1 such that
max_ |p{*) — p;l < Mp*, for all k2 1. (4.1)
1<i,j<N
88 Datta and McCormick
Consequently,
| pfkt2 -p
max (x,y)
(u,v),(x,y)e S2 IsGey)
1a IpPyx’® —
33 Pxy | — Px p.| < Mo*,
p for all k > 1,
where Penny) denote the k-step transition probabilities for {Y,}. Therefore
{Y,} satisfies the basic condition (0.1) of Nagaev (1961), for some kp large
enough.
Fix 1 <i,j <N. Define
f(Y) = (X,=i, Xy1 = j] — pylX = id.
Yiu
n-“n,(B;-Pi,) 27 2K :
SEC
(4.4)
(ppy(1-p,))* no -
Theorem 1 of Nagaev (1961) will be used to prove Theorems 2.1 and 2.2.
n“n;(6;;-Pi) =
sup |P{ — <x} — Ox)! = OM”), (4.5)
x (p;p;(1-p,))” :
Finite State Markov Chain 89
SSS
On the other hand, since nj ~ Binomial (n;, p,) under the P* and
P; = nji/n;, by the ordinary Berry Esseen Theorem for i.id. summands it fol-
lows that
n;*(p;—-P
nj ;) constant
Meee ee OO
(B;;( 1—9;;)) a (;;( 1—p pr?
= O(n”), (4.6)
a.s. (P), because lim inf p;(1—p;;) = pi(1—p;;) > 0 and n/n > p; > 0, as. (P).
n-— eo
Theorem 2.2 follows, by the triangle inequality, from (4.5) and (4.6). O
Bj A 1 * a
subir og n;*(B;; — py) < x} — P*{n;4(pj — py) < x}!
1
ae ee i (495
(p;(1-p;))* — (6(1-B,))”
+ P{ 1p;
— pj! > ep;},
90 Datta and McCormick
where
Z, = (Pin n;*(P;-P;,)
Pi (p(1-p;))”
By (4.5) and the mean value theorem, the first two terms in (4.9) are
O(e) + O(n”). The third term is no more than
y%
pin € 1
A ee On),
=oe *)+0m~”,,
= (0),: if e= 407log n
1
ye
Lemma 4.1
For any 1 $i,j <N,
8; > Sj as. (P), asn > ©,
Proof
For i = j , Sy=Sj=1- limpg® = 1—p;. Hence the conclusion is
immediate from the fact that p; — p; a.s. (P).
The conclusion in the i#j case will follow if we prove that
Lies
a (k-1) _ ps? = pi) os pi! —> 0, as. (P). (4.10)
Since P > P a.s., it follows that each summand in the above sum con-
verges a.s. to zero, as n> oo. It can be seen from (4.1) that for kg large
enough,
Fix any r, € (1,1) where r denotes the LHS of (4.11). Since P > P,
for almosts all sample paths, for large enough n (depending upon the path).
Therefore (see Nagaev), a stationary distribution f = (6;) for P exists and
ROME ee gs PR Rare
(1-2f;+38;,8;)b:5,(1-f;;), if Pij*>>
* *3
U3, =E Y,~ =
. A on 1
0, if Pij = 3 a
Since Pi=> pj. > 0, Bj =) Pij E (0,1) and Si —- Si, we get that
, b3
iensup - <oo, a.s. (P). Hence by the Berry-Esseen Theorem for i.i.d.
n
*
n “nici(Pi7-Py)
x (f;P;(1-P ;:))” <x} - O&)!| = OM), (4.13)
iPij ij
Finite State Markov Chain
SS 93
a.s. (P). Rest of the proof is essentially the same as that of Theorem 2.1.
(b) Let Y,. be a random variable generated in the same way as Yi with p;
replaced by Pi» Bj by Pij» Bi by Tj =1+ 3Pij S;/ (1—2p,). Then = has a
non-lattice distribution F;; (say) since Pij is irrational, with
uniformly in x, a.s. (P). The proof now ends by Theorem 2.3 and the triangle
inequality. O
Next we prove a general result for i.i.d. summands which will be used to
prove Theorem 3.2 (b). This result is a continuous version of special case of
the Corollary in Babu and Singh (1989).
Theorem 4.1
Let Yj,.... Y, iid. with distribution G=G, such that fyaG =." if
(3)
G — F such that fy*dF > 0, [ly |?dF< ©, then
P{ ee
Mey oh n*13(1-x7)()
yt nN <x} = M(x) + foal ci is a +o(n-”) (4.14)
0, 60°
Proof
Let og and y be the characteristic functions of G and 1 respectively. Let
+ f ly(ttldt,
Itl>n*8o,,
for large enough n such that {|Itl>n” 8 o,} (-\ support (wy) = 9.
Since, [3,2 U3<0, 6,-6>0, and et? < @ TOR o4 for
Itl > n“8o,, it follows that the second term in (4.15) goes to zero exponen-
tially fast, as n > 9,
The first term in (4.15) can be further bound by
The first term above is handled in exactly the same way as in the proofs of
Lemma 4.2 and Theorem 2.1 of Datta (1989). To bound the second term we
use the Taylor expansion for wy:
yi)=1+twO+—
w), 2 ”
0<t’<t. Since n has mean zero, w’(0) =0. Also, vw is continuous, because
Y has finite third absolute moment, and hence is bounded on the compact sup-
port (wy). Let suply” | = 2K. Then the second term in (4.16) is no more than
lua litt
K n 34 f et + Malice 1A Idt,
Itl<n*8o,, oO;
= O(n->4).
Finite State Markov
Chain
This completes
the proof. 0
Proof
of Theorem 3.2
Since 5. as PV Oi<1-Hy)” is the standardized sum of the iid
summands Y = LIWs = jl — B). Saamy poste (a) aed TRY Sotlows "in“the
Lae ea
G)
By similar arguments as inthe proof of Theorem 3.1(b), Fj —» F,, as.
(P), where F5 isthedistribution ofYj, F 1s the distribution of Y_ which is
generated inthesame wayasY; withf,;replaced byp,,andf, by 1. Clearly,
ee - a dF;=pyll-py), fr’dFy=pf 1-py)(1-2p,+3pySy). Hence, by
toes almost all sample paths,
Se 11-294305S
»he) ets yeu) 4 SO
2 nz”), uniformly in x,
Proof
of Theorem 3.3
It can easily be checked that 1” ni (pz — f,)/(PH,(1-f,))* is the
g
to the summands
standardized sum correspondin
¥; = 1,0W; =i) - 6), 1x, (4.17)
Thus, parts (a) and (2')ofthe theorem follow asbefore (Le., parts (a) and (a’)
cx-Tiiecaces 3.1) dace nfe py
For past (b), define Y_. in the same way asY{ with p,; replaced by pz f,5
by 1% and €, by €. Also, define Y_ similarly with p,; replaced by py and Ay,
B,, by Az, By given by
96 Datta and McCormick
and
l+e, if Ty S 1.
Then Y,, and vo are both non-lattice random variables because € is irrational.
Furthermore,
E Ls = 0, E Zz = py(1—pj), ;
n;*6(x)(1-x?)
= &(x) + ———~(1-2p;, + 3p,S;) + o(n;),
6(p(1—-p;;))” ane ee ;
oO onPijt3pijaaa, a
ahs«) é 6(p;p;(1-p;;))” ( Sij) a o(n 5 (4.19)
a.s.
uniformly in x, a.s. (P) since n/n > p; > 0.
Finite State Markov Chain 97
a
For the case when r;; is an integer, consider two subsequences of n;, viz.,
3
{n;} = {nj: 52 rj) and {n;*} = {nj:f; <1}. Then, as. (P), along nj, Fi > Fi,
and along n;, F;; — F;; where F, is the distribution of Y_. Since Y,, and Y.,
have the same first three moments, we get the same expansions (4.18) and
(4.19) along both the subsequences.
Finally, we get the conclusion of the theorem by comparing (4.19) with
Theorem 2.3. O
REFERENCES
B. Efron*
Abstract
Investigations of bootstrap methods often raise more general ques-
tions in statistical inference. This talk discusses six such questions: (1)
Why do distributions estimated by maximum likeklihood tend to have
too short tails? (2) Why does the delta method tend to underestimate
standard errors? (3) Why are cross-validation estimates so variable? (4)
What is a “correct” confidence interval? (5) What is a good nonpara-
metric pivotal quantity? (6) Can we get bootstrap-like answers without
Monte Carlo?
£3.d.
Bake Y1,Y2,°°*,Yn- Ett)
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 047 1-53631-8.
100 Efron
_.._|| es
2
Prob,{A} = a (1.3)
to any set A in the sample space of the y’s. This is an unbiased estimate
of the true probability Probr{A},
However the same unbiasedness does not apply to the variance functional:
/n has expectation
Var p{Y} = Dyni(yi — 9)?
n —1
E Vara{Y} = Varr{Y}. (1.5)
n
We see that the variance function is underestimated by maximum
likelihood, albeit mildly so. Elementary statistics courses recommend
estimating the variance by
O(&/n) (1.8)
Six Questions Raised by the Bootstrap 101
having expectation
aia
E{67} = ores (1.11)
n
~ Lid: * * * pee *
6 + hse, (1.14)
Six Questions Raised by the Bootstrap 103
jackknife estimate of standard error. In this case all three methods were
nearly unbiased, giving average standard error estimates of .35,.37, and
.37 respectively. Details of the four sampling experiments appear in Efron
(1982), Section 3.
In the last three sampling experiments, the delta method has a no-
ticeable downward bias. This is particularly evident in the last case,
where the statistic of interest was the tanh”? transformation of a simple
correlation coefficient from 15 independent bivariate normal points.
A puzzling aspect of the differences seen in Table 2 is that the delta
method is intimately related to both the jackknife and the bootstrap. In
fact the delta method is identical to Jaeckel’s (1972) infinitesimal
jackknife. Suppose 6 is a functional statistic 6 = S(F ), such as the
ordinary mean S(F) = f adF = %. Let FS be a distorted version of
the empirical distribution F that puts extra probability on the zth data
point,
fe a te —e)/n+e probability on tj (2.1)
: (1—e)/n probability on z;, j #7. ;
0 Pre
Efron (1981) showed that the usual nonparametric delta function estimate
of standard error using a linear Taylor series expansion of 6 is identical
Six Questions Raised by the Bootstrap 105
“Ga
Ae, ay
eSds A a ee
0
a+ n
robability
rete on
probability on z2;, j #2,
2;
(2.4)
Let 6.3) = S(Fiy) and 6.) Ser, 6(:)/n. Tukey’s jackknife estimate of
standard error is
A n—-l_-s» ”
S€jack{9} = {—— =]; — OP}. (2.5)
This is almost the same as (2.3), except that € in (2.2) has been set equal
to —1/(n—1) instead of going to zero. (Tukey’s formula also incorporates
an extra factor of n/(n — 1), in order that Sejack {7} exactly equal the
usual formula for the standard error of the mean, [=(#;—Z)? /n(n—1)]!/?.)
Efron and Stein (1981) showed that the jackknife estimates of vari-
ance, cee oh?> tends to be biased upward as an estimate of the vari-
ance. A moderate upward bias is discernable in the jackknife column of
Table 2. The close relationship of definitions (2.5) and (2.3) suggests that
the jackknife and delta method should behave similarly, but such is not
always the case.
The bootstrap method gives the nonparametric maximum likelihood
estimate (MLE) of standard error: let se{t; F} indicate the true standard
error of a statistic 6 = t(x), where x = (11, %2,--+,2n) is ani.i.d. sample
from F; then the bootstrap estimate of standard error is
bootstrap, the delta method, and the jackknife. This deepens the puzzle
of the delta method’s poor performance. 3
A bootstrap sample is a random sample of size n from F,
B
Sehoot{9} = [> (6" — 6")? /(B- 1”, (2.8)
b=1
abou 1
P*~Mult(n,P°)/n [P°=(=,=,---,=)} (2.10)
With the original sample x fixed, we can think of 6* = t(x*) as a function
of P*, say 6(P*). Another way to state (2.6) is
S€hoot
{9} = [Var.{6(P*)}]*/2, (2.11)
where Var, indicates variance under the multinomial distribution (2.11).
Figure 1 schematically represents (2.11). The prone triangle is the
simplex S in which P* takes its values. The function 6(P*) is represented
by a curved surface over S. At the center of S is P®, with 6(P°) St xp
the actual simple value of the statistic. We usually need to approximate
Six Questions Raised by the Bootstrap 107
(2.11) by Monte Carlo because there is no closed formula for the variance
of a non-linear function of a multinomial vector.
Both the jackknife and the delta method avoid Monte Carlo by ap-
proximating 6(P*) with a linear function of P*, say OLIN (P*); and then
theoretically evaluating [Var. {9 tn (P*)}]!/ 2 instead of (2.11), from the
usual formula for the variance of a linear function of a multinomial.
6(P*)
insight into the relationship between seggjz, and sepoo¢- Consider the
data x as fixed, and suppose that we observe a multinomial vector P ~
Mult(n, )/n, where now 7 can be any vector in the simplex S, not just P®
as in (2.10). The artificial problem is to estimate 6() having observed P.
The MLE of 6(7) in the artificial problem is 6(P). It then turns
out that [se,,o¢{9}]? is the variance of the MLE when 7 = P®, while
[segelta{9}]” is the Cramer-Rao lower bound for the unbiased estimation
of 6(7), at = P°. This argument makes it plausible that S€delta would
usually be less than S€hoot: “Plausible” isn’t a proof though, and it isn’t
true that S€delta < Shoot in every case.
The delta method is much too useful a tool to throw away. However,
it’s numerical results shouldn’t be accepted uncritically, since they seem
liable to underestimation. The jackknife and bootstrap standard errors
are both more dependable.
The two means, (5,0) and (—$,0), are indicated by stars in Figure 2.
In real practice, of course, we wouldn’t know the probability mechanism
generating the data.
Table 3 also shows that Diffcy is highly variable: its standard de-
viation over the 100 simulations was .073, about 80% as big as its mean
.091. Unbiased or not, this makes Diffoy an undependable estimator of
Diff.
Efron (1983) shows that cross-validation is closely related to the
bootstrap, much as the delta method, jackknife, and bootstrap are related
in Figure 1. This leads to several new estimators for Diff, based on
variants of the underlying bootstrap argument. The most successful of
these, “Diff 632”, also appears in Table 3. We see that it is moderately
biased downward, but has much smaller standard deviation than Diffgy.
An objective way to compare the two procedures is in terms of their root
112 Efron
eS
cere i
Hypothesis Test
Reject Bate
Accept ga
-
-
Dy = (4.2)
De
L > cUurvg
is approximately normal. This approximation is very good, the cdf of Dj
differing from the standard normal cdf by only O(n-*/?) if y is actually
a sufficient statistic obtained from n observations yj, y2,°°-, Yn ‘No
(u, I). This is third-order accuracy, in the language of Section 1.
Inverting the approximate pivotal Dj gives a third-order accurate
approximate confidence interval for 6. Table 4 shows the results for the
case y ~ No(u,Z), Ou) = |||], when the observed vector y has length
lly|| = 5. (A version of (4.2) holds in higher dimensions.) In this case
there is an exact confidence interval for 9 based on inverting the non-
central chi-square distribution of ||y||? ~ 2(67).
Six Questions Raised by the Bootstrap 115
a
.05 .95
Exact: 2.68 6.19
De 2.71 6.19 3rd order
BC,: 2.94 6.06 2nd order
Standard: 3.36 6.64 1st order
DiCiccio and Efron (1990) show that all of these methods give confi-
dence interval agreeing at the second order, and in a certain sense they all
are second order correct, as well as accurate. The best situation would be
if all the methods continued to agree at the third order. This happy
result might very well be false. If so, the question of correctness will
be a pressing one. Highly accurate confidence statements are not worth
pursuing if they lead to inferential errors.
ips
rans (cease -foRy Sx; (5.1)
100
6-86
T= 9
(5.4)
>
118 Efron
ee
where G(x) is some estimate of standard error for 6(x), perhaps the jack-
knife or delta method estimates. If we believe in the pivotality of T,
then we can use the bootstrap to construct a “bootstrap t” approximate
confidence interval for 6; we generate some large number B of bootstrap
replications of T, ? 4
O(x*) — 6
TS (5.5)
os G(x*) ’
compute the 5th and 95th percentiles of the values T*®, b = 1,2,---,B,
say T*(-°5) and T*(95); and assign @ the approximate confidence interval
16th, 50th, 84th, 90th, and 95th percentiles of the 1000 T* bootstrap
replications as dashed lines. So, for example, the 5% dashed line is at
height —.939, and the 95% line at height 2.93.
Nonparametric Parametric
6*—6@ Standard Boot-t BC, Boot-t BC, Exact
.05: 61 61 40 48 53 .50 49
95: 1.03 95 91 94 93 .90 .90
nace es
a e -90
‘
Br
org
Reaea: - ae
-- - -| -05
ee
=2.97 =1.07 —.25,-=.22) —.10) (—.10 700 0 923) ol, «93 <61)8) 39 71, Olas
B
ae
biasg = B1 ; 6*°A*b — (x).
A
(6.1)
B
biasg = 5
= A(P**) — 6(P°), (6.2)
NOTE: Entries are 1,000 x bias estimate. inthis case biasay isabout 50 times as good an
estimator as biaso.
The bias estimate biasp corrects biasp by taking into account the
discrepancy between P, the average of the observed resampling vectors,
and P°, their theoretical expectation. Davison, Hinkley, and Schechtman
(1987) make the correction another way: they draw the resampling vec-
tors P*>, 6 = 1,2,---, B, in a manner which forces P to equal P°. Then
(6.2) performs very much like biasg. Various methods of improved boot-
strap sampling for reducing the number of bootstrap replications appear
Six Questions Raised by the Bootstrap = 123
REFERENCES
Abramovitch, L. and Singh, K. (1985). Edgeworth corrected pivotal
statistics and the bootstrap. Ann. Stat. 13, 116-132.
Peter Hall?
Australian National University
1. Introduction
In many problems of practical interest, the nonparametric bootstrap is
employed to estimate an expected value. For example, if 6 is an estimate of
an unknown quantity @ then we might wish to estimate bias, E(6 — 0), or
the distribution function of 6, E{I(§ < z)}. Generally, suppose we wish to
estimate E(U/), where U is a random variable which will often be a functional
of both the population distribution function Fo and the empirical distribution
function F, of the sample 7: U = f(fo,F;). Let F) denote the empirical
distribution function of a resample 1* drawn randomly, with replacement,
from x, and put U*= f(fKi, F). Then, = E{ f(Fi, Fo) | Fy} = E(U* | ¥)
is “the bootstrap estimate” of u = E(U).
In the bias example considered above, we would have U = 6 — 6 and
U* = 6* — 6, where 6* is the version of 6 computed for 4* rather than 7.
In the case of a distribution function, U = I(@< x) and U* = 1(6* <2).
Our aim in this paper is to describe some of the available methods for
approximating & by Monte Carlo simulation, and to provide a little theory for
each. The methods which we treat are uniform resampling, linear approxima-
tion, the centring method, balanced resampling, antithetic resampling and
importance resampling, and are discussed in Sections 2 to 7 respectively. This
account is not exhaustive; for example, we do not treat Richardson extrapo-
lation (Bickel and Yahav 1988), computation by saddlepoint methods (Davi-
son and Hinkley 1988, Reid 1988) or balanced importance resampling (Hall
1990b). Section 8 will briefly describe the problem of quantile estimation,
which does not quite fit into the format of approximating i = E(U* |).
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
128 Hall
2. Uniform resampling
Since & is defined in terms of uniform resampling — that is, random
resampling with replacement, in which each sample value is drawn with the
same probability n~1 — then uniform resampling is the most obvious ap-
proach to simulation. Conditional on 4%, draw B independent resamples
Aj ,...,%% by resampling uniformly, and let 6x denote the version of § com-
piles fe, Xf rather than Y. Then
B
ino ete,
b=1
=n 1674 O(n-*) :
Efficient Bootstrap Simulation 129
where
=n Ss{ex. — %) 9(K)}
0? =B{ ex = 2) o(w)}
(as n — oo) and gp = E(X) is the population mean.
These two examples describe situations which are typical of a great
many that arise in statistical problems. When the target td is a distribution
function or a quantile, the conditional variance of the uniform bootstrap
approximant is roughly equal to CB™! for large n and large B; and when
the target is the expected value of a smooth function of a mean, the variance
is approximately CB~!n7!. In both cases, C is a constant not depending on
B or n. Efficient approaches to Monte Carlo approximation can reduce the
value of C in the case of estimating a distribution function, or increase the
power of n~! (say, from n~! to n~*) in the case of estimating the expected
value of a smooth function of a mean. Most importantly, they usually do
not increase the power of B~!. Therefore, generally speaking, even the most
efficient of Monte Carlo methods has mean squared error which decreases
like B~} as B increases, for a given sample.
3. Linear approximation
We motivate linear approximation by considering the bias estimation
example of the previous section. Suppose our aim is to approximate & =
E(U* |), where U* = 6* — 6 = g(X*) — 9(X) and g is a smooth function
of d variables. Let us extend the Taylor expansion (2.1) to another term:
d d d
95(K) +5 DY DI (KK)
= SO(KK) (K-K) gye(K) +.
j=1 j=1 k=1
(3.1)
where g;,.. j= (8° /d2) . ..024r)) g(x). As we noted in Section 2, the
conditional variance of U” is oeined asymptotically by the variance of
the first term on the right-hand side of (3.1), which is the linear component
in the Taylor expansion. Now, our aim is to approximate E(U* |X), and the
linear component does not contribute anything to that expectation:
d d
Therefore it makes sense to remove the linear component. That is, base the
uniform resampling approximation on
d
j=l
instead of on U*. Conditional on 4, draw B independent resamples A7,...,
X% by resampling uniformly (exactly as in Section 2), and put
d
B
63=B") Vy.
b=1
=n?B+O(n-), (3.3)
with probability one as n — oo, where?
d d d
2 In Hall (1989a), the factor 4 was inadvertently omitted from the right-hand
side of (3.4).
Efficient Bootstrap Simulation
a e 131
Therefore, the order of magnitude of the variance has been reduced from
B~' n7* (in the case of
t,) to B-! n-? (in the case of 6%). The numerical
value of the reduction, for a given problem and sample, will depend on values
of the first and second derivatives of g, as well as on the higher-order terms
which our asymptotic argument has ignored.
More generally, we may approximate U* by an arbitrary number of terms
in the Taylor expansion (3.1), and thereby compute a general “polynomial
approximation” to t. For example, if m > 1 is an integer then we may define
Wy =U - i See DoW.
RF— RB) 5 8)
jvi=1 r=1
(a generalization of V,"),
m d d
SA RET Oe Do BAK = KH)... — KY A} 95.5%),
4. Centring method
To motivate the centring method approximation, recall that the linear
approximation method produces the approximant
B B
and #, &3* are given by (3.4), (3.5) respectively. See Hall (1989a). Note
particularly that by (3.3) and (4.2), the conditional asymptotic variances of
d5 and £§ are identical. Since 6% is an unbiased approximant, and the bias
Efficient Bootstrap Simulation
SS
133
of Z} is preusible relative to the error about the mean (order B~1 n-!rela-
tive to Bo? n~1), then the approximations 6*B and z% have LAC ane
equivalent mean squared error.
5. Balanced resampling
If we could ensure that the grand mean of the bootstrap resamples was
identical to the sample mean, i.e.
ASS x, (5.1)
then the uniform approximation i,, the linear approximation 6% and the
centring approximation <% would all be identical. The only practical way
of guaranteeing (5.1) is to resample in such a way that each data point
X; occurs the same number of times in the union of the resamples 1.
To achieve this end, write down each of the sample values X,,...,X, B
times, in a string of length Bn; then randomly permute the elements of this
string; and finally, divide the permuted string into B chunks of length n,
putting all the sample values lying between positions (6 — 1)n + 1 and bn
of the permuted string into the b’th resample rat for. 1 <b < B. This
is balanced resampling, and amounts to random resampling subject to the
constraint that X; appears just B times in U,4%,. Balanced resampling
was introduced by Davison, Hinkley and Schechtman (1986), and high-order
balance has been discussed by Graham, Hinkley, John and Shi (1990). See
also Ogbonmwan and Wynn (1986, 1988). An algorithm for performing
balanced resampling has been described by Gleason (1988). The method of
Latin hypercube sampling (McKay, Beckman and Conover 1979; Stein 1987),
used for Monte Carlo simulation in a non-bootstrap setting, is closely related
to balanced resampling.
‘The balanced resampling approximation of i = E{g(X*) |} is
iun = =e
a(x!
9(X;) ’
b=1
with probability one as n — oo, where &, # are given by (4.3), (3.4) re-
spectively; compare (4.1) and (4.2). In particular, in the context of bias
estimation for a smooth function of a mean, balanced resampling reduces
the orders of magnitude of variance and mean squared error by a factor
of n=}. Formulae (5.2) and (5.3) are not entirely trivial to prove, and the
asymptotic equivalence of bias and variance formulae for the centring method
and balanced resampling is not quite obvious; see Hall (1989a).
In Sections 3 and 4, and so far in the present section, we have treated
only the case of approximating a smooth function of a mean. The methods
of linear approximation and centring do not admit a wide range of other
applications. For example, linear approximation relies on Taylor expansion,
and that demands a certain level of smoothness of the statistic U. However,
balanced resampling is not constrained in this way, and in principle applies to
a much wider range of problems, including distribution function and quantile
estimation. In those cases the extent of improvement of variance and mean
squared error is generally by a constant factor, not by a factor n7}.
Suppose that U is an indicator function of the form U = I(S < z) or
U =I(T < x), where S = n?(6 — 6)/o and T = n3(6 — 6)/6 are statistics
which are asymptotically Normal N(0,1). The bootstrap versions are U* =
I(S* < x) and U* = I(T* < z) respectively, where S* = n?(6* — 6)/6 and
T* =n?(6* — 6)/6*. (The pros and cons of pivoting are not relevant to the
present discussion.) To construct a balanced resampling approximation to
ua = E(U* |), first draw B balanced resamples xi 1<b<B, as described
two paragraphs earlier. Let 6}tat
, 5, denote the versions of 6, & respectively
computed from x} instead of 1, and put
E(al,|#)
—a = 0(B7)
Fie
with probability one. The asymptotic variance of ti, is less than that of aB
by a constant factor p(z)~! <1, since
6. Antithetic resampling
The method of antithetic resampling dates back at least to Hammers-
ley and Morton (1956) and Hammersley and Mauldon (1956). See Snijders
(1984) for a recent account in connection with Monte Carlo estimation of
probabilities. Antithetic resampling may be described as follows. Suppose
we have two estimates 6, and 62 of the same parameter 6, with identical
means and variances but negative covariance. Assume that ike costs of com-
puting 6, and 6, are identical. Define
= 4(6,
+ 62).
Then 63 has the same mean as either 6, or 62, but less than half the variance,
since
Since the cost of computing 63 is scarcely more than twice the cost of com-
puting either 6, or 62, but the variance is more than halved, then there
is an advantage from the viewpoint of cost-effectiveness in using 63, rather
than either 6, or 62, to estimate 9. Obviously, the advantage increases with
increasing negativity of the covariance, all other things being equal.
136 Hall
Ne e
_..___
To appreciate how this idea may be applied to the case of resampling, let
U* denote the version of a statistic U computed from a (uniform) resample
X* = {X*,...,X*} rather than the original sample ¥ = {X1,...,Xn}. Let
x be an arbitrary but fixed permutation of the integers 1,...,n, and let
jis-++)Jn be the random integers such that Xf = X;, for 1 <i <n. Define
X** = Xej), 1 SiS n, and put ** = {Xj",...,X,*}. That is, 47"
is the (uniform) resample obtained by replacing each appearance of X; in
X* by Xq(x). If U** denotes the version of U computed from 4** instead
of X, then U* and U** have the same distributions, conditional on ¥. In
particular, they have the same conditional mean and variance. If we choose
the permutation 7 in such a way that the conditional covariance of U* and
U** is negative, we may apply the antithetic argument to the pair (U*,U**).
That is, the approximant
will have the same conditional mean as U* but less than half the conditional
variance of U™.
If the X;’s are scalars then in many cases of practical interest, the
“asymptotically optimal” permutation 7 (which asymptotically minimizes
the covariance of U* and U**), is that which takes the largest X; into the
smallest X;, the second largest X; into the second smallest X;, and so on.
That is, if we index the X,’s such that X; <...< Xp, then m(t?) =n—i41
for 1 <i <n. For example, this is true when U = g(X) —g(), where g isa
smooth function, and also when
That is, if we index the X,’s such that Yj <... < Yn, then m(i) =n—i+1.
We shall call 7 the antithetic permutation. These results remain true if we
Studentize the arguments of the indicator functions at (6.1) and (6.2); the
pros and cons of pivoting do not have a role to play here. The reader is
referred to Hall (1989b) for details.
Efficient Bootstrap Simulation 137
B
a3 =3B SY (Us +U;").
b=1
7. Importance resampling
The method of importance resampling is a standard technique for im-
proving the efficiency of Monte Carlo approximations. See Hammersley and
Handscomb (1964, p.60ff). It was first suggested in the context of bootstrap
resampling by Johns (1988) and Davison (1988).
To introduce importance resampling, let 1 = {Xi,...,Xn} denote the
sample from which a resample will be drawn. (This notation is only for the
sake of convenience, and in no way precludes a multivariate sample.) Under
importance resampling, each X; is assigned a probability p; of being selected
138 Hall
ae,
n
my!...my! £5 ci
n!
or
' n} mji : mj;
=> al jgiess- jn: ne ee
=a {=1 Co
respectively. Let U be the statistic of interest, a function of the original
sample. We wish to construct a Monte Carlo approximation to the bootstrap
estimate t of the mean of U, u = E(U).
Let ** denote a resample drawn by uniform resampling, and write U*
for the value of U computed for 1*. Of course, 4* will be one of the 4's.
Write u; for the value of U* when 4* = 4;. In this notation,
N N n
i es B n
BrP
a =B? Sut [] (np).
b=1 t=1
This approximation is unbiased, in the sense that E(al, |v’) =a. Note too
that conditional on 7, al,— t with probability one as B — oo.
If we take each p; = n=! then al,is just the usual uniform resampling
approximant ti}. We wish to choose pj,..., Pn to optimize the performance
of ip. Since al is unbiased then the performance of i, may be described
in terms of variance:
j=1 #=1
N n
= Do ius [](rp
j=l %=1
On the last line, M} denotes the number of times X; appears in the uniform
resample 4*. Ideally we would like to choose p1,...,Pn 80 as to minimize
0(pi,---,Pn), subject to Dp; = 1.
In the case of estimating a distribution function there can be a signif-
icant advantage in choosing non-identical p,’s, the amount of improvement
depending on the argument of the distribution function. To appreciate the
extent of improvement, consider the case of estimating the distribution func-
tion of a Studentized statistic, T = n3(6 — 9))/6, assuming the “smooth
function model” introduced in Section 2. Other cases, such as that where
S = n3(6 — 0)/o is the subject of interest, are similar; the issue of Studen-
tizing does not play a role here.
Take U* = I(T* < 2), where T* = n2(6* — 6)/6*,
n d
6 =n (KI -RN 9(K)).
t=1 j=l
BP NT
ns aS)
variance of al,(under importance resampling)
ye 2) LS
(x — A) e4’ — O(z)?2 ©
8. Quantile estimation
Here we consider the problem of estimating the ath quantile, &., of
ae distribution of a random variable R such as S = n?(6— 0)/o or T =
n3(6—@)/&. We define €, to be the solution of the equation P(R < €.)= a.
Now, the bootstrap estimate of €, is the solution & of
References
BICKEL, P.J. and YAHAV, J.A. (1988). Richardson extrapolation and the
bootstrap. J. Amer. Statist. Assoc. 83, 387-393.
DAVISON, A.C. (1988). Discussion of papers by D.V. Hinkley and by T.J.
meats and J.P. Romano. J. Roy. Statist. Soc. Ser. B 50, 356-
57.
DAVISON, A.C. and HINKLEY, D.V. (1988). Saddlepoint approximations
in resampling methods. Biometrika 75, 417-431.
DAVISON, A.C., HINKLEY, D.V. and SCHECHTMAN, E. (1986). Efficient
bootstrap simulation. Biometrika 73, 555-566.
DO, K.-A. and HALL, P. (1990). On importance resampling for the boot-
strap.
EFRON, B. (1990). More efficient bootstrap computations. J. Amer. Statist.
Assoc..
GLEASON, J.R. (1988). Algorithms for balanced bootstrap simulations.
Amer. Statistician 42, 263-266.
GRAHAM, R.L., HINKLEY, D.V., JOHN, P.W.M. and SHI, S. (1990). Bal-
anced design of bootstrap simulations. J. Roy. Statist. Soc. Ser. B
52, 185-202.
Efficient Bootstrap Simulation 143
‘aicaty
: a, ri Fs
ate Oe
P| Ps
wi at gh Mall a Bircapenttane
aN
i
SS
SSS
BOOTSTRAPPING U-QUANTILES
-
R. HELMERS !
Centre for Mathematics and Computer Science
Amsterdam, The Netherlands
ABSTRACT
1. INTRODUCTION
Let X,, X2, ... be independent random variables defined on a common
probability space (N, A, P), having common unknown distribution func-
tion (df) F. Let h(ri,...,2m) be a kernel of degree m (i.e. a real-valued
measurable function symmetric in its m arguments) and let
Let, for0 << p<1,6,=H 7'(p), denote the p-th quantile corresponding
to Hp, and let és = H=1(p) denote its empirical counterpart. Generalized
1R. Helmers, Centre for Math. and Comp. Science, Kruislaan 413, 1098 SJ Amsterdam
(The Netherlands) <4 ee
2P. Janssen, N. Veraverbeke, Limburgs Universitair Centrum, Universitaire Campus,
B-3590 Diepenbeek (Belgium)
——EEEEoeE
ae reer ere
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
146 Helmers, Janssen, and Veraverbeke
oe a mg eg ge te a
quantiles of the form bai = Hz1(p) are called U-quantiles. Choudhury and
Serfling (1988) note that &, — €, a.s. [P], as m — oo, and, in addition,
that, as n — oo,
The aim of this paper is to employ bootstrap methods for the construc-
tion of a confidence interval for &, = Hj'(p). In Section 2 we establish a
bootstrap analog of (3), under a slightly more stringent smoothness condi-
tion on Hr and in Section 3 we establish the asymptotic accuracy of this
bootstrap approximation. Applications to certain estimators of location and
spread, such as the classical sample quantile, the Hodges-Lehmann estimator
of location and a spread estimator proposed in Bickel and Lehmann (1979)
are discussed in Section 4.
P,(n (e3
cs én)<< is
n(nt(E x (p) —Hy"(p))S2)
I Hoe (pn M(p)+ 2n- 2) (8)
A H3(Hz1(p) + 2n-?) > p)
wsSoan( Ws> —D,)
bo»
bop
where
- 3 _
Di we De (12)
t=1
where
and
where, for any given permutation a of {1,2,...,n} Ufa.j(t) denotes the em-
pirical process based on the [=] independent random variables
R(Xe(mj+1)> Lee aan oe Oy aces [=] — 1, all with common df Hp.
With impunity we may replace at stage n + 1 any of the n.n! permutations
a of {1,...,n +1} which do not extend those of {1,...,n} by one of the n!
permutations which do extend those of {1,...,n}. Application of relation
(2.13) of Stute (1982) to each of the resulting n! terms appearing on the r.h.s.
of (18) directly yields that Dy, = O(n7?(Inn)?) as. [P], as n — oo, hence,
(26)
y
Note that, in contrast to Corollary 4.1 of van Zwet (1984), the asymptotic
variance instead of the exact variance of W,* is employed. It is easy to see
150 Helmers, Janssen, and Veraverbeke
Se r
ene Diner Ne OL eS
that this does not affect the bound (26). The different standardization will
give rise to an additional term of type
Ep, hn(Xi,---Xm) -}
(27)
Ep,9n(Xt)
which is already present in van Zwet’s bound. Because h,, is bounded by 1,
for all n, and combining (25) with the fact that ¢, > 0, we easily see that
the moments appearing on the r.h.s. of (26) are O(1) a.s. [P], as n — oo.
Hence the r.h.s. of (26) is O(n-?) a.s. [P], as n — 00.
For the special case m = 1, h(x) = 2, p = $, the classical sample median, our
result reduces to Proposition 5.1 of Bickel and Freedman (1981). An insight-
ful proof of their proposition is given in an unpublished note by Sheehy and
Wellner (1988). Our proof is in part inspired by their argument.
Theorem 3.1. Suppose that the assumptions of Theorem 2.1 are satisfied.
Suppose, in addition, that hp satisfies a Lipschitz condition of order > zon
a neighborhood of &,. Then
sup |Pa(n¥ (Gn — €on) S 2) — P(n¥ (Eon — &) S 2)| = O(n-t(In n)) (29)
a.s.[P], as n — oo.
For the special case m = 1, h(x) = 2, the classical p-th sample quantile,
Singh (1981) obtained a slightly better a.s. rate : the factor (Inn)? in (29)
Bootstrapping U-Quantiles 151
is replaced by (InIn n)? in this case. Whenever the same improvement holds
true for U-quantiles appears to be an interesting open problem.
- 3
sup |Pa(n?(€, — fon) S 2) — P(n? (Em — &) S2)1S Yolin (30)
t=1
For this we used Lemma 3.1 of Choudhury and Serfling (1988) and the Lip-
schitz condition on hy. Combining (33) with (28) directly yields
p—-H n(lp ++ oe
pa ) (38)
p— He(é,
Fe + n-¥(Inn)}) (39)
= —jhr(E)n-¥(In n)#(14+0(1)) as. [P], as n> 00.
Together (37), (38) and (39) yield that p— Hy(&n + Kn-2(In n)*) < 0, for
all n sufficiently large, a.s. [P], provided we take K large enough.
Combining (34), (41) and (42) with (30), we find that the theorem is
proved. Oo
4. APPLICATIONS
In this section we indicate briefly applications of our results to the prob-
lem of obtaining confidence intervals for , = Hp’ (p). Let ug = ®-1(1— $).
The normal approximation (3) yields an approximate two-sided confidence
interval
where cf,¢ and c, ,_¢ denote the $-th and (1— $)-th percentile of the (simu-
lated) bootstrap Enpvotiamtion Itiis easily verifiedthat the upper and lower
confidence limits in (44) have error rates equal to 5 + 0(n7 + (In n)?).
A further investigation into the relative merits of the normal and boot-
strap based confidence intervals (43) and (44) for U-quantiles appears to be
worthwhile. The authors hope to report on these matters elsewhere.
REFERENCES
Bickel, P.J. and Freedman, D. (1981). Some asymptotic theory for the boot-
strap. Ann. Statist. 9, 1196-1217.
Bickel, P.J. and Lehmann, E.L. (1979). Descriptive statistics for nonpara-
metric models. IV. Spread. Contributions to Statistics (J. Hajek
Memorial Volume), 33-40. (ed. J. Juretékovd). Academia, Prague.
whet yyfecar
nile
big
Rielamides re ' oF
— Wes ¥ites
telat Soca)
i
vl
Sam
Pa 4:
cs
AN INVARIANCE PRINCIPLE
APPLICABLE TO THE BOOTSTRAP
John G. Kinateder!
Michigan State University
1 Introduction
Suppose X,,X2,... are independent random variables distributed according
to a distribution function F' with location parameter 06. In order to make
inferences about 0, we may consider the distribution of the sample mean about
Ge
X,— 8.
For example, if EX? < oo, the well-known Lindeberg-Levy Central Limit
Theorem [Bil86] tells us that
we Xe XG) NOE)
where o? represents the variance of X,. In the finite variance case, we can use
this to make inferences about 0 = EX.
Of course EX? < 00 is not necessary for convergence in distribution of the
sample mean.
“Necessarily an ~ cn#/* for some a € (0,2] and c > 0; Y and p are said to be
a-stable. (See [Fel71).)
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
158 Kinateder
For each observation of the data X,,..., Xn, we consider the distribution
of
Sh = Cay we. ¢:= Xn)
g=1
ECSRI Oy Aa A p=)
2 aNG, iy Xn (2)
k=
The M?; can be thought of as counts; X; is chosen M;,; times in the bootstrap
sample.
Following Athreya’s work, Knight [Kni89] gave a different representation
for this limit law. Using the distributional relationship (2) above, and the
sample sequence representation provided by LePage, Woodroofe, and Zinn
[LWZ81], he gave the following explicit representation of the limit law:
As in [LWZ81], define p by
Bais 1 — F(y)
P= yh T— FW) + Fy)
Let €1, €2,--.- be tad., P(e = 1) =p= 1— P(a = —1). Let i (Pi 25 ++)
represent the arrival times of a Poisson process with unit rate; [, = St &;
where P(é; > x) = e~* for alli (&, &2,... are independent). Let Myf, Mj,... be
independent Poisson mean 1 random variables. Finally, assume that {¢;}, {Tj},
{M*} are mutually independent. Then
such that m,/n — 0, then the bootstrap distribution of S*, converges weakly
to p in probability.
Arcones and Giné [AG89] added to this answer by showing that if
Mn log logm,/n — 0,
then the bootstrap central limit theorem holds almost surely. That is, the
conditional distribution of S* converges with respect to the weak topology
to p almost surely. But if
lim inf m, log log iy] eS 0;
then there is no almost sure convergence — not even to a random measure!
[nt]
W°(t) =a," >> Xe; t € [0,1].
k=1
LI iErWo H™")
(dr)|X) # L( /rWo H™")(dr)|Xq,...,Xn)-
Definition 3 For each pair of functions x and y in D[0, 1], define the distance
d,(x,y) as the infimum of all those values of 5 for which there exists a strictly
increasing and onto transformation 2: [0,1] —> [0,1] such that
in the product space (D(R),U4) x (D[0,1],U2) where Uy, denotes the uniform
topology on D(R) and Uz denotes the uniform topology on D{0, 1].
Let T represent the arrival times of a unit rate Poisson process as described in
the introduction.
Let T;,T2,... be t.t.d., uniformly distributed on (0,1).
Define W by
WO) = Seki Tweet eS aan)
k=1
An Invariance Principle
SSS
163
The right side converges to zero almost surely by the previous analysis. Oo
We will proceed to show that the tail sums W"-Vz and W —Ty converge to
zero in the uniform metric as N — oo rapidly enough so that d,(W”, W) —, 0.
To this end, we will use a weak version of a theorem from Pollard [Pol84].
But first we need to define conditional variance as Pollard does.
Specifically, Pollard states the theorem for processes on [0, 00) and the con-
vergence is with respect to the uniform on compacta topology. Thus he proves
a generalized version of this statement for processes restricted to intervals of
the form [0,7].
Let hs ry
Ty (ue Sere lar edd): 32 = Sa Bp
dane =n
An Invariance Principle
SSS
165
Notation. In what follows, || - || will denote the supremum norm for any
space on which it makes sense; it will be subscripted to remove any ambiguity.
For o-fields F, we will let E? denote conditional expectation given the o-field
F; E" will denote conditional expectation given the o-field generated by the
process {I';}.
Lemma 1 {(Tn(t)/sn,
Fi) : t € [0,1]} ts an L?-martingale.
Sketch of Proof. Use the method provided by Aalen in [Aal78], and the
monotone convergence theorem, and the Minkowski inequality for conditional
expectation to verify the conditions given in definition 5. 0
E(T,./82), B(BM(—3; =n
Tn -TA2)))
B(s;? > PET (ial — Te A ‘))).
k=n
But the processes {[;} and {T;} are independent and E(—In(1 —T, At)) =
Jo In(l —uAt) du =t. Hence
ery
g=1
apy Sw. (8)
Now we will work on finding an appropriate bound for the tail sums of the
processes W” described earlier. Define
n
(=e) a S Bn (s52,(us 4)
n - ux)’
n
[nt]-1
ay y- sx22 Baim (J Ypepltl,
(=,et
a y):
k=0 yo=iNi
[nt]-1 n k+1
= Do 8nn Dy YayP(LR =[Fen
k=0 j=N
[nt]-1
= me = pee Saget. eat eh eer
- k=0
Sn DyYj
j=N
n—k
n [nt]—1 1s By
= 85 in Daeg a
j=N k=0
Proof. Since for each j, Y,; > Tek a.s., we have for each N,
ya) Pa a.s
SH ye eek Ee
Hence we can choose no(N) > 2N large enough that for n > no(N),
Saja
es (ae YZ, ue, =a ie! = >) Ze
Fix a sequence (n'(N)) such that n’(N) > no(N) for each N. By a simple
application of a Borel-Cantelli lemma, as N — oo
2 aie
ae ee ore (11)
Dna, ni(N)j ONY iy
V |LF =| ar 0,
j=l
N
j =1
Peli 0,
T; # H(t) for alli >1 and teR,
|H"-H|| -— 0.
j=1
||H"—H|| <e/3.
Now let n > M. We show that |V,? o H"(t) — Sn o H(t)| < 6, for all real
t. Recall that Sy is constant on [T(, Ti41)) for 7 = 0,1,...,N (with To = 0,
and Ty41 = 1); and Vx is constant on [L{j), Liat): We proceed to show that
\Vn(H"(t)) — Sv(H(4))|
170 Kinateder
N-1
2 Wee eel
t=1
50:
With existence guaranteed by Propositions 1 and 6, choose ni(N) > no( NV)
(no is defined in Proposition 5) such that for n > ni(N),
P(|\Vi
0H" —Sy 0 H||>N7)<N™, (16)
and
P(d,(VEV SS) > Na cv (17)
Define
N(n) = max{N >1:n(N) <n}V1.
Lemma 3 VN (n) —, W in the Skorohod topology.
Proof. Let (n;,) be a subsequence. Since ni(N) > no(N) > 2N — ov, we can
choose a subsequence (n,,;) such that N(n;;) < N(nx,,,) for each j.
Let e > 0. Now ng, > mi(N(nx,)) by definition of N(-), and for 7 large
enough, N(n;,;) > 1/varepsilon, in which case
P(ds(Vyr,,y W) 2 28)
< P(ds(Vin(a,,)) Swim,)) > €) + P(ds(Swvim,)s W) > €)
WwW” 4 W,
|W oH"|ln
[Wr oH" —WoHllln < ||\WoH"—WoH" t
—Wola
< |W" - Who. + ||\WoH" —Wo Hllr
By construction, the first term on the right converges to 0 almost surely.
With probability one W is uniformly continuous on [0,1]. Since H" > H
uniformly almost surely, the last term converges to 0 almost surely.
With this representation, we have (W"o H”, W”) — (WoH, W) uniformly
almost surely. Therefore,
Theorem 5
Proof. To see this, look at the characteristic functions again. The stochastic
integral can be written as >°°2_, r[W(H(r)) — W(H(r—))]. By independence
of the increments of Brownian motion, its characteristic function is
anes. ee
nr — Co Sane 4
= exp(—5 |r dH(r))
= exp(—t?/2).
This is the characteristic function of the standard normal distribution, so the
assertion is proved. Oo
By applying Bickel and Freedman’s results [BF81], using the proper scal-
ing, a, = n'/?o, we see that in the symmetric case the result concerning a
distribution in the domain of attraction of an infinite variance stable random
variable can be viewed as an extension of what was already known in the finite
variance case.
5 Simulation Results
For random variables which have infinite variance, we found that in the sym-
metric case, the bootstrap of the sample mean does not perform so badly. In
fact, in some ways, the method gives better results than it does in the finite
variance case.
174 Kinateder
PIX! — Xa $T(Xn)|[Xal = 7-
A suprising observation was that the empirical coverage of this method was
consistently higher than 7 for a < 2. Figure 1 shows the observed coverage
on the bootstrap method applied 1000 times each for y = .95 confidence with
sample size n = 50, bootstrap resample size m = 50, and 500 bootstrap
observations for various values of a. In our simulations, X, ~ «U~1!/* where
P(e = 1) = 1/2 = P(e = —1), U uniform on (0,1). Notice that for a > 2,
F has finite variance, and hence it is expected that the coverage should be
approximately .95.
Consider the confidence radii obtained by the above method, scaled by
n-1/@ because
St = naz (X* SX),
and a, ~ cn'/* (see Feller [Fel71]).
In the finite variance case, the bootstrap distribution of the scaled and cen-
tered bootstrapped sample mean converges weakly to a fixed (normal) distri-
bution almost surely. Since the limit distribution is continuous, the confidence
radii given by the above method and then scaled as indicated converge almost
surely to a fixed number as the sample size n tends to infinity.
But by Athreya’s early results [Ath84] one might not expect this phe-
nomenon to occur in the infinite variance case. Our simulation results exem-
plify this. Figure 2 shows a frequency histogram of the observed confidence
radii (logarithmically scaled) with n = 50, a = 1.0. The vertical line rep-
resents the logarithm of the radius necessary for an unconditional confidence
interval with confidence level equal to the coverage observed by applying this
method.
Figure 3 shows more of the same phenomena for various values of a.
An Invariance Principle 175
(x 6.81)
ad
ee
|Toes
wv 2
1ee
|41
41
et
sae
® 4 8 12 i6 2e oe a 3 6 7
n260 elphe:.6e elphe?i.26
t)
“1.1 *@.1 6.9 1.9 2.8 [email protected]@.3 @ 6.3 8.6
alpneas2. 60 elpnass. 6e
; As expected, even as n gets large, the distribution of the scaled radii ob-
tained by this method is dispersed apparently continuously over the positive
real axis. Figure 4 shows what happens when n = 200.
> )% a 3 6 7 9 4a
less than the radius necessary to give confidence equivalent to the empirical
coverage obtained by the method.
Maybe more substantial is how much smaller the observed radii were. Half
of the time the bootstrap confidence interval radius was less than about a
tenth of the radius necessary for unconditional confidence.
More needs to be studied in this direction. The invariance principle proved
in this paper will help to explain the phenomenon.
6 Remarks
aly "(Mg - 1)
k=1
An Invariance Principle
A Appendix
Below are stated some of the technical results referenced during the proof of
the invariance principle. These are all proved in the author’s Ph.D. thesis.
and
E**el(T € B) =0.
Lemma 6 With probability one, '=*/~/s? = 0.
Lemma 7 For each t, o7(V,(t)) — 0.
Lemma 8 For0<s<t<l
t—s
E**\n(1 ~T At) = In(l—T, As) — 5 (Ti. = 3}.
— 8
Proposition 8
Co
—2/a
Naa ie py a ry Ea
g=1 k=1
References
[Aal78] Odd Aalen. Nonparametric inference for a family of counting pro-
cesses. Annals of Statistics, 6:701-726, 1978.
[AG89] Miquel A. Arcones and Evarist Giné. The bootstrap of the mean
with arbitrary sample size. Annals of Inst. of H. Poincare, 1989.
[Ath87] K.B. Athreya. Bootstrap of the mean in the infinite variance case.
Annals of Statistics, 15:724-731, 1987.
[BF81] Peter J. Bickel and David A. Freedman. Some asymptotic theory for
the bootstrap. Annals of Statistics, 9:1196-1217, 1981.
[Bil86] Patrick Billingsley. Probability and Measure. John Wiley, New York,
1986.
[GZ89] Evarist Giné and Joel Zinn. Necessary conditions for the bootstrap
of the mean. Annals of Statistics, 17:684-691, 1989.
[Kni89] Keith Knight. On the bootstrap of the sample mean in the infinite
variance case. Annals of Statistics, 17:1168-1175, 1989.
1 lies , 4 etalrage at
id $yaypT easey.
ne wT Mi WieeM ones
’ Ae beige ha‘ auts
tT nid "Wow a.
Lain ‘
ie
Seat ri : :
mee ieee ie
we a
ay me
isa ag
Mdiee Pie i.aetna) 73
ugar a
A Veh a Ted
D Pah: Gy)‘ee
ee
ra tay
oe
‘ Ph
Mies
Edgeworth Correction by ‘Moving Block’ Bootstrap
for Stationary and Nonstationary Data
S.N. Lahiri
Iowa State University
ABSTRACT
This paper considers second order properties of the ‘moving
stationary.
the situation may be totally different when the observations are not
case of Markov chains with finite state space, Basawa et al. (1990)
variables. Also, see Kulperger and Prakasa Rao (1989), Datta and
should be pointed out that the same modification is also put forward
Shi and Shao (1988) (for m-dependent data) and by Moore and Rais
Set k=[n/2]. Here, [x] denotes the integer part of a real number
size k from the observed blocks Ey kor-- 20, and defines the boot-
not very serious for the first order results but becomes predominant
to zero almost surely but not at the desired rate o(n71/2), Theorem
when the data are m-dependent. In this case as well, one has to use
sample mean is second order correct for observations which may not
2. Results on xX, As in the iid case, the key step in proving the
paper.
(Can) Ex, |" < © and M= lim M, is nonsingular.
Nw
188 Lahiri
||
Yam for which (i) EIX,-Y,ml $ d-lexp(-dm) and (ii) with m-a,,
EW ml L(Y, al < nl/4) <a}. Here oP is the o-field generated
by (D;, a<j<b}, {a,} is a sequence of real numbers satisfying logn
= o(a,) and a= 0((log n)1+1/d) as n+, and for any set A, I(A)
(C.5) There exists d>0 such that for all m,n,q=1,2,... and AD
moments of 1
requires Ex,8 < © for almost sure convergence of the second moment.
B14G45» J>1 for some sequences of iid random variables {¥,} and
constants {C,)}, one may define D;’s in terms of Yj'S rather than
otherwise specified, all the integrals will range over the entire
Euclidean space in question and the limits in the order symbols are
Our first result gives the exact rate of approximation for the
distribution on RP.
Under the assumed conditions, the bound (8,|| tends to zero
almost surely (See Lemma 5.2 (ii)-(iv)), but not even at the rate
Theorem 2.2. Suppose there exist constants C)>0, C,>0, a>0, a<B<1/4
(a) Then, for any Borel measurable f: RP + R with M,(f)<, and any
Remark 2.2 Theorem 2.2 shows that the stationary bootstrap procedure
is second order correct if the block size 2 is not too small or not
too big compared to the sample size n. The lower bound on the order
of the order o(e-k) for any positive number k. The upper bound on
Remark 2.3 In the iid set up, works of Singh (1981) and Babu and
Theorem 2.2.
192 Lahiri
a
sss.
3. s
of mean:
Smooth function In this section we consider the
Theorem 3.1. Suppose that conditions (C.1) - (C.5) hold with X,"s
statistics based on sample means. However, unlike the iid case, one
Here, the main problem arise from the form of the asymptotic
statistic, first note that Sy and hence T3y are not simple averages
194 Lahiri
-1 k, 2
sy = Dreg@plky) LyaqUng) - x,
(2me1)
and Ton = (ee) an)/s,
where a, = 1 or 2 according as r=0 or r > 1. Unfortunately, the
problem is not with the bias, rather with the normalizing variable
* A * *
and Lee ‘k,e Ty(Xn - by)/TySp- (3.2)
Moving Block Bootstrap 195
Note that r2/k,e is the actual variance of the bootstrap sample mean
of :
Xne On the other hand, writing fo(y) = Yr=08rY y42 - (2m+1)y?, y =
below, one can show that T. ate o(1) a.s.(P). However, the rate
order correct.
constants C,>0, C,>0, 0<a<B<1/4 such that Cn%e<C nF, then, for the
depending on the choice of Sy: See Géetz and Kiinsch (1990) for a
series data observed over a long period often tend to show lack of
bootstrap works reasonably well even when the data are independent
Lahiri (1989), Liu and Singh (1989)). In fact, the works of Liu
that one can obtain ‘good’ approximations even when the observations
there exist constants C) 520.5 7) >0, 0<a< B < 1/4 such that Cyn®
Section 3.
|W; |<B a.s. (Q) for some B>O. Then, for any positive integer r and
l<m<C(r)
en,
n k k j t
(Dyar44g)” = Ljuy LyC(ay---@5) Yo MEL] ae (5.1)
I<ij}<...<ijsn.
r Ly C(ay,---.05) Lp E Up}
[Zja1 siesta: Teyk
fe < C(r)n"EWS (5.2)
For proving Lemma 3.1, it is now enough to show that for each fixed
Moving Block Bootstrap 199
SSS
SSS
rice
JE >raoaa
De_, kak rokpk
#3 | Wsi, | < C(r)[n*B*a(m) + n'mSB*]
and litay - ix > m for some teA}. Next, rearrange the sum Yo as
A Qa
IZ3 E OY, we 1 < c(r)nkBKo(m).
t
Hence, it remains to find an upper bound on the sum Yq: Since Bp
To that end note that the indices t, tte,, tel’ are not necessar-
r if and only if stq< vy. If q<-s, then the bounds on vy imply that
stq < 2s < v. On the other hand, if q > s, then st+q < 2q <
200 Lahiri
function T(n;@,+) by T(n;@,x) = nox g(n-P |x|) / xi, for xe RP, 6>0,
Lemma 5.2. Suppose that 2+ and e=0(n(1-8)/4) for some 6 > 0. Then
a.s.(P)
yi,
ny = bo} SP
t=] (Um)
Up-m +E t=1
2) (M,-H#,)/d-
(m-,)/b 5
(5.3)
Using the condition sup(E|X,|}*:321) < » and the definition of ms»
one can show that there exists a constant C > 0 such that
a.s.(P)
273
I(oe -lo Dod; (X5+XKy_54g) =o(n F-2)/40)
and (5.4)
[Jb-]°E, af
&j-T0X,))] e=O(n")
-]
Next, note that for all x,ye RP, there exists a constant Cy
follows that
-dm/2)
Jo"= 15,(T0K)-TY;,I = (2-8/2 as. (P)
and (5.5)
oS= E(TOX:)-T0Y4,
qhIIL= (2).
-d
Finally use Lemma 3.1 and condition (C.2) to conclude that for any
ae[0,1/4)
(££, (U;-H,) (U}-#,)/-M,) goes to the zero matrix almost surely (P).
tE, (U}-Hi,)
(UpFig)’
= eb TSP sC(Ug—m,) (Uy-m,)/ + (mH) (Uy-my)’
+ (Ug-m,) (my)! + (mB) (mH) T-
Clearly, by assumption (C.6), the last term above goes to zero.
w.l.g. assume that H 5-0 forzall sla Rix I<ip; JysP- With m=a,, and
The term k=0 can be shown to converge to its expectation a.s. (P)
on
For j < 2, one uses the trivial bound n@e2 on the number of
estimates. First we consider the case when j=4. Note that in this
AESPSSESSEe
Write ie = Y3th, where 23 runs over all distinct pairs of
dence, each term under Y3 is at most a(m)n2 and the number of terms
differences to write
indices differ by at most 3m. Then, under Ye? there can be at most
lar arguments, one can show that R, < cn222m*. Hence, the bound on
Ag, is 0(n226m*).
The case j=3 also admits the same bound. Hence, it follows that
(P) for all 1<i»JgSP- The proof of Lemma 5.2(ii) can now be
taking m=0(f+a_), one can show that uniformly over all t in RP,
one given in the proof of Lemma 4.2 of Babu and Singh (1984) (here-
constant C > 0 such that EA, (U;-a,) 11"Oe <vort ass. (Ph).
Proof of Lemma 5.4: By Lemma 5.2(ii) and condition (C.1)’, @-1/2A,
converges almost surely (P) to M “1/2 and hence, is bounded in norm.
A
Therefore it is enough to estimate Eg /Z(U-1,)II*. By Lemma 5.2(i)
= T(G30,Y5 a)s for j>l, where m=a, 6=(1/12) + (1-4B)/48 and 0<B<1/4
0b FP (Vet
-ellv, 4) = 0(1) a.s. (P).
Let Z denote a p-dimensional standard normal random vector. Then,
using the arguments in the proof of Theorem 2.8 of [GH], one can
show that
For ve(Z*), |v] < 3, let Sins E,(A,(U}-it,))”. Let p, denote the
finite signed measure corresponding to the Fourier transform
: X
~
be a real number. Then, there exist a set rT; C2 with P(T,)=1 such
Moving Block Bootstrap 207
that for every sample point wel’, » there exists a constant C = C(t,w)
> 0 satisfying :
My(f) < @.
Proof of Lemma 5.5. The main steps in the proof of Lemma 5.5 are
JExZ>yh= o(k9/4),
=3/2 e,242,' ' - 1] = 0(k?)
-] and
(5.8)
of Lemma 5.2 and Lemma 5.3 (for a suitable depending only on @ and
for this sample point. Fix a function f with M4 (f)<o. Using the
arguments in the proof of Theorem 20.1 of [BR], one can show that
Fourier transform
-t’D t
A(t) = (Egexp(itZ,))* - pet KS/*p (its (x ae
2- ° A n
>
208 Lahiri
2D. (E424 24) and xp = vth cumulant of Z, under E,, ve(Z*)P and
Pales*) are polynomials (see page 51, [BR] for definitions of P).
modified for estimating the last term above. Instead of Lemma 11.2,
one has to use Lemma 24.1 of [BR]. Rest of the proof can be
[BR] and noting that, by condition (C.7), sup{|v,(t) |: & < |tll < 0)
< 1 for every 6 > 0.
Proof of Theorem 2.1: Note that P,.(T,€B) = Esf tty) where F(x) =
a constant C, > 0 such that Sup(wW(1(_,, xy (+-¥) 38) :xe RP) = Con for
all n>0, and ye RP.
As for part (c), note that by condition (C.1) (w.1.g., assuming
H5=H=0 for all j>l),
Proof of Theorem 2.2: Note that the signed measure Yn in Lemma 5.5
has density
equation (2.1).
one can show that the distribution of /n(H(Y) - H(a)) has the
following expansion:
sup |Px(/k 3/2 (H(¥h) - H(@,))€B) - sas n/2 (x5 {Xy, ))6(x)dx|
= o(n71/2) a.s. (P)
over BeB, 5
Then it is obvious that the conclusions of Lemma 5.2 hold with X's
replaced by Y,’s and U;’s replaced by Uscs. Next, define the func-
HA(y) = (¥y-q)/Lf(y)
11/2
for y=(Y)> roc »Ym42) ’E Rae Then, for every n> Mm, H, is infinitely
of all orders ata. Let D, denote the vector of first order par-
Note that Tyy=/N(H(¥y)-H(a)) and Tae 13,77? (H, (TN) Hy (Ex TN)
212 Lahiri
References
Basawa, I.V., Green, T.A., McCormick, W.P. and Taylor, R.L. (1990)
Malabar, Florida.
842.
G6tze, F. and Hipp, C. (1983). Asymptotic Expansions for Sums of
Kiinsch, H.R. (1989). The Jackknife and the Bootstrap for General
Liu, R.Y. and Singh, K. (1989). Using iid Bootstrap Inference for
Rutgers University.
Montreal.
Raoul LePage
Michigan State University
1. Introduction.
Bootstrap has trouble recovering the distribution of a sum of i.i.d. long
tailed errors €; based on observations p + €; (Athreya 1987, Gine’ and Zinn
1989). There is an idea which seems missing from previous attempts to
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
216 LePage
deal with this difficulty. I have not found it, for example, in Athreya (1987)
or Knight (1989). Nor does it appear in Wu (1986) among the many topics
relating to resampling methods in regression. This idea is that the
presumed goal of bootstrap, recovery of the unconditional limit law of a
statistic of interest, may be too modest. Perhaps we should seek to recover
the conditional limit law given some characteristic of the errors.
Bootstrap confidence intervals, in cases where bootstrap appears to fail,
are actually narrower than their unconditional counterparts for the same
unconditional confidence level. Kinateder (1990) established an invariance
principle linking bootstrap to the order statistics of the data, and presented
simulations which show that for long tailed errors bootstrap confidence
intervals based on the sample average tend to be smaller than those
obtained from the unconditional limit law. The following histogram, taken
from Figure 6.2 of Kinateder, is illustrative. It depicts the sampling
distribution of loge(A) where 2A is the bootstrap confidence interval width
(calibrated to have 0.95 unconditional confidence level) for samples of n =
50 from a symmetric distribution attracted to the Cauchy distribution. The
vertical line in the figure points to the log-width of a .95-confidence interval
based on the Cauchy limit law. The horizontal scale is about [-0.7, 7.0]
(logarithmic). It can be seen that the sampling distribution of A places
most of its mass far below the confidence interval width obtainable from the
unconditional limit law (in this case Cauchy).
2. Bootstrapping signs.
Our problem is to use i.i.d. observations Y;=pt+ oe i<n, to estimate
the sampling distribution of ju - 4 , where jj = Y. We consider the case of
errors whose signs are randomly assigned by coin flips.
Often (see Lemma 1 below) a scaling ath * with a <2 will be required
for distributional convergence of each of (né)? and ne?. In such a case é*/e?
is of order 1/n, which guarantees the relative smallness of the difference
between bootstrapping signs of residuals versus bootstrapping signs on the
actual errors present in the data.
Notice that one need not assume i.i.d. errors € in Proposition 1. With
the additional assumption of i.i.d. errors a far simpler comparison can be
obtained.
Proof. These relations are a simple consequence of the fact that, conditional
on F, the vectors 6 and e-are independent; é; are conditionally i.i.d. signs; é;
are conditionally uniform over all 2" xn! signed permutations of le;|. oO
Proof. Use the conditioning argument of Proposition l.a. Let {u,} be any
complete orthonormal basis for the column space of X. We make use of the
fact that for every a, 8,
Lemma 1. Suppose {U;} are i.i.d. uniform on [0, 1]; = 6 le;| with €; = UP
for a real number p, and 6: i.i.d as 6; . All sequences of r.v. are independent
of one another. Let sss time of j-th arrival in a homogeneous Poisson
process on [0,00] having a mean of one arrival per unit time. Then for every
n (but not for different n simultaneously) there is a standard construction of
the increasing order statistics Uy = 7; PIS ss i<n, in terms of which,
“P , 6! PP = vc weP , (4.1)
222 LePage
Only the case p < -1/2 corresponds to a symmetric stable law, yet the
other powers are of some interest for finite samples, before the central limit
theorem takes hold. They are best examined from the conditional viewpoint
of Proposition 1.a. since the limit law is uninteresting.
Proposition 3. The result (4.2) (with a scale constant on the right side)
holds for the corresponding Poisson construction of symmetric errors {e,
i<n} belonging to the domain of attraction of a stable law of index 0 < a <
2 (i.e. p = - 1/a in the above) . Basic bootstrap (Kni, 1989) on the other
hand gives,
i=
where, as is usual, Y* denotes the bootstrap average, and {7;} are iid.
Poisson r.v., having expectation one, and independent of 6, etc. .
In the case p >-1/2 the conditioning vanishes in the limit.
arguments along the lines of lemma i. The form of the limit law in (5.1) is
due to Knight (1929), who docs not address the problem of developing a
| comitistent bootstrap. Athreya (1987) obtained 2 characterization
of the
unconditional lirnit lew (5.1) 2s 2 random distribution. O
The following figure is a plot of confidence interval half-widths for
different samples d n = W from 2 standard normal distribution. The CLT
half-width is 0.258. The pairs being plotted are y = confidence interval half-
width obtained from bootstrapping signs; x = confidence interval half-width
obtained by the samme monte-carlo simulation applied instead to the signs of
the actual errors in the data. Sampling variation is biased upward due to
the use of only OO passes in cach bootstrap, but the point being made is
only that the approximation of lemma 1 is operating very well. That is,
bootstrapping signs is approximately recovering the conditional sampling
distribution given the absolute values of the residuals.
References
Abstract
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
226 Liu and Singh
aE
1. Introduction.
A sequence of time dependent observations tend to be
dependent, unless the time indices are sufficiently far apart. The notion of
m—dependence is probably the most basic model which takes into account
such dependence. Let {X,,X5,...} be a sequence of r.v.s (random variables).
Let A and B be two events such that A depends on {X,,---X} and B
on {Xt Xam gore t: The sequence {X;} is called m—dependent if
any such pair of events A and B are independent. The sequence {X;} is
called stationary if the distribution of any subcollection of X,’s does not
change if each of the indices are shifted by the same amount. For instance,
(X,,X9,X¢) has the same distribution as (Xj 9XyoX15) under the
stationarity. Consider the variance of the sum De 1X If {X;} is
stationary and m—dependent, m<n, then
1 n m :
= Var (2%) = Var (X,) + az “ — 2) cov (X1,X44 )-
ae
7 dg ee eeepe =(%,) 2
i=!
and using the law of large numbers for stationary m—dependent r.v.s. This
inconsistency of the classical bootstrap was also mentioned in Remark of
Singh (1981). Since Yyro¥, are i.i.d. observations from | F» it is obvious
that this bootstrap completely ignores the dependence structure of X,’s.
Consequently under m—dependent models, the achieved asymptotic variance
is incorrect. As for the classsical jackknife procedure, the estimate of the
: ; 1 2 )
variance of yn(X,—y) is 55; (5; -X,)° where Jj’s are
pseudo—values; ie., J; = nX — (n-1)X, = X,. Here Xi =
n
— x X, and J is the mean of Jpn, This estimate equals s.
j=1,ji
and hence also converges to Var,(X,)- Thus the classical delete—1 jackknife
is also inconsistent.
A 1 n-b+1 9
V3 p(T) = ble-peT oa (J; ale) IB (1.2)
At the first glance the MBJ procedure proposed here may seem
somewhat to resemble the so—called delete—d jackknife. (See Miller (1974)
for a general review on the jackknife.) However, they are intrinsically
different. In fact, a delete—-d version of MBJ would be to delete d of the
moving blocks at a time, for all combinations of d such blocks.
n—l X.+X.
.
ean re 1 3). (lf+1 8X).2
n—l X.4X.
2! ble > a ae a 2 X — ———_ (X, +X_).
AT ajaerty 2 ra1)
This shows that Vyo(X) is essentially the sample variance of
Uys suet Consequently, the jackknife variance converges to
15eMs
gp(X;) + o,(n/?),
where F is the c.d.f. of X, and E gp(X,) = 0. This comment applies
under i.i.d. models as well as weak dependence models. When X,’s are i.i.d.
the pseudo— values of the standard delete-1 jackknife are gp(X;), i =
1,2,....n, with a remainder of order ol) uniformly in i (see Parr eee) and
Singh and Liu (1990)). It turns out that the ith pseudo—value of the MBJ
procedure is
2 b
Vy p(X) 4 oFyg= 5| Var ey X) a.s. aS No.
2
n—1 (X. + X,,,)
erat te pres
a aT8 a
]}=
with EX}
<e. Let b=b,, b+@ and b/n+0 as n4o. Then
a m
Vy p(%q) > o”= Var (X,) +2 es cov (X,, X 1+) in prob. as no.
Note that o is the asymptotic variance of /n xX
pa Bs bael 9
hte eee (2.3)
where B. = pt/2 a X;, Since EB. = pl/ wi expression (2.3) can
be rewritten as
1 n-b +1 9
aba 2 (8-2
8,-w(K,-w?. (2.4)
Clearly, (B. —E B.), i = 1,2,....n-b+1 is a stationary (m+b)—dependent
sequence of r.v.s. We observe that the splitting technique mentioned in the
Moving Blocks Jackknife and Bootstrap Capture 233
Appendix for m—dependent r.v.s is not applicable here for getting the law of
large numbers when (m+b) +m as n+. Next we express (2.4) further as
Avi; D,—2C, with ~
1 n-b+1 9
peti eer ee eg)
D, n =b(X,-#)"
n :
bound X, -p= O(a 4); (In fact, this bound also comes into use in
showing that C, +0 in probability.)
Note that
b
Var (B, — EB,)”) < E((B, — EB,)*) =? BIE (%— wl
which is of order O(1) asn-o, provided EX! <. (See the Appendix.)
Applying this particular observation and the condition b/n +0 in (2.5), we
obtain that Var (A,) +0 as n-+o. Using Chebyshev inequality it is easy to
Liu and Singh
show that ||A, — Var (B,)|| + 0 in probability. The result now clearly
follows since Var (B,) + o in probability.
se
igi 8p(X))|
— ee +»
max
oe
=
eee
tends to zero a.s. as n+. Here b is assumed to be a fixed positive integer.
: —1
Va (Ty) +b © Var (gp(X,) +... + 8p(Xp)) a.S. aS No.
j We rewrite bQ, 5 2
4. 4 1 =
, fa ~—(n—b)
“] 28a, +(n-b) [2Lu,- LE u
4 jx * jk © 5x4, 540-4 ix
al
r
| where ay = bp(XX,) — bp(X,*) —hy(*X,) + hyl*,*) with hp(x,*) =
Eh xX) and bf*,7)=E XX). Next, we express
.
pomn aan:
nf{n-5) D« edPig
and
| 2 1 ts s sina
a where
= £ Ua, and = Zz ua -
: eae Siig ia *
_ ~‘Wkis clear that A_ +B, cannot exceed
b
m4 |max
Cm) Ig|jq..|+2 1max jq.j],
jén la]
where (*) stands for i < j< i+b—1 and 1¢i¢ n-b+1.
Thus, it only remains to show that
max |q,|
= o(n//b),
1¢ jin
and
Recall that Cyd are i.i.d. sampled blocks from the moving
blocks By, By pay for each i, € = (Eipres ib) Yyr¥y = (E5906)
stands for the bootstrap sample of size 4 and FY is the empirical d.f. based
on Yj,...,¥, Throughout this section we assume ¢ and n are of the same
order, namely ¢/n is bounded below and above.
Case (i) T, = xX
For the bootstrap approximation of the sampling distribution
of xX) based on the MBB procedure, with b fixed, we obtain Theorem 5
below.
Proof: (with b= 2) The proof hinges on the observation that under the
MBB procedure the effective sample is sen where g = (& + &o)/v2.
Let 2 = Sade: then ¥2Y,= yk re Note that Comey are iid. r.v.s
under the MBB scheme. Hence,
n—l1 X.+X
Washoe oe pee ttle iJ
prt ys
n—l X.+X,
wreey aug ate a.s.
j=
40% as.
Now we shall use the Berry—Esseen theorem to bound the normed difference
—1
ait a 3
1 n-b+1
= RebET 2 sel Ai eae bd
iii)
Pe
E*Y,=X, + of(n—1/2[2
Here o* stands for o_ under the bootstrap probability. We proceed to
prove statements i), ii) and iii) in that order to complete the proof of the
theorem.
: i:
Proof of i) Var* (|) = abelze= it1 (B, — Et) , where B, =
(X; + Xj, 4+--+%)
14, 1)/vb. Let us rewrite Var* (2,) as
12 wb § : 9
nobel cae [((B; — you)
—(E*E, — yby)]°.
Moving Blocks Jackknife and Bootstrap Capture 241
To prove this we need the result that Ree (x pate = o(b2t 4/a
This is a well—known fact for iid. r.v.s and it can be extended easily to
m—dependent r.v.s by the splitting technique mentioned in the Appendix.
= [b(n —b + 1)]2
- {b z X; — [(b-1)X, + (b-2)X, +... + 2X9 +Xp_4]
S3[(b-1)X, + (D-2)Xp y+ 2X, 9) + Xap yl}
=X, + [n/(n-b+1) -1]X, + 0,(b/n)
=X, +0 p(b/n) in probability.
Lemma 1. For any 6> 0, there exists a universal constant c such that
F%(x) =b
=I [FZ 1) +... +F7 pl
and
b
IPA-Fall
ee
$51 2 FY;x —Foll:
Consequently, by Bonferroni inequality we have
PAUFZ-Fyl> 9
* *
2 PUY; —Fall > 9.
ak
(3.3)
Next we shall show that, for any i, 1<i<b,
o*(n2/2) if b is fixed,
IFE-F,l=4? n
o8(b!/ nog pa Rr eete
Next, we extend the Dvoretzky—Kiefer—Wolfowitz inequality
to the m—dependence case.
T(F+)
(FD) -—T(Fet -15
a eeeee (vy)
Pent -2 5 g(x,iad + 0#(a
iota tip 2/2);
244 Liu and Singh —
aE
DN
AS
ails g(v)-2 5 e(X)] +0(NF}—Fi)
j) |Te y +o(IF,—FI).
n (3.5)
Now the two exponential bounds of Lemmas 1 and 2 and the arguments of
Theorem 1 of Liu, Singh and Lo (1989) can be used to show that the
remainder term of (3.5) is of the order ot(n ay,
Theorem 9. Assume T(-) admits the following expansion at F for some Sp:
Proof: To establish the claim, we essentially need to prove that for any ¢ >
0, PA(IF¥— FI? + IF, = FI? > 1%) 40s. Clearly |IF¥— FI? +
{k= FI? ¢ 2||F5- Fl? so) | P|”. It follows from Lemma 2 that
Theorem 10. Under the condition of Theorem 9 and the condition b log
n//n-+0 as no, we have
4. Concluding Remarks.
min (b,m) F
Var(X,)+2 YU
aa
(1—f) cov (X,,X 1+i)
decreases to o and the above comments on MBB and MBB3J still hold.
E(X — Mp)" in the univariate case, the correlation and regression coefficients
in the bivariate case. In the case of variance: p = 1, q = 2; h, (X) = x?
ho(X) = X. In the case of the correlation coefficient: p = 2, q = 5; h, (X)
I(x — pg)? dG — I(x — wp)” AF = f(x — pp)” a(G -F) — (ug — bp)”
The term (uq — Le) = O(\|G - F||?) if the distributions considered have
compact support.
then n
D,_,X; = (KX, + Xp t+. + Xy) + (Xq + Xy te + X,)-
Applying the triangle inequality and this splitting technique, many
asymptotic bounds for independent r.v.s can be extended immediately.
Several such extensions are used repeatedly in the text. Some of them are
listed here for references: If {X;} is stationary and m—dependent, then
matt| ; .
1) nD) _ 1X, 7 EX, (if EIX,| <®);
2) BP_,X, = O(a 1/2)2) (it4: EX?2 < @);
3) B[D"_,X,|P = o(nP/?) (it BIX,|? < o);
ay ok 0(n!/2(10g log n)/?) as. (if EX? <a).
References
Kunsch, H. (1989). "The jackknife and the bootstrap for general stationary
observations." Ann. Stat. 17, 1217-1241.
J. S. Marron
University of North Carolina
1. Introduction
This is a review of results concerning application of bootstrap ideas to
bandwidth or smoothing parameter selection. The main ideas are useful in
all types of nonparametric curve estimation settings, including regression,
density and hazard estimation, and also apply to a wide variety of
estimators, including those based on kernels, splines, orthogonal series, etc.
However as much of the work so far has focused on perhaps the simplest of
these, kernel density estimation, the discussion here will be given in this
context.
The density estimation problem is often mathematically formulated
by assuming that observations Ky oX n are a random sample from a
probability density f(x), and it is desired to estimate f(x). The kernel
estimator of f(x) is defined by
A es! n
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
250 Marron
error criteria. Here the focus will be on the expected Peenorm, or Mean
Integrated Squared Error,
the effective local average when the window width is too small), h_ big
introduces too much bias (again intuitively clear since a large window width
introduces information from too far away).
Section 2 discusses bandwidth selection by minimization of bootstrap
estimates of MISE(h). In particular it is seen why the smoothed bootstrap
is very important here. An interesting and unusual feature of this case is
that the bootstrap expected value can be directly and simply calculated, so
the usual simulation step is unnecessary in this case.
Asymptotic analysis and comparison of these methods is described in
Section 3. Connection to other methods, including Least Squares Cross
Validation, is made in Section 4. Simulation experience and a real data
example are given in Section 5.
studied, denoted in this case by L{f, (x) — f{(x)}. The usual simple means of
doing this involves thinking about "resampling with replacement". One way
of thinking about this probability structure is through random variables
Tel, which are independent of each other, and of Xj X and are
uniformly distributed on the integers {1,....n}. These new random variables
contain the probability structure (conditioned on X,,...,X,) of the
"resample" ny i ae defined for i= 1,...,.n by
*
xX, = oe
As a first attempt at using this new conditional probability structure
to model the bandwidth trade—off in the density estimation problem, one
might define the "bootstrap density estimator"
wit 12
fh (x) =n sa neat
is useful. Faraway and Jhun (1987) have pointed out that this
252 Marron
a aa COO ee Oe roe DE eet ee Se
where E is expected value "in the bootstrap world", in other words with
ig(*)
(x)=n=ot}oP Eee
Hence it seems natural to study when the approximation
Bootstrap Bandwidth Selection 253
3. Asymptotics
In this section, choice of g and L is considered. A sensible first
attempt, see Taylor (1989), would be to try g = h and K=L. This can be
easily analyzed using the assumptions and asymptotics at the end of section
1, with the important part being
range for h see section 3.3.2 of Silverman (1986) for example, fg (x) does
not even converge to f’’(x) (because the variance does not tend to 0). For
this reason, Faraway and Jhun (1987) propose using g > h. However
observe that f’’(x) is not what is needed here, instead we need the
functional {(f’ as which is a different problem. Indeed for g ~ ntl a Hall
and Marron (1987) show that
[liga sa f(e2y
although this choice of the bandwidth g is quite inefficient in the sense that
it gives a slower than necessary rate of convergence.
A means of quantifying this inefficiency, which is relevant to
bandwidth selection, is to study its effect on the relative rate of convergence.
In remark 3.6 of Hall, Marron and Park (1990), it is shown that
(ufleler
z
yp - ma_—1/10 ay,
when g=h, where hp denotes the minimizer of MISE(h). This very slow
rate of convergence is the same as that well known to obtain for least squares
cross—validation, and for the biased cross—validation method of Scott and
Terrell (1987) (which uses g =h in a slightly different way). For this
reason, as well as the fact that the appropriate bandwidth for estimating
f(f aye is different from that for estimation of f(x), the choice g=h does
not seem appropriate.
Bootstrap Bandwidth Selection
SSS
255
Good insight for the problem of how to choose g_has been provided
by the main results of. Hall, Marron and Park (1990). The minor
modification of these results presented here is explicitly given in Jones,
Marron and Park (1990), where it is seen that if f has slightly more than
four derivatives, and L is a probability density, for C,, C, and Cy
2
constants depending on f, K and L,
n 2/5 gives the slow n~ 1/10 rate in the above paragraph. This
expansion is important because it quantifies the trade-off involved in the
choice of g. In particular there is too much "variance" present if g +0 too
rapidly, and a "bias" term that penalizes g 40 too slowly. This variance
and bias can be combined them into an "asymptotic mean squared error"
which can then be optimized over g to see that the best g has the form
P —1/7
& C,(£,K,L) n / ?
which gives
ayia vsat
Data based methods for estimating C 4 are given in Jones, Marron and Park
(1990). Note that this rate of convergence is much faster than n 2/10)
A natural question at this point is: can the rate nol fe be
improved? As noted in remark 3.3 of Hall, Marron and Jones (1990), by
taking L to be a higher order kernel, this rate can be improved all the way
up to the parametric mon! (L needs to be of “order 6" for this fast rate).
This rate has been shown to be best possible by Hall and Marron (1990).
However there is a distinct intuitive drawback to this in that when L is a
higher order kernel, it can no longer be thought of as a probability density,
of h, of the form
g=C a
the pairwise differences of the data. Note that, using fh to denote the
kernel estimator based on the sample with X. excluded, the least squares
cross—validation criterion can be written in the form
Note that the first term provides the same sensible estimate of V(h)
discussed in the paragraph above, while the second term has features
make this connection more precise, note that when there are no replications
among Xp Xs the second term is the limit as g 70 of
over B has been demonstrated by Jones, Marron and Park (1990) in terms
of the relative rate of convergence. Simulation work has also indicated
usually small superiority of the diagonals in approach, although the
improvement is sometimes much larger because the diagonals out version is
*
less stable. One possible explanation as to why this happens is that B 2 is
the smoothed bootstrap estimate of B2, while B 2 does not seem to have
any such representation.
Faraway and Jhun (1987) have pointed out that the bootstrap
approximation can be used to understand the bandwidth trade-off entailed
by other means of assessing the error in f,. For example one could replace
258 Marron
TT
calculation of E can be done, however this has not been explored carefully
yet, mostly because it seems sensible to postpone investigation of this more
challenging pointwise case until more is understood about
the global MISE
problem.
3
2
/MISE(huise)
MISE(h)
1
more rapid improvements in BSF than the others (as expected from its
faster rate of convergence). For the N(0,1) BSF gave really superlative
performance, in fact even beating out the Normal reference distribution
bandwidth given at (3.28) of Silverman (1986). For densities which are still
unimodal, but depart strongly from the normal in directions of strong
skewness or kurtosis, the performance was not so good (in fact CV is
typically the best in terms of MISE), but can be improved a lot by using a
scale estimate which is more reasonable than the sample standard deviation
in such situations, such as the interquartile range. On the other hand when
f is far from normal in the direction of heavy multimodality, again most of
these newer bandwidth selectors were inferior to CV in the MISE sense,
but the sample standard deviation was a more reasonable scale estimate than
the IQR. A way to view both of the above situations, is that they are cases
Bootstrap Bandwidth Selection
a
261
where it takes very large sample sizes before the effects described by the
asymptotics take over. There is still work to be done in finding a bandwidth
selector which works acceptably well in all situations.
To see how well these methods work on a real data set, they were
tried on the income data shown in Figure 2 of Park and Marron (1990). The
data and importance of that type of display are discussed there. Several of
the bootstrap bandwidth selectors considered in this paper were tried on this
data set. The best result was for SBF with the N(0,07) reference
distribution used immediately. Figure 2 here, which compares nicely to
Figure 2 in Park and Marron shows the result. The other variants, involving
estimation steps in the choice of g, tended to give smaller bandwidths,
which are probably closer to the MISE value, but gave estimates that are too
rough for effective presentation of this type.
REFERENCES
Efron, B. (1982) The jackknife, the bootstrap and other resampling plans,
CBMS Regional Conference series in Applied Mathematics, Society
for Industrial and Applied Mathematics, Philadelphia.
Hall, P. (1990) Using the bootstrap to estimate mean squared error and _
select smoothing parameter in nonparametric problems, to appear in
Journal of Multivariate Analysis.
ABSTRACT
A block-resampling bootstrap for the sample mean of weakly dependent stationary
observations has been recently introduced by Kitinsch (1989) and independently by Liu and
Singh (1988). In Lahiri (1990) it was shown that the bootstrap estimate of sampling dis-
tribution is more accurate than the normal approximation, provided it is centered around
the bootstrap mean, and not around the sample mean as customary. In this report, we
introduce a variant of this block-resampling bootstrap that amounts to ‘wrapping’ the data
around in a circle before blocking them. This ‘circular’ block-resampling bootstrap, has
the additional advantage to be automatically centered around the sample mean. The con-
sistency and asymptotic accuracy of the proposed method are demonstrated in comparison
with the corresponding results for the block-resampling bootstrap.
AMS 1980 subject classifications: Primary 62G05; secondary 62M10.
Key words and phrases: Resampling schemes, bootstrap, time series, mixing sequences,
weak dependence, nonparametric estimation.
EES —_——_—=u=
Haaeee essen eee eee
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
264 Politis and Romano
ae
a ____._..___ |
1. Introduction
Suppose X;,...,Xwy are observations from the (strictly) stationary multivariate sequence
{X,,n € Z}, and the statistic of interest is the sample mean Xy = N-* vu, Xi. The sequence
{X,,n € Z} is assumed to have a weak dependence structure. Specifically, the a-mixing (also
called strong mixing) condition will be assumed, i.e. that ax(k) — 0, as k — oo, where
P*. If k is an integer such that kb ~ N , then letting €,,...,& be drawn i.i.d. from
P", it is seen that each €; is a block of b observations (€;1,...,&,5). If all | = kb of
the &;;’s are concatenated in one long vector denoted by ¥;,...,Y;, then the block-
resampling bootstrap estimate of the variance of /NXw is the variance of VIY; under
P*, and the bootstrap estimate of P{V/N(Xw — #) < 2} is P*{Vi(¥; — Xn) < x}, where
y= +e Y;:
It is obvious that taking 6 = 1 makes the block-resampling bootstrap coincide with the
classical (i.i.d.) bootstrap of Efron(1979) which has well-known optimality properties (cf.
Singh(1981)). It can be shown (cf. Lahiri(1990)) that a slightly modified bootstrap estimate
of sampling distribution turns out to be more accurate than the normal approximation, under
some regularity conditions, resulting to more accurate confidence intervals for ». The modifi-
cation amounts to approximating the quantiles of P{/N: (Xn — pu) < 2} by the corresponding
quantiles of P*{Vi(Y; — E*Y;) < z}, where E*Y; denotes the expected value of Y; under the
P* probability (conditional on the original data). The need for re-centering the bootstrap dis-
Circular Block-Resampling for Stationary Data 265
tribution so as to have mean zero can also be traced back to Kiinsch’s (1989) short calculation
of the skewness of his block-resampling bootstrap.
The reason that the re-centered bootstrap provides a more accurate approximation is that
E*Y; = q7* Di, 6 Litt") X; = Xn + O,(b/N), where, for consistency of the bootstrap in
the dependent data setting, b —- oo as N — oo. In other words, the distribution of Y; under P*
possesses a random bias of significant order. This bias is associated with the block-resampling
bootstrap that assigns reduced weight to X;’s with i < bor i > N—6+1. In other words, if we
let P* be the limit (almost sure in P*) of the proportion /~!(number of the Y;’s that equal X;)
as | — oo, (and assuming no ties among the X;’s), although P* = 6/R, with R = b(N —b+1),
for any i such that b < i < N — 6+11, this proportion drops to P* = i/R, for any i < b, and
Pr =(N -i+1)/R,
for anyi>N-6b+1.
266 Politis and Romano
ee eee
A simple and ‘automatic’ way to have an unbiased bootstrap distribution is to ‘wrap’ the
X;’s around in a ‘circle’, that is, to define (for i > N) X; = Xj,, where iy = i(modJN), and
Xo = Xn. This idea is closely associated with the definition of the circular autocovariance se-
quence of time series models. The ‘circular’ block-resampling bootstrap amounts to resampling
whole ‘arcs’ of the circularly defined observations, and goes as follows.
e Define the blocks B; as previously, that is, B; = (X;,..., Xji4p-1), but note that now for
any integer 6, there are N such B;, j = 1,...,N. Sampling with replacement from the set
{B,,...,Bn}, defines a (conditional on the original data) probability measure P*. If k is
an integer such that kb ~ N, then letting €,,...,€, be drawni.i.d. from P%, it is seen that
each €; is a block of b observations (€;1,...,&,b). If all / = kb of the €;;’s are concatenated
in one long vector denoted by ¥;,...,¥;, then the ‘circular’ block-resampling bootstrap
estimate of the variance of VN Xn is the variance of VIY; under P*, and the ‘circular’
block-resampling bootstrap estimate of P{/N(Xw — ») < 2} is P*{Vi(¥%; — Xn) < 2},
where Y; = } y-/_, Yj.
The ‘circular’ construction is an integral part of a related resampling method in which blocks
of random size are resampled (cf. Politis and Romano (1991)). It can also be immediately
applied to bootstrapping general linear and nonlinear statistics, as in Kiinsch (1989), Liu and
Singh (1988), and Politis and Romano (1989). Kiinsch’s proposal of ‘tapering’ the observations
in the B; blocks can also be incorporated in the ‘circular’ construction without changing the
first-order asymptotic results.
The following two theorems concern the consistency and asymptotic accuracy of the ‘circu-
lar’ block-resampling bootstrap. The theorems are stated for univariate sequences {X,,n € Z},
Theorem 1 Assume that E|X;|°t* < 00, for some 6 > 0, and 52, k?(ax(k))®? < oo. As
N = 00, let 1/N — 1, and let b + 00, but with bN-! + 0. Then 02, = Var(VNXw) has a
finite limit 02,, and Var*(V1¥;) © 03, where Var*(VIY;) is the variance of VIY; under P*
Circular Block-Resampling for Stationary Data
nn eters se _ rin theres, Jeet pheeee er267
Proof. The proof of Theorem 1 is immediate in view of the proof of the consistency
of the block-resampling bootstrap of Kiinsch(1989). If we let E*, E*,Var*,Var*, represent
expectation and variance under the P* and P™ probabilities, then it is easily calculated that
E*Y, = Xn, and that
i+b—1
Var*(Vi¥;)= Ne »» X;- Xn)
ct
——
FN bat = (ot = oe— Xn)? + P 0,(b/N)
Since Var*(V1¥;) o2,, it is seen that the first two moments of V/Y; under the P* probability
are asymptotically correct.
Finally, the moment and mixing conditions assumed are sufficient (cf. Hall and Heyde(1980))
to imply that /N(Xw — ) is asymptotically normal N(0,02,). Noting that V/I(¥; — Xw) is
also asymptotically normal (conditionally on the data X),...,Xyn), the proof is concluded. O
It is noteworthy that to make the bias of the block-resampling bootstrap distribution to be
of smaller order than its standard deviation, Kiinsch (1989) imposed the additional condition
bN~-1/2 —, 0 which is unnecessarily strong in our setting.
Theorem 2 Assuming that the sequence {X,,n € Z} is defined on the probability space
(Q,A,P), denote D,,n € Z, a sequence of sub o-fields of A, and Dy? the o-field generated
by Dn,,---;Dn,- Also assume that E|X,|®+® < 00, for some 6 > 0, and, as N — o, let
268 Politis and Romano
/N — 1, and b — co, but with bN-1/8 — 0. Under the following regularity conditions:
ao) ax(k) decreases geometrically fast;
a) 3d > 0 such that for all k,n € N, with n > 1/d, there exists a Der measurable random
ariable Zin, for which E|Xk—Zkn| < d-te78", and E|Zrnyl?2(|Zisny| < &/4) < d=}, where
& i8 a sequence of real numbers satisfying logk= o(n,) and nz = O(log k)i+8 |as k + co;
a) 3d > 0 such that for allk,n€ N, with k >n>1/d, and allt > d,
1 Neb tbo Ae
Spare) » (o-* So Xi
- Xn)? + 0,(/N)
i= j=i
Hence, E*(Ux — Xy)* — E*(Uy — E*UZ)> = O,(b/N) = 0,(b-?), and the proof is concluded.O
270 Politis and Romano
References
[1] Efron, B. (1979), Bootstrap Methods: Another Look at the Jackknife, Ann. Statist., 7,
1-26.
[2] Hall, P. and Heyde, C. (1980), Martingale Limit Theory and its Applications, Academic
Press, New York.
[3] Kiinsch H.R. (1989), The jackknife and the bootstrap for general stationary observations,
Ann. Statist., 17, 1217-1241.
[4] Lahiri, S.N.(1990), Second order optimality of stationary bootstrap, Technical Report
No.90-1, Department of Statistics, Iowa State University, (to appear in Statist. Prob.
Letters).
[5] Liu, R.Y. and Singh, K. (1988), Moving Blocks Bootstrap and Jackknife Capture Weak
Dependence, unpublished manuscript, Department of Statistics, Rutgers University.
[6] Politis D.N. and Romano, J.P. (1989), A General Resampling Scheme for Triangular
Arrays of a-mixing Random Variables with application to the problem of Spectral Density
Estimation, Technical Report No.338, Department of Statistics, Stanford University.
[7] Politis D.N. and Romano, J.P. (1991), The Stationary Bootstrap, Technical Report No.
365, Department of Statistics, Stanford University.
[8] Singh, K.(1981), On the asymptotic accuracy of Efron’s bootstrap, Ann.Statist., 9, 1187-
1195.
Some applications of the bootstrap
in complex problems
Robert Tibshirani
Department of Preventive Medicine and Biostatistics
and
Department of Statistics
University of Toronto
1 Introduction
In this paper I give two examples of the application of the bootstrap in some
complex problems. These problems arose as part of statistical consultations.
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
212 Tibshirani
ae
heart rates. The smoothing parameter was chosen so that the degrees of
freedom of the fitted curve was about 4 (see Hastie and Tibshirani 1990
chapter 3). This fixed smoothing parameter was used throughout. Despite
the correlation across workloads for a given individual, the mean heart rate
value is the generalized least squares estimate of the within group mean; see
for example Rice and Silverman, 1988. The results are the solid curves in
Figure 1.
The construction of prediction bands for these curves seems very difficult
analytically. Denote the estimated curve at workload w by f(w). Since the
smoothing operation is linear, an estimate of the pointwise standard error of
f(w) is easily obtained, say s(w). Then if our nominal level is 80%,
250
200
150
100 200
150
100
250
(watts)
workload (watts)
workload
50 50
172-186
height
male, 0 0
height
female,
172-186
eye) yeeY
150
100
250
200 150
100
250
200
(watts)
workload (watts)
workload
50 50
height
male,
166-171 0 0
female,
166-171
height
e181 We8Y
250
200
150
100
(watts)
workload (watts)
workload
50 250
200
150
100
50
height
male,
162-165 0 0
height
female,
162-165
00¢ Ost 001 os
250
200
150
100 250
200
150
100
(watts)
workload workload
(watts)
50 50
height
male,
152-161 0
height
female,
152-161 0
eyes yeey
150
100
250
200 150
100
250
200
workload
(watts) (watts)
workload
50 50
height
male,
137-151 0
height
female,
137-151 0
Fig. 1. Solid curves are estimate of heart rates at each workload, for children in
indicated sezx-height group. Broken lines are 80% global prediction bands. Tick
marks on the horizontal axis indicate workloads where measurements were taken.
274 Tibshirani
eer
Oo cmt mw SO
—
NN
animal
(xi2 ers Lis) ~ (1 = 0)M (i+, {p1, Pi, P1, 91, q}) =F OM (i+, {p2, P2, P2, q2, q2}) (3)
clusterings (2,3,4 vs 5,6) or (2,3 vs. 4,5,6) was better. The two models are not
nested, so this is a difficult question to answer. To address it, I bootstrapped
by sampling with replacement from the rows of the data matrix, and for each
bootstrap sample, I computed the difference in maximized log-likelihood val-
ues for the two models. The standard deviation of this difference was 10.4,
and so the observed difference of 24.5 is “significantly” large.
Acknowlegment:
I would like to thank William DuMouchel for helpful suggestions. This
work was supported by the Natural Sciences and Engineering Research Coun-
cil of Canada.
References:
Efron, B. (1983). Estimating the error-rate of a prediction rule: some
improvements on cross-validation. J. Amer. Statist. Assoc., vol 78, pages
316-331.
Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. Chap-
man and Hall. London.
Rice, J.A and Silverman, B.W. (1988) Estimating the mean and covari-
ance structure nonparametrically when the data are curves. Unpublished
report.
aaa
ata
as
Fie rs Pesala Hy
.
MU > pas ke nis site rn i we Re eins as
ixRY "te, e ’
‘ ar
ni« wiv
ages
ys
} :
~ theese me ‘aaah
20% ke ee>| amt (38
by
D. S. Tu
ABSTRACT
1. INTRODUCTION
———
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
280 Tu
times.
differentiability of T at F.
only if there exist ¥: R-R and oy: R ate: depending only on T and F
Let now Xi, Xo, eee xX, be n i.i.d. random variables with
ie Ei)
F(x) = nd C253)
: n
BOn-1l(xjotnetien 4.y WX,ex)
j
“ ist, 2, 26... 0, (2.5)
jl
j=1
eel)
iJon (Fa-1 - ee) Pt
Th-12t and Ty gonty (7 Ri eee elas. boo. eta) (2.6)
and
284 Tu
T 1 n
ee _ and
* Sd
U =(n-1)
n
See *
2th, 5 arse
T,) (2.7)
2. ee )ees
sy=var*(y¥nW_ =< (ap:
2 nit*.9 j (2.10)
*
where var stands for the conditional variance given X,, Xo, seeds
quantities given X)> redouaity x)» can be used as an estimate for the
respectively.°
In comparing
.
asymptotic
.
expansions
°
of Ha and
*
Ha we
Approximating the Distribution with Jackknife 285
find that, however, this approximation can not possess the desired
second order accuracy(see the discussion in next section).
Following the idea of Abramovitch and Singh(1985), Beran(1984) and
B p B pe B
h(y=(y - OB) - —2my - Om)? , Amy _ —On)3 (2.11)
n yn yn 3n yn
where
A nea hoes
Pon="on- 6 2 Taj’ Bon=on* ~§ 2 Taj?
oe J= (2.12)
A ft A
n °
2 ‘i ;
+ SED
3(n-1) pera
(j) UT(k) 8 int, -(n-1)
« (T+)
CU ay)
jzk
(GN del
Diy aeie Cp
lO oe eh (2.43 )
*
Then we can use the conditional distribution of h,(L,) given X>
Lj=va(T-9)/s_,
G(x)=8(x)-n
1/?[ks 1+(1/6)kg 9(x7-1) 1¢(x) +0(1/V¥n) (3.2)
where (x) and $(x) are the distribution function and density
function of standard normal, respectively,
THEOREM 3.2 Suppose that $y» Sqr see > Sn are iid with
arrays satisfying
K(y)=0(y)-(1/6)
0(y) Vee(y"-1)eee¥ 170
ia =
(1 /vn) (3.6)
j=l
where
EY re
;=0, ry 2
EY;>0 and BlY; |3 ’<eo.
i=
Let
then we have
[G,,(y)-O(y)+(1/6)0(y) (y?-1)
tg/(Hy)9? |
(| + gb}
1)n
(3.8)
where C is a absolute constant,
y;(t)=Bexp(itY,)
;
and 6, = 26 DENY; 19 -1
Pnj’ Wj =* “nice
var
Fe a i d
|“B($;
i (¢; -9)1[ 2yt[ |=
ice2 ) |>CHy 1/2 Ih
a.
s
vi» 3 3 25 PAR
IM,(| 1
Yas. 3 E|¢,-2| a Hla
ml yylS [ag 3/2 2)|
n n
-2|31(+}a_.||¢.-2
=
v¥n(p,)
1/2
ee 3 2
= ae la; | E|¢,-2| [Is 2|: max |a +] a
= 1<jsn ™
Ho =
rig
=p: |
ei) an
240,
jJ=1
we have
et 3
yn y E|W at —0,
ic ieee
that is J,=0(1/yn). Consider Jo. Since
——0
we have Jo=0(1/yn). Finally we treat Jz. Observe that
n n Ome
sup ty |v; (t) |= sup i a |(1-(1/2) it-BL) 4 |
|t|26 -j=1 |t|26° j=1 vn
1 Oi (l+on
< n.d ecle Ht 6°;
e gee
Approximating the Distribution with Jackknife 291
Se RL SL ee er a a rr
S 1%
= (1+n--3/2,2.2
6a nj \-2
j=l -
2 -3.4 4
-2n--3/2,2 6a ajo”
< inj e ote 61a ih
=1
=1 on™ja)2(2
- -2(6 (2 8nj) ny4/1(S2a
a2 * -sca
<1 - 2,°n,2f1
2-)?[2 y8 a2,2
n yn Mj=1 nj
6 n
= €2)7(= Tete
max |a eem2 ele
: | ai |
vn
ee
vn
so that
i
VnJg=vnn ( ue Le [Yj (+) | he1,\n 0;
nb,
2 j=
3.2, the weights Cie Sarl el, need only to statisfy E(¢,-E¢, )?=1
0<07=
{7 (x)dF(x)<o and S|¥Cx) [RdF(x) <0.
Then
n
P*{wr /LVAR’ (W,)17/?<y = &(y)-(1/6)0(y)(y2-1) ¥ 13 -+0(1/Vin) (3.9)
j=l
holds uniformly in y for almost all sample sequence X,; Xo, a
X_, where
n
*
PROOF. Take a_ .=T .-T. in Theorem 3.2. Then given X,,
n,j on J n 1
in Theorem 3.2 are true for almost all sample sequence X,, -
j=l
y a_ n,J
.=0
Note that sup IF, 4-Fyll < 1/n and by the definition of
<1sn :
differentiability we have
TG )=T(F,)-S¥(x) dF iF) (x) +0(([F, oral
n
=1(F)-¥(x)/[n-1]+{n(n-1)} > J; W(X, )+0(1/n)
j=l
and hence -
* -(X.) n
eg oe
teeing ayeni,© VOX; +001). (3712)
Therefore we have
1
— max |T. .-T * |< — 1 max |¥(X,)
ynacjen tS ya oe! |
ae era ia
+(1/vn) [2y W(X; )]+0(1/¥i)— 40 a.s.
j=l?
for the reason EY(X, )?<eo and also
|e
=) » fayoalSSA
j=1
Sis 1S)
ni), e 240%)
+ al
re +0(1)
order accuracy can not be obtained. The reason is that the second
terms of corresponding expansions are not the same. The
transformation given in (2.11) can help us to eliminate these
differences. This is the conclusion of Theorem 2.1 which we will
prove in the following.
294 Tu
+limit
n>
sup0(y)
y
|(K34- kgy )/24(k 32m k39 )(1-y? )/6
we can prove that they are strongly consistent estimates of Lost and
Note that
+n sup 1/1219,
6%)|
By ia baa
+31 Dyfea
5+d lesSec (3.18)
i
For Jia since
first prove that limit yn sup 1/12™n=2: ‘In fact, when [x|<nt/12
n>o |x|<n
-5/12
we have |x/y¥n|<n , hence limit sup 1/12
|o(x/¥n)|=0. Observe
n> «|x|<n
that
RES
limit
pagel
n
1" sup u_(x)-x|=limit [n
5 -1/12;4a “yn yor
2
nh 6 Ix|<n i/ial n Aecstes on 6521 fl
+n 1/12)”la,+ La ¥ ro.|(1+
6 nyja 713,| ixlen
sup _ 1/42 |o(x/va)
o(x/¥n |)]-0 a.s.
Therefore
- K 2
yn sup r Sighiait yn sup (u_ (x)-x)
yet / 12! se ae I<n 1/12°°n
Reaves
< Slimit
hae [n’’ 1/4 “su n
2
ix}ent/?? Ju_(x)-x|]°—-0 a.s. (3.19)
296 Tu
However
1 2
- eee Oy )|
n
=limit
noe vayn Be lé
1/12'6
g(x) (x2-1)-6(u, ur (x))
1s 0(x)(x"- (u2(x)-1))
i §
earn
ten
+0(x)(La,, at yin3 )x70(x/yn) |
ri a.s.
=limit yn as aliale
Al 0(x) (x 2 -1)-0(u, (2) (9-191.
2 y 23
ta |
n>o |x|<
and as in the proof of (3.13), we have
J, =limit
nia
sup
[x|<nt/1?
Wreyerm
6
ames
n
y oo
ral nj
n
<Klimit sup |u,(x)-x|vn J r3 =0 a.s. (3.20)
n>o |x|< ni/12 ja 1
For J,, note that limit -0(x)|x|*=0(k=0, 1, 2), and hence
xX 7-0
Jie < limit ynO(u_ (en!/12) ys imit o( -n)/12)
n> 0 n>
However
limit |u,(-n |
n> @
Approximating the Distribution
A s with Jackknife 297
n
etait [at /2(@ ap Seca, ae
n> + coe 4 ny In
+n =1/3;4(a5,+ av
21,-8 3 |=0
tay) ,)_ a.s.
limit yn sup|J,(x)|
en. wale? < J, Lt ,+3,9tJ,,.
t2 5013 <0 as.
REMARK 3.4 From the proof of Theorem 3.2 we known that the
conditions of Theorem 3.1 were only used to obtain the asymptotic
expansion P{T <x}. If for some special statistics Le we can obtain
then Theorem 2.1 will be valid under those conditions and the
conditions ensuring the consistency for the jackknife estimates
defined in (2.13), such as
4. TWO EXAMPLES
(1) U-statistics
298 Tu
-1
uU.=(5)
n 2
S n(X.,
i<j 1
X.)
J
(4.1)
where h is a symmetric kernal function. In fact Un is not a
and
O=Eh(X, ,X,).
zy os -1/2
sup|P(o x)-F(x
Up <x)-F_
= | _” a (U_-@) )|
(x)| << en (4.3)
4.3
where
F,(x)=®(x)-(+/?9(x)
1/6 )eg(x?-1)
n,
oP=n(n-1)"Bg(x, +2? (x, ,x,)
and
of U0 can be definied
* %D ‘
THEOREM 4.2 Suppose that E|h(X, ,X_) |< and the other
conditions of Theorems 4.1 are satisfied. Then for
ee | Bat y ie r
w(y)=(y-B,/va)® + —2y- /Wn)? + —2y-8, via)? (4.5)
yn vn
where
Bivens 2,3/2
and
300 Tu
i, j+k i, j+k,1
we have
(2) M-estimates
with
3
J Mp (x, 0)dF(x)<o.
(vi) For each 0, there exists a Ug, such that for any 6>0
Pptvn(8
0) /WWCY<y)=8(y)- SAC 8) 40(1/Nin)
n
where
3
A( ,0) = {vy (x,9)dF(x) y 214)
: Pte aac? }
cal S¥G(x, W(x, 8) dF(x)
© 2 LR Cx, 0) d(x) )1/2( fog x, 0) F(x) )
Vp!(x,0)dF(x)
: feels: Fania750)}
(SV? (x, 0)4F(x)dF(x))
302 Tu
eee ee er ree ee errr
(i)
r) j=nd—(n-1)8) 1
n,
Dyas apt)
W(x, 899) hit(x)=0
with
Ci) ga
Fi-1 ~ n-1 aes
Let
n
(BR
J,n n j21
os 991
ee
1 2
E,* nh fe (8, 5-8);
i=1
where the definition of €, (isl, 2, ... , mn) is given above, and
3
ego tg{a 3 00-9.)
n-l on
Ben)? 5831-8
FO 99639 I(900 -9,1End Cn) ON(k) 905)
sev 95,P3}
2
re
n Mj=1
a) 978g,
Sucx, 93)
arG 3 (x)=0
(j,k) (x)=
Foo 1 ¥ 1{X.<x}.
n(8 -8)
et VV(8)
under the same conditions.
h(X, ,X), g(X,) and ~(X, ,X_). We need no conditions on h(X, ,Xq).
ACKNOWLEDGEMENTS
REFERENCES
i.
-
ee
a hs
.
|
es)
ad
d
@
iS
~
~
a
se
—
SS
~—.
7
=
’ ~- —
——
ae Je :
a
—é
BOOTSTRAPPING FOR ORDER STATISTICS
SANS RANDOM NUMBERS (OPERATIONAL BOOTSTRAPPING)
William A. Bailey
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
310 Bailey
EE
EE
4Univariate or bivariate.
5RFach of these distributions is conditioned on a
particular value of Y; however, they are partial
distributions, since the probabilities have not be
normalized to unity.
312 Bailey
SNE
problems.
[low,+(nax/2 - 3)+A,high,)
[high,, high,+A)
Subinterval I(r) is the degenerate interval consisting
of 0 alone. If for some ry>1 Oeliro), then 0 is deleted
from I(ro); that is, that particular subinterval has a
hole at O.
[0,0]
(Llowy(r)-Ay(r),
Lowy (r) J
(low,(r),
Lowy (r) +A, (r))
[low (r) +A, ir), Lowy (r)+2-Ay(r))
Pro (r,s)
+Poo (r,s) = Doz (r,s) +Poo(r,s)=Mog9
[ I(r) xT, (s) ]
X1 (r,s) *Py9(r,s)+Xo(r,s) * Poa, (r,s) =m,,[I(r)xJ,(s) ]
2
Xe (r,s) "Pio (r,s) +Xg(r,s) * Poo (r,s) =Mz,
[I (r)xJ,(s) ]
3
Xs (r,s) > Py (r,s)+X3(r,s)° P23, (r,s) =M49 [ I (r)xJ,.(s) J
Yi (r,s) -Po1 (r,s) FYa(r,s) .Poalr,s) =Mp; [It@)xJd,(s) ]
2
yi (r,s) -Poi (r,s) FY2(r,s) .Poo(r,s) =Mp2
[I (r)xJ,(s) ]
3
Yi;(r,s) Poi (r,s) +Ya(r,s) .Poalr,s) =Mo3[
I (r)xJ,(s) J
1=1j=1
We can represent the (partial) distribution (of
(X,+X2,Y,+¥2) restricted to IrxJ,(s) by the four
points and associated probabilities
er ee mia e ee Oe
12The Appendices (not shown herein) can be obtained from
the author.
Bootstrapping for Order Statistics 317
(X,(r,s),Y,(r,s)) py (r,s)
+ (X, (7,8), Yolr,s)) =Pjolr,s)
(X2(r,s) Yi (r,s)) Pdi (r,s)
(X2(r,s),Yoalr,s)) p$o(r,s)
where Pii(r,s), Pi2(r,s), P31 (r,s) and PS2(r,s) are
determined as follows:
2 2
Fi at I ae ee eR ee ee OE
13fhe absolute value of the error in the joint moment is
linear in two pieces, because of the absolute value
being taken.
14The Appendices (not shown herein) can be obtained from
the author.
318 Bailey
n
—— = el
’ s=nay/2+1
(Xq(r,s),Yy(r,s)) Pay rss) |ponaxy2et
(X1(r,8),Yolr,s)) —Pyo(rys)
(X2(r,s),Y;(r,s)) Poy (r,s)
(X2(r,8),Yolr,s)) Poalr,s) |=?
s=1 .
A GENERALIZED BOOTSTRAP
By EDWARD J. BEDRICK! AND JOE R. HILL?
University of New Mexico and EDS Research
Abstract
This article defines a generalized boostrap that is based on the general frame-
work for model-based statistics described in Hill (1990). This bootstrap in-
cludes Efron’s (1979) frequentist, Rubin’s (1981) Bayesian, and Hinkley and
Schechtman’s (1987) conditional bootstraps as special cases. It also includes
new bootstraps for empirical Bayesian (EB) models. In particular, we present
a conditional EB bootstrap used for EB inference that is conditional on EB
ancillary statistics.
Key words: Bayesian bootstrap; Conditional bootstrap; Empirical Bayesian
bootstrap; Frequentist bootstrap; General framework.
1 Introduction
The generalized bootstrap has four inputs:
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
320 Bedrick and Hill
es
The bootstrap takes advantage of the fact that Jobs will be plugged-in for \
by directly calculating
That is, the summary p.5,.(9, y) is the only member of P used by the bootstrap.
Hence the bootstrap can avoid calculations involving all \ in A, which contrasts
with most analytical methods.
REMARK E. The inputs P, \, and z make this bootstrap very general.
Section 2 shows that for appropriate choices of these inputs, the generalized
bootstrap reduces to specific well-known bootstraps including Efron’s (1979)
frequentist bootstrap, Rubin’s (1981, 1984) Bayesian bootstraps, and Hinkley
and Schechtman’s (1987) conditional bootstrap. Section 3 defines a conditional
EB bootstrap appropriate for inference (Hill, 1987, 1990).
2 Well-Known Examples
2.1 Efron’s frequentist bootstrap
The frequentist bootstrap (Efron, 1979, 1982a, b) has inputs: (i) P is a fre-
quentist model as described in Remark A of Section 1; (ii) usually, \ is taken
to be the MLE, but this is not required; and (iii) z is null. A bootstrap
replicate sets 0* = Joos and generates y* from the MLS p,»,(y). Because the
estimated prior and the posterior distributions are both the same point dis-
tribution 6 = Ass. with probability one, this bootstrap also arises if the fixed
features z = 0.
EXAMPLE 1. Let y1,...,Yn be independently and identically distributed
with cumulative distribution function (CDF) 6 = A € A, where A is the
collection of all continuous distributions on R. The vector of order statistics
t = (ya), ++ Y(n))’ is complete sufficient for A and the empirical CDF, \, is the
nonparametric MLE of X. The bootstrap sets 6*= \obs and generates y* by
322 Bedrick and Hill
_.
ee es
y ~ Mult,(n, d).
The MLE of \ is \ = y/n. The bootstrap sets 6* = Jobs and generates y* from
the MLS :
y* ~ Mult;(n, obs).
EXAMPLE 3. Let
yi ~ N(x, 07)
independently for 1 = 1,...,n where \ = 0 = (07) indexes the family. The
MLE of d is (, 67) where ¥ = Dy;/n and 6? = Y(y; — y)?/n. The bootstrap
sets 6* = \.¥, and generates y* from the MLS
Yi io N(Yobs, 648)
independently for i = 1,...,n. For example, the bootstrap estimate of the bias
and variance of y are 0 and G?,,/n, respectively.
y |9 ~ Mult,(n,
8).
6 ~ Dir,
(my, ..., Me)
A Generalized Bootstrap 323
SSS SSS SSS
where the m; are known. Then the posterior distribution of 6 is also Dirichlet;
in particular,
6 |Yor, Dirk (yr Sue LG Peer Yk + mx).
A sample from p(@ |y = y.s5) can be generated using the gaps between ordered
uniform random variables (Rubin, 1981).
If m; = 0, 7 = 1,...,k, and 6.45 = yoss/n, then the posterior mean and
variance-covariance matrix of 6 — oe are
and
Var(8 — 806s |y = Yyoss) = [Diag(Oos) — 905544,,]/(n + 1),
respectively.
These results are close to those for the frequentist bootstrap given in Ex-
ample 2. For that situation, the frequentist bootstrap distribution of 6 — 6,
which assumes that 6 equals the observed MLE 6.46 = Yobs/N, has mean 0 and
variance-covariance matrix
[Diag(8.ss) — 9os%p4]/7-
This similarity was noted by Rubin (1981) and by Efron (1982a).
Rubin (1984) defined two other Bayesian bootstraps. When the fixed fea-
tures z = 6 the bootstrap generates 0* from the posterior p(@ |y = Yyoss) as be-
fore. But now it also generates y* from the sampling distribution p(y |0 = *).
Marginally, y* is generated from the predictive distribution of a future obser-
vation given the original observed data. Rubin (1984) used this bootstrap to
evaluate tests of model adequacy when the experimental units where consid-
ered fixed features of the experiment.
When = is null the bootstrap generates (0*,y*) from the joint distribu-
tion p(9,y). Rubin (1984) used this bootstrap to evaluate the unconditional
calibration of procedures.
3 A Conditional EB Bootstrap
We illustrate the conditional EB bootstrap using the following example.
EXAMPLE 5. Let 6 = (61,..., 9%)’ and y = (y1,..., yx)’ have joint distribu-
tions described by
1. ‘sets a” =<4.55,
Note that this bootstrap differs from the one defined by Laird and Louis
(1987).
To find the conditional mean squared error of an estimator 7 of 7, in the
notation of Remark D of Section 1, let Q(0,y) = (n — 9)? and z = a. Hill
(1990) proved that if 7(¢) = (1 — B(S))y, then
This estimate equals the posterior MSE of n when 7? ~ Unif[—1, 00), the
hyperprior that gives Stein’s rule.
Efron and Morris (1973) derived the relative savings loss for a class of
truncated Bayes rules. We can use their results to provide analytic conditional
EB bootstrap estimates of the conditional MSE for those rules.
Our results on EB conditional confidence bounds and more complicated
EB models will be reported elsewhere.
References
EFRON, B. (1979). Bootstrap methods: Another look at the jackknife. Ann.
Statist. '7, 1-26.
EFRON, B. (1982a). The Jackknife, Bootstrap, and Other Resampling Plans.
Society for Industrial and Applied Mathematics, Philadelphia.
EFRON, B. (1982b). Maximum likelihood and decision theory. Ann. Statist.
10, 340-356.
EFRON, B., AND HINKLEY, D. V. (1978). Assessing the accuracy of the MLE:
Observed versus expected Fisher information (with discussion). Biometrika
65, 457-487.
EFRON, B., AND Morris, C. (1973). Stein’s estimation rule and its competi-
tors — An empirical Bayes approach. J. Amer. Statist. Assoc. 68, 117-130.
FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proc.
Roy. Soc. London Sect. A 144, 285-307.
FISHER, R. A. (1956). Statistical Methods and Scientific Inference. Oliver
and Boyd, Edinburgh.
HILL, J. R. (1987). Comment on “Empirical Bayes confidence intervals based
on bootstrap samples,” by N. M. Laird and T. A. Louis, J. Amer. Statist.
Assoc. 82, 752-754.
HILL, J. R. (1990). A general framework for model-based statistics. Biometri-
ka 77, 115-126.
HINKLEY, D. V. (1980). Likelihood. Can. J. Statist. 8, 151-163.
326 Bedrick and Hill
Abstract
This paper investigates the use of the bootstrap for estimating
sampling distributions of standard and limited-translation Stein-rule
estimation procedures, based on Li (1985 and 1987) and Stein (1981).
These estimators incorporate uncertainty in the model selection process and
dominate both Ordinary Least Squares on the largest model and common
pretest estimators. Since there are no asymptotic approximations available
for these estimators, the bootstrap is necessary to obtain standard errors
needed for their practical application. Monte Carlo studies are performed
to verify the performance of the new procedures and the accuracy of the
bootstrap approximations to the second moments of their sampling
distributions. The bootstrap generates much more accurate standard errors
than the delete—one jackknife for the designs considered here. Additional
Monte Carlo studies are used to study the accuracy of various confidence
bands calculated from the bootstrap distributions. Although reasonably
accurate, there is room for improvement using double bootstrapping or
prepivoting.
1. Introduction
This paper extends the work in Brownstone (1990 a and b) analyzing
the bootstrap estimator for the sampling distributions of standard and
limited—-translation Stein-rule procedures for the linear model. These
procedures, based on Li (1985 and 1987) and Stein (1981), are admissible
alternatives to pretest estimators commonly used in the presence of model
uncertainty. Since there are no asymptotic approximations of the sampling
distribution of these procedures available, the bootstrap is necessary to
obtain standard errors needed for their practical application.
Brownstone (1990b) describes a Monte Carlo study, summarized in
Section 4 of this paper, which verifies the improved performance (relative to
Least Squares) of the Stein-rule procedures and the accuracy of their
bootstrapped variances. Section 5 of this paper describes similar Monte
Carlo studies for models with more exogenous variables, and it also
examines two alternative versions of the delete-one jackknife (see Wu,
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
328 Brownstone
E—
———————————— __s__.
TT aa a.
y=Xfr+e, (1)
where X and # are fixed (TxK) and (Kx1) matrices with full rank and T>K.
The components of ¢ are independent identically distributed random
variables with zero mean and common variance o?.
The nonparametric bootstrap estimate of the sampling distribution of
an estimator (* of f is generated by repeatedly drawing with replacement
from the residual vector
e* = y - Xf. (2)
If ep is a (Tx1) vector of T independent draws from e*, then the
corresponding bootstrap dependent variable is given by
A fie mae
Wrsbing ), (Pr-O (PB). (4)
Hinkley (1977) proposed an improved version, called the balanced jackknife,
which uses the weighted pseudovalues Q; = 6* + T(1—wi)(f*—-G*4), where
wi = x;’/(X’X)-1x; and x; is the ith row of X. The balanced jackknife
variance estimator is given by:
Vg ence
= ! ),(Qi-B*)(Qi-B*)
;-D* SWE: (5)
Both of these jackknife variance estimators are consistent but biased under
assumptions on the X matrix and the smoothness of the mapping from the
least squares estimator to @*. These jackknife estimators are much faster to
compute than the bootstrap since the (* can be computed without
recomputing regressions for each reduced data set.
Of course, good asymptotic properties are only useful if the
asymptotic approximation holds for realistic models and sample sizes.
Monte Carlo studies by Brownstone (1990 a and b) and Freedman, Navidi,
and Peters (1988) indicate that, with sufficient degrees of freedom, the
bootstrap can generate accurate standard error estimates for estimated
parameters in linear models with reasonable degrees of freedom. Section 5
of this paper shows that the jackknife estimators may have poor small
sample properties in similar situations.
investigated by James and Stein (1961), who showed that these estimators
dominate LS under a squared error loss function. There has been a large
amount of work extending these estimators to different models and
situations. Judge and Bock (1978) and Judge and Yancey (1986) provide
good reviews of the application of these developments to multiple regression
and other econometric models, and Stigler (1990) gives an excellent
intuitive proof of Stein’s (1956) result. These authors have also proposed
Stein—-rule estimators as an alternative to commonly used pretest procedures
which are known to have poor sampling properties (see Brownstone, 1990a,
and Hill and Judge, 1987). Li and Hwang (1984) further point out that
appropriately defined Stein-rule estimators are "robust" with respect to
misspecified models. The development in this section, following Li (1985),
assumes that there is a true base model, and all approximate models are
derived by imposing linear restrictions on this base model. Although it is
relatively straightforward to consider nonlinear restrictions, the assumption
that the base model is correct is critical.
I will only consider the orthonormal linear model where X’X = I.
The general linear model in equation (1) can be transformed to this case
using the singular value decomposition. The LS estimator of (, (*, is just
X’y. If we temporarily assume that the ¢ are Normally distributed, then :
Bi = Bi + 6, i=1,...K, (6)
where 6; are independent normally distributed with mean zero and variance
o2. The estimation of @ from /* in eq. 6 is the classic Multivariate Normal
mean problem studied by Stein (1956), where he showed that LS is
inadmissible under squared error loss (||4*—{]||?).
Consider a class of linear estimators, ['*(h), of the unknown mean
vector 6. Associated with each h in a finite index set H there is a KxK
matrix, M(h), such that ['*(h) = M(h)G*. The model selection problem is
how to choose h. For example, if the columns of X are derived from a
singular value decomposition of a collinear design matrix, then some of the
characteristic values and the associated (; will be close to zero. This
suggests considering a class of restricted models with the last (K—h) (*; set
to zero (i.e. M(h) is a diagonal matrix with the first h elements equal to one
and the rest zero)1.
The next step is to choose h to minimize the risk function, but for
typical linear estimators (including LS) this risk function is unbounded. Li
ee suggests replacing the class of linear estimators, ['*(h), with their
tein-rule counterparts:
o2
f(h)= fh ae. (h) ps, (7)
B+" B(h)p*
'Li(1987) shows that this general framework can also be applied to ridge
regression and nearest-neighbor nonparametric regression.
Linear Model Selection 331
of ||A(h)6*||?
S(h) = °K (BBO? / (9)
is an unbiased estimator for the risk of (S(h). Li(1985) further shows that
even when S(h) is not a consistent estimator of the risk, it is still a
consistent estimator of the true loss ll{s(h) — ||?) divided by K. Note that
choosing h to minimize S(h) is equivalent to minimizing
K (f*’B(h) 6)?
(10)
|| ACh) B*|]?
which is independent of o2. Li (1987) shows that minimizing eq. oe is
identical to the Generalized Cross Validation model selection criterion (see
Craven and Wahba, 1979) and is asymptotically optimal in the sense that
L(A s(h*))
—_—_——___——_+ 1 in probability (as T-0), (11)
ming -L(6*(h))
where L(-) is the loss function and h* denotes the model chosen to minimize
eq. (10). :
( Li’s methodology outlined above can be applied to general linear
model selection problems, but the remainder of this paper will only consider
the problem of choosing how many of the /*; to restrict to zero. In this case
2Li (1987) gives an alternative formula for general M(h). All the cases
considered in this paper have symmetric M(h).
332 Brownstone
ae
the trace condition on A(h) implies that there are at least three restrictions.
An alternative approach for improving LS in this situation is Stein’s (1981)
truncated limited-translation estimator, which uses different shrinkage
values for each component and therefore limits the bias. Stein’s estimator
is defined for each component by:
pe (LET fas
Yk_ymin (6% ,23)
where q is a large fraction of K, and Z;<--+<Z, are the order statistics of
the |6*;|. Note that if q=K , eq. 12 simplifies to the James—Stein (1961)
estimator. If q<K, then the estimator defined by eq. 12 shrinks the 6*;s
more if they are closer to zero, but some method for determining q is
needed. Dey and Berger (1983) propose choosing q>3 to maximize
(q-2)?
——————_—_—————_, (13)
3Efron and Morris (1973) show that most Stein-rule estimators can also be
derived as Empirical Bayes procedures, which justifies considering the
Bayes risk here.
Linear Model Selection 333
i
4. Estimator Performance
This section describes a Monte Carlo study of the positive part
versions of Li’s estimator any hereafter called LISTEIN, and Stein’s
limited translation estimator ( ),hereafter called NEWSTEIN. The
columns of X are generated as orthonormalized independent draws from a
Uniform distribution on the unit interval, and they are held fixed
pages the experiments. There are T=100 observations (rows of X),
and the first set of experiments have K=10 regressors. The dependent
variables (y) are generated from equation 1 with ¢ generated as independent
draws from a standard Normal distribution. This design has the risk of
LS = K for all of the experiments. The GAUSS (1989) programming
system was used for all calculations on a 386 personal computer, and the
pseudo-random numbers were all generated using GAUSS’ normal and
uniform random number generators.
The parameter vector § was chosen according to 6 = LA, where L is a
scalar. As L increases, the signal to noise ratio, as measured by the
population R?= 6’ / (p.6+T), varies between 0 and .988. This
definition of R? is appropriate since the regression plane passes through the
origin, and o? is always equal 1. ‘Table 1, abridged from Brownstone
(1990b), shows the results of a Monte Carlo study for two choices of the
direction vector, A. "Case 1" has all components of X = 1, and "Case 2"
has the first 5 components = 1.9999 and the remaining 5 = .0001.
Each entry in Table 1 is based on 400 Monte Carlo replications. In
addition to population R2, these tables give
400
(B+8)/400
J, BB) (14)
which is the empirical (mean) risk or expected prediction error for each of
the two Stein—-rule estimators. By construction, LS has theoretical risk 10
for all cases, and its empirical risk was very close to 10 for all cases, so these
empirical risks are omitted from Tables 1 and 2. The tables also provide
the average value of h* (the number of +; set to zero), denoted "hopt," for
LISTEIN and the average value of q for NEWSTEIN. The standard errors
of the empirical risk measures are not reported because they are always less
than .30.
Unlike the pretest and Stein—rule pretest estimators considered in
Brownstone (1990a), both LISTEIN and NEWSTEIN dominate LS in these
experiments. For "Case 1", the risk of LISTEIN and NEWSTEIN rises
smoothly toward 10 (the risk of LS in all of these experiments). This is to
be expected since as R? increases all parameters are estimated more
precisely and there is less gain from imposing approximate restrictions. As
LISTEIN and NEWSTEIN approach LS, both hopt and q (which are
restricted to be no less than three) approach their maximum values (10).
When hopt and q are at their maxima, both LISTEIN and NEWSTEIN are
just the standard positive—-part James-Stein estimator. Since the shrinkage
334 Brownstone
anna
The generally good results from this study are in contrast to one case
in Freedman, Navidi, and Peters (1988) which showed the poor performance
of the bootstrap for LS in an orthonormal design with 100 observations and
75 regressors. Two possible causes for the poor performance in that case are
the lack of degrees of freedom and the presence of 50 extraneous regressors
(i.e., those with true parameter values set to zero). When there are
extraneous regressors there is excess variation in the LS residuals, so it is
not surprising that the bootstrap based on these residuals will be biased.
The two Stein-rule procedures considered here should be less subject to this
"extraneous regressor" bias since they shrink the coefficients of these
extraneous regressors towards zero before bootstrap residuals are calculated.
Table 4 gives the results of additional Monte Carlo studies of the
bootstrap and the two jackknife estimators described in Section 2 for the
same designs as Table 3. There were only 100 bootstrap repetitions
performed for each Monte Carlo repetition in Table 4 so that the bootstrap
would be more computationally comparable to the jackknife estimators.
The jackknife estimators are still much faster to compute since they are
based on the delete-one LS estimates which only require one regression to
compute. In addition to the three estimators investigated in Table 3, Table
4 includes results for the LS estimator on the full model, which has
risk = 20 for all the designs. The "Std. Error" rows in Table 4 give the
standard errors of the estimates of the first component of 6 over the 300
Monte Carlo repetitions. These entries differ from Table 3 and across
estimators because, for computational reasons, different random numbers
were drawn for each estimator. The "Est." and "MSE" rows give the mean
and mean squared error of the jackknife and bootstrap estimators over the
Monte Carlo repetitions.
Except for LS, the jackknife estimators perform much worse than the
bootstrap, and the balanced jackknife performs slightly better than the
regular jackknife. Since the mapping from LS to LSR and the Stein—rule
procedures is not continuous, the jackknife estimators are only known to be
consistent for LS. The balanced jackknife was designed to improve the
jackknife for LS, so its better performance here is not surprising. Even
though the balanced jackknife is an inferior variance estimator (relative to
the bootstrap) for these designs, it may still be good enough to be used to
prepivot the bootstrap sampling distributions.
Name Definition
Notes: op is the bootstrap standard error for 4* and t(a) is the level
a cutoff value from a student T distribution with 80 degrees of
freedom. G* is the bootstrap distribution of A+, 6 is the standard
normal cumulative distribution function, z(a) is the level a cutoff
value from this distribution, and zo=)-1{G*(@*)}. 8, which is only
calculated for LS, is the square root of the usual LS variance estimate.
H* is the bootstrap distribution of (/>;—G*)/s;, where 6>; and s; are
the values of the estimators at the jth bootstrap repetition.
Linear Model Selection 339
SSS
Table 5 gives the definitions of the intervals, but the "Std." and "T
Percentile" are only calculated for the LS estimator since they require a
teliable estimate of o. Efron and Tibshirani (1986) show that the "Boot
Std.," "Percentile," and "BC Percentile" intervals are correct under
increasingly general conditions. Note that if the bootstrap bias is zero
G*(f*)=.5), then the Percentile and BC Percentile bands are identical.
ince the "Boot Std." band only requires bootstrap standard errors, it can
be computed with fewer bootstrap repetitions.
Table 6 shows the performance of 3 different 90 percent confidence
bands for the first coefficient of 6. Except for the Stein—-rule estimators at
true coefficient = 0, the bootstrap standard errors ("Boot SE" columns in
Table 6) underestimate the true values ("Std. Error" columns in Table 6).
This negative bias is related to the negative bias in the nonparametric
bootstrap standard error estimates for LS4. The coverage probability errors
generally follow the biases in the bootstrap standard errors. The "Boot
Std." and "Percentile" bands behaved similarly in these runs. The
bias—corrected intervals ("BC Percentile") behaved similarly for the LSR
estimator, but they displayed much more variability and had poor coverage
for the Stein-rule estimators. The culprit in the bad performance of the
bias—corrected intervals is high variation in the estimates of zo.
One difficulty with interpreting Table 6 is that there are no
well-defined "correct" confidence bands for these estimators to compare
against. Table 7 considers the confidence bands for the LS estimator where
the usual t interval ("Std." column in Table 7) provides a natural basis for
comparison. The bootstrap intervals studied in Table 6 all have some
undercoverage due to the negative bias in the bootstrap standard errors
("Boot SE" column in Table 7). The similarity of these three intervals is
expected since the LS estimator is normally distributed in this example.
Even though it is based on the biased bootstrap residuals ("Boot s" column
in Table 7), the "T Percentile" intervals are almost identical to the correct
"Std." intervals. The exact correctness of the "T Percentile" interval in
this example is not surprising since (4>;—(*)/s; is an exact pivot for the LS
estimator with normally distributed residuals. In more general settings,
Hall (a shows that the "T Percentile" intervals are asymptotically more
accurate than the other bootstrap intervals in Tables 6 and 7 since they
capture additional terms in the Cornish—-Fisher expansions of the
estimators’ sampling distributions. Direct application of the "T Percentile"
intervals to the LISTEIN or NEWSTEIN procedure requires a bootstrap or
jackknife variance estimate for each bootstrap repetition. This "double
bootstrapping" is computationally beyond the scope of the current study.
Tables 6 and 7 also give values of the "Bootstrap Risk," which is
defined as
4 The negative bias for LS is expected, since the sampling variability of the
LS residuals underestimates the variability of the true residuals. In
particular, E(e;2)=02%(1—h;), where e; is the LS residual for the ith
observation and h; is the ith element of the diagonal of Bos is One
solution is to divide each LS residual by the square root of (1-h;), but this
is only valid for LS. This bias disappears in large samples.
340 Brownstone
|
————__._
7. Further Refinements
The Monte Carlo studies presented here and in Brownstone (1990a
and b) show that bootstrapping estimation strategies for linear regression
models yields reasonably accurate estimates of their sampling distributions.
Although the bootstrap standard errors and confidence intervals are
adequate for most applications with sufficient degrees of freedom, the biases
in Tables 6 and 7 suggest the need for more accurate estimates. One
approach is to use exact finite sample theory and/or higher-order
asymptotic expansions. These theoretical results are available for simple
pretest estimators, such as LSR (Judge and Yancey, 1986), and for simple
Stein-tule estimators (Phillips, 1984), but not for the more complex
Stein-rule procedures (LISTEIN and NEWSTEIN) considered here. Even
when exact finite sample theory is available, the results are typically quite
sensitive to the functional form of the residual distribution.
Another approach which preserves the distributional robustness of the
bootstrap is to use improved bootstrap techniques (see Beran, 1986, Efron,
1987, Hall, 1988, and Loh, 1987). These work by improving the rate of
convergence of the bootstrap sampling distribution, but they generally
require an order of magnitude more computations than the simple
bootstrap. The most general approach (Beran, 1986 and Hall, 1988) is to
use the bootstrap distribution as an approximate pivot, which is a
generalization of the "T Percentile" interval in Table 7. Efron’s (1987)
accelerated bias—corrected bootstrap intervals have the same second-order
asymptotic properties as the T Percentile, but they are easier to compute in
many cases. Nevertheless these accelerated intervals require the equivalent
of a jackknife variance computation for each bootstrap repetition for the
LISTEIN and NEWSTEIN procedures. Monte Carlo studies of these
improved techniques require supercomputer resources, but their application
to even large applied problems is feasible on current high-end workstations.
Even though the techniques described above will improve the
accuracy of the methods studied in this paper, the Stein—-rule procedures
with the simple bootstrap confidence intervals are a_ substantial
improvement relative to current popular techniques. For the setting in
Tables 6 and 7, most applied econometricians would use extensive
pretesting to find the "best" set of regressors, and then perform inference
conditional on the final model chosen being correct. Brownstone (1990a)
and Hill and Judge (1987) show that these procedures may have much
higher risk than LS applied to the full model, and the conditional inferences
Linear Model Selection 341
(Table 6 continued)
True Coef. Mean Std. Error Risk Boot SE Boot s Boot Risk
[-2.28, 2.41] [-2.03, 2.16] [-2.02, 2.13] [+-2.02, 2.13] [-2.29, 2.39]
0 0.896 0.860 0.864 0.860 0.888
[-1.73, 2.95] | [-1.48, 2.70] [-1.46, 2.67] [-1.47, 2.66] _[-1.73, 2.94]
0.636 0.922 0.896 0.887 0.878 0.926
[ 1.03, 5.71] [ 1.28, 5.46] [ 1.30, 5.43] [ 1.30, 5.43] _[ 1.03, 5.70]
3.18 0.872 0.830 0.819 0.823 0.872
Linear Model Selection 343
SSS
SS SS SS
8. References
1. Introduction
Let T be a non-negative random variable representing the length of the inter-
val from some initial zero point to the occurrence of an absorbing or non-
recurrent event of interest such as waiting time, failure time, recurrence time.
For a given sample of n individuals, T @; ) represents the duration time for the
member ; , i=1,2,...,.n. Since the probability measure induced by T is equivalent
to the original one defined on the o -algebra F generated by subsets of the sample
space S , we will henceforth deal directly with the former.
The (cumulative) distribution function F(t) = P[ T <t ] of T is right continu-
ous, monotone non-decreasing on [ 0, ce ) with F(0)=0 and lim F(t)=1. The com-
tf loo
exists for r20, almost everywhere, and is given by f(t)=F’(t)=-S’(t). The hazard
rate function is defined for S(t)>0 by
h (elm? [T <t+AlT 2t VA=f (t)/S (t), (1)
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
346 Hsieh
a
Note that if the distribution function of T is not absolutely continuous, then the
hazard rate function cannot be defined. If T has a finite mean, then the life expec-
tancy or mean residual life function is defined by
e (t)=E (T-1T2r}=[ uaF (uyiS (7) =I, 5 )d US), (3)
for S(t~)>0 and e(t) = 0 whenever S(t)=0,where S (@)=lims (u)=P [T 2t]. Note
ult
that S$ (t~)=S (t) if T is continuous. This fact was used in (1). The five functions
defined above are equivalent.
Finally, we define an important rate function -- the duration-specific
occurrence/exposure rate. The [t; , t2 ]-specific O/E rate is defined for tz > t,20
by
S(t; )-S (¢2)
(4)
{ S(t)dt
t),t2]
Expression (4) represents the ratio of the number of events in [f;,f2] to the
number of individual-time units of exposure in (¢;,t2]. If T is (absolutely) con-
tinuous, then S(t; )=S(t,) and an application of the mean value theorem of
integral calculus yields
S@i)S2) S(t)
=h(t)), (4a)
tlt, j S(t)dt S(t1)
(t1,¢2]
and an instantaneous O/E rate reduces to a hazard rate function. Thus, an O/E rate
is an average rate and as such is a set function while a hazard rate is an instan-
taneous rate and as such is a point function. Death rates and incidence rates are
examples of the former while force of mortality and force of morbidity are exam-
ples of the latter. For more functions useful in describing the distribution of the
duration time, see Hsieh ( 1990).
2. Deterministic Hazards
To cover both discrete and continuous ditributions, we begin by defining the
differential of F as follows:
where F gs (u)=P [T <t]. Note that in the discrete case dF(t) becomes the
u
probability mass function representing the size of jump of F at t which is zero for
all t except at a countable number of jump points T(@;). The differential of H is
then defined for S (t~)>0 as
dF(t) |P(T=tlT2t], if F is discrete,
dH (t)=
S(t) \f@adt/S ()=h(t)dt, if F is continuous, (6)
the second equality following directly from (5).
Hazard Process for Survival Analysis 347
The mean residual life function e(t) and the [t;,t2] -specific O/E rate for
discrete distribution are as given in (3) and (4), respectively, with S(t) and S(t")
replaced by (8) and (10), respectively.
NQ@=ENT 52)
3)
(22)
for the whole sample space Q which counts the total number of occurrencies of
the event of interest in (0,t]. The stochastic hazard dA(t) and the hazard process
A(t) of (22) are respectively obtained by summing both sides of (15) and (16)
over i and using (13) and (21):
where
The third step is to find the asymptotic distribution of M,(t). We will use the
so-called asymptotic stability condition (the Glivenko-Cantelli theorem) which
states that Y(t)/n converges uniformly (see Pollard 1984) with probability one to
the deterministic mean function p(t)=E [Y (t)/n], as nee. ( If there are no cen-
soring, as is the case here, p(t)=S (t~) .This, by the way, establishes the asymp-
totic negligibility condition of the martingale estimate (28)). Now we are ready to
show that the two conditions of the martingale central limit theorem are satisfied
by the martingale M(t), so that it has (continuous) Gaussian distribution with
mean zero and variance estimated by (36) for large n. We have, as n—-0,
I(¥ (s)>0) P dH (s)
<M1>Ct-| — )
>OFl iOmiGaie ee aes
p(s)”
which is deterministic and hence satisfies the first condition (a),.To show that
condition (b) holds, we note that jumps in the martingale Vn [H(t)}-H (t)] arise
only from H(t) whose jumps are of size 1/Y(t)(see (28)), which is of, order 1/n
and hence the Lindeberg condition holds for Caan 4 so that Vn| [H (t)—H (t)]
tends to zero , as n—e0, Note that conditions weaker than (b) such as those of
Lindeberg, Lyapounov and Rebolledo suffice to yield the same results(see Ander-
sen and Gill 1982 and Karr 1986). We therefore conclude that the estimator H(t)
given by (29) is consistent and asymptotically normal.
Hazard Process for Survival Analysis 350
When, as above, h°(t}=h§ (t) for te/;, thenthe terms involving h° in (42) become
r
The test statistic Z given in (42) could also be derived by defining the
transformed martingale (see Andersen et al. 1982, and Andersen and Borgan 1985
Corresponding to (42), we obtain the test statistic for the hypothesis (46) as
; k
Xk1=>D (Oj/-E;"/E;, (49)
j=l
where O;=N;(T) and
$ Y;Gi) ,
Eja[, Vid o=[, 10. p01) Pee PE haat (49a)
The test statistic (49) is known as the Chi-square test of homogeneity. It has an
asymptotic Chi-square distribution with k-1 degrees of freedom. The loss of one
degree of freedom from the total k degrees of freedom is due to the requirement
of estimating the unknown H(t).
Then 7”(@;) is the failure time of «; in the presence of all causes. Note that right
censoring is considered and included in (50) as a mode of failure. To be specific,
let T,(@;) be the censoring time for @;. Just as in the last paragraph, with
F "(Q=PIT* <1), the functions , describing the distribution of T , namely,
h* Olek tal : ; i A S, and e , are similarly defined. These functions are
identifiable because T (@;) is observable.
Let X (@;) index the mode (cause) of failure and Z,(@;)) be a vector of inter-
nal covariates @; assumes at time t. Clearly, X(@;) takes values from the finite
index set A=/1,2,...,q¢} which lists the causes of failure. Let A =/{ 1,2,....¢-l}
which is the subset of A with censoring excluded. Suppose for simplicity that
Z,(@;) takes values from a set of discrete vector values I=/{z , ... ,Z,}. Note that
when Z,(@;) is continuous we may categorize it and use zy,Y=1,2,...,p, to
represent the mean value for the y th category. Clearly, Z,(@;) is a bounded
predictable vector covariate process and can change the values it takes, from z; to
Z2 , Say, as t progresses from 0 to T (;).
As there are several possible ways for the covariate to vary with time, dif-
ferent classifications of the time-varying covariates prevail in the literature. The
covariates Z;(@;) or Z(t) introduced here corresponds to the class of internal
covariates in Kalbfleisch and Prentice’s (1980) classification. Most of work on
Statistical inference involving time-varying covariates done so far employed
either parametric approach or semiparametric regression models and used likeli-
hood or least-squares methods which require (1) a given functional form for the
whole or part of the hazard rate function with covariates written as explicit func-
tions of time and (2) integration of the hazard rate function to obtain the cumula-
tive hazard and survival functions (see Petersen 1986, Blossfeld et al 1989 and
references therein). These procedures are mathematically unwieldy and compu-
taionally lengthy. Besides, it is not always possible to write the time path of
covariate as an explicit function of time. Our approach to the treatment of the
time-varying covariate is nonparametric and is akin to those of Leung and Wong
(1990) and McKeague and Utikal (1990). It requires a Markovian assumption on
the covariates and provides a straight forward approach to the estimation and
hypothesis testing problem.
In this section we formulate the competing-risks problem as a marked point
process {T (@;),X(@;)} with the internal history F;-“* defined by
FT X= 6/1 (T* (@;)<s,X (@)=8); OSs<t, 8 A,i=1,...,7) (51)
and employ the sub o-algebra
F! =o/I(T* (@;)Ss),0<s<t,i=1,...,.n} (52)
and the covariate history
F7=0/{Z, (w;)=z;0Ss<min(T*(@;),t),z€T,i=1,...,n} (53)
as natural filtration to accommodate time-varying covariates. Note that the two
random variables T (@;) and X(q;) are realized simultaneously. We cannot
observe one without the other. Thus, when failure occurs to @;, in addition to the
observed failure time T («;), we also observe the mode of failure X(@;) and the
covariate history Z,(@;) up to the time of failure T*(@;). This is a considerable
enlargement from the previous sections where we only dealt with the point pro-
ae { T (@;) } and employed the internal history F;- defined in (11) as natural
tration.
Hazard Process for Survival Analysis Shy
and the two crude hazard rate functions of the marked point { Pox} given by
h'(, d)slimP (T" <1+A,X=8 |T*>t/A (55a)
* ; *
and
h* (t,(t, 8;z)=li
032) limaP [T ; <t+A,X =8|1T
=SlT->2t,Z (t)=z]/A. (55b)
Similarly, extensions for the deterministic hazard probability (6) yields the two
net hazard probabilities:
(T 5=t|T52t], if Fs5 is dicrete,
GEO yeah guis. Continuous, (56a)
and
P (Fs=t\T s2t,Z (t)=z), if Fs is discrete,
dist goss if Fs is continuous;
Si
and the two crude hazard probabilities:
P(T"=t,X=8IT">t], if F” is discrete,
dH“ (t,8)= (57a)
h*(t,8)dt, if F” is continuous,
3 P(T*=t,X=81T*>1,Z(t)=z], if F* is discrete,
aH (t, 5;z)=
h*(t,8;z)dt, if F* is continuous. Sap
In (54b), (55b), (56b) and (57b) above we have replaced the conditioning
events { T52t,Z(s),0<s<t } and { T >t,Z(s),O<s<t } in the general definitions of
conditional hazard functions by { T52t,Z (t)=z } and { T *>t,Z(t)=z } because we
assume the Markovian Property for Z(t) (see Heckman and Singer (1984) for
illustrations) which implies that hs(t;z) h*(t,5;z), dH 3(t;z) and dH (t,8;z) will
remain unchanged with the condition Z(t)=z in these formulas replaced by
Z(s)=z,0<s <t. That is , the hazard at duration time t depends on the covariates
only through their values at the duration time t and the above four hazard formu-
las are identical to those with the covariate values remaining unchanged at z over
the entire course of the spell and hence are interpretable as such.
Just as in Sections 4 and 5, our interest here lies in making inference on
dH 5(t) or hg(t) and dH 5(t;z) or hg(t;z). However, these net hazard functions are
not identifiable, i.e., cannot be estimated from the observed information (51)-(53)
without further assumptions. If we make the common assumption that the latent
lives Ts , 5€A, are mutually independent or conditionally independent given F;,,
it follows that (54a) equals (55a), and (54b) equals (55b), (56a) equals (57a) and
(56b) equals (57b) ( see, e.g., Hsieh 1989 for the proof of these and related
358 Hsieh
results). Since the crude hazards are identifiable, the net hazards can thus be
estimated through their relationships to the crude hazards under the assumption of
independent or conditional independent latent lives.
As in Section 3 we shall employ the homogeneity assumption that the @;’s in
Q. are all copies of the same @ (homogeneous population) and the "no simultane-
ous occurrences" assumption which in the present context means that with proba-
bility 1, Ts, (@;)#T5,(@;) for 5, #52, i=1,...,n, and T§(@;, )T 5(@;,), for 4114/2,
5=1,...,g. The counting process N(t) of Section 3 (see (22)) is directly extended as
follows: For the problem of competing risks accounting for right censoring, we
have
i=l
which counts the number of failures in (0,t] due to cause de A. For the problem
of competing risks accounting for right censoring and time-varying covariates, we
have
n A
N(t,8;z)= >] (T* (@;)<t,X (w;)=5,Z7* (@;)=z), t20, dEA, zeT, (S58b)
t=1
which counts the number of failures in (0,t] due to cause de A with covariates z
at the time of failure.
The F,- -predictable process for the counting process N(t,5) defined in
(58a) is
n n
which represents the number at risk at time t of failure from any cause (including
censoring), and the F,- -predictable process for N(t, 5;z) defined in (58b) is
n n
extended to correspond to the counting process (58a) and (58b) and their respec-
tive compensators (60a) and (60b) as follows:
M (t,8)=N(t,8) — ug’ aH ats), SecA, (61a)
with E [M (t,5)|F,-]=0, where F,=F! “, and
M (t, 8:2)=N (t, 8:2) — (oY 24H a(s32), SEA, zeT, (61b)
with E [M (t,8;z)|F,-]=0, where F,=F?* FZ.
It follows immediately from (61a) and (61b) that
E[N(t,8)|F,-]= fe Y*(s)dH 5(s)
and
E[N (t,8;z)|F,J= ot at :z)dH 5(s;z),
which are the corresponding extensions of (24) of Section 3.
Thus, martingale estimation formulas (28) and (29) of Section 4 for the
deterministic hazard dH(t) are extended to those for dH s(t) as follows: For each
de A , we have
dH (¢)=1(Y*(1)>0)
HE® p30, (62a)
Fer (7)
and
s@=[, [OOFANG=
Hs)={__1(Y*(s)>0) BV= stNGOs,
re) ee (62b)
where Sis i=1,...,n, is the sequence of ordered failure times irrespective of cause
and dN (t, 5) =N(t,6) — N(t,6) is the number of failures due to cause deA at
time t, which is zero except at the mark points (T (@;) =t,X(@;)=5). Similar
extensions for estimation of dH 5(t;z) and H §(t;z) yield for each (6,z)e AxT:
Mean
(er Oe e tSO, (63a)
F(h22)
and
: dN(s; .8;
Aseiarf, 10" ery MES» SO t20, (63b)
ROGGE) bige¥OG.2
where dN (t, 6;z)=N (t,8;z) — N(t”,8;z) is the number of failures due to cause
deA at time t with covariate z at the time of failure, which is zero except at the
points (T (@;)=t, X(@;)=5,Zr* (@;)=z). Note that just as in Section 4, we have
used the assumption of asymptotic negligibility in (62) and (63). The validity of
this assumption will be shown in the next paragraph.
The hypotheses and test statistics of Section 5 on the h(t)’s are to be
extended to the corresponding hypotheses and test statistics on the h(t)’s and
hs(t;z)’s. In order for the corresponding test statistics for these hypotheses to
hold, one has to show that the two martingale extensions of M(t) defined by
(33), namely,
M(t)=Vn lH s(t) —Ha(0)] (64a)
360 Hsieh
and j
covariates,respectively.
Finally, for the second type of hypothesis, we have in correspondence to the
hypothesis (46) the following two extensions:
Hypo’: hg, (t)=hso(t)=...=hax(t), SeA, (46”)
and
REFERENCES
Dekker.
Kopp, P. E. (1984): Martingales and Stochastic Integrals. Cambridge: Cam-
bridge University Press.
Leung, S. F., and W. H. Wong (1990): " Nonparametric Hazard Estimation with
Time-Varying Discrete Covariates," Journal of Econometrics, 45, 309-330
Lipster, R. S. and Shiryayev, A. N. (1978): Statistics of Random Processes II:
Applications, New York: Springer-Verlag.
Mckeague, I. W., and K. J. Utikal(1990): " Inference for a Nonlinear Counting
Process Regression Model,” Annals of Statistics 18, 1172-1187.
Nelson, W. (1969): " Hazard Plotting for Incomplete Failure Data, " J. Qual.
Technol. 1, 27-52
Petersen, T (1986): " Fitting Parametric survival Models with Time-Dependent
Covariates," Journal of Royal Statistical Society,series C, 35, 281-288
Pollard, D. (1984): Convergence of Stochastic Process. New York, NY.:
Springer-Verlag.
Rebolledo, R. (1980): " Central Limit Theorems for Local Martingales, " Z.
Wahrsch. verw.Gebiete 51, 269-286.
SSS
Victor Kipnis
University of Southern California*
Abstract
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
364 Kipnis
GV
Xe oe
that selects a p-subset, p < k, of the predictor variables and yields the
ny -vector Yy of the predicted values of Yy based on the OLS fitting
of the selected subset. To be more specific, let S, = [1,,1.,...,t,] be a
set of indices of the predictor variables to be included in the selected p-
subset. Let D, be a corresponding k xp ‘indicator’ matrix consisting of
zeros and ones such that Xy, = Xy D, contains only those pcolumns of
Xy with indices from S,. Then the procedure g consists of (i) choosing
subset S,; (ii) estimating vector @ of the regression coefficients by
B = D,B,, where B, = (Di
X, Xy D,)~' DX, Y,; (iii) predicting Yy by
Yw = Xw 6. Note that although £, is the OLS estimator of B, = Di B,
the resulting vector @ is not the OLS estimator of £. Its distribution,
Prediction in Exploratory Analysis 365
A 1 A A
same data have been used for both construction and evaluation. This
is a familiar fact that could be easily demonstrated for a very sim-
ple procedure consisting of OLS-fitting of an a priort specified subset
Xy p= Xv D,, with fixed D,, and Xw = Xy, Le., when a new response
vector Yy is predicted for the same set Xy of observations on the ex-
planatory variables. For future reference we will denote this procedure
by gp. We have
Yw = 9p(V) = P,Y, (5)
where P, = Xvy,(X,,Xvp)'
X/, is the projection matrix onto the fixed
linear space spanned by the column-vectors of the matrix Xy,. In this
case, Xy EB = P,Y,, and it follows from (3)-(4) that
o? 1 o! oO
E[AL(g,;V)| = rae: Sal nt CRED 9) Cr (6)
and
o? 1 edi a
MSEP(g,) me Pre ay ni (I, =) (7)
*best’ one, with implied false inference about thereby obtained predic-
tors. To be able to get more adequate estimators one has to study the
distribution of Yy under that very exploratory procedure which has
yielded this predictor.
As exact distributions are extremely difficult to study analyti-
cally, even for relatively simple selection procedures, prediction as-
sessment requires data sets independent from the construction data
V. To expedite this process, the independent sets could be replaced
by pseudosamples which in some sense are close to the original set.
There are different methods for construction of pseudosamples. One
leads to different forms of splitting the data or cross—validation (e.g.,
Stone, 1974; Geisser, 1975). Cross—validation methods could be very
helpful in realizing the behaviour of MSEP for some selection proce-
dures as was demonstrated by Hjorth (1982), and Picard and Cook
(1984) among others. Complications arise, though, for more compli-
cated procedures, which examine different classes of models and sets of
predictor variables that have not been planned in advance, but evolve
in the process of selection itself. Proper cross—validation in this case
requires excluding more and more observations for assessment of the
ensuing results at each new iteration, which could become impractical.
Another method of generating pseudosamples, which does not
have these limitations, is bootstrap. Recently bootstrap methods have
been successfully used for solving various statistical problems includ-
ing estimating the predictive ability for an a priori specified regression
model (e.g., Bunke and Droge, 1984). In the discussion on applying
bootstrap to estimating MSEP under subset selection (e.g., Miller,
1984), it is usually pointed out that this approach seems to suffer
from the complication that the selected set of predictor variables will
vary for different pseudosamples and could be quite different from
that fitted for the given construction sample. The key to our present
approach is just a small step from the usual bootstrap regression anal-
ysis but an important one nevertheless. We bring in the procedural
approach and suggest that assessment of the efficiency of the predictor
should rest on the bootstrap assessment of the exploratory procedure
by which this predictor has been chosen, rather than the evaluation
of any particular subset of variables. It is interesting to note that this
approach has been implicitly applied in Efron (1983, 1986) and Gong
(1986), where it was found better than cross-validation, and in Freed-
man et al. (1987), where it was declared a failure. We will briefly
comment on this failure in Section 3 (see Remark 1).
3. Bootstrap Estimators.
The idea is to analyze performance characteristics of the proce-
dure g on the data generated by a known random mechanism. The
main requirement is that this mechanism, or as we will call it, pseu-
domodel, should simulate the unknown regression model (1). In other
words, it should generate pseudosamples that are ‘close’ to the ob-
served one with regard to their statistical structure. There are several
ways to construct such a pseudomodel. One possible approach is to
368 Kipnis
use = 2
Y=Y°+é= X68 +Z (8)
where B and the distribution F of é are estimated from the data V.
In a parametric bootstrap, when the form of the distribution F of e€ is
assumed known, F is obtained by estimating the unknown parameters
of F. In a nonparametric bootstrap, € is usually a random sample
(perhaps, times a weighted factor) from the empirical distribution F =
F,,(ri,.--;T) of the residuals r = Y, — Xy 8. When the true model (1)
is unknown, choosing B seems to be one of the most sensitive elements
of constructing a pseudomodel. If procedure g is rather complicated,
e.g. involves subset selection, misspecification of the mean component
Y° in (8) could lead to quite a different performance on pseudodata
as compared to the real data.
In subset selection, it is usually assumed that X in (1) includes all
relevant variables plus, perhaps, some extraneous variables. In this sit-
uation it seems reasonable to use the OLS estimate @ = (Xi), X,) 1 X,Y
based on the ‘full’ set of explanatory variables. By using the best un-
biased estimator of G, one hopes to get a pseudomodel that is as close
to the real one as possible (see also Remark 3 below). A different
approach — resampling rows from the matrix V — does not seem to
be appropriate here. First, it would lead to pseudodata with a matrix
X different from its counterpart X, which is assumed fixed. Second,
if k is close to n, there is a high probability of getting fewer than k
distinct rows and, thus, not a full rank matrix X.
Consider now two pseudosamples, a pseudoconstruction set V =
(Xv,¥,) and a pseudotarget set W = (Xy,Yw), where Y, and Yy
are n X 1 and ny X 1 random vectors, respectively, independently
generated from model (8). Applying the same selection procedure
g that is used for the original construction set V to the pseudoset
V~, we get a pseudopredictor Yw = g(V;Xw) and the pseudolosses
Ly (93V) = a (Yw —Yw)! (Yw —~Yy). As the pseudomodel (8) is com-
pletely known, Monte Carlo replications could be used to analyze the
distribution of Ly (9;V). Characteristics of this distribution serve as
estimates of their counterparts for the distribution of the real losses
Lw (g;V). This approach leads to the so called direct bootstrap esti-
mator
Ry (9) = E.[MSEP, (9;V)}, (9)
where E. denotes expectation with respect to the random mechanism
(8).
_ _ The idea of direct bootstrap estimation has recently been applied
in Freedman et al. (1987). Disappointingly, the bootstrap did not per-
form very well. That R2 is generally biased may be easily illustrated
with the procedure g,.
Prediction in Exploratory Analysis
SSS
369
SS
m 6? nena ~
lee) Gy intp) + oY (ln — BY
so that
REMARK 1: The above result explains why the direct bootstrap estimator
performs well in the traditional regression framework, when the procedure
g, is applied to the ‘true’ model (1). Then p = k, and R2, is unbiased as has
been demonstrated in the literature. On the other hand, the fact that the
direct bootstrap estimator becomes biased even for such a simple procedure
as g,, may explain its failure in subset selection as was observed in Freedman
et al. (1987).
Another possible approach to deriving MSEP estimators is based on
using the pseudomodel (8) for evaluating ‘overoptimism’ of the autolosses
AL(g,V) in order to make an appropriate adjustment. At least two choices
present themselves for representing average overoptimism: the difference
A 1 i
R +e 7 (RSS> : 2p6"), (13)
Jp x meee
n
(ATE),
n-—p
(14)
suggested by Rothman (1968) and Amemiya (1980).
4. Experimental Comparison of the Conventional and the Boot-
strap Estimators.
To illustrate the effect of subset selection on conventional MS EP esti-
mators and to compare these estimators with the bootstrap estimators the
following simulation study was conducted. In all the experiments the simu-
lated data satisfied model (1), with « ~ N(0,07J,,) where matrices Xy and
Xw were orthonormal with the same number n = ny of observations. As
was pointed out in Miller (1984), the orthogonal case is far from being the
simplest one with respect to MSEP estimation and actually gives an example
of intermediate deterioration of the conventional estimators under subset se-
lection. An advantage of considering the orthogonal case (besides from obvi-
ous calculation simplifications) lies in the fact that.all major subset selection
procedures, such as the best subset regression, forward selection, backward
elimination, and stepwise regression lead here to the same ‘best’ p-subset, for
each p = 0,1,...,k, a property that does not hold for nonorthogonal predic-
tor variables. Consider the best subset regression procedure g, which consists
of screening all 2* subsets and selecting the best one with regard to some
criterion (e.g., Hocking, 1976). Usually such a criterion is based on one of
the conventional adjusted estimators, mentioned above. Then for any fixed
p, the ‘best’ p-subset is the one with minimum RSS,, and g could be con-
ceived of as a two-step procedure. At the first step, for each p,p = 0,1...,k,
a p-subset corresponding to the minimum RSS, is found. The second step
372 Kipnis
consists in comparing these subsets and choosing the overall best according
to the adopted criterion. In the experimental study the three bootstrap es-
timators (9)-(11) were compared with the two conventional estimators (13)
and (14) with regard to the evaluation of the MSEP for each of the best
p-subset at the first step of the procedure. Then each estimator served as
the criterion for the overall choice at the second step. It made it possible
to compare these statistics not only as MSEP estimators under the search
process, but also as stopping rules in subset selection.
Note, that S in (15) is a random set of indices. If the true model (1) were
known, the best predictor with minimum MSEP should be based on the
subset S* = {1,,...,%,} such that 8? /o? > 1. This fact easily follows from
(3) since for any two fired subsets S,,, and S,,,p, < po,
MSEP(9,,)
-MSEP(,,) =—[0?(2-r.)- D> #i].
TES p,\Sp,
where the i-th component of vector e; equals one and all other compo-
nents equal zero. Three values of the true vector @ were considered: (i)
Br = (0,0,...,0)’, which represents the model with no relevant variables;
(ii) Brr = (B1,62,...,8,,0,0,...,0)’, where 8; = E[Z(i;q)], Z(i;q) is the
1-th order statistic from g N(1,.25) random variables, q is the closest to k/3
integer. Here the first q values of the elements of 6,; are near the resolv-
ing power of the system with the ‘signal-to-noise’ ratio ((;)?/o? ~ 1; (iii)
Br11 = (7.0,5.0,0,0,...,0)’, which represents the case with two very sig-
nificant predictor variables with the signal-to-noise ratio being 49 and 25,
respectively.
For each model specification, as defined by o? = 1, n = 50, k =
15, 25,35 and B = 6,,Br1,Br11; 1000 basic data sets Viz = (Xy 4 (m))
and W,, = (Xw,Yw(m)), m = 1,2,..., 1000, were generated following
(1). To each data set V,, the best subset procedure g was applied, and for
each p = 0,1,...,k, predictor Yy (m,p) based on the p-subset Xy,(m) =
Xy (m)D,(m) with the minimum RSS, was found. The ‘true’ conditional
MSEP was calculated by (2). The two conventional estimators were calcu-
lated by formulas (13) and (14) as based on subset X,,(m). The adopted
nonparametric pseudomodel was based on (8) with 8 = (X/, X,)-' X,Y
and €; ~ F= Poi lie 5s olept = Vy Xe For each simulated set m,
the direct, the additive, and the multiplicative estimators were calculated
by generating 200 pseudosamples V;(m) = (Xy,Y¥,(m,i)) and W,(m) =
(Xw ,Yw (m,i)) from the chosen pseudomodel.
REMARK 6: The decision to generate €; from the empirical distribution of
the last n — k residuals, as opposed to the traditionally used empirical distri-
bution of all the residuals, reflects a very important feature of an orthogonal
regression. Due to the choice of matrix X,, r,; =... = r, = 0, so that
when k is comparable with n, there is a high probability of getting too many
zero pseudo-disturbances €;. This would make pseudodata quite different
from the real observations. Another advantage of the chosen F lies in the
fact, that it is based on independent residuals as opposed to the full set of
components of r. Besides, 6? becomes an unbiased estimator for 0”.
To judge the performance of each estimator R in the simulations, two
criteria were used: the mean squared error
(A‘3)dasw Re ou od ms wd
: a as q as q as q as q as q as
+0 - 007 = Or £66: o. 261 «666 = Z6l OFT COE LOL’—s«CS66°
= 566° r6l
ar = Ole “EPO 9%6, a RI eek
"626 CSS 8r'T LG. B00= 12.S KOLA© G00.
a, Vee ta, “con... Musee BSRitS68 nau CS Sh lee POLL FSR OU= Sz
St pe “eahast
§ "SLO" = 198 pi ESET" =
PGeese ORL) BhT Ere Seo ee aS OD ae
za 1908 eee. GLI SLE= 261% SFT asc OT = 00c& @ all (OC
ete «ae S660, Pet... os EST S199a;& 007 —8rT O9€ Net oS Oe= wile GAs
068"—si“(‘éiST=C(‘é‘«iB’C“‘éOT:;
0 gO? SFT soeCO#~*C#ET.:CO@d
EET > Oe= 1a <0Z== 962
Mee “RO: 68 SO6l'S < £06| 117s SPT PLE Get Go& weal OG
Hales EI. SACS SOG ee:2 92 BHT 6LE ger ceS SOULEte 7
“thie SLI 0M, Wes 2 RI OTR 6H re OPT 6 SOE GCL CS
OF pee Se LIZV TEN.)
© S607 = S06= OFS 6H Sse SIn] e OLE 714 HE
Vea. SCC 196k Ze LaRPST cI OOF «OFT ZOb LT «OOF eh ise
ac= Soya Ree = Hedy. BSSIC OCT2 PSE 6H LO 6r'l LO’ Srl 90¢
xz BET,© <8 S6FIy = S0r a“ 6P'l= gS «6H 80b 6FT 80h 6rT Olp
Or 6r' 1 807" 6r'T 807° 6r' 1 807" 6r' 1 807 6r' 1 Sol OST Sz
LOY Srl LOV 6r'T LOV 6r'T psc 6c T SI¢ JES Sel 6rT 02
L8¢° ert 96¢° Lvl COV 6r'l 660 7G CSc vL6 vol Spt SI
6Prco vol CLE crt 06¢° IST LVC 886° EEGs 983° Scr prt Ol
6c¢o COT g9c crt 98C° ES Te 87 696 907 188° vam Crt 6
6c OE T 9SC Ivl C8C° CST 6CC SS6° 10d 088" CCl Ipt 8
LI¢ Oy Lee 6c T LL esl CG bro 86." p83" Cals Ort L
Soc STT Leo” L¢l CLE PST CLG Or6 Cé6l £63" Sit LET 9
C67 CTT Sco Sel 99¢° CST LOC Cr6" vol 606° cIl cot 6S
O8C ITT EES. col 09¢° 9ST COC CS6 rol 1c6 CLL ccl YP
6l I 967 OCT €se EAL 107 CLO 961° C96" cor’ OCT ¢€
375
LIT
€S7 SIT LEG LTT bre 09'T COC IOT 107 TOT £60 EG Lae ee
OFC yOeNE 9ST ETT gce” COT OI’ LOT OI’ LOT TLO CEL
SIT 6C¢o 89'T OCT SIT OL SIT 0 6rT O
SD
LEC SIT CVC
ee SS Sa SRS a ee Tee RT tS a
d ds a ds ist gs a ds ad ‘
ds
Stew a a
da ds
ea PL Sie ee 2 eR OG is es ete Se eS
te va ad ou
Vv
“f daSW
(A‘3)
Vv Vv Vv
SIOJEUUT}S9
dASW ywor1syIp jo (sywourtsedxa C00] UO peseq se) sUONeIAeP piepuL}s pue SUBOUT OWL Mgag ZT WIAVL
HIadVL
el “g=¢9 uL sueourpue prepuris suonelAsp
se) poseq
uO (00, (sjueuodxo
jo WWoerszfIpdHSW
eWIT}$9
S10}
Se es ee eo en i
(A‘3)dasw “f ou ad va ol
oe a ae ee a a ee
e a as | as a as q as a as a as
0 grz 0 BPrZ Le 8h~ ple 86°7 lpr Cn
BH 8 8ht 6LE
e OTE ep 2 a OS SOc —00'Z g9¢ cGy= Coen Sette| Dear
tO SOO SiS. Oe Sic; Ze Pee )~=—SdTHZ’s«éQO'
POT
=T €?~
(NO € HET THN" 96" v6 L6" Sol’ OST 9¢€ TT SS Ol Me
IZ Wie ever? Oo 6 ~—- 00% OFT ree SA TO OO See
83=6sdTE6——(s«
0c HT”—(s«S/B™"Ci
OFT SQO-~C—s«C«*T'
ese T.C‘*<SS
Say [66 40 ior
9 06h ree GS i 626 Vig. OFT I9¢ 6 aia 2 CS ea
)
-L “Pen TOL” O48 ci £6 617 GFT g9€ eo 0s: O04: 6G
8 ee 08? «GGT. SO! 9oC. OT rle 0eT ere. Soll Ol
6 ae I SOB SOS 196; LEC... 6P'T O8€ OT PSE«= OTT Vie
ie ar SS. IS O86 ye. OPT Sse. OWT ~— =€9E 6ZT 9f€
at EeY* Zel 696° se CUT OOS 6H= 00h COT OE Uhl ese
Oe RT CE (ICL6G6 prse 6YT Lop rT 90h 8rI 9r
cz “LOST Reh” 4 GWT SOR GET.5 SOPs. 6h 80r 6b 80h 6hI Olb
Prediction in Exploratory Analysis Se
unconditional MSEP (e.g., Hocking, 1976; Seber, 1977), but later most of
them were looked upon as estimators of the conditional MSEP (e.g., Hjorth,
1982; Efron, 1983, 1986; Picard and Cook, 1984). Recently, though, the
fashion seems to have been changing again. As is mentioned in Gong (1986)
and Boudreau (1988), it is, perhaps, unfair to consider MSEP estimates as
evaluating the conditional risk, so that they must be viewed as estimators of
the unconditional MSEP. Without trying to resolve this dispute, it is worth
mentioning that from the procedural point of view it is very important to
estimate the unconditional risk. As was mentioned in Section 2, the uncon-
ditional MSEP is the only measure that evaluates the exploratory procedure,
as opposed to any particular chosen model.
Since the results for the three considered k values were not qualitatively
different, in Tables 1 - 4 we report three model specifications with B =
Br,Brr,and B;;, for k = 25.
(i) Bias.From Tables 1.1 - 1.3 one can see that both conventional es-
timators J, and R> are considerably biased downward when the number of
regressors in a subset exceeds the number of non-zero coefficients of 8. For
p = 9 the bias of J, is about 40% and the bias of R¢ is more than 30% of the
actual MSEP values in all three considered examples. When p approaches
the full model size, the bias gets smaller and becomes negligible when p = 25.
As was expected the direct bootstrap estimator is biased upward. When
p is small this bias almost follows the same pattern as for the procedure g,,
i.e. (kK — p)o?/n. When p exceeds 3 the bias gets smaller as compared to
Jp; and it becomes almost negligible when p > 10. The ‘indirect’ bootstrap
estimators have considerably smaller bias than the conventional estimators.
The additive estimator R4 is a clear winner here, having the smallest bias
throughout. For all three considered cases, its bias is less than 2% of the
actual MSEP for p between 0 and 9, then it increases slightly and reaches
about 3% of the MSEP when p = 25. The bias of the multiplicative estima-
tor, RéM , is higher, approaching 10% of the MSEP value when p = 9, but
it still remains relatively small as compared to the bias of the conventional
estimators.
(ii) MSE. As follows from Tables 1.1 - 1.3, both conventional esti-
mators have smaller variances than the bootstrap estimators since they do
not account for the prediction error due to selection. Still in terms of the
MSE, and/or MSEg (Tables 2.1 -2.3), the indirect bootstrap estimators
considerably outperform J, and Rs: their larger variances are more than
compensated for by substantially lower biases. As compared between them-
selves, although RA is less biased than R™ , its lower bias trades off against
its higher variability, so that the MSE values for both estimators are prac-
tically the same in the three considered examples. The direct bootstrap
estimator occupies some middle ground except for small values of p(p < 5),
where R”? has the highest MSE due to a considerable bias.
(iii) MSEP as a function of p. What is even worse than biasedness,
both J, and Re do not follow the actual MSEP and lead to wrong conclusions
as to which subset is the overall best for prediction. From Tables 1.1 - 1.3
one can see that on average the best predictor (with regard to the actual
minimum MSEP) includes no variables if 6 = 6; and B = 6, and has 2
LOT 88I° Sc
Lov 88T LOT; 877 (02
681 Lov 88 Sol 807
gor 981° 991 981° 8c" pre ST
Sol 981 Sol 6LT 0¢7
SLT T9T Ges: csc OL
cSt OLY LST 991° CLE
Ost vst y9C ae =6
Tet vel Et vol LT
3 est 8rC gsc 8
9c Sel Ter OL IL@
Sel est 9¢C Tce. &
[4 uae Sel 191 COT
9cT est 96C ooc 9
Stl eel LAT, 191 Lvc
601 )8 9st 097 CIC =S
80l Stl OS) vol VT QI? Vv
Sot oot 901° col 917
660° +60" gol OLY Lov gol ¢
€60° 060° vst
378
680° 180° (6:3 9 (6:3 vil Cite G
8L0° 6L0" 6L0" v0T 601°
990° 907 790° c90° iL
790° £90" 990° Cv 790°
0S0" 97 Leo" LD ©
60 Lv0’ 7S0° Ie Le0
sroyeunse TASW IWeIesFIP
IpUo. 24. Ig=J Te dladvL
peseq se) SIO1I2 po yenbs
uvoul jeuonIpuooun pue [e UON
jo (syuouttodxe Qoo'T uo
891° 681° LOT 88I° LOT 881° LOT 881 LOT. 881° Sc
Sol 98T° sor 981° 991° 981° Sol" S81 LOZ L@ 072
CST OLT LST SLT (6% O8T LI? Se? 81 pee ST
Tel" Srl Sel (65% LST OLT S97 LLC USS 3 19¢ OL
LON: 6€T 5 a Tvl LST 691 $9C LL@ Ove” OSe 6
CCL 3 3 ia Lev" 6cT LSY 891° CIT €LZ vec cco
) EG: 1); cel 6ST OLT CSC C97 10¢° 60¢ L
801° sil cL. €cl v9T pL vec eve OLC LLE 9
Tol" OIT Sol SIT LL 6L1 4 a 07 9€7 (4 (Co
£60 0) L60° SOT” I8l° 681° Sst 16r" 861° C0?
€80° 880° L80° €60° 861° €0C OST VST SST 8ST ¢€ On
€L0 SLO LLO 080° €CC SC? Til’ CLT tLE Lk 7 tm
190° 790° 990° 990° 997 L9T TLO° TL0° 0LO TLO'- E
950° 9S0° 80° 850° Lye Lve €S0° €S0° €S0° €s0° 0
“ASW -ASW "ASW -ASW “ASW -ASW “ASW -ASIN “ASW = °ASW é
ne uy ou ou
SIOVEWITISS CGHSW JWelofip
Jo (syuouttiedxe QQ‘ WO peseq se) sIo1I9 porenbs ues! [euOTIpuOoUN pue [euOIIpuods ey], “"G=g TC AIAVL
HIdVL
ec ot], [BUONIpu
uep os
Mg [euontpu
uveooun pesenbssi04I0
se) paseq
Uo (QQ0‘T (sjuoumti
jo sdxe
WoloyjIpFASW S1OJCUINsS®
S
Lvl Lvl
vt
f=)
vrl
\e-al
9L0° vol
¥80° SLO’ Ore 3 960° €60° c60° 880°
(4.05 S¢0° (440) Lye Sve 190° 090° CSO" 0S0°
690° L90° 890° Ore Ive 790° $90" 80° 8S0°
6IT” OI
ERNIE
OI 661° 661° 80° LLO’ vL0’ €L0°
691° cSt OST BLT LLL: c60° 680° 680° 980°
OEE L81° vst 891° Sol vor’ 660° COT’ L60°
NGA
9ST SIC O17 €or 8ST OTT 60T° cite 90T°
887° SEC 8C7" COL" SST SCI:
N
sil
on
(cae
ra)
vIT
OG
cle: 8bC Ore oo) vst" cel’ Scr OeT Ter
Occ SSC Svc v9’ VST col cel" SET” OCT
Oe Tec SI? 8LT 191° CLI SST LOT OST’
LOC
eae
S8T y9T" 981° OO}: 981° Sol S8T
LOT" S9T
881° LOL 881° LOT
N
881° LOT
co
oO
va)
681° 89°
Prediction in Exploratory Analysis 381
NN
variables for G = 6,,,;. At the same time Re — the least corrupted of the
two conventional estimators — has its minimum when p = 4 for 8; and p = 6
for B;; and B;7;. ‘
REMARK 8: The results for @ = 6,, (Table 1.2) are especially interesting,
because they illustrate the fact that subset selection is not the best way of
deciding which variables should be included in the predictor equation, at least
when some of the components of @ are near the level of predictor significance.
Due to selection, even when p equals the number of relevant regressors, the
‘best’ p-subset does not often include all the significant variables but contains
some ‘noisy’ regressors with small or zero coefficients. As a result, on average
we are still better off with the ‘naive’ prediction Yy = 0. Copas (1983)
and Miller (1984), among others, make similar conclusions. Unfortunately,
both conventional estimators, J, and Re do not reflect this very important
feature of subset selection and mislead a researcher with an overoptimistic
assessment.
The same remains true with regard to the direct bootstrap estimator.
On the contrary, the indirect bootstrap estimators behave similarly to the
true MSEP. R4 has the smallest average values for p = 0 when £ = £, and
B = Br, and p=2 for B = B,,,;. R™ differs only in the most difficult case of
B = B11, where it has the smallest average value when p = 1. The fact that
the indirect bootstrap estimators closely follow the actual MSEP indicates
that they could be used as criteria for choosing the overall best predictor.
Tables 3.1 - 3.3 and 4 display characteristics of the final predictor selected
from among (k + 1) best p-subsets, p = 0,1,...,k, according to a criterion
based on each of the six statistics: actual MSEP(g,V), J, Re R° RA, and
R™. Tables 3.1 - 3.3 contain the empirical distributions of the number
of variables in these final predictors. It follows that the direct bootstrap
estimator provides the worst criterion with respect to the distribution of the
optimal p. The conventional estimators obviously lead to overfitting. The
indirect bootstrap estimators provide distributions very similar to one based
on using the actual MSEP, except for somewhat thicker right-hand tails.
These long tails could be explained by the fact that all k components of 8
in the pseudomodel (8) are always different from zero, as opposed to the
three considered true vectors B in model (1). Table 4 contains the average
MSEP values for the final predictors. It follows that although R?” still leads
to some poor results, the two indirect bootstrap estimators, used as selection
criteria, provide substantially better final predictors than those based on the
conventional criteria.
5. Conclusion.
The theory behind conventional estimators for predictive efficiency is
not valid when predictor selection and estimation are from the same data.
The very selection process affects the distribution of those estimators and,
in particular, leads to their substantial bias when the selection effect is not
allowed for. It is suggested that each estimator should be developed for
the selection procedure it is used with. As exact distributional results are
extremely difficult to study analytically, even for relatively simple subset se-
lection procedures, the indirect bootstrap assessment described above seems
to be helpful in solving the problem. The bootstrap method appears general
382 Kipnis
MSEP(g,V) J, RS R? ne R™
1000 6 25 372 868 710
0 14 94 62 49 99
0 38 119 50 18 55
0 54 179 27 11 36
0 115 161 26 5 19
0 127 164 25 13 21
0 156 86 16 4 15
0 153 74 17 2 7
0 101 47 12 3 4
SC
NDMPWNeE
COON
|UD 0 87 27 8 1 7
0 56 12 10 2 2
0 43 5 13 1 4
0 25 5 8 1 3
0 13 1 4 0 4
0 9 0 12 2 2
0 2 1 15 0 1
0 0 0 6 1 1
0 0 0 10 0 1
0 1 0 14 2 0
0 0 0 17 0 2
0 0 0 17 2 0
0 0 0 24 2 1
0 0 0 29 3 43
0 0 0 34 1 0
0 0 0 42 7 2
0 0 0 130 2 1
Prediction in Exploratory Analysis 383
MSEP(g,V) J, R R° R* R"
0 1 119 624 437
2 20 39 116 132
8 51 25 57 100
20 100 36 23 71
47 139 27 30 54
73 160 22 13 39
108 167 20 9 32
149 124 19 4 20
138 92 18 6 14
SC
!'D 132
WOOINIAMPWNPF 61 16 9 12
109 44 15 6 19
81 23 15 9 11
48 11 16 9 8
52 3 14 a 5
21 1 14 7 7
7 2 11 3 6
3 0 21 ‘] 7
0 1 OF 6 2
1 0 21 5 4
0 0 26 3 2
1 0 29 3 3
0 0 38 8 2
0 0 5) 5 4
0 0 52 7 Z
0 0 98 10 3
—~i—!
Oo
oN
eo
coco
coo 0
oSCcoqoce
So 0 216 17 4
384 Kipnis
p MSEP(g,V) J, R°? R* R™
0 0 0 0 0 0 0
1 0 0 0 3 26 13
) 993 8 31 252 786 640
3 5 16 115 104 78 143
4 0 45 136 58 33 64
5 2 78 177 $1 14 45
6 0 130 185 27 Vl 21
7 0 148 148 ie) } 21
8 0 151 70 25 3 14
9 0 148 69 15 5 7
10 0 99 BH 13 2 4
11 0 70 17 17 3 5
12 0 D2 8 14 0 4
13 0 30 4 13 2 3
14 0 13 1 10 1 3
15 0 8 1 9 1 0
16 0 3 0 12 0 2
17 0 0 1 11 0 2
18 0 0 0 14 3 0
19 0 0 0 19 2 1
20 0 1 0 20 2 |
21 0 0 0 23 4 o)
22 0 0 0 32 2 1
23 0 0 0 44 1 2
24 0 0 0 48 8 1
25 0 0 0 141 6 1
Prediction in Exploratory Analysis 385
MSEP
Criterion Se
and flexible enough, and in principle could be used for any exploratory pro-
cedure. Indeed, after generating a necessary number of pseudosamples the
same exploratory process that was used for the original data is applied to
each one of them, and the corresponding empirical distribution of predictor
errors provides all characteristics of interest.
One of the major problems in applying bootstrap to model building pro-
cedures consists in choosing an appropriate pseudomodel. Since the ‘true’
model is unknown, the usual bootstrap regression idea of using estimated pa-
rameters of the true model as the pseudoparameters in the pseudomodel does
not work. This difficulty is aggravated by the fact that rather complicated
exploratory procedures prove to be quite sensitive to the choice of the pseu-
domodel. In the framework of subset selection this problem seems to have
been resolved by using the ‘full’ model estimates to construct a pseudomodel.
But this approach is by no means mandatory, and other pseudomodels could
and, perhaps, should be used in different situations.
It is also important to note that the direct bootstrap method does not
work well enough in exploratory analysis, as was demonstrated by the per-
formance of R?. The indirect approach, which tries to improve, or adjust,
the existing estimators seems to be crucial in this case. As a result, the
suggested method can be used to assess the efficiency of different regression
procedures, to compare those procedures with each other, and to choose the
most efficient one. It also may be helpful in correcting the existing proce-
dures by providing, for example, a criterion for a stopping rule in multistep
model building.
References
-ae
PR Oa tt iIg .y
A AVA “y fe» ip uals
Po al -' f AF
, Bsr
at auineee FY
ee tihdixBA deemed
(
ont PPRRE.
“Sortade eaten
dS 2
Te Ai OVE
x eae ye
at $1888 io “ts Ga
tte) 3 sake y!
Rite Heahh
ey apg Pere ¥4
a ey y ae ae
Pao, ‘ We ped aS
awe a fs Baek
O08. 0 aes
; Nae Me ry i. prem
rygine wl th ' yoy erty“a te Ths
II
ee
Nicholas Schorkt
University of Michigan
I. Introduction
3
9 —\
F (tla, m1, #2) 3,07, 03,08) = ) Pi - P(t \y;, 03). (1)
mall Q
where ¢ is the normal density function and i indexes the genotype; i.e., AA
(i=1), Aa or aA (i=2), and aa (i=3). Modeling the effects of dominance
and recessivity at the assumed locus is straightforward; one simply assumes
either uw4g = Haa OT LAA = Ag: Often the assumption of equal variances
can be made (i.e., oe = tay = one = oan): A graphical representation of the
modeling of quantitative traits with normal mixture distributions is given in
Figure 1 of the appendix, where dominance (of the a allele over the A allele)
is assumed.
An oft-used statistical practice in quantitative genetics is the fitting of
mixture distribution models (e.g., based on equation 1) to data with unknown
ee se ——EEo—E>———EE=E=
ee
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
390 Schork
———— ___.|
lihood ratio statistic under g(z;6), where 6 and 6 are consistent estimators of 6
and 6, respectively. Draw data sets X},...,X7 from g(z*; 6) and compute T,*
(2 =1,...,r). From the T*’s estimate a relevant critical quantile C. Compare
C to T to make inferences about Hg. This test construction has been referred
to as the “parametric bootstrap” test in the literature, and several authors
have investigated its usefulness in nested hypotheses situations (see Hall and
Titterington [1989]; Jockel [1986]; and Hope [1968}).
We examined the usefulness of the parametric bootstrap test for settings
involving mixed normal and lognormal distributional hypothesis. To investi-
gate the level of the test assuming the lognormal model is the correct model, we
generated 100 data sets following lognormal distributions with mean 0 and dif-
fering variances for various sample sizes and computed a parametric bootstrap
test with r=19 and a nominal level of 0.05 for each of the 100 replications. The
number of rejections resulting from these simulations were tallied and taken
as an estimate of the level of the test. These results are outlined in Table 1
of the appendix. This table suggests the parametric bootstrap test is at or
near the nominal level of the test. The small number of replications (i.e., 100)
used to estimate the significance levels was used to keep the computational
burden assumed in the estimation of the mixture distribution parameters for
each replication to a minimum.
To investigate the power of the test in this situation, we again kept the
lognormal distribution as the null model, but generated data following a 2
component normal mixture with mixing weights of 0.75 and 0.25, standard
deviation 1, and means separated at a distance varying from 1 to 4 “within”
component standard deviation units. The rationale for this was that as the
separation between the component means gets larger, “bimodality” (or the
conspicuous presence of a normal mixture distribution) would become more
pronounced (this is not the case in Figure 1) and therefore the lognormal
hypothesis should be more Power estimates based on this
easily rejected.
strategy were computed from 100 replications assuming four different sample
sizes. A plot of the results is given in Figure 2 of the appendix. It greatly
suggests that, as expected, the power to reject lognormality increases as the
separation between the means of the normal mixture increases. It should be
emphasized that it has been our experience that a homoscedastic 2 component
normal mixture with mixing weights 0.75 and 0.25 setting provides one of the
392 Schork
a _.| |e
poorest environments for rejecting the lognormal hypothesis at low mean com-
ponent separations. That is, other settings (e.g., mixing weights of 0.5 and
0.5) produce higher power to reject the lognormal hypothesis at small mean
component separations (which is intuitive, given that certain normal mixture
distribution settings don’t look or behave like lognormal distributions). Our
investigations of the level and power properties of the test and test setting as
described are by no means exhaustive; further studies involving other lognor-
mal and mixture parameter settings are called for, as are studies investigating
the level and power properties of the test when the roles of the hypotheses
are reversed (i.e., the normal mixture is the null and the lognormal is the
alternative hypothesis).
IV. Conclusion
References
y : ss ” € ré T
aanqxywW Teur0U :Tapow ‘ITY ( ee
[ewiousoy :[apow T{[NN Jee
; : ; eg ; Sz°0
IMO”
0Ss*0
sz=u
SL4°0
OGeu-”
OSc=uU
ootsu vort
Bootstrapping Likelihood Ratios
4 ae 90 €0 00’
00 =—-00 or €0
002 uoyssaadxa
zi oz =O zz 8690
90° #410
00'4 adfjoues ee pue ey
12 660" 40” =—20" gt = =6S0
to )=—s«00”
«650° 80° $0 0s
vo , 20 40 = 20 90°
(0)Sn ze Ge
OV 20 60° 390
uoyssaidxa adAjouas yy
0Sz 001 0s Sz ozs odweS
GouOQraee
*so0UrTIeA JUaIATJTP pue OQ UvaW YIFM BIep
‘suoz3ngszAzIsTp (ayeza ‘+2°7) adkaousyg
[Tew1ou-SoT Suysn sazeottdez GOT vo paseq
oT
aanaxyu Tew10u Juauoduos-7 DPISepeosowoy °SA Tewio0u-Z
$3892 OTIED VIUOP 1OJ STAA2T BoUeSFITUSTS poiemypasy
I 9TqeL
Xf pueddy
Schork
8
(y0°yT) 2FISFIeIS
peaiesqg
OOT xX A3}suaq
x AqtTsuag
:°I1V
OT
sauaskTog
snoo7 Jofeq :TIMN
ST
OT
snooy iofey :°ITVv
OOT
02
souaskTOd :TTON
(70*T) 2F3s}IeIS pesresqg
ie
SG
02
oc
se
Se
396
A Nonparametric Density Estimation Based Resampling
Algorithm. !
e277, (2)
ee eee ———EEEESESESESESESSSS>Eeee—e——e——E=E=
renner
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
398 Taylor and Thompson
to be terrible. It says that all that can happen in any other experiment is a
repeat of the data points already observed, each occuring with probability 1/n.
Yet, for many purposes f(a) isi quite satisfactory. For example, the mean
of Rigsis simply
Similarly,
fore) i ibe Fi
= / (c — z)’fs(x)dz = — (2 — 2)? = 8”. (4)
—oo n way
_ Cov(z,y)
la Peer (7)
Suppose that we have a sample of size n: {z;, y;}%_,. Then we construct
s(2,y) = ~3 6((
2, Y) = (2534). (8)
” 721
tempted to try a scheme which goes directly from the actual sample to the
pseudo-sample. Of course, this is precisely what the bootstrap estimator does,
with the very bad properties associated with a Dirac-comb. It is possible,
however, to go from the sample directly to the pseudo-sample in such a way
that the resulting estimator behaves very much like that of the normal kernel
approach above. This is the SIMDAT algorithm of Taylor and Thompson.
We assume that we have a data set of size n from a p-dimensional variable
X, {X;}@,. First of all, we shall assume that we have already rescaled our
data set so that the marginal sample variances in each vector component are
the same. For a given integer m, we find, for each of the n data points, the
m — 1 nearest neighbors. These will be stored in an array of size n x (m —1).
Let us suppose we wish to generate a pseudo-sample of size N. Of course,
there is no reason to suppose that n and N will, in general, be the same (as
is the case generally with the bootstrap). To start the algorithm, we sample
one of the n data points with probability 1/n (just as with the bootstrap).
Starting with this point, we recall it and its m — 1 nearest neighbors from
memory, and compute the mean of the resulting set of points:
y(t Seid).
— (14)
We now generate our centered pseudo-data dineX, via
Ki Sap: (15)
l=1
X= X'4+X. (16)
The procedure above, as m and n get large, becomes very like that of the
normal kernel approach mentioned earlier. To see why this is so, we consider
the sampled vector X; and its m — 1 nearest neighbors:
T |
a 21
{Xi}, 3 : o (17)
Ll l=. ..,7
1 m—1 eee
E(u) = = Var(u) = Ta Cov(u;,u;) = 0,. for, ¢.4.3. (18)
Next, we form the linear combination:
l=1 :
For the r’th component of the vector Z, z, = ui2p1 + U2Er2 +... + UmTrm, We
observe the following relationships:
m—1
Cov(z,, Zs) = Ors + a eH: (22)
Note that if the mean vector of X were (0,0,...,0), then the mean vector
and covariance matrix of Z would be the same as that of X, i.e., E(z,) = 0,
Var(z,) = 02, and Cov(z,,zs) = Ors. Naturally, by translation to the local
sample mean of the nearest neighbor cloud, we will not quite have achieved
this result. But we will come very close to the generation of an observation
from the truncated distribution which generated the points in the nearest
neighbor cloud.
402 Taylor and Thompson
Ee
m = Cni*/@+4), (23)
Of course, people who carry out nonparametric density estimation realize
that such formulae have little practical relevance, since C is usually not avail-
able. Beyond this, we ought to remember that our goal is not to obtain a
nonparametric density estimator, but rather to generate a data set which ap-
pears like that of the data set before us. Let us suppose we err on the side of
making m far too small, namely, m = 1. That would yield simply the boot-
strap. Suppose we err on the side of making m far too large, namely, m = n.
That would yield an estimator which roughly sampled from a multivariate nor-
mal distribution with the mean vector and covariance matrix computed from
the data.
Below, in Figure 1, we show a sample of size 85 from a mixture of three
normal distributions with the weights indicated, and a pseudo data set of size
85 generated by SIMDAT with m = 5. We note that the emulation of the data
is reasonably good. In Figure 2, we go through the same exercise, but with
m = 15. There effects of a modest oversmoothing are noted. In general, if the
data set is very large, say of size 1,000 or greater, good results are generally
obtained with m ® .02n. For smaller values of n, m values in the .05n range
appear to work well. A version of SIMDAT in the S language, written by
EK. Neely Atkinson, is available under the name “gendat” from the S Library
available through email from Bell Labs. The new edition of the IMSL Library
will contain a SIMDAT subroutine entitled “RNDAT” (beta versions may be
obtained free by writing IMSL) .
In this paper, we have observed the usefulness of noting what we really seek
rather than using graphically displayed nonparametric density estimators as
our end product. The cases where we really must worry about obtaining the
graphical representation of a density are, happily, rare. But, as with the case
of SIMDAT, the context of nonparametric density estimation is very useful in
many problems where the explicit graphical representation of a nonparametric
density estimator is not required.
Nonparametric Density Estimation 403
References
5 ma a ities i
iesorye ooh
= a i
i - mp; ma al, aie ,
he sed
R
J
i “a,
oi |
NONPARAMETRIC RANK
ESTIMATION USING BOOTSTRAP
RESAMPLING AND CANONICAL
CORRELATION ANALYSIS
Xin M. Tu, Harvard School of Public Health;
D.S. Burdick and B.C. Mitchell, Duke University *
Abstract
Canonical correlation analysis has proven to be remarkably successful as an
alternative to the eigenvalue approach in rank estimation; a problem that has
challenged the analytical chemists for more than a decade. A methodological
advance of this new approach is that it focuses on the difference in structure
rather than in magnitude in characterizing the difference between the signal
and the noise. This structural difference ts quantified through the analysis of
canonical correlation, which is a well established data reduction technique in
multivariate statistics. Unfortunately, there is a price to be paid for having this
structural difference: at least two replicate data matrices are needed to carry
out the analysis.
In this paper, we propose a bootstrap resampling method to extend the
canonical correlation analysis to a single data matrix. Such a procedure not
only removes the requisite for replicate data matrices but also leads to a robust
estimator for the rank. With the percentile method, statistical inference about
the rank can proceed without any distributional assumption about the noise.
This “distribution-free” feature is especially desirable and useful in practice
since it frees us from hinging results on some crutial assumptions about the
random noise which are often difficult to justify and may even be erroneous.
The procedure is illustrated with real as well as simulated mixture samples.
*X.M. Tu, Dept. of Biostat., Harvard School of Public of Health, 677 Huntington Ave.,
Boston, MA 02115; D.S. Burdick and B.C. Mitchell, Institute of Stat. and Decision Sciences,
Duke University, Durham, NC 27706
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
406 Tu, Burdick, and Mitchell
INTRODUCTION
One of the problems which have challenged the chemists in analytical chem-
istry for more than a decade is to determine the number of components in a
multicomponent mixture sample. Often the sample data is expressed in the
form of a matrix whose rank, in the absence of noise, is equal to the number
of components. However, the presence of noise in the data generally causes
the rank to exceed the number of components in the mixture.
Most methods which have been proposed to estimate the rank in the pres-
ence of noise all, in essence, rely on the information summarized by the eigen-
values from the singular value decomposition of the underlying matrix (Rossi
and Warner 1986; Malinowski 1990; Tway, Cline Love and H.B. Woodruff
1980; Wold 1978). Even though it is difficult to evaluate the amount of in-
formation which is lost due to this type of data summary (or reduction), it is
not very hard to convince oneself at least heuristically that some information
is lost since it ignores any information contained in the eigenvectors which
may well provide much more information than the eigenvalues do. A differ-
ent approach which also incorporates the information in the eigenvectors was
therefore introduced recently as an alternative to the eigenvalue approach (Tu
et al. 1989). A methodological turning point of this new approach is that it fo-
cuses on the difference in structure rather than in magnitude in characterizing
the difference between the signal and the noise. This structural difference is
then quantified through the analysis of canonical correlation, a well established
data reduction methodology in multivariate statistics. Unfortunately, this new
approach can only be applied to situations where replicate data matrices are
available.
The objective of this paper is to continue to explore the potential and to
extend the scope of this new approach. In particular, we propose a bootstrap
resampling procedure to extend this new methodology to a single data matrix.
An important feature of the data matrix arising from such chemical experi-
ments is the structure imposed on the data matrix by the signal. Such a signal
structure plays a crutial role in rank estimation. As a consequence, a standard
bootstrap resampling procedure, which relies on a random sample drawn with
replacement from the elements of the data matrix, is not readily applicable
since it will destroy this signal structure which depends both on the ordered
rows and columns. We therefore discuss a variant of the bootstrap resampling
to circumvent this problem when resampling the observed data matrix.
In the following discussion, we use the Excitation-Emission Matrix (Warner
1982; Warner, Neal and Rossi 1985) as a vehicle of examplifying the proposed
Nonparametric Rank Estimation 407
RANK ESTIMATION BY
CANONICAL CORRELATION
In this section, we briefly review the procedure RECCAMP (Rank Estimation
by Canonical Correlation Analysis of Matrix Pairs) for analyzing replicate data
matrices. A detailed discussion of the RECCAMP can be found in Tu et al.
(1989).
Let the J by J matrix S be the EEM for an R-component mixture in the
absence of noise. Then S can be expressed as:
S=xy? (1)
where X = (21,...,2R) is an J by R matrix with the r column being the
excitation vector for the r“* component and Y = (y1,...,yR) is an J by R
matrix with the r‘* column being the emission vector for the r** component.
The columns of X as well as those of Y will be assumed to be linearly inde-
pendent. Under this assumption, the rank of S will be equal to R, which is
the number of components in the mixture. In the absence of noise, the rank
can be determined by the number of non-zero singular values from the singular
value decomposition (SVD) of S.
In practice, however, since the measured EEM also contains a noise term
N, ie.,
M=S+N (2)
the number of non-zero singular values will exceed the number of components
because the random noise term N generally increases the rank. The eigenvalue
approach for correcting this is to watch for a “large drop” between the singular
values, which seems very reasonable since one would expect a relatively small
singular value contributed from the low-magnitude noise. However, if these
singular values corresponding to the noise are viewed as a summary statis-
tic for the noise from a statistical point of view, which makes perfact sense
since this is what is used in making inference about the rank when using the
eigenvalue approach, it is not difficult to see that the random nature of the
noise may not be well summarized by this statistic since it clearly ignores any
information contained in the eigenvectors. An alternative to this eigenvalue
approach was therefore introduced, hoping to provide a better statistic to sum-
marize the information about the random noise. A methodological advance in
408 Tu, Burdick, and Mitchell
|
this new approach is that it focuses on the random nature of the noise, which
is manifested in an Euclidean subspace, and utilizes the analysis of canonical
correlation, a well established data reduction method in multivariate statis-
tics, to provide a statistic which fully reflects this random nature. Below, we
briefly describe this approach.
Let M, and M, be two replicate EEM’s. Then these matrices can be
expressed as
M, = S+M,
M2 = S+ No.
In the absence of noise, M, and M, will have a common column space which
we denote by Col(S) and a common row space which we denote by Row(S),
both of dimension R. In the presence of noise, let
M, = U,D\ViZ
Mp = U2D2VF
the bootstrap sample, we can calculate the conficence interval for the median
using the percentile method (Efron 1982) without any assumption for the
underlying distributon and without recourse to Monte Carlo simulation. Using
the percentile method, the 1 — 2a confidence interval is expressed in the form
EVs koe |
Here the integers ¢; and t2 are given by the smallest integers which satisfy
F(t) >a
and
F(t2) >l-a.
Column Resampling
Row Resampling
Table I: Percentiles and confidence intervals (C. I.) from the distributions of the
bootstrap samples of canonical correlation coefficients for the mixture ANT and
DPA.
414 Tu, Burdick, and Mitchell
o= 4
C53
Table II: Percentiles and confidence intervals (C. I.) from the distributions of
the bootstrap samples of canonical correlation coefficients corresponding to the
process of resampling the columns for the simulated mixture.
Nonpara
SSE metric Rank Estimation 415
SECC onan peridot hnkayPn coed
Sv.
SOO IRS
RSD
Wrens
(a)
(a) (>)
As for the simulated mixture sample, our major reason to study this sample
is to see whether reduction in sample data would have any significant impact
on the canonical correlation coefficients. In statistical inference, reduction in
data reduces efficiency and may lead to bias in estimation. In our context,
both could have influence on the canonical correlation coefficients on which
our inference about the rank is based. It may be difficult, if not impossible, to
theoretically characterize this impact. However, with this simulated mixture
sample, we can at least make some empirical observation about this effect.
The percentiles and confidence intervals for the two critical intensity levels
are shown in Table 2. Note that reported in Table 2 are the results from
the process of resampling the columns since the results from resampling the
rows are similar and therefore are not reported. At c = 4, the drop for the
percentiles occurs between the forth and fifth coefficients. The lower limits
of both confidence intervals for the median of the fourth canonical correlation
coefficient are way above 0.5. On the other hand, even the upper limits of the
confidence intervals for the median of the fifth canonical correlation coefficient
fall below 0.5. Hence, had we not known the composition of this mixture,
we would have correctly estimated the rank at this low signal level without
difficulty. The percentiles and confidence intervals for c = 3, however, show
no sign of the four-component nature of the mixture sample. As a matter
of fact, the rank would be estimated to be three if based on the confidence
interval, which coincides with the estimate from the analysis of the RECCAMP
using two replicate EEM’s in this case (Tu et al. 1989). So at least with this
simulated data, it seems that there is no loss of efficiency in inference, which
may not be surprising since the information of the noise distribution (normal
distribution in this case) may very well be summarized by the 2500 elements
in one EEM. In this example, it is quite likely that one of the components is
overwhelmed by the noise when the signal level drops down to c = 3.
Plotted in Figure 3 are the distributions of some of the bootstrap samples.
The drop between the forth and fifth canonical correlation coefficients for
c = 4 can also be seen from the striking difference in skewness between the
distributions shown in (a) and (b). However, for c = 3, it is difficult to tell from
the shape of the histogram plotted in (c) whether the coefficients correspond
to the signal even though the C.J. in this case does cover the point 0.5 as
discussed earlier.
In summary, we have proposed a bootstrap resampling procedure to extend
the canonical correlation technique to the analysis of a single data matrix.
Also, since a robust estimator is introduced for inference, the procedure may
be applied to a wide range of data without any restriction on the noise distri-
bution. On the other hand, the procedure may be computationally expensive
compared to the eigenvalue approach, especially in case of low signal-to-noise
ratio. Our experience shows that if the signal-to-noise ratio is high as in the
Nonparametric Rank Estimation 417
(a) (b)
(c) (d)
real mixture sample in our example, a sample size as small as 50 will suffice to
make the correct inference. As a guide for general practice, we may start the
procedure with a small bootstrap sample and then increase the sample size
if necessary to improve efficiency for the boostrap estimate of the confidence
interval. However, as discussed earlier, there is no point to increase the sample
size beyond 1,000.
The procedure may also be used in conjunction with some other procedures
such as those from the eigenvalue approach to reduce the amount of compu-
tation but also to make valid statistical inference. For example, an initial
analysis using the eigenvalue approach may help expedite the search of rank
to a small possible range of values. The bootstrap resampling may then be
employed to quantify our uncertainty in terms of a valid statistical test.
References
[1] B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans,
SIAM, monograph # 38, CBMS-NSF (1982).
[5] X.M. Tu, D.S. Burdick, D.W. Millican and L.B. McGown, Anal. Chem.
61, 2219 (1989).
[6] P.C. Tway, L.J. Cline Love and H.B. Woodruff, Anal. Chim. Acta 117,
45 (1980).
[7] I.M. Warner, in Contemporary Topics in Analytical and Clinical Chem-
istry, Vol. 4, D.M. Hercules, G.M. Hieftje, L.R. Snyder and M.A. Evenson
Eds., Plenum Press, New York (1982).
[8] I.M. Warner, S.L. Neal and T.M. Rossi, J. of Research of the National
Bureau of Standards, 90, 487 (1985).
419
‘Bootstrap methods (Continued) nonstationary data, 195-197
consistency, asymptotic, 7 second order correctness, 185, 187,
convergence, 7, 329 197
and differentiable functionals, 36-44 naive, 8
as Dirac-comb density estimators, in nonparametric rank estimation,
397-402 408-418
problems with, 399 for non-tested hypotheses, 390-393, 395
double-dip-, 9 “operational” bootstrapping, 309-318
Dudley’s limit theorem, 36-37 bivariate convolution, 313-318
for Edgeworth correction, 8-9. See pivoting, 224
also Edgeworth expansions for prediction band construction,
and empirical processes, 17-22 272-274
central limit theorem, 17-18 quantile estimation, 141-142
and estimators, see Estimators U-quantile estimation, 146-154
and M-estimators, 25-36. See also versus random weighting
M-Estimators approximation, 281
central limit theorem, 30-32, 35-36 for ratio statistic distribution, 276-277
pr-bootstrap, 42-44 resampling, 106-107, 251
in exploratory regression (model ~ antithetic, 135-137
building): balanced, 133-135
compared to conventional methods, circular block, 266-269
371-386 i.i.d., 77, 81, 183-184
MSEP estimators, 367-371 importance, 137-141
pseudomodel choice, 386 improved approaches, 399-402
frequentist, 321-322 nonrandom, 309-318
generalized, 319-321 uniform, 128-129, 130
inconsistency of, 227 vector, 106, 121
versus jackknife and delta methods, and sample mean, see Sample mean,
103-108 bootstrapping
kernel density estimator bandwidth in segregation analysis, 393, 396
selection, 249-250 sign bootstrapping, 217-224
by asymptotic analysis, 254-256 conditional distributions, 218-221
comparison to other methods, SIMDAT algorithm, 400-402
258-261 smoothed, 252-253
by MISE estimation, 251-253 and spatial medians, 32-33
in likelihood ratio simulation, 390-393, and standard error estimate, 5-6
395 comparison to delta and jackknife
limit distribution, 7 methods, 103-108
and long-tailed errors, 9, 215-217 sign bootstrapping, 217-224
and Markov chains, see Markov chains for Stein-rule estimators, 325
meshing, 311-318 comparison with jackknife method,
Monte Carlo simulation in, see Monte 335-340
Carlo simulations Bootstrap samples, 106
moving block, 185-187, 228-231, Bootstrap t methods, 8-9
238-245, 264 confidence intervals, 67-68
applied to sample mean, 187-191 first order efficiency, 74
applied to studentized mean, parametric versus nonparametric,
192-195 74-75
choice of block size, 246-247 second order correctness/
Edgeworth expansions, 187-188 equivalence, 68-74
420
second order optimality/robustness, D
74-75
undershoot, 74 Decomposition, singular value, 407
as pivotal quantity, 118-120 Decrement-life-table functions, 358
P-Brownian bridge, 16 y Delta method, 103-108
Brownian motion: Deterministic hazards, 346-348
convergence to, 162, 164-165, 170, 172, Dirac-comb density estimator, 397-399
173 Distributions, maximum likelihood
estimated, 99-103
P-Donsker function classes, 17, 18, 20
C Donsker’s convergence theorem, 16, 17
Dudley’s limit theorem, 36-37
Canonical correlation analysis, 407-408 Dudley’s representation theorem, 20-21
Centring method, 131-133 Dvoretzky-Kiefer-Wolfowitz inequality,
Compensators, of counting processes, 242, 243
349
Competing risks theory, 355-361
Conditional bootstrap models, 323-325 E
Conditional distributions, sign
bootstrapping, 218-221 Edgeworth expansion, 8-9
Condition vector, 220 in bootstrapping Markov chains, 63,
Confidence intervals, 8-9, 101-103, 216, 80-81
223 of functional statistics, 287-291
Barndorff-Nielsen, 115 from moving block bootstrap, 187-188
choice of pivotal quantities, 116-120 Empirical influence function, 104
correctness versus accuracy, 112-116 Empirical measure, 16
defined, 338-339 Empirical processes, 15-17
Fieller’s construction, 113-114 almost sure behavior, 22-23
non-central chi-square, 114-115 bootstrap method applied for, 17-22
percentile, 102 central limit theorem, 17-18
standard, 102-103 Errors, standard, see Standard errors
Consistency, of bootstrap method, 7 Estimators, 3-4
Convergence, of bootstrap method, 7 cross-validation, 108-112
Convolutions, of distributions, 309-318 Dirac-comb density, 397-399
Correlation analysis, canonical, 407-408 generalized bootstrap for, 319-321
Correlation coefficient, Pearson, Hodges-Lehmann, bootstrapping, 154
116-117 jackknife methods for, 329
Cortical cells, clustering of, 274-277 kernel density, bandwidth, see Kernel
Countable state Markov chains, see density estimator bandwidth
Markov chains, countable state linear, 330-331
Counting processes, 348 martingale method for determining,
compensators of, 349 351-361
Covariates, in hazard processes, 355-361 maximum likelihood, 99-103
Cramer’s condition, 189 underestimation, 100-101
Cross-validation, 331 in multiple regression, bootstrapping,
estimators, 108-112 328-329
least squares, bandwidth selection, nonparametric density, 399-402
253, 256-257 orthogonal series, 249
Crude hazard probabilities, 357 spline, 249
Crude hazard rate functions, 357 spread, bootstrapping, 154
421
Estimators (Continued) G
standard error of, see Standard error
Stein-rule, 329-335 Gaussian processes, 16-17
bootstrapping, 325 sample continuous, 17
jackknife versus bootstrap methods, Genetics, quantitative, bootstrapping in,
335-338 389-393, 395-396
M-Estimators: Glivenko-Cantelli theorem, 352
approximation by pseudo jackknife
method, 300-303
bootstrap of, 25-36 H
central limit theorem, 30-32, 35-36
definition, 24-25, 29-30 Hadamard differentiability, 282
pr-bootstrap of, 42-44 Hajek-Le Cam convolution theorem, 72
Excitation-emission matrix, 406, 407, Hardy-Weinberg equilibrium, 389
408-409 Hazard probabilities, 357
Exercise output prediction, 272-274 Hazard processes, 346-350
decrement-life-table functions, 358
homogeneity, chi-square test, 355
F increment-decrement-life-table
functions, 358
Fast Fourier transforms, 217 latent life, 355
Fieller’s construction, 113-114 Martingale method for, 352-361
Filtration, 347 asymptotic estimate distribution,
Finite state Markov chains, see Markov 352-353
chains, finite state censoring, competing risks,
First order properties, 66, 74 covariates, 355-361
Fischer’s linear discriminant, 109, 110 hypothesis testing, 353-355
Frechét differentiability, 36, 234, 282-283 mortality ratio, standardized, 354
second order, 283 probabilities, 357
Frequentist models, 320 rate functions, 357
bootstrap, 321-322 stochastic hazards, 348
Functionals: History, 347
differentiable, 20-22 internal, 347
bootstrap method and, 36-44 Hitting time distribution, 49, 51, 54-55,
Frechét differentiability, 36, 234, 77
282-283 Hodges-Lehmann location estimator,
variance, 100 bootstrapping, 154
Functional statistics: Hoeffding expansion, 73
distribution approximation, 279-281, Homogeneity, chi-square test, 355
283-287
Edgeworth expansions, 287-291
Functions, classes of: I
measurable, 16
P-Donsker, 17, 18, 20 Importance resampling, 137-141
a.s.-bootstrap, 18 Increment-decrement-life-table
pr-bootstrap, 18 functions, 358
P-Glivenko-Cantelli, 29 Infinitesimal jackknife method, 104
uniformly pregaussian, 19 Internal history, 347
VC(Vapnik-Cérvonenkis)-subgraph, 22 Invariance principle, in sample mean
Functions, perfect, 20 bootstrapping, 161-172
422
J Markov chains, countable state:
bootstrapping, 52-61
Jackknife methods, 4-5 hitting time distribution estimation,
versus bootstrap and delta methods, 54-55
103-108 t sampling distribution estimation,
delete-d, 228 52-53
delete-one, 329 transition probability matrix
inconsistency of, 227 estimation, 57-58
infinitesimal, 104, 105 Markov chains, finite state:
moving block, 227-228, 231-238 bootstrapping, 50-52, 77-79, 81-87
pseudo, 279-281, 283-287, 297-303, 329 accuracy of, 62-63
for Stein-rule estimators, 335-338 asymptotic accuracy, 79-81
Jaeckel’s infinitesimal jackknife method, Edgeworth expansion, 80-81
104, 105 hitiing time distribution estimation,
49, 51, 54-55, 77
transition probability matrix
K estimation, 49, 50-51, 57-58, 77
Martingale method, for deterministic
Kernel density estimator bandwidth, functions, 351-361
249-250 Martingales, 164, 165, 166-167, 349, 350
comparison of selection methods, sequences of, central limit theorem,
258-261 351
selection by bootstrapping, 251-256 Martingale transform theorem, 351
selection by cross-validation, 256-257 Maximum likelihood estimated
Ktinsch’s moving block bootstrap distributions, 99-103
method, see Moving block bootstrap Mean, standard error of, 3-4
method Mean integrated square error (MISE),
250
bootstrap estimation, 251-253
L Mean square error of prediction (MSEP):
conditional, 365
Latent life, 355 estimators of:
Least squares cross-validation, bandwidth bootstrap, 367-371
selection, 253, 256-257 comparison of methods, 371-386
Likelihood ratios, bootstrapping, conventional, 365-367
390-393, 395 unconditional, 365
Limit distribution, of bootstrap method, 7 Medians, spatial:
Lindeberg central limit theorem, 284 bootstrapping, 32-33
Lindeberg-Levy central limit theorem, Meshing, 311-318
157 Minkowski inequality, 241
Linear approximation, 129-131 MISE, see Mean integrated square error
Linear discriminant, Fischer's, 109, 110 Mixing sequences, 226, 245-246, 264
Linear estimators, 330-331 Model building, see Regression
Linear programming, 311 analysis, exploratory
Linear regression analysis, 221, 224, 245 Monte Carlo simulations, 106, 107, 123,
127-140, 328
antithetic resampling, 135-137
M balanced resampling, 133-135
centring method, 131-133
Marked point processes, 356 importance resampling, 137-141
423
Monte Carlo simulations (Continued) Polynomial approximation, 131
linear approximation, 129-131 Predictable variation processes, 350
polynomial approximation, 131 Prediction bands, estimation by
uniform resampling, 128-129, 130 bootstrapping, 272-274
Mortality ratio, standardized, 354 Prediction errors, 109-112
Moving block bootstrap method, Propagation of errors formula, 103-108
185-187, 228-231, 238-245
applied to sample mean, 187-191
applied to studentized mean, 192-195 Q
choice of block size, 246-247
Edgeworth expansions, 187-188 U-Quantiles:
nonstationary data, 195-197 estimation, 141-142
second order correctness, 185, 187, 197 by bootstrapping, 146-154
Moving block jackknife method, 227-228, by pseudo jackknife method,
231-238 297-300
MSEP, see Mean square error of Quantitative genetics, bootstrapping in,
prediction 389-393, 395-396
Multinomial models, bootstrapping, Quantitative segregation analysis,
276-277 392-393
Multiple regression, bootstrapping in,
328-329
R
N Random weighting approximation,
280-281, 284
Net hazard probabilities, 357 second order accuracy, 287-297
Net hazard rate functions, 357 Rank estimation, nonparametric, 407-408
Nonparametric estimation, 397-402 bootstrap resampling, 408-418
choice of pivotal quantity, 116-120 Ratio statistic distribution, bootstrapping,
Nonparametric rank estimation, 407-408 276-277
bootstrap resampling, 408-418 ~ RECCAMP (Rank estimation), 407-408
Regression analysis, exploratory, 363-367
MSEP estimator determination:
O bootstrap approach, 367-371
comparison of methods, 371-386
Order statistics, “operational” conventional approach, 365-367
bootstrapping, 309-318 Regression analysis, linear, 221, 224,
Orthogonal series estimators, 249 245
Ottaviani’s inequality, 23 Regression analysis, multiple, 328-329
Resampling, 106-107, 251
antithetic, 135-137
P balanced, 133-135
circular block, 266-269
Paasche index, 354 i.i.d., 77, 81
Pearson correlation coefficient, 116-117 problems with, 183-184
Percentile confidence intervals, 102 importance, 137-141
Pivotal quantities, choice of, 116-120 improved approaches, 399-402
Pivot method, see Bootstrap t methods nonrandom, 309-318
Point processes, 348 uniform, 128-129, 130
Poisson processes, 221-222 vector, 106, 121
424
Risk function, minimization, 330-331, distribution approximation, 279-281,
333-335 283-287
Edgeworth expansions, 287-291
S .
Statistics, order, “operational”
bootstrapping, 309-318
Stein-rule estimators, 329-335
Sample mean, bootstrapping: bootstrapping, 325
convergence to Brownian motion, 162, jackknife versus bootstrap methods,
164-165, 170, 172, 173 335-340
invariance principle, 161-172 Stieltjes integral, 347
limiting distribution: Stochastic differential, 348
finite variance, 173, 174-178 Stochastic hazards, 348
infinite variance, 172, 173-178 Stochastic integral representation,
“operational” bootstrapping, 309-318 sample mean bootstrapping,
partial sum process, 161, 163 160-161
stochastic integral representation, Studentization, 5
160-161 Student's r, 5
Second order properties, 66, 68-77 bootstrap in estimating, 8-9
of “moving block” bootstrap, 185, 195, Survival analysis, 345-346
197 hazard processes for, see Hazard
Segregation analysis, 392-393 processes
Shrinkage estimators, see Stein-rule
estimators
Signs, of errors, bootstrapping, 217-224 ii
SIMDAT algorithm, 400-402
Singular value decomposition, 407 Taylor series method, 103-108
Skorohod topology, 161-162, 163, 170 Time dependence, models of, 226,
Spatial medians, bootstrapping of, 245-246
32-33 Training sets, 108-109
Spline estimators, 249 Transition probability matrix, 49, 50-51,
Spread estimators, bootstrapping, 154 57-58, 77
Square root trick inequality, 26 True error, 110
Standard confidence intervals, Tukey’s jackknife methods, see Jackknife
102-103 methods
Standard errors, 3-4
bootstrap estimate, 5-6
by signs, 217-224 U
comparison of delta/jackknife/
bootstrap methods, 103-108 Undershoot, of t confidence intervals,
jackknife estimate, 4-5 74
Standardized mortality ratio, 354 Uniform resampling, 128-129, 130
Standard normal distribution, 173
Statistical differentials, method of,
103-108 Vv
k-Statistics, 101
t-Statistics: Vapnik-Cérvonenkis property, 22
bootstrapping, see Bootstrap t methods Variance, 3
as pivotal quantities, 117-120 Variance functionals, 100
U-Statistics, see U-Quantiles von Mises expansion, 73
Statistics, functional: von Mises theorem, 311, 315-316
425
W . moving block bootstrap for, 228-231,
238-245, 264
Weak dependence: moving block jackknife for, 227-228,
circular block-resampling bootstrap 231-238
for, 266-269 Wiener processes, 351
426
ae aie |
ims & yf ‘
ae
>
—
ace ee:
“San
/
=~
ve
van
rd
if
=
oe
pa } S
—~ ——
—
, as : :
+ wae,
"
Jf ne -
Carns
ae, F i
'
1
ih
\\ A f
i
Applied Probability and Statistics (Continued)
PANKRATZ »* Forecasting with Univariate Box-Jenkins Models: Concepts and Cases
RACHEV » Probability Metrics and the Stability of Stochastic Models
RENYI ° A Diary on Information Theory
RIPLEY * Spatial Statistics
RIPLEY * Stochastic Simulation
ROSS =: Introduction to Probability and Statistics for Engineers and Scientists
ROUSSEEUW and LEROY = Robust Regression and Outlier Detection
RUBIN + Multiple Imputation for Nonresponse in Surveys
RUBINSTEIN + Monte Carlo Optimization, Simulations, and Sensitivity of Queueing
Networks
RYAN » Statistical Methods for Quality Improvement
SCHUSS ° Theory and Applications of Stochastic Differential Equations
SEARLE * Linear Models
SEARLE °* Linear Models for Unbalanced Data
SEARLE * Matrix Algebra Useful for Statistics
SEARLE, CASELLA, and McCULLOCH * Variance Components
SKINNER, HOLT, and SMITH * Analysis of Complex Surveys
STOYAN »° Comparison Methods for Queues and Other Stochastic Models
STOYAN, KENDALL, and MECKE »* Stochastic Geometry and Its Applications
THOMPSON * Empirical Model Building
TIERNEY + LISP-STAT: An Object-Oriented Environment for Statistical Computing
and Dynamic Graphics
TIJMS * Stochastic Modeling and Analysis: A Computational Approach
TITTERINGTON, SMITH, and MAKOV : Statistical Analysis of Finite Mixture
Distributions
UPTON and FINGLETON -° Spatial Data Analysis by Example, Volume I: Point
Pattern and Quantitative Data
UPTON and FINGLETON ° Spatial Data Analysis by Example, Volume II:
Categorical and Directional Data
VAN RIJCKEVORSEL and DE LEEUW + Component and
Correspondence Analysis
WEISBERG °* Applied Linear Regression, Second Edition
WHITTLE + Optimization Over Time: Dynamic Programming and Stochastic Control,
Volume I and Volume II
WHITTLE °* Systems in Stochastic Equilibrium
WONNACOTT and WONNACOTT + Econometrics, Second Edition
WONNACOTT and WONNACOTT »* Introductory Statistics, Fifth Edition
WONNACOTT and WONNACOTT = Introductory Statistics for Business and
Economics, Fourth Edition
WOOLSON -: Statistical Methods for the Analysis of Biomedical Data
af aw
cy
“a ‘ | 4
ne
4
£
1
5
2 ; Pa. Yas
i
”) } /
‘ >
~ é yo
me os
. ¥ &
>
f
; 4
: m i
eS
we f
ot
, iat
‘ ' :i :
i.
Z {
= = %, r sf
- = tlk ns
“ &
—— 74
7 ‘ a :
ae
Pas i
Sais i Nicci ~’ ¥ ee
P - » ‘ag . Bini .
r3 4 * ~ ~ me
¥ SS n i
ss ~ z
‘ bi <
4
a
~ =
ae ls be
) cs
wey -
x J
aa co
; yes =e x
" t
. %, Ae ‘ .
py
About the editors
RAOUL LEPAGE is Professor of
Statistics and Probability at Michi-
gan State University and Chairman
of the Board and Chief Executive Of--
ficer of the Interface Foundation of
North America. Dr. LePage earned a
BS in mathematics and an MS in sta-
tistics and probability at Michigan
State University, and a PhD in statis-
tics from the University of Minne-
sota. He has served as a consultant
and given courses on simulation
methods, bootstrap, and time series
analysis for government agencies
and private industry. He is active in
the study of stable random pro-
cesses where his series representa-
tion and Gaussian-slicing methods
are widely referenced. His current
work is on conditional resampling
methods designed to cope with
long-tailed errors in observations,
and also problems having applica-
tions in finance and control which
involve optimal methods for driving
a random process to achieve its
most rapid growth.
LYNNE BILLARD is Professor of
Statistics at the University of Geor-
gia. She was formerly head of the
statistics and computer science de-
partment at that university and has
held faculty positions and visiting
positions at a number of U.S. univer-
sities as well as in England and Can-
ada. Her current research interests
include time series, Sequential anal-
ysis, stochastic processes, and
AIDS. She is a fellow of the American
Statistical Association and the Insti-
tute of Mathematical Statistics and
an elected member of the Interna-
tional Statistical Institute. She has
held many professional offices, in-
cluding President of the Biometric
Society, Eastern North American Re-
gion, Program Secretary of the Insti-
tute of Mathematical Statistics, and
Associate Editor with the Journal of
the American Statistical Associa-
tion. She is currently a member of
the Internationa! Council of the Bio-
metric Society and the Council of
the International Statistical Insti-
tute. She received a BS in mathe-
matics and statistics and a PhD in
statistics from the University of New
South Wales, Australia.
Of related interest...
ROBUST ESTIMATION AND TESTING
Robert G. Staudte and Simon J. Sheather
Designed to provide students with practical methods for carrying out
robust procedures in a variety of statistical contexts, this practical
guide: reviews some traditional finite sample methods of analyzing
estimators of a scale parameter; introduces the main tools of robust
statistics; examines the important joint location-scale estimation prob-
lem; and comprehensively treats the linear regression model. The text
also includes an appendix containing a large number of Minitab macros
which calculate robust estimators and their standard errors.
1990 (0 471-85547-2) 376pp.
CONFIGURAL POLYSAMPLING
A Route to Practical Robustness
Edited by Stephan Morgenthaler and John W. Tukey
Beginning with the historical background of robustness, this unique
text introduces the essential ideas for a small sample approach to
robust statistics. Using models for distributional shape which contain
at least two alternative distributions, the book treats in great detail
inference based on such models for the location and scale case. Point
estimation and interval estimation are examined in depth, as is classi-
cal material concerning Pitman’s optimal invariant estimators. Configu-
ral Polysampling’s blend of theoretical and empirical facts makes it
especially useful to both novices and experienced users’ understand-
ing of traditional robust methods.
1991 (0 471-52372-0) 248 pp.
LISP-STAT
An Object-Oriented Environment for
Statistical Computing and Dynamic Graphics
Luke Tierney
This hands-on guide shows users how to carry out basic and extensive
programming projects within the context of the Lisp-Stat system. It also
clearly demonstrates how to use functional and object-oriented pro-
gramming styles in statistical computing to develop new numerical and
graphical methods. Using the Lisp system for statistical calculations
and graphs, the book addresses computations ranging from summary
statistics through fitting linear and nonlinear regression models to
general maximum likelinood estimation and approximate Bayesian
computations.
1990 (0 471-50916-7) 416 pp.