RigNotes15 PDF
RigNotes15 PDF
Lecture Notes
Philippe Rigollet
Spring 2015
Preface
These lecture notes were written for the course 18.S997: High Dimensional
Statistics at MIT. They build on a set of notes that was prepared at Princeton
University in 2013-14.
Over the past decade, statistics have undergone drastic changes with the
development of high-dimensional statistical inference. Indeed, on each indi-
vidual, more and more features are measured to a point that it usually far
exceeds the number of observations. This is the case in biology and specifically
genetics where millions of (or combinations of) genes are measured for a single
individual. High resolution imaging, finance, online advertising, climate stud-
ies . . . the list of intensive data producing fields is too long to be established
exhaustively. Clearly not all measured features are relevant for a given task
and most of them are simply noise. But which ones? What can be done with
so little data and so much noise? Surprisingly, the situation is not that bad
and on some simple models we can assess to which extent meaningful statistical
methods can be applied. Regression is one such simple model.
Regression analysis can be traced back to 1632 when Galileo Galilei used
a procedure to infer a linear relationship from noisy data. It was not until
the early 19th century that Gauss and Legendre developed a systematic pro-
cedure: the least-squares method. Since then, regression has been studied
in so many forms that much insight has been gained and recent advances on
high-dimensional statistics would not have been possible without standing on
the shoulders of giants. In these notes, we will explore one, obviously sub-
jective giant whose shoulders high-dimensional statistics stand: nonparametric
statistics.
The works of Ibragimov and Has’minskii in the seventies followed by many
researchers from the Russian school have contributed to developing a large
toolkit to understand regression with an infinite number of parameters. Much
insight from this work can be gained to understand high-dimensional or sparse
regression and it comes as no surprise that Donoho and Johnstone have made
the first contributions on this topic in the early nineties.
Therefore, while not obviously connected to high dimensional statistics, we
will talk about nonparametric estimation. I borrowed this disclaimer (and the
template) from my colleague Ramon van Handel. It does apply here.
I have no illusions about the state of these notes—they were written
rather quickly, sometimes at the rate of a chapter a week. I have
i
Preface ii
no doubt that many errors remain in the text; at the very least
many of the proofs are extremely compact, and should be made a
little clearer as is befitting of a pedagogical (?) treatment. If I have
another opportunity to teach such a course, I will go over the notes
again in detail and attempt the necessary modifications. For the
time being, however, the notes are available as-is.
As any good set of notes, they should be perpetually improved and updated
but a two or three year horizon is more realistic. Therefore, if you have any
comments, questions, suggestions, omissions, and of course mistakes, please let
me know. I can be contacted by e-mail at [email protected].
Acknowledgements. These notes were improved thanks to the careful read-
ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall,
Jan-Christian Hütter, Gautam Kamath, Kevin Lin, Ali Makhdoumi, Yaroslav
Mukhin, Ludwig Schmidt, Vira Semenova, Yuyan Wang, Jonathan Weed and
Chiyuan Zhang.
These notes were written under the partial support of the National Science
Foundation, CAREER award DMS-1053987.
Required background. I assume that the reader has had basic courses in
probability and mathematical statistics. Some elementary background in anal-
ysis and measure theory is helpful but not required. Some basic notions of
linear algebra, especially spectral decomposition of matrices is required for the
latter chapters.
Notation
iii
Preface iv
Preface i
Notation iii
Contents v
Introduction 1
4 Matrix estimation 81
4.1 Basic facts about matrices . . . . . . . . . . . . . . . . . . . . . 81
4.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . 91
4.4 Principal component analysis . . . . . . . . . . . . . . . . . . . 94
4.5 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
v
Contents vi
Bibliography 121
Introduction
1
Introduction 2
variable Z small if IE[Z 2 ] = [IEZ]2 + var[Z] is small. Indeed in this case, the
expectation of Z is small and the fluctuations of Z around this value are also
small. The function R(g) = IE[Y − g(X)]2 is called the L2 risk of g defined for
IEY 2 < ∞.
For any measurable function g : X → IR, the L2 risk of g can be decom-
posed as
Note that khk22 is the Hilbert norm associated to the inner product
Z
hh, h′ i2 = hh′ dPX .
X
When the reference measure is clear from the context, we will simply write
khk2 L = khkL2(PX ) and hh, h′ i2 := hh, h′ iL2 (PX ) .
Introduction 3
It follows from the proof of the best prediction property above that
In particular, the prediction risk will always be at least equal to the positive
constant IE[Y − f (X)]2 . Since we tend to prefer a measure of accuracy to be
able to go to zero (as the sample size increases), it is equivalent to study the
estimation error kfˆn − f k22 . Note that if fˆn is random, then kfˆn − f k22 and
R(fˆn ) are random quantities and we need deterministic summaries to quantify
their size. It is customary to use one of the two following options. Let {φn }n
be a sequence of positive numbers that tends to zero as n goes to infinity.
1. Bounds in expectation. They are of the form:
IEkfˆn − f k22 ≤ φn ,
This equality allowed us to consider only the part kfˆn − f k22 as a measure of
error. While this decomposition may not hold for other risk measures, it may
be desirable to explore other distances (or pseudo-distances). This leads to two
distinct ways to measure error. Either by bounding a pseudo-distance d(fˆn , f )
(estimation error ) or by bounding the risk R(fˆn ) for choices other than the L2
risk. These two measures coincide up to the additive constant IE[Y − f (X)]2
in the case described above. However, we show below that these two quantities
may live independent lives. Bounding the estimation error is more customary
in statistics whereas, risk bounds are preferred in learning theory.
Here is a list of choices for the pseudo-distance employed in the estimation
error.
• Pointwise error. Given a point x0 , the pointwise error measures only
the error at this point. It uses the pseudo-distance:
It turns out that this principle can be extended even if an optimization follows
the substitution. Recall that the L2 risk is defined by R(g) = IE[Y −g(X)]2 . See
the expectation? Well, it can be replaced by an average to from the empirical
risk of g defined by
n
1X 2
Rn (g) = Yi − g(Xi ) .
n i=1
We can now proceed to minimizing this risk. However, we have to be careful.
Indeed, Rn (g) ≥ 0 for all g. Therefore any function g such that Yi = g(Xi ) for
all i = 1, . . . , n is a minimizer of the empirical risk. Yet, it may not be the best
choice (Cf. Figure 1). To overcome this limitation, we need to leverage some
prior knowledge on f : either it may belong to a certain class G of functions (e.g.,
linear functions) or it is smooth (e.g., the L2 -norm of its second derivative is
2 ERM may also mean Empirical Risk Minimizer
Introduction 6
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.5
−0.5
y
y
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
−2.0
−2.0
−2.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
x x x
Figure 1. It may not be the best choice idea to have fˆn (Xi ) = Yi for all i = 1, . . . , n.
small). In both cases, this extra knowledge can be incorporated to ERM using
either a constraint :
min Rn (g)
g∈G
or a penalty: n o
min Rn (g) + pen(g) ,
g
or both n o
min Rn (g) + pen(g) ,
g∈G
These schemes belong to the general idea of regularization. We will see many
variants of regularization throughout the course.
Unlike traditional (low dimensional) statistics, computation plays a key role
in high-dimensional statistics. Indeed, what is the point of describing an esti-
mator with good prediction properties if it takes years to compute it on large
datasets? As a result of this observation, much of the modern estimators, such
as the Lasso estimator for sparse linear regression can be computed efficiently
using simple tools from convex optimization. We will not describe such algo-
rithms for this problem but will comment on the computability of estimators
when relevant.
In particular computational considerations have driven the field of com-
pressed sensing that is closely connected to the problem of sparse linear regres-
sion studied in these notes. We will only briefly mention some of the results and
refer the interested reader to the book [FR13] for a comprehensive treatment.
Introduction 7
Linear models
When X = IRd , an all time favorite constraint G is the class of linear functions
that are of the form g(x) = x⊤ θ, that is parametrized by θ ∈ IRd . Under
this constraint, the estimator obtained by ERM is usually called least squares
estimator and is defined by fˆn (x) = x⊤ θ̂, where
n
1X
θ̂ ∈ argmin (Yi − Xi⊤ θ)2 .
θ∈IRd n i=1
Note that θ̂ may not be unique. In the case of a linear model, where we assume
that the regression function is of the form f (x) = x⊤ θ∗ for some unknown
θ∗ ∈ IRd , we will need assumptions to ensure identifiability if we want to prove
bounds on d(θ̂, θ∗ ) for some specific pseudo-distance d(· , ·). Nevertheless, in
other instances such as regression with fixed design, we can prove bounds on
the prediction error that are valid for any θ̂ in the argmin. In the latter case,
we will not even require that f satisfies the linear model but our bound will
be meaningful only if f can be well approximated by a linear function. In this
case, we talk about misspecified model, i.e., we try to fit a linear model to data
that may not come from a linear model. Since linear models can have good
approximation properties especially when the dimension d is large, our hope is
that the linear model is never too far from the truth.
In the case of a misspecified model, there is no hope to drive the estimation
error d(fˆn , f ) down to zero even with a sample size that tends to infinity.
Rather, we will pay a systematic approximation error. When G is a linear
subspace as above, and the pseudo distance is given by the squared L2 norm
d(fˆn , f ) = kfˆn − f k22 , it follows from the Pythagorean theorem that
then it’s easy for the statistician to mimic it but it may be very far from the
true regression function; on the other hand, if the oracle is strong, then it is
harder to mimic but it is much closer to the truth.
Oracle inequalities were originally developed as analytic tools to prove adap-
tation of some nonparametric estimators. With the development of aggregation
[Nem00, Tsy03, Rig06] and high dimensional statistics [CT07, BRT09, RT11],
they have become important finite sample results that characterize the inter-
play between the important parameters of the problem.
In some favorable instances, that is when the Xi s enjoy specific properties,
it is even possible to estimate the vector θ accurately, as is done in parametric
statistics. The techniques employed for this goal will essentially be the same
as the ones employed to minimize the prediction risk. The extra assumptions
on the Xi s will then translate in interesting properties on θ̂ itself, including
uniqueness on top of the prediction properties of the function fˆn (x) = x⊤ θ̂.
d
IEkfˆn − f k22 ≤ C , (1)
n
where C > 0 is a constant and in Chapter 5, we will show that this cannot
be improved apart perhaps for a smaller multiplicative constant. Clearly such
a bound is uninformative if d ≫ n and actually, in view of its optimality,
we can even conclude that the problem is too difficult statistically. However,
the situation is not hopeless if we assume that the problem has actually less
degrees of freedom than it seems. In particular, it is now standard to resort to
the sparsity assumption to overcome this limitation.
A vector θ ∈ IRd is said to be k-sparse for some k ∈ {0, . . . , d} if it has
at most k non-zero coordinates. We denote by |θ|0 the number of nonzero
coordinates of θ is also known as sparsity or “ℓ0 -norm” though it is clearly not
a norm (see footnote 3). Formally, it is defined as
d
X
|θ|0 = 1I(θj 6= 0) .
j=1
Introduction 9
Sparsity is just one of many ways to limit the size of the set of potential
θ vectors to consider. One could consider vectors θ that have the following
structure for example (see Figure 2):
• Monotonic: θ1 ≥ θ2 ≥ · · · ≥ θd
• Smooth: |θi − θj | ≤ C|i − j|α for some α > 0
Pd−1
• Piecewise constant: j=1 1I(θj+1 6= θj ) ≤ k
• Structured in another basis: θ = Ψµ, for some orthogonal matrix and µ
is in one of the structured classes described above.
Sparsity plays a significant role in statistics because, often, structure translate
into sparsity in a certain basis. For example a smooth function is sparse in the
trigonometric basis and a piecewise constant function has sparse increments.
Moreover, as we will real images for example are approximately sparse in certain
bases such as wavelet or Fourier bases. This is precisely the feature exploited
in compression schemes such as JPEG or JPEG-2000: only a few coefficients
in these images are necessary to retain the main features of the image.
We say that θ is approximately sparse if |θ|0 may be as large as d but many
coefficients |θj | are small rather than exactly equal to zero. There are several
mathematical ways to capture this phenomena, including ℓq -“balls” for q ≤ 1.
For q > 0, the unit ℓq -ball of IRd is defined as
d
X
Bq (R) = θ ∈ IRd : |θ|qq = |θj |q ≤ 1
j=1
where |θ|q is often called ℓq -norm3 . As we will see, the smaller q is, the better
vectors in the unit ℓq ball can be approximated by sparse vectors.
Pk
Note that the set of k-sparse vectors of IRd is a union of j=0 dj linear
subspaces with dimension at most k and that are spanned by at most k vectors
in the canonical basis of IRd . If we knew that θ∗ belongs to one of these
subspaces, we could simply drop irrelevant coordinates and obtain an oracle
inequality such as (1), with d replaced by k. Since we do not know what
subspace θ∗ lives exactly, we will have to pay an extra term to find in which
subspace θ∗ lives. This it turns out that this term is exactly of the the order
of P
k d
log j=0 j k log edk
≃C
n n
Therefore, the price to pay for not knowing which subspace to look at is only
a logarithmic factor.
3 Strictly speaking, |θ|q is a norm and the ℓq ball is a ball only for q ≥ 1.
Introduction 10
Monotone Smooth
θj θj
j j
θj θj
j j
Nonparametric regression
Nonparametric does not mean that there is no parameter to estimate (the
regression function is a parameter) but rather that the parameter to estimate
is infinite dimensional (this is the case of a function). In some instances, this
parameter can be identified to an infinite sequence of real numbers, so that we
are still in the realm of countable infinity. Indeed, observe that since L2 (PX )
equipped with the inner product h· , ·i2 is a separable Hilbert space, it admits an
orthonormal basis {ϕk }k∈Z and any function f ∈ L2 (PX ) can be decomposed
as X
f= αk ϕk ,
k∈Z
where αk = hf, ϕk i2 .
Therefore estimating a regression function f amounts to estimating the
infinite sequence {αk }k∈Z ∈ ℓ2 . You may argue (correctly) that the basis
{ϕk }k∈Z is also unknown as it depends on the unknown PX . This is absolutely
Introduction 11
where α̂k are some data-driven coefficients (obtained by least-squares for ex-
ample). Then by the Pythagorean theorem and Parseval’s identity, we have
We can even work further on this oracle inequality using the fact that |αk | ≤
C|k|−γ . Indeed, we have4
X X
α2k ≤ C 2 k −2γ ≤ Ck01−2γ .
|k|>k0 |k|>k0
P
The so called stochastic term IE |k|≤k0 (α̂k − αk )2 clearly increases with k0
(more parameters to estimate) whereas the approximation term Ck01−2γ de-
creases with k0 (less terms discarded). We will see that we can strike a com-
promise called bias-variance tradeoff.
The main difference here with oracle inequalities is that we make assump-
tions on the regression function (here in terms of smoothness) in order to
4 Here we illustrate a convenient notational convention that we will be using through-
out these notes: a constant C may be different from line to line. This will not affect the
interpretation of our results since we are interested in the order of magnitude of the error
bounds. Nevertheless we will, as much as possible, try to make such constants explicit. As
an exercise, try to find an expression of the second C as a function of the first one and of γ.
Introduction 12
control the approximation error. Therefore oracle inequalities are more general
but can be seen on the one hand as less quantitative. On the other hand, if
one is willing to accept the fact that approximation error is inevitable then
there is no reason to focus on it. This is not the final answer to this rather
philosophical question. Indeed, choosing the right k0 can only be done with
a control of the approximation error. Indeed, the best k0 will depend on γ.
We will see that even if the smoothness index γ is unknown, we can select k0
in a data-driven way that achieves almost the same performance as if γ were
known. This phenomenon is called adaptation (to γ).
It is important to notice the main difference between the approach taken
in nonparametric regression and the one in sparse linear regression. It is not
so much about linear vs. nonlinear model as we can always first take nonlinear
transformations of the xj ’s in linear regression. Instead, sparsity or approx-
imate sparsity is a much weaker notion than the decay of coefficients {αk }k
presented above. In a way, sparsity only imposes that after ordering the coef-
ficients present a certain decay, whereas in nonparametric statistics, the order
is set ahead of time: we assume that we have found a basis that is ordered in
such a way that coefficients decay at a certain rate.
Matrix models
In the previous examples, the response variable is always assumed to be a scalar.
What if it is a higher dimensional signal? In Chapter 4, we consider various
problems of this form: matrix completion a.k.a. the Netflix problem, structured
graph estimation and covariance matrix estimation. All these problems can be
described as follows.
Let M, S and N be three matrices, respectively called observation, signal
and noise, and that satisfy
M =S+N.
Here N is a random matrix such that IE[N ] = 0, the all-zero matrix. The goal
is to estimate the signal matrix S from the observation of M .
The structure of S can also be chosen in various ways. We will consider the
case where S is sparse in the sense that it has many zero coefficients. In a way,
this assumption does not leverage much of the matrix structure and essentially
treats matrices as vectors arranged in the form of an array. This is not the case
of low rank structures where one assumes that the matrix S has either low rank
or can be well approximated by a low rank matrix. This assumption makes
sense in the case where S represents user preferences as in the Netflix example.
In this example, the (i, j)th coefficient Sij of S corresponds to the rating (on a
scale from 1 to 5) that user i gave to movie j. The low rank assumption simply
materializes the idea that there are a few canonical profiles of users and that
each user can be represented as a linear combination of these users.
At first glance, this problem seems much more difficult than sparse linear
regression. Indeed, one needs to learn not only the sparse coefficients in a given
Introduction 13
basis, but also the basis of eigenvectors. Fortunately, it turns out that the latter
task is much easier and is dominated by the former in terms of statistical price.
Another important example of matrix estimation is high-dimensional co-
variance estimation, where the goal is to estimate the covariance matrix of a
random vector X ∈ IRd , or its leading eigenvectors, based on n observations.
Such a problem has many applications including principal component analysis,
linear discriminant analysis and portfolio optimization. The main difficulty is
that n may be much smaller than the number of degrees of freedom in the
covariance matrix, which can be of order d2 . To overcome this limitation,
assumptions on the rank or the sparsity of the matrix can be leveraged.
d
IEkfˆn − f k22 > c
n
for some positive constant c. Here we used a different notation for the constant
to emphasize the fact that lower bounds guarantee optimality only up to a
constant factor. Such a lower bound on the risk is called minimax lower bound
for reasons that will become clearer in chapter 5.
How is this possible? How can we make a statement for all estimators?
We will see that these statements borrow from the theory of tests where we
know that it is impossible to drive both the type I and the type II error to
zero simultaneously (with a fixed sample size). Intuitively this phenomenon
is related to the following observation: Given n observations X1 , . . . , Xn , it is
hard to tell if they are distributed according to N (θ, 1) or to N (θ′ , 1) for a
Euclidean distance |θ − pθ′ |2 is small enough. We will see that it is the case for
′
example if |θ − θ |2 ≤ C d/n, which will yield our lower bound.
Chapter
1
Sub-Gaussian Random Variables
1 (x − µ)2
p(x) = √ exp − , x ∈ IR ,
2πσ 2 2σ 2
where µ = IE(X) ∈ IR and σ 2 = var(X) > 0 are the mean and variance of
X. We write X ∼ N (µ, σ 2 ). Note that X = σZ + µ for Z ∼ N (0, 1) (called
standard Gaussian) and where the equality holds in distribution. Clearly, this
distribution has unbounded support but it is well known that it has almost
bounded support in the following sense: IP(|X − µ| ≤ 3σ) ≃ 0.997. This is due
to the fast decay of the tails of p as |x| → ∞ (see Figure 1.1). This decay can
be quantified using the following proposition (Mills inequality).
Proposition 1.1. Let X be a Gaussian random variable with mean µ and
variance σ 2 then for any t > 0, it holds
t2
1 e− 2σ2
IP(X − µ > t) ≤ √ .
2π t
By symmetry we also have
t2
1 e− 2σ2
IP(X − µ < −t) ≤ √ .
2π t
14
1.1. Gaussian tails and MGF 15
Figure 1.1. Probabilities of falling within 1, 2, and 3 standard deviations close to the
mean in a Gaussian distribution. Source http://www.openintro.org/
and r t2
2 e− 2σ2
IP(|X − µ| > t) ≤ .
π t
Proof. Note that it is sufficient to prove the theorem for µ = 0 and σ 2 = 1 by
simple translation and rescaling. We get for Z ∼ N (0, 1),
Z ∞ x2
1
IP(Z > t) = √ exp − dx
2π t 2
Z ∞ x2
1 x
≤ √ exp − dx
2π t t 2
Z ∞ x2
1 ∂
= √ − exp − dx
t 2π t ∂x 2
1
= √ exp(−t2 /2) .
t 2π
The second inequality follows from symmetry and the last one using the union
bound:
IP(|Z| > t) = IP({Z > t} ∪ {Z < −t}) ≤ IP(Z > t) + IP(Z < −t) = 2IP(Z > t) .
The fact that a Gaussian random variable Z has tails that decay to zero
exponentially fast can also be seen in the moment generating function (MGF)
M : s 7→ M (s) = IE[exp(sZ)] .
1.2. Sub-Gaussian random variables and Chernoff bounds 16
σ2 s2
It follows that if X ∼ N (µ, σ 2 ), then IE[exp(sX)] = exp sµ + 2 .
7
6
5
width
4
3
0 5 10 15 20
δ in %
Figure 1.2. Width of confidence intervals from exact computation in R (red dashed)
and (1.1) (solid black).
The above inequality holds for any s > 0 so to make it the tightest possible, we
2 2
minimize with respect to s > 0. Solving φ′ (s) = 0, where φ(s) = σ 2s − st, we
t2
find that inf s>0 φ(s) = − 2σ 2 . This proves the first part of (1.3). The second
inequality in this equation follows in the same manner (recall that (1.2) holds
for any s ∈ IR).
Moments
Recall that the absolute moments of Z ∼ N (0, σ 2 ) are given by
1 k + 1
IE[|Z|k ] = √ (2σ 2 )k/2 Γ
π 2
The next lemma shows that the tail bounds of Lemma 1.3 are sufficient to
show that the absolute moments of X ∼ subG(σ 2 ) can be bounded by those of
Z ∼ N (0, σ 2 ) up to multiplicative constants.
Lemma 1.4. Let X be a random variable such that
t2
IP[|X| > t] ≤ 2 exp − 2 ,
2σ
then for any positive integer k ≥ 1,
In particular, √
IE[|X|k ])1/k ≤ σe1/e k , k ≥ 2.
√
and IE[|X|] ≤ σ 2π .
Proof.
Z ∞
IE[|X|k ] = IP(|X|k > t)dt
0
Z ∞
= IP(|X| > t1/k )dt
0
Z ∞
t2/k
≤2 e− 2σ2 dt
0
Z ∞
t2/k
= (2σ 2 )k/2 k e−u uk/2−1 du , u=
0 2σ 2
2 k/2
= (2σ ) kΓ(k/2)
1.2. Sub-Gaussian random variables and Chernoff bounds 19
The second statement follows from Γ(k/2) ≤ (k/2)k/2 and k 1/k ≤ e1/e for any
k ≥ 2. It yields
r
2 k/2
1/k 1/k 2σ 2 k √
(2σ ) kΓ(k/2) ≤k ≤ e1/e σ k .
2
√ √
Moreover, for k = 1, we have 2Γ(1/2) = 2π.
Using moments, we can prove the following reciprocal to Lemma 1.3.
Lemma 1.5. If (1.3) holds, then for any s > 0, it holds
2 2
IE[exp(sX)] ≤ e4σ s
.
As a result, we will sometimes write X ∼ subG(σ 2 ) when it satisfies (1.3).
Proof. We use the Taylor expansion of the exponential function as follows.
Observe that by the dominated convergence theorem
∞
X
sk IE[|X|k ]
IE esX ≤ 1 +
k!
k=2
X∞
(2σ 2 s2 )k/2 kΓ(k/2)
≤1+
k!
k=2
X∞ ∞
(2σ 2 s2 )k 2kΓ(k) X (2σ 2 s2 )k+1/2 (2k + 1)Γ(k + 1/2)
=1+ +
(2k)! (2k + 1)!
k=1 k=1
∞
√ X (2σ 2 s2 )k k!
≤ 1 + 2 + 2σ 2 s2
(2k)!
k=1
r ∞
σ 2 s2 X (2σ 2 s2 )k
≤1+ 1+ 2(k!)2 ≤ (2k)!
2 k!
k=1
r
2 2 σ 2 s2 2σ2 s2
= e2σ s + (e − 1)
2
2 2
≤ e4σ s .
From the above Lemma, we see that sub-Gaussian random variables can
be equivalently defined from their tail bounds and their moment generating
functions, up to constants.
If we only care about the tails, this property is preserved for sub-Gaussian
random variables.
Theorem 1.6. Let X = (X1 , . . . , Xn ) be a vector of independent sub-Gaussian
random variables that have variance proxy σ 2 . Then, the random vector X is
sub-Gaussian with variance proxy σ 2 .
and
hX
n i t2
IP ai Xi < −t ≤ exp −
i=1
2σ 2 |a|22
Of special interest
Pnis the case where ai = 1/n for all i. Then, we get that
the average X̄ = n1 i=1 Xi , satisfies
nt2 nt2
IP(X̄ > t) ≤ e− 2σ2 and IP(X̄ < −t) ≤ e− 2σ2
Hoeffding’s inequality
The class of subGaussian random variables is actually quite large. Indeed,
Hoeffding’s lemma below implies that all randdom variables that are bounded
uniformly are actually subGaussian with a variance proxy that depends on the
size of their support.
Lemma 1.8 (Hoeffding’s lemma (1963)). Let X be a random variable such
that IE(X) = 0 and X ∈ [a, b] almost surely. Then, for any s ∈ IR, it holds
s2 (b−a)2
IE[esX ] ≤ e 8 .
2
In particular, X ∼ subG( (b−a)
4 ).
1.2. Sub-Gaussian random variables and Chernoff bounds 21
Proof. Define ψ(s) = log IE[esX ], and observe that and we can readily compute
2
IE[XesX ] IE[X 2 esX ] IE[XesX ]
ψ ′ (s) = , ψ ′′
(s) = − .
IE[esX ] IE[esX ] IE[esX ]
Thus ψ ′′ (s) can be interpreted as the variance of the random variable X under
esX
the probability measure dQ = IE[e sX ] dIP. But since X ∈ [a, b] almost surely,
Xi ∈ [ai , bi ] , ∀ i.
1
Pn
Let X̄ = n i=1 Xi , then for any t > 0,
2n2 t2
IP(X̄ − IE(X̄) > t) ≤ exp − Pn 2
,
i=1 (bi − ai )
and
2n2 t2
IP(X̄ − IE(X̄) < −t) ≤ exp − Pn 2
.
i=1 (bi − ai )
Note that Hoeffding’s lemma is for any bounded random variables. For
example, if one knows that X is a Rademacher random variable. Then
es + e−s s2
IE(esX ) = = cosh(s) ≤ e 2
2
Note that 2 is the best possible constant in the above approximation. For such
variables a = −1, b = 1, IE(X) = 0 so Hoeffding’s lemma yields
s2
IE(esX ) ≤ e 2 .
Hoeffding’s inequality is very general but there is a price to pay for this gen-
erality. Indeed, if the random variables have small variance, we would like to
see it reflected in the exponential tail bound (like for the Gaussian case) but
the variance does not appear in Hoeffding’s inequality. We need a more refined
inequality.
1.3. Sub-exponential random variables 22
The second statement follows from Γ(k) ≤ k k and k 1/k ≤ e1/e ≤ 2 for any
k ≥ 1. It yields
1/k
λk kΓ(k) ≤ 2λk .
To control the MGF of X, we use the Taylor expansion of the exponential
function as follows. Observe that by the dominated convergence theorem, for
any s such that |s| ≤ 1/2λ
∞
X
|s|k IE[|X|k ]
IE esX ≤ 1 +
k!
k=2
∞
X
≤1+ (|s|λ)k
k=2
∞
X
= 1 + s2 λ2 (|s|λ)k
k=0
1
≤ 1 + 2s2 λ2 |s| ≤
2λ
2
λ2
≤ e2s
Berstein’s inequality
Theorem 1.13 (Bernstein’s inequality). Let X1 , . . . , Xn be independent ran-
dom variables such that IE(Xi ) = 0 and Xi ∼ subE(λ). Define
n
1X
X̄ = Xi ,
n i=1
Proof. Without loss of generality, assume that λ = 1 (we can always replace
Xi by Xi /λ and t by t/λ. Next, using a Chernoff bound, we get for any s > 0
n
Y
IP(X̄ > t) ≤ IE esXi e−snt .
i=1
1.4. Maximal inequalities 25
2
Next, if |s| ≤ 1, then IE esXi ≤ es /2 by definition of sub-exponential distri-
butions. It yields
ns2
IP(X̄ > t) ≤ e 2 −snt
Choosing s = 1 ∧ t yields
n 2
IP(X̄ > t) ≤ e− 2 (t ∧t)
We obtain the same bound for IP(X̄ < −t) which concludes the proof.
The exponential inequalities of the previous section are valid for linear com-
binations of independent random variables, and in particular, for the average
X̄. In many instances, we will be interested in controlling the maximum over
the parameters of such linear combinations (this is because of empirical risk
minimization). The purpose of this section is to present such results.
Note that the random variables in this theorem need not be independent.
1.4. Maximal inequalities 26
log N σ2 s
= +
s 2
p
Taking s = 2(log N )/σ 2 yields the first inequality in expectation.
The first inequality in probability is obtained by a simple union bound:
[
IP max Xi > t = IP {Xi > t}
1≤i≤N
1≤i≤N
X
≤ IP(Xi > t)
1≤i≤N
t2
≤ N e− 2σ2 ,
On the opposite side of the picture, if all the Xi s are equal to the same random
variable X, we have for any t > 0,
In the Gaussian case, lower bounds are also available. They illustrate the effect
of the correlation between the Xi s
1.4. Maximal inequalities 27
It yields
max c⊤ x ≤ max c⊤ x ≤ max c⊤ x
x∈P x∈V(P) x∈P
n d
X
B1 = x ∈ IRd : |xi | ≤ 1} .
i=1
n d
X o
B2 = x ∈ IRd : x2i ≤ 1 .
i=1
Clearly, this ball is not a polytope and yet, we can control the maximum of
random variables indexed by B2 . This is due to the fact that there exists a
finite subset of B2 such that the maximum over this finite set is of the same
order as the maximum over the entire ball.
Definition 1.17. Fix K ⊂ IRd and ε > 0. A set N is called an ε-net of K
with respect to a distance d(·, ·) on IRd , if N ⊂ K and for any z ∈ K, there
exists x ∈ N such that d(x, z) ≤ ε.
Therefore, if N is an ε-net of K with respect to norm k · k, then every point
of K is at distance at most ε from a point in N . Clearly, every compact set
admits a finite ε-net. The following lemma gives an upper bound on the size
of the smallest ε-net of B2 .
Lemma 1.18. Fix ε ∈ (0, 1). Then the unit Euclidean ball B2 has an ε-net N
with respect to the Euclidean distance of cardinality |N | ≤ (3/ε)d
This is equivalent to
ε ε
(1 + )d ≥ |N |( )d .
2 2
Therefore, we get the following bound
2 d 3 d
|N | ≤ 1 + ≤ .
ε ε
But
1
max x⊤ X = max x⊤ X
1
x∈ 2 B2 2 x∈B2
Therefore, using Theorem 1.14, we get
p p √
IE[max θ⊤ X] ≤ 2IE[max z ⊤ X] ≤ 2σ 2 log(|N |) ≤ 2σ 2(log 6)d ≤ 4σ d .
θ∈B2 z∈N
Problem 1.4. Let A = {Ai,j } 1≤i≤n be a random matrix such that its entries
1≤j≤m
are iid sub-Gaussian random variables with variance proxy σ 2 .
(a) Show that the matrix A is sub-Gaussian. What is its variance proxy?
(b) Let kAk denote the operator norm of A defined by
|Ax|2
maxm .
x∈IR |x|2
Show that there exits a constant C > 0 such that
√ √
IEkAk ≤ C( m + n) .
Problem 1.5. Recall that for any q ≥ 1, the ℓq norm of a vector x ∈ IRn is
defined by
Xn q1
|x|q = |xi |q .
i=1
Problem 1.6. Let K be a compact subset of the unit sphere of IRp that
admits an ε-net Nε with respect to the Euclidean distance of IRp that satisfies
|Nε | ≤ (C/ε)d for all ε ∈ (0, 1). Here C ≥ 1 and d ≤ p are positive constants.
Let X ∼ subGp (σ 2 ) be a centered random vector.
Show that there exists positive constants c1 and c2 to be made explicit such
that for any δ ∈ (0, 1), it holds
p p
max θ⊤ X ≤ c1 σ d log(2p/d) + c2 σ log(1/δ)
θ∈K
with probability at least 1−δ. Comment on the result in light of Theorem 1.19 .
1.5. Problem set 32
Problem 1.7. For any K ⊂ IRd , distance d on IRd and ε > 0, the ε-covering
number C(ε) of K is the cardinality of the smallest ε-net of K. The ε-packing
number P (ε) of K is the cardinality of the largest set P ⊂ K such that
d(z, z ′ ) > ε for all z, z ′ ∈ P, z 6= z ′ . Show that
Yi = f (Xi ) + εi , i = 1, . . . , n , (2.1)
Random design
The case of random design corresponds to the statistical learning setup. Let
(X1 , Y1 ), . . . , (Xn+1 , Yn+1 ) be n+1 i.i.d. random couples. Given (X1 , Y1 ), . . . , (Xn , Yn )
the goal is construct a function fˆn such that fˆn (Xn+1 ) is a good predictor of
Yn+1 . Note that when fˆn is constructed, Xn+1 is still unknown and we have
to account for what value it is likely to take.
Consider the following example from [HTF01, Section 3.2]. The response
variable Y is the log-volume of a cancerous tumor, and the goal is to predict
it based on X ∈ IR6 , a collection of variables that are easier to measure (age
of patient, log-weight of prostate, . . . ). Here the goal is clearly to construct f
for prediction purposes. Indeed, we want to find an automatic mechanism that
33
2.1. Fixed design linear regression 34
outputs a good prediction of the log-weight of the tumor given certain inputs
for a new (unseen) patient.
A natural measure of performance here is the L2 -risk employed in the in-
troduction:
R(fˆn ) = IE[Yn+1 − fˆn (Xn+1 )]2 = IE[Yn+1 − f (Xn+1 )]2 + kfˆn − f k2L2 (PX ) ,
where PX denotes the marginal distribution of Xn+1 . It measures how good
the prediction of Yn+1 is in average over realizations of Xn+1 . In particular,
it does not put much emphasis on values of Xn+1 that are not very likely to
occur.
Note that if the εi are random variables with variance σ 2 then, one simply
has R(fˆn ) = σ 2 + kfˆn − f k2L2 (PX ) . Therefore, for random design, we will focus
on the squared L2 norm kfˆn − f k2 2 L (PX )as a measure of accuracy. It measures
how close fˆn is to the unknown f in average over realizations of Xn+1 .
Fixed design
In fixed design, the points (or vectors) X1 , . . . , Xn are deterministic. To em-
phasize this fact, we use lowercase letters x1 , . . . , xn to denote fixed design. Of
course, we can always think of them as realizations of a random variable but
the distinction between fixed and random design is deeper and significantly
affects our measure of performance. Indeed, recall that for random design, we
look at the performance in average over realizations of Xn+1 . Here, there is no
such thing as a marginal distribution of Xn+1 . Rather, since the design points
x1 , . . . , xn are considered deterministic, our goal is estimate f only at these
points. This problem is sometimes called denoising since our goal is to recover
f (x1 ), . . . , f (xn ) given noisy observations of these values.
In many instances, fixed design can be recognized from their structured
form. A typical example is the regular design on [0, 1], given by xi = i/n, i =
1, . . . , n. Interpolation between these points is possible under smoothness as-
sumptions.
⊤
Note that in fixed design, we observe µ∗ +ε, where µ∗ = f (x1 ), . . . , f (xn ) ∈
IRn and ε = (ε1 , . . . , ε)⊤ ∈ IRn is sub-Gaussian with variance proxy σ 2 . Instead
of a functional estimation problem, it is often simpler to view this problem as a
vector problem in IRn . This point of view will allow us to leverage the Euclidean
geometry of IRn .
In the case of fixed design, we will focus on the Mean Squared Error (MSE)
as a measure of performance. It is defined by
n
1X ˆ 2
MSE(fˆn ) = fn (xi ) − f (xi ) .
n i=1
1 X⊤ X
MSE(Xθ̂) = |X(θ̂ − θ∗ )|22 = (θ̂ − θ∗ )⊤ (θ̂ − θ∗ ) . (2.3)
n n
A natural example of fixed design regression is image denoising. Assume
that µ∗i , i ∈ 1, . . . , n is the grayscale value of pixel i of an image. We do not
get to observe the image µ∗ but rather a noisy version of it Y = µ∗ + ε. Given
a library of d images {x1 , . . . , xd }, xj ∈ IRn , our goal is to recover the original
image µ∗ using linear combinations of the images x1 , . . . , xd . This can be done
fairly accurately (see Figure 2.1).
Figure 2.1. Reconstruction of the digit “6”: Original (left), Noisy (middle) and Recon-
struction (right). Here n = 16 × 16 = 256 pixels. Source [RT11].
As we will see in Remark 2.3, choosing fixed design properly also ensures
that if MSE(fˆ) is small for some linear estimator fˆ(x) = x⊤ θ̂, then |θ̂ − θ∗ |22 is
also small.
Throughout this section, we consider the regression model (2.2) with fixed
design.
Note that we are interested in estimating Xθ∗ and not θ∗ itself, so by exten-
sion, we also call µ̂ls = Xθ̂ls least squares estimator. Observe that µ̂ls is the
projection of Y onto the column span of X.
It is not hard to see that least squares estimators of θ∗ and µ∗ = Xθ∗ are
maximum likelihood estimators when ε ∼ N (0, σ 2 In ).
Proposition 2.1. The least squares estimator µ̂ls = Xθ̂ls ∈ IRn satisfies
X⊤ µ̂ls = X⊤ Y .
∇θ |Y − Xθ|22 = 0
X⊤ Xθ = X⊤ Y .
It concludes the proof of the first statement. The second statement follows
from the definition of the Moore-Penrose pseudoinverse.
We are now going to prove our first result on the finite sample performance
of the least squares estimator for fixed design.
Theorem 2.2. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Then the least squares estimator θ̂ls satisfies
1 r
IE MSE(Xθ̂ls ) = IE|Xθ̂ls − Xθ∗ |22 . σ 2 ,
n n
where r = rank(X⊤ X). Moreover, for any δ > 0, with probability 1 − δ, it holds
r + log(1/δ)
MSE(Xθ̂ls ) . σ 2 .
n
Proof. Note that by definition
Moreover,
|Y − Xθ̂ls |22 = |Xθ∗ + ε − Xθ̂ls |22 = |Xθ̂ls − Xθ∗ |22 − 2ε⊤ X(θ̂ls − θ∗ ) + |ε|22 .
2.2. Least squares estimators 37
Therefore, we get
ε⊤ X(θ̂ls − θ∗ )
|Xθ̂ls − Xθ∗ |22 ≤ 2ε⊤ X(θ̂ls − θ∗ ) = 2|Xθ̂ls − Xθ∗ |2 (2.5)
|X(θ̂ls − θ∗ )|2
Note that it is difficult to control
ε⊤ X(θ̂ls − θ∗ )
|X(θ̂ls − θ∗ )|2
as θ̂ls depends on ε and the dependence structure of this term may be com-
plicated. To remove this dependency, a traditional technique is “sup-out” θ̂ls .
This is typically where maximal inequalities are needed. Here we have to be a
bit careful.
Let Φ = [φ1 , . . . , φr ] ∈ IRn×r be an orthonormal basis of the column span
of X. In particular, there exists ν ∈ IRr such that X(θ̂ls − θ∗ ) = Φν. It yields
ε⊤ X(θ̂ls − θ∗ ) ε⊤ Φν ε⊤ Φν ν
= = = ε̃⊤ ≤ sup ε̃⊤ u ,
|X(θ̂ls − θ∗ )|2 |Φν|2 |ν|2 |ν|2 u∈B2
Moreover, with probability 1 − δ, it follows from the last step in the proof1 of
Theorem 1.19 that
sup (ε̃⊤ u)2 ≤ 8 log(6)σ 2 r + 8σ 2 log(1/δ) .
u∈B2
X⊤ X
Remark 2.3. If d ≤ n and B := n has rank d, then we have
MSE(Xθ̂ls )
|θ̂ls − θ∗ |22 ≤ ,
λmin (B)
Indeed, the fundamental inequality (2.4) would still hold and the bounds on
the MSE may be smaller. Indeed, (2.5) can be replaced by
ls
|Xθ̂K − Xθ∗ |22 ≤ 2ε⊤ X(θ̂K
ls
− θ∗ ) ≤ 2 sup (ε⊤ Xθ) ,
θ∈K−K
and it has exactly 2d vertices V = {e1 , −e1 , . . . , ed , −ed }, where ej is the j-th
vector of the canonical basis of IRd and is defined by
ej = (0, . . . , 0, 1
|{z} , 0, . . . , 0)⊤ .
jth position
now that since ε ∼ subGn (σ 2 ),then for any column Xj such that
Observe √
|Xj |2 ≤ n, the random variable ε⊤ Xj ∼ subG(nσ 2 ). Therefore, applying
ls
Theorem 1.16, we get the bound on IE MSE(Xθ̂K ) and for any t ≥ 0,
nt2
IP MSE(Xθ̂Kls
) > t ≤ IP sup (ε⊤ v) > nt/4 ≤ 2de− 32σ2
v∈XK
ls
Note that the proof of Theorem 2.2 also applies to θ̂B 1
(exercise!) so that
ls
θ̂B1 benefits from the best of both rates.
r r log d
ls
IE MSE(Xθ̂B1 ) . min , .
n n
√
This is called an elbow effect. The elbow takes place around r ≃ n (up to
logarithmic terms).
Therefore it is really limq→0+ |θ|qq but the notation |θ|00 suggests too much that
it is always equal to 1.
2.2. Least squares estimators 40
By extension, denote by B0 (k) the ℓ0 ball of IRd , i.e., the set of k-sparse
vectors, defined by
B0 (k) = {θ ∈ IRd : |θ|0 ≤ k} .
ls
In this section, our goal is to control the MSE of θ̂K when K = B0 (k). Note that
ls d
computing θ̂B0 (k) essentially requires computing k least squares estimators,
which is an exponential number in k. In practice this will be hard (or even
impossible) but it is interesting to understand the statistical properties of this
estimator and to use them as a benchmark.
Theorem 2.6. Fix a positive integer k ≤ d/2. Let K = B0 (k) be set of
k-sparse vectors of IRd and assume that θ∗ ∈ B0 (k). Moreover, assume the
conditions of Theorem 2.2. Then, for any δ > 0, with probability 1 − δ, it holds
ls σ2 d σ2 k σ2
MSE(Xθ̂B0 (k) ) . log + + log(1/δ) .
n 2k n n
Proof. We begin as in the proof of Theorem 2.2 to get (2.5):
ε⊤ X(θ̂K
ls
− θ∗ )
ls
|Xθ̂K − Xθ∗ |22 ≤ 2ε⊤ X(θ̂K
ls
− θ∗ ) = 2|Xθ̂K
ls
− Xθ∗ |2 .
ls − θ ∗ )|
|X(θ̂K 2
ε̃S = Φ⊤ 2
S ε ∼ subGrS (σ ).
Using a union bound, we get for any t > 0,
X
IP max sup (ε̃⊤ u)2 > t ≤ IP sup (ε̃⊤ u)2 > t
|S|=2k u∈BrS r
u∈B2S
2 |S|=2k
It follows from the proof of Theorem 1.19 that for any |S| ≤ 2k,
t t
IP sup (ε̃⊤ u)2 > t ≤ 6|S| e− 8σ2 ≤ 62k e− 8σ2 .
r
u∈B2S
2.2. Least squares estimators 41
d
How large is log 2k ? It turns out that it is not much larger than k.
Lemma 2.7. For any integers 1 ≤ k ≤ n, it holds
n en k
≤
k k
Observe that
n n n − k en k n − k ek nk+1 1 k
= ≤ = 1 + ,
k+1 k k+1 k k+1 (k + 1)k+1 k
σ2 k ed σ 2 k σ2
ls
MSE(Xθ̂B 0 (k)
). log + log(6) + log(1/δ) .
n 2k n n
2.3. The Gaussian Sequence Model 42
Note that for any fixed δ, there exits a constant Cδ > 0 such that for any
n ≥ 2k,
σ2 k ed
ls
MSE(Xθ̂B 0 (k)
) ≤ Cδ log .
n 2k
Comparing this result with Theorem 2.2 with r = k, we see that the price to
pay for not knowing the support of θ∗ but only its size, is a logarithmic factor
in the dimension d.
This result immediately leads the following bound in expectation.
Corollary 2.9. Under the assumptions of Theorem 2.6,
σ2 k ed
ls
IE MSE(Xθ̂B 0 (k)
) . log .
n k
Proof. It follows from (2.6) that for any H ≥ 0,
Z ∞
ls
IE MSE(Xθ̂B0 (k) ) = ls
IP(|Xθ̂K − Xθ∗ |22 > nu)du
0
Z ∞
≤H+ ls
IP(|Xθ̂K − Xθ∗ |22 > n(u + H))du
0
X2k Z
d 2k ∞ − n(u+H)
≤H+ 6 e 32σ2 ,
j=1
j 0
X2k
d 2k − nH2 32σ 2
=H+ 6 e 32σ du .
j=1
j n
X2k
d 2k − nH2
6 e 32σ = 1 .
j=1
j
In particular, it yields
σ2 k ed
H. log ,
n k
which completes the proof.
The Gaussian Sequence Model is a toy model that has received a lot of
attention, mostly in the eighties. The main reason for its popularity is that
it carries already most of the insight of nonparametric estimation. While the
model looks very simple it allows to carry deep ideas that extend beyond its
framework and in particular to the linear regression model that we are inter-
ested in. Unfortunately we will only cover a small part of these ideas and
2.3. The Gaussian Sequence Model 43
the interested reader should definitely look at the excellent books by A. Tsy-
bakov [Tsy09, Chapter 3] and I. Johnstone [Joh11]. The model is as follows:
Yi = θi∗ + εi , i = 1, . . . , d (2.7)
2
where ε1 , . . . , εd are i.i.d N (0, σ ) random variables. Note that often, d is taken
equal to ∞ in this sequence model and we will also discuss this case. Its links
to nonparametric estimation will become clearer in Chapter 3. The goal here
is to estimate the unknown vector θ∗ .
X⊤ X
= Id ,
n
where Id denotes the identity matrix of IRd .
Assumption ORT allows for cases where d ≤ n but not d > n (high dimensional
case) because of obvious rank constraints. In particular,
√ it means that the d
columns of X are orthogonal in IRn and all have norm n.
Under this assumption, it follows from the linear regression model (2.2)
that
1 ⊤ X⊤ X ∗ 1 ⊤
y := X Y = θ + X ε
n n n
= θ∗ + ξ ,
X⊤ X
MSE(Xθ̂) = (θ̂ − θ∗ )⊤ (θ̂ − θ∗ ) = |θ̂ − θ∗ |22 .
n
2.3. The Gaussian Sequence Model 44
y = θ∗ + ξ ∈ IRd (2.9)
2
where ξ ∼ subGd (σ /n).
In this section, we can actually completely “forget” about our original
model (2.2). In particular we can define this model independently of Assump-
tion ORT and thus for any values of n and d.
The sub-Gaussian sequence model, like the Gaussian sequence model are
called direct (observation) problems as opposed to inverse problems where the
goal is to estimate the parameter θ∗ only from noisy observations of its image
through an operator. The linear regression model one such inverse problem
where the matrix X plays the role of a linear operator. However, in these notes,
we never try to invert the operator. See [Cav11] for an interesting survey on
the statistical theory of inverse problems.
σ2 k ed
ls
MSE(Xθ̂B 0 (k)
) ≤ Cδ log .
n 2k
As we will see, the assumption ORT gives us the luxury to not know k and yet
adapt to its value. Adaptation means that we can construct an estimator that
does not require the knowledge of k (the smallest such that |θ∗ |0 ≤ k) and yet,
ls
perform as well as θ̂B 0 (k)
, up to a multiplicative constant.
Let us begin with some heuristic considerations to gain some intuition.
Assume the sub-Gaussian sequence model (2.9). If nothing is known about θ∗
2.3. The Gaussian Sequence Model 45
(i) If |θ∗ |0 = k,
k log(2d/δ)
MSE(Xθ̂hrd ) = |θ̂hrd − θ∗ |22 . σ 2 .
n
supp(θ̂hrd ) = supp(θ∗ ) .
and recall that Theorem 1.14 yields IP(A) ≥ 1 − δ. On the event A, the
following holds for any j = 1, . . . , d.
First, observe that
and
|yj | ≤ 2τ ⇒ |θj∗ | ≤ |yj | + |ξj | ≤ 3τ (2.13)
It yields
It yields
d
X d
X
|θ̂hrd − θ∗ |22 = |θ̂jhrd − θj∗ |2 ≤ 16 min(|θj∗ |2 , τ 2 ) ≤ 16|θ∗ |0 τ 2 .
j=1 j=1
Hard Soft
2
2
1
1
0
0
y
y
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
Figure 2.2. Transformation applied to yj with 2τ = 1 to obtain the hard (left) and soft
(right) thresholding estimators
Similar results can be obtained for the soft thresholding estimator θ̂sft
defined by
yj − 2τ if yj > 2τ ,
θ̂jsft = yj + 2τ if yj < −2τ ,
0 if |yj | ≤ 2τ ,
In short, we can write
2τ
θ̂jsft = 1 − yj
|yj | +
In view of (2.8), under the assumption ORT, the above variational definitions
can be written as
n1 o
θ̂hrd = argmin |Y − Xθ|22 + 4τ 2 |θ|0
θ∈IRd n
n1 o
θ̂sft = argmin |Y − Xθ|22 + 4τ |θ|1
θ∈IRd n
When the assumption ORT is not satisfied, they no longer correspond to thresh-
olding estimators but can still be defined as above. We change the constant in
the threshold parameters for future convenience.
Definition 2.12. Fix τ > 0 and assume the linear regression model (2.2). The
BIC2 estimator of θ∗ in is defined by any θ̂bic such that
n1 o
θ̂bic ∈ argmin |Y − Xθ|22 + τ 2 |θ|0
θ∈IRd n
Instead the the Lasso estimator is convex problem and there exists many
efficient algorithms to compute it. We will not describe this optimization prob-
lem in details but only highlight a few of the best known algorithms:
1. Probably the most popular method among statisticians relies on coor-
dinate gradient descent. It is implemented in the glmnet package in R
[FHT10],
2 Note that it minimizes the Bayes Information Criterion (BIC) employed in the tradi-
p
tional literature of asymptotic statistics if τ = log(d)/n. We will use the same value below,
up to multiplicative constants (it’s the price to pay to get non asymptotic results).
2.4. High-dimensional linear regression 49
σ2 σ 2 log(ed)
τ 2 = 16 log(6) + 32 . (2.14)
n n
satisfies
1 log(ed/δ)
MSE(Xθ̂bic ) = |Xθ̂bic − Xθ∗ |22 . |θ∗ |0 σ 2
n n
with probability at least 1 − δ.
where we use the inequality 2ab ≤ 2a2 + 21 b2 . Together with the previous
display, it yields
2
|Xθ̂bic − Xθ∗ |22 ≤ 2nτ 2 |θ∗ |0 + 4 ε⊤ U(θ̂bic − θ∗ ) − 2nτ 2 |θ̂bic |0 (2.15)
where
Xθ̂bic − Xθ∗
U(θ̂bic − θ∗ ) =
|Xθ̂bic − Xθ∗ |2
Next, we need to “sup out” θ̂bic . To that end, we decompose the sup into a
max over cardinalities as follows:
sup = max max sup .
θ∈IRd 1≤k≤d |S|=k supp(θ)=S
Moreover, using the ε-net argument from Theorem 1.19, we get for |S| = k,
2 t 1 t
+ 1 nτ 2 k
IP sup ε⊤ ΦS,∗ u ≥ + nτ 2 k ≤ 2 · 6rS,∗ exp − 4 2 2
r
u∈B2S,∗
4 2 8σ
t nτ 2 k
∗
≤ 2 exp − − + (k + |θ | 0 ) log(6)
32σ 2 16σ 2
t
∗
≤ exp − − 2k log(ed) + |θ | 0 log(12)
32σ 2
2.4. High-dimensional linear regression 51
It follows from Theorem 2.14 that θ̂bic adapts to the unknown sparsity of
∗
θ , just like θ̂hrd . Moreover, this holds under no assumption on the design
matrix X.
Theorem 2.15. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Moreover, assume
√ that the columns of X are normalized in such a way that
maxj |Xj |2 ≤ n. Then, the Lasso estimator θ̂L with regularization parameter
r r
2 log(2d) 2 log(1/δ)
2τ = 2σ + 2σ . (2.16)
n n
satisfies
r r
1 2 log(2d) 2 log(1/δ)
L
MSE(Xθ̂ ) = |Xθ̂L − Xθ∗ |22 ≤ 4|θ∗ |1 σ + 4|θ∗ |1 σ
n n n
with probability at least 1 − δ. Moreover, there exists a numerical constant
C > 0 such that r
L
∗ log(2d)
IE MSE(Xθ̂ ) ≤ C|θ |1 σ .
n
Proof. From the definition of θ̂L , it holds
1 1
|Y − Xθ̂L |22 + 2τ |θ̂L |1 ≤ |Y − Xθ∗ |22 + 2τ |θ∗ |1 .
n n
Using Hölder’s inequality, it implies
|Xθ̂L − Xθ∗ |22 ≤ 2ε⊤ X(θ̂L − θ∗ ) + 2nτ |θ∗ |1 − |θ̂L |1
≤ 2|X⊤ ε|∞ |θ̂L |1 − 2nτ |θ̂L |1 + 2|X⊤ ε|∞ |θ∗ |1 + 2nτ |θ∗ |1
= 2(|X⊤ ε|∞ − nτ )|θ̂L |1 + 2(|X⊤ ε|∞ + nτ )|θ∗ |1
p p
Therefore, taking t = σ 2n log(2d) + σ 2n log(1/δ) = nτ , we get that with
probability 1 − δ,
|Xθ̂L − Xθ∗ |22 ≤ 4nτ |θ∗ |1 .
The bound in expectation follows using the same argument as in the proof of
Corollary 2.9.
Incoherence
Assumption INC(k) We say that the design matrix X has incoherence k for
some integer k > 0 if
X⊤ X 1
− Id ∞ ≤
n 14k
where the |A|∞ denotes the largest element of A in absolute value. Equivalently,
1. For all j = 1, . . . , d,
|Xj |22 1
− 1 ≤ .
n 14k
2. For all 1 ≤ i, j ≤ d, i 6= j, we have
⊤
Xi Xj ≤ 1 .
14k
It implies that there exists matrices that satisfy Assumption INC(k) for
n & k 2 log(d) ,
Proof. Let εij ∈ {−1, 1} denote the Rademacher random variable that is on
the ith row and jth column of X.
Note first that the jth diagonal entries of X⊤ X/n is given by
n
1X 2
ε =1
n i=1 i,j
2.4. High-dimensional linear regression 54
X⊤ X
Moreover, for j 6= k, the (j, k)th entry of the d × d matrix n is given by
n n
1X 1 X (j,k)
εi,j εi,k = ξ ,
n i=1 n i=1 i
X⊤ X 1 X n
(j,k)
IP − Id ∞ > t = IP max ξi > t
n j6=k n
i=1
X 1 X n
(j,k)
≤ IP ξi > t (Union bound)
n i=1
j6=k
X nt2
≤ 2e− 2 (Hoeffding: Theorem 1.9)
j6=k
nt2
≤ d2 e− 2
X⊤ X 1 n
IP − Id ∞ > ≤ d2 e− 392k2 ≤ δ
n 14k
for
n ≥ 392k 2 log(1/δ) + 784k 2 log(d) .
it holds
|Xθ|22
|θS |22 ≤ 2
n
2.4. High-dimensional linear regression 55
Proof. We have
|Xθ̂L −Xθ∗ |22 +nτ |θ̂L −θ∗ |1 ≤ 2ε⊤ X(θ̂L −θ∗ )+nτ |θ̂L −θ∗ |1 +2nτ |θ∗ |1 −2nτ |θ̂L |1 .
2.4. High-dimensional linear regression 56
Applying Hölder’s inequality and using the same steps as in the proof of The-
orem 2.15, we get that with probability 1 − δ, we get
|Xθ̂L − Xθ∗ |22 + nτ |θ̂L − θ∗ |1 ≤ 2nτ |θ̂L − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂L |1
= 2nτ |θ̂SL − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂SL |1
≤ 4nτ |θ̂SL − θ∗ |1 (2.18)
so that θ = θ̂L − θ∗ satisfies the cone condition (2.17). Using now the Cauchy-
Schwarz inequality and Lemma 2.17 respectively, we get since |S| ≤ k,
r
L ∗
p L ∗ 2k
|θ̂S − θ |1 ≤ |S||θ̂S − θ |2 ≤ |Xθ̂L − Xθ∗ |2 .
n
Combining this result with (2.18), we find
Moreover, it yields
r
L ∗ 2k
|θ̂ − θ |1 ≤ 4 |Xθ̂L − Xθ∗ |2
n
r
2k √
≤4 32nkτ 2 ≤ 32kτ
n
The bound in expectation follows using the same argument as in the proof of
Corollary 2.9.
Note that all we required for the proof was not really incoherence but the
conclusion of Lemma 2.17:
|Xθ|22
inf inf ≥κ (2.19)
|S|≤k θ∈CS n|θS |22
Problem 2.1. Consider the linear regression model with fixed design with
d ≤ n. The ridge regression estimator is employed when the rank(X⊤ X) < d
but we are interested in estimating θ∗ . It is defined for a given parameter τ > 0
by
n1 o
θ̂τridge = argmin |Y − Xθ|22 + τ |θ|22 .
θ∈IRd n
(a) Show that for any τ , θ̂τridge is uniquely defined and give its closed form
expression.
(b) Compute the bias of θ̂τridge and show that it is bounded in absolute value
by |θ∗ |2 .
Problem 2.2. Let X = (1, Z, . . . , Z d−1 )⊤ ∈ IRd be a random vector where Z
is a random variable. Show that the matrix IE(XX ⊤) is positive definite if Z
admits a probability density with respect to the Lebesgue measure on IR.
Problem 2.3. In the proof of Theorem 2.11, show that 4 min(|θj∗ |, τ ) can be
replaced by 3 min(|θj∗ |, τ ), i.e., that on the event A, it holds
Problem 2.4. For any q > 0, a vector θ ∈ IRd is said to be in a weak ℓq ball
of radius R if the decreasing rearrangement |θ[1] | ≥ |θ[2] | ≥ . . . satisfies
|θ[j] | ≤ Rj −1/q .
Problem 2.6. Assume that the linear model (2.2) with ε ∼ subGn (σ 2 ) and
θ∗ 6= 0. Show that the modified BIC estimator θ̂ defined by
n1 ed o
θ̂ ∈ argmin |Y − Xθ|22 + λ|θ|0 log
θ∈IRd n |θ|0
satisfies,
ed
∗ 2
log |θ ∗ |0
MSE(Xθ̂) . |θ |0 σ .
n
with probability .99, for appropriately chosen λ. What do you conclude?
Problem 2.7. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Moreover, assume the conditions of Theorem 2.2 √ and that the columns of X
are normalized in such a way that maxj |Xj |2 ≤ n. Then the Lasso estimator
θ̂L with regularization parameter
r
2 log(2d)
2τ = 8σ ,
n
satisfies
|θ̂L |1 ≤ C|θ∗ |1
with probability 1 − (2d)−1 for some constant C to be specified.
Chapter
3
Misspecified Linear Models
Yi = f (Xi ) + εi , i = 1, . . . , n , (3.1)
60
3.1. Oracle inequalities 61
Oracle inequalities
As mentioned in the introduction, an oracle is a quantity that cannot be con-
structed without the knowledge of the quantity of interest, here: the regression
function. Unlike the regression function itself, an oracle is constrained to take
a specific form. For all matter of purposes, an oracle can be viewed as an
estimator (in a given family) that can be constructed with an infinite amount
of data. This is exactly what we should aim for in misspecified models.
When employing the least squares estimator θ̂ls , we constrain ourselves to
estimating functions that are of the form x 7→ x⊤ θ, even though f itself may
not be of this form. Therefore, the oracle fˆ is the linear function that is the
closest to f .
Rather than trying to approximate f by a linear function f (x) ≈ θ⊤ x, we
make the model a bit more general and consider a dictionary H = {ϕ1 , . . . , ϕM }
of functions where ϕj : IRd → IR. In the case, we can actually remove the
assumption that X ∈ IRd . Indeed, the goal is now to estimate f using a linear
combination of the functions in the dictionary:
M
X
f ≈ ϕθ := θj ϕj .
j=1
R(ϕθ̄ ) ≤ R(ϕθ ) , ∀θ ∈ K .
or
IP R(fˆ) ≤ C inf R(ϕθ ) + φn,M,δ (K) ≥ 1 − δ , ∀δ>0
θ∈K
where ϕθ̄ denotes the orthogonal projection of f onto the linear spam of
ϕ1 , . . . , ϕn . Since Y = f + ε, we get
It yields
|ϕθ̂ls − ϕθ̄ |22 ≤ 2ε⊤ (ϕθ̂ls − ϕθ̄ ) .
Using the same steps as the ones following equation (2.5) for the well specified
case, we get
σ2 M
|ϕθ̂ls − ϕθ̄ |22 . log(1/δ)
n
with probability 1 − δ. The result of the lemma follows.
16σ 2
τ2 = log(6eM ) , α ∈ (0, 1) (3.5)
αn
satisfies for some numerical constant C > 0,
n1 + α Cσ 2 o
MSE(ϕθ̂bic ) ≤ inf MSE(ϕθ )+ |θ|0 log(eM )
θ∈IRM 1−α α(1 − α)n
Cσ 2
+ log(1/δ)
α(1 − α)n
Proof. Recall the the proof of Theorem 2.14 for the BIC estimator begins as
follows:
1 1
|Y − ϕθ̂bic |22 + τ 2 |θ̂bic |0 ≤ |Y − ϕθ |22 + τ 2 |θ|0 .
n n
This is true for any θ ∈ IRM . It implies
1+α Cσ 2 h io
MSE(ϕθ̂bic ) ≤ MSE(ϕθ̄ ) + |θ̄|0 log(eM ) + log(1/δ)
1−α α(1 − α)n
Cσ 2
+ log(1/δ)
α(1 − α)n
with probability at least 1 − δ.
|ϕθ̂L −f |22 −|ϕθ −f |22 +nτ |θ̂L −θ|1 ≤ 2ε⊤ (ϕθ̂L −ϕθ )+nτ |θ̂L −θ|1 +2nτ |θ|1 −2nτ |θ̂L |1 .
√ (3.7)
Next, note that INC(k) for any k ≥ 1 implies that |ϕj |2 ≤ 2 n for all j =
1, . . . , M . Applying Hölder’s inequality using the same steps as in the proof of
Theorem 2.15, we get that with probability 1 − δ, it holds
nτ L
2ε⊤ (ϕθ̂L − ϕθ ) ≤ |θ̂ − θ|1
2
Therefore, taking S = supp(θ) to be the support of θ, we get that the right-
hand side of (3.7) is bounded by
with probability 1 − δ.
It implies that either MSE(ϕθ̂L ) ≤ MSE(ϕθ ) or that
|θ̂SLc − θS c |1 ≤ 3|θ̂SL − θS |1 .
so that θ = θ̂L − θ satisfies the cone condition (2.17). Using now the Cauchy-
Schwarz inequality and Lemma 2.17 respectively, assume that |θ|0 ≤ k, we
get p p
4nτ |θ̂SL − θ|1 ≤ 4nτ |S||θ̂SL − θ|2 ≤ 4τ 2n|θ|0 |ϕθ̂L − ϕθ |2 .
2 2
Using now the inequality 2ab ≤ αa + α2 b2 , we get
16τ 2 n|θ|0 α
4nτ |θ̂SL − θ|1 ≤ + |ϕθ̂L − ϕθ |22
α 2
16τ 2 n|θ|0
≤ + α|ϕθ̂L − f |22 + α|ϕθ − f |22
α
3.1. Oracle inequalities 66
16τ 2 |θ|0
(1 − α)MSE(ϕθ̂L ) ≤ (1 + α)MSE(ϕθ ) + .
α
To conclude the proof of the bound with high probability, it only remains to
divide by 1−α on both sides of the above inequality. The bound in expectation
follows using the same argument as in the proof of Corollary 2.9.
Maurey’s argument
From the above section, it seems that the Lasso estimator is strictly better
than the BIC estimator as long as incoherence holds. Indeed, if there is no
sparse θ such that MSE(ϕθ ) is small, Theorem 3.4 is useless. In reality, no
one really believes in the existence of sparse vectors but rater of approximately
sparse vectors. Zipf’s law would instead favor the existence of vectors θ with
absolute coefficients that decay polynomially when ordered from largest to
smallest in absolute value. This is the case for example if θ has a small ℓ1
norm but is not sparse. For such θ, the Lasso estimator still enjoys slow rates
as in Theorem 2.15, which can be easily extended to the misspecified case (see
Problem 3.2). Fortunately, such vectors can be well approximated by sparse
vectors in the following sense: for any vector θ ∈ IRM such that |θ|1 ≤ 1, there
exists a vector θ′ that is sparse and for which MSE(ϕθ′ ) is not much larger
than MSE(ϕθ ). The following theorem quantifies exactly the tradeoff between
sparsity and MSE. It is often attributed to B. Maurey and was published by
Pisier [Pis81]. This is why it is referred to as Maurey’s argument.
Theorem 3.6. Let {ϕ1 , . . . , ϕM } be a dictionary normalized in such a way
that √
max |ϕj |2 ≤ D n .
1≤j≤M
Then for any integer k such that 1 ≤ k ≤ M and any positive R, we have
D 2 R2
min MSE(ϕθ ) ≤ min MSE(ϕθ ) + .
θ∈IRM θ∈IRM k
|θ|0 ≤2k |θ|1 ≤R
Proof. Define
θ̄ ∈ argmin |ϕθ − f |22
θ∈IRM
|θ|1 ≤R
Let now U ∈ IRn be a random vector with values in {0, ±Rϕ1 , . . . , ±RϕM }
defined by
(2)
(2) |θj |
IP(U = Rsign(θj )ϕj ) = , j = k + 1, . . . , M
R
|θ(2) |1
IP(U = 0) = 1 − .
R
√
Note that IE[U ] = ϕθ(2) and |U |2 ≤ RD n. Let now U1 , . . . , Uk be k indepen-
dent copies of U define
k
1X
Ū = Ui .
k i=1
Note that Ū = ϕθ̃ for some θ̃ ∈ IRM such that |θ̃|0 ≤ k. Therefore, |θ(1) + θ̃|0 ≤
2k and
IE|f − ϕθ(1) − Ū |22 = IE|f − ϕθ(1) − ϕθ(2) + ϕθ(2) − Ū |22
= IE|f − ϕθ(1) − ϕθ(2) |22 + |ϕθ(2) − Ū |22
IE|U − IE[U ]|22
= |f − ϕθ̄ |22 +
√k 2
(RD n)
≤ |f − ϕθ̄ |22 +
k
To conclude the proof, note that
IE|f − ϕθ(1) − Ū |22 = IE|f − ϕθ(1) +θ̃ |22 ≥ min |f − ϕθ |22
θ∈IRM
|θ|0 ≤2k
and to divide by n.
Maurey’s argument implies the following corollary.
Corollary 3.7. Assume that the assumptions of Theorem 3.4 hold and that
the dictionary {ϕ1 , . . . , ϕM } is normalized in such a way that
√
max |ϕj |2 ≤ n .
1≤j≤M
Then there exists a constant C > 0 such that the BIC estimator satisfies
r
n h σ 2 |θ| log(eM ) log(eM ) io
0
MSE(ϕθ̂bic ) ≤ inf 2MSE(ϕθ ) + C ∧ σ|θ|1
θ∈IRM n n
σ 2 log(1/δ)
+C
n
with probability at least 1 − δ.
3.1. Oracle inequalities 68
For any θ′ ∈ IRM , it follows from Maurey’s argument that there exist θ ∈ IRM
such that |θ|0 ≤ 2|θ′ |0 and
2|θ′ |21
MSE(ϕθ ) ≤ MSE(ϕθ′ ) +
|θ|0
It implies that
2. If k̄ ≤ 1, then
σ 2 log(eM )
|θ′ |21 ≤ C ,
n
which yields
|θ′ |2 σ 2 k log(eM ) σ 2 log(eM )
1
min +C ≤C
k k n n
3. If k̄ ≥ M , then
σ 2 M log(eM ) |θ′ |21
≤C .
n M
|θ|1
Therefore, on the one hand, if M ≥ √ , we get
σ log(eM)/n
r
|θ′ |2 σ 2 k log(eM ) |θ′ |21 log(eM )
min 1
+C ≤C ≤ Cσ|θ′ |1 .
k k n M n
3.2. Nonparametric regression 69
|θ|1
On the other hand, if M ≤ √ , then for any Θ ∈ IRM , we have
σ log(eM)/n
r
σ 2 |θ|0 log(eM ) σ 2 M log(eM ) log(eM )
≤ ≤ Cσ|θ′ |1 .
n n n
Note that this last result holds for any estimator that satisfies an oracle
inequality with respect to the ℓ0 norm such as the result of Theorem 3.4. In
particular, this estimator need not be the BIC estimator. An example is the
Exponential Screening estimator of [RT11].
Maurey’s argument allows us to enjoy the best of both the ℓ0 and the
ℓ1 world. The rate adapts to the sparsity of the problem and can be even
generalized to ℓq -sparsity (see Problem 3.3). However, it is clear from the proof
that this argument is limited to squared ℓ2 norms such as the one appearing
in MSE and extension to other risk measures is non trivial. Some work has
been done for non Hilbert spaces [Pis81, DDGS97] using more sophisticated
arguments.
So far, the oracle inequalities that we have derived do not deal with the
approximation error MSE(ϕθ ). We kept it arbitrary and simply hoped that
it was small. Note also that in the case of linear models, we simply assumed
that the approximation error was zero. As we will see in this section, this
error can be quantified under natural smoothness conditions if the dictionary
of functions H = {ϕ1 , . . . , ϕM } is chosen appropriately. In what follows, we
assume for simplicity that d = 1 so that f : IR → IR and ϕj : IR → IR.
Fourier decomposition
Historically, nonparametric estimation was developed before high-dimensional
statistics and most results hold for the case where the dictionary H = {ϕ1 , . . . , ϕM }
forms an orthonormal system of L2 ([0, 1]):
Z 1 Z 1
ϕ2j (x)dx = 1 , ϕj (x)ϕk (x)dx = 0, ∀ j 6= k .
0 0
Assume now that the regression function f admits the following decompo-
sition
X∞
f= θj∗ ϕj .
j=1
There exists many choices for the orthonormal system and we give only two
as examples.
Example 3.8. Trigonometric basis. This is an orthonormal basis of L2 ([0, 1]).
It is defined by
ϕ1 ≡ 1
√
ϕ2k (x) = 2 cos(2πkx) ,
√
ϕ2k+1 (x) = 2 sin(2πkx) ,
for k = 1, 2, . . . and x ∈ [0, 1]. The fact that it is indeed an orthonormal system
can be easily check using trigonometric identities.
The next example has received a lot of attention in the signal (sound, image,
. . . ) processing community.
Example 3.9. Wavelets. Let ψ : IR → IR be a sufficiently smooth and
compactly supported function, called “mother wavelet ”. Define the system of
functions
ψjk (x) = 2j/2 ψ(2j x − k) , j, k ∈ Z .
It can be shown that for a suitable ψ, the dictionary {ψj,k , j, k ∈ Z} forms an
orthonormal system of L2 ([0, 1]) and sometimes a basis. In the latter case, for
any function g ∈ L2 ([0, 1]), it holds
∞
X ∞
X Z 1
g= θjk ψjk , θjk = g(x)ψjk (x)dx .
j=−∞ k=−∞ 0
1
ψ(x)
0
−1
Definition 3.10. Fix parameters β ∈ {1, 2, . . . } and L > 0. The Sobolev class
of functions W (β, L) is defined by
n
W (β, L) = f : [0, 1] → IR : f ∈ L2 ([0, 1]) , f (β−1) is absolutely continuous and
Z 1 o
[f (β) ]2 ≤ L2 , f (j) (0) = f (j) (1), j = 0, . . . , β − 1
0
where θ∗ = {θj∗ }j≥1 is in the space of squared summable sequence ℓ2 (IN) defined
by
n ∞
X o
ℓ2 (IN) = θ : θj2 < ∞ .
j=1
Theorem 3.11. Fix β ≥ 1 and L > 0 and let {ϕj }j≥1 denote the trigonometric
basis of L2 ([0, 1]). Moreover, let {aj }j≥1 be defined as in (3.9). A function
f ∈ W (β, L) can be represented as
∞
X
f= θj∗ ϕj ,
j=1
where the sequence {θj∗ }j≥1 belongs to Sobolev ellipsoid of ℓ2 (IN) defined by
n ∞
X o
Θ(β, Q) = θ ∈ ℓ2 (IN) : a2j θj2 ≤ Q
j=1
for Q = L2 /π 2β .
Proof. Let us first recall the definition of the Fourier coefficients {sk (j)}k≥1 of
the jth derivative f (j) of f for j = 1, . . . , β:
Z 1
s1 (j) = f (j) (t)dt = f (j−1) (1) − f (j−1) (0) = 0 ,
0
√ Z 1
s2k (j) = 2 f (j) (t) cos(2πkt)dt ,
0
√ Z 1
s2k+1 (j) = 2 f (j) (t) sin(2πkt)dt ,
0
Moreover,
√ (β−1) 1
√ Z 1
s2k+1 (β) = 2f (t) sin(2πkt) − (2πk) 2 f (β−1) (t) cos(2πkt)dt
0 0
= −(2πk)s2k (β − 1) .
In particular, it yields
s2k (β)2 + s2k+1 (β)2 = (2πk)2 s2k (β − 1)2 + s2k+1 (β − 1)2
so that θ ∈ Θ(β, L2 /π 2β ) .
It can actually be shown that the reciprocal is true, that is any function
with Fourier coefficients in Θ(β, Q) belongs to if W (β, L) but we will not be
needing this.
In what follows, we will define smooth functions as functions with Fourier
coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By
extension, we write f ∈ Θ(β, Q) in this case and consider any real value for β.
Proposition 3.12. The Sobolev ellipsoids enjoy the following properties
(i) For any Q > 0,
Proof. Note first that for any j, j ′ ∈ {1, . . . , n − 1}, j 6= j ′ the inner product
ϕ⊤
j ϕj ′ is of the form
n−1
X
ϕ⊤
j ϕj ′ = 2 uj (2πkj s/n)vj ′ (2πkj ′ s/n)
s=0
where kj = ⌊j/2⌋ is the integer part of j/2 for any x ∈ IR, uj (x), vj ′ (x) ∈
{Re(eix ), Im(eix )}.
Next, observe that if kj 6= kj ′ , we have
n−1
X i2πkj s i2πk ′ s
n−1
X i2π(kj −k ′ )s
j j
e n e− n = e n = 0.
s=0 s=0
a⊤ a′ = −b⊤ b′ = 0, b ⊤ a′ = a⊤ b ′ = 0
which implies ϕ⊤j ϕj ′ = 0. To conclude the proof, it remains to deal with the
case where kj = kj ′ . This can happen in two cases: |j ′ − j| = 1 or j ′ = j. In
the first case, we have that {uj (x), vj ′ (x)} = {Re(eix ), Im(eix )}, i.e., one is a
sin(·) and the other is a cos(·). Therefore,
1 ⊤
ϕj ϕj ′ = a⊤ a′ + b⊤ b′ + i b⊤ a′ − a⊤ b′ = 0
2
The final case is j = j ′ for which, on the one hand,
n−1
X i2πkj s i2πkj s
n−1
X i4πkj s
e n e n = e n =0
s=0 s=0
for any fixed M . This truncation leads to a systematic error that vanishes as
M → ∞. We are interested in understanding the rate at which this happens.
The Sobolev assumption to control precisely this error as a function of the
tunable parameter M and the smoothness β.
Lemma 3.14. For any integer M ≥ 1, and f ∈ Θ(β, Q), β > 1/2, it holds
X
kϕM 2
θ ∗ − f kL2 = |θj∗ |2 ≤ QM −2β . (3.10)
j>M
Proof. Note that for any θ ∈ Θ(β, Q), if β > 1/2, then
∞
X ∞
X 1
|θj | = aj |θj |
j=2 j=2
aj
v
uX
u ∞ 2 2X ∞
1
≤t a j θj 2 by Cauchy-Schwarz
j=2
a
j=2 j
v
u X
u ∞ 1
≤ tQ <∞
j=1
j 2β
3.2. Nonparametric regression 76
where in√the last inequality, we used the fact that for the trigonometric basis
|ϕj |2 ≤ 2n, j ≥ 1 regardless of the choice of the design X1 , . . . , Xn . When
θ∗ ∈ Θ(β, Q), we have
sX s
X X X 1
∗ ∗ 1 1
|θj | = aj |θj | ≤ 2 ∗
aj |θj | 2
2 . Qn 2 −β .
aj aj
j≥n j≥n j≥n j≥n
Note the truncated Fourier series ϕθ∗ is an oracle: this is what we see when
we view f through the lens of functions with only low frequency harmonics.
To estimate ϕθ∗ , consider the estimator ϕθ̂ls where
n
X 2
θ̂ls ∈ argmin Yi − ϕθ (Xi ) .
θ∈IRM i=1
Which should be such that ϕθ̂ls is close to ϕθ∗ . For this estimator, we have
proved (Theorem 3.3) an oracle inequality for the MSE that is of the form
|ϕM
θ̂ ls
− f |22 ≤ inf |ϕM 2
θ − f |2 + Cσ M log(1/δ) , C > 0.
θ∈IRM
It yields
M ⊤
|ϕM
θ̂ ls
− ϕM 2 M M 2
θ ∗ |2 ≤ 2(ϕθ̂ ls − ϕθ ∗ ) (f − ϕθ ∗ ) + Cσ M log(1/δ)
X
⊤
= 2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) + Cσ 2 M log(1/δ)
j>M
X
⊤
= 2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) + Cσ 2 M log(1/δ) ,
j≥n
where we used Lemma 3.13 in the last equality. Together with (3.11) and
Young’s inequality 2ab ≤ αa2 + b2 /α, a, b ≥ 0 for any α > 0, we get
X C
⊤
2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) ≤ α|ϕM
θ̂ ls
− ϕM 2
θ ∗ |2 + Qn2−2β ,
α
j≥n
3.2. Nonparametric regression 77
1 σ2 M
|ϕM
θ̂ ls
− ϕM 2
θ ∗ |2 . Qn2−2β + log(1/δ) (3.12)
α(1 − α) 1−α
M + log(1/δ)
kϕθ̂ls − ϕθ∗ k2L2 ([0,1]) . n1−2β + σ 2 .
n
Using now Lemma 3.14 and σ 2 ≤ 1, we get
M + σ 2 log(1/δ)
kϕθ̂ls − f k2L2 ([0,1]) . M −2β + n1−2β + .
n
1
Taking M = ⌈n 2β+1 ⌉ ≤ n − 1 for n large enough yields
2β log(1/δ)
kϕθ̂ls − f k2L2 ([0,1]) . n− 2β+1 + n1−2β + σ 2 .
n
To conclude the proof, simply note that for the prescribed β, we have n1−2β ≤
2β
n− 2β+1 . The bound in expectation can be obtained by integrating the tail
bound.
Adaptive estimation
1
The rate attained by the projection estimator ϕθ̂ls with M = ⌈n 2β+1 ⌉ is actually
optimal so, in this sense, it is a good estimator. Unfortunately, its implementa-
tion requires the knowledge of the smoothness parameter β which is typically
unknown, to determine the level M of truncation. The purpose of adaptive es-
timation is precisely to adapt to the unknown β, that is to build an estimator
3.2. Nonparametric regression 78
2β
that does not depend on β and yet, attains a rate of the order of Cn− 2β+1 (up
to a logarithmic lowdown). To that end, we will use the oracle inequalities for
the BIC and Lasso estimator defined in (3.3) and (3.4) respectively. In view of
Lemma 3.13, the design matrix Φ actually satisfies the assumption ORT when
we work with the trigonometric basis. This has two useful implications:
1. Both estimators are actually thresholding estimators and can therefore
be implemented efficiently
2. The condition INC(k) is automatically satisfied for any k ≥ 1.
These observations lead to the following corollary.
√
Corollary 3.16. Fix β ≥ (1 + 5)/4 ≃ 0.81, Q > 0, δ > 0 and n large enough
1
to ensure n − 1 ≥ ⌈n 2β+1 ⌉ assume the general regression model (3.1) with
n−1
f ∈ Θ(β, Q) and ε ∼ subGn (σ 2 ), σ 2 ≤ 1. Let {ϕj }j=1 be the trigonometric
n−1 n−1
basis. Denote by ϕθ̂bic (resp. ϕθ̂L ) the BIC (resp. Lasso) estimator defined
in (3.3) (resp. (3.4)) over IRn−1 with regularization parameter given by (3.5)
(resp. (3.6)). Then ϕn−1
θ̂
, where θ̂ ∈ {θ̂bic , θ̂L } satisfies with probability 1 − δ,
2β log(1/δ)
kϕn−1 − f k2L2 ([0,1]) . n− 2β+1 + σ 2 .
θ̂ n
Moreover,
log n 2β+1
2β
∗
where we used Young’s inequality once again. Choose now α = 1/2 and θ = θM ,
∗ ∗
where θM is equal to θ on its first M coordinates and 0 otherwise so that
ϕθn−1
∗ = ϕMθ ∗ . It yields
M
|ϕn−1
θ̂
− ϕθn−1
∗ |22 . |ϕθn−1
∗ − f |22 + R(M ) . |ϕθn−1
∗ − ϕθn−1
∗ |22 + |ϕθn−1
∗ − f |22 + R(M )
M M M
3.2. Nonparametric regression 79
R(M )
kϕn−1 − ϕθn−1
∗ k2L2 ([0,1]) . kϕθn−1
∗ − ϕθn−1
∗ k2L2 ([0,1]) + Qn1−2β + .
θ̂ M M n
Moreover, using (3.10), we find that
M σ2
kϕn−1 − f k2L2 ([0,1]) . M −2β + Qn1−2β + log(en) + log(1/δ) .
θ̂ n n
1
To conclude the proof, choose M = ⌈(n/ log n) 2β+1 ⌉ and observe that the choice
of β ensures that n1−2β . M −2β . This yields the high probability bound. The
bound in expectation is obtained by integrating the tail.
Problem 3.1. Show that the least-squares estimator θ̂ls defined in (3.2) sat-
isfies the following exact oracle inequality:
M
IEMSE(ϕθ̂ls ) ≤ inf MSE(ϕθ ) + Cσ 2
θ∈IRM n
for some constant M to be specified.
2
Problem 3.2. Assume that ε ∼ subG √ n (σ ) and the vectors ϕj are normalized
in such a way that maxj |ϕj |2 ≤ n. Show that there exists a choice of τ
such that the Lasso estimator θ̂L with regularization parameter 2τ satisfies the
following exact oracle inequality:
n r
log M o
MSE(ϕθ̂L ) ≤ inf MSE(ϕθ ) + Cσ|θ|1
θ∈IRM n
with probability at least 1 − M −c for some positive constants C, c.
Problem 3.3. Let√ {ϕ1 , . . . , ϕM } be a dictionary normalized in such a way
that maxj |ϕj |2 ≤ n. Show that for any integer k such that 1 ≤ k ≤ M , we
have 1 1 2
k q̄ − M q̄
min MSE(ϕθ ) ≤ min MSE(ϕθ ) + Cq D2 ,
θ∈IRM θ∈IRM k
|θ|0 ≤2k |θ|wℓq ≤1
1 1
where |θ|wℓq denotes the weak ℓq norm and q̄ is such that q + q̄ = 1.
Problem 3.4. Show that the trigonometric basis and the Haar system indeed
form an orthonormal system of L2 ([0, 1]).
Problem 3.5. If f ∈ Θ(β, Q) for β > 1/2 and Q > 0, then f is continuous.
Chapter
4
Matrix estimation
Over the past decade or so, matrices have entered the picture of high-dimensional
statistics for several reasons. Perhaps the simplest explanation is that they are
the most natural extension of vectors. While this is true, and we will see exam-
ples where the extension from vectors to matrices is straightforward, matrices
have a much richer structure than vectors allowing “interaction” between their
rows and columns. In particular, while we have been describing simple vectors
in terms of their sparsity, here we can measure the complexity of a matrix by
its rank. This feature was successfully employed in a variety of applications
ranging from multi-task learning to collaborative filtering. This last application
was made popular by the Netflix prize in particular.
In this chapter, we study several statistical problems where the parameter of
interest θ is a matrix rather than a vector. These problems include: multivari-
ate regression, covariance matrix estimation and principal component analysis.
Before getting to these topics, we begin by a quick reminder on matrices and
linear algebra.
81
4.1. Basic facts about matrices 82
Vector norms
The simplest way to treat a matrix is to deal with it as if it were a vector. In
particular, we can extend ℓq norms to matrices:
X 1/q
|A|q = |aij |q , q > 0.
ij
The case q = 2 plays a particular role for matrices and |A|2 is called the
Frobenius norm of A and is often denoted by kAkF . It is also the Hilbert-
Schmidt norm associated to the inner product:
Spectral norms
Let λ = (λ1 , . . . , λr , 0, . . . , 0) be the singular values of a matrix A. We can
define spectral norms on A as vector norms on the vector λ. In particular, for
any q ∈ [1, ∞],
kAkq = |λ|q ,
is called Schatten q-norm of A. Here again, special cases have special names:
• q = 2: kAk2 = kAkF is the Frobenius norm defined above.
• q = 1: kAk1 = kAk∗ is called the Nuclear norm (or trace norm) of A.
• q = ∞: kAk∞ = λmax (A) = kAkop is called the operator norm (or
spectral norm) of A.
We are going to employ these norms to assess the proximity to our matrix
of interest. While the interpretation of vector norms is clear by extension from
the vector case, the meaning of “kA− Bkop is small” is not as transparent. The
following subsection provides some inequalities (without proofs) that allow a
better reading.
max λk (A) − λk (B) ≤ kA − Bkop , Weyl (1912)
k
X
λk (A) − λk (B)2 ≤ kA − Bk2F , Hoffman-Weilandt (1953)
k
1 1
hA, Bi ≤ kAkq kBkq , + = 1, p, q ∈ [1, ∞] , Hölder
p q
the expression levels for d genes. As a result the regression function in this case
f (x) = IE[Y |X = x] is a function from IRd to IRT . Clearly, f can be estimated
independently for each coordinate, using the tools that we have developed in
the previous chapter. However, we will see that in several interesting scenar-
ios, some structure is shared across coordinates and this information can be
leveraged to yield better prediction bounds.
The model
Throughout this section, we consider the following multivariate linear regres-
sion model:
Y = XΘ∗ + E , (4.1)
where Y ∈ IRn×T is the matrix of observed responses, X is the n × d observed
design matrix (as before), Θ ∈ IRd×T is the matrix of unknown parameters and
E ∼ subGn×T (σ 2 ) is the noise matrix. In this chapter, we will focus on the
prediction task, which consists in estimating XΘ∗ .
As mentioned in the foreword of this chapter, we can view this problem as T
(univariate) linear regression problems Y (j) = Xθ∗,(j) +ε(j) , j = 1, . . . , T , where
Y (j) , θ∗,(j) and ε(j) are the jth column of Y, Θ∗ and E respectively. In particu-
lar, an estimator for XΘ∗ can be obtained by concatenating the estimators for
each of the T problems. This approach is the subject of Problem 4.1.
The columns of Θ∗ correspond to T different regression tasks. Consider the
following example as a motivation. Assume that the Subway headquarters
want to evaluate the effect of d variables (promotions, day of the week, TV
ads,. . . ) on their sales. To that end, they ask each of their T = 40, 000
restaurants to report their sales numbers for the past n = 200 days. As a
result, franchise j returns to headquarters a vector Y(j) ∈ IRn . The d variables
for each of the n days are already known to headquarters and are stored in
a matrix X ∈ IRn×d . In this case, it may be reasonable to assume that the
same subset of variables has an impact of the sales for each of the franchise,
though the magnitude of this impact may differ from franchise to franchise. As
a result, one may assume that the matrix Θ∗ has each of its T columns that
is row sparse and that they share the same sparsity pattern, i.e., Θ∗ is of the
form:
0 0 0 0
• • • •
• • • •
Θ∗ = 0 0 0 0 ,
.. .. .. ..
. . . .
0 0 0 0
• • • •
where • indicates a potentially nonzero entry.
4.2. Multivariate regression 85
It follows from the result of Problem 4.1 that if each task is performed
individually, one may find an estimator Θ̂ such that
1 kT log(ed)
IEkXΘ̂ − XΘ∗ k2F . σ 2 ,
n n
where k is the number of nonzero coordinates in each column of Θ∗ . We
remember that the term log(ed) corresponds to the additional price to pay
for not knowing where the nonzero components are. However, in this case,
when the number of tasks grows, this should become easier. This fact was
proved in [LPTVDG11]. We will see that we can recover a similar phenomenon
when the number of tasks becomes large, though larger than in [LPTVDG11].
Indeed, rather than exploiting sparsity, observe that such a matrix Θ∗ has rank
k. This is the kind of structure that we will be predominantly using in this
chapter.
Rather than assuming that the columns of Θ∗ share the same sparsity
pattern, it may be more appropriate to assume that the matrix Θ∗ is low rank
or approximately so. As a result, while the matrix may not be sparse at all,
the fact that it is low rank still materializes the idea that some structure is
shared across different tasks. In this more general setup, it is assumed that the
columns of Θ∗ live in a lower dimensional space. Going back to the Subway
example this amounts to assuming that while there are 40,000 franchises, there
are only a few canonical profiles for these franchises and that all franchises are
linear combinations of these profiles.
Recall that the threshold for the hard thresholding estimator was chosen to
be the level of the noise with high probability. The singular value thresholding
estimator obeys the same rule, except that the norm in which the magnitude of
the noise is measured is adapted to the matrix case. Specifically, the following
lemma will allow us to control the operator norm of the matrix F .
Lemma 4.2. Let A be a d × T random matrix such that A ∼ subGd×T (σ 2 ).
Then p p
kAkop ≤ 4σ log(12)(d ∨ T ) + 2σ 2 log(1/δ)
with probability 1 − δ.
Proof. This proof follows the same steps as Problem 1.4. Let N1 be a 1/4-
net for S d−1 and N2 be a 1/4-net for S T −1 . It follows from Lemma 1.18
that we can always choose |N1 | ≤ 12d and |N2 | ≤ 12T . Moreover, for any
u ∈ S d−1 , v ∈ S T −1 , it holds
1
u⊤ Av ≤ max x⊤ Av + max u⊤ Av
x∈N1 4 u∈S d−1
1 1
≤ max max x⊤ Ay + max max x⊤ Av + max u⊤ Av
x∈N1 y∈N2 4 x∈N1 v∈S T −1 4 u∈S d−1
1
≤ max max x⊤ Ay + max max u⊤ Av
x∈N1 y∈N2 2 u∈S d−1 v∈S T −1
It yields
kAkop ≤ 2 max max x⊤ Ay
x∈N1 y∈N2
4.2. Multivariate regression 87
Proof. Assume without loss of generality that the singular values of Θ∗ and y
are arranged in a non increasing order: λ1 ≥ λ2 ≥ . . . and λ̂1 ≥ λ̂2 ≥ . . . .
Define the set S = {j : |λ̂j | > 2τ }.
Observe first that it follows from Lemma 4.2 that kF kop ≤ τ for τ chosen
as in (4.3) on an event A such that IP(A) ≥ 1 − δ. The rest of the proof is on
A.
Note that it follows from Weyl’s inequality that |λ̂j − λj | ≤ kF kop ≤ τ . It
implies that S ⊂ {j : |λj | > τ }Pand S c ⊂ {j : |λj | ≤ 3τ }.
Next define the oracle Θ̄ = j∈S λj uj vj⊤ and note that
Moreover,
Therefore, X
kΘ̂svt − Θ̄k2F ≤ 72|S|τ 2 = 72 τ2 .
j∈S
= 432 rank(Θ∗ )τ 2 .
In the next subsection, we extend our analysis to the case where X does not
necessarily satisfy the assumption ORT.
Penalization by rank
The estimator from this section is the counterpart of the BIC estimator in the
spectral domain. However, we will see that unlike BIC, it can be computed
efficiently.
Let Θ̂rk be any solution to the following minimization problem:
n1 o
min kY − XΘk2F + 2τ 2 rank(Θ) .
Θ∈IRd×T n
This estimator is called estimator by rank penalization with regularization pa-
rameter τ 2 . It enjoys the following property.
4.2. Multivariate regression 89
Theorem 4.4. Consider the multivariate linear regression model (4.1). Then,
the estimator by rank penalization Θ̂rk with regularization parameter τ 2 , where
τ is defined in (4.3) satisfies
1 σ 2 rank(Θ∗ )
kXΘ̂rk − XΘ∗ k2F ≤ 8 rank(Θ∗ )τ 2 . d ∨ T + log(1/δ) .
n n
with probability 1 − δ.
Proof. We begin as usual by noting that
kY − XΘ̂rk k2F + 2nτ 2 rank(Θ̂rk ) ≤ kY − XΘ∗ k2F + 2nτ 2 rank(Θ∗ ) ,
which is equivalent to
kXΘ̂rk − XΘ∗ k2F ≤ 2hE, XΘ̂rk − XΘ∗ i − 2nτ 2 rank(Θ̂rk ) + 2nτ 2 rank(Θ∗ ) .
Next, by Young’s inequality, we have
1
2hE, XΘ̂rk − XΘ∗ i = 2hE, U i2 + kXΘ̂rk − XΘ∗ k2F ,
2
where
XΘ̂rk − XΘ∗
U= .
kXΘ̂rk − XΘ∗ kF
Write
XΘ̂rk − XΘ∗ = ΦN ,
where Φ is a n × r, r ≤ d matrix whose columns form orthonormal basis of the
column span of X. The matrix Φ can come from the SVD of X for example:
X = ΦΛΨ⊤ . It yields
ΦN
U=
kN kF
and
kXΘ̂rk − XΘ∗ k2F ≤ 4hΦ⊤ E, N/kN kF i2 − 4nτ 2 rank(Θ̂rk ) + 4nτ 2 rank(Θ∗ ) .
(4.5)
Note that rank(N ) ≤ rank(Θ̂rk ) + rank(Θ∗ ). Therefore, by Hölder’s in-
equality, we get
hE, U i2 = hΦ⊤ E, N/kN kF i2
kN k21
≤ kΦ⊤ Ek2op
kN k2F
≤ rank(N )kΦ⊤ Ek2op
≤ kΦ⊤ Ek2op rank(Θ̂rk ) + rank(Θ∗ ) .
It follows from Theorem 4.4 that the estimator by rank penalization enjoys
the same properties as the singular value thresholding estimator even when X
does not satisfies the ORT condition. This is reminiscent of the BIC estimator
which enjoys the same properties as the hard thresholding estimator. However
this analogy does not extend to computational questions. Indeed, while the
rank penalty, just like the sparsity penalty, is not convex, it turns out that
XΘ̂rk can be computed efficiently.
Note first that
1 n1 o
min kY − XΘk2F + 2τ 2 rank(Θ) = min min kY − XΘk2F + 2τ 2 k .
Θ∈IRd×T n k n Θ∈IRd×T
rank(Θ)≤k
can be solved efficiently. To that end, let Ȳ = X(X⊤ X)† X⊤ Y denote the orthog-
onal projection of Y onto the image space of X: this is a linear operator from
IRd×T into IRn×T . By the Pythagorean theorem, we get for any Θ ∈ IRd×T ,
kY − XΘk2F = kY − Ȳk2F + kȲ − XΘk2F .
Next consider the SVD of Ȳ:
X
Ȳ = λj uj vj⊤
j
Indeed, X
kȲ − Ỹk2F = λ2j ,
j>k
with probability 1 − δ.
Proof. Observe first that without loss of generality we can assume that Σ = Id .
Indeed, note that since IE XX ⊤ = Σ ≻ 0, then X ∼ subGd (kΣkop ). Moreover,
Y = Σ−1/2 X ∼ subGd (1) and IE[Y Y ⊤ ] = Σ−1/2 ΣΣ−1/2 = Id . Therefore,
Pn
kΣ̂ − Σkop k1 Xi Xi⊤ − Σkop
= n i=1
kΣkop kΣkop
P
kΣ kop k n1 ni=1 Yi Yi⊤ − Id kop kΣ1/2 kop
1/2
≤
kΣkop
n
1X
=k Yi Yi⊤ − Id kop .
n i=1
Let N be a 1/4-net for S d−1 such that |N | ≤ 12d. It follows from the proof of
Lemma 4.2 that
kΣ̂ − Id kop ≤ 2 max x⊤ (Σ̂ − Id )y
x,y∈N
It holds,
n
1 X ⊤
x⊤ (Σ̂ − Id )y = (Xi x)(Xi⊤ y) − IE (Xi⊤ x)(Xi⊤ y) .
n i=1
It yields that
(Xi⊤ x)(Xi⊤ y) − IE (Xi⊤ x)(Xi⊤ y) ∼ subE(16) .
t 2d 2 2d 2 1/2
≥ log(144) + log(1/δ) ∨ log(144) + log(1/δ)
32 n n n n
This concludes our proof.
Theorem 4.6 indicates that for fixed d, the empirical covariance matrix is a
consistent estimator of Σ (in any norm as they are all equivalent in finite dimen-
sion). However, the bound that we got is not satisfactory in high-dimensions
when d ≫ n. To overcome this limitation, we can introduce sparsity as we have
done in the case of regression. The most obvious way to do so is to assume
that few of the entries of Σ are non zero and it turns out that in this case
thresholding is optimal. There is a long line of work on this subject (see for
example [CZZ10] and [CZ12]).
Once we have a good estimator of Σ, what can we do with it? The key
insight is that Σ contains information about the projection of the vector X
onto any direction u ∈ S d−1 . Indeed, we have that var(X ⊤ u) = u⊤ Σu, which
d ⊤ u) = u⊤ Σ̂u. Observe that it follows from
can be readily estimated by Var(X
Theorem 4.6
dVar(X ⊤ u) − Var(X ⊤ u) = u⊤ (Σ̂ − Σ)u
≤ kΣ̂ − Σkop
r
d + log(1/δ) d + log(1/δ)
. kΣkop ∨
n n
with probability 1 − δ.
The above fact is useful in the Markowitz theory of portfolio section for
example [Mar52], where a portfolio of assets is a vector u ∈ IRd such that
|u|1 = 1 and the risk of a portfolio is given by the variance Var(X ⊤ u). The
goal is then to maximize reward subject to risk constraints. In most instances,
the empirical covariance matrix is plugged into the formula in place of Σ.
4.4. Principal component analysis 94
This model is often called the spiked covariance model. By a simple rescaling,
it is equivalent to the following definition.
Definition 4.7. A covariance matrix Σ ∈ IRd×d is said to satisfy the spiked
covariance model if it is of the form
Σ = θvv ⊤ + Id ,
Figure 4.1. Projection onto two dimensions of 1, 387 points from gene expression data.
Source: Gene expression blog.
This model can be extended to more than one spike but this extension is
beyond the scope of these notes.
Clearly, under the spiked covariance model, v is the eigenvector of the
matrix Σ that is associated to its largest eigenvalue 1 + θ. We will refer to
this vector simply as largest eigenvector. To estimate it, a natural candidate
is the largest eigenvector v̂ of Σ̃, where Σ̃ is any estimator of Σ. There is a
caveat: by symmetry, if u is an eigenvector, of a symmetric matrix, then −u is
also an eigenvector associated to the same eigenvalue. Therefore, we may only
estimate v up to a sign flip. To overcome this limitation, it is often useful to
describe proximity between two vectors u and v in terms of the principal angle
4.4. Principal component analysis 96
between their linear span. Let us recall that for two unit vectors the principal
angle between their linear spans is denoted by ∠(u, v) and defined as
∠(u, v) = arccos(|u⊤ v|) .
The following result form perturbation theory is known as the Davis-Kahan
sin(θ) theorem as it bounds the sin of the principal angle between eigenspaces.
This theorem exists in much more general versions that extend beyond one-
dimensional eigenspaces.
Theorem 4.8 (Davis-Kahan sin(θ) theorem). Let Σ satisfy the spiked covari-
ance model and let Σ̃ be any PSD estimator of Σ. Let ṽ denote the largest
eigenvector of Σ̃. Then we have
8
min |εṽ − v|22 ≤ 2 sin2 ∠(ṽ, v) ≤ 2 kΣ̃ − Σk2op .
ε∈{±1} θ
Proof. Note that for any u ∈ S d−1 , it holds under the spiked covariance model
that
u⊤ Σu = 1 + θ(v ⊤ u)2 = 1 + θ cos2 (∠(u, v)) .
Therefore,
v ⊤ Σv − ṽ ⊤ Σṽ = θ[1 − cos2 (∠(ṽ, v))] = θ sin2 (∠(ṽ, v)) .
Next, observe that
v ⊤ Σv − ṽ ⊤ Σṽ = v ⊤ Σ̃v − ṽ ⊤ Σṽ − v ⊤ Σ̃ − Σ v
≤ ṽ ⊤ Σ̃ṽ − ṽ ⊤ Σṽ − v ⊤ Σ̃ − Σ v
= hΣ̂ − Σ, ṽṽ ⊤ − vv ⊤ i (4.8)
⊤ ⊤
≤ kΣ̃ − Σkop kṽṽ − vv k1 (Hölder)
√
≤ 2kΣ̃ − Σkop kṽṽ ⊤ − vv ⊤ kF (Cauchy-Schwarz) .
where in the first inequality, we used the fact that ṽ is the largest eigenvector
of Σ̃ and in the last one, we used the fact that the matrix ṽṽ ⊤ − vv ⊤ has rank
at most 2.
Next, we have that
kṽṽ ⊤ − vv ⊤ k2F = 2(1 − (v ⊤ ṽ)2 ) = 2 sin2 (∠(ṽ, v)) .
Therefore, we have proved that
θ sin2 (∠(ṽ, v)) ≤ 2kΣ̃ − Σkop sin(∠(ṽ, v)) ,
so that
2
sin(∠(ṽ, v)) ≤
kΣ̃ − Σkop .
θ
To conclude the proof, it remains to check that
min |εṽ − v|22 = 2 − 2|ṽ ⊤ v| ≤ 2 − 2(ṽ ⊤ v)2 = 2 sin2 (∠(ṽ, v)) .
ε∈{±1}
4.4. Principal component analysis 97
Sparse PCA
In the example of Figure 4.1, it may be desirable to interpret the meaning of
the two directions denoted by PC1 and PC2. We know that they are linear
combinations of the original 500,000 gene expression levels. A natural question
to ask is whether only a subset of these genes could suffice to obtain similar
results. Such a discovery could have potential interesting scientific applications
as it would point to a few genes responsible for disparities between European
populations.
In the case of the spiked covariance model this amounts to have v to be
sparse. Beyond interpretability as we just discussed, sparsity should also lead
to statistical stability as in the case of sparse linear regression for example.
To enforce sparsity, we will assume that v in the spiked covariance model is
k-sparse: |v|0 = k. Therefore, a natural candidate to estimate v is given by v̂
defined by
v̂ ⊤ Σ̂v̂ = max u⊤ Σ̂u .
u∈S d−1
|u|0 =k
It is easy to check that λkmax (Σ̂) = v̂ ⊤ Σ̂v̂ is the largest of all leading eigenvalues
among all k × k sub-matrices of Σ̂ so that the maximum is indeed attained,
though there my be several maximizers. We call λkmax (Σ̂) the k-sparse leading
eigenvalue of Σ̂ and v̂ a k-sparse leading eigenvector.
Theorem 4.10. Let X1 , . . . ,Xn be n i.i.d. copies of a sub-Gaussian random
vector X ∈ IRd such that IE XX ⊤ = Σ and X ∼ subGd (kΣkop ). Assume
further that Σ = θvv ⊤ + Id satisfies the spiked covariance model for v such
that |v|0 = k ≤ d/2. Then, the k-sparse largest eigenvector v̂ of the empirical
covariance matrix satisfies,
r
1 + θ k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ)
min |εv̂ − v|2 . ∨ .
ε∈{±1} θ n n
4.4. Principal component analysis 98
with probability 1 − δ.
Since both v̂ and v are k sparse, there exists a (random) set S ⊂ {1, . . . , d}
such that |S| ≤ 2k and {v̂v̂ ⊤ − vv ⊤ }ij = 0 if (i, j) ∈
/ S 2 . It yields
Where for any d × d matrix M , we defined the matrix M (S) to be the |S| × |S|
sub-matrix of M with rows and columns indexed by S and for any vector
x ∈ IRd , x(S) ∈ IR|S| denotes the sub-vector of x with coordinates indexed by
S. It yields, by Hölder’s inequality that
Following the same steps as in the proof of Theorem 4.8, we get now that
8
min |εv̂ − v|22 ≤ 2 sin2 ∠(v̂, v) ≤ 2 sup kΣ̂(S) − Σ(S)kop .
ε∈{±1} θ S : |S|=2k
where we used (4.7) in the second inequality. Using Lemma 2.7, we get that
the right-hand side above is further bounded by
h n t 2 t ed i
exp − ( ∧ ) + 2k log(144) + k log
2 32 32 2k
Choosing now t such that
r
k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ)
t≥C ∨ ,
n n
for large enough C ensures that the desired bound holds with probability at
least 1 − δ.
4.5. Problem set 99
Problem 4.1. Using the results of Chapter 2, show that the following holds
for the multivariate regression model (4.1).
1. There exists an estimator Θ̂ ∈ IRd×T such that
1 rT
kXΘ̂ − XΘ∗ k2F . σ 2
n n
with probability .99, where r denotes the rank of X .
2. There exists an estimator Θ̂ ∈ IRd×T such that
1 |Θ∗ |0 log(ed)
kXΘ̂ − XΘ∗ k2F . σ 2 .
n n
with probability .99.
Problem 4.2. Consider the multivariate regression model (4.1) where Y has
SVD: X
Y= λ̂j ûj v̂j⊤ .
j
Let M be defined by
X
M̂ = λ̂j 1I(|λ̂j | > 2τ )ûj v̂j⊤ , τ > 0 .
j
Problem 4.3. Consider the multivariate regression model (4.1) and define Θ̂
be the any solution to the minimization problem
n1 o
min kY − XΘk2F + τ kXΘk1
Θ∈IRd×T n
4.5. Problem set 100
where λ∗1 ≥ λ∗2 ≥ . . . and λ̂1 ≥ λ̂2 ≥ . . . are the singular values
of XΘ∗ and Y respectively and the SVD of Y is given by
X
Y= λ̂j ûj v̂j⊤
j
In the previous chapters, we have proved several upper bounds and the goal of
this chapter is to assess their optimality. Specifically, our goal is to answer the
following questions:
1. Can our analysis be improved? In other words: do the estimators that
we have studied actually satisfy better bounds?
2. Can any estimator improve upon these bounds?
Both questions ask about some form of optimality. The first one is about
optimality of an estimator, whereas the second one is about optimality of a
bound.
The difficulty of these questions varies depending on whether we are looking
for a positive or a negative answer. Indeed, a positive answer to these questions
simply consists in finding a better proof for the estimator we have studied
(question 1.) or simply finding a better estimator, together with a proof that
it performs better (question 2.). A negative answer is much more arduous.
For example, in question 2., it is a statement about all estimators. How can
this be done? The answer lies in information theory (see [CT06] for a nice
introduction).
In this chapter, we will see how to give a negative answer to question 2. It
will imply a negative answer to question 1.
101
5.1. Optimality in a minimax sense 102
σ2 d
IRd θ̂ls Theorem 2.2
n
r
log d ls
B1 σ θ̂B Theorem 2.4
n 1
σ2 k ls
B0 (k) log(ed/k) θ̂B 0 (k)
Corollaries 2.8-2.9
n
Yi = θi∗ + εi , i = 1, . . . , d , (5.1)
2
where ε = (ε1 , . . . , εd )⊤ ∼ Nd (0, σn Id ), θ∗ = (θ1∗ , . . . , θd∗ )⊤ ∈ Θ is the parameter
of interest and Θ ⊂ IRd is a given set of parameters. We will need a more precise
notation for probabilities and expectations throughout this chapter. Denote by
IPθ∗ and IEθ∗ the probability measure and corresponding expectation that are
associated to the distribution of Y from the GSM (5.1).
Recall that GSM is a special case of the linear regression model when the
design matrix satisfies the ORT condition. In this case, we have proved several
performance guarantees (upper bounds) for various choices of Θ that can be
expressed either in the form
IE |θ̂n − θ∗ |22 ≤ Cφ(Θ) (5.2)
or the form
|θ̂n − θ∗ |22 ≤ Cφ(Θ) , with prob. 1 − d−2 (5.3)
For some constant C. The rates φ(Θ) for different choices of Θ that we have
obtained are gathered in Table 5.1 together with the estimator (and the corre-
sponding result from Chapter 2) that was employed to obtain this rate. Can
any of these results be improved? In other words, does there exists another
estimator θ̃ such that supθ∗ ∈Θ IE|θ̃ − θ∗ |22 ≪ φ(Θ)?
A first step in this direction is the Cramér-Rao lower bound [Sha03] that
allows us to prove lower bounds in terms of the Fisher information. Neverthe-
less, this notion of optimality is too stringent and often leads to nonexistence
of optimal estimators. Rather, we prefer here the notion of minimax optimality
that characterizes how fast θ∗ can be estimated uniformly over Θ.
5.2. Reduction to finite hypothesis testing 103
where the infimum is taker over all estimators (i.e., measurable functions of
Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ.
Note that minimax rates of convergence φ are defined up to multiplicative
constants. We may then choose this constant such that the minimax rate has
a simple form such as σ 2 d/n as opposed to 7σ 2 d/n for example.
This definition can be adapted to rates that hold with high probability. As
we saw in Chapter 2 (Cf. Table 5.1), the upper bounds in expectation and those
with high probability are of the same order of magnitude. It is also the case
for lower bounds. Indeed, observe that it follows from the Markov inequality
that for any A > 0,
IEθ φ−1 (Θ)|θ̂ − θ|22 ≥ AIPθ φ−1 (Θ)|θ̂ − θ|22 > A (5.5)
for some positive constants A and C”. The above inequality also implies a lower
bound with high probability. We can therefore employ the following alternate
definition for minimax optimality.
where the infimum is taker over all estimators (i.e., measurable functions of
Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ.
Minimax lower bounds rely on information theory and follow from a simple
principle: if the number of observations is too small, it may be hard to distin-
guish between two probability distributions that are close to each other. For
example, given n i.i.d. observations, it is impossible to reliably decide whether
they are drawn from N (0, 1) or N ( n1 , 1). This simple argument can be made
precise using the formalism of statistical hypothesis testing. To do so, we reduce
our estimation problem to a testing problem. The reduction consists of two
steps.
5.2. Reduction to finite hypothesis testing 104
where the infimum is taken over all tests based on Y and that take values
in {1, . . . , M }.
Conclusion: it is sufficient for proving lower bounds to find θ1 , . . . , θM ∈ Θ
such that |θj − θk |22 ≥ 4φ(Θ) and
inf max IPθj ψ 6= j ≥ C ′ .
ψ 1≤j≤M
The above quantity is called minimax probability of error. In the next sections,
we show how it can be bounded from below using arguments from information
theory. For the purpose of illustration, we begin with the simple case where
M = 2 in the next section.
5.3. Lower bounds based on two hypotheses 105
H0 : Z ∼ IP0
H1 : Z ∼ IP1
Lemma 5.3 (Neyman-Pearson). Let IP0 and IP1 be two probability measures.
Then for any test ψ, it holds
Z
IP0 (ψ = 1) + IP1 (ψ = 0) ≥ min(p0 , p1 )
Next for any test ψ, define its rejection region R = {ψ = 1}. Let R⋆ = {p1 ≥
p0 } denote the rejection region of the likelihood ration test ψ ⋆ . It holds
IP0 (ψ = 1) + IP1 (ψ = 0) = 1 + IP0 (R) − IP1 (R)
Z
=1+ p0 − p1
ZR Z
=1+ p0 − p1 + p0 − p1
R∩R⋆ R∩(R⋆ )c
Z Z
=1− |p0 − p1 | + |p0 − p1 |
R∩R⋆ R∩(R⋆ )c
Z
= 1 + |p0 − p1 | 1I(R ∩ (R⋆ )c ) − 1I(R ∩ R⋆ )
It can be shown [Tsy09] that the integral is always well defined when IP1 ≪
IP0 (though it can be equal to ∞ even in this case). Unlike the total variation
distance, the Kullback-Leibler divergence is not a distance. Actually, it is not
even symmetric. Nevertheless, it enjoys properties that are very useful for our
purposes.
Proposition 5.6. Let IP and Q be two probability measures. Then
1. KL(IP, Q) ≥ 0
2. If IP and Q are product measures, i.e.,
n
O n
O
IP = IPi and Q = Qi
i=1 i=1
then
n
X
KL(IP, Q) = KL(IPi , Qi ) .
i=1
Proof. If IP is not absolutely continuous then the result is trivial. Next, assume
that IP ≪ Q and let X ∼ IP.
1. Observe that by Jensen’s inequality,
dQ dQ
KL(IP, Q) = −IE log (X) ≥ − log IE (X) = − log(1) = 0 .
dIP dIP
5.3. Lower bounds based on two hypotheses 108
2ασ 2 1
inf sup IPθ (|θ̂ − θ|22 ≥ ) ≥ −α.
θ̂ θ∈Θ n 2
Proof. Write for simplicity IPj = IPθj , j = 0, 1. Recall that it follows from the
5.4. Lower bounds based on many hypotheses 110
Clearly the result of Theorem 5.9 matches the upper bound for Θ = IRd
only for d = 1. How about larger d? A quick inspection of our proof shows
that our technique, in its present state, cannot yield better results. Indeed,
there are only two known candidates for the choice of θ∗ . With this knowledge,
one can obtain upper bounds that do not depend on d by simply projecting
Y onto the linear span of θ0 , θ1 and then solving the GSM in two dimensions.
To obtain larger lower bounds, we need to use more than two hypotheses. In
particular, in view of the above discussion, we need a set of hypotheses that
spans a linear space of dimension proportional to d. In principle, we should
need at least order d hypotheses but we will actually need much more.
The reduction to hypothesis testing from Section 5.2 allows us to use more
than two hypotheses. Specifically, we should find θ1 , . . . , θM such that
inf max IPθj ψ 6= j ≥ C ′ ,
ψ 1≤j≤M
where the infimum is taken over all tests with values in {1, . . . , M }.
where
h(x) = x log(x) + (1 − x) log(1 − x)
and
IP(Z = j|X)
qj =
IP(Z 6= ψ(X)|X)
P
is such that qj ≥ 0 and j6=ψ(X) qj = 1. It implies by Jensen’s inequality that
X X 1 X qj
qj log(qj ) = − qj log ≥ − log = − log(M − 1) .
qj qj
j6=ψ(X) j6=ψ(X) j6=ψ(X)
It implies
Z nX
M o
IP(Z = j|X = x) log[IP(Z = j|X = x)] dPZ (x)
j=1
M Z n
X 1 dP o
1 dPj j
= (x) log (x) dPZ (x)
M dPZ M dPZ
j=1
M Z
1 X dPj (x)
= log PM dPj (x)
M j=1 k=1 dPk (x)
M Z dP (x)
1 X j
≤ log dPj (x) − log M (by Jensen)
M2 dPk (x)
j,k=1
M
1 X
= KL(Pj , Pk ) − log M ,
M2
j,k=1
Since
M
1 X
IP(Z 6= ψ(X)) = Pj (ψ(X) 6= j) ≤ max Pj (ψ(X) 6= j) ,
M j=1 1≤j≤M
n|θj − θk |22
KL(IPj , IPk ) = ≤ α log(M ) .
2σ 2
Moreover, since M ≥ 5,
1 PM
M2 j,k=1 KL(IPj , IPk ) + log 2 α log(M ) + log 2 1
≤ ≤ 2α + .
log(M − 1) log(M − 1) 2
Lemma 5.12 (Varshamov-Gilbert). For any γ ∈ (0, 1/2), there exist binary
vectors ω1 , . . . ωM ∈ {0, 1}d such that
1
(i) ρ(ωj , ωk ) ≥ − γ d for all j 6= k ,
2
2 γ2d
(ii) M = ⌊eγ d ⌋ ≥ e 2 .
M (M − 1) d M (M − 1)
IP X − > γd ≤ exp − 2γ 2 d + log <1
2 2 2
5.5. Application to the Gaussian sequence model 114
as soon as
M (M − 1) < 2 exp 2γ 2 d
2
A sufficient condition for the above inequality to hold is to take M = ⌊eγ d ⌋ ≥
γ2d
e 2 . For this value of M , we have
1
IP ∀j 6= k , ρ(ωj , ωk ) ≥ −γ d >0
2
and by virtue of the probabilistic method, there exist ω1 , . . . ωM ∈ {0, 1}d that
satisfy (i) and (ii)
α σ2 d 1
inf sup IPθ |θ̂ − θ|22 ≥ ≥ − 2α .
θ̂ θ∈IRd 256 n 2
X M
X
1 k
≤ d
IP ωj 6= x : ρ(ωj , x) <
k
2
x∈{0,1}d j=1
|x|0 =k
k
= M IP ω 6= x0 : ρ(ω, x0 ) <
2
5.5. Application to the Gaussian sequence model 116
where Zi = 1I(Ui ∈ supp(x0 )). Indeed the left hand side is the number of
coordinates on which the vectors ω, x0 disagree and the right hand side is
the number of coordinates in supp(x0 ) on which the two vectors disagree. In
particular, we have that, Z1 ∼ Ber(k/d) and for any i = 2, . . . , d, conditionally
on Z1 , . . . , Zi−i , we have Zi ∼ Ber(Qi ), where
P
k − i−1l=1 Zl k 2k
Qi = ≤ ≤
p − (i − 1) d−k d
since k ≤ d/2.
Next we apply a Chernoff bound to get that for any s > 0,
k Xk
k h Xk
i sk
IP ω 6= x0 : ρ(ω, x0 ) < ≤ IP Zi > = IE exp s Zi e− 2
2 i=1
2 i=1
h k
X i h k−1
X i
IE exp s Zi = IE exp s Zi IE exp sZk Z1 , . . . , Zk=1
i=1 i=1
h k−1
X i
= IE exp s Zi (Qk (es − 1) + 1)
i=1
h k−1
X i 2k s
≤ IE exp s Zi (e − 1) + 1
i=1
d
..
.
2k s k
≤ (e − 1) + 1
d
= 2k
5.5. Application to the Gaussian sequence model 117
d
For s = log(1 + 2k ). Putting everything together, we get
sk
IP ∃ ωj 6= ωk : ρ(ωj , ωk ) < k ≤ exp log M + k log 2 −
2
k d
= exp log M + k log 2 − log(1 + )
2 2k
k d
≤ exp log M + k log 2 − log(1 + )
2 2k
k d
≤ exp log M − log(1 + ) (for d ≥ 8k)
4 2k
< 1.
α2 σ 2 d 1
inf sup IPθ |θ̂ − θ|22 ≥ k log(1 + ) ≥ − 2α .
θ̂ θ∈IRd 64n 2k 2
|θ|0 ≤k
Corollary 5.16. Recall that B1 (R) ⊂ IRd denotes the set vectors θ ∈ IRd such
that |θ|1 ≤ R. Then there exist a constant C > 0 such that if d ≥ n1/2+ε ,
ε > 0, the minimax rate of estimation over B1 (R) in the Gaussian sequence
model is
log d
φ(B0 (k)) = min(R2 , Rσ ).
n
ls
Moreover, it is attained by the constrained least squares estimator θ̂B 1 (R)
if
R ≥ σ logn d and by the trivial estimator θ̂ = 0 otherwise.
Proof. To complete the proof of the statement, we need to study risk of the
trivial estimator equal to zero for small R. Note that if |θ∗ |1 ≤ R, we have
Remark 5.17. Note that the inequality |θ∗ |22 ≤ |θ∗ |21 appears to be quite loose.
Nevertheless, it is tight up to a multiplicative constant for the vectors of the
log d
form θj = ωj Rk that are employed in the lower bound. Indeed, if R ≤ σ n ,
we have k ≤ 2/β
R2 β
|θj |22 = ≥ |θj |21 .
k 2
5.5. Application to the Gaussian sequence model 120
PROBLEM SET
Problem 5.5. Fix β ≥ 5/3, Q > 0 and prove that the minimax rate of esti-
2β
mation over Θ(β, Q) with the k · kL2 ([0,1]) -norm is given by n− 2β+1 .
[Hint:Consider functions of the form
N
C X
fj = √ ωji ϕi
n i=1
[AS08] Noga Alon and Joel H. Spencer. The probabilistic method. Wiley-
Interscience Series in Discrete Mathematics and Optimization.
John Wiley & Sons, Inc., Hoboken, NJ, third edition, 2008. With
an appendix on the life and work of Paul Erdős.
[Ber09] Dennis S. Bernstein. Matrix mathematics. Princeton University
Press, Princeton, NJ, second edition, 2009. Theory, facts, and
formulas.
[Bil95] Patrick Billingsley. Probability and measure. Wiley Series in
Probability and Mathematical Statistics. John Wiley & Sons
Inc., New York, third edition, 1995. A Wiley-Interscience Pub-
lication.
[Bir83] Lucien Birgé. Approximation dans les espaces métriques et
théorie de l’estimation. Z. Wahrsch. Verw. Gebiete, 65(2):181–
237, 1983.
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Con-
centration inequalities. Oxford University Press, Oxford, 2013.
A nonasymptotic theory of independence, With a foreword by
Michel Ledoux.
[BRT09] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Si-
multaneous analysis of Lasso and Dantzig selector. Ann. Statist.,
37(4):1705–1732, 2009.
[BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-
thresholding algorithm for linear inverse problems. SIAM J.
Imaging Sci., 2(1):183–202, 2009.
[Cav11] Laurent Cavalier. Inverse problems in statistics. In Inverse prob-
lems and high-dimensional estimation, volume 203 of Lect. Notes
Stat. Proc., pages 3–96. Springer, Heidelberg, 2011.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information
theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ,
second edition, 2006.
121
Bibliography 122
[CT07] Emmanuel Candes and Terence Tao. The Dantzig selector: sta-
tistical estimation when p is much larger than n. Ann. Statist.,
35(6):2313–2351, 2007.
[CZ12] T. Tony Cai and Harrison H. Zhou. Minimax estimation of large
covariance matrices under ℓ1 -norm. Statist. Sinica, 22(4):1319–
1349, 2012.
[CZZ10] T. Tony Cai, Cun-Hui Zhang, and Harrison H. Zhou. Opti-
mal rates of convergence for covariance matrix estimation. Ann.
Statist., 38(4):2118–2144, 2010.
[DDGS97] M.J. Donahue, C. Darken, L. Gurvits, and E. Sontag. Rates
of convex approximation in non-hilbert spaces. Constructive
Approximation, 13(2):187–220, 1997.
[EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tib-
shirani. Least angle regression. Ann. Statist., 32(2):407–499,
2004. With discussion, and a rejoinder by the authors.
[FHT10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Reg-
ularization paths for generalized linear models via coordinate
descent. Journal of Statistical Software, 33(1), 2010.
[FR13] Simon Foucart and Holger Rauhut. A mathematical introduc-
tion to compressive sensing. Applied and Numerical Harmonic
Analysis. Birkhäuser/Springer, New York, 2013.
[Gru03] Branko Grunbaum. Convex polytopes, volume 221 of Graduate
Texts in Mathematics. Springer-Verlag, New York, second edi-
tion, 2003. Prepared and with a preface by Volker Kaibel, Victor
Klee and Günter M. Ziegler.
[GVL96] Gene H. Golub and Charles F. Van Loan. Matrix computa-
tions. Johns Hopkins Studies in the Mathematical Sciences.
Johns Hopkins University Press, Baltimore, MD, third edition,
1996.
[HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The
elements of statistical learning. Springer Series in Statistics.
Springer-Verlag, New York, 2001. Data mining, inference, and
prediction.
[IH81] I. A. Ibragimov and R. Z. Hasminskiı̆. Statistical estimation,
volume 16 of Applications of Mathematics. Springer-Verlag, New
York, 1981. Asymptotic theory, Translated from the Russian by
Samuel Kotz.
[Joh11] Iain M. Johnstone. Gaussian estimation: Sequence and wavelet
models. Unpublished Manuscript., December 2011.
Bibliography 123