Xxxx Statistical Estimation
Xxxx Statistical Estimation
Cheng Mao
May 2, 2023
This set of notes, based on [LC06, Kee11, Ber18, RH19, Tsy08] and other sources, is provided
to the students in the course MATH 6262 at Georgia Tech as a complement to the lectures. It is
not meant to be a complete introduction to the subject.
2
Contents
3
2.5 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Admissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Inadmissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Shrinkage estimators and Stein’s effect . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Gaussian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Poisson estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Asymptotic estimation 39
3.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Asymptotic properties of maximum likelihood estimation . . . . . . . . . . . . . . . 43
3.3.1 Asymptotic consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Examples of maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Bernstein–von Mises theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Bootstrap methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.1 Jackknife estimator and bias reduction . . . . . . . . . . . . . . . . . . . . . . 50
3.6.2 Mean estimation and asymptotics . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Sampling with quantile function . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.3 Metropolis–Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.4 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Finite-sample analysis 55
4.1 Rates of estimation for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Fast rate for low-dimensional linear regression . . . . . . . . . . . . . . . . . . 55
4.1.2 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Slow rate for high-dimensional linear regression . . . . . . . . . . . . . . . . . 57
4.2 High-dimensional linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Setup and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Fast rate for sparse linear regression . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Fast rate for LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Generalized linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Setup and models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Maximum likelihood estimation for logistic regression . . . . . . . . . . . . . 64
4.4 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Model and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Rates of estimation for local polynomial estimators . . . . . . . . . . . . . . . 68
4
5 Information-theoretic lower bounds 73
5.1 Reduction to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Le Cam’s two-point method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Lower bounds for nonparametric regression at a point . . . . . . . . . . . . . 75
5.3 Assouad’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Application to Gaussian mean estimation . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Application to nonparametric regression . . . . . . . . . . . . . . . . . . . . . 83
5.5 Generalization of the two-point method . . . . . . . . . . . . . . . . . . . . . . . . . 84
5
6
Chapter 1
• A finite or countable set X equipped with the counting measure µ. For example, when we roll
a die, the outcome lies in X = {1, 2, 3, 4, 5, 6}. Moreover, if we consider the number of times we
throw a coin before a “heads” is observed, then this number lies in X = {1, 2, 3, . . . }.
• If X is discrete, i.e., X is finite or countable, we can specify the probability mass function (PMF)
fX of X. For example, for the uniform random variable X ∼ Unif([n]) where [n] := {1, . . . , n},
we have fX (i) = P{X = i} = 1/n for i = 1, . . . , n.
• The cumulative distribution function (CDF) of a random variable X on R is FX (t) = P{X ≤ t}.
We have FX0 (t) = fX (t) and −∞ fX (s) ds = FX (t). The CDF of X = (X1 , . . . , Xd ) on Rd is
Rt
FX (t1 , . . . , td ) = P{X1 ≤ t1 , . . . , Xd ≤ td }.
Examples:
P3 a die; the outcome is X ∼ Unif([6]). The probability of seeing 2 or 3 is P{X ∈ {2, 3}} =
• Roll
i=2 1/6 = 1/3.
• Consider X ∼ N(0, 1). The probability that X is positive is P{X > 0} = 0 √12π e−t /2 dt = 1/2.
R∞ 2
7
The expectation ofR X is E[X] = X t fX (t) dµ(t). Given a function g : X → R, the expectation
R
E[X] =
P6
• For X ∼ Unif([6]), i=1 i · 1
6 = 3.5.
E[(X − 0)2 ] =
R∞ 2
• For X ∼ N(0, 1), the variance of X is −∞ t
2 · √1 e−t /2 dt
2π
= 1.
• Family of distributions: the set P containing all Pθ . E.g., P = {Ber(θ) : θ ∈ [0, 1]}.
g(θ) = θ2 .
• Loss function: a bivariate function L(g, ĝ) ≥ 0. E.g., the squared loss L(θ, θ̂) = (θ̂ − θ)2 .
the expectation of the loss E[L(g, ĝ)] with respect to the observations. E.g.,
• Risk: P E[(θ̂ − θ)2 ] =
E[( n ni=1 Xi − θ)2 ] for i.i.d. X1 , . . . , Xn ∼ Ber(θ).
1
8
Here, η = η(θ) ∈ Rm is the natural parameter, T (x) ∈ Rm is the sufficient statistic,
Z
exp η > T (x) · h(x) dµ(x)
A(η) = log
X
1 (x − µ)2 1 µx x2 µ2
f (x | µ, σ 2 ) = √ exp − = √ exp − −
2πσ 2σ 2 2πσ σ2 2σ 2 2σ 2
η2 1 1
= exp η1 x + η2 x2 + 1 + log(−2η2 ) √ .
4η2 2 2π
λx −λ 1 1
f (x | λ) = e = exp x log λ − λ · = exp ηx − eη · .
x! x! x!
p
• Binomial distribution Bin(n, p): If n is fixed and known, and η = log 1−p , then
n
n x n−x n p
η
f (x | p) = p (1−p) = exp x log +n log(1−p) = exp ηx−n log(1+e ) .
x x 1−p x
Therefore, we have
∂ r1 +···+rm MX (u)
αr1 ,...,rm = .
∂ur11 · · · ∂urmm u=0
The cumulant generating function (CGF) is defined as KX (u) := log MX (u). Its power series
expansion is X κr ,...,r
KX (u) = 1 m
ur11 · · · urmm ,
r ,...,r
r1 ! · · · rm !
1 m
9
where we call κr1 ,...,rm the cumulants of X.
In the case that m = 1, i.e., the random variable is real-valued, we have
Examples:
p
• X ∼ Bin(n, p): T (x) = x, η = log 1−p , and A(η) = n log(1 + eη ). Hence
1 + eη+u n
MX (u) = = (1 − p + peu )n .
1 + eη
On the other hand, if we use the definition of the MGF, we need to compute
n
X n
MX (u) = px (1 − p)n−x eux .
x
x=0
This is slightly more involved than using the general formula for exponential families.
10
1.2.3 Stein’s lemma
Lemma 1.2 (Stein). Let {Pθ : θ ∈ Θ} be an exponential family. Suppose that X ∼ Pθ has density
f (x | θ) for x ∈ R. Let g be a differentiable function such that limx→±∞ g(x)f (x | θ) = 0. Then we
have h h0 (X) i
E g(X) + η > T 0 (X) = − E[g 0 (X)].
h(X)
In particular, for X ∼ N(µ, σ 2 ), we have
For X ∼ N(µ, σ 2 ), setting g(x) = 1 gives E[X] = µ, and setting g(x) = x gives E[X 2 ] = σ 2 + µ2 .
Remarks:
• If T is a sufficient statistic and there exists a function g such that T = g(S), then S is a sufficient
statistic.
• A sufficient statistic T is minimal if for any sufficient statistic S, there exists a function g such
that T = g(S).
11
• If sufficient statistics S and T are functions of each other, then we say that they are equivalent.
1 −1 X µ X n 2
f (x1 , . . . , xn | µ, σ 2 ) = exp x 2
i + x i − µ .
(2πσ 2 )n/2 2σ 2 σ2 2σ 2
i i
• Let Pθ be from an exponential family with density f (x | θ) = exp η > T (x) − A(η) · h(x), where
∈ Rm is a sufficient statistic.
P
where i T (Xi )
Lemma 1.4. Consider a finite family of distributions with densities f0 , f1 , . . . , fk , all having the
f1 f 2 fk
same support. Then the statistic T = f0 , f0 , . . . , f0 is minimal sufficient.
Proof. We need to show that for any sufficient statistic S, there exists a function g such that
T = g(S). This follows immediately from the previous result.
12
Lemma 1.5. Let P be a family of distributions with common support and P 0 ⊂ P. If a statistic
T is minimal sufficient for P 0 and sufficient for P, then it is minimal sufficient for P.
Proof. If S is a sufficient statistic for P, it is a sufficient statistic for P 0 . Hence there exists a
function g such that T = g(S).
Consider an exponential family with density f (x | θ) = exp η > T (x)−A(η) ·h(x), where θ ∈ Θ.
If the interior of the set η(Θ) ⊂ Rm is not empty and if T does not satisfy an affine constraint
v > T = c for nonzero v ∈ Rm and c ∈ R, then the exponential family is said to be of full rank.
Theorem 1.6. Consider an exponential family with density f (x | θ) = exp η > T (x) − A(η) · h(x),
where θ ∈ Θ and η = η(θ) ∈ Rm . Suppose that T does not satisfy an affine constraint of the form
v > T = c. If there exist natural parameters η (0) , η (1) , . . . , η (m) such that {η (i) − η (0) : i ∈ [m]} spans
Rm , then T is minimal sufficient.
In particular, the sufficient statistic T in a full-rank exponential family P is minimal.
Proof. Let P 0 ⊂ P be a subfamily consisting of m + 1 distributions, with natural parameters
η (0) , η (1) , . . . , η (m) . By Lemma 1.4, a minimal sufficient statistic for P 0 is
T0 = exp (η (1) − η (0) )> T − A(η (1) ) + A(η (0) ), . . . , · ,
T = (η (1) − η (0) )> T, . . . , (η (m) − η (0) )> T .
This is equivalent to T if and only if the matrix with columns {η (i) − η (0) : i ∈ [m]} is nonsingular.
Conclude using Lemma 1.5. Such a subfamily can be chosen if the exponential family is full-
rank.
Proposition 1.7. Consider a real-valued convex function f on a convex open set S ⊂ Rn . At each
x ∈ S, there exists a vector v ∈ Rn such that f (y) − f (x) ≥ v > (y − x) for any y ∈ S. This vector
v is called a subgradient of f at x.
If f is strictly convex, then v can be chosen so that the inequality is strict unless y = x.
Proposition 1.8 (Jensen’s inequality). Consider a real-valued convex function f on a Pconvex set
R n . For any x , . . . , x ∈ S and a , . . . , a ∈ [0, 1] such that
P
S
P ⊂ 1 n 1 n a
i i = 1, we have f ( i ai xi ) ≤
i ai f (xi ).
More generally, if X is a random variable taking values in S ⊂ Rn and E[X] < ∞, then
f (E[X]) ≤ E[f (X)]. If f is strictly convex, the inequality is strict unless P{X = E[X]} = 1.
13
An example is eλ E[X] ≤ E[eλX ].
Proof. Define L(y) = f (x) + v > (y − x) ≤ f (y) for x = E[X]. Then E[f (X)] ≥ E[L(X)] =
L(E[X]) = f (E[X]).
Then P ∗ is from an exponential family with density C exp θ> T (x) − A(θ) , where θ = θ(µ).
14
Proof. By Jensen’s inequality,
Taking the expectation on both sides with respect to T yields the result.
E ĝ(X) − g(θ) 2
Proof. “⇒”: Fix such a U and θ ∈ Θ. For any λ ∈ R, g̃ = ĝ + λU is an unbiased estimator. Since
ĝ is the MVUE by assumption, we have
15
Theorem 1.12. If X ∼ Pθ for a full-rank exponential family {Pθ }, then T is complete.
For a proof, see Theorem 4.3.1 of [LR06]. The above result leads to an important theorem by
Lehmann and Scheffé.
• uniformly minimizes the risk for any loss L(g, ·) convex in the second variable;
• is the MVUE.
1.5.2 Examples
• Gaussian MVUE: Consider i.i.d. X1 , . . . , Xn ∼ N(µ, σ 2 ), where µ andPσ are unknown. Recall
that the empirical mean µ̂ = n i=1 Xi and empirical vairance σ̂ 2 = n1 ni=1 (Xi − µ̂)2 are jointly
1 Pn
sufficient statistics. They are also complete by full-rankness. Since E[µ̂] = µ and E[σ̂ 2 ] =
E(X1 − µ̂)2 = n−1 2 n 2
n σ , we have that (µ̂, n−1 σ̂ ) is the MVUE of (µ, σ ).
2
• Uniform MVUE: Consider i.i.d. X1 , . . . , Xn ∼ Unif(0, θ) and g(θ) = θ. It can be shown that the
sufficient statistic T = X(n) is complete. Note that 2X1 is an unbiased estimator of θ, so the
MVUE is
1 n−1 t(n + 1)
ĝ(t) = E[2X1 | X(n) = t] = · 2t + ·t= .
n n n
• Binomial MVUE: Consider X ∼ Bin(N, p) and g(p) = p(1 − p). Recall that T (X) = X. Then
E ĝ(T ) = g(p) says
N
X N
ĝ(x)px (1 − p)N −x = p(1 − p).
x
x=0
Let r = p/(1 − p). Then we have p = r/(1 + r) and 1 − p = 1/(1 + r). Hence
N N −1
X N x 1−N N −2
X N −2 x
ĝ(x)r = p(1 − p) = r(1 + r) = r ,
x x−1
x=0 x=1
16
1.6 Lower bounds on the variance of an unbiased estimator
Let us assume a few technical conditions throughout this section:
3. (Cramér–Rao) Suppose that there exists a function B(x, θ) and ε > 0 such that
f (x | θ + δ) − f (x | θ)
Eθ [B(X, θ)2 ] < ∞ and ≤ B(x, θ) for all |δ| ≤ ε.
δf (x | θ)
If g is differentiable, taking the limit δ → 0 on the right-hand side of the above inequality, and
applying dominated convergence, we obtain
2 2
g 0 (θ) g 0 (θ)
Varθ (ĝ) ≥ = .
Eθ ∂f (X|θ)/∂θ 2 I(θ)
f (X|θ)
17
∂2
If, in addition, log f (x | θ) exists for all x and θ and differentiation under the integral sign holds,
∂θ2
∂2
f (x|θ) 2
f (x|θ)
∂
∂2 ∂θ 2
then taking the expectation of ∂θ2 log f (x | θ) = f (x|θ) − ∂θf (x|θ) yields
h ∂2 i
I(θ) = − Eθ log f (X | θ) .
∂θ2
Note that if θ is differentiable function of ξ, the Fisher information X contains about ξ is
˜ = I(θ) · θ0 (ξ) 2 .
I(ξ)
1.6.2 Extensions
• (i.i.d. observations) By definition, we can check:
Lemma 1.14. Let X1 and X2 be independent random variables with densities f1 (x | θ) and
f2 (x | θ) respectively. If I1 (θ), I2 (θ) and I(θ) denote the information X1 , X2 and (X1 , X2 )
contain about θ, then I(θ) = I1 (θ) + I2 (θ).
• (Biased case) If ĝ is a biased estimator with Eθ ĝ = g(θ) + b(θ), then the same argument yields
2
g 0 (θ) + b0 (θ)
Varθ (ĝ) ≥ .
I(θ)
Under some regularity conditions similar to the one-dimensional case, the information matrix
I ∈ Rm×m is defined by
h ∂ ∂ i
Iij (θ) = Eθ log f (X | θ) · log f (X | θ)
∂θi ∂θj
∂ ∂
= Covθ log f (X | θ), log f (X | θ)
∂θi ∂θj
h ∂2 i
= − Eθ log f (X | θ) .
∂θi ∂θj
Hence I(θ) = − Eθ [∇2 log f (X | θ)].
Theorem 1.16 (Cramér–Rao, Information Inequality). Assume mild regularity conditions (sim-
ilar to the one-dimensional case) and that I(θ) is positive definite. Define α ∈ Rm by αi =
∂θi Eθ ĝ. Then we have
∂
18
1.6.3 Examples
• Exponential family: X ∼ Pη with f (x | η) = exp η > T (x) − A(η) h(x). Then ∇2 log f (x | η) =
19
20
Chapter 2
is defined as
Rπ (ĝ) := Eθ∼π R(g, ĝ).
An estimator ĝ is called a Bayes (optimal) estimator if Rπ (ĝ) ≤ Rπ (g̃) for any other estimator g̃.
A stronger condition is that the estimator ĝ minimizes the posterior loss
• For the squared loss L(g, g̃) = (g − g̃)2 , the posterior mean ĝ(X) := E[g(θ) | X] is a Bayes
estimator because for any estimator g̃,
• For the `1 loss L(g, g̃) = |g − g̃|, it is not hard to check that the posterior median is a Bayes
estimator.
21
2.1.2 Examples
Gaussian Consider X ∼ N(θ, 1) where θ ∼ N(0, σ 2 ). The Bayes estimator under the squared loss
L(θ, θ̂) = (θ − θ̂)2 is the posterior mean
R
θf (X | θ) p(θ) dθ
θ̂(X) := E[θ | X] = R .
f (X | θ) p(θ)dθ
σ 2
conditional on X. Therefore, the posterior mean is 1+σ 2 X.
The risk of θ̂ is
σ2 2 σ 2 2 1 2
EX∼N(θ,1) X − θ = + θ2 ,
1 + σ2 1 + σ2 1 + σ2
and the Bayes risk is
h σ 2 2 1 2 i σ 2 2 σ 2 σ2
Eθ∼N(0,σ2 ) + θ 2
= + = .
1 + σ2 1 + σ2 1 + σ2 1 + σ2 1 + σ2
22
We then conclude that the posterior distribution of θ conditional on X1 , . . . , Xn is
nσ 2 τ2 σ2τ 2
N X̄ + µ, .
τ 2 + nσ 2 τ 2 + nσ 2 τ 2 + nσ 2
(Poisson, Gamma) pair Consider X ∼ Poi(λ), where λ ∼ Gamma(a, b) for a, b > 0. Recall that
the gamma distribution has density
Z ∞
ba a−1 −bλ
p(λ) = λ e , where Γ(a) = xa−1 e−x dx.
Γ(a) 0
Then we have
∞ ∞
λx e−λ ba a−1 −bλ ba Γ(a + x)
Z Z
f (x) = f (x | λ)p(λ) dλ = λ e dλ = .
0 0 x! Γ(a) x!(b + 1)a+x Γ(a)
∞
(λ − λ̂)2
Z
Eπ [(λ − λ̂) /λ | X] =
2 k
p(λ | X) dλ
0 λk
Γ(a + X − k) n o
(b + 1)k−2 (b + 1)λ̂ (b + 1)λ̂ − 2(a + X − k) + (a + X − k)(a + X − k + 1) ,
=
Γ(a + X)
a+X−k
which is minimized at λ̂ = λ̂(X) = b+1 . Therefore, this λ̂ is the Bayes estimator.
23
– w = (w1 , . . . , wk ) follows the Dirichlet distribution with parameter
Pkα > 0. The Dirichlet
distribution is defined on the probability simplex ∆ := {v ∈ R : i=1 vi = 1, vi ≥ 0} and
k
Such a hierarchical model may be useful partly because the hyperparameter space is more man-
ageable and allows tuning of the sparsity of the mixing weights.
Maximum likelihood estimation Consider the likelihood L(θ | x) := f (x | θ) and the log-
likelihood log L(θ | x) := log f (x | θ). Given i.i.d. X1 , . . . , Xn ∼ Pθ , the maximum likehood
estimator (MLE) of g(θ) is defined to be ĝ := g(θ̂) where
n
X
θ̂ := argmax log f (Xi | θ).
θ∈Θ i=1
Let us recall the unbiased estimation for Gaussian mean and variance. Given i.i.d. X1 , . . . , Xn ∼
n
N(θ, σ 2 ) where θ and σ are unknown, the empirical mean θ̂ is the MVUE of θ and n−1 σ̂ 2 is the
MVUE of σ 2 , where σ̂ 2 = n1 ni=1 (Xi − θ̂)2 . If we do not require the estimators to be unbiased, can
P
they achieve better risks? How about the MLEs? We have
n
X (Xi − θ)2 n
(θ̂, σ̂) = argmin + log(2πσ 2 ),
(θ,σ) 2σ 2 2
i=1
Pn Pn
so the MLEs are θ̂ = n1 i=1 Xi and σ̂ 2 = 1
n i=1 (Xi − θ̂)2 . In particular, σ̂ 2 is a biased estimator
of σ 2 . One can check n2 − 1 n−1
E(cσ̂2 − σ2 )2 = σ4 c2 − 2 c+1 ,
n2 n
24
so we have
n
E(σ̂2 − σ2 )2 < E( σ̂ 2 − σ 2 )2 .
n−1
n 2
In fact, n+1 σ̂ is even a better choice in terms of minimizing the risk.
Theorem 2.2 (Bayesian Cramér–Rao, van Trees). Let π be a distribution on Θ := (a, b) ⊂ R with
density p such that p(θ) → 0 as θ → a or b. Consider X ∼ Pθ where θ ∼ π. Suppose that f (x | θ)
is bounded. Define
∂ 2 ∂ 2
I(θ) = Eθ log f (X | θ) and I(π) = E log p(θ) .
∂θ ∂θ
For any estimator ĝ(X) of a differentiable estimand g(θ), we have
E[g0 (θ)]
2 2
E ĝ(X) − g(θ) ≥ .
E[I(θ)] + I(π)
Proof. By the Cauchy-Schwarz inequality, we obtain
∂ 2 h ∂ i2
E ĝ(X) − g(θ) 2 · E ≥ E ĝ(X) − g(θ) ·
log f (X | θ)p(θ) log f (X | θ)p(θ) .
∂θ ∂θ
25
Moreover, for the left-hand side of the Cauchy-Schwarz inequality, we have
∂ 2
E log f (X | θ)p(θ)
∂θ
∂ 2 ∂ 2 ∂ ∂
=E log f (X | θ) + E log p(θ) + 2 E log f (X | θ) · log p(θ)
∂θ ∂θ ∂θ ∂θ
= E[I(θ)] + I(π) + 0,
where we used E ∂θ
∂
log f (X | θ) | θ = 0. Combining everything completes the proof.
σ2
For example, for X ∼ N(θ, 1) where θ ∼ N(0, σ 2 ), we have shown that the estimator θ̂ = 1+σ 2
X
σ2
of θ achieves the Bayes risk 1+σ 2
under the squared loss. On the other hand, we have
• For a loss function L(g, ĝ), we then consider the estimator ĝ(X) that minimizes the empirical
posterior loss Z
ĝ(X) := argmin L g(θ), g̃(X) p θ | X, γ̂(X) dθ.
g̃
Compare this to the posterior loss
Z
E L g(θ), g̃(X) | X = L g(θ), g̃(X) p θ | X, γ dθ
26
In the original Bayesian framework, γ is assumed to be known, while in hierarchical Bayes, γ is
assumed to follow a known distribution. Here in the empirical Bayes approach, γ is estimated from
data. Moreover, we remark that γ̂ and ĝ can be replaced by other estimators.
Bayes estimator and its risk We start with the case where σ 2 is known. Similar to the
univariate case, the Bayes estimator, denoted by θ̂B (X), is again the posterior mean:
f (X | θ, σ 2 ) p(θ | σ 2 )
Z Z
θ̂B (X) = E[θ | X, σ ] = θ · p(θ | X, σ ) dθ = θ ·
2 2
dθ.
f (X | σ 2 )
The density of X marginalized over θ is
Z
f (X | σ ) = f (X | θ, σ 2 ) p(θ | σ 2 ) dθ
2
Z
1 1 2 1 1 2
= n/2
e− 2 kX−θk2 2 n/2
e− 2σ2 kθk2 dθ
(2π) (2πσ )
1 − 1
2(1+σ 2 )
kXk22
= e , (2.1)
(2π(1 + σ 2 ))n/2
σ2
Z
1 − 1
kXk22
θ · f (X | θ, σ 2 ) p(θ | σ 2 ) dθ = e 2(1+σ 2 ) · X.
(2π(1 + σ 2 ))n/2 1 + σ2
Hence the Bayes estimator is
σ2 1
θ̂B (X) = X = 1 − X, (2.2)
1 + σ2 1 + σ2
with Bayes risk
nσ 2
ZZ
Rπ (θ̂B ) = kθ̂B (x) − θk22 · f (x | θ, σ 2 ) p(θ | σ 2 ) dx dθ = .
1 + σ2
Empirical Bayes Let us now consider the empirical Bayes approach to the case where σ 2 is
unknown. Note that the marginal distribution given by (2.1) is N(0, (1 + σ 2 )In ). To choose an
1
estimator of σ 2 , we require the associated estimator of 1+σ 2 to be unbiased, in view of (2.2).
A basic fact is that, if Y follows the chi-squared distribution with n degrees of freedom, then
E[ Y1 ] = n−2
1 n−2
. Therefore, if we let τ̂ (X) := kXk 2 , then
2
1
E τ̂ (X) | σ2 =
.
1 + σ2
27
The associated empirical Bayes estimator is therefore
n − 2
θ̂JS (X) = 1 − X,
kXk22
Risk of the James–Stein estimator We now compute the risk of the James–Stein estimator
(n − 2)2
n − 2 2 n−2 >
R(θ, θ̂JS ) = E 1− X −θ = E kX − θk22 +E − 2E X (X − θ) , (2.3)
kXk22 2 kXk22 kXk22
where the expectation is with respect to X ∼ N(θ, In ). Conditional on all Xj for j 6= i, Stein’s
n−2
lemma applied to Xi with g(Xi ) = kXk 2 Xi (see Lemma 1.2 and (1.1)) yields that
2
n−2 ∂ n−2
EXi ∼N(θi ,1) Xi (Xi − θi ) = EXi ∼N(θi ,1) Xi
kXk22 ∂Xi kXk22
n − 2 2(n − 2)Xi2
= EXi ∼N(θi ,1) − .
kXk22 kXk42
Summing the above equation over i and taking the expectation with respect to X ∼ N(θ, In ), we
obtain
(n − 2)2
n−2 > n(n − 2) 2(n − 2)
E X (X − θ) = E − E = E .
kXk22 kXk22 kXk22 kXk22
(n − 2)2
R(θ, θ̂JS ) = n − E .
kXk22
28
Positive-part Stein estimator Instead, we can consider the maximum likelihood estimator
1
(MLE) of 1+σ 2 based on the marginal density (2.1). Namely, we solve
τ n/2 τ 2
max e− 2 kXk2
τ 2π
n
which yields τ̃ (X) = min{ kXk2 , 1}. Therefore, the associated empirical Bayes estimator is
2
n n o n +
θ̂PS (X) = 1 − min , 1 X = 1 − X,
kXk22 kXk22
where x ∈ Rn and
R η ∈ R . Suppose that η ∼ p(η) for a prior
m density p. Let the marginal density
of X be f (x) = f (x | η) p(η) dη. Define a matrix D ∈ R n×m by Di,j = ∂Tj /∂xi . Then
Proof. Lengthy but straightforward computation, which uses Stein’s lemma. See [LC06], Chapter 4,
Theorem 3.2, Corollary 3.3, and Theorem 3.5.
R R . Suppose that the prior is p(η | γ). Let γ̂ be the MLE of γ based on the marginal
where x, η ∈ m
f (x | γ) = f (x | η, γ) p(η | γ) dη. Then the empirical Bayes estimator under the squared loss is
29
2.4 Minimax estimation
2.4.1 Definitions and examples
Consider X ∼ Pθ where θ ∈ Θ. For an estimator g̃ of g, the maximum risk is supθ∈Θ R(g(θ), g̃(X)).
The minimax risk is the minimum of the maximum risk:
R∗ := inf sup R(g(θ), g̃(X)).
g̃ θ∈Θ
An estimator ĝ(X) of g(θ) is called a minimax estimator if it achieves the minimax risk:
sup R(g(θ), ĝ(X)) = R∗ .
θ∈Θ
nτ 2
• For µ ∼ N(0, τ 2 ), the Bayes estimator, the posterior mean, is µ̃(X) = σ 2 +nτ 2
X̄ with the Bayes
σ2 τ 2 σ2
risk Rπ∗ = Rπ (µ̃) = σ 2 +nτ 2
. Since τ 2 ≥ 0 is arbitrary, Proposition 2.5 implies that R∗ ≥ n .
σ2
Therefore, R∗ = n , and µ̂(X) = X̄ is a minimax estimator.
Lemma 2.6. Consider parameter spaces Θ ⊂ Θ0 . Let ĝ(X) be a minimax estimator of g(θ) over
Θ. If
sup R(g(θ), ĝ(X)) = sup R(g(θ), ĝ(X)),
θ∈Θ θ∈Θ0
then ĝ is also minimax over Θ0 .
Proof. If there exists another estimator g̃ with a smaller maximum risk over Θ0 than ĝ, then the
same is true over Θ, contradicting that ĝ is minimax over Θ.
Recall that X̄ is minimax over {(µ, σ 2 ) : σ 2 = M } with minimax risk M/n. Hence it is minimax
over {(µ, σ 2 ) : σ 2 ≤ M } in view of Lemma 2.6.
Since X̄ does not depend on M , we say that X̄ adapts to M and call X̄ an adaptive estimator.
30
(µ−µ̂)2
• We now consider the loss L(µ, µ̂) = σ2
, and impose no upper bound on σ 2 > 0. The estimator
X̄ has maximum risk
(µ − X̄)2 1
sup 2
= .
(µ,σ 2 ) σ n
Since X̄ is minimax over {(µ, σ 2 ) : σ 2 = 1} with the same minimax risk 1/n, it is minimax over
{(µ, σ 2 ) : σ 2 > 0} as well.
2. If ĝ is the unique Bayes estimator for π, then it is the unique minimax estimator;
so ĝ is minimax.
Replacing the second ≥ with > gives uniqueness.
For any distribution π 0 on Θ, let ĝ 0 be the Bayes estimator for π 0 . Then
Rπ∗ 0 = Rπ0 (ĝ 0 ) ≤ Rπ0 (ĝ) ≤ sup R(g, ĝ) = Rπ (ĝ) = Rπ∗ .
θ
Theorem 2.8. Let πn be a sequence of prior distributions on Θ such that the following limit exists:
R := lim Rπ∗ n .
n→∞
then we have:
31
Proof. For any other estimator g̃, it holds
n 2 τ 4 + µ2
E(cX̄ − µ)2 = c2 E(X̄ − µ)2 + (1 − c)2 µ2 = ,
(1 + nτ 2 )2
2.5 Admissibility
An estimator ĝ is inadmissible if there exists an estimator g̃ that dominates ĝ, i.e.,
Proof. If ĝ is the unique Bayes estimator of g with respect to prior π and is dominated by g̃, then
32
Proof. If a minimax estimator is inadmissible, then another estimator dominates it and thus is also
minimax, contradicting uniqueness.
Theorem 2.11. If an estimator has constant risk and is admissible, then it is minimax.
Proof. If an estimator with constant risk is not minimax, then another estimator has smaller
maximum risk and thus uniformly smaller risk.
Theorem 2.12. Suppose that L(g, ·) is a strictly convex loss, and ĝ(X) is an admissible estimator
of g(θ). If g̃(X) is another estimator of g(θ) such that R(g, ĝ) = R(g, g̃) at all θ, then ĝ = g̃ with
probability 1.
1
L(g, ḡ) < L(g, ĝ) + L(g, g̃)
2
wherever ĝ 6= g̃. If this happens with nonzero probability, then
1
R(g, ḡ) < R(g, ĝ) + R(g, g̃) = R(g, ĝ),
2
contradicting the admissibility of ĝ.
σ2
R(µ, µ̂) ≤ = R(µ, X̄).
n
By the bias-variance decomposition, have
(1 + b0 (µ))2 σ 2 (1 + b0 (µ))2
R(µ, µ̂) ≥ + b(µ)2 = + b(µ)2 (2.4)
I(µ) n
where I(µ) = − E[ ∂µ
∂ 2
2
2 log f (X | µ)] = n/σ . Hence we obtain
(1 + b0 (µ))2 b(µ)2 1 1
+ 2
≤ 2 R(µ, µ̂) ≤ . (2.5)
n σ σ n
We claim that µ̂ is unbiased, i.e., b(µ) ≡ 0:
33
3. If b0 (µ) < −ε for a fixed ε > 0 as µ → ±∞, then b(µ) cannot be bounded. Hence there is a
sequence µi → ±∞ such that b0 (µi ) → 0.
4. By (2.5), b(µi ) → 0 as µi → ±∞. Since b(µ) is nonincreasing, b(µ) ≡ 0.
Hence we also have b0 (µ) ≡ 0, and (2.4) implies that
σ2
R(µ, µ̂) ≥ .
n
We conclude that R(µ, µ̂) = R(µ, X̄), and thus X̄ is admissible. Moreover, X̄ is the only minimax
estimator by the above two theorems.
Consider i.i.d. X1 , . . . , Xn ∼ N(µ, 1) where µ ≥ a for a fixed a ∈ R. Then the above proposition
implies that X̄ is not admissible. However, X̄ is still minimax. To see this, suppose that X̄ is not
minimax. Let µ̂ be an estimator such that
1
R(µ, µ̂) ≤ −ε
n
for all µ ≥ a and a fixed ε > 0. Hence the Cramér–Rao lower bound for biased estimators shows
that
(1 + b0 (µ))2 1
+ b(µ)2 ≤ − ε,
n n
√
where b(µ) = E[µ̂] − µ. Consequently, b(µ) is bounded, and b0 (µ) ≤ 1 − εn − 1 ≤ −εn/2 for all
µ ≥ a, giving a contradiction.
If, in addition, µ ∈ [a, b], then X̄ is neither admissible nor minimax. One can show that
max a, min{X̄, b} has a uniformly smaller risk.
34
Proof. For the last statement, it can be shown that X is minimax in a way similar to the one-
dimensional case. Hence, µ̂ is also minimax. It remains to prove that µ̂ dominates X.
First, note that the risk of X is
The risk of µ̂ is
Let us write Y = Σ−1/2 X ∼ N(η, Id ) where η := Σ−1/2 µ. Then kXk22 = X > X = Y > ΣY , so
Conditioning on {Yj }j6=i and applying Stein’s lemma E[g(Yi )(Yi − ηi )] = E[g 0 (Yi )] for Yi ∼ N(ηi , 1),
we obtain that each summand above is equal to
d
h ∂ h(Y > ΣY ) X i
E Y Σ
j j,i
∂Yi Y > ΣY
j=1
Pd d
h h(Y > ΣY ) 2[h0 (Y > ΣY )Y > ΣY − h(Y > ΣY )] k=1 Σi,k Yk
i
=E
X
Σi,i + Yj Σj,i ,
Y > ΣY (Y > ΣY )2
j=1
Note that for d ≥ 3 and Σ = Id , there exists a function h satisfying the assumptions, making
X inadmissible. This counterintuitive result is known as Stein’s example or Stein’s effect.
35
2.6.2 Poisson estimation
Lemma 2.16. For a random variable X and functions f and g, suppose that E[f (X)], E[g(X)]
and E[f (X)g(X)] all exist. If f and g are both nondecreasing, then
E[f (X)g(X)] ≥ E[f (X)] · E[g(X)].
Moreover, if f and g are strictly increasing and X is not constant, then the above inequality is
strict.
Proof. Let Y be an independent copy of X. Then
f (X)g(X) + f (Y )g(Y ) − f (X)g(Y ) − f (Y )g(X) = f (X) − f (Y ) g(X) − g(Y ) ≥ 0.
Taking the expectation yields the desired inequality.
Proposition 2.17. Consider independent Poisson random variables Xi ∼ Poi(λi ) for i ∈ [d] where
d ≥ 2. Let λ = (λ1 , . . . , λd ) ∈ (0, ∞)d . Define the class of estimators
h( d Xi )
P
λ̂(X) := 1 − Pd i=1 X
i=1 X i + b
where h is a real-valued nondecreasing function and b > 0. Consider the loss
d
X (λi − λ̂i )2
L(λ, λ̂) := .
λi
i=1
If h is nondecreasing, 0 < h(·) ≤ 2(d − 1), and b ≥ d − 1, then the estimator λ̂ dominates X.
P
Proof. Let Z = i Xi . Then we have
d
hX 1 h(Z) 2 i
R(λ, λ̂) = E Xi − Xi − λi
λi Z +b
i=1
d
h h(Z) X h h(Z)2 X d
1 i 1 2i
= d − 2E Xi (Xi − λi ) + E X .
Z +b λi (Z + b)2 λi i
i=1 i=1
Xi | Z is multinomial with
It is known that the conditional distribution P E[Xi | Z] = Zλi /Λ and
Var(Xi | Z) = Z(λi /Λ)(1 − λi /Λ) where Λ := di=1 λi . Therefore,
d d
hX 1 2 i X 1 λi λi Z
E Xi | Z = Z 1− + Z 2 2 = (d − 1 + Z),
λi Λ Λ Λ Λ
i=1 i=1
d
hX 1 i Z Z
E Xi (Xi − λi ) | Z = (d − 1 + Z) − Z = (d − 1 + Z − Λ).
λi Λ Λ
i=1
We obtain
h h(Z)Z h(Z) i
R(λ, λ̂) = d + E (d − 1 + Z) − 2(d − 1) + 2(Λ − Z)
(Z + b)Λ Z + b
h h(Z)Z i
≤ d + 2E (Λ − Z)
(Z + b)Λ
36
by the assumptions b ≥ d − 1 and h(·) ≤ 2(d − 1). By Lemma 2.16,
h h(Z)Z i h h(Z)Z i
E (Λ − Z) < E · E[Λ − Z] = 0,
(Z + b)Λ (Z + b)Λ
37
38
Chapter 3
Asymptotic estimation
In this chapter, we are interested in the scenario where the sample size n goes to infinity. Sections 3.1
to 3.5 establish the general theory. Topics of Sections 3.6 and 3.7 are not directly about asymptotic
estimation but are related.
P{|Xn − X| ≥ ε} → 0 as n → ∞.
• If E(ĝn − g(θ))2 → 0 as n → ∞ for all θ ∈ Θ, then ĝn is consistent for estimating g(θ).
• As a result, if the bias and variance of ĝn both converge to zero for all θ ∈ Θ, then ĝn is a
consistent estimator. In particular, any unbiased estimator with variance converging to zero is
consistent.
• (Weak law of large numbers) Let X1 , . . . , Xn be i.i.d. with finite mean µ and variance σ 2 . Since
p
X̄ has variance σ 2 /n, it is consistent for estimating µ by the above theorem, i.e., X̄ −→ µ.
39
• In the above setting, consider the unbiased estimator of the variance,
n
1 X
Sn2 := (Xi − X̄)2 .
n−1
i=1
Theorem 3.4. If X1 , . . . , Xn are i.i.d. with expectation µ, and g is a function that is continuous
p
at µ, then g(X̄) −→ g(µ). In particular, if g is continuous, then the plug-in estimator g(X̄) is
consistent for estimating g(µ).
p
Proof. Since X̄ −→ µ as above, this follows from the continuous mapping theorem.
• A sequence of estimators ĝn of g(θ) is asymptotically unbiased (or unbiased in the limit) if
E[ĝn ] → g(θ).
• Instead of consistency, sometimes we are interested in finer results—rates of convergence. Namely,
we may aim to establish
P{|ĝn − g(θ)| ≤ rn } ≥ 1 − δn ,
where rn and δn both go to zero as n → ∞. This is clearly stronger than convergence in
probability, and the quantity rn is an upper bound on the rate of convergence.
• Convergence in distribution is preserved under continuous mapping, but not necessarily under
sum or product.
d
• We have Xn −→ X if and only if E f (Xn ) → E f (X) for every bounded continuous real-valued
function f .
√
• (Central limit theorem) Let X1 , . . . , Xn be i.i.d. with mean µ and variance σ 2 . Then n(X̄ −µ)/σ
converges in distribution to the standard Gaussian N(0, 1).
40
d p
• If Xn −→ x for a constant x, then Xn −→ x.
d d d d
• If Xn −→ X, An −→ a for a constant a, and Bn −→ b for a constant b, then An + Bn Xn −→
a + bX.
d
• If Xn −→ X, yn → y where {yn } is a sequence of real numbers, and X has CDF F (t) which is
continuous at t = y, then we have P{Xn ≤ yn } → P{X ≤ y} = F (y).
Theorem 3.6 (Delta method). Suppose that a real-valued function g on Θ has a nonzero derivative
√ d
g 0 (θ) at θ. If n(Tn − θ) −→ N(0, σ 2 ), then
√ d
n(g(Tn ) − g(θ)) −→ N(0, (g 0 (θ))2 σ 2 ).
d 1
n(g(Tn ) − g(θ)) −→ σ 2 g 00 (θ)χ21 ,
2
where Rn → 0 as Tn → θ. The result then follows from the above properties of convergence.
If g 0 (θ) = 0, consider the second-order Taylor expansion:
1 1
g(Tn ) = g(θ) + g 00 (θ)(Tn − θ)2 + Rn (Tn − θ)2 ,
2 2
d
where Rn → 0 as Tn → θ. Note that n(Tn − θ)2 −→ σ 2 χ21 , so the same reasoning finishes the
proof.
For p = 1/2, we have g 0 (1/2) = 0 and g 00 (1/2) = −2. Hence the Delta method implies
1
n X̄(1 − X̄) − 1/4 → − χ21 .
4
41
3.2 Asymptotic efficiency
Definition 3.7. Consider i.i.d. X1 , . . . , Xn ∼ Pθ and an estimator ĝn (X1 , . . . , Xn ) of g(θ) ∈ R.
We say that ĝn is asymptotically normal if
√ d
n ĝn − g(θ) −→ N 0, v(θ) for v(θ) > 0.
The quantity v(θ) is called the asymptotic variance of ĝn .
Definition 3.8. A sequence of estimators {ĝn }∞
n=1 is called asymptotically efficient if
√ d (g 0 (θ))2
n ĝn − g(θ) −→ N 0, ,
I(θ)
where I(θ) is the Fisher information each Xi contains about θ.
If ĝn is unbiased, the Cramér–Rao bound says that
(g 0 (θ))2
Varθ (ĝn ) ≥ .
nI(θ)
When do we have
(g 0 (θ))2
v(θ) ≥ ?
I(θ)
√
A sufficient condition If ĝn is unbiased and Var( nĝn ) → v(θ), then
√ (g 0 (θ))2
v(θ) = lim Var nĝn ≥ .
n→∞ I(θ)
42
(g 0 (θ))2
General result Under some reasonable general conditions, we have v(θ) ≥ I(θ) except on a
set of measure zero. See Chapter 6, Theorem 2.6 of [LC06].
Finite parameter space Let us consider a finite parameter space Θ. A sequence of estimators
θ̂n is consistent if and only if
Corollary 3.10. If Θ is finite, then the MLE exists, is unique with probability going to 1, and is
consistent.
43
Real parameter space We consider an open set of parameters Θ ⊂ R and use the shorthand
`(θ | x) = log L(θ | x).
Theorem 3.11. Suppose that f (x | θ) is differentiable with respect to θ ∈ Θ ⊂ R for all x. Then
with probability going to 1, the likelihood equation
n ∂
X
∂θ f (Xi | θ)
`0 (θ | X) = =0
f (Xi | θ)
i=1
p
has a root θ̂n = θ̂n (X1 , . . . , Xn ), and that root satisfies θ̂n −→ θ∗ .
By Theorem 3.9, Pθ∗ {Sn } → 1. For any x ∈ Sn , there exists θ̂n ∈ (θ∗ − ε, θ∗ + ε) at which
`0 (θ̂n | x) = 0. Hence we have
Pθ∗ {|θ̂n − θ∗ | < ε} → 1.
Note that we can choose θ̂n to be the root closest to θ∗ so that it does not depend on ε.
However, Theorem 3.11 does not provide a practical way of choosing θ̂n in general, because θ∗
is unknown and we cannot choose the root closest to θ∗ .
Corollary 3.12. If the likelihood equation has a unique root for all x and n, then θ̂n is a consistent
estimator of θ. If, in addition, Θ = (a, b) where −∞ ≤ a < b ≤ ∞, then θ̂n is the MLE with
probability going to 1.
Proof. The first statement is immediate. To prove the second, suppose that θ̂n is not the MLE with
probability bounded away from zero. No other interior point can be the MLE without satisfying
the likelihood equation. On the other hand, if the likelihood converges to a supremum as θ → a or
b, this contradicts Theorem 3.9.
Theorem 3.13. Suppose that for all x and all θ ∈ (θ∗ − ε, θ∗ + ε),
∂3
log f (x | θ) ≤ M (x) and Eθ∗ [M (X)] < ∞.
∂θ3
44
Proof. For any fixed x, the Taylor expansion `0 (θ̂n ) = `0 (θ̂n | x) about θ∗ gives
1
0 = `0 (θ̂n ) = `0 (θ∗ ) + (θ̂n − θ∗ )`00 (θ∗ ) + (θ̂n − θ∗ )2 `000 (β),
2
√ √1 `0 (θ ∗ )
n
n(θ̂n − θ∗ ) = 1 00 ∗ 1
.
− n ` (θ ) − 2n (θ̂n − θ∗ )`000 (β)
Note that
n ∂ ∗
1 √ 1X ∂θ f (Xi | θ ) d
√ `0 (θ∗ ) = n −→ N 0, I(θ∗ )
n n ∗
f (Xi | θ )
i=1
by the law of large numbers. In addition, by the bound on the third derivative,
n
1 000 1X
M (Xi ) −→ Eθ∗ [M (X)].
p
` (β) ≤
n n
i=1
Corollary 3.14. If Θ = (a, b) and the likelihood equation has a unique root for all x and n, then
the MLE is asymptotically efficient.
Exponential family Consider the exponential family f (xi | η) = exp ηT (xi ) − A(η) . The
likelihood equation is
n
1X
T (Xi ) = A0 (η) = Eη [T (Xi )].
n
i=1
00
Moreover, A (η) = Varη T (Xi ) = I(η) > 0, so the RHS of the above equation is increasing in η.
Hence the likelihood equation has exactly one solution η̂n , which satisfies
√ d
n(η̂n − η) −→ N 0, 1/ Var(T ) .
45
3.4.1 Some examples
Let X1 , . . . , Xn be i.i.d. observations from some parametric model. We consider finding the MLE
by solving the log-likelihood equation. As we will see, this is not always possible.
Weibull distribution (MLE is the unique solution) The Weibull distribution on (0, ∞)
parametrized by λ, k > 0 is used in, for example, survival analysis. Its density is f (x | λ, k) =
k x k−1 −(x/λ)k
λ λ e for x > 0. The log-likelihood is
n
X k x i x i k
`(λ, k | x) = log + (k − 1) log − .
λ λ λ
i=1
Therefore, the likelihood equation, or more precisely, the system of likelihood equations, is
n
∂ X k kxki
`(λ, k | x) = − + k+1 = 0,
∂λ λ λ
i=1
n
∂ X 1 xi xki xi
`(λ, k | x) = + log − k log = 0.
∂k k λ λ λ
i=1
• ∂
∂k `(λ, k | x) → ∞ as k → 0,
• ∂
∂k `(λ, k | x) < 0 as k → ∞, and
∂2 xki 2
• ∂2k
`(λ, k | x) = − k12 − λk
log xλi < 0.
We see that the second likelihood equation also has a unique solution k̂n , which gives the MLE of
k. The asymptotic efficiency of λ̂n and k̂n is guaranteed by the general theory. (However, we need
a multivariate version of the theory proved above.)
Mixture of Gaussian distributions (MLE does not exist) Consider the mixture of two
Gaussians p N(µ, σ 2 ) + (1 − p) N(η, τ 2 ) where p ∈ (0, 1). The likelihood is
n (x − µ)2 1 − p (x − η)2
2 2
Y p i i
L(µ, σ , η, τ | x) = √ exp − 2
+√ exp − .
2πσ 2σ 2πτ 2τ 2
i=1
When µ = x1 and σ → 0, this term goes to infinity. Therefore, the likelihood is unbounded, and
the MLE does not exist.
46
Uniform distribution (MLE is not asymptotically normal) Consider the uniform distri-
butions Unif(0, θ) parametrized by θ > 0, which do not have a common support. The MLE of θ is
θ̂n = X(n) and the MVUE is θ̃n = n+1n X(n) . In addition, we have
d d
n(θ − θ̂n ) −→ Exp(0, θ) and n(θ − θ̃n ) −→ Exp(−θ, θ),
where Exp(a, b) denotes the exponential distribution with density 1b e−(x−a)/b 1[a,∞) (x). In addition,
one can show that
n2 n2
E n(θ̂n − θ) 2 = 2θ2 → 2θ2 E n(θ̃n − θ) 2 = θ2 → θ2 .
and
n2 + 3n + 2 n2 + 2n
Y = Xβ ∗ + ε,
where Y is the vector of observations in Rn , X is the design matrix in Rn×d , β ∗ is the parameter
vector in Rd to be estimated, and ε is the random vector of errors in Rn . There are two types of
assumptions on the design matrix X:
• Random design: X is a random matrix. For example, the n rows of X are i.i.d. random vectors
in Rd .
In this section, we assume that ε ∼ N(0, σ 2 In ), so, equivalently, Y ∼ N(Xβ ∗ , σ 2 In ) with density
1 kY − Xβ ∗ k2
f (Y | β ∗ ) = exp − 2
(2πσ 2 )n/2 2σ 2
in the case of a fixed design. For a random design, the above is still true conditional on X. The
log-likelihood of β is therefore
kY − Xβk22 n
`(β | Y ) = − − log(2πσ 2 ).
2σ 2 2
As a result, the maximum likelihood estimator β̂ is the least squares estimator (LSE)
β̂ := argmin kY − Xβk22 .
β∈ Rd
Low-dimensional case Let us first consider the case where n ≥ d and X is of rank d. The
likelihood equation is −2X > (Y − Xβ) = 0, which we can solve to obtain the LSE
Consider a random design where the n rows {Xi> }ni=1 of X are i.i.d. from some nice distribution
with covariance Σ = E[Xi Xi> ]. Then the n entries of Y are i.i.d., and the asymptotic efficiency of
β̂ is guaranteed by the general multivariate theory. While it is not easy to verify this statement
47
√
directly, let us show that ∆n := n(β̂ − β ∗ ) is Gaussian conditional on X and has the correct
asymptotic covariance.
We fix n and condition on X in the sequel. First, the conditional Gaussianity of ∆n is clearly
true. Second, β̂ is unbiased:
so the Gaussian vector ∆n has mean zero. Third, the covariance matrix of ∆n can be computed:
E[∆n ∆>
n | X] = n E[(β̂ − β )(β̂ − β ) | X]
∗ ∗ >
High-dimensional case We now consider the case where the columns of X are not linearly
independent, which is always true if n < d. Then the LSE β̂ may not be unique. In this case, the
problem with the formula (X > X)−1 X > Y is that X > X is not invertible. However, it suffices to
replace (X > X)−1 by the Moore–Penrose pseudoinverse (X > X)† , defined using the singular value
decomposition (SVD). More precisely, define
To see that β̂ solves the likelihood equation ∇β kY − Xβk22 = −2(X > Y − X > Xβ) = 0, it suffices
to use basic properties of the pseudoinverse to check that
As β̂ is not the unique solution of the likelihood equation, the general theory does not apply.
In fact, since n < d, the number of dimensions has to grow as n → ∞. Therefore, it is not even
clear how we can talk about any asymptotic property.
48
Gaussian mean estimation Recall the example of Bayesian Gaussian mean estimation: We
nτ 2
observe i.i.d. X1 , . . . , Xn ∼ N(θ∗ , 1) where θ∗ ∼ N(0, τ 2 ). Then the posterior mean is θ̃n = 1+nτ 2 X̄.
p
This is very close to the MLE θ̂n = X̄ when n is large, and θ̃n −→ θ∗ as n → ∞. Moreover,
√ √ √ d
n(θ̃n − θ∗ ) = n(θ̃n − θ̂n ) + n(θ̂n − θ∗ ) −→ N(0, 1),
√ √ −1 p √ d
because n(θ̃n − θ̂n ) = n 1+nτ 2 X̄ −→ 0 and n(θ̂n − θ∗ ) −→ N(0, 1) by the asymptotic efficiency
of the MLE. Therefore, the Bayes estimator also enjoys the asymptotic efficiency just like the MLE.
We now introduce the Bernstein–von Mises theorem, which states that, as n → ∞, the posterior
distribution behaves like a Gaussian distribution centered at an efficient estimator (such as the
MLE). The rigorous statement involves a set of regularity assumptions which we omit, and we only
provide a very brief sketch of some ideas of the proof. See [vdV00] for the full statement and proof.
For two probability distributions P and Q with densities f (x) and g(x) respectively, define the
total variation distance between them as
Z Z
1 1
TV(P, Q) := f (x) − g(x) dµ(x) = max h(x) f (x) − g(x) dµ(x).
2 2 |h(x)|≤1
Theorem 3.15 (Bernstein–von Mises; informal). Consider i.i.d. X1 , . . . , Xn from a “nice” para-
metric model Pθ∗ where θ∗ ∈ Θ. Let π be a prior on Θ with density p(θ) > 0. Let πn denote the
posterior with
1
density p(θ | X) where X = (X1 , . . . , Xn ). Moreover, let φn denote the distribution
N θ̂n , nI(θ∗ ) where θ̂n is an asymptotically efficient estimator. Then we have that
p
TV(πn , φn ) −→ 0 as n → ∞
Sketch of ideas. We are interested in the posterior around θ̂n . Recall that the posterior is
where `(θ) = log L(θ | x) denotes the log-likelihood, and C(x) is a quantity that only depends on
x and whose particular value is not important. Taylor expansion yields
φ 1 φ2 φ
log q(φ | x) ≈ `(θ̂n ) + `0 (θ̂n ) √ + `00 (θ̂n ) + log p θ̂n + √ + C(x).
n 2 n n
49
Suppose that θ̂n is the MLE. Then, `(θ̂n ) depends only on x, and `0 (θ̂n ) = 0. Moreover, E[`00 (θ)] =
−nI(θ), so we have `00 (θ̂n ) ≈ `00 (θ∗ ) ≈ −nI(θ∗ ). Finally, we approximate the term log p(θ̂n + √φn )
by log p(θ∗ ). In summary,
1
log q(φ | x) ≈ − I(θ∗ )φ2 + C2 (x)
2
φ2
for a quantity C2 (x) that only depends on x. Therefore, q(φ | x) is proportional to exp 2/I(θ ∗) .
1
In other words, the posterior distribution of φ is approximately N 0, I(θ∗ ) .
50
Let Un be the distribution of resampling from the observations uniformly, i.e., Un = Unif({Xi }ni=1 ).
Consider i.i.d. Yn,i ∼ Un for i = 1, . . . , n. We call
n
1X
Ȳn := Yn,i
n
i=1
the bootstrap sample mean. Conditional on X1 , . . . , Xn , it is clear that each Yn,i has mean X̄n .
Therefore, the bootstrap sample mean Ȳn is an estimator of the sample mean X̄n . To understand
the behavior of X̄n , the main idea of bootstrapping is that the distribution of X̄n − θ can be
approximated by the distribution of Ȳn − X̄n . In the case of mean estimation, this claim is justified
by the following theorem.
Theorem 3.16. Let Φ denote the CDF of N(0, 1). We have that, as n → ∞,
sup P
√
n Ȳn − X̄n ≤ t | X1 , . . . , Xn − Φ(t/σ) −→ 0
t∈R
almost surely with respect to the randomness of X1 , . . . , Xn .
√ d
The central limit theorem says that n(X̄n − θ) −→ N(0, σ 2 ), and the convergence is in fact
uniform in the sense that
sup P
√
n X̄n − θ ≤ t − Φ(t/σ) −→ 0
t∈ R
Comparing the above two displays, we see that the behavior of X̄n − θ is indeed similar to that of
Ȳn − X̄n when n is large.
In this setting, we understand the distribution of X̄n − θ very well in view of the central limit
theorem, so there is no need to study Ȳn − X̄n . However, for more complicated models, if we have
no idea of the behavior of
θ̂n − θ, where θ̂n := θ̂(X1 , . . . , Xn ),
bootstrapping becomes useful because we can generate bootstrap samples {Yn,1 , . . . , Yn,n } and study
the distribution of
θ̃n − θ̂n , where θ̃n := θ̂(Yn,1 , . . . , Yn,n ).
There are typically two ways to generate the bootstrap samples {Yn,1 , . . . , Yn,n }:
Lemma 3.17. If Xn has CDF Fn and X has CDF F which is continuous, then
d
Xn −→ X =⇒ sup |Fn (t) − F (t)| −→ 0.
t∈ R
51
Proof. Since F is monotone and continuous, there exists xi such that F (xi ) = i/k for each i =
1, . . . , k and −∞ = x0 < x1 < · · · < xk = ∞. For all x ∈ [xi−1 , xi ], we have
Given any ε > 0, take k sufficiently large such that 1/k ≤ ε/2. Then take n sufficiently large
depending on ε and k such that
It follows that
sup |Fn (x) − F (x)| ≤ ε/2 + 1/k ≤ ε.
x∈ R
with respect to the randomness of X. By a version of the central limit theorem for the “triangular
array” {Yn,i }, we obtain
√ d
n(Ȳn − X̄n ) −→ N(0, σ 2 ).
Lemma 3.17 then implies that the desired result.
which is simply the inverse of F if F is invertible. If we know Q and can sample from Unif(0, 1),
then we can sample from P.
Proposition 3.18. For any distribution P with quantile function Q, if U ∼ Unif(0, 1), then X =
Q(U ) has distribution P.
Proof. It suffices to note that
52
For i.i.d. X, X1 , . . . , Xn ∼ P, by the law of large numbers, we have
n
1X
g(Xi ) −→ E[g(X)].
p
n
i=1
Therefore, to approximate EP [g(X)], we can sample i.i.d. Y1 , . . . , Yn ∼ Q and use the fact
n
1 X g(Yi )
f (Yi ) −→ EP [g(X)].
p
n h(Yi )
i=1
−∞ P U ≤ M h(y) · h(y) dy
Rt f (y) Rt
1
M −∞ f (y) dy
= R = 1 R = F (t)
P U ≤ Mf h(y)
(y)
· h(y) dy M f (y) dy
53
1. Yt ∼ fY |X (· | X = Xt−1 );
2. Xt ∼ fX|Y (· | Y = Yt ).
This algorithm generates a sequence (Xt , Yt )nt=1 which is a Markov chain with invariant distribution
fX,Y . By the ergodic theorem, for a bivariate function g, we have
n
1X
g(Xt , Yt ) −→ E[g(X, Y )].
p
n
t=1
This can be generalized to the multivariate case and is often useful in Bayesian estimation, for
example, when computing posterior means.
54
Chapter 4
Finite-sample analysis
In this chapter, we focus on the minimax point of view and finite-sample analysis. We frequently
prove results of the form
sn . inf sup R g(θ), ĝn . rn , (4.1)
ĝn θ∈Θ
where “.” means “≤” up to a constant factor (independent of n), and sn and rn are sequences
that converge to zero as n → ∞. The sequences sn and rn are referred to as rates of estimation
or rates of convergence. Hopefully, we have sn = rn in which case we obtain matching upper and
lower bounds on the minimax risk.
A simple example we have seen is that, for X1 , . . . , Xn ∼ N(µ, 1), we have
with a fixed design matrix X ∈ Rn×d and ε ∼ N(0, σ 2 In ). Let β̂ be an estimator of β ∗ . In the
sequel, we consider the following loss functions: (1) mean squared error d1 kβ̂ − β ∗ k22 , and (2) mean
squared prediction error n1 kX β̂ − Xβ ∗ k22 . There are usually two types of rates of estimation for
√
problems studied here: (1) slow rate 1/ n, and (2) fast rate 1/n.
55
Theorem 4.1. For the linear regression model (4.2), the LSE β̂ = (X > X)† X > Y satisfies
1 r
E kX β̂ − Xβ ∗ k22 = σ2 ,
n n
where r is the rank of X. In addition, if X is of rank d, then
d
1 σ2 X 1
E kβ̂ − β k2 =
∗ 2
,
d d λi
i=1
It follows that
As a result, we obtain
E kβ̂ − β ∗ k22 = σ2 tr (X > X)−1 ,
E eλ(X−E[X]) ≤ eσ
2 λ2 /2
for any λ > 0. In particular, if X ∼ N(µ, σ 2 ), then X ∼ subG(σ 2 ). Moreover, if X ∼ subG(σ 2 ) and
a ∈ R, then aX ∼ subG(a2 σ 2 ).
Proposition 4.2. For (not necessarily independent) zero-mean random variables Xi ∼ subG(σi2 )
where i ∈ [n], we have h i
E max Xi ≤ max σi · 2 log n.
p
i∈[n] i∈[n]
56
Proof. Let σ = maxi∈[n] σi . For any λ > 0, we have
1 1
E max Xi = E log eλ maxi∈[n] Xi ≤ log E eλ maxi∈[n] Xi
i∈[n] λ λ
1 1 log n λσ 2
E eλXi ≤ log
X X 2 2
≤ log eσi λ /2 ≤ + .
λ λ λ 2
i∈[n] i∈[n]
√
Taking λ = σ −1 2 log n yields the result.
We say that a random vector X in Rn is sub-Gaussian with variance proxy σ 2 and write X ∼
subGn (σ 2 ) if v > X ∼ subG(σ 2 ) for any fixed unit vector v ∈ Rn . In particular, if X ∼ N(µ, σ 2 In ),
then X ∼ subGn (σ 2 ).
Proposition 4.3. Let K be a convex polytope in Rn with vertices v1 , . . . , vd , and consider a zero-
mean random vector X ∼ subGn (σ 2 ). Then we have
h i
E max X > v ≤ max kvi k2 · σ 2 log d.
p
v∈K i∈[d]
By the sub-Gaussianity of X, we have X > vi ∼ subG(σ 2 kvi k22 ), so the result follows from Proposi-
tion 4.2.
In this case, the optimization problem (4.3) is a computationally feasible convex program.
57
Theorem 4.4. For the linear regression model (4.2) where ε is a zero-mean subGn (σ 2 ) random
√
vector, suppose that we have β ∗ ∈ B1 (κ) and maxj∈[d] kXj k2 ≤ n where Xj denotes the jth column
of X. Then the constrained LSE β̂B1 (κ) satisfies that
r
1 log d
E kX β̂B1 (κ) − Xβ k2 . σκ
∗ 2
.
n n
Proof. For simplicity, we write β̂ = β̂B1 (κ) . By the definition of β̂, we have
As a result, we obtain
by Proposition 4.3.
In fact, a rate of estimation like that in Theorem 4.1 also holds for β̂B1 (κ) , so that we have
r
1 d log d
E kX β̂B1 (κ) − Xβ k2 . min σ , σκ
∗ 2 2
.
n n n
Let us consider the case σ = 1 for simplicity. In the low-dimensional case d n, we observe
q a fast
log d
rate d/n. In the high-dimensional case d n, we obtain a slow yet consistent rate κ n . This
is usually called the elbow effect. In fact, one can still achieve a fast rate that scales as 1/n in the
high-dimensional setting; we discuss a special case in the next section.
i=1
There are many potential estimators of β∗:
58
• For kβ ∗ k0 ≤ k, the constrained LSE is
constraint kβk1 ≤ k which is convex, here kβk0 ≤ k is a discrete constraint with more
Unlike the
than nk possible choices of β. Hence this optimization problem is computationally infeasible in
the worst case.
• If β ∗ is k-sparse and each entry of β ∗ is in [−1, 1], then kβ ∗ k1 ≤ k. Hence we can still consider
• The `1 -constrained LSE requires the knowledge of the (typically unknown) sparsity k. Instead,
the most popularly used estimator is the LASSO estimator
Then we define
d
X
kβk∗ := λi |β(i) |.
i=1
The SLOPE estimator achieves slightly better rate of convergence than the LASSO estimator.
Although the norm kβk∗ is more involved than kβk1 , the SLOPE estimator can be efficiently
computed.
59
Theorem 4.5. For the linear regression model (4.2) where ε is a zero-mean subGn (σ 2 ) random
vector, suppose that we have β ∗ ∈ B0 (k) and βi∗ ∈ {−1, 0, 1} for each i ∈ [d]. Define an estimator
1 k 5ed k d
kX β̂ − Xβ ∗ k22 ≤ 16σ 2 log . σ 2 log .
n n 2kδ n kδ
Proof. Using the same argument as in the proof of Theorem 4.4, we obtain
X(β̂ − β ∗ )
kX(β̂ − β ∗ )k22 ≤ 2ε> X(β̂ − β ∗ ) = 2kX(β̂ − β ∗ )k2 · ε> .
kX(β̂ − β ∗ )k2
Define a set
Xu
D := v∈R d
:v= d
for u ∈ {−2, −1, 0, 1, 2} , kuk0 ≤ 2k .
kXuk2
Then we have
kX(β̂ − β ∗ )k2 ≤ 2 sup ε> v.
v∈D
Since each v ∈ D is a unit vector, we have ε> v ∼ subG(σ 2 ). As a result of a homework problem,
t2
P{ε> v > t} ≤ exp −
2σ 2
d ed 2k 2k
2k
for any t > 0. Moreover, the set D has cardinality at most 2k 5 ≤ ( 2k ) 5 . By a union bound,
n o ed 2k t2
P sup ε> v > t ≤ 52k · exp − 2 .
v∈D 2k 2σ
q
5ed
For δ ∈ (0, 1), we take t = 2σ k log 2kδ and obtain
r
5ed
P sup ε v > 2σ
>
k log ≤ δ.
v∈D 2kδ
This combined with kX(β̂ − β ∗ )k2 ≤ 2 supv∈D ε> v finishes the proof.
60
• We say that the design matrix X ∈ Rn×d is δ-incoherent if the matrix
1 >
X X − Id
n
is entrywise bounded in absolute value by δ > 0. Note that if the rows x>
i of X are sampled i.i.d.
from a distribution with mean zero and covariance Id , then the above matrix is equal to
n
1X
xi x>
i − Id
n
i=1
and converges to 0 by the law of large numbers. Therefore, incoherence is a reasonable assump-
tion. Furthermore, it can be shown that as along as n & log
δ2
d
, we can sample a random matrix
X∈R n×d that is δ-incoherent with probability 0.99.
• For any β ∈ Rd and S ⊂ [d], let βS denote the vector in Rd with (βS )i = βi for i ∈ S and
(βS )i = 0 for i ∈ S c := [d] \ S. Define a cone of vectors
CS := β ∈ Rd : kβS c k1 ≤ 3kβS k1 .
If |S| ≤ k, then the cone CS contains approximately k-sparse vectors β with support approx-
imately contained in S. We say that X satisfies the restricted eigenvalue (RE) condition for
k-sparse vectors if
kXβk22 1
inf inf 2 ≥ . (4.4)
|S|≤k β∈CS nkβk2 2
Note that if the infimum is taken over all β ∈ Rd , then the condition is saying that the smallest
eigenvalue of n1 X > X is lower bounded by 1/2. This is why the above condition is referred to as
the RE condition. Furthermore, it can be shown that as soon as n & k log d, we can sample a
random matrix X ∈ Rn×d that satisfies (4.4) with probability 0.99.
The following result shows that incoherence is stronger than the RE condition when the param-
eters are appropriately chosen.
Proof. We have
kXβk22 = kXβS + XβS c k22 = kXβS k22 + kXβS c k22 + 2βS> X > XβS c .
61
• 1 > >
= βS> βS c + βS> 1 > 1 3
kβS k21 .
n βS X XβS
c
nX X − Id βS c ≥ − 32k kβS k1 kβS c k1 ≥ − 32k
Lemma 4.7. For X ∈ Rn×d , suppose that maxj∈[d] kXj k22 ≤ 2n where Xj denotes the jth column
of X. Let ε ∼ subGn (σ 2 ). Then we have kX > εk∞ ≤ 2σ n log(2d/δ) with probability at least 1 − δ.
p
Proof. We have Xj> ε ∼ subG(2nσ 2 ) and thus P{|Xj> ε| > t} ≤ exp( 4nσ
−t 2
2 ) by a homework problem.
Therefore,
n o −t2
P{kX > εk∞ > t} ≤ P max |Xj> ε| > t ≤ 2d exp .
j∈[d] 4nσ 2
p
Choosing t = 2σ n log(2d/δ) completes the proof.
By Hölder’s inequality and the above lemma, it holds with probability at least 1 − δ that
λ
ε> X(β̂ − β ∗ ) ≤ kX > εk∞ kβ̂ − β ∗ k1 ≤ 4σ n log(2d/δ) kβ̂ − β ∗ k1 = kβ̂ − β ∗ k1 .
p
2
As a result, with S chosen to be the support of β ∗ , we get
62
that is, β̂ − β ∗ satisfies the cone condition. Then it follows from the Cauchy–Schwarz inequality
and (4.4) that
√ √
r
2k
kβ̂S − βS∗ k1 ≤ k kβ̂S − βS∗ k2 ≤ ∗
k kβ̂ − β k2 ≤ kX β̂ − Xβ ∗ k2 .
n
This completes the proof in view of the definition of λ and that kβ̂ − β ∗ k22 ≤ n2 kX β̂ − Xβ ∗ k22 .
y · x> β ∗ − A(x> β ∗ )
i
f (yi | x> ∗
i β ) = exp
i i
· h(yi , σ)
σ2
for functions A(·), h(·), and noise parameter σ > 0. Suppose the observations are independent so
that the log-likelihood is
n
X Yi · x> β − A(x> β)
i i
`(β | Y ) = + log h(Yi , σ) .
σ2
i=1
Usually the function A is convex, so we can efficiently solve for the MLE.
Let us see two examples of the above general model:
Gaussian linear regression In linear regression (4.2) with Gaussian noise ε ∼ N(0, σ 2 In ), let
Yi be the ith entry of Y , and let x>
i be the ith row of X. Then we have
y · x> β ∗ − (x> β ∗ )2 /2
i
y2 √
f (yi | x> ∗
i β ) = exp
i i
· exp i
− log 2πσ 2 .
σ2 2σ 2
63
Then we have
1 yi exp(−x> β ∗ ) 1−yi
f (yi | x> ∗
i β )=
i
1 + exp(−x> i β ∗) 1 + exp(−x >β∗)
i
exp(−x> ∗
1 i β )
= exp yi log + (1 − yi ) log
1 + exp(−x> ∗
i β ) 1 + exp(−x> ∗
i β )
= exp yi · x> ∗
i β − log 1 + exp(xi β )
> ∗
,
which is again a special case of the general model. What motivates the logistic regression model?
The task of classification: β ∗ is a linear classifier that we aim to learn, each xi is a vector of d
features, and each Yi represents an outcome of classification.
There is a different class of generalized linear models:
Yi = F (x> ∗
i β ) + εi
Θ := {β ∈ Rd : |x>
i β| ≤ B, i ∈ [n]},
Theorem 4.9. Consider the logistic regression model (4.6) where β ∗ ∈ Θ. Then the MLE β̂
satisfies that
1 r
E kX β̂ − Xβ ∗ k22 .B ,
n n
where the hidden constant depends on B and r is the rank of X.
64
Proof. One can check that g(t) is κ-strongly concave in the sense that
for all t, t∗ with |t| ≤ B, |t∗ | ≤ B, for a constant κ = κ(B) > 0. Since |x>
i β| ≤ B, we see that
n h
X i
`(β̂) = `(β̂ | Y ) = Yi g(x>
i β̂) + (1 − Yi )g(−x >
i β̂)
i=1
n h
X i
≤ Yi g(x> ∗ > ∗
i β ) + (1 − Yi )g(−xi β )
i=1
n h
X i
+ Yi g 0 (x>
i β ∗
) − (1 − Yi )g 0
(−x > ∗
i β ) (x> > ∗
i β̂ − xi β )
i=1
n
X
− κ(x> > ∗ 2
i β̂ − xi β )
i=1
= `(β ) + ε> (X β̂ − Xβ ∗ ) − κkX β̂ − Xβ ∗ k22 ,
∗
where uj is the jth column of U . Moreover, the fact that g 0 (t) = 1 − F (t) = F (−t) implies
In view of the model (4.6), we see that εi is simply the deviation of Yi from its mean, so E[ε2i ] ≤ 1/4.
It follows that
n
1 1
E(ε uj ) = E εi · (uj )i 2 ≤ kuk22 = .
X
> 2
4 4
i=1
Combining everything yields
r
1 1 X r
E kX β̂ − Xβ ∗ k22 ≤ 2 E(ε> uj )2 ≤ 2 ,
n κ n 4κ n
j=1
65
4.4 Nonparametric regression
Let us consider an even more general regression model
where xi ∈ Rd are the design points, and εi are independent noise with E[εi ] = 0 and E[ε2i ] ≤ σ 2 .
The linear and generalized linear regression models assume f (xi ) = x> ∗ > ∗
i β and f (xi ) = F (xi β )
respectively. Nonparametric regression, on the other hand, does not assume that there is an
underlying parameter vector β ∗ . Instead, we impose a certain nonparametric assumption on the
function f , such as smoothness, monotonicity, or convexity.
Definition 4.10. Fix β > 0 and let ` := bβc. The Hölder class Σ(β) on [0, 1] is defined as the set
of ` times differentiable functions f : [0, 1] → R whose `th derivative f (`) satisfies
|f (`) (x) − f (`) (x0 )| ≤ L|x − x0 |β−` , for all x, x0 ∈ [0, 1],
for some constant L > 0. We also use Σ(β, L) to denote all functions f satisfying the above
conditions for a fixed L > 0.
Note that for a larger β, the condition is stronger and thus Σ(β) is smaller.
Kernels Before defining the estimator of interest, we first introduce a kernel function K : R → R,
such as:
Note that the above kernels except the Gaussian kernel satisfy these conditions.
66
Nadaraya–Watson estimator Given a kernel K and a bandwidth h > 0, a prominent kernel
estimator of the regression function f is the Nadaraya–Watson estimator fˆNW , defined as
Pn xi −x
ˆNW i=1 Yi K( h )
f (x) := Pn xi −x
i=1 K( h )
More generally, one can consider a linear nonparametric regression estimator fˆlinear of the form
n
X
fˆlinear (x) = Yi Wi (x),
i=1
Local polynomial estimator Following the intuition of weight least squares, we can design a
more sophisticated estimator using the Taylor expansion of f , and with θ replaced by a polynomial
of degree ` = bβc. For f ∈ Σ(β, L) where β > 1, for z close to x, we have
f 00 (x) f (`) (x) z − x
f (z) ≈ f (x) + f 0 (x)(z − x) + (z − x)2 + · · · + (z − x)` = θ(x)> U ,
2 `! h
where the vectors θ(x) = θh (x) and U (u) are defined by
>
θ(x) = f (x), f 0 (x)h, f 00 (x)h2 , . . . , f (`) h` ,
u2 u` >
U (u) = 1, u, , . . . , .
2 `!
Definition 4.11. The local polynomial estimator of order ` (LP(`) estimator) of θ(x) is the vector
θ̂(x) in R`+1 defined by
n h x − x i2 x − x
i i
X
θ̂(x) := argmin Yi − θ> U K .
θ∈ R`+1 i=1
h h
67
Note that fˆNW is simply the LP(0) estimator.
We can rewrite θ̂(x) as
If B(x) is positive definite, then the solution θ̂(x) of the quadratic program is given by
Consequently, we have
n
X
fˆ(x) = Yi Wi (x) (4.10)
i=1
where
1 x − x x − x
i i
Wi (x) := U (0)> B(x)−1 U K . (4.11)
nh h h
In particular, the LP(`) estimator fˆ(x) of f (x) is a linear estimator (linear in the data Yi ).
Proposition 4.12. Suppose that B(x) is positive definite. Let Q(x) be a polynomial of degree ≤ `.
Then we have
Xn
Q(xi )Wi (x) = Q(x)
i=1
Q(`) (x) x − x
i
Q(xi ) = Q(x) + Q0 (x)(xi − x) + · · · + (xi − x)` = q(x)> U
`! h
68
∈ R`+1 . Set Yi = Q(xi ). Then we have
>
where q(x) := Q(x), Q0 (x)h, . . . , Q(`) (x)h`
n h x − x i2 x − x
i i
X
θ̂(x) = argmin Q(xi ) − θ> U K
θ∈ R`+1 i=1
h h
n h
X > xi − x i2 xi − x
= argmin Uq(x) − θ K
θ∈R`+1 i=1 h h
>
= argmin q(x) − θ B(x) q(x) − θ .
θ∈ R`+1
Since B(x) is positive, we have θ̂(x) = q(x) and therefore fˆ(x) = θ̂(x)1 = Q(x). The reproducing
property then follows from (4.10).
For (4.12), take respectively Q(t) ≡ 1 and Q(t) = (t − x)k .
In addition, we impose an assumption: The smallest eigenvalue λmin (B(x)) of B(x) satisfies
λmin (B(x)) ≥ λ0 (4.13)
for any x ∈ [0, 1] for a constant λ0 > 0. In particular, kB(x)−1 vk2 ≤ kvk2 /λ0 for any v ∈ R`+1 .
Lemma 4.13. Under assumptions (4.9) and (4.13), the weights defined in (4.11) satisfy
• Wi (x) = 0 if |x − xi | > h for any i ∈ [n];
• |Wi (x)| ≤ nhλ
2
0
for any x ∈ [0, 1] and i ∈ [n];
Pn
• 8
i=1 |Wi (x)| ≤ λ0 for any x ∈ [0, 1] if h ≥ 1/(2n).
We now study the rate of estimation for the local polynomial estimator fˆ(x) of f (x) in terms of
the mean squared risk E fˆ(x) − f (x) . To this end, we consider the bias–variance decomposition:
2
69
Theorem 4.14. Suppose that f : [0, 1] → R belongs to the Hölder class Σ(β, L) for β, L > 0.
Consider the model Yi = f (xi ) + εi where i ∈ [n], xi = i/n, and εi are independent with E[εi ] = 0
and E[ε2i ] ≤ σ 2 . Let fˆ be the LP(`) estimator of f with ` = bβc and kernel K satisfying (4.9).
Assume (4.13) and h ≥ 1/(2n).
• For any x ∈ [0, 1], we have the following upper bounds on the bias and the variance of fˆ:
8Lhβ 16
| Bias(x)| ≤ , Var(x) ≤ .
`!λ0 λ20 nh
2 4β
• As a result, for C = C(β, L, λ0 , σ) := 32 2L
λ20 `!
2β+1
σ 2β+1 , we have
−2β
E fˆ(x) − f (x) 2 ≤ Cn 2β+1 .
for some τi ∈ [0, 1], where note that we could insert a term −f (`) (x) in the numerator since the
sum vanishes. It follows from f ∈ Σ(β, L) and Lemma 4.13 that
n n
X L|xi − x|β Lhβ X 8Lhβ
| Bias(x)| ≤ |Wi (x)| ≤ |Wi (x)| ≤ .
`! `! `!λ0
i=1 i=1
2 −1
Therefore, choosing h = ( `!σ
2L )
2β+1 n 2β+1 yields
70
Some remarks:
−2β
• Note that we have the rate of estimation n 2β+1 for the pointwise risk E fˆ(x) − f (x)
2
at each
x ∈ [0, 1]. This is stronger than bounding an average risk like E kfˆ − f k22 .
−2β
• As β grows, the function becomes smoother, so the rate n 2β+1 improves as expected. In particular,
−2β
as β → ∞, the nonparametric rate n 2β+1 tends to the parametric rate 1/n.
• Here we choose h depending on the smoothness parameters β and L. In practice, we may not
know how smooth the function is a priori. To address this issue, we can in fact design adaptive
estimators that do not depend these smoothness parameters.
• In dimension d, when we estimate a β-Hölder smooth function f : [0, 1]d → R from n noisy
−2β
observations, one can similarly establish the rate of estimation n 2β+d .
The name “nonparametric” simply refers to the setup where there is no obvious parameter
(like β in linear regression). In fact, it is without loss of generality to focus on the framework
of parametric estimation by viewing nonparametric estimation as a setup which has a “large”
parameter space. For example, for nonparametric regression discussed in these two sections, we
can view the Hölder class Σ(β, L) as the parameter space, and view the function f as the parameter.
71
72
Chapter 5
In the previous chapter, we studied several regression models and proved rates of estimations for
specific estimators. That is, we established an upper bound on the minimax risk of the form
inf sup R θ, θ̂n . rn ,
θ̂n θ∈Θ
for some rate rn . Can we show that the minimax risk is also lower bounded by some rate sn which
is hopefully equal to rn ? If so, this suggests that the estimator that achieves the upper bound is
the essentially the best we can hope for in the minimax sense.
Reduction to bounds in probability Suppose that the risk we would like to lower bound is
of the form R(θ, θ̂) = Eθ [d(θ, θ̂)2 ] for some pseudometric d(·, ·). If we can establish
then (5.1) obviously holds. The difficulty for proving lower bounds usually lies in how to appropri-
ately choose θ1 , . . . , θM , which we call hypotheses.
73
Reduction to hypothesis testing The crucial requirement is that d(θi , θj )2 ≥ 4sn for any
distinct i, j ∈ [M ]. Given any estimator θ̂, consider the minimum distance test
d(θi , θ̂) ≥ d(θi , θψ(θ̂) ) − d(θψ(θ̂) , θ̂) ≥ d(θi , θψ(θ̂) ) − d(θi , θ̂),
so that
1 √
d(θi , θ̂) ≥ d(θi , θψ(θ̂) ) ≥ sn .
2
Therefore, we obtain
inf max Pθi {d(θi , θ̂)2 ≥ sn } ≥ inf max Pθi {ψ(θ̂) 6= i} ≥ inf max Pθi {ψ 6= i},
θ̂ i∈[M ] θ̂ i∈[M ] ψ i∈[M ]
where the infimum is taken over all tests ψ that are measurable with respect to the observations
and take values in [M ]. We have proved the following theorem.
Theorem 5.1. Let θ1 , . . . , θM ∈ Θ be such that d(θi , θj )2 ≥ 4sn for any distinct i, j ∈ [M ]. Then
where the infimum on the right-hand side is taken over all tests ψ that are measurable with respect
to the observations and take values in [M ].
Moreover, the equality holds for the likelihood ratio test ψ ∗ := 1{p1 /p0 ≥ 1}.
This is the Neyman–Pearson lemma, although the name sometimes refers to a different formu-
lation.
74
Proof. First note that
Z Z
P0 {ψ = 1} + P1 {ψ = 0} =
∗ ∗
p0 + p1
{ψ ∗ =1} {ψ ∗ =0}
Z Z
= p0 + p1
{p1 ≥p0 } {p1 <p0 }
Z Z
= min(p0 , p1 ) + min(p0 , p1 )
{p1 ≥p0 } {p1 <p0 }
Z
= min(p0 , p1 ).
Next, for any test ψ, define R := {ψ = 1}. Also define R∗ := {p1 ≥ p0 }. Then we have
P0 {ψ = 1} + P1 {ψ = 0} = P0 {R} + 1 − P1 {R}
Z
=1+ (p0 − p1 )
ZR Z
=1+ (p0 − p1 ) + (p0 − p1 )
R∩R∗ R∩(R∗ )c
Z Z
=1− |p0 − p1 | + |p0 − p1 |
R∩R∗ R∩(R∗ )c
Z
1{R ∩ R∗ } − 1{R ∩ (R∗ )c } ,
=1− |p0 − p1 |
which is minimized at R = R∗ .
The total variation distance between P0 and P1 is defined as any of the following quantities:
Z Z
1
TV(P0 , P1 ) = |p0 − p1 | = 1 − min(p0 , p1 ) = 1 − inf P0 {ψ = 1} + P1 {ψ = 0} .
2 ψ
The equivalence of the first two definitions is proved as a homework problem. The above lemma
gives the second equivalence.
Combining Theorem 5.1 and Lemma 5.2 with the definition of the total variation distance, we
have established Le Cam’s two-point bound.
1
inf sup Eθ [d(θ, θ̂)2 ] ≥ d(θ0 , θ1 )2 1 − TV(Pθ0 , Pθ1 ) .
θ̂ θ∈Θ 8
A couple of remarks: The constant 1/8 can be refined to 1/4 with an improved argument.
Moreover, by the chain of inequalities√betweenp
f -divergences proved in the homework, TV in the
above theorem can be replaced by H, KL, or χ2 , which are typically easier to compute.
75
Lemma 5.4. We have
kµ1 − µ2 k22
KL N(µ1 , σ 2 Id ), N(µ2 , σ 2 Id ) = .
2σ 2
Proof. The one-dimensional case follows from direct computation
(x−µ1 )2 h (x − µ1 )2 (x − µ2 )2 i (µ1 − µ2 )2
Z
1
2 2
e− 2σ2
KL N(µ1 , σ ), N(µ2 , σ ) = √ − + dx = .
2πσ 2σ 2 2σ 2 2σ 2
The multivariate case follows from the tensorization property of KL established in the homework
d
2 2
X kµ1 − µ2 k22
KL N((µ1 )i , σ 2 ), N((µ2 )i , σ 2 ) =
KL N(µ1 , σ Id ), N(µ2 , σ Id ) = .
2σ 2
i=1
Consider the nonparametric regression model (4.8), where we assume that xi = i/n and εi are
i.i.d N(0, σ 2 ) noise for i ∈ [n]. We aim to establish a minimax lower bound over the Hölder class
Σ(β, L) where β, L > 0, for the distance d(f, g) = |f (x0 ) − g(x0 )| at a fixed point x0 ∈ [0, 1].
2 4β
Theorem 5.5. For any x0 ∈ [0, 1], there exists a constant c = c2 (β)L 2β+1 σ 2β+1 > 0 such that
−2β
Ef fˆ(x0 ) − f (x0 )
2
inf sup ≥ c n 2β+1 .
fˆ f ∈Σ(β,L)
Therefore, we obtain
76
Separation We have
−β
d(f0 , f1 ) = f1 (x0 ) = Lhβ K(0) = Lcβ0 n 2β+1 c1 /e.
1
σ 2 e2 2β+1
. As a result, TV(Pf0 , Pf1 ) ≤ KL(Pf0 , Pf1 ) ≤ 1/2 by a homework problem.
p
if c0 = 4L 2 c2
1
The proof is complete thanks to Theorem 5.3.
Note that this minimax lower bound matches the pointwise upper bound in Theorem 4.14 up
to a constant factor. However, the two-point method is not sufficient for establishing a matching
lower bound on the integrated error kfˆ − f k2L2 .
To establish a lower bound, we apply the two-point method on the hypotheses P0 = ⊗ni=1 N(v, Id )
and P1 = ⊗ni=1 N(w, Id ). Then
n
n
KL(P0 , P1 ) =
X
KL N(v, Id ), N(w, Id ) = kv − wk22 .
2
i=1
This is optimal in the sample size n but not in the dimension d unless it is a constant.
One powerful tool for proving high-dimensional lower bounds is the following theorem called
Assouad’s lemma.
77
Theorem 5.6 (Assouad). Let {Pω : ω ∈ {0, 1}d } be a set of 2d probability measures, and let Eω
denote the corresponding expectations. Then
d
Eω ρ(ω̂, ω) ≥ min 1 − TV(Pω , Pω0 ) ,
inf max
ω̂ ω∈{0,1} d 2 ρ(ω,ω0 )=1
1{ω̂i 6= ωi }
Pd
where the infimum is over all estimators ω̂ taking values in {0, 1}d , and ρ(ω̂, ω) = i=1
denotes the Hamming distance between ω̂ and ω.
Proof. We have
1
Eω ρ(ω̂, ω) ≥ Eω ρ(ω̂, ω)
X
max
ω∈{0,1} d 2d
ω∈{0,1}d
d d
1 1 X
Eω 1{ω̂i = Pω {ω̂i 6= ωi }.
X X X
= d 6 ωi } = d
2 2
ω∈{0,1}d i=1 i=1 ω∈{0,1}d
Let ω−i ∈ {0, 1}d−1 denote the subvector of ω with its ith entry removed. Let (ω−i , 1) denote the
vector ω whose ith entry is equal to 1. Then
d
1 X
max Eω ρ(ω̂, ω) ≥ d P(ω−i ,1) {ω̂i = 0} + P(ω−i ,0) {ω̂i = 1}
X
ω∈{0,1}d 2
i=1 ω−i ∈{0,1}d−1
d
1 X
P(ω−i ,1) , P(ω−i ,0)
X
≥ 1 − TV
2d
i=1 ω−i ∈{0,1}d−1
d
≥ min 1 − TV(Pω , Pω0 ) .
2 ρ(ω,ω0 )=1
Lemma 5.7. In the problem of estimating θ ∈ Θ where Θ is a closed set, let θ̃ denote an estimator
that takes values in Θ, and let θ̂ denote an arbitrary estimator. Then we have
1
inf sup E[d(θ, θ̃)2 ] ≤ inf sup E[d(θ, θ̂)2 ] ≤ inf sup E[d(θ, θ̃)2 ].
4 θ̃ θ∈Θ θ̂ θ∈Θ θ̃ θ∈Θ
Proof. The second inequality is trivial. Let us focus on the first inequality. Consider an arbitrary
estimator θ̂. Define
θ̃ := argmin d(θ, θ̂),
θ∈Θ
As a result,
sup E[d(θ, θ̃)2 ] ≤ 4 sup E[d(θ, θ̂)2 ].
θ∈Θ θ∈Θ
78
Corollary 5.8. Suppose that to each ω ∈ {0, 1}d , we can associate a parameter θω ∈ Θ such that
d(θω , θω0 )2 ≥ cn ρ(ω, ω 0 )
for a constant cn > 0 that may depend on the sample size n. Let Pω = Pθω denote the model at θω
for ω ∈ {0, 1}d . Then we have
cn d
inf sup E d(θ̂, θ)2 ≥ min 1 − TV(Pω , Pω0 ) ,
θ̂ θ∈Θ 8 ρ(ω,ω )=1
0
where the first infimum is over all θ̃ that takes values in θω : ω ∈ {0, 1}d . Lemma 5.7 implies
1 1
inf sup E d(θ̂, θ)2 ≥ inf sup E d(θ̃, θ)2 ≥ inf max E d(θ̃, θω )2
θ̂ θ∈Θ 4 θ̃ θ∈Θ 4 θ̃ ω∈{0,1}d
where the first infimum is over arbitrary estimators. It suffices to combine the two bounds.
5.3.2 Applications
Gaussian mean estimation The typically way of applying Assouad’s lemma is to associate
each ω with a parameter. For example, for the above Gaussian mean estimation problem, we define
√
µω = ω/ n ∈ Rd for each ω ∈ {0, 1}d . Then
ρ(ω, ω 0 ) = kω − ω 0 k22 = nkµω − µω0 k22 ,
and Pω = ⊗ni=1 N(µω , Id ). If 1 = ρ(ω, ω0 ) = nkµω − µω0 k22 , then by Lemma 5.4,
q √
TV(Pω , Pω0 ) ≤ KL(Pω , Pω0 ) = n · (1/2) · kµω − µω0 k22 = 1/ 2.
p
Linear regression Consider the linear regression model Y = Xβ ∗ + ε where ε ∼ N(0, σ 2 In ). Let
r be the rank of X ∈ Rn×d , and let U ∈ Rn×r be a matrix whose columns form an orthonormal
basis of the column space of X. We associate each ω ∈ {0, 1}r with a vector βω ∈ Rd such that
1
U ω = 2 Xβω .
σ
Then we have
1
ρ(ω, ω 0 ) = kω − ω 0 k22 = kU ω − U ω 0 k22 = 2 kXβω − Xβω0 k22 .
σ
In addition, Pω = N(Xβω , In ). Hence, if 1 = ρ(ω, ω 0 ) = σ12 kXβω − Xβω0 k22 , then by Lemma 5.4,
√
r
1
TV(Pω , Pω0 ) ≤ KL(Pω , Pω0 ) =
p
2
kXβω − Xβω0 k22 = 1/ 2.
2σ
Applying Corollary 5.8, we obtain
1 σ2 r
inf sup Eβ kX β̂ − Xβk22 & inf max d Eω ρ(ω̂, ω) & σ2 .
β̂ β∈ Rd n n ω̂ ω∈{0,1} n
79
5.4 Fano’s inequality
5.4.1 General theory
We move on to study minimax lower bounds based on multiple hypothesis testing. Recall that to
apply Theorem 5.1, we need to find separated parameters θ1 , . . . , θM ∈ Θ such that
inf max Pi {ψ 6= i} ≥ c,
ψ i∈[M ]
where we write Pi = Pθi and the infimum is over all tests ψ. To this end, we use a special case of
Fano’s inequality. Let us start with a lemma.
p 1−p
h(p, q) := KL(Ber(p), Ber(q)) = p log + (1 − p) log .
q 1−q
Then h is convex.
Proof. We first show that the function (p, q) 7→ p log pq is convex for p, q > 0. The Hessian of the
1/p −1/q
function is H = . We have det(H) = 0 and tr(H) > 0, so H is positive semidefinite.
−1/q p/q 2
Moreover, since the composition of a convex function with a linear function is convex, and a
sum of two convex functions is convex, we see that h is convex.
Theorem 5.10 (Data processing). Let P and Q be two probability measures that are absolutely
continuous with respect to each other. For X ∼ P, Y ∼ Q, and a function g, we have
Proof. Let fX denote the density of X. Then fX can be identified with fX,g(X) = fg(X) fX|g(X) . It
follows that
h fX i h fX,g(X) i h fg(X) fX|g(X) i
EX log = EX,g(X) log = EX,g(X) log + log
fY fY,g(Y ) fg(Y ) fY |g(Y )
fg(X) fX|g(X) fg(X) i
h i h i h
= Eg(X) EX log g(X) + Eg(X) EX log g(X) ≥ Eg(X) log ,
fg(Y ) fY |g(Y ) fg(Y )
h i
EX log fX|g(X)
f
where we used that the conditional KL divergence g(X) is nonnegative.
Y |g(Y )
Theorem 5.11 (Fano’s inequality). Let P1 , . . . , PM be probability measures that are absolutely
continuous with respect to each other. Then we have
Pi , Pj ) + log 2
1 PM
i,j=1 KL(
inf max Pi {ψ 6= i} ≥ 1 − M2
,
ψ i∈[M ] log M
80
Proof. Fix a test ψ. Let pi := Pi {ψ = i} and qi := Pj {ψ = i}. Moreover, let
1 PM
M j=1
M M M
1 X 1 X 1 X 1
p̄ = pi = Pi {ψ = i}, q̄ = qi = .
M M M M
i=1 i=1 i=1
Pi , Pj ) + log 2
M 1 PM
1 X i,j=1 KL(
1 − max Pi {ψ =
6 i} ≤ 1 − Pi {ψ 6= i} = p̄ ≤ M2
.
i∈[M ] M log M
i=1
Let Xi denote the observation under Pi for i ∈ [M ]. The above inequality is equivalent to
KL(1{ψ(Xi ) = i}, 1{ψ(Xj ) = i}) ≤ KL(Xi , Xj ),
Combining Theorem 5.1 with Fano’s inequality, we obtain the following corollary.
Corollary 5.12. Suppose that for θ1 , . . . , θM ∈ Θ, we have
1
d(θi , θj )2 ≥ 4sn , KL(Pθi , Pθj ) ≤ log M − log 2
2
for any distinct i, j ∈ [M ]. Then it holds that
81
5.4.2 Application to Gaussian mean estimation
Let us again consider estimating µ ∈ Rd given i.i.d. X1 , . . . , Xn ∼ N(µ, Id ). Recall that
n
KL(Pµ , Pµ0 ) = kµ − µ0 k22 .
2
Therefore, we would like to choose µ1 , . . . , µM such that
1
4sn ≤ kµi − µj k22 ≤ (log M − 2 log 2).
n
On the one hand, we need many µi so that M is large. On the other hand, if there are too many
µi packed together, the separation sn becomes too small and so does the lower bound. We need to
find a balance between these two tensions.
Let us introduce the notions of ε-packing and ε-net.
We will not prove the following result, but the intuition is clear by considering the ratio of
volumes.
Lemma 5.14. Let B d denote the unit ball in Rd . There exists an ε-packing N of B d , which is also
an ε-net of B d , such that
(1/ε)d ≤ |N | ≤ (3/ε)d .
Note that in a homework problem, we assume that there is a 1/4-net N of the unit sphere S d−1
in Rd such that |N | ≤ 12d . This is simply replacing B d with the subset S d−1 and setting ε = 1/4
in the above lemma.
d d
q the lemma given, let us take a 1/4-packing N = {θ1 , . . . , θM } ⊂ B where M ≥ 4 . Set
With
d
µi = c n θi for each i ∈ [M ] and some constant c > 0 to be determined. Then by definition, we
c2 d
can set sn = 43 n
so that
d c2 d
kµi − µj k22 = c2 kθi − θj k22 ≥ 2 = 4sn .
n 4 n
On the other hand,
d 4c2 d 1
kµi − µj k22 = c2 kθi − θj k22 ≤ ≤ (log M − 2 log 2)
n n n
if we choose c > 0 to be a sufficiently small constant. We conclude that
sn d
Eµ kµ̂ − µk22 ≥
inf sup & .
µ̂ µ∈Rd 2 n
82
5.4.3 Application to nonparametric regression
Lemma 5.15 (Hoeffding’s lemma). Suppose that a random variable X has mean zero and satisfies
a ≤ X ≤ b for constants a, b ∈ R. Then, for any λ ∈ R, we have
λ2 (b − a)2
E[eλX ] ≤ exp .
8
Proof. By convexity, it holds that
b − X λa X − a λb
eλX ≤ e + e ,
b−a b−a
so
b λa a λb at a(1 − et )
E[eλX ] ≤ e − e = eL(λ(b−a)) , L(t) := + log 1 + .
b−a b−a b−a b−a
We have
ab(et − 1) −abet 1
L0 (t) = , L00 (t) = ≤ .
(a − b)(b − aet ) t
(b − ae ) 2 4
Using the second-order Taylor approximation, we obtain that L(t) ≤ t2 /8 for t ∈ R.
Lemma 5.16 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables taking
values in [0, 1]. Then for all t > 0,
nX n o
P (Xi − E[Xi ]) ≤ −t ≤ exp(−2t2 /n).
i=1
For any distinct i, j ∈ [M ], ρ(ωi , ωj ) = dk=1 1{ωi,k 6= ωj,k }, so it is a sum of Ber(1/2) random
variables. By a union bound and Hoeffding’s inequality with t = 3d/8, we obtain
−9d
P{E c } ≤ P{ρ(ωi , ωj ) < d/8} ≤ M 2 exp
X
.
32
i,j∈[M ], i6=j
d 9d
It is not hard to see that this is strictly less than 1, as taking the logarithm gives 4 − 32 < 0.
Theorem 5.18. Consider the nonparametric regression model (4.8), where f ∈ Σ(β, L) for β, L >
2 4β
0, xi = i/n, and εi are i.i.d N(0, σ 2 ) noise for i ∈ [n]. For a constant c = c3 (β)L 2β+1 σ 2β+1 > 0,
the following minimax lower bound in the squared L2 distance holds over the Hölder class:
−2β
inf sup E kfˆ − f k2L2 ≥ c n 2β+1 .
fˆ f ∈Σ(β,L)
R1
Proof. This time, for the squared L2 distance d(f, g)2 = kf − gk2L2 = 0 (f (x) − g(x))2 dx, the proof
is based on multiple hypothesis testing over {f1 , . . . , fM } ⊂ Σ(β, L).
83
Construction of hypotheses Fix a constant C0 = C0 (β, L, σ) > 0 to be determined later. Let
1 1 k − 1/2 x − z
k
d = dC0 n 2β+1 e, h= , zk = , φk (x) = Lhβ K , k ∈ [d], x ∈ [0, 1],
d d h
where K is defined by (5.3). Recall that using the proof of Theorem 5.5, we can show that
φk ∈ Σ(β, L/2) if the constant c1 (β) > 0 in (5.3) is taken to be sufficiently small. Moreover, φk is
supported in [zk − h2 , zk + h2 ] = [ k−1 k
d , d ] for each k ∈ [d].
Let ω1 , . . . , ωM be given by Lemma 5.17. For each i ∈ [M ], we define fi (x) := dk=1 ωi,k φk (x).
P
Since the supports of φk are disjoint (up to a set of measure zero), it is easily seen that fi ∈ Σ(β, L).
1
1
To apply Corollary 5.12, we need this bound to be smaller than 2 log M − log 2 & d ≥ C0 n 2β+1 , i.e.
1
L2 c21 2β+1
C0 & ( σ2
) . Plugging this into the separation bound above finishes the proof.
We have established matching upper and lower bounds for the minimax risk at a point or in
the L2 norm for nonparametric regression.
where Pθ is each a distribution. In other words, Y ∼ P̄ can be generated as follows: First sample θ
uniformly randomly from Θ and then, conditional on θ, sample Y ∼ Pθ . We will study hypothesis
84
testing between a distribution P0 and the mixture P̄, where the latter is usually referred to as a
composite hypothesis.
Let ψ denote a test, which equals 0 if it selects P0 and equals 1 if it selects P̄. By the Neyman–
Pearson lemma and a homework problem, we have that for any test ψ,
q
P0 {ψ = 1} + P̄{ψ = 0} ≥ 1 − TV(P0 , P̄) ≥ 1 − χ2 (P̄, P0 ), (5.5)
where TV(·, ·) and χ2 (·, ·) denote the total variation distance and the χ2 -divergence respectively.
To showcase how this inequality leads to a minimax lower bound for an estimation problem, we
consider the following example. Suppose that we aim to estimate θ given the observation
Y = θ + ε, (5.6)
where θ is k-sparse and ε ∼ N(0, σ 2 In ). Recall that this is called the sparse sequence model in a
homework problem, and it is a special case of sparse linear regression with n = d and the design
√
matrix X being orthogonal. Assume that 1 ≤ k ≤ n. We aim to prove a lower bound of order
k/n up to a logarithmic factor, which then matches the upper bound.
Theorem 5.19. Let Pθ denote the probability associated with the model (5.6). For λ ∈ (0, 1), set
r
σ λn
µ := k log 1 + 2 .
2 k
Define n µ o
Θ := θ = √ 1S : S ⊂ [n], |S| = k .
k
In other
√ words, each vector in Θ is k-sparse with support S, and its nonzero entries are all equal
to µ/ k. Let P̄ be defined as in (5.4). Then
χ2 (P̄, P0 ) ≤ 2λ.
Proof. Let p0 , pθ , and p̄ denote the densities of P0 , Pθ , and P̄ respectively. Let θ and θ0 be two
independent uniform random variables over Θ. By the definition of the χ2 -divergence, we have
(p̄ − p0 )2
Z Z Z
p̄ 2 pθ pθ0
χ (P̄, P0 ) =
2
= p0 − 1 = Eθ,θ0 p0 − 1.
p0 p0 p0 p0
Let S and S 0 denote the supports of θ and θ0 respectively; they are independent random subsets
of [n], each of size k. Since the noise is Gaussian, it holds that
pθ (x) 1 µ 2 1 2 1 2µ > 2
= exp − 2 x − √ 1S + 2 kxk2 = exp √ x 1S − µ .
p0 (x) 2σ k 2 2σ 2σ 2 k
P µ
For Z ∼ N(0, In ), define ZS := i∈S Zi . Let ν := σ√ k
. Then we have
Z Z
pθ pθ 0 1 2µ > > 2
p0 =exp √ (x 1S + x 1S ) − 2µ
0 p0 (x) dx
p0 p0 2σ 2 k
1 2µσ
= E exp 2
= E exp ν(ZS + ZS 0 ) − ν 2 k .
2
√ (Z S + Z S 0 ) − 2µ
2σ k
85
Zi and |S4S 0 | ≤ 2k. Therefore,
P P
Note that ZS + ZS 0 = 2 i∈S∩S 0 Zi + i∈S4S 0
Z X
pθ pθ 0
p0 = E exp 2ν · E exp ν
X
Zi Zi · exp(−ν 2 k)
p0 p0
i∈S∩S 0 i∈S4S 0
It follows that
The random variable |S ∩ [k]| is a sampling-without-replacement version of Bin(k, k/n), and it can
shown that the MGF of the former is dominated by the MGF of the latter. As a result,
k k
k
2 k k
χ2 (P̄, P0 ) ≤
2
e2ν + 1− − 1 = 1 + (e2ν − 1) − 1.
n n n
Recall that r
µ 1 λn
ν= √ = log 1 + 2 .
σ k 2 k
Hence, we conclude that
k k
k λn λ
χ (P̄, P0 ) ≤
2
1+ −1= 1+ − 1 ≤ 2λ
n k2 k
Corollary 5.20. We have the following minimax lower bound for the model (5.6):
1 k n
inf sup E kθ̂ − θk22 & σ2 log 1 + 2 .
θ̂ θ∈ Rn n n k
q
σ n
Proof. We continue to use the notation from above. Let λ = 1/8 and µ = 2 k log(1 + 8k2
). Let
θ̂ be any estimator of θ. Define a test ψ by ψ = 0 if kθ̂k2 ≤ µ/2 and ψ = 1 if kθ̂k2 > µ/2. From
(5.5) and the above theorem, we obtain
n o 1 1 − p1/4 1
max P0 {ψ = 1}, max Pθ {ψ = 0} ≥ P0 {ψ = 1} + P̄{ψ = 0} ≥ = .
θ∈Θ 2 2 4
1 µ2
Suppose that n kθ̂ − θk22 ≤ 9n with probability at least 0.9. Then
µ
kθ̂k2 − kθk2 ≤ kθ̂ − θk2 ≤ .
3
By the definition of ψ, if θ = 0, then ψ = 0; if θ ∈ Θ, then kθk2 = µ and thus ψ = 1. We reach a
µ2
contradiction. Therefore, n1 kθ̂ − θk22 > 9n with probability at least 0.1, proving the conclusion.
86
Bibliography
[GL95] Richard D. Gill and Boris Y. Levit. Applications of the van trees inequality: a bayesian
cramér-rao bound. Bernoulli, 1(1-2):59–79, 03 1995.
[Kee11] Robert W Keener. Theoretical statistics: Topics for a core course. Springer, 2011.
[LC06] Erich L. Lehmann and George Casella. Theory of point estimation. Springer Science &
Business Media, 2006.
[LR06] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science
& Business Media, 2006.
[RH19] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. 2019.
[vdV00] Aad W van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
87