0% found this document useful (0 votes)
5 views

Xxxx Statistical Estimation

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Xxxx Statistical Estimation

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Statistical Estimation

Lecture Notes for MATH 6262 at Georgia Tech

Cheng Mao

School of Mathematics, Georgia Tech

May 2, 2023
This set of notes, based on [LC06, Kee11, Ber18, RH19, Tsy08] and other sources, is provided
to the students in the course MATH 6262 at Georgia Tech as a complement to the lectures. It is
not meant to be a complete introduction to the subject.

2
Contents

1 Fundamentals of statistical estimation 7


1.1 Background and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Review of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Setup of statistical estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Moments and cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Stein’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Definitions and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Some results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Convexity, maximum entropy, Rao–Blackwell . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Bias, variance, MVUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Lower bounds on the variance of an unbiased estimator . . . . . . . . . . . . . . . . 17
1.6.1 Lower bounds and the Fisher information . . . . . . . . . . . . . . . . . . . . 17
1.6.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Bayesian versus minimax 21


2.1 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Bayes risk and Bayes estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Hierarchical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Several perspectives of estimation . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Bayesian Cramér–Rao, a.k.a. van Trees inequality . . . . . . . . . . . . . . . . . . . . 25
2.3 Empirical Bayes and the James–Stein estimator . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 The empirical Bayes approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 James–Stein estimator and its variant . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 General results for exponential families . . . . . . . . . . . . . . . . . . . . . 29
2.4 Minimax estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Definitions and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Some theoretical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.3 Efron–Morris estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3
2.5 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Admissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Inadmissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Shrinkage estimators and Stein’s effect . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Gaussian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Poisson estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Asymptotic estimation 39
3.1 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Asymptotic properties of maximum likelihood estimation . . . . . . . . . . . . . . . 43
3.3.1 Asymptotic consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Examples of maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Bernstein–von Mises theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Bootstrap methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.1 Jackknife estimator and bias reduction . . . . . . . . . . . . . . . . . . . . . . 50
3.6.2 Mean estimation and asymptotics . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.1 Sampling with quantile function . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.3 Metropolis–Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.7.4 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Finite-sample analysis 55
4.1 Rates of estimation for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Fast rate for low-dimensional linear regression . . . . . . . . . . . . . . . . . . 55
4.1.2 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Slow rate for high-dimensional linear regression . . . . . . . . . . . . . . . . . 57
4.2 High-dimensional linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Setup and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Fast rate for sparse linear regression . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Fast rate for LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Generalized linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Setup and models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Maximum likelihood estimation for logistic regression . . . . . . . . . . . . . 64
4.4 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Model and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Rates of estimation for local polynomial estimators . . . . . . . . . . . . . . . 68

4
5 Information-theoretic lower bounds 73
5.1 Reduction to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Le Cam’s two-point method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Lower bounds for nonparametric regression at a point . . . . . . . . . . . . . 75
5.3 Assouad’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 General theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Application to Gaussian mean estimation . . . . . . . . . . . . . . . . . . . . 82
5.4.3 Application to nonparametric regression . . . . . . . . . . . . . . . . . . . . . 83
5.5 Generalization of the two-point method . . . . . . . . . . . . . . . . . . . . . . . . . 84

5
6
Chapter 1

Fundamentals of statistical estimation

1.1 Background and setup


1.1.1 Review of probability
Consider a sample space X containing all possible outcomes of an experiment. Let µ be the reference
(or natural) measure on X . We primarily consider the following spaces:

• A finite or countable set X equipped with the counting measure µ. For example, when we roll
a die, the outcome lies in X = {1, 2, 3, 4, 5, 6}. Moreover, if we consider the number of times we
throw a coin before a “heads” is observed, then this number lies in X = {1, 2, 3, . . . }.

• X = Rd equipped with the Lebesgue measure µ. For example, tomorrow’s temperature is in


X = R, while tomorrow’s temperature and humidity jointly lie in X = R2 .

A random variable X is an experiment taking values in X . We write X ∼ P if X follows a


distribution P. There are several ways to describe a random variable or a distribution:

• If X is discrete, i.e., X is finite or countable, we can specify the probability mass function (PMF)
fX of X. For example, for the uniform random variable X ∼ Unif([n]) where [n] := {1, . . . , n},
we have fX (i) = P{X = i} = 1/n for i = 1, . . . , n.

• If X is continuous, e.g., X = R or Rd , we can specify the probability density function (PDF or


density) fX of X. For example, for the standard Gaussian random variable X ∼ N(0, 1), we have
fX (t) = √12π e−t /2 for t ∈ R.
2

• The cumulative distribution function (CDF) of a random variable X on R is FX (t) = P{X ≤ t}.
We have FX0 (t) = fX (t) and −∞ fX (s) ds = FX (t). The CDF of X = (X1 , . . . , Xd ) on Rd is
Rt

FX (t1 , . . . , td ) = P{X1 ≤ t1 , . . . , Xd ≤ td }.

In general, for a subset E ⊂ X , the probability of the event {X ∈ E} is P{X ∈ E} = E fX dµ.


R

Examples:

P3 a die; the outcome is X ∼ Unif([6]). The probability of seeing 2 or 3 is P{X ∈ {2, 3}} =
• Roll
i=2 1/6 = 1/3.

• Consider X ∼ N(0, 1). The probability that X is positive is P{X > 0} = 0 √12π e−t /2 dt = 1/2.
R∞ 2

7
The expectation ofR X is E[X] = X t fX (t) dµ(t). Given a function g : X → R, the expectation
R

of g(X) is E[g(X)] = X g(t)fX (t) dµ(t). Examples:

E[X] =
P6
• For X ∼ Unif([6]), i=1 i · 1
6 = 3.5.

E[(X − 0)2 ] =
R∞ 2
• For X ∼ N(0, 1), the variance of X is −∞ t
2 · √1 e−t /2 dt

= 1.

1.1.2 Setup of statistical estimation


Statistical estimation is in some sense the reverse engineering of probability. Observing realizations
of random variables X1 , X2 , . . . , Xn , we aim to estimate certain (functions of) parameters of the
underlying distribution. Let us describe the basic setup of parametric estimation using throwing a
biased coin as a running example. Consider a biased coin for which we see 1 (heads) with probability
θ ∈ [0, 1] and see 0 (tails) with probability 1 − θ. In other words, the observation follows the Ber(θ)
distribution. The following is a list of basic concepts in parametric estimation:

• Parameter: θ, which is typically a real number. E.g., θ = 0.3, 0.5, or 0.8.

• Parameter space: the set Θ of parameters. E.g., Θ = [0, 1].

• Probability distribution: Pθ . E.g., Pθ = Ber(θ).

• Family of distributions: the set P containing all Pθ . E.g., P = {Ber(θ) : θ ∈ [0, 1]}.

• Observations: i.i.d. X1 , . . . , Xn ∼ Pθ . E.g., X1 , . . . , Xn are the binary outcomes of n independent


coin throws.

• Statistic: a function of the observations X1 , . . . , Xn . E.g., h(X1 , . . . , Xn ) = n1 ni=1 Xi .


P

• Estimand: a function g(θ) of the parameter θ. E.g., g(θ) = θ or θ2 .

• Estimator: a statistic which is used to estimate the estimand, denoted by θ̂ = θ̂(X1 , . . . , Xn ) or


1 Pn 1 Pn
ĝ = ĝ(X1 , . . . , Xn ). E.g., θ̂ = n i=1 Xi when g(θ) = θ, or ĝ = g(θ̂) = θ̂ = ( n i=1 Xi )2 when
2

g(θ) = θ2 .

• Loss function: a bivariate function L(g, ĝ) ≥ 0. E.g., the squared loss L(θ, θ̂) = (θ̂ − θ)2 .

the expectation of the loss E[L(g, ĝ)] with respect to the observations. E.g.,
• Risk: P E[(θ̂ − θ)2 ] =
E[( n ni=1 Xi − θ)2 ] for i.i.d. X1 , . . . , Xn ∼ Ber(θ).
1

1.2 Exponential families


1.2.1 Definition and examples
Definition 1.1. Let Θ ⊂ Rd be a parameter space. A family {Pθ }θ∈Θ of probability distributions
on a sample space X with measure µ is called an exponential family if Pθ has PDF (or PMF)

f (x | θ) = exp η > T (x) − A(η) · h(x).




8
Here, η = η(θ) ∈ Rm is the natural parameter, T (x) ∈ Rm is the sufficient statistic,
Z
exp η > T (x) · h(x) dµ(x)

A(η) = log
X

is the log-partition function, and


R h(x) is> the base
 measure. Moreover, the natural parameter space
is E := {η : A(η) < ∞} = {η : X exp η T (x) · h(x) dµ(x) < ∞}.

Examples of exponential families include:


µ
• Gaussian distribution N(µ, σ 2 ): If η1 = σ2
and η2 = −1
2σ 2
, then

1  (x − µ)2  1  µx x2 µ2 
f (x | µ, σ 2 ) = √ exp − = √ exp − −
2πσ 2σ 2 2πσ σ2 2σ 2 2σ 2
 η2 1  1
= exp η1 x + η2 x2 + 1 + log(−2η2 ) √ .
4η2 2 2π

• Poisson distribution Poi(λ): If η = log λ, then

λx −λ  1  1
f (x | λ) = e = exp x log λ − λ · = exp ηx − eη · .
x! x! x!
p
• Binomial distribution Bin(n, p): If n is fixed and known, and η = log 1−p , then
    n
n x n−x n  p  
η
f (x | p) = p (1−p) = exp x log +n log(1−p) = exp ηx−n log(1+e ) .
x x 1−p x

1.2.2 Moments and cumulants


For a random variable X taking values in Rm , consider its moments
αr1 ,...,rm = E X1r1 · · · Xm
rm
ri ∈ Z≥0 .
 
,

The moment generating function (MGF) is defined as

MX (u) := E exp u> X , u ∈ Rm .


 

If MX exists in a neighborhood of the origin, then all moments exist and


X αr ,...,r
MX (u) = 1 m
ur11 · · · urmm .
r ,...,r
r1 ! · · · rm !
1 m

Therefore, we have
∂ r1 +···+rm MX (u)
αr1 ,...,rm = .
∂ur11 · · · ∂urmm u=0
The cumulant generating function (CGF) is defined as KX (u) := log MX (u). Its power series
expansion is X κr ,...,r
KX (u) = 1 m
ur11 · · · urmm ,
r ,...,r
r1 ! · · · rm !
1 m

9
where we call κr1 ,...,rm the cumulants of X.
In the case that m = 1, i.e., the random variable is real-valued, we have

α1 = κ1 , α2 = κ2 + κ21 , α3 = κ3 + 3κ1 κ2 + κ31 , ...

There is a general relation via Bell polynomials.


Pn
Qn If X1 , . . . , Xn are independent real-valued
Pn random variables and X :=
Pn i=1 Xi , then MX (u) =
i=1 MXi (u) and thus KX (u) = i=1 KXi (u). Therefore, κr (X) = i=1 κr (Xi ), i.e., the rth
cumulant of X is the sum of the rth cumulants of X1 , . . . , Xn .

Consider X ∼ Pθ where Pθ is an exponential family. Note that T = T (X) is a random variable.


Assuming mild regularity conditions, e.g., {η(θ) : θ ∈ Θ} is open, and MT (u) and KT (u) exist in a
neighborhood of the origin, we have
Z
MT (u) = E[exp(u> T )] = exp (u + η)> T (x) − A(η) h(x) dµ(x) = exp A(η + u) − A(η) .
 
X

Then we can compute moments of T from MT (u). It also follows that

KT (u) = A(η + u) − A(η).

From this, it can be derived that

E[T ] = ∇A(η), Cov(T ) = ∇2 A(η), . . .

Examples:

• X ∼ Poi(λ): T (x) = x, η = log λ, and A(η) = eη . Hence

MX (u) = exp eη+u − eη = exp λ(eu − 1) .


 

From MX (u), we can compute

E[X] = λ, E[X 2 ] = λ2 + λ, E[X 3 ] = λ3 + 3λ2 + λ, . . .

p
• X ∼ Bin(n, p): T (x) = x, η = log 1−p , and A(η) = n log(1 + eη ). Hence

 1 + eη+u n
MX (u) = = (1 − p + peu )n .
1 + eη

On the other hand, if we use the definition of the MGF, we need to compute
n  
X n
MX (u) = px (1 − p)n−x eux .
x
x=0

This is slightly more involved than using the general formula for exponential families.

10
1.2.3 Stein’s lemma
Lemma 1.2 (Stein). Let {Pθ : θ ∈ Θ} be an exponential family. Suppose that X ∼ Pθ has density
f (x | θ) for x ∈ R. Let g be a differentiable function such that limx→±∞ g(x)f (x | θ) = 0. Then we
have h  h0 (X) i
E g(X) + η > T 0 (X) = − E[g 0 (X)].
h(X)
In particular, for X ∼ N(µ, σ 2 ), we have

E[g(X)(X − µ)] = σ2 E[g0 (X)]. (1.1)

Proof. By integration by parts, the RHS is equal to


Z
g 0 (x) exp η > T (x) − A(η) h(x) dµ(x)


R Z
 
g(x) η > T 0 (x) exp η > T (x) − A(η) h(x) + exp η > T (x) − A(η) h0 (x) dµ(x)
 
=
ZR  h0 (x) 
g(x) η > T 0 (x) + exp η > T (x) − A(η) h(x) dµ(x)

=
R h(x)

which is equal to the LHS.


µ −1 √1 ,
For X ∼ N(µ, σ 2 ), we have η1 = σ2
, η2 = 2σ 2
, T1 (x) = x, T2 (x) = x2 , and h(X) = 2π
so the
conclusion follows.

For X ∼ N(µ, σ 2 ), setting g(x) = 1 gives E[X] = µ, and setting g(x) = x gives E[X 2 ] = σ 2 + µ2 .

1.3 Sufficient statistics


1.3.1 Definitions and examples
For X ∼ Pθ and a statistic T = T (X), the following are equivalent definitions or characterizations
of the sufficiency of T (in which case we call T a sufficient statistic):

• The conditional distribution of X given T does not depend on θ.

• Given T , it is possible to construct a random variable X 0 having the same distribution as X.

• (Fisher–Neyman) There exist nonnegative functions gθ and h such that f (x | θ) = gθ (T (x))·h(x).


This is called the factorization criterion.

Remarks:

• Obviously, X itself is a sufficient statistic.

• If T is a sufficient statistic and there exists a function g such that T = g(S), then S is a sufficient
statistic.

• A sufficient statistic T is minimal if for any sufficient statistic S, there exists a function g such
that T = g(S).

11
• If sufficient statistics S and T are functions of each other, then we say that they are equivalent.

Examples of sufficient statistics:

• Let X have a symmetric distribution on R. Then T = |X| is a sufficient statistic.


• Let X1 , . . . , Xn be i.i.d. random variables sampled from a distribution on R. Then the set of
order statistics T = (X(1) , . . . , X(n) ) is sufficient.

• Consider i.i.d. X1 , . . . , Xn ∼ Unif(0, θ). Then T = X(n) = maxi∈[n] Xi is a sufficient statistic by


the Fisher–Neyman criterion:

f (x1 , . . . , xn | θ) = (1/θ)n 1{0 ≤ xi ≤ θ for all i ∈ [n]}


= (1/θ)n 1{x(n) ≤ θ} · 1{xi ≥ 0 for all i ∈ [n]} = gθ (T ) · h(x1 , . . . , xn ).
Pn
• Consider i.i.d. X1 , . . . , Xn ∼ Poi(λ). Then T = i=1 Xi is a sufficient statistic, since
P
λ i xi e−nλ / i (xi !).
Q
f (x1 , . . . , xn | λ) =

• Consider i.i.d. X1 , . . . , XP 2 of sufficient statistics is T = ( i Xi , i Xi2 ) or


P P
n ∼ N(µ, σ ). A set
T 0 = (µ̂, σ̂ 2 ) where µ̂ := i Xi /n and σ̂ 2 := i (Xi − µ̂)2 , since
P

1  −1 X µ X n 2
f (x1 , . . . , xn | µ, σ 2 ) = exp x 2
i + x i − µ .
(2πσ 2 )n/2 2σ 2 σ2 2σ 2
i i

• Let Pθ be from an exponential family with density f (x | θ) = exp η > T (x) − A(η) · h(x), where


η(θ), T (x) ∈ Rm . For i.i.d. X1 , . . . , Xn ∼ Pθ , the distribution of (X1 , . . . , Xn ) has density


  Q
exp η >
P 
i T (x i ) − nA(η) · i h(xi ),

∈ Rm is a sufficient statistic.
P
where i T (Xi )

Although there is no loss statistically (or information-theoretically) in retaining only sufficient


statistics, it may not be computationally favorable to do so. See the excellent paper [Mon15]. (It
is not even obvious how to generate n i.i.d. Gaussians from the empirical mean and variance.)

1.3.2 Some results


Lemma 1.3. Consider a family of distributions {Pθ : θ ∈ Θ} where every Pθ is absolutely con-
tinuous with respect to µ. A statistic S is sufficient if and only if for any θ, θ0 ∈ Θ, the ratio
f (x | θ)/f (x | θ0 ) is a function only of S(x).

Proof. This follows immediately from the factorization criterion.

Lemma 1.4. Consider a finite family of distributions with densities f0 , f1 , . . . , fk , all having the
f1 f 2 fk 
same support. Then the statistic T = f0 , f0 , . . . , f0 is minimal sufficient.

Proof. We need to show that for any sufficient statistic S, there exists a function g such that
T = g(S). This follows immediately from the previous result.

12
Lemma 1.5. Let P be a family of distributions with common support and P 0 ⊂ P. If a statistic
T is minimal sufficient for P 0 and sufficient for P, then it is minimal sufficient for P.
Proof. If S is a sufficient statistic for P, it is a sufficient statistic for P 0 . Hence there exists a
function g such that T = g(S).

Consider an exponential family with density f (x | θ) = exp η > T (x)−A(η) ·h(x), where θ ∈ Θ.


If the interior of the set η(Θ) ⊂ Rm is not empty and if T does not satisfy an affine constraint
v > T = c for nonzero v ∈ Rm and c ∈ R, then the exponential family is said to be of full rank.
Theorem 1.6. Consider an exponential family with density f (x | θ) = exp η > T (x) − A(η) · h(x),


where θ ∈ Θ and η = η(θ) ∈ Rm . Suppose that T does not satisfy an affine constraint of the form
v > T = c. If there exist natural parameters η (0) , η (1) , . . . , η (m) such that {η (i) − η (0) : i ∈ [m]} spans
Rm , then T is minimal sufficient.
In particular, the sufficient statistic T in a full-rank exponential family P is minimal.
Proof. Let P 0 ⊂ P be a subfamily consisting of m + 1 distributions, with natural parameters
η (0) , η (1) , . . . , η (m) . By Lemma 1.4, a minimal sufficient statistic for P 0 is
 
T0 = exp (η (1) − η (0) )> T − A(η (1) ) + A(η (0) ), . . . , · ,


 
T = (η (1) − η (0) )> T, . . . , (η (m) − η (0) )> T .

This is equivalent to T if and only if the matrix with columns {η (i) − η (0) : i ∈ [m]} is nonsingular.
Conclude using Lemma 1.5. Such a subfamily can be chosen if the exponential family is full-
rank.

1.4 Convexity, maximum entropy, Rao–Blackwell


Here are some basic facts about convex functions:

• A function f : (a, b) → R is convex if and only if its epigraph is a convex set.


• If f : (a, b) → R is convex, it is continuous on (a, b), and has a left and right derivative at every
point in (a, b).
• If f is differentiable on (a, b), then f is convex if and only if f 0 is nondecreasing.
• If f is twice differentiable on (a, b), then f is convex if and only if f 00 ≥ 0.

Proposition 1.7. Consider a real-valued convex function f on a convex open set S ⊂ Rn . At each
x ∈ S, there exists a vector v ∈ Rn such that f (y) − f (x) ≥ v > (y − x) for any y ∈ S. This vector
v is called a subgradient of f at x.
If f is strictly convex, then v can be chosen so that the inequality is strict unless y = x.
Proposition 1.8 (Jensen’s inequality). Consider a real-valued convex function f on a Pconvex set
R n . For any x , . . . , x ∈ S and a , . . . , a ∈ [0, 1] such that
P
S
P ⊂ 1 n 1 n a
i i = 1, we have f ( i ai xi ) ≤
i ai f (xi ).
More generally, if X is a random variable taking values in S ⊂ Rn and E[X] < ∞, then
f (E[X]) ≤ E[f (X)]. If f is strictly convex, the inequality is strict unless P{X = E[X]} = 1.

13
An example is eλ E[X] ≤ E[eλX ].

Proof. Define L(y) = f (x) + v > (y − x) ≤ f (y) for x = E[X]. Then E[f (X)] ≥ E[L(X)] =
L(E[X]) = f (E[X]).

The entropy of a random variable X ∼ P with density p is defined as


H(X) = H(P) := EP [− log p(X)].
In the case of a continuous distribution, the entropy is also called the differential entropy.
Consider probability distributions P and Q on X with densities p and q respectively, such that P
is absolutely continuous with respect to Q. The relative entropy/entropy distance/Kullback-Leibler
divergence between them is
Z
h p(X) i p(x)
KL(P, Q) = D(P k Q) := EP log = p(x) log dµ(x) ≥ 0.
q(X) q(x)
The KL divergence is not symmetric, but is zero if and only if p = q almost everywhere:
h p(X) i h p(X) p(X) i h  p(X) i  h p(X) i
EP log = EQ log = EQ f ≥ f EQ = f (1) = 0,
q(X) q(X) q(X) q(X) q(X)
where f (x) = x log x is convex.
Theorem 1.9 (Maximum entropy principle). Consider a random variable X, a vector of statistics
T = T (X) ∈ Rm , and a fixed vector µ ∈ Rm (which is a value that T may take). Denote by P ∗ the
solution to the optimization problem:
max H(P) s.t. X ∼ P, E[T (X)] = µ.
P

Then P ∗ is from an exponential family with density C exp θ> T (x) − A(θ) , where θ = θ(µ).


 Pθ such that EP [T (X)] =


Proof. Consider P and
>
EPθ [T (X)] = µ, with densities p and pθ (x) =
C exp θ T (x) − A(θ) respectively. Then
h p(X) i
H(P) = − EP [log p(X)] = − EP log − EP [log pθ (X)]
pθ (X)
= − KL(P, Pθ ) − EP θ> T (X) − A(θ) + log C
 

≤ − EPθ θ> T (X) − A(θ) + log C


 

= − EPθ [log pθ (X)] = H(Pθ ),


where the inequality holds because KL(P, Pθ ) ≥ 0 and EP [T (X)] = EPθ [T (X)] = µ.
Recall that for an estimator ĝ and a loss function L(g, ĝ), the risk is the expected loss R(g, ĝ) =
E[L(g, ĝ)].
Theorem 1.10 (Rao–Blackwell). Consider X ∼ Pθ from the family {Pθ : θ ∈ Θ} and a suffi-
cient statistic T = T (X). Suppose that the loss function L(g, ·) is convex in the second variable.
Moreover, consider an estimator ĝ = ĝ(X) such that E[ĝ(X)] < ∞ and R(g, ĝ) < ∞. Define an
estimator g̃ = g̃(T ) = E[ĝ(X) | T ]. Then we have R(g, g̃) ≤ R(g, ĝ).
If L(g, ·) is strictly convex in the second variable, then the above inequality is strict unless
P{ĝ = g̃} = 1.

14
Proof. By Jensen’s inequality,

L(g, g̃) = L g, E[ĝ(X) | T ] ≤ E L g, ĝ(X) | T .


   

Taking the expectation on both sides with respect to T yields the result.

1.5 Bias, variance, MVUE


1.5.1 Theory
Consider X ∼ Pθ . Let g(θ) be an estimand, and
 let ĝ(X) be an estimator of g(θ). The bias of ĝ is
E[ĝ(X)] − g(θ). The variance of ĝ is Var ĝ(X) . An estimator is unbiased if E[ĝ(X)] = g(θ). The
bias-variance decomposition for the squared loss refers to the following

E ĝ(X) − g(θ) 2


= E ĝ(X) − E ĝ(X) + E ĝ(X) − g(θ)


2

= E ĝ(X) − E ĝ(X) + 2 E ĝ(X) − E ĝ(X) E ĝ(X) − g(θ) + E ĝ(X) − g(θ)


2   2
 2
= Var ĝ(X) + Bias ĝ(X) .

An unbiased estimator ĝ(X) is called


 the uniformly minimum-variance unbiased estimator (MVUE)
of g(θ) if Varθ ĝ(X) ≤ Varθ g̃(X) for all θ ∈ Θ for any unbiased estimator g̃(X).
Theorem 1.11. Consider X ∼ Pθ where θ ∈ Θ, and let ĝ(X) be an estimator of g(θ) such that
Eθ ĝ2 < ∞ for all θ ∈ Θ. Let U denote the set of U = U (X) such that Eθ U = 0 and Eθ U 2 < ∞
for all θ ∈ Θ. Then ĝ(X) is the MVUE if and only if it is unbiased and

Cov(ĝ, U ) = Eθ ĝ U = 0 for all U ∈ U and all θ ∈ Θ.


 

Interpretation: If U is “irrelevant” for estimating g(θ), then the MVUE ĝ is orthogonal to U .


In addition, for any estimator g̃, the MVUE is ĝ = g̃ − Ũ where Ũ is the orthogonal projection of
g̃ onto U.

Proof. “⇒”: Fix such a U and θ ∈ Θ. For any λ ∈ R, g̃ = ĝ + λU is an unbiased estimator. Since
ĝ is the MVUE by assumption, we have

Var(ĝ) ≤ Var(ĝ + λU ) = Var(ĝ) + 2λ Cov(ĝ, U ) + λ2 Var(U ).

This is violated at λ = − Cov(ĝ, U )/ Var(U ) unless Cov(ĝ, U ) = 0.


“⇐”: Let g̃(X) be an unbiased estimator with E g̃ 2 < ∞. Then U := ĝ − g̃ has zero mean and
finite variance, so E[ĝ(ĝ − g̃)] = 0 by assumption. This implies

Var(ĝ) = E ĝ 2 − (E ĝ)2 = E[ĝg̃] − (E ĝ)(E g̃) = E(ĝ − E ĝ)(g̃ − E g̃)


≤ E(ĝ − E ĝ)2 E(g̃ − E g̃)2 = Var(ĝ) Var(g̃).
p p p

Hence Var(ĝ) ≤ Var(g̃).

A statistic T = T (X) is called complete if

Eθ [f (T )] = 0 for all θ ∈ Θ implies f (t) = 0.

15
Theorem 1.12. If X ∼ Pθ for a full-rank exponential family {Pθ }, then T is complete.

For a proof, see Theorem 4.3.1 of [LR06]. The above result leads to an important theorem by
Lehmann and Scheffé.

Theorem 1.13 (Lehmann–Scheffé). Consider X ∼ Pθ , and let T be a complete sufficient statistic


for {Pθ : θ ∈ Θ}. Suppose that g̃(X) is an unbiased estimator of g(θ). Define ĝ(T ) by ĝ(t) =
E[g̃(X) | T = t] (as in the Rao–Blackwell theorem). Then ĝ(T )
• is an unbiased estimator of g(θ);

• is the only unbiased estimator that is a function of T ;

• uniformly minimizes the risk for any loss L(g, ·) convex in the second variable;

• is the MVUE.

Proof. We check that E[ĝ(T )] = E[g̃(X)] = g(θ), so ĝ(T ) is unbiased.


For uniqueness, let ĝ(T ) and g̃(T ) be two unbiased estimators. Then E[ĝ(T ) − g̃(T )] = 0, so by
completeness we have ĝ = g̃.
The Rao–Blackwell theorem implies that E L g(θ), ĝ(T ) ≤ E L g(θ), g̃(X) for any g̃(X).
 

Taking L(x, y) = (x − y)2 , we have Var(ĝ(T )) ≤ Var(g̃(X)).

1.5.2 Examples
• Gaussian MVUE: Consider i.i.d. X1 , . . . , Xn ∼ N(µ, σ 2 ), where µ andPσ are unknown. Recall
that the empirical mean µ̂ = n i=1 Xi and empirical vairance σ̂ 2 = n1 ni=1 (Xi − µ̂)2 are jointly
1 Pn

sufficient statistics. They are also complete by full-rankness. Since E[µ̂] = µ and E[σ̂ 2 ] =
E(X1 − µ̂)2 = n−1 2 n 2
n σ , we have that (µ̂, n−1 σ̂ ) is the MVUE of (µ, σ ).
2

• Uniform MVUE: Consider i.i.d. X1 , . . . , Xn ∼ Unif(0, θ) and g(θ) = θ. It can be shown that the
sufficient statistic T = X(n) is complete. Note that 2X1 is an unbiased estimator of θ, so the
MVUE is
1 n−1 t(n + 1)
ĝ(t) = E[2X1 | X(n) = t] = · 2t + ·t= .
n n n
• Binomial MVUE: Consider X ∼ Bin(N, p) and g(p) = p(1 − p). Recall that T (X) = X. Then
E ĝ(T ) = g(p) says
N  
X N
ĝ(x)px (1 − p)N −x = p(1 − p).
x
x=0

Let r = p/(1 − p). Then we have p = r/(1 + r) and 1 − p = 1/(1 + r). Hence
N   N −1  
X N x 1−N N −2
X N −2 x
ĝ(x)r = p(1 − p) = r(1 + r) = r ,
x x−1
x=0 x=1

which holds for p ∈ (0, 1) or r ∈ (0, ∞). Thus we can take


  −1
N −2 N x(N − x)
ĝ(x) = = .
x−1 x N (N − 1)

16
1.6 Lower bounds on the variance of an unbiased estimator
Let us assume a few technical conditions throughout this section:

• Θ ⊂ R and Θ is an open interval;

• The support {x : f (x | θ) > 0} is independent of θ;


∂f (x|θ)
• ∂θ exists and is finite for all x and θ;

• Differentiation under the integral sign works.

1.6.1 Lower bounds and the Fisher information


For an estimator ĝ(X) with Eθ ĝ = g(θ) and Eθ ĝ2 < ∞, we now provide three lower bounds on
Varθ (ĝ(X)).

1. (Cauchy–Schwarz) For any function φ(x, θ) with Eθ [φ(X, θ)2 ] < ∞,


Covθ (ĝ, φ)2
Varθ (ĝ) ≥ . (1.2)
Varθ (φ)
The problem with this simple bound is that the right-hand side depends on the estimator ĝ.
f (x|θ+δ)
2. (Hammersley–Chapman–Robbins inequality) Let us choose φ(x, θ) = f (x|θ) − 1. Then we have
Covθ (ĝ, φ) = Eθ [ĝφ] = Eθ+δ [ĝ] − Eθ [ĝ] = g(θ + δ) − g(θ). Hence
2
g(θ + δ) − g(θ)
Varθ (ĝ) ≥ 2 .
Eθ f (X|θ+δ)
f (X|θ) − 1

3. (Cramér–Rao) Suppose that there exists a function B(x, θ) and ε > 0 such that

f (x | θ + δ) − f (x | θ)
Eθ [B(X, θ)2 ] < ∞ and ≤ B(x, θ) for all |δ| ≤ ε.
δf (x | θ)
If g is differentiable, taking the limit δ → 0 on the right-hand side of the above inequality, and
applying dominated convergence, we obtain
2 2
g 0 (θ) g 0 (θ)
Varθ (ĝ) ≥  = .
Eθ ∂f (X|θ)/∂θ 2 I(θ)
f (X|θ)

Here I(θ) is the Fisher information that X contains about θ, defined by


∂ 2
I(θ) = Eθ log f (X | θ) .
∂θ
Eθ ∂
 
Since ∂θ log f (X | θ) = 0, we have
∂ 
I(θ) = Varθ log f (X | θ) .
∂θ

17
∂2
If, in addition, log f (x | θ) exists for all x and θ and differentiation under the integral sign holds,
∂θ2
∂2
f (x|θ) 2
f (x|θ)
∂ 
∂2 ∂θ 2
then taking the expectation of ∂θ2 log f (x | θ) = f (x|θ) − ∂θf (x|θ) yields
h ∂2 i
I(θ) = − Eθ log f (X | θ) .
∂θ2
Note that if θ is differentiable function of ξ, the Fisher information X contains about ξ is
˜ = I(θ) · θ0 (ξ) 2 .

I(ξ)

1.6.2 Extensions
• (i.i.d. observations) By definition, we can check:
Lemma 1.14. Let X1 and X2 be independent random variables with densities f1 (x | θ) and
f2 (x | θ) respectively. If I1 (θ), I2 (θ) and I(θ) denote the information X1 , X2 and (X1 , X2 )
contain about θ, then I(θ) = I1 (θ) + I2 (θ).

Therefore, if we observe i.i.d. X1 , . . . , Xn ∼ Pθ , then


2
g 0 (θ)
Varθ (ĝ) ≥ .
nI1 (θ)

• (Biased case) If ĝ is a biased estimator with Eθ ĝ = g(θ) + b(θ), then the same argument yields
2
g 0 (θ) + b0 (θ)
Varθ (ĝ) ≥ .
I(θ)

• (Multivariate case) Consider θ ∈ Rm . Analogous to (1.2), we have the following result.


Theorem 1.15. Consider an unbiased estimator ĝ and functions φi (x, θ) with finite second
moments where i ∈ [m]. Define γ ∈ Rm by γi = Cov(ĝ, φi ), and define C ∈ Rm×m by Cij =
Cov(φi , φj ). Then
Var(ĝ) ≥ γ > C −1 γ.

Under some regularity conditions similar to the one-dimensional case, the information matrix
I ∈ Rm×m is defined by
h ∂ ∂ i
Iij (θ) = Eθ log f (X | θ) · log f (X | θ)
∂θi ∂θj
 ∂ ∂ 
= Covθ log f (X | θ), log f (X | θ)
∂θi ∂θj
h ∂2 i
= − Eθ log f (X | θ) .
∂θi ∂θj
Hence I(θ) = − Eθ [∇2 log f (X | θ)].
Theorem 1.16 (Cramér–Rao, Information Inequality). Assume mild regularity conditions (sim-
ilar to the one-dimensional case) and that I(θ) is positive definite. Define α ∈ Rm by αi =
∂θi Eθ ĝ. Then we have

Varθ (ĝ) ≥ α> I(θ)−1 α.

18
1.6.3 Examples
• Exponential family: X ∼ Pη with f (x | η) = exp η > T (x) − A(η) h(x). Then ∇2 log f (x | η) =


−∇2 A(η), so I(η) = ∇2 A(η).


µ2 x2 √ 1
• Gaussian: X ∼ N(µ, σ 2 ) where σ is known. Then f (x | µ) = exp µx

σ2
− 2σ 2 − 2σ 2
2πσ
and thus
∂ 2 2 2 2 1 2 1

∂µ log f (x | µ) = x/σ − µ/σ . It follows I(µ) = 1/σ and I(µ ) = I(µ) 2µ = 4µ2 σ2 .
x
• Poisson: X ∼ Poi(λ) so that f (x | λ) = λx! e−λ . Hence ∂λ

log f (x | λ) = x/λ − 1. It follows that
d ξ
2
I(λ) = 1/λ. However, by a change of variable, I(log λ) = I(λ) dξ e |ξ=log λ = λ.

• Binomial: X ∼ Bin(N, p) where N is known. Then ∂p ∂


log f (x | p) = x/p − (N − x)/(1 − p), so
2 N
I(p) = N p(1 − p)[1/p + 1/(1 − p)] = p(1−p) . Then Var(X/N ) = p(1 − p)/N and the equality is
attained in the Cramér–Rao bound.

19
20
Chapter 2

Bayesian versus minimax

2.1 Bayesian estimation


Consider X ∼ Pθ with density f (x | θ). Suppose that θ ∼ π where π is the prior distribution with
density p(θ). The marginal distribution of X has density
Z
f (x) = f (x | θ) · p(θ) dµ(θ).

The posterior distribution of θ refers to the conditional distribution with density


f (x | θ)
p(θ | x) = p(θ).
f (x)

2.1.1 Bayes risk and Bayes estimator


For a loss function L(g, ĝ), recall that the risk is R(g, ĝ) := EX∼Pθ L g(θ), ĝ(X) . The Bayes risk


is defined as
Rπ (ĝ) := Eθ∼π R(g, ĝ).
An estimator ĝ is called a Bayes (optimal) estimator if Rπ (ĝ) ≤ Rπ (g̃) for any other estimator g̃.
A stronger condition is that the estimator ĝ minimizes the posterior loss

Eθ∼π L g(θ), ĝ(X) | X .


  

• For the squared loss L(g, g̃) = (g − g̃)2 , the posterior mean ĝ(X) := E[g(θ) | X] is a Bayes
estimator because for any estimator g̃,

E[(g(θ) − g̃(X))2 | X] = E[(g(θ) − ĝ(X))2 | X] + (ĝ(X) − g̃(X))2 .


Note that we have
R
f (X | θ) g(θ)f (X | θ)p(θ) dθ
Z Z
E[g(θ) | X] = g(θ)p(θ | X) dθ = g(θ) p(θ) dθ = R .
f (X) f (X | θ)p(θ)dθ

• For the `1 loss L(g, g̃) = |g − g̃|, it is not hard to check that the posterior median is a Bayes
estimator.

21
2.1.2 Examples
Gaussian Consider X ∼ N(θ, 1) where θ ∼ N(0, σ 2 ). The Bayes estimator under the squared loss
L(θ, θ̂) = (θ − θ̂)2 is the posterior mean
R
θf (X | θ) p(θ) dθ
θ̂(X) := E[θ | X] = R .
f (X | θ) p(θ)dθ

One straightforward but tedious way to obtain θ̂(X) is to compute


Z
1
Z
2 2 2 1  −X 2 
f (X | θ) p(θ)dθ = e−(X−θ) /2−θ /(2σ ) dθ = p exp ,
2πσ 2π(1 + σ 2 ) 2(1 + σ 2 )
and
σ2X  −X 2 
Z Z
1 2 /2−θ 2 /(2σ 2 )
θf (X | θ) p(θ)dθ = θe−(X−θ) dθ = p exp .
2πσ 2π(1 + σ 2 )3 2(1 + σ 2 )
Therefore,
σ2
θ̂(X) = X.
1 + σ2
Alternatively, we may note that the posterior density of θ is
f (X | θ) p(θ) f (X | θ) p(θ)
p(θ | X) = =R ,
f (X) f (X | θ) p(θ)dθ
where
2 !
σ2 + 1 σ2

1 −(X−θ)2 /2−θ2 /(2σ2 )
f (X | θ) p(θ) = e = hσ (X) · exp − θ− X
2πσ 2σ 2 1 + σ2

for some inessential


 2function2 hσ (X). Since the above quantity is proportional to the PDF of the
σ σ
Gaussian θ ∼ N 1+σ 2 X, 1+σ 2 , we conclude that this Gaussian is the posterior distribution of θ

σ 2
conditional on X. Therefore, the posterior mean is 1+σ 2 X.

The risk of θ̂ is
 σ2 2  σ 2 2  1 2
EX∼N(θ,1) X − θ = + θ2 ,
1 + σ2 1 + σ2 1 + σ2
and the Bayes risk is
h σ 2 2  1 2 i  σ 2 2  σ 2 σ2
Eθ∼N(0,σ2 ) + θ 2
= + = .
1 + σ2 1 + σ2 1 + σ2 1 + σ2 1 + σ2

i.i.d. Gaussians Suppose that we observe i.i.d. X1 , . . . , Xn ∼ N(θ, τ 2 ) where θ ∼ N(µ, σ 2 ). By


the same argument, it suffices to consider
n n
!
Y 1 X (Xi − θ)2 (θ − µ)2
f (Xi | θ) p(θ) = exp − −
(2π)(n+1)/2 τ n σ 2τ 2 2σ 2
i=1 i=1
2 !
nσ 2 + τ 2 nσ 2 τ2

= hµ,τ,σ (X) · exp − θ− 2 X̄ − 2 µ .
2σ 2 τ 2 τ + nσ 2 τ + nσ 2

22
We then conclude that the posterior distribution of θ conditional on X1 , . . . , Xn is

nσ 2 τ2 σ2τ 2
 
N X̄ + µ, .
τ 2 + nσ 2 τ 2 + nσ 2 τ 2 + nσ 2

(Poisson, Gamma) pair Consider X ∼ Poi(λ), where λ ∼ Gamma(a, b) for a, b > 0. Recall that
the gamma distribution has density
Z ∞
ba a−1 −bλ
p(λ) = λ e , where Γ(a) = xa−1 e−x dx.
Γ(a) 0

Then we have
∞ ∞
λx e−λ ba a−1 −bλ ba Γ(a + x)
Z Z
f (x) = f (x | λ)p(λ) dλ = λ e dλ = .
0 0 x! Γ(a) x!(b + 1)a+x Γ(a)

Hence, the posterior has density

f (x | λ) (b + 1)a+x a+x−1 −(b+1)λ


p(λ | x) = p(λ) = λ e ,
f (x) Γ(a + x)

which turns out to be the density of Gamma(a + x, b + 1).


In this case, the prior and the posterior are conjugate distributions (i.e., in the same family of
distributions), and the prior is called a conjugate prior. Other (likelihood, conjugate prior) pairs
include (Binomial, Beta), (Multinomial, Dirichlet), ...
For a fixed k ≥ 0, consider the loss L(λ, λ̂) = (λ − λ̂)2 /λk . Then we have


(λ − λ̂)2
Z
Eπ [(λ − λ̂) /λ | X] =
2 k
p(λ | X) dλ
0 λk
Γ(a + X − k) n o
(b + 1)k−2 (b + 1)λ̂ (b + 1)λ̂ − 2(a + X − k) + (a + X − k)(a + X − k + 1) ,
 
=
Γ(a + X)

a+X−k
which is minimized at λ̂ = λ̂(X) = b+1 . Therefore, this λ̂ is the Bayes estimator.

2.1.3 Hierarchical Bayes


It is sometimes useful to consider a hierarchical framework of Bayesian estimation consisting of more
than one “level” of prior. For example, let X be the random observation with density f (x | θ),
where θ is the parameter with prior density p(θ | γ) further parametrized by a hyperparameter γ;
moreover, suppose that γ is a random variable with density φ(γ). Examples:

• In the conjugate normal hierarchy, we have X ∼ N(θ, 1), θ ∼ N(0, σ 2 ), and 1


σ2
∼ Gamma(a, b),
where a, b are known.

• For exemplary purpose, consider the following mixture model:


Pk
– X∼ i=1 wi N(θi , 1), where θ1 , . . . , θk are fixed.

23
– w = (w1 , . . . , wk ) follows the Dirichlet distribution with parameter
Pkα > 0. The Dirichlet
distribution is defined on the probability simplex ∆ := {v ∈ R : i=1 vi = 1, vi ≥ 0} and
k

has density Γ(kα)


Qk α−1
Γ(α) k i=1 wi .
If α = 1, the Dirichlet distribution becomes the uniform distribution on the simplex ∆. The
smaller α, the “sparser” a corresponding Dirichlet random variable. As α → ∞, the Dirichlet
distribution converges to the point mass at ( k1 , . . . , k1 ). As α → 0, the Dirichlet distribution
converges to the discrete distribution k1 ki=1 δei .
P

– α follows the exponential distribution with density e−α , for example.

Such a hierarchical model may be useful partly because the hyperparameter space is more man-
ageable and allows tuning of the sparsity of the mixing weights.

2.1.4 Several perspectives of estimation


Unbiased versus Bayesian In the Gaussian example X ∼ N(θ, 1) where θ ∼ N(0, σ 2 ), the
σ2
Bayes estimator θ̂(X) = 1+σ 2 X is biased. The next result shows an intrinsic contradiction between
unbiased and Bayesian estimation.
Theorem 2.1. Consider X ∼ Pθ where θ ∼ π. Under the squared loss L(g, ĝ) = (g − ĝ)2 , no
unbiased estimator ĝ can be a Bayes estimator, unless the Bayes risk is zero.
Proof. For the squared loss, the Bayes estimator is the posterior mean ĝ(X) = E[g(θ) | X]. If ĝ(X)
is unbiased, we have E[ĝ(X) | θ] = g(θ) for any θ. Then
• E[g(θ)ĝ(X)] = E[E[g(θ)ĝ(X) | θ]] = E[g(θ) E[ĝ(X) | θ]] = E[g(θ)2 ];
• E[g(θ)ĝ(X)] = E[E[g(θ)ĝ(X) | X]] = E[E[g(θ) | X] ĝ(X)] = E[ĝ(X)2 ].
Therefore E[(g(θ) − ĝ(X))2 ] = 0.

Maximum likelihood estimation Consider the likelihood L(θ | x) := f (x | θ) and the log-
likelihood log L(θ | x) := log f (x | θ). Given i.i.d. X1 , . . . , Xn ∼ Pθ , the maximum likehood
estimator (MLE) of g(θ) is defined to be ĝ := g(θ̂) where
n
X
θ̂ := argmax log f (Xi | θ).
θ∈Θ i=1

Let us recall the unbiased estimation for Gaussian mean and variance. Given i.i.d. X1 , . . . , Xn ∼
n
N(θ, σ 2 ) where θ and σ are unknown, the empirical mean θ̂ is the MVUE of θ and n−1 σ̂ 2 is the
MVUE of σ 2 , where σ̂ 2 = n1 ni=1 (Xi − θ̂)2 . If we do not require the estimators to be unbiased, can
P
they achieve better risks? How about the MLEs? We have
n
X (Xi − θ)2 n
(θ̂, σ̂) = argmin + log(2πσ 2 ),
(θ,σ) 2σ 2 2
i=1
Pn Pn
so the MLEs are θ̂ = n1 i=1 Xi and σ̂ 2 = 1
n i=1 (Xi − θ̂)2 . In particular, σ̂ 2 is a biased estimator
of σ 2 . One can check  n2 − 1 n−1 
E(cσ̂2 − σ2 )2 = σ4 c2 − 2 c+1 ,
n2 n

24
so we have
n
E(σ̂2 − σ2 )2 < E( σ̂ 2 − σ 2 )2 .
n−1
n 2
In fact, n+1 σ̂ is even a better choice in terms of minimizing the risk.

2.2 Bayesian Cramér–Rao, a.k.a. van Trees inequality


We establish a Bayesian version of the Cramér–Rao bound, which is also known as the van Trees
inequality. In fact, this bound holds for any estimator, rather than an unbiased estimator as in the
case of the Cramér–Rao bound. For simplicity, let us focus on the univariate case. See [GL95] for
a multivariate version of the theorem below. As before, we assume mild regularity conditions so
that we can differentiate under the integral sign.

Theorem 2.2 (Bayesian Cramér–Rao, van Trees). Let π be a distribution on Θ := (a, b) ⊂ R with
density p such that p(θ) → 0 as θ → a or b. Consider X ∼ Pθ where θ ∼ π. Suppose that f (x | θ)
is bounded. Define
∂ 2 ∂ 2
I(θ) = Eθ log f (X | θ) and I(π) = E log p(θ) .
∂θ ∂θ
For any estimator ĝ(X) of a differentiable estimand g(θ), we have

E[g0 (θ)]
 2 2
E ĝ(X) − g(θ) ≥ .
E[I(θ)] + I(π)
Proof. By the Cauchy-Schwarz inequality, we obtain
∂  2  h  ∂  i2
E ĝ(X) − g(θ) 2 · E ≥ E ĝ(X) − g(θ) ·

log f (X | θ)p(θ) log f (X | θ)p(θ) .
∂θ ∂θ

Let us first compute the right-hand side. Since p(θ) → 0 as θ → a or b, we have


Z
∂  
f (x | θ)p(θ) dθ = 0
∂θ
and Z Z
∂  
g(θ) · f (x | θ)p(θ) dθ = − g 0 (θ) · f (x | θ)p(θ) dθ.
∂θ
Consequently,
h  ∂  i
E ĝ(X) − g(θ) · log f (X | θ)p(θ)
Z Z  ∂θ
 ∂  
= ĝ(x) − g(θ) · log f (x | θ)p(θ) · f (x | θ)p(θ) dθ dµ(x)
∂θ
Z Z   ∂  
= ĝ(x) − g(θ) · f (x | θ)p(θ) dθ dµ(x)
∂θ
Z Z
= g 0 (θ)f (x | θ)p(θ) dθ dµ(x) = E[g 0 (θ)].

25
Moreover, for the left-hand side of the Cauchy-Schwarz inequality, we have
∂  2
E log f (X | θ)p(θ)
∂θ
∂ 2 ∂ 2 ∂ ∂ 
=E log f (X | θ) + E log p(θ) + 2 E log f (X | θ) · log p(θ)
∂θ ∂θ ∂θ ∂θ
= E[I(θ)] + I(π) + 0,

where we used E ∂θ
∂ 
log f (X | θ) | θ = 0. Combining everything completes the proof.
σ2
For example, for X ∼ N(θ, 1) where θ ∼ N(0, σ 2 ), we have shown that the estimator θ̂ = 1+σ 2
X
σ2
of θ achieves the Bayes risk 1+σ 2
under the squared loss. On the other hand, we have

g 0 (θ) = 1, I(θ) = 1, I(π) = 1/σ 2 ,


so that
E[g0 (θ)]
2
1 σ2
= = .
E[I(θ)] + I(π) 1 + 1/σ 2 1 + σ2
Therefore, the Bayes risk achieved by the estimator θ̂ matches the Bayesian Cramér–Rao lower
bound, so θ̂ is optimal in this sense.

2.3 Empirical Bayes and the James–Stein estimator


2.3.1 The empirical Bayes approach
Suppose that we observe i.i.d. X1 , . . . , Xn ∼ f (x | θ) where θ ∼ p(θ | γ). If γ is known, we can
estimate θ based on the data and the prior (for example, by the posterior mean). If γ is unknown,
what we can do is to first estimate the hyperparameter γ from data to obtain an approximate prior,
and then use it to estimate θ. Such an approach is called empirical Bayes. For example, we may
do the following:
• Note that the marginal density of X = (X1 , . . . , Xn ) is
Z Yn
f (x | γ) = f (xi | θ)p(θ | γ) dθ.
i=1

We can estimate γ by, for example, the MLE


γ̂(X) := argmax f (X | γ̃).
γ̃

• For a loss function L(g, ĝ), we then consider the estimator ĝ(X) that minimizes the empirical
posterior loss Z
 
ĝ(X) := argmin L g(θ), g̃(X) p θ | X, γ̂(X) dθ.

Compare this to the posterior loss
Z
E L g(θ), g̃(X) | X = L g(θ), g̃(X) p θ | X, γ dθ
    

which we would minimize if γ were known.

26
In the original Bayesian framework, γ is assumed to be known, while in hierarchical Bayes, γ is
assumed to follow a known distribution. Here in the empirical Bayes approach, γ is estimated from
data. Moreover, we remark that γ̂ and ĝ can be replaced by other estimators.

2.3.2 James–Stein estimator and its variant


Consider X ∼ N(θ, In ), where θ ∼ N(0, σ 2 In ) and n ≥ 3. We would like to estimate θ ∈ Rn under
the squared loss
n
X
L(θ, θ̂) = kθ − θ̂k22 = (θi − θ̂i )2 .
i=1

Bayes estimator and its risk We start with the case where σ 2 is known. Similar to the
univariate case, the Bayes estimator, denoted by θ̂B (X), is again the posterior mean:

f (X | θ, σ 2 ) p(θ | σ 2 )
Z Z
θ̂B (X) = E[θ | X, σ ] = θ · p(θ | X, σ ) dθ = θ ·
2 2
dθ.
f (X | σ 2 )
The density of X marginalized over θ is
Z
f (X | σ ) = f (X | θ, σ 2 ) p(θ | σ 2 ) dθ
2

Z
1 1 2 1 1 2
= n/2
e− 2 kX−θk2 2 n/2
e− 2σ2 kθk2 dθ
(2π) (2πσ )
1 − 1
2(1+σ 2 )
kXk22
= e , (2.1)
(2π(1 + σ 2 ))n/2

which is simply the density of N(0, (1 + σ 2 )In ). Similarly, we can compute

σ2
Z
1 − 1
kXk22
θ · f (X | θ, σ 2 ) p(θ | σ 2 ) dθ = e 2(1+σ 2 ) · X.
(2π(1 + σ 2 ))n/2 1 + σ2
Hence the Bayes estimator is

σ2  1 
θ̂B (X) = X = 1 − X, (2.2)
1 + σ2 1 + σ2
with Bayes risk

nσ 2
ZZ
Rπ (θ̂B ) = kθ̂B (x) − θk22 · f (x | θ, σ 2 ) p(θ | σ 2 ) dx dθ = .
1 + σ2

Empirical Bayes Let us now consider the empirical Bayes approach to the case where σ 2 is
unknown. Note that the marginal distribution given by (2.1) is N(0, (1 + σ 2 )In ). To choose an
1
estimator of σ 2 , we require the associated estimator of 1+σ 2 to be unbiased, in view of (2.2).
A basic fact is that, if Y follows the chi-squared distribution with n degrees of freedom, then
E[ Y1 ] = n−2
1 n−2
. Therefore, if we let τ̂ (X) := kXk 2 , then
2

1
E τ̂ (X) | σ2 =
 
.
1 + σ2

27
The associated empirical Bayes estimator is therefore
 n − 2
θ̂JS (X) = 1 − X,
kXk22

which is called the James–Stein estimator.

Risk of the James–Stein estimator We now compute the risk of the James–Stein estimator

(n − 2)2
 
 n − 2 2 n−2 >
R(θ, θ̂JS ) = E 1− X −θ = E kX − θk22 +E − 2E X (X − θ) , (2.3)
kXk22 2 kXk22 kXk22

where the expectation is with respect to X ∼ N(θ, In ). Conditional on all Xj for j 6= i, Stein’s
n−2
lemma applied to Xi with g(Xi ) = kXk 2 Xi (see Lemma 1.2 and (1.1)) yields that
2

    
n−2 ∂ n−2
EXi ∼N(θi ,1) Xi (Xi − θi ) = EXi ∼N(θi ,1) Xi
kXk22 ∂Xi kXk22
n − 2 2(n − 2)Xi2
 
= EXi ∼N(θi ,1) − .
kXk22 kXk42

Summing the above equation over i and taking the expectation with respect to X ∼ N(θ, In ), we
obtain
(n − 2)2
 
n−2 > n(n − 2) 2(n − 2)
E X (X − θ) = E − E = E .
kXk22 kXk22 kXk22 kXk22

Plugging this together with E kX − θk22 = n into (2.3), we conclude that

(n − 2)2
R(θ, θ̂JS ) = n − E .
kXk22

Furthermore, the Bayes risk of the James–Stein estimator is

Rπ (θ̂JS ) = Eθ∼N(0,σ2 In ) R(θ, θ̂JS )


(n − 2)2 
ZZ 
= n− f (x | θ, σ 2 ) p(θ | σ 2 ) dx dθ
kxk22
(n − 2)2 
Z 
= n− f (x | σ 2 ) dx
kxk22
n−2 nσ 2 2 2
=n− 2
= 2
+ 2
= R(θ̂B ) + .
1+σ 1+σ 1+σ 1 + σ2

The relative increase of risk is


R(θ̂JS ) − R(θ̂B ) 2
= ,
R(θ̂B ) nσ 2

which is small if σ 2 is fixed and n is large.

28
Positive-part Stein estimator Instead, we can consider the maximum likelihood estimator
1
(MLE) of 1+σ 2 based on the marginal density (2.1). Namely, we solve

 τ n/2 τ 2
max e− 2 kXk2
τ 2π
n
which yields τ̃ (X) = min{ kXk2 , 1}. Therefore, the associated empirical Bayes estimator is
2

 n n o  n +
θ̂PS (X) = 1 − min , 1 X = 1 − X,
kXk22 kXk22

which is called a positive-part Stein estimator.

2.3.3 General results for exponential families


Theorem 2.3. Consider the exponential family with density

f (x | η) = exp η > T (x) − A(η) h(x),




where x ∈ Rn and
R η ∈ R . Suppose that η ∼ p(η) for a prior
m density p. Let the marginal density
of X be f (x) = f (x | η) p(η) dη. Define a matrix D ∈ R n×m by Di,j = ∂Tj /∂xi . Then

E[Dη | x] = ∇ log f (x) − ∇ log h(x).

In particular, if T (x) = x, then we have D = I and

E[η | x] = ∇ log f (x) − ∇ log h(x).


Moreover, under the squared loss L(η, η̂) = kη − η̂k22 , the risk achieved by the Bayes estimator
E[η | x] is
m h ∂2  ∂ 2 i
R(η, E[η | X]) = R(η, −∇ log h(X)) + E 2
X
log f (X) + log f (X) .
i=1
∂Xi2 ∂Xi

Proof. Lengthy but straightforward computation, which uses Stein’s lemma. See [LC06], Chapter 4,
Theorem 3.2, Corollary 3.3, and Theorem 3.5.

Theorem 2.4. Consider X ∼ Pη from the exponential family with density

f (x | η) = exp η > x − A(η) h(x),




R R . Suppose that the prior is p(η | γ). Let γ̂ be the MLE of γ based on the marginal
where x, η ∈ m

f (x | γ) = f (x | η, γ) p(η | γ) dη. Then the empirical Bayes estimator under the squared loss is

E[η | X, γ̂] = ∇ log f (X | γ̂(X)) − ∇ log h(X).

Proof. See Chapter 4, Theorem 6.3 of [LC06].

29
2.4 Minimax estimation
2.4.1 Definitions and examples
Consider X ∼ Pθ where θ ∈ Θ. For an estimator g̃ of g, the maximum risk is supθ∈Θ R(g(θ), g̃(X)).
The minimax risk is the minimum of the maximum risk:
R∗ := inf sup R(g(θ), g̃(X)).
g̃ θ∈Θ

An estimator ĝ(X) of g(θ) is called a minimax estimator if it achieves the minimax risk:
sup R(g(θ), ĝ(X)) = R∗ .
θ∈Θ

Consider the minimum Bayes risk


Rπ∗ := inf Rπ (g̃) = inf Eθ∼π R(g(θ), g̃(X)).
g̃ g̃

Proposition 2.5. We have


R∗ = inf sup R(g(θ), g̃(X)) = inf sup Rπ (g̃) ≥ sup inf Rπ (g̃) = sup Rπ∗ .
g̃ θ∈Θ g̃ π π g̃ π

As an example, consider i.i.d. X1 , . . . , Xn ∼ N(µ, σ 2 ) where σ 2 is known. We aim to estimate µ


under the squared loss L(µ, µ̂) = (µ − µ̂)2 .
• For µ̂(X) = X̄ = n1 ni=1 Xi , we have R(µ, µ̂) = E(X̄ − µ)2 = σn . Hence R∗ ≤ σn .
P 2 2

nτ 2
• For µ ∼ N(0, τ 2 ), the Bayes estimator, the posterior mean, is µ̃(X) = σ 2 +nτ 2
X̄ with the Bayes
σ2 τ 2 σ2
risk Rπ∗ = Rπ (µ̃) = σ 2 +nτ 2
. Since τ 2 ≥ 0 is arbitrary, Proposition 2.5 implies that R∗ ≥ n .
σ2
Therefore, R∗ = n , and µ̂(X) = X̄ is a minimax estimator.
Lemma 2.6. Consider parameter spaces Θ ⊂ Θ0 . Let ĝ(X) be a minimax estimator of g(θ) over
Θ. If
sup R(g(θ), ĝ(X)) = sup R(g(θ), ĝ(X)),
θ∈Θ θ∈Θ0
then ĝ is also minimax over Θ0 .
Proof. If there exists another estimator g̃ with a smaller maximum risk over Θ0 than ĝ, then the
same is true over Θ, contradicting that ĝ is minimax over Θ.

Let us consider variants of the above Gaussian example:


• Consider i.i.d. X1 , . . . , Xn ∼ N(µ, σ 2 ) where µ and σ 2 are unknown. We aim to estimate µ under
the squared loss. Assume that σ 2 ≤ M where M > 0, for otherwise the minimax risk is infinite.
The maximum risk of the estimator X̄ is
M
sup E(X̄ − µ)2 = .
(µ,σ 2 ):σ 2 ≤M n

Recall that X̄ is minimax over {(µ, σ 2 ) : σ 2 = M } with minimax risk M/n. Hence it is minimax
over {(µ, σ 2 ) : σ 2 ≤ M } in view of Lemma 2.6.
Since X̄ does not depend on M , we say that X̄ adapts to M and call X̄ an adaptive estimator.

30
(µ−µ̂)2
• We now consider the loss L(µ, µ̂) = σ2
, and impose no upper bound on σ 2 > 0. The estimator
X̄ has maximum risk
(µ − X̄)2 1
sup 2
= .
(µ,σ 2 ) σ n

Since X̄ is minimax over {(µ, σ 2 ) : σ 2 = 1} with the same minimax risk 1/n, it is minimax over
{(µ, σ 2 ) : σ 2 > 0} as well.

2.4.2 Some theoretical results


Theorem 2.7. Let ĝ be a Bayes estimator for the prior π. If

sup R(g(θ), ĝ(X)) = Rπ (ĝ),


θ∈Θ

then the following holds:

1. ĝ is minimax with minimax risk R∗ = Rπ∗ = Rπ (ĝ);

2. If ĝ is the unique Bayes estimator for π, then it is the unique minimax estimator;

3. Rπ∗ ≥ Rπ∗ 0 for any other prior distribution π 0 on Θ.

In particular, if π(Ω) = 1 where Ω := {θ ∈ Θ : R(g(θ), ĝ(X)) = supθ0 ∈Θ R(g(θ0 ), ĝ(X))}, then ĝ


is minimax.

Proof. For any estimator g̃, we have

sup R(g, g̃) ≥ Rπ (g̃) ≥ Rπ (ĝ) = sup R(g, ĝ),


θ θ

so ĝ is minimax.
Replacing the second ≥ with > gives uniqueness.
For any distribution π 0 on Θ, let ĝ 0 be the Bayes estimator for π 0 . Then

Rπ∗ 0 = Rπ0 (ĝ 0 ) ≤ Rπ0 (ĝ) ≤ sup R(g, ĝ) = Rπ (ĝ) = Rπ∗ .
θ

In particular, if π(Ω) = 1, then Rπ (ĝ) = supθ R(g, ĝ).

Theorem 2.8. Let πn be a sequence of prior distributions on Θ such that the following limit exists:

R := lim Rπ∗ n .
n→∞

If ĝ is an estimator for which


sup R(g(θ), ĝ(X)) = R,
θ∈Θ

then we have:

1. ĝ is minimax with minimax risk R∗ = R;

2. limn→∞ Rπ∗ n ≥ Rπ∗ for any prior distribution π on Θ.

31
Proof. For any other estimator g̃, it holds

sup R(g, g̃) ≥ Rπn (g̃) ≥ Rπ∗ n .


θ

Taking the limit as n → ∞, we have

sup R(g, g̃) ≥ R = sup R(g, ĝ).


θ θ∈Θ

The second statement follows from Rπ∗ ≤ R∗ = R.

2.4.3 Efron–Morris estimator


Consider i.i.d. X1 , . . . , Xn ∼ N(µ, 1) where µ ∼ N(0, τ 2 ). We aim to estimate µ under the squared
nτ 2
loss. The Bayes estimator is cX̄ with c = 1+nτ 2 . Its risk is

n 2 τ 4 + µ2
E(cX̄ − µ)2 = c2 E(X̄ − µ)2 + (1 − c)2 µ2 = ,
(1 + nτ 2 )2

and the maximum risk is infinite over µ ∈ R.


As a compromise, consider

M
X̄ + M
 if X̄ < − 1−c
M M
µ̂ := cX̄ if X̄ ∈ [− 1−c , 1−c ]
 M
X̄ − M if X̄ > 1−c

for c ∈ (0, 1). Its risk is bounded.

2.5 Admissibility
An estimator ĝ is inadmissible if there exists an estimator g̃ that dominates ĝ, i.e.,

• R(g(θ), g̃(X)) ≤ R(g(θ), ĝ(X)) for all θ ∈ Θ, and

• R(g(θ), g̃(X)) < R(g(θ), ĝ(X)) for some θ ∈ Θ.

Otherwise, the estimator is called admissible.

2.5.1 Admissible estimators


Theorem 2.9. Any unique Bayes estimator is admissible.

Proof. If ĝ is the unique Bayes estimator of g with respect to prior π and is dominated by g̃, then

Eθ∼π R(θ, g̃) ≤ Eθ∼π R(θ, ĝ),


contradicting uniqueness.

Theorem 2.10. Any unique minimax estimator is admissible.

32
Proof. If a minimax estimator is inadmissible, then another estimator dominates it and thus is also
minimax, contradicting uniqueness.

Theorem 2.11. If an estimator has constant risk and is admissible, then it is minimax.

Proof. If an estimator with constant risk is not minimax, then another estimator has smaller
maximum risk and thus uniformly smaller risk.

Theorem 2.12. Suppose that L(g, ·) is a strictly convex loss, and ĝ(X) is an admissible estimator
of g(θ). If g̃(X) is another estimator of g(θ) such that R(g, ĝ) = R(g, g̃) at all θ, then ĝ = g̃ with
probability 1.

Proof. Define ḡ = 21 (ĝ + g̃). Then

1 
L(g, ḡ) < L(g, ĝ) + L(g, g̃)
2
wherever ĝ 6= g̃. If this happens with nonzero probability, then
1 
R(g, ḡ) < R(g, ĝ) + R(g, g̃) = R(g, ĝ),
2
contradicting the admissibility of ĝ.

In Gaussian mean estimation, is the minimax estimator X̄ admissible?

Proposition 2.13. Given i.i.d. X1 , . . . , Xn ∼ N(µ, σ 2 ) where σ 2 is known, X̄ is an admissible


estimator and is the unique minimax estimator of µ under the squared loss.

Proof. Consider any estimator µ̂ such that

σ2
R(µ, µ̂) ≤ = R(µ, X̄).
n
By the bias-variance decomposition, have

R(µ, µ̂) = Varµ (µ̂) + b(u)2 ,

where b(u) = Eµ [µ̂] − µ. Then the Cramér–Rao bound gives

(1 + b0 (µ))2 σ 2 (1 + b0 (µ))2
R(µ, µ̂) ≥ + b(µ)2 = + b(µ)2 (2.4)
I(µ) n

where I(µ) = − E[ ∂µ
∂ 2
2
2 log f (X | µ)] = n/σ . Hence we obtain

(1 + b0 (µ))2 b(µ)2 1 1
+ 2
≤ 2 R(µ, µ̂) ≤ . (2.5)
n σ σ n
We claim that µ̂ is unbiased, i.e., b(µ) ≡ 0:

1. The bias b(µ) is clearly bounded.

2. We have (1 + b0 (µ))2 = 1 + 2b0 (µ) + b(µ)2 ≤ 1, so b0 (µ) ≤ 0, i.e., b(µ) is nonincreasing.

33
3. If b0 (µ) < −ε for a fixed ε > 0 as µ → ±∞, then b(µ) cannot be bounded. Hence there is a
sequence µi → ±∞ such that b0 (µi ) → 0.
4. By (2.5), b(µi ) → 0 as µi → ±∞. Since b(µ) is nonincreasing, b(µ) ≡ 0.
Hence we also have b0 (µ) ≡ 0, and (2.4) implies that
σ2
R(µ, µ̂) ≥ .
n
We conclude that R(µ, µ̂) = R(µ, X̄), and thus X̄ is admissible. Moreover, X̄ is the only minimax
estimator by the above two theorems.

2.5.2 Inadmissible estimators


The estimator X̄ is no long admissible in the truncated case.
Proposition 2.14. Consider g(θ) in a fixed interval [a, b]. Suppose the loss function L(g, ĝ) is zero
at ĝ = g and strictly increasing as ĝ moves away from g. If ĝ(X) takes values outside [a, b] with
positive probability, then ĝ is inadmissible.
Proof. Define an estimator g̃ by g̃ = ĝ if ĝ ∈ [a, b], g̃ = a if ĝ < a, and g̃ = b if ĝ > b. Then g̃
dominates ĝ.

Consider i.i.d. X1 , . . . , Xn ∼ N(µ, 1) where µ ≥ a for a fixed a ∈ R. Then the above proposition
implies that X̄ is not admissible. However, X̄ is still minimax. To see this, suppose that X̄ is not
minimax. Let µ̂ be an estimator such that
1
R(µ, µ̂) ≤ −ε
n
for all µ ≥ a and a fixed ε > 0. Hence the Cramér–Rao lower bound for biased estimators shows
that
(1 + b0 (µ))2 1
+ b(µ)2 ≤ − ε,
n n

where b(µ) = E[µ̂] − µ. Consequently, b(µ) is bounded, and b0 (µ) ≤ 1 − εn − 1 ≤ −εn/2 for all
µ ≥ a, giving a contradiction.
If, in addition, µ ∈ [a, b], then X̄ is neither admissible nor minimax. One can show that
max a, min{X̄, b} has a uniformly smaller risk.

2.6 Shrinkage estimators and Stein’s effect


2.6.1 Gaussian estimation
Proposition 2.15. Consider X ∼ N(µ, Σ) where µ ∈ Rd is to be estimated and Σ ∈ Rd×d is
known. Define a class of estimators
h(kXk22 )
 
µ̂(X) := 1 − X,
kXk22
where h is a real-valued function. If h is nondecreasing and 0 < h(·) ≤ 2 tr(Σ) − 4 λmax (Σ), then
under the loss L(µ, µ̂) = kµ − µ̂k22 , the estimator µ̂ dominates X. In particular, X is inadmissible
and µ̂ is minimax.

34
Proof. For the last statement, it can be shown that X is minimax in a way similar to the one-
dimensional case. Hence, µ̂ is also minimax. It remains to prove that µ̂ dominates X.
First, note that the risk of X is

R(µ, X) = E kµ − Xk22 = E[(X − µ)> (X − µ)] = E tr (X − µ)> (X − µ)


 

= E tr (X − µ)(X − µ)> = tr E (X − µ)(X − µ)> = tr(Σ).


   

The risk of µ̂ is

R(µ, µ̂) = E kµ − µ̂k22 = E[(µ̂ − µ)> (µ̂ − µ)]


h h(kXk2 ) i h h(kXk2 )2 i
= E[(X − µ)> (X − µ)] − 2 E 2
X > (X − µ) + E 2
.
kXk22 kXk22

Let us write Y = Σ−1/2 X ∼ N(η, Id ) where η := Σ−1/2 µ. Then kXk22 = X > X = Y > ΣY , so

h h(kXk2 ) i h h(Y > ΣY ) i Xd d


h h(Y > ΣY ) X i
E 2
X > (X − µ) = E Y >
Σ(Y − η) = E Yj Σ j,i (Yi − ηi ) .
kXk22 Y > ΣY Y > ΣY
i=1 j=1

Conditioning on {Yj }j6=i and applying Stein’s lemma E[g(Yi )(Yi − ηi )] = E[g 0 (Yi )] for Yi ∼ N(ηi , 1),
we obtain that each summand above is equal to
d
h ∂  h(Y > ΣY ) X i
E Y Σ
j j,i
∂Yi Y > ΣY
j=1
Pd d
h h(Y > ΣY ) 2[h0 (Y > ΣY )Y > ΣY − h(Y > ΣY )] k=1 Σi,k Yk
i
=E
X
Σi,i + Yj Σj,i ,
Y > ΣY (Y > ΣY )2
j=1

where we used the fact that


d d
∂ > ∂  X  X
(Y ΣY ) = Yj Σj,k Yk = 2 Σi,k Yk .
∂Yi ∂Yi
j,k=1 k=1

Summing over i yields


h h(kXk2 ) i h h(kXk2 ) i h h0 (kXk2 )kXk2 − h(kXk2 ) i
E 2
X > (X − µ) = tr(Σ) E 2
+ 2 E 2 2 2
X >
ΣX .
kXk22 kXk22 kXk42

Combining everything, we conclude that


h h(kXk2 )  X > ΣX i h h0 (kXk2 ) i
R(µ, µ̂) = tr(Σ) + E 2
h(kXk22 ) − 2 tr(Σ) + 4 − 4 E 2
X >
ΣX ,
kXk22 kXk22 kXk22

which is smaller than tr(Σ) by the assumptions on h.

Note that for d ≥ 3 and Σ = Id , there exists a function h satisfying the assumptions, making
X inadmissible. This counterintuitive result is known as Stein’s example or Stein’s effect.

35
2.6.2 Poisson estimation
Lemma 2.16. For a random variable X and functions f and g, suppose that E[f (X)], E[g(X)]
and E[f (X)g(X)] all exist. If f and g are both nondecreasing, then
E[f (X)g(X)] ≥ E[f (X)] · E[g(X)].
Moreover, if f and g are strictly increasing and X is not constant, then the above inequality is
strict.
Proof. Let Y be an independent copy of X. Then
 
f (X)g(X) + f (Y )g(Y ) − f (X)g(Y ) − f (Y )g(X) = f (X) − f (Y ) g(X) − g(Y ) ≥ 0.
Taking the expectation yields the desired inequality.

Proposition 2.17. Consider independent Poisson random variables Xi ∼ Poi(λi ) for i ∈ [d] where
d ≥ 2. Let λ = (λ1 , . . . , λd ) ∈ (0, ∞)d . Define the class of estimators

h( d Xi )
 P 
λ̂(X) := 1 − Pd i=1 X
i=1 X i + b
where h is a real-valued nondecreasing function and b > 0. Consider the loss
d
X (λi − λ̂i )2
L(λ, λ̂) := .
λi
i=1

If h is nondecreasing, 0 < h(·) ≤ 2(d − 1), and b ≥ d − 1, then the estimator λ̂ dominates X.
P
Proof. Let Z = i Xi . Then we have
d
hX 1 h(Z) 2 i
R(λ, λ̂) = E Xi − Xi − λi
λi Z +b
i=1
d
h h(Z) X h h(Z)2 X d
1 i 1 2i
= d − 2E Xi (Xi − λi ) + E X .
Z +b λi (Z + b)2 λi i
i=1 i=1

Xi | Z is multinomial with
It is known that the conditional distribution P E[Xi | Z] = Zλi /Λ and
Var(Xi | Z) = Z(λi /Λ)(1 − λi /Λ) where Λ := di=1 λi . Therefore,
d d
hX 1 2 i X 1 λi  λi Z
E Xi | Z = Z 1− + Z 2 2 = (d − 1 + Z),
λi Λ Λ Λ Λ
i=1 i=1
d
hX 1 i Z Z
E Xi (Xi − λi ) | Z = (d − 1 + Z) − Z = (d − 1 + Z − Λ).
λi Λ Λ
i=1

We obtain
h h(Z)Z  h(Z) i
R(λ, λ̂) = d + E (d − 1 + Z) − 2(d − 1) + 2(Λ − Z)
(Z + b)Λ Z + b
h h(Z)Z i
≤ d + 2E (Λ − Z)
(Z + b)Λ

36
by the assumptions b ≥ d − 1 and h(·) ≤ 2(d − 1). By Lemma 2.16,
h h(Z)Z i h h(Z)Z i
E (Λ − Z) < E · E[Λ − Z] = 0,
(Z + b)Λ (Z + b)Λ

which completes the proof.

37
38
Chapter 3

Asymptotic estimation

In this chapter, we are interested in the scenario where the sample size n goes to infinity. Sections 3.1
to 3.5 establish the general theory. Topics of Sections 3.6 and 3.7 are not directly about asymptotic
estimation but are related.

3.1 Convergence of random variables


3.1.1 Convergence in probability
Definition 3.1. A sequence of random variables {Xn }∞n=1 converges to a random variable X in
p
probability, denoted by Xn −→ X, if for every ε > 0,

P{|Xn − X| ≥ ε} → 0 as n → ∞.

Definition 3.2. Given i.i.d. observations X1 , . . . , Xn ∼ Pθ where θ ∈ Θ, let us consider the


estimator ĝn = ĝn (X1 , . . . , Xn ) of g(θ). A sequence of estimators {ĝn }∞
n=1 is consistent if for every
θ ∈ Θ, ĝn −→ g(θ) with respect to Pθ .
p

n=1 be a sequence of estimators of g(θ) ∈ R, and consider the mean squared


Theorem 3.3. Let {ĝn }∞
error (MSE) E(ĝn − g(θ))2 .

• If E(ĝn − g(θ))2 → 0 as n → ∞ for all θ ∈ Θ, then ĝn is consistent for estimating g(θ).
• As a result, if the bias and variance of ĝn both converge to zero for all θ ∈ Θ, then ĝn is a
consistent estimator. In particular, any unbiased estimator with variance converging to zero is
consistent.

Proof. This is a result of Chebychev’s inequality Pθ {|ĝn − g(θ)| ≥ ε} ≤ 1


ε2
E(ĝn − g(θ))2 .
Some remarks about convergence in probability and consistency:

• Convergence in probability is preserved under sum, product, and continuous mapping.

• (Weak law of large numbers) Let X1 , . . . , Xn be i.i.d. with finite mean µ and variance σ 2 . Since
p
X̄ has variance σ 2 /n, it is consistent for estimating µ by the above theorem, i.e., X̄ −→ µ.

39
• In the above setting, consider the unbiased estimator of the variance,
n
1 X
Sn2 := (Xi − X̄)2 .
n−1
i=1

To see that this estimator is consistent, it suffices to note that


n
n 1 X 
Sn2 = (Xi − µ)2 − (X̄ − µ)2 ,
n−1 n
i=1

which converges to σ 2 in probability.


Asymptotically, “optimal” estimators are typically not unique.

Theorem 3.4. If X1 , . . . , Xn are i.i.d. with expectation µ, and g is a function that is continuous
p
at µ, then g(X̄) −→ g(µ). In particular, if g is continuous, then the plug-in estimator g(X̄) is
consistent for estimating g(µ).
p
Proof. Since X̄ −→ µ as above, this follows from the continuous mapping theorem.

• A sequence of estimators ĝn of g(θ) is asymptotically unbiased (or unbiased in the limit) if
E[ĝn ] → g(θ).
• Instead of consistency, sometimes we are interested in finer results—rates of convergence. Namely,
we may aim to establish
P{|ĝn − g(θ)| ≤ rn } ≥ 1 − δn ,
where rn and δn both go to zero as n → ∞. This is clearly stronger than convergence in
probability, and the quantity rn is an upper bound on the rate of convergence.

3.1.2 Convergence in distribution


Definition 3.5. Let {Xn }∞ i=1 be a sequence of random variables with CDFs Fn (t) = P{Xn ≤ t}.
Suppose that there exists a random variable X with CDF F (t) such that Fn (t) → F (t) for all t at
which F is continuous. Then we say that Xn converges to X in distribution or in law, denoted by
d
Xn −→ X, and that Fn converges to F weakly.

• Convergence in distribution is preserved under continuous mapping, but not necessarily under
sum or product.
d
• We have Xn −→ X if and only if E f (Xn ) → E f (X) for every bounded continuous real-valued
function f .

• (Central limit theorem) Let X1 , . . . , Xn be i.i.d. with mean µ and variance σ 2 . Then n(X̄ −µ)/σ
converges in distribution to the standard Gaussian N(0, 1).

Some properties about the two types of convergence:


p d
• If Xn −→ X, then Xn −→ X.

40
d p
• If Xn −→ x for a constant x, then Xn −→ x.

d d d d
• If Xn −→ X, An −→ a for a constant a, and Bn −→ b for a constant b, then An + Bn Xn −→
a + bX.

d
• If Xn −→ X, yn → y where {yn } is a sequence of real numbers, and X has CDF F (t) which is
continuous at t = y, then we have P{Xn ≤ yn } → P{X ≤ y} = F (y).

Theorem 3.6 (Delta method). Suppose that a real-valued function g on Θ has a nonzero derivative
√ d
g 0 (θ) at θ. If n(Tn − θ) −→ N(0, σ 2 ), then

√ d
n(g(Tn ) − g(θ)) −→ N(0, (g 0 (θ))2 σ 2 ).

If g 0 (θ) = 0 and g 00 (θ) 6= 0, then

d 1
n(g(Tn ) − g(θ)) −→ σ 2 g 00 (θ)χ21 ,
2

where χ21 is the chi-squared distribution with 1 degree of freedom.

Proof. Consider the first-order Taylor expansion of g(Tn ) around g(θ):

g(Tn ) = g(θ) + g 0 (θ)(Tn − θ) + Rn (Tn − θ),

where Rn → 0 as Tn → θ. The result then follows from the above properties of convergence.
If g 0 (θ) = 0, consider the second-order Taylor expansion:

1 1
g(Tn ) = g(θ) + g 00 (θ)(Tn − θ)2 + Rn (Tn − θ)2 ,
2 2

d
where Rn → 0 as Tn → θ. Note that n(Tn − θ)2 −→ σ 2 χ21 , so the same reasoning finishes the
proof.

Example: Bernoulli variance estimation Let X1 , . . . , Xn be i.i.d. Ber(p) random variables.


√ d
Then the central limit theorem implies n(X̄ − p) −→ N(0, p(1 − p)).
Consider estimating g(p) = p(1 − p) by X̄(1 − X̄). Since g 0 (p) = 1 − 2p 6= 0 for p 6= 1/2, it
follows from the Delta method that
√  d
n X̄(1 − X̄) − p(1 − p) −→ N 0, (1 − 2p)2 p(1 − p) .


For p = 1/2, we have g 0 (1/2) = 0 and g 00 (1/2) = −2. Hence the Delta method implies

1
n X̄(1 − X̄) − 1/4 → − χ21 .

4

41
3.2 Asymptotic efficiency
Definition 3.7. Consider i.i.d. X1 , . . . , Xn ∼ Pθ and an estimator ĝn (X1 , . . . , Xn ) of g(θ) ∈ R.
We say that ĝn is asymptotically normal if
√  d 
n ĝn − g(θ) −→ N 0, v(θ) for v(θ) > 0.
The quantity v(θ) is called the asymptotic variance of ĝn .
Definition 3.8. A sequence of estimators {ĝn }∞
n=1 is called asymptotically efficient if
√  d  (g 0 (θ))2 
n ĝn − g(θ) −→ N 0, ,
I(θ)
where I(θ) is the Fisher information each Xi contains about θ.
If ĝn is unbiased, the Cramér–Rao bound says that
(g 0 (θ))2
Varθ (ĝn ) ≥ .
nI(θ)
When do we have
(g 0 (θ))2
v(θ) ≥ ?
I(θ)

A sufficient condition If ĝn is unbiased and Var( nĝn ) → v(θ), then
√  (g 0 (θ))2
v(θ) = lim Var nĝn ≥ .
n→∞ I(θ)

Counterexample: superefficient estimator For i.i.d. X1 , . . . , Xn ∼ N(θ, 1) and g(θ) = θ, we


have seen that I(θ) = 1 (where I(θ) denotes the Fisher information of a single observation). Is it
always true that v(θ) ≥ 1? No.
Consider the sequence of estimators
(
X̄ if |X̄| ≥ 1/n1/4 ,
θ̂n =
aX̄ if |X̄| < 1/n1/4 ,
where a ∈ (0, 1). Therefore, we have
(√ √
√ n(X̄ − θ) if | nX̄| ≥ n1/4 ,
n(θ̂n − θ) = √ √ √
a nX̄ − nθ if | nX̄| < n1/4 .

Let Zn = n(X̄ − θ) ∼ N(0, 1). Then
( √
√ Zn if |Zn + nθ| ≥ n1/4 ,
n(θ̂n − θ) = √ √ √
a(Zn + nθ) − nθ if |Zn + nθ| < n1/4 .
Consequently, if θ 6= 0, then we have

P{ n(θ̂n − θ) ≤ c} − P{Zn ≤ c} → 0.
√ d
That is, n(θ̂n − θ) −→ N(0, 1). On the other hand, if θ = 0, then

P{ n(θ̂n − θ) ≤ c} − P{aZn ≤ c} → 0.
√ d
That is, n(θ̂n − θ) −→ N(0, a). However, v(0) = a < 1.

42
(g 0 (θ))2
General result Under some reasonable general conditions, we have v(θ) ≥ I(θ) except on a
set of measure zero. See Chapter 6, Theorem 2.6 of [LC06].

3.3 Asymptotic properties of maximum likelihood estimation


Throughout the section, we consider i.i.d. observations X1 , . . . , Xn ∼ Pθ∗ , where the distributions
{Pθ }θ∈Θ are distinct and have common support. Suppose that θ∗ is in the interior of Θ.
Recall that the likelihood (function) is L(θ | x) = ni=1 f (xi | θ), and thus the log-likelihood
Q
Pn
is `(θ | x) = log L(θ | x) = i=1 log f (xi | θ). Moreover, the MLE of θ is defined as θ̂ :=

argmaxθ∈Θ L(θ | x). The MLE of g(θ) is defined to be g(θ̂). We call ∂θ `(θ | x) = 0 the likelihood
equation, solving which yields the MLE (if it is unique).

3.3.1 Asymptotic consistency


Theorem 3.9. We have that for any fixed θ 6= θ∗ ,

Pθ∗ L(θ∗ | X) > L(θ | X) → 1



as n → ∞.

Proof. By the weak law of large numbers, we have


n
1X f (Xi | θ) p h f (X | θ) i
log −→ E θ ∗ log .
n f (Xi | θ∗ ) f (X | θ∗ )
i=1

In addition, since log(·) is strictly concave, Jensen’s inequality implies


h f (X | θ) i h f (X | θ) i
Eθ∗ log < log Eθ ∗ = 0.
f (X | θ∗ ) f (X | θ∗ )

Therefore, it holds that


n
n1 X f (Xi | θ) o
lim P log < 0 → 1,
n→∞ n f (Xi | θ∗ )
i=1

which is equivalent to what we need to prove.

Finite parameter space Let us consider a finite parameter space Θ. A sequence of estimators
θ̂n is consistent if and only if

Pθ∗ {θ̂n = θ∗ } → 1 for any θ∗ ∈ Θ.

We immediately obtain the following result.

Corollary 3.10. If Θ is finite, then the MLE exists, is unique with probability going to 1, and is
consistent.

43
Real parameter space We consider an open set of parameters Θ ⊂ R and use the shorthand
`(θ | x) = log L(θ | x).

Theorem 3.11. Suppose that f (x | θ) is differentiable with respect to θ ∈ Θ ⊂ R for all x. Then
with probability going to 1, the likelihood equation
n ∂
X
∂θ f (Xi | θ)
`0 (θ | X) = =0
f (Xi | θ)
i=1

p
has a root θ̂n = θ̂n (X1 , . . . , Xn ), and that root satisfies θ̂n −→ θ∗ .

Proof. For (θ∗ − ε, θ∗ + ε) ⊂ Θ, let

Sn := `(θ∗ | X) > `(θ∗ − ε | X) and `(θ∗ | X) > `(θ∗ + ε | X) .




By Theorem 3.9, Pθ∗ {Sn } → 1. For any x ∈ Sn , there exists θ̂n ∈ (θ∗ − ε, θ∗ + ε) at which
`0 (θ̂n | x) = 0. Hence we have
Pθ∗ {|θ̂n − θ∗ | < ε} → 1.
Note that we can choose θ̂n to be the root closest to θ∗ so that it does not depend on ε.

However, Theorem 3.11 does not provide a practical way of choosing θ̂n in general, because θ∗
is unknown and we cannot choose the root closest to θ∗ .

Corollary 3.12. If the likelihood equation has a unique root for all x and n, then θ̂n is a consistent
estimator of θ. If, in addition, Θ = (a, b) where −∞ ≤ a < b ≤ ∞, then θ̂n is the MLE with
probability going to 1.

Proof. The first statement is immediate. To prove the second, suppose that θ̂n is not the MLE with
probability bounded away from zero. No other interior point can be the MLE without satisfying
the likelihood equation. On the other hand, if the likelihood converges to a supremum as θ → a or
b, this contradicts Theorem 3.9.

3.3.2 Asymptotic efficiency


We in fact have asymptotic efficiency for the sequence θ̂n in Theorem 3.11.

Theorem 3.13. Suppose that for all x and all θ ∈ (θ∗ − ε, θ∗ + ε),

∂3
log f (x | θ) ≤ M (x) and Eθ∗ [M (X)] < ∞.
∂θ3

Then the sequence θ̂n in Theorem 3.11 satisfies


√ d
 1 
n(θ̂n − θ∗ ) −→ N 0, ,
I(θ∗ )

i.e., it is asymptotically efficient.

44
Proof. For any fixed x, the Taylor expansion `0 (θ̂n ) = `0 (θ̂n | x) about θ∗ gives

1
0 = `0 (θ̂n ) = `0 (θ∗ ) + (θ̂n − θ∗ )`00 (θ∗ ) + (θ̂n − θ∗ )2 `000 (β),
2

where β lies between θ∗ and θ̂n . Hence we have

√ √1 `0 (θ ∗ )
n
n(θ̂n − θ∗ ) = 1 00 ∗ 1
.
− n ` (θ ) − 2n (θ̂n − θ∗ )`000 (β)

Note that
n ∂ ∗
1 √ 1X ∂θ f (Xi | θ ) d
√ `0 (θ∗ ) = n −→ N 0, I(θ∗ )

n n ∗
f (Xi | θ )
i=1

by the central limit theorem. Moreover,


n 2
1 ∂
1 X ( ∂θ f (Xi | θ∗ ))2 − f (Xi | θ∗ ) ∂θ
∂ ∗
2 f (Xi | θ )
− `00 (θ∗ ) =
n n f (Xi | θ∗ )2
i=1
∂2

( ∂θ f (Xi | θ∗ ))2 f (Xi| θ∗ )
−→ Eθ∗ − Eθ∗
p ∂θ2
= I(θ∗ ),
f (Xi | θ∗ )2 f (Xi | θ∗ )

by the law of large numbers. In addition, by the bound on the third derivative,
n
1 000 1X
M (Xi ) −→ Eθ∗ [M (X)].
p
` (β) ≤
n n
i=1

Combining the above pieces completes the proof.

Corollary 3.14. If Θ = (a, b) and the likelihood equation has a unique root for all x and n, then
the MLE is asymptotically efficient.


Exponential family Consider the exponential family f (xi | η) = exp ηT (xi ) − A(η) . The
likelihood equation is
n
1X
T (Xi ) = A0 (η) = Eη [T (Xi )].
n
i=1
00

Moreover, A (η) = Varη T (Xi ) = I(η) > 0, so the RHS of the above equation is increasing in η.
Hence the likelihood equation has exactly one solution η̂n , which satisfies
√ d 
n(η̂n − η) −→ N 0, 1/ Var(T ) .

3.4 Examples of maximum likelihood estimation


In this section, we discuss some examples of maximum likelihood estimation, complementing the
theory developed in Section 3.3.

45
3.4.1 Some examples
Let X1 , . . . , Xn be i.i.d. observations from some parametric model. We consider finding the MLE
by solving the log-likelihood equation. As we will see, this is not always possible.

Weibull distribution (MLE is the unique solution) The Weibull distribution on (0, ∞)
parametrized by λ, k > 0 is used in, for example, survival analysis. Its density is f (x | λ, k) =
k x k−1 −(x/λ)k

λ λ e for x > 0. The log-likelihood is
n  
X k x i  x i k
`(λ, k | x) = log + (k − 1) log − .
λ λ λ
i=1

Therefore, the likelihood equation, or more precisely, the system of likelihood equations, is
n
∂ X k kxki 
`(λ, k | x) = − + k+1 = 0,
∂λ λ λ
i=1
n 
∂ X 1 xi xki xi 
`(λ, k | x) = + log − k log = 0.
∂k k λ λ λ
i=1

k 1/k . To see the uniqueness of the MLE of k, note


Pn 
We obtain that the MLE of λ is λ̂n = i=1 xi
that with probability one, we have xi 6= λ for all i ∈ [n]. Then

• ∂
∂k `(λ, k | x) → ∞ as k → 0,

• ∂
∂k `(λ, k | x) < 0 as k → ∞, and

∂2 xki 2
• ∂2k
`(λ, k | x) = − k12 − λk
log xλi < 0.

We see that the second likelihood equation also has a unique solution k̂n , which gives the MLE of
k. The asymptotic efficiency of λ̂n and k̂n is guaranteed by the general theory. (However, we need
a multivariate version of the theory proved above.)

Mixture of Gaussian distributions (MLE does not exist) Consider the mixture of two
Gaussians p N(µ, σ 2 ) + (1 − p) N(η, τ 2 ) where p ∈ (0, 1). The likelihood is
n   (x − µ)2  1 − p  (x − η)2 
2 2
Y p i i
L(µ, σ , η, τ | x) = √ exp − 2
+√ exp − .
2πσ 2σ 2πτ 2τ 2
i=1

If we expand the product, one term will be


n
p(1 − p)n−1  (x − µ)2 X (x − η)2 
1 i
exp − 2
− .
n/2
(2π) στ n−1 2σ 2τ 2
i=2

When µ = x1 and σ → 0, this term goes to infinity. Therefore, the likelihood is unbounded, and
the MLE does not exist.

46
Uniform distribution (MLE is not asymptotically normal) Consider the uniform distri-
butions Unif(0, θ) parametrized by θ > 0, which do not have a common support. The MLE of θ is
θ̂n = X(n) and the MVUE is θ̃n = n+1n X(n) . In addition, we have

d d
n(θ − θ̂n ) −→ Exp(0, θ) and n(θ − θ̃n ) −→ Exp(−θ, θ),

where Exp(a, b) denotes the exponential distribution with density 1b e−(x−a)/b 1[a,∞) (x). In addition,
one can show that
n2 n2
E n(θ̂n − θ) 2 = 2θ2 → 2θ2 E n(θ̃n − θ) 2 = θ2 → θ2 .
 
and
n2 + 3n + 2 n2 + 2n

3.4.2 Linear regression


We now move away from the i.i.d. case. Consider the linear regression model

Y = Xβ ∗ + ε,

where Y is the vector of observations in Rn , X is the design matrix in Rn×d , β ∗ is the parameter
vector in Rd to be estimated, and ε is the random vector of errors in Rn . There are two types of
assumptions on the design matrix X:

• Fixed design: X is a deterministic matrix.

• Random design: X is a random matrix. For example, the n rows of X are i.i.d. random vectors
in Rd .

In this section, we assume that ε ∼ N(0, σ 2 In ), so, equivalently, Y ∼ N(Xβ ∗ , σ 2 In ) with density

1  kY − Xβ ∗ k2 
f (Y | β ∗ ) = exp − 2
(2πσ 2 )n/2 2σ 2

in the case of a fixed design. For a random design, the above is still true conditional on X. The
log-likelihood of β is therefore

kY − Xβk22 n
`(β | Y ) = − − log(2πσ 2 ).
2σ 2 2
As a result, the maximum likelihood estimator β̂ is the least squares estimator (LSE)

β̂ := argmin kY − Xβk22 .
β∈ Rd

Low-dimensional case Let us first consider the case where n ≥ d and X is of rank d. The
likelihood equation is −2X > (Y − Xβ) = 0, which we can solve to obtain the LSE

β̂ = (X > X)−1 X > Y.

Consider a random design where the n rows {Xi> }ni=1 of X are i.i.d. from some nice distribution
with covariance Σ = E[Xi Xi> ]. Then the n entries of Y are i.i.d., and the asymptotic efficiency of
β̂ is guaranteed by the general multivariate theory. While it is not easy to verify this statement

47

directly, let us show that ∆n := n(β̂ − β ∗ ) is Gaussian conditional on X and has the correct
asymptotic covariance.
We fix n and condition on X in the sequel. First, the conditional Gaussianity of ∆n is clearly
true. Second, β̂ is unbiased:

E[β̂ | X] = (X > X)−1 X > Xβ ∗ = β ∗ ,

so the Gaussian vector ∆n has mean zero. Third, the covariance matrix of ∆n can be computed:

E[∆n ∆>
n | X] = n E[(β̂ − β )(β̂ − β ) | X]
∗ ∗ >

= n E[(X > X)−1 X > εε> X(X > X)−1 | X]


1 −1
= n(X > X)−1 X > (σ 2 In )X(X > X)−1 = σ 2 X >X .
n
p
To see that this gives the correct covariance asymptotically, note that n1 X > X −→ Σ, so we
need to check that σ 2 Σ−1 is the inverse Fisher information matrix. Indeed, the Fisher information
that each Yi contains about β ∗ conditional on Xi is
  
1  1
I(β | Xi ) = − E [∇2β log f (Yi | β , Xi ) | Xi ] = − E ∇β 2 Xi (Yi − Xi β) Xi = 2 Xi Xi> .
∗ >
σ σ

Its expectation is I(β) = E[ σ12 Xi Xi> ] = 1


σ2
Σ, as expected.

High-dimensional case We now consider the case where the columns of X are not linearly
independent, which is always true if n < d. Then the LSE β̂ may not be unique. In this case, the
problem with the formula (X > X)−1 X > Y is that X > X is not invertible. However, it suffices to
replace (X > X)−1 by the Moore–Penrose pseudoinverse (X > X)† , defined using the singular value
decomposition (SVD). More precisely, define

β̂ = (X > X)† X > Y.

To see that β̂ solves the likelihood equation ∇β kY − Xβk22 = −2(X > Y − X > Xβ) = 0, it suffices
to use basic properties of the pseudoinverse to check that

X > X β̂ = X > X(X > X)† X > Y = X > XX † Y = X > Y.

As β̂ is not the unique solution of the likelihood equation, the general theory does not apply.
In fact, since n < d, the number of dimensions has to grow as n → ∞. Therefore, it is not even
clear how we can talk about any asymptotic property.

3.5 Bernstein–von Mises theorem


While we have taken the frequentist point of view when discussing asymptotic estimation, it is also
possible to analyze Bayesian procedures and talk about their asymptotics.

48
Gaussian mean estimation Recall the example of Bayesian Gaussian mean estimation: We
nτ 2
observe i.i.d. X1 , . . . , Xn ∼ N(θ∗ , 1) where θ∗ ∼ N(0, τ 2 ). Then the posterior mean is θ̃n = 1+nτ 2 X̄.
p
This is very close to the MLE θ̂n = X̄ when n is large, and θ̃n −→ θ∗ as n → ∞. Moreover,
√ √ √ d
n(θ̃n − θ∗ ) = n(θ̃n − θ̂n ) + n(θ̂n − θ∗ ) −→ N(0, 1),
√ √ −1 p √ d
because n(θ̃n − θ̂n ) = n 1+nτ 2 X̄ −→ 0 and n(θ̂n − θ∗ ) −→ N(0, 1) by the asymptotic efficiency
of the MLE. Therefore, the Bayes estimator also enjoys the asymptotic efficiency just like the MLE.
We now introduce the Bernstein–von Mises theorem, which states that, as n → ∞, the posterior
distribution behaves like a Gaussian distribution centered at an efficient estimator (such as the
MLE). The rigorous statement involves a set of regularity assumptions which we omit, and we only
provide a very brief sketch of some ideas of the proof. See [vdV00] for the full statement and proof.
For two probability distributions P and Q with densities f (x) and g(x) respectively, define the
total variation distance between them as
Z Z
1 1 
TV(P, Q) := f (x) − g(x) dµ(x) = max h(x) f (x) − g(x) dµ(x).
2 2 |h(x)|≤1

Theorem 3.15 (Bernstein–von Mises; informal). Consider i.i.d. X1 , . . . , Xn from a “nice” para-
metric model Pθ∗ where θ∗ ∈ Θ. Let π be a prior on Θ with density p(θ) > 0. Let πn denote the
posterior with
1
 density p(θ | X) where X = (X1 , . . . , Xn ). Moreover, let φn denote the distribution
N θ̂n , nI(θ∗ ) where θ̂n is an asymptotically efficient estimator. Then we have that

p
TV(πn , φn ) −→ 0 as n → ∞

with respect to the probability distribution Pθ∗ .

Sketch of ideas. We are interested in the posterior around θ̂n . Recall that the posterior is

f (x | θ) · p(θ) L(θ | x) · p(θ)


p(θ | x) = =
f (x) f (x)

where L is the likelihood. Consider a change of variable φ = n(θ − θ̂n ), and let q denote the
density of φ. Then our goal is to show that the posterior q(φ | x) is close to N 0, I(θ1∗ ) . Note that

we have θ = θ̂n + φ/ n and
1  φ 
q(φ | x) = √ p θ̂n + √ x .
n n
Therefore, combining the above two equations yields
 φ   φ 
log q(φ | x) = ` θ̂n + √ + log p θ̂n + √ + C(x),
n n

where `(θ) = log L(θ | x) denotes the log-likelihood, and C(x) is a quantity that only depends on
x and whose particular value is not important. Taylor expansion yields

φ 1 φ2  φ 
log q(φ | x) ≈ `(θ̂n ) + `0 (θ̂n ) √ + `00 (θ̂n ) + log p θ̂n + √ + C(x).
n 2 n n

49
Suppose that θ̂n is the MLE. Then, `(θ̂n ) depends only on x, and `0 (θ̂n ) = 0. Moreover, E[`00 (θ)] =
−nI(θ), so we have `00 (θ̂n ) ≈ `00 (θ∗ ) ≈ −nI(θ∗ ). Finally, we approximate the term log p(θ̂n + √φn )
by log p(θ∗ ). In summary,
1
log q(φ | x) ≈ − I(θ∗ )φ2 + C2 (x)
2
φ2 
for a quantity C2 (x) that only depends on x. Therefore, q(φ | x) is proportional to exp 2/I(θ ∗) .
1

In other words, the posterior distribution of φ is approximately N 0, I(θ∗ ) .

3.6 Bootstrap methods


In this section, we introduce bootstrapping, a method based on sampling with replacement, and
then study its asymptotic properties.

3.6.1 Jackknife estimator and bias reduction


Given i.i.d. observations X1 , . . . , Xn ∼ Pθ , let θ̂n = θ̂(X1 , . . . , Xn ) be an estimator of θ with bias
denoted by b(θ) = Eθ [θ̂n ] − θ. Can we estimate and reduce the bias?
Let θ̂(−i) = θ̂(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) be the estimator of θ with the ith observation re-
moved. The jackknife estimator of the bias b(θ) is defined to be
n
1 X 
b̂n := (n − 1) θ̂(−i) − θ̂n .
n
i=1

The jackknife bias-corrected estimator of θ is defined to be


n
n−1X
θ̃n := θ̂n − b̂n = nθ̂n − θ̂(−i) .
n
i=1

It is not hard to see that if


A B 1
b(θ) = + 2 +O 3 ,
n n n
then the bias of the jackknife bias-corrected estimator is of a smaller order
1
E[θ̃n ] − θ = O 2 .
n
Bias reduction, however, does not necessarily yield a smaller risk.

3.6.2 Mean estimation and asymptotics


Let us use the simple problem of mean estimation to explain why bootstrapping is useful for
studying properties of an estimator.
Fix i.i.d. observations X1 , . . . , Xn from a distribution with population mean θ ∈ R and variance
σ 2 < ∞. Suppose that we would like to study the sample mean
n
1X
X̄n := Xi .
n
i=1

50
Let Un be the distribution of resampling from the observations uniformly, i.e., Un = Unif({Xi }ni=1 ).
Consider i.i.d. Yn,i ∼ Un for i = 1, . . . , n. We call
n
1X
Ȳn := Yn,i
n
i=1

the bootstrap sample mean. Conditional on X1 , . . . , Xn , it is clear that each Yn,i has mean X̄n .
Therefore, the bootstrap sample mean Ȳn is an estimator of the sample mean X̄n . To understand
the behavior of X̄n , the main idea of bootstrapping is that the distribution of X̄n − θ can be
approximated by the distribution of Ȳn − X̄n . In the case of mean estimation, this claim is justified
by the following theorem.

Theorem 3.16. Let Φ denote the CDF of N(0, 1). We have that, as n → ∞,

sup P
√ 
n Ȳn − X̄n ≤ t | X1 , . . . , Xn − Φ(t/σ) −→ 0
t∈R
almost surely with respect to the randomness of X1 , . . . , Xn .
√ d
The central limit theorem says that n(X̄n − θ) −→ N(0, σ 2 ), and the convergence is in fact
uniform in the sense that

sup P
√ 
n X̄n − θ ≤ t − Φ(t/σ) −→ 0
t∈ R
Comparing the above two displays, we see that the behavior of X̄n − θ is indeed similar to that of
Ȳn − X̄n when n is large.
In this setting, we understand the distribution of X̄n − θ very well in view of the central limit
theorem, so there is no need to study Ȳn − X̄n . However, for more complicated models, if we have
no idea of the behavior of
θ̂n − θ, where θ̂n := θ̂(X1 , . . . , Xn ),
bootstrapping becomes useful because we can generate bootstrap samples {Yn,1 , . . . , Yn,n } and study
the distribution of
θ̃n − θ̂n , where θ̃n := θ̂(Yn,1 , . . . , Yn,n ).
There are typically two ways to generate the bootstrap samples {Yn,1 , . . . , Yn,n }:

• Nonparametric bootstrapping: We sample i.i.d. Yn,1 , . . . , Yn,n ∼ Un as above.

• Parametric bootstrapping: If we know that X1 , . . . , Xn are from a parametric model Pθ , instead


of resampling Yn,1 , . . . , Yn,n from {X1 , . . . , Xn }, we can sample Yn,1 , . . . , Yn,n from the model Pθ̂n ,
where θ̂n is the estimator in consideration. Then we can compute the bootstrap estimator θ̃n in
the same way.

For the proof of Theorem 3.16, we need the following lemma.

Lemma 3.17. If Xn has CDF Fn and X has CDF F which is continuous, then
d
Xn −→ X =⇒ sup |Fn (t) − F (t)| −→ 0.
t∈ R
51
Proof. Since F is monotone and continuous, there exists xi such that F (xi ) = i/k for each i =
1, . . . , k and −∞ = x0 < x1 < · · · < xk = ∞. For all x ∈ [xi−1 , xi ], we have

Fn (x) − F (x) ≤ Fn (xi ) − F (xi−1 ) = Fn (xi ) − F (xi ) + 1/k,


Fn (x) − F (x) ≥ Fn (xi−1 ) − F (xi ) = Fn (xi−1 ) − F (xi−1 ) − 1/k.

Given any ε > 0, take k sufficiently large such that 1/k ≤ ε/2. Then take n sufficiently large
depending on ε and k such that

|Fn (xi ) − F (xi )| ≤ ε/2 for all i = 0, 1, . . . , k.

It follows that
sup |Fn (x) − F (x)| ≤ ε/2 + 1/k ≤ ε.
x∈ R

Proof sketch of Theorem 3.16. Let X = (X1 , . . . , Xn ). We have


n
1X 2
Var(Yn,i | X) = E[Yn,i | X] − (E[Yn,i | X])2 =
2 p
Xi − X̄n2 −→ σ 2
n
i=1

with respect to the randomness of X. By a version of the central limit theorem for the “triangular
array” {Yn,i }, we obtain
√ d
n(Ȳn − X̄n ) −→ N(0, σ 2 ).
Lemma 3.17 then implies that the desired result.

3.7 Sampling methods


For a random variable X ∼ P and a function g, it is a common task to compute the expectation
E[g(X)]. If this cannot be done analytically, then we can use numerical approximation. However,
sometimes it is not even clear how we can sample X ∼ P in practice.

3.7.1 Sampling with quantile function


Not every distribution is built in any program. Suppose that we would like to sample from a
distribution P. Let F be the CDF of P. Then the quantile function of P is

Q(u) := inf{t : u ≤ F (t)},

which is simply the inverse of F if F is invertible. If we know Q and can sample from Unif(0, 1),
then we can sample from P.
Proposition 3.18. For any distribution P with quantile function Q, if U ∼ Unif(0, 1), then X =
Q(U ) has distribution P.
Proof. It suffices to note that

P{X ≤ t} = P{Q(U ) ≤ t} = P{U ≤ F (t)} = F (t).

52
For i.i.d. X, X1 , . . . , Xn ∼ P, by the law of large numbers, we have
n
1X
g(Xi ) −→ E[g(X)].
p
n
i=1

Therefore, we can approximate expectations via sampling.

3.7.2 Importance sampling


Sometimes the quantile function of a distribution P can be hard to compute, so we need another
method to sample from P. Let f be the density of P. Let h be the density of a distribution Q
whose support includes that of f . Suppose that we can sample from the distribution Q. Let X ∼ P
and Y ∼ Q. We have
h g(Y ) i Z g(y) Z
EQ f (Y ) = f (y)h(y) dx = g(x)f (x) dx = EP [g(X)].
h(Y ) h(y)

Therefore, to approximate EP [g(X)], we can sample i.i.d. Y1 , . . . , Yn ∼ Q and use the fact
n
1 X g(Yi )
f (Yi ) −→ EP [g(X)].
p
n h(Yi )
i=1

3.7.3 Metropolis–Hastings algorithm


We introduce a very simple example in the class of Metropolis–Hastings algorithms. Consider a
setup similar to the previous subsection. Suppose that the densities f and h satisfy f ≤ M h for a
constant M > 0. To sample X ∼ P, we can run:

1. Generate Y ∼ Q and U ∼ Unif(0, 1).


f (Y )
2. If U ≤ M h(Y ) , take X = Y ; otherwise, return to Step 1.

To see that X ∼ P, note that


  P{Y ≤ t, U ≤ f (Y )
f (Y )
P{X ≤ t} = P Y ≤ t U ≤ =
M h(Y )
M h(Y ) P U ≤ M h(Y) )
f (Y


−∞ P U ≤ M h(y) · h(y) dy
Rt  f (y) Rt
1
M −∞ f (y) dy
= R  = 1 R = F (t)
P U ≤ Mf h(y)
(y)
· h(y) dy M f (y) dy

which is the CDF of P at t.

3.7.4 Gibbs sampler


When dealing with a joint distribution, one may use the Gibbs sampler. Let us consider a sim-
ple example in the bivariate setting. Suppose that we would like to sample (X, Y ) with density
fX,Y . Suppose that it is hard to sample from the joint distribution, but easy from the conditional
distribution when one of the coordinate is fixed. Starting from a fixed value X0 , for t ≥ 1, we do:

53
1. Yt ∼ fY |X (· | X = Xt−1 );

2. Xt ∼ fX|Y (· | Y = Yt ).

This algorithm generates a sequence (Xt , Yt )nt=1 which is a Markov chain with invariant distribution
fX,Y . By the ergodic theorem, for a bivariate function g, we have
n
1X
g(Xt , Yt ) −→ E[g(X, Y )].
p
n
t=1

This can be generalized to the multivariate case and is often useful in Bayesian estimation, for
example, when computing posterior means.

54
Chapter 4

Finite-sample analysis

In this chapter, we focus on the minimax point of view and finite-sample analysis. We frequently
prove results of the form

sn . inf sup R g(θ), ĝn . rn , (4.1)
ĝn θ∈Θ

where “.” means “≤” up to a constant factor (independent of n), and sn and rn are sequences
that converge to zero as n → ∞. The sequences sn and rn are referred to as rates of estimation
or rates of convergence. Hopefully, we have sn = rn in which case we obtain matching upper and
lower bounds on the minimax risk.
A simple example we have seen is that, for X1 , . . . , Xn ∼ N(µ, 1), we have

inf sup E(µ − µ̂)2 = 1/n.


µ̂n µ∈ R
For more complex problems, we typically cannot obtain such a precise result.
Note that a result like (4.1) falls in the finite-sample category as it holds for any fixed n. Yet
it has an asymptotic flavor because we are interested in large n and ignore constant factors. An
upper bound like that in (4.1) strengthens asymptotic consistency since it gives an explicit rate of
convergence over the entire parameter space.

4.1 Rates of estimation for linear regression


Recall the linear regression model
Y = Xβ ∗ + ε (4.2)

with a fixed design matrix X ∈ Rn×d and ε ∼ N(0, σ 2 In ). Let β̂ be an estimator of β ∗ . In the
sequel, we consider the following loss functions: (1) mean squared error d1 kβ̂ − β ∗ k22 , and (2) mean
squared prediction error n1 kX β̂ − Xβ ∗ k22 . There are usually two types of rates of estimation for

problems studied here: (1) slow rate 1/ n, and (2) fast rate 1/n.

4.1.1 Fast rate for low-dimensional linear regression


Let us start with proving a fast rate of estimation in the low-dimensional case.

55
Theorem 4.1. For the linear regression model (4.2), the LSE β̂ = (X > X)† X > Y satisfies
1 r
E kX β̂ − Xβ ∗ k22 = σ2 ,
n n
where r is the rank of X. In addition, if X is of rank d, then
d
1 σ2 X 1
E kβ̂ − β k2 =
∗ 2
,
d d λi
i=1

where λ1 , . . . , λd are the eigenvalues of X > X.

Proof. By the definition of β̂ and basic properties of the pseudoinverse, we have

X β̂ − Xβ ∗ = X(X > X)† X > Y − Xβ ∗


= X(X > X)† X > Xβ ∗ − Xβ ∗ + X(X > X)† X > ε
= X(X > X)† X > ε.

It follows that

E kX β̂ − Xβ ∗ k22 = E kX(X > X)† X > εk22


= tr E X(X > X)† X > εε> X(X > X)† X >
 

= σ 2 tr X(X > X)† X > = σ 2 r,




where the last equality can be seen from the SVD of X.


In addition, if X is of rank d, then similarly,

β̂ − β ∗ = (X > X)−1 X > Xβ ∗ − β ∗ + (X > X)−1 X > ε = (X > X)−1 X > ε.

As a result, we obtain
E kβ̂ − β ∗ k22 = σ2 tr (X > X)−1 ,


from which the desired result follows.

4.1.2 Maximal inequalities


As a preparation for the next subsection, we now introduce sub-Gaussian random variables and
maximal inequalities. We say that a random variable X is sub-Gaussian with variance proxy σ 2
and write X ∼ subG(σ 2 ) if the MGF of X satisfies

E eλ(X−E[X]) ≤ eσ
  2 λ2 /2

for any λ > 0. In particular, if X ∼ N(µ, σ 2 ), then X ∼ subG(σ 2 ). Moreover, if X ∼ subG(σ 2 ) and
a ∈ R, then aX ∼ subG(a2 σ 2 ).

Proposition 4.2. For (not necessarily independent) zero-mean random variables Xi ∼ subG(σi2 )
where i ∈ [n], we have h i
E max Xi ≤ max σi · 2 log n.
p
i∈[n] i∈[n]

56
Proof. Let σ = maxi∈[n] σi . For any λ > 0, we have

1 1
E max Xi = E log eλ maxi∈[n] Xi ≤ log E eλ maxi∈[n] Xi
i∈[n] λ λ
1 1 log n λσ 2
E eλXi ≤ log
X X 2 2
≤ log eσi λ /2 ≤ + .
λ λ λ 2
i∈[n] i∈[n]

Taking λ = σ −1 2 log n yields the result.

We say that a random vector X in Rn is sub-Gaussian with variance proxy σ 2 and write X ∼
subGn (σ 2 ) if v > X ∼ subG(σ 2 ) for any fixed unit vector v ∈ Rn . In particular, if X ∼ N(µ, σ 2 In ),
then X ∼ subGn (σ 2 ).

Proposition 4.3. Let K be a convex polytope in Rn with vertices v1 , . . . , vd , and consider a zero-
mean random vector X ∼ subGn (σ 2 ). Then we have
h i
E max X > v ≤ max kvi k2 · σ 2 log d.
p
v∈K i∈[d]

Proof. Let us define


 d 
α ∈ R : αi ≥ 0,
X
d
∆ := αi = 1 .
i=1
Pd
For any vector v ∈ K, we can write v = i=1 αi vi where α ∈ ∆. Hence
d
X
max X > v = max αi X > vi = max X > vi .
v∈K α∈∆ i∈[d]
i=1

By the sub-Gaussianity of X, we have X > vi ∼ subG(σ 2 kvi k22 ), so the result follows from Proposi-
tion 4.2.

4.1.3 Slow rate for high-dimensional linear regression


Theorem 4.1 gives the fast rate d/n for low-dimensional linear regression when X is of rank-d.
However, this rate becomes vacuous in the high-dimensional case since d/n is not shrinking to zero
as n grows. For consistent estimation in high dimensions, we have to assume that β ∗ is of low
complexity in a certain sense.
Let K ⊂ Rd be a closed set. If we know β ∗ ∈ K a priori, we may consider the constrained LSE

β̂K := argmin kY − Xβk22 . (4.3)


β∈K

A prominent example of K is the `1 -ball of radius κ > 0, i.e., K is equal to


 d 
β ∈ R : kβk1 =
X
d
B1 (κ) = |βi | ≤ κ .
i=1

In this case, the optimization problem (4.3) is a computationally feasible convex program.

57
Theorem 4.4. For the linear regression model (4.2) where ε is a zero-mean subGn (σ 2 ) random

vector, suppose that we have β ∗ ∈ B1 (κ) and maxj∈[d] kXj k2 ≤ n where Xj denotes the jth column
of X. Then the constrained LSE β̂B1 (κ) satisfies that
r
1 log d
E kX β̂B1 (κ) − Xβ k2 . σκ
∗ 2
.
n n
Proof. For simplicity, we write β̂ = β̂B1 (κ) . By the definition of β̂, we have

kY − X β̂k22 ≤ kY − Xβ ∗ k22 = kεk22 .

Expanding the LHS gives

kY − X β̂k22 = kXβ ∗ + ε − X β̂k22 = kX(β̂ − β ∗ )k22 − 2ε> X(β̂ − β ∗ ) + kεk22 .

As a result, we obtain

kX(β̂ − β ∗ )k22 ≤ 2ε> X(β̂ − β ∗ ) ≤ 4 max ε> Xβ ≤ 4 max ε> v,


β∈B1 (κ) v∈D

where we used that kβ̂ − β ∗ k1 ≤ 2κ, and D := {Xβ : β ∈ B1 (κ)} ⊂ Rn .


To bound this supremum, note that D is a polytope in Rn with at most 2d vertices that are in the

set {κX1 , −κX1 , . . . , κXd , −κXd }. In addition, each vertex has `2 -norm bounded by κkXi k2 ≤ κ n
by assumption. As a result, we obtain

E kX(β̂ − β ∗ )k22 ≤ 4 E max ε> v ≤ κ n · σ 2 log 2d
p
v∈D

by Proposition 4.3.

In fact, a rate of estimation like that in Theorem 4.1 also holds for β̂B1 (κ) , so that we have
 r 
1 d log d
E kX β̂B1 (κ) − Xβ k2 . min σ , σκ
∗ 2 2
.
n n n
Let us consider the case σ = 1 for simplicity. In the low-dimensional case d  n, we observe
q a fast
log d
rate d/n. In the high-dimensional case d  n, we obtain a slow yet consistent rate κ n . This
is usually called the elbow effect. In fact, one can still achieve a fast rate that scales as 1/n in the
high-dimensional setting; we discuss a special case in the next section.

4.2 High-dimensional linear regression


4.2.1 Setup and estimators
As we have seen, when the dimension d of the problem exceeds the sample size n, we can still do
consistent estimation if kβ ∗ k1 ≤ κ for a small κ. Another popular assumption is that β ∗ is k-sparse,
i.e., β ∗ is in
 d 
B0 (k) = β ∈ R : kβk0 = 1{βi 6= 0} ≤ k .
X
d

i=1
There are many potential estimators of β∗:

58
• For kβ ∗ k0 ≤ k, the constrained LSE is

β̂B0 (k) := argmin kY − Xβk22 .


kβk0 ≤k

 constraint kβk1 ≤ k which is convex, here kβk0 ≤ k is a discrete constraint with more
Unlike the
than nk possible choices of β. Hence this optimization problem is computationally infeasible in
the worst case.

• If β ∗ is k-sparse and each entry of β ∗ is in [−1, 1], then kβ ∗ k1 ≤ k. Hence we can still consider

β̂B1 (k) := argmin kY − Xβk22


kβk1 ≤k

in this case, which is a computationally efficient and statistically consistent estimator.

• The `1 -constrained LSE requires the knowledge of the (typically unknown) sparsity k. Instead,
the most popularly used estimator is the LASSO estimator

β̂ LASSO := argmin kY − Xβk22 + 2λkβk1 ,


β∈ Rd

where λ = Cσ n log d for a constant C > 0. This is an `1 -penalized estimator, in comparison
to the `1 -constrained LSE above. They enjoy similar rates of estimation. Note that λ does not
depend on k, so LASSO adapts to the sparsity of β ∗ .

• A more recently proposed estimator is the SLOPE estimator

β̂ SLOPE := argmin kY − Xβk22 + 2τ kβk∗ ,


β∈ Rd
p
where τ = Cσn for a constant C > 0 and the norm kβk∗ is defined as follows. Let λi = log(2d/i)
for i ∈ [d]. Let β(1) , . . . , β(d) be the order statistics of β1 , . . . , βd such that

|β(1) | ≥ |β(2) | ≥ · · · ≥ |β(d) |.

Then we define
d
X
kβk∗ := λi |β(i) |.
i=1

The SLOPE estimator achieves slightly better rate of convergence than the LASSO estimator.
Although the norm kβk∗ is more involved than kβk1 , the SLOPE estimator can be efficiently
computed.

4.2.2 Fast rate for sparse linear regression


We now prove that the estimator β̂B0 (k) achieves a fast rate of estimation when the entries of β ∗
take discrete values. This setting is very special and the estimator is computationally infeasible.
However, the simple proof provides some intuition for why we can obtain the fast rate of estimation
in certain cases.

59
Theorem 4.5. For the linear regression model (4.2) where ε is a zero-mean subGn (σ 2 ) random
vector, suppose that we have β ∗ ∈ B0 (k) and βi∗ ∈ {−1, 0, 1} for each i ∈ [d]. Define an estimator

β̂ := argmin kXβ − Y k22 .


β∈B0 (k)∩{−1,0,1}d

For any δ ∈ (0, 1), it holds with probability at least 1 − δ that

1 k 5ed k d
kX β̂ − Xβ ∗ k22 ≤ 16σ 2 log . σ 2 log .
n n 2kδ n kδ

Proof. Using the same argument as in the proof of Theorem 4.4, we obtain

X(β̂ − β ∗ )
kX(β̂ − β ∗ )k22 ≤ 2ε> X(β̂ − β ∗ ) = 2kX(β̂ − β ∗ )k2 · ε> .
kX(β̂ − β ∗ )k2

Define a set
 
Xu
D := v∈R d
:v= d
for u ∈ {−2, −1, 0, 1, 2} , kuk0 ≤ 2k .
kXuk2

Then we have
kX(β̂ − β ∗ )k2 ≤ 2 sup ε> v.
v∈D

Since each v ∈ D is a unit vector, we have ε> v ∼ subG(σ 2 ). As a result of a homework problem,

t2  
P{ε> v > t} ≤ exp −
2σ 2
d ed 2k 2k
 2k
for any t > 0. Moreover, the set D has cardinality at most 2k 5 ≤ ( 2k ) 5 . By a union bound,

n o  ed 2k  t2 
P sup ε> v > t ≤ 52k · exp − 2 .
v∈D 2k 2σ
q
5ed
For δ ∈ (0, 1), we take t = 2σ k log 2kδ and obtain

 r 
5ed
P sup ε v > 2σ
>
k log ≤ δ.
v∈D 2kδ

This combined with kX(β̂ − β ∗ )k2 ≤ 2 supv∈D ε> v finishes the proof.

4.2.3 Fast rate for LASSO


The computationally efficient LASSO estimator in fact achieves the fast rate of estimation for sparse
linear regression under reasonable assumptions on the design matrix X ∈ Rn×d . We introduce two
related assumptions:

60
• We say that the design matrix X ∈ Rn×d is δ-incoherent if the matrix
1 >
X X − Id
n
is entrywise bounded in absolute value by δ > 0. Note that if the rows x>
i of X are sampled i.i.d.
from a distribution with mean zero and covariance Id , then the above matrix is equal to
n
1X
xi x>
i − Id
n
i=1

and converges to 0 by the law of large numbers. Therefore, incoherence is a reasonable assump-
tion. Furthermore, it can be shown that as along as n & log
δ2
d
, we can sample a random matrix
X∈R n×d that is δ-incoherent with probability 0.99.

• For any β ∈ Rd and S ⊂ [d], let βS denote the vector in Rd with (βS )i = βi for i ∈ S and
(βS )i = 0 for i ∈ S c := [d] \ S. Define a cone of vectors

CS := β ∈ Rd : kβS c k1 ≤ 3kβS k1 .


If |S| ≤ k, then the cone CS contains approximately k-sparse vectors β with support approx-
imately contained in S. We say that X satisfies the restricted eigenvalue (RE) condition for
k-sparse vectors if
kXβk22 1
inf inf 2 ≥ . (4.4)
|S|≤k β∈CS nkβk2 2
Note that if the infimum is taken over all β ∈ Rd , then the condition is saying that the smallest
eigenvalue of n1 X > X is lower bounded by 1/2. This is why the above condition is referred to as
the RE condition. Furthermore, it can be shown that as soon as n & k log d, we can sample a
random matrix X ∈ Rn×d that satisfies (4.4) with probability 0.99.

The following result shows that incoherence is stronger than the RE condition when the param-
eters are appropriately chosen.

Proposition 4.6. Consider a subset S ⊂ [d] with |S| ≤ k and a 1


32k -incoherent matrix X ∈ Rn×d .
If β satisfies the cone condition kβS c k1 ≤ 3kβS k1 , then it holds
2
kβk22 ≤ kXβk22 .
n
As a result, every 1
32k -incoherent matrix X ∈ Rn×d satisfies (4.4).

Proof. We have

kXβk22 = kXβS + XβS c k22 = kXβS k22 + kXβS c k22 + 2βS> X > XβS c .

The three terms can be controlled as follows:

• n1 kXβS k22 = kβS k22 + βS> n1 X > X − Id βS ≥ kβS k22 − 32k 1


kβS k21 ;


• n1 kXβS c k22 = kβS c k22 + βS>c n1 X > X − Id βS c ≥ kβS c k22 − 32k


1
kβS c k21 ≥ kβS c k22 − 9 2

32k kβS k1 ;

61
• 1 > >
= βS> βS c + βS> 1 > 1 3
kβS k21 .

n βS X XβS
c
nX X − Id βS c ≥ − 32k kβS k1 kβS c k1 ≥ − 32k

Combining the three terms yields


1 1 |S| 1
kXβk22 ≥ kβk22 − kβS k21 ≥ kβk22 − kβS k22 ≥ kβk22 ,
n 2k 2k 2
where we used the Cauchy–Schwarz inequality kβS k21 ≤ |S| · kβS k22 .

Lemma 4.7. For X ∈ Rn×d , suppose that maxj∈[d] kXj k22 ≤ 2n where Xj denotes the jth column
of X. Let ε ∼ subGn (σ 2 ). Then we have kX > εk∞ ≤ 2σ n log(2d/δ) with probability at least 1 − δ.
p

Proof. We have Xj> ε ∼ subG(2nσ 2 ) and thus P{|Xj> ε| > t} ≤ exp( 4nσ
−t 2
2 ) by a homework problem.
Therefore,
n o  −t2 
P{kX > εk∞ > t} ≤ P max |Xj> ε| > t ≤ 2d exp .
j∈[d] 4nσ 2
p
Choosing t = 2σ n log(2d/δ) completes the proof.

Theorem 4.8. Consider the linear regression model Y = Xβ ∗ + ε where kβ ∗ k0 ≤ k and ε ∼


subGn (σ 2 ). Suppose that X satisfies the RE condition (4.4) and that maxj∈[d] kXj k22 ≤ 2n (both of
1
which hold if X is 32k -incoherent). Define the LASSO estimator
p
β̂ := argmin kY − Xβk22 + 2λkβk1 , where λ := 8σ n log(2d/δ).
β∈ Rd
For any δ ∈ (0, 1), it holds with probability at least 1 − δ that
2 k d
kβ̂ − β ∗ k22 ≤ kX β̂ − Xβ ∗ k22 . σ 2 log .
n n δ
Proof. By the definition of the LASSO estimator, we have

kY − X β̂k22 + 2λkβ̂k1 ≤ kY − Xβ ∗ k22 + 2λkβ ∗ k1 .

Expanding kY − X β̂k22 = kε + Xβ ∗ − X β̂k22 and adding λkβ̂ − β ∗ k1 on both sides, we obtain

kXβ ∗ − X β̂k22 + λkβ̂ − β ∗ k1 ≤ 2ε> X(β̂ − β ∗ ) + λkβ̂ − β ∗ k1 + 2λkβ ∗ k1 − 2λkβ̂k1 .

By Hölder’s inequality and the above lemma, it holds with probability at least 1 − δ that
λ
ε> X(β̂ − β ∗ ) ≤ kX > εk∞ kβ̂ − β ∗ k1 ≤ 4σ n log(2d/δ) kβ̂ − β ∗ k1 = kβ̂ − β ∗ k1 .
p
2
As a result, with S chosen to be the support of β ∗ , we get

kXβ ∗ − X β̂k22 + λkβ̂ − β ∗ k1 ≤ 2λkβ̂ − β ∗ k1 + 2λkβ ∗ k1 − 2λkβ̂k1


= 2λkβ̂S − βS∗ k1 + 2λkβS∗ k1 − 2λkβ̂S k1 ≤ 4λkβ̂S − βS∗ k1 . (4.5)

This in particular implies that

kβ̂S c − βS∗ c k1 ≤ 3kβ̂S − βS∗ k1 ,

62
that is, β̂ − β ∗ satisfies the cone condition. Then it follows from the Cauchy–Schwarz inequality
and (4.4) that

√ √
r
2k
kβ̂S − βS∗ k1 ≤ k kβ̂S − βS∗ k2 ≤ ∗
k kβ̂ − β k2 ≤ kX β̂ − Xβ ∗ k2 .
n

Plugging this bound back into (4.5), we obtain


r
∗ 2k k
kXβ − X β̂k22 ≤ 4λ kX β̂ − Xβ ∗ k2 =⇒ kXβ ∗ − X β̂k22 ≤ 32λ2 .
n n

This completes the proof in view of the definition of λ and that kβ̂ − β ∗ k22 ≤ n2 kX β̂ − Xβ ∗ k22 .

4.3 Generalized linear regression


4.3.1 Setup and models
A generalized linear model can be defined using an exponential family. Fix a parameter vector
β ∗ ∈ Rd . For each i = 1, . . . , n, suppose that we observe a design point xi ∈ Rd and a random
outcome Yi ∼ Px> β ∗ with density of the form
i

 y · x> β ∗ − A(x> β ∗ ) 
i
f (yi | x> ∗
i β ) = exp
i i
· h(yi , σ)
σ2

for functions A(·), h(·), and noise parameter σ > 0. Suppose the observations are independent so
that the log-likelihood is
n 
X Yi · x> β − A(x> β)
i i

`(β | Y ) = + log h(Yi , σ) .
σ2
i=1

Usually the function A is convex, so we can efficiently solve for the MLE.
Let us see two examples of the above general model:

Gaussian linear regression In linear regression (4.2) with Gaussian noise ε ∼ N(0, σ 2 In ), let
Yi be the ith entry of Y , and let x>
i be the ith row of X. Then we have

 y · x> β ∗ − (x> β ∗ )2 /2 
i
 y2 √ 
f (yi | x> ∗
i β ) = exp
i i
· exp i
− log 2πσ 2 .
σ2 2σ 2

Hence, this is a special of the general model.

Logistic regression Consider independent binary observations Y1 , . . . , Yn , where


 1 
Yi ∼ Ber . (4.6)
1 + exp(−x> ∗
i β )

63
Then we have
 1 yi  exp(−x> β ∗ ) 1−yi
f (yi | x> ∗
i β )=
i
1 + exp(−x> i β ∗) 1 + exp(−x >β∗)
i
exp(−x> ∗
 
1 i β )
= exp yi log + (1 − yi ) log
1 + exp(−x> ∗
i β ) 1 + exp(−x> ∗
i β )
   
= exp yi · x> ∗
i β − log 1 + exp(xi β )
> ∗
,

which is again a special case of the general model. What motivates the logistic regression model?
The task of classification: β ∗ is a linear classifier that we aim to learn, each xi is a vector of d
features, and each Yi represents an outcome of classification.
There is a different class of generalized linear models:

Yi = F (x> ∗
i β ) + εi

for i = 1, . . . , n, where F : R → R is a known, increasing function, and εi is zero-mean noise. In


the matrix form, we have
Y = F (Xβ ∗ ) + ε, (4.7)
where X ∈ Rn×d , β ∗ ∈ Rd , ε ∈ Rn , and F applies entrywise to Xβ ∗ .
Note that (4.7) reduces to linear regression if F (t) = t. Moreover, if F is the logistic function
defined by F (t) = 1+e1 −t and εi = Yi − F (x> ∗
i β ), then (4.7) reduces to logistic regression (4.6).
Another example is the probit model, where F is taken to be the CDF of a standard Gaussian, and
Yi ∼ Ber(x> ∗
i β ).

4.3.2 Maximum likelihood estimation for logistic regression


To study logistic regression (4.6), let us focus on the low-dimensional regime where d ≤ n and X is
of rank r. For Gaussian linear regression, Theorem 4.1 shows that the MLE (i.e., the LSE) achieves
the rate of estimation r/n. We now show that this is also the case for logistic regression.
To be more precise, for a constant B > 0, consider the parameter space

Θ := {β ∈ Rd : |x>
i β| ≤ B, i ∈ [n]},

and suppose β ∗ ∈ Θ. The function F (t) = 1


1+e−t
satisfies that 1 − F (t) = F (−t), so we have

f (yi | x> ∗ > ∗ yi > ∗ 1−yi


i β ) = F (xi β ) F (−xi β ) .

Define g(t) := log F (t). Then the MLE is


n h
X i
β̂ := argmax Yi g(x>
i β) + (1 − Y i )g(−x >
i β) .
β∈Θ i=1

Theorem 4.9. Consider the logistic regression model (4.6) where β ∗ ∈ Θ. Then the MLE β̂
satisfies that
1 r
E kX β̂ − Xβ ∗ k22 .B ,
n n
where the hidden constant depends on B and r is the rank of X.

64
Proof. One can check that g(t) is κ-strongly concave in the sense that

g(t) ≤ g(t∗ ) + g 0 (t∗ )(t − t∗ ) − κ(t − t∗ )2

for all t, t∗ with |t| ≤ B, |t∗ | ≤ B, for a constant κ = κ(B) > 0. Since |x>
i β| ≤ B, we see that
n h
X i
`(β̂) = `(β̂ | Y ) = Yi g(x>
i β̂) + (1 − Yi )g(−x >
i β̂)
i=1
n h
X i
≤ Yi g(x> ∗ > ∗
i β ) + (1 − Yi )g(−xi β )
i=1
n h
X i
+ Yi g 0 (x>
i β ∗
) − (1 − Yi )g 0
(−x > ∗
i β ) (x> > ∗
i β̂ − xi β )
i=1
n
X
− κ(x> > ∗ 2
i β̂ − xi β )
i=1
= `(β ) + ε> (X β̂ − Xβ ∗ ) − κkX β̂ − Xβ ∗ k22 ,

where ε is the vector with entries εi := Yi g 0 (x> ∗ 0 > ∗ ∗


i β ) − (1 − Yi )g (−xi β ). In addition, `(β̂) ≥ `(β )
by the definition of the MLE, so
1 > 1 1
kX β̂ − Xβ ∗ k22 ≤ ε (X β̂ − Xβ ∗ ) = ε> XX † X(β̂ − β ∗ ) ≤ kε> XX † k2 kX(β̂ − β ∗ )k2
κ κ κ
by the definition of X † and the Cauchy-Schwarz inequality. Rearranging terms yields
1 >
kX β̂ − Xβ ∗ k22 ≤ kε XX † k22 .
κ2
Consider the SVD X = U ΣV > . Then XX † = U ΣΣ† U > and thus
r
X
kε> XX † k22 = kε> U ΣΣ† k22 = (ε> uj )2 ,
j=1

where uj is the jth column of U . Moreover, the fact that g 0 (t) = 1 − F (t) = F (−t) implies

εi := Yi 1 − F (x> ∗ > ∗ > ∗



i β ) − (1 − Yi )F (xi β ) = Yi − F (xi β ).

In view of the model (4.6), we see that εi is simply the deviation of Yi from its mean, so E[ε2i ] ≤ 1/4.
It follows that
n
1 1
E(ε uj ) = E εi · (uj )i 2 ≤ kuk22 = .
X
> 2

4 4
i=1
Combining everything yields
r
1 1 X r
E kX β̂ − Xβ ∗ k22 ≤ 2 E(ε> uj )2 ≤ 2 ,
n κ n 4κ n
j=1

which completes the proof.

65
4.4 Nonparametric regression
Let us consider an even more general regression model

Yi = f (xi ) + εi , i ∈ [n], (4.8)

where xi ∈ Rd are the design points, and εi are independent noise with E[εi ] = 0 and E[ε2i ] ≤ σ 2 .
The linear and generalized linear regression models assume f (xi ) = x> ∗ > ∗
i β and f (xi ) = F (xi β )
respectively. Nonparametric regression, on the other hand, does not assume that there is an
underlying parameter vector β ∗ . Instead, we impose a certain nonparametric assumption on the
function f , such as smoothness, monotonicity, or convexity.

4.4.1 Model and estimators


Let us focus on the simple setting where d = 1, xi = i/n, and f : [0, 1] → R is Hölder smooth in
the following sense.

Definition 4.10. Fix β > 0 and let ` := bβc. The Hölder class Σ(β) on [0, 1] is defined as the set
of ` times differentiable functions f : [0, 1] → R whose `th derivative f (`) satisfies

|f (`) (x) − f (`) (x0 )| ≤ L|x − x0 |β−` , for all x, x0 ∈ [0, 1],

for some constant L > 0. We also use Σ(β, L) to denote all functions f satisfying the above
conditions for a fixed L > 0.

Note that for a larger β, the condition is stronger and thus Σ(β) is smaller.

Kernels Before defining the estimator of interest, we first introduce a kernel function K : R → R,
such as:

• Rectangular kernel: K(u) = 1


2 1{|u| ≤ 1};
• Triangular kernel: K(u) = (1 − |u|)1{|u| ≤ 1};

• Parabolic kernel: K(u) = 34 (1 − u2 )1{|u| ≤ 1};

• Quartic kernel: K(u) = 15


16 (1 − u2 )2 1{|u| ≤ 1};
2
• Gaussian kernel: K(u) = √1 e−u /2 .

In the sequel, we focus on kernels that satisfy


Z
0 ≤ K ≤ 1, supp(K) ⊂ [−1, 1], K(u) = K(−u), K = 1. (4.9)

Note that the above kernels except the Gaussian kernel satisfy these conditions.

66
Nadaraya–Watson estimator Given a kernel K and a bandwidth h > 0, a prominent kernel
estimator of the regression function f is the Nadaraya–Watson estimator fˆNW , defined as
Pn xi −x
ˆNW i=1 Yi K( h )
f (x) := Pn xi −x
i=1 K( h )

if ni=1 K( xih−x ) 6= 0, and fˆNW (x) := 0 otherwise.


P

More generally, one can consider a linear nonparametric regression estimator fˆlinear of the form
n
X
fˆlinear (x) = Yi Wi (x),
i=1

where Wi (x) = Wn,i (x, x1 , . . . , xn ) with ni=1 Wi (x) = 1.


P
Consider the Nadaraya–Watson estimator with the rectangular kernel: If K(u) = 12 1{|u| ≤ 1},
then fˆNW (x) is the average of Yi for which xi ∈ [x − h, x + h].
• As h → ∞, we have that fˆNW (x) tends to n1 ni=1 Yi , which is a constant that does not depend
P
on x. Then the bias is too large, and this situation is called over-smoothing.
• As h → 0, we have fˆNW (xi ) = Yi and fˆNW (x) = 0 if x 6= xi for any i ∈ [n]. Then the variance is
too large, and this situation is called under-smoothing.
We need to choose an appropriate bandwidth h to achieve an optimal bias-variance trade-off.
The Nadaraya–Watson estimator can also be interpreted as a weighted LSE:
n x − x
i
X
fˆNW (x) = argmin (Yi − θ)2 K ,
θ∈R i=1
h

where the kernel downweights xi if xi is far away from x.

Local polynomial estimator Following the intuition of weight least squares, we can design a
more sophisticated estimator using the Taylor expansion of f , and with θ replaced by a polynomial
of degree ` = bβc. For f ∈ Σ(β, L) where β > 1, for z close to x, we have
f 00 (x) f (`) (x) z − x
f (z) ≈ f (x) + f 0 (x)(z − x) + (z − x)2 + · · · + (z − x)` = θ(x)> U ,
2 `! h
where the vectors θ(x) = θh (x) and U (u) are defined by
>
θ(x) = f (x), f 0 (x)h, f 00 (x)h2 , . . . , f (`) h` ,
 u2 u` >
U (u) = 1, u, , . . . , .
2 `!
Definition 4.11. The local polynomial estimator of order ` (LP(`) estimator) of θ(x) is the vector
θ̂(x) in R`+1 defined by
n h  x − x i2  x − x 
i i
X
θ̂(x) := argmin Yi − θ> U K .
θ∈ R`+1 i=1
h h

Moreover, the LP(`) estimator of f (x) is defined by


fˆ(x) := Û (0)> θ(x) = θ̂(x)1 .

67
Note that fˆNW is simply the LP(0) estimator.
We can rewrite θ̂(x) as

θ̂(x) = argmin −2θ> a(x) + θ> B(x)θ,


θ∈R`+1
where the vector a(x) and the matrix B(x) are defined by
n x − x x − x
1 X i i
a(x) = Yi U K ,
nh h h
i=1
n
1 X  xi − x  >  xi − x   xi − x 
B(x) = U U K .
nh h h h
i=1

If B(x) is positive definite, then the solution θ̂(x) of the quadratic program is given by

θ̂(x) = B(x)−1 a(x).

Consequently, we have
n
X
fˆ(x) = Yi Wi (x) (4.10)
i=1

where
1 x − x x − x
i i
Wi (x) := U (0)> B(x)−1 U K . (4.11)
nh h h
In particular, the LP(`) estimator fˆ(x) of f (x) is a linear estimator (linear in the data Yi ).

4.4.2 Rates of estimation for local polynomial estimators


Before analyzing the rate of estimation achieved by the local polynomial estimator, we first show
that the weights defined by (4.11) are able to “reproduce” any polynomial of degree ≤ `.

Proposition 4.12. Suppose that B(x) is positive definite. Let Q(x) be a polynomial of degree ≤ `.
Then we have
Xn
Q(xi )Wi (x) = Q(x)
i=1

where the weights are defined in (4.11). In particular,


n
X n
X
Wi (x) = 1, (xi − x)k Wi (x) = 0 for all k ∈ [`]. (4.12)
i=1 i=1

Proof. Since Q is a polynomial of degree ≤ `, we have

Q(`) (x) x − x
i
Q(xi ) = Q(x) + Q0 (x)(xi − x) + · · · + (xi − x)` = q(x)> U
`! h

68
∈ R`+1 . Set Yi = Q(xi ). Then we have
>
where q(x) := Q(x), Q0 (x)h, . . . , Q(`) (x)h`
n h  x − x i2  x − x 
i i
X
θ̂(x) = argmin Q(xi ) − θ> U K
θ∈ R`+1 i=1
h h
n h
X >  xi − x i2  xi − x 
= argmin Uq(x) − θ K
θ∈R`+1 i=1 h h
> 
= argmin q(x) − θ B(x) q(x) − θ .
θ∈ R`+1
Since B(x) is positive, we have θ̂(x) = q(x) and therefore fˆ(x) = θ̂(x)1 = Q(x). The reproducing
property then follows from (4.10).
For (4.12), take respectively Q(t) ≡ 1 and Q(t) = (t − x)k .

In addition, we impose an assumption: The smallest eigenvalue λmin (B(x)) of B(x) satisfies
λmin (B(x)) ≥ λ0 (4.13)
for any x ∈ [0, 1] for a constant λ0 > 0. In particular, kB(x)−1 vk2 ≤ kvk2 /λ0 for any v ∈ R`+1 .
Lemma 4.13. Under assumptions (4.9) and (4.13), the weights defined in (4.11) satisfy
• Wi (x) = 0 if |x − xi | > h for any i ∈ [n];
• |Wi (x)| ≤ nhλ
2
0
for any x ∈ [0, 1] and i ∈ [n];
Pn
• 8
i=1 |Wi (x)| ≤ λ0 for any x ∈ [0, 1] if h ≥ 1/(2n).

Proof. The first statement is obvious since supp(K) ⊂ [−1, 1].


Using kU (0)k2 = 1 and kB(x)−1 vk2 ≤ kvk2 /λ0 , we obtain
1 x − x x − x
i i
|Wi (x)| ≤ U (0) 2 B(x)−1 U K
nh h h 2
1 x − x x − x
i i
≤ U K
nhλ0 h h 2
1 1  1 1 1/2 2
≤ U (1) 2 ≤ 1 + 1 + 2 + ··· + 2
≤ .
nhλ0 nhλ0 2 (`!) nhλ0
In addition, we have
n n
2 2 4 2
1{|x − xi | ≤ h} ≤
X X
|Wi (x)| ≤ (2hn + 1) ≤ + ,
nhλ0 nhλ0 λ0 nhλ0
i=1 i=1

finishing the proof.

We now study the rate of estimation for the local polynomial estimator fˆ(x) of f (x) in terms of
the mean squared risk E fˆ(x) − f (x) . To this end, we consider the bias–variance decomposition:
2

E fˆ(x) − f (x) 2 = Bias(x)2 + Var(x),




where the bias and variance of fˆ(x) are given, respectively, by

Bias(x) := E[fˆ(x)] − f (x), Var(x) := E fˆ(x) − E[fˆ(x)] .


2

69
Theorem 4.14. Suppose that f : [0, 1] → R belongs to the Hölder class Σ(β, L) for β, L > 0.
Consider the model Yi = f (xi ) + εi where i ∈ [n], xi = i/n, and εi are independent with E[εi ] = 0
and E[ε2i ] ≤ σ 2 . Let fˆ be the LP(`) estimator of f with ` = bβc and kernel K satisfying (4.9).
Assume (4.13) and h ≥ 1/(2n).

• For any x ∈ [0, 1], we have the following upper bounds on the bias and the variance of fˆ:

8Lhβ 16
| Bias(x)| ≤ , Var(x) ≤ .
`!λ0 λ20 nh

2 4β
• As a result, for C = C(β, L, λ0 , σ) := 32 2L

λ20 `!
2β+1
σ 2β+1 , we have

−2β
E fˆ(x) − f (x) 2 ≤ Cn 2β+1 .


• For g : [0, 1] → R, let kgk2L2 :=


R1
0 g(x)2 dx. Then we have
−2β
E kfˆ − f k2L2 ≤ Cn 2β+1 .

Proof. Applying (4.12) and the Taylor expansion, we obtain


n n
E[Yi ]Wi (x) − f (x) =
X X 
Bias(x) = f (xi ) − f (x) Wi (x)
i=1 i=1
n X
`−1
f (`) x + τi (xi − x)

f (k) (x)
X 
k `
= (xi − x) + (xi − x) Wi (x)
k! `!
i=1 k=1
n
f (`) x + τi (xi − x) − f (`) (x)

X
= (xi − x)` Wi (x)
`!
i=1

for some τi ∈ [0, 1], where note that we could insert a term −f (`) (x) in the numerator since the
sum vanishes. It follows from f ∈ Σ(β, L) and Lemma 4.13 that
n n
X L|xi − x|β Lhβ X 8Lhβ
| Bias(x)| ≤ |Wi (x)| ≤ |Wi (x)| ≤ .
`! `! `!λ0
i=1 i=1

The variance can be bounded, using Lemma 4.13 again, as


n n n
X 2 X  X 16σ 2
Var(x) = E εi Wi (x) = Wi (x)2 E[ε2i ] ≤ σ 2 max |Wi (x)| |Wi (x)| ≤ 2 .
i=1 i=1
i∈[n]
i=1
λ0 nh

2 −1
Therefore, choosing h = ( `!σ
2L )
2β+1 n 2β+1 yields

64L2 2β 16σ 2 1 32σ 2  2L  2β+1


2
−2β
E fˆ(x) − f (x) 2 ≤

2 2
h + 2 = 2 n 2β+1 .
(`!) λ0 λ0 n h λ0 `!σ

Integrating over x ∈ [0, 1] gives the last statement.

70
Some remarks:
−2β
• Note that we have the rate of estimation n 2β+1 for the pointwise risk E fˆ(x) − f (x)
2
at each
x ∈ [0, 1]. This is stronger than bounding an average risk like E kfˆ − f k22 .
−2β
• As β grows, the function becomes smoother, so the rate n 2β+1 improves as expected. In particular,
−2β
as β → ∞, the nonparametric rate n 2β+1 tends to the parametric rate 1/n.

• Here we choose h depending on the smoothness parameters β and L. In practice, we may not
know how smooth the function is a priori. To address this issue, we can in fact design adaptive
estimators that do not depend these smoothness parameters.

• In dimension d, when we estimate a β-Hölder smooth function f : [0, 1]d → R from n noisy
−2β
observations, one can similarly establish the rate of estimation n 2β+d .

The name “nonparametric” simply refers to the setup where there is no obvious parameter
(like β in linear regression). In fact, it is without loss of generality to focus on the framework
of parametric estimation by viewing nonparametric estimation as a setup which has a “large”
parameter space. For example, for nonparametric regression discussed in these two sections, we
can view the Hölder class Σ(β, L) as the parameter space, and view the function f as the parameter.

71
72
Chapter 5

Information-theoretic lower bounds

In the previous chapter, we studied several regression models and proved rates of estimations for
specific estimators. That is, we established an upper bound on the minimax risk of the form

inf sup R θ, θ̂n . rn ,
θ̂n θ∈Θ

for some rate rn . Can we show that the minimax risk is also lower bounded by some rate sn which
is hopefully equal to rn ? If so, this suggests that the estimator that achieves the upper bound is
the essentially the best we can hope for in the minimax sense.

5.1 Reduction to hypothesis testing


To establish such minimax lower bounds, we reduce the problem to hypothesis testing and use
information-theoretic tools. Such lower bounds are referred to as statistical lower bounds or
information-theoretic lower bounds.

Reduction to bounds in probability Suppose that the risk we would like to lower bound is
of the form R(θ, θ̂) = Eθ [d(θ, θ̂)2 ] for some pseudometric d(·, ·). If we can establish

inf sup Pθ {d(θ, θ̂)2 ≥ sn } ≥ c (5.1)


θ̂ θ∈Θ

for a universal constant c > 0, then by Markov’s inequality,

inf sup Eθ [d(θ, θ̂)2 ] ≥ c sn .


θ̂ θ∈Θ

Reduction to a finite number of parameters For any θ1 , . . . , θM ∈ Θ, if we can establish

inf max Pθi {d(θi , θ̂)2 ≥ sn } ≥ c, (5.2)


θ̂ i∈[M ]

then (5.1) obviously holds. The difficulty for proving lower bounds usually lies in how to appropri-
ately choose θ1 , . . . , θM , which we call hypotheses.

73
Reduction to hypothesis testing The crucial requirement is that d(θi , θj )2 ≥ 4sn for any
distinct i, j ∈ [M ]. Given any estimator θ̂, consider the minimum distance test

ψ(θ̂) := argmin d(θi , θ̂).


i∈[M ]

For any i ∈ [M ] such that ψ(θ̂) 6= i, we have

d(θi , θ̂) ≥ d(θi , θψ(θ̂) ) − d(θψ(θ̂) , θ̂) ≥ d(θi , θψ(θ̂) ) − d(θi , θ̂),

so that
1 √
d(θi , θ̂) ≥ d(θi , θψ(θ̂) ) ≥ sn .
2
Therefore, we obtain

inf max Pθi {d(θi , θ̂)2 ≥ sn } ≥ inf max Pθi {ψ(θ̂) 6= i} ≥ inf max Pθi {ψ 6= i},
θ̂ i∈[M ] θ̂ i∈[M ] ψ i∈[M ]

where the infimum is taken over all tests ψ that are measurable with respect to the observations
and take values in [M ]. We have proved the following theorem.

Theorem 5.1. Let θ1 , . . . , θM ∈ Θ be such that d(θi , θj )2 ≥ 4sn for any distinct i, j ∈ [M ]. Then

inf sup Eθ [d(θ, θ̂)2 ] ≥ sn · inf max Pθi {ψ 6= i},


θ̂ θ∈Θ ψ i∈[M ]

where the infimum on the right-hand side is taken over all tests ψ that are measurable with respect
to the observations and take values in [M ].

5.2 Le Cam’s two-point method


5.2.1 General theory
To study lower bounds for hypothesis testing, let us start with the simplest case where we have two
probability measures P0 and P1 . SupposeR thatR P0 and P1 have densities p0 and p1 respectively,
with respect to a measure µ. We write f = f (x) dµ(x). Given an observation X from either
P0 or P1 , consider a test ψ = ψ(X) ∈ {0, 1}.
Lemma 5.2 (Neyman–Pearson). For any test ψ, the sum of the type I error and the type II error
satisfies
Z
P0 {ψ = 1} + P1 {ψ = 0} ≥ min(p0 , p1 ).

Moreover, the equality holds for the likelihood ratio test ψ ∗ := 1{p1 /p0 ≥ 1}.

This is the Neyman–Pearson lemma, although the name sometimes refers to a different formu-
lation.

74
Proof. First note that
Z Z
P0 {ψ = 1} + P1 {ψ = 0} =
∗ ∗
p0 + p1
{ψ ∗ =1} {ψ ∗ =0}
Z Z
= p0 + p1
{p1 ≥p0 } {p1 <p0 }
Z Z
= min(p0 , p1 ) + min(p0 , p1 )
{p1 ≥p0 } {p1 <p0 }
Z
= min(p0 , p1 ).

Next, for any test ψ, define R := {ψ = 1}. Also define R∗ := {p1 ≥ p0 }. Then we have

P0 {ψ = 1} + P1 {ψ = 0} = P0 {R} + 1 − P1 {R}
Z
=1+ (p0 − p1 )
ZR Z
=1+ (p0 − p1 ) + (p0 − p1 )
R∩R∗ R∩(R∗ )c
Z Z
=1− |p0 − p1 | + |p0 − p1 |
R∩R∗ R∩(R∗ )c
Z
1{R ∩ R∗ } − 1{R ∩ (R∗ )c } ,

=1− |p0 − p1 |

which is minimized at R = R∗ .

The total variation distance between P0 and P1 is defined as any of the following quantities:
Z Z
1
TV(P0 , P1 ) = |p0 − p1 | = 1 − min(p0 , p1 ) = 1 − inf P0 {ψ = 1} + P1 {ψ = 0} .
 
2 ψ

The equivalence of the first two definitions is proved as a homework problem. The above lemma
gives the second equivalence.
Combining Theorem 5.1 and Lemma 5.2 with the definition of the total variation distance, we
have established Le Cam’s two-point bound.

Theorem 5.3 (Le Cam). For any θ0 , θ1 ∈ Θ, we have

1
inf sup Eθ [d(θ, θ̂)2 ] ≥ d(θ0 , θ1 )2 1 − TV(Pθ0 , Pθ1 ) .
 
θ̂ θ∈Θ 8

A couple of remarks: The constant 1/8 can be refined to 1/4 with an improved argument.
Moreover, by the chain of inequalities√betweenp
f -divergences proved in the homework, TV in the
above theorem can be replaced by H, KL, or χ2 , which are typically easier to compute.

5.2.2 Lower bounds for nonparametric regression at a point


We establish a lemma which will be useful later.

75
Lemma 5.4. We have
 kµ1 − µ2 k22
KL N(µ1 , σ 2 Id ), N(µ2 , σ 2 Id ) = .
2σ 2
Proof. The one-dimensional case follows from direct computation
(x−µ1 )2 h (x − µ1 )2 (x − µ2 )2 i (µ1 − µ2 )2
Z
1
2 2
e− 2σ2

KL N(µ1 , σ ), N(µ2 , σ ) = √ − + dx = .
2πσ 2σ 2 2σ 2 2σ 2
The multivariate case follows from the tensorization property of KL established in the homework
d
2 2
X  kµ1 − µ2 k22
KL N((µ1 )i , σ 2 ), N((µ2 )i , σ 2 ) =

KL N(µ1 , σ Id ), N(µ2 , σ Id ) = .
2σ 2
i=1

Consider the nonparametric regression model (4.8), where we assume that xi = i/n and εi are
i.i.d N(0, σ 2 ) noise for i ∈ [n]. We aim to establish a minimax lower bound over the Hölder class
Σ(β, L) where β, L > 0, for the distance d(f, g) = |f (x0 ) − g(x0 )| at a fixed point x0 ∈ [0, 1].
2 4β
Theorem 5.5. For any x0 ∈ [0, 1], there exists a constant c = c2 (β)L 2β+1 σ 2β+1 > 0 such that
−2β
Ef fˆ(x0 ) − f (x0 )
 2 
inf sup ≥ c n 2β+1 .
fˆ f ∈Σ(β,L)

Proof. Let us consider the hypotheses f0 ≡ 0 and


x − x 
0
f1 (x) = Lhβ K
h
for all x ∈ [0, 1], where
−1
h = c0 n 2β+1 , c0 = c0 (β, L, σ) > 0,
and the function K is defined as
 −1 
K(u) = c1 exp 1{|u| ≤ 1/2}, c1 = c1 (β) > 0. (5.3)
1 − 4u2

Smoothness First, we check that f1 ∈ Σ(β, L). For ` = bβc, we have


x − x 
(`) 0
f1 (x) = Lhβ−` K (`) .
h
Moreover, we take c1 > 0 to be sufficiently small depending on β, so that K (`+1) (u) ≤ 1. Then for
−1/2 ≤ u, u0 ≤ 1/2, it holds that

|K (`) (u) − K (`) (u0 )| ≤ |u − u0 | ≤ |u − u0 |β−` .

Therefore, we obtain

(`) (`) x − x0 β−`


|f1 (x) − f1 (x0 )| ≤ Lhβ−` = L|x − x0 |β−` .
h

76
Separation We have
−β
d(f0 , f1 ) = f1 (x0 ) = Lhβ K(0) = Lcβ0 n 2β+1 c1 /e.

KL divergence The joint distribution of (Y1 , . . . , Yn ) is N(0, σ 2 In ) under f0 , and it is


 n 
⊗ni=1 N(f1 (xi ), σ 2 ) = N f1 (xi ) i=1 , σ 2 In

under f1 . By Lemma 5.4, we have


n n
f1 (xi )2 L2 h2β X  xi − x0 2
KL(Pf0 , Pf1 ) =
X
= K
2σ 2 2σ 2 h
i=1 i=1
n
L2 h2β c21 L2 h2β c21 L2 c21 2β+1 1
1{|xi − x0 | ≤ h/2} ≤
X
≤ (nh + 1) ≤ c = ,
2σ 2 e2 2σ 2 e2 σ 2 e2 0 4
i=1

 1
σ 2 e2 2β+1
. As a result, TV(Pf0 , Pf1 ) ≤ KL(Pf0 , Pf1 ) ≤ 1/2 by a homework problem.
p
if c0 = 4L 2 c2
1
The proof is complete thanks to Theorem 5.3.

Note that this minimax lower bound matches the pointwise upper bound in Theorem 4.14 up
to a constant factor. However, the two-point method is not sufficient for establishing a matching
lower bound on the integrated error kfˆ − f k2L2 .

5.3 Assouad’s lemma


5.3.1 General theory
To exhibit the difficulty of applying the two-point method in multivariate estimation, let us consider
estimating µ ∈ Rd given i.i.d. observations X1 , . . . , Xn ∼ N(µ, Id ). For the empirical mean X̄, we
achieve the minimax risk
d
d
E kX̄ − µk2 = E(X̄i − µi )2 = .
X
2
n
i=1

To establish a lower bound, we apply the two-point method on the hypotheses P0 = ⊗ni=1 N(v, Id )
and P1 = ⊗ni=1 N(w, Id ). Then
n
 n
KL(P0 , P1 ) =
X
KL N(v, Id ), N(w, Id ) = kv − wk22 .
2
i=1

Therefore, if we choose v, w ∈ Rd such that kv − wk22 = 1/n, then


1
inf sup E kµ̂ − µk22 & .
µ̂ µ∈ Rd n

This is optimal in the sample size n but not in the dimension d unless it is a constant.
One powerful tool for proving high-dimensional lower bounds is the following theorem called
Assouad’s lemma.

77
Theorem 5.6 (Assouad). Let {Pω : ω ∈ {0, 1}d } be a set of 2d probability measures, and let Eω
denote the corresponding expectations. Then
d
Eω ρ(ω̂, ω) ≥ min 1 − TV(Pω , Pω0 ) ,
 
inf max
ω̂ ω∈{0,1} d 2 ρ(ω,ω0 )=1

1{ω̂i 6= ωi }
Pd
where the infimum is over all estimators ω̂ taking values in {0, 1}d , and ρ(ω̂, ω) = i=1
denotes the Hamming distance between ω̂ and ω.
Proof. We have
1
Eω ρ(ω̂, ω) ≥ Eω ρ(ω̂, ω)
X
max
ω∈{0,1} d 2d
ω∈{0,1}d
d d
1 1 X
Eω 1{ω̂i = Pω {ω̂i 6= ωi }.
X X X
= d 6 ωi } = d
2 2
ω∈{0,1}d i=1 i=1 ω∈{0,1}d

Let ω−i ∈ {0, 1}d−1 denote the subvector of ω with its ith entry removed. Let (ω−i , 1) denote the
vector ω whose ith entry is equal to 1. Then
d  
1 X
max Eω ρ(ω̂, ω) ≥ d P(ω−i ,1) {ω̂i = 0} + P(ω−i ,0) {ω̂i = 1}
X
ω∈{0,1}d 2
i=1 ω−i ∈{0,1}d−1
d
1 X  
P(ω−i ,1) , P(ω−i ,0)
X
≥ 1 − TV
2d
i=1 ω−i ∈{0,1}d−1
d  
≥ min 1 − TV(Pω , Pω0 ) .
2 ρ(ω,ω0 )=1

Lemma 5.7. In the problem of estimating θ ∈ Θ where Θ is a closed set, let θ̃ denote an estimator
that takes values in Θ, and let θ̂ denote an arbitrary estimator. Then we have
1
inf sup E[d(θ, θ̃)2 ] ≤ inf sup E[d(θ, θ̂)2 ] ≤ inf sup E[d(θ, θ̃)2 ].
4 θ̃ θ∈Θ θ̂ θ∈Θ θ̃ θ∈Θ

Proof. The second inequality is trivial. Let us focus on the first inequality. Consider an arbitrary
estimator θ̂. Define
θ̃ := argmin d(θ, θ̂),
θ∈Θ

which is an estimator that takes values in Θ. Then for any θ ∈ Θ,


 2
d(θ̃, θ)2 ≤ d(θ̃, θ̂) + d(θ̂, θ) ≤ 2d(θ̃, θ̂)2 + 2d(θ̂, θ)2 ≤ 4d(θ̂, θ)2 .

As a result,
sup E[d(θ, θ̃)2 ] ≤ 4 sup E[d(θ, θ̂)2 ].
θ∈Θ θ∈Θ

Taking an infimum over θ̂ and then over θ̃ completes the proof.

78
Corollary 5.8. Suppose that to each ω ∈ {0, 1}d , we can associate a parameter θω ∈ Θ such that
d(θω , θω0 )2 ≥ cn ρ(ω, ω 0 )
for a constant cn > 0 that may depend on the sample size n. Let Pω = Pθω denote the model at θω
for ω ∈ {0, 1}d . Then we have
cn d
inf sup E d(θ̂, θ)2 ≥ min 1 − TV(Pω , Pω0 ) ,
 
θ̂ θ∈Θ 8 ρ(ω,ω )=1
0

Proof. By Assouad’s lemma, we have


d
E d(θ̃, θω )2 ≥ cn inf max d Eω ρ(ω̂, ω) ≥ cn min 1 − TV(Pω , Pω0 ) ,
 
inf max
d
θ̃ ω∈{0,1} ω̂ ω∈{0,1} 2 ρ(ω,ω )=1
0

where the first infimum is over all θ̃ that takes values in θω : ω ∈ {0, 1}d . Lemma 5.7 implies


1 1
inf sup E d(θ̂, θ)2 ≥ inf sup E d(θ̃, θ)2 ≥ inf max E d(θ̃, θω )2
θ̂ θ∈Θ 4 θ̃ θ∈Θ 4 θ̃ ω∈{0,1}d
where the first infimum is over arbitrary estimators. It suffices to combine the two bounds.

5.3.2 Applications
Gaussian mean estimation The typically way of applying Assouad’s lemma is to associate
each ω with a parameter. For example, for the above Gaussian mean estimation problem, we define

µω = ω/ n ∈ Rd for each ω ∈ {0, 1}d . Then
ρ(ω, ω 0 ) = kω − ω 0 k22 = nkµω − µω0 k22 ,
and Pω = ⊗ni=1 N(µω , Id ). If 1 = ρ(ω, ω0 ) = nkµω − µω0 k22 , then by Lemma 5.4,
q √
TV(Pω , Pω0 ) ≤ KL(Pω , Pω0 ) = n · (1/2) · kµω − µω0 k22 = 1/ 2.
p

Therefore, Corollary 5.8 implies that


1 d
inf sup E kµ̂ − µk22 & inf max Eω ρ(ω̂, ω) & .
µ̂ µ∈ Rd n ω̂ ω∈{0,1}d n

Linear regression Consider the linear regression model Y = Xβ ∗ + ε where ε ∼ N(0, σ 2 In ). Let
r be the rank of X ∈ Rn×d , and let U ∈ Rn×r be a matrix whose columns form an orthonormal
basis of the column space of X. We associate each ω ∈ {0, 1}r with a vector βω ∈ Rd such that
1
U ω = 2 Xβω .
σ
Then we have
1
ρ(ω, ω 0 ) = kω − ω 0 k22 = kU ω − U ω 0 k22 = 2 kXβω − Xβω0 k22 .
σ
In addition, Pω = N(Xβω , In ). Hence, if 1 = ρ(ω, ω 0 ) = σ12 kXβω − Xβω0 k22 , then by Lemma 5.4,

r
1
TV(Pω , Pω0 ) ≤ KL(Pω , Pω0 ) =
p
2
kXβω − Xβω0 k22 = 1/ 2.

Applying Corollary 5.8, we obtain
1 σ2 r
inf sup Eβ kX β̂ − Xβk22 & inf max d Eω ρ(ω̂, ω) & σ2 .
β̂ β∈ Rd n n ω̂ ω∈{0,1} n

79
5.4 Fano’s inequality
5.4.1 General theory
We move on to study minimax lower bounds based on multiple hypothesis testing. Recall that to
apply Theorem 5.1, we need to find separated parameters θ1 , . . . , θM ∈ Θ such that

inf max Pi {ψ 6= i} ≥ c,
ψ i∈[M ]

where we write Pi = Pθi and the infimum is over all tests ψ. To this end, we use a special case of
Fano’s inequality. Let us start with a lemma.

Lemma 5.9. Define a function h(p, q) for p, q ∈ (0, 1) by

p 1−p
h(p, q) := KL(Ber(p), Ber(q)) = p log + (1 − p) log .
q 1−q

Then h is convex.

Proof. We first show that the function (p, q) 7→ p log pq is convex for p, q > 0. The Hessian of the
 
1/p −1/q
function is H = . We have det(H) = 0 and tr(H) > 0, so H is positive semidefinite.
−1/q p/q 2
Moreover, since the composition of a convex function with a linear function is convex, and a
sum of two convex functions is convex, we see that h is convex.

Theorem 5.10 (Data processing). Let P and Q be two probability measures that are absolutely
continuous with respect to each other. For X ∼ P, Y ∼ Q, and a function g, we have

KL(g(X), g(Y )) ≤ KL(X, Y ) = KL(P, Q).

Proof. Let fX denote the density of X. Then fX can be identified with fX,g(X) = fg(X) fX|g(X) . It
follows that
h fX i h fX,g(X) i h fg(X) fX|g(X) i
EX log = EX,g(X) log = EX,g(X) log + log
fY fY,g(Y ) fg(Y ) fY |g(Y )
fg(X) fX|g(X) fg(X) i
 h i   h i h
= Eg(X) EX log g(X) + Eg(X) EX log g(X) ≥ Eg(X) log ,
fg(Y ) fY |g(Y ) fg(Y )
h i
EX log fX|g(X)
f
where we used that the conditional KL divergence g(X) is nonnegative.
Y |g(Y )

Theorem 5.11 (Fano’s inequality). Let P1 , . . . , PM be probability measures that are absolutely
continuous with respect to each other. Then we have

Pi , Pj ) + log 2
1 PM
i,j=1 KL(
inf max Pi {ψ 6= i} ≥ 1 − M2
,
ψ i∈[M ] log M

where the infimum is over all tests ψ that take values in [M ].

80
Proof. Fix a test ψ. Let pi := Pi {ψ = i} and qi := Pj {ψ = i}. Moreover, let
1 PM
M j=1

M M M
1 X 1 X 1 X 1
p̄ = pi = Pi {ψ = i}, q̄ = qi = .
M M M M
i=1 i=1 i=1

We claim that for any test ψ,

Pi , Pj ) + log 2
M 1 PM
1 X i,j=1 KL(
1 − max Pi {ψ =
6 i} ≤ 1 − Pi {ψ 6= i} = p̄ ≤ M2
.
i∈[M ] M log M
i=1

It suffices to prove the last inequality.


Using the inequality
−p̄ log p̄ − (1 − p̄) log(1 − p̄) ≤ log 2
(which says that the entropy of Ber(p) is maximized at p = 1/2), we obtain
p̄ 1 − p̄
h(p̄, q̄) + log 2 ≥ p̄ log + (1 − p̄) log − p̄ log p̄ − (1 − p̄) log(1 − p̄)
q̄ 1 − q̄
1 1 1
= p̄ log + (1 − p̄) log ≥ p̄ log = p̄ log M
q̄ 1 − q̄ q̄
and thus
h(p̄, q̄) + log 2
p̄ ≤ .
log M
Moreover, the convexity of h yields
M M
1 X 1 X
h(p̄, q̄) ≤ h(pi , qi ) ≤ 2 h(Pi {ψ = i}, Pj {ψ = i}).
M M
i=1 i,j=1

Hence it remains to show that

h(Pi {ψ = i}, Pj {ψ = i}) ≤ KL(Pi , Pj ).

Let Xi denote the observation under Pi for i ∈ [M ]. The above inequality is equivalent to
KL(1{ψ(Xi ) = i}, 1{ψ(Xj ) = i}) ≤ KL(Xi , Xj ),

which holds thanks to the data processing inequality.

Combining Theorem 5.1 with Fano’s inequality, we obtain the following corollary.
Corollary 5.12. Suppose that for θ1 , . . . , θM ∈ Θ, we have
1
d(θi , θj )2 ≥ 4sn , KL(Pθi , Pθj ) ≤ log M − log 2
2
for any distinct i, j ∈ [M ]. Then it holds that

inf sup Eθ [d(θ, θ̂)2 ] ≥ sn /2.


θ̂ θ∈Θ

See Theorem 2.5 of [Tsy08] for a more precisely stated version.

81
5.4.2 Application to Gaussian mean estimation
Let us again consider estimating µ ∈ Rd given i.i.d. X1 , . . . , Xn ∼ N(µ, Id ). Recall that
n
KL(Pµ , Pµ0 ) = kµ − µ0 k22 .
2
Therefore, we would like to choose µ1 , . . . , µM such that

1
4sn ≤ kµi − µj k22 ≤ (log M − 2 log 2).
n
On the one hand, we need many µi so that M is large. On the other hand, if there are too many
µi packed together, the separation sn becomes too small and so does the lower bound. We need to
find a balance between these two tensions.
Let us introduce the notions of ε-packing and ε-net.

Definition 5.13. A set N ⊂ B ⊂ Rd is called an ε-packing of B in the Euclidean distance if


kµ − µ0 k2 ≥ ε for any distinct µ, µ0 ∈ N .
A set N ⊂ B ⊂ Rd is called an ε-net of B in the Euclidean distance if for every µ ∈ B, there
exists µ0 ∈ N such that kµ − µ0 k2 ≤ ε.

We will not prove the following result, but the intuition is clear by considering the ratio of
volumes.

Lemma 5.14. Let B d denote the unit ball in Rd . There exists an ε-packing N of B d , which is also
an ε-net of B d , such that
(1/ε)d ≤ |N | ≤ (3/ε)d .

Note that in a homework problem, we assume that there is a 1/4-net N of the unit sphere S d−1
in Rd such that |N | ≤ 12d . This is simply replacing B d with the subset S d−1 and setting ε = 1/4
in the above lemma.
d d
q the lemma given, let us take a 1/4-packing N = {θ1 , . . . , θM } ⊂ B where M ≥ 4 . Set
With
d
µi = c n θi for each i ∈ [M ] and some constant c > 0 to be determined. Then by definition, we
c2 d
can set sn = 43 n
so that

d c2 d
kµi − µj k22 = c2 kθi − θj k22 ≥ 2 = 4sn .
n 4 n
On the other hand,

d 4c2 d 1
kµi − µj k22 = c2 kθi − θj k22 ≤ ≤ (log M − 2 log 2)
n n n
if we choose c > 0 to be a sufficiently small constant. We conclude that

sn d
Eµ kµ̂ − µk22 ≥
 
inf sup & .
µ̂ µ∈Rd 2 n

82
5.4.3 Application to nonparametric regression
Lemma 5.15 (Hoeffding’s lemma). Suppose that a random variable X has mean zero and satisfies
a ≤ X ≤ b for constants a, b ∈ R. Then, for any λ ∈ R, we have
 λ2 (b − a)2 
E[eλX ] ≤ exp .
8
Proof. By convexity, it holds that
b − X λa X − a λb
eλX ≤ e + e ,
b−a b−a
so
b λa a λb at  a(1 − et ) 
E[eλX ] ≤ e − e = eL(λ(b−a)) , L(t) := + log 1 + .
b−a b−a b−a b−a
We have
ab(et − 1) −abet 1
L0 (t) = , L00 (t) = ≤ .
(a − b)(b − aet ) t
(b − ae ) 2 4
Using the second-order Taylor approximation, we obtain that L(t) ≤ t2 /8 for t ∈ R.
Lemma 5.16 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random variables taking
values in [0, 1]. Then for all t > 0,
nX n o
P (Xi − E[Xi ]) ≤ −t ≤ exp(−2t2 /n).
i=1

E[eλ(Xi −E[Xi ]) ] ≤ exp( λ8 ) and Chernoff’s bound.


2
Proof. Use Hoeffding’s lemma
Lemma 5.17 (Varshamov–Gilbert bound). Let d ≥ 8. There exists {ω1 , . . . , ωM } ⊂ {0, 1}d such
that ρ(ωi , ωj ) ≥ d/8 for any distinct i, j ∈ [M ] and M ≥ ed/8 , where ρ(·, ·) denotes the Hamming
distance.
Proof. Let ωi,k be i.i.d. Ber(1/2) for i ∈ [M ] and k ∈ [d]. Consider the event
E := {ρ(ωi , ωj ) ≥ d/8 for any distinct i, j ∈ [M ]}.
It suffices to show that P{E} > 0, i.e., P{EP c } < 1. (This is called the probabilistic method.)

For any distinct i, j ∈ [M ], ρ(ωi , ωj ) = dk=1 1{ωi,k 6= ωj,k }, so it is a sum of Ber(1/2) random
variables. By a union bound and Hoeffding’s inequality with t = 3d/8, we obtain
 −9d 
P{E c } ≤ P{ρ(ωi , ωj ) < d/8} ≤ M 2 exp
X
.
32
i,j∈[M ], i6=j
d 9d
It is not hard to see that this is strictly less than 1, as taking the logarithm gives 4 − 32 < 0.
Theorem 5.18. Consider the nonparametric regression model (4.8), where f ∈ Σ(β, L) for β, L >
2 4β
0, xi = i/n, and εi are i.i.d N(0, σ 2 ) noise for i ∈ [n]. For a constant c = c3 (β)L 2β+1 σ 2β+1 > 0,
the following minimax lower bound in the squared L2 distance holds over the Hölder class:
−2β
inf sup E kfˆ − f k2L2 ≥ c n 2β+1 .
fˆ f ∈Σ(β,L)
R1
Proof. This time, for the squared L2 distance d(f, g)2 = kf − gk2L2 = 0 (f (x) − g(x))2 dx, the proof
is based on multiple hypothesis testing over {f1 , . . . , fM } ⊂ Σ(β, L).

83
Construction of hypotheses Fix a constant C0 = C0 (β, L, σ) > 0 to be determined later. Let
1 1 k − 1/2 x − z 
k
d = dC0 n 2β+1 e, h= , zk = , φk (x) = Lhβ K , k ∈ [d], x ∈ [0, 1],
d d h
where K is defined by (5.3). Recall that using the proof of Theorem 5.5, we can show that
φk ∈ Σ(β, L/2) if the constant c1 (β) > 0 in (5.3) is taken to be sufficiently small. Moreover, φk is
supported in [zk − h2 , zk + h2 ] = [ k−1 k
d , d ] for each k ∈ [d].
Let ω1 , . . . , ωM be given by Lemma 5.17. For each i ∈ [M ], we define fi (x) := dk=1 ωi,k φk (x).
P
Since the supports of φk are disjoint (up to a set of measure zero), it is easily seen that fi ∈ Σ(β, L).

Separation For distinct i, j ∈ [M ], we have


Z 1 Z d
1X 2
2
kfi − fj k2L2 = fi (x) − fj (x) dx = (ωi,k − ωj,k )φk (x) dx
0 0 k=1
d
X Z 1
= (ωi,k − ωj,k )2 φk (x)2 dx = L2 h2β+1 kKk2L2 ρ(ωi , ωj ).
k=1 0

As a result of Lemma 5.17, for c2 = c2 (β) = kKk2L2 > 0,


−2β
kfi − fj k2L2 ≥ L2 h2β+1 c2 d/8 & L2 c2 C0−2β n 2β+1 .

KL divergence Finally, we check


n n d
(fi (x` ) − fj (x` ))2 1 XX
KL(Pfi , Pfj ) =
X
= (ωi,k − ωj,k )2 φk (x` )2
2σ 2 2σ 2
`=1 `=1 k=1
n X
d
L2 h2β c21 L2 h2β c21 L2 c21 −2β 2β+1
1
1{|x` − zk | ≤ h} .
X
≤ n . C0 n .
2σ 2 e2 σ 2 σ 2
`=1 k=1

1
1
To apply Corollary 5.12, we need this bound to be smaller than 2 log M − log 2 & d ≥ C0 n 2β+1 , i.e.
1
L2 c21 2β+1
C0 & ( σ2
) . Plugging this into the separation bound above finishes the proof.

We have established matching upper and lower bounds for the minimax risk at a point or in
the L2 norm for nonparametric regression.

5.5 Generalization of the two-point method


One way to generalize the two-point method is through composite hypothesis testing. For a finite
set Θ, consider the mixture
1 X
P̄ := Pθ , (5.4)
|Θ|
θ∈Θ

where Pθ is each a distribution. In other words, Y ∼ P̄ can be generated as follows: First sample θ
uniformly randomly from Θ and then, conditional on θ, sample Y ∼ Pθ . We will study hypothesis

84
testing between a distribution P0 and the mixture P̄, where the latter is usually referred to as a
composite hypothesis.
Let ψ denote a test, which equals 0 if it selects P0 and equals 1 if it selects P̄. By the Neyman–
Pearson lemma and a homework problem, we have that for any test ψ,
q
P0 {ψ = 1} + P̄{ψ = 0} ≥ 1 − TV(P0 , P̄) ≥ 1 − χ2 (P̄, P0 ), (5.5)

where TV(·, ·) and χ2 (·, ·) denote the total variation distance and the χ2 -divergence respectively.
To showcase how this inequality leads to a minimax lower bound for an estimation problem, we
consider the following example. Suppose that we aim to estimate θ given the observation

Y = θ + ε, (5.6)

where θ is k-sparse and ε ∼ N(0, σ 2 In ). Recall that this is called the sparse sequence model in a
homework problem, and it is a special case of sparse linear regression with n = d and the design

matrix X being orthogonal. Assume that 1 ≤ k ≤ n. We aim to prove a lower bound of order
k/n up to a logarithmic factor, which then matches the upper bound.
Theorem 5.19. Let Pθ denote the probability associated with the model (5.6). For λ ∈ (0, 1), set
r
σ  λn 
µ := k log 1 + 2 .
2 k
Define n µ o
Θ := θ = √ 1S : S ⊂ [n], |S| = k .
k
In other
√ words, each vector in Θ is k-sparse with support S, and its nonzero entries are all equal
to µ/ k. Let P̄ be defined as in (5.4). Then

χ2 (P̄, P0 ) ≤ 2λ.

Proof. Let p0 , pθ , and p̄ denote the densities of P0 , Pθ , and P̄ respectively. Let θ and θ0 be two
independent uniform random variables over Θ. By the definition of the χ2 -divergence, we have

(p̄ − p0 )2
Z Z   Z
p̄ 2 pθ pθ0
χ (P̄, P0 ) =
2
= p0 − 1 = Eθ,θ0 p0 − 1.
p0 p0 p0 p0
Let S and S 0 denote the supports of θ and θ0 respectively; they are independent random subsets
of [n], each of size k. Since the noise is Gaussian, it holds that
   
pθ (x) 1 µ 2 1 2 1  2µ > 2
= exp − 2 x − √ 1S + 2 kxk2 = exp √ x 1S − µ .
p0 (x) 2σ k 2 2σ 2σ 2 k
P µ
For Z ∼ N(0, In ), define ZS := i∈S Zi . Let ν := σ√ k
. Then we have
Z Z  
pθ pθ 0 1  2µ > > 2
p0 =exp √ (x 1S + x 1S ) − 2µ
0 p0 (x) dx
p0 p0 2σ 2 k
  
1  2µσ 
= E exp 2
= E exp ν(ZS + ZS 0 ) − ν 2 k .
 
2
√ (Z S + Z S 0 ) − 2µ
2σ k

85
Zi and |S4S 0 | ≤ 2k. Therefore,
P P
Note that ZS + ZS 0 = 2 i∈S∩S 0 Zi + i∈S4S 0
Z      X 
pθ pθ 0
p0 = E exp 2ν · E exp ν
X
Zi Zi · exp(−ν 2 k)
p0 p0
i∈S∩S 0 i∈S4S 0

= exp 2ν 2 |S ∩ S 0 | · exp ν 2 |S4S 0 |/2 · exp(−ν 2 k) ≤ exp(2ν 2 |S ∩ S 0 |).


 

It follows that

χ2 (P̄, P0 ) ≤ ES,S 0 exp(2ν 2 |S ∩ S 0 |) − 1


 
h h ii
≤ ES 0 ES exp(2ν 2 |S ∩ S 0 |) S 0 − 1 = ES exp(2ν 2 |S ∩ [k]|) − 1.
 

The random variable |S ∩ [k]| is a sampling-without-replacement version of Bin(k, k/n), and it can
shown that the MGF of the former is dominated by the MGF of the latter. As a result,

k k
   k
2 k k

χ2 (P̄, P0 ) ≤
2
e2ν + 1− − 1 = 1 + (e2ν − 1) − 1.
n n n

Recall that r
µ 1  λn 
ν= √ = log 1 + 2 .
σ k 2 k
Hence, we conclude that
 k  k
k λn λ
χ (P̄, P0 ) ≤
2
1+ −1= 1+ − 1 ≤ 2λ
n k2 k

for any λ ∈ (0, 1) and k ≥ 1.

Corollary 5.20. We have the following minimax lower bound for the model (5.6):

1 k  n
inf sup E kθ̂ − θk22 & σ2 log 1 + 2 .
θ̂ θ∈ Rn n n k
q
σ n
Proof. We continue to use the notation from above. Let λ = 1/8 and µ = 2 k log(1 + 8k2
). Let
θ̂ be any estimator of θ. Define a test ψ by ψ = 0 if kθ̂k2 ≤ µ/2 and ψ = 1 if kθ̂k2 > µ/2. From
(5.5) and the above theorem, we obtain
n o 1  1 − p1/4 1
max P0 {ψ = 1}, max Pθ {ψ = 0} ≥ P0 {ψ = 1} + P̄{ψ = 0} ≥ = .
θ∈Θ 2 2 4
1 µ2
Suppose that n kθ̂ − θk22 ≤ 9n with probability at least 0.9. Then
µ
kθ̂k2 − kθk2 ≤ kθ̂ − θk2 ≤ .
3
By the definition of ψ, if θ = 0, then ψ = 0; if θ ∈ Θ, then kθk2 = µ and thus ψ = 1. We reach a
µ2
contradiction. Therefore, n1 kθ̂ − θk22 > 9n with probability at least 0.1, proving the conclusion.

86
Bibliography

[Ber18] Quentin Berthet. Principles of Statistics, Lecture Notes. 2018.

[GL95] Richard D. Gill and Boris Y. Levit. Applications of the van trees inequality: a bayesian
cramér-rao bound. Bernoulli, 1(1-2):59–79, 03 1995.

[Kee11] Robert W Keener. Theoretical statistics: Topics for a core course. Springer, 2011.

[LC06] Erich L. Lehmann and George Casella. Theory of point estimation. Springer Science &
Business Media, 2006.

[LR06] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science
& Business Media, 2006.

[Mon15] Andrea Montanari. Computational implications of reducing data to sufficient statistics.


Electron. J. Statist., 9(2):2370–2390, 2015.

[RH19] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. 2019.

[Tsy08] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Science &


Business Media, 2008.

[vdV00] Aad W van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

87

You might also like