0% found this document useful (0 votes)
26 views

SI_Chapter-2

Uploaded by

Sangeetha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

SI_Chapter-2

Uploaded by

Sangeetha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

MATH 350 - Statistical Inference By Tanujit Chakraborty

Chapter 2: Theory of Estimation

Figure 1: Interplay between Probability and Statistical Inference for Data Science
The word “inference” refers to drawing conclusions based on some evidence. Thus,
Statistical Inference refers to drawing conclusions based on evidence obtained from the
data1 .

Figure 2: A loose Taxonomy of Statistical Inference


1
Reference Book: Rice, John A. Mathematical statistics and data analysis. Cengage Learning, 2006.

Page 1
The main challenges to do so are:

(i) How to summarise the information in the data using formal mathematical tools?

(ii) How to use these summaries to answer questions about the phenomenon of interest?

There is no unique way of answering these questions. There exist several schools
of thought that perform statistical inference using different tools and starting from
different philosophical views. These philosophical differences, as well as the different
mathematical tools employed in these approaches, will be discussed in detail in this
module. The two main schools of thought are the “Frequentist approach” and the
“Bayesian approach”. An appealing feature of this chapter is that both approaches are
presented in parallel, giving the student a balanced perspective.
In many areas, researchers collect data as a means to obtain information about a
phenomenon of interest or to collect information about a population. For example,

• The National Bureau of Statistics in UAE collects information about UAE resi-
dents, such as the age of those persons.

• UAE national cancer registry monitors the survival times of cancer patients diag-
nosed in the UAE in specific years (cohorts).

• Pharmaceutical companies conduct experiments (trials) to assess the effectiveness


of new drugs on a group of people.

• The National Aeronautics and Space Administration (NASA) has a Data Portal
(publicly available) with many data sets produced in their experiments and moni-
toring.

• Many companies monitor their financial performance by looking at the daily price
of the saleable stocks of the company (”share price”).

• Many others ...

We can understand the collected data as a sample of observations x1 , . . . , xn , where


each xj , j = 1, . . . , n can be either a scalar number or a vector. In statistical in-
ference, this sample is interpreted as a realization of the random variables (vectors)
X1 , . . . , Xn . We will use bold capital letters to denote the vector of random variables
X = (X1 , . . . , Xn )⊤ , and bold lowercase letters to denote the sample of observations
x = (x1 , . . . , xn )⊤ .

Page 2
Data Reduction

To analyze, understand, and communicate the information contained in the sample


x, we need tools to summarise it. The process of summarising the data is known
as data reduction or data summary. There exist many quantities that are used as
summaries. These quantities are functions of the sample, and they are called “statis-
tics”. In mathematical terms, a statistic is any function of the sample T (x), with
T : Rn → Rm , 1 ≤ m ≤ n. The statistic summarises the data in that, instead of
reporting the entire sample, only the value of the statistic T (x) = t is reported. An
example of a statistic (also known as “summary statistic”) is the sample mean (or
average) T (x) = x = x1 +···+x
n
n
. For instance, if the sample x consists of the age of
individuals in the UAE population, a way of summarising this sample is to report
only the average age of the population, in which case T (x) is the sample mean. In
many cases, a statistical data analysis consists only of summarising a data set, using a
choice of different summary statistics. This kind of analysis is known as “Descriptive
Statistics” or “Descriptive Analysis”. In fact, a descriptive analysis is usually the first
step in statistical data analysis as it helps the statistician gain an understanding of the
features of the data. Other summaries that are used in practice are the median (0.5
quantile) as well as other quantiles, the minimum of the sample, the maximum of the
sample, etc. In fact, the set of summary statistics given by the minimum of the sample,
the first quartile ( 0.25 quantile), the median, the third quartile (0.75 quantile), and
the maximum is known as the “Five Number Summary”. Visual tools (boxplots, violin
plots, histograms, scatter plots, etc) are also used in applied statistics to understand
other features of the data (covered in the Descriptive Statistics Course). Now, we will
discuss two of the important concepts in the theory of estimation.

Point Estimation: In statistics, point estimation involves the use of sample


data to calculate a simple value (known as a statistic) which serves as a “best
estimate” of an unknown (fixed or random) population parameter.

Let (X1 , X2 , . . . .., Xn ) be a random sample drawn from a population having


distribution function Fθ , θ ∈ Θ. Where the functional form of F is known except
the parameter θ. If we are to guess a specific feature of the parent distribution,
it can be explicitly written as a function of θ.

Suppose we are to guess γ(θ), a real-valued function of θ. The statis-


tic T (X1 , X2 , . . . .., Xn ) is said to be an estimator of γ(θ), if we guess
γ(θ) by T (X1 , X2 , . . . , Xn ) given (X1 , X2 , . . . . . . , Xn ) = (x1 , x2 , . . . . . . , xn ),
T (X1 , X2 , . . . , Xn ) is said to be an estimate of γ(θ).

Page 3
Interval Estimation: In statistics, interval estimation is the use of sample data
to calculate an interval of possible (probable) values, of an unknown population
parameter, in contrast to point estimation, which is a single number (estimation
by unique estimate).

An interval estimate of a real-values parameter θ is any pair of functions,


L (x1 , x2 , . . . , xn ) and U (x1 , x2 , . . . ., xn ), of a sample that satisfy L(x) ≤ U (x) for
all x ∈ X. If X = x is observed, the inference L(x) ≤ θ ≤ U (x) is made. The
random interval [L(X), U (X)] is called an interval estimator.

The most prevalent forms of interval estimation are Confidence intervals (a fre-
quentist method) and credible intervals (a Bayesian method). Other common
approaches to interval estimation, which are encompassed by statistical theory,
are Tolerance and prediction intervals (used mainly in Regression Analysis).

• Credible intervals can readily deal with prior information, while confidence
intervals cannot.

• Confidence intervals are more flexible and can be used practically in more
situations than credible intervals: one area where credible intervals suffer in
comparison is in dealing with non-parametric models.

1 Parametric models and methods of estimation

In this chapter, we discuss the question: How to estimate the parameter(s) of given
probability distribution?

A parametric model is a family of probability distributions that can be described


by a finite number of parameters2 . We’ve already seen many examples of parametric
models:

• The family of normal distributions N µ, σ 2 , with parameters µ and σ 2 .




• The family of Bernoulli distributions Bernoulli(p), with a single parameter p.

• The family of Gamma distributions Gamma(α, β), with parameters α and β.

We will denote a general parametric model by {f (x | θ) : θ ∈ Ω}, where θ ∈ Rk


represents k parameters, Ω ⊆ Rk is the parameter space to which the parameters
must belong, and f (x | θ) is the PDF or PMF for the distribution having parameters
2
The number of parameters is fixed and cannot grow with the sample size

Page 4
 
θ. For example, in the N µ, σ 2 model above, θ = µ, σ 2 , Ω = R × R+ where R+ is
the set of positive real numbers, and
1 (x−µ)2
f (x | θ) = √ e− . 2σ 2
2πσ 2
Given data X1 , . . . , Xn , the question of which parametric model we choose to fit the
data usually depends on what the data values represent (number of occurrences over a
period of time? aggregation of many small effects?) as well as a visual examination of
the shape of the data histogram. This question is discussed in the context of several
examples in Rice Sections 8.2-8.3.

Our main question of interest in this unit will be the following: After specifying an
appropriate parametric model {f (x | θ) : θ ∈ Ω}, and given observations
IID
X1 , . . . , Xn ∼ f (x | θ).
How can we estimate the unknown parameter θ and quantify the uncertainty in our
estimate?

1.1 Method of moments


If θ is a single number, then a simple idea to estimate θ is to find the value of θ
for which the theoretical mean of X ∼ f (x | θ) equals the observed sample mean
X̄ = n1 (X1 + . . . + Xn ).

Example 1.1. The Poisson distribution with parameter λ > 0 is a discrete distri-
bution over the non-negative integers {0, 1, 2, 3, . . .} having PMF

e−λ λx
f (x | λ) = .
x!
IID
If X ∼ Poisson(λ), then it has mean E[X] = λ. Hence for data X1 , . . . , Xn ∼
Poisson(λ), a simple estimate of λ is the sample mean λ̂ = X̄.

Example 1.2. The exponential distribution with parameter λ > 0 is a continuous


distribution over R+ having PDF

f (x | λ) = λe−λx .
IID
If X ∼ Exponential(λ), then E[X] = λ1 . Hence for data X1 , . . . , Xn ∼ Exponential(λ),
we estimate λ by the value λ̂ which satisfies λ̂1 = X̄, i.e. λ̂ = X̄1 .

Page 5
More generally, for X ∼ f (x | θ) where θ contains k unknown parameters, we may
consider the first k moments of the distribution of X, which are the values

µ1 = E[X]
µ2 = E X 2
 
..
.
µk = E X k ,
 

and compute these moments in terms of θ. To estimate θ from data X1 , . . . , Xn , we


solve for the value of θ for which these moments equal the observed sample moments
1
µ̂1 = (X1 + . . . + Xn )
n
..
.
1
X1k + . . . + Xnk .

µ̂k =
n
(This yields k equations in k unknown parameters.) The resulting estimate of θ is
called the method of moments estimator.
IID 2
 2

Example
 2 1.3. Let X 1 , . . . , X n ∼ N µ, σ . If X ∼ N µ, σ , then E[X] = µ and
E X = µ2 + σ 2 . So the method of moments estimators µ̂ and σ̂ 2 for µ and σ 2 solve
the equations

µ̂ = µ̂1 ,
σ̂ + µ̂2 = µ̂2 .
2

The first equation yields µ̂ = µ̂1 = X̄, and the second yields

n n n
! n
2 1X 1 X X 1X 2
σ̂ = µ̂2 − µ̂21 = Xi2 − X̄ 2 = Xi2 −2 Xi X̄ + nX̄ 2
= Xi − X̄ .
n i=1
n i=1 i=1
n i=1

IID
Example 1.4.
 Let X1 , . . . , Xn ∼ Gamma(α, β). If X ∼ Gamma(α, β), then E[X] =
α 2 α+α2
β and E X = β 2 . So the method of moments estimators α̂, β̂ solve the equations
α̂
= µ̂1 ,
β̂
α̂ + α̂2
= µ̂2 .
β̂ 2
Substituting the first equation into the second,

Page 6
 
1
+ 1 µ̂21 = µ̂2 ,
α̂
so
1 µ̂21 X̄ 2
α̂ = µ̂2 = 2 = 2 .

Pn
µ̂2 − 1 µ̂ 2 µ̂1
1
Xi − X̄
1 n i=1

The first equation then yields

α̂ X̄
β̂ = = Pn 2 .
µ̂1 1
Xi − X̄
n i=1

1.2 Bias, variance, and mean-squared-error


Consider the case of a single parameter θ ∈ R. Any estimator θ̂ := θ̂ (X1 , . . . , Xn ) is a
statistic - it has variability due to the randomness of the data X1 , . . . , Xn from which
IID
it is computed. Supposing that X1 , . . . , Xn ∼ f (x | θ) (so the parametric model
is correct and the true parameter is θ ), we can think about whether θ̂ is a “good”
estimate of the true parameter θ in a variety of different ways:

• The bias of θ̂ is Eθ [θ̂]−θ. Here and below, Eθ denotes the expectation with respect
IID
to X1 , . . . , Xn ∼ f (x | θ)
q
• The standard error of θ̂ is its standard deviation Varθ [θ̂]. Here and below,
IID
Varθ denotes the variance with respect to X1 , . . . , Xn ∼ f (x | θ)
h i
• The mean-squared-error (MSE) of θ̂ is Eθ (θ̂ − θ) . 2

The bias measures how close the average value of θ̂ is to the true parameter θ; the
standard error measures how variable is θ̂ around this average value. An estimator
with small bias need not be an accurate estimator, if it has large standard error, and
conversely an estimator with small standard error need not be accurate if it has large
bias. The mean-squared-error encompasses both bias and variance: For any random
variable X and any constant c ∈ R,

E (X − c)2 = E (X − EX + EX − c)2
   

= E (X − EX)2 + E[2(X − EX)(EX − c)] + E (EX − c)2


   

= Var[X] + 2(EX − c)E[X − EX] + (EX − c)2


= Var[X] + (EX − c)2 ,

Page 7
where we used that EX − c is a constant and E[X − EX] = 0. Applying this to X = θ̂
and c = θ
h i  2
2
Eθ (θ̂ − θ) = Var[θ̂] + Eθ [θ̂] − θ
We obtain the bias-variance decomposition of mean-squared-error:

Mean-squared-error = Variance + Bias2 .


An important remark is that the bias, standard error, and MSE may depend on the
true parameter θ and take different values for different θ. We say that θ̂ is unbiased
for θ if Eθ [θ̂] = θ for all θ ∈ Ω
IID
Example 1.5. In the model X1 , . . . , Xn ∼ Poisson(λ), the method-of-moments esti-
mator of λ was λ̂ = X̄. Then
n
1X
Eλ [λ̂] = Eλ [X̄] = Eλ [Xi ] = λ
n i=1

where the last equality uses E[X] = λ if X ∼ Poisson(λ). So Eλ [λ̂] − λ = 0 for all
λ > 0, and λ̂ is an unbiased estimator of λ. Also,
n
1 X λ
Varλ [λ̂] = Varλ [X̄] = 2 Varλ [Xi ] =
n i=1 n

n are independent and Var[X] = λ if X ∼ Poisson(λ).


where we have used that X1 , . . . , Xp
Hence the standard error of λ̂ is λ/n, and the MSE is λ/n. Note that both of these
depend on λ-they are larger when λ is larger.

As we do not knowq λ, in practice to determine the variability of λ̂ we may estimate the


p
standard error by λ̂/n = X̄/n. For large n, this is justified by the fact that λ̂ is

unbiased with standard error of the order
q 1/ n, so we expect λ̂ − λ to be of this order.
Hence the estimated standard error λ̂/n should be very close to the true standard
p p q
error λ/n. (We expect the difference between λ/n and λ̂/n to be of the smaller
order 1/n.)

Remark 1.1. A nonsense unbiased estimator. Let X be a Poisson random


variable with mean λ > 0. Recall that the pmf of X is given by

λx e−λ
p(x; λ) = P (X = x; λ) = .
x!

Page 8
Suppose that we are interested in estimating the parameter θ = e−3λ based on a sample
of size one. Let T (X) = (−2)X . Then, the expectation is

X λx e−λ
E(T ) = (−2)x
x=0
x!

−λ
X (−2λ)x
=e
x=0
x!
−λ −2λ
=e e = e−3λ .
Therefore, T is unbiased for e−3λ . However, T is unreasonable in the sense that it
may be negative (for X odd), even though it is an estimator of a strictly positive quan-
tity. This suggests that one should not automatically assume that a unique unbiased
estimator is necessarily good.
IID
Example 1.6. In the model X1 , . . . , Xn ∼ Exponential(λ), the method-of-moments
estimator of λ was λ̂ = 1/X̄. This estimator is biased: Recall Jensen’s inequality,
which says for any strictly convex function g : R → R, E[g(X)] > g(E[X]). The
function x 7→ 1/x is strictly convex, so

Eλ [λ̂] = Eλ [1/X̄] > 1/Eλ [X̄] = 1/(1/λ) = λ,


IID
where we used Eλ [X̄] = Eλ [X1 ] = 1/λ when X1 , . . . , Xn ∼ Exponential(λ). So Eλ [λ̂]−
λ > 0 for all λ > 0, meaning λ̂ always has positive bias.
To compute exactly the bias, variance, and MSE of λ̂, note that Exponential (λ) is the
same distribution as Gamma(1, λ). Then X̄ = n1 (X1 + . . . + Xn ) ∼ Gamma(n, nλ).
(This may be shown by calculating the MGF of X̄.) The distribution of λ̂ = 1/X̄
λn
is called the Inverse-Gamma (n, nλ) distribution, which has mean n−1 and variance
2 2
λ n
(n−1)2 (n−2) for n ≥ 3. So the bias, variance, and MSE are given by

λn λ
Bias = Eλ [λ̂] − λ = −λ= ,
n−1 n−1
λ2 n2
Variance = Varλ [λ̂] = ,
(n − 1)2 (n − 2)
2
λ2 n2 λ2 (n + 2)

λ
MSE = + = .
(n − 1)2 (n − 2) n−1 (n − 1)(n − 2)
Till now, we introduced the method of moments for estimating one or more pa-
rameters θ in a parametric model. In the next section, we discuss a different method
called maximum likelihood estimation. The focus of the next section will be on how
to compute this estimate; subsequent sections will study its statistical properties.

Page 9
1.3 Maximum likelihood estimation
IID
Consider data X1 , . . . , Xn ∼ f (x | θ), for a parametric model {f (x | θ) : θ ∈ Ω}. The
“x given θ” implies that given a particular value of θ, f (· | θ) defines a density. The
parameter θ can be a vector of parameters. Suppose we simulate data from F or collect
some real data (say, waiting times in a popular restaurant), and we want to answer the
following questions:

1. How to estimate θ?

2. How to construct confidence intervals around the estimator of θ?

A useful method of estimating θ is the method of maximum likelihood estimation.


Given the observed data X = (X1 , . . . , Xn ), we define a function L(θ | X) (sometimes
written as lik(θ))
n
Y
L(θ | X) = f (Xi | θ) or lik(θ) = f (X1 | θ) × . . . × f (Xn | θ)
i=1

of the parameter θ is called the likelihood function. If f (x | θ) is the PMF of


a discrete distribution, then lik(θ) is simply the probability of observing the values
X1 , . . . , Xn if the true parameter were θ. The maximum likelihood estimator
(MLE) of θ is the value of θ ∈ Ω that maximizes lik(θ) or L(θ | X). Intuitively,
it is the value of θ that makes the observed data “most probable” or “most likely”.
The likelihood function measures “how likely is a particular value of θ given the data
observed” and then finds the θ that maximizes this likelihood.

It is important to note that L(θ | X) is not a distribution over θ. The ”most likely”
value is the value that maximizes the likelihood

θ̂MLE = arg max L(θ | X).


θ∈Θ

Computing the MLE is an optimization problem. Maximizing lik(θ) or L(θ | X) is


equivalent to maximizing its (natural) logarithm
n
X
l(θ) = log(lik(θ)) = log f (Xi | θ)
i=1

which in many examples is easier to work with as it involves a sum rather than a
product. Let’s work through several examples:

Page 10
IID
Example 1.7. (Bernoulli). Let X1 , . . . , Xn ∼ Bernoulli(p). Then the likelihood is
n
Y
L(p | x) = p (xi | p)
i=1
n
Y  x
p i (1 − p)1−xi

=
i=1
P P
xi n− xi
=p (1 − p) .
To obtain the MLE of p, we will maximize the likelihood. Note that maximizing the
likelihood is the same as maximizing the log of the likelihood, but the calculations are
easier after taking a log. So we take a log:
n
! n
!
X X
⇒ l(p) := log L(p | x) = xi log p + n − xi log(1 − p)
Pi P i
dl(p) xi n − xi set
= − = 0
dp p 1−p
n
1X
⇒ p̂ = xi .
n t=1

Verify for yourself that the second derivative is negative for this p̂. Thus,
n
1X
p̂MLE = xi .
n t=1

IID
Example 1.8. (Poisson). Let X1 , . . . , Xn ∼ Poisson(λ). Then
n
X λXi e−λ
l(λ) = log
i=1
Xi !
Xn
= (Xi log λ − λ − log (Xi !))
i=1
n
X n
X
= (log λ) Xi − nλ − log (Xi !) .
i=1 i=1
This is differentiable in λ, so we maximize l(λ) by setting its first derivative equal to
0:
n
′ 1X
0 = l (λ) = Xi − n.
λ i=1

Page 11
Solving for λ yields the estimate λ̂ = X̄. Since l(λ) → −∞ as λ → 0 or λ → ∞, and
since λ̂ = X̄ is the unique value for which 0 = l′ (λ), this must be the maximum of l.
In this example, λ̂ is the same as the method-of-moments estimate.
IID 
Example 1.9. (Normal). Let X1 , . . . , Xn ∼ N µ, σ 2 . Then
n  2

X 1 (Xi −µ)
l µ, σ 2 = e− 2σ2

log √
2πσ 2
i=1
n
!
2
X 1 (Xi − µ)
− log 2πσ 2 −

=
i=1
2 2σ 2
n
n n 1 X
2
(Xi − µ)2 .

= − log(2π) − log σ − 2
2 2 2σ i=1
Considering σ 2 (rather than σ ) as the parameter, we maximize l(λ) by settings its
partial derivatives with respect to µ and σ 2 equal to 0 :
n
∂l 1 X
0= = (Xi − µ) ,
∂µ σ 2 i=1
n
∂l n 1 X 2
0= = − + (X i − µ) .
∂σ 2 2σ 2 2σ 4 i=1
Solving the first equation yields
2 µ̂ = X̄, and2 substituting this into the second equation
1
Pn
2
yields σ̂ = n i=1 Xi − X̄ . Since l µ, σ → −∞ as µ → −∞, µ → ∞, σ 2 → 0,
∂l ∂l
or σ 2 → ∞, and as µ̂, σ̂ 2 is the unique value for which 0 = ∂µ and 0 = ∂σ 2 , this

must be the maximum of l. Again, the MLEs are the same as the method-of-moments
estimates.

Example 1.10. (Two parameter exponential). The density of a two parameter expo-
nential distribution is

f (x | µ, λ) = λe−λ(x−µ) ; x ≥ µ, µ ∈ R, λ > 0.

Page 12
We want to compute the MLEs of both λ and µ. The likelihood is
n
Y
L(λ, µ | x) = f (xi | µ, λ)
t=1
Yn
= λe−λ(xi −µ) I (xi ≥ µ)
t=1
( !)
X
= λn exp −λ xi − nµ I (xi ≥ µ) , ∀ µ.
i

But if X1 , . . . , Xn ≥ µ ⇒ min {Xi } ≥ µ. So


( !)
X  
n
L(λ, µ | x) = λ exp −λ xi − nµ I min {xi } ≥ µ ∀ µ.
i
i

We will first try to maximize with respect to µ and then with respect to λ. Note that
L(λ, µ) is an increasing function of µ within the restriction. So that the MLE of µ is
the largest value in the support of µ where µ ≤ min {Xi }. So

µ̂MLE = min {Xi } = X(1) .


1≤i≤n

Next, note that


( !)
X
L X(1) , λ | x = λn exp −λ

Xi − nX(1)
i
  X 
⇒ l X(1) , λ := log L X(1) , λ | x = n log λ − λ Xi − nX(1)
dl n X 
set
⇒ = − Xi − nX(1) = 0 and
dλ λ
d2 l n
= − < 0.
dλ2 λ2
So, the function is concave, thus there is a unique maximum. Set
dl
=0

n
n X
⇒ = Xi − nX(1)
λ t=1
n
⇒ λ̂MLE = P .
Xi − nX(1)

Page 13
Example 1.11. The tank problem.

“The enemy” has an unknown number N of tanks, which he has obligingly numbered
1, 2, . . . , N . Spies have reported sighting 8 tanks with numbers

x = (137, 24, 86, 33, 92, 129, 17, 111)⊤ .

Assume that sightings are independent and that each of the N tanks has a probability
1/N of being observed at each sighting. What is the MLE of N ?
Let Xi be the serial number of tank i. Then, the pmf of these variables is:
(
1
N , for x ≤ N,
P (x; N ) =
0, for x > N.
Given that each tank has the same probability of being observed, and that the largest
sample value is x(8) = 137, it follows that the likelihood function of N is
(
1
N 8, for N ≥ 137,
L(N | x) = P ( Event N ) =
0, for N < 137.
it is straightforward to see that the likelihood function is maximised at

N
b = max xi = 137.
i=1,...,8

Q: Would you trust this estimate?

1.3.1 Why MLE?


One main reason of using MLE is that often (not always), the resulting estimators are
consistent and asymptotically normal. That is, for a general likelihood L(θ | x) and n
being the size of the data:
p
θ̂MLE → θ as n → ∞
and under some additional conditions, we also have
√  
d
n θ̂MLE − θ → N (0, Σ∗ ) ,

where Σ∗ is an estimable matrix called the inverse Fisher information matrix. So if


we use MLE estimation (and after verifying certain conditions), we know that we can
construct confidence intervals around θ̂MLE . The conditions required for consistency
and asymptotic normality are important (to be discussed in Sec. 1.3.3).

Page 14
1.3.2 No closed-form MLEs
In Inference, obtaining MLE estimates for a problem requires maximizing the likeli-
hood. However, it is possible that no analytical form of the maxima is possible! This
is a common challenge in many models and estimation problems, and requires sophis-
ticated optimization tools. We will give examples in which we cannot get an analytical
form of the MLE.
IID
Example 1.12. (Gamma). Let X1 , . . . , Xn ∼ Gamma(α, β). Then
n  α 
X β
l(α, β) = log Xiα−1 e−βXi
i=1
Γ(α)
n
X
= (α log β − log Γ(α) + (α − 1) log Xi − βXi )
i=1
n
X n
X
= nα log β − n log Γ(α) + (α − 1) log Xi − β Xi .
i=1 i=1
To maximize l(α, β), we set its partial derivatives equal to 0:
n
∂l nΓ′ (α) X
0= = n log β − + log Xi ;
∂α Γ(α) i=1
n
∂l nα X
0= = − Xi .
∂β β i=1

The second equation implies that the MLEs α̂ and β̂ satisfy β̂ = α̂/X̄. Substituting
into the first equation and dividing by n, α̂ satisfies
n
Γ′ (α̂) 1X
0 = log α̂ − − log X̄ + log Xi . (1)
Γ(α̂) n i=1

The function f (α) = log α − ΓΓ(α)
(α)
decreases from ∞ to 0 as α increases from 0 to ∞,
and the value − log X̄ + n1 ni=1 log Xi is always negative
P

(by Jensen’s inequality, E[g(X)] ≥ g[E(X)])


– hence (equation 1) always has a single unique root α̂, which is the MLE for α. The
MLE for β is then β̂ = α̂/X̄.

Unfortunately, there is no closed-form expression for this root α̂. (In particular, the
MLE α̂ is not the method-of-moments estimator for α.) We may compute the root

Page 15
numerically using the Newton-Raphson method: We start with an initial guess α(0) ,
which (for example) may be the method-of-moments estimator

(0) X̄ 2
α = Pn 2 .
1
n i=1 Xi − X̄
Having computed α(t) for any t = 0, 1, 2, . . ., we compute the next iteration α(t+1) by
approximating the equation 1 with a linear equation using a first-order Taylor expansion
around α̂ = α(t) , and set α(t+1) as the value of α̂ that solves this linear equation. In

detail, let f (α) = log α − ΓΓ(α)
(α)
. A first-order Taylor expansion around α̂ = α(t) in
(equation 1) yields the linear approximation
n

(t)
 
(t)



(t)
 1X
0≈f α + α̂ − α f α − log X̄ + log Xi ,
n i=1

and we set α(t+1) to be the value of α̂ solving this linear equation, i.e. 3

(t)
 1
Pn
−f α + log X̄ − log Xi
α(t+1) = α(t) + n i=1 .
f ′ α(t)
The iterations α(0) , α(1) , α(2) , . . . converge to the MLE α̂.

Example 1.13. Let (X1 , . . . , Xk ) ∼ Multinomial (n, (p1 , . . . , pk )). (This is not quite
the setting of n IID observations from a parametric model, as we have been considering,
although you can think of (X1 , . . . , Xk ) as a summary of n such observations Y1 , . . . , Yn
from the parametric model Multinomial(1, (p1 , . . . , pk )), where Yi indicates which of k
possible outcomes occurred for the i th observation.) The log-likelihood is given by

     k
n n X
l (p1 , . . . , pk ) = log pX
1
1
. . . pX
k
k
= log + Xi log pi ,
X1 , . . . , X k X1 , . . . , X k
i=1

and the parameter space is

Ω = {(p1 , . . . , pk ) : 0 ≤ pi ≤ 1 for all i and p1 + . . . + pk = 1} .


To maximize l (p1 , . . . , pk ) subject to the linear constraint p1 + . . . + pk = 1, we may use
the method of Lagrange multipliers: Consider the Lagrangian
3
If this update yields α(t+1) ≤ 0, we may reset α(t+1) to be a very small positive value.

Page 16
  k
n X
L (p1 , . . . , pk , λ) = log + Xi log pi + λ (p1 + . . . + pk − 1) ,
X1 , . . . , X k
i=1

for a constant λ to be chosen later. Clearly, subject to p1 + . . . + pk = 1, maximizing


l (p1 , . . . , pk ) is the same as maximizing L (p1 , . . . , pk , λ). Ignoring momentarily the
constraint p1 + . . . + pk = 1, the unconstrained maximizer of L is obtained by setting
for each i = 1, . . . , k
∂L Xi
0= = + λ,
∂pi pi
which yields p̂i P = −Xi /λ.P For the specific choice of constant λ = −n, we obtain
p̂i = Xi /n and ni=1 p̂i = ni=1 Xi /n = 1, so the constraint is satisfied. As p̂i = Xi /n
is the unconstrained maximizer of L (p1 , . . . , pk , −n), this implies that it must also be
the constrained maximizer of L (p1 , . . . , pk , −n), so it is the constrained maximizer of
l (p1 , . . . , pk ). So the MLE is given by p̂i = Xi /n for i = 1, . . . , k.

Recall the Taylor expansion:

set −f (a) f (a)


f (x) = f (a)+f ′ (a)(x−a) = 0 ⇒ f ′ (a)(x−a) = −f (a) ⇒ x−a = ⇒ x = a−
f ′ (a) f ′ (a)

1.3.3 Consistency and asymptotic normality of the MLE


IID
We showed in the last section that given data X1 , . . . , Xn ∼ Poisson(λ), the maximum
likelihood estimator for λ is simply λ̂ = X̄. How accurate is λ̂ for λ ? Recall from
Section 1.1 the following computations:
n
1X
Eλ [X̄] = E [Xi ] = λ
n i=1
n
1 X λ
Varλ [X̄] = 2 Var [Xi ] = .
n i=1 n

So λ̂ is unbiased, with variance λ/n.

When n is large, asymptotic theory provides us with a more complete picture of the
“accuracy” of λ̂ : By the Law of Large Numbers, X̄ converges to λ in probability as
n → ∞. Furthermore, by the Central Limit Theorem,

Page 17

n(X̄ − λ) → N (0, Var [Xi ]) = N (0, λ)
in distribution as n → ∞. So for large n, we expect λ̂ to be close to λ, and the
λ

sampling distribution of λ̂ is approximately N λ, n . This normal approximation is
useful for many reasons - for example, it allows us to understand other measures of
error (such as E[|λ̂ − λ|] or P[|λ̂ − λ| > 0.01] ), and will allow us to obtain a confidence
interval for λ̂. In a parametric model, we say that an estimator θ̂ based on X1 , . . . , Xn
√ if θ̂ → θ in probability as n → ∞. We say that it is asymptotically
is consistent
normal if n(θ̂−θ) converges in distribution to a normal distribution (or a multivariate
normal distribution, if θ has more than 1 parameter). So λ̂ above is consistent and
asymptotically normal.
The goal of this section is to explain why, rather than being a curiosity of this Poisson
example, consistency and asymptotic normality of the MLE hold quite generally for
many “typical” parametric models, and there is a general formula for its asymptotic
variance. The following is one statement of such a result:

Theorem 1.1. Let {f (x | θ) : θ ∈ Ω} be a parametric model, where θ ∈ R is a


IID
single parameter. Let X1 , . . . , Xn ∼ f (x | θ0 ) for θ0 ∈ Ω, and let θ̂ be the MLE
based on X1 , . . . , Xn . Suppose certain regularity conditions hold, including:a

• All PDFs/PMFs f (x | θ) in the model have the same support,

• θ0 is an interior point (i.e., not on the boundary) of Ω,

• The log-likelihood l(θ) is differentiable in θ, and

• θ̂ is the unique value of θ ∈ Ω that solves the equation 0 = l′ (θ).


√   
1

Then θ̂ is consistent and asymptotically normal, with n θ̂ − θ0 → N 0, I(θ0 )
in distribution. Here, I(θ) is defined by the two equivalent expressions

I(θ) := Varθ [z(X, θ)] = −Eθ [z ′ (X, θ)] ,


where Varθ and Eθ denote variance and expectation with respect to X ∼ f (x | θ),
and

∂ ′ ∂2
z(x, θ) = log f (x | θ), z (x, θ) = 2 log f (x | θ).
∂θ ∂θ
a
Some technical conditions in addition to the ones stated are required to make this theorem rigorously true;
these additional conditions will hold for the examples we discuss, and we won’t worry about them in this class.

Page 18
Here z(x, θ) is called the score function, and I(θ) is called the Fisher information.
Heuristically for large n, the above theorem tells us the following about the MLE θ̂ :

• θ̂ is asymptotically unbiased. More precisely, the bias of θ̂ is less than order 1/ n.
√  
(Otherwise n θ̂ − θ0 should not converge to a distribution with mean 0.)

• The variance of θ̂ is approximately nI(θ 1


0)
. In particular, the standard error is of or-

der 1/ n, and the variance (rather than the squared bias) is the main contributing
factor to the mean-squared-error of θ̂.
 
• If the true parameter is θ0 , the sampling distribution of θ̂ is approximately N θ0 , nI(θ0 ) .
1

Example 1.14. Let’s verify that this theorem is correct for the above Poisson example.
There,

λx e−λ
log f (x | λ) = log = x log λ − λ − log(x!),
x!
so the score function and its derivative are given by

∂ x ∂2 x
z(x, λ) = log f (x | λ) = − 1, z ′ (x, λ) = log f (x | λ) = − .
∂λ λ ∂λ2 λ2
We may compute the Fisher information as
 
′ X 1
I(λ) = −Eλ [z (X, λ)] = Eλ = ,
λ2 λ

so n(λ̂ − λ) → N (0, λ) in distribution. This is the same result as what we obtained
using a direct application of the CLT.
Proof Sketch of Theorem 1.1
Proof. We’ll sketch heuristically the proof of Theorem 1.1, assuming f (x | θ) is the PDF
of a continuous distribution. (The discrete case is analogous with integrals replaced by
sums.)
To see why the MLE θ̂ is consistent, note that θ̂ is the value of θ which maximizes
n
1 1X
l(θ) = log f (Xi | θ) .
n n i=1
IID
Suppose the true parameter is θ0 , i.e. X1 , . . . , Xn ∼ f (x | θ0 ). Then for any θ ∈ Ω
(not necessarily θ0 ), the Law of Large Numbers implies the convergence in probability

Page 19
n
1X
log f (Xi | θ) → Eθ0 [log f (X | θ)].
n i=1
Under suitable regularity conditions, this implies that the value of θ maximizing the
left side, which is θ̂, converges in probability to the value of θ maximizing the right
side, which we claim is θ0 . Indeed, for any θ ∈ Ω,
 
f (X | θ)
Eθ0 [log f (X | θ)] − Eθ0 [log f (X | θ0 )] = Eθ0 log .
f (X | θ0 )
Noting that x 7→ log x is concave, Jensen’s inequality implies E[log X] ≤ log E[X] for
any positive random variable X, so

   
f (X | θ) f (X | θ)
Eθ0 log ≤ log Eθ0
f (X | θ0 ) f (X | θ0 )
f (x | θ)
Z Z
= log f (x | θ0 ) dx = log f (x | θ)dx = 0.
f (x | θ0 )

So θ 7→ Eθ0 [log f (X | θ)] is maximized at θ = θ0 , which establishes consistency of θ̂.


To show asymptotic normality, we first compute the mean and variance of the score:
Lemma 1.1. (Properties of the score). For θ ∈ Ω,

Eθ [z(X, θ)] = 0, Varθ [z(X, θ)] = −E [z ′ (X, θ)] .


Proof. By the chain rule of differentiation,


∂θ f (x | θ)
 
∂ ∂
z(x, θ)f (x | θ) = log f (x | θ) f (x | θ) = f (x | θ) = f (x | θ). (2)
∂θ f (x | θ) ∂θ
R
Then, since f (x | θ)dx = 1,

Z Z Z
∂ ∂
Eθ [z(X, θ)] = z(x, θ)f (x | θ)dx = f (x | θ)dx = f (x | θ)dx = 0.
∂θ ∂θ
Next, we differentiate this identity with respect to θ:

Page 20

0= Eθ [z(X, θ)]
∂θ Z

= z(x, θ)f (x | θ)dx
∂θ   
Z

= z ′ (x, θ)f (x | θ) + z(x, θ) f (x | θ) dx
∂θ
Z
z ′ (x, θ)f (x | θ) + z(x, θ)2 f (x | θ) dx

=
= Eθ [z ′ (X, θ)] + Eθ z(X, θ)2
 

= Eθ [z ′ (X, θ)] + Varθ [z(X, θ)],


where the fourth line above applies equation 2 and the last line uses Eθ [z(X, θ)] = 0.
Since θ̂ maximizes l(θ), we must have 0 = l′ (θ̂). Consistency of θ̂ ensures that (when
n is large) θ̂ is close to θ0 with high probability. This allows us to apply a first-order
Taylor expansion to the equation 0 = l′ (θ̂) around θ̂ = θ0 :
 
0 ≈ l (θ0 ) + θ̂ − θ0 l′′ (θ0 ) ,

so
√   √ l′ (θ0 ) √1 l ′ (θ0 )
n
n θ̂ − θ0 ≈ − n ′′ = − 1 ′′ . (3)
l (θ0 ) n l (θ 0 )
For the denominator, by the Law of Large Numbers,

n n
1 ′′ 1 X ∂2 1X ′
l (θ0 ) = 2
[log f (Xi | θ)]θ=θ0 = z (Xi , θ0 ) → Eθ0 [z ′ (X, θ0 )] = −I (θ0 )
n n i=1 ∂θ n i=1

in probability. For the numerator, recall by Lemma 1.1 that z (X, θ0 ) has mean 0 and
variance I (θ0 ) when X ∼ f (x | θ0 ). Then by the Central Limit Theorem,

n n
1 ′ 1 X ∂ 1 X
√ l (θ0 ) = √ [log f (Xi | θ)]θ=θ0 = √ z (Xi , θ0 ) → N (0, I (θ0 ))
n n i=1 ∂θ n i=1

in distribution. Applying these conclusions, the Continuous Mapping Theorem, and


Slutsky’s Lemma4 to (equation 3)
√   1 
−1

n θ̂n − θ0 → N (0, I (θ0 )) = N 0, I (θ0 )
I (θ0 )
as desired.
4
Slutsky’s Lemma says: If Xn → c in probability and Yn → Y in distribution, then Xn Yn → cY in distribution.

Page 21
1.3.4 Applications of MLE in Regression Analysis
We start with the simple linear regression setup: Let Y1 , Y2 , . . . , Yn be observations
known as the response. Let xi = (xi1 , . . . xip )T ∈ Rp be the i th corresponding vector
of covariates for the i th observation. Let β ∈ Rp be the regression coefficient so that
for σ 2 > 0,
Yi = xTi β + ϵi where ϵi ∼ N 0, σ 2 .

T
Let X = xT1 , xT2 , . . . , xTn . In vector form we have,

Y = Xβ + ϵ ∼ Nn Xβ, σ 2 In .


The linear regression model is built to estimate β, which measures the linear effect
of X on Y . We will show below how to use MLE in the computation of regression
coefficients in linear regression model.
Example 1.15. (MLE for Linear Regression).

In order to understand the linear relationship between X and β, we will need to estimate
β. We have
n
Y
2
f yi | X, β, σ 2
 
L β, σ | y =
t=1
n
1 (Y − Xβ)T (Y − Xβ)
  
1
= √ exp −
2πσ 2 2 σ2
1 n  1 (Y − Xβ)T (Y − Xβ)
⇒ l β, σ 2 := log L β, σ 2 | y = − log(2π) − log σ 2 −
 
.
2 2 2 σ2
Note that
(y − Xβ)T (y − Xβ) = y T − β T X T (y − Xβ)


= y T y − y T Xβ − β T X T y + β T X T Xβ
= y T y − 2β T X T y + β T X T Xβ.
Using this we have
dl 1   X T y − X T Xβ set
= − 2 −2X T y + 2X T Xβ = = 0
dβ 2σ 2σ 2
dl n (y − Xβ)T (y − Xβ) set
=− 2+ = 0.
dσ 2 2σ 2σ 4
The first equation leads to β̂MLE satisfying
−1 −1
X T y − X T X β̂MLE = 0 ⇒ β̂MLE = X T X X T y, if X T X exists.

Page 22
2
And σ̂M LE is  T  
y − X β̂MLE y − X β̂MLE
2
σ̂M LE = .
n

Remark 1.2. It can be easily verified that the second derivative is negative,
−1 and
T
this is indeed the maximum. The next question is: What if X X does not
exist? For example, if p > n, then the number of observations
 is less than the
T
number of parameters, and since X is n × p, X X is p × p of rank n < p.
So X T X is not full rank and cannot be inverted. In this case, the MLE does not
exist and other estimators need to be constructed. This is one of the motivations
of penalized regression, which we will discuss next.

Note that in the Linear regression setup, the MLE for β satisfies the following:

β̂MLE = arg min(y − Xβ)T (y − Xβ).


β

Suppose X is such that X T X is not invertible, then the MLE does not exist, and
we don’t know how to estimate β. In such cases, we may use penalized likelihood
that penalizes the coefficients β so that some of the βs are pushed towards zero. The
corresponding Xs to those small βs are essentially not important, removing singularity
from X T X. The penalized likelihood is

Q̃(β) = L(β | y) + P̃ (β),

where P (β) is called the penalization function. Since the optimization of L(β | y) only
depends on (y − Xβ)T (y − Xβ) term, a penalized (negative) log-likelihood is used and
the final penalized (negative) log-likelihood is

Q(β) = − log L(β | y) + P (β).

There are many ways of penalizing β and each method yields a different estimator. A
popular one is the ridge penalty.
Example 1.16. (MLE for Ridge Regression).

The ridge penalization term is λβ T β/2 for λ > 0 for

(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2
We will minimize Q(β) over the space of β and since we are adding a arbitrary term
that depends on the size of β, smaller sizes of β will be preferred. Small sizes of β

Page 23
means X are less important, and this will eventually nullify the singularity in X T X.
The larger λ is, the more “penalization” there is for large values of β.

(y − Xβ)T (y − Xβ) λ T
 
β̂ = arg min + β β .
β 2 2
To carry out the minimization, we take the derivative:
dQ(β) 1 set
−2X T y + 2X T Xβ + λβ = 0

=
dβ 2
⇒ X T X + λIp β̂ − X T y = 0

−1 T
⇒ β̂ridge = X T X + λIp X y

(verify second derivative is positive for yourself ). Note that X T X + λIp is always
positive definite for λ > 0 since for any a ∈ Rp ̸= 0

aT X T X + λIp a = aT X T Xa + λaT a ≥ 0.


Thus, the final ridge solution always exists even if X T X is not invertible.

2 Fisher information and the Cramer-Rao bound

2.1 Fisher information for one or more parameters


For a parametric model {f (x | θ) : θ ∈ Ω} where θ ∈ R is a single parameter, we
IID
showed in Section 1.3 that the MLE θ̂n based on X1 , . . . , Xn ∼ f (x | θ) is, under
certain regularity conditions, asymptotically normal:
√ 
 
 1
n θ̂n − θ → N 0,
I(θ)
in distribution as n → ∞, where
   2 
∂ ∂
I(θ) := Varθ log f (X | θ) = −Eθ log f (X | θ)
∂θ ∂θ2
is the Fisher information. As an application of this result, let us study the sampling
distribution of the MLE in a one-parameter Gamma model:
IID
Example 2.1. Let X1 , . . . , Xn ∼ Gamma(α, 1). (For this example, we are assuming
that we know β = 1 and only need to estimate α.) Then
1 α−1 −x
log f (x | α) = log x e = − log Γ(α) + (α − 1) log x − x
Γ(α)

Page 24
The log-likelihood of all observations is then
n
X
l(α) = (− log Γ(α) + (α − 1) log Xi − Xi )
i=1
n
X n
X
= −n log Γ(α) + (α − 1) log Xi − Xi .
i=1 i=1

Γ′ (α)
Introducing the digamma function ψ(α) = Γ(α) , the MLE α̂ is obtained by (numerically)
solving
n
X

0 = l (α) = −nψ(α) + log Xi .
i=1
What is the sampling distribution of α̂? We compute

∂2
2
log f (x | α) = −ψ ′ (α).
∂α
′ ′
As this does not depend on x, the Fisher information is I(α) = −E α [−ψ (α)] = ψ (α).
Then for large n, α̂ is distributed approximately as N α, nψ1′ (α) .

Asymptotic normality of the MLE extends naturally to the setting of multiple param-
eters:

Theorem 2.1. Let {f (x | θ) : θ ∈ Ω} be a parametric model, where θ ∈ Rk has k


IID
parameters. Let X1 , . . . , Xn ∼ f (x | θ) for θ ∈ Ω, and let θ̂n be the MLE based
on X1 , . . . , Xn . Define the Fisher information matrix I(θ) ∈ Rk×k as the
matrix whose (i, j) entry is given by the equivalent expressions

∂2
   
∂ ∂
I(θ)ij = Covθ log f (X | θ), log f (X | θ) = −Eθ log f (X | θ) .
∂θi ∂θj ∂θi ∂θj
(4)
Then under the same conditions as Theorem 1.1,
√  
n θ̂n − θ → N 0, I(θ)−1 ,


where I(θ)−1 is the k × k matrix inverse of I(θ) (and the distribution on the right
is the multivariate normal distribution having this covariance).

Page 25
(For k = 1, this definition of I(θ) is exactly the same as our previous definition, and
I(θ)−1 is just I(θ)
1
. The proof of the above result is analogous to the k = 1 case from
previous section, employing a multivariate Taylor expansion of the equation 0 = ∇l(θ̂)
around θ̂ = θ0 .)
IID
Example 2.2. Consider now the full Gamma model, X1 , . . . , Xn ∼ Gamma(α, β).
Numerical computation of the MLEs α̂ and β̂ in this model was discussed in Section
1.3. To approximate their sampling distributions, note
β α α−1 −βx
log f (x | α, β) = log x e = α log β − log Γ(α) + (α − 1) log x − βx,
Γ(α)
so
∂2 ∂2 1 ∂2 α
2
log f (x | α, β) = −ψ ′ (α), log f (x | α, β) = , log f (x | α, β) = − .
∂α ∂α∂β β ∂β 2 β2
These partial derivatives again do not depend on x, so the Fisher information matrix
is
 ′
ψ (α) − β1

I(α, β) = ,
− β1 α
β2
and its inverse is
 α 1 
1
I(α, β)−1 = β2 β
.
ψ ′ (α) βα2 − 1
β2
1
β ψ ′ (α)

(α̂, β̂) is approximately distributed as the bivariate normal distribution N (α, β), n1 I(α, β)−1 .

In particular, the marginal distribution of α̂ is approximately
 
1 α
N α,   2 .
n ψ ′ (α) α2 − 12 β
β β

Suppose, in this example, that in fact the true parameter β = 1. Then the variance of
1
α̂ reduces to n(ψ′ (α)−1/α) , which is not the variance nψ1′ (α) obtained in example 2.1 – the
variance here is larger. The difference is that in this example, we do not assume that
we know β = 1 and instead are estimating β by its MLE β̂. As a result, the MLEs of
α in these two examples are not the same, and here our uncertainty about β is also
increasing the variability of our estimate of α.

More generally, for any 2 × 2 Fisher information matrix

Page 26
 
a b
I= ,
b c
the first definition of equation 4 implies that a, c ≥ 0. The upper-left element of I −1 is
1 1
a−b2 /c , which is always at least a . This implies, for any model with a single parameter
θ1 that is contained inside a larger model with parameters (θ1 , θ2 ), that the variability
of the MLE for θ1 in the larger model is always at least that of the MLE for θ1 in the
smaller model; they are equal when the off-diagonal entry b is equal to 0. The same
observation is true for any number of parameters k ≥ 2 in the larger model.

This is a simple example of a trade-off between model complexity and accuracy of


estimation, which is fundamental to many areas of statistics and machine learning:
a complex model with more parameters might better capture the true distribution
of data, but these parameters will also be more difficult to estimate than those in a
simpler model.

2.2 The Cramer-Rao lower bound


Let’s return to the setting of a single parameter θ ∈ R. Why is the Fisher information
I(θ) called “information”, and why should we choose to estimate θ by the MLE θ̂?
IID
If X1 , . . . , Xn ∼ f (x | θ0 ) for a true parameter θ0 , and l(θ) = ni=1 log f (Xi | θ) is
P
the log-likelihood function, then
 2 
∂ 1
I (θ0 ) = −Eθ0 2
[log f (X | θ)]θ=θ 0
= − Eθ0 [l′′ (θ0 )] .
∂θ n
I (θ0 ) measures the expected curvature of the log-likelihood function l(θ) around the
true parameter θ = θ0 . If l(θ) is sharply curved around θ0 - in other words, I (θ0 ) is
large - then a small change in θ can lead to a large decrease in the log-likelihood l(θ),
and hence the data provides a lot of “information” that the true value of θ is close to
θ0 . Conversely, if I (θ0 ) is small, then a small change in θ does not affect l(θ) by much,
and the data provides less information about θ. In this (heuristic) sense, I (θ0 ) quanti-
fies the amount of information that each observation Xi contains about the unknown
parameter.

The Fisher information I(θ) is an intrinsic property of the model {f (x | θ) : θ ∈ Ω}, not
of any specific estimator. (We’ve shown that it is related to the variance of the MLE,
but its definition does not involve the MLE.) There are various information-theoretic
results stating that I(θ) describes a fundamental limit to how accurate any estimator
of θ based on X1 , . . . , Xn can be. We’ll prove one such result, called the Cramer-Rao
lower bound (CRLB):

Page 27
Theorem 2.2. Consider a parametric model {f (x | θ) : θ ∈ Ω} (satisfying certain
mild regularity assumptions) where θ ∈ R is a single parameter. Let T be any
IID
unbiased estimator of θ based on data X1 , . . . , Xn ∼ f (x | θ). Then
1
Varθ [T ] ≥ .
nI(θ)

Proof. Recall the score function



∂ ∂θ f (x | θ)
z(x, θ) = log f (x | θ) = ,
∂θ f (x | θ)
and let Z := Z (X1 , . . . , Xn , θ) = ni=1 z (Xi , θ). By the definition of correlation and
P
the fact that the correlation of two random variables is always between −1 and 1,

Covθ [Z, T ]2 ≤ Varθ [Z] × Varθ [T ].


The random variables z (X1 , θ) , . . . , z (Xn , θ) are IID, and by Lemma 1.1, they have
mean 0 and variance I(θ). Then

Varθ [Z] = n Varθ [z (X1 , θ)] = nI(θ).


Since T is unbiased,
Z
θ = Eθ [T ] = T (x1 , . . . , xn ) f (x1 | θ) × . . . × f (xn | θ) dx1 . . . dxn .
Rn
Differentiating both sides with respect to θ and applying the product rule of differen-
tiation,

Z 

1= T (x1 , . . . , xn ) f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ)
Rn ∂θ

+ f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ) + . . .
∂θ 

+f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ) dx1 . . . dxn
∂θ
Z
= T (x1 , . . . , xn )Z (x1 , . . . , xn , θ) f (x1 | θ) × . . . × f (xn | θ) dx1 . . . dxn
Rn
= Eθ [T Z].
1
Since Eθ [Z] = 0, this implies Covθ [T, Z] = Eθ [T Z] = 1, so Varθ [T ] ≥ nI(θ) as desired.

Page 28
For two unbiased estimators of θ, the ratio of their variances is called their relative
efficiency. An unbiased estimator is efficient if its variance equals the lower bound
1
nI(θ) . Since the MLE achieves this lower bound asymptotically, we say it is asymp-
totically efficient.

The Cramer-Rao bound ensures that no unbiased estimator can achieve asymptotically
lower variance than the MLE. Stronger results, which we will not prove here, in fact,
show that no estimator, biased or unbiased, can asymptotically achieve lower mean-
1
squared-error than nI(θ) , except possibly on a small set of special values θ ∈ Ω.5 In
particular, when the method-of-moments estimator differs from the MLE, we expect it
to have higher mean-squared-error than the MLE for large n, which explains why the
MLE is usually the preferred estimator in simple parametric models.

3 MLE under model misspecification

The eminent statistician George Box once said, “All models are wrong, but some are
useful.”

When we fit a parametric model to a set of data X1 , . . . , Xn , we are usually not certain
that the model is correct (for example, that the data truly have a normal or Gamma
distribution). Rather, we think of the model as an approximation to what might be
the true distribution of data. It is natural to ask, then, whether the MLE estimate θ̂
in a parametric model is at all meaningful, if the model itself is incorrect. The goal of
this section is to explore this question and to discuss how the properties of θ̂ change
under model misspecification.

3.1 MLE and the KL-divergence


Consider a parametric model {f (x | θ) : θ ∈ Ω}. We’ll assume throughout this section
that f (x | θ) is the PDF of a continuous distribution, and θ ∈ R is a single parameter.

Thus far, we have been measuring the error of an estimator θ̂ by its distance to the
IID
true parameter θ, via the bias, variance, and MSE. If X1 , . . . , Xn ∼ g for a PDF g
that is not in the model, then there is no true parameter value θ associated to g. We
will instead think about a measure of “distance” between two general PDFs:
5
For example, the constant estimator θ̂ = c for fixed c ∈ Ω achieves 0 mean-squared-error if the true parameter
happened to be the special value c, but at all other parameter values is worse than the MLE for sufficiently large n.

Page 29
Definition 3.1. For two PDFs f and g, the Kullback-Leibler (KL) diver-
gence from f to g is
Z
g(x)
DKL (g∥f ) = g(x) log dx.
f (x)
Equivalently, if X ∼ g, then
 
g(X)
DKL (g∥f ) = E log .
f (X)

DKL has many information-theoretic interpretations and applications. For our pur-
poses, we’ll just note the following properties: If f = g, then log(g(x)/f (x)) ≡ 0, so
DKL (g∥f ) = 0. By Jensen’s inequality, since x 7→ − log x is convex,

    Z
f (X) f (X) f (x)
DKL (g∥f ) = E − log ≥ − log E = − log g(x)dx = 0.
g(X) g(X) g(x)

Furthermore, since x 7→ − log x is strictly convex, the inequality above can only be an
equality if f (X)/g(X) is a constant random-variable, so f = g. Thus, like an ordinary
distance measure, DKL (g∥f ) ≥ 0 always, and DKL (g∥f ) = 0 if and only if f = g.

Example 3.1. To get an intuition for what the KL-divergence  is measuring, let f and
2 2
g be the PDFs of the distributions N µ0 , σ and N µ1 , σ . Then
 
g(x) 1 (x−µ1 )2 1 (x−µ0 )2
log = log √ e− 2σ2 / √ e− 2σ2
f (x) 2πσ 2 2πσ 2
2 2
(x − µ1 ) (x − µ0 )
=− +
2σ 2 2σ 2 
2 (µ1 − µ0 ) x − µ21 − µ20
= .
2σ 2
So letting X ∼ g,

 
g(X) 1
= 2 2 (µ1 − µ0 ) E[X] − µ21 − µ20

DKL (g∥f ) = E log
f (X) 2σ
1 2 2
 (µ1 − µ0 )2
= 2 2 (µ1 − µ0 ) µ1 − µ1 − µ0 = .
2σ 2σ 2
Thus DKL (g∥f ) is proportional to the square of the mean difference normalized by the
standard deviation σ. In this example we happen to have DKL (f ∥g) = DKL (g∥f ), but

Page 30
in general this is not true – for two arbitrary PDFs f and g, we may have DKL (f ∥g) ̸=
DKL (g∥f ).
IID
Suppose X1 , . . . , Xn ∼ g, and consider a parametric model {f (x | θ) : θ ∈ Ω} which
may or may not contain the true PDF g. The MLE θ̂ is the value of θ that maximizes
n
1 1X
l(θ) = log f (Xi | θ) ,
n n i=1
and this quantity by the Law of Large Numbers converges in probability to

Eg [log f (X | θ)],
where Eg denotes expectation with respect to X ∼ g. In Section 2, we showed that when
g(x) = f (x | θ0 ) (meaning g belongs to the parametric model, and the true parameter
is θ0 ), then Eg [log f (X | θ)] is maximized at θ = θ0 - this explained consistency of the
MLE. More generally, when g does not necessarily belong to the parametric model, we
may write

 
g(X)
Eg [log f (X | θ)] = Eg [log g(X)] − Eg log = Eg [log g(X)] − DKL (g∥f (x | θ)).
f (X | θ)

The term Eg [log g(X)] does not depend on θ, so the value of θ maximizing Eg [log f (X |
θ)] is the value of θ that minimizes DKL (g∥f (x | θ) ). This (heuristically) shows the
following result:6

IID
Theorem 3.1. Let X1 , . . . , Xn ∼ g and suppose DKL (g∥f (x | θ)) has a unique
minimum at θ = θ∗ . Then, under suitable regularity conditions on {f (x | θ) : θ ∈
Ω} and on g, the MLE θ̂ converges to θ∗ in probability as n → ∞.

The density f (x | θ∗ ) may be interpreted as the “KL-projection” of g onto the


parametric model {f (x | θ) : θ ∈ Ω}. In other words, the MLE is estimating the
distribution in our model that is closest, with respect to KL-divergence, to g.

3.2 The sandwich estimator of variance


IID
When X1 , . . . , Xn ∼ g, how close is the MLE θ̂ to this KL-projection θ∗ ? Analogous to
our proof in Section 2, we may answer this question by performing a Taylor expansion
of the identity 0 = l′ (θ̂) around the point θ̂ = θ∗ . This yields
6
For a rigorous statement of necessary regularity conditions, see for example White, H. (1982). “Maximum likelihood
estimation of misspecified models.” Econometrica, https://www.jstor.org/stable/1912526.

Page 31
 
′ ∗ ∗
0 ≈ l (θ ) + θ̂ − θ l′′ (θ∗ ) ,
so

√   √1 l ′ (θ ∗ )
∗ n
n θ̂ − θ ≈ − 1 ′′ ∗ . (5)
n l (θ )
Recall the score function

log f (x | θ).
z(x, θ) =
∂θ
The Law of Large Numbers applied to the denominator of equation 5 gives
n
1 ′′ ∗ 1X ′
l (θ ) = z (Xi , θ∗ ) → Eg [z ′ (X, θ∗ )]
n n i=1
in probability, while the Central Limit Theorem applied to the numerator of equation
5 gives
n
1 1 X
√ l′ (θ∗ ) = √ z (Xi , θ∗ ) → N (0, Varg [z (X, θ∗ )])
n n i=1
in distribution. The quantity z (X, θ∗ ) has mean 0 when X ∼ g because θ∗ maximizes
Eg [log f (X | θ)], so differentiating with respect to θ yields
 

0 = Eg [log f (X | θ)]θ=θ∗ = Eg [z (X, θ∗ )] .
∂θ
Hence by Slutsky’s lemma,
!
√  ∗
 Varg [z (X, θ )]
n θ̂ − θ∗ → N 0, .
Eg [z ′ (X, θ∗ )]2
These are the same formulas as in Section 2 (with θ∗ in place of θ0 ), except expectations
and variances are taken with respect to X ∼ g rather than X ∼ f (x | θ∗ ). If g(x) =
f (x | θ∗ ), meaning the model is correct, then Varg [z (X, θ∗ )] = −Eg [z ′ (X, θ∗ )] = I (θ∗ ),
and we recover our theorem from Section 2.However, when g(x) ̸= f (x | θ∗ ), in general

Varg [z (X, θ∗ )] ̸= −Eg [z ′ (X, θ∗ )] ,


so we cannot simplify the variance of the above normal limit any further. We may
instead estimate the individual
 quantities Varg [z (X, θ∗)] and Eg [z ′ (X, θ∗ )] using the
sample variance of z Xi , θ̂ and the sample mean of z ′ Xi , θ̂ – this yields the sand-
wich estimator for the variance of the MLE.

Page 32
Example 3.2. Suppose we fit the model Exponential (λ) to data X1 , . . . , Xn by com-
puting the MLE. The log-likelihood is
n
X n
X
−λXi
l(λ) = log λe = n log λ − λ Xi
i=1 i=1
Pn
so the MLE solves the equation 0 = l′ (λ) = n/λ − i=1 Xi . This yields the MLE
λ̂ = 1/X̄ (which is the same as the method-of-moments estimator from Section 1.1).

We may compute the sandwich estimate of the variance of λ̂ as follows: In the expo-
nential model,

∂ 1 ∂2 1
z(x, λ) = log f (x | λ) = − x, z ′ (x, λ) = log f (x | λ) = − .
∂λ λ ∂λ2 λ2
Pn   Pn  1   
1 1 1
Let Z̄ = n i=1 z Xi , λ̂ = n i=1 λ̂ − Xi = λ̂ −X̄ be the sample mean of z X1 , λ̂ , . . .,
     
z Xn , λ̂ . We estimate Varg [z(X, λ)] by the sample variance of z X1 , λ̂ , . . . , z Xn , λ̂ :

n n    2
1 X   2 1 X 1 1
z Xi , λ̂ − Z̄ = − Xi − − X̄
n − 1 i=1 n − 1 i=1 λ̂ λ̂
n
1 X 2 2
= Xi − X̄ = SX .
n − 1 i=1
   
′ ′ ′
We estimate Eg [z (X, λ)] by the sample mean of z X1 , λ̂ , . . . , z Xn , λ̂ :
n n
1 X ′  1X 1 1
z Xi , λ̂ = − =− .
n i=1 n i=1 λ̂2 λ̂2
So the sandwich estimate of Varg [z(X, λ)]/Eg [z ′ (X, λ)]2 is SX
2 4 2
λ̂ = SX /X̄ 4 , and we
√ 
may estimate the standard error of λ̂ by SX / X̄ 2 n .

4 Plugin estimators and the delta method

4.1 Estimating a function of θ


In the setting of a parametric model, we have been discussing how to estimate the pa-
rameter θ. We showed how to compute the MLE θ̂, derived its variance and sampling
distribution for large n, and showed that no unbiased estimator can achieve variance

Page 33
much smaller than that of the MLE for large n (the Cramer-Rao lower bound).

In many examples, the quantity we are interested in is not θ itself, but some value
g(θ). The obvious way to estimate g(θ) is to use g(θ̂), where θ̂ is an estimate (say, the
MLE) of θ. This is called the plugin estimate of g(θ), because we are just “plugging
in” θ̂ for θ.

Example 4.1. (Odds). You play a game with a friend, where you flip a biased coin. If
the coin lands heads, you give your friend $1. If the coin lands tails, your friend gives
you $x. What is the value of x that makes this a fair game?

If the coin lands heads with probability p, then your expected winnings is −p+(1−p)x.
The game is fair when −p + (1 − p)x = 0, i.e. when x = p/(1 − p). This value p/(1 − p)
is the odds of getting heads to getting tails. To estimate the odds from n coin flips
IID
X1 , . . . , Xn ∼ Bernoulli(p),

we may first estimate p by p̂ = X̄. (This is both the method of moments estimator
and the MLE.) Then the plugin estimate of p/(1 − p) is simply X̄/(1 − X̄).

The odds falls in the interval (0, ∞) and is not symmetric about p = 1/2. We oftentimes
p
think instead in terms of the log-odds, log 1−p - this can be any real number and is
symmetric about p = 1/2. The plugin estimate for the log-odds is log 1−X̄X̄ .

Example 4.2. (The Pareto mean). The Pareto (x0 , θ) distribution for x0 > 0 and
θ > 1 is a continuous distribution over the interval [x0 , ∞), given by the PDF
(
θxθ0 x−θ−1 x ≥ x0
f (x | x0 , θ) =
0 x < x0 .
It is commonly used in economics as a model for the distribution of income. x0 rep-
resents the minimum possible income; let’s assume that x0 is known and equal to 1 .
We then have a one-parameter model with PDFs f (x | θ) = θx−θ−1 supported on [1, ∞).

The mean of the Pareto distribution is


Z ∞ ∞
−θ−1 x−θ+1 θ
Eθ [X] = x · θx dx = θ = ,
1 −θ + 1 1 θ−1

so we might estimate the mean income by θ̂/(θ̂ − 1) where θ̂ is the MLE. To compute
θ̂ from observations X1 , . . . , Xn , the log-likelihood is

Page 34
n
X n
X n
X
θXi−θ−1

l(θ) = log = (log θ − (θ + 1) log Xi ) = n log θ − (θ + 1) log Xi .
i=1 i=1 i=1

Solving the equation


n
′ n X
0 = l (θ) = − log Xi
θ i=1
Pn
yields the MLE, θ̂ = n/ i=1 log Xi .

4.2 The delta method


We would like to be able to quantify our uncertainty about g(θ̂) using what we know
about the uncertainty of θ̂ itself. When n is large, this may be done using a first-order
Taylor approximation of g, formalized as the delta method:

Theorem 4.1. (Delta method). If a function g : R → R is differentiable at θ0


with g ′ (θ0 ) ̸= 0 , and if
√  
n θ̂ − θ0 → N (0, v (θ0 ))
in distribution as n → ∞ for some variance v (θ0 ), then
√   
′ 2

n g(θ̂) − g (θ0 ) → N 0, (g (θ0 )) v (θ0 )
in distribution as n → ∞.

Proof sketch. We perform a Taylor expansion of g(θ̂) around θ̂ = θ0 :


 
g(θ̂) ≈ g (θ0 ) + θ̂ − θ0 g ′ (θ0 ) .

Rearranging yields
√   √  
n g(θ̂) − g (θ0 ) ≈ n θ̂ − θ0 g ′ (θ0 ) ,

and multiplying a mean-zero normal variable by a constant c scales its variance by


c2 .
IID
Example 4.3. (Log-odds). Let X1 , . . . , Xn ∼ Bernoulli(p), and recall the plugin esti-
p
mate of the log-odds log 1−p given by log 1−X̄X̄ . By the Central Limit Theorem,

Page 35

n(X̄ − p) → N (0, p(1 − p))
in distribution, where p(1 − p) is the variance of a Bernoulli(p) random variable. The
p
function g(p) = log 1−p = log p − log(1 − p) has derivative
1 1 1
g ′ (p) = + = ,
p 1 − p p(1 − p)
so by the delta method,

   
X̄ p 1
n log − log → N 0, .
1 − X̄ 1−p p(1 − p)
In other words, our estimate of the log-odds of heads to tails is approximately normally
p 1
distributed around the true log-odds log 1−p , with variance np(1−p) .

Suppose we toss this biased coin n = 100 times and observe 60 heads, i.e. X̄ = 0.6.
We would estimate the log-odds by log 1−X̄X̄ ≈ 0.41, and we may estimate our standard
q
1
error by nX̄(1−X̄)
≈ 0.20.
IID
Example 4.4. (The Pareto mean). Let X1 , . . . , Xn ∼ Pareto (1, θ), and recall that
the MLE for θ is θ̂ = n/ ni=1 log Xi . We may use the maximum-likelihood theory
P

developed in Section 1.3 to understand the distribution of θ̂: We compute (for x ≥ 1)


the following:
log f (x | θ) = log θx−θ−1 = log θ − (θ + 1) log x


∂ 1
log f (x | θ) = − log x
∂θ θ
2
∂ 1
log f (x | θ) = − .
∂θ2 θ2
Then the Fisher information is given by I(θ) = 1/θ2 , so

n(θ̂ − θ) → N 0, θ2


in distribution as n → ∞. For the function g(θ) = θ/(θ − 1), we have


1 θ 1
g ′ (θ) = − = − .
θ − 1 (θ − 1)2 (θ − 1)2
So the delta method implies
!
√ θ2
 
θ̂ θ
n − →N 0, .
θ̂ − 1 θ − 1 (θ − 1)4

Page 36
Say, for a data set with n = 1000 income values, we obtain the MLE θ̂ = 1.5. We
q estimate the mean income as θ̂/(θ̂ − 1) = 3, and estimate our standard
might then
θ̂2
error by n(θ̂−1)4
≈ 0.19.

What if we decided to just estimate the mean income by the sample mean, X̄? Since
E [Xi ] = θ/(θ − 1), the Central Limit Theorem implies

 
θ
n X̄ − → N (0, Var [Xi ])
θ−1
in distribution. For θ > 2, we may compute
Z ∞ ∞
 2 2 −θ−1 x−θ+2 θ
E Xi = x · θx dx = θ = ,
1 −θ + 2 1 θ−2
so
 2
 2 2 θ θ θ
Var [Xi ] = E Xi − (E [Xi ]) = − = .
θ−2 θ−1 (θ − 1)2 (θ − 2)
(If θ ≤ 2, the variance of Xi is actually infinite.) For any θ, this variance is greater
than θ2 /(θ − 1)4 .

Thus, if the Pareto model for income is correct, then our previous estimate θ̂/(θ̂ − 1)
is more accurate for the mean income than is the sample mean X̄. Intuitively, this
is because the Pareto distribution is heavy-tailed, and the sample mean X̄ is heavily
influenced by rare but extremely large data values. On the other hand, θ̂ is estimating
the shape of the Pareto distribution and estimating the mean by its relationship to
this shape in the Pareto model. The formula for θ̂ involves the values log Xi rather
than Xi , so θ̂ is not as heavily influenced by extremely large data values. Of course,
the estimate θ̂/(θ̂ − 1) relies strongly on the correctness of the Pareto model, whereas
X̄ would be a valid estimate of the mean even if the Pareto model doesn’t hold true.

That the plugin estimate g(θ̂) performs better than X̄ in the previous example is not a
coincidence - it is in certain senses the best we can do for estimating g(θ). For example,
we have the following more general version of the Cramer-Rao lower bound:

Page 37
Theorem 4.2. For a parametric model {f (x | θ) : θ ∈ Ω} (satisfying certain
mild regularity assumptions) where θ is a single parameter, let g be any function
differentiable on all of Ω, and let T be any unbiased estimator of g(θ) based on
IID
data X1 , . . . , Xn ∼ f (x | θ). Then

g ′ (θ)2
Varθ [T ] ≥ .
nI(θ)

Hint. The proof is identical to that of Theorem 2.2, except with the equation θ = Eθ [T ]
replaced by g(θ) = Eθ [T ]. (Differentiating this equation yields g ′ (θ) = Eθ [T Z] =
Covθ [T, Z] as in Theorem 2.2) An estimator T for g(θ) that achieves this variance
g ′ (θ)2 /(nI(θ)) is efficient. The plugin estimate g(θ̂) where θ̂ is the MLE achieves
this variance asymptotically, so we say it is asymptotically efficient. This theorem
ensures that no unbiased estimator of g(θ) can achieve variance much smaller than
g(θ̂) when n is large, and in particular applies to the estimator T = X̄ of the previous
example.

5 Confidence intervals (CI)

The estimation of a parameter by a single value is referred to as a point estimation.


In a wide variety of inference problems, one is not interested in point estimation
or testing of hypothesis of the parameter. Rather one wishes to establish a level
or an upper bound or both for the parameter. An alternative procedure is to give
an interval within which the parameter may be supposed to lie with a certain
probability or confidence, which is called Interval Estimation.

We have seen how to understand the variability of an estimate θ̂ for a parameter


θ, or of g(θ̂) for a quantity g(θ), in terms of its sampling distribution and its standard
error. This understanding may be used to construct a confidence interval for θ or g(θ).

5.1 Exact confidence intervals


In a parametric model, let g(θ) be any quantity of interest (which might be the pa-
rameter θ itself). Informally, a confidence interval for g(θ) is a random interval
calculated from the data that contains this value g(θ) with a specified probability. For
example, a 90% confidence interval contains g(θ) with probability 0.90, and a 95% con-
fidence interval contains g(θ) with probability 0.95. (If we construct 100 different 90%
confidence intervals for θ using 100 independent sets of data, then we would expect
about 90 of them to contain θ.)

Page 38
WARNING! A common misconfusion is to interpret CIs as intervals with probability
(1 − α) of containing the true value θ, for a particular sample x. This interpretation is
incorrect. The interpretation of CIs has to be made in terms of repeated sampling as
discussed above (blue colored text).

More formally, what this means is the following: Let X1 , . . . , Xn be a sample of data. By
random interval, we mean an interval whose lower and upper endpoints L (X1 , . . . , Xn )
and U (X1 , . . . , Xn ) are functions of the data X1 , . . . , Xn . (Hence the interval is random
in the same sense that the data itself is random - a different realization of the data leads
to a different interval.) The interval [L (X1 , . . . , Xn ) , U (X1 , . . . , Xn )] is a 100(1 − α)%
confidence interval for g(θ) if, for all θ ∈ Ω,

Pθ [L (X1 , . . . , Xn ) ≤ g(θ) ≤ U (X1 , . . . , Xn )] = 1 − α;


IID
where Pθ denotes probability under X1 , . . . , Xn ∼ f (x | θ).

A confidence interval for g(θ) is commonly constructed from an estimate of g(θ)


and an estimate of the associated standard error:
IID 
Example 5.1. Consider data X1 , . . . , Xn ∼ N µ, σ 2 , where both µ and σ 2 are
unknown. To construct
 a confidence interval for
√ µ, consider the estimate X̄. √
As
2
X̄ ∼ N µ, σ /n , the standard error of X̄ is σ/ n, which we may estimate by S/ n
where
v
u n
u 1 X 2
S= t Xi − X̄ .
n − 1 i=1
IID 
Recall from Chapter 2 Section 3 that when X1 , . . . , Xn ∼ N µ, σ 2 , the quantity

X̄ − µ n(X̄ − µ)
√ =
S/ n S
has a t-distribution with n − 1 degrees of freedom. (In Chapter 2 Section 3 we assumed
µ = 0, but the distribution of this quantity doesn’t depend on µ.) Letting tn−1 (α/2)
be the upper-α/2 point of the tn−1 distribution and noting that −tn−1 (α/2) is then the
lower-α/2 point by symmetry, this means
 √ 
n(X̄ − µ)
Pµ,σ2 −tn−1 (α/2) ≤ ≤ tn−1 (α/2) = 1 − α.
S
The upper inequality above may be rearranged as

Page 39
S
X̄ − √ tn−1 (α/2) ≤ µ,
n
and the lower inequality may be rearranged as
S
µ ≤ X̄ + √ tn−1 (α/2).
n
Hence
 
S S
Pµ,σ2 X̄ − √ tn−1 (α/2) ≤ µ ≤ X̄ + √ tn−1 (α/2) = 1 − α,
n n
h i
so X̄ − √Sn tn−1 (α/2), X̄ + √Sn tn−1 (α/2) is a 100(1 − α)% confidence interval for µ.
We’ll use the notation X̄ ± √S tn−1 (α/2) as shorthand for this interval.
n

Definition 5.1.

(i) An interval I(x) which is a subset of Ω ⊆ R is said to constitute a confidence


˜
interval with confidence coefficient (1 − α), if P [I(x) ∋ θ] = 1 − α ∀ θ ∈
˜
Ω, i.e., the random interval I(x) covers the true parameter with probability
= 1 − α. ˜

(ii) A subset S(x) of Ω ⊆ Rk is said to constitute a confidence set at confidence


(1 − α) if P˜[S(x) ∈ θ] ⩾ 1 − α ∀ θ ∈ Ω.
˜

5.2 Methods of finding Confidence interval


Let θ be a parameter & T be a statistic based on a random sample of size n from a
population. Most often it is possible to find a function ψ(T, θ) whose distribution is
independent of θ. Then
 
P ψ1−α/2 < ψ(T, θ) < ψα/2 = 1 − α,

where, ψα is independent of θ, as distribution of ψ(T, θ) is independent of θ.

Now, ψ1−α/2 < ψ(T, θ) < ψα/2 can often be put in the form θ1 (T ) ⩽ θ ⩽ θ2 (T ).

Then P [θ1 (T ) ⩽ θ ⩽ θ2 (T )] = 1−α and the observed value of the interval [θ1 (T ), θ2 (T )]
will be the confidence interval for θ with confidence coefficient (1 − α).

Page 40

Example 5.2. Let X1 , . . . , Xn be a random sample from N µ, σ 2 ; µ and σ both are
unknown. Find the confidence interval with confidence coefficient (1 − α), for

(i) µ; (ii) σ 2 ; (iii) µ, σ 2 .
Solutions.
(i) For confidence interval

of µ, we select the statistic T = X̄.
n(X̄−µ)
Then, ψ(T, µ) = s ∼ tn−1 , which is independent of µ.

n(X̄ − µ)
Now, 1 − α = P −tα/2,n−1 < < tα/2,n−1
s
 
s s
= P X̄ − √ tα/2,n−1 ⩽ µ ⩽ X̄ + √ tα/2,n−1 .
 n  n
Hence, X̄ − √sn tα/2,n−1 , X̄ + √sn tα/2,n−1 is an observed confidence interval for µ
with confidence coefficient (1 − α).
1
Pn 2
(ii) For confidence interval of σ 2 , we select the statistic s2 = n−1 i=1 X i − X̄ .
2
Then, ψ s2 , σ 2 h = (n − 1) σs 2 ∼ χ2n−1 , the distribution

i is hindependent of σ 2 . i
2 s2 2 (n−1)s2 2 (n−1)s2
Now, 1−α = P χ1−α/2,n−1 ⩽ (n − 1) σ2 ⩽ χα/2,n−1 = P χ2 ⩽ σ ⩽ χ2 .
α/2,n−1 1−α/2,n−1
2 2
 Pn Pn 
i=1 (Xi −X̄ ) i=1 (Xi −X̄ )
Hence, χ2
, χ2 is an observed confidence interval for σ 2 with
α/2,n−1 1−α/2,n−1

confidence coefficient (1 − α).


h i
s s
(iii) P X̄ − √n tα1 /2,n−1 ⩽ µ ⩽ X̄ + √n tα1 /2,n−1 = 1 − α1
h i
(n−1)s2 2 (n−1)s2
and P χ2 ⩽ σ ⩽ χ2 = 1 − α2 .
α2 /2,n−1 1−α2 /2,n−1

Note that (Boolsen Probability): P (A ∩ B) ⩾ P (A) + P (B) − 1.


" #
2 2
s s (n − 1)s (n − 1)s
∴P X̄ − √ tα1 /2,n−1 ⩽ µ ⩽ X̄ + √ tα1 /2,n−1 ; 2 ⩽ σ2 ⩽ 2
n n χα2 /2,n−1 χ1−α2 /2,n−1
⩾ (1 − α1 ) + (1 − α2 ) − 1 = 1 − α, where α = α1 + α2 .
   
s s (n−1)s2 (n−1)s2
Hence, S(X) = X̄ − n tα1 /2,n−1 , X̄ + n tα1 /2,n−1 × χα /2,n−1 , χ1−α /2,n−1 is an
√ √
˜ 2 2
observed confidence interval for (µ, σ 2 ) with confidence coefficient (1 − α).

5.3 Asymptotic confidence intervals


In the previous example, we were able
√ to construct an exact confidence interval because
we knew the exact distribution of n(X̄ − µ)/S, which is tn−1 (and which does not

Page 41

depend on µ and σ 2 . Suppose that we had forgotten this fact. If n is large, we could
have still reasoned as follows: By the Central Limit Theorem, as n → ∞,

n(X̄ − µ) → N 0, σ 2


in distribution. By our addendum at the end of Section 6.4 of Chapter 2, S 2 → σ 2 in


probability (meaning, the sample variance S 2 is consistent for σ 2 ). Then, applying the
Continuous Mapping Theorem and Slutsky’s Lemma,
√ √
n(X̄ − µ) σ n(X̄ − µ)
= × → N (0, 1)
S S σ
in distribution, so
 √ 
n(X̄ − µ)
Pµ,σ2 −zα/2 ≤ ≤ zα/2 → 1 − α
S
as n → ∞. Rearranging the inequalities above in the same way as the previous example
yields a 100(1 − α)% asymptotic confidence interval X̄ ± √Sn zα/2 for µ. We expect
this interval to be accurate (meaning its coverage of µ is close to 100(1 − α)%) for large
n - indeed, for large n, zα/2 ≈ tn−1 (α/2) because the tn−1 distribution is very close to
the standard normal distribution so that this interval is almost the same as the exact
interval of the previous example.

This method may be applied to construct an approximate confidence interval from any
asymptotically normal estimator, as we will see in the following examples.
IID
Example 5.3. Let X1 , . . . , Xn ∼ Poisson(λ). To construct an asymptotic confidence
interval for λ, let’s start with the estimator λ̂ = X̄. By the Central Limit Theorem,

n(λ̂ − λ) → N (0, λ).
We don’t know the variance λ of this limiting normal distribution, but we can estimate
it by λ̂. By the Law of Large Numbers, λ̂ → λ in probability as n → ∞, i.e. λ̂ is
consistent for λ. Then by the Continuous Mapping Theorem and Slutsky’s Lemma,
√ √ √
n(λ̂ − λ) λ n(λ̂ − λ)
p =p × √ → N (0, 1),
λ̂ λ̂ λ
so
" √ #
n(λ̂ − λ)
Pλ −zα/2 ≤ p ≤ zα/2 → 1 − α
λ̂

Page 42
Rearranging
q these inequalities yields the asymptotic 100(1 − α)% confidence interval
λ̂± nλ̂ zα/2 .
For various values of λ and n, the table below shows the simulated true probabilities
that the 90% and 95% confidence intervals constructed in this way cover λ:

Desired coverage: 90% Desired coverage: 95%


λ = 0.1 λ=1 λ = 5 λ = 0.1 λ = 1 λ = 5
n = 10 0.63 0.91 0.90 0.63 0.93 0.95
n = 30 0.79 0.89 0.90 0.80 0.93 0.95
n = 100 0.91 0.90 0.90 0.93 0.94 0.95

IID
(Meaning, we simulated X1 , . . . , Xn ∼ Poisson(λ), computed the confidence interval,
checked whether it contained λ, and repeated this B = 1, 000, 000 times. The table
reports the fraction of simulations for which the interval covered λ.) We observe that
coverage is closer to the desired levels for larger values of n, as well as for larger values
of λ. For small n and/or small λ, the normal approximation to the distribution of λ̂
is inaccurate, and the simulations show that we underestimate the variability of λ̂.

Example 5.4. More generally, let {f (x | θ) : θ ∈ Ω} be any parametric model satisfying


the regularity conditions of Theorem 1.1, where θ is a single parameter. To obtain a
confidence interval for θ, consider the MLE θ̂, which satisfies

n(θ̂ − θ) → N 0, I(θ)−1


as n → ∞. We may estimate I(θ) by the plugin estimator I(θ̂). If I(θ) is continuous in


θ and θ̂ is consistent for θ, then the Continuous Mapping Theorem implies I(θ̂) → I(θ)
in probability, and hence
q
q I(θ̂) p
nI(θ̂)(θ̂ − θ) = p × nI(θ)(θ̂ − θ) → N (0, 1).
I(θ)
So, q
Pθ [−zα/2 ≤ nI(θ̂)(θ̂ − θ) ≤ zα/2 ] → 1 − α,
and rearranging yields the asymptotic 100(1 − α)% confidence interval θ̂ ± √ 1 zα/2 .
nI(θ̂)
This is oftentimes called the Wald interval for θ.
IID
Example 5.5. Let X1 , . . . , Xn ∼ Bernoulli(p). Suppose we wish to construct a confi-
p
dence interval for the log-odds g(p) = log 1−p . In Section 4, we showed using the delta
method that

Page 43

 
1
n(g(p̂) − g(p)) → N 0, ,
p(1 − p)
where p̂ = X̄. Since p̂ → p in probability, by the Continuous Mapping Theorem and
Slutsky’s Lemma,

p
p p̂(1 − p̂) p
np̂(1 − p̂)(g(p̂) − g(p)) = p np(1 − p)(g(p̂) − g(p)) → N (0, 1),
p(1 − p)
so
p
Pp [−zα/2 ≤ np̂(1 − p̂)(g(p̂) − g(p)) ≤ zα/2 ] → 1 − α.
p
An asymptotic 100(1 − α)% confidence interval for the log-odds g(p) = log 1−p is then

" s s #
p̂ 1 p̂ 1
[L(p̂), U (p̂)] := log − zα/2 , log + zα/2 .
1 − p̂ np̂(1 − p̂) 1 − p̂ np̂(1 − p̂)
p
If we wish to obtain a confidence interval for the odds 1−p rather than the log-odds,
h i h i
p p
 
note that P L(p̂) ≤ log 1−p ≤ U (p̂) = P eL(p̂) ≤ 1−p ≤ eU (p̂) , so that eL(p̂) , eU (p̂) is
a confidence interval for the odds. This interval is not symmetric around the estimate

1−p̂ , and is different from what we would have obtained if we instead applied the delta
p
 
method directly to g(p) = 1−p . The interval eL(p̂) , eU (p̂) for the odds is typically
p̂ p̂
used in practice the distribution of log 1−p̂ is less skewed than that of 1−p̂ for small to
moderate n, so the normal approximation and resulting confidence interval are more
accurate if we consider odds on the log scale.

Let us caution that in the construction of these asymptotic confidence intervals, a


number of different approximations are being made:

• The true distribution of n(θ̂ − θ) is being approximated by a normal distribution.

• The true variance of this normal distribution, say I(θ)−1 , is being approximated
by a plugin estimate I(θ̂)−1 .

• In the case where we are interested in g(θ) and g is a nonlinear function, the value
g(θ̂) is being approximated by the Taylor expansion g(θ) + (θ̂ − θ)g ′ (θ). (This is
what is done in the delta method.)

Page 44
These approximations are all valid in the limit n → ∞, but their accuracy is not
guaranteed for the finite sample size n of any given problem. Coverage of asymptotic
confidence intervals should be checked by simulation, as Example 5.3 illustrates that
they might be severely overconfident for small n.

6 Bayesian analysis

Our treatment of parameter estimation thus far has assumed that θ is an unknown
but non-random quantity - it is some fixed parameter describing the true distribution
of data, and our goal was to determine this parameter. This is called the Frequentist
Paradigm of statistical inference. In this and the next section, we will describe an
alternative Bayesian Paradigm, in which θ itself is modeled as a random variable.
The Bayesian paradigm naturally incorporates our prior belief about the unknown
parameter θ and updates this belief based on observed data.

6.1 Prior and posterior distributions


Recall that if X, Y are two random variables having joint PDF or PMFfX,Y (x, y), then
the marginal distribution of X is given by the PDF
Z
fX (x) = fX,Y (x, y)dy

in the continuous case and by the PMF


X
fX (x) = fX,Y (x, y)
y

in the discrete case; this describes the probability distribution of X alone. The con-
ditional distribution of Y given X = x is defined by the PDF or PMF

fX,Y (x, y)
fY |X (y | x) = ,
fX (x)
and represents the probability distribution of Y if it is known that X = x. (This is
a PDF or PMF as a function of y, for any fixed x.) Defining similarly the marginal
distribution fY (y) of Y and the conditional distribution fX|Y (x | y) of X given Y = y,
the joint PDFX,Y (x, y) factors in two ways as

fX,Y (x, y) = fY |X (y | x)fX (x) = fX|Y (x | y)fY (y).


In Bayesian analysis, before data is observed, the unknown parameter is modeled as a
random variable Θ having a probability distribution fΘ (θ), called the prior distribu-
tion. This distribution represents our prior belief about the value of this parameter.

Page 45
Conditional on Θ = θ, the observed data X is assumed to have distribution fX|Θ (x | θ),
where fX|Θ (x | θ) defines a parametric model with parameter θ, as in our previous
chapters.7 The joint distribution of Θ and X is then the product

fX,Θ (x, θ) = fX|Θ (x | θ)fΘ (θ),


and the marginal distribution of X (in the continuous case) is
Z Z
fX (x) = fX,Θ (x, θ)dθ = fX|Θ (x | θ)fΘ (θ)dθ.

The conditional distribution of Θ given X = x is

fX,Θ (x, θ) fX|Θ (x | θ)fΘ (θ)


fΘ|X (θ | x) = =R . (6)
fX (x) fX|Θ (x | θ′ ) fΘ (θ′ ) dθ′
This is called the posterior distribution of Θ : It represents our knowledge about
the parameter Θ after having observed the data X. We often summarize the preceding
equation simply as

fΘ|X (θ | x) ∝ fX|Θ (x | θ)fΘ (θ)


(7)
Posterior density ∝ Likelihood × Prior density
where the symbol ∝ hides the proportionality factor fX (x) = fX|Θ (x | θ′ ) fΘ (θ′ ) dθ′
R
which does not depend on θ.
Example 6.1. Let P ∈ (0, 1) be the probability of heads for a biased coin, and let
X1 , . . . , Xn be the outcomes of n tosses of this coin. If we do not have any prior
information about P , we might choose for its prior distribution Uniform (0, 1), having
IID
PDF fP (p) = 1 for all p ∈ (0, 1). Given P = p, we model X1 , . . . , Xn ∼ Bernoulli(p).
Then the joint distribution of P, X1 , . . . , Xn is given by

fX,P (x1 , . . . , xn , p) = fX|P (x1 , . . . , xn | p) fP (p)


Yn Pn Pn
= pxi (1 − p)1−xi × 1 = p i=1 xi (1 − p)n− i=1 xi .
i=1
Let s = x1 +. . .+xn . The marginal distribution of X1 , . . . , Xn is obtained by integrating
fX,P (x1 , . . . , xn , p) over p:
Z 1
fX (x1 , . . . , xn ) = ps (1 − p)n−s dp = B(s + 1, n − s + 1)
0
where B(x, y) is the Beta function
7
For notational simplicity, we are considering here a single data value X, but this extends naturally to the case where
X = (X1 , . . . , Xn ) is a data vector and fX|Θ (x | θ) is the joint distribution of X given θ.

Page 46
Γ(x)Γ(y)
B(x, y) =
.
Γ(x + y)
Hence the posterior distribution of P given X1 = x1 , . . . , Xn = xn has PDF

fX,P (x1 , . . . , xn , p) 1
fP |X (p | x1 , . . . , xn ) = = ps (1 − p)n−s .
fX (x1 , . . . , xn ) B(s + 1, n − s + 1)

This is the PDF of the Beta(s + 1, n − s + 1) distribution8 , so the posterior distribution


of P given X1 = x1 , . . . , Xn = xn is Beta(s + 1, n − s + 1), where s = x1 + . . . + xn .
We computed explicitly the marginal distribution fX (x1 , . . . , xn ) above, but this was
not necessary to arrive at the answer. Indeed, equation (7) gives

fP |X (p | x1 , . . . , xn ) ∝ fX|P (x1 , . . . , xn | p) fP (p) = ps (1 − p)n−s .


This tells us that the PDF of the posterior distribution of P is proportional to ps (1 −
p)n−s , as a function of p. Then it must be the PDF of the Beta(s + 1, n − s + 1)
distribution, and the proportionality constant must be whatever constant is required
to make this PDF integrate to 1 over p ∈ (0, 1). We will repeatedly use this trick to
simplify our calculations of posterior distributions.
Example 6.2. Suppose now we have a prior belief that P is close to 1/2. There are
various prior distributions that we can choose to encode this belief; it will turn out to
be mathematically convenient to use the prior distribution Beta(α, α), which has mean
1/2 and variance 1/(8α + 4). The constant α may be chosen depending on how confi-
dent we are, a priori, that P is near 1/2 - choosing α = 1 reduces to the Uniform (0, 1)
prior of the previous example, whereas choosing α > 1 yields a prior distribution more
concentrated around 1/2.
1
The prior distribution Beta(α, α) has PDF fP (p) = B(α,α) pα−1 (1 − p)α−1 . Then,
applying equation (7), the posterior distribution of P given X1 = x1 , . . . , Xn = xn has
PDF

fP |X (p | x1 , . . . , xn ) ∝ fX|P (x1 , . . . , xn | p) fP (p)


∝ ps (1 − p)n−s × pα−1 (1 − p)α−1 = ps+α−1 (1 − p)n−s+α−1 ,
where s = x1 + . . . + xn as before, and where the symbol ∝ hides any proportionality
constants that do not depend on p. This is proportional to the PDF of the distribution
Beta (s + α, n − s + α), so this Beta distribution is the posterior distribution of P .
8 1 α−1
The Beta(α, β) distribution is a continuous distribution on (0, 1) with PDF f (x) = B(α,β) x (1 − x)β−1 .

Page 47
In the previous example, the parametric form for the prior was (cleverly) chosen
so that the posterior would be of the same form-they were both Beta distributions.
This type of prior is called a conjugate prior for P in the Bernoulli model. Use
of a conjugate prior is mostly for mathematical and computational convenience - in
principle, any prior fP (p) on (0, 1) may be used. The resulting posterior distribution
may be not be a simple named distribution with a closed-form PDF, but the PDF may
be computed numerically from equation (6) by numerically evaluating the integral in
the denominator of this equation.
IID
Example 6.3. Let Λ ∈ (0, ∞) be the parameter of the Poisson model X1 , . . . , Xn ∼
Poisson(λ). As a prior distribution for Λ, let us take the Gamma distribution Gamma(α, β).
The prior and likelihood are given by
β α α−1 −βλ
fΛ (λ) = λ e
Γ(α)
n
Y λxi e−λ
fX|Λ (x1 , . . . , xn | λ) = .
i=1
x i !
Dropping proportionality constants that do not depend on λ, the posterior distribution
of Λ given X1 = x1 , . . . , Xn = xn is then

fΛ|X (λ | x1 , . . . , xn ) ∝ fX|Λ (x1 , . . . , xn | λ) fΛ (λ)


Yn
λxi e−λ × λα−1 e−βλ = λs+α−1 e−(n+β)λ ,


i=1

where s = x1 + . . . + xn . This is proportional to the PDF of the Gamma(s + α, n + β)


distribution, so the posterior distribution of Λ must be Gamma(s + α, n + β).
As the prior and posterior are both Gamma distributions, the Gamma distribution is
a conjugate prior for Λ in the Poisson model.

6.2 Point estimates and credible intervals


To the Bayesian statistician, the posterior distribution is the complete answer to the
question: What is the value of θ? In many applications, though, we would still like to
have a single estimate θ̂, as well as an interval describing our uncertainty about θ.

The posterior mean and posterior mode are the mean and mode of the posterior
distribution of Θ; both of these are commonly used as a Bayesian estimate θ̂ for θ.
A 100(1 − α)% Bayesian credible interval is an interval I such that the posterior
probability P[Θ ∈ I | X] = 1 − α, and is the Bayesian analogue to a frequentist

Page 48
 
confidence interval. One common choice for I is simply the interval θ(α/2) , θ(1−α/2)
where θ(α/2) and θ(1−α/2) are the α/2 and 1 − α/2 quantiles of the posterior distribution
of Θ. Note that the interpretation of a Bayesian credible interval is different from
the interpretation of a frequentist confidence interval - in the Bayesian framework, the
parameter Θ is modeled as random, and 1 − α is the probability that this random
parameter Θ belongs to an interval that is fixed conditional on the observed data.
Example 6.4. From Example 6.2, the posterior distribution of P is Beta(s + α, n −
s + α). The posterior mean is then (s + α)/(n + 2α), and the posterior mode is
(s + α − 1)/(n + 2α − 2). Both of these may be taken as a point estimate p̂ for p. The
interval from the 0.05 to the 0.95 quantile of the Beta(s + α, n − s + α) distribution
forms a 90% Bayesian credible interval for p.
Example 6.5. From Example 6.3, the posterior distribution of Λ is Gamma(s + α, n +
β). The posterior mean and mode are then (s + α)/(n + β) and (s + α − 1)/(n + β),
and either may be used as a point estimate λ̂ for λ. The interval from the 0.05 to the
0.95 quantile of the Gamma(s + α, n + β) distribution forms a 90% Bayesian credible
interval for λ.

6.3 Conjugate priors and improper priors


Last section, we saw two examples of conjugate priors:
IID
1. If X1 , . . . , Xn ∼ Poisson(λ), then a conjugate prior for λ is Gamma(α, β), and
the corresponding posterior given X1 = x1 , . . . , Xn = xn is Gamma(s + α, n + β)
where s = x1 + . . . + xn . A Bayesian estimate of λ is the posterior mean
s+α n s β α
λ̂ = = · + · .
n+β n+β n n+β β

IID
2. If X1 , . . . , Xn ∼ Bernoulli(p), then a conjugate prior for p is Beta(α, β), and the
corresponding posterior given X1 = x1 , . . . , Xn = xn is Beta(s + α, n − s + β).9 A
Bayesian estimate of p is the posterior mean
s+α n s α+β α
p̂ = = · + · .
n+α+β n+α+β n n+α+β α+β

In addition to being mathematically convenient, conjugate priors oftentimes have


intuitive interpretations: In example 1 above, the posterior mean behaves as if we ob-
served, a priori, β additional count observations that sum to α. β may be interpreted
9
If we assume α = β then the prior is centered around 1/2, but the same calculation of the posterior distribution holds
when α ̸= β.

Page 49
as an effective prior sample size and α/β as a prior mean, and the posterior mean is
a weighted average of the prior mean and the data mean. In example 2 above, the
posterior mean behaves as if we observed, a priori, α additional heads and β additional
tails. α + β is an effective prior sample size, α/(α + β) is a prior mean, and the pos-
terior mean is again a weighted average of the prior mean and the data mean. These
interpretations may serve as a guide for choosing the prior parameters α and β.

Sometimes it is convenient to use the formalism of Bayesian inference, but with an “un-
informative prior” that does not actually impose prior knowledge, so that the resulting
analysis is more objective. In both examples above, the priors are “uninformative” for
the posterior mean when α and β are small. We may take this idea to the limit by
considering α = β = 0. As the PDF of the Gamma distribution is proportional to
xα−1 e−βx on (0, ∞), the “PDF” for α = β = 0 may be considered to be

f (x) ∝ x−1 .

Similarly, as the PDF of the Beta distribution is proportional to xα−1 (1 − x)β−1 on


(0, 1), the “PDF” for α = β = 0 may be considered to be

f (x) ∝ x−1 (1 − x)−1 .


These are not real probability distributions: There is no such distribution as
Gamma(0, R 0), and f (x) ∝ x−1 does not actually describe a valid PDF on (0, ∞),
because x−1 dx = ∞ so that it is impossible to choose a normalizing constant to
make this PDF integrate to 1. Similarly, there is no such distribution as Beta(0, 0),
and f (x) ∝ x−1 (1 − x)−1 does not describe a valid PDF on (0, 1). These types of priors
are called improper priors.

Nonetheless, we may formally carry out Bayesian analysis using improper priors, and
this oftentimes yields valid posterior distributions: In the Poisson example, we obtain
the posterior PDF

fΛ|X (λ | x1 , . . . , xn ) ∝ fX|Λ (x1 , . . . , xn | λ) fΛ (λ)


∝ λs e−nλ × λ−1 = λs−1 e−nλ ,

which is the PDF of Gamma(s, n). In the Bernoulli example, we obtain the posterior
PDF

fP |X (p | x1 , . . . , xn ) ∝ fX|P (x1 , . . . , xn | p) fP (p)


∝ ps (1 − p)n−s × p−1 (1 − p)−1 = ps−1 (1 − p)n−s−1 ,

Page 50
which is the PDF of Beta(s, n − s). These posterior distributions are real probability
distributions (as long as s > 0 in the Poisson example and s, n − s > 0 in the Bernoulli
example), and may be thought of as approximations to the posterior distributions that
we would have obtained if we used proper priors with small but positive values of α
and β.

6.4 Normal approximation for large n


For any fixed α, β in the above examples, as n → ∞, the influence of the prior di-
minishes and the posterior mean becomes close to the MLE s/n. This is true more
generally for parametric models satisfying mild regularity conditions, and in fact the
posterior distribution is approximately a normal distribution centered at the MLE θ̂
with variance nI(1 θ̂) for large n, where I(θ) is the Fisher information. We sketch the
argument for why this occurs:

Consider Bayesian inference applied with the prior fΘ (θ), for a parametric model
IID
fX|Θ (x | θ). Let X1 , . . . , Xn ∼ fX|Θ (x | θ), and let
n
X
l(θ) = log fX|Θ (xi | θ)
i=1
be the usual log-likelihood. Then the posterior distribution of Θ is given by

fΘ|X (θ | x1 , . . . , xn ) ∝ fX|Θ (x1 , . . . , xn | θ) fΘ (θ) = exp(l(θ))fΘ (θ).


Applying a second-order Taylor expansion of l(θ) around the MLE of θ = θ̂,
1
l(θ) ≈ l(θ̂) + (θ − θ̂)l′ (θ̂) + (θ − θ̂)2 l′′ (θ̂)
2
1
≈ l(θ̂) − (θ − θ̂)2 · nI(θ̂),
2
where the second equality follows because l′ (θ̂) = 0 if θ̂ is the MLE, and l′′ (θ̂) ≈ −nI(θ̂)
for large n. Since θ̂ is a function of the data x1 , . . . , xn and doesn’t depend on θ, we
may absorb exp(l(θ̂)) into the proportionality constant to obtain
 
1
fΘ|X (θ | x1 , . . . , xn ) ∝ exp − (θ − θ̂)2 · nI(θ̂) fΘ (θ).
2
 
For large n, the value of exp − 21 (θ − θ̂)2 · nI(θ̂) is small unless θ is within order

1/ n distance from θ̂. In this region of θ, the prior fΘ (θ) is approximately constant
and equal to fΘ (θ̂). Absorbing this constant into the proportionality factor in ∝, we
finally arrive at

Page 51
 
1
fΘ|X (θ | x1 , . . . , xn ) ∝ exp − (θ − θ̂)2 · nI(θ̂) .
2
1
This describes a normal distribution for Θ with mean θ̂ and variance nI(θ̂)
.

To summarize, the posterior mean of Θ is, for large n, approximately the MLE θ̂.
Furthermore, a 100(1 − α)% Bayesian credible interval is approximately given by θ̂ ±
q
z(α/2) nI(θ̂), which is exactly the 100(1 − α)% Wald confidence interval for θ. In
this sense, frequentist and Bayesian methods yield similar inferences for large n.

6.5 Prior distributions and average MSE


Last section we introduced the prior distribution for Θ as something that encodes our
prior belief about its value. A different (but related) interpretation and motivation for
the prior comes from the following considerations:

Let’s return to the frequentist setting where we assume that there is a true parameter
θ for a parametric model {f (x | θ) : θ ∈ Ω}. Suppose we have two estimators for
θ based on data X1 , . . . , Xn ∼ f (x | θ) : θ̂1 and θ̂2 . Which estimator is “better”?
Without appealing to asymptotic (large n) arguments, one answer to this question is
to compare their mean-squared-errors:
 2   2
MSE1 (θ) = Eθ θ̂1 − θ = Variance of θ̂1 + Bias of θ̂1 ;
 2   2
MSE2 (θ) = Eθ θ̂2 − θ = Variance of θ̂2 + Bias of θ̂2 .

The estimator with smaller MSE is “better”. Unfortunately, the problem with this
approach is that the MSEs might depend on the true parameter θ (hence why we have
written MSE1 and MSE2 as functions of θ in the above), and neither may be uniformly
IID
better than the other. For example, suppose X1 , . . . , Xn ∼ N (θ, 1). Let θ̂1 = X̄; this
is unbiased with variance n1 , so its MSE is n1 . Let θ̂2 ≡ 0 be the constant estimator that
always estimates θ by 0. This has bias −θ and variance 0, so its MSE is θ2 . If √the true
parameter θ happens to be close to 0 - more specifically, if |θ| is less than 1/ n - then
θ̂2 is “better”, and otherwise θ̂1 is “better”.

To resolve this ambiguity, we might consider a weighted average MSE,


Z
MSE(θ)w(θ)dθ,

Page 52
R
where w(θ) is a weight function over the parameter space such that Ω w(θ)dθ = 1, and
find the estimator that minimizes this weighted average. This weighted average MSE
is called the Bayes risk. Writing the expectation in the definition of the MSE as an
integral, and letting x denote the data and f (x | θ) denote the PDF of the data, we
may write the Bayes risk of an estimator θ̂ as
Z Z 
2
(θ̂(x) − θ) f (x | θ)dx w(θ)dθ.

Exchanging the order of integration, this is


Z Z 
(θ̂(x) − θ)2 f (x | θ)w(θ)dθ dx.

In order to minimize the Bayes risk, for each possible value x of the observed data,
θ̂(x) should be defined so as to minimize
Z
(θ̂(x) − θ)2 f (x | θ)w(θ)dθ.

Let us now interpret w(θ) as a prior fΘ (θ) for the parameter Θ, and f (x | θ) as the
likelihood fX|Θ (x | θ) given Θ = θ. Then

Z Z Z
2 2
(θ̂(x)−θ) f (x | θ)w(θ)dθ = (θ̂(x)−θ) fX,Θ (x, θ)dθ = fX (x) (θ̂(x)−θ)2 fΘ|X (θ | x)dθ.

So given the observed data x, θ̂(x) should be defined to minimize


Z h i
2 2
(θ̂(x) − θ) fΘ|X (θ | x)dθ = E (θ̂(x) − Θ) ,

where the expectation is with respect to the posterior distribution


  of Θ for the fixed
2
and observed value of x. For any random variable Y, E (c − Y ) is minimized over c
when c = E[Y ] - hence the minimizer θ̂(x) of the above is exactly the posterior mean
of Θ. We have thus arrived at the following conclusion:

The posterior mean of Θ for the prior fΘ (θ) is the estimator that minimizes the average
mean-squared-error Z
MSE(θ)fΘ (θ)dθ.

Thus a Bayesian prior may be interpreted as the weighting of parameter values for
which we wish to minimize the weighted-average mean-squared-error.

Page 53

You might also like