SI_Chapter-2
SI_Chapter-2
Figure 1: Interplay between Probability and Statistical Inference for Data Science
The word “inference” refers to drawing conclusions based on some evidence. Thus,
Statistical Inference refers to drawing conclusions based on evidence obtained from the
data1 .
Page 1
The main challenges to do so are:
(i) How to summarise the information in the data using formal mathematical tools?
(ii) How to use these summaries to answer questions about the phenomenon of interest?
There is no unique way of answering these questions. There exist several schools
of thought that perform statistical inference using different tools and starting from
different philosophical views. These philosophical differences, as well as the different
mathematical tools employed in these approaches, will be discussed in detail in this
module. The two main schools of thought are the “Frequentist approach” and the
“Bayesian approach”. An appealing feature of this chapter is that both approaches are
presented in parallel, giving the student a balanced perspective.
In many areas, researchers collect data as a means to obtain information about a
phenomenon of interest or to collect information about a population. For example,
• The National Bureau of Statistics in UAE collects information about UAE resi-
dents, such as the age of those persons.
• UAE national cancer registry monitors the survival times of cancer patients diag-
nosed in the UAE in specific years (cohorts).
• The National Aeronautics and Space Administration (NASA) has a Data Portal
(publicly available) with many data sets produced in their experiments and moni-
toring.
• Many companies monitor their financial performance by looking at the daily price
of the saleable stocks of the company (”share price”).
Page 2
Data Reduction
Page 3
Interval Estimation: In statistics, interval estimation is the use of sample data
to calculate an interval of possible (probable) values, of an unknown population
parameter, in contrast to point estimation, which is a single number (estimation
by unique estimate).
The most prevalent forms of interval estimation are Confidence intervals (a fre-
quentist method) and credible intervals (a Bayesian method). Other common
approaches to interval estimation, which are encompassed by statistical theory,
are Tolerance and prediction intervals (used mainly in Regression Analysis).
• Credible intervals can readily deal with prior information, while confidence
intervals cannot.
• Confidence intervals are more flexible and can be used practically in more
situations than credible intervals: one area where credible intervals suffer in
comparison is in dealing with non-parametric models.
In this chapter, we discuss the question: How to estimate the parameter(s) of given
probability distribution?
Page 4
θ. For example, in the N µ, σ 2 model above, θ = µ, σ 2 , Ω = R × R+ where R+ is
the set of positive real numbers, and
1 (x−µ)2
f (x | θ) = √ e− . 2σ 2
2πσ 2
Given data X1 , . . . , Xn , the question of which parametric model we choose to fit the
data usually depends on what the data values represent (number of occurrences over a
period of time? aggregation of many small effects?) as well as a visual examination of
the shape of the data histogram. This question is discussed in the context of several
examples in Rice Sections 8.2-8.3.
Our main question of interest in this unit will be the following: After specifying an
appropriate parametric model {f (x | θ) : θ ∈ Ω}, and given observations
IID
X1 , . . . , Xn ∼ f (x | θ).
How can we estimate the unknown parameter θ and quantify the uncertainty in our
estimate?
Example 1.1. The Poisson distribution with parameter λ > 0 is a discrete distri-
bution over the non-negative integers {0, 1, 2, 3, . . .} having PMF
e−λ λx
f (x | λ) = .
x!
IID
If X ∼ Poisson(λ), then it has mean E[X] = λ. Hence for data X1 , . . . , Xn ∼
Poisson(λ), a simple estimate of λ is the sample mean λ̂ = X̄.
f (x | λ) = λe−λx .
IID
If X ∼ Exponential(λ), then E[X] = λ1 . Hence for data X1 , . . . , Xn ∼ Exponential(λ),
we estimate λ by the value λ̂ which satisfies λ̂1 = X̄, i.e. λ̂ = X̄1 .
Page 5
More generally, for X ∼ f (x | θ) where θ contains k unknown parameters, we may
consider the first k moments of the distribution of X, which are the values
µ1 = E[X]
µ2 = E X 2
..
.
µk = E X k ,
µ̂ = µ̂1 ,
σ̂ + µ̂2 = µ̂2 .
2
The first equation yields µ̂ = µ̂1 = X̄, and the second yields
n n n
! n
2 1X 1 X X 1X 2
σ̂ = µ̂2 − µ̂21 = Xi2 − X̄ 2 = Xi2 −2 Xi X̄ + nX̄ 2
= Xi − X̄ .
n i=1
n i=1 i=1
n i=1
IID
Example 1.4.
Let X1 , . . . , Xn ∼ Gamma(α, β). If X ∼ Gamma(α, β), then E[X] =
α 2 α+α2
β and E X = β 2 . So the method of moments estimators α̂, β̂ solve the equations
α̂
= µ̂1 ,
β̂
α̂ + α̂2
= µ̂2 .
β̂ 2
Substituting the first equation into the second,
Page 6
1
+ 1 µ̂21 = µ̂2 ,
α̂
so
1 µ̂21 X̄ 2
α̂ = µ̂2 = 2 = 2 .
−
Pn
µ̂2 − 1 µ̂ 2 µ̂1
1
Xi − X̄
1 n i=1
α̂ X̄
β̂ = = Pn 2 .
µ̂1 1
Xi − X̄
n i=1
• The bias of θ̂ is Eθ [θ̂]−θ. Here and below, Eθ denotes the expectation with respect
IID
to X1 , . . . , Xn ∼ f (x | θ)
q
• The standard error of θ̂ is its standard deviation Varθ [θ̂]. Here and below,
IID
Varθ denotes the variance with respect to X1 , . . . , Xn ∼ f (x | θ)
h i
• The mean-squared-error (MSE) of θ̂ is Eθ (θ̂ − θ) . 2
The bias measures how close the average value of θ̂ is to the true parameter θ; the
standard error measures how variable is θ̂ around this average value. An estimator
with small bias need not be an accurate estimator, if it has large standard error, and
conversely an estimator with small standard error need not be accurate if it has large
bias. The mean-squared-error encompasses both bias and variance: For any random
variable X and any constant c ∈ R,
E (X − c)2 = E (X − EX + EX − c)2
Page 7
where we used that EX − c is a constant and E[X − EX] = 0. Applying this to X = θ̂
and c = θ
h i 2
2
Eθ (θ̂ − θ) = Var[θ̂] + Eθ [θ̂] − θ
We obtain the bias-variance decomposition of mean-squared-error:
where the last equality uses E[X] = λ if X ∼ Poisson(λ). So Eλ [λ̂] − λ = 0 for all
λ > 0, and λ̂ is an unbiased estimator of λ. Also,
n
1 X λ
Varλ [λ̂] = Varλ [X̄] = 2 Varλ [Xi ] =
n i=1 n
λx e−λ
p(x; λ) = P (X = x; λ) = .
x!
Page 8
Suppose that we are interested in estimating the parameter θ = e−3λ based on a sample
of size one. Let T (X) = (−2)X . Then, the expectation is
∞
X λx e−λ
E(T ) = (−2)x
x=0
x!
∞
−λ
X (−2λ)x
=e
x=0
x!
−λ −2λ
=e e = e−3λ .
Therefore, T is unbiased for e−3λ . However, T is unreasonable in the sense that it
may be negative (for X odd), even though it is an estimator of a strictly positive quan-
tity. This suggests that one should not automatically assume that a unique unbiased
estimator is necessarily good.
IID
Example 1.6. In the model X1 , . . . , Xn ∼ Exponential(λ), the method-of-moments
estimator of λ was λ̂ = 1/X̄. This estimator is biased: Recall Jensen’s inequality,
which says for any strictly convex function g : R → R, E[g(X)] > g(E[X]). The
function x 7→ 1/x is strictly convex, so
λn λ
Bias = Eλ [λ̂] − λ = −λ= ,
n−1 n−1
λ2 n2
Variance = Varλ [λ̂] = ,
(n − 1)2 (n − 2)
2
λ2 n2 λ2 (n + 2)
λ
MSE = + = .
(n − 1)2 (n − 2) n−1 (n − 1)(n − 2)
Till now, we introduced the method of moments for estimating one or more pa-
rameters θ in a parametric model. In the next section, we discuss a different method
called maximum likelihood estimation. The focus of the next section will be on how
to compute this estimate; subsequent sections will study its statistical properties.
Page 9
1.3 Maximum likelihood estimation
IID
Consider data X1 , . . . , Xn ∼ f (x | θ), for a parametric model {f (x | θ) : θ ∈ Ω}. The
“x given θ” implies that given a particular value of θ, f (· | θ) defines a density. The
parameter θ can be a vector of parameters. Suppose we simulate data from F or collect
some real data (say, waiting times in a popular restaurant), and we want to answer the
following questions:
1. How to estimate θ?
It is important to note that L(θ | X) is not a distribution over θ. The ”most likely”
value is the value that maximizes the likelihood
which in many examples is easier to work with as it involves a sum rather than a
product. Let’s work through several examples:
Page 10
IID
Example 1.7. (Bernoulli). Let X1 , . . . , Xn ∼ Bernoulli(p). Then the likelihood is
n
Y
L(p | x) = p (xi | p)
i=1
n
Y x
p i (1 − p)1−xi
=
i=1
P P
xi n− xi
=p (1 − p) .
To obtain the MLE of p, we will maximize the likelihood. Note that maximizing the
likelihood is the same as maximizing the log of the likelihood, but the calculations are
easier after taking a log. So we take a log:
n
! n
!
X X
⇒ l(p) := log L(p | x) = xi log p + n − xi log(1 − p)
Pi P i
dl(p) xi n − xi set
= − = 0
dp p 1−p
n
1X
⇒ p̂ = xi .
n t=1
Verify for yourself that the second derivative is negative for this p̂. Thus,
n
1X
p̂MLE = xi .
n t=1
IID
Example 1.8. (Poisson). Let X1 , . . . , Xn ∼ Poisson(λ). Then
n
X λXi e−λ
l(λ) = log
i=1
Xi !
Xn
= (Xi log λ − λ − log (Xi !))
i=1
n
X n
X
= (log λ) Xi − nλ − log (Xi !) .
i=1 i=1
This is differentiable in λ, so we maximize l(λ) by setting its first derivative equal to
0:
n
′ 1X
0 = l (λ) = Xi − n.
λ i=1
Page 11
Solving for λ yields the estimate λ̂ = X̄. Since l(λ) → −∞ as λ → 0 or λ → ∞, and
since λ̂ = X̄ is the unique value for which 0 = l′ (λ), this must be the maximum of l.
In this example, λ̂ is the same as the method-of-moments estimate.
IID
Example 1.9. (Normal). Let X1 , . . . , Xn ∼ N µ, σ 2 . Then
n 2
X 1 (Xi −µ)
l µ, σ 2 = e− 2σ2
log √
2πσ 2
i=1
n
!
2
X 1 (Xi − µ)
− log 2πσ 2 −
=
i=1
2 2σ 2
n
n n 1 X
2
(Xi − µ)2 .
= − log(2π) − log σ − 2
2 2 2σ i=1
Considering σ 2 (rather than σ ) as the parameter, we maximize l(λ) by settings its
partial derivatives with respect to µ and σ 2 equal to 0 :
n
∂l 1 X
0= = (Xi − µ) ,
∂µ σ 2 i=1
n
∂l n 1 X 2
0= = − + (X i − µ) .
∂σ 2 2σ 2 2σ 4 i=1
Solving the first equation yields
2 µ̂ = X̄, and2 substituting this into the second equation
1
Pn
2
yields σ̂ = n i=1 Xi − X̄ . Since l µ, σ → −∞ as µ → −∞, µ → ∞, σ 2 → 0,
∂l ∂l
or σ 2 → ∞, and as µ̂, σ̂ 2 is the unique value for which 0 = ∂µ and 0 = ∂σ 2 , this
must be the maximum of l. Again, the MLEs are the same as the method-of-moments
estimates.
Example 1.10. (Two parameter exponential). The density of a two parameter expo-
nential distribution is
f (x | µ, λ) = λe−λ(x−µ) ; x ≥ µ, µ ∈ R, λ > 0.
Page 12
We want to compute the MLEs of both λ and µ. The likelihood is
n
Y
L(λ, µ | x) = f (xi | µ, λ)
t=1
Yn
= λe−λ(xi −µ) I (xi ≥ µ)
t=1
( !)
X
= λn exp −λ xi − nµ I (xi ≥ µ) , ∀ µ.
i
We will first try to maximize with respect to µ and then with respect to λ. Note that
L(λ, µ) is an increasing function of µ within the restriction. So that the MLE of µ is
the largest value in the support of µ where µ ≤ min {Xi }. So
Page 13
Example 1.11. The tank problem.
“The enemy” has an unknown number N of tanks, which he has obligingly numbered
1, 2, . . . , N . Spies have reported sighting 8 tanks with numbers
Assume that sightings are independent and that each of the N tanks has a probability
1/N of being observed at each sighting. What is the MLE of N ?
Let Xi be the serial number of tank i. Then, the pmf of these variables is:
(
1
N , for x ≤ N,
P (x; N ) =
0, for x > N.
Given that each tank has the same probability of being observed, and that the largest
sample value is x(8) = 137, it follows that the likelihood function of N is
(
1
N 8, for N ≥ 137,
L(N | x) = P ( Event N ) =
0, for N < 137.
it is straightforward to see that the likelihood function is maximised at
N
b = max xi = 137.
i=1,...,8
Page 14
1.3.2 No closed-form MLEs
In Inference, obtaining MLE estimates for a problem requires maximizing the likeli-
hood. However, it is possible that no analytical form of the maxima is possible! This
is a common challenge in many models and estimation problems, and requires sophis-
ticated optimization tools. We will give examples in which we cannot get an analytical
form of the MLE.
IID
Example 1.12. (Gamma). Let X1 , . . . , Xn ∼ Gamma(α, β). Then
n α
X β
l(α, β) = log Xiα−1 e−βXi
i=1
Γ(α)
n
X
= (α log β − log Γ(α) + (α − 1) log Xi − βXi )
i=1
n
X n
X
= nα log β − n log Γ(α) + (α − 1) log Xi − β Xi .
i=1 i=1
To maximize l(α, β), we set its partial derivatives equal to 0:
n
∂l nΓ′ (α) X
0= = n log β − + log Xi ;
∂α Γ(α) i=1
n
∂l nα X
0= = − Xi .
∂β β i=1
The second equation implies that the MLEs α̂ and β̂ satisfy β̂ = α̂/X̄. Substituting
into the first equation and dividing by n, α̂ satisfies
n
Γ′ (α̂) 1X
0 = log α̂ − − log X̄ + log Xi . (1)
Γ(α̂) n i=1
′
The function f (α) = log α − ΓΓ(α)
(α)
decreases from ∞ to 0 as α increases from 0 to ∞,
and the value − log X̄ + n1 ni=1 log Xi is always negative
P
Unfortunately, there is no closed-form expression for this root α̂. (In particular, the
MLE α̂ is not the method-of-moments estimator for α.) We may compute the root
Page 15
numerically using the Newton-Raphson method: We start with an initial guess α(0) ,
which (for example) may be the method-of-moments estimator
(0) X̄ 2
α = Pn 2 .
1
n i=1 Xi − X̄
Having computed α(t) for any t = 0, 1, 2, . . ., we compute the next iteration α(t+1) by
approximating the equation 1 with a linear equation using a first-order Taylor expansion
around α̂ = α(t) , and set α(t+1) as the value of α̂ that solves this linear equation. In
′
detail, let f (α) = log α − ΓΓ(α)
(α)
. A first-order Taylor expansion around α̂ = α(t) in
(equation 1) yields the linear approximation
n
(t)
(t)
′
(t)
1X
0≈f α + α̂ − α f α − log X̄ + log Xi ,
n i=1
and we set α(t+1) to be the value of α̂ solving this linear equation, i.e. 3
(t)
1
Pn
−f α + log X̄ − log Xi
α(t+1) = α(t) + n i=1 .
f ′ α(t)
The iterations α(0) , α(1) , α(2) , . . . converge to the MLE α̂.
Example 1.13. Let (X1 , . . . , Xk ) ∼ Multinomial (n, (p1 , . . . , pk )). (This is not quite
the setting of n IID observations from a parametric model, as we have been considering,
although you can think of (X1 , . . . , Xk ) as a summary of n such observations Y1 , . . . , Yn
from the parametric model Multinomial(1, (p1 , . . . , pk )), where Yi indicates which of k
possible outcomes occurred for the i th observation.) The log-likelihood is given by
k
n n X
l (p1 , . . . , pk ) = log pX
1
1
. . . pX
k
k
= log + Xi log pi ,
X1 , . . . , X k X1 , . . . , X k
i=1
Page 16
k
n X
L (p1 , . . . , pk , λ) = log + Xi log pi + λ (p1 + . . . + pk − 1) ,
X1 , . . . , X k
i=1
When n is large, asymptotic theory provides us with a more complete picture of the
“accuracy” of λ̂ : By the Law of Large Numbers, X̄ converges to λ in probability as
n → ∞. Furthermore, by the Central Limit Theorem,
Page 17
√
n(X̄ − λ) → N (0, Var [Xi ]) = N (0, λ)
in distribution as n → ∞. So for large n, we expect λ̂ to be close to λ, and the
λ
sampling distribution of λ̂ is approximately N λ, n . This normal approximation is
useful for many reasons - for example, it allows us to understand other measures of
error (such as E[|λ̂ − λ|] or P[|λ̂ − λ| > 0.01] ), and will allow us to obtain a confidence
interval for λ̂. In a parametric model, we say that an estimator θ̂ based on X1 , . . . , Xn
√ if θ̂ → θ in probability as n → ∞. We say that it is asymptotically
is consistent
normal if n(θ̂−θ) converges in distribution to a normal distribution (or a multivariate
normal distribution, if θ has more than 1 parameter). So λ̂ above is consistent and
asymptotically normal.
The goal of this section is to explain why, rather than being a curiosity of this Poisson
example, consistency and asymptotic normality of the MLE hold quite generally for
many “typical” parametric models, and there is a general formula for its asymptotic
variance. The following is one statement of such a result:
∂ ′ ∂2
z(x, θ) = log f (x | θ), z (x, θ) = 2 log f (x | θ).
∂θ ∂θ
a
Some technical conditions in addition to the ones stated are required to make this theorem rigorously true;
these additional conditions will hold for the examples we discuss, and we won’t worry about them in this class.
Page 18
Here z(x, θ) is called the score function, and I(θ) is called the Fisher information.
Heuristically for large n, the above theorem tells us the following about the MLE θ̂ :
√
• θ̂ is asymptotically unbiased. More precisely, the bias of θ̂ is less than order 1/ n.
√
(Otherwise n θ̂ − θ0 should not converge to a distribution with mean 0.)
Example 1.14. Let’s verify that this theorem is correct for the above Poisson example.
There,
λx e−λ
log f (x | λ) = log = x log λ − λ − log(x!),
x!
so the score function and its derivative are given by
∂ x ∂2 x
z(x, λ) = log f (x | λ) = − 1, z ′ (x, λ) = log f (x | λ) = − .
∂λ λ ∂λ2 λ2
We may compute the Fisher information as
′ X 1
I(λ) = −Eλ [z (X, λ)] = Eλ = ,
λ2 λ
√
so n(λ̂ − λ) → N (0, λ) in distribution. This is the same result as what we obtained
using a direct application of the CLT.
Proof Sketch of Theorem 1.1
Proof. We’ll sketch heuristically the proof of Theorem 1.1, assuming f (x | θ) is the PDF
of a continuous distribution. (The discrete case is analogous with integrals replaced by
sums.)
To see why the MLE θ̂ is consistent, note that θ̂ is the value of θ which maximizes
n
1 1X
l(θ) = log f (Xi | θ) .
n n i=1
IID
Suppose the true parameter is θ0 , i.e. X1 , . . . , Xn ∼ f (x | θ0 ). Then for any θ ∈ Ω
(not necessarily θ0 ), the Law of Large Numbers implies the convergence in probability
Page 19
n
1X
log f (Xi | θ) → Eθ0 [log f (X | θ)].
n i=1
Under suitable regularity conditions, this implies that the value of θ maximizing the
left side, which is θ̂, converges in probability to the value of θ maximizing the right
side, which we claim is θ0 . Indeed, for any θ ∈ Ω,
f (X | θ)
Eθ0 [log f (X | θ)] − Eθ0 [log f (X | θ0 )] = Eθ0 log .
f (X | θ0 )
Noting that x 7→ log x is concave, Jensen’s inequality implies E[log X] ≤ log E[X] for
any positive random variable X, so
f (X | θ) f (X | θ)
Eθ0 log ≤ log Eθ0
f (X | θ0 ) f (X | θ0 )
f (x | θ)
Z Z
= log f (x | θ0 ) dx = log f (x | θ)dx = 0.
f (x | θ0 )
∂
∂θ f (x | θ)
∂ ∂
z(x, θ)f (x | θ) = log f (x | θ) f (x | θ) = f (x | θ) = f (x | θ). (2)
∂θ f (x | θ) ∂θ
R
Then, since f (x | θ)dx = 1,
Z Z Z
∂ ∂
Eθ [z(X, θ)] = z(x, θ)f (x | θ)dx = f (x | θ)dx = f (x | θ)dx = 0.
∂θ ∂θ
Next, we differentiate this identity with respect to θ:
Page 20
∂
0= Eθ [z(X, θ)]
∂θ Z
∂
= z(x, θ)f (x | θ)dx
∂θ
Z
∂
= z ′ (x, θ)f (x | θ) + z(x, θ) f (x | θ) dx
∂θ
Z
z ′ (x, θ)f (x | θ) + z(x, θ)2 f (x | θ) dx
=
= Eθ [z ′ (X, θ)] + Eθ z(X, θ)2
so
√ √ l′ (θ0 ) √1 l ′ (θ0 )
n
n θ̂ − θ0 ≈ − n ′′ = − 1 ′′ . (3)
l (θ0 ) n l (θ 0 )
For the denominator, by the Law of Large Numbers,
n n
1 ′′ 1 X ∂2 1X ′
l (θ0 ) = 2
[log f (Xi | θ)]θ=θ0 = z (Xi , θ0 ) → Eθ0 [z ′ (X, θ0 )] = −I (θ0 )
n n i=1 ∂θ n i=1
in probability. For the numerator, recall by Lemma 1.1 that z (X, θ0 ) has mean 0 and
variance I (θ0 ) when X ∼ f (x | θ0 ). Then by the Central Limit Theorem,
n n
1 ′ 1 X ∂ 1 X
√ l (θ0 ) = √ [log f (Xi | θ)]θ=θ0 = √ z (Xi , θ0 ) → N (0, I (θ0 ))
n n i=1 ∂θ n i=1
Page 21
1.3.4 Applications of MLE in Regression Analysis
We start with the simple linear regression setup: Let Y1 , Y2 , . . . , Yn be observations
known as the response. Let xi = (xi1 , . . . xip )T ∈ Rp be the i th corresponding vector
of covariates for the i th observation. Let β ∈ Rp be the regression coefficient so that
for σ 2 > 0,
Yi = xTi β + ϵi where ϵi ∼ N 0, σ 2 .
T
Let X = xT1 , xT2 , . . . , xTn . In vector form we have,
Y = Xβ + ϵ ∼ Nn Xβ, σ 2 In .
The linear regression model is built to estimate β, which measures the linear effect
of X on Y . We will show below how to use MLE in the computation of regression
coefficients in linear regression model.
Example 1.15. (MLE for Linear Regression).
In order to understand the linear relationship between X and β, we will need to estimate
β. We have
n
Y
2
f yi | X, β, σ 2
L β, σ | y =
t=1
n
1 (Y − Xβ)T (Y − Xβ)
1
= √ exp −
2πσ 2 2 σ2
1 n 1 (Y − Xβ)T (Y − Xβ)
⇒ l β, σ 2 := log L β, σ 2 | y = − log(2π) − log σ 2 −
.
2 2 2 σ2
Note that
(y − Xβ)T (y − Xβ) = y T − β T X T (y − Xβ)
= y T y − y T Xβ − β T X T y + β T X T Xβ
= y T y − 2β T X T y + β T X T Xβ.
Using this we have
dl 1 X T y − X T Xβ set
= − 2 −2X T y + 2X T Xβ = = 0
dβ 2σ 2σ 2
dl n (y − Xβ)T (y − Xβ) set
=− 2+ = 0.
dσ 2 2σ 2σ 4
The first equation leads to β̂MLE satisfying
−1 −1
X T y − X T X β̂MLE = 0 ⇒ β̂MLE = X T X X T y, if X T X exists.
Page 22
2
And σ̂M LE is T
y − X β̂MLE y − X β̂MLE
2
σ̂M LE = .
n
Remark 1.2. It can be easily verified that the second derivative is negative,
−1 and
T
this is indeed the maximum. The next question is: What if X X does not
exist? For example, if p > n, then the number of observations
is less than the
T
number of parameters, and since X is n × p, X X is p × p of rank n < p.
So X T X is not full rank and cannot be inverted. In this case, the MLE does not
exist and other estimators need to be constructed. This is one of the motivations
of penalized regression, which we will discuss next.
Note that in the Linear regression setup, the MLE for β satisfies the following:
where P (β) is called the penalization function. Since the optimization of L(β | y) only
depends on (y − Xβ)T (y − Xβ) term, a penalized (negative) log-likelihood is used and
the final penalized (negative) log-likelihood is
There are many ways of penalizing β and each method yields a different estimator. A
popular one is the ridge penalty.
Example 1.16. (MLE for Ridge Regression).
(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2
We will minimize Q(β) over the space of β and since we are adding a arbitrary term
that depends on the size of β, smaller sizes of β will be preferred. Small sizes of β
Page 23
means X are less important, and this will eventually nullify the singularity in X T X.
The larger λ is, the more “penalization” there is for large values of β.
(y − Xβ)T (y − Xβ) λ T
β̂ = arg min + β β .
β 2 2
To carry out the minimization, we take the derivative:
dQ(β) 1 set
−2X T y + 2X T Xβ + λβ = 0
=
dβ 2
⇒ X T X + λIp β̂ − X T y = 0
−1 T
⇒ β̂ridge = X T X + λIp X y
(verify second derivative is positive for yourself ). Note that X T X + λIp is always
positive definite for λ > 0 since for any a ∈ Rp ̸= 0
aT X T X + λIp a = aT X T Xa + λaT a ≥ 0.
Thus, the final ridge solution always exists even if X T X is not invertible.
Page 24
The log-likelihood of all observations is then
n
X
l(α) = (− log Γ(α) + (α − 1) log Xi − Xi )
i=1
n
X n
X
= −n log Γ(α) + (α − 1) log Xi − Xi .
i=1 i=1
Γ′ (α)
Introducing the digamma function ψ(α) = Γ(α) , the MLE α̂ is obtained by (numerically)
solving
n
X
′
0 = l (α) = −nψ(α) + log Xi .
i=1
What is the sampling distribution of α̂? We compute
∂2
2
log f (x | α) = −ψ ′ (α).
∂α
′ ′
As this does not depend on x, the Fisher information is I(α) = −E α [−ψ (α)] = ψ (α).
Then for large n, α̂ is distributed approximately as N α, nψ1′ (α) .
Asymptotic normality of the MLE extends naturally to the setting of multiple param-
eters:
∂2
∂ ∂
I(θ)ij = Covθ log f (X | θ), log f (X | θ) = −Eθ log f (X | θ) .
∂θi ∂θj ∂θi ∂θj
(4)
Then under the same conditions as Theorem 1.1,
√
n θ̂n − θ → N 0, I(θ)−1 ,
where I(θ)−1 is the k × k matrix inverse of I(θ) (and the distribution on the right
is the multivariate normal distribution having this covariance).
Page 25
(For k = 1, this definition of I(θ) is exactly the same as our previous definition, and
I(θ)−1 is just I(θ)
1
. The proof of the above result is analogous to the k = 1 case from
previous section, employing a multivariate Taylor expansion of the equation 0 = ∇l(θ̂)
around θ̂ = θ0 .)
IID
Example 2.2. Consider now the full Gamma model, X1 , . . . , Xn ∼ Gamma(α, β).
Numerical computation of the MLEs α̂ and β̂ in this model was discussed in Section
1.3. To approximate their sampling distributions, note
β α α−1 −βx
log f (x | α, β) = log x e = α log β − log Γ(α) + (α − 1) log x − βx,
Γ(α)
so
∂2 ∂2 1 ∂2 α
2
log f (x | α, β) = −ψ ′ (α), log f (x | α, β) = , log f (x | α, β) = − .
∂α ∂α∂β β ∂β 2 β2
These partial derivatives again do not depend on x, so the Fisher information matrix
is
′
ψ (α) − β1
I(α, β) = ,
− β1 α
β2
and its inverse is
α 1
1
I(α, β)−1 = β2 β
.
ψ ′ (α) βα2 − 1
β2
1
β ψ ′ (α)
(α̂, β̂) is approximately distributed as the bivariate normal distribution N (α, β), n1 I(α, β)−1 .
In particular, the marginal distribution of α̂ is approximately
1 α
N α, 2 .
n ψ ′ (α) α2 − 12 β
β β
Suppose, in this example, that in fact the true parameter β = 1. Then the variance of
1
α̂ reduces to n(ψ′ (α)−1/α) , which is not the variance nψ1′ (α) obtained in example 2.1 – the
variance here is larger. The difference is that in this example, we do not assume that
we know β = 1 and instead are estimating β by its MLE β̂. As a result, the MLEs of
α in these two examples are not the same, and here our uncertainty about β is also
increasing the variability of our estimate of α.
Page 26
a b
I= ,
b c
the first definition of equation 4 implies that a, c ≥ 0. The upper-left element of I −1 is
1 1
a−b2 /c , which is always at least a . This implies, for any model with a single parameter
θ1 that is contained inside a larger model with parameters (θ1 , θ2 ), that the variability
of the MLE for θ1 in the larger model is always at least that of the MLE for θ1 in the
smaller model; they are equal when the off-diagonal entry b is equal to 0. The same
observation is true for any number of parameters k ≥ 2 in the larger model.
The Fisher information I(θ) is an intrinsic property of the model {f (x | θ) : θ ∈ Ω}, not
of any specific estimator. (We’ve shown that it is related to the variance of the MLE,
but its definition does not involve the MLE.) There are various information-theoretic
results stating that I(θ) describes a fundamental limit to how accurate any estimator
of θ based on X1 , . . . , Xn can be. We’ll prove one such result, called the Cramer-Rao
lower bound (CRLB):
Page 27
Theorem 2.2. Consider a parametric model {f (x | θ) : θ ∈ Ω} (satisfying certain
mild regularity assumptions) where θ ∈ R is a single parameter. Let T be any
IID
unbiased estimator of θ based on data X1 , . . . , Xn ∼ f (x | θ). Then
1
Varθ [T ] ≥ .
nI(θ)
Z
∂
1= T (x1 , . . . , xn ) f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ)
Rn ∂θ
∂
+ f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ) + . . .
∂θ
∂
+f (x1 | θ) × f (x2 | θ) × . . . × f (xn | θ) dx1 . . . dxn
∂θ
Z
= T (x1 , . . . , xn )Z (x1 , . . . , xn , θ) f (x1 | θ) × . . . × f (xn | θ) dx1 . . . dxn
Rn
= Eθ [T Z].
1
Since Eθ [Z] = 0, this implies Covθ [T, Z] = Eθ [T Z] = 1, so Varθ [T ] ≥ nI(θ) as desired.
Page 28
For two unbiased estimators of θ, the ratio of their variances is called their relative
efficiency. An unbiased estimator is efficient if its variance equals the lower bound
1
nI(θ) . Since the MLE achieves this lower bound asymptotically, we say it is asymp-
totically efficient.
The Cramer-Rao bound ensures that no unbiased estimator can achieve asymptotically
lower variance than the MLE. Stronger results, which we will not prove here, in fact,
show that no estimator, biased or unbiased, can asymptotically achieve lower mean-
1
squared-error than nI(θ) , except possibly on a small set of special values θ ∈ Ω.5 In
particular, when the method-of-moments estimator differs from the MLE, we expect it
to have higher mean-squared-error than the MLE for large n, which explains why the
MLE is usually the preferred estimator in simple parametric models.
The eminent statistician George Box once said, “All models are wrong, but some are
useful.”
When we fit a parametric model to a set of data X1 , . . . , Xn , we are usually not certain
that the model is correct (for example, that the data truly have a normal or Gamma
distribution). Rather, we think of the model as an approximation to what might be
the true distribution of data. It is natural to ask, then, whether the MLE estimate θ̂
in a parametric model is at all meaningful, if the model itself is incorrect. The goal of
this section is to explore this question and to discuss how the properties of θ̂ change
under model misspecification.
Thus far, we have been measuring the error of an estimator θ̂ by its distance to the
IID
true parameter θ, via the bias, variance, and MSE. If X1 , . . . , Xn ∼ g for a PDF g
that is not in the model, then there is no true parameter value θ associated to g. We
will instead think about a measure of “distance” between two general PDFs:
5
For example, the constant estimator θ̂ = c for fixed c ∈ Ω achieves 0 mean-squared-error if the true parameter
happened to be the special value c, but at all other parameter values is worse than the MLE for sufficiently large n.
Page 29
Definition 3.1. For two PDFs f and g, the Kullback-Leibler (KL) diver-
gence from f to g is
Z
g(x)
DKL (g∥f ) = g(x) log dx.
f (x)
Equivalently, if X ∼ g, then
g(X)
DKL (g∥f ) = E log .
f (X)
DKL has many information-theoretic interpretations and applications. For our pur-
poses, we’ll just note the following properties: If f = g, then log(g(x)/f (x)) ≡ 0, so
DKL (g∥f ) = 0. By Jensen’s inequality, since x 7→ − log x is convex,
Z
f (X) f (X) f (x)
DKL (g∥f ) = E − log ≥ − log E = − log g(x)dx = 0.
g(X) g(X) g(x)
Furthermore, since x 7→ − log x is strictly convex, the inequality above can only be an
equality if f (X)/g(X) is a constant random-variable, so f = g. Thus, like an ordinary
distance measure, DKL (g∥f ) ≥ 0 always, and DKL (g∥f ) = 0 if and only if f = g.
Example 3.1. To get an intuition for what the KL-divergence is measuring, let f and
2 2
g be the PDFs of the distributions N µ0 , σ and N µ1 , σ . Then
g(x) 1 (x−µ1 )2 1 (x−µ0 )2
log = log √ e− 2σ2 / √ e− 2σ2
f (x) 2πσ 2 2πσ 2
2 2
(x − µ1 ) (x − µ0 )
=− +
2σ 2 2σ 2
2 (µ1 − µ0 ) x − µ21 − µ20
= .
2σ 2
So letting X ∼ g,
g(X) 1
= 2 2 (µ1 − µ0 ) E[X] − µ21 − µ20
DKL (g∥f ) = E log
f (X) 2σ
1 2 2
(µ1 − µ0 )2
= 2 2 (µ1 − µ0 ) µ1 − µ1 − µ0 = .
2σ 2σ 2
Thus DKL (g∥f ) is proportional to the square of the mean difference normalized by the
standard deviation σ. In this example we happen to have DKL (f ∥g) = DKL (g∥f ), but
Page 30
in general this is not true – for two arbitrary PDFs f and g, we may have DKL (f ∥g) ̸=
DKL (g∥f ).
IID
Suppose X1 , . . . , Xn ∼ g, and consider a parametric model {f (x | θ) : θ ∈ Ω} which
may or may not contain the true PDF g. The MLE θ̂ is the value of θ that maximizes
n
1 1X
l(θ) = log f (Xi | θ) ,
n n i=1
and this quantity by the Law of Large Numbers converges in probability to
Eg [log f (X | θ)],
where Eg denotes expectation with respect to X ∼ g. In Section 2, we showed that when
g(x) = f (x | θ0 ) (meaning g belongs to the parametric model, and the true parameter
is θ0 ), then Eg [log f (X | θ)] is maximized at θ = θ0 - this explained consistency of the
MLE. More generally, when g does not necessarily belong to the parametric model, we
may write
g(X)
Eg [log f (X | θ)] = Eg [log g(X)] − Eg log = Eg [log g(X)] − DKL (g∥f (x | θ)).
f (X | θ)
The term Eg [log g(X)] does not depend on θ, so the value of θ maximizing Eg [log f (X |
θ)] is the value of θ that minimizes DKL (g∥f (x | θ) ). This (heuristically) shows the
following result:6
IID
Theorem 3.1. Let X1 , . . . , Xn ∼ g and suppose DKL (g∥f (x | θ)) has a unique
minimum at θ = θ∗ . Then, under suitable regularity conditions on {f (x | θ) : θ ∈
Ω} and on g, the MLE θ̂ converges to θ∗ in probability as n → ∞.
Page 31
′ ∗ ∗
0 ≈ l (θ ) + θ̂ − θ l′′ (θ∗ ) ,
so
√ √1 l ′ (θ ∗ )
∗ n
n θ̂ − θ ≈ − 1 ′′ ∗ . (5)
n l (θ )
Recall the score function
∂
log f (x | θ).
z(x, θ) =
∂θ
The Law of Large Numbers applied to the denominator of equation 5 gives
n
1 ′′ ∗ 1X ′
l (θ ) = z (Xi , θ∗ ) → Eg [z ′ (X, θ∗ )]
n n i=1
in probability, while the Central Limit Theorem applied to the numerator of equation
5 gives
n
1 1 X
√ l′ (θ∗ ) = √ z (Xi , θ∗ ) → N (0, Varg [z (X, θ∗ )])
n n i=1
in distribution. The quantity z (X, θ∗ ) has mean 0 when X ∼ g because θ∗ maximizes
Eg [log f (X | θ)], so differentiating with respect to θ yields
∂
0 = Eg [log f (X | θ)]θ=θ∗ = Eg [z (X, θ∗ )] .
∂θ
Hence by Slutsky’s lemma,
!
√ ∗
Varg [z (X, θ )]
n θ̂ − θ∗ → N 0, .
Eg [z ′ (X, θ∗ )]2
These are the same formulas as in Section 2 (with θ∗ in place of θ0 ), except expectations
and variances are taken with respect to X ∼ g rather than X ∼ f (x | θ∗ ). If g(x) =
f (x | θ∗ ), meaning the model is correct, then Varg [z (X, θ∗ )] = −Eg [z ′ (X, θ∗ )] = I (θ∗ ),
and we recover our theorem from Section 2.However, when g(x) ̸= f (x | θ∗ ), in general
Page 32
Example 3.2. Suppose we fit the model Exponential (λ) to data X1 , . . . , Xn by com-
puting the MLE. The log-likelihood is
n
X n
X
−λXi
l(λ) = log λe = n log λ − λ Xi
i=1 i=1
Pn
so the MLE solves the equation 0 = l′ (λ) = n/λ − i=1 Xi . This yields the MLE
λ̂ = 1/X̄ (which is the same as the method-of-moments estimator from Section 1.1).
We may compute the sandwich estimate of the variance of λ̂ as follows: In the expo-
nential model,
∂ 1 ∂2 1
z(x, λ) = log f (x | λ) = − x, z ′ (x, λ) = log f (x | λ) = − .
∂λ λ ∂λ2 λ2
Pn Pn 1
1 1 1
Let Z̄ = n i=1 z Xi , λ̂ = n i=1 λ̂ − Xi = λ̂ −X̄ be the sample mean of z X1 , λ̂ , . . .,
z Xn , λ̂ . We estimate Varg [z(X, λ)] by the sample variance of z X1 , λ̂ , . . . , z Xn , λ̂ :
n n 2
1 X 2 1 X 1 1
z Xi , λ̂ − Z̄ = − Xi − − X̄
n − 1 i=1 n − 1 i=1 λ̂ λ̂
n
1 X 2 2
= Xi − X̄ = SX .
n − 1 i=1
′ ′ ′
We estimate Eg [z (X, λ)] by the sample mean of z X1 , λ̂ , . . . , z Xn , λ̂ :
n n
1 X ′ 1X 1 1
z Xi , λ̂ = − =− .
n i=1 n i=1 λ̂2 λ̂2
So the sandwich estimate of Varg [z(X, λ)]/Eg [z ′ (X, λ)]2 is SX
2 4 2
λ̂ = SX /X̄ 4 , and we
√
may estimate the standard error of λ̂ by SX / X̄ 2 n .
Page 33
much smaller than that of the MLE for large n (the Cramer-Rao lower bound).
In many examples, the quantity we are interested in is not θ itself, but some value
g(θ). The obvious way to estimate g(θ) is to use g(θ̂), where θ̂ is an estimate (say, the
MLE) of θ. This is called the plugin estimate of g(θ), because we are just “plugging
in” θ̂ for θ.
Example 4.1. (Odds). You play a game with a friend, where you flip a biased coin. If
the coin lands heads, you give your friend $1. If the coin lands tails, your friend gives
you $x. What is the value of x that makes this a fair game?
If the coin lands heads with probability p, then your expected winnings is −p+(1−p)x.
The game is fair when −p + (1 − p)x = 0, i.e. when x = p/(1 − p). This value p/(1 − p)
is the odds of getting heads to getting tails. To estimate the odds from n coin flips
IID
X1 , . . . , Xn ∼ Bernoulli(p),
we may first estimate p by p̂ = X̄. (This is both the method of moments estimator
and the MLE.) Then the plugin estimate of p/(1 − p) is simply X̄/(1 − X̄).
The odds falls in the interval (0, ∞) and is not symmetric about p = 1/2. We oftentimes
p
think instead in terms of the log-odds, log 1−p - this can be any real number and is
symmetric about p = 1/2. The plugin estimate for the log-odds is log 1−X̄X̄ .
Example 4.2. (The Pareto mean). The Pareto (x0 , θ) distribution for x0 > 0 and
θ > 1 is a continuous distribution over the interval [x0 , ∞), given by the PDF
(
θxθ0 x−θ−1 x ≥ x0
f (x | x0 , θ) =
0 x < x0 .
It is commonly used in economics as a model for the distribution of income. x0 rep-
resents the minimum possible income; let’s assume that x0 is known and equal to 1 .
We then have a one-parameter model with PDFs f (x | θ) = θx−θ−1 supported on [1, ∞).
so we might estimate the mean income by θ̂/(θ̂ − 1) where θ̂ is the MLE. To compute
θ̂ from observations X1 , . . . , Xn , the log-likelihood is
Page 34
n
X n
X n
X
θXi−θ−1
l(θ) = log = (log θ − (θ + 1) log Xi ) = n log θ − (θ + 1) log Xi .
i=1 i=1 i=1
Rearranging yields
√ √
n g(θ̂) − g (θ0 ) ≈ n θ̂ − θ0 g ′ (θ0 ) ,
Page 35
√
n(X̄ − p) → N (0, p(1 − p))
in distribution, where p(1 − p) is the variance of a Bernoulli(p) random variable. The
p
function g(p) = log 1−p = log p − log(1 − p) has derivative
1 1 1
g ′ (p) = + = ,
p 1 − p p(1 − p)
so by the delta method,
√
X̄ p 1
n log − log → N 0, .
1 − X̄ 1−p p(1 − p)
In other words, our estimate of the log-odds of heads to tails is approximately normally
p 1
distributed around the true log-odds log 1−p , with variance np(1−p) .
Suppose we toss this biased coin n = 100 times and observe 60 heads, i.e. X̄ = 0.6.
We would estimate the log-odds by log 1−X̄X̄ ≈ 0.41, and we may estimate our standard
q
1
error by nX̄(1−X̄)
≈ 0.20.
IID
Example 4.4. (The Pareto mean). Let X1 , . . . , Xn ∼ Pareto (1, θ), and recall that
the MLE for θ is θ̂ = n/ ni=1 log Xi . We may use the maximum-likelihood theory
P
∂ 1
log f (x | θ) = − log x
∂θ θ
2
∂ 1
log f (x | θ) = − .
∂θ2 θ2
Then the Fisher information is given by I(θ) = 1/θ2 , so
√
n(θ̂ − θ) → N 0, θ2
Page 36
Say, for a data set with n = 1000 income values, we obtain the MLE θ̂ = 1.5. We
q estimate the mean income as θ̂/(θ̂ − 1) = 3, and estimate our standard
might then
θ̂2
error by n(θ̂−1)4
≈ 0.19.
What if we decided to just estimate the mean income by the sample mean, X̄? Since
E [Xi ] = θ/(θ − 1), the Central Limit Theorem implies
√
θ
n X̄ − → N (0, Var [Xi ])
θ−1
in distribution. For θ > 2, we may compute
Z ∞ ∞
2 2 −θ−1 x−θ+2 θ
E Xi = x · θx dx = θ = ,
1 −θ + 2 1 θ−2
so
2
2 2 θ θ θ
Var [Xi ] = E Xi − (E [Xi ]) = − = .
θ−2 θ−1 (θ − 1)2 (θ − 2)
(If θ ≤ 2, the variance of Xi is actually infinite.) For any θ, this variance is greater
than θ2 /(θ − 1)4 .
Thus, if the Pareto model for income is correct, then our previous estimate θ̂/(θ̂ − 1)
is more accurate for the mean income than is the sample mean X̄. Intuitively, this
is because the Pareto distribution is heavy-tailed, and the sample mean X̄ is heavily
influenced by rare but extremely large data values. On the other hand, θ̂ is estimating
the shape of the Pareto distribution and estimating the mean by its relationship to
this shape in the Pareto model. The formula for θ̂ involves the values log Xi rather
than Xi , so θ̂ is not as heavily influenced by extremely large data values. Of course,
the estimate θ̂/(θ̂ − 1) relies strongly on the correctness of the Pareto model, whereas
X̄ would be a valid estimate of the mean even if the Pareto model doesn’t hold true.
That the plugin estimate g(θ̂) performs better than X̄ in the previous example is not a
coincidence - it is in certain senses the best we can do for estimating g(θ). For example,
we have the following more general version of the Cramer-Rao lower bound:
Page 37
Theorem 4.2. For a parametric model {f (x | θ) : θ ∈ Ω} (satisfying certain
mild regularity assumptions) where θ is a single parameter, let g be any function
differentiable on all of Ω, and let T be any unbiased estimator of g(θ) based on
IID
data X1 , . . . , Xn ∼ f (x | θ). Then
g ′ (θ)2
Varθ [T ] ≥ .
nI(θ)
Hint. The proof is identical to that of Theorem 2.2, except with the equation θ = Eθ [T ]
replaced by g(θ) = Eθ [T ]. (Differentiating this equation yields g ′ (θ) = Eθ [T Z] =
Covθ [T, Z] as in Theorem 2.2) An estimator T for g(θ) that achieves this variance
g ′ (θ)2 /(nI(θ)) is efficient. The plugin estimate g(θ̂) where θ̂ is the MLE achieves
this variance asymptotically, so we say it is asymptotically efficient. This theorem
ensures that no unbiased estimator of g(θ) can achieve variance much smaller than
g(θ̂) when n is large, and in particular applies to the estimator T = X̄ of the previous
example.
Page 38
WARNING! A common misconfusion is to interpret CIs as intervals with probability
(1 − α) of containing the true value θ, for a particular sample x. This interpretation is
incorrect. The interpretation of CIs has to be made in terms of repeated sampling as
discussed above (blue colored text).
More formally, what this means is the following: Let X1 , . . . , Xn be a sample of data. By
random interval, we mean an interval whose lower and upper endpoints L (X1 , . . . , Xn )
and U (X1 , . . . , Xn ) are functions of the data X1 , . . . , Xn . (Hence the interval is random
in the same sense that the data itself is random - a different realization of the data leads
to a different interval.) The interval [L (X1 , . . . , Xn ) , U (X1 , . . . , Xn )] is a 100(1 − α)%
confidence interval for g(θ) if, for all θ ∈ Ω,
Page 39
S
X̄ − √ tn−1 (α/2) ≤ µ,
n
and the lower inequality may be rearranged as
S
µ ≤ X̄ + √ tn−1 (α/2).
n
Hence
S S
Pµ,σ2 X̄ − √ tn−1 (α/2) ≤ µ ≤ X̄ + √ tn−1 (α/2) = 1 − α,
n n
h i
so X̄ − √Sn tn−1 (α/2), X̄ + √Sn tn−1 (α/2) is a 100(1 − α)% confidence interval for µ.
We’ll use the notation X̄ ± √S tn−1 (α/2) as shorthand for this interval.
n
Definition 5.1.
Now, ψ1−α/2 < ψ(T, θ) < ψα/2 can often be put in the form θ1 (T ) ⩽ θ ⩽ θ2 (T ).
Then P [θ1 (T ) ⩽ θ ⩽ θ2 (T )] = 1−α and the observed value of the interval [θ1 (T ), θ2 (T )]
will be the confidence interval for θ with confidence coefficient (1 − α).
Page 40
Example 5.2. Let X1 , . . . , Xn be a random sample from N µ, σ 2 ; µ and σ both are
unknown. Find the confidence interval with confidence coefficient (1 − α), for
(i) µ; (ii) σ 2 ; (iii) µ, σ 2 .
Solutions.
(i) For confidence interval
√
of µ, we select the statistic T = X̄.
n(X̄−µ)
Then, ψ(T, µ) = s ∼ tn−1 , which is independent of µ.
√
n(X̄ − µ)
Now, 1 − α = P −tα/2,n−1 < < tα/2,n−1
s
s s
= P X̄ − √ tα/2,n−1 ⩽ µ ⩽ X̄ + √ tα/2,n−1 .
n n
Hence, X̄ − √sn tα/2,n−1 , X̄ + √sn tα/2,n−1 is an observed confidence interval for µ
with confidence coefficient (1 − α).
1
Pn 2
(ii) For confidence interval of σ 2 , we select the statistic s2 = n−1 i=1 X i − X̄ .
2
Then, ψ s2 , σ 2 h = (n − 1) σs 2 ∼ χ2n−1 , the distribution
i is hindependent of σ 2 . i
2 s2 2 (n−1)s2 2 (n−1)s2
Now, 1−α = P χ1−α/2,n−1 ⩽ (n − 1) σ2 ⩽ χα/2,n−1 = P χ2 ⩽ σ ⩽ χ2 .
α/2,n−1 1−α/2,n−1
2 2
Pn Pn
i=1 (Xi −X̄ ) i=1 (Xi −X̄ )
Hence, χ2
, χ2 is an observed confidence interval for σ 2 with
α/2,n−1 1−α/2,n−1
Page 41
depend on µ and σ 2 . Suppose that we had forgotten this fact. If n is large, we could
have still reasoned as follows: By the Central Limit Theorem, as n → ∞,
√
n(X̄ − µ) → N 0, σ 2
This method may be applied to construct an approximate confidence interval from any
asymptotically normal estimator, as we will see in the following examples.
IID
Example 5.3. Let X1 , . . . , Xn ∼ Poisson(λ). To construct an asymptotic confidence
interval for λ, let’s start with the estimator λ̂ = X̄. By the Central Limit Theorem,
√
n(λ̂ − λ) → N (0, λ).
We don’t know the variance λ of this limiting normal distribution, but we can estimate
it by λ̂. By the Law of Large Numbers, λ̂ → λ in probability as n → ∞, i.e. λ̂ is
consistent for λ. Then by the Continuous Mapping Theorem and Slutsky’s Lemma,
√ √ √
n(λ̂ − λ) λ n(λ̂ − λ)
p =p × √ → N (0, 1),
λ̂ λ̂ λ
so
" √ #
n(λ̂ − λ)
Pλ −zα/2 ≤ p ≤ zα/2 → 1 − α
λ̂
Page 42
Rearranging
q these inequalities yields the asymptotic 100(1 − α)% confidence interval
λ̂± nλ̂ zα/2 .
For various values of λ and n, the table below shows the simulated true probabilities
that the 90% and 95% confidence intervals constructed in this way cover λ:
IID
(Meaning, we simulated X1 , . . . , Xn ∼ Poisson(λ), computed the confidence interval,
checked whether it contained λ, and repeated this B = 1, 000, 000 times. The table
reports the fraction of simulations for which the interval covered λ.) We observe that
coverage is closer to the desired levels for larger values of n, as well as for larger values
of λ. For small n and/or small λ, the normal approximation to the distribution of λ̂
is inaccurate, and the simulations show that we underestimate the variability of λ̂.
Page 43
√
1
n(g(p̂) − g(p)) → N 0, ,
p(1 − p)
where p̂ = X̄. Since p̂ → p in probability, by the Continuous Mapping Theorem and
Slutsky’s Lemma,
p
p p̂(1 − p̂) p
np̂(1 − p̂)(g(p̂) − g(p)) = p np(1 − p)(g(p̂) − g(p)) → N (0, 1),
p(1 − p)
so
p
Pp [−zα/2 ≤ np̂(1 − p̂)(g(p̂) − g(p)) ≤ zα/2 ] → 1 − α.
p
An asymptotic 100(1 − α)% confidence interval for the log-odds g(p) = log 1−p is then
" s s #
p̂ 1 p̂ 1
[L(p̂), U (p̂)] := log − zα/2 , log + zα/2 .
1 − p̂ np̂(1 − p̂) 1 − p̂ np̂(1 − p̂)
p
If we wish to obtain a confidence interval for the odds 1−p rather than the log-odds,
h i h i
p p
note that P L(p̂) ≤ log 1−p ≤ U (p̂) = P eL(p̂) ≤ 1−p ≤ eU (p̂) , so that eL(p̂) , eU (p̂) is
a confidence interval for the odds. This interval is not symmetric around the estimate
p̂
1−p̂ , and is different from what we would have obtained if we instead applied the delta
p
method directly to g(p) = 1−p . The interval eL(p̂) , eU (p̂) for the odds is typically
p̂ p̂
used in practice the distribution of log 1−p̂ is less skewed than that of 1−p̂ for small to
moderate n, so the normal approximation and resulting confidence interval are more
accurate if we consider odds on the log scale.
• The true variance of this normal distribution, say I(θ)−1 , is being approximated
by a plugin estimate I(θ̂)−1 .
• In the case where we are interested in g(θ) and g is a nonlinear function, the value
g(θ̂) is being approximated by the Taylor expansion g(θ) + (θ̂ − θ)g ′ (θ). (This is
what is done in the delta method.)
Page 44
These approximations are all valid in the limit n → ∞, but their accuracy is not
guaranteed for the finite sample size n of any given problem. Coverage of asymptotic
confidence intervals should be checked by simulation, as Example 5.3 illustrates that
they might be severely overconfident for small n.
6 Bayesian analysis
Our treatment of parameter estimation thus far has assumed that θ is an unknown
but non-random quantity - it is some fixed parameter describing the true distribution
of data, and our goal was to determine this parameter. This is called the Frequentist
Paradigm of statistical inference. In this and the next section, we will describe an
alternative Bayesian Paradigm, in which θ itself is modeled as a random variable.
The Bayesian paradigm naturally incorporates our prior belief about the unknown
parameter θ and updates this belief based on observed data.
in the discrete case; this describes the probability distribution of X alone. The con-
ditional distribution of Y given X = x is defined by the PDF or PMF
fX,Y (x, y)
fY |X (y | x) = ,
fX (x)
and represents the probability distribution of Y if it is known that X = x. (This is
a PDF or PMF as a function of y, for any fixed x.) Defining similarly the marginal
distribution fY (y) of Y and the conditional distribution fX|Y (x | y) of X given Y = y,
the joint PDFX,Y (x, y) factors in two ways as
Page 45
Conditional on Θ = θ, the observed data X is assumed to have distribution fX|Θ (x | θ),
where fX|Θ (x | θ) defines a parametric model with parameter θ, as in our previous
chapters.7 The joint distribution of Θ and X is then the product
Page 46
Γ(x)Γ(y)
B(x, y) =
.
Γ(x + y)
Hence the posterior distribution of P given X1 = x1 , . . . , Xn = xn has PDF
fX,P (x1 , . . . , xn , p) 1
fP |X (p | x1 , . . . , xn ) = = ps (1 − p)n−s .
fX (x1 , . . . , xn ) B(s + 1, n − s + 1)
Page 47
In the previous example, the parametric form for the prior was (cleverly) chosen
so that the posterior would be of the same form-they were both Beta distributions.
This type of prior is called a conjugate prior for P in the Bernoulli model. Use
of a conjugate prior is mostly for mathematical and computational convenience - in
principle, any prior fP (p) on (0, 1) may be used. The resulting posterior distribution
may be not be a simple named distribution with a closed-form PDF, but the PDF may
be computed numerically from equation (6) by numerically evaluating the integral in
the denominator of this equation.
IID
Example 6.3. Let Λ ∈ (0, ∞) be the parameter of the Poisson model X1 , . . . , Xn ∼
Poisson(λ). As a prior distribution for Λ, let us take the Gamma distribution Gamma(α, β).
The prior and likelihood are given by
β α α−1 −βλ
fΛ (λ) = λ e
Γ(α)
n
Y λxi e−λ
fX|Λ (x1 , . . . , xn | λ) = .
i=1
x i !
Dropping proportionality constants that do not depend on λ, the posterior distribution
of Λ given X1 = x1 , . . . , Xn = xn is then
The posterior mean and posterior mode are the mean and mode of the posterior
distribution of Θ; both of these are commonly used as a Bayesian estimate θ̂ for θ.
A 100(1 − α)% Bayesian credible interval is an interval I such that the posterior
probability P[Θ ∈ I | X] = 1 − α, and is the Bayesian analogue to a frequentist
Page 48
confidence interval. One common choice for I is simply the interval θ(α/2) , θ(1−α/2)
where θ(α/2) and θ(1−α/2) are the α/2 and 1 − α/2 quantiles of the posterior distribution
of Θ. Note that the interpretation of a Bayesian credible interval is different from
the interpretation of a frequentist confidence interval - in the Bayesian framework, the
parameter Θ is modeled as random, and 1 − α is the probability that this random
parameter Θ belongs to an interval that is fixed conditional on the observed data.
Example 6.4. From Example 6.2, the posterior distribution of P is Beta(s + α, n −
s + α). The posterior mean is then (s + α)/(n + 2α), and the posterior mode is
(s + α − 1)/(n + 2α − 2). Both of these may be taken as a point estimate p̂ for p. The
interval from the 0.05 to the 0.95 quantile of the Beta(s + α, n − s + α) distribution
forms a 90% Bayesian credible interval for p.
Example 6.5. From Example 6.3, the posterior distribution of Λ is Gamma(s + α, n +
β). The posterior mean and mode are then (s + α)/(n + β) and (s + α − 1)/(n + β),
and either may be used as a point estimate λ̂ for λ. The interval from the 0.05 to the
0.95 quantile of the Gamma(s + α, n + β) distribution forms a 90% Bayesian credible
interval for λ.
IID
2. If X1 , . . . , Xn ∼ Bernoulli(p), then a conjugate prior for p is Beta(α, β), and the
corresponding posterior given X1 = x1 , . . . , Xn = xn is Beta(s + α, n − s + β).9 A
Bayesian estimate of p is the posterior mean
s+α n s α+β α
p̂ = = · + · .
n+α+β n+α+β n n+α+β α+β
Page 49
as an effective prior sample size and α/β as a prior mean, and the posterior mean is
a weighted average of the prior mean and the data mean. In example 2 above, the
posterior mean behaves as if we observed, a priori, α additional heads and β additional
tails. α + β is an effective prior sample size, α/(α + β) is a prior mean, and the pos-
terior mean is again a weighted average of the prior mean and the data mean. These
interpretations may serve as a guide for choosing the prior parameters α and β.
Sometimes it is convenient to use the formalism of Bayesian inference, but with an “un-
informative prior” that does not actually impose prior knowledge, so that the resulting
analysis is more objective. In both examples above, the priors are “uninformative” for
the posterior mean when α and β are small. We may take this idea to the limit by
considering α = β = 0. As the PDF of the Gamma distribution is proportional to
xα−1 e−βx on (0, ∞), the “PDF” for α = β = 0 may be considered to be
f (x) ∝ x−1 .
Nonetheless, we may formally carry out Bayesian analysis using improper priors, and
this oftentimes yields valid posterior distributions: In the Poisson example, we obtain
the posterior PDF
which is the PDF of Gamma(s, n). In the Bernoulli example, we obtain the posterior
PDF
Page 50
which is the PDF of Beta(s, n − s). These posterior distributions are real probability
distributions (as long as s > 0 in the Poisson example and s, n − s > 0 in the Bernoulli
example), and may be thought of as approximations to the posterior distributions that
we would have obtained if we used proper priors with small but positive values of α
and β.
Consider Bayesian inference applied with the prior fΘ (θ), for a parametric model
IID
fX|Θ (x | θ). Let X1 , . . . , Xn ∼ fX|Θ (x | θ), and let
n
X
l(θ) = log fX|Θ (xi | θ)
i=1
be the usual log-likelihood. Then the posterior distribution of Θ is given by
Page 51
1
fΘ|X (θ | x1 , . . . , xn ) ∝ exp − (θ − θ̂)2 · nI(θ̂) .
2
1
This describes a normal distribution for Θ with mean θ̂ and variance nI(θ̂)
.
To summarize, the posterior mean of Θ is, for large n, approximately the MLE θ̂.
Furthermore, a 100(1 − α)% Bayesian credible interval is approximately given by θ̂ ±
q
z(α/2) nI(θ̂), which is exactly the 100(1 − α)% Wald confidence interval for θ. In
this sense, frequentist and Bayesian methods yield similar inferences for large n.
Let’s return to the frequentist setting where we assume that there is a true parameter
θ for a parametric model {f (x | θ) : θ ∈ Ω}. Suppose we have two estimators for
θ based on data X1 , . . . , Xn ∼ f (x | θ) : θ̂1 and θ̂2 . Which estimator is “better”?
Without appealing to asymptotic (large n) arguments, one answer to this question is
to compare their mean-squared-errors:
2 2
MSE1 (θ) = Eθ θ̂1 − θ = Variance of θ̂1 + Bias of θ̂1 ;
2 2
MSE2 (θ) = Eθ θ̂2 − θ = Variance of θ̂2 + Bias of θ̂2 .
The estimator with smaller MSE is “better”. Unfortunately, the problem with this
approach is that the MSEs might depend on the true parameter θ (hence why we have
written MSE1 and MSE2 as functions of θ in the above), and neither may be uniformly
IID
better than the other. For example, suppose X1 , . . . , Xn ∼ N (θ, 1). Let θ̂1 = X̄; this
is unbiased with variance n1 , so its MSE is n1 . Let θ̂2 ≡ 0 be the constant estimator that
always estimates θ by 0. This has bias −θ and variance 0, so its MSE is θ2 . If √the true
parameter θ happens to be close to 0 - more specifically, if |θ| is less than 1/ n - then
θ̂2 is “better”, and otherwise θ̂1 is “better”.
Page 52
R
where w(θ) is a weight function over the parameter space such that Ω w(θ)dθ = 1, and
find the estimator that minimizes this weighted average. This weighted average MSE
is called the Bayes risk. Writing the expectation in the definition of the MSE as an
integral, and letting x denote the data and f (x | θ) denote the PDF of the data, we
may write the Bayes risk of an estimator θ̂ as
Z Z
2
(θ̂(x) − θ) f (x | θ)dx w(θ)dθ.
In order to minimize the Bayes risk, for each possible value x of the observed data,
θ̂(x) should be defined so as to minimize
Z
(θ̂(x) − θ)2 f (x | θ)w(θ)dθ.
Let us now interpret w(θ) as a prior fΘ (θ) for the parameter Θ, and f (x | θ) as the
likelihood fX|Θ (x | θ) given Θ = θ. Then
Z Z Z
2 2
(θ̂(x)−θ) f (x | θ)w(θ)dθ = (θ̂(x)−θ) fX,Θ (x, θ)dθ = fX (x) (θ̂(x)−θ)2 fΘ|X (θ | x)dθ.
The posterior mean of Θ for the prior fΘ (θ) is the estimator that minimizes the average
mean-squared-error Z
MSE(θ)fΘ (θ)dθ.
Thus a Bayesian prior may be interpreted as the weighting of parameter values for
which we wish to minimize the weighted-average mean-squared-error.
Page 53