0% found this document useful (0 votes)
148 views126 pages

MTH210

This document outlines the course materials for MTH210a: Statistical Computing taught in Semester II of 2022-2023. It covers topics like pseudorandom number generation, generating discrete and continuous random variables, importance sampling, likelihood based estimation, numerical optimization methods, the EM algorithm, choosing tuning parameters, and stochastic optimization methods. The copyright of the materials belongs to the instructor Dootika Vats and they should not be distributed without consent.

Uploaded by

Indraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views126 pages

MTH210

This document outlines the course materials for MTH210a: Statistical Computing taught in Semester II of 2022-2023. It covers topics like pseudorandom number generation, generating discrete and continuous random variables, importance sampling, likelihood based estimation, numerical optimization methods, the EM algorithm, choosing tuning parameters, and stochastic optimization methods. The copyright of the materials belongs to the instructor Dootika Vats and they should not be distributed without consent.

Uploaded by

Indraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

MTH210a: Statistical Computing

Instructor: Dootika Vats

Semester II, 2022 - 2023

The instructor of this course owns the copyright of all the course materials. This lecture
material was distributed only to the students attending the course MTH210a: “Statistical
Computing” of IIT Kanpur, and should not be distributed in print or through electronic
media without the consent of the instructor. Students can make their own copies of the
course materials for their use.

Contents
1 Lecture-wise Summary 4

2 Pseudorandom Number Generation 5


2.1 Multiplicative congruential method . . . . . . . . . . . . . . . . . . 5
2.2 Mixed Congruential Generator . . . . . . . . . . . . . . . . . . . . 7
2.3 Generating U (a, b) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Generating Discrete Random Variables 11


3.1 Inverse transform method . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Accept-Reject for Discrete Random Variables . . . . . . . . . . . . 14
3.3 The Composition Method . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Generating continuous random variables 23


4.1 Inverse transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Accept-reject method . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Choosing a proposal . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Choosing parameters for a fixed proposal family . . . . . . . . . 32
4.3 The Box-Muller transformation for N (0, 1). . . . . . . . . . . . . . 33
4.4 Ratio-of-Uniforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Miscellaneous methods in sampling . . . . . . . . . . . . . . . . . . 39
4.5.1 Known relationships . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.2 Sampling from mixture distributions . . . . . . . . . . . . . . . 41

1
4.5.3 Multidimensional target . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Importance Sampling 48
5.1 Simple Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Simple importance sampling . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Optimal proposals . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Questions to think about . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Weighted Importance Sampling . . . . . . . . . . . . . . . . . . . . 56
5.4 Questions to think about . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Likelihood Based Estimation 65


6.1 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 66
6.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Penalized Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 No closed-form MLEs . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Numerical optimization methods 76


7.1 Newton-Raphson’s method . . . . . . . . . . . . . . . . . . . . . . . 76
7.2 Gradient Ascent (Descent) . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 MM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8 The EM algorithm 95
8.1 Motivating example: Gaussian mixture models . . . . . . . . . . . . 95
8.2 The Expectation-Maximization Algorithm . . . . . . . . . . . . . . 97
8.3 (Back to) Gaussian mixture likelihood . . . . . . . . . . . . . . . . 99
8.4 EM Algorithm for Censored Data . . . . . . . . . . . . . . . . . . . 103
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

9 Choosing Tuning Parameters 109


9.1 Components in Gaussian Mixture Model . . . . . . . . . . . . . . . 109
9.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3.1 Holdout method . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.3.2 Leave-one-out Cross-validation . . . . . . . . . . . . . . . . . . . 112
9.3.3 K-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . 113
9.4 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.4.1 Nonparametric Bootstrap . . . . . . . . . . . . . . . . . . . . . 115
9.4.2 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 117
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

2
10 Stochastic optimization methods 121
10.1 Stochastic gradient ascent . . . . . . . . . . . . . . . . . . . . . . . 121
10.1.1 Mini-batch stochastic gradient ascent . . . . . . . . . . . . . . . 123
10.1.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3
1 Lecture-wise Summary
Lec No. Date Topic
1 Jan 9 FCH and Pseudorandom number generation
2 Jan 9 Pseudorandom numbers and inverse transform
3 Jan 10 Accept-Reject
4 Jan 11 Accept-Reject and composition method
5 Jan 16 Continuous: inverse transform and accept-reject
6 Jan 17 Accept-reject
7 Jan 18 Accept-reject
8 Jan 23 Box-Muller, Ratio-of-uniforms, Quiz 1
9 Jan 24 Ratio-of-uniforms
10 Jan 25 Ratio-of-uniforms, Miscellaneous sampling
11 Jan 30 Multivariate normal sampling, Simple Monte Carlo
12 Jan 31 Simple Monte Carlo, simple importance sampling
13 Feb 1 Simple importance sampling
14 Feb 6 Optimal proposal example, Weighted importance sampling
15 Feb 7 Weighted importance sampling
16 Feb 8 Likelihood function, Maximum Likelihood Estimator
17 Feb 13 MLE and linear regression
18 Feb 14 Penalized regression, Quiz 2
19 Feb 15 Newton-Raphson Algorithm
– Feb 20 - 24 Mid-sem Exam week
20 Feb 27 Newton-Raphson Algorithm
21 Feb 28 Gradient Ascent Algorithm
22 Feb 28 (extra class) Logistic regression GA algorithm
23 Mar 1 More Logistic regression
Mar 6 - 17 Mid-sem Recess + 1 week I was away
24 Mar 20 MM Algorithm
25 Mar 21 Bridge regression
26 Mar 21 (extra class) EM Algorithm
27 Mar 22 EM Algorithm
28 Mar 27 Gaussian Mixture Model
29 Mar 28 Gaussian Mixture
30 Mar 28 (extra class) Gaussian Mixture still
31 Mar 29 Censored data example
32 Apr 3 Loss functions
33 Apr 5 Cross-validation
34 Apr 10 Bootstrapping
35 Apr 11 Bootstrapping

4
2 Pseudorandom Number Generation
The building block of computational simulation is the generation of uniform random
numbers. If we can draw from U (0, 1), then we can draw from most other distributions.
Thus the construction of sampling from U (0, 1) requires special attention.

Computers can generate numbers between (0, 1), which although are not exactly ran-
dom (and in fact deterministic), but have the appearance of being U (0, 1) random
variables. These draws from U (0, 1) are pseudorandom draws.

The goal in pseudorandom generation is to draw

approx iid
X1 , . . . , X n ∼ U (0, 1) .

The resultant sample is as uniformly distributed as possible, and as independent as


possible. We will learn about two different pseudorandom generators. These are very
basic ones that are actually not really used in real life, but make our point well.

Note: After this lecture, we will always assume that all U (0, 1) draws are exactly
iid and perfectly random. We will forget that they are infact, pseudorandom. Pseu-
dorandom generation is a whole field in itself; for more on this, checkout CS744 at
IITK.

2.1 Multiplicative congruential method


A common algorithm to generate a sequence {xn } is the multiplicative congruential method :

1. Set seed x0 , and positive integers a, m.

2. Obtain xt = a xt−1 mod m

3. Return sequence xt /m for t = 1, . . . , n.

Since xt ∈ {0, 1, . . . m − 1}, xt /m ∈ (0, 1). Also note that after some finite number of
steps < m, the algorithm will repeat itself, since when a seed x0 is set, a deterministic
sequence of numbers follows. Naturally, to allow for the sequence xt to mimic uniform
and random draws m should be large. Naturally, both a and m should be chosen to
be large so as to avoid repetition. Typically m should be a large prime number.

Example 1. Set a = 123 and m = 10, and let x0 = 7. Then


x1 = 123 ∗ 7 mod 10 = 1

5
x2 = 123 ∗ 1 mod 10 = 3
x3 = 123 ∗ 3 mod 10 = 9
x4 = 123 ∗ 9 mod 10 = 7
x5 = 123 ∗ 7 mod 10 = 1
..
. ■

Thus, we see that the above choices of a, m, x0 repeats itself. It is also recommended
that a is large to ensure large jumps, and reduce ”dependence” in the sequence. Based
on the bits of your machine, it is recommended to set m = 231 − 1 and a = 75 . Notice
that both are large.

m <- 2^(31) - 1
a <- 7^5
x <- numeric(length = 1e3)
x[1] <- 7

for(i in 2:1e3)
{
x[i] <- (a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks close to uniformly distributed
plot.ts(x/m) # look like it’s jumping around too

The histogram shows roughly ”uniform” distribution of the samples and the trace plot
shows the lack of dependence between samples.

Histogram of x/m
1.0
120

0.8
100
80

0.6
Frequency

x/m
60

0.4
40

0.2
20

0.0
0

0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000

x/m Time

6
Any pseudorandom generation method should satisfy:

1. for any initial seed, the resultant sequence has the “appearance” of being IID
from Uniform[0, 1].

2. for any initial seed, the number of values generated before repetition begins is
large

3. the values can be computed efficiently.

2.2 Mixed Congruential Generator


Notice that in the previous method, if we set the seed to be zero, the algorithms fails!
To combat this, there is another method, the mixed congruential generator :

1. Set seed x0 , and positive integers a, c, m.

2. xt = (a xt−1 + c) mod m

3. Return sequence xt /m for t = 1, . . . , n.

m <- 2^(31) - 1
a <- 7^5
c <- 2^(10) - 1
x <- numeric(length = 1e3)
x[1] <- 7

for(i in 2:1e3)
{
x[i] <- (c + a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks close to uniformly distributed
plot.ts(x/m) # look like it’s jumping around too

7
Histogram of x/m

1.0
120
100

0.8
80

0.6
Frequency

x/m
60

0.4
40

0.2
20

0.0
0

0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000

x/m Time

We must be cautious not to be happy with a just a histogram. A histogram shows that
the empirical distribution of all samples is uniformly distributed. But we can still get
a uniform looking histogram if we set a = 1, m = 1e3 and c = 1.

m <- 1e3
a <- 1
c <- 1
x <- numeric(length = 1e3)
x[1] <- 7

for(i in 2:1e3)
{
x[i] <- (c + a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks VERY uniformly distributed
plot.ts(x/m) # Clearly "dependent" samples

Although a histogram shows an almost perfect uniform distribution, the trace plot
shows that the draws don’t behave like they are independent.

8
Histogram of x/m

1.0
100

0.8
80

0.6
60
Frequency

x/m

0.4
40

0.2
20

0.0
0

0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000

x/m Time

We could also use

xn = (a1 xn−1 + a2 xn−2 + · · · + ak xn−k + c) mod m ,

but this requires more flops from the computer, and so is not as computationally viable.

We claim that these methods return “good” pseudosamples, in the sense of the three
points stated above. There are statistical hypothesis tests, like the Kolmogorov-
Smirnov test, one can do to test whether a sample is truly random: independent
and identically distributed.

runif() in R uses the Mersenne-Twister generator by default (we will not go into this),
but there are options to use other generators. After this, we will assume that runif()
returns truly iid samples from U (0, 1).

2.3 Generating U (a, b)


Suppose we can draw from U (a, b) for any a, b ∈ R. But we only know how to draw
from U (0, 1). Note that if U ∼ U (0, 1), then for any a, b,

(b − a)U + a ∼ U (a, b) .

That means, we can draw U ∼ U (0, 1) and set X = (b − a)U + a. Then X ∼ U (a, b).

9
# Try for yourself

set.seed(1)
repeats <- 1e4
b <- 10
a <- 5
U <- runif(repeats, min = 0, max = 1)
X <- (b - a) * U + a #R is vectorized

hist(X)

Questions to think about


• Given a sample of pseudorandom draws from U (0, 1) and perfectly IID draws
from U (0, 1), would you be able to tell the difference?

• Could we obtain uniform samples from R?

2.4 Exercises
1. (Using R) Consider the multiplicative congruential method. For a, m positive
integers
xn = axn−1 mod m .

(a) Set seed x0 = 5, m = 104 , a = 2. Generate n = 104 pseudorandom numbers


using the above method. Does this look like a (pseudo) random sample from
Uniform[0, 1]? Maybe plot a histogram to see the empirical distribution.

(b) Now look at only the first 10 numbers: x1 , x2 , . . . , x10 . What is the problem
here?

(c) How can you fix the problem noted in the previous step?

10
3 Generating Discrete Random Variables
Suppose X is a discrete random variable having probability mass function
X
Pr(X = xj ) = pj j = 0, 1, . . . , pj = 1 .

Examples of such random variables are: Bernoulli, Poisson, Geometric, Negative Bi-
nomial, Binomial, etc. We will learn two methods to draw samples realizations of this
discrete random variable:

1. Inverse transform method

2. The acceptance-rejection technique

3.1 Inverse transform method


Let’s demonstrate the inverse transform method with an example first.

Example 2 (Bernoulli distribution). If X ∼ Bern(p), then

Pr(X = 1) = p and Pr(X = 0) = 1 − p := q .

Let U ∼ U [0, 1]. Define 


0 if U ≤ q
X= .
1 if q < U ≤ 1

Then X ∼ Bern(p).

Proof. To show the result we only need to show that Pr(X = 1) = p and Pr(X = 0) =
1 − p. Recall that by the cumulative distribution function of U [0, 1], for any 0 < t < 1,
Pr(U ≤ t) = t. Using this,

Pr(X = 0) = Pr(U ≤ q) = q ,

and also
Pr(X = 1) = Pr(q < U ≤ 1) = 1 − q = p .

11
Algorithm 1 Inverse transform for Bern(p)
1: Draw U ∼ U [0, 1]
2: if U < q then X = 0 else X = 1

Inverse transform method: The principles used in the above example can be ex-
tended to any generic discrete distribution. For a distribution with mass function
X
Pr(X = xj ) = pj for j = 0, 1, . . . with pj = 1 .
j=0

Let U ∼ U [0, 1]. Set X to be





 x0 if U ≤ p0


if p0 < U ≤ p0 + p1

 x
 1


X = x2 if p0 + p1 < U ≤ p0 + p1 + p2 .
..





 .

 Pj−1 Pj
pi < U ≤

x
j if i=0 i=0 pi

This works because


j−1 j j j−1
!
X X X X
Pr(X = xj ) = Pr pi < U ≤ pi = pi − p i = pj .
i=0 i=0 i=0 i=0

This method is called the Inverse transform method since the algorithm is essentially
looking at the inverse cumulative distribution function of the random variable.

Example 3 (Poisson random variables). The probability mass function for the Poisson
random variable is

e−λ λi
Pr(X = i) = pi = i = 0, 1, 2, . . . ,
i!

12
Algorithm 2 Inverse transform for Poisson(λ)
1: Draw U ∼ U [0, 1]
2: if U ≤ p0 then
3: X=0
4: else if U ≤ p0 + p1 then
5: X=1
6: ...
Pj
7: else if U ≤ i=1 pi then
8: X=j
9: ...

However, Algorithm 2 outlines a challenge in implementing this algorithm.

Q. What happens when λ is large?

A Poisson(λ) distribution with a large λ will yield pj to be small when j is small.


This implies Algorithm 2 can be quite slow here. We will therefore discuss a more
computationally efficient algorithm.

We know that most likely, a realization from Poisson will be closer to λ, so it will be
beneficial to start from around λ. Set I = ⌊λ⌋, and check whether

I−1
X I
X
pi < U ≤ pi .
i=0 i=0

PI
If it is, then return X = I. Else, if U > i=1 pi , then increase I, otherwise, decrease
I and check again. ■

Questions to think about

• What other example can you think of where the inverse transform method could
take a lot of time?

• Can you try and implement this for a Binomial random variable?

13
3.2 Accept-Reject for Discrete Random Variables
Although we can draw from any discrete distribution using the inverse transform
method, you can imagine that for distributions on countably infinite spaces (like the
Poisson distribution), the inverse transform method may be very expensive. In such
situations, acceptance-rejection sampling may be more reliable.

Let {pj } denote the pmf of the target distribution with Pr(X = aj ) = pj and let
{qj } denote the pmf of another distribution with Pr(Y = aj ) = qj . Suppose you can
efficiently draw from {qj } and you want to draw from {pj }. Let c be a constant such
that
pj
≤ c < ∞ for all j such that pj > 0 .
qj
If we can find such a {qj } and c, then we can implement an Acceptance-Rejection or
Accept-Reject sampler. The idea is to draw samples from {qj } and accept these samples
if they seem likely to be from {pj }.

Note: When {pj } has a finite set of states, c is always finite (since the maximum
exists). However, when target distribution does not have a finite set of states, then c
need not be finite, and accept-reject is not possible.

Algorithm 3 Acceptance-Rejection sampler to draw 1 sample from {pj }


1: Draw U ∼ U [0, 1]
2: Simulate Y = y with probability mass function qy
py
3: if U ≤ then
cqy
4: Return X = y and stop
5: else
6: Goto step 1

Theorem 1. When c is finite, the Accept-Reject method generates a random variable


with probability
Pr(X = aj ) = pj .

Further, the number of iterations needed to generate an acceptance is distributed as


Geometric(1/c).

Proof. First, we look at the second statement. We note that the number of iterations
required to stop the algorithm is clearly geometrically distributed by the definition of

14
the geometric distribution – the distribution of the number of Bernoulli trials needed
to get one success (with support 1, 2, 3...).

We will show that the probability of success is 1/c. “Success” here is an acceptance.
First, consider

Pr(Y = aj , accepted) = Pr(Y = aj ) Pr(Accept | Y = aj )


 
pj
= qj Pr U ≤ | Y = aj
cqj
pj pj
= qj = .
cqj c

Using this we can calculate the marginal distribution of accepting:


X X pj 1
Pr(accept) = Pr(Y = aj , accept) = = .
j j
c c

Thus, the second statement is proved. We will now use this to show the main statement.
Note that

X
Pr(X = aj ) = Pr(aj accepted on iteration n)
n=1

X
= Pr(No acceptance until iteration n − 1) Pr (Y = aj , accept)
n=1
∞  n−1
X 1 pj
= 1−
c c
|n=1 {z }
c

= pj .

This completes the proof.

Note: Since the probability of acceptance in any loop is 1/c, the expected number of
loops for one acceptance is c. The larger c is, the more expensive the algorithm.

One important thing to note is that within the support {aj } of {pj }, the proposal
distribution must always be positive. That is, for all aj in the support of {pj }, Pr(Y =
aj ) = qj > 0. In other words, a proposal distribution must have support larger than
the target distribution.

15
Example 4 (Sampling from Binomial using AR). The binomial distribution has pmf
 
n
Pr(X = x) = (1 − p)n−x px for x = 0, 1, . . . , n .
x

We will use AR to simulate draws from Binomial(n, p). The first task is to choose a
proposal distribution. We could use any of Poisson, negative-binomial, or geometric
distributions. We cannot use Bernoulli, since the support of Bernoulli does not contain
the support of Binomial.

We choose to use the geometric distribution, but we must be a little careful.

We use the version of geometric distribution that is defined as the number of failures
before the first success, so that the support of the geometric distribution has 0 in it.
The pmf of the geometric distribution is

Pr(X = x) = (1 − p)x p x = 0, 1, . . . .

We will first find c. Note that


n

p(x) x
(1 − p)n−x px
=
q(x) (1 − p)x p
 
n
= (1 − p)n−2x px−1 .
x

Set  
n
c = max (1 − p)n−2x px−1 .
x=0,1,...,n x

For n = 10, p = 0.25, we yield c = 2.373 . . . .

To be safe (since we don’t know all the decimal points), we may set c to be slightly
larger (say c = 2.5) as c just needs to be an upper bound. Once c is known, the AR
algorithm can be implemented simply as described. Now here, we would expect, on
average, 2.5 values of Geometric random variables to be proposed until one acceptance.

Note, that c depends on both n and p. Particularly, if n is large, then c increases


drastically. A way to understand this is then the mean of the target distribution (np)
can be much larger than the mean of the proposal, (1 − p)/p. In this case, this implies
that the bulk of the mass of the target distribution is far away from the bulk of the
mass of the proposal distribution. This is not ideal. We want the pmf of the proposal
and target to match each other as much as possible, so that c is close to 1. This

16
suggests, that we may not want to choose the same p in the proposal distribution!

A possible fix, is to consider a Geometric(p∗ ) proposal where p∗ is such that

1 − p∗ 1
np = ∗
⇒ p∗ =
p np + 1

In this case, we have


n

p(x) x
(1 − p)n−x px
=
q(x) (1 − p∗ )x p∗

the maximum over {0, 1, . . . , n} can be determined on the computer. For n = 100 and
p = .25, the old bound is 1028.497 and the new one is 6.0455, which is much more
efficient! ■

Example 5. (Geometric Random Variable) We consider the Geometric random vari-


able with pmf (trails until x failures)

Pr(X = x) = (1 − p)x p x = 0, 1, 2, . . .

We cannot use Binomial as a proposal, but we can use Poisson. Let us consider the
Poisson(λ) proposal. The Poisson random variable has pmf

e−λ λx
Pr(X = x) = x = 0, 1, 2, . . . .
x!

First step is to find c, if it exists

p(x) (1 − p)x p
=
q(x) e−λ λx
x!
 x
p 1−p
= −λ x! .
e λ

For small values of λ (< 1 − p), the above clearly diverges as x increases, thus the max-
imum doesn’t exist. This is true for large values of λ as well. To see, this (intuitively),
consider the Stirling’s approximation of the factorial:

log(x!) ≈ x log(x) − x ⇒ x! ≈ ex log x−x .

17
Using this:
 x
p(x) p 1−p
= −λ x!
q(x) e λ
 x
p 1−p
= −λ ex log x−x
e λ
x
(1 − p)elog(x)

p
= −λ
e eλ

Thus, no matter how large λ is, eventually as x increases elog(x) will be larger than λ
and the ratio will diverge. Thus, this proposal does not allow an AR for the Geomtric
distribution. ■

Question to think about

• Why is c always greater than 1?

• What happens when c is large or small?

3.3 The Composition Method


We have now learned two algorithms for sampling from a discrete distribution: the in-
verse transform method and the accept-reject algorithm. The inverse transform method
can be used for any distribution and the accept-reject can be efficient if used properly.

For certain special distributions, it is easier to use a composition method for sampling.
(1)
Suppose we have an efficient way of simulating random variables from two pmfs {pj }
(2)
and {pj }, and we want to simulate from

(1) (2)
Pr(X = j) = αpj + (1 − α)pj j ≥ 0 where 0 < α < 1 .

P
First you should note that the above composition pmf is a valid pmf since j Pr(X =
j) = 1. How would we sample in such a situation?

Let X1 ∼ P (1) and X2 ∼ P (2) . Set



X with probability α
1
X= .
X
2 with probability 1 − α

18
Algorithm 4 Composition method
1: Draw U ∼ U [0, 1]
2: if U ≤ α then simulate X1 ∼ P (1) else simulate X2 and stop

Proof. Consider

Pr(X = j)
= Pr(X = j, U ≤ α) + Pr(X = j, α < U ≤ 1) (by law of total probability)
= Pr(X = j | U ≤ α) Pr(U ≤ α) + Pr(X = j | α < U ≤ 1) Pr(α < U ≤ 1)
= Pr(X1 = j) Pr(U ≤ α) + Pr(X2 = j) Pr(α < U ≤ 1) (by independence of U and X1 , X2 )
(1) (2)
= αpj + (1 − α)pj .

We can set this up more generally for k different distributions. In general, Fi , i =


1, . . . , k are distribution functions, and αi are such that 0 < αi < 1 for all i and
P
i αi = 1. The composition (or mixture) distribution is

k
X
F (x) = αi Fi (x) .
i=1

Let Xi ∼ Fi . To simulate from the composition F , set





 X1 with probability α1



X
2 with probability α2
X= . .

 ..




X with probability αk
k

Example 6 (Zero inflated Poisson distribution). A Poisson(λ) distribution usually has


a small mass at 0. But sometimes, we need a counting distribution with large mass
at 0. For example, consider the random variable X being the number of COVID-19
patients tested positive every hour. Many hours of the day this number may be 0, and
then this number can be quite high for some hours.

In such a case, we may use the zero inflated Poisson distribution (ZIP). Recall that if

19
X ∼ Poisson(λ)
λk
Pr(X = k) = e−λ k = 0, 1, . . . .
k!
If X ∼ ZIP(δ, λ) for δ > 0

δ + (1 − δ)e−λ if k = 0
Pr(X = k) = k .
(1 − δ)e−λ λ if k = {1, 2, . . . }
k!

Note that the mean of a ZIP is (1 − δ)λ < λ since more mass is given at 0. We will use
the composition method to sample from the ZIP distribution. To sample from a ZIP,
(1)
first pj be defined as

Pr(X1 = 0) = 1 and Pr(X1 ̸= 0) = 0 ,

and let X2 ∼ Poisson(λ). Define the pmf:

(1) (2)
Pr(X = k) = δpk + (1 − δ)pk .

Then X ∼ ZIP(δ, λ). To see this, plug in k = 0 and k = 1, 2, . . . above:

Algorithm 5 Zero inflated Poisson distribution


1: Draw U ∼ U [0, 1]
2: if U ≤ δ then X = 0 else simulate X ∼ Poisson(λ)

Other composition or mixture distributions are also possible. Think about Zero-inflated
Binomial, Zero-inflated Geometric, 2-inflated Poisson, etc.

3.4 Exercises
1. Show that if U ∼ U (0, 1), then for any a, b,

(b − a) ∗ U + a ∼ U (a, b)

2. Use the inverse transform method to sample from a geometric distribution, where

20
for 0 < p < 1 and q = 1 − p,

Pr(X = i) = pq i−1 , i ≥ 1, where q = 1 − p .

3. We want to draw a sample from the random vector (X, Y )⊤ , that follows the
distribution with joint probability mass function

P (X = i, Y = j) = θi,j where i, j ∈ {0, 1} .


P
Here i,j θi,j = 1. Write an inverse-transform algorithm to draw realizations of
(X, Y )⊤ .

4. List as many appropriate proposal distributions as you can think of for the fol-
lowing target distributions:

• Binomial

• Bernoulli

• Geometric

• Negative Binomial

• Poisson

5. (Using R) Draw 10,000 draws from a Binomial(20, .75) distribution using an


accept-reject sampler.

6. In an accept-reject algorithm, we need to find c such that

pj
≤ c forall j for which pi > 0 .
qj

And, the probability of accepting in any iteration is 1/c. Why is c guaranteed to


be more than 1?

7. Simulate from a Negative Binomial(n, p) using the inverse transform and accept-
reject methods. Implement in R with n = 10 successes and p = .30.

8. Simulate from the following “truncated Poisson distribution” with pmf:

e−λ λi /i!
Pr(X = i) = Pm −λ j i = 0, 1, 2, . . . , m .
j=0 e λ /j!

21
Implement in R with m = 30 and λ = 20.

9. Suppose we want to obtain samples from a discrete distribution with pmf {pi }.
We use accept-reject with proposal distribution with pmf {qi }, such that for some
α ∈ R:
pi
∝ iα i = 1, 2, . . . , ,
qi
For what values of α would this AR algorithm work?

10. Suppose we want to obtain samples from a discrete distribution with pmf {pi }.
(1) (2)
Two possible proposal distributions are {qi } and {q2 }, yielding AR bounds c1
and c2 such that c1 > c2 . Which proposal distribution is better?

11. Implement a an algorithm to sample from a Zero Inflated Binomial distribution.


Can you think of an application of such a distribution?

22
4 Generating continuous random variables
Similar to generating discrete random variables, there are various methods for gener-
ating continuous random variables. We will discuss three main methods:

1. Inverse transform

2. The accept-reject method

3. Ratio of uniforms

We will also discuss a few special samplers.

4.1 Inverse transform


The principles of the inverse transform method for discrete distributions, apply simi-
larly to continuous random variables. Consider a random variable X with probability
R∞
density function f (x) so that f (x) ≥ 0, −∞ f (x) = 1 with distribution function
Z x
F (x) = f (x) dx .
−∞

The following theorem will be the foundation for the inverse transform method.

Theorem 2. Let U ∼ U [0, 1]. For any continuous distribution F , a random variable
X = F −1 (U ) has distribution F .

Proof. Let FX be the distribution function of X = F −1 (U ). We need to show that


FX = F . Note that for any x ∈ R,

FX (x) = Pr(X ≤ x)
= Pr(F −1 (U ) ≤ x)
= Pr(F (F −1 (U )) ≤ F (x)) (Since F is non-decreasing)
= Pr(U ≤ F (x))
= F (x) .

The above theorem then implies that if we can invert the CDF function, then we can
obtain random draws from that random variable.

23
Example 7. Exponential(1): For the Exponential(1) distribution, the cdf is F (x) =
1 − e−x . Thus,
F −1 (u) = − log(1 − u) .

To generate X ∼ Exp(1) we can thus use the following algorithm:

Algorithm 6 Exponential(1) Inverse transform


1: Generate U ∼ U [0, 1]
2: Set X = − log(1 − U ) ∼ Exp(1)

Similarly, we can draw from an Exponential(λ) distribution. ■

Example 8. Cauchy distribution: Cauchy distribution has pdf

1 1
f (x) = ,
π (1 + x2 )

and Z x
1 1
u = F (x) = f (y)dy = arctan(x) + .
−∞ π 2
So, F −1 (u) = tan(π(u − .5)).

Algorithm 7 Cauchy distribution


1: Generate U ∼ U [0, 1]
2: Set X = tan(π(U − .5) ∼ Cauchy

Example 9. Gamma distribution: The CDF of a Gamma(n, λ) distribution is


x
λe−λy (λy)n−1
Z
F (x) = dy .
0 Γ(n)

Here, we don’t know the CDF in closed form and thus cannot analytically find the
inverse. This is an example where the inverse transform method cannot work in practice
(even though it works theoretically). Thus, unlike the discrete case, this genuinely
motivates the need for another method to sample from a distribution.

24
Questions to think about

• The CDF F (x) is a deterministic function, so how is F −1 (U ) a random quantity?

• Can we use the inverse transform method to generate sample from a normal
distribution?

4.2 Accept-reject method


Suppose we cannot generate from distribution F with pdf f (x), like the Gamma dis-
tribution example in inverse transform. We can use accept-reject in a similar way as
the discrete case. That is, we choose an appropriate proposal distribution with density
g(x), and accept or reject it based on certain probabilities.

Let the support of F be X and choose a proposal distribution G with density g(x)
whose support is larger or the same as the support of F . That is, if Y is the support
of G then, X ⊆ Y. If we can fine c such that

f (x)
sup ≤ c,
x∈X g(x)

then an accept-reject sampler can be implemented.

Algorithm 8 Accept-reject for continuous random variables


1: Draw U ∼ U [0, 1]
2: Draw proposal Y ∼ G, independently
f (Y )
3: if U ≤ then
c g(Y )
4: Return X = Y
5: else
6: Go to Step 1.

Theorem 3. Algorithm 8 returns X ∼ F . Further, the number of loops the AR


algorithm takes to return X is distributed Geometric(1/c).

Proof. Consider any set B in X . We will show that

Pr(X ∈ B) = F (B) .

25
First, we consider the probability of acceptance:
 
f (Y )
Pr(accept) = Pr U ≤
cg(Y )
  
f (Y )
=E I U ≤
cg(Y )

    
f (Y )
=E E I U ≤ |Y using iterated expectations
cg(Y )
  
f (Y )
= E Pr U ≤ |Y
cg(Y )
 
f (Y )
=E
cg(Y )
Z
f (y)
= g(y)dy
Y cg(y)
Z
1
= f (y)dy
c Y
Z Z
1 1
= f (y)dy + f (y)dy since Y ⊆ X
c X c Y/X
1
= .
c

Now that we have this established, consider

Pr(X ∈ B) = Pr(Y ∈ B | accept)


 
f (Y )
Pr Y ∈ B, U <
cg(Y )
=
Pr(accept)
    
f (Y )
= c · E E I Y ∈ B, U < |Y
cg(Y )
    
f (Y )
= c · E I (Y ∈ B) E I U < |Y
cg(Y )
 
f (Y )
= c · E I (Y ∈ B)
cg(Y )
Z
f (y)
=c· g(y)dy
cg(y)
Z B
= f (y)dy
B

= F (B) .

26
From the proof, we know that Pr(accept) = 1/c, and so, just like the discrete ex-
ample, the number of attempts it takes to generate an acceptance is distributed
Geometric(1/c). Thus

Expected number of loops for an acceptance is = c .

Accept-reject method: intuition

At a proposed value y:

• if f (y) is large but g(y) is small means this value will not be proposed often and
is a good value to accept for f , so higher probability of accepting it.

• if f (y) is small but g(y) is large, then this value will be proposed often but is
unlikely for f , so accept this value less often.

We can choose any g we want as long its support is larger than other the support of
f , and the resulting c is finite. However, some gs will be better than other gs, based
on the expected number of iterations, c.

Example 10. Beta distribution: Consider the beta distribution Beta(4, 3), where

Γ(7)
f (x) = x4−1 (1 − x)3−1 0 < x < 1; .
Γ(4)Γ(3)

Consider a uniform proposal distribution. So that G = U (0, 1) and

g(x) = 1 for x ∈ (0, 1) .

Note that, X = Y in this case. For this choice of g,

f (x)
sup = sup f (x)
x∈(0,1) g(x) x∈(0,1)

We can show that maximum of f (x) occurs at x = 3/5 and


 3  2
f (x) 3 2
sup = sup f (x) = 60 = 2.0736 = c .
x∈(0,1) g(x) x∈(0,1) 5 5

27
Algorithm 9 Accept-reject for Beta(4, 3)
1: Draw U ∼ U [0, 1]
2: Draw proposal Y ∼ U (0, 1)
f (Y )
3: if U ≤ then
c g(Y )
4: Return X = Y
5: else
6: Go to Step 1.

In order to choose a good proposal distribution (that yields a finite c), it is important
to choose a g so that it has “fatter tails” than f . This ensures that as x → ∞ or
x → ∞, g dominates f , so that c → 0 in the extremes, rather than blow up.

Example 11. Normal distribution

The target density function is

1 2
f (x) = √ e−x /2 .

We know that the t-distribution has the right support and fatter tails and the “fattest”
t distribution is with degrees of freedom 1, which is Cauchy. The pdf of a Cauchy
distribution is
1 1
g(x) = .
π 1 + x2
(We know we can sample from Cauchy using inverse transform, so that is easy.) We
will need to find the supremum of the ratio of the densities. Consider

f (x) π 2
= √ (1 + x2 )e−x /2 .
g(x) 2π
2
When x → ∞, −∞, e−x /2 decreases more rapidly than x2 increases, the ratio tends to
zero. This can be shown more formally using L’Hopital’s rule.

Taking a derivative of the above ratio and setting it to 0 (and checking the second

28
derivative condition), yields that the supremum above occurs at x = −1, 1, so

f (x) f (1) √
sup = = 2πe−1/2 ≈ 1.746 ⇒ c = 1.80 .
x∈R g(x) g(1)

The actual algorithm can now be implemented similarly as before.

Note: Consider if the target distribution was Cauchy, and the proposal was N (0, 1)?
The ratio would clearly diverge as x → −∞, ∞, and thus an accept-reject sampler
would not be possible. ■

Example 12. Sampling from a uniform circle Consider a unit circle centered at
(0, 0):
x2 + y 2 < 1 − 1 < x, y < 1 .

We are interested in sampling uniformly from within this circle. Since the area of the
circle is π, the target density is

1
f (x, y) = I(x2 + y 2 < 1) .
π

Consider the uniform distribution on the square as a proposal distribution

1
g(x, y) = I(−1 < x < 1)I(−1 < y < 1) .
4

First, we will find c. For x, y such that x2 + y 2 < 1,

f (x, y) 4 4
= I(x2 + y 2 < 1) ≤ := c .
g(x, y) π π

Next, note that

f (x, y) 4 π
= I(x2 + y 2 < 1) = I(x2 + y 2 < 1) .
cg(x, y) π 4

So for any (x, y) drawn from within the square, the ratio will be either 1 or 0, thus, no
need to draw a uniform at all!

Note: How do we draw uniformly from within the box? Note that
  
1 1
g(x, y) = I(−1 < x < 1) I(−1 < y < 1) = g1 (x) · g1 (y) ,
2 2

29
where g1 is a density of a U (−1, 1) random variable. Thus, with two independent draws
iid
U1 , U2 ∼ U (−1, 1), we have U1 × U2 being a draw from the uniform box.

Algorithm 10 Accept-reject for Uniform distribution on a circle


1: Draw proposal (U1 , U2 ) ∼ U (−1, 1) × U (−1, 1)
2: if U12 + U22 ≤ 1 then
3: Return (X, Y ) = (U1 , U2 )
4: else
5: Go to Step 1.

Questions to think about

• In A-R, do we want c to be large or small?

• Why is c guaranteed to be more than 1?

• How can you decide whether one proposal distribution is better than another
proposal distribution?

• Try implementing the circle/square example in 3 dimension, 4 dimensions, and


a general p dimensions. What happens to c?

• Can a similar A-R algorithm be implemented for Beta(m, n) for all m, n ∈ Z?

4.2.1 Choosing a proposal

Sometimes it is difficult to find a good proposal or even one that works! That is, for
a target density f (x) it can sometimes be challenging to find a proposal density g(x)
such that
f (x)
sup <∞
x g(x)

Here are certain examples of when it may be difficult / impossible to implement accept-
reject.

Example 13 (Beta). Consider a Beta(m, n)

Γ(m + n) m−1
f (x) = x (1 − x)n−1
Γ(m)Γ(n)

30
Depending on m and n, the Beta distribution can behave quite differently. Particularly,
note that when both m, n < 1 the Beta density function is unbounded!

When m, n < 1, if we use a uniform proposal distribution

f (x) Γ(m + n) m−1


sup = sup x (1 − x)n−1 = ∞ .
x∈(0,1) g(x) x∈(0,1) Γ(m)Γ(n)

So a Uniform distribution will not work. In fact, any proposal distribution with a
bounded density will not work. So this is an example of a distribution where it is
difficult to find a good proposal distribution.

However, when say n ≥ 1, then

Γ(m + n) m−1
f (x) = x (1 − x)n−1
Γ(m)Γ(n)
Γ(m + n) m−1
≤ x .
Γ(m)Γ(n)

If we look at the upper bound, the function xm−1 on x ∈ (0, 1) can define a valid
distribution if normalized. So, consider g(x) = mxm−1 , which is a proper density on
0 ≤ x ≤ 1, and

f (x) Γ(m + n)
≤m := c
g(x) Γ(m)Γ(n)
This Accept-Reject sampler can be implemented easily. Similarly if m ≥ 1. Thus, an
AR sampler is easier to implement here if one of m or n is more than (or equal to)
1. ■

Example 14 (Accept-Reject for Cauchy target). Consider the target density

1 1
f (x) = x ∈ R,
π 1 + x2

The Cauchy distribution is known to have “fat tails” so that as x → ±∞, the density
function reduces to zero slowly. This means that it is very challenging to find g(x) that
“dominates” the density in the tails.

For example, let the proposal be N (0, 1). As discussed before, we will see the ratio of

31
the densities be 2
f (x) 2π ex /2
= → ∞ as x → ±∞!
g(x) π 1 + x2
In fact, as far we know, there are no possible standard accept-reject algorithms possible
here. ■

4.2.2 Choosing parameters for a fixed proposal family

If you have chosen a family of proposal distributions that you know gives a finite c, it
may be unclear what the best parameters for that proposal distribution is. That is, if
the target f (x) and the proposal density is g(x | θ) (where θ is a parameter you can
change, to change the behaviour of the proposal, then you want to find a value of the
parameter θ so that the resulting proposal is the “best”.

Notice that the upper bound will be a function of θ, so that

f (x)
sup ≤ c(θ) .
x g(x|θ)

The value c(θ) is the expected number of loops for the accept-reject algorithm. Since
we want this to be small, the best proposal density within this family would be the
one that minimizes c(θ), so set

θ∗ := arg min c(θ) .


θ

Example 15 (Gamma distribution). Consider the target distribution Gamma(α, β)

β α α−1 −βx
f (x) = x e .
Γ(α)

Further, suppose we want to use an Exp(λ) proposal. Then

g(x|λ) = λe−λx .

We can now find c(λ),

f (x) β α xα−1 e−βx


=
g(x) Γ(α) λe−λx

32
β α α−1 −x(β−λ)
= x e .
λΓ(α)

First note that no matter what λ is, if 0 < α < 1, then f (x)/g(x) → ∞ as x → 0. So
accept-reject with this proposal won’t work!

However, when α ≥ 1, then xα−1 increases, so we want to choose λ such that e−x(β−λ) de-
creases (since exponential decay is more powerful than polynomial increase) (of course,
you should show this more mathematically). Thus we want β > λ!

Thus, we restriction attention to α ≥ 1, λ < β,

f (x) β α α−1 −x(β−λ)


c(λ) = sup = sup x e
x g(x) x>0 λΓ(α)

which you can show, occurs at


α−1
x= ,
β−λ
for which α−1
βα

α−1
c(λ) = e1−α ,
λΓ(α) β−λ
which is minimized for
λ = β/α.

Thus, the optimal exponential proposal for the Gamma(α, β), α > 1 is Exp(β/α). ■

Questions to think about

• How would you implement accept-reject for Gamma(α, β) for 0 < α < 1?

4.3 The Box-Muller transformation for N (0, 1).


A classical method to generate samples from N (0, 1) is the Box-Muller transformation
method. Here, we will draw random variables (R2 , Θ) from a certain distribution in the
polar coordinate system, and then use a transformation h, so that h(R2 , Θ) ∼ N (0, 1).
First, we will need some theory for this.
iid
Let X and Y ∼ N (0, 1). The joint density of (X, Y ) is

1 −x2 /2 −y2 /2
f (x, y) = e e x ∈ R, y ∈ R.

Let (R2 , Θ) denote the polar coordinates of (X, Y ) so that, X = R cos Θ and Y =

33
R sin Θ; here the support of R is (0, ∞) and the support of Θ is (0, 2π). Then,

Y
R2 = X 2 + Y 2 tan Θ = .
X

Notationally, we denote a realization from (R2 , Θ) as (d, θ) and find the joint density
of f (d, θ). Thus, let d = x2 + y 2 and θ = tan−1 (y/x). We know that the density for
(d, θ) can be found by

∂x ∂y
∂d ∂d
f (d, θ) = |J|f (x, y) where J = ∂y
∂x
∂θ ∂θ

Solving for J,
√ √
∂ d cos θ ∂ d sin θ 1 cos
√θ 1 sin
√θ
∂d ∂d 2 d 2 d 1
J= √ √ = √ √ = .
∂ d cos θ ∂ d sin θ
− d sin θ d cos θ 2
∂θ ∂θ

Since d = x2 + y 2 , the joint density of (R2 , Θ) is f (d, θ) with

1 1 −d/2
f (d, θ) = e 0 < d < ∞, 0 < θ < 2π
2 2π
1 1 −d/2
= I(0 < θ < 2π) e I(0 < d < ∞)
|2π {z } |2 {z }
U (0,2π) Exp(2)

This is a separable density, so R2 and Θ are independent, and Θ ∼ U [0, 2π] and
R2 ∼ Exp(2).

To generate from Exp(2), we can use an inverse transform method. If U ∼ U (0, 1),
then by the inverse transform method, −2 log U ∼ Exp(2) (verify for yourself). To
generate from U (0, 2π), we know if U ∼ U (0, 1), then 2πU ∼ U (0, 2π). The Box-
Muller algorithm then is given in Algorithm 11 which produces X and Y from N (0, 1)
indendently.

Algorithm 11 Box-Muller algorithm for N (0, 1)


1: Generate U1 and U2 from U [0, 1] independently
2: Set R2 = −2 log U1 and Θ = 2πU2

3: Set X = R cos(Θ) = −2 log U1 cos(2πU2 )

4: and Y = R sin(Θ) = −2 log U1 sin(2πU2 ) .

34
4.4 Ratio-of-Uniforms
Ratio-of-uniforms is a powerful, however not so popular method to generate samples
for a continuous random variables. When it works, it can work really well. The method
is based critically on the following theorem.

Theorem 4. Let f (x) be a target density with distribution function F . Define the set
 r  
v
C= (u, v) : 0 ≤ u ≤ f .
u

If C bounded, let (U, V ) be uniformly distributed over the set C; then V /U ∼ F .

Proof. We will show that the density of Z = V /U is f (z). Note that by definition, the
joint density of (U, V ) is

1
f(U,V ) (u, v) = R R I {(u, v) ∈ C} .
C
du dv

Consider transformation (U, V ) 7→ (U, Z) with Z = V /U . Then U = U and V = U Z.


It’s easy to see that the Jacobian for this transformation is U . So

u
I 0 ≤ u ≤ f 1/2 (z) .

f(U,Z) (u, z) = R R
C
du dv

Now that we have the joint distribution of (U, Z), all we need to show is that the
marginal distribution of Z is F . Finding the marginal distribution of Z = V /U , we
integrate out U ,
Z
u
I 0 ≤ u ≤ f 1/2 (z) du

fZ (z) = RR
C
du dv
Z f 1/2 (z)
1
=RR udu
C
du dv 0
f (z)
= RR .
2 C du dv

Since fZ (z) and f (z) are both densities, this implies that
Z R Z Z
f (z)dz 1 1
1= fZ (z)dz = R R = RR ⇒ du dv =
2 C du dv 2 C du dv C 2

35
This implies fZ (z) = f (z) . Thus, Z = V /U has the desired distribution.

So if we can draw (U, V ) ∼ Unif(C), then V /U ∼ F . But C looks quite complicated,


so how do we uniformly draw from C?

Think back to the AR technique used to draw uniformly from a circle! If C is a bounded
set, then if we enclose C in a rectangle, we can use accept-reject to draw uniform draws
from C! So, the task is to find [0, a] × [b, c] such that

0≤u≤a b≤v≤c for all (u, v) ∈ C .

We just need to find any such a, b, c.

First, note that if supx f 1/2 (x) exists, then


v
0 ≤ u ≤ f 1/2 ≤ sup f 1/2 (x) =: a .
u x

Note now that inside C, if x = v/u ⇒ v/x = u ≤ f 1/2 (x). This implies that

v
≤ f 1/2 (x) .
x

Now for:

x ≥ 0 : v ≤ xf 1/2 (x) ≤ sup xf 1/2 (x) =: c


x≥0

x ≤ 0 : v ≥ xf 1/2 (x) ≥ inf xf 1/2 (x) =: b .


x≤0

p
Note that if f (x) or x2 f (x) are unbounded, then C is unbounded, and the method
cannot work. Now that we have found the rectangle: [0, a] × [b, c], we can propose from
the rectangle, check if the proposed value is in the region C; if it is, we accept it and
return V /U . This leads to the following algorithm:

Algorithm 12 Ratio-of-Uniforms
1: Generate (U, V ) ∼ U [0, a] × U [b, c]
p
2: If U ≤ f (V /U ), then set X = V /U .
3: Else go to 1.

Steps 1 and 2 in Algorithm 12 are implementing an Accept-Reject to sample uniformly

36
from C. To understand how effective this algorithm will be, we can calculate the
probability of acceptance for the AR. First, note that

I((u,v)∈C)
f (u, v) R
C dudv
sup = sup 1 = 2a(c − b)
(u,v)∈C g(u, v) (u,v)∈C a∗(c−b)

Thus,
1
Pr (Accepting for AR in RoU) = .
2a(c − b)
So if a is large and/or (c − b) is large, the probability is small, and thus the algorithm
will take a large number of loops to yield one acceptance.

Example 16 (Exponential(1)).

f (x) = e−x x≥0

Here,
C = {(u, v) : 0 ≤ u ≤ e−v/2u } .

Since e−x/2 is a decreasing function, a = supx e−x/2 = 1. Additionally,

b = inf xe−x/2 = 0 (since support is x ≥ 0)


x≤0

and
c = sup xe−x/2 ⇒ c = 2e−1 (show for yourself) .
x≥0

So we sample from U [0, 1] × [0, 2/e] and then implement accept-reject. ■

Example 17 (Normal(θ, σ 2 )). The target density is:

1 2 /2σ 2
f (x) = √ e−(x−θ) .
2πσ 2

The set C is (  1/4 )


1 −(v−uθ)2 /4σ 2 u2
C= (u, v) : 0 ≤ u ≤ e
2πσ 2

In order to draw the region later, we need to rearrange the bound above, which gives

37
us (by taking log):
 
2 1 2 2 2
(v − θu) ≤ 4σ u log u + log(2πσ ) .
2

The above defines the region C. Now, in order to bound the region C, we find the
limits a, b, c: (The following calculations were incorrect in class. I have now fixed it
here.)

2 /4σ 2
a = sup(2πσ 2 )−1/4 e−(x−θ) = (2πσ 2 )−1/4
x∈R

 1/4  1/4
1 −(x−θ)2 /4σ 2 1 2 /4σ 2
b = inf xe and c = sup xe−(x−θ)
x≤0 2πσ 2 x≥0 2πσ 2

First, we find b and then c will follow similarly. Note that b will be non-positive, and
thus, to find the infimum, we first take negative and then log. That is, let for x < 0,
let  1/4
1 2 2
A(x) = 2
(−x)e−(x−θ) /4σ
2πσ
Then A(x) is non-negative, and we want to find the supremem of A(x) for x ≤ 0.
Taking log:

1 (x − θ)2
log(A(x)) = − log(2πσ 2 ) + log(−x) −
4 4σ 2
d log(A(x)) 1 (x − θ) set
⇒ = − =0
dx x √ 2σ 2
θ ± θ2 + 8σ 2
⇒x=
2


Now, we need to decide which of ± would be choose. Note that θ2 + 8σ 2 > θ. Hence,
since we are taking supx≥0 A(x), we obtain

θ− θ2 + 8σ 2
xb :=
2

Thus,
b = xb f 1/2 (xb )

38
Similarly, we obtain √
θ+ θ2 + 8σ 2
xc :=
2σ 2
with
c = xc f 1/2 (xc ) .

All that needs to be done now is to implement Algorithm 12 with these values of a, b, c,
given the values of θ and σ 2

Questions to think about

1. Construct a similar RoU sampler for Cauchy distribution.

2. Why does RoU fail when C is unbounded?

3. For N (0, 1) between RoU and AR using Cauchy proposal, which is more efficient,
in terms of the expected number of uniforms required for one acceptance?

4.5 Miscellaneous methods in sampling


4.5.1 Known relationships

It is always useful to remember the relationships between different distributions.


iid
1. Binomial distribution: We know that if Y1 , Y2 , . . . , Yn ∼ Bern(p), then

X = Y1 + Y2 + . . . Yn ∼ Bin(n, p) .

So, we can simulate n Bernoulli variables, add them up, and we have a realization
from a Binomial(n, p).

2. Negative binomial distribution: Number of failures until the rth success. So


iid
possibly related to geometric! If Y1 , Y2 , . . . , Yr ∼ Geom(p) (on failures), then

X = Y1 + Y2 + · · · + Yr ∼ N B(r, p) .

3. Beta distribution If X ∼ Gamma(a, 1) and Y ∼ Gamma(b, 1), then

X
∼ Beta(a, b) .
X +Y

39
4. Dirichlet distribution : The Dirichlet distribution is a distribution over pmf.

k k
Γ(α1 + · · · + αk ) Y αi −1 X
f (x1 , x2 , . . . , xk ) = Qk xi 0 ≤ xi ≤ 1, xi = 1 .
i=1 Γ(αi ) i=1 i=1

The Dirichlet distribution is a generalization of the Beta distribution. Similarly,

Y1 ∼ Gamma(α1 , 1)
Y2 ∼ Gamma(α2 , 1)
..
.
Yk ∼ Gamma(αk , 1)

Let
Yi
X i = Pk .
i=1 Yi
Then (X1 , . . . , Xk ) ∼ Dir(α1 , α2 , . . . , αk ).

iid
5. Chi-squared distribution: If Y1 , Y2 , . . . , Yk ∼ N (0, 1), then

X = Y12 + Y22 + · · · + Yk2 ∼ χ2k .

This way we can simulate χ2 distributions with integer degrees of freedom.

6. t-distribution Let Z ∼ N (0, 1) and Y ∼ χ2k , then

Z
X = r ∼ tk .
Y
k

7. Location-scale family: Let F be a distribution in the location-scale family.


Then, if Z has CDF FZ (z) in the sense that it doesn’t have any parameters.
Then for µ ∈ R and σ > 0,
 
z−µ
Y = µ + σZ has CDF FY (y) = FZ .
σ

If Z has pdf f (z) then Y has pdf σ −1 f ((z − µ)/σ).

So, if Z ∼ N (0, 1), then Y = µ + σZ ∼ N (µ, σ 2 ).

40
4.5.2 Sampling from mixture distributions

Mixture distributions for continuous densities is very similar to the discrete distri-
butions. For j = 1, . . . , k, let Fj (x) be a distribution function and let fj (x) be the
corresponding density function. A mixture distribution function, F is

k
X X
F (x) = pj Fj (x) where pj > 0, pj = 1
j=1 j

and the corresponding density is

k
X
f (x) = pj fj (x) .
j=1

If we can draw from each Fj then we can draw from the mixture distribution F as well
using the same steps as in the discrete case.

Algorithm 13 Sampling from a mixture distribution


1: Generate U ∼ U [0, 1]
2: If U < p1 , generate from F1
3: If U < p1 + p2 , generate from F2
4: ...
Pk−1
5: If U < i=1 pi , generate from Fk−1
6: Else, generate from Fk .

Example 18 (Mixture of normals). Consider two normal distributions N (µ1 , σ12 ) and
N (µ1 , σ22 ). For some 0 < p < 1, the mixture density is

f (x) = pf1 (x; µ1 , σ12 ) + (1 − p)f2 (x; µ2 , σ22 )


(  2 ) (  2 )
1 1 x − µ1 1 1 x − µ2
= pp exp − + (1 − p) p exp −
2πσ12 2 σ1 2πσ22 2 σ2

Mixture distributions are particularly useful for clustering problems and we will come
back to them again in the data analysis part of the course. If we want to sample from
this distribution

41
Algorithm 14 Sampling from a Gaussian mixture
1: Generate U ∼ U [0, 1]
2: If U < p, generate N (µ1 , σ12 ) (using location-scale family trick)
3: Otherwise, generate N (µ2 , σ22 ).

4.5.3 Multidimensional target

We have almost entirely focused on univariate densities, but most often interest is in
multivariate/multidimensional target distribution.

• Conditional Distribution: Consider a variable X = (X1 , X2 , . . . , Xk ), with a


joint pdf
f (x) = f (x1 , x2 , . . . , xk ) .

We can use conditional distribution properties:

f (x) = fX1 (x1 )fX2 |X1 (x2 ) . . . fXk |X1 ,...,Xk−1 (xk ) .

Algorithm 15 Sampling X using conditional distributions


1: Generate X1 ∼ fX1 (x1 )
2: Generate X2 ∼ fX2 |X1 (x2 )
3: Generate X3 ∼ fX3 |X2 ,X1 (x3 )
.
4: ..

5: Generate Xn ∼ fXk |Xk−1 ,...,X1 (xk )


6: Return X = (X1 , . . . , Xk )

• Multivariate normal: Consider sampling from a Nk (µ, Σ) where Σ is positive


definite. Then for | · | denoting determinant,
k/2
(x − µ)T Σ−1 (x − µ)
  
1 −1/2
fX (x) = |Σ| exp − ,
2π 2

42
is the density of a multivariate normal distribution with mean µ and covariance
Σ. First, note that since Σ is a positive-definite (symmetric) matrix, we can use
the eigenvalue decomposition
Σ = QΛQ−1

where Q is the matrix of eigenvectors and since Σ is symmetric, Q is guaranteed


to be an orthongal matrix so that Q−1 = QT and Λ is a diagonal matrix of
eigenvalues. Then, we can define the square-root of Σ as

Σ1/2 := QΛ1/2 Q−1 ,

so that
Σ1/2 Σ1/2 = QΛ1/2 Q−1 QΛ1/2 Q−1 = QΛQ−1 .

Similarly, the inverse square-root is

Σ−1/2 = QΛ−1/2 Q−1 ,

Set Z = Σ−1/2 (X − µ). Then

Z ∼ Nk (0, Ik ) .

That is, Z is a k-dimensional multivariate normal distribution with an identity


covariance matrix. Which implies if Z = (Z1 , . . . , Zk ), then Cov(Zi , Zj ) = 0 for
all i ̸= j.

For the normal distribution, if the covariance is zero, then the random variables
are independent! This isn’t true in general but is true for normal random vari-
ables.
iid
So, to sampling from Nk (µ, Σ), we can sample Z1 , Z2 , . . . , Zk ∼ N (0, 1), and set
Z = (Z1 , . . . , Zk ). Then

X := µ + Σ1/2 Z ∼ Nk (µ, Σ) .

Then Z ∼ Nk (µ, Σ).

43
Questions to think about

• Can you construct a zero-inflated normal distribution and find a suitable appli-
cation of it?

4.6 Exercises
1. Using the inverse transform method, simulate from Exp(λ) for any λ > 0. Im-
plement this for λ = 5.

2. Use the inverse transform method to obtain samples from the Weibull(α, λ)

α
f (x) = αλxα−1 e−λx , x > 0.

3. (Ross 5.1) Give a method for generating a random variable having density func-
tion
ex
f (x) = 0 ≤ x ≤ 1.
e−1

4. (Ross 5.2) Give a method for generating a random variable having density func-
tion
x−2


 if 2 ≤ x ≤ 3
f (x) = 2 − 2
x/3

 if 3 ≤ x ≤ 6
2

5. (Ross 5.3) Use the inverse transform method to generate a random variable having
distribution function

x2 + x
F (x) = 0 ≤ x ≤ 1.
2

6. Sample following the following distribution using two different methods:

 3 (1 − x2 )

if x ∈ (−1, 1)
f (x) = 4
0 otherwise

7. (Ross 5.6) Let X be an Exp(1). Provide an efficient algorithm for simulating


a random variable whose distribution is the conditional distribution of X given

44
that X < 0.05. That is, its density function is

e−x
f (x) = 0 < x < 0.05 .
1 − e−0.05

Using R generate 1000 such random variables and use them to estimate E[X |
X < 0.05].

8. (Ross 5.7) Suppose it is relatively easy to generate random variables from any
of the distributions Fi , i = 1, . . . , k. How could we generate a random variable
from the distribution function
n
X
F (x) = pi Fi (x) ,
i=1

P
where pi ≥ 0 and pi = 1.

9. (Ross 5.8) Using the previous exercise, provide algorithms for generating random
variables from the following distributions:
x+x3 +x5
(a) F (x) = 3
,0≤ x ≤ 1.

 1−e−2x +2x if x ∈ (0, 1)
3
(b) F (x) =
 3−e−2x if x ∈ [1, ∞)
3

10. (Ross 5.9) Give a method to generate a random variable with distribution function
Z ∞
F (x) = xy e−y dy 0≤x≤1
0

11. (Ross 5.15) Give two methods for generating a random variable with density
function
f (x) = xe−x , 0 ≤ x < ∞ .

12. (Ross 5.18) Give an algorithm for generating a random variable having density
function
2
f (x) = 2xe−x , x > 0.

13. (Ross 5.19) Show how to generate a random variable who distribution function
is
x + x2
F (x) = , 0≤x≤1
2

45
using the inverse transform, accept-reject, composition method.

14. (Ross 5.20) Use the AR method to find an efficient way to generate a random
variable having density function

(1 + x)e−x
f (x) = 0 < x < ∞.
2

15. (Ross 5.21) Consider the target density to be a truncated Gamma(α, 1), α < 1
defined on (a, ∞) for some a > 0. Suppose the proposal distribution is a truncated
exponential(λ), defined on the same (a, ∞). What is the best λ to use?

16. (Using R)

(a) Implement an accept-reject sampler to sample uniformly from the circle


{x2 + y 2 ≤ 1} and obtain 10000 samples and estimate the probability of
acceptance. Does it approximately equal π/4?

(b) Now consider sampling uniformly from a p-dimensional sphere (a circle is


p = 2). Consider a p-vector x = (x1 , x2 , . . . , xp ) and let ∥ · ∥ denote the
Euclidean norm. The pdf of this distribution is
p

Γ 2
+1
f (x) = I{∥x∥ ≤ 1} .
π p/2

Use a uniform p-dimensional hypercube to sample uniformly from this sphere.


Implement this for p = 3, 4, 5, and 6. What happens as p increases?

17. (Using R)

(a) Using accept-reject and a standard normal proposal, obtain samples from a
truncated standard normal distribution with pdf:

1 1 2
f (x) = √ e−x /2 I(−a < x < a) ,
Φ(a) − Φ(−a) 2π

where Φ(·) is the CDF of a standard normal distribution. Run for a = 4


and a = 1. What are the differences between the two settings.

(b) Now consider a multivariate truncated normal distribution, where for x =


(x1 , x2 , . . . , xp ), the pdf is
 p  p
1 1 T x/2
f (x) = √ e−x I(−a < x < a) .
Φ(a) − Φ(−a) 2π

46
Implement an accept-reject sampler with proposal distribution Np (0, I) with
a = 4 and p = 3, 10 and with a = 1 and p = 3, 10. Describe the differences
between these settings.

18. Implement an accept-reject sampler to draw from a Gamma(α, 1) for α > 1.


Using the above method, can you draw samples from Gamma(α, β), for any β?

19. In accept-reject sampling, why is c ≥ 1?

20. Use ratio-of-uniforms method to sample from a truncated exponential distribu-


tion with density
e−x
f (x) = 0 < x < a.
1 − e−a
How efficient is this algorithm?

21. Use ratio-of-uniforms method to sample from the distribution with density

1
f (x) = x ≥ 1.
x2

22. Use ratio-of-uniforms method to draw samples from a tν distribution for ν ≥ 1.

23. (Zero-inflated Gamma distribution) Suppose you are an auto-insurance company


and you want to study the cost of claims associated with each customer. That
is, each customer, if they have an accident, will come to you and claim insurance
money reimbursement for the accident. So

Let X = insurance money asked for by a customer in a month.

However, most customers will not enter into any accidents, so they will claim Rs
0. But when they do, they will claim reimbursement for some amount of money
that, say, will follow a Gamma distribution.

The density function can be defined as follows for 0 < p < 1

β α α−1 −xβ
f (x) = pI(x = 0) + (1 − p) x e .
Γ(α)

Provide an algorithm to sample this random variable.

47
5 Importance Sampling
We have so far learned many (many!) ways of sampling from different distributions.
These sampling methodologies are particularly useful when we want to estimate char-
acteristic of F . Using computer simulated samples from F to estimate characteristics
of F is broadly termed as Monte Carlo

5.1 Simple Monte Carlo


Suppose F is a distribution with density f . We are interested in estimating the expec-
tation of a function h : X → R with respect to F . That is, we want to estimate
Z
θ := EF [h(X)] = h(x)f (x) dx < ∞ ,
X

we assume that θ is finite. We also assumed that

σ 2 = VarF (h(X)) < ∞ .

Note: there is no “data” here, there is just an integral! We are just interested in
estimating an annoying integral.

Note: notation EF [X] means the expectation is with respect to F . From now on, it is
very important to keep track of what the expectation is with respect to.
iid
Suppose we can draw iid samples X1 , . . . , XN ∼ F (this we can do using the many
methods we have learned). Then, by the weak law of large numbers, as N → ∞,

N
1 X p
θ̂ = h(Xt ) → θ .
N t=1

In addition, we can find the variance of the estimator:

N
!
1 X
Var(θ̂) = Var h(Xt )
N t=1
N
1 X
= 2 VarF (h(Xt )) because of independence
N t=1
VarF (h(X1 ))
= because of identical
N

48
σ2
= .
N

Naturally, a central limit theorem also holds if σ 2 < ∞, so that as N → ∞


√ d
N (θ̂ − θ) → N (0, σ 2 ) ,

This central limit theorem gives us an expected behavior of θ̂ for large values of n.

Q. But is there a way we can obtain a better estimator of θ?

A. Possibly by using importance sampling.

5.2 Simple importance sampling


Our goal is the same. For h : X → R, we want to estimate θ = EF [h(X)]. Similar
to the the accept-reject sampler, we will choose a proposal distribution. Let G be a
distribution with density g defined on X so that,
Z
EF [h(X)] = h(x)f (x)dx
X
Z
h(x)f (x)
= g(x) dx
X g(x)
 
h(Z)f (Z)
= EG , Z∼G
g(Z)

iid
If Z1 , . . . , ZN ∼ G , then an estimator of θ is

N
1 X h(Zt )f (Zt )
θ̂g = .
N t=1 g(Zt )

The estimator θ̂g is the importance sampling estimator, the method is called importance
sampling and G is the importance distribution.

Let
f (Zt )
w(Zt ) =
g(Zt )
be the weights assigned to each point Zt . Then θ̂g is a weighted average of of h(Zt ).
Intuitively, this means that depending on how likely a sampled value is for f and g, a
weight is assigned to that value.

49
Example 19 (Moments of Gamma distribution). Suppose we want to estimate the
kth moment of a Gamma distribution. That is, let F be the density of a Gamma(α, β)
distribution. Then Z ∞
β α α−1 −βx
θ= xk x e dx .
0 Γ(α)

Suppose we set G to be also an Exponential(λ) distribution. Let Z1 , . . . , ZN ∼ Exp(λ)

N  
1 X h(Zt )f (Zt )
θ̂g =
N t=1 g(Zt )
N
1 X β α Ztk Ztα−1 e−βZt
 
= .
N t=1 Γ(α) λe−λZt

So we have now constructed an alternative estimator of θ. In fact, a different choice of


G will yield a different estimator. It is now important to study the properties of this
importance sampling estimator. We study a sequence of properties.

Theorem 5 (Unbiasedness). The importance sampling estimator θ̂g is unbiased for θ.

Proof. To show an estimator is unbiased, we need to show that E[θ̂g ] = θ. Consider


" N
#
h i 1 X h(Zt )f (Zt )
E θ̂g = EG
N t=1 g(Zt )
N  
1 X h(Zt )f (Zt )
= EG
N t=1
g(Zt )
N  
1 X h(Z1 )f (Z1 )
= EG
N t=1
g(Z1 )
Z
h(z)f (z)
= g(z) dz
g(z)
ZX
= h(z)f (z)dz
X

= θ.

50
Theorem 6. The importance sampling estimator is consistent for θ. That is, as
N → ∞,
p
θ̂g → θ .

Proof. Note that θ̂g is just a sample average:

N
1 X h(Zt )f (Zt )
θ̂g = .
N t=1 g(Zt )

The law of large numbers applies to any sample average whose expectation is finite.
So by the law of large numbers, as N → ∞,

p
θ̂g → E[θ̂g ] = θ .

This means that as we get more and more samples from G, our estimator will get
increasingly closer to the truth.

However, we should never be happy with a point estimator!

It is essential to quantify the variability in our estimator θ̂g in order to ascertain how
“erratic” or “stable” the estimator is. We also want to establish expected behavior for
θ̂g , but does a central limit theorem hold?. Notice that a simple Monte Carlo is just a
sample average, so we should be able to directly apply the CLT result, if the variance
is finite. Note that, the variance of θ̂g is

N
!
σg2
 
1 X h(Zt )f (Zt ) 1 h(Z1 )f (Z1 )
Var(θ̂g ) = Varg = Varg =: .
N t=1 g(Zt ) N g(Z1 ) N

 
h(Z1 )f (Z1 )
A central limit theorem will hold if σg2 = Varg < ∞.
g(Z1 )

So the question is, when is this finite?

The following theorem provides a sufficient condition.

51
Theorem 7. Suppose σ 2 = VarF (h(X)) < ∞. If g is chosen such that

f (z)
sup ≤M <∞
z∈X g(z)

then
σg2 < ∞ .

Proof. First note that if the variance of a random variable is finite, this is equivalent
to saying that the second moment of that variable is finite. So, consider the second
moment of h(Z)f (Z)
g(Z)
where Z ∼ G.
" 2 #
h(z)2 f (z)2
Z
h(Z)f (Z)
EG = g(z)dz
g(Z) X g(z)2
Z
f (z)
= h(z)2 f (z)dz
X g(z)
Z
≤M h(z)2 f (z)dz
X

= M EF (h(X)2 ) < ∞ by assumption .

Thus, if an accept-reject is possible for the propogal G, then a simple importance


sampling estimator of θ, with a finite variance, is also possible. Now, we have a central
limit theorem that can hold. Recall
 
2 h(Z)f (Z)
σg = VarG . (1)
g(Z)

By the CLT, if σg2 < ∞, then as N → ∞,


√ d
N (θ̂g − θ) → N (0, σg2 ) . (2)

Further, an estimator of σg2 is easily available since we have N samples of h(Z)f (Z)/g(Z)
available. Thus, an estimator of σg2 is the sample variance from all the samples:
 2
1 h(Zt )f (Zt )
σ̂g2 := − θ̂g .
N −1 g(Zt )

52
Example 20 (Gamma continued). Recall from the accept-reject example for Gamma(α, β)
with α ≥ 1 and Exponential(λ) proposal for an accept-reject sampler will work only if
λ < β. That means, when λ < β, there exists a finite M , and the importance sampling
estimator will have a finite variance. ■

Questions to think about

1. Can we construct G so that its support, Y is larger than X ?

2. Check what happens with β = λ in this simulation.

3. Why would a CLT be useful here?

4. How would we check whether this importance sampler is better than IID Monte
Carlo?

5.2.1 Optimal proposals

How do we choose the importance distribution g? The proposal g should be chosen so


that:

• Sampling from G is relatively easy

• Varg (θ̂g ) = σg2 /N is smaller than regular Monte Carlo variance estimator.

Note that, one reason to use importance sampling would be to obtain smaller variance
estimators than the original. So, if we can choose g such that σg2 is minimized that
would be ideal!

Let’s see this term:

h(Z)2 f (Z)2 h(z)2 f (z)2


    Z
2 h(Z)f (Z) 2
σg = VarG = EG − θ = dz −θ2
g(Z) g(Z)2 g(z)
|X {z }
A

For the above to be small, term A should be close to θ2 . This logic leads to the following
theorem.

|h(x)|f (x)dx ̸= 0, the importance density g ∗ that minimizes σg2 is


R
Theorem 8. If X

|h(z)|f (z)
g ∗ (z) = .
EF [|h(x)|]

53
Proof. Consider the above importance density. The second moment of the importance
sampling estimator with this density is:

θ2 + σg2∗
" 2 #
h(Z)f (Z)
= EG∗
g ∗ (Z)
h(z)2 f (z)2 ∗
Z
= g (z)dz
X g ∗ (z)2
h(z)2 f (z)2
Z
= · EF [|h(x)|] dz
X |h(z)|f (z)
Z
= EF [|h(x)|] |h(z)|f (z)dz
X
Z 2
= |h(z)|f (z)dz
X
Z 2
|h(z)|f (z)
= g(z)dz for any other g defined on X
X g(z)
  2
|h(z)|f (z)
= EG
g(z)
h(z) f (z)2
2
 
≤ EG By Jensen’s inequality: for a convex function ϕ, ϕ(E[x]) ≤ E(ϕ(x))
g 2 (z)
= θ2 + σg2 .

Thus, for any generic proposal g defined on X , we have

σg2∗ ≤ σg2 .

Since this is true for all g, this implies that g ∗ produces the smallest σg2∗ .

Note that, with this choice of proposal,


 
h(Z)f (Z)
σg2∗ = Varg∗
g ∗ (Z)
 
2 h(Z)f (Z)
= EF [|h(x)|] VarG∗
|h(Z)|f (Z)
 
2 h(Z)f (Z)
= EF [|h(z)|] VarG∗ .
|h(Z)|f (Z)

If on the support X , h(Z) = |h(Z)|, then the variance of the importance sampling
estimator is zero!

54
Example 21 (Gamma distribution). Consider estimating moments of a Gamma(α, β)
distribution. We actually know the optimal importance distribution here! For estimat-
ing the kth moment

g ∗ (z) ∝ |h(z)|f (z)


= |xk |xα−1 exp {−βx}
= xα+k−1 exp {−βx} .

So the optimum importance distribution is Gamma(α + k, β). The variance in this case
of the estimator will be 0. ■

Example 22 (Mean of standard normal). Let h(x) = x and let f (x) be the density
of a standard normal distribution. So we are interested in estimating the mean of the
standard normal distribution. The universally optimal proposal in this case is
2
∗ |x|e−x /2
g (x) = R
|x|e−x2 /2 dx

But it may be quite challenging to draw samples from the above distribution! In order
for importance sampling to be useful, we need not find the optimal proposal, as long
as we can find a more efficient proposal than sampling from the target.

Consider an importance distribution of N (0, σ 2 ) for some σ 2 > 0. The variance of the
importance estimator is

h(x)2 f (x)2
Z
σg2 = dx
−∞ g(x)
Z ∞  2 
2 σ x 2
= x √ exp − x dx
−∞ 2π 2σ 2
Z ∞  2 
2 1 x 1
=σ x √ exp − 2− 2 dx
−∞ 2π 2 σ
Z ∞ r
2 − σ −2
 2 
σ 2 x 1
=√ x · exp − 2− 2 dx
2 − σ −2 −∞ 2π 2 σ
| {z }
density of N (0,(2−σ −2 )−1 ) if σ 2 >1/2
σ 1
=√
2 − σ −2 2 − σ −2
σ
= −2 3/2
if σ 2 > 1/2
(2 − σ )

55
else if σ 2 < 1/2, the integral diverges and the variance is infinite. Also, minimizing the
variance:
σ √
arg min
√ −2 3/2
= 2.
σ> 1/2 (2 − σ )

Thus the optimal proposal has standard deviation σ = 2, not 1! Also, at σ 2 = 2, the
variance is .7698 which is less than 1. ■

5.2.2 Questions to think about

• Does this mean that N (0, 2) is the optimal proposal for estimating the mean of
a standard normal?

• What is the optimal proposal within the class of Beta proposals for estimating
the mean of a Beta distribution?

5.3 Weighted Importance Sampling


Often for many distributions, we do not know the target distribution fully, but only
know it up to a normalizing constant. That is, for some unknown a, the target density
is
f (x) = af˜(x)

and for some known or unknown b, the proposal density is

g(x) = bg̃(x)

For simplicity (or rather uniformity in complexity), we will assume that b is unknown.
Even though f is not fully known, we are interested in expectations under f . Suppose
for some function h, the following integral is of interest:
Z
θ := h(x)f (x)dx .
X

We still want to use g as the importance distribution, so that


Z
h(x)f (x)
θ= g(x) dx .
X g(x)

Since a and b are unknown, we can’t evaluate f (x) and g(x). So our original estimator
does not work anymore! We can evaluate f˜(x) and g̃(x). So if we can estimate a and b

56
as well, that will allow us estimate θ. Instead, we will estimate b/a, which also works!
iid
Consider Z1 , . . . , Zt ∼ G. The weighted importance sampling estimator of θ is

N
X h(Zt )f˜(Zt )
t=1
g̃(Zt )
θ̂w := N ˜
.
X f (Zt )
t=1
g̃(Zt )

Theorem 9. The weighted importance sampling estimator is consistent. So as N →


p
∞, θ̂w → θ.

Proof. Intuitively, both the numerator and the denominator are average, so the LLN
applies to both individually. Then we may use Slutsky’s theorem. This is the plan.

As a first step, note that as N → ∞, by the law of large numbers

N
" # N
" #
1 X h(Zt )f˜(Zt ) p h(Z)f˜(Z) 1 X f˜(Zt ) p f˜(Z)
→ EG and → EG
N t=1 g̃(Zt ) g̃(Z) N t=1 g̃(Zt ) g̃(Z)

By an application of Slutsky’s theorem, as N → ∞


" #
h(Z)f˜(Z)
EG
p
g̃(Z)
θ̂w → " # .
f˜(Z)
EG
g̃(Z)

We need to show that " #


h(Z)f˜(Z)
EG
g̃(Z)
θ= " # .
f˜(Z)
EG
g̃(Z)

So we will first find both expectations. First


" # Z
h(Z)f˜(Z) h(z)f˜(z)
EG = g(z)dz
g̃(Z) X g̃(z)
Z
b h(z)f (z)
= g(z)dz
a X g(z)

57
b
= θ.
a

Second,
" # Z
f˜(Z) f˜(z)
EG = g(z)dz
g̃(Z) X g̃(z)
Z
b b
= f (z)dz = .
a X a

So, " #
h(Z)f˜(Z)
EG
g̃(Z) b
θ
a
" # = b
= θ.
f˜(Z) a
EG
g̃(Z)

We will denote
f˜(Z)
w(Z) = .
g̃(Z)
Then w(Z) is called the un-normalized importance sampling weight.

Thus, even though we do not know a and b, the weighted importance sampling esti-
mator converges to the right quantity and is consistent. However, not knowing b and
a comes at a cost. Unlike the simple importance sampling estimator, the weighted
importance sampling estimator θ̂w is not unbiased.

To see this heuristically, let’s assume (just for theory, we actually don’t do this) that
we obtained two different samples from G for the numerator and the denominator.
iid iid
That is, assume that Z1 , . . . , ZN ∼ G and T1 , . . . , TN ∼ G. This allows us to break the
expectation as follows:
 N
X
  
 h(Zt )w(Zt )  " N #  
 t=1  X  1 
EG 
 X N
 = EG
 h(Zt )w(Zt ) E G

XN


  t=1  
w(Tt ) w(Tt )
t=1 t=1
" N
#
X 1
≥ EG h(Zt )w(Zt ) " N # By Jensen’s inequality
t=1
X
EG w(Tt )
t=1

58
= θ,

where the equality only holds for a constant function only. Thus, even if we have two
independent samples for the numerator and denominator, we will still not obtain an
unbiased estimator. This is then certainly true for when we use the same sample as
well. However, the proof of this is complex, and outside the scope of the course.

Further, an exact expression for the variance is difficult to obtain. Even if the numera-
tor and estimator are assumed to be from independent samples, we may, after a series
of approximations (omitted), obtain:
 
h i VarG (h(Z)w(Z)) VarG (w(Z))
Var θ̂w ≈ 1+
[EG (w(Z))]2 [EG (w(Z))]2
 
VarG (h(Z)w(Z)) VarG (w(Z))
= 1+
b2 /a2 [EG (w(Z))]2
 
VarG (w(Z))
= VarG (ah(Z)w(Z)/b) 1 +
[EG (w(Z))]2
  
f (Z) VarG (w(Z))
= VarG h(Z) 1+
g(Z) [EG (w(Z))]2
 
2 VarG (w(Z))
= σg 1 +
[EG (w(Z))]2
≥ σg2 ,

where recall that σg2 is the variance coming from the simple importance sampling es-
timator. This seems to indicate that weighted importance sampling is always worse
than simple importance sampling, but in reality this is not true. Here we assumed two
independent samples in the approximation above. Since the samples from the de-
nominator and numerator are the same, it is sometimes possible for the numerator
and denominator to be negatively correlated, so that weighted importance sampling is
better!

NOTE: Unlike simple importance sampling, the asymptotic normality of the weighted
importance sampling estimator is challenging to verify, and the variance may not always
be finite.
h i
However, unlike simple importance sampling, estimating the variance Var θ̂w is no
longer straightforward here!

59
Example 23. Consider estimating
Z π Z π
θ= xyπ(x, y)dxdy ,
0 0

where
π(x, y) ∝ e(sin(xy)) 0 ≤ x, y ≤ π

Notice that here, the target distribution is bivariate, but the function h(x, y) = xy is a
univariate mapping. Further, the target distribution is not a product of two marginals,
so we have to implement multivariate importance sampling. Also, we do not know the
normalizing constants. So for some unknown a > 0

π(x, y) = a e(sin(xy)) 0 ≤ x, y ≤ π

We will use a weighted importance sampling distribution. Consider the importance


distribution that is uniform on the box: U [0, π] × U [0, π]. So that

1
g(z, t) = I(0 < z < π)I(0 < t < π) .
π2

Since, we will assume b is unknown, we realize that

g̃(z, t) = 1 · I(0 < z < π)I(0 < t < π) .

Sample (Z1 , T1 ), . . . , (ZN , TN ) ∼ U [0, π] × U [0, π]. The weights are

f˜(Zt , Tt )
w(Zt , Tt ) = = esin(Zt Tt ) .
g̃(Zt , Tt )

The final estimator is


N
X N
X
Zt Tt w(Zt , Tt ) Zt Tt esin(Zt Tt )
t=1 t=1
θ̂w = N
= N
.
X X
w(Zt , Tt ) esin(Zt Tt )
t=1 t=1

60
5.4 Questions to think about
• How would you choose a good proposal for weighted importance sampling? Would
finding a proposal that yields a small variance suffice?

• Do you have intuition as to why often the variance of the weight importance
estimator is larger than the variance of the simple importance sampler for the
same importance proposal?

5.5 Exercises
Still working on below. Will add some more exercises by 8th Feb.
R1
1. Estimate 0 ex dx using importance sampling.
R∞ 2
2. Estimate −∞ e−x /2 using importance sampling.

3. The inverse Gaussian distribution has density


r
λ(x − µ)2
 
λ
f (x) = exp − x > 0 and µ, λ > 0 .
2πx3 2µ2 x

We are interested in estimating the moment generating function of this distribu-


tion, EF [etX ] for t ∈ R. Sampling from this Inverse Gaussian distribution can be
quite challenging, so we will use importance sampling instead.

For µ = 1 and λ = 3, using importance sampling with importance distribution


Gamma(10, 3) (rate = 3), make a plot with t on the x-axis and the importance
sampling estimate of EF [etX ] on the y-axis.

NOTE: First, use the Z1 , Z2 , . . . , ZN points for all chosen values of t, and then
use different importance samples for different values of t.

4. Consider the problem of estimating the k-th moment of a Beta(α, β) distribution.


For which values of α and β are we sure to obtain importance estimators of the
kth moment with a finite variance for a uniform proposal distribution.

5. In the previous problem, give other examples of importance proposal distributions


that will give finite variance of the importance estimator? For what values of α
and β is the finite variance of the estimator not guaranteed?

6. Consider estimating the mean of a standard Cauchy distribution using impor-


tance sampling with a normal proposal distribution. Does the estimator have

61
finite variance?

7. For estimating kth moment of a Gamma(α, β) with α > 1 with the importance
distribution Exp(λ), show that the importance sampling estimator has infinite
variance when λ > β.

8. Considering a density f (x) and an importance proposal g(x). Suppose

f (x)
sup < ∞.
x g(x)

In order to estimate the mean of the target density, is there any benefit to using
importance sampling over accept-reject sampling?

9. For a target distribution f and a proposal g, if

f (x)
sup <∞
x g(x)

then we know that the simple importance estimator has finite variance. Does the
weighted importance estimator also have finite variance?

10. For some known yi ∈ R, i = 1, . . . , n and some ν > 2, suppose the target density
is
n  −(ν+1)/2
−x2 /2
Y (yi − x)2
f (x) ∝ e 1+ .
i=1
ν

Generate ys using the following code for ν = 5

set.seed(1)
n <- 50
nu <- 5
y <- rt(n, df = nu)

Implement an importance sampling estimator with a N (0, 1) proposal to estimate


the first moment of this distribution. Does the weighted importance sampling
estimator seem to have finite variance? What happens if ν = 1 and ν = 2?

11. Suppose interest is in estimating


Z 10
θ= exp {−2|x − 5|} dx .
0

• What is the optimal simple importance proposal distribution? (Hint: look

62
up Laplace distribution) and what is the corresponding simple importance
sampling estimator?

• Implement a weighted importance sampling procedure with the same pro-


posal distribution from above. How do the final estimators compare?

The quantity θ can be written as


Z 10
θ= 10 exp {−2|x − 5|} f (x)dx ,
0

where π(x) is the density of a U [0, 10]. We know we can do IID sampling that is,
sample X1 , X2 , . . . , XN from Unif[0, 10] and estimate θ. But this will be simple
Monte Carlo and not importance sampling. Using importance sampling, we can
reduce the variance. First, note that the optimal importance distribution here is
1
10 exp {−2|z − 5|} 10
g ∗ (z) = z ∈ (0, 10) .
θ

The function h is identical to the density of a Laplace (Double exponential)


distribution. For a Laplace random variable with parameters µ and b
 
1 |z − µ|
l(x) = exp − −∞<z <∞
2b b

So h(x) is the density of Laplace(5, 1/2), but truncated between 0 and 10.
So the optimum proposal is a truncated on (0, 10) Laplace(5, 1/2).

You can simulate Z1 , Z2 , . . . ZN from this g ∗ using accept-reject algorithm (recall


previous exercises) and then the optimal importance sampling estimator of θ is

N N 1
1 X h(Zt )f (Zt ) 1 X 10 exp {−2|z − 5|} 10
θ̂g∗ = = 1 = θ!
N t=1 g ∗ (Zt ) N t=1 10 exp {−2|z − 5|} 10
θ

So notice that for the optimal proposal, the IS estimator is the quantity we want
itself! Thus it is impossible to implement the simple importance sampler here.

However, we can do a weighted importance sampling estimator since the normal-

63
izing constant for g ∗ (which is θ) is unknown. So we have

1
g ∗ (z) = exp {−2|z − 5|} z ∈ (0, 10)
θ | {z }
g̃ ∗
|{z}
b

We will assume f˜ = f , that is a = 1.


PN h(Zt )f˜(Zt )
t=1 g̃ ∗ (Zt )
θ̂w = PN f˜(Zt )
t=1 g̃ ∗ (Zt )
PN exp{−2|Zt −5|}
t=1 exp{−2|Zt −5|}
= PN 10−1
t=1 exp{−2|Zt −5|}
N
= 10 PN
t=1 exp {2|Zt − 5|}

64
6 Likelihood Based Estimation
We have learned a fair amount about sampling from various distributions and esti-
mating integrals. For the next few weeks we will focus our attention to optimization
methods for certain statistical procedures.

Before we study optimization, we want to motivate why exactly optimization is useful


to statisticians. One common use of optimization in statistics is when obtaining a
maximum likelihood estimator (MLE) for a parameter. Thus, we first introduce MLE
below briefly, before going into optimization methods.

6.1 Likelihood Function


Suppose X1 , X2 , . . . , Xn is a random sample from a distribution with density f (x|θ) for
θ ∈ Θ, where Θ is the parameter space. The “x given θ” implies that given a particular
value of θ, f (·|θ) defines a density. f (·|θ) is also written sometimes as fθ (x).

The parameter θ can be a vector of parameters. After having obtained real data, from
F , we want to

1. estimate θ

2. and assess the quality of the estimator of θ.

A useful method of estimating θ is the method of maximum likelihood estimation. Let


X = (X1 , . . . , Xn ). The idea is that we define a function L(θ|X) which measures “how
likely is a particular value of θ given the data observed”. In general L(θ|X = x) is the
joint distribution of all the Xs

L(θ|X = x) = f (x|θ) = f (x1 , x2 , . . . , xn |θ) .

When the sample is independent as well, this likelihood becomes


n
Y
L(θ|X) = f (xi |θ) .
i=1

It is important to note that L(θ|x) is not a distribution over θ, it is just a function of


θ. It is a function that quantifies how likely a value of θ is.

iid
Example 24. Suppose we obtain X1 , X2 ∼ N (θ, 1), where we don’t know θ. Suppose

65
we obtain X1 = 2 =: x1 and X2 = 3 =: x2 . Then the likelihood function is:

L(θ|x1 , x2 ) = f (x1 |θ) · f (x2 |θ)


= f (2|θ)f (5|θ)
(2 − θ)2 (3 − θ)2
 
1
= exp − − .
2π 2 2

We can plot the above function of θ for different values of θ to understand what the
likelihood of every value of θ is. ■

6.2 Maximum Likelihood Estimation


The “most likely” value of θ having observed the data is the value that maximizes the
likelihood
θ̂MLE = arg max L(θ|x) .
θ∈Θ

θ̂MLE is called the maximum likelihood estimator of θ. As you can understand, max-
imizing the function L(θ|x) may be complex for some problems and not possible to
do analytically. This is where we require numerical optimization methods to help us
obtain these MLEs. MLEs have nice theoretical properties and you will learn them in
MTH211a or MTH418a.

Before we continue with some examples, we recall a few definitions:

Definition 1. Concave function (one-dimension): a function h(x) is concave if h′′ (x) ≤


0 for all x. If the equality never holds, it is strictly concave

Definition 2. Concave function : a function h(x) is concave if the Hessian of the


matrix, ∇2 h(x), is negative semi-definite for all x. That is, if all eigenvalues of the
Hessian are non-positive.

6.2.1 Examples
iid
Example 25 (Bernoulli). Let X1 , . . . , Xn ∼ Bern(p), 0 ≤ p ≤ 1. Then the likelihood
is
n
Y
L(p|x) = Pr(Xi = xi |p)
i=1

66
n
Y  xi
p (1 − p)1−xi

=
i=1
P P
xi
=p (1 − p)n− xi
.

To obtain the MLE of θ, we will maximize the likelihood. Note that maximizing the
likelihood is the same as maximizing the log of the likelihood, but the calculations are
easier after taking a log. So we take a log:

n
! n
!
X X
⇒ l(p) := log L(p|x) = xi log p + n− xi log(1 − p)
Pi P i
dl(p) xi n − xi set
⇒ = − =0
dp p 1−p
n
1X
⇒ p̂ = xi .
n t=1

Taking the second derivative, we obtain

d2 l(p)
P P
xi n − xi
=− 2 − <0 for all p .
dp2 p (1 − p)2

Thus, the likelihood is concave, and p̂ is a global maxima.

In any example, if I do not check the second derivative, you HAVE to check it for yourself.

Thus,
n
1X
p̂MLE = xi .
n t=1

Example 26 (Two parameter exponential). The density of a two parameter exponen-


tial distribution is

f (x|µ, λ) = λe−λ(x−µ) x ≥ µ, µ ∈ R, λ > 0 .

We want to compute the MLEs of both λ and µ. The likelihood is


n
Y
L(λ, µ|x) = f (xi |µ, λ)
t=1

67
n
Y
= λe−λ(xi −µ) I(xi ≥ µ)
t=1
( n
!)
X
= λn exp −λ xi − nµ I(x1 , . . . , xn ≥ µ) ∀µ and λ > 0 .
i=1

But if X1 , . . . , Xn ≥ µ ⇒ min{Xi } ≥ µ. So
( !)
X  
L(λ, µ|x) = λn exp −λ xi − nµ I min{xi } ≥ µ ∀µ and λ > 0 .
i
i

We will first try to maximize with respect to µ and then with respect to λ. Note that
L(λ, µ|x) is an increasing function of µ within the restriction. So that the MLE of µ
is the largest value in the support of µ where µ ≤ min{Xi }. So

µ̂MLE = min {Xi } =: X(1) .


1≤i≤n

Next, note that


( !)
X
L(X(1) , λ|x) = λn exp −λ Xi − nX(1)
Xi 
⇒ l(X(1) , λ) := log L(X(1) , λ|x) = n log λ − λ Xi − nX(1)
dl n  X 
set
⇒ = − Xi − nX(1) = 0 and
dλ λ
d2 l n
2
= − 2 < 0.
dλ λ

So, the log-likelihood function is concave, thus there is a unique maximum. Set

dl
=0

n
n X
⇒ = Xi − nX(1)
λ t=1
n
⇒ λ̂MLE = P .
Xi − nX(1)

68
6.3 Regression
We will focus a lot on variants of linear regression and thus it is important to setup
the premise of linear regression.

Let Y1 , Y2 , . . . , Yn be observations known as the response. Let xi = (xi1 , . . . xip )T ∈ Rp


be the ith corresponding vector of covariates for the ith observation. Let β ∈ Rp be
the regression coefficient so that for σ 2 > 0,

Yi = xTi β + ϵi where ϵi ∼ N (0, σ 2 ) .

Let X = (xT1 , xT2 , . . . , xTn )T . In vector form we have,

Y = Xβ + ϵ ∼ Nn (Xβ, σ 2 In ) .

Note: I use capital Y to denote the population random variable and will use the small
y to denote realized observations.

The linear regression model is built to estimate β, which measures the linear effect of
X on Y . There is much more to linear regression and multiple courses are required to
study all aspects of it. However, here we will just focus on the mathematical properties
and optimization tools required to study them.

Example 27 (MLE for Linear Regression). In order to understand the linear relation-
ship between X and β, we will need to estimate β. Since we assume that the errors
are normally distributed, we have a distribution available for Y s and we may use the
method of MLE. We have
n
Y
2
L(β, σ |y) = f (yi |X, β, σ 2 )
t=1
n
1 (y − Xβ)T (y − Xβ)
  
1
= √ exp −
2πσ 2 2 σ2
1 n 1 (y − Xβ)T (y − Xβ)
⇒ l(β, σ 2 ) := log L(β, σ 2 |y) = − log(2π) − log(σ 2 ) −
2 2 2 σ2

Note that

(y − Xβ)T (y − Xβ) = (y T − β T X T )(y − Xβ)

69
= y T y − y T Xβ − β T X T y + β T X T Xβ
= y T y − 2β T X T y + β T X T Xβ .

Using this we have (recall your multivariable calculus courses)

dl 1   X T y − X T Xβ set
= − 2 −2X T y + 2X T Xβ = =0
dβ 2σ 2σ 2
dl n (y − Xβ)T (y − Xβ) set
=− 2 + = 0.
dσ 2 2σ 2σ 4

The first equation leads to β̂MLE satisfying

X T y − X T X β̂MLE = 0 ⇒ β̂MLE = (X T X)−1 X T y ,

if (X T X)−1 exists. And σ̂M


2
LE is

2 (y − X β̂MLE )T (y − X β̂MLE )
σ̂M LE = .
n

Verify: that the Hessian matrix is negative definite, and thus the objective function is concave.

Note: What if (X T X)−1 does not exist?

For example, if p > n, then the number of observations is less than the number of
parameters, and since X is n × p, (X T X) is p × p of rank n < p. So X T X is not full
rank and cannot be inverted. In this case, the MLE does not exist and other estimators
need to be constructed. This is one of the motivations of penalized regression, which
we will discuss in detail. ■

6.4 Penalized Regression


Note that in the Linear regression setup, the MLE for β satisfied:

β̂MLE = arg min(y − Xβ)T (y − Xβ)


β

Suppose X is such that (X T X) is not invertible, then we don’t know how to estimate
β. In such cases, we may used penalized likelihood, that penalizes the coefficients β so
that some of the βs are “pushed towards zero”. The corresponding Xs to those small
βs are essentially not important, removing singularity from X T X.

70
Instead of looking at the likelihood, we consider a penalized likelihood. Since the
optimization of L(β|y) only depends on (y−Xβ)T (y−Xβ) term, a penalized (negative)
log-likelihood is used and the final penalized (negative) log-likelihood is

Q(β) = − log L(β|y) + P (β)

Here P (β) is a penalization function. Note that since we are now looking at the negative
log-likelihood, we now want to minimize Q(β). The penalization function assigns large
values for large β, so that the optimization problem favors small values of β.

There are many ways of penalizing β and each method yields a different estimator. A
popular one is the ridge penalty.

Example 28 (Ridge Regression). The ridge penalization term is P (β) = λβ T β/2 for
some λ > 0 for

(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2

We will minimize Q(β) over the space of β and since we are adding an arbitrary term
that depends on the size of β, smaller sizes of β will be preferred. Small sizes of β
means X are less important, and this will eventually nullify the singularity in X T X.
The larger λ is, the more “penalization” there is for large values of β; λ is typically
user-chosen. We will study choosing λ when we cover “cross-validation” later.

We are now interested in finding:

(y − Xβ)T (y − Xβ) λ T
 
β̂ = arg min + β β .
β 2 2

To carry out the minimization, we take the derivative:

dQ(β) 1 set
= (−2X T y + 2X T Xβ) + λβ = 0
dβ 2
⇒ (X T X + λIp )β̂ − X T y = 0
⇒ β̂ridge = (X T X + λIp )−1 X T y .

Note: Verify that the Hessian matrix is positive definite for yourself.

71
Note that (X T X + λIp ) is always positive definite for λ > 0 since for any a ∈ Rp ̸= 0

aT (X T X + λIp )a = aT X T Xa + λaT a > 0

Thus, the final ridge solution always exists even if X T X is not invertible.

Pros:
We have an estimate of β!
In terms of a certain criterion (we will learn later), we actually do better than non-
penalized estimation even when (X T X) is invertible.

Cons:
The estimator is not an MLE, so we cannot use distributional properties to construct
confidence intervals. This is a big problem, and is addressed by bootstrapping, which
we will get to.

We will study one more penalization method later. ■

Questions to think about

1. Under the normal likelihood, what is the distribution of β̂MLE (when it exists)
and β̂ridge ? Are they unbiased? Which one has a smaller variance (covariance)?

2. What other penalization functions can you think of? Recall that β T β = ∥β∥22 .

6.5 No closed-form MLEs


Obtaining MLE estimates for a problem requires maximizing the likelihood. However,
it is possible that no analytical form of the maxima is possible!

This is a common challenge in many models and estimation problems, and requires
sophisticated optimization tools. In the next few weeks, we will go over some of these
optimization methods.

iid
Example 29 (Gamma Distribution). Let X1 , . . . , Xn ∼ Gamma(α, 1). The likelihood
function is
n
Y 1 α−1 −xi
L(α|x) = x e
i=1
Γ(α) i
n
1 P Y
− xi
= n
e xα−1
i
Γ(α) i=1

72
n
X n
X
⇒ l(α) := log L(α|x) = −n log(Γ(α)) + (α − 1) log xi − xi
i=1 i=1

Taking first derivative


n
dl(α) Γ′ (α) X set
= −n + log xi = 0
dα Γ(α) i=1

Γ′ (α)
Solving the above analytically is not possible. In fact, the form of Γ(α)
is challenging
to write analytically.

However, taking second derivative

d2 l(α) d2
= −n log(Γ(α)) < 0 (polygamma function of order 1 is > 0)
dα2 dα2

In the above, we use that


d2
log(Γ(α))
dα2
is the polygamma function of order 1, which is always positive (look it up). So we know
that the function is concave and a unique maximum exists, but not available in closed
form.

We cannot get an analytical form of the MLE for α. In such cases, we will use opti-
mization methods. ■

6.6 Exercises
1. Simple linear regression: Load the cars dataset in R:

data(cars)

Fit a linear regression model using maximum likelihood with response y being
the distance and x being speed. Remember to include an intercept term in X by
making the first column as a column of 1s. Do not use inbuilt functions in R to
fit the model.

2. Multiple linear regression: Load the fuel2001 dataset in R:

fuel2001 <- read.csv("https://dvats.github.io/assets/fuel2001.csv",


row.names = 1)

Fit the linear regression model using maximum likelihood with response FuelC.

73
Remember to include an intercept in X.

3. Simulating data in R:

Let X ∈ Rn×p be the design matrix, where all entries in its first column equal
one (to form an intercept). Let xi,j be the (i, j)th element of X. For the ith
case, xi1 = 1 and xi2 , . . . , xip are the values of the p − 1 predictors. Let yi be the
response for the ith case and define y = (y1 , . . . , yn )T . The model assumes that
y is a realization of the random vector

Y ∼ Nn (Xβ∗ , σ∗2 In ) ,

where β∗ ∈ Rp are unknown regression coefficients and σ∗2 > 0 is the unknown
variance.

For our simulation, let’s pick n = 50, p = 5, σ 2 = 1/2 and generate the entries of
β∗ as p independent draws from N (0, 1):

set.seed(1)
n <- 50
p <- 5
sigma2.star <- 1/2
beta.star <- rnorm(p)
beta.star # to output
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078

We will create the design matrix X ∈ Rn×p , so that xi1 = 1 and the other entries
are from N (0, 1).

X <- cbind(1, matrix(rnorm(n*(p-1)), nrow = n, ncol = (p-1)))

Now we will generate a realization of Y ∼ Nn (Xβ∗ , σ∗2 In ):

y <- X %*% beta.star + rnorm(n, mean = 0, sd = sqrt(sigma2.star))

In this way we have generate simulated data to be used in regression.

4. Find the MLE estimator of β and σ 2 from the previous dataset. Is it close to β∗
and σ∗2 ? Find the ridge regression solution with λ = 0.01, 1, 10, 100.

5. Regression: an equivalent optimization

In our original setup X ∈ Rn×p , all entries in its first column equal to one to

74
form an intercept. The MLE estimate (when it exists) is

β̂ = arg minp (y − Xβ)T (y − Xβ) .


β∈R

Define X−1 be the matrix X with its first column removed. Let ȳ = n−1 ni=1 yi
P

and x̄T = n−1 1Tn X−1 = (n−1 ni=1 xi2 , . . . , n−1 ni=1 xip ). Let ỹ = (y1 − ȳ, . . . , yn −
P P

ȳ)T and X̃ = X−1 − 1n x̄T . Then ỹ is the centered response and X̃ is the centered
design matrix.

Suppose that

β̂−1 = arg min (ỹ − X̃ β̃)T (ỹ − X̃ β̃)


β̃∈Rp−1

β̂1 = ȳ − x̄T β̂−1 .

T T
Then (β̂1 , β̂−1 ) is equivalent to β̂ above. Verify this for the dataset generated in
Exercise 3.

6. Logistic Regression: Often in regression, the response may be a 0 or 1. That is,


the response is a Bernoulli random variable. Let the covariate vector for the ith
observation be xi = (1, xi2 , . . . , xip )T . Suppose yi is a realization of Yi where

exp(xTi β)
 
Yi ∼ Bern .
1 + exp(xTi β)

Find the MLE of β∗ . Does a closed form solution exist?

75
7 Numerical optimization methods
When obtaining true optimizer is not available analytically, we may use numerical
method. A general numerical optimization problem is framed in the following way.
Let f (θ) be an objective function that is the main function of interest and needs to be
either maximized or minimized. Then, we want to solve the following maximization

θ∗ = arg max f (θ)


θ

The algorithms we will learn are such that they generate a sequence of {θ(k) } such that
the goal is for θ(k) → θ∗ in a deterministic manner (non-random convergence).

All methods that we will learn will find a local optima. Some will guarantee a local max-
ima, but not guarantee a global maxima, some will guarantee a local optima (so max or
min), but not a global maxima. If the objective function is concave, then all methods
will guarantee a global maxima!

Recall that a (univariate) function f is concave if f ′′ < 0 and a (multivariate) function


f is concave if its Hessian is negative definite: for all a ̸= 0 ∈ Rp ,

aT ∇2 f a < 0 .
 

7.1 Newton-Raphson’s method


To solve this optimization problem, consider starting at a point θ(0) . Then subsequent
elements of the sequence are determined in the following way.

Suppose that the objective function f is such that a second derivative exists. Since
f (θ) is maximized at the unknown θ∗ ,

f ′ (θ∗ ) = 0.

Applying a first order Taylor’s series expansion to f ′ (and ignoring higher orders) to
f ′ (θ∗ ) about the current iterate θ(k)

f ′ (θ∗ ) ≈ f ′ (θ(k) ) + (θ∗ − θ(k) )f ′′ (θ(k) ) = 0


f ′ (θ(k) )
⇒ θ∗ ≈ θ(k) − ′′ ,
f (θ(k) )

76
where the approximation is best when θ(k) = θ∗ and the approximation is weak when
θ(k) is far from θ∗ . Thus, if we start from an arbitrary point using successive updates
of the right hand side, we will get closer and closer to θ∗ .

Using the above argument, the Newton-Raphson method was constructed:

f ′ (θ(k) )
θ(k+1) = θ(k) −
f ′′ (θ(k) )

Intuitively, when f ′ (θk ) < 0, the function is increasing at θ(k) , and thus Newton-
Raphson increases θ(k) ; and vice-versa.

You stop iterating when |θ(k+1) −θ(k) | < ϵ for some chosen tolerance ϵ or |f ′ (θ(k+1) )| ≈ 0.

If the objective function is concave, the N-R method will converge to the
global maxima. Otherwise it converges to a local optima or diverges!

Example 30 (Gamma distribution continuted). Our objective function is the log-


likelihood: n
X X
f (α) = −n log(Γ(α)) + (α − 1) log xi − β xi
i=1

First derivative n
′ Γ′ (α) X
f (α) = −n + log Xi
Γ(α) i=1

Second derivative
d2
f ′′ (α) = −n
log(Γ(α)) < 0 .
dα2
Thus the log-likelihood is concave, which implies there is a global maxima! The
Newton-Raphson algorithm will converge to this global maxima.

Start with a reasonable starting value: α0 . Then iterate with

f ′ (α(k) )
α(k+1) = α(k) −
f ′′ (α(k) )

Polygamma functions are in psi function in the pracma R package.

What is a good starting value α0 ? Well, we know that the mean of a Gamma(α, 1) is
α, so a good starting value is α0 = n−1 ni=1 Xi .
P

Note the impact of the sample size of the original data. The Newton-Raphson algorithm
converges to the MLE. If the data size is small, the MLE may not be close to the truth.

77
This is why we see that the blue and red lines are far from each other. However, when
increase the observed data to be 1000 observations, the consistency of the MLE should
kick in and we expect to see the blue and red lines to be similar (below).

Example 31 (Location Cauchy distribution). Consider the location Cauchy distribu-


tion with mode at µ ∈ R. The goal is to find the MLE for µ.

1 1
f (x|µ) = .
π (1 + (x − µ)2 )

First, we find the log-likelihood


n n
Y
−n
Y 1
L(µ|X) = f (Xi |µ) = π
i=1 i=1
1 + (xi − µ)2
n
X
⇒ l(µ) := log L(µ|X) = −n log π − log(1 + (Xi − µ)2 ) .
t=1

It is evident that a closed form solution is difficult. So, we find the derivates.
n
X Xi − µ
l′ (µ) = 2 .
i=1
1 + (Xi − µ)2

n 
(Xi − µ)2

′′
X 1
l (µ) = 2 2 2 2
− ,
i=1
[1 + (Xi − µ) ] 1 + (Xi − µ)2

which may be positive or negative. So this is not a concave function. This implies that
Newton-Raphson is not guaranteed to converge to the global maxima! It may not even
converge to a local maxima, and may converge to a local minima. Thus, we will need
to be careful in choosing starting values.

1. Set µ0 = Median(Xi ) since the mean of Cauchy does not exist and the Cauchy
centered at µ is symmetric around µ.

2. Determine:
f ′ (µ(k) )
µ(k+1) = µ(k) −
f ′′ (µ(k) )

3. Stop when |l′ (µ(k+1) | < ϵ for a chosen tolerance level ϵ.

78

Newton-Raphson in Higher Dimensions

The NR method can be found in the same way using the multivariate Taylor’s expan-
sion. Let θ = (θ1 , θ2 , . . . , θp ). Then first let ∇f denote the gradient of f and ∇2 f
denote the Hessian. So
 2
∂ 2f


∂f
 ∂ f
 ∂θ1   ∂θ2 ∂θ1 θ2 . . . 
1
 .  .. .. .. 
 
∇f =  ..  and ∇2 f =   . . . 
   2
 ∂f  2

 ∂ f .. ∂ f 
... . 2
∂θp ∂θp θ1 ∂θp

Then, the function f is concave if ∇2 f is negative definite. So always check that first
to know if there is a unique maximum.

Using a similar multivariate Taylor series expansion, the Newton-Raphson update


solves the system of linear equations

∇f (θ(k) ) + ∇2 f (θk )(θk+1 − θk ) = 0 .

If ∇2 f (θ(k) ) is invertible, then you have that


−1
θ(k+1) = θ(k) − ∇2 f (θ(k) )

∇f (θ(k) ) .

Iterations are stopped with when ∥∇f (θ(k+1) )∥ < ϵ for some chosen tolerance level, ϵ.

Example 32 ( Ridge regression). In the ridge regression problem, we have an analytical


solution available. But assume that we did not have the analytical solution. In that
case we have our objective function to minimize:

(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2

We are now interested in finding:

(y − Xβ)T (y − Xβ) λ T
 
β̂ = arg min + β β .
β 2 2

79
Recall that we obtained

∇Q(β) = (X T X + λIp )β − X T y

and the Hessian was

∇2 Q(β) = (X T X + λIp )

So the iterates of Newton-Raphson are then

βk+1 = β(k) − (X T X + λIp )−1 ((X T X + λIp )β(k) − X T y)


= β(k) − β(k) + (X T X + λIp )−1 X T y
= (X T X + λIp )−1 X T y

This simplification gives us the actual right solution! So one NR step will lead us to
the analytical solution in this case! ■

Still working on this

Questions

1. Can you implement the the Newton-Raphson procedure for linear regression and
ridge regression?

2. What are some of the issues in implementing Newton-Raphson? Can we use it


for any problem?

3. If the function is not concave and different starting values yield convergence to
different points (or divergence), then what do we do?

7.2 Gradient Ascent (Descent)


For concave objective functions, Newton-Raphson is essentially the best algorithm.
However, there are a few flaws of the algorithm that make it challenging to use it more
generally:

• when the objective function is not concave, NR is not guaranteed to even con-
verge.

80
• when the objective function is complicated and high-dimensional, finding the
Hessian, and inverting it repeatedly may be expensive.

In such a case, gradient ascent (or gradient descent if the problem is a minimizing
problem) is a useful alternative as it does not require the Hessian.

Consider the objective function f (θ) that we want to maximize and suppose θ∗ is the
true maximum. Then, by the Taylor series approximation at a fixed θ0

f ′′ (θ0 )
f (θ) ≈ f (θ0 ) + f ′ (θ0 )(θ − θ0 ) + (θ − θ0 )2
2

If f ′′ (θ) is unavailable or we don’t want to use it, consider assuming that the double
derivative is a negative constant:f ′′ (θ) = −1/t for some t > 0 . That is, assume that f
is quadratic and concave. Then,

1
f (θ) ≈ f (θ0 ) + f ′ (θ0 )(θ − θ0 ) − (θ − θ0 )2
2t

Maximizing f (θ) and using this crude approximation would imply maximizing the right
hand side. Taking the derivative with respect to θ setting it to zero:

θ − θ0 set
f ′ (θ0 ) − = 0 ⇒ θ = θ0 + tf ′ (θ0 ) .
t

Using this intuition, given a θ(k) , the gradient ascent algorithm does the update

θ(k+1) = θ(k) + tf ′ (θ(k) ) ,

for t > 0. The iteration can be stopped when |θ(k+1) − θ(k) | < ϵ for ϵ > 0 or when
|f ′ (θk+1 )| ≈ 0.

For concave functions, there exists a t such that gradient ascent converges
to the global maxima. In general (when the function is not concave), there
exists a t such that gradient ascent converges to a local maxima, as long as
you don’t start from a local minima.

The algorithm essentially does a local concave quadratic approximation at the current
point θ(k) and then maximizes that quadratic equation. The value of t indicates how
far do we want to jump and is a tuning parameter. If t is large, we take big jumps, as
opposed to t small, where the jumps are smaller.

81
Example 33 (Location Cauchy distribution). Recall the location Cauchy distribution
with mode at µ ∈ R, where the log-likelihood is not guaranteed to be concave and thus
Newton-Raphson is difficult to use. The log-likelihood was
n n
Y Y 1
L(µ|X) = f (Xi |µ) = π −n
i=1 i=1
1 + (xi − µ)2
n
X
⇒ l(µ) := log L(µ|X) = −n log π − log(1 + (Xi − µ)2 ) .
t=1

We found the first derivative:


n

X Xi − µ
l (µ) = 2 .
i=1
1 + (Xi − µ)2

I choose t = 0.3 in this case so that we obtain the following gradient ascent iterative
scheme: !
n
X Xi − µ
µ(k+1) = µ(k) + (0.3) 2 2
. .
i=1
1 + (X i − µ)

1. Set µ0 = Median(Xi ) since the mean of Cauchy does not exist and the Cauchy
centered at µ is symmetric around µ.

2. Determine: !
n
X Xi − µ
µ(k+1) = µ(k) + (0.3) 2 . .
i=1
1 + (Xi − µ)2

3. Stop when |l′ (µ(k+1) | < ϵ for a chosen tolerance level ϵ.

Gradient Ascent in Higher Dimensions

By a multivariate Taylor series expansion, we can obtain a similar motivation and the
iteration in the algorithm is:

θ(k+1) = θ(k) + t∇f (θ(k) ) .

Example 34 (Logistic regression). We have studied linear regression for modeling

82
continuous responses. But when Y is a response of 1s and 0s (Bernoulli) then assuming
the Y s are normally distributed is not appropriate. Instead, when the ith covariate is
(xi1 , . . . , xip )T , then for β ∈ Rp logistic regression assumes the model

T
!
e xi β
Yi ∼ Bern T .
1 + exi β

In other words, the probability that any response takes the value 1 is
T
e xi β
Pr(Yi = 1) = T =: pi .
1 + exi β

Our goal is to obtain the MLE of β. As usual, first we write down the log-likelihood.
n
Y
L(β|Y ) = (pi )yi (1 − pi )1−yi
i=1
n
X n
X
⇒ l(β) = yi log pi + (1 − yi ) log(1 − pi )
i=1 i=1
Xn n
X
= log(1 − pi ) + yi (log pi − log(1 − pi ))
i=1 i=1
n
X n
X
exp(xTi β) yi xTi β

=− log 1 + +
i=1 i=1

In order to understand taking the derivative, we can do is elementwise for each βs or do


it using matrix calculus. Let’s first do it element-wise. In order to do it element-wise,
we can first solve the above
p
n
!! n X
X X X
l(β) = − log 1 + exp xij βj + yi xij βj
i=1 j i=1 j=1

n n
P
∂l(β) X X xis e j xij βj
= yi xis − P
βs 1 + e j xij βj
i=1 i=1

Assembling the whole vector for all j

n  
X 1
⇒ ∇l(β) = xi yi − T
i=1
1 + e−xi β

83
The above can be done directly for the full vector as well in one go:
n n T
X X xi exi β
∇l(β) = xi y i − T
i=1 i=1
1 + e xi β
n
" T
#
X e xi β
= xi yi − T
i=1
1 + exi β
n  
X 1 set
= xi y i − −x Tβ =0
i=1
1+e i

An analytical solution here is not possible, thus a numerical optimization tool is re-
quired. In order to know what kind of function we have, we also obtain the Hessian.
In order to calculate the Hessian, we can, once again do this element-wise or do it
together. Let’s do it elementwise: For j ̸= k
" n n
P #
∂ 2 l(β) ∂ X X xis e j xij βj
= yi xis − P
βk βs ∂βk i=1 i=1
1 + e j xij βj
" n P #
∂ X xis e j xij βj
=− P
∂βk i=1 1 + e j xij βj
n
X P xik
=− xis e j xij βj P
(1 + e j xij βj )2
i=1
n
P
X e j xij βj
=− xik xis P
(1 + e j xij βj )2
i=1

= −X T W X ,

T T
where W is the n × n diagonal matrix with diagonal elements exi β /(1 + exi β )2 .

Note: You can verify by studying the Hessian above that it is negative semi-dfinite,
and thus that the likelihood function is indeed concave and thus we can use either
Newton-Raphson or Gradient Ascent successfully. We will use gradient ascent, and
you should on your own, implement N-R.

Gradient Ascent for Logistic regression:

1. Set β(0) = 0p (since there is no information available, and it is reasonable to


assume that none of the covariates are important).

2. Set some appropriate t.

84
3. For iteration k + 1:
" n  #
X 1
β(k+1) = β(k) + t xi y i − T
i=1 1 + e−xi β(k)

4. Stop when |∇f ′ (β(k) | < ϵ.

Example 35 (Probit Regression). Not completed in class, do it yourself. Logistic


regression is only one way to model the probabilities. For pi , we may actually use any
mapping that will transport xTi β to a probability pi . This can be done through a CDF
function.

For i = 1, . . . , n, let xi = (1, xi2 , . . . , xip )T be the vector of covariates for the ith
observation and β ∈ Rp be the corresponding vector of regression coefficients. Suppose
response yi is a realization of Yi with

Yi ∼ Bern Φ(xTi β) ,


where Φ(·) is the CDF of a standard Normal distribution. We want to find the MLE
of β. The likelihood is given by

N
Y
L(β) = Φ(xTi β)yi (1 − Φ(xTi β))(1−yi ) .
i=1

We know that 1 − Φ(x) = Φ(−x). Let zi = 2yi − 1, which implies yi = 1 =⇒ zi = 1


and yi = 0 =⇒ zi = −1. This transformation will help us.

So for yi = 1, Φ(xTi β)yi = Φ(zi xTi β)yi , and for yi = 0, Φ(xTi β)yi = 1 = Φ(zi xTi β)yi .
Similarly, for yi = 0, Φ(−xTi β)(1−yi ) = Φ(zi xTi β)(1−yi ) , and for yi = 1, Φ(−xTi β)(1−yi ) =
1 = Φ(zi xTi β)(1−yi ) .

Therefore,

Φ(xTi β)yi = Φ(zi xTi β)yi


(1 − Φ(xTi β))(1−yi ) = Φ(zi xTi β)(1−yi )

85
As a consequence, likelihood can be written concisely as

N
Y
L(β) = Φ(zi xTi β)yi Φ(zi xTi β)(1−yi )
i=1
YN
= Φ(zi xTi β)
i=1

Let the log likelihood be denoted by l(β). We have

N
X
l(β) = log(Φ(zi xTi β)) .
i=1

For finding the optimum value of regression coefficient that minimizes the log-likelihood,
we use the iterative Newton-Raphson method. The gradient and Hessian calculations
are as follows:

N
d X
∇l(β) = D = log(Φ(zi xTi β))
dβ i=1
N
X f (zi xT β)zi xi
i
= .
i=1
Φ(zi xTi β)

N
2 d X f (zi xTi β)zi xi
∇ l(β) = DD =
dβ i=1 Φ(zi xTi β)
N 
d f (zi xTi β)zi xip
X 
DDpq =
i=1
dβq Φ(zi xTi β)
N 
f (zi xTi β)2 f ′ (zi xTi β)
X 
= − T
(z x )(zi xip ) +
2 i iq T
(zi xiq )(zi xip )
i=1
Φ(z i x i β) Φ(z i x i β)

−x 2
We can use that f ′ (x) = √ e−x /2 = −xf (x). This gives us the following formulation

for Hessian matrix
N
"  2  #
f (zi xTi β) f (zi xTi β)(zi xTi β)
X 
DDpq = − xip xiq + xip xiq
i=1
Φ(zi xTi β) Φ(zi xTi β)

86
Therefore,

N
" 2 #
f (zi xTi β) f (zi xTi β)(zi xTi β)
X 
DD = − xi + xTi
i=1
F (zi xTi β) Φ(zi xTi β)

One can show that the above is negative definite, and thus the function is concave
(this will be an exercise). Now we can apply Gradient Ascent algorithm or Newton-
Raphson. ■

Still working on this

7.3 MM Algorithm
Consider obtaining a solution to

θ∗ = arg max f (θ)


θ

The “Minorize/Maximize algorithm” algorithm at a current iterate, finds a “minoriz-


ing” function at that point, and then maximizes that minorizing function. That is, at
any given iteration, consider a minorizing function f˜(θ|θ(k) ) such that:

• f (θk ) = f˜(θk |θk )

• f (θ) ≥ f˜(θ|θk ) for all other θ

Then, θ(k+1) is obtained as

θ(k+1) = arg max f˜(θ|θ(k) ) .


θ

The algorithm has the ascent property in that every update increases the objective value.
That is,

f (θ(k+1) ) ≥ f˜(θ(k+1) | θ(k) )


≥ f˜(θ(k) | θ(k) )
= f (θ(k) ) .

Thus, if we want to maximize f (θ), we may find a minorizing function for it, and then
repeatedly maximize it. The key to implementing the MM algorithm is to finding a

87
good minorizing function. This can be done in a few different ways and generally
application specific.

Note: When minimizing an objective function, we the opposite: we find a majorizing


function and then minimize it.

The question is, how to construct such minorizing functions? This is typically done on
a case-by-case basis. Many inequalities are used:

• Jensen’s inequality

• Cauchy Schwartz inequliaty

• Arithmetic mean-geometric mean inequalities.

One common way of implementing MM-algorithm is to use the remainder form of


Taylor series expansion:

1
f (θ) = f (θ(k) ) + f ′ (θ(k) )(θ − θ(k) ) + f ′′ (z)(θ − θk )2 ,
2

where z is some constant between θk and θ (by the mean value theorem).

If we can lower bound f ′′ (z) > L, then

1
f˜(θ|θ(k) ) = f (θ(k) ) + f ′ (θ(k) )(θ − θ(k) ) + L(θ − θk )2
2

and the iterates are


f ′ (θ(k) )
θ(k+1) = θ(k) − .
L
In some way, this is an informed way of choosing the learning rate for gradient ascent!

Example 36 (Location Cauchy distribution). As before, the log-likelihood is


n
X
f (µ) = log L(µ|X) = −n log π − log(1 + (Xi − µ)2 ) ,
t=1

and derivatives n

X Xi − µ
f (µ) = 2 ,
i=1
1 + (Xi − µ)2
n 
(Xi − µ)2 − 1
X 
′′
f (µ) = 2 .
i=1
[1 + (Xi − µ)2 ]2

88
Consider the Taylor’s expansion:

1
f (µ) = f (µ(k) ) + f ′ (µ(k) )(µ − µ(k) ) + f ′′ (µ(k) )(µ − µ(k) )2
2

Note that for any i

(Xi − µ)2 − 1
   
−1
2 ≥2
[1 + (Xi − µ)2 ]2 [1 + (Xi − µ)2 ]2
≥ −2 := L .

This implies that f ′′ (µ) ≥ −2n. So the iterations are,


n
1X Xi − µ(k)
µ(k+1) = µ(k) + .
n i=1 1 + (Xi − µ(k) )2

The main utility of MM algorithm is that it can be used when the gradient of the
objective function is unavailable. An excellent example of this is the following Bridge
Regression problem.

Example 37 (Bridge regression). Recall the case of the penalized (negative) log-
likelihood problem during ridge regression, where the objective function was:
p
(y − Xβ)T (y − Xβ) λ X 2
Q(β) = + β .
2 2 i=1 i

By changing the penalty function, the above ridge regression objective function can be
generalized as bridge regression when the objective function is
p
(y − Xβ)T (y − Xβ) λ X
QB (β) = + |βi |α ,
2 α i=1

for α ∈ [1, 2] and λ > 0. When α = 2, this is ridge regression, and when α = 1,
this is the popular lasso regression. Different choices of α, lead to different style of
penalization. For a given λ, smaller values of α push the estimates closer towards zero.

89
We need to find the bridge regression estimates

arg min QB (β).


β

First note that, (y − Xβ)T (y − Xβ) is a convex function and |βi |α is convex for α ≥ 1.
Since the sum of positive convex functions is convex, the objective function is convex,
thus our optimization algorithms will find a global minima.

Note that for α = 1, the objective function is not differentiable at 0, and for α ∈
(1, 2), the function is not twice differentiable at 0. Thus, using Newton-Raphson and
gradient descent is not possible. We will instead use an MM algorithm. Since this
is a minimization problem, we will find a majorizing function and then minimize the
majorizing function.

We will try to find a majorizing function that upper bounds the objective QB (β), and
then minimize the majorizing function. Intuitively, optimizing the majorizing function
will again require derivatives of the majorizing function. Thus our goal is to find a
majorizing function that does not contain an absolute value, and is thus differentiable.

Note further that (y − Xβ)T (y − Xβ) term is well behaved and quadratic. Thus, we
are not inclined to look at this part. Instead we would like to lower bound pi=1 |βi |α .
P

Consider a function h(u) = uα/2 for u ≥ 0 and see that

α α/2−1
h′ (u) = u
2

and
α α 
h′′ (u) = − 1 uα/2−2 ≤ 0
2 2
so h(u) is a concave function for α ∈ [1, 2]. For a concave function, by the “Rooftop
theorem”, the first order Taylor series creates a tangent line that is above the function.
Thus for a u∗ ,

α ∗ α/2−1
h(u) ≤ h(u∗ ) + h′ (u∗ )(u − u∗ ) = h(u∗ ) + (u ) (u − u∗ ) .
2

For any given iteration of the optimization, given β(k) , taking u = |βi |2 and u∗ = |βi,(k) |2
where βi is the ith component of the vector β. Then,

α
h(u) = |βi |α ≤ |βi,(k) |α + |βi,(k) |α−2 βi2 − βi,(k)
2

2
α α
α
= |βi,(k) | − |βi,(k) |α + |βi,(k) |α−2 βi2
2 2

90
mi,(k) 2
= constants + βi
2

where mi,(k) = α|βi,(k) |α−2 . (You will see that the constants will not be important.)

Now that we have upper bounded the the penalty function, we have an upper bound
on the full objective function! So, the objective function can be bounded above by:
p
(y − Xβ)T (y − Xβ) λ X
QB (β) ≤ constants + + mj,(k) βj2 .
2 2α j=1

Why is this upper bound useful?

• Remember that at any given iteration, the optimization is with respect to β.


Thus, the constants are truly constants.

• The upper bound has no absolute values and is easily differentiable!

• Recall that we obtained the upper bound function using a derivative of h(u).
This derivative is not defined at u = 0. However, we’re only using the derivative
function at u∗ = |βi,(k) |2 , which is the previous iteration. So as long as we DO
NOT START at zero, this upper bound is valid.

• Finally, the upper bound is easily optimizable, as it is similar to ridge. (See


below)

The objective function is similar to ridge regression, except it is “weighted”. Following


the same steps as in ridge optimization, you can show that the minimum occurs at

β(k+1) = (X T X + λM(k) )−1 X T y ,

where M(k) = diag(m1,(k) /α, m2,(k) /α, . . . , mp,(k) /α). Note that here M(k) is what drives
the direction of the optimization.

Questions to think about

• How do you think we can choose α in any given problem?

• How do you think we can choose λ in any given problem?

• Will the MM algorithm always converge to a global maxima?

91
7.4 Exercises
1. Find the MLE of (α, µ) for a Pareto distribution with density

αµα
f (x) = x ≥ µ, µ, α > 0 .
xα+1

Do you need numerical procedures to calculate the MLE here?

2. Consider objective function of the form f (θ) = aθ2 + bθ + c. Writes the iterates
of the Newton-Raphson algorithm. In how many iterates do you expect NR to
converge for any starting value θ(0) ?

3. (Using R) Find the MLE of (µ, σ 2 ) for the N (µ, σ 2 ) distribution using Newton-
Raphson’s method. Compare with the closed form estimates.

4. (Using R) Using both Newton-Raphson and gradient ascent algorithm, maximize


objective function
f (x) = cos(x) x ∈ [−π, 3π] .

5. Consider estimating the first moment of Exp(λ) using simple importance sam-
pling with a Gamma(α, β) proposal distribution. The density of an exponential
is
π(x) = λe−λx x > 0 ,

and the density of Gamma(α, β) is

β α α−1 βx
g(x) = x e x > 0.
Γ(α)

(a) Construct the simple importance sampling expression for general α, β, and
λ?

(b) Denote the variance of the estimator in (a) as κλ (α, β) . For what values of
α and β is κλ (α, β) infinite? You will have have to solve the integral here.

(c) Set β = 1. Describe an algorithm for obtaining the best proposal within
this family of proposals. Give all details. Use the above constructed algo-
rithm to obtain the best Gamma(α, 1) for λ = 5. Here you may require an
optimization algorithm.

(d) Is the proposal obtained in (d) the universally optimal proposal for this prob-
lem?

92
6. Generate data according to the logistic regression model above with n = 50, and
use Newton-Raphson’s and gradient ascent algorithm to find the MLE.

7. Implement Newton-Raphson and gradient ascent to find the MLE of (α, β) for a
Beta(α, β) distribution. Generate your own data to implement this in R.

8. (Using R) Find the MLE of a Γ(α, 1) distribution using Newton-Raphson’s method.


Set α = 4 and n = 10 and generate your own data. Rerun the Newton-Raphson’s
algorithm with different starting values.

9. (Using R) Find the MLE of a location Cauchy distribution with density

1 1
f (x) = µ ∈ R, x ∈ R .
π 1 + (x − µ)2

Set µ = −2 and n = 4, 100, and run the Newton-Raphson’s algorithm with


different starting values. Are starting values more impactful here than compared
to the Gamma problem? Why or why not? Now repeat the same for the gradient
ascent algorithm.

10. (Modified Newton-Raphson): It is possible to “overshoot” when using Newton-


Raphson’s algorithm, when the objective function is not concave. In these sce-
narios, we use a modified Newton-Raphson’s approach, with step factors.

Suppose the kth iteration is such that f ′ (θ(k) ) > 0 (that is the function is in-
creasing at θ(k) ), and NR takes us to θ(k+1) where f ′ (θ(k+1) ) < 0, so that now the
function is decreasing. This means we may have overshot! As a compromise, we
may want to implement the following algorithm:

f ′ (θ(k) )
θ(k+1) = θ(k) − λ(k)
f ′′ (θ(k) )
where λ(k) is a step-factor sequence chosen at every iteration so that

f (θ(k+1) ) > f (θ(k) ) .

That is, the next value must be such that we achieve an increase in the objective
function. You can choose λ(k) so that

(a) If f (θ(k+1) ) > f (θk ), then λ(k) = 1. Else λ(k) = 1/2, and recalculate f (θ(k+1) )

(b) If f (θ(k+1) ) > f (θk ), then continue, else set λ(k) = 1/22 and so on...

93
Implement this modified algorithm for the Gamma and Cauchy examples.

11. Consider the logistic regression model: for i = 1, . . . , n, let xi = (1, xi2 , . . . , xip )T
be the vector of covariates for the ith observation and β ∈ Rp be the correspond-
ing vector of regression coefficients. Suppose response yi is a realization of Yi
with
exp(−xTi β)
Yi ∼ Bern (pi ) where pi = .
1 + exp(−xTi β)
(a) Write the negative log-likelihood, −l(β).

(b) Consider the ridge logistic regression problem, which minimizes the follow-
ing penalized negative log-likelihood:
p
λX 2
Q(β) = −l(β) + β .
2 i=1 i

Is Q(β) a convex function in β?

(c) Write the Newton-Raphson algorithm for minimizing Q(β). Write all steps
clearly.

12. Consider a typical Poisson regression model. For i = 1, . . . , n, let xi ∈ Rp be


the vector of covariates for the ith observation and β ∈ Rp are the corresponding
regression coefficients. Let yi be the observed response, which is count data.
That is,  T 
Yi ∼ Poisson exi β .

(a) Find the joint likelihood L(β | y1 , . . . yn ) and obtain the maximum likeli-
hood estimator of β. If not available in closed-form, present the complete
optimization algorithm to obtain β̂MLE .

94
8 The EM algorithm
An important application of the MM algorithm is the Expectation-Maximization (EM)
algorithm. Since the EM algorithm is an integral part of statistics on its own, we study
it separately. We will first motivate the EM algorithm with an example.

8.1 Motivating example: Gaussian mixture models


iid
Suppose X1 , X2 , . . . , Xn ∼ F , where F is mixture of two normal distributions so that
the density is:

f (x | µ1 , µ2 , σ12 , σ22 , p∗ ) = p∗ f1 (x | µ1 , σ12 ) + (1 − p∗ )f2 (x | µ2 , σ22 ) ,

where fi (x | µi , σi2 ) is the density of N (µi , σi2 ) distribution for i = 1, 2. Data following
such distributions arises often in real life. For instance, suppose, we collect batting
averages for players in cricket. Then bowlers are expected to have low averages and
batters are expected to have high averages, created a mixture-like distribution.

Given data, suppose we wish to find the maximum likelihood estimates of all 5 param-
eters: (µ1 , µ2 , σ12 , σ22 , p∗ ). That is, we want to maximize:
n
X
l(µ1 , µ2 , σ12 , σ22 , p∗ |X) = log f (xi | µ1 , µ2 , σ12 , σ22 , p∗ )
i=1
n
X
log p∗ f1 (x | µ1 , σ12 ) + (1 − p∗ )f2 (x | µ2 , σ22 ) .
 
=
i=1

There is no analytical solution to the above optimization problem and we have to


resort to numerical techniques. We can certainly think of implementing gradient-based
estimation methods here. However, instead of trying to use gradient-based tools, we
use a common trick called the latent variable or missing data trick.

Recall that the data likelihood is a mixture of Gaussians. An interpretation of this is


that with probability p∗ , any observed Xi is from f1 and with probability 1 − p∗ it is
from f2 .

Suppose we have the information about the the class of each xi (classes being class 1
and class 2). Thus, suppose the complete data was of the form

(X1 , Z1 ), (X2 , Z2 ), . . . , (Xn , Zn ) ,

95
where each Zi = k means that Xi is from population k. If this complete data is
available to us, then first note that the joint probability density/mass function is

f (xi , zi = k) = f (xi |zi = k) Pr(Zi = k) .

Suppose D1 = {i : 1 ≤ i ≤ n, zi = 1} and D2 = {i : 1 ≤ i ≤ n, zi = 2}, with cardinality


d1 and d2 respectively. The set D1 and D2 have the indices of the data that belong to
each component of the mixture.

Then the likelihood of the complete data is

L(µ1 , µ2 , σ12 , σ22 , p∗ |X, Z)


Yn
= f (xi , zi )
i=1
Y Y
= f (xi , zi = 1) f (xi , zi = 2)
i∈D1 j∈D2
Y Y
= f (xi |zi = 1) Pr(Zi = 1) f (xi |zi = 2) Pr(Zi = 2)
i∈D1 i∈D2
Y Y
p∗ f1 (xi |µ1 , σ12 ) f2 (xi |µ2 , σ22 )(1 − p∗ )

=
i∈D1 j∈D2
Y Y
= (p∗ )d1 (1 − p∗ )d2 f1 (xi |µ1 , σ12 ) f2 (xi |µ2 , σ22 ) .

i∈D1 i∈D2

This means that the log-likelihood of the complete data is


X X
⇒ log L = d1 log(p∗ ) + d2 log(1 − p∗ ) + log f1 (xi |µ1 , σ12 ) + log f2 (xi |µ2 , σ22 ) .
i∈D1 i∈D2

This complete log-likelihood is in a far nicer format, so that closed-form estimates are
available.

Differentiating with respect to p∗ , we get

∂l d1 d2 set

= ∗− = 0.
∂p p 1 − p∗

This gives that the MLE for p∗ is

d1 d1
π̂ ∗ = = .
d1 + d2 n

96
You can also see that the MLEs for µ1 , µ2 , σ12 , σ22 have all been isolated so that the
MLEs can be easily obtained from the usual Gaussian likelihood.

1 X 1 X
µ̂1 = Xi , µ̂2 = Xi
d1 i∈D d2 i∈D
1 2

1 X 1 X
σ12 = (Xi − µ̂1 )2 , σ22 = (Xi − µ̂2 )2 .
d1 i∈D d2 i∈D
1 2

(Of course, you must verify the second derivative condition).

However, remember that the actual Zs are unknown to us, so we can’t actually
calculate d1 and d2 , and the above cannot be evaluated. The EM algorithm will solve
this problem by estimating the unobserved Zi corresponding to each Zi in an iterative
manner.

We will come back to this Gaussian problem again.

8.2 The Expectation-Maximization Algorithm


Suppose, we have a vector of parameters θ, and we have only observed the marginal
data X1 , . . . , Xn from the complete data (Xi , Zi ). The objective is to maximize is
Z
l(θ|X) = log f (x|θ) = log f (x, z|θ)dνz ,

R
where the · dνz denotes integral or summation based on whether Z is continuous or
discrete.

The EM algorithm produces iterates {θ(k) } in order to solve the above maximization
problem. Writing the target objective function in this form allows us to do the op-
timization using EM. The EM algorithm iterates through an “E” step (Expectation)
and an “M” step (maximization). Consider a starting value θ0 . Then for any (k + 1)
iteration

1. E-Step: Compute the expectation of the complete expected log-likelihood:


h i
q(θ|θ(k) ) = EZ|X log f (x, z|θ) | X = x, θ(k)

where the expectation is computed with respect to the conditional distribution


of Z given X = x for the current iterate θ(k) .

97
2. M-Step: Compute

θ(k+1) = arg max q θ|θ(k) .
θ∈Θ

3. Stop when ∥θ(k+1) − θ(k) ∥ < ϵ.

The following theorem will convince us that running the above algorithm will ensure
the θ(k) converges to a local maxima. The trick to implementing the EM algorithm for
a general problem is in finding an appropriate joint distribution (X, Z) for which the
E-step is computable.

Theorem 10. The EM algorithm is an MM algorithm and thus has the ascent property.

Proof. In order to show the result we first need to find a minorizing function.

The objective function is log f (x|θ). So we need to find f˜(θ|θ(k) ) such that f˜(θ(k) |θ(k) ) =
log f (x|θ(k) ) and in general
f˜(θ|θ(k) ) ≤ log f (x|θ)

We will show that f˜(θ|θ(k) ) is such that

f˜(θ|θ(k) ) = q(θ|θ(k) ) + constants.

Then, maximizing f˜(θ|θ(k) ) is equivalent to maximizing q(θ|θ(k) ) (the M step of both


EM and MM).

Let
Z Z
f˜(θ|θ(k) ) = log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) )− log{f (x, z|θ(k) )}f (z|x, θ(k) )dz .
z z

(The proof technique is setup for continuous Z, but the same proof works for discrete
Z as well.)

Naturally, we can see that at θ = θ(k) , f˜(θ(k) |θ(k) ) = f (x|θ(k) ). We will now show that
minorizing property.

f˜(θ|θ(k) )
Z Z
= log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) ) − log{f (x, z|θ(k) )}f (z|x, θ(k) )dz
z z

98
Z Z Z
= log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) )f (z|x, θ(k) )dz − log{f (x, z|θ(k) )}f (z|x, θ(k) )dz
z z z

 
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) )
z f (x, z|θ(k) )

 
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) ) + log f (x|θ) − log f (x|θ)
z f (x, z|θ(k) )

 
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) ) + log f (x|θ)
z f (x, z|θ(k) )f (x|θ)

By Jensen’s inequality,
Z   
f (x, z|θ) f (x|θ(k) )
≤ log f (z|x, θ(k) ) + log f (x|θ)
z f (x, z|θ(k) )f (x|θ)
Z 
f (z|x, θ)
= log f (z|x, θ(k) ) + log f (x|θ)
z f (z|x, θ(k) )
Z
= log f (z|x, θ)dz + log f (x|θ)
z

= log f (x|θ) .

Thus, f˜(θ|θ(k) ) is a minorizing function, and the next iterate is

θ(k+1) = arg max f˜(θ|θ(k) ) = arg max q(θ|θ(k) )


θ θ

8.3 (Back to) Gaussian mixture likelihood


We will look at the general setup of C groups, so that the density for X1 , . . . , Xn is

C
X
f (x | θ) = πj fj (x | µj , σj2 ) ,
j=1

99
where θ = (µ1 , . . . , µC , σ12 , . . . , σC2 , π1 , . . . , πC−1 ). The setup is the same as before, and
suppose we the complete data (Xi , Zi ) where

[Xi | Zi = c] ∼ N (µc , σc2 ) and Pr(Zi = c) = πc .

To implement the EM algorithm for this example, we first need to find q(θ|θ(k) ), which
requires finding the distribution of Z|X. This can be done by Bayes’ theorem, since

f (xi | Z = c) Pr(Z = c) fc (xi | µc , σc2 ) πc


Pr(Z = c | X = xi ) = = PC =: γi,c .
f (xi ) 2
j=1 fj (xi | µj , σj )πj

2
So for any kth iterate with current step θ(k) = (µ1,k , µ2,k , σ1,k 2
, σ2,k , p∗k ), we have

2
fc (xi | µc,k , σc,k )πc,k
Pr(Z = c | X = xi , θ(k) ) = PC 2
:= γi,c,k .
j=1 fj (xi | µj,k , σj,k )πj,k

NOTE: γi,c are itself quantities of interest since they tell us the probability of the ith
observation being in class c. In this way, at the end we get probabilities of association
for each data point that inform us about the classification of the observations.

Next,
 
q(θ | θ(k) ) = EZ|X log f (x, z|θ) | X = x, θ(k)
" n #
X
= EZ|X log f (xi , zi |θ) | X = x, θ(k)
i=1
n
X  
= EZi |Xi log f (xi , zi |θ) | X = xi , θ(k)
i=1
Xn XC
= [log f (xi , zi = c|θ)] Pr(Z = c | X = xi , θ(k) )
i=1 c=1
Xn X C
= [log f (xi |, zi = c, θ) Pr(zi = πc )] Pr(Z = c | X = xi , θ(k) )
i=1 c=1
n X C 2
X fc (xi |µc,k , σc,k )πc,k
fc (xi |µc , σc2 )πc

= log PC 2
.
i=1 c=1 j=1 fj (xi |µj,k , σj,k )πj,k
| {z }
γi,c,k

This is expectation is in a complicated form, but we have it available! Notice that we

100
will need to store the value of γi,c,k in order to implement the E-step:

2
fc (xi | µc,k , σc,k )πc,k
γi,c,k = PC 2
.
j=1 f j (x i | µ j,k , σj,k )π j,k

This completes the E-step. We move on to the M-step. To complete the M-step

θ(k+1) = arg max q(θ | θ(k) ) .

n X
X C
log fc (xi | µc , σc2 ) + log πc γi,c,k

q(θ | θ(k) ) =
i=1 c=1
n X C 
(Xi − µc )2

X 1 1 2
= − log(2π) − log σc − 2
+ log πc γi,c,k
i=1 c=1
2 2 2σc
n C n C
X X (Xi − µc )2 n C
1 XX XX
= const − log σc2 γi,c,k − 2
γi,c,k + log πc γi,c,k .
2 i=1 c=1 i=1 c=1
2σc i=1 c=1

Taking derivatives and setting to 0, we get that for any c,

n Pn
∂q X (xi − µc ) γi,c,k set i=1 γi,c,k xi
= 2
= 0 ⇒ µc,(k+1) = P n , (3)
∂µc i=1
σ c i=1 γi,c,k

and the second derivative is clearly negative. For σC2 ,

n n
Pn
∂L 1 X γi,c,k X (Xi − µc )2 set 2 i=1 γi,c,k (xi − µ2c,(k+1) )
2
= − 2
+ 4
γi,c,k = 0 ⇒ σc,(k+1) = Pn .
∂σc 2 i=1 σc i=1
2σc i=1 γi,c,k

(4)

(You can show for yourself that the second derivative at this solutions is negative.)

For πc note that the optimization requires a constraint, since C


P
c=1 πc = 1. So we will
use Lagrange multipliers. The modified objective function, for λ > 0 is

C
!
X
q̃(θ | θ(k) ) = q(θ | θ(k) ) − λ πc − 1 .
c=1

101
Taking derivative
n
∂ q̃ X γi,c,k set
⇒ = −λ = 0
∂πc i=1
π c
n
X γi,c,k
⇒πc =
i=1
λ
C C X
n
X X γi,c,k
⇒ πc =
c=1 c=1 i=1
λ
n
1X
⇒1 = 1
λ i=1
⇒λ = n
n
1X
⇒πc,(k+1) = γi,c,k . (5)
n i=1

(and the second derivative is clearly negative). Thus equations (3), (3) and (5) provide
the iterative updates for the parameters. The final algorithm is:

Algorithm 16 EM Algorithm for Mixture of Gaussians


2 2
1: Set the initial value: θ(0) = (µ1,(0) , . . . , µC,(0) , σ1,(0) , σC,(0) , π1,(0) , . . . , πC,(0) )
2: For all c = 1, . . . , C and all i = 1, . . . , n, calculate

2
fc (xi | µc,(k) , σc,(k) )πc,(k)
γi,c,(k) = PC 2
.
j=1 fj (xi | µj,(k) , σj,(k) )πj,(k)

3: For all c = 1, . . . , C
Pn
i=1 γi,c,(k) xi
µc,(k+1) = P n
γi,c,(k)
Pn i=1 2
2 i=1 γi,c,(k) (xi − µc,(k+1) )
σc,(k+1) = Pn
i=1 γi,c,(k)
n
1X
πc,(k+1) = γi,c,(k)
n i=1

4: Stop when ∥θ(k+1) − θ(k) ∥ < ϵ.

Note: The target likelihood f (x|θ) is not concave, so the algorithm is not guaranteed
to converge to a global maxima.

102
Questions to think about

• What happens when you change the starting values drastically?

• What happens when you increase the number of clusters C?

• Can you setup the EM algorithm for multivariate normal distributions?

8.4 EM Algorithm for Censored Data


The EM algorithm is particularly employed when dealing with censored data: censored
data is when the realization of a random variable is only partially known. Consider
the following example.

A light bulb company is testing the failure times of their bulbs and know that failure
times follow Exp(λ) for some λ > 0. They test n light bulbs, so the failure time of
each light bulb is
iid
Z1 , . . . , Zn ∼ Exp(λ) .

However, officials recording these failure times walked into the room only at time T
and observed that m < n of the bulbs had already failed. Their failure time cannot be
recorded. Define Ej = I(Zj < T ), so observed data is

E1 = 1, . . . , Em = 1, Zm+1 , Zm+2 , . . . , Zn .

Note that Ei ∼ Bern(p) where p = Pr(Ei = 1) = Pr(Zi ≤ T ) = 1 − e−λT (from the


CDF of an exponential distribution). Our goal is to find the MLE for λ. Note that

• If we ignore the first m light bulbs, then not only do we have a smaller sample
size, but we also have a biased sample which do not contain the bottom tail of
the distribution of failure times.

So we must account for the “missing data”. Let us first write down the observed data
likelihood.

L(λ|E1 , . . . , Em , Zm+1 , . . . , Zn ) = f (E1 , . . . , Em , Zm+1 , . . . , Zn |λ)


Ym Yn
= Pr(Ei = 1) · f (zj |λ)
i=1 j=m+1
Ym Yn
1 − e−λT

= λ exp {−λZj }
i=1 j=m+1

103
( n
)
X
= (1 − e−λT )m λn−m exp −λ Zj .
j=m+1

Closed-form MLEs are difficult here, and some sort of numerical optimization is useful.
Of course, here, we cn very easily implement our gradient-based methods. But, we will
resort to the EM algorithm, as that will give us something extra since that has the
added advantage that the estimates of Z1 , . . . , Zm may be obtained as well.

Also, note that if we choose to “throw away” the censored data, then the likelihood is
( n
)
X
Lbad (λ|Zm+1 . . . Zn ) = λn−m exp −λ Zj
j=m+1

and the MLE is


n−m
λMLE, bad = Pn
j=m+1 Zj

The above MLE is a bad estimator since the data thrown away is not censored at
random, and in fact, all those bulbs fused early. So the bulb company cannot just
throw that data away, as that would be dishonest!

Now we implement the EM algorithm for this example. First, note that the complete
unobserved data is Z1 , . . . , Zn and the complete log likelihood is
( n ) n
Y X
−λzi
log Lcomp (λ|Z1 , . . . , Zn ) = log f (z|λ) = log λe = n log λ − λ zi .
i=1 i=1

In order to implement the EM algorithm, we need the conditional distribution of the


unobserved data, given the observed data. Unobserved data is Z1 , . . . , Zm :

f (Z1 , . . . , Zm | E1 , E2 , . . . , Em , Zm+1 , . . . , Zn )
= f (Z1 , . . . , Zm | E1 , . . . , Em , Zm+1 , . . . , Zn )
= f (Z1 , . . . , Zm | E1 , . . . , Em )
Ym
= f (Zi | Ei )
i=1
Ym
= f (Zi | Zi ≤ T )
i=1
m
Y λe−λZi
= −λT
I(Zi ≤ T ) .
i=1
1 − e

104
Further,

E [Zi |Ei = 1] = E [Zi |Zi ≤ T ]


Z T
λe−λzi 1 T e−λT
= zi = · · · = − .
0 1 − e−λT λ 1 − e−λT

Once we have the conditional likelihood of the unobserved observations, we are rready
to implement EM algorithm. Implementing the EM steps now

1. E-Step: In the E-step, we find the expectation of the complete log likelihood
under Z1:m |(E1:m , Z(m+1):n ). That is
h i
q(λ | λ(k) ) = E log f (Z1 , . . . , Zn |λ) | E1 , . . . , Em , Zm+1 , . . . , Zn
" n #
X
= n log λ − λEλ(k) Zi |E1 = 1, . . . , Em = 1, Zm+1 , . . . , Zn
i=1
n
X m
X
= n log λ − λ Zi − λ [E[Zi |Zi ≤ T ]
i=m+1 i=1

2. M-Step: To implement the M-step:


" n m
#
X X
λ(k+1) = arg max n log λ − λ Zi − λ [E[Zi |Zi ≤ T ] .
λ
i=m+1 i=1

It is then easy to show that the M step makes the following update (show by
yourself):

n n
λ(k+1) = Pn Pm =
[E[Zi |Zi ≤ T ] T e−λ(k) T
 
i=m+1 Zi + i=1 Pn 1
i=m+1 Zi + m −
λ(k) 1 − e−λ(k) T

Questions to think about

• If the E-step is not solvable, in the sense that the expectation is not available,
what can we do?

• What is an added benefit of doing EM rather than NR or Gradient Ascent on


the observed likelihood?

105
8.5 Exercises
1. Following the steps from class, write the EM algorithm for a mixture of K Gaus-
sians, for any general K. That is, the distribution is

K
X
f (x|θ) = πk fk (x|µk , σk2 ) .
k=1

2. (Using R) Consider the faithful dataset in R, which contains waiting time


between eruptions and the duration of each eruption for the Old Faithful geyser
in Yellowstone National Park, Wyoming, USA. First, run the following code

data(faithful)
plot(density(faithful$eruptions))

You will see that the length of the eruptions looks like a bimodal distribution.
For any given eruption, let Xi be the eruption time. Let

1 X has short eruptions
i
Zi =
2 X has long eruptions
i

Thus Zi is a latent variable which is not observed. Let π1 and π2 be the probability
of short and long eruptions, respectively. Assume that the joint distribution of
(X, Z) is

f (x, z|θ) = π1 f1 (x|µ1 , σ12 )I(Z = 1) + π2 f2 (x|µ2 , σ22 )I(Z = 2) .

Implement the EM algorithm for this example.

3. (Using R) For the same dataset faithful, we will fit a multivariate mixture of
Gaussians for both the eruption time and waiting times. Let Xi be the eruption
time and Yi be the waiting time for the ith eruption. Let

1 X and Y has short eruptions and short wait times
i i
Zi = .
2 X and Y has long eruptions and long wait times
i i

First, we want to find the EM steps for this. The joint distribution of the observed

106
t = (x, y) and the latent variable z is

f (t, z|θ) = π1 f1 (t|µ1 , Σ1 )I(Z = 1) + π2 f2 (t|µ2 , Σ22 )I(Z = 2) ,

where µc ∈ R2 , Σc ∈ R2×2 and

(t − µc )T Σ−1
   
1 1 c (t − µc )
fc (t | µc , σc2 ) = exp − .
2π |Σc |1/2 2

Similar to the one dimensional case, set up the EM algorithm for this two-
dimensional case, and then implement this on the Old Faithful dataset.

4. Repeat the previous exercises for four latent class defined as





1 Xi and Yi has short eruptions and short wait times



2 X and Y has long eruptions and long wait times
i i
Zi = .


 3 X i has short eruptions and Y i has long wait times



4 X has long eruptions and Y has short wait times
i i

5. (EM algorithm for multinomial) Suppose y = (y1 , y2 , y3 , y4 ) has a multinomial


distribution with probabilities
 
1 θ 1−θ 1−θ θ
+ , , , .
2 4 4 4 4

The joint distribution of y is


P  y  y  y  y
( yi )! 1 θ 1 1 − θ 2 1 − θ 3 θ 4
g(y | θ) = Q4 + .
i=1 yi !
2 4 4 4 4

Suppose you observe y = (125, 18, 20, 34). Also, suppose that the complete data
is (z1 , z2 , y2 , y3 , y4 ) where z1 + z2 = x1 . That is, the first variable y1 is broken into
two groups, with the new probabilities are
 
1 θ 1−θ 1−θ θ
, , , , .
2 4 4 4 4

The complete data distribution is


 x1  x2  y  y  y
(z1 + z2 + y2 + y3 + y4 )! 1 θ 1−θ 2 1−θ 3 θ 4
f (z, y | θ) = .
z1 !z2 !y2 !y3 !y4 ! 2 4 4 4 4

107
The E-Step is

q(θ|θ(k) ) = Eθ(k) [log f (z, y | θ) | y1 , y2 , y3 , y4 ] .

Write the above expectation explicitly, and then write the M-step. Implement
this in R and return the estimate of θ.

6. Suppose the expectation in the E-step is not tractable. That is the expectations
is not available in closed form. In this case, one may replace the expectation
with a Monte Carlo estimate. This is called Monte Carlo EM. Repeat Exercise
5 using Monte Carlo EM.

7. Will the above algorithm have the ascend property.

8. Consider a two component Gamma(αi , 1) mixture density

f (x; α1 , α2 , π1 , π2 ) = π1 f1 (x; α1 ) + π2 f2 (x; α2 ) ,

where π1 , π2 , α1 , α2 > 0 and π1 = 1 − π2 . Recall that:

1
fi (x; αi ) = xα−1 e−x .
Γ(αi )

(a) Construct an EM algorithm to obtain the MLE of α1 , α2 , π1 , π2 . Present all


details and do not skip steps.

(b) Implement the algorithm on the eruptions dataset. What are your final
estimates of α1 , α2 , π1 , π2 ? Is there much difference in the clusters from the
Gaussian mixture model?

108
9 Choosing Tuning Parameters
Still working on this We will take a break from optimization methods to discuss an
important practical question:

How do we choose model tuning parameters?

In ridge/bridge/lasso regression, we require choosing λ and/or α. In Gaussian mixture


models, we need to choose the number of clusters C.

9.1 Components in Gaussian Mixture Model


Suppose for observed data X1 , X2 , . . . , Xn our goal is to cluster these observations using
a Gaussian Mixture Model:
C
X
f (x | θ) = πj fj (x | µj , σj2 ) ,
j=1

In the above C denotes the number of components/classes/groups/populations/cluster.


The choice of C depends on us, so the question is how to choose C?

Recall that we use the log-likelihood to see whether our estimates are good or not.A
larger value of the log-likelihood means better models! Thus, in principle, we should
be able to use negative log-likelihood to see how bad our estimates are. However, the
negative log-likelihood will keep reducing always as C increases, since each data point
will want to be it’s own cluster.

An alternative model selection procedure is used, called the Bayesian Information


Criterion. The AIC is a function of the log-likelihood and a penalty term, penalizing
the number of parameter the algorithm has to estimate. So if C increases, the number
of parameters to be estimated increases, and a penalty is imposed. That is,

BIC(θ̂ | x) = 2 log l(θ̂|x) − log(n)K

where K is again the number of parameters being estimated. The BIC decreases when
the number of clusters increase and for large data.. This makes sense since when large
number of data are available, we should be able to find simpler models.

109
Note: We want to choose C such that the value of BIC is maximized.

9.2 Loss functions


A loss function is a measure of error of an estimator from the true value. Having
observed data, y, let fˆ be the model fit. Then a loss function is a function L(y, fˆ) that
quantifies the distance between y and fˆ. The form of the loss function may depend on
the type of data. We will focus only on binary/Gaussian regression and the Gaussian
mixture models.

• Linear regression: In penalized or non-penalized regression, we have an obser-


vation y, covariates X, and a regression coefficient, β. Let β̂ be the estimated
regression coefficient – this could be obtained via ridge, bridge, or regular regres-
sion. Then the estimated response is

fˆ(x) = xT β̂ .

A loss function measures the error between y and the estimated response ŷ =
fˆ(x). For continuous response y, there are two popular loss functions:

– Squared error: L(y, fˆ(x)) = (y − fˆ(x))2

– Absolute error: L(y, fˆ(x)) = |y − fˆ(x)|

We will focus mainly on the squared error loss.

• Binary data classification: For binary data models, like logistic regression, a
popular loss function is the misclassification also known as the 0 − 1 loss. Given
response y which are Bernoulli(p) where
T
ex β
p= ,
1 + e xT β

we can find the estimated p, p̂ by setting


T
ex β̂MLE
p̂ = ,
1 + exT β̂MLE

where β̂MLE are the MLE estimates obtained using an optimization algorithm.
These p̂ are the estimated probability of success. Set fˆ(x) = 1·I{p̂ ≥ .5}+0·I{p̂ <
.5}. (The cutoff, .5, is the default cutoff, but can be changed depending on the

110
problem). Then the misclassification loss function checks whether the model has
misclassified the observation

Misclassification or 0 − 1 loss : L(y, fˆ(x)) = I(y ̸= fˆ(x)) .

How do we use these loss functions?

Given a dataset, we are interested in estimating the test error which is the expected
loss, of an independent dataset given estimates from the current dataset. If our given
dataset is D h i
ErrD = E L(y, fˆ(x))|D ,

where fˆ(x) denotes the estimated y for a new independent dataset, given estimators
from the current dataset. For example, using β̂ from a given dataset D, then fˆ(x) =
xnew β̂.

9.3 Cross-validation
In many data analysis techniques, there are tuning/nuisance parameters that need to
be chosen in order to fit the model. How can we choose these nuisance parameters?
For example:

• λ, the tuning parameter in penalized regression

• α, the bridge regression penalization factor

Or, we can fit multiple different models to the same dataset. For example, we could
fit linear regression, bridge regression, or ridge regression. Which fit is the best? How
can we make that assessment?

Cross-validation provides a way to estimate the test error without requiring new data.
In cross-validation, our original dataset is broken into chunks in order to emulate
independent datasets. Then the model is fit on some chunks and tested on other
chunks, with the loss recorded. The way the data is broken into chunks can lead to
different methods of cross-validation.

9.3.1 Holdout method

One way is to use the holdout method. In this,

111
• Choose a loss function

• We randomly split the data into two parts: a training set and a test set.

• Fit the model on the training set, and obtain estimates of the observation y for
the test set. Compute loss on the test set.

• Repeat the process for different models and choose the model with smallest error.

This method has two drawbacks:

1. We do not often have the luxury of “throwing away” observations for the test
data, specially in small data problems.

2. It is possible, that just by chance a bad split in the data makes the test/train
split data drastically different.

The main advantage is that, compared to the methods that follow, this is fairly cheap.

9.3.2 Leave-one-out Cross-validation

In leave-one-out cross-validation (LOOCV), the data D of size n is randomly split into


a training set of size n − 1 and a test set of size 1. This is repeated for systematically
all observations so that there are n such splits possible.

For each split, the test error is estimated, and the average error over all splits is
calculated, which estimates the expected test error for a model fit fˆ(x). Let fˆ−i (xi )
denote the predicted value of yi using the model that removes the ith data point. Then
CV estimate of the prediction error is
n
1X h i
CV1 (fˆ) = L(yi , fˆ−i (xi )) ˆ
≈ E L(y, f (x)) .
n i=1

Note that each fˆ−i (xi ) represents model fits using different datasets, with testing on
one observation.

CV1 (fˆ) can be calculated for different models or tuning parameters. Let γ denote a
generic tuning parameter utilized in determining the model fit fˆ, and let fˆ−i (xi , γ)
denote the predicted value for yi using a training data that doesn’t include the ith
observation and uses γ as a tuning parameter. Then:
n
1X
CV1 (fˆ, γ) = L(yi , fˆ−i (xi , γ)) .
n i=1

112
The chosen model is the one with γ such that
n o
γchosen = arg min CV1 (fˆ, γ)
γ

The final model is fˆ(X, γchosen ) fit to all the data. In this way we can accomplish two
things: obtain an estimate of the prediction error and choose a model.

Points:

• CV1 (fˆ) is an approximately unbiased estimator of the test error. This is because
the expectation is being evaluated over n samples of the data that are very close
to the original data D.

• LOOCV is computationally burdensome since the model is fit n times for each
θ. If the number of possible values of γ is large, this becomes computationally
expensive.

9.3.3 K-fold cross-validation

The data is randomly split into K roughly equal-sized parts. For any kth split, the
rest of the K − 1 parts make up the training set and the model is fit to the training
set. We then estimate the prediction error for each element in the kth part. Repeating
this for all k = 1, 2, . . . , K parts, we have an estimate of the prediction error.

Let κ : {1, . . . , N } 7→ {1, . . . , K} indicates the partition to which each ith observation
belongs. Let fˆ−κ(i) (x) be the fitted function for the κ(i)th partition removed. Then,
the estimated prediction error is

K n
1 X 1 X 1X
CVK (fˆ, γ) = L(yi , fˆ−κ(i) (xi , γ)) = L(yi , fˆ−κ(i) (xi , γ)) .
K k=1 n/K n i=1
i∈kth split

The chosen model is the one with γ such that


n o
θchosen = arg min CVK (fˆ, γ)
γ

The final model is fˆ(X, γchosen ) fit to all the data.

Points:

• For small K, the bias in estimating the true test error is large since each training
data is quite different from the given dataset D.

113
• The computational burden is lesser when K is small.

Usually, for large datasets, 10-fold or 5-fold CV is common. For small datasets, LOOCV
is more common.

Questions to think about

• Which algorithms do you think would be time-consuming to do LOOCV for?

9.4 Bootstrapping
We have discussed cross-validation, which we use to choose model tuning parameters.
However, once the final model is fit, we would like to do inference. That is, we want to
account for the variability of the final estimators obtained, and potentially do testing.

If our estimates are MLEs then we know that under certain important conditions,
MLEs have asymptotic normality, that is
√ d 2
n(θ̂MLE − θ) → N (0, σMLE ),

2 2
where σM LE is the inverse Fisher information. Then, if we can estimate σMLE , we can
construct asymptotically normal confidence intervals:
r
2
σ̂M LE
θ̂MLE ± z1−α/2 .
n

We can also conduct hypothesis tests etc and go on to do regular statistical analysis.
But sometimes we cannot use an asymptotic distribution:

1. when our estimates are not MLEs, like ridge and bridge regression

2. when the assumptions for asymptotic normality are not satisfied (I haven’t shared
these assumptions)

3. when n is not large enough for asymptotic normality to hold

In Bootstrapping, we approximate the distribution of θ̂, and from there we will obtain
a confidence intervals.
iid
Suppose θ̂ is some estimator of θ from sample X1 , . . . , Xn ∼ F . Then since θ̂ is random
it has a sampling distribution Gn that is unknown. If asymptotic normality holds, then
Gn ≈ N (·, ·) for large enough n, but in general we may not know much about Gn . If

114
we could obtain many (say, B) similar datasets, we could obtain an estimate from each
of those B datasets:
iid
θ̂1 , . . . , θ̂B ∼ Gn .

Once we have B realizations from Gn , we can easily estimate characteristics about Gn ,


like the overall mean, variance, quantiles, etc.

Thus, in order to learn things about the sampling distribution Gn , our goal is to draw
more samples of such data. But this, of course is not easy in real-data scenarios. We
could obtain more Monte Carlo datasets from F , but we typically do not know the
true F .

In bootstrap, instead of obtaining typical Monte Carlo datasets, we will repeatedly


“resample” from our current dataset. This would give us an approximate sample
from our distribution Gn , and we could estimate characteristics of this distribution!
This resampling using information from the current data is called bootstrapping. We
will study two popular bootstrap methods: nonparameteric bootstrap and parametric
bootstrap.

9.4.1 Nonparametric Bootstrap

In nonparametric bootstrap, we resample data of size n from within {X1 , X2 , . . . , Xn }


(with replacement) and obtain estimates of θ using these samples. That is

∗ ∗ ∗
Bootstrap sample 1: X11 , X21 , . . . , Xn1 ⇒ θ̂1∗
∗ ∗ ∗
Bootstrap sample 2: X12 , X22 , . . . , Xn2 ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
Bootstrap sample B: X1B , X2B , . . . , XnB ⇒ θ̂B .

Each sample is called a bootstrap sample, and there are B bootstrap samples. Now,
the idea is that θ̂1∗ , . . . θ̂B

are B approximate samples from the distribution of θ̂, Gn .
That is
θ̂1∗ , . . . θ̂B

≈ Gn .

If we want to know the variance of θ̂, then the bootstrap estimate of the variance is

B
\ 1 X ∗ X
Var( θ̂) = (θ̂b − B −1 θ̂k∗ )2 .
B − 1 b=1 k

115
Similarly, we may want to construct a 100(1 − α)% confidence interval for θ̂. A 100(1 −
α)% confidence interval is the random interval (L, U ) such that

Pr((L, U ) contains θ) = 1 − α .

Note that here L and U are random and θ is fixed. We can find the confidence interval
by looking at the quantiles of the distribution of θ̂, Gn . Since we have bootstrap samples
from Gn , we can estimate these quantiles!. So if we order the bootstrap estimates

∗ ∗ ∗
θ̂(1) < θ̂(2) < · · · < θ̂(B) ,

and set L to be the (α/2)th ordered statistic and U to be (1 − α/2)th order statistic,
we get:
∗ ∗
L = θ̂⌊α/2∗B⌋ and U = θ̂⌊1−α/2∗B⌋ .
 
∗ ∗
Then θ̂⌊α/2∗B⌋ , θ̂⌊1−α/2∗B⌋ , is a 100(1 − α)% bootstrap confidence interval.

iid
Example 38 (Median of Gamma(a, b)). Let X1 , . . . , Xn ∼ Gamma(a, b). The median
of this distribution (θ) does not have a closed form expression, but suppose we are
interested in estimating it with the sample median. That is

θ̂ = Sample Median(X1 , X2 , . . . , Xn ) ∼ Gn

Technically, there is a result of asymptotic normality of the sample median so that


Gn ≈ normally distribution for large n. However, we may not have enough samples
for the asymptotic normality to hold reasonably. Thus, instead here we implement the
nonparametric bootstrap:

∗ ∗ ∗
X11 , X21 , . . . , Xn1 ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ⇒ θ̂B
.
 
And find α/2 and 1−α/2 sample quantiles from θ̂i∗ , i = 1, . . . , B. Then ∗
θ̂⌊α/2∗B⌋ ∗
, θ̂⌊1−α/2∗B⌋ ,

116
is a 100(1 − α)% bootstrap confidence interval. ■

9.4.2 Parametric Bootstrap

Suppose X1 , . . . , Xn ∼ F (θ), where θ is a parameter we can estimate. Let θ̂ be a chosen


estimator of θ. Instead of resampling within our data, in parametric bootstrap, we use
our estimator of θ to obtain computer generated samples from F (θ̂):

∗ ∗ ∗
X11 , X21 , . . . , Xn1 ∼ F (θ̂) ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ∼ F (θ̂) ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ∼ F (θ̂) ⇒ θ̂B
 

And again, we find the α/2 and 1−α/2 quantiles of the θ̂i∗ s so that θ̂⌊α/2∗B⌋ ∗
, θ̂⌊1−α/2∗B⌋ ,
is a 100(1 − α)% bootstrap confidence interval.

Example 39 (Coefficient of variation). For a population with variance σ 2 and mean


µ, its coefficient of variation is defined as

σ
θ= .
µ

It essentially tells us what is the deviation of the population compared to its mean. For
iid
F = N (µ, σ 2 ), and let X1 , . . . , Xn ∼ N (µ, σ 2 ). We want to estimate θ, the coefficient
of variation. The obvious estimator is the sample standard deviation by the sample
mean: √ n
s2 2 1 X
θ̂ = where s = (Xi − X̄)2 .
X̄ n − 1 i=1

θ̂ is a complicated estimator, and it is unclear if it has some known distribution. We


may then understand its sampling distribution based on doing a parametric bootstrap
method:

∗ ∗ ∗
X11 , X21 , . . . , Xn1 ∼ N (X̄, s2 ) ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ∼ N (X̄, s2 ) ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ∼ N (X̄, s2 ) ⇒ θ̂B .

117
And find α/2 and 1 − α/2 sample quantiles from θ̂i∗ , i = 1, . . . , B. ■

Questions to think about

• What happens if we increase or decrease B?

• How will you use bootstrapping to obtain confidence intervals for bridge regres-
sion coefficients?

• Do we need bootstrapping to obtain confidence intervals for ridge regression es-


timates?

9.5 Exercises
1. Comparing different cross-validation techniques: Generate a dataset using the
following code:

set.seed(10)
n <- 100
p <- 50
sigma2.star <- 4
beta.star <- rnorm(p, mean = 2)
beta.star

Generate a dataset (y, X) to fit the linear regression model

y = Xβ + ϵ .

Implement the holdout method, leave-one-out, 10-fold, and 5-fold cross-validation


over 500 replications. Keep a track of the CV error from each method and
compare the performance of all cross-validation methods.

2. cars dataset: Estimate the prediction error for the cars dataset using 10-fold,
5-fold, and LOOCV.

3. mtcars dataset: Consider the mtcars dataset from 1974 Motor Trend US maga-
zine, that comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles. Load the dataset using

data(mtcars)

There are 10 covariates in the dataset, and mpg (miles per gallon) is the re-

118
sponse variable. Fit a ridge regression model for this dataset and find an op-
timal λ using 1-fold, 5-fold, and LOOCV cross-validation. Choose the best
λ ∈ {10−8 , 10−7.5 , . . . , 107.5 , 108 }. Make sure you make the X matrix such that
the first column is a column of 1s.

4. Seeds dataset: Download the seeds dataset from

https://archive.ics.uci.edu/ml/datasets/seeds

This dataset contains information about three varieties of wheat (last column of
the dataset). There are 7 covariate information. Fit a 7-dimensional Gaussian
mixture model algorithm with C = 3 and estimate the mis-classification rate
using cross-validation.

5. Seeds dataset: For the same dataset, with C = 3, use cross-validation to find out
which of the 7 covariates best helps identify between the three kinds of wheat.

6. Generate n observations from a Normal distribution with mean µ and variance


σ 2 . Use code below:

set.seed(1)
mu <- 5
sig2 <- 3
n <- 100
my.samp <- rnorm(n, mean = mu, sd = sqrt(sig2))

Construct bootstrap confidence intervals for estimating the mean of a normal


distribution using both parameteric and nonparametric bootstrap methods and
compare the confidence intervals with the usual normal distribution confidence
intervals for µ.

7. Repeat the previous exercise for estimating the mean of t10 distribution from
sample of size 50. Are the bootstrap confidence intervals similar to the intervals
using CLT? Why or why not?

8. Repeat again to estimate the mean of a Gamma(.05, 1) distribution from a sample


of size n = 50. Are the bootstrap confidence intervals similar to the intervals
using CLT? Why or why not?

9. For Exercise 1, fit a bridge regression model with λ = 5, α = 1.5, and construct
95% parametric and nonparametric bootstrap confidence intervals for of the 50
βs. In repeated simulations, what is the coverage probability of each the con-

119
fidence intervals. What percentage of the confidence intervals contain the true
vector of β, beta.star?

10. Obtain 95% bootstrap confidence intervals for each ridge regression coefficient β
for the chosen λ value in the mtcars Exercise 3.

120
10 Stochastic optimization methods
We go back to optimization this week. The reason we took a break from optimization is
because we will focus on stochastic optimization methods, which will lead the discussion
into other stochastic methods. We will cover two topics:

1. Stochastic gradient ascent - used for large scale data problems

2. Simulated annealing - used for non-convex objective functions

Our goal is the same as before: for an objective function f (θ), our goal is to find

θ∗ = arg max f (θ) .


θ

10.1 Stochastic gradient ascent


Recall, in order to maximize the objective function the gradient ascent algorithm does
the following update:
θ(k+1) = θ(k) + t∇f (θ(k) ) ,

where ∇f (θ(k) ) is the gradient vector. Now, in many statistics problems, the objective
iid
function is the log-likelihood (for some density f˜). That is, we have X1 , X2 , . . . , Xn ∼ F̃
with density function f˜. Then interest is in :
( n ) ( n
)
X 1X
θ∗ = arg max log f˜(xi |θ) = arg max log f˜(xi |θ) .
θ
i=1
θ n i=1
| {z }
f (θ)

Now due to the objective function being an average, we have the following:
n
1X
f (θ) = log f˜(θ|xi )
n i=1
n
1X h i
⇒ ∇f (θ) = ∇ log f˜(θ|xi ) .
n i=1

That is, in order to implement a gradient ascent step, the gradient of the log-likelihood
is calculated for the whole data. However, consider the following two situations

• the data size n and/or dimension of θ are prohibitively large so that calculating
the full gradient multiple times is infeasible

121
• the data is not available at once! In many online data situations, the full data set
is not available, but comes in sequentially. Then, the full data gradient vector is
not available.

In such situations, when the full gradient vector is unavailable, we may replace it with
the estimate the gradient. Suppose ik is a randomly chosen index in {1, . . . , n}. Then
 
n h
˜(xi |θ)  = 1
 h i  X i
E ∇ log f ∇ log ˜(xi |θ) .
f
 k  n
| {z } i=1
Estimate of full gradient

h i
Thus, ∇ log f˜(xik |θ) is an unbiased estimator of the complete gradient, but uses only
one data point. Replacing the complete gradient with this estimate yields the stochastic
gradient ascent update:
n h io
θ(k+1) = θ(k) + t ∇ log f˜(xik |θ(k) ) ,

where ik is a randomly chosen index. This randomness in choosing the index makes
this a stochastic algorithm.

• advantage: it is much cheaper to implement since only one-data point is required


for gradient evaluation

• disadvantage it may require larger k for convergence to the optimal solution

• disadvantage as k increases, θ(k+1) ̸→ θ∗ . Rather, after some initial steps, θ(k+1)


oscillates around θ∗ .

After K iterations, the final estimate of θ∗ is

K
∗ 1 X
θ̂ = θ(k+1) .
K k=1

However, since each step involves estimating data gradient, variability in updates of
θ(k) is larger than using gradient ascent. To stabilize this behavior, often mini-batch
stochastic gradient is used.

122
10.1.1 Mini-batch stochastic gradient ascent

Let Ik be a random subset of {1, . . . , n} of size b. Then, the mini-batch stochastic


gradient ascent algorithm implements the following update:
" #
1X h i
θ(k+1) = θ(k) + t ∇ log f˜(θ(k) |xi ) .
b i∈Ik

The mini-batch stochastic gradient estimate of θ∗ after K updates is

K
∗ 1 X
θ̂ = θ(k) .
K k=1

There are not a lot of clear rules about terminating the algorithm in stochastic gra-
dient. Typically, the number of iterations K = n, so that one full pass at the data is
implemented.

10.1.2 Logistic regression

Recall the logistic regression setup where for a response Y and a covariate matrix X,
T
!
e xi β
Yi ∼ Bern T .
1 + exi β

In order to find the MLE for β, we obtain the log-likelihood.


n
Y
L(β|Y ) = (pi )yi (1 − pi )1−yi
i=1
n n
1 ˜ 1X T
 1X
⇒ log f (β) = − log 1 + exp(xi β) + yi xTi β
n n i=1 n i=1

Taking derivative:

n
" #
xT β
 
1 1 X e i
∇ log f˜(β) = xi y i − T .
n n i=1 1 + exi β

As noted earlier, the target objective is concave, thus a global optima exists and the
gradient ascent algorithm will converge to the MLE. We will implement the stochastic
gradient ascent algorithm here. The stochastic gradient ascent algorithm proceeds in

123
the folllowing way, for a randomly chosen index ik ,
T
" !#
exik β
β(k+1) = β(k) + t xik y ik − xT
i β
1+e k

NOTE: Here we chose a fixed learning rate t. A common strategy is to choose a


learning rate tk that reduces to 0 as k increases.

10.2 Simulated annealing


Last lecture we went over the stochastic gradient ascent algorithm: the merit of this
algorithm was its use in online sequential data and for large data set problem.

This lecture focuses on simulated annealing, an algorithm particularly useful for non-
concave objective functions. Our goal is the same as before: for an objective function
f (θ), our goal is to find
θ∗ = arg max f (θ) .
θ

Recall that when the objective function is non-concave, all of the methods we’ve dis-
cussed cannot escape out of a local maxima. This creates challenges in obtaining global
maximas. This is where the method of simulated annealing has an advantage over other
methods.

Consider an objective function f (θ) to maximize. Note that maximizing f (θ) is equiv-
alent to maximizing exp(f (θ)). The idea in simulated annealing is that, instead of
trying to find a maxima directly, we will obtain samples from the density

π(θ) ∝ exp(f (θ)) .

Samples collected from π(θ) are likely to be from areas near the maximas. However,
obtaining samples from π(θ) means there will be samples from low probability areas
as well. So how do we force samples to come from areas near the maximas?

Consider for T > 0, we may also look at:

arg max f (θ) = arg max e{f (θ)/T } .


θ θ

For 0 < T < 1, the objective function’s modes are exaggerated there-by amplifying
the maximas. This feature will help us in trying to “push” the sampling to areas of

124
high-probability.

Example 40. Consider the following objective function

f (θ) = [cos(50θ) + sin(20θ)]2 I(0 < θ < 1)

Below is a plot of ef (θ)/T for various values of T .


150

T=1
T = .83
T = .75
T = .71
100
exp(f/T)

50
0

0.0 0.2 0.4 0.6 0.8 1.0

In simulated annealing, this feature is utilized so that every subsequent sample is drawn
from an increasingly concentrated distribution. That is, at a time point k, a sample
will be drawn from
πk,T (θ) ∝ ef (θ)/Tk ,

where Tk is a decreasing sequence.

How do we generate these samples?

Certainly, we can try and use accept-reject or another Monte Carlo sampling method,
but such methods cannot be implemented generally. Note that for any θ′ , θ

125
πk,T (θ′ ) f (θ′ ) − f (θ)
 
= exp .
πk,T (θ) Tk

Let G be a proposal distribution with density g(θ′ |θ) so that g(θ′ |θ) = g(θ|θ′ ). Such a
proposal distribution is a symmetric proposal distribution. Further, given a value of
θk , θ′ is sampled using density g(·|θk ).

Algorithm 17 Simulated Annealing algorithm


1: For k = 1, . . . , N , repeat the following:

2: Generate θ′ ∼G(·|θk )and generate U ∼ U (0, 1)


f (θ′ ) − f (θ)
3: Let α = min 1, exp .
Tk
4: If U < α, then let θk+1 = θ ′

5: Else θk+1 = θk .
6: Update Tk+1
7: Store θk+1 and ef (θk+1 ) .
8: Return θ∗ = θk∗ where k ∗ is such that k ∗ = arg maxk ef (θk )

Thus, if the proposed value is such that f (θ′ ) > f (θ), then α = 1 and the move is
always accepted. The reason simulated annealing works is because when θ′ is such
that f (θ′ ) < f (θ), even then, the move is accepted with probability α.

Thus, there is always a chance to move out of local maximas.

Essentially, each θk is approximately distributed as πk,T , and as Tk → 0, πk,T puts more


and more mass on the maximas, thus, θk will typically be getting increasingly closer
to θ∗ .

• Typically, G(·|θ) is U (θ − r, θ + r) or N (θ, r) which are both valid symmetrical


proposals. The parameter r dictates how far/close the proposed values will be.

• Tk is often called the temperature parameter. A common value of Tk = d/ log(k)


for some constant d.

126

You might also like