MTH210
MTH210
The instructor of this course owns the copyright of all the course materials. This lecture
material was distributed only to the students attending the course MTH210a: “Statistical
Computing” of IIT Kanpur, and should not be distributed in print or through electronic
media without the consent of the instructor. Students can make their own copies of the
course materials for their use.
Contents
1 Lecture-wise Summary 4
1
4.5.3 Multidimensional target . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Importance Sampling 48
5.1 Simple Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Simple importance sampling . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Optimal proposals . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Questions to think about . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Weighted Importance Sampling . . . . . . . . . . . . . . . . . . . . 56
5.4 Questions to think about . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8 The EM algorithm 95
8.1 Motivating example: Gaussian mixture models . . . . . . . . . . . . 95
8.2 The Expectation-Maximization Algorithm . . . . . . . . . . . . . . 97
8.3 (Back to) Gaussian mixture likelihood . . . . . . . . . . . . . . . . 99
8.4 EM Algorithm for Censored Data . . . . . . . . . . . . . . . . . . . 103
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2
10 Stochastic optimization methods 121
10.1 Stochastic gradient ascent . . . . . . . . . . . . . . . . . . . . . . . 121
10.1.1 Mini-batch stochastic gradient ascent . . . . . . . . . . . . . . . 123
10.1.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.2 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3
1 Lecture-wise Summary
Lec No. Date Topic
1 Jan 9 FCH and Pseudorandom number generation
2 Jan 9 Pseudorandom numbers and inverse transform
3 Jan 10 Accept-Reject
4 Jan 11 Accept-Reject and composition method
5 Jan 16 Continuous: inverse transform and accept-reject
6 Jan 17 Accept-reject
7 Jan 18 Accept-reject
8 Jan 23 Box-Muller, Ratio-of-uniforms, Quiz 1
9 Jan 24 Ratio-of-uniforms
10 Jan 25 Ratio-of-uniforms, Miscellaneous sampling
11 Jan 30 Multivariate normal sampling, Simple Monte Carlo
12 Jan 31 Simple Monte Carlo, simple importance sampling
13 Feb 1 Simple importance sampling
14 Feb 6 Optimal proposal example, Weighted importance sampling
15 Feb 7 Weighted importance sampling
16 Feb 8 Likelihood function, Maximum Likelihood Estimator
17 Feb 13 MLE and linear regression
18 Feb 14 Penalized regression, Quiz 2
19 Feb 15 Newton-Raphson Algorithm
– Feb 20 - 24 Mid-sem Exam week
20 Feb 27 Newton-Raphson Algorithm
21 Feb 28 Gradient Ascent Algorithm
22 Feb 28 (extra class) Logistic regression GA algorithm
23 Mar 1 More Logistic regression
Mar 6 - 17 Mid-sem Recess + 1 week I was away
24 Mar 20 MM Algorithm
25 Mar 21 Bridge regression
26 Mar 21 (extra class) EM Algorithm
27 Mar 22 EM Algorithm
28 Mar 27 Gaussian Mixture Model
29 Mar 28 Gaussian Mixture
30 Mar 28 (extra class) Gaussian Mixture still
31 Mar 29 Censored data example
32 Apr 3 Loss functions
33 Apr 5 Cross-validation
34 Apr 10 Bootstrapping
35 Apr 11 Bootstrapping
4
2 Pseudorandom Number Generation
The building block of computational simulation is the generation of uniform random
numbers. If we can draw from U (0, 1), then we can draw from most other distributions.
Thus the construction of sampling from U (0, 1) requires special attention.
Computers can generate numbers between (0, 1), which although are not exactly ran-
dom (and in fact deterministic), but have the appearance of being U (0, 1) random
variables. These draws from U (0, 1) are pseudorandom draws.
approx iid
X1 , . . . , X n ∼ U (0, 1) .
Note: After this lecture, we will always assume that all U (0, 1) draws are exactly
iid and perfectly random. We will forget that they are infact, pseudorandom. Pseu-
dorandom generation is a whole field in itself; for more on this, checkout CS744 at
IITK.
Since xt ∈ {0, 1, . . . m − 1}, xt /m ∈ (0, 1). Also note that after some finite number of
steps < m, the algorithm will repeat itself, since when a seed x0 is set, a deterministic
sequence of numbers follows. Naturally, to allow for the sequence xt to mimic uniform
and random draws m should be large. Naturally, both a and m should be chosen to
be large so as to avoid repetition. Typically m should be a large prime number.
5
x2 = 123 ∗ 1 mod 10 = 3
x3 = 123 ∗ 3 mod 10 = 9
x4 = 123 ∗ 9 mod 10 = 7
x5 = 123 ∗ 7 mod 10 = 1
..
. ■
Thus, we see that the above choices of a, m, x0 repeats itself. It is also recommended
that a is large to ensure large jumps, and reduce ”dependence” in the sequence. Based
on the bits of your machine, it is recommended to set m = 231 − 1 and a = 75 . Notice
that both are large.
m <- 2^(31) - 1
a <- 7^5
x <- numeric(length = 1e3)
x[1] <- 7
for(i in 2:1e3)
{
x[i] <- (a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks close to uniformly distributed
plot.ts(x/m) # look like it’s jumping around too
The histogram shows roughly ”uniform” distribution of the samples and the trace plot
shows the lack of dependence between samples.
Histogram of x/m
1.0
120
0.8
100
80
0.6
Frequency
x/m
60
0.4
40
0.2
20
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
x/m Time
6
Any pseudorandom generation method should satisfy:
1. for any initial seed, the resultant sequence has the “appearance” of being IID
from Uniform[0, 1].
2. for any initial seed, the number of values generated before repetition begins is
large
2. xt = (a xt−1 + c) mod m
m <- 2^(31) - 1
a <- 7^5
c <- 2^(10) - 1
x <- numeric(length = 1e3)
x[1] <- 7
for(i in 2:1e3)
{
x[i] <- (c + a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks close to uniformly distributed
plot.ts(x/m) # look like it’s jumping around too
7
Histogram of x/m
1.0
120
100
0.8
80
0.6
Frequency
x/m
60
0.4
40
0.2
20
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
x/m Time
We must be cautious not to be happy with a just a histogram. A histogram shows that
the empirical distribution of all samples is uniformly distributed. But we can still get
a uniform looking histogram if we set a = 1, m = 1e3 and c = 1.
m <- 1e3
a <- 1
c <- 1
x <- numeric(length = 1e3)
x[1] <- 7
for(i in 2:1e3)
{
x[i] <- (c + a * x[i-1]) %% m
}
par(mfrow = c(1,2))
hist(x/m) # looks VERY uniformly distributed
plot.ts(x/m) # Clearly "dependent" samples
Although a histogram shows an almost perfect uniform distribution, the trace plot
shows that the draws don’t behave like they are independent.
8
Histogram of x/m
1.0
100
0.8
80
0.6
60
Frequency
x/m
0.4
40
0.2
20
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 800 1000
x/m Time
but this requires more flops from the computer, and so is not as computationally viable.
We claim that these methods return “good” pseudosamples, in the sense of the three
points stated above. There are statistical hypothesis tests, like the Kolmogorov-
Smirnov test, one can do to test whether a sample is truly random: independent
and identically distributed.
runif() in R uses the Mersenne-Twister generator by default (we will not go into this),
but there are options to use other generators. After this, we will assume that runif()
returns truly iid samples from U (0, 1).
(b − a)U + a ∼ U (a, b) .
That means, we can draw U ∼ U (0, 1) and set X = (b − a)U + a. Then X ∼ U (a, b).
9
# Try for yourself
set.seed(1)
repeats <- 1e4
b <- 10
a <- 5
U <- runif(repeats, min = 0, max = 1)
X <- (b - a) * U + a #R is vectorized
hist(X)
2.4 Exercises
1. (Using R) Consider the multiplicative congruential method. For a, m positive
integers
xn = axn−1 mod m .
(b) Now look at only the first 10 numbers: x1 , x2 , . . . , x10 . What is the problem
here?
(c) How can you fix the problem noted in the previous step?
10
3 Generating Discrete Random Variables
Suppose X is a discrete random variable having probability mass function
X
Pr(X = xj ) = pj j = 0, 1, . . . , pj = 1 .
Examples of such random variables are: Bernoulli, Poisson, Geometric, Negative Bi-
nomial, Binomial, etc. We will learn two methods to draw samples realizations of this
discrete random variable:
Then X ∼ Bern(p).
Proof. To show the result we only need to show that Pr(X = 1) = p and Pr(X = 0) =
1 − p. Recall that by the cumulative distribution function of U [0, 1], for any 0 < t < 1,
Pr(U ≤ t) = t. Using this,
Pr(X = 0) = Pr(U ≤ q) = q ,
and also
Pr(X = 1) = Pr(q < U ≤ 1) = 1 − q = p .
11
Algorithm 1 Inverse transform for Bern(p)
1: Draw U ∼ U [0, 1]
2: if U < q then X = 0 else X = 1
Inverse transform method: The principles used in the above example can be ex-
tended to any generic discrete distribution. For a distribution with mass function
X
Pr(X = xj ) = pj for j = 0, 1, . . . with pj = 1 .
j=0
This method is called the Inverse transform method since the algorithm is essentially
looking at the inverse cumulative distribution function of the random variable.
Example 3 (Poisson random variables). The probability mass function for the Poisson
random variable is
e−λ λi
Pr(X = i) = pi = i = 0, 1, 2, . . . ,
i!
12
Algorithm 2 Inverse transform for Poisson(λ)
1: Draw U ∼ U [0, 1]
2: if U ≤ p0 then
3: X=0
4: else if U ≤ p0 + p1 then
5: X=1
6: ...
Pj
7: else if U ≤ i=1 pi then
8: X=j
9: ...
We know that most likely, a realization from Poisson will be closer to λ, so it will be
beneficial to start from around λ. Set I = ⌊λ⌋, and check whether
I−1
X I
X
pi < U ≤ pi .
i=0 i=0
PI
If it is, then return X = I. Else, if U > i=1 pi , then increase I, otherwise, decrease
I and check again. ■
• What other example can you think of where the inverse transform method could
take a lot of time?
• Can you try and implement this for a Binomial random variable?
13
3.2 Accept-Reject for Discrete Random Variables
Although we can draw from any discrete distribution using the inverse transform
method, you can imagine that for distributions on countably infinite spaces (like the
Poisson distribution), the inverse transform method may be very expensive. In such
situations, acceptance-rejection sampling may be more reliable.
Let {pj } denote the pmf of the target distribution with Pr(X = aj ) = pj and let
{qj } denote the pmf of another distribution with Pr(Y = aj ) = qj . Suppose you can
efficiently draw from {qj } and you want to draw from {pj }. Let c be a constant such
that
pj
≤ c < ∞ for all j such that pj > 0 .
qj
If we can find such a {qj } and c, then we can implement an Acceptance-Rejection or
Accept-Reject sampler. The idea is to draw samples from {qj } and accept these samples
if they seem likely to be from {pj }.
Note: When {pj } has a finite set of states, c is always finite (since the maximum
exists). However, when target distribution does not have a finite set of states, then c
need not be finite, and accept-reject is not possible.
Proof. First, we look at the second statement. We note that the number of iterations
required to stop the algorithm is clearly geometrically distributed by the definition of
14
the geometric distribution – the distribution of the number of Bernoulli trials needed
to get one success (with support 1, 2, 3...).
We will show that the probability of success is 1/c. “Success” here is an acceptance.
First, consider
Thus, the second statement is proved. We will now use this to show the main statement.
Note that
∞
X
Pr(X = aj ) = Pr(aj accepted on iteration n)
n=1
∞
X
= Pr(No acceptance until iteration n − 1) Pr (Y = aj , accept)
n=1
∞ n−1
X 1 pj
= 1−
c c
|n=1 {z }
c
= pj .
Note: Since the probability of acceptance in any loop is 1/c, the expected number of
loops for one acceptance is c. The larger c is, the more expensive the algorithm.
One important thing to note is that within the support {aj } of {pj }, the proposal
distribution must always be positive. That is, for all aj in the support of {pj }, Pr(Y =
aj ) = qj > 0. In other words, a proposal distribution must have support larger than
the target distribution.
15
Example 4 (Sampling from Binomial using AR). The binomial distribution has pmf
n
Pr(X = x) = (1 − p)n−x px for x = 0, 1, . . . , n .
x
We will use AR to simulate draws from Binomial(n, p). The first task is to choose a
proposal distribution. We could use any of Poisson, negative-binomial, or geometric
distributions. We cannot use Bernoulli, since the support of Bernoulli does not contain
the support of Binomial.
We use the version of geometric distribution that is defined as the number of failures
before the first success, so that the support of the geometric distribution has 0 in it.
The pmf of the geometric distribution is
Pr(X = x) = (1 − p)x p x = 0, 1, . . . .
Set
n
c = max (1 − p)n−2x px−1 .
x=0,1,...,n x
To be safe (since we don’t know all the decimal points), we may set c to be slightly
larger (say c = 2.5) as c just needs to be an upper bound. Once c is known, the AR
algorithm can be implemented simply as described. Now here, we would expect, on
average, 2.5 values of Geometric random variables to be proposed until one acceptance.
16
suggests, that we may not want to choose the same p in the proposal distribution!
1 − p∗ 1
np = ∗
⇒ p∗ =
p np + 1
the maximum over {0, 1, . . . , n} can be determined on the computer. For n = 100 and
p = .25, the old bound is 1028.497 and the new one is 6.0455, which is much more
efficient! ■
Pr(X = x) = (1 − p)x p x = 0, 1, 2, . . .
We cannot use Binomial as a proposal, but we can use Poisson. Let us consider the
Poisson(λ) proposal. The Poisson random variable has pmf
e−λ λx
Pr(X = x) = x = 0, 1, 2, . . . .
x!
p(x) (1 − p)x p
=
q(x) e−λ λx
x!
x
p 1−p
= −λ x! .
e λ
For small values of λ (< 1 − p), the above clearly diverges as x increases, thus the max-
imum doesn’t exist. This is true for large values of λ as well. To see, this (intuitively),
consider the Stirling’s approximation of the factorial:
17
Using this:
x
p(x) p 1−p
= −λ x!
q(x) e λ
x
p 1−p
= −λ ex log x−x
e λ
x
(1 − p)elog(x)
p
= −λ
e eλ
Thus, no matter how large λ is, eventually as x increases elog(x) will be larger than λ
and the ratio will diverge. Thus, this proposal does not allow an AR for the Geomtric
distribution. ■
For certain special distributions, it is easier to use a composition method for sampling.
(1)
Suppose we have an efficient way of simulating random variables from two pmfs {pj }
(2)
and {pj }, and we want to simulate from
(1) (2)
Pr(X = j) = αpj + (1 − α)pj j ≥ 0 where 0 < α < 1 .
P
First you should note that the above composition pmf is a valid pmf since j Pr(X =
j) = 1. How would we sample in such a situation?
18
Algorithm 4 Composition method
1: Draw U ∼ U [0, 1]
2: if U ≤ α then simulate X1 ∼ P (1) else simulate X2 and stop
Proof. Consider
Pr(X = j)
= Pr(X = j, U ≤ α) + Pr(X = j, α < U ≤ 1) (by law of total probability)
= Pr(X = j | U ≤ α) Pr(U ≤ α) + Pr(X = j | α < U ≤ 1) Pr(α < U ≤ 1)
= Pr(X1 = j) Pr(U ≤ α) + Pr(X2 = j) Pr(α < U ≤ 1) (by independence of U and X1 , X2 )
(1) (2)
= αpj + (1 − α)pj .
k
X
F (x) = αi Fi (x) .
i=1
In such a case, we may use the zero inflated Poisson distribution (ZIP). Recall that if
19
X ∼ Poisson(λ)
λk
Pr(X = k) = e−λ k = 0, 1, . . . .
k!
If X ∼ ZIP(δ, λ) for δ > 0
δ + (1 − δ)e−λ if k = 0
Pr(X = k) = k .
(1 − δ)e−λ λ if k = {1, 2, . . . }
k!
Note that the mean of a ZIP is (1 − δ)λ < λ since more mass is given at 0. We will use
the composition method to sample from the ZIP distribution. To sample from a ZIP,
(1)
first pj be defined as
(1) (2)
Pr(X = k) = δpk + (1 − δ)pk .
Other composition or mixture distributions are also possible. Think about Zero-inflated
Binomial, Zero-inflated Geometric, 2-inflated Poisson, etc.
3.4 Exercises
1. Show that if U ∼ U (0, 1), then for any a, b,
(b − a) ∗ U + a ∼ U (a, b)
2. Use the inverse transform method to sample from a geometric distribution, where
20
for 0 < p < 1 and q = 1 − p,
3. We want to draw a sample from the random vector (X, Y )⊤ , that follows the
distribution with joint probability mass function
4. List as many appropriate proposal distributions as you can think of for the fol-
lowing target distributions:
• Binomial
• Bernoulli
• Geometric
• Negative Binomial
• Poisson
pj
≤ c forall j for which pi > 0 .
qj
7. Simulate from a Negative Binomial(n, p) using the inverse transform and accept-
reject methods. Implement in R with n = 10 successes and p = .30.
e−λ λi /i!
Pr(X = i) = Pm −λ j i = 0, 1, 2, . . . , m .
j=0 e λ /j!
21
Implement in R with m = 30 and λ = 20.
9. Suppose we want to obtain samples from a discrete distribution with pmf {pi }.
We use accept-reject with proposal distribution with pmf {qi }, such that for some
α ∈ R:
pi
∝ iα i = 1, 2, . . . , ,
qi
For what values of α would this AR algorithm work?
10. Suppose we want to obtain samples from a discrete distribution with pmf {pi }.
(1) (2)
Two possible proposal distributions are {qi } and {q2 }, yielding AR bounds c1
and c2 such that c1 > c2 . Which proposal distribution is better?
22
4 Generating continuous random variables
Similar to generating discrete random variables, there are various methods for gener-
ating continuous random variables. We will discuss three main methods:
1. Inverse transform
3. Ratio of uniforms
The following theorem will be the foundation for the inverse transform method.
Theorem 2. Let U ∼ U [0, 1]. For any continuous distribution F , a random variable
X = F −1 (U ) has distribution F .
FX (x) = Pr(X ≤ x)
= Pr(F −1 (U ) ≤ x)
= Pr(F (F −1 (U )) ≤ F (x)) (Since F is non-decreasing)
= Pr(U ≤ F (x))
= F (x) .
The above theorem then implies that if we can invert the CDF function, then we can
obtain random draws from that random variable.
23
Example 7. Exponential(1): For the Exponential(1) distribution, the cdf is F (x) =
1 − e−x . Thus,
F −1 (u) = − log(1 − u) .
1 1
f (x) = ,
π (1 + x2 )
and Z x
1 1
u = F (x) = f (y)dy = arctan(x) + .
−∞ π 2
So, F −1 (u) = tan(π(u − .5)).
Here, we don’t know the CDF in closed form and thus cannot analytically find the
inverse. This is an example where the inverse transform method cannot work in practice
(even though it works theoretically). Thus, unlike the discrete case, this genuinely
motivates the need for another method to sample from a distribution.
24
Questions to think about
• Can we use the inverse transform method to generate sample from a normal
distribution?
Let the support of F be X and choose a proposal distribution G with density g(x)
whose support is larger or the same as the support of F . That is, if Y is the support
of G then, X ⊆ Y. If we can fine c such that
f (x)
sup ≤ c,
x∈X g(x)
Pr(X ∈ B) = F (B) .
25
First, we consider the probability of acceptance:
f (Y )
Pr(accept) = Pr U ≤
cg(Y )
f (Y )
=E I U ≤
cg(Y )
f (Y )
=E E I U ≤ |Y using iterated expectations
cg(Y )
f (Y )
= E Pr U ≤ |Y
cg(Y )
f (Y )
=E
cg(Y )
Z
f (y)
= g(y)dy
Y cg(y)
Z
1
= f (y)dy
c Y
Z Z
1 1
= f (y)dy + f (y)dy since Y ⊆ X
c X c Y/X
1
= .
c
= F (B) .
26
From the proof, we know that Pr(accept) = 1/c, and so, just like the discrete ex-
ample, the number of attempts it takes to generate an acceptance is distributed
Geometric(1/c). Thus
At a proposed value y:
• if f (y) is large but g(y) is small means this value will not be proposed often and
is a good value to accept for f , so higher probability of accepting it.
• if f (y) is small but g(y) is large, then this value will be proposed often but is
unlikely for f , so accept this value less often.
We can choose any g we want as long its support is larger than other the support of
f , and the resulting c is finite. However, some gs will be better than other gs, based
on the expected number of iterations, c.
Example 10. Beta distribution: Consider the beta distribution Beta(4, 3), where
Γ(7)
f (x) = x4−1 (1 − x)3−1 0 < x < 1; .
Γ(4)Γ(3)
f (x)
sup = sup f (x)
x∈(0,1) g(x) x∈(0,1)
27
Algorithm 9 Accept-reject for Beta(4, 3)
1: Draw U ∼ U [0, 1]
2: Draw proposal Y ∼ U (0, 1)
f (Y )
3: if U ≤ then
c g(Y )
4: Return X = Y
5: else
6: Go to Step 1.
In order to choose a good proposal distribution (that yields a finite c), it is important
to choose a g so that it has “fatter tails” than f . This ensures that as x → ∞ or
x → ∞, g dominates f , so that c → 0 in the extremes, rather than blow up.
1 2
f (x) = √ e−x /2 .
2π
We know that the t-distribution has the right support and fatter tails and the “fattest”
t distribution is with degrees of freedom 1, which is Cauchy. The pdf of a Cauchy
distribution is
1 1
g(x) = .
π 1 + x2
(We know we can sample from Cauchy using inverse transform, so that is easy.) We
will need to find the supremum of the ratio of the densities. Consider
f (x) π 2
= √ (1 + x2 )e−x /2 .
g(x) 2π
2
When x → ∞, −∞, e−x /2 decreases more rapidly than x2 increases, the ratio tends to
zero. This can be shown more formally using L’Hopital’s rule.
Taking a derivative of the above ratio and setting it to 0 (and checking the second
28
derivative condition), yields that the supremum above occurs at x = −1, 1, so
f (x) f (1) √
sup = = 2πe−1/2 ≈ 1.746 ⇒ c = 1.80 .
x∈R g(x) g(1)
Note: Consider if the target distribution was Cauchy, and the proposal was N (0, 1)?
The ratio would clearly diverge as x → −∞, ∞, and thus an accept-reject sampler
would not be possible. ■
Example 12. Sampling from a uniform circle Consider a unit circle centered at
(0, 0):
x2 + y 2 < 1 − 1 < x, y < 1 .
We are interested in sampling uniformly from within this circle. Since the area of the
circle is π, the target density is
1
f (x, y) = I(x2 + y 2 < 1) .
π
1
g(x, y) = I(−1 < x < 1)I(−1 < y < 1) .
4
f (x, y) 4 4
= I(x2 + y 2 < 1) ≤ := c .
g(x, y) π π
f (x, y) 4 π
= I(x2 + y 2 < 1) = I(x2 + y 2 < 1) .
cg(x, y) π 4
So for any (x, y) drawn from within the square, the ratio will be either 1 or 0, thus, no
need to draw a uniform at all!
Note: How do we draw uniformly from within the box? Note that
1 1
g(x, y) = I(−1 < x < 1) I(−1 < y < 1) = g1 (x) · g1 (y) ,
2 2
29
where g1 is a density of a U (−1, 1) random variable. Thus, with two independent draws
iid
U1 , U2 ∼ U (−1, 1), we have U1 × U2 being a draw from the uniform box.
• How can you decide whether one proposal distribution is better than another
proposal distribution?
Sometimes it is difficult to find a good proposal or even one that works! That is, for
a target density f (x) it can sometimes be challenging to find a proposal density g(x)
such that
f (x)
sup <∞
x g(x)
Here are certain examples of when it may be difficult / impossible to implement accept-
reject.
Γ(m + n) m−1
f (x) = x (1 − x)n−1
Γ(m)Γ(n)
30
Depending on m and n, the Beta distribution can behave quite differently. Particularly,
note that when both m, n < 1 the Beta density function is unbounded!
So a Uniform distribution will not work. In fact, any proposal distribution with a
bounded density will not work. So this is an example of a distribution where it is
difficult to find a good proposal distribution.
Γ(m + n) m−1
f (x) = x (1 − x)n−1
Γ(m)Γ(n)
Γ(m + n) m−1
≤ x .
Γ(m)Γ(n)
If we look at the upper bound, the function xm−1 on x ∈ (0, 1) can define a valid
distribution if normalized. So, consider g(x) = mxm−1 , which is a proper density on
0 ≤ x ≤ 1, and
f (x) Γ(m + n)
≤m := c
g(x) Γ(m)Γ(n)
This Accept-Reject sampler can be implemented easily. Similarly if m ≥ 1. Thus, an
AR sampler is easier to implement here if one of m or n is more than (or equal to)
1. ■
1 1
f (x) = x ∈ R,
π 1 + x2
The Cauchy distribution is known to have “fat tails” so that as x → ±∞, the density
function reduces to zero slowly. This means that it is very challenging to find g(x) that
“dominates” the density in the tails.
For example, let the proposal be N (0, 1). As discussed before, we will see the ratio of
31
the densities be 2
f (x) 2π ex /2
= → ∞ as x → ±∞!
g(x) π 1 + x2
In fact, as far we know, there are no possible standard accept-reject algorithms possible
here. ■
If you have chosen a family of proposal distributions that you know gives a finite c, it
may be unclear what the best parameters for that proposal distribution is. That is, if
the target f (x) and the proposal density is g(x | θ) (where θ is a parameter you can
change, to change the behaviour of the proposal, then you want to find a value of the
parameter θ so that the resulting proposal is the “best”.
f (x)
sup ≤ c(θ) .
x g(x|θ)
The value c(θ) is the expected number of loops for the accept-reject algorithm. Since
we want this to be small, the best proposal density within this family would be the
one that minimizes c(θ), so set
β α α−1 −βx
f (x) = x e .
Γ(α)
g(x|λ) = λe−λx .
32
β α α−1 −x(β−λ)
= x e .
λΓ(α)
First note that no matter what λ is, if 0 < α < 1, then f (x)/g(x) → ∞ as x → 0. So
accept-reject with this proposal won’t work!
However, when α ≥ 1, then xα−1 increases, so we want to choose λ such that e−x(β−λ) de-
creases (since exponential decay is more powerful than polynomial increase) (of course,
you should show this more mathematically). Thus we want β > λ!
Thus, the optimal exponential proposal for the Gamma(α, β), α > 1 is Exp(β/α). ■
• How would you implement accept-reject for Gamma(α, β) for 0 < α < 1?
1 −x2 /2 −y2 /2
f (x, y) = e e x ∈ R, y ∈ R.
2π
Let (R2 , Θ) denote the polar coordinates of (X, Y ) so that, X = R cos Θ and Y =
33
R sin Θ; here the support of R is (0, ∞) and the support of Θ is (0, 2π). Then,
Y
R2 = X 2 + Y 2 tan Θ = .
X
Notationally, we denote a realization from (R2 , Θ) as (d, θ) and find the joint density
of f (d, θ). Thus, let d = x2 + y 2 and θ = tan−1 (y/x). We know that the density for
(d, θ) can be found by
∂x ∂y
∂d ∂d
f (d, θ) = |J|f (x, y) where J = ∂y
∂x
∂θ ∂θ
Solving for J,
√ √
∂ d cos θ ∂ d sin θ 1 cos
√θ 1 sin
√θ
∂d ∂d 2 d 2 d 1
J= √ √ = √ √ = .
∂ d cos θ ∂ d sin θ
− d sin θ d cos θ 2
∂θ ∂θ
1 1 −d/2
f (d, θ) = e 0 < d < ∞, 0 < θ < 2π
2 2π
1 1 −d/2
= I(0 < θ < 2π) e I(0 < d < ∞)
|2π {z } |2 {z }
U (0,2π) Exp(2)
This is a separable density, so R2 and Θ are independent, and Θ ∼ U [0, 2π] and
R2 ∼ Exp(2).
To generate from Exp(2), we can use an inverse transform method. If U ∼ U (0, 1),
then by the inverse transform method, −2 log U ∼ Exp(2) (verify for yourself). To
generate from U (0, 2π), we know if U ∼ U (0, 1), then 2πU ∼ U (0, 2π). The Box-
Muller algorithm then is given in Algorithm 11 which produces X and Y from N (0, 1)
indendently.
34
4.4 Ratio-of-Uniforms
Ratio-of-uniforms is a powerful, however not so popular method to generate samples
for a continuous random variables. When it works, it can work really well. The method
is based critically on the following theorem.
Theorem 4. Let f (x) be a target density with distribution function F . Define the set
r
v
C= (u, v) : 0 ≤ u ≤ f .
u
Proof. We will show that the density of Z = V /U is f (z). Note that by definition, the
joint density of (U, V ) is
1
f(U,V ) (u, v) = R R I {(u, v) ∈ C} .
C
du dv
u
I 0 ≤ u ≤ f 1/2 (z) .
f(U,Z) (u, z) = R R
C
du dv
Now that we have the joint distribution of (U, Z), all we need to show is that the
marginal distribution of Z is F . Finding the marginal distribution of Z = V /U , we
integrate out U ,
Z
u
I 0 ≤ u ≤ f 1/2 (z) du
fZ (z) = RR
C
du dv
Z f 1/2 (z)
1
=RR udu
C
du dv 0
f (z)
= RR .
2 C du dv
Since fZ (z) and f (z) are both densities, this implies that
Z R Z Z
f (z)dz 1 1
1= fZ (z)dz = R R = RR ⇒ du dv =
2 C du dv 2 C du dv C 2
35
This implies fZ (z) = f (z) . Thus, Z = V /U has the desired distribution.
Think back to the AR technique used to draw uniformly from a circle! If C is a bounded
set, then if we enclose C in a rectangle, we can use accept-reject to draw uniform draws
from C! So, the task is to find [0, a] × [b, c] such that
Note now that inside C, if x = v/u ⇒ v/x = u ≤ f 1/2 (x). This implies that
v
≤ f 1/2 (x) .
x
Now for:
p
Note that if f (x) or x2 f (x) are unbounded, then C is unbounded, and the method
cannot work. Now that we have found the rectangle: [0, a] × [b, c], we can propose from
the rectangle, check if the proposed value is in the region C; if it is, we accept it and
return V /U . This leads to the following algorithm:
Algorithm 12 Ratio-of-Uniforms
1: Generate (U, V ) ∼ U [0, a] × U [b, c]
p
2: If U ≤ f (V /U ), then set X = V /U .
3: Else go to 1.
36
from C. To understand how effective this algorithm will be, we can calculate the
probability of acceptance for the AR. First, note that
I((u,v)∈C)
f (u, v) R
C dudv
sup = sup 1 = 2a(c − b)
(u,v)∈C g(u, v) (u,v)∈C a∗(c−b)
Thus,
1
Pr (Accepting for AR in RoU) = .
2a(c − b)
So if a is large and/or (c − b) is large, the probability is small, and thus the algorithm
will take a large number of loops to yield one acceptance.
Example 16 (Exponential(1)).
Here,
C = {(u, v) : 0 ≤ u ≤ e−v/2u } .
and
c = sup xe−x/2 ⇒ c = 2e−1 (show for yourself) .
x≥0
1 2 /2σ 2
f (x) = √ e−(x−θ) .
2πσ 2
In order to draw the region later, we need to rearrange the bound above, which gives
37
us (by taking log):
2 1 2 2 2
(v − θu) ≤ 4σ u log u + log(2πσ ) .
2
The above defines the region C. Now, in order to bound the region C, we find the
limits a, b, c: (The following calculations were incorrect in class. I have now fixed it
here.)
2 /4σ 2
a = sup(2πσ 2 )−1/4 e−(x−θ) = (2πσ 2 )−1/4
x∈R
1/4 1/4
1 −(x−θ)2 /4σ 2 1 2 /4σ 2
b = inf xe and c = sup xe−(x−θ)
x≤0 2πσ 2 x≥0 2πσ 2
First, we find b and then c will follow similarly. Note that b will be non-positive, and
thus, to find the infimum, we first take negative and then log. That is, let for x < 0,
let 1/4
1 2 2
A(x) = 2
(−x)e−(x−θ) /4σ
2πσ
Then A(x) is non-negative, and we want to find the supremem of A(x) for x ≤ 0.
Taking log:
1 (x − θ)2
log(A(x)) = − log(2πσ 2 ) + log(−x) −
4 4σ 2
d log(A(x)) 1 (x − θ) set
⇒ = − =0
dx x √ 2σ 2
θ ± θ2 + 8σ 2
⇒x=
2
√
Now, we need to decide which of ± would be choose. Note that θ2 + 8σ 2 > θ. Hence,
since we are taking supx≥0 A(x), we obtain
√
θ− θ2 + 8σ 2
xb :=
2
Thus,
b = xb f 1/2 (xb )
38
Similarly, we obtain √
θ+ θ2 + 8σ 2
xc :=
2σ 2
with
c = xc f 1/2 (xc ) .
All that needs to be done now is to implement Algorithm 12 with these values of a, b, c,
given the values of θ and σ 2
3. For N (0, 1) between RoU and AR using Cauchy proposal, which is more efficient,
in terms of the expected number of uniforms required for one acceptance?
X = Y1 + Y2 + . . . Yn ∼ Bin(n, p) .
So, we can simulate n Bernoulli variables, add them up, and we have a realization
from a Binomial(n, p).
X = Y1 + Y2 + · · · + Yr ∼ N B(r, p) .
X
∼ Beta(a, b) .
X +Y
39
4. Dirichlet distribution : The Dirichlet distribution is a distribution over pmf.
k k
Γ(α1 + · · · + αk ) Y αi −1 X
f (x1 , x2 , . . . , xk ) = Qk xi 0 ≤ xi ≤ 1, xi = 1 .
i=1 Γ(αi ) i=1 i=1
Y1 ∼ Gamma(α1 , 1)
Y2 ∼ Gamma(α2 , 1)
..
.
Yk ∼ Gamma(αk , 1)
Let
Yi
X i = Pk .
i=1 Yi
Then (X1 , . . . , Xk ) ∼ Dir(α1 , α2 , . . . , αk ).
iid
5. Chi-squared distribution: If Y1 , Y2 , . . . , Yk ∼ N (0, 1), then
Z
X = r ∼ tk .
Y
k
40
4.5.2 Sampling from mixture distributions
Mixture distributions for continuous densities is very similar to the discrete distri-
butions. For j = 1, . . . , k, let Fj (x) be a distribution function and let fj (x) be the
corresponding density function. A mixture distribution function, F is
k
X X
F (x) = pj Fj (x) where pj > 0, pj = 1
j=1 j
k
X
f (x) = pj fj (x) .
j=1
If we can draw from each Fj then we can draw from the mixture distribution F as well
using the same steps as in the discrete case.
Example 18 (Mixture of normals). Consider two normal distributions N (µ1 , σ12 ) and
N (µ1 , σ22 ). For some 0 < p < 1, the mixture density is
Mixture distributions are particularly useful for clustering problems and we will come
back to them again in the data analysis part of the course. If we want to sample from
this distribution
41
Algorithm 14 Sampling from a Gaussian mixture
1: Generate U ∼ U [0, 1]
2: If U < p, generate N (µ1 , σ12 ) (using location-scale family trick)
3: Otherwise, generate N (µ2 , σ22 ).
We have almost entirely focused on univariate densities, but most often interest is in
multivariate/multidimensional target distribution.
f (x) = fX1 (x1 )fX2 |X1 (x2 ) . . . fXk |X1 ,...,Xk−1 (xk ) .
42
is the density of a multivariate normal distribution with mean µ and covariance
Σ. First, note that since Σ is a positive-definite (symmetric) matrix, we can use
the eigenvalue decomposition
Σ = QΛQ−1
so that
Σ1/2 Σ1/2 = QΛ1/2 Q−1 QΛ1/2 Q−1 = QΛQ−1 .
Z ∼ Nk (0, Ik ) .
For the normal distribution, if the covariance is zero, then the random variables
are independent! This isn’t true in general but is true for normal random vari-
ables.
iid
So, to sampling from Nk (µ, Σ), we can sample Z1 , Z2 , . . . , Zk ∼ N (0, 1), and set
Z = (Z1 , . . . , Zk ). Then
X := µ + Σ1/2 Z ∼ Nk (µ, Σ) .
43
Questions to think about
• Can you construct a zero-inflated normal distribution and find a suitable appli-
cation of it?
4.6 Exercises
1. Using the inverse transform method, simulate from Exp(λ) for any λ > 0. Im-
plement this for λ = 5.
2. Use the inverse transform method to obtain samples from the Weibull(α, λ)
α
f (x) = αλxα−1 e−λx , x > 0.
3. (Ross 5.1) Give a method for generating a random variable having density func-
tion
ex
f (x) = 0 ≤ x ≤ 1.
e−1
4. (Ross 5.2) Give a method for generating a random variable having density func-
tion
x−2
if 2 ≤ x ≤ 3
f (x) = 2 − 2
x/3
if 3 ≤ x ≤ 6
2
5. (Ross 5.3) Use the inverse transform method to generate a random variable having
distribution function
x2 + x
F (x) = 0 ≤ x ≤ 1.
2
3 (1 − x2 )
if x ∈ (−1, 1)
f (x) = 4
0 otherwise
44
that X < 0.05. That is, its density function is
e−x
f (x) = 0 < x < 0.05 .
1 − e−0.05
Using R generate 1000 such random variables and use them to estimate E[X |
X < 0.05].
8. (Ross 5.7) Suppose it is relatively easy to generate random variables from any
of the distributions Fi , i = 1, . . . , k. How could we generate a random variable
from the distribution function
n
X
F (x) = pi Fi (x) ,
i=1
P
where pi ≥ 0 and pi = 1.
9. (Ross 5.8) Using the previous exercise, provide algorithms for generating random
variables from the following distributions:
x+x3 +x5
(a) F (x) = 3
,0≤ x ≤ 1.
1−e−2x +2x if x ∈ (0, 1)
3
(b) F (x) =
3−e−2x if x ∈ [1, ∞)
3
10. (Ross 5.9) Give a method to generate a random variable with distribution function
Z ∞
F (x) = xy e−y dy 0≤x≤1
0
11. (Ross 5.15) Give two methods for generating a random variable with density
function
f (x) = xe−x , 0 ≤ x < ∞ .
12. (Ross 5.18) Give an algorithm for generating a random variable having density
function
2
f (x) = 2xe−x , x > 0.
13. (Ross 5.19) Show how to generate a random variable who distribution function
is
x + x2
F (x) = , 0≤x≤1
2
45
using the inverse transform, accept-reject, composition method.
14. (Ross 5.20) Use the AR method to find an efficient way to generate a random
variable having density function
(1 + x)e−x
f (x) = 0 < x < ∞.
2
15. (Ross 5.21) Consider the target density to be a truncated Gamma(α, 1), α < 1
defined on (a, ∞) for some a > 0. Suppose the proposal distribution is a truncated
exponential(λ), defined on the same (a, ∞). What is the best λ to use?
16. (Using R)
17. (Using R)
(a) Using accept-reject and a standard normal proposal, obtain samples from a
truncated standard normal distribution with pdf:
1 1 2
f (x) = √ e−x /2 I(−a < x < a) ,
Φ(a) − Φ(−a) 2π
46
Implement an accept-reject sampler with proposal distribution Np (0, I) with
a = 4 and p = 3, 10 and with a = 1 and p = 3, 10. Describe the differences
between these settings.
21. Use ratio-of-uniforms method to sample from the distribution with density
1
f (x) = x ≥ 1.
x2
However, most customers will not enter into any accidents, so they will claim Rs
0. But when they do, they will claim reimbursement for some amount of money
that, say, will follow a Gamma distribution.
β α α−1 −xβ
f (x) = pI(x = 0) + (1 − p) x e .
Γ(α)
47
5 Importance Sampling
We have so far learned many (many!) ways of sampling from different distributions.
These sampling methodologies are particularly useful when we want to estimate char-
acteristic of F . Using computer simulated samples from F to estimate characteristics
of F is broadly termed as Monte Carlo
Note: there is no “data” here, there is just an integral! We are just interested in
estimating an annoying integral.
Note: notation EF [X] means the expectation is with respect to F . From now on, it is
very important to keep track of what the expectation is with respect to.
iid
Suppose we can draw iid samples X1 , . . . , XN ∼ F (this we can do using the many
methods we have learned). Then, by the weak law of large numbers, as N → ∞,
N
1 X p
θ̂ = h(Xt ) → θ .
N t=1
N
!
1 X
Var(θ̂) = Var h(Xt )
N t=1
N
1 X
= 2 VarF (h(Xt )) because of independence
N t=1
VarF (h(X1 ))
= because of identical
N
48
σ2
= .
N
This central limit theorem gives us an expected behavior of θ̂ for large values of n.
iid
If Z1 , . . . , ZN ∼ G , then an estimator of θ is
N
1 X h(Zt )f (Zt )
θ̂g = .
N t=1 g(Zt )
The estimator θ̂g is the importance sampling estimator, the method is called importance
sampling and G is the importance distribution.
Let
f (Zt )
w(Zt ) =
g(Zt )
be the weights assigned to each point Zt . Then θ̂g is a weighted average of of h(Zt ).
Intuitively, this means that depending on how likely a sampled value is for f and g, a
weight is assigned to that value.
49
Example 19 (Moments of Gamma distribution). Suppose we want to estimate the
kth moment of a Gamma distribution. That is, let F be the density of a Gamma(α, β)
distribution. Then Z ∞
β α α−1 −βx
θ= xk x e dx .
0 Γ(α)
N
1 X h(Zt )f (Zt )
θ̂g =
N t=1 g(Zt )
N
1 X β α Ztk Ztα−1 e−βZt
= .
N t=1 Γ(α) λe−λZt
= θ.
50
Theorem 6. The importance sampling estimator is consistent for θ. That is, as
N → ∞,
p
θ̂g → θ .
N
1 X h(Zt )f (Zt )
θ̂g = .
N t=1 g(Zt )
The law of large numbers applies to any sample average whose expectation is finite.
So by the law of large numbers, as N → ∞,
p
θ̂g → E[θ̂g ] = θ .
This means that as we get more and more samples from G, our estimator will get
increasingly closer to the truth.
It is essential to quantify the variability in our estimator θ̂g in order to ascertain how
“erratic” or “stable” the estimator is. We also want to establish expected behavior for
θ̂g , but does a central limit theorem hold?. Notice that a simple Monte Carlo is just a
sample average, so we should be able to directly apply the CLT result, if the variance
is finite. Note that, the variance of θ̂g is
N
!
σg2
1 X h(Zt )f (Zt ) 1 h(Z1 )f (Z1 )
Var(θ̂g ) = Varg = Varg =: .
N t=1 g(Zt ) N g(Z1 ) N
h(Z1 )f (Z1 )
A central limit theorem will hold if σg2 = Varg < ∞.
g(Z1 )
51
Theorem 7. Suppose σ 2 = VarF (h(X)) < ∞. If g is chosen such that
f (z)
sup ≤M <∞
z∈X g(z)
then
σg2 < ∞ .
Proof. First note that if the variance of a random variable is finite, this is equivalent
to saying that the second moment of that variable is finite. So, consider the second
moment of h(Z)f (Z)
g(Z)
where Z ∼ G.
" 2 #
h(z)2 f (z)2
Z
h(Z)f (Z)
EG = g(z)dz
g(Z) X g(z)2
Z
f (z)
= h(z)2 f (z)dz
X g(z)
Z
≤M h(z)2 f (z)dz
X
Further, an estimator of σg2 is easily available since we have N samples of h(Z)f (Z)/g(Z)
available. Thus, an estimator of σg2 is the sample variance from all the samples:
2
1 h(Zt )f (Zt )
σ̂g2 := − θ̂g .
N −1 g(Zt )
52
Example 20 (Gamma continued). Recall from the accept-reject example for Gamma(α, β)
with α ≥ 1 and Exponential(λ) proposal for an accept-reject sampler will work only if
λ < β. That means, when λ < β, there exists a finite M , and the importance sampling
estimator will have a finite variance. ■
4. How would we check whether this importance sampler is better than IID Monte
Carlo?
• Varg (θ̂g ) = σg2 /N is smaller than regular Monte Carlo variance estimator.
Note that, one reason to use importance sampling would be to obtain smaller variance
estimators than the original. So, if we can choose g such that σg2 is minimized that
would be ideal!
For the above to be small, term A should be close to θ2 . This logic leads to the following
theorem.
|h(z)|f (z)
g ∗ (z) = .
EF [|h(x)|]
53
Proof. Consider the above importance density. The second moment of the importance
sampling estimator with this density is:
θ2 + σg2∗
" 2 #
h(Z)f (Z)
= EG∗
g ∗ (Z)
h(z)2 f (z)2 ∗
Z
= g (z)dz
X g ∗ (z)2
h(z)2 f (z)2
Z
= · EF [|h(x)|] dz
X |h(z)|f (z)
Z
= EF [|h(x)|] |h(z)|f (z)dz
X
Z 2
= |h(z)|f (z)dz
X
Z 2
|h(z)|f (z)
= g(z)dz for any other g defined on X
X g(z)
2
|h(z)|f (z)
= EG
g(z)
h(z) f (z)2
2
≤ EG By Jensen’s inequality: for a convex function ϕ, ϕ(E[x]) ≤ E(ϕ(x))
g 2 (z)
= θ2 + σg2 .
σg2∗ ≤ σg2 .
Since this is true for all g, this implies that g ∗ produces the smallest σg2∗ .
If on the support X , h(Z) = |h(Z)|, then the variance of the importance sampling
estimator is zero!
54
Example 21 (Gamma distribution). Consider estimating moments of a Gamma(α, β)
distribution. We actually know the optimal importance distribution here! For estimat-
ing the kth moment
So the optimum importance distribution is Gamma(α + k, β). The variance in this case
of the estimator will be 0. ■
Example 22 (Mean of standard normal). Let h(x) = x and let f (x) be the density
of a standard normal distribution. So we are interested in estimating the mean of the
standard normal distribution. The universally optimal proposal in this case is
2
∗ |x|e−x /2
g (x) = R
|x|e−x2 /2 dx
But it may be quite challenging to draw samples from the above distribution! In order
for importance sampling to be useful, we need not find the optimal proposal, as long
as we can find a more efficient proposal than sampling from the target.
Consider an importance distribution of N (0, σ 2 ) for some σ 2 > 0. The variance of the
importance estimator is
∞
h(x)2 f (x)2
Z
σg2 = dx
−∞ g(x)
Z ∞ 2
2 σ x 2
= x √ exp − x dx
−∞ 2π 2σ 2
Z ∞ 2
2 1 x 1
=σ x √ exp − 2− 2 dx
−∞ 2π 2 σ
Z ∞ r
2 − σ −2
2
σ 2 x 1
=√ x · exp − 2− 2 dx
2 − σ −2 −∞ 2π 2 σ
| {z }
density of N (0,(2−σ −2 )−1 ) if σ 2 >1/2
σ 1
=√
2 − σ −2 2 − σ −2
σ
= −2 3/2
if σ 2 > 1/2
(2 − σ )
55
else if σ 2 < 1/2, the integral diverges and the variance is infinite. Also, minimizing the
variance:
σ √
arg min
√ −2 3/2
= 2.
σ> 1/2 (2 − σ )
√
Thus the optimal proposal has standard deviation σ = 2, not 1! Also, at σ 2 = 2, the
variance is .7698 which is less than 1. ■
• Does this mean that N (0, 2) is the optimal proposal for estimating the mean of
a standard normal?
• What is the optimal proposal within the class of Beta proposals for estimating
the mean of a Beta distribution?
g(x) = bg̃(x)
For simplicity (or rather uniformity in complexity), we will assume that b is unknown.
Even though f is not fully known, we are interested in expectations under f . Suppose
for some function h, the following integral is of interest:
Z
θ := h(x)f (x)dx .
X
Since a and b are unknown, we can’t evaluate f (x) and g(x). So our original estimator
does not work anymore! We can evaluate f˜(x) and g̃(x). So if we can estimate a and b
56
as well, that will allow us estimate θ. Instead, we will estimate b/a, which also works!
iid
Consider Z1 , . . . , Zt ∼ G. The weighted importance sampling estimator of θ is
N
X h(Zt )f˜(Zt )
t=1
g̃(Zt )
θ̂w := N ˜
.
X f (Zt )
t=1
g̃(Zt )
Proof. Intuitively, both the numerator and the denominator are average, so the LLN
applies to both individually. Then we may use Slutsky’s theorem. This is the plan.
N
" # N
" #
1 X h(Zt )f˜(Zt ) p h(Z)f˜(Z) 1 X f˜(Zt ) p f˜(Z)
→ EG and → EG
N t=1 g̃(Zt ) g̃(Z) N t=1 g̃(Zt ) g̃(Z)
57
b
= θ.
a
Second,
" # Z
f˜(Z) f˜(z)
EG = g(z)dz
g̃(Z) X g̃(z)
Z
b b
= f (z)dz = .
a X a
So, " #
h(Z)f˜(Z)
EG
g̃(Z) b
θ
a
" # = b
= θ.
f˜(Z) a
EG
g̃(Z)
We will denote
f˜(Z)
w(Z) = .
g̃(Z)
Then w(Z) is called the un-normalized importance sampling weight.
Thus, even though we do not know a and b, the weighted importance sampling esti-
mator converges to the right quantity and is consistent. However, not knowing b and
a comes at a cost. Unlike the simple importance sampling estimator, the weighted
importance sampling estimator θ̂w is not unbiased.
To see this heuristically, let’s assume (just for theory, we actually don’t do this) that
we obtained two different samples from G for the numerator and the denominator.
iid iid
That is, assume that Z1 , . . . , ZN ∼ G and T1 , . . . , TN ∼ G. This allows us to break the
expectation as follows:
N
X
h(Zt )w(Zt ) " N #
t=1 X 1
EG
X N
= EG
h(Zt )w(Zt ) E G
XN
t=1
w(Tt ) w(Tt )
t=1 t=1
" N
#
X 1
≥ EG h(Zt )w(Zt ) " N # By Jensen’s inequality
t=1
X
EG w(Tt )
t=1
58
= θ,
where the equality only holds for a constant function only. Thus, even if we have two
independent samples for the numerator and denominator, we will still not obtain an
unbiased estimator. This is then certainly true for when we use the same sample as
well. However, the proof of this is complex, and outside the scope of the course.
Further, an exact expression for the variance is difficult to obtain. Even if the numera-
tor and estimator are assumed to be from independent samples, we may, after a series
of approximations (omitted), obtain:
h i VarG (h(Z)w(Z)) VarG (w(Z))
Var θ̂w ≈ 1+
[EG (w(Z))]2 [EG (w(Z))]2
VarG (h(Z)w(Z)) VarG (w(Z))
= 1+
b2 /a2 [EG (w(Z))]2
VarG (w(Z))
= VarG (ah(Z)w(Z)/b) 1 +
[EG (w(Z))]2
f (Z) VarG (w(Z))
= VarG h(Z) 1+
g(Z) [EG (w(Z))]2
2 VarG (w(Z))
= σg 1 +
[EG (w(Z))]2
≥ σg2 ,
where recall that σg2 is the variance coming from the simple importance sampling es-
timator. This seems to indicate that weighted importance sampling is always worse
than simple importance sampling, but in reality this is not true. Here we assumed two
independent samples in the approximation above. Since the samples from the de-
nominator and numerator are the same, it is sometimes possible for the numerator
and denominator to be negatively correlated, so that weighted importance sampling is
better!
NOTE: Unlike simple importance sampling, the asymptotic normality of the weighted
importance sampling estimator is challenging to verify, and the variance may not always
be finite.
h i
However, unlike simple importance sampling, estimating the variance Var θ̂w is no
longer straightforward here!
59
Example 23. Consider estimating
Z π Z π
θ= xyπ(x, y)dxdy ,
0 0
where
π(x, y) ∝ e(sin(xy)) 0 ≤ x, y ≤ π
Notice that here, the target distribution is bivariate, but the function h(x, y) = xy is a
univariate mapping. Further, the target distribution is not a product of two marginals,
so we have to implement multivariate importance sampling. Also, we do not know the
normalizing constants. So for some unknown a > 0
π(x, y) = a e(sin(xy)) 0 ≤ x, y ≤ π
1
g(z, t) = I(0 < z < π)I(0 < t < π) .
π2
f˜(Zt , Tt )
w(Zt , Tt ) = = esin(Zt Tt ) .
g̃(Zt , Tt )
60
5.4 Questions to think about
• How would you choose a good proposal for weighted importance sampling? Would
finding a proposal that yields a small variance suffice?
• Do you have intuition as to why often the variance of the weight importance
estimator is larger than the variance of the simple importance sampler for the
same importance proposal?
5.5 Exercises
Still working on below. Will add some more exercises by 8th Feb.
R1
1. Estimate 0 ex dx using importance sampling.
R∞ 2
2. Estimate −∞ e−x /2 using importance sampling.
NOTE: First, use the Z1 , Z2 , . . . , ZN points for all chosen values of t, and then
use different importance samples for different values of t.
61
finite variance?
7. For estimating kth moment of a Gamma(α, β) with α > 1 with the importance
distribution Exp(λ), show that the importance sampling estimator has infinite
variance when λ > β.
f (x)
sup < ∞.
x g(x)
In order to estimate the mean of the target density, is there any benefit to using
importance sampling over accept-reject sampling?
f (x)
sup <∞
x g(x)
then we know that the simple importance estimator has finite variance. Does the
weighted importance estimator also have finite variance?
10. For some known yi ∈ R, i = 1, . . . , n and some ν > 2, suppose the target density
is
n −(ν+1)/2
−x2 /2
Y (yi − x)2
f (x) ∝ e 1+ .
i=1
ν
set.seed(1)
n <- 50
nu <- 5
y <- rt(n, df = nu)
62
up Laplace distribution) and what is the corresponding simple importance
sampling estimator?
where π(x) is the density of a U [0, 10]. We know we can do IID sampling that is,
sample X1 , X2 , . . . , XN from Unif[0, 10] and estimate θ. But this will be simple
Monte Carlo and not importance sampling. Using importance sampling, we can
reduce the variance. First, note that the optimal importance distribution here is
1
10 exp {−2|z − 5|} 10
g ∗ (z) = z ∈ (0, 10) .
θ
So h(x) is the density of Laplace(5, 1/2), but truncated between 0 and 10.
So the optimum proposal is a truncated on (0, 10) Laplace(5, 1/2).
N N 1
1 X h(Zt )f (Zt ) 1 X 10 exp {−2|z − 5|} 10
θ̂g∗ = = 1 = θ!
N t=1 g ∗ (Zt ) N t=1 10 exp {−2|z − 5|} 10
θ
So notice that for the optimal proposal, the IS estimator is the quantity we want
itself! Thus it is impossible to implement the simple importance sampler here.
63
izing constant for g ∗ (which is θ) is unknown. So we have
1
g ∗ (z) = exp {−2|z − 5|} z ∈ (0, 10)
θ | {z }
g̃ ∗
|{z}
b
64
6 Likelihood Based Estimation
We have learned a fair amount about sampling from various distributions and esti-
mating integrals. For the next few weeks we will focus our attention to optimization
methods for certain statistical procedures.
The parameter θ can be a vector of parameters. After having obtained real data, from
F , we want to
1. estimate θ
iid
Example 24. Suppose we obtain X1 , X2 ∼ N (θ, 1), where we don’t know θ. Suppose
65
we obtain X1 = 2 =: x1 and X2 = 3 =: x2 . Then the likelihood function is:
We can plot the above function of θ for different values of θ to understand what the
likelihood of every value of θ is. ■
θ̂MLE is called the maximum likelihood estimator of θ. As you can understand, max-
imizing the function L(θ|x) may be complex for some problems and not possible to
do analytically. This is where we require numerical optimization methods to help us
obtain these MLEs. MLEs have nice theoretical properties and you will learn them in
MTH211a or MTH418a.
6.2.1 Examples
iid
Example 25 (Bernoulli). Let X1 , . . . , Xn ∼ Bern(p), 0 ≤ p ≤ 1. Then the likelihood
is
n
Y
L(p|x) = Pr(Xi = xi |p)
i=1
66
n
Y xi
p (1 − p)1−xi
=
i=1
P P
xi
=p (1 − p)n− xi
.
To obtain the MLE of θ, we will maximize the likelihood. Note that maximizing the
likelihood is the same as maximizing the log of the likelihood, but the calculations are
easier after taking a log. So we take a log:
n
! n
!
X X
⇒ l(p) := log L(p|x) = xi log p + n− xi log(1 − p)
Pi P i
dl(p) xi n − xi set
⇒ = − =0
dp p 1−p
n
1X
⇒ p̂ = xi .
n t=1
d2 l(p)
P P
xi n − xi
=− 2 − <0 for all p .
dp2 p (1 − p)2
In any example, if I do not check the second derivative, you HAVE to check it for yourself.
Thus,
n
1X
p̂MLE = xi .
n t=1
■
67
n
Y
= λe−λ(xi −µ) I(xi ≥ µ)
t=1
( n
!)
X
= λn exp −λ xi − nµ I(x1 , . . . , xn ≥ µ) ∀µ and λ > 0 .
i=1
But if X1 , . . . , Xn ≥ µ ⇒ min{Xi } ≥ µ. So
( !)
X
L(λ, µ|x) = λn exp −λ xi − nµ I min{xi } ≥ µ ∀µ and λ > 0 .
i
i
We will first try to maximize with respect to µ and then with respect to λ. Note that
L(λ, µ|x) is an increasing function of µ within the restriction. So that the MLE of µ
is the largest value in the support of µ where µ ≤ min{Xi }. So
So, the log-likelihood function is concave, thus there is a unique maximum. Set
dl
=0
dλ
n
n X
⇒ = Xi − nX(1)
λ t=1
n
⇒ λ̂MLE = P .
Xi − nX(1)
68
6.3 Regression
We will focus a lot on variants of linear regression and thus it is important to setup
the premise of linear regression.
Y = Xβ + ϵ ∼ Nn (Xβ, σ 2 In ) .
Note: I use capital Y to denote the population random variable and will use the small
y to denote realized observations.
The linear regression model is built to estimate β, which measures the linear effect of
X on Y . There is much more to linear regression and multiple courses are required to
study all aspects of it. However, here we will just focus on the mathematical properties
and optimization tools required to study them.
Example 27 (MLE for Linear Regression). In order to understand the linear relation-
ship between X and β, we will need to estimate β. Since we assume that the errors
are normally distributed, we have a distribution available for Y s and we may use the
method of MLE. We have
n
Y
2
L(β, σ |y) = f (yi |X, β, σ 2 )
t=1
n
1 (y − Xβ)T (y − Xβ)
1
= √ exp −
2πσ 2 2 σ2
1 n 1 (y − Xβ)T (y − Xβ)
⇒ l(β, σ 2 ) := log L(β, σ 2 |y) = − log(2π) − log(σ 2 ) −
2 2 2 σ2
Note that
69
= y T y − y T Xβ − β T X T y + β T X T Xβ
= y T y − 2β T X T y + β T X T Xβ .
dl 1 X T y − X T Xβ set
= − 2 −2X T y + 2X T Xβ = =0
dβ 2σ 2σ 2
dl n (y − Xβ)T (y − Xβ) set
=− 2 + = 0.
dσ 2 2σ 2σ 4
2 (y − X β̂MLE )T (y − X β̂MLE )
σ̂M LE = .
n
Verify: that the Hessian matrix is negative definite, and thus the objective function is concave.
For example, if p > n, then the number of observations is less than the number of
parameters, and since X is n × p, (X T X) is p × p of rank n < p. So X T X is not full
rank and cannot be inverted. In this case, the MLE does not exist and other estimators
need to be constructed. This is one of the motivations of penalized regression, which
we will discuss in detail. ■
Suppose X is such that (X T X) is not invertible, then we don’t know how to estimate
β. In such cases, we may used penalized likelihood, that penalizes the coefficients β so
that some of the βs are “pushed towards zero”. The corresponding Xs to those small
βs are essentially not important, removing singularity from X T X.
70
Instead of looking at the likelihood, we consider a penalized likelihood. Since the
optimization of L(β|y) only depends on (y−Xβ)T (y−Xβ) term, a penalized (negative)
log-likelihood is used and the final penalized (negative) log-likelihood is
Here P (β) is a penalization function. Note that since we are now looking at the negative
log-likelihood, we now want to minimize Q(β). The penalization function assigns large
values for large β, so that the optimization problem favors small values of β.
There are many ways of penalizing β and each method yields a different estimator. A
popular one is the ridge penalty.
Example 28 (Ridge Regression). The ridge penalization term is P (β) = λβ T β/2 for
some λ > 0 for
(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2
We will minimize Q(β) over the space of β and since we are adding an arbitrary term
that depends on the size of β, smaller sizes of β will be preferred. Small sizes of β
means X are less important, and this will eventually nullify the singularity in X T X.
The larger λ is, the more “penalization” there is for large values of β; λ is typically
user-chosen. We will study choosing λ when we cover “cross-validation” later.
(y − Xβ)T (y − Xβ) λ T
β̂ = arg min + β β .
β 2 2
dQ(β) 1 set
= (−2X T y + 2X T Xβ) + λβ = 0
dβ 2
⇒ (X T X + λIp )β̂ − X T y = 0
⇒ β̂ridge = (X T X + λIp )−1 X T y .
Note: Verify that the Hessian matrix is positive definite for yourself.
71
Note that (X T X + λIp ) is always positive definite for λ > 0 since for any a ∈ Rp ̸= 0
Thus, the final ridge solution always exists even if X T X is not invertible.
Pros:
We have an estimate of β!
In terms of a certain criterion (we will learn later), we actually do better than non-
penalized estimation even when (X T X) is invertible.
Cons:
The estimator is not an MLE, so we cannot use distributional properties to construct
confidence intervals. This is a big problem, and is addressed by bootstrapping, which
we will get to.
1. Under the normal likelihood, what is the distribution of β̂MLE (when it exists)
and β̂ridge ? Are they unbiased? Which one has a smaller variance (covariance)?
2. What other penalization functions can you think of? Recall that β T β = ∥β∥22 .
This is a common challenge in many models and estimation problems, and requires
sophisticated optimization tools. In the next few weeks, we will go over some of these
optimization methods.
iid
Example 29 (Gamma Distribution). Let X1 , . . . , Xn ∼ Gamma(α, 1). The likelihood
function is
n
Y 1 α−1 −xi
L(α|x) = x e
i=1
Γ(α) i
n
1 P Y
− xi
= n
e xα−1
i
Γ(α) i=1
72
n
X n
X
⇒ l(α) := log L(α|x) = −n log(Γ(α)) + (α − 1) log xi − xi
i=1 i=1
Γ′ (α)
Solving the above analytically is not possible. In fact, the form of Γ(α)
is challenging
to write analytically.
d2 l(α) d2
= −n log(Γ(α)) < 0 (polygamma function of order 1 is > 0)
dα2 dα2
We cannot get an analytical form of the MLE for α. In such cases, we will use opti-
mization methods. ■
6.6 Exercises
1. Simple linear regression: Load the cars dataset in R:
data(cars)
Fit a linear regression model using maximum likelihood with response y being
the distance and x being speed. Remember to include an intercept term in X by
making the first column as a column of 1s. Do not use inbuilt functions in R to
fit the model.
Fit the linear regression model using maximum likelihood with response FuelC.
73
Remember to include an intercept in X.
3. Simulating data in R:
Let X ∈ Rn×p be the design matrix, where all entries in its first column equal
one (to form an intercept). Let xi,j be the (i, j)th element of X. For the ith
case, xi1 = 1 and xi2 , . . . , xip are the values of the p − 1 predictors. Let yi be the
response for the ith case and define y = (y1 , . . . , yn )T . The model assumes that
y is a realization of the random vector
Y ∼ Nn (Xβ∗ , σ∗2 In ) ,
where β∗ ∈ Rp are unknown regression coefficients and σ∗2 > 0 is the unknown
variance.
For our simulation, let’s pick n = 50, p = 5, σ 2 = 1/2 and generate the entries of
β∗ as p independent draws from N (0, 1):
set.seed(1)
n <- 50
p <- 5
sigma2.star <- 1/2
beta.star <- rnorm(p)
beta.star # to output
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
We will create the design matrix X ∈ Rn×p , so that xi1 = 1 and the other entries
are from N (0, 1).
4. Find the MLE estimator of β and σ 2 from the previous dataset. Is it close to β∗
and σ∗2 ? Find the ridge regression solution with λ = 0.01, 1, 10, 100.
In our original setup X ∈ Rn×p , all entries in its first column equal to one to
74
form an intercept. The MLE estimate (when it exists) is
Define X−1 be the matrix X with its first column removed. Let ȳ = n−1 ni=1 yi
P
and x̄T = n−1 1Tn X−1 = (n−1 ni=1 xi2 , . . . , n−1 ni=1 xip ). Let ỹ = (y1 − ȳ, . . . , yn −
P P
ȳ)T and X̃ = X−1 − 1n x̄T . Then ỹ is the centered response and X̃ is the centered
design matrix.
Suppose that
T T
Then (β̂1 , β̂−1 ) is equivalent to β̂ above. Verify this for the dataset generated in
Exercise 3.
exp(xTi β)
Yi ∼ Bern .
1 + exp(xTi β)
75
7 Numerical optimization methods
When obtaining true optimizer is not available analytically, we may use numerical
method. A general numerical optimization problem is framed in the following way.
Let f (θ) be an objective function that is the main function of interest and needs to be
either maximized or minimized. Then, we want to solve the following maximization
The algorithms we will learn are such that they generate a sequence of {θ(k) } such that
the goal is for θ(k) → θ∗ in a deterministic manner (non-random convergence).
All methods that we will learn will find a local optima. Some will guarantee a local max-
ima, but not guarantee a global maxima, some will guarantee a local optima (so max or
min), but not a global maxima. If the objective function is concave, then all methods
will guarantee a global maxima!
aT ∇2 f a < 0 .
Suppose that the objective function f is such that a second derivative exists. Since
f (θ) is maximized at the unknown θ∗ ,
f ′ (θ∗ ) = 0.
Applying a first order Taylor’s series expansion to f ′ (and ignoring higher orders) to
f ′ (θ∗ ) about the current iterate θ(k)
76
where the approximation is best when θ(k) = θ∗ and the approximation is weak when
θ(k) is far from θ∗ . Thus, if we start from an arbitrary point using successive updates
of the right hand side, we will get closer and closer to θ∗ .
f ′ (θ(k) )
θ(k+1) = θ(k) −
f ′′ (θ(k) )
Intuitively, when f ′ (θk ) < 0, the function is increasing at θ(k) , and thus Newton-
Raphson increases θ(k) ; and vice-versa.
You stop iterating when |θ(k+1) −θ(k) | < ϵ for some chosen tolerance ϵ or |f ′ (θ(k+1) )| ≈ 0.
If the objective function is concave, the N-R method will converge to the
global maxima. Otherwise it converges to a local optima or diverges!
First derivative n
′ Γ′ (α) X
f (α) = −n + log Xi
Γ(α) i=1
Second derivative
d2
f ′′ (α) = −n
log(Γ(α)) < 0 .
dα2
Thus the log-likelihood is concave, which implies there is a global maxima! The
Newton-Raphson algorithm will converge to this global maxima.
f ′ (α(k) )
α(k+1) = α(k) −
f ′′ (α(k) )
What is a good starting value α0 ? Well, we know that the mean of a Gamma(α, 1) is
α, so a good starting value is α0 = n−1 ni=1 Xi .
P
Note the impact of the sample size of the original data. The Newton-Raphson algorithm
converges to the MLE. If the data size is small, the MLE may not be close to the truth.
77
This is why we see that the blue and red lines are far from each other. However, when
increase the observed data to be 1000 observations, the consistency of the MLE should
kick in and we expect to see the blue and red lines to be similar (below).
1 1
f (x|µ) = .
π (1 + (x − µ)2 )
It is evident that a closed form solution is difficult. So, we find the derivates.
n
X Xi − µ
l′ (µ) = 2 .
i=1
1 + (Xi − µ)2
n
(Xi − µ)2
′′
X 1
l (µ) = 2 2 2 2
− ,
i=1
[1 + (Xi − µ) ] 1 + (Xi − µ)2
which may be positive or negative. So this is not a concave function. This implies that
Newton-Raphson is not guaranteed to converge to the global maxima! It may not even
converge to a local maxima, and may converge to a local minima. Thus, we will need
to be careful in choosing starting values.
1. Set µ0 = Median(Xi ) since the mean of Cauchy does not exist and the Cauchy
centered at µ is symmetric around µ.
2. Determine:
f ′ (µ(k) )
µ(k+1) = µ(k) −
f ′′ (µ(k) )
78
■
The NR method can be found in the same way using the multivariate Taylor’s expan-
sion. Let θ = (θ1 , θ2 , . . . , θp ). Then first let ∇f denote the gradient of f and ∇2 f
denote the Hessian. So
2
∂ 2f
∂f
∂ f
∂θ1 ∂θ2 ∂θ1 θ2 . . .
1
. .. .. ..
∇f = .. and ∇2 f = . . .
2
∂f 2
∂ f .. ∂ f
... . 2
∂θp ∂θp θ1 ∂θp
Then, the function f is concave if ∇2 f is negative definite. So always check that first
to know if there is a unique maximum.
Iterations are stopped with when ∥∇f (θ(k+1) )∥ < ϵ for some chosen tolerance level, ϵ.
(y − Xβ)T (y − Xβ) λ T
Q(β) = + β β.
2 2
(y − Xβ)T (y − Xβ) λ T
β̂ = arg min + β β .
β 2 2
79
Recall that we obtained
∇Q(β) = (X T X + λIp )β − X T y
∇2 Q(β) = (X T X + λIp )
This simplification gives us the actual right solution! So one NR step will lead us to
the analytical solution in this case! ■
Questions
1. Can you implement the the Newton-Raphson procedure for linear regression and
ridge regression?
3. If the function is not concave and different starting values yield convergence to
different points (or divergence), then what do we do?
• when the objective function is not concave, NR is not guaranteed to even con-
verge.
80
• when the objective function is complicated and high-dimensional, finding the
Hessian, and inverting it repeatedly may be expensive.
In such a case, gradient ascent (or gradient descent if the problem is a minimizing
problem) is a useful alternative as it does not require the Hessian.
Consider the objective function f (θ) that we want to maximize and suppose θ∗ is the
true maximum. Then, by the Taylor series approximation at a fixed θ0
f ′′ (θ0 )
f (θ) ≈ f (θ0 ) + f ′ (θ0 )(θ − θ0 ) + (θ − θ0 )2
2
If f ′′ (θ) is unavailable or we don’t want to use it, consider assuming that the double
derivative is a negative constant:f ′′ (θ) = −1/t for some t > 0 . That is, assume that f
is quadratic and concave. Then,
1
f (θ) ≈ f (θ0 ) + f ′ (θ0 )(θ − θ0 ) − (θ − θ0 )2
2t
Maximizing f (θ) and using this crude approximation would imply maximizing the right
hand side. Taking the derivative with respect to θ setting it to zero:
θ − θ0 set
f ′ (θ0 ) − = 0 ⇒ θ = θ0 + tf ′ (θ0 ) .
t
Using this intuition, given a θ(k) , the gradient ascent algorithm does the update
for t > 0. The iteration can be stopped when |θ(k+1) − θ(k) | < ϵ for ϵ > 0 or when
|f ′ (θk+1 )| ≈ 0.
For concave functions, there exists a t such that gradient ascent converges
to the global maxima. In general (when the function is not concave), there
exists a t such that gradient ascent converges to a local maxima, as long as
you don’t start from a local minima.
The algorithm essentially does a local concave quadratic approximation at the current
point θ(k) and then maximizes that quadratic equation. The value of t indicates how
far do we want to jump and is a tuning parameter. If t is large, we take big jumps, as
opposed to t small, where the jumps are smaller.
81
Example 33 (Location Cauchy distribution). Recall the location Cauchy distribution
with mode at µ ∈ R, where the log-likelihood is not guaranteed to be concave and thus
Newton-Raphson is difficult to use. The log-likelihood was
n n
Y Y 1
L(µ|X) = f (Xi |µ) = π −n
i=1 i=1
1 + (xi − µ)2
n
X
⇒ l(µ) := log L(µ|X) = −n log π − log(1 + (Xi − µ)2 ) .
t=1
I choose t = 0.3 in this case so that we obtain the following gradient ascent iterative
scheme: !
n
X Xi − µ
µ(k+1) = µ(k) + (0.3) 2 2
. .
i=1
1 + (X i − µ)
1. Set µ0 = Median(Xi ) since the mean of Cauchy does not exist and the Cauchy
centered at µ is symmetric around µ.
2. Determine: !
n
X Xi − µ
µ(k+1) = µ(k) + (0.3) 2 . .
i=1
1 + (Xi − µ)2
By a multivariate Taylor series expansion, we can obtain a similar motivation and the
iteration in the algorithm is:
82
continuous responses. But when Y is a response of 1s and 0s (Bernoulli) then assuming
the Y s are normally distributed is not appropriate. Instead, when the ith covariate is
(xi1 , . . . , xip )T , then for β ∈ Rp logistic regression assumes the model
T
!
e xi β
Yi ∼ Bern T .
1 + exi β
In other words, the probability that any response takes the value 1 is
T
e xi β
Pr(Yi = 1) = T =: pi .
1 + exi β
Our goal is to obtain the MLE of β. As usual, first we write down the log-likelihood.
n
Y
L(β|Y ) = (pi )yi (1 − pi )1−yi
i=1
n
X n
X
⇒ l(β) = yi log pi + (1 − yi ) log(1 − pi )
i=1 i=1
Xn n
X
= log(1 − pi ) + yi (log pi − log(1 − pi ))
i=1 i=1
n
X n
X
exp(xTi β) yi xTi β
=− log 1 + +
i=1 i=1
n n
P
∂l(β) X X xis e j xij βj
= yi xis − P
βs 1 + e j xij βj
i=1 i=1
n
X 1
⇒ ∇l(β) = xi yi − T
i=1
1 + e−xi β
83
The above can be done directly for the full vector as well in one go:
n n T
X X xi exi β
∇l(β) = xi y i − T
i=1 i=1
1 + e xi β
n
" T
#
X e xi β
= xi yi − T
i=1
1 + exi β
n
X 1 set
= xi y i − −x Tβ =0
i=1
1+e i
An analytical solution here is not possible, thus a numerical optimization tool is re-
quired. In order to know what kind of function we have, we also obtain the Hessian.
In order to calculate the Hessian, we can, once again do this element-wise or do it
together. Let’s do it elementwise: For j ̸= k
" n n
P #
∂ 2 l(β) ∂ X X xis e j xij βj
= yi xis − P
βk βs ∂βk i=1 i=1
1 + e j xij βj
" n P #
∂ X xis e j xij βj
=− P
∂βk i=1 1 + e j xij βj
n
X P xik
=− xis e j xij βj P
(1 + e j xij βj )2
i=1
n
P
X e j xij βj
=− xik xis P
(1 + e j xij βj )2
i=1
= −X T W X ,
T T
where W is the n × n diagonal matrix with diagonal elements exi β /(1 + exi β )2 .
Note: You can verify by studying the Hessian above that it is negative semi-dfinite,
and thus that the likelihood function is indeed concave and thus we can use either
Newton-Raphson or Gradient Ascent successfully. We will use gradient ascent, and
you should on your own, implement N-R.
84
3. For iteration k + 1:
" n #
X 1
β(k+1) = β(k) + t xi y i − T
i=1 1 + e−xi β(k)
For i = 1, . . . , n, let xi = (1, xi2 , . . . , xip )T be the vector of covariates for the ith
observation and β ∈ Rp be the corresponding vector of regression coefficients. Suppose
response yi is a realization of Yi with
Yi ∼ Bern Φ(xTi β) ,
where Φ(·) is the CDF of a standard Normal distribution. We want to find the MLE
of β. The likelihood is given by
N
Y
L(β) = Φ(xTi β)yi (1 − Φ(xTi β))(1−yi ) .
i=1
So for yi = 1, Φ(xTi β)yi = Φ(zi xTi β)yi , and for yi = 0, Φ(xTi β)yi = 1 = Φ(zi xTi β)yi .
Similarly, for yi = 0, Φ(−xTi β)(1−yi ) = Φ(zi xTi β)(1−yi ) , and for yi = 1, Φ(−xTi β)(1−yi ) =
1 = Φ(zi xTi β)(1−yi ) .
Therefore,
85
As a consequence, likelihood can be written concisely as
N
Y
L(β) = Φ(zi xTi β)yi Φ(zi xTi β)(1−yi )
i=1
YN
= Φ(zi xTi β)
i=1
N
X
l(β) = log(Φ(zi xTi β)) .
i=1
For finding the optimum value of regression coefficient that minimizes the log-likelihood,
we use the iterative Newton-Raphson method. The gradient and Hessian calculations
are as follows:
N
d X
∇l(β) = D = log(Φ(zi xTi β))
dβ i=1
N
X f (zi xT β)zi xi
i
= .
i=1
Φ(zi xTi β)
N
2 d X f (zi xTi β)zi xi
∇ l(β) = DD =
dβ i=1 Φ(zi xTi β)
N
d f (zi xTi β)zi xip
X
DDpq =
i=1
dβq Φ(zi xTi β)
N
f (zi xTi β)2 f ′ (zi xTi β)
X
= − T
(z x )(zi xip ) +
2 i iq T
(zi xiq )(zi xip )
i=1
Φ(z i x i β) Φ(z i x i β)
−x 2
We can use that f ′ (x) = √ e−x /2 = −xf (x). This gives us the following formulation
2π
for Hessian matrix
N
" 2 #
f (zi xTi β) f (zi xTi β)(zi xTi β)
X
DDpq = − xip xiq + xip xiq
i=1
Φ(zi xTi β) Φ(zi xTi β)
86
Therefore,
N
" 2 #
f (zi xTi β) f (zi xTi β)(zi xTi β)
X
DD = − xi + xTi
i=1
F (zi xTi β) Φ(zi xTi β)
One can show that the above is negative definite, and thus the function is concave
(this will be an exercise). Now we can apply Gradient Ascent algorithm or Newton-
Raphson. ■
7.3 MM Algorithm
Consider obtaining a solution to
The algorithm has the ascent property in that every update increases the objective value.
That is,
Thus, if we want to maximize f (θ), we may find a minorizing function for it, and then
repeatedly maximize it. The key to implementing the MM algorithm is to finding a
87
good minorizing function. This can be done in a few different ways and generally
application specific.
The question is, how to construct such minorizing functions? This is typically done on
a case-by-case basis. Many inequalities are used:
• Jensen’s inequality
1
f (θ) = f (θ(k) ) + f ′ (θ(k) )(θ − θ(k) ) + f ′′ (z)(θ − θk )2 ,
2
where z is some constant between θk and θ (by the mean value theorem).
1
f˜(θ|θ(k) ) = f (θ(k) ) + f ′ (θ(k) )(θ − θ(k) ) + L(θ − θk )2
2
and derivatives n
′
X Xi − µ
f (µ) = 2 ,
i=1
1 + (Xi − µ)2
n
(Xi − µ)2 − 1
X
′′
f (µ) = 2 .
i=1
[1 + (Xi − µ)2 ]2
88
Consider the Taylor’s expansion:
1
f (µ) = f (µ(k) ) + f ′ (µ(k) )(µ − µ(k) ) + f ′′ (µ(k) )(µ − µ(k) )2
2
(Xi − µ)2 − 1
−1
2 ≥2
[1 + (Xi − µ)2 ]2 [1 + (Xi − µ)2 ]2
≥ −2 := L .
The main utility of MM algorithm is that it can be used when the gradient of the
objective function is unavailable. An excellent example of this is the following Bridge
Regression problem.
Example 37 (Bridge regression). Recall the case of the penalized (negative) log-
likelihood problem during ridge regression, where the objective function was:
p
(y − Xβ)T (y − Xβ) λ X 2
Q(β) = + β .
2 2 i=1 i
By changing the penalty function, the above ridge regression objective function can be
generalized as bridge regression when the objective function is
p
(y − Xβ)T (y − Xβ) λ X
QB (β) = + |βi |α ,
2 α i=1
for α ∈ [1, 2] and λ > 0. When α = 2, this is ridge regression, and when α = 1,
this is the popular lasso regression. Different choices of α, lead to different style of
penalization. For a given λ, smaller values of α push the estimates closer towards zero.
89
We need to find the bridge regression estimates
First note that, (y − Xβ)T (y − Xβ) is a convex function and |βi |α is convex for α ≥ 1.
Since the sum of positive convex functions is convex, the objective function is convex,
thus our optimization algorithms will find a global minima.
Note that for α = 1, the objective function is not differentiable at 0, and for α ∈
(1, 2), the function is not twice differentiable at 0. Thus, using Newton-Raphson and
gradient descent is not possible. We will instead use an MM algorithm. Since this
is a minimization problem, we will find a majorizing function and then minimize the
majorizing function.
We will try to find a majorizing function that upper bounds the objective QB (β), and
then minimize the majorizing function. Intuitively, optimizing the majorizing function
will again require derivatives of the majorizing function. Thus our goal is to find a
majorizing function that does not contain an absolute value, and is thus differentiable.
Note further that (y − Xβ)T (y − Xβ) term is well behaved and quadratic. Thus, we
are not inclined to look at this part. Instead we would like to lower bound pi=1 |βi |α .
P
α α/2−1
h′ (u) = u
2
and
α α
h′′ (u) = − 1 uα/2−2 ≤ 0
2 2
so h(u) is a concave function for α ∈ [1, 2]. For a concave function, by the “Rooftop
theorem”, the first order Taylor series creates a tangent line that is above the function.
Thus for a u∗ ,
α ∗ α/2−1
h(u) ≤ h(u∗ ) + h′ (u∗ )(u − u∗ ) = h(u∗ ) + (u ) (u − u∗ ) .
2
For any given iteration of the optimization, given β(k) , taking u = |βi |2 and u∗ = |βi,(k) |2
where βi is the ith component of the vector β. Then,
α
h(u) = |βi |α ≤ |βi,(k) |α + |βi,(k) |α−2 βi2 − βi,(k)
2
2
α α
α
= |βi,(k) | − |βi,(k) |α + |βi,(k) |α−2 βi2
2 2
90
mi,(k) 2
= constants + βi
2
where mi,(k) = α|βi,(k) |α−2 . (You will see that the constants will not be important.)
Now that we have upper bounded the the penalty function, we have an upper bound
on the full objective function! So, the objective function can be bounded above by:
p
(y − Xβ)T (y − Xβ) λ X
QB (β) ≤ constants + + mj,(k) βj2 .
2 2α j=1
• Recall that we obtained the upper bound function using a derivative of h(u).
This derivative is not defined at u = 0. However, we’re only using the derivative
function at u∗ = |βi,(k) |2 , which is the previous iteration. So as long as we DO
NOT START at zero, this upper bound is valid.
where M(k) = diag(m1,(k) /α, m2,(k) /α, . . . , mp,(k) /α). Note that here M(k) is what drives
the direction of the optimization.
91
7.4 Exercises
1. Find the MLE of (α, µ) for a Pareto distribution with density
αµα
f (x) = x ≥ µ, µ, α > 0 .
xα+1
2. Consider objective function of the form f (θ) = aθ2 + bθ + c. Writes the iterates
of the Newton-Raphson algorithm. In how many iterates do you expect NR to
converge for any starting value θ(0) ?
3. (Using R) Find the MLE of (µ, σ 2 ) for the N (µ, σ 2 ) distribution using Newton-
Raphson’s method. Compare with the closed form estimates.
5. Consider estimating the first moment of Exp(λ) using simple importance sam-
pling with a Gamma(α, β) proposal distribution. The density of an exponential
is
π(x) = λe−λx x > 0 ,
β α α−1 βx
g(x) = x e x > 0.
Γ(α)
(a) Construct the simple importance sampling expression for general α, β, and
λ?
(b) Denote the variance of the estimator in (a) as κλ (α, β) . For what values of
α and β is κλ (α, β) infinite? You will have have to solve the integral here.
(c) Set β = 1. Describe an algorithm for obtaining the best proposal within
this family of proposals. Give all details. Use the above constructed algo-
rithm to obtain the best Gamma(α, 1) for λ = 5. Here you may require an
optimization algorithm.
(d) Is the proposal obtained in (d) the universally optimal proposal for this prob-
lem?
92
6. Generate data according to the logistic regression model above with n = 50, and
use Newton-Raphson’s and gradient ascent algorithm to find the MLE.
7. Implement Newton-Raphson and gradient ascent to find the MLE of (α, β) for a
Beta(α, β) distribution. Generate your own data to implement this in R.
1 1
f (x) = µ ∈ R, x ∈ R .
π 1 + (x − µ)2
Suppose the kth iteration is such that f ′ (θ(k) ) > 0 (that is the function is in-
creasing at θ(k) ), and NR takes us to θ(k+1) where f ′ (θ(k+1) ) < 0, so that now the
function is decreasing. This means we may have overshot! As a compromise, we
may want to implement the following algorithm:
f ′ (θ(k) )
θ(k+1) = θ(k) − λ(k)
f ′′ (θ(k) )
where λ(k) is a step-factor sequence chosen at every iteration so that
That is, the next value must be such that we achieve an increase in the objective
function. You can choose λ(k) so that
(a) If f (θ(k+1) ) > f (θk ), then λ(k) = 1. Else λ(k) = 1/2, and recalculate f (θ(k+1) )
(b) If f (θ(k+1) ) > f (θk ), then continue, else set λ(k) = 1/22 and so on...
93
Implement this modified algorithm for the Gamma and Cauchy examples.
11. Consider the logistic regression model: for i = 1, . . . , n, let xi = (1, xi2 , . . . , xip )T
be the vector of covariates for the ith observation and β ∈ Rp be the correspond-
ing vector of regression coefficients. Suppose response yi is a realization of Yi
with
exp(−xTi β)
Yi ∼ Bern (pi ) where pi = .
1 + exp(−xTi β)
(a) Write the negative log-likelihood, −l(β).
(b) Consider the ridge logistic regression problem, which minimizes the follow-
ing penalized negative log-likelihood:
p
λX 2
Q(β) = −l(β) + β .
2 i=1 i
(c) Write the Newton-Raphson algorithm for minimizing Q(β). Write all steps
clearly.
(a) Find the joint likelihood L(β | y1 , . . . yn ) and obtain the maximum likeli-
hood estimator of β. If not available in closed-form, present the complete
optimization algorithm to obtain β̂MLE .
94
8 The EM algorithm
An important application of the MM algorithm is the Expectation-Maximization (EM)
algorithm. Since the EM algorithm is an integral part of statistics on its own, we study
it separately. We will first motivate the EM algorithm with an example.
where fi (x | µi , σi2 ) is the density of N (µi , σi2 ) distribution for i = 1, 2. Data following
such distributions arises often in real life. For instance, suppose, we collect batting
averages for players in cricket. Then bowlers are expected to have low averages and
batters are expected to have high averages, created a mixture-like distribution.
Given data, suppose we wish to find the maximum likelihood estimates of all 5 param-
eters: (µ1 , µ2 , σ12 , σ22 , p∗ ). That is, we want to maximize:
n
X
l(µ1 , µ2 , σ12 , σ22 , p∗ |X) = log f (xi | µ1 , µ2 , σ12 , σ22 , p∗ )
i=1
n
X
log p∗ f1 (x | µ1 , σ12 ) + (1 − p∗ )f2 (x | µ2 , σ22 ) .
=
i=1
Suppose we have the information about the the class of each xi (classes being class 1
and class 2). Thus, suppose the complete data was of the form
95
where each Zi = k means that Xi is from population k. If this complete data is
available to us, then first note that the joint probability density/mass function is
This complete log-likelihood is in a far nicer format, so that closed-form estimates are
available.
∂l d1 d2 set
∗
= ∗− = 0.
∂p p 1 − p∗
d1 d1
π̂ ∗ = = .
d1 + d2 n
96
You can also see that the MLEs for µ1 , µ2 , σ12 , σ22 have all been isolated so that the
MLEs can be easily obtained from the usual Gaussian likelihood.
1 X 1 X
µ̂1 = Xi , µ̂2 = Xi
d1 i∈D d2 i∈D
1 2
1 X 1 X
σ12 = (Xi − µ̂1 )2 , σ22 = (Xi − µ̂2 )2 .
d1 i∈D d2 i∈D
1 2
However, remember that the actual Zs are unknown to us, so we can’t actually
calculate d1 and d2 , and the above cannot be evaluated. The EM algorithm will solve
this problem by estimating the unobserved Zi corresponding to each Zi in an iterative
manner.
R
where the · dνz denotes integral or summation based on whether Z is continuous or
discrete.
The EM algorithm produces iterates {θ(k) } in order to solve the above maximization
problem. Writing the target objective function in this form allows us to do the op-
timization using EM. The EM algorithm iterates through an “E” step (Expectation)
and an “M” step (maximization). Consider a starting value θ0 . Then for any (k + 1)
iteration
97
2. M-Step: Compute
θ(k+1) = arg max q θ|θ(k) .
θ∈Θ
The following theorem will convince us that running the above algorithm will ensure
the θ(k) converges to a local maxima. The trick to implementing the EM algorithm for
a general problem is in finding an appropriate joint distribution (X, Z) for which the
E-step is computable.
Theorem 10. The EM algorithm is an MM algorithm and thus has the ascent property.
Proof. In order to show the result we first need to find a minorizing function.
The objective function is log f (x|θ). So we need to find f˜(θ|θ(k) ) such that f˜(θ(k) |θ(k) ) =
log f (x|θ(k) ) and in general
f˜(θ|θ(k) ) ≤ log f (x|θ)
Let
Z Z
f˜(θ|θ(k) ) = log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) )− log{f (x, z|θ(k) )}f (z|x, θ(k) )dz .
z z
(The proof technique is setup for continuous Z, but the same proof works for discrete
Z as well.)
Naturally, we can see that at θ = θ(k) , f˜(θ(k) |θ(k) ) = f (x|θ(k) ). We will now show that
minorizing property.
f˜(θ|θ(k) )
Z Z
= log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) ) − log{f (x, z|θ(k) )}f (z|x, θ(k) )dz
z z
98
Z Z Z
= log{f (x, z|θ)}f (z|x, θ(k) )dz + log f (x|θ(k) )f (z|x, θ(k) )dz − log{f (x, z|θ(k) )}f (z|x, θ(k) )dz
z z z
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) )
z f (x, z|θ(k) )
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) ) + log f (x|θ) − log f (x|θ)
z f (x, z|θ(k) )
f (x, z|θ) f (x|θ(k) )
Z
= log f (z|x, θ(k) ) + log f (x|θ)
z f (x, z|θ(k) )f (x|θ)
By Jensen’s inequality,
Z
f (x, z|θ) f (x|θ(k) )
≤ log f (z|x, θ(k) ) + log f (x|θ)
z f (x, z|θ(k) )f (x|θ)
Z
f (z|x, θ)
= log f (z|x, θ(k) ) + log f (x|θ)
z f (z|x, θ(k) )
Z
= log f (z|x, θ)dz + log f (x|θ)
z
= log f (x|θ) .
C
X
f (x | θ) = πj fj (x | µj , σj2 ) ,
j=1
99
where θ = (µ1 , . . . , µC , σ12 , . . . , σC2 , π1 , . . . , πC−1 ). The setup is the same as before, and
suppose we the complete data (Xi , Zi ) where
To implement the EM algorithm for this example, we first need to find q(θ|θ(k) ), which
requires finding the distribution of Z|X. This can be done by Bayes’ theorem, since
2
So for any kth iterate with current step θ(k) = (µ1,k , µ2,k , σ1,k 2
, σ2,k , p∗k ), we have
2
fc (xi | µc,k , σc,k )πc,k
Pr(Z = c | X = xi , θ(k) ) = PC 2
:= γi,c,k .
j=1 fj (xi | µj,k , σj,k )πj,k
NOTE: γi,c are itself quantities of interest since they tell us the probability of the ith
observation being in class c. In this way, at the end we get probabilities of association
for each data point that inform us about the classification of the observations.
Next,
q(θ | θ(k) ) = EZ|X log f (x, z|θ) | X = x, θ(k)
" n #
X
= EZ|X log f (xi , zi |θ) | X = x, θ(k)
i=1
n
X
= EZi |Xi log f (xi , zi |θ) | X = xi , θ(k)
i=1
Xn XC
= [log f (xi , zi = c|θ)] Pr(Z = c | X = xi , θ(k) )
i=1 c=1
Xn X C
= [log f (xi |, zi = c, θ) Pr(zi = πc )] Pr(Z = c | X = xi , θ(k) )
i=1 c=1
n X C 2
X fc (xi |µc,k , σc,k )πc,k
fc (xi |µc , σc2 )πc
= log PC 2
.
i=1 c=1 j=1 fj (xi |µj,k , σj,k )πj,k
| {z }
γi,c,k
100
will need to store the value of γi,c,k in order to implement the E-step:
2
fc (xi | µc,k , σc,k )πc,k
γi,c,k = PC 2
.
j=1 f j (x i | µ j,k , σj,k )π j,k
This completes the E-step. We move on to the M-step. To complete the M-step
n X
X C
log fc (xi | µc , σc2 ) + log πc γi,c,k
q(θ | θ(k) ) =
i=1 c=1
n X C
(Xi − µc )2
X 1 1 2
= − log(2π) − log σc − 2
+ log πc γi,c,k
i=1 c=1
2 2 2σc
n C n C
X X (Xi − µc )2 n C
1 XX XX
= const − log σc2 γi,c,k − 2
γi,c,k + log πc γi,c,k .
2 i=1 c=1 i=1 c=1
2σc i=1 c=1
n Pn
∂q X (xi − µc ) γi,c,k set i=1 γi,c,k xi
= 2
= 0 ⇒ µc,(k+1) = P n , (3)
∂µc i=1
σ c i=1 γi,c,k
n n
Pn
∂L 1 X γi,c,k X (Xi − µc )2 set 2 i=1 γi,c,k (xi − µ2c,(k+1) )
2
= − 2
+ 4
γi,c,k = 0 ⇒ σc,(k+1) = Pn .
∂σc 2 i=1 σc i=1
2σc i=1 γi,c,k
(4)
(You can show for yourself that the second derivative at this solutions is negative.)
C
!
X
q̃(θ | θ(k) ) = q(θ | θ(k) ) − λ πc − 1 .
c=1
101
Taking derivative
n
∂ q̃ X γi,c,k set
⇒ = −λ = 0
∂πc i=1
π c
n
X γi,c,k
⇒πc =
i=1
λ
C C X
n
X X γi,c,k
⇒ πc =
c=1 c=1 i=1
λ
n
1X
⇒1 = 1
λ i=1
⇒λ = n
n
1X
⇒πc,(k+1) = γi,c,k . (5)
n i=1
(and the second derivative is clearly negative). Thus equations (3), (3) and (5) provide
the iterative updates for the parameters. The final algorithm is:
2
fc (xi | µc,(k) , σc,(k) )πc,(k)
γi,c,(k) = PC 2
.
j=1 fj (xi | µj,(k) , σj,(k) )πj,(k)
3: For all c = 1, . . . , C
Pn
i=1 γi,c,(k) xi
µc,(k+1) = P n
γi,c,(k)
Pn i=1 2
2 i=1 γi,c,(k) (xi − µc,(k+1) )
σc,(k+1) = Pn
i=1 γi,c,(k)
n
1X
πc,(k+1) = γi,c,(k)
n i=1
Note: The target likelihood f (x|θ) is not concave, so the algorithm is not guaranteed
to converge to a global maxima.
102
Questions to think about
A light bulb company is testing the failure times of their bulbs and know that failure
times follow Exp(λ) for some λ > 0. They test n light bulbs, so the failure time of
each light bulb is
iid
Z1 , . . . , Zn ∼ Exp(λ) .
However, officials recording these failure times walked into the room only at time T
and observed that m < n of the bulbs had already failed. Their failure time cannot be
recorded. Define Ej = I(Zj < T ), so observed data is
E1 = 1, . . . , Em = 1, Zm+1 , Zm+2 , . . . , Zn .
• If we ignore the first m light bulbs, then not only do we have a smaller sample
size, but we also have a biased sample which do not contain the bottom tail of
the distribution of failure times.
So we must account for the “missing data”. Let us first write down the observed data
likelihood.
103
( n
)
X
= (1 − e−λT )m λn−m exp −λ Zj .
j=m+1
Closed-form MLEs are difficult here, and some sort of numerical optimization is useful.
Of course, here, we cn very easily implement our gradient-based methods. But, we will
resort to the EM algorithm, as that will give us something extra since that has the
added advantage that the estimates of Z1 , . . . , Zm may be obtained as well.
Also, note that if we choose to “throw away” the censored data, then the likelihood is
( n
)
X
Lbad (λ|Zm+1 . . . Zn ) = λn−m exp −λ Zj
j=m+1
The above MLE is a bad estimator since the data thrown away is not censored at
random, and in fact, all those bulbs fused early. So the bulb company cannot just
throw that data away, as that would be dishonest!
Now we implement the EM algorithm for this example. First, note that the complete
unobserved data is Z1 , . . . , Zn and the complete log likelihood is
( n ) n
Y X
−λzi
log Lcomp (λ|Z1 , . . . , Zn ) = log f (z|λ) = log λe = n log λ − λ zi .
i=1 i=1
f (Z1 , . . . , Zm | E1 , E2 , . . . , Em , Zm+1 , . . . , Zn )
= f (Z1 , . . . , Zm | E1 , . . . , Em , Zm+1 , . . . , Zn )
= f (Z1 , . . . , Zm | E1 , . . . , Em )
Ym
= f (Zi | Ei )
i=1
Ym
= f (Zi | Zi ≤ T )
i=1
m
Y λe−λZi
= −λT
I(Zi ≤ T ) .
i=1
1 − e
104
Further,
Once we have the conditional likelihood of the unobserved observations, we are rready
to implement EM algorithm. Implementing the EM steps now
1. E-Step: In the E-step, we find the expectation of the complete log likelihood
under Z1:m |(E1:m , Z(m+1):n ). That is
h i
q(λ | λ(k) ) = E log f (Z1 , . . . , Zn |λ) | E1 , . . . , Em , Zm+1 , . . . , Zn
" n #
X
= n log λ − λEλ(k) Zi |E1 = 1, . . . , Em = 1, Zm+1 , . . . , Zn
i=1
n
X m
X
= n log λ − λ Zi − λ [E[Zi |Zi ≤ T ]
i=m+1 i=1
It is then easy to show that the M step makes the following update (show by
yourself):
n n
λ(k+1) = Pn Pm =
[E[Zi |Zi ≤ T ] T e−λ(k) T
i=m+1 Zi + i=1 Pn 1
i=m+1 Zi + m −
λ(k) 1 − e−λ(k) T
• If the E-step is not solvable, in the sense that the expectation is not available,
what can we do?
105
8.5 Exercises
1. Following the steps from class, write the EM algorithm for a mixture of K Gaus-
sians, for any general K. That is, the distribution is
K
X
f (x|θ) = πk fk (x|µk , σk2 ) .
k=1
data(faithful)
plot(density(faithful$eruptions))
You will see that the length of the eruptions looks like a bimodal distribution.
For any given eruption, let Xi be the eruption time. Let
1 X has short eruptions
i
Zi =
2 X has long eruptions
i
Thus Zi is a latent variable which is not observed. Let π1 and π2 be the probability
of short and long eruptions, respectively. Assume that the joint distribution of
(X, Z) is
3. (Using R) For the same dataset faithful, we will fit a multivariate mixture of
Gaussians for both the eruption time and waiting times. Let Xi be the eruption
time and Yi be the waiting time for the ith eruption. Let
1 X and Y has short eruptions and short wait times
i i
Zi = .
2 X and Y has long eruptions and long wait times
i i
First, we want to find the EM steps for this. The joint distribution of the observed
106
t = (x, y) and the latent variable z is
(t − µc )T Σ−1
1 1 c (t − µc )
fc (t | µc , σc2 ) = exp − .
2π |Σc |1/2 2
Similar to the one dimensional case, set up the EM algorithm for this two-
dimensional case, and then implement this on the Old Faithful dataset.
Suppose you observe y = (125, 18, 20, 34). Also, suppose that the complete data
is (z1 , z2 , y2 , y3 , y4 ) where z1 + z2 = x1 . That is, the first variable y1 is broken into
two groups, with the new probabilities are
1 θ 1−θ 1−θ θ
, , , , .
2 4 4 4 4
107
The E-Step is
Write the above expectation explicitly, and then write the M-step. Implement
this in R and return the estimate of θ.
6. Suppose the expectation in the E-step is not tractable. That is the expectations
is not available in closed form. In this case, one may replace the expectation
with a Monte Carlo estimate. This is called Monte Carlo EM. Repeat Exercise
5 using Monte Carlo EM.
1
fi (x; αi ) = xα−1 e−x .
Γ(αi )
(b) Implement the algorithm on the eruptions dataset. What are your final
estimates of α1 , α2 , π1 , π2 ? Is there much difference in the clusters from the
Gaussian mixture model?
108
9 Choosing Tuning Parameters
Still working on this We will take a break from optimization methods to discuss an
important practical question:
Recall that we use the log-likelihood to see whether our estimates are good or not.A
larger value of the log-likelihood means better models! Thus, in principle, we should
be able to use negative log-likelihood to see how bad our estimates are. However, the
negative log-likelihood will keep reducing always as C increases, since each data point
will want to be it’s own cluster.
where K is again the number of parameters being estimated. The BIC decreases when
the number of clusters increase and for large data.. This makes sense since when large
number of data are available, we should be able to find simpler models.
109
Note: We want to choose C such that the value of BIC is maximized.
fˆ(x) = xT β̂ .
A loss function measures the error between y and the estimated response ŷ =
fˆ(x). For continuous response y, there are two popular loss functions:
• Binary data classification: For binary data models, like logistic regression, a
popular loss function is the misclassification also known as the 0 − 1 loss. Given
response y which are Bernoulli(p) where
T
ex β
p= ,
1 + e xT β
where β̂MLE are the MLE estimates obtained using an optimization algorithm.
These p̂ are the estimated probability of success. Set fˆ(x) = 1·I{p̂ ≥ .5}+0·I{p̂ <
.5}. (The cutoff, .5, is the default cutoff, but can be changed depending on the
110
problem). Then the misclassification loss function checks whether the model has
misclassified the observation
Given a dataset, we are interested in estimating the test error which is the expected
loss, of an independent dataset given estimates from the current dataset. If our given
dataset is D h i
ErrD = E L(y, fˆ(x))|D ,
where fˆ(x) denotes the estimated y for a new independent dataset, given estimators
from the current dataset. For example, using β̂ from a given dataset D, then fˆ(x) =
xnew β̂.
9.3 Cross-validation
In many data analysis techniques, there are tuning/nuisance parameters that need to
be chosen in order to fit the model. How can we choose these nuisance parameters?
For example:
Or, we can fit multiple different models to the same dataset. For example, we could
fit linear regression, bridge regression, or ridge regression. Which fit is the best? How
can we make that assessment?
Cross-validation provides a way to estimate the test error without requiring new data.
In cross-validation, our original dataset is broken into chunks in order to emulate
independent datasets. Then the model is fit on some chunks and tested on other
chunks, with the loss recorded. The way the data is broken into chunks can lead to
different methods of cross-validation.
111
• Choose a loss function
• We randomly split the data into two parts: a training set and a test set.
• Fit the model on the training set, and obtain estimates of the observation y for
the test set. Compute loss on the test set.
• Repeat the process for different models and choose the model with smallest error.
1. We do not often have the luxury of “throwing away” observations for the test
data, specially in small data problems.
2. It is possible, that just by chance a bad split in the data makes the test/train
split data drastically different.
The main advantage is that, compared to the methods that follow, this is fairly cheap.
For each split, the test error is estimated, and the average error over all splits is
calculated, which estimates the expected test error for a model fit fˆ(x). Let fˆ−i (xi )
denote the predicted value of yi using the model that removes the ith data point. Then
CV estimate of the prediction error is
n
1X h i
CV1 (fˆ) = L(yi , fˆ−i (xi )) ˆ
≈ E L(y, f (x)) .
n i=1
Note that each fˆ−i (xi ) represents model fits using different datasets, with testing on
one observation.
CV1 (fˆ) can be calculated for different models or tuning parameters. Let γ denote a
generic tuning parameter utilized in determining the model fit fˆ, and let fˆ−i (xi , γ)
denote the predicted value for yi using a training data that doesn’t include the ith
observation and uses γ as a tuning parameter. Then:
n
1X
CV1 (fˆ, γ) = L(yi , fˆ−i (xi , γ)) .
n i=1
112
The chosen model is the one with γ such that
n o
γchosen = arg min CV1 (fˆ, γ)
γ
The final model is fˆ(X, γchosen ) fit to all the data. In this way we can accomplish two
things: obtain an estimate of the prediction error and choose a model.
Points:
• CV1 (fˆ) is an approximately unbiased estimator of the test error. This is because
the expectation is being evaluated over n samples of the data that are very close
to the original data D.
• LOOCV is computationally burdensome since the model is fit n times for each
θ. If the number of possible values of γ is large, this becomes computationally
expensive.
The data is randomly split into K roughly equal-sized parts. For any kth split, the
rest of the K − 1 parts make up the training set and the model is fit to the training
set. We then estimate the prediction error for each element in the kth part. Repeating
this for all k = 1, 2, . . . , K parts, we have an estimate of the prediction error.
Let κ : {1, . . . , N } 7→ {1, . . . , K} indicates the partition to which each ith observation
belongs. Let fˆ−κ(i) (x) be the fitted function for the κ(i)th partition removed. Then,
the estimated prediction error is
K n
1 X 1 X 1X
CVK (fˆ, γ) = L(yi , fˆ−κ(i) (xi , γ)) = L(yi , fˆ−κ(i) (xi , γ)) .
K k=1 n/K n i=1
i∈kth split
Points:
• For small K, the bias in estimating the true test error is large since each training
data is quite different from the given dataset D.
113
• The computational burden is lesser when K is small.
Usually, for large datasets, 10-fold or 5-fold CV is common. For small datasets, LOOCV
is more common.
9.4 Bootstrapping
We have discussed cross-validation, which we use to choose model tuning parameters.
However, once the final model is fit, we would like to do inference. That is, we want to
account for the variability of the final estimators obtained, and potentially do testing.
If our estimates are MLEs then we know that under certain important conditions,
MLEs have asymptotic normality, that is
√ d 2
n(θ̂MLE − θ) → N (0, σMLE ),
2 2
where σM LE is the inverse Fisher information. Then, if we can estimate σMLE , we can
construct asymptotically normal confidence intervals:
r
2
σ̂M LE
θ̂MLE ± z1−α/2 .
n
We can also conduct hypothesis tests etc and go on to do regular statistical analysis.
But sometimes we cannot use an asymptotic distribution:
1. when our estimates are not MLEs, like ridge and bridge regression
2. when the assumptions for asymptotic normality are not satisfied (I haven’t shared
these assumptions)
In Bootstrapping, we approximate the distribution of θ̂, and from there we will obtain
a confidence intervals.
iid
Suppose θ̂ is some estimator of θ from sample X1 , . . . , Xn ∼ F . Then since θ̂ is random
it has a sampling distribution Gn that is unknown. If asymptotic normality holds, then
Gn ≈ N (·, ·) for large enough n, but in general we may not know much about Gn . If
114
we could obtain many (say, B) similar datasets, we could obtain an estimate from each
of those B datasets:
iid
θ̂1 , . . . , θ̂B ∼ Gn .
Thus, in order to learn things about the sampling distribution Gn , our goal is to draw
more samples of such data. But this, of course is not easy in real-data scenarios. We
could obtain more Monte Carlo datasets from F , but we typically do not know the
true F .
∗ ∗ ∗
Bootstrap sample 1: X11 , X21 , . . . , Xn1 ⇒ θ̂1∗
∗ ∗ ∗
Bootstrap sample 2: X12 , X22 , . . . , Xn2 ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
Bootstrap sample B: X1B , X2B , . . . , XnB ⇒ θ̂B .
Each sample is called a bootstrap sample, and there are B bootstrap samples. Now,
the idea is that θ̂1∗ , . . . θ̂B
∗
are B approximate samples from the distribution of θ̂, Gn .
That is
θ̂1∗ , . . . θ̂B
∗
≈ Gn .
If we want to know the variance of θ̂, then the bootstrap estimate of the variance is
B
\ 1 X ∗ X
Var( θ̂) = (θ̂b − B −1 θ̂k∗ )2 .
B − 1 b=1 k
115
Similarly, we may want to construct a 100(1 − α)% confidence interval for θ̂. A 100(1 −
α)% confidence interval is the random interval (L, U ) such that
Pr((L, U ) contains θ) = 1 − α .
Note that here L and U are random and θ is fixed. We can find the confidence interval
by looking at the quantiles of the distribution of θ̂, Gn . Since we have bootstrap samples
from Gn , we can estimate these quantiles!. So if we order the bootstrap estimates
∗ ∗ ∗
θ̂(1) < θ̂(2) < · · · < θ̂(B) ,
and set L to be the (α/2)th ordered statistic and U to be (1 − α/2)th order statistic,
we get:
∗ ∗
L = θ̂⌊α/2∗B⌋ and U = θ̂⌊1−α/2∗B⌋ .
∗ ∗
Then θ̂⌊α/2∗B⌋ , θ̂⌊1−α/2∗B⌋ , is a 100(1 − α)% bootstrap confidence interval.
iid
Example 38 (Median of Gamma(a, b)). Let X1 , . . . , Xn ∼ Gamma(a, b). The median
of this distribution (θ) does not have a closed form expression, but suppose we are
interested in estimating it with the sample median. That is
θ̂ = Sample Median(X1 , X2 , . . . , Xn ) ∼ Gn
∗ ∗ ∗
X11 , X21 , . . . , Xn1 ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ⇒ θ̂B
.
And find α/2 and 1−α/2 sample quantiles from θ̂i∗ , i = 1, . . . , B. Then ∗
θ̂⌊α/2∗B⌋ ∗
, θ̂⌊1−α/2∗B⌋ ,
116
is a 100(1 − α)% bootstrap confidence interval. ■
∗ ∗ ∗
X11 , X21 , . . . , Xn1 ∼ F (θ̂) ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ∼ F (θ̂) ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ∼ F (θ̂) ⇒ θ̂B
∗
And again, we find the α/2 and 1−α/2 quantiles of the θ̂i∗ s so that θ̂⌊α/2∗B⌋ ∗
, θ̂⌊1−α/2∗B⌋ ,
is a 100(1 − α)% bootstrap confidence interval.
σ
θ= .
µ
It essentially tells us what is the deviation of the population compared to its mean. For
iid
F = N (µ, σ 2 ), and let X1 , . . . , Xn ∼ N (µ, σ 2 ). We want to estimate θ, the coefficient
of variation. The obvious estimator is the sample standard deviation by the sample
mean: √ n
s2 2 1 X
θ̂ = where s = (Xi − X̄)2 .
X̄ n − 1 i=1
∗ ∗ ∗
X11 , X21 , . . . , Xn1 ∼ N (X̄, s2 ) ⇒ θ̂1∗
∗ ∗ ∗
X12 , X22 , . . . , Xn2 ∼ N (X̄, s2 ) ⇒ θ̂2∗
..
.
∗ ∗ ∗ ∗
X1B , X2B , . . . , XnB ∼ N (X̄, s2 ) ⇒ θ̂B .
117
And find α/2 and 1 − α/2 sample quantiles from θ̂i∗ , i = 1, . . . , B. ■
• How will you use bootstrapping to obtain confidence intervals for bridge regres-
sion coefficients?
9.5 Exercises
1. Comparing different cross-validation techniques: Generate a dataset using the
following code:
set.seed(10)
n <- 100
p <- 50
sigma2.star <- 4
beta.star <- rnorm(p, mean = 2)
beta.star
y = Xβ + ϵ .
2. cars dataset: Estimate the prediction error for the cars dataset using 10-fold,
5-fold, and LOOCV.
3. mtcars dataset: Consider the mtcars dataset from 1974 Motor Trend US maga-
zine, that comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles. Load the dataset using
data(mtcars)
There are 10 covariates in the dataset, and mpg (miles per gallon) is the re-
118
sponse variable. Fit a ridge regression model for this dataset and find an op-
timal λ using 1-fold, 5-fold, and LOOCV cross-validation. Choose the best
λ ∈ {10−8 , 10−7.5 , . . . , 107.5 , 108 }. Make sure you make the X matrix such that
the first column is a column of 1s.
https://archive.ics.uci.edu/ml/datasets/seeds
This dataset contains information about three varieties of wheat (last column of
the dataset). There are 7 covariate information. Fit a 7-dimensional Gaussian
mixture model algorithm with C = 3 and estimate the mis-classification rate
using cross-validation.
5. Seeds dataset: For the same dataset, with C = 3, use cross-validation to find out
which of the 7 covariates best helps identify between the three kinds of wheat.
set.seed(1)
mu <- 5
sig2 <- 3
n <- 100
my.samp <- rnorm(n, mean = mu, sd = sqrt(sig2))
7. Repeat the previous exercise for estimating the mean of t10 distribution from
sample of size 50. Are the bootstrap confidence intervals similar to the intervals
using CLT? Why or why not?
9. For Exercise 1, fit a bridge regression model with λ = 5, α = 1.5, and construct
95% parametric and nonparametric bootstrap confidence intervals for of the 50
βs. In repeated simulations, what is the coverage probability of each the con-
119
fidence intervals. What percentage of the confidence intervals contain the true
vector of β, beta.star?
10. Obtain 95% bootstrap confidence intervals for each ridge regression coefficient β
for the chosen λ value in the mtcars Exercise 3.
120
10 Stochastic optimization methods
We go back to optimization this week. The reason we took a break from optimization is
because we will focus on stochastic optimization methods, which will lead the discussion
into other stochastic methods. We will cover two topics:
Our goal is the same as before: for an objective function f (θ), our goal is to find
where ∇f (θ(k) ) is the gradient vector. Now, in many statistics problems, the objective
iid
function is the log-likelihood (for some density f˜). That is, we have X1 , X2 , . . . , Xn ∼ F̃
with density function f˜. Then interest is in :
( n ) ( n
)
X 1X
θ∗ = arg max log f˜(xi |θ) = arg max log f˜(xi |θ) .
θ
i=1
θ n i=1
| {z }
f (θ)
Now due to the objective function being an average, we have the following:
n
1X
f (θ) = log f˜(θ|xi )
n i=1
n
1X h i
⇒ ∇f (θ) = ∇ log f˜(θ|xi ) .
n i=1
That is, in order to implement a gradient ascent step, the gradient of the log-likelihood
is calculated for the whole data. However, consider the following two situations
• the data size n and/or dimension of θ are prohibitively large so that calculating
the full gradient multiple times is infeasible
121
• the data is not available at once! In many online data situations, the full data set
is not available, but comes in sequentially. Then, the full data gradient vector is
not available.
In such situations, when the full gradient vector is unavailable, we may replace it with
the estimate the gradient. Suppose ik is a randomly chosen index in {1, . . . , n}. Then
n h
˜(xi |θ) = 1
h i X i
E ∇ log f ∇ log ˜(xi |θ) .
f
k n
| {z } i=1
Estimate of full gradient
h i
Thus, ∇ log f˜(xik |θ) is an unbiased estimator of the complete gradient, but uses only
one data point. Replacing the complete gradient with this estimate yields the stochastic
gradient ascent update:
n h io
θ(k+1) = θ(k) + t ∇ log f˜(xik |θ(k) ) ,
where ik is a randomly chosen index. This randomness in choosing the index makes
this a stochastic algorithm.
K
∗ 1 X
θ̂ = θ(k+1) .
K k=1
However, since each step involves estimating data gradient, variability in updates of
θ(k) is larger than using gradient ascent. To stabilize this behavior, often mini-batch
stochastic gradient is used.
122
10.1.1 Mini-batch stochastic gradient ascent
K
∗ 1 X
θ̂ = θ(k) .
K k=1
There are not a lot of clear rules about terminating the algorithm in stochastic gra-
dient. Typically, the number of iterations K = n, so that one full pass at the data is
implemented.
Recall the logistic regression setup where for a response Y and a covariate matrix X,
T
!
e xi β
Yi ∼ Bern T .
1 + exi β
Taking derivative:
n
" #
xT β
1 1 X e i
∇ log f˜(β) = xi y i − T .
n n i=1 1 + exi β
As noted earlier, the target objective is concave, thus a global optima exists and the
gradient ascent algorithm will converge to the MLE. We will implement the stochastic
gradient ascent algorithm here. The stochastic gradient ascent algorithm proceeds in
123
the folllowing way, for a randomly chosen index ik ,
T
" !#
exik β
β(k+1) = β(k) + t xik y ik − xT
i β
1+e k
This lecture focuses on simulated annealing, an algorithm particularly useful for non-
concave objective functions. Our goal is the same as before: for an objective function
f (θ), our goal is to find
θ∗ = arg max f (θ) .
θ
Recall that when the objective function is non-concave, all of the methods we’ve dis-
cussed cannot escape out of a local maxima. This creates challenges in obtaining global
maximas. This is where the method of simulated annealing has an advantage over other
methods.
Consider an objective function f (θ) to maximize. Note that maximizing f (θ) is equiv-
alent to maximizing exp(f (θ)). The idea in simulated annealing is that, instead of
trying to find a maxima directly, we will obtain samples from the density
Samples collected from π(θ) are likely to be from areas near the maximas. However,
obtaining samples from π(θ) means there will be samples from low probability areas
as well. So how do we force samples to come from areas near the maximas?
For 0 < T < 1, the objective function’s modes are exaggerated there-by amplifying
the maximas. This feature will help us in trying to “push” the sampling to areas of
124
high-probability.
T=1
T = .83
T = .75
T = .71
100
exp(f/T)
50
0
In simulated annealing, this feature is utilized so that every subsequent sample is drawn
from an increasingly concentrated distribution. That is, at a time point k, a sample
will be drawn from
πk,T (θ) ∝ ef (θ)/Tk ,
Certainly, we can try and use accept-reject or another Monte Carlo sampling method,
but such methods cannot be implemented generally. Note that for any θ′ , θ
125
πk,T (θ′ ) f (θ′ ) − f (θ)
= exp .
πk,T (θ) Tk
Let G be a proposal distribution with density g(θ′ |θ) so that g(θ′ |θ) = g(θ|θ′ ). Such a
proposal distribution is a symmetric proposal distribution. Further, given a value of
θk , θ′ is sampled using density g(·|θk ).
5: Else θk+1 = θk .
6: Update Tk+1
7: Store θk+1 and ef (θk+1 ) .
8: Return θ∗ = θk∗ where k ∗ is such that k ∗ = arg maxk ef (θk )
Thus, if the proposed value is such that f (θ′ ) > f (θ), then α = 1 and the move is
always accepted. The reason simulated annealing works is because when θ′ is such
that f (θ′ ) < f (θ), even then, the move is accepted with probability α.
126