HW 4
HW 4
1. [20 points] We have a data association problem where there are K objects and we are given K observa-
tions. Each observation corresponds to a single object, and we are given one observation for each object.
However, we don’t know which observation corresponds to which object, and we would like to infer that
using a probabilistic model relating observations to objects. Specifically we have
• K objects u1 , . . . , uK
• observations v1 , . . . , vK , where Val(vi ) = {a1 , . . . , aL } (so vi is a discrete random variable), where
each observation corresponds to the appearance of one object and there is exactly one observation of
each object
• correspondence variables C1 , . . . CK , where Val(Ci ) = {1, . . . , K}; Ci = k denotes that measurement
vi is derived from object uk
• a known appearance model for each object uk , Pk (vi = al |Ci = k).
Note that because of the mutex constraints, the correspondence variables C1 , . . . , CK will be a permutation
over 1, . . . , K. We also assume for simplicity that all permutations are equally likely a priori.
We wish to compute the marginals P (Ci |v1 , . . . , vK ), for i = 1, . . . , K, using Metropolis-Hastings (MH) to
sample the correspondence variables. We will start with an arbitrary assignment to C1 , . . . CK , and take
MH-steps. The proposal distribution that we will use randomly picks two correspondence variables Ci , Cj
from a uniform distribution over all pairs of correspondence variables, and swaps their assignments.
(a) [10 points] Compute the acceptance probability for each MH step.
(b) [5 points] Suppose we have run the MH sampler for a long time and collected M samples (C1 [m], . . . , CK [m])
for m = 1, . . . , M after the chain has mixed. Give an explicit expression for estimating the marginal
P (Ci |v1 , . . . , vK ).
(c) [5 points] Your friend Geoff Gibbs hears about your MH algorithm and suggests that you can also
consider using Gibbs sampling to compute your marginals. Briefly explain why this will or will not
work.
where `X|Y (θ; D) means the conditional log-likelihood of the dataset D using the distribution Pθ (X | Y )
defined by the Markov network with parameters θ (similarly for `Y |X ). Thus, our objective is a mixture
of two conditional log-likelihoods (0 < α < 1). As usual, we consider a log-linear parameterization of a
Markov network, using a set of n features fi (X i , Y i ) where X i and Y i are some (possibly empty) subsets
of the variables X and Y , respectively.
1
(a) [10 points] Write down the full objective function g(θ; D) in terms of the features fi and weights θi
∂
(b) [10 points] Derive ∂θ i
g(θ; D): the derivative of the objective with respect to a weight θi . Write your
final answer in terms of feature expectations IEQ [fi ], where Q is either: the empirical distribution of
our dataset P̂ ; or a conditional distribution of the form Pθ (W | Z = z) (for some sets of variables
W , Z, and assignment z.)
for all xi , c, then the EM algorithm converges in one iteration, and give a closed form expression for the
parameter values at this convergence point.
2
1
[60 points] Programming Assignment
In this homework, you will apply Gibbs sampling to a simple Markov random field model for image
restoration. In an image restoration problem, you are given an image corrupted by noise X and you want
to recover the original image Y (see Figure 1 and Lecture 4).
Let x = {xij } denote the observed image, with xij ∈ {−1, +1} representing the pixel at row i and column
j. Assume a black-and-white image, with -1 corresponding to white and +1 to black. The image has
dimensions N × M , so that 1 ≤ i ≤ N and 1 ≤ j ≤ M . Assume a set of (unobserved) variables y = {yij }
representing the true (unknown) image, with yij ∈ {−1, +1} indicating the value of xij before noise was
added. Each (internal) yij is linked with four immediate neighbors, yi−1,j , yi+1,j , yi,j−1 , and yi,j+1 , which
together are denoted yN (i,j) . Pixels at the borders of the image (with i ∈ {1, N } or j ∈ {1, M }) also have
neighbors denoted yN (i,j) , but these sets are reduced in the obvious way. We denote E the corresponding
set of edges. For example, the pair ((1, 1), (1, 2)) ∈ E, but the pair ((1, 1), (2, 2)) ∈
/ E. The joint probability
of y and x can be written (with no prior preference for black or white):
N Y
Y M
1 ηyij xij
Y
βyij yi0 j 0
p(y, x) = exp × exp (1)
Z i=1 j=1 ((i,j),(i0 ,j 0 ))∈E
X N X M
1 X
= exp η yij xij + β yij yi0 j 0 (2)
Z i=1 j=1 ((i,j),(i0 ,j 0 ))∈E
where
X X X
Z= exp η yij xij + β yij yi0 j 0 (3)
y,x i,j ((i,j),(i0 ,j 0 ))∈E
(Notice in particular that each pair of neighbors, yij and yi0 j 0 , factors into the formula only once, despite
that each variable is a neighbor of the other. Failing to account for this will lead to double counting of β
values.) This is equivalent to a Boltzmann (sometimes called Gibbs) distribution with “energy”:
X X
E(y, x) = −η yij xij − β yij yi0 j 0 (4)
i,j ((i,j),(i0 ,j 0 ))∈E
3
The system will have lower energy, and hence higher probability, in states in which neighboring yij vari-
ables, and neighboring yij and xij variables, tend to have the same value (assuming η and β are positive).
This captures the fact that each noisy pixel xij is likely to be similar to the corresponding “true” pixel
yij , and that images tend to be “smooth”.
There are algorithms for deterministically estimating y given an image x but we will here use the alter-
native approach: we will devise a Markov Chain Monte Carlo (MCMC) algorithm to sample values of y
conditional on x. Here are some advantages over the deterministic algorithms:
i. It is very general, and can easily be extended to more complex graphs.
ii. It provides great flexibility for quantifying the uncertainty of y (and, potentially, for the parameters
η and β).
iii. It is relatively straightforward in this setting to derive the exact conditional distributions for nodes
given the Markov blanket, so Gibbs sampling is possible, and one need not worry about the acceptance
rate for proposed samples.
You will apply your methods to two small, black-and-white images that have been made available with
the problem set. These two noisy images, and the original, undistorted image from which they derive,
are available both in PNG format and in a simple text format that lists each coordinate pair (i, j) and
the corresponding value of xij . You may find it useful to convert between this text representation and a
viewable image format.
(a) [5 points] Derive an expression for the conditional probability that pixel (i, j) is black given its
Markov blanket, i.e. p(yij = 1|yM (i,j) ), where yM (i,j) denotes the variables in the Markov blanket of
yij (but you should be explicit about which variables are included). Your expression should take the
form of a logistic function and should depend only on η, β,and yM (i,j) .
(b) [10 points] Outline a Gibbs sampling algorithm (in pseudocode) that iterates over the pixels in the
image and samples each yij given its Markov blanket. Use the simple approach of sweeping across the
image in row-major fashion on every iteration of the algorithm. Thus, an “iteration” will generate
a complete new sample of y. Allow for a burn-in of B iterations, followed by draws of S samples.
You may assume η and β are fixed constants. How can we show in our case that the equilibrium
distribution is in fact the posterior distribution p(y|x)?
(c) [15 points] Implement your algorithm and apply it to the image with 20% noise (noisy 20.png,txt).
Use values of η = 1, β = 1, B = 100, and S = 1000. On each iteration of your algorithm, compute the
energy E(y, x) for the current sample of y and output it to a log file, keeping track of which values
correspond to the burn-in. Run your algorithm with three different initializations - one in which each
yij is initialized to xij , one in which each yij is initialized to −xij , and one in which the yij are set to
−1 or +1 at random. Plot the energy of the model as a function of the iteration number for all three
chains and visually inspect these traces for signs of convergence. Do all three seem to be converging
to the same general region of the posterior, or are some obviously suboptimal? Does the burn-in seem
to be adequate in length? Is there substantial fluctuation from iteration to iteration, indicating that
the chain is mixing well, or does it become stuck at particular energies for several iterations at a time?
(d) [10 points] Have your program output a restored image after completing its sampling iterations, by
thresholding the estimated posterior probabilities for the yij variables at 0.5 - i.e., by estimating the
“true” color of each pixel (i, j) as:
+1 if p(yij = 1|x) > 0.5
ŷij =
−1 otherwise
4
To estimate the required posterior probabilities, store a running count cij of the number of (retained)
samples for which each yij = 1, and then use the Monte Carlo estimate:
1X (t) 1
p(yij = 1|x) ≈ 1(yij = 1) = cij (5)
S t S
(t)
where yij represents the tth sample of yij . Restore both the 10% - and 20% -noise images in this
way, using the same values of η, β, B, and S as above. Evaluate the quality of the restoration by
computing the fraction of all pixels that differ between the restored images and the original image.
Prepare a figure for each the the two images, showing the original, the noisy version, and the restora-
tion side by side.
(e) [10 points] If you have implemented your algorithm correctly, your restored images should be quite
close to the original. But is this because you have a clever algorithm or just because the problem is
easy? To examine this question, implement a trivial reconstruction algorithm that sets each yij equal
to the consensus (majority) of its neighbors (including xij ), and iterates a few times until convergence
(use sequential rather than batch updates, as in Gibbs sampling. This algorithm need not converge in
theory, but quickly do quite often in practice. To be safe, you can force it to terminate after, say, 30
iterations.) You should be able to get this program working quickly and easily by reusing code from
your Gibbs sampler. However, note that in this case, you should not average over samples (there are
no samples here) but instead should use the final value of the yij variables for your restored image.
Run this program on both images and compute its restoration error. Include figures for the images
restored in this way. Does the Gibbs sampler do better than the trivial algorithm? Why or why not?
(f) [10 points] While the Gibbs sampler is useful for obtaining marginal posterior probabilities of
interest, much of its appeal derives from its flexibility in estimating posterior distributions for more
complex features of the model. To get a sense for its flexibility, use your Gibbs sampler to estimate
the posterior distribution over the number of pixels in the “Z” in the image, which approximately
falls in the rectangle from (i = 125, j = 143) to (i = 162, j = 174). Using the same parameters as
above, simply count the number of cases of yij = +1 within this rectangle for each retained sample,
output one count per iteration as your sampler runs, then use their relative frequencies as an estimate
of the posterior distribution of interest. Plot a histogram showing these relative frequencies for both
images and comment on any differences between the two estimated posterior distributions.