0% found this document useful (0 votes)
27 views

HW 4

This document describes a homework assignment involving Gibbs sampling for image restoration. The task is to recover an original image Y given a corrupted observed image X. A graphical model is defined representing the joint probability of Y and X, with variables Y representing true pixel values and X representing observed noisy pixel values. Gibbs sampling will be used to sample values of Y conditioned on X. The student is asked to implement Gibbs sampling for this problem to restore images given their noisy observed versions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

HW 4

This document describes a homework assignment involving Gibbs sampling for image restoration. The task is to recover an original image Y given a corrupted observed image X. A graphical model is defined representing the joint probability of Y and X, with variables Y representing true pixel values and X representing observed noisy pixel values. Gibbs sampling will be used to sample values of Y conditioned on X. The student is asked to implement Gibbs sampling for this problem to restore images given their noisy observed versions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS228 Homework 4

Instructor: Stefano Ermon – [email protected]

Available: 02/17/2017; Due: 03/3/2017

1. [20 points] We have a data association problem where there are K objects and we are given K observa-
tions. Each observation corresponds to a single object, and we are given one observation for each object.
However, we don’t know which observation corresponds to which object, and we would like to infer that
using a probabilistic model relating observations to objects. Specifically we have

• K objects u1 , . . . , uK
• observations v1 , . . . , vK , where Val(vi ) = {a1 , . . . , aL } (so vi is a discrete random variable), where
each observation corresponds to the appearance of one object and there is exactly one observation of
each object
• correspondence variables C1 , . . . CK , where Val(Ci ) = {1, . . . , K}; Ci = k denotes that measurement
vi is derived from object uk
• a known appearance model for each object uk , Pk (vi = al |Ci = k).

Note that because of the mutex constraints, the correspondence variables C1 , . . . , CK will be a permutation
over 1, . . . , K. We also assume for simplicity that all permutations are equally likely a priori.
We wish to compute the marginals P (Ci |v1 , . . . , vK ), for i = 1, . . . , K, using Metropolis-Hastings (MH) to
sample the correspondence variables. We will start with an arbitrary assignment to C1 , . . . CK , and take
MH-steps. The proposal distribution that we will use randomly picks two correspondence variables Ci , Cj
from a uniform distribution over all pairs of correspondence variables, and swaps their assignments.

(a) [10 points] Compute the acceptance probability for each MH step.
(b) [5 points] Suppose we have run the MH sampler for a long time and collected M samples (C1 [m], . . . , CK [m])
for m = 1, . . . , M after the chain has mixed. Give an explicit expression for estimating the marginal
P (Ci |v1 , . . . , vK ).
(c) [5 points] Your friend Geoff Gibbs hears about your MH algorithm and suggests that you can also
consider using Gibbs sampling to compute your marginals. Briefly explain why this will or will not
work.

2. [20 points] Multi-conditional Parameter Learning, Markov Networks


In this problem, we will consider the problem of learning parameters for a Markov network using a specific
objective function. In particular assume that we have two sets of variables Y and X, and a dataset
D = {(x1 , y 1 ), . . . , (xM , y M )}. We will estimate the model parameters θ = [θ1 . . . θn ] by maximizing the
following objective function:

g(θ; D) = (1 − α)`Y |X (θ; D) + α`X|Y (θ; D)

where `X|Y (θ; D) means the conditional log-likelihood of the dataset D using the distribution Pθ (X | Y )
defined by the Markov network with parameters θ (similarly for `Y |X ). Thus, our objective is a mixture
of two conditional log-likelihoods (0 < α < 1). As usual, we consider a log-linear parameterization of a
Markov network, using a set of n features fi (X i , Y i ) where X i and Y i are some (possibly empty) subsets
of the variables X and Y , respectively.

1
(a) [10 points] Write down the full objective function g(θ; D) in terms of the features fi and weights θi

(b) [10 points] Derive ∂θ i
g(θ; D): the derivative of the objective with respect to a weight θi . Write your
final answer in terms of feature expectations IEQ [fi ], where Q is either: the empirical distribution of
our dataset P̂ ; or a conditional distribution of the form Pθ (W | Z = z) (for some sets of variables
W , Z, and assignment z.)

3. [10 points] Expectation Maximization in a Naive Bayes Model


Consider the Naive Bayes model with class variable C and discrete evidence variables X1 , . . . , Xn . The
CPDs for the model are parameterized by P (C = c) = θc and P (Xi = x | C = c) = θxi |c for i = 1, . . . , n,
and for all assignments xi ∈ Val(Xi ) and classes c ∈ Val(C).
Now given a data set D = {x[1], . . . , x[M ]}, where each x[m] is a complete assignment to the evidence
variables, X1 , . . . , Xn , we can use EM to learn the parameters of our model. Note that the class variable,
C, is never observed.
Show that if we initialize the parameters uniformly,
1 1
θc0 = and θx0i |c = ,
|Val(C)| |Val(Xi )|

for all xi , c, then the EM algorithm converges in one iteration, and give a closed form expression for the
parameter values at this convergence point.

2
1
[60 points] Programming Assignment

In this homework, you will apply Gibbs sampling to a simple Markov random field model for image
restoration. In an image restoration problem, you are given an image corrupted by noise X and you want
to recover the original image Y (see Figure 1 and Lecture 4).

(a) Image denoising task.

(b) Graphical model structure, for N = M = 3.

Figure 1: Image denoising with graphical models.

Let x = {xij } denote the observed image, with xij ∈ {−1, +1} representing the pixel at row i and column
j. Assume a black-and-white image, with -1 corresponding to white and +1 to black. The image has
dimensions N × M , so that 1 ≤ i ≤ N and 1 ≤ j ≤ M . Assume a set of (unobserved) variables y = {yij }
representing the true (unknown) image, with yij ∈ {−1, +1} indicating the value of xij before noise was
added. Each (internal) yij is linked with four immediate neighbors, yi−1,j , yi+1,j , yi,j−1 , and yi,j+1 , which
together are denoted yN (i,j) . Pixels at the borders of the image (with i ∈ {1, N } or j ∈ {1, M }) also have
neighbors denoted yN (i,j) , but these sets are reduced in the obvious way. We denote E the corresponding
set of edges. For example, the pair ((1, 1), (1, 2)) ∈ E, but the pair ((1, 1), (2, 2)) ∈
/ E. The joint probability
of y and x can be written (with no prior preference for black or white):

N Y
Y M   
1 ηyij xij
Y
βyij yi0 j 0
p(y, x) = exp × exp (1)
Z i=1 j=1 ((i,j),(i0 ,j 0 ))∈E

 X N X M 
1 X
= exp η yij xij + β yij yi0 j 0 (2)
Z i=1 j=1 ((i,j),(i0 ,j 0 ))∈E

where

X  X X 
Z= exp η yij xij + β yij yi0 j 0 (3)
y,x i,j ((i,j),(i0 ,j 0 ))∈E

(Notice in particular that each pair of neighbors, yij and yi0 j 0 , factors into the formula only once, despite
that each variable is a neighbor of the other. Failing to account for this will lead to double counting of β
values.) This is equivalent to a Boltzmann (sometimes called Gibbs) distribution with “energy”:
X X
E(y, x) = −η yij xij − β yij yi0 j 0 (4)
i,j ((i,j),(i0 ,j 0 ))∈E

1 Assignment adapted from Cornell’s BTRY 6790, instructed by Adam Siepel

3
The system will have lower energy, and hence higher probability, in states in which neighboring yij vari-
ables, and neighboring yij and xij variables, tend to have the same value (assuming η and β are positive).
This captures the fact that each noisy pixel xij is likely to be similar to the corresponding “true” pixel
yij , and that images tend to be “smooth”.

There are algorithms for deterministically estimating y given an image x but we will here use the alter-
native approach: we will devise a Markov Chain Monte Carlo (MCMC) algorithm to sample values of y
conditional on x. Here are some advantages over the deterministic algorithms:
i. It is very general, and can easily be extended to more complex graphs.
ii. It provides great flexibility for quantifying the uncertainty of y (and, potentially, for the parameters
η and β).
iii. It is relatively straightforward in this setting to derive the exact conditional distributions for nodes
given the Markov blanket, so Gibbs sampling is possible, and one need not worry about the acceptance
rate for proposed samples.
You will apply your methods to two small, black-and-white images that have been made available with
the problem set. These two noisy images, and the original, undistorted image from which they derive,
are available both in PNG format and in a simple text format that lists each coordinate pair (i, j) and
the corresponding value of xij . You may find it useful to convert between this text representation and a
viewable image format.

(a) [5 points] Derive an expression for the conditional probability that pixel (i, j) is black given its
Markov blanket, i.e. p(yij = 1|yM (i,j) ), where yM (i,j) denotes the variables in the Markov blanket of
yij (but you should be explicit about which variables are included). Your expression should take the
form of a logistic function and should depend only on η, β,and yM (i,j) .

(b) [10 points] Outline a Gibbs sampling algorithm (in pseudocode) that iterates over the pixels in the
image and samples each yij given its Markov blanket. Use the simple approach of sweeping across the
image in row-major fashion on every iteration of the algorithm. Thus, an “iteration” will generate
a complete new sample of y. Allow for a burn-in of B iterations, followed by draws of S samples.
You may assume η and β are fixed constants. How can we show in our case that the equilibrium
distribution is in fact the posterior distribution p(y|x)?

(c) [15 points] Implement your algorithm and apply it to the image with 20% noise (noisy 20.png,txt).
Use values of η = 1, β = 1, B = 100, and S = 1000. On each iteration of your algorithm, compute the
energy E(y, x) for the current sample of y and output it to a log file, keeping track of which values
correspond to the burn-in. Run your algorithm with three different initializations - one in which each
yij is initialized to xij , one in which each yij is initialized to −xij , and one in which the yij are set to
−1 or +1 at random. Plot the energy of the model as a function of the iteration number for all three
chains and visually inspect these traces for signs of convergence. Do all three seem to be converging
to the same general region of the posterior, or are some obviously suboptimal? Does the burn-in seem
to be adequate in length? Is there substantial fluctuation from iteration to iteration, indicating that
the chain is mixing well, or does it become stuck at particular energies for several iterations at a time?

(d) [10 points] Have your program output a restored image after completing its sampling iterations, by
thresholding the estimated posterior probabilities for the yij variables at 0.5 - i.e., by estimating the
“true” color of each pixel (i, j) as:

 +1 if p(yij = 1|x) > 0.5
ŷij =
 −1 otherwise

4
To estimate the required posterior probabilities, store a running count cij of the number of (retained)
samples for which each yij = 1, and then use the Monte Carlo estimate:

1X (t) 1
p(yij = 1|x) ≈ 1(yij = 1) = cij (5)
S t S
(t)
where yij represents the tth sample of yij . Restore both the 10% - and 20% -noise images in this
way, using the same values of η, β, B, and S as above. Evaluate the quality of the restoration by
computing the fraction of all pixels that differ between the restored images and the original image.
Prepare a figure for each the the two images, showing the original, the noisy version, and the restora-
tion side by side.

(e) [10 points] If you have implemented your algorithm correctly, your restored images should be quite
close to the original. But is this because you have a clever algorithm or just because the problem is
easy? To examine this question, implement a trivial reconstruction algorithm that sets each yij equal
to the consensus (majority) of its neighbors (including xij ), and iterates a few times until convergence
(use sequential rather than batch updates, as in Gibbs sampling. This algorithm need not converge in
theory, but quickly do quite often in practice. To be safe, you can force it to terminate after, say, 30
iterations.) You should be able to get this program working quickly and easily by reusing code from
your Gibbs sampler. However, note that in this case, you should not average over samples (there are
no samples here) but instead should use the final value of the yij variables for your restored image.
Run this program on both images and compute its restoration error. Include figures for the images
restored in this way. Does the Gibbs sampler do better than the trivial algorithm? Why or why not?

(f) [10 points] While the Gibbs sampler is useful for obtaining marginal posterior probabilities of
interest, much of its appeal derives from its flexibility in estimating posterior distributions for more
complex features of the model. To get a sense for its flexibility, use your Gibbs sampler to estimate
the posterior distribution over the number of pixels in the “Z” in the image, which approximately
falls in the rectangle from (i = 125, j = 143) to (i = 162, j = 174). Using the same parameters as
above, simply count the number of cases of yij = +1 within this rectangle for each retained sample,
output one count per iteration as your sampler runs, then use their relative frequencies as an estimate
of the posterior distribution of interest. Plot a histogram showing these relative frequencies for both
images and comment on any differences between the two estimated posterior distributions.

You might also like