0% found this document useful (0 votes)
2 views

Chib-UnderstandingMetropolisHastingsAlgorithm-1995

The document provides a comprehensive introduction to the Metropolis-Hastings algorithm, a Markov chain method for simulating multivariate distributions. It covers the algorithm's derivation, implementation issues, and applications, including acceptance-rejection sampling and block-at-a-time scans. The authors emphasize the algorithm's versatility and its connections to other methods like Gibbs sampling, supported by mathematical proofs and empirical examples.

Uploaded by

karanijash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chib-UnderstandingMetropolisHastingsAlgorithm-1995

The document provides a comprehensive introduction to the Metropolis-Hastings algorithm, a Markov chain method for simulating multivariate distributions. It covers the algorithm's derivation, implementation issues, and applications, including acceptance-rejection sampling and block-at-a-time scans. The authors emphasize the algorithm's versatility and its connections to other methods like Gibbs sampling, supported by mathematical proofs and empirical examples.

Uploaded by

karanijash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Understanding the Metropolis-Hastings Algorithm

Author(s): Siddhartha Chib and Edward Greenberg


Source: The American Statistician , Nov., 1995, Vol. 49, No. 4 (Nov., 1995), pp. 327-335
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association

Stable URL: https://www.jstor.org/stable/2684568

REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2684568?seq=1&cid=pdf-
reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to The American Statistician

This content downloaded from


103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
Understanding the Metropolis-Hastings Algorithm
Siddhartha CHIB and Edward GREENBERG

since it includes the relevant Markov chain theory, is-


We provide a detailed, introductory exposition of the sues related to implementation and tuning, and empir-
Metropolis-Hastings algorithm, a powerful Markov chain ical illustrations. We also discuss applications of the
method to simulate multivariate distributions. A sim- method, one for implementing acceptance-rejection sam-
ple, intuitive derivation of this method is given along pling when a blanketing function is not available, devel-
with guidance on implementation. Also discussed are oped by Tierney (1994), and the other for applying the
two applications of the algorithm, one for implementing algorithm one "block at a time." For the latter situation,
acceptance-rejection sampling when a blanketing func- we present an important principle that we call the prod-
tion is not available and the other for implementing the al- uct of kernels principle and explain how it is the basis of
gorithm with block-at-a-time scans. In the latter situation, many other algorithms, including the Gibbs sampler. In
many different algorithms, including the Gibbs sampler, each case we emphasize the intuition for the method and
are shown to be special cases of the Metropolis-Hastings present proofs of the main results. For mathematical con-
algorithm. The methods are illustrated with examples. venience, our entire discussion is phrased in the context
of simulating an absolutely continuous target density, but
KEY WORDS: Gibbs sampling; Markov chain Monte
the same ideas apply to discrete and mixed continuous-
Carlo; Multivariate density simulation; Reversible
discrete distributions.
Markov chains.
The rest of the article is organized as follows. In
Section 2 we briefly review the acceptance-rejection
1. INTRODUCTION (A-R) method of simulation. Although not an MCMC
In recent years statisticians have been increasingly method, it uses some concepts that also appear in the
drawn to Markov chain Monte Carlo (MCMC) methods Metropolis-Hastings algorithm and is a useful introduc-
to simulate complex, nonstandard multivariate distribu- tion to the topic. Section 3 introduces the relevant Markov
tions. The Gibbs sampling algorithm is one of the best chain theory for continuous state spaces, along with the
known of these methods, and its impact on Bayesian statis- general philosophy behind MCMC methods. In Section 4
tics, following the work of Tanner and Wong (1987) and we derive the M-H algorithm by exploiting the notion of
Gelfand and Smith (1990), has been immense as detailed reversibility defined in Section 3, and discuss some impor-
in many articles, for example, Smith and Roberts (1993), tant features of the algorithm and the mild regularity con-
Tanner (1993), and Chib and Greenberg (1993). A con- ditions that justify its use. Section 5 contains issues related

siderable amount of attention is now being devoted to the to the choice of the candidate-generating density and guid-
Metropolis-Hastings (M-H) algorithm, which was devel- ance on implementation. Section 6 discusses how the algo-
oped by Metropolis, Rosenbluth, Rosenbluth, Teller, and rithm can be used in an acceptance-rejection scheme when

Teller (1953) and subsequently generalized by Hastings a dominating density is not available. This section also ex-
(1970). This algorithm is extremely versatile and gives plains how the algorithm can be applied when the variables
rise to the Gibbs sampler as a special case, as pointed out to be simulated are divided into blocks. The final section
by Gelman (1992). The M-H algorithm has been used contains two numerical examples, the first involving the
extensively in physics, yet despite the paper by Hastings, simulation of a bivariate normal distribution, and the sec-
it was little known to statisticians until recently. Papers ond the Bayesian analysis of an autoregressive model.
by Muller (1993) and Tierney (1994) were instrumental
in exposing the value of this algorithm and stimulating 2. ACCEPTANCE-REJECTION SAMPLING
interest among statisticians in its use.
In contrast to the MCMC methods described be-
Because of the usefulness of the M-H alogrithm, appli-
low, classical simulation techniques generate non-Markov
cations are appearing steadily in the current literature (see
(usually independent) samples, that is, the successive ob-
Muller (1993), Chib and Greenberg (1994), and Phillips
servations are statistically independent unless correlation
and Smith (1994) for recent examples). Despite its obvi-
is artificially introduced as a variance reduction device.
ous importance, however, no simple or intuitive exposi-
An important method in this class is the A-R method,
tion of the M-H algorithm, comparable to that of Casella
which can be described as follows:
and George (1992) for the Gibbs sampler, is available.
The objective is to generate samples from the abso-
This article is an attempt to fill this gap. We provide a
lutely continuous target density wF(x) = f(x)/K, where
tutorial introduction to the algorithm, deriving the algo-
x C -RI, f(x) is the unnormalized density, and K is the
rithm from first principles. The article is self-contained
(possibly unknown) normalizing constant. Let h(x) be a
density that can be simulated by some known method, and
Siddhartha Chib is at the John M. Olin School of Business, Wash-
ington University, St. Louis, MO 63130. Edward Greenberg is at suppose there is a known constant c such that f(x) < ch(x)
the Department of Economics, Washington University, St. Louis,for MOall x. Then, to obtain a random variate from 7rF(.),
63130. The authors express their thanks to the editor of the journal, the
associate editor, and four referees for many useful comments on the *pa-
(*) Generate a candidate Z from h(.) and a value it
per, and to Michael Ogilvie and Pin-Huang Chou for helpful discussions. from U1(0, 1), the uniform distribution on (0, 1).

? 1995 Amiier-ican Statistical Associationi The American Statisticiani, Novemiiber- 1995, Vol. 49, No. 4 327
This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
* If u < f (Z)/ch(Z) where p(x, x) = 0, X(dy) = 1 if x c dy and 0 otherwise, and
-return Z = y. U(x) = 1 - f,i p(x, y) dy is the probability that the chain re
* Else mains at x. From the possibility that r(x) 7 0, it should be
-goto (*). clear that the integral of p(x, y) over y is not necessarily 1.

It is easily shown that the accepted value y is a random Now, if the function p(x, y) in (2) satisfies the reversibil-
ity condition (also called "detailed balance," "microscopic
variate from wF(.). For this method to be efficient, c must
be carefully selected. Because the expected number of reversibility," and "time reversibility")

iterations of steps 1 and 2 to obtain a draw is given by c-,


wT(X)p(X,y) = w ()))p(y),x), (3)
the rejection method is optimized by setting
then wF(.) is the invariant density of P(x, -) (Tierney 1994).
f(x)
To verify this we evaluate the right-hand side of (1):
XF h(x)'
Even this choice, however, may result in an undesirably
large number of rejections.
J P(x, A)7r(x) dx = p(x, y) dy -(x) dx
The notion of a generating density also appears in the
M-H algorithm, but before considering the differences + J r(x)6,(A)wF(x) dx
and similarities, we turn to the rationale behind MCMC
methods. = j [J/ P(x y-(x) dx] dy
3. MARKOV CHAIN MONTE CARLO
+ j r(x)F(x) dx
SIMULATION

The usual approach to Markov chain theory on a contin-


= j [JPO, x-(y) dx]d.y
uous state space is to start with a transition kernel P(x,A)
for x E ?Rd and A C 3, where 3 is the Borel o-field on kl.
+ j r(x)-r(x) dx
The transition kernel is a conditional distribution function
that represents the probability of moving from x to a point
in the set A. By virtue of its being a distribution function, - 1-(y))lr(Y)dy +jr(x)-r(x)dx
P(x, RJd) = 1, where it is permitted that the chain can make
a transition from the point x to x, that is, P(x, {x}) is not = F N')ci. (4)
necessarily zero.
A major concern of Markov chain theor, [see Intuitively, the left-hand side of
Nuinmelin (1984), Billingsley (1986), Bhattacharya and is the unconditional probability of moving from x to y,
Waymire (1990), and, especially, Meyn and Tweedie where x is generated from wF(*), and the right-hand side is
(1993)] is to determine conditions under which there exists the unconditional probability of moving from y to x, where
an invariant distribution -F* and conditions under which it- y is also generated from wF( ). The reversibility condition
erations of the transition kernel converge to the invariant says that the two sides are equal, and the above result
distribution. The invariant distribution satisfies shows that wF*(.) is the invariant distribution for P(., .).
This result gives us a sufficient condition (reversibility)
r*(dy)= J P(x,dy)7r(x)dx (1) that must be satisfied by p(x, y). We now show how the

where wF is the density with respect to Metropolis-Hastings


Lebesgue measure algorithm finds a p(x, y) with this
property.
of wF* (thus wF*(dy) = wF(y) dy). The nith iterate is given by
P("l)(x, A) = fR,i p('- 1)(x, dy)P(y, A), where 4.
P(')(x, dy) =
THE METROPOLIS-HASTINGS
P(x, dy). Under conditions discussed in the following, it
ALGORITHM
can be shown that the nth iterate converges to the invariant
distribution as n -> oc. As in the A-R method, suppose we have a density
MCMC methods turn the theory around: the invariant that can generate candidates. Since we are dealing with
density is known (perhaps up to a constant multiple) it is Markov chains, however, we permit that density to de-
wF(.), the tatrget density from which samples are desired- pend on the current state of the process. Accordingly,
but the transition kernel is unknown. To generate samples the candidate-gener-ating density is denoted q(x, y), where
from wFQ), the methods find and utilize a transition kernel f q(x, y) dy = 1. This density is to be interpreted as saying
P(x, dy) whose nth iterate converges to wF(.) for large ni. that when a process is at the poinit x, the density generates
The process is started at an arbitrary x and iterated a large a value y from q(x, y). If it happens that q(x, y) itself sat-
number of times. After this large number, the distribu- isfies the reversibility condition (3) for all x, y, our search
tion of the observations generated from the simulationi is is over. But most likely it will not. We might find, for
approximately the target distribution. example, that for some x, y,
The problem then is to find an appropriate P(x, dy).
What might appear to be a search for the proverbial needle wr(x)q(x,y2) > wr(y)q(y,x). (5)
in a haystack is somewhat simplified by the following con-
In this case, speaking somewhat loosely, the process
siderations. Suppose that the transition kernel, for some
moves from x to)' too often and from)y to x too rarely
function p(x, y), is expressed as
A convenienlt way to correct this condition is to reduce
P(x, dy) = p(x: y) dy7 + r(x)>5-(dy), (2) number of moves from x to)' by introducing a probability

328 The Amiierican Statisticiani, Novemiiber- 1995, Vol. 49, No. 4


This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
o(x, y) < 1 that the move is made. We refer to o(x, y) as
the probability of move. If the move is not made, the pro-
cess again returns x as a value from the target distribution.
(Note the contrast with the A-R method in which, when
a y is rejected, a new pair (y, it) is drawn independently
of the previous value of y.) Thus transitions from x to y
(y # x) are made according to

PMH(X, Y) - q(x, y)oa(x, y), x 7 y,


where o(x, y) is yet to be determined.
Consider again inequality (5). It tells us that the move-
ment from y to x is not made often enough. We should
therefore define cv(y, x) to be as large as possible, and
since it is a probability, its upper limit is 1. But now the
probability of move cv(x, y) is determined by requiring that
PMH(X, Y) satisfies the reversibility condition, because then x Yi Y2

7r(x)q(x, y)ci(x, y) = 7r(y)q(y, x)a(y, x) Figure 1. Calculating Probabilities of Move With Symmetric
= wr(y)q(y,x). (6) Candidate-Generating Function (see text).

We now see that cv(x, y) = 7r(y)q(y, x)/7r(x)q(x, y). Of


course, if the inequality in (5) is reversed, we set cv(x, y) =to candidate Yi is made with certainty, while a move to
1 and derive cv(y, x) as above. The probabilities cv(x, y) and candidate Y2 is made with probability 7r(y2)/7r(x).] This
cv(y, x) are thus introduced to ensure that the two sides of is the algorithm proposed by Metropolis et al. (1953). In-
(5) are in balance or, in other words, that PMH(X, Y) satis-
terestingly, it also forms the basis for several optimization
fies reversibility. Thus we have shown that in order for algorithms, notably the method of simulated annealing.
PMH(X, Y) to be reversible, the probability of move must be We now summarize the M-H algorithm in algorithmic
set to
form initialized with the (arbitrary) value x(?):

o(x, y) = mmn [w(y)q(y, x), 1, if 7(x)q(x,y) > 0 * Repeatforj=1,2,...,N.


x7 y(x)q(x, y)>j * Generate y from q(x(J), ) and u from U(O, 1).
= 1, otherwise.
* If it < c(x), y)
To complete the definition of the transition kernel for -set x(y+l)=Y
the Metropolis-Hastings chain, we must consider the pos- * Else
sibly nonzero probability that the process remains at x. As -set P+)= x(j)-
defined above, this probability is * Return the values {x(1), x(2),.. . (N)

As in any MCMC method, the draws are regarded as a


r(x) = 1 jq(x, y)a(x, y) dy.
sample from the target density 7r(x) only after the chain has
Consequently, the transition kernel of the M-H chain, de- passed the transient stage and the effect of the fixed starting
noted by PMH(X, dy), is given by value has become so small that it can be ignored. In fact,
this convergence to the invariant distribution occurs under
PMH(X, dy) = q(x, y)a(x, y) dy mild regularity conditions. The regularity conditions re-
quired are irreducibility and aperiodicity [see Smith and
+ [I X q(x,y)a(x,y)dy] d (dy),
Roberts (1993)]. What these mean is that, if x and y are in
a particular case of (2). Because PMH(X, Y) is reversible by the domain of 7r(.), it must be possible to move from x to dy
construction, it follows from the argument in (4) that the in a finite number of iterations with nonzero probability,
M-H kernel has 7r(x) as its invariant density. and the number of moves required to move from x to dy is
Several remarks about this algorithm are in order. First, not required to be a multiple of some integer. These con-
the M-H algorithm is specified by its candidate-generating ditions are usually satisfied if q(x, y) has a positive density
density q(x, y) whose selection we take up in the next sec- on the same support as that of 7rwQ). It is usually also satis-
tion. Second, if a candidate value is rejected, the current fied by a q(x, y) with a restricted support (e.g., a uniform
value is taken as the next item in the sequence. Third, distribution around the current point with finite width).
the calculation of c(x, y) does not require knowledge of These conditions, however, do not determine the rate of
the normalizing constant of 7rw() because it appears both inconvergence [see Roberts and Tweedie (1994)], so there
the numerator and denominator. Fourth, if the candidate- is an empirical question of how large an initial sample
generating density is symmetric, an important special case, of size no (say) should be discarded and how long the
q(x, y) = q(y, x) and the probability of move reduces to sampling should be run. One possibility, due to Gelman
7r(y)/7r(x); hence, if 7r(y) > 7r(x), the chain moves to y; and Rubin (1992), is to start multiple chains from dis-
otherwise, it moves with probability given by 7r(y)/wr(x). persed initial values and compare the within and between
In other words, if the jump goes "uphill," it is always ac- variation of the sampled draws. A simple heuristic that
cepted; if "downhill," it is accepted with a nonzero proba- workis in some situations is to make no and N increasing
bility. [See Fig. 1 where, from the current point x, a move functions of the first-order serial correlation in the output.

The American Statisticiani, Novemiiber 1995, Vol. 49, No. 4 329


This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
This entire area, however, is quite unsettled and is being y = a + B(x - a) + z, where a is a vector and B is a matrix
actively researched. For more details the reader should (both conformable with x) and z has q as its density. Then,
consult Gelman and Rubin (1992) and the accompanying q(x, y) = q(y - a - B(x - a)). Setting B = -I produces
discussion. chains that are reflected about the point a and is a sim-
ple way to induce negative correlation between successive
5. IMPLEMENTATION ISSUES:
elements of the chain.
CHOICE OF q(x,y)
We now return to the critical question of choosing the
To implement the M-H algorithm, it is necessary spread, that aor scale, of the candidate-generating density. This
suitable candidate-generating density be specified. Typi- is an important matter that has implications for the ef-
cally, this density is selected from a family of distributions ficiency of the algorithm. The spread of the candidate-
that requires the specification of such tuning parameters generating density affects the behavior of the chain in at
as the location and scale. Considerable recent work is be- least two dimensions: one is the "acceptance rate" (the
ing devoted to the question of how these choices should percentage of times a move to a new point is made), and
be made and, although the theory is far from complete, the other is the region of the sample space that is covered
enough is known to conduct most practical simulation by the chain. To see why, consider the situation in which
studies. the chain has converged and the density is being sampled
One family of candidate-generating densities, that ap- around the mode. Then, if the spread is extremely large,
pears in the work of Metropolis et al. (1953), is given some of the generated candidates will be far from the cur-
by q(x,y) = ql(y - x), where qlQ) is a multivariate den-
rent value, and will therefore have a low probability of
sity [see Muller (1993)]. The candidate y is thus drawn being accepted (because the ordinate of the candidate is
according to the process y = x + z, where z is called the small relative to the ordinate near the mode). Reducing
increment random variable and follows the distribution ql. the spread will correct this problem, but if the spread is
Because the candidate is equal to the current value plus chosen too small, the chain will take longer to traverse the
noise, this case is called a random walk chain. Possible support of the density, and low probability regions will be
choices for qi include the multivariate normal density and
undersampled. Both of these situations are likely to be
the multivariate-t with the parameters specified according
reflected in high autocorrelations across sample values.
to the principles described below. Note that when qi is Recent work by Roberts, Gelman, and Gilks (1994) dis-
symmetric, the usual circumstance, ql(z) = ql(-z); the cussed this issue in the context of q1 (the random walk pro-
probability of move then reduces to posal density). They show that if the target and proposal
densities are normal, then the scale of the latter should be
cv(x, y) = min 7{(X) I tuned so that the acceptance rate is approximately .45 in
one-dimensional problems and approximately .23 as the
As mentioned earlier, the same reduction occurs if q(x, y) =
number of dimensions approaches infinity, with the op-
q(y, x).
timal acceptance rate being around .25 in as low as six
A second family of candidate-generating densities is
dimensions. This is similar to the recommendation of
given by the form q(x,y) = q2(Y) [see Hastings (1970)].
Muller (1993), who argues that the acceptance rate should
In contrast to the random walk chain, the candidates are
be around .5 for the random walk chain.
drawn independently of the current location x-an inde-
The choice of spread of the proposal density in the
pendence chain in Tierney's (1994) terminology. As in
case of q2 (the independence proposal density) has also
the first case, we can let q2 be a multivariate normal or
come under recent scrutiny. Chib and Geweke [work in
multivariate-t density, but now it is necessary to specify
progress] show that it is important to ensure that the tails of
the location of the generating density as well as the spread.
the proposal density dominate those of the target density,
A third choice, which seems to be an efficient solu-
which is similar to a requirement on the importance sam-
tion when available, is to exploit the known form of 7rT()
pling function in Monte Carlo integration with importance
to specify a candidate-generating density [see Chib and
sampling [see Geweke (1989)]. It is important to mention
Greenberg (1994)]. For example, if 7r(t) can be written as
the caveat that a chain with the "optimal" acceptance rate
7r(t) c< 4'(t)h(t), where h(t) is a density that can be sam-
may still display high autocorrelations. In such circum-
pled and 9,(t) is uniformly bounded, then set q(x, y) = h(y)
stances it is usually necessary to try a different family of
(as in the independence chain) to draw candidates. In this
candidate-generating densities.
case, the probability of move requires only the computa-
tion of the 4' function (not 7r or h) and is given by
6. APPLICATIONS OF THE M-H ALGORITHM
cv(x, y) = min { (x):
We hope that the reader is now convinced that the
A fourth method of drawing candidates is to use the A-R M-H algorithm is a useful and straightforward device with
method with a pseudodominating density. This method which to sample an arbitrary multivariate distribution. In
was developed in Tierney (1994), and because it is of inde- this section we explain two uses of the algorithm, one
pendent interest as an M-H acceptance-rejection method, involving the A-R method, and the other for implement-
we explain it in Section 6.1. ing the algorithm with block-at-a-time scans. In the latter
A fifth family, also suggested by Tierney (1994), is situation many different algorithms, including the Gibbs
represented by a vector autoregressive process of or- sampler, are shown to arise as special cases of the M-H
der 1. These aultoregressive chains are produced by letting
algorithm.

330 The Am7ter-ica,z Statisticiani, November 1995, Vol. 49, No. 4


This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
6.1 An M-H Acceptance-Rejection Algorithm

Recall that in the A-R method described earlier, a con-


stant c and a density h(x) are needed such that ch(x)
dominates or blankets the (possibly) unnormalized target
density f(x). Finding a c that does the trick may be dif-
ficult in some applications; moreover, if f(x) depends on
parameters that are revised during an iterative cycle, find-
ing a new value of c for each new set of the parameters
may significantly slow the computations. For these rea-
sons it is worthwhile to have an A-R method that does not
require a blanketing function. Tierney's (1994) remark-
able algorithm does this by using an A-R step to generate
candidates for an M-H algorithm. This algorithm, which
seems complicated at first, can be derived rather easily
using the intuition we have developed for the M-H algo-
rithm.
To fix the context again: we are interested in sampling Figure 2. Acceptance-Rejection Sampling With Pseudodominat-

the target density 7r(x), 7r(x) = f(x)/K, where K may be ing Density ch(x).

unknown, and a pdf h(.) is available for sampling. Suppose


c > 0 is a known constant, but thatf(x) is not necessarily
To proceed, we derive cv(x, y) in each of the four possible
less than ch(x) for all x; that is, ch(x) does not necessarily
cases given above. As in (2), we consider 7r(x)q(y) and
dominate f(x). It is convenient to define the set C where
-r(y)q(x) [or, equivalently, f(x)q(y) and f(y)q(x)] to see
domination occurs:
how the probability of moves should be defined to ensure
C ={x: f(x)< ch(x)}. reversibility. That is, we need to find cv(x, y) and c(y, x)
such that
In this algorithm, given x(") = x, the next value x("+l) is
obtained as follows: First, a candidate value z is obtained, f (x)q(y)ca(x, y) = f (y)q(x)ca(y, x)
independent of the cutrrent value x, by applying the A-R
algorithm with ch( ) as the "dominating" density. The A- in each of the cases (a)-(d), where q(y) is chosen from (7).
R step is implemented through steps 1 and 2 in Section 2.
Case (a): x E C, y E C. In this case it is easy to verify
What is the density of the rv y that comes through this
thatf(x)q(y) _ f(x)f (y)/cd is equal to f (y)q(x). Accord-
step? Following Rubinstein (1981, pp. 45-46), we have
ingly, setting cv(x, y) = c(y, x) = 1 satisfies reversibility.
q(y) = P(yI U < f(Z)/ch(Z)) Cases (b) and (c). x f C, y E C or x E C, y f C.
P(U < f(Z)/ch(Z) Z = y) x h(y)
In the first case f(x) > ch(x), or h(x) < f(x)/c, which
Pr(U < f(Z)/ch(Z)) implies (on multiplying both sides by f (y)/d) that

But because P(U < f(Z)/ch(Z) I Z = y) = min{f(y)/ f (y)h(x) f (y)f (x)


ch(y), 1}, it follows that
d cd
q(y) _ min{f(y)/ch(y), 1} x h(y) or, from (7), f(y)q(x)
d
are relatively too few tr
where d _ Pr(U < f(Z)/ch(Z)). By simplifying the nu- in the opposite directio
merator of this density we obtain a more useful represen- problem is alleviated, an
tation for the candidate-generating density:

q(y) =f(y)/cd, ify E C


f(y)h(x) = ca(x
d c
= h(y)/d, ify C. (7)
which gives ca(x, y) = c
(Note that there is no need to writethe roles of
q(x, y)x for
and y above
this to density
find that ca(x, y) = 1 and
because the candidate y is drawn aindependently (y,x) = ch (y)/f(y). of x.)
Because ch(y) does not dominate the target density in
Case (d): x f C, y f C. In this case we have
CC (by definition), it follows that the target density is not
adequately sampled there. See Figure 2 for an illustration f(x)q(y) = f(x)h(y)/d and f (y)q(x) = f (y)h(x)/
there are two possibilities. There are too few transitions
of a nondominating density and the C region. This can be
from y to x to satisfy reversibility if
corrected with an M-H step applied to the y values that
come through the A-R step. Since x and y can each be in
f (x)h() > f (y)q(x).
C or in CC, there are four possible cases: (a) x E C, y E C; d
(b) and (c) x fC, y C Corx E C, y fC; and (d) x fC,
y E4C. In that case set cv(y, x) = 1 and determine cv(x, y) from

The objective now is to find the M-H moving proba- f (x)h(y) f (y)h(x)
bility cv(x, y) such that q(y)c~(x, y) satisfies reversibility. cv(x, y) d/ d
The American Statisticiani, November- 1995, Vol. 49, No. 4 331
This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
which implies x1 and x2, and the transition kernel P2(A, Yi) generates
Y2 given x2 and Yl. Then the kernel formed by multiplying
(x, y) = min { f (y)h(x) I the conditional kernels has 7*(., ) as its invariant distri-
11 f (x)h(y) J
bution:
If there are too few transitions from x to y, just interchange
x and y in the above discussion. If P,(XI dyI I X2)P2(X2, dy2 I Y)(X1, X2) dxl dx2
We thus see that in two of the cases, those where x E C,
the probability of move to y is 1, regardless of where y lies.
= P2(X2 dY2 Yi ) [PI (XI dYl I x2)7112(x I x2)dxI]
To summarize, we have derived the following probability
of move to the candidates y that are produced from the X 72(X9) dX2

A-R step:
= JP2(x2) dY2 YD)7r *2(dyl1 X2)w2(x2)dx2
* Let Cl = {f(x) < ch(X)}; and C2 = {f(y) < ch(y)}.
* Generate it from U(0, 1) and f jP(xd dy )1t72102 IYI)7*(dy1)
-if Cl = I, then let a = 1;
-if Cl = 0 and C2 = 1, then let ae = (ch(x)/f(x));
= Iw(dy1) fP2(x2,dy2 I Y172l1 (X2 I Y1)dX2
-if Cl = 0 and C2 = 0, then let ae = min{(f(y)h(x)/
f(x)h(y), )1}. = wj (dy1)7r11(dy2 I YI)
* If it < c = 7 Y*(dyi
-return y.
where the third line follows from (8), the fourth from Bayes
* Else
theorem, the sixth from assumed invariance of P2, and the
-return x.
last from the law of total probability.
6.2 Block-at-a-Time Algorithms With this result in hand, several important special cases
of the M-H algorithm can be mentioned. The first special
Another interesting situation arises when the M-H al-
case is the so-called "Gibbs sampler." This algorithm is
gorithm is applied in turn to subblocks of the vector x,
obtained by letting the transition kernel P1 (x1, dy, 1 x2) =
rather than simultaneously to all elements of the vector.
7Fj2(dyi x2), and P2(x2,dy2 I Yi) = 7r*1(dy2 I y), that
This "block-at-a-time" or "variable-at-a-time" possibility,
is, the samples are generated directly from the "full con-
which is discussed in Hastings (1970, sec. 2.4), often sim-
ditional distributions." Note that this method requires that
plifies the search for a suitable candidate-generating den-
it be possible to generate independent samples from each
sity and gives rise to several interesting hybrid algorithms
of the full conditional densities. The calculations above
obtained by combining M-H updates.
demonstrate that this algorithm is a special case of the
The central idea behind these algorithms may be il-
M-H algorithm. Alternatively, it may be checked that the
lustrated with two blocks, x = (xI x2), where xi E Rd.
M-H acceptance probability ca(x, y) = 1 for all x, y.
Suppose that there exists a conditional transition kernel
Another special case of the M-H algorithm is the so-
PI (x, dyl 1 x2) with the property that, for a fixed value called "M-H within Gibbs" algorithm (but see our com-
of X2, 7*12(. 1 X2) iS its invariant distribution (with densityments on terminology below), in which an intractable full
712( X2)), that is,
conditional density [say 71122(Y 1 x2)] is sampled with the
general form of the M-H algorithm described in Section 4
7Fj12(dyi I x2) = PI (xI dy I x2)7m12(x1 I x2) dxl. (8)and the others are sampled directly from their full condi-
Also, suppose the existence of a conditional transition ker- tional distributions. Many other algorithms can be sim-
ilarly developed that arise from multiplying conditional
nel P2(x2, dy2 I x1) with the property that, for a given xl,
kernels.
7r*1(. I x1) is its invariant distribution, analogous to (8).
We conclude this section with a brief digression on ter-
For example, PI could be the transition kernel generated
minology. It should be clear from the discussion in this
by a Metropolis-Hastings chain applied to the block x,
with x2 fixed for all iterations. Now, somewhat surpris- subsection that the M-H algorithm can take many different
ingly, it turns out that the product of the transition kernels forms, one of which is the Gibbs sampler. Because much
of the literature has overlooked Hastings's discussion of
has 7r(x], x2) as its invariant density. The practical sig-
nificance of this principle (which we call the product of M-H algorithms that scan one block at a time, some un-
kernels principle) is enormous because it allows us to take fortunate usage ("M-H within Gibbs," for example) has
draws in succession from each of the kernels, instead of arisen that should be abandoned. In addition, it may be de-
having to run each of the kernels to convergence for every sirable to define the Gibbs sampler rather narrowly, as we

value of the conditioning variable. In addition, as sug- have done above, as the case in which all full conditional
gested above, this principle is extremely useful because it kernels are sampled by independent algorithms in a fixed
is usually far easier to find several conditional kernels that order. Although a special case of the M-H algorithm, it is
converge to their respective conditional densities than to an extremely important special case.

find one kernel that converges to the joint.


7. EXAMPLES
To establish the product of kernels principle it is nec-
essary to specify the nature of the "scan" through the We next present two examples of the use of the M-H
elements of x (Hastings mentions several possibilities). algorithm. In the first we simulate the bivariate normal
Suppose the transition kiernel P1 (., x2) produces Yi given to illustrate the effects of various choices of q(x, y); the

332 The American Statistician, Novemiiber- 1995, Vol. 49, No. 4


This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
second example illustrates the value of setting up blocks simulated, although overall the best result is obtained from
of variables in the Bayesian posterior analysis of a second- the fourth generating density. To illustrate the character-
order autoregressive time series model. istics of the output, the top panel of Figure 3 contains
the scatter plot of N = 4,000 simulated values from the
7.1 Simulating a Bivariate Normal Choleski approach and the bottom panel the scatter plot
To illustrate the M-H algorithm we consider the simu- of N = 6,000 simulated values using the fourth candidate-

lation of the bivariate normal distribution No(ut, E), where generating density. More observations are taken from the
M-H algorithm to make the two plots comparable. The
,u = (1, 2)' is the mean vector and E = (-i ): 2 x 2 is the
covariance matrix given by plots of the output with the other candidate-generating
densities are similar to this and are therefore omitted. At
1 .90 the suggestion of a referee, points that repeat in the M-H
E .9 1J' chain are "jittered" to improve clarity. The figure clearly
Because of the high correlation the contours of this dis- reveals that the sampler does a striking job of visiting the
tribution are "cigar-shaped," that is, thin and positively entire support of the distribution. This is confirmed by the
inclined. Although this distribution can be simulated di- estimated tail probabilities computed from the M-H out-
rectly in the Choleski approach by letting y = t + P'ui, put for which the estimates are extremely close to the true

where t - N2(0,12) and P satisfies P'P = , this values. Details are not reported to save space.
well-known problem is useful for illustrating the M-H For the third generating density we found that reduc-
algorithm. tions in the elements of D led to an erosion in the number
From the expression for the multivariate normal den- of times the sampler visited the tails of the distribution.
sity, the probability of move (for a symmetric candidate- In addition, we found that the first-order serial correlation
generating density) is of the sampled values with the first and second candidate-
generating densities is of the order .9, and with the other
two it is .30 and .16, respectively. The high serial cor-
a(X, y)= (expF
min I [- 2(
pai)'(Y - At)]
( )]7} relation with the random walk generating densities is not
unexpected and stems from the long memory in the can-
x y Ek2. (9)
didate draws. Finally, by reflecting the candidates we see
We use the following candidate-generating densities, for that it is possible to obtain a beneficial reduction in the
which the parameters are adjusted by experimentation to serial correlation of the output with little cost.
achieve an acceptance rate of 40% to 50%:
7.2 Simulating a Bayesian Posterior
1. Random walk generating density (y = x + z), where
We now illustrate the use of the M-H algorithm to sam-
the increment random variable z is distributed as bivariate
ple an intractable distribution that arises in a stationary
uniform, that is, the ith component of z is uniform on the
second-order autoregressive [AR(2)] time series model.
interval (-6i, hi). Note that ?1 controls the spread along
Our presentation is based on Chib and Greenberg (1994),
the first coordinate axis and 62 the spread along the second.
which contains a more detailed discussion and results for
To avoid excessive moves we let 6, = .75 and 62 = 1.
the general ARMA( p, q) model.
2. Random walk generating density (y = x + z) with
For our illustration, we simulated 100 observations from
z distributed as independent normal N2(0, D), where D =
the model
diagonal(.6, .4).
3. Pseudorejection sampling generating density with Yt = 0i't-i + 02Yt-2 + Et, t = 1, 2, . . . 100, (10)
"dominating function" ch(x) = c(27)-ll D 1-/2 exp[- 1
(x - pI)'D(x - i)], where D = diagonal(2, 2) and c = .9. where 1 = 1, 02 =-.5, and Et N(0, 1). The values
The trial draws, which are passed through the A-R step, Of 0 = (01, 2) lie in the region S c R22 that satisfies the
stationarity restrictions
are thus obtained from a bivariate, independent normal
distribution. ?'1+?)2 < l; -1+ 02 < 02 2>-
4. The autoregressive generating density y = ,u - (x -
Following Box and Jenkins (1976), we express the (exact
A) + z, where z is independent uniform with ?i = 1 = ?2.
or unconditional) likelihood function for this model given
Thus values of y are obtained by reflecting the current
point around At and then adding the increment. the n = 100 data values Y,n = (Yl, Y2, .., y,)' as
1,2) = 4 - u2) x (0_2)-(n-2)/2
Note that the probability of move in cases 1, 2, and 4 is
given by (9). In addition, the first two generating densities
do not make use of the known value of pt, although the
x exp - w ,0 (11)
values of the 6i are related to E. In the third generating
density we have set the value of the constant c to be smaller where wt - (Yt- I, Yt-2)',
than that which leads to true domination. For domination
is f ~( ) = ((X2) l Vl|l2exp K-22Y 21V- ( 12)
it is necessary to let all diagonal entries of D be equal to
1.9 (the largest eigenvalue of S) and to set c = D /Z
isthe density of Y2 = (Yl,Y2'
[see Dagpunar (1988, p. 159)].
Each of these four candidate-generating densities repro-
duces the shape of the bivariate normal distribution being

The American Statisticia7n, November 1995, Vol. 49, No. 4 333

This content downloaded from


103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
- .. ** ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~ ..... ... . ....

1 . . . .. . .. . . . . .. .. . . . . . . . . . . . . . . . . .

-3

-3 -2 .10 1 2 345

6 ,~~~~~~~~~~~~~~~V,

5.~~~~- - - - - - - - - - - - - - -- - - - --.

* ~ ~ ~ ~ ~ ; V fj A~~~~~~~.4~~~~..~V

3.C.~~~~~~~~e

N *~~~~~ - - - - - - - - - - - - -. . . . . . . ...

x~~~~~~~~~~~~~~~~~~~~ .

-3

-3 -2 .10 12 345

xI

Figure 3. Scatter Plots of Simulated Draws. Top panel: Generated by Choleski approach. Bottom panel: Generated by M-H with reflection
candidate-generating density.

and the third term in (11) is proportional to the density of where GI = G


the observations (Y3 .. . ., y,) given Y2. the kernel of
If the only prior information available is that the process ance matrix o
is stationary, then the posterior distribution of the param- to the follow
eters is
1. The densit
7r(0 o-2 1 y l) X l(, 502)j[o C S], with parame
2. The density
where I[r C S] is 1 if C C S and 0 otherwise.
How can this posterior density be simulated? The an-
wQ y$ " Y11
swer lies in recognizing two facts. First, the blocking
(13)
strategy is useful for this problem by taking ?b and CJ2 as
wherefnor is the normal density function.
blocks. Second, from the regression ANOVA decomposi-
tion, the exponential term of (11) is proportional to A sample of draws from w(ou2, q $ Yn) can now be ob-
tained by successively sampling ?b from ir( I Y, o2), and
given this value of q, simulating cJ2 from 1r((J2 y Yn ( i).
exp 2
The latter simulation is straightforward. For the former,

334 The Amiiericani Statistician, November 1995, Vol. 49, No. 4


This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
Table 1. Summaries of the Posterior Distribution for running the chain a large number of times. We provide
Simulated AR(2) Model
a simple, intuitive justification for the form taken by the
probability of move in the M-H algorithm by showing its
Posterior
relation to reversibility. We also discuss implementation
Param. Mean Num. SE SD Median Lower Upper Corr. issues and two applications, the M-H acceptance rejection
algorithm and the use of the algorithm in block-at-a-time
01 1.044 .002 .082 1.045 .882 1.203 .133 setting. Finally, the procedures are illustrated with two
02 -.608 .001 .082 -.610 -.763 -.445 .109
examples.
a2 1.160 .003 .170 1.143 .877 1.544 .020

[Received Ap-il 1994. Revised Jiami-cy 1995. ]

because it can be shown that IV-' 1/2 is bounded for all


REFERENCES
values of q in the stationary region, we generate candi-
Bhattacharya, R. N., and Waymire, E. C. (1900), Stochastic P7ocesses
dates from the density in curly braces of (13), following
wznith Applications, New York: John Wiley.
the idea described in Section 5. Then, the value of q is
Billingsley, P. (1986), Probability and Measure (2nd ed.), New York:
simulated as: At thejth iteration (given the current value John Wiley.

2(i)), draw a candidate 0(i+?) from a normal density with Box, G. E. P., and Jenkins, G. M. (1976), Timiie Series Analysis. For-e-
mean X and covariance o2-J)G-1; if it satisfies stationarity, castinig and Conitr-ol (rev. ed.), San Francisco: Holden-Day.
Casella, G., and George, E. (1992), "Explaining the Gibbs Sampler,"
move to this point with probability
The Amiierican Statistician, 46, 167-174.
Chib, S., and Greenberg, E. (1993), "Markov Chain Monte Carlo Simu-
min{kl /(0(J+) Io-2(,)),} lation Methods in Econometrics," Econiomiietric Theory, in press.
(1994), "Bayes Inference for Regression Models with
mm (J) 02(J))1
ARMA(p, q) Errors," Journ-lal of Econzometrics, 64, 183-206.
and otherwise set 0/j+) = /i), where (F, -) is defined in
Dagpunar, J. (1988), Principles of Randomi? Var-iate Genie-ationi, New
(12). The A-R method of Section 2 can also be applied
York: Oxford University Press.
to this problem by drawing candidates $0+') from the nor-A. E., and Smith, A. F. M. (1990), "Sampling-Based Ap-
Gelfand,
mal density in (13) until U < f(q(+1l), 2(j)). Many draws proaches to Calculating Marginal Densities," Journal of the Amle-ican
of X may be necessary, however, before one is accepted Statistical Association1, 85, 398-409.
because T(q,o 2) can become extremely small. Thus the Gelman, A. (1992), "Iterative and Non-Iterative Simulation Algo-
rithms," in Computinlg Scienzce and Statistics (Interface Proceedings),
direct A-R method, although available, is not a competi-
24, 433-438.
tive substitute for the M-H scheme described above.
Gelman, A., and Rubin, D. B. (1992), "Inference from Iterative Simula-
In the sampling process we ignore the first no = 500 tion Using Multiple Sequences" (with discussion). Statistical Scienlce,
draws and collect the next N = 5,000. These are used 7, 457-511.
to approximate the posterior distributions of X and o2. Geweke, J. (1989), "Bayesian inference in econometric models using
Monte Carlo integration," Econiomiietrica, 57, 1317-1340.
It is worth mentioning that the entire sampling process
Hastings, W. K. (1970), "Monte Carlo Sampling Methods Using Markov
took just 2 minutes on a 50 MHz PC. For comparison we
Chains and Their Applications," Bioamietrika, 57, 97-109.
obtained samples from the A-R method, which took about Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and
4 times as long as the M-H algorithm. Teller, E. (1953), "Equations of State Calculations by Fast Computing
The posterior distributions are summarized in Table 1, Machines," Journal of Chemiiical Physics, 21, 1087-1092.
Meyn, S. P., and Tweedie, R. L. (1993), Matkov Chains and Stochastic
where we report the posterior mean (the average of the
Stability, London: Springer-Verlag.
simulated values), the numerical standard error of the pos-
Muller, P. (1993), "A Generic Approach to Posterior Integration and
terior mean (computed by the batch means method), the Gibbs Sampling," manuscript.
posterior standard deviations (the standard deviation of Nummelin, E. (1984), General Iredtucible Markov Chlainis ald Non-
the simulated values), the posterior median, the lower 2.5 Negative Operators, Cambridge: Cambridge University Press.
Phillips, D. B., and Smith, A. F. M. (1994), "Bayesian Faces via Hi-
and upper 97.5 percentiles of the simulated values, and
erarchical Template Modeling," Joufrn-al of the Amlerican Statistical
the sample first-order serial correlation in the simulated
Associationl, 89, 1151-1 163.
values (which is low and not of concern). From these re- Roberts, G. O., Gelman, A., and Gilks, W. R. (1994), "Weak Conver-
sults it is clear that the M-H algorithm has quickly and gence and Optimal Scaling of Random Walk Metropolis Algorithms,"
accurately produced a posterior distribution concentrated Technical Report, University of Cambridge.
Roberts, G. O., and Tweedie, R. L. (1994), "Geometric Convergence and
on the values that generated the data.
Central Limit Theorems for Multidimensional Hastings and Metropo-
lis Algorithms," Technical Report, University of Cambridge.
8. CONCLUDING REMARKS
Rubinstein, R. Y. (1981), Simulationz and the Monzte Carblo Method, New
York: John Wiley.
Our goal in this article is to provide a tutorial expo-
Smith, A. F. M., and Roberts, G. 0. (1993), "Bayesian Computation via
sition of the Metropolis-Hastings algorithm, a versatile,
the Gibbs Sampler and Related Markov Chain Monte Carlo Methods,"
efficient, and powerful simulation technique. It borrows Journslal of the Royal Statistical Society, Ser. B, 55, 3-24.
from the well-known A-R method the idea of generating Tanner, M. A. (1993), Tools for Statistical Inference (2nd ed.), New
candidates that are either accepted or rejected, but then York: Springer-Verlag.
Tanner, M. A., and Wong, W. H. (1987), "The Calculation of Poste-
retains the current value when rejection takes place. The
rior Distributions by Data Augmentation," Journal of the Amnerican
Markov chain thus generated can be shown to have the
Statistical Associationi, 82, 528-549.
target distribution as its limiting distribution. Simulat- Tierney, L. (1994), "Markov Chains for Exploring Posterior Distribu-
ing from the target distribution is then accomplished by tions" (with discussion), Anniials of Statistics, 22, 1701-1762.

The Amiier-icani Statisticiani, November 1995, Vol. 49, No. 4 335

This content downloaded from


103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms

You might also like