Chib-UnderstandingMetropolisHastingsAlgorithm-1995
Chib-UnderstandingMetropolisHastingsAlgorithm-1995
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2684568?seq=1&cid=pdf-
reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to The American Statistician
siderable amount of attention is now being devoted to the to the choice of the candidate-generating density and guid-
Metropolis-Hastings (M-H) algorithm, which was devel- ance on implementation. Section 6 discusses how the algo-
oped by Metropolis, Rosenbluth, Rosenbluth, Teller, and rithm can be used in an acceptance-rejection scheme when
Teller (1953) and subsequently generalized by Hastings a dominating density is not available. This section also ex-
(1970). This algorithm is extremely versatile and gives plains how the algorithm can be applied when the variables
rise to the Gibbs sampler as a special case, as pointed out to be simulated are divided into blocks. The final section
by Gelman (1992). The M-H algorithm has been used contains two numerical examples, the first involving the
extensively in physics, yet despite the paper by Hastings, simulation of a bivariate normal distribution, and the sec-
it was little known to statisticians until recently. Papers ond the Bayesian analysis of an autoregressive model.
by Muller (1993) and Tierney (1994) were instrumental
in exposing the value of this algorithm and stimulating 2. ACCEPTANCE-REJECTION SAMPLING
interest among statisticians in its use.
In contrast to the MCMC methods described be-
Because of the usefulness of the M-H alogrithm, appli-
low, classical simulation techniques generate non-Markov
cations are appearing steadily in the current literature (see
(usually independent) samples, that is, the successive ob-
Muller (1993), Chib and Greenberg (1994), and Phillips
servations are statistically independent unless correlation
and Smith (1994) for recent examples). Despite its obvi-
is artificially introduced as a variance reduction device.
ous importance, however, no simple or intuitive exposi-
An important method in this class is the A-R method,
tion of the M-H algorithm, comparable to that of Casella
which can be described as follows:
and George (1992) for the Gibbs sampler, is available.
The objective is to generate samples from the abso-
This article is an attempt to fill this gap. We provide a
lutely continuous target density wF(x) = f(x)/K, where
tutorial introduction to the algorithm, deriving the algo-
x C -RI, f(x) is the unnormalized density, and K is the
rithm from first principles. The article is self-contained
(possibly unknown) normalizing constant. Let h(x) be a
density that can be simulated by some known method, and
Siddhartha Chib is at the John M. Olin School of Business, Wash-
ington University, St. Louis, MO 63130. Edward Greenberg is at suppose there is a known constant c such that f(x) < ch(x)
the Department of Economics, Washington University, St. Louis,for MOall x. Then, to obtain a random variate from 7rF(.),
63130. The authors express their thanks to the editor of the journal, the
associate editor, and four referees for many useful comments on the *pa-
(*) Generate a candidate Z from h(.) and a value it
per, and to Michael Ogilvie and Pin-Huang Chou for helpful discussions. from U1(0, 1), the uniform distribution on (0, 1).
? 1995 Amiier-ican Statistical Associationi The American Statisticiani, Novemiiber- 1995, Vol. 49, No. 4 327
This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
* If u < f (Z)/ch(Z) where p(x, x) = 0, X(dy) = 1 if x c dy and 0 otherwise, and
-return Z = y. U(x) = 1 - f,i p(x, y) dy is the probability that the chain re
* Else mains at x. From the possibility that r(x) 7 0, it should be
-goto (*). clear that the integral of p(x, y) over y is not necessarily 1.
It is easily shown that the accepted value y is a random Now, if the function p(x, y) in (2) satisfies the reversibil-
ity condition (also called "detailed balance," "microscopic
variate from wF(.). For this method to be efficient, c must
be carefully selected. Because the expected number of reversibility," and "time reversibility")
7r(x)q(x, y)ci(x, y) = 7r(y)q(y, x)a(y, x) Figure 1. Calculating Probabilities of Move With Symmetric
= wr(y)q(y,x). (6) Candidate-Generating Function (see text).
the target density 7r(x), 7r(x) = f(x)/K, where K may be ing Density ch(x).
The objective now is to find the M-H moving proba- f (x)h(y) f (y)h(x)
bility cv(x, y) such that q(y)c~(x, y) satisfies reversibility. cv(x, y) d/ d
The American Statisticiani, November- 1995, Vol. 49, No. 4 331
This content downloaded from
103.144.93.246 on Fri, 13 Dec 2024 18:59:14 UTC
All use subject to https://about.jstor.org/terms
which implies x1 and x2, and the transition kernel P2(A, Yi) generates
Y2 given x2 and Yl. Then the kernel formed by multiplying
(x, y) = min { f (y)h(x) I the conditional kernels has 7*(., ) as its invariant distri-
11 f (x)h(y) J
bution:
If there are too few transitions from x to y, just interchange
x and y in the above discussion. If P,(XI dyI I X2)P2(X2, dy2 I Y)(X1, X2) dxl dx2
We thus see that in two of the cases, those where x E C,
the probability of move to y is 1, regardless of where y lies.
= P2(X2 dY2 Yi ) [PI (XI dYl I x2)7112(x I x2)dxI]
To summarize, we have derived the following probability
of move to the candidates y that are produced from the X 72(X9) dX2
A-R step:
= JP2(x2) dY2 YD)7r *2(dyl1 X2)w2(x2)dx2
* Let Cl = {f(x) < ch(X)}; and C2 = {f(y) < ch(y)}.
* Generate it from U(0, 1) and f jP(xd dy )1t72102 IYI)7*(dy1)
-if Cl = I, then let a = 1;
-if Cl = 0 and C2 = 1, then let ae = (ch(x)/f(x));
= Iw(dy1) fP2(x2,dy2 I Y172l1 (X2 I Y1)dX2
-if Cl = 0 and C2 = 0, then let ae = min{(f(y)h(x)/
f(x)h(y), )1}. = wj (dy1)7r11(dy2 I YI)
* If it < c = 7 Y*(dyi
-return y.
where the third line follows from (8), the fourth from Bayes
* Else
theorem, the sixth from assumed invariance of P2, and the
-return x.
last from the law of total probability.
6.2 Block-at-a-Time Algorithms With this result in hand, several important special cases
of the M-H algorithm can be mentioned. The first special
Another interesting situation arises when the M-H al-
case is the so-called "Gibbs sampler." This algorithm is
gorithm is applied in turn to subblocks of the vector x,
obtained by letting the transition kernel P1 (x1, dy, 1 x2) =
rather than simultaneously to all elements of the vector.
7Fj2(dyi x2), and P2(x2,dy2 I Yi) = 7r*1(dy2 I y), that
This "block-at-a-time" or "variable-at-a-time" possibility,
is, the samples are generated directly from the "full con-
which is discussed in Hastings (1970, sec. 2.4), often sim-
ditional distributions." Note that this method requires that
plifies the search for a suitable candidate-generating den-
it be possible to generate independent samples from each
sity and gives rise to several interesting hybrid algorithms
of the full conditional densities. The calculations above
obtained by combining M-H updates.
demonstrate that this algorithm is a special case of the
The central idea behind these algorithms may be il-
M-H algorithm. Alternatively, it may be checked that the
lustrated with two blocks, x = (xI x2), where xi E Rd.
M-H acceptance probability ca(x, y) = 1 for all x, y.
Suppose that there exists a conditional transition kernel
Another special case of the M-H algorithm is the so-
PI (x, dyl 1 x2) with the property that, for a fixed value called "M-H within Gibbs" algorithm (but see our com-
of X2, 7*12(. 1 X2) iS its invariant distribution (with densityments on terminology below), in which an intractable full
712( X2)), that is,
conditional density [say 71122(Y 1 x2)] is sampled with the
general form of the M-H algorithm described in Section 4
7Fj12(dyi I x2) = PI (xI dy I x2)7m12(x1 I x2) dxl. (8)and the others are sampled directly from their full condi-
Also, suppose the existence of a conditional transition ker- tional distributions. Many other algorithms can be sim-
ilarly developed that arise from multiplying conditional
nel P2(x2, dy2 I x1) with the property that, for a given xl,
kernels.
7r*1(. I x1) is its invariant distribution, analogous to (8).
We conclude this section with a brief digression on ter-
For example, PI could be the transition kernel generated
minology. It should be clear from the discussion in this
by a Metropolis-Hastings chain applied to the block x,
with x2 fixed for all iterations. Now, somewhat surpris- subsection that the M-H algorithm can take many different
ingly, it turns out that the product of the transition kernels forms, one of which is the Gibbs sampler. Because much
of the literature has overlooked Hastings's discussion of
has 7r(x], x2) as its invariant density. The practical sig-
nificance of this principle (which we call the product of M-H algorithms that scan one block at a time, some un-
kernels principle) is enormous because it allows us to take fortunate usage ("M-H within Gibbs," for example) has
draws in succession from each of the kernels, instead of arisen that should be abandoned. In addition, it may be de-
having to run each of the kernels to convergence for every sirable to define the Gibbs sampler rather narrowly, as we
value of the conditioning variable. In addition, as sug- have done above, as the case in which all full conditional
gested above, this principle is extremely useful because it kernels are sampled by independent algorithms in a fixed
is usually far easier to find several conditional kernels that order. Although a special case of the M-H algorithm, it is
converge to their respective conditional densities than to an extremely important special case.
lation of the bivariate normal distribution No(ut, E), where generating density. More observations are taken from the
M-H algorithm to make the two plots comparable. The
,u = (1, 2)' is the mean vector and E = (-i ): 2 x 2 is the
covariance matrix given by plots of the output with the other candidate-generating
densities are similar to this and are therefore omitted. At
1 .90 the suggestion of a referee, points that repeat in the M-H
E .9 1J' chain are "jittered" to improve clarity. The figure clearly
Because of the high correlation the contours of this dis- reveals that the sampler does a striking job of visiting the
tribution are "cigar-shaped," that is, thin and positively entire support of the distribution. This is confirmed by the
inclined. Although this distribution can be simulated di- estimated tail probabilities computed from the M-H out-
rectly in the Choleski approach by letting y = t + P'ui, put for which the estimates are extremely close to the true
where t - N2(0,12) and P satisfies P'P = , this values. Details are not reported to save space.
well-known problem is useful for illustrating the M-H For the third generating density we found that reduc-
algorithm. tions in the elements of D led to an erosion in the number
From the expression for the multivariate normal den- of times the sampler visited the tails of the distribution.
sity, the probability of move (for a symmetric candidate- In addition, we found that the first-order serial correlation
generating density) is of the sampled values with the first and second candidate-
generating densities is of the order .9, and with the other
two it is .30 and .16, respectively. The high serial cor-
a(X, y)= (expF
min I [- 2(
pai)'(Y - At)]
( )]7} relation with the random walk generating densities is not
unexpected and stems from the long memory in the can-
x y Ek2. (9)
didate draws. Finally, by reflecting the candidates we see
We use the following candidate-generating densities, for that it is possible to obtain a beneficial reduction in the
which the parameters are adjusted by experimentation to serial correlation of the output with little cost.
achieve an acceptance rate of 40% to 50%:
7.2 Simulating a Bayesian Posterior
1. Random walk generating density (y = x + z), where
We now illustrate the use of the M-H algorithm to sam-
the increment random variable z is distributed as bivariate
ple an intractable distribution that arises in a stationary
uniform, that is, the ith component of z is uniform on the
second-order autoregressive [AR(2)] time series model.
interval (-6i, hi). Note that ?1 controls the spread along
Our presentation is based on Chib and Greenberg (1994),
the first coordinate axis and 62 the spread along the second.
which contains a more detailed discussion and results for
To avoid excessive moves we let 6, = .75 and 62 = 1.
the general ARMA( p, q) model.
2. Random walk generating density (y = x + z) with
For our illustration, we simulated 100 observations from
z distributed as independent normal N2(0, D), where D =
the model
diagonal(.6, .4).
3. Pseudorejection sampling generating density with Yt = 0i't-i + 02Yt-2 + Et, t = 1, 2, . . . 100, (10)
"dominating function" ch(x) = c(27)-ll D 1-/2 exp[- 1
(x - pI)'D(x - i)], where D = diagonal(2, 2) and c = .9. where 1 = 1, 02 =-.5, and Et N(0, 1). The values
The trial draws, which are passed through the A-R step, Of 0 = (01, 2) lie in the region S c R22 that satisfies the
stationarity restrictions
are thus obtained from a bivariate, independent normal
distribution. ?'1+?)2 < l; -1+ 02 < 02 2>-
4. The autoregressive generating density y = ,u - (x -
Following Box and Jenkins (1976), we express the (exact
A) + z, where z is independent uniform with ?i = 1 = ?2.
or unconditional) likelihood function for this model given
Thus values of y are obtained by reflecting the current
point around At and then adding the increment. the n = 100 data values Y,n = (Yl, Y2, .., y,)' as
1,2) = 4 - u2) x (0_2)-(n-2)/2
Note that the probability of move in cases 1, 2, and 4 is
given by (9). In addition, the first two generating densities
do not make use of the known value of pt, although the
x exp - w ,0 (11)
values of the 6i are related to E. In the third generating
density we have set the value of the constant c to be smaller where wt - (Yt- I, Yt-2)',
than that which leads to true domination. For domination
is f ~( ) = ((X2) l Vl|l2exp K-22Y 21V- ( 12)
it is necessary to let all diagonal entries of D be equal to
1.9 (the largest eigenvalue of S) and to set c = D /Z
isthe density of Y2 = (Yl,Y2'
[see Dagpunar (1988, p. 159)].
Each of these four candidate-generating densities repro-
duces the shape of the bivariate normal distribution being
1 . . . .. . .. . . . . .. .. . . . . . . . . . . . . . . . . .
-3
-3 -2 .10 1 2 345
6 ,~~~~~~~~~~~~~~~V,
5.~~~~- - - - - - - - - - - - - - -- - - - --.
* ~ ~ ~ ~ ~ ; V fj A~~~~~~~.4~~~~..~V
3.C.~~~~~~~~e
N *~~~~~ - - - - - - - - - - - - -. . . . . . . ...
x~~~~~~~~~~~~~~~~~~~~ .
-3
-3 -2 .10 12 345
xI
Figure 3. Scatter Plots of Simulated Draws. Top panel: Generated by Choleski approach. Bottom panel: Generated by M-H with reflection
candidate-generating density.
2(i)), draw a candidate 0(i+?) from a normal density with Box, G. E. P., and Jenkins, G. M. (1976), Timiie Series Analysis. For-e-
mean X and covariance o2-J)G-1; if it satisfies stationarity, castinig and Conitr-ol (rev. ed.), San Francisco: Holden-Day.
Casella, G., and George, E. (1992), "Explaining the Gibbs Sampler,"
move to this point with probability
The Amiierican Statistician, 46, 167-174.
Chib, S., and Greenberg, E. (1993), "Markov Chain Monte Carlo Simu-
min{kl /(0(J+) Io-2(,)),} lation Methods in Econometrics," Econiomiietric Theory, in press.
(1994), "Bayes Inference for Regression Models with
mm (J) 02(J))1
ARMA(p, q) Errors," Journ-lal of Econzometrics, 64, 183-206.
and otherwise set 0/j+) = /i), where (F, -) is defined in
Dagpunar, J. (1988), Principles of Randomi? Var-iate Genie-ationi, New
(12). The A-R method of Section 2 can also be applied
York: Oxford University Press.
to this problem by drawing candidates $0+') from the nor-A. E., and Smith, A. F. M. (1990), "Sampling-Based Ap-
Gelfand,
mal density in (13) until U < f(q(+1l), 2(j)). Many draws proaches to Calculating Marginal Densities," Journal of the Amle-ican
of X may be necessary, however, before one is accepted Statistical Association1, 85, 398-409.
because T(q,o 2) can become extremely small. Thus the Gelman, A. (1992), "Iterative and Non-Iterative Simulation Algo-
rithms," in Computinlg Scienzce and Statistics (Interface Proceedings),
direct A-R method, although available, is not a competi-
24, 433-438.
tive substitute for the M-H scheme described above.
Gelman, A., and Rubin, D. B. (1992), "Inference from Iterative Simula-
In the sampling process we ignore the first no = 500 tion Using Multiple Sequences" (with discussion). Statistical Scienlce,
draws and collect the next N = 5,000. These are used 7, 457-511.
to approximate the posterior distributions of X and o2. Geweke, J. (1989), "Bayesian inference in econometric models using
Monte Carlo integration," Econiomiietrica, 57, 1317-1340.
It is worth mentioning that the entire sampling process
Hastings, W. K. (1970), "Monte Carlo Sampling Methods Using Markov
took just 2 minutes on a 50 MHz PC. For comparison we
Chains and Their Applications," Bioamietrika, 57, 97-109.
obtained samples from the A-R method, which took about Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and
4 times as long as the M-H algorithm. Teller, E. (1953), "Equations of State Calculations by Fast Computing
The posterior distributions are summarized in Table 1, Machines," Journal of Chemiiical Physics, 21, 1087-1092.
Meyn, S. P., and Tweedie, R. L. (1993), Matkov Chains and Stochastic
where we report the posterior mean (the average of the
Stability, London: Springer-Verlag.
simulated values), the numerical standard error of the pos-
Muller, P. (1993), "A Generic Approach to Posterior Integration and
terior mean (computed by the batch means method), the Gibbs Sampling," manuscript.
posterior standard deviations (the standard deviation of Nummelin, E. (1984), General Iredtucible Markov Chlainis ald Non-
the simulated values), the posterior median, the lower 2.5 Negative Operators, Cambridge: Cambridge University Press.
Phillips, D. B., and Smith, A. F. M. (1994), "Bayesian Faces via Hi-
and upper 97.5 percentiles of the simulated values, and
erarchical Template Modeling," Joufrn-al of the Amlerican Statistical
the sample first-order serial correlation in the simulated
Associationl, 89, 1151-1 163.
values (which is low and not of concern). From these re- Roberts, G. O., Gelman, A., and Gilks, W. R. (1994), "Weak Conver-
sults it is clear that the M-H algorithm has quickly and gence and Optimal Scaling of Random Walk Metropolis Algorithms,"
accurately produced a posterior distribution concentrated Technical Report, University of Cambridge.
Roberts, G. O., and Tweedie, R. L. (1994), "Geometric Convergence and
on the values that generated the data.
Central Limit Theorems for Multidimensional Hastings and Metropo-
lis Algorithms," Technical Report, University of Cambridge.
8. CONCLUDING REMARKS
Rubinstein, R. Y. (1981), Simulationz and the Monzte Carblo Method, New
York: John Wiley.
Our goal in this article is to provide a tutorial expo-
Smith, A. F. M., and Roberts, G. 0. (1993), "Bayesian Computation via
sition of the Metropolis-Hastings algorithm, a versatile,
the Gibbs Sampler and Related Markov Chain Monte Carlo Methods,"
efficient, and powerful simulation technique. It borrows Journslal of the Royal Statistical Society, Ser. B, 55, 3-24.
from the well-known A-R method the idea of generating Tanner, M. A. (1993), Tools for Statistical Inference (2nd ed.), New
candidates that are either accepted or rejected, but then York: Springer-Verlag.
Tanner, M. A., and Wong, W. H. (1987), "The Calculation of Poste-
retains the current value when rejection takes place. The
rior Distributions by Data Augmentation," Journal of the Amnerican
Markov chain thus generated can be shown to have the
Statistical Associationi, 82, 528-549.
target distribution as its limiting distribution. Simulat- Tierney, L. (1994), "Markov Chains for Exploring Posterior Distribu-
ing from the target distribution is then accomplished by tions" (with discussion), Anniials of Statistics, 22, 1701-1762.