High-Dimensional Gaussian SamplingA Review and a Unifying Approach Basedon a Stochastic Proximal Point Algorithm
High-Dimensional Gaussian SamplingA Review and a Unifying Approach Basedon a Stochastic Proximal Point Algorithm
Abstract. Efficient sampling from a high-dimensional Gaussian distribution is an old but high-stakes
issue. Vanilla Cholesky samplers imply a computational cost and memory requirements
that can rapidly become prohibitive in high dimensions. To tackle these issues, multiple
methods have been proposed from different communities ranging from iterative numeri-
cal linear algebra to Markov chain Monte Carlo (MCMC) approaches. Surprisingly, no
complete review and comparison of these methods has been conducted. This paper aims
to review all these approaches by pointing out their differences, close relations, benefits,
and limitations. In addition to reviewing the state of the art, this paper proposes a unify-
ing Gaussian simulation framework by deriving a stochastic counterpart of the celebrated
proximal point algorithm in optimization. This framework offers a novel and unifying re-
visiting of most of the existing MCMC approaches while also extending them. Guidelines
to choosing the appropriate Gaussian simulation method for a given sampling problem in
high dimensions are proposed and illustrated with numerical examples.
Key words. Gaussian distribution, high-dimensional sampling, linear system, Markov chain Monte
Carlo, proximal point algorithm
DOI. 10.1137/20M1371026
Contents
1 Introduction 5
\ast Received by the editors October 2, 2020; accepted for publication (in revised form) August 9,
2021; published electronically February 3, 2022.
https://doi.org/10.1137/20M1371026
Funding: This work was partially supported by the ANR-3IA Artificial and Natural Intelligence
Toulouse Institute (ANITI) under grant agreement ANITI ANR-19-PI3A-0004, by the ANR project
``Chaire IA Sherlock,"" ANR-20-CHIA-0031-01, as well as by national support within Programme
d'investissements d'avenir, ANR-16-IDEX-0004 ULNE. The work of the third author was supported
by his 3IA Chair Sherlock funded by the ANR, under grant agreement ANR-20-CHIA-0031-01,
Centrale Lille, the I-Site Lille-Nord-Europe, and the region Hauts-de-France.
\dagger Lagrange Mathematics and Computing Research Center, Huawei, 75007 Paris, France
([email protected]).
\ddagger University of Toulouse, IRIT/INP-ENSEEIHT, 31071 Toulouse, France, and Institut Universi-
([email protected]).
3
7 Conclusion 47
Acknowledgments 51
References 51
dimensional settings this multivariate sampling task can become computationally de-
manding, which may prevent us from using statistically sound methods for real-world
applications. Therefore, even recently, a host of works have focused on the derivation
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
(1.1) Q = CC\top ,
where C \in \BbbR d\times d is a lower triangular matrix with real and positive diagonal en-
tries. The computation of the Cholesky factor C for dense matrices requires \Theta (d3 )
floating point operations (flops), that is, arithmetic operations such as additions, sub-
tractions, multiplications, or divisions [41]; see also subsection 3.1.1 for details. In
addition, the Cholesky factor, which involves at most d(d + 1)/2 nonzero entries,
must be stored. In the general case, this implies a global memory requirement of
\Theta (d2 ). In high-dimensional settings (d \gg 1), both these numerical complexity and
storage requirements rapidly become prohibitive for standard computers. Table 1 il-
lustrates this claim by listing the number of flops (using 64-bit numbers also called
double precision) and storage space required by the vanilla Cholesky sampler in high
dimensions. Note that for d \geq 105 , which is classical in image processing problems,
for instance, the memory requirement of the Cholesky sampler exceeds the memory
capacity of today's standard laptops. To mitigate these computational issues, much
work has focused on the derivation of surrogate high-dimensional Gaussian sampling
methods. Compared to Cholesky sampling, these surrogate samplers involve addi-
tional sources of approximation in finite-time sampling and as such intend to trade off
computation and storage requirements against sampling accuracy. Throughout this
review, we will say that a Gaussian sampling procedure is ``efficient"" if, in order to
produce a sample with reasonable accuracy, the number of flops and memory required
are significantly lower than that of the Cholesky sampler. For the sake of clarity, at
the end of each section presenting an existing Gaussian sampler, we will highlight its
theoretical relative efficiency with respect to (w.r.t.) vanilla Cholesky sampling with
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
adapting ideas from numerical linear algebra. In section 4, we review another class of
sampling techniques, namely, MCMC approaches, which build a Markov chain admit-
ting the Gaussian distribution of interest (or a close approximation) as a stationary
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
where \bfitmu \in \BbbR d and \Sigma \in \BbbR d\times d , respectively, stand for the mean vector and the covari-
ance matrix of the considered Gaussian distribution.
For some approaches and applications, working with the precision Q rather than
with the covariance \Sigma will be more convenient (e.g., for conditional autoregressive
models or hierarchical Bayesian models; see also section 4). In this paper, we choose
to present existing approaches by working directly with Q for the sake of simplicity.
When Q is unknown but \Sigma is available instead, simple and straightforward algebraic
manipulations can be used to implement the same \bigl( approaches
\bigr) without increasing their
computational complexity. Sampling from \scrN \bfitmu , Q - 1 raises several important issues
which are mainly related to the structure of Q. In the following paragraphs, we will
detail some special instances of (2.1) and well-known associated sampling strategies
before focusing on the general Gaussian sampling problem considered in this paper.
1 http://github.com/mvono/PyGauss.
2.2. Usual Special Instances. For completeness, this subsection recalls special
cases of Gaussian sampling tasks that will not be detailed later but are common
building blocks. Instead, we point out appropriate references for the interested reader.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
These special instances include basic univariate sampling and the scenarios where Q
is (i) a diagonal matrix, (ii) a band matrix, or (iii) a circulant Toeplitz matrix. Again,
with basic algebraic manipulations, the same samplers can be used when \Sigma has one
of these specific structures.
2.2.1. Univariate Gaussian Sampling. The most simple Gaussian sampling prob-
lem boils down to drawing univariate Gaussian random variables with mean \mu \in \BbbR
and precision q > 0. Generating the latter quickly and with high accuracy has been
the topic of much research work in the last 70 years. Such methods can be loosely
speaking divided into four groups, namely, (i) cumulative density function (c.d.f.)
inversion, (ii) transformation, (iii) rejection, and (iv) recursive methods; they are
now well documented. Interested readers are invited to refer to the comprehensive
overview in [101] for more details. For instance, Algorithm 2.1 details the well-known
Box--Muller method, which transforms a pair of independent uniform random vari-
ables into a pair of Gaussian random variables by exploiting the radial symmetry of
the two-dimensional normal distribution.
Algorithmic Efficiency. By using, for instance, Algorithm 2.1 for step 2, Algorithm 2.2
admits a computational complexity of \Theta (d) and a storage capacity of \Theta (d). In this spe-
cific scenario, these requirements are significantly less than those of vanilla Cholesky
sampling, whose complexities are recalled in Table 1.
When Q is not diagonal, we can no longer sample the d components of \bfittheta indepen-
dently. Thus, more sophisticated sampling methods must be used. For well-structured
matrices Q, we show in the following sections that it is possible to draw the random
vector of interest more efficiently than with vanilla Cholesky sampling.
●1
●●
7
12
● ● 51 2
●● ● ● 48
53 5
●●
63
11 4
● ● 56
10
● ● ● ● ● ● ● 76 58
65 14
● ●●
401
49 13 8
● ●● ● ● ● ● ● ● 52
● ● ●
70 79 69 72
● 93 112 9 379 376
57 62 404
●
● ●●● ●
80
●
378
●
94 96
●● ● ●
71 382
89 66 55 37 15
67 54
● ● ● ● ● ● ● 59 3
● ● ● ● ●● ● ● ● ●
44 398
●● ● ●
81 41
● 98 50 16 387 400
● ●
78
68
108
99
47
64
377
409
383
388 374
●
61
●● ● ●
84 97 95
● ●
77
● ● ●● 385
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
86
● ● 85 74 109 42 6 384 386
406
●
60
● ● ● ● ● ● 29 46 38
● ● ●● ● ● ● ●
402 403 389 393
83
● ● ●
119 100
● ● 87 91
75 113 111 390
380
410
●
● ● ●
110
● ● ● ● ● ● ● ● 88 73 103 106 34 43
40
396
381
●
102 391 392 408
● ● ● ● ●
157
● ● ● ● ● ● ● ●● ●
82 28
90 115 118 101 45 39 394 407
● ●
159
92 35 36 399
375
● ● ● ● ● ● ● ● ● 365
● ● ●
323 147 31 362
160 105 498
●
107
●
161 405
● ● 30 20 372 397
● ● ● ●
496 395
● ● ● 325
● ●● 158
● 153 148
116
114 26 19 364
● 104 17 336
●
152
150
● ● ● 32 371
● ●● ● ● ●
360
● ●
324 ● ● ● ● 146
156 117 18 23 501 350
band
322 154 492
● ● ● 144 27
●● ● ● ●
345
● ●
155 24 504
177 366
● ● ●● ●
490
● ● ● ● ●
149 136 142139 22 359 361 341
326 495
● ● ● ● ● ● ● ● 180 132
135
137 21 25
491 502
488 330 337
●●
487
● ● ●
171
● ● 176 174
● ● ● ●
143 503
● ● ●
163
● ● ● ●
145
●
530
● 168 181 543 497 499
363
● 129 334 367
● ●
● ● ● ● ●
164 133
473
●
173
● ● ● ●
138
●
126
●
167 528
● ● ● ● ●
141
● 121 484 466 349
354
● ●
170 477 333
●●
120 373
178
● ●●
131 471 358
122
●
476
●● ● ● ● ● ● ● ● 165
● ●
124 125 128 140 514 525 513
465
346
● ● ● ●
162
● ● 179
●
482
483 467 339
332
● ● ● ● ●
200
● ● ●●
175 517 468 472
●
357
● ● ● ● ● ● 169
● 130 296 300
527
481 479
474
351
● 195
● ● ●● 199
●
202 425
● ●
196
● 301 541
● ● 198 ● ● ● ●
197● ● 304
305 512 505
542 511
429
469 344
● ● ● 352
● ● ●
485
● ● ● ● ● 201 508
480
443 343
331
● ● ●● ● ● ● ●
207 458
● ●● ● ● 210
●● ● 522 509 347
● ●
209 204 189 306 478
●
203 205 298 307 523 414 461 348
190
●
188 507
● ● 297 302
● ● ●
206 192 532
● 208
● ● ● ● ● ●
524 422
369
● ● ● ● ●
186 338
●
215 ●● 212 211 ● ● ●218
183
● 187
191
303
277
539
529 516 486
436
449
368
● 216
● ● ● ● ● ●
182
● 293 533 459
● ● ● ●● ● ● ●
184
●
275
531 510 433
● ● ● ● ● ● ●● ●
219 289 540 453 440
282 427
●
437
● ● 213 217
● 185 194 284 271
273
280 518 536
●
526 454
● ● 214
● ● ● 544 447
● ● ● ●
278
● ● ● ●
446
● 225
● ● ● ● 221
● ● ● ●
193 295 534
506 434
● ●● ●
276 281 428
● 320 294 520 438 420
222 287 460 421
●
279
● ●● ● ● ● ● ● 314 286
● ●
424
●
220 315 285 292
451
●
283
●
223 435
● 224
● ● ● 316
● ●
● 313 288 272 439 432 456
● ● ● ● ● ●●
319 452 464 431
●● ●
● ●
243
● 311
● ● 234 264 270 448 430 412 444
413
● ● ● ●
312 450
266 267 419
● ●
274 457
● ● ●
317 318
308
●
416
463 411
● ● ● ● ●
321
● ● ● 239
226 415 418 423 426
462
● 269 442
● ● ● ● ● ● ● 237
244
254
261
417
445
●● ●
268
248
● 238
●● ●
455
246
●
236
227 265
● ● 242 255
262
● ● 231 259
● 235
● ● ● ●
240
● ● ●● ●
233 249 260 257
245 241
258 251
● 252
●
228 ● ●
247
229 ● 253
● 256
●
230
● 250
Fig. 1 From left to right: Example of an undirected graph defined on the 544 regions of Germany
where those sharing a common border are considered as neighbors, its associated precision
matrix \bfQ (bandwidth b = 522), its reordered precision matrix \bfP \bfQ \bfP \top (b = 43) where \bfP is
a permutation matrix, and a drawing of a band matrix. For the three matrices, the white
entries are equal to zero.
3 15 5
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
4
2 10
1 5
0 0
1
−1 −5 0
Fig. 2 From left to right: Example of a 3 \times 3 Laplacian filter, the associated circulant precision
matrix \bfQ = \bfDelta \top \bfDelta for convolution with periodic boundary conditions, and its counterpart
diagonal matrix \bfF \bfQ \bfF \sansH in the Fourier domain, where \bfF and its Hermitian conjugate \bfF \sansH are
unitary matrices associated with the Fourier and inverse Fourier transforms.
where \{ Qi \} i\in [M ] are M circulant matrices. Such structured matrices frequently ap-
pear in image processing problems since they translate the convolution operator cor-
responding to linear and shift-invariant filters. As an illustration, Figure 2 shows
the circulant structure of the precision matrix associated with the Gaussian distri-
2
bution with density \pi (\bfittheta ) \propto exp( - \| \Delta \bfittheta \| /2). Here, the vector \bfittheta \in \BbbR d is an image
reshaped in lexicographic order and \Delta is the Laplacian differential operator with pe-
riodic boundaries, also called the Laplacian filter. In this case the precision matrix
Q = \Delta \top \Delta is a circulant matrix [78] so that it is diagonalizable in the Fourier domain.
Therefore, sampling from \scrN (\bfitmu , Q - 1 ) can be performed in this domain as shown in
Algorithm 2.4. For Gaussian distributions with more general Toeplitz precision ma-
trices, Q can be replaced by its circulant approximation and then Algorithm 2.4 can
be used; see [92] for more details. Although not considered in this paper, other ap-
proaches to generating stationary Gaussian processes [59] have been considered, such
as the spectral [71, 97] and turning bands [65] methods.
Algorithmic Efficiency. Thanks to the use of the fast Fourier transform [26, 109],
Algorithm 2.4 admits a computational complexity of \scrO (d log(d)) flops. In addition,
note that only d-dimensional vectors have to be stored, which implies a memory
requirement of \Theta (d). Overall, these complexities are significantly smaller than those
of vanilla Cholesky sampling and as such Algorithm 2.4 can be considered to be
``efficient.""
2.2.5. Truncated and Intrinsic Gaussian Distributions. Several works have fo-
cused on sampling from various probability distributions closely related to the Gaus-
sian distribution on \BbbR d . Two cases are worth mentioning here, namely, the truncated
and so-called intrinsic Gaussian distributions. Truncated Gaussian distributions on
Algorithm 2.4 Sampler when Q is a block circulant matrix with circulant blocks
Input: M and N , the number of blocks, and the size of each block, respectively.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\scrD \subset \BbbR d admit, for any \bfittheta \in \BbbR d , p.d.f.s of the form
\biggl( \biggr)
- 1 1
(2.3) \pi \scrD (\bfittheta ) = 1\scrD (\bfittheta ) \cdot Z\scrD exp - (\bfittheta - \bfitmu )\top \Sigma - 1 (\bfittheta - \bfitmu ) ,
2
where Z\scrD is the appropriate normalizing constant and 1\scrD (\bfittheta ) = 1 if \bfittheta \in \scrD , and 0 oth-
erwise. The subset \scrD is usually defined by equalities and/or inequalities. Truncations
\prod d
on the hypercube are such that \scrD = i=1 [ai , bi ], (ai , bi ) \in \BbbR 2 , 1 \leq i \leq d, and trunca-
\sum d
tions on the simplex are such that \scrD = \{ \bfittheta \in \BbbR d | i=1 \theta i = 1\} . Rejection and Gibbs
sampling algorithms dedicated to these distributions can be found in [5, 60, 107].
Intrinsic Gaussian distributions are such that \Sigma is not of full rank, that is, Q
may have eigenvalues equal to zero. This yields an improper Gaussian distribution
often used as a prior in GMRFs to remove trend components [92]. Sampling from
the latter can be done by identifying an appropriate subspace of \BbbR d , where the target
distribution is proper, and then sampling from the proper Gaussian distribution on
this subspace [12, 81].
All the usual special sampling problems above will not be considered in what
follows since they have already been exhaustively reviewed in the literature.
Example 2.1 (Bayesian ridge regression). Let us consider a ridge regression prob-
lem from a Bayesian perspective [13]. For the sake of simplicity and without loss
of generality, assume that the observations y \in \BbbR n and the known predictor matrix
Under these assumptions, we consider the statistical model associated with observa-
tions y written as
where \bfittheta \in \BbbR d and \bfitvarepsilon \sim \scrN (0n , \sigma 2 In ). In this example, the standard deviation \sigma is
known and fixed. The conditional prior distribution for \bfittheta is chosen as Gaussian i.i.d.,
that is,
\biggl( \biggr)
1 2
(2.6) p(\bfittheta | \tau ) \propto exp - | | \bfittheta | | ,
2\tau
1
(2.7) p(\tau ) \propto 1\BbbR + \setminus \{ 0\} (\tau ) ,
\tau
where \tau > 0 is an unknown variance parameter which is given a diffuse and improper
(i.e., nonintegrable) Jeffrey's prior [53, 86]. The Bayes' rule then leads to the target
joint posterior distribution with density
1 \Bigl( 1 1 \Bigr)
(2.8) p(\bfittheta , \tau | y) \propto 1\BbbR + \setminus \{ 0\} (\tau ) exp - | | \bfittheta | | 2 - 2 | | y - X\bfittheta | | 2 .
\tau 2\tau 2\sigma
Sampling from this joint posterior distribution can be conducted using a Gibbs sam-
pler [36, 87], which sequentially samples from the conditional posterior distributions.
In particular, the conditional posterior distribution associated to \bfittheta is Gaussian with
precision matrix and mean vector
1 \top
(2.9) Q= X X + \tau - 1 Id ,
\sigma 2
1
(2.10) \bfitmu = 2 Q - 1 X\top y .
\sigma
Challenges related to handling the matrix Q already appear in this classical and
simple regression problem. Indeed, Q is possibly high-dimensional and dense, which
potentially rules out its storage; see Table 1. The inversion required to compute the
mean (2.10) may be very expensive as well. In addition, since \tau is unknown, its value
changes at each iteration of the Gibbs sampler used to sample from the joint distri-
bution with density (2.8). Hence, precomputing the matrix Q - 1 is not possible. As
an illustration on real data, Figure 3 represents three examples of precision matrices2
X\top X for the MNIST [57], leukemia [6], and CoEPrA [25] datasets. One can note that
these precision matrices are potentially both high-dimensional and dense, penalizing
their numerical inversion at each iteration of the Gibbs sampler.
Hosts of past contributions are related to high-dimensional Gaussian sampling:
it is impossible to cite them in an exhaustive manner. As far as possible, the follow-
ing review aims at gathering and citing the main contributions. We refer the reader
2 When considering the dataset itself, \bfX \top \bfX is usually interpreted as the empirical covariance of
the data \bfX . The reader should not be disturbed by the fact that, turning to the variable \bfittheta to infer,
\bfX \top \bfX will, however, play the role of a precision matrix.
15
104
40
103
10
102
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
20
5
101
0
10
0 0
10−1
−5
10−2
−20
10−3
−10
10−4
−40
−15
Fig. 3 Examples of precision matrices \bfX \top \bfX for three datasets. Left: MNIST dataset [57]. Only
the predictors associated to the digits 5 and 3 have been taken into account for the MNIST
dataset [57]. Middle: Leukemia dataset [6]. For the leukemia dataset [6], only the first 5,000
predictors (out of 12,600) have been taken into account. Right: CoEPrA dataset [25].
to references therein for more details. Next, sections 3 and 4 review the two main
families of approaches that deal with the sampling issues raised above. In section 3,
we deal with approaches derived from numerical linear algebra. On the other hand,
section 4 deals with MCMC sampling approaches. Section 5 then proposes a unifying
revisit to Gibbs samplers thanks to a stochastic counterpart of the PPA. Similarly
to subsection 2.2, computational costs, storage requirements, and accuracy of the re-
viewed Gaussian sampling approaches will be detailed for each method in a dedicated
paragraph entitled Algorithmic Efficiency. For a synthetic summary and comparison
of these metrics, we refer the interested reader to Table 7.
3. Sampling Algorithms Derived from Numerical Linear Algebra. This sec-
tion presents sampling approaches corresponding to direct adaptations of classical
techniques used in numerical linear algebra [41]. They can be divided into three main
groups: (i) factorization methods that consider appropriate decompositions of Q, (ii)
inverse square root approximation approaches where approximations of Q - 1/2 are
used to obtain samples from \scrN (\bfitmu , Q - 1 ) at a reduced cost compared to factorization
approaches, and (iii) conjugate gradient based methods.
3.1. Factorization Methods. We begin this review with the most basic but com-
putationally involved sampling techniques, namely, factorization approaches that were
introduced in section 1. These methods exploit the positive definiteness of Q to de-
compose it as a product of simpler matrices and are essentially based on the celebrated
Cholesky factorization [20]. Though helpful for problems in small or moderate dimen-
sions, these basic sampling approaches fail to address, in high-dimensional scenarios,
the computational and memory issues raised in subsection 2.3.
3.1.1. Cholesky Factorization. Since Q is symmetric and positive definite, we
noted in (1.1) that there exists a unique lower triangular matrix C \in \BbbR d\times d , called
the Cholesky factor, with positive diagonal entries such that Q = CC\top [41]. Algo-
rithm 3.1 details how such a decomposition3 can be used to obtain a sample \bfittheta from
\scrN (\bfitmu , Q - 1 ).
Algorithmic Efficiency. In the general case where Q presents no particular structure,
the computational cost is \Theta (d3 ) and the storage requirement is \Theta (d2 ); see also section 1
3 When working with the covariance matrix rather than the precision matrix, the Cholesky de-
composition \bfSigma = \bfL \bfL \top leads to the simpler step 3: \bfw = \bfL \bfz .
and Table 1. If the dimension d is large but the matrix Q has a sparse structure, the
computational and storage requirements of the Cholesky factorization can be reduced
by reordering the components of Q to design an equivalent band matrix [91]; see
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Gershgorin circle theorem [41, Theorem 7.2.1]. In the literature [24, 51, 82], the func-
tion f has been built using Chebyshev polynomials [70], which are a family (Tk )k\in \BbbN
of polynomials defined over [ - 1, 1] by
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
or by the recursion
\left\{
T0 (x) = 1,
(3.1) T1 (x) = x,
Tk+1 (x) = 2xTk (x) - Tk - 1 (x) (\forall k \geq 1) .
This family of polynomials (Tk )k\in \BbbN exhibits several interesting properties: uniform
convergence of the Chebyshev series toward an arbitrary Lipschitz-continuous function
over [ - 1, 1] and near minimax property [70], along with fast computation of the
coefficients of the series via the fast Fourier transform [84]. Algorithm 3.2 describes the
steps to generating arbitrary Gaussian vectors using this polynomial approximation.
Algorithmic Efficiency. Contrary to factorization methods detailed in subsection 3.1,
Algorithm 3.2 does not require the storage of Q since only the computation of matrix-
vector products of the form Qv with v \in \BbbR d is necessary. Assuming that these
operations can be computed efficiently in \scrO (d2 ) flops with some black-box routine,
e.g., a fast wavelet transform [64], Algorithm 3.2 admits an overall computational
cost of \scrO (Kcheby d2 ) flops and storage capacity of \Theta (d), where Kcheby is a truncation
parameter giving the order of the polynomial approximation. When Kcheby \ll d, Al-
gorithm 3.2 becomes more computationally efficient than vanilla Cholesky sampling
while admitting a reasonable memory overhead. For a sparse precision matrix Q com-
posed of nnz nonzero entries, the computational complexity can be reduced down to
\scrO (Kcheby nnz ) flops. Note that compared to factorization approaches, Algorithm 3.2
involves an additional source of approximation coming from the order of the Cheby-
shev series Kcheby . Choosing this parameter in an adequate manner involves some
fine-tuning or additional computationally intensive statistical tests [82].
3.2.2. Lanczos Approximation. Instead of approximating the inverse square
root B - 1 , some approaches directly approximate the matrix-vector product B - 1 z
by building on the Lanczos decomposition [7, 21, 51, 52, 98, 99]. The correspond-
ing simulation-based algorithm is described in Algorithm 3.3. It iteratively builds
an orthonormal basis H = \{ h1 , . . . , hKkryl \} \in \BbbR d\times Kkryl with Kkryl \leq d for the
Krylov subspace \scrK Kkryl (Q, z) \triangleq span\{ z, Qz, . . . , QKkryl - 1 z\} , and a tridiagonal ma-
trix T \approx H\top QH \in \BbbR Kkryl \times Kkryl . Using the orthogonality of H, z = He1 , where e1 is
the first canonical vector of \BbbR Kkryl , the final approximation is
j\in [d]
2: for j \in J0, Kcheby K do \triangleleft Do the change of interval.
\biggl[ \Bigl( \Bigr) (\lambda - \lambda ) \lambda + \lambda \biggr] - 1/2
2j+1 u l u l
3: Set gj = cos π 2K + .
cheby 2 2
4: end for
5: for k \in J0, Kcheby K do \triangleleft Compute coefficients of the Kcheby -truncated
Chebyshev series.
Kcheby \Biggl( \Biggr)
2 \sum 2j + 1
6: Compute ck = gj cos πk .
Kcheby j=0 2Kcheby
7: end for
8: Draw z \sim \scrN (0d , Id ).
2 \lambda u + \lambda l
9: Set \alpha = and \beta = .
\lambda u - \lambda l \lambda u - \lambda l
10: Initialize u1 = \alpha Qz - \beta z and u0 = z.
1
11: Set u = c0 u0 + c1 u1 and k = 2.
2
12: while k \leq Kcheby do \triangleleft Compute the Kcheby -truncated Chebyshev series.
13: Compute u\prime = 2(\alpha Qu1 - \beta u1 ) - u0 .
14: Set u = u + ck u\prime .
15: Set u0 = u1 and u1 = u\prime .
16: k = k + 1.
17: end while
18: Set \bfittheta = \bfitmu + u. \triangleleft Build the Gaussian vector of interest.
19: return \bfittheta .
Algorithm 3.2, one can note that Kkryl represents a trade-off between computation,
storage, and accuracy. As emphasized in [7, 98], adjusting this truncation parameter
can be achieved by using the conjugate gradient (CG) algorithm. In addition to pro-
viding an approximate sampling technique when Kkryl < d, the main and well-known
drawback of Algorithm 3.3 is that the basis H loses orthogonality in floating point
arithmetic due to round-off errors. Some possibly complicated procedures to cope
with this problem are surveyed in [100]. Finally, one major problem of the Lanczos
decomposition is the construction and storage of the basis H \in \BbbR d\times Kkryl , which be-
comes as large as Q when Kkryl tends to d. Two main approaches have been proposed
to deal with this problem, namely, a so-called 2-pass strategy and a restarting strat-
egy, both reviewed in [7, 52, 98]. In addition, preconditioning methods have been
proposed to reduce the computational burden of Lanczos samplers [21].
3.2.3. Other Square Root Approximations. At least two other methods have
been proposed to approximate the inverse square root B - 1 . Since these approaches
have been less used than others, only their main principle is given, and we refer
the interested reader to the corresponding references. The first one is the rational
approximation of B - 1 based on numerical quadrature of a contour integral [45], while
the other one is a continuous deformation method based on a system of ordinary
differential equations [4]. These two approaches are reviewed and illustrated using
numerical examples in [7].
2: Set r(0) = z, h(0) = 0d , \beta (0) = \bigm\| r(0) \bigm\| and h(1) = r(0) /\beta (0) .
3: for k \in [Kkryl ] do
4: Set w = Qh(k) - \beta (k - 1) h(k - 1) .
5: Set \alpha (k) = w\top h(k) .
6: Set w = w - \alpha (k) h(k) . \triangleleft Gram--Schmidt orthogonalization process.
7: Set \beta (k) = \| w\| .
8: Set h(k+1) = w/\beta (k) .
9: end for
10: Set T = tridiag(\bfitbeta , \bfitalpha , \bfitbeta ).
11: Set H = (h(1) , . . . , h(Kkryl ) ).
12:
13: Set \bfittheta = \bfitmu + \beta (0) HT - 1/2 e1 , where e1 = (1, 0, . . . , 0)\top \in \BbbR Kkryl .
14: return \bfittheta .
where b = Q\bfitmu is called the potential vector. If one is able to draw a Gaussian vector
z\prime \sim \scrN (0d , Q), then a sample \bfittheta from \scrN (\bfitmu , Q - 1 ) is obtained by solving the linear
system
where Q is positive definite, so that CG methods are relevant. This approach uses
the affine transformation of a Gaussian vector u = b + z\prime : if u \sim \scrN (Q\bfitmu , Q), then
Q - 1 u \sim \scrN (\bfitmu , Q - 1 ).
3.3.1. Perturbation before Optimization. A first possibility to handling the
perturbed linear problem (3.4) consists of first computing the potential vector b, then
perturbing this vector with the Gaussian vector z\prime , and finally solving the linear system
with numerical algebra techniques. This approach is detailed in Algorithm 3.4. While
the computation of b is not difficult in general, drawing z\prime might be computationally
involved. Hence, this sampling approach is of interest only if we are able to draw
the Gaussian vector z\prime efficiently (i.e., in \scrO (d2 ) flops). This is, for instance, the case
- 1
when Q = Q1 + Q2 with Qi = G\top i \Lambda i Gi (i \in [2]), provided that the symmetric and
positive-definite matrices \{ \Lambda i \} i\in [2] have simple structures; see subsection 2.2. Such
situations often arise when Bayesian hierarchical models are considered [86, Chapter
10]. For these scenarios, an efficient way to compute b + z\prime has been proposed in [79]
based on a local perturbation of the mean vectors \{ \bfitmu i \} i\in [2] . Such an approach has
been coined perturbation-optimization (PO) since it draws perturbed versions of the
mean vectors involved in the hierarchical model before using them to define the linear
system to be solved [79].
3: Solve Q\bfittheta = \bfiteta w.r.t. \bfittheta . \triangleleft with the CG solver as used, for instance, in [47].
4: return \bfittheta .
Algorithmic Efficiency. If K \in \BbbN \ast iterations of an appropriate linear solver (e.g., the
CG method) are used for step 3 in Algorithm 3.4, the global computational and stor-
age complexities of this algorithm are of order \scrO (Kd2 ) and \Theta (d). Regarding sampling
accuracy, while Algorithm 3.4 in theory is an exact approach, the K-truncation pro-
cedure implies an approximate sampling scheme [77]. A solution to correct this bias
has been proposed in [37] which builds upon a reversible-jump approach [43].
3.3.2. Optimization with Perturbation. Alternatively, (3.4) can also be seen as
a perturbed version of the linear system Q\bfittheta = b. Thus, some works have focused
on modified versions of well-known linear solvers such as the CG solver [81, 91, 95].
Actually, only one additional line of code providing a univariate Gaussian sampling
step (perturbation) is required to turn the classical CG solver into a CG sampler
[81, 95]; see step 8 in Algorithm 3.5. This perturbation step sequentially builds a
Gaussian vector with a covariance matrix that is the best k-rank approximation of
Q - 1 in the Krylov subspace \scrK k (Q, r(0) ) [81]. Then a perturbation vector y(KCG ) is
simulated before addition to \bfitmu so that finally \bfittheta = \bfitmu + y(KCG ) .
finite machine precision and the KCG -truncation procedure. In addition, the covari-
ance of the generated samples depends on the distribution of the eigenvalues of the
matrix Q. Actually, if these eigenvalues are not well spread out, Algorithm 3.5 stops
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
after KCG < d iterations, which yields an approximate sample with the best KCG -
rank approximation of Q - 1 as the actual covariance matrix. In order to correct this
approximation, reorthogonalization schemes can be employed but could become as
computationally prohibitive as Cholesky sampling when d is large [95]. These sources
of approximation are detailed in [81]. A generalization of Algorithm 3.5 has been con-
sidered in [30], where a random set of K \prime mutually conjugate directions \{ h(k) \} k\in [K \prime ]
is considered at each iteration of a Gibbs sampler.
4.1. Matrix Splitting. We begin the review of MCMC samplers by detailing so-
called matrix-splitting (MS) approaches that build on the decomposition Q = M - N
of the precision matrix. As we shall see, both exact and approximate MS samplers
have been proposed in the existing literature. These methods embed one of the
simplest MCMC methods, namely, the componentwise Gibbs sampler [36]. Similarly
to Algorithm 3.1 for samplers in section 3, it can be viewed as one of the simplest and
most straightforward approaches to sampling from a target Gaussian distribution.
1: Set t = 1.
2: while t \leq T do
3: for i \in [d] do
4: Draw z \sim \scrN (0, 1). \left( \right)
(t) [Q\bfitmu ]i z 1 \sum (t - 1)
\sum (t)
5: Set \theta i = + \surd - Qij \theta j + Qij \theta j .
Qii Qii Qii j>i j<i
6: end for
7: Set t = t + 1.
8: end while
9: return \bfittheta (T ) .
strictly lower triangular and diagonal parts of Q, respectively. Indeed, each iteration
solves the linear system
Table 2 Examples of MS schemes for \bfQ which can be used in Algorithm 4.2. The matrices \bfD and
\~
\bfL denote the diagonal and strictly lower triangular parts of \bfQ , respectively. The vector \bfz
is the one appearing in step 3 of Algorithm 4.2 and \omega is a positive scalar.
\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfM \bfN \bfz ) = \bfM \top + \bfN
cov(\~ Convergence
Richardson \bfI d /\omega \bfI d /\omega - \bfQ 2\bfI d /\omega - \bfQ 0 < \omega < 2/ \| \bfQ \|
\sum
Jacobi \bfD \bfD - \bfQ 2\bfD - \bfQ | Qii | > j\not =i | Qij | \forall i \in [d]
Gauss--Seidel \bfD + \bfL - \bfL \top \bfD always
1 - \omega 2 - \omega
SOR \bfD /\omega + \bfL \omega
\bfD - \bfL \top \omega
\bfD 0 < \omega < 2
1: Set t = 1.
2: while t \leq T do
3: Draw z \~ \sim \scrN (0d , M\top + N).
4: Solve M\bfittheta (t) = Q\bfitmu + z \~ + N\bfittheta (t - 1) w.r.t. \bfittheta (t) .
5: Set t = t + 1.
6: end while
7: return \bfittheta (T ) .
to (4.2), and the difficulty of sampling z \~ with covariance M\top + N. As pointed out
\top
in [32], the simpler M, the denser M + N and the more difficult the sampling of
\~ For instance, Jacobi and Richardson schemes yield a simple diagonal linear system
z.
requiring \scrO (d) flops, but one has to sample from a Gaussian distribution with an
arbitrary covariance matrix; see step 3 of Algorithm 4.2. Iterative samplers requiring
at least K steps such as those reviewed in section 3 can be used. This yields a com-
putational burden of \scrO (KT d2 ) flops for step 3 and as such Jacobi and Richardson
samplers admit a computational cost of \scrO (KT d2 ) and a storage requirement of \Theta (d).
On the other hand, both Gauss--Seidel and SOR schemes are associated to a simple
sampling step which can be performed in \scrO (d) flops with Algorithm 2.2, but one
has to solve a lower triangular system which can be done in \scrO (d2 ) flops via forward
substitution. In order to mitigate the trade-off between steps 3 and 4, approximate
MS approaches have been proposed recently [9, 54]; see subsection 4.1.2.
Polynomial Accelerated Gibbs Samplers. When the splitting Q = M - N is
symmetric, that is, both M and N are symmetric matrices, the rate of convergence of
Algorithm 4.2 can be improved by using polynomial preconditioners [32]. For ease of
presentation, we will first explain how such a preconditioning accelerates linear solvers
based on MS, before building upon the one-to-one equivalence between linear solvers
and Gibbs samplers to show how Algorithm 4.2 can be accelerated. Given a linear
system Q\bfittheta = v for v \in \BbbR d , linear solvers based on the MS, Q = M - N, consider
the recursion, for any t \in \BbbN and \bfittheta (0) \in \BbbR d , \bfittheta (t+1) = \bfittheta (t) + M - 1 (v - Q\bfittheta (t) ). The
error at iteration t defined by e(t+1) = \bfittheta (t+1) - Q - 1 v can be shown to be equal to
(Id - M - 1 Q)t e(0) [41]. Since this error is a tth-order polynomial of M - 1 Q, it is then
natural to wonder whether one can find another tth order polynomial \sansP t that achieves
a lower error, that is, \rho (\sansP t (M - 1 Q)) < \rho ((Id - M - 1 Q)t ). This can be accomplished
by considering the second-order iterative scheme defined, for any t \in \BbbN , by [8]
\bfittheta (t+1) = \alpha t \bfittheta (t) + (1 - \alpha t )\bfittheta (t - 1) + \beta t M - 1 (v - Q\bfittheta (t) ) ,
where (\alpha t , \beta t )t\in \BbbN is a set of acceleration parameters. This iterative method yields
an error at step t given by e(t+1) = \sansP t (M - 1 Q)e(0) , where \sansP t is a scaled Chebyshev
polynomial; see (3.1). Optimal values for (\alpha t , \beta t )t\in \BbbN are given by [8]
\Bigl( \Bigr) - 1
\alpha t = \tau 1 \beta t and \beta t = \tau 1 - \tau 22 \beta t - 1 ,
\tau 1 = [\lambda min (M - 1 Q) + \lambda max (M - 1 Q)]/2 and \tau 2 = [\lambda max (M - 1 Q) - \lambda min (M - 1 Q)]/4.
Note that these optimal choices suppose that the minimal and maximal eigenvalues
of M - 1 Q are real valued and available. The first requirement is satisfied, for instance,
if the splitting Q = M - N is symmetric, while the second one is met by using the
CG algorithm as explained in [32]. In the literature [32, 88], a classical symmetric
splitting scheme that has been considered is derived from the SOR splitting and as
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
such is called symmetric SOR (SSOR). Denote by MSOR and NSOR the matrices
involved in the SOR splitting such that Q = MSOR - NSOR ; see row 4 of Table 2.
Then, for any 0 < \omega < 2, the SSOR splitting is defined by Q = MSSOR - NSSOR with
\omega \omega
MSSOR = MSOR D - 1 M\top
SOR and NSSOR = NSOR D - 1 N\top
SOR .
2 - \omega 2 - \omega
By resorting to the one-to-one equivalence between linear solvers and Gibbs samplers,
[31, 32] showed that the acceleration above via Chebyshev polynomials can be applied
to Gibbs samplers based on a symmetric splitting. In this context, the main challenge
when dealing with accelerated Gibbs samplers compared to accelerated linear solvers
is the calibration of the noise covariance to ensure that the invariant distribution
coincides with \scrN (\bfitmu , Q - 1 ). For the sake of completeness, the pseudocode associated
to an accelerated version of Algorithm 4.2 based on the SSOR splitting is detailed
in Algorithm 4.3. Associated convergence results and numerical studies associated to
this algorithm can be found in [31, 32].
Algorithmic Efficiency. Similar to Algorithm 4.2, Algorithm 4.3 is exact in the sense
that it admits \scrN (\bfitmu , Q - 1 ) as an invariant distribution. The work [32] gave guidelines
to choosing the truncation parameter T such that the error between the distribution
of \bfittheta (T ) and \scrN (\bfitmu , Q - 1 ) is sufficiently small. Regarding computation and storage,
since triangular linear systems can be solved in \scrO (d2 ) flops by either back or forward
substitution, Algorithm 4.3 admits a computational cost of \scrO (T d2 ) and a storage
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Algorithmic Efficiency. Regarding sampling accuracy, the Hogwild sampler and clone
MCMC define a Markov chain whose invariant distribution is Gaussian with the cor-
rect mean \bfitmu but with precision matrix Q \widetilde MS , where
\Biggl\{ \bigl( \bigr)
- 1 \top
\widetilde MS = Q Id - D (L + L ) for the Hogwild sampler,
Q \bigl( 1 - 1 - 1
\bigr)
Q Id - 2 (D + 2\omega Id ) Q for clone MCMC.
Contrary to the Hogwild sampler, clone MCMC is able to sample exactly from
\widetilde MS \rightarrow Q. While
\scrN (\bfitmu , Q - 1 ) in the asymptotic scenario \omega \rightarrow 0, since in this case Q
retaining a memory requirement of \Theta (d), the induced approximation yields a highly
parallelizable sampler. Indeed, compared to Algorithm 4.2, the computational com-
plexities associated to step 3 and the solving of the triangular system in step 4 are
decreased by an order of magnitude to \scrO (d). Note that the overall computational
complexity of step 4 is still \scrO (d2 ) because of the matrix-vector product N\bfittheta (t - 1) .
4.2. Data Augmentation. Since the precision matrix Q has been assumed to be
arbitrary, the MS schemes Q = M - N in Table 2 were not motivated by its structure
but rather by the computational efficiency of the associated samplers. Hence, inspired
by efficient linear solvers, relevant choices for M and N given in Tables 2 and 3 have
Table 3 MS schemes for \bfQ that can be used in Algorithm 4.4. The matrices \bfD and \bfL denote the
diagonal and strictly lower triangular parts of \bfQ , respectively. The vector \bfz \~\prime is the one
appearing in step 3 of Algorithm 4.4 and \omega > 0 is a tuning parameter controlling the bias
of those methods. Sufficient conditions to guarantee \rho (\bfM - 1 \bfN ) < 1 are given in [9, 54].
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfM \bfN cov(\~ \~
\bfz \prime ) = \bfM
Hogwild with blocks of size 1 [54] \bfD - \bfL - \bfL \top \bfD
Clone MCMC [9] \bfD + 2\omega \bfI d 2\omega \bfI d - \bfL - \bfL \top 2 (\bfD + 2\omega \bfI d )
and yields proper marginal distributions. Figure 4 describes the directed acyclic
graphs (DAGs) associated with two hierarchical models proposed in [66, 67] to de-
couple Q1 from Q2 by involving auxiliary variables. In what follows, we detail the
motivations behind these two DA schemes. Among the two matrices Q1 and Q2
involved in the composite precision matrix Q, without loss of generality, we assume
that Q2 presents a particular and simpler structure (e.g., diagonal or circulant) than
Q1 . We want now to benefit from this structure by leveraging the efficient sampling
schemes previously discussed in subsection 2.2 and well suited to handling a Gaussian
distribution with a precision matrix only involving Q2 . This is the aim of the first
DA model called EDA, which introduces the joint distribution with p.d.f.
\biggl( \Bigr] \biggr)
1 \Bigl[ \top \top
(4.5) \pi (\bfittheta , u1 ) \propto exp - (\bfittheta - \bfitmu ) Q(\bfittheta - \bfitmu ) + (u1 - \bfittheta ) R(u1 - \bfittheta ) ,
2
- 1
with R = \omega - 1 Id - Q1 and 0 < \omega < \| Q1 \| , where \| \cdot \| is the spectral norm. The
resulting Gibbs sampler (see Algorithm 4.5) relies on two conditional Gaussian sam-
pling steps whose associated conditional distributions are detailed in Table 4. This
u1 u2 u1 \bfittheta
\bfittheta
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\bigl( \bigr)
\bfittheta \sim \scrN \bfitmu , Q - 1
Fig. 4 Hierarchical models proposed in [66, 67], where \omega is such that 0 < \omega < \| \bfQ 1 \| - 1 .
Table 4 Conditional probability distributions of \bfittheta | \bfu 1 , \bfu 1 | \bfittheta , \bfu 2 , and \bfu 2 | \bfu 1 for the exact DA
schemes detailed in subsection 4.2.1. The parameter \omega is such that 0 < \omega < \| \bfQ 1 \| - 1 . For
simplicity, the conditioning is notationally omitted.
\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfittheta \sim \scrN (\bfitmu \bfittheta , \bfQ - 1
\bfittheta ) \bfu 1 \sim \scrN (\bfitmu \bfu 1 , \bfQ - 1
\bfu 1 ) \bfu 2 \sim \scrN (\bfitmu \bfu 2 , \bfQ - 1
\bfu 2 )
\bfQ \bfittheta = \omega - 1 \bfI d + \bfQ 2 \bfQ \bfu 1 = \omega - 1 \bfI d \bfQ \bfu 2 = \bfLambda 1
GEDA \bfitmu \bfittheta = \bfQ - 1
\bfittheta (\bfR \bfu 1 + \bfQ \bfitmu ) \bfitmu \bfu 1 = \bfittheta - \omega (\bfQ 1 \bfittheta - \bfG \top - 1
1 \bfLambda 1 \bfu 2 ) \bfitmu \bfu 2 = \bfG 1 \bfu 1
scheme has the great advantage of decoupling the two precision matrices Q1 and Q2
since they are not simultaneously involved in either of the two steps. In particular,
introducing the auxiliary variable u1 permits us to remove the dependence in Q1
when defining the precision matrix of the conditional distribution of \bfittheta . While effi-
cient sampling from this conditional is possible, we have to ensure that sampling the
auxiliary variable u1 can be achieved with a reasonable computational cost. Again, if
Q1 presents a nice structure, the specific approaches reviewed in subsection 2.2 can
be employed. If this is not the case, the authors of [66, 67] proposed a generalization
of EDA, called GEDA, to simplify the whole Gibbs sampling procedure when Q arises
GEDA introduces an additional auxiliary variable u2 such that the augmented p.d.f. is
\biggl( \Bigr] \biggr)
1 \Bigl[ \top \top
\pi (\bfittheta , u1 , u2 ) \propto exp - (\bfittheta - \bfitmu ) Q(\bfittheta - \bfitmu ) + (u1 - \bfittheta ) R(u1 - \bfittheta )
2
\biggl( \biggr)
1 \top
(4.6) \times exp - (u2 - G1 u1 ) \Lambda 1 (u2 - G1 u1 ) .
2
The associated joint distribution yields conditional Gaussian distributions with di-
agonal covariance matrices for both u1 and u2 that can be sampled efficiently with
Algorithm 2.2; see Table 4.
Algorithmic Efficiency. First, both EDA and GEDA admit \scrN (\bfitmu , Q - 1 ) as invariant
distribution and hence are exact. Regarding EDA, since the conditional distribution
of u1 | \bfittheta might admit an arbitrary precision matrix in the worst-case scenario, its com-
putational and storage complexities are \scrO (KT d2 ) and \Theta (d), where K is a truncation
parameter associated to one of the algorithms reviewed in section 3. On the other
hand, GEDA benefits from an additional DA which yields the reduced computational
and storage requirements of \scrO (T d2 ) and \Theta (d).
4.2.2. Approximate Data Augmentation. An approximate DA scheme inspired
by variable-splitting approaches in optimization [2, 3, 15] was proposed in [104]. This
framework, also called asymptotically exact DA (AXDA) [105], was initially intro-
duced to deal with any target distribution, not limited to Gaussian distributions; it
therefore a fortiori applies to them as well. An auxiliary variable u \in \BbbR d is introduced
such that the joint p.d.f. of (\bfittheta , u) is
\left( \Biggl[ \Biggr] \right)
1 \top \top \| \bfu - \bfittheta \| 2
(4.7) \pi (\bfittheta , \bfu ) \propto \mathrm{e}\mathrm{x}\mathrm{p} - (\bfittheta - \bfitmu ) \bfQ 2 (\bfittheta - \bfitmu ) + (\bfu - \bfitmu ) \bfQ 1 (\bfu - \bfitmu ) + ,
2 \omega
where \omega > 0. The main idea behind (4.7) is to replicate the variable of interest \bfittheta in
order to sample via a Gibbs sampling strategy two different random variables u and
\bfittheta with covariance matrices involving, separately, Q1 and Q2 . This algorithm, coined
the split Gibbs sampler (SGS), is detailed in Algorithm 4.6 and sequentially draws
from the conditional distributions
\Bigl( \Bigr)
(4.8) u | \bfittheta \sim \scrN (\omega - 1 Id + Q1 ) - 1 (\omega - 1 \bfittheta + Q1 \bfitmu ), (\omega - 1 Id + Q1 ) - 1 ,
\Bigl( \Bigr)
(4.9) \bfittheta | u \sim \scrN (\omega - 1 Id + Q2 ) - 1 (\omega - 1 u + Q2 \bfitmu ), (\omega - 1 Id + Q2 ) - 1 .
Again, this approach has the great advantage of decoupling the two precision matrices
Q1 and Q2 defining Q since they are not simultaneously involved in either of the two
steps of the Gibbs sampler. In [66], the authors showed that exact DA schemes
(i.e., EDA and GEDA) generally outperform AXDA as far as Gaussian sampling is
concerned. This was expected since the AXDA framework proposed is not specifically
designed for Gaussian targets but for a wide family of distributions.
Algorithmic Efficiency. The sampling efficiency of Algorithm 4.6 depends upon the
parameter \omega which controls the strength of the coupling between u and \bfittheta as well as the
bias-variance trade-off of this method; it yields exact sampling when \omega \rightarrow 0. Indeed,
the marginal distribution of \bfittheta under the joint distribution with density defined in (4.7)
is a Gaussian with the correct mean \bfitmu but with an approximate precision matrix Q \widetilde DA
that admits the closed-form expression
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\Bigl( \Bigr) - 1
(4.10) \widetilde DA = Q2 + Q - 1 + \omega Id
Q .
1
In the worst-case scenario where Q1 is arbitrary, sampling from the conditional distri-
bution (4.8) can be performed with an iterative algorithm running K iterations such
as those reviewed in section 3. Hence Algorithm 4.6 admits the same computational
and storage complexities as EDA (see Algorithm 4.5), that is, \scrO (KT d2 ) and \Theta (d).
where u \in \BbbR d is an additional (auxiliary) variable and R \in \BbbR d\times d is a symmetric
matrix acting as a preconditioner such that \kappa defines a proper density on an appro-
priate state space. More precisely, in what follows, depending on the definition of the
variable u, the probability density \kappa in (5.1) will refer to either a joint p.d.f. \pi (\bfittheta , u) or
a conditional probability density \pi (\bfittheta | u). Contrary to the MCMC samplers detailed
in section 4, the methods described in section 3 do not use explicit surrogate distribu-
tions to simplify the sampling procedure. Instead, they directly perturb deterministic
approaches from numerical linear algebra without explicitly defining a simpler sur-
rogate distribution at each iteration. This feature can be encoded with the choice
R \rightarrow 0d\times d so that these methods can be described by this unifying model as well.
Then the main motivation for using the surrogate density \kappa is to precondition the
initial p.d.f. \pi to end up with simpler sampling steps as in section 4.
5.2. Revisiting MCMC Sampling Approaches. This section builds on the prob-
ability kernel density (5.1) to revisit, unify, and extend the exact and approximate
approaches reviewed in section 4. We emphasize that exact approaches indeed target
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
the distribution of interest \scrN (\bfitmu , Q - 1 ), while approximate approaches only target an
approximation of \scrN (\bfitmu , Q - 1 ).
5.2.1. From Exact Data Augmentation to Exact Matrix Splitting. We assume
here that the variable u refers to an auxiliary variable such that the joint distribution
of the couple (\bfittheta , u) has a density given by \pi (\bfittheta , u) \triangleq \kappa (\bfittheta , u). In addition, here we
restrict R to be positive definite. It follows that
\int \int \biggl( \biggr)
1
(5.2) \pi (\bfittheta , u)du = Z - 1 \pi (\bfittheta ) exp - (\bfittheta - u)\top R(\bfittheta - u) du = \pi (\bfittheta )
\BbbR d \BbbR d 2
holds almost surely with Z = det(R) - 1/2 (2π)d/2 . Hence, the joint density (5.1)
yields an exact DA scheme whatever the choice of the positive-definite matrix R.
We will show that the exact DA approaches described in Algorithm 4.5 precisely fit
the proposed generic framework with a specific choice for the preconditioning matrix
R. We will then extend this class of exact DA approaches and show a one-to-one
equivalence between Gibbs samplers based on exact MS (see subsection 4.1.1) and
those based on exact DA (see subsection 4.2.1).
To this end, we start by making the change of variable v = Ru. Combined with
the joint probability density (5.1), this yields the following two conditional probability
distributions:
which boils down to the Gibbs sampler based on exact MS discussed in subsection 4.1.1
(see Algorithm 4.2).
Table 5 Equivalence relations between exact DA and exact MS approaches. The matrices \bfQ 1 and
\bfQ 2 are such that \bfQ = \bfQ 1 + \bfQ 2 . The matrix \bfD 1 denotes the diagonal part of \bfQ 1 , and
\omega > 0 is a positive scalar ensuring the positive definiteness of \bfR . Bold acronyms refer to
novel samplers derived from the proposed unifying framework.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\bfR = cov(\bfv | \bfittheta ) (\bfQ + \bfR ) - 1 = cov(\bfittheta | \bfv ) \bfM \top + \bfN = cov(\~
\bfz ) DA sampler MS sampler
\biggl( \biggr) - 1
\bfI d \bfI d 2\bfI d
- \bfQ 1 + \bfQ 2 + \bfQ 2 - \bfQ 1 EDA [66] Richardson [32]
\omega \biggl( \omega \biggr) - 1 \omega
\bfD 1 \bfD 1 2\bfD 1
- \bfQ 1 + \bfQ 2 + \bfQ 2 - \bfQ 1 \bfE \bfD \bfA \bfJ Jacobi [32]
\omega \omega \omega
To illustrate the relevance of this rewriting when considering the case of two ma-
trices Q1 and Q2 that cannot be efficiently handled in the same basis, Table 5 presents
two possible choices of R which relate two MS strategies to their DA counterparts.
First, one particular choice of R (row 1 of Table 5) shows directly that the Richardson
MS sampler can be rewritten as the EDA sampler. More precisely, the autoregressive
process of order 1 w.r.t. \bfittheta defined by EDA yields a variant of the Richardson sampler.
This finding relates two different approaches proposed by authors from distinct com-
munities (numerical algebra and signal processing). Second, the proposed unifying
framework also permits us to go beyond existing approaches by proposing a novel ex-
act DA approach via a specific choice for the precision matrix R driven by an existing
MS method. Indeed, following the same rewriting trick with another particular choice
of R (row 2 of Table 5), an exact DA scheme can be easily derived from the Jacobi
MS approach. To our knowledge, this novel DA method, referred to as EDAJ in the
table, has not been documented in the existing literature.
Finally, the table reports two particular choices of R which lead to revisiting
existing MS and/or DA methods. It is worth noting that other relevant choices might
be possible, which would allow one to derive new exact DA and MS methods or to
draw further analogies between existing approaches. Note also that Table 5 shows the
main benefit of an exact DA scheme over its MS counterpart thanks to the decoupling
between Q1 and Q2 into two separate simulation steps. This feature can be directly
observed by comparing the two first columns of Table 5 with the third column.
5.2.2. From Approximate Matrix Splitting to Approximate Data Augmen-
tation. We now build on the proposed unifying proposal (5.1) to extend the class
of samplers based on approximate MS and reviewed in subsection 4.1.2. With some
abuse of notation, the variable u in (5.1) now refers to an iterate associated to \bfittheta .
More precisely, let us define u = \bfittheta (t - 1) to be the current iterate within an MCMC
algorithm and \kappa to be
\Bigl( \Bigr) \biggl( \Bigr) \biggr)
1 \Bigl( \Bigr) \top \Bigl(
(5.9) \kappa (\bfittheta , u) \triangleq p \bfittheta | u = \bfittheta (t - 1) \propto \pi (\bfittheta ) exp - \bfittheta - \bfittheta (t - 1) R \bfittheta - \bfittheta (t - 1) .
2
Readers familiar with MCMC algorithms will recognize in (5.9) a proposal density
that can be used within Metropolis--Hastings schemes [87]. However, unlike the
usual random-walk algorithm which considers the Gaussian proposal distribution
\scrN (\bfittheta (t - 1) , \lambda Id ) with \lambda > 0, the originality of (5.9) is to define the proposal by com-
bining the Gaussian target density \pi with a term that is equal to a Gaussian kernel
density when R is positive definite. If we always accept the proposed sample obtained
from (5.9) without any correction, that is, \bfittheta (t) = \bfittheta \widetilde \sim P (\cdot | u = \bfittheta (t - 1) ) with density
(5.9), this directly implies that the associated Markov chain converges in distribution
toward a Gaussian random variable with distribution \scrN (\bfitmu , Q \widetilde - 1 ) with the correct
mean \bfitmu but with precision matrix
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\Bigl( \Bigr)
(5.10) \widetilde = Q Id + (R + Q) - 1 R .
Q
This algorithm is detailed in Algorithm 5.1. Note again that one can obtain samples
from the initial target distribution \scrN (\bfitmu , Q - 1 ) by replacing step 4 with an accep-
tance/rejection step; see [87] for details.
Moreover, the instance (5.9) of (5.1) paves the way to an extended class of sam-
\widetilde
plers based on approximate MS. More precisely, the draw of a proposed sample \bfittheta
from (5.9) can be replaced by the following two-step sampling procedure:
This recursion defines an extended class of approximate MS-based samplers and en-
compasses the Hogwild sampler reviewed in subsection 4.1.2 by taking R = - L - L\top .
In addition to the existing Hogwild approach, Table 6 lists two new approximate MS
approaches resulting from specific choices of the preconditioning matrix R. They are
coined approximate Richardson and Jacobi samplers since the expressions for M and
N are very similar to those associated to their exact counterparts; see Table 2. For
those two samplers, note that the approximate precision matrix Q \widetilde tends toward 2Q
in the asymptotic regime \omega \rightarrow 0. Indeed, for the approximate Jacobi sampler, we
have
\Biggl( \biggl( \biggr) \Biggr)
\widetilde = Q Id + \omega I d
Q - Q
\omega
= Q (2Id - \omega Q)
\rightarrow 2Q .
\omega \rightarrow 0
Table 6 Extended class of Gibbs samplers based on approximate MS with \bfQ = \bfM - \bfN with \bfN =
\bfR and approximate DA. The matrices \bfD and \bfL denote the diagonal and strictly lower
triangular parts of \bfQ , respectively. \omega is a positive scalar. Bold names and acronyms refer
to novel samplers derived from the proposed unifying framework.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
1 1
2
\bfM = cov(\bfv \prime | \bfittheta ) 2
\bfM - 1 = cov(\bfittheta | \bfv \prime ) \bfz \prime )
\bfM = cov(\~ MS sampler DA sampler
1 1 - 1
2
\bfD 2
\bfD \bfD Hogwild [54] \bfA \bfD \bfA \bfH
In order to retrieve the original precision matrix Q when \omega \rightarrow 0, [9] proposed an
approximate DA strategy that can be related to the work of [105].
In subsection 5.2.1, we showed that exact DA approaches can be rewritten to
recover exact MS approaches. In what follows, we will take the opposite path to
show that approximate MS approaches admit approximate DA counterparts that are
highly amenable to distributed and parallel computations. Using the fact that z\prime =
Q\bfitmu + z1 + (Q + R)z2 , where z1 \sim \scrN (0d , 21 (R + Q)) and z2 \sim \scrN (0d , 12 (R + Q) - 1 ),
the recursion (5.14) can be written equivalently as
\Bigl( \Bigr)
\widetilde = (Q + R) - 1 Q\bfitmu + R\bfittheta (t - 1) + z1 + z2 .
\bfittheta
and targets the joint distribution with density \pi (\bfittheta , v\prime ). Compared to the exact DA
approaches reviewed in subsection 4.2.1, the sampling difficulty associated to each
conditional sampling step is the same and only driven by the structure of the matrix
M = R + Q. In particular, this matrix becomes diagonal for the three specific
choices listed in Table 6. These choices lead to three new sampling schemes that we
name ADAH, ADAR, and ADAJ since they represent the DA counterparts of the
approximate MS samplers discussed above. Interestingly, these DA schemes naturally
emerge here without assuming any explicit decomposition Q = Q1 + Q2 or including
an additional auxiliary variable (as in GEDA). Finally, as previously highlighted, when
compared to their exact counterparts, these DA schemes have the great advantage of
leading to Gibbs samplers suited for parallel computations, hence simplifying the
sampling procedure.
5.3. Gibbs Samplers as Stochastic Versions of the PPA. This section aims to
draw new connections between optimization and the sampling approaches discussed
in this paper. In particular, we will focus on the proximal point algorithm (PPA)
[90]. After briefly presenting this optimization method, we will show that the Gibbs
2
(5.15) \| \bfittheta \| \bfR \triangleq \bfittheta \top R\bfittheta .
The Proximal Point Algorithm. The PPA [90] is an important and widely used
method to find zeros of a maximal monotone operator \sansK , that is, to solve problems
of the following form:
(5.16) Find \bfittheta \star \in \scrH such that 0d \in \sansK (\bfittheta \star ) ,
where \scrH is a real Hilbert space. For simplicity, we will take here \scrH = \BbbR d equipped
with the usual Euclidean norm and focus on the particular case \sansK = \partial f , where f is
a lower semicontinuous (l.s.c.), proper, coercive, and convex function and \partial denotes
the subdifferential operator; see Appendix B. In this case, the PPA is equivalent
to the proximal minimization algorithm [68, 69] which aims at solving the following
minimization problem:
(5.17) Find \bfittheta \star \in \BbbR d such that \bfittheta \star = arg min f (\bfittheta ) .
\bfittheta \in \BbbR d
This algorithm is called the proximal point algorithm in reference to the work by
Moreau [76]. For readability reasons, we refer the reader to Appendix B for details
about this algorithm for a general operator \sansK and to the comprehensive overview
in [90] for more information.
The PPA is detailed in Algorithm 5.2. Note that instead of directly minimizing
the objective function f , the PPA adds a quadratic penalty term depending on the
previous iterate \bfittheta (t - 1) and then solves an approximation of the initial optimization
problem at each iteration. This idea of successive approximations is exactly the
deterministic counterpart of (5.9), which proposes a new sample based on successive
approximations of the target density \pi via a Gaussian kernel with precision matrix
R. In fact, searching for the maximum a posteriori estimator under the proposal
distribution P (\cdot | \bfittheta (t - 1) ) with density p(\cdot | \bfittheta (t - 1) ) in (5.9) boils down to solving
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.18) arg min - log \pi (\bfittheta ) + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| ,
\bfittheta \in \BbbR d \underbrace{} \underbrace{} 2 \bfR
f (\bfittheta )
which coincides with step 4 in Algorithm 5.2 by taking f = - log \pi . This emphasizes
the tight connection between optimization and simulation that we have highlighted
in previous sections.
The PPA, the ADMM, and the Approximate Richardson Gibbs Sampler. An
important motivation of the PPA is related to the preconditioning idea used in the uni-
fying model proposed in (5.1). Indeed, the PPA has been extensively used within the
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.19) z(t) = arg min g1 (z) + \bigm\| z - A\bfittheta (t - 1) - u(t - 1) \bigm\| ,
\bfz \in \BbbR k 2\rho
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.20) \bfittheta (t) = arg min g2 (\bfittheta ) + \bigm\| A\bfittheta - z(t) + u(t - 1) \bigm\| ,
\bfittheta \in \BbbR d 2\rho
(5.21) u(t) = u(t - 1) + A\bfittheta (t) - z(t) ,
where z \in \BbbR k is a splitting variable, u \in \BbbR k is a scaled dual variable, and \rho is a positive
penalty parameter. Without loss of generality,4 let us assume that g2 is a quadratic
function, that is, for any \bfittheta \in \BbbR d , g2 (\bfittheta ) = (\bfittheta - \bfittheta \= )\top (\bfittheta - \bfittheta
\= )/2. Even in this simple
case, one can notice that the \bfittheta -update (5.20) involves a matrix A operating directly
on \bfittheta , preluding an expensive inversion of a high-dimensional matrix associated to A.
To deal with such an issue, Algorithm 5.2 is considered to solve approximately the
minimization problem in (5.20). The PPA applied to the minimization problem (5.20)
reads
1 \bigm\| \bigm\| 2 1 \bigm\| \bigm\| 2
\= )+ 1 \bigm\|
\= )\top (\bfittheta - \bfittheta
(5.22) \bfittheta (t) = arg min (\bfittheta - \bfittheta
\bigm\| \bigm\| \bigm\|
\bigm\| A\bfittheta - z(t) + u(t - 1) \bigm\| + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| .
\bfittheta \in \BbbR d 2 2\rho 2 \bfR
In order to draw some connections with (5.9), we set Q = \rho - 1 A\top A + Id and \bfitmu =
\= ] and rewrite (5.22) as
Q - 1 [A\top (u(t - 1) - z(t) )/\rho + \bfittheta
1 1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.23) \bfittheta (t) = arg min (\bfittheta - \bfitmu )\top Q(\bfittheta - \bfitmu ) + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| .
\bfittheta \in \BbbR d 2 2 \bfR
Note that (1/2)(\bfittheta - \bfitmu )\top Q(\bfittheta - \bfitmu ) is the potential function associated to \pi in (2.1)
and, as such, (5.23) can be seen as the deterministic counterpart of (5.9). By defining
- 2
R = \omega - 1 Id - Q, where 0 < \omega \leq \rho \| A\| ensures that R is positive semidefinite, the
\bfittheta -update in (5.22) becomes (see Appendix B)
\bigm\| \Bigl( \Bigr) \bigm\|
1 \bigm\| \bigm\| 2
(5.24) (t)
= arg min \bigm\| \bfittheta - \omega R\bfittheta (t - 1)
+ Q\bfitmu \bigm\|
\bfittheta \bigm\| \bigm\| .
\bfittheta \in \BbbR d 2\omega
Note that (5.24) boils down to solving \omega - 1 \bfittheta = R\bfittheta (t - 1) + Q\bfitmu , which is exactly the de-
terministic counterpart of the approximate Richardson Gibbs sampler in Table 6. This
highlights further the tight links between the proposed unifying framework and the
use of the PPA in the optimization literature. It also paves the way to novel sampling
methods inspired by optimization approaches which are not necessarily dedicated to
Gaussian sampling; this goes beyond the scope of the present paper.
4 If g admits a nonquadratic form, an additional splitting variable can be introduced and the
2
following comments still hold.
the benefits and bottlenecks associated with each method, illustrate some of them
on numerical applications, and propose an up-to-date selection of the most efficient
algorithms for archetypal Gaussian sampling tasks. General guidelines resulting from
this review and numerical experiments are gathered in Figure 12.
6.1. Summary, Comparison, and Discussion of Existing Approaches. Table 7
summarizes the main features of the sampling techniques reviewed above. In particu-
lar, for each approach, this table recalls its exactness (or not), its computational and
storage costs, the most expensive sampling step to compute, the possible linear sys-
tem to solve, and the presence of tuning parameters. It aims to synthesize the main
pros and cons of each class of samplers. Regarding MCMC approaches, the computa-
tional cost associated to the required sampling step is not taken into account in the
column ``Comp. cost"" since it depends upon the structure of Q. Instead, the column
``Sampling"" indicates the type of sampling step required by the sampling approach.
Rather than conducting a one-to-one comparison between samplers, we choose to
focus on selected important questions raised by the taxonomy reported in this table.
Concerning more technical or in-depth comparisons between specific approaches, we
refer the interested reader to the appropriate references; see, for instance, those high-
lighted in sections 3 and 4. These questions of interest lead to scenarios that motivate
the dedicated numerical experiments conducted in subsection 6.2, then subsection 6.3
gathers guidelines to choosing an appropriate sampler for a given sampling task. The
questions we focus on are as follows.
Question 1. In which scenarios do iterative approaches become interesting com-
pared to factorization approaches?
In section 3, one sees that square root, CG, and PO approaches bypass the compu-
tational and storage costs of factorization thanks to an iterative process of K cheaper
steps, with K \in \BbbN \ast . A natural question is, in which scenarios does the total cost
of K iterations remain efficient when compared to factorization methods? Table 7
tells us that high-dimensional scenarios (d \gg 1) favor iterative methods as soon as
memory requirements of order \Theta (d2 ) become prohibitive. If this storage is not an
issue, iterative samplers become interesting only if the number of iterations K is such
that K \ll (d + T - 1)/T . This inequality is verified only in cases where a small
number of samples T is required (T \ll d), which in turn imposes K \ll d. Note that
this condition remains similar when a Gaussian sampling step is embedded within a
Gibbs sampler with a varying covariance or precision matrix (see Example 2.1): the
condition on K is K \ll d, whatever the number T of samples, since it is no longer
possible to precompute the factorization of Q.
Question 2. In which scenarios should we prefer an iterative sampler from section 3
or an MCMC sampler from section 4?
Table 7 shows that the iterative samplers reviewed in section 3 have to perform
K iterations to generate one sample. In contrast, most MCMC samplers generate
one sample per iteration. However, these samples are distributed according to the
target distribution (or an approximation of it) only in an asymptotic regime, i.e.,
when T \rightarrow \infty (in practice, after a burn-in period). If one considers a burn-in period
of length Tbi whose samples are discarded, MCMC samplers are interesting w.r.t.
Table 7 Taxonomy of existing methods to sample from an arbitrary d-dimensional Gaussian distribution \Pi \triangleq \scrN (\bfitmu , \bfQ - 1 ); see (2.1). K \in \BbbN \ast and Tbi \in \BbbN \ast are the
number of iterations of a given iterative sampler (e.g., the CG one) and the number of burn-in iterations for MCMC algorithms, respectively. ``Target 36
\Pi "" refers to approaches that target the right distribution \Pi under infinite precision arithmetic. Note that some i.i.d. samplers using a truncation
parameter K might target \Pi for specific choices of K (e.g., the classical CG sampler is exact for K = d). ``Finite time sampling"" refers here to
approaches which need a truncation procedure to deliver a sample in finite time. The matrix \bfA is a symmetric and positive-definite matrix associated
to \bfQ ; see section 4. The sampling methods highlighted in bold stand for novel approaches derived from the proposed unifying framework. AMS stands
for ``approximate matrix splitting"" (see Table 6), EDA for ``exact data augmentation"" (see Table 4), and ADA for ``approximate data augmentation""
(see Table 6).
\bfT \bfa \bfr \bfg \bfe \bft \bfF \bfi \bfn \bfi \bft \bfe \bft \bfi \bfm \bfe
\bfM \bfe \bft \bfh \bfo \bfd \bfI \bfn \bfs \bft \bfa \bfn \bfc \bfe \bfC \bfo \bfm \bfp . \bfc \bfo \bfs \bft \bfS \bft \bfo \bfr \bfa \bfg \bfe \bfc \bfo \bfs \bft \bfS \bfa \bfm \bfp \bfl \bfi \bfn \bfg \bfL \bfi \bfn \bfe \bfa \bfr \bfs \bfy \bfs \bft \bfe \bfm \bfN \bfo \bft \bfu \bfn \bfi \bfn \bfg
\Pi \bfs \bfa \bfm \bfp \bfl \bfi \bfn \bfg
Cholesky [91, 94] \checkmark \checkmark \scrO (d2 (d + T )) \Theta (d2 ) \scrN (0, 1) triangular \checkmark
Factorization
Square root [24] \checkmark \checkmark \scrO (d2 (d + T )) \Theta (d2 ) \scrN (0, 1) full \checkmark
Chebyshev [24, 82] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
\bfQ 1/2 approx.
Lanczos [7, 98] \ding{55} \checkmark \scrO (K(K + d2 )T ) \Theta (K(K + d)) \scrN (0, 1) \ding{55} \checkmark
Classical [81] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
CG
Gradient scan [30] \checkmark \ding{55} \scrO (d2 K(T + Tbi )) \Theta (Kd) \scrN (0, 1) \ding{55} \checkmark
Truncated [77, 79] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) full \checkmark
PO
Reversible jump [37] \checkmark \ding{55} \scrO (d2 K(T + Tbi )) \Theta (d) \scrN (0, 1) full \checkmark
Richardson [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) diagonal \ding{55}
Jacobi [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) diagonal \checkmark
Gauss--Seidel [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \checkmark
MS SOR [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
SSOR [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
Cheby-SSOR [31, 32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
Hogwild [54] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \checkmark
MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS
Clone MCMC [9] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \ding{55}
\bfU \bfn \bfi \bff \bfy \bfi \bfn \bfg \bfA \bfM \bfS \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \checkmark
EDA [66] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \checkmark
GEDA [66] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
DA SGS [104] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \ding{55}
\bfU \bfn \bfi \bff \bfy \bfi \bfn \bfg \bfE \bfD \bfA \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \checkmark
iterative ones only if T + Tbi \ll KT . Since most often K \ll Tbi , this condition
favors MCMC methods when a large number T \gtrsim Tbi of samples is desired. When a
small number T \lesssim Tbi /K of samples is desired, iterative methods are preferred. In
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
intermediate situations, the choice depends on the precise number of required samples
T , mixing properties of the MCMC sampler, and the number of iterations K of the
alternative iterative algorithm.
Question 3. When is it efficient to use a decomposition Q = Q1 + Q2 of the
precision matrix in comparison with other approaches?
Sections 3 and 4 have shown that some sampling methods, such as Algorithms 3.4
and 4.5, exploit a decomposition of the form Q = Q1 + Q2 to simplify the sampling
task. Regarding the PO approaches, the main benefit lies in the cheap computation
of the vector z\prime \sim \scrN (0d , Q) before solving the linear system Q\bfittheta = z\prime ; see [79] for
more details. On the other hand, MCMC samplers based on exact DA yield simpler
sampling steps a priori and do not require us to solve any high-dimensional linear
system. However, the Achilles' heel of MCMC methods is that they only produce
samples of interest in the asymptotic regime where the number of iterations tends
toward infinity. For a fixed number of MCMC iterations, dependent samples are
obtained and their quality depends highly upon the mixing properties of the MCMC
sampler. Numerical experiments in subsection 6.2 include further discussion on this
point.
6.2. Numerical Illustrations. This section illustrates the main differences be-
tween the methods reviewed in sections 3 and 4. The main purpose is not to give an
exhaustive one-to-one comparison among all approaches listed in Table 7. Instead,
these methods are compared in light of three experimental scenarios that address the
questions raised in subsection 6.1. More specific numerical simulations can be found
in the cited works and references therein. Since the main challenges of Gaussian sam-
pling are related to the properties of the precision matrix Q, or the covariance \Sigma (see
subsection 2.2), the mean vector \bfitmu is set to 0d and only centered distributions are
considered. For the first two scenarios associated with Questions 1 and 2, the unbiased
estimate of the empirical covariance matrix will be used to assess the convergence in
distribution of the samples generated by each algorithm:
T
\widehat T = 1 \sum (t) \= (t) \= \top
(6.1) \Sigma (\bfittheta - \bfittheta )(\bfittheta - \bfittheta ) ,
T - 1 t=1
\sum T
where \bfittheta \= = T - 1 t=1 \bfittheta (t) is the empirical mean. Note that other metrics (such as the
empirical precision matrix) could have been used to assess whether these samples are
distributed according to a sufficiently close approximation of the target distribution
\scrN (\bfitmu , Q - 1 ). Among available metrics, we chose the one that has been the most used
in the literature, that is, (6.1) [9, 32, 37].
For the scenario 3 associated with the corresponding final question, the consid-
ered high-dimensional setting will preclude the computation of exact and empirical
covariance matrices. Instead, MCMC samplers will be compared in terms of the
computational efficiency and quality of the generated chains (see subsection 6.2.3 for
details).
The experimental setting is as follows. To ensure fair comparisons, all algo-
rithms are implemented on equal grounds, with the same quality of optimization.
The programming language used is Python and all loops have been carefully re-
ples of Gaussian sampling problems that often appear in applications and that have
been previously considered in the literature, so that they provide good tutorials to
answer the question raised by each scenario. The code to reproduce all the fig-
ures of this section is available in a Jupyter notebook format available online at
https://github.com/mvono/PyGauss/tree/master/notebooks.
6.2.1. Scenario 1. This first set of experiments addresses Question 1 concerning
iterative versus factorization approaches. We consider a sampling problem also tackled
in [81] to demonstrate the performance of Algorithm 3.5 based on the conjugate
gradient. For the sake of clarity, we divide this scenario into two subexperiments.
Comparison between Factorization and Iterative Approaches. In this first
subexperiment, we compare so-called factorization approaches with iterative ap-
proaches on two Gaussian sampling problems. We consider first the problem of
sampling from \scrN (0d , Q - 1 ), where the covariance matrix \Sigma = Q - 1 is chosen as a
squared exponential matrix that is commonly used in applications involving Gaussian
processes [48, 63, 85, 96, 103, 108]. Its coefficients are defined by
\Biggl( \Biggr)
(si - sj )2
(6.2) \Sigma ij = 2 exp - + \epsilon \delta ij \forall i, j \in [d] ,
2a2
where \{ si \} i\in [d] are evenly spaced scalars on [ - 3, 3], \epsilon > 0, and the Kronecker symbol
\delta ij = 1 if i = j and zero otherwise. In (6.2), the parameters a and \epsilon have been set to
a = 1.5 and \epsilon = 10 - 6 , which yields a distribution of the eigenvalues of \Sigma such that the
large ones are well separated while the small ones are clustered together near 10 - 6 ; see
Figure 5 (first row). We compare the Cholesky sampler (Algorithm 3.1), the approxi-
mate inverse square root sampler using Chebyshev polynomials (Algorithm 3.2), and
the CG-based sampler (Algorithm 3.5). The sampler using Chebyshev polynomials
needs Kcheby = 23 iterations on average, while the CG iterations are stopped once
loss of conjugacy occurs, following the guidelines of [81], that is, at Kkryl = 8. In all
experiments, T = 105 samples are simulated in dimensions ranging from 1 to several
thousands.
Figure 5 shows the results associated with these three direct samplers in dimen-
sion d = 100. The generated samples indeed follow a target Gaussian distribution
admitting a covariance matrix close to \Sigma . This is attested to by the small residuals
observed between the estimated and the true covariances. Based on this criterion, all
approaches successfully sample from \scrN (0d , \Sigma ). This is emphasized by the spectrum
of the estimated matrices \Sigma \widehat T , which coincides with the spectrum of \Sigma for large eigen-
values. This observation ensures that most of the covariance information has been
captured. However, note that only the Cholesky method is able to recover accurately
all the eigenvalues, including the smallest ones.
Figure 6 compares the direct samplers above in terms of central processing unit
(CPU) time. To generate only one sample (T = 1), as expected, one can observe that
the Cholesky sampler is faster than the two iterative samplers in small dimension d
and becomes computationally demanding as d grows beyond a few hundred. Indeed,
for small d the cost implied by several iterations (Kcheby or Kkryl ) within each iterative
sampler dominates the cost of the factorization in Algorithm 3.1, while the contrary
holds for large values of d. Since the Cholesky factorization is performed only once,
Σ 2.00 102
Eigenvalues of estimates of Σ
1.75 100
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
1.50
10−2
1.25
10−4
1.00 CG
10−6 Cholesky
0.75
Chebyshev
10−8
0.50 Ground truth
0.25
100 101 102
Index
Residual Σ − ΣCholesky Residual Σ − ΣCG Residual Σ − ΣChebyshev
0.04 0.04 0.04
Fig. 5 Scenario 1. Results of the three considered samplers for the sampling from \scrN (\bfzero d , \bfSigma ) in
dimension d = 100 with \bfSigma detailed in (6.2).
the Cholesky sampler becomes more attractive than the other two approaches as the
sample size T increases. However, as already pointed out in subsection 2.3, it is
worth noting that precomputing the Cholesky factor is no longer possible once the
Gaussian sampling task involves a matrix \Sigma which changes at each iteration of a
Gibbs sampler, e.g., when considering a hierarchical Bayesian model with unknown
hyperparameters (see Example 2.1). We also point out that a comparison among
direct samplers reviewed in section 3 and their related versions was conducted in [7]
in terms of CPU and GPU times. In agreement with the findings reported here,
this comparison essentially showed that the choice of the sampler in small dimensions
is not particularly important, while iterative direct samplers become interesting in
high-dimensional scenarios where Cholesky factorization becomes impossible.
We complement our analysis by focusing on another sampling problem which
now considers the matrix defined in (6.2) as a precision matrix instead of a covariance
matrix: we now want to generate samples from \scrN (0d , \Sigma ) \widetilde with \Sigma
\widetilde = \Sigma - 1 . This
sampling problem is expected to be more difficult since the largest eigenvalues of \Sigma \widetilde
are now clustered; see Figure 7 (first row). Figure 7 (second row) shows that the
CG and Chebyshev samplers fail to capture covariance information as accurately as
the Cholesky sampler. The residuals between the estimated and the true covariances
remain important on the diagonal: variances are inaccurately underestimated. This
observation is in line with [81], which showed that the CG sampler is not suitable
for the sampling from a Gaussian distribution whose covariance matrix has many
clustered large eigenvalues, probably as a consequence of the bad conditioning of the
matrix. As far as the Chebyshev sampler is concerned, this failure was expected since
the interval [\lambda l , \lambda u ] on which the function x \mapsto \rightarrow x - 1/2 has to be well approximated
becomes very large with an extent of about 106 . Of course, the relative error between
100 100
CPU time [s]
10−2 10−2
10−3 10−3
−4
10
101 102 103 101 102 103
Dimension d Dimension d
Sample size T = 1000 Sample size T = 10000
101 CG 102 CG
Cholesky Cholesky
101
CPU time [s]
100 Chebyshev
CPU time [s] Chebyshev
10−1 100
10−2 10−1
10−3 10−2
101 102 103 101 102 103
Dimension d Dimension d
Fig. 6 Scenario 1. Comparison among the three considered direct samplers in terms of CPU time
for the sampling from \scrN (\bfzero d , \bfSigma ) with \bfSigma detailed in (6.2).
\widetilde and its estimate can be decreased by sufficiently increasing the number of iterations
\Sigma
Kcheby , but this is possible only at a prohibitive computational cost.
On the choice of the metric to monitor convergence. We saw in Figures 5 and 7
that the covariance estimation error was small if the large values of the covariance
matrix were captured and large in the opposite scenario. Note that if the precision
estimation error was chosen as a metric, we would have observed similar numerical
results: if the largest eigenvalues of the precision matrix were not captured, the pre-
cision estimation error would have been large. Regarding the CG sampler, Fox and
Parker in [31], for instance, highlighted this fact and illustrated it numerically (see
equations (3.1) and (3.2) in that paper).
e
Σ
Eigenvalues of estimates of Σ
CG
e
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
101
200000
0 10−1
−200000
100 101 102
Index
Residual Σ̃ − Σ̃Cholesky 80000
Residual Σ̃ − Σ̃CG Residual Σ̃ − Σ̃Chebyshev 80000
80000
0 0 0
kΣ̃ − Σ̃Cholesky k2 /kΣ̃k2 = 0.0592 kΣ̃ − Σ̃CG k2 /kΣ̃k2 = 0.9995 kΣ̃ − Σ̃Chebyshev k2 /kΣ̃k2 = 0.9998
Fig. 7 Scenario 1. Results of the three considered samplers for the sampling from \scrN (\bfzero d , \bfSigma
\widetilde ) in
dimension d = 100.
Σ 5
5 CG
Eigenvalues of estimates of Σ
Cholesky
4 Ground truth
4 Chebyshev
3
3
2 2
1
1
0 2 4 6 8 10 12 14
0 Index
Residual Σ − ΣCholesky 3.0
Residual Σ − ΣCG Residual Σ − ΣChebyshev 3.0
3.0
Fig. 8 Scenario 1. Results of the three considered direct samplers for the sampling from \scrN (\bfzero d , \bfSigma )
with \bfSigma diagonal in dimension d = 15.
6.2.2. Scenario 2. Now we turn to Question 2 and compare iterative and MCMC
samplers. In this scenario, we consider a precision matrix Q which is commonly used
to build Gaussian Markov random fields (GMRFs) [92]. Before defining this matrix,
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
we introduce some notation. Let \scrG = (\scrV , \scrE ) be an undirected two-dimensional graph
(see Figure 9), where \scrV is the set of d nodes in the graph and \scrE represents the edges
between the nodes. We say that nodes i and j are neighbors and write i \sim j if there
is an edge connecting these two nodes. The number of neighbors of node i is denoted
ni and is also called the degree. Using this notation, we set Q to be a second-order
locally linear precision matrix [48, 92] associated to the two-dimensional lattice shown
in Figure 9,
\left\{
\phi ni if i = j ,
(6.3) Qij = \epsilon \delta ij + - \phi if i \sim j , \forall i, j \in [d] ,
0 otherwise
where we set \epsilon = 1 (in fact, \epsilon > 0 suffices) and \phi > 0 to ensure that Q is a nonsingular
matrix yielding a nonintrinsic Gaussian density w.r.t. the d-dimensional Lebesgue
measure; see subsection 2.2. \surd Note that this precision matrix is band-limited with
bandwidth of the order \scrO ( d) [92], preluding the possible embedding of Algorithm 2.3
within the samplers considered in this scenario. Related instances of this precision
matrix have also been considered in [32, 52, 81] in order to show the benefits of
both direct and MCMC samplers. In what follows, we consider the sampling from
\scrN (0d , Q - 1 ) for three different scalar parameters \phi \in \{ 0.1, 1, 10\} leading to three
covariance matrices Q - 1 with different correlation structures; see Figure 9. This will
be of interest since it is known that the efficiency of Gibbs samplers is highly dependent
on the correlation between the components of the Gaussian vector \bfittheta \in \BbbR d [87, 92].
For this scenario, we set d = 100 in order to provide complete diagnostics for
evaluating the accuracy of the samples generated by each algorithm. We implement
the four MCMC samplers based on exact MS (see Table 2) without considering a
burn-in period (i.e., Tbi = 0). These MCMC algorithms are compared with the direct
samplers based on Cholesky factorization and Chebyshev polynomials; \sum see section 3.
Since the matrix Q is strictly diagonally dominant, that is, | Qii | > j\not =i | Qij | for all
i \in [d], the convergence of the MCMC sampler based on Jacobi splitting is ensured
[32, 41]. Based on this convergence property, we can use an optimal value for the
parameter \omega appearing in the MCMC sampler based on successive overrelaxation
(SOR) splitting; see Appendix C.
Figure 10 shows the relative error between the estimated covariance matrix and
the true one w.r.t. the number of samples generated with i.i.d. samplers from section 3
and MCMC samplers from section 4. Regarding MCMC samplers, no burn-in has been
considered here in order to emphasize that these algorithms do not yield i.i.d. samples
from the first iteration compared to the samplers reviewed in section 3. This behavior
is particularly noticeable when \phi = 10, where one can observe that Gauss--Seidel,
Jacobi, and Richardson samplers need far more samples than Chebyshev and Cholesky
samplers to reach the same precision in terms of covariance estimation. Interestingly,
this claim does not hold for ``accelerated"" Gibbs samplers such as the SOR (accelerated
version of the Gauss--Seidel sampler) and the Chebyshev accelerated SSOR samplers.
Indeed, for \phi = 1, one can note that the latter sampler is as efficient as i.i.d. samplers.
On the other hand, when \phi = 10, these two accelerated Gibbs samplers manage to
achieve lower relative covariance estimation error than the Chebyshev sampler when
the number of iterations increases. This behavior is due to the fact that the Chebyshev
\surd Q, φ = 100
d
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
..
. 4
1 0
... \surd
1 d
Q−1, φ = 10−1 Q−1, φ = 100 Q−1, φ = 101
0.30 0.06
0.7
0.25 0.05
0.6
0.5 0.20
0.04
0.4
0.15
0.03
0.3
0.10
0.2 0.02
0.05
0.1
0.01
Fig. 9 Scenario 2. Illustrations of the considered Gaussian sampling problem: (top left) 2D lattice
(d = 36) associated to the precision matrix \bfQ in (6.3); (top right) \bfQ depicted for \phi = 1;
(bottom) \bfQ - 1 = \bfSigma for \phi \in \{ 0.1, 1, 10\} . All the results are shown for d = 100. On the 2D
lattice, the green nodes are the neighbors of the blue node while the coordinates of the \surd lattice
correspond to the coordinates of \bfittheta \in \BbbR d with nonzero correlations, that is, 1 \leq i, j \leq d.
φ=1 φ = 10
0.30 Gibbs (Gauss-Seidel)
Gibbs (Gauss-Seidel)
Relative error in covariance estimation
0.03
10−2
0.02
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Iteration t Iteration t
Fig. 10 Scenario 2. Relative error associated to the estimation of the covariance matrix \bfQ - 1 de-
fined by \| \bfQ - 1 - var(\bfittheta (1:t) )\| 2 /\| \bfQ - 1 \| 2 w.r.t. the number of iterations t in Algorithm 4.2,
with d = 100 (left: \phi = 1, right: \phi = 10). We also highlight the relative error obtained
with an increasing number of samples generated independently from Cholesky and Cheby-
shev samplers. The results have been averaged over 30 independent runs. The standard
deviations are not shown for readability reasons.
of these samplers is slower than that of the Gauss--Seidel sampler, their CPU times
become roughly equivalent. Note that this computational gain is problem-dependent
and cannot be ensured in general. Cholesky factorization appears to be much faster
in all cases when the same constant covariance is used for many samples. The next
scenario will consider high-dimensional scenarios where Cholesky factorization is no
longer possible.
6.2.3. Scenario 3. Finally, we deal with Question 3 to assess the benefits of
samplers that take advantage of the decomposition structure Q = Q1 + Q2 of the
precision matrix. As motivated in subsection 6.1, we will focus here on the exact DA
approaches detailed in subsection 4.2 and compare the latter to iterative samplers
which produce uncorrelated samples, such as those reviewed in section 3.
To this end, we consider Gaussian sampling problems in high dimensions d \in
[104 , 106 ] for which Cholesky factorization is both computationally and memory pro-
hibitive when a standard computer is used. This sampling problem commonly appears
in image processing [38, 67, 78, 104] and arises from the linear inverse problem, usually
called deconvolution or deblurring in image processing:
where y \in \BbbR d refers to a blurred and noisy observation, \bfittheta \in \BbbR d is the unknown original
image rearranged in lexicographic order, and \bfitvarepsilon \sim \scrN (0d , \Gamma ) with \Gamma = diag(\gamma 1 , . . . , \gamma d )
is a synthetic structured noise such that \gamma i \sim 0.7\delta \kappa 1 + 0.3\delta \kappa 2 for all i \in [d]. In
what follows, we set \kappa 1 = 13 and \kappa 2 = 40. The matrix S \in \BbbR d\times d is a circulant
convolution matrix associated to the space-invariant box blurring kernel 91 13\times 3 , where
1p\times p is the p \times p-matrix filled with ones. We adopt a smoothing conjugate prior
distribution on \bfittheta [61, 74, 75], introduced in subsection 2.2 and Figure 2, written
as \scrN (0d , ( \xi d0 1d\times d + \xi 1 \Delta \top \Delta ) - 1 ), where \Delta is the discrete two-dimensional Laplacian
operator; \xi 0 = 1 ensures that this prior is nonintrinsic while \xi 1 = 1 controls the
smoothing. Bayes' rule then yields the Gaussian posterior distribution
\Bigl( \Bigr)
(6.5) \bfittheta | y \sim \scrN \bfitmu , Q - 1 ,
Table 8 Scenario 2. Comparison between Cholesky-, Chebyshev-, and MS-based Gibbs samplers
for d = 100. The samplers have been run until the relative error between the covariance
matrix \bfQ - 1 and its estimate is lower than 5 \times 10 - 2 . For Richardson, SOR, SSOR, and
Cheby-SSOR samplers, the tuning parameter \omega is the optimal one; see Appendix C. The
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\bfS \bfa \bfm \bfp \bfl \bfe \bfr \phi \omega \rho (\bfM - 1 \bfN ) T \bfC \bfP \bfU \bft \bfi \bfm \bfe [\bfs ]
0.1 - - 6.3 \times 104 0.29
Cholesky 1 - - 1.3 \times 104 0.06
10 - - 2.9 \times 103 0.01
0.1 - - 6.4 \times 104 2.24
Chebyshev (K = 21) 1 - - 1.3 \times 104 0.44
10 - - 2.5 \times 103 0.19
0.1 0.6328 0.3672 6.7 \times 104 5.44
Richardson 1 0.1470 0.8530 3.8 \times 104 3.03
10 0.0169 0.9831 4 \times 104 3.31
0.1 - 0.4235 6.8 \times 104 5.72
Jacobi 1 - 0.8749 3.9 \times 104 3.24
MCMC MS-based samplers
where
\xi 0
(6.6) Q = S\top \Delta - 1 S + 1d\times d + \xi 1 \Delta \top \Delta ,
d
(6.7) \bfitmu = Q - 1 S\top \Delta - 1 y .
Sampling from (6.5) is challenging since the size of the precision matrix forbids its
computation. Moreover, the presence of the matrix \Gamma rules out the diagonalization
of Q in the Fourier basis and therefore the direct use of Algorithm 2.4. In addition,
resorting to MCMC samplers based on exact MS to sample from (6.5) raises several
difficulties. First, both Richardson- and Jacobi-based samplers involve a sampling step
with an unstructured covariance matrix; see Table 2. This step can be performed with
one of the direct samplers reviewed in section 3 but this implies an additional com-
putational cost. On the other hand, although Gauss--Seidel and SOR-based MCMC
samplers involve a simple sampling step, they require access to the lower triangular
part of (6.6). In this high-dimensional scenario, the precision matrix cannot be easily
computed on a standard desktop computer and this lower triangular part must be
found with surrogate approaches. One possibility is to compute each nonzero coef-
ficient of this triangular matrix following the matrix-vector products e\top i Qej for all
i, j \in [d] such that j \leq i, where we recall that ei is the ith canonical vector of \BbbR d .
These quantities can be precomputed when Q remains constant along the T iterations
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
where T1 is the CPU time in seconds required to draw one sample and \rho t (\vargamma ) is the
lag-t autocorrelation of a scalar parameter \vargamma . A variant of the ESSR has, for instance,
been used in [37] in order to measure the efficiency of an MCMC variant of the PO
algorithm (Algorithm 3.4). For a direct sampler providing i.i.d. draws, the ESSR (6.8)
simplifies to 1/T1 and represents the number of samples obtained in one second. In
both cases, the larger the ESSR, the more computationally efficient is the sampler.
Figure 11 shows the ESSR associated to the two considered algorithms for d \in
[104 , 106 ]. The latter has been computed by choosing the ``slowest"" component of \bfittheta as
the scalar summary \vargamma , that is, the one with the largest variance. As in the statistical
software Stan [18], we have truncated the infinite sum in (6.8) at the first negative
\rho t . One can note that for the various high-dimensional problems considered here,
the GEDA sampler exhibits good mixing properties which, combined with its low
computational cost per iteration, yields a larger ESSR than the direct CG sampler.
Hence, in this specific case, building on both the decomposition Q = Q1 + Q2 of
the precision matrix and an efficient MCMC sampler is highly beneficial compared to
using Q directly in Algorithm 3.5. Obviously, this gain in computational efficiency
w.r.t. direct samplers is not guaranteed in general since GEDA relies on an appropriate
decomposition Q = Q1 + Q2 .
6.3. Guidelines to Choosing the Appropriate Gaussian Sampling Approach.
In this section, we provide the reader with some insights into how to choose the
most appropriate sampler for a given Gaussian simulation task when vanilla Cholesky
sampling cannot be envisioned. These guidelines are formulated as simple binary
questions in a decision tree (see Figure 12) to determine the class of samplers that is
of potential interest. We choose to start5 from the existence of a natural decompo-
5 Note that alternative decision trees could be built by considering other issues as the primary
decision level.
d = 104
1.0
CG sampler (Kkryl ∈ [33, 38])
ESS ratio per second (ESSR)
GEDA sampler
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
101 0.8
0.6
100
ρt
0.4
0.2
10−1
0.0
104 105 106 0 20 40 60 80 100
Dimension d Lag t
Fig. 11 Scenario 3. (left) ESS ratio per second (ESSR); (right) autocorrelation function \rho t shown
for d = 104 . For both figures, the slowest component of \bfittheta is used as the scalar summary \vargamma .
Q = Q1 + Q2
48
yes no
limited yes no
Fig. 12 General guidelines to choosing the most appropriate sampling approach based on those reviewed in this paper. When computation of order \Theta (d3 ) and
storage of \Theta (d2 ) elements are not prohibitive, the sampler of choice is obviously Algorithm 3.1. The accuracy of a sampler is referred to as arbitrary
when it is expected to obtain samples with arbitrary accuracy more efficiently than vanilla Cholesky sampling; the accuracy is said to be limited when
Appendix A. Guide to Notation. The following table lists and defines all the
notation used in this paper.
General Definitions.
d
Definition B.1 (operator). Let the notation 2\BbbR represent the family of all sub-
d
sets of \BbbR d . An operator or multivalued function \sansK : \BbbR d \rightarrow 2\BbbR maps every point in \BbbR d
d
to a subset of \BbbR .
d
Definition B.2 (graph). Let \sansK : \BbbR d \rightarrow 2\BbbR . The graph of \sansK is defined by
(B.1) gra(\sansK ) = \{ (\bfittheta , u) \in \BbbR d \times \BbbR d | u \in \sansK (\bfittheta )\} .
d
Definition B.3 (monotone operator). Let \sansK : \BbbR d \rightarrow 2\BbbR . \sansK is said to be mono-
tone if
(B.2) \forall (\bfittheta , u) \in gra(\sansK ) and \forall (y, p) \in gra(\sansK ), \langle \bfittheta - y, u - p\rangle \geq 0.
d
Definition B.4 (maximal monotone operator). Let \sansK : \BbbR d \rightarrow 2\BbbR be monotone.
d
Then \sansK is maximal monotone if there exists no monotone operator \sansP : \BbbR d \rightarrow 2\BbbR
6 http://github.com/mvono/PyGauss.
such that gra(\sansP ) properly contains gra(\sansK ), i.e., for every (\bfittheta , u) \in \BbbR d \times \BbbR d ,
(B.3) (\bfittheta , u) \in gra(\sansK ) \leftrightarrow \forall (y, p) \in gra(\sansK ), \langle \bfittheta - y, u - p\rangle \geq 0.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
d
Definition B.5 (nonexpansiveness). Let \sansK : \BbbR d \rightarrow 2\BbbR . Then \sansK is nonexpansive
if it is Lipschitz continuous with constant 1, i.e., for every (\bfittheta , y) \in \BbbR d \times \BbbR d ,
\bigm\| \bigm\|
(B.4) \bigm\| \sansK (y) - \sansK (\bfittheta )\bigm\| \leq \| y - \bfittheta \| ,
The PPA. For \lambda > 0, define the Moreau--Yosida resolvent operator associated to
\sansK as the operator \sansL defined by
where Id is the identity operator. The monotonicity of \sansK implies that \sansL is nonex-
pansive and its maximal monotonicity yields dom(\sansL ) = \BbbR d [72], where the notation
``dom"" means the domain of the operator \sansL . Therefore, solving the problem (5.16) is
equivalent to solving the following fixed point problem for all \bfittheta \in \BbbR d :
This result suggests that finding the zeros of \sansK can be achieved by building a sequence
of iterates \{ \bfittheta (t) \} t\in \BbbN such that for t \in \BbbN ,
This iteration corresponds to the PPA with an arbitrary monotone operator \sansK .
Proof of (5.23). Applying the PPA with R = W - \rho - 1 A\top A to (5.20) leads to
1 \bigm\| 1 \bigm\|
\bigm\| \bigm\| 2 \bigm\| \bigm\| 2
\bfittheta (t+1) = \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bigm\| \bfA \bfittheta - \bfz
(t+1)
+ \bfu (t) \bigm\| + \bigm\| \bfittheta - \bfittheta (t) \bigm\|
\bigm\| \bigm\|
\bfittheta \in \BbbR d 2\rho 2 \bfR
\biggl\langle \Bigl( \biggr\rangle
1 \bigm\| 1
\bigm\| \bigm\| 2 \Bigr)
(t+1) (t) \bigm\| (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfA \bfittheta - \bfz + \bfu + \bfR \bfittheta - \bfittheta , \bfittheta - \bfittheta
2\rho 2
\bigm\| \bigm\|
\bfittheta \in \BbbR d
\Biggl( \biggl[ \biggr] \biggl[ \biggr] \Biggr)
1 \top 1 \top \top 1 \top
\Bigl\{
(t+1) (t)
\Bigr\}
\top (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfittheta \bfA \bfA + \bfR \bfittheta - 2\bfittheta \bfA \bfz - \bfu + \bfR \bfittheta
\bfittheta \in \BbbR d 2 \rho \rho
\Biggl( \biggl[ \Biggr)
1 1 \top \Bigl\{ (t+1) \Bigr\} \biggr]
\top \top (t) (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfittheta \bfW \bfittheta - 2\bfittheta \bfW \bfittheta + \bfA \bfz - \bfu - \bfA \bfittheta
\bfittheta \in \BbbR d 2 \rho
\Bigr] \biggr) \bigm\| 2
\bigm\| \biggl( \bigm\|
1 \bigm\| 1 - 1 \top \Bigl[ (t+1)
\bigm\|
(t) (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bigm\| \bfittheta - \bfittheta + \bfW \bfA \bfz - \bfu - \bfA \bfittheta \bigm\| .
\bigm\|
\bfittheta \in \BbbR d 2 \bigm\| \rho \bigm\|
\bfW
\ast 2
(C.1) \omega SOR = \sqrt{} ,
2
1+ 1 - \rho (Id - D - 1 Q)
where D is the diagonal part of Q. Regarding the MCMC sampler based on Richard-
son splitting, we used the optimal value
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\ast 2
(C.2) \omega Richardson = ,
\lambda min (Q) + \lambda max (Q)
where \lambda min (Q) and \lambda max (Q) are the minimum and maximum eigenvalues of Q, re-
spectively. Finally, for the samplers based on SSOR splitting including Algorithm 4.3,
we used the optimal relaxation parameter
\ast 2
(C.3) \omega SSOR = \sqrt{} \bigl( \bigl( \bigr) \bigr) ,
1 + 2 1 - \rho D - 1 (L + L\top )
REFERENCES
[1] S. L. Adler, Over-relaxation method for the Monte Carlo evaluation of the partition function
for multiquadratic actions, Phys. Rev. D, 23 (1981), pp. 2901--2904. (Cited on p. 20)
[2] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, Fast image recovery using
variable splitting and constrained optimization, IEEE Trans. Image Process., 19 (2010),
pp. 2345--2356. (Cited on p. 27)
[3] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, An augmented Lagrangian
approach to the constrained optimization formulation of imaging inverse problems, IEEE
Trans. Image Process., 20 (2011), pp. 681--695. (Cited on p. 27)
[4] E. Allen, J. Baglama, and S. Boyd, Numerical approximation of the product of the square
root of a matrix with a vector, Linear Algebra Appl., 310 (2000), pp. 167--181, https:
//doi.org/10.1016/S0024-3795(00)00068-9. (Cited on p. 17)
[5] Y. Altmann, S. McLaughlin, and N. Dobigeon, Sampling from a multivariate Gaussian
distribution truncated on a simplex: A review, in Proc. IEEE-SP Workshop Stat. and
Signal Process. (SSP), Gold Coast, Australia, 2014, pp. 113--116, https://doi.org/10.
1109/SSP.2014.6884588. (Cited on p. 12)
[6] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D.
Minden, S. E. Sallan, E. S. Lander, T. R. Golub, and S. J. Korsmeyer, MLL translo-
cations specify a distinct gene expression profile that distinguishes a unique leukemia,
Nat. Genet., 30 (2002), pp. 41--47, https://doi.org/10.1038/ng765. (Cited on pp. 13, 14)
[7] E. Aune, J. Eidsvik, and Y. Pokern, Iterative numerical methods for sampling from high
dimensional Gaussian distributions, Stat. Comput., 23 (2013), pp. 501--521, https://doi.
org/10.1007/s11222-012-9326-8. (Cited on pp. 16, 17, 36, 39)
[8] O. Axelsson, Iterative Solution Methods, Cambridge University Press, 1994, https://doi.org/
10.1017/CBO9780511624100. (Cited on p. 22)
[9] A.-C. Barbos, F. Caron, J.-F. Giovannelli, and A. Doucet, Clone MCMC: Parallel high-
dimensional Gaussian Gibbs sampling, in Adv. in Neural Information Process. Systems,
2017, pp. 5020--5028. (Cited on pp. 22, 24, 25, 32, 36, 37)
[10] P. Barone and A. Frigessi, Improving stochastic relaxation for Gaussian random fields,
Probab. Eng. Inform. Sci., 4 (1990), pp. 369--389. (Cited on p. 20)
[11] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory
in Hilbert Spaces, 2nd ed., Springer, 2017. (Cited on p. 49)
[12] J. Besag and C. Kooperberg, On conditional and intrinsic autoregressions, Biometrika, 82
(1995), pp. 733--746, https://doi.org/10.2307/2337341. (Cited on pp. 5, 12)
[13] C. M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006. (Cited on
p. 12)
[14] G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, 3rd ed.,
Prentice Hall, 1994. (Cited on p. 29)
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
statistical learning via the alternating direction method of multipliers, Found. Trends
Mach. Learn., 3 (2011), pp. 1--122. (Cited on pp. 27, 34)
[16] S. Brahim-Belhouari and A. Bermak, Gaussian process for nonstationary time series pre-
diction, Comput. Statist. Data Anal., 47 (2004), pp. 705--712, https://doi.org/10.1016/j.
csda.2004.02.006. (Cited on p. 5)
[17] K. Bredies and H. Sun, A proximal point analysis of the preconditioned alternating direction
method of multipliers, J. Optim. Theory Appl., 173 (2017), pp. 878--907, https://doi.org/
10.1007/s10957-017-1112-5. (Cited on p. 34)
[18] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt,
M. Brubaker, J. Guo, P. Li, and A. Riddell, Stan: A probabilistic programming lan-
guage, J. Stat. Softw., 76 (2017), pp. 1--32, https://doi.org/10.18637/jss.v076.i01. (Cited
on p. 46)
[19] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with
applications to imaging, J. Math. Imaging Vision, 40 (2011), pp. 120--145, https://doi.
org/10.1007/s10851-010-0251-1. (Cited on p. 34)
[20] A.-L. Cholesky, Sur la r\' esolution num\'
erique des syst\èmes d'\'
equations lin\'
eaires, Bull. Soc.
\'
Amis Bibl. Ecole Polytech., 39 (1910), pp. 81--95. (Cited on p. 14)
[21] E. Chow and Y. Saad, Preconditioned Krylov subspace methods for sampling multivariate
Gaussian distributions, SIAM J. Sci. Comput., 36 (2014), pp. A588--A608, https://doi.
org/10.1137/130920587. (Cited on pp. 16, 17)
[22] E. S. Coakley and V. Rokhlin, A fast divide-and-conquer algorithm for computing the
spectra of real symmetric tridiagonal matrices, Appl. Comput. Harmonic Anal., 34 (2013),
pp. 379--414, https://doi.org/10.1016/j.acha.2012.06.003. (Cited on p. 16)
[23] N. A. C. Cressie, Statistics for Spatial Data, 2nd ed., Wiley, 1993. (Cited on p. 5)
[24] M. W. Davis, Generating large stochastic simulations---the matrix polynomial approxima-
tion method, Math. Geosci., 19 (1987), pp. 99--107, https://doi.org/10.1007/BF00898190.
(Cited on pp. 15, 16, 36)
[25] O. Demir-Kavuk, H. Riedesel, and E.-W. Knapp, Exploring classification strategies with
the CoEPrA 2006 contest, Bioinformatics, 26 (2010), pp. 603--609, https://doi.org/10.
1093/bioinformatics/btq021. (Cited on pp. 13, 14)
[26] C. R. Dietrich and G. N. Newsam, Fast and exact simulation of stationary Gaussian
processes through circulant embedding of the covariance matrix, SIAM J. Sci. Comput.,
18 (1997), pp. 1088--1107, https://doi.org/10.1137/S1064827592240555. (Cited on p. 11)
[27] S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth, Hybrid Monte Carlo, Phys.
Lett. B, 195 (1987), pp. 216--222. (Cited on p. 5)
[28] E. Esser, X. Zhang, and T. F. Chan, A general framework for a class of first order primal-
dual algorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 3
(2010), pp. 1015--1046, https://doi.org/10.1137/09076934X. (Cited on p. 34)
[29] L. Fahrmeir and S. Lang, Bayesian semiparametric regression analysis of multicategorical
time-space data, Ann. Inst. Statist. Math., 53 (2001), pp. 11--30, https://doi.org/10.1023/
A:1017904118167. (Cited on p. 5)
[30] O. Feron,
\' F. Orieux, and J. F. Giovannelli, Gradient scan Gibbs sampler: An efficient
algorithm for high-dimensional Gaussian distributions, IEEE J. Sel. Topics Signal Pro-
cess., 10 (2016), pp. 343--352, https://doi.org/10.1109/JSTSP.2015.2510961. (Cited on
pp. 20, 36)
[31] C. Fox and A. Parker, Convergence in variance of Chebyshev accelerated Gibbs samplers,
SIAM J. Sci. Comput., 36 (2014), pp. A124--A147, https://doi.org/10.1137/120900940.
(Cited on pp. 23, 36, 40)
[32] C. Fox and A. Parker, Accelerated Gibbs sampling of normal distributions using matrix
splittings and polynomials, Bernoulli, 23 (2017), pp. 3711--3743, https://doi.org/10.3150/
16-BEJ863. (Cited on pp. 7, 20, 21, 22, 23, 30, 36, 37, 42, 43)
[33] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational
problems via finite element approximation, Comput. Math. Appl., 2 (1976), pp. 17--40,
https://doi.org/10.1016/0898-1221(76)90003-1. (Cited on p. 34)
[34] B. Galerne, Y. Gousseau, and J. Morel, Random phase textures: Theory and synthesis,
IEEE Trans. Image Process., 20 (2011), pp. 257--267, https://doi.org/10.1109/TIP.2010.
2052822. (Cited on p. 5)
[35] A. Gelman, C. Robert, N. Chopin, and J. Rousseau, Bayesian Data Analysis, Chapman
\& Hall, 1995. (Cited on p. 20)
[36] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 6 (1984), pp. 721--741.
(Cited on pp. 5, 13, 20)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[37] C. Gilavert, S. Moussaoui, and J. Idier, Efficient Gaussian sampling for solving large-
scale inverse problems using MCMC, IEEE Trans. Signal Process., 63 (2015), pp. 70--80,
https://doi.org/10.1109/TSP.2014.2367457. (Cited on pp. 5, 19, 36, 37, 46)
[38] J.-F. Giovannelli and J. Idier, Regularization and Bayesian Methods for Inverse Problems
in Signal and Image Processing, 1st ed., Wiley-IEEE Press, 2015. (Cited on p. 44)
[39] P. Giudici and P. J. Green, Decomposable graphical Gaussian model determination,
Biometrika, 86 (1999), pp. 785--801, https://doi.org/10.1093/biomet/86.4.785. (Cited
on p. 5)
[40] R. Glowinski and A. Marroco, Sur l'approximation, par \' el\'
ements finis d'ordre un, et la
r\'
esolution, par p\'
enalisation-dualit\'
e d'une classe de probl\èmes de Dirichlet non lin\'
eaires,
RAIRO Anal. Numer., 9 (1975), pp. 41--76, http://www.numdam.org/item/M2AN 1975
9 2 41 0. (Cited on p. 34)
[41] G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., The Johns Hopkins
University Press, 1989. (Cited on pp. 6, 10, 14, 16, 21, 22, 42)
[42] J. Goodman and A. D. Sokal, Multigrid Monte Carlo method. Conceptual foundations,
Phys. Rev. D, 40 (1989), pp. 2035--2071, https://doi.org/10.1103/PhysRevD.40.2035.
(Cited on p. 20)
[43] P. J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination, Biometrika, 82 (1995), pp. 711--732, https://doi.org/10.1093/biomet/82.
4.711. (Cited on p. 19)
[44] M. Gu and S. C. Eisenstat, A divide-and-conquer algorithm for the symmetric tridiagonal
eigenproblem, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 172--191, https://doi.org/10.
1137/S0895479892241287. (Cited on p. 16)
[45] N. Hale, N. J. Higham, and L. N. Trefethen, Computing A\alpha , log(a), and related matrix
functions by contour integrals, SIAM J. Numer. Anal., 46 (2008), pp. 2505--2523, https:
//doi.org/10.1137/070700607. (Cited on p. 17)
[46] J. Havil, Gamma: Exploring Euler's Constant, Princeton University Press, 2003. (Cited on
p. 5)
[47] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems,
J. Res. Natl. Inst. Stand., 49 (1952), pp. 409--436. (Cited on p. 19)
[48] D. Higdon, A primer on space-time modeling from a Bayesian perspective, in Statistical
Methods for Spatio-Temporal Systems, B. Finkenstadt, L. Held, and V. lsham, eds.,
Chapman \& Hall/CRC, 2007, pp. 217--279. (Cited on pp. 38, 42)
[49] J. P. Hobert and G. L. Jones, Honest exploration of intractable probability distributions via
Markov chain Monte Carlo, Stat. Sci., 16 (2001), pp. 312--334, https://doi.org/10.1214/
ss/1015346317. (Cited on p. 20)
[50] J. Idier, ed., Bayesian Approach to Inverse Problems, Wiley, 2008, https://doi.org/10.1002/
9780470611197. (Cited on pp. 10, 25)
[51] M. Ilic, T. Pettitt, and I. Turner, Bayesian computations and efficient algorithms for
computing functions of large sparse matrices, ANZIAM J., 45 (2004), pp. 504--518, https:
//eprints.qut.edu.au/22511/. (Cited on p. 16)
[52] M. Ilic, I. W. Turner, and D. P. Simpson, A restarted Lanczos approximation to functions
of a symmetric matrix, IMA J. Numer. Anal., 30 (2009), pp. 1044--1061, https://doi.org/
10.1093/imanum/drp003. (Cited on pp. 16, 17, 42)
[53] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. R.
Soc. A, 186 (1946), pp. 453--461. (Cited on p. 13)
[54] M. Johnson, J. Saunderson, and A. Willsky, Analyzing Hogwild parallel Gaussian Gibbs
sampling, in Adv. in Neural Information Process. Systems, 2013, pp. 2715--2723. (Cited
on pp. 7, 22, 24, 25, 32, 36)
[55] R. E. Kass, B. P. Carlin, A. Gelman, and R. M. Neal, Markov chain Monte Carlo in
practice: A roundtable discussion, Amer. Statist., 52 (1998), pp. 93--100. (Cited on p. 46)
[56] J. Kittler and J. Foglein,
\" Contextual classification of multispectral pixel data, Image Vis.
Comput., 2 (1984), pp. 13--29, https://doi.org/10.1016/0262-8856(84)90040-4. (Cited on
p. 5)
[57] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proc. IEEE, 86 (1998), pp. 2278--2324, https://doi.org/10.1109/5.
726791. (Cited on pp. 13, 14)
[58] M. Li, D. Sun, and K.-C. Toh, A majorized ADMM with indefinite proximal terms for
linearly constrained convex composite optimization, SIAM J. Optim., 26 (2016), pp. 922--
950, https://doi.org/10.1137/140999025. (Cited on p. 34)
[59] S. Z. Li, Markov Random Field Modeling in Image Analysis, 3rd ed., Springer, 2009. (Cited
on p. 11)
[60] Y. Li and S. K. Ghosh, Efficient sampling methods for truncated multivariate normal and
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[83] N. G. Polson, J. G. Scott, and J. Windle, Bayesian inference for logistic models using
P\'
olya-Gamma latent variables, J. Amer. Stat. Assoc., 108 (2013), pp. 1339--1349, https:
//doi.org/10.1080/01621459.2013.829001. (Cited on p. 5)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[107] S. Wilhelm and B. Manjunath, tmvtnorm: A package for the truncated multivariate normal
distribution, The R Journal, 2 (2010), pp. 25--29, https://doi.org/10.32614/RJ-2010-005.
(Cited on p. 12)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[108] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Machine Learning, MIT
Press, 2006. (Cited on p. 38)
[109] A. T. A. Wood and G. Chan, Simulation of stationary Gaussian processes in [0, 1]d , J.
Comput. Graph. Stat., 3 (1994), pp. 409--432, https://doi.org/10.1080/10618600.1994.
10474655. (Cited on p. 11)
[110] X. Zhang, M. Burger, and S. Osher, A unified primal-dual algorithm framework based
on Bregman iteration, J. Sci. Comput., 46 (2011), pp. 20--46, https://doi.org/10.1007/
s10915-010-9408-8. (Cited on p. 34)