0% found this document useful (0 votes)
7 views

High-Dimensional Gaussian SamplingA Review and a Unifying Approach Basedon a Stochastic Proximal Point Algorithm

This document provides a comprehensive review of high-dimensional Gaussian sampling methods, highlighting their computational challenges and the lack of a complete comparative analysis. It introduces a unifying framework based on a stochastic proximal point algorithm, which integrates various existing approaches and offers guidelines for selecting appropriate methods for specific sampling problems. The paper also includes numerical examples to illustrate the proposed methods and their effectiveness in high-dimensional contexts.

Uploaded by

Shen Weihan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

High-Dimensional Gaussian SamplingA Review and a Unifying Approach Basedon a Stochastic Proximal Point Algorithm

This document provides a comprehensive review of high-dimensional Gaussian sampling methods, highlighting their computational challenges and the lack of a complete comparative analysis. It introduces a unifying framework based on a stochastic proximal point algorithm, which integrates various existing approaches and offers guidelines for selecting appropriate methods for specific sampling problems. The paper also includes numerical examples to illustrate the proposed methods and their effectiveness in high-dimensional contexts.

Uploaded by

Shen Weihan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

SIAM REVIEW \bigcirc

c 2022 Society for Industrial and Applied Mathematics


Vol. 64, No. 1, pp. 3–56

High-Dimensional Gaussian Sampling:


Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

A Review and a Unifying Approach Based


on a Stochastic Proximal Point Algorithm\ast
Maxime Vono\dagger
Nicolas Dobigeon\ddagger
Pierre Chainais\S

Abstract. Efficient sampling from a high-dimensional Gaussian distribution is an old but high-stakes
issue. Vanilla Cholesky samplers imply a computational cost and memory requirements
that can rapidly become prohibitive in high dimensions. To tackle these issues, multiple
methods have been proposed from different communities ranging from iterative numeri-
cal linear algebra to Markov chain Monte Carlo (MCMC) approaches. Surprisingly, no
complete review and comparison of these methods has been conducted. This paper aims
to review all these approaches by pointing out their differences, close relations, benefits,
and limitations. In addition to reviewing the state of the art, this paper proposes a unify-
ing Gaussian simulation framework by deriving a stochastic counterpart of the celebrated
proximal point algorithm in optimization. This framework offers a novel and unifying re-
visiting of most of the existing MCMC approaches while also extending them. Guidelines
to choosing the appropriate Gaussian simulation method for a given sampling problem in
high dimensions are proposed and illustrated with numerical examples.

Key words. Gaussian distribution, high-dimensional sampling, linear system, Markov chain Monte
Carlo, proximal point algorithm

AMS subject classifications. 65C10, 68U20, 62H12

DOI. 10.1137/20M1371026

Contents

1 Introduction 5

2 Gaussian Sampling: Problem, Instances, and Issues 8


2.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 8

\ast Received by the editors October 2, 2020; accepted for publication (in revised form) August 9,
2021; published electronically February 3, 2022.
https://doi.org/10.1137/20M1371026
Funding: This work was partially supported by the ANR-3IA Artificial and Natural Intelligence
Toulouse Institute (ANITI) under grant agreement ANITI ANR-19-PI3A-0004, by the ANR project
``Chaire IA Sherlock,"" ANR-20-CHIA-0031-01, as well as by national support within Programme
d'investissements d'avenir, ANR-16-IDEX-0004 ULNE. The work of the third author was supported
by his 3IA Chair Sherlock funded by the ANR, under grant agreement ANR-20-CHIA-0031-01,
Centrale Lille, the I-Site Lille-Nord-Europe, and the region Hauts-de-France.
\dagger Lagrange Mathematics and Computing Research Center, Huawei, 75007 Paris, France

([email protected]).
\ddagger University of Toulouse, IRIT/INP-ENSEEIHT, 31071 Toulouse, France, and Institut Universi-

taire de France (IUF), France ([email protected]).


\S University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France

([email protected]).
3

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


4 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

2.2 Usual Special Instances . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.2.1 Univariate Gaussian Sampling . . . . . . . . . . . . . . . . . . 9
2.2.2 Multivariate Gaussian Sampling with Diagonal Precision Matrix 9
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

2.2.3 Multivariate Gaussian Sampling with Sparse or Band Matrix Q 10


2.2.4 Multivariate Gaussian Sampling with Block Circulant (Toeplitz)
Matrix Q with Circulant (Toeplitz) Blocks . . . . . . . . . . . 11
2.2.5 Truncated and Intrinsic Gaussian Distributions . . . . . . . . . 11
2.3 Problem Statement: Sampling from a Gaussian Distribution with an
Arbitrary Precision Matrix Q . . . . . . . . . . . . . . . . . . . . . . . 12

3 Sampling Algorithms Derived from Numerical Linear Algebra 14


3.1 Factorization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Square Root Factorization . . . . . . . . . . . . . . . . . . . . . 15
3.2 Inverse Square Root Approximations . . . . . . . . . . . . . . . . . . . 15
3.2.1 Polynomial Approximation . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Lanczos Approximation . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Other Square Root Approximations . . . . . . . . . . . . . . . 17
3.3 Conjugate Gradient--Based Samplers . . . . . . . . . . . . . . . . . . . 18
3.3.1 Perturbation before Optimization . . . . . . . . . . . . . . . . . 18
3.3.2 Optimization with Perturbation . . . . . . . . . . . . . . . . . . 19

4 Sampling Algorithms Based on MCMC 20


4.1 Matrix Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Exact Matrix Splitting . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Approximate Matrix Splitting . . . . . . . . . . . . . . . . . . . 24
4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Exact Data Augmentation . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Approximate Data Augmentation . . . . . . . . . . . . . . . . . 27

5 A Unifying Revisit of Gibbs Samplers via a Stochastic Version of the PPA 28


5.1 A Unifying Proposal Distribution . . . . . . . . . . . . . . . . . . . . . 28
5.2 Revisiting MCMC Sampling Approaches . . . . . . . . . . . . . . . . . 29
5.2.1 From Exact Data Augmentation to Exact Matrix Splitting . . 29
5.2.2 From Approximate Matrix Splitting to Approximate Data Aug-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Gibbs Samplers as Stochastic Versions of the PPA . . . . . . . . . . . 32

6 A Comparison of Gaussian Sampling Methods with Numerical Simula-


tions 35
6.1 Summary, Comparison, and Discussion of Existing Approaches . . . . 35
6.2 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2.1 Scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.2 Scenario 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.3 Scenario 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Guidelines to Choosing the Appropriate Gaussian Sampling Approach 46

7 Conclusion 47

Appendix A. Guide to Notation 49

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 5

Appendix B. Details and Proofs Associated to Subsection 5.1 49

Appendix C. Details Associated to Subsection 6.2.2 50


Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Acknowledgments 51

References 51

1. Introduction. If there was only one continuous probability distribution that


we needed to know, it would certainly be the Gaussian (also known as normal ) dis-
tribution. Many nice properties of the Gaussian distribution can be listed, such as its
infinite divisibility, maximum entropy property, or its description based on the use of
the first two cumulants only (mean and variance). However, its popularity and ubiq-
uity result from two essential properties, namely, the central limit theorem and the
statistical interpretation of ordinary least squares, which often motivate its use to de-
scribe random noises or residual errors in various applications (e.g., inverse problems
in signal and image processing). The first property originates from gambling theory.
The binomial distribution, which models the probabilities of successes and failures
after a given number of trials, was approximated by a Gaussian distribution in the
seminal work by de Moivre [73]. This famous approximation is a specific instance of
the central limit theorem, which states that the sum of a sufficiently large number of
independent and identically distributed (i.i.d.) random variables with finite variance
converges in distribution toward a Gaussian random variable. Capitalizing on this
theorem, a lot of complex random events have been approximated using the Gaussian
distribution, sometimes called the bell curve. Another well-known use of the Gaussian
distribution has been in the search for an error distribution in empirical sciences. For
instance, since the end of the 16th century, astronomers have been interested in data
summaries to describe their observations. They found that the estimate defined by
the arithmetic mean of the observations was related to the resolution of a least mean
square problem under the assumption of Gaussian measurement errors [46]. The as-
sumption of Gaussian noise has now become so common that it is sometimes implicit
in many applications.
Motivated by all these features, the Gaussian distribution is omnipresent in prob-
lems far beyond the statistics community. In statistical machine learning and sig-
nal processing, Gaussian posterior distributions commonly appear when hierarchical
Bayesian models are derived [37, 67, 80, 83], in particular, when the exponential fam-
ily is involved. As archetypal examples, models based on Gaussian Markov random
fields or conditional autoregressions assume that parameters of interest (associated
with observations) come from a joint Gaussian distribution with a structured covari-
ance matrix reflecting their interactions [92]. Such models have found applications in
spatial statistics [12, 23], image analysis [36, 56], graphical structures [39], and semi-
parametric regression and splines [29]. We can also cite the discretization schemes of
stochastic differential equations involving Brownian motion, which led to a Gaussian
sampling step [27, 89, 106], texture synthesis [34], and time series prediction [16].
Indeed, the Gaussian distribution is also intimately connected to diffusion processes
and statistical physics.
When the dimension of the problem is small or moderate, sampling from this
distribution is an old solved problem that raises no particular difficulty. In high-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


6 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

dimensional settings this multivariate sampling task can become computationally de-
manding, which may prevent us from using statistically sound methods for real-world
applications. Therefore, even recently, a host of works have focused on the derivation
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

of efficient high-dimensional Gaussian sampling methods. Before summarizing our


contributions and main results, in what follows, we discuss what we mean by com-
plexity and efficiency, in light of the most common sampling technique, i.e., exploiting
the Cholesky factorization.
Computational and Storage Complexities: Notation. In what follows, we will
use the mathematical notation \Theta (\cdot ) and \scrO (\cdot ) to describe the complexities of the
computation and the storage required by the sampling algorithms. We recall that
f (d) = \scrO (g(d)) if there exists c > 0 such that f (d) \leq cg(d) when d \rightarrow \infty . We
use f (d) = \Theta (g(d)) if there exist c1 , c2 > 0 such that c1 g(d) \leq f (d) \leq c2 g(d) when
d \rightarrow \infty .
Table 1 Vanilla Cholesky sampling. Computational and storage requirements to produce one sample
from an arbitrary d-dimensional Gaussian distribution.

Vanilla Cholesky sampler


d
\Theta (d3 ) flops \Theta (d2 ) memory requirement
103 3.34 \times 108 4 MB
104 3.33 \times 1011 0.4 GB
105 3.33 \times 1014 40 GB
106 3.33 \times 1017 4 TB

Algorithmic Efficiency: Definition. To sample from a given d-dimensional Gaus-


sian distribution, the most common sampling algorithm is based on the Cholesky fac-
torization [91]. Let us recall that the Cholesky factorization of a symmetric positive-
definite matrix Q \in \BbbR d\times d is a decomposition of the form [92, section 2.4]

(1.1) Q = CC\top ,

where C \in \BbbR d\times d is a lower triangular matrix with real and positive diagonal en-
tries. The computation of the Cholesky factor C for dense matrices requires \Theta (d3 )
floating point operations (flops), that is, arithmetic operations such as additions, sub-
tractions, multiplications, or divisions [41]; see also subsection 3.1.1 for details. In
addition, the Cholesky factor, which involves at most d(d + 1)/2 nonzero entries,
must be stored. In the general case, this implies a global memory requirement of
\Theta (d2 ). In high-dimensional settings (d \gg 1), both these numerical complexity and
storage requirements rapidly become prohibitive for standard computers. Table 1 il-
lustrates this claim by listing the number of flops (using 64-bit numbers also called
double precision) and storage space required by the vanilla Cholesky sampler in high
dimensions. Note that for d \geq 105 , which is classical in image processing problems,
for instance, the memory requirement of the Cholesky sampler exceeds the memory
capacity of today's standard laptops. To mitigate these computational issues, much
work has focused on the derivation of surrogate high-dimensional Gaussian sampling
methods. Compared to Cholesky sampling, these surrogate samplers involve addi-
tional sources of approximation in finite-time sampling and as such intend to trade off
computation and storage requirements against sampling accuracy. Throughout this
review, we will say that a Gaussian sampling procedure is ``efficient"" if, in order to
produce a sample with reasonable accuracy, the number of flops and memory required

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 7

are significantly lower than that of the Cholesky sampler. For the sake of clarity, at
the end of each section presenting an existing Gaussian sampler, we will highlight its
theoretical relative efficiency with respect to (w.r.t.) vanilla Cholesky sampling with
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

a dedicated paragraph. As a typical example, some approaches reviewed in this paper


will only require \scrO (d2 ) flops and \Theta (d) storage space, which are lower than Cholesky
complexities by one power of d.
Contributions. To the authors' knowledge, no systematic comparison of existing
Gaussian sampling approaches is available in the literature. This is probably due
to the huge number of contributions related to this task from distinct communities.
Hence, it is not always clear which method is best suited to a given Gaussian sampling
task in high dimensions, and what the main similarities and differences among them
are. To this end, this paper both reviews the main sampling techniques dedicated
to an arbitrary high-dimensional Gaussian distribution (see Table 7 for a synthetic
overview) and derives general guidelines for practitioners in determining the appro-
priate sampling approach when Cholesky sampling is not possible (see Figure 12 for
a schematic overview). Beyond that review, we propose to put most of the state-of-
the-art Markov chain Monte Carlo (MCMC) methods under a common umbrella by
deriving a unifying Gaussian sampling framework.
Main Results. Our main results are summarized below.
\bullet At the expense of some approximation, we show that state-of-the-art Gaus-
sian sampling algorithms are indeed more efficient than Cholesky sampling
in high dimensions. Their computational complexity, memory requirements,
and accuracy are summarized in Table 7.
\bullet Among existing Gaussian samplers, some provide i.i.d. samples while oth-
ers (e.g., MCMC-based ones) produce correlated samples. Interestingly, we
show in section 6 for several experiments that MCMC approaches might per-
form better than samplers providing i.i.d. samples. This relative efficiency is
particularly important in the case where many samples are required.
\bullet In section 5, we show that most of the existing MCMC approaches can be seen
as special instances of a unifying framework that is a stochastic counterpart
of the proximal point algorithm (PPA) [90]. The proposed Gaussian sampling
framework also allows us to both extend existing algorithms by proposing new
sampling alternatives and to draw a one-to-one equivalence among MCMC
samplers proposed by distinct communities such as those based on matrix
splitting [32, 54] and data augmentation [66, 67].
\bullet Finally, we show that the choice of an appropriate Gaussian sampling ap-
proach is a subtle compromise between several issues such as the need to
obtain accurate samples or the existence of a natural decomposition of the
precision matrix. We provide in Figure 12 simple guidelines for practitioners
in the form of a decision tree to help them choose the appropriate Gaussian
sampler based on these parameters.
Structure of the Paper. This paper is structured as follows. In section 2, we
present the multivariate Gaussian sampling problem along with its simple and more
complicated instances. In particular, we will list and illustrate the main difficulties
associated with sampling from a high-dimensional Gaussian distribution with an ar-
bitrary covariance matrix. These difficulties have motivated many works that propose
surrogate sampling approaches. The latter are presented in sections 3 and 4. More
precisely, section 3 presents Gaussian sampling schemes which have been derived by

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


8 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

adapting ideas from numerical linear algebra. In section 4, we review another class of
sampling techniques, namely, MCMC approaches, which build a Markov chain admit-
ting the Gaussian distribution of interest (or a close approximation) as a stationary
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

distribution. In section 5, we propose to shed new light on most of these MCMC


methods by embedding them into a unifying framework based on a stochastic version
of the PPA. In section 6, we illustrate and compare the reviewed approaches w.r.t.
archetypal experimental scenarios. Finally, section 7 makes concluding remarks. A
guide to the notation used in this paper and technical details associated to each section
are given in the appendices.
Software. All the methods reviewed in this paper have been implemented and
made available in a companion package written in Python called PyGauss. In ad-
dition, PyGauss contains a Jupyter notebook which allows the reader to reproduce
all the figures and numerical results in this paper. This package and its associated
documentation can be found online.1
2. Gaussian Sampling: Problem, Instances, and Issues. This section highlights
the Gaussian sampling problem considered, its already-surveyed special instances, and
its main issues. By recalling these specific instances, this section will also define the
limits of this paper in terms of the reviewed approaches. Note that for the sake of
simplicity, we shall abusively use the same notation for both a random variable and
its realization.
2.1. Definitions and Notation. Throughout this review, we will use capital let-
ters (e.g., \Pi ) to denote a probability distribution and corresponding small letters
(e.g., \pi ) to refer to its associated probability density function (p.d.f.). We address
the problem of sampling from a d-dimensional Gaussian distribution \Pi \triangleq \scrN (\bfitmu , \Sigma ),
where d may be large. Assume throughout that the covariance matrix \Sigma is positive
definite, that is, for all \bfittheta \in \BbbR d \setminus \{ 0d \} , \bfittheta \top \Sigma \bfittheta > 0. Hence, its inverse Q = \Sigma - 1 , called
the precision matrix, exists and is also positive definite. When \Sigma is not of full rank,
the distribution \Pi is said to be degenerate and does not admit a density w.r.t. the d-
dimensional Lebesgue measure. Its p.d.f. with respect to the d-dimensional Lebesgue
measure, for all \bfittheta \in \BbbR d , can be written as
\biggl( \biggr)
1 1 \top - 1
(2.1) \pi (\bfittheta ) = exp - (\bfittheta - \bfitmu ) \Sigma (\bfittheta - \bfitmu ) ,
(2π)d/2 det(\Sigma )1/2 2

where \bfitmu \in \BbbR d and \Sigma \in \BbbR d\times d , respectively, stand for the mean vector and the covari-
ance matrix of the considered Gaussian distribution.
For some approaches and applications, working with the precision Q rather than
with the covariance \Sigma will be more convenient (e.g., for conditional autoregressive
models or hierarchical Bayesian models; see also section 4). In this paper, we choose
to present existing approaches by working directly with Q for the sake of simplicity.
When Q is unknown but \Sigma is available instead, simple and straightforward algebraic
manipulations can be used to implement the same \bigl( approaches
\bigr) without increasing their
computational complexity. Sampling from \scrN \bfitmu , Q - 1 raises several important issues
which are mainly related to the structure of Q. In the following paragraphs, we will
detail some special instances of (2.1) and well-known associated sampling strategies
before focusing on the general Gaussian sampling problem considered in this paper.

1 http://github.com/mvono/PyGauss.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 9

2.2. Usual Special Instances. For completeness, this subsection recalls special
cases of Gaussian sampling tasks that will not be detailed later but are common
building blocks. Instead, we point out appropriate references for the interested reader.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

These special instances include basic univariate sampling and the scenarios where Q
is (i) a diagonal matrix, (ii) a band matrix, or (iii) a circulant Toeplitz matrix. Again,
with basic algebraic manipulations, the same samplers can be used when \Sigma has one
of these specific structures.
2.2.1. Univariate Gaussian Sampling. The most simple Gaussian sampling prob-
lem boils down to drawing univariate Gaussian random variables with mean \mu \in \BbbR
and precision q > 0. Generating the latter quickly and with high accuracy has been
the topic of much research work in the last 70 years. Such methods can be loosely
speaking divided into four groups, namely, (i) cumulative density function (c.d.f.)
inversion, (ii) transformation, (iii) rejection, and (iv) recursive methods; they are
now well documented. Interested readers are invited to refer to the comprehensive
overview in [101] for more details. For instance, Algorithm 2.1 details the well-known
Box--Muller method, which transforms a pair of independent uniform random vari-
ables into a pair of Gaussian random variables by exploiting the radial symmetry of
the two-dimensional normal distribution.

Algorithm 2.1 Box--Muller sampler


i.i.d.
1: u2 \sim \scrU ((0, 1]).
Draw u1 , \sqrt{}
2: \~1 = - 2 log(u1 ).
Set u
3: Set u
\~2 = 2πu2 . \Bigl( \Bigr)
u
\~ u
\~1
4: return (\theta 1 , \theta 2 ) = \mu + \surd 1q sin(\~
u2 ), \mu + \surd
q cos(\~
u2 ) .

2.2.2. Multivariate Gaussian Sampling with Diagonal Precision Matrix. Let


us extend the previous sampling problem and now assume that one wants to generate
a d-dimensional Gaussian vector \bfittheta with mean \bfitmu and diagonal precision matrix Q =
diag(q1 , . . . , qd ). The d components of \bfittheta being independent, this problem is as simple
as the univariate one since we can sample the d components in parallel independently.
A pseudocode of the corresponding sampling algorithm is given in Algorithm 2.2.

Algorithm 2.2 Sampler when Q is a diagonal matrix


1: for i \in [d] do \bigl( \triangleleft In some
\bigr) programming languages, this loop can be vectorized.
2: Draw \theta i \sim \scrN \mu i , 1/qi .
3: end for
4: return \bfittheta = (\theta 1 , . . . , \theta d )\top .

Algorithmic Efficiency. By using, for instance, Algorithm 2.1 for step 2, Algorithm 2.2
admits a computational complexity of \Theta (d) and a storage capacity of \Theta (d). In this spe-
cific scenario, these requirements are significantly less than those of vanilla Cholesky
sampling, whose complexities are recalled in Table 1.
When Q is not diagonal, we can no longer sample the d components of \bfittheta indepen-
dently. Thus, more sophisticated sampling methods must be used. For well-structured
matrices Q, we show in the following sections that it is possible to draw the random
vector of interest more efficiently than with vanilla Cholesky sampling.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


10 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

●1

●●
7
12

● ● 51 2

●● ● ● 48
53 5

●●
63
11 4

● ● 56
10

● ● ● ● ● ● ● 76 58
65 14

● ●●
401
49 13 8

● ●● ● ● ● ● ● ● 52

● ● ●
70 79 69 72
● 93 112 9 379 376
57 62 404


● ●●● ●
80


378

94 96
●● ● ●
71 382
89 66 55 37 15
67 54

● ● ● ● ● ● ● 59 3

● ● ● ● ●● ● ● ● ●
44 398

●● ● ●
81 41
● 98 50 16 387 400
● ●
78
68
108
99
47
64
377
409
383
388 374

61

●● ● ●
84 97 95

● ●
77
● ● ●● 385
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

86
● ● 85 74 109 42 6 384 386
406

60
● ● ● ● ● ● 29 46 38

● ● ●● ● ● ● ●
402 403 389 393
83

● ● ●
119 100

● ● 87 91
75 113 111 390
380
410


● ● ●
110

● ● ● ● ● ● ● ● 88 73 103 106 34 43
40
396
381


102 391 392 408

● ● ● ● ●
157

● ● ● ● ● ● ● ●● ●
82 28
90 115 118 101 45 39 394 407
● ●
159
92 35 36 399
375

● ● ● ● ● ● ● ● ● 365

● ● ●
323 147 31 362
160 105 498

107

161 405
● ● 30 20 372 397

● ● ● ●
496 395
● ● ● 325
● ●● 158
● 153 148
116
114 26 19 364

● ● ● ● ●327 ● ● ● ● ● 151 33 494 489


493 355 335

● 104 17 336

152
150
● ● ● 32 371

● ●● ● ● ●
360

● ●
324 ● ● ● ● 146
156 117 18 23 501 350

band
322 154 492
● ● ● 144 27

●● ● ● ●
345
● ●
155 24 504
177 366

● ● ●● ●
490

● ● ● ● ●
149 136 142139 22 359 361 341
326 495

● ● ● ● ● ● ● ● 180 132
135
137 21 25
491 502
488 330 337

●●
487

● ● ●
171

● ● 176 174
● ● ● ●
143 503

● ● ●
163

● ● ●● ● ● ● ● ● 172 166 123


127 134
521 475 500
342
328 329 370

● ● ● ●
145

530
● 168 181 543 497 499
363
● 129 334 367

● ●
● ● ● ● ●
164 133
473


173

● ● ● ●
138

126

167 528
● ● ● ● ●
141
● 121 484 466 349
354
● ●
170 477 333

● ● ● ● 515 538 470 340

●●
120 373
178

● ●●
131 471 358
122

476

●● ● ● ● ● ● ● ● 165
● ●
124 125 128 140 514 525 513
465
346

● ● ● ●
162
● ● 179

482
483 467 339
332

● ● ● ● ●
200

● ● ● 299 535 519


537
353 356

● ● ●●
175 517 468 472


357

● ● ● ● ● ● 169
● 130 296 300
527
481 479
474
351
● 195
● ● ●● 199

202 425

● ●
196
● 301 541

● ● 198 ● ● ● ●
197● ● 304
305 512 505
542 511
429
469 344

● ● ● 352

● ● ●
485

● ● ● ● ● 201 508
480
443 343
331

● ● ●● ● ● ● ●
207 458

● ●● ● ● 210
●● ● 522 509 347

● ●
209 204 189 306 478

203 205 298 307 523 414 461 348
190

188 507
● ● 297 302

● ● ●
206 192 532

● 208
● ● ● ● ● ●
524 422
369

● ● ● ● ●
186 338


215 ●● 212 211 ● ● ●218
183
● 187
191
303
277
539
529 516 486
436
449
368

● 216
● ● ● ● ● ●
182
● 293 533 459

● ● ● ●● ● ● ●
184

275
531 510 433

● ● ● ● ● ● ●● ●
219 289 540 453 440
282 427

437
● ● 213 217
● 185 194 284 271
273
280 518 536

● ● 291 290 441


526 454
● ● 214
● ● ● 544 447

● ● ● ●
278
● ● ● ●
446
● 225

● ● ● ● 221
● ● ● ●
193 295 534
506 434

● ●● ●
276 281 428
● 320 294 520 438 420
222 287 460 421

279

● ●● ● ● ● ● ● 314 286

● ●
424

220 315 285 292
451

283

223 435

● 224
● ● ● 316

● ●
● 313 288 272 439 432 456

● ● ● ● ● ●●
319 452 464 431

●● ●
● ●
243
● 311
● ● 234 264 270 448 430 412 444
413

● ● ● ●
312 450
266 267 419

● ●
274 457
● ● ●
317 318
308

416
463 411

● ● ● ● ●
321
● ● ● 239
226 415 418 423 426
462

● 269 442

● ● ● ● ● ● ● 237
244
254
261
417
445

● 309 232 263


● 310

●● ●
268
248

● 238

●● ●
455
246

236
227 265

● ● 242 255
262

● ● 231 259

● 235

● ● ● ●
240

● ● ●● ●
233 249 260 257
245 241
258 251
● 252


228 ● ●
247
229 ● 253
● 256


230

● 250

Fig. 1 From left to right: Example of an undirected graph defined on the 544 regions of Germany
where those sharing a common border are considered as neighbors, its associated precision
matrix \bfQ (bandwidth b = 522), its reordered precision matrix \bfP \bfQ \bfP \top (b = 43) where \bfP is
a permutation matrix, and a drawing of a band matrix. For the three matrices, the white
entries are equal to zero.

2.2.3. Multivariate Gaussian Sampling with Sparse or Band Matrix Q. A lot


of standard Gaussian sampling approaches leverage the sparsity of the matrix Q.
Sparse precision matrices appear, for instance, when Gaussian Markov random fields
(GMRFs) are considered, as illustrated in Figure 1. In this figure, German regions are
represented graphically, where each edge between two regions represents a common
border. These edges can then be described by an adjacency matrix which plays the
role of the precision matrix Q of a GMRF. Since there are few neighbors for each
region, Q is symmetric and sparse. By permuting the rows and columns of Q, one
can build a so-called band matrix with minimal bandwidth b, where b is the smallest
integer b < d such that Qij = 0 \forall i > j + b [91]. Note that band matrices also appear
naturally in specific applications, e.g., when the latter involve finite impulse response
linear filters [50]. Problems with such structured (sparse or band) matrices have been
extensively studied in the literature and this paper will not cover them explicitly.
We provide in Algorithm 2.3 the main steps to obtaining a Gaussian vector \bfittheta from
\scrN (\bfitmu , Q - 1 ) in this scenario and refer the interested reader to [92] for more details.
Algorithmic Efficiency. Algorithm 2.3 is a specific instance of Cholesky sampling for
band precision matrices. In this specific scenario, Algorithm 2.3 admits a computa-
tional complexity of \Theta (b2 d) flops and a memory space of \Theta (bd), since the band matrix
Q can be stored in a so-called ``Q.array"" [41, section 1.2.5]. When b \ll d, one observes
that these computational and storage requirements are smaller than those of vanilla
Cholesky sampling by an order of magnitude w.r.t. d. Similar computational savings
can be obtained in the sparse case [92].

Algorithm 2.3 Sampler when Q is a band matrix


1: Set C = chol(Q). \triangleleft Build the Cholesky factor C of Q; see [92, section 2.4].
2: Draw z \sim \scrN (0d , Id ).
3: for i \in [d] do \triangleleft Solve C\top w = z w.r.t. w by backward substitution.
4: Set j = d - i + 1.
5: Set m1 = min\{ j \left( + b, d\} . \right)
m1
\sum
1
6: Set wj = zj - Ckj wk .
Cjj
k=j+1
7: end for
8: return \bfittheta = \bfitmu + w.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 11
4 20
6

3 15 5
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

4
2 10

1 5

0 0
1

−1 −5 0

Fig. 2 From left to right: Example of a 3 \times 3 Laplacian filter, the associated circulant precision
matrix \bfQ = \bfDelta \top \bfDelta for convolution with periodic boundary conditions, and its counterpart
diagonal matrix \bfF \bfQ \bfF \sansH in the Fourier domain, where \bfF and its Hermitian conjugate \bfF \sansH are
unitary matrices associated with the Fourier and inverse Fourier transforms.

2.2.4. Multivariate Gaussian Sampling with Block Circulant (Toeplitz) Matrix


Q with Circulant (Toeplitz) Blocks. An important special case of (2.1) which has
already been surveyed [92] is when Q is a block circulant matrix with circulant blocks,
that is,
\left( \right)
Q1 Q2 . . . QM
QM Q1 . . . QM - 1
(2.2) Q= .. .. .. .. ,
. . . .
Q2 Q3 ... Q1

where \{ Qi \} i\in [M ] are M circulant matrices. Such structured matrices frequently ap-
pear in image processing problems since they translate the convolution operator cor-
responding to linear and shift-invariant filters. As an illustration, Figure 2 shows
the circulant structure of the precision matrix associated with the Gaussian distri-
2
bution with density \pi (\bfittheta ) \propto exp( - \| \Delta \bfittheta \| /2). Here, the vector \bfittheta \in \BbbR d is an image
reshaped in lexicographic order and \Delta is the Laplacian differential operator with pe-
riodic boundaries, also called the Laplacian filter. In this case the precision matrix
Q = \Delta \top \Delta is a circulant matrix [78] so that it is diagonalizable in the Fourier domain.
Therefore, sampling from \scrN (\bfitmu , Q - 1 ) can be performed in this domain as shown in
Algorithm 2.4. For Gaussian distributions with more general Toeplitz precision ma-
trices, Q can be replaced by its circulant approximation and then Algorithm 2.4 can
be used; see [92] for more details. Although not considered in this paper, other ap-
proaches to generating stationary Gaussian processes [59] have been considered, such
as the spectral [71, 97] and turning bands [65] methods.
Algorithmic Efficiency. Thanks to the use of the fast Fourier transform [26, 109],
Algorithm 2.4 admits a computational complexity of \scrO (d log(d)) flops. In addition,
note that only d-dimensional vectors have to be stored, which implies a memory
requirement of \Theta (d). Overall, these complexities are significantly smaller than those
of vanilla Cholesky sampling and as such Algorithm 2.4 can be considered to be
``efficient.""
2.2.5. Truncated and Intrinsic Gaussian Distributions. Several works have fo-
cused on sampling from various probability distributions closely related to the Gaus-
sian distribution on \BbbR d . Two cases are worth mentioning here, namely, the truncated
and so-called intrinsic Gaussian distributions. Truncated Gaussian distributions on

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


12 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Algorithm 2.4 Sampler when Q is a block circulant matrix with circulant blocks
Input: M and N , the number of blocks, and the size of each block, respectively.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1: Compute F = FM \otimes FN . \triangleleft FM is the M \times M unitary matrix associated to the


Fourier transform and \otimes denotes the tensor product.
2: Draw z \sim \scrN (0d , Id ).
3: Set \Lambda \bfq = diag(q). \triangleleft q is the d-dimensional vector built by stacking the first
columns of each circulant block of Q.
- 1/2
4: Set \bfittheta = \bfitmu + F\sansH \Lambda \bfq Fz.
5: return \bfittheta .

\scrD \subset \BbbR d admit, for any \bfittheta \in \BbbR d , p.d.f.s of the form

\biggl( \biggr)
- 1 1
(2.3) \pi \scrD (\bfittheta ) = 1\scrD (\bfittheta ) \cdot Z\scrD exp - (\bfittheta - \bfitmu )\top \Sigma - 1 (\bfittheta - \bfitmu ) ,
2

where Z\scrD is the appropriate normalizing constant and 1\scrD (\bfittheta ) = 1 if \bfittheta \in \scrD , and 0 oth-
erwise. The subset \scrD is usually defined by equalities and/or inequalities. Truncations
\prod d
on the hypercube are such that \scrD = i=1 [ai , bi ], (ai , bi ) \in \BbbR 2 , 1 \leq i \leq d, and trunca-
\sum d
tions on the simplex are such that \scrD = \{ \bfittheta \in \BbbR d | i=1 \theta i = 1\} . Rejection and Gibbs
sampling algorithms dedicated to these distributions can be found in [5, 60, 107].
Intrinsic Gaussian distributions are such that \Sigma is not of full rank, that is, Q
may have eigenvalues equal to zero. This yields an improper Gaussian distribution
often used as a prior in GMRFs to remove trend components [92]. Sampling from
the latter can be done by identifying an appropriate subspace of \BbbR d , where the target
distribution is proper, and then sampling from the proper Gaussian distribution on
this subspace [12, 81].
All the usual special sampling problems above will not be considered in what
follows since they have already been exhaustively reviewed in the literature.

2.3. Problem Statement: Sampling from a Gaussian Distribution with an Ar-


bitrary Precision Matrix Q. From now on, we will consider and review approaches
aiming at sampling from an arbitrary nonintrinsic multivariate Gaussian distribution
\scrN (\bfitmu , Q - 1 ) with density defined in (2.1), i.e., without assuming any particular struc-
ture of the precision or covariance matrix. If Q is diagonal or well structured, we
saw in subsection 2.2 that sampling can be performed more efficiently than vanilla
Cholesky sampling, even in high dimension. When this matrix is unstructured and
possibly dense, these methods with reduced numerical and storage complexities can-
not be used anymore. In such settings the main challenges for Gaussian sampling
are directly related to handling the precision Q (or covariance \Sigma ) matrix in high di-
mensions. Typical issues include the storage of the d2 entries of the matrix Q (or \Sigma )
and expensive operations of order \Theta (d3 ) flops such as inversion or square roots, which
become prohibitive when d is large. These challenges are illustrated below with an
example that typically arises in statistical learning.

Example 2.1 (Bayesian ridge regression). Let us consider a ridge regression prob-
lem from a Bayesian perspective [13]. For the sake of simplicity and without loss
of generality, assume that the observations y \in \BbbR n and the known predictor matrix

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 13

X \in \BbbR n\times d are such that


n
\sum n
\sum n
\sum
2
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

(2.4) yi = 0 , Xij = 0 , and Xij =1 for j \in [d] .


i=1 i=1 i=1

Under these assumptions, we consider the statistical model associated with observa-
tions y written as

(2.5) y = X\bfittheta + \bfitvarepsilon ,

where \bfittheta \in \BbbR d and \bfitvarepsilon \sim \scrN (0n , \sigma 2 In ). In this example, the standard deviation \sigma is
known and fixed. The conditional prior distribution for \bfittheta is chosen as Gaussian i.i.d.,
that is,
\biggl( \biggr)
1 2
(2.6) p(\bfittheta | \tau ) \propto exp - | | \bfittheta | | ,
2\tau
1
(2.7) p(\tau ) \propto 1\BbbR + \setminus \{ 0\} (\tau ) ,
\tau
where \tau > 0 is an unknown variance parameter which is given a diffuse and improper
(i.e., nonintegrable) Jeffrey's prior [53, 86]. The Bayes' rule then leads to the target
joint posterior distribution with density
1 \Bigl( 1 1 \Bigr)
(2.8) p(\bfittheta , \tau | y) \propto 1\BbbR + \setminus \{ 0\} (\tau ) exp - | | \bfittheta | | 2 - 2 | | y - X\bfittheta | | 2 .
\tau 2\tau 2\sigma
Sampling from this joint posterior distribution can be conducted using a Gibbs sam-
pler [36, 87], which sequentially samples from the conditional posterior distributions.
In particular, the conditional posterior distribution associated to \bfittheta is Gaussian with
precision matrix and mean vector
1 \top
(2.9) Q= X X + \tau - 1 Id ,
\sigma 2
1
(2.10) \bfitmu = 2 Q - 1 X\top y .
\sigma
Challenges related to handling the matrix Q already appear in this classical and
simple regression problem. Indeed, Q is possibly high-dimensional and dense, which
potentially rules out its storage; see Table 1. The inversion required to compute the
mean (2.10) may be very expensive as well. In addition, since \tau is unknown, its value
changes at each iteration of the Gibbs sampler used to sample from the joint distri-
bution with density (2.8). Hence, precomputing the matrix Q - 1 is not possible. As
an illustration on real data, Figure 3 represents three examples of precision matrices2
X\top X for the MNIST [57], leukemia [6], and CoEPrA [25] datasets. One can note that
these precision matrices are potentially both high-dimensional and dense, penalizing
their numerical inversion at each iteration of the Gibbs sampler.
Hosts of past contributions are related to high-dimensional Gaussian sampling:
it is impossible to cite them in an exhaustive manner. As far as possible, the follow-
ing review aims at gathering and citing the main contributions. We refer the reader
2 When considering the dataset itself, \bfX \top \bfX is usually interpreted as the empirical covariance of

the data \bfX . The reader should not be disturbed by the fact that, turning to the variable \bfittheta to infer,
\bfX \top \bfX will, however, play the role of a precision matrix.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


14 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

15
104
40
103
10

102
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

20
5
101

0
10
0 0
10−1

−5
10−2
−20

10−3
−10

10−4
−40
−15

Fig. 3 Examples of precision matrices \bfX \top \bfX for three datasets. Left: MNIST dataset [57]. Only
the predictors associated to the digits 5 and 3 have been taken into account for the MNIST
dataset [57]. Middle: Leukemia dataset [6]. For the leukemia dataset [6], only the first 5,000
predictors (out of 12,600) have been taken into account. Right: CoEPrA dataset [25].

to references therein for more details. Next, sections 3 and 4 review the two main
families of approaches that deal with the sampling issues raised above. In section 3,
we deal with approaches derived from numerical linear algebra. On the other hand,
section 4 deals with MCMC sampling approaches. Section 5 then proposes a unifying
revisit to Gibbs samplers thanks to a stochastic counterpart of the PPA. Similarly
to subsection 2.2, computational costs, storage requirements, and accuracy of the re-
viewed Gaussian sampling approaches will be detailed for each method in a dedicated
paragraph entitled Algorithmic Efficiency. For a synthetic summary and comparison
of these metrics, we refer the interested reader to Table 7.
3. Sampling Algorithms Derived from Numerical Linear Algebra. This sec-
tion presents sampling approaches corresponding to direct adaptations of classical
techniques used in numerical linear algebra [41]. They can be divided into three main
groups: (i) factorization methods that consider appropriate decompositions of Q, (ii)
inverse square root approximation approaches where approximations of Q - 1/2 are
used to obtain samples from \scrN (\bfitmu , Q - 1 ) at a reduced cost compared to factorization
approaches, and (iii) conjugate gradient based methods.
3.1. Factorization Methods. We begin this review with the most basic but com-
putationally involved sampling techniques, namely, factorization approaches that were
introduced in section 1. These methods exploit the positive definiteness of Q to de-
compose it as a product of simpler matrices and are essentially based on the celebrated
Cholesky factorization [20]. Though helpful for problems in small or moderate dimen-
sions, these basic sampling approaches fail to address, in high-dimensional scenarios,
the computational and memory issues raised in subsection 2.3.
3.1.1. Cholesky Factorization. Since Q is symmetric and positive definite, we
noted in (1.1) that there exists a unique lower triangular matrix C \in \BbbR d\times d , called
the Cholesky factor, with positive diagonal entries such that Q = CC\top [41]. Algo-
rithm 3.1 details how such a decomposition3 can be used to obtain a sample \bfittheta from
\scrN (\bfitmu , Q - 1 ).
Algorithmic Efficiency. In the general case where Q presents no particular structure,
the computational cost is \Theta (d3 ) and the storage requirement is \Theta (d2 ); see also section 1

3 When working with the covariance matrix rather than the precision matrix, the Cholesky de-

composition \bfSigma = \bfL \bfL \top leads to the simpler step 3: \bfw = \bfL \bfz .

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 15

and Table 1. If the dimension d is large but the matrix Q has a sparse structure, the
computational and storage requirements of the Cholesky factorization can be reduced
by reordering the components of Q to design an equivalent band matrix [91]; see
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

subsection 2.2 and Algorithm 2.3.

Algorithm 3.1 Cholesky sampler


1: Set C = chol(Q). \triangleleft Build the Cholesky factor C of Q; see [92, section 2.4].
2: Draw z \sim \scrN (0d , Id ).
3: Solve C\top w = z w.r.t. w.
4: return \bfittheta = \bfitmu + w.

3.1.2. Square Root Factorization. The Cholesky factorization in the previous


paragraph was used to decompose Q into a product of a triangular matrix C and
its transpose. Then a Gaussian sample was obtained by solving a triangular linear
system. An extension of this approach was considered in [24] which performed a
singular value decomposition (SVD) of the Cholesky factor C, which yields Q = B2
with B = U\Lambda 1/2 U\top , where \Lambda is diagonal and U is orthogonal. Similar to the
Cholesky factor and given z \sim \scrN (0d , Id ), this square root can then be used to obtain
an arbitrary Gaussian sample by solving Bw = z w.r.t. w and computing \bfittheta = \bfitmu + w.
Algorithmic Efficiency. Although the square root factorization is interesting for es-
tablishing the existence of B, the associated sampler is generally as computationally
demanding as Algorithm 3.1 since the eigendecomposition of Q is not cheaper than
finding its Cholesky factor. To avoid these computational problems and since samplers
based on B boil down to computing \bfittheta = \bfitmu + B - 1 z, some works have focused on ap-
proximations of the inverse square root B - 1 of Q that require smaller computational
and storage complexities.
3.2. Inverse Square Root Approximations. This idea of finding an efficient
(compared to the costs associated to factorization approaches) way to approximate
the inverse square root B - 1 dates back at least to the 1980s with the work of Davis [24],
who derived a polynomial approximation of the function x \mapsto \rightarrow x1/2 to approximate the
square root of a given covariance matrix. Other works used Krylov-based approaches,
building on the Lanczos decomposition to approximate directly any matrix-vector
product B - 1 z involving the square root B. The following two paragraphs review
these methods.
3.2.1. Polynomial Approximation. In subsection 3.1, we showed that the square
root of Q can be written as B = U\Lambda 1/2 U\top , which implies that Q = U\Lambda U\top . If f is
a real continuous function, Q and f (Q) = Uf (\Lambda )U\top are diagonalizable with respect
to the same eigenbasis U, where f (\Lambda ) \triangleq diag(f (\lambda 1 ), . . . , f (\lambda d )). This is a well-known
result coming from the Taylor expansion of a real continuous function f . Hence, a
function f such that f (Q) is a good approximation of B - 1 = Q - 1/2 has to be such
that
\sqrt{}
f (\lambda i ) \approx 1/ \lambda i \forall i \in [d] .

Since f only needs to be evaluated at the points corresponding to the eigenval-


ues \{ \lambda i \} i\in [d] of Q, it suffices to find a good approximation of B - 1 on the interval
[\lambda min , \lambda max ] whose extremal values can be lower and upper bounded easily using the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


16 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Gershgorin circle theorem [41, Theorem 7.2.1]. In the literature [24, 51, 82], the func-
tion f has been built using Chebyshev polynomials [70], which are a family (Tk )k\in \BbbN
of polynomials defined over [ - 1, 1] by
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Tk (x) = cos(k\alpha ) , where \forall \alpha \in \BbbR , x = cos(\alpha ) ,

or by the recursion
\left\{
T0 (x) = 1,
(3.1) T1 (x) = x,
Tk+1 (x) = 2xTk (x) - Tk - 1 (x) (\forall k \geq 1) .

This family of polynomials (Tk )k\in \BbbN exhibits several interesting properties: uniform
convergence of the Chebyshev series toward an arbitrary Lipschitz-continuous function
over [ - 1, 1] and near minimax property [70], along with fast computation of the
coefficients of the series via the fast Fourier transform [84]. Algorithm 3.2 describes the
steps to generating arbitrary Gaussian vectors using this polynomial approximation.
Algorithmic Efficiency. Contrary to factorization methods detailed in subsection 3.1,
Algorithm 3.2 does not require the storage of Q since only the computation of matrix-
vector products of the form Qv with v \in \BbbR d is necessary. Assuming that these
operations can be computed efficiently in \scrO (d2 ) flops with some black-box routine,
e.g., a fast wavelet transform [64], Algorithm 3.2 admits an overall computational
cost of \scrO (Kcheby d2 ) flops and storage capacity of \Theta (d), where Kcheby is a truncation
parameter giving the order of the polynomial approximation. When Kcheby \ll d, Al-
gorithm 3.2 becomes more computationally efficient than vanilla Cholesky sampling
while admitting a reasonable memory overhead. For a sparse precision matrix Q com-
posed of nnz nonzero entries, the computational complexity can be reduced down to
\scrO (Kcheby nnz ) flops. Note that compared to factorization approaches, Algorithm 3.2
involves an additional source of approximation coming from the order of the Cheby-
shev series Kcheby . Choosing this parameter in an adequate manner involves some
fine-tuning or additional computationally intensive statistical tests [82].
3.2.2. Lanczos Approximation. Instead of approximating the inverse square
root B - 1 , some approaches directly approximate the matrix-vector product B - 1 z
by building on the Lanczos decomposition [7, 21, 51, 52, 98, 99]. The correspond-
ing simulation-based algorithm is described in Algorithm 3.3. It iteratively builds
an orthonormal basis H = \{ h1 , . . . , hKkryl \} \in \BbbR d\times Kkryl with Kkryl \leq d for the
Krylov subspace \scrK Kkryl (Q, z) \triangleq span\{ z, Qz, . . . , QKkryl - 1 z\} , and a tridiagonal ma-
trix T \approx H\top QH \in \BbbR Kkryl \times Kkryl . Using the orthogonality of H, z = He1 , where e1 is
the first canonical vector of \BbbR Kkryl , the final approximation is

(3.2) B - 1 z = Q - 1/2 z \approx \| z\| HT - 1/2 H\top He1 = \| z\| HT - 1/2 e1 .

As highlighted in (3.2), the main idea of the Lanczos approximation is to transform


the computation of Q - 1/2 z into the computation of T - 1/2 e1 , which is expected to be
simpler since T is tridiagonal and of size Kkryl \times Kkryl .
Algorithmic Efficiency. Numerous approaches have been proposed to compute T - 1/2 e1
2
exactly or approximately, and they generally require \scrO (Kkryl ) flops [22, 44]; see also
Algorithm 3.2. By using such approaches, Algorithm 3.3 admits a computational com-
plexity of \scrO (Kkryl d2 ) and a memory requirement of \scrO (Kkryl d). Similarly to Kcheby in

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 17

Algorithm 3.2 Approximate square root sampler using Chebyshev polynomials


\sum
1: Set \lambda l = 0 and \lambda u = max | Qij | .
i\in [d]
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

j\in [d]
2: for j \in J0, Kcheby K do \triangleleft Do the change of interval.
\biggl[ \Bigl( \Bigr) (\lambda - \lambda ) \lambda + \lambda \biggr] - 1/2
2j+1 u l u l
3: Set gj = cos π 2K + .
cheby 2 2
4: end for
5: for k \in J0, Kcheby K do \triangleleft Compute coefficients of the Kcheby -truncated
Chebyshev series.
Kcheby \Biggl( \Biggr)
2 \sum 2j + 1
6: Compute ck = gj cos πk .
Kcheby j=0 2Kcheby
7: end for
8: Draw z \sim \scrN (0d , Id ).
2 \lambda u + \lambda l
9: Set \alpha = and \beta = .
\lambda u - \lambda l \lambda u - \lambda l
10: Initialize u1 = \alpha Qz - \beta z and u0 = z.
1
11: Set u = c0 u0 + c1 u1 and k = 2.
2
12: while k \leq Kcheby do \triangleleft Compute the Kcheby -truncated Chebyshev series.
13: Compute u\prime = 2(\alpha Qu1 - \beta u1 ) - u0 .
14: Set u = u + ck u\prime .
15: Set u0 = u1 and u1 = u\prime .
16: k = k + 1.
17: end while
18: Set \bfittheta = \bfitmu + u. \triangleleft Build the Gaussian vector of interest.
19: return \bfittheta .

Algorithm 3.2, one can note that Kkryl represents a trade-off between computation,
storage, and accuracy. As emphasized in [7, 98], adjusting this truncation parameter
can be achieved by using the conjugate gradient (CG) algorithm. In addition to pro-
viding an approximate sampling technique when Kkryl < d, the main and well-known
drawback of Algorithm 3.3 is that the basis H loses orthogonality in floating point
arithmetic due to round-off errors. Some possibly complicated procedures to cope
with this problem are surveyed in [100]. Finally, one major problem of the Lanczos
decomposition is the construction and storage of the basis H \in \BbbR d\times Kkryl , which be-
comes as large as Q when Kkryl tends to d. Two main approaches have been proposed
to deal with this problem, namely, a so-called 2-pass strategy and a restarting strat-
egy, both reviewed in [7, 52, 98]. In addition, preconditioning methods have been
proposed to reduce the computational burden of Lanczos samplers [21].
3.2.3. Other Square Root Approximations. At least two other methods have
been proposed to approximate the inverse square root B - 1 . Since these approaches
have been less used than others, only their main principle is given, and we refer
the interested reader to the corresponding references. The first one is the rational
approximation of B - 1 based on numerical quadrature of a contour integral [45], while
the other one is a continuous deformation method based on a system of ordinary
differential equations [4]. These two approaches are reviewed and illustrated using
numerical examples in [7].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


18 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Algorithm 3.3 Approximate square root sampler using Lanczos decomposition


1: Draw z \sim \scrN (0d , Id ). \bigm\| \bigm\|
\bigm\| \bigm\|
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

2: Set r(0) = z, h(0) = 0d , \beta (0) = \bigm\| r(0) \bigm\| and h(1) = r(0) /\beta (0) .
3: for k \in [Kkryl ] do
4: Set w = Qh(k) - \beta (k - 1) h(k - 1) .
5: Set \alpha (k) = w\top h(k) .
6: Set w = w - \alpha (k) h(k) . \triangleleft Gram--Schmidt orthogonalization process.
7: Set \beta (k) = \| w\| .
8: Set h(k+1) = w/\beta (k) .
9: end for
10: Set T = tridiag(\bfitbeta , \bfitalpha , \bfitbeta ).
11: Set H = (h(1) , . . . , h(Kkryl ) ).
12:
13: Set \bfittheta = \bfitmu + \beta (0) HT - 1/2 e1 , where e1 = (1, 0, . . . , 0)\top \in \BbbR Kkryl .
14: return \bfittheta .

3.3. Conjugate Gradient–Based Samplers. Instead of building on factorization


results, some approaches start from the finding that Gaussian densities with invertible
precision matrix Q can be rewritten in a so-called information form, that is,
\biggl( \biggr)
1
(3.3) \pi (\bfittheta ) \propto exp - \bfittheta \top Q\bfittheta + b\top \bfittheta ,
2

where b = Q\bfitmu is called the potential vector. If one is able to draw a Gaussian vector
z\prime \sim \scrN (0d , Q), then a sample \bfittheta from \scrN (\bfitmu , Q - 1 ) is obtained by solving the linear
system

(3.4) Q\bfittheta = b + z\prime ,

where Q is positive definite, so that CG methods are relevant. This approach uses
the affine transformation of a Gaussian vector u = b + z\prime : if u \sim \scrN (Q\bfitmu , Q), then
Q - 1 u \sim \scrN (\bfitmu , Q - 1 ).
3.3.1. Perturbation before Optimization. A first possibility to handling the
perturbed linear problem (3.4) consists of first computing the potential vector b, then
perturbing this vector with the Gaussian vector z\prime , and finally solving the linear system
with numerical algebra techniques. This approach is detailed in Algorithm 3.4. While
the computation of b is not difficult in general, drawing z\prime might be computationally
involved. Hence, this sampling approach is of interest only if we are able to draw
the Gaussian vector z\prime efficiently (i.e., in \scrO (d2 ) flops). This is, for instance, the case
- 1
when Q = Q1 + Q2 with Qi = G\top i \Lambda i Gi (i \in [2]), provided that the symmetric and
positive-definite matrices \{ \Lambda i \} i\in [2] have simple structures; see subsection 2.2. Such
situations often arise when Bayesian hierarchical models are considered [86, Chapter
10]. For these scenarios, an efficient way to compute b + z\prime has been proposed in [79]
based on a local perturbation of the mean vectors \{ \bfitmu i \} i\in [2] . Such an approach has
been coined perturbation-optimization (PO) since it draws perturbed versions of the
mean vectors involved in the hierarchical model before using them to define the linear
system to be solved [79].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 19

Algorithm 3.4 Perturbation-optimization sampler


1: Draw z\prime \sim \scrN (0d , Q). \triangleleft with local perturbation as in [79].
2: Set \bfiteta = b + z\prime .
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

3: Solve Q\bfittheta = \bfiteta w.r.t. \bfittheta . \triangleleft with the CG solver as used, for instance, in [47].
4: return \bfittheta .

Algorithmic Efficiency. If K \in \BbbN \ast iterations of an appropriate linear solver (e.g., the
CG method) are used for step 3 in Algorithm 3.4, the global computational and stor-
age complexities of this algorithm are of order \scrO (Kd2 ) and \Theta (d). Regarding sampling
accuracy, while Algorithm 3.4 in theory is an exact approach, the K-truncation pro-
cedure implies an approximate sampling scheme [77]. A solution to correct this bias
has been proposed in [37] which builds upon a reversible-jump approach [43].
3.3.2. Optimization with Perturbation. Alternatively, (3.4) can also be seen as
a perturbed version of the linear system Q\bfittheta = b. Thus, some works have focused
on modified versions of well-known linear solvers such as the CG solver [81, 91, 95].
Actually, only one additional line of code providing a univariate Gaussian sampling
step (perturbation) is required to turn the classical CG solver into a CG sampler
[81, 95]; see step 8 in Algorithm 3.5. This perturbation step sequentially builds a
Gaussian vector with a covariance matrix that is the best k-rank approximation of
Q - 1 in the Krylov subspace \scrK k (Q, r(0) ) [81]. Then a perturbation vector y(KCG ) is
simulated before addition to \bfitmu so that finally \bfittheta = \bfitmu + y(KCG ) .

Algorithm 3.5 Conjugate gradient sampler


Input: Threshold \epsilon > 0, fixed initialization \bfitomega (0) , and random vector c \in \BbbR d .
(0)
1: Set k =\bigm\| 1, r\bigm\| = c - Q\bfitomega (0) , h(0) = r(0) , d(0) = h(0)\top Qh(0) , and y(0) = \bfitomega (0) .
\bigm\| (k) \bigm\|
2: while \bigm\| r \bigm\| \geq \epsilon do

r(k - 1)\top r(k - 1)


3: Set \gamma (k - 1) = .
d(k - 1)
4: Set r(k) = r(k - 1) - \gamma (k - 1) Qh(k - 1) .
r(k)\top r(k)
5: Set \eta (k) = - (k - 1)T (k - 1) .
r r
6: Set h(k) = r(k) - \eta (k) h(k - 1) .
7: Set d(k) = h(k)\top Qh(k) .
z
8: Set y(k) = y(k - 1) + \surd h(k - 1) , where z \sim \scrN (0, 1).
d (k - 1)
9: k = k + 1.
10: end while
11: Set \bfittheta = \bfitmu + y(KCG ) , where KCG is the number of CG iterations.
12: return \bfittheta .

Algorithmic Efficiency. From a computational point of view, the CG sampler inherits


the benefits of the CG solver: only matrix-vector products involving Q and the stor-
age of two d-dimensional vectors are needed, and one exact sample from \scrN (\bfitmu , Q - 1 )
is obtained after at most KCG = d iterations. This yields an approximate computa-
tional cost of \scrO (KCG d2 ) flops and a storage requirement of \Theta (d), where KCG is the
number of CG iterations [47]. The CG sampler belongs to the family of Krylov-based
samplers (e.g., Lanczos). As such, it suffers from the same numerical problem due to

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


20 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

finite machine precision and the KCG -truncation procedure. In addition, the covari-
ance of the generated samples depends on the distribution of the eigenvalues of the
matrix Q. Actually, if these eigenvalues are not well spread out, Algorithm 3.5 stops
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

after KCG < d iterations, which yields an approximate sample with the best KCG -
rank approximation of Q - 1 as the actual covariance matrix. In order to correct this
approximation, reorthogonalization schemes can be employed but could become as
computationally prohibitive as Cholesky sampling when d is large [95]. These sources
of approximation are detailed in [81]. A generalization of Algorithm 3.5 has been con-
sidered in [30], where a random set of K \prime mutually conjugate directions \{ h(k) \} k\in [K \prime ]
is considered at each iteration of a Gibbs sampler.

4. Sampling Algorithms Based on MCMC. The previous section presented ex-


isting Gaussian sampling approaches by directly adapting ideas and techniques from
numerical linear algebra such as matrix decompositions and matrix approximations.
In this section, we will present another family of sampling approaches, namely, MCMC
approaches, which build a discrete-time Markov chain (\bfittheta (t) )t\in \BbbN with \scrN (\bfitmu , Q - 1 ) (or a
close approximation of \scrN (\bfitmu , Q - 1 )) as its invariant distribution [87]. In what follows,
we state that an MCMC approach is exact if the associated sampler admits an invari-
ant distribution that coincides with \scrN (\bfitmu , Q - 1 ). Contrary to the approaches reviewed
in section 3, which produce i.i.d. samples from \scrN (\bfitmu , Q - 1 ) or a close approximation to
it, MCMC approaches produce correlated samples that are asymptotically distributed
according to their invariant distribution. Hence, at a first glance, it seems natural
to think that MCMC samplers are less trustworthy since the number of iterations
required until convergence is very difficult to assess in practice [49]. Interestingly,
we will show numerically in section 6 that MCMC methods might perform better
than i.i.d. samplers in some cases and thus might be serious contenders for the most
efficient Gaussian sampling algorithms; see also [32]. Furthermore, we will also show
that most of these MCMC approaches can be unified via a stochastic version of the
PPA [90]. This framework will be presented in section 5.

4.1. Matrix Splitting. We begin the review of MCMC samplers by detailing so-
called matrix-splitting (MS) approaches that build on the decomposition Q = M - N
of the precision matrix. As we shall see, both exact and approximate MS samplers
have been proposed in the existing literature. These methods embed one of the
simplest MCMC methods, namely, the componentwise Gibbs sampler [36]. Similarly
to Algorithm 3.1 for samplers in section 3, it can be viewed as one of the simplest and
most straightforward approaches to sampling from a target Gaussian distribution.

4.1.1. Exact Matrix Splitting. Given the multivariate Gaussian distribution


\scrN (\bfitmu , Q - 1 ) with density \pi in (2.1), an attractive and simple option is to sequentially
draw one component of \bfittheta given the others. This is the well-known componentwise
Gibbs sampler; see Algorithm 4.1 [35, 36, 92]. The main advantage of Algorithm 4.1 is
its simplicity and the low cost per sweep (i.e., internal iteration) of \scrO (d2 ) flops, which
is comparable with Cholesky applied to Toeplitz covariance matrices [102]. More gen-
erally, one can also consider random sweeps over the d components of \bfittheta or blockwise
strategies which update simultaneously several components of \bfittheta . The analysis of these
strategies and their respective convergence rates are detailed in [88].
In [1, 10, 42], the authors showed by rewriting Algorithm 4.1 using a matrix
formulation that it actually provides a stochastic version of the Gauss--Seidel linear
solver that relies on the decomposition Q = L + D + L\top , where L and D are the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 21

Algorithm 4.1 Componentwise Gibbs sampler


Input: Number T of iterations and initialization \bfittheta (0) .
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1: Set t = 1.
2: while t \leq T do
3: for i \in [d] do
4: Draw z \sim \scrN (0, 1). \left( \right)
(t) [Q\bfitmu ]i z 1 \sum (t - 1)
\sum (t)
5: Set \theta i = + \surd - Qij \theta j + Qij \theta j .
Qii Qii Qii j>i j<i
6: end for
7: Set t = t + 1.
8: end while
9: return \bfittheta (T ) .

strictly lower triangular and diagonal parts of Q, respectively. Indeed, each iteration
solves the linear system

(4.1) (L + D)\bfittheta (t) = Q\bfitmu + D1/2 z - L\top \bfittheta (t - 1) ,

where z \sim \scrN (0d , Id ). By setting M = L + D and N = - L\top so that Q = M - N, the


updating rule (4.1) can be written as solving the usual Gauss--Seidel linear system

(4.2) M\bfittheta (t) = Q\bfitmu + z


\~ + N\bfittheta (t - 1) ,

where N = - L\top is strictly upper triangular and z


\~ \sim \scrN (0d , D) is easy to sample.
Interestingly, (4.2) is a perturbed instance of MS schemes which are a class of
linear iterative solvers based on the splitting of Q into Q = M - N [41, 93]. Capital-
izing on this one-to-one equivalence between samplers and linear solvers, the authors
in [32] extended Algorithm 4.1 to other MCMC samplers based on different matrix
splittings Q = M - N. These are reported in Table 2 and yield Algorithm 4.2. The
acronym SOR stands for successive overrelaxation.

Table 2 Examples of MS schemes for \bfQ which can be used in Algorithm 4.2. The matrices \bfD and
\~
\bfL denote the diagonal and strictly lower triangular parts of \bfQ , respectively. The vector \bfz
is the one appearing in step 3 of Algorithm 4.2 and \omega is a positive scalar.

\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfM \bfN \bfz ) = \bfM \top + \bfN
cov(\~ Convergence
Richardson \bfI d /\omega \bfI d /\omega - \bfQ 2\bfI d /\omega - \bfQ 0 < \omega < 2/ \| \bfQ \|
\sum
Jacobi \bfD \bfD - \bfQ 2\bfD - \bfQ | Qii | > j\not =i | Qij | \forall i \in [d]
Gauss--Seidel \bfD + \bfL - \bfL \top \bfD always
1 - \omega 2 - \omega
SOR \bfD /\omega + \bfL \omega
\bfD - \bfL \top \omega
\bfD 0 < \omega < 2

Algorithmic Efficiency. Similarly to linear solvers, Algorithm \bigl( 4.2


\bigr) is guaranteed to con-
verge toward the correct distribution \scrN (\bfitmu , Q - 1 ) if \rho M - 1 N < 1, where \rho (\cdot ) is the
spectral radius of a matrix. In practice, Algorithm 4.2 is stopped after T iterations
and the error between the distribution of \bfittheta (T ) and \scrN (\bfitmu , Q - 1 ) can be assessed quan-
titatively; see [32]. The computational efficiency of Algorithm 4.2 is directly related
to the complexity of solving the linear systems M\bfittheta (t) = Q\bfitmu + z \~ + N\bfittheta (t - 1) , similar

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


22 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Algorithm 4.2 MCMC sampler based on exact matrix splitting


Input: Number T of iterations, initialization \bfittheta (0) , and splitting Q = M - N.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1: Set t = 1.
2: while t \leq T do
3: Draw z \~ \sim \scrN (0d , M\top + N).
4: Solve M\bfittheta (t) = Q\bfitmu + z \~ + N\bfittheta (t - 1) w.r.t. \bfittheta (t) .
5: Set t = t + 1.
6: end while
7: return \bfittheta (T ) .

to (4.2), and the difficulty of sampling z \~ with covariance M\top + N. As pointed out
\top
in [32], the simpler M, the denser M + N and the more difficult the sampling of
\~ For instance, Jacobi and Richardson schemes yield a simple diagonal linear system
z.
requiring \scrO (d) flops, but one has to sample from a Gaussian distribution with an
arbitrary covariance matrix; see step 3 of Algorithm 4.2. Iterative samplers requiring
at least K steps such as those reviewed in section 3 can be used. This yields a com-
putational burden of \scrO (KT d2 ) flops for step 3 and as such Jacobi and Richardson
samplers admit a computational cost of \scrO (KT d2 ) and a storage requirement of \Theta (d).
On the other hand, both Gauss--Seidel and SOR schemes are associated to a simple
sampling step which can be performed in \scrO (d) flops with Algorithm 2.2, but one
has to solve a lower triangular system which can be done in \scrO (d2 ) flops via forward
substitution. In order to mitigate the trade-off between steps 3 and 4, approximate
MS approaches have been proposed recently [9, 54]; see subsection 4.1.2.
Polynomial Accelerated Gibbs Samplers. When the splitting Q = M - N is
symmetric, that is, both M and N are symmetric matrices, the rate of convergence of
Algorithm 4.2 can be improved by using polynomial preconditioners [32]. For ease of
presentation, we will first explain how such a preconditioning accelerates linear solvers
based on MS, before building upon the one-to-one equivalence between linear solvers
and Gibbs samplers to show how Algorithm 4.2 can be accelerated. Given a linear
system Q\bfittheta = v for v \in \BbbR d , linear solvers based on the MS, Q = M - N, consider
the recursion, for any t \in \BbbN and \bfittheta (0) \in \BbbR d , \bfittheta (t+1) = \bfittheta (t) + M - 1 (v - Q\bfittheta (t) ). The
error at iteration t defined by e(t+1) = \bfittheta (t+1) - Q - 1 v can be shown to be equal to
(Id - M - 1 Q)t e(0) [41]. Since this error is a tth-order polynomial of M - 1 Q, it is then
natural to wonder whether one can find another tth order polynomial \sansP t that achieves
a lower error, that is, \rho (\sansP t (M - 1 Q)) < \rho ((Id - M - 1 Q)t ). This can be accomplished
by considering the second-order iterative scheme defined, for any t \in \BbbN , by [8]

\bfittheta (t+1) = \alpha t \bfittheta (t) + (1 - \alpha t )\bfittheta (t - 1) + \beta t M - 1 (v - Q\bfittheta (t) ) ,

where (\alpha t , \beta t )t\in \BbbN is a set of acceleration parameters. This iterative method yields
an error at step t given by e(t+1) = \sansP t (M - 1 Q)e(0) , where \sansP t is a scaled Chebyshev
polynomial; see (3.1). Optimal values for (\alpha t , \beta t )t\in \BbbN are given by [8]
\Bigl( \Bigr) - 1
\alpha t = \tau 1 \beta t and \beta t = \tau 1 - \tau 22 \beta t - 1 ,

\tau 1 = [\lambda min (M - 1 Q) + \lambda max (M - 1 Q)]/2 and \tau 2 = [\lambda max (M - 1 Q) - \lambda min (M - 1 Q)]/4.
Note that these optimal choices suppose that the minimal and maximal eigenvalues
of M - 1 Q are real valued and available. The first requirement is satisfied, for instance,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 23

if the splitting Q = M - N is symmetric, while the second one is met by using the
CG algorithm as explained in [32]. In the literature [32, 88], a classical symmetric
splitting scheme that has been considered is derived from the SOR splitting and as
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

such is called symmetric SOR (SSOR). Denote by MSOR and NSOR the matrices
involved in the SOR splitting such that Q = MSOR - NSOR ; see row 4 of Table 2.
Then, for any 0 < \omega < 2, the SSOR splitting is defined by Q = MSSOR - NSSOR with
\omega \omega
MSSOR = MSOR D - 1 M\top
SOR and NSSOR = NSOR D - 1 N\top
SOR .
2 - \omega 2 - \omega
By resorting to the one-to-one equivalence between linear solvers and Gibbs samplers,
[31, 32] showed that the acceleration above via Chebyshev polynomials can be applied
to Gibbs samplers based on a symmetric splitting. In this context, the main challenge
when dealing with accelerated Gibbs samplers compared to accelerated linear solvers
is the calibration of the noise covariance to ensure that the invariant distribution
coincides with \scrN (\bfitmu , Q - 1 ). For the sake of completeness, the pseudocode associated
to an accelerated version of Algorithm 4.2 based on the SSOR splitting is detailed
in Algorithm 4.3. Associated convergence results and numerical studies associated to
this algorithm can be found in [31, 32].

Algorithm 4.3 Chebyshev accelerated SSOR sampler


Input: SSOR tuning parameter 0 < \omega < 2, extreme eigenvalues \lambda min (M - 1 SSOR Q),
and \lambda max (M - 1 - 1
SSOR Q) of MSSOR Q, number T of iterations, initialization w
(0)
, diagonal
D of Q, and SOR splitting Q = MSOR - NSOR .
1: Set \surd D\omega = (2/\omega - 1)D.
- 1 - 1
2: Set \delta = (\lambda max (MSSOR Q) - \lambda min (MSSOR Q))/4.
- 1 - 1
3: Set \tau = 2/(\lambda max (MSSOR Q) + \lambda min (MSSOR Q)).
4: Set \beta = 2\tau , \alpha = 1, e = 2/\alpha - 1, c = (2/\tau - 1)e, and \kappa = \tau .
5: Set t = 1.
6: while t \leq T do
7: Draw z1 \sim \scrN (0d , Id ).
\surd 1/2
8: Solve MSOR x1 = MSOR w(t - 1) + eD\omega z1 - Qw(t - 1) w.r.t. x1 .
9: Draw z2 \sim \scrN (0d , Id ).
\surd 1/2
10: Solve M\top SOR x2 = MSOR (x1 - w
\top (t - 1)
) + cD\omega z2 - Qx1 w.r.t. x2 .
11: if t = 1, then
12: Set w(t) = \alpha (w(t - 1) + \tau x2 ).
13: Set \bfittheta (t) = \bfitmu + w(t) .
14: else
15: Set w(t) = \alpha (w(t - 1) - w(t - 2) + \tau x2 ) + w(t - 2) .
16: Set \bfittheta (t) = \bfitmu + w(t) .
17: end if
18: Set \beta = 1/\tau 1 - \beta \delta , \alpha = \beta \tau , e = 2\kappa (1 - \alpha )
\beta + 1, c = \tau 2 - 1 + (e - 1)( \tau 1 + \kappa 1 - 1), and
\kappa = \beta + (1 - \alpha )\kappa .
19: Set t = t + 1.
20: end while
(T )
21: return \bfittheta .

Algorithmic Efficiency. Similar to Algorithm 4.2, Algorithm 4.3 is exact in the sense
that it admits \scrN (\bfitmu , Q - 1 ) as an invariant distribution. The work [32] gave guidelines
to choosing the truncation parameter T such that the error between the distribution

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


24 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

of \bfittheta (T ) and \scrN (\bfitmu , Q - 1 ) is sufficiently small. Regarding computation and storage,
since triangular linear systems can be solved in \scrO (d2 ) flops by either back or forward
substitution, Algorithm 4.3 admits a computational cost of \scrO (T d2 ) and a storage
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

requirement of \Theta (d).


4.1.2. Approximate Matrix Splitting. Motivated by efficiency and parallel com-
putations, the authors in [9] and [54] proposed to relax exact MS and introduced two
MCMC samplers whose invariant distributions are approximations of \scrN (\bfitmu , Q - 1 ).
First, in order to solve efficiently the linear system M\bfittheta (t) = Q\bfitmu + z
\~ + N\bfittheta (t - 1) in-
volved in step 4 of Algorithm 4.2, these approximate approaches consider MS schemes
with diagonal matrices M. For exact samplers, e.g., Richardson and Jacobi, we saw
in the previous paragraph that such a convenient structure for M implies that the
drawing of the Gaussian vector z \~ becomes more demanding. To bypass this issue,
approximate samplers draw Gaussian vectors z \~\prime with simpler covariance matrices M \~
\top \~
instead of M + N. Again, attractive choices for M are diagonal matrices, since the
associated sampling task then boils down to Algorithm 2.2. This yields Algorithm 4.4,
which is highly amenable to parallelization since both the covariance matrix M \~ of z\~\prime
and the matrix M involved in the linear system to be solved are diagonal. Table 3
gathers the respective expressions of M, N, and M \~ for the two approaches introduced
in [54] and [9] and coined ``Hogwild sampler"" and ``clone MCMC,"" respectively.

Algorithm 4.4 MCMC sampler based on approximate matrix splitting


Input: Number T of iterations, initialization \bfittheta (0) , and splitting Q = M - N.
1: Set t = 1.
2: while t \leq T do
3: Draw z \~
\~\prime \sim \scrN (0d , M). \~ = D or 2 (D + 2\omega Id ); see Table 3.
\triangleleft M
(t) \prime (t - 1)
4: Solve M\bfittheta = Q\bfitmu + z \~ + N\bfittheta .
5: Set t = t + 1.
6: end while
7: return \bfittheta (T ) .

Algorithmic Efficiency. Regarding sampling accuracy, the Hogwild sampler and clone
MCMC define a Markov chain whose invariant distribution is Gaussian with the cor-
rect mean \bfitmu but with precision matrix Q \widetilde MS , where
\Biggl\{ \bigl( \bigr)
- 1 \top
\widetilde MS = Q Id - D (L + L ) for the Hogwild sampler,
Q \bigl( 1 - 1 - 1
\bigr)
Q Id - 2 (D + 2\omega Id ) Q for clone MCMC.

Contrary to the Hogwild sampler, clone MCMC is able to sample exactly from
\widetilde MS \rightarrow Q. While
\scrN (\bfitmu , Q - 1 ) in the asymptotic scenario \omega \rightarrow 0, since in this case Q
retaining a memory requirement of \Theta (d), the induced approximation yields a highly
parallelizable sampler. Indeed, compared to Algorithm 4.2, the computational com-
plexities associated to step 3 and the solving of the triangular system in step 4 are
decreased by an order of magnitude to \scrO (d). Note that the overall computational
complexity of step 4 is still \scrO (d2 ) because of the matrix-vector product N\bfittheta (t - 1) .
4.2. Data Augmentation. Since the precision matrix Q has been assumed to be
arbitrary, the MS schemes Q = M - N in Table 2 were not motivated by its structure
but rather by the computational efficiency of the associated samplers. Hence, inspired
by efficient linear solvers, relevant choices for M and N given in Tables 2 and 3 have

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 25

Table 3 MS schemes for \bfQ that can be used in Algorithm 4.4. The matrices \bfD and \bfL denote the
diagonal and strictly lower triangular parts of \bfQ , respectively. The vector \bfz \~\prime is the one
appearing in step 3 of Algorithm 4.4 and \omega > 0 is a tuning parameter controlling the bias
of those methods. Sufficient conditions to guarantee \rho (\bfM - 1 \bfN ) < 1 are given in [9, 54].
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfM \bfN cov(\~ \~
\bfz \prime ) = \bfM
Hogwild with blocks of size 1 [54] \bfD - \bfL - \bfL \top \bfD
Clone MCMC [9] \bfD + 2\omega \bfI d 2\omega \bfI d - \bfL - \bfL \top 2 (\bfD + 2\omega \bfI d )

been considered. Another line of search explores schemes specifically dedicated to


precision matrices Q of the form
(4.3) Q = Q1 + Q2 ,
where, contrary to the MS schemes discussed in the previous section, the two matrices
Q1 and Q2 are not chosen by the user but result directly from the statistical model
under consideration. In particular, such situations arise when deriving hierarchical
Bayesian models (see, e.g., [50, 78, 92]). By capitalizing on possible specific structures
of \{ Qi \} i\in [2] , it may be desirable to separate Q1 and Q2 into two different hopefully
simpler steps of a Gibbs sampler. To this purpose, this section discusses data augmen-
tation (DA) approaches which introduce one (or several) auxiliary variables u \in \BbbR k
such that the joint distribution of the couple (\bfittheta , u) yields simple conditional distri-
butions, and thus sampling steps within a Gibbs sampler [9, 66, 67, 104]. Then a
straightforward marginalization of the auxiliary variable u permits us to retrieve the
target distribution \scrN (\bfitmu , Q - 1 ), either exactly or in an asymptotic regime depending
on the nature of the DA scheme. Both exact and approximate DA methods have been
proposed.
4.2.1. Exact Data Augmentation. This section reviews some exact DA ap-
proaches to obtaining samples from \scrN (\bfitmu , Q - 1 ). The term exact here means that
the joint distribution of (\bfittheta , u) admits a density \pi (\bfittheta , u) that satisfies almost surely
\int
(4.4) \pi (\bfittheta , u) = \pi (\bfittheta )
\BbbR k

and yields proper marginal distributions. Figure 4 describes the directed acyclic
graphs (DAGs) associated with two hierarchical models proposed in [66, 67] to de-
couple Q1 from Q2 by involving auxiliary variables. In what follows, we detail the
motivations behind these two DA schemes. Among the two matrices Q1 and Q2
involved in the composite precision matrix Q, without loss of generality, we assume
that Q2 presents a particular and simpler structure (e.g., diagonal or circulant) than
Q1 . We want now to benefit from this structure by leveraging the efficient sampling
schemes previously discussed in subsection 2.2 and well suited to handling a Gaussian
distribution with a precision matrix only involving Q2 . This is the aim of the first
DA model called EDA, which introduces the joint distribution with p.d.f.
\biggl( \Bigr] \biggr)
1 \Bigl[ \top \top
(4.5) \pi (\bfittheta , u1 ) \propto exp - (\bfittheta - \bfitmu ) Q(\bfittheta - \bfitmu ) + (u1 - \bfittheta ) R(u1 - \bfittheta ) ,
2
- 1
with R = \omega - 1 Id - Q1 and 0 < \omega < \| Q1 \| , where \| \cdot \| is the spectral norm. The
resulting Gibbs sampler (see Algorithm 4.5) relies on two conditional Gaussian sam-
pling steps whose associated conditional distributions are detailed in Table 4. This

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


26 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

u1 u2 u1 \bfittheta
\bfittheta
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\bigl( \bigr)
\bfittheta \sim \scrN \bfitmu , Q - 1

\bigl( \bigr) \bigl( \bigr)


\bfittheta \sim \scrN \bfitmu , Q - 1 u1 | \bfittheta \sim \scrN \bfittheta , R - 1
\Bigl( \Bigr)
\bigl( \bigr)
u1 | \bfittheta \sim \scrN \bfittheta , R - 1
, R = \omega - 1
Id - Q1 u2 | u1 \sim \scrN G1 u1 , \Lambda - 1
1

(a) Hierarchical EDA model (b) Hierarchical GEDA model

Fig. 4 Hierarchical models proposed in [66, 67], where \omega is such that 0 < \omega < \| \bfQ 1 \| - 1 .

Table 4 Conditional probability distributions of \bfittheta | \bfu 1 , \bfu 1 | \bfittheta , \bfu 2 , and \bfu 2 | \bfu 1 for the exact DA
schemes detailed in subsection 4.2.1. The parameter \omega is such that 0 < \omega < \| \bfQ 1 \| - 1 . For
simplicity, the conditioning is notationally omitted.

\bfS \bfa \bfm \bfp \bfl \bfe \bfr \bfittheta \sim \scrN (\bfitmu \bfittheta , \bfQ - 1
\bfittheta ) \bfu 1 \sim \scrN (\bfitmu \bfu 1 , \bfQ - 1
\bfu 1 ) \bfu 2 \sim \scrN (\bfitmu \bfu 2 , \bfQ - 1
\bfu 2 )

\bfQ \bfittheta = \omega - 1 \bfI d + \bfQ 2 \bfQ \bfu 1 = \bfR -


EDA \bfitmu \bfittheta = \bfQ - 1
\bfittheta (\bfR \bfu 1 + \bfQ \bfitmu ) \bfitmu \bfu 1 = \bfittheta -

\bfQ \bfittheta = \omega - 1 \bfI d + \bfQ 2 \bfQ \bfu 1 = \omega - 1 \bfI d \bfQ \bfu 2 = \bfLambda 1
GEDA \bfitmu \bfittheta = \bfQ - 1
\bfittheta (\bfR \bfu 1 + \bfQ \bfitmu ) \bfitmu \bfu 1 = \bfittheta - \omega (\bfQ 1 \bfittheta - \bfG \top - 1
1 \bfLambda 1 \bfu 2 ) \bfitmu \bfu 2 = \bfG 1 \bfu 1

Algorithm 4.5 Gibbs sampler based on exact data augmentation (G)EDA


(0)
Input: Number T of iterations and initialization \bfittheta (0) , u1 .
1: Set t = 1.
2: while t \leq T do
(t)
3: Draw u2 \sim \scrN (\bfitmu \bfu 2 , Q - 1 \bfu 2 ). \triangleleft Only if GEDA is considered.
(t) - 1
4: Draw u1 \sim \scrN (\bfitmu \bfu 1 , Q\bfu 1 ).
5: Draw \bfittheta (t) \sim \scrN (\bfitmu \bfittheta , Q - 1
\bfittheta ).
6: Set t = t + 1.
7: end while
8: return \bfittheta (T ) .

scheme has the great advantage of decoupling the two precision matrices Q1 and Q2
since they are not simultaneously involved in either of the two steps. In particular,
introducing the auxiliary variable u1 permits us to remove the dependence in Q1
when defining the precision matrix of the conditional distribution of \bfittheta . While effi-
cient sampling from this conditional is possible, we have to ensure that sampling the
auxiliary variable u1 can be achieved with a reasonable computational cost. Again, if
Q1 presents a nice structure, the specific approaches reviewed in subsection 2.2 can
be employed. If this is not the case, the authors of [66, 67] proposed a generalization
of EDA, called GEDA, to simplify the whole Gibbs sampling procedure when Q arises

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 27

from a hierarchical Bayesian model. In such models, Q1 , as a fortiori Q2 , naturally


admits an explicit decomposition written as Q1 = G\top 1 \Lambda 1 G1 , where \Lambda 1 is a positive-
definite (and very often diagonal) matrix. By building on this explicit decomposition,
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

GEDA introduces an additional auxiliary variable u2 such that the augmented p.d.f. is
\biggl( \Bigr] \biggr)
1 \Bigl[ \top \top
\pi (\bfittheta , u1 , u2 ) \propto exp - (\bfittheta - \bfitmu ) Q(\bfittheta - \bfitmu ) + (u1 - \bfittheta ) R(u1 - \bfittheta )
2
\biggl( \biggr)
1 \top
(4.6) \times exp - (u2 - G1 u1 ) \Lambda 1 (u2 - G1 u1 ) .
2
The associated joint distribution yields conditional Gaussian distributions with di-
agonal covariance matrices for both u1 and u2 that can be sampled efficiently with
Algorithm 2.2; see Table 4.
Algorithmic Efficiency. First, both EDA and GEDA admit \scrN (\bfitmu , Q - 1 ) as invariant
distribution and hence are exact. Regarding EDA, since the conditional distribution
of u1 | \bfittheta might admit an arbitrary precision matrix in the worst-case scenario, its com-
putational and storage complexities are \scrO (KT d2 ) and \Theta (d), where K is a truncation
parameter associated to one of the algorithms reviewed in section 3. On the other
hand, GEDA benefits from an additional DA which yields the reduced computational
and storage requirements of \scrO (T d2 ) and \Theta (d).
4.2.2. Approximate Data Augmentation. An approximate DA scheme inspired
by variable-splitting approaches in optimization [2, 3, 15] was proposed in [104]. This
framework, also called asymptotically exact DA (AXDA) [105], was initially intro-
duced to deal with any target distribution, not limited to Gaussian distributions; it
therefore a fortiori applies to them as well. An auxiliary variable u \in \BbbR d is introduced
such that the joint p.d.f. of (\bfittheta , u) is
\left( \Biggl[ \Biggr] \right)
1 \top \top \| \bfu - \bfittheta \| 2
(4.7) \pi (\bfittheta , \bfu ) \propto \mathrm{e}\mathrm{x}\mathrm{p} - (\bfittheta - \bfitmu ) \bfQ 2 (\bfittheta - \bfitmu ) + (\bfu - \bfitmu ) \bfQ 1 (\bfu - \bfitmu ) + ,
2 \omega

where \omega > 0. The main idea behind (4.7) is to replicate the variable of interest \bfittheta in
order to sample via a Gibbs sampling strategy two different random variables u and
\bfittheta with covariance matrices involving, separately, Q1 and Q2 . This algorithm, coined
the split Gibbs sampler (SGS), is detailed in Algorithm 4.6 and sequentially draws
from the conditional distributions
\Bigl( \Bigr)
(4.8) u | \bfittheta \sim \scrN (\omega - 1 Id + Q1 ) - 1 (\omega - 1 \bfittheta + Q1 \bfitmu ), (\omega - 1 Id + Q1 ) - 1 ,
\Bigl( \Bigr)
(4.9) \bfittheta | u \sim \scrN (\omega - 1 Id + Q2 ) - 1 (\omega - 1 u + Q2 \bfitmu ), (\omega - 1 Id + Q2 ) - 1 .

Again, this approach has the great advantage of decoupling the two precision matrices
Q1 and Q2 defining Q since they are not simultaneously involved in either of the two
steps of the Gibbs sampler. In [66], the authors showed that exact DA schemes
(i.e., EDA and GEDA) generally outperform AXDA as far as Gaussian sampling is
concerned. This was expected since the AXDA framework proposed is not specifically
designed for Gaussian targets but for a wide family of distributions.
Algorithmic Efficiency. The sampling efficiency of Algorithm 4.6 depends upon the
parameter \omega which controls the strength of the coupling between u and \bfittheta as well as the
bias-variance trade-off of this method; it yields exact sampling when \omega \rightarrow 0. Indeed,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


28 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

the marginal distribution of \bfittheta under the joint distribution with density defined in (4.7)
is a Gaussian with the correct mean \bfitmu but with an approximate precision matrix Q \widetilde DA
that admits the closed-form expression
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\Bigl( \Bigr) - 1
(4.10) \widetilde DA = Q2 + Q - 1 + \omega Id
Q .
1

In the worst-case scenario where Q1 is arbitrary, sampling from the conditional distri-
bution (4.8) can be performed with an iterative algorithm running K iterations such
as those reviewed in section 3. Hence Algorithm 4.6 admits the same computational
and storage complexities as EDA (see Algorithm 4.5), that is, \scrO (KT d2 ) and \Theta (d).

Algorithm 4.6 Gibbs sampler based on approximate data augmentation


Input: Number T of iterations and initialization \bfittheta (0) .
1: Set t = 1.
2: while t \leq T do
3: Draw u(t) \sim \scrN (\bfitmu \bfu , Q - 1 \bfu ) in (4.8).
4: Draw \bfittheta (t) \sim \scrN (\bfitmu \bfittheta , Q - 1
\bfittheta ) in (4.9).
5: Set t = t + 1.
6: end while
7: return \bfittheta (T ) .

5. A Unifying Revisit of Gibbs Samplers via a Stochastic Version of the PPA.


Sections 3 and 4 showed that numerous approaches have been proposed to sample from
a possibly high-dimensional Gaussian distribution with density (2.1). This section
proposes to unify these approaches within a general Gaussian simulation framework
which is actually a stochastic counterpart of the celebrated PPA in optimization [90].
This viewpoint will shed new light on the connections among the reviewed simulation-
based algorithms, and in particular among Gibbs samplers.
5.1. A Unifying Proposal Distribution. The approaches described in section 4
use surrogate probability distributions (e.g., conditional or approximate distributions)
to make Gaussian sampling easier. In what follows, we show that most of these sur-
rogate distributions can be put under a common umbrella by considering the density
\biggl( \biggr)
1 \top
(5.1) \kappa (\bfittheta , u) \propto \pi (\bfittheta ) exp - (\bfittheta - u) R(\bfittheta - u) ,
2

where u \in \BbbR d is an additional (auxiliary) variable and R \in \BbbR d\times d is a symmetric
matrix acting as a preconditioner such that \kappa defines a proper density on an appro-
priate state space. More precisely, in what follows, depending on the definition of the
variable u, the probability density \kappa in (5.1) will refer to either a joint p.d.f. \pi (\bfittheta , u) or
a conditional probability density \pi (\bfittheta | u). Contrary to the MCMC samplers detailed
in section 4, the methods described in section 3 do not use explicit surrogate distribu-
tions to simplify the sampling procedure. Instead, they directly perturb deterministic
approaches from numerical linear algebra without explicitly defining a simpler sur-
rogate distribution at each iteration. This feature can be encoded with the choice
R \rightarrow 0d\times d so that these methods can be described by this unifying model as well.
Then the main motivation for using the surrogate density \kappa is to precondition the
initial p.d.f. \pi to end up with simpler sampling steps as in section 4.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 29

5.2. Revisiting MCMC Sampling Approaches. This section builds on the prob-
ability kernel density (5.1) to revisit, unify, and extend the exact and approximate
approaches reviewed in section 4. We emphasize that exact approaches indeed target
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

the distribution of interest \scrN (\bfitmu , Q - 1 ), while approximate approaches only target an
approximation of \scrN (\bfitmu , Q - 1 ).
5.2.1. From Exact Data Augmentation to Exact Matrix Splitting. We assume
here that the variable u refers to an auxiliary variable such that the joint distribution
of the couple (\bfittheta , u) has a density given by \pi (\bfittheta , u) \triangleq \kappa (\bfittheta , u). In addition, here we
restrict R to be positive definite. It follows that
\int \int \biggl( \biggr)
1
(5.2) \pi (\bfittheta , u)du = Z - 1 \pi (\bfittheta ) exp - (\bfittheta - u)\top R(\bfittheta - u) du = \pi (\bfittheta )
\BbbR d \BbbR d 2

holds almost surely with Z = det(R) - 1/2 (2π)d/2 . Hence, the joint density (5.1)
yields an exact DA scheme whatever the choice of the positive-definite matrix R.
We will show that the exact DA approaches described in Algorithm 4.5 precisely fit
the proposed generic framework with a specific choice for the preconditioning matrix
R. We will then extend this class of exact DA approaches and show a one-to-one
equivalence between Gibbs samplers based on exact MS (see subsection 4.1.1) and
those based on exact DA (see subsection 4.2.1).
To this end, we start by making the change of variable v = Ru. Combined with
the joint probability density (5.1), this yields the following two conditional probability
distributions:

(5.3) v | \bfittheta \sim \scrN (R\bfittheta , R) ,


\Bigl( \Bigr)
(5.4) \bfittheta | v \sim \scrN (Q + R) - 1 (v + Q\bfitmu ), (Q + R) - 1 .

As emphasized in subsection 5.1, the aim of introducing the preconditioning matrix


R is to yield simpler sampling steps. In the general case where Q = Q1 + Q2 , with
Q1 and Q2 two matrices that cannot be easily handled jointly (e.g., because they are
not diagonalized in the same basis), an attractive option is R = \omega - 1 Id - Q1 . Indeed,
this choice ensures that Q1 and Q2 are separated and are not simultaneously involved
in either of the two conditional sampling steps. Note that this choice yields the EDA
scheme already discussed in subsection 4.2.1; see Table 4. Now we relate this exact
DA scheme to an exact MS scheme. By rewriting the Gibbs sampling steps associated
with the conditional distributions (5.3) and (5.4) as an autoregressive process of order
1 w.r.t. \bfittheta [14], it follows that an equivalent sampling strategy is

(5.5) \~ \sim \scrN (Q\bfitmu , 2R + Q) ,


z
\Bigl( \Bigr)
(t) - 1
(5.6) \bfittheta = (Q + R) \~ + R\bfittheta (t - 1) .
z

Defining M = Q + R and N = R, or equivalently Q = M - N, this yields


\Bigl( \Bigr)
(5.7) \~ \sim \scrN Q\bfitmu , M\top + N ,
z
\Bigl( \Bigr)
(5.8) \bfittheta (t) = M - 1 z \~ + N\bfittheta (t - 1) ,

which boils down to the Gibbs sampler based on exact MS discussed in subsection 4.1.1
(see Algorithm 4.2).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


30 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Table 5 Equivalence relations between exact DA and exact MS approaches. The matrices \bfQ 1 and
\bfQ 2 are such that \bfQ = \bfQ 1 + \bfQ 2 . The matrix \bfD 1 denotes the diagonal part of \bfQ 1 , and
\omega > 0 is a positive scalar ensuring the positive definiteness of \bfR . Bold acronyms refer to
novel samplers derived from the proposed unifying framework.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\bfR = cov(\bfv | \bfittheta ) (\bfQ + \bfR ) - 1 = cov(\bfittheta | \bfv ) \bfM \top + \bfN = cov(\~
\bfz ) DA sampler MS sampler
\biggl( \biggr) - 1
\bfI d \bfI d 2\bfI d
- \bfQ 1 + \bfQ 2 + \bfQ 2 - \bfQ 1 EDA [66] Richardson [32]
\omega \biggl( \omega \biggr) - 1 \omega
\bfD 1 \bfD 1 2\bfD 1
- \bfQ 1 + \bfQ 2 + \bfQ 2 - \bfQ 1 \bfE \bfD \bfA \bfJ Jacobi [32]
\omega \omega \omega

To illustrate the relevance of this rewriting when considering the case of two ma-
trices Q1 and Q2 that cannot be efficiently handled in the same basis, Table 5 presents
two possible choices of R which relate two MS strategies to their DA counterparts.
First, one particular choice of R (row 1 of Table 5) shows directly that the Richardson
MS sampler can be rewritten as the EDA sampler. More precisely, the autoregressive
process of order 1 w.r.t. \bfittheta defined by EDA yields a variant of the Richardson sampler.
This finding relates two different approaches proposed by authors from distinct com-
munities (numerical algebra and signal processing). Second, the proposed unifying
framework also permits us to go beyond existing approaches by proposing a novel ex-
act DA approach via a specific choice for the precision matrix R driven by an existing
MS method. Indeed, following the same rewriting trick with another particular choice
of R (row 2 of Table 5), an exact DA scheme can be easily derived from the Jacobi
MS approach. To our knowledge, this novel DA method, referred to as EDAJ in the
table, has not been documented in the existing literature.
Finally, the table reports two particular choices of R which lead to revisiting
existing MS and/or DA methods. It is worth noting that other relevant choices might
be possible, which would allow one to derive new exact DA and MS methods or to
draw further analogies between existing approaches. Note also that Table 5 shows the
main benefit of an exact DA scheme over its MS counterpart thanks to the decoupling
between Q1 and Q2 into two separate simulation steps. This feature can be directly
observed by comparing the two first columns of Table 5 with the third column.
5.2.2. From Approximate Matrix Splitting to Approximate Data Augmen-
tation. We now build on the proposed unifying proposal (5.1) to extend the class
of samplers based on approximate MS and reviewed in subsection 4.1.2. With some
abuse of notation, the variable u in (5.1) now refers to an iterate associated to \bfittheta .
More precisely, let us define u = \bfittheta (t - 1) to be the current iterate within an MCMC
algorithm and \kappa to be
\Bigl( \Bigr) \biggl( \Bigr) \biggr)
1 \Bigl( \Bigr) \top \Bigl(
(5.9) \kappa (\bfittheta , u) \triangleq p \bfittheta | u = \bfittheta (t - 1) \propto \pi (\bfittheta ) exp - \bfittheta - \bfittheta (t - 1) R \bfittheta - \bfittheta (t - 1) .
2

Readers familiar with MCMC algorithms will recognize in (5.9) a proposal density
that can be used within Metropolis--Hastings schemes [87]. However, unlike the
usual random-walk algorithm which considers the Gaussian proposal distribution
\scrN (\bfittheta (t - 1) , \lambda Id ) with \lambda > 0, the originality of (5.9) is to define the proposal by com-
bining the Gaussian target density \pi with a term that is equal to a Gaussian kernel
density when R is positive definite. If we always accept the proposed sample obtained
from (5.9) without any correction, that is, \bfittheta (t) = \bfittheta \widetilde \sim P (\cdot | u = \bfittheta (t - 1) ) with density

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 31

(5.9), this directly implies that the associated Markov chain converges in distribution
toward a Gaussian random variable with distribution \scrN (\bfitmu , Q \widetilde - 1 ) with the correct
mean \bfitmu but with precision matrix
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\Bigl( \Bigr)
(5.10) \widetilde = Q Id + (R + Q) - 1 R .
Q

This algorithm is detailed in Algorithm 5.1. Note again that one can obtain samples
from the initial target distribution \scrN (\bfitmu , Q - 1 ) by replacing step 4 with an accep-
tance/rejection step; see [87] for details.

Algorithm 5.1 MCMC sampler based on (5.9)


Input: Number T of iterations and initialization \bfittheta (0) .
1: Set t = 1.
2: while t \leq T do \Bigl( \Bigr)
3: Draw \bfittheta \widetilde \sim P \cdot | u = \bfittheta (t - 1) in (5.9).

4: Set \bfittheta (t) = \bfittheta \widetilde .


5: Set t = t + 1.
6: end while
7: return \bfittheta (T ) .

Moreover, the instance (5.9) of (5.1) paves the way to an extended class of sam-
\widetilde
plers based on approximate MS. More precisely, the draw of a proposed sample \bfittheta
from (5.9) can be replaced by the following two-step sampling procedure:

(5.11) \~\prime \sim \scrN (Q\bfitmu , R + Q) ,


z
\Bigl( \Bigr)
- 1
(5.12) \bfittheta (t) = (Q + R) \~\prime + R\bfittheta (t - 1) .
z

The MS form with M = Q + R, N = R is

(5.13) \~\prime \sim \scrN (Q\bfitmu , M) ,


z
\Bigl( \Bigr)
(5.14) \bfittheta (t) = M - 1 z \~\prime + N\bfittheta (t - 1) .

This recursion defines an extended class of approximate MS-based samplers and en-
compasses the Hogwild sampler reviewed in subsection 4.1.2 by taking R = - L - L\top .
In addition to the existing Hogwild approach, Table 6 lists two new approximate MS
approaches resulting from specific choices of the preconditioning matrix R. They are
coined approximate Richardson and Jacobi samplers since the expressions for M and
N are very similar to those associated to their exact counterparts; see Table 2. For
those two samplers, note that the approximate precision matrix Q \widetilde tends toward 2Q
in the asymptotic regime \omega \rightarrow 0. Indeed, for the approximate Jacobi sampler, we
have
\Biggl( \biggl( \biggr) \Biggr)
\widetilde = Q Id + \omega I d
Q - Q
\omega
= Q (2Id - \omega Q)
\rightarrow 2Q .
\omega \rightarrow 0

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


32 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Table 6 Extended class of Gibbs samplers based on approximate MS with \bfQ = \bfM - \bfN with \bfN =
\bfR and approximate DA. The matrices \bfD and \bfL denote the diagonal and strictly lower
triangular parts of \bfQ , respectively. \omega is a positive scalar. Bold names and acronyms refer
to novel samplers derived from the proposed unifying framework.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1 1
2
\bfM = cov(\bfv \prime | \bfittheta ) 2
\bfM - 1 = cov(\bfittheta | \bfv \prime ) \bfz \prime )
\bfM = cov(\~ MS sampler DA sampler
1 1 - 1
2
\bfD 2
\bfD \bfD Hogwild [54] \bfA \bfD \bfA \bfH

\bfI d \omega \bfI d \bfI d


\bfa \bfp \bfp \bfr \bfo \bfx . \bfR \bfi \bfc \bfh \bfa \bfr \bfd \bfs \bfo \bfn \bfA \bfD \bfA \bfR
2\omega 2 \omega
\bfD \omega \bfD - 1 \bfD
\bfa \bfp \bfp \bfr \bfo \bfx . \bfJ \bfa \bfc \bfo \bfb \bfi \bfA \bfD \bfA \bfJ
2\omega 2 \omega

In order to retrieve the original precision matrix Q when \omega \rightarrow 0, [9] proposed an
approximate DA strategy that can be related to the work of [105].
In subsection 5.2.1, we showed that exact DA approaches can be rewritten to
recover exact MS approaches. In what follows, we will take the opposite path to
show that approximate MS approaches admit approximate DA counterparts that are
highly amenable to distributed and parallel computations. Using the fact that z\prime =
Q\bfitmu + z1 + (Q + R)z2 , where z1 \sim \scrN (0d , 21 (R + Q)) and z2 \sim \scrN (0d , 12 (R + Q) - 1 ),
the recursion (5.14) can be written equivalently as
\Bigl( \Bigr)
\widetilde = (Q + R) - 1 Q\bfitmu + R\bfittheta (t - 1) + z1 + z2 .
\bfittheta

By introducing an auxiliary variable v\prime defined by v\prime = R\bfittheta (t - 1) + z1 , the resulting


two-step Gibbs sampling relies on the conditional sampling steps
\biggl( \biggr)
\prime 1
v | \bfittheta \sim \scrN R\bfittheta , (R + Q) ,
2
\biggl( \biggr)
\prime - 1 \prime 1 - 1
\bfittheta | v \sim \scrN (Q + R) (v + Q\bfitmu ), (R + Q) ,
2

and targets the joint distribution with density \pi (\bfittheta , v\prime ). Compared to the exact DA
approaches reviewed in subsection 4.2.1, the sampling difficulty associated to each
conditional sampling step is the same and only driven by the structure of the matrix
M = R + Q. In particular, this matrix becomes diagonal for the three specific
choices listed in Table 6. These choices lead to three new sampling schemes that we
name ADAH, ADAR, and ADAJ since they represent the DA counterparts of the
approximate MS samplers discussed above. Interestingly, these DA schemes naturally
emerge here without assuming any explicit decomposition Q = Q1 + Q2 or including
an additional auxiliary variable (as in GEDA). Finally, as previously highlighted, when
compared to their exact counterparts, these DA schemes have the great advantage of
leading to Gibbs samplers suited for parallel computations, hence simplifying the
sampling procedure.
5.3. Gibbs Samplers as Stochastic Versions of the PPA. This section aims to
draw new connections between optimization and the sampling approaches discussed
in this paper. In particular, we will focus on the proximal point algorithm (PPA)
[90]. After briefly presenting this optimization method, we will show that the Gibbs

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 33

samplers based on the proposal (5.9) can be interestingly interpreted as stochastic


counterparts of the PPA. Let us assume here that R is positive semidefinite and
define the weighted norm w.r.t. R for all \bfittheta \in \BbbR d by
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

2
(5.15) \| \bfittheta \| \bfR \triangleq \bfittheta \top R\bfittheta .
The Proximal Point Algorithm. The PPA [90] is an important and widely used
method to find zeros of a maximal monotone operator \sansK , that is, to solve problems
of the following form:
(5.16) Find \bfittheta \star \in \scrH such that 0d \in \sansK (\bfittheta \star ) ,
where \scrH is a real Hilbert space. For simplicity, we will take here \scrH = \BbbR d equipped
with the usual Euclidean norm and focus on the particular case \sansK = \partial f , where f is
a lower semicontinuous (l.s.c.), proper, coercive, and convex function and \partial denotes
the subdifferential operator; see Appendix B. In this case, the PPA is equivalent
to the proximal minimization algorithm [68, 69] which aims at solving the following
minimization problem:
(5.17) Find \bfittheta \star \in \BbbR d such that \bfittheta \star = arg min f (\bfittheta ) .
\bfittheta \in \BbbR d

This algorithm is called the proximal point algorithm in reference to the work by
Moreau [76]. For readability reasons, we refer the reader to Appendix B for details
about this algorithm for a general operator \sansK and to the comprehensive overview
in [90] for more information.

Algorithm 5.2 Proximal point algorithm (PPA)


1: Choose an initial value \bfittheta (0) , a positive semidefinite matrix R, and a maximal
number of iterations T .
2: Set t = 1.
3: while t \leq T do
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
4: \bfittheta (t) = arg min f (\bfittheta ) + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| .
\bfittheta \in \BbbR d 2 \bfR
5: end while
(T )
6: return \bfittheta .

The PPA is detailed in Algorithm 5.2. Note that instead of directly minimizing
the objective function f , the PPA adds a quadratic penalty term depending on the
previous iterate \bfittheta (t - 1) and then solves an approximation of the initial optimization
problem at each iteration. This idea of successive approximations is exactly the
deterministic counterpart of (5.9), which proposes a new sample based on successive
approximations of the target density \pi via a Gaussian kernel with precision matrix
R. In fact, searching for the maximum a posteriori estimator under the proposal
distribution P (\cdot | \bfittheta (t - 1) ) with density p(\cdot | \bfittheta (t - 1) ) in (5.9) boils down to solving
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.18) arg min - log \pi (\bfittheta ) + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| ,
\bfittheta \in \BbbR d \underbrace{} \underbrace{} 2 \bfR
f (\bfittheta )

which coincides with step 4 in Algorithm 5.2 by taking f = - log \pi . This emphasizes
the tight connection between optimization and simulation that we have highlighted
in previous sections.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


34 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

The PPA, the ADMM, and the Approximate Richardson Gibbs Sampler. An
important motivation of the PPA is related to the preconditioning idea used in the uni-
fying model proposed in (5.1). Indeed, the PPA has been extensively used within the
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

alternating direction method of multipliers (ADMM) [15, 33, 40] as a preconditioner


in order to avoid high-dimensional inversions [17, 19, 28, 58, 110]. The ADMM [15]
is an optimization approach that solves the minimization problem in (5.17) when
g(\bfittheta ) = g1 (A\bfittheta ) + g2 (\bfittheta ), A \in \BbbR k\times d , via the iterative scheme

1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.19) z(t) = arg min g1 (z) + \bigm\| z - A\bfittheta (t - 1) - u(t - 1) \bigm\| ,
\bfz \in \BbbR k 2\rho
1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.20) \bfittheta (t) = arg min g2 (\bfittheta ) + \bigm\| A\bfittheta - z(t) + u(t - 1) \bigm\| ,
\bfittheta \in \BbbR d 2\rho
(5.21) u(t) = u(t - 1) + A\bfittheta (t) - z(t) ,

where z \in \BbbR k is a splitting variable, u \in \BbbR k is a scaled dual variable, and \rho is a positive
penalty parameter. Without loss of generality,4 let us assume that g2 is a quadratic
function, that is, for any \bfittheta \in \BbbR d , g2 (\bfittheta ) = (\bfittheta - \bfittheta \= )\top (\bfittheta - \bfittheta
\= )/2. Even in this simple
case, one can notice that the \bfittheta -update (5.20) involves a matrix A operating directly
on \bfittheta , preluding an expensive inversion of a high-dimensional matrix associated to A.
To deal with such an issue, Algorithm 5.2 is considered to solve approximately the
minimization problem in (5.20). The PPA applied to the minimization problem (5.20)
reads
1 \bigm\| \bigm\| 2 1 \bigm\| \bigm\| 2
\= )+ 1 \bigm\|
\= )\top (\bfittheta - \bfittheta
(5.22) \bfittheta (t) = arg min (\bfittheta - \bfittheta
\bigm\| \bigm\| \bigm\|
\bigm\| A\bfittheta - z(t) + u(t - 1) \bigm\| + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| .
\bfittheta \in \BbbR d 2 2\rho 2 \bfR

In order to draw some connections with (5.9), we set Q = \rho - 1 A\top A + Id and \bfitmu =
\= ] and rewrite (5.22) as
Q - 1 [A\top (u(t - 1) - z(t) )/\rho + \bfittheta

1 1 \bigm\|
\bigm\|
\bigm\| 2
\bigm\|
(5.23) \bfittheta (t) = arg min (\bfittheta - \bfitmu )\top Q(\bfittheta - \bfitmu ) + \bigm\| \bfittheta - \bfittheta (t - 1) \bigm\| .
\bfittheta \in \BbbR d 2 2 \bfR

Note that (1/2)(\bfittheta - \bfitmu )\top Q(\bfittheta - \bfitmu ) is the potential function associated to \pi in (2.1)
and, as such, (5.23) can be seen as the deterministic counterpart of (5.9). By defining
- 2
R = \omega - 1 Id - Q, where 0 < \omega \leq \rho \| A\| ensures that R is positive semidefinite, the
\bfittheta -update in (5.22) becomes (see Appendix B)
\bigm\| \Bigl( \Bigr) \bigm\|
1 \bigm\| \bigm\| 2
(5.24) (t)
= arg min \bigm\| \bfittheta - \omega R\bfittheta (t - 1)
+ Q\bfitmu \bigm\|
\bfittheta \bigm\| \bigm\| .
\bfittheta \in \BbbR d 2\omega

Note that (5.24) boils down to solving \omega - 1 \bfittheta = R\bfittheta (t - 1) + Q\bfitmu , which is exactly the de-
terministic counterpart of the approximate Richardson Gibbs sampler in Table 6. This
highlights further the tight links between the proposed unifying framework and the
use of the PPA in the optimization literature. It also paves the way to novel sampling
methods inspired by optimization approaches which are not necessarily dedicated to
Gaussian sampling; this goes beyond the scope of the present paper.

4 If g admits a nonquadratic form, an additional splitting variable can be introduced and the
2
following comments still hold.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 35

6. A Comparison of Gaussian Sampling Methods with Numerical Simula-


tions. This section aims to provide readers with a detailed comparison of the Gaus-
sian sampling techniques discussed in sections 3 and 4. In particular, we summarize
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

the benefits and bottlenecks associated with each method, illustrate some of them
on numerical applications, and propose an up-to-date selection of the most efficient
algorithms for archetypal Gaussian sampling tasks. General guidelines resulting from
this review and numerical experiments are gathered in Figure 12.
6.1. Summary, Comparison, and Discussion of Existing Approaches. Table 7
summarizes the main features of the sampling techniques reviewed above. In particu-
lar, for each approach, this table recalls its exactness (or not), its computational and
storage costs, the most expensive sampling step to compute, the possible linear sys-
tem to solve, and the presence of tuning parameters. It aims to synthesize the main
pros and cons of each class of samplers. Regarding MCMC approaches, the computa-
tional cost associated to the required sampling step is not taken into account in the
column ``Comp. cost"" since it depends upon the structure of Q. Instead, the column
``Sampling"" indicates the type of sampling step required by the sampling approach.
Rather than conducting a one-to-one comparison between samplers, we choose to
focus on selected important questions raised by the taxonomy reported in this table.
Concerning more technical or in-depth comparisons between specific approaches, we
refer the interested reader to the appropriate references; see, for instance, those high-
lighted in sections 3 and 4. These questions of interest lead to scenarios that motivate
the dedicated numerical experiments conducted in subsection 6.2, then subsection 6.3
gathers guidelines to choosing an appropriate sampler for a given sampling task. The
questions we focus on are as follows.
Question 1. In which scenarios do iterative approaches become interesting com-
pared to factorization approaches?
In section 3, one sees that square root, CG, and PO approaches bypass the compu-
tational and storage costs of factorization thanks to an iterative process of K cheaper
steps, with K \in \BbbN \ast . A natural question is, in which scenarios does the total cost
of K iterations remain efficient when compared to factorization methods? Table 7
tells us that high-dimensional scenarios (d \gg 1) favor iterative methods as soon as
memory requirements of order \Theta (d2 ) become prohibitive. If this storage is not an
issue, iterative samplers become interesting only if the number of iterations K is such
that K \ll (d + T - 1)/T . This inequality is verified only in cases where a small
number of samples T is required (T \ll d), which in turn imposes K \ll d. Note that
this condition remains similar when a Gaussian sampling step is embedded within a
Gibbs sampler with a varying covariance or precision matrix (see Example 2.1): the
condition on K is K \ll d, whatever the number T of samples, since it is no longer
possible to precompute the factorization of Q.
Question 2. In which scenarios should we prefer an iterative sampler from section 3
or an MCMC sampler from section 4?
Table 7 shows that the iterative samplers reviewed in section 3 have to perform
K iterations to generate one sample. In contrast, most MCMC samplers generate
one sample per iteration. However, these samples are distributed according to the
target distribution (or an approximation of it) only in an asymptotic regime, i.e.,
when T \rightarrow \infty (in practice, after a burn-in period). If one considers a burn-in period
of length Tbi whose samples are discarded, MCMC samplers are interesting w.r.t.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Table 7 Taxonomy of existing methods to sample from an arbitrary d-dimensional Gaussian distribution \Pi \triangleq \scrN (\bfitmu , \bfQ - 1 ); see (2.1). K \in \BbbN \ast and Tbi \in \BbbN \ast are the
number of iterations of a given iterative sampler (e.g., the CG one) and the number of burn-in iterations for MCMC algorithms, respectively. ``Target 36
\Pi "" refers to approaches that target the right distribution \Pi under infinite precision arithmetic. Note that some i.i.d. samplers using a truncation
parameter K might target \Pi for specific choices of K (e.g., the classical CG sampler is exact for K = d). ``Finite time sampling"" refers here to
approaches which need a truncation procedure to deliver a sample in finite time. The matrix \bfA is a symmetric and positive-definite matrix associated
to \bfQ ; see section 4. The sampling methods highlighted in bold stand for novel approaches derived from the proposed unifying framework. AMS stands
for ``approximate matrix splitting"" (see Table 6), EDA for ``exact data augmentation"" (see Table 4), and ADA for ``approximate data augmentation""
(see Table 6).

\bfT \bfa \bfr \bfg \bfe \bft \bfF \bfi \bfn \bfi \bft \bfe \bft \bfi \bfm \bfe
\bfM \bfe \bft \bfh \bfo \bfd \bfI \bfn \bfs \bft \bfa \bfn \bfc \bfe \bfC \bfo \bfm \bfp . \bfc \bfo \bfs \bft \bfS \bft \bfo \bfr \bfa \bfg \bfe \bfc \bfo \bfs \bft \bfS \bfa \bfm \bfp \bfl \bfi \bfn \bfg \bfL \bfi \bfn \bfe \bfa \bfr \bfs \bfy \bfs \bft \bfe \bfm \bfN \bfo \bft \bfu \bfn \bfi \bfn \bfg
\Pi \bfs \bfa \bfm \bfp \bfl \bfi \bfn \bfg
Cholesky [91, 94] \checkmark \checkmark \scrO (d2 (d + T )) \Theta (d2 ) \scrN (0, 1) triangular \checkmark
Factorization
Square root [24] \checkmark \checkmark \scrO (d2 (d + T )) \Theta (d2 ) \scrN (0, 1) full \checkmark
Chebyshev [24, 82] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
\bfQ 1/2 approx.
Lanczos [7, 98] \ding{55} \checkmark \scrO (K(K + d2 )T ) \Theta (K(K + d)) \scrN (0, 1) \ding{55} \checkmark
Classical [81] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
CG
Gradient scan [30] \checkmark \ding{55} \scrO (d2 K(T + Tbi )) \Theta (Kd) \scrN (0, 1) \ding{55} \checkmark
Truncated [77, 79] \ding{55} \checkmark \scrO (d2 KT ) \Theta (d) \scrN (0, 1) full \checkmark
PO
Reversible jump [37] \checkmark \ding{55} \scrO (d2 K(T + Tbi )) \Theta (d) \scrN (0, 1) full \checkmark
Richardson [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) diagonal \ding{55}
Jacobi [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) diagonal \checkmark
Gauss--Seidel [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \checkmark
MS SOR [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
SSOR [32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
Cheby-SSOR [31, 32] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) triangular \ding{55}
Hogwild [54] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \checkmark
MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Clone MCMC [9] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \ding{55}
\bfU \bfn \bfi \bff \bfy \bfi \bfn \bfg \bfA \bfM \bfS \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) diagonal \checkmark
EDA [66] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \checkmark
GEDA [66] \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) \ding{55} \checkmark
DA SGS [104] \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \ding{55}
\bfU \bfn \bfi \bff \bfy \bfi \bfn \bfg \bfE \bfD \bfA \checkmark \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (\bfzero d , \bfA ) \ding{55} \checkmark

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


\bfU \bfn \bfi \bff \bfy \bfi \bfn \bfg \bfA \bfD \bfA \ding{55} \ding{55} \scrO (d2 (T + Tbi )) \Theta (d) \scrN (0, 1) \ding{55} \ding{55}
HIGH-DIMENSIONAL GAUSSIAN SAMPLING 37

iterative ones only if T + Tbi \ll KT . Since most often K \ll Tbi , this condition
favors MCMC methods when a large number T \gtrsim Tbi of samples is desired. When a
small number T \lesssim Tbi /K of samples is desired, iterative methods are preferred. In
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

intermediate situations, the choice depends on the precise number of required samples
T , mixing properties of the MCMC sampler, and the number of iterations K of the
alternative iterative algorithm.
Question 3. When is it efficient to use a decomposition Q = Q1 + Q2 of the
precision matrix in comparison with other approaches?
Sections 3 and 4 have shown that some sampling methods, such as Algorithms 3.4
and 4.5, exploit a decomposition of the form Q = Q1 + Q2 to simplify the sampling
task. Regarding the PO approaches, the main benefit lies in the cheap computation
of the vector z\prime \sim \scrN (0d , Q) before solving the linear system Q\bfittheta = z\prime ; see [79] for
more details. On the other hand, MCMC samplers based on exact DA yield simpler
sampling steps a priori and do not require us to solve any high-dimensional linear
system. However, the Achilles' heel of MCMC methods is that they only produce
samples of interest in the asymptotic regime where the number of iterations tends
toward infinity. For a fixed number of MCMC iterations, dependent samples are
obtained and their quality depends highly upon the mixing properties of the MCMC
sampler. Numerical experiments in subsection 6.2 include further discussion on this
point.
6.2. Numerical Illustrations. This section illustrates the main differences be-
tween the methods reviewed in sections 3 and 4. The main purpose is not to give an
exhaustive one-to-one comparison among all approaches listed in Table 7. Instead,
these methods are compared in light of three experimental scenarios that address the
questions raised in subsection 6.1. More specific numerical simulations can be found
in the cited works and references therein. Since the main challenges of Gaussian sam-
pling are related to the properties of the precision matrix Q, or the covariance \Sigma (see
subsection 2.2), the mean vector \bfitmu is set to 0d and only centered distributions are
considered. For the first two scenarios associated with Questions 1 and 2, the unbiased
estimate of the empirical covariance matrix will be used to assess the convergence in
distribution of the samples generated by each algorithm:
T
\widehat T = 1 \sum (t) \= (t) \= \top
(6.1) \Sigma (\bfittheta - \bfittheta )(\bfittheta - \bfittheta ) ,
T - 1 t=1
\sum T
where \bfittheta \= = T - 1 t=1 \bfittheta (t) is the empirical mean. Note that other metrics (such as the
empirical precision matrix) could have been used to assess whether these samples are
distributed according to a sufficiently close approximation of the target distribution
\scrN (\bfitmu , Q - 1 ). Among available metrics, we chose the one that has been the most used
in the literature, that is, (6.1) [9, 32, 37].
For the scenario 3 associated with the corresponding final question, the consid-
ered high-dimensional setting will preclude the computation of exact and empirical
covariance matrices. Instead, MCMC samplers will be compared in terms of the
computational efficiency and quality of the generated chains (see subsection 6.2.3 for
details).
The experimental setting is as follows. To ensure fair comparisons, all algo-
rithms are implemented on equal grounds, with the same quality of optimization.
The programming language used is Python and all loops have been carefully re-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


38 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

placed by matrix-vector products as far as possible. Simulations are run on a com-


puter equipped with an Intel Xeon 3.70 GHz processor with 16.0 GB of RAM,
under Windows 7. Among the infinite set of possible examples, we choose exam-
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ples of Gaussian sampling problems that often appear in applications and that have
been previously considered in the literature, so that they provide good tutorials to
answer the question raised by each scenario. The code to reproduce all the fig-
ures of this section is available in a Jupyter notebook format available online at
https://github.com/mvono/PyGauss/tree/master/notebooks.
6.2.1. Scenario 1. This first set of experiments addresses Question 1 concerning
iterative versus factorization approaches. We consider a sampling problem also tackled
in [81] to demonstrate the performance of Algorithm 3.5 based on the conjugate
gradient. For the sake of clarity, we divide this scenario into two subexperiments.
Comparison between Factorization and Iterative Approaches. In this first
subexperiment, we compare so-called factorization approaches with iterative ap-
proaches on two Gaussian sampling problems. We consider first the problem of
sampling from \scrN (0d , Q - 1 ), where the covariance matrix \Sigma = Q - 1 is chosen as a
squared exponential matrix that is commonly used in applications involving Gaussian
processes [48, 63, 85, 96, 103, 108]. Its coefficients are defined by
\Biggl( \Biggr)
(si - sj )2
(6.2) \Sigma ij = 2 exp - + \epsilon \delta ij \forall i, j \in [d] ,
2a2

where \{ si \} i\in [d] are evenly spaced scalars on [ - 3, 3], \epsilon > 0, and the Kronecker symbol
\delta ij = 1 if i = j and zero otherwise. In (6.2), the parameters a and \epsilon have been set to
a = 1.5 and \epsilon = 10 - 6 , which yields a distribution of the eigenvalues of \Sigma such that the
large ones are well separated while the small ones are clustered together near 10 - 6 ; see
Figure 5 (first row). We compare the Cholesky sampler (Algorithm 3.1), the approxi-
mate inverse square root sampler using Chebyshev polynomials (Algorithm 3.2), and
the CG-based sampler (Algorithm 3.5). The sampler using Chebyshev polynomials
needs Kcheby = 23 iterations on average, while the CG iterations are stopped once
loss of conjugacy occurs, following the guidelines of [81], that is, at Kkryl = 8. In all
experiments, T = 105 samples are simulated in dimensions ranging from 1 to several
thousands.
Figure 5 shows the results associated with these three direct samplers in dimen-
sion d = 100. The generated samples indeed follow a target Gaussian distribution
admitting a covariance matrix close to \Sigma . This is attested to by the small residuals
observed between the estimated and the true covariances. Based on this criterion, all
approaches successfully sample from \scrN (0d , \Sigma ). This is emphasized by the spectrum
of the estimated matrices \Sigma \widehat T , which coincides with the spectrum of \Sigma for large eigen-
values. This observation ensures that most of the covariance information has been
captured. However, note that only the Cholesky method is able to recover accurately
all the eigenvalues, including the smallest ones.
Figure 6 compares the direct samplers above in terms of central processing unit
(CPU) time. To generate only one sample (T = 1), as expected, one can observe that
the Cholesky sampler is faster than the two iterative samplers in small dimension d
and becomes computationally demanding as d grows beyond a few hundred. Indeed,
for small d the cost implied by several iterations (Kcheby or Kkryl ) within each iterative
sampler dominates the cost of the factorization in Algorithm 3.1, while the contrary
holds for large values of d. Since the Cholesky factorization is performed only once,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 39

Σ 2.00 102

Eigenvalues of estimates of Σ
1.75 100
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1.50
10−2
1.25
10−4
1.00 CG
10−6 Cholesky
0.75
Chebyshev
10−8
0.50 Ground truth
0.25
100 101 102
Index
Residual Σ − ΣCholesky Residual Σ − ΣCG Residual Σ − ΣChebyshev
0.04 0.04 0.04

0.02 0.02 0.02

0.00 0.00 0.00

−0.02 −0.02 −0.02

−0.04 −0.04 −0.04

kΣ − ΣCholesky k2 /kΣk2 = 0.0043 kΣ − ΣCG k2 /kΣk2 = 0.0041 kΣ − ΣChebyshev k2 /kΣk2 = 0.0061

Fig. 5 Scenario 1. Results of the three considered samplers for the sampling from \scrN (\bfzero d , \bfSigma ) in
dimension d = 100 with \bfSigma detailed in (6.2).

the Cholesky sampler becomes more attractive than the other two approaches as the
sample size T increases. However, as already pointed out in subsection 2.3, it is
worth noting that precomputing the Cholesky factor is no longer possible once the
Gaussian sampling task involves a matrix \Sigma which changes at each iteration of a
Gibbs sampler, e.g., when considering a hierarchical Bayesian model with unknown
hyperparameters (see Example 2.1). We also point out that a comparison among
direct samplers reviewed in section 3 and their related versions was conducted in [7]
in terms of CPU and GPU times. In agreement with the findings reported here,
this comparison essentially showed that the choice of the sampler in small dimensions
is not particularly important, while iterative direct samplers become interesting in
high-dimensional scenarios where Cholesky factorization becomes impossible.
We complement our analysis by focusing on another sampling problem which
now considers the matrix defined in (6.2) as a precision matrix instead of a covariance
matrix: we now want to generate samples from \scrN (0d , \Sigma ) \widetilde with \Sigma
\widetilde = \Sigma - 1 . This
sampling problem is expected to be more difficult since the largest eigenvalues of \Sigma \widetilde
are now clustered; see Figure 7 (first row). Figure 7 (second row) shows that the
CG and Chebyshev samplers fail to capture covariance information as accurately as
the Cholesky sampler. The residuals between the estimated and the true covariances
remain important on the diagonal: variances are inaccurately underestimated. This
observation is in line with [81], which showed that the CG sampler is not suitable
for the sampling from a Gaussian distribution whose covariance matrix has many
clustered large eigenvalues, probably as a consequence of the bad conditioning of the
matrix. As far as the Chebyshev sampler is concerned, this failure was expected since
the interval [\lambda l , \lambda u ] on which the function x \mapsto \rightarrow x - 1/2 has to be well approximated
becomes very large with an extent of about 106 . Of course, the relative error between

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


40 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Sample size T = 1 Sample size T = 100


101 101
CG CG
Cholesky Cholesky
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

100 100
CPU time [s]

CPU time [s]


Chebyshev Chebyshev
10−1 10−1

10−2 10−2

10−3 10−3
−4
10
101 102 103 101 102 103
Dimension d Dimension d
Sample size T = 1000 Sample size T = 10000
101 CG 102 CG
Cholesky Cholesky
101
CPU time [s]

100 Chebyshev
CPU time [s] Chebyshev

10−1 100

10−2 10−1

10−3 10−2
101 102 103 101 102 103
Dimension d Dimension d

Fig. 6 Scenario 1. Comparison among the three considered direct samplers in terms of CPU time
for the sampling from \scrN (\bfzero d , \bfSigma ) with \bfSigma detailed in (6.2).

\widetilde and its estimate can be decreased by sufficiently increasing the number of iterations
\Sigma
Kcheby , but this is possible only at a prohibitive computational cost.
On the choice of the metric to monitor convergence. We saw in Figures 5 and 7
that the covariance estimation error was small if the large values of the covariance
matrix were captured and large in the opposite scenario. Note that if the precision
estimation error was chosen as a metric, we would have observed similar numerical
results: if the largest eigenvalues of the precision matrix were not captured, the pre-
cision estimation error would have been large. Regarding the CG sampler, Fox and
Parker in [31], for instance, highlighted this fact and illustrated it numerically (see
equations (3.1) and (3.2) in that paper).

Comparison between Chebyshev- and CG-Based Samplers. In order to dis-


criminate the two iterative direct samplers in Algorithms 3.2 and 3.5, we consider a
toy example in dimension d = 15. The covariance matrix \Sigma is chosen to be diagonal
with diagonal elements drawn randomly from the discrete set J1, 5K. As shown in
Figure 8, \Sigma has repeated and large eigenvalues. Because of this, the CG sampler
stops at Kkryl = 5 (the number of distinct eigenvalues of \Sigma ) and fails to sample ac-
curately from \scrN (0d , \Sigma ), while the sampler based on Chebyshev polynomials yields
samples of interest. Hence, although the CG sampler is an attractive iterative option,
its accuracy is known to be highly dependent on the distribution of the eigenvalues
of \Sigma which is in general unknown in high-dimensional settings. Preconditioning tech-
niques to spread out these eigenvalues can be used but might fail to reduce the error
significantly, as shown in [81].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 41

e
Σ

Eigenvalues of estimates of Σ
CG

e
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

800000 105 Cholesky


Chebyshev
600000
103 Ground truth
400000

101
200000

0 10−1

−200000
100 101 102
Index
Residual Σ̃ − Σ̃Cholesky 80000
Residual Σ̃ − Σ̃CG Residual Σ̃ − Σ̃Chebyshev 80000
80000

60000 60000 60000

40000 40000 40000

20000 20000 20000

0 0 0

−20000 −20000 −20000

kΣ̃ − Σ̃Cholesky k2 /kΣ̃k2 = 0.0592 kΣ̃ − Σ̃CG k2 /kΣ̃k2 = 0.9995 kΣ̃ − Σ̃Chebyshev k2 /kΣ̃k2 = 0.9998

Fig. 7 Scenario 1. Results of the three considered samplers for the sampling from \scrN (\bfzero d , \bfSigma
\widetilde ) in
dimension d = 100.

Σ 5

5 CG
Eigenvalues of estimates of Σ

Cholesky
4 Ground truth
4 Chebyshev

3
3

2 2

1
1

0 2 4 6 8 10 12 14
0 Index
Residual Σ − ΣCholesky 3.0
Residual Σ − ΣCG Residual Σ − ΣChebyshev 3.0
3.0

2.5 2.5 2.5

2.0 2.0 2.0

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

kΣ − ΣCholesky k2 /kΣk2 = 0.0148 kΣ − ΣCG k2 /kΣk2 = 0.6042 kΣ − ΣChebyshev k2 /kΣk2 = 0.0203

Fig. 8 Scenario 1. Results of the three considered direct samplers for the sampling from \scrN (\bfzero d , \bfSigma )
with \bfSigma diagonal in dimension d = 15.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


42 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

6.2.2. Scenario 2. Now we turn to Question 2 and compare iterative and MCMC
samplers. In this scenario, we consider a precision matrix Q which is commonly used
to build Gaussian Markov random fields (GMRFs) [92]. Before defining this matrix,
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

we introduce some notation. Let \scrG = (\scrV , \scrE ) be an undirected two-dimensional graph
(see Figure 9), where \scrV is the set of d nodes in the graph and \scrE represents the edges
between the nodes. We say that nodes i and j are neighbors and write i \sim j if there
is an edge connecting these two nodes. The number of neighbors of node i is denoted
ni and is also called the degree. Using this notation, we set Q to be a second-order
locally linear precision matrix [48, 92] associated to the two-dimensional lattice shown
in Figure 9,
\left\{
\phi ni if i = j ,
(6.3) Qij = \epsilon \delta ij + - \phi if i \sim j , \forall i, j \in [d] ,
0 otherwise

where we set \epsilon = 1 (in fact, \epsilon > 0 suffices) and \phi > 0 to ensure that Q is a nonsingular
matrix yielding a nonintrinsic Gaussian density w.r.t. the d-dimensional Lebesgue
measure; see subsection 2.2. \surd Note that this precision matrix is band-limited with
bandwidth of the order \scrO ( d) [92], preluding the possible embedding of Algorithm 2.3
within the samplers considered in this scenario. Related instances of this precision
matrix have also been considered in [32, 52, 81] in order to show the benefits of
both direct and MCMC samplers. In what follows, we consider the sampling from
\scrN (0d , Q - 1 ) for three different scalar parameters \phi \in \{ 0.1, 1, 10\} leading to three
covariance matrices Q - 1 with different correlation structures; see Figure 9. This will
be of interest since it is known that the efficiency of Gibbs samplers is highly dependent
on the correlation between the components of the Gaussian vector \bfittheta \in \BbbR d [87, 92].
For this scenario, we set d = 100 in order to provide complete diagnostics for
evaluating the accuracy of the samples generated by each algorithm. We implement
the four MCMC samplers based on exact MS (see Table 2) without considering a
burn-in period (i.e., Tbi = 0). These MCMC algorithms are compared with the direct
samplers based on Cholesky factorization and Chebyshev polynomials; \sum see section 3.
Since the matrix Q is strictly diagonally dominant, that is, | Qii | > j\not =i | Qij | for all
i \in [d], the convergence of the MCMC sampler based on Jacobi splitting is ensured
[32, 41]. Based on this convergence property, we can use an optimal value for the
parameter \omega appearing in the MCMC sampler based on successive overrelaxation
(SOR) splitting; see Appendix C.
Figure 10 shows the relative error between the estimated covariance matrix and
the true one w.r.t. the number of samples generated with i.i.d. samplers from section 3
and MCMC samplers from section 4. Regarding MCMC samplers, no burn-in has been
considered here in order to emphasize that these algorithms do not yield i.i.d. samples
from the first iteration compared to the samplers reviewed in section 3. This behavior
is particularly noticeable when \phi = 10, where one can observe that Gauss--Seidel,
Jacobi, and Richardson samplers need far more samples than Chebyshev and Cholesky
samplers to reach the same precision in terms of covariance estimation. Interestingly,
this claim does not hold for ``accelerated"" Gibbs samplers such as the SOR (accelerated
version of the Gauss--Seidel sampler) and the Chebyshev accelerated SSOR samplers.
Indeed, for \phi = 1, one can note that the latter sampler is as efficient as i.i.d. samplers.
On the other hand, when \phi = 10, these two accelerated Gibbs samplers manage to
achieve lower relative covariance estimation error than the Chebyshev sampler when
the number of iterations increases. This behavior is due to the fact that the Chebyshev

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 43

\surd Q, φ = 100
d
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

..
. 4

1 0
... \surd
1 d
Q−1, φ = 10−1 Q−1, φ = 100 Q−1, φ = 101
0.30 0.06
0.7

0.25 0.05
0.6

0.5 0.20
0.04

0.4
0.15
0.03
0.3
0.10
0.2 0.02

0.05
0.1
0.01

Fig. 9 Scenario 2. Illustrations of the considered Gaussian sampling problem: (top left) 2D lattice
(d = 36) associated to the precision matrix \bfQ in (6.3); (top right) \bfQ depicted for \phi = 1;
(bottom) \bfQ - 1 = \bfSigma for \phi \in \{ 0.1, 1, 10\} . All the results are shown for d = 100. On the 2D
lattice, the green nodes are the neighbors of the blue node while the coordinates of the \surd lattice
correspond to the coordinates of \bfittheta \in \BbbR d with nonzero correlations, that is, 1 \leq i, j \leq d.

sampler involves a truncation procedure and as such provides approximate samples


from \scrN (\bfitmu , Q - 1 ) compared to exact MCMC schemes, which produce asymptotically
exact samples. These numerical findings match observations made by [32], which
experimentally showed that accelerated Gibbs approaches can be considered as serious
contenders for the fastest samplers in high-dimensional settings compared to i.i.d.
methods.
Table 8 complements these numerical findings by reporting the spectral radius
of the iteration operator M - 1 N associated to each MCMC sampler. This radius is
particularly important since it is directly related to the convergence factor of the
corresponding MCMC sampler [32]. In order to provide quantitative insights into
the relative performance of each sampler, Table 8 also shows the number of samples
T and corresponding CPU time such that the relative error between the covariance
matrix \Sigma and its estimate \Sigma \widehat T computed from T samples generated by each algorithm
is lower than 5 \times 10 - 2 , i.e., a relative error of 5\%. Thanks to the small bandwidth
(b = 11) of Q, the covariance matrix M\top + N of the vector z \~ appearing in step 3
of Algorithm 4.2 is also band-limited with b = 11 for both Jacobi and Richardson
splitting. Hence Algorithm 2.3 specifically dedicated to the band matrix can be used
within Algorithm 4.2 for these two splitting strategies. Although the convergence

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


44 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

φ=1 φ = 10
0.30 Gibbs (Gauss-Seidel)
Gibbs (Gauss-Seidel)
Relative error in covariance estimation

Relative error in covariance estimation


Gibbs (SOR)
Gibbs (SOR)
0.20 Gibbs (SSOR)
Gibbs (SSOR) Accelerated Gibbs (Cheby-SSOR)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Accelerated Gibbs (Cheby-SSOR) Gibbs (Jacobi)


Gibbs (Jacobi) Gibbs (Richardson)
10−1
10−1 Cholesky
0.09 Gibbs (Richardson)
Chebyshev
0.08 Cholesky
0.07
0.06 Chebyshev
0.05
0.04

0.03

10−2
0.02

0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Iteration t Iteration t

Fig. 10 Scenario 2. Relative error associated to the estimation of the covariance matrix \bfQ - 1 de-
fined by \| \bfQ - 1 - var(\bfittheta (1:t) )\| 2 /\| \bfQ - 1 \| 2 w.r.t. the number of iterations t in Algorithm 4.2,
with d = 100 (left: \phi = 1, right: \phi = 10). We also highlight the relative error obtained
with an increasing number of samples generated independently from Cholesky and Cheby-
shev samplers. The results have been averaged over 30 independent runs. The standard
deviations are not shown for readability reasons.

of these samplers is slower than that of the Gauss--Seidel sampler, their CPU times
become roughly equivalent. Note that this computational gain is problem-dependent
and cannot be ensured in general. Cholesky factorization appears to be much faster
in all cases when the same constant covariance is used for many samples. The next
scenario will consider high-dimensional scenarios where Cholesky factorization is no
longer possible.
6.2.3. Scenario 3. Finally, we deal with Question 3 to assess the benefits of
samplers that take advantage of the decomposition structure Q = Q1 + Q2 of the
precision matrix. As motivated in subsection 6.1, we will focus here on the exact DA
approaches detailed in subsection 4.2 and compare the latter to iterative samplers
which produce uncorrelated samples, such as those reviewed in section 3.
To this end, we consider Gaussian sampling problems in high dimensions d \in
[104 , 106 ] for which Cholesky factorization is both computationally and memory pro-
hibitive when a standard computer is used. This sampling problem commonly appears
in image processing [38, 67, 78, 104] and arises from the linear inverse problem, usually
called deconvolution or deblurring in image processing:

(6.4) y = S\bfittheta + \bfitvarepsilon ,

where y \in \BbbR d refers to a blurred and noisy observation, \bfittheta \in \BbbR d is the unknown original
image rearranged in lexicographic order, and \bfitvarepsilon \sim \scrN (0d , \Gamma ) with \Gamma = diag(\gamma 1 , . . . , \gamma d )
is a synthetic structured noise such that \gamma i \sim 0.7\delta \kappa 1 + 0.3\delta \kappa 2 for all i \in [d]. In
what follows, we set \kappa 1 = 13 and \kappa 2 = 40. The matrix S \in \BbbR d\times d is a circulant
convolution matrix associated to the space-invariant box blurring kernel 91 13\times 3 , where
1p\times p is the p \times p-matrix filled with ones. We adopt a smoothing conjugate prior
distribution on \bfittheta [61, 74, 75], introduced in subsection 2.2 and Figure 2, written
as \scrN (0d , ( \xi d0 1d\times d + \xi 1 \Delta \top \Delta ) - 1 ), where \Delta is the discrete two-dimensional Laplacian
operator; \xi 0 = 1 ensures that this prior is nonintrinsic while \xi 1 = 1 controls the
smoothing. Bayes' rule then yields the Gaussian posterior distribution
\Bigl( \Bigr)
(6.5) \bfittheta | y \sim \scrN \bfitmu , Q - 1 ,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 45

Table 8 Scenario 2. Comparison between Cholesky-, Chebyshev-, and MS-based Gibbs samplers
for d = 100. The samplers have been run until the relative error between the covariance
matrix \bfQ - 1 and its estimate is lower than 5 \times 10 - 2 . For Richardson, SOR, SSOR, and
Cheby-SSOR samplers, the tuning parameter \omega is the optimal one; see Appendix C. The
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

results have been averaged over 30 independent runs.

\bfS \bfa \bfm \bfp \bfl \bfe \bfr \phi \omega \rho (\bfM - 1 \bfN ) T \bfC \bfP \bfU \bft \bfi \bfm \bfe [\bfs ]
0.1 - - 6.3 \times 104 0.29
Cholesky 1 - - 1.3 \times 104 0.06
10 - - 2.9 \times 103 0.01
0.1 - - 6.4 \times 104 2.24
Chebyshev (K = 21) 1 - - 1.3 \times 104 0.44
10 - - 2.5 \times 103 0.19
0.1 0.6328 0.3672 6.7 \times 104 5.44
Richardson 1 0.1470 0.8530 3.8 \times 104 3.03
10 0.0169 0.9831 4 \times 104 3.31
0.1 - 0.4235 6.8 \times 104 5.72
Jacobi 1 - 0.8749 3.9 \times 104 3.24
MCMC MS-based samplers

10 - 0.9856 4.6 \times 104 3.69


0.1 - 0.1998 6.5 \times 104 8.48
Gauss--Seidel 1 - 0.7677 2.5 \times 104 3.34
10 - 0.9715 2.5 \times 104 3.32
0.1 1.0494 0.1189 6.4 \times 104 8.40
SOR 1 1.3474 0.4726 1.6 \times 104 1.31
10 1.7110 0.7852 5.4 \times 103 0.71
0.1 0.9644 0.0936 6.4 \times 104 19.65
SSOR 1 1.3331 0.4503 1.6 \times 104 4.91
10 1.7101 0.9013 9.3 \times 103 2.86
0.1 0.9644 0.0246 6.3 \times 104 9.17
Cheby-SSOR 1 1.3331 0.1485 1.3 \times 104 1.89
10 1.7101 0.5213 4.5 \times 103 0.65

where
\xi 0
(6.6) Q = S\top \Delta - 1 S + 1d\times d + \xi 1 \Delta \top \Delta ,
d
(6.7) \bfitmu = Q - 1 S\top \Delta - 1 y .
Sampling from (6.5) is challenging since the size of the precision matrix forbids its
computation. Moreover, the presence of the matrix \Gamma rules out the diagonalization
of Q in the Fourier basis and therefore the direct use of Algorithm 2.4. In addition,
resorting to MCMC samplers based on exact MS to sample from (6.5) raises several
difficulties. First, both Richardson- and Jacobi-based samplers involve a sampling step
with an unstructured covariance matrix; see Table 2. This step can be performed with
one of the direct samplers reviewed in section 3 but this implies an additional com-
putational cost. On the other hand, although Gauss--Seidel and SOR-based MCMC
samplers involve a simple sampling step, they require access to the lower triangular
part of (6.6). In this high-dimensional scenario, the precision matrix cannot be easily
computed on a standard desktop computer and this lower triangular part must be
found with surrogate approaches. One possibility is to compute each nonzero coef-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


46 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

ficient of this triangular matrix following the matrix-vector products e\top i Qej for all
i, j \in [d] such that j \leq i, where we recall that ei is the ith canonical vector of \BbbR d .
These quantities can be precomputed when Q remains constant along the T iterations
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

but, again, this becomes computationally prohibitive when Q depends on unknown


hyperparameters to be estimated within a Gibbs sampler.
Nonetheless, since the precision matrix (6.6) can be decomposed as Q = Q1 +
Q2 with Q1 = S\top \Gamma - 1 S and Q2 = \xi d0 1d\times d + \xi 1 \Delta \top \Delta , we can apply Algorithm 4.5
to sample efficiently from (6.5). This algorithm is particularly interesting in this
example since the three sampling steps involve two diagonal matrices and one circulant
precision matrix, respectively. For the two first ones, one can use Algorithm 2.2, while
Algorithm 2.4 can be used to sample from the last one.
In what follows, we compare Algorithm 4.5 with the CG direct sampler defined by
Algorithm 3.5. Since we consider high-dimensional scenarios, the covariance estimate
in (6.1) cannot be used to assess the convergence of these samplers. Instead, we
compare the respective efficiency of the considered samplers by computing the effective
sample size ratio per second (ESSR). For an MCMC sampler, the ESSR gives an
estimate of the equivalent number of i.i.d. samples that can be drawn in one second;
see [55, 62]. It is defined as
1 ESS(\vargamma ) 1
(6.8) ESSR(\vargamma ) = = \left( \right) ,
T1 T \infty
\sum
T1 1+2 \rho t (\vargamma )
t=1

where T1 is the CPU time in seconds required to draw one sample and \rho t (\vargamma ) is the
lag-t autocorrelation of a scalar parameter \vargamma . A variant of the ESSR has, for instance,
been used in [37] in order to measure the efficiency of an MCMC variant of the PO
algorithm (Algorithm 3.4). For a direct sampler providing i.i.d. draws, the ESSR (6.8)
simplifies to 1/T1 and represents the number of samples obtained in one second. In
both cases, the larger the ESSR, the more computationally efficient is the sampler.
Figure 11 shows the ESSR associated to the two considered algorithms for d \in
[104 , 106 ]. The latter has been computed by choosing the ``slowest"" component of \bfittheta as
the scalar summary \vargamma , that is, the one with the largest variance. As in the statistical
software Stan [18], we have truncated the infinite sum in (6.8) at the first negative
\rho t . One can note that for the various high-dimensional problems considered here,
the GEDA sampler exhibits good mixing properties which, combined with its low
computational cost per iteration, yields a larger ESSR than the direct CG sampler.
Hence, in this specific case, building on both the decomposition Q = Q1 + Q2 of
the precision matrix and an efficient MCMC sampler is highly beneficial compared to
using Q directly in Algorithm 3.5. Obviously, this gain in computational efficiency
w.r.t. direct samplers is not guaranteed in general since GEDA relies on an appropriate
decomposition Q = Q1 + Q2 .
6.3. Guidelines to Choosing the Appropriate Gaussian Sampling Approach.
In this section, we provide the reader with some insights into how to choose the
most appropriate sampler for a given Gaussian simulation task when vanilla Cholesky
sampling cannot be envisioned. These guidelines are formulated as simple binary
questions in a decision tree (see Figure 12) to determine the class of samplers that is
of potential interest. We choose to start5 from the existence of a natural decompo-
5 Note that alternative decision trees could be built by considering other issues as the primary

decision level.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 47

d = 104
1.0
CG sampler (Kkryl ∈ [33, 38])
ESS ratio per second (ESSR)

GEDA sampler
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

101 0.8

0.6
100

ρt
0.4

0.2
10−1
0.0
104 105 106 0 20 40 60 80 100
Dimension d Lag t

Fig. 11 Scenario 3. (left) ESS ratio per second (ESSR); (right) autocorrelation function \rho t shown
for d = 104 . For both figures, the slowest component of \bfittheta is used as the scalar summary \vargamma .

sition Q = Q1 + Q2 since some existing approaches are specifically dedicated to this


scenario. Then we discriminate existing sampling approaches based on several criteria
which have been discussed throughout this review such as the prescribed accuracy,
the number of desired samples, or the eigenvalue spectrum of the precision matrix Q.
Regarding sampling accuracy, we highlight sampling approaches that are expected or
not expected to yield samples with arbitrary accuracy more efficiently than vanilla
Cholesky sampling; see Table 1. MCMC approaches introduced in section 4 and the
iterative i.i.d. samplers of section 3 are distinguished depending on the number of
desired samples T and their relative efficiency measured via the burn-in period Tbi
for MCMC samplers and via a truncation parameter K \in \BbbN \ast for i.i.d. samplers. The
latter guidelines follow from remarks highlighted in sections 3 and 4 and from the
numerical results in subsection 6.2.
As already mentioned in subsection 2.3, we emphasize that this review only aims
to highlight the main approaches dedicated to high-dimensional Gaussian sampling,
which arises in many different contexts. Therefore, it remains difficult to enunciate
here precise rules for each context. Thus, the guidelines in Figure 12 give general
principles to guide the practitioner toward an appropriate class of sampling approaches
that are reviewed and complemented by additional references provided in this paper.
7. Conclusion. Given the ubiquity of the Gaussian distribution and the huge
number of related contributions, this paper aimed to consolidate an up-to-date re-
view of the main approaches dedicated to high-dimensional Gaussian sampling. To
this end, we first presented the Gaussian sampling problem at stake as well as its
specific and previously reviewed instances. Then we pointed out the main difficulties
when the associated covariance matrix is not structured and the dimension of the
problem increases. We reviewed two main classes of approaches from the literature,
namely, approaches derived from numerical linear algebra and those based on MCMC
sampling. In order to help practitioners in choosing the most appropriate algorithm
for a given sampling task, we compared the reviewed methods by highlighting and
illustrating their respective pros and cons. Ultimately, we have provided general in-
sights into how to select one of the most appropriate samplers using a decision tree; see
Figure 12. In addition, we have also unified most of the reviewed MCMC approaches
under a common umbrella by building upon a stochastic counterpart of the celebrated
proximal point algorithm that is well known in optimization. This enabled us to shed
a new light on existing sampling approaches and draw further links between them. To

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Q = Q1 + Q2
48

yes no

Accuracy Tbi \ll KT

limited yes no

arbitrary Tbi \ll KT Accuracy Clustered spectrum

yes no arbitrary limited yes no

(G)EDA Approx. DA PO Chebyshev SSOR Approx. MS Chebyshev CG


Algorithm 4.5 Algorithm 4.6 Algorithm 3.4 Algorithm 4.3 Algorithm 4.4 Algorithm 3.2 Algorithm 3.5
MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

Fig. 12 General guidelines to choosing the most appropriate sampling approach based on those reviewed in this paper. When computation of order \Theta (d3 ) and
storage of \Theta (d2 ) elements are not prohibitive, the sampler of choice is obviously Algorithm 3.1. The accuracy of a sampler is referred to as arbitrary
when it is expected to obtain samples with arbitrary accuracy more efficiently than vanilla Cholesky sampling; the accuracy is said to be limited when

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


this is not the case. The integer parameters Tbi , T , and K, respectively, refer to the burn-in period of MCMC approaches, the number of desired samples
from \scrN (\bfitmu , \bfQ - 1 ), and the truncation parameter associated to samplers of section 3. The shaded nodes highlight that nonasymptotic convergence bounds
were available when this review was written.
HIGH-DIMENSIONAL GAUSSIAN SAMPLING 49

promote reproducibility, this article is complemented by a companion package written


in Python named \sansP \sansy \sansG \sansa \sansu \sanss \sanss 6 that implements all the reviewed approaches.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Appendix A. Guide to Notation. The following table lists and defines all the
notation used in this paper.

Multivariate Gaussian probability distribution


\scrN (\cdot | \bfitmu , \Sigma )
with mean \bfitmu and covariance matrix \Sigma .
M\top Transpose of matrix M.
f (d) = \scrO (d) Order of the function f when d \rightarrow \infty up to constant factors.
There exist C1 , C2 \in \BbbR such that C1 d \leq f (d) \leq C2 d
f (d) = \Theta (d)
when d \rightarrow \infty .
det(M) Determinant of the matrix M.
\triangleq By definition.
0d Null vector on \BbbR d .
0d\times d Null matrix on \BbbR d\times d .
Id Identity matrix on \BbbR d\times d .
\| \cdot \| The L2 norm.
The d \times d diagonal matrix
diag(v)
with diagonal elements v = (v1 , . . . , vd )\top .
Q = M - N Matrix-splitting decomposition of the precision matrix Q.

Appendix B. Details and Proofs Associated to Subsection 5.1. First, we briefly


recall some useful definitions associated to monotone operators. For more information
on the theory of monotone operators in Hilbert spaces, we refer the interested reader
to the book [11].

General Definitions.
d
Definition B.1 (operator). Let the notation 2\BbbR represent the family of all sub-
d
sets of \BbbR d . An operator or multivalued function \sansK : \BbbR d \rightarrow 2\BbbR maps every point in \BbbR d
d
to a subset of \BbbR .
d
Definition B.2 (graph). Let \sansK : \BbbR d \rightarrow 2\BbbR . The graph of \sansK is defined by

(B.1) gra(\sansK ) = \{ (\bfittheta , u) \in \BbbR d \times \BbbR d | u \in \sansK (\bfittheta )\} .
d
Definition B.3 (monotone operator). Let \sansK : \BbbR d \rightarrow 2\BbbR . \sansK is said to be mono-
tone if

(B.2) \forall (\bfittheta , u) \in gra(\sansK ) and \forall (y, p) \in gra(\sansK ), \langle \bfittheta - y, u - p\rangle \geq 0.
d
Definition B.4 (maximal monotone operator). Let \sansK : \BbbR d \rightarrow 2\BbbR be monotone.
d
Then \sansK is maximal monotone if there exists no monotone operator \sansP : \BbbR d \rightarrow 2\BbbR

6 http://github.com/mvono/PyGauss.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


50 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

such that gra(\sansP ) properly contains gra(\sansK ), i.e., for every (\bfittheta , u) \in \BbbR d \times \BbbR d ,

(B.3) (\bfittheta , u) \in gra(\sansK ) \leftrightarrow \forall (y, p) \in gra(\sansK ), \langle \bfittheta - y, u - p\rangle \geq 0.
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d
Definition B.5 (nonexpansiveness). Let \sansK : \BbbR d \rightarrow 2\BbbR . Then \sansK is nonexpansive
if it is Lipschitz continuous with constant 1, i.e., for every (\bfittheta , y) \in \BbbR d \times \BbbR d ,
\bigm\| \bigm\|
(B.4) \bigm\| \sansK (y) - \sansK (\bfittheta )\bigm\| \leq \| y - \bfittheta \| ,

where \| \cdot \| is the standard Euclidean norm.


d
Definition B.6 (domain). Let \sansK : \BbbR d \rightarrow 2\BbbR . The domain of \sansK is defined by

(B.5) dom(\sansK ) = \{ \bfittheta \in \BbbR d | \sansK (\bfittheta ) \not = \varnothing \} .

The PPA. For \lambda > 0, define the Moreau--Yosida resolvent operator associated to
\sansK as the operator \sansL defined by

(B.6) \sansL = (Id + \lambda \sansK ) - 1 ,

where Id is the identity operator. The monotonicity of \sansK implies that \sansL is nonex-
pansive and its maximal monotonicity yields dom(\sansL ) = \BbbR d [72], where the notation
``dom"" means the domain of the operator \sansL . Therefore, solving the problem (5.16) is
equivalent to solving the following fixed point problem for all \bfittheta \in \BbbR d :

(B.7) \bfittheta = \sansL (\bfittheta ).

This result suggests that finding the zeros of \sansK can be achieved by building a sequence
of iterates \{ \bfittheta (t) \} t\in \BbbN such that for t \in \BbbN ,

(B.8) \bfittheta (t+1) = (Id + \lambda \sansK ) - 1 (\bfittheta (t) ).

This iteration corresponds to the PPA with an arbitrary monotone operator \sansK .
Proof of (5.23). Applying the PPA with R = W - \rho - 1 A\top A to (5.20) leads to
1 \bigm\| 1 \bigm\|
\bigm\| \bigm\| 2 \bigm\| \bigm\| 2
\bfittheta (t+1) = \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bigm\| \bfA \bfittheta - \bfz
(t+1)
+ \bfu (t) \bigm\| + \bigm\| \bfittheta - \bfittheta (t) \bigm\|
\bigm\| \bigm\|
\bfittheta \in \BbbR d 2\rho 2 \bfR
\biggl\langle \Bigl( \biggr\rangle
1 \bigm\| 1
\bigm\| \bigm\| 2 \Bigr)
(t+1) (t) \bigm\| (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfA \bfittheta - \bfz + \bfu + \bfR \bfittheta - \bfittheta , \bfittheta - \bfittheta
2\rho 2
\bigm\| \bigm\|
\bfittheta \in \BbbR d
\Biggl( \biggl[ \biggr] \biggl[ \biggr] \Biggr)
1 \top 1 \top \top 1 \top
\Bigl\{
(t+1) (t)
\Bigr\}
\top (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfittheta \bfA \bfA + \bfR \bfittheta - 2\bfittheta \bfA \bfz - \bfu + \bfR \bfittheta
\bfittheta \in \BbbR d 2 \rho \rho
\Biggl( \biggl[ \Biggr)
1 1 \top \Bigl\{ (t+1) \Bigr\} \biggr]
\top \top (t) (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bfittheta \bfW \bfittheta - 2\bfittheta \bfW \bfittheta + \bfA \bfz - \bfu - \bfA \bfittheta
\bfittheta \in \BbbR d 2 \rho
\Bigr] \biggr) \bigm\| 2
\bigm\| \biggl( \bigm\|
1 \bigm\| 1 - 1 \top \Bigl[ (t+1)
\bigm\|
(t) (t) (t)
= \mathrm{a}\mathrm{r}\mathrm{g} \mathrm{m}\mathrm{i}\mathrm{n} g2 (\bfittheta ) + \bigm\| \bfittheta - \bfittheta + \bfW \bfA \bfz - \bfu - \bfA \bfittheta \bigm\| .
\bigm\|
\bfittheta \in \BbbR d 2 \bigm\| \rho \bigm\|
\bfW

Appendix C. Details Associated to Subsection 6.2.2. The optimal value of the


tuning parameter \omega for the two MS schemes SOR and Richardson is given by

\ast 2
(C.1) \omega SOR = \sqrt{} ,
2
1+ 1 - \rho (Id - D - 1 Q)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 51

where D is the diagonal part of Q. Regarding the MCMC sampler based on Richard-
son splitting, we used the optimal value
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\ast 2
(C.2) \omega Richardson = ,
\lambda min (Q) + \lambda max (Q)

where \lambda min (Q) and \lambda max (Q) are the minimum and maximum eigenvalues of Q, re-
spectively. Finally, for the samplers based on SSOR splitting including Algorithm 4.3,
we used the optimal relaxation parameter

\ast 2
(C.3) \omega SSOR = \sqrt{} \bigl( \bigl( \bigr) \bigr) ,
1 + 2 1 - \rho D - 1 (L + L\top )

where L is the strictly lower triangular part of Q.


Acknowledgments. The authors would like to thank Dr. J\'er\^ ome Idier (LS2N,
France) and Prof. Jean-Christophe Pesquet (CentraleSup\'elec, France) for relevant
feedback on an earlier version of this paper. They are also grateful to the editor and
two anonymous reviewers whose comments helped to significantly improve the quality
of the paper.

REFERENCES

[1] S. L. Adler, Over-relaxation method for the Monte Carlo evaluation of the partition function
for multiquadratic actions, Phys. Rev. D, 23 (1981), pp. 2901--2904. (Cited on p. 20)
[2] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, Fast image recovery using
variable splitting and constrained optimization, IEEE Trans. Image Process., 19 (2010),
pp. 2345--2356. (Cited on p. 27)
[3] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, An augmented Lagrangian
approach to the constrained optimization formulation of imaging inverse problems, IEEE
Trans. Image Process., 20 (2011), pp. 681--695. (Cited on p. 27)
[4] E. Allen, J. Baglama, and S. Boyd, Numerical approximation of the product of the square
root of a matrix with a vector, Linear Algebra Appl., 310 (2000), pp. 167--181, https:
//doi.org/10.1016/S0024-3795(00)00068-9. (Cited on p. 17)
[5] Y. Altmann, S. McLaughlin, and N. Dobigeon, Sampling from a multivariate Gaussian
distribution truncated on a simplex: A review, in Proc. IEEE-SP Workshop Stat. and
Signal Process. (SSP), Gold Coast, Australia, 2014, pp. 113--116, https://doi.org/10.
1109/SSP.2014.6884588. (Cited on p. 12)
[6] S. A. Armstrong, J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D.
Minden, S. E. Sallan, E. S. Lander, T. R. Golub, and S. J. Korsmeyer, MLL translo-
cations specify a distinct gene expression profile that distinguishes a unique leukemia,
Nat. Genet., 30 (2002), pp. 41--47, https://doi.org/10.1038/ng765. (Cited on pp. 13, 14)
[7] E. Aune, J. Eidsvik, and Y. Pokern, Iterative numerical methods for sampling from high
dimensional Gaussian distributions, Stat. Comput., 23 (2013), pp. 501--521, https://doi.
org/10.1007/s11222-012-9326-8. (Cited on pp. 16, 17, 36, 39)
[8] O. Axelsson, Iterative Solution Methods, Cambridge University Press, 1994, https://doi.org/
10.1017/CBO9780511624100. (Cited on p. 22)
[9] A.-C. Barbos, F. Caron, J.-F. Giovannelli, and A. Doucet, Clone MCMC: Parallel high-
dimensional Gaussian Gibbs sampling, in Adv. in Neural Information Process. Systems,
2017, pp. 5020--5028. (Cited on pp. 22, 24, 25, 32, 36, 37)
[10] P. Barone and A. Frigessi, Improving stochastic relaxation for Gaussian random fields,
Probab. Eng. Inform. Sci., 4 (1990), pp. 369--389. (Cited on p. 20)
[11] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory
in Hilbert Spaces, 2nd ed., Springer, 2017. (Cited on p. 49)
[12] J. Besag and C. Kooperberg, On conditional and intrinsic autoregressions, Biometrika, 82
(1995), pp. 733--746, https://doi.org/10.2307/2337341. (Cited on pp. 5, 12)
[13] C. M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006. (Cited on
p. 12)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


52 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

[14] G. E. P. Box and G. M. Jenkins, Time Series Analysis: Forecasting and Control, 3rd ed.,
Prentice Hall, 1994. (Cited on p. 29)
[15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

statistical learning via the alternating direction method of multipliers, Found. Trends
Mach. Learn., 3 (2011), pp. 1--122. (Cited on pp. 27, 34)
[16] S. Brahim-Belhouari and A. Bermak, Gaussian process for nonstationary time series pre-
diction, Comput. Statist. Data Anal., 47 (2004), pp. 705--712, https://doi.org/10.1016/j.
csda.2004.02.006. (Cited on p. 5)
[17] K. Bredies and H. Sun, A proximal point analysis of the preconditioned alternating direction
method of multipliers, J. Optim. Theory Appl., 173 (2017), pp. 878--907, https://doi.org/
10.1007/s10957-017-1112-5. (Cited on p. 34)
[18] B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt,
M. Brubaker, J. Guo, P. Li, and A. Riddell, Stan: A probabilistic programming lan-
guage, J. Stat. Softw., 76 (2017), pp. 1--32, https://doi.org/10.18637/jss.v076.i01. (Cited
on p. 46)
[19] A. Chambolle and T. Pock, A first-order primal-dual algorithm for convex problems with
applications to imaging, J. Math. Imaging Vision, 40 (2011), pp. 120--145, https://doi.
org/10.1007/s10851-010-0251-1. (Cited on p. 34)
[20] A.-L. Cholesky, Sur la r\' esolution num\'
erique des syst\èmes d'\'
equations lin\'
eaires, Bull. Soc.
\'
Amis Bibl. Ecole Polytech., 39 (1910), pp. 81--95. (Cited on p. 14)
[21] E. Chow and Y. Saad, Preconditioned Krylov subspace methods for sampling multivariate
Gaussian distributions, SIAM J. Sci. Comput., 36 (2014), pp. A588--A608, https://doi.
org/10.1137/130920587. (Cited on pp. 16, 17)
[22] E. S. Coakley and V. Rokhlin, A fast divide-and-conquer algorithm for computing the
spectra of real symmetric tridiagonal matrices, Appl. Comput. Harmonic Anal., 34 (2013),
pp. 379--414, https://doi.org/10.1016/j.acha.2012.06.003. (Cited on p. 16)
[23] N. A. C. Cressie, Statistics for Spatial Data, 2nd ed., Wiley, 1993. (Cited on p. 5)
[24] M. W. Davis, Generating large stochastic simulations---the matrix polynomial approxima-
tion method, Math. Geosci., 19 (1987), pp. 99--107, https://doi.org/10.1007/BF00898190.
(Cited on pp. 15, 16, 36)
[25] O. Demir-Kavuk, H. Riedesel, and E.-W. Knapp, Exploring classification strategies with
the CoEPrA 2006 contest, Bioinformatics, 26 (2010), pp. 603--609, https://doi.org/10.
1093/bioinformatics/btq021. (Cited on pp. 13, 14)
[26] C. R. Dietrich and G. N. Newsam, Fast and exact simulation of stationary Gaussian
processes through circulant embedding of the covariance matrix, SIAM J. Sci. Comput.,
18 (1997), pp. 1088--1107, https://doi.org/10.1137/S1064827592240555. (Cited on p. 11)
[27] S. Duane, A. Kennedy, B. J. Pendleton, and D. Roweth, Hybrid Monte Carlo, Phys.
Lett. B, 195 (1987), pp. 216--222. (Cited on p. 5)
[28] E. Esser, X. Zhang, and T. F. Chan, A general framework for a class of first order primal-
dual algorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 3
(2010), pp. 1015--1046, https://doi.org/10.1137/09076934X. (Cited on p. 34)
[29] L. Fahrmeir and S. Lang, Bayesian semiparametric regression analysis of multicategorical
time-space data, Ann. Inst. Statist. Math., 53 (2001), pp. 11--30, https://doi.org/10.1023/
A:1017904118167. (Cited on p. 5)
[30] O. Feron,
\' F. Orieux, and J. F. Giovannelli, Gradient scan Gibbs sampler: An efficient
algorithm for high-dimensional Gaussian distributions, IEEE J. Sel. Topics Signal Pro-
cess., 10 (2016), pp. 343--352, https://doi.org/10.1109/JSTSP.2015.2510961. (Cited on
pp. 20, 36)
[31] C. Fox and A. Parker, Convergence in variance of Chebyshev accelerated Gibbs samplers,
SIAM J. Sci. Comput., 36 (2014), pp. A124--A147, https://doi.org/10.1137/120900940.
(Cited on pp. 23, 36, 40)
[32] C. Fox and A. Parker, Accelerated Gibbs sampling of normal distributions using matrix
splittings and polynomials, Bernoulli, 23 (2017), pp. 3711--3743, https://doi.org/10.3150/
16-BEJ863. (Cited on pp. 7, 20, 21, 22, 23, 30, 36, 37, 42, 43)
[33] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational
problems via finite element approximation, Comput. Math. Appl., 2 (1976), pp. 17--40,
https://doi.org/10.1016/0898-1221(76)90003-1. (Cited on p. 34)
[34] B. Galerne, Y. Gousseau, and J. Morel, Random phase textures: Theory and synthesis,
IEEE Trans. Image Process., 20 (2011), pp. 257--267, https://doi.org/10.1109/TIP.2010.
2052822. (Cited on p. 5)
[35] A. Gelman, C. Robert, N. Chopin, and J. Rousseau, Bayesian Data Analysis, Chapman
\& Hall, 1995. (Cited on p. 20)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 53

[36] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian
restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 6 (1984), pp. 721--741.
(Cited on pp. 5, 13, 20)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[37] C. Gilavert, S. Moussaoui, and J. Idier, Efficient Gaussian sampling for solving large-
scale inverse problems using MCMC, IEEE Trans. Signal Process., 63 (2015), pp. 70--80,
https://doi.org/10.1109/TSP.2014.2367457. (Cited on pp. 5, 19, 36, 37, 46)
[38] J.-F. Giovannelli and J. Idier, Regularization and Bayesian Methods for Inverse Problems
in Signal and Image Processing, 1st ed., Wiley-IEEE Press, 2015. (Cited on p. 44)
[39] P. Giudici and P. J. Green, Decomposable graphical Gaussian model determination,
Biometrika, 86 (1999), pp. 785--801, https://doi.org/10.1093/biomet/86.4.785. (Cited
on p. 5)
[40] R. Glowinski and A. Marroco, Sur l'approximation, par \' el\'
ements finis d'ordre un, et la
r\'
esolution, par p\'
enalisation-dualit\'
e d'une classe de probl\èmes de Dirichlet non lin\'
eaires,
RAIRO Anal. Numer., 9 (1975), pp. 41--76, http://www.numdam.org/item/M2AN 1975
9 2 41 0. (Cited on p. 34)
[41] G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed., The Johns Hopkins
University Press, 1989. (Cited on pp. 6, 10, 14, 16, 21, 22, 42)
[42] J. Goodman and A. D. Sokal, Multigrid Monte Carlo method. Conceptual foundations,
Phys. Rev. D, 40 (1989), pp. 2035--2071, https://doi.org/10.1103/PhysRevD.40.2035.
(Cited on p. 20)
[43] P. J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination, Biometrika, 82 (1995), pp. 711--732, https://doi.org/10.1093/biomet/82.
4.711. (Cited on p. 19)
[44] M. Gu and S. C. Eisenstat, A divide-and-conquer algorithm for the symmetric tridiagonal
eigenproblem, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 172--191, https://doi.org/10.
1137/S0895479892241287. (Cited on p. 16)
[45] N. Hale, N. J. Higham, and L. N. Trefethen, Computing A\alpha , log(a), and related matrix
functions by contour integrals, SIAM J. Numer. Anal., 46 (2008), pp. 2505--2523, https:
//doi.org/10.1137/070700607. (Cited on p. 17)
[46] J. Havil, Gamma: Exploring Euler's Constant, Princeton University Press, 2003. (Cited on
p. 5)
[47] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems,
J. Res. Natl. Inst. Stand., 49 (1952), pp. 409--436. (Cited on p. 19)
[48] D. Higdon, A primer on space-time modeling from a Bayesian perspective, in Statistical
Methods for Spatio-Temporal Systems, B. Finkenstadt, L. Held, and V. lsham, eds.,
Chapman \& Hall/CRC, 2007, pp. 217--279. (Cited on pp. 38, 42)
[49] J. P. Hobert and G. L. Jones, Honest exploration of intractable probability distributions via
Markov chain Monte Carlo, Stat. Sci., 16 (2001), pp. 312--334, https://doi.org/10.1214/
ss/1015346317. (Cited on p. 20)
[50] J. Idier, ed., Bayesian Approach to Inverse Problems, Wiley, 2008, https://doi.org/10.1002/
9780470611197. (Cited on pp. 10, 25)
[51] M. Ilic, T. Pettitt, and I. Turner, Bayesian computations and efficient algorithms for
computing functions of large sparse matrices, ANZIAM J., 45 (2004), pp. 504--518, https:
//eprints.qut.edu.au/22511/. (Cited on p. 16)
[52] M. Ilic, I. W. Turner, and D. P. Simpson, A restarted Lanczos approximation to functions
of a symmetric matrix, IMA J. Numer. Anal., 30 (2009), pp. 1044--1061, https://doi.org/
10.1093/imanum/drp003. (Cited on pp. 16, 17, 42)
[53] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc. R.
Soc. A, 186 (1946), pp. 453--461. (Cited on p. 13)
[54] M. Johnson, J. Saunderson, and A. Willsky, Analyzing Hogwild parallel Gaussian Gibbs
sampling, in Adv. in Neural Information Process. Systems, 2013, pp. 2715--2723. (Cited
on pp. 7, 22, 24, 25, 32, 36)
[55] R. E. Kass, B. P. Carlin, A. Gelman, and R. M. Neal, Markov chain Monte Carlo in
practice: A roundtable discussion, Amer. Statist., 52 (1998), pp. 93--100. (Cited on p. 46)
[56] J. Kittler and J. Foglein,
\" Contextual classification of multispectral pixel data, Image Vis.
Comput., 2 (1984), pp. 13--29, https://doi.org/10.1016/0262-8856(84)90040-4. (Cited on
p. 5)
[57] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to
document recognition, Proc. IEEE, 86 (1998), pp. 2278--2324, https://doi.org/10.1109/5.
726791. (Cited on pp. 13, 14)
[58] M. Li, D. Sun, and K.-C. Toh, A majorized ADMM with indefinite proximal terms for
linearly constrained convex composite optimization, SIAM J. Optim., 26 (2016), pp. 922--
950, https://doi.org/10.1137/140999025. (Cited on p. 34)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


54 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

[59] S. Z. Li, Markov Random Field Modeling in Image Analysis, 3rd ed., Springer, 2009. (Cited
on p. 11)
[60] Y. Li and S. K. Ghosh, Efficient sampling methods for truncated multivariate normal and
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Student-t distributions subject to linear inequality constraints, J. Stat. Theory Pract., 9


(2015), pp. 712--732, https://doi.org/10.1080/15598608.2014.996690. (Cited on p. 12)
[61] A. C. Likas and N. P. Galatsanos, A variational approach for Bayesian blind image de-
convolution, IEEE Trans. Signal Process., 52 (2004), pp. 2222--2233. (Cited on p. 44)
[62] J. S. Liu, Monte Carlo Strategies in Scientific Computing, Springer, 2001. (Cited on p. 46)
[63] D. J. C. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge Uni-
versity Press, 2003. (Cited on p. 38)
[64] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way, 3rd ed., Academic Press,
2008. (Cited on p. 16)
[65] A. Mantoglou and J. L. Wilson, The turning bands method for simulation of random fields
using line generation by a spectral method, Water Resour. Res., 18 (1982), pp. 1379--1394,
https://doi.org/10.1029/WR018i005p01379. (Cited on p. 11)
[66] Y. Marnissi, D. Abboud, E. Chouzenoux, J.-C. Pesquet, M. El-Badaoui, and
A. Benazza-Benyahia, A data augmentation approach for sampling Gaussian models
in high dimension, in Proc. European Signal Process. Conf. (EUSIPCO), Coruna, Spain,
2019, pp. 1--5. (Cited on pp. 7, 25, 26, 27, 30, 36)
[67] Y. Marnissi, E. Chouzenoux, A. Benazza-Benyahia, and J.-C. Pesquet, An auxiliary
variable method for Markov chain Monte Carlo algorithms in high dimension, Entropy,
20 (2018), art. 110, https://doi.org/10.3390/e20020110. (Cited on pp. 5, 7, 25, 26, 44)
[68] B. Martinet, Br\ève communication. R\' egularisation d'in\'
equations variationnelles par ap-
proximations successives, ESAIM Math. Model. Numer. Anal., 4 (1970), pp. 154--158,
http://www.numdam.org/item/M2AN 1970 4 3 154 0. (Cited on p. 33)
[69] B. Martinet, Determination approach\' e d'un point fixe dune application pseudocontractante.
Cas de l'application prox, C. R. Math. Acad. Sci. Paris, 274 (1972), pp. 163--165. (Cited
on p. 33)
[70] J. Mason and D. Handscomb, Chebyshev Polynomials, CRC Press, 2002. (Cited on p. 16)
[71] J. M. Mej\'{\i}a and I. Rodr\'{\i}guez-Iturbe, On the synthesis of random field sampling from
the spectrum: An application to the generation of hydrologic spatial processes, Water
Resour. Res., 10 (1974), pp. 705--711, https://doi.org/10.1029/WR010i004p00705. (Cited
on p. 11)
[72] G. J. Minty, Monotone (nonlinear) operators in Hilbert space, Duke Math. J., 29 (1962),
pp. 341--346, https://doi.org/10.1215/S0012-7094-62-02933-2. (Cited on p. 50)
[73] A. de Moivre, The Doctrine of Chances: Or, a Method of Calculating the Probability of
Events in Play, 1st ed., W. Pearson, 1718. (Cited on p. 5)
[74] R. Molina, J. Mateos, and A. K. Katsaggelos, Blind deconvolution using a variational
approach to parameter, image, and blur estimation, IEEE Trans. Image Process., 15
(2006), pp. 3715--3727. (Cited on p. 44)
[75] R. Molina and B. D. Ripley, Using spatial models as priors in astronomical image analysis,
J. Appl. Stat., 16 (1989), pp. 193--206. (Cited on p. 44)
[76] J. J. Moreau, Proximit\' e et dualit\' e dans un espace hilbertien, Bull. de la Soc. Math. de
France, 93 (1965), pp. 273--299, https://doi.org/10.24033/bsmf.1625. (Cited on p. 33)
[77] F. Orieux, O. Feron, and J. F. Giovannelli, Sampling high-dimensional Gaussian dis-
tributions for general linear inverse problems, IEEE Signal Process. Lett., 19 (2012),
pp. 251--254, https://doi.org/10.1109/LSP.2012.2189104. (Cited on pp. 19, 36)
[78] F. Orieux, J.-F. Giovannelli, and T. Rodet, Bayesian estimation of regularization and
point spread function parameters for Wiener--Hunt deconvolution, J. Opt. Soc. Am. A,
27 (2010), pp. 1593--1607, https://doi.org/10.1364/JOSAA.27.001593. (Cited on pp. 11,
25, 44)
[79] G. Papandreou and A. L. Yuille, Gaussian sampling by local perturbations, in Adv. in
Neural Information Process. Systems, 2010, pp. 1858--1866. (Cited on pp. 18, 19, 36, 37)
[80] T. Park and G. Casella, The Bayesian Lasso, J. Amer. Stat. Assoc., 103 (2008), pp. 681--
686, https://doi.org/10.1198/016214508000000337. (Cited on p. 5)
[81] A. Parker and C. Fox, Sampling Gaussian distributions in Krylov spaces with conjugate
gradients, SIAM J. Sci. Comput., 34 (2012), pp. B312--B334, https://doi.org/10.1137/
110831404. (Cited on pp. 12, 19, 20, 36, 38, 39, 40, 42)
[82] M. Pereira and N. Desassis, Efficient simulation of Gaussian Markov random fields by
Chebyshev polynomial approximation, Spat. Stat., 31 (2019), art. 100359, https://doi.
org/10.1016/j.spasta.2019.100359. (Cited on pp. 16, 36)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


HIGH-DIMENSIONAL GAUSSIAN SAMPLING 55

[83] N. G. Polson, J. G. Scott, and J. Windle, Bayesian inference for logistic models using
P\'
olya-Gamma latent variables, J. Amer. Stat. Assoc., 108 (2013), pp. 1339--1349, https:
//doi.org/10.1080/01621459.2013.829001. (Cited on p. 5)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[84] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical


Recipes: The Art of Scientific Computing, 3rd ed., Cambridge University Press, 2007.
(Cited on p. 16)
[85] C. E. Rasmussen, Gaussian processes to speed up hybrid Monte Carlo for expensive Bayesian
integrals, in Bayesian Statistics 7, Oxford University Press, 2003, pp. 651--660. (Cited on
p. 38)
[86] C. P. Robert, The Bayesian Choice: A Decision-Theoretic Motivation, Springer, 2001.
(Cited on pp. 13, 18)
[87] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, 2004. (Cited on
pp. 13, 20, 30, 31, 42)
[88] G. O. Roberts and S. K. Sahu, Updating schemes, correlation structure, blocking and pa-
rameterization for the Gibbs sampler, J. Roy. Stat. Soc. Ser. B, 59 (1997), pp. 291--317,
https://doi.org/10.1111/1467-9868.00070. (Cited on pp. 20, 23)
[89] G. O. Roberts and R. L. Tweedie, Exponential convergence of Langevin distributions and
their discrete approximations, Bernoulli, 2 (1996), pp. 341--363, https://doi.org/10.2307/
3318418. (Cited on p. 5)
[90] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control
Optim., 14 (1976), pp. 877--898, https://doi.org/10.1137/0314056. (Cited on pp. 7, 20,
28, 32, 33)
[91] H. Rue, Fast sampling of Gaussian Markov random fields, J. Roy. Stat. Soc. Ser. B, 63 (2001),
pp. 325--338, https://doi.org/10.1111/1467-9868.00288. (Cited on pp. 6, 10, 15, 19, 36)
[92] H. Rue and L. Held, Gaussian Markov Random Fields: Theory and Applications, Chapman
\& Hall/CRC, 2005. (Cited on pp. 5, 6, 10, 11, 12, 15, 20, 25, 42)
[93] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, 2003, https://doi.org/
10.1137/1.9780898718003. (Cited on p. 21)
[94] E. M. Scheuer and D. S. Stoller, On the generation of normal random vectors, Techno-
metrics, 4 (1962), pp. 278--281, https://doi.org/10.2307/1266625. (Cited on p. 36)
[95] M. K. Schneider and A. S. Willsky, A Krylov subspace method for covariance approxima-
tion and simulation of random processes and fields, Multidimens. Syst. Signal Process.,
14 (2003), pp. 295--318, https://doi.org/10.1023/A:1023530718764. (Cited on pp. 19, 20)
[96] J. Shi and J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal.
Mach. Intell., 22 (2000), pp. 888--905, https://doi.org/10.1109/34.868688. (Cited on p. 38)
[97] M. Shinozuka and C.-M. Jan, Digital simulation of random processes and its applications,
J. Sound Vib., 25 (1972), pp. 111--128, https://doi.org/10.1016/0022-460X(72)90600-1.
(Cited on p. 11)
[98] D. P. Simpson, I. W. Turner, and A. N. Pettitt, Fast Sampling from a Gaussian Markov
Random Field Using Krylov Subspace Approaches, 2008, https://eprints.qut.edu.au/
14376/. (Cited on pp. 16, 17, 36)
[99] D. P. Simpson, I. W. Turner, C. M. Strickland, and A. N. Pettitt, Scalable Iterative
Methods for Sampling from Massive Gaussian Random Vectors, preprint, https://arxiv.
org/abs/1312.1476, 2013. (Cited on p. 16)
[100] G. W. Stewart, Matrix Algorithms. Volume 2: Eigensystems, SIAM, 2001, https://doi.org/
10.1137/1.9780898718058. (Cited on p. 17)
[101] D. B. Thomas, W. Luk, P. H. Leong, and J. D. Villasenor, Gaussian random number
generators, ACM Comput. Surv., 39 (2007), art. 11, https://doi.org/10.1145/1287620.
1287622. (Cited on p. 9)
[102] W. F. Trench, An algorithm for the inversion of finite Toeplitz matrices, SIAM J. Appl.
Math., 12 (1964), pp. 512--522, https://doi.org/10.1137/0112045. (Cited on p. 20)
[103] D. W. Vasco, L. R. Johnson, and O. Marques, Global Earth structure: Inference
and assessment, Geophys. J. Int., 137 (1999), pp. 381--407, https://doi.org/10.1046/j.
1365-246X.1999.00823.x. (Cited on p. 38)
[104] M. Vono, N. Dobigeon, and P. Chainais, Split-and-augmented Gibbs sampler: Application
to large-scale inference problems, IEEE Trans. Signal Process., 67 (2019), pp. 1648--1661,
https://doi.org/10.1109/TSP.2019.2894825. (Cited on pp. 25, 27, 36, 44)
[105] M. Vono, N. Dobigeon, and P. Chainais, Asymptotically exact data augmentation: Models,
properties, and algorithms, J. Comput. Graph. Statist., 30 (2021), pp. 335--348 (Cited on
pp. 27, 32)
[106] M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics,
in Proc. Int. Conf. Machine Learning (ICML), 2011, pp. 681--688. (Cited on p. 5)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


56 MAXIME VONO, NICOLAS DOBIGEON, AND PIERRE CHAINAIS

[107] S. Wilhelm and B. Manjunath, tmvtnorm: A package for the truncated multivariate normal
distribution, The R Journal, 2 (2010), pp. 25--29, https://doi.org/10.32614/RJ-2010-005.
(Cited on p. 12)
Downloaded 08/05/24 to 192.38.67.116 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[108] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Machine Learning, MIT
Press, 2006. (Cited on p. 38)
[109] A. T. A. Wood and G. Chan, Simulation of stationary Gaussian processes in [0, 1]d , J.
Comput. Graph. Stat., 3 (1994), pp. 409--432, https://doi.org/10.1080/10618600.1994.
10474655. (Cited on p. 11)
[110] X. Zhang, M. Burger, and S. Osher, A unified primal-dual algorithm framework based
on Bregman iteration, J. Sci. Comput., 46 (2011), pp. 20--46, https://doi.org/10.1007/
s10915-010-9408-8. (Cited on p. 34)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like