0% found this document useful (0 votes)
4 views34 pages

Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Journal of Machine Learning Research 25 (2024) 1-34 Submitted 11/21; Revised 1/24; Published 1/24

Nonasymptotic analysis of Stochastic Gradient Hamiltonian


Monte Carlo under local conditions for nonconvex
optimization

Ö. Deniz Akyildiz [email protected]


Department of Mathematics, Imperial College London
Sotirios Sabanis [email protected]
School of Mathematics, University of Edinburgh
The Alan Turing Institute
National Technical University of Athens

Editor: Ohad Shamir

Abstract
We provide a nonasymptotic analysis of the convergence of the stochastic gradient Hamil-
tonian Monte Carlo (SGHMC) to a target measure in Wasserstein-2 distance without as-
suming log-concavity. Our analysis quantifies key theoretical properties of the SGHMC as a
sampler under local conditions which significantly improves the findings of previous results.
In particular, we prove that the Wasserstein-2 distance between the target and the law of
the SGHMC is uniformly controlled by the step-size of the algorithm, therefore demonstrate
that the SGHMC can provide high-precision results uniformly in the number of iterations.
The analysis also allows us to obtain nonasymptotic bounds for nonconvex optimization
problems under local conditions and implies that the SGHMC, when viewed as a noncon-
vex optimizer, converges to a global minimum with the best known rates. We apply our
results to obtain nonasymptotic bounds for scalable Bayesian inference and nonasymptotic
generalization bounds.
Keywords: Non-convex optimization, underdamped Langevin Monte Carlo, non-log-concave
sampling, nonasmyptotic bounds, global optimization.

1. Introduction
We are interested in nonasymptotic estimates for the sampling problem from the probability
measures of the form

πβ (dθ) ∝ exp(−βU (θ))dθ. (1)

when only the noisy estimate of ∇U is available. This problem arises in many cases in
machine learning, most notably in large-scale (mini-batch) Bayesian inference (Welling and
Teh, 2011; Ahn et al., 2012) and nonconvex stochastic optimization as for large values of
β, a sample from the target measure (1) is an approximate minimizer of the potential U
(Gelfand and Mitter, 1991). Consequently, nonasymptotic error bounds for the sampling
schemes can be used to obtain guarantees for Bayesian inference or nonconvex optimization.

c 2024 Ömer Deniz Akyildiz and Sotirios Sabanis.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/21-1423.html.
Akyildiz and Sabanis

An efficient method for obtaining a sample from (1) is simulating the overdamped
Langevin stochastic differential equation (SDE) which is given by
r
2
dLt = −h(Lt )dt + dBt , (2)
β
with a random initial condition L0 := θ0 where h := ∇U and (Bt )t≥0 is a d-dimensional
Brownian motion. The Langevin SDE (2) admits πβ as the unique invariant measure,
therefore simulating this process will lead to samples from πβ and can be used as a Markov
chain Monte Carlo (MCMC) algorithm (Roberts et al., 1996; Roberts and Stramer, 2002).
Moreover, the fact that the limiting probability measure πβ concentrates around the global
minimum of U for sufficiently large values of β makes the diffusion (2) also an attractive
candidate as a global optimizer (see, e.g., Hwang (1980)). However, since the continuous-
time process (2) cannot be simulated, its first-order Euler discretization with the step-size
η > 0 is used in practice, termed the Unadjusted Langevin Algorithm (ULA) (Roberts et al.,
1996). The ULA scheme has become popular in recent years due to its advantages in high-
dimensional settings and ease of implementation. Nonasymptotic properties of the ULA
were recently established under strong convexity and smoothness assumptions by Dalalyan
(2017); Durmus et al. (2017, 2019) while some extensions about relaxing smoothness as-
sumptions or inaccurate gradients were also considered by Dalalyan and Karagulyan (2019);
Brosse et al. (2019). The similar attractive properties hold for the ULA when the potential
U is nonconvex (Gelfand and Mitter, 1991; Raginsky et al., 2017; Xu et al., 2018; Erdogdu
et al., 2018; Sabanis and Zhang, 2019). In recent works, ULA has been extended for non-
convex cases in several different directions, e.g., under log-Sobolev inequality (Vempala
and Wibisono, 2019), under Hölder continuity and specific tail growth conditions (Erdogdu
and Hosseinzadeh, 2021), under Poincaré inequality (Chewi et al., 2022). Further work
has extended these results, see, e.g., Balasubramanian et al. (2022) for averaged Langevin
schemes, Erdogdu et al. (2022) for the analysis under dissipativity in Chi-squared and Renyi
divergences, Mou et al. (2022) for results under dissipativity with smooth initalisation, and
finally under weak Poincaré inequalities (Mousavi-Hosseini et al., 2023).
While the ULA performs well when the computation of the gradient h(·) is straightfor-
ward, this is not the case in most interesting applications. Usually, a stochastic, unbiased
estimate of h(·) is available, either because the cost function is defined as an expectation
or as a finite sum. Using stochastic gradients in the ULA leads to another scheme called
stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011). The SGLD has
been particularly popular in the fields of (i) large-scale Bayesian inference since it allows
one to construct Markov chains Monte Carlo (MCMC) algorithms using only subsets of
the dataset (Welling and Teh, 2011), (ii) nonconvex optimization since it enables one to es-
timate global minima using only stochastic (often cheap-to-compute) gradients (Raginsky
et al., 2017). As a result, attempts for theoretical understanding of the SGLD have been
recently made in several works, both for the strongly convex potentials (i.e. log-concave
targets), see, e.g., Barkhagen et al. (2021); Brosse et al. (2018) and nonconvex potentials,
see, e.g. Raginsky et al. (2017); Majka et al. (2018); Zhang et al. (2023b). Our particular
interest is in nonasymptotic bounds for the nonconvex case, as it is relevant to our work. In
their seminal paper, Raginsky et al. (2017) obtain a nonasymptotic bound between the law
of the SGLD and the target measure in Wasserstein-2 distance with a rate η 5/4 n where η is

2
Nonasymptotic analysis of SGHMC

the step-size and n is the number of iterations. While this work is first of its kind, the error
rate grows with the number of iterations. In a related contribution, Xu et al. (2018) have
obtained improved rates, albeit still growing with the number of iterations n. In a more
recent work, Chau et al. (2021) have obtained a uniform rate of order η 1/2 in Wasserstein-1
distance. Majka et al. (2018) achieved error rates of η 1/2 and η 1/4 for Wasserstein-1 and
Wasserstein-2 distances, respectively, under the assumption of convexity outside a ball. Fi-
nally, Zhang et al. (2023b) achieved the same rates under only local conditions which can
be verified for a class of practical problems.
An alternative to methods based on the overdamped Langevin SDE (2) is the class
of algorithms based on the underdamped Langevin SDE. To be precise, the underdamped
Langevin SDE is given as
r

dVt = −γVt dt − h(θt )dt + dBt , (3)
β
dθt = Vt dt, (4)
where (θt , Vt )t≥0 are called position and momentum process, respectively, and h := ∇U .
Similar to eq. (2), this diffusion can be used as both an MCMC sampler and nonconvex
optimizer, since under appropriate conditions, the Markov process (θt , Vt )t≥0 has a unique
invariant measure given by
  
1 2
π β (dθ, dv) ∝ exp −β kvk + U (θ) dθdv. (5)
2
Consequently, the marginal distribution of (5) in θ is precisely the target measure defined
in (1). This means that sampling from (5) in the extended space and then keeping the
samples in the θ-space would define a valid sampler for the sampling problem of (1).
Due to its attractive properties, methods based on the underdamped Langevin SDE have
attracted significant attention. In particular, the first order discretization of (3)–(4), which
is termed underdamped Langevin MCMC (i.e. the underdamped counterpart of the ULA),
has been a focus of attention, see, e.g., Duncan et al. (2017); Dalalyan and Riou-Durand
(2018); Cheng et al. (2018b). Particularly, the underdamped Langevin MCMC has displayed
improved convergence rates in the setting where U is convex, see, e.g., Dalalyan and Riou-
Durand (2018); Cheng et al. (2018b). Similar results have been extended to the nonconvex
case. In particular, Cheng et al. (2018a) have shown that the underdamped Langevin
MCMC converges in Wasserstein-2 with a better dimension and step-size dependence under
the assumptions smoothness and convexity outside a ball. It has been also shown that the
underdamped Langevin MCMC can be seen as an accelerated optimization method in the
space of measures in Kullback-Leibler divergence (Ma et al., 2019).
Similar to the case in the ULA, oftentimes ∇U (·) is expensive or impossible to compute
exactly, but rather an unbiased estimate of it can be obtained efficiently. The underdamped
Langevin MCMC with stochastic gradients is dubbed as Stochastic Gradient Hamiltonian
Monte Carlo (SGHMC) and given as (Chen et al., 2014; Ma et al., 2015)
r
η η η η 2γη
Vn+1 = Vn − η[γVn + H(θn , Xn+1 )] + ξn+1 , (6)
β
η
θn+1 = θnη + ηVnη , (7)

3
Akyildiz and Sabanis

where η > 0 is a step-size, V0η = v0 , θ0η = θ0 , and E[H(θ, X0 )] = h(θ) for every θ ∈
Rd . We note that SGHMC relies on the Euler-Maruyama discretisation which is of the
possible discretisation methods that can be used. Alternative discretisation methods can
be considered and lead to different algorithms, e.g., see Cheng et al. (2018b); Li et al. (2019);
Zhang et al. (2023a).
In this paper, we analyze recursions (6)–(7). We achieve convergence bounds and im-
prove the ones proved in Gao et al. (2022) and Chau and Rásonyi (2022) (see Section 2.1
for a direct comparison).
Notation. For an integer d ≥ 1, the Borel sigma-algebra of Rd is denoted by B(Rd ).
We denote the dot product with h·, ·i while | · | denotes the associated norm. The set of
probability measures defined on a measurable space (Rd , B(Rd )) is denoted as P(Rd ). For
an Rd -valued random variable, L(X) and E[X] are used to denote its law and its expectation
respectively. Note that we also write E[X] as EX. For µ, ν ∈ P(Rd ), let C(µ, ν) denote the
set of probability measures Γ on B(R2d ) so that its marginals are µ, ν. Finally, µ, ν ∈ P(Rd ),
the Wasserstein distance of order p ≥ 1 is defined as
Z Z 1/p
0 p 0
Wp (µ, ν) := inf |θ − θ | Γ(dθ, dθ ) . (8)
Γ∈C(µ,ν) Rd Rd

2. Main results and overview


Let (Xn )n∈N be an Rm -valued stochastic process adapted to (Gn )n∈N where Gn := σ(Xk , k ≤
n) for n ∈ N. It is assumed henceforth that θ0 , v0 , G∞ , and (ξn )n∈N are independent. The
main assumptions about U and other quantities follow.

Assumption 2.1 The cost function U takes nonnegative values, i.e., U (θ) ≥ 0.

Next, the local smoothness assumptions on the stochastic gradient H are given.

Assumption 2.2 There exist positive constants L1 , L2 and ρ such that, for all x, x0 ∈ Rm
and θ, θ0 ∈ Rd ,

|H(θ, x) − H(θ0 , x)| ≤ L1 (1 + |x|)ρ |θ − θ0 |,


|H(θ, x) − H(θ, x0 )| ≤ L2 (1 + |x| + |x0 |)ρ (1 + |θ|)|x − x0 |.

The following assumption states that the stochastic gradient is assumed to be unbiased.

Assumption 2.3 The process (Xn )n∈N is i.i.d. with |X0 | ∈ L4(ρ+1) and |θ0 |, |v0 | ∈ L4 . It
satisfies
E[H(θ, X0 )] = h(θ).

It is important to note that Assumption 2.2 is a significant relaxation in comparison


with the corresponding assumptions provided in the literature, see, e.g., Raginsky et al.
(2017); Gao et al. (2022); Chau and Rásonyi (2022). To the best of the authors’ knowledge,
all relevant works in this area have focused on uniform Lipschitz assumptions with the
exception of Zhang et al. (2023b), which provides a nonasymptotic analysis of the SGLD
under similar assumptions to ours.

4
Nonasymptotic analysis of SGHMC

Remark 2.1 Assumption 2.2 implies, for all θ, θ0 ∈ Rd ,

|h(θ) − h(θ0 )| ≤ L1 E[(1 + |X0 |)ρ ]|θ − θ0 |, (9)

which consequently implies

|h(θ)| ≤ L1 E[(1 + |X0 |)ρ ]|θ| + h0 , (10)

where h0 := |h(0)|. Let H0 := |H(0, 0)|, then Assumption 2.2 implies

|H(θ, x)| ≤ L1 (1 + |x|)ρ |θ| + L2 (1 + |x|)ρ+1 + H0 .

We denote Cρ := E (1 + |X0 |)4(ρ+1) . Note that Cρ < ∞ by Assumption 2.3.


 

Assumption 2.4 There exist a measurable (symmetric matrix-valued) function A : Rm →


Rd×d , b : Rm → R such that for any x, y ∈ Rd , hy, A(x)yi ≥ 0, and for all θ ∈ Rd and
x ∈ Rm ,

hH(θ, x), θi ≥ hθ, A(x)θi − b(x).

The smallest eigenvalue of E[A(X0 )] is a positive real number a > 0 and E[b(X0 )] = b > 0.

Note that Assumption 2.4 is a local dissipativity condition. This assumption implies the
usual dissipativity property on the corresponding deterministic (full) gradient. In the next
remark, we motivate these assumptions for the case of linear regression loss.

Remark 2.2 We note that Assumptions 2.2 and 2.4 can be motivated even using the sim-
plest linear regression loss. Consider the problem

min E |Z − hY, θi|2 ,


 
θ∈Rd

where (Z, Y ) ∼ P (dz, dy) on R × Rd . The stochastic gradient here is given by, for a sample
(zn , yn ) ∼ P (dz, dy),

H(θ, xn ) = −2yn zn − yn hyn , θi,

where xn = (zn , yn ). This loss is not globally Lipschitz however we can prove that it is locally
Lipschitz. For example, it is easy to show that Assumption 2.2 is satisfied with L1 = 1,
L2 = 2, and ρ = 2. Moreover, it is easy to see that this stochastic gradient is not dissipative
uniformly in xn as

hθ, H(θ, xn )i ≥ |hyn , θi|2 − zn2 .

However, our Assumption 2.4 is satisfied with A(xn ) = yn yn> and b(xn ) = zn2 . We would
also like to note that we do not need to assume that stochastic gradient moments are bounded
as it is a result of our assumptions.

Remark 2.3 By Assumption 2.4, we obtain hh(θ), θi ≥ a|θ|2 − b, for θ ∈ Rd and a, b > 0.

5
Akyildiz and Sabanis

Remark 2.4 Due to Assumption 2.1, Remark 2.1 and Remark 2.3, one observes that As-
sumption 2.1 in Eberle et al. (2019) is satisfied and thus, due to Corollary 2.6 in Eberle
et al. (2019), the underdamped Langevin SDE (3)–(4) has the unique invariant measure
given by (5).
Below, we state our main result about the convergence of the law L(θkη , Vkη ), which is gener-
ated by the SGHMC recursions (6)–(7), to the extended target measure π β in Wasserstein-2
(W2 ) distance. We first define
 
1 γλ K3 γλ
ηmax = min 1, , , , ,
γ 2K1 K2 2K̄
where K1 , K2 , K3 are constants explicitly given in the proof of Lemma 3.2 and K̄ is a
constant explicitly given in the proof of Lemma 3.4. Then, the following result is obtained.
Theorem 2.1 Let Assumptions 2.1–2.4 hold. Then, there exist constants C1? , C2? , C3? , C4? >
0 such that, for every 0 < η ≤ ηmax ,
?
W2 (L(θnη , Vnη ), π β ) ≤ C1? η 1/2 + C2? η 1/4 + C3? e−C4 ηn . (11)
where the constants C1? , C2? , C3? , C4? are explicitly provided in the Appendix.
Proof See Section 4.

Remark 2.5 We remark that C1? = O(d1/2 ), C2? = O(ed ), C3? = O(ed ), and C4? = O(e−d ).
We note that although the dependence of C1? on dimension is O(d1/2 ), and comes from our
main result, the dependence of C2? , C3? , and C4? to dimension may be exponential as it is
an immediate consequence of the contraction result of the underdamped Langevin SDE in
Eberle et al. (2019).
The result in Theorem 2.1 demonstrates that the error scales like O(η 1/4 ) and is uniformly
bounded over n which can be made arbitrarily small by choosing η > 0 small enough. This
result is thus a significant improvement over the findings in Gao et al. (2022), where error
bounds grow with the number of iterations, and in Chau and Rásonyi (2022), where the
corresponding error bounds contain an additional term that is independent of η and relates
to the variance of the unbiased estimator.
Remark 2.6 Our proof techniques can be adapted easily when H(θ, x) = h(θ). Hence The-
orem 2.1 provides a novel convergence bound for the underdamped Langevin MCMC as well.
Our result is in W2 distance under dissipativity for the Euler-Maruyama discretisation of the
underdamped Langevin dynamics, which can be contrasted with existing work which focus on
alternative discretisations. Beyond the log-concave case, Ma et al. (2019) analyses a second-
order discretisation of underdamped Langevin diffusion under the assumption that the target
satisfies a log-Sobolev inequality with a Lipschitz Hessian condition for the log-target. More
recently, Zhang et al. (2023a) removed the Lipschitz Hessian condition and established the
convergence of underdamped Langevin MCMC under log-Sobolev and Poincaré inequalities.
To the best of our knowledge, our result for the Euler-Maruyama discretisation of the un-
derdamped Langevin dynamics is the first nonasymptotic result under dissipativity that is
uniformly controlled by the step-size.

6
Nonasymptotic analysis of SGHMC

Let (θkη )k∈N be generated by the SGHMC algorithm. Convergence of the L(θkη ) to πβ in
W2 also implies that one can prove a global convergence result (Raginsky et al., 2017).
More precisely, assume that we aim at solving the problem θ? ∈ arg minθ∈Rd U (θ) which is
a nonconvex optimization problem. We denote U? := inf θ∈Rd U (θ). Then we can bound
the error E[U (θkη )] − U? which would give us a guarantee on the nonconvex optimization
problem. We state it as a formal result as follows.
Theorem 2.2 Under the assumptions of Theorem 2.1, we obtain
  
η ? 1/2 ? 1/4 ? −C ? ηn d eL 1 bβ
E[U (θn )] − U? ≤ C 1 η + C 2 η + C 3 e 4 + log +1 ,
2β a d
? ? ?
where C 1 , C 2 , C 3 , C4? , L1 are finite constants which are explicitly given in the proofs.
Proof See Section 5.
This result bounds the error in terms of the function value for convergence to the global
? ? ?
minima. Note that C 1 , C 2 anf C 3 have the same dependence to dimension as C1? , C2? and
C3? (see Remark 2.5).

2.1 Related work and contributions


Our work is related to two available analyses of the SGHMC, i.e., Gao et al. (2022) and
Chau and Rásonyi (2022). We compare the bounds provided in Theorem 2.1 and 2.2 to
these two works. Finally, we also briefly consider the relations to Zhang et al. (2023b) at
the end of this section.
The scheme (6)–(7) is analyzed by Gao et al. (2022). In particular, Gao et al. (2022)
provided a convergence rate of the SGHMC (6)–(7) to the underdamped Langevin SDE (3)–
1/4 1/4 √ p
(4) which is of order O(δ + η ) ηn log(ηn). This rate grows with n, hence worsens
over the number of iterations. Moreover, it is achieved under a uniform assumption on the
stochastic gradient, i.e., H(θ, x) is assumed to be Lipschitz in θ uniformly in x (as opposed
to our Assumption 2.2). Moreover, the mean-squared error of the gradient is assumed to
be bounded whereas we do not place such an assumption in our work. Similar analyses
appeared in the literature, e.g., for variance-reduced SGHMC (Zou et al., 2019) which also
has growing rates with the number of iterations.
Another related work was provided by Chau and Rásonyi (2022) who also analyzed
the SGHMC recursions essentially under the same assumptions as in Gao et al. (2022).
However, Chau and Rásonyi (2022) improved the convergence rate of the SGHMC recursions
to the underdamped Langevin SDE significantly, i.e., provided a convergence rate of order
O(δ 1/4 + η 1/4 ) where δ > 0 is a constant. While this rate significantly improves the rate of
Gao et al. (2022), it cannot be made to vanish by choosing η > 0 small enough, as δ > 0 is
(a priori assumed to be) independent of η.
In contrast, we prove that the SGHMC recursions track the underdamped Langevin
SDE with a rate of order O(η 1/4 ) which can be made arbitrarily small as with small η > 0.
Moreover, our assumptions are significantly relaxed compared to Gao et al. (2022) and
Chau and Rásonyi (2022). In particular, we relax the assumptions on stochastic gradients
significantly by allowing growth in both variables (θ, x) which makes our theory hold for
practical settings (Zhang et al., 2023b).

7
Akyildiz and Sabanis

Note that we analyze the SGHMC under similar assumptions to Zhang et al. (2023b)
who analyzed SGLD, since these assumptions allow for broader applicability of the results.
However, the SGHMC requires the analysis of the underdamped Langevin diffusion and
the discrete-time scheme on R2d and requires significantly different techniques compared to
the ones used in Zhang et al. (2023b). We show that the SGHMC can achieve a similar
performance with the SGLD as showed in Zhang et al. (2023b). However, due to the
nonconvexity, we do not observe any improved dimension dependence as in the convex case
and this constitutes a significant direction of future work.

3. Preliminaries
In this section, preliminary results which are essential for proving the main results are pro-
vided. A central idea behind the proof of Theorem 2.1 is the introduction of continuous-time
auxiliary processes whose marginals at chosen discrete times coincide with the marginals of
the (joint) law L(θkη , Vkη ). Hence, these auxiliary stochastic processes can be used to ana-
lyze the recursions (6)–(7). We first introduce the necessary lemmas about the moments of
the auxiliary processes defined in Section 3 and contraction rates of the underlying under-
damped Langevin diffusion. We give the lemmas and some explanatory remarks and defer
the proofs to Appendix C. We then provide the proofs of main theorems in the following
section.
Consider the scaled process (ζtη , Ztη ) := (θηt , Vηt ) given (θt , Vt )t∈R+ as in (3)–(4). We
next define

dZtη = −η(γZtη + h(ζtη ))dt + 2γηβ −1 dBtη ,


p
(12)
dζtη = ηZtη dt, (13)

where η > 0 and Btη = √1η Bηt , where (Bs )s∈R+ is a Brownian motion with natural filtration
Ft . We denote the natural filtration of (Btη )t∈R+ as Ftη . We note that Ftη is independent of
G∞ ∨ σ(θ0 , v0 ). Next, we define the continuous-time interpolation of the SGHMC
η η η
2γηβ −1 dBtη ,
p
dV t = −η(γV btc + H(θbtc , Xdte ))dt + (14)
η η
dθt = ηV btc dt. (15)

η η
The processes (14)–(15) mimic the recursions (6)–(7) at n ∈ N, i.e., L(θnη , Vnη ) = L(θn , V n ).
Finally, we define the underdamped Langevin process (ζbts,u,v,η , Zbts,u,v,η ) for s ≤ t

dZbts,u,v,η = −η(γ Zbts,u,v,η + h(ζbts,u,v,η ))dt + 2γηβ −1 dBtη ,


p
(16)
dζbts,u,v,η = η Z
bs,u,v,η dt,
t (17)

with initial conditions θbss,u,v,η = u and Vbss,u,v,η = v.

Definition 3.1 Fix n ∈ N and for T := b1/ηc, define


η η η
η,n nT,θ ,V ,η η,n nT,θnT ,V nT η,η
ζt = ζbt nT nT , and Zt =Z
b
t .

8
Nonasymptotic analysis of SGHMC

η,n η,n
The process (ζ t , Z t )t≥nT is an underdamped Langevin process started at time nT with
η η
(θnT , V nT ).
To achieve the convergence results, we first define a Lyapunov function, common in the
literature (Mattingly et al., 2002; Eberle et al., 2019), defined as
β 2
γ |θ + γ −1 v|2 + |γ −1 v|2 − λ|θ|2 ,

V(θ, v) = βU (θ) + (18)
4
where λ ∈ (0, 1/4]. This Lyapunov function plays an important role in obtaining uniform
moment estimates for some of the aforementioned processes. First we recall a result from
the literature about the second moments of the processes (θt , Vt )t≥0 in continuous-time.
Lemma 3.1 (Lemma 12(i) in Gao et al. (2022).) Let Assumptions 2.1–2.4 hold. Then
d+Ac
R
2 c 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|θt | ≤ Cθ := R 1 2
, (19)
t≥0 8 (1 − 2λ)βγ
d+Ac
R
2 c 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|Vt | ≤ Cv := R 1 (20)
t≥0 4 (1 − 2λ)β
Proof See Gao et al. (2022).
Next, we obtain, uniform in time, second moment estimates for the discrete-time processes
(θkη )k≥0 and (Vkη )k≥0 .
Lemma 3.2 Let Assumptions 2.1–2.4 hold. Then, for 0 < η ≤ ηmax ,
R 4(Ac +d)
η 2 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|θk | ≤ Cθ := R 1 2
,
k≥0 8 (1 − 2λ)βγ
R 4(Ac +d)
η 2 R2d V(θ, v)µ0 (dθ, dv) + λ
sup E|Vk | ≤ Cv := 1 .
k≥0 4 (1 − 2λ)β
Proof See Appendix C.1.
η,n
Using Lemma 3.2, we can get the second moment bounds of our auxiliary process (ζ t )t≥0 .
Lemma 3.3 Under the assumptions of Lemmas 3.1 and 3.2, we obtain
R 8(d+Ac )
η,n 2 2d V(θ, v)µ0 (dθ, dv) + λ
sup sup E|ζ t | ≤ Cζ := R 1 2
. (21)
n∈N t∈(nT,(n+1)T ) 8 (1 − 2λ)βγ
Proof See Appendix C.2.
η,n η,n
Moreover, in order to obtain the error bound distance between the laws L(ζ t , Z t ) and
L(ζtη , Ztη ), we need to obtain the contraction of EV 2 (θkη , Vkη ), which is established in the
following lemma.
Lemma 3.4 Let 0 < η ≤ ηmax and Assumptions 2.1–2.4 hold. Then, we have
2D
sup E[V 2 (θkη , Vkη )] ≤ E[V 2 (θ0 , v0 )] +
k∈N γλ
where D = O(d2 ) constant independent of η and provided explicitly in the proof.

9
Akyildiz and Sabanis

Proof See Appendix C.3.

Finally, we present a convergence result for the underdamped Langevin diffusion adapted
from Eberle et al. (2019). To this end, a functional for probability measures µ, ν on R2d is
introduced below
Z
Wρ (µ, ν) = inf ρ((x,v), (x0 , v 0 ))Γ(d(x, v), d(x0 , v 0 )), (22)
Γ∈C(µ,ν)

where ρ is defined in eq. (2.10) in Eberle et al. (2019). Thus, in view of Remarks 2.1 and
2.3, one recovers the following result.
Theorem 3.1 (Eberle et al., 2019, Theorem 2.3 and Corollary 2.6) Let Assumptions 2.1–
2.4 hold and the laws of the underdamped Langevin SDEs (θt , Vt ) and (θt0 , Vt0 ) started at
(θ0 , V0 ) ∼ µ and (θ00 , V00 ) ∼ ν respectively. Then, there exist constants ċ, Ċ ∈ (0, ∞) such
that
p q
0 0 −ċt/2
W2 (L(θt , Vt ), L(θt , Vt )) ≤ Ċe Wρ (µ, ν), (23)

where the constants ċ = O(e−d ) and Ċ = O(ed ) are given in Appendix A.

4. Proof of Theorem 2.1


In order to prove Theorem 2.1, we note first that W2 is a metric on P(R2d ). The main
strategy for this proof is to bound W2 (L(θnη , Vnη ), π β ) by using appropriate estimates on the
continuous time interpolation of (θnη , Vnη )n∈N . In particular, we obtain the desired bound
by decomposing first as
η η η η η,n η,n η,n η,n
W2 (L(θt , V t ), π β ) ≤ W2 (L(θt , V t ), L(ζ t , Z t )) + W2 (L(ζ t , Z t ), L(ζtη , Ztη ))
+ W2 (L(ζtη , Ztη ), π β ), (24)
for nT ≤ t ≤ (n + 1)T for every n ∈ N and then by obtaining suitable (decaying in n)
bounds for each of the terms on the rhs of (24). This leads to the proof of our main result,
namely, Theorem 2.1. All proofs are deferred to the Appendix. First, we bound the first
term of (24).
Theorem 4.1 Let Assumptions 2.1–2.4 hold and 0 < η ≤ ηmax . Then, for every t ∈
[nT, nT + 1),
η η η,n η,n
W2 (L(θt , V t ), L(ζ t , Z t )) ≤ C1? η 1/2 (25)

where C1? = O(d1/2 ) is a finite constant.


Proof See Appendix D.1.
Next, we prove the following result for bounding the second term of (24).
Theorem 4.2 Let Assumptions 2.1–2.4 hold and 0 < η ≤ ηmax . Then,
η,n η,n
W2 (L(ζ t , Z t ), L(ζtη , Ztη )) ≤ C2? η 1/4 ,

where C2? = O(ed ).

10
Nonasymptotic analysis of SGHMC

Proof See Appendix D.2.


The constant C2? comes from the contraction result of Eberle et al. (2019, Corollary 2.6)
which might scale exponentially in d. Finally, the convergence of the last term follows from
Theorem 3.1.
Theorem 4.3 (Eberle et al., 2019) Let Assumptions 2.1–2.4 hold. Then,
?
W2 (L(ζtη , Ztη ), π β ) ≤ C3? e−C4 ηt ,
q
where C3? = ĊWρ (µ0 , ν0 ) and C4? = ċ/2. In particular, C3? = O(ed ) while C4? = O(e−d ).
Finally, considering Theorems 4.1, 4.2, and 4.3 together by putting t = n leads to the full
proof of our main result, namely, Theorem 2.1.

5. Proof of Theorem 2.2


The bound provided for the convergence to the target in W2 distance can be used to obtain
guarantees for the nonconvex optimization. In order to do so, we proceed by decomposing
the error as follows
E[U (θnη )] − U? = E[U (θnη )] − E[U (θ∞ )] + E[U (θ∞ )] − U? ,
| {z } | {z }
T1 T2

where θ∞ ∼ πβ . The following proposition presents a bound for T1 under our assumptions.
Proposition 5.1 Under the assumptions of Theorem 2.1, we have,
? ? ? ?
E[U (θnη )] − E[U (θ∞ )] ≤ C 1 η 1/2 + C 2 η 1/4 + C 3 e−C4 ηn , (26)
? 2 = max(C c , C ).
where C i = Ci? (Cm L̄1 + h0 ) for i = 1, 2, 3 and Cm θ θ

Proof See Appendix D.3.


Next, we bound the second term T2 as follows. This result is fairly standard in the literature
(see, e.g., Raginsky et al. (2017); Gao et al. (2022); Chau and Rásonyi (2022)).
Proposition 5.2 (Raginsky et al., 2017) Under the assumptions of Theorem 2.1, we have
  
d eL1 bβ
E[U (θ∞ )] − U? ≤ log +1 .
2β a d
Merging Props. 5.1 and 5.2 leads to the bound given in Theorem 2.2 which completes our
proof.

6. Applications
In this section, we present two applications to machine learning. First, we show that the
SGHMC can be used to sample from the posterior probability measure in the context of
scalable Bayesian inference. We also note that our assumptions hold in a practical setting
of Bayesian logistic regression. Secondly, we provide an improved generalization bound for
empirical risk minimization.

11
Akyildiz and Sabanis

6.1 Convergence rates for scalable Bayesian inference


Consider a prior distribution π0 (θ) and a likelihood p(yi |θ) for a sequence of data points
{yi }M
i=1 where M is the dataset size. Often, one
QMis interested in sampling from the posterior
probability distribution p(θ|y1:M )dθ ∝ π0 (θ) i=1 p(yi |θ)dθ. This is a sampling problem of
the form (1). The SGHMC is an MCMC method to sample from the posterior measure
and, therefore, our explicit convergence rates provides a guarantee for the sampling. To see
this, note that the underdamped Langevin SDE with h(θ) = −∇ log p(θ|y1:M ) converges to
the extended target π whose θ-marginal of π is p(θ|y1:M ), hence the underdamped Langevin
SDE samples from the posterior distribution.
We note that, our setting specifically applies to cases where MPis too large. More pre-
cisely, note that we have h(θ) = −∇ log p(θ|y1:M ) = −∇ log π0 (θ)− M i=1 ∇ log p(yi |θ). When
M is too large, evaluating h(θ) is impractical. However, one can estimate the sum in the last
term in an unbiased way. To be precise, consider random indices i1 , . . . , iK ∼ {1, ..., M }
uniformly, then one can construct a stochastic gradient by using u = {yi1 , . . . , yiK } and
M PK
obtaining H(θ, u) = − log π0 (θ) − K k=1 ∇ log p(yik |θ). Then, we have the following
corollary follows from Theorem 2.1.
Corollary 6.1 Assume that the log-posterior density log p(θ|y1:M ), its gradient, and stochas-
tic gradient H(θ, ·) satisfy the Assumptions 2.1–2.4. Then,
?
W2 (L(θn ), p(θ|y1:M )) ≤ C1? η 1/2 + C2? η 1/4 + C3? e−C4 ηn .

where C1? , C2? , C3? , C4? are finite constants.


This setting becomes practical under our assumptions, e.g., for the Bayesian logis-
tic regression example. Consider the Gaussian mixture prior π0 (θ) ∝ exp(−f0 (θ)) =
2 2 > >
e−|θ−m| /2 + e−|θ+m| /2 and the likelihood p(zi |θ) = (1/(1 + e−zi θ ))yi (1 − 1/(1 + e−zi θ ))1−yi
for θ ∈ Rd and zi = (zi , yi ). Then, it is shown by Zhang et al. (2023b) that the stochastic
gradient H(θ, u) for a mini-batch in this case satisfies assumptions 2.1–2.4. In particular,
our theoretical guarantee in Theorem 2.1 and Corollary 6.1 apply to the Bayesian logistic
regression case.

6.2 A generalization bound for machine learning


Leveraging standard results in machine learning literature, e.g., Raginsky et al. (2017), we
can prove a generalization bound for the empirical risk minimization problem. Note that,
many problems in machine learning can be written as a finite-sum minimization problem
as
M
1 X
θ? ∈ arg min U (θ) := f (θ, zi ). (27)
θ∈Rd M
i=1

Applying the result of Theorem 2.2, one can get a convergence guarantee on E[U (θkη )] − U? .
However, this does not account for the so-called generalization error. Note that, one can
see the cost function in (27) as
R an empirical risk (expectation) minimization problem where
the risk is given by U(θ) := f (θ, z)P (dz) = E[f (θ, Z)], where Z ∼ P (dz) is an unknown
probability measure where the real-world data is sampled from. Therefore, in order to bound

12
Nonasymptotic analysis of SGHMC

the generalization error, one needs to bound the error E[U(θnη )] − U? . The generalization
error can be decomposed as
E[U(θnη )] − U? = E[U(θnη )] − E[U(θ∞ )] + E[U(θ∞ )] − E[U (θ∞ )] + E[U (θ∞ )] − U? .
| {z } | {z } | {z }
B1 B2 B3

In what follows, we present a series of results bounding the terms B1 , B2 , B3 . By using the
results about Gibbs distributions presented in Raginsky et al. (2017), one can bound B1 as
follows.
Proposition 6.1 Under the assumptions of Theorem 2.1, we obtain
? ? ? ?
E[U(θnη )] − E[U(θ∞ )] ≤ C 1 η 1/2 + C 2 η 1/4 + C 3 e−C4 ηn ,
?
where C i = Ci? (Cm L̄1 + h0 ) for i = 1, 2, 3 and Cm
2 = max(C c , C ).
θ θ

The proof of Proposition 6.1 is similar to the proof of Proposition 5.1 and indeed the rates
match.
Next, we seek a bound for the term B2 . In order to prove the following result, we assume
that Assumption 2.2 and Assumption 2.4 hold uniformly in x, as required by, e.g., Raginsky
et al. (2017).
Proposition 6.2 (Raginsky et al., 2017) Assume that Assumptions 2.1, 2.3 hold and As-
sumptions 2.2 and 2.4 hold uniformly in x, i.e., |H(θ, x)| ≤ L01 |θ|2 + B1 . Then,
4βcLS L01
 
E[U(θ∞ )] − E[U (θ∞ )] ≤ (b + d/β) + B1 ,
M a
where cLS is the constant of the logarithmic Sobolev inequality.
Finally, let Θ? ∈ arg minθ∈R U(θ). We note that B3 is bounded trivially as
E[U (θ∞ )] − U? = E[U (θ∞ ) − U? ] + E[U? − U (Θ? )] ≤ E[U (θ∞ ) − U? ], (28)
which follows from the proof of Proposition 5.2. Finally, Proposition 6.1, Proposition 6.2
and (28) leads to the following generalization bound presented as a corollary.
Corollary 6.2 Under the setting of Proposition 6.2, we obtain the generalization bound for
the SGHMC,
 0 
η ? 1/2 ? 1/4 ? −C ? ηn 4βc LS L 1
E[U(θn )] − U? ≤ C 1 η + C 2 η + C 3 e 4 + (b + d/β) + B1
M a
  
d eL1 bβ
+ log +1 . (29)
2β a d
We note that this generalization bound improves that of Raginsky et al. (2017); Gao et al.
(2022); Chau and Rásonyi (2022) due to our improved W2 bound which is reflected in
Theorem 2.2 and, consequently, Proposition 6.1. In particular, while the generalization
bounds of Raginsky et al. (2017) and Gao et al. (2022) grow with the number of iterations
and require careful tuning between the step-size and the number of iterations, our bound
decreases with the number of iterations n. We also note that our bound improves that of
Chau and Rásonyi (2022), similar to the W2 bound.

13
Akyildiz and Sabanis

7. Conclusions
We have analyzed the convergence of the SGHMC recursions (6)–(7) to the extended target
measure π β in Wasserstein-2 distance which implies the convergence of the law of the iter-
ates L(θnη ) to the target measure πβ in W2 . We have proved that the error bound scales like
O(η 1/4 ) where η is the step-size. This improves the existing bounds for the SGHMC signif-
icantly which are either growing with the number of iterations or include constants cannot
be made to vanish by decreasing the step-size η. This bound on sampling from πβ enables
us to prove a stochastic global optimization result when (θnη )n∈N is viewed as an output
of a nonconvex optimizer. We have shown that our results provide convergence rates for
scalable Bayesian inference and we have particularized our results to the Bayesian logistic
regression. Moreover, we have shown that our improvement of W2 bounds are reflected in
improved generalization bounds for the SGHMC.

Acknowledgments

Ö. D. A. is supported by the Lloyd’s Register Foundation Data Centric Engineering Pro-
gramme and EPSRC Programme Grant EP/R034710/1 (CoSInES). S.S. acknowledges sup-
port by the Alan Turing Institute under the EPSRC grant EP/N510129/1.

References
Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via
stochastic gradient Fisher scoring. In 29th International Conference on Machine Learn-
ing, ICML 2012, pages 1591–1598, 2012.

Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang.
Towards a theory of non-log-concave sampling: first-order stationarity guarantees for
Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR,
2022.

Mathias Barkhagen, Ngoc Huy Chau, Éric Moulines, Miklós Rásonyi, Sotirios Sabanis, and
Ying Zhang. On stochastic gradient Langevin dynamics with dependent data streams in
the logconcave case. Bernoulli, 27(1):1–33, 2021.

Nicolas Brosse, Alain Durmus, and Eric Moulines. The promises and pitfalls of stochastic
gradient Langevin dynamics. In Advances in Neural Information Processing Systems,
pages 8268–8278, 2018.

Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unad-
justed Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–
3663, 2019.

Huy N Chau and Miklós Rásonyi. Stochastic gradient hamiltonian Monte Carlo for non-
convex learning. Stochastic Processes and their Applications, 149:341–368, 2022.

14
Nonasymptotic analysis of SGHMC

Huy N Chau, Chaman Kumar, Miklós Rásonyi, and Sotirios Sabanis. On fixed gain recursive
estimators with discontinuity in the parameters. ESAIM: Probability and Statistics, 23:
217–244, 2019.

Ngoc Huy Chau, Éric Moulines, Miklos Rásonyi, Sotirios Sabanis, and Ying Zhang. On
stochastic gradient Langevin dynamics with dependent data streams: The fully nonconvex
case. SIAM Journal on Mathematics of Data Science, 3(3):959–986, 2021.

Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte
Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.

Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I
Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv
preprint arXiv:1805.01648, 2018a.

Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped
Langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, pages
300–323, 2018b.

Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of
Langevin Monte Carlo from poincare to log-sobolev. In Conference on Learning Theory,
pages 1–2. PMLR, 2022.

Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 79(3):651–676, 2017.

Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte
Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):
5278–5311, 2019.

Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using
kinetic Langevin diffusions. arXiv preprint arXiv:1807.09382, 2018.

AB Duncan, N Nüsken, and GA Pavliotis. Using perturbed underdamped Langevin dy-


namics to efficiently sample from probability distributions. Journal of Statistical Physics,
169(6):1098–1131, 2017.

Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjusted
Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.

Alain Durmus, Eric Moulines, et al. High-dimensional Bayesian inference via the unadjusted
Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.

Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative con-
traction rates for Langevin dynamics. The Annals of Probability, 47(4):1982–2010, 2019.

Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of Langevin Monte Carlo:
The interplay between tail growth and smoothness. In Conference on Learning Theory,
pages 1776–1822. PMLR, 2021.

15
Akyildiz and Sabanis

Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization
with discretized diffusions. In Advances in Neural Information Processing Systems, pages
9671–9680, 2018.

Murat A Erdogdu, Rasa Hosseinzadeh, and Shunshi Zhang. Convergence of Langevin Monte
Carlo in chi-squared and rényi divergence. In International Conference on Artificial
Intelligence and Statistics, pages 8151–8175. PMLR, 2022.

Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of stochastic
gradient hamiltonian Monte Carlo for nonconvex stochastic optimization: nonasymptotic
performance bounds and momentum-based acceleration. Operations Research, 70(5):
2931–2947, 2022.

Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization
in Rd . SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.

Chii-Ruey Hwang. Laplace’s method revisited: weak convergence of probability measures.


The Annals of Probability, pages 1177–1182, 1980.

Xuechen Li, Yi Wu, Lester Mackey, and Murat A Erdogdu. Stochastic runge-kutta accel-
erates Langevin Monte Carlo and beyond. Advances in neural information processing
systems, 32, 2019.

Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.

Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and
Michael I Jordan. Is There an Analog of Nesterov Acceleration for MCMC? arXiv
preprint arXiv:1902.00996, 2019.

Mateusz B Majka, Aleksandar Mijatović, and Lukasz Szpruch. Non-asymptotic bounds for
sampling algorithms without log-concavity. arXiv preprint arXiv:1808.07105, 2018.

Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for sdes and
approximations: locally lipschitz vector fields and degenerate noise. Stochastic processes
and their applications, 101(2):185–232, 2002.

Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved
bounds for discretization of Langevin diffusions: Near-optimal rates without convexity.
Bernoulli, 28(3):1577–1601, 2022.

Alireza Mousavi-Hosseini, Tyler Farghly, Ye He, Krishnakumar Balasubramanian, and Mu-


rat A Erdogdu. Towards a complete analysis of Langevin Monte Carlo: Beyond poincar\’e
inequality. arXiv preprint arXiv:2303.03589, 2023.

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via
Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Conference on
Learning Theory, pages 1674–1703, 2017.

16
Nonasymptotic analysis of SGHMC

Gareth O Roberts and Osnat Stramer. Langevin diffusions and metropolis-hastings algo-
rithms. Methodology and computing in applied probability, 4(4):337–357, 2002.

Gareth O Roberts, Richard L Tweedie, et al. Exponential convergence of Langevin distri-


butions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.

S. Sabanis and Y. Zhang. Higher order Langevin Monte Carlo algorithm. Electronic Journal
of Statistics, 13(2):3805–3850, 2019.

Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin
algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32,
2019.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.
In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,
2011.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dy-
namics based algorithms for nonconvex optimization. In Advances in Neural Information
Processing Systems, pages 3122–3133, 2018.

Shunshi Zhang, Sinho Chewi, Mufan Li, Krishna Balasubramanian, and Murat A Erdogdu.
Improved discretization analysis for underdamped Langevin Monte Carlo. In The Thirty
Sixth Annual Conference on Learning Theory, pages 36–71. PMLR, 2023a.

Ying Zhang, Ömer Deniz Akyildiz, Theodoros Damoulas, and Sotirios Sabanis. Nonasymp-
totic estimates for stochastic gradient Langevin dynamics under local conditions in non-
convex optimization. Applied Mathematics & Optimization, 87(2):25, 2023b.

Difan Zou, Pan Xu, and Quanquan Gu. Stochastic gradient Hamiltonian Monte Carlo
methods with recursive variance reduction. In Advances in Neural Information Processing
Systems, pages 3830–3841, 2019.

17
Akyildiz and Sabanis

Appendix
Appendix A. Constants in Theorem 3.1
The constants of Theorem 3.1 are given as follows Eberle et al. (2019).
2 −1 −1
 
2+Λ (1 + γ) 2 (d + Ac )γ ċ
Ċ = 2e max 1, 4(1 + 2α + 2α )
min(1, α)2 min(1, R1 )β

and
γ  
ċ = min λL1 β −1 γ −2 , Λ1/2 e−Λ L1 β −1 γ −2 , Λ1/2 e−Λ
384
where
12
Λ = L1 R12 /8 = (1 + 2α + 2α2 )(d + Ac )L1 β −1 γ −2 λ−1 (1 − 2λ)−1 ,
5
and α ∈ (0, ∞).

Appendix B. Additional lemmata


We prove the following lemma adapted from Raginsky et al. (2017).

Lemma B.1 For all θ ∈ Rd ,

a 2 b L̄1 2
|θ| − log 3 ≤ U (θ) ≤ u0 + |θ| + h0 |θ|.
3 2 2
where u0 = U (0) and L̄1 = L1 E[(1 + |X0 |)ρ ].

Proof See Appendix C.4.

Lemma B.2 Let G, H ⊂ F be sigma-algebras. Consider two Rd -valued random vectors, denoted X, Y , in
Lp with p ≥ 1 such that Y is measurable w.r.t. H ∨ G. Then,

E1/p [|X − E[X|H ∨ G]|p |G] ≤ 2E1/p [|X − Y |p |G].

Proof See Lemma 6.1 in Chau et al. (2019).

Lemma B.3 Let Assumption 2.1, 2.3, 2.2 and 2.4 hold. For any k = 1, . . . , K + 1 where K + 1 ≤ T , we
obtain
h i
2
sup sup E h(ζ̄tη,n ) − H(ζ̄tη,n , XnT +k ) ≤ σH
n∈N t∈[nT,(n+1)T ]

where

σH := 8L22 σZ (1 + Cζ ) < ∞,

where σZ = E[(1 + |X0 | + |E[X0 ]|)2ρ |X0 − E[X0 ]|2 ] < ∞.

Proof See Appendix C.5.

18
Nonasymptotic analysis of SGHMC

Lemma B.4 Under Assumptions 2.1–2.4


h η η 2
i
E V btc − V t ≤ σV η,

where

σV = 4ηγ 2 Cv + 4η(L e1 ) + 4γβ −1 d.


e 1 Cθ + C

Moreover,
 
η η 2
E θbtc − θt ≤ η 2 Cv .

Proof See Appendix C.6.


Next, it is shown that a key assumption appearing in Eberle et al. (2019) holds.

Lemma B.5 There exist constants Ac ∈ (0, ∞) and λ ∈ (0, 1/4] such that

hθ, h(θ)i ≥ 2λ U (θ) + γ 2 |θ|2 /4 − 2Ac /β



(30)

for all x ∈ Rd .

Proof See Appendix C.7.

Appendix C. Proofs of the preliminary results


C.1 Proof of Lemma 3.2
For this proof, we use the Lyapunov function defined by Eberle et al. (2019) and follow a similar proof
presented in Gao et al. (2022). We first define the Lyapunov function as
β 2
γ kθ + γ −1 vk2 + kγ −1 vk2 − λkθk2 .

V(θ, v) = βU (θ) +
4
Next, we will use this Lyapunov function to show that the second moments of the processes (Vnη )n∈N and
(θnη )n∈N are finite.
We start by defining
γ2
 
η η η η −1 η 2 −1 η 2 η 2
M2 (k) = EV(θk , Vk )/β = E U (θk ) + |θk + γ Vk | + |γ Vk | − λ|θk | . (31)
4
Recall our discrete-time recursions (6)–(7)
η
p
Vk+1 = (1 − ηγ)Vkη − ηH(θkη , Xk+1 ) + 2ηγβ −1 ξk+1 ,
η
θk+1 = θkη + ηVkη , θ0η = θ0 , V0η = v0 ,

where (ξk )k∈N is a sequence of i.i.d. standard Normal random variables. Consequently, we have the equality
h i
| = E |(1 − γη)Vkη − ηH(θkη , Xk+1 )|2 + 2γηβ −1 d,
 η 2
E |Vk+1
h i h i
= (1 − γη)2 E |Vkη |2 − 2η(1 − γη)E [hVkη , h(θkη )i] + η 2 E |H(θkη , Xk+1 )|2 + 2γηβ −1 d,

which immediately leads to


h i h h i i
| ≤ (1 − γη)2 E |Vkη |2 − 2η(1 − γη)E [hVkη , h(θkη )i] + η 2 L e1 + 2γηβ −1 d,
e 1 E |θη |2 + C
 η 2
E |Vk+1 k (32)

where
e 1 = 2L21 Cρ
L and e1 = 4L22 Cρ + 4H02 .
C (33)

19
Akyildiz and Sabanis

Next, we note that


2
η
E θk+1 = E|θkη |2 + 2ηEhθkη , Vkη i + η 2 E |Vkη |2 . (34)

Recall h := ∇U and note also that


Z 1
η
U (θk+1 ) = U (θkη + ηVkη ) = U (θkη ) + hh(θkη + τ ηVkη ), ηVkη idτ,
0

which suggests
Z 1
η
U (θk+1 ) − U (θkη ) − hh(θkη ), ηVkη )i = hh(θkη + τ ηVkη ) − h(θkη ), ηVkη idτ ,
0
Z 1
≤ |h(θkη + τ ηVkη ) − h(θkη )| |ηVkη | dτ,
0
1
≤ L1 Cρ η 2 |Vkη |2 ,
2
where the second line follows from the Cauchy-Schwarz inequality and the final line follows from (9). Finally
we obtain
1
η
EU (θk+1 ) − EU (θkη ) ≤ ηEhh(θkη ), Vkη i + L1 Cρ η 2 E |Vkη |2 . (35)
2
Next, we continue computing
2 2
η
E θk+1 + γ −1 Vk+1
η
= E θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) + 2γ −1 β −1 ηd,
2
= E θkη + γ −1 Vkη − 2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i
+ η 2 γ −2 E |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 d,
2
≤ E θkη + γ −1 Vkη − 2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i
+ η 2 γ −2 (L e1 ) + 2γ −1 ηβ −1 d.
e 1 E |θη |2 + C
k (36)

where L
e 1 and C
e1 is defined as in (33). Next, combining (32), (34), (35), (36),

M2 (k + 1) − M2 (k)
 γ2  2 2

η
) − U (θkη ) + η
+ γ −1 Vk+1
η
− E θkη + γ −1 Vkη

= E U (θk+1 E θk+1
4
1  γ2λ  
2 η 2 2
+ η
E Vk+1 − E |Vk | − η
E θk+1 − E |θkη |2 ,
4 4
2
L C
1 ρ η
≤ ηEhh(θkη ), Vkη i + E |Vkη |2
2
γ2    
+ −2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i + η 2 γ −2 L e 1 E |θη |2 + C
k
e1 + 2γ −1 β −1 ηd
4
1  
e1 + 2γηβ −1 d

+ (−2γη + γ 2 η 2 )E |Vkη |2 − 2η(1 − γη)EhVkη , h(θkη )i + η 2 L e 1 E|θη |2 + C
k
4
γ2λ
2ηEhθkη , Vkη i + η 2 E|Vkη |2 ,


4
γη 2 L1 Cρ η 2 η2 γ 2 γ 2 η2 λ
 
ηγ γη
= − Ehθkη , h(θkη )i + Ehh(θkη ), Vkη i + + − − E |Vkη |2
2 2 2 4 2 4
η2 Le1 γ 2 ηλ f1 η 2
C
+ E |θkη |2 − Ehθkη , Vkη i + + γηβ −1 d,
2 2 2
λγ 3 η γη 2 η2 L
e1
≤ −ηγλEU (θkη ) − E|θkη |2 + Ac ηγβ −1 + Ehh(θkη ), Vkη i + E |θkη |2
4 2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ γ 2 ηλ f1 η 2
 
γη C
+ + − − E |Vkη |2 − Ehθkη , Vkη i + + γηβ −1 d.
2 4 2 4 2 2

20
Nonasymptotic analysis of SGHMC

where the last line is obtained using (30). Next, using the fact that 0 < λ ≤ 1/4 and the form of the
Lyapunov function (31), we obtain

γ γ2 1
− Ehθkη , Vkη i ≤ −M2 (k) + EU (θkη ) + E|θkη |2 + E|Vkη |2 .
2 4 2
Using this, we can obtain

γη 2 η2 L
e1
M2 (k + 1) − M2 (k) ≤ Ac ηγβ −1 + Ehh(θkη ), Vkη i + E |θkη |2 + γηβ −1 d
2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ f1 η 2
 
γη γλη C
+ + − − + E |Vkη |2 + − γληM2 (k).
2 4 2 4 2 2

Next, reorganizing and using ha, bi ≤ (|a|2 + |b|2 )/2

η2 Le1 Cf1 η 2
M2 (k + 1) ≤ (1 − γλη)M2 (k) + Ac ηγβ −1 + E |θkη |2 + + γηβ −1 d
2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ γη 2 γη 2
 
γη γλη
+ + − − + + E |Vkη |2 + E|h(θkη )|2 ,
2 4 2 4 4 4 4
!
L
e1 γ f1 η 2
C
≤ (1 − γλη)M2 (k) + Ac ηγβ −1 + η 2 + L21 Cρ2 E |θkη |2 + + γηβ −1 d
2 2 2
γ2 γ2λ γη 2 h20
 
L1 Cρ γ
+η 2
+ − + E |Vkη |2 + ,
2 4 4 4 2
where the last inequality follows since λ ≤ 1/4 and (10). We note that
 
1 β
V(θ, v) ≥ max (1 − 2λ)βγ 2 |θ|2 , (1 − 2λ)|v|2 ,
8 4
which implies by the definition of M2 (k) that
 
1 1
M2 (k) ≥ max (1 − 2λ)γ 2 E|θkη |2 , (1 − 2λ)E|Vkη |2 ,
8 4
1 1
≥ (1 − 2λ)γ 2 E|θkη |2 + (1 − 2λ)E|Vkη |2 , (37)
16 8
since max{x, y} ≥ (x + y)/2 for any x, y > 0. Therefore, we obtain

M2 (k + 1) ≤ (1 − γλη + K1 η 2 )M2 (k) + K2 η 2 + K3 η

where
(L 2 2 )
1 Cρ
+ γ4 − γ 4λ + γ L
+ γ2 L21 Cρ2
e1
2 4 2
K1 := max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2

and
f1 + γh20
C
K2 = and K3 = (Ac + d)γβ −1 .
2
n o
For 0 < η ≤ min K3
, γλ , 2
K2 2K1 γλ
, we obtain
 
γλη
M2 (k + 1) ≤ 1− M2 (k) + 2K3 η
2
which implies
4
M2 (k) ≤ M2 (0) + K3 .
γλ
Combining this with (37) gives the result.

21
Akyildiz and Sabanis

C.2 Proof of Lemma 3.3


η,n η
We recall that ζ t is the Langevin diffusion started at θnT and run until t ∈ (nT, (n + 1)T ). First notice
that Lemma 3.1 implies
η
EV(θnT η
, VnT ) + 4(d+Ac)
sup E|ζtη,n |2 ≤ 1
λ
, (38)
t∈(nT,(n+1)T ) 8
(1 − 2λ)βγ 2
η η
which, noting that, EV(θnT , VnT ) = βM2 (nT ), implies

βM2 (0) + 8(d+A c)


sup E|ζtη,n |2 ≤ 1
λ
.
t∈(nT,(n+1)T ) 8
(1 − 2λ)βγ 2
Substituting M2 (0) gives
R 8(d+Ac )
R2d
V(θ, v)µ0 (dθ, dv) +
sup E|ζtη,n |2 ≤ Cζ := 1
λ
.
t∈(nT,(n+1)T ) 8
(1 − 2λ)βγ 2

C.3 Proof of Lemma 3.4


We need to obtain the contraction of EV 2 (θkη , Vkη ). Recall again the Lyapunov function defined by Eberle
et al. (2019)
β 2
γ kθ + γ −1 vk2 + kγ −1 vk2 − λkθk2 .

V(θ, v) = βU (θ) +
4
In this proof, we follow a similar strategy to the proof of Lemma 3.2 and approaches of Gao et al. (2022)
and Chau and Rásonyi (2022). We define the notation Vk = V(θkη , Vkη ) and note
γ2
Vk /β := V(θkη , Vkη )/β = U (θkη ) + |θkη + γ −1 Vkη |2 + |γ −1 Vkη |2 − λ|θkη |2 .

(39)
4
Recall our discrete-time recursions (6)–(7)
η
p
Vk+1 = (1 − ηγ)Vkη − ηH(θkη , Xk+1 ) + 2ηγβ −1 ξk+1 ,
η
θk+1 = θkη + ηVkη , θ0η = θ0 , V0η = v0 ,
where (ξk )k∈N is a sequence of i.i.d. Normal random variables. We first define ∆1k = (1 − ηγ)Vkη −
η H(θkη , Xk+1 ) and write

|2 = |∆1k |2 + 2 2ηγβ −1 h∆1k , ξk+1 i + 2γηβ −1 |ξk+1 |2 ,


η
p
|Vk+1
= |(1 − γη)Vkη − ηH(θkη , Xk+1 )|2 + 2 2ηγβ −1 h∆1k , ξk+1 i + 2γηβ −1 |ξk+1 |2 ,
p

= (1 − γη)2 |Vkη |2 − 2η(1 − γη)hVkη , H(θkη , Xk+1 )i


+ η 2 |H(θkη , Xk+1 )|2 + 2 2ηγβ −1 h∆1k , ξk+1 i + 2γηβ −1 |ξk+1 |2 .
p
(40)
Next, we note that
2
η
θk+1 = |θkη |2 + 2ηhθkη , Vkη i + η 2 |Vkη |2 . (41)
Recall h := ∇U and note also that
Z 1
η
U (θk+1 ) = U (θkη + ηVkη ) = U (θkη ) + hh(θkη + τ ηVkη ), ηVkη idτ,
0

which suggests
Z 1
η
U (θk+1 ) − U (θkη ) − hh(θkη ), ηVkη )i = hh(θkη + τ ηVkη ) − h(θkη ), ηVkη idτ ,
0
Z 1
≤ |h(θkη + τ ηVkη ) − h(θkη )| |ηVkη | dτ,
0
1
≤ L1 Cρ η 2 |Vkη |2 ,
2

22
Nonasymptotic analysis of SGHMC

where, similarly as in the previous proof, the second line follows from the Cauchy-Schwarz inequality and
the final line follows from (9). Finally we arrive at

1
η
U (θk+1 ) − U (θkη ) ≤ ηhh(θkη ), Vkη i + L1 Cρ η 2 |Vkη |2 . (42)
2

Next, we note that ∆2k = θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) and continue computing
2 2
+ γ −1 Vk+1 = θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) 2γ −1 β −1 ηh∆2k , ξk+1 i + 2γ −1 β −1 η|ξk+1 |2 ,
η η
p
θk+1 +2
2
= θkη + γ −1 Vkη − 2ηγ −1 hθkη + γ −1 Vkη , H(θkη , Xk+1 )i
+ 2 2γ −1 β −1 ηh∆2k , ξk+1 i + η 2 γ −2 |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 |ξk+1 |2 ,
p
(43)

Next, combining (40), (41), (42), (43),

Vk+1 − Vk  γ2  η 2 2

η
= U (θk+1 ) − U (θkη ) + θk+1 + γ −1 Vk+1 η
− θkη + γ −1 Vkη
β 4
1 η 2  γ2λ  
2
+ Vk+1 − |Vkη |2 − η
θk+1 − |θkη |2 ,
4 4
1 γ2
≤ ηhh(θkη ), Vkη i + L1 Cρ η 2 |Vkη |2 + −2ηγ −1 hθkη + γ −1 Vkη , H(θkη , Xk+1 )i
2 4
 1
+η 2 γ −2 |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 |ξk+1 |2 + (−2γη + γ 2 η 2 ) |Vkη |2 − 2η(1 − γη)hVkη , H(θkη , Xk+1 )i
4
 γ2λ  
+η 2 |H(θkη , Xk+1 )|2 + 2γηβ −1 |ξk+1 |2 − 2ηhθkη , Vkη i + η 2 |Vkη |2 + Σk ,
4
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
 
ηγ η η η η γη
≤ − hθk , H(θk , Xk+1 )i + ηhh(θk ), Vk i + − − + |Vkη |2 + |H(θkη , Xk+1 )|2
2 2 4 2 4 2
 2
γ 2 λη η η

γη
+ ηγβ −1 |ξk+1 |2 + − η hVkη , H(θkη , Xk+1 )i − hθk , Vk i + Σk
2 2

where

γ 2 p −1 −1 1p
Σk = 2γ β ηh∆2k , ξk+1 i + 2ηγβ −1 h∆1k , ξk+1 i.
2 2
Next, using the fact that 0 < λ ≤ 1/4 and the form of the Lyapunov function (39), we obtain

γ Vk γ2 η 2 1 η 2
− hθkη , Vkη i ≤ − + U (θkη ) + |θ | + |Vk | .
2 β 4 k 2

Using this and merging some terms, we obtain

Vk+1 − Vk ηγ
≤ − hθkη , H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
 
γη
+ − − + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2
2 4 2 4 2
γη 2 η Vk γ 3 λη η 2 γλη η 2
+ hVk , H(θkη , Xk+1 )i − γλη + γληU (θkη ) + |θk | + |Vk | + Σk . (44)
2 β 4 2

Next, by reorganizing and using ha, bi ≤ (|a|2 + |b|2 )/2, we arrive at

Vk+1 Vk ηγ η ηγ η
≤ (1 − γλη) − hθ , h(θkη )i + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β β 2 k 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2 γ 3 λη η 2
 
γη γλη
+ − − + + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2 + γληU (θkη ) + |θk |
2 4 2 4 2 2 4
+ Σk . (45)

23
Akyildiz and Sabanis

Using (30), and λ ≤ 1/4, we obtain

Vk+1 Vk γ 3 ηλ η 2 ηγ η
≤ (1 − γλη) − ηγλU (θkη ) − |θk | + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i
β β 4 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
 
γη γλη
+ ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i + − − + + |Vkη |2 + |H(θkη , Xk+1 )|2
2 4 2 4 2 2
γ 3 λη η 2
+ ηγβ −1 |ξk+1 |2 + γληU (θkη ) + |θk | + Σk ,
4
Vk ηγ η
= (1 − γλη) + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
 
γη γλη
+ − − + + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2 + Σk ,
2 4 2 4 2 2
Vk ηγ η
≤ (1 − γλη) + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
 
γη γλη
+ − − + + |Vkη |2
2 4 2 4 2
η2  2 
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk ,
2
by Remark 2.1. Let φ = (1 − γλη) and let Hk = Gk ∨ σ(ξ1 , . . . , ξk ). By using λ ≤ 1/4, we obtain
2
V2 L1 Cρ η 2 γ 2 λη 2 γ 2 η2
  
E[Vk+1 |Hk ] Vk
2
≤ φ2 k2 + 2φ ηγAc β −1 + − + |Vkη |2
β β β 2 4 4
η2
 
ηγ η
2L21 Cρ |θkη |2 + 4L22 Cρ + 4H02 + ηγβ −1 d + E ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i

+
2 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
 
γη γλη
+ ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i + − − + + |Vkη |2
2 4 2 4 2
2 #
η2  2 2ρ η 2 2 2(ρ+1) 2

−1 2
+ 2L1 (1 + |Xk+1 |) |θk | + 4L2 (1 + |Xk+1 |) + 4H0 + ηγβ |ξk+1 | + Σk Hk .
2

We first note that


 
Vk 1 1
≥ max (1 − 2λ)γ 2 |θkη |2 , (1 − 2λ)|Vkη |2 ,
β 8 4
1 1
≥ (1 − 2λ)γ |θk | + (1 − 2λ)|Vkη |2 ,
2 η 2
(46)
16 8
since max{x, y} ≥ (x + y)/2 for any x, y > 0. Then we first obtain
2
E[Vk+1 |Hk ] V2 Vk
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d

β2 β β
h ηγ η
+ E ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i
2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
 
γη γλη
+ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i + − − + + |Vkη |2
2 4 2 4 2
2 #
η2  2 
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2

where
(L 2 2 )
1 Cρ
2
+ γ4 − γ 4λ L21 Cρ
K̃1 := max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2

24
Nonasymptotic analysis of SGHMC

Next using ha, bi ≤ (|a|2 + |b|2 )/2, we obtain

2
E[Vk+1 |Hk ] V2 Vk
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d

β 2 β β
h ηγ η 2  ηγ η
+ E ηγAc β −1 + |θk | + + |h(θkη ) − H(θkη , Xk+1 )|2
4 4 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
 
γη γλη η
+ − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2  2 2ρ η 2 2 2(ρ+1) 2

−1 2
+ 2L1 (1 + |Xk+1 |) |θk | + 4L2 (1 + |Xk+1 |) + 4H0 + ηγβ |ξk+1 | + Σk Hk
2
V2 Vk
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d

β β
h ηγ η 2  ηγ 
+ E ηγAc β −1 + |θk | + + η (|h(θkη )|2 + |H(θkη , Xk+1 )|2 )
4 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
 
γη γλη η
+ − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2  2 
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2

where we have used |a + b|2 ≤ 2|a|2 + 2|b|2 . Now using Remark 2.1, we obtain

2 2
E[Vk+1 |Hk ] 2 Vk Vk
2
ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d

≤ (φ + 2φ K̃ 1 η ) + 2φ
β2 β2 β
h ηγ η 2  ηγ 
+ E ηγAc β −1 + |θk | + + η 2L21 Cρ2 |θkη |2 + 2h20 + 2L21 (1 + |Xk+1 |)2ρ |θkη |2
4 2
  L C η2 γ 2 λη 2 γη γ 2 η2 γλη η

2 2(ρ+1) 2 1 ρ
+4L2 (1 + |Xk+1 |) + 4H0 + − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2  2 
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2
V2 Vk
= (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d

β β
 η2 2
  
ηγ  ηγ 
+ E ηγAc β −1 + + + η 2L21 Cρ2 + 2L21 (1 + |Xk+1 |)2ρ + 2L1 (1 + |Xk+1 |)2ρ |θkη |2
4 2 2
 ηγ    L C η2 γ λη 2
2
γη γ 2 η2 γλη η

1 ρ
+ + η 2h20 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + − − + + + |Vkη |2
2 2 4 2 4 2 2
2 #
η2  2 
+ 4L2 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2

Now taking the expectation, regrouping and simplifying the terms above, we finally arrive at

2
E[Vk+1 |Hk ] V2 Vk Vk c̃2
2
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ 2 ηc̃1 + 2φ η 2 ĉ1 + η 2 2 + c̃3 η 2 |θkη |4 + ĉ3 η 4 |θkη |4
β β β β β
c̃6
+ η 2 c̃4 |Vkη |4 + η 4 ĉ4 |Vkη |4 + η 2 c̃5 + η 4 ĉ5 + η 2 2 + E[Σ2k |Hk ]. (47)
β

25
Akyildiz and Sabanis

where

c̃1 = γAc + γd,


ĉ1 = 2L2 Cρ + 2H02
c̃2 = 6γ 2 A2c
9γ 2
c̃3 = + 18(γ + 2)2 (L41 Cρ4 + L41 Cρ ),
8
ĉ3 = 18L41 Cρ ,
(1 + λγ − γ)2
c̃4 = 6 ,
4
 2
L1 Cρ γ 1
ĉ4 = 6 + + ,
2 4 2
c̃5 = (γ + 2)2 30h40 + 120L42 + 120H04 + 30L42 Cρ + 30H04 ,


ĉ5 = 48(L44 + H04 ),


c̃6 = γ 2 d(d + 2).

Next, recall

γ 2 p −1 −1 1p
Σk = 2γ β ηh∆2k , ξk+1 i + 2ηγβ −1 h∆1k , ξk+1 i,
2 2
where

∆1k = (1 − ηγ)Vkη − η H(θkη , Xk+1 )


∆2k = θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ).

Note that

Σ2k ≤ 2γβ −1 η|∆1k |2 |ξk+1 |2 + 2γ 3 β −1 η|∆2k |2 |ξk+1 |2 ,


≤ 2ηγβ −1 |ξk+1 |2 |Vkη − η(γVkη + H(θkη , Xk+1 )|2 + γ 2 |θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 )|2 ,


≤ 2ηγβ −1 |ξk+1 |2 2(1 − ηγ)2 |Vkη |2 + 2|H(θk , Xk+1 )|2 + 3γ 2 |θkη |2 + 3|Vkη |2 + 3η 2 |H(θkη , Xk+1 )|2

  
≤ 2ηγβ −1 |ξk+1 |2 2(1 − ηγ)2 |Vkη |2 + 2 3L21 (1 + |Xk+1 |)2ρ |θkη |2 + 3L22 (1 + |Xk+1 |)2(ρ+1) + 3H02
 
+3γ 2 |θkη |2 + 3|Vkη |2 + 3η 2 3L21 (1 + |Xk+1 |)2ρ |θkη |2 + 3L22 (1 + |Xk+1 |)2(ρ+1) + 3H02 .

Therefore, taking expectation and using (46) and η ≤ 1/γ, we obtain

E[Σ2k |Hk ] ≤ 2ηγβ −1 d 5|Vkη |2 + 6L21 Cρ + 3γ 2 + 9L21 Cρ |θkη |2 + 15L22 Cρ + 15H02 .


 

Vk c̃8
≤ ηc̃7 2 + η
β β
where
(2γd)(6L21 Cρ + 3γ 2 + 9L21 Cρ )
 
10γd
c̃7 = max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2
c̃8 = 30γdL22 Cρ + 30γdH02 .

Now plugging this into the inequality (47), we arrive at


2
E[Vk+1 |Hk ] V2 Vk Vk c̃2
2
≤ (φ2 + 2φK̃1 η 2 ) k2 + (2φc̃1 + c̃7 ) 2 η + 2φ η 2 ĉ1 + η 2 2 + c̃3 η 2 |θkη |4 + ĉ3 η 4 |θkη |4
β β β β β
c̃6 c̃8
+ η 2 c̃4 |Vkη |4 + η 4 ĉ4 |Vkη |4 + η 2 c̃5 + η 4 ĉ5 + η 2 2 + η , (48)
β β

26
Nonasymptotic analysis of SGHMC

Now, recall our step-size condition for the above inequality, in particular, η ≤ 1, which leads to
2 2
E[Vk+1 |Hk ] 2 2 Vk Vk Vk c̃2
≤ (φ + 2φ K̃ 1 η ) + (2φc̃1 + c̃7 ) 2 η + 2φ η 2 ĉ1 + η 2 2 + (c̃3 + ĉ3 )η 2 |θkη |4
β2 β2 β β β
c̃6 c̃8
+ η 2 (c̃4 + ĉ4 )|Vkη |4 + η 2 (c̃5 + ĉ5 ) + η 2 2 + η , (49)
β β
Next, recall that
 
Vk 1 2 η 2 1 η 2
≥ max (1 − 2λ)γ |θk | , (1 − 2λ)|Vk | ,
β 8 4
which implies
Vk2
 
1 2 4 η 4 1 2 η 4
≥ max (1 − 2λ) γ |θk | , (1 − 2λ) |V k | ,
β2 64 16
1 1
≥ (1 − 2λ)2 γ 4 |θkη |4 + (1 − 2λ)2 |Vkη |4 . (50)
128 32
Now inequalities (49) and (50) together imply
2 2
E[Vk+1 |Hk ] 2 2 Vk Vk Vk c̃2 c̃6 c̃8
≤ (φ + 2φ K̃2 η ) + (2φc̃1 + c̃7 ) 2 η + 2φ η 2 ĉ1 + η 2 2 + η 2 (c̃5 + ĉ5 ) + η 2 2 + η ,
β2 β2 β β β β β
where K̃2 = K̃1 + c̃9 and
 
c̃3 + ĉ3 ĉ4 + c̃4
c̃9 = max 1 , 1 .
128
(1 − 2λ)2 γ 4 32
(1 − 2λ)2
Now we take unconditional expectations, use the fact that φ ≤ 1, and obtain
2
] ≤ 1 − λγη + K̄η 2 E[Vk2 ] + c̃10 E[Vk ]η + 2ĉ1 E[Vk ]βη 2 + η 2 c̃2 + η 2 (ĉ5 + c̃5 )β 2 + η 2 c̃6 + ηc̃8 β,

E[Vk+1

where K̄ = 2K̃2 , c̃10 = 2c̃1 + c̃7 . First note that, it follows from the proof of Lemma 3.2 that
4(Ac + d)
E[Vk ] ≤ E[V0 ] + := c̃11 .
γλ
Therefore, using η ≤ 1, we obtain
2
] ≤ 1 − λγη + K̄η 2 E[Vk2 ] + Dη

E[Vk+1
where D = c̃10 c̃11 + 2c̃11 ĉ1 β + c̃2 + (ĉ5 + c̃5 )β 2 + c̃6 + c̃8 β. Now recall that η ≤ λγ/2K̄ which leads to
 
2 λγη
E[Vk+1 ]≤ 1− E[Vk2 ] + Dη,
2
which implies that
2D
E[Vk2 ] ≤ E[V02 ] + ,
γλ
which concludes the proof. 

C.4 Proof of Lemma B.1


We start by writing that
Z 1
U (θ) − U (0) = hθ, h(tθ)idt,
0
Z 1
≤ |θ||h(tθ)|dt,
0
Z 1
≤ |θ|(L̄1 t|θ| + h0 )dt.
0

27
Akyildiz and Sabanis

from Remark (10) where L̄1 = L1 E[(1 + |X0 |)ρ ]. This in turn leads to
L̄1 2
U (θ) ≤ u0 +
|θ| + h0 |θ|.
2
where u0 = U (0). Next, we prove the lower bound. To this end, take c ∈ (0, 1) and write
Z 1
U (θ) = U (cθ) + hθ, h(tθ)idt,
c
Z 1
1
≥ htθ, h(tθ)idt,
c t
Z 1
1
a|tθ|2 − b dt,


c t
a(1 − c2 ) 2
= |θ| + b log c.
2

Taking c = 1/ 3 leads to the bound.

C.5 Proof of Lemma B.3


Let Ht∞ = F∞
η
∨ Gbtc . Following Zhang et al. (2023b), we obtain
h i
2
E h(ζ̄tη,n ) − H(ζ̄tη,n , XnT +k )
h h ii
2 ∞
= E E h(ζ̄tη,n ) − H(ζ̄tη,n , XnT +k ) HnT
h h  ii
∞  2 ∞
= E E E H(ζ̄tη,n , XnT +k ) HnT − H(ζ̄tη,n , XnT +k ) HnT
h h ii
2 ∞
≤ 4E E H(ζ̄tη,n , XnT +k ) − H(ζ̄tη,n , E [ XnT +k | HnT ]) HnT
h 2 i
≤ 4L22 σZ E 1 + ζ̄tη,n ,

where the first inequality holds due to Lemma B.2 and


σZ = E[(1 + |X0 | + |E[X0 ]|)2ρ |X0 − E[X0 ]|2 ].
Then, by using Lemma 3.3, we obtain
h i
2
E h(ζ̄tη,n ) − H(ζ̄tη,n , XnT +k ) ≤ 8L22 σZ + 8L22 σZ Cζ .

C.6 Proof of Lemma B.4


Note that for any t, we have
Z t Z t
η η η p
V t = V btc − ηγ V bsc ds − η H(θbsc , Xdse )ds + 2γηβ −1 (Btη − Bbtc
η
).
btc btc

We therefore obtain
" 2#
h i Z t Z t
η η 2 η p
E V btc −V t =E −ηγ V bsc ds −η H(θbsc , Xdse )ds + 2γηβ −1 (Btη − Bbtc
η
)
btc btc
2 2
≤ 4η γ Cv + 4η (L e1 ) + 4γηβ −1 d.
e 1 Cθ + C 2

Next, we write
Z t
η η η
θt = θbtc + ηV bτ c dτ,
btc

which implies
 
η η 2
E θt − θbtc ≤ η 2 Cv .

28
Nonasymptotic analysis of SGHMC

C.7 Proof of Lemma B.5


Choose λ > 0 such that
( )
1 a
λ = min , γ2
.
4 L̄1 + 2L̄1 b +
2

Using Assumption 2.4, we obtain

hθ, h(θ)i ≥ a|θ|2 − b,


γ2
 
L̄1
≥ 2λ + L̄1 b + |θ|2 − b,
2 4
γ2 2
 
≥ 2λ U (θ) − u0 − L̄1 b|θ| + L̄1 b|θ|2 + |θ| − b,
4
2
 
γ
≥ 2λ U (θ) − u0 − L̄1 b + |θ|2 − b,
4

where the third line follows from Lemma B.1 and the last line follows from the inequality |x| ≤ 1 + |x|2 .
Consequently, we obtain

γ2 2
 
hθ, h(θ)i ≥ 2λ U (θ) + |θ| − 2Ac /β
4

where

β
Ac = (b + 2λu0 + 2λL̄1 b),
2

which proves the claim.

Appendix D. Proofs of main results


D.1 Proof of Theorem 4.1
We note that
 1/2
η,n 2 η,n 2 1/2
η η η,n η,n η
h η i
W2 (L(θt , V t ), L(ζ t , Z t )) ≤ E θt − ζ t + E V t − Zt . (51)

We first bound the first term of (51). We start by employing the synchronous coupling and obtain

Z t
η η,n η η,n
θt − ζ t ≤η V bsc − Z s ds,
nT

which implies
  Z u
η,n 2
η
h η η,n 2
i
sup E θu − ζ u ≤η sup E V bsc − Z s ds,
nT ≤u≤t nT ≤u≤t nT
Z t h i
η η,n 2
=η E V bsc − Z s ds,
nT

29
Akyildiz and Sabanis

Next, we write for any t ∈ [nT, (n + 1)T )

η η,n η η η η,n
V btc − Z t ≤ V btc − V t + V t − Z t ,
Z t Z t
η η η η,n η η,n
≤ V btc − V t + −γη [V bsc − Z s ]ds − η [H(θbsc , Xdse ) − h(ζ s )]ds ,
nT nT
Z t Z t
η η η η,n η η,n
≤ V btc −V t + γη V bsc − Zs ds + η [H(θbsc , Xdse ) − h(ζ s )]ds ,
nT nT
Z t
η η η η,n
≤ V btc − V t + γη V bsc − Z s ds
nT
Z t Z t
η η,n η,n η,n
+η H(θbsc , Xdse ) − H(ζ s , Xdse ) ds + η [H(ζ s , Xdse ) − h(ζ s )]ds ,
nT nT
Z t
η η η η,n
≤ V btc −V t + γη V bsc − Z s ds
nT
Z t Z t
η η,n η,n η,n
+η L1 (1 + Xdse )ρ θbsc − ζ s ds + η [H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT

We take the squares of both sides and use (a + b)2 ≤ 2(a2 + b2 ) twice to obtain

Z t 2
η η,n 2 η η 2 η η,n
V btc − Z t ≤ 4 V btc − V t + 4γ 2 η 2 V bsc − Z s ds
nT
Z t 2 Z t 2
η η,n η,n η,n
+ 4η 2 L1 (1 + Xdse )ρ θbsc − ζ s ds + 4η 2 [H(ζ s , Xdse ) − h(ζ s )]ds
nT nT
Z t
η η 2 η η,n 2
≤4 V btc −V t + 4γ 2 η V bsc − Zs ds
nT
Z t Z t 2
η η,n 2 η,n η,n
+ 4ηL21 (1 + Xdse )2ρ θbsc − ζ s ds + 4η 2 [H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT

Taking expectations of both sides, we obtain

h η i h η i Z t h η i
η,n 2 η 2 η,n 2
E V btc − Z t ≤ 4E V btc − V t + 4γ 2 η E V bsc − Z s ds
nT
" 2
#
Z t   Z t
η η,n 2 η,n η,n
+ 4ηL21 Cρ E θbsc − ζs ds + 4η E 2
[H(ζ s , Xdse ) − h(ζ s )]ds ,
nT nT
Z t h η i
η,n 2
≤ 4σV η + 4γ 2 η E V bsc − Z s ds
nT
"Z 2
#
Z t   t
η η,n 2 η,n η,n
+ 4ηL21 Cρ E θbsc − ζs ds + 4η E 2
[H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT

By applying Grönwall’s lemma, we arrive at

Z t  
η,n 2
h η η,n 2
i η
E V btc − Z t ≤ 4c1 σV η + 4ηc1 L21 Cρ E θbsc − ζ s ds
nT
"Z 2
#
t
η,n η,n
+ 4c1 η 2 E [H(ζ s , Xdse ) − h(ζ s )]ds ,
nT

30
Nonasymptotic analysis of SGHMC

where c1 = exp(4γ 2 ) since ηT ≤ 1. Next, we write


  Z t
η,n 2
η
h η η,n 2
i
sup E θu − ζ u ≤η E V bsc − Z s ds,
nT ≤u≤t nT
Z t Z s  
η η,n 2
≤ 4c1 ησV + 4c1 η 2 L21 Cρ E θbs0 c − ζ s0 ds0 ds
nT nT
"Z #
Z t s 2
η,n η,n
+ 4c1 η 3 E [H(ζ s0 , Xds0 e ) − h(ζ s0 )]ds0 ds,
nT nT
Z s  
η η,n 2
≤ 4c1 ησV + 4c1 ηL21 Cρ sup E θbs0 c − ζ s0 ds0
nT ≤s≤t nT
"Z #
Z t s 2
η,n η,n
+ 4c1 η 3
E [H(ζ s0 , Xds0 e ) − h(ζ s0 )]ds0 ds. (52)
nT nT

First, we bound the supremum term (i.e. the second term of (52)) as
Z s   Z t  
η η,n 2 η η,n 2
sup E θbs0 c − ζ s0 ds0 = E θbs0 c − ζ s0 ds0
nT ≤s≤t nT nT
Z t  
η η,n 2
≤ sup E θbuc − ζ u ds0
nT nT ≤u≤s0
Z t  
η η,n 2
≤ sup E θu − ζ u ds0 . (53)
nT nT ≤u≤s0

Next, we bound the last term of (52) by partitioning the integral. Assume that nT +K ≤ s ≤ t ≤ nT +K +1
where K + 1 ≤ T . Thus we can write

Z s K
X
 0
h(ζ̄sη,n η,n

0 ) − H(ζ̄s0 , Xds0 e ) ds = Ik + RK
nT k=1

where
Z nT +k Z s
0 0
Ik = [h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , XnT +k )]ds and RK = [h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , XnT +K+1 )]ds .
nT +(k−1) nT +K

Taking squares of both sides

K 2 K K k−1 K
X X X X X
Ik + RK = |Ik |2 + 2 hIk , Ij i + 2 hIk , RK i + |RK |2 ,
k=1 k=1 k=2 j=1 k=1

Finally, it remains to take the expectations of both sides. We begin by defining the filtration Hs∞ = F∞
η
∨Gbsc
and note that for any k = 2, . . . , K, j = 1, . . . , k − 1,

EhIk , Ij i
 ∞ 
= E E[hIk , Ij i|HnT +j ] ,
" "*Z + ##
nT +k Z nT +j
0 0 ∞
=E E [H(ζ̄sη,n η,n
0 , XnT +k ) − h(ζ̄s0 )]ds , [H(ζ̄sη,n
0 , XnT +j ) − h(ζ̄sη,n
0 )]ds HnT +j ,
nT +(k−1) nT +(j−1)
"*Z +#
nT +k Z nT +j
∞  0 0
H(ζ̄sη,n h(ζ̄sη,n [H(ζ̄sη,n h(ζ̄sη,n

=E E 0 , XnT +k ) − 0 ) HnT +j ds , 0 , XnT +j ) − 0 )]ds ,
nT +(k−1) nT +(j−1)

= 0.

31
Akyildiz and Sabanis

By the same argument EhIk , RK i = 0 for all 1 ≤ k ≤ K. Therefore,


"Z #
Z t s 2
η,n η,n
E [H(ζ s0 , Xds0 e ) − h(ζ s0 )]ds0 ds
nT nT
" K
"Z 2 ##
Z t X nT +k
0
= E [h(ζ̄sη,n
0 ) − H(ζ̄sη,n
0 , XnT +k )]ds ds
nT k=1 nT +(k−1)
"Z #
Z t s 2
0
+ E [h(ζ̄sη,n
0 ) − H(ζ̄sη,n
0 , XnT +K+1 )]ds ds
nT nT +K
" K Z
#
Z t nT +k h i
2
X 0
≤ E h(ζ̄sη,n
0 ) − H( ζ̄ η,n
s0 , X nT +k ) ds ds
nT k=1 nT +(k−1)
Z t Z s h i
2
+ E h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , XnT +K+1 ) ds0 ds
nT nT +K

≤ T 2 σH + T σH . (54)

Using (52), (53), and (54), we eventually obtain


  Z t  
η η,n 2 η η,n 2
sup E θu − ζ u ≤ 4c1 ησV + 4ηL21 Cρ sup E θu − ζ u ds0 ,
nT ≤u≤t nT nT ≤u≤s0

+ 4c1 η 3 (T 2 σH + T σH ),
Z t  
η η,n 2
≤ 4c1 ησV + 4c1 ηL21 Cρ sup E θu − ζ u ds0
nT nT ≤u≤s0

+ 4c1 ησH + 4c1 η 2 σH , (55)

since ηT ≤ 1. Finally, applying Grönwall’s inequality and using again ηT ≤ 1 provides


 
η η,n 2
sup E θu − ζ u ≤ exp(4c1 L21 Cρ )(4c1 σV + 4c1 σH + 4c1 ησH )η,
nT ≤u≤t

which implies that


 1/2
η η,n 2 ? √
E θt − ζ t ≤ C1,1 η (56)

?
p
with C1,1 = exp(4c1 L21 Cρ )(4c1 σV + 4c1 σH + 4c1 ησH ). Note that σV = O(d) and σH = O(d) hence
?

C1,1 = O( d).
Next, we upper bound the second term of (51). To prove it, we write
Z t Z t h i
η η,n  η η,n  η η,n
V t − Zt ≤ γη V bsc − Z s ds + η H(θbsc ) − h(ζ s ) ds ,
nT nT

which leads to
"Z 2
#
h η i Z t h η i t h i
η,n 2 η,n 2 η η,n
E V t − Zt ≤ 2γ 2 η E V bsc − Z s + 2η 2 E H(θbsc ) − h(ζ s ) ds .
nT nT

By similar arguments we have used for bounding the the first term, we obtain
h η i Z t h i
η,n 2 η η,n 2
E V t − Zt ≤ 2γ 2 η E V bsc − Z s + exp(4c1 L21 Cρ )(4c1 σH η + 4c1 η 2 σH η 2 ).
nT

Using the fact that the rhs is an increasing function of t and we obtain
h η i Z t h η i
η,n 2 η,n 2
sup E V u − Z u ≤ 2γ 2 η sup E V u − Z u ds + exp(4c1 L21 Cρ )(4c1 σH η + 4c1 η 2 σH η 2 ),
nT ≤u≤t nT nT ≤u≤s

32
Nonasymptotic analysis of SGHMC

Applying Gronwall’s lemma and ηT ≤ 1 yields


h η η,n 2
i
sup E V u − Z u ≤ exp(2γ 2 + 4c1 L21 Cρ )4c1 (σH η + σH η 2 ),
nT ≤u≤t

which leads to
η,n 2 1/2 ? √
h η i
E V t − Zt ≤ C1,2 η, (57)
?
p ?
where C1,2 = exp(2γ 2 + 4c1 L21 Cρ )4c1 σH (1 + η). Note again that σH = O(d), hence C1,2 = O(d1/2 ).
Therefore, combining (51), (56), (57), we obtain
η η η,n η,n
W2 (L(θt , V t ), L(ζ t , Z t )) ≤ C1? η 1/2 ,

where C1? = C1,1


? ?
+ C1,2 = O(d1/2 ).

D.2 Proof of Theorem 4.2


Triangle inequality implies that
n
η,n η,n X η,k η,k η,k−1 η,k−1
W2 (L(ζ t , Z t ), L(ζtη , Ztη )) ≤ W2 (L(ζ t , Z t ), L(ζ t , Zt )),
k=1
η η η η
η,0 η,0 0,θ ,V ,η 0,θ 0 ,V 0 ,η
since (ζ t , Z t ) = (ζbt 0 0 , Z
bt ) = (ζtη , Ztη ) by definition. Next we write out the definition of the
η,k η,k
process (ζ t , Z t ) and obtain
η,n η,n
W2 (L(ζ t , Z t ), L(ζtη , Ztη ))
n η η η η η η η η
X kT,θ kT ,V kT ,η kT,θ kT ,V kT ,η (k−1)T,θ (k−1)T ,V (k−1)T ,η (k−1)T,θ (k−1)T ,V (k−1)T ,η
≤ W2 (L(ζbt ,Z
bt ), L(ζbt ,Z
bt )).
k=1

At this point, we have two processes and in order to be able to use the contraction result, we need their
starting times to match. For notational simplicity, let us define
η η η η η η
bts,θs ,V s ,η = (ζbts,θs ,V s ,η , Z
B bts,θs ,V s ,η )

Therefore, in order to be able to use a contraction result, we note


η η
η η (k−1)T ,θ ,V ,η
(k−1)T,θ (k−1)T ,V (k−1)T ,η (k−1)T (k−1)T
kT,B
b ,η
L(B
bt ) = L(B
bt kT
).
This leads to
η,n η,n
W2 (L(ζ t , Z t ), L(ζtη , Ztη ))
η η η η
n (k−1)T ,θ ,V ,η (k−1)T ,θ ,V ,η
η η η η (k−1)T (k−1)T (k−1)T (k−1)T
kT,θ ,V ,η
btkT,θkT ,VkT ,η ), L(ζbtkT,BkT ,η
btkT,BkT ,η
X b b
≤ W2 (L(ζbt kT kT , Z ,Z )).
k=1

The reader should notice at this point that this quantity can be upper bounded by a contraction result as
both processes are defined for time t and both started at time kT . By using Theorem 3.1, we obtain
p X n q
η,n η,n η,(k−1)T η,(k−1)T
W2 (L(ζ t , Z t ), L(ζtη , Ztη )) ≤ Ċ e−ηċ(t−kT )/2 Wρ (L(θkT
η η
, VkT ), L(ζ kT , Z kT ).
k=1

Next, using Lemma 5.4 of Chau and Rásonyi (2022) to upper bound the the last term leads
η,n η,n
W2 (L(ζ t , Z t ), L(ζtη , Ztη ))
p X n q
η,(k−1)T η,(k−1)T
≤ 3 max{1 + α, γ −1 } Ċ e−ηċ(t−kT )/2 1 + εc E1/2 [V 2 (θkT
η η
, VkT )] + εc E1/2 [V 2 (ζ kT , Z kT )]
k=1
q
η η η,(k−1)T η,(k−1)T
× W2 (L(θkT , VkT ), L(ζ kT , Z kT )),
≤ C2? η 1/4 ,

33
Akyildiz and Sabanis

where the last line follows from Lemma 3.4 and Theorem 4.1. The bound for the term relating to the contin-
η,(k−1)T η,(k−1)T
uous dynamics (i.e. E1/2 [V 2 (ζ kT , Z kT )]) follows from Remark 2.3 (dissipativity of the gradient)
and we have a constant diffusion coefficient, and uniform bound in Lemma 3.1 and, finally, the fact that the
initial data (which is the iterates of the numerical scheme) has finite fourth moments (Lemma 3.4). We note
that C2? = O(ed ).

D.3 Proof of Proposition 5.1


η
We denote πn,β := L(θnη , Vnη ) and write
Z Z
E[U (θnη )] − E[U (θ∞ )] = η
U (θ)πn,β (dθ, dv) − U (θ)πβ (dθ, dv).
R2d R2d

Recall from (10), we have

|h(θ)| ≤ L1 |θ| + h0 .

Using Raginsky et al. (2017, Lemma 6), we arrive at


Z Z
η η
U (θ)πn,β (dθ, dv) − U (θ)πβ (dθ, dv) ≤ (L1 Cm + h0 )W2 (πn,β , πβ ),
R2d R2d

where,
Z Z 
2
Cm := max kθk2 πn,β
η
(dθ, dv), kθk2 πβ (dθ, dv) = max(Cθc , Cθ ).
R2d R2d

We therefore obtain using Theorem 2.1 that


 ?

E[U (θnη )] − E[U (θ∞ )] ≤ (L1 Cm + h0 ) C1? d1/2 η 1/2 + C2? η 1/4 + C3? e−C4 ηn .

34

You might also like