Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
Abstract
We provide a nonasymptotic analysis of the convergence of the stochastic gradient Hamil-
tonian Monte Carlo (SGHMC) to a target measure in Wasserstein-2 distance without as-
suming log-concavity. Our analysis quantifies key theoretical properties of the SGHMC as a
sampler under local conditions which significantly improves the findings of previous results.
In particular, we prove that the Wasserstein-2 distance between the target and the law of
the SGHMC is uniformly controlled by the step-size of the algorithm, therefore demonstrate
that the SGHMC can provide high-precision results uniformly in the number of iterations.
The analysis also allows us to obtain nonasymptotic bounds for nonconvex optimization
problems under local conditions and implies that the SGHMC, when viewed as a noncon-
vex optimizer, converges to a global minimum with the best known rates. We apply our
results to obtain nonasymptotic bounds for scalable Bayesian inference and nonasymptotic
generalization bounds.
Keywords: Non-convex optimization, underdamped Langevin Monte Carlo, non-log-concave
sampling, nonasmyptotic bounds, global optimization.
1. Introduction
We are interested in nonasymptotic estimates for the sampling problem from the probability
measures of the form
when only the noisy estimate of ∇U is available. This problem arises in many cases in
machine learning, most notably in large-scale (mini-batch) Bayesian inference (Welling and
Teh, 2011; Ahn et al., 2012) and nonconvex stochastic optimization as for large values of
β, a sample from the target measure (1) is an approximate minimizer of the potential U
(Gelfand and Mitter, 1991). Consequently, nonasymptotic error bounds for the sampling
schemes can be used to obtain guarantees for Bayesian inference or nonconvex optimization.
An efficient method for obtaining a sample from (1) is simulating the overdamped
Langevin stochastic differential equation (SDE) which is given by
r
2
dLt = −h(Lt )dt + dBt , (2)
β
with a random initial condition L0 := θ0 where h := ∇U and (Bt )t≥0 is a d-dimensional
Brownian motion. The Langevin SDE (2) admits πβ as the unique invariant measure,
therefore simulating this process will lead to samples from πβ and can be used as a Markov
chain Monte Carlo (MCMC) algorithm (Roberts et al., 1996; Roberts and Stramer, 2002).
Moreover, the fact that the limiting probability measure πβ concentrates around the global
minimum of U for sufficiently large values of β makes the diffusion (2) also an attractive
candidate as a global optimizer (see, e.g., Hwang (1980)). However, since the continuous-
time process (2) cannot be simulated, its first-order Euler discretization with the step-size
η > 0 is used in practice, termed the Unadjusted Langevin Algorithm (ULA) (Roberts et al.,
1996). The ULA scheme has become popular in recent years due to its advantages in high-
dimensional settings and ease of implementation. Nonasymptotic properties of the ULA
were recently established under strong convexity and smoothness assumptions by Dalalyan
(2017); Durmus et al. (2017, 2019) while some extensions about relaxing smoothness as-
sumptions or inaccurate gradients were also considered by Dalalyan and Karagulyan (2019);
Brosse et al. (2019). The similar attractive properties hold for the ULA when the potential
U is nonconvex (Gelfand and Mitter, 1991; Raginsky et al., 2017; Xu et al., 2018; Erdogdu
et al., 2018; Sabanis and Zhang, 2019). In recent works, ULA has been extended for non-
convex cases in several different directions, e.g., under log-Sobolev inequality (Vempala
and Wibisono, 2019), under Hölder continuity and specific tail growth conditions (Erdogdu
and Hosseinzadeh, 2021), under Poincaré inequality (Chewi et al., 2022). Further work
has extended these results, see, e.g., Balasubramanian et al. (2022) for averaged Langevin
schemes, Erdogdu et al. (2022) for the analysis under dissipativity in Chi-squared and Renyi
divergences, Mou et al. (2022) for results under dissipativity with smooth initalisation, and
finally under weak Poincaré inequalities (Mousavi-Hosseini et al., 2023).
While the ULA performs well when the computation of the gradient h(·) is straightfor-
ward, this is not the case in most interesting applications. Usually, a stochastic, unbiased
estimate of h(·) is available, either because the cost function is defined as an expectation
or as a finite sum. Using stochastic gradients in the ULA leads to another scheme called
stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011). The SGLD has
been particularly popular in the fields of (i) large-scale Bayesian inference since it allows
one to construct Markov chains Monte Carlo (MCMC) algorithms using only subsets of
the dataset (Welling and Teh, 2011), (ii) nonconvex optimization since it enables one to es-
timate global minima using only stochastic (often cheap-to-compute) gradients (Raginsky
et al., 2017). As a result, attempts for theoretical understanding of the SGLD have been
recently made in several works, both for the strongly convex potentials (i.e. log-concave
targets), see, e.g., Barkhagen et al. (2021); Brosse et al. (2018) and nonconvex potentials,
see, e.g. Raginsky et al. (2017); Majka et al. (2018); Zhang et al. (2023b). Our particular
interest is in nonasymptotic bounds for the nonconvex case, as it is relevant to our work. In
their seminal paper, Raginsky et al. (2017) obtain a nonasymptotic bound between the law
of the SGLD and the target measure in Wasserstein-2 distance with a rate η 5/4 n where η is
2
Nonasymptotic analysis of SGHMC
the step-size and n is the number of iterations. While this work is first of its kind, the error
rate grows with the number of iterations. In a related contribution, Xu et al. (2018) have
obtained improved rates, albeit still growing with the number of iterations n. In a more
recent work, Chau et al. (2021) have obtained a uniform rate of order η 1/2 in Wasserstein-1
distance. Majka et al. (2018) achieved error rates of η 1/2 and η 1/4 for Wasserstein-1 and
Wasserstein-2 distances, respectively, under the assumption of convexity outside a ball. Fi-
nally, Zhang et al. (2023b) achieved the same rates under only local conditions which can
be verified for a class of practical problems.
An alternative to methods based on the overdamped Langevin SDE (2) is the class
of algorithms based on the underdamped Langevin SDE. To be precise, the underdamped
Langevin SDE is given as
r
2γ
dVt = −γVt dt − h(θt )dt + dBt , (3)
β
dθt = Vt dt, (4)
where (θt , Vt )t≥0 are called position and momentum process, respectively, and h := ∇U .
Similar to eq. (2), this diffusion can be used as both an MCMC sampler and nonconvex
optimizer, since under appropriate conditions, the Markov process (θt , Vt )t≥0 has a unique
invariant measure given by
1 2
π β (dθ, dv) ∝ exp −β kvk + U (θ) dθdv. (5)
2
Consequently, the marginal distribution of (5) in θ is precisely the target measure defined
in (1). This means that sampling from (5) in the extended space and then keeping the
samples in the θ-space would define a valid sampler for the sampling problem of (1).
Due to its attractive properties, methods based on the underdamped Langevin SDE have
attracted significant attention. In particular, the first order discretization of (3)–(4), which
is termed underdamped Langevin MCMC (i.e. the underdamped counterpart of the ULA),
has been a focus of attention, see, e.g., Duncan et al. (2017); Dalalyan and Riou-Durand
(2018); Cheng et al. (2018b). Particularly, the underdamped Langevin MCMC has displayed
improved convergence rates in the setting where U is convex, see, e.g., Dalalyan and Riou-
Durand (2018); Cheng et al. (2018b). Similar results have been extended to the nonconvex
case. In particular, Cheng et al. (2018a) have shown that the underdamped Langevin
MCMC converges in Wasserstein-2 with a better dimension and step-size dependence under
the assumptions smoothness and convexity outside a ball. It has been also shown that the
underdamped Langevin MCMC can be seen as an accelerated optimization method in the
space of measures in Kullback-Leibler divergence (Ma et al., 2019).
Similar to the case in the ULA, oftentimes ∇U (·) is expensive or impossible to compute
exactly, but rather an unbiased estimate of it can be obtained efficiently. The underdamped
Langevin MCMC with stochastic gradients is dubbed as Stochastic Gradient Hamiltonian
Monte Carlo (SGHMC) and given as (Chen et al., 2014; Ma et al., 2015)
r
η η η η 2γη
Vn+1 = Vn − η[γVn + H(θn , Xn+1 )] + ξn+1 , (6)
β
η
θn+1 = θnη + ηVnη , (7)
3
Akyildiz and Sabanis
where η > 0 is a step-size, V0η = v0 , θ0η = θ0 , and E[H(θ, X0 )] = h(θ) for every θ ∈
Rd . We note that SGHMC relies on the Euler-Maruyama discretisation which is of the
possible discretisation methods that can be used. Alternative discretisation methods can
be considered and lead to different algorithms, e.g., see Cheng et al. (2018b); Li et al. (2019);
Zhang et al. (2023a).
In this paper, we analyze recursions (6)–(7). We achieve convergence bounds and im-
prove the ones proved in Gao et al. (2022) and Chau and Rásonyi (2022) (see Section 2.1
for a direct comparison).
Notation. For an integer d ≥ 1, the Borel sigma-algebra of Rd is denoted by B(Rd ).
We denote the dot product with h·, ·i while | · | denotes the associated norm. The set of
probability measures defined on a measurable space (Rd , B(Rd )) is denoted as P(Rd ). For
an Rd -valued random variable, L(X) and E[X] are used to denote its law and its expectation
respectively. Note that we also write E[X] as EX. For µ, ν ∈ P(Rd ), let C(µ, ν) denote the
set of probability measures Γ on B(R2d ) so that its marginals are µ, ν. Finally, µ, ν ∈ P(Rd ),
the Wasserstein distance of order p ≥ 1 is defined as
Z Z 1/p
0 p 0
Wp (µ, ν) := inf |θ − θ | Γ(dθ, dθ ) . (8)
Γ∈C(µ,ν) Rd Rd
Assumption 2.1 The cost function U takes nonnegative values, i.e., U (θ) ≥ 0.
Next, the local smoothness assumptions on the stochastic gradient H are given.
Assumption 2.2 There exist positive constants L1 , L2 and ρ such that, for all x, x0 ∈ Rm
and θ, θ0 ∈ Rd ,
The following assumption states that the stochastic gradient is assumed to be unbiased.
Assumption 2.3 The process (Xn )n∈N is i.i.d. with |X0 | ∈ L4(ρ+1) and |θ0 |, |v0 | ∈ L4 . It
satisfies
E[H(θ, X0 )] = h(θ).
4
Nonasymptotic analysis of SGHMC
The smallest eigenvalue of E[A(X0 )] is a positive real number a > 0 and E[b(X0 )] = b > 0.
Note that Assumption 2.4 is a local dissipativity condition. This assumption implies the
usual dissipativity property on the corresponding deterministic (full) gradient. In the next
remark, we motivate these assumptions for the case of linear regression loss.
Remark 2.2 We note that Assumptions 2.2 and 2.4 can be motivated even using the sim-
plest linear regression loss. Consider the problem
where (Z, Y ) ∼ P (dz, dy) on R × Rd . The stochastic gradient here is given by, for a sample
(zn , yn ) ∼ P (dz, dy),
where xn = (zn , yn ). This loss is not globally Lipschitz however we can prove that it is locally
Lipschitz. For example, it is easy to show that Assumption 2.2 is satisfied with L1 = 1,
L2 = 2, and ρ = 2. Moreover, it is easy to see that this stochastic gradient is not dissipative
uniformly in xn as
However, our Assumption 2.4 is satisfied with A(xn ) = yn yn> and b(xn ) = zn2 . We would
also like to note that we do not need to assume that stochastic gradient moments are bounded
as it is a result of our assumptions.
Remark 2.3 By Assumption 2.4, we obtain hh(θ), θi ≥ a|θ|2 − b, for θ ∈ Rd and a, b > 0.
5
Akyildiz and Sabanis
Remark 2.4 Due to Assumption 2.1, Remark 2.1 and Remark 2.3, one observes that As-
sumption 2.1 in Eberle et al. (2019) is satisfied and thus, due to Corollary 2.6 in Eberle
et al. (2019), the underdamped Langevin SDE (3)–(4) has the unique invariant measure
given by (5).
Below, we state our main result about the convergence of the law L(θkη , Vkη ), which is gener-
ated by the SGHMC recursions (6)–(7), to the extended target measure π β in Wasserstein-2
(W2 ) distance. We first define
1 γλ K3 γλ
ηmax = min 1, , , , ,
γ 2K1 K2 2K̄
where K1 , K2 , K3 are constants explicitly given in the proof of Lemma 3.2 and K̄ is a
constant explicitly given in the proof of Lemma 3.4. Then, the following result is obtained.
Theorem 2.1 Let Assumptions 2.1–2.4 hold. Then, there exist constants C1? , C2? , C3? , C4? >
0 such that, for every 0 < η ≤ ηmax ,
?
W2 (L(θnη , Vnη ), π β ) ≤ C1? η 1/2 + C2? η 1/4 + C3? e−C4 ηn . (11)
where the constants C1? , C2? , C3? , C4? are explicitly provided in the Appendix.
Proof See Section 4.
Remark 2.5 We remark that C1? = O(d1/2 ), C2? = O(ed ), C3? = O(ed ), and C4? = O(e−d ).
We note that although the dependence of C1? on dimension is O(d1/2 ), and comes from our
main result, the dependence of C2? , C3? , and C4? to dimension may be exponential as it is
an immediate consequence of the contraction result of the underdamped Langevin SDE in
Eberle et al. (2019).
The result in Theorem 2.1 demonstrates that the error scales like O(η 1/4 ) and is uniformly
bounded over n which can be made arbitrarily small by choosing η > 0 small enough. This
result is thus a significant improvement over the findings in Gao et al. (2022), where error
bounds grow with the number of iterations, and in Chau and Rásonyi (2022), where the
corresponding error bounds contain an additional term that is independent of η and relates
to the variance of the unbiased estimator.
Remark 2.6 Our proof techniques can be adapted easily when H(θ, x) = h(θ). Hence The-
orem 2.1 provides a novel convergence bound for the underdamped Langevin MCMC as well.
Our result is in W2 distance under dissipativity for the Euler-Maruyama discretisation of the
underdamped Langevin dynamics, which can be contrasted with existing work which focus on
alternative discretisations. Beyond the log-concave case, Ma et al. (2019) analyses a second-
order discretisation of underdamped Langevin diffusion under the assumption that the target
satisfies a log-Sobolev inequality with a Lipschitz Hessian condition for the log-target. More
recently, Zhang et al. (2023a) removed the Lipschitz Hessian condition and established the
convergence of underdamped Langevin MCMC under log-Sobolev and Poincaré inequalities.
To the best of our knowledge, our result for the Euler-Maruyama discretisation of the un-
derdamped Langevin dynamics is the first nonasymptotic result under dissipativity that is
uniformly controlled by the step-size.
6
Nonasymptotic analysis of SGHMC
Let (θkη )k∈N be generated by the SGHMC algorithm. Convergence of the L(θkη ) to πβ in
W2 also implies that one can prove a global convergence result (Raginsky et al., 2017).
More precisely, assume that we aim at solving the problem θ? ∈ arg minθ∈Rd U (θ) which is
a nonconvex optimization problem. We denote U? := inf θ∈Rd U (θ). Then we can bound
the error E[U (θkη )] − U? which would give us a guarantee on the nonconvex optimization
problem. We state it as a formal result as follows.
Theorem 2.2 Under the assumptions of Theorem 2.1, we obtain
η ? 1/2 ? 1/4 ? −C ? ηn d eL 1 bβ
E[U (θn )] − U? ≤ C 1 η + C 2 η + C 3 e 4 + log +1 ,
2β a d
? ? ?
where C 1 , C 2 , C 3 , C4? , L1 are finite constants which are explicitly given in the proofs.
Proof See Section 5.
This result bounds the error in terms of the function value for convergence to the global
? ? ?
minima. Note that C 1 , C 2 anf C 3 have the same dependence to dimension as C1? , C2? and
C3? (see Remark 2.5).
7
Akyildiz and Sabanis
Note that we analyze the SGHMC under similar assumptions to Zhang et al. (2023b)
who analyzed SGLD, since these assumptions allow for broader applicability of the results.
However, the SGHMC requires the analysis of the underdamped Langevin diffusion and
the discrete-time scheme on R2d and requires significantly different techniques compared to
the ones used in Zhang et al. (2023b). We show that the SGHMC can achieve a similar
performance with the SGLD as showed in Zhang et al. (2023b). However, due to the
nonconvexity, we do not observe any improved dimension dependence as in the convex case
and this constitutes a significant direction of future work.
3. Preliminaries
In this section, preliminary results which are essential for proving the main results are pro-
vided. A central idea behind the proof of Theorem 2.1 is the introduction of continuous-time
auxiliary processes whose marginals at chosen discrete times coincide with the marginals of
the (joint) law L(θkη , Vkη ). Hence, these auxiliary stochastic processes can be used to ana-
lyze the recursions (6)–(7). We first introduce the necessary lemmas about the moments of
the auxiliary processes defined in Section 3 and contraction rates of the underlying under-
damped Langevin diffusion. We give the lemmas and some explanatory remarks and defer
the proofs to Appendix C. We then provide the proofs of main theorems in the following
section.
Consider the scaled process (ζtη , Ztη ) := (θηt , Vηt ) given (θt , Vt )t∈R+ as in (3)–(4). We
next define
where η > 0 and Btη = √1η Bηt , where (Bs )s∈R+ is a Brownian motion with natural filtration
Ft . We denote the natural filtration of (Btη )t∈R+ as Ftη . We note that Ftη is independent of
G∞ ∨ σ(θ0 , v0 ). Next, we define the continuous-time interpolation of the SGHMC
η η η
2γηβ −1 dBtη ,
p
dV t = −η(γV btc + H(θbtc , Xdte ))dt + (14)
η η
dθt = ηV btc dt. (15)
η η
The processes (14)–(15) mimic the recursions (6)–(7) at n ∈ N, i.e., L(θnη , Vnη ) = L(θn , V n ).
Finally, we define the underdamped Langevin process (ζbts,u,v,η , Zbts,u,v,η ) for s ≤ t
8
Nonasymptotic analysis of SGHMC
η,n η,n
The process (ζ t , Z t )t≥nT is an underdamped Langevin process started at time nT with
η η
(θnT , V nT ).
To achieve the convergence results, we first define a Lyapunov function, common in the
literature (Mattingly et al., 2002; Eberle et al., 2019), defined as
β 2
γ |θ + γ −1 v|2 + |γ −1 v|2 − λ|θ|2 ,
V(θ, v) = βU (θ) + (18)
4
where λ ∈ (0, 1/4]. This Lyapunov function plays an important role in obtaining uniform
moment estimates for some of the aforementioned processes. First we recall a result from
the literature about the second moments of the processes (θt , Vt )t≥0 in continuous-time.
Lemma 3.1 (Lemma 12(i) in Gao et al. (2022).) Let Assumptions 2.1–2.4 hold. Then
d+Ac
R
2 c 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|θt | ≤ Cθ := R 1 2
, (19)
t≥0 8 (1 − 2λ)βγ
d+Ac
R
2 c 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|Vt | ≤ Cv := R 1 (20)
t≥0 4 (1 − 2λ)β
Proof See Gao et al. (2022).
Next, we obtain, uniform in time, second moment estimates for the discrete-time processes
(θkη )k≥0 and (Vkη )k≥0 .
Lemma 3.2 Let Assumptions 2.1–2.4 hold. Then, for 0 < η ≤ ηmax ,
R 4(Ac +d)
η 2 2d V(θ, v)µ0 (dθ, dv) + λ
sup E|θk | ≤ Cθ := R 1 2
,
k≥0 8 (1 − 2λ)βγ
R 4(Ac +d)
η 2 R2d V(θ, v)µ0 (dθ, dv) + λ
sup E|Vk | ≤ Cv := 1 .
k≥0 4 (1 − 2λ)β
Proof See Appendix C.1.
η,n
Using Lemma 3.2, we can get the second moment bounds of our auxiliary process (ζ t )t≥0 .
Lemma 3.3 Under the assumptions of Lemmas 3.1 and 3.2, we obtain
R 8(d+Ac )
η,n 2 2d V(θ, v)µ0 (dθ, dv) + λ
sup sup E|ζ t | ≤ Cζ := R 1 2
. (21)
n∈N t∈(nT,(n+1)T ) 8 (1 − 2λ)βγ
Proof See Appendix C.2.
η,n η,n
Moreover, in order to obtain the error bound distance between the laws L(ζ t , Z t ) and
L(ζtη , Ztη ), we need to obtain the contraction of EV 2 (θkη , Vkη ), which is established in the
following lemma.
Lemma 3.4 Let 0 < η ≤ ηmax and Assumptions 2.1–2.4 hold. Then, we have
2D
sup E[V 2 (θkη , Vkη )] ≤ E[V 2 (θ0 , v0 )] +
k∈N γλ
where D = O(d2 ) constant independent of η and provided explicitly in the proof.
9
Akyildiz and Sabanis
Finally, we present a convergence result for the underdamped Langevin diffusion adapted
from Eberle et al. (2019). To this end, a functional for probability measures µ, ν on R2d is
introduced below
Z
Wρ (µ, ν) = inf ρ((x,v), (x0 , v 0 ))Γ(d(x, v), d(x0 , v 0 )), (22)
Γ∈C(µ,ν)
where ρ is defined in eq. (2.10) in Eberle et al. (2019). Thus, in view of Remarks 2.1 and
2.3, one recovers the following result.
Theorem 3.1 (Eberle et al., 2019, Theorem 2.3 and Corollary 2.6) Let Assumptions 2.1–
2.4 hold and the laws of the underdamped Langevin SDEs (θt , Vt ) and (θt0 , Vt0 ) started at
(θ0 , V0 ) ∼ µ and (θ00 , V00 ) ∼ ν respectively. Then, there exist constants ċ, Ċ ∈ (0, ∞) such
that
p q
0 0 −ċt/2
W2 (L(θt , Vt ), L(θt , Vt )) ≤ Ċe Wρ (µ, ν), (23)
10
Nonasymptotic analysis of SGHMC
where θ∞ ∼ πβ . The following proposition presents a bound for T1 under our assumptions.
Proposition 5.1 Under the assumptions of Theorem 2.1, we have,
? ? ? ?
E[U (θnη )] − E[U (θ∞ )] ≤ C 1 η 1/2 + C 2 η 1/4 + C 3 e−C4 ηn , (26)
? 2 = max(C c , C ).
where C i = Ci? (Cm L̄1 + h0 ) for i = 1, 2, 3 and Cm θ θ
6. Applications
In this section, we present two applications to machine learning. First, we show that the
SGHMC can be used to sample from the posterior probability measure in the context of
scalable Bayesian inference. We also note that our assumptions hold in a practical setting
of Bayesian logistic regression. Secondly, we provide an improved generalization bound for
empirical risk minimization.
11
Akyildiz and Sabanis
Applying the result of Theorem 2.2, one can get a convergence guarantee on E[U (θkη )] − U? .
However, this does not account for the so-called generalization error. Note that, one can
see the cost function in (27) as
R an empirical risk (expectation) minimization problem where
the risk is given by U(θ) := f (θ, z)P (dz) = E[f (θ, Z)], where Z ∼ P (dz) is an unknown
probability measure where the real-world data is sampled from. Therefore, in order to bound
12
Nonasymptotic analysis of SGHMC
the generalization error, one needs to bound the error E[U(θnη )] − U? . The generalization
error can be decomposed as
E[U(θnη )] − U? = E[U(θnη )] − E[U(θ∞ )] + E[U(θ∞ )] − E[U (θ∞ )] + E[U (θ∞ )] − U? .
| {z } | {z } | {z }
B1 B2 B3
In what follows, we present a series of results bounding the terms B1 , B2 , B3 . By using the
results about Gibbs distributions presented in Raginsky et al. (2017), one can bound B1 as
follows.
Proposition 6.1 Under the assumptions of Theorem 2.1, we obtain
? ? ? ?
E[U(θnη )] − E[U(θ∞ )] ≤ C 1 η 1/2 + C 2 η 1/4 + C 3 e−C4 ηn ,
?
where C i = Ci? (Cm L̄1 + h0 ) for i = 1, 2, 3 and Cm
2 = max(C c , C ).
θ θ
The proof of Proposition 6.1 is similar to the proof of Proposition 5.1 and indeed the rates
match.
Next, we seek a bound for the term B2 . In order to prove the following result, we assume
that Assumption 2.2 and Assumption 2.4 hold uniformly in x, as required by, e.g., Raginsky
et al. (2017).
Proposition 6.2 (Raginsky et al., 2017) Assume that Assumptions 2.1, 2.3 hold and As-
sumptions 2.2 and 2.4 hold uniformly in x, i.e., |H(θ, x)| ≤ L01 |θ|2 + B1 . Then,
4βcLS L01
E[U(θ∞ )] − E[U (θ∞ )] ≤ (b + d/β) + B1 ,
M a
where cLS is the constant of the logarithmic Sobolev inequality.
Finally, let Θ? ∈ arg minθ∈R U(θ). We note that B3 is bounded trivially as
E[U (θ∞ )] − U? = E[U (θ∞ ) − U? ] + E[U? − U (Θ? )] ≤ E[U (θ∞ ) − U? ], (28)
which follows from the proof of Proposition 5.2. Finally, Proposition 6.1, Proposition 6.2
and (28) leads to the following generalization bound presented as a corollary.
Corollary 6.2 Under the setting of Proposition 6.2, we obtain the generalization bound for
the SGHMC,
0
η ? 1/2 ? 1/4 ? −C ? ηn 4βc LS L 1
E[U(θn )] − U? ≤ C 1 η + C 2 η + C 3 e 4 + (b + d/β) + B1
M a
d eL1 bβ
+ log +1 . (29)
2β a d
We note that this generalization bound improves that of Raginsky et al. (2017); Gao et al.
(2022); Chau and Rásonyi (2022) due to our improved W2 bound which is reflected in
Theorem 2.2 and, consequently, Proposition 6.1. In particular, while the generalization
bounds of Raginsky et al. (2017) and Gao et al. (2022) grow with the number of iterations
and require careful tuning between the step-size and the number of iterations, our bound
decreases with the number of iterations n. We also note that our bound improves that of
Chau and Rásonyi (2022), similar to the W2 bound.
13
Akyildiz and Sabanis
7. Conclusions
We have analyzed the convergence of the SGHMC recursions (6)–(7) to the extended target
measure π β in Wasserstein-2 distance which implies the convergence of the law of the iter-
ates L(θnη ) to the target measure πβ in W2 . We have proved that the error bound scales like
O(η 1/4 ) where η is the step-size. This improves the existing bounds for the SGHMC signif-
icantly which are either growing with the number of iterations or include constants cannot
be made to vanish by decreasing the step-size η. This bound on sampling from πβ enables
us to prove a stochastic global optimization result when (θnη )n∈N is viewed as an output
of a nonconvex optimizer. We have shown that our results provide convergence rates for
scalable Bayesian inference and we have particularized our results to the Bayesian logistic
regression. Moreover, we have shown that our improvement of W2 bounds are reflected in
improved generalization bounds for the SGHMC.
Acknowledgments
Ö. D. A. is supported by the Lloyd’s Register Foundation Data Centric Engineering Pro-
gramme and EPSRC Programme Grant EP/R034710/1 (CoSInES). S.S. acknowledges sup-
port by the Alan Turing Institute under the EPSRC grant EP/N510129/1.
References
Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via
stochastic gradient Fisher scoring. In 29th International Conference on Machine Learn-
ing, ICML 2012, pages 1591–1598, 2012.
Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang.
Towards a theory of non-log-concave sampling: first-order stationarity guarantees for
Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR,
2022.
Mathias Barkhagen, Ngoc Huy Chau, Éric Moulines, Miklós Rásonyi, Sotirios Sabanis, and
Ying Zhang. On stochastic gradient Langevin dynamics with dependent data streams in
the logconcave case. Bernoulli, 27(1):1–33, 2021.
Nicolas Brosse, Alain Durmus, and Eric Moulines. The promises and pitfalls of stochastic
gradient Langevin dynamics. In Advances in Neural Information Processing Systems,
pages 8268–8278, 2018.
Nicolas Brosse, Alain Durmus, Éric Moulines, and Sotirios Sabanis. The tamed unad-
justed Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–
3663, 2019.
Huy N Chau and Miklós Rásonyi. Stochastic gradient hamiltonian Monte Carlo for non-
convex learning. Stochastic Processes and their Applications, 149:341–368, 2022.
14
Nonasymptotic analysis of SGHMC
Huy N Chau, Chaman Kumar, Miklós Rásonyi, and Sotirios Sabanis. On fixed gain recursive
estimators with discontinuity in the parameters. ESAIM: Probability and Statistics, 23:
217–244, 2019.
Ngoc Huy Chau, Éric Moulines, Miklos Rásonyi, Sotirios Sabanis, and Ying Zhang. On
stochastic gradient Langevin dynamics with dependent data streams: The fully nonconvex
case. SIAM Journal on Mathematics of Data Science, 3(3):959–986, 2021.
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte
Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I
Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv
preprint arXiv:1805.01648, 2018a.
Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped
Langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, pages
300–323, 2018b.
Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of
Langevin Monte Carlo from poincare to log-sobolev. In Conference on Learning Theory,
pages 1–2. PMLR, 2022.
Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-
concave densities. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 79(3):651–676, 2017.
Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte
Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):
5278–5311, 2019.
Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using
kinetic Langevin diffusions. arXiv preprint arXiv:1807.09382, 2018.
Alain Durmus, Eric Moulines, et al. Nonasymptotic convergence analysis for the unadjusted
Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
Alain Durmus, Eric Moulines, et al. High-dimensional Bayesian inference via the unadjusted
Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative con-
traction rates for Langevin dynamics. The Annals of Probability, 47(4):1982–2010, 2019.
Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of Langevin Monte Carlo:
The interplay between tail growth and smoothness. In Conference on Learning Theory,
pages 1776–1822. PMLR, 2021.
15
Akyildiz and Sabanis
Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization
with discretized diffusions. In Advances in Neural Information Processing Systems, pages
9671–9680, 2018.
Murat A Erdogdu, Rasa Hosseinzadeh, and Shunshi Zhang. Convergence of Langevin Monte
Carlo in chi-squared and rényi divergence. In International Conference on Artificial
Intelligence and Statistics, pages 8151–8175. PMLR, 2022.
Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of stochastic
gradient hamiltonian Monte Carlo for nonconvex stochastic optimization: nonasymptotic
performance bounds and momentum-based acceleration. Operations Research, 70(5):
2931–2947, 2022.
Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization
in Rd . SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.
Xuechen Li, Yi Wu, Lester Mackey, and Murat A Erdogdu. Stochastic runge-kutta accel-
erates Langevin Monte Carlo and beyond. Advances in neural information processing
systems, 32, 2019.
Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc.
In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, and
Michael I Jordan. Is There an Analog of Nesterov Acceleration for MCMC? arXiv
preprint arXiv:1902.00996, 2019.
Mateusz B Majka, Aleksandar Mijatović, and Lukasz Szpruch. Non-asymptotic bounds for
sampling algorithms without log-concavity. arXiv preprint arXiv:1808.07105, 2018.
Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for sdes and
approximations: locally lipschitz vector fields and degenerate noise. Stochastic processes
and their applications, 101(2):185–232, 2002.
Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved
bounds for discretization of Langevin diffusions: Near-optimal rates without convexity.
Bernoulli, 28(3):1577–1601, 2022.
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via
Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Conference on
Learning Theory, pages 1674–1703, 2017.
16
Nonasymptotic analysis of SGHMC
Gareth O Roberts and Osnat Stramer. Langevin diffusions and metropolis-hastings algo-
rithms. Methodology and computing in applied probability, 4(4):337–357, 2002.
S. Sabanis and Y. Zhang. Higher order Langevin Monte Carlo algorithm. Electronic Journal
of Statistics, 13(2):3805–3850, 2019.
Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin
algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32,
2019.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.
In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,
2011.
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dy-
namics based algorithms for nonconvex optimization. In Advances in Neural Information
Processing Systems, pages 3122–3133, 2018.
Shunshi Zhang, Sinho Chewi, Mufan Li, Krishna Balasubramanian, and Murat A Erdogdu.
Improved discretization analysis for underdamped Langevin Monte Carlo. In The Thirty
Sixth Annual Conference on Learning Theory, pages 36–71. PMLR, 2023a.
Ying Zhang, Ömer Deniz Akyildiz, Theodoros Damoulas, and Sotirios Sabanis. Nonasymp-
totic estimates for stochastic gradient Langevin dynamics under local conditions in non-
convex optimization. Applied Mathematics & Optimization, 87(2):25, 2023b.
Difan Zou, Pan Xu, and Quanquan Gu. Stochastic gradient Hamiltonian Monte Carlo
methods with recursive variance reduction. In Advances in Neural Information Processing
Systems, pages 3830–3841, 2019.
17
Akyildiz and Sabanis
Appendix
Appendix A. Constants in Theorem 3.1
The constants of Theorem 3.1 are given as follows Eberle et al. (2019).
2 −1 −1
2+Λ (1 + γ) 2 (d + Ac )γ ċ
Ċ = 2e max 1, 4(1 + 2α + 2α )
min(1, α)2 min(1, R1 )β
and
γ
ċ = min λL1 β −1 γ −2 , Λ1/2 e−Λ L1 β −1 γ −2 , Λ1/2 e−Λ
384
where
12
Λ = L1 R12 /8 = (1 + 2α + 2α2 )(d + Ac )L1 β −1 γ −2 λ−1 (1 − 2λ)−1 ,
5
and α ∈ (0, ∞).
a 2 b L̄1 2
|θ| − log 3 ≤ U (θ) ≤ u0 + |θ| + h0 |θ|.
3 2 2
where u0 = U (0) and L̄1 = L1 E[(1 + |X0 |)ρ ].
Lemma B.2 Let G, H ⊂ F be sigma-algebras. Consider two Rd -valued random vectors, denoted X, Y , in
Lp with p ≥ 1 such that Y is measurable w.r.t. H ∨ G. Then,
Lemma B.3 Let Assumption 2.1, 2.3, 2.2 and 2.4 hold. For any k = 1, . . . , K + 1 where K + 1 ≤ T , we
obtain
h i
2
sup sup E h(ζ̄tη,n ) − H(ζ̄tη,n , XnT +k ) ≤ σH
n∈N t∈[nT,(n+1)T ]
where
σH := 8L22 σZ (1 + Cζ ) < ∞,
18
Nonasymptotic analysis of SGHMC
where
Moreover,
η η 2
E θbtc − θt ≤ η 2 Cv .
Lemma B.5 There exist constants Ac ∈ (0, ∞) and λ ∈ (0, 1/4] such that
for all x ∈ Rd .
where (ξk )k∈N is a sequence of i.i.d. standard Normal random variables. Consequently, we have the equality
h i
| = E |(1 − γη)Vkη − ηH(θkη , Xk+1 )|2 + 2γηβ −1 d,
η 2
E |Vk+1
h i h i
= (1 − γη)2 E |Vkη |2 − 2η(1 − γη)E [hVkη , h(θkη )i] + η 2 E |H(θkη , Xk+1 )|2 + 2γηβ −1 d,
where
e 1 = 2L21 Cρ
L and e1 = 4L22 Cρ + 4H02 .
C (33)
19
Akyildiz and Sabanis
which suggests
Z 1
η
U (θk+1 ) − U (θkη ) − hh(θkη ), ηVkη )i = hh(θkη + τ ηVkη ) − h(θkη ), ηVkη idτ ,
0
Z 1
≤ |h(θkη + τ ηVkη ) − h(θkη )| |ηVkη | dτ,
0
1
≤ L1 Cρ η 2 |Vkη |2 ,
2
where the second line follows from the Cauchy-Schwarz inequality and the final line follows from (9). Finally
we obtain
1
η
EU (θk+1 ) − EU (θkη ) ≤ ηEhh(θkη ), Vkη i + L1 Cρ η 2 E |Vkη |2 . (35)
2
Next, we continue computing
2 2
η
E θk+1 + γ −1 Vk+1
η
= E θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) + 2γ −1 β −1 ηd,
2
= E θkη + γ −1 Vkη − 2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i
+ η 2 γ −2 E |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 d,
2
≤ E θkη + γ −1 Vkη − 2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i
+ η 2 γ −2 (L e1 ) + 2γ −1 ηβ −1 d.
e 1 E |θη |2 + C
k (36)
where L
e 1 and C
e1 is defined as in (33). Next, combining (32), (34), (35), (36),
M2 (k + 1) − M2 (k)
γ2 2 2
η
) − U (θkη ) + η
+ γ −1 Vk+1
η
− E θkη + γ −1 Vkη
= E U (θk+1 E θk+1
4
1 γ2λ
2 η 2 2
+ η
E Vk+1 − E |Vk | − η
E θk+1 − E |θkη |2 ,
4 4
2
L C
1 ρ η
≤ ηEhh(θkη ), Vkη i + E |Vkη |2
2
γ2
+ −2ηγ −1 Ehθkη + γ −1 Vkη , h(θkη )i + η 2 γ −2 L e 1 E |θη |2 + C
k
e1 + 2γ −1 β −1 ηd
4
1
e1 + 2γηβ −1 d
+ (−2γη + γ 2 η 2 )E |Vkη |2 − 2η(1 − γη)EhVkη , h(θkη )i + η 2 L e 1 E|θη |2 + C
k
4
γ2λ
2ηEhθkη , Vkη i + η 2 E|Vkη |2 ,
−
4
γη 2 L1 Cρ η 2 η2 γ 2 γ 2 η2 λ
ηγ γη
= − Ehθkη , h(θkη )i + Ehh(θkη ), Vkη i + + − − E |Vkη |2
2 2 2 4 2 4
η2 Le1 γ 2 ηλ f1 η 2
C
+ E |θkη |2 − Ehθkη , Vkη i + + γηβ −1 d,
2 2 2
λγ 3 η γη 2 η2 L
e1
≤ −ηγλEU (θkη ) − E|θkη |2 + Ac ηγβ −1 + Ehh(θkη ), Vkη i + E |θkη |2
4 2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ γ 2 ηλ f1 η 2
γη C
+ + − − E |Vkη |2 − Ehθkη , Vkη i + + γηβ −1 d.
2 4 2 4 2 2
20
Nonasymptotic analysis of SGHMC
where the last line is obtained using (30). Next, using the fact that 0 < λ ≤ 1/4 and the form of the
Lyapunov function (31), we obtain
γ γ2 1
− Ehθkη , Vkη i ≤ −M2 (k) + EU (θkη ) + E|θkη |2 + E|Vkη |2 .
2 4 2
Using this, we can obtain
γη 2 η2 L
e1
M2 (k + 1) − M2 (k) ≤ Ac ηγβ −1 + Ehh(θkη ), Vkη i + E |θkη |2 + γηβ −1 d
2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ f1 η 2
γη γλη C
+ + − − + E |Vkη |2 + − γληM2 (k).
2 4 2 4 2 2
η2 Le1 Cf1 η 2
M2 (k + 1) ≤ (1 − γλη)M2 (k) + Ac ηγβ −1 + E |θkη |2 + + γηβ −1 d
2 2
L1 Cρ η 2 η2 γ 2 γ 2 η2 λ γη 2 γη 2
γη γλη
+ + − − + + E |Vkη |2 + E|h(θkη )|2 ,
2 4 2 4 4 4 4
!
L
e1 γ f1 η 2
C
≤ (1 − γλη)M2 (k) + Ac ηγβ −1 + η 2 + L21 Cρ2 E |θkη |2 + + γηβ −1 d
2 2 2
γ2 γ2λ γη 2 h20
L1 Cρ γ
+η 2
+ − + E |Vkη |2 + ,
2 4 4 4 2
where the last inequality follows since λ ≤ 1/4 and (10). We note that
1 β
V(θ, v) ≥ max (1 − 2λ)βγ 2 |θ|2 , (1 − 2λ)|v|2 ,
8 4
which implies by the definition of M2 (k) that
1 1
M2 (k) ≥ max (1 − 2λ)γ 2 E|θkη |2 , (1 − 2λ)E|Vkη |2 ,
8 4
1 1
≥ (1 − 2λ)γ 2 E|θkη |2 + (1 − 2λ)E|Vkη |2 , (37)
16 8
since max{x, y} ≥ (x + y)/2 for any x, y > 0. Therefore, we obtain
where
(L 2 2 )
1 Cρ
+ γ4 − γ 4λ + γ L
+ γ2 L21 Cρ2
e1
2 4 2
K1 := max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2
and
f1 + γh20
C
K2 = and K3 = (Ac + d)γβ −1 .
2
n o
For 0 < η ≤ min K3
, γλ , 2
K2 2K1 γλ
, we obtain
γλη
M2 (k + 1) ≤ 1− M2 (k) + 2K3 η
2
which implies
4
M2 (k) ≤ M2 (0) + K3 .
γλ
Combining this with (37) gives the result.
21
Akyildiz and Sabanis
which suggests
Z 1
η
U (θk+1 ) − U (θkη ) − hh(θkη ), ηVkη )i = hh(θkη + τ ηVkη ) − h(θkη ), ηVkη idτ ,
0
Z 1
≤ |h(θkη + τ ηVkη ) − h(θkη )| |ηVkη | dτ,
0
1
≤ L1 Cρ η 2 |Vkη |2 ,
2
22
Nonasymptotic analysis of SGHMC
where, similarly as in the previous proof, the second line follows from the Cauchy-Schwarz inequality and
the final line follows from (9). Finally we arrive at
1
η
U (θk+1 ) − U (θkη ) ≤ ηhh(θkη ), Vkη i + L1 Cρ η 2 |Vkη |2 . (42)
2
Next, we note that ∆2k = θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) and continue computing
2 2
+ γ −1 Vk+1 = θkη + γ −1 Vkη − ηγ −1 H(θkη , Xk+1 ) 2γ −1 β −1 ηh∆2k , ξk+1 i + 2γ −1 β −1 η|ξk+1 |2 ,
η η
p
θk+1 +2
2
= θkη + γ −1 Vkη − 2ηγ −1 hθkη + γ −1 Vkη , H(θkη , Xk+1 )i
+ 2 2γ −1 β −1 ηh∆2k , ξk+1 i + η 2 γ −2 |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 |ξk+1 |2 ,
p
(43)
Vk+1 − Vk γ2 η 2 2
η
= U (θk+1 ) − U (θkη ) + θk+1 + γ −1 Vk+1 η
− θkη + γ −1 Vkη
β 4
1 η 2 γ2λ
2
+ Vk+1 − |Vkη |2 − η
θk+1 − |θkη |2 ,
4 4
1 γ2
≤ ηhh(θkη ), Vkη i + L1 Cρ η 2 |Vkη |2 + −2ηγ −1 hθkη + γ −1 Vkη , H(θkη , Xk+1 )i
2 4
1
+η 2 γ −2 |H(θkη , Xk+1 )|2 + 2γ −1 ηβ −1 |ξk+1 |2 + (−2γη + γ 2 η 2 ) |Vkη |2 − 2η(1 − γη)hVkη , H(θkη , Xk+1 )i
4
γ2λ
+η 2 |H(θkη , Xk+1 )|2 + 2γηβ −1 |ξk+1 |2 − 2ηhθkη , Vkη i + η 2 |Vkη |2 + Σk ,
4
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
ηγ η η η η γη
≤ − hθk , H(θk , Xk+1 )i + ηhh(θk ), Vk i + − − + |Vkη |2 + |H(θkη , Xk+1 )|2
2 2 4 2 4 2
2
γ 2 λη η η
γη
+ ηγβ −1 |ξk+1 |2 + − η hVkη , H(θkη , Xk+1 )i − hθk , Vk i + Σk
2 2
where
γ 2 p −1 −1 1p
Σk = 2γ β ηh∆2k , ξk+1 i + 2ηγβ −1 h∆1k , ξk+1 i.
2 2
Next, using the fact that 0 < λ ≤ 1/4 and the form of the Lyapunov function (39), we obtain
γ Vk γ2 η 2 1 η 2
− hθkη , Vkη i ≤ − + U (θkη ) + |θ | + |Vk | .
2 β 4 k 2
Vk+1 − Vk ηγ
≤ − hθkη , H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
γη
+ − − + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2
2 4 2 4 2
γη 2 η Vk γ 3 λη η 2 γλη η 2
+ hVk , H(θkη , Xk+1 )i − γλη + γληU (θkη ) + |θk | + |Vk | + Σk . (44)
2 β 4 2
Vk+1 Vk ηγ η ηγ η
≤ (1 − γλη) − hθ , h(θkη )i + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β β 2 k 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2 γ 3 λη η 2
γη γλη
+ − − + + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2 + γληU (θkη ) + |θk |
2 4 2 4 2 2 4
+ Σk . (45)
23
Akyildiz and Sabanis
Vk+1 Vk γ 3 ηλ η 2 ηγ η
≤ (1 − γλη) − ηγλU (θkη ) − |θk | + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i
β β 4 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
γη γλη
+ ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i + − − + + |Vkη |2 + |H(θkη , Xk+1 )|2
2 4 2 4 2 2
γ 3 λη η 2
+ ηγβ −1 |ξk+1 |2 + γληU (θkη ) + |θk | + Σk ,
4
Vk ηγ η
= (1 − γλη) + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2 η2
γη γλη
+ − − + + |Vkη |2 + |H(θkη , Xk+1 )|2 + ηγβ −1 |ξk+1 |2 + Σk ,
2 4 2 4 2 2
Vk ηγ η
≤ (1 − γλη) + ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i + ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i
β 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
γη γλη
+ − − + + |Vkη |2
2 4 2 4 2
η2 2
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk ,
2
by Remark 2.1. Let φ = (1 − γλη) and let Hk = Gk ∨ σ(ξ1 , . . . , ξk ). By using λ ≤ 1/4, we obtain
2
V2 L1 Cρ η 2 γ 2 λη 2 γ 2 η2
E[Vk+1 |Hk ] Vk
2
≤ φ2 k2 + 2φ ηγAc β −1 + − + |Vkη |2
β β β 2 4 4
η2
ηγ η
2L21 Cρ |θkη |2 + 4L22 Cρ + 4H02 + ηγβ −1 d + E ηγAc β −1 + hθ , h(θkη ) − H(θkη , Xk+1 )i
+
2 2 k
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
γη γλη
+ ηhh(θkη ) − H(θkη , Xk+1 ), Vkη i + − − + + |Vkη |2
2 4 2 4 2
2 #
η2 2 2ρ η 2 2 2(ρ+1) 2
−1 2
+ 2L1 (1 + |Xk+1 |) |θk | + 4L2 (1 + |Xk+1 |) + 4H0 + ηγβ |ξk+1 | + Σk Hk .
2
where
(L 2 2 )
1 Cρ
2
+ γ4 − γ 4λ L21 Cρ
K̃1 := max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2
24
Nonasymptotic analysis of SGHMC
2
E[Vk+1 |Hk ] V2 Vk
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d
β 2 β β
h ηγ η 2 ηγ η
+ E ηγAc β −1 + |θk | + + |h(θkη ) − H(θkη , Xk+1 )|2
4 4 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
γη γλη η
+ − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2 2 2ρ η 2 2 2(ρ+1) 2
−1 2
+ 2L1 (1 + |Xk+1 |) |θk | + 4L2 (1 + |Xk+1 |) + 4H0 + ηγβ |ξk+1 | + Σk Hk
2
V2 Vk
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d
β β
h ηγ η 2 ηγ
+ E ηγAc β −1 + |θk | + + η (|h(θkη )|2 + |H(θkη , Xk+1 )|2 )
4 2
L1 Cρ η 2 γ 2 λη 2 γ 2 η2
γη γλη η
+ − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2 2
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2
where we have used |a + b|2 ≤ 2|a|2 + 2|b|2 . Now using Remark 2.1, we obtain
2 2
E[Vk+1 |Hk ] 2 Vk Vk
2
ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d
≤ (φ + 2φ K̃ 1 η ) + 2φ
β2 β2 β
h ηγ η 2 ηγ
+ E ηγAc β −1 + |θk | + + η 2L21 Cρ2 |θkη |2 + 2h20 + 2L21 (1 + |Xk+1 |)2ρ |θkη |2
4 2
L C η2 γ 2 λη 2 γη γ 2 η2 γλη η
2 2(ρ+1) 2 1 ρ
+4L2 (1 + |Xk+1 |) + 4H0 + − − + + + |Vkη |2
2 4 2 4 2 2
2 #
η2 2
+ 2L1 (1 + |Xk+1 |)2ρ |θkη |2 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2
V2 Vk
= (φ2 + 2φK̃1 η 2 ) k2 + 2φ ηγAc β −1 + 2η 2 L2 Cρ + 2η 2 H02 + ηγβ −1 d
β β
η2 2
ηγ ηγ
+ E ηγAc β −1 + + + η 2L21 Cρ2 + 2L21 (1 + |Xk+1 |)2ρ + 2L1 (1 + |Xk+1 |)2ρ |θkη |2
4 2 2
ηγ L C η2 γ λη 2
2
γη γ 2 η2 γλη η
1 ρ
+ + η 2h20 + 4L22 (1 + |Xk+1 |)2(ρ+1) + 4H02 + − − + + + |Vkη |2
2 2 4 2 4 2 2
2 #
η2 2
+ 4L2 (1 + |Xk+1 |)2(ρ+1) + 4H02 + ηγβ −1 |ξk+1 |2 + Σk Hk
2
Now taking the expectation, regrouping and simplifying the terms above, we finally arrive at
2
E[Vk+1 |Hk ] V2 Vk Vk c̃2
2
≤ (φ2 + 2φK̃1 η 2 ) k2 + 2φ 2 ηc̃1 + 2φ η 2 ĉ1 + η 2 2 + c̃3 η 2 |θkη |4 + ĉ3 η 4 |θkη |4
β β β β β
c̃6
+ η 2 c̃4 |Vkη |4 + η 4 ĉ4 |Vkη |4 + η 2 c̃5 + η 4 ĉ5 + η 2 2 + E[Σ2k |Hk ]. (47)
β
25
Akyildiz and Sabanis
where
Next, recall
γ 2 p −1 −1 1p
Σk = 2γ β ηh∆2k , ξk+1 i + 2ηγβ −1 h∆1k , ξk+1 i,
2 2
where
Note that
≤ 2ηγβ −1 |ξk+1 |2 2(1 − ηγ)2 |Vkη |2 + 2|H(θk , Xk+1 )|2 + 3γ 2 |θkη |2 + 3|Vkη |2 + 3η 2 |H(θkη , Xk+1 )|2
≤ 2ηγβ −1 |ξk+1 |2 2(1 − ηγ)2 |Vkη |2 + 2 3L21 (1 + |Xk+1 |)2ρ |θkη |2 + 3L22 (1 + |Xk+1 |)2(ρ+1) + 3H02
+3γ 2 |θkη |2 + 3|Vkη |2 + 3η 2 3L21 (1 + |Xk+1 |)2ρ |θkη |2 + 3L22 (1 + |Xk+1 |)2(ρ+1) + 3H02 .
Vk c̃8
≤ ηc̃7 2 + η
β β
where
(2γd)(6L21 Cρ + 3γ 2 + 9L21 Cρ )
10γd
c̃7 = max 1 , 1
8
(1 − 2λ) 16
(1 − 2λ)γ 2
c̃8 = 30γdL22 Cρ + 30γdH02 .
26
Nonasymptotic analysis of SGHMC
Now, recall our step-size condition for the above inequality, in particular, η ≤ 1, which leads to
2 2
E[Vk+1 |Hk ] 2 2 Vk Vk Vk c̃2
≤ (φ + 2φ K̃ 1 η ) + (2φc̃1 + c̃7 ) 2 η + 2φ η 2 ĉ1 + η 2 2 + (c̃3 + ĉ3 )η 2 |θkη |4
β2 β2 β β β
c̃6 c̃8
+ η 2 (c̃4 + ĉ4 )|Vkη |4 + η 2 (c̃5 + ĉ5 ) + η 2 2 + η , (49)
β β
Next, recall that
Vk 1 2 η 2 1 η 2
≥ max (1 − 2λ)γ |θk | , (1 − 2λ)|Vk | ,
β 8 4
which implies
Vk2
1 2 4 η 4 1 2 η 4
≥ max (1 − 2λ) γ |θk | , (1 − 2λ) |V k | ,
β2 64 16
1 1
≥ (1 − 2λ)2 γ 4 |θkη |4 + (1 − 2λ)2 |Vkη |4 . (50)
128 32
Now inequalities (49) and (50) together imply
2 2
E[Vk+1 |Hk ] 2 2 Vk Vk Vk c̃2 c̃6 c̃8
≤ (φ + 2φ K̃2 η ) + (2φc̃1 + c̃7 ) 2 η + 2φ η 2 ĉ1 + η 2 2 + η 2 (c̃5 + ĉ5 ) + η 2 2 + η ,
β2 β2 β β β β β
where K̃2 = K̃1 + c̃9 and
c̃3 + ĉ3 ĉ4 + c̃4
c̃9 = max 1 , 1 .
128
(1 − 2λ)2 γ 4 32
(1 − 2λ)2
Now we take unconditional expectations, use the fact that φ ≤ 1, and obtain
2
] ≤ 1 − λγη + K̄η 2 E[Vk2 ] + c̃10 E[Vk ]η + 2ĉ1 E[Vk ]βη 2 + η 2 c̃2 + η 2 (ĉ5 + c̃5 )β 2 + η 2 c̃6 + ηc̃8 β,
E[Vk+1
where K̄ = 2K̃2 , c̃10 = 2c̃1 + c̃7 . First note that, it follows from the proof of Lemma 3.2 that
4(Ac + d)
E[Vk ] ≤ E[V0 ] + := c̃11 .
γλ
Therefore, using η ≤ 1, we obtain
2
] ≤ 1 − λγη + K̄η 2 E[Vk2 ] + Dη
E[Vk+1
where D = c̃10 c̃11 + 2c̃11 ĉ1 β + c̃2 + (ĉ5 + c̃5 )β 2 + c̃6 + c̃8 β. Now recall that η ≤ λγ/2K̄ which leads to
2 λγη
E[Vk+1 ]≤ 1− E[Vk2 ] + Dη,
2
which implies that
2D
E[Vk2 ] ≤ E[V02 ] + ,
γλ
which concludes the proof.
27
Akyildiz and Sabanis
from Remark (10) where L̄1 = L1 E[(1 + |X0 |)ρ ]. This in turn leads to
L̄1 2
U (θ) ≤ u0 +
|θ| + h0 |θ|.
2
where u0 = U (0). Next, we prove the lower bound. To this end, take c ∈ (0, 1) and write
Z 1
U (θ) = U (cθ) + hθ, h(tθ)idt,
c
Z 1
1
≥ htθ, h(tθ)idt,
c t
Z 1
1
a|tθ|2 − b dt,
≥
c t
a(1 − c2 ) 2
= |θ| + b log c.
2
√
Taking c = 1/ 3 leads to the bound.
We therefore obtain
" 2#
h i Z t Z t
η η 2 η p
E V btc −V t =E −ηγ V bsc ds −η H(θbsc , Xdse )ds + 2γηβ −1 (Btη − Bbtc
η
)
btc btc
2 2
≤ 4η γ Cv + 4η (L e1 ) + 4γηβ −1 d.
e 1 Cθ + C 2
Next, we write
Z t
η η η
θt = θbtc + ηV bτ c dτ,
btc
which implies
η η 2
E θt − θbtc ≤ η 2 Cv .
28
Nonasymptotic analysis of SGHMC
where the third line follows from Lemma B.1 and the last line follows from the inequality |x| ≤ 1 + |x|2 .
Consequently, we obtain
γ2 2
hθ, h(θ)i ≥ 2λ U (θ) + |θ| − 2Ac /β
4
where
β
Ac = (b + 2λu0 + 2λL̄1 b),
2
We first bound the first term of (51). We start by employing the synchronous coupling and obtain
Z t
η η,n η η,n
θt − ζ t ≤η V bsc − Z s ds,
nT
which implies
Z u
η,n 2
η
h η η,n 2
i
sup E θu − ζ u ≤η sup E V bsc − Z s ds,
nT ≤u≤t nT ≤u≤t nT
Z t h i
η η,n 2
=η E V bsc − Z s ds,
nT
29
Akyildiz and Sabanis
η η,n η η η η,n
V btc − Z t ≤ V btc − V t + V t − Z t ,
Z t Z t
η η η η,n η η,n
≤ V btc − V t + −γη [V bsc − Z s ]ds − η [H(θbsc , Xdse ) − h(ζ s )]ds ,
nT nT
Z t Z t
η η η η,n η η,n
≤ V btc −V t + γη V bsc − Zs ds + η [H(θbsc , Xdse ) − h(ζ s )]ds ,
nT nT
Z t
η η η η,n
≤ V btc − V t + γη V bsc − Z s ds
nT
Z t Z t
η η,n η,n η,n
+η H(θbsc , Xdse ) − H(ζ s , Xdse ) ds + η [H(ζ s , Xdse ) − h(ζ s )]ds ,
nT nT
Z t
η η η η,n
≤ V btc −V t + γη V bsc − Z s ds
nT
Z t Z t
η η,n η,n η,n
+η L1 (1 + Xdse )ρ θbsc − ζ s ds + η [H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT
We take the squares of both sides and use (a + b)2 ≤ 2(a2 + b2 ) twice to obtain
Z t 2
η η,n 2 η η 2 η η,n
V btc − Z t ≤ 4 V btc − V t + 4γ 2 η 2 V bsc − Z s ds
nT
Z t 2 Z t 2
η η,n η,n η,n
+ 4η 2 L1 (1 + Xdse )ρ θbsc − ζ s ds + 4η 2 [H(ζ s , Xdse ) − h(ζ s )]ds
nT nT
Z t
η η 2 η η,n 2
≤4 V btc −V t + 4γ 2 η V bsc − Zs ds
nT
Z t Z t 2
η η,n 2 η,n η,n
+ 4ηL21 (1 + Xdse )2ρ θbsc − ζ s ds + 4η 2 [H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT
h η i h η i Z t h η i
η,n 2 η 2 η,n 2
E V btc − Z t ≤ 4E V btc − V t + 4γ 2 η E V bsc − Z s ds
nT
" 2
#
Z t Z t
η η,n 2 η,n η,n
+ 4ηL21 Cρ E θbsc − ζs ds + 4η E 2
[H(ζ s , Xdse ) − h(ζ s )]ds ,
nT nT
Z t h η i
η,n 2
≤ 4σV η + 4γ 2 η E V bsc − Z s ds
nT
"Z 2
#
Z t t
η η,n 2 η,n η,n
+ 4ηL21 Cρ E θbsc − ζs ds + 4η E 2
[H(ζ s , Xdse ) − h(ζ s )]ds .
nT nT
Z t
η,n 2
h η η,n 2
i η
E V btc − Z t ≤ 4c1 σV η + 4ηc1 L21 Cρ E θbsc − ζ s ds
nT
"Z 2
#
t
η,n η,n
+ 4c1 η 2 E [H(ζ s , Xdse ) − h(ζ s )]ds ,
nT
30
Nonasymptotic analysis of SGHMC
First, we bound the supremum term (i.e. the second term of (52)) as
Z s Z t
η η,n 2 η η,n 2
sup E θbs0 c − ζ s0 ds0 = E θbs0 c − ζ s0 ds0
nT ≤s≤t nT nT
Z t
η η,n 2
≤ sup E θbuc − ζ u ds0
nT nT ≤u≤s0
Z t
η η,n 2
≤ sup E θu − ζ u ds0 . (53)
nT nT ≤u≤s0
Next, we bound the last term of (52) by partitioning the integral. Assume that nT +K ≤ s ≤ t ≤ nT +K +1
where K + 1 ≤ T . Thus we can write
Z s K
X
0
h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , Xds0 e ) ds = Ik + RK
nT k=1
where
Z nT +k Z s
0 0
Ik = [h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , XnT +k )]ds and RK = [h(ζ̄sη,n η,n
0 ) − H(ζ̄s0 , XnT +K+1 )]ds .
nT +(k−1) nT +K
K 2 K K k−1 K
X X X X X
Ik + RK = |Ik |2 + 2 hIk , Ij i + 2 hIk , RK i + |RK |2 ,
k=1 k=1 k=2 j=1 k=1
Finally, it remains to take the expectations of both sides. We begin by defining the filtration Hs∞ = F∞
η
∨Gbsc
and note that for any k = 2, . . . , K, j = 1, . . . , k − 1,
EhIk , Ij i
∞
= E E[hIk , Ij i|HnT +j ] ,
" "*Z + ##
nT +k Z nT +j
0 0 ∞
=E E [H(ζ̄sη,n η,n
0 , XnT +k ) − h(ζ̄s0 )]ds , [H(ζ̄sη,n
0 , XnT +j ) − h(ζ̄sη,n
0 )]ds HnT +j ,
nT +(k−1) nT +(j−1)
"*Z +#
nT +k Z nT +j
∞ 0 0
H(ζ̄sη,n h(ζ̄sη,n [H(ζ̄sη,n h(ζ̄sη,n
=E E 0 , XnT +k ) − 0 ) HnT +j ds , 0 , XnT +j ) − 0 )]ds ,
nT +(k−1) nT +(j−1)
= 0.
31
Akyildiz and Sabanis
≤ T 2 σH + T σH . (54)
+ 4c1 η 3 (T 2 σH + T σH ),
Z t
η η,n 2
≤ 4c1 ησV + 4c1 ηL21 Cρ sup E θu − ζ u ds0
nT nT ≤u≤s0
?
p
with C1,1 = exp(4c1 L21 Cρ )(4c1 σV + 4c1 σH + 4c1 ησH ). Note that σV = O(d) and σH = O(d) hence
?
√
C1,1 = O( d).
Next, we upper bound the second term of (51). To prove it, we write
Z t Z t h i
η η,n η η,n η η,n
V t − Zt ≤ γη V bsc − Z s ds + η H(θbsc ) − h(ζ s ) ds ,
nT nT
which leads to
"Z 2
#
h η i Z t h η i t h i
η,n 2 η,n 2 η η,n
E V t − Zt ≤ 2γ 2 η E V bsc − Z s + 2η 2 E H(θbsc ) − h(ζ s ) ds .
nT nT
By similar arguments we have used for bounding the the first term, we obtain
h η i Z t h i
η,n 2 η η,n 2
E V t − Zt ≤ 2γ 2 η E V bsc − Z s + exp(4c1 L21 Cρ )(4c1 σH η + 4c1 η 2 σH η 2 ).
nT
Using the fact that the rhs is an increasing function of t and we obtain
h η i Z t h η i
η,n 2 η,n 2
sup E V u − Z u ≤ 2γ 2 η sup E V u − Z u ds + exp(4c1 L21 Cρ )(4c1 σH η + 4c1 η 2 σH η 2 ),
nT ≤u≤t nT nT ≤u≤s
32
Nonasymptotic analysis of SGHMC
which leads to
η,n 2 1/2 ? √
h η i
E V t − Zt ≤ C1,2 η, (57)
?
p ?
where C1,2 = exp(2γ 2 + 4c1 L21 Cρ )4c1 σH (1 + η). Note again that σH = O(d), hence C1,2 = O(d1/2 ).
Therefore, combining (51), (56), (57), we obtain
η η η,n η,n
W2 (L(θt , V t ), L(ζ t , Z t )) ≤ C1? η 1/2 ,
At this point, we have two processes and in order to be able to use the contraction result, we need their
starting times to match. For notational simplicity, let us define
η η η η η η
bts,θs ,V s ,η = (ζbts,θs ,V s ,η , Z
B bts,θs ,V s ,η )
The reader should notice at this point that this quantity can be upper bounded by a contraction result as
both processes are defined for time t and both started at time kT . By using Theorem 3.1, we obtain
p X n q
η,n η,n η,(k−1)T η,(k−1)T
W2 (L(ζ t , Z t ), L(ζtη , Ztη )) ≤ Ċ e−ηċ(t−kT )/2 Wρ (L(θkT
η η
, VkT ), L(ζ kT , Z kT ).
k=1
Next, using Lemma 5.4 of Chau and Rásonyi (2022) to upper bound the the last term leads
η,n η,n
W2 (L(ζ t , Z t ), L(ζtη , Ztη ))
p X n q
η,(k−1)T η,(k−1)T
≤ 3 max{1 + α, γ −1 } Ċ e−ηċ(t−kT )/2 1 + εc E1/2 [V 2 (θkT
η η
, VkT )] + εc E1/2 [V 2 (ζ kT , Z kT )]
k=1
q
η η η,(k−1)T η,(k−1)T
× W2 (L(θkT , VkT ), L(ζ kT , Z kT )),
≤ C2? η 1/4 ,
33
Akyildiz and Sabanis
where the last line follows from Lemma 3.4 and Theorem 4.1. The bound for the term relating to the contin-
η,(k−1)T η,(k−1)T
uous dynamics (i.e. E1/2 [V 2 (ζ kT , Z kT )]) follows from Remark 2.3 (dissipativity of the gradient)
and we have a constant diffusion coefficient, and uniform bound in Lemma 3.1 and, finally, the fact that the
initial data (which is the iterates of the numerical scheme) has finite fourth moments (Lemma 3.4). We note
that C2? = O(ed ).
|h(θ)| ≤ L1 |θ| + h0 .
where,
Z Z
2
Cm := max kθk2 πn,β
η
(dθ, dv), kθk2 πβ (dθ, dv) = max(Cθc , Cθ ).
R2d R2d
34