A Diffusion Process Perspective On Posterior Contraction Rates For Parameters
A Diffusion Process Perspective On Posterior Contraction Rates For Parameters
UC Berkeley
Voleon Group‡
September 4, 2019
Abstract
We show that diffusion processes can be exploited to study the posterior contraction
rates of parameters in Bayesian models. By treating the posterior distribution as a station-
ary distribution of a stochastic differential equation (SDE), posterior convergence rates
can be established via control of the moments of the corresponding SDE. Our results de-
pend on the structure of the population log-likelihood function, obtained in the limit of an
infinite sample sample size, and stochastic perturbation bounds between the population
and sample log-likelihood functions. When the population log-likelihood is strongly con-
cave, we establish posterior convergence of a d-dimensional parameter at the optimal rate
(d/n)1/2 . In the weakly concave setting, we show that the convergence rate is determined
by the unique solution of a non-linear equation that arises from the interplay between the
degree of weak concavity and the stochastic perturbation bounds. We illustrate this gen-
eral theory by deriving posterior convergence rates for three concrete examples: Bayesian
logistic regression models, Bayesian single index models, and over-specified Bayesian mix-
ture models.
1 Introduction
Bayesian inference is one of the central pillars of statistics. In Bayesian analysis, we first endow
the parameter space with a prior distribution, which represents a form of prior belief and
knowledge; the posterior distribution obtained by Bayes’ rule combines this prior information
with observations. Fundamental questions that arise in Bayesian inference include consistency
of the posterior distribution as the sample size goes to infinity, and from a more refined point
of view, the contraction rate of the posterior distribution.
The earliest work on posterior consistency dates back to the seminal work by Doob [8],
who demonstrated that, for any given prior, the posterior distribution is consistent for all
parameters apart from a set of zero measure. Subsequent work by Freedman [10, 11] pro-
vided examples showing that this null set can be problematic for Bayesian consistency in
nonparametric problems. In order to address this issue, Schwartz [29] proposed a general
framework for establishing posterior consistency for both semiparametric and nonparamet-
ric models. Since then, a number of researchers have isolated conditions that are useful for
studying posterior disributions [1, 36, 37].
⋆ Wenlong Mou and Nhat Ho contributed equally to this work.
1
Moving beyond consistency of the posterior distribution, convergence rates for the poste-
rior density function and associated parameters remain an active area of research. The seminal
paper by Ghosal et al. [13] lays out a general framework for analyzing rates of convergence of
posterior densities in both finite and infinite dimensional models. Their result relies on a sieve
construction, meaning a series of subsets of the full model class such that: (a) the difference
between these model subsets and the full model, as measured in terms of mass under the
prior, vanishes as the sample size size goes to infinity and (b) the prior distribution puts a
sufficient quantity of mass on the neighborhood around the true density function. Drawing
on the framework of that paper, various works established the posterior convergence rates of
density functions under several statistical models. For instance, the papers [14, 15, 27, 30]
provided (adaptive) rates of density function in Dirichlet Process and nonparametric Beta
mixture models. Other work [2, 41, 40] established minimax optimal rates for regression
function in nonparametric regression models. Related problems include adaptive rates for the
density in nonparametric Bayesian inference [7, 12], and posterior contraction rates of density
under misspecified models [18]. Other popular general frameworks for analyzing the density
functions of posterior distributions include those of Shen and Wasserman [31], and Walker et
al. [38].
While our current understanding of convergence of posterior densities is relatively well-
understood, there remain various unresolved issues with the corresponding questions for model
parameters. On one hand, for certain classes of Bayesian (in)finite mixture models, the pre-
cise posterior convergence rates of parameters are well-studied [22, 16]. However, these results
rely on the boundedness of the parameter space, which can be problematic due to model mis-
specification. Another popular example is Latent Dirichlet Allocation model, which is widely
applied in machine learning and genetics [3, 24]. Even though some recent works provided spe-
cific dependence of posterior convergence rates of topics on the number of documents [23, 39],
the concrete dependence of these rates on the number of words in each document or the size of
vocabulary has still been unknown. In these works, posterior convergence rates for parameters
are established by using techniques previously used in analyzing densities. More precisely, the
approach was based on deriving lower bounds on distances on density functions, including the
Hellinger or total variation distances [13], in terms of distances between the model parameters.
Given lower bounds of this type, a convergence guarantee for the density can immediately be
translated into a convergence guarantee for the parameters. A drawback of this technique is
that it typically requires strong global geometric assumptions that are not always necessary,
especially for weakly-concave population log-likelihood functions, namely, the limit of sample
log-likelihood function when the number of data n grows to infinity. For example, in order to
1
guarantee, for some α ≥ 1, an n− 2α rate for parameter convergence, the bound in Ghosal et
al. [13] requires the following conditions:
" #
pθ ∗ 2α pθ 2
−Eθ∗ log - kθ − θ k2 , −Eθ∗ log ∗ - kθ − θ ∗ k2α
2 , and
pθ ∗ pθ (1)
kθ1 − θ2 kα2 - h(pθ1 , pθ2 ) - kθ1 − θ2 kα2 .
Here θ ∗ denotes the true parameter, the quantity pθ is the density function at θ, and h is the
Hellinger distance. The conditions (1) do not always hold for the statistical models mentioned
earlier, since the log-likelihood function behaves differently according to its location within
the unbounded parameter space. Furthermore, the precise dependence of the convergence rate
on other parameters in addition to sample size n, including the dimension d, can be difficult
to obtain using this technique.
2
In this paper, we study the posterior contraction rates of parameters via the lens of
diffusion processes. This approach does not require the parameter space to be bounded and
allows us to quantify the dependence of convergence rates on quantities other than the sample
size n. More specifically, we recast the posterior distribution as the stationary distribution of
a stochastic differential equation (SDE). In this way, by controlling the moments of the given
diffusion process, we can obtain bounds on posterior convergence rates of parameters. This
approach exploits two main features that have not been extensively explored in the Bayesian
literature: (i) the geometric structure of population log-likelihood function and (ii) uniform
perturbation bounds between the empirical and population log-likelihood functions.
At a high-level, our main contributions can be summarized as follows:
• Strongly concave setting: We first consider settings in which the population log-
likelihood function is strongly concave around the true parameter. We demonstrate that
as long as the prior distribution is sufficiently smooth and the perturbation error between
the population and empirical log-likelihood function is well-controlled, the posterior
contraction rate around the true parameter is (d/n)1/2 . In addition, this technique
allows us to also quantity the dependence of the rate on other properties of the model.
• Weakly concave setting: Moving beyond the strongly concave setting, we then turn
to population log-likelihood functions that are only weakly concave. In this setting, our
analysis depends on two auxiliary functions: (i) a function ψ is used to capture the local
and global behavior of population log-likelihood function around the true parameter,
and (ii) a function ζ describes the growth of perturbation error between the population
and sample log-likelihood functions. Our analysis shows how the posterior contraction
rates for parameters can be derived from the unique positive solution of a non-linear
equation involving these two functions.
• Illustrative examples: We illustrate the general results for three concrete classes of
models: Bayesian logistic regression models, Bayesian single index models, and over-
specified Bayesian location Gaussian mixture models. The obtained rates show the
influence of different modeling assumptions. For instance, the posterior convergence
rate of parameter under Bayesian logistic regression models is (d/n)1/2 . In contrast, for
Bayesian single index models with polynomial link functions, we exhibit rates of the
form (d/n)1/(2p) , where p ≥ 2 is the degree of the polynomial. Finally, for over-specified
location Gaussian mixtures, we establish a convergence rate of the order (d/n)1/4 . While
the n−1/4 component of this rate is known from past work [6, 22], the corresponding
scaling of d1/4 with dimension appears to be a novel result.
The remainder of the paper is organized as follows. In Section 2, we set up the basic framework
for Bayesian models and introduce a diffusion process that admits posterior distribution as its
stationary distribution. Section 3 is devoted to establishing the general results for posterior
convergence rates of parameters under various assumptions on the concavity of the popula-
tion log-likelihood. In Section 4, we apply these general results to derive concrete rates of
convergence for a number of illustrative examples. The proofs of main theorems are provided
in Section 5 while the proofs of auxiliary lemmas and corollaries in the paper are deferred to
the Appendices. We conclude our work with a discussion in Section 6.
Notation. In the paper, the expression an % bn will be used to denote an ≥ cbn for some
positive universal constant c that does not change with n. Additionally, we write an ≍ bn if
3
both an % bn and an - bn hold. For any n ∈ N, we denote [n] = {1, 2, . . . , n}. The notation
Sd−1 stands for the unit sphere, namely, the set of vectors u ∈ Rd such that kuk2 = 1. For any
subset Θ of Rd , r ≥ 1, and ε > 0, we denote N (ε, Θ, k.kr ) the covering number of Θ under
k.kr norm, namely, the minimum number of ε-balls under k.kr norm to cover the entire set Θ.
Finally, for any x, y ∈ R, we denote x ∨ y = max{x, y}.
In a Bayesian analysis, the parameter space Θ is endowed with a prior distribution π. Com-
bining this prior with the likelihood (2a) yields the posterior distribution
As the sample size n increases, we expect that the posterior distribution will concentrate
more of its mass over increasingly smaller neighborhoods of the true parameter θ ∗ . Posterior
contraction rates allow us to study how quickly this concentration of mass takes place. In
particular, for a given norm, we study the posterior mass of a ball of the form kθ − θ ∗ k ≤ ρ
for a suitably chosen radius. For a given δ ∈ (0, 1), our goal is to prove statements of the form
Π kθ − θ ∗ k ≥ ρ(n, d, δ) | X1n ≤ δ, (3)
with probability at least 1 − δ over the randomly drawn data X1n . Our interest is in the
scaling of the radius ρ(n, d, δ) as a function of sample size n, problem dimension d, and the
error tolerance δ, as well as other problem-specific parameters.
4
where (Bt , t ≥ 0) is a standard d-dimensional Brownian motion [25], and the potential function
U : Rd → R is assumed to satisfy the regularity conditions: (a) its gradient ∇U is locally
Lipschitz, and (b) its gradient satisfies the inequality
where c1 , c2 are positive constants. Under these conditions, from known results on general
Langevin diffusions [20, 26], we have:
Proposition 1. Under the stated regularity conditions, the solution to the Langevin SDE (4)
exists and is unique in the strong sense. Furthermore, the density of θt converges in L2 to the
stationary distribution with density proportional to e−βU .
Let us consider this general result in the context of Bayesian inference. In particular,
suppose that we apply Proposition 1 to the potential function Un (θ) := −nFn (θ) − log π(θ).
Doing so will require us to verify that Un satisfies the requisite regularity conditions. Assuming
this validity, we are guaranteed that the posterior distribution Π(θ | X1n ) is the stationary
distribution of the following stochastic differential equation (SDE):
1 1 1
dθt = ∇Fn (θt )dt + ∇ log π(θt )dt + √ dBt , (5)
2 2n n
R
where Z = e−Un (θ) dθ. Given this bound, we can establish posterior contraction rates for the
parameters by controlling the moments of the diffusion process {θt }t≥0 . The main theoretical
results of this paper are obtained by following this general roadmap.
where the expectation is taken with respect to X ∼ Pθ∗ . Throughout the paper, we impose the
following smoothness conditions on the population log-likelihood F and the log prior density:
(A) There exist positive constants L1 and L2 such that for any θ1 , θ2 ∈ Rd , we have
5
(B) There exists a constant B > 0 such that
3 Main results
We now turn to the presentation of our main results, which provides guarantees on posterior
contraction rates under different conditions. Our first main result, stated as Theorem 1 in
Section 3.1, establishes the posterior convergence rate of parameters when the population log-
likelihood is strongly concave around the true parameter θ ∗ . Then in Section 3.2, we study
the same question when the population log-likelihood is weakly concave around θ ∗ , with our
conclusions stated in Theorem 2.
(S.2) For any δ > 0, there exist non-negative functions ε1 and ε2 that map from N × (0, 1] to
R+ such that
Theorem 1. Suppose that Assumptions A, B, S.1, and S.2 hold. Then there is a universal
constant C such that for any δ ∈ (0, 1), given a sample size n large enough such that ε1 (n, δ) ≤
µ
6 , we have
s !
∗ d + log(1/δ) + B ε2 (n, δ) n
Π kθ − θ k2 ≥ C + X1 ≤ δ,
nµ µ
6
See Section 5.1 for the proof of Theorem 1.
The result of Theorem 1 establishes the posterior convergence rate (d/n)1/2 of parameter
under the strong concavity settings of F . It also provides a detailed dependence of the rate
on other model parameters, including B and µ, both of which might vary as a function of θ ∗ .
At the moment, we do not know whether the dependence of these parameters is optimal.
(W.1) There exists a convex, non-decreasing function ψ : [0, +∞) → R such that ψ(0) = 0 and
for any θ ∈ Rd , we have
Assumption W.1 characterizes the weak concavity of the function F around the global maxima
θ ∗ . This condition can hold when the log likelihood is locally strongly concave around θ ∗ but
only weakly concave in a global sense, or it can hold when the log likelihood is weakly concave
but nowhere strongly concave. An example of the former type is the logistic regression model
analyzed in Section 4.1, whereas an example of the latter type is given by certain kinds of
single index models, as analyzed in Section 4.2.
Our next assumption controls the deviation between the gradients of the population and
sample likelihoods:
(W.2) For any δ > 0, there exist a function ε : N × (0, 1] 7→ R+ and a non-decreasing function
ζ : R → R such that ζ(0) ≥ 0 and
Note that, the function ζ can depend on the sample size n and other model parameters, as
analyzed in Section 4.3 under over-specified Bayesian mixture model. Here we suppress this
dependence in ζ for the brevity of the presentation.
The previous conditions involved two functions, namely ψ and ζ. We let ξ : R+ → R
denote the inverse function of the strictly increasing function r 7→ rζ(r). Our next assumption
imposes certain inequalities on these functions and their derivatives:
(W.3) The function r 7→ ψ(ξ(r)) is convex, and moreover, for any r > 0, the functions ψ and
ζ satisfy the following differential inequalities:
7
These differential inequalities are needed controlling the moments of the diffusion process
{θt }t>0 in equation (5). In our discussion of concrete examples, we provide instances in which
they are satisfied.
Our result involves a certain fixed point equation that depends on the parameters and
functions in our assumptions. In particular, for any tolerance parameter δ ∈ (0, 1) and
sample size n, consider the following equation in a variable z > 0:
B + d log(1/δ)
ψ(z) = ε(n, δ)ζ(z)z + . (9)
n
In order to ensure that equation (9) has a unique positive solution, our final assumption
imposes certain condition on the growth of the functions ψ and ζ:
ψ(z)
(W.4) The sample size n and tolerance parameter δ ∈ (0, 1) are such that ε(n, δ) < lim inf .
z→+∞ zζ(z)
With this set-up, we are now ready to state our second main result:
Theorem 2. Assume that Assumptions A, B, and W.1—W.4 hold. Then for a given sample
size n and δ ∈ (0, 1), equation (9) has a unique positive solution z ∗ (n, δ) such that
Π kθ − θ ∗ k2 ≥ z ∗ (n, δ) | X1n ≤ δ (10)
where zp∗ is the unique positive solution to the equation ψ(z) = ε(n, δ)ζ(z)z + B+pdn . In light
of the above result and the inequality (6), when p is of the order log(1/δ), we obtain the
posterior convergence rate (10).
Second, it is (at least in general) not possible to compute an explicit form for the positive
solution z ∗ (n, δ) to the non-linear equation (9). However, for certain forms of the function ψ
and ζ, it is possible to compute an analytic and relatively simple upper bound. For instance,
given some positive parameters (α, β) such that α > β + 1, suppose that these functions are
defined locally, in a interval above zero, as follows:
ψ(r) = r α , and ζ(r) = r β for all r in some interval [0, r̄). (11a)
As shown in the analysis to follow in Section 4, these particular forms arise in several statistical
models, including Bayesian logistic regression and certain forms of Bayesian single index
models. The following result shows that we have a simple upper bound on z ∗ (n, δ).
8
Corollary 1. Assume that the functions ψ, ζ have the local behavior (11a), and the pertur-
bation term ε(n, δ) has the form (11b). If, in addition, the global forms of ψ and ζ satisfy
Assumption W.3, then we are guaranteed that
1 1
∗ d + log(1/δ) + B 2(α−(β+1)) d + log(1/δ) + B α
z (n, δ) ≤ c ∨ ,
n n
where z ∗ (n, δ) is defined in Theorem 2.
Corollary 1 directly leads to a posterior convergence rate for the parameters—namely, we
have
1 1 !
∗ d + log(1/δ) + B 2(α−(β+1)) d + log(1/δ) + B α n
Π kθ − θ k2 ≥ c ∨ X1 ≤ δ, (12)
n n
with probability 1 − δ with respect to the training data. Note that the convergence rate scales
1 1
as (d/n) 2(α−(β+1)) when α ≥ 2(β + 1). On the other hand, this rate becomes (d/n) 2α when
α < 2(β + 1).
P (Yi = 1 | Xi , θ ∗ ) = , (14)
1 + ehXi , θ∗ i
where θ ∗ ∈ Rd is a fixed but unknown value of the parameter vector. Consequently, the
sample log-likelihood function of the samples Z1n takes the form
n
1X
FnR (θ) := {log P (Yi | Xi , θ) + log φ(Xi )} , (15)
n
i=1
where φ denotes the density of a standard normal vector. Combining this log likelihood with
a given prior π over θ yields the posterior distribution in the usual way. We assume that the
prior function π satisfies Assumptions A and B.
With this set-up, the following result establishes the posterior convergence rate of θ around
∗
θ , conditionally on the observations Z1n .
9
Corollary 2. For a given δ ∈ (0, 1), suppose that we are given logn n ≥ c′ d log( 1δ ) i.i.d. samples
Z1n from the Bayesian logistic regression model (13) with true parameter θ ∗ for some universal
constant c′ . Then there is a universal constant c such that
r !
d + log(1/δ) + B
Π kθ − θ ∗ k2 ≥ c Z1n ≤ δ
n
where the outer expectation in the above display is taken with respect to X and Y |X from
the logistic model (14).
In Appendix A.1, we prove that there are universal constants c, c1 , c2 such that
(
kθ − θ ∗ k22 , for all kθ − θ ∗ k2 ≤ 1
h∇F R (θ), θ ∗ − θi ≥ c1 (17a)
kθ − θ ∗ k2 , otherwise
and
r r !
d log(1/δ) log(1/δ)
sup
∇FnR (θ) − ∇F R (θ)
2 ≤ c2 + + , (17b)
θ∈Rd n n n
for any r > 0 with probability 1 − δ as long as logn n ≥ cd log(1/δ). The above results with F R
and FnR indicate that the functions ψ and ζ in Assumptions W.1 and W.2 have the following
close forms:
(
r 2 , for all 0 < r ≤ 1
ψ(r) = c1 , and ζ(r) = c2 for all r > 0. (18)
r, otherwise
We can check that the functions ψ and ζ satisfy the conditions in Assumptions W.3 and W.4.
Therefore, an application of Theorem 2 to these functions leads to the posterior contraction
rate of θ around θ ∗ in Corollary 2.
10
consider specific settings of single index models when the link function is known and has spe-
cific form. In particular, we assume that the data points Zi = (Yi , Xi ) ∈ Rd+1 are generated
as follows:
Yi = g(Xi⊤ θ ∗ ) + ǫi , (19)
for i ∈ [n] where the positive integer p ≥ 2 is given. Here (Xi , Yi ) correspond to the covariate
vector and response variable, respectively, whereas g is a known link function while θ ∗ is a
true but unknown parameter. Furthermore, the random variables {ǫi }ni=1 are i.i.d. standard
Gaussian variables, whereas the covariates Xi are assumed to be i.i.d. data from standard
multivariate Gaussian distribution. The choice g(r) = r 2 has been used to model the problem
of phase retrieval in computational imaging.
Now, we consider a Bayesian single index model to study θ ∗ . More specifically, we endow
the parameter θ with a prior function π. Conditioning on Xi and θ, we have
i.i.d.
Yi | Xi , θ ∼ N g Xi⊤ θ , 1 . (20)
The main goal of this section is to study the posterior convergence rate of θ around θ ∗ given
the data. To do that, we assume the prior function π satisfies the Assumptions A and B and
follow the framework in the Bayesian logistic regression case. In particular, we first study
the structure of the sample log-likelihood function around the true parameter θ ∗ . Then we
establish the uniform perturbation bound between the population and sample log-likelihood
functions.
Given the Bayesian single index model (20), the sample log-likelihood function FnI of the
samples Z1n = {Zi }ni=1 admits the following form
!
Xn
Y − g X ⊤θ 2
1 i i
FnI (θ) := − + log φ(Xi ) , (21)
n 2
i=1
where φ is the standard normal density function of X1 , . . . , Xn . Hence, the population log-
likelihood function F I has the following form
" #
Y − g X ⊤θ 2
F I (θ) := E(X,Y ) − + log φ(X) , (22)
2
Given these choices of g and θ ∗ , the population log- likelihood function has the closed-form
expression
11
and there are universal constants (c, c2 ) such that for any r > 0 and δ ∈ (0, 1), as long as
n ≥ c (d log(d/δ))2p , we have
r
d + log(1/δ)
sup
∇FnI (θ) − ∇F I (θ)
2 ≤ c2 r p−1 + r 2p−1 , (24b)
θ∈B(θ ∗ ,r) n
with probability at least 1 − δ. Therefore, the functions ψ and ζ in Assumptions W.1 and
W.2 take the specific forms
for all r > 0. Simple algebra shows that these functions satisfy Assumptions W.3 and W.4.
Therefore, under the setting (23), a direct application of Theorem 2 leads to the following
result regarding the posterior contraction rate of θ around θ ∗ :
Corollary 3. Consider the Bayesian single index model (19) with true parameter θ ∗ = 0 and
link function g(r) = r p for for some p ≥ 2. Then there are universal constants c, c′ such that
for any δ > 0, given a sample size n ≥ c′ (d + log(d/δ))2p , we have
1/(2p) !
d + log(1/δ) + B
Π kθ − θ ∗ k2 ≥ c Z1n ≤ δ
n
valid for each r > 0 with probability 1 − δ. The condition n ≥ c(d + log(d/δ))2p is required to
guarantee that the RHS of the above display is upper bounded by the RHS of equation (24b);
this bound permits us to apply Theorem 2 to establish the posterior convergence rate of
parameter under the Bayesian single index models.
12
hand, when the covariance matrices are known and the parameter space is bounded, location
parameters have been shown to have posterior convergence rates n−1/4 in the Wasserstein-2
metric [22]. However, neither the dependence on dimension d nor on the true number of
components have been established.
In this section, we consider a class of overspecified location Gaussian mixture models and
provide convergence rates for the parameters with precise dependence on the dimension d,
and not requiring any boundedness assumption on the parameter space. Concretely, suppose
that X1n = (X1 , . . . , Xn ) are i.i.d. samples from single Gaussian distribution N (θ ∗ , Id ) where
θ ∗ = 0. Suppose that we fit such a dataset using an overspecified Bayesian location Gaussian
mixture model of the form
i.i.d. i.i.d.
θ ∼ π(θ), ci ∈ {−1, 1} ∼ Cat(1/2, 1/2), Xi |ci , θ ∼ N (ci θ, Id ), (26)
where Cat(1/2, 1/2) stands for the categorical distribution with parameters (1/2, 1/2). Here
the prior function π is chosen to satisfy smoothness Assumptions A and B, with one example
being the location Gaussian density function. Our goal in this section is to characterize the
posterior contraction rate of the location parameter θ around θ ∗ .
In order to do so, we first define the sample log-likelihood function FnG given data X1n . It
has the form
n
G 1X 1 1
Fn (θ) := log φ(Xi ; −θ, Id ) + φ(Xi ; θ, Id ) , (27)
n 2 2
i=1
2
where x 7→ φ(x; θ, Id ) = (2π)−d/2 e−kx−θk2 /2 denotes the density of multivariate Gaussian
distribution N (θ, σ 2 Id ). Similarly, the population log-likelihood function is given by
G 1 1
F (θ) := EX log φ(X; −θ, Id ) + φ(X; θ, Id ) , (28)
2 2
where the outer expectation in the above display is taken with respect to X ∼ N (θ ∗ , Id ).
In Appendix A.3, we prove that there is a universal constant c1 > 0 such that
( ∗ k4 , ∗k ≤
√
c1 kθ
− θ 2 for all kθ − θ 2 2
h∇F G (θ), θ ∗ − θi ≥ ∗ 2 , (29a)
4c1 kθ − θ k2 − 1 , otherwise
and moreover, there are universal constants (c, c2 ) such that for any δ ∈ (0, 1), given a sample
size n ≥ cd log(1/δ), we have
r
1 d + log(log n/δ)
sup
∇Fn (θ) − ∇F (θ)
2 ≤ c2 r + √
G G
. (29b)
θ∈B(θ ,r)
∗ n n
with probability 1 − δ.
Given the above results, the functions ψ and ζ in Assumptions W.1 and W.2 take the
form
( √
c1 r 4 , for all 0 < r ≤ 2 1
ψ(r) = 2
, and ζ(r) = r + √ for all r > 0. (30)
4c1 r − 1 , otherwise n
These functions satisfy the conditions of Assumptions W.3 and W.4. Therefore, it leads to
the following result regarding the posterior contraction rate of parameters under overspecified
Bayesian location Gaussian mixtures (26):
13
Corollary 4. Given the overspecified Bayesian location Gaussian mixture model (26), there
are universal constants c, c′ such that given any δ ∈ (0, 1) and a sample size n ≥ c′ d log(1/δ),
we have
1/4 !
d + log(log n/δ) + B
Π kθ − θ ∗ k2 ≥ c X1n ≤ δ
n
5 Proofs
This section is devoted to the proofs of Theorem 1 and Theorem 2, with more technical aspects
deferred to the appendices.
≤ (pC) E 4
αt
sup e kθt − θ ∗ k22 αs
e ds
0≤t≤T 0
p !p
pCeαT 4 4
≤ E αt
sup e kθs − θ ∗ k22 ,
α 0≤t≤T
14
For the right hand side of the above inequality, we can relate it to the left hand side by using
Young’s inequality, which is given by
p !p !p
αT 2
p
pCeαT 4 4
1 pCe 1 2
for universal constant C ′ > 0. Therefore, the diffusion process defined in equation (5) satisfies
the following inequality
s r !
∗ p p
1 B + d ε2 (n, δ) p
sup (E [kθt − θ k2 ]) ≤ c + +
t≥0 µn µ nµ
for any p ≥ 1. Combining the above inequality with the inequality (6) yields the conclusion
of the theorem.
Proof of claim (31): For the given choice α > 0, an application of Itô’s formula yields the
decomposition
Z Z t
1 αt 1 t ∗ 1
e kθt − θ ∗ k22 = − hθ − θs , ∇Fn (θs )eαs ids + hθs − θ ∗ , ∇ log π(θs )eαs ids
2 2 0 2n 0
Z t Z t Z
d 1 1 t αs
+ αs
e ds + √ αs ∗
e hθs − θ , dBs i + αe kθs − θ ∗ k22 ds
2n 0 n 0 2 0
=J1 + J2 + J3 + J4 + J5 . (32)
We begin by bounding the term J1 in equation (32). Based on the Assumption S.2
regarding the perturbation error between Fn and F and the strong convexity of F , we have
Z
1 t ∗
J1 = − hθ − θs , ∇Fn (θs )eαs ids
2 0
Z Z
1 t ∗ 1 t
≤− αs
hθ − θs , ∇F (θs )e ids + kθs − θ ∗ k2 k∇F (θs ) − ∇Fn (θs )k2 eαs ds
2 0 2 0
Z Z
1 t 1 t
≤− µ kθs − θ ∗ k22 eαs ds + kθs − θ ∗ k2 (ε1 (n, δ) kθs − θ ∗ k2 + ε2 (n, δ))eαs ds
2 0 2 0
Z Z Z
1 t ∗ 2 αs 1 t ∗ 2 αs 3ε22 (n, δ) t αs
≤− µ kθs − θ k2 e ds + kθs − θ k2 (ε1 (n, δ) + µ/3)e ds + e ds.
2 0 2 0 2µ 0
The second term J2 involving prior π can be controlled in the following way:
Z t Z t
1 ∗ αs 1 (eαt − 1)B
J2 = hθs − θ , ∇ log π(θs )e ids ≤ B · eαs ds = .
2n 0 2n 0 2αn
For the third term J3 , a direct calculation leads to
d(eαt − 1)
J3 = .
2αn
15
√
Moving to the fourth term J4 , it is a martingale as J4 = Mt / n. Putting the above results
together, as α = 21 µ − ε1 (n, δ) > µ6 , we obtain that
1 αt 1 (eαt − 1)
e kθt − θ ∗ k22 ≤ √ Mt + Un .
2 n 2α
Since ν(p) is convex and increasing, the numerator is a decreasing positive function of r.
Additionally, the denominator is an increasing positive function of r. Therefore, the derivative
p−2
−1
in equation (34) is a decreasing function of r, and the function r 7→ ν(p) (r) p−1 is concave. We
denote
−1 B + (p − 1)d −1 p−2
φ(r) := −r + ε(n, δ)τ(p) (r) + ν(p) (r) p−1 .
n
Then, we have φ(0) = 0 and φ is a concave function. Suppose that r∗ is the smallest positive
solution to the following equation:
16
Since ν(p) is a convex and strictly increasing function, by means of Jensen’s inequality, we
have
Rp (t) = E kθt − θ ∗ kp−2
2 ψ(kθt − θ ∗ k2 ) ≥ ν(p) E kθt − θ ∗ kp−1
2 .
1
Therefore, by denoting z∗ := limt→+∞ E kθt − θ ∗ kp−1 , we have z∗p−1 ≤ ν(p)
−1
p−1
2 (r∗ ). Hence,
we arrive at the following inequality
−1
B + (p − 1)d p−2
z∗p−2 ψ(z∗ ) ≤ ε(n, δ)τ(p) ν(p) (z∗p−1 ) + z∗
n
B + (p − 1)d p−2
= ε(n, δ)z∗p−1 ζ(z∗ ) + z∗ .
n
As a consequence, we find that
B + (p − 1)d
ψ(z∗ ) ≤ ε(n, δ)ζ(z∗ )z∗ + .
n
Now, we claim that there exists a unique positive solution to equation (9). Given this claim,
replacing p by (p + 1) and putting the above results together yields
1
lim (E (kθt − θ ∗ kp2 )) p ≤ zp∗ ,
t→+∞
B + pd
ψ(z) = ε(n, δ)ζ(z)z + .
n
Combining the above inequality with the inequality (6) yields the conclusion of the theorem.
We now return to prove our earlier claims about the behavior of the functions ν(p) , τ(p) , the
moment bound (33), and the existence of unique positive solution to equation (9).
Structure of the function ν(p) : Since ψ is a convex and strictly increasing function, by
taking the second derivative, we find that
1 p−2
′′
ν(p) (r) = ∇2r ψ r p−1 r p−1
1 1
1 1 −1 ′ p−1
1
1
−1 − 1
= r p−1 ψ ′′ r p−1 + r ψ r − r p−1 ψ r p−1 ≥0
p−1 p−1
for all r > 0. As a consequence, the function ν(p) is convex.
Structure of the function τ(p) : This proof exploits Assumption W.3 on the functions
ψ and ζ. For any p ≥ 2, we denote ζ(p) : r → r p−1 ζ(r) and ψ(p) : r → r p−2 ψ(r) two
−1
strictly increasing functions. Therefore, we can define a function τ(p) := ψ(p) ◦ ζ(p) , namely,
p−1 p−2
τ(p) (r ζ(r)) = r ψ(r), for any r > 0. Following some calculation, we find that
′
∇r τ(p) (r p−1 ζ(r)) = (p − 1)r p−2 ζ(r) + r p−1 ζ ′ (r) τ(p) (r p−1 ζ(r))
= (p − 2)r p−3 ψ(r) + r p−2 ψ ′ (r).
17
Setting z = ζ(p) (r) leads to
(p − 2)ψ(r) + rψ ′ (r)
∇z τ(p) (z) = .
(p − 1)rζ(r) + r 2 ζ ′ (r)
Taking another derivative of the above term, we find that
−1 g(r, p)
′
∇2z τ(p) (z) = ζ(p) (r) ,
((p − 1)rζ(r) + r 2 ζ ′ (r))2
where we denote
g(r, p) := (p − 1)rζ(r) + r 2 ζ ′ (r) · (p − 1)ψ ′ (r) + rψ ′′ (r)
− (p − 1)ζ(r) + (p + 1)rζ ′ (r) + r 2 ζ ′′ (r) · (p − 2)ψ(r) + rψ ′ (r) .
−1
According to Assumption W.3, τ(2) = ψ(2) ◦ ζ(2) is a convex function. Therefore, we have
g(r, 2) ≥ 0 for any r > 0. Simple algebra with first order derivative of function g with respect
to parameter p leads to
∇p (g(r, p)) =ζ(r) · (p − 1)rψ ′ (r) + r 2 ψ ′′ (r) − (p − 2)ψ(r) − rψ ′ (r)
−rζ ′ (r) (p − 2)ψ(r) + rψ ′ (r) + rψ ′ (r) · (p − 1)ζ(r) + rζ ′ (r)
−ψ(r) · (p − 1)ζ(r) + (p + 1)rζ ′ (r) + r 2 ζ ′′ (r)
=2(p − 2) rψ ′ (r)ζ(r) − ψ(r)ζ(r) − rζ ′ (r)ψ(r)
+ r 2 ζ(r)ψ ′′ (r) + rψ ′ (r)ζ(r) − 3ψ(r)ζ(r) − r 2 ψ(r)ζ ′′ (r) ≥ 0
for all r > 0 where the last inequality is due to Assumption W.3. Therefore, the function g
is increasing function in terms of p when p ≥ 2. It indicates that g(r, p) ≥ g(r, 2) ≥ 0 for all
d2
r > 0. Given that inequality, we have dz 2 τ(p) (z) ≥ 0 for any z ≥ 0, p ≥ 2, i.e., the function
τ(p) (z) is a convex function for z = ζ(p) (r).
Proof of claim P (33): For any p ≥ 2, an application of Itô’s formula yields the bound
kθt − θ ∗ kp2 ≤ 5j=1 Tj , where
Z
p t ∗
T1 := − hθ − θs , ∇F (θs )i kθs − θ ∗ kp−2
2 ds, (35a)
2 0
Z
p t ∗
T2 := hθ − θs , ∇F (θs ) − ∇Fn (θs )i kθs − θ ∗ kp−22 ds (35b)
2 0
Z t
p
T3 := hθs − θ ∗ , ∇ log π(θs )i kθs − θ ∗ kp−2
2 ds (35c)
2n 0
Z t
T4 := p kθs − θ ∗ kp−2
2 hθs − θ ∗ , dBs i (35d)
0
Z
p(p − 1)d t
T5 := kθs − θ ∗ kp−2
2 ds. (35e)
2n 0
We now upper bound the terms {Tj }5j=1 in terms of functionals of the quantity Rp . From the
weak convexity of F guaranteed by Assumption W.1, we have
Z t Z
p ∗ ∗ p−2 p t
E [T1 ] = − E hθ − θs , ∇F (θs )i kθs − θ k2 ds ≤ − Rp (s)ds. (36a)
2 0 2 0
18
Based on Assumption W.2, we find that
Z t
p ∗ ∗ p−2
E [T2 ] = E hθ − θs , ∇F (θs ) − ∇Fn (θs )i kθs − θ k2 ds
2 0
Z t h i
p
≤ ε(n, δ) E kθs − θ ∗ kp−1
2 ζ(kθ s − θ ∗
k2 ds.
)
2 0
Since the function τ(p) is convex, invoking Jensen’s inequality, we obtain the following inequal-
ities:
Z t h i Z t h i
∗ p−1 ∗
E kθs − θ k2 ζ (kθs − θ k2 ) ds ≤ −1
τ(p) E τ(p) kθs − θ ∗ kp−1
2 ζ(kθs − θ ∗ k2 ) ds
0 0
Z t
−1
= τ(p) (Rp (s)) ds.
0
Moving to T3 in equation (35c), given Assumptions A and B about the smoothness of prior
distribution π, its expectation is bounded as
Z t
p ∗ ∗ p−2
E [T3 ] = E hθs − θ , ∇ log π(θs )i kθs − θ k2 ds
2n 0
Z
pB t h i
≤ E kθs − θ ∗ kp−2
2 ds (36c)
2n 0
−1
Since ν(p) is a strictly increasing and convex function on [0, +∞), the function ν(p) is a concave
function on [0, +∞). Invoking Jensen’s inequality leads to the following inequality:
Z t Z t h i p−2 Z t p−2
∗ p−2 ∗ p−1 −1
Eν(p) kθs − θ ∗ kp−1
p−1 p−1
E kθs − θ k2 ds ≤ E kθs − θ k2 ds ≤ ν(p) 2 ds
0 0 0
Z t
p−2
−1
≤ ν(p) (Rp (s)) p−1 ds. (36d)
0
where we have used the martingale structure. Furthermore, in light of moment bound in
equation (36d), we have
Z
p(p − 1)d t −1 p−2
E [T5 ] ≤ ν(p) (Rp (s)) p−1 ds. (36g)
2n 0
Collecting the bounds on the expectations of {Tj }5j=1 from equations (36a), (36b), (36e), (36f),
and (36g), respectively, yields the claim (33).
19
Unique positive solution to equation (9): We now establish that equation (9) has a
unique positive solution under the stated assumptions. Define the function
B + d log(1/δ)
ϑ(z) := ψ(z) − ε(n, δ)ζ(z)z + .
n
Since ψ(0) = 0, we have ϑ(0) < 0. On the other hand, based on Assumption W.4, lim inf z→+∞ ϑ(z) >
0. Therefore, there exists a positive solution to the equation ϑ(z) = 0.
Recall that ξ : R+ → R is an inverse function of the strictly increasing function z 7→ zζ(z).
Therefore, we can write the function ϑ as follows:
6 Discussion
In this paper, we described an approach for analyzing the posterior contraction rates of pa-
rameters based on the diffusion processes. Our theory depends on two important features:
the convex-analytic structure of the population log-likelihood function F and stochastic per-
turbation bounds between the gradient of F and the gradient of its sample counterpart Fn .
For log-likelihoods that are strongly concave around the true parameter θ ∗ , we established
posterior convergence rates for estimating parameters of the order (d/n)1/2 along with other
model parameters, valid under sufficiently smooth conditions of prior distribution π and mild
conditions on the perturbation error between ∇Fn and ∇F . On the other hand, when the
population log-likelihood function is weakly concave, our analysis shows that convergence
rates are more delicate: they depend on an interaction between the degree of weak convexity,
and the stochastic error bounds. In this setting, proved that the posterior convergence rate
of parameter is upper bounded by the unique positive solution of a non-linear equation deter-
mined by the previous interplay. As an illustration of our general theory, we derived posterior
convergence rates for three concrete examples: Bayesian logistic regression models, Bayesian
single index models, and overspecified Bayesian location Gaussian mixture models.
Let us now discuss a few directions that arise naturally from our work. First, the current
results are not directly applicable to establish the asymptotic convergence of posterior distri-
bution of parameter under locally weakly concave settings of F around the true parameter θ ∗ .
More precisely, when F is locally strongly concave around θ ∗ , it is well-known from Berstein-
von Mises theorem that the posterior distribution of parameter converges to a multivariate
normal distribution centered at the maximum likelihood estimation (MLE) with the covari-
ance matrix is given by 1/ (nI(θ ∗ )) (e.g., see the book [32]), where I(θ ∗ ) denotes the Fisher
information matrix at θ ∗ . Under the weak concavity setting of F , the Fisher information
matrix I(θ ∗ ) is degenerate. Therefore, the posterior distribution of parameter can no longer
be approximated by a multivariate Gaussian distribution in this setting.
Second, it is desirable to understand the posterior convergence rate of parameter under
the non-convex settings of population log-likelihood function F , which arise in various statis-
tical models such as Latent Dirichlet Allocation [3]. Under these settings, without a careful
analysis of the multi-modal structures of the population and sample log-likelihood functions,
20
the diffusion approach may lead to exponential dependence on dimension d and sub-optimal
dependence on other model parameters.
Acknowledgements
This work was partially supported by Office of Naval Research grant DOD ONR-N00014-18-
1-2640 and National Science Foundation grant NSF-DMS-1612948 to MJW; by NSF-DMS-
grant-1909365 joint to PLB and MJW; and by the Mathematical Data Science program of
the Office of Naval Research under grant number N00014-18-1-2764 to MIJ.
A Proofs of corollaries
In this appendix, we collect the proofs of several corollaries stated in Section 4. To summarize,
we make use of Theorem 2 to establish the posterior contraction rates of parameters in all
three examples. The crux of the proofs of these corollaries involves a verification of Assump-
tions W.1, W.2, and W.3 to invoke the respective theorem. We first present the proof of
Corollary 2 in Appendix A.1. Then, we move to the proofs of Corollary 3 and Corollary 4 in
Appendices A.2 and A.3 respectively. Note that the values of universal constants may change
from line-to-line.
where the above expectations are taken with respect to X ∼ N (0, σ 2 Id ) and Y |X following
probability distribution generated from logistic model (14). By taking derivative of F R (θ),
we obtain that
" ! #
R ∗ 1 + ehX, θi 1 + e−hX, θi e−hX, θi ∗
h∇F (θ), θ − θi = E ∗ − hX, θ − θ i .
1 + ehX, θ i 1 + e−hX, θ i (1 + e−hX, θi )2
∗
By the mean value theorem, there exists ξ between 0 and hX, θ − θ ∗ i such that
∗ ∗
!
1 + ehX, θi 1 + e−hX, θi ∗ ehX, θ i+ξ e−hX, θ i−ξ
− ∗ = hX, θ − θ i + .
1 + ehX, θ i 1 + e−hX, θ i 1 + ehX, θ i 1 + e−hX, θ i
∗ ∗ ∗
21
In light of the above equality, we arrive at the following inequalities:
∗ ∗
!
R ∗ ehX, θ i+ξ e−hX, θ i−ξ
h∇F (θ), θ − θi ≥ E inf ∗ +
|ξ|∈[0,|hX, θ−θ ∗ i|] 1 + ehX, θ i 1 + e−hX, θ∗ i
e−hX, θi
× |hX, θ − θ ∗ i|2
(1 + e−hX, θi )2
" #
1 −|hX, θ−θ∗ i| e−hX, θi
≥E e −hX, θi 2
|hX, θ − θ ∗ i|2
2 (1 + e )
1 h i
≥ E e−|hX, θ−θ i|−|hX, θi| |hX, θ − θ ∗ i|2
∗
8
1
≥ 4 E 1{|hX, θi|≤2, |hX, θ−θ∗ i|≤2} |hX, θ − θ ∗ i|2 .
8e
hX, θi kθk22 hθ, θ − θ ∗ i
Since X ∼ N (0, Id ), we have ∼ N 0, . Given that
hX, θ − θ ∗ i hθ, θ − θ ∗ i kθ − θ ∗ k22
result, direct calculation leads to
c
E 1{|hX, θi|≤2,|hX, θ−θ∗ i|≤2} |hX, θ − θ ∗ i|2 ≥ kθ − θ ∗ k22 ,
(1 + kθk2 )(1 + kθ − θ ∗ k2 )
for a universal constant c > 0. Collecting the above results, for all θ such that kθ − θ ∗ k2 ≤ 1,
we achieve that
c
h∇F R (θ), θ ∗ − θi ≥ ∗
kθ − θ ∗ k22
(1 + kθk2 )(1 + kθ − θ k2 )
1
≥c kθ − θ ∗ k22 .
1 + kθ ∗ k2
θ−θ ∗
For θ with kθ − θ ∗ k2 > 1, let θ̃ = θ ∗ + kθ−θ ∗ k2 . Then, we find that
c
h∇F R (θ), θ ∗ − θi ≥ h∇F R (θ̃), θ ∗ − θi ≥ kθ − θ ∗ k2 ,
2(1 + kθ ∗ k2 )
with probability at least 1 − δ for any logn n ≥ c0 d log(1/δ) where c0 is a universal constant.
We begin by writing Z as the supremum of a stochastic process. Let Sd−1 denote the
Euclidean sphere in Rd , and define the stochastic process
n
1 X yhx, uieyhx, θi
Zu,θ := fu,θ (Xi , Yi ) − E[fu,θ (X, Y )] , where fu,θ (x, y) = ,
n
i=1
1 + eyhx, θi
22
indexed by vectors u ∈ Sd−1 and θ ∈ B(θ ∗ ; r). The outer expectation in the above display is
taken with respect to (X, Y ) from the logistic model (14). Observe that Z = sup sup Zu,θ .
u∈Sd−1 θ∈Rd
Let {u1 , . . . , uN } be a 1/8-covering of Sd−1 in the Euclidean norm; there exists such a set with
N ≤ 17d elements. By a standard discretization argument (see Chapter 6, [35]), we have
Accordingly, the remainder of our argument focuses on bounding the random variable V := supθ∈Rd Zu,θ ,
where the vector u ∈ Sd−1 should be understood as arbitrary but fixed.
Define the symmetrized random variable
n
1 X eyhx, θi
′
V := sup εi hXi , uiϕθ (Xi , Yi ) , where ϕθ (x, y) = y ,
θ∈Rd n i=1 1 + eyhx, θi
and {εi }ni=1 is an i.i.d. sequence of Rademacher variables. By a symmetrization inequality for
probabilities [33], we have
P [V ≥ t] ≤ c1 P V ′ ≥ c2 t ,
where c1 and c2 are two positive universal constants. We now analyze the Rademacher process
that defines V ′ conditionally on {(Xi , Yi )}ni=1 . We first use a functional Bernstein inequality
to control the deviations above its expectation. For a parameter b > 0 to be chosen, define
the event
( n )
1X 2
Eb := hXi , ui ≤ 2 and |hXi , ui| ≤ b for all i = 1, . . . , n .
n
i=1
pn n
Setting t = d and using the assumption that log n ≥ c0 d, we find that
h i n2 pn
P max |hXi , ui| ≥ b ≤ e−c2 d where b := c1 d.
i=1,...,n
23
Consequently, we have
!
ns2 n2
P V ′ ≥ E[V ′ ] + s ≤ exp − p + e−c2 d + e−n/8
16e + c3 nd s
√
2 n2
− ns − s 2cnd
≤e 32e +e 3 + e−c2 d + e−n/8 . (38)
We now bound the expectation of V ′ , first over the Rademacher variables. Consider the
following function class:
n o
G := gθ : (x, y) 7→ hx, uiϕθ (x, y) | θ ∈ Rd .
It is clear that the function class G has the envelope function Ḡ(x) := |hx, ui|. We claim that
the L2 -covering number of G can be bounded as
1 c(d+1)
N̄ (t) := sup N G, k·kL2 (Q) , t Ḡ L2 (Q) ≤ for all t > 0, (39)
Q t
Up to this point, we have been conditioning on the observations {Xi }ni=1 . Taking expectations
over them as well yields
r q (i) r q r
′ ′ d ′ d (ii) ′ d
Eε,X1n [V ] ≤ C · EX1n Pn (Ḡ2 ) ≤ C · EX1n Pn (Ḡ2 ) = C , (40)
n n n
where step (i) follows from Jensen’s inequality; and step (ii) uses the fact that EX1n [Pn (Ḡ2 )] = 1.
Putting together the bounds (38) and (40) yields
" r # √
d − ns
2
− s 2cnd −c2 nd
2
P V ′ ≥ C′ + s ≤ e 32e + e 3 + e + e−n/8 .
n
This probability bound holds for each u ∈ Sd−1 . By taking the union q bound over the 1/8-
q
log(1/δ)
covering set {u1 , . . . , uN } of Sd−1 where N ≤ 17d and choosing s = c′ d
n + n + log(1/δ)
n
24
Proof of claim (39): We consider a fixed sequence (xi , yi , ti )m
i=1 where yi ∈ {−1, 1}, xi ∈ R
d
and ti ∈ R for i ∈ [m]. Now, we suppose that for any binary sequence (zi )m m
i=1 ∈ {0, 1} , there
d
exists θ ∈ R such that
where we define
1 X n p−1
⊤
J1 := p sup
Yi Xi Xi θ
, and (41a)
θ∈B(θ ,r)
∗
n
i=1 2
n
1 X 2p−1 2p−1
J2 := p sup
Xi Xi⊤ θ − EX X X ⊤ θ
. (41b)
θ∈B(θ ∗ ,r)
n
i=1
2
25
We claim that there is a universal constant c such that for any δ ∈ (0, 1), the quantities J1
and J2 can be bounded as
s
d + log 1
1 n p+1
J1 ≤ c r p−1 δ
+ 3/2 d + log , and (42a)
n n δ
s
d + log δ1
1
n 2p+1
J2 ≤ c r 2p−1 + 3/2 d + log , (42b)
n n δ
Thus, in order to establish the claim (42a), it suffices to show that there is a universal constant
c such that
r !
d + log(1/δ) 1 n p+1
P Z≤c + 3/2 d + log ≥ 1 − δ. (44)
n n δ
26
where N 81 , Sd−1 , k.k2 is the 18 -covering of Sd−1
under k.k2 norm. Therefore, it is sufficient
1 d−1
to bound Zu for any fixed u ∈ N 8 , S , k.k2 .
For any even integer q ≥ 2, a symmetrization argument (e.g., Theorem 4.10, [35]) yields
n ! n !
1 X p−1 q 2 X p−1 q
E sup Yi Xi⊤ u Xi⊤ θ ≤ E sup εi Yi Xi⊤ u Xi⊤ θ
θ∈Sd−1 n
θ∈Sd−1 n
i=1 i=1
By choosing t = 2−(p+2) , the above inequality leads to R(Sd−1 ) ≤ 2R(N (2−(p+2) )).
In order to obtain a high-probability upper bound on R(N (2−(p+2) )), we bound its mo-
ments. By the union bound, for any q ≥ 1, we have
" n ! #
h i 4 X p′ −1 q
q −(p+2) −(p+2) ⊤ ⊤
E R N (2 ) ≤ p · N (2 ) · sup E εi Yi Xi u Xi θ .
θ∈Sd−1 ,p′ ∈[1,p] n
i=1
| {z }
:=T1 (θ,p′ )
In order to upper bound T1 (θ, p′ ), we apply Khintchine’s inequality [4]; it guarantees that
there is a universal constant C such that
!q
Xn 2
Cq ′
T1 (θ, p′ ) ≤ E Y i
2
(Xi
⊤ 2
u) (Xi
⊤ 2(p −1)
θ) , (46a)
n2
i=1
for any p′ ∈ [1, p] . In order to further upper bound the right hand side, we define the function
gθ,u (x, y) := y 2 (x⊤ u)2 (x⊤ θ)2(p −1) . For any i ∈ [n], we can verify that
′
h i ′
E [gθ,u (Xi , Yi )] =E Yi2 · E (Xi⊤ u)2 (Xi⊤ θ)2(p−1) ≤ (2p′ )p ,
h i
E [gθ,u (Xi , Yi )q ] =E Yi2q · E (Xi⊤ u)2q (Xi⊤ θ)2(p −1)q ≤ (2q)q (2p′ q)p q .
′ ′
27
Given the above bounds, invoking the result of Lemma 2 leads to the following probability
bound
n r !
1 X log 4/δ 1 ′ n p+1
′ p′
P gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )] > (8p ) + 2p log ≤ δ,
n n n δ
i=1
for all δ > 0 where the outer expectation in the above display is taken with respect to (X, Y )
such that X ∼ N (0, Id ) and Y |X = x ∼ N ((x⊤ θ ∗ )p , 1). Putting the previous bounds together,
we obtain that
!q/2
Xn
1
E gθ,u (Xi , Yi )
n
i=1
q/2
1 X n
q/2
≤2q/2 E(X,Y ) [gθ,u (X, Y )] + 2q/2 E gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )]
n
i=1
Z +∞ !
1 X n
′ pq
≤(4p ) + q λq−1 P gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )] > λ dλ
0 n
i=1
Z 1 r ! q
log 4/δ 1 ′ n p′ +1 4
≤(4p′ )p q + q (p′ + 1) (8p′ )p log−1 dδ
′ ′
+ 2p log
0 n n δ δ
!
(16p′ )p q (2p′ )(p +1)q
′ ′
′ p′ q ′ (p′ +1)q ′
≤(4p ) + Cp q q Γ(q/2) + (2 log n) + Γ (p + 1)q , (46b)
n2 nq
where Γ(·) stands for the Gamma function. Combining the bounds (46a) and (46b), we reach
to the following upper bound for T1 (θ, p′ ):
q/2
′ Cq ′ p′ q ′ (16p′ )pq
T1 (θ, p ) ≤ (4p ) + Cp q q Γ(q/2)
n n2
(2p′ )(p +1)q
′
(p′ +1)q ′
+ (2 log n) + Γ (p + 1)q . (47)
nq
1 d−1 , k.k
1 d−1 , k.k
for any given u ∈ N 8, S 2 . Taking the supremum over u ∈ N 8, S 2 of both
28
sides in the above bound and applying Minkowski’s inequality, we obtain that
d/q " #! 1
64 2 X n p−1 q q
q
1 ⊤ ⊤
(E|Z| ) ≤
q E sup εi Yi Xi u Xi θ
7 θ∈Sd−1 n i=1
"r #
d/q C p q C p q C p
≤2 10 · 2p+3 + + 3 (log n + q)p+1 ,
n n n 2
2
where Cp is a universal constant depending only on p. By choosing q = d(p + 7) + log δ and
using Markov inequality, we find that
s
1
d + log δ 1 n p+1
P |Z| ≥ Cp + 3/2 d + log ≤ δ.
n n δ
Proof of claim (42b): In order to obtain a uniform concentration bound for J2 , we use
an argument similar to that from the proof of claim (42a). In particular, since polynomial
(x⊤ θ)2p−1 is homogeneous in terms of θ, using the same normalization as in equation (43), it
suffices to demonstrate that
s
d + log 1
1 n 2p+1
P W ≤ cr 2p−1 δ
+ 3/2 d + log ≥ 1 − δ, (48)
n n δ
h
1 Pn 2p−1 2p−1 i
for any δ > 0 where W := supθ∈Sd−1
Xi Xi θ⊤ − EX X X θ ⊤
.
n i=1
2
d
For each u ∈ R , define the random variable
n
1 X 2p−1 2p−1
Wu := sup Xi⊤ u Xi⊤ θ − EX X ⊤ u X ⊤ θ
θ∈Sd−1 n i=1
It suffices to bound Wu for fixed u ∈ N 81 , Sd−1 , k.k2 . We bound Wu by controlling its
moments. By a symmetrization argument, we have
" #
1 X n 2p−1 2p−1 q
E sup Xi⊤ u Xi⊤ θ − EX X ⊤ u X ⊤ θ
θ∈Sd−1 n i=1
" #
2 Xn 2p−1 q
≤ E sup εi Xi⊤ u Xi⊤ θ .
θ∈S d−1 n
i=1
From here, we can use the same technique as that in and after inequality (45) to bound the
RHS term in the above display. Therefore, we will only highlight the main differences here.
For any compact set Ω ⊆ Rd , we define the random variable
n
2 X 2p′ −1
Q(Ω) := sup εi Xi⊤ u Xi⊤ θ .
θ∈Ω,p ∈[1,p]
′ n
i=1
29
Following the similar
argument as that in equation (45), we can check that Q(Sd−1 ) ≤
2Q N (2−(2p+2) ) . A direct application of union bound leads to
" n ! #
h i 4 X 2p′ −1 q
q −(2p+2) −(2p+2) ⊤ ⊤
E Q N (2 ) ≤ 2p · N (2 ) · sup E εi Xi u Xi θ .
θ∈S d−1 ,p ∈[1,p]
′ n
i=1
| {z }
:=T2 (θ,p′ )
We control T2 (θ, p′ ) using the same approach as that the proof of claim (42a). For the
convenience of notation, we denote hθ,u (x) := (x⊤ u)2 (xθ )2(2p −1) . Simple algebra lead to the
′
30
where Cp is a universal constant depending only upon p. With the choice of q = d(2p+7)+log 2δ ,
we obtain that
s
d + log 1δ 1 n 2p+1
P |W | ≥ Cp + 3/2 d + log ≤ δ.
n n δ
A.3.1 Structure of F G
Direct algebra leads to the following equation
h i⊤
h∇F G (θ), θ ∗ − θi = θ − E X tanh X ⊤ θ (θ − θ ∗ )
h i
≥ kθk22 − kθk2
E X tanh X ⊤ θ
(49)
2
exp(x)−exp(−x)
where tanh(x) := exp(x)+exp(−x) for all x ∈ R. From Theorem 2 in Dwivedi et al. [9], we have
h i
p
E X tanh X ⊤ θ
≤ 1 − p + kθk
2
2 kθk22
1+ 2
for all θ ∈ Rd where p := P (|Y | ≤ 1) + 12 P (|Y | > 1) where Y ∼ N (0, 1). Plugging the above
inequality into equation (49) leads to
(p 4 √
G ∗ p kθk42 4 kθk
2, for kθk2 ≤ 2
h∇F (θ), θ − θi ≥ ≥ p 2 .
2+ kθk22 2 kθk2 − 1 , otherwise
1X
n h i
∇FnG (θ) − ∇F G (θ) = Xi tanh(Xi⊤ θ) − E X tanh X ⊤ θ .
n
i=1
The outer expectation in the above display is taken with respect to X ∼ N (θ ∗ , σ 2 Id ) where
θ ∗ = 0. Based on the proof argument of Lemma 1 from the paper [9], for each r > 0, we have
the following concentration inequality
r !
1 Xn h i
d + log(1/δ)
⊤ ⊤
P sup
Xi tanh(Xi θ) − E X tanh X θ
≤ cr ≥ 1 − δ,
θ∈B(θ ∗ ,r)
n
i=1
n
2
(50)
31
for any δ > 0 as long as the sample size n ≥ c′ d log(1/δ) where c and c′ are universal constants.
For any M ∈ N+ , by the concentration bound (50) and the union bound, we find that
r !
d + log(M/δ)
P ∀r ∈ [2−M , 1], sup
∇FnG (θ) − F G (θ)
2 ≤ c r ≥ 1 − δ. (51)
θ∈B(θ ∗ ,r) n
On the other hand, based on the standard inequality |tanh(x)| ≤ |x| for all x ∈ R, we find
that
Xn h i
∇FnG (θ) − ∇F G (θ)
≤ 1 kX k
i 2
tanh X ⊤
i θ
+ E kXk
tanh X ⊤
θ
2 2
n
i=1
1X
n h i
≤ kXi k2 Xi⊤ θ + E kXk2 X ⊤ θ
n
i=1
!
1X
n h i
2 2
≤ kXi k2 + E kXk2 kθk2 .
n
i=1
Therefore, we have
∇FnG (θ) − ∇F G (θ)
2 ≤ 2d kθk2 log(1/δ) with probability 1 − δ. By
choosing M1 := log(2nd), based on the previous bound, we obtain that
!
−M1
log(1/δ)
P ∀r < 2 , sup
∇Fn (θ) − ∇F (θ)
2 ≤
G G
≥ 1 − δ. (52)
θ∈B(θ ∗ ,r) n
Furthermore, for vector θ ∈ Rd with large norm, by the concentration bound (50) combined
with the union bound, for any M ′ ∈ N+ , we find that
r !
M′
d + log(M ′ /δ)
P ∀r ∈ [1, 2 ], sup
∇Fn (θ) − F (θ)
2 ≤ c r
G G
≥ 1 − δ.
θ∈B(θ ∗ ,r) n
When r in the above bound is too large, we can simply use the fact that tanh is a bounded
function. We thus have the upper bound
n
X
∇FnG (θ) − ∇F G (θ))
≤ E [kXk ] + 1 kXi k2 ,
2 2
n
i=1
√
for any θ. Given the above bound, by choosing M2 := log(2 n), we obtain that
r !
d + log(1/δ)
P ∀r > 2 , sup
∇Fn (θ) − ∇F (θ))
2 ≤ r
M2 G G
θ∈B(θ ∗ ,r) n
n
r !
1X d + log(1/δ)
≥ P E [kXk2 ] + kXi k2 ≤ 2M2 ≥ 1 − δ. (53)
n n
i=1
Putting the bounds (51), (52), and (53) together, for n ≥ cd log(1/δ), the following probability
bound holds
r !
d + log (log n/δ) log(1/δ)
P ∀r > 0, sup
∇FnG (θ) − ∇F G (θ))
2 ≤ c r + ≥ 1 − δ,
θ∈B(θ ∗ ,r) n n
32
B Proofs of some auxiliary results
In this appendix, we state and prove a few technical lemmas used in the proofs of our main
results.
Proof. The proof of the lemma is a direct combination of truncation argument and Bern-
stein’s inequality.
In particular,
for each i ∈ [n], define the truncated random variable
Ỹi := Yi I |Yi | ≤ 3(a log nδ )b . With this definition, we have
n n n b n b δ
P (Yi )i=1 6= (Ỹi )i=1 = P max |Yi | > 3 a log ≤ nP |Yi | > 3 a log ≤ .
1≤i≤n δ δ 2
33
P
Therefore, it is sufficient to study a concentration behavior of the quantity ni=1 Ỹi . Invoking
Bernstein’s inequality [4], we obtain that
n
! !
1X nε2
P Ỹi ≥ ε ≤ 2 exp − 2b + 2 ε · 3(a log n )b
.
n 2(2a) 3 δ
i=1
In order to make the RHS of the above inequality less than 2δ , it suffices to set
r
b log(4/δ) n b log(4/δ)
ε = (4a) + a log .
n δ n
Collecting all of the above inequalities yields the claim.
References
[1] A. Barron, M. Schervish, and L. Wasserman. The consistency of posterior distributions
in nonparametric problems. Ann. Statist, 27:536–561, 1999. (Cited on page 1.)
[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res, 3:993–
1022, 2003. (Cited on pages 2 and 20.)
[5] R. J. Carroll and P. Hall. Optimal rates of convergence for deconvolving a density. Journal
of American Statistical Association, 83:1184–1186, 1988. (Cited on page 10.)
[6] J. Chen. Optimal rate of convergence for finite mixture models. Annals of Statistics,
23(1):221–233, 1995. (Cited on pages 3 and 14.)
[7] R. de Jonge and J. H. van Zanten. Adaptive nonparametric Bayesian inference using
location-scale mixture priors. Annals of Statistics, 38:3300–3320, 2010. (Cited on page 2.)
[10] D. A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case.
Annals of Statistics, 34:13861403, 1963. (Cited on page 1.)
[11] D. A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case.II.
Annals of Statistics, 36:454456, 1965. (Cited on page 1.)
34
[12] C. Gao and H. H. Zhou. Rate exact Bayesian adaptation with modified block priors.
Annals of Statistics, 44:318–345, 2016. (Cited on page 2.)
[13] S. Ghosal, J. K. Ghosh, and A. van der Vaart. Convergence rates of posterior distributions.
Annals of Statistics, 28:500–531, 2000. (Cited on page 2.)
[14] S. Ghosal and A. van der Vaart. Entropies and rates of convergence for maximum
likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics,
29:1233–1263, 2001. (Cited on pages 2 and 12.)
[15] S. Ghosal and A. van der Vaart. Posterior convergence rates of Dirichlet mixtures at
smooth densities. Annals of Statistics, 35:697–723, 2007. (Cited on page 2.)
[16] N. Ho and X. Nguyen. Convergence rates of parameter estimation for some weakly
identifiable finite mixtures. Annals of Statistics, 44:2726–2755, 2016. (Cited on page 2.)
[17] H. Ishwaran, L. F. James, and J. Sun. Bayesian model selection in finite mixtures
by marginal density decompositions. Journal of the American Statistical Association,
96:1316–1332, 2001. (Cited on page 14.)
[19] B. Lindsay. Mixture Models: Theory, Geometry and Applications. In NSF-CBMS Re-
gional Conference Series in Probability and Statistics. IMS, Hayward, CA., 1995. (Cited
on page 12.)
[20] X. Mao. Stochastic Differential Equations and Applications. Elsevier, 2007. (Cited on
page 5.)
[21] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall/CRC,
1989. (Cited on pages 7 and 9.)
[22] X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models.
Annals of Statistics, 4(1):370–400, 2013. (Cited on pages 2, 3, 13, and 14.)
[23] X. Nguyen. Borrowing strength in hierarchical Bayes: convergence of the Dirichlet base
measure. Bernoulli, 22:1535–1571, 2016. (Cited on page 2.)
[25] D. Revuz and M. Yor. Continuous Martingales and Brownian Motion, volume 293.
Springer-Verlag, third edition, 1999. (Cited on pages 5 and 14.)
[26] H. Risken. The Fokker-Planck Equation. Springer, 1996. (Cited on page 5.)
[27] J. Rousseau. Rates of convergence for the posterior distributions of mixtures of Betas
and adaptive nonparametric estimation of the density. Annals of Statistics, 38:146–180,
2010. (Cited on page 2.)
35
[29] L. Schwartz. On Bayes procedures. Zeitschrift fűr Wahrscheinlichkeitstheorie und Ver-
wandte Gebiete, 4:10–26, 1965. (Cited on page 1.)
[30] W. Shen, S. R. Tokdar, , and S. Ghosal. Adaptive Bayesian multivariate density estima-
tion with Dirichlet mixtures. Biometrika, 100:623640, 2013. (Cited on page 2.)
[32] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998. (Cited on
page 20.)
[33] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes.
Springer-Verlag, New York, NY, 1996. (Cited on page 23.)
[34] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes:
With Applications to Statistics. Springer-Verlag, New York, NY, 2000. (Cited on page 25.)
[36] S. Walker. On sufficient conditions for Bayesian consistency. Annals of Statistics, 90:482–
488, 2003. (Cited on page 1.)
[38] S. G. Walker, A. Lijoi, and I. Prunster. On rates of convergence for posterior distributions
in infinite-dimensional models. Annals of Statistics, 35:738–746, 2007. (Cited on page 2.)
[39] Y. Wang. Convergence rates of latent topic models under relaxed identifiability conditions.
Electronic Journal of Statistics, 13:37–66, 2019. (Cited on page 2.)
[40] Y. Yang and D. B. Dunson. Bayesian manifold regression. Annals of Statistics, 44:876–
905, 2016. (Cited on page 2.)
36