0% found this document useful (0 votes)
44 views36 pages

A Diffusion Process Perspective On Posterior Contraction Rates For Parameters

1) The document proposes using diffusion processes to study posterior contraction rates of parameters in Bayesian models. It frames the posterior distribution as the stationary distribution of a stochastic differential equation (SDE). 2) Controlling the moments of the corresponding SDE allows establishing posterior convergence rates, which depend on the structure of the population log-likelihood function and bounds between population and sample log-likelihood functions. 3) When the population log-likelihood is strongly concave, the posterior converges at the optimal rate of (d/n)1/2, where d is the parameter dimension. In the weakly concave case, the rate is determined by a non-linear equation involving concavity and perturbation bounds.

Uploaded by

zeldaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views36 pages

A Diffusion Process Perspective On Posterior Contraction Rates For Parameters

1) The document proposes using diffusion processes to study posterior contraction rates of parameters in Bayesian models. It frames the posterior distribution as the stationary distribution of a stochastic differential equation (SDE). 2) Controlling the moments of the corresponding SDE allows establishing posterior convergence rates, which depend on the structure of the population log-likelihood function and bounds between population and sample log-likelihood functions. 3) When the population log-likelihood is strongly concave, the posterior converges at the optimal rate of (d/n)1/2, where d is the parameter dimension. In the weakly concave case, the rate is determined by a non-linear equation involving concavity and perturbation bounds.

Uploaded by

zeldaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

A Diffusion Process Perspective on Posterior

Contraction Rates for Parameters


Wenlong Mou⋆,⋄ Nhat Ho⋆,⋄ Martin J. Wainwright⋄,†,‡
Peter Bartlett⋄,† Michael I. Jordan⋄,†

Department of Electrical Engineering and Computer Sciences⋄


Department of Statistics†
arXiv:1909.00966v1 [math.ST] 3 Sep 2019

UC Berkeley
Voleon Group‡

September 4, 2019

Abstract
We show that diffusion processes can be exploited to study the posterior contraction
rates of parameters in Bayesian models. By treating the posterior distribution as a station-
ary distribution of a stochastic differential equation (SDE), posterior convergence rates
can be established via control of the moments of the corresponding SDE. Our results de-
pend on the structure of the population log-likelihood function, obtained in the limit of an
infinite sample sample size, and stochastic perturbation bounds between the population
and sample log-likelihood functions. When the population log-likelihood is strongly con-
cave, we establish posterior convergence of a d-dimensional parameter at the optimal rate
(d/n)1/2 . In the weakly concave setting, we show that the convergence rate is determined
by the unique solution of a non-linear equation that arises from the interplay between the
degree of weak concavity and the stochastic perturbation bounds. We illustrate this gen-
eral theory by deriving posterior convergence rates for three concrete examples: Bayesian
logistic regression models, Bayesian single index models, and over-specified Bayesian mix-
ture models.

1 Introduction
Bayesian inference is one of the central pillars of statistics. In Bayesian analysis, we first endow
the parameter space with a prior distribution, which represents a form of prior belief and
knowledge; the posterior distribution obtained by Bayes’ rule combines this prior information
with observations. Fundamental questions that arise in Bayesian inference include consistency
of the posterior distribution as the sample size goes to infinity, and from a more refined point
of view, the contraction rate of the posterior distribution.
The earliest work on posterior consistency dates back to the seminal work by Doob [8],
who demonstrated that, for any given prior, the posterior distribution is consistent for all
parameters apart from a set of zero measure. Subsequent work by Freedman [10, 11] pro-
vided examples showing that this null set can be problematic for Bayesian consistency in
nonparametric problems. In order to address this issue, Schwartz [29] proposed a general
framework for establishing posterior consistency for both semiparametric and nonparamet-
ric models. Since then, a number of researchers have isolated conditions that are useful for
studying posterior disributions [1, 36, 37].
⋆ Wenlong Mou and Nhat Ho contributed equally to this work.

1
Moving beyond consistency of the posterior distribution, convergence rates for the poste-
rior density function and associated parameters remain an active area of research. The seminal
paper by Ghosal et al. [13] lays out a general framework for analyzing rates of convergence of
posterior densities in both finite and infinite dimensional models. Their result relies on a sieve
construction, meaning a series of subsets of the full model class such that: (a) the difference
between these model subsets and the full model, as measured in terms of mass under the
prior, vanishes as the sample size size goes to infinity and (b) the prior distribution puts a
sufficient quantity of mass on the neighborhood around the true density function. Drawing
on the framework of that paper, various works established the posterior convergence rates of
density functions under several statistical models. For instance, the papers [14, 15, 27, 30]
provided (adaptive) rates of density function in Dirichlet Process and nonparametric Beta
mixture models. Other work [2, 41, 40] established minimax optimal rates for regression
function in nonparametric regression models. Related problems include adaptive rates for the
density in nonparametric Bayesian inference [7, 12], and posterior contraction rates of density
under misspecified models [18]. Other popular general frameworks for analyzing the density
functions of posterior distributions include those of Shen and Wasserman [31], and Walker et
al. [38].
While our current understanding of convergence of posterior densities is relatively well-
understood, there remain various unresolved issues with the corresponding questions for model
parameters. On one hand, for certain classes of Bayesian (in)finite mixture models, the pre-
cise posterior convergence rates of parameters are well-studied [22, 16]. However, these results
rely on the boundedness of the parameter space, which can be problematic due to model mis-
specification. Another popular example is Latent Dirichlet Allocation model, which is widely
applied in machine learning and genetics [3, 24]. Even though some recent works provided spe-
cific dependence of posterior convergence rates of topics on the number of documents [23, 39],
the concrete dependence of these rates on the number of words in each document or the size of
vocabulary has still been unknown. In these works, posterior convergence rates for parameters
are established by using techniques previously used in analyzing densities. More precisely, the
approach was based on deriving lower bounds on distances on density functions, including the
Hellinger or total variation distances [13], in terms of distances between the model parameters.
Given lower bounds of this type, a convergence guarantee for the density can immediately be
translated into a convergence guarantee for the parameters. A drawback of this technique is
that it typically requires strong global geometric assumptions that are not always necessary,
especially for weakly-concave population log-likelihood functions, namely, the limit of sample
log-likelihood function when the number of data n grows to infinity. For example, in order to
1
guarantee, for some α ≥ 1, an n− 2α rate for parameter convergence, the bound in Ghosal et
al. [13] requires the following conditions:
  "  #
pθ ∗ 2α pθ 2
−Eθ∗ log - kθ − θ k2 , −Eθ∗ log ∗ - kθ − θ ∗ k2α
2 , and
pθ ∗ pθ (1)
kθ1 − θ2 kα2 - h(pθ1 , pθ2 ) - kθ1 − θ2 kα2 .

Here θ ∗ denotes the true parameter, the quantity pθ is the density function at θ, and h is the
Hellinger distance. The conditions (1) do not always hold for the statistical models mentioned
earlier, since the log-likelihood function behaves differently according to its location within
the unbounded parameter space. Furthermore, the precise dependence of the convergence rate
on other parameters in addition to sample size n, including the dimension d, can be difficult
to obtain using this technique.

2
In this paper, we study the posterior contraction rates of parameters via the lens of
diffusion processes. This approach does not require the parameter space to be bounded and
allows us to quantify the dependence of convergence rates on quantities other than the sample
size n. More specifically, we recast the posterior distribution as the stationary distribution of
a stochastic differential equation (SDE). In this way, by controlling the moments of the given
diffusion process, we can obtain bounds on posterior convergence rates of parameters. This
approach exploits two main features that have not been extensively explored in the Bayesian
literature: (i) the geometric structure of population log-likelihood function and (ii) uniform
perturbation bounds between the empirical and population log-likelihood functions.
At a high-level, our main contributions can be summarized as follows:

• Strongly concave setting: We first consider settings in which the population log-
likelihood function is strongly concave around the true parameter. We demonstrate that
as long as the prior distribution is sufficiently smooth and the perturbation error between
the population and empirical log-likelihood function is well-controlled, the posterior
contraction rate around the true parameter is (d/n)1/2 . In addition, this technique
allows us to also quantity the dependence of the rate on other properties of the model.

• Weakly concave setting: Moving beyond the strongly concave setting, we then turn
to population log-likelihood functions that are only weakly concave. In this setting, our
analysis depends on two auxiliary functions: (i) a function ψ is used to capture the local
and global behavior of population log-likelihood function around the true parameter,
and (ii) a function ζ describes the growth of perturbation error between the population
and sample log-likelihood functions. Our analysis shows how the posterior contraction
rates for parameters can be derived from the unique positive solution of a non-linear
equation involving these two functions.

• Illustrative examples: We illustrate the general results for three concrete classes of
models: Bayesian logistic regression models, Bayesian single index models, and over-
specified Bayesian location Gaussian mixture models. The obtained rates show the
influence of different modeling assumptions. For instance, the posterior convergence
rate of parameter under Bayesian logistic regression models is (d/n)1/2 . In contrast, for
Bayesian single index models with polynomial link functions, we exhibit rates of the
form (d/n)1/(2p) , where p ≥ 2 is the degree of the polynomial. Finally, for over-specified
location Gaussian mixtures, we establish a convergence rate of the order (d/n)1/4 . While
the n−1/4 component of this rate is known from past work [6, 22], the corresponding
scaling of d1/4 with dimension appears to be a novel result.

The remainder of the paper is organized as follows. In Section 2, we set up the basic framework
for Bayesian models and introduce a diffusion process that admits posterior distribution as its
stationary distribution. Section 3 is devoted to establishing the general results for posterior
convergence rates of parameters under various assumptions on the concavity of the popula-
tion log-likelihood. In Section 4, we apply these general results to derive concrete rates of
convergence for a number of illustrative examples. The proofs of main theorems are provided
in Section 5 while the proofs of auxiliary lemmas and corollaries in the paper are deferred to
the Appendices. We conclude our work with a discussion in Section 6.

Notation. In the paper, the expression an % bn will be used to denote an ≥ cbn for some
positive universal constant c that does not change with n. Additionally, we write an ≍ bn if

3
both an % bn and an - bn hold. For any n ∈ N, we denote [n] = {1, 2, . . . , n}. The notation
Sd−1 stands for the unit sphere, namely, the set of vectors u ∈ Rd such that kuk2 = 1. For any
subset Θ of Rd , r ≥ 1, and ε > 0, we denote N (ε, Θ, k.kr ) the covering number of Θ under
k.kr norm, namely, the minimum number of ε-balls under k.kr norm to cover the entire set Θ.
Finally, for any x, y ∈ R, we denote x ∨ y = max{x, y}.

2 Background and problem formulation


We first provide formulation for posterior distribution of parameter and its convergence rate in
Section 2.1. Then we demonstrate that the posterior distribution is a stationary distribution
of a diffusion process in Section 2.2. Finally, we define the population likelihood function and
provide smoothness conditions of that function and the prior distribution in Section 2.3.

2.1 Posterior contraction rates for parameters


Consider a parametric family of distributions {Pθ | θ ∈ Θ}. Throughout the paper, we
assume that each distribution Pθ has density pθ with respect to the Lebesgue measure. Let
X1n := (X1 , . . . , Xn ) be a sequence of random variables drawn i.i.d. from Pθ∗ , where θ ∗ ∈ Θ is
the true parameter, albeit unknown. The log-likelihood Fn of the data is given by
n
1X
Fn (θ) := log pθ (Xi ). (2a)
n
i=1

In a Bayesian analysis, the parameter space Θ is endowed with a prior distribution π. Com-
bining this prior with the likelihood (2a) yields the posterior distribution

enFn (θ) π(θ)


Π (θ | X1n ) := R nFn (u) π(u)du
. (2b)
Θe

As the sample size n increases, we expect that the posterior distribution will concentrate
more of its mass over increasingly smaller neighborhoods of the true parameter θ ∗ . Posterior
contraction rates allow us to study how quickly this concentration of mass takes place. In
particular, for a given norm, we study the posterior mass of a ball of the form kθ − θ ∗ k ≤ ρ
for a suitably chosen radius. For a given δ ∈ (0, 1), our goal is to prove statements of the form
 
Π kθ − θ ∗ k ≥ ρ(n, d, δ) | X1n ≤ δ, (3)

with probability at least 1 − δ over the randomly drawn data X1n . Our interest is in the
scaling of the radius ρ(n, d, δ) as a function of sample size n, problem dimension d, and the
error tolerance δ, as well as other problem-specific parameters.

2.2 From diffusion processes to the posterior distribution


The analysis of this paper relies on a connection between the posterior distribution and a
particular stochastic differential equation (SDE). Accordingly, we begin by introducing some
background on Langevin processes. For a parameter β > 0, consider an SDE of the form
r
2
dθt = −∇U (θt )dt + dBt , (4)
β

4
where (Bt , t ≥ 0) is a standard d-dimensional Brownian motion [25], and the potential function
U : Rd → R is assumed to satisfy the regularity conditions: (a) its gradient ∇U is locally
Lipschitz, and (b) its gradient satisfies the inequality

h∇U (θ), θi ≥ c1 kθk2 − c2 for any θ ∈ Rd ,

where c1 , c2 are positive constants. Under these conditions, from known results on general
Langevin diffusions [20, 26], we have:

Proposition 1. Under the stated regularity conditions, the solution to the Langevin SDE (4)
exists and is unique in the strong sense. Furthermore, the density of θt converges in L2 to the
stationary distribution with density proportional to e−βU .

Let us consider this general result in the context of Bayesian inference. In particular,
suppose that we apply Proposition 1 to the potential function Un (θ) := −nFn (θ) − log π(θ).
Doing so will require us to verify that Un satisfies the requisite regularity conditions. Assuming
this validity, we are guaranteed that the posterior distribution Π(θ | X1n ) is the stationary
distribution of the following stochastic differential equation (SDE):

1 1 1
dθt = ∇Fn (θt )dt + ∇ log π(θt )dt + √ dBt , (5)
2 2n n

where θ0 = θ ∗ . Moreover, the density of θt converges in L2 to the posterior density.


This connection—between the SDE (5) and the posterior distribution (2b)—provides an
avenue for analysis. In particular, by characterizing the behavior of the process (θt , t ≥ 0) as
a function of time, we can obtain bounds on the posterior distribution by taking limits. In
particular, Fatou’s lemma guarantees that for any p > 0, we have
Z
Z −1
kθkp2 e−Un (θ) dθ ≤ lim inf E [kθt kp2 ] , (6)
t→+∞

R
where Z = e−Un (θ) dθ. Given this bound, we can establish posterior contraction rates for the
parameters by controlling the moments of the diffusion process {θt }t≥0 . The main theoretical
results of this paper are obtained by following this general roadmap.

2.3 From empirical to population likelihood


Before proceeding to our main results, let us introduce some additional definitions and con-
ditions. A useful notion for our analysis is the population log-likelihood F . It corresponds to
the limit of log-likelihood function Fn , as previously defined in equation (2a), as the sample
size n goes to infinity—viz.

F (θ) := E [log pθ (X)] , (7)

where the expectation is taken with respect to X ∼ Pθ∗ . Throughout the paper, we impose the
following smoothness conditions on the population log-likelihood F and the log prior density:

(A) There exist positive constants L1 and L2 such that for any θ1 , θ2 ∈ Rd , we have

k∇F (θ1 ) − ∇F (θ2 )k2 ≤ L1 kθ1 − θ2 k2 ,


k∇ log π(θ1 ) − ∇ log π(θ2 )k2 ≤ L2 kθ1 − θ2 k2 .

5
(B) There exists a constant B > 0 such that

sup h∇ log π(θ), θ − θ ∗ i ≤ B.


θ∈Rd

Note that, the constant B in Assumption B depends on θ ∗ . We suppress that dependence


in B for the simplicity of the presentation. The above conditions are relatively mild, and we
provide a number of examples in the sequel for which they are satisfied.

3 Main results
We now turn to the presentation of our main results, which provides guarantees on posterior
contraction rates under different conditions. Our first main result, stated as Theorem 1 in
Section 3.1, establishes the posterior convergence rate of parameters when the population log-
likelihood is strongly concave around the true parameter θ ∗ . Then in Section 3.2, we study
the same question when the population log-likelihood is weakly concave around θ ∗ , with our
conclusions stated in Theorem 2.

3.1 Posterior contraction under strong concavity


We first study the setting when F is strongly concave around the true parameter θ ∗ . We
collect a few assumptions that are needed for the analysis:

(S.1) There exists a scalar µ > 0 such that

h∇F (θ), θ ∗ − θi ≥ µ kθ − θ ∗ k22 for any θ ∈ Rd . (8)

(S.2) For any δ > 0, there exist non-negative functions ε1 and ε2 that map from N × (0, 1] to
R+ such that

sup k∇Fn (θ) − ∇F (θ)k2 ≤ ε1 (n, δ)r + ε2 (n, δ),


θ∈B(θ ∗ ,r)

for any radius r > 0 with probability at least 1 − δ.

Assumption S.1 is a standard strong concavity condition of function F around θ ∗ . Further-


more, Assumption S.2 is used to control the uniform perturbation error between the gradient
of negative empirical log-likelihood function Fn and the gradient of negative population log-
likelihood function F .
Given the above assumptions, we are ready to state our first result regarding the posterior
convergence rate of parameters for a strongly concave population log likelihood:

Theorem 1. Suppose that Assumptions A, B, S.1, and S.2 hold. Then there is a universal
constant C such that for any δ ∈ (0, 1), given a sample size n large enough such that ε1 (n, δ) ≤
µ
6 , we have
s !
∗ d + log(1/δ) + B ε2 (n, δ) n
Π kθ − θ k2 ≥ C + X1 ≤ δ,
nµ µ

with probability 1 − δ, taken with respect to the random observations X1n .

6
See Section 5.1 for the proof of Theorem 1.
The result of Theorem 1 establishes the posterior convergence rate (d/n)1/2 of parameter
under the strong concavity settings of F . It also provides a detailed dependence of the rate
on other model parameters, including B and µ, both of which might vary as a function of θ ∗ .
At the moment, we do not know whether the dependence of these parameters is optimal.

3.2 Posterior contraction under weak concavity


Moving beyond the strong concavity setting, we consider the weakly concave setting of the
population log-likelihood function F . Weak concavity arises when working with singular
statistical models, meaning those whose Fisher information matrix at the true parameter θ ∗
is rank-degenerate. Examples of such models include single index models [21] with certain
choices of link functions, as well as over-specified Bayesian mixture models [28], in which the
fitted mixture model has more components than the true mixture distribution. In order to
analyze the posterior contraction rate of parameters for weakly concave log likelihoods, we
impose the following assumptions:

(W.1) There exists a convex, non-decreasing function ψ : [0, +∞) → R such that ψ(0) = 0 and
for any θ ∈ Rd , we have

h∇F (θ), θ ∗ − θi ≥ ψ(kθ − θ ∗ k2 ).

Assumption W.1 characterizes the weak concavity of the function F around the global maxima
θ ∗ . This condition can hold when the log likelihood is locally strongly concave around θ ∗ but
only weakly concave in a global sense, or it can hold when the log likelihood is weakly concave
but nowhere strongly concave. An example of the former type is the logistic regression model
analyzed in Section 4.1, whereas an example of the latter type is given by certain kinds of
single index models, as analyzed in Section 4.2.
Our next assumption controls the deviation between the gradients of the population and
sample likelihoods:

(W.2) For any δ > 0, there exist a function ε : N × (0, 1] 7→ R+ and a non-decreasing function
ζ : R → R such that ζ(0) ≥ 0 and

sup k∇Fn (θ) − ∇F (θ)k2 ≤ ε(n, δ)ζ(r),


θ∈B(θ ∗ ,r)

for any radius r > 0 with probability at least 1 − δ.

Note that, the function ζ can depend on the sample size n and other model parameters, as
analyzed in Section 4.3 under over-specified Bayesian mixture model. Here we suppress this
dependence in ζ for the brevity of the presentation.
The previous conditions involved two functions, namely ψ and ζ. We let ξ : R+ → R
denote the inverse function of the strictly increasing function r 7→ rζ(r). Our next assumption
imposes certain inequalities on these functions and their derivatives:

(W.3) The function r 7→ ψ(ξ(r)) is convex, and moreover, for any r > 0, the functions ψ and
ζ satisfy the following differential inequalities:

rψ ′ (r)ζ(r) ≥rψ(r)ζ ′ (r) + ψ(r)ζ(r), and


r 2 ψ ′′ (r)ζ(r) + rψ ′ (r)ζ(r) ≥3ψ(r)ζ(r) + r 2 ψ(r)ζ ′′ (r).

7
These differential inequalities are needed controlling the moments of the diffusion process
{θt }t>0 in equation (5). In our discussion of concrete examples, we provide instances in which
they are satisfied.
Our result involves a certain fixed point equation that depends on the parameters and
functions in our assumptions. In particular, for any tolerance parameter δ ∈ (0, 1) and
sample size n, consider the following equation in a variable z > 0:
B + d log(1/δ)
ψ(z) = ε(n, δ)ζ(z)z + . (9)
n
In order to ensure that equation (9) has a unique positive solution, our final assumption
imposes certain condition on the growth of the functions ψ and ζ:
ψ(z)
(W.4) The sample size n and tolerance parameter δ ∈ (0, 1) are such that ε(n, δ) < lim inf .
z→+∞ zζ(z)

With this set-up, we are now ready to state our second main result:
Theorem 2. Assume that Assumptions A, B, and W.1—W.4 hold. Then for a given sample
size n and δ ∈ (0, 1), equation (9) has a unique positive solution z ∗ (n, δ) such that
 
Π kθ − θ ∗ k2 ≥ z ∗ (n, δ) | X1n ≤ δ (10)

with probability 1 − δ with respect to the random observations X1n .


See Section 5.2 for the proof of Theorem 2.
A few comments are in order. First, the convergence guarantee (10) depends on the weak
convexity function ψ and the perturbation function ζ through the non-linear equation (9). It
is natural to wonder about the origins of this equation. As shown in our analysis, it stems
from an upper bound on the moments E [kθt − θ ∗ kp2 ] of the diffusion process {θt }t>0 defined
in equation (5). In particular, we find that
1
lim (E (kθt − θ ∗ kp2 )) p ≤ zp∗ ,
t→+∞

where zp∗ is the unique positive solution to the equation ψ(z) = ε(n, δ)ζ(z)z + B+pdn . In light
of the above result and the inequality (6), when p is of the order log(1/δ), we obtain the
posterior convergence rate (10).
Second, it is (at least in general) not possible to compute an explicit form for the positive
solution z ∗ (n, δ) to the non-linear equation (9). However, for certain forms of the function ψ
and ζ, it is possible to compute an analytic and relatively simple upper bound. For instance,
given some positive parameters (α, β) such that α > β + 1, suppose that these functions are
defined locally, in a interval above zero, as follows:

ψ(r) = r α , and ζ(r) = r β for all r in some interval [0, r̄). (11a)

Moreover, suppose that the perturbation function takes the form


q 
ε(n, δ) = d + log( 1δ ) /n. (11b)

As shown in the analysis to follow in Section 4, these particular forms arise in several statistical
models, including Bayesian logistic regression and certain forms of Bayesian single index
models. The following result shows that we have a simple upper bound on z ∗ (n, δ).

8
Corollary 1. Assume that the functions ψ, ζ have the local behavior (11a), and the pertur-
bation term ε(n, δ) has the form (11b). If, in addition, the global forms of ψ and ζ satisfy
Assumption W.3, then we are guaranteed that
  1  1
∗ d + log(1/δ) + B 2(α−(β+1)) d + log(1/δ) + B α
z (n, δ) ≤ c ∨ ,
n n
where z ∗ (n, δ) is defined in Theorem 2.
Corollary 1 directly leads to a posterior convergence rate for the parameters—namely, we
have
  1  1 !
∗ d + log(1/δ) + B 2(α−(β+1)) d + log(1/δ) + B α n
Π kθ − θ k2 ≥ c ∨ X1 ≤ δ, (12)
n n
with probability 1 − δ with respect to the training data. Note that the convergence rate scales
1 1
as (d/n) 2(α−(β+1)) when α ≥ 2(β + 1). On the other hand, this rate becomes (d/n) 2α when
α < 2(β + 1).

4 Some illustrative examples


In this section, we study the posterior contraction rates of parameters for a few interesting
statistical examples that fall under the general framework of this paper.

4.1 Bayesian logistic regression


We begin with the method of logistic regression, which is popular approach to modeling the
relation between a binary response variable Y ∈ {−1, +1} and a vector X ∈ Rd of explanatory
variables [21]. In this model, the pair Y and X are related by a conditional distribution of
the form
ehX, θi
P (Y = 1 | X, θ) = , (13)
1 + ehX, θi
where θ ∈ Rd is a parameter vector.
Suppose that we observe a collection Z1n = {Zi }ni=1 of n i.i.d paired samples Zi = (Xi , Yi ),
generated in the following way. First, the covariate vector Xi is drawn from a standard
Gaussian distribution N (0, Id ), and then the binary response Yi is drawn according to the
conditional distribution:
ehXi , θ i

P (Yi = 1 | Xi , θ ∗ ) = , (14)
1 + ehXi , θ∗ i
where θ ∗ ∈ Rd is a fixed but unknown value of the parameter vector. Consequently, the
sample log-likelihood function of the samples Z1n takes the form
n
1X
FnR (θ) := {log P (Yi | Xi , θ) + log φ(Xi )} , (15)
n
i=1
where φ denotes the density of a standard normal vector. Combining this log likelihood with
a given prior π over θ yields the posterior distribution in the usual way. We assume that the
prior function π satisfies Assumptions A and B.
With this set-up, the following result establishes the posterior convergence rate of θ around

θ , conditionally on the observations Z1n .

9
Corollary 2. For a given δ ∈ (0, 1), suppose that we are given logn n ≥ c′ d log( 1δ ) i.i.d. samples
Z1n from the Bayesian logistic regression model (13) with true parameter θ ∗ for some universal
constant c′ . Then there is a universal constant c such that
r !
d + log(1/δ) + B
Π kθ − θ ∗ k2 ≥ c Z1n ≤ δ

n

with probability 1 − δ over the data Z1n .

See Appendix A.1 for the proof of this claim.


A few comments are in order. First, the result of Corollary 2 shows that the posterior
convergence rate of parameters under Bayesian logistic regression model (13) is (d/n)1/2 .
Furthermore, this result also gives a concrete dependence of the rate on B characterizing the
degree to which the prior is concentrated away from the true parameter.
Second, by taking the sample size in the function FnR to infinity, the population log-
likelihood is given by
h   i
F R (θ) := E(X,Y ) − log 1 + e−Y hX, θi + log φ(X) , (16)

where the outer expectation in the above display is taken with respect to X and Y |X from
the logistic model (14).
In Appendix A.1, we prove that there are universal constants c, c1 , c2 such that
(
kθ − θ ∗ k22 , for all kθ − θ ∗ k2 ≤ 1
h∇F R (θ), θ ∗ − θi ≥ c1 (17a)
kθ − θ ∗ k2 , otherwise

and
r r !
d log(1/δ) log(1/δ)
sup ∇FnR (θ) − ∇F R (θ) 2 ≤ c2 + + , (17b)
θ∈Rd n n n

for any r > 0 with probability 1 − δ as long as logn n ≥ cd log(1/δ). The above results with F R
and FnR indicate that the functions ψ and ζ in Assumptions W.1 and W.2 have the following
close forms:
(
r 2 , for all 0 < r ≤ 1
ψ(r) = c1 , and ζ(r) = c2 for all r > 0. (18)
r, otherwise

We can check that the functions ψ and ζ satisfy the conditions in Assumptions W.3 and W.4.
Therefore, an application of Theorem 2 to these functions leads to the posterior contraction
rate of θ around θ ∗ in Corollary 2.

4.2 Bayesian single index models


We now turn to single index models, which are generalizations of linear regression models
where the linear coefficients are connected to a general link function [5]. These models are
widely used in both econometrics and biostatistics, and have also seen application in com-
putational imaging problems. Due to the special structure of single index models, curse of
dimensionality associated with model index coefficients can be avoided. In this section, we

10
consider specific settings of single index models when the link function is known and has spe-
cific form. In particular, we assume that the data points Zi = (Yi , Xi ) ∈ Rd+1 are generated
as follows:

Yi = g(Xi⊤ θ ∗ ) + ǫi , (19)

for i ∈ [n] where the positive integer p ≥ 2 is given. Here (Xi , Yi ) correspond to the covariate
vector and response variable, respectively, whereas g is a known link function while θ ∗ is a
true but unknown parameter. Furthermore, the random variables {ǫi }ni=1 are i.i.d. standard
Gaussian variables, whereas the covariates Xi are assumed to be i.i.d. data from standard
multivariate Gaussian distribution. The choice g(r) = r 2 has been used to model the problem
of phase retrieval in computational imaging.
Now, we consider a Bayesian single index model to study θ ∗ . More specifically, we endow
the parameter θ with a prior function π. Conditioning on Xi and θ, we have
   
i.i.d.
Yi | Xi , θ ∼ N g Xi⊤ θ , 1 . (20)

The main goal of this section is to study the posterior convergence rate of θ around θ ∗ given
the data. To do that, we assume the prior function π satisfies the Assumptions A and B and
follow the framework in the Bayesian logistic regression case. In particular, we first study
the structure of the sample log-likelihood function around the true parameter θ ∗ . Then we
establish the uniform perturbation bound between the population and sample log-likelihood
functions.
Given the Bayesian single index model (20), the sample log-likelihood function FnI of the
samples Z1n = {Zi }ni=1 admits the following form
 !
Xn
Y − g X ⊤θ 2
1 i i
FnI (θ) := − + log φ(Xi ) , (21)
n 2
i=1

where φ is the standard normal density function of X1 , . . . , Xn . Hence, the population log-
likelihood function F I has the following form
"  #
Y − g X ⊤θ 2
F I (θ) := E(X,Y ) − + log φ(X) , (22)
2

  in the above display is taken with respect to X ∼ N (0, Id ) and


where the outer expectation
Y |X = x ∼ N g x⊤ θ ∗ , 1 .
We can check that the function F I is weak-concave around θ ∗ when the link function g
and the true parameter θ ∗ take the following values

g(r) = r p for some p ≥ 2, and θ ∗ = 0. (23)

Given these choices of g and θ ∗ , the population log- likelihood function has the closed-form
expression

I 1 + (2p − 1)!! kθ − θ ∗ k2p


2
F (θ) = for all θ ∈ Rd .
2
Furthermore, in Appendix A.2, we prove that there is a universal constant c1 > 0 such that

h∇F I (θ), θ ∗ − θi ≥ c1 kθ − θ ∗ k2p


2 for all θ ∈ Rd , (24a)

11
and there are universal constants (c, c2 ) such that for any r > 0 and δ ∈ (0, 1), as long as
n ≥ c (d log(d/δ))2p , we have
r
 d + log(1/δ)
sup ∇FnI (θ) − ∇F I (θ) 2 ≤ c2 r p−1 + r 2p−1 , (24b)
θ∈B(θ ∗ ,r) n

with probability at least 1 − δ. Therefore, the functions ψ and ζ in Assumptions W.1 and
W.2 take the specific forms

ψ(r) = c1 r 2p , and ζ(r) = r p−1 + r 2p−1 , (25)

for all r > 0. Simple algebra shows that these functions satisfy Assumptions W.3 and W.4.
Therefore, under the setting (23), a direct application of Theorem 2 leads to the following
result regarding the posterior contraction rate of θ around θ ∗ :

Corollary 3. Consider the Bayesian single index model (19) with true parameter θ ∗ = 0 and
link function g(r) = r p for for some p ≥ 2. Then there are universal constants c, c′ such that
for any δ > 0, given a sample size n ≥ c′ (d + log(d/δ))2p , we have
 1/(2p) !
d + log(1/δ) + B
Π kθ − θ ∗ k2 ≥ c Z1n ≤ δ

n

with probability 1 − δ over the data Z1n .

See Appendix A.2 for the proof of Corollary 3.


It is worth noting that the proof of Corollary 3 actually leads to the following stronger
uniform perturbation bound:
s 
d + log 1
1  n p+1
sup ∇FnI (θ) − ∇F I (θ) 2 ≤ c r p−1  δ
+ 3/2 d + log 
θ∈B(θ ,r)
∗ n n δ
s 
d + log 1
1  n 2p+1
+ r 2p−1  δ
+ 3/2 d + log ,
n n δ

valid for each r > 0 with probability 1 − δ. The condition n ≥ c(d + log(d/δ))2p is required to
guarantee that the RHS of the above display is upper bounded by the RHS of equation (24b);
this bound permits us to apply Theorem 2 to establish the posterior convergence rate of
parameter under the Bayesian single index models.

4.3 Over-specified Bayesian Gaussian mixture models


Bayesian Gaussian mixture models have been widely used by statisticians to study datasets
with heterogeneity, which can be modeled in terms of different mixture components [19].
In fitting such models, the true number of components is generally unknown, and several
approaches have been proposed to deal with this challenge. One of the most popular methods
is to overspecify the true number of components and fit the data with the larger models. This
method is regarded as overspecified Gaussian mixture models [28]. Posterior convergence
rates of density function with over-specified Bayesian Gaussian mixture models have been well-
studied [14]. However, rates of parameters in these models are not fully understood. On one

12
hand, when the covariance matrices are known and the parameter space is bounded, location
parameters have been shown to have posterior convergence rates n−1/4 in the Wasserstein-2
metric [22]. However, neither the dependence on dimension d nor on the true number of
components have been established.
In this section, we consider a class of overspecified location Gaussian mixture models and
provide convergence rates for the parameters with precise dependence on the dimension d,
and not requiring any boundedness assumption on the parameter space. Concretely, suppose
that X1n = (X1 , . . . , Xn ) are i.i.d. samples from single Gaussian distribution N (θ ∗ , Id ) where
θ ∗ = 0. Suppose that we fit such a dataset using an overspecified Bayesian location Gaussian
mixture model of the form
i.i.d. i.i.d.
θ ∼ π(θ), ci ∈ {−1, 1} ∼ Cat(1/2, 1/2), Xi |ci , θ ∼ N (ci θ, Id ), (26)

where Cat(1/2, 1/2) stands for the categorical distribution with parameters (1/2, 1/2). Here
the prior function π is chosen to satisfy smoothness Assumptions A and B, with one example
being the location Gaussian density function. Our goal in this section is to characterize the
posterior contraction rate of the location parameter θ around θ ∗ .
In order to do so, we first define the sample log-likelihood function FnG given data X1n . It
has the form
n  
G 1X 1 1
Fn (θ) := log φ(Xi ; −θ, Id ) + φ(Xi ; θ, Id ) , (27)
n 2 2
i=1
2
where x 7→ φ(x; θ, Id ) = (2π)−d/2 e−kx−θk2 /2 denotes the density of multivariate Gaussian
distribution N (θ, σ 2 Id ). Similarly, the population log-likelihood function is given by
  
G 1 1
F (θ) := EX log φ(X; −θ, Id ) + φ(X; θ, Id ) , (28)
2 2
where the outer expectation in the above display is taken with respect to X ∼ N (θ ∗ , Id ).
In Appendix A.3, we prove that there is a universal constant c1 > 0 such that
( ∗ k4 , ∗k ≤

c1 kθ
 − θ 2  for all kθ − θ 2 2
h∇F G (θ), θ ∗ − θi ≥ ∗ 2 , (29a)
4c1 kθ − θ k2 − 1 , otherwise

and moreover, there are universal constants (c, c2 ) such that for any δ ∈ (0, 1), given a sample
size n ≥ cd log(1/δ), we have
 r
1 d + log(log n/δ)
sup ∇Fn (θ) − ∇F (θ) 2 ≤ c2 r + √
G G
. (29b)
θ∈B(θ ,r)
∗ n n

with probability 1 − δ.
Given the above results, the functions ψ and ζ in Assumptions W.1 and W.2 take the
form
( √
c1 r 4 , for all 0 < r ≤ 2 1
ψ(r) = 2
 , and ζ(r) = r + √ for all r > 0. (30)
4c1 r − 1 , otherwise n

These functions satisfy the conditions of Assumptions W.3 and W.4. Therefore, it leads to
the following result regarding the posterior contraction rate of parameters under overspecified
Bayesian location Gaussian mixtures (26):

13
Corollary 4. Given the overspecified Bayesian location Gaussian mixture model (26), there
are universal constants c, c′ such that given any δ ∈ (0, 1) and a sample size n ≥ c′ d log(1/δ),
we have
 1/4 !
d + log(log n/δ) + B
Π kθ − θ ∗ k2 ≥ c X1n ≤ δ

n

with probability 1 − δ over the data X1n .


See Appendix A.3 for the proof of Corollary 4.
The dependence on n in the posterior contraction rate of θ in Corollary 4 is consistent with
the previous result with location parameters in the overspecified Bayesian location Gaussian
mixtures [6, 17, 22]. The novel finding in the result of the above corollary is the dependency
on dimension d, which is d1/4 , and the dependency on the smoothness of the prior distribution
π, which is B 1/4 . Finally, our result does not require the boundedness of the parameter space,
which had been a main assumption in previous works to study the posterior rate of θ [6, 17, 22].

5 Proofs
This section is devoted to the proofs of Theorem 1 and Theorem 2, with more technical aspects
deferred to the appendices.

5.1 Proof of Theorem 1


Throughout the proof, in order to simplify notation, we omit the conditioning on the σ-field
Fn := σ(X1n ); it should be taken as given. For α = 21 µ − ε1 (n, δ) > µ6 , we claim that
1 αt 1 (eαt − 1)
e kθt − θ ∗ k22 ≤ √ Mt + Un , (31)
2 n 2α
3ε22 (n,δ) R t αs
where Un := Bn + µ + n
d
and Mt := ∗
0 e hθs − θ , dBs i, which is a martingale.
Assume that the above claim is given at the moment (the proof of that claim is deferred
to the end of the proof of the theorem). In order to bound the moments of martingale Mt ,
for any p ≥ 4, we invoke the Burkholder-Gundy-Davis inequality [25] to find that
" # Z T  p4
p p
h pi p
2αs ∗ 2
E sup |Mt | 2 ≤ (pC) 4 E hM, M iT4 = (pC) 4 E e kθs − θ k2 ds
0≤t≤T 0
Z !p
p
T 4

≤ (pC) E 4
αt
sup e kθt − θ ∗ k22 αs
e ds
0≤t≤T 0
  p !p
pCeαT 4 4

≤ E αt
sup e kθs − θ ∗ k22 ,
α 0≤t≤T

where C is a universal constant. Therefore, we arrive at the following bound:


 p  p
 αt ∗
p 2 2 (eαt − 1) 2
E e kθt − θ k2 ≤ E √ Mt + Un
n α
  p   p !p
eαT 2 pCeαT 4 4

≤ Un + E sup eαs kθs − θ ∗ k22 .


α αn 0≤s≤T

14
For the right hand side of the above inequality, we can relate it to the left hand side by using
Young’s inequality, which is given by
 p !p  !p
αT  2
p
pCeαT 4 4
1 pCe 1 2

E sup eαs kθs − θ ∗ k22 ≤ + E sup eαs kθs − θ ∗ k22 .


αn 0≤s≤T 2 αn 2 0≤s≤T

Putting the above results together, we find that


!1 s r !
p
1  Un 2p
(E [kθT − θ ∗ kp2 ]) p ≤ e−αT E sup eαT kθT − θ ∗ kp2 ≤ C′ + ,
0≤t≤T µ nµ

for universal constant C ′ > 0. Therefore, the diffusion process defined in equation (5) satisfies
the following inequality
s r !
∗ p p
1 B + d ε2 (n, δ) p
sup (E [kθt − θ k2 ]) ≤ c + +
t≥0 µn µ nµ

for any p ≥ 1. Combining the above inequality with the inequality (6) yields the conclusion
of the theorem.

Proof of claim (31): For the given choice α > 0, an application of Itô’s formula yields the
decomposition
Z Z t
1 αt 1 t ∗ 1
e kθt − θ ∗ k22 = − hθ − θs , ∇Fn (θs )eαs ids + hθs − θ ∗ , ∇ log π(θs )eαs ids
2 2 0 2n 0
Z t Z t Z
d 1 1 t αs
+ αs
e ds + √ αs ∗
e hθs − θ , dBs i + αe kθs − θ ∗ k22 ds
2n 0 n 0 2 0
=J1 + J2 + J3 + J4 + J5 . (32)

We begin by bounding the term J1 in equation (32). Based on the Assumption S.2
regarding the perturbation error between Fn and F and the strong convexity of F , we have
Z
1 t ∗
J1 = − hθ − θs , ∇Fn (θs )eαs ids
2 0
Z Z
1 t ∗ 1 t
≤− αs
hθ − θs , ∇F (θs )e ids + kθs − θ ∗ k2 k∇F (θs ) − ∇Fn (θs )k2 eαs ds
2 0 2 0
Z Z
1 t 1 t
≤− µ kθs − θ ∗ k22 eαs ds + kθs − θ ∗ k2 (ε1 (n, δ) kθs − θ ∗ k2 + ε2 (n, δ))eαs ds
2 0 2 0
Z Z Z
1 t ∗ 2 αs 1 t ∗ 2 αs 3ε22 (n, δ) t αs
≤− µ kθs − θ k2 e ds + kθs − θ k2 (ε1 (n, δ) + µ/3)e ds + e ds.
2 0 2 0 2µ 0

The second term J2 involving prior π can be controlled in the following way:
Z t Z t
1 ∗ αs 1 (eαt − 1)B
J2 = hθs − θ , ∇ log π(θs )e ids ≤ B · eαs ds = .
2n 0 2n 0 2αn
For the third term J3 , a direct calculation leads to
d(eαt − 1)
J3 = .
2αn

15

Moving to the fourth term J4 , it is a martingale as J4 = Mt / n. Putting the above results
together, as α = 21 µ − ε1 (n, δ) > µ6 , we obtain that

1 αt 1 (eαt − 1)
e kθt − θ ∗ k22 ≤ √ Mt + Un .
2 n 2α

Putting together the pieces yields the claim (31).

5.2 Proof of Theorem 2


n
 p−2 on Fn := σ(X1 ). For any p ≥ 2,
As in the proof of Theorem 1, we omit the conditioning
1
we define the function ν(p) as ν(p) (r) := ψ r p−1 r p−1 for any r > 0. Additionally, the
function τ(p) is defined to satisfy τ(p) (r p−1 ζ(r)) := r p−2 ψ(r) for any r > 0. Note that, due to
Assumption W.2, the function r 7→ r p−1 ζ(r) is a strictly increasing and surjective function that
−1
maps from [0, +∞) to [0, +∞). Therefore, it is invertible and the function τ(p) is well-defined.
Now, we claim that the one-dimensional functions ν(p) (·) and τ(p) (·) are convex and strictly
increasing for any p ≥ 2. Furthermore, the following inequality holds:
Z t
p
E [kθt − θ ∗ kp2 ] ≤ −1
− Rp (s) + ε(n, δ)τ(p) (Rp (s))
2 0

B + (p − 1)d −1 p−2
+ ν(p) (Rp (s)) p−1 ds, (33)
n
h i
where Rp (s) := E kθs − θ ∗ kp−2
2 ψ(kθ s − θ ∗k ) .
2
Taken the above claims as given at the moment, let us now complete the proof of the
theorem. Since the process (θt : t ≥ 0) converges in Lq norm for arbitrarily large q, the limit
limt→+∞ Rp (t) exists. Since the functions τ(p) and ν(p) are convex and strictly increasing, their
inverse functions are concave. Moreover, simple calculation leads to
1
 −1 −
p−2  p − 2 ν(p) (r) p−1
−1
∇r ν(p) (r) p−1 = ′ (ν −1 (r))
. (34)
p − 1 ν(p) (p)

Since ν(p) is convex and increasing, the numerator is a decreasing positive function of r.
Additionally, the denominator is an increasing positive function of r. Therefore, the derivative
p−2
−1
in equation (34) is a decreasing function of r, and the function r 7→ ν(p) (r) p−1 is concave. We
denote

−1 B + (p − 1)d −1 p−2
φ(r) := −r + ε(n, δ)τ(p) (r) + ν(p) (r) p−1 .
n
Then, we have φ(0) = 0 and φ is a concave function. Suppose that r∗ is the smallest positive
solution to the following equation:

−1 B + (p − 1)d −1 p−1 p−2


r = ε(n, δ)τ(p) (r) + ν(p) (r) .
n
Then, we have φ(r) < 0 for r > r∗ and φ(r) > 0 for r ∈ (0, r∗ ). By the result of Lemma 1, we
have limt→+∞ Rp (t) ≤ r∗ .

16
Since ν(p) is a convex and strictly increasing function, by means of Jensen’s inequality, we
have
   
Rp (t) = E kθt − θ ∗ kp−2
2 ψ(kθt − θ ∗ k2 ) ≥ ν(p) E kθt − θ ∗ kp−1
2 .

  1
Therefore, by denoting z∗ := limt→+∞ E kθt − θ ∗ kp−1 , we have z∗p−1 ≤ ν(p)
−1
p−1
2 (r∗ ). Hence,
we arrive at the following inequality

−1
 B + (p − 1)d p−2
z∗p−2 ψ(z∗ ) ≤ ε(n, δ)τ(p) ν(p) (z∗p−1 ) + z∗
n
B + (p − 1)d p−2
= ε(n, δ)z∗p−1 ζ(z∗ ) + z∗ .
n
As a consequence, we find that
B + (p − 1)d
ψ(z∗ ) ≤ ε(n, δ)ζ(z∗ )z∗ + .
n
Now, we claim that there exists a unique positive solution to equation (9). Given this claim,
replacing p by (p + 1) and putting the above results together yields
1
lim (E (kθt − θ ∗ kp2 )) p ≤ zp∗ ,
t→+∞

where zp∗ is the unique positive solution to the following equation:

B + pd
ψ(z) = ε(n, δ)ζ(z)z + .
n
Combining the above inequality with the inequality (6) yields the conclusion of the theorem.

We now return to prove our earlier claims about the behavior of the functions ν(p) , τ(p) , the
moment bound (33), and the existence of unique positive solution to equation (9).

Structure of the function ν(p) : Since ψ is a convex and strictly increasing function, by
taking the second derivative, we find that
  1  p−2 
′′
ν(p) (r) = ∇2r ψ r p−1 r p−1
1 1
 1  1 −1  ′  p−1
1
  1 
−1 − 1
= r p−1 ψ ′′ r p−1 + r ψ r − r p−1 ψ r p−1 ≥0
p−1 p−1
for all r > 0. As a consequence, the function ν(p) is convex.

Structure of the function τ(p) : This proof exploits Assumption W.3 on the functions
ψ and ζ. For any p ≥ 2, we denote ζ(p) : r → r p−1 ζ(r) and ψ(p) : r → r p−2 ψ(r) two
−1
strictly increasing functions. Therefore, we can define a function τ(p) := ψ(p) ◦ ζ(p) , namely,
p−1 p−2
τ(p) (r ζ(r)) = r ψ(r), for any r > 0. Following some calculation, we find that
   ′
∇r τ(p) (r p−1 ζ(r)) = (p − 1)r p−2 ζ(r) + r p−1 ζ ′ (r) τ(p) (r p−1 ζ(r))
= (p − 2)r p−3 ψ(r) + r p−2 ψ ′ (r).

17
Setting z = ζ(p) (r) leads to

(p − 2)ψ(r) + rψ ′ (r)
∇z τ(p) (z) = .
(p − 1)rζ(r) + r 2 ζ ′ (r)
Taking another derivative of the above term, we find that
 −1 g(r, p)

∇2z τ(p) (z) = ζ(p) (r) ,
((p − 1)rζ(r) + r 2 ζ ′ (r))2
where we denote
   
g(r, p) := (p − 1)rζ(r) + r 2 ζ ′ (r) · (p − 1)ψ ′ (r) + rψ ′′ (r)
   
− (p − 1)ζ(r) + (p + 1)rζ ′ (r) + r 2 ζ ′′ (r) · (p − 2)ψ(r) + rψ ′ (r) .
−1
According to Assumption W.3, τ(2) = ψ(2) ◦ ζ(2) is a convex function. Therefore, we have
g(r, 2) ≥ 0 for any r > 0. Simple algebra with first order derivative of function g with respect
to parameter p leads to
 
∇p (g(r, p)) =ζ(r) · (p − 1)rψ ′ (r) + r 2 ψ ′′ (r) − (p − 2)ψ(r) − rψ ′ (r)
   
−rζ ′ (r) (p − 2)ψ(r) + rψ ′ (r) + rψ ′ (r) · (p − 1)ζ(r) + rζ ′ (r)
 
−ψ(r) · (p − 1)ζ(r) + (p + 1)rζ ′ (r) + r 2 ζ ′′ (r)
 
=2(p − 2) rψ ′ (r)ζ(r) − ψ(r)ζ(r) − rζ ′ (r)ψ(r)
 
+ r 2 ζ(r)ψ ′′ (r) + rψ ′ (r)ζ(r) − 3ψ(r)ζ(r) − r 2 ψ(r)ζ ′′ (r) ≥ 0

for all r > 0 where the last inequality is due to Assumption W.3. Therefore, the function g
is increasing function in terms of p when p ≥ 2. It indicates that g(r, p) ≥ g(r, 2) ≥ 0 for all
d2
r > 0. Given that inequality, we have dz 2 τ(p) (z) ≥ 0 for any z ≥ 0, p ≥ 2, i.e., the function
τ(p) (z) is a convex function for z = ζ(p) (r).

Proof of claim P (33): For any p ≥ 2, an application of Itô’s formula yields the bound
kθt − θ ∗ kp2 ≤ 5j=1 Tj , where
Z
p t ∗
T1 := − hθ − θs , ∇F (θs )i kθs − θ ∗ kp−2
2 ds, (35a)
2 0
Z
p t ∗
T2 := hθ − θs , ∇F (θs ) − ∇Fn (θs )i kθs − θ ∗ kp−22 ds (35b)
2 0
Z t
p
T3 := hθs − θ ∗ , ∇ log π(θs )i kθs − θ ∗ kp−2
2 ds (35c)
2n 0
Z t
T4 := p kθs − θ ∗ kp−2
2 hθs − θ ∗ , dBs i (35d)
0
Z
p(p − 1)d t
T5 := kθs − θ ∗ kp−2
2 ds. (35e)
2n 0

We now upper bound the terms {Tj }5j=1 in terms of functionals of the quantity Rp . From the
weak convexity of F guaranteed by Assumption W.1, we have
Z t  Z
p ∗ ∗ p−2 p t
E [T1 ] = − E hθ − θs , ∇F (θs )i kθs − θ k2 ds ≤ − Rp (s)ds. (36a)
2 0 2 0

18
Based on Assumption W.2, we find that
Z t 
p ∗ ∗ p−2
E [T2 ] = E hθ − θs , ∇F (θs ) − ∇Fn (θs )i kθs − θ k2 ds
2 0
Z t h i
p
≤ ε(n, δ) E kθs − θ ∗ kp−1
2 ζ(kθ s − θ ∗
k2 ds.
)
2 0

Since the function τ(p) is convex, invoking Jensen’s inequality, we obtain the following inequal-
ities:
Z t h i Z t h  i
∗ p−1 ∗
E kθs − θ k2 ζ (kθs − θ k2 ) ds ≤ −1
τ(p) E τ(p) kθs − θ ∗ kp−1
2 ζ(kθs − θ ∗ k2 ) ds
0 0
Z t
−1
= τ(p) (Rp (s)) ds.
0

In light of the above inequalities, we have


Z t
p −1
E [T2 ] ≤ ε(n, δ) τ(p) (Rp (s)) ds. (36b)
2 0

Moving to T3 in equation (35c), given Assumptions A and B about the smoothness of prior
distribution π, its expectation is bounded as
Z t 
p ∗ ∗ p−2
E [T3 ] = E hθs − θ , ∇ log π(θs )i kθs − θ k2 ds
2n 0
Z
pB t h i
≤ E kθs − θ ∗ kp−2
2 ds (36c)
2n 0
−1
Since ν(p) is a strictly increasing and convex function on [0, +∞), the function ν(p) is a concave
function on [0, +∞). Invoking Jensen’s inequality leads to the following inequality:
Z t  Z t h i p−2 Z t    p−2
∗ p−2 ∗ p−1 −1
Eν(p) kθs − θ ∗ kp−1
p−1 p−1
E kθs − θ k2 ds ≤ E kθs − θ k2 ds ≤ ν(p) 2 ds
0 0 0
Z t
p−2
−1
≤ ν(p) (Rp (s)) p−1 ds. (36d)
0

Combining the inequalities (36c) and (36d), we have


Z
pB t −1 p−2
E [T3 ] ≤ ν(p) (Rp (s)) p−1 ds. (36e)
2n 0
Moving to the fourth term T4 from equation (35d), we have
Z t 
∗ p−2 ∗
E [T4 ] = E kθs − θ k2 hθs − θ , dBs i = 0, (36f)
0

where we have used the martingale structure. Furthermore, in light of moment bound in
equation (36d), we have
Z
p(p − 1)d t −1 p−2
E [T5 ] ≤ ν(p) (Rp (s)) p−1 ds. (36g)
2n 0

Collecting the bounds on the expectations of {Tj }5j=1 from equations (36a), (36b), (36e), (36f),
and (36g), respectively, yields the claim (33).

19
Unique positive solution to equation (9): We now establish that equation (9) has a
unique positive solution under the stated assumptions. Define the function
 
B + d log(1/δ)
ϑ(z) := ψ(z) − ε(n, δ)ζ(z)z + .
n

Since ψ(0) = 0, we have ϑ(0) < 0. On the other hand, based on Assumption W.4, lim inf z→+∞ ϑ(z) >
0. Therefore, there exists a positive solution to the equation ϑ(z) = 0.
Recall that ξ : R+ → R is an inverse function of the strictly increasing function z 7→ zζ(z).
Therefore, we can write the function ϑ as follows:

e := ψ(ξ(r)) − ε(n, δ)r − B + d log(1/δ)


ϑ(z) = ϑ(r) ,
n
where r = z · ζ(z). Given the convexity of function r 7→ ψ(ξ(r)) guaranteed by Assumption
W.3, the functions ϑe and ϑ are convex. Putting the above results together, there exists a
unique positive solution to equation (9).

6 Discussion
In this paper, we described an approach for analyzing the posterior contraction rates of pa-
rameters based on the diffusion processes. Our theory depends on two important features:
the convex-analytic structure of the population log-likelihood function F and stochastic per-
turbation bounds between the gradient of F and the gradient of its sample counterpart Fn .
For log-likelihoods that are strongly concave around the true parameter θ ∗ , we established
posterior convergence rates for estimating parameters of the order (d/n)1/2 along with other
model parameters, valid under sufficiently smooth conditions of prior distribution π and mild
conditions on the perturbation error between ∇Fn and ∇F . On the other hand, when the
population log-likelihood function is weakly concave, our analysis shows that convergence
rates are more delicate: they depend on an interaction between the degree of weak convexity,
and the stochastic error bounds. In this setting, proved that the posterior convergence rate
of parameter is upper bounded by the unique positive solution of a non-linear equation deter-
mined by the previous interplay. As an illustration of our general theory, we derived posterior
convergence rates for three concrete examples: Bayesian logistic regression models, Bayesian
single index models, and overspecified Bayesian location Gaussian mixture models.
Let us now discuss a few directions that arise naturally from our work. First, the current
results are not directly applicable to establish the asymptotic convergence of posterior distri-
bution of parameter under locally weakly concave settings of F around the true parameter θ ∗ .
More precisely, when F is locally strongly concave around θ ∗ , it is well-known from Berstein-
von Mises theorem that the posterior distribution of parameter converges to a multivariate
normal distribution centered at the maximum likelihood estimation (MLE) with the covari-
ance matrix is given by 1/ (nI(θ ∗ )) (e.g., see the book [32]), where I(θ ∗ ) denotes the Fisher
information matrix at θ ∗ . Under the weak concavity setting of F , the Fisher information
matrix I(θ ∗ ) is degenerate. Therefore, the posterior distribution of parameter can no longer
be approximated by a multivariate Gaussian distribution in this setting.
Second, it is desirable to understand the posterior convergence rate of parameter under
the non-convex settings of population log-likelihood function F , which arise in various statis-
tical models such as Latent Dirichlet Allocation [3]. Under these settings, without a careful
analysis of the multi-modal structures of the population and sample log-likelihood functions,

20
the diffusion approach may lead to exponential dependence on dimension d and sub-optimal
dependence on other model parameters.

Acknowledgements
This work was partially supported by Office of Naval Research grant DOD ONR-N00014-18-
1-2640 and National Science Foundation grant NSF-DMS-1612948 to MJW; by NSF-DMS-
grant-1909365 joint to PLB and MJW; and by the Mathematical Data Science program of
the Office of Naval Research under grant number N00014-18-1-2764 to MIJ.

A Proofs of corollaries
In this appendix, we collect the proofs of several corollaries stated in Section 4. To summarize,
we make use of Theorem 2 to establish the posterior contraction rates of parameters in all
three examples. The crux of the proofs of these corollaries involves a verification of Assump-
tions W.1, W.2, and W.3 to invoke the respective theorem. We first present the proof of
Corollary 2 in Appendix A.1. Then, we move to the proofs of Corollary 3 and Corollary 4 in
Appendices A.2 and A.3 respectively. Note that the values of universal constants may change
from line-to-line.

A.1 Proof of Corollary 2


We begin by verifying claim (17a) about the structure of the negative population log-likelihood
function F R and claim (17b) about the uniform perturbation error between ∇F R and ∇FnR .

A.1.1 Proof of claim (17a)

Following some algebra, we find that


h    i
−F R (θ) = E −Y log 1 + e−hX, θi − (1 − Y ) log 1 + ehX, θi
    
1 −hX, θi 1 hX, θi
= −E log 1 + e + log 1 + e ,
1 + e−hX, θ∗ i 1 + ehX, θ∗ i

where the above expectations are taken with respect to X ∼ N (0, σ 2 Id ) and Y |X following
probability distribution generated from logistic model (14). By taking derivative of F R (θ),
we obtain that
" ! #
R ∗ 1 + ehX, θi 1 + e−hX, θi e−hX, θi ∗
h∇F (θ), θ − θi = E ∗ − hX, θ − θ i .
1 + ehX, θ i 1 + e−hX, θ i (1 + e−hX, θi )2

By the mean value theorem, there exists ξ between 0 and hX, θ − θ ∗ i such that

∗ ∗
!
1 + ehX, θi 1 + e−hX, θi ∗ ehX, θ i+ξ e−hX, θ i−ξ
− ∗ = hX, θ − θ i + .
1 + ehX, θ i 1 + e−hX, θ i 1 + ehX, θ i 1 + e−hX, θ i
∗ ∗ ∗

21
In light of the above equality, we arrive at the following inequalities:
 ∗ ∗
!
R ∗ ehX, θ i+ξ e−hX, θ i−ξ
h∇F (θ), θ − θi ≥ E inf ∗ +
|ξ|∈[0,|hX, θ−θ ∗ i|] 1 + ehX, θ i 1 + e−hX, θ∗ i

e−hX, θi
× |hX, θ − θ ∗ i|2
(1 + e−hX, θi )2
" #
1 −|hX, θ−θ∗ i| e−hX, θi
≥E e −hX, θi 2
|hX, θ − θ ∗ i|2
2 (1 + e )
1 h i
≥ E e−|hX, θ−θ i|−|hX, θi| |hX, θ − θ ∗ i|2

8
1  
≥ 4 E 1{|hX, θi|≤2, |hX, θ−θ∗ i|≤2} |hX, θ − θ ∗ i|2 .
8e
    
hX, θi kθk22 hθ, θ − θ ∗ i
Since X ∼ N (0, Id ), we have ∼ N 0, . Given that
hX, θ − θ ∗ i hθ, θ − θ ∗ i kθ − θ ∗ k22
result, direct calculation leads to
 c
E 1{|hX, θi|≤2,|hX, θ−θ∗ i|≤2} |hX, θ − θ ∗ i|2 ≥ kθ − θ ∗ k22 ,
(1 + kθk2 )(1 + kθ − θ ∗ k2 )

for a universal constant c > 0. Collecting the above results, for all θ such that kθ − θ ∗ k2 ≤ 1,
we achieve that
c
h∇F R (θ), θ ∗ − θi ≥ ∗
kθ − θ ∗ k22
(1 + kθk2 )(1 + kθ − θ k2 )
1
≥c kθ − θ ∗ k22 .
1 + kθ ∗ k2
θ−θ ∗
For θ with kθ − θ ∗ k2 > 1, let θ̃ = θ ∗ + kθ−θ ∗ k2 . Then, we find that

c
h∇F R (θ), θ ∗ − θi ≥ h∇F R (θ̃), θ ∗ − θi ≥ kθ − θ ∗ k2 ,
2(1 + kθ ∗ k2 )

which yields the claim (17a).

A.1.2 Proof of the bound (17b)


In this appendix, we prove the uniform bound (17b) between the empirical and population
likelihood gradients. It suffices to establish the following stronger result:
(r r )
d log(1/δ) log(1/δ)
Z := sup ∇Fn (θ) − ∇F (θ) 2 ≤ c
R R
+ + , (37)
θ∈Rd n n n

with probability at least 1 − δ for any logn n ≥ c0 d log(1/δ) where c0 is a universal constant.
We begin by writing Z as the supremum of a stochastic process. Let Sd−1 denote the
Euclidean sphere in Rd , and define the stochastic process
n
1 X yhx, uieyhx, θi

Zu,θ := fu,θ (Xi , Yi ) − E[fu,θ (X, Y )] , where fu,θ (x, y) = ,
n
i=1
1 + eyhx, θi

22
indexed by vectors u ∈ Sd−1 and θ ∈ B(θ ∗ ; r). The outer expectation in the above display is
taken with respect to (X, Y ) from the logistic model (14). Observe that Z = sup sup Zu,θ .
u∈Sd−1 θ∈Rd
Let {u1 , . . . , uN } be a 1/8-covering of Sd−1 in the Euclidean norm; there exists such a set with
N ≤ 17d elements. By a standard discretization argument (see Chapter 6, [35]), we have

Z ≤ 2 max sup Zuj ,θ .


j=1,...,N θ∈Rd

Accordingly, the remainder of our argument focuses on bounding the random variable V := supθ∈Rd Zu,θ ,
where the vector u ∈ Sd−1 should be understood as arbitrary but fixed.
Define the symmetrized random variable
n
1 X eyhx, θi

V := sup εi hXi , uiϕθ (Xi , Yi ) , where ϕθ (x, y) = y ,
θ∈Rd n i=1 1 + eyhx, θi

and {εi }ni=1 is an i.i.d. sequence of Rademacher variables. By a symmetrization inequality for
probabilities [33], we have
 
P [V ≥ t] ≤ c1 P V ′ ≥ c2 t ,

where c1 and c2 are two positive universal constants. We now analyze the Rademacher process
that defines V ′ conditionally on {(Xi , Yi )}ni=1 . We first use a functional Bernstein inequality
to control the deviations above its expectation. For a parameter b > 0 to be chosen, define
the event
( n )
1X 2
Eb := hXi , ui ≤ 2 and |hXi , ui| ≤ b for all i = 1, . . . , n .
n
i=1

Conditioned on Eb , we have |hXi , uiϕθ (Xi , Yi )| ≤ b for all i = 1, . . . , n. Moreover, we have


n
1X
Σ2 := sup (hXi , ui)2 ϕ2θ (Xi , Yi ) ≤ 2.
θ∈Rd n
i=1

Consequently, by Talagrand’s theorem on empirical processes (Theorem 3.27 in [35]), we find


that
 
 ′ ′
 ns2
P V ≥ E[V ] + s | Eb ≤ 2 exp − for all s > 0.
16e + 4bs

We now bound the probability P[E c 2


b ] for a suitable choice of b. By standard χ -tail bounds
1 Pn
(Example 2.11, [35]), we have P[ n i=1 hXi , ui2 ≥ 2] ≤ e−n/8 . By concentration of Gaussian
maxima (Example 2.29, [35]), we have
 
p 2
P max |hXi , ui| ≥ 4 log n + t ≤ e−nt /2 .
i=1,...,n

pn n
Setting t = d and using the assumption that log n ≥ c0 d, we find that
h i n2 pn
P max |hXi , ui| ≥ b ≤ e−c2 d where b := c1 d.
i=1,...,n

23
Consequently, we have
!
  ns2 n2
P V ′ ≥ E[V ′ ] + s ≤ exp − p + e−c2 d + e−n/8
16e + c3 nd s

2 n2
− ns − s 2cnd
≤e 32e +e 3 + e−c2 d + e−n/8 . (38)

We now bound the expectation of V ′ , first over the Rademacher variables. Consider the
following function class:
n o
G := gθ : (x, y) 7→ hx, uiϕθ (x, y) | θ ∈ Rd .

It is clear that the function class G has the envelope function Ḡ(x) := |hx, ui|. We claim that
the L2 -covering number of G can be bounded as

   1 c(d+1)

N̄ (t) := sup N G, k·kL2 (Q) , t Ḡ L2 (Q) ≤ for all t > 0, (39)
Q t

where c > 0 is a universal constant.


Let us take the claim (39) as given for the moment, and use it to bound the expectation of
′ 1 Pn
2
V , first over the Rademacher variables. Define the empirical expectation Pn (Ḡ ) := n i=1 hXi , ui2 .
Invoking Dudley’s entropy integral bound (e.g., Theorem 5.22, [35]), we find that there are
universal constants C, C ′ such that
" n # r Z q
1 X Pn (Ḡ2 ) 1

Eε [V ′ ] = Eε sup εi g(Xi , Yi ) ≤ C 1 + log N̄ (t)dt
g∈G n i=1 n 0
q r
′ 2
d
≤ C Pn (Ḡ ) .
n

Up to this point, we have been conditioning on the observations {Xi }ni=1 . Taking expectations
over them as well yields
r q  (i) r q r
′ ′ d ′ d   (ii) ′ d
Eε,X1n [V ] ≤ C · EX1n Pn (Ḡ2 ) ≤ C · EX1n Pn (Ḡ2 ) = C , (40)
n n n

where step (i) follows from Jensen’s inequality; and step (ii) uses the fact that EX1n [Pn (Ḡ2 )] = 1.
Putting together the bounds (38) and (40) yields
" r # √
d − ns
2
− s 2cnd −c2 nd
2
P V ′ ≥ C′ + s ≤ e 32e + e 3 + e + e−n/8 .
n

This probability bound holds for each u ∈ Sd−1 . By taking the union q bound over the 1/8- 
q
log(1/δ)
covering set {u1 , . . . , uN } of Sd−1 where N ≤ 17d and choosing s = c′ d
n + n + log(1/δ)
n

for a sufficiently large constant c′ > 0, we obtain the claim (37).

24
Proof of claim (39): We consider a fixed sequence (xi , yi , ti )m
i=1 where yi ∈ {−1, 1}, xi ∈ R
d

and ti ∈ R for i ∈ [m]. Now, we suppose that for any binary sequence (zi )m m
i=1 ∈ {0, 1} , there
d
exists θ ∈ R such that

zi = I [hXi , uiϕθ (Xi , Yi ) ≥ ti ] for all i ∈ [m].

Following some algebra, we find that


(
Y i ti ≥0 zi = 1
yi xTi θ − log .
hXi , ui − Yi ti <0 zi = 0

Consequently, the set {[yi xi , log(Yi ti /(hXi , ui − Yi ti ))]}m


i=1 of (d + 1)-dimensional points can
be shattered by linear separators. Therefore, we have m ≤ d + 2, which leads to the VC
subgraph dimension of G to be at most d + 2 (e.g., see the book [34]). As a consequence, we
obtain the conclusion of the claim (39).

A.2 Proof of Corollary 3


The claim (24a) of weak convexity for the negative population log-likelihood function F I is
straightforward. Therefore, we only need to establish the claim (24b) about the uniform
perturbation bound between ∇F I and ∇FnI .

A.2.1 Bounding the difference ∇F I − ∇FnI

It is convenient to introduce the shorthand


  p 2
pθ (x, y) = y − x⊤ θ /2 for all (x, y) ∈ Rd+1 .

We then compute the gradient


  p   p−1
∇ log pθ (x, y) = p y − x⊤ θ x⊤ θ x.

Fix an arbitrary r > 0, by applying the triangle inequality, we find that


n
1 X
∇F I (θ) − ∇F I (θ) =
sup n 2
sup ∇ log pθ (Xi , Yi ) − E(X,Y ) [∇ log pθ (X, Y )]
θ∈B(θ ∗ ,r) θ∈B(θ ,r)
∗ n
i=1 2
≤ p {J1 + J2 } ,

where we define

1 X n  p−1

J1 := p sup Yi Xi Xi θ , and (41a)
θ∈B(θ ,r)
∗ n
i=1 2
n  
1 X  2p−1 2p−1 

J2 := p sup Xi Xi⊤ θ − EX X X ⊤ θ . (41b)
θ∈B(θ ∗ ,r) n
i=1

2

25
We claim that there is a universal constant c such that for any δ ∈ (0, 1), the quantities J1
and J2 can be bounded as
s 
d + log 1
1  n p+1
J1 ≤ c r p−1  δ
+ 3/2 d + log  , and (42a)
n n δ
s 
d + log δ1
1  
n 2p+1 
J2 ≤ c r 2p−1  + 3/2 d + log , (42b)
n n δ

with probability at least 1 − δ.


Assume that the above claims are given at the moment. We proceed to finish the proof
of the uniform perturbation bound between ∇FnI and ∇F I in (24b). In fact, plugging the
concentration bounds (42a) and (42b) into (41), we obtain that
r
 d + log(1/δ)
sup I I
∇Fn (θ) − ∇F (θ) 2 ≤ c r p−1
+r 2p−1
θ∈B(θ ∗ ,r) n
r p−1 (d + log(1/δ) + log n)p+1 + r 2p−1 (d + log(1/δ) + log n)2p+1
+ 3 ,
n2
for any r > 0 with probability at least 1 − 2δ where c is a universal constant. When
n ≥ c′ (d + log(d/δ))2p for some universal constant c′ , it is clear that the the second term
is dominated by the first term in the RHS of the above inequality. As a consequence, we have
proved the claim (24b).

Proof of claim (42a): Following some algebra, we find that


P p−1

sup n1 ni=1 Yi Xi Xi⊤ θ n
1 X  p−1

θ∈B(θ ,r)
∗ 2 θ

sup p−1
≤ sup sup Y i Xi Xi
r>0 r r>0 θ∈B(θ ∗ ,r) n i=1 kθk2
2
1 X n  p−1

= sup Yi Xi Xi⊤ θ . (43)
θ∈S d−1 n
i=1 2
| {z }
=: Z

Thus, in order to establish the claim (42a), it suffices to show that there is a universal constant
c such that
r !
d + log(1/δ) 1  n p+1
P Z≤c + 3/2 d + log ≥ 1 − δ. (44)
n n δ

By the variational definition of the Euclidean norm, we have


n n
1 X  p−1 1 X  p−1
⊤ ⊤ ⊤
Z = sup Yi Xi Xi θ = sup sup Yi Xi u Xi θ .
θ∈Sd−1 n u∈Sd−1 θ∈Sd−1 n
i=1 2 i=1
| {z }
:=Zu

Using a discretization argument as in Appendix A.1.2, we find that


Z≤2 sup Zu ,
u∈N ( 81 ,Sd−1 ,k.k2 )

26

where N 81 , Sd−1 , k.k2 is the 18 -covering of Sd−1
 under k.k2 norm. Therefore, it is sufficient
1 d−1
to bound Zu for any fixed u ∈ N 8 , S , k.k2 .
For any even integer q ≥ 2, a symmetrization argument (e.g., Theorem 4.10, [35]) yields
n ! n !
1 X  p−1 q 2 X  p−1 q

E sup Yi Xi⊤ u Xi⊤ θ ≤ E sup εi Yi Xi⊤ u Xi⊤ θ

θ∈Sd−1 n
θ∈Sd−1 n
i=1 i=1

where {εi }ni=1


is an i.i.d. sequence of Rademacher variables. In order to facilitate
 the proof ar-
d−1
gument, for any t > 0, we introduce the shorthand N (t) := N t, S , k.k2 = {θ1 , . . . , θN̄ (t) }

where N̄ (t) = N t, Sd−1 , k.k2 . For any compact set Ω ⊆ Rd , we define the following random
variable:
n
2 X  p′ −1

R(Ω) := sup εi Yi Xi⊤ u Xi⊤ θ .
θ∈Ω,p′ ∈[1,p] n i=1

By the definition of t-covering, we obtain that



2 X n  p′ −1

R(Sd−1 ) = sup εi Yi Xi⊤ u Xi⊤ θ
θ∈S d−1 ,p ∈[1,p]
′ n
i=1
n
2 X  p′ −1
⊤ ⊤
≤ sup εi Yi Xi u Xi (θk + η) (45)
θk ∈N (t),kηk2 ≤t,p ∈[1,p]
′ n
i=1

4 X n  p′ −1

≤ sup εi Yi Xi⊤ u Xi⊤ θ

θk ∈N (t),p′ ∈[1,p] n i=1
′ −1  
pX
p′ − 1 4 X n
b
+ max · sup εi Yi hXi , uihXi , ηi
p′ ∈[1,p]
b=1
b kηk2 ≤t n
i=1

p+1 d−1
≤ R(N (t)) + 2 t · R(S ).

By choosing t = 2−(p+2) , the above inequality leads to R(Sd−1 ) ≤ 2R(N (2−(p+2) )).
In order to obtain a high-probability upper bound on R(N (2−(p+2) )), we bound its mo-
ments. By the union bound, for any q ≥ 1, we have
" n ! #
h  i 4 X  p′ −1 q
q −(p+2) −(p+2) ⊤ ⊤
E R N (2 ) ≤ p · N (2 ) · sup E εi Yi Xi u Xi θ .
θ∈Sd−1 ,p′ ∈[1,p] n
i=1
| {z }
:=T1 (θ,p′ )

In order to upper bound T1 (θ, p′ ), we apply Khintchine’s inequality [4]; it guarantees that
there is a universal constant C such that
 !q 
Xn 2
Cq ′
T1 (θ, p′ ) ≤ E  Y i
2
(Xi
⊤ 2
u) (Xi
⊤ 2(p −1)
θ) , (46a)
n2
i=1

for any p′ ∈ [1, p] . In order to further upper bound the right hand side, we define the function
gθ,u (x, y) := y 2 (x⊤ u)2 (x⊤ θ)2(p −1) . For any i ∈ [n], we can verify that

h  i ′
E [gθ,u (Xi , Yi )] =E Yi2 · E (Xi⊤ u)2 (Xi⊤ θ)2(p−1) ≤ (2p′ )p ,
h  i
E [gθ,u (Xi , Yi )q ] =E Yi2q · E (Xi⊤ u)2q (Xi⊤ θ)2(p −1)q ≤ (2q)q (2p′ q)p q .
′ ′

27
Given the above bounds, invoking the result of Lemma 2 leads to the following probability
bound
n r !
1 X log 4/δ 1 ′ n p+1
′ p′
P gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )] > (8p ) + 2p log ≤ δ,
n n n δ
i=1

for all δ > 0 where the outer expectation in the above display is taken with respect to (X, Y )
such that X ∼ N (0, Id ) and Y |X = x ∼ N ((x⊤ θ ∗ )p , 1). Putting the previous bounds together,
we obtain that
 !q/2 
Xn
1
E gθ,u (Xi , Yi ) 
n
i=1
 q/2 
 1 X n
q/2
≤2q/2 E(X,Y ) [gθ,u (X, Y )] + 2q/2 E  gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )] 
n
i=1
Z +∞ !
1 X n
′ pq
≤(4p ) + q λq−1 P gθ,u (Xi , Yi ) − E(X,Y ) [gθ,u (X, Y )] > λ dλ
0 n
i=1
Z 1 r ! q
log 4/δ 1 ′ n p′ +1 4
≤(4p′ )p q + q (p′ + 1) (8p′ )p log−1 dδ
′ ′
+ 2p log
0 n n δ δ
!
(16p′ )p q (2p′ )(p +1)q  
′ ′
′ p′ q ′ (p′ +1)q ′
≤(4p ) + Cp q q Γ(q/2) + (2 log n) + Γ (p + 1)q , (46b)
n2 nq

where Γ(·) stands for the Gamma function. Combining the bounds (46a) and (46b), we reach
to the following upper bound for T1 (θ, p′ ):

 q/2  
′ Cq ′ p′ q ′ (16p′ )pq
T1 (θ, p ) ≤ (4p ) + Cp q q Γ(q/2)
n n2

(2p′ )(p +1)q  

(p′ +1)q ′
+ (2 log n) + Γ (p + 1)q . (47)
nq

Plugging the upper bounds of T1 in equation


 (47) into equation (45) and taking the union
bound over all θk ∈ N 2−(p+2) , Sd−1 , k · k2 , we find that
h i h  i
E Rq (Sd−1 ) ≤ 2q E Rq N (2−(p+2) )
d
≤ 2q p 2p+3 sup T1 (θ, p′ )
θ∈Sd−1 ,p′ ∈[1,p]
 q  
q

p+3 d Cq 2
pq (16p)pq
≤2 p 2 (4p) + Cpq q Γ(q/2)
n n2
(2p)(p+1)q  
(p+1)q
+ (2 log n) + Γ ((p + 1)q) ,
nq

1 d−1 , k.k
 1 d−1 , k.k

for any given u ∈ N 8, S 2 . Taking the supremum over u ∈ N 8, S 2 of both

28
sides in the above bound and applying Minkowski’s inequality, we obtain that

 d/q " #! 1
64 2 X n  p−1 q q
q
1 ⊤ ⊤
(E|Z| ) ≤
q E sup εi Yi Xi u Xi θ
7 θ∈Sd−1 n i=1
"r #
 d/q C p q C p q C p
≤2 10 · 2p+3 + + 3 (log n + q)p+1 ,
n n n 2

2
where Cp is a universal constant depending only on p. By choosing q = d(p + 7) + log δ and
using Markov inequality, we find that
 s 
1
d + log δ 1  n p+1
P |Z| ≥ Cp  + 3/2 d + log  ≤ δ.
n n δ

Thus, we have establish the claim (42a).

Proof of claim (42b): In order to obtain a uniform concentration bound for J2 , we use
an argument similar to that from the proof of claim (42a). In particular, since polynomial
(x⊤ θ)2p−1 is homogeneous in terms of θ, using the same normalization as in equation (43), it
suffices to demonstrate that
 s 
d + log 1
1  n  2p+1
P W ≤ cr 2p−1  δ
+ 3/2 d + log  ≥ 1 − δ, (48)
n n δ

h
1 Pn 2p−1 2p−1 i
for any δ > 0 where W := supθ∈Sd−1 Xi Xi θ⊤ − EX X X θ ⊤ .
n i=1
2
d
For each u ∈ R , define the random variable
n 
1 X  2p−1  2p−1 

Wu := sup Xi⊤ u Xi⊤ θ − EX X ⊤ u X ⊤ θ
θ∈Sd−1 n i=1


It suffices to bound Wu for fixed u ∈ N 81 , Sd−1 , k.k2 . We bound Wu by controlling its
moments. By a symmetrization argument, we have
"  #
1 X n  2p−1  2p−1  q

E sup Xi⊤ u Xi⊤ θ − EX X ⊤ u X ⊤ θ
θ∈Sd−1 n i=1
" #
2 Xn  2p−1 q

≤ E sup εi Xi⊤ u Xi⊤ θ .
θ∈S d−1 n
i=1

From here, we can use the same technique as that in and after inequality (45) to bound the
RHS term in the above display. Therefore, we will only highlight the main differences here.
For any compact set Ω ⊆ Rd , we define the random variable
n
2 X  2p′ −1

Q(Ω) := sup εi Xi⊤ u Xi⊤ θ .
θ∈Ω,p ∈[1,p]
′ n
i=1

29
Following the similar
 argument as that in equation (45), we can check that Q(Sd−1 ) ≤
2Q N (2−(2p+2) ) . A direct application of union bound leads to
" n ! #
h  i 4 X  2p′ −1 q
q −(2p+2) −(2p+2) ⊤ ⊤
E Q N (2 ) ≤ 2p · N (2 ) · sup E εi Xi u Xi θ .
θ∈S d−1 ,p ∈[1,p]
′ n
i=1
| {z }
:=T2 (θ,p′ )

We control T2 (θ, p′ ) using the same approach as that the proof of claim (42a). For the
convenience of notation, we denote hθ,u (x) := (x⊤ u)2 (xθ )2(2p −1) . Simple algebra lead to the

following upper bounds:


′ ′
E [hθ,u (Xi )] ≤ (4p′ )2p , E [hθ,u (Xi )q ] ≤ (4p′ q)2p q .
Invoking the result of Lemma 2, the above bounds lead to the following probability bound:
n r !
1 X log 4/δ  n 2p′ log 4/δ

hθ,u (Xi ) − EX [hθ,u (X)] ≤ (16p′ )2p + 4p′ log

P ≤ δ.
n n δ n
i=1

Therefore, we further obtain that


 !q/2 
Xn 
(32p′ )2p q

 1  ′ 2p′ q ′
E hθ,u (Xi ) ≤ (8p ) + Cp q q Γ(q/2)
n n2
i=1

(4p′ )(2p +1)q  

(2p′ +1)q ′
+ (2 log n) + Γ (2p + 1)q .
nq
Combining the above bound and an upper bound of T2 (θ, p′ ) based on Khintchine’s inequality,
we obtain the following inequality:
 q/2  
(32p′ )2p q

′ Cq ′ 2p′ q ′
T2 (θ, p ) ≤ (8p ) + Cp q q Γ(q/2)
n n2

(4p′ )(2p +1)q  

(2p′ +1)q ′
+ (2 log n) + Γ (2p + 1)q .
nq
Collecting the above bounds leads to
h i d
E Qq (Sd−1 ) ≤ 2q+1 p 22p+3 sup T2 (θ, p′ )
θ∈Sd−1 ,p′ ∈[1,p]
 q  
q+1

2p+3 d Cq 2
2pq (32p)2pq
≤2 p 2 (8p) + Cpq q Γ(q/2)
n n2
(4p)(2p+1)q  
(2p+1)q
+ (2 log n) + Γ ((2p + 1)q) ,
nq
 
for any fixed u ∈ N 18 , Sd−1 , k.k2 . Taking supremum over u ∈ N 81 , Sd−1 , k.k2 of both sides
in the above bound and applying Minkowski’s inequality, we arrive at the following bound:
 d " n #! 1
64 2 X  2p−1 q q
q q
1 q
⊤ ⊤
(E [|W | ]) ≤ E sup σi Xi u Xi θ
7 θ∈Sd−1 n i=1
  d "r #
10 q Cp q Cp q Cp 2p+1
≤ + + 3 (log n + q) ,
ε n n n2

30
where Cp is a universal constant depending only upon p. With the choice of q = d(2p+7)+log 2δ ,
we obtain that
 s 
d + log 1δ 1  n 2p+1
P |W | ≥ Cp  + 3/2 d + log  ≤ δ.
n n δ

Thus, we have established the claim (42b).

A.3 Proof of Corollary 4


The proof of Corollary 4 follows directly by verifying the claims (29a) and (29b).

A.3.1 Structure of F G
Direct algebra leads to the following equation
 h  i⊤
h∇F G (θ), θ ∗ − θi = θ − E X tanh X ⊤ θ (θ − θ ∗ )
h  i

≥ kθk22 − kθk2 E X tanh X ⊤ θ (49)
2

exp(x)−exp(−x)
where tanh(x) := exp(x)+exp(−x) for all x ∈ R. From Theorem 2 in Dwivedi et al. [9], we have
 
h  i p

E X tanh X ⊤ θ ≤ 1 − p +  kθk
2
2 kθk22
1+ 2

for all θ ∈ Rd where p := P (|Y | ≤ 1) + 12 P (|Y | > 1) where Y ∼ N (0, 1). Plugging the above
inequality into equation (49) leads to
(p 4 √
G ∗ p kθk42 4 kθk
 2,  for kθk2 ≤ 2
h∇F (θ), θ − θi ≥ ≥ p 2 .
2+ kθk22 2 kθk2 − 1 , otherwise

As a consequence, we achieve the conclusion of claim (29a).

A.3.2 Perturbation error between ∇F G and ∇FnG


Direct calculation indicates the following equation:

1X
n h  i
∇FnG (θ) − ∇F G (θ) = Xi tanh(Xi⊤ θ) − E X tanh X ⊤ θ .
n
i=1

The outer expectation in the above display is taken with respect to X ∼ N (θ ∗ , σ 2 Id ) where
θ ∗ = 0. Based on the proof argument of Lemma 1 from the paper [9], for each r > 0, we have
the following concentration inequality
r !
1 Xn h  i d + log(1/δ)
⊤ ⊤
P sup Xi tanh(Xi θ) − E X tanh X θ ≤ cr ≥ 1 − δ,
θ∈B(θ ∗ ,r) n
i=1
n
2
(50)

31
for any δ > 0 as long as the sample size n ≥ c′ d log(1/δ) where c and c′ are universal constants.
For any M ∈ N+ , by the concentration bound (50) and the union bound, we find that
r !
d + log(M/δ)
P ∀r ∈ [2−M , 1], sup ∇FnG (θ) − F G (θ) 2 ≤ c r ≥ 1 − δ. (51)
θ∈B(θ ∗ ,r) n

On the other hand, based on the standard inequality |tanh(x)| ≤ |x| for all x ∈ R, we find
that
Xn   h   i
∇FnG (θ) − ∇F G (θ) ≤ 1 kX k
i 2

tanh X ⊤
i θ

+ E kXk

tanh X ⊤
θ


2 2
n
i=1
1X
n h i

≤ kXi k2 Xi⊤ θ + E kXk2 X ⊤ θ
n
i=1
!
1X
n h i
2 2
≤ kXi k2 + E kXk2 kθk2 .
n
i=1

Therefore, we have ∇FnG (θ) − ∇F G (θ) 2 ≤ 2d kθk2 log(1/δ) with probability 1 − δ. By
choosing M1 := log(2nd), based on the previous bound, we obtain that
!
−M1
log(1/δ)
P ∀r < 2 , sup ∇Fn (θ) − ∇F (θ) 2 ≤
G G
≥ 1 − δ. (52)
θ∈B(θ ∗ ,r) n

Furthermore, for vector θ ∈ Rd with large norm, by the concentration bound (50) combined
with the union bound, for any M ′ ∈ N+ , we find that
r !
M′
d + log(M ′ /δ)
P ∀r ∈ [1, 2 ], sup ∇Fn (θ) − F (θ) 2 ≤ c r
G G
≥ 1 − δ.
θ∈B(θ ∗ ,r) n

When r in the above bound is too large, we can simply use the fact that tanh is a bounded
function. We thus have the upper bound
n
X

∇FnG (θ) − ∇F G (θ)) ≤ E [kXk ] + 1 kXi k2 ,
2 2
n
i=1

for any θ. Given the above bound, by choosing M2 := log(2 n), we obtain that
r !
d + log(1/δ)
P ∀r > 2 , sup ∇Fn (θ) − ∇F (θ)) 2 ≤ r
M2 G G
θ∈B(θ ∗ ,r) n
n
r !
1X d + log(1/δ)
≥ P E [kXk2 ] + kXi k2 ≤ 2M2 ≥ 1 − δ. (53)
n n
i=1

Putting the bounds (51), (52), and (53) together, for n ≥ cd log(1/δ), the following probability
bound holds
r !
d + log (log n/δ) log(1/δ)
P ∀r > 0, sup ∇FnG (θ) − ∇F G (θ)) 2 ≤ c r + ≥ 1 − δ,
θ∈B(θ ∗ ,r) n n

which completes the proof of the claim (29b).

32
B Proofs of some auxiliary results
In this appendix, we state and prove a few technical lemmas used in the proofs of our main
results.

B.1 A limit result


We begin with a lemma in the limiting behavior of a function. The lemma is used in the proof
of Theorem 2 in Section 5.2.
Lemma 1. Let φ be a non-increasing continuous function on the real line with φ(c) = 0,
and such that φ(t) ≥ 0 for all t ∈ (c, ∞). Suppose that there
R texist two continuous functions
f, g : [0, +∞) → R such that limt→+∞ g(t) exists and f (t) ≤ 0 φ(g(s))ds for all t ≥ 0. Under
these conditions, we have limt→+∞ g(t) ≤ c.
Proof. Define the limit A := limt→+∞ g(t), which exists according to the assumptions. We
proceed via proof by contradiction. In particular, suppose that A > c. Based on the definition
of A, for the positive constant ε = (A − c)/2 > 0, we can find a sufficiently large positive
constant T such that g(t) > A − ε for any t ≥ T . According to the assumptions on φ, we
obtain that
δ := − sup φ(s) < 0.
s≥c+ε

Therefore, for all t > T , we arrive at the following inequalities


Z T Z t Z T
0 ≤ f (t) ≤ φ(g(s))ds + φ(g(s))ds ≤ φ(g(s))ds − δ(t − T ).
0 T 0
R
−1 T
By choosing t = 1 + T + δ 0 φ(g(s))ds, the above inequality cannot hold. This yields the
desired contradiction, which completes the proof.

B.2 A tail bound based on truncation


We now state an upper deviation inequality based on a truncation argument. This lemma is
used in Appendix A.2 to prove the uniform concentration bound (24b). Consider a sequence
of random variables {Yi }ni=1 satisfying the moment bounds
E [|Yi |q ] ≤ (aq)bq for all q = 1, 2, . . . (54)
where a, b are universal constants.
Lemma 2. Given an i.i.d. sequence of zero-mean random variables {Yi }ni=1 satisfying the
moment bounds (54), we have
r !
log 4/δ  n b log 4/δ
n
1X b
P Yi ≥ (4a) + a log ≤ δ.
n n δ n
i=1

Proof. The proof of the lemma is a direct combination of truncation argument and Bern-
stein’s inequality.
 In particular,
 for each i ∈ [n], define the truncated random variable
Ỹi := Yi I |Yi | ≤ 3(a log nδ )b . With this definition, we have
      
n n n b  n b δ
P (Yi )i=1 6= (Ỹi )i=1 = P max |Yi | > 3 a log ≤ nP |Yi | > 3 a log ≤ .
1≤i≤n δ δ 2

33
P
Therefore, it is sufficient to study a concentration behavior of the quantity ni=1 Ỹi . Invoking
Bernstein’s inequality [4], we obtain that
n
! !
1X nε2
P Ỹi ≥ ε ≤ 2 exp − 2b + 2 ε · 3(a log n )b
.
n 2(2a) 3 δ
i=1

In order to make the RHS of the above inequality less than 2δ , it suffices to set
r
b log(4/δ)  n b log(4/δ)
ε = (4a) + a log .
n δ n
Collecting all of the above inequalities yields the claim.

References
[1] A. Barron, M. Schervish, and L. Wasserman. The consistency of posterior distributions
in nonparametric problems. Ann. Statist, 27:536–561, 1999. (Cited on page 1.)

[2] A. Bhattacharya, D. Pati, , and D. B. Dunson. Anisotropic function estimation using


multi-bandwidth Gaussian processes. Annals of Statistics, 42:352–381, 2014. (Cited on
page 2.)

[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res, 3:993–
1022, 2003. (Cited on pages 2 and 20.)

[4] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic


Theory of Independence. Oxford University Press, 2016. (Cited on pages 27 and 34.)

[5] R. J. Carroll and P. Hall. Optimal rates of convergence for deconvolving a density. Journal
of American Statistical Association, 83:1184–1186, 1988. (Cited on page 10.)

[6] J. Chen. Optimal rate of convergence for finite mixture models. Annals of Statistics,
23(1):221–233, 1995. (Cited on pages 3 and 14.)

[7] R. de Jonge and J. H. van Zanten. Adaptive nonparametric Bayesian inference using
location-scale mixture priors. Annals of Statistics, 38:3300–3320, 2010. (Cited on page 2.)

[8] J. L. Doob. Application of the theory of martingales. Actes du ColloqueInternational Le


Calcul des Probabilités et ses applications (Lyon, 28 Juin 3 Juillet, 1948), pages 23–27,
1949. (Cited on page 1.)

[9] R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, M. I. Jordan, and B. Yu. Singularity,


misspecification, and the convergence rate of EM. arXiv preprint arXiv:1810.00828, 2018.
(Cited on page 31.)

[10] D. A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case.
Annals of Statistics, 34:13861403, 1963. (Cited on page 1.)

[11] D. A. Freedman. On the asymptotic behavior of Bayes estimates in the discrete case.II.
Annals of Statistics, 36:454456, 1965. (Cited on page 1.)

34
[12] C. Gao and H. H. Zhou. Rate exact Bayesian adaptation with modified block priors.
Annals of Statistics, 44:318–345, 2016. (Cited on page 2.)

[13] S. Ghosal, J. K. Ghosh, and A. van der Vaart. Convergence rates of posterior distributions.
Annals of Statistics, 28:500–531, 2000. (Cited on page 2.)

[14] S. Ghosal and A. van der Vaart. Entropies and rates of convergence for maximum
likelihood and bayes estimation for mixtures of normal densities. Annals of Statistics,
29:1233–1263, 2001. (Cited on pages 2 and 12.)

[15] S. Ghosal and A. van der Vaart. Posterior convergence rates of Dirichlet mixtures at
smooth densities. Annals of Statistics, 35:697–723, 2007. (Cited on page 2.)

[16] N. Ho and X. Nguyen. Convergence rates of parameter estimation for some weakly
identifiable finite mixtures. Annals of Statistics, 44:2726–2755, 2016. (Cited on page 2.)

[17] H. Ishwaran, L. F. James, and J. Sun. Bayesian model selection in finite mixtures
by marginal density decompositions. Journal of the American Statistical Association,
96:1316–1332, 2001. (Cited on page 14.)

[18] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian


statistics. Annals of Statistics, 34:837–877, 2006. (Cited on page 2.)

[19] B. Lindsay. Mixture Models: Theory, Geometry and Applications. In NSF-CBMS Re-
gional Conference Series in Probability and Statistics. IMS, Hayward, CA., 1995. (Cited
on page 12.)

[20] X. Mao. Stochastic Differential Equations and Applications. Elsevier, 2007. (Cited on
page 5.)

[21] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall/CRC,
1989. (Cited on pages 7 and 9.)

[22] X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models.
Annals of Statistics, 4(1):370–400, 2013. (Cited on pages 2, 3, 13, and 14.)

[23] X. Nguyen. Borrowing strength in hierarchical Bayes: convergence of the Dirichlet base
measure. Bernoulli, 22:1535–1571, 2016. (Cited on page 2.)

[24] J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using


multilocus genotype data. Genetics, 155:945959, 2000. (Cited on page 2.)

[25] D. Revuz and M. Yor. Continuous Martingales and Brownian Motion, volume 293.
Springer-Verlag, third edition, 1999. (Cited on pages 5 and 14.)

[26] H. Risken. The Fokker-Planck Equation. Springer, 1996. (Cited on page 5.)

[27] J. Rousseau. Rates of convergence for the posterior distributions of mixtures of Betas
and adaptive nonparametric estimation of the density. Annals of Statistics, 38:146–180,
2010. (Cited on page 2.)

[28] J. Rousseau and K. Mengersen. Asymptotic behaviour of the posterior distribution in


overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73:689–710, 2011. (Cited on pages 7 and 12.)

35
[29] L. Schwartz. On Bayes procedures. Zeitschrift fűr Wahrscheinlichkeitstheorie und Ver-
wandte Gebiete, 4:10–26, 1965. (Cited on page 1.)

[30] W. Shen, S. R. Tokdar, , and S. Ghosal. Adaptive Bayesian multivariate density estima-
tion with Dirichlet mixtures. Biometrika, 100:623640, 2013. (Cited on page 2.)

[31] X. Shen and L. Wasserman. Rates of convergence of posterior distributions. Annals of


Statistics, 29:687–714, 2001. (Cited on page 2.)

[32] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998. (Cited on
page 20.)

[33] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes.
Springer-Verlag, New York, NY, 1996. (Cited on page 23.)

[34] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes:
With Applications to Statistics. Springer-Verlag, New York, NY, 2000. (Cited on page 25.)

[35] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cam-


bridge University Press, 2019. (Cited on pages 23, 24, and 27.)

[36] S. Walker. On sufficient conditions for Bayesian consistency. Annals of Statistics, 90:482–
488, 2003. (Cited on page 1.)

[37] S. Walker. New approaches to Bayesian consistency. Annals of Statistics, 32:2028–2043,


2004. (Cited on page 1.)

[38] S. G. Walker, A. Lijoi, and I. Prunster. On rates of convergence for posterior distributions
in infinite-dimensional models. Annals of Statistics, 35:738–746, 2007. (Cited on page 2.)

[39] Y. Wang. Convergence rates of latent topic models under relaxed identifiability conditions.
Electronic Journal of Statistics, 13:37–66, 2019. (Cited on page 2.)

[40] Y. Yang and D. B. Dunson. Bayesian manifold regression. Annals of Statistics, 44:876–
905, 2016. (Cited on page 2.)

[41] Y. Yang and S. T. Tokdar. Minimax-optimal nonparametric regression in high dimensions.


Annals of Statistics, 43:652–674, 2015. (Cited on page 2.)

36

You might also like