2309.04860 Approximation Results For Gradient Descent Trained
2309.04860 Approximation Results For Gradient Descent Trained
Abstract
The paper contains approximation guarantees for neural networks that
are trained with gradient flow, with error measured in the continuous
L2 (Sd−1 )-norm on the d-dimensional unit sphere and targets that are
Sobolev smooth. The networks are fully connected of constant depth
and increasing width. Although all layers are trained, the gradient flow
convergence is based on a neural tangent kernel (NTK) argument for the
non-convex second but last layer. Unlike standard NTK analysis, the con-
tinuous error norm implies an under-parametrized regime, possible by the
natural smoothness assumption required for approximation. The typical
over-parametrization re-enters the results in form of a loss in approxi-
mation rate relative to established approximation methods for Sobolev
smooth functions.
Contents
1 Introduction 2
2 Main Result 5
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
4 Proof Overview 11
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Neural Tangent Kernel . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Abstract Convergence result . . . . . . . . . . . . . . . . . . . . . 14
4.3 Assumption (20): Hölder continuity . . . . . . . . . . . . . . . . 16
4.4 Assumption (19): Concentration . . . . . . . . . . . . . . . . . . 17
4.5 Assumption (17): Weights stay Close to Initial . . . . . . . . . . 18
6 Technical Supplements 46
6.1 Hölder Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Hermite Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Sobolev Spaces on the Sphere . . . . . . . . . . . . . . . . . . . . 57
6.4.1 Definition and Properties . . . . . . . . . . . . . . . . . . 57
6.4.2 Kernel Bounds . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4.3 NTK on the Sphere . . . . . . . . . . . . . . . . . . . . . 61
1 Introduction
Direct approximation results for a large variety of methods, including neural
networks, are typically of the form
2
On the other hand, the neural network optimization literature, typically
considers discrete error norms (or losses)
n
!1/2
1X
∥fθ − f ∥∗ := |fθ (xi ) − f (xi )|2 ,
n i=1
together with neural networks that are over-parametrized, i.e. for which the
number of weights is larger than the number of samples n so that they can
achieve zero training error
inf ∥fθ − f ∥∗ = 0,
θ
3
Paper Organization The paper is organized as follows. Section 2.2 defines
the neural networks and training procedures and Section 2.3 contains the main
result. The coercivity of the NTK is discussed in Section 3. The proof is split
into two parts. Section 4 provides an overview and all major lemmas. The proof
the these lemmas and further details are provided in Section 5. Finally, to keep
the paper self contained, Section 6 contains several facts from the literature.
Literature Review
• Approximation: Some recent surveys are given in [53, 15, 69, 8]. Most of
the results prove direct approximation guarantees as in (1) for a variety of
classes K and network architectures. They show state of the art or even
superior performance of neural networks, but typically do not provide
training methods and rely on hand-picked weights, instead.
– Results for classical Sobolev and Besov regularity are in [25, 27, 50,
44, 64].
– [72, 73, 74, 14, 57, 47] show better than classical approximation rates
for Sobolev smoothness. Since classical methods are optimal (with
regard to nonlinear width and entropy), this implies that the weight
assignment f → θ must be discontinuous.
– Function classes that are specifically tailored to neural networks are
Barron spaces for which approximation results are given in [5, 37, 70,
46, 58, 59, 10].
– Many papers address specialized function classes [56, 54], often from
applications like PDEs [39, 52, 40, 48].
Besides approximation guarantees (1) many of the above papers also dis-
cuss limitations of neural networks, for more information see [20].
• Optimization: We confine the literature overview to neural tangent kernel
based approaches, which are most relevant to this paper. The NTK is
introduced in [32] and similar arguments together with convergence and
perturbation analysis appear simultaneously in [45, 2, 19, 18], Related
optimization ideas are further developed in many papers, including [75, 4,
43, 62, 76, 36, 13, 51, 49, 6, 61, 41]. In particular, [3, 63, 34, 12] refine
the analysis based on expansions of the target f in the NTK eigenbasis
and are closely related to the arguments in this paper, with the major
difference that they rely on the typical over-parametrized regime, whereas
we do solemnly rely on smoothness.
The papers [23, 28, 21, 42, 55, 68] discuss to what extend the lineariza-
tion approach of the NTK can describe real neural network training.
Characterizations of the NTK are fundamental for this paper and given
[9, 22, 35, 11]. Convergence analysis for optimizing NTK models directly
are in [65, 66].
4
• Approximation and Optimization: Since the approximation question is
under-parametrized and the optimization literature largely relies on over-
parametrization there is little work on optimization methods for approx-
imation. The gap between approximation theory and practice is consid-
ered in [1, 26]. The previous paper [24] contains comparable results for 1d
shallow networks. Similar approximation results for gradient flow trained
shallow 1d networks are in [33, 31], with slightly different assumptions on
the target f , more general probability weighted L2 loss and an alternative
proof technique. Other approximation and optimization guarantees rely
on alternative optimizers. [60, 29] use greedy methods and [30] uses a
two step procedure involving a classical and subsequent neural network
approximation.
L2 error bounds are also proven in generalization error bounds for sta-
tistical estimation. E.g. the papers [17, 38] show generalization errors
for parallel fully connected networks in over-parametrized regimes with
Hölder continuity.
2 Main Result
2.1 Notations
• ≲, ≳, ∼ denote less, bigger and equivalence up to a constant that can
change in every occurrence and is independent of smoothness and number
of weights. It can depend on the number of layers L and input dimension d.
Likewise, c is a generic constant that can be different in each occurrence.
• [n] := {1, . . . , n}
• λ = ij; ℓ is the index of the weight Wλ := Wijℓ with |λ| := ℓ. Likewise, we
∂
set ∂λ = ∂W λ
.
• ⊙: Element wise product
• Ai· and A·j are ith row and jth column of matrix A, respectively.
2.2 Setup
Neural Networks We train fully connected deep neural networks without
bias and a few modifications: The first and last layer remain untrained, we use
gradient flow instead of (stochastic) gradient descent and the first layer remains
unscaled. For x in some bounded domain D ⊂ Rd , the networks are defined by
f 1 (x) = W 0 V x,
−1/2
f ℓ+1 (x) = W ℓ nℓ σ f ℓ (x) ,
ℓ = 1, . . . , L (2)
f (x) = f L+1 (x),
5
which we abbreviate by f ℓ = f ℓ (x) if x is unimportant or understood from
context. The weights are initialized as follows
all trained by gradient flow, except for the last layer W L+1 and the first matrix
V , which is pre-chosen with orthonormal columns. All layers have conventional
√
1/ nℓ scaling, except for the first, which ensures that the NTK is of unit size
on the diagonal and common in the literature [18, 9, 22, 11]. We also require
that the layers are of similar size, except for the last one which ensures scalar
valued output of the network
m := nL−1 , 1 = nL+1 ≤ nL ∼ · · · ∼ n0 ≥ d.
and continuous second and third derivative with at most polynomial growth
6
Smoothness Since we are in an under-parametrized regime, we require smooth-
ness of f to guarantee meaningful convergence bounds. In this paper, we use
Sobolev spaces H α (Sd−1 ) on the sphere D = Sd−1 , with norms and scalar prod-
ucts denoted by ∥ · ∥H α (Sd−1 ) and ⟨·, ·⟩H α (Sd−1 ) . We drop the explicit reference
to the domain Sd−1 when convenient. Definitions and required properties are
summarized in Section 6.4.1.
Neural Tangent Kernel The analysis is based on the neural tangent kernel,
which for the time being, we informally define as
X
Γ(x, y) = lim ∂λ frL+1 (x)∂λ frL+1 (y). (7)
width→∞
|λ|=L−1
where Σ̇L (x, y) and ΣL−1 (x, y) are the covariances of two Gaussian processes
1/2
that characterize the forward evaluation of the networks W L nL σ̇ f L and
f L−1 in the infinite width limit, see Section 4.1.1 for their rigorous definition.
We require that
2.3 Result
We are now ready to state the main result of the paper.
7
Theorem 2.1. Assume that the neural network (2) - (5) is trained by gradient
flow (6). Let κ(t) := fθ(t) − f be the residual and assume:
β
1. The NTK satisfies coercivity (8) for some 0 ≤ α ≤ 2 and the forward
process satisfies (9).
2. All hidden layers are of similar size: n0 ∼ · · · ∼ nL−1 =: m.
3. Smoothness is bounded by 0 < α < 1/2.
4. 0 < γ < 1 − α is an arbitrary number (used for Hölder continuity of the
NTK in the proof ).
5. For τ specified below, m is sufficiently large so that
1 1 1 cd τ
∥κ(0)∥−α
2
∥κ(0)∥α2 m− 2 ≲ 1, ≤ 1, ≤ 1.
m m
where κ := fθ(t) − f is the gradient flow residual for sufficiently large time t.
8
For traditional approximation methods, one would expect convergence rate
m−α/d for functions in the Sobolev space H α . Our rates are lower, which seems
to be a variation of over-parametrization is disguise: In the over-parametrized
as well as in our approximation regime the optimizer analysis seems to require
some redundancy and thus more weights than necessary for the approximation
alone. Of course, we only provide upper bounds and practical neural networks
may perform better. Some preliminary experiments in [24] show that shallow
networks in one dimension outperform the theoretical bounds but are still worse
than classical approximation theory would suggest. In addition, the linearization
argument of the NTK results in smoothness measures in Hilbert spaces H α and
not in larger Lp based smoothness spaces with p < 2 or even Barron spaces, as
is common for nonlinear approximation.
Remark 2.3. Although Theorem 2.1 and Corollary 2.2 seem to show dimen-
sion independent convergence rates, they are not. Indeed, β depends on the
dimension and smoothness of the activation function as we see in Section 3 and
Lemma 3.2.
of all layers. Coercivity easily follows once we understand the NTK’s spectral
decomposition. To this end, first note that Γ(x, y) and Θ(x, y) are both zonal
kernels, i.e. they only depend on xT y, and as consequence their eigenfunctions
are spherical harmonics.
Lemma 3.1 ([22, Lemma 1]). The eigenfunctions of the kernels Γ(x, y) and
Θ(x, y) on the sphere with uniform measure are spherical harmonics.
Proof. See [22, Lemma 1] and the discussion thereafter.
Hence, it is sufficient to show lower bounds for the eigenvalues. These are
provided in [9, 22, 11] under slightly different assumptions than required in this
paper:
1. They use all layers Θ(x, y) instead of only the second but last one in
Γ(x, y). (The reference [18] does consider Γ(x, y) and shows that the eigen-
values are strictly positive in the over-parametrized regime with discrete
loss and non-degenerate data.)
9
2. They use bias, whereas we don’t. We can however easily introduce bias
into the first layer by the usual technique to incorporate one fixed input
component x0 = 1.
3. The cited papers use ReLU activations, which do not satisfy the third
derivative smoothness requirements (4).
The proof is given at the end of Section 6.4.3. Note that this implies β = d/2
and thus Theorem 2.1 cannot be expected to be dimension independent. In
fact, due to smoother activations, the kernel Γ(x, y) is expected to be more
smoothing than Θ(x, y) resulting in a faster decay of the eigenvalues and larger
β. This leads to Sobolev coercivity (Lemmas 6.18 and 3.2) as long as the decay
is polynomial, which we only verify numerically in this paper, as shown in Figure
1 for n = 100 uniform samples on the d = 2 dimensional sphere and L − 1 = 1
hidden layers of width m = 1000. The plot uses log-log axes so that straight
lines represent polynomial decay. As expected, ReLU and ELU activations show
polynomials decay with higher order for the latter, which are smoother. For
comparison the C ∞ activation GELU seems to show super polynomial decay.
However, the results are preliminary and have to be considered carefully:
1. The oscillations at the end, are for eigenvalues of size ∼ 10−7 , which is
machine accuracy for floating point numbers.
2. Most eigenvalues are smaller than the difference between the empirical
NTK and the actual NTK. For comparison, the difference between two
randomly sampled empirical NTKs (in matrix norm) is: ReLU: 0.280,
ELU: 0.524, GELU: 0.262 .
3. According to [9], for shallow networks without bias, every other eigenvalue
of the NTK should be zero. This is not clear from the experiments (which
do not use bias, but have one more layer), likely because of the large errors
in the previous item.
4. The errors should be better for wider hidden layers, but since the networks
involve dense matrices, their size quickly becomes substantial.
In conclusion, the experiments show the expected polynomial decay of NTK
eigenvalues and activations with singularities in higher derivatives, but the re-
sults have to be regraded with care.
10
Figure 1: Eigenvalues of the NTK Γ(x, y) for different activation functions.
4 Proof Overview
4.1 Preliminaries
4.1.1 Neural Tangent Kernel
In this section, we recall the definition of the neural tangent kernel (NTK) and
setup notations for its empirical variants. Our definition differs slightly from
the literature because we only use the last hidden layer (weights W L−1 ) to
reduce the loss, whereas all other layers are trained but only estimated by a
perturbation analysis. Throughout the paper, we only need the definitions as
stated, not that they are the infinite width limit of the network derivatives as
stated in (7), although we sometimes refer to this for motivation.
As usual, we start with the recursive definition of the covariances
ℓ
Σ (x, x) Σℓ (x, y)
ℓ+1
Σ (x, y) := Eu,v∼N (0,A) [σ (u) , σ (v)] , A = , Σ0 (x, y) = xT y,
Σℓ (y, x) Σℓ (y, y)
which define a Gaussian process that is the infinite width limit of the forward
evaluation of the hidden layer f ℓ (x), see [32]. Likewise, we define
ℓ
Σ (x, x) Σℓ (x, y)
Σ̇ℓ+1 (x, y) := Eu,v∼N (0,A) [σ̇ (u) , σ̇ (v)] , A= ,
Σℓ (y, x) Σℓ (y, y)
with activation function of the last layer is exchanged with its derivative. Then
the neural tangent kernel (NTK) is defined by
Γ(x, y) := Σ̇L (x, y)ΣL−1 (x, y). (11)
The paper [32] shows that all three definitions above are infinite width limits of
the corresponding empirical processes (denoted with an extra hat ˆ·)
nℓ
1 X 1 T
Σ̂ℓ (x, y) := σ frℓ (x) σ frℓ (y) = σ f ℓ (x) σ f ℓ (y) ,
nℓ r=1 nℓ
nℓ
(12)
ˆ ℓ (x, y) := 1 X σ̇ f ℓ (x) σ̇ f ℓ (y) = 1 σ̇ f ℓ (x)T σ̇ f ℓ (y)
Σ̇ r r
nℓ r=1 nℓ
11
and X
Γ̂(x, y) := ∂λ frL+1 (x)∂λ frL+1 (y).
|λ|=L−1
Note that unlike the usual definition of the NTK, we only include weights from
the second but last layer. Formally, we do not show that Σℓ , Σ̇ℓ and Γ arise as
ˆ ℓ and Γ̂, but rather concen-
infinite width limits of the empirical versions Σ̂ℓ , Σ̇
tration inequalities between them.
The next lemma shows that the empirical kernels satisfy the same identity
(11) as their limits.
Lemma 4.1. Assume that WijL ∈ {−1, +1}. Then
ˆ L (x, y)Σ̂L−1 (x, y).
Γ̂(x, y) = Σ̇
Proof. By definitions of f L and f L−1 , we have
nL
−1/2
X
∂W L−1 frL+1 = W·rL nL ∂W L−1 σ frL
ij ij
1=r
nL
−1/2
X
W·rL nL σ̇ frL ∂W L−1 frL
=
ij
1=r
nL
−1/2 −1/2
X
W·rL nL σ̇ frL δir nL−1 σ fjL−1
=
1=r
−1/2 −1/2
= W·iL nL fiL σ fjL−1 .
nL−1 σ̇
It follows that
nL nX
X L−1
The NTK and empirical NTK induce integral operators, which we denote
by
Z Z
Hf := Γ(·, y)f (y) dy, Hθ f := Γ̂(·, y)f (y) dy
D D
The last definition makes the dependence on the weights explicit, which is hidden
in Γ̂.
12
4.1.2 Norms
We use several norms for our analysis.
1. ℓ2 and matrix norms: ∥ · ∥ denotes the ℓ2 norm when applied to a vector
and the matrix norm when applied to a matrix.
∥f (x) − f (x̄)∥V
∥f ∥C 0 (D;V ) := sup ∥f (x)∥V + sup .
x∈D x̸=x̄∈D ∥x − x̄∥α
U
See Section 6.1 for the full definitions and basic properties.
13
4.1.3 Neural Networks
Many results use a generic activation function denoted by σ with derivative
σ̇, which is allowed to change in each layer, although we always use the same
symbol for notational simplicity. They satisfy the linear growth condition
are Lipschitz
|σ (x) − σ (x̄) | ≲ |x − x̄| (14)
and have uniformly bounded derivatives
f : Θ = ℓ2 (Rm ) → H, θ → fθ .
Hθ := Dfθ (Dfθ )∗ ,
14
1. With probability at least 1 − p0 (m), the distance of the weights from their
initial value is controlled by
r Z t
2
∥θ(t) − θ(0)∥∗ ≤ 1 ⇒ ∥θ(t) − θ(0)∥∗ ≲ ∥κ(τ )∥0 dτ. (17)
m 0
for all −α − β ≤ a ≤ b ≤ c ≤ α.
3. Let H : Hα → H−α be an operator that satisfies the concentration inequal-
ity " r #
r
d c∞ τ
Pr ∥H − Hθ(0) ∥α←−α ≥ c + ≤ p∞ (τ ) (19)
m m
p c∞ τ
for all τ with m ≤ 1. (In our application H is the NTK and Hθ(0)
the empirical NTK.)
4. Hölder continuity with high probability:
∥κ∥2α ≲ ∥κ(0)∥2α
15
We defer the proof to Section 5.1 and only consider a sketch here. As for
standard NTK arguments, the proof is based on the following observation
1 d
∥κ∥2 = − κ, Hθ(t) κ ≈ − ⟨κ, H κ⟩ (22)
2 dt
which can be shown by a short computation. The last step relies on the ob-
servation that empirical NTK stays close to its initial Hθ(t) ≈ Hθ(0) and that
the initial is close to the infinite width limit Hθ(0) ≈ H. However, since we are
not in an over-parametrized regime, the NTK’s eigenvalues can be arbitrarily
close to zero and we only have coercivity in the weaker norm ⟨κ, H κ⟩ ≳ ∥κ∥−α ,
which is not sufficient to show convergence by e.g. Grönwall’s inequality. To
avoid this problem, we derive a closely related system of coupled ODEs
1 d 2 2α+β −2 β βγ
∥κ∥2−α ≲ −c∥κ∥−α2α ∥κ∥α 2α + h β−α ∥κ∥2−α
2 dt
1 d β
2 2α 2 2α−β
∥κ∥2α ≲ −c∥κ∥−α ∥κ∥α 2α + hγ ∥κ∥α ∥κ∥−α .
2 dt
The first one is used to bound the error in the H−α norm and the second ensures
that the smoothness of the residual κ(t) is uniformly bounded during gradient
flow. Together with the interpolation inequality (18), this shows convergence in
the H = H0 norm.
It remains to verify all assumption of Lemma 4.2, which we do in the fol-
lowing subsections. Details are provided in Section 5.5.
For the initial weights W ℓ , this holds with high probability because its entries are
i.i.d. standard Gaussian. For perturbed weights we only need continuity bounds
−1/2
under the condition that θ − θ̄ ∗ ≤ 1 or equivalently that ∥W ℓ − W̄ ℓ ∥nℓ ≤1
ℓ
so that the weight bound of the perturbation W̄ follow from the bounds for
W ℓ . With this setup, we show the following lemma.
Lemma 4.3. Assume that σ and σ̇ satisfy the growth and Lipschitz conditions
(13), (14) and may be different in each layer. Assume the weights, perturbed
weights and domain are bounded (23) and nL ∼ nL−1 ∼ · · · ∼ n0 . Then for
16
0<α<1
Γ̂ ≲1
C 0;α,α
¯
Γ̂ ≲1
C 0;α,α
"L−1 #1−α
¯ n0 X −1/2
Γ̂ − Γ̂ ≲ W k − W̄ k nk .
C 0;α,α nL
k=0
The proof is at the end of Section 5.2. The lemma shows that the kernels
¯
Γ̂ − Γ̂ℓ
ℓ
are Hölder continuous (w.r.t. weights) in a Hölder norm (w.r.t.
C 0;α,α
x and y). This directly implies that the induced integral operators ∥Hθ −
Hθ̄ ∥α←−α are bounded in operator norms induced by Sobolev norms (up to ϵ
less smoothness), which implies Assumption (20), see Section 5.5 for details.
∂ i (σa ) N
≲ 1, ∂ i (σ̇a ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,
17
we have
L−1
"√ √ #
X n0 d + uk d + uk 1
Γ̂ − Γ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0
for all u1 , . . . , uL−1 ≥ 0 sufficiently small so that the rightmost inequality holds.
18
Proof of Lemma 4.2
Proof of Lemma 4.2. For the time being, we assume that the weights remain
within a finite distance
( r )
d
h := max sup ∥θ(t) − θ(0)∥∗ , c ≤1 (28)
t≤T m
with probability at least 1 − p∞ (τ ) − pL (m, h), where the second but last in-
equality follows from assumptions (19), (20) and in the last
p c∞inequality we have
τ
used the coercivity, (28) and chosen τ = h2γ m so that m ≲ h γ
. The left
hand side contains one negative term −∥κ∥2S−β , which decreases the residual
d 2
dt ∥κ∥S , and one positive term which enlarges it. In the following, we ensure
that these terms are properly balanced.
We eliminate all norms that are not ∥κ∥−α or ∥κ∥α so that we obtain a
closed system of ODEs in these two variables. We begin with ∥κ∥S̄ , which is
already of the right type if S̄ = α but ∥κ∥−3α for S̄ = −α. Since 0 < α < β2 ,
we have −α − β ≤ −3α ≤ α so that we can invoke the interpolation inequality
from Assumption 2
2α β−2α
β
∥v∥−3α ≤ ∥v∥−α−β ∥v∥−αβ .
19
Together with Young’s inequality, this implies
2α 2β−2α
≤ c∥κ∥−α−β + c h ∥κ∥−αβ
β β
α β β γβ
= c α ∥κ∥2−α−β + c (β−α) h β−α ∥κ∥2−α
β
for any generic constant c > 0. Choosing this constant sufficiently small and
plugging into the evolution equation for ∥κ∥−α , we obtain
1 d γβ
∥κ∥2−α ≲ −c∥κ∥2−α−β + h β−α ∥κ∥2−α ,
2 dt
with a different generic constant c. Hence, together with the choice S = α, we
arrive at the system of ODEs
1 d γβ
∥κ∥2−α ≲ −c∥κ∥2−α−β + h β−α ∥κ∥2−α ,
2 dt
1 d
∥κ∥2α ≲ −c∥κ∥2α−β + hγ ∥κ∥α ∥κ∥−α .
2 dt
Next, we eliminate the ∥κ∥2−α−β and ∥κ∥2α−β norms. Since 0 < α < β2 implies
−α − β < α − β < −α < α the interpolation inequalities in Assumption 2 yield
2α β 2α+β β
2α+β −
∥κ∥−α ≤ ∥κ∥−α−β ∥κ∥α2α+β ⇒ ∥κ∥−α−β ≥ ∥κ∥−α
2α
∥κ∥α 2α
2α β−2α β 2α−β
β
∥κ∥−α ≤ ∥κ∥α−β ∥κ∥α β ⇒ ∥κ∥α−β ≥ ∥κ∥−α
2α
∥κ∥α 2α ,
i.e. the error ∥κ∥−α is still larger than the right hand side, which will be our
final error bound, we have
βγ
2α
β
βγ β β
β−α β t
∥κ∥2−α ≲ h β−α ∥κ(0)∥αα + ∥κ(0)∥−α
α
e−ch 2α (30)
20
The second condition B(t) ≥ 0 in Lemma 5.1 is equivalent to axρ0 ≥ by0ρ (no-
tation of the lemma), which in our case is identical to (29) at t = 0. Notice
that the right hand side of (29) corresponds to the first summand in the ∥κ∥2−α
bound so that the second summand must dominate and we obtain the simpler
expression
βγ
Notice that by assumption m is sufficiently large so that the right hand side
is strictly
p smaller than one and thus T is only constrained by (29). In case
h = c d/m there is nothing to show and we obtain
( r )
h 1 1
i β−α d
− 21 β(1+γ)−α
h ≲ max ∥κ(0)∥−α ∥κ(0)∥α m
2 2
,c .
m
Finally, we extend the result beyond the largest time T for which (29) is satisfied
and hence (29) holds with equality. Since ∥κ∥20 is defined by a gradient flow, it
is monotonically decreasing and thus for any time t > T , we have
2α
β
γα γβ β
2 β−α
∥κ(t)∥2−α ≤ ∥κ(T )∥2−α = ch ∥κ(0)∥2α =c h β−α ∥κ(0)∥α
α
βγ
2α
β
βγ β β β
−ch β−α 2α t
≲ h β−α ∥κ(0)∥α + ∥κ(0)∥−α e
α α
so that the error bound (30) holds for all times up to an adjustment of the
constants. This implies the statement of the lemma with our choice of h and τ .
21
Technical Supplements
1
Lemma 5.1. Assume a, b, c, d > 0, ρ ≥ 2 and that x, y satisfy the differential
inequality
with
" −ρ #
b b x0
A := y0ρ , B(t) := 1 − e−bρt
a a y0
we have
−1
x(t) ≤ A (1 − B(t)) , y(t) ≤ y0 .
Proof. First, we show that y(t) ≤ y0 for all t ∈ T . To this end, note that
condition (35) states that we are above a critical point for the second ODE
(34). Indeed, setting y ′ (t) = 0 and thus y(t) = y0 and solving the second ODE
(with = instead of ≤) for x(t), we have
2
2ρ−1
d
x(t) = y0 .
c
22
which upon rearrangement is equivalent to
√
−cxρ y 1−ρ + d xy ≤ 0,
so that the differential equation (34) yields y ′ (t) ≤ 0 and hence y(t) ≤ y0 for
all t < τϵ . On the other hand, for all t > τϵ we have y(t) > y0 (1 + ϵ), which
contradicts the continuity of y. It follows that τϵ ≥ Tϵ and with limϵ→0 Tϵ = T ,
we obtain
y(t) ≤ y0 , t < T.
Next, we show the bounds for x(t). For any fixed function y, the function x is
bounded by the solution z of the equality case
of the first equation (33). This is a Bernoulli differential equation, with solution
Z t − ρ1
−bρt bρτ −ρ −ρ
x(t) ≤ z(t) = e aρ e y(τ ) dτ + x0 .
0
which shows the first bound for x(t). We can estimate this further by
23
5.2 Proof of Lemma 4.3: NTK Hölder continuity
The proof is technical but elementary. We start with upper bounds and Hölder
continuity for simple objects, like hidden layers, and then compose these for
derived objects with results for the NTK at the end of the section.
Throughout this section, we use a bar ¯· to denote a perturbation. In partic-
ular W̄ ℓ is a perturbed weight,
−1/2
f¯ℓ+1 (x) = W̄ ℓ nℓ σ f¯ℓ (x) , f¯1 (x) = W̄ 0 V x
¯ ¯ˆ ¯
is the neural network with perturbed weights and Σ̂, Σ̇, Γ̄ and Γ̂ are the kernels of
the perturbed network. The bounds in this section depend on the operator norm
−1/2
of the weight matrices. At initialization, they are bounded W ℓ nℓ ≲ 1,
with high probability. All perturbations of the weights that we need are close
−1/2
W ℓ − W̄ ℓ nℓ ≲ 1 so that we may assume
−1/2
W ℓ nℓ ≲1 (36)
−1/2
W̄ ℓ nℓ ≲1 (37)
2. Assume that σ satisfies the growth and Lipschitz conditions (13) and (14)
and may be different in each layer. Assume the weights and perturbed
weights are bounded (36), (37). Then
ℓ−1 ℓ−1
1/2 −1/2 −1/2
X Y
f ℓ (x) − f¯ℓ (x) ≲ n0 W k − W̄ k nk W j , W̄ j
max nj .
k=0 j=0
j̸=k
3. Assume that σ has bounded derivative (15) and may be different in each
layer. Assume the weights are bounded (36). Then
" ℓ−1 #
1/2 −1/2
Y
ℓ ℓ k
f (x) − f (x̄) ≲ n0 W nk ∥x − x̄∥.
k=0
24
Proof. 1. For ℓ = 0, we have
1/2 −1/2
f 1 (x) = W 0 V x ≤ n0 W 0 n0 ,
where in the last step we have used that V has orthonormal columns and
∥x∥ ≲ 1. For ℓ > 0, we have
(13)
−1/2 −1/2 −1/2
f ℓ+1 = W ℓ nℓ σ fℓ ≤ W ℓ nℓ σ fℓ W ℓ nℓ fℓ
≲
induction
ℓ−1 ℓ
−1/2 1/2 −1/2 1/2 −1/2
Y Y
≲ W ℓ nℓ n0 W k nk = n0 W k nk ,
k=0 k=0
where in the first step we have used the definition of f ℓ+1 , in the third the
growth condition and in the fourth the induction hypothesis.
2. For ℓ = 0 we have
1/2 −1/2
f 1 − f¯1 = [W 0 − W̄ 0 ]V x = n0 W 0 − W̄ 0 n0 ,
where in the last step we have used that V has orthonormal columns and
∥x∥ ≲ 1. For ℓ > 0, we have
−1/2 −1/2
f ℓ+1 − f¯ℓ+1 = W ℓ nℓ σ f ℓ − W̄ ℓ nℓ σ f¯ℓ
−1/2
≤ W ℓ − W̄ ℓ nℓ σ fℓ
−1/2
+ W̄ ℓ nℓ σ f ℓ − σ f¯ℓ
=: I + II
For the first term, the growth condition (13) implies σ f ℓ ≲ f ℓ and
thus the first part of the Lemma yields
ℓ−1
−1/2 1/2 −1/2
Y
I ≲ W ℓ − W̄ ℓ nℓ n0 W k nk .
k=0
For the second term, we have by Lipschitz continuity (14) and induction
−1/2 −1/2
II = W̄ ℓ nℓ σ f ℓ − σ f¯ℓ ≲ W̄ ℓ nℓ f ℓ − f¯ℓ
ℓ−1 ℓ
1/2 −1/2 −1/2
X Y
W k − W̄ k nk Wj , Wj
≲ n0 max nj .
k=0 j=0
j̸=k
By I and II we obtain
ℓ ℓ
1/2 −1/2 −1/2
X Y
ℓ+1
− f¯ℓ+1 ≲ n0 k k
Wj , Wj
f W − W̄ nk max nj ,
k=0 j=0
j̸=k
25
3. Follows from the mean value theorem because by Lemma 5.3 below the
first derivatives are uniformly bounded.
Lemma 5.3. Assume that σ has bounded derivative (15) and may be different
in each layer. Assume the weights are bounded (36). Then
ℓ−1
1/2 −1/2
Y
Df ℓ (x) ≲ n0 W k nk .
k=0
where in the last step we have used that V has orthonormal columns and ∥Dx∥ =
∥I∥ = 1. For ℓ > 0, we have
−1/2
Df ℓ+1 = W ℓ nℓ Dσ f ℓ
−1/2 −1/2
= W ℓ nℓ Dσ f ℓ ≤ W ℓ nℓ σ̇ f ℓ ⊙ Df ℓ
(15) induction
ℓ−1
−1/2 −1/2 1/2 −1/2
Y
≲ W ℓ nℓ Df ℓ ≲ W ℓ nℓ n0 W k nk
k=0
ℓ
1/2 −1/2
Y
= n0 W k nk ,
k=0
where in the first step we have used the definition of f ℓ+1 , in the fourth the
boundedness of σ̇ and in the fifth the induction hypothesis.
Remark 5.4. An argument analogous to Lemma 5.3 does not show that the
derivative is Lipschitz or similarly second derivatives ∂xi ∂xj f ℓ are bounded.
Indeed, the argument uses that
where we bound the first factor by the upper bound of σ̇ and the second by
induction. However, higher derivatives produce products
1/2
With bounded weights (36) the hidden layers are of size ∂xi f ℓ ≲ n0 but a
naive estimate of their product by Cauchy Schwarz and embedding ∂xi f ℓ ⊙ ∂xj f ℓ ≤
∥∂xi f ℓ ∥ℓ4 ∥∂xi f ℓ ∥ℓ4 ≤ ∥∂xi f ℓ ∥∥∂xi f ℓ ∥ ≲ n0 is much larger.
26
Given the difficulties in the last remark, we can still show that f ℓ is Hölder
continuous with respect to the weights in a Hölder norm with respect to x.
Lemma 5.5. Assume that σ satisfies the growth and Lipschitz conditions (13),
(14) and may be different in each layer. Assume the weights, perturbed weights
and domain are bounded (36), (37), (38). Then for 0 < α < 1
1/2
σ fℓ
C 0;α
≲ n0 .
1/2
σ f¯ℓ
C 0;α
≲ n0 .
" ℓ−1 #1−α
1/2 −1/2
X
σ f ℓ − σ f¯ℓ k k
C 0;α
≲ n0 W − W̄ nk .
k=0
Proof. By the growth condition (13) and the Lipschitz continuity (14) of the
activation function, we have
σ f ℓ C0 ≲ f ℓ C0 , σ f ℓ C 0;1 ≲ f ℓ C 0;1 .
where in the last step we have used the bounds form Lemma 5.2 together with
−1/2 −1/2
W ℓ nℓ ≲ 1 and W̄ ℓ nℓ ≲ 1 from Assumptions (36), (37). Likewise,
by the interpolation inequality in Lemma 6.3 we have
1−α α
σ f ℓ − σ f¯ℓ ≲ σ f ℓ − σ f¯ℓ σ f ℓ − σ f¯ℓ C 0;1
C 0;α C0
n
1−α α α
o
≲ σ f ℓ − σ f¯ℓ σ f ℓ C 0;1 σ f¯ℓ
C0
max C 0;1
.
1−α
n α α
o
≲ f ℓ − f¯ℓ C 0 max f ℓ C 0;1 f¯ℓ C 0;1 .
" ℓ−1 #1−α
1/2 −1/2
X
k k
≲ n0 W − W̄ nk ,
k=0
where in the third step we have used that σ is Lipschitz and in the last step
−1/2
the bounds from Lemma 5.2 together with the bounds W ℓ nℓ ≲ 1 and
ℓ −1/2
W̄ nℓ ≲ 1 from Assumptions (36), (37).
Lemma 5.6. Assume that σ satisfies the growth and Lipschitz conditions (13),
(14) and may be different in each layer. Assume the weights, perturbed weights
27
and domain are bounded (36), (37), (38). Then for 0 < α, β < 1
n0
Σ̂ℓ ≲ ,
C 0;α,β nℓ
¯ n0
Σ̂ℓ ≲ ,
C 0;α,β nℓ
" ℓ−1 #1−α
¯ n0 X −1/2
Σ̂ℓ − Σ̂ℓ ≲ k k
W − W̄ nk .
C 0;α,α nℓ
k=0
¯ 1 T T
Σ̂ℓ − Σ̂ℓ = σ fℓ σ f˜ℓ − σ f¯ℓ σ f˜¯ℓ
C 0;α,α nℓ C 0;α,α
1 T T h ℓ i
σ f ℓ − σ f¯ℓ σ f˜ℓ − σ f¯ℓ σ f˜ − σ f˜¯ℓ
=
nℓ C 0;α,α
1 T
T
h i
˜ℓ − σ f˜¯ℓ
σ f ℓ − σ f¯ℓ σ f˜ℓ ¯ℓ
≤ + σ f σ f
nℓ C 0;α,α C 0;α,α
2 T
σ f ℓ − σ f¯ℓ σ f˜ℓ
= ,
nℓ 0;α,α C
where in the last step we have used symmetry in x and y. Thus, by the product
identity Item 3 in Lemma 6.3, we obtain
¯ 2
Σ̂ℓ − Σ̂ℓ σ f ℓ − σ f¯ℓ C 0;α σ f˜ℓ
≤
C 0;α,α nℓ C 0;α
" ℓ−1 #1−α
n0 X k k −1/2
≲ W − W̄ nk ,
nℓ
k=0
Lemma 5.7 (Lemma 4.3 restated form overview). Assume that σ and σ̇ satisfy
the growth and Lipschitz conditions (13), (14) and may be different in each
28
layer. Assume the weights, perturbed weights and domain are bounded (23) and
nL ∼ nL−1 ∼ · · · ∼ n0 . Then for 0 < α < 1
Γ̂ ≲1
C 0;α,α
¯
Γ̂ ≲1
C 0;α,α
"L−1 #1−α
¯ n0 X k k −1/2
Γ̂ − Γ̂ ≲ W − W̄ nk .
C 0;α,α nL
k=0
Thus, since Hölder spaces are closed under products, Lemma 6.3 Item 4, it
follows that
¯ ¯ˆ L
ˆ L (x, y)Σ̂L−1 (x, y) − Σ̇ ¯
Γ̂ − Γ̂ = Σ̇ (x, y)Σ̂L−1 (x, y)
C 0;α,α C 0;α,α
ˆ ¯ˆ L
≤ Σ̇ (x, y) − Σ̇ (x, y) Σ̂L−1 (x, y)
L
C 0;α,α
¯ ¯ L−1
ˆ L (x, y) Σ̂L−1 (x, y) − Σ̂
h i
+ Σ̇ (x, y)
C 0;α,α
¯ˆ L
ˆ L (x, y) − Σ̇
≤ Σ̇ (x, y) Σ̂L−1 (x, y)
C 0;α,α C 0;α,α
¯
ˆ L (x, y) ¯
+ Σ̇ Σ̂L−1 (x, y) − Σ̂L−1 (x, y)
C 0;α,α C 0;α,α
" ℓ−1 #1−α
n0 X k k −1/2
≲ W − W̄ nk ,
nℓ
k=0
where in the last step we have used Lemma 5.6 and nL ∼ nL−1 .
29
5.3 Proof of Lemma 4.4: Concentration
Concentration for the NTK
is derived from concentration for the forward kernels Σ̇L and ΣL−1 . They are
shown inductively by splitting off the expectation Eℓ [·] with respect to the last
layer W ℓ in
h i h i
Σ̂ℓ+1 − Σℓ+1 ≤ Σ̂ℓ+1 − Eℓ Σ̂ℓ+1 + Eℓ Σ̂ℓ+1 − Σℓ+1 .
C 0;α,β C 0;α,β C 0;α,β
Concentration for the first term is shown in Section 5.3.1 by a chaining argument
and bounds for the second term in Section 5.3.2 with an argument similar to
[18]. The results are combined into concentration for the NTK in Section 5.3.3.
For fixed weights W 0 , . . . , W ℓ−2 and random W ℓ−1 , all Λ̂ℓr , r ∈ [nℓ ] are random
variables dependent only on the random vector Wr·ℓ−1 and thus independent.
Hence, we can show concentration uniform in x and y by chaining. For Dudley’s
inequality, one would bound the increments
where the right hand side is a metric for α ≤ 1. However, this is not sufficient
in our case. First, due to the product in the definition of Λ̂ℓr , we can only
bound the ψ1 norm and second this leads to a concentration of the supremum
norm ∥Λ̂ℓr ∥C 0 , whereas we need a Hölder norm. Therefore, we bound the finite
difference operators
β β
∆α ℓ α ℓ
x,hx ∆y,hy Λ̂r (x, y) − ∆x,h̄x ∆y,h̄ Λ̂r (x̄, ȳ)
y ψ1
which can be conveniently expressed by the Orlicz space valued Hölder norm
∆α β ℓ
x ∆y Λ̂r ≲ 1,
C 0;α,β (∆D×∆D;ψ1 )
30
1. Finite difference operators ∆α : (x, h) → h−α [f (x + h) − f (x)], depending
both on x and h, with partial application two variables x and y denoted
by ∆α α
x and ∆y , respectively. See Section 6.1.
2. Assume that σ has bounded derivative (15) and may be different in each
layer. Then
1/2
n0
frℓ (x) frℓ (x̄)
σ −σ ψ2
≲ ∥x − x̄∥.
nℓ−1
ℓ−1 −1/2
is a sum of independent random variables Wrs nℓ−1 σ fsℓ−1 , s ∈ [nℓ−1 ],
by Hoeffding’s inequality (general version for sub-gaussian norms, see e.g.
[67, Proposition 2.6.1]) we have
−1/2 −1/2
Wr·ℓ−1 nℓ−1 σ f ℓ−1 σ f ℓ−1
≲ nℓ−1 .
ψ2
Thus
−1/2
σ frℓ ≲ frℓ = Wr·ℓ−1 nℓ−1 σ f ℓ−1
ψ2 ψ2 ψ2
1/2
−1/2 ℓ−1 −1/2 n0
f ℓ−1 ≲
≤ nℓ−1 σ f ≤ nℓ−1 ,
nℓ−1
31
where in the first step we have used the growth condition and Lemma 6.7,
in the fourth step the growth condition and in the last step the upper
bounds from Lemma 5.2.
2. Using Hoeffding’s inequality analogous to the previous item, we have
−1/2
Wr·ℓ−1 nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)
ψ2
−1/2
σ f ℓ−1 (x) − σ f ℓ−1 (x̄)
≲ nℓ−1
and
ℓ−1 −1/2
= Wr· nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)
ψ2
−1/2
≲ nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)
−1/2
≲ nℓ−1 f ℓ−1 (x) − f ℓ−1 (x̄)
1/2
n0
≲ ∥x − x̄∥,
nℓ−1
where in the first step we have used the Lipschitz condition and Lemma
6.7, in the fourth step the Lipschitz condition and in the last step the
Lipschitz bounds from Lemma 5.2.
Lemma 5.9. Let U and V be two normed spaces and D ⊂ U . For all 0 ≤ α ≤ 21 ,
we have
∥∆α f ∥C 0;α (∆D;V ) ≤ 4 ∥f ∥C 0;2α (D;V ) ,
with ∆D defined in (48).
Proof. Throughout the proof, let C 0;2α = C 0;2α (D; V ) and | · | = ∥ · ∥U or | · | =
∥ · ∥V depending on context. Unraveling the definitions, for every (x, h), (x̄, h̄) ∈
∆D, we have to show
∆α α α
h f (x) − ∆h̄ f (x̄) ≤ 4∥f ∥C 0;2α max{|x − x̄|, |h − h̄|} .
We consider two cases. First, assume that |h| ≤ max{|x − x̄|, |h − h̄|} and h̄ is
arbitrary. Then |h̄| ≤ |h̄ − h| + |h| ≤ 2 max{|x − x̄|, |h − h̄|} and thus
∆α α α α
h f (x) − ∆h̄ f (x̄) ≤ |∆h f (x)| + ∆h̄ f (x̄)
≤ ∥f ∥C 0;2α |h|α + ∥f ∥C 0;2α |h̄|α ≤ 3∥f ∥C 0;2α max{|x − x̄|, |h − h̄|}α .
32
In the second case, assume that max{|x − x̄|, |h − h̄|} ≤ |h| and without loss of
generality that |h| ≤ |h̄|. Then
−α
∆α α
h f (x) − ∆h̄ f (x̄) ≤ [f (x + h) − f (x)]|h| − [f (x̄ + h̄) − f (x̄)]|h̄|−α
≤ f (x + h) − f (x) − f (x̄ + h̄) + f (x̄) |h|−α
+ |f (x̄ + h̄) − f (x̄)| |h|−α − |h̄|−α
=: I + II.
Lemma 5.10. Assume for k = 0, . . . , ℓ−2 the weights Wk are fixed and bounded
−1/2
∥W k ∥nk ≲ 1. Assume that W ℓ−1 is i.i.d. sub-gaussian with ∥Wijℓ−1 ∥ψ2 ≲ 1.
Assume that σ satisfies the growth condition (13), has bounded derivative (15)
and may be different in each layer. Let r ∈ [nℓ ]. Then for α, β ≤ 1/2
n0
∆α β ℓ
x ∆y Λ̂r ≲ ,
C 0;α,β (∆D×∆D;ψ1 ) nℓ−1
with ∆D defined in (48).
Proof. Throughout the proof, we abbreviate
Since by Lemma 6.8 we have ∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 by the product inequality
Lemma 6.3 Item 3 for Hölder norms we obtain
∆α ∆β Λ̂ℓ = ∆α σ f ℓ ∆β σ f˜ℓ
x y r x r y r
C 0;α,β (ψ1 ) C 0;α,β (ψ1 )
∆α ℓ
∆βy σ f˜rℓ
≲ xσ fr C 0;α (ψ2 )
.
C 0;β (ψ2 )
33
Next, we use Lemma 5.9 to eliminate the finite difference in favour of a higher
Hölder norm
∆α β ℓ ℓ
f˜rℓ
∆ Λ̂
x y r ≲ σ f r C 0;2α (ψ )
σ .
C 0;α,β (ψ1 ) 2 C 0;2β (ψ2 )
1/2 −1/2
Finally, Lemma 5.8 implies that σ frℓ C 0;2α (D;ψ2 )
≤ n0 nℓ−1 and likewise
for f˜ℓ and thus
r
n0
∆α β ℓ
x ∆y Λ̂r ≲ .
C 0;α,β (ψ1 ) nℓ−1
Lemma 5.11. Assume for k = 0, . . . , ℓ−2 the weights Wk are fixed and bounded
−1/2
∥W k ∥nk ≲ 1. Assume that W ℓ−1 is i.i.d. sub-gaussian with ∥Wijℓ−1 ∥ψ2 ≲ 1.
Assume that the domain D is bounded, that σ satisfies the growth condition
(13), has bounded derivative (15) and may be different in each layer. Then for
α = β = 1/2
" "√ √ ##
h i n0 d+ u d+u
Pr ℓ
Σ̂ − E Σ̂ ℓ
≥C √ + ≤ e−u .
C 0;α,β (D) nℓ−1 nℓ−1 nℓ−1
Proof. Since ∆α β ℓ
x ∆y Λ̂rfor r ∈ [nℓ ] only depends on the random vector Wr·ℓ−1 , all
β
stochastic processes ∆α ℓ
x,hx ∆y,hy Λ̂r (x, y) are independent
(x,hx ,y,hy )∈∆D×∆D
and satisfy
n0
∆α β ℓ
x ∆y Λ̂r ≲
nℓ−1
C 0;α,β (∆D×∆D;ψ1 )
by Lemma 5.10. Thus, we can estimate the processes’ supremum by the chaining
Corollary 6.12
nℓ−1
1 X h i
Pr sup ∆α β ℓ α β ℓ
x ∆y Λ̂r − E ∆x ∆y Λ̂r ≥ Cτ ≤ e−u ,
(x,hx )∈∆D nℓ−1 r=1
(y,hy )∈∆D
and
nℓ−1 nℓ−1
1 X 1 X
∆α β ℓ α β
x ∆y Λ̂r = ∆x ∆y Λ̂ℓr = ∆α β ℓ
x ∆y Σ̂
nℓ−1 r=1
nℓ−1 r=1
completes the proof.
34
5.3.2 Perturbation of Covariances
This section contains the tools to estimate
h i
Eℓ Σ̂ℓ+1 − Σℓ+1 ,
C 0;α,β
35
Thus, by Mehler’s theorem (Theorem 6.14 in the appendix) we conclude that
∞
ρk
ZZ X
E(u,v)∼N (0,A) [σ(u)σ(v)] = σ(au)σ(bv) Hk (u)Hk (v) dN (0, 1)(u) dN (0, 1)(v)
k!
k=0
∞
X ρk
= ⟨σa , Hk ⟩N ⟨σb , Hk ⟩N .
k!
k=0
a2 ρab
Lemma 5.13. Assume A = is positive semi-definite and all deriva-
ρab b2
(γb +γρ )
tives up to σ (γa +γρ ) and σb are continuous and have at most polynomial
growth for x → ±∞. Then
We first
estimate
the ρ derivative.
Since 0 ⪯ A and a, b > 0, we must have
1 ρ 1 ρ
0⪯ and thus det = 1 − ρ2 ≥ 0. It follows that |ρ| ≤ 1. Therefore
ρ 1 ρ 1
ρk 1 k! 1
∂ γρ = ρk−γρ ≤ . (41)
k! k! (k − γρ )! (k − γρ )!
36
Plugging the last equation and (41) into (40), we obtain
where in the second step we have used Cauchy-Schwarz and in the last that H̄k
are an orthonormal basis.
Lemma 5.14. Let f (a11 , a22 , a12 ) be implicitly defined by solving the identity
a11 a12 a ρab
=
a12 a22 ρab b
for a, b and ρ. Let Df be a domain with a11 , a22 ≥ c > 0 and |a12 | ≲ 1. Then
∥f ′′′ ∥C 1 (Df ) ≲ 1.
Since the denominator is bounded away from zero, all third partial derivatives
exist and are bounded.
with
∂ i (σa ) N
≲ 1, i = 1, . . . , 3,
37
with σa defined in (39). Then, for α, β ≤ 1 the functions
satisfy
Proof. Define
a ρab
F (a, b, ρ) = E(u,v)∼N (0,Ā) [σ(u)σ(v)] . Ā =
ρab b
and
38
5.3.3 Concentration of the NTK
We combine the results from the last two sections to show concentration in-
equalities, first for the forward kernels Σℓ and Σ̇ℓ and then for the NTK Γ.
Lemma 5.16. Let α = β = 1/2 and k = 0, . . . , ℓ.
1. Assume that all W k are are i.i.d. standard normal.
2. Assume that σ satisfies the growth condition (13), has uniformly bounded
derivative (15), derivatives σ (i) , i = 0, . . . , 3 are continuous and have at
most polynomial growth for x → ±∞ and the scaled activations satisfy
∂ i (σa ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,
we have
Σℓ C 0;α,β
≲1
Σ̂ℓ ≲1
C 0;α,β
ℓ−1
"√ √ #
ℓ ℓ
X n0 d + uk d + uk 1
Σ̂ − Σ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0
for all u1 , . . . , uℓ−1 ≥ 0 sufficiently small so that the last inequality holds.
Proof. We prove the statement by induction. Let us first consider ℓ ≥ 1. We
split off the expectation over the last layer
h i h i
Σ̂ℓ+1 − Σℓ+1 ≤ Σ̂ℓ+1 − Eℓ Σ̂ℓ+1 + Eℓ Σ̂ℓ+1 − Σℓ+1
C 0;α,β C 0;α,β C 0;α,β
= I + II,
39
which is true with probability at least 1 − 2e−nk , see e.g. [67, Theorem 4.4.5].
Then, by Lemma 5.11 for uℓ ≥ 0
" "√ √ ##
h i n0 d + uℓ d + uℓ
Pr Σ̂ ℓ+1
− E Σ̂ ℓ+1
≥C √ + ≤ e−uℓ . (43)
C 0;α,β (D) nℓ nℓ nℓ
Next we estimate II. To this end, recall that Σ̂ℓ+1 (x, y) is defined by
nℓ+1
ℓ+1 1 X
σ frℓ+1 (x) σ frℓ+1 (y) .
Σ̂ (x, y) =
nℓ r=1
It follows that
40
Next, we bound the off diagonal terms. Since the weights are bounded (42),
Lemma 5.6 implies
n0
Σ̂ℓ ≲ ≲ 1, Σℓ C 0;α,β
≲ 1,
C 0;α,β nl
where the last inequality follows from (45). In particular,
where the last line follows by induction. Together with (42), (43) and a union
bound, this shows the result for ℓ ≥ 1.
Finally, we consider the induction start for ℓ = 0. The proof is the same,
except that in (44) the covariance simplifies to
Hence,
h for ℓ i= 1 the two covariances A and  are identical and therefore
∥E0 Σ̂1 (x, y) − Σ1 ∥C 0;α,β = 0.
Lemma 5.17 (Lemma 4.4, restated from the overview). Let α = β = 1/2 and
k = 0, . . . , L − 1.
1. Assume that W L ∈ {−1, +1} with probability 1/2 each.
2. Assume that all W k are are i.i.d. standard normal.
3. Assume that σ and σ̇ satisfy the growth condition (13), have uniformly
bounded derivatives (15), derivatives σ (i) , i = 0, . . . , 3 are continuous and
have at most polynomial growth for x → ±∞ and the scaled activations
satisfy
∂ i (σa ) N
≲ 1, ∂ i (σ̇a ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,
41
Then, with probability at least
L−1
X
1−c e−nk + e−uk (46)
k=1
we have
L−1
"√ √ #
X n0 d + uk d + uk 1
Γ̂ − Γ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0
for all u1 , . . . , uL−1 ≥ 0 sufficiently small so that the rightmost inequality holds.
Proof. By definition (11) of Γ and Lemma 4.1 for Γ̂, we have
and therefore
where in the last step we have used Lemma 6.3 Item 4. Thus, the result follows
from
and
ˆL
n o
max ΣL−1 − Σ̂L−1 , Σ̇L − Σ̇
C 0;α,β C 0;α,β
L−1
"√ √ #
X n0 d + uk d + uk 1
≲ √ + ≤ cΣ ,
nk nk nk 2
k=0
with probability (46) by Lemma 5.16. For Σ̇L , we do not require the lower
bound Σ̇k (x, x) ≥ cΣ > 0 because in the recursive definition σ̇ is only used in
the last layer and therefore not necessary in the induction step in the proof of
Lemma 5.16.
42
and the corresponding maximum norm ∥ · ∥C 0 (D;∗) for functions mapping x to a
tensor measured in the ∥ · ∥∗ norm. We use this norm for an inductive argument
in a proof, but later only apply it for the last layer ℓ = L + 1. In this case
nL+1 = 1 and the norm reduces to a regular matrix norm.
Lemma 5.18. Assume that σ satisfies the growth and derivative bounds (13),
(15) and may be different in each layer. Assume the weights are bounded
−1/2
∥W k ∥nk ≲ 1, k = 1, . . . , ℓ − 1. Then for 0 ≤ α ≤ 1
1/2
ℓ n0
∂W k f C 0 (D;∗) ≲ .
nk
Proof. First note that for any tensor T
X
ur vi wj Trij ≤ C∥u∥∥v∥∥w∥
r,i,j
C0
−1/2
X
ur vi wj ∂Wijk frk (x) (uT v) wT σ f k
= nk
C0
r,i,j
C0
1/2
−1/2 k
n0
≤ nk ∥u∥∥v∥∥w∥ σ f C0
≲ ∥u∥∥v∥∥w∥ ,
nk
where in the last step we have used Lemma 5.5. Thus, we conclude that
1/2
n0
∂W k f k+1 (x) C 0 (D;∗) ≲ .
nk
For k < ℓ − 1, we have
h i
−1/2 −1/2
∂Wijk f ℓ (x) = ∂Wijk W ℓ−1 nℓ−1 σ f ℓ−1 = W ℓ−1 nℓ−1 σ̇ f ℓ−1 ⊙ ∂Wijk f ℓ−1
and therefore
−1/2
X
ur vi wj ∂Wijℓ frk ≤ ∥uT W ℓ−1 nℓ−1 ∥∥v∥∥w∥ σ̇ f ℓ−1 ⊙ ∂Wijk f ℓ−1
C 0 (D;∗)
r,i,j
C0
43
−1/2
where in the second step we have used that ∥W ℓ−1 ∥nℓ−1 ≲ 1 and in the last
step we have used that σ̇ f ℓ−1 ℓ ≲ 1 because |σ̇(·)| ≲ 1 and the induction
∞
hypothesis. It follows that
1/2
n0
∂W k f ℓ (x) C 0 (D;∗)
≲ .
nk
Lemma 5.19 (Lemma 4.5, restated from the overview). Assume that σ satisfies
the growth and derivative bounds (13), (15) and may be different in each layer.
Assume the weights are defined by the gradient flow (6) and satisfy
−1/2
∥W ℓ (0)∥nℓ ≲ 1, ℓ = 1, . . . , L,
ℓ ℓ −1/2
∥W (0) − W (τ )∥nℓ ≲ 1, 0 ≤ τ < t.
Then
1/2 Z t
−1/2 n0
W ℓ (t) − W ℓ (0) nℓ ≲ ∥κ∥C 0 (D)′ dx dτ,
nℓ 0
′
where C 0 (D) is the dual space of C 0 (D).
we have
Z t
d ℓ
W ℓ (t) − W ℓ (0) = W (τ ) dτ
0 dτ
Z tZ
= κ(x)DWℓ f L+1 (x) dx dτ
0 D
Z tZ
≤ |κ(x)| DWℓ f L+1 (x) dx dτ
0 D
1/2 Z t
n0
≲ ∥κ∥C 0 (D)′ dx dτ,
nℓ 0
−1/2
where in the last step we have used Lemma 5.18. Multiplying with nℓ shows
the result.
44
5.5 Proof of Theorem 2.1: Main Result
Proof of Theorem 2.1. The result follows directly from Lemma 4.2 with the
smoothness spaces Hα = H α (Sd−1 ). While the lemma bounds the residual
κ in the H−α and Hα norms, we aim for an H0 = L2 (Sd−1 ) bound. This follows
directly from the interpolation inequality
1/2 1/2
∥ · ∥L2 (Sd−1 ) = ∥ · ∥H 0 (Sd−1 ) ≤ ∥ · ∥H −α (Sd−1 ) ∥ · ∥H α (Sd−1 ) .
It remains to verify all assumptions. To this end, first note that the initial
weights satisfy
−1/2
∥W (0)ℓ ∥nℓ ≲ 1, ℓ = 0, . . . , L, (47)
−1/2
∥θ(t) − θ(0)∥∗ = max W ℓ (t) − W ℓ (0) nℓ
ℓ∈[L]
1/2 Z t Z t
n0
≲ ∥κ∥C 0 (Sd−1 )′ dx dτ, ≲ m−1/2 ∥κ∥H 0 (Sd−1 ) dx dτ,
nℓ 0 0
for all a ∈ {Σk (x, x) : x ∈ D} contained in the set {cΣ , CΣ } for some
CΣ ≥ 0, by assumption. Together with α + ϵ < 1/2 for sufficiently small ϵ,
hidden dimensions d ≲ n0 ∼ . . . , ∼ nL =: m and the concentration result
Lemma 4.4 we obtain, with probability at least
45
the bound "r r #
d τ τ
Γ̂ − Γ ≲L + +
C 0;α+ϵ,α+ϵ m m m
for the neural tangent kernel for all 0 ≤ τ = u0 = · · · = uL−1 ≲ 1. By
Lemma 6.16, the kernel bound directly implies the operator norm bound
"r r #
d τ τ
H − Hθ(0) −α,α ≲ L + +
m m m
for the corresponding integral operators H and Hθ(0) , with kernels Γ and
Γ̂, respectively. If τ /m ≲ 1, we can drop the last term and thus satisfy
assumption (19).
4. Hölder continuity of the NTK (20): By (47) with probability at least
1 − pL (m) := 1 − Le−m
we have ∥θ(0)∥∗ ≲ 1 and thus for all perturbations θ̄ with θ̄ − θ(0) ∗
≤
h ≤ 1 by Lemma 4.3 that
¯
Γ̂ − Γ̂ ≲ Lh1−α−ϵ
C 0;α+ϵ,α+ϵ
for any sufficiently small ϵ > 0. By Lemma 6.16, the kernel bound implies
the operator norm bound
Hθ(0) − Hθ̄ α←−α
≲ Lhγ
for any γ < 1 − α and integral operators Hθ(0) and Hθ̄ corresponding to
kernels Γθ (0) and Γ̂θ̄ , respectively.
5. Coercivity (5): Is given by assumption.
Thus, all assumptions of Lemma 4.2 are satisfied, which directly implies the
theorem as argued above.
6 Technical Supplements
6.1 Hölder Spaces
Definition 6.1. Let U and V be two normed spaces.
1. For 0 < α ≤ 1, we define the Hölder spaces on the domain D ⊂ U as all
functions f : D → V for which the norm
∥f ∥C 0;α (D;V ) := max{∥f ∥C 0 (D;V ) , |f |C 0;α (D;V ) } < ∞
is finite, with
∥f (x) − f (x̄)∥V
|f |C 0 (D;V ) := sup ∥f (x)∥V , |f |C 0;α (D;V ) := sup .
x∈D x̸=x̄∈D ∥x − x̄∥α
U
46
2. For 0 < α, β ≤ 1, we define the mixed Hölder spaces on the domain
D × D ⊂ U × U as all functions g : D × D → V for which the norm
with
which satisfy product and chain rules similar to derivatives. We may also con-
sider these as functions in both x and h
∆α f : (x, h) ∈ ∆D → V, ∆α f (x, h) = ∆α
h f (x)
on the domain
∆D := {(x, h) : x ∈ D, x + h ∈ D} ⊂ U × U. (48)
47
If f = f (x, y) depends on multiple variables, we denote the partial finite differ-
ence operators by ∆α α
x,hx and ∆y,hy defined by
−α
∆0x,hx f (x, y) := f (x, y), ∆α
x,hx f (x, y) := ∥hx ∥U [f (x + hx , y) − f (x, y)],
∆α α
x f (x, y, hx ) = ∆x,hx f (x, y), ∆α α
y f (x, y, hy ) = ∆y,hy f (x, y).
∆α α α
h [f g](x) = [∆h f (x)] g(x) + f (x + h) [∆h g(x)] .
Then
∆α ¯ α
h (f ◦ g)(x) = ∆h (f, g)(x)∆h g(x).
48
In the following lemma, we summarize several useful properties of Hölder
spaces.
Lemma 6.3. Let U and V be two normed spaces, D ⊂ U and 0 < α, β ≤ 1.
1. Interpolation Inequality: For any f ∈ C 1 (D; V ), we have
∥f ∥C 0;α (D;V ) ≤ 2∥f ∥1−α α
C 0 (D;V ) ∥f ∥C 0;1 (D;V ) .
2. Follows from
∥σ(f (x)) − σ(f (x̄))∥V
|σ ◦ f |C 0;α (D;V ) = sup
x,x̄∈D ∥x − x̄∥α
U
∥f (x) − f (x̄)∥α
V
≲ sup = ∥f ∥C 0;α (D;V ) .
x,x̄∈D ∥x − x̄∥α
U
49
and analogous identities for the remaining semi norms |f g|C 0;0,0 (D;V2 ) ,
|f g|C 0;α,0 (D;V2 ) , |f g|C 0;0,β (D;V2 ) .
4. We only show the bound for | · |C 0;α,β (D) . The other semi-norms follow
analogously. Applying the product rule (Lemma 6.2)
∆α
α α
x,hx [f (x, y)g(x, y)] = ∆x,hx f (x, y) g(x, y)+ f (x+hx , y) ∆x,hx f (x, y)
= ∆βy,hy ∆α
α
x,hx f (x, y) g(x, y) + f (x + hx , y) ∆x,hx f (x, y)
h i h β i
= ∆βy,hy ∆α
α
x,hx f (x, y) g(x, y) + ∆ x,hx f (x, y + hy ) ∆ y,hy g(x, y)
h i h i
+ ∆y,hy f (x + hx , y) ∆x,hx f (x, y) + f (x + hx , y + hy ) ∆βy,hy ∆α
β α
x,hx f (x, y) .
The following two lemmas contain chain rules for Hölder and mixed Hölder
spaces.
Lemma 6.4. Let D ⊂ U and Df ⊂ V be domains in normed spaces U , V and
W . Let g : D → Df and f : Df → W . Let 0 < α, β ≤ 1. Then
∥∆α (f ◦ g)∥C 0 (∆D;W ) ≤ ∥f ′ ∥C 0;0 (Df ;L(V,W )) ∥g∥C 0;α (D;V )
and
where L(V, W ) is the space of all linear maps V → W with induced operator
norm.
Proof. Note that
Z 1
¯ h (f, g)(x) :=
∆ f ′ (tg(x + h) + (1 − t)g(x)) dt
0
¯ h (f, g)(x)v∥W ≤ ∥∆
takes values in the linear maps L(V, W ) and thus ∥∆ ¯ h (f, g)(x)∥L(V,W ) ∥v∥V ,
for all v ∈ V . Using the chain rule Lemma 6.2, it follows that
∥∆α ¯ α
h (f ◦ g)(x)∥W = ∆h (f, g)(x)∆h g(x) W
¯ h (f, g)(x)
≤ ∆ ∥∆α g(x)∥
L(V,W ) h V
50
and
∥∆α α ¯ α ¯ α
h (f ◦ g)(x) − ∆h (f ◦ ḡ)(x)∥W = ∆h (f, g)(x)∆h g(x) − ∆h (f, ḡ)(x)∆h ḡ(x) W
¯ h (f, g)(x) − ∆
≤ ∆ ¯ h (f, ḡ)(x) ∥∆α h g(x)∥
L(V,W ) V
¯ h (f, ḡ)(x)
+ ∆ α α
∥∆h g(x) − ∆h ḡ(x)∥V .
L(V,W )
and
¯ h (f, g)(x) − ∆
∆ ¯ h (f, ḡ)(x)
L(V,W )
Z 1
≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥t(g − ḡ)(x + h) + (1 − t)(g − ḡ)(x)∥ dt
0
≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥g − ḡ∥C 0 (D;V ) , (50)
∆α ∆β [f ◦ g − f ◦ ḡ] C 0 (∆D×∆D;W )
∆βy,hy [f ◦ g − f ◦ ḡ] = ∆
¯ y,h (f, g)∆β g − ∆
y y,hy
¯ y,h (f, ḡ)∆β ḡ
y y,hy
β
¯ ¯
= ∆y,hy (f, g) − ∆y,hy (f, ḡ) ∆y,hy g
h i
+∆ ¯ y,h (f, ḡ) ∆β g − ∆β ḡ
y y,hy y,hy
=: I + II.
Applying the product rule Lemma 6.2 to the first term yields
51
Likewise, applying the product Lemma rule 6.2 to the second term yields
h i
¯ β β
∆α α
x,hx II W = ∆x,hx ∆y,hy (f, ḡ) ∆y,hy g − ∆y,hy ḡ
W
h i
¯ β β
+ ∆y,hy (f, ḡ)(x + hx , y) ∆x,hx ∆y,hy g − ∆α
α
x,hx ∆y,hy ḡ
W
≤ ∆α ¯
x,hx ∆y,hy (f, ḡ) L(V,W )
∆βy,hy g − ∆βy,hy ḡ
W
¯ y,h (f, ḡ)(x + hx , y) β β
+ ∆ y L(V,W )
∆α
x,hx ∆y,hy g − ∆α
x,hx ∆y,hy ḡ .
W
All terms involving only g and ḡ can easily be upper bounded by ∥g∥C 0;α,β (D;V ) ,
∥ḡ∥C 0;α,β (D;V ) or ∥g − ḡ∥C 0;α,β (D;V ) . The terms
are bounded by (49) and (50) in the proof of Lemma 6.4. For the remaining
terms, define
G(x) := tg(x, y + hy ) + (1 − t)g(x, y)
and likewise Ḡ. Then
∥G∥C 0;α (D,V ) ≲ ∥g∥C 0;α,β (D,V ) , ∥G − Ḡ∥C 0;α (D,V ) ≲ ∥g − ḡ∥C 0;α,β (D,V ) .
and
∆α ¯ ¯
x,hx ∆y,hy (f, g) − ∆y,hy (f, ḡ) L(V,W )
Z 1
′ ′
∆α
= x,hx f ◦ G − f ◦ Ḡ dt
0 L(V,W )
′′
≤ 2∥f ∥ C 0;1 (Df ;L(V,L(V,W ))) ∥g − ḡ∥C 0;α,β (D;V ) max{1, ∥ḡ∥C 0;α,β (D;V ) }.
6.2 Concentration
In this section, we recall the definition of Orlicz norms, some basic properties
and the chaining concentration inequalities we use to show that the empirical
NTK is close to the NTK.
52
Definition 6.6. For random variable X, we define the sub-gaussian and sub-
exponential norms by
∥X∥ψ2 = inf t > 0 : E exp(X 2 /t2 ) ≤ 2 ,
1 ∥Y ∥ψ2 X 2 1 ∥X∥ψ2 Y 2
|XY |
exp ≤ exp +
t 2 ∥X∥ψ2 t2 2 ∥Y ∥ψ2 t2
1/2
∥Y ∥ψ2 X 2 ∥X∥ψ2 Y 2 √ √
≤ exp 2
+ 2
≤ 2 2 ≤ 2.
∥X∥ψ2 t ∥Y ∥ψ2 t
Hence
∥XY ∥ψ1 ≤ t ≤ ∥X∥ψ2 ∥Y ∥ψ2 .
53
Theorem 6.9 ([16, Theorem 3.5]). Let X be a normed linear space. Assume
the X valued separable random process (Xt )t∈T , has a mixed tail, with respect
to some semi-metrics d1 and d2 on T , i.e.
√
Pr ∥Xt − Xs ∥ ≥ ud2 (t, s) + ud1 (t, s) ≤ 2e−u
where the infimum is taken over all admissible sequences Tn ⊂ T with |T0 | = 1
n
and |Tn | ≤ 22 . Then for any t0 ∈ T
√
Pr sup ∥Xt − Xt0 ∥ ≥ C γ2 (T, d2 ) + γ1 (T, d1 ) + u∆d2 (T ) + u∆d1 (T ) ≤ e−u .
t∈T
Remark 6.10. [16, Theorem 3.5] assumes that T is finite. Using separability
and monotone convergence, this can be extended to infinite T by standard
arguments.
Lemma 6.11. Let 0 ≤ α ≤ 1 and D ⊂ Rd be as set of Euclidean norm | · |-
diameter smaller than R ≥ 1. Then
α 1/2
α 3α + 1 1+α α 3
γ1 (D, | · | ) ≲ R d, γ2 (D, | · | ) ≲ Rα/2 d1/2 .
α 4α
Proof. Let N (D, | · |α , u) be the covering number of D, i.e. the smallest number
of u-balls in the metric | · |α necessary to cover D. It is well known (e.g. [16,
(2.3)]) that
Z ∞ Z Rα
1/i 1/i
γi (D, | · |α ) ≲ [log N (D, | · |α , u)] du ≲ [log N (D, | · |α , u)] du,
0 0
where in the last step we have used that N (D, | · |α , u) = 1 for u ≥ Rα and thus
its logarithm is zero. Since every u-cover in the | · | norm is a uα cover in the
| · |α metric, the covering numbers can be estimated by
d d/α
(3R)α
α 1/α 3R
N (D, | · | , u) = N (D, | · |, u ) ≤ = ,
u1/α u
see e.g. [67]. Hence
Rα d/α Z α
(3R)α d R
Z
γ1 (D, | · |α ) ≲ log du = α log(3R) − log u du
0 u α 0
d d
3αR1+α − Rα log Rα + Rα ≤ (3α + 1)R1+α
≤
α α
54
and using log x ≤ x − 1 ≤ x
1/2 Z Rα 1/2 1/2 Z Rα 1/2
d (3R)α d (3R)α
γ2 (D, |·|α ) ≲ log du ≲ du
α 0 u α 0 u
α α 1/2 Z Rα 1/2 α 1/2
3 dR 1 3 d
≲ du ≲ Rα/2 .
α 0 u 4α
Then
" #
N 1/2 u 1/2
1 X d d u ≤ e−u .
Pr sup Xj,t − E [Xj,t ] ≥ CL + + +
t∈T N j=1 N N N N
Proof. We show the result with Theorem 6.9 for the process
N
1 X
Yt := Xj,t − E [Xj,t ] .
N j=1
τ2
τ
≤ 2 exp −cN min , .
L2 |t − s|2α L|t − s|α
τ2
r
τ u u
u := cN min , ⇒ τ = L|t − s|α max ,
L2 |t − s|2 L|t − s| cN cN
55
and thus
r
α u u
Pr |Yt − Ys | ≥ L|t − s| max , ≤ 2 exp(−u). (51)
cN cN
which directly yields the corollary with supt∈D ∥Yt ∥ ≤ supt∈D ∥Yt − Yt0 ∥ + ∥Yt0 ∥
and (51).
⟨Hn , Hm ⟩N = n! δnm .
56
Proof. The normalization is well known, we only show the formula for the
dn−k−1 −x2 /2
derivative. By the growth condition, we have f (k) (x) dx n−k−1 e → 0 for
x → ±∞. Thus, in the integration by parts formula below all boundary terms
vanish and we have
Z
1 2
⟨f, Hn ⟩ = √ f (u)Hn (u)e−x /2 du.
2π R
dn
Z
1 2 2 2
=√ f (u) (−1)n ex /2 n e−x /2 e−x /2 du.
2π R dx
Z n
1 d 2
= √ (−1)n f (u) n e−x /2 du.
2π R dx
dn−k
Z
1 2
= √ (−1)n−k f (k) (u) n−k e−x /2 du.
2π R dx
dn−k
Z
1 2 2 2
=√ f (k) (u) (−1)n−k ex /2 n−k e−x /2 e−x /2 du.
2π R dx
D E
(k)
= f , Hn−k .
N
Proof. See [71] for Mehler’s theorem in the form stated here.
Yℓj , ℓ = 0, 1, 2, . . . , 1 ≤ j ≤ ν(ℓ)
of degree ℓ and order j are an orthonormal basis on the sphere L2 (Sd−1 ), com-
parable to Fourier bases for periodic functions. For any f ∈ L2 (Sd−1 ), we
57
D E
denote by fˆℓj = f, Yℓj the corresponding basis coefficient. The Sobolev space
H α (Sd−1 ) consists of all function for which the norm
ν(ℓ)
∞ X 2α
X 2
∥f ∥2H α (Sd−1 ) = 1 + ℓ1/2 (ℓ + d − 2)1/2 fˆℓj
ℓ=0 j=1
set Z
At (f )(x) := − f (τ ) dτ.
C(x,t)
With Z π
2
Sα (f )2 (x) := |At f (x) − f (x)| t−2α−1 dt
0
the Sobolev norm on the sphere is equivalent to
Using the definition (52) for a < b < c, the interpolation inequality
c−b b−a
c−a c−a
∥ · ∥H b (Sd−1 ) ≲ ∥ · ∥H a (Sd−1 ) ∥ · ∥H c (Sd−1 ) , ⟨·, ·⟩−α ≲ ∥ · ∥−3α ∥ · ∥α , (54)
Lemma 6.15. Let 0 < α < 1. Then for any ϵ > 0 with α + ϵ ≤ 1, we have
58
6.4.2 Kernel Bounds
In this section, we provide bounds for the kernel integral
ZZ
⟨f, g⟩k := f (x)k(x, y)g(y) dx dy
D×D
on the sphere D = Sd−1 in Sobolev norms on the sphere. Clearly, for 0 ≤ α, β <
2, we have
Z
⟨f, g⟩k ≤ ∥f ∥H −α k(·, y)g(y) dy ≤ ∥f ∥H −α ∥k∥H α ←H −β ∥g∥H −β ,
D Hα
where the norm of k is the induced operator norm. While the norms for f and
g are the ones used in the convergence analysis, concentration and perturbation
results for k are computed in mixed Hölder norms. We show in this section,
that these bound the operator norm.
Indeed, ⟨f, g⟩k is a bilinear form on f and g and thus is bounded by the
tensor product norms
⟨f, g⟩k ≤ ∥f ⊗ g∥(H α ⊗H β )′ ∥k∥H α ⊗H β ≤ ∥f ∥H −α ∥g∥H −β ∥k∥H α ⊗H β ,
where ·′ denotes the dual norm. The H α ⊗H β norm contains mixed smoothness
and with Sobolev-Slobodeckij type definition (53) is easily bounded by corre-
sponding mixed Hölder regularity. In order to avoid rigorous characterization
of tensor product norms on the sphere, the following lemma shows the required
bounds directly.
Lemma 6.16. Let 0 < α, β < 1. Then for any ϵ > 0 with α + ϵ ≤ 1 and
β + ϵ < 1, we have
ZZ
f (x)k(x, y)g(y) dx dy ≤ ∥f ∥H −α (Sd−1 ) ∥g∥H −β (Sd−1 ) ∥k∥C 0;α+ϵ,β+ϵ (Sd−1 ) .
D×D
so that it remains to estimate the last term. Plugging in definition (53) of the
Sobolev norm, we obtain
Z 2 Z Z π Z 2
k(·, y)g(y) = (Axt − I) k(·, y)g(y) dy (x) t−2α−1 dt dx,
D Hα D 0 D
59
where Axt is the average in (53) applied to the x variable only and I the identity.
Swapping the inner integral with the one inside the definition of Axt , we estimate
Z 2 Z Z π Z 2
k(·, y)g(y) = [(Axt − I)(k(·, y))(x)] g(y) dy t−2α−1 dt dx,
D Hα D 0 D
Z Z π
2
≤ ∥(Axt − I)(k(·, y))(x)∥H β ∥g∥2H −β t−2α−1 dt dx,
ZD
Z 0
ZZ π
2
= |(Ays − I)(Axt − I)(k)(x, y)| t−2α−1 s−2β−1 dst dxy ∥g∥2H −β .
D×D 0
Plugging in the definition of the averages Ays and Axt , the integrand is estimated
by the mixed Hölder norm
Z Z
|(Ays − I)(Axt − I)(k)(x, y)| = − − |k(τ, σ) − k(x, σ) − k(τ, y) + k(x, y)| dτ σ
C(y,s) C(x,t)
Z Z
≤− − |x − τ |α+ϵ |y − σ|β+ϵ ∥k∥C 0;α+ϵ,β+ϵ dτ σ.
C(y,s) C(x,t)
The difference |x − τ |, and likewise |y − σ|, is bounded by the angle of the cap
C(x, t). Indeed
|x − τ | ≲ min{t, T }, |y − σ| ≲ min{s, T }.
It follows that
|(Ays − I)(Axt − I)(k)(x, y)| ≲ min{t, T }α+ϵ min{s, T }β+ϵ ∥k∥C 0;α+ϵ,β+ϵ .
60
6.4.3 NTK on the Sphere
This section fills in the proofs for Section 3. Recall that we denote the normal
NTK used in [9, 22, 11] by
X
Θ(x, y) = lim ∂λ f L+1 (x)∂λ f L+1 (y),
width→∞
λ
whereas the NTK Γ(x, y) used in this paper confines the sum to |λ| = L − 1,
i.e. the second but last layer, see Section 3. We first show that the reproducing
kernel Hilbert space (RKHS) of the NTK is a Sobolev space.
Lemma 6.17. Let Θ(x, y) be the neural tangent kernel for a fully connected
neural network on the sphere Sd−1 with bias and ReLU activation. Then the
corresponding RKHS HΘ is the Sobolev space H d/2 (Sd−1 ) with equivalent norms
∥ · ∥HΘ ∼ ∥ · ∥H d/2 .
Proof. By [11, Theorem 1] the RKHS HΘ is the same as the RKHS HLap of the
Laplacian kernel
k(x, y) = e−∥x−y∥ .
An inspection of their proof reveals that these spaces have equivalent norms.
By [22, Theorem 2], the Laplace kernel has the same eigenfunctions as the NTK
(both are spherical harmonics) and eigenvalues
for all eigenvalues. With Mercer’s theorem and the definition (52) of Sobolev
norms, we conclude that
ν(ℓ)
∞ X ν(ℓ)
∞ X
X X
∥f ∥2HΘ ∼ ∥f ∥2Lap = λ−1 ˆ 2 (ℓ + 1)d |fˆℓ,j |2 = ∥f ∥2H d/2 (Sd−1 ) .
ℓ,j |fℓ,j | ∼
ℓ=0 j=1 ℓ=0 j=1
Lemma 6.18. Let Θ(x, y) be the neural tangent kernel for a fully connected
neural network on the sphere Sd−1 with bias and ReLU activation. It’s eigen-
functions are spherical harmonics with eigenvalues
61
Proof. This follows directly form the norm equivalence ∥ · ∥HΘ ∼ ∥ · ∥H d/2 in
Lemma 6.17 and in Mercer’s theorem representation of the RKHS
ν(ℓ)
∞ X ν(ℓ)
∞ X
X X
λ−1 ˆ 2 2 2
(ℓ + 1)d |fˆℓ,j |2 ,
ℓ,j |fℓ,j | = ∥f ∥HΘ ∼ ∥f ∥H d/2 (Sd−1 ) . =
ℓ=0 j=1 ℓ=0 j=1
ν(ℓ)
∞ X
X
⟨f, LΘ f ⟩H α (Sd−1 ) = (ℓ + 1)2α fˆℓj L
dθ f ℓj
ℓ=0 j=1
ν(ℓ)
∞ X
X
= (ℓ + 1)2α λℓj |fˆℓj |2
ℓ=0 j=1
References
[1] B. Adcock and N. Dexter. The gap between theory and practice in function
approximation with deep neural networks. SIAM Journal on Mathematics
of Data Science, 3(2):624–655, 2021.
[2] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning
via over-parameterization. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Research, page 242–252, Long
Beach, California, USA, 09–15 Jun 2019. PMLR. Full version available at
https://arxiv.org/abs/1811.03962.
62
[3] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of op-
timization and generalization for overparameterized two-layer neural net-
works. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the
36th International Conference on Machine Learning, volume 97 of Proceed-
ings of Machine Learning Research, page 322–332, Long Beach, California,
USA, 09–15 Jun 2019. PMLR.
[4] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang.
On exact computation with an infinitely wide neural net. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 32.
Curran Associates, Inc., 2019.
[5] F. Bach. Breaking the curse of dimensionality with convex neural networks.
Journal of Machine Learning Research, 18(19):1–53, 2017.
[6] Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order
approximation of wide neural networks. In International Conference on
Learning Representations, 2020.
[7] J. A. Barceló, T. Luque, and S. Pérez-Esteva. Characterization of sobolev
spaces on the sphere. Journal of Mathematical Analysis and Applications,
491(1):124240, 2020.
[8] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathe-
matics of deep learning, 2021. https://arxiv.org/abs/2105.04026.
[9] A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[10] G. Bresler and D. Nagaraj. Sharp representation theorems for ReLU net-
works with precise dependence on depth. In H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Informa-
tion Processing Systems, volume 33, page 10697–10706. Curran Associates,
Inc., 2020.
[11] L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have
the same rkhs. In International Conference on Learning Representations,
2021.
[12] Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is
sufficient to learn deep re{lu} networks? In International Conference on
Learning Representations, 2021.
[13] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable
programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc., 2019.
63
[14] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear
Approximation and (Deep) ReLU Networks. Constructive Approximation,
55(1):127–172, Feb. 2022.
[15] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta
Numerica, 30:327–444, 2021.
[16] S. Dirksen. Tail bounds via generic chaining. Electronic Journal of Proba-
bility, 20:1 – 29, 2015.
[17] S. Drews and M. Kohler. On the universal consistency of an over-
parametrized deep neural network estimate learned by gradient descent,
2022.
[18] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global
minima of deep neural networks. In K. Chaudhuri and R. Salakhutdi-
nov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, page
1675–1685, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
[19] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference
on Learning Representations, 2019.
[20] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural
network approximation theory. IEEE Transactions on Information Theory,
67(5):2581–2623, 2021.
[21] S. Fort, G. K. Dziugaite, M. Paul, S. Kharaghani, D. M. Roy, and S. Gan-
guli. Deep learning versus kernel learning: an empirical study of loss
landscape geometry and the time evolution of the neural tangent kernel.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, edi-
tors, Advances in Neural Information Processing Systems, volume 33, page
5850–5861. Curran Associates, Inc., 2020.
[22] A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ro-
nen. On the similarity between the laplace and neural tangent kernels.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, edi-
tors, Advances in Neural Information Processing Systems, volume 33, page
1451–1461. Curran Associates, Inc., 2020.
[23] M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’Ascoli, G. Biroli,
C. Hongler, and M. Wyart. Scaling description of generalization with num-
ber of parameters in deep learning. CoRR, abs/1901.01608, 2019.
[24] R. Gentile and G. Welper. Approximation results for gradient descent
trained shallow neural networks in 1d, 2022. https://arxiv.org/abs/
2209.08399.
64
[25] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approx-
imation Spaces of Deep Neural Networks. Constructive Approximation,
55(1):259–367, Feb. 2022.
[26] P. Grohs and F. Voigtlaender. Proof of the theory-to-practice gap in deep
learning via sampling complexity bounds for neural network approximation
spaces, 2021. https://arxiv.org/abs/2104.02746.
[27] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations
with deep ReLU neural networks in ws,p norms. Analysis and Applications,
18(05):803–859, 2020.
[28] B. Hanin and M. Nica. Finite depth and width corrections to the neural
tangent kernel. In International Conference on Learning Representations,
2020.
[29] W. Hao, X. Jin, J. W. Siegel, and J. Xu. An efficient greedy training
algorithm for neural networks and applications in PDEs, 2021. https:
//arxiv.org/abs/2107.04466.
[30] L. Herrmann, J. A. A. Opschoor, and C. Schwab. Constructive deep ReLU
neural network approximation. Journal of Scientific Computing, 90(2):75,
2022.
[31] S. Ibragimov, A. Jentzen, and A. Riekert. Convergence to good non-optimal
critical points in the training of neural networks: Gradient descent opti-
mization with one random initialization overcomes all bad non-global lo-
cal minima with high probability, 2022. https://arxiv.org/abs/2212.
13111.
[32] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Conver-
gence and generalization in neural networks. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018.
[33] A. Jentzen and A. Riekert. A proof of convergence for the gradient descent
optimization method with random initializations in the training of neural
networks with relu activation for piecewise linear target functions. Journal
of Machine Learning Research, 23(260):1–50, 2022.
[34] Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent
to achieve arbitrarily small test error with shallow ReLU networks. In
International Conference on Learning Representations, 2020.
[35] Z. Ji, M. Telgarsky, and R. Xian. Neural tangent kernels, transportation
mappings, and universal approximation. In International Conference on
Learning Representations, 2020.
65
[36] K. Kawaguchi and J. Huang. Gradient descent finds global minima for
generalizable deep neural networks of practical sizes. In 2019 57th Annual
Allerton Conference on Communication, Control, and Computing (Aller-
ton), page 92–99, 2019.
[37] J. M. Klusowski and A. R. Barron. Approximation by combinations of
ReLU and squared ReLU ridge functions with ℓ1 and ℓ0 controls. IEEE
Transactions on Information Theory, 64(12):7649–7656, 2018.
[38] M. Kohler and A. Krzyzak. Analysis of the rate of convergence of an over-
parametrized deep neural network estimate learned by gradient descent,
2022.
66
[46] Z. Li, C. Ma, and L. Wu. Complexity measures for neural networks
with general activation functions using path-based norms, 2020. https:
//arxiv.org/abs/2009.06132.
[47] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approxima-
tion for smooth functions. SIAM Journal on Mathematical Analysis,
53(5):5465–5506, 2021.
[48] C. Marcati, J. A. A. Opschoor, P. C. Petersen, and C. Schwab. Exponential
ReLU neural network approximation rates for point and edge singularities.
Foundations of Computational Mathematics, page 1615 – 3383, 2022.
[55] M. Seleznova and G. Kutyniok. Neural tangent kernel beyond the infinite-
width limit: Effects of depth and initialization. In K. Chaudhuri, S. Jegelka,
L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the
39th International Conference on Machine Learning, volume 162 of Pro-
ceedings of Machine Learning Research, page 19522–19560. PMLR, 17–23
Jul 2022.
[56] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation
properties for deep neural networks. Applied and Computational Harmonic
Analysis, 44(3):537–557, 2018.
67
[57] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions.
Neural Networks, 119:74–84, 2019.
[58] J. W. Siegel and J. Xu. Approximation rates for neural networks with
general activation functions. Neural Networks, 128:313–321, 2020.
[59] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural
networks with cosine and ReLUk activation functions. Applied and Com-
putational Harmonic Analysis, 58:1–26, 2022.
[60] J. W. Siegel and J. Xu. Optimal convergence rates for the orthogonal greedy
algorithm. IEEE Transactions on Information Theory, 68(5):3354–3361,
2022.
[61] C. Song, A. Ramezani-Kebrya, T. Pethick, A. Eftekhari, and V. Cevher.
Subquadratic overparameterization for shallow neural networks. In M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, edi-
tors, Advances in Neural Information Processing Systems, volume 34, page
11247–11259. Curran Associates, Inc., 2021.
[62] Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix
chernoff bound, 2019. https://arxiv.org/abs/1906.03593.
[63] L. Su and P. Yang. On learning over-parameterized neural networks:
A functional approximation perspective. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019.
[64] T. Suzuki. Adaptivity of deep reLU network for learning in besov and
mixed smooth besov spaces: optimal rate and curse of dimensionality. In
International Conference on Learning Representations, 2019.
[65] M. Velikanov and D. Yarotsky. Universal scaling laws in the gradient de-
scent training of neural networks, 2021. https://arxiv.org/abs/2105.
00507.
[66] M. Velikanov and D. Yarotsky. Tight convergence rate bounds for opti-
mization under power law spectral conditions, 2022. https://arxiv.org/
abs/2202.00992.
[67] R. Vershynin. High-dimensional probability: an introduction with applica-
tions in data science. Number 47 in Cambridge series in statistical and
probabilistic mathematics. Cambridge University Press, Cambridge ; New
York, NY, 2018.
[68] N. Vyas, Y. Bansal, and P. Nakkiran. Limitations of the ntk for under-
standing generalization in deep learning, 2022.
68
[69] E. Weinan, M. Chao, W. Lei, and S. Wojtowytsch. Towards a mathemat-
ical understanding of neural network-based machine learning: What we
know and what we don’t. CSIAM Transactions on Applied Mathematics,
1(4):561–615, 2020.
[70] E. Weinan, C. Ma, and L. Wu. The Barron Space and the Flow-Induced
Function Spaces for Neural Network Models. Constructive Approximation,
55(1):369–406, Feb. 2022.
[71] C. S. Withers and S. Nadarajah. Expansions for the multivariate normal.
Journal of Multivariate Analysis, 101(5):1311–1316, 2010.
[72] D. Yarotsky. Error bounds for approximations with deep ReLU networks.
Neural Networks, 94:103–114, 2017.
[73] D. Yarotsky. Optimal approximation of continuous functions by very deep
ReLU networks. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceed-
ings of the 31st Conference On Learning Theory, volume 75 of Proceedings
of Machine Learning Research, page 639–649. PMLR, 06–09 Jul 2018.
[74] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation
rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, page 13005–13015. Curran Associates, Inc., 2020.
[75] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Gradient descent optimizes over-
parameterized deep ReLU networks. Machine Learning, 109(3):467 – 492,
2020.
[76] D. Zou and Q. Gu. An improved analysis of training over-parameterized
deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural In-
formation Processing Systems, volume 32. Curran Associates, Inc., 2019.
69