0% found this document useful (0 votes)

26 views69 pages

2309.04860 Approximation Results For Gradient Descent Trained

This paper provides approximation guarantees for neural networks trained with gradient descent. It shows that neural networks can approximate functions in Sobolev spaces up to a certain rate, provided certain conditions are met. Specifically, it proves approximation rates for fully connected networks with increasing width and constant depth that are trained with gradient flow. The analysis relies on showing that the neural tangent kernel is close to the infinite width kernel and does not change much during training. This allows bounding the error in the continuous L2 norm on the unit sphere, establishing an approximation rate in an under-parametrized regime. However, the proven rates are lower than other established approximation methods for Sobolev smooth functions, due to the over-parametrized nature of neural networks.

Uploaded by

hongnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views69 pages

2309.04860 Approximation Results For Gradient Descent Trained

Uploaded by

hongnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Approximation Results for Gradient Descent

trained Neural Networks

G. Welper∗
arXiv:2309.04860v1 [cs.LG] 9 Sep 2023

Abstract
The paper contains approximation guarantees for neural networks that
are trained with gradient flow, with error measured in the continuous
L2 (Sd−1 )-norm on the d-dimensional unit sphere and targets that are
Sobolev smooth. The networks are fully connected of constant depth
and increasing width. Although all layers are trained, the gradient flow
convergence is based on a neural tangent kernel (NTK) argument for the
non-convex second but last layer. Unlike standard NTK analysis, the con-
tinuous error norm implies an under-parametrized regime, possible by the
natural smoothness assumption required for approximation. The typical
over-parametrization re-enters the results in form of a loss in approxi-
mation rate relative to established approximation methods for Sobolev
smooth functions.

Keywords: deep neural networks, approximation, gradient descent, neural tan-

gent kernel
AMS subject classifications: 41A46, 65K10, 68T07

Contents
1 Introduction 2

2 Main Result 5
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Coercivity of the NTK 9

∗ Department of Mathematics, University of Central Florida, Orlando, FL 32816, USA,

[email protected]

1
4 Proof Overview 11
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Neural Tangent Kernel . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Abstract Convergence result . . . . . . . . . . . . . . . . . . . . . 14
4.3 Assumption (20): Hölder continuity . . . . . . . . . . . . . . . . 16
4.4 Assumption (19): Concentration . . . . . . . . . . . . . . . . . . 17
4.5 Assumption (17): Weights stay Close to Initial . . . . . . . . . . 18

5 Proof of the Main Result 18

5.1 Proof of Lemma 4.2: Generalized Convergence . . . . . . . . . . 18
5.2 Proof of Lemma 4.3: NTK Hölder continuity . . . . . . . . . . . 24
5.3 Proof of Lemma 4.4: Concentration . . . . . . . . . . . . . . . . 30
5.3.1 Concentration of the Last Layer . . . . . . . . . . . . . . 30
5.3.2 Perturbation of Covariances . . . . . . . . . . . . . . . . . 35
5.3.3 Concentration of the NTK . . . . . . . . . . . . . . . . . . 39
5.4 Proof of Lemma 4.5: Weights stay Close to Initial . . . . . . . . 42
5.5 Proof of Theorem 2.1: Main Result . . . . . . . . . . . . . . . . . 45

6 Technical Supplements 46
6.1 Hölder Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Hermite Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Sobolev Spaces on the Sphere . . . . . . . . . . . . . . . . . . . . 57
6.4.1 Definition and Properties . . . . . . . . . . . . . . . . . . 57
6.4.2 Kernel Bounds . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4.3 NTK on the Sphere . . . . . . . . . . . . . . . . . . . . . 61

1 Introduction
Direct approximation results for a large variety of methods, including neural
networks, are typically of the form

inf ∥fθ − f ∥ ≤ n(θ)−r , f ∈ K. (1)

I.e., a target function f is approximated by an approximation method fθ ,

parametrized by some degrees of freedom or weights θ up to a rate n(θ)−r for
some n(θ) that measures the richness of the approximation method as width,
depth or number of weights for neural networks. Generally, the approximation
rate can be arbitrarily slow unless the target f is contained in some compact set
K, which depends on the approximation method and application and is typi-
cally a unit ball in a Sobolev, Besov, Barron or other normed smoothness space.
Such results are well established for a variety of neural network architectures
and compact sets K, however, these results rarely address how to practically
compute the infimum in the formula above and instead use hand-picked weights.

2
On the other hand, the neural network optimization literature, typically
considers discrete error norms (or losses)

n
!1/2
1X
∥fθ − f ∥∗ := |fθ (xi ) − f (xi )|2 ,
n i=1

together with neural networks that are over-parametrized, i.e. for which the
number of weights is larger than the number of samples n so that they can
achieve zero training error

inf ∥fθ − f ∥∗ = 0,
θ

rendering the approximation question obsolete. In contrast, approximation the-

ory measures the error in continuous norms that emerge in the sample n → ∞
limit, where the problem is necessarily under-parametrized.
This paper contains approximation results of type (1) for fully connected
networks that are trained with gradient flow and therefore avoids the question
how to compute the infimum in (1). The outline of the proof follows the typical
neural tangent kernel (NTK) argument: We show that the empirical NTK is
close to the infinite width NTK and that the NTK does not change too much
during training. The main differences to the standard analysis are:

1. Due to the under-parametrization, the eigenvalues of the NTK are not

lower bounded away form zero. Instead we require that the NTK is coer-
cive in a negative Sobolev norm.
2. We show that the gradient flow networks are uniformly bounded in positive
Sobolev norms.

3. The coercivity in negative Sobolev smoothness and the uniform bounds

of positive Sobolev smoothness allow us to derive L2 error bounds by
interpolation inequalities.
4. All perturbation and concentration estimates are carried out in function
space norms. In particular, the concentration results need some careful
consideration and are proven by chaining arguments.
The NTK is a sum of positive matrices from which we only use the contribution
form the second but last layer to drive down the error, while all other layers are
trained but estimated only by a perturbation analysis. The coercivity assump-
tion on the NTK is not shown in this paper. It is known for ReLU activations,
but we require smoother activations and only provide a preliminary numerical
test while leaving a rigorous analysis of the resulting NTK for future work.
The proven approximation rates are lower than finite element, wavelet or
spline rates under the same smoothness assumptions. This seems to be a variant
of the over-parametrization in the usual NTK arguments: the networks need
some redundancy in their degrees of freedom to aid the optimization.

3
Paper Organization The paper is organized as follows. Section 2.2 defines
the neural networks and training procedures and Section 2.3 contains the main
result. The coercivity of the NTK is discussed in Section 3. The proof is split
into two parts. Section 4 provides an overview and all major lemmas. The proof
the these lemmas and further details are provided in Section 5. Finally, to keep
the paper self contained, Section 6 contains several facts from the literature.

Literature Review

• Approximation: Some recent surveys are given in [53, 15, 69, 8]. Most of
the results prove direct approximation guarantees as in (1) for a variety of
classes K and network architectures. They show state of the art or even
superior performance of neural networks, but typically do not provide
training methods and rely on hand-picked weights, instead.

– Results for classical Sobolev and Besov regularity are in [25, 27, 50,
44, 64].
– [72, 73, 74, 14, 57, 47] show better than classical approximation rates
for Sobolev smoothness. Since classical methods are optimal (with
regard to nonlinear width and entropy), this implies that the weight
assignment f → θ must be discontinuous.
– Function classes that are specifically tailored to neural networks are
Barron spaces for which approximation results are given in [5, 37, 70,
46, 58, 59, 10].
– Many papers address specialized function classes [56, 54], often from
applications like PDEs [39, 52, 40, 48].
Besides approximation guarantees (1) many of the above papers also dis-
cuss limitations of neural networks, for more information see [20].
• Optimization: We confine the literature overview to neural tangent kernel
based approaches, which are most relevant to this paper. The NTK is
introduced in [32] and similar arguments together with convergence and
perturbation analysis appear simultaneously in [45, 2, 19, 18], Related
optimization ideas are further developed in many papers, including [75, 4,
43, 62, 76, 36, 13, 51, 49, 6, 61, 41]. In particular, [3, 63, 34, 12] refine
the analysis based on expansions of the target f in the NTK eigenbasis
and are closely related to the arguments in this paper, with the major
difference that they rely on the typical over-parametrized regime, whereas
we do solemnly rely on smoothness.
The papers [23, 28, 21, 42, 55, 68] discuss to what extend the lineariza-
tion approach of the NTK can describe real neural network training.
Characterizations of the NTK are fundamental for this paper and given
[9, 22, 35, 11]. Convergence analysis for optimizing NTK models directly
are in [65, 66].

4
• Approximation and Optimization: Since the approximation question is
under-parametrized and the optimization literature largely relies on over-
parametrization there is little work on optimization methods for approx-
imation. The gap between approximation theory and practice is consid-
ered in [1, 26]. The previous paper [24] contains comparable results for 1d
shallow networks. Similar approximation results for gradient flow trained
shallow 1d networks are in [33, 31], with slightly different assumptions on
the target f , more general probability weighted L2 loss and an alternative
proof technique. Other approximation and optimization guarantees rely
on alternative optimizers. [60, 29] use greedy methods and [30] uses a
two step procedure involving a classical and subsequent neural network
approximation.
L2 error bounds are also proven in generalization error bounds for sta-
tistical estimation. E.g. the papers [17, 38] show generalization errors
for parallel fully connected networks in over-parametrized regimes with
Hölder continuity.

2 Main Result
2.1 Notations
• ≲, ≳, ∼ denote less, bigger and equivalence up to a constant that can
change in every occurrence and is independent of smoothness and number
of weights. It can depend on the number of layers L and input dimension d.
Likewise, c is a generic constant that can be different in each occurrence.

• [n] := {1, . . . , n}
• λ = ij; ℓ is the index of the weight Wλ := Wijℓ with |λ| := ℓ. Likewise, we
∂
set ∂λ = ∂W λ
.
• ⊙: Element wise product

• Ai· and A·j are ith row and jth column of matrix A, respectively.

2.2 Setup
Neural Networks We train fully connected deep neural networks without
bias and a few modifications: The first and last layer remain untrained, we use
gradient flow instead of (stochastic) gradient descent and the first layer remains
unscaled. For x in some bounded domain D ⊂ Rd , the networks are defined by

f 1 (x) = W 0 V x,
−1/2
f ℓ+1 (x) = W ℓ nℓ σ f ℓ (x) ,

ℓ = 1, . . . , L (2)
f (x) = f L+1 (x),

5
which we abbreviate by f ℓ = f ℓ (x) if x is unimportant or understood from
context. The weights are initialized as follows

W L+1 ∈ {−1, +1}1×nL+1 i.i.d. Rademacher not trained,

ℓ nℓ+1 ×nℓ
W ∈R , ℓ ∈ [L] i.i.d. N (0, 1) trained,
n0 ×d T
V ∈R orthogonal columns V V = I not trained,

all trained by gradient flow, except for the last layer W L+1 and the first matrix
V , which is pre-chosen with orthonormal columns. All layers have conventional
√
1/ nℓ scaling, except for the first, which ensures that the NTK is of unit size
on the diagonal and common in the literature [18, 9, 22, 11]. We also require
that the layers are of similar size, except for the last one which ensures scalar
valued output of the network

m := nL−1 , 1 = nL+1 ≤ nL ∼ · · · ∼ n0 ≥ d.

Activation Functions We require comparatively smooth activation functions

that have no more that linear growth

|σ (x) | ≲ |x|, (3)

uniformly bounded first derivatives

|σ (i) (x)| ≲ 1 i = 1, 2, x∈R (4)

and continuous second and third derivative with at most polynomial growth

|σ (i) (x)| ≤ p(x), i = 0, 1, 2, 3, 4 (5)

for some polynomial p and all x ∈ R.

Training We wish to approximate a function f ∈ L2 (D) by neural networks

and therefore use the L2 (D) norm for the loss function
1
L(θ) := ∥fθ − f ∥2L2 (D) .
2
In the usual split up into approximation and estimation error in the machine
learning literature, this corresponds to the former. It can also be understood
as an infinite sample limit of the mean squared loss. This implies that we
perform convergence analysis in an under-parametrized regime, different from
the bulk of the neural network optimization literature, which typically relies on
over-parametrization.
For simplicity, we optimize the loss by gradient flow
d
θ = −∇L(θ) (6)
dt
and not gradient descent or stochastic gradient descent.

6
Smoothness Since we are in an under-parametrized regime, we require smooth-
ness of f to guarantee meaningful convergence bounds. In this paper, we use
Sobolev spaces H α (Sd−1 ) on the sphere D = Sd−1 , with norms and scalar prod-
ucts denoted by ∥ · ∥H α (Sd−1 ) and ⟨·, ·⟩H α (Sd−1 ) . We drop the explicit reference
to the domain Sd−1 when convenient. Definitions and required properties are
summarized in Section 6.4.1.

Neural Tangent Kernel The analysis is based on the neural tangent kernel,
which for the time being, we informally define as
X
Γ(x, y) = lim ∂λ frL+1 (x)∂λ frL+1 (y). (7)
width→∞
|λ|=L−1

The rigorous definition is in (11), based on an recursive formula as in [32]. Our

definition differs slightly form the standard version because we only include
weights from layer |λ| = L − 1. We require that it is coercive in Sobolev norms
Z
f, Γ(·, y)f (y) dy ≳ ∥f ∥H S−β (8)
D H S(Sd−1 )

for some 0 ≤ α ≤ β2 , S ∈ {−α, α} and all f ∈ H α (Sd−1 ). For ReLU activations

and regular NTK, including all layers, this property easily follows from [9, 22, 11]
as shown in Lemma 3.2. However, our convergence theory requires smoother
activations and therefore Section 3 provides some numerical evidence, while a
rigorous analysis is left for future research.
The paper [32] provides a recursive formula for the NTK, which in our sim-
plified case reduces to

Γ(x, y) = Σ̇L (x, y)ΣL−1 (x, y),

where Σ̇L (x, y) and ΣL−1 (x, y) are the covariances of two Gaussian processes
1/2
that characterize the forward evaluation of the networks W L nL σ̇ f L and
f L−1 in the infinite width limit, see Section 4.1.1 for their rigorous definition.
We require that

cΣ ≤ Σk (x, x) ≤ CΣ > 0, (9)

for all x, y ∈ D, k = 1, . . . , L and constants cΣ , CΣ ≥ 0. As we see in Section 3,

the kernels are zonal, i.e. they only depend on xT y. Hence, with a slight abuse
of notation (9) simplifies to Σk (x, x) = Σk (xT x) = Σ(1) ̸= 0. In fact, for ReLU
activation (which is not sufficiently differentiable for our results) the paper [11]
shows Σk (x, x) = 1.

2.3 Result
We are now ready to state the main result of the paper.

7
Theorem 2.1. Assume that the neural network (2) - (5) is trained by gradient
flow (6). Let κ(t) := fθ(t) − f be the residual and assume:
β
1. The NTK satisfies coercivity (8) for some 0 ≤ α ≤ 2 and the forward
process satisfies (9).
2. All hidden layers are of similar size: n0 ∼ · · · ∼ nL−1 =: m.
3. Smoothness is bounded by 0 < α < 1/2.
4. 0 < γ < 1 − α is an arbitrary number (used for Hölder continuity of the
NTK in the proof ).
5. For τ specified below, m is sufficiently large so that
1 1 1 cd τ
∥κ(0)∥−α
2
∥κ(0)∥α2 m− 2 ≲ 1, ≤ 1, ≤ 1.
m m

Then with probability at least 1 − cL(e−m + e−τ ) we have

βγ
αβ
βγ β β β
−ch β−α 2α
∥κ(t)∥2L2 (Sd−1 ) ≲ h β−α ∥κ(0)∥H α (Sd−1 ) + ∥κ(0)∥H −α (Sd−1 ) e
α α t
∥κ(0)∥H α (Sd−1 )
(10)

for some h with


1 1  β−α 
 ∥κ(0)∥ 2 −α d−1 ∥κ(0)∥ 2 α d−1 β(1+γ)−α r d 
 
H (S ) H (S )
h ≲ max  √  ,c , τ = h2γ m

 m m

and generic constant c ≥ 0, dependent on smoothness α, depth L and dimension

d, independent of width m and residual κ.
All assumptions are easy to verify, except for the coercivity of the NTK (8)
and the bounds (9) of the forward kernel, which we discuss in the next section.
The error bound (10) consists of two summands, only one of which depends on
the gradient flow time t. For large t, it converges to zero and we are left with the
first error term. This results in the following corollary, which provides a direct
approximation result of type (1) for the outcome of gradient flow training.
Corollary 2.2. Let all assumptions of Theorem 2.1 be satisfied. Then for m
sufficiently large, with high probability (both as in Theorem 2.1), we have
( 1 αγ αγ )
14 β−α
C(κ(0)) 4 β(1+γ)−α d
∥κ∥L2 (Sd−1 ) ≲ max , ∥κ(0)∥H α (Sd−1 ) ,
m m
C(κ(0)) = ∥κ(0)∥H −α (Sd−1 ) ∥κ(0)∥H α (Sd−1 )

where κ := fθ(t) − f is the gradient flow residual for sufficiently large time t.

8
For traditional approximation methods, one would expect convergence rate
m−α/d for functions in the Sobolev space H α . Our rates are lower, which seems
to be a variation of over-parametrization is disguise: In the over-parametrized
as well as in our approximation regime the optimizer analysis seems to require
some redundancy and thus more weights than necessary for the approximation
alone. Of course, we only provide upper bounds and practical neural networks
may perform better. Some preliminary experiments in [24] show that shallow
networks in one dimension outperform the theoretical bounds but are still worse
than classical approximation theory would suggest. In addition, the linearization
argument of the NTK results in smoothness measures in Hilbert spaces H α and
not in larger Lp based smoothness spaces with p < 2 or even Barron spaces, as
is common for nonlinear approximation.
Remark 2.3. Although Theorem 2.1 and Corollary 2.2 seem to show dimen-
sion independent convergence rates, they are not. Indeed, β depends on the
dimension and smoothness of the activation function as we see in Section 3 and
Lemma 3.2.

3 Coercivity of the NTK

While most assumptions of Theorem 2.1 are easy to verify, the coercivity (8)
is less clear. This section contains some results for the NTK Γ(x, y) in this
paper, which only considers the second but last layer, as well as the regular
NTK defined by the infinite width limit
X
Θ(x, y) = lim ∂f L+1 (x)∂p f L+1 (y)
width→∞
λ

of all layers. Coercivity easily follows once we understand the NTK’s spectral
decomposition. To this end, first note that Γ(x, y) and Θ(x, y) are both zonal
kernels, i.e. they only depend on xT y, and as consequence their eigenfunctions
are spherical harmonics.
Lemma 3.1 ([22, Lemma 1]). The eigenfunctions of the kernels Γ(x, y) and
Θ(x, y) on the sphere with uniform measure are spherical harmonics.
Proof. See [22, Lemma 1] and the discussion thereafter.
Hence, it is sufficient to show lower bounds for the eigenvalues. These are
provided in [9, 22, 11] under slightly different assumptions than required in this
paper:
1. They use all layers Θ(x, y) instead of only the second but last one in
Γ(x, y). (The reference [18] does consider Γ(x, y) and shows that the eigen-
values are strictly positive in the over-parametrized regime with discrete
loss and non-degenerate data.)

9
2. They use bias, whereas we don’t. We can however easily introduce bias
into the first layer by the usual technique to incorporate one fixed input
component x0 = 1.
3. The cited papers use ReLU activations, which do not satisfy the third
derivative smoothness requirements (4).

Anyways, with these modified assumptions, it is easy to derive coercivity from

the NTK’s RKHS in [9, 22, 11].
Lemma 3.2. Let Θ(x, y) be the neural tangent kernel for a fully connected
neural network with bias on the sphere Sd−1 with ReLU activation. Then for
any α ∈ R
⟨f, LΘ f ⟩H α (Sd−1 ) ≳ ∥f ∥2H α−d/2 (Sd−1 ) ,
where LΘ is the integral operator with kernel Θ(x, y).

The proof is given at the end of Section 6.4.3. Note that this implies β = d/2
and thus Theorem 2.1 cannot be expected to be dimension independent. In
fact, due to smoother activations, the kernel Γ(x, y) is expected to be more
smoothing than Θ(x, y) resulting in a faster decay of the eigenvalues and larger
β. This leads to Sobolev coercivity (Lemmas 6.18 and 3.2) as long as the decay
is polynomial, which we only verify numerically in this paper, as shown in Figure
1 for n = 100 uniform samples on the d = 2 dimensional sphere and L − 1 = 1
hidden layers of width m = 1000. The plot uses log-log axes so that straight
lines represent polynomial decay. As expected, ReLU and ELU activations show
polynomials decay with higher order for the latter, which are smoother. For
comparison the C ∞ activation GELU seems to show super polynomial decay.
However, the results are preliminary and have to be considered carefully:
1. The oscillations at the end, are for eigenvalues of size ∼ 10−7 , which is
machine accuracy for floating point numbers.
2. Most eigenvalues are smaller than the difference between the empirical
NTK and the actual NTK. For comparison, the difference between two
randomly sampled empirical NTKs (in matrix norm) is: ReLU: 0.280,
ELU: 0.524, GELU: 0.262 .
3. According to [9], for shallow networks without bias, every other eigenvalue
of the NTK should be zero. This is not clear from the experiments (which
do not use bias, but have one more layer), likely because of the large errors
in the previous item.
4. The errors should be better for wider hidden layers, but since the networks
involve dense matrices, their size quickly becomes substantial.
In conclusion, the experiments show the expected polynomial decay of NTK
eigenvalues and activations with singularities in higher derivatives, but the re-
sults have to be regraded with care.

10
Figure 1: Eigenvalues of the NTK Γ(x, y) for different activation functions.

4 Proof Overview
4.1 Preliminaries
4.1.1 Neural Tangent Kernel
In this section, we recall the definition of the neural tangent kernel (NTK) and
setup notations for its empirical variants. Our definition differs slightly from
the literature because we only use the last hidden layer (weights W L−1 ) to
reduce the loss, whereas all other layers are trained but only estimated by a
perturbation analysis. Throughout the paper, we only need the definitions as
stated, not that they are the infinite width limit of the network derivatives as
stated in (7), although we sometimes refer to this for motivation.
As usual, we start with the recursive definition of the covariances
ℓ
Σ (x, x) Σℓ (x, y)

ℓ+1
Σ (x, y) := Eu,v∼N (0,A) [σ (u) , σ (v)] , A = , Σ0 (x, y) = xT y,
Σℓ (y, x) Σℓ (y, y)
which define a Gaussian process that is the infinite width limit of the forward
evaluation of the hidden layer f ℓ (x), see [32]. Likewise, we define
ℓ
Σ (x, x) Σℓ (x, y)

Σ̇ℓ+1 (x, y) := Eu,v∼N (0,A) [σ̇ (u) , σ̇ (v)] , A= ,
Σℓ (y, x) Σℓ (y, y)
with activation function of the last layer is exchanged with its derivative. Then
the neural tangent kernel (NTK) is defined by
Γ(x, y) := Σ̇L (x, y)ΣL−1 (x, y). (11)
The paper [32] shows that all three definitions above are infinite width limits of
the corresponding empirical processes (denoted with an extra hat ˆ·)
nℓ
1 X 1 T
Σ̂ℓ (x, y) := σ frℓ (x) σ frℓ (y) = σ f ℓ (x) σ f ℓ (y) ,

nℓ r=1 nℓ
nℓ
(12)
ˆ ℓ (x, y) := 1 X σ̇ f ℓ (x) σ̇ f ℓ (y) = 1 σ̇ f ℓ (x)T σ̇ f ℓ (y)
Σ̇ r r
nℓ r=1 nℓ

11
and X
Γ̂(x, y) := ∂λ frL+1 (x)∂λ frL+1 (y).
|λ|=L−1

Note that unlike the usual definition of the NTK, we only include weights from
the second but last layer. Formally, we do not show that Σℓ , Σ̇ℓ and Γ arise as
ˆ ℓ and Γ̂, but rather concen-
infinite width limits of the empirical versions Σ̂ℓ , Σ̇
tration inequalities between them.
The next lemma shows that the empirical kernels satisfy the same identity
(11) as their limits.
Lemma 4.1. Assume that WijL ∈ {−1, +1}. Then
ˆ L (x, y)Σ̂L−1 (x, y).
Γ̂(x, y) = Σ̇
Proof. By definitions of f L and f L−1 , we have
nL
−1/2
X
∂W L−1 frL+1 = W·rL nL ∂W L−1 σ frL

ij ij
1=r
nL
−1/2
X
W·rL nL σ̇ frL ∂W L−1 frL

=
ij
1=r
nL
−1/2 −1/2
X
W·rL nL σ̇ frL δir nL−1 σ fjL−1

=
1=r
−1/2 −1/2
= W·iL nL fiL σ fjL−1 .

nL−1 σ̇
It follows that
nL nX
X L−1

Γ̂(x, y) = ∂W L−1 frL+1 (x)∂W L−1 frL+1 (y)

ij ij
i=1 j=1
nL nL−1
1 X 1 X 2
W·iL σ̇ fiL (x) σ̇ fiL (y) σ fjL−1 (x) σ fjL−1 (y)

=
nL i=1 nL−1 j=1
ˆ L (x, y)Σ̂L−1 (x, y),
= Σ̇
2
where in the last step we have used that W·iL = 1 by assumption and the
ˆ L and Σ̂L−1 .
definitions of Σ̇

The NTK and empirical NTK induce integral operators, which we denote
by
Z Z
Hf := Γ(·, y)f (y) dy, Hθ f := Γ̂(·, y)f (y) dy
D D

The last definition makes the dependence on the weights explicit, which is hidden
in Γ̂.

12
4.1.2 Norms
We use several norms for our analysis.
1. ℓ2 and matrix norms: ∥ · ∥ denotes the ℓ2 norm when applied to a vector
and the matrix norm when applied to a matrix.

2. Hölder norms ∥ · ∥C 0;α (D;V ) for functions f : D ⊂ Rd → V into some

normed vector space V , with Hölder continuity measured in the V norm

∥f (x) − f (x̄)∥V
∥f ∥C 0 (D;V ) := sup ∥f (x)∥V + sup .
x∈D x̸=x̄∈D ∥x − x̄∥α
U

We drop V in ∥ · ∥C 0;α (D) when V = ℓ2 and D in ∥ · ∥C 0;α when it is un-

derstood from context. We also use alternate definitions as the supremum
over the finite difference operator
−α
∆0h f (x) = f (x), ∆α
h f (x) = ∥h∥U [f (x + h) − f (x)], α > 0,

See Section 6.1 for the full definitions and basic properties.

3. Mixed Hölder norms ∥ · ∥C 0;α,β (D;V ) for functions f : D × D ⊂ Rd → V of

two variables. They measure the supremum of all mixed finite difference
operators ∆sx,hx ∆ty,hy for any s ∈ {0, α} and t ∈ {0, β}, similar to Sobolev
spaces with mixed smoothness. As for Hölder norms for one variable, we
use two different definitions, which are provided in Section 6.1.

4. Sobolev Norms on the Sphere denoted by ∥ · ∥H α (Sd−1 ) . Definitions and

properties are provided in Section 6.4.1. The bulk of the analysis is carried
out in Hölder norms, which control Sobolev norms by

∥ · ∥H α (Sd−1 ) ≲ ∥ · ∥C 0;α+ϵ (Sd−1 ) .

for ϵ > 0, see Lemma 6.15.

5. Generic Smoothness norms ∥ · ∥α , α ∈ R for associated Hilbert spaces
Hα . These are used in abstract convergence results and later replaced by
Sobolev norms.

6. Orlicz norms ∥ · ∥ψi for i = 1, 2 measure sub-gaussian and sub-exponential

concentration. Some required results are summarized in Section 6.2.
7. Gaussian weighted L2 norms defined by
Z
∥f ∥2N = ⟨f, f ⟩N , ⟨f, g⟩ = f (x)2 dN (0, 1)(x)
R

13
4.1.3 Neural Networks
Many results use a generic activation function denoted by σ with derivative
σ̇, which is allowed to change in each layer, although we always use the same
symbol for notational simplicity. They satisfy the linear growth condition

|σ (x) | ≲ |x|, (13)

are Lipschitz
|σ (x) − σ (x̄) | ≲ |x − x̄| (14)
and have uniformly bounded derivatives

|σ̇ (x) | ≲ 1. (15)

4.2 Abstract Convergence result

We first show convergence in a slightly generalized setting. To this end, we
consider neural networks as maps from the parameter space to the square inte-
grable functions f· : Θ ⊂ ℓ2 (Rm ) → L2 (D) defined by θ → fθ (·). More generally,
for the time being, we replace L2 (D) by an arbitrary Hilbert space H and the
network by an arbitrary Fréchet differentiable function

f : Θ = ℓ2 (Rm ) → H, θ → fθ .

For a target function f ∈ H, we define the loss

1
L(θ) = ∥fθ − f ∥2H
2
and the corresponding gradient flow for θ(t)
d
θ(t) = −∇L(θ), (16)
dt
initialized with random θ(0). The convergence analysis relies on a regime where
the evolution of the gradient flow is governed by its linearization

Hθ := Dfθ (Dfθ )∗ ,

where ∗ denotes the adjoint and Hθ is the empirical NTK if fθ is a neural

network. To describe the smoothness of the target and spectral properties of
Hθ , we use a series of Hilbert spaces Hα for some smoothness index α ∈ R so that
H0 = H. As stated in the lemma below, they satisfy interpolation inequalities
and coercivity conditions. In this abstract framework, we show convergence as
follows.
Lemma 4.2. Let θ(t) be defined by the gradient flow (16), κ = fθ − f be
the residual and m be a number that satisfies all assumptions below, which is
typically related to the degrees of freedom. For constants c∞ , c0 , β, γ > 0 and
0 ≤ α ≤ β2 , functions p0 (m), p∞ (τ ), pL (m, h) and weight norm ∥·∥∗ assume
that:

14
1. With probability at least 1 − p0 (m), the distance of the weights from their
initial value is controlled by
r Z t
2
∥θ(t) − θ(0)∥∗ ≤ 1 ⇒ ∥θ(t) − θ(0)∥∗ ≲ ∥κ(τ )∥0 dτ. (17)
m 0

2. The norms and scalar product satisfy interpolation and continuity

c−b b−a
∥ · ∥b ≲ ∥ · ∥ac−a ∥ · ∥cc−a , ⟨·, ·⟩−α ≲ ∥ · ∥−3α ∥ · ∥α , (18)

for all −α − β ≤ a ≤ b ≤ c ≤ α.
3. Let H : Hα → H−α be an operator that satisfies the concentration inequal-
ity " r #
r
d c∞ τ
Pr ∥H − Hθ(0) ∥α←−α ≥ c + ≤ p∞ (τ ) (19)
m m
p c∞ τ
for all τ with m ≤ 1. (In our application H is the NTK and Hθ(0)
the empirical NTK.)
4. Hölder continuity with high probability:

≤ h and ∥Hθ̄ − Hθ(0) ∥α←−α ≥ c0 hγ

Pr ∃ θ̄ ∈ Θ with θ̄ − θ(0) ∗
≤ pL (m, h) (20)

for all 0 < h ≤ 1.

5. H is coercive for S ∈ {−α, α}

∥v∥2S−β ≲ ⟨v, Hv⟩S , v ∈ HS−β (21)

6. For τ specified below, m is sufficiently large so that

1 1 1 cd τ
∥κ(0)∥−α
2
∥κ(0)∥α2 m− 2 ≲ 1, ≤ 1, ≤ 1.
m m

Then with probability at least 1 − p0 (m) − p∞ (τ ) − pL (m, h) we have

βγ
2α
β
βγ β β β
−ch β−α 2α
∥κ∥2−α ≲ h β−α ∥κ(0)∥α + ∥κ(0)∥−α e
α α t

∥κ∥2α ≲ ∥κ(0)∥2α

for some h with

 
# β−α
 ∥κ(0)∥ 12 ∥κ(0)∥ 21 β(1+γ)−α r d 
 "

−α α
h ≲ max √ ,c , τ = h2γ m

 m m

and generic constants c ≥ 0 dependent of α and independent of κ and m.

15
We defer the proof to Section 5.1 and only consider a sketch here. As for
standard NTK arguments, the proof is based on the following observation
1 d
∥κ∥2 = − κ, Hθ(t) κ ≈ − ⟨κ, H κ⟩ (22)
2 dt
which can be shown by a short computation. The last step relies on the ob-
servation that empirical NTK stays close to its initial Hθ(t) ≈ Hθ(0) and that
the initial is close to the infinite width limit Hθ(0) ≈ H. However, since we are
not in an over-parametrized regime, the NTK’s eigenvalues can be arbitrarily
close to zero and we only have coercivity in the weaker norm ⟨κ, H κ⟩ ≳ ∥κ∥−α ,
which is not sufficient to show convergence by e.g. Grönwall’s inequality. To
avoid this problem, we derive a closely related system of coupled ODEs
1 d 2 2α+β −2 β βγ
∥κ∥2−α ≲ −c∥κ∥−α2α ∥κ∥α 2α + h β−α ∥κ∥2−α
2 dt
1 d β
2 2α 2 2α−β
∥κ∥2α ≲ −c∥κ∥−α ∥κ∥α 2α + hγ ∥κ∥α ∥κ∥−α .
2 dt
The first one is used to bound the error in the H−α norm and the second ensures
that the smoothness of the residual κ(t) is uniformly bounded during gradient
flow. Together with the interpolation inequality (18), this shows convergence in
the H = H0 norm.
It remains to verify all assumption of Lemma 4.2, which we do in the fol-
lowing subsections. Details are provided in Section 5.5.

4.3 Assumption (20): Hölder continuity

We use a bar ¯· to denote perturbation, in particular W̄ ℓ is a perturbed weight,
¯
and Γ̂ is the corresponding empirical neural tangent kernel. In order to obtain
continuity results, we require that the weight matrices and domain are bounded
−1/2 −1/2
W ℓ nℓ ≲ 1, W̄ ℓ nℓ ≲ 1, ∥x∥ ≲ 1 ∀x ∈ D. (23)

For the initial weights W ℓ , this holds with high probability because its entries are
i.i.d. standard Gaussian. For perturbed weights we only need continuity bounds
−1/2
under the condition that θ − θ̄ ∗ ≤ 1 or equivalently that ∥W ℓ − W̄ ℓ ∥nℓ ≤1
ℓ
so that the weight bound of the perturbation W̄ follow from the bounds for
W ℓ . With this setup, we show the following lemma.
Lemma 4.3. Assume that σ and σ̇ satisfy the growth and Lipschitz conditions
(13), (14) and may be different in each layer. Assume the weights, perturbed
weights and domain are bounded (23) and nL ∼ nL−1 ∼ · · · ∼ n0 . Then for

16
0<α<1

Γ̂ ≲1
C 0;α,α
¯
Γ̂ ≲1
C 0;α,α
"L−1 #1−α
¯ n0 X −1/2
Γ̂ − Γ̂ ≲ W k − W̄ k nk .
C 0;α,α nL
k=0

The proof is at the end of Section 5.2. The lemma shows that the kernels
¯
Γ̂ − Γ̂ℓ
ℓ
are Hölder continuous (w.r.t. weights) in a Hölder norm (w.r.t.
C 0;α,α
x and y). This directly implies that the induced integral operators ∥Hθ −
Hθ̄ ∥α←−α are bounded in operator norms induced by Sobolev norms (up to ϵ
less smoothness), which implies Assumption (20), see Section 5.5 for details.

4.4 Assumption (19): Concentration

For concentration, we need to show that the empirical NTK is close to the
NTK, i.e. that ∥H − Hθ(0) ∥α←−α is small in the operator norm. To this end, it
suffices to bound the corresponding integral kernels ∥Γ − Γ̂∥C 0;α+ϵ,α+ϵ in Hölder
norms with slightly higher smoothness, see Lemma 6.16. Concentration is then
provided by the following Lemma. See the end of Section 5.3 for a proof and
Section 5.5 for its application in the proof of the main result.
Lemma 4.4. Let α = β = 1/2 and k = 0, . . . , L − 1.
1. Assume that W L ∈ {−1, +1} with probability 1/2 each.
2. Assume that all W k are are i.i.d. standard normal.
3. Assume that σ and σ̇ satisfy the growth condition (13), have uniformly
bounded derivatives (15), derivatives σ (i) , i = 0, . . . , 3 are continuous and
have at most polynomial growth for x → ±∞ and the scaled activations
satisfy

∂ i (σa ) N
≲ 1, ∂ i (σ̇a ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,

with σa (x) := σ(ax). The activation functions may be different in each

layer.
4. For all x ∈ D assume
Σk (x, x) ≥ cΣ > 0.

5. The widths satisfy nℓ ≳ n0 for all ℓ = 0, . . . , L.

Then, with probability at least
L−1
X
1−c e−nk + e−uk (24)
k=1

17
we have
L−1
"√ √ #
X n0 d + uk d + uk 1
Γ̂ − Γ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0

for all u1 , . . . , uL−1 ≥ 0 sufficiently small so that the rightmost inequality holds.

4.5 Assumption (17): Weights stay Close to Initial

Assumption (17) follows from the following lemma, which shows that the weights
stay close to their random initialization. Again, the estimates are proven in
Hölder norms, which control the relevant Sobolev norms, see Section 5.5 for
details.
Lemma 4.5. Assume that σ satisfies the growth and derivative bounds (13),
(15) and may be different in each layer. Assume the weights are defined by the
gradient flow (6) and satisfy
−1/2
∥W ℓ (0)∥nℓ ≲ 1, ℓ = 1, . . . , L,
−1/2
∥W ℓ (0) − W ℓ (τ )∥nℓ ≲ 1, 0 ≤ τ < t.
Then
1/2 Z t
ℓ ℓ −1/2 n
W (t) − W (0) nℓ ≲ 0 ∥κ∥C 0 (D)′ dx dτ,
nℓ 0
′
where C 0 (D) is the dual space of C 0 (D).

5 Proof of the Main Result

5.1 Proof of Lemma 4.2: Generalized Convergence
NTK Evolution In this section, we prove the convergence result in Lemma
4.2. Let us first recall the evolution of the loss in NTK theory. The Fréchet
derivative of the loss is
DL(θ)v = ⟨κ, (Dfθ )v⟩ = ⟨(Dfθ )∗ κ, v⟩ , for all v ∈ Θ
and the gradient of the loss is the Riesz lift of the derivative
∇L(θ) = (Dfθ )∗ κ. (25)
Using the chain rule, we obtain the evolution of the residual
dκ dθ
= (Dfθ ) = −(Dfθ )∇L(θ) = −(Dfθ )(Dfθ )∗ κ =: Hθ κ (26)
dt dt
and the loss in any HS norm

1 d dκ
2
∥κ∥S = κ, = − ⟨κ, (Dfθ )(Dfθ )∗ κ⟩S = − ⟨κ, Hθ κ⟩S , (27)
2 dt dt S
with
Hθ := (Dfθ )(Dfθ )∗ .

18
Proof of Lemma 4.2
Proof of Lemma 4.2. For the time being, we assume that the weights remain
within a finite distance
( r )
d
h := max sup ∥θ(t) − θ(0)∥∗ , c ≤1 (28)
t≤T m

to their initial up to a time T to be determined below, but sufficiently small

so that the last inequality holds. With this condition, we can bound the time
derivatives of the loss ∥κ∥−α and the smoothness ∥κ∥α . For S ∈ {−α, α} and
respective S̄ ∈ {−3α, α}, we have already calculated the exact evolution in (27),
which we estimate by
1 d
∥κ∥2S = − κ, Hθ(t) κ S
2 dt
= − ⟨κ, Hκ⟩S + κ, (H − Hθ(0) )κ S
+ κ, (Hθ(0) − Hθ(t) )κ S
.

We estimate the last two summands as

⟨κ, [. . . ]κ⟩S ≤ ∥κ∥S̄ ∥[. . . ]κ∥s ≤ ∥κ∥S̄ ∥[. . . ]∥α←−α ∥κ∥−α ,

where S̄ = α for S = α and S̄ = −3α for S = −α by Assumption 2. Then, we

obtain
1 d
∥κ∥2S ≤ − ⟨κ, Hκ⟩S + ∥H − Hθ(0) ∥α←−α ∥κ∥S̄ ∥κ∥−α + ∥Hθ(0) − Hθ(t) ∥α←−α ∥κ∥S̄ ∥κ∥−α
2 dt " r #
r
d c∞ τ γ
≤ − ⟨κ, Hκ⟩S + c + + c0 h ∥κ∥S̄ ∥κ∥−α ,
m m
≲ −c∥κ∥2S−β + hγ ∥κ∥S̄ ∥κ∥−α ,

with probability at least 1 − p∞ (τ ) − pL (m, h), where the second but last in-
equality follows from assumptions (19), (20) and in the last
p c∞inequality we have
τ
used the coercivity, (28) and chosen τ = h2γ m so that m ≲ h γ
. The left
hand side contains one negative term −∥κ∥2S−β , which decreases the residual
d 2
dt ∥κ∥S , and one positive term which enlarges it. In the following, we ensure
that these terms are properly balanced.
We eliminate all norms that are not ∥κ∥−α or ∥κ∥α so that we obtain a
closed system of ODEs in these two variables. We begin with ∥κ∥S̄ , which is
already of the right type if S̄ = α but ∥κ∥−3α for S̄ = −α. Since 0 < α < β2 ,
we have −α − β ≤ −3α ≤ α so that we can invoke the interpolation inequality
from Assumption 2
2α β−2α
β
∥v∥−3α ≤ ∥v∥−α−β ∥v∥−αβ .

19
Together with Young’s inequality, this implies
2α 2β−2α

hγ ∥κ∥S̄ ∥κ∥−α ≤ hγ ∥κ∥−α−β

β
∥κ∥−αβ
αβ β
β−α
α 2α
β β − α −1 γ 2β−2α

≤ c∥κ∥−α−β + c h ∥κ∥−αβ
β β
α β β γβ
= c α ∥κ∥2−α−β + c (β−α) h β−α ∥κ∥2−α
β
for any generic constant c > 0. Choosing this constant sufficiently small and
plugging into the evolution equation for ∥κ∥−α , we obtain
1 d γβ
∥κ∥2−α ≲ −c∥κ∥2−α−β + h β−α ∥κ∥2−α ,
2 dt
with a different generic constant c. Hence, together with the choice S = α, we
arrive at the system of ODEs
1 d γβ
∥κ∥2−α ≲ −c∥κ∥2−α−β + h β−α ∥κ∥2−α ,
2 dt
1 d
∥κ∥2α ≲ −c∥κ∥2α−β + hγ ∥κ∥α ∥κ∥−α .
2 dt
Next, we eliminate the ∥κ∥2−α−β and ∥κ∥2α−β norms. Since 0 < α < β2 implies
−α − β < α − β < −α < α the interpolation inequalities in Assumption 2 yield
2α β 2α+β β
2α+β −
∥κ∥−α ≤ ∥κ∥−α−β ∥κ∥α2α+β ⇒ ∥κ∥−α−β ≥ ∥κ∥−α
2α
∥κ∥α 2α
2α β−2α β 2α−β
β
∥κ∥−α ≤ ∥κ∥α−β ∥κ∥α β ⇒ ∥κ∥α−β ≥ ∥κ∥−α
2α
∥κ∥α 2α ,

so that we obtain the differential inequalities

1 d 2 2α+β −2 β βγ
∥κ∥2−α ≲ −c∥κ∥−α2α ∥κ∥α 2α + h β−α ∥κ∥2−α
2 dt
1 d β
2 2α 2 2α−β
∥κ∥2α ≲ −c∥κ∥−α ∥κ∥α 2α + hγ ∥κ∥α ∥κ∥−α .
2 dt
Bounds for the solutions are provided by Lemma 5.1 with x = ∥κ∥2−α , y = ∥κ∥2α
β
and ρ = 2α ≥ 1 ≥ 12 : Given that
γα
∥κ∥2−α ≳ h2 β−α ∥κ(0)∥2α , (29)

i.e. the error ∥κ∥−α is still larger than the right hand side, which will be our
final error bound, we have
βγ
2α
β
βγ β β
β−α β t
∥κ∥2−α ≲ h β−α ∥κ(0)∥αα + ∥κ(0)∥−α
α
e−ch 2α (30)

∥κ∥2α ≲ ∥κ(0)∥2α . (31)

20
The second condition B(t) ≥ 0 in Lemma 5.1 is equivalent to axρ0 ≥ by0ρ (no-
tation of the lemma), which in our case is identical to (29) at t = 0. Notice
that the right hand side of (29) corresponds to the first summand in the ∥κ∥2−α
bound so that the second summand must dominate and we obtain the simpler
expression
βγ

∥κ∥2−α ≲ ∥κ(0)∥2−α e−ch

β−α t
,
(32)
∥κ∥2α ≲ ∥κ(0)∥2α .
Finally, we compute h, first for the case h = supt≤T ∥θ(t) − θ(0)∥∗ . For T
we use the smallest time for which (29) fails and temporarily also h ≤ 1. Then
by Assumption (17), interpolation inequality (18) and the ∥κ∥2−α , ∥κ∥2α bounds,
with probability at least 1 − p0 (m), we have
r Z T
2
h = sup ∥θ(t) − θ(0)∥∗ ≲ ∥κ(τ )∥0 dτ
t≤T m 0
r Z T
2 1 1
≲ ∥κ(τ )∥−α2
∥κ(τ )∥α2 dτ
m 0
r Z T βγ
2 1 1
β−α τ
≲ ∥κ(0)∥−α
2
∥κ(0)∥α2 e−ch 4 dτ
m 0
r 1 1
1 ∥κ(0)∥−α
2
∥κ(0)∥α2
≤c βγ ,
m h β−α
for some generic constant c > 0. Solving for h, we obtain
β−α
1 1 1 1
βγ 1
h 1
i β(1+γ)−α
h1+ β−α ≲ ∥κ(0)∥−α
2
∥κ(0)∥α2 m− 2 ⇔ h ≲ ∥κ(0)∥−α 2
∥κ(0)∥α2 m− 2 .

Notice that by assumption m is sufficiently large so that the right hand side
is strictly
p smaller than one and thus T is only constrained by (29). In case
h = c d/m there is nothing to show and we obtain
( r )
h 1 1
i β−α d
− 21 β(1+γ)−α
h ≲ max ∥κ(0)∥−α ∥κ(0)∥α m
2 2
,c .
m

Finally, we extend the result beyond the largest time T for which (29) is satisfied
and hence (29) holds with equality. Since ∥κ∥20 is defined by a gradient flow, it
is monotonically decreasing and thus for any time t > T , we have
2α
β
γα γβ β
2 β−α
∥κ(t)∥2−α ≤ ∥κ(T )∥2−α = ch ∥κ(0)∥2α =c h β−α ∥κ(0)∥α
α

βγ
2α
β
βγ β β β
−ch β−α 2α t
≲ h β−α ∥κ(0)∥α + ∥κ(0)∥−α e
α α

so that the error bound (30) holds for all times up to an adjustment of the
constants. This implies the statement of the lemma with our choice of h and τ .

21
Technical Supplements
1
Lemma 5.1. Assume a, b, c, d > 0, ρ ≥ 2 and that x, y satisfy the differential
inequality

x′ ≤ −ax1+ρ y −ρ + bx, x(0) = x0 (33)

√
y ′ ≤ −cxρ y 1−ρ + d xy, y(0) = y0 . (34)

Then within any time interval [0, T ] for which

2
2ρ−1
d
x(t) ≥ y0 , (35)
c

with
" −ρ #
b b x0
A := y0ρ , B(t) := 1 − e−bρt
a a y0

we have
−1
x(t) ≤ A (1 − B(t)) , y(t) ≤ y0 .

If B(t) ≥ 0, this can be further estimated by

ρ1
x(t) ≤ A + xρ0 e−bρt , y(t) ≤ y0 .

Proof. First, we show that y(t) ≤ y0 for all t ∈ T . To this end, note that
condition (35) states that we are above a critical point for the second ODE
(34). Indeed, setting y ′ (t) = 0 and thus y(t) = y0 and solving the second ODE
(with = instead of ≤) for x(t), we have
2
2ρ−1
d
x(t) = y0 .
c

To show that y(t) ≥ y0 , let ϵ ≥ 0 and define

( 2
2ρ−1 )
d
Tϵ = sup t ≤ T x(t) ≥ y0 (1 + ϵ) ,
c
τϵ = inf {t ≤ Tϵ |y(t) ≥ y0 (1 + ϵ)} ,

where the definition of Tϵ resembles the definition of T up to a safety factor of

1 + ϵ and τϵ is the smallest time when our hypothesis y(t) ≤ y0 fails up to a
small margin. Assume that τϵ < Tϵ . Since 2ρ − 1 ≥ 0, for all t < τϵ , we have
2 2
2ρ−1 d 2ρ−1 d
x(t) ≥ [y0 (1 + ϵ)] ≥ y(t)2ρ−1 ,
c c

22
which upon rearrangement is equivalent to
√
−cxρ y 1−ρ + d xy ≤ 0,

so that the differential equation (34) yields y ′ (t) ≤ 0 and hence y(t) ≤ y0 for
all t < τϵ . On the other hand, for all t > τϵ we have y(t) > y0 (1 + ϵ), which
contradicts the continuity of y. It follows that τϵ ≥ Tϵ and with limϵ→0 Tϵ = T ,
we obtain

y(t) ≤ y0 , t < T.

Next, we show the bounds for x(t). For any fixed function y, the function x is
bounded by the solution z of the equality case

z ′ = −az 1+ρ y −ρ + bz, z(0) = x0

of the first equation (33). This is a Bernoulli differential equation, with solution
Z t − ρ1
−bρt bρτ −ρ −ρ
x(t) ≤ z(t) = e aρ e y(τ ) dτ + x0 .
0

Since y(t) ≤ y0 , in the relevant time interval this simplifies to

Z t −1
z(t)ρ ≤ ebρt aρ ebρτ y0−ρ dτ + x−ρ
0
0
a −1
= ebρt ebρt − 1 y0−ρ + x−ρ

0
b
a a −1
= y0−ρ − y0−ρ − x−ρ
0 e −bρt
b b
 −1
 −ρ ! 
b  b x0
= y0ρ  e−bρt 

1− 1− ,
a 
|{z}  a y0 

=:A
| {z }
=:B(t)

which shows the first bound for x(t). We can estimate this further by

A A[1 − B(t)] AB(t) A

z(t)ρ ≤ = + =A+ B(t).
1 − B(t) 1 − B(t) 1 − B(t) 1 − B(t)

In case B(t) ≥ 0, the function A/(1 − B(t)) is monotonically decreasing and

thus with A/(1 − B(0)) = xρ0 , we have
A
z(t)ρ ≤ A + B(t) = A + xρ0 B(t) ≤ A + xρ0 e−bρt ,
1 − B(0)

which shows the second bound for x(t) in the lemma.

23
5.2 Proof of Lemma 4.3: NTK Hölder continuity
The proof is technical but elementary. We start with upper bounds and Hölder
continuity for simple objects, like hidden layers, and then compose these for
derived objects with results for the NTK at the end of the section.
Throughout this section, we use a bar ¯· to denote a perturbation. In partic-
ular W̄ ℓ is a perturbed weight,
−1/2
f¯ℓ+1 (x) = W̄ ℓ nℓ σ f¯ℓ (x) , f¯1 (x) = W̄ 0 V x

¯ ¯ˆ ¯
is the neural network with perturbed weights and Σ̂, Σ̇, Γ̄ and Γ̂ are the kernels of
the perturbed network. The bounds in this section depend on the operator norm
−1/2
of the weight matrices. At initialization, they are bounded W ℓ nℓ ≲ 1,
with high probability. All perturbations of the weights that we need are close
−1/2
W ℓ − W̄ ℓ nℓ ≲ 1 so that we may assume
−1/2
W ℓ nℓ ≲1 (36)
−1/2
W̄ ℓ nℓ ≲1 (37)

In addition, we consider bounded domains

∥x∥ ≲ 1 for all x ∈ D. (38)

Lemma 5.2. Assume that ∥x∥ ≲ 1.

1. Assume that σ satisfies the growth condition (13) and may be different in
each layer. Assume the weights are bounded (36). Then
ℓ−1
1/2 −1/2
Y
f ℓ (x) ≲ n0 W k nk .
k=0

2. Assume that σ satisfies the growth and Lipschitz conditions (13) and (14)
and may be different in each layer. Assume the weights and perturbed
weights are bounded (36), (37). Then
ℓ−1 ℓ−1
1/2 −1/2 −1/2
X Y
f ℓ (x) − f¯ℓ (x) ≲ n0 W k − W̄ k nk W j , W̄ j

max nj .
k=0 j=0
j̸=k

3. Assume that σ has bounded derivative (15) and may be different in each
layer. Assume the weights are bounded (36). Then
" ℓ−1 #
1/2 −1/2
Y
ℓ ℓ k
f (x) − f (x̄) ≲ n0 W nk ∥x − x̄∥.
k=0

24
Proof. 1. For ℓ = 0, we have
1/2 −1/2
f 1 (x) = W 0 V x ≤ n0 W 0 n0 ,
where in the last step we have used that V has orthonormal columns and
∥x∥ ≲ 1. For ℓ > 0, we have

(13)
−1/2 −1/2 −1/2
f ℓ+1 = W ℓ nℓ σ fℓ ≤ W ℓ nℓ σ fℓ W ℓ nℓ fℓ

≲
induction
ℓ−1 ℓ
−1/2 1/2 −1/2 1/2 −1/2
Y Y
≲ W ℓ nℓ n0 W k nk = n0 W k nk ,
k=0 k=0

where in the first step we have used the definition of f ℓ+1 , in the third the
growth condition and in the fourth the induction hypothesis.
2. For ℓ = 0 we have
1/2 −1/2
f 1 − f¯1 = [W 0 − W̄ 0 ]V x = n0 W 0 − W̄ 0 n0 ,
where in the last step we have used that V has orthonormal columns and
∥x∥ ≲ 1. For ℓ > 0, we have
−1/2 −1/2
f ℓ+1 − f¯ℓ+1 = W ℓ nℓ σ f ℓ − W̄ ℓ nℓ σ f¯ℓ

−1/2
≤ W ℓ − W̄ ℓ nℓ σ fℓ

−1/2
+ W̄ ℓ nℓ σ f ℓ − σ f¯ℓ

=: I + II

For the first term, the growth condition (13) implies σ f ℓ ≲ f ℓ and
thus the first part of the Lemma yields
ℓ−1
−1/2 1/2 −1/2
Y
I ≲ W ℓ − W̄ ℓ nℓ n0 W k nk .
k=0

For the second term, we have by Lipschitz continuity (14) and induction
−1/2 −1/2
II = W̄ ℓ nℓ σ f ℓ − σ f¯ℓ ≲ W̄ ℓ nℓ f ℓ − f¯ℓ

ℓ−1 ℓ
1/2 −1/2 −1/2
X Y
W k − W̄ k nk Wj , Wj

≲ n0 max nj .
k=0 j=0
j̸=k

By I and II we obtain
ℓ ℓ
1/2 −1/2 −1/2
X Y
ℓ+1
− f¯ℓ+1 ≲ n0 k k
Wj , Wj

f W − W̄ nk max nj ,
k=0 j=0
j̸=k

which shows the lemma.

25
3. Follows from the mean value theorem because by Lemma 5.3 below the
first derivatives are uniformly bounded.

Lemma 5.3. Assume that σ has bounded derivative (15) and may be different
in each layer. Assume the weights are bounded (36). Then
ℓ−1
1/2 −1/2
Y
Df ℓ (x) ≲ n0 W k nk .
k=0

Proof. For ℓ = 0, we have

1/2 −1/2
Df 1 (x) = W 0 V Dx ≤ n0 W 0 n0 ,

where in the last step we have used that V has orthonormal columns and ∥Dx∥ =
∥I∥ = 1. For ℓ > 0, we have
−1/2
Df ℓ+1 = W ℓ nℓ Dσ f ℓ

−1/2 −1/2
= W ℓ nℓ Dσ f ℓ ≤ W ℓ nℓ σ̇ f ℓ ⊙ Df ℓ

(15) induction
ℓ−1
−1/2 −1/2 1/2 −1/2
Y
≲ W ℓ nℓ Df ℓ ≲ W ℓ nℓ n0 W k nk
k=0
ℓ
1/2 −1/2
Y
= n0 W k nk ,
k=0

where in the first step we have used the definition of f ℓ+1 , in the fourth the
boundedness of σ̇ and in the fifth the induction hypothesis.

Remark 5.4. An argument analogous to Lemma 5.3 does not show that the
derivative is Lipschitz or similarly second derivatives ∂xi ∂xj f ℓ are bounded.
Indeed, the argument uses that

∂xi σ f ℓ = σ̇ f ℓ ⊙ ∂xi f ℓ ≤ σ̇ f ℓ ∞ ∂xi f ℓ ,

where we bound the first factor by the upper bound of σ̇ and the second by
induction. However, higher derivatives produce products

∂xi ∂xj σ f ℓ = σ̇ f ℓ ⊙ ∂xi ∂xi f ℓ + σ (2) f ℓ ⊙ ∂xi f ℓ ⊙ ∂xj f ℓ

≤ σ̇ f ℓ ∞ ∂xi ∂xj f ℓ + σ (2) f ℓ ∂xi f ℓ ⊙ ∂xj f ℓ

∞

1/2
With bounded weights (36) the hidden layers are of size ∂xi f ℓ ≲ n0 but a
naive estimate of their product by Cauchy Schwarz and embedding ∂xi f ℓ ⊙ ∂xj f ℓ ≤
∥∂xi f ℓ ∥ℓ4 ∥∂xi f ℓ ∥ℓ4 ≤ ∥∂xi f ℓ ∥∥∂xi f ℓ ∥ ≲ n0 is much larger.

26
Given the difficulties in the last remark, we can still show that f ℓ is Hölder
continuous with respect to the weights in a Hölder norm with respect to x.
Lemma 5.5. Assume that σ satisfies the growth and Lipschitz conditions (13),
(14) and may be different in each layer. Assume the weights, perturbed weights
and domain are bounded (36), (37), (38). Then for 0 < α < 1
1/2
σ fℓ

C 0;α
≲ n0 .
1/2
σ f¯ℓ

C 0;α
≲ n0 .
" ℓ−1 #1−α
1/2 −1/2
X
σ f ℓ − σ f¯ℓ k k

C 0;α
≲ n0 W − W̄ nk .
k=0

Proof. By the growth condition (13) and the Lipschitz continuity (14) of the
activation function, we have

σ f ℓ C0 ≲ f ℓ C0 , σ f ℓ C 0;1 ≲ f ℓ C 0;1 .

Thus the interpolation inequality in Lemma 6.3 implies

1−α α 1−α α 1/2
σ fℓ ≲ σ fℓ σ fℓ ≲ fℓ fℓ

C 0;α C0 C 0;1 C0 C 0;1
≲ n0 ,

where in the last step we have used the bounds form Lemma 5.2 together with
−1/2 −1/2
W ℓ nℓ ≲ 1 and W̄ ℓ nℓ ≲ 1 from Assumptions (36), (37). Likewise,
by the interpolation inequality in Lemma 6.3 we have
1−α α
σ f ℓ − σ f¯ℓ ≲ σ f ℓ − σ f¯ℓ σ f ℓ − σ f¯ℓ C 0;1

C 0;α C0
n
1−α α α
o
≲ σ f ℓ − σ f¯ℓ σ f ℓ C 0;1 σ f¯ℓ

C0
max C 0;1
.
1−α
n α α
o
≲ f ℓ − f¯ℓ C 0 max f ℓ C 0;1 f¯ℓ C 0;1 .
" ℓ−1 #1−α
1/2 −1/2
X
k k
≲ n0 W − W̄ nk ,
k=0

where in the third step we have used that σ is Lipschitz and in the last step
−1/2
the bounds from Lemma 5.2 together with the bounds W ℓ nℓ ≲ 1 and
ℓ −1/2
W̄ nℓ ≲ 1 from Assumptions (36), (37).

Lemma 5.6. Assume that σ satisfies the growth and Lipschitz conditions (13),
(14) and may be different in each layer. Assume the weights, perturbed weights

27
and domain are bounded (36), (37), (38). Then for 0 < α, β < 1
n0
Σ̂ℓ ≲ ,
C 0;α,β nℓ
¯ n0
Σ̂ℓ ≲ ,
C 0;α,β nℓ
" ℓ−1 #1−α
¯ n0 X −1/2
Σ̂ℓ − Σ̂ℓ ≲ k k
W − W̄ nk .
C 0;α,α nℓ
k=0

Proof. Throughout the proof, we abbreviate

f ℓ = f ℓ (x), f¯ℓ = f¯ℓ (x), f˜ℓ = f ℓ (y), f˜¯ℓ = f¯ℓ (x),

for two independent variables x and y. Then by definition (12) of Σ̂ℓ

1 T 1 n0
Σ̂ℓ σ f ℓ σ f˜ℓ σ fℓ σ f˜ℓ

= ≤ C 0;α
≲ ,
C 0;α,β nℓ C 0;α,β nℓ C 0;β nℓ
where in the second step we have used the product identity Item 3 in Lemma 6.3
¯
and in the last step Lemma 5.5. The bound for Σ̂ℓ 0;α,β follows analogously.
C
Likewise for α = β

¯ 1 T T
Σ̂ℓ − Σ̂ℓ = σ fℓ σ f˜ℓ − σ f¯ℓ σ f˜¯ℓ
C 0;α,α nℓ C 0;α,α
1 T T h ℓ i
σ f ℓ − σ f¯ℓ σ f˜ℓ − σ f¯ℓ σ f˜ − σ f˜¯ℓ

=
nℓ C 0;α,α
1 T
T
h i
˜ℓ − σ f˜¯ℓ
σ f ℓ − σ f¯ℓ σ f˜ℓ ¯ℓ

≤ + σ f σ f
nℓ C 0;α,α C 0;α,α
2 T

σ f ℓ − σ f¯ℓ σ f˜ℓ

= ,
nℓ 0;α,α C

where in the last step we have used symmetry in x and y. Thus, by the product
identity Item 3 in Lemma 6.3, we obtain

¯ 2
Σ̂ℓ − Σ̂ℓ σ f ℓ − σ f¯ℓ C 0;α σ f˜ℓ

≤
C 0;α,α nℓ C 0;α
" ℓ−1 #1−α
n0 X k k −1/2
≲ W − W̄ nk ,
nℓ
k=0

where in the last step we have used Lemma 5.5.

Lemma 5.7 (Lemma 4.3 restated form overview). Assume that σ and σ̇ satisfy
the growth and Lipschitz conditions (13), (14) and may be different in each

28
layer. Assume the weights, perturbed weights and domain are bounded (23) and
nL ∼ nL−1 ∼ · · · ∼ n0 . Then for 0 < α < 1

Γ̂ ≲1
C 0;α,α
¯
Γ̂ ≲1
C 0;α,α
"L−1 #1−α
¯ n0 X k k −1/2
Γ̂ − Γ̂ ≲ W − W̄ nk .
C 0;α,α nL
k=0

Proof. By Lemma 5.6 and nℓ ∼ n0 , we have

" ℓ−1 #1−α
ℓ ¯ ¯ n0 X −1/2
Σ̂ , Σ̂ℓ ≲ 1, Σ̂ − Σ̂ℓ
ℓ
≲ k k
W − W̄ nk .
C 0;α,α C 0;α,α C 0;α,α nℓ
k=0

Since σ̇ satisfies the same assumptions as σ, the same lemma provides

" ℓ−1 #1−α
ˆℓ ¯
ˆℓ ¯ˆ ℓ
ˆ ℓ − Σ̇ n0 X k k −1/2
Σ̇ , Σ̇ ≲ 1, Σ̇ ≲ W − W̄ nk .
C 0;α,α
C 0;α,α C 0;α,α nℓ
k=0

Furthermore, by Lemma 4.1, we have

ˆ L (x, y)Σ̂L−1 (x, y).
Γ̂( x, y) = Σ̇

Thus, since Hölder spaces are closed under products, Lemma 6.3 Item 4, it
follows that

¯ ¯ˆ L
ˆ L (x, y)Σ̂L−1 (x, y) − Σ̇ ¯
Γ̂ − Γ̂ = Σ̇ (x, y)Σ̂L−1 (x, y)
C 0;α,α C 0;α,α

ˆ ¯ˆ L
≤ Σ̇ (x, y) − Σ̇ (x, y) Σ̂L−1 (x, y)
L

C 0;α,α
¯ ¯ L−1
ˆ L (x, y) Σ̂L−1 (x, y) − Σ̂
h i
+ Σ̇ (x, y)
C 0;α,α
¯ˆ L
ˆ L (x, y) − Σ̇
≤ Σ̇ (x, y) Σ̂L−1 (x, y)
C 0;α,α C 0;α,α
¯
ˆ L (x, y) ¯
+ Σ̇ Σ̂L−1 (x, y) − Σ̂L−1 (x, y)
C 0;α,α C 0;α,α
" ℓ−1 #1−α
n0 X k k −1/2
≲ W − W̄ nk ,
nℓ
k=0

where in the last step we have used Lemma 5.6 and nL ∼ nL−1 .

29
5.3 Proof of Lemma 4.4: Concentration
Concentration for the NTK

Γ(x, y) := Σ̇L (x, y)ΣL−1 (x, y)

is derived from concentration for the forward kernels Σ̇L and ΣL−1 . They are
shown inductively by splitting off the expectation Eℓ [·] with respect to the last
layer W ℓ in
h i h i
Σ̂ℓ+1 − Σℓ+1 ≤ Σ̂ℓ+1 − Eℓ Σ̂ℓ+1 + Eℓ Σ̂ℓ+1 − Σℓ+1 .
C 0;α,β C 0;α,β C 0;α,β

Concentration for the first term is shown in Section 5.3.1 by a chaining argument
and bounds for the second term in Section 5.3.2 with an argument similar to
[18]. The results are combined into concentration for the NTK in Section 5.3.3.

5.3.1 Concentration of the Last Layer

We define
Λ̂ℓr (x, y) := σ frℓ (x) σ frℓ (y)

as the random variables that constitute the kernel

nℓ nℓ
1 X 1 X
Σ̂ℓ (x, y) = Λ̂ℓr (x, y) = σ frℓ (x) σ frℓ (y) .

nℓ r=1 nℓ r=1

For fixed weights W 0 , . . . , W ℓ−2 and random W ℓ−1 , all Λ̂ℓr , r ∈ [nℓ ] are random
variables dependent only on the random vector Wr·ℓ−1 and thus independent.
Hence, we can show concentration uniform in x and y by chaining. For Dudley’s
inequality, one would bound the increments

Λ̂ℓr (x, y) − Λ̂ℓr (x̄, ȳ) ≲ ∥x − x̄∥α + ∥y − ȳ∥α ,

ψ2

where the right hand side is a metric for α ≤ 1. However, this is not sufficient
in our case. First, due to the product in the definition of Λ̂ℓr , we can only
bound the ψ1 norm and second this leads to a concentration of the supremum
norm ∥Λ̂ℓr ∥C 0 , whereas we need a Hölder norm. Therefore, we bound the finite
difference operators

β β
∆α ℓ α ℓ
x,hx ∆y,hy Λ̂r (x, y) − ∆x,h̄x ∆y,h̄ Λ̂r (x̄, ȳ)
y ψ1

≲ ∥x − x̄∥ + ∥hx − h̄x ∥α + ∥y − ȳ∥β + ∥hy − h̄y ∥β ,

which can be conveniently expressed by the Orlicz space valued Hölder norm

∆α β ℓ
x ∆y Λ̂r ≲ 1,
C 0;α,β (∆D×∆D;ψ1 )

with the following notations:

30
1. Finite difference operators ∆α : (x, h) → h−α [f (x + h) − f (x)], depending
both on x and h, with partial application two variables x and y denoted
by ∆α α
x and ∆y , respectively. See Section 6.1.

2. Domain ∆D consisting of all pairs (x, h) for which x, x + h ∈ D, see (48).

Likewise the domain ∆D × ∆D consists of all feasible x, hx , y and hy .
3. Following the definitions in Section 6.1, we use the Hölder space C 0;α,β (∆D × ∆D; Lψi ),
i = 1, 2 with values in the Orlicz spaces Lψi of random variables for
which the ∥ · ∥ψi norms are finite. For convenience, we abbreviate this by
C 0;α,β (∆D × ∆D; ψi ).
Given the above inequalities, we derive concentration by chaining for for mixed
tail random variables in [16] summarized in Corollary 6.12.
Lemma 5.8. Assume for k = 0, . . . , ℓ − 2 the weights Wk are fixed and bounded
−1/2
∥W k ∥nk ≲ 1. Assume that W ℓ−1 is i.i.d. sub-gaussian with ∥Wijℓ−1 ∥ψ2 ≲ 1.
Let r ∈ [nℓ ].
1. Assume that σ satisfies the growth condition (13) and may be different in
each layer. Then
1/2
ℓ
n0
σ fr (x) ψ2 ≲ .
nℓ−1

2. Assume that σ has bounded derivative (15) and may be different in each
layer. Then
1/2
n0
frℓ (x) frℓ (x̄)

σ −σ ψ2
≲ ∥x − x̄∥.
nℓ−1

Proof. 1. Since for frozen W 0 , . . . , W ℓ−2

nℓ−1
−1/2 X ℓ−1 −1/2
Wr·ℓ−1 nℓ−1 σ f ℓ−1 = nℓ−1 σ fsℓ−1

Wrs
s=1

ℓ−1 −1/2

is a sum of independent random variables Wrs nℓ−1 σ fsℓ−1 , s ∈ [nℓ−1 ],
by Hoeffding’s inequality (general version for sub-gaussian norms, see e.g.
[67, Proposition 2.6.1]) we have
−1/2 −1/2
Wr·ℓ−1 nℓ−1 σ f ℓ−1 σ f ℓ−1

≲ nℓ−1 .
ψ2

Thus
−1/2
σ frℓ ≲ frℓ = Wr·ℓ−1 nℓ−1 σ f ℓ−1

ψ2 ψ2 ψ2
1/2
−1/2 ℓ−1 −1/2 n0
f ℓ−1 ≲

≤ nℓ−1 σ f ≤ nℓ−1 ,
nℓ−1

31
where in the first step we have used the growth condition and Lemma 6.7,
in the fourth step the growth condition and in the last step the upper
bounds from Lemma 5.2.
2. Using Hoeffding’s inequality analogous to the previous item, we have

−1/2
Wr·ℓ−1 nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)

ψ2
−1/2
σ f ℓ−1 (x) − σ f ℓ−1 (x̄)

≲ nℓ−1

and

σ frℓ (x) − σ frℓ (x̄) ≲ frℓ (x) − frℓ (x̄) ψ

ψ2 2

ℓ−1 −1/2
= Wr· nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)

ψ2
−1/2
≲ nℓ−1 σ f ℓ−1 (x) − σ f ℓ−1 (x̄)

−1/2
≲ nℓ−1 f ℓ−1 (x) − f ℓ−1 (x̄)
1/2
n0
≲ ∥x − x̄∥,
nℓ−1

where in the first step we have used the Lipschitz condition and Lemma
6.7, in the fourth step the Lipschitz condition and in the last step the
Lipschitz bounds from Lemma 5.2.

Lemma 5.9. Let U and V be two normed spaces and D ⊂ U . For all 0 ≤ α ≤ 21 ,
we have
∥∆α f ∥C 0;α (∆D;V ) ≤ 4 ∥f ∥C 0;2α (D;V ) ,
with ∆D defined in (48).
Proof. Throughout the proof, let C 0;2α = C 0;2α (D; V ) and | · | = ∥ · ∥U or | · | =
∥ · ∥V depending on context. Unraveling the definitions, for every (x, h), (x̄, h̄) ∈
∆D, we have to show

∆α α α
h f (x) − ∆h̄ f (x̄) ≤ 4∥f ∥C 0;2α max{|x − x̄|, |h − h̄|} .

We consider two cases. First, assume that |h| ≤ max{|x − x̄|, |h − h̄|} and h̄ is
arbitrary. Then |h̄| ≤ |h̄ − h| + |h| ≤ 2 max{|x − x̄|, |h − h̄|} and thus

∆α α α α
h f (x) − ∆h̄ f (x̄) ≤ |∆h f (x)| + ∆h̄ f (x̄)
≤ ∥f ∥C 0;2α |h|α + ∥f ∥C 0;2α |h̄|α ≤ 3∥f ∥C 0;2α max{|x − x̄|, |h − h̄|}α .

32
In the second case, assume that max{|x − x̄|, |h − h̄|} ≤ |h| and without loss of
generality that |h| ≤ |h̄|. Then
−α
∆α α
h f (x) − ∆h̄ f (x̄) ≤ [f (x + h) − f (x)]|h| − [f (x̄ + h̄) − f (x̄)]|h̄|−α
≤ f (x + h) − f (x) − f (x̄ + h̄) + f (x̄) |h|−α
+ |f (x̄ + h̄) − f (x̄)| |h|−α − |h̄|−α
=: I + II.

For the first term, we have

I ≤ f (x + h) − f (x) − f (x̄ + h̄) + f (x̄) |h|−α

≤ ∥f ∥C 0;2α |x + h − x̄ − h̄|2α + |x − x̄|2α |h|−α

≤ 3∥f ∥C 0;2α max |x − x̄|2α , |h − h̄|2α |h|−α

α
≤ 3∥f ∥C 0;2α max |x − x̄|, |h − h̄| .

For the second term, since α ≤ 1, we have

II ≤ ∥f ∥C 0;2α |h̄|2α |h|−α − |h̄|−α

≤ ∥f ∥C 0;2α |h|α |h̄|α |h|−α − |h̄|−α
≤ ∥f ∥C 0;2α |h̄|α − |h|α
≤ ∥f ∥C 0;2α |h̄ − h|α .

Combining all inequalities shows the result.

Lemma 5.10. Assume for k = 0, . . . , ℓ−2 the weights Wk are fixed and bounded
−1/2
∥W k ∥nk ≲ 1. Assume that W ℓ−1 is i.i.d. sub-gaussian with ∥Wijℓ−1 ∥ψ2 ≲ 1.
Assume that σ satisfies the growth condition (13), has bounded derivative (15)
and may be different in each layer. Let r ∈ [nℓ ]. Then for α, β ≤ 1/2
n0
∆α β ℓ
x ∆y Λ̂r ≲ ,
C 0;α,β (∆D×∆D;ψ1 ) nℓ−1
with ∆D defined in (48).
Proof. Throughout the proof, we abbreviate

f ℓ = f ℓ (x), C 0;α (ψi ) = C 0;α (∆D, ψi ), i = 1, 2,

f˜ℓ = f ℓ (y), C 0;α,β
(ψi ) = C 0;α,β
(∆D × ∆D, ψi ).

Since by Lemma 6.8 we have ∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 by the product inequality
Lemma 6.3 Item 3 for Hölder norms we obtain

∆α ∆β Λ̂ℓ = ∆α σ f ℓ ∆β σ f˜ℓ

x y r x r y r
C 0;α,β (ψ1 ) C 0;α,β (ψ1 )

∆α ℓ
∆βy σ f˜rℓ

≲ xσ fr C 0;α (ψ2 )
.
C 0;β (ψ2 )

33
Next, we use Lemma 5.9 to eliminate the finite difference in favour of a higher
Hölder norm

∆α β ℓ ℓ
f˜rℓ

∆ Λ̂
x y r ≲ σ f r C 0;2α (ψ )
σ .
C 0;α,β (ψ1 ) 2 C 0;2β (ψ2 )
1/2 −1/2
Finally, Lemma 5.8 implies that σ frℓ C 0;2α (D;ψ2 )
≤ n0 nℓ−1 and likewise
for f˜ℓ and thus
r
n0
∆α β ℓ
x ∆y Λ̂r ≲ .
C 0;α,β (ψ1 ) nℓ−1

Lemma 5.11. Assume for k = 0, . . . , ℓ−2 the weights Wk are fixed and bounded
−1/2
∥W k ∥nk ≲ 1. Assume that W ℓ−1 is i.i.d. sub-gaussian with ∥Wijℓ−1 ∥ψ2 ≲ 1.
Assume that the domain D is bounded, that σ satisfies the growth condition
(13), has bounded derivative (15) and may be different in each layer. Then for
α = β = 1/2

" "√ √ ##
h i n0 d+ u d+u
Pr ℓ
Σ̂ − E Σ̂ ℓ
≥C √ + ≤ e−u .
C 0;α,β (D) nℓ−1 nℓ−1 nℓ−1

Proof. Since ∆α β ℓ
x ∆y Λ̂rfor r ∈ [nℓ ] only depends on the random vector Wr·ℓ−1 , all

β
stochastic processes ∆α ℓ
x,hx ∆y,hy Λ̂r (x, y) are independent
(x,hx ,y,hy )∈∆D×∆D
and satisfy
n0
∆α β ℓ
x ∆y Λ̂r ≲
nℓ−1
C 0;α,β (∆D×∆D;ψ1 )

by Lemma 5.10. Thus, we can estimate the processes’ supremum by the chaining
Corollary 6.12
 
nℓ−1
1 X h i
Pr  sup ∆α β ℓ α β ℓ
x ∆y Λ̂r − E ∆x ∆y Λ̂r ≥ Cτ  ≤ e−u ,
 
(x,hx )∈∆D nℓ−1 r=1
(y,hy )∈∆D

with " 1/2 1/2 #

n0 d d u u
τ= + + + .
nℓ−1 nℓ−1 nℓ−1 nℓ−1 nℓ−1
Noting that
sup ∆α β
x ∆y · = ∥ · ∥C 0;α,β (D)
(x,hx )∈∆D
(y,hy )∈∆D

and
nℓ−1 nℓ−1
1 X 1 X
∆α β ℓ α β
x ∆y Λ̂r = ∆x ∆y Λ̂ℓr = ∆α β ℓ
x ∆y Σ̂
nℓ−1 r=1
nℓ−1 r=1
completes the proof.

34
5.3.2 Perturbation of Covariances
This section contains the tools to estimate
h i
Eℓ Σ̂ℓ+1 − Σℓ+1 ,
C 0;α,β

with an argument analogous to [18], except that we measure

h i differences in Hölder
norms. As we will see in the next section, both Eℓ Σ̂ℓ+1 and Σℓ+1 are of the
form
E(u,v)∼N (0,A) [σ(u)σ(v)] ,
with two different matrices A and Â and thus it suffices to show that the above
expectation is Hölder continuous in A. By a variable transform
2
a a12 a ρab
A = 11 =
a21 a22 ρab b2
and rescaling, we reduce the problem to matrices of the form

1 ρ
A= .
ρ 1
For these matrices, by Mehler’s theorem we decompose the expectation as
∞
X ρk
E(u,v)∼N (0,A) [σ(u)σ(v)] = ⟨σ, Hk ⟩N ⟨σ, Hk ⟩N ,
k!
k=0

where Hk are Hermite polynomials. The rescaling introduces rescaled activation

functions, which we denote by
σa (x) := σ(ax). (39)
Finally, we show Hölder continuity by bounding derivatives. To this end, we
γ
use the multi-index γ to denote derivatives ∂ γ = ∂aγa ∂bγb ∂ρ ρ with respect to the
transformed variables. Details are as follows.
Lemma 5.12. Let
a2 ρab

a 1 ρ a
A= = .
ρab b2 b ρ 1 b
Then
∞
X ρk
E(u,v)∼N (0,A) [σ(u)σ(v)] = ⟨σa , Hk ⟩N ⟨σb , Hk ⟩N .
k!
k=0

Proof. By rescaling, or more generally, linear transformation of Gaussian ran-

dom variables, we have
Z
a 1 ρ a
E(u,v)∼N (0,A) [σ(u)σ(v)] = σ(u)σ(v)dN 0, (u, v)
b ρ 1 b
Z
1 ρ
= σ(au)σ(bv)dN 0, (u, v).
ρ 1

35
Thus, by Mehler’s theorem (Theorem 6.14 in the appendix) we conclude that
∞
ρk
ZZ X
E(u,v)∼N (0,A) [σ(u)σ(v)] = σ(au)σ(bv) Hk (u)Hk (v) dN (0, 1)(u) dN (0, 1)(v)
k!
k=0
∞
X ρk
= ⟨σa , Hk ⟩N ⟨σb , Hk ⟩N .
k!
k=0

a2 ρab

Lemma 5.13. Assume A = is positive semi-definite and all deriva-
ρab b2
(γb +γρ )
tives up to σ (γa +γρ ) and σb are continuous and have at most polynomial
growth for x → ±∞. Then

∂ γ E(u,v)∼N (0,A) [σ(u)σ(v)] ≤ ∂ γa +γρ (σa ) N

∂ γb +γρ (σb ) N
.

Proof. By Lemma 5.12, we have

∞
X ρk
∂ γ E(u,v)∼N (0,A) [σ(u)σ(v)] = ∂ γ ⟨σa , Hk ⟩N ⟨σb , Hk ⟩N
k!
k=0
∞
(40)
X
γa γb γρ ρk
= ∂ ⟨σa , Hk ⟩N ∂ ⟨σb , Hk ⟩N ∂ .
k!
k=0

We first
estimate
the ρ derivative.
Since 0 ⪯ A and a, b > 0, we must have
1 ρ 1 ρ
0⪯ and thus det = 1 − ρ2 ≥ 0. It follows that |ρ| ≤ 1. Therefore
ρ 1 ρ 1

ρk 1 k! 1
∂ γρ = ρk−γρ ≤ . (41)
k! k! (k − γρ )! (k − γρ )!

We eliminate the denominator (k − γρ )! by introducing extra derivatives into

∂ γa ⟨σa , Hk ⟩N . To this end, by Lemma 6.13, we decrease the degree of the
Hermite polynomial for a higher derivative on σa :

∂ γa ⟨σa , Hk ⟩N = ⟨∂ γa (σa ), Hk ⟩N = ∂ γa +γρ (σa ), Hk−γρ N

By Lemma 6.13, ∥ · ∥N normalized Hermite polynomials are given by

1
H̄k := √ Hk
k!
and thus q
∂ γa ⟨σa , Hk ⟩N = ∂ γa +γρ (σa ), H̄k−γρ N
(k − γρ )!.

36
Plugging the last equation and (41) into (40), we obtain

∂ γ E(u,v)∼N (0,A) [σ(u)σ(v)]

X∞
≤ ∂ γa +γρ (σa ), H̄k N
∂ γb +γρ (σb ), H̄k N
k=0
∞
!1/2 ∞
!1/2
X 2 X 2
≤ ∂ γa +γρ (σa ), H̄k N ∂ γb +γρ (σb ), H̄k N ,
k=0 k=0
γa +γρ
= ∂ (σa ) N
∂ γb +γρ (σb ) N
,

where in the second step we have used Cauchy-Schwarz and in the last that H̄k
are an orthonormal basis.

Lemma 5.14. Let f (a11 , a22 , a12 ) be implicitly defined by solving the identity

a11 a12 a ρab
=
a12 a22 ρab b

for a, b and ρ. Let Df be a domain with a11 , a22 ≥ c > 0 and |a12 | ≲ 1. Then

∥f ′′′ ∥C 1 (Df ) ≲ 1.

Proof. Comparing coefficients, f is explicitly given by

a12 T
f (a11 , a22 , a12 ) = a11 a22 a11 a22 .

Since the denominator is bounded away from zero, all third partial derivatives
exist and are bounded.

Lemma 5.15. For D ⊂ Rd and x, y ∈ D, let

a (x, y) a12 (x, y) b (x, y) b12 (x, y)
A(x, y) = 11 B(x, y) = 11 ,
a12 (x, y) a22 (x, y) b12 (x, y) b22 (x, y)

with

a11 (x, y) ≥ c > 0, a22 (x, y) ≥ c > 0, |a12 (x, y)| ≲ 1,

b11 (x, y) ≥ c > 0, b22 (x, y) ≥ c > 0, |b12 (x, y)| ≲ 1.

Assume the derivatives σ (i) , i = 0, . . . , 3 are continuous and have at most

polynomial growth for x → ±∞ and for all a ∈ {a(x, y) : x, y ∈ D, a ∈
{a11 , a22 , b11 , b22 }} the scaled activation satisfies

∂ i (σa ) N
≲ 1, i = 1, . . . , 3,

37
with σa defined in (39). Then, for α, β ≤ 1 the functions

x → E(u,v)∼N (0,A(x,y)) [σ(u)σ(v)] ,

x → E(u,v)∼N (0,B(x,y)) [σ(u)σ(v)]

satisfy

E(u,v)∼N (0,A) [σ(u)σ(v)] − E(u,v)∼N (0,B) [σ(u)σ(v)] C 0;α,β (D)

≲ ∥A∥C 0;α,β (D) ∥B∥C 0;α,β (D) ∥A − B∥C 0;α,β (D) .

Proof. Define

a ρab
F (a, b, ρ) = E(u,v)∼N (0,Ā) [σ(u)σ(v)] . Ā =
ρab b

and f (a11 , a22 , a12 ) by solving the identity

a11 a12 a ρab
=
a12 a22 ρab b

for a, b and ρ. Then

F ◦ f ◦ A = x, y → E(u,v)∼N (0,A(x,y)) [σ(u)σ(v)] ,

F ◦ f ◦ B = x, y → E(u,v)∼N (0,B(x,y)) [σ(u)σ(v)]

and

E(u,v)∼N (0,A) [σ(u)σ(v)] − E(u,v)∼N (0,B) [σ(u)σ(v)] C 0;α,β (D)

= ∥F ◦ f ◦ A − F ◦ f ◦ B∥C 0;α,β (D) .

By Lemmas 6.4 (for ∆α and ∆β ) and 6.5 (for ∆α ∆β ), we have

∥F ◦ f ◦ A − F ◦ f ◦ B∥C 0;α,β (D)

≲ ∥F ◦ f ∥C 3 (Df ) ∥A − B∥C 0;α,β (D)
max{1, ∥A∥C 0;α,β (D) } max{1, ∥B∥C 0;α,β (D) },

with Df = A(D) ∪ B(D), so that it suffices to bound ∥F ◦ f ∥C 3 (Df ) ≲ 1. This

follows directly from the assumptions, chain rule, product rule and Lemmas 5.13
and 5.14. Finally, we simplify
1
max{1, ∥A∥C 0;α,β (D) } ≤ ∥A∥C 0;α,β (D)
c
because
1 1
∥A∥C 0;α,β (D) ≥ a11 (·) ≥ 1
c c
and likewise for B.

38
5.3.3 Concentration of the NTK
We combine the results from the last two sections to show concentration in-
equalities, first for the forward kernels Σℓ and Σ̇ℓ and then for the NTK Γ.
Lemma 5.16. Let α = β = 1/2 and k = 0, . . . , ℓ.
1. Assume that all W k are are i.i.d. standard normal.
2. Assume that σ satisfies the growth condition (13), has uniformly bounded
derivative (15), derivatives σ (i) , i = 0, . . . , 3 are continuous and have at
most polynomial growth for x → ±∞ and the scaled activations satisfy

∂ i (σa ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,

with σa defined in (39). The activation function may be different in each

layer.
3. For all x ∈ D assume
Σk (x, x) ≥ cΣ > 0.

4. The widths satisfy nℓ ≳ n0 for all ℓ = 0, . . . , L.

Then, with probability at least
ℓ−1
X
1−c e−nk + e−uk
k=1

we have

Σℓ C 0;α,β
≲1
Σ̂ℓ ≲1
C 0;α,β
ℓ−1
"√ √ #
ℓ ℓ
X n0 d + uk d + uk 1
Σ̂ − Σ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0

for all u1 , . . . , uℓ−1 ≥ 0 sufficiently small so that the last inequality holds.
Proof. We prove the statement by induction. Let us first consider ℓ ≥ 1. We
split off the expectation over the last layer
h i h i
Σ̂ℓ+1 − Σℓ+1 ≤ Σ̂ℓ+1 − Eℓ Σ̂ℓ+1 + Eℓ Σ̂ℓ+1 − Σℓ+1
C 0;α,β C 0;α,β C 0;α,β
= I + II,

where Eℓ [·] denotes the expectation with respect to W ℓ . We Estimate I, given

that the lower layers satisfy
−1/2
∥W k ∥nk ≲ 1, k = 0, . . . , ℓ − 1, (42)

39
which is true with probability at least 1 − 2e−nk , see e.g. [67, Theorem 4.4.5].
Then, by Lemma 5.11 for uℓ ≥ 0
" "√ √ ##
h i n0 d + uℓ d + uℓ
Pr Σ̂ ℓ+1
− E Σ̂ ℓ+1
≥C √ + ≤ e−uℓ . (43)
C 0;α,β (D) nℓ nℓ nℓ

Next we estimate II. To this end, recall that Σ̂ℓ+1 (x, y) is defined by
nℓ+1
ℓ+1 1 X
σ frℓ+1 (x) σ frℓ+1 (y) .

Σ̂ (x, y) =
nℓ r=1

For fixed lower layers W 0 , . . . , W ℓ−1 , the inner arguments

−1/2 −1/2
frℓ+1 (x) = Wr·ℓ nℓ σ f ℓ (x) frℓ+1 (x) = Wr·ℓ nℓ σ f ℓ (y)

are Gaussian random variables in Wr·ℓ with covariance

h T i
−1/2 −1/2
El Wr·ℓ nℓ σ f ℓ (x) Wr·ℓ nℓ σ f ℓ (y)
nℓ
1 X −1/2 −1/2
σ f ℓ (x) nℓ σ f ℓ (y) = Σ̂ℓ (x, y). (44)

= nℓ
nℓ r=1

It follows that

Σ̂ℓ (x, x) Σ̂ℓ (x, y)

h i
Eℓ Σ̂ℓ+1 (x, y) = E(u,v)∼N (0,Â) [σ(u)σ(v)] , Â = .
Σ̂ℓ (y, x) Σ̂ℓ (y, y)

This matches the definition

ℓ
Σℓ (x, y)

ℓ+1 Σ (x, x)
Σ (x, y) = Eu,v∼N (0,A) [σ (u) , σ (v)] A=
Σℓ (y, x) Σℓ (y, y)

of the process Σℓ+1 up to the

h covariance
i matrix Â versus A. Thus, we can
estimate the difference Eℓ Σ̂ (x, y) − Σℓ+1 0;α,β by Lemma 5.15 if the
ℓ+1
C
entries of A and Â satisfy the required bounds. To this end, we first bound the
diagonal entries away from zero. For A, this
Pℓ−1is true by assumption. For Â, by
induction, with probability at least 1 − c k=1 e−nk + e−uk we have
ℓ−1
"√ √ #
ℓ ℓ
X n0 d + uk d + uk 1
Σ̂ − Σ ≲ √ + ≤ cΣ . (45)
C 0;α,β nk nk nk 2
k=0

In the event that this is true, we have

1
Σ̂ℓ (x, x) ≥ cΣ > 0.
2

40
Next, we bound the off diagonal terms. Since the weights are bounded (42),
Lemma 5.6 implies
n0
Σ̂ℓ ≲ ≲ 1, Σℓ C 0;α,β
≲ 1,
C 0;α,β nl
where the last inequality follows from (45). In particular,

Σ̂ℓ (x, y) ≲ 1, Σℓ (x, y) ≲ 1

for all x, y ∈ D. Hence, we can apply Lemma 5.15 and obtain

h i
Eℓ Σ̂ℓ+1 − Σℓ+1 ≲ Σℓ C 0;α,β
Σ̂ℓ Σ̂ℓ − Σℓ
C 0;α,β C 0;α,β C 0;α,β
ℓ−1
"√ √ #
ℓ ℓ
X n0 d + uk d + uk
≲ Σ̂ − Σ .≲ √ + ,
C 0;α,β nk nk nk
k=0

where the last line follows by induction. Together with (42), (43) and a union
bound, this shows the result for ℓ ≥ 1.
Finally, we consider the induction start for ℓ = 0. The proof is the same,
except that in (44) the covariance simplifies to

El f 1 (x)f 1 (y) = El (Wr·0 V x)(Wr·0 y) = (V x)T (V y) = xT y = Σ0 (x, y).

Hence,
h for ℓ i= 1 the two covariances A and Â are identical and therefore
∥E0 Σ̂1 (x, y) − Σ1 ∥C 0;α,β = 0.

Lemma 5.17 (Lemma 4.4, restated from the overview). Let α = β = 1/2 and
k = 0, . . . , L − 1.
1. Assume that W L ∈ {−1, +1} with probability 1/2 each.
2. Assume that all W k are are i.i.d. standard normal.
3. Assume that σ and σ̇ satisfy the growth condition (13), have uniformly
bounded derivatives (15), derivatives σ (i) , i = 0, . . . , 3 are continuous and
have at most polynomial growth for x → ±∞ and the scaled activations
satisfy

∂ i (σa ) N
≲ 1, ∂ i (σ̇a ) N
≲ 1, a ∈ {Σk (x, x) : x ∈ D}, i = 1, . . . , 3,

with σa (x) := σ(ax). The activation functions may be different in each

layer.
4. For all x ∈ D assume
Σk (x, x) ≥ cΣ > 0.

5. The widths satisfy nℓ ≳ n0 for all ℓ = 0, . . . , L.

41
Then, with probability at least
L−1
X
1−c e−nk + e−uk (46)
k=1

we have
L−1
"√ √ #
X n0 d + uk d + uk 1
Γ̂ − Γ ≲ √ + ≤ cΣ
C 0;α,β nk nk nk 2
k=0

for all u1 , . . . , uL−1 ≥ 0 sufficiently small so that the rightmost inequality holds.
Proof. By definition (11) of Γ and Lemma 4.1 for Γ̂, we have

Γ(x, y) = Σ̇L (x, y)ΣL−1 (x, y), ˆ L (x, y)Σ̂L−1 (x, y)

Γ̂(x, y) = Σ̇

and therefore

Γ − Γ̂ = Σ̇L ΣL−1 − Σ̇ˆ L Σ̂L−1

C 0;α,β C 0;α,β
ˆ ˆ L ΣL−1 − Σ̂L−1
h i h i
L L L−1
= Σ̇ − Σ̇ Σ 0;α,β
+ Σ̇
C C 0;α,β
ˆL
= Σ̇L − Σ̇ ΣL−1 ˆL
+ Σ̇ ΣL−1 − Σ̂L−1 ,
C 0;α,β C 0;α,β C 0;α,β C 0;α,β

where in the last step we have used Lemma 6.3 Item 4. Thus, the result follows
from

ΣL−1 ≲ 1, Σ̂L−1 ≲ 1, Σ̇L ≲ 1, ˆL

Σ̇ ≲1
C 0;α,β C 0;α,β C 0;α,β C 0;α,β

and
ˆL
n o
max ΣL−1 − Σ̂L−1 , Σ̇L − Σ̇
C 0;α,β C 0;α,β
L−1
"√ √ #
X n0 d + uk d + uk 1
≲ √ + ≤ cΣ ,
nk nk nk 2
k=0

with probability (46) by Lemma 5.16. For Σ̇L , we do not require the lower
bound Σ̇k (x, x) ≥ cΣ > 0 because in the recursive definition σ̇ is only used in
the last layer and therefore not necessary in the induction step in the proof of
Lemma 5.16.

5.4 Proof of Lemma 4.5: Weights stay Close to Initial

The derivative ∂W k f ℓ (x) ∈ Rnℓ−1 ×(nk+1 ×nk ) is a tensor with three axes for which
we define the norm
X
∂W k f ℓ (x) ∗ := sup ur vi wj ∂Wijk frℓ (x)
∥u∥,∥v∥,∥w∥≤1 r,i,j

42
and the corresponding maximum norm ∥ · ∥C 0 (D;∗) for functions mapping x to a
tensor measured in the ∥ · ∥∗ norm. We use this norm for an inductive argument
in a proof, but later only apply it for the last layer ℓ = L + 1. In this case
nL+1 = 1 and the norm reduces to a regular matrix norm.
Lemma 5.18. Assume that σ satisfies the growth and derivative bounds (13),
(15) and may be different in each layer. Assume the weights are bounded
−1/2
∥W k ∥nk ≲ 1, k = 1, . . . , ℓ − 1. Then for 0 ≤ α ≤ 1
1/2
ℓ n0
∂W k f C 0 (D;∗) ≲ .
nk
Proof. First note that for any tensor T

X
ur vi wj Trij ≤ C∥u∥∥v∥∥w∥
r,i,j
C0

implies that ∥T ∥C 0 (D;∗) ≤ C, which we use throughout the proof. We proceed

by induction over ℓ. For k ≥ ℓ, the pre-activation f ℓ does not depend on W k
and thus ∂W k f ℓ (x) = 0. For k = ℓ − 1, we have
−1/2 −1/2
∂Wijk frk+1 (x) = ∂Wijk Wr·k nk σ f k (x) = δir nk σ fjk (x)

and therefore for any vectors u, v, w

−1/2
X
ur vi wj ∂Wijk frk (x) (uT v) wT σ f k

= nk
C0
r,i,j
C0
1/2
−1/2 k
n0
≤ nk ∥u∥∥v∥∥w∥ σ f C0
≲ ∥u∥∥v∥∥w∥ ,
nk
where in the last step we have used Lemma 5.5. Thus, we conclude that
1/2
n0
∂W k f k+1 (x) C 0 (D;∗) ≲ .
nk
For k < ℓ − 1, we have
h i
−1/2 −1/2
∂Wijk f ℓ (x) = ∂Wijk W ℓ−1 nℓ−1 σ f ℓ−1 = W ℓ−1 nℓ−1 σ̇ f ℓ−1 ⊙ ∂Wijk f ℓ−1

and therefore

−1/2
X
ur vi wj ∂Wijℓ frk ≤ ∥uT W ℓ−1 nℓ−1 ∥∥v∥∥w∥ σ̇ f ℓ−1 ⊙ ∂Wijk f ℓ−1

C 0 (D;∗)
r,i,j
C0

≤ ∥u∥∥v∥∥w∥ σ̇ f ℓ−1 ∂Wijk f ℓ−1

C 0 (D;ℓ∞ ) C 0 (D;∗)
1/2
n0
≲ ∥u∥∥v∥∥w∥ ,
nk

43
−1/2
where in the second step we have used that ∥W ℓ−1 ∥nℓ−1 ≲ 1 and in the last

step we have used that σ̇ f ℓ−1 ℓ ≲ 1 because |σ̇(·)| ≲ 1 and the induction
∞
hypothesis. It follows that
1/2
n0
∂W k f ℓ (x) C 0 (D;∗)
≲ .
nk

Lemma 5.19 (Lemma 4.5, restated from the overview). Assume that σ satisfies
the growth and derivative bounds (13), (15) and may be different in each layer.
Assume the weights are defined by the gradient flow (6) and satisfy
−1/2
∥W ℓ (0)∥nℓ ≲ 1, ℓ = 1, . . . , L,
ℓ ℓ −1/2
∥W (0) − W (τ )∥nℓ ≲ 1, 0 ≤ τ < t.

Then
1/2 Z t
−1/2 n0
W ℓ (t) − W ℓ (0) nℓ ≲ ∥κ∥C 0 (D)′ dx dτ,
nℓ 0
′
where C 0 (D) is the dual space of C 0 (D).

Proof. By assumption, we have

−1/2
∥W ℓ (τ )∥nℓ ≲ 1, 0 ≤ τ < t, ℓ = 1, . . . , L.

With loss L and residual κ = fθ − f , because

Z
d ℓ
W = −∇W ℓ L = κ(x)DWℓ f L+1 (x) dx
dτ D

we have
Z t
d ℓ
W ℓ (t) − W ℓ (0) = W (τ ) dτ
0 dτ
Z tZ
= κ(x)DWℓ f L+1 (x) dx dτ
0 D
Z tZ
≤ |κ(x)| DWℓ f L+1 (x) dx dτ
0 D
1/2 Z t
n0
≲ ∥κ∥C 0 (D)′ dx dτ,
nℓ 0

−1/2
where in the last step we have used Lemma 5.18. Multiplying with nℓ shows
the result.

44
5.5 Proof of Theorem 2.1: Main Result
Proof of Theorem 2.1. The result follows directly from Lemma 4.2 with the
smoothness spaces Hα = H α (Sd−1 ). While the lemma bounds the residual
κ in the H−α and Hα norms, we aim for an H0 = L2 (Sd−1 ) bound. This follows
directly from the interpolation inequality
1/2 1/2
∥ · ∥L2 (Sd−1 ) = ∥ · ∥H 0 (Sd−1 ) ≤ ∥ · ∥H −α (Sd−1 ) ∥ · ∥H α (Sd−1 ) .

It remains to verify all assumptions. To this end, first note that the initial
weights satisfy
−1/2
∥W (0)ℓ ∥nℓ ≲ 1, ℓ = 0, . . . , L, (47)

with probability at least 1 − 2e−cm since nℓ ∼ m by assumption, see e.g. [67,

Theorem 4.4.5]. Then, the assumptions are shown as follows.
1. The weights stay close to the initial (17): We use the scaled matrix norm
−1/2
∥θ∥∗ := max ∥W ℓ ∥nℓ
L∈[L]

to measure the weight distance. Then, by (47) with p0 (m) := 2Le−m

given that ∥θ(τ ) − θ(0)∥∗ ≤ 1, Lemma 4.5 implies that

−1/2
∥θ(t) − θ(0)∥∗ = max W ℓ (t) − W ℓ (0) nℓ
ℓ∈[L]
1/2 Z t Z t
n0
≲ ∥κ∥C 0 (Sd−1 )′ dx dτ, ≲ m−1/2 ∥κ∥H 0 (Sd−1 ) dx dτ,
nℓ 0 0

where the last step follows from the assumption n0 ∼ · · · ∼ nL−1 =: m

and the embedding ∥ · ∥C 0 (Sd−1 )′ ≲ ∥ · ∥H 0 (Sd−1 )′ = ∥ · ∥H 0 (Sd−1 ) , which
follows directly from the inverted embedding ∥ · ∥H 0 (Sd−1 ) ≲ ∥ · ∥C 0 (Sd−1 ) .
2. Norms and Scalar Product (18): Both are well known for Sobolev spaces,
and follow directly from norm definition (52) with Cauchy-Schwarz.
3. Concentration of the Initial NTK (19): Since by (5) the first four deriva-
tives of the activation function have at most polynomial growth, we have
Z
∥∂ i (σa )∥= σ (i) (ax)ai dN (0, 1))(x) ≲ 1
R

for all a ∈ {Σk (x, x) : x ∈ D} contained in the set {cΣ , CΣ } for some
CΣ ≥ 0, by assumption. Together with α + ϵ < 1/2 for sufficiently small ϵ,
hidden dimensions d ≲ n0 ∼ . . . , ∼ nL =: m and the concentration result
Lemma 4.4 we obtain, with probability at least

1 − p∞ (m, τ ) := 1 − cL(e−m + e−τ )

45
the bound "r r #
d τ τ
Γ̂ − Γ ≲L + +
C 0;α+ϵ,α+ϵ m m m
for the neural tangent kernel for all 0 ≤ τ = u0 = · · · = uL−1 ≲ 1. By
Lemma 6.16, the kernel bound directly implies the operator norm bound
"r r #
d τ τ
H − Hθ(0) −α,α ≲ L + +
m m m
for the corresponding integral operators H and Hθ(0) , with kernels Γ and
Γ̂, respectively. If τ /m ≲ 1, we can drop the last term and thus satisfy
assumption (19).
4. Hölder continuity of the NTK (20): By (47) with probability at least
1 − pL (m) := 1 − Le−m
we have ∥θ(0)∥∗ ≲ 1 and thus for all perturbations θ̄ with θ̄ − θ(0) ∗
≤
h ≤ 1 by Lemma 4.3 that
¯
Γ̂ − Γ̂ ≲ Lh1−α−ϵ
C 0;α+ϵ,α+ϵ

for any sufficiently small ϵ > 0. By Lemma 6.16, the kernel bound implies
the operator norm bound
Hθ(0) − Hθ̄ α←−α
≲ Lhγ
for any γ < 1 − α and integral operators Hθ(0) and Hθ̄ corresponding to
kernels Γθ (0) and Γ̂θ̄ , respectively.
5. Coercivity (5): Is given by assumption.
Thus, all assumptions of Lemma 4.2 are satisfied, which directly implies the
theorem as argued above.

6 Technical Supplements
6.1 Hölder Spaces
Definition 6.1. Let U and V be two normed spaces.
1. For 0 < α ≤ 1, we define the Hölder spaces on the domain D ⊂ U as all
functions f : D → V for which the norm
∥f ∥C 0;α (D;V ) := max{∥f ∥C 0 (D;V ) , |f |C 0;α (D;V ) } < ∞
is finite, with
∥f (x) − f (x̄)∥V
|f |C 0 (D;V ) := sup ∥f (x)∥V , |f |C 0;α (D;V ) := sup .
x∈D x̸=x̄∈D ∥x − x̄∥α
U

46
2. For 0 < α, β ≤ 1, we define the mixed Hölder spaces on the domain
D × D ⊂ U × U as all functions g : D × D → V for which the norm

∥f ∥C 0;α,β (D;V ) := max |f |C 0;a,b (D;V ) < ∞,

a∈{0,α}
b∈{0,β}

with

|f |C 0;0,0 (D;V ) := sup ∥f (x, y)∥V ,

x,y∈D
∥f (x, y) − f (x̄, y)∥V
|f |C 0;α,0 (D;V ) := sup ,
x̸=x̄,y∈D ∥x − x̄∥αU
∥f (x, y) − f (x, ȳ)∥V
|f |C 0;0,β (D;V ) := sup ,
x,y̸=ȳ∈D ∥y − ȳ∥βU
∥f (x, y) − f (x̄, y) − f (x, ȳ) + f (x̄, ȳ)∥V
|f |C 0;α,β (D;V ) := sup β
.
x̸=x̄,y̸=ȳ∈D ∥x − x̄∥α
U ∥y − ȳ∥U

3. We use the following abbreviations:

(a) If D is understood from context and V = Rn , both equipped with the
Euclidean norm, we write

C 0;α = C 0;α (D) = C 0;α (D; ℓ2 (Rn )).

(b) If V = Lψi , i = 1, 2 is an Orlicz space, we write

C 0;α (D; ψi ) = C 0;α (D; Lψi ).

We use analogous abbreviations for all other spaces.

It is convenient to express Hölder spaces in terms of finite difference opera-
tors,
−α
∆0h f (x) = f (x), ∆α
h f (x) = ∥h∥U [f (x + h) − f (x)], α > 0,

which satisfy product and chain rules similar to derivatives. We may also con-
sider these as functions in both x and h

∆α f : (x, h) ∈ ∆D → V, ∆α f (x, h) = ∆α
h f (x)

on the domain

∆D := {(x, h) : x ∈ D, x + h ∈ D} ⊂ U × U. (48)

Then, the Hölder norms can be equivalently expressed as

|f |C 0;α (D;V ) = sup ∥∆α α

h f ∥V = ∥∆ f ∥C 0 (∆D;V ) .
x̸=x+h∈D

47
If f = f (x, y) depends on multiple variables, we denote the partial finite differ-
ence operators by ∆α α
x,hx and ∆y,hy defined by

−α
∆0x,hx f (x, y) := f (x, y), ∆α
x,hx f (x, y) := ∥hx ∥U [f (x + hx , y) − f (x, y)],

∆0y,hy f (x, y) := f (x, y), ∆βy,hy f (x, y) := ∥hy ∥−β

U [f (x, y + hy ) − f (x, y)],

for α > 0, and likewise

∆α α
x f (x, y, hx ) = ∆x,hx f (x, y), ∆α α
y f (x, y, hy ) = ∆y,hy f (x, y).

Then, the mixed Hölder norms is

β
|f |C 0;α,β (D;V ) = sup ∆α
x,hx ∆y,hy f (x, y) = ∆α β
x ∆y f C 0 (∆D×∆D;V )
x̸=x+hx ∈D V
y̸=y+hy ∈D

for all α, β ≥ 0 and likewise for all other Hölder semi-norms.

In the following lemma, we summarize several useful properties of finite
differences.
Lemma 6.2. Let U, V and W be three normed spaces, D ⊂ U and 0 < α, β ≤ 1.

1. Product rule: Let f, g : D → R. Then

∆α α α
h [f g](x) = [∆h f (x)] g(x) + f (x + h) [∆h g(x)] .

2. Chain rule: Let f : D → V and g : f (D) → W . Define

Z 1
¯
∆h (f, g)(x) := f ′ (tg(x + h) + (1 − t)g(x)) dt.
0

Then
∆α ¯ α
h (f ◦ g)(x) = ∆h (f, g)(x)∆h g(x).

Proof. 1. Plugging in the definitions, we have

−α
∆α
h [f g](x) = ∥h∥U [f (x + h)g(x + h) − f (x)g(x)]
= ∥h∥−α
U [[f (x + h) − f (x)]g(x) + f (x + h)[g(x + h) − g(x)]]
= [∆α α
h f (x)] g(x) + f (x + h) [∆h g(x)] .

2. Follows directly from the integral form of the Taylor remainder:

−α
∆α
h (f ◦ g)(x) = ∥h∥U [f (g(x + h)) − f (g(x))]
Z 1
= ∥h∥−α
U f ′ (tg(x + h) + (1 − t)g(x)) dt[g(x + h) − g(x)]
0
¯ h (f, g)(x)∆α
=∆ h g(x).

48
In the following lemma, we summarize several useful properties of Hölder
spaces.
Lemma 6.3. Let U and V be two normed spaces, D ⊂ U and 0 < α, β ≤ 1.
1. Interpolation Inequality: For any f ∈ C 1 (D; V ), we have
∥f ∥C 0;α (D;V ) ≤ 2∥f ∥1−α α
C 0 (D;V ) ∥f ∥C 0;1 (D;V ) .

2. Assume σ satisfies the growth and Lipschitz conditions ∥σ (x) ∥V ≲ ∥x∥V

and ∥σ (x) − σ (x̄) ∥V ≲ ∥x − x̄∥V . Then
∥σ ◦ f ∥C 0;α (D;V ) ≲ ∥f ∥C 0;α (D;V ) .

3. Let V1 and V2 be two normed spaces and f, g : D → V1 . Let · : V1 ×V1 → V2

be a distributive product that satisfies ∥u · v∥V2 ≲ ∥u∥V1 ∥v∥V1 . Then
∥f · g∥C 0;α,β (D;V2 ) ≲ ∥f ∥C 0;α (D;V1 ) ∥g∥C 0;β (D;V1 ) .

4. Let V = R and f, g : D × D → R. Then

∥f g∥C 0;α,β (D) ≲ ∥f ∥C 0;α,β (D) ∥g∥C 0;α,β (D) .

Proof. 1. The inequality follows directly from

∥f (x) − f (x̄)∥V
|f |C 0;α (D;V ) = sup
x,x̄∈D ∥x − x̄∥α
U
∥f (x) − f (x̄)∥α
≤ sup ∥f (x) − f (x̄)∥1−α
V sup V
x̸=x̄∈D x̸=x̄∈D ∥x − x̄∥α
U
≤ 2∥f ∥1−α α
C 0 (D;V ) ∥f ∥C 0;1 (D;V ) .

2. Follows from
∥σ(f (x)) − σ(f (x̄))∥V
|σ ◦ f |C 0;α (D;V ) = sup
x,x̄∈D ∥x − x̄∥α
U
∥f (x) − f (x̄)∥α
V
≲ sup = ∥f ∥C 0;α (D;V ) .
x,x̄∈D ∥x − x̄∥α
U

and likewise for the | · |C 0 (D;V ) norm.

3. Follows from
∥f (x) · g(y) − f (x̄) · g(y) − f (x) · g(ȳ) + f (x̄) · g(ȳ)∥V2
|f · g|C 0;α,β (D;V2 ) = sup β
x,x̄,y,ȳ∈D ∥x − x̄∥α
U ∥y − ȳ∥U
∥[f (x) − f (x̄)] · [g(y) − g(ȳ)]∥V2
= sup β
x,x̄,y,ȳ∈D ∥x − x̄∥αU ∥y − ȳ∥U
∥f (x) − f (x̄)∥V1 ∥g(y) − g(ȳ)∥V1
≲ sup β
x,x̄,y,ȳ∈D ∥x − x̄∥α
U ∥y − ȳ∥U
= |f |C 0;α (D;V1 ) |g|C 0;β (D;V1 )

49
and analogous identities for the remaining semi norms |f g|C 0;0,0 (D;V2 ) ,
|f g|C 0;α,0 (D;V2 ) , |f g|C 0;0,β (D;V2 ) .
4. We only show the bound for | · |C 0;α,β (D) . The other semi-norms follow
analogously. Applying the product rule (Lemma 6.2)
∆α
α α
x,hx [f (x, y)g(x, y)] = ∆x,hx f (x, y) g(x, y)+ f (x+hx , y) ∆x,hx f (x, y)

and then analogously for ∆βy,hy

∆βy,hy ∆αx,hx [f (x, y)g(x, y)]

= ∆βy,hy ∆α
α
x,hx f (x, y) g(x, y) + f (x + hx , y) ∆x,hx f (x, y)
h i h β i
= ∆βy,hy ∆α
α
x,hx f (x, y) g(x, y) + ∆ x,hx f (x, y + hy ) ∆ y,hy g(x, y)
h i h i
+ ∆y,hy f (x + hx , y) ∆x,hx f (x, y) + f (x + hx , y + hy ) ∆βy,hy ∆α
β α

x,hx f (x, y) .

Taking the supremum directly shows the result.

The following two lemmas contain chain rules for Hölder and mixed Hölder
spaces.
Lemma 6.4. Let D ⊂ U and Df ⊂ V be domains in normed spaces U , V and
W . Let g : D → Df and f : Df → W . Let 0 < α, β ≤ 1. Then
∥∆α (f ◦ g)∥C 0 (∆D;W ) ≤ ∥f ′ ∥C 0;0 (Df ;L(V,W )) ∥g∥C 0;α (D;V )
and

∥∆α (f ◦ g) − ∆α (f ◦ ḡ)∥C 0 (∆D;W )

≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥g − ḡ∥C 0 (D;V ) ∥ḡ∥C 0;α (D;V )
+ ∥f ′ ∥C 0;0 (Df ;L(V,W )) ∥g − ḡ∥C 0;α (D;V ) ,
≤ 2∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥g − ḡ∥C 0;α (D;V ) max{1, ∥ḡ∥C 0;α (D;V ) },

where L(V, W ) is the space of all linear maps V → W with induced operator
norm.
Proof. Note that
Z 1
¯ h (f, g)(x) :=
∆ f ′ (tg(x + h) + (1 − t)g(x)) dt
0

¯ h (f, g)(x)v∥W ≤ ∥∆
takes values in the linear maps L(V, W ) and thus ∥∆ ¯ h (f, g)(x)∥L(V,W ) ∥v∥V ,
for all v ∈ V . Using the chain rule Lemma 6.2, it follows that
∥∆α ¯ α
h (f ◦ g)(x)∥W = ∆h (f, g)(x)∆h g(x) W
¯ h (f, g)(x)
≤ ∆ ∥∆α g(x)∥
L(V,W ) h V

50
and
∥∆α α ¯ α ¯ α
h (f ◦ g)(x) − ∆h (f ◦ ḡ)(x)∥W = ∆h (f, g)(x)∆h g(x) − ∆h (f, ḡ)(x)∆h ḡ(x) W
¯ h (f, g)(x) − ∆
≤ ∆ ¯ h (f, ḡ)(x) ∥∆α h g(x)∥
L(V,W ) V
¯ h (f, ḡ)(x)
+ ∆ α α
∥∆h g(x) − ∆h ḡ(x)∥V .
L(V,W )

Hence, the result follows from

¯ h (f, ḡ)(x)
∆ ≤ ∥f ′ ∥C 0 (Df ;L(V,W )) (49)
L(V,W )

and
¯ h (f, g)(x) − ∆
∆ ¯ h (f, ḡ)(x)
L(V,W )
Z 1
≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥t(g − ḡ)(x + h) + (1 − t)(g − ḡ)(x)∥ dt
0
≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥g − ḡ∥C 0 (D;V ) , (50)

where we have used that unlike ∆α ¯

h , the integral ∆h does not have an inverse
∥h∥−α
U factor.

Lemma 6.5. Let D ⊂ U and Df ⊂ V be domains in normed spaces U , V and

W . Let g : D → Df and f : Df → W . Let 0 < α, β ≤ 1. Then

∆α ∆β [f ◦ g − f ◦ ḡ] C 0 (∆D×∆D;W )

≤ ∥f ∥C 3 (Df ,W ) ∥g − ḡ∥C 0;α,β (D;V )

max{1, ∥g∥C 0;α,β (D;V ) } max{1, ∥ḡ∥C 0;α,β (D;V ) }.
Proof. In the following, we fix x and y, but only include it in the formulas if
necessary, e.g. f = f (x, y). By the chain rule Lemma 6.2, we have

∆βy,hy [f ◦ g − f ◦ ḡ] = ∆
¯ y,h (f, g)∆β g − ∆
y y,hy
¯ y,h (f, ḡ)∆β ḡ
y y,hy
β
¯ ¯

= ∆y,hy (f, g) − ∆y,hy (f, ḡ) ∆y,hy g
h i
+∆ ¯ y,h (f, ḡ) ∆β g − ∆β ḡ
y y,hy y,hy

=: I + II.
Applying the product rule Lemma 6.2 to the first term yields

∆α ¯ y,h (f, g)] − ∆α [∆ ¯ y,h (f, ḡ)] ∆β g(x + hx , y)

α
x,hx I W = ∆x,hx [∆ y x,hx y y,hy

¯ y,h (f, g) − ∆ ¯ y,h (f, ḡ) ∆
α β
+ ∆ y y x,hx ∆y,hy g
W
¯ y,h (f, g)] − ∆x,h [∆ ¯ y,h (f, ḡ)] ∆βy,hy g(x + hx , y)
α α

≤ ∆x,hx [∆ y x y L(V,W ) W
¯ y,h (f, g) − ∆ ¯ y,h (f, ḡ) β
∆α

+ ∆ y y L(V,W ) x,hx ∆y,hy g .
W

51
Likewise, applying the product Lemma rule 6.2 to the second term yields
h i
¯ β β
∆α α
x,hx II W = ∆x,hx ∆y,hy (f, ḡ) ∆y,hy g − ∆y,hy ḡ
W
h i
¯ β β
+ ∆y,hy (f, ḡ)(x + hx , y) ∆x,hx ∆y,hy g − ∆α
α
x,hx ∆y,hy ḡ
W

≤ ∆α ¯
x,hx ∆y,hy (f, ḡ) L(V,W )
∆βy,hy g − ∆βy,hy ḡ
W
¯ y,h (f, ḡ)(x + hx , y) β β
+ ∆ y L(V,W )
∆α
x,hx ∆y,hy g − ∆α
x,hx ∆y,hy ḡ .
W

All terms involving only g and ḡ can easily be upper bounded by ∥g∥C 0;α,β (D;V ) ,
∥ḡ∥C 0;α,β (D;V ) or ∥g − ḡ∥C 0;α,β (D;V ) . The terms

¯ y,h (f, ḡ)(x + hx , y)

∆ ≤ ∥f ′ ∥C 0 (Df ;L(V,W ))
y L(V,W )
¯ ¯ y,h (f, ḡ) ≤ ∥f ′ ∥C 0;1 (Df ;L(V,W )) ∥g − ḡ∥C 0 (D;V )

∆y,hy (f, g) − ∆ y L(V,W )

are bounded by (49) and (50) in the proof of Lemma 6.4. For the remaining
terms, define
G(x) := tg(x, y + hy ) + (1 − t)g(x, y)
and likewise Ḡ. Then

∥G∥C 0;α (D,V ) ≲ ∥g∥C 0;α,β (D,V ) , ∥G − Ḡ∥C 0;α (D,V ) ≲ ∥g − ḡ∥C 0;α,β (D,V ) .

Thus, by Lemma 6.4, we have

Z 1
∆α ¯ ∆α ′

x,hx ∆y,hy (f, g) L(V,W )
= x,hx (f ◦ G) dt
0 L(V,W )

≤ ∥f ′′ ∥C 0;0 (Df ;L(V,L(V,W ))) ∥g∥C 0;α,β (D;V )

and

∆α ¯ ¯

x,hx ∆y,hy (f, g) − ∆y,hy (f, ḡ) L(V,W )
Z 1
′ ′
∆α

= x,hx f ◦ G − f ◦ Ḡ dt
0 L(V,W )
′′
≤ 2∥f ∥ C 0;1 (Df ;L(V,L(V,W ))) ∥g − ḡ∥C 0;α,β (D;V ) max{1, ∥ḡ∥C 0;α,β (D;V ) }.

Combining all inequalities yields the proof.

6.2 Concentration
In this section, we recall the definition of Orlicz norms, some basic properties
and the chaining concentration inequalities we use to show that the empirical
NTK is close to the NTK.

52
Definition 6.6. For random variable X, we define the sub-gaussian and sub-
exponential norms by
∥X∥ψ2 = inf t > 0 : E exp(X 2 /t2 ) ≤ 2 ,

∥X∥ψ1 = inf {t > 0 : E [exp(|X|/t)] ≤ 2} .

Lemma 6.7. Assume that σ satisfies the growth and Lipschitz conditions
|σ(x)| ≤ G|x|, |σ(x) − σ(y)| ≤ L|x − y|
for all x, y ∈ R and let X, Y be two sub-gaussian random variables. Then
∥σ (X) ∥ψ2 ≲ G∥X∥ψ2 , ∥σ (X) − σ (Y ) ∥ψ2 ≲ L∥X − Y ∥ψ2 .
Proof. For two random variables X and Y with X 2 ≤ Y 2 almost surely, we have

∥X∥ψ2 = inf t > 0 : E exp(X 2 /t2 ) ≤ 2

≤ inf t > 0 : E exp(Y 2 /t2 ) ≤ 2 = ∥Y ∥ψ2 .

Thus, the result follows directly form

σ(X)2 ≤ G2 X 2 , [σ(x) − σ(y)]2 ≤ L2 [x − y]2 .

Lemma 6.8. Let X and Y be two sub-gaussian random variables. Then

∥XY ∥ψ1 ≤ ∥X∥ψ2 ∥Y ∥ψ2 .
Proof. Let
1/2 1/2
1/2 1/2 ∥Y ∥ψ2 ∥X∥ψ2
t= ∥X∥ψ2 ∥Y ∥ψ2 = X = Y .
∥X∥ψ2 ∥Y ∥ψ2
ψ2 ψ2

Ignoring a simple ϵ perturbation, we assume that the infima in the definition of

the ∥X∥ψ2 and ∥Y ∥ψ2 norms are attained. Then
∥Y ∥ψ2 X 2 ∥X∥ψ2 Y 2

exp ≤ 2, exp ≤ 2.
∥X∥ψ2 t2 ∥Y ∥ψ2 t2
Thus, Young’s inequality implies

1 ∥Y ∥ψ2 X 2 1 ∥X∥ψ2 Y 2

|XY |
exp ≤ exp +
t 2 ∥X∥ψ2 t2 2 ∥Y ∥ψ2 t2
1/2
∥Y ∥ψ2 X 2 ∥X∥ψ2 Y 2 √ √

≤ exp 2
+ 2
≤ 2 2 ≤ 2.
∥X∥ψ2 t ∥Y ∥ψ2 t
Hence
∥XY ∥ψ1 ≤ t ≤ ∥X∥ψ2 ∥Y ∥ψ2 .

53
Theorem 6.9 ([16, Theorem 3.5]). Let X be a normed linear space. Assume
the X valued separable random process (Xt )t∈T , has a mixed tail, with respect
to some semi-metrics d1 and d2 on T , i.e.
√
Pr ∥Xt − Xs ∥ ≥ ud2 (t, s) + ud1 (t, s) ≤ 2e−u

for all s, t ∈ T and u ≥ 0. Set

∞
X
γα (T, di ) := inf sup 2n/a d(t, Tn ), α ∈ {0, 1},
T t∈T
n=0
∆d (T ) := sup d(s, t),
s,t∈T

where the infimum is taken over all admissible sequences Tn ⊂ T with |T0 | = 1
n
and |Tn | ≤ 22 . Then for any t0 ∈ T
√

Pr sup ∥Xt − Xt0 ∥ ≥ C γ2 (T, d2 ) + γ1 (T, d1 ) + u∆d2 (T ) + u∆d1 (T ) ≤ e−u .

t∈T

Remark 6.10. [16, Theorem 3.5] assumes that T is finite. Using separability
and monotone convergence, this can be extended to infinite T by standard
arguments.
Lemma 6.11. Let 0 ≤ α ≤ 1 and D ⊂ Rd be as set of Euclidean norm | · |-
diameter smaller than R ≥ 1. Then
α 1/2
α 3α + 1 1+α α 3
γ1 (D, | · | ) ≲ R d, γ2 (D, | · | ) ≲ Rα/2 d1/2 .
α 4α
Proof. Let N (D, | · |α , u) be the covering number of D, i.e. the smallest number
of u-balls in the metric | · |α necessary to cover D. It is well known (e.g. [16,
(2.3)]) that
Z ∞ Z Rα
1/i 1/i
γi (D, | · |α ) ≲ [log N (D, | · |α , u)] du ≲ [log N (D, | · |α , u)] du,
0 0

where in the last step we have used that N (D, | · |α , u) = 1 for u ≥ Rα and thus
its logarithm is zero. Since every u-cover in the | · | norm is a uα cover in the
| · |α metric, the covering numbers can be estimated by
d d/α
(3R)α

α 1/α 3R
N (D, | · | , u) = N (D, | · |, u ) ≤ = ,
u1/α u
see e.g. [67]. Hence

Rα d/α Z α
(3R)α d R
Z
γ1 (D, | · |α ) ≲ log du = α log(3R) − log u du
0 u α 0
d d
3αR1+α − Rα log Rα + Rα ≤ (3α + 1)R1+α

≤
α α

54
and using log x ≤ x − 1 ≤ x
1/2 Z Rα 1/2 1/2 Z Rα 1/2
d (3R)α d (3R)α
γ2 (D, |·|α ) ≲ log du ≲ du
α 0 u α 0 u
α α 1/2 Z Rα 1/2 α 1/2
3 dR 1 3 d
≲ du ≲ Rα/2 .
α 0 u 4α

The following is a rewrite of the chaining inequality [16, Theorem 3.5] or

Theorem 6.9, that is compatible with the terminology used in the NTK concen-
tration proof.
Corollary 6.12. For j ∈ [N ], let (Xj,t )t∈D be real valued independent stochastic
processes on some domain D with radius ≲ 1. Assume that the map t → Xj,t
with values in the Orlicz space Lψ1 is Hölder continuous

∥Xj,· ∥C 0;α (D;ψ1 ) ≤ L.

Then
 " #
N 1/2 u 1/2
1 X d d u  ≤ e−u .
Pr sup Xj,t − E [Xj,t ] ≥ CL + + +
t∈T N j=1 N N N N

Proof. We show the result with Theorem 6.9 for the process
N
1 X
Yt := Xj,t − E [Xj,t ] .
N j=1

We first show that it has mixed tail. For all s, t ∈ D, we have

∥Xj,t − Xj,s ∥ψ1 ≤ L|s − t|α .

Hence, Bernstein’s inequality implies

 
N
1 X
Pr [|Yt − Ys | ≥ τ ] = Pr  [Xj,t − Xj,s ] − E [Xj,t − Xj,s ] ≥ τ 
N j=1

τ2

τ
≤ 2 exp −cN min , .
L2 |t − s|2α L|t − s|α

An elementary computation shows that

τ2
r
τ u u
u := cN min , ⇒ τ = L|t − s|α max ,
L2 |t − s|2 L|t − s| cN cN

55
and thus
r
α u u
Pr |Yt − Ys | ≥ L|t − s| max , ≤ 2 exp(−u). (51)
cN cN

I.e. the centered process Yt has mixed tail with

di (t, s) := (cN )−1/i L|t − s|α ,

for i = 1, 2, which are metrics because α ≤ 1. Moreover the γi -functional are

linear in scaling
γi (D, di ) = (cN )−1/i Lγi (D, | · |α )
and thus by Lemma (6.11)
1/2
α d α d
γ1 (D, | · | ) ≲ L , γ2 (D, | · | ) ≲ L .
N N

Thus, by chaining Theorem 6.9 we have

" " ##
1/2 u 1/2
d d u
Pr sup ∥Yt − Yt0 ∥ ≥ CL + + + ≤ e−u ,
t∈T N N N N

which directly yields the corollary with supt∈D ∥Yt ∥ ≤ supt∈D ∥Yt − Yt0 ∥ + ∥Yt0 ∥
and (51).

6.3 Hermite Polynomials

Hermite polynomials are defined by
2 dn −x2 /2
Hn (x) := (−1)n ex /2
e
dxn
and orthogonal with respect to the Gaussian weighted scalar product
Z
1 2
⟨f, g⟩N := Eu∼N (0,1) [f (u)g(u)] = √ f (u)g(u)e−x /2 du.
2π R
Lemma 6.13. 1. Normalization:

⟨Hn , Hm ⟩N = n! δnm .

2. Derivatives: Let f : R → R be k times continuously differentiable so that

all derivatives smaller or equal to k have at most polynomial growth for
x → ±∞. Then D E
⟨f, Hn ⟩ = f (k) , Hn−k .
N

56
Proof. The normalization is well known, we only show the formula for the
dn−k−1 −x2 /2
derivative. By the growth condition, we have f (k) (x) dx n−k−1 e → 0 for
x → ±∞. Thus, in the integration by parts formula below all boundary terms
vanish and we have
Z
1 2
⟨f, Hn ⟩ = √ f (u)Hn (u)e−x /2 du.
2π R
dn
Z
1 2 2 2
=√ f (u) (−1)n ex /2 n e−x /2 e−x /2 du.
2π R dx
Z n
1 d 2
= √ (−1)n f (u) n e−x /2 du.
2π R dx
dn−k
Z
1 2
= √ (−1)n−k f (k) (u) n−k e−x /2 du.
2π R dx
dn−k
Z
1 2 2 2
=√ f (k) (u) (−1)n−k ex /2 n−k e−x /2 e−x /2 du.
2π R dx
D E
(k)
= f , Hn−k .
N

Theorem 6.14 (Mehler’s theorem). Let

1 ρ
A= .
ρ 1

Then the multi- and uni-variate normal density functions satisfy

∞
X ρk
pdf N (0,A) = Hk (u)Hk (v) pdf N (0,1) (u) pdf N (0,1) (v).
k!
k=0

Proof. See [71] for Mehler’s theorem in the form stated here.

6.4 Sobolev Spaces on the Sphere

6.4.1 Definition and Properties
We use two alternative characterizations of Sobolev spaces on the sphere. The
first is based on spherical harmonics, which are also eigenfunctions of the NTK
and thus establishes connections to the available NTK literature. Second,
we consider Sobolev Slobodeckij type norms, which are structurally similar to
Hölder norms and allow connections to the perturbation analysis in this paper.
The spherical harmonics

Yℓj , ℓ = 0, 1, 2, . . . , 1 ≤ j ≤ ν(ℓ)

of degree ℓ and order j are an orthonormal basis on the sphere L2 (Sd−1 ), com-
parable to Fourier bases for periodic functions. For any f ∈ L2 (Sd−1 ), we

57
D E
denote by fˆℓj = f, Yℓj the corresponding basis coefficient. The Sobolev space
H α (Sd−1 ) consists of all function for which the norm
ν(ℓ)
∞ X 2α
X 2
∥f ∥2H α (Sd−1 ) = 1 + ℓ1/2 (ℓ + d − 2)1/2 fˆℓj
ℓ=0 j=1

is finite. We write H α = H α (Sd−1 ) if the domain is understood from context.

Since the constants in this paper are dimension dependent, we simplify this to
the equivalent norm
ν(ℓ)
∞ X
X 2
2α
∥f ∥2H α (Sd−1 ) = (1 + ℓ) fˆℓj . (52)
ℓ=0 j=1

Another equivalent norm, similar to Sobolev-Slobodeckij norms, is given in [7,

Proposition 1.4] and defined as follows for the case 0 < α < 2. For the spherical
cap centered at x ∈ Sd−1 and angle t ∈ (0, π) given by

C(x, t) := y ∈ Sd−1 : x · y ≥ cos t

set Z
At (f )(x) := − f (τ ) dτ.
C(x,t)

With Z π
2
Sα (f )2 (x) := |At f (x) − f (x)| t−2α−1 dt
0
the Sobolev norm on the sphere is equivalent to

∥f ∥H α (Sd−1 ) ∼ ∥Sα (f )∥L2 (Sd−1 ) . (53)

Using the definition (52) for a < b < c, the interpolation inequality
c−b b−a
c−a c−a
∥ · ∥H b (Sd−1 ) ≲ ∥ · ∥H a (Sd−1 ) ∥ · ∥H c (Sd−1 ) , ⟨·, ·⟩−α ≲ ∥ · ∥−3α ∥ · ∥α , (54)

follows directly from Cauchy-Schwarz. Moreover, we have the following embed-

ding.

Lemma 6.15. Let 0 < α < 1. Then for any ϵ > 0 with α + ϵ ≤ 1, we have

∥ · ∥H α (Sd−1 ) ≲ ∥ · ∥C 0;α+ϵ (Sd−1 ) .

Proof. The proof is standard and similar to Lemma 6.16.

58
6.4.2 Kernel Bounds
In this section, we provide bounds for the kernel integral
ZZ
⟨f, g⟩k := f (x)k(x, y)g(y) dx dy
D×D

on the sphere D = Sd−1 in Sobolev norms on the sphere. Clearly, for 0 ≤ α, β <
2, we have
Z
⟨f, g⟩k ≤ ∥f ∥H −α k(·, y)g(y) dy ≤ ∥f ∥H −α ∥k∥H α ←H −β ∥g∥H −β ,
D Hα

where the norm of k is the induced operator norm. While the norms for f and
g are the ones used in the convergence analysis, concentration and perturbation
results for k are computed in mixed Hölder norms. We show in this section,
that these bound the operator norm.
Indeed, ⟨f, g⟩k is a bilinear form on f and g and thus is bounded by the
tensor product norms
⟨f, g⟩k ≤ ∥f ⊗ g∥(H α ⊗H β )′ ∥k∥H α ⊗H β ≤ ∥f ∥H −α ∥g∥H −β ∥k∥H α ⊗H β ,

where ·′ denotes the dual norm. The H α ⊗H β norm contains mixed smoothness
and with Sobolev-Slobodeckij type definition (53) is easily bounded by corre-
sponding mixed Hölder regularity. In order to avoid rigorous characterization
of tensor product norms on the sphere, the following lemma shows the required
bounds directly.
Lemma 6.16. Let 0 < α, β < 1. Then for any ϵ > 0 with α + ϵ ≤ 1 and
β + ϵ < 1, we have
ZZ
f (x)k(x, y)g(y) dx dy ≤ ∥f ∥H −α (Sd−1 ) ∥g∥H −β (Sd−1 ) ∥k∥C 0;α+ϵ,β+ϵ (Sd−1 ) .
D×D

Proof. Since for any u, v

Z Z
v(x)
u(x)v(x) dx = u(x) dx ∥v∥H α
∥v∥H α
Z
≤ sup u(x)w dx ∥v∥H α ≤ ∥u∥H −α ∥v∥H α ,
∥w∥H α ≤1

with D = Sd−1 we have

ZZ Z
⟨f, g⟩k = f (x)k(x, y)g(y) dx dy ≤ ∥f ∥H −α k(·, y)g(y)
D×D D Hα

so that it remains to estimate the last term. Plugging in definition (53) of the
Sobolev norm, we obtain
Z 2 Z Z π Z 2
k(·, y)g(y) = (Axt − I) k(·, y)g(y) dy (x) t−2α−1 dt dx,
D Hα D 0 D

59
where Axt is the average in (53) applied to the x variable only and I the identity.
Swapping the inner integral with the one inside the definition of Axt , we estimate
Z 2 Z Z π Z 2
k(·, y)g(y) = [(Axt − I)(k(·, y))(x)] g(y) dy t−2α−1 dt dx,
D Hα D 0 D
Z Z π
2
≤ ∥(Axt − I)(k(·, y))(x)∥H β ∥g∥2H −β t−2α−1 dt dx,
ZD
Z 0
ZZ π
2
= |(Ays − I)(Axt − I)(k)(x, y)| t−2α−1 s−2β−1 dst dxy ∥g∥2H −β .
D×D 0

Plugging in the definition of the averages Ays and Axt , the integrand is estimated
by the mixed Hölder norm
Z Z
|(Ays − I)(Axt − I)(k)(x, y)| = − − |k(τ, σ) − k(x, σ) − k(τ, y) + k(x, y)| dτ σ
C(y,s) C(x,t)
Z Z
≤− − |x − τ |α+ϵ |y − σ|β+ϵ ∥k∥C 0;α+ϵ,β+ϵ dτ σ.
C(y,s) C(x,t)

The difference |x − τ |, and likewise |y − σ|, is bounded by the angle of the cap
C(x, t). Indeed

|x − τ |2 = |x|2 + |τ |2 − 2 ⟨x, τ ⟩ = 2(1 − ⟨x, τ ⟩) ≤ 2(1 − cos t) ≲ t2

for t ≤ T for some T ≥ 0. Since for all other t the difference |x − τ | ≤ 2 is

bounded, we obtain

|x − τ | ≲ min{t, T }, |y − σ| ≲ min{s, T }.

It follows that

|(Ays − I)(Axt − I)(k)(x, y)| ≲ min{t, T }α+ϵ min{s, T }β+ϵ ∥k∥C 0;α+ϵ,β+ϵ .

Putting all estimates together, we find that

⟨f, g⟩k ≲ ∥f ∥H −α ∥g∥H −β ∥k∥C 0;α+ϵ,β+ϵ

Z Z ZZ π 12
α+ϵ β+ϵ 2 −2α−1 −2β−1

· min{t, T } min{s, T } t s dst dxy .
D×D 0

Since the integral is bounded, we conclude that

⟨f, g⟩k ≲ ∥f ∥H −α ∥g∥H −β ∥k∥C 0;α+ϵ,β+ϵ .

60
6.4.3 NTK on the Sphere
This section fills in the proofs for Section 3. Recall that we denote the normal
NTK used in [9, 22, 11] by
X
Θ(x, y) = lim ∂λ f L+1 (x)∂λ f L+1 (y),
width→∞
λ

whereas the NTK Γ(x, y) used in this paper confines the sum to |λ| = L − 1,
i.e. the second but last layer, see Section 3. We first show that the reproducing
kernel Hilbert space (RKHS) of the NTK is a Sobolev space.
Lemma 6.17. Let Θ(x, y) be the neural tangent kernel for a fully connected
neural network on the sphere Sd−1 with bias and ReLU activation. Then the
corresponding RKHS HΘ is the Sobolev space H d/2 (Sd−1 ) with equivalent norms

∥ · ∥HΘ ∼ ∥ · ∥H d/2 .

Proof. By [11, Theorem 1] the RKHS HΘ is the same as the RKHS HLap of the
Laplacian kernel
k(x, y) = e−∥x−y∥ .
An inspection of their proof reveals that these spaces have equivalent norms.
By [22, Theorem 2], the Laplace kernel has the same eigenfunctions as the NTK
(both are spherical harmonics) and eigenvalues

ℓ−d ≲ λℓ,j ≲ ℓ−d , ℓ ≥ ℓ0 , j = 1, . . . , ν(ℓ),

for some ℓ0 ≥ 0, whereas the remaining eigenvalues are strictly positive. By

rearranging the constants, this implies

(ℓ + 1)−d ≲ λℓ,j ≲ (ℓ + 1)−d , ℓ ≥ 0, j = 1, . . . , ν(ℓ),

for all eigenvalues. With Mercer’s theorem and the definition (52) of Sobolev
norms, we conclude that
ν(ℓ)
∞ X ν(ℓ)
∞ X
X X
∥f ∥2HΘ ∼ ∥f ∥2Lap = λ−1 ˆ 2 (ℓ + 1)d |fˆℓ,j |2 = ∥f ∥2H d/2 (Sd−1 ) .
ℓ,j |fℓ,j | ∼
ℓ=0 j=1 ℓ=0 j=1

Lemma 6.18. Let Θ(x, y) be the neural tangent kernel for a fully connected
neural network on the sphere Sd−1 with bias and ReLU activation. It’s eigen-
functions are spherical harmonics with eigenvalues

(ℓ + 1)−d ≲ λℓ,j ≲ (ℓ + 1)−d , ℓ ≥ 0, j = 1, . . . , ν(ℓ),

61
Proof. This follows directly form the norm equivalence ∥ · ∥HΘ ∼ ∥ · ∥H d/2 in
Lemma 6.17 and in Mercer’s theorem representation of the RKHS
ν(ℓ)
∞ X ν(ℓ)
∞ X
X X
λ−1 ˆ 2 2 2
(ℓ + 1)d |fˆℓ,j |2 ,
ℓ,j |fℓ,j | = ∥f ∥HΘ ∼ ∥f ∥H d/2 (Sd−1 ) . =
ℓ=0 j=1 ℓ=0 j=1

choosing f = Yℓj as a spherical harmonic.

With the knowledge of the full spectrum of the NTK, it is now straight
forward to show coercivity.
Lemma 6.19 (Lemma 3.2, restated). Let Θ(x, y) be the neural tangent kernel
for a fully connected neural network with bias on the sphere Sd−1 with ReLU
activation. Then for any α ∈ R

⟨f, LΘ f ⟩H α (Sd−1 ) ≳ ∥f ∥2H α−d/2 (Sd−1 ) ,

where LΘ is the integral operator with kernel Θ(x, y).

P∞ Pν(ℓ)
Proof. Plugging in f = ℓ=0 j=1 fˆℓj Yℓj in eigenbasis, and using the estimate
λℓj ∼ (ℓ + 1)−d of the eigenvalues in Lemma 6.18, we have

ν(ℓ)
∞ X
X
⟨f, LΘ f ⟩H α (Sd−1 ) = (ℓ + 1)2α fˆℓj L
dθ f ℓj
ℓ=0 j=1
ν(ℓ)
∞ X
X
= (ℓ + 1)2α λℓj |fˆℓj |2
ℓ=0 j=1

= ∥f ∥2H α−d/2 (Sd−1 ) .

References
[1] B. Adcock and N. Dexter. The gap between theory and practice in function
approximation with deep neural networks. SIAM Journal on Mathematics
of Data Science, 3(2):624–655, 2021.
[2] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning
via over-parameterization. In K. Chaudhuri and R. Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Research, page 242–252, Long
Beach, California, USA, 09–15 Jun 2019. PMLR. Full version available at
https://arxiv.org/abs/1811.03962.

62
[3] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of op-
timization and generalization for overparameterized two-layer neural net-
works. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the
36th International Conference on Machine Learning, volume 97 of Proceed-
ings of Machine Learning Research, page 322–332, Long Beach, California,
USA, 09–15 Jun 2019. PMLR.
[4] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang.
On exact computation with an infinitely wide neural net. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 32.
Curran Associates, Inc., 2019.
[5] F. Bach. Breaking the curse of dimensionality with convex neural networks.
Journal of Machine Learning Research, 18(19):1–53, 2017.
[6] Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order
approximation of wide neural networks. In International Conference on
Learning Representations, 2020.
[7] J. A. Barceló, T. Luque, and S. Pérez-Esteva. Characterization of sobolev
spaces on the sphere. Journal of Mathematical Analysis and Applications,
491(1):124240, 2020.
[8] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathe-
matics of deep learning, 2021. https://arxiv.org/abs/2105.04026.
[9] A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[10] G. Bresler and D. Nagaraj. Sharp representation theorems for ReLU net-
works with precise dependence on depth. In H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Informa-
tion Processing Systems, volume 33, page 10697–10706. Curran Associates,
Inc., 2020.
[11] L. Chen and S. Xu. Deep neural tangent kernel and laplace kernel have
the same rkhs. In International Conference on Learning Representations,
2021.
[12] Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is
sufficient to learn deep re{lu} networks? In International Conference on
Learning Representations, 2021.
[13] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable
programming. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc., 2019.

63
[14] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear
Approximation and (Deep) ReLU Networks. Constructive Approximation,
55(1):127–172, Feb. 2022.
[15] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta
Numerica, 30:327–444, 2021.

[16] S. Dirksen. Tail bounds via generic chaining. Electronic Journal of Proba-
bility, 20:1 – 29, 2015.
[17] S. Drews and M. Kohler. On the universal consistency of an over-
parametrized deep neural network estimate learned by gradient descent,
2022.
[18] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global
minima of deep neural networks. In K. Chaudhuri and R. Salakhutdi-
nov, editors, Proceedings of the 36th International Conference on Machine
Learning, volume 97 of Proceedings of Machine Learning Research, page
1675–1685, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
[19] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably
optimizes over-parameterized neural networks. In International Conference
on Learning Representations, 2019.
[20] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural
network approximation theory. IEEE Transactions on Information Theory,
67(5):2581–2623, 2021.
[21] S. Fort, G. K. Dziugaite, M. Paul, S. Kharaghani, D. M. Roy, and S. Gan-
guli. Deep learning versus kernel learning: an empirical study of loss
landscape geometry and the time evolution of the neural tangent kernel.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, edi-
tors, Advances in Neural Information Processing Systems, volume 33, page
5850–5861. Curran Associates, Inc., 2020.
[22] A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and B. Ro-
nen. On the similarity between the laplace and neural tangent kernels.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, edi-
tors, Advances in Neural Information Processing Systems, volume 33, page
1451–1461. Curran Associates, Inc., 2020.
[23] M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’Ascoli, G. Biroli,
C. Hongler, and M. Wyart. Scaling description of generalization with num-
ber of parameters in deep learning. CoRR, abs/1901.01608, 2019.
[24] R. Gentile and G. Welper. Approximation results for gradient descent
trained shallow neural networks in 1d, 2022. https://arxiv.org/abs/
2209.08399.

64
[25] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approx-
imation Spaces of Deep Neural Networks. Constructive Approximation,
55(1):259–367, Feb. 2022.
[26] P. Grohs and F. Voigtlaender. Proof of the theory-to-practice gap in deep
learning via sampling complexity bounds for neural network approximation
spaces, 2021. https://arxiv.org/abs/2104.02746.
[27] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations
with deep ReLU neural networks in ws,p norms. Analysis and Applications,
18(05):803–859, 2020.

[28] B. Hanin and M. Nica. Finite depth and width corrections to the neural
tangent kernel. In International Conference on Learning Representations,
2020.
[29] W. Hao, X. Jin, J. W. Siegel, and J. Xu. An efficient greedy training
algorithm for neural networks and applications in PDEs, 2021. https:
//arxiv.org/abs/2107.04466.
[30] L. Herrmann, J. A. A. Opschoor, and C. Schwab. Constructive deep ReLU
neural network approximation. Journal of Scientific Computing, 90(2):75,
2022.
[31] S. Ibragimov, A. Jentzen, and A. Riekert. Convergence to good non-optimal
critical points in the training of neural networks: Gradient descent opti-
mization with one random initialization overcomes all bad non-global lo-
cal minima with high probability, 2022. https://arxiv.org/abs/2212.
13111.
[32] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Conver-
gence and generalization in neural networks. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018.
[33] A. Jentzen and A. Riekert. A proof of convergence for the gradient descent
optimization method with random initializations in the training of neural
networks with relu activation for piecewise linear target functions. Journal
of Machine Learning Research, 23(260):1–50, 2022.
[34] Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent
to achieve arbitrarily small test error with shallow ReLU networks. In
International Conference on Learning Representations, 2020.
[35] Z. Ji, M. Telgarsky, and R. Xian. Neural tangent kernels, transportation
mappings, and universal approximation. In International Conference on
Learning Representations, 2020.

65
[36] K. Kawaguchi and J. Huang. Gradient descent finds global minima for
generalizable deep neural networks of practical sizes. In 2019 57th Annual
Allerton Conference on Communication, Control, and Computing (Aller-
ton), page 92–99, 2019.
[37] J. M. Klusowski and A. R. Barron. Approximation by combinations of
ReLU and squared ReLU ridge functions with ℓ1 and ℓ0 controls. IEEE
Transactions on Information Theory, 64(12):7649–7656, 2018.
[38] M. Kohler and A. Krzyzak. Analysis of the rate of convergence of an over-
parametrized deep neural network estimate learned by gradient descent,
2022.

[39] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical

analysis of deep neural networks and parametric pdes. Constructive Ap-
proximation, 55(1):73–125, 2022.
[40] F. Laakmann and P. Petersen. Efficient approximation of solutions of para-
metric linear transport equations by ReLU DNNs. Advances in Computa-
tional Mathematics, 47(1):11, Feb. 2021.
[41] J. Lee, J. Y. Choi, E. K. Ryu, and A. No. Neural tangent kernel analysis
of deep narrow neural networks. In K. Chaudhuri, S. Jegelka, L. Song,
C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th In-
ternational Conference on Machine Learning, volume 162 of Proceedings of
Machine Learning Research, page 12282–12351. PMLR, 17–23 Jul 2022.
[42] J. Lee, S. Schoenholz, J. Pennington, B. Adlam, L. Xiao, R. Novak, and
J. Sohl-Dickstein. Finite versus infinite neural networks: an empirical study.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, edi-
tors, Advances in Neural Information Processing Systems, volume 33, page
15156–15172. Curran Associates, Inc., 2020.
[43] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and
J. Pennington. Wide neural networks of any depth evolve as linear mod-
els under gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural In-
formation Processing Systems, volume 32. Curran Associates, Inc., 2019.
[44] B. Li, S. Tang, and H. Yu. Better approximations of high dimensional
smooth functions by deep neural networks with rectified power units. Com-
munications in Computational Physics, 27(2):379–411, 2019.
[45] Y. Li and Y. Liang. Learning overparameterized neural networks via
stochastic gradient descent on structured data. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems 31, page 8157–8166.
Curran Associates, Inc., 2018.

66
[46] Z. Li, C. Ma, and L. Wu. Complexity measures for neural networks
with general activation functions using path-based norms, 2020. https:
//arxiv.org/abs/2009.06132.
[47] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approxima-
tion for smooth functions. SIAM Journal on Mathematical Analysis,
53(5):5465–5506, 2021.
[48] C. Marcati, J. A. A. Opschoor, P. C. Petersen, and C. Schwab. Exponential
ReLU neural network approximation rates for point and edge singularities.
Foundations of Computational Mathematics, page 1615 – 3383, 2022.

[49] Q. N. Nguyen and M. Mondelli. Global convergence of deep networks

with one wide layer followed by pyramidal topology. In H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neu-
ral Information Processing Systems, volume 33, page 11961–11972. Curran
Associates, Inc., 2020.

[50] J. A. A. Opschoor, P. C. Petersen, and C. Schwab. Deep ReLU net-

works and high-order finite element methods. Analysis and Applications,
18(05):715–770, 2020.
[51] S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization:
Global convergence guarantees for training shallow neural networks. IEEE
Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.

[52] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise

smooth functions using deep ReLU neural networks. Neural Networks,
108:296–330, 2018.
[53] A. Pinkus. Approximation theory of the mlp model in neural networks.
Acta Numerica, 8:143–195, 1999.
[54] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and
when can deep-but not shallow-networks avoid the curse of dimension-
ality: A review. International Journal of Automation and Computing,
14(5):503–519, 2017.

[55] M. Seleznova and G. Kutyniok. Neural tangent kernel beyond the infinite-
width limit: Effects of depth and initialization. In K. Chaudhuri, S. Jegelka,
L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the
39th International Conference on Machine Learning, volume 162 of Pro-
ceedings of Machine Learning Research, page 19522–19560. PMLR, 17–23
Jul 2022.
[56] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation
properties for deep neural networks. Applied and Computational Harmonic
Analysis, 44(3):537–557, 2018.

67
[57] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions.
Neural Networks, 119:74–84, 2019.
[58] J. W. Siegel and J. Xu. Approximation rates for neural networks with
general activation functions. Neural Networks, 128:313–321, 2020.

[59] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural
networks with cosine and ReLUk activation functions. Applied and Com-
putational Harmonic Analysis, 58:1–26, 2022.
[60] J. W. Siegel and J. Xu. Optimal convergence rates for the orthogonal greedy
algorithm. IEEE Transactions on Information Theory, 68(5):3354–3361,
2022.
[61] C. Song, A. Ramezani-Kebrya, T. Pethick, A. Eftekhari, and V. Cevher.
Subquadratic overparameterization for shallow neural networks. In M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, edi-
tors, Advances in Neural Information Processing Systems, volume 34, page
11247–11259. Curran Associates, Inc., 2021.
[62] Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix
chernoff bound, 2019. https://arxiv.org/abs/1906.03593.
[63] L. Su and P. Yang. On learning over-parameterized neural networks:
A functional approximation perspective. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019.
[64] T. Suzuki. Adaptivity of deep reLU network for learning in besov and
mixed smooth besov spaces: optimal rate and curse of dimensionality. In
International Conference on Learning Representations, 2019.
[65] M. Velikanov and D. Yarotsky. Universal scaling laws in the gradient de-
scent training of neural networks, 2021. https://arxiv.org/abs/2105.
00507.

[66] M. Velikanov and D. Yarotsky. Tight convergence rate bounds for opti-
mization under power law spectral conditions, 2022. https://arxiv.org/
abs/2202.00992.
[67] R. Vershynin. High-dimensional probability: an introduction with applica-
tions in data science. Number 47 in Cambridge series in statistical and
probabilistic mathematics. Cambridge University Press, Cambridge ; New
York, NY, 2018.
[68] N. Vyas, Y. Bansal, and P. Nakkiran. Limitations of the ntk for under-
standing generalization in deep learning, 2022.

68
[69] E. Weinan, M. Chao, W. Lei, and S. Wojtowytsch. Towards a mathemat-
ical understanding of neural network-based machine learning: What we
know and what we don’t. CSIAM Transactions on Applied Mathematics,
1(4):561–615, 2020.
[70] E. Weinan, C. Ma, and L. Wu. The Barron Space and the Flow-Induced
Function Spaces for Neural Network Models. Constructive Approximation,
55(1):369–406, Feb. 2022.
[71] C. S. Withers and S. Nadarajah. Expansions for the multivariate normal.
Journal of Multivariate Analysis, 101(5):1311–1316, 2010.

[72] D. Yarotsky. Error bounds for approximations with deep ReLU networks.
Neural Networks, 94:103–114, 2017.
[73] D. Yarotsky. Optimal approximation of continuous functions by very deep
ReLU networks. In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceed-
ings of the 31st Conference On Learning Theory, volume 75 of Proceedings
of Machine Learning Research, page 639–649. PMLR, 06–09 Jul 2018.
[74] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation
rates for deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, page 13005–13015. Curran Associates, Inc., 2020.

[75] D. Zou, Y. Cao, D. Zhou, and Q. Gu. Gradient descent optimizes over-
parameterized deep ReLU networks. Machine Learning, 109(3):467 – 492,
2020.
[76] D. Zou and Q. Gu. An improved analysis of training over-parameterized
deep neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural In-
formation Processing Systems, volume 32. Curran Associates, Inc., 2019.

Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
No ratings yet
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
713 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
PDLT
No ratings yet
PDLT
449 pages
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
No ratings yet
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 3
433 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
174 pages
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
The Principles of Deep Learning Theory
No ratings yet
The Principles of Deep Learning Theory
300 pages
Theory DL
No ratings yet
Theory DL
227 pages
On A Mathematical Understanding of Deep Neural Networks
No ratings yet
On A Mathematical Understanding of Deep Neural Networks
170 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Sigmoidal Approximation Costarelli
No ratings yet
Sigmoidal Approximation Costarelli
165 pages
DLbook
No ratings yet
DLbook
165 pages
Index
No ratings yet
Index
127 pages
Propagation of Chaos in One-Hidden-Layer Neural Networks Beyond Logarithmic Time
No ratings yet
Propagation of Chaos in One-Hidden-Layer Neural Networks Beyond Logarithmic Time
70 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
Statistical Learning Theory For Neural Operators
No ratings yet
Statistical Learning Theory For Neural Operators
68 pages
Deep Neural Network Approximation Theory
No ratings yet
Deep Neural Network Approximation Theory
80 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
17 Aap1328
No ratings yet
17 Aap1328
59 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Random Fully Connected Neural Networks As Perturbatively Solvable Hierarchies
No ratings yet
Random Fully Connected Neural Networks As Perturbatively Solvable Hierarchies
58 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
EPW, Vol.58, Issue No.29, 22 July 2023
No ratings yet
EPW, Vol.58, Issue No.29, 22 July 2023
70 pages
Bài Báo Toán
No ratings yet
Bài Báo Toán
35 pages
On Exact Computation With An Infinitely Wide Neural Net: (CNTK)
No ratings yet
On Exact Computation With An Infinitely Wide Neural Net: (CNTK)
36 pages
36 Neural Operator Graph Kernel N
No ratings yet
36 Neural Operator Graph Kernel N
21 pages
Lec11 NTK
No ratings yet
Lec11 NTK
28 pages
Danh Gia U-V
No ratings yet
Danh Gia U-V
29 pages
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
No ratings yet
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
46 pages
2205.14398 Deep Neural Networks Overcome The Curse
No ratings yet
2205.14398 Deep Neural Networks Overcome The Curse
34 pages
NTK!!!
No ratings yet
NTK!!!
28 pages
Approximation of Functionals On Korobov Spaces
No ratings yet
Approximation of Functionals On Korobov Spaces
19 pages
The Neural Tangent Kernel - 2
No ratings yet
The Neural Tangent Kernel - 2
23 pages
Deep Neural Network Structures Solving Variational Inequalities
No ratings yet
Deep Neural Network Structures Solving Variational Inequalities
26 pages
Spectrum Dependent Learning
No ratings yet
Spectrum Dependent Learning
22 pages
Neural Tangent Kernel
No ratings yet
Neural Tangent Kernel
21 pages
The Mathematics of Kolmogorov-Arnold-Networks
No ratings yet
The Mathematics of Kolmogorov-Arnold-Networks
26 pages
Deep ONet
No ratings yet
Deep ONet
22 pages
The Engineer's Guide To Tank Gauging: 2021 EDITION
100% (2)
The Engineer's Guide To Tank Gauging: 2021 EDITION
53 pages
3.1.1weight Decay, Weight Elimination, and Unit Elimination: GX X X X, Which Is Plotted in
No ratings yet
3.1.1weight Decay, Weight Elimination, and Unit Elimination: GX X X X, Which Is Plotted in
26 pages
1 s2.0 S1474667017477378 Main
No ratings yet
1 s2.0 S1474667017477378 Main
24 pages
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
No ratings yet
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
24 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Transformer Turing
No ratings yet
Transformer Turing
13 pages
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
No ratings yet
(00000) - 2018-Weinan - (CommMathStat) - The Deep Ritz Method A Deep Learning-Based
12 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
Montanari
No ratings yet
Montanari
10 pages
Modeling Systems With Machine Learning Based Differential Equations
No ratings yet
Modeling Systems With Machine Learning Based Differential Equations
12 pages
Training With Noise Is Equivalent To Tikhonov Regularization
No ratings yet
Training With Noise Is Equivalent To Tikhonov Regularization
8 pages
Mathematics of Deep Learning: Lecture 2 - Depth Separation
No ratings yet
Mathematics of Deep Learning: Lecture 2 - Depth Separation
13 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Level 5 Elec
No ratings yet
Level 5 Elec
62 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
Aproximação Por Redes Neurais - Prova para LP e Funções Contínuas Quando Se Usa Trigonometrica
No ratings yet
Aproximação Por Redes Neurais - Prova para LP e Funções Contínuas Quando Se Usa Trigonometrica
10 pages
Approximation of Solution Operators for High-dimensional PDEs部分5
No ratings yet
Approximation of Solution Operators for High-dimensional PDEs部分5
2 pages
Approximation of Solution Operators for High-dimensional PDEs部分3
No ratings yet
Approximation of Solution Operators for High-dimensional PDEs部分3
2 pages
UTS Final3 Banduras-Self-Efficacy Original
100% (1)
UTS Final3 Banduras-Self-Efficacy Original
18 pages
Fisheries Statistical Yearboook 2017-18
No ratings yet
Fisheries Statistical Yearboook 2017-18
131 pages
2410.07305v1 - A Blockchain and Artificial Intelligence Based
No ratings yet
2410.07305v1 - A Blockchain and Artificial Intelligence Based
13 pages
GMP Requirements Comparison Pharmaguideline
No ratings yet
GMP Requirements Comparison Pharmaguideline
230 pages
Redux Toolkit
100% (1)
Redux Toolkit
20 pages
Bow
No ratings yet
Bow
4 pages
Training & Development
No ratings yet
Training & Development
84 pages
Lesson 11 Measurement and Motion PDF
No ratings yet
Lesson 11 Measurement and Motion PDF
3 pages
Sample Course Guideline Template
No ratings yet
Sample Course Guideline Template
5 pages
Test Mazes
No ratings yet
Test Mazes
85 pages
2025 - ..
No ratings yet
2025 - ..
7 pages
Writing and Reading Task E5
No ratings yet
Writing and Reading Task E5
2 pages
Lec 35
No ratings yet
Lec 35
18 pages
2309.13722 Deep Neural Networks With ReLU
No ratings yet
2309.13722 Deep Neural Networks With ReLU
52 pages
G Magnetics Methods
No ratings yet
G Magnetics Methods
35 pages
Maiden Gully Precinct Structure Plan
No ratings yet
Maiden Gully Precinct Structure Plan
52 pages
Openquant Strategy
No ratings yet
Openquant Strategy
61 pages
4F - Eco-Friendly Survey Questions
No ratings yet
4F - Eco-Friendly Survey Questions
8 pages
Survey Lecture No. 01 A
No ratings yet
Survey Lecture No. 01 A
27 pages
1203.2369 Counterparty Risk Valuation
No ratings yet
1203.2369 Counterparty Risk Valuation
17 pages
Barani Dresden Vpub
No ratings yet
Barani Dresden Vpub
16 pages
2304.09221 Convergence of Stochastic Gradient Descent Under
No ratings yet
2304.09221 Convergence of Stochastic Gradient Descent Under
14 pages
1312.6203 Spectral Networks and Locally Connected Networks
No ratings yet
1312.6203 Spectral Networks and Locally Connected Networks
14 pages
Science-Grade 10: Continental Drift and Seafloor Spreading
No ratings yet
Science-Grade 10: Continental Drift and Seafloor Spreading
17 pages
9-Cu Sinif Ingilis Dili-Illik Plan
No ratings yet
9-Cu Sinif Ingilis Dili-Illik Plan
4 pages
Bud The Mud Bug
No ratings yet
Bud The Mud Bug
6 pages
Project
No ratings yet
Project
13 pages
International Business
No ratings yet
International Business
8 pages
04.02 Cells and The Structure of Life Guided Notes: Objectives
100% (1)
04.02 Cells and The Structure of Life Guided Notes: Objectives
2 pages
4as Co2 Sy23 24
No ratings yet
4as Co2 Sy23 24
5 pages
1213.density Lab Online Activity
No ratings yet
1213.density Lab Online Activity
4 pages
MSC 2 Sem Mathematics Advanced Complex Analysis 2 403 May 2019
No ratings yet
MSC 2 Sem Mathematics Advanced Complex Analysis 2 403 May 2019
2 pages
Dots, Dots, Dots
No ratings yet
Dots, Dots, Dots
6 pages
Anycast Based Routing in Vehicular Adhoc Networks (VANETS) Using Vanetmobisim
No ratings yet
Anycast Based Routing in Vehicular Adhoc Networks (VANETS) Using Vanetmobisim
11 pages
Our Lady of Lourdes College Foundation: Vinzons Ave., Lag-On, Daet, Camarines Norte
No ratings yet
Our Lady of Lourdes College Foundation: Vinzons Ave., Lag-On, Daet, Camarines Norte
5 pages
TDR 240 Spec
No ratings yet
TDR 240 Spec
4 pages
Leaflet 36
No ratings yet
Leaflet 36
4 pages
A SAT Attack On The Erd Os Discrepancy Conjecture
No ratings yet
A SAT Attack On The Erd Os Discrepancy Conjecture
8 pages
Colors Around Me
No ratings yet
Colors Around Me
6 pages
21 OWL 0108 IXD Data Sheet V6
No ratings yet
21 OWL 0108 IXD Data Sheet V6
2 pages
The Envira Amazonia Project
No ratings yet
The Envira Amazonia Project
2 pages
Vivid S60N Brochure-2
No ratings yet
Vivid S60N Brochure-2
1 page
#Include #Include #Include #Include #Include #Include #Include #Include Static Void
No ratings yet
#Include #Include #Include #Include #Include #Include #Include #Include Static Void
2 pages