Deep Neural Network Approximation Theory
Deep Neural Network Approximation Theory
Abstract
This paper develops fundamental limits of deep neural network learning by characterizing what is possible if
arXiv:1901.02220v4 [cs.LG] 12 Mar 2021
no constraints are imposed on the learning algorithm and on the amount of training data. Concretely, we consider
Kolmogorov-optimal approximation through deep neural networks with the guiding theme being a relation between
the complexity of the function (class) to be approximated and the complexity of the approximating network in terms
of connectivity and memory requirements for storing the network topology and the associated quantized weights.
The theory we develop establishes that deep networks are Kolmogorov-optimal approximants for markedly different
function classes, such as unit balls in Besov spaces and modulation spaces. In addition, deep networks provide
exponential approximation accuracy—i.e., the approximation error decays exponentially in the number of nonzero
weights in the network—of the multiplication operation, polynomials, sinusoidal functions, and certain smooth
functions. Moreover, this holds true even for one-dimensional oscillatory textures and the Weierstrass function—
a fractal function, neither of which has previously known methods achieving exponential approximation accuracy.
We also show that in the approximation of sufficiently smooth functions finite-width deep networks require strictly
smaller connectivity than finite-depth wide networks.
I. I NTRODUCTION
Triggered by the availability of vast amounts of training data and drastic improvements in computing power,
deep neural networks have become state-of-the-art technology for a wide range of practical machine learning tasks
such as image classification [1], handwritten digit recognition [2], speech recognition [3], or game intelligence [4].
For an in-depth overview, we refer to the survey paper [5] and the recent book [6].
A neural network effectively implements a mapping approximating a function that is learned based on a given
set of input-output value pairs, typically through the backpropagation algorithm [7]. Characterizing the fundamental
limits of approximation through neural networks shows what is possible if no constraints are imposed on the learning
algorithm and on the amount of training data [8].
D. Elbrächter is with the Department of Mathematics, University of Vienna, Austria (e-mail: [email protected]).
D. Perekrestenko and H. Bölcskei are with the Chair for Mathematical Information Science, ETH Zurich, Switzerland (e-mail:
[email protected], [email protected]).
P. Grohs is with the Department of Mathematics and the Research Platform DataScience@UniVienna, University of Vienna, Austria (e-mail:
[email protected]).
D. Elbrächter was supported through the FWF projects P 30148 and I 3403 as well as the WWTF project ICT19-041.
The theory of function approximation through neural networks has a long history dating back to the work by
McCulloch and Pitts [9] and the seminal paper by Kolmogorov [10], who showed, when interpreted in neural
network parlance, that any continuous function of n variables can be represented exactly through a 2-layer neural
network of width 2n + 1. However, the nonlinearities in Kolmogorov’s neural network are highly nonsmooth
and the outer nonlinearities, i.e., those in the output layer, depend on the function to be represented. In modern
neural network theory, one is usually interested in networks with nonlinearities that are independent of the function
to be realized and exhibit, in addition, certain smoothness properties. Significant progress in understanding the
approximation capabilities of such networks has been made in [11], [12], where it was shown that single-hidden-
layer neural networks can approximate continuous functions on bounded domains arbitrarily well, provided that the
activation function satisfies certain (mild) conditions and the number of nodes is allowed to grow arbitrarily large.
In practice one is, however, often interested in approximating functions from a given function class C determined by
the application at hand. It is therefore natural to ask how the complexity of a neural network approximating every
function in C to within a prescribed accuracy depends on the complexity of C (and on the desired approximation
accuracy). The recently developed Kolmogorov-Donoho rate-distortion theory for neural networks [13] formalizes
this question by relating the complexity of C—in terms of the number of bits needed to describe any element in C to
within prescribed accuracy—to network complexity in terms of connectivity and memory requirements for storing
the network topology and the associated quantized weights. The theory is based on a framework for quantifying
the fundamental limits of nonlinear approximation through dictionaries as introduced by Donoho [14], [15].
The purpose of this paper is to provide a comprehensive, principled, and self-contained introduction to Kolmogorov-
Donoho rate-distortion optimal approximation through deep neural networks. The idea is to equip the reader with
a working knowledge of the mathematical tools underlying the theory at a level that is sufficiently deep to enable
further research in the field. Part of this paper is based on [13], but extends the theory therein to the rectified linear
unit (ReLU) activation function and to networks with depth scaling in the approximation error.
The theory we develop educes remarkable universality properties of finite-width deep networks. Specifically, deep
networks are Kolmogorov-Donoho optimal approximants for vastly different function classes such as unit balls in
Besov spaces [16] and modulation spaces [17]. This universality is afforded by a concurrent invariance property
of deep networks to time-shifts, scalings, and frequency-shifts. In addition, deep networks provide exponential
approximation accuracy—i.e., the approximation error decays exponentially in the number of parameters employed
in the approximant, namely the number of nonzero weights in the network—for vastly different functions such
as the squaring operation, multiplication, polynomials, sinusoidal functions, general smooth functions, and even
one-dimensional oscillatory textures [18] and the Weierstrass function—a fractal function, neither of which has
known methods achieving exponential approximation accuracy.
2
While we consider networks based on the ReLU1 activation function throughout, certain parts of our theory carry
over to strongly sigmoidal activation functions of order k ≥ 2 as defined in [13]. For the sake of conciseness, we
refrain from providing these extensions.
Outline of the paper. In Section II, we introduce notation, formally define neural networks, and record basic
elements needed in the neural network constructions throughout the paper. Section III presents an algebra of
function approximation by neural networks. In Section IV, we develop the Kolmogorov-Donoho rate-distortion
framework that will allow us to characterize the fundamental limits of deep neural network learning of function
classes. This theory is based on the concept of metric entropy, which is introduced and reviewed starting from first
principles. Section V then puts the Kolmogorov-Donoho framework to work in the context of nonlinear function
approximation with dictionaries. This discussion serves as a basis for the development of the concept of best M -
weight approximation in neural networks presented in Section VI. We proceed, in Section VII, with the development
of a method—termed the transference principle—for transferring results on function approximation through dictio-
naries to results on approximation by neural networks. The purpose of Section VIII is to demonstrate that function
classes that are optimally approximated by affine dictionaries (e.g., wavelets), are optimally approximated by neural
networks as well. In Section IX, we show that this optimality transfer extends to function classes that are optimally
approximated by Weyl-Heisenberg dictionaries. Section X demonstrates that neural networks can improve the best-
known approximation rates for two example functions, namely oscillatory textures and the Weierstrass function, from
polynomial to exponential. The final Section XI makes a formal case for depth in neural network approximation by
establishing a provable benefit of deep networks over shallow networks in the approximation of sufficiently smooth
functions. The Appendices collect ancillary technical results.
Notation. For a function f (x) : Rd → R and a set Ω ⊆ Rd , we define kf kL∞ (Ω) := sup{|f (x)| : x ∈ Ω}.
Lp (Rd ) and Lp (Rd , C) denote the space of real-valued, respectively complex-valued, Lp -functions. When dealing
with the approximation error for simple functions such as, e.g., (x, y) 7→ xy, we will for brevity of exposition
and with slight abuse of notation, make the arguments inside the norm explicit according to kf (x, y) − xykLp (Ω) .
For a vector b ∈ Rd , we let kbk∞ := maxi=1,...,d |bi |, similarly we write kAk∞ := maxi,j |Ai,j | for the matrix
A ∈ Rm×n . We denote the identity matrix of size n × n by In . log stands for the logarithm to base 2. For a set
X ∈ Rd , we write |X| for its Lebesgue measure. Constants like C are understood to be allowed to take on different
values in different uses.
1 ReLU stands for the Rectified Linear Unit nonlinearity defined as x 7→ max{0, x}.
3
II. S ETUP AND BASIC R E LU CALCULUS
This section defines neural networks, introduces the basic setup as well as further notation, and lists basic elements
needed in the neural network constructions considered throughout, namely compositions and linear combinations of
neural networks. There is a plethora of neural network architectures and activation functions in the literature. Here,
we restrict ourselves to the ReLU activation function and consider the following general network architecture.
Definition II.1. Let L ∈ N and N0 , N1 , . . . , NL ∈ N. A ReLU neural network Φ is a map Φ : RN0 → RNL
given by
W1 , L=1
Φ= W2 ◦ ρ ◦ W1 , L=2 , (1)
W ◦ρ◦W
L L−1 ◦ ρ ◦ · · · ◦ ρ ◦ W1 , L ≥ 3
where, for ` ∈ {1, 2, . . . , L}, W` : RN`−1 → RN` , W` (x) := A` x + b` are the associated affine transformations
with matrices A` ∈ RN` ×N`−1 and (bias) vectors b` ∈ RN` , and the ReLU activation function ρ : R → R, ρ(x) :=
max(0, x) acts component-wise, i.e., ρ(x1 , . . . , xN ) := (ρ(x1 ), . . . , ρ(xN )). We denote by Nd,d0 the set of all ReLU
networks with input dimension N0 = d and output dimension NL = d0 . Moreover, we define the following quantities
related to the notion of size of the ReLU network Φ:
• the connectivity M(Φ) is the total number of nonzero entries in the matrices A` , ` ∈ {1, 2, . . . , L}, and the
vectors b` , ` ∈ {1, 2, . . . , L},
• depth L(Φ) := L,
• width W(Φ) := max`=0,...,L N` ,
• weight magnitude B(Φ) := max`=1,...,L max{kA` k∞ , kb` k∞ }.
Remark II.2. Note that for a given function f : RN0 → RNL , which can be expressed according to (1), the
underlying affine transformations W` are highly nonunique in general [19], [20]. The question of uniqueness in
this context is of independent interest and was addressed recently in [21], [22]. Whenever we talk about a given
ReLU network Φ, we will either explicitly or implicitly associate Φ with a given set of affine transformations W` .
N0 is the dimension of the input layer indexed as the 0-th layer, N1 , . . . , NL−1 are the dimensions of the L − 1
hidden layers, and NL is the dimension of the output layer. Our definition of depth L(Φ) counts the number of
affine transformations involved in the representation (1). Single-hidden-layer neural networks hence have depth 2
in this terminology. Finally, we consider standard affine transformations as neural networks of depth 1 for technical
purposes.
The matrix entry (A` )i,j represents the weight associated with the edge between the j-th node in the (` − 1)-th
layer and the i-th node in the `-th layer, (b` )i is the weight associated with the i-th node in the `-th layer. These
4
assignments are schematized in Figure 1. The real numbers (A` )i,j and (b` )i are referred to as the network’s edge
weights and node weights, respectively.
Throughout the paper, we assume that every node in the input layer and in layers 1, . . . , L − 1 has at least
one outgoing edge and every node in the output layer L has at least one incoming edge. These nondegeneracy
assumptions are basic as nodes that do not satisfy them can be removed without changing the functional relationship
realized by the network.
Finally, we note that the connectivity satisfies
The term “network” stems from the interpretation of the mapping Φ as a weighted acyclic directed graph with
nodes arranged in hierarchical layers and edges only between adjacent layers.
(A2 )1,1 (A2 )1,2 0
(A2 )1,1 (A2 )1,2 (A2 )2,3 A2 =
0 0 (A2 )2,3
Hidden layer ρ (b1 )1 (b1 )2 (b1 )3
(A1 )1,1 (A1 )1,2 0
(A1 )1,2 (A1 )2,3 (A1 )3,3
A1 =
(A1 )1,1
0 0 (A1 )2,3
0 0 (A1 )3,3
Input layer
Fig. 1: Assignment of the weights (A` )i,j and (b` )i of a two-layer network to the edges and nodes, respectively.
We mostly consider the case Φ : Rd → R, i.e., NL = 1, but emphasize that our results readily generalize to
NL > 1.
The neural network constructions provided in the paper frequently make use of basic elements introduced next,
namely compositions and linear combinations of networks [23].
Lemma II.3. Let d1 , d2 , d3 ∈ N, Φ1 ∈ Nd1 ,d2 , and Φ2 ∈ Nd2 ,d3 . Then, there exists a network Ψ ∈ Nd1 ,d3
with L(Ψ) = L(Φ1 ) + L(Φ2 ), M(Ψ) ≤ 2M(Φ1 ) + 2M(Φ2 ), W(Ψ) ≤ max{2d2 , W(Φ1 ), W(Φ2 )}, B(Ψ) =
max{B(Φ1 ), B(Φ2 )}, and satisfying
5
Proof. The proof is based on the identity x = ρ(x) − ρ(−x). First, note that by Definition II.1, we can write
Next, let NL1 1 −1 denote the width of layer L1 − 1 in Φ1 and let N12 denote the width of layer 1 in Φ2 . We define
1
the affine transformations W f 2 : R2d2 7→ RN12 according to
f 1 : RNL1 −1 7→ R2d2 and W
L1 1
I
Wf 1 (x) := d2 W 1 (x) and W f 2 (y) := W 2 I
L1 L1 1 1 d2 −I d2 y .
−Id2
The proof is finalized by noting that the network
Ψ := WL22 ◦ ρ ◦ · · · ◦ W22 ◦ ρ ◦ W
f2 ◦ ρ ◦ W
1
f1 ◦ ρ ◦ W 1
L1
1
L1 −1 ◦ · · · ◦ ρ ◦ W1
Unless explicitly stated otherwise, the composition of two neural networks will be understood in the sense of
Lemma II.3.
In order to formalize the concept of a linear combination of networks with possibly different depths, we need
the following two technical lemmas which show how to augment network depth while retaining the network’s
input-output relation and how to parallelize networks.
Lemma II.4. Let d1 , d2 , K ∈ N, and Φ ∈ Nd1 ,d2 with L(Φ) < K. Then, there exists a network Ψ ∈ Nd1 ,d2 with
L(Ψ) = K, M(Ψ) ≤ M(Φ) + d2 W(Φ) + 2d2 (K − L(Φ)), W(Ψ) = max{2d2 , W(Φ)}, B(Ψ) = max{1, B(Φ)},
and satisfying Ψ(x) = Φ(x) for all x ∈ Rd1 .
fj (x) := diag Id , Id x, for j ∈ {L(Φ) + 1, . . . , K − 1}, W
Proof. Let W fK (x) := I −Id2 x, and note that
2 2 d2
with
Φ = WL(Φ) ◦ ρ ◦ WL(Φ)−1 ◦ ρ ◦ · · · ◦ ρ ◦ W1 ,
the network
WL(Φ)
fK ◦ ρ ◦ W
Ψ := W fK−1 ◦ ρ ◦ · · · ◦ ρ ◦ W
fL(Φ)+1 ◦ ρ ◦ ◦ ρ ◦ WL(Φ)−1 ◦ ρ ◦ · · · ◦ ρ ◦ W1
−WL(Φ)
satisfies the claimed properties.
For the sake of simplicity of exposition, we state the following two lemmas only for networks of the same depth,
the extension to the general case follows by straightforward application of Lemma II.4. The first of these two lemmas
formalizes the notion of neural network parallelization, concretely of combining neural networks implementing the
functions f and g into a neural network realizing the mapping x 7→ (f (x), g(x)).
6
Lemma II.5. Let n, L ∈ N and, for i ∈ {1, 2, . . . , n}, let di , d0i ∈ N and Φi ∈ Ndi ,d0i with L(Φi ) = L. Then,
Pn Pn
there exists a network Ψ ∈ NPni=1 di ,Pni=1 d0i with L(Ψ) = L, M(Ψ) = i=1 M(Φi ), W(Ψ) = i=1 W(Φi ),
B(Ψ) = maxi B(Φi ), and satisfying
d0i
Pn
Ψ(x) = (Φ1 (x1 ), Φ2 (x2 ), . . . , Φn (xn )) ∈ R i=1 ,
Pn
di
for x = (x1 , x2 , . . . , xn ) ∈ R i=1 with xi ∈ Rdi , i ∈ N.
Φi = WLi ◦ ρ ◦ WL−1
i
◦ ρ ◦ · · · ◦ ρ ◦ W1i ,
Pn
with W`i (x) = Ai` x+bi` . Furthermore, we denote the layer dimensions of Φi by N0i , . . . , NLi and set N` := i=1 N`i ,
for ` ∈ {0, 1, . . . , L}. Next, define, for ` ∈ {1, 2, . . . , L}, the block-diagonal matrices A` := diag(A1` , A2` , . . . , An` ),
the vectors b` = (b1` , b2` , . . . , bn` ), and the affine transformations W` (x) := A` x + b` . The proof is concluded by
noting that
Ψ := WL ◦ ρ ◦ WL−1 ◦ ρ ◦ · · · ◦ ρ ◦ W1
We are now ready to formalize the concept of a linear combination of neural networks.
Lemma II.6. Let n, L, d0 ∈ N and, for i ∈ {1, 2, . . . , n}, let di ∈ N, ai ∈ R, and Φi ∈ Ndi ,d0 with L(Φi ) = L.
Pn Pn
Then, there exists a network Ψ ∈ NPni=1 di ,d0 with L(Ψ) = L, M(Ψ) ≤ i=1 M(Φi ), W(Ψ) ≤ i=1 W(Φi ),
B(Ψ) = maxi {|ai |B(Φi )}, and satisfying
n
X 0
Ψ(x) = ai Φi (xi ) ∈ Rd ,
i=1
Pn
di di
for x = (x1 , x2 , . . . , xn ) ∈ R i=1 with xi ∈ R , i ∈ {1, 2, . . . , n}.
Proof. The proof follows by taking the construction in Lemma II.5, replacing AL by (a1 A1L , a2 A2L , . . . , an AnL ), bL
Pn
by i=1 ai biL , and noting that the resulting network satisfies the claimed properties.
This section constitutes the first part of the paper dealing with the approximation of basic function “templates”
through neural networks. Specifically, we shall develop an algebra of neural network approximation by starting with
the squaring function, building thereon to approximate the multiplication function, proceeding to polynomials and
general smooth functions, and ending with sinusoidal functions.
The basic element of the neural network algebra we develop is based on an approach by Yarotsky [24] and by
Schmidt-Hieber [25], both of whom, in turn, employed the “sawtooth” construction from [26].
7
We start by reviewing the sawtooth construction underlying our program. Consider the hat function g : R → [0, 1],
2x, if 0 ≤ x < 21
g(x) = 2ρ(x) − 4ρ(x − 21 ) + 2ρ(x − 1) = 2(1 − x), if 1 ≤ x ≤ 1 ,
2
0, else
let g0 (x) = x, g1 (x) = g(x), and define the s-th order sawtooth function gs as the s-fold composition of g with
itself, i.e.,
gs := g ◦ g ◦ · · · ◦ g , s ≥ 2. (2)
| {z }
s
Φsg := W2 ◦ ρ ◦ Wg ◦ ρ ◦ · · · ◦ Wg ◦ ρ ◦ W1 = gs (3)
| {z }
s−1
with
2 −4 2 x 0
1
Wg (x) = 2 −4 2 x2 − 1/2 .
2 −4 2 x3 1
The following restatement of [26, Lemma 2.4] summarizes the self-similarity and symmetry properties of gs (x) we
will frequently make use of.
k k+1
Lemma III.1. For s ∈ N, k ∈ {0, 1, . . . , 2s−1 − 1}, it holds that g(2s−1 · −k) is supported in 2s−1 , 2s−1 ,
2s−1
X−1
gs (x) = g(2s−1 x − k), for x ∈ [0, 1],
k=0
and
k k+1 1
gs 2s−1 + x = gs 2s−1 −x , for x ∈ 0, 2s−1 .
We are now ready to proceed with the statement of the basic building block of our neural network algebra,
namely the approximation of the squaring function through deep ReLU networks.
Proposition III.2. There exists a constant C > 0 such that for all ε ∈ (0, 1/2), there is a network Φε ∈ N1,1 with
L(Φε ) ≤ C log(ε−1 ), W(Φε ) = 3, B(Φε ) = 1, Φε (0) = 0, satisfying
8
4 4
16 64
F F − I1
I1 I2 − I1
3 F − I1 3 F − I2
16 64
2 2
16 64
1 1
16 64
0 1 2 3
0 1 2 3 4 5 6 7
0 1 0 1
4 4 4 8 8 8 8 8 8 8
4
256
F − I2
I3 − I2
3 F − I3
256
2
256
1
256
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
Fig. 2: First three steps of approximating F (x) = x − x2 by an equispaced linear interpolation Im at 2m + 1 points.
Proof. The proof builds on two rather elementary observations. The first one concerns the linear interpolation
j
Im : [0, 1] → R, m ∈ N, of the function F (x) := x − x2 at the points 2m , j ∈ {0, 1, . . . , 2m }, and in particular
the self-similarity of the refinement step Im → Im+1 . For every m ∈ N, the residual F − Im is identical on each
interval between two points of interpolation (see Figure 2). Concretely, let fm : [0, 2−m ] → [0, 2−2m−2 ] be defined
as fm (x) = 2−m x − x2 and consider its linear interpolation hm : [0, 2−m ] → [0, 2−2m−2 ] at the midpoint and the
endpoints of the interval [0, 2−m ] given by
2−m−1 x,
x ∈ [0, 2−m−1 ]
hm (x) := .
−2−m−1 x + 2−2m−1 ,
x ∈ [2−m−1 , 2−m ]
Direct calculation shows that
fm+1 (x), x ∈ [0, 2−m−1 ]
fm (x) − hm (x) = .
fm+1 (x − 2−m−1 ),
x ∈ [2−m−1 , 2−m ]
As F = f0 and I1 = h0 this implies that, for all m ∈ N,
j
F (x) − Im (x) = fm (x − 2m ), for x ∈ [ 2jm , j+1 m
2m ], j ∈ {0, 1, . . . , 2 − 1}
Pm−1
and Im = k=0 Hk , where Hk : [0, 1] → R is given by
j
Hk (x) = hk (x − 2k
), for x ∈ [ 2jk , j+1
2k
], j ∈ {0, 1, . . . , 2k − 1}.
9
Thus, we have
sup |x2 − (x − Im (x))| = sup |F (x) − Im (x)| = sup |fm (x)| = 2−2m−2 . (4)
x∈[0,1] x∈[0,1] x∈[0,2−m ]
The second observation we build on is a manifestation of the sawtooth construction described above and leads to
economic realizations of the Hk through k-layer networks with two neurons in each layer; a third neuron is used
to realize the approximation x − Im (x) to x2 . Concretely, let sk (x) := 2−1 ρ(x) − ρ(x − 2−2k−1 ), and note that, for
x ∈ [0, 1], H0 = s0 , we get Hk = sk ◦ Hk−1 . We can thus construct a network realizing x − Im (x), for x ∈ [0, 1],
as follows. Let A1 := (1, 1, 1)T ∈ R3×1 , b1 := (0, −2−1 , 0)T ∈ R3 ,
2−1 −1 0 0
A` := 2−1 −1 0 ∈ R3×3 , b` := −2−2`+1 ∈ R3 , for ` ∈ {2, . . . , m},
−2−1 1 1 0
and Am+1 := (−2−1 , 1, 1) ∈ R1×3 , bm+1 = 0. Setting W` (x) := A` x + b` , ` ∈ {1, 2, . . . , m + 1}, and
Φ̃m := Wm+1 ◦ ρ ◦ Wm ◦ ρ ◦ · · · ◦ ρ ◦ W1 ,
Pm−1
a direct calculation yields Φ̃m (x) = x − k=0 Hk (x), for x ∈ [0, 1]. The proof is completed upon noting that the
networks Φε := Φ̃dlog(ε−1 )/2e satisfy the claimed properties.
The symmetry properties of gs (x) according to Lemma III.1 lead to the interpolation error in the proof of
Proposition III.2 being identical in each interval, with the maximum error taken on at the centers of the respective
intervals. More importantly, however, the approximating neural networks realize linear interpolation at a number
of points that grows exponentially in network depth. This is a manifestation of the fact that the number of linear
regions in the sawtooth construction (3) grows exponentially with depth, which, owing to Lemma XI.1, is optimal.
We emphasize that the theory developed in this paper hinges critically on this optimality property, which, however,
is brittle in the sense that networks with weights obtained through training will, as observed in [27], in general, not
exhibit exponential growth of the number of linear regions with network depth. An interesting approach to neural
network training which manages to partially circumvent this problem was proposed recently in [28]. Understanding
how the number of linear regions grows in general trained networks and quantifying the impact of this—possibly
subexponential—growth behavior on the approximation-theoretic fundamental limits of neural networks constitutes
a major open problem.
We proceed to the construction of networks that approximate the multiplication function over the interval [−D, D].
This will be effected by using the result on the approximation of x2 just established combined with the polarization
identity xy = 14 ((x + y)2 − (x − y)2 ), the fact that ρ(x) + ρ(−x) = |x|, and a scaling argument exploiting that the
ReLU function is positive homogeneous, i.e., ρ(λx) = λρ(x), for all λ ≥ 0, x ∈ R.
10
Proposition III.3. There exists a constant C > 0 such that, for all D ∈ R+ and ε ∈ (0, 1/2), there is a network
ΦD,ε ∈ N2,1 with L(ΦD,ε ) ≤ C(log(dDe) + log(ε−1 )), W(ΦD,ε ) ≤ 5, B(ΦD,ε ) = 1, satisfying ΦD,ε (0, x) =
ΦD,ε (x, 0) = 0, for all x ∈ R, and
Proof. We first note that, w.l.o.g., we can assume D ≥ 1 in the following, as for D < 1, we can simply employ the
network constructed for D = 1 to guarantee the claimed properties. The proof builds on the polarization identity
and essentially constructs two squaring networks according to Proposition III.2 which share the neuron responsible
for summing up the Hk , preceded by a layer mapping (x, y) to (|x + y|/(2D), |x − y|/(2D)) and followed by
layers realizing the multiplication by D2 through weights bounded by 1. Specifically, consider the network Ψ̃m
with associated matrices A` and vectors b` given by
1 1 0 0 0
1 1
−1
−2
1 1 0 0
−1 −1
1
∈ R4×2 , b1 := 0 ∈ R4 , A2 := 1 1 −1 −1 ∈ R5×4 , b2 := 0
A1 :=
2D 1
−1
0 0 1 1 0
−1 1
0 0 1 1 −2−1
2−1 −1 0 0 0 0
−1 −2`+3
−1 0 −2
2 0 0
5×5
A` := −2−1 −1
−1 ∈ , b := , for ` ∈ {3, . . . , m + 1},
1 1 2 R ` 0
0 0 2−1 −1
0 0
0 0 0 2−1 −1 −2−2`+3
and Am+2 := (−2−1 , 1, 1, 2−1 , −1) ∈ R1×5 , bm+2 := 0. A direct calculation yields
m−1
! m−1
!
|x+y| |x+y| |x−y| |x−y|
X X
Ψ̃m (x, y) = 2D − Hk 2D − 2D − Hk 2D
k=0 k=0 (6)
|x+y| |x−y|
= Φ̃m 2D − Φ̃m 2D ,
with Hk and Φ̃m as defined in the proof of Proposition III.2. With (4) this implies
2 2
xy |x+y| |x−y| |x+y| |x−y|
sup Ψ̃m (x, y) − D = sup Φ̃ − Φ̃ − −
m 2D m
2
2D 2D 2D
(x,y)∈[−D,D]2 (x,y)∈[−D,D]2
(7)
2 −2m−1
≤ 2 sup |Φ̃m (z) − z | ≤ 2 .
z∈[0,1]
Next, let ΨD (x) = D2 x be the scalar multiplication network according to Lemma A.1 and take ΦD,ε := ΨD ◦
Ψ̃m(D,ε) , where m(D, ε) := d2−1 (1 + log(D2 ε−1 ))e. Then, the error estimate (5) follows directly from (7)
and Lemma II.3 establishes the desired bounds on depth, width, and weight magnitude. Finally, ΦD,ε (0, x) =
ΦD,ε (x, 0) = 0, for all x ∈ R, follows directly from (6).
11
Remark III.4. Note that the multiplication network just constructed has weights bounded by 1 irrespectively of the
size D of the domain. This is accomplished by trading network depth for weight magnitude according to Lemma A.1.
We proceed to the approximation of polynomials, effected by networks that realize linear combinations of
monomials, which, in turn, are built by composing multiplication networks. Before presenting the specifics of
this construction, we hasten to add that a similar approach was considered previously in [24] and [25]. While there
are slight differences in formulation, the main distinction between our construction and those in [24] and [25] resides
in their purpose. Specifically, the goal in [24] and [25] is to establish, by way of local Taylor-series approximation,
that d-variate, k-times (weakly) differentiable functions can be approximated in L∞ -norm to within error ε with
networks of connectivity scaling according to ε−d/k log(ε−1 ). Here, on the other hand, we will be interested in
functions that allow approximation with networks of connectivity scaling polylogarithmically in ε−1 (i.e., as a
polynomial in log(ε−1 )). Moreover, for ease of exposition, we will employ finite-width networks. Polylogarithmic
connectivity scaling will turn out to be crucial (see Sections VI-IX) in establishing Kolmogorov-Donoho rate-
distortion optimality of neural networks in the approximation of a variety of prominent function classes. Finally, we
would like to mention related recent work [29], [30], [31] on the approximation of Sobolev-class functions in certain
Sobolev norms enabled by neural network approximations of the multiplication operation and of polynomials.
Proposition III.5. There exists a constant C > 0 such that for all m ∈ N, a = (ai )m
i=0 ∈ R
m+1
, D ∈ R+ , and ε ∈
(0, 1/2), there is a network Φa,D,ε ∈ N1,1 with L(Φa,D,ε ) ≤ Cm(log(ε−1 )+m log(dDe)+log(m)+log(dkak∞ e)),
W(Φa,D,ε ) ≤ 9, B(Φa,D,ε ) ≤ 1, and satisfying
m
X
kΦa,D,ε (x) − ai xi kL∞ ([−D,D]) ≤ ε.
i=0
Proof. As in the proof of Proposition III.3 and for the same reason, it suffices to consider the case D ≥ 1. For
m = 1, we simply have an affine transformation and the statement follows directly from Corollary A.2. The proof
for m ≥ 2 will be effected by realizing the monomials xk , k ≥ 2, through iterative composition of multiplication
networks and combining this with a construction that uses the network realizing xk not only as a building block
Pk
in the network implementing xk+1 but also to approximate the partial sum i=0 ai xi in parallel.
Pk−2
We start by setting Bk = Bk (D, η) := dDek + η s=0 dDes , k ∈ N, η ∈ R+ and take ΦBk ,η to be the
multiplication network from Proposition III.3. Next, we recursively define the functions
with f0,D,η (x) = 1 and f1,D,η (x) = x. For notational simplicity, we use the abbreviation fk = fk,D,η in the
following. First, we verify that the fk,D,η approximate monomials sufficiently well. Specifically, we prove by
induction that
k−2
X
kfk (x) − xk kL∞ ([−D,D]) ≤ η dDes , (8)
s=0
12
for all k ≥ 2. The base case k = 2, i.e.,
follows directly from Proposition III.3 upon noting that D ≤ B1 = dDe (we take the sum in the definition of Bk to
equal zero when the upper limit of summation is negative). We proceed to establish the induction step (k − 1) → k
with the induction assumption given by
k−3
X
k−1
kfk−1 (x) − x kL∞ ([−D,D]) ≤ η dDes .
s=0
As
kfk−1 kL∞ ([−D,D]) ≤ kxk−1 kL∞ ([−D,D]) + kfk−1 (x) − xk−1 kL∞ ([−D,D]) ≤ Bk−1 ,
kfk (x) − xk kL∞ ([−D,D]) ≤ kfk (x) − xfk−1 (x)kL∞ ([−D,D]) + kxfk−1 (x) − xk kL∞ ([−D,D])
≤ kΦBk−1 ,η (x, fk−1 (x)) − xfk−1 (x)kL∞ ([−D,D]) + Dkfk−1 (x) − xk−1 kL∞ ([−D,D])
k−3
X k−2
X
≤ η + dDeη dDes = η dDes ,
s=0 s=0
To see that this is, indeed, the case, consider the following chain of mappings
(I) (II) (III) (IV )
(x, s, y) −−→ (x, s, y, y) −−→ (x, s + ai y, y) −−−→ (x, s + ai y, x, y) −−−→ (x, s + ai y, ΦBi ,η (x, y)).
Observe that the mapping (I) is an affine transformation with coefficients in {0, 1}, which we can simply consider
to be a depth-1 network. The mapping (II) is obtained by using Corollary A.2 in order to implement the affine
transformation (s, y) 7→ s + ai y with weights bounded by 1, followed by application of Lemmas II.4 and II.5 to put
this network in parallel with two networks realizing the identity mapping according to x = ρ(x) − ρ(−x). Mapping
(III) is obtained along the same lines by putting the result of mapping (II) in parallel with another network realizing
the identity mapping. Finally, mapping (IV) is realized by putting the network ΦBi ,η in parallel with two identity
networks. Composing these four networks according to Lemma II.3 yields, for i ∈ {1, . . . , m−1}, a network Ψia,D,η
with the claimed properties. Next, we employ Corollary A.2 to get networks Ψ0a,D,η which implement x 7→ (x, a0 , x)
13
m−2 −1
as well as networks Ψm 2
a,D,η realizing (x, s, y) 7→ s + am y. Let now η = η(a, D, ε) := (kak∞ (m − 1) dDe ) ε
and define
m−1
Φa,D,ε := Ψm 1 0
a,D,η ◦ Ψa,D,η ◦ · · · ◦ Ψa,D,η ◦ Ψa,D,η .
Next, we recall that the Weierstrass approximation theorem states that every continuous function on a closed
interval can be approximated to within arbitrary accuracy by a polynomial.
Theorem III.6 ([32]). Let [a, b] ⊆ R and f ∈ C([a, b]). Then, for every ε > 0, there exists a polynomial π such
that
kf − πkL∞ ([a,b]) ≤ ε.
Proposition III.5 hence allows us to conclude that every continuous function on a closed interval can be approxi-
mated to within arbitrary accuracy by a deep ReLU network of width no more than 9. This amounts to a variant of
the universal approximation theorem [11], [12] for finite-width deep ReLU networks. A quantitative statement in
terms of making the approximating network’s width, depth, and weight bounds explicit can be obtained for (very)
smooth functions by applying Proposition III.5 to Lagrangian interpolation with Chebyshev points.
14
Lemma III.7. Consider the set
n o
S[−1,1] := f ∈ C ∞ ([−1, 1], R) : kf (n) (x)kL∞ ([−1,1]) ≤ n!, for all n ∈ N0 .
There exists a constant C > 0 such that for all f ∈ S[−1,1] and ε ∈ (0, 1/2), there is a network Ψf,ε ∈ N1,1 with
L(Ψf,ε ) ≤ C(log(ε−1 ))2 , W(Ψf,ε ) ≤ 9, B(Ψf,ε ) ≤ 1, and satisfying
Proof. A fundamental result on Lagrangian interpolation with Chebyshev points (see e.g. [33, Lemma 3]) guarantees,
for all f ∈ S[−1,1] , m ∈ N, the existence of a polynomial Pf,m of degree m such that
1 (m+1) 1
kf − Pf,m kL∞ ([−1,1]) ≤ (m+1)!2m kf kL∞ ([−1,1]) ≤ 2m .
Note that Pf,m can be expressed in the Chebyshev basis (see e.g. [34, Section 3.4.1]) according to Pf,m =
Pm
j=0 cf,m,j Tj (x) with |cf,m,j | ≤ 2 and the Chebyshev polynomials defined through the two-term recursion
Tk (x) = 2xTk−1 (x) − Tk−2 (x), k ≥ 2, with T0 (x) = 1 and T1 (x) = x. We can moreover use this recursion
to conclude that the coefficients of the Tk in the monomial basis are upper-bounded by 3k . Consequently, we can
Pm
express Pf,m according to Pf,m = j=0 af,m,j xj with
Application of Proposition III.5 to Pf,m in the monomial basis, with m = dlog(2/ε)e and approximation error ε/2,
completes the proof upon noting that
An extension of Lemma III.7 to approximation over general intervals is provided in Lemma A.6. While Lemma
III.7 shows that a specific class of C ∞ -functions, namely those whose derivatives are suitably bounded, can be
approximated by neural networks with connectivity growing polylogarithmically in ε−1 , it turns out that this is not
possible for general (Sobolev-class) k-times differentiable functions [24, Thm. 4].
We are now ready to proceed to the approximation of sinusoidal functions. Before stating the corresponding result,
we comment on the basic idea enabling the approximation of oscillatory functions through deep neural networks. In
essence, we exploit the optimality of the sawtooth construction (3) in terms of achieving exponential—in network
depth—growth in the number of linear regions. As indicated in Figure 3, the composition of the cosine function
(realized according to Lemma III.7) with the sawtooth function, combined with the symmetry properties of the
cosine function and the sawtooth function, yields oscillatory behavior that increases exponentially with network
depth.
15
Theorem III.8. There exists a constant C > 0 such that for every a, D ∈ R+ , ε ∈ (0, 1/2), there is a network
Ψa,D,ε ∈ N1,1 with L(Ψa,D,ε ) ≤ C((log(ε−1 ))2 + log(daDe)), W(Ψa,D,ε ) ≤ 9, B(Ψa,D,ε ) ≤ 1, and satisfying
Proof. Note that f (x) := (6/π 3 ) cos(πx) is in S[−1,1] . Thus, by Lemma III.7, there exists a constant C > 0 such
that for every ε ∈ (0, 1/2), there is a network Φε ∈ N1,1 with L(Φε ) ≤ C(log(ε−1 ))2 , W(Φε ) ≤ 9, B(Φε ) ≤ 1,
and satisfying
6
kΦε − f kL∞ ([−1,1]) ≤ π 3 ε. (9)
We now extend this result to the approximation of x 7→ cos(ax) on the interval [−1, 1] for arbitrary a ∈ R+ .
This will be accomplished by exploiting that x 7→ cos(πx) is 2-periodic and even. Let gs : [0, 1] → [0, 1], s ∈ N,
be the s-th order sawtooth functions as defined in (2) and note that, due to the periodicity and the symmetry of the
cosine function (see Figure 3 for illustration), we have for all s ∈ N0 , x ∈ [−1, 1],
For a > π, we define s = s(a) := dlog(a) − log(π)e and α = α(a) := (π2s )−1 a ∈ (1/2, 1], and note that
In order to realize Φε (gs (α|x|)) as a neural network, we start from the networks Φsg defined in (3) and apply
Proposition A.3 to convert them into networks Ψsg (x) = gs (x), for x ∈ [0, 1], with B(Ψsg ) ≤ 1, L(Ψsg ) = 7(s + 1),
and W(Ψsg ) = 3. Furthermore, let Ψ(x) := αρ(x) − αρ(−x) = α|x| and take Φmult
π 3/6 to be the scalar multiplication
s
network from Lemma A.1. Noting that Ψa,ε := Φmult
π 3/6 ◦Φε ◦Ψg ◦Ψ = Φε (gs (α|x|)) and concluding from Lemma II.3
that L(Ψa,ε ) ≤ C((log(ε−1 ))2 + log(dae)), W(Ψa,ε ) ≤ 9, and B(Ψa,ε ) ≤ 1, together with (10), establishes the
desired result for a > π and for approximation over the interval [−1, 1]. For a ∈ (0, π), we can simply take
3
Ψa,ε := Φmult
π 3/6 ◦ Φε as x 7→ (6/π ) cos(ax) is in S[−1,1] in this case.
Finally, we consider the approximation of x 7→ cos(ax) on intervals [−D, D], for arbitrary D ≥ 1. To this end,
x
we define the networks Ψa,D,ε (x) := ΨaD,ε ( D ) and observe that
16
1 1
0.8 0.8
0.6 0.6
g(g(x))
g(x)
0.4 0.4
0.2 0.2
0 0
1 1
0.5 0.5
cos(2πx)
cos(2πx)
0 0
−0.5 −0.5
−1 −1
1 1
cos(2π4x) = cos(2πg(g(x)))
cos(2π2x) = cos(2πg(x))
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Fig. 3: Approximation of the function cos(2πax) according to Theorem III.8 using “sawtooth” functions gs (x) as
per (2), left a = 2, right a = 4.
17
The result just obtained extends to the approximation of x 7→ sin(ax), formalized next, simply by noting that
sin(x) = cos(x − π/2).
Corollary III.9. There exists a constant C > 0 such that for every a, D ∈ R+ , b ∈ R, ε ∈ (0, 1/2), there is a
network Ψa,b,D,ε ∈ N1,1 with L(Ψa,b,D,ε ) ≤ C((log(ε−1 ))2 +log(daD+|b|e)), W(Ψa,b,D,ε ) ≤ 9, B(Ψa,b,D,ε ) ≤ 1,
and satisfying
kΨa,b,D,ε (x) − cos(ax − b)kL∞ ([−D,D]) ≤ ε.
b
Proof. For given a, D ∈ R+ , b ∈ R, ε ∈ (0, 1/2), consider the network Ψa,b,D,ε (x) := Ψa,D+ |b| ,ε x − a with
a
Ψa,D,ε as defined in the proof of Theorem III.8, and observe that, owing to (11),
sup |Ψa,b,D,ε (x) − cos(ax − b)| ≤ sup |Ψa,D+ |b| ,ε (y) − cos(ay)| ≤ ε.
|b| |b| a
x∈[−D,D] y∈[−(D+ ]
a ),D+ a
Remark III.10. The results in this section all have approximating networks of finite width and depth scaling
polylogarithmically in ε−1 . Owing to
M(Φ) ≤ L(Φ)W(Φ)(W(Φ) + 1)
this implies that the connectivity scales no faster than polylogarithmic in ε−1 . It therefore follows that the approx-
imation error ε decays (at least) exponentially fast in the connectivity or equivalently in the number of parameters
the approximant (i.e., the neural network) employs. We say that the network provides exponential approximation
accuracy.
So far we considered the explicit construction of deep neural networks for the approximation of a wide range of
functions, namely polynomials, smooth functions, and sinusoidal functions, in all cases with exponential accuracy,
i.e., with an approximation error that decays exponentially in network connectivity. We now proceed to lay the
foundation for the development of a framework that allows us to characterize the fundamental limits of deep neural
network approximation of entire function classes. But first, we provide a review of relevant literature.
The best-known results on approximation by neural networks are the universal approximation theorems of Hornik
[12] and Cybenko [11], stating that continuous functions on bounded domains can be approximated arbitrarily well
by a single-hidden-layer (L = 2 in our terminology) neural network with sigmoidal activation function. The
literature on approximation-theoretic properties of networks with a single hidden layer continuing this line of work
is abundant. Without any claim to completeness, we mention work on approximation error bounds in terms of the
number of neurons for functions with Fourier transforms of bounded first moments [35], [36], the nonexistence of
18
localized approximations [37], a fundamental lower bound on approximation rates [38], [39], and the approximation
of smooth or analytic functions [40], [41].
Approximation-theoretic results for networks with multiple hidden layers were obtained in [42], [43] for general
functions, in [44] for continuous functions, and for functions together with their derivatives in [45]. In [37] it was
shown that for certain approximation tasks deep networks can perform fundamentally better than single-hidden-
layer networks. We also highlight two recent papers, which investigate the benefit—from an approximation-theoretic
perspective—of multiple hidden layers. Specifically, in [46] it was shown that there exists a function which, although
expressible through a small three-layer network, can only be represented through a very large two-layer network;
here size is measured in terms of the total number of neurons in the network.
In the setting of deep convolutional neural networks first results of a nature similar to those in [46] were reported
in [47]. Linking the expressivity properties of neural networks to tensor decompositions, [48], [49] established the
existence of functions that can be realized by relatively small deep convolutional networks but require exponentially
larger shallow convolutional networks.
We conclude by mentioning recent results bearing witness to the approximation power of deep ReLU networks in
the context of PDEs. Specifically, it was shown in [29] that deep ReLU networks can approximate very effectively
certain solution families of parametric PDEs depending on a large (possibly infinite) number of parameters. The
series of papers [50], [51], [52], [53] constructs and analyzes a deep-learning-based numerical solver for Black-
Scholes PDEs.
For survey articles on approximation-theoretic aspects of neural networks, we refer the interested reader to [54]
and [55] as well as the very recent [56]. Most closely related to the framework we develop here is the paper by
Shaham, Cloninger, and Coifman [57], which shows that for functions that are sparse in specific wavelet frames,
the best M -weight approximation rate (see Definition VI.1 below) of three-layer neural networks is at least as large
as the best M -term approximation rate in piecewise linear wavelet frames.
We begin the development of our framework with a review of a widely used theoretical foundation for deterministic
lossy data compression [58], [59]. Our presentation essentially follows [14], [60].
Let d ∈ N, Ω ⊆ Rd , and consider a set of functions C ⊆ L2 (Ω), which we will frequently refer to as function
class. Then, for each ` ∈ N, we denote by
E` := E : C → {0, 1}`
19
be the set of binary decoders of length `. An encoder-decoder pair (E, D) ∈ E` × D` is said to achieve uniform
error ε over the function class C, if
Note that here we quantified the approximation error in L2 (Ω)-norm, whereas in the previous section we used
the L∞ (Ω)-norm. While results in terms of L∞ (Ω)-norm are stronger, we shall employ the L2 (Ω)-norm in order
to parallel the Kolmogorov-Donoho framework for nonlinear approximation through dictionaries [14], [15]. We
furthermore note that for sets Ω of finite Lebesgue measure |Ω|, the two norms are related through kf kL2 (Ω) ≤
|Ω|1/2 kf kL∞ (Ω) . Finally, whenever we talk about compactness and related topological notions, we shall always
mean w.r.t. the topology induced by the L2 (Ω)-norm.
A quantity of central interest is the minimal length ` ∈ N for which there exists an encoder-decoder pair
(E, D) ∈ E` × D` that achieves uniform error ε over the function class C, along with its asymptotic behavior as
made precise in the following definition.
Definition IV.1. Let d ∈ N, Ω ⊆ Rd , and let C ⊆ L2 (Ω) be compact. Then, for ε > 0, the minimax code length
L(ε, C) is
( )
` `
L(ε, C) := min ` ∈ N : ∃(E, D) ∈ E × D : sup kD(E(f )) − f kL2 (Ω) ≤ ε . (12)
f ∈C
The optimal exponent γ ∗ (C) determines the minimum growth rate of L(ε, C) as the error ε tends to zero and
can hence be seen as quantifying the “description complexity” of the function class C. Larger γ ∗ (C) results in
smaller growth rate and hence smaller memory requirements for storing functions f ∈ C such that reconstruction
with uniformly bounded error is possible.
Remark IV.2. The optimal exponent γ ∗ (C) can equivalently be thought of as quantifying the asymptotic behavior
of the minimal achievable error for the function class C with a given code length. Specifically, we have
where
The quantity γ ∗ (C) is closely related to the concept of Kolmogorov-Tikhomirov epsilon entropy a.k.a. metric
entropy [61]. We next make this connection explicit.
20
B. Metric entropy
Most of the discussion in this subsection, which is almost exclusively of review nature, follows very closely [62,
Chapter 5]. Consider the metric space (X , ρ) with X a nonempty set and ρ : X × X → R a distance function.
A natural measure for the size of a compact subset C of X is given by the number of balls of a fixed radius ε
required to cover C, a quantity known as the covering number (for covering radius ε).
Definition IV.3. [62] Let (X , ρ) be a metric space. An ε-covering of a compact set C ⊆ X with respect to the
metric ρ is a set {x1 , . . . , xN } ⊆ C such that for each x ∈ C, there exists an i ∈ {1, . . . , N } so that ρ(x, xi ) ≤ ε.
The ε-covering number N (ε; C, ρ) is the cardinality of the smallest ε-covering.
where B(xi , ε) is a ball—in the metric ρ—of radius ε centered at xi . The covering number is nonincreasing in
ε, i.e., N (ε; C, ρ) ≥ N (ε0 ; C, ρ), for all ε ≤ ε0 . When the set C is not finite, the covering number goes to infinity
as ε goes to zero. We shall be interested in the corresponding rate of growth, more specifically in the quantity
log N (ε; C, ρ) known as the metric entropy of C with respect to ρ. Recall that log is to the base 2, hence the unit
of metric entropy is “bits”. The operational significance of metric entropy follows from the question: What is the
minimum number of bits needed to represent any element x ∈ C with error—quantified in terms of the distance
measure ρ—of at most ε? By what was just developed, the answer to this question is dlog N (ε; C, ρ)e. Specifically,
for a given x ∈ X , the corresponding encoder E(x) simply identifies the closest ball center xi and encodes the
index i using dlog N (ε; C, ρ)e bits. The corresponding decoder D delivers the ball center xi , which guarantees that
the resulting error satisfies kD(E(x)) − xk ≤ ε.
We proceed with a simple example ([62, Example 5.2]) computing an upper bound on the metric entropy of the
interval C = [−1, 1] in R with respect to the metric ρ(x, x0 ) = |x − x0 |. To this end, we divide C into intervals of
length 2ε by setting xi = −1 + 2(i − 1)ε, for i ∈ [1, L], where L = b 1ε c + 1. This guarantees that, for every point
x ∈ [−1, 1], there is an i ∈ [1, L] such that |x − xi | ≤ ε, which, in turn, establishes
j1k 1
N (ε; C, ρ) ≤ +1≤ +1
ε ε
and hence yields an upper bound on metric entropy according to2
1
log N (ε; C, ρ) ≤ log + 1 log(ε−1 ), as ε → 0. (14)
ε
2 The notation f (ε) g(ε), as ε → 0, means that there are constants c, C, ε0 > 0 such that cf (ε) ≤ g(ε) ≤ Cf (ε), for all ε ≤ ε0 . For
ease of exposition, we shall usually omit the qualifier ε → 0.
21
This result can be generalized to the d-dimensional unit cube to yield log(N (ε; C, ρ)) ≤ d log(1/ε+1) d log(ε−1 ).
In order to show that the upper bound (14) correctly reflects metric entropy scaling for C = [−1, 1] with respect to
ρ(x, x0 ) = |x − x0 |, we would need a lower bound on N (ε; C, ρ) that exhibits the same scaling (in ε) behavior. A
systematic approach to establishing lower bounds on metric entropy is through the concept of packing, which will
be introduced next.
We start with the definition of the packing number of a compact set C in a metric space (X , ρ).
Definition IV.4. [62, Definition 5.4] Let (X , ρ) be a metric space. An ε-packing of a compact set C ⊆ X with
respect to the metric ρ is a set {x1 , . . . , xN } ⊆ C such that ρ(xi , xj ) > ε, for all distinct i, j. The ε-packing number
M (ε; X , ρ) is the cardinality of the largest ε-packing.
An ε-packing is a collection of nonintersecting balls of radius ε/2 and centered at elements in X . Although
different, the covering number and the packing number provide essentially the same measure of size of a set as
formalized next.
Lemma IV.5. [62, Lemma 5.5] Let (X , ρ) be a metric space and C a compact set in X . For all ε > 0, the packing
and the covering number are related according to
Proof. [62], [63] First, choose a minimal ε-covering and a maximal 2ε-packing of C. Since no two centers of
the 2ε-packing can lie in the same ball of the ε-covering, it follows that M (2ε; C, ρ) ≤ N (ε; C, ρ). To establish
N (ε; C, ρ) ≤ M (ε; C, ρ), we note that, given a maximal packing M (ε; C, ρ), for any x ∈ C, we have the center of
at least one of the balls in the packing within distance less than ε. If this were not the case, we could add another
ball to the packing thereby violating its maximality. This maximal packing hence also provides an ε-covering and
since N (ε; C, ρ) is a minimal covering, we must have N (ε; C, ρ) ≤ M (ε; C, ρ).
We now return to the example in which we computed an upper bound on the metric entropy of C = [−1, 1] with
respect to ρ(x, x0 ) = |x − x0 | and show how Lemma IV.5 can be employed to establish the scaling behavior of
metric entropy. To this end, we simply note that the points xi = −1 + 2(i − 1)ε, i ∈ [1, L], are separated according
to |xi − xj | = 2ε > ε, for all i 6= j, which implies that M (ε; C, | · |) ≥ L = b1/εc + 1 ≥ 1ε . Combining this with
the upper bound (14) and Lemma IV.5, we obtain log N (ε; C, | · |) log(ε−1 ). Likewise, it can be established that
log N (ε; C, k · k) d log(ε−1 ) for the d-dimensional unit cube. This illustrates how an explicit construction of a
packing set can be used to determine the scaling behavior of metric entropy.
We next formalize the notion that metric entropy is determined by the volume of the corresponding covering
balls. Specifically, the following result establishes a relationship between a certain volume ratio and metric entropy.
22
Lemma IV.6. [62, Lemma 5.7] Consider a pair of norms k · k and k · k0 on Rd , and let B and B 0 be their
corresponding unit balls, i.e., B = {x ∈ Rd |kxk ≤ 1} and B 0 = {x ∈ Rd |kxk0 ≤ 1}. Then, the ε-covering number
of B in the k · k0 -norm satisfies
d
1 vol(B) 0 vol( 2ε B + B 0 )
≤ N (ε; B, k · k ) ≤ . (15)
ε vol(B 0 ) vol(B 0 )
which implies vol(B) ≤ N (ε; B, k · k0 ) εd vol(B 0 ), thus establishing the lower bound in (15). The upper bound is
obtained by starting with a maximal ε-packing {x1 , . . . , xM (ε;B,k·k0 ) } of B in the k·k0 -norm. The balls {xj + 2ε B 0 , j =
1, . . . , M (ε; B, k · k0 )} are all disjoint and contained within B + 2ε B 0 . We can therefore conclude that
M (ε;B,k·k0 )
X ε ε
vol xj + B 0 ≤ vol B + B 0 ,
j=1
2 2
and hence
ε ε
M (ε; B, k · k0 ) vol B 0 ≤ vol B + B 0 .
2 2
Finally, we have vol( 2ε B 0 ) = ( 2ε )d vol(B 0 ) and vol(B+ 2ε B 0 ) = ( 2ε )d vol( 2ε B+B 0 ), which, together with M (ε; B, k · k0 )
≥ N (ε; B, k · k0 ) due to Lemma IV.5, yields the upper bound in (15).
This result now allows us to establish the scaling of the metric entropy of unit balls in terms of their own norm,
thus yielding a measure of the massiveness of unit balls in d-dimensional spaces. Specifically, we set B 0 = B in
Lemma IV.6 and get d
2 2 2
vol B + B0 = vol +1 B = + 1 vol(B),
ε ε ε
which when used in (15) yields N (ε; B, k · k) ε−d and hence results in metric entropy scaling according to
log(N (ε; B, k · k)) d log(ε−1 ). Particularizing this result to the unit ball B∞
d
= [−1, 1]d and the metric k · k∞ ,
we recover the result of our direct analysis in the example above.
So far we have been concerned with the metric entropy of subsets of Rd . We now proceed to analyzing the
metric entropy of function classes, which will eventually allow us to establish the desired connection between the
optimal exponent γ ∗ (C) and metric entropy. We begin with the simple one-parameter function class considered
in [62, Example 5.9] and follow closely the exposition in [62]. For a fixed θ, define the real-valued function
fθ (x) = 1 − e−θx , and consider the class
23
The set P constitutes a metric space under the sup-norm given by kf − gkL∞ ([0,1]) = supx∈[0,1] |f (x) − g(x)|. We
show that the covering number of P satisfies
1 − 1/e 1
1+ ≤ N (ε; P, k · kL∞ ([0,1]) ) ≤ + 2,
2ε 2ε
which leads to the scaling behavior N (ε; P, k · kL∞ ([0,1]) ) ε−1 and hence to metric entropy scaling according
to log(N (ε; P, k · kL∞ ([0,1]) )) log(ε−1 ). We start by establishing the upper bound. For given ε ∈ [0, 1], set
1
T = b 2ε c, and define the points θi = 2εi, for i = 0, 1, . . . , T . By also adding the point θT +1 = 1, we obtain a
collection of T + 2 points {θ0 , θ1 , . . . , θT +1 } in [0, 1]. We show that the associated functions {fθ0 , fθ1 , . . . , fθT +1 }
form an ε-covering for P. Indeed, for any fθ ∈ P, we can find some θi in the covering such that |θ − θi | ≤ ε. We
then have
kfθ − fθi kL∞ ([0,1]) = max |e−θx − e−θi x | ≤ |θ − θi |,
x∈[0,1]
max |e−θx − e−θi x | = max (e−θx − e−θi x ) = max e−θx (1 − e−(θi −θ)x ) ≤ max (1 − e−(θi −θ)x )
x∈[0,1] x∈[0,1] x∈[0,1] x∈[0,1]
as a consequence of 1−e−x ≤ x, for x ∈ [0, 1], which is easily verified by noting that the function g(x) = 1−e−x −x
satisfies g(0) = 0 and g 0 (x) ≤ 0, for x ∈ [0, 1]. The case θ > θi follows similarly. In summary, we have shown
1
that N (ε; P, k · kL∞ ([0,1]) ) ≤ T + 2 ≤ 2ε + 2.
In order to derive the lower bound, we first bound the packing number from below and then use Lemma IV.5.
We start by constructing an explicit packing as follows. Set θ0 = 0 and define θi = − log(1 − εi), for all i such
that θi ≤ 1. The largest index T such that this holds is given by T = b 1−1/e
ε c. Moreover, note that for all i, j
with i 6= j, we have kfθi − fθj kL∞ ([0,1]) ≥ |fθi (1) − fθj (1)| = |ε(i − j)| ≥ ε. We can therefore conclude that
M (ε; P, k · kL∞ ([0,1]) ) ≥ b 1−1/e
ε c + 1, and hence, due to the lower bound in Lemma IV.5,
1 − 1/e
N (ε; P, k · kL∞ ([0,1]) ) ≥ M (2ε; P, k · kL∞ ([0,1]) ) ≥ + 1,
2ε
as claimed. We have thus established that the function class P has metric entropy scaling according to
This class, denoted as FL ([0, 1]d ), has metric entropy scaling [64], [62]
24
Contrasting the exponential dependence of metric entropy in (16) on the ambient dimension d to the linear
dependence we identified earlier for simpler sets such as unit balls in Rd , where we had
We now show how Kolmogorov-Donoho rate-distortion theory can be put to work in the context of optimal
approximation with dictionaries. Again, this subsection is of review nature. We start with a brief discussion of
basics on optimal approximation in Hilbert spaces. Specifically, we shall consider two types of approximation,
namely linear and nonlinear.
Let H be a Hilbert space equipped with inner product h·, ·i and induced norm k · kH and let ek , k = 1, 2, . . . ,
be an orthonormal basis for H. For linear approximation, we use the linear space HM := span{ek : 1 ≤ k ≤ M }
to approximate a given element f ∈ H. We measure the approximation error by
EM (f ) := inf kf − gkH .
g∈HM
In nonlinear approximation, we consider best M -term approximation, which replaces HM by the set ΣM consisting
of all elements g ∈ H that can be expressed as
X
g= ck ek ,
k∈Λ
where Λ ⊆ N is a set of indices with |Λ| ≤ M . Note that, in contrast to HM , the set ΣM is not a linear space
as a linear combination of two elements in ΣM will, in general, need 2M terms in its representation by the ek .
Analogous to EM , we define the error of best M -term approximation
ΓM (f ) := inf kf − gkH .
g∈ΣM
25
The key difference between linear and nonlinear approximation resides in the fact that in nonlinear approximation,
we can choose the M elements ek participating in the approximation of f freely from the entire orthonormal
basis whereas in linear approximation we are constrained to the first M elements. A classical example for linear
approximation is the approximation of periodic functions by the Fourier series elements corresponding to the M
lowest frequencies (assuming natural ordering of the dictionary). This approach clearly leads to poor approximation
if the function under consideration consists of high-frequency components. In contrast, in nonlinear approximation
we would seek the M frequencies that yield the smallest approximation error. In summary, it is clear that (nonlinear)
best M -term approximation can achieve smaller approximation error than linear M -term approximation.
We shall consider nonlinear approximation in arbitrary, possibly redundant, dictionaries, i.e., in frames [68], and
will exclusively be interested in the case H = L2 (Ω), in particular the approximation error will be measured in
terms of L2 (Ω)-norm. Specifically, let C be a set of functions in L2 (Ω) and consider a countable family of functions
D := (ϕi )i∈N ⊆ L2 (Ω), termed dictionary.
We consider the best M -term approximation error of f ∈ C in D defined as follows.
Definition V.1. [58] Given d ∈ N, Ω ⊆ Rd , a function class C ⊆ L2 (Ω), and a dictionary D = (ϕi )i∈N ⊆ L2 (Ω),
we define, for f ∈ C and M ∈ N,
X
ΓD
M (f ) := inf
f − ci ϕi
. (17)
If,M ⊆ N,
|If,M |=M,(ci )i∈If,M
i∈If,M
L2 (Ω)
We call ΓD
P
M (f ) the best M -term approximation error of f in D. Every fM = i∈If,M ci ϕi attaining the infimum
in (17) is referred to as a best M -term approximation of f in D. The supremal γ > 0 such that
sup ΓD
M (f ) ∈ O(M
−γ
), M → ∞,
f ∈C
will be denoted by γ (C, D). We say that the best M -term approximation rate of C in the dictionary D is γ ∗ (C, D).
∗
Function classes C widely studied in the approximation theory literature include unit balls in Lebesgue, Sobolev,
or Besov spaces [59], as well as α-cartoon-like functions [69]. A wealth of structured dictionaries D is provided
by the area of applied harmonic analysis, starting with wavelets [70], followed by ridgelets [39], curvelets [71],
shearlets [72], parabolic molecules [73], and most generally α-molecules [69], which include all previously named
dictionaries as special cases. Further examples are Gabor frames [17], Wilson bases [74], and wave atoms [18].
The best M -term approximation rate γ ∗ (C, D) according to Definition V.1 quantifies how difficult it is to
approximate a given function class C in a fixed dictionary D. It is sensible to ask whether for given C, there
is a fundamental limit on γ ∗ (C, D) when one is allowed to vary over D. To answer this question, we first note that
for every dense (and countable) D, for any given f ∈ C, by density of D, there exists a single dictionary element
that approximates f to within arbitrary accuracy thereby effectively realizing a 1-term approximation for arbitrary
approximation error ε. Formally, this can be expressed through γ ∗ (C, D) = ∞. Identifying this single dictionary
26
element or, more generally, the M elements participating in the best M -term approximation is in general, however,
practically infeasible as it entails searching through the infinite set D and requires an infinite number of bits to
describe the indices of the participating elements. This insight leads to the concept of “best M -term approximation
subject to polynomial-depth search” as introduced by Donoho in [15]. Here, the basic idea is to restrict the search
for the elements in D participating in the best M -term approximation to the first π(M ) elements of D, with π a
polynomial. We formalize this under the name of effective best M -term approximation as follows.
Definition V.2. Let d ∈ N, Ω ⊆ Rd , C ⊆ L2 (Ω) be compact, and D = (ϕi )i∈N ⊆ L2 (Ω). We define for M ∈ N
and π a polynomial
X
επC,D (M ) := sup
inf
f − c ϕ
i i
(18)
f ∈C If,M ⊆{1,2,...,π(M )},
|If,M |=M, |ci |≤π(M ) i∈If,M
L2 (Ω)
and
We refer to γ ∗,eff (C, D) as the effective best M -term approximation rate of C in the dictionary D.
Note that we required the coefficients ci in the approximant in Definition V.2 to be polynomially bounded in
M . This condition, not present in [14], [60] and easily met for generic C and D, is imposed for technical reasons
underlying the transference results in Section VII. Strictly speaking—relative to [14], [60]—we hence get a subtly
different notion of approximation rate. Exploring the implications of this difference is certainly worthwhile, but
deemed beyond the scope of this paper.
We next present a central result in best M -term approximation theory stating that for compact C ⊆ L2 (Ω), the
effective best M -term approximation rate in any dictionary D is upper-bounded by γ ∗ (C) and hence limited by the
“description complexity” of C. This endows γ ∗ (C) with operational meaning.
Theorem V.3. [14], [60] Let d ∈ N, Ω ⊆ Rd , and let C ⊆ L2 (Ω) be compact. The effective best M -term
approximation rate of the function class C ⊆ L2 (Ω) in the dictionary D = (ϕi )i∈N ⊆ L2 (Ω) satisfies
In light of this result the following definition is natural (see also [60]).
Definition V.4. (Kolmogorov-Donoho optimality) Let d ∈ N, Ω ⊆ Rd , and let C ⊆ L2 (Ω) be compact. If the
effective best M -term approximation rate of the function class C ⊆ L2 (Ω) in the dictionary D = (ϕi )i∈N ⊆ L2 (Ω)
satisfies
γ ∗,eff (C, D) = γ ∗ (C),
27
As the ideas underlying the proof of Theorem V.3 are essential ingredients in the development of a kindred
theory of best M -weight approximation rates for neural networks, we present a detailed proof, which is similar to
that in [60]. We perform, however, some minor technical modifications with an eye towards rendering the proof a
suitable genesis for the new theory of best M -weight approximation with neural networks, developed in the next
section. The spirit of the proof is to construct, for every given M ∈ N an encoder that, for each f ∈ C, maps
the indices of the dictionary elements participating in the effective best M -term approximation3 of f , along with
the corresponding coefficients ci , to a bitstring. This bitstring needs to be of sufficient length for the decoder to
be able to reconstruct an approximation to f with an error which is of the same order as that of the best M -term
approximation we started from. As elucidated in the proof, this can be accomplished while ensuring that the length
of the bitstring is proportional to M log(M ), which upon noting that ε = M −γ implies M = ε−1/γ , establishes
optimality.
Proof of Theorem V.3. The proof will be based on showing that for every γ ∈ R+ the following Implication (I)
holds: Assume that there exist a constant C > 0 and a polynomial π such that for every M ∈ N, the following
holds: For every f ∈ C, there are an index set If,M ⊆ {1, 2, . . . , π(M )} and coefficients (ci )i∈If,M ⊆ R with
|ci | ≤ π(M ) so that
X
ci ϕi
L2 (Ω) ≤ CM −γ .
f − (20)
i∈If,M
This implies the existence of a constant C 0 > 0 such that for every M ∈ N, there is an encoder-decoder pair
(EM , DM ) ∈ E`(M ) × D`(M ) with `(M ) ≤ C 0 M log(M ) and
The implication will be proven by explicit construction. For a given f ∈ C, we pick an M -term approximation
according to (20) and encode the associated index set If,M and weights ci as follows. First, note that owing
to |If,M | ≤ π(M ), each index in If,M can be represented by at most Cπ log(M ) bits; this results in a total of
Cπ M log(M ) bits needed to encode the indices of all dictionary elements participating in the M -term approximation.
The encoder and the decoder are assumed to know Cπ , which allows stacking of the binary representations of the
indices such that the decoder can read them off uniquely from the sequence of their binary representations.
We proceed to the encoding of the coefficients ci . First, note that even though the ci are bounded (namely,
polynomially in M ) by assumption, we did not impose bounds on the norms of the dictionary elements {ϕi }i∈If,M
participating in the M -term approximation under consideration. Hence, we can not, in general, expect to be able to
control the approximation error incurred by reconstructing f from quantized ci . We can get around this by performing
a Gram-Schmidt orthogonalization on the dictionary elements {ϕi }i∈If,M and, as will be seen later, using the fact
3 Note that as we have an infimum in (18) an effective best M -term approximation need not exist, but we can pick an M -term approximation
that yields an error arbitrarily close to the infimum.
28
that the function class C was assumed to be compact. Specifically, this Gram-Schmidt orthogonalization yields a
set of functions {ϕ̃i }i∈I˜ f ≤ M , that has the same span as {ϕi }i∈I . Next, we define (implicitly) the
, with M f,M
f,M
f
As C is compact by assumption, we have supf ∈C kf k2L2 (Ω) < ∞, which establishes that the coefficients c̃i are
uniformly bounded. This, in turn, allows us to quantize them, specifically, we shall round the c̃i to integer multiples
of M −(γ+1/2) , and denote the resulting rounded coefficients by ĉi . As the c̃i are uniformly bounded, this results
in a number of quantization levels that is proportional to M (γ+1/2) . The number of bits needed to store the binary
representations of the quantized coefficients is therefore proportional to M log(M ). Again, the proportionality
constant is assumed known to encoder and decoder, which allows us to stack the binary representations of the
quantized coefficients in a uniquely decodable manner. The resulting bitstring is then appended to the bitstring
encoding the indices of the participating dictionary elements. We finally note that the specific choice of the exponent
γ + 1/2 is informed by the upper bound on the reconstruction error we are allowed, this will be made explicit
below in the description of the decoder.
In summary, we have mapped the function f to a bitstring of length O(M log(M )). The decoder is presented
with this bitstring and reconstructs an approximation to f as follows. It first reads out the indices of the set If,M
and the quantized coefficients ĉi . Recall that this is uniquely possible. Next, the decoder performs a Gram-Schmidt
orthonormalization on the set of dictionary elements indexed by If,M . The error resulting from reconstructing the
function f from the quantized coefficients ĉi rather than the exact coefficients c̃i can be bounded according to
X
X X X
f − ĉi ϕ̃i
=
f − c̃i ϕ̃i + c̃i ϕ̃i − ĉi ϕ̃i
i∈I˜f,M
2
i∈I˜f,M i∈I˜f,M i∈I˜f,M
2
L (Ω) L (Ω)
f f f f
X
X
≤
f − c̃i ϕ̃i
+
(c̃i − ĉi )ϕ̃i
(23)
i∈I˜f,M
2
i∈I˜f,M
2
L (Ω) L (Ω)
f f
1/2
X
X
=
f − c̃i ϕ̃i
+ |c̃i − ĉi |2 ,
i∈I˜f,M
i∈I˜f,M
L2 (Ω)
f f
29
where in the last step we again exploited the orthonormality of the ϕ̃i . Next, note that due to the choice of the
quantizer resolution, we have |c̃i − ĉi |2 ≤ C 00 M −2γ−1 for some constant C 00 . With M
f ≤ M this yields
X
|c̃i − ĉi |2 ≤ C 00 M −2γ .
i∈I˜f,M
f
for some constant C 0 . As the length of the bitstring used in this construction is proportional to M log(M ), the
claim (21) is established.
Now, we note that the antecedent of Implication (I) holds for all γ < γ ∗,eff (C, D). Assume next, towards a
contradiction, that the antecedent holds for a γ > γ ∗ (C). This would imply that for any γ 0 < γ,
0
inf sup kD(E(f )) − f kL2 (Ω) ∈ O L−γ , L → ∞. (24)
(E,D)∈EL ×DL f ∈C
In particular, (24) would hold for some γ 0 > γ ∗ (C) which, owing to (13) stands in contradiction to the definition
of γ ∗ (C). This completes the proof.
Table 1: Optimal exponents and corresponding optimal dictionaries. U(X) = {f ∈ X : kf kX ≤ 1} denotes the
unit ball in the space X and Ω ⊆ Rd is a Lipschitz domain. Recall that compactness of these unit balls is w.r.t.
L2 -norm.
30
The optimal exponent γ ∗ (C) is known for various function classes such as unit balls in Besov spaces Bp,q
m
(Rd )
with p, q ∈ (0, ∞] and m > d(1/p − 1/2)+ , where γ ∗ (C) = m/d (see [76]), and unit balls in (polynomially)
s
weighted modulation spaces Mp,p (Rd ) with p ∈ (1, 2) and s ∈ R+ , where γ ∗ (C) = ( p1 − 1
2 + 2s −1
d ) (see [77]). A
further example is the set of β-cartoon-like functions, which are β-smooth on some bounded d-dimensional domain
with sufficiently smooth boundary and zero otherwise. Here, we have γ ∗ (C) = β(d − 1)/2 (see [79], [78], [23]).
These examples along with additional ones are summarized in Table 1. For an extensive summary of metric entropy
results and techniques for their derivation, we also refer to [64].
We conclude this section with general remarks on certain formal aspects of the Kolmogorov-Donoho rate-distortion
framework. First, we note that for the set C ⊆ L2 (Ω) to have a well-defined optimal exponent it must be relatively
compact9 . This follows from the fact that the set over which the minimum in the definition (12) of L(ε, C) is taken
must be nonempty for every ε ∈ (0, ∞). To see this, note that every length-L(ε, C) encoder-decoder pair induces an
ε-covering of C with at most 2L(ε,C) balls (and ball centers {D(E(f ))}f ∈C ). It hence follows that C must be totally
bounded and thus relatively compact as a consequence of L2 (Ω) being a complete metric space [80, Thm. 45.1].
As shown in the proof of Theorem V.3, effective best M -term approximations construct encoder-decoder pairs
and thereby induce ε-coverings. By the arguments just made, this implies that also γ ∗,eff (C, D) is well-defined only
for compact function classes C.
A consequence of the compactness requirement on C is that the spaces in Table 1 either consist of functions on
bounded domains or, in the case of modulation spaces, are equipped with a weighted norm. In order to provide
intuition on why this must be so, let us consider a function space (X, k·kX ) with X ⊆ L2 (Rd ) and k·kX translation
4
invariant. Take ε > 0 and f ∈ X with kf kX = 1 and choose C > 0 such that kf kL2 ([−C,C]d ) > 5 kf kL2 (Rd ) .
Now, consider the family of translates of f given by fi (x) := f (x − 2Ci), i ∈ Zd , and note that kfi kX = 1 for all
i ∈ Zd by translation invariance of k · kX . Furthermore, we have
21 12
kfi kL2 ([−C,C]d ) = kfi k2L2 (Rd ) − kfi k2L2 (Rd \[−C,C]d ) ≤ kf k2L2 (Rd ) − kf k2L2 ([−C,C]d ) < 35 kf kL2 (Rd )
kfi − fj kL2 (Rd ) = kfi−j − f kL2 (Rd ) ≥ kfi−j − f kL2 ([−C,C]d ) > 51 kf kL2 (Rd ) (25)
for all i, j ∈ Zd , with i 6= j, by the reverse triangle inequality. As such no ε-ball (w.r.t. L2 (Rd )-norm) with
1
ε≤ 10 kf kL2 (Rd ) can contain more than one of the infinitely many (fi )i∈Zd which are, however, all contained in
the unit ball U(X) of the space (X, k · kX ). This implies that U(X) cannot be totally bounded and thereby not
relatively compact (w.r.t. L2 (Rd )-norm). Somewhat nonchalantly speaking, for spaces equipped with translation-
invariant norms this issue can be avoided by considering functions that live on a bounded domain, which ensures that
9 For the sake of simplicity, we assume, however, compactness throughout even though relative compactness (i.e. having a compact closure)
would be sufficient.
31
(25) pertains only to a finite number of translates. Alternatively, for spaces of functions living on unbounded domains
once can consider weighted norms that are not translation invariant. Here, the weighting effectively constrains the
functions to a bounded domain.
The less restrictive concept of best M -term approximation rate γ ∗ (C, D) (see Definition V.1) is, in apparent
contrast, often studied for noncompact function classes C.
In [75, Sec. 15.2] a condition for γ ∗,eff (C, D) and γ ∗ (C, D) to coincide is presented. Specifically, this condition,
referred to as tail compactness, is expressed as follows. Let C ⊆ L2 (Ω) be bounded and let D = {ϕi }i∈N be an
ordered orthonormal basis for C. We say that tail compactness holds if there exist C, β > 0 such that for all N ∈ N,
X N
sup
f − hf, ϕi iϕi
≤ CN −β . (26)
f ∈C
i=1
2
L (Ω)
∗,eff ∗
In order to see that (26) implies γ (C, D) = γ (C, D), we consider, for fixed f ∈ C, the (unconstrained) best
P
M -term approximation fM = i∈I hf, ϕi iϕi with I ⊆ N, |I| = M . We now modify this M -term approximation by
letting α := dγ ∗ (C, D)/βe ∈ N and removing, in the expansion fM = i∈I hf, ϕi iϕi , all terms corresponding to
P
indices that are larger than M α . Recalling that in Definition V.2 the same polynomial π bounds the search depth and
the size of the coefficients, it follows that the modified approximation we just constructed obeys a polynomial depth
search constraint with constraining polynomial πα (x) = xα + S, where S := supf ∈C kf kL2 (Ω) . Here, owing to
orthonormality of D, S accounts for the size of the expansion coefficients hf, ϕi i. In order to complete the argument,
P
we need to show that the additional approximation error incurred by removing terms in fM = i∈I hf, ϕi iϕi is in
∗
O(M −γ (C,D)
), i.e., it is of the same order as the error corresponding to the original (unconstrained) best M -term
P
approximation. Due to orthonormality of D this additional error is given by the norm of i∈I,i>πα (M ) hf, ϕi iϕi
and can, by virtue of (26), be bounded as
∞
πα (M )
X
X
X
hf, ϕi iϕ i
≤
hf, ϕi iϕ i
=
f − hf, ϕi iϕi
i∈I,i>πα (M )
2
i=πα (M )+1
i=1
L (Ω) L2 (Ω) L2 (Ω)
∗
≤ C(πα (M ))−β ∈ O(M −γ (C,D)
),
which establishes the claim. We have hence shown that under tail compactness of arbitrary rate β > 0, γ ∗ (C, D) =
γ ∗,eff (C, D), and hence there is no cost incurred by imposing a polynomial depth search constraint combined with
a polynomial bound on the size of the expansion coefficients. We hasten to add that the assumptions stated at the
beginning of this paragraph together with what was just established imply that γ ∗,eff (C, D) is, indeed, well-defined.
For the more general case of D a frame, we refer to [60, Sec. 5.4.3] for analogous arguments. Finally, we remark
that the tail compactness inequality (26) can be interpreted as quantifying the rate of linear approximation for C
in D. Two examples of pairs (C, D) satisfying tail compactness, namely Besov spaces with wavelet bases and
modulation spaces with Wilson bases, are provided in Appendices B and C, respectively.
32
As already mentioned, a larger optimal exponent γ ∗ (C) leads to faster error decay (specifically according to
∗
L−γ (C)
) and hence corresponds to a function class of smaller complexity. As such, techniques for deriving lower
bounds on the optimal exponent are often based on variations of the approach employed in the proof of Theorem V.3,
namely on the explicit construction of encoder-decoder pairs (in the case of the proof of Theorem V.3 by encoding
the dictionary elements participating in the M -term approximation). A powerful method for deriving upper bounds
on the optimal exponent is the hypercube embedding approach proposed by Donoho in [79]; the basic idea here is
to show that the function class C under consideration contains a sufficiently complex embedded set of orthogonal
hypercubes and to then find the exponent corresponding to this set. An interesting alternative technique for deriving
optimal exponents was proposed in the context of modulation spaces in [77]. The essence of this approach is to
exploit the isomorphism between weighted modulation spaces and weighted mixed-norm sequence spaces [17] and
to then utilize results about entropy numbers of operators between sequence spaces.
Inspired by the theory of best M -term approximation with dictionaries, we now develop the new concept of best
M -weight approximation through neural networks. At the heart of this theory lies the interpretation of the network
weights as the counterpart of the coefficients ci in best M -term approximation. In other words, parsimony in terms
of the number of participating elements in a dictionary is replaced by parsimony in terms of network connectivity.
Our development will parallel that for best M -term approximation in the previous section.
Before proceeding to the specifics, we would like to issue a general remark. While the neural network ap-
proximation results in Section III were formulated in terms of L∞ -norm, we shall be concerned with L2 -norm
approximation here, on the one hand paralleling the use of L2 -norm in the context of best M -term approximation,
and on the other hand allowing for the approximation of discontinuous functions by ReLU neural networks, which,
owing to the continuity of the ReLU nonlinearity, necessarily realize continuous functions.
We start by introducing the concept of best M -weight approximation rate.
Definition VI.1. Given d ∈ N, Ω ⊆ Rd , and a function class C ⊆ L2 (Ω), we define, for f ∈ C and M ∈ N,
ΓN
M (f ) := inf kf − ΦkL2 (Ω) . (27)
Φ∈Nd,1
M(Φ)≤M
We call ΓN
M (f ) the best M -weight approximation error of f . The supremal γ > 0 such that
sup ΓN
M (f ) ∈ O(M
−γ
), M → ∞,
f ∈C
∗ ∗
will be denoted by γN (C). We say that the best M -weight approximation rate of C by neural networks is γN (C).
We emphasize that the infimum in (27) is taken over all networks with fixed input dimension d, no more than M
nonzero (edge and node) weights, and arbitrary depth L. In particular, this means that the infimum is with respect
33
to all possible network topologies and weight choices. The best M -weight approximation rate is fundamental as it
benchmarks all algorithms that map a function f and an ε > 0 to a neural network approximating f with error no
more than ε.
The two restrictions underlying the concept of effective best M -term approximation through dictionaries, namely
polynomial depth search and polynomially bounded coefficients, are next addressed in the context of approximation
through deep neural networks. We start by noting that the need for the former is obviated by the tree-like-structure
of neural networks. To see this, first note that W(Φ) ≤ M(Φ) and L(Φ) ≤ M(Φ). As the total number of nonzero
weights in the network can not exceed L(Φ)W(Φ)(W(Φ) + 1), this yields at most O(M(Φ)3 ) possibilities for the
“locations” (in terms of entries in the A` and the b` ) of the M(Φ) nonzero weights. Encoding the locations of the
3
M(Φ) nonzero weights hence requires log( CM(Φ) M(Φ)
) = O(M(Φ) log(M(Φ))) bits. This assumes, however, that
the architecture of the network, i.e., the number of layers L(Φ) and the Nk are known. Proposition VI.7 below
shows that the architecture can, indeed, also be encoded with O(M(Φ) log(M(Φ))) bits. In summary, we can
therefore conclude that the tree-like-structure of neural networks automatically guarantees what we had to enforce
through the polynomial depth search constraint in the case of best M -term approximation.
Inspection of the approximation results in Section III reveals that a sublinear growth restriction on L(Φ) as a
function of M(Φ) is natural. Specifically, the approximation results in Section III all have L(Φ) proportional to a
polynomial in log(ε−1 ). As we are interested in approximation error decay according to M(Φ)−γ , see Definition
VI.1, this suggests to restrict L(Φ) to growth that is polynomial in log(M(Φ)).
The second restriction imposed in the definition of effective best M -term approximation, namely polynomially
bounded coefficients, will be imposed in monomorphic manner on the magnitude of the weights. This growth
condition will turn out natural in the context of the approximation results we are interested in and will, together
with polylogarithmic depth growth, be seen below to allow rate-distortion-optimal quantization of the network
weights. We remark, however, that networks with weights growing polynomially in M(Φ) can be converted into
networks with uniformly bounded weights at the expense of increased—albeit still of polylogarithmic scaling in
M(Φ)—depth (see Proposition A.3). In summary, we will develop the concept of “best M -weight approximation
subject to polylogarithmic depth and polynomial weight growth”.
We start by introducing the following notation for neural networks with depth and weight magnitude bounded
polylogarithmically respectively polynomially w.r.t. their connectivity.
π
NM,d,d0 := {Φ ∈ Nd,d0 : M(Φ) ≤ M, L(Φ) ≤ π(log(M )), B(Φ) ≤ π(M )} .
Next, we formalize the notion of effective best M -weight approximation rate subject to polylogarithmic depth
and polynomial weight growth.
34
Definition VI.3. Let d ∈ N, Ω ⊆ Rd , and let C ⊆ L2 (Ω) be compact. We define for M ∈ N and π a polynomial
and
∗,eff
γN (C) := sup{γ ≥ 0 : ∃ polynomial π s.t. επN (M ) ∈ O(M −γ ), M → ∞}.
∗,eff
We refer to γN (C) as the effective best M -weight approximation rate of C.
We now state the equivalent of Theorem V.3 for approximation by deep neural networks. Specifically, we establish
that the optimal exponent γ ∗ (C) constitutes a fundamental bound on the effective best M -weight approximation
rate of C as well.
The key ingredients of the proof of Theorem VI.4 are developed throughout this section and the formal proof
appears at the end of the section. Before getting started, we note that, in analogy to Definition V.4, what we just
found suggests the following.
Definition VI.5. Let d ∈ N, Ω ⊆ Rd , and let C ⊆ L2 (Ω) be compact. We say that the function class C ⊆ L2 (Ω)
is optimally representable by neural networks if
∗,eff
γN (C) = γ ∗ (C).
It is interesting to observe that the fundamental limits of effective best M -term approximation (through dictionar-
ies) and effective best M -weight approximation in neural networks are determined by the same quantity, although
the approximants in the two cases are vastly different. We have linear combinations of elements of a dictionary
under polynomial weight growth of the coefficients and with the participating functions identified subject to a
polynomial-depth search constraint in the former, and concatenations of affine functions followed by nonlinearities
under polynomial growth constraints on the coefficients of the affine functions and with a polylogarithmic growth
constraint on the number of concatenations in the latter case.
We now commence the program developing the proof of Theorem VI.4. As in the arguments in the proof sketch
of Theorem V.3, the main idea is to compare the length of the bitstring needed to encode the approximating network
to the minimax code length of the function class C to be approximated. To this end, we will need to represent the
approximating network’s nonzero weights, its architecture, i.e., L and the Nk , and the nonzero weights’ locations
as a bitstring. As the weights are real numbers and hence require, in principle, an infinite number of bits for their
binary representations, we will have to suitably quantize them. In particular, the resolution of the corresponding
35
quantizer will have to increase appropriately with decreasing ε. To formalize this idea, we start by defining the
quantization employed.
Definition VI.6. Let m ∈ N and ε ∈ (0, 1/2). The network Φ is said to have (m, ε)-quantized weights if all its
−1
weights are elements of 2−mdlog(ε )e
Z ∩ [−ε−m , ε−m ].
A key ingredient of the proof of Theorem VI.4 is the following result, which establishes a fundamental lower
bound on the connectivity of networks with quantized weights achieving uniform error ε over a given function class
C.
Ψ : 0, 12 × C → Nd,d0
be a map such that for every ε ∈ (0, 1/2), f ∈ C, the network Ψ(ε, f ) has (dπ(log(ε−1 ))e, ε)-quantized weights
and satisfies
sup kf − Ψ(ε, f )kL2 (Ω) ≤ ε.
f ∈C
Then,
/ O ε−1/γ , ε → 0,
sup M(Ψ(ε, f )) ∈ for all γ > γ ∗ (C).
f ∈C
Proof. The proof is by contradiction. Let γ > γ ∗ (C) and assume that supf ∈C M(Ψ(ε, f )) ∈ O(ε−1/γ ), ε → 0. The
contradiction will be effected by constructing encoder-decoder pairs (Eε , Dε ) ∈ E`(ε) × D`(ε) achieving uniform
error ε over C with
where C0 , C1 , q > 0 are constants not depending on f, ε and γ > ν > γ ∗ (C). The specific form of the upper bound
(28) will become apparent in the construction of the bitstring representing Ψ detailed below.
We proceed to the construction of the encoder-decoder pairs (Eε , Dε ) ∈ E`(ε) ×D`(ε) , which will be accomplished
by encoding the network architecture, its topology, and the quantized weights in bitstrings of length `(ε) satisfying
(28) while guaranteeing unique reconstruction (of the network). For the sake of notational simplicity, we fix ε ∈
(0, 1/2) and f ∈ C and set Ψ := Ψ(ε, f ), M := M(Ψ), and L := L(Ψ). Recall that the number of nodes in layers
0, . . . , L is denoted by N0 , . . . , NL and that N0 = d, NL = d0 (see Definition II.1). Moreover, note that due to our
PL
nondegeneracy assumption (see Remark II.2) we have `=0 N` ≤ 2M and L ≤ M . The bitstring representing Ψ
is constructed according to the following steps.
36
Step 1: If M = 0, we encode the network by a single 0. Using the convention 0 log(0) = 0, we then note that
(28) holds trivially and we terminate the encoding procedure. Else, we encode the network connectivity, M , by
starting the overall bitstring with M 1’s followed by a single 0. The length of this bitstring is therefore given by
M + 1.
Step 2: We continue by encoding the number of layers which, due to L ≤ M , requires no more than dlog(M )e
bits. We thus reserve the next dlog(M )e bits for the binary representation of L.
Step 3: Next, we store the layer dimensions N0 , . . . , NL . As L ≤ M and N` ≤ M , for all ` ∈ {0, . . . , L}, owing
to nondegeneracy, we can encode the layer dimensions using (M + 1)dlog(M )e bits. In combination with Steps 1
and 2 this yields an overall bitstring of length at most
Step 4: We encode the topology of the graph associated with the network Ψ. To this end, we enumerate all nodes
by assigning a unique index i to each one of them, starting from the 0-th layer and increasing from left to right
PL
within a given layer. The indices range from 1 to N := `=0 N` ≤ 2M . Each of these indices can be encoded by
a bitstring of length dlog(N )e. We denote the bitstring corresponding to index i by b(i) ∈ {0, 1}dlog(N )e and let
for all nodes, except for those in the last layer, n(i) be the number of children of the node with index i, i.e., the
number of nodes in the next layer connected to the node with index i via an edge. For each of these nodes i, we
form a bitstring of length n(i)dlog(N )e by concatenating the bitstrings indexing its children. We follow this string
with an all-zeros bitstring of length dlog(N )e to signal that all children of the current node have been encoded.
Overall, this yields a bitstring of length
−d0
NX
(n(i) + 1)dlog(N )e ≤ 3M dlog(2M )e, (30)
i=1
PN −d0
where we used i=1 n(i) ≤ M .
Step 5: We encode the weights of Ψ. By assumption, Ψ has (dπ(log(ε−1 ))e, ε)-quantized weights, which means
that each weight of Ψ can be represented by no more than Bε := 2(dπ(log(ε−1 ))edlog(ε−1 )e + 1) bits. For each
node i = 1, . . . , N , we reserve the first Bε bits to encode its associated node weight and, for each of its children
a bitstring of length Bε to encode the weight corresponding to the edge between the current node and that child.
Concatenating the results in ascending order of child node indices, we get a bitstring of length (n(i) + 1)Bε for
node i, and an overall bitstring of length
−d0
NX
(n(i) + 1)Bε + d0 Bε ≤ 3M Bε
i=1
representing the weights. Combining this with (29) and (30), we find that the overall number of bits needed to
encode the network architecture, topology, and weights is no more than
37
The network can be recovered by sequentially reading out M, L, the N` , the topology, and the quantized weights
from the overall bitstring. It is not difficult to verify that the individual steps in the encoding procedure were crafted
such that this yields unique recovery. As (31) can be upper-bounded by
for constants C0 , q > 0 depending on π only, we have constructed an encoder-decoder pair (Eε , Dε ) ∈ E`(ε) ×D`(ε)
with `(ε) satisfying (28). This concludes the proof.
Proposition VI.7 states that the connectivity growth rate of networks with quantized weights achieving uniform
∗
approximation error ε over a function class C must exceed O ε−1/γ (C) , ε → 0. As Proposition VI.7 applies to
networks that have each weight represented by a finite number of bits scaling polynomially in log(ε−1 ), while
guaranteeing that the underlying encoder-decoder pair achieves uniform error ε over C, it remains to establish that
such a compatibility is, indeed, possible. Specifically, this requires a careful interplay between the network’s depth
and connectivity scaling, and its weight growth, all as a function of ε. Establishing that this delicate balancing is
implied by our technical assumptions is the subject of the remainder of this section. We start with a perturbation
result quantifying how the error induced by weight quantization in the network translates to the output function
realized by the network.
Lemma VI.8. Let d, d0 , k ∈ N, D ∈ R+ , Ω ⊆ [−D, D]d , ε ∈ (0, 1/2), let Φ ∈ Nd,d0 with M(Φ) ≤ ε−k ,
B(Φ) ≤ ε−k , and let m ∈ N satisfy
Then, there exists a network Φ̃ ∈ Nd,d0 with (m, ε)-quantized weights satisfying
More specifically, the network Φ̃ can be obtained simply by replacing every weight in Φ by a closest element in
−mdlog(ε−1 )e
2 Z ∩ [−ε−m , ε−m ].
Proof of Theorem VI.8. We first consider the case L(Φ) = 1. Here, it follows from Definition II.1 that the network
simply realizes an affine transformation and hence
−1
sup kΦ(x) − Φ̃(x)k∞ ≤ M(Φ)dDe2−mdlog(ε )e−1
≤ ε.
x∈Ω
In the remainder of the proof, we can therefore assume that L(Φ) ≥ 2. For simplicity of notation, we set L :=
L(Φ), M := M(Φ), and, as usual, write
Φ = WL ◦ ρ ◦ WL−1 ◦ ρ ◦ · · · ◦ ρ ◦ W1
38
with W` (x) = A` x + b` , A` ∈ RN` ×N`−1 , and b` ∈ RN` . We now consider the partial networks Φ` : Ω → RN` ,
` ∈ {1, 2, . . . , L − 1}, given by
ρ ◦ W1 , `=1
Φ` := ρ ◦ W2 ◦ ρ ◦ W1 , `=2
ρ◦W ◦ρ◦W
` `−1 ◦ · · · ◦ ρ ◦ W1 , ` = 3, . . . , L − 1,
and set ΦL := Φ. We hasten to add that we decided—for ease of exposition—to deviate from the convention used in
Definition II.1 and to have the partial networks include the application of ρ at the end. Now, for ` ∈ {1, 2, . . . , L},
let Φ̃` be the (partial) network obtained by replacing all the entries of the A` and b` by a closest element in
−1
2−mdlog(ε )e
Z ∩ [−ε−m , ε−m ]. We denote these replacements by Ã` and b̃` , respectively, and note that
−1
max |A`,i,j − Ã`,i,j | ≤ 1
2 2−mdlog(ε )e
≤ 1
2 εm ,
i,j
(33)
−1
max |b`,i,j − b̃`,i,j | ≤ 1
2 2−mdlog(ε )e
≤ 1
2 εm .
i,j
The proof will be effected by upper-bounding the error building up across layers as a result of this quantization.
To this end, we define, for ` ∈ {1, 2, . . . , L}, the error in the `-th layer as
We further set C0 := dDe and C` := max{1, supx∈Ω kΦ` (x)k∞ }. As each entry of the vector Φ` (x) ∈ RN` is
obtained by applying10 the 1-Lipschitz function ρ to the sum of a weighted sum of at most N`−1 components
of the vector Φ`−1 (x) ∈ RN`−1 and a bias component b`,i , and B(Φ) ≤ ε−k by assumption, we have for all
` ∈ {1, 2, . . . , L},
Next, note that the components (Φ̃1 (x))i , i ∈ {1, 2, . . . , N1 }, of the vector Φ̃1 (x) ∈ RN1 can be written as
XN0
(Φ̃1 (x))i = ρ Ã1,i,j xj + b̃1,i ,
j=1
which, combined with (33) and the fact that ρ is 1-Lipschitz implies
m
εm m
e1 ≤ C0 N0 ε2 + 2 ≤ C0 (N0 + 1) ε2 . (35)
10 Note that going from ΦL−1 to ΦL the activation function is not applied anymore, which nevertheless leads to the same estimate as the
identity mapping is 1-Lipschitz.
39
Due to ρ and the identity mapping being 1-Lipschitz, we have, for ` = 1, . . . , L,
e` = sup kΦ` (x) − Φ̃` (x)k∞ = sup |(Φ` (x))i − (Φ̃` (x))i |
x∈Ω x∈Ω,i∈{1,...,N` }
N`−1 N`−1
X X
`−1 `−1
≤ sup A`,i,j (Φ (x))j + b`,i −
Ã`,i,j (Φ̃ (x))j + b̃`,i
x∈Ω,i∈{1,...,N` } (36)
j=1 j=1
N`−1
X
≤ sup A`,i,j (Φ`−1 (x))j − Ã`,i,j (Φ̃`−1 (x))j + b`,i − b̃`,i .
x∈Ω,i∈{1,...,N` } j=1
As |(Φ`−1 (x))j − (Φ̃`−1 (x))j | ≤ e`−1 and |(Φ`−1 (x))j | ≤ C`−1 for all x ∈ Ω, j ∈ {1, . . . , N`−1 } by definition,
and |A`,i,j | ≤ ε−k by assumption, upon invoking (33), we get
m m
|A`,i,j (Φ`−1 (x))j − Ã`,i,j (Φ̃`−1 (x))j | ≤ e`−1 ε−k + C`−1 ε2 + e`−1 ε2 .
Since ε ∈ (0, 1/2), it therefore follows from (36), that for all ` ∈ {2, . . . , L},
m m
εm m
e` ≤ N`−1 (e`−1 ε−k + C`−1 ε2 + e`−1 ε2 ) + 2 ≤ (N`−1 + 1)(2e`−1 ε−k + C`−1 ε2 ). (37)
which we prove by induction. The base case ` = 1 was already established in (35). For the induction step we
assume that (38) holds for a given ` which, in combination with (34) and (37), implies
m
e`+1 ≤ N` + 1)(2e` ε−k + C` ε2
`−1 `−1
!
m
Y Y
m−(`−1)k −k
`
≤ (N` + 1) (2 − 1)C0 ε ε (Ni + 1) + C0 ε−k` ε2 (Ni + 1)
i=0 i=0
`
1 `+1 Y
= (2 − 1)C0 εm−`k (Ni + 1).
2 i=0
QL−1
This completes the induction argument and establishes (38). Using 2L−1 ≤ ε−(L−1) , i=0 (Ni +1) ≤ M L ≤ ε−kL ,
and m ≥ 3kL + log(dDe) by assumption, we get
L−1
Y
sup kΦ(x) − Φ̃(x)k∞ = eL ≤ 12 (2L − 1)C0 εm−(L−1)k (Ni + 1)
x∈Ω i=0
≤ εm−(L−1+kL−k+log(dDe)+kL)
≤ εm−(3kL+log(dDe)−1) ≤ ε.
40
∗,eff ∗,eff
(C) > γ ∗ (C) and let γ ∈ γ ∗ (C), γN
Proof of Theorem VI.4. Suppose towards a contradiction that γN (C) .
Then, by Definition VI.3, there exist a polynomial π and a constant C > 0 such that
sup inf
π
kf − ΦkL2 (Ω) ≤ CM −γ , for all M ∈ N.
f ∈C Φ∈NM,d,1
Setting Mε := (ε/(4C))−1/γ , it follows that, for every f ∈ C and every ε ∈ (0, 1/2), there exists a neural
π
network Φε,f ∈ NM ε ,d,1
such that
ε
kf − Φε,f kL2 (Ω) ≤ 2 sup inf kf − ΦkL2 (Ω) ≤ 2CMε−γ ≤ . (39)
π
f ∈C Φ∈NMε ,d,1 2
By Lemma VI.8 there exists a polynomial π ∗ such that for every f ∈ C, ε ∈ (0, 1/2), there is a network Φ
e ε,f with
The conditions of Lemma VI.8 are satisfied as Mε can be upper-bounded by ε−k with a suitably chosen k, the
π
weights in Φε,f are polynomially bounded in Mε , and (32) follows from the depth of networks in Φ ∈ NM ε ,d,1
Ψ : 0, 21 × C → Nd,1 ,
(ε, f ) 7→ Φ
e ε,f ,
it follows from (39) and (40), by application of the triangle inequality, that
We conclude this section with a discussion of the conceptual implications of the results established above.
Proposition VI.7 combined with Lemma VI.8 establishes that neural networks achieving uniform approximation
error ε while having weights that are polynomially bounded in ε−1 and depth growing polylogarithmically in ε−1
∗
cannot exhibit connectivity growth rate smaller than O(ε−1/γ (C)
), ε → 0; in other words, a decay of the uniform
∗
approximation error, as a function of M , faster than O(M −γ (C)
), M → ∞, is not possible.
We have seen that a wide array of function classes can be approximated in Kolmogorov-Donoho optimal fashion
through dictionaries, provided that the dictionary D is chosen to consort with the function class C according to
γ ∗,eff (C, D) = γ ∗ (C). Examples of such pairs are unit balls in Besov spaces with wavelet bases and unit balls in
weighted modulation spaces with Wilson bases. A more extensive list of optimal pairs is provided in Table 1. On
the other hand, as shown in [14], Fourier bases are strictly suboptimal—in terms of approximation rate—for balls
C of finite radius in the spaces BV (R) and Wpm (R).
41
In light of what was just said, it is hence natural to let neural networks play the role of the dictionary D and to ask
which function classes C are approximated in Kolmogorov-Donoho-optimal fashion by neural networks. Towards
answering this question, we next develop a general framework for transferring results on function approximation
through dictionaries to results on approximation by neural networks. This will eventually lead us to a characterization
of function classes C that are optimally representable by neural networks in the sense of Definition VI.5.
We start by introducing the notion of effective representability of dictionaries through neural networks.
Definition VII.1. Let d ∈ N, Ω ⊆ Rd , and D = (ϕi )i∈N ⊆ L2 (Ω) be a dictionary. We call D effectively
representable by neural networks, if there exists a bivariate polynomial π such that for all i ∈ N, ε ∈ (0, 1/2),
there is a neural network Φi,ε ∈ Nd,1 satisfying M(Φi,ε ) ≤ π(log(ε−1 ), log(i)), B(Φi,ε ) ≤ π(ε−1 , i), and
The next result will allow us to conclude that optimality—in the sense of Definition V.4—of a dictionary D for a
function class C combined with effective representability of D by neural networks implies optimal representability of
C by neural networks. The proof is, in essence, effected by noting that every element of the effectively representable
D participating in a best M -term-rate achieving approximation fM of f ∈ C can itself be approximated by
neural networks well enough for an overall network to approximate fM with connectivity M π(log(M )). As
this connectivity is only polylogarithmically larger than the number of terms M participating in the best M -
term approximation fM , we will be able to conclude that the optimal approximation rate, indeed, transfers from
approximation in D to approximation in neural networks. The conditions on M(Φi,ε ) and B(Φi,ε ) in Definition
VII.1 guarantee precisely that the connectivity increase is at most by a polylogarithmic factor. To see this, we first
recall that effective best M -term approximation has a polynomial depth search constraint, which implies that the
indices i under consideration are upper-bounded by a polynomial in M . In addition, the approximation error behavior
we are interested in is ε = M −γ . Combining these two insights, it follows that M(Φi,ε ) ≤ π(log(ε−1 ), log(i))
implies polylogarithmic (in M ) connectivity for each network Φi,ε and hence connectivity M π(log(M )) for the
overall network realizing fM , as desired. By the same token, B(Φi,ε ) ≤ π(ε−1 , i) guarantees that the weights of
Φi,ε are polynomial in M .
There is another aspect to effective representability by neural networks that we would like to illustrate by way
of example, namely that of ordering the dictionary elements. Specifically, we consider, for d = 1 and Ω = [−π, π),
the class C of real-valued even functions in C = L2 (Ω), and take the dictionary as D = {cos(ix), i ∈ N0 }. As the
index i enumerating the dictionary elements corresponds to frequencies, the basis functions in D are hence ordered
according to increasing frequencies. Next, note that the parameter a in Theorem III.8 corresponds to the frequency
index i in our example. As the network Ψa,D,ε in Theorem III.8 is of finite width, it hence follows, upon replacing
a in the expression for L(Ψa,D,ε ) by i, that M(Ψi,D,ε ) ≤ π(log(ε−1 ), log(i)). The condition on the weights for
effective representability is satisfied trivially, simply as B(Ψi,D,ε ) ≤ 1 ≤ π(ε−1 , i).
42
We are now ready to state the rate optimality transfer result.
Theorem VII.2. Let d ∈ N, Ω ⊆ Rd be bounded, and consider the compact function class C ⊆ L2 (Ω). Suppose
that the dictionary D = (ϕi )i∈N ⊆ L2 (Ω) is effectively representable by neural networks. Then, for every γ ∈
(0, γ ∗,eff (C, D)), there exist a polynomial π and a map
Ψ : 0, 12 × C → Nd,1 ,
such that for all f ∈ C, ε ∈ (0, 1/2), the network Ψ(ε, f ) has (dπ(log(ε−1 ))e, ε)-quantized weights while satisfying
kf − Ψ(ε, f )kL2 (Ω) ≤ ε, L(Ψ(ε, f )) ≤ π(log(ε−1 )), B(Ψ(ε, f )) ≤ π(ε−1 ), and we have
with the implicit constant in (41) being independent of f . In particular, it holds that
∗,eff
γN (C) ≥ γ ∗,eff (C, D).
Remark VII.3. Theorem VII.2 allows us to draw the following conclusion. If D optimally represents the function
class C in the sense of Definition V.4, i.e., γ ∗,eff (C, D) = γ ∗ (C), and if it is, in addition, effectively representable by
∗,eff
neural networks in the sense of Definition VII.1, then, due to Theorem VI.4, which states that γN (C) ≤ γ ∗ (C), we
∗,eff
have γN (C) = γ ∗ (C) and hence C is optimally representable by neural networks in the sense of Definition VI.5.
Proof of Theorem VII.2. Let γ 0 ∈ (γ, γ ∗,eff (C, D)). According to Definition V.2, there exist a constant C ≥ 1 and
a polynomial π1 , such that for every f ∈ C, M ∈ N, there is an index set If,M ⊆ {1, . . . , π1 (M )} of cardinality
M and coefficients (ci )i∈If,M with |ci | ≤ π1 (M ), such that
0
CM −γ
X
f − ci ϕi
≤ . (42)
2
i∈If,M
2
L (Ω)
1/2
Let A := max{1, |Ω| }. Effective representability of D according to Definition VII.1 ensures the existence of a
bivariate polynomial π2 such that for all M ∈ N, i ∈ If,M , there is a neural network Φi,M ∈ Nd,1 satisfying
C −(γ 0 +1)
kϕi − Φi,M kL2 (Ω) ≤ 4Aπ1 (M ) M (43)
with
−1
C −(γ 0 +1)
M(Φi,M ) ≤ π2 log 4Aπ1 (M ) M , log(i)
= π2 (γ 0 + 1) log(M ) + log 4AπC1 (M ) , log(i) , (44)
−1
C −(γ 0 +1) 4Aπ1 (M ) γ 0 +1
B(Φi,M ) ≤ π2 4Aπ1 (M ) M , i = π 2 C M , i .
43
Due to max(If,M ) ≤ π1 (M ), (44) and Lemma A.8 imply the existence of a polynomial π3 such that L(Ψf,M ) ≤
π3 (log(M )), M(Ψf,M ) ≤ M π3 (log(M )), and B(Ψf,M ) ≤ π3 (M ), for all f ∈ C, M ∈ N, and, owing to (43), we
get
|If,M |
0 0 maxi∈If,M |ci | 0
CM −γ CM −γ
X X X
|ci | 4AπC1 (M ) M −(γ +1)
Ψf,M − ci ϕi
≤ ≤ ≤ . (45)
4A M π1 (M ) 4A
i∈If,M
i∈If,M i=1
L2 (Ω)
Lemma VI.8 therefore ensures the existence of a polynomial π4 such that for all f ∈ C, M ∈ N, there is a
0
e f,M ∈ Nd,1 with (dπ4 (log( 4A M γ 0 ))e, CM −γ )-quantized weights satisfying L(Ψ
network Ψ e f,M ) = L(Ψf,M ),
C 4A
−γ 0
M(Ψ e f,M ) ≤ B(Ψf,M ) + CM , and
e f,M ) = M(Ψf,M ), B(Ψ
4A
0
CM −γ
Ψf,M − Ψ ≤ . (46)
e f,M
∞ 4A
L (Ω)
for all f ∈ C, M ∈ N. Combining (47) with (42) and (45), we get, for all f ∈ C, M ∈ N,
X
X
f − Ψf,M
2 ≤
f − ci ϕi
+
ci ϕi − Ψf,M
+
Ψf,M − Ψ
e
e f,M
2
L (Ω) L (Ω)
i∈If,M
2
i∈If,M
L (Ω)
2
L (Ω)
(48)
0
≤ CM −γ .
l 0
m
For ε ∈ (0, 1/2) and f ∈ C, we now set Mε := (C/ε)1/γ and
Ψ(ε, f ) := Ψ
e f,M .
ε
Since Mε and π3 are independent of f , the implicit constant in (49) does not depend on f .
Next, note that, in general, an (n, η)-quantized network is also (m, δ)-quantized for n ≥ m and η ≤ δ, simply
as
−1 −1
2−mdlog(δ )e
Z ∩ [−δ −m , δ −m ] ⊆ 2−ndlog(η )e
Z ∩ [−η −n , η −n ].
44
0
CMε−γ
Since 4A ≤ ε this ensures the existence of a polynomial π such that, for every f ∈ C, ε ∈ (0, 1/2), the network
Ψ(ε, f ) is (dπ(log(ε−1 ))e, ε)-quantized, L(Ψ(ε, f )) ≤ π(log(ε−1 )), and B(Ψ(ε, f )) ≤ π(ε−1 ). With (49) this
π
establishes the first claim of the theorem. In order to verify the second claim, note that Ψ(ε, f ) ∈ NM(Ψ(ε,f )),d,1 ,
sup inf
π
kf − ΦkL2 (Ω) ∈ O(M −γ ), M → ∞.
f ∈C Φ∈NM,d,1
Remark VII.4. We note that Theorem VII.2 continues to hold for Ω = Rn if the elements of D = (ϕi )i∈N are
compactly supported with the size of their support sets growing no more than polynomially in i. The technical
elements required to show this can be found in the context of the approximation of Gabor dictionaries in the proof
of Theorem IX.3, but are omitted here for ease of exposition.
The last piece needed to complete our program is to establish that the conditions in Definition VII.1 guaranteeing
effective representability in neural networks are, indeed, satisfied by a wide variety of dictionaries.
Inspecting Table 1, we can see that all example function classes provided therein are optimally represented
either by affine dictionaries, i.e., wavelets, the Haar basis, and curvelets or Weyl-Heisenberg dictionaries, namely
Fourier bases and Wilson bases. The next two sections will be devoted to proving effective representability of
affine dictionaries and Weyl-Heisenberg dictionaries by neural networks, thus allowing us to draw the conclusion
that neural networks are universally Kolmogorov-Donoho optimal approximators for all function classes listed in
Table 1.
The purpose of this section is to establish that affine dictionaries, including wavelets [70], ridgelets [39], curvelets
[71], shearlets [72], α-shearlets and more generally α-molecules [69], which contain all aforementioned dictionaries
as special cases, are effectively representable by neural networks. Due to Theorem VII.2 and Theorem VI.4, this
will then allow us to conclude that any function class that is optimally representable—in the sense of Definition
V.4—by an affine dictionary with a suitable generator function is optimally representable by neural networks in the
sense of Definition VI.5. By “suitable” we mean that the generator function can be approximated well by ReLU
networks in a sense to be made precise below.
45
In order to elucidate the main ideas underlying the general definition of affine dictionaries that are effectively
representable by neural networks, we start with a basic example, namely the Haar wavelet dictionary on the unit
interval, i.e., the set of functions
n
ψn,k : [0, 1] 7→ R, x 7→ 2 2 ψ(2n x − k), n ∈ N0 , k = 0, . . . , 2n − 1,
with
1, x ∈ [0, 1/2)
ψ : R → R, x 7→ −1, x ∈ [1/2, 1)
0, else.
We approximate the piecewise constant mother wavelet ψ through a continuous piecewise linear function realized
by a neural network as follows
1 1
Ψδ (x) := 2δ ρ(x + δ) − 2δ ρ(x − δ) − 1δ ρ(x − ( 21 − δ)) + 1δ ρ(x − ( 21 + δ)) + 1
2δ ρ(x − (1 − δ)) − 1
2δ ρ(x − (1 + δ))
The basic idea in the approximation of ψ through Ψδ is to let the transition regions around 0, 1/2, and 1 shrink,
as a function of ε, sufficiently fast for the construction to realize an approximation error of no more than ε. Now,
a direct calculation yields that, indeed, for ε ∈ (0, 1/2),
46
A. Affine Dictionaries with Canonical Ordering
Definition VIII.1. Let d, S ∈ N, δ > 0, Ω ⊆ Rd be bounded, and let gs ∈ L∞ (Rd ), s ∈ {1, . . . , S}, be compactly
supported. Furthermore, for s ∈ {1, . . . , S}, let Js ⊆ N and As,j ∈ Rd×d , j ∈ Js , be full-rank and with eigenvalues
bounded below by 1 in absolute value. We define the affine dictionary D ⊆ L2 (Ω) with generator functions (gs )Ss=1
as
n 1
o
D := gsj,e := |det(As,j )| 2 gs (As,j · − δe) Ω : s ∈ {1, . . . , S}, e ∈ Zd , j ∈ Js , and gsj,e 6= 0 .
where the elements within each Dj may be ordered arbitrarily, and there exist constants a, c > 0 such that
j−1
X
| det(As,k )| ≥ ckAs,j ka∞ , for all j ∈ Js \{1}, s ∈ {1, . . . , S}. (51)
k=1
We call an affine dictionary nondegenerate if for every j ∈ Js , s ∈ {1, . . . , S}, the sub-dictionary Ds,j contains at
least one element.
Note that for sake of greater generality, we associate possibly different sets Js ⊆ N with the generator functions
gs and, in particular, also allow these sets to be finite. The Haar wavelet dictionary example above is recovered
as a nondegenerate affine dictionary by taking d = 1, Ω = [0, 1], S = 1, Js = N, g1 = ψ, δ = 1, A1,j = 2j−1 ,
a = 1, c = 1/2, and noting that nondegeneracy is verified as for scale j, the sub-dictionary Ds,j contains 2j−1
elements. Moreover, the weights of the networks approximating the individual Haar wavelet dictionary elements
grow linearly in the index of the dictionary elements. This is a consequence of the weights being determined by
the dilation factor 2n and 2n(i) ≤ i due to the ordering we chose. As will be shown below, morally this continues
to hold for general nondegenerate affine dictionaries, thereby revealing what informed our definition of canonical
ordering. Besides, our notion of canonical ordering is also inspired by the ordering employed in the tail compactness
considerations for Besov spaces and orthonormal wavelet dictionaries as detailed in Appendix B. We remark that
(51) constitutes a very weak restriction on how fast the size of dilations may grow; in fact, we are not aware of
any affine dictionaries in the literature that would violate this condition. Finally, we note that the dilations As,j are
not required to be ordered in ascending size, as was the case in the Haar wavelet dictionary example. Canonical
ordering does, however, ensure a modicum of ordering.
47
B. Invariance to Affine Transformations
Affine dictionaries consist of dilations and translations of a given generator function. It is therefore important to
understand the impact of these operations on the approximability—by neural networks—of a given function. As
neural networks realize concatenations of affine functions and nonlinearities, it is clear that translations and dilations
can be absorbed into the first layer of the network and the transformed function should inherit the approximability
properties of the generator function. However, what we will have to understand is how the weights, the connectivity,
and the domain of approximation of the resulting network are impacted. The following result makes this quantitative.
Proposition VIII.2. Let d ∈ N, p ∈ [1, ∞], and f ∈ Lp (Rd ). Assume that there exists a bivariate polynomial π
such that for all D ∈ R+ , ε ∈ (0, 1/2), there is a network ΦD,ε ∈ Nd,1 satisfying
with M(ΦD,ε ) ≤ π(log(ε−1 ), log(dDe)). Then, for all full-rank matrices A ∈ Rd×d , and all e ∈ Rd , E ∈ R+ ,
and η ∈ (0, 1/2), there is a network ΨA,e,E,η ∈ Nd,1 satisfying
1
|det(A)| p f (A · − e) − ΨA,e,E,η
≤ η,
Lp ([−E,E]d )
1
with M(ΨA,e,E,η ) ≤ π 0 (log(η −1 ), log(dF e)) and B(ΨA,e,E,η ) ≤ max{B(ΦF,η ), |det(A)| p , kAk∞ , kek∞ }, where
F = dEkAk∞ + kek∞ and π 0 is of the same degree as π.
|det(A)| p1 f (A · − e) − ΨA,e,E,η
p
L ([−E,E]d )
= kf − ΦF,η kLp (A·[−E,E]d − e) ≤ kf − ΦF,η kLp ([−F,F ]d ) ≤ η.
The next result establishes that canonically ordered affine dictionaries with generator functions that can be
approximated well by neural networks are effectively representable by neural networks.
48
Theorem VIII.3. Let d, S ∈ N, Ω ⊆ Rd be bounded with nonempty interior, (gs )Ss=1 ∈ L∞ (Rd ) compactly
supported, and D = (ϕi )i∈N ⊆ L2 (Ω) a nondegenerate canonically ordered affine dictionary with generator
functions (gs )Ss=1 . Assume that there exists a polynomial π such that, for all s ∈ {1, . . . , S}, ε ∈ (0, 1/2), there is
a network Φs,ε ∈ Nd,1 satisfying
kgs − Φs,ε kL2 (Rd ) ≤ ε, (55)
with M(Φs,ε ) ≤ π(log(ε−1 )) and B(Φs,ε ) ≤ π(ε−1 ). Then, D is effectively representable by neural networks.
Proof. By Definition VII.1 we need to establish the existence of a bivariate polynomial π such that for each i ∈ N,
η ∈ (0, 1/2), there is a network Φi,η ∈ Nd,1 satisfying
with M(Φi,η ) ≤ π(log(η −1 ), log(i)) and B(Φi,η ) ≤ π(η −1 , i). Note that we have
1
ϕi = gsjii ,ei = |det(Asi ,ji )| 2 gsi (Asi ,ji · − δei ) Ω ,
for si ∈ {1, . . . , S}, ji ∈ Jsi , and ei ∈ Zd . In order to devise networks satisfying (56), we employ Proposition VIII.2,
upon noting that, by virtue of (55), the networks Φs,ε satisfy (52) with p = 2, f = gs , for every D ∈ R+ .
Consequently Proposition VIII.2 yields a connectivity bound that is even slightly stronger than needed, as it is
independent of i. It remains to ensure that the desired bound on B(Φi,η ) holds. This is the case for kAsi ,ji k∞ and
kei k∞ both bounded polynomially in i. In order to verify this, we first bound kei k∞ relative to kAsi ,ji k∞ . As the
generators (gs )Ss=1 are compactly supported by assumption, there exists E ∈ R+ such that, for every s ∈ {1, . . . , S},
the support of gs is contained in [−E, E]d . We thus get, for all s ∈ {1, . . . , S}, j ∈ Js , and e ∈ Zd , that
Since Ω is bounded by assumption, there hence exists a constant c = c(Ω, (gs )Ss=1 , δ, d) such that, for all s ∈
{1, . . . , S}, j ∈ Js , and e ∈ Zd , we have
It remains to show that kAsi ,ji k∞ is polynomially bounded in i. We start by claiming that, for every s ∈ {1, . . . , S},
there is a constant cs := cs (Ω, δ, d) > 0 such that
To verify this claim, first note that |Ds,j | ≥ 1, for all s ∈ {1, . . . , S}, j ∈ Js , owing to the nondegeneracy condition.
Thus, for every s ∈ {1, . . . , S}, j ∈ Js , there exist x0 ∈ Ω and e0 ∈ Zd such that gsj,e0 (x0 ) 6= 0, which implies
1
gsj,e (x0 + A−1 j,e0
s,j δ(e − e0 )) = | det(As,j )| gs (As,j x0 − δe0 ) = gs (x0 ) 6= 0.
2
49
We can therefore conclude that x0 + A−1 j,e
s,j δ(e − e0 ) ∈ Ω implies gs ∈ Ds,j . Consequently, we have
As Ω was assumed to have nonempty interior, there exists a constant C = C(Ω) such that
We have hence established the claim (57). Combining (51) and (57), we obtain, for all si ∈ {1, . . . , S}, j ∈ Js\{1},
i −1
jX i −1
jX
ckAsi ,ji ka∞ ≤ | det(Asi ,k )| ≤ csi |Dk,si | ≤ cs i,
k=1 k=1
where the last inequality follows from the fact that ϕi ∈ Dji ,si and hence its index i must be larger than the number
of elements contained in preceding sub-dictionaries. This ensures that
a1
1 1
kAsi ,ji k∞ ≤ max cs i a + max kAs,1 k∞ , for all i ∈ N,
c s=1,...,S s=1,...,S
Remark VIII.4. Theorem VIII.3 is restricted, for ease of exposition, to bounded Ω and compactly supported
generator functions gs . The result can be extended to Ω = Rd and to generator functions gs of unbounded support
but sufficiently fast decay. This extension requires additional technical steps and an alternative definition of canonical
ordering. For conciseness we do not provide the details here, but instead refer to the proofs of Theorems IX.3 and
IX.5, which deal with the corresponding technical aspects in the context of approximation of Gabor dictionaries by
neural networks.
We can now put the results together to conclude a remarkable universality and optimality property of neural
networks: Consider an affine dictionary generated by functions gs that can be approximated well by neural networks.
If this dictionary provides Kolmogorov-Donoho-optimal approximation for a given function class, then so do neural
networks.
Theorem VIII.5. Let d, S ∈ N, Ω ⊆ Rd be bounded with nonempty interior, (gs )Ss=1 ∈ L∞ (Rd ) compactly
supported, and D = (ϕi )i∈N ⊆ L2 (Ω) a nondegenerate canonically ordered affine dictionary with generator
functions (gs )Ss=1 . Assume that there exists a polynomial π such that, for all s ∈ {1, . . . , S}, ε ∈ (0, 1/2), there
is a network Φs,ε ∈ Nd,1 satisfying kgs − Φs,ε kL2 (Rd ) ≤ ε with M(Φs,ε ) ≤ π(log(ε−1 )) and B(Φs,ε ) ≤ π(ε−1 ).
Then, we have
∗,eff
γN (C) ≥ γ ∗,eff (C, D)
for all compact function classes C ⊆ L2 (Ω). In particular, if C is optimally representable by D (in the sense of
Definition V.4), then C is optimally representable by neural networks (in the sense of Definition VI.5).
Proof. The first statement follows from Theorem VII.2 and Theorem VIII.3, the second from Theorem VI.4.
50
D. Spline wavelets
We next particularize the results developed above to show that neural networks Kolmogorov-Donoho optimally
represent all function classes C that are optimally representable by spline wavelet dictionaries. As spline wavelet
dictionaries have B-splines as generator functions, we start by showing how B-splines can be realized through
neural networks. For simplicity of exposition, we restrict ourselves to the univariate case throughout.
Nm+1 := N1 ∗ Nm ,
where ∗ stands for convolution. We refer to Nm as the univariate cardinal B-spline of order m.
Recognizing that B-splines are piecewise polynomial, we can build on Proposition III.5 to get the following
statement on the approximation of B-splines by deep neural networks.
Lemma VIII.7. Let m ∈ N. There exists a constant C > 0 such that for all ε ∈ (0, 1/2), there is a neural network
Φε ∈ N1,1 satisfying
Proof. The proof is based on the following representation [81, Eq. 19]
m+1
1 X k m+1
Nm (x) = (−1) ρ((x − k)m ). (58)
m! k
k=0
While Nm is supported on [0, m], the networks Φε can have support outside [0, m] as well. We only need to ensure
that Φε is “close” to Nm on [0, m] and at the same time “small” outside the interval [0, m]. To accomplish this,
we first approximate Nm on the slightly larger domain [−1, m + 1] by a linear combination of networks realizing
shifted monomials according to (58), and then multiply the resulting network by another one that takes on the value
1 on [0, m] and 0 outside of [−1, m + 1]. Specifically, we proceed as follows. Proposition III.5 ensures the existence
of a constant C1 such that for all ε ∈ (0, 1/2), there is a network Ψm+2,ε ∈ N1,1 satisfying
with M(Ψm+2,ε ) ≤ C1 log(ε−1 ) and B(Ψm+2,ε ) ≤ 1. Note that we did not make the dependence of M(Ψm+2,ε )
on m explicit as we consider m to be fixed. Next, let Tk (x) := x − k and observe that ρ((x − k)m ) can be realized
as a neural network according to ρ ◦ Ψm+2,ε ◦ Tk , where Tk is taken pursuant to Corollary A.2. Next, we define,
for ε ∈ (0, 1/2), the network
m+1
1 X k m+1
Φε :=
e (−1) ρ ◦ Ψm+2,ε ◦ Tk
m! k
k=0
51
and note that
1 m+1 m+1
= ≤ 2,
m! k k!(m − k + 1)!
for k = 0, . . . , m + 1. As ρ is 1-Lipschitz, we have, for all ε ∈ (0, 1/2),
m+1
X 1 m + 1
kΦε − Nm kL∞ ([−1,m+1]) ≤
e kρ ◦ Ψm+2,ε ◦ Tk − ρ ◦ Tkm kL∞ ([−1,m+1])
m! k
k=0
(59)
m+1
X
≤2 kΨm+2,ε (x) − xm kL∞ ([−(m+2),m+2]) ≤ 2ε .
k=0
Let now Γ(x) := ρ(x + 1) − ρ(x) − ρ(x − m) + ρ(x − (m + 1)), note that 0 ≤ Γ(x) ≤ 1, and take Φmult
1+ε/2,ε/2 to
and first note that, for x ∈ [0, m], Γ(x) = 1, which implies kΦ
e ε · Γ − Nm kL∞ ([0,m]) = kΦ
e ε − Nm kL∞ ([0,m]) ≤ ε/2,
again owing to (59). For x ∈ [−1, m + 1] \ [0, m], we have Nm (x) = 0 and Γ(x) ≤ 1, which yields
|Φ
e ε (x) · Γ(x) − Nm (x)| ≤ |Φ
e ε (x)| ≤ |Φ
e ε (x) − Nm (x)| + |Nm (x)| = |Φ
e ε (x) − Nm (x)| ≤ ε/2,
ε
again by (59). In summary, (59) hence ensures that the second term in (60) is also upper-bounded by 2 and therefore
kΦε − Nm kL∞ (R) ≤ ε. Combining Lemma II.3, Proposition III.3, Corollary A.2, Lemma A.4, and Lemma A.7
establishes the desired bounds on M(ΦD,ε ) and B(ΦD,ε ).
Remark VIII.8. As both Nm and the approximating networks Φε we constructed in the proof of Lemma VIII.7 are
supported in [−1, m+1], we have kΦε −Nm kL2 (R) ≤ (m+2)1/2 kΦε −Nm kL∞ (R) , which shows that Lemma VIII.7
continues to hold when the approximation error is measured in L2 (R)-norm, albeit with a different constant C.
where closL2 denotes closure with respect to L2 -norm. Spline spaces Vn , n ∈ Z, constitute a multiresolution analysis
[82] of L2 (R) according to
{0} ⊆ . . . V−1 ⊆ V0 ⊆ V1 ⊆ · · · ⊆ L2 (R).
Moreover, with the orthogonal complements (. . . , W−1 , W0 , W1 , . . . ) such that Vn+1 = Vn ⊕ Wn , where ⊕ denotes
the orthogonal sum, we have
∞
M
L2 (R) = V0 ⊕ Wk .
k=0
52
Theorem VIII.9 ([83, Theorem 1]). Let m ∈ N. The m-th order spline
2m−2
1 X dm
ψm (x) = (−1)j N2m (j + 1) N2m (2x − j), (61)
2m−1 j=0
dxm
with support [0, 2m − 1], is a basic wavelet that generates W0 and thereby all the spaces Wn , n ∈ Z. Consequently,
the set
Wm := {ψk,n (x) = 2n/2 ψm (2n x − k) : n ∈ N0 , k ∈ Z} ∪ {φk (x) = Nm (x − k) : k ∈ Z} (62)
is a nondegenerate canonically ordered affine dictionary with generators g1 = ψm and g2 = Nm . The canonical
ordering condition (51) is satisfied with a = 1 and c = 1/2. Nondegeneracy follows upon noting that supp(ψk,n ) =
[2−n k, 2−n (2m − 1 + k)] and supp(Nm ( · − k)) = [k, m + k], which implies that all sub-dictionaries contain at
least one element as required.
We have therefore established the following.
Theorem VIII.10. Let Ω ⊆ R be bounded and of nonempty interior and D = (ϕi )i∈N ⊆ L2 (Ω) a spline wavelet
dictionary according to (63) ordered per (50). Then, all compact function classes C ⊆ L2 (Ω) that are optimally
representable by D (in the sense of Definition V.4) are optimally representable by neural networks (in the sense of
Definition VI.5).
Proof. As the canonical ordering and the nondegeneracy conditions were already verified, it remains to establish
that the generators ψm and Nm satisfy the antecedent of Theorem VIII.3. To this end, we first devise an alternative
representation of (61). Specifically, using the identity [83, Eq. 2.2]
m
dm
j m
X
N2m (x) = (−1) Nm (x − j),
dxm j=0
j
we get
3m−1
X
ψm (x) = qn Nm (2x − n + 1), (64)
n=1
with
m
(−1)n+1 X m
qn = N2m (n − j).
2m−1 j=0 j
As (64) shows that ψm is a linear combination of shifts and dilations of Nm , combining Lemma VIII.7 and
Remark VIII.8 with Lemma II.6 and Proposition VIII.2 ensures that (55) is satisfied. Application of Theorem VIII.5
then establishes the claim.
53
IX. W EYL -H EISENBERG DICTIONARIES
In this section, we consider Weyl-Heisenberg a.k.a. Gabor dictionaries [17], which consist of time-frequency
translates of a given generator function. Gabor dictionaries play a fundamental role in time-frequency analysis [17]
and in the study of partial differential equations [84]. We start with the formal definition of Gabor dictionaries.
Definition IX.1 (Gabor dictionaries). Let d ∈ N, f ∈ L2 (Rd ), and x, ξ ∈ Rd . We define the translation operator
Tx : L2 (Rd ) → L2 (Rd ) as
Tx f (t) := f (t − x)
Let Ω ⊆ Rd , α, β > 0, and g ∈ L2 (Rd ). The Gabor dictionary G(g, α, β, Ω) ⊆ L2 (Ω) is defined as
In order to describe representability in neural networks in the sense of Definition VII.1, we need to order the
elements in G(g, α, β, Ω). To this end, let G0 (g, α, β, Ω) := {g Ω } and define Gn (g, α, β, Ω), n ∈ N, recursively
according to
n−1
[
Gn (g, α, β, Ω) := {Mξ Tx g Ω : (x, ξ) ∈ αZd × βZd , kxk∞ ≤ nα, kξk∞ ≤ nβ}\
Gk (g, α, β, Ω).
k=0
where the ordering within the sets Gn (g, α, β, Ω) is arbitrary. We hasten to add that the specifics of the overall
ordering in (65) are irrelevant as long as G(g, α, β, Ω) = (ϕi )i∈N with ϕi = Mξ(i) Tx(i) g Ω is such that kx(i)k∞
and kξ(i)k∞ do not grow faster than polynomially in i; this will become apparent in the proof of Theorem IX.3.
We note that this ordering is also inspired by that employed in the tail compactness considerations for modulation
spaces and Wilson bases as detailed in Appendix C.
As Gabor dictionaries are built from time-shifted and modulated versions of the generator function g, and invari-
ance to time-shifts was already established in Proposition VIII.2, we proceed to showing that the approximation-
theoretic properties of the generator function are inherited by its modulated versions. This result can be interpreted
as an invariance property to frequency shifts akin to that established in Proposition VIII.2 for affine transformations
in the context of affine dictionaries. In summary, neural networks exhibit a remarkable invariance property both to
the affine group operations of scaling and translation and to the Weyl-Heisenberg group operations of modulation
and translation.
54
Lemma IX.2. Let d ∈ N, f ∈ L2 (Rd ) ∩ L∞ (Rd ), and for every D ∈ R+ , ε ∈ (0, 1/2), let ΦD,ε ∈ Nd,1 satisfy
Then, there exists a constant C > 0 (which does not depend on f ) such that for all D ∈ R+ , ε ∈ (0, 1/2), ξ ∈ Rd ,
there are networks ΦRe Im
D,ξ,ε , ΦD,ξ,ε ∈ Nd,1 satisfying
kRe(Mξ f ) − ΦRe Im
D,ξ,ε kL∞ ([−D,D]d ) + kIm(Mξ f ) − ΦD,ξ,ε kL∞ ([−D,D]d ) ≤ 3ε
with
−1 2
L(ΦRe Im
D,ξ,ε ), L(ΦD,ξ,ε ) ≤ C((log(ε )) + log(ddDkξk∞ e) + (log(dSf e))2 ) + L(ΦD,ε ),
−1 2
M(ΦRe Im
D,ξ,ε ), M(ΦD,ξ,ε ) ≤ C((log(ε )) + log(ddDkξk∞ e) + (log(dSf e))2 + d) + 4M(ΦD,ε ) + 4L(ΦD,ε ),
and B(ΦRe
D,ξ,ε ) ≤ 1, where Sf := max{1, kf kL∞ (Rd ) }.
Proof. All statements in the proof involving ε pertain to ε ∈ (0, 1/2) without explicitly stating this every time. We
start by observing that
due to f ∈ R. Note that for given ξ ∈ Rd , the map t 7→ hξ, ti = ξ T t = t1 ξ1 + · · · + td ξd is simply a linear
transformation. Hence, combining Lemma II.3, Theorem III.8, and Corollary A.2 establishes the existence of a
constant C1 such that for all D ∈ R+ , ξ ∈ Rd , ε ∈ (0, 1/2), there is a network ΨD,ξ,ε ∈ Nd,1 satisfying
ε
sup | cos(2πhξ, ti) − ΨD,ξ,ε (t)| ≤ 6Sf (66)
t∈[−D,D]d
with
L(ΨD,ξ,ε ) ≤ C1 ((log(ε−1 ))2 + (log(Sf ))2 + log(ddDkξk∞ e)),
(67)
M(ΨD,ξ,ε ) ≤ C1 ((log(ε−1 ))2 + (log(Sf ))2 + log(ddDkξk∞ e) + d),
and B(ΨD,ξ,ε ) ≤ 1. Moreover, Proposition III.3 guarantees the existence of a constant C2 > 0 such that for all
ε ∈ (0, 1/2), there is a network µε ∈ N2,1 satisfying
ε
sup |µε (x, y) − xy| ≤ 6 (68)
x,y∈[−Sf −1/2,Sf +1/2]
with
and B(µε ) ≤ 1. Using Lemmas II.4 and II.5, we get that the network ΓD,ξ,ε := (ΨD,ξ,ε , ΦD,ε ) ∈ Nd,2 satisfies
55
and B(ΓD,ξ,ε ) ≤ 1. Finally, applying Lemma II.3 to concatenate the networks ΓD,ξ,ε and µε , we obtain the network
ΦRe
D,ξ,ε := µε ◦ ΓD,ξ,ε = µε ◦ (ΨD,ξ,ε , ΦD,ε ) ∈ Nd,1
satisfying
L(ΦRe
D,ξ,ε ) ≤ max{L(ΨD,ξ,ε ), L(ΦD,ε )} + L(µε ), (70)
M(ΦRe
D,ξ,ε ) ≤ 4M(ΨD,ξ,ε ) + 4M(ΦD,ε ) + 4L(ΨD,ξ,ε ) + 4L(ΦD,ε ) + 2M(µε ), (71)
and B(ΦRe
D,ξ,ε ) ≤ 1. Next, observe that (66) and (68) imply that
kΦRe
D,ξ,ε − Re(Mξ f )kL∞ ([−D,D]d ) = kµε (ΨD,ξ,ε ( · ), ΦD,ε ( · )) − cos(2πhξ, · i)f ( · )kL∞ ([−D,D]d )
Combining (67), (69), (71), and (70) we can further see that there exists a constant C > 0 such that
−1 2
L(ΦRe
D,ξ,ε ) ≤ C((log(ε )) + log(ddDkξk∞ e) + (log(dSf e))2 ) + L(ΦD,ε ),
−1 2
M(ΦRe
D,ξ,ε ) ≤ C((log(ε )) + log(ddDkξk∞ e) + (log(dSf e))2 + d) + 4M(ΦD,ε ) + 4L(ΦD,ε ),
and B(ΦRe Im
D,ξ,ε )) ≤ 1. The results for ΦD,ξ,ε follow analogously, simply by using sin(x) = cos(x − π/2).
Note that Gabor dictionaries necessarily contain complex-valued functions. The theory developed so far was,
however, phrased for neural networks with real-valued outputs. As is evident from the proof of Lemma IX.2, this
is not problematic when the generator function g is real-valued. For complex-valued generator functions we would
need a version of Proposition III.3 that applies to the multiplication of complex numbers. Due to (a+ib)(a0 +ib0 ) =
(aa0 − bb0 ) + i(ab0 + a0 b) such a network can be constructed by realizing the real and imaginary parts of the product
as a sum of real-valued multiplication networks and then proceeding as in the proof above. We omit the details as
they are straightforward and would not lead to new conceptual insights. Furthermore, an extension—to the complex-
valued case—of the concept of effective representability by neural networks according to Definition VII.1 would be
needed. This can be effected by considering the set of neural networks with 1-dimensional complex-valued output
as neural networks with 2-dimensional real-valued output, i.e., by setting
C
Nd,1 := Nd,2 ,
56
with the convention that the first component represents the real part and the second the imaginary part.
We proceed to establish conditions for effective representability of Gabor dictionaries by neural networks.
Theorem IX.3. Let d ∈ N, Ω ⊆ Rd , α, β > 0, g ∈ L2 (Rd ) ∩ L∞ (Rd ), and let G(g, α, β, Ω) be the corresponding
Gabor dictionary with ordering as defined in (65). Assume that Ω is bounded or that Ω = Rd and g is compactly
supported. Further, suppose that there exists a polynomial π such that for every x ∈ Rd , ε ∈ (0, 1/2), there is a
network Φx,ε ∈ Nd,1 satisfying
with M(Φx,ε ) ≤ π(log(ε−1 ), log(kxk∞ )), B(Φx,ε ) ≤ π(ε−1 , kxk∞ ). Then, G(g, α, β, Ω) is effectively repre-
sentable by neural networks.
Proof. We start by noting that owing to (65), we have G(g, α, β, Ω) = (ϕi )i∈N with ϕi = Mξ(i) Tx(i) g ∈
Gn(i) (g, α, β, Ω), where
Next, we take the affine transformation Wx (y) := y − x to be a depth-1 network and observe that, due to (72) and
Lemma II.3, we have, for all x ∈ Rd , ε ∈ (0, 1/2),
with
We first consider the case where Ω is bounded and let E ∈ R+ be such that Ω ⊆ [−E, E]d . Combining (74) with
Proposition VIII.2 and Lemma IX.2, we can infer the existence of a multivariate polynomial π1 such that for all
i ∈ N, ε ∈ (0, 1/2), there is a network Φi,ε = (ΦRe Im
i,ε , Φi,ε ) ∈ Nd,1 satisfying
C
d
−2
kRe(Mξ(i) Tx(i) g) − ΦRe Im
i,ε kL∞ (Ω) + kIm(Mξ(i) Tx(i) g) − Φi,ε kL∞ (Ω) ≤ (2E) ε, (75)
with
−1
M(ΦRe Im
i,ε ), M(Φi,ε ) ≤ π1 (log(ε ), log(kξ(i)k∞ ), log(kx(i)k∞ )),
(76)
−1
B(ΦRe Im
i,ε ), B(Φi,ε ) ≤ π1 (ε , kξ(i)k∞ , kx(i)k∞ ).
Note that here we did not make the dependence of the connectivity and the weight upper bounds on d and E
explicit as these quantities are irrelevant for the purposes of what we want to show, as long as they are finite, of
57
course, which is the case by assumption. Likewise, we did not explicitly indicate the dependence of π1 on g. As
|z| ≤ |Re(z)| + |Im(z)|, it follows from (75) that for all i ∈ N, ε ∈ (0, 1/2),
d
kϕi − Φi,ε kL2 (Ω,C) ≤ (2E) 2 kϕi − Φi,ε kL∞ (Ω,C)
d
≤ (2E) 2 kRe(ϕi ) − ΦRe Im
i,ε kL∞ (Ω) + kIm(ϕi ) − Φi,ε kL∞ (Ω) ≤ ε.
Moreover, (73) and (76) imply the existence of a polynomial π2 such that
−1 −1
M(ΦRe Im
i,ε ), M(Φi,ε ) ≤ π2 (log(ε ), log(i)), B(ΦRe Im
i,ε ), B(Φi,ε ) ≤ π2 (ε , i),
for all i ∈ N, ε ∈ (0, 1/2). We can therefore conclude that G(g, α, β, Ω) is effectively representable by neural
networks.
We proceed to proving the statement for the case Ω = Rd and g compactly supported, i.e., there exists E ∈ R+
such that supp(g) ⊆ [−E, E]d . This implies
Again, combining (74) with Proposition VIII.2 and Lemma IX.2 establishes the existence of a polynomial π3 such
that for all x, ξ ∈ Rd , ε ∈ (0, 1/2), there are networks ΨRe Im
x,ξ,ε , Ψx,ξ,ε ∈ Nd,1 satisfying
kRe(Mξ Tx g) − ΨRe Im
x,ξ,ε kL∞ (Sx ) + kIm(Mξ Tx g) − Ψx,ξ,ε kL∞ (Sx ) ≤
ε
2sx , (77)
with
−1
M(ΨRe Im
x,ξ,ε ), M(Ψx,ξ,ε ) ≤ π3 (log(ε ), log(kxk∞ ), log(kξk∞ )),
−1
B(ΨRe Im
x,ξ,ε ), B(Ψx,ξ,ε ) ≤ π3 (ε , kxk∞ , kξk∞ ),
where we set Sx := [−(kxk∞ + E + 1), kxk∞ + E + 1]d and sx := |Sx |1/2 to simplify notation. As we want to
establish effective representability for Ω = Rd , the estimate in (77) is insufficient. In particular, we have no control
over the behavior of the networks ΨRe Im
x,ξ,ε , Ψx,ξ,ε outside the set Sx . We can, however, construct networks which
exhibit the same scaling behavior in terms of M and B, are supported in Sx , and realize the same output for all
inputs in Sx . To this end let, for y ∈ R+ , the network αy ∈ N1,1 be given by
58
and note that
0 ≤ χx (t) ≤ 1, ∀t ∈ Rd .
As d and E are considered fixed here, there exists a constant C1 such that, for all x ∈ Rd , we have M(χx ) ≤ C1
and B(χx ) ≤ C1 max{1, kxk∞ }. Now, let B := max{1, kgkL∞ (R) }. Next, by Proposition III.3 there exists a
constant C2 such that, for all x ∈ Rd , ε ∈ (0, 1/2), there is a network µx,ε ∈ N1,1 satisfying
ε
sup |µx,ε (y, z) − yz| ≤ 4sx , (78)
y,z∈[−2B,2B]
with M(µx,ε ) ≤ C2 (log(ε−1 ) + log(sx )) and B(µx,ε ) ≤ 1. Note that in the upper bound on M(µx,ε ), we did not
make the dependence on B explicit as we consider g fixed for the purposes of the proof. Next, as E is fixed, there
exists a constant C3 such that M(µx,ε ) ≤ C3 (log(ε−1 ) + log(kxk∞ + 1)), for all x ∈ Rd , ε ∈ (0, 1/2).
We now take
ΓRe Re
x,ξ,ε := µx,ε ◦ (Ψx,ξ,ε , χx ) and ΓIm Im
x,ξ,ε := µx,ε ◦ (Ψx,ξ,ε , χx )
according to Lemmas II.5 and II.3, which ensures the existence of a polynomial π4 such that, for all x, ξ ∈ Rd ,
ε ∈ (0, 1/2),
−1
M(ΓRe Im
x,ξ,ε ), M(Γx,ξ,ε ) ≤ π4 (log(ε ), log(kxk∞ ), log(kξk∞ )),
(80)
−1
B(ΓRe Im
x,ξ,ε ), B(Γx,ξ,ε ) ≤ π4 (ε , kxk∞ , kξk∞ ).
Furthermore,
kΓRe Re Re
x,ξ,ε − Re(Mξ Tx g)kL∞ (Sx ) ≤ kµx,ε ◦ (Ψx,ξ,ε , χx ) − Ψx,ξ,ε · χx kL∞ (Sx )
(81)
+ kΨRe
x,ξ,ε · χx − Re(Mξ Tx g)kL∞ (Sx ) ,
ε
where the first term is upper-bounded by 4sx due to (78). The second term on the right-hand side of (81) is upper-
bounded as follows. First, note that for t ∈ Sx \ [−(kxk∞ + E), kxk∞ + E]d , we have Re(Mξ Tx g)(t) = 0 and
|χx (t)| ≤ 1, which implies
|ΨRe Re Re
x,ξ,ε (t) · χx (t) − Re(Mξ Tx g)(t)| ≤ |Ψx,ξ,ε (t)| ≤ |Ψx,ξ,ε (t) − Re(Mξ Tx g)(t)| + |Re(Mξ Tx g)(t)|
= |ΨRe
x,ξ,ε (t) − Re(Mξ Tx g)(t)|.
As |χx (t)| = 1 for t ∈ [−(kxk∞ + E), kxk∞ + E]d , together with (81), this yields
kΓRe
x,ξ,ε − Re(Mξ Tx g)kL∞ (Sx ) ≤
ε
4sx + kΨRe
x,ξ,ε − Re(Mξ Tx g)kL∞ (Sx ) .
59
The analogous estimate for kΓIm
x,ξ,ε − Im(Mξ Tx g)kL∞ (Sx ) is obtained in exactly the same manner. Together with
kRe(Mξ Tx g) − ΓRe Im
x,ξ,ε kL∞ (Sx ) + kIm(Mξ Tx g) − Γx,ξ,ε kL∞ (Sx ) ≤
ε
sx .
As Mξ Tx g, ΓRe Im d
x,ξ,ε , and Γx,ξ,ε are supported in Sx for all x, ξ ∈ R , ε ∈ (0, 1/2), using (79), we get
kRe(Mξ Tx g) − ΓRe Im
x,ξ,ε kL2 (Rd ) + kIm(Mξ Tx g) − Γx,ξ,ε kL2 (Rd )
= kRe(Mξ Tx g) − ΓRe Im
x,ξ,ε kL2 (Sx ) + kIm(Mξ Tx g) − Γx,ξ,ε kL2 (Sx )
(82)
≤ sx kRe(Mξ Tx g) − ΓRe Im
x,ξ,ε kL∞ (Sx ) + sx kIm(Mξ Tx g) − Γx,ξ,ε kL∞ (Sx ) ≤ ε.
Consider now, for i ∈ N, ε ∈ (0, 1/2), the complex-valued network Γi,ε ∈ Nd,1
C
given by
Γi,ε := (ΓRe Im
x(i),ξ(i),ε , Γx(i),ξ(i),ε )
Finally, using (73) in (80), it follows that there exists a polynomial π5 such that for all i ∈ N, ε ∈ (0, 1/2), we
−1 −1
have M(ΓRe Im
x(i),ξ(i),ε ), M(Γx(i),ξ(i),ε ) ≤ π5 (log(ε ), log(i)) and B(ΓRe Im
x(i),ξ(i),ε ), B(Γx(i),ξ(i),ε ) ≤ π5 (ε , i), which
finalizes the proof.
Next, we establish the central result of this section. To this end, we first recall that according to Theorem
VIII.5 neural networks provide optimal approximations for all function classes that are optimally approximated
by affine dictionaries (generated by functions f that can be approximated well by neural networks). While this
universality property is significant as it applies to all affine dictionaries, it is perhaps not completely surprising
as affine dictionaries are generated by affine transformations and neural networks consist of concatenations of
affine transformations and nonlinearities. Gabor dictionaries, on the other hand, exhibit a fundamentally different
mathematical structure. The next result shows that neural networks also provide optimal approximations for all
function classes that are optimally approximated by Gabor dictionaries (again, with generator functions that can be
approximated well by neural networks).
60
Theorem IX.4. Let d ∈ N, Ω ⊆ Rd , α, β > 0, g ∈ L2 (Rd ) ∩ L∞ (Rd ), and let G(g, α, β, Ω) be the corresponding
Gabor dictionary with ordering as defined in (65). Assume that Ω is bounded or that Ω = Rd and g is compactly
supported. Further, suppose that there exists a polynomial π such that for every x ∈ Rd , ε ∈ (0, 1/2), there is a
network Φx,ε ∈ Nd,1 satisfying
with M(Φx,ε ) ≤ π(log(ε−1 ), log(kxk∞ )), B(Φx,ε ) ≤ π(ε−1 , kxk∞ ). Then, for all compact function classes C ⊆
L2 (Ω), we have
∗,eff
γN (C) ≥ γ ∗,eff (C, G(g, α, β, Ω)).
In particular, if C is optimally representable by G(g, α, β, Ω) (in the sense of Definition V.4), then C is optimally
representable by neural networks (in the sense of Definition VI.5).
Proof. The first statement follows from Theorem VII.2 and Theorem IX.3, the second is by Theorem VI.4.
We complete the program in this section by showing that the Gaussian function satisfies the conditions on the
generator g in Theorem IX.3 for bounded Ω. Gaussian functions are widely used generator functions for Gabor
dictionaries owing to their excellent time-frequency localization and their frame-theoretic optimality properties [17].
We hasten to add that the result below can be extended to any generator function g of sufficiently fast decay and
sufficient smoothness.
There exists a constant C > 0 such that, for all d ∈ N and ε ∈ (0, 1/2), there is a network Φd,ε ∈ Nd,1 satisfying
Proof. Observe that gd can be written as the composition h ◦ fd of the functions fd : Rd → R+ and h : R+ → R
given by
d
X
fd (x) := kxk22 = x2i and h(y) := e−y .
i=1
By Proposition III.3 and Lemma II.6, there exists a constant C1 > 0 such that, for every d ∈ N, D ∈ [1, ∞),
ε ∈ (0, 1/2), there is a network Ψd,D,ε ∈ Nd,1 satisfying
61
n
−y
d
Moreover, as | dy ne | = |e−y | ≤ 1 for all n ∈ N, y ≥ 0, Lemma A.6 implies the existence of a constant C2 > 0
such that for every d ∈ N, D ∈ [1, ∞), ε ∈ (0, 1/2), there is a network Γd,D,ε ∈ N1,1 satisfying
from (84) and (86) that there exists a constant C2 > 0 such that for all d ∈ N, ε ∈ (0, 1/2), we have M(Φ
e d,ε ) ≤
2
≤ |e−kxk2 − e−Ψd,Dε ,ε (x) | + |e−Ψd,Dε ,ε (x) − Γd,Dε ,ε (Ψd,Dε ,ε (x))|
ε ε
≤ 2 + 2 = ε.
We can now use the same approach as in the proof of Theorem IX.3 to construct networks Φd,ε supported on the inter-
val [−Dε , Dε ]d over which they approximate g to within error ε, and obey M(Φε ) ≤ Cd(log(ε−1 ))2 ((log(ε−1 ))2 +
log(d)), B(Φd,ε ) ≤ 1 for some absolute constant C. Together with |g(x)| ≤ ε, for all x ∈ Rd \[−Dε , Dε ]d , this
completes the proof.
Remark IX.6. Note that Lemma IX.5 establishes an approximation result that is even stronger than what is required
by Theorem IX.3. Specifically, we achieve ε-approximation over all of Rd with a network that does not depend on
the shift parameter x, while exhibiting the desired growth rates on M and B, which consequently do not depend
on the shift parameter as well. The idea underlying this construction can be used to strengthen Theorem IX.3 to
apply to Ω = Rd and generator functions of unbounded support, but sufficiently rapid decay.
We conclude this section with a remark on the neural network approximation of the real-valued counterpart of
Gabor dictionaries known as Wilson dictionaries [74], [17] and consisting of cosine-modulated and time-shifted
versions of a given generator function, see also Appendix C. The techniques developed in this section, mutatis
mutandis, show that neural networks provide Kolmogorov-Donoho optimal approximation for all function classes
that are optimally approximated by Wilson dictionaries (generated by functions that can be approximated well by
neural networks). Specifically, we point out that the proofs of Lemma IX.2 and Theorem IX.3 explicitly construct
neural network approximations of time-shifted and cosine- and sine-modulated versions of the generator g. As
identified in Table 1, Wilson bases provide optimal nonlinear approximation of (unit) balls in modulation spaces
[85], [74]. Finally, we note that similarly the techniques developed in the proofs of Lemma IX.2 and Theorem IX.3
can be used to establish optimal representability of Fourier bases.
62
X. I MPROVING P OLYNOMIAL A PPROXIMATION R ATES TO E XPONENTIAL R ATES
Having established that for all function classes listed in Table 1, Kolmogorov-Donoho-optimal approximation
through neural networks is possible, this section proceeds to show that neural networks, in addition to their striking
Kolmogorov-Donoho universality property, can also do something that has no classical equivalent.
Specifically, as mentioned in the introduction, for the class of oscillatory textures as considered below and for the
Weierstrass function, there are no known methods that achieve exponential accuracy, i.e., an approximation error
that decays exponentially in the number of parameters employed in the approximant. We establish below that deep
networks fill this gap.
Let us start by defining one-dimensional “oscillatory textures” according to [18]. To this end, we recall the
following definition from Lemma A.6,
n o
S[a,b] = f ∈ C ∞ ([a, b], R) : kf (n) (x)kL∞ ([a,b]) ≤ n!, for all n ∈ N0 .
The efficient approximation of functions in FD,a with a large represents a notoriously difficult problem due to
the combination of the rapidly oscillating cosine term and the warping function g. The best approximation results
available in the literature [18] are based on wave-atom dictionaries11 and yield low-order polynomial approximation
rates. In what follows we show that finite-width deep networks drastically improve these results to exponential
approximation rates.
We start with our statement on the neural network approximation of oscillatory textures.
Proposition X.2. There exists a constant C > 0 such that for all D, a ∈ R+ , f ∈ FD,a , and ε ∈ (0, 1/2), there is
a network Γf,ε ∈ N1,1 satisfying
with L(Γf,ε ) ≤ CdDe((log(ε−1 ) + log(dae))2 + log(dDe) + log(dD−1 e)), W(Γf,ε ) ≤ 32, B(Γf,ε ) ≤ 1.
Proof. For D, a ∈ R+ , f ∈ FD,a , let gf , hf ∈ S[−D,D] be functions such that f = cos(agf )hf . Note that Lemma
A.6 guarantees the existence of a constant C1 > 0 such that for all D, a ∈ R+ , ε ∈ (0, 1/2), there are networks
Ψgf ,ε , Ψhf ,ε ∈ N1,1 satisfying
ε ε
kΨgf ,ε − gf kL∞ ([−D,D]) ≤ 12dae , kΨhf ,ε − hf kL∞ ([−D,D]) ≤ 12dae (87)
11 To be precise, the results of [18] are concerned with the two-dimensional case, whereas here we focus on the one-dimensional case. Note,
however, that all our results are readily extended to the multi-dimensional case.
63
with
ε
L(Ψgf ,ε ), L(Ψhf ,ε ) ≤ C1 dDe(log(( 12dae )−1 )2 + log(dDe) + log(dD−1 e)),
W(Ψgf ,ε ), W(Ψhf ,ε ) ≤ 16, and B(Ψgf ,ε ), B(Ψhf ,ε ) ≤ 1. Furthermore, Theorem III.8 ensures the existence of a
constant C2 > 0 such that for all D, a ∈ R+ , ε ∈ (0, 1/2), there is a neural network Φa,D,ε ∈ N1,1 satisfying
with L(Φa,D,ε ) ≤ C2 ((log(ε−1 ))2 + log(d3a/2e)), W(Φa,D,ε ) ≤ 9, and B(Φa,D,ε ) ≤ 1. Moreover, due to
Proposition III.3, there exists a constant C3 > 0 such that for all ε ∈ (0, 1/2), there is a network µε ∈ N2,1
satisfying
with L(µε ) ≤ C3 log(ε−1 ), W(µε ) ≤ 5, and B(µε ) ≤ 1. By Lemma II.3 there exists a network Ψ1 satisfying
Ψ1 = Φa,D,ε ◦ Ψgf ,ε with W(Ψ1 ) ≤ 16, L(Ψ1 ) = L(Φa,D,ε ) + L(Ψgf ,ε ), and B(Ψ1 ) ≤ 1. Furthermore,
combining Lemma II.4 and Lemma A.7, we can conclude the existence of a network Ψ2 (x) = (Ψ1 (x), Ψhf ,ε (x)) =
(Φa,D,ε (Ψgf ,ε (x)), Ψhf ,ε (x)) with W(Ψ2 ) ≤ 32, L(Ψ2 ) = max{L(Φa,D,ε ) + L(Ψgf ,ε ), L(Ψhf ,ε )}, and B(Ψ2 ) ≤
1. Next, for all D, a ∈ R+ , f ∈ FD,a , ε ∈ (0, 1/2), we define the network Γf,ε := µε ◦ Ψ2 . By (87), (88), and
d
supx∈R | dx cos(ax)| = a, we have, for all x ∈ [−D, D],
|Φa,D,ε (Ψgf ,ε (x)) − cos(agf (x))| ≤ |Φa,D,ε (Ψgf ,ε (x)) − cos(aΨgf ,ε (x))|
Combining this with (87), (89), and k cos kL∞ ([−D,D]) , kf kL∞ ([−D,D]) ≤ 1 yields for all x ∈ [−D, D],
|Γf,ε (x) − f (x)| = |µε (Φa,D,ε (Ψgf ,ε (x)), Ψhf ,ε (x)) − cos(agf (x))hf (x)|
≤ |µε (Φa,D,ε (Ψgf ,ε (x)), Ψhf ,ε (x)) − Φa,D,ε (Ψgf ,ε (x))Ψhf ,ε (x)|
Finally, by Lemma II.3 there exists a constant C4 such that for all D, a ∈ R+ , f ∈ FD,a , ε ∈ (0, 1/2), it holds
that W(Γf,ε ) ≤ 32,
and B(Γf,ε ) ≤ 1.
64
Fig. 4: Left: A function in F1,100 . Right: The function W √1 ,2 .
2
Finally, we show how the Weierstrass function—a fractal function, which is continuous everywhere but dif-
ferentiable nowhere—can be approximated with exponential accuracy by deep ReLU networks. Specifically, we
consider
∞
X
Wp,a (x) = pk cos(ak πx), for p ∈ (0, 1/2), a ∈ R+ , with ap ≥ 1,
k=0
log(p)
and let α = − log(a) , see Figure 4 right for an example. It is well known [86] that Wp,a possesses Hölder smoothness
α which may be made arbitrarily small by suitable choice of a. While classical approximation methods achieve
polynomial approximation rates only, it turns out that finite-width deep networks yield exponential approximation
rates. This is formalized as follows.
Proposition X.3. There exists a constant C > 0 such that for all ε, p ∈ (0, 1/2), D, a ∈ R+ , there is a network
Ψp,a,D,ε ∈ N1,1 satisfying
with L(Ψp,a,D,ε ) ≤ C((log(ε−1 ))3 +(log(ε−1 ))2 log(dae)+log(ε−1 ) log(dDe)), W(Ψp,a,D,ε ) ≤ 13, B(Ψp,a,D,ε ) ≤ 1.
PN
Proof. For every N ∈ N, p ∈ (0, 1/2), a ∈ R+ , x ∈ R, let SN,p,a (x) = k=0 pk cos(ak πx) and note that
∞ ∞
1−pN +1
X X
|SN,p,a (x) − Wp,a (x)| ≤ |pk cos(ak πx)| ≤ pk = 1
1−p − 1−p ≤ 2−N . (90)
k=N +1 k=N +1
Let Nε := dlog(2/ε)e for ε ∈ (0, 1/2). Next, note that Theorem III.8 ensures the existence of a constant C1 > 0
such that for all D, a ∈ R+ , k ∈ N0 , ε ∈ (0, 1/2), there is a network φak ,D,ε ∈ N1,1 satisfying
65
with L(φak ,D,ε ) ≤ C1 ((log(ε−1 ))2 + log(dak πDe)), W(φak ,D,ε ) ≤ 9, B(φak ,D,ε ) ≤ 1. Let A : R3 → R3 and
B : R3 → R be the affine transformations given by A(x1 , x2 , x3 ) = (x1 , x1 , x2 +x3 )T and B(x1 , x2 , x3 ) = x2 +x3 ,
respectively. We now define, for all p ∈ (0, 1/2), D, a ∈ R+ , k ∈ N0 , ε ∈ (0, 1/2), the networks
x x1
p,a,0 p,a,k
ψD,ε (x) = p0 φa0 ,D,ε (x) and ψD,ε (x1 , x2 , x3 ) = pk φak ,D,ε (x2 ) , k > 0,
0 x3
and, for all p ∈ (0, 1/2), D, a ∈ R+ , ε ∈ (0, 1/2), the network
p,a,Nε p,a,Nε −1 p,a,0
Ψp,a,D,ε := B ◦ ψD,ε ◦ A ◦ ψD,ε ◦ · · · ◦ A ◦ ψD,ε .
Due to (91) we get, for all p ∈ (0, 1/2), D, a ∈ R+ , ε ∈ (0, 1/2), x ∈ [−D, D], that
N Nε
Xε X
k k k
|Ψp,a,D,ε (x) − SNε ,p,a (x)| = p φak ,D,ε (x) − p cos(a πx)
k=0 k=0
Nε
X Nε
X
≤ pk |φak ,D,ε (x) − cos(ak πx)| ≤ ε
4 2−k ≤ 2ε .
k=0 k=0
Combining this with (90) establishes, for all p ∈ (0, 1/2), D, a ∈ R+ , ε ∈ (0, 1/2), x ∈ [−D, D],
2
|Ψp,a,D,ε (x) − Wp,a (x)| ≤ 2−dlog( ε )e + ε
2 ≤ ε
2 + ε
2 = ε.
Applying Lemmas II.3, II.4, and II.5 establishes the existence of a constant C2 such that for all p ∈ (0, 1/2),
D, a ∈ R+ , ε ∈ (0, 1/2),
Nε
X
L(Ψp,a,D,ε ) ≤ (L(φak ,D,ε ) + 1) ≤ Nε + 1 + (Nε + 1)C1 ((log(ε−1 ))2 + log(daNε πDe))
k=0
We finally note that the restriction p ∈ (0, 1/2) in Proposition X.3 was made for simplicity of exposition and
can be relaxed to p ∈ (0, r), with r < 1, while only changing the constant C.
The recent successes of neural networks in machine learning applications have been enabled by various technolog-
ical factors, but they all have in common the use of deep networks as opposed to shallow networks studied intensely
in the 1990s. It is hence of interest to understand whether the use of depth offers fundamental advantages. In this
spirit, the goal of this section is to make a formal case for depth in neural network approximation by establishing that,
for nonconstant periodic functions, finite-width deep networks require asymptotically—in the function’s “highest
frequency”—smaller connectivity than finite-depth wide networks. This statement is then extended to sufficiently
66
smooth nonperiodic functions, thereby formalizing the benefit of deep networks over shallow networks for the
approximation of a broad class of functions.
We start with preparatory material taken from [26].
Definition XI.1 ([26]). Let k ∈ N. A function f : R → R is called k-sawtooth if it is piecewise linear with no
more than k pieces, i.e., its domain R can be partitioned into k intervals such that f is linear on each of these
intervals.
The quantity ξ(f ) measures the error incurred by the best linear approximation of f on any segment of length
equal to the period of f ; ξ(f ) can hence be interpreted as quantifying the nonlinearity of f . The next result states that
finite-depth networks with width and hence also connectivity scaling polylogarithmically in the “highest frequency”
of the periodic function to be approximated can not achieve arbitrarily small approximation error.
Proposition XI.4. Let f ∈ C(R) be a nonconstant u-periodic function, L ∈ N, and π a polynomial. Then, there
exists an a ∈ N such that for every network Φ ∈ N1,1 with L(Φ) ≤ L and W(Φ) ≤ π(log(a)), we have
Proof. First note that there exists an even a ∈ N such that a/2 > (2π(log(a)))L . Lemma XI.2 now implies that
every network Φ ∈ N1,1 with L(Φ) ≤ L and W(Φ) ≤ π(log(a)) is (2π(log(a)))L -sawtooth and therefore consists
of no more than a/2 different linear pieces. Hence, there exists an interval [u1 , u2 ] ⊆ [0, u] with u2 − u1 ≥ (2u/a)
on which Φ is linear. Since u2 − u1 ≥ (2u/a) the interval supports two full periods of f (a · ) and we can therefore
conclude that
kf (a · ) − ΦkL∞ ([0,u]) ≥ kf (a · ) − ΦkL∞ ([u1 ,u2 ]) ≥ inf kf (x) − (cx + d)kL∞ ([0,2u])
c,d∈R
Finally, note that ξ(f ) > 0 as ξ(f ) = 0 for u-periodic f ∈ C(R) necessarily implies that f is constant, which,
however, is ruled out by assumption.
Application of Proposition XI.4 to f (x) = cos(x) shows that finite-depth networks, owing to ξ(cos) > 0, require
faster than polylogarithmic growth of connectivity in a to approximate x 7→ cos(ax) with arbitrarily small error,
whereas finite-width networks, due to Theorem III.8, can accomplish this with polylogarithmic connectivity growth.
The following result from [87] allows a similar observation for functions that are sufficiently smooth.
67
Theorem XI.5 ([87]). Let [a, b] ⊆ R, f ∈ C 3 ([a, b]), and for ε ∈ (0, 1/2), let s(ε) ∈ N denote the smallest number
such that there exists a piecewise linear approximation of f with s(ε) pieces and error at most ε in L∞ ([a, b])-norm.
Then, it holds that Z b
c 1 p
s(ε) ∼ √ , ε → 0, where c = |f 00 (x)|dx.
ε 4 a
Combining this with Lemma XI.2 yields the following result on depth-width tradeoff for three-times continuously
differentiable functions.
Rbp
Theorem XI.6. Let f ∈ C 3 ([a, b]) with a
|f 00 (x)|dx > 0, L ∈ N, and π a polynomial. Then, there exists ε > 0
such that for every network Φ ∈ N1,1 with L(Φ) ≤ L and W(Φ) ≤ π(log(ε−1 )), we have
Proof. The proof will be effected by contradiction. Assume that for every ε > 0, there exists a network Φε ∈ N1,1
with L(Φε ) ≤ L, W(Φε ) ≤ π(log(ε−1 )), and kf −Φε kL∞ ([a,b]) ≤ ε. By Lemma XI.2 every (ReLU) neural network
realizes a piecewise linear function. Application of Theorem XI.5 hence allows us to conclude the existence of a
1
constant C such that, for all ε > 0, the network Φε must have at least Cε− 2 different linear pieces. This, however,
leads to a contradiction as, by Lemma XI.2, Φε is at most (2π(log(ε−1 )))L -sawtooth and π̃(log(ε−1 )) ∈ o(ε−1/2 ),
ε → 0, for every polynomial π̃.
In summary, we have hence established that any function which is at least three times continuously differentiable
(and does not have a vanishing second derivative) cannot be approximated by finite-depth networks with connectivity
scaling polylogarithmically in the inverse of the approximation error. Our results in Section III establish that, in
contrast, this “is” possible with finite-width deep networks for various interesting types of smooth functions such
as polynomials and sinusoidal functions. Further results on the limitations of finite-depth networks akin to Theorem
XI.6 were reported in [23].
ACKNOWLEDGMENTS
The authors are indebted to R. Gül and W. Ou for their careful proofreading of the paper, to E. Riegler and the
reviewers for their constructive and insightful comments, and to the handling editor, P. Narayan, for his helpful
comments and his patience.
68
A PPENDIX A
AUXILIARY NEURAL NETWORK CONSTRUCTIONS
The following three results are concerned with the realization of affine transformations of arbitrary weights by
neural networks with weights upper-bounded by 1.
Lemma A.1. Let d ∈ N and a ∈ R. There exists a network Φa ∈ Nd,d satisfying Φa (x) = ax, with L(Φa ) ≤
blog(|a|)c + 4, W(Φa ) ≤ 3d, B(Φa ) ≤ 1.
Proof. First note that for |a| ≤ 1 the claim holds trivially, which can be seen by taking Φa to be the affine
transformation x 7→ ax and interpreting it according to Definition II.1 as a depth-1 neural network. Next, we
consider the case |a| > 1 for d = 1, set K := blog(a)c, α := a2−(K+1) , and define A1 := (1, −1)T ∈ R2×1 ,
1 0 1 1 −1
3×2 3×3
A2 := 1 1 ∈ R , Ak := 1 1 1 ∈ R , k ∈ {3, . . . , K + 3},
0 1 −1 1 1
and AK+4 := (α, 0, −α). Note that (ρ ◦ A2 ◦ ρ ◦ A1 )(x) = (ρ(x), ρ(x) + ρ(−x), ρ(−x)) and ρ(Ak (x, x + y, y)T ) =
2(x, x + y, y), for k ∈ {3, . . . , K + 3}. The network Ψa := AK+4 ◦ ρ ◦ · · · ◦ ρ ◦ A1 hence satisfies Ψa (x) = ax,
L(Ψa ) = blog(a)c + 4, W(Ψa ) = 3, and B(Φa ) ≤ 1. Applying Lemma II.5 to get a parallelization of d copies of
Ψa completes the proof.
0 0
Corollary A.2. Let d, d0 ∈ N, a ∈ R+ , A ∈ [−a, a]d ×d , and b ∈ [−a, a]d . There exists a network ΦA,b ∈ Nd,d0
satisfying ΦA,b (x) = Ax + b, with L(ΦA,b ) ≤ blog(|a|)c + 5, W(ΦA,b ) ≤ max{d, 3d0 }, B(ΦA,b ) ≤ 1.
Proof. Let Φa ∈ Nd0 ,d0 be the multiplication network from Lemma A.1, consider W (x) := a−1 (Ax + b) as a
1-layer network, and take ΦA,b := Φa ◦ W according to Lemma II.3.
Proposition A.3. Let d, d0 ∈ N and Φ ∈ Nd,d0 . There exists a network Ψ ∈ Nd,d0 satisfying Ψ(x) = Φ(x), for all
x ∈ Rd , and with L(Ψ) ≤ (dlog(B(Φ))e + 5)L(Φ), W(Ψ) ≤ max{3d0 , W(Φ)}, B(Ψ) ≤ 1.
Φ fL(Φ) ◦ ρ ◦ · · · ◦ ρ ◦ W
e := W f1 ,
and Ψ := Φa ◦ Φ
e according to Lemma II.3. Note that Φ
e has weights upper-bounded by 1 and is of the same depth
and width as Φ. As ρ is positively homogeneous, i.e., ρ(λx) = λρ(x), for all λ ≥ 0, x ∈ R, we have Ψ(x) = Φ(x),
for all x ∈ Rd . Application of Lemma II.3 and Lemma A.1 completes the proof.
Next we record a technical Lemma on how to realize a sum of networks with the same input by a network whose
width is independent of the number of constituent networks.
69
Lemma A.4. Let d, d0 ∈ N, N ∈ N, and Φi ∈ Nd,d0 , i ∈ {1, . . . , N }. There exists a network Φ ∈ Nd,d0 satisfying
N
X
Φ(x) = Φi (x), for all x ∈ Rd ,
i=1
PN
with L(Φ) = i=1 L(Φi ), W(Φ) ≤ 2d + 2d0 + max{2d, maxi {W(Φi )}}, B(Φ) = max{1, maxi B(Φi )}.
eN ◦ Ψ
Φ=Ψ e N −1 ◦ · · · ◦ Ψ
e 1,
70
when the compositions are taken in the sense of Lemma II.3. Due to Lemmas II.4 and II.5, we have L(Ψi ) = L(Φi ),
W(Ψi ) = 2d + 2d0 + W(Φi ), and B(Ψi ) = max{1, B(Φi )}. The proof is finalized by noting that, owing to the
structure of the involved matrices, the depth and the weight magnitude remain unchanged by turning Ψi into Ψ
e i,
whereas the width can not increase, but may decrease owing to the replacement of E11 by E11 Ain .
The following lemma shows how to patch together local approximations using multiplication networks and a
partition of unity consisting of hat functions. We note that this argument can be extended to higher dimensions
using tensor products (which can be realized efficiently through multiplication networks) of the one-dimensional
hat function.
Lemma A.5. Let ε ∈ (0, 1/2), n ∈ N, a0 < a1 < · · · < an ∈ R, f ∈ L∞ ([a0 , an ]), and
1
A := max{|a0 |, |an |, 2 max } , B := max{1, kf kL∞ ([a0 ,an ]) }.
i∈{2,...,n−1} |ai −ai−1 |
Assume that for every i ∈ {1, . . . , n − 1}, there exists a network Φi ∈ N1,1 with kf − Φi kL∞ ([ai−1 ,ai+1 ]) ≤ ε/3.
Then, there is a network Φ ∈ N1,1 satisfying
1 1
Ψ1 (x) := 1 − a2 −a1 ρ(x − a1 ) + a2 −a1 ρ(x − a2 ),
1 1 1 1
Ψi (x) := ai −ai−1 ρ(x − ai−1 ) − ( ai −a i−1
+ ai+1 −ai ) ρ(x − ai ) + ai+1 −ai ρ(x − ai+1 ), i ∈ {2, . . . , n − 2},
1 1
Ψn−1 (x) := an−1 −an−2 ρ(x − an−2 ) − an−1 −an−2 ρ(x − an−1 ).
Note that supp(Ψ1 ) = (∞, a2 ), supp(Ψn−1 ) = [an−2 , ∞), and supp(Ψi ) = [ai−1 , ai+1 ]. Proposition A.3 now
ensures that, for all i ∈ {1, . . . , n−1}, Ψi can be realized as a network with L(Ψi ) ≤ 2(dlog(A)e+5), W(Ψi ) ≤ 3,
and B(Ψi ) ≤ 1. Next, let ΦB+1/6,ε/3 ∈ N2,1 be the multiplication network according to Proposition III.3 and define
the networks
Φ
e i (x) := ΦB+1/6,ε/3 (Φi (x), Ψi (x))
according to Lemma II.5 and Lemma II.3, along with their sum
n−1
X
Φ(x) := Φ
e i (x)
i=1
71
according to Lemma A.4. Proposition III.3 ensures, for all i ∈ {1, . . . , n − 1}, x ∈ [ai−1 , ai+1 ], that
|f (x)Ψi (x) − Φ
e i (x)| ≤ |f (x)Ψi (x) − Φi (x)Ψi (x)| + |Φi (x)Ψi (x) − ΦB+1/6,ε/3 (Φi (x), Ψi (x))|
≤ (Ψi (x) + 1) 3ε
I(x) := {i ∈ {1, . . . , n − 1} : Φ
e i (x) 6= 0}
P
of active indices contains at most two elements. Moreover, we have i∈I(x) Ψi (x) = 1 by construction, which
implies that, for all x ∈ R,
X X X
|f (x) − Φ(x)| = Ψi (x)f (x) − Φ̃i (x) ≤ (Ψi (x) + 1) 3ε ≤ ε.
i∈I(x) i∈I(x) i∈I(x)
Due to Lemma II.3, Lemma II.5, Proposition III.3, and Lemma A.4, we can conclude that Φ, indeed, satisfies the
claimed properties.
There exists a constant C > 0 such that for all a, b ∈ R with a < b, f ∈ S[a,b] , and ε ∈ (0, 1/2), there is a network
Ψf,ε ∈ N1,1 satisfying
Proof. We first recall that the case [a, b] = [−1, 1] has already been dealt with in Lemma III.7. Here, we will
first prove the statement for the interval [−D, D] with D ∈ (0, 1) and then use this result to establish the general
case through a patching argument according to Lemma A.5. We start by noting that for g ∈ S[−D,D] , the function
fg : [−1, 1] → R, x 7→ g(Dx) is in S[−1,1] due to D < 1. Hence, by Lemma III.7, there exists a constant C > 0
such that for all g ∈ S[−D,D] and ε ∈ (0, 1/2), there is a network Ψ
e g,ε ∈ N1,1 satisfying kΨ
e g,ε −fg kL∞ ([−1,1]) ≤ ε,
e g,ε ) ≤ C(log(ε−1 ))2 , W(Ψ
with L(Ψ e g,ε ) ≤ 9, B(Ψ
e g,ε ) ≤ 1. The claim is then established by taking the network
e g,ε ◦ ΦD−1 , where ΦD−1 is the scalar multiplication network from Lemma A.1,
approximating g to be Ψg,ε := Ψ
and noting that
= sup |Ψ
e g,ε (x) − fg (x)| ≤ ε.
x∈[−1,1]
72
Due to Lemma II.3, we have L(Ψg,ε ) ≤ C((log(ε−1 ))2 + log(d D
1
e)), W(Ψg,ε ) ≤ 9, and B(Ψg,ε ) ≤ 1. We are now
ready to proceed to the proof of the statement for general intervals [a, b]. This will be accomplished by approximating
f on intervals of length no more than 2 and stitching the resulting approximations together according to Lemma A.5.
We start with the case b−a ≤ 2 and note that here we can simply shift the function by (a+b)/2 to center its domain
around the origin and then use the result above for approximation on [−D, D] with D ∈ (0, 1) or Lemma III.7
if b − a = 2, both in combination with Corollary A.2 to realize the shift through a neural network with weights
bounded by 1. Using Lemma II.3 to implement the composition of the network realizing this shift with that realizing
g, we can conclude the existence of a constant C 0 > 0 such that, for all [a, b] ⊆ R with b − a ≤ 2, g ∈ S[a,b] ,
ε ∈ (0, 1/2), there is a network satisfying kg − Ψg,ε kL∞ ([a,b]) ≤ ε with L(Ψg,ε ) ≤ C 0 ((log(ε−1 ))2 + log(d b−a
1
e)),
W(Ψg,ε ) ≤ 9, and B(Ψg,ε ) ≤ 1. Finally, for b − a > 2, we partition the interval [a, b] and apply Lemma A.5 as
follows. We set n := db − ae and define
ai := a + i b−a
n , i ∈ {0, . . . , n}.
Next, for i ∈ {1, . . . , n−1}, let gi : [ai−1 , ai+1 ] → R be the restriction of g to the interval [ai−1 , ai+1 ], and note that
2(b−a)
ai+1 −ai−1 = n ∈ ( 43 , 2]. Furthermore, for i ∈ {1, . . . , n−1}, let Ψgi ,ε/3 be the network approximating gi with
ε
error ε/3 as constructed above. Then, for every i ∈ {1, . . . , n − 1}, it holds that kg − Ψgi ,ε/3 kL∞ ([ai−1 ,ai+1 ]) ≤ 3
We finally record, for technical purposes, slight variations of Lemmas II.5 and II.6 to account for parallelizations
and linear combinations, respectively, of neural networks with shared input.
Lemma A.7. Let n, d, L ∈ N and, for i ∈ {1, 2, . . . , n}, let d0i ∈ N and Φi ∈ Nd,d0i with L(Φi ) = L. Then,
Pn Pn
there exists a network Ψ ∈ Nd,Pni=1 d0i with L(Ψ) = L, M(Ψ) = i=1 M(Φi ), W(Ψ) ≤ i=1 W(Φi ), B(Ψ) =
maxi B(Φi ), and satisfying
d0i
Pn
Ψ(x) = (Φ1 (x), Φ2 (x), . . . , Φn (x)) ∈ R i=1 ,
for x ∈ Rd .
Proof. The claim is established by following the construction in the proof of Lemma II.5, but with the matrix
A1 = diag(A11 , A21 , . . . , An1 ) replaced by
A11
.
Pn i
A1 = .. ∈ R( i=1 N1 )×d ,
An1
73
Lemma A.8. Let n, d, d0 , L ∈ N and, for i ∈ {1, 2, . . . , n}, let ai ∈ R and Φi ∈ Nd,d0 with L(Φi ) = L.
Pn Pn
Then, there exists a network Ψ ∈ Nd,d0 with L(Ψ) = L, M(Ψ) ≤ i=1 M(Φi ), W(Ψ) ≤ i=1 W(Φi ),
Proof. The proof follows directly from that of Lemma A.7 with the same modifications as those needed in the
proof of Lemma II.6 relative to that of Lemma II.5.
A PPENDIX B
TAIL COMPACTNESS FOR B ESOV SPACES
m
We consider the Besov space Bp,q ([0, 1]) [16] given by the set of functions f ∈ L2 ([0, 1]) satisfying
1 1 n
kf km,p,q := k(2n(m+ 2 − p ) k(hf, ψn,k i)2k=0
−1
k`p )n∈N0 k`q < ∞, (92)
with D = {ψn,k : n ∈ N0 , k = 0, . . . , 2n − 1} an orthonormal wavelet basis12 for L2 ([0, 1]) and `p denoting the
usual sequence norm
p1
P |ai |p , 1≤p<∞
i∈I
k(ai )i∈I k`p = .
i∈I |ai |, p=∞
sup
m
The unit ball in Bp,q ([0, 1]) is
m
U(Bp,q ([0, 1])) = {f ∈ L2 ([0, 1]) : kf km,p,q ≤ 1}. (93)
n n
−1
For simplicity of notation, we set an,k (f ) := hf, ψn,k i and An (f ) := (an,k (f ))2k=0 ∈ R2 , for n ∈ N0 . We
m
now want to verify that for q ∈ [1, 2] tail compactness holds for the pair (U(Bp,q ([0, 1])), D) under the ordering
PN
D = (D0 , D1 , . . . ), where Dn := {ψn,k : k = 0, . . . , 2n − 1}. To this end, we first note that owing to n=0 |Dn | =
2N +1 − 1, we have tail compactness according to (26) if there exist C, β > 0 such that for all f ∈ U(Bp,q
m
([0, 1])),
N ∈ N,
n
N 2X
X −1
f − an,k (f )ψn,k
≤ C(2N +1 )−β . (94)
n=0 k=0 L2 ([0,1])
To see that (92) implies (94), we note that by orthonormality of D,
N 2Xn
−1
∞ 2n −1
∞ n
2X −1
! 12
X
X X
X
f − an,k (f )ψn,k
=
an,k (f )ψn,k
= |an,k (f )|2
n=0 k=0 L2 ([0,1]) n=N +1 k=0 L2 ([0,1]) n=N +1 k=0
= k(kAn (f )k`2 )∞
n=N +1 k`2 .
12 The space does not depend on the particular choice of mother wavelet ψ as long as ψ has at least r vanishing moments and is in C r ([0, 1])
for some r > m. For further details we refer to Section 9.2.3 in [16].
74
As the An (f ) are finite sequences of length |Dn | = 2n , it follows, by application of Hölder’s inequality, that
1 1
kAn (f )k`2 ≤ 2n( 2 − p ) kAn (f )k`p . Together with k · k`2 ≤ k · k`q , for q ≤ 2, (92) then ensures, for all f ∈
m
U(Bp,q ([0, 1])) and q ∈ [1, 2], that
1 1 1 1
n( 2 − p )
k(kAn (f )k`2 )∞
n=N +1 k`2 ≤ k(2 kAn (f )k`p )∞
n=N +1 k`q ≤ 2
−(N +1)m
k(2n(m+ 2 − p ) kAn (f )k`p )∞
n=N +1 k`q
A PPENDIX C
TAIL COMPACTNESS FOR MODULATION SPACES
We consider tail compactness for unit balls in (polynomially) weighted modulation spaces, which, for p, q ∈
[1, ∞), are defined as follows
s
Mp,q (R) := {f : kf kMp,q
s (R) < ∞},
with
Z Z pq ! q1
kf kMp,q
s (R) := |Vw f (x, ξ)|p (1 + |x| + |ξ|)sp dx dξ ,
R R
where
Z
Vw f (x, ξ) := f (t) w(t − x)e−2πitξ dt, x, ξ ∈ R,
R
is the short-time Fourier transform of f with respect to the window function13 w ∈ S(R).
Next, let g ∈ L2 (R) with kgkL2 (R) = 1 and g(x) = g(−x) such that the Gabor dictionary G(g, 12 , 1, R) is a tight
frame [68] for L2 (R). Then, the Wilson dictionary D = {ψk,n : (k, n) ∈ Z × N0 } with
ψk,0 = Tk g, k ∈ Z,
ψk,n = √1 T k (Mn
2 2
+ (−1)k+n M−n )g, (k, n) ∈ Z × N,
is an orthonormal basis for L2 (R) (see [17, Thm. 8.5.1]). We have, for every f ∈ Mp,q
s
(R), the expansion [17,
Thm. 12.3.4]
X
f= ck,n (f )ψk,n , where ck,n (f ) = hf, ψk,n i, c(f ) ∈ `sp,q (Z × N0 ),
(k,n)∈Z×N0
13 The resulting modulation space does not depend on the specific choice of window function w as long as w is in the Schwartz space
S(R) = {f ∈ C ∞ (R) : supx∈R |xα f (β) (x)| < ∞, for all α, β ∈ N0 }, where f (n) stands for the n-th derivative of f .
75
with `sp,q (Z × N0 ) the space of sequences c ∈ RZ×N0 satisfying
! pq q1
X X
kck`sp,q (Z×N0 ) := |ck,n |p (1 + | k2 | + |n|)sp < ∞.
n∈N0 k∈Z
s
Moreover, there exists [17, Thm. 12.3.1] a constant D ≥ 1 such that, for all f ∈ Mp,q (R),
1
D kf kMp,q (R) ≤ kc(f )k`sp,q (Z×N0 ) ≤ Dkf kMp,q
s s (R) .
s
In particular, we can characterize the unit ball of Mp,q (R) according to
s
U(Mp,q (R)) = {f : kc(f )k`sp,q (Z×N0 ) ≤ D}.
We now order the Wilson basis dictionary as follows. Define D0 := {ψ0,0 } and
`−1
[
D` := {ψk,n : |k|, n ≤ `} \ Di
i=0
PN
for ` ≥ 1, and order the overall dictionary according to D = (D0 , D1 , . . . ). Owing to `=0 |D` | = (2N +1)(N +1),
s s
we have tail compactness for the pair (U(Mp,q (R)), D) if there exist C, β > 0 such that, for all f ∈ U(Mp,q (R)),
N ∈ N,
N N
X X
f − ck,n (f )ψk,n
≤ CN −β . (95)
n=0 k=−N L2 (R)
We restrict our attention to p, q ≤ 2 and use orthonormality of D and the fact that k · k`2 ≤ k · k`p , for p ≤ 2, to
s
obtain, for all f ∈ U(Mp,q (R)),
12
N
X N
X
X X
X X
|ck,n (f )|2
f − ck,n (f )ψk,n
=
ck,n (f )ψk,n
=
n=0 k=−N L2 (R)
n>N |k|>N
n>N |k|>N
L2 (R)
p
q q1
X X
p
≤ |ck,n (f )|
n>N |k|>N
pq q1
X X
≤ (1 + 32 N )−s |ck,n (f )|p (1 + | k2 | + |n|)sp
n>N |k|>N
76
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available:
http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[2] Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Müller, E. Säckinger, P. Simard, and
V. Vapnik, “Comparison of learning algorithms for handwritten digit recognition,” International Conference on Artificial Neural Networks,
pp. 53–60, 1995.
[3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,
“Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process.
Mag., vol. 29, no. 6, pp. 82–97, 2012.
[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M.
Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D.
Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[Online]. Available: http://www.nature.com/nature/journal/v529/n7587/abs/nature16961.html#supplementary-information
[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [Online]. Available:
http://dx.doi.org/10.1038/nature14539
[6] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
[7] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088,
pp. 533–536, Oct. 1986. [Online]. Available: http://dx.doi.org/10.1038/323533a0
[8] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
[9] W. McCulloch and W. Pitts, “A logical calculus of ideas immanent in nervous activity,” Bull. Math. Biophys., vol. 5, pp. 115–133, 1943.
[10] A. N. Kolmogorov, “On the representation of continuous functions of many variables by superposition of continuous functions of one
variable and addition,” Dokl. Akad. Nauk SSSR, vol. 114, no. 5, pp. 953–956, 1957.
[11] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Signals, and Systems, vol. 2, no. 4,
pp. 303–314, 1989. [Online]. Available: http://dx.doi.org/10.1007/BF02551274
[12] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T
[13] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approximation with sparsely connected deep neural networks,” SIAM Journal
on Mathematics of Data Science, vol. 1, no. 1, pp. 8–45, 2019.
[14] D. L. Donoho, “Unconditional bases are optimal bases for data compression and for statistical estimation,” Appl. Comput. Harmon.
Anal., vol. 1, no. 1, pp. 100 – 115, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1063520383710080
[15] ——, “Unconditional bases and bit-level compression,” Appl. Comput. Harm. Anal., vol. 3, pp. 388–392, 1996.
[16] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way, 3rd ed. USA: Academic Press, Inc., 2008.
[17] K. Gröchenig, Foundations of time-frequency analysis. Springer Science & Business Media, 2013.
[18] L. Demanet and L. Ying, “Wave atoms and sparsity of oscillatory patterns,” Appl. Comput. Harmon. Anal., vol. 23, no. 3, pp. 368–387,
2007.
[19] C. L. Fefferman, “Reconstructing a neural net from its output,” Revista Matemática Iberoamericana, vol. 10, no. 3, pp. 507–555, 1994.
[20] D. M. Elbrächter, J. Berner, and P. Grohs, “How degenerate is the parametrization of neural networks with the ReLU activation
function?” in Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, p. 7788–7799. [Online]. Available:
https://arxiv.org/abs/1905.09803
[21] V. Vlačić and H. Bölcskei, “Neural network identifiability for a family of sigmoidal nonlinearities,” Constructive Approximation, 2021.
[Online]. Available: https://arxiv.org/abs/1906.06994
[22] ——, “Affine symmetries and neural network identifiability,” Advances in Mathematics, vol. 376, no. 107485, pp. 1–72, 2021. [Online].
Available: http://www.sciencedirect.com/science/article/pii/S0001870820305132
77
[23] P. Petersen and F. Voigtlaender, “Optimal approximation of piecewise smooth functions using deep ReLU neural networks,” Neural
Networks, vol. 108, pp. 296–330, Sep. 2018.
[24] D. Yarotsky, “Error bounds for approximations with deep ReLU networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
[25] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with ReLU activation function,” Annals of Statistics, vol. 48,
no. 4, pp. 1875–1897, 2020. [Online]. Available: https://arxiv.org/abs/1708.06633
[26] M. Telgarsky, “Representation benefits of deep feedforward networks,” arXiv:1509.08101, 2015.
[27] B. Hanin and D. Rolnick, “Deep ReLU networks have surprisingly few activation patterns,” in Advances in Neural
Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 361–370. [Online]. Available: http://papers.nips.cc/paper/
8328-deep-relu-networks-have-surprisingly-few-activation-patterns.pdf
[28] D. Fokina and I. Oseledets, “Growing axons: Greedy learning of neural networks with application to function approximation,” 2019.
[Online]. Available: https://arxiv.org/abs/1910.12686
[29] C. Schwab and J. Zech, “Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions
in UQ,” Analysis and Applications, vol. 17, no. 1, pp. 19–55, 2019.
[30] J. A. A. Opschoor, P. C. Petersen, and C. Schwab, “Deep ReLU networks and high-order finite element methods,” Analysis and
Applications, vol. 18, no. 5, pp. 715–770, 2020. [Online]. Available: https://doi.org/10.1142/S0219530519410136
[31] I. Gühring, G. Kutyniok, and P. Petersen, “Error bounds for approximations with deep ReLU neural networks in W s,p norms,” Analysis
and Applications, vol. 18, no. 5, pp. 803–859, 2020. [Online]. Available: https://doi.org/10.1142/S0219530519410021
[32] M. H. Stone, “The generalized Weierstrass approximation theorem,” Mathematics Magazine, vol. 21, pp. 167–184, 1948.
[33] S. Liang and R. Srikant, “Why deep neural networks for function approximation?” International Conference on Learning Representations,
2017. [Online]. Available: https://arxiv.org/abs/1610.04161
[34] A. Gil, J. Segura, and N. M. Temme, Numerical Methods for Special Functions. Society for Industrial and Applied Mathematics, 2007.
[Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9780898717822
[35] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Transactions on Information Theory,
vol. 39, no. 3, pp. 930–945, 1993.
[36] ——, “Approximation and estimation bounds for artificial neural networks,” Mach. Learn., vol. 14, no. 1, pp. 115–133, 1994. [Online].
Available: http://dx.doi.org/10.1007/BF00993164
[37] C. K. Chui, X. Li, and H. N. Mhaskar, “Neural networks for localized approximation,” Math. Comp., vol. 63, no. 208, pp. 607–623,
1994. [Online]. Available: http://dx.doi.org/10.2307/2153285
[38] R. DeVore, K. Oskolkov, and P. Petrushev, “Approximation by feed-forward neural networks,” Ann. Numer. Math., vol. 4, pp. 261–287,
1996.
[39] E. J. Candès, “Ridgelets: Theory and applications,” Ph.D. dissertation, Stanford University, 1998.
[40] H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Comput., vol. 8, no. 1, pp. 164–177,
1996.
[41] H. N. Mhaskar and C. A. Micchelli, “Degree of approximation by neural and translation networks with a single hidden layer,” Adv. Appl.
Math., vol. 16, no. 2, pp. 151–183, 1995.
[42] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5,
pp. 359–366, 1989.
[43] H. N. Mhaskar, “Approximation properties of a multilayered feedforward artificial neural network,” Advances in Computational
Mathematics, vol. 1, no. 1, pp. 61–80, Feb 1993. [Online]. Available: https://doi.org/10.1007/BF02070821
[44] K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp.
183–192, 1989. [Online]. Available: //www.sciencedirect.com/science/article/pii/0893608089900038
[45] T. Nguyen-Thien and T. Tran-Cong, “Approximation of functions and their derivatives: A neural network implementation with
applications,” Appl. Math. Model., vol. 23, no. 9, pp. 687–704, 1999. [Online]. Available: //www.sciencedirect.com/science/article/pii/
S0307904X99000062
78
[46] R. Eldan and O. Shamir, “The power of depth for feedforward neural networks,” in Proceedings of the 29th Conference on Learning
Theory, 2016, pp. 907–940.
[47] H. N. Mhaskar and T. Poggio, “Deep vs. shallow networks: An approximation theory perspective,” Analysis and Applications, vol. 14,
no. 6, pp. 829–848, 2016. [Online]. Available: http://www.worldscientific.com/doi/abs/10.1142/S0219530516400042
[48] N. Cohen, O. Sharir, and A. Shashua, “On the expressive power of deep learning: A tensor analysis,” in Proceedings of the 29th Conference
on Learning Theory, vol. 49, 2016, pp. 698–728.
[49] N. Cohen and A. Shashua, “Convolutional rectifier networks as generalized tensor decompositions,” in Proceedings of the 33rd International
Conference on Machine Learning, vol. 48, 2016, pp. 955–963.
[50] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger, “A proof that artificial neural networks overcome the curse of dimensionality
in the numerical approximation of Black-Scholes partial differential equations,” arXiv e-prints, p. arXiv:1809.02362, Sep. 2018.
[51] J. Berner, P. Grohs, and A. Jentzen, “Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks
overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations,” SIAM Journal on
Mathematics of Data Science, vol. 2, no. 3, pp. 631–657, 2020.
[52] C. Beck, S. Becker, P. Grohs, N. Jaafari, and A. Jentzen, “Solving stochastic differential equations and Kolmogorov equations by means
of deep learning,” arXiv:1806.00421, 2018.
[53] D. Elbrächter, P. Grohs, A. Jentzen, and C. Schwab, “DNN expression rate analysis of high-dimensional PDEs: Application to option
pricing,” arXiv:1809.07669, 2018.
[54] S. Ellacott, “Aspects of the numerical analysis of neural networks,” Acta Numer., vol. 3, pp. 145–202, 1994.
[55] A. Pinkus, “Approximation theory of the MLP model in neural networks,” Acta Numer., vol. 8, pp. 143–195, 1999.
[56] R. DeVore, B. Hanin, and G. Petrova, “Neural network approximation,” arXiv:2012.14501, 2020.
[57] U. Shaham, A. Cloninger, and R. R. Coifman, “Provable approximation properties for deep neural networks,” Appl. Comput. Harmon.
Anal., vol. 44, no. 3, pp. 537–557, May 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1509.html#ShahamCC15
[58] R. A. DeVore and G. G. Lorentz, Constructive Approximation. Springer, 1993.
[59] R. A. DeVore, “Nonlinear approximation,” Acta Numerica, vol. 7, pp. 51–150, 1998.
[60] P. Grohs, “Optimally sparse data representations,” in Harmonic and Applied Analysis. Springer, 2015, pp. 199–248.
[61] E. Ott, Chaos in Dynamical Systems. Cambridge University Press, 2002.
[62] M. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.
[63] R. T. Prosser, “The ε-entropy and ε-capacity of certain time-varying channels,” Journal of Mathematical Analysis and Applications, vol. 16,
pp. 553–573, 1966.
[64] A. Kolmogorov and V. Tikhomirov, “ε-entropy and ε-capacity of sets in function spaces,” Uspekhi Mat. Nauk., vol. 14, no. 2, pp. 3–86,
1959.
[65] M. Ehler and F. Filbir, “Metric entropy, n-widths, and sampling of functions on manifolds,” Journal of Approximation Theory, vol. 225,
pp. 41 – 57, 2018.
[66] J. Schmidt-Hieber, “Deep ReLU network approximation of functions on a manifold,” arXiv:1908.00695, 2019.
[67] H. Mhaskar, “A direct approach for function approximation on data defined manifolds,” Neural Networks, vol. 132, pp. 253 – 268, 2020.
[68] V. Morgenshtern and H. Bölcskei, Mathematical Foundations for Signal Processing, Communications, and Networking, Boca Raton, FL,
2012, ch. A short course on frame theory, pp. 737–789.
[69] P. Grohs, S. Keiper, G. Kutyniok, and M. Schäfer, “α-molecules,” Appl. Comput. Harmon. Anal., vol. 41, no. 1, pp. 297–336, 2016.
[Online]. Available: http://dx.doi.org/10.1016/j.acha.2015.10.009
[70] I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
[71] E. J. Candès and D. L. Donoho, “New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities,”
Comm. Pure Appl. Math., vol. 57, pp. 219–266, 2002.
[72] K. Guo, G. Kutyniok, and D. Labate, “Sparse multidimensional representations using anisotropic dilation and shear operators,” in Wavelets
and Splines (Athens, GA, 2005). Nashboro Press, Nashville, TN, 2006, pp. 189–201.
[73] P. Grohs and G. Kutyniok, “Parabolic molecules,” Found. Comput. Math., vol. 14, pp. 299–337, 2014.
79
[74] K. Gröchenig and S. Samarah, “Nonlinear approximation with local Fourier bases,” Constructive Approximation, vol. 16, no. 3, pp. 317–331,
Jul. 2000.
[75] D. L. Donoho, M. Vetterli, R. A. DeVore, and I. Daubechies, “Data compression and harmonic analysis,” IEEE Transactions on Information
Theory, vol. 44, no. 6, pp. 2435–2476, 1998.
[76] P. Grohs, A. Klotz, and F. Voigtlaender, “Phase transitions in rate distortion theory and deep learning,” arxiv:2008.01011, 2020.
[77] A. Hinrichs, I. Piotrowska-Kurczewski, and M. Piotrowski, “On the degree of compactness of embeddings between weighted modulation
spaces,” J. Funct. Spaces Appl., vol. 6, pp. 303–317, 01 2008.
[78] P. Grohs, S. Keiper, G. Kutyniok, and M. Schäfer, “Cartoon approximation with α-curvelets,” J. Fourier Anal. Appl., vol. 22, no. 6, pp.
1235–1293, 2016. [Online]. Available: http://dx.doi.org/10.1007/s00041-015-9446-6
[79] D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constr. Approx., vol. 17, no. 3, pp. 353–382, 2001.
[Online]. Available: http://dx.doi.org/10.1007/s003650010032
[80] J. Munkres, Topology, ser. Featured Titles for Topology. Prentice Hall, Incorporated, 2000.
[81] M. Unser, “Ten good reasons for using spline wavelets,” Wavelet Applications in Signal and Image Processing V, vol. 3169, pp. 422–431,
1997.
[82] S. Mallat, “Multiresolution approximations and wavelet orthonormal bases of L2 (R),” Trans. Amer. Math. Soc., vol. 315, no. 1, pp. 69–87,
Sep. 1989.
[83] C. K. Chui and J.-Z. Wang, “On compactly supported spline wavelets and a duality principle,” Trans. Amer. Math. Soc., vol. 330, no. 2,
pp. 903–915, Apr. 1992.
[84] C. L. Fefferman, “The uncertainty principle,” Bull. Amer. Math. Soc. (N.S.), vol. 9, no. 2, pp. 129–206, 1983. [Online]. Available:
https://doi.org/10.1090/S0273-0979-1983-15154-6
[85] H. G. Feichtinger, “On a new Segal algebra,” Monatshefte für Mathematik, vol. 92, pp. 269–289, 1981.
[86] A. Zygmund, Trigonometric series. Cambridge University Press, 2002.
[87] C. Frenzen, T. Sasao, and J. T. Butler, “On the number of segments needed in a piecewise linear approximation,” Journal of Computational
and Applied Mathematics, vol. 234, no. 2, pp. 437–446, 2010.
80