0% found this document useful (0 votes)
29 views80 pages

dl

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views80 pages

dl

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

arXiv:2403.04807v1 [cs.

LG] 6 Mar 2024

Mathematics of Neural Networks

Bart M.N. Smets

November 12, 2022


“You insist that there is something a machine cannot do. If you
tell me precisely what it is a machine cannot do, then I can
always make a machine which will do just that.”
– John von Neumann

“As a technical discussion grows longer, the probability of


someone suggesting deep learning as a solution approaches 1.”
– comment on Youtube anno 2020

Thanks to my colleagues Gijs Bellaard, Remco Duits, Jim Portegies and


Alessandro Di Bucchianico for valuable input and feedback during the
writing of these lecture notes.

2
Contents

Introduction 5

1 The Basics 7
1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Artificial Neurons & Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Shallow Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Deep Learning 20
2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Feed Forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Scaling to High Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Stochastic Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Xavier Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Those are a lot of Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Discrete Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Max Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.4 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.5 Classification Example: MNIST & LeNet-5 . . . . . . . . . . . . . . . . . . 37
2.4 Automatic Differentiation & Backpropagation . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Notation and an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Automation & the Computational Graph . . . . . . . . . . . . . . . . . . . 40
2.4.3 Implementing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3
2.5 Adaptive Learning Rate Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.1 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.2 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.3 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4 Which variant to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Equivariance 48
3.1 Manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Smooth Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Lie Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2 Lie Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.3 Group Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.4 Equivariant Maps and Operators . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.5 Homogeneous Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.2 Equivariant Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Building a Rotation-translation Equivariant CNN . . . . . . . . . . . . . . . . . . 65
3.4.1 Lifting Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.2 Group Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.3 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.4 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5 Tropical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.1 Semirings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.2 Tropical Semiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.3 Equivariant Tropical Operators . . . . . . . . . . . . . . . . . . . . . . . . . 73

4
Introduction

The last decade has seen great experimental progress being made in machine learning, spear-
headed by deep learning methods that make use of so-called deep neural networks. Many
challenging, high-dimensional tasks that were previously beyond reach have become feasible
with remarkably simple (in the algorithmic sense) techniques coupled with modern computa-
tional resources. Particularly in the fields of computer vision and natural language processing,
deep learning is currently the go-to tool.
Machine learning is usually positioned under the umbrella of artificial intelligence. While
artificial intelligence is a fairly nebulous and broad term, the term machine learning refers
specifically to algorithms that can improve their performance at a given task when presented
with more data about the problem. Machine learning algorithms are used in a wide variety of
applications such as speech recognition and computer vision, where it is difficult or impossible
to develop conventional algorithms to perform the required task. These learning algorithms
generally start from very general model that is then trained based in sample data to learn how
to perform a specific task. When the underlying model being used is a (deep) neural network
then we speak about deep learning.
The idea of using computational models that are inspired by the workings of biological neurons
dates as far back as McCulloch and Pitts (1943). Over the following decades more of the
ingredients that we think of a standard today were added: training the network on data was
tried by Ivakhnenko and Lapa (1966), Fukushima (1987) originated the ancestor of our current
convolutional neural networks and LeCun, Boser, et al. (1989) introduced backpropagation as
a training mechanism for neural networks.
However, non of these efforts led to a breakthrough in the use of neural networks in practice and
such models were mainly regarded as a academic curiosity during this time period. This state of
affairs only changed at the start of the new millennium with the appearance of programmable
GPUs (Graphics Processing Unit), while initially designed with rendering 3D graphics in mind
these devices could be leveraged for other purposes, such as by Oh and Jung (2004) for neural
networks. This eventually led to such breakthroughs as on the ImageNet (Deng et al. 2009)
image classification challenge by Krizhevsky, Sutskever, and G. E. Hinton (2012), where neural
networks managed to dominate other, more traditional, techniques. These events can be thought
of as the start of the modern deep learning era.
Despite more than a decade of impressive experimental results, theoretical understanding of
why deep learning works as well as it does is lacking. This presents an opportunity both for
understanding and improvement, particularly for mathematicians.
This course presents an introduction to neural networks from the point of view of a mathe-
matician. We cover the basic vocabulary and functioning of neural networks in Chapter 1. In
Chapter 2 we look at deep neural networks and the associates techniques that allow them to
work. Chapter 3 covers a novel application of geometry to neural networks. We will discuss

5
how the theory of Lie groups and homogeneous spaces can be leveraged to endow neural net-
works with certain structural symmetries, i.e. make them equivariant under certain geometric
transformations.

6
Chapter 1

The Basics

1.1 Supervised Learning

While there are different kinds of machine learning, we will be focusing on supervised learning.
Supervised learning is a machine learning paradigm where the available data consists of a
pairing of inputs with know, correct, outputs. What is unknown is exactly how the mapping
between those inputs and outputs works. The goal is to infer, from the available data, the
general structure of the mapping in the hope that this will generalize to unseen situations.
Formally we can state the problem as follows.

The supervised learning problem Given an unknown function 𝑓 : 𝑋 → 𝑌 between spaces 𝑋


and 𝑌, find a good approximation of 𝑓 using only a dataset of 𝑁 samples:
𝑁
D = (𝑥 𝑖 , 𝑦 𝑖 ) with 𝑦 𝑖 = 𝑓 (𝑥 𝑖 ) for all 𝑖 = 1 . . . 𝑁 .

𝑖=1

The space 𝑋 is also called the feature space and its elements are referred to as feature vectors.
The space 𝑌 is also called the label space and its elements are referred to as labels.

Example 1.1. Let 𝑋 be the space of all images of cats and dogs and 𝑓 the classifier that maps
to 𝑌 = "dog", "cat" . Which we can express numerically as 𝑋 = [0, 1]3×𝐻×𝑊 (i.e. an image of
height 𝐻 and width 𝑊 with 3 color channels) and 𝑌 = [0, 1]2 (one probability for each of the
two classes).

The general approach to solving this problem consists of three main steps.

1. Choose a model 𝐹 : 𝑋 × 𝑊 → 𝑌 parametrized by a parameter space 𝑊 (𝑊 for weights,


since that is how parameters in NNs are often referred to).
2. We need to quantify the quality of the model’s output, so we need to choose a loss function
ℓ : 𝑌 × 𝑌 → ℝ, where ℓ (𝑦1 , 𝑦2 ) indicates “how different” 𝑦1 and 𝑦2 are.
3. Based on the dataset and the loss function choose 𝑤 ∈ 𝑊 so that 𝐹( · ; 𝑤) is the “best”
possible approximation of the target function 𝑓 .

This high-level approach leaves some open questions however.

7
1.1. SUPERVISED LEARNING CHAPTER 1. THE BASICS

• How to choose the model?


• How to choose the loss function?
• How to optimize?

None of these questions have canonical answers and are largely handled by trial-and-error
and heuristics simply because we have to resolve them before we can make any progress.
Regardless, these choices will have a large impact on the final result even though they are in
no way supported by the available data or some form of first principle. The collective set of
assumptions we make to be able to proceed with the problem is called inductive bias in the
machine learning field.
For the purpose of this course we will of course pick neural networks as our models of choice
and we will talk about those later. The second thing we have to do is chose a loss function. A
loss function is a function ℓ : 𝑌 × 𝑌 → ℝ, it works much like a metric, but it is less restrictive.

• Usually but not always 𝑌 × 𝑌 → ℝ+


• Since we want to minimize loss, the minimum should exist so that min 𝑦∈𝑌 ℓ (𝑦0 , 𝑦) is well
posed.
• Differentiable (a.e. at least) would be necessary since we want to do gradient descent.
• Metric properties like identity of indiscernibles, symmetry and triangle inequality need
not hold.

Having chosen a model and a loss function we move on to optimization, or: what is the “best”
choice of 𝑤 ∈ 𝑊? The straightforward (but not necessarily preferred) answer: minimize the
loss on the dataset, i.e. find:
𝑁
Õ

𝑤 = arg min ℓ (𝐹( 𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) .
𝑤∈𝑊 𝑖=1

Example 1.2 (Linear least squares).


𝑛
Õ
𝐹(𝑥; 𝑤) = 𝑤 𝑗 𝜑 𝑗 (𝑥)
𝑗=1

and ℓ (𝑦1 , 𝑦2 ) = |𝑦1 − 𝑦2 | 2 for some basis functions {𝜑 𝑗 } 𝑗 . Then we get the familiar linear least
square setting.

What is the “best” 𝑤 ∈ 𝑊 is not necessarily the one that minimizes the loss on the dataset,
overfitting is often an issue in any optimization problem. Which brings us to regularization.
We will discuss regularization techniques for NNs more in the future. But for now we will
mention that parameter regularization is a common technique from regression that is often
used in NNs. This type of regularization is characterized by the addition of a penalty term
to the data loss that discourages parameter values for straying into undesirable areas (such as
becoming too large). The modified optimization problem becomes:
𝑁
Õ
𝑤 ∗ = arg min ℓ (𝐹(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) + 𝜆𝐶(𝑤)
𝑤∈𝑊 𝑖=1

with 𝜆 > 0 and 𝐶 : 𝑊 → ℝ+ to penalize complexity in some fashion.

8
CHAPTER 1. THE BASICS 1.1. SUPERVISED LEARNING

Example 1.3 (Tikhonov regularization). 𝑊 = ℝ𝑛 and 𝐶(𝑤) = ∥𝑤∥ 2 . In the context of neural
networks this type of parameter regularization is also called weight decay.

Regression & classification Supervised learning should sound familiar by now. Indeed if
𝑋 = ℝ𝑛 and 𝑌 = ℝ𝑚 then it amounts to regression.
Regression is a large part of supervised learning but it is more generally formulated so that is
encompasses other tasks such as classification.
Classification is the ML designation for (multiclass) logistic regression where 𝑋 = ℝ𝑛 and we
try to assign each 𝑥 ∈ 𝑋 to one of 𝑚 classes. The numeric output for this type of problem is
normally a discrete probability distribution over 𝑚 classes:

Í𝑚
𝑌 = (𝑦1 , . . . , 𝑦𝑚 ) ∈ [0, 1]𝑚 𝑦𝑖 = 1 .

𝑖=1

Example 1.1 is of this type.

Remark 1.4 (Statistical learning theory viewpoint). In SLT the assumption is made that the
data samples (𝑥 𝑖 , 𝑦 𝑖 ) are drawn i.i.d. from some probability distribution 𝜇 on 𝑋 × 𝑌. Think
about why this is a fairly big leap. What we are then interested in is minimizing the population
risk: ∫
𝑅(𝑤) := 𝔼(𝑥,𝑦)∼𝜇 [ℓ (𝐹(𝑥; 𝑤), 𝑦)] := ℓ (𝐹(𝑥; 𝑤), 𝑦) d𝜇(𝑥, 𝑦).
𝑋×𝑌
The goal would be finding the parameter set that minimizes this population risk:

𝑤 ∗ = arg min 𝑅(𝑤).


𝑤∈𝑊

But in reality we do not know 𝜇 and so we cannot even calculate the population risk, let
alone minimize it. So we do the next best thing: minimize the empirical risk:
𝑁
ˆ 1 Õ
𝑅(𝑤) := ℓ (𝐹(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) .
𝑁
𝑖=1

The parameter set that minimize the empirical risk is called the empirical minimizer:

ˆ
𝑤ˆ = arg min 𝑅(𝑤),
𝑤∈𝑊

which is the same thing as minimizing the loss on the dataset in the supervised learning
setting.
When we add a regularization term as before it is called structural risk minimization.

ˆ
𝑤ˆ = arg min 𝑅(𝑤) + 𝜆𝐶(𝑤).
𝑤∈𝑊

ˆ 𝑤)
SLT is then concerned with studying things such as bounds on 𝑅( ˆ − 𝑅(𝑤 ∗ ). For more see
2MMS80 Statistical Learning Theory.

9
1.2. ARTIFICIAL NEURONS & ACTIVATION FUNCTIONS CHAPTER 1. THE BASICS

1.2 Artificial Neurons & Activation Functions

As the name suggests, artificial neural networks are inspired by biology. Just like their biological
counterparts their constituent parts are artificial neurons. The basic structure of a biological
neuron is illustrated in Figure 1.1. Each neuron can send and receive signals to and from other
neurons so that together they form a network.

Figure 1.1: A simplified bio-


logical neuron. The dendrites
on the left receive electric sig-
nals from other neurons, once
a certain threshold is reached
the neuron will fire a signal
along its axon and through its
synapses on the right relay a
signal to other neurons.

In similar fashion artificial neural networks consist of artificial neurons. Each artificial neurons
takes some inputs (usually in the form of real numbers) and produces one or more outputs
(again, usually real numbers) that it passes to other neurons. The most common model neuron
is given by an affine transform followed by a non-linear function. Let 𝒙 ∈ ℝ𝑛 be the input signal
and 𝒚 ∈ ℝ𝑚 be the output signal then we calculate

𝒚 = 𝜎(𝐴𝒙 + 𝒃), (1.1)

where 𝐴 ∈ ℝ𝑚×𝑛 , 𝒃 ∈ ℝ𝑚 and 𝜎 is a choice of activation function. The components of the matrix
𝐴 are referred to as the linear weights, the vector 𝒃 is called the bias. The intermediate value
𝐴𝒙 + 𝒃 is sometimes referred to as the activation. The activation function 𝜎 is also sometimes
called the transfer function or the non-linearity.
Inputs and outputs need not span the real numbers. Depending on the application we could
encounter:

• {0, 1}: binary,


• ℝ+ : non-negative reals,
• [0, 1]: intervals (probabilities for example),
• ℂ: complex,
• etc.

A historically significant choice of activation function is the Heaviside function, given by


(
1 if 𝑥 ≥ 0,
𝐻(𝑥) := 𝟙𝑥≥0 (𝑥) :=
0 else.

A neuron of the type (1.1) that uses the Heaviside function as its activation function is called a
perceptron. Let us see what we can do with it. Let 𝒘 ∈ ℝ𝑛 and 𝑏 ∈ ℝ then a neuron with input

10
CHAPTER 1. THE BASICS 1.2. ARTIFICIAL NEURONS & ACTIVATION FUNCTIONS

𝒙 ∈ ℝ𝑛 and output 𝑦 ∈ ℝ would look like


(
1 if 𝒘 𝑇 𝒙 ≥ −𝑏,
N(𝒙) = 𝐻(𝒘 𝑇 𝒙 + 𝑏) =
0 if 𝒘 𝑇 𝒙 < −𝑏.

Which is nothing but a linear binary classifier on ℝ𝑛 since 𝒘 𝑇 𝒙 = −𝑏 is a hyperplane in ℝ𝑛 .


This hyperplane divides the space into two and assign the value 0 to one half and 1 to the other.
Example 1.5 (Boolean gate). We can (amongst other things) use a perceptron to model some
elementary Boolean logic. Let 𝑥 1 , 𝑥2 ∈ {0, 1} and let AND(𝑥 1 , 𝑥2 ) := 𝐻(𝑥 1 + 𝑥 2 − 1.5) then the
neuron behaves as follows.

𝑥1 𝑥2 AND(𝑥 1 , 𝑥2 )
0 0 0
0 1 0
1 0 0
1 1 1

The Heaviside function is an example of a scalar or pointwise activation function. Often when
a use a scalar function as activation function we abuse notation to let it accepts vector (and
matrices) as follows. Let 𝜎 : ℝ → ℝ then we allow
𝑥   𝜎(𝑥1 ) 
©  .1  ª  .. 
 
𝜎 ­­  ..  ®® ≡  . .
« 𝑥 𝑛  ¬
𝜎(𝑥 𝑛 )
   
 
 
We list some commonly used scalar activation functions which are also illustrated in Figure 1.2.

• Rectified Linear Unit (ReLU): arguably the most used activation function in modern
neural networks, it is calculated as
𝜎(𝜆) = ReLU(𝜆) := max{0, 𝜆}.

• Sigmoid (also known as logistic sigmoid or soft-step):


1
𝜎(𝜆) := .
1 + 𝑒 −𝜆
The sigmoid was commonly used as activation function in early neural networks, which
is the reason that activations functions in general are still often labeled with a 𝜎.
• Hyperbolic tangent: very similar to the sigmoid, it is given by
𝑒 𝜆 − 𝑒 −𝜆
tanh(𝜆) := .
𝑒 𝜆 + 𝑒 −𝜆
• Swish: a more recent choice of activation function that can be thought of as a smooth
variant of the ReLU. It is given by the multiplication of the input itself with the sigmoid
function:
𝜆
swish𝛽 (𝜆) := 𝜆 𝜎(𝛽𝜆) := ,
1 + 𝑒 −𝛽𝜆
where 𝛽 > 0. The 𝛽 parameter is usually chosen to be 1 but could be treated as a trainable
parameter if desired. In case of 𝛽 = 1, this function is also called the sigmoid-weighted
linear unit or SiLU.

11
1.2. ARTIFICIAL NEURONS & ACTIVATION FUNCTIONS CHAPTER 1. THE BASICS

𝑦 𝑦 𝑦 𝑦
swish1 (𝑥)
𝑅𝑒𝐿𝑈(𝑥) tanh(𝑥)
𝜎(𝑥)
1 1 1 1

𝑥 𝑥 𝑥 𝑥
−1 1 −1 1 −1 1 −1 1

−1 −1 −1 −1

Figure 1.2: Some common scalar activation functions. From left to right: the
rectified linear unit, the logistic sigmoid, the hyperbolic tangent
and the swish function with 𝛽 = 1.

Activation functions need not be scalar, we list some common multivariate functions.

• Softmax, also known as the normalized exponential function: softmax : ℝ𝑛 → [0, 1]𝑛 is
given by
𝑥   𝑒 𝑥1 
© 1ª  𝑥2 
­  𝑥2  ® 1 𝑒 
softmax ­­  .  ®® := Í𝑛 𝑥  .  .
­  ..  ® 𝑖=1 𝑒
𝑖
 .. 
𝑥  𝑒 𝑥𝑛 
« 𝑛¬  
Softmax has the useful property that its output is a discrete probability distribution, i.e.
each value is a non-negative real in the range [0, 1], and all the values in its output add up
to exactly 1.
• Maxpool: here each output is the maximum of a certain subset of the inputs: the function
maxpool : ℝ𝑛 → ℝ𝑚 is given by

𝑥   max 𝑗∈𝐼1 𝑥 𝑗 
© 1ª
­  𝑥2  ®  max 𝑗∈𝐼2 𝑥 𝑗 
 
maxpool ­­  .  ®® := .. .
­  ..  ®

.
 
 
« 𝑥 𝑛  ¬
max 𝑥
𝑗∈𝐼𝑚 𝑗 
  

Where for each 𝑖 ∈ {1, . . . , 𝑚} we have a 𝐼 𝑖 ⊂ {1, . . . , 𝑛} that specifies over which inputs
to take the maximum for each output. Maxpooling can easily be generalised by replacing
the max operation with min, the average, the mean, etc.
• Normalization, sometimes it is desirable to re-center and re-scale a signal:

𝑥   𝑥1𝜎−𝜇 
© 1ª  
 𝑥2 −𝜇 
­  𝑥2  ®  𝜎 
normalize ­­  .  ®® :=  . ,
­  ..  ®  .. 
 𝑥 −𝜇 
« 𝑥 𝑛  ¬
   𝑛 
 𝜎 

where 𝜇 = 𝔼[𝒙] and 𝜎2 = Var(𝒙). There are many variants on normalization where the
difference is how 𝜇 and 𝜎 are computed: over time, over subsets of the incoming signals,
etc.

All the previous examples of activation functions are deterministic, but stochastic activation
functions are also used.

12
CHAPTER 1. THE BASICS 1.3. SHALLOW NETWORKS

• Dropout is a stochastic function that is often used during the training process but is
removed once the training is finished. It works by randomly setting individual values of
a signal to zero with probability 𝑝:
(
  0 with probability 𝑝,
dropout𝑝 (𝒙) :=
𝑖 𝑥𝑖 with probability 1 − 𝑝.

• Heatbath is a scalar function that outputs 1 or −1 with a probability that depends on the
input: (
1
1 with probability 1+𝑒 −𝜆
heatbath(𝜆) :=
−1 otherwise.

All the activation functions we seen are essentially fixed functions, swish and dropout have a
parameter but it is usually fixed to some chosen value. That means that the trainable parameters
of a neural network are usually the linear weights and biases. There is however no a-priori reason
why that needs to be the case, in fact we will see a class of non-linear operators with trainable
parameters at the end of Chapter 3. Regardless, having parameters in the non-linear part of a
network is somewhat rare in practice at the time of this writing.

1.3 Shallow Networks

While we are interested in deep networks we start out with a look at shallow networks. Because
of their simplicity we can still approach them constructively and gain some intuition about
neural networks in general along the way. You can also think about the study of shallow
networks as the study of single layers of deep networks.
Let us consider a shallow ReLU network with scalar in- and output. Let 𝒘 = (𝒂, 𝒃, 𝒄) ∈ 𝑊 =
(ℝ𝑁 )3 be our set of parameters for some 𝑁 ∈ ℕ , then we define our model 𝐹 : ℝ × 𝑊 → ℝ as
𝑁
Õ
𝐹(𝑥; 𝒘) := 𝑐 𝑖 𝜎(𝑎 𝑖 𝑥 + 𝑏 𝑖 ), (1.2)
𝑖=1

where 𝜎 is the ReLU. Diagrammatically this network is represented in Figure 1.3.

Figure 1.3: Diagrammatic rep-


hidden units resentation of a shallow ℝ → ℝ
Í
𝑐𝑖 neural network per (1.2). In
𝑏 1) 𝑧1
𝑥+ deep learning literature the in-
𝜎( 𝑎 1
input output put and output of a network
unit 𝜎( 𝑎 2 𝑥 + 𝑏 2 ) 𝑧2
unit are often referred to as the in-
width

𝑥 𝐹(𝑥) put unit respectively the out-


.. put unit. The intermediate val-
𝜎( 𝑎 .
𝑁𝑥
+𝑏 ues are often called the hid-
𝑁)
den units. What is commonly
𝑧𝑁
referred to as the width and
depth of the network is also in-
depth
dicated.

We will restrict ourselves to 𝑥 ∈ [0, 1] for the time being. In practice this would not be much of
a restriction since real world data is compactly supported.

13
1.3. SHALLOW NETWORKS CHAPTER 1. THE BASICS

Let us explore what types of functions our network can express: the output is a linear combi-
nation of functions of the type
𝑥 ↦→ 𝜎(𝑎𝑥 + 𝑏).
Which, depending on the value of 𝑎, gives us one of the following types of functions.

𝑦 𝑦 𝑦
𝑎=0 𝑎>0 𝑎<0

𝑦 = 𝜎(𝑏) 𝑎
𝑎
𝑥 𝑥 𝑥
− 𝑏𝑎 − 𝑏𝑎

All three of these classes of functions are piecewise linear functions (or piecewise affine functions
really), so any linear combination of these functions would again be a piecewise linear function.
Hence, our model (1.2) is really just a particular parameterization for a piecewise linear function
on [0, 1].
Can we then represent any piecewise linear function on [0, 1] by a shallow ReLU network? Let
𝑓 : [0, 1] → ℝ be piecewise linear function with 𝑁 pieces. We will denote the inflection points
with
0 = 𝛽1 < 𝛽2 < . . . < 𝛽 𝑁+1 = 1
and the slopes (i.e. the constant derivatives of each piece) with 𝛼1 , . . . , 𝛼 𝑁 . An example of this
setup is illustrated in Figure 1.4.

𝑦
Figure 1.4: Example of a piece-
wise linear function on [0, 1]
𝛼1 𝑓 (𝑥)
𝛼2

with 4 pieces. The location of


𝛼4 the inflection points are labeled
𝛼3 with 𝛽’s and the slope of each
piece is denoted with an 𝛼.
𝑥
𝛽1 = 0 𝛽2 𝛽3 𝛽4 𝛽5 = 1

Now choose a network of the type in (1.2) with 𝑁 + 1 neurons:


𝑁+1
Õ
𝐹(𝑥) := 𝑐 𝑖 𝜎(𝑎 𝑖 𝑥 + 𝑏 𝑖 ).
𝑖=1

Next we choose the parameters 𝒂, 𝒃 and 𝒄 of the network as:

𝑎 𝑁+1 = 0, 𝑏 𝑁+1 = 1, 𝑐1 = 𝛼1 ,
𝑎 1 , . . . , 𝑎 𝑁 = 1, 𝑏 𝑖 = −𝛽 𝑖 (for 𝑖 = 1 . . . 𝑁), 𝑐 𝑖 = 𝛼 𝑖 − 𝛼 𝑖−1 (for 𝑖 = 2 . . . 𝑁).

This turns our model into


𝑁
Õ
𝐹(𝑥) = 𝑓 (0) + 𝛼1 𝑥 + (𝛼 𝑖 − 𝛼 𝑖−1 ) 𝜎(𝑥 − 𝛽 𝑖 ). (1.3)
𝑖=2

14
CHAPTER 1. THE BASICS 1.3. SHALLOW NETWORKS

When we examine the third term of (1.3) we see that the ReLU’s will vanish each term until 𝑥
reaches the appropriate threshold, i.e. 𝜎(𝑥 − 𝛽 𝑖 ) = 0 until 𝑥 > 𝛽 𝑖 . Hence for 𝑥 ∈ [𝛽 1 , 𝛽2 ] we
get the function 𝑥 ↦→ 𝑓 (0) + 𝛼 1 𝑥 which exactly matches the function 𝑓 in that interval. As we
proceed to the [𝛽2 , 𝛽 3 ] interval the first term of the right hand sum becomes non-zero and the
model becomes
𝑥 ↦→ 𝑓 (0) + 𝛼1 𝑥 + (𝛼2 − 𝛼1 )(𝑥 − 𝛽 2 )
= 𝑓 (0) + 𝛼 1 𝑥 − 𝛼1 𝑥 + 𝛼 1 𝛽2 + 𝛼 2 (𝑥 − 𝛽 2 )
= 𝑓 (𝛽2 ) + 𝛼2 (𝑥 − 𝛽2 ),
which is exactly 𝑓 in the interval [𝛽 2 , 𝛽3 ]. We can keep advancing along the 𝑥-axis like this and
every time we pass an inflection point a new term will activate and bend the line towards a new
heading. The model can effectively be rewritten as


 𝑓 (0) + 𝛼1 𝑥 if 𝑥 ∈ [𝛽 1 , 𝛽2 ],

 𝑓 (𝛽2 ) + 𝛼2 (𝑥 − 𝛽2 ) if 𝑥 ∈ [𝛽 2 , 𝛽3 ],



𝐹(𝑥) = ..


 .
 𝑓 (𝛽 𝑁−1 ) + 𝛼 𝑁 (𝑥 − 𝛽 𝑁−1 ) if 𝑥 ∈ [𝛽 𝑁−1 , 𝛽 𝑁 ],



which matches 𝑓 exactly. Hence, we may conclude that on compact intervals the shallow scalar
ReLU neural networks are exactly the space of piecewise linear functions.
Piecewise linear function are a simple class of function but can be used to approximate many
other classes of function to an arbitrary degree, as the following lemma shows.

Lemma 1.6. Let 𝑓 ∈ 𝐶([0, 1], ℝ) then for all 𝜀 > 0 there exists a piecewise linear function 𝐹 so
that
sup | 𝑓 (𝑥) − 𝐹(𝑥)| < 𝜀.
𝑥∈[0,1]

Proof. Let 𝜀 > 0 be arbitrary. Since 𝑓 is a continuous function on a compact domain it is also
uniformly continuous on said domain, i.e.
𝜀
∃𝛿 > 0 ∀𝑥1 , 𝑥2 ∈ [0, 1] : |𝑥 1 − 𝑥 2 | < 𝛿 ⇒ | 𝑓 (𝑥 1 ) − 𝑓 (𝑥 2 )| < .
2
1 𝑖
Now choose 𝑁 > 𝛿 and partition [0, 1] as 𝑥 𝑖 = 𝑁 for 𝑖 = 0 . . . 𝑁. Define the piecewise linear
function 𝐹 as
𝑁
𝑓 (𝑥 𝑖 ) − 𝑓 (𝑥 𝑖−1 )
Õ  
𝐹(𝑥) = 𝟙[𝑥 𝑖−1 ,𝑥 𝑖 ) (𝑥) 𝑓 (𝑥 𝑖−1 ) + (𝑥 − 𝑥 𝑖−1 ) .
𝑥 𝑖 − 𝑥 𝑖−1
𝑖=1
Note that for all 𝑖 − 0 . . . 𝑁 that 𝐹(𝑥 𝑖 ) = 𝑓 (𝑥 𝑖 ). Now let 𝑥 ∈ [0, 1] be arbitrary, 𝑥 will always fall
inside some interval [𝑥 𝑖−1 , 𝑥 𝑖 ) where we have
| 𝑓 (𝑥) − 𝐹(𝑥)| ≤ | 𝑓 (𝑥) − 𝑓 (𝑥 𝑖 )| + | 𝑓 (𝑥 𝑖 ) − 𝐹(𝑥)|
𝜀
< + |𝐹(𝑥 𝑖 ) − 𝐹(𝑥)|
2
𝜀
< + |𝐹(𝑥 𝑖 ) − 𝐹(𝑥 𝑖−1 )|
2
𝜀
= + | 𝑓 (𝑥 𝑖 ) − 𝑓 (𝑥 𝑖−1 )|
2
𝜀 𝜀
< + =𝜀
2 2

15
1.4. STOCHASTIC GRADIENT DESCENT CHAPTER 1. THE BASICS

Since shallow scalar ReLU networks represent the piecewise linear function on [0, 1] and those
piecewise linear functions are dense in 𝐶([0, 1]) we get the following.
Corollary 1.7. Shallow scalar ReLU neural networks of arbitrary width (1.2) are universal
approximators of 𝐶([0, 1]) in the supremum norm.

This corollary is our first universality result. In the context of deep learning, universality is
the study of what classes of functions can be approximated arbitrarily well by particular neural
networks architectures. Notice that Corollary 1.7 has 4 ingredients:

1. the type of neural network (shallow scalar ReLU network),


2. the growth direction of the network (width),
3. the space of function to approximate (𝐶([0, 1])),
4. and how the approximation is measured (with the supremum norm).

Generalizing this universal approximation result to ℝ𝑛 is not possible with just one layer, one of
the reasons deep networks are necessary. In general constructive proofs such as for Lemma 1.6
are not possible/available and we will have to contend ourselves with existence results.
Universality is theoretically interesting since it tells us that neural networks can in principle
closely approximate most reasonable types of functions (continuous, 𝐿 𝑝 , etc.), thus explaining to
some degree why they are powerful. But universality does not consider many other important
facets of neural networks in practice:

• economy of representation: how does the number of parameters scale with the desired
accuracy,
• economy of finding a good approximation: just because a good approximation exist does
not mean it is easy to find.

We will not discuss universality further for now. Just remember that, under some mild assump-
tions, neural networks are universal approximators.

1.4 Stochastic Gradient Descent

Recall the elements of supervised learning:

• a dataset D = {(𝑥 𝑖 , 𝑦 𝑖 ) ∈ 𝑋 × 𝑌} 𝑁
𝑖=1
for some input space 𝑋 and output space 𝑌,
• a model 𝐹 : 𝑋 × 𝑊 → 𝑌 for some parameter space 𝑊,
• and a loss function ℓ : 𝑌 × 𝑌 → ℝ.

We can then look at the total loss over the dataset for a given choice of parameters:
1 Õ
ℓ total (𝑤) := ℓ (𝐹(𝑥; 𝑤), 𝑦) . (1.4)
|D|
(𝑥,𝑦)∈D

Note how the loss is a function of the parameters 𝑤 only, it could also include some additional
regularization terms (as in Section 1.1) but we will leave those details aside for now.

16
CHAPTER 1. THE BASICS 1.4. STOCHASTIC GRADIENT DESCENT

Knowing little about 𝐹 we cannot find a minimum directly, but assuming everything is differ-
entiable we can use gradient descent:

𝑤 𝑡 = 𝑤 𝑡−1 − 𝜂 ∇𝑤 ℓtotal (𝑤 𝑡−1 ),

where 𝜂 > 0 is called the learning rate and where we chose some 𝑤0 ∈ 𝑊 as our starting point.
Sadly, doing this calculation is not practical in the real world due to:

• the datasets being huge,


• the models having an enormous amount of parameters.

Just calculating ℓ total is a major undertaking, calculating ∇𝑤 ℓtotal is entirely out of the question.
Instead, we take a divide and conquer approach. We randomly divide the dataset into (roughly)
equal batches and tackle the problem batch by batch. Denote a batch by an index set 𝐼 ⊂
{1 . . . 𝑁 }, then we consider the loss over that single batch:

1 Õ
ℓ 𝐼 (𝑤) := ℓ (𝐹(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 ) . (1.5)
|𝐼 |
𝑖∈𝐼

Say we divide the dataset into batches 𝐼1 , 𝐼2 , . . . , 𝐼𝐵 , we can then split our gradient descent step
into 𝐵 smaller steps:
𝑤 𝑡 = 𝑤 𝑡−1 − 𝜂 ∇𝑤 ℓ 𝐼𝑡 (𝑤 𝑡−1 ) (1.6)
where we again assume some starting point 𝑤 0 ∈ 𝑊. When we reach 𝑡 = 𝐵 and have exhausted
all the batches we say we have complete an epoch. Subsequently, we generate a new set of
random batches and repeat the process. This process is called stochastic gradient descent, or
SGD for short, the stochastic referring the the random selecting of batches.
If the batch 𝐼 is uniformly drawn then

𝔼 [∇𝑤 ℓ 𝐼 (𝑤)] = ∇𝑤 ℓtotal (𝑤),

i.e. the gradient of a uniformly drawn random batch is an unbiased estimator of the gradient
of the full dataset.

Remark 1.8 (Higher order methods). Gradient descent (stochastic or not) is a first order
method since it relies only on the first order derivatives. Higher order methods are rarely
used with neural networks because of the large amount of parameters. A model with 𝑁
parameters has 𝑁 first order derivatives but 𝑁 2 second order derivatives, since neural net-
work commonly have millions to billions of parameters calculating second order derivatives
is simply not feasible.

A simple but important modification to the (stochastic) gradient descent algorithm is momen-
tum. Instead of taking each step based only on the current gradient we take into account the
direction we were moving in in previous steps. The modified gradient descent step is given by

𝑣 𝑡 = 𝜇 𝑣 𝑡−1 − 𝜂 ∇𝑤 ℓ 𝐼𝑡 (𝑤 𝑡−1 ),
𝑤 𝑡 = 𝑤 𝑡−1 + 𝑣 𝑡 ,

where we initialize with 𝑣 0 = 0 and 𝜇 ∈ [0, 1) is called the momentum coefficient. The variable
𝑣 𝑡 can be interpreted as the current velocity vector along which we are moving in the parameter

17
1.5. TRAINING CHAPTER 1. THE BASICS

space. At each iteration our new velocity takes a fraction of the previous velocity (controlled
by 𝜇) and updates it with the current gradient. Observe that for 𝜇 = 0, i.e. no momentum, we
revert to (1.6). The larger we take 𝜇 the more influence the history of the gradients has on our
current step.

1.5 Training

In practice the training process involves more than applying SGD on the whole dataset. One
thing we need to keep in mind is that the dataset is the only real information we have about
the underlying function we are trying to approximate, so there is no way to judge how good
our approximation is outside of using the dataset. Given that we know that neural networks
are universal approximators we can always find a neural networks that perfectly fits any given
dataset; making overfitting a given. What we are really after is generalization: the ability of
our neural network to give the correct output for inputs it has not seen before.
We accomplish this by splitting the dataset and only using part of it to train the network, the
remaining part of the dataset (that the network has never seen) we use to test how well our
network has generalized. The exact split depends on the situation but using 80% of the dataset
for training and 20% for testing is a good rule-of-thumb.
Let us denote our training and testing datasets as
Dtrain , Dtest ⊂ D.
We then perform SGD on the training dataset, i.e. try to minimize the training loss:
1 Õ
ℓtrain (𝑤) := ℓ (𝐹(𝑥; 𝑤), 𝑦). (1.7)
|Dtrain |
(𝑥,𝑦)∈Dtrain

But we judge the performance of our network by the testing loss:


1 Õ
ℓ test (𝑤) := ℓ (𝐹(𝑥; 𝑤), 𝑦). (1.8)
|Dtest |
(𝑥,𝑦)∈Dtest

By monitoring the testing loss during training we are able to judge for how long we should
train, how well our network generalizes and when overfitting sets in. Figure 1.5 illustrates the
typical behaviour of loss curves that are encountered during training.
When we are designing a network for a given application we usually do repeated training while
we play with the hyperparameters to try and improve the testing loss. This can lead us to
overfit the hyperparameters to the given testing dataset. There are two things we can do to
mitigate this:

1. split the dataset in three parts: training/testing/validation,


2. redo the split randomly every time we retrain.

In the author’s opinion the second method is preferable since it is easier to implement and even
if the first method is used it is still necessary to redo the split if we decide to make some changes
after having used the validation dataset (since if we use it more than once we are back where
we started).

18
CHAPTER 1. THE BASICS 1.5. TRAINING

Figure 1.5: Typical progression


of the training and testing loss.
loss loss
The training loss will gener-
ally converge to some very low
stop value. The testing loss either
training behaves in a similar fashion and
here test will converge on some higher
value, as is illustrated on the
test left. The testing loss could also
start to increase again at some
train train point, as on the right, this in-
epochs epochs dicates overfitting and tells you
when to stop training.

Remark 1.9 (Hyperparameters). The hyperparameters are all parameters that we ourselves
select and are not trainable parts of the model. Examples are: batch size, learning rate, size
of the model, choice of activation function, ratio of the dataset split, etc.

19
Chapter 2

Deep Learning

The recent advances in machine learning were made largely with so called deep neural net-
works. The deep in this context refers to neural networks where the input will pass through
many neurons for processing, an arrangement not dissimilar to the brain and in contrast to
our previous shallow networks. Once the potential of these networks was demonstrated the
term deep learning was coined as a synonym for doing machine learning with deep neural
networks. Deep neural networks are a very broad class of models that contains a host of various
architectures for different applications, see Wikipedia (2021b) for an overview. We will only
cover two main architectures but much of what we will discuss is transferable to other types of
networks.
While deep networks can be very powerful there are challenges in getting them to work properly
and many aspects of these networks are not yet well understood. This chapter will cover what
deep neural networks are and the techniques and algorithms that are necessary to make them
work.

2.1 Deep Neural Networks

2.1.1 Feed Forward Networks

The quintessential example of a deep neural network is the feed forward neural network
wherein connections between nodes do not form cycles. A feed forward network simply takes
a set of shallow networks and concatenates them, feeding the output of one layer of neurons as
input into the next layer of neurons. Let us formalize this construction.
Let 𝐿 ∈ ℕ and 𝑁0 , 𝑁1 , . . . , 𝑁𝐿+1 ∈ ℕ . Let 𝜎1 , . . . , 𝜎𝐿 be activation functions, which we will
assume to be scalar functions that we apply pointwise for the moment. Let 𝐹𝑖 : ℝ𝑁𝑖−1 → ℝ𝑁𝑖 be
affine transforms given by
𝐹𝑖 (𝒙) := 𝐴 𝑖 𝒙 + 𝒃 𝑖
where 𝐴 𝑖 ∈ ℝ𝑁𝑖 ×𝑁𝑖−1 and 𝒃 𝑖 ∈ ℝ𝑁𝑖 for 𝑖 ∈ {1, . . . , 𝐿 + 1}. Then we call N : ℝ𝑁0 → ℝ𝑁𝐿+1 given by

N := 𝐹𝐿+1 ◦ 𝜎𝐿 ◦ 𝐹𝐿 ◦ · · · ◦ 𝜎1 ◦ 𝐹1 (2.1)

a feed forward neural network. This network is said to have 𝐿 (hidden) layers, 𝑁𝑖 is said to
be the width of the layer 𝑖 and the maximum of all the layer’s widths, i.e. max𝐿𝑖=1 {𝑁𝑖 }, is also
called the width of the network.

20
CHAPTER 2. DEEP LEARNING 2.1. DEEP NEURAL NETWORKS

We usually label inputs with 𝒙, outputs with 𝒚 and if necessary the intermediate results with
𝒛 (𝑖) ∈ ℝ𝑁𝑖 as follows:  
𝒛 (0) := 𝒙, 𝒛 (𝑖) := 𝜎𝑖 𝐹𝑖 (𝒛 (𝑖−1) ) ,
for 𝑖 ∈ {1, . . . , 𝐿 + 1}.
Variants to this architecture are still called feed forward networks. For example we could
include multi-variate activation functions such as soft-max or max pooling, this would require
us to specify the input and output widths of the layers differently but we could still write the
network down as in (2.1). Another common feature is a skip connection: some of the outputs
of a layer are passed to layers deeper down rather than (or in addition) to the next layer. An
example of a feed forward network is illustrated in graph form in Figure 2.1.
(2)
𝑧1
𝑥1 (3)
𝑧1 Figure 2.1: Graph representa-
(1)
𝑧1
(2)
𝑧2 tion of a feed forward neural
𝑦1
network with 3 hidden layers.
𝑥2 (3)
𝑧2
Each node is computed as the
(1) (2)
𝑧2 𝑧3 width activation function applied to
𝑥3 (3)
𝑧3 𝑦2 an affine combination of its in-
(1) (2) puts. Passing signals to deeper
𝑧3 𝑧4
layers rather than the next layer
𝑥4 (3)
𝑧4 is also a common feature of
(2)
𝑧5 these types of networks and is
called a skip connection.
skip connection
depth

Remark 2.1 (Recurrent neural networks). The other major class of neural networks besides
feed forward are the recurrent neural networks, these networks allow for cycles to be present
in their graphs. Recurrent neural networks are mainly used for processing sequential data
such as in speech and natural language applications. See Wikipedia (2021b) for a survey.

Previously we saw how a shallow neural network with ReLU activation functions results in a
piecewise linear function. If we want to model a more complex function we needed to increase
the width of the network, the number of linear pieces would scale linearly with the amount of
neurons.
Consider a sawtooth function as in Figure 2.2, if we wanted to model a sawtooth with 𝑛 teeth
with a shallow ReLU network we would need 2𝑛 neurons. But consider the network for 1 tooth:
𝑓 (𝑥) := ReLU(2𝑥) − ReLU(4𝑥 − 2) + ReLU(2𝑥 − 2), (2.2)
and concatenate it multiple times: we would also get a more and more complex sawtooth for
every concatenation as is shown in Figure 2.2.
In fact we would get a doubling of the number of teeth every time we reapply 𝑓 and so an
exponential increase in the amount of linear pieces in the output. This is the major benefit of
depth: the complexity of the functions a network can model grows faster with depth than with
width.
This is not to say that we should design our networks with maximum depth and minimum
width. As is usually the case in engineering there are tradeoffs to consider and any real network
strikes a balance between width and depth.

21
2.1. DEEP NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

𝑦 𝑦 𝑦
𝑓 (𝑥) 𝑓 ( 𝑓 (𝑥)) 𝑓 ( 𝑓 ( 𝑓 (𝑥))) Figure 2.2: Iterative applica-
1 1 1 tion of the function 𝑓 from (2.2)
causes an exponential increase
in the complexity of the result
as measured by the amount of
linear pieces.
𝑥 𝑥 𝑥
1 1 1

Remark 2.2. While in a shallow ReLU network each linear piece is independent in a deep
ReLU network this is not the case. This can be understood by considering that the number
of parameters in a deep ReLU network scales linearly with the number of layers while the
number of pieces of the output scales exponentially, at some point there will not be enough
parameters to describe any piecewise linear function with that amount of pieces. With an
equivalent shallow ReLU network the output space is exactly the space of all piecewise linear
functions with that amount of pieces.

2.1.2 Vanishing and Exploding Gradients

One difficulty that arises with deep networks is in training. A parameter in an early layer of the
network, it has to pass through many layers before finally contributing to the loss.
Let us disregard the affine transforms of a network for the moment and focus on the activation
functions. Let 𝜎 be the sigmoid activation function given by 𝜎(𝑎) := 1+𝑒1 −𝑎 , then define

𝜎 𝑁 := 𝜎 ◦ 𝜎 ◦ · · · ◦ 𝜎 .
| {z }
𝑁

For computing the derivative of 𝜎 𝑁 we apply the chain rule to find the recursion

𝜕 𝑁   𝜕
𝜎 (𝑥) = 𝜎′ 𝜎 𝑁−1 (𝑎) 𝜎 𝑁−1 (𝑎).
𝜕𝑎 𝜕𝑥
From Figure 2.3 we deduce that 0 < 𝜎′(𝑎) ≤ 1/4 so
 𝑁
𝜕 𝑁 1
𝜎 (𝑎) ≤ .
𝜕𝑎 4

Consequently performing a gradient descent step of the form

𝜕 𝑁
𝑎 𝑖+1 = 𝑎 𝑖 − 𝜂 𝜎 (𝑎 𝑖 ),
𝜕𝑎
would not change the parameter very much, a problem which becomes worse with every
additional layer. Worse is that 𝜎′(𝑎) goes to zero quickly for large absolute values of 𝑎, as can be
seen in Figure 2.3. In this regime, when 𝜎(𝑎) is close to 0 or 1, we say the sigmoid is saturated.
Theoretically we could train for longer to compensate for very small gradients but in practice
this also does not work because of how floating point math works. Specifically when working
with floating point numbers we have

𝑎+𝑏 = 𝑎 if |𝑏| ≪ |𝑎|,

hence a small enough update to a parameter does not actually change the parameter.

22
CHAPTER 2. DEEP LEARNING 2.1. DEEP NEURAL NETWORKS

1 𝜎(𝑎)
Figure 2.3: The sigmoid acti-
vation function and its deriva-
saturation

saturation
tive. Observe that the deriva-
tive is at most 1/4 and goes to
zero quickly as 𝑎 → ±∞, this is
𝜎′(𝑎) called saturation.

𝑎
1

This behaviour of gradients becoming smaller and smaller the deeper the network becomes is
called the vanishing gradient problem. The opposite phenomenon can also occur where the
gradients become larger and larger leading to unstable training, this is called the exploding
gradient problem.
We can contrast the behaviour of the sigmoid with the rectified linear unit in the same situation.
Recall that the ReLU is defined as ReLU(𝑥) := max{0, 𝑥}, then we define

ReLU𝑁 := ReLU ◦ ReLU ◦ · · · ◦ ReLU .


| {z }
𝑁

Calculating the derivative we get:

(
𝜕 1 if 𝑎 > 0,
ReLU𝑁 (𝑎) = (2.3)
𝜕𝑎 0 else,

which does not depend on the number of layers 𝑁. Conceptually (2.3) can be interpreted as a
ReLU allowing for a 1 derivative for parameters that are currently affecting the loss (regardless
of which layer the parameter resides in) and giving a zero derivative for parameters that do not.
The ReLU (at least partially) sidestepping the vanishing/exploding gradient problem accounts
for their popularity in deep neural networks.
This is not to say that ReLUs are without issues. If an input to a ReLU is negative it will have
a gradient of zero, we say the ReLU or the neuron is dead. Consequently all neurons in the
previous layer that only (or predominantly) feed into that neuron will also have a gradient of
zero (or close to zero). You could visualize this phenomenon as a dead neuron casting a shadow
on the lower layers as is illustrated in Figure 2.4.
During training, which neurons are dead and where the shadow is cast changes from batch to
batch. As long as most neurons are not perpetually in the shadow we should be able to properly
train the network. However, if we happen to get an unlucky initialization it is quite possible
that a large part of our network is permanently stuck with zero gradients, this phenomenon is
called the dying ReLU problem.
Of course, the choice of activation function is not the only thing we need to consider. The
gradients in a real network also depend on the current values of the parameters of the affine
transforms. So we still need to initialize those parameters to values that do not cause vanishing
or exploding gradient problem right from the start. We will get back to parameter initialization
later on.

23
2.1. DEEP NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

Figure 2.4: A dead ReLU (i.e.


with negative argument and so
zero derivative) will cause the
dead ReLU gradients of the upstream neu-
rons to also become zero, this
effect will cascade all the way
through to the start of the net-
work. Neurons that feed into
non-dead ReLUs will still have
a gradient via those live con-
nections. The purple shading
of the nodes indicates the de-
gree to which each node is af-
fected by the dead ReLU.

2.1.3 Scaling to High Dimensional Data

If we consider all components of a matrix 𝐴 𝑖 from (2.1) to be trainable parameters then we say
layer 𝑖 is fully connected or dense, meaning all outputs of the layer depend on all inputs. When
all layers are fully connected we say that the network is fully connected. Having all components
of the matrices 𝐴 𝑖 and vectors 𝒃 𝑖 as trainable parameters is of course not required. In fact the
general fully connected feed forward network from (2.1) is not often used in practice, but many
architectures are specializations of this type adapted to specific applications. Specialization in
this context primarily means choosing which components of the 𝐴 𝑖 ’s and 𝒃 𝑖 ’s are trainable and
which are fixed to some chosen constant.
The primary reason fully connected network are not used is that they simply do not scale to
high-dimensional data. Consider an application where we want to apply a transform on a
1000 × 1000 color image. The input and output would be elements of the space ℝ3×1000×1000 , a
linear transform in that space would be given by a matrix
 2
𝐴 ∈ ℝ3×1000×1000 .

This matrix has 9 · 1012 components. Storing that many numbers in 32-bit floating point format
would take a whopping 36 terabytes, and that is just a single matrix. Clearly this rules out fully
connected networks for high dimensional applications such as imaging.
There are 3 strategies employed to deal with this problem:

1) sparsity,
2) weight sharing,
3) parameterization.

Sparsity mean employing sparse matrices for the linear operations. When we fix most entries
of a matrix to zero we do not need to store those entries and we simplify the calculation that we
need to perform since we already know the result of the zeroed out parts.
With weight sharing the same parameter is reused at multiple locations in the matrix. In this
case we just have to store the unique parameters and not the whole matrix. Weight sharing is
technically a special case of the last strategy.

24
CHAPTER 2. DEEP LEARNING 2.2. INITIALIZATION

With parameterization we let the components of the matrix depend on a (small) number of
parameter. In this case we store the parameters and compute the matrix components when
required.
The following matrices are an example comparing the fully connected case with the three
proposed strategies:

𝑎 𝑏 𝑐  𝑎  𝑎 𝑏 𝑏   𝑓1 (𝑎, 𝑏) 𝑓2 (𝑎, 𝑏) 𝑓3 (𝑎, 𝑏)


𝑑 𝑒 𝑓  , , 𝑏 𝑎 𝑏  ,  𝑓4 (𝑎, 𝑏) 𝑓5 (𝑎, 𝑏) 𝑓6 (𝑎, 𝑏) ,
     

𝑔 ℎ 𝑖   𝑏 𝑏 𝑏 𝑎   𝑓7 (𝑎, 𝑏) 𝑓8 (𝑎, 𝑏) 𝑓9 (𝑎, 𝑏)
     

     
(fully connected) (sparse) (shared weights) (parametrized)

we need to store 9 numbers for the fully connected matrix but only 2 numbers for each of the
other matrices.
These 3 strategies are not mutually exclusive and we will look at an architecture that uses a
combination of sparsity and weight sharing later on.

2.2 Initialization

For deep network, the initial values of the parameters make a significant difference to the
functioning of the SGD algorithms. Not in the least because initial parameters being too
small/large in magnitude would cause vanishing/exploding gradient problems and cause
immediate issues for training. In this section we will develop some stochastic initialization
schemes that provide a functional starting point for a network to train from. We generally
use stochastic methods for initialization since we need initial parameter values in a layer to be
sufficiently different. Imagine if the values were the same, then their gradients would also be
the same and they would never diverge from each other, locking the network in an untrainable
state. The most straightforward method of achieving this is drawing values from a probability
distribution, which is currently the common practice. Usually the distributions that are used
are either the normal N(0, 𝜎2 ) or the uniform Unif[−𝑎, 𝑎], where we need to pick 𝜎2 or 𝑎 to avoid
training issues such as vanishing/exploding gradient. Each group of parameters (the linear
coefficients and the biases in each layer) typically get assigned their own distribution. We will
look at how these distributions are currently chosen.

Remark 2.3 (Deterministic initialization scheme). If you recall, in the tutorial notebook
4_FunctionApproximationIn1D.ipynb we developed a deterministic initialization scheme
that outperformed the default stochastic scheme. We did have to use our insight about the
problem as a whole (beyond what the data provided) to do it. It is quite conceivable that
for a given application a deterministic scheme can be developed that gives a much better
starting point for training than the current crop of stochastic schemes.

Remark 2.4 (On the importance of good initialization). A good initialization scheme allows
the network to reach a higher performance level in less time. This can have important
consequences for large production networks. One such network is GPT-3 (see Brown et al.
2020), which is used for natural language processing, it has 175 billion parameters and
training it has reportedly cost 4 million $. Hence better initialization schemes can be of
substantial economic value.

25
2.2. INITIALIZATION CHAPTER 2. DEEP LEARNING

2.2.1 Stochastic Initialization

To develop a suitable probability distribution to initialize parameters with we start by looking


at the input to a neuron as a vector valued random variable 𝑋 ∈ ℝ𝑛 . The output of a neuron is
then a random variable 𝑌 ∈ ℝ𝑚 per
𝑌 = 𝜎 (𝐴𝑋 + 𝒃) ,
for some constant matrix 𝐴 ∈ ℝ𝑚×𝑛 , bias vector 𝒃 ∈ 𝕞 and activation function 𝜎. Or alternatively
written out per component:
𝑛
©Õ
𝑌𝑖 = 𝜎 ­ 𝐴 𝑖𝑗 𝑋 𝑗 + 𝑏 𝑖 ® . (2.4)
ª

« 𝑗=1 ¬
The idea is to then initialize 𝐴 and 𝒃 so that the variance of the signal does not change too much
from layer to layer:
𝑚
Õ 𝑛
Õ
Var(𝑌𝑖 )2 ≈ Var(𝑋 𝑗 )2 .
𝑖=1 𝑗=1

Controlling the 𝐿2 norm of the signal variances is not necessarily the only possibility here, but
it is the choice we will proceed with.
Calculating variances of functions of random variables is difficult in general. The schemes we
will be looking at depend on the following approximation.

Lemma 2.5. Let 𝑋 be a real valued random variable and let 𝑓 : ℝ → ℝ be a differentiable
function then
Var ( 𝑓 (𝑋)) ≈ 𝑓 ′ (𝔼[𝑋])2 · Var(𝑋),
assuming the variance of 𝑋 is finite and 𝑓 ′ is differentiable.

Proof. Let 𝑚 := 𝔼[𝑋] and approximate 𝑓 by its linearization 𝑓 (𝑋) ≈ 𝑓 (𝑚) + 𝑓 ′(𝑚)(𝑋 − 𝑚). Then
we find
h i
Var ( 𝑓 (𝑋)) = 𝔼 ( 𝑓 (𝑋) − 𝔼[ 𝑓 (𝑋)])2
h i
≈ 𝔼 ( 𝑓 (𝑚) + 𝑓 ′(𝑚)(𝑋 − 𝑚) − 𝔼[ 𝑓 (𝑚) + 𝑓 ′(𝑚)(𝑋 − 𝑚)])2
h i
′ ′ 2
= 𝔼 ( 𝑓 (𝑚) + 𝑓 (𝑚)(𝑋 − 𝑚) − 𝑓 (𝑚) − 𝑓 (𝑚) 𝔼[(𝑋 − 𝑚)])
h i
= 𝔼 ( 𝑓 ′(𝑚)(𝑋 − 𝑚))2
= 𝑓 ′(𝑚)2 𝔼 (𝑋 − 𝑚)2
 

= 𝑓 ′ (𝔼[𝑋])2 · Var(𝑋).

Intuitively, the approximation in Lemma 2.5 can be read as: if the variance of 𝑋 is small and 𝑓 ′
is reasonably bounded then the variance of 𝑓 (𝑋) will also be small.
Applying Lemma 2.5 to (2.4) yields
2
𝑛 𝑛
©Õ ª ©Õ 2
Var(𝑌𝑖 ) ≈ 𝜎′ ­ 𝐴 𝑖𝑗 𝔼[𝑋 𝑗 ] + 𝑏 𝑖 ® ­ 𝐴 𝑖𝑗 Var(𝑋 𝑗 )® .
ª

« 𝑗=1 ¬ « 𝑗=1 ¬

26
CHAPTER 2. DEEP LEARNING 2.2. INITIALIZATION

To make progress let us assume 𝔼[𝑋1 ] = . . . = 𝔼[𝑋𝑛 ] and Var(𝑋1 ) = . . . = Var(𝑋𝑛 ), then
2
Õ 𝑛 𝑛
ª ©Õ 2 ª
Var(𝑌𝑖 ) ≈ 𝜎′ ­𝔼[𝑋1 ] 𝐴 𝑖𝑗 + 𝑏 𝑖 ® ­ 𝐴 𝑖𝑗 ® Var(𝑋1 ).
©

« 𝑗=1 ¬ « 𝑗=1 ¬
| {z }
ideally ≈1

When the bracketed term is approximately 1 then the variances of the outputs 𝑌𝑖 are about the
same as those of the inputs 𝑋 𝑗 .
Let us now turn the 𝐴 𝑖𝑗 ’s and 𝑏 𝑖 ’s into random variables: 𝐴 𝑖𝑗 ∼ 𝜇1 and 𝑏 𝑖 ∼ 𝜇2 for all 𝑖 and 𝑗 in
their respective ranges and where 𝜇1 and 𝜇2 are some choice of scalar probability distributions.
Ideally we would choose 𝜇1 and 𝜇2 so that
2
 Õ𝑛 𝑛 
 ′© ©Õ 2 ª
𝔼 𝜎 ­𝔼[𝑋1 ] 𝐴 𝑖𝑗 + 𝑏 𝑖 ® 𝐴 𝑖𝑗 ® = 1. (2.5)
ª
­
 𝑗=1 𝑗=1 
 « ¬ « ¬
This expression allows us to put a condition on our choice of probability distributions that
ensures that the variances of the signals between layers stay under control (at least at the start
of training). The following examples show how (2.5) is utilized.

Example 2.6 (Sigmoid with balanced inputs). Like before assume 𝔼[𝑋1 ] = . . . = 𝔼[𝑋𝑛 ] and
Var(𝑋1 ) = . . . = Var(𝑋𝑛 ) and our goal is choosing a probability distribution for the linear
coefficients and one for the biases. Say 𝜎 is the sigmoid activation function and we have
𝔼[𝑋𝑖 ] = 0, i.e. we have balanced inputs. Additionally we want balanced parameter initialization,
i.e. 𝔼[𝑏 𝑖 ] = 0 and 𝔼[𝐴 𝑖𝑗 ] = 0 for all 𝑖, 𝑗 in their respective ranges. Since 𝔼[𝑋𝑖 ] = 0 (2.5) reduces
to:
 𝑛   𝑛 
𝔼 𝜎′ (𝑏 𝑖 )2 2
©Õ 2 ª  ′  Õ 2
𝐴 𝑖𝑗 ® = 𝔼 𝜎 (𝑏 𝑖 ) 𝔼  𝐴 𝑖𝑗  ,
 
­
« 𝑗=1
   𝑗=1 
 ¬  
since the 𝑏 𝑖 ’s and 𝐴 𝑖𝑗 ’s are independent. We know that 0 < 𝜎′(𝑥) ≤ 1/4 and that the maximum is
achieved at 𝑥 = 0. Hence for that first factor to not become too small we need the variance of 𝑏 𝑖 to
be small since we already decided on setting 𝔼[𝑏 𝑖 ] = 0. Of course the smallest possible variance
is zero, so let us be uncompromising and fix 𝑏 𝑖 = 0 for all 𝑖. Thus the previous expression
becomes
h i 𝑛  𝑛  𝑛
𝜎′(0)2 𝑛 𝔼 𝐴2𝑖𝑗 = 𝔼 (𝐴 𝑖𝑗 − 0)2 = 𝔼 (𝐴 𝑖𝑗 − 𝔼[𝐴 𝑖𝑗 ])2 = Var(𝐴 𝑖𝑗 ),
 
16 16 16
which equals 1 if
16
Var(𝐴 𝑖𝑗 ) =
.
𝑛
So we could choose our probability distribution for the linear coefficients to be the normal
√ √
distribution N(0, 16/𝑛 ) or the uniform distribution Unif [−4 3/𝑛 , 4 3/𝑛 ]. Of course the choice of the
type of distribution is free as long as the expected value is zero and the variance is 16/𝑛 , but
in practice you will usually only encounter normal or uniform distributions. In any case this
choice of expected values and variances will provide some assurance that the signals will not
explode or die out as they travel through the network (at least at the start of training).

Example 2.7 (ReLU with balanced inputs). Again assume 𝔼[𝑋1 ] = . . . = 𝔼[𝑋𝑛 ] and Var(𝑋1 ) =
. . . = Var(𝑋𝑛 ). This time we use the ReLU activation function and we are going to initialize both

27
2.2. INITIALIZATION CHAPTER 2. DEEP LEARNING

our linear coefficients and biases with a uniform distribution Unif[−𝑎, 𝑎], our goal is choosing
𝑎 > 0 in a suitable manner. Assume again that the inputs are balanced, i.e. 𝔼[𝑋𝑖 ] = 0, then the
expression from (2.5) simplifies to

 𝑛   𝑛 
𝔼 ReLU′ (𝑏 𝑖 )2
©Õ 2 ª  © Õ 2 ª
𝐴 𝑖𝑗 ® = 𝔼 𝟙𝑏 𝑖 >0 𝔼 ­ 𝐴 𝑖𝑗 ®
  
­ 
« 𝑗=1
   𝑗=1 
 ¬ « ¬
= ℙ (𝑏 𝑖 > 0) 𝑛 Var(𝐴 𝑖𝑗 )
𝑛
= Var(𝐴 𝑖𝑗 )
2
𝑛𝑎 2
= ,
6
p h p p i
which equals 1 if 𝑎 = 6/𝑛 so we would draw our 𝐴 𝑖𝑗 ’s and 𝑏 𝑖 ’s from Unif − 6/𝑛 , 6/𝑛 .

2.2.2 Xavier Initialization

The initialization schemes from the previous section focused on controlling the variance of the
signals going forward through the network. While this does help in controlling the vanish-
ing/exploding gradient problem we can also look at gradients directly as they backpropagate
through the network, this is the approach taken by Glorot and Bengio (2010). The first author’s
name is Xavier Glorot and for that reason the scheme we will be seeing is commonly referred
to as Glorot or Xavier initialization (as it is in PyTorch for example).
The idea is to treat the partial derivatives of the loss function with regards to the linear co-
efficients and biases as random variables as well. Consider a setting with a linear activation
function and no bias:
𝑛
Õ
𝑌𝑖 = 𝐴 𝑖𝑗 𝑋 𝑗 ,
𝑗=1

where we assume all inputs 𝑋 𝑗 are distributed i.i.d. with zero mean. We additionally want to
initialize our coefficients 𝐴 𝑖𝑗 with mean zero as well. Since the 𝐴 𝑖𝑗 and 𝑋 𝑗 ’s are independent the
variance distributes over the sum. Additionally we have that

Var(𝐴 𝑖𝑗 𝑋 𝑗 ) = 𝔼[𝑋 𝑗 ]2 Var(𝐴 𝑖𝑗 ) + 𝔼[𝐴 𝑖𝑗 ]2 Var(𝑋 𝑗 ) + Var(𝐴 𝑖𝑗 ) Var(𝑋 𝑗 ) = Var(𝐴 𝑖𝑗 ) Var(𝑋 𝑗 )

since 𝔼[𝑋 𝑗 ] = 𝔼[𝐴 𝑖𝑗 ] = 0. So we work out that

𝑛
Õ
Var(𝑌𝑖 ) = Var(𝐴 𝑖𝑗 ) Var(𝑋 𝑗 ) = 𝑛 Var(𝐴 𝑖𝑗 ) Var(𝑋 𝑗 ).
𝑗=1

Hence for forward signal propagation we have Var(𝑌𝑖 ) = Var(𝑋 𝑗 ) if

1
Var(𝐴 𝑖𝑗 ) = . (2.6)
𝑛

But we can look at the backward gradient propagation as well. Let ℓ be a loss function at the
end of the network, then we can look at the partial derivatives of ℓ with respect to the inputs

28
CHAPTER 2. DEEP LEARNING 2.2. INITIALIZATION

𝜕ℓ 𝜕ℓ
and outputs as random variables as well. Call these random variables 𝜕𝑋𝑖
and 𝜕𝑌𝑖
, applying the
chain rule gives us
𝑚 𝑚
𝜕ℓ Õ 𝜕ℓ 𝜕𝑌
𝑖
Õ 𝜕ℓ
= = 𝐴 𝑖𝑗 .
𝜕𝑋 𝑗 𝜕𝑌𝑖 𝜕𝑋 𝑗 𝜕𝑌𝑖
𝑖=1 𝑖=1

Now we make the same assumption about the backward gradients as we did about the forward
𝜕ℓ
signals, namely that the partial derivatives 𝜕𝑌 are i.i.d. with zero mean. Then we can do the
𝑖
same calculation as before and find

𝜕ℓ 𝜕ℓ
   
Var = 𝑚 Var 𝐴 𝑖𝑗 Var .

𝜕𝑋 𝑗 𝜕𝑌𝑖
   
𝜕ℓ 𝜕ℓ
Hence if we want to have Var 𝜕𝑋 𝑗
= Var 𝜕𝑌𝑖
for backward gradient propagation we need to
set
1
Var(𝐴 𝑖𝑗 ) = . (2.7)
𝑚
Now unless 𝑛 = 𝑚 we cannot satisfy (2.6) and (2.7) at the same time, but we can compromise
and set
2
Var(𝐴 𝑖𝑗 ) = . (2.8)
𝑛+𝑚
2
Under this choice we can use the normal distribution N(0, 𝑛+𝑚 ) or the uniform distribution
 q q 
6 6
Unif − 𝑛+𝑚 , 𝑛+𝑚 to draw our coefficients 𝐴 𝑖𝑗 from.

Of course in reality we never use the linear activation function. The original Xavier initialization
scheme has been expanded to include specific activation functions. For example for the ReLU
by He et al. (2015), they arrive at
4
Var(𝐴 𝑖𝑗 ) = .
𝑛+𝑚
Which intuitively makes sense: since the ReLU is zero on half its domain the variance of the
coefficients needs to be increased to keep the variances of the signals/gradients constant.
Other choices of activation function lead to other multipliers being introduced to the same basic
formula (2.8):
2
Var(𝐴 𝑖𝑗 ) = 𝛼 2 ,
𝑛+𝑚
where 𝛼 is called the gain and depends on the choice of activation function.

Remark 2.8. See torch.nn.init.xavier_uniform_ and torch.nn.init.xavier_normal_ for


PyTorch’s implementation of these initialization schemes.

2.2.3 Those are a lot of Assumptions

In the last two sections we made a lot of assumptions to arrive at simple formulas. Some of the
assumptions are even verifiably incorrect in the networks we employ. In spite of the coarse and
inelegant way these initialization schemes were derived they are widely used for the simple
reason that they work. They do not totally solve the vanishing/exploding gradient problem
but they still significantly improve the performance of the gradient descent algorithms.

29
2.3. CONVOLUTIONAL NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

2.3 Convolutional Neural Networks

→ Visualizations of convolutions from Dumoulin and Visin (2018), also available at https:
//github.com/vdumoulin/conv_arithmetic.
→ Handy reference that provides an overview of CNNs: CNN cheat-sheet.
We previously saw that using fully connected networks for high dimensional data such as images
is a non-starter due to the memory and computational requirements involved. The proposed
solution was reducing the effective amount of parameters by a combination of using a sparse
matrix and weight sharing/parameterization. How exactly we should sparsify the network
and which weights should be shared or parametrized had to be determined application by
application.
In this section we will look at a combination of sparsity and weight sharing that is suited to data
that has a natural spatial structure, think about signals in time (1D), images (2D) or volumetric
data (3D). What these types of spatial data have in common is that we treat each ‘part’ of the
input data in the same way, e.g. we do not process the left side of an image in another way than
the right side.
The type of network that exploits this spacial structure is called a Convolutional Neural Net-
works, or CNN for short. As the name suggest these networks employ the convolution operation
as well as the closely related pooling operation. We will look at how discrete convolution and
pooling are defined and how they are used to construct a deep CNN.

2.3.1 Discrete Convolution

Recall that in the familiar continuous setting the convolution of two functions 𝑓 , 𝑔 : ℝ → ℝ is
defined as:

( 𝑓 ∗ 𝑔)(𝑥) := 𝑓 (𝑥 − 𝑦) 𝑔(𝑦) d𝑦, (2.9)

which can be interpreted as a filter (or kernel) 𝑓 being translated over the data 𝑔 and at each
translated location the 𝐿2 inner product is taken.
To switch to the discrete setting we will use notation that is more in line with programming
languages. When we have 𝑓 ∈ ℝ𝑛 we will use square brackets and zero-based indexing to
access its components, so 𝑓 [0] ∈ ℝ is 𝑓 ’s first component and 𝑓 [𝑛 − 1] ∈ ℝ its last. We use this
array-based notation since we will need to do some computations on indices and 𝑓 [𝑖 − 𝑗 + 1] is
easier to read than 𝑓𝑖−𝑗+1 .

Example 2.9. Let 𝑥 ∈ ℝ𝑛 , 𝑦 ∈ ℝ𝑚 and 𝐴 ∈ ℝ𝑚×𝑛 then the familiar matrix product 𝑦 = 𝐴𝑥 can
be written in array notation as
𝑛−1
Õ
𝑦[𝑖] = 𝐴[𝑖, 𝑗] 𝑥[𝑗]
𝑗=0

for each 𝑖 ∈ {0, . . . , 𝑚 − 1}.

30
CHAPTER 2. DEEP LEARNING 2.3. CONVOLUTIONAL NEURAL NETWORKS

Definition 2.10 (Discrete cross-correlation and convolution in 1D). Let 𝑓 ∈ ℝ𝑛 (the input)
and 𝑘 ∈ ℝ𝑚 (the kernel) then their discrete cross-correlation (𝑘 ★ 𝑓 ) ∈ ℝ𝑛−𝑚+1 is given by:
𝑚−1
Õ
(𝑘 ★ 𝑓 )[𝑖] := 𝑘[𝑗] 𝑓 [𝑖 + 𝑗]
𝑗=0

for 𝑖 ∈ {0, . . . , 𝑛 − 𝑚}.


Discrete convolution is defined similarly but with one of the inputs reversed:
𝑚−1
Õ
(𝑘 ∗ 𝑓 )[𝑖] := 𝑘[𝑚 − 1 − 𝑗] 𝑓 [𝑖 + 𝑗]
𝑗=0

for 𝑖 ∈ {0, . . . , 𝑛 − 𝑚}.

Making an index substitution in the definition of discrete convolution makes the relation to the
continuous convolution (2.9) more apparent:
𝑖+𝑚−1
Õ
(𝑘 ∗ 𝑓 )[𝑖] = 𝑘[𝑖 − 𝑗 + (𝑚 − 1)] 𝑓 [𝑗].
𝑗=𝑖

The idea here is to let the components of the kernel 𝑘 ∈ ℝ𝑚 be trainable parameters. In that
context it does not matter whether the kernel is reflected or not and there is no real reason to
distinguish convolution from cross-correlation. Consequently in the deep learning field it is
usual to refer to both operations as convolution. Most ‘convolution’ operators in deep learning
software are in fact implemented as cross-correlations, as in PyTorch for example. We will adopt
the same convention and talk about such subjects as convolution layers and convolutional neural
networks but the actual underlying operation we will use is cross-correlation.
As in the continuous case, discrete convolution can be generalized to higher dimensions. For
this course we will only look at the 2 dimensional case as extending to higher dimensions is
straightforward. To make things slightly simpler we will restrict ourselves to square kernels, in
practice kernels are almost always chosen to be square anyway.

Definition 2.11. (Discrete cross-correlation in 2D) Let 𝑓 ∈ ℝ ℎ×𝑤 (the input, read ℎ as height
and 𝑤 as width) and 𝑘 ∈ ℝ𝑚×𝑚 (the kernel) then their discrete cross-correlation (𝑘 ★ 𝑓 ) ∈
ℝ(ℎ−𝑚+1)×(𝑤−𝑚+1) is given by
𝑚−1
Õ
(𝑘 ★ 𝑓 )[𝑖1 , 𝑖2 ] := 𝑘[𝑗1 , 𝑗2 ] 𝑓 [𝑖1 + 𝑗1 , 𝑖2 + 𝑗2 ]
𝑗1 ,𝑗2 =0

for 𝑖1 ∈ {0, . . . , ℎ − 𝑚} and 𝑖2 ∈ {0, . . . , 𝑤 − 𝑚}.

The convolution operator “*” can be extended to 2 dimensions similarly but we will only be
using cross-correlation for the remainder of our discussion of convolutional neural networks.
For a visualization of discrete convolution in 2D, see Figure 2.5.

31
2.3. CONVOLUTIONAL NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

𝑓
𝑘★ 0,1
0,0

Figure 2.5: An illustration

𝑘
of convolution (or cross-

𝑘
𝑓
correlation) in 2D. Here the
input 𝑓 ∈ ℝ4×4 (colored blue)
is convolved with the kernel
𝑘 ∈ ℝ3×3 , which yields an
output (𝑘 ★ 𝑓 ) ∈ ℝ2×2 (colored
purple).
1,1
1,0

𝑘 𝑘
Remark 2.12 (Index interpretation conventions). In Definition 2.10 we called the first dimen-
sion of an array 𝑓 ∈ ℝ ℎ×𝑤 the height and the second the width. Additionally, if you look at
Figure 2.5 we place the origin point (0, 0) in the top left. This is different to the Cartesian
convention of denoting points in the plane with (𝑥, 𝑦) (i.e. width first, height second) and
having the origin in the bottom left. Both conventions are illustrated bellow for an element
of ℝ3×4 .

Cartesian convention Array convention


𝑖2
0, 3 2, 3 2, 3
𝑖1 0, 0 0, 1 0, 2 0, 3
0, 2 2, 2 2, 2
1, 0 2, 1 3, 2 4, 3
0, 1 1, 1 2, 1
2, 0 2, 1 2, 2 2, 3
𝑦 0, 0 1, 0 2, 0
𝑥

Of course choosing either convention does not change the underlying object, merely the
interpretation of the indices and shape in colloquial terms. The array convention is almost
universally adopted in software and it is used in PyTorch, for that reason we will adopt it
when dealing with spatial data.

2.3.2 Padding

The convolution operations we proposed so far have the property that the output is of smaller
size than the input. Depending on our goal this might or might not be desirable. If shrinking
output size is not desirable we can use padding to ensure the output size is the same as the
input size. The most common type of padding is zero padding, which we will look at for the 2
dimensional case.

32
CHAPTER 2. DEEP LEARNING 2.3. CONVOLUTIONAL NEURAL NETWORKS

Definition 2.13 (Zero padding in 2D). Let 𝑓 ∈ ℝ ℎ×𝑤 and let 𝑝 𝑡 , 𝑝 𝑏 , 𝑝 𝑙 , 𝑝 𝑟 , which we read as top,
bottom, left and right padding respectively. Then we define ZP𝑝𝑡 ,𝑝𝑏 ,𝑝 𝑙 ,𝑝 𝑟 𝑓 ∈ ℝ(ℎ+𝑝𝑡 +𝑝𝑏 )×(𝑤+𝑝 𝑙 +𝑝 𝑟 )
as
0 if 𝑖1 < 𝑝 𝑡 or 𝑖1 ≥ ℎ + 𝑝 𝑡




ZP𝑝𝑡 ,𝑝𝑏 ,𝑝 𝑙 ,𝑝 𝑟 𝑓 [𝑖1 , 𝑖2 ] := or 𝑖2 < 𝑝 𝑙 or 𝑖2 ≥ 𝑤 + 𝑝 𝑙 ,
 𝑓 [𝑖1 − 𝑝 𝑡 , 𝑖2 − 𝑝 𝑙 ] else,



for all 𝑖1 ∈ {0, . . . , ℎ + 𝑝 𝑡 + 𝑝 𝑏 − 1} and 𝑖2 ∈ {0, . . . , 𝑤 + 𝑝 𝑙 + 𝑝 𝑟 − 1}.

If we then have an 𝑓 ∈ ℝ ℎ×𝑤 and we want to convolve with a kernel ℝ𝑚×𝑚 while keeping the
shape of the output the same we can choose

𝑚−1 𝑚−1 𝑚−1 𝑚−1


       
𝑝 𝑡 := , 𝑝 𝑏 := , 𝑝 𝑙 := , 𝑝 𝑟 := ,
2 2 2 2

then 𝑘 ★ ZP𝑝𝑡 ,𝑝𝑏 ,𝑝 𝑙 ,𝑝 𝑟 𝑓 ∈ ℝ ℎ×𝑤 . We leave verifying this claim as an exercise. An example of this
technique is illustrated in Figure 2.6.

1𝑓
1 ,1 ,
Z P 1,
𝑘★
0,0 Figure 2.6: Adding padding
to the input allows the out-
put of the convolution opera-
tion to retain the size of the
2,3 original input. The most com-
𝑓 mon type of padding is zero-

𝑘
padding, where each out-of-
bounds value is assumed to be
zero.

𝑘
Many more padding techniques exist, they only vary in how the out-of-bounds values are cho-
sen. In PyTorch the available padding modes are listed in pytorch.org/docs/stable/generated/torch.nn.functio

2.3.3 Max Pooling

The second operation commonly found in CNN is max pooling. We already saw an activation
function called max pooling previously, this is exactly what is used in a CNN but with a
particular choice of subsets to take maxima over that plays well with the spatial structure of the
data.
The idea is to have a ‘window’ slide over the input data in the same way that a convolution
kernel slides over the data and then take the maximum value in each window. We will take a

33
2.3. CONVOLUTIONAL NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

look at a particular type of 2D max pooling that is commonly found in CNNs used for image
processing and classification applications.

Definition 2.14 (𝑚 × 𝑚 max pooling in 2D). Let 𝑓 ∈ ℝ ℎ×𝑤 and let 𝑚 ∈ ℕ . Then we define
ℎ 𝑤
MP𝑚,𝑚 𝑓 ∈ ℝ ⌊ /𝑚⌋×⌊ /𝑚⌋ as

MP𝑚,𝑚 𝑓 [𝑖1 , 𝑖2 ] := max 𝑓 [𝑖1 𝑚 + 𝑗1 , 𝑖2 𝑚 + 𝑗2 ]


0≤𝑗1 <𝑚
0≤𝑗2 <𝑚

for all 𝑖1 ∈ {0, . . . , ⌊ ℎ/𝑚 ⌋} and 𝑖2 ∈ {0, . . . , ⌊ 𝑤/𝑚 ⌋}.

Two things to note about this particular definition. First, if ℎ and/or 𝑤 is not divisible by 𝑚 then
some values at the edges will be ignored entirely and not contribute to the output. We could
modify the definition to resolve this but in practice 𝑚 is usually set to 2 and so in the worst case
we lose a single row of values at the edge, and that is only if the input has odd dimensions.
Second, we move the window in steps of 𝑚 in both directions instead of taking steps of 1, this is
called having a stride of 𝑚. Having a stride larger than 1 allows max pooling to quickly reduce
the dimensions of the data. Figure 2.7 shows an example of how MP2,2 works.

2, 2𝑓 Figure 2.7: Max pooling is sim-


MP 0,1 ilar to convolution in that a
0,0
window of a certain size slides
over the input, instead of tak-
ing a weighted sum inside the
x
ma
𝑓 window we take the maximum
m ax value. Usually the window is
moved in strides equal to its
size so that the outputs are the
maxima from disjoint sets of
the input. The pooling opera-
1,1 tion in the figure is usually just
1,0 called 2 × 2 max pooling, refer-
ring to both the window size
and stride used.
x
ma
m ax

2.3.4 Convolutional Layers

So how are convolutions/cross-correlations used in a neural network? First of all we organize


our inputs and outputs differently, instead of the inputs and outputs being element of ℝ𝑛 for
some 𝑛 we want to keep the spatial structure. In the 2D case it makes sense to talk about objects
with height and width, so elements in ℝ ℎ×𝑤 for some choice of ℎ, 𝑤 ∈ ℕ . We will call these
objects 2D maps or just maps. We will want more than one map at any stage of the network, we
say we want multiple channels.
Concretely, say we have a deep network where we index the number of layers with 𝑖, then we
denote the input of layer 𝑖 as an element in ℝ𝐶 𝑖−1 ×ℎ 𝑖−1 ×𝑤 𝑖−1 and the output of layer 𝑖 as an element

34
CHAPTER 2. DEEP LEARNING 2.3. CONVOLUTIONAL NEURAL NETWORKS

of ℝ𝐶 𝑖 ×ℎ 𝑖 ×𝑤 𝑖 . We say layer 𝑖 has 𝐶 𝑖−1 input channels and 𝐶 𝑖 output channels. We call ℎ 𝑖−1 × 𝑤 𝑖−1
and ℎ 𝑖 × 𝑤 𝑖 the shape of the input respectively output maps.
Example 2.15 (Image inputs). If we design a neural network to process color 1080p images then
𝐶0 = 3 (RGB images have 3 color channels), ℎ0 = 1080 and 𝑤0 = 1920, i.e. the input to the first
layer is an element in ℝ3×1080×1920 . For a monochrome image we would need just 1 channel.

A convolution layer is just a specialized network layer, hence it follows the same pattern of first
doing a linear transform, then adding a bias and finally applying an activation function. The
activation function in a CNN is typically a ReLU or max pooling function (or both). The linear
part will consist of taking convolutions of the input maps, but there are several ways to do that,
we will look at two of them.
The first, and most straightforward one, is called single channel convolution (alternatively
know as depthwise convolution). With this method we assign a kernel of a certain size to
each input channel and then perform the convolution of each input channel with each kernel.
This gives us a number of maps equal to the number of input channels, we subsequently take
point-wise linear combinations of those maps to generate the desired number of output maps.

Definition 2.16 (Single channel convolution). Let 𝑓 ∈ ℝ𝐶×ℎ×𝑤 and let 𝑘 ∈ ℝ𝐶×𝑚×𝑚 . We call
𝐶 the number of input channels, ℎ × 𝑤 the input map shape and 𝑚 × 𝑚 the kernel shape.
Let 𝐴 ∈ ℝ𝐶 ×𝐶 , we call 𝐶 ′ the number of output channels. Then we define 𝑆𝐶𝐶 𝑘,𝐴 ( 𝑓 ) ∈

ℝ𝐶 ×(ℎ−𝑚+1)×(𝑤−𝑚+1) as:

𝐶−1
Õ

SCC 𝑘,𝐴 ( 𝑓 )[𝑐 , 𝑖1 , 𝑖2 ] := 𝐴[𝑐 ′ , 𝑐] 𝑘[𝑐, ·, ·] ★ 𝑓 [𝑐, ·, ·],
𝑐=0

for all 𝑐 ′ ∈ {0, . . . , 𝐶 ′ − 1}, 𝑖1 ∈ {0, . . . , ℎ − 𝑚 + 1} and 𝑖2 ∈ {0, . . . , 𝑤 − 𝑚 + 1}.


This operation allows the entries of the kernel stack 𝑘 and matrix 𝐴 to be trainable, hence the
number of trainable parameters is 𝐶 · 𝑚 2 + 𝐶 ′ · 𝐶. Figure 2.8 gives a visualization of how single
channel convolution works.
An alternative way of using convolution to build a linear operator is multi channel convolution
(alternatively known as multi channel multi kernel (MCMK) convolution). Instead of assigning
a kernel to each input channel and then taking linear combinations we assign a kernel to each
combination of input and output channels.

Definition 2.17 (Multi channel convolution). Let 𝑓 ∈ ℝ𝐶×ℎ×𝑤 and let 𝑘 ∈ ℝ𝐶 ×𝐶×𝑚×𝑚 . We call

𝐶 the number of input channels, 𝐶 ′ the number of output channels, ℎ × 𝑤 the input map
shape and 𝑚 × 𝑚 the kernel shape. Then we define 𝑀𝐶𝐶 𝑘 ( 𝑓 ) ∈ ℝ𝐶 ×(ℎ−𝑚+1)×(𝑤−𝑚+1) as:

𝐶−1
Õ
MCC 𝑘 ( 𝑓 )[𝑐 ′ , 𝑖1 , 𝑖2 ] := 𝑘[𝑐 ′ , 𝑐, ·, ·] ★ 𝑓 [𝑐, ·, ·],
𝑐=0

for all 𝑐 ′ ∈ {0, . . . , 𝐶 ′ − 1}, 𝑖1 ∈ {0, . . . , ℎ − 𝑚 + 1} and 𝑖2 ∈ {0, . . . , 𝑤 − 𝑚 + 1}.


Under this construction the kernel components are the trainable parameters and so we have a
total of 𝐶 ′ · 𝐶 · 𝑚 2 trainable parameters. The multi channel convolution construction is illustrated
in Figure 2.9.
Neural network frameworks like PyTorch implement the multi channel version. CNN’s in
literature are also generally formulated with multi channel convolutions. So why did we also
introduce single channel convolutions?

35
2.3. CONVOLUTIONAL NEURAL NETWORKS CHAPTER 2. DEEP LEARNING

input kernel stack linear output


combination

0 1 1
 

★ 1 0 2

ℝ3×4×4 ℝ3×2×2 ℝ2×3 ℝ2×3×3

Figure 2.8: With single channel convolution, also called depthwise convolu-
tion, each input channel gets assigned a single kernel. After doing
𝐶 𝑖𝑛 cross-correlations we take pointwise linear combinations of the
resulting maps to generate the desired number of output maps. In
this example that yields a total of 18 trainable parameters (or 20 if
we include a bias per output channel).

input kernel stack sum output



★ Í



★ Í

ℝ3×4×4 ℝ2×3×2×2 ℝ2×3×3

Figure 2.9: With multi channel convolution each output channel gets assigned
a stack of kernels, where each stack has a kernel per input channel.
This results in 𝐶 𝑜𝑢𝑡 · 𝐶 𝑖𝑛 kernels. The resulting outputs of the
𝐶 𝑜𝑢𝑡 · 𝐶 𝑖𝑛 cross-correlations are then summed up pointwise per
output channel resulting in 𝐶 𝑜𝑢𝑡 output channels. In this example
that yields a total of 24 trainable parameters (or 26 if we include a
bias per output channel).

First of all, single and multi channel convolution are equivalent in the sense that given an
instance of one I can always construct an instance of the second that does the exact same
calculation. Consequently we can work with whichever construction we prefer for whatever
reason without losing anything. The nice thing about single channel convolution is that there
is a clear separation between the processing done inside a particular channel and the way the
input channels are combined to create output channels. These two processing steps are fused
together in the multi channel convolution operation. In the author’s opinion this makes the
multi channel technique harder to reason about, having two distinct steps that do two distinct
things seems more elegant.

36
CHAPTER 2. DEEP LEARNING
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION

All that is left now to create a full CNN layer is combining one of the convolution operations,
add padding if desirable and pass the result through a max pooling operation and/or a scalar
activation function such as a ReLU.

2.3.5 Classification Example: MNIST & LeNet-5

Figure 2.10: The MNIST


dataset (Modified National In-
stitute of Standards and Tech-
nology dataset) consists of
a large collection of 28 ×
28 grayscale images of hand
drawn digits. The goal is as-
signing to each image the cor-
rect 0-9 label.

ℝ1×28×28 ℝ6×28×28 ℝ6×14×14 ℝ16×10×10 ℝ16×5×5 ℝ400 ℝ120 ℝ84 ℝ10


flatten fully f.c. f.c.
max pool 2 × 2 convolution 5 × 5 max pool 2 × 2
convolution 5 × 5 connected
with padding

Figure 2.11: A modernized LeNet-5 network, the classic LeNet-5 network from
LeCun, Bottou, et al. (1998) used sigmoidal activation functions,
unusual sub-sampling layers and Gaussian connections. Replac-
ing these with ReLUs, max pooling and fully connected layers
yields a somewhat simpler network that works equally well.

2.4 Automatic Differentiation & Backpropagation

All our training algorithms are gradient based and so our ability to train a network rests on us
computing those gradients. The traditional approaches we could take are:

(1) compute the derivative formulas ourselves and code them into the program,

(2) use a computer algebra system to perform symbolic differentiation and use the resulting
formula,

(3) use numeric differentiation (i.e. finite differences).

37
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION
CHAPTER 2. DEEP LEARNING

None of these three options are appealing. We definitely do not want to work out the derivatives
ourselves for every network we define so option (1) is out. Symbolic differentiation with
computer algebra systems like Mathematica usually produces complex and cryptic expressions
in non-trivial cases. Subsequently evaluating these complex expressions is usually anything
but efficient, this rules out option (2). For networks with millions if not billions of parameters
computing finite differences would entail millions or billions of evaluations of the network, this
makes option (3) impossible to use.
What we need is a fourth technique: automatic differentiation, also called algorithmic differ-
entiation or simply autodiff. We will restrict ourselves to looking at automatic differentiation
in the context of feed-forward neural networks but it is a very general technique, see Baydin
et al. (2018) for a broader survey of its uses in machine learning. In introducing automatic dif-
ferentiation it is perhaps best to emphasizing that it is not symbolic differentiation nor numeric
differentiation, even though it contains elements of both.

Remark 2.18 (Autograd vs. autodiff). Autograd is a particular autodiff implementation in


Python. Autograd served as a prototype for a lot of autodiff implementations, including
PyTorch’s, for that reason in PyTorch the term autograd is used as a synonym for autodiff.

2.4.1 Notation and an Example

Let us say we have a network 𝐹 : ℝ × ℝ2 → ℝ with two parameters 𝑎 and 𝑏 and loss function ℓ .
We are interested in computing:
𝜕 𝜕
ℓ (𝐹(𝑥0 ; 𝑎, 𝑏), 𝑦0 ) and ℓ (𝐹(𝑥 0 ; 𝑎, 𝑏), 𝑦0 ) , (2.10)
𝜕𝑎 (𝑎,𝑏) 𝜕𝑏 (𝑎,𝑏)

i.e. the partial derivatives with respect to out parameters at their current values which are just
real numbers. Note how we use 𝑎 and 𝑏 first as dummy variables to indicate the input we want
to differentiate over and then again as real valued function arguments.
Partial derivative notation can get cluttered quickly, as (2.10) shows. To simplify things we are
going to introduce some notational conventions. Let 𝑓 : ℝ2 → ℝ be given by
𝑓 (𝑥, 𝑦) = 𝑔(ℎ(𝑥, 𝑦), 𝑘(𝑥, 𝑦)), (2.11)
where 𝑔, ℎ and 𝑘 are also functions from ℝ2 to ℝ. We evaluate 𝑓 at a fixed (𝑥, 𝑦) by calculating:
𝑢 = ℎ(𝑥, 𝑦), 𝑣 = 𝑘(𝑥, 𝑦), 𝑧 = 𝑔(𝑢, 𝑣) = 𝑓 (𝑥, 𝑦),
which are all just real numbers and not functions like 𝑓 , 𝑔, ℎ and 𝑘 are. Now we introduce the
following notations for the intermediate results 𝑢 and 𝑣:
𝜕𝑢 𝜕 𝜕𝑢 𝜕
:= ℎ(𝑥, 𝑦) , := ℎ(𝑥, 𝑦) ,
𝜕𝑥 𝜕𝑥 (𝑥,𝑦) 𝜕𝑦 𝜕𝑦 (𝑥,𝑦)
𝜕𝑣 𝜕 𝜕𝑣 𝜕
:= 𝑘(𝑥, 𝑦) , := 𝑘(𝑥, 𝑦) ,
𝜕𝑥 𝜕𝑥 (𝑥,𝑦) 𝜕𝑦 𝜕𝑦 (𝑥,𝑦)

and for the final output 𝑧 we define:


𝜕𝑧 𝜕 𝜕𝑧 𝜕
:= 𝑔(𝑢, 𝑣) , := 𝑔(𝑢, 𝑣)
𝜕𝑢 𝜕𝑢 (𝑢,𝑣) 𝜕𝑣 𝜕𝑣 (𝑢,𝑣)
𝜕𝑧 𝜕 𝜕𝑧 𝜕
:= 𝑓 (𝑥, 𝑦) , := 𝑓 (𝑥, 𝑦) ,
𝜕𝑥 𝜕𝑥 (𝑥,𝑦) 𝜕𝑦 𝜕𝑦 (𝑥,𝑦)

38
CHAPTER 2. DEEP LEARNING
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION

which are all real numbers. At first glance the partial derivative of one real number with respect
to another real number is nonsensical, but if we use one real number in the computation of a
second one it does make sense to ask how sensitive the value of the second is with respect to the
first if we changed it a little bit. This sensitivity is of course nothing but the partial derivative of
the function used in the computation of the second value evaluated at the first value. But since
we are not interested in the whole partial derivatives it simplifies things to keep the function
and its partial derivatives implicit and use the simpler notation we just introduced.
Using this notation the chain rule applied to (2.11) can be expressed very succinctly as

𝜕𝑧 𝜕𝑧 𝜕𝑢 𝜕𝑧 𝜕𝑣 𝜕𝑧 𝜕𝑧 𝜕𝑢 𝜕𝑧 𝜕𝑣
= + and = + . (2.12)
𝜕𝑥 𝜕𝑢 𝜕𝑥 𝜕𝑣 𝜕𝑥 𝜕𝑦 𝜕𝑢 𝜕𝑦 𝜕𝑣 𝜕𝑦

We are always looking to minimize our chosen loss function, hence we are mainly interested
in the partial derivatives of the loss function. We will use the following notation to further
abbreviate the partial derivatives of the loss: let ℓ ∈ ℝ be the loss produced at the end of our
calculation and let 𝛼 ∈ ℝ be either a parameter or intermediate result used in the calculation of
ℓ then we write
𝜕ℓ
𝛼 := ,
𝜕𝛼
which we refer to as a gradient (technically the value of the gradient of the loss function with
respect to 𝛼 evaluated at the current value of 𝛼, but that is quite the mouthful). So in the case
of the two-parameter network 𝐹 from before: for a given set of parameters 𝑎, 𝑏 ∈ ℝ we need to
calculate the gradients 𝑎, 𝑏 ∈ ℝ, which are exactly the evaluated partial derivatives from (2.10).
As an example to see how this notation works let us pick a concrete example and systematically
work through differentiating it. Let 𝐹(𝑥; 𝑎, 𝑏) := 𝜎(𝑎𝑥 + 𝑏) for some choice of (differentiable)
activation function 𝜎 : ℝ → ℝ. Let (𝑥 0 , 𝑦0 ) ∈ ℝ2 be some data point and let us use the loss
function (𝑦, 𝑦 ′) ↦→ 12 (𝑦 − 𝑦 ′)2 . After we pick our parameter values 𝑎, 𝑏 ∈ ℝ we can simply
compute the corresponding loss, doing this is called the forward pass and is shown on the left
side in (2.13) below.

𝑎, 𝑏, 𝑥 0 , 𝑦0 ∈ ℝ 𝜕ℓ 𝜕𝑧
𝑎= 𝜕𝑧 𝜕𝑎
= 𝑧 𝑥0 , 𝑏 = 𝑧

𝜕ℓ 𝜕ℓ 𝜕𝑦
𝑧 = 𝑎𝑥0 + 𝑏 𝑧= = 𝑦 𝜎′ ( 𝑧 )
backward

=
forward

𝜕𝑧 𝜕𝑦 𝜕𝑧
(2.13)
𝜕ℓ
𝑦 = 𝜎(𝑧) 𝑦= 𝜕𝑦
= 𝑦 − 𝑦0

ℓ = 12 ( 𝑦 − 𝑦0 )2 ℓ= 𝜕ℓ
=1
𝜕ℓ

After having evaluated the network we can start calculating the partial derivatives of the loss
with respect the the parameters 𝑎 and 𝑏 by apply the chain rule from back to front, this is
called the backward pass and is shown on the right side in (2.13). Some thing to note about the
schematic in (2.13).

• Everything that we wrote down is a numeric computation and the whole schematic can
be executed by a computer as is.
• In writing down the backward pass we did use our symbolic knowledge of how the
operations in the forward pass need to be differentiated if we look at them as functions.

39
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION
CHAPTER 2. DEEP LEARNING

• Writing down the trivial ℓ = 1 is of course redundant, but autodiff implementations


such as PyTorch’s do actually start the backward pass by creating a single element tensor
containing the value 1 (and it makes the schematic look symmetric).

• Some of the intermediate values computed during the forward pass are reused during
the backward pass. In the schematic (2.13) the reused values have been shaded in green.

Remark 2.19. In PyTorch, the intermediate results computed during the forward pass that
need to be retained for the backward pass are called saved tensors. In a typical neural
network many of the intermediate results have to be saved for the backward pass, this is the
reason that training a neural network requires a lot of memory.

Nothing we have done in the example in (2.13) is novel, we just did what we would normally do
if asked to calculate these partial derivatives. Only, we wrote it down systematically in a way
that we can automate.

2.4.2 Automation & the Computational Graph

The key to automating the type of calculation in (2.13) and (2.14) is splitting it into primitive
operations and tracking how they are composed. Consider the network 𝐹(𝑥; 𝑎, 𝑏) := ReLU(𝑎𝑥+𝑏)
with loss function (𝑦, 𝑦 ′) ↦→ 12 (𝑦 − 𝑦 ′)2 evaluated for some data point (𝑥0 , 𝑦0 ) ∈ ℝ2 and parameter
values 𝑎, 𝑏 ∈ ℝ. Evaluating the network one primitive operation at a time looks as follows.

data: 𝑥0 , 𝑦0 ∈ ℝ

parameters: 𝑎, 𝑏 ∈ ℝ 𝜕ℓ 𝜕𝑡1 𝜕ℓ 𝜕𝑡2


𝑎= 𝜕𝑡1 𝜕𝑎
= 𝑡1 𝑥 0 , 𝑏 = 𝜕𝑡2 𝜕𝑏
= 𝑡2

𝜕ℓ 𝜕𝑡2
𝑡1 = 𝑎𝑥0 𝑡1 = 𝜕𝑡2 𝜕𝑡1
= 𝑡2

𝜕ℓ 𝜕𝑡3
𝑡2 = 𝑡1 + 𝑏 𝑡2 = = 𝑡3 𝟙
backward

𝜕𝑡3 𝜕𝑡2 𝑡2
forward

≥0
(2.14)
𝜕ℓ 𝜕𝑡4
𝑡3 = ReLU(𝑡2 ) 𝑡3 = 𝜕𝑡4 𝑡3
= 𝑡4

𝜕ℓ 𝜕ℓ 𝜕𝑡5
𝑡 4 = 𝑡 3 − 𝑦0 𝑡4 = 𝜕𝑡4
= 𝜕𝑡5 𝜕𝑡4
= 𝑡5 · 2𝑡4 = 𝑡4

𝜕ℓ 1
𝑡5 = 𝑡42 𝑡5 = 𝜕𝑡5
= 2

ℓ = 21 𝑡5 ℓ =1

The backward pass consists again of numeric computations but at every step we need to know
how the value computed at that line, say 𝛼, is used so we are able to compute the correct partial
derivative 𝛼. In the case of (2.14) that is fairly straightforward as every intermediate value is
only used in the next step i.e. 𝑡 𝑖 only depend on 𝑡 𝑖−1 and the parameters. Consequently 𝑡 𝑖 only
depends on 𝑡 𝑖+1 and 𝑡 𝑖 . A more general example that has a slightly more complicated structure
is the network 𝐹(𝑥; 𝑎, 𝑏) := ReLU(𝑎𝑥 + 𝑏) + (𝑎𝑥 + 𝑏). We can write this network out in primitive
form as well, we omit the forward/backward arrows but add a dependency graph that shows
how the intermediate results depend on each other.

40
CHAPTER 2. DEEP LEARNING
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION

data: 𝑥 0 , 𝑦0 ∈ ℝ

parameters: 𝑎, 𝑏 ∈ ℝ 𝑎 = 𝑡1 𝑥 0 , 𝑏 = 𝑡2
𝑡1 = 𝑎𝑥 0 𝑡1 = 𝑡2
𝜕ℓ 𝜕ℓ 𝜕𝑡3 𝜕ℓ 𝜕𝑡4
𝑡2 = 𝑡1 + 𝑏 𝑡2 = 𝜕𝑡2
= 𝜕𝑡3 𝜕𝑡2
+ 𝜕𝑡4 𝜕𝑡2
= 𝑡3 𝟙𝑡2 ≥0 + 𝑡4

𝑡3 = ReLU(𝑡2 ) 𝑡3 = 𝑡4
𝑡4 = 𝑡2 + 𝑡3 𝑡4 = 𝑡5
𝑡 5 = 𝑡 4 − 𝑦0 𝑡5 = 𝑡6 · 2𝑡5 = 𝑡5

𝑡6 = 𝑡52 𝑡6 = 1
2

ℓ = 12 𝑡6 ℓ =1
(2.15)
The dependency graph of the gradients 𝑡 𝑖 in (2.15) is naturally the reverse of the dependency
graph of the values 𝑡 𝑖 augmented with dependencies on the results from the forward pass. Both
the values 𝑡3 and 𝑡4 depend on the value 𝑡2 hence 𝑡2 depends on both 𝑡3 and 𝑡4 in addition to 𝑡2
itself. This double dependency causes the chain rule applied to 𝑡2 to have two terms, of course
this generalizes to multiple dependencies.
Constructing this computational graph is exactly how machine learning frameworks such as
PyTorch implement gradient computation. Each time you perform an operation on one or more
tensors a new node is added to the graph to record what operation was performed and on
which inputs the output depends. Then, when it becomes time to compute the gradient (i.e.
.backward() is called in PyTorch) the graph is traversed back to front.
As mentioned, the graph records what operation was performed at each node. This is necessary
because what backward computation needs to be performed at each node depends on what the
corresponding forward computation was, this is where our symbolic knowledge needs to be
added.

2.4.3 Implementing Operations

In the previous section we saw how the evaluation of a neural network (or any computation
for that matter) can be expressed as a computational graph where each node correspond to
a (primitive) operation. To be able to do the backward pass each node needs to know how
to compute its own partial derivative(s). Previous examples (2.13) (2.14) and (2.15) only used
simple scalar operations, now will explore how to implement both the forward and backward
computation for a general multivariate vector-valued function.
Let 𝐹 : ℝ𝑛 → ℝ𝑚 be a differentiable map, let 𝒙 = [𝑥 1 · · · 𝑥 𝑛 ]𝑇 ∈ ℝ𝑛 and 𝒚 = [𝑦1 · · · 𝑦𝑚 ]𝑇 ∈ ℝ𝑚
so that 𝒚 = 𝐹(𝒙). Let ℓ ∈ ℝ be the final loss computed for a neural network that contains 𝐹 as

41
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION
CHAPTER 2. DEEP LEARNING

one of its operations. Then we can generalize our gradient notation from scalars to vector as
 𝜕ℓ  𝜕ℓ 
 𝜕𝑦
 𝜕𝑥1   1
𝜕ℓ 𝜕ℓ
𝒙 := :=  ...  , 𝒚 := :=  ...  .
   
𝜕𝒙  𝜕ℓ  𝜕𝒚  𝜕ℓ 
   
 𝜕𝑥 𝑛   𝜕𝑦𝑚 
During the forward pass the task is: given 𝒙 compute 𝒚 = 𝐹(𝒙). During the the backward pass
the task is: given 𝒙 and 𝒚 compute 𝒙, we call this backward operation 𝐹 : ℝ𝑛 × ℝ𝑚 → ℝ𝑛 . Let
us see what it looks like
 𝜕ℓ 
 𝜕𝑥1 
 .. 
𝒙=  . 
 𝜕ℓ 
 
 𝜕𝑥 𝑛 
 Í𝑚 𝜕ℓ 𝜕𝑦 𝑗 
 𝑗=1 𝜕𝑦 𝑗 𝜕𝑥1 
=
 .. 
.
 
 
Í𝑚 𝜕ℓ 𝜕𝑦 𝑗 
 𝑗=1
𝜕𝑦 𝑗 𝜕𝑥 𝑛 


 Í𝑚 𝑦 𝜕𝑦 𝑗 
 𝑗=1 𝑗 𝜕𝑥1 
=
 .. 
.
 
 
Í 𝑚 𝜕𝑦 𝑗 
 𝑗=1 𝑦 𝑗
𝜕𝑥 𝑛 


 𝜕𝑦1 · · · 𝜕𝑦𝑚  𝑦 
 𝜕𝑥1 𝜕𝑥1   1
 . ..   . 
=  .. . 
 .. 
  
 𝜕𝑦1 𝜕𝑦𝑚 
 𝑦𝑚 
 
 𝜕𝑥 · · · 𝜕𝑥 
 𝑛 𝑛   
= 𝐽(𝒙)𝑇 𝒚,
where 𝐽(𝒙) is the Jacobian matrix of 𝐹 evaluated in 𝒙. So not unexpectedly we end up with the
general form of the chain rule.
Example 2.20 (Pointwise addition). Let 𝐹 : ℝ2 × ℝ2 → ℝ2 be defined by 𝐹(𝒂, 𝒃) := 𝒂 + 𝒃, with
𝒂 = [𝑎 1 𝑎 2 ]𝑇 ∈ ℝ2 and 𝒃 = [𝑏1 𝑏2 ]𝑇 . We can equivalently define 𝐹 as 𝐹 : ℝ4 → ℝ2 as follows:

  𝑎1  
 
1 0 1 0  𝑎 2  𝑎1 + 𝑏1
 
𝒄 = 𝐹(𝒂, 𝒃) = = .
0 1 0 1 𝑏1  𝑎2 + 𝑏2
𝑏 2 
 
Since this is a linear operator the Jacobian matrix is just the constant matrix above. The backward
operation 𝐹 : ℝ4 × ℝ2 → ℝ4 is then given by
1 0   𝑐 1   

0 1 𝑐1 𝑐 
 
𝒂 𝒄
= 𝐹(𝒂, 𝒃, 𝒄) =  =  2  =
𝒃 1 0 𝑐 2 𝑐 1 𝒄

0 1 𝑐2 
 
  
𝑇 2
for a gradient 𝒄 = [𝑐1 𝑐2 ] ∈ ℝ .
So to implement this operation we do not have to retain the inputs 𝒂 and 𝒃, we can implement
the backward calculation by simply copying the incoming gradient 𝒄 and passing the two copies
up the graph.

42
CHAPTER 2. DEEP LEARNING
2.4. AUTOMATIC DIFFERENTIATION & BACKPROPAGATION

Example 2.21 (Copy). Let 𝐹 : ℝ𝑛 → ℝ2𝑛 be given by


 𝑎1 
 .. 
 
" # .
𝑎𝑛 
 
𝒃
= 𝐹(𝒂) =   ,
𝒄  𝑎1 
 .. 
.
𝑎𝑛 
 
 
with 𝒂, 𝒃, 𝒄 ∈ ℝ𝑛 . This operation makes two copies of its input, it is clearly linear with Jacobian
" #
𝐼𝑛
𝐽= ,
𝐼𝑛

where 𝐼𝑛 is a unit matrix of size 𝑛 × 𝑛. Given gradients 𝒃, 𝒄 ∈ ℝ𝑛 of the outputs the gradient 𝒂
of the input is calculated as:
" #
 𝒃
𝒂 = 𝐹(𝒂, 𝒃, 𝒄) = 𝐼𝑛 𝐼𝑛

= 𝒃 + 𝒄.
𝒄

Example 2.22 (Inner product). In this example the Jacobian is not constant and we do have to
retain the inputs to be able to do the backward pass. Let 𝐹 : ℝ2𝑛 → ℝ be given by
𝑛
Õ
𝑐 = 𝐹(𝒂, 𝒃) = 𝑎𝑖 𝑏𝑖 ,
𝑖=1

for 𝒂 = [𝑎1 · · · 𝑎 𝑛 ]𝑇 and 𝒃 = [𝑏 1 · · · 𝑏 𝑛 ]𝑇 ∈ ℝ𝑛 . Then the Jacobian matrix evaluated at [𝒂 𝒃]𝑇 is


given by
𝐽(𝒂, 𝒃) = 𝑏1 · · · 𝑏 𝑛 𝑎 1 · · · 𝑎 𝑛 = 𝒃𝑇 𝒂 𝑇 .
   

Given a (scalar) gradient 𝑐 ∈ ℝ of the output we compute the gradients of 𝒂 and 𝒃 as follows:
" # " #
𝒂 𝒃
𝑇 𝑇
= 𝐹(𝒂, 𝒃, 𝑐) = 𝒃𝑇 𝑐=𝑐 ,
 
𝒂
𝒃 𝒂

which depends on the input values 𝒂 and 𝒃.

In (2.14) and (2.15) we deconstructed the loss function (𝑦, 𝑦 ′) ↦→ 21 (𝑦 − 𝑦 ′)2 into three primitive
operations. As a consequence the backward pass has a step where we first multiply with 2 and
then multiply with 1/2, this is not efficient of course. Hence deconstructing our calculation into
the smallest possible operations is not always advisable. In this example we will express the 𝐿2
loss function as single operation instead.
Example 2.23 (𝐿2 loss). Let 𝐹 : ℝ2𝑛 → ℝ be given by
𝑛

ℓ = 𝐹(𝒚, 𝒚′) = (𝑦 𝑖 − 𝑦 𝑖′)2
2
𝑖=1

for 𝒚 = [𝑦1 · · · 𝑦𝑛 ]𝑇 and 𝒚′ = [𝑦1′ · · · 𝑦𝑛′ ]𝑇 ∈ ℝ𝑛 . Then the Jacobian matrix evaluated at [𝒚 𝒚′]𝑇
is given by

𝐽(𝒚, 𝒚′) = 𝑦1 − 𝑦1′ · · · 𝑦𝑛 − 𝑦𝑛′ 𝑦1′ − 𝑦1 · · · 𝑦𝑛′ − 𝑦𝑛 = (𝒚 − 𝒚′)𝑇 (𝒚′ − 𝒚)𝑇 .


   

43
2.5. ADAPTIVE LEARNING RATE ALGORITHMS CHAPTER 2. DEEP LEARNING

The backward operation has signature 𝐹 : ℝ2𝑛 × ℝ → ℝ2𝑛 . The second argument of 𝐹 is a
given gradient ℓ ∈ ℝ, which is trivially ℓ = 1 if the output ℓ is the final loss we are interested in
minimizing. We then compute the gradients as follows:
" # " #
𝒚 𝑇 𝑇
𝒚 − 𝒚′
= 𝐹(𝒚, 𝒚′ , ℓ = 1) = (𝒚 − 𝒚′)𝑇 (𝒚′ − 𝒚) ℓ= .
 
𝒚′ 𝒚′ − 𝒚

If 𝒚 is the output of the neural network and 𝒚′ is the data point then we are only interested in
𝒚 and we would only compute 𝒚 − 𝒚′. This is equivalent to the computation in (2.14) and (2.15)
but avoids the redundant multiplications.

2.5 Adaptive Learning Rate Algorithms

The learning rate is crucial to the success of the training process. In general the loss is highly
sensitive to some parameters but insensitive to others, just think about parameters in different
layers. Momentum alleviates some of the issues but introduces problems of its own and adds
another hyperparameter we have to tune.
Assigning each parameters its own learning rate (and continually adjusting it) is not feasible.
What we need are automatic methods. This has lead to the development of adaptive learning
(rate) algorithms. The idea is to set the learning rate dynamically per-parameter at each
iteration based on the history of the gradients. We will look at three of these methods: Adagrad,
RMSProp and Adam.
Let us recall the setting. We have a parameter space 𝑊 = ℝ𝑁 and some initial parameter value
𝑤 0 ∈ 𝑊. Let 𝐼𝑡 be the batch at iteration 𝑡 ∈ ℕ and ℓ 𝐼𝑡 its associated loss function. We will
abbreviate the current batch’s gradient as:
𝑔𝑡 := ∇ℓ 𝐼𝑡 (𝑤 𝑡 ).

The SGD update rule with learning rate 𝜂 > 0 is then:


𝑤 𝑡+1 = 𝑤 𝑡 − 𝜂𝑔𝑡 .
With momentum the update rule becomes:
𝑣 𝑡 = 𝜇𝑣 𝑡−1 − 𝜂𝑔𝑡 ,
𝑤 𝑡+1 = 𝑤 𝑡 + 𝑣 𝑡 ,
where 𝑣 0 = 0 and 𝜇 ∈ [0, 1) is the momentum factor.

2.5.1 Adagrad

Adagrad (Adaptive Gradient Descent) was one of the first adaptive learning rate methods
introduced by Duchi, Hazan, and Singer (2011).
Let 𝑡 ∈ ℕ0 , 𝐼𝑡 the batch index set at iteration 𝑡 and ℓ 𝐼𝑡 its associated loss. Let 𝑤0 be the initial
parameter values and abbreviate 𝑔𝑡 := ∇ℓ 𝐼𝑡 (𝑤 𝑡 ).
The Adagrad update is defined per-parameter as:
𝜂
(𝑤 𝑡+1 )𝑖 = (𝑤 𝑡 )𝑖 − q (𝑔𝑡 )𝑖 .
Í𝑡 2
𝑘=0 (𝑔 𝑘 ) 𝑖 +𝜀

44
CHAPTER 2. DEEP LEARNING 2.5. ADAPTIVE LEARNING RATE ALGORITHMS

Or when we take all the operations on the vectors to mean component-wise operations we can
write more concisely:
𝜂
𝑤 𝑡+1 = 𝑤 𝑡 − q 𝑔𝑡 .
Í𝑡 2
𝑔
𝑘=0 𝑘 + 𝜀

At each iterations we slow down the effective learning rate of each individual parameter by the
𝐿2 norm of the history of the partial derivatives with respect to that parameter. As (𝑔𝑡 )𝑖 may be
very small or zero we add a small positive 𝜀 (say 10−8 or so) for numerical stability.
The benefit of this method is that choosing 𝜂 becomes less important, the step size will eventually
decrease to the point that it can settle into a local minimum.
q
𝑡 2
On the other hand, the denominator 𝑘=0 (𝑔 𝑘 ) 𝑖 is monotonically decreasing and will eventu-
Í

ally bring the training process to a halt whether a local minimum has been reach or not.

2.5.2 RMSProp

Root Mean Square Propagation or RMSProp is an adaptive learning rate algorithm introduced
by Tieleman, G. Hinton, et al. (2012).
RMSProp uses the same basic idea as Adagrad, but instead of accumulating the squared gradi-
ents it uses a weighted average of the historic gradients. Let 𝛼 ∈ (0, 1):
𝑣 𝑡 = 𝛼𝑣 𝑡−1 + (1 − 𝛼)𝑔𝑡2
𝜂
𝑤 𝑡+1 = 𝑤 𝑡 − √ 𝑔𝑡 ,
𝑣𝑡 + 𝜀
where again all the arithmetic operations are applied component-wise. The 𝛼 factor is called
the forgetting factor, decay rate or smoothing factor. The suggested hyperparameter values to
start with are 𝜂 = 0.001 and 𝛼 = 0.9.

2.5.3 Adam

The Adaptive Moment Estimation method, or Adam, was introduced by Kingma and Ba (2017).
In addition to storing an exponentially decaying average of past square gradients 𝑣 𝑡 like RM-
SProp, Adam also keeps a running average of past gradients 𝑚𝑡 , similar to momentum.
Let 𝛽1 , 𝛽 1 ∈ (0, 1),
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽 1 )𝑔𝑡 ,
𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + (1 − 𝛽2 )(𝑔𝑡 )2 ,
with 𝑚0 = 𝑣 0 = 0. The vectors 𝑚𝑡 and 𝑣 𝑡 are estimates of the first moment (the mean) and the
second moment (the uncentered variance), hence the name of the method.
As we start with 𝑚0 = 𝑣0 = 0 the estimates are biased towards zero. Let us see how biased and
whether we can correct that bias. By induction we have:
𝑡
Õ
𝑚𝑡 = (1 − 𝛽 1 ) 𝛽 1𝑡−𝑖 𝑔𝑖 ,
𝑖=1
𝑡
Õ
𝑣 𝑡 = (1 − 𝛽 2 ) 𝛽 2𝑡−𝑖 (𝑔𝑖 )2 ,
𝑖=1

45
2.5. ADAPTIVE LEARNING RATE ALGORITHMS CHAPTER 2. DEEP LEARNING

and so:
𝑡
Õ
𝔼[𝑚𝑡 ] = (1 − 𝛽1 ) 𝛽 1𝑡−𝑖 𝔼[𝑔𝑖 ],
𝑖=1
𝑡
Õ
𝔼[𝑣 𝑡 ] = (1 − 𝛽2 ) 𝛽 2𝑡−𝑖 𝔼[(𝑔𝑖 )2 ].
𝑖=1

Assuming 𝔼[𝑔𝑖 ] and 𝔼[(𝑔𝑖 )2 ] are stationary (i.e. do not depend on 𝑖) we get:
𝑡
!
Õ
𝔼[𝑚𝑡 ] = (1 − 𝛽1 ) 𝛽 1𝑡−𝑖 𝔼[𝑔𝑡 ],
𝑖=1
𝑡
!
Õ
𝔼[𝑣 𝑡 ] = (1 − 𝛽2 ) 𝛽2𝑡−𝑖 𝔼[(𝑔𝑡 )2 ].
𝑖=1

Which simplifies to:

𝔼[𝑚𝑡 ] = (1 − 𝛽1𝑡 )𝔼[𝑔𝑡 ],


𝔼[𝑣 𝑡 ] = (1 − 𝛽2𝑡 )𝔼[(𝑔𝑡 )2 ].

Therefore, the bias-corrected moments are:


𝑚𝑡 𝑣𝑡
𝑚
ˆ𝑡 = , 𝑣ˆ 𝑡 = .
1 − 𝛽 1𝑡 1 − 𝛽2𝑡

The update rule of Adam is then given by:

𝑚
ˆ𝑡
𝑤 𝑡+1 = 𝑤 𝑡 − 𝜂 √ ,
𝑣ˆ 𝑡 + 𝜀

where again all arithmetic is applied component-wise.


The default hyperparameter values suggested by Kingma and Ba (2017) are 𝜂 = 0.001, 𝛽 1 = 0.9,
𝛽 2 = 0.99 and 𝜀 = 10−8 .

2.5.4 Which variant to use?

Which of these algorithms should you use? If your data is sparse (very high dimensional
data usually is) then the adaptive learning rate methods usually outperform plain SGD (with
or without momentum). Kingma and Ba (2017) showed that due to its bias correction Adam
performs marginally better late in the training process. In that regard Adam is a safe option
to try first. Nonetheless trying different algorithms to see which one works best for any given
problem can be worthwhile.
Fig. 2.12 illustrates the different behaviour of the methods we have discussed in some artificial
loss landscapes. These landscapes model some of the problems the algorithms may encounter
and show some of the strengths and weaknesses of each method.
There are many more SGD variants we have not discussed, you can look at the toch.optim
namespace in the PyTorch documentation to see the available algorithms.

46
CHAPTER 2. DEEP LEARNING 2.5. ADAPTIVE LEARNING RATE ALGORITHMS

Figure 2.12: Comparing 5 SGD variants in 4 different situations for a number


of steps. In the canyon situation we have one parameters that has
a large effect on the loss and one parameter that has little effect on
the loss. The saddle situation has a prominent saddle point. The
plateau has a large flat section with a vanishing gradient that needs
to be traversed to get to the minimum. The final situation has an
obstacle the algorithm has to go around to reach the minimum.
Dark level sets indicate lower function values. This figure was
generated with SGDDemonstration.ipynb.

47
Chapter 3

Equivariance

There are many applications where we want the neural network to have certain symmetries,
such as in Fig. 3.1. Most applications come with natural symmetries. It might be rotation-
translation invariance for a medical diagnosis application that detects tumors in X-ray images
or time invariance for a weather forecasting system.

Figure 3.1: We would like a classification network to be invariant under trans-


lation, rotation and scaling. We could train our network with
translated, rotated and scaled versions of our original data and
hope the invariance gets encoded that way. But this would drasti-
cally increase the training time and gives no guarantee of success.
Preferable would be designing the network in such a way that it is
intrinsically invariant.

The desired symmetry might not be an actual invariance in that the output is not expected to
remain the same but instead transform in some manner similar to the input. For example: in
an image enhancement application the output image is expected to rotate and translate along
with the input image. Some authors use invariance to denote both cases, we will use the term
equivariance (as in: transforms with) to refer to both and use invariance for the special case
where the output should stay the same.
Expressing many of these symmetries on discrete domains is awkward, indeed rotations and
translations of an image are not even well defined unless the translation is on grid or the
rotation is in 90◦ increments. So instead we will be working in the continuous setting and only
discretizing when it becomes time to implement our ideas. For images, instead of representing
them as elements of ℝ𝐻×𝑊 we represent them as elements of 𝐶 𝑐∞ (ℝ2 ), i.e. smooth functions
with compact support.
Our eventual practical goal for this chapter is building a type of CNN that is not only translation

48
CHAPTER 3. EQUIVARIANCE 3.1. MANIFOLDS

equivariant but also rotation equivariant. But being mathematicians we want to be general and
develop a theory that allows for other transformations as well. This brings us to the theory of
Lie groups, which are essentially continuous transformation groups, something we will make
precise in this chapter.
Lie group theory plays an important role in many disparate fields of study such as geometric
image analysis, computer vision, particle physics, financial mathematics, robotics, etc. In the
next sections we will build up a general equivariance framework based on Lie groups. The
payoff of this theoretical work will be a general recipe for building a neural network that is
equivariant with respect to an arbitrary transformation group: a so called Group Equivariant
CNN (Cohen, Geiger, and Weiler 2020; Cohen and Welling 2016), or G-CNN for short.

3.1 Manifolds

We start with the basic object we will be working with: the manifold. We are accustomed to
doing analysis on ℝ𝑛 but we also know that there are non-Euclidean spaces of importance.
Classic example is the unit circle 𝑆 1 , which is distinctly non-Euclidean but still admits deriva-
tives, integrals, PDEs, etc. When working with an object such as 𝑆 1 we usually do it indirectly
by parameterizing it some way, such as with an angle 𝜃 ∈ [0, 2𝜋), so that at least locally it
resembles ℝ. We can generalize this and consider all spaces that we can, at least locally, identify
with a subset of ℝ𝑛 .

3.1.1 Characterization

The proper way to introduce manifolds is starting with a set and then adding several layers of
structure, as is done in Lee (2010, 2013). After going through that laborious process we would
then see a characterization that is a more practical tool for constructing manifolds. In this course
we will skip straight to that characterization in the form of the following lemma.

Lemma 3.1 (Smooth manifold chart lemma). Let 𝑀 be a set and suppose we are given a
collection of subsets {𝑈 𝛼 } 𝛼∈𝐼 of 𝑀 for some index set 𝐼, together with maps 𝜑 𝛼 : 𝑈 𝛼 → ℝ𝑛 .
Then 𝑀 together with {(𝑈 𝛼 , 𝜑 𝛼 )} 𝛼∈𝐼 gives a smooth 𝑛-dimensional manifold if the following
conditions are satisfied.

(i) For all 𝛼 ∈ 𝐼, 𝜑 𝛼 is a bijection between 𝑈 𝛼 and an open subset of ℝ𝑛 .


(ii) For each 𝛼, 𝛽 ∈ 𝐼, the sets 𝜑 𝛼 𝑈 𝛼 ∩ 𝑈𝛽 and 𝜑𝛽 𝑈 𝛼 ∩ 𝑈𝛽 are open in ℝ𝑛 .
 

(iii) When 𝑈 𝛼 ∩ 𝑈𝛽 ≠ ∅ then the map 𝜏𝛼,𝛽 := 𝜑𝛽 ◦ 𝜑−1


𝛼 : 𝜑 𝛼 (𝑈 𝛼 ∩ 𝑈𝛽 ) → 𝜑 𝛽 (𝑈 𝛼 ∩ 𝑈𝛽 ) is
smooth.
(iv) There exists an (at most) countably infinite 𝐽 ⊂ 𝐼 so that 𝑈 𝛼 = 𝑀, i.e. there exists a
Ð
𝛼∈𝐽
countable cover of 𝑀.
(v) Whenever 𝑝 1 and 𝑝 2 are distinct point in 𝑀, there exist 𝑈 𝛼 and 𝑈𝛽 (not necessarily
distinct) so that there exist disjoint sets 𝑉 ⊂ 𝑈 𝛼 and 𝑊 ⊂ 𝑈𝛽 , with 𝜑 𝛼 (𝑉) and 𝜑𝛽 (𝑊)
open in ℝ𝑛 , with 𝑝 1 ∈ 𝑉 and 𝑝2 ∈ 𝑊.

We say that each pair (𝑈 𝛼 , 𝜑 𝛼 ) is a (smooth) chart and that the set {(𝑈 𝛼 , 𝜑 𝛼 )} 𝛼∈𝐼 is a (smooth)
atlas of 𝑀. The maps 𝜏𝛼,𝛽 := 𝜑𝛽 ◦ 𝜑 −1 𝛼 are called the atlas’ transition maps. Since the transition
maps are from ℝ𝑛 to ℝ𝑛 we know exactly what it means for these maps to be smooth. Charts

49
3.1. MANIFOLDS CHAPTER 3. EQUIVARIANCE

that have smooth transition maps both ways are said to be compatible. This construction is
illustrated in Fig. 3.2, for details and proof see Lee (2013, Ch. 1).

𝜑𝛼

𝑀 ℝ𝑛

𝑈𝛼 𝑈𝛽 𝜏𝛽,𝛼 = 𝜑 𝛼 ◦ 𝜑𝛽−1 𝜏𝛼,𝛽 = 𝜑𝛽 ◦ 𝜑−1


𝛼

𝜑𝛽
ℝ𝑛

Figure 3.2: A manifold 𝑀 with two charts (𝑈 𝛼 , 𝜑 𝛼 ) and 𝑈𝛽 , 𝜑𝛽 and their



transitions maps 𝜏𝛼,𝛽 and 𝜏𝛽,𝛼 .

Remark 3.2 (The manifold is not the atlas). In Lemma 3.1 we say that the set 𝑀 along with an
atlas {(𝑈 𝛼 , 𝜑 𝛼 )} 𝛼∈𝐼 gives a smooth manifold rather than stating that it is a smooth manifold.
This is because we could construct another atlas compatible with the first but the set together
with the new atlas would still describe the same manifold. In example 3.5 this distinction
will be clear as we can see that we can construct any number of atlases describing the same
manifold.
We say two atlases are compatible if the transition maps between charts of the two atlases
are also smooth, if two atlases are compatible they describe the same manifold. Formally we
could define a smooth manifold as the equivalence class of all atlases per Lemma 3.1 under
the equivalence relation of the atlases being compatible.
We could also consider the union of all possible compatible atlases, called the maximal atlas
as defining the manifold.

Remark 3.3 (Hausdorff space). Item (v) of Lemma 3.1 guarantees that a manifold is a Haus-
dorff space, this simply means that for any two distinct points 𝑝 1 and 𝑝 2 we can find neigh-
borhoods of both that are disjoint (a neighborhood in the manifold being a preimage of a
neighborhood in a chart). This may seem redundant but one can construct counterexamples
that satisfy (i)-(iv) but not (v). The Hausdorff property is needed for limits to be unique and
manifolds are required to be Hausdorff for that reason.

We define what the open subsets of the manifold are by looking at their images in ℝ𝑛 . A subset
𝑉 ⊂ 𝑀 is open if 𝜑 𝛼 (𝑉 ∩ 𝑈 𝛼 ) is open in ℝ𝑛 for all 𝛼 ∈ 𝐼. In other words: we let the atlas and
the standard topology on ℝ𝑛 define the topology of the manifold. With this definition we could
rephrase item (v) of Lemma 3.1 as: for any two distinct points of 𝑀 there exist neighborhoods
(i.e. open subsets containing the point in question) of said two points that are disjoint.
Of course if (𝑈 𝛼 , 𝜑 𝛼 ) is a chart then if 𝑉 ⊂ 𝑈 𝛼 is open then (𝑉 , 𝜑 𝛼 |𝑉 ) is also a chart.

50
CHAPTER 3. EQUIVARIANCE 3.1. MANIFOLDS

Remark 3.4 (Open sets and continuous charts). Defining the open sets of 𝑀 as the preimages
of open subsets of ℝ𝑛 makes the charts continuous maps by definition. It might be the
case that when we are constructing a manifold we do not start with 𝑀 being just a set. For
example, in many cases we start with 𝑀 being a subset of ℝ𝑛 , in that case we already know
what the open subsets of 𝑀 are (i.e. the topology of 𝑀 is already known). If we already
decided what the open subsets of 𝑀 are going to be then we need to make sure that the two
notions of ‘open’ coincide. This requires that the charts are continuous maps by construction
instead of them being continuous by definition.

Lemma 3.1 is a characterization so it works the other way around as well. If 𝑀 is a smooth
manifold then it is always possible to produce a smooth atlas, i.e. a set {(𝑈 𝛼 , 𝜑 𝛼 )} 𝛼∈𝐼 that
satisfies all the conditions of the lemma.

Example 3.5 (The unit circle). Let 𝑆 1 be the unit circle in ℝ2 , i.e. 𝑆 1 = (𝑥, 𝑦) 𝑥 2 + 𝑦 2 = 1 .


Then we can construct two smooth charts that cover 𝑆 1 :

𝑈1 = 𝑆 1 \ {(−1, 0)} , 𝜑1 (𝑥, 𝑦) : 𝑈1 → (−𝜋, 𝜋),


𝑈2 = 𝑆 1 \ {(1, 0)} , 𝜑2 (𝑥, 𝑦) : 𝑈2 → (0, 2𝜋).

Where 𝜑1 gives the angle between the 𝑥-axis and (𝑥, 𝑦) measured from −𝜋 to 𝜋 and 𝜑2 gives
that same angle measured from 0 to 2𝜋, as illustrated in Figure 3.3.
Clearly 𝑈1 ∪ 𝑈2 = 𝑆 1 and we see that 𝜑1 (𝑈1 ) = (−𝜋, 𝜋) and 𝜑2 (𝑈2 ) = (0, 2𝜋), so these two
charts form an atlas of 𝑆 1 . Now we have to check the transition maps, note that 𝑈1 ∩ 𝑈2 =
𝑆 1 \ {(−1, 0), (1, 0)} which consists of two disjoint components, so the transition map from the
first to the second chart would have as domain (−𝜋, 0) ∪ (0, 𝜋) and be defined as:
(
𝜃 if 𝜃 ∈ (0, 𝜋),
𝜏1,2 (𝜃) = 𝜑2 ◦ 𝜑1−1 (𝜃) =
𝜃 + 2𝜋 if 𝜃 ∈ (−𝜋, 0),

which is smooth on its domain (the discontinuity at 𝜃 = 0 is not part of the domain). The
inverted transition map is similarly smooth. The whole construction is visualized in Fig. 3.3.

𝑦
𝜑1 ℝ
𝑆1 𝑈1 −𝜋 0 𝜋 2𝜋
𝑈2 Figure 3.3: A smooth atlas on
𝑥 𝑆1 with the transition maps il-
lustrated in grey.


𝜑2
−𝜋 0 𝜋 2𝜋

3.1.2 Smooth Maps

An important reason for introducing smooth manifolds is being able to talk about smooth
maps. While the terms map and function are technically interchangeable we will use the term

51
3.1. MANIFOLDS CHAPTER 3. EQUIVARIANCE

function for maps whose co-domain is ℝ or ℝ 𝑘 and use map for more general maps between
manifolds.

Definition 3.6 (Smooth map). Let 𝑀 and 𝑁 be two smooth manifolds and let 𝐹 : 𝑀 → 𝑁 be
any map. We say 𝐹 is a smooth map if for every 𝑝 ∈ 𝑀 there exists a smooth chart (𝑈 , 𝜑) of
𝑀 that contains 𝑝 and a smooth chart (𝑉 , 𝜓) of 𝑁 that contains 𝐹(𝑝) so that 𝐹(𝑈) ⊆ 𝑉 and the
map 𝜓 ◦ 𝐹 ◦ 𝜑−1 is smooth from 𝜑(𝑈) to 𝜓(𝐹(𝑈)) ⊆ 𝜓(𝑉).

Due to the smooth structure (i.e. all of its possible atlases) that a smooth manifold is equipped
with this definition is independent of the choice of charts, the smoothness of transition maps
ensures this key property. The definition makes it clear that we can only talk about the smooth-
ness of maps if the manifolds in question have smooth structures, so when we say 𝐹 : 𝑀 → 𝑁
is a smooth map we imply that 𝑀 and 𝑁 are smooth manifolds even if we do not specify that
fact explicitly. The construction from the definition is illustrated in Fig. 3.4.

𝐹
𝑀 𝑈 𝐹(𝑈) 𝑁
𝑝 𝑉 𝐹(𝑝)

𝜑 𝜓

𝜑(𝑝)
𝜓 ◦ 𝐹 ◦ 𝜑−1 𝜓(𝐹(𝑝))
ℝ𝑛 ℝ𝑛

Figure 3.4: A smooth map 𝐹 between two manifolds and its representation
𝜓 ◦ 𝐹 ◦ 𝜑 −1 between chart co-domains.

With smooth function we now just mean a smooth map from 𝑀 to ℝ or ℝ 𝑘 . Another special
case for which we reserve its own term is that of a smooth curve: a smooth map from ℝ (or
some interval of ℝ) to another smooth manifold.

Example 3.7. The map 𝐹 : ℝ → 𝑆1 given by 𝐹(𝜃) := (cos 𝜃, sin 𝜃) is a smooth map from the
manifold ℝ to the manifold 𝑆1 .

Definition 3.8 (Diffeomorphism). Let 𝑀 and 𝑁 be smooth manifolds, a smooth bijective map
from 𝑀 to 𝑁 that has a smooth inverse is called a diffeomorphism. Two manifolds between
which a diffeomorphism exists are said to be diffeomorphic.

Example 3.9. The unit circle 𝑆 1 is not diffeomorphic with ℝ since there exists no continuous
bijection between the two.

52
CHAPTER 3. EQUIVARIANCE 3.2. LIE GROUPS

The unit circle is however diffeomorphic with the group SO(2) (i.e. the group of orthogonal
2 × 2 matrices with determinant 1) via the identification

𝑥 −𝑦
 
1
𝑆 ∋ (𝑥, 𝑦) ↔ ∈ SO(2),
𝑦 𝑥

usually parametrized with 𝜃 ∈ ℝ/(2𝜋ℤ) as

cos 𝜃 − sin 𝜃
 
1
𝑆 ∋ (cos 𝜃, sin 𝜃) ↔ ∈ SO(2).
sin 𝜃 cos 𝜃

3.2 Lie Groups

3.2.1 Basic Definitions

We assume that the transformations we are interested in form Lie groups. Classical groups such
as the general and special linear groups (the groups of matrices that are invertible resp. have
determinant 1), the orthogonal group (the group of orthogonal matrices), etc. are examples of
Lie groups.

Definition 3.10. A Lie group G is a smooth manifold so that G is also an algebraic group
given by two smooth maps, one for the group product (also called multiplication):

𝐺 × 𝐺 → 𝐺, (𝑔1 , 𝑔2 ) ↦→ 𝑔1 𝑔2 ,

and one for inversion:


𝐺 → 𝐺, 𝑔 ↦→ 𝑔 −1 .

Recall that being a group, a Lie group has the following properties.

• Closure: ∀𝑔1 , 𝑔2 ∈ 𝐺 : 𝑔1 𝑔2 ∈ 𝐺.
• Associativity: ∀𝑔1 , 𝑔2 , 𝑔3 ∈ 𝐺 : (𝑔1 𝑔2 )𝑔3 = 𝑔1 (𝑔2 𝑔3 ).
• Unit element: ∃𝑒 ∈ 𝐺, ∀𝑔 ∈ 𝐺 : 𝑒 𝑔 = 𝑔𝑒 = 𝑔 and this element 𝑒 is unique. We use the
traditional 𝑒 to denote the unit element, which derives from the German Einselement.
• Inverse: ∀𝑔 ∈ 𝐺 ∃𝑔 −1 ∈ 𝐺 : 𝑔 𝑔 −1 = 𝑔 −1 𝑔 = 𝑒.

We should emphasize that a group need not be commutative, indeed the particular Lie groups
we are most interested in are not commutative and have group elements 𝑔1 , 𝑔2 for which
𝑔1 𝑔2 ≠ 𝑔2 𝑔1 .
For 𝑔 ∈ 𝐺, we denote by 𝐿 𝑔 : 𝐺 → 𝐺, 𝐿 𝑔 (ℎ) = 𝑔 ℎ the left multiplication by 𝑔 and by 𝑅 𝑔 : 𝐺 → 𝐺,
𝑅 𝑔 (ℎ) = ℎ 𝑔 the right multiplication by 𝑔. Left and right multiplication are also sometimes
called left and right translation.

Example 3.11. ℝ𝑛 is a (commutative) Lie group under vector addition (𝒙, 𝒚) ↦→ 𝒙 + 𝒚 and
negation 𝒙 ↦→ −𝒙.

Example 3.12 (Multiplicative group of positive real numbers). ℝ>0 is a (commutative) Lie group
under multiplication (𝑥, 𝑦) ↦→ 𝑥 𝑦 and inversion 𝑥 ↦→ 1/𝑥 .

53
3.2. LIE GROUPS CHAPTER 3. EQUIVARIANCE

Example 3.13 (General linear group). The general linear group of degree 𝑛, GL(𝑛), is the group
of all invertible 𝑛 × 𝑛 matrices. The group product is the matrix product.

Example 3.14 (Special orthogonal group). The special orthogonal group of degree 𝑛, SO(𝑛),
is the subgroup of GL(𝑛) of all matrices with determinant 1. It is the group of rotations in 𝑛
dimensions. For 𝑆𝑂(2) we write the matrices in terms of the angle of the rotation 𝜃 ∈ ℝ/(2𝜋ℤ)
as
cos 𝜃 − sin 𝜃
 
𝑅(𝜃) = , (3.1)
sin 𝜃 cos 𝜃
with the group law given by 𝑅(𝜃1 )𝑅(𝜃2 ) = 𝑅(𝜃1 + 𝜃2 ).

Example 3.15 (Special Euclidean group). The special Euclidean group of degree 𝑛, SE(𝑛), is
the group of rotations and translations in 𝑛 dimensions. As a set it equals ℝ𝑛 × SO(𝑛) with the
group law given by:
(𝒙 1 , 𝑅1 ) (𝒙 2 , 𝑅2 ) = (𝒙 1 + 𝑅 1 𝒙 2 , 𝑅 1 𝑅 2 ) . (3.2)
The group law is not the direct product (i.e. not (𝒙 1 + 𝒙 2 , 𝑅 1 𝑅 2 )) but the second group affects
the law of the first group. We call this a semidirect product, to emphasize this difference we
write SE(𝑛) = ℝ𝑛 ⋊ SO(𝑛).
Since we want to design a neural network that takes in 2D images and is rotation-translation
equivariant, the Lie group SE(2) is of special interest to us. In the SE(2) case we can either
represent elements as (𝒙, 𝑅(𝜃)) ∈ ℝ2 × SO(2) with 𝑅(𝜃) the rotation matrix from (3.1), or as
(𝒙, 𝜃) ∈ ℝ2 × [0, 2𝜋). In the latter case the group law can be written as

(𝒙 1 , 𝜃1 )(𝒙 2 , 𝜃2 ) = (𝒙 1 + 𝑅(𝜃1 )𝒙2 , 𝜃1 + 𝜃2 mod 2𝜋) .

3.2.2 Lie Subgroups

Algebraic groups can have subgroups (think of the many subgroups of the general linear
group). Lie groups also have subgroups but those subgroups are not automatically Lie groups
themselves. Let 𝐺 be a Lie group, then 𝐻 ⊂ 𝐺 is a Lie subgroup of 𝐺 if

(i) 𝐻 is a subgroup of the group 𝐺,


(ii) 𝐻 is an immersed submanifold of the manifold 𝐺,
(iii) the group operations on 𝐻 are smooth.

We have not seen what an immersed submanifold is and this is beyond the scope of this course.
However the following theorem provides an easier way of identifying most Lie subgroups.

Theorem 3.16 (Cartan’s closed subgroup theorem). Any subgroup of a Lie group that is
closed (as a set) is a Lie subgroup.

See Lee (2013, Ch. 7) for proof and details.


Not all Lie subgroups are closed but the ones we are interested in all are. Consequently
Theorem 3.16 is our go-to method for proving whether a subgroup is a Lie group.

Example 3.17. Every linear subspace of ℝ𝑛 is a (closed) Lie subgroup under vector addition.

Example 3.18. The groups ℝ𝑛 × {0} and {0} × SO(𝑛) are (closed) Lie subgroups of SE(𝑛).

54
CHAPTER 3. EQUIVARIANCE 3.2. LIE GROUPS

3.2.3 Group Actions

The most important use of Lie groups in manifold theory involves the action of a Lie group on
a manifold.
Definition 3.19 (Group action). If 𝐺 is a group and 𝑀 a set then a left action of 𝐺 on 𝑀 is a
map 𝐺 × 𝑀 → 𝑀 written as (𝑔, 𝑝) ↦→ 𝑔 · 𝑝 that satisfies:

𝑔2 · (𝑔1 · 𝑝) = (𝑔2 𝑔1 ) · 𝑝 ∀𝑔1 , 𝑔2 ∈ 𝐺, 𝑝 ∈ 𝑀,


(3.3)
𝑒·𝑝=𝑝 ∀𝑝 ∈ 𝑀.

A right action is defined similarly as a map 𝑀 × 𝐺 → 𝑀 that satisfies:

(𝑝 · 𝑔1 ) · 𝑔2 = 𝑝 · (𝑔1 𝑔2 ) ∀𝑔1 , 𝑔2 ∈ 𝐺, 𝑝 ∈ 𝑀,
𝑝·𝑒 =𝑝 ∀𝑝 ∈ 𝑀.

If both 𝐺 and 𝑀 are manifolds and the map is smooth in both inputs 𝐺 and 𝑀 then we say
the action is a smooth action.
We will focus exclusively on left actions since their group law (3.3) has the property that group
multiplication corresponds to map composition. In any case a right action can always be
converted to a left action by defining 𝑔 · 𝑝 := 𝑝 · 𝑔 −1 , or vice versa for turning a left action into a
right action.
In our setting 𝐺 is always a Lie group and 𝑀 always a manifold and we will only be considering
smooth actions.
Sometimes it is convenient to label an action, say 𝜌 : 𝐺 × 𝑀 → 𝑀. The action of a group element
𝑔 on a point 𝑝 can then be written equivalently as 𝑔 · 𝑝 ≡ 𝜌(𝑔, 𝑝) ≡ 𝜌 𝑔 (𝑝).
If 𝜌 is a smooth action then for all 𝑔 ∈ 𝐺, 𝜌 𝑔 : 𝑀 → 𝑀 is a diffeomorphism since 𝜌 𝑔 −1 is a
smooth inverse.
A group action on a manifold induces an action on any function space on that manifold in a
straightforward manner. Let 𝑋 be a function space on 𝑀, such as 𝐶 𝑘 (𝑀) or 𝐿 𝑝 (𝑀), then the
mapping 𝜌𝑋 : 𝐺 × 𝑋 → 𝑋, defined by
 
𝜌𝑋 (𝑔, 𝑓 ) (𝑝) = 𝑓 (𝑔 −1 · 𝑝) = 𝑓 (𝜌(𝑔 −1 , 𝑝)), (3.4)

for all 𝑓 ∈ 𝑋 and 𝑔 ∈ 𝐺 is a (left) action. We can verify that with


 
𝜌𝑋 𝑔2 , 𝜌𝑋 (𝑔1 , 𝑓 ) (𝑝)
= 𝜌𝑋 (𝑔1 , 𝑓 )(𝑔2−1 · 𝑝)
 
= 𝑓 𝑔1−1 · (𝑔2−1 · 𝑝)
 
= 𝑓 (𝑔1−1 𝑔2−1 ) · 𝑝
 
= 𝑓 (𝑔2 𝑔1 )−1 · 𝑝
= 𝜌𝑋 (𝑔2 𝑔1 , 𝑓 )(𝑝).

This action on function spaces has the additional property that it is linear in the second argu-
ment. We call actions with this property representations of the Lie groups.

55
3.2. LIE GROUPS CHAPTER 3. EQUIVARIANCE

Definition 3.20 (Lie group representation). Let 𝐺 be a Lie group and 𝑉 a vector space
(finite dimensional or not) then 𝜈 : 𝐺 → Aut(𝑉) is a representation of 𝐺 if it is a smooth
homomorphism, i.e. is smooth and

𝜈(𝑔1 𝑔2 ) = 𝜈(𝑔1 ) ◦ 𝜈(𝑔2 ) ∀𝑔1 , 𝑔2 ∈ 𝐺.

Recall that the automorphism group Aut(𝑉) is the group of invertible linear transformations
of 𝑉.
It follows from the definition that a representation 𝜈 also has the following properties:

𝜈(𝑒) = id𝑉 and 𝜈(𝑔 −1 ) = 𝜈(𝑔)−1 .

Since we usually only have one group action per manifold, and so one corresponding repre-
sentation on a given function space we can overload the meaning of the “·” symbol and use the
following equivalent notations:

𝑔 · 𝑓 := 𝜌𝑋𝑔 ( 𝑓 ) := 𝜌𝑋 (𝑔, 𝑓 ).

This is how we are going to be modeling transformation acting on our data. In our rotation-
translation case 𝑓 would be an input image on ℝ2 and 𝑔 ∈ SE(2) would be a rotation-translation
acting on the image.

3.2.4 Equivariant Maps and Operators

Suppose 𝐺 is a Lie group and 𝑀 and 𝑁 are smooth manifolds with smooth (left) actions 𝜌 𝑀
and 𝜌 𝑁 . Then we can consider maps 𝐹 : 𝑀 → 𝑁 that are equivariant with respect to those
group actions, i.e.
𝐹(𝜌 𝑀 (𝑔, 𝑝)) = 𝜌 𝑁 (𝑔, 𝐹(𝑝))
for all 𝑔 ∈ 𝐺 and 𝑝 ∈ 𝑀, or more concisely:

𝐹(𝑔 · 𝑝) = 𝑔 · 𝐹(𝑝).

Equivalently, 𝐹 is equivariant if the following diagram commutes for each 𝑔 ∈ 𝐺:


𝐹
𝑀 𝑁
𝜌𝑀
𝑔 𝜌𝑁
𝑔

𝐹
𝑀 𝑁.

This idea extends naturally to operators between function spaces on those manifolds. Let 𝑋
be a function space on 𝑀 and 𝑌 a function space on 𝑁 equipped with the corresponding
representations 𝜌𝑋 and 𝜌𝑌 per (3.4). Then an operator 𝐴 : 𝑋 → 𝑌 is equivariant if

𝐴 ◦ 𝜌𝑋𝑔 = 𝜌𝑌𝑔 ◦ 𝐴, ∀𝑔 ∈ 𝐺. (3.5)

Or in words: for every group element doing the corresponding transform on the input space
𝑋 and then applying the operator 𝐴 gives the same results as first applying the operator 𝐴 and
then performing the transform corresponding to the group element on the output space 𝑌.
Our goal in the continuous setting is designing our neural network as an equivariant operator
that satisfies (3.5).

56
CHAPTER 3. EQUIVARIANCE 3.2. LIE GROUPS

3.2.5 Homogeneous Spaces

While Lie groups represent the transformations we are interested in, homogeneous spaces are
the spaces our data will live on and on which the Lie groups will act.

Definition 3.21 (Homogeneous space). A smooth manifold 𝑀 is a homogeneous space of a


Lie group 𝐺 if there exists a smooth (left) action 𝜌 : 𝐺 × 𝑀 → 𝑀 that is transitive, i.e.

∀𝑝 1 , 𝑝2 ∈ 𝑀 ∃𝑔 ∈ 𝐺 : 𝜌(𝑔, 𝑝1 ) = 𝑔 · 𝑝 1 = 𝑝 2 .

The elements of 𝑀 are called the points of the homogeneous space and 𝐺 is called the motion
group or the fundamental group of the homogeneous space. The transitive property can be
reformulated as: for every point in 𝑀 there is a 𝑔 that takes us to any other point in 𝑀.
Observe that 𝐺 is a homogeneous space of itself, called the principal homogeneous space. The
group action is just the left multiplication, i.e. 𝜌 𝑔 = 𝐿 𝑔 .
On the other end we have the trivial homogeneous space consisting of a single element {0},
which is a homogeneous space of every Lie group under the identity action 𝜌(𝑔, 0) = 0.

Remark 3.22 (Zero-dimensional manifolds). You may wonder whether {0} is a manifold. In
fact all (at most) countable sets 𝑆 are 0-dimensional manifolds. Assign each point 𝑝 ∈ 𝑆 its
own unique chart 𝜑 𝑝 : {𝑝} → ℝ0 = {0}. Since these chart domains do not overlap the charts
are trivially smoothly compatible and form a unique smooth atlas.

For each 𝑝 ∈ 𝑀, the stabilizer or the isotropy group of 𝑝 is the subset 𝐺 𝑝 of 𝐺 (also denoted by
Stab𝐺 (𝑝)) that fixes 𝑝:

𝐺 𝑝 := Stab𝐺 (𝑝) := {𝑔 ∈ 𝐺 | 𝑔 · 𝑝 = 𝜌(𝑔, 𝑝) = 𝑝} . (3.6)

If we have 𝑔1 , 𝑔2 ∈ 𝐺 𝑝 then (𝑔1 𝑔2 ) · 𝑝 = 𝑔1 · (𝑔2 · 𝑝) = 𝑔1 · 𝑝 = 𝑝. So 𝑔1 𝑔2 ∈ 𝐺 𝑝 , meaning that


𝐺 𝑝 is a subgroup of 𝐺. Moreover since the group action is smooth it follows that if (𝑔𝑛 )𝑛∈ℕ is a
sequence in 𝐺 𝑝 with lim𝑛→∞ 𝑔𝑛 = 𝑔 ∈ 𝐺 then
 
𝜌(𝑔, 𝑝) = 𝜌 lim 𝑔𝑛 , 𝑝 = lim 𝜌(𝑔𝑛 , 𝑝) = 𝑝.
𝑛→∞ 𝑛→∞

From which we conclude that 𝑔 ∈ 𝐺 𝑝 and so 𝐺 𝑝 is closed and consequently by Theorem 3.16,
𝐺 𝑝 is a Lie subgroup of 𝐺 for all 𝑝 ∈ 𝑀.
When we pick a reference element 𝑝 0 ∈ 𝑀, we can define the subset 𝐺 𝑝0 ,𝑝 ⊂ 𝐺 of all group
elements that map 𝑝 0 to 𝑝:
𝐺 𝑝0 ,𝑝 := {𝑔 ∈ 𝐺 | 𝑔 · 𝑝0 = 𝑝} . (3.7)
Note that this is generally not a subgroup, except for 𝐺 𝑝0 ,𝑝0 = 𝐺 𝑝0 . If we have two group
elements that map 𝑝 0 to the same 𝑝, i.e. 𝑔1 , 𝑔2 ∈ 𝐺 𝑝0 ,𝑝 then

𝑔1 · 𝑝 0 = 𝑔2 · 𝑝 0 ⇔ 𝑔1−1 𝑔2 · 𝑝 0 = 𝑝 0 ⇔ 𝑔1−1 𝑔2 ∈ 𝐺 𝑝0 . (3.8)

This condition imposes an equivalence relation on 𝐺, which we can quotient out as follows:

𝐺/𝐺 𝑝0 := 𝑠 ⊂ 𝐺 ∀𝑔1 , 𝑔2 ∈ 𝑠 : 𝑔1−1 𝑔2 ∈ 𝐺 𝑝0 = 𝐺 𝑝0 ,𝑝 ∀𝑝 ∈ 𝑀 .


 

There is a straightforward isomorphism between 𝑀 and 𝐺/𝐺 𝑝0 given by 𝑝 ↦→ 𝐺 𝑝0 ,𝑝 and 𝐺 𝑝0 ,𝑝 ↦→


𝐺 𝑝0 ,𝑝 · 𝑝 0 .

57
3.3. LINEAR OPERATORS CHAPTER 3. EQUIVARIANCE

From which we can conclude that all homogeneous spaces are isomorphic to a Lie group
quotient 𝐺/𝐻 for some closed Lie subgroup 𝐻 of 𝐺. For this reason many authors blur the
line between a homogeneous space and its corresponding group quotient and effectively equate
a point of the homogeneous space with its corresponding equivalence class in the group, i.e.
𝑝 ≡ 𝐺 𝑝0 ,𝑝 after fixing a 𝑝 0 ∈ 𝑀. This leads to concise notation such as 𝑔 ∈ 𝑝 ⇔ 𝑔 · 𝑝 0 = 𝑝 and
the dropping of the ‘·’ notation since if 𝑝 is seen as a subset of 𝐺 then 𝑔 · 𝑝 ≡ 𝑔𝑝.

3.3 Linear Operators

Now let us look at how we can start putting together an artificial neuron in our new setting.
We have an input manifold 𝑀 and an output manifold 𝑁 that are both homogeneous spaces of
a Lie group 𝐺. Our input data is a function on 𝑀, say 𝑓 ∈ 𝑋 = B(𝑀) and we are expected to
output a function on 𝑁, say an element of 𝑌 = B(𝑁). Recall that the set of bounded functions
B(𝑀) is a Banach space under the supremum norm (a.k.a. the ∞-norm or the uniform norm)
given by ∥ 𝑓 ∥ ∞ := sup𝑝∈𝑀 | 𝑓 (𝑝)|.
The first part of a discrete artificial neuron was a linear operator 𝐴 : ℝ𝑛 → ℝ𝑚 , given by:
Õ
(𝐴𝒙)𝑖 = (𝐴)𝑖𝑗 𝑥 𝑖 .
𝑗

Its analogue in the continuous setting is an integral operator 𝐴 : ℝ𝑀 → ℝ𝑁 of the form:



(𝐴 𝑓 ) (𝑞) = 𝑘 𝐴 (𝑝, 𝑞) 𝑓 (𝑝) d𝑝, (3.9)
𝑀
where the function 𝑘 𝐴 : 𝑀 × 𝑁 → ℝ is called the operator’s kernel.

Remark 3.23 (Measurable functions). Technically for the Lebesgue integral in (3.9) to exist the
integrand needs to be measurable. We will not be dealing with non-measurable functions
and you may assume that when we say function we mean measurable function. If the
concept of measurable functions is new to you, you may ignore the issue.

In this framework, instead of training the matrix 𝐴, we will train the kernel 𝑘 𝐴 . In practice we
cannot train a continuous function so training the kernel will come down to either training a
discretization or training the parameters of some parameterization of 𝑘 𝐴 .
Now we still need to specify how we are going to integrate on a homogeneous space to make
progress.

3.3.1 Integration

Integration on ℝ𝑛 has the desirable property that it is translation invariant: for all 𝒚 ∈ ℝ𝑛 and
integrable functions 𝑓 : ℝ𝑛 → ℝ we have
∫ ∫
𝑓 (𝒙 − 𝒚) d𝒙 = 𝑓 (𝒙) d𝒙. (3.10)
ℝ𝑛 ℝ𝑛
Ideally we would want integration on a homogeneous space 𝑀 of a Lie group 𝐺 to behave
similarly, namely for all 𝑔 ∈ 𝐺 we would like:
∫ ∫ ∫
−1
(𝑔 · 𝑓 ) (𝑝) d𝜇𝑀 (𝑝) := 𝑓 (𝑔 · 𝑝) d𝜇𝑀 (𝑝) = 𝑓 (𝑝) d𝜇𝑀 (𝑝), (3.11)
𝑀 𝑀 𝑀

58
CHAPTER 3. EQUIVARIANCE 3.3. LINEAR OPERATORS

for some Radon measure 𝜇𝑀 on 𝑀.

Remark 3.24 (Measures). Recall that measures are the generalization of concepts such as
length, volume, mass, probability etc. A measure assigns a non-negative real number to
subsets of a space in such a way that it behaves similarly to the aforementioned concepts.
A Radon measure on a Hausdorff topological space is a measure that plays well with the
topology of the space (defined for open and closed sets, finite on compact sets, etc.). The
Lebesgue measure is the translation invariant Radon measure on ℝ𝑛 and coincides with
our less general notion of the length/area/volume of subsets of ℝ𝑛 . Integration on ℝ𝑛
such as in (3.10) implicitly uses the Lebesgue measure and so is translation invariant. For
a comprehensive introduction to measure theory see Tao (2011). For the purpose of this
course it is sufficient to think about a measure as measuring the volume of a subset.

This imposes a condition on the measure 𝜇𝑀 , namely: for all measurable subsets 𝑆 of 𝑀 and
𝑔 ∈ 𝐺 we require
𝜇𝑀 (𝑔 · 𝑆) = 𝜇𝑀 (𝑆). (3.12)
In other words we would need a (non-zero) group invariant measure to get the desired integral.
These G-invariant measures, or just invariant measures, do not always exist. In some cases we
can still obtain a covariant measure, which is a measure that satisfies

𝜇(𝑔 · 𝑆) = 𝜒(𝑔) 𝜇(𝑆), (3.13)

where 𝜒 : 𝐺 → ℝ+ is a character of 𝐺.

Definition 3.25 (Character). A multiplicative character or linear character or simply character


of a Lie group 𝐺 is a continuous homomorphism from the group to the multiplicative group
of positive real numbers, i.e. 𝜒 : 𝐺 → ℝ>0 so that:

𝜒(𝑔1 𝑔2 ) = 𝜒(𝑔1 ) 𝜒(𝑔2 ) ∀𝑔1 , 𝑔2 ∈ 𝐺.

The function 𝜒 needs to be a character since by (3.13) we have:

𝜒(𝑔1 𝑔2 ) 𝜇(𝑆) = 𝜇(𝑔1 𝑔2 · 𝑆) = 𝜇(𝑔1 · (𝑔2 · 𝑆)) = 𝜒(𝑔1 ) 𝜇(𝑔2 · 𝑆) = 𝜒(𝑔1 ) 𝜒(𝑔2 ) 𝜇(𝑆),

for all 𝑔1 , 𝑔2 ∈ 𝐺 and all measurable 𝑆 ⊂ 𝑀.


If we integrate with respect to a G-invariant measure we say we have a G-invariant integral,
or just an invariant integral, if the measure is covariant with a character 𝜒 we say we have a
𝜒-covariant integral or just covariant integral.

Definition 3.26 ∫(Covariant integral). Let 𝑀 be a homogeneous space of a Lie group 𝐺, we


say the integral 𝑀 . . . d𝑝 (using some Radon measure on 𝑀) is covariant with respect to 𝐺 if
there exists a character 𝜒 𝑀 of 𝐺 so that
∫ ∫
(𝑔 · 𝑓 ) (𝑝) d𝑝 = 𝜒 𝑀 (𝑔) 𝑓 (𝑝) d𝑝
𝑀 𝑀

for all 𝑔 ∈ 𝐺 and all 𝑓 : 𝑀 → ℝ for which the integral exists. In the special case that 𝜒 𝑀 ≡ 1
we say the integral is invariant.

Remark 3.27 (Abuse of notation). Integration is always with respect to some measure. If we
are integrating with respect to the measure 𝜇 then for the sake of completeness we should

59
3.3. LINEAR OPERATORS CHAPTER 3. EQUIVARIANCE

write ∫
. . . d𝜇(𝑝).
𝑀
But since we only ever consider one measure per space we integrate over and for the sake of
brevity we abbreviate d𝑝 ≡ d𝜇(𝑝).

If the homogeneous space is 𝐺 itself then an invariant measure is called the (left) Haar measure
on 𝐺 (named after the Hungarian mathematician Alfréd Haar). We can say the Haar measure
since Haar measures are unique up to multiplication with a constant and always exist (see
Federer 2014, Ch. 2.7). Hence when integrating on the group itself we can always have a Haar
measure 𝜇𝐺 so that the following equality holds
∫ ∫
(ℎ · 𝑓 ) (𝑔) d𝑔 = 𝑓 (𝑔) d𝑔 ∀ℎ ∈ 𝐺, (3.14)
𝐺 𝐺

where we abbreviated d𝑔 := d𝜇𝐺 (𝑔). We also call this invariant integral on the group the (left)
Haar integral.
Not all homogeneous spaces admit a covariant integral but those in which we are interested
all do. Going forward we will assume that all homogeneous spaces that we consider admit a
covariant integral and that we can always use the equality from Definition 3.26.

Example 3.28 (𝐺 = SE(2) and 𝑀 = ℝ2 ). In the case we are most interested in, namely 𝐺 = SE(2)
and 𝑀 = ℝ2 , we are fortunate that the Lebesgue measure on ℝ2 is invariant with respect to
𝐺. This is intuitively easy to understand: the area of a subset of ℝ2 is invariant under both
translation and rotation.

Example 3.29 (Haar measure on SE(2)). The Haar measure on SE(2) also conveniently coincides
with the Lebesgue measure on ℝ2 ×[0, 2𝜋) when using the parameterization from Example 3.15.
Indeed, let 𝑔 = (𝒙 1 , 𝜃1 ) and ℎ = (𝒙 2 , 𝜃2 ) then:
∫ ∫ 2𝜋 ∫ ∫ 2𝜋  
((𝒙 1 , 𝜃1 ) · 𝑓 ) (𝒙 2 , 𝜃2 ) d𝜃2 d𝒙 2 = 𝑓 (𝒙 1 , 𝜃1 )−1 (𝒙 2 , 𝜃2 ) d𝜃2 d𝒙 2 .
ℝ2 0 ℝ2 0

When we change variables to (𝒙 3 , 𝜃3 ) = (𝒙 1 , 𝜃1 )−1 (𝒙 2 , 𝜃2 ) we obtain the following Jacobian


matrix:
𝜕(𝑥 21 , 𝑥22 , 𝜃2 ) ©cos 𝜃1 − sin 𝜃1 0ª
= ­ sin 𝜃1 cos 𝜃1 0® ,
𝜕(𝑥 31 , 𝑥32 , 𝜃3 )
« 0 0 1¬
which has determinant 1. Consequently, the Haar integral (up to a multiplicative constant) on
SE(2) can be calculated as:
∫ ∫ ∫ 2𝜋
𝑓 (𝑔) d𝑔 = 𝑓 (𝒙, 𝜃) d𝜃 d𝒙. (3.15)
SE(2) ℝ2 0

3.3.2 Equivariant Linear Operators

Of course the objective of this chapter is building equivariant operators, so when is an integral
operator (3.9) equivariant? Equivariance means that

𝐴(𝑔 · 𝑓 ) = 𝑔 · (𝐴 𝑓 )

60
CHAPTER 3. EQUIVARIANCE 3.3. LINEAR OPERATORS

for all 𝑔 ∈ 𝐺 and 𝑓 ∈ B(𝑀) or equivalently

𝑔 −1 · 𝐴(𝑔 · 𝑓 ) = 𝐴 𝑓 . (3.16)

This extra condition on 𝐴 will naturally impose some restrictions on the kernel of the operator
as the following lemma shows.

Lemma 3.30 (Equivariant linear operators). Let 𝑀 and 𝑁 be homogeneous spaces of a Lie
group 𝐺 so that 𝑀 admits a covariant integral with character 𝜒 𝑀 . Let 𝐴 be an integral
operator (3.9) from C(𝑀) ∩ B(𝑀) to C(𝑁) ∩ B(𝑁) with a kernel 𝑘 𝐴 ∈ C(𝑀 × 𝑁). Then

𝐴(𝑔 · 𝑓 ) = 𝑔 · (𝐴 𝑓 )

for all 𝑔 ∈ 𝐺 and 𝑓 ∈ C(𝑀) ∩ B(𝑀) if and only if

𝜒 𝑀 (𝑔) 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘 𝐴 (𝑝, 𝑞) (3.17)

for all 𝑔 ∈ 𝐺, 𝑝 ∈ 𝑀 and 𝑞 ∈ 𝑁.


Moreover 𝐴 is bounded (and so continuous) in the supremum norm if

sup |𝑘 𝐴 (𝑝, 𝑞)| d𝑝 < ∞. (3.18)
𝑞∈𝑁 𝑀

Proof.
“⇒”
Assuming 𝐴 to be equivariant, take an arbitrary 𝑔 ∈ 𝐺 and 𝑓 ∈ C(𝑀) ∩ B(𝑀) and substitute the
definition of the group representation and 𝐴 in (3.16) to find
∫ ∫
−1
𝑘 𝐴 (𝑝, 𝑔 · 𝑞) 𝑓 (𝑔 · 𝑝) d𝑝 = 𝑘 𝐴 (𝑝, 𝑞) 𝑓 (𝑝) d𝑝 (3.19)
𝑀 𝑀

for all 𝑞 ∈ 𝑁.
Fix 𝑞 ∈ 𝑁 and let 𝐹(𝑝) := 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) 𝑓 (𝑝) then observe that

(𝑔 · 𝐹)(𝑝) = 𝑘 𝐴 (𝑔 · 𝑔 −1 · 𝑝, 𝑔 · 𝑞) 𝑓 (𝑔 −1 · 𝑝) = 𝑘 𝐴 (𝑝, 𝑔 · 𝑞) 𝑓 (𝑔 −1 · 𝑝),

which is the left integrand from (3.19). Since we have assumed covariant integration we use
Definition 3.26 and have
∫ ∫
(𝑔 · 𝐹) (𝑝) d𝑝 = 𝜒 𝑀 (𝑔) 𝐹(𝑝) d𝑝.
𝑀 𝑀

Applying this to (3.19) we find


∫ ∫
𝜒 𝑀 (𝑔) 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) 𝑓 (𝑝) d𝑝 = 𝑘 𝐴 (𝑝, 𝑞) 𝑓 (𝑝) d𝑝. (3.20)
𝑀 𝑀

Since 𝑓 was arbitrary and 𝑝 ↦→ 𝑘 𝐴 (𝑝, 𝑞) continuous it follows that

𝜒 𝑀 (𝑔) 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘 𝐴 (𝑝, 𝑞)

for all 𝑝 ∈ 𝑀.

61
3.3. LINEAR OPERATORS CHAPTER 3. EQUIVARIANCE

“⇐”
Assuming 𝜒 𝑀 (𝑔) 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘 𝐴 (𝑝, 𝑞) for all 𝑔 ∈ 𝐺, 𝑝 ∈ 𝑀 and 𝑞 ∈ 𝑁 then (3.20) follows
for any choice of 𝑓 ∈ C(𝑀) ∩ B(𝑀), 𝑔 ∈ 𝐺 and 𝑞 ∈ 𝑁. Substituting the covariant integral the
other way yields (3.19), which implies (3.16) since 𝑞 ∈ 𝑁 is arbitrary. The function 𝑓 and group
element 𝑔 were also chosen arbitrarily so (3.16) follows for all 𝑓 ∈ C(𝑀) ∩ B(𝑀) and 𝑔 ∈ 𝐺.
Boundedness of 𝐴 follows from

∥𝐴 𝑓 ∥ ∞ = sup 𝑘 𝐴 (𝑝, 𝑞) 𝑓 (𝑝) d𝑝
𝑞∈𝑁 𝑀

≤ sup |𝑘 𝐴 (𝑝, 𝑞)| | 𝑓 (𝑝)| d𝑝
𝑞∈𝑁 𝑀

≤ ∥ 𝑓 ∥ ∞ · sup |𝑘 𝐴 (𝑝, 𝑞)| d𝑝
𝑞∈𝑁 𝑀
(3.18)
< ∞.

The condition on the kernel (3.18) is partially redundant with the symmetry requirement as the
following lemma shows.

Lemma 3.31. In the same setting as Lemma 3.30. If the kernel 𝑘 𝐴 ∈ C(𝑀 × 𝑁) satisfies the
symmetry (3.17) and condition (3.18) then

∥𝑘 𝐴 ( · , 𝑞1 )∥ 𝐿1 (𝑀) = ∥𝑘 𝐴 ( · , 𝑞2 )∥ 𝐿1 (𝑀)

for all 𝑞1 , 𝑞2 ∈ 𝑁.

Proof. Since 𝑁 is a homogeneous space then for all 𝑞1 , 𝑞2 ∈ 𝑁 there exists a 𝑔 ∈ 𝐺 so that
𝑞 1 = 𝑔 · 𝑞2 , then
∫ ∫
|𝑘 𝐴 (𝑝, 𝑞1 )| d𝑝 = |𝑘 𝐴 (𝑝, 𝑔 · 𝑞2 )| d𝑝
𝑀
∫𝑀
= 𝑘 𝐴 (𝑔 · 𝑔 −1 · 𝑝, 𝑔 · 𝑞2 ) d𝑝
𝑀
1

(3.17) = 𝑘 𝐴 (𝑔 −1 · 𝑝, 𝑞2 ) d𝑝
𝜒 𝑀 (𝑔) 𝑀
𝜒 𝑀 (𝑔)

(Def. 3.26) = |𝑘 𝐴 (𝑝, 𝑞2 )| d𝑝
𝜒 𝑀 (𝑔) 𝑀

= |𝑘 𝐴 (𝑝, 𝑞2 )| d𝑝.
𝑀

The condition on the kernel from Lemma 3.30 can be exploited to express it as a function on
𝑀 instead of 𝑀 × 𝑁. If we fix a 𝑞0 ∈ 𝑁 and for all 𝑞 ∈ 𝑁 we choose a 𝑔𝑞 ∈ 𝐺 𝑞0 ,𝑞 (i.e. so that
𝑔𝑞 · 𝑞0 = 𝑞) then by (3.17) we have

𝑘 𝐴 (𝑝, 𝑞) = 𝜒 𝑀 (𝑔𝑞−1 ) 𝑘 𝐴 (𝑔𝑞−1 · 𝑝, 𝑔𝑞−1 · 𝑞) = 𝜒 𝑀 (𝑔𝑞−1 ) 𝑘 𝐴 (𝑔𝑞−1 · 𝑝, 𝑞0 ),

62
CHAPTER 3. EQUIVARIANCE 3.3. LINEAR OPERATORS

which fixes the second input of 𝑘 𝐴 . Consequently we could contain all the information of our
kernel in a function that exists only on 𝑀 as 𝜅 𝐴 (𝑝) := 𝑘 𝐴 (𝑝, 𝑞0 ). This reduced kernel 𝜅 𝐴 still
has some restrictions placed on it for the resulting operator to be equivariant, as the following
theorem makes precise.

Theorem 3.32 (Equivariant linear operators). Let 𝑀 and 𝑁 be homogeneous spaces of a Lie
group 𝐺 so that 𝑀 admits a covariant integral with respect to a character 𝜒 𝑀 of 𝐺. Fix a
𝑞0 ∈ 𝑁 and let 𝜅 𝐴 ∈ C(𝑀) ∩ L1 (𝑀) be compatible, i.e. have the property that

∀ℎ ∈ 𝐺 𝑞0 : ℎ · 𝜅 𝐴 = 𝜒 𝑀 (ℎ) 𝜅 𝐴 . (3.21)

Then the operator 𝐴 defined by

1

(𝐴 𝑓 )(𝑞) := (𝑔𝑞 · 𝜅 𝐴 )(𝑝) 𝑓 (𝑝) d𝑝
𝜒 𝑀 (𝑔𝑞 ) 𝑀

where for all 𝑞 ∈ 𝑁 we can choose any 𝑔𝑞 so that 𝑔𝑞 · 𝑞0 = 𝑞, is a well defined bounded linear
operator from C(𝑀) ∩ B(𝑀) to C(𝑁) ∩ B(𝑁) that is equivariant with respect to 𝐺.
Conversely every equivariant integral operator with a kernel 𝑘 𝐴 ∈ C(𝑀 × 𝑁) and with
𝑘 𝐴 ( · , 𝑞) ∈ L1 (𝑀) for some 𝑞 ∈ 𝑁 is of this form.

Proof.
“⇒”
Assuming we have a 𝜅 𝐴 ∈ C(𝑀) ∩ L1 (𝑀) that satisfies (3.21). Define 𝑘 𝐴 ∈ 𝐶(𝑀 × 𝑁) by
1
𝑘 𝐴 (𝑝, 𝑞) := (𝑔𝑞 · 𝜅 𝐴 )(𝑝).
𝜒 𝑀 (𝑔𝑞 )

Then 𝑘 𝐴 is well defined since it does not depend on the choice of 𝑔𝑞 for a given 𝑞 ∈ 𝑁. If 𝑔𝑞′
is another group element with 𝑔𝑞 · 𝑞 0 = 𝑞 then there exists a ℎ ∈ 𝐺 𝑞0 so that 𝑔𝑞′ = 𝑔𝑞 ℎ, we can
check 𝑘 𝐴 is invariant under choice of ℎ ∈ 𝐺 𝑞0 :

1 𝜒 𝑀 (ℎ) 1
(𝑔𝑞 · ℎ · 𝜅 𝐴 )(𝑝) = (𝑔𝑞 · 𝜅 𝐴 )(𝑝) = (𝑔𝑞 · 𝜅 𝐴 )(𝑝).
𝜒 𝑀 (𝑔𝑞 ℎ) 𝜒 𝑀 (𝑔𝑞 )𝜒 𝑀 (ℎ) 𝜒 𝑀 (𝑔𝑞 )

The kernel 𝑘 𝐴 also satisfies the symmetry requirement (3.17) from Lemma 3.30:
1
𝜒 𝑀 (𝑔) 𝑘 𝐴 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝜒 𝑀 (𝑔) (𝑔(𝑔·𝑞) · 𝜅 𝐴 )(𝑔 · 𝑝)
𝜒 𝑀 (𝑔(𝑔·𝑞) )
1
= 𝜒 𝑀 (𝑔) (𝑔 · 𝑔𝑞 · 𝜅 𝐴 )(𝑔 · 𝑝)
𝜒 𝑀 (𝑔 𝑔𝑞 )
𝜒 𝑀 (𝑔)
= (𝑔𝑞 · 𝜅 𝐴 )(𝑔 −1 𝑔 · 𝑝)
𝜒 𝑀 (𝑔)𝜒 𝑀 (𝑔𝑞 )
1
= (𝑔𝑞 · 𝜅 𝐴 )(𝑝)
𝜒 𝑀 (𝑔𝑞 )
= 𝑘 𝐴 (𝑝, 𝑞).

By Lemma 3.31 we have



sup |𝑘 𝐴 (𝑝, 𝑞)| d𝑝 = ∥ 𝑘 𝐴 ( · , 𝑞0 )∥ 𝐿1 (𝑀) = ∥𝜅 𝐴 ∥ 𝐿1 (𝑀) < ∞.
𝑞∈𝑁 𝑀

63
3.3. LINEAR OPERATORS CHAPTER 3. EQUIVARIANCE

Consequently, 𝐴 also satisfies (3.18) and is a bounded equivariant linear operator per Lemma 3.30.
“⇐”
Assuming we have an equivariant linear operator 𝐴 with kernel 𝑘 𝐴 ∈ C(𝑀 × 𝑁) then we pick a
fixed 𝑞0 ∈ 𝑁 and define 𝜅 𝐴 ∈ C(𝑀)

𝜅 𝐴 (𝑝) := 𝑘 𝐴 (𝑝, 𝑞0 ).

This reduced kernel 𝜅 𝐴 satisfies the compatibility condition (3.21) since if ℎ ∈ 𝐺 𝑞0 then

(ℎ · 𝜅 𝐴 )(𝑝) = 𝑘 𝐴 (ℎ −1 · 𝑝, 𝑞0 )
= 𝑘 𝐴 (ℎ −1 · 𝑝, ℎ −1 · 𝑞0 )
= 𝜒 𝑀 (ℎ) 𝑘 𝐴 (𝑝, 𝑞0 )
= 𝜒 𝑀 (ℎ) 𝜅 𝐴 (𝑝).

Since we required 𝑘 𝐴 ( · , 𝑞) ∈ L1 (𝑀) for some 𝑞 ∈ 𝑁, we apply Lemma 3.31 to find

∥𝜅 𝐴 ∥ 𝐿1 (𝑀) = ∥𝑘 𝐴 ( · , 𝑞0 )∥ 𝐿1 (𝑀) = ∥𝑘 𝐴 ( · , 𝑞)∥ 𝐿1 (𝑀) < ∞.

Theorem 3.32 is the at the core of group equivariant CNNs since it allows us to generalize the
familiar convolution operation present in CNNs to general linear operators that are equivariant
with respect to a group of choice.
Example 3.33 (Group convolution). Let 𝐺 = 𝑀 = 𝑁 be some Lie group. A Lie group always
admits a Haar integral, so we have a trivial character 𝜒 = 1. As reference element we obviously
choose the unit element 𝑒, though any group element would do. Then 𝐺 𝑔 = {𝑒} and 𝐺 𝑒 ,𝑔 = {𝑔}
are both trivial. Hence we have no symmetry condition on the kernel. Any 𝜅 𝐴 ∈ 𝐶(𝐺) ∩ 𝐿1 (𝐺)
defines a linear operator 𝐴 : 𝐶(𝐺) ∩ 𝐵(𝐺) → 𝐶(𝐺) ∩ 𝐵(𝐺) by
∫ ∫
(𝐴 𝑓 )(ℎ) = (ℎ · 𝜅 𝐴 )(𝑔) 𝑓 (𝑔) d𝑔 = 𝜅 𝐴 (ℎ −1 𝑔) 𝑓 (𝑔) d𝑔
𝐺 𝐺

We also call this operation group cross-correlation and denote it as



(𝜅 ★𝐺 𝑓 )(ℎ) := (ℎ · 𝜅)(𝑔) 𝑓 (𝑔) d𝑔.
𝐺

As in the familiar ℝ𝑛 setting, group cross-correlation is closely related to group convolution,


which is defined as ∫
(𝜅ˇ ∗𝐺 𝑓 )(ℎ) := ˇ −1 ℎ) 𝑓 (𝑔) d𝑔.
𝜅(𝑔
𝐺
We leave relating the two kernels 𝜅 and 𝜅ˇ as an exercise: when is 𝜅 ★𝐺 𝑓 = 𝜅ˇ ∗𝐺 𝑓 ?
As in the ℝ𝑛 case, when we talk about group convolution we mean both group cross-correlation
and group convolution since they are interchangeable.
Example 3.34 (Rotation-translation equivariance in ℝ2 ). Let 𝐺 = SE(2) = ℝ2 ⋊ SO(2) and 𝑀 =
𝑁 = ℝ2 . The Lebesgue measure on ℝ2 is rotation-translation invariant so we have a G-invariant
integral on ℝ2 . Choose 𝒚0 = 0 as the reference element then 𝐺 𝒚0 = {(0, 𝑅(𝜃)) ∈ 𝐺 | 𝜃 ∈ [0, 2𝜋)}
is the stabilizer of 𝒚0 . A kernel 𝜅 𝐴 on ℝ2 is then compatible if

(0, 𝑅(𝜃)) · 𝜅 𝐴 = 𝜅 𝐴 ∀𝜃 ∈ [0, 2𝜋),

64
CHAPTER 3. EQUIVARIANCE
3.4. BUILDING A ROTATION-TRANSLATION EQUIVARIANT CNN

i.e. 𝜅 𝐴 needs to be radially symmetric. Now, we could have figured that out without building
up the whole equivariance framework. But the next section will show how we can use the
equivariance framework to step over the severe restriction that is imposed on the allowable
kernels here.

3.4 Building a Rotation-translation Equivariant CNN

How do we now use this framework to construct a rotation-translation invariant network for
images? From Example 3.34 we know that we do not have a lot of freedom in training a rotation-
translation equivariant operator from ℝ2 to ℝ2 . We can buy ourselves a lot more freedom by
making the first linear operation in our network one that maps functions on ℝ2 to functions
on SE(2). In the context of image analysis the process of transforming an image to a higher
dimensional representation is called lifting. Therefore we will call the first layer in our network
the lifting layer. Once we are operating on the group we have much more freedom since group
convolution is equivariant. Just like in a conventional CNN we can have a series of group
convolution layers that make up the bulk of our network. Of course we might not want our
final product to live on the group, we might want to go back to ℝ2 , but we also have a recipe
for that. Since going from the 3 dimensional space SE(2) to the 2 dimensional space ℝ2 is akin
to projection we call a layer that does this a projection layer.
This three stage design is illustrated in Figure 3.5 for the retinal vessel segmentation application.
For some more examples of the use of this type of network in medical imaging applications see
Bekkers et al. (2018).

Figure 3.5: A G-CNN for retinal vessel segmentation that is rotation-


translation equivariant. Lifting data to a higher dimensional space
affords us more freedom in the kernels we train while maintaining
equivariance.

3.4.1 Lifting Layer

Let 𝐺 = SE(2) ≡ ℝ2 ⋊ [0, 2𝜋) using the parameterization from Example 3.15. Let 𝑀 = ℝ2 and
𝑁 = 𝐺. Choose 𝑒 = (0, 0) ∈ 𝑁 as the reference element then the stabilizer 𝐺 𝑒 is trivially {𝑒}, so
any kernel on 𝑀 = ℝ2 is compatible. The Lebesgue measure is rotation-translation equivariant
so we have an invariant integral.
(0)
Let 𝑛0 ∈ {1, 3} be the number of input channels and denote the input functions as 𝑓 𝑗 : ℝ2 → ℝ
for 𝑗 from 1 to 𝑛0 . Let us denote the number of desired feature maps in the first layers as 𝑛1 .
Recall that there are two conventions for convolution layers: single channel and multi channel.

65
3.4. BUILDING A ROTATION-TRANSLATION EQUIVARIANT
CHAPTER
CNN 3. EQUIVARIANCE

In the multi channel setup we associate with each output channel a number of kernels equal
(1)
to the amount of input channels. Our parameters would be a set {𝜅 𝑖𝑗 } 𝑖𝑗 ⊂ 𝐶(ℝ2 ) ∩ 𝐿1 (ℝ2 ) of
(1)
kernels and a set {𝑏 𝑖 } 𝑖 ⊂ ℝ of biases for 𝑖 from 1 to 𝑛1 and 𝑗 from 1 to 𝑛0 . The calculation for
output channel 𝑖 is then given by
𝑛0 ∫  
(1) ©Õ (1) (0) (1) ª
𝑓𝑖 (𝒙, 𝜃) = 𝜎­ (𝒙, 𝜃) · 𝜅 𝑖𝑗 (𝒚) 𝑓 𝑗 (𝒚) d𝒚 + 𝑏 𝑖 ® ,
ℝ2
« 𝑗=1 ¬
for some choice of activation function 𝜎.
In the single channel setup we associate a kernel with each input channel and then make linear
combinations of the convolved input channels to generate output channels. Our parameters
(1) (1)
would then consist of a set of kernels {𝜅 𝑗 } 𝑗 ⊂ 𝐶(ℝ2 ) ∩ 𝐿1 (ℝ2 ) and a set of weights {𝑎 𝑖𝑗 } 𝑖𝑗 ⊂ ℝ
(1)
and biases {𝑏 𝑖 } 𝑖 ⊂ ℝ for 𝑖 from 1 to 𝑛1 and 𝑗 from 1 to 𝑛0 . The calculation for output channel 𝑖
is then given by
𝑛0 ∫  
(1) ©Õ (1) (1) (0) (1) ª
𝑓𝑖 (𝒙, 𝜃) = 𝜎­ 𝑎 𝑖𝑗 (𝒙, 𝜃) · 𝜅 𝑗 (𝒚) 𝑓 𝑗 (𝒚) d𝒚 + 𝑏 𝑖 ® .
ℝ2
« 𝑗=1 ¬
In either case the actual lifting happens by translating and rotating the kernel over the image,
a particular translation and rotation gives us a particular scalar value at the corresponding
location in SE(2).

Remark 3.35 (Orientation score transform). If you followed the course Differential Geometry for
Image Processing (2MMA70) this will seem very familiar to you. Indeed this an orientation
score transform except we do not design the wavelet filter (the kernel) ourselves but leave it
up to the network to train.

3.4.2 Group Convolution Layer

We already saw in Example 3.33 how to do convolution on the group itself. On the Lie group
we always have an invariant integral (the left Haar integral) and the symmetry requirement on
the kernel is trivial so we have no restrictions to take into account for training the kernel (unlike
for the case ℝ2 → ℝ2 ). In layer ℓ ∈ ℕ with 𝑛ℓ −1 input channels and 𝑛ℓ output channels the
calculation for output channel 𝑖 is given (for the single channel setup) by
𝑛ℓ −1  
(ℓ ) ©Õ (ℓ ) (ℓ ) (ℓ −1) (ℓ ) ª
𝑓𝑖 = 𝜎­ 𝑎 𝑖𝑗 𝜅 𝑗 ★𝐺 𝑓 𝑗 + 𝑏𝑖 ®
« 𝑗=1 ¬
for all 𝑖 ∈ {1, . . . , 𝑛ℓ } or
𝑛ℓ ∫ ∫ 2𝜋  
(ℓ ) ©Õ (ℓ ) (ℓ ) (ℓ −1) (ℓ ) ª
𝑓𝑖 (𝒙, 𝜃) = 𝜎­ 𝑎 𝑖𝑗 (𝒙, 𝜃) · 𝜅 𝑗 (𝒚, 𝛼) 𝑓 𝑗 (𝒚, 𝛼) d𝛼 d𝒚 + 𝑏 𝑖 ® ,
𝑗=1 ℝ2 0
« ¬
(ℓ )
for some choice of activation function 𝜎. Here the kernels 𝜅𝑗 ∈ 𝐶(𝐺) × 𝐿1 (𝐺), the weights
(ℓ ) (ℓ )
𝑎 𝑖𝑗 ∈ ℝ and the biases 𝑏 𝑖 ∈ ℝ are the trainable parameters. Deducing the formula for the
multi channel setup we leave as an exercise.
Group convolution layers can be stacked sequentially just like normal convolution layers in a
CNN to make up the heart of a G-CNN, see Figure 3.5.

66
CHAPTER 3. EQUIVARIANCE
3.4. BUILDING A ROTATION-TRANSLATION EQUIVARIANT CNN

3.4.3 Projection

The desired output of our network is likely not a feature map on the group or some other higher
dimensional homogeneous space. So at some point we have to transition away from them.
In traditional CNNs used for classification we saw that at some point we flattened our multi-
dimensional array by ‘forgetting’ the spatial dimensions. Once we have discretized, flattening
is of course also a viable approach for a G-CNN when the goal is classification. However we
might not want to throw away our spatial structure, if the goal of the network is to transform
its input in some way then we want to go back to our original input space (like in the example
in Figure 3.5).
Applying our equivariance framework again to the case 𝐺 = 𝑀 = SE(2) and 𝑁 = ℝ2 . Choose
0 ∈ 𝑁 as the reference element then its stabilizer is the subgroup of just rotations. So to construct
an equivariant linear operator from SE(2) to ℝ2 requires a kernel 𝜅 on SE(2) that satisfies

(0, 𝛽) · 𝜅 = 𝜅 ⇔ 𝜅(𝒙, 𝜃) = 𝜅 (𝑅(−𝛽)𝒙, 𝜃 − 𝛽) ∀𝛽, 𝜃 ∈ [0, 2𝜋), 𝒙 ∈ ℝ2 ,

where 𝑅(−𝜃) is the rotation matrix over −𝜃. Consequently we can reduce the trainable (unre-
stricted) part of the kernel 𝜅 to a 2 dimensional slice:

𝜅(𝒙, 𝜃) = 𝜅 (𝑅(−𝜃)𝒙, 0) .

A kernel like this gives us the desired equivariant linear operator and with a set of them we can
construct a layer in the same fashion as the lifting and group convolution layer.
In practice this type of operator with trainable kernel is not what is used for projection from
SE(2) to ℝ2 . Instead the much simpler (and non-trainable) integration over the 𝜃 axis is used,
let 𝑓 ∈ B(SE(2)) then the operator 𝑃 : 𝐶(SE(2)) ∩ 𝐵(SE(2)) → 𝐶(ℝ2 ) ∩ 𝐵(ℝ2 ) given by
∫ 2𝜋
(𝑃 𝑓 )(𝒙) := 𝑓 (𝒙, 𝜃) d𝜃, (3.22)
0

is a bounded linear operator that is rotation-translation equivariant.

Remark 3.36. Note that the projection operator (3.22) is one of our equivariant linear operators
if we take the kernel to be 𝜅 𝑃 (𝒙, 𝜃) = 𝛿(𝒙) where 𝛿 is the Dirac delta on ℝ2 . The Dirac delta
is not a function in 𝐶(𝐺) ∩ 𝐿1 (𝐺) but we can take a sequence in 𝐶(𝐺) ∩ 𝐿1 (𝐺) that has 𝜅 𝑃 as
the limit in the weak sense, such as a sequence of narrowing Gaussians.

A common alternative to integrating over the orientation axis is taking the maximum over that
axis:
(𝑃max 𝑓 ) (𝒙) := max 𝑓 (𝒙, 𝜃). (3.23)
𝜃∈[0,2𝜋)

This is not a linear operator but it is rotation-translation equivariant, we will revisit this projec-
tion operator later.
After we have once again obtained feature maps on ℝ2 we can proceed to our desired output
format in the same way as we would do with a classic CNN. Either we forget the spatial
dimensions and transition to a fully connected network for classification applications or we
take a linear combination of the obtained 2D feature maps to generate an output image such as
in Figure 3.5 and Bekkers et al. (2018).

67
3.4. BUILDING A ROTATION-TRANSLATION EQUIVARIANT
CHAPTER
CNN 3. EQUIVARIANCE

3.4.4 Discretization

To implement our developed G-CNN in practice we will need to switch to a discretized setting.
For our specific case of an SE(2) G-CNN the lifting layer typically uses kernels of size 5 × 5 to
7 × 7. We usually choose the number of discrete orientations to be 8, so an input of ℝ𝐻×𝑊 would
be lifted to ℝ8×𝐻×𝑊 , this may seem low but empirically this is around the sweet-spot between
performance and memory usage/computation time. The group convolution layers usually
employ 5 × 5 × 5 kernels. In both cases we need to sample the kernel off-grid to be able to rotate
them, for that we almost always use linear interpolation. Example G-CNN implementations are
illustrated in Figure 3.6 for both segmentation and classification, examples for various medical
applications can be found in Bekkers et al. (2018).

(a) A traditional CNN (left) versus a rotation-translation


equivariant G-CNN (right) for segmenting a ℎ × 𝑤
color image. Figure 3.5 shows an application of this
type. (b) A traditional CNN (left) versus a rotation-translation
equivariant G-CNN (right) for classifying 28 × 28
grayscale images into 10 classes. Digit classification
falls into this category.

Figure 3.6: Example SE(2) G-CNNs architectures versus similar traditional


CNN architectures. The shapes show the size of the data ten-
sors at each stage in the network.

As a general rule of thumb in deep learning we discretize as coarsely as we can get away with.
Increasing the size of the kernels or the number of orientations does increase performance
but nowhere near proportional to the increase in memory and computation time this causes.
Keeping coarse kernels and increasing the depth of the network is a better way of spending a
given memory/time budget.

Remark 3.37 (Linear interpolation alternatives). Higher order polynomial interpolation meth-
ods are highly undesirable on such coarse grids, as the oscillations can make the network
behave erratically. More advanced interpolation techniques have been proposed, see for
example Bekkers (2021), but the added computationally complexity can be a drawback. Just
as with discretization, the rule of thumb for interpolation is: as coarsely as you can get away

68
CHAPTER 3. EQUIVARIANCE 3.5. TROPICAL OPERATORS

with.

3.5 Tropical Operators

We previously generalized the idea of a convolution to that of equivariant linear operators in a


Lie group setting, essentially generalizing the domain of our functions. But we can go further
by looking at the codomain and generalizing what we mean by ‘linear’.
Linear and non-linear are generally thought of as an absolute dichotomy. Another way of
thinking about a linear function or operator is as a map that preserves the algebraic properties
of the field of real numbers. But could we not look at maps that preserve the structure of other
algebraic structures rather than the field of real numbers?
We could in principle develop this idea using any algebraic structure, but to be useful in the
context of neural networks we want the underlying set to be numeric and the operations to be
able to be performed by a computer.
Additionally, restricting ourselves to fields would be very limiting, a more congenial algebraic
structure is the semiring. Semirings turn up in many places, indeed the first mathematical
structure we learn about, the set of natural numbers ℕ , is a semiring. Semirings arise in many
areas of mathematics such as functional analysis, topology, graphs, combinatorics, optimization,
etc. See Golan (1999) for an extensive survey of the applications of semirings.

3.5.1 Semirings

A semiring is an algebraic structure in which we can add and multiply elements, but in which
neither subtraction nor division are necessarily possible.

Definition 3.38 (Semiring). A semiring is a set 𝑅 equipped with two binary operations ⊕ and
⊙, called addition and multiplication, such that

(i) addition and multiplication are associative,


(ii) addition is commutative,
(iii) addition has an identity element 𝟘,
(iv) multiplication has an identity element 𝟙,
(v) multiplication distributes over addition:

𝑎 ⊙ (𝑏 ⊕ 𝑐) = 𝑎 ⊙ 𝑏 ⊕ 𝑎 ⊙ 𝑐,
(𝑎 ⊕ 𝑏) ⊙ 𝑐 = 𝑎 ⊙ 𝑐 ⊕ 𝑏 ⊙ 𝑐,

(vi) multiplication by 𝟘 annihilates:

𝟘 ⊙ 𝑎 = 𝑎 ⊙ 𝟘 = 𝟘.

Additionally if 𝑎 ⊕ 𝑎 = 𝑎 we say the semiring is idempotent and if 𝑎 ⊙ 𝑏 = 𝑏 ⊙ 𝑎 we say the


semiring is commutative or abelian.

Just like with standard multiplication it is conventional to abbreviate 𝑎𝑏 ≡ 𝑎 ⊙ 𝑏 if there can be


no confusion. We also let ⊙ take precedence over ⊕, i.e. 𝑎 ⊙ 𝑏 ⊕ 𝑐 = (𝑎 ⊙ 𝑏) ⊕ 𝑐.

69
3.5. TROPICAL OPERATORS CHAPTER 3. EQUIVARIANCE

Remark 3.39 (Rig). Some works refer to semirings as rigs, from rings ‘without negatives’,
hence the missing ‘n’.

Example 3.40 (Real linear semiring). The real numbers form a commutative semiring (ℝ , +, · )
under standard addition and multiplication. Naturally all fields and rings are also semirings.
Example 3.41 (Viterbi semiring). The Viterbi semiring is given by the unit interval [0, 1] and
the operations

𝑎 ⊕ 𝑏 := max{𝑎, 𝑏},
𝑎 ⊙ 𝑏 := 𝑎𝑏.
The additive unit is 0 and multiplicative unit is 1. This semiring is both idempotent and
commutative. Note that there exists no element (−𝑎) so that 𝑎 ⊕ (−𝑎) = 𝟘 = 0 and no element
𝑎 −1 so that 𝑎 ⊙ 𝑎 −1 = 𝟙 = 1. So neither subtraction nor division is possible in this semiring.
Example 3.42 (Log semiring). The log semiring is given by the set ℝ ∪ {−∞, +∞} and the
operations
 
𝑎 ⊕ 𝑏 := log 𝑒 𝑎 + 𝑒 𝑏 ,
𝑎 ⊙ 𝑏 := 𝑎 + 𝑏.
The additive unit is −∞ and the multiplicative unit is 0.

The central object of study in linear algebra is the vector space over a field (usually ℝ or ℂ). The
analogue to the vector space in our generalized setting is the semimodule.

Definition 3.43 (Semimodule). Let (𝑅, ⊕, ⊙) be a semiring. A left 𝑅-semimodule or semimod-


ule over 𝑅 is a commutative monoid (𝑀, +) with additive identity 0𝑀 and a map 𝑅 × 𝑀 → 𝑀
denoted by (𝑎, 𝑚) ↦→ 𝑎𝑚, called scalar multiplication, so that for all 𝑎, 𝑏 ∈ 𝑅 and 𝑚, 𝑚 ′ ∈ 𝑀
the following hold:

(i) (𝑎 ⊙ 𝑏)𝑚 = 𝑎(𝑏𝑚),


(ii) 𝑎(𝑚 + 𝑚 ′) = 𝑎𝑚 + 𝑎𝑚 ′,
(iii) (𝑎 ⊕ 𝑏)𝑚 = 𝑎𝑚 + 𝑏𝑚,
(iv) 𝟙𝑚 = 𝑚,
(v) 𝑎0𝑀 = 0𝑀 = 𝟘𝑚.

Naturally every semiring is a semimodule over itself just like every field is a vector space over
itself.
Example 3.44. Let 𝑅 be a semiring and 𝑆 a non-empty set, then 𝑅 𝑆 (the set of all functions
𝑆 → 𝑅) is a semimodule with addition and scalar multiplication defined element-wise: let
𝑓 , 𝑓 ′ ∈ 𝑅 𝑆 and 𝑟 ∈ 𝑅 then

( 𝑓 ⊕ 𝑓 ′)(𝑠) := 𝑓 (𝑠) ⊕ 𝑓 ′(𝑠) and (𝑟 ⊙ 𝑓 )(𝑠) := 𝑟 ⊙ 𝑓 (𝑠)

for all 𝑠 ∈ 𝑆. Here we used ⊕ and ⊙ for the operations in the semimodule as well since
they correspond with the ⊕ and ⊙ operations in the semiring. The additive identity of the
semimodule is the constant function 𝑠 ↦→ 𝟘. This generalizes the prototypical vector space ℝ𝑛
to the semimodule 𝑅 𝑛 using the notational convention 𝑛 ≡ {1, . . . , 𝑛}. For non-finite 𝑆 we get
the generalization of function vector spaces to function semimodules.

70
CHAPTER 3. EQUIVARIANCE 3.5. TROPICAL OPERATORS

Example 3.45. Let 𝑅 = ([0, 1], max, · ) be the Viterbi semiring from Example 3.41 and let 𝑆 be a
manifold. Let 𝑓 , 𝑓 ′ : 𝑆 → [0, 1] and define addition and scalar multiplication as

( 𝑓 + 𝑓 ′)(𝑠) := max { 𝑓 (𝑠), 𝑓 ′(𝑠)} and (𝑟 𝑓 )(𝑠) := 𝑟 𝑓 (𝑠),

then the functions from 𝑆 to [0, 1] form a semimodule.

Now we can formulate a generalization of linear maps in the form of semiring homomorphisms.

Definition 3.46 (Semimodule homomorphism). Let 𝑅 be a semiring and let 𝑋 and 𝑌 be semi-
modules over 𝑅. Then a map 𝐴 : 𝑋 → 𝑌 is an 𝑅-homomorphic map or an 𝑅-homomorphism
if for all 𝑎, 𝑏 ∈ 𝑅 and 𝑓 , 𝑓 ′ ∈ 𝑋:

𝐴(𝑎 𝑓 + 𝑏 𝑓 ′) = 𝑎(𝐴 𝑓 ) + 𝑏(𝐴 𝑓 ′),

where on the left the addition and scalar multiplication happen in 𝑋 and on the right the
addition and scalar multiplication happen in 𝑌.

Just like with the definition of linearity the single condition above is equivalent to the following
two conditions:

𝐴(𝑎 𝑓 ) = 𝑎(𝐴 𝑓 ′) and 𝐴( 𝑓 + 𝑓 ′) = 𝐴 𝑓 + 𝐴 𝑓 ′ ∀𝑎 ∈ 𝑅, 𝑓 , 𝑓 ′ ∈ 𝑋.

Under this definition we can understand linear as meaning homomorphic with respect to the
real linear semigroup (ℝ , +, · ). But now, instead of just considering linear maps we can pick
another semigroup and consider homomorphisms with respect to this semigroup. This allows
us to construct equivariant semimodule homomorphic operators in the same fashion as we did
with equivariant linear operators in section 3.3.2. We will develop such a class of equivariant
operators for a particular choice of semiring.

3.5.2 Tropical Semiring

Definition 3.47 (Tropical semiring). The max tropical semiring or max-plus algebra or simply
tropical semiring is the semiring ℝmax := (ℝ ∪ {−∞}, ⊕, ⊙), with the operations

𝑎 ⊕ 𝑏 := max{𝑎, 𝑏},
𝑎 ⊙ 𝑏 := 𝑎 + 𝑏.

Where the additive identity is 𝟘 := −∞ and the multiplicative identity is 𝟙 := 0.

Since 𝑎 ⊕ 𝑎 = max{𝑎, 𝑎} = 𝑎 and 𝑎 ⊙ 𝑏 = 𝑎 + 𝑏 = 𝑏 + 𝑎 = 𝑏 ⊙ 𝑎 the tropical semiring is both


idempotent and commutative. By definition we set 𝑎 + (−∞) := −∞ so that we satisfy the
annihilation property 𝑎 ⊙ 𝟘 = 𝟘 ⊙ 𝑎 = 𝟘.

Remark 3.48. The tropical semiring can alternatively be defined as (ℝ ∪ {+∞}, min, +), which
is then also called the min tropical semiring or min-plus algebra. But we observe that one is
isomorphic to the other via negation so we make a style choice and go with the max version.

Example 3.49 (ReLU). Recall that the rectified linear unit is defined as 𝑥 ↦→ {𝑥, 0} for 𝑥 ∈ ℝ. In
the tropical setting we can write this as 𝑥 ↦→ 𝑥 ⊕ 0 for 𝑥 ∈ ℝmax , hence the ReLU may not be a
linear or affine function but it is tropically affine. So we can think about a typical ReLU neural
network as really alternating operations from two distinct semirings.

71
3.5. TROPICAL OPERATORS CHAPTER 3. EQUIVARIANCE

We based our construction of equivariant linear operators on integration. When you consider
an integral as a limit of a sum of products we can see how we could define a type of integration
with respect to another semigroup.
Recall that Riemannian integration is defined in terms of Darboux sums over ever smaller
partitions of the underlying space. Let 𝑀 be a manifold with some Radon measure 𝜇 and
(𝑅, ⊕, ⊙) a semiring. Let 𝑃 be a partition of 𝑀 and define 𝜇(𝑃) := sup𝑆∈𝑃 𝜇(𝑆). Then we can
generalize a Darboux sum
Õ
lim 𝑓 (𝑝 𝑆 ) · 𝜇(𝑆),
𝜇(𝑃)→0
𝑆∈𝑃

as: Ê
lim 𝑓 (𝑝 𝑆 ) ⊙ 𝜇(𝑆), (3.24)
𝜇(𝑃)→0
𝑆∈𝑃

where 𝑝 𝑆 ∈ 𝑀 is any arbitrary point in the partition element 𝑆 and assuming that for all 𝜀 > 0
we can find a partition 𝑃 so that 𝜇(𝑃) < 𝜀. Filling in the tropical semiring operations we obtain:

lim max ( 𝑓 (𝑝 𝑆 ) + 𝜇(𝑆))


𝜇(𝑃)→0 𝑆∈𝑃

(𝜇(𝑆) → 0 since 𝜇(𝑃) → 0)

= lim max 𝑓 (𝑝 𝑆 ) (3.25)


𝜇(𝑃)→0 𝑆∈𝑃

(Assuming 𝑓 is such that the limit exists and is unique)

= sup 𝑓 (𝑝).
𝑝∈𝑀

Lebesgue integration can be generalized similarly. This would yield the same sup𝑝∈𝑀 𝑓 (𝑝)
formula but for all measurable functions that are bounded from above rather then all the
functions for which the Darboux sum has a unique limit. We will not detail the Lebesgue
construction (see Kolokoltsov and Maslov (1997) for that) but proceed with the sup𝑝∈𝑀 𝑓 (𝑝)
formula as our definition of tropical integral.

Remark 3.50. Classic example of a function that is Lebesgue integrable but not Riemann
integrable is the indicator function of the rational numbers 𝟙ℚ . The limit (3.25) does not exist
depending on our choice of points 𝑝 𝑆 but in the Lebesgue sense we simply get sup𝑥∈ℝ 𝟙ℚ (𝑥) =
1 as expected.

Definition 3.51. Let 𝑀 be a manifold, then we define the set of measurable ℝmax -valued
functions that are bounded from above as
n o
𝑀
BA(𝑀) := BA(𝑀, ℝmax ) := 𝑓 ∈ ℝmax sup𝑝∈𝑀 𝑓 (𝑝) < ∞ and 𝑓 is measurable .

Additionally if 𝑓 is not identical to −∞ everywhere we say 𝑓 is proper.

Clearly BA(𝑀) is a tropical semimodule (i.e. a semimodule with respect to the tropical semiring)
under pointwise addition and multiplication:

( 𝑓 ⊕ 𝑓 ′)(𝑝) := max{ 𝑓 (𝑝), 𝑓 ′(𝑝)} and (𝑎 ⊙ 𝑓 )(𝑝) := 𝑎 + 𝑓 (𝑝),

72
CHAPTER 3. EQUIVARIANCE 3.5. TROPICAL OPERATORS

for all 𝑎 ∈ ℝmax , 𝑓 , 𝑓 ′ ∈ BA(𝑀) and all 𝑝 ∈ 𝑀.


The functions in BA(𝑀) are exactly those for which the tropical integral exists.

Definition 3.52 (Tropical integral). Let 𝑀 be a manifold then we call the mapping BA(𝑀) →
ℝmax defined by
𝑓 ↦→ sup 𝑓 (𝑝)
𝑝∈𝑀

for 𝑓 ∈ BA(𝑀) the tropical integral over 𝑀.

We can easily check that the tropical integral is a tropical map (i.e. a semimodule homomor-
phism with respect to the tropical semiring) from BA(𝑀) to ℝmax in the same way that the
(linear) integral is a linear map from the integrable functions to ℝ. For all 𝑎, 𝑏 ∈ ℝmax and
𝑓 , 𝑓 ′ ∈ BA(𝑀) we have:
   
sup𝑝∈𝑀 (𝑎 ⊙ 𝑓 ⊕ 𝑏 ⊙ 𝑓 ′)(𝑝) = 𝑎 ⊙ sup𝑝∈𝑀 𝑓 (𝑝) ⊕ 𝑏 ⊙ sup𝑝∈𝑀 𝑓 ′(𝑝) .

Let 𝑀 be a homogeneous space of the Lie group 𝐺, let ‘·’ denote both the action of 𝐺 on 𝑀 and
the corresponding representation on functions on 𝑀. Since this representation does not affect
the codomain of the functions it does not change the supremum, hence the tropical integral is
always invariant:
sup (𝑔 · 𝑓 )(𝑝) = sup 𝑓 (𝑝).
𝑝∈𝑀 𝑝∈𝑀

So now we have developed an alternative notion of linearity and an alternative to the (linear)
invariant integral. We will use these elements to construct a new type of equivariant operator
in the same manner as we did before with linear operators.

3.5.3 Equivariant Tropical Operators

The starting point for our equivariant linear operators was the integral operator from (3.9). We
obtain the tropical analogue simply by replacing the linear semiring operations:


(𝐴 𝑓 )(𝑞) := 𝑘 𝐴 (𝑝, 𝑞) · 𝑓 (𝑝) d𝑝,
𝑀

with the tropical semiring operations:

(𝑇 𝑓 )(𝑞) := sup 𝑘𝑇 (𝑝, 𝑞) + 𝑓 (𝑝). (3.26)


𝑝∈𝑀

We can see that if 𝑘𝑇 ∈ BA(𝑀 × 𝑁) and 𝑓 ∈ BA(𝑀) then 𝑇 𝑓 ∈ BA(𝑁). It is straightforward to


verify that 𝑇 is a tropical operator from BA(𝑁) to BA(𝑁).
Now we can proceed in the exact same way as in section 3.3.2 to find out when 𝑇 is equivariant.

73
3.5. TROPICAL OPERATORS CHAPTER 3. EQUIVARIANCE

Lemma 3.53 (Equivariant tropical integral operators). Let 𝑀 and 𝑁 be homogeneous spaces
of a Lie group 𝐺. Let 𝑇 be a tropical integral operator (3.26) from BA(𝑀) to BA(𝑁) with a
kernel 𝑘𝑇 ∈ BA(𝑀 × 𝑁). Then
𝑇(𝑔 · 𝑓 ) = 𝑔 · (𝑇 𝑓 )
for all 𝑔 ∈ 𝐺 and 𝑓 ∈ BA(𝑀) if and only if

𝑘𝑇 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘𝑇 (𝑝, 𝑞) (3.27)

for all 𝑔 ∈ 𝐺, 𝑝 ∈ 𝑀 and 𝑞 ∈ 𝑁.

Proof. First we show that 𝑇 𝑓 ∈ BA(𝑁):


!
sup (𝑇 𝑓 )(𝑞) = sup sup 𝑘𝑇 (𝑝, 𝑞) + 𝑓 (𝑝)
𝑞∈𝑁 𝑞∈𝑁 𝑝∈𝑀
! !
≤ sup 𝑘𝑇 (𝑝, 𝑞) + sup 𝑓 (𝑝)
(𝑝,𝑞)∈𝑀×𝑁 𝑝∈𝑀

< ∞,

since both 𝑘𝑇 and 𝑓 are bounded from above.


“⇒”
Assuming 𝑇 to be equivariant, take an arbitrary 𝑔 ∈ 𝐺 and 𝑓 ∈ BA(𝑀) and substitute the
definition of the group representation and 𝑇 in

𝑔 −1 · 𝑇(𝑔 · 𝑓 ) = 𝑇 𝑓

to find
sup 𝑘𝑇 (𝑝, 𝑔 · 𝑞) + 𝑓 (𝑔 −1 · 𝑝) = sup 𝑘𝑇 (𝑝, 𝑞) + 𝑓 (𝑝) (3.28)
𝑝∈𝑀 𝑝∈𝑀

for all 𝑞 ∈ 𝑁. Since the tropical integral is invariant under domain transformation we can
substitute 𝑝 with 𝑔 · 𝑝 on the left without changing the result:

sup 𝑘𝑇 (𝑔 · 𝑝, 𝑔 · 𝑞) + 𝑓 (𝑝) = sup 𝑘𝑇 (𝑝, 𝑞) + 𝑓 (𝑝)


𝑝∈𝑀 𝑝∈𝑀

for all 𝑞 ∈ 𝑁. The equality only holds for all 𝑓 if

𝑘𝑇 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘𝑇 (𝑝, 𝑞) (3.29)

for all 𝑝 ∈ 𝑀, 𝑞 ∈ 𝑁 and 𝑔 ∈ 𝐺.


“⇐”
Assuming 𝑘𝑇 (𝑔 · 𝑝, 𝑔 · 𝑞) = 𝑘𝑇 (𝑝, 𝑞) for all 𝑔 ∈ 𝐺, 𝑝 ∈ 𝑀 and 𝑞 ∈ 𝑁 then (3.29) follows for any
choice of 𝑓 ∈ BA(𝑀), 𝑔 ∈ 𝐺 and 𝑞 ∈ 𝑁. Substituting the invariant tropical integral the other
way yields (3.28), which implies the equivariance

𝑔 −1 · 𝑇(𝑔 · 𝑓 ) = 𝑇 𝑓

since 𝑞 ∈ 𝑁 is arbitrary. The function 𝑓 and group element 𝑔 were also chosen arbitrarily so
equivariance follows for all 𝑓 ∈ BA(𝑀) and 𝑔 ∈ 𝐺. □

74
CHAPTER 3. EQUIVARIANCE 3.5. TROPICAL OPERATORS

Now we can use the same strategy as we did in the linear case and use the symmetry (3.27) to
fix the second argument of the kernel 𝑘𝑇 . Pick a 𝑞0 ∈ 𝑁 then for all 𝑞 ∈ 𝑁 there exists at least
one 𝑔𝑞 ∈ 𝐺 𝑞0 , consequently

𝑘𝑇 (𝑝, 𝑞) = 𝑘𝑇 (𝑝, 𝑔𝑞 · 𝑞0 ) = 𝑘𝑇 (𝑔𝑞 · 𝑔𝑞−1 · 𝑝, 𝑔𝑞 · 𝑞0 ) = 𝑘𝑇 (𝑔𝑞−1 · 𝑝, 𝑞0 )

for all 𝑝 ∈ 𝑀 and 𝑞 ∈ 𝑁. From which we can define a reduced kernel 𝜅𝑇 ∈ BA(𝑀) as

𝜅𝑇 (𝑝) := 𝑘 𝐴 (𝑝, 𝑞0 )

that still allows us to recover the full kernel by

𝑘𝑇 (𝑝, 𝑞) = 𝑘𝑇 (𝑔𝑞−1 · 𝑝, 𝑞0 ) = 𝜅𝑇 (𝑔𝑞−1 · 𝑝).

This whole construction in the end gives us the same type of operator as in Theorem 3.32 but
with the linear operations switched for tropical operations:

(𝐴 𝑓 )(𝑞) := (𝑔𝑞 · 𝜅 𝐴 )(𝑝) · 𝑓 (𝑝) d𝑝,
𝑀

(𝑇 𝑓 )(𝑞) := sup (𝑔𝑞 · 𝜅𝑇 )(𝑝) + 𝑓 (𝑝).


𝑝∈𝑀

We summarize this result in the following theorem.

Theorem 3.54 (Equivariant tropical operators). Let 𝑀 and 𝑁 be homogeneous spaces of a Lie
group 𝐺. Fix a 𝑞0 ∈ 𝑁 and let 𝜅𝑇 ∈ BA(𝑀) be compatible, i.e

∀ℎ ∈ 𝐺 𝑞0 : ℎ · 𝜅 𝐴 = 𝜅 𝐴 .

Then the operator 𝑇 defined by

(𝑇 𝑓 )(𝑞) := sup (𝑔𝑞 · 𝜅𝑇 )(𝑝) + 𝑓 (𝑝)


𝑝∈𝑀

where for all 𝑞 ∈ 𝑁 we choose any 𝑔𝑞 so that 𝑔𝑞 · 𝑞0 = 𝑞, is a well defined tropical operator
from BA(𝑀) to BA(𝑁) that is equivariant with respect to 𝐺.
Conversely every tropical integral operator (3.26) with a kernel in BA(𝑀×𝑁) that is equivariant
is of this form.

Proof. (sketch)

“⇒”
Assume we have a 𝜅𝑇 ∈ BA(𝑀) that is compatible. Define

𝑘𝑇 (𝑝, 𝑞) := (𝑔𝑞 · 𝜅𝑇 )(𝑝)

we can check that 𝑘𝑇 is well defined and satisfies the requirements of Lemma 3.53 because of
the compatibility condition on 𝜅𝑇 .

75
3.5. TROPICAL OPERATORS CHAPTER 3. EQUIVARIANCE

“⇐”
Assume we have a tropical integral operator 𝑇, then by Lemma 3.53 we have a kernel 𝑘𝑇 ∈
BA(𝑀 × 𝑁) that satisfies (3.27). Fix a 𝑞0 ∈ 𝑁 and define

𝜅𝑇 (𝑝) := 𝑘𝑇 (𝑝, 𝑞0 ),

then we can check that 𝜅𝑇 satisfies the compatibility condition. □

Example 3.55 (Max pooling). Let 𝐺 = (ℝ𝑛 , +) be the translation group and 𝑀 = 𝑁 = ℝ𝑛 . Pick
a subset 𝑆 ⊂ ℝ𝑛 and define (
0 if 𝑝 ∈ 𝑆,
𝜅𝑇 (𝑝) :=
−∞ elsewhere.

Then the corresponding operator 𝑇 equals

(𝑇 𝑓 )(𝒚) = sup (𝒚 · 𝜅𝑇 )(𝒙) + 𝑓 (𝒙)


𝒙∈ℝ𝑛
= sup 𝜅𝑇 (𝒙 − 𝒚) + 𝑓 (𝒙)
𝒙∈ℝ𝑛
= sup 𝑓 (𝒙),
𝒙∈𝒚+𝑆

so at each point 𝒚 the output equals the supremum of the input in the subset 𝒚 + 𝑆. Think of
this a continous version the shift-invariant max pooling operation usually seen in CNNs. In
this light we can see that max pooling is a tropical operator.

Example 3.56 (Tropical convolution). Let 𝐺 = 𝑀 = 𝑁 be some Lie group, then we call the
equivariant tropical operation tropical convolution or morphological convolution, which we
denote as
(𝜅□𝐺 𝑓 )(ℎ) := sup (ℎ · 𝜅)(𝑔) + 𝑓 (𝑔).
𝑔∈𝐺

The name morphological convolution comes from the field of grayscale morphology where
these types of operators have previously been used for image processing applications, see
Wikipedia (2021a). An example use of these types of operations in neural networks can be
found in Smets et al. (2021).

Example 3.57 (Pointwise ReLU). Let 𝐺 = 𝑀 = 𝑁 be some Lie group and let 𝑓 ∈ BA(𝑀). Define
a kernel (
0 if 𝑔 = 𝑒 ,
𝜅𝑇 (𝑔) :=
− sup ℎ∈𝐺 𝑓 (ℎ) else.

Then applying the corresponding operator to 𝑓 gives

(𝑇 𝑓 )(ℎ) = sup(ℎ · 𝜅𝑇 )(𝑔) + 𝑓 (𝑔)


𝑔∈𝐺
(   )
= max 𝑓 (ℎ), sup − sup 𝑓 (𝑧) + 𝑓 (𝑔)
𝑔∈𝐺 𝑧∈𝐺

= max { 𝑓 (ℎ), 0}
= ReLU( 𝑓 (ℎ)).

76
CHAPTER 3. EQUIVARIANCE 3.5. TROPICAL OPERATORS

We have not said anything about the boundedness of 𝑇 in Theorem 3.54 since it is not clear
what metric or norm we should consider on the function space BA. In the following lemma we
detail some reasonable conditions under which 𝑇 will be bounded in the supremum norm.

Lemma 3.58. In the setting of Theorem 3.54: if 𝑓 ∈ B(𝑀) and sup𝑝∈𝑀 𝜅𝑇 (𝑝) = 0 then 𝑇 𝑓 ∈ B(𝑁)
and 𝑇 is a bounded operator from B(𝑀) to B(𝑁) in the supremum norm.

Proof.

∥𝑇 𝑓 ∥ ∞ = sup sup (𝑔𝑞 · 𝜅𝑇 )(𝑝) + 𝑓 (𝑝)


𝑞∈𝑁 𝑝∈𝑀

≤ sup sup(𝑔𝑞 · 𝜅𝑇 )(𝑝) + sup 𝑓 (𝑝)


𝑞∈𝑁 𝑝∈𝑀 𝑝∈𝑀

= sup sup 𝜅𝑇 (𝑝) + sup 𝑓 (𝑝)


𝑞∈𝑁 𝑝∈𝑀 𝑝∈𝑀

= sup 𝜅𝑇 (𝑝) + sup 𝑓 (𝑝)


𝑝∈𝑀 𝑝∈𝑀

= sup 𝑓 (𝑝)
𝑝∈𝑀

≤ sup | 𝑓 (𝑝)|
𝑝∈𝑀

= ∥ 𝑓 ∥∞ .

We have seen how we can use semirings, and the tropical semiring in particular, to develop
classes of equivariant operators other than linear. From the examples we saw we can conclude
that many operators currently in use in neural networks (particularly the ReLU and max pooling
activation functions) are special cases of tropical or tropically affine operators and fit well inside
the equivariant semimodule homomorphism framework.

77
Bibliography

Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark
Siskind (2018). Automatic differentiation in machine learning: a survey. arXiv: 1502.05767
[cs.SC] (page 38).

Bekkers, Erik J (2021). B-Spline CNNs on Lie Groups. arXiv: 1909.12057 [cs.LG] (page 68).

Bekkers, Erik J, Maxime W Lafarge, Mitko Veta, Koen AJ Eppenhof, Josien PW Pluim, and
Remco Duits (2018). “Roto-translation covariant convolutional networks for medical image
analysis”. In: International Conference on Medical Image Computing and Computer-
Assisted Intervention. Springer, pp. 440–448. url: https://arxiv.org/abs/1804.03393
(pages 65, 67, 68).

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc-
Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei (2020). Language Models are
Few-Shot Learners. arXiv: 2005.14165 [cs.CL] (page 25).

Cohen, Taco, Mario Geiger, and Maurice Weiler (2020). A General Theory of Equivariant
CNNs on Homogeneous Spaces. arXiv: 1811.02017 [cs.LG] (page 49).

Cohen, Taco and Max Welling (2016). “Group Equivariant Convolutional Networks”. In: In-
ternational Conference on Machine Learning. PMLR, pp. 2990–2999 (page 49).

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009). “Imagenet: A
large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and
pattern recognition. Ieee, pp. 248–255 (page 5).

Duchi, John, Elad Hazan, and Yoram Singer (2011). “Adaptive subgradient methods for online
learning and stochastic optimization.” In: Journal of Machine Learning Research 12.7
(page 44).

Dumoulin, Vincent and Francesco Visin (2018). A guide to convolution arithmetic for deep
learning. arXiv: 1603.07285 [stat.ML] (page 30).

Federer, Herbert (2014). Geometric Measure Theory. Springer. isbn: 978-3-540-60656-7. doi:
10.1007/978-3-642-62010-2 (page 60).

78
BIBLIOGRAPHY BIBLIOGRAPHY

Fukushima, Kunihiko (1987). “Neural network model for selective attention in visual pattern
recognition and associative recall”. In: Applied Optics 26.23, pp. 4985–4992 (page 5).

Glorot, Xavier and Yoshua Bengio (May 2010). “Understanding the difficulty of training deep
feedforward neural networks”. In: Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics. Ed. by Yee Whye Teh and Mike Titterington. Vol. 9.
Proceedings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR,
pp. 249–256. url: https://proceedings.mlr.press/v9/glorot10a.html (page 28).

Golan, Jonathan S. (1999). Semirings and their Applications. Dordrecht: Springer Netherlands.
isbn: 978-90-481-5252-0. doi: 10.1007/978-94-015-9333-5. url: http://link.springer.
com/10.1007/978-94-015-9333-5 (page 69).

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification. arXiv: 1502 . 01852
[cs.CV] (page 29).

Ivakhnenko, Aleksei Grigorevich and Valentin Grigorevich Lapa (1966). Cybernetic predicting
devices. Tech. rep. PURDUE UNIV LAFAYETTE IND SCHOOL OF ELECTRICAL ENGI-
NEERING (page 5).

Kingma, Diederik P. and Jimmy Ba (2017). Adam: A Method for Stochastic Optimization. arXiv:
1412.6980 [cs.LG] (pages 45, 46).

Kolokoltsov, Vassili N. and Victor P. Maslov (1997). Idempotent Analysis and Its Applications.
Dordrecht: Springer Netherlands. isbn: 978-90-481-4834-9. doi: 10.1007/978-94-015-8901-
7. url: http://link.springer.com/10.1007/978-94-015-8901-7 (page 72).

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton (2012). “ImageNet Classification
with Deep Convolutional Neural Networks”. In: Advances in Neural Information Pro-
cessing Systems. Ed. by F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger. Vol. 25.
Curran Associates, Inc. url: https : / / proceedings . neurips . cc / paper / 2012 / file /
c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (page 5).

LeCun, Yann, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne
Hubbard, and Lawrence D Jackel (1989). “Backpropagation applied to handwritten zip code
recognition”. In: Neural computation 1.4, pp. 541–551 (page 5).

LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). “Gradient-based learn-
ing applied to document recognition”. In: Proceedings of the IEEE 86.11, pp. 2278–2324
(page 37).

Lee, John M. (2010). Introduction to Topological Manifolds. 2nd ed. Graduate Texts in Mathe-
matics. Springer. isbn: 978-1-4419-7939-1. doi: 10.1007/978-1-4419-7940-7 (page 49).

Lee, John M. (2013). Introduction to Smooth Manifolds. 2nd ed. Graduate Texts in Mathematics.
Springer. isbn: 978-1-4419-9981-8. doi: 10.1007/978-1-4419-9982-5_1 (pages 49, 50, 54).

79
BIBLIOGRAPHY BIBLIOGRAPHY

McCulloch, Warren S and Walter Pitts (1943). “A logical calculus of the ideas immanent in
nervous activity”. In: The bulletin of mathematical biophysics 5.4, pp. 115–133 (page 5).

Oh, Kyoung-Su and Keechul Jung (2004). “GPU implementation of neural networks”. In: Pattern
Recognition 37.6, pp. 1311–1314 (page 5).

Smets, Bart, Jim Portegies, Erik Bekkers, and Remco Duits (2021). PDE-based Group Equivari-
ant Convolutional Neural Networks. arXiv: 2001.09046 [cs.LG] (page 76).

Tao, Terence (2011). An Introduction to Measure Theory. Vol. 126. Graduate Studies in Math-
ematics. American Mathematical Society. isbn: 978-1-4704-6640-4. doi: 10.1090/gsm/126
(page 59).

Tieleman, Tijmen, Geoffrey Hinton, et al. (2012). “Lecture 6.5-rmsprop: Divide the gradient by
a running average of its recent magnitude”. In: Coursera: Neural networks for machine
learning 4.2, pp. 26–31 (page 45).

Wikipedia (Sept. 26, 2021a). Mathematical morphology. Page Version ID: 1046500059. url:
https://en.wikipedia.org/w/index.php?title=Mathematical_morphology&oldid=
1046500059 (page 76).

Wikipedia (Dec. 16, 2021b). Types of artificial neural networks. Page Version ID: 1053596085.
url: https://en.wikipedia.org/w/index.php?title=Types_of_artificial_neural_
networks&oldid=1053596085 (pages 20, 21).

80

You might also like