0% found this document useful (0 votes)
84 views172 pages

Neural Networks Lecture Notes Overview

The lecture notes cover the fundamentals of neural networks, focusing on the structure and function of neurons, particularly through the Hodgkin-Huxley model. Key topics include neuron membrane potential, action potentials, and the dynamics of ion channels. The document also discusses various aspects of neural learning, optimization, and advanced neural network architectures.

Uploaded by

jameshong0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views172 pages

Neural Networks Lecture Notes Overview

The lecture notes cover the fundamentals of neural networks, focusing on the structure and function of neurons, particularly through the Hodgkin-Huxley model. Key topics include neuron membrane potential, action potentials, and the dynamics of ion channels. The document also discusses various aspects of neural learning, optimization, and advanced neural network architectures.

Uploaded by

jameshong0508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AMATH 449 / CS 479 / CS 679

Neural Networks
Lecture Notes

© Jeff Orchard, Mohamed Hibat-Allah


University of Waterloo

January 4, 2026
Contents

Contents 3

1 Networks of Neurons 1
1.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Simpler Neuron Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Neural Learning 23
2.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Universal Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Gradient Descent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Automatic Differentiation 45
3.1 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Neural Networks with Auto-Diff . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Generalizability 55
4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Combatting Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Optimization Considerations 61
5.1 Enhancing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Vision 73
6.1 Your Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . 79

7 Unsupervised Learning 85
7.1 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3
7.2 Restricted Boltzmann Machines (RBMs) . . . . . . . . . . . . . . . . . . . . 89
7.3 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Vector Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8 Recurrent Neural Networks 109


8.1 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . 118

9 Adversarial Attacks 123


9.1 Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.2 Adversarial Defence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

10 Neural Engineering 133


10.1 Population Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.3 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

11 Additional Topics 147


11.1 Biological Backprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . 152
11.3 Ethics in Neural Networks and AI . . . . . . . . . . . . . . . . . . . . . . . . 156
11.4 Attention mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
11.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
*
Chapter 1

Networks of Neurons

1
CHAPTER 1. NETWORKS OF NEURONS 2

1.1 Neurons
Goal: To see the basics of how a neuron works, in the form of the Hodgkin-Huxley neuron
model.

The human brain is estimated to contain approximately around 86 billion neurons. Each
neuron can form thousands of connections with other neurons, leading to extraordinary
human intelligence.

To understand how a neuron works, we will examine the Hodgkin-Huxley neuron model more
closely. A neuron is a special cell that can send and receive signals from other neurons. It
can be quite long, sending its signal over a long distance—up to 50m long! But most are
much shorter.

Figure 1.1: Diagram of a neuron with key components labeled: dendrites, soma (body), and
axon.

Neuron Membrane Potential


Ions are molecules or atoms whose number of electrons (-) does not match the number
of protons (+), resulting in a net charge. Many ions float around in your cells. The cell’s
membrane, a lipid bi-layer, stops most ions from crossing. However, ion channels embedded
in the cell membrane can allow ions to pass.
CHAPTER 1. NETWORKS OF NEURONS 3

Figure 1.2: Illustration of ion channels and sodium-potassium pump in the cell membrane
of the axon.

Sodium-Potassium Pump
The sodium-potassium pump exchanges 3 Na+ ions inside the cell for 2 K+ ions outside
the cell. This process causes a higher concentration of Na+ outside the cell, and higher
concentration of K+ inside the cell. It also creates a net positive charge outside, and thus a
net negative charge inside the cell. This difference in charge across the membrane induces
a voltage difference, and is called the membrane potential V between the inside and
the outside of the neuron, which is usually around −70 mV. In this regard, the Sodium-
Potassium Pump allows to maintain the value of the resting membrane potential
of a neuron. Also, note that this pump does not play a role in the change of the membrane
potential. We will see in the Hodgkin-Huxley model that this voltage change depends on
several factors including the Sodium ions (Na+ ) channels and the Potassium ions (K+ )
channels.

Action Potential
Neurons have a peculiar behavior: they can produce a spike of electrical activity called an
action potential. This electrical burst travels along the neuron’s axon to its synapses,
where it passes signals to other neurons. The water dump bucket experiment is a great
analogy for this phenomenon [Link]

Hodgkin-Huxley Model
Alan Lloyd Hodgkin and Andrew Fielding Huxley received the Nobel Prize in Physiology or
Medicine in 1963 for their model of an action potential (spike). Their model is based on the
nonlinear interaction between membrane potential (voltage) and the opening and closing of
Na+ and K+ ion channels.

Voltage-Gated Ion Channels


Both Na+ and K+ ion channels are voltage-dependent, so their opening and closing changes
with the membrane potential.
CHAPTER 1. NETWORKS OF NEURONS 4

The fraction of K+ channels that are open is proportional n(t)4 , where:

dn 1
= (n∞ (V ) − n) .
dt τn (V )

Here n is a variable associated with the probability of the activation of a subunit of the
potassium channels. Since there are 4 subunits in the K+ channel, the probability of all sub-
units being open is n(t)4 . As n increases, more potassium channels open, allowing potassium
ions to flow out of the cell.

The fraction of Na+ ion channels open is proportional m(t)3 h(t), where:

dm 1
= (m∞ (V ) − m) ,
dt τm (V )

dh 1
= (h∞ (V ) − h) .
dt τh (V )

Note that:

• m: a variable associated with the probability of a subunit of the sodium channels.


There are 3 sub-units of this type in the Na+ channel. As m increases, more sodium
channels open, allowing sodium ions to flow into the cell.

• h: a variable associated with the probability of a sub-unit of the sodium channel


being inactivated (or “roughly speaking” open). There is only one sub-unit of this
type in the Na+ channel. As the membrane potential increases, the h-gate closes (i.e.,
the probability h gets closer to zero), leading to the inactivation (closing) of sodium
channels and a reduction in the sodium current.

You can observe in the following plot, a typical evolution of the gating parameters n∞ (V ),
m∞ (V ), and h∞ (V ) against the membrane potential V .
CHAPTER 1. NETWORKS OF NEURONS 5

Dynamics of Membrane Potential:


These two channels allow ions to flow into/out of the cell, inducing a current, which affects
the membrane potential V :
dV
C = Jin − gL (V − VL ) − gNa m3 h(V − VNa ) − gK n4 (V − VK )
dt
• C: Membrane (lipid bi-layer) capacitance.
• I = C dV
dt
: net current inside the cell.
• Jin : Input current coming from other neurons’ dendrites.
• gL : Leak conductance related to cell membrane not being perfectly impermeable to
ions.
• gNa : Maximum Na+ conductance
• gK : Maximum K+ conductance.
• The equilibrium voltages are estimated as follows: VL = −55mV, VNa = +50mV,
VK = −77mV.
(See Dayan, Peter; Abbott, L. F. book, page 162 for more details to understand the + and
− signs).
The system of the four previous differential equations (DEs) governs the dynamics of the
membrane potential. The latter is challenging to solve analytically, however, we can solve
CHAPTER 1. NETWORKS OF NEURONS 6

it numerically as will be illustrated during the lecture. Here is an illustration of the


demo, where we can qualitatively observe the spike behavior of the neuron in the HH model
simulation.

Figure 1.3: Simulation of the HH model on a short time window.

Figure 1.4: Typical behavior of the action potential which travels through the axon. Credit:
Wikipedia.

Intuitively, at a threshold voltage of around −55mV initiated by the input current coming
from other neurons, the conductance of Na+ channels are expected to be active as illustrated
in Fig. 1.3. The Na+ channels activation leads to the so-called depolarization phase, by
pushing Na+ ions in the cell, which causes a sudden increase in the membrane potential (V).
CHAPTER 1. NETWORKS OF NEURONS 7

The Na+ channel becomes effectively inactive around the maximum voltage as illustrated in
Fig. 1.3.
The repolarization phase is catalyzed by the K+ channels activation as illustrated by the
increase in the gating variable n in Fig. 1.3. These channels are sending K+ ions out of the
membrane. This mechanism leads to a sudden drop in voltage to a lower level compared to
the threshold voltage. After a refractory period with a characteristic period τref , the voltage
comes back to around V = −70mV . These steps are illustrated in Fig. 1.4.
An important observation to make is that the maximum voltage in Fig. 1.4 is independent
of the input current, however the strength of the input current influences the frequency of
the firing of the neuron through the action potential. This observation will be relevant in
the coming lectures.
It is also important to note that the sodium-potassium pump is not explicitly represented
in the Hodgkin-Huxley equation. It operates on a much slower timescale than the rapid
sodium and potassium channel dynamics that underlie action potential generation.

Remarks:
• Try to observe what happens when the input current is: Negative, Zero, Slightly
positive, or Very positive.
• Here is also a recommended video to watch that illustrates the electrical nature of the
human brain: Video.
CHAPTER 1. NETWORKS OF NEURONS 8

1.2 Simpler Neuron Models


Goal: To look at other, less complicated, but more computationally efficient neuron
models.

In this lecture, we take a careful look at other less complicated neuron models. The Hodgkin-
Huxley (HH) model is already greatly simplified:
• A neuron is treated as a point in space.
• Conductances are approximated with formulas.
• Only considers K+ , Na+ , and generic leak currents.
However, to model a single action potential (spike) takes many time steps of this 4-D dif-
ferential equation system. Spikes are fairly generic, and it is thought that the presence of a
spike is more important than its specific shape.

Leaky Integrate-and-Fire (LIF) Model


The leaky integrate-and-fire (LIF) model only considers the sub-threshold membrane poten-
tial (voltage), but does not model the spike itself. Instead, it simply records when a spike
occurs (i.e., when the voltage reaches the threshold).
The governing equation for the LIF model is

dV
C = Jin − gL (V − VL ) ,
dt
where:
• C: Capacitance
• gL = 1
R
: Conductance
• Jin : Input current
• VL : Resting potential
Using Ohm’s law (V = IR), we can rewrite it as

dV
RC = RJin − (V − VL ) .
dt

Letting Vin = RJin and τm = RC (time constant), this becomes:

dV
τm = Vin − (V − VL ) for V < Vth ,
dt
CHAPTER 1. NETWORKS OF NEURONS 9

where Vth is the threshold potential. By changing variables,

V − VL Vin
v= , vin = ,
Vth − VL Vth − VL

the dimensionless equation becomes

dv
τm = vin − v for v < 1 . (1.1)
dt

We integrate the differential equation for a given input current until v reaches the threshold
value of 1.

The figure below illustrates how we record a spike at time ti . When the membrane potential
v crosses the threshold (v = 1), a spike is recorded and the membrane potential is reset to
0. After a spike, a refractory period τref follows before the membrane potential can start
integrating again.

LIF Firing Rate

Suppose we hold the input, vin , constant. We can solve the DE analytically between spikes.

Claim:

v(t) = vin 1 − e−t/τ




This is a solution of:


dv
τ = vin − v, v(0) = 0
dt
CHAPTER 1. NETWORKS OF NEURONS 10

1.2
v(t)
1
0.8
0.6
0.4
0.2
t
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

v(t) vin

It can be shown that the steady-state firing rate for constant vin is:

1
  for vin > 1
τref −τm ln 1− v1
G(vin ) = in
0 otherwise.
You are asked to derive this result in Assignment 1.
Typical Values for Cortical Neurons

τref = 0.002 s (2 ms)


τm = 0.02 s (20 ms)

The steady-state firing rate can be visualized as a function of input vin . The plot below uses
these typical values:
100

80
Firing Rate (Hz)

60

40

20

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
Input (vin )
CHAPTER 1. NETWORKS OF NEURONS 11

The previous graph is known as the tuning curve because it tells us how a neuron reacts
to different input currents. Note that the turning curve saturates in the infinite limit of
vin to 1/τref , which means that the neuron activation rate cannot go beyond the frequency
characterized by the refractory period.

Artificial Neurons
As we’ve seen, the activity of a neuron is very low, or zero, when the input is low, and the
activity increases and approaches some maximum as the input increases.

The previous tuning curve motivates why we can represent an artificial neuron by a number
that represents its activity. Biologically speaking, this has to be a real positive number but
in practice, we can also consider negative values for the activity of artificial neurons. The
general behavior of a neuron activity can be modeled by a number of different activation
functions.

Logistic Curve

1
σ(z) =
1 + e−z

1.2
σ(z)
1
0.8
0.6
0.4
0.2
z
−4 −2 2 4

Arctan

σ(z) = arctan(z)
CHAPTER 1. NETWORKS OF NEURONS 12

2
σ(z)

z
−4 −2 2 4

−1

−2

Hyperbolic Tangent

σ(z) = tanh(z)

1 σ(z)

0.5
z
−4 −2 2 4
−0.5

−1

Threshold
(
0, if z < 0
σ(z) =
1, if z ≥ 0

σ(z)
1

0.5

z
−4 −2 2 4
CHAPTER 1. NETWORKS OF NEURONS 13

Rectified Linear Unit (ReLU)

ReLU(z) = max(0, z)

5
ReLU(z)
4

1
z
−4 −2 2 4

Softplus

Softplus(z) = log(1 + ez )

5
Softplus(z)
4

1
z
−4 −2 2 4

Multi-Neuron Activation Functions


Some activation functions depend on multiple neurons. Here are two of them.

SoftMax

SoftMax is like a probability distribution (or probability vector), so its elements add to 1. If
⃗z is the drive (input) to a set of neurons, then:

exp(zi )
SoftMax(⃗z)i = P
j exp(zj )
CHAPTER 1. NETWORKS OF NEURONS 14

Then, by definition:
X
SoftMax(⃗z)i = 1
i

Example:

SoftMax
⃗z = [0.6, 3.4, −1.2, 0.05] −−−−→ ⃗y = [0.06, 0.90, 0.009, 0.031]
Input ⃗z
4

2
Value

−2
1 2 3 4
Index
SoftMax Output ⃗y
1

0.8
Probability

0.6

0.4

0.2

0
1 2 3 4
Index

Argmax

Argmax is the extreme of the Softmax, where only the largest element remains nonzero,
while the others are set to zero.
CHAPTER 1. NETWORKS OF NEURONS 15

Example:

Argmax
⃗z = [0.6, 3.4, −1.2, 0.05] −−−−→ ⃗y = [0, 1, 0, 0]
Argmax Output ⃗y
1.2
1
0.8
Value

0.6
0.4
0.2
0
1 2 3 4
Index
CHAPTER 1. NETWORKS OF NEURONS 16

1.3 Synapses

Goal: To get an overview of how neurons pass information between them, and how we
can model those communication channels.

So far, we’ve just looked at individual neurons, and how they react to their input. But that
input usually comes from other neurons. When a neuron fires, an action potential (the wave
of electrical activity) travels along its axon.

The junction where one neuron communicates with the next neuron is called a synapse.

Note that the neurons are separated by a microscopic small space called the synaptic cleft,
which is around 20–50 nm wide.

Post-Synaptic Current
Even though an action potential is very fast, the synaptic processes by which it affects
the next neuron take time. Some synapses are fast (taking just about 10 ms), and some are
quite slow (taking over 300 ms). If we represent that time constant using τs , then the current
entering the post-synaptic neuron can be written:

t
(
ktn e− τs if t ≥ 0 (for some n ∈ Z≥0 ),
h(t) =
0 if t < 0,
CHAPTER 1. NETWORKS OF NEURONS 17

where k is chosen so that:


Z ∞
1
h(t) dt = 1 =⇒ k = .
0 n! τsn+1

The function h(t) is called a Post-Synaptic Current (PSC) filter, or (in keeping with the
ambiguity between current and voltage) Post-Synaptic Potential (PSP) filter.

Multiple spikes form what we call a “spike train,” and can be modeled as a sum of Dirac
delta functions:
X3
a(t) = δ(t − tp ).
p=1

The Dirac Delta Function is defined as


(
∞ if t = 0,
δ(t) =
0 otherwise,

with the properties:


Z ∞ Z ∞
δ(t) dt = 1 and f (t)δ(t − τ ) dt = f (τ ).
−∞ −∞

t1 t2 t3 t

How does a spike train influence the post-synaptic neuron?


CHAPTER 1. NETWORKS OF NEURONS 18

Answer: You simply add together all the PSC filters, one for each spike. This is actually
convolving the spike train with the PSC filter:

s(t) = a(t) ∗ h(t).

That is,
s(t) = (a ∗ h)(t)
hP i
= p δ(t − tp ) ∗ h(t)
RP
= p δ(τ − tp )h(t − τ ) dτ (Convolution)
P R
= p δ(τ − tp )h(t − τ ) dτ
P
= p h(t − tp ),

which is the sum of PSC filters, one for each spike, also known as the filtered spike
train.
Post-synaptic current for a random spike train with τs = 0.1, n = 1. The PSC captures some
information about the pre-synaptic spike history.

Interestingly, for a constant firing of P = 60 Hz and τs = 0.1, n = 1, we observe that the


postsynaptic current saturates to a constant value, which is determined by the pre-synaptic
firing rate P .
CHAPTER 1. NETWORKS OF NEURONS 19

More specifically, if we plot the asymptotic value of the post-synaptic current (PSC) against
the firing rate, we find a linear relationship

PSC ≈ P,

as demonstrated in the following plot.

Connection Weight
The total current induced by an action potential onto a particular post-synaptic neuron can
vary widely, depending on:

• the number and sizes of the synapses,


CHAPTER 1. NETWORKS OF NEURONS 20

• the amount and type of neurotransmitter,

• the number and type of receptors, etc.

We can combine all those factors into a single number, the connection weight. Thus, the
total input to a neuron is a weighted sum of filtered spike-trains.

A wCA

C
wCB
B

Weight Matrices
When we have many pre-synaptic neurons, it is more convenient to use matrix-vector nota-
tion to represent the weights and activities.

Suppose we have two populations, X and Y :

• X has N nodes,

• Y has M nodes.
Population X Population Y

If every node in X sends its output to every node in Y , then we will have a total of N × M
connections, each with its own weight.

y1
w11
x1
w21

X w12 y2 Y

x2 w22 w13

w23
y3
CHAPTER 1. NETWORKS OF NEURONS 21

The weights can be represented as a matrix:


 
w11 w12 w13
W = ∈ RN ×M .
w21 w22 w23

Vectors of Neuron Activities


We typically store the neuron activities in vectors:
   
⃗x = x1 x2 , ⃗y = y1 y2 y3 .

We can compute the input to the nodes in Y using:

⃗z = ⃗xW + ⃗b,

where ⃗b holds the biases for the neurons in Y .

Thus,
⃗y = σ(⃗z) = σ(⃗xW + ⃗b),

where σ represents an activation function.

Bias Representation
Another way to represent the biases, ⃗b, is by adding an additional input node with a fixed
value of 1.

y1

x1

X y2 Y

x2
b1
b2 y3

1 b3

Or in other words: " #


 W
⃗xW + ⃗b = ⃗x 1 . ⃗

b
CHAPTER 1. NETWORKS OF NEURONS 22

Implementing Connections between Spiking Neurons


For simplicity, let n = 0:
1 − τts
(
τs
e if t ≥ 0,
h(t) =
0 if t < 0.

1
τs

Theorem: The function h(t) defined above is the solution of the initial value problem
(IVP)
ds 1
τs = −s, s(0) = .
dt τs
Proof: Exercise.

Full LIF Neuron Model


wij
axon
si (t) vi (t)

The dynamics of the neuron can be described by:


(
τm dv
dt
i
= si − vi if not in refractory period (see Eq. (1.1),
τs ds
dt
i
= −si .

If vi reaches 1:
1. Start refractory period.
2. Send spike along the axon.
3. Reset vi to 0.
If a spike arrives from neuron j: increment si using
wij
si ← si +
τs
Chapter 2

Neural Learning

23
CHAPTER 2. NEURAL LEARNING 24

2.1 Learning
Goal: To formulate the problem of supervised learning as an optimization problem.

Getting a neural network to do what you want usually means finding a set of connection
weights that yield the desired behaviour. That is, neural learning is all about adjusting
connection weights.

Overview of Learning
There are three basic categories of learning problems:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

Supervised Learning:
In supervised learning, the desired output is known so we can compute the error and use
that error to adjust our network.
Example:
Given an image of a digit, identify which digit it is.
Input: Image of digit “4”:

Target: [0 0 0 0 1 0 0 0 0 0]

Unsupervised Learning
In unsupervised learning, the output is not known (or not supplied), so cannot be used
to generate an error signal. Instead, this form of learning is all about finding efficient
representations for the statistical structure in the input.
Example:
Given spoken English words, transform them into a more efficient representation such as
phonemes, and then syllables. Or, cluster points into categories.
CHAPTER 2. NEURAL LEARNING 25

Reinforcement Learning

In reinforcement learning, feedback is given, but usually less often, and the error signal is
usually less specific.

Example:
When playing a game of chess, a person knows their play was good if they win the game.
They can try to learn from the moves they made.

In this course, we will mostly focus on supervised learning. But we will also look at some
examples of unsupervised learning.

Supervised Learning
Our neural network performs some mapping from an input space to an output space.

Input (x1 , x2 ) ∈ R2 → Output y1 ∈ R, Target is t ∈ R

We are given training data, with MANY examples of input/target pairs. This data is
(presumably) the result of some consistent mapping process.

Example: MNIST
Images of handwritten digits map to integers.

Example:

A B XOR(A, B)
1 1 0
1 0 1
0 1 1
0 0 0
Input: (A, B) ∈ {0, 1}2
Output/Target: t ∈ {0, 1}, y ∈ [0, 1]
CHAPTER 2. NEURAL LEARNING 26

Our task is to alter the connection weights in our network so that our network mimics this
mapping.

Types of Supervised Learning


Our goal is to bring the output as close as possible to the target.

But what, exactly, do we mean by “close”? For now, we will use the scalar function L(y, t)
as an error (or “loss”) function, which returns a smaller value as our outputs are closer to
the target.

Two common types of mappings encountered in supervised learning are:

• Regression

• Classification

Regression

Output values are a continuous-valued function of the inputs. The outputs can take on a
range of values.

Example: Linear Regression

The plot above demonstrates a linear regression model. The blue line is the regression line
that best fits the data points, minimizing the error between the predicted outputs and the
true values.

Classification

Outputs fall into a number of distinct categories. Here are some examples:
CHAPTER 2. NEURAL LEARNING 27

Example:

Learning as Optimization
Once we have a cost function, our neural-network learning problem can be formulated as an
optimization problem.

Let our network be represented by the mapping f so that

y = f (x; θ)

where θ represents all the weights and biases.


CHAPTER 2. NEURAL LEARNING 28

Neural learning seeks to find:


h i
min E(x,t)∈data L f (x; θ), t(x)
θ

In other words, find the weights and biases that minimize the expected cost between the
outputs and the targets.
CHAPTER 2. NEURAL LEARNING 29

2.2 Universal Approximation Theorem

Goal: Can we approximate any function using a neural network?

Question: Can we approximate any function using a neural network?

Given a function f (x), can we find the weights ωj , αj , and biases θj , j = 1, 2, . . . , N , such
that
N
X
f (x) ≈ αj σ(ωj x + θj )
j=1

to arbitrary precision?

h1
w1 α1
w2 h2 α2
x y

wn αn
hN

Theorem: Let σ be any continuous sigmoidal function. Then finite sums of the form
N
X
G(x) = αj σ(ωj x + θj )
j=1

are dense in C(In ). In other words, given any f ∈ C(In ) and ε > 0, there is a sum, G(x),
of the above form, for which

|G(x) − f (x)| < ε for all x ∈ In .

Here, In = [0, 1]n .


Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function,”
Math. Control Signals Systems, 2:303-314, 1989.

A function σ is “sigmoidal” if
(
1 as x → ∞,
σ(x) =
0 as x → −∞.
CHAPTER 2. NEURAL LEARNING 30

The theorem states that

∃ N , and ∃ ωj , θj , αj for j = 1, . . . , N, such that |G(x) − f (x)| < ε.

Informal Proof:
Suppose we let ωj → ∞ for j = 1, . . . , N , then
(
ωj →∞ 0 for x ≤ 0,
σ(ωj x) −−−−→
1 for x > 0.

0 x

By shifting the x-axis by bj , we get:


(
ωj →∞ 0 for x ≤ bj ,
σ(ωj (x − bj )) −−−−→
1 for x > bj .

bj x

Within this limit, we obtain the Heaviside step function:

H(x) = lim σ(ωj x).


ωj →∞

Let us define:
H(x; b) = lim σ(ω(x − b)).
ω→∞
CHAPTER 2. NEURAL LEARNING 31

We can use two such functions to create a piece,

P (x; b, δ) = H(x; b) − H(x; b + δ).

Hence, P is made from sigmoidal functions:

0 b b+δ x

The piece function P (x; b, δ) is defined as:

0 b b+δ x

Since f (x) is continuous,


lim f (x) = f (a), ∀a ∈ In .
x→a

Therefore, ∃ an interval, (aj , aj + ∆x), such that

|f (x) − f (aj )| < ε, ∀x ∈ (aj , aj + ∆x).

f
ε f (aj )
ε

aj aj + ∆x x

∆x

Repeat this process for x = aj+1 = bj + δj .

As a result, the function:


N ′
X
G(x) = f (aj )P (x; bj , δj )
j=1

satisfies the constraint


|G(x) − f (x)| < ε for all x ∈ In .
CHAPTER 2. NEURAL LEARNING 32

Here N ′ is the number of subintervals. G(x) can be also written in terms of threshold
functions (which can be approximated by sigmoids) as:
N ′
X
G(x) = f (aj )(H(x; bj ) − H(x; bj + δj )).
j=1

Thus, the total number of hidden neurons required to construct G(x) is N = 2N ′ . ■

Max error = 0.09986354911599937

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6
0.0 0.2 0.4 0.6 0.8 1.0

So, why would we ever need a neural network with more than one hidden layer?
Answer:
The theorem guarantees existence, but makes no claims about the scaling of N
as a function of ε, the number of hidden neurons. N might grow exponentially
as ε gets smaller.
CHAPTER 2. NEURAL LEARNING 33

2.3 Loss Functions


Goal: To become familiar with some of the most common ways to measure error.

We have to choose a way to quantify how close our output is to the target. For this, we use a
“cost function”, also known as an “objective function”, “loss function”, or “error function”.
There are many choices, but here are two commonly-used ones.
Suppose we are given a dataset
{y (i) , t(i) }i=1,...,N

For input xi , the network’s output is:

yi = f (xi ; θ)

(Mean) Squared Error (MSE)


t N
1 1X
L(y, t) = ∥y − t∥22 = (yj − tj )2
2 2 j=1

Taking the expectation (mean) over the entire dataset,


N
1 X
E= L(y (i) , t(i) )
N i=1

The use of MSE as a cost function is often associated with linear activation functions, or
ReLU. This loss-function/activation-function pair is often used for regression problems.

Cross Entropy (Bernoulli Cross Entropy)


Consider the task of classifying inputs into two categories, labelled 0 and 1. Our neural-
network model for this task will output a single value between 0 and 1.
f (x;θ)
x −−−→ y ∈ (0, 1)

where the true class is expressed in the target,

t ∈ {0, 1}

If we suppose that y is the probability that x → 1 (x belongs to class 1),

y = P (x → 1 | θ) = f (x; θ)
CHAPTER 2. NEURAL LEARNING 34

then we can treat it as a Bernoulli distribution.

P (x → 1 | θ) = y i.e. t = 1
P (x → 0 | θ) = 1 − y i.e. t = 0

The likelihood of our data sample given our model is:

P (x → t | θ) = y t (1 − y)1−t

The task of ”learning” would be finding a model (θ) that maximizes this likelihood.
Or, we could equivalently minimize the negative log-likelihood:
 
L(y, t) = − t log y + (1 − t) log(1 − y) .

This log-likelihood formula is the basis of the cross-entropy loss function.


The expected cross-entropy over the entire dataset is
h i
(i) (i) (i) (i)
E = −E t log y + (1 − t ) log(1 − y )
data
N
1 X (i)
=− t log y (i) + (1 − t(i) ) log(1 − y (i) ) .
N i=1

Cross entropy assumes that the output values are in the range [0, 1]. Hence, it works nicely
with the logistic activation function.

Categorical Cross-Entropy (Multinoulli Cross-Entropy)


Consider a classification problem that has K classes (K > 2). Given an input, the task of
our model is to output the class of the input (e.g., given an image of a digit, determine the
digit class).
Suppose our model is given the input x. Then the network’s output is:

y = f (x; θ)

We interpret yk as the probability of x being from class k. That is, y is the distribution of
x’s membership over the K classes.
Note that:
K
X
yk = 1
k=1

Under that distribution, suppose we observed a sample from class k̄. The likelihood of that
observation is:
P (x ∈ Ck̄ | θ) = yk̄ , where Ck̄ = {x | x is in class k̄}
CHAPTER 2. NEURAL LEARNING 35

Note that y is a function of the input x and the model parameters θ. If we represent the
target class using the one-hot vector:

t = [0 · · · 0 1 0 · · · 0],

then we can write the likelihood as


K
Y
P (x ∈ Ck | θ) = yktk .
k=1

Thus, the negative log-likelihood of x is


K
X
− log P (x ∈ Ck | θ) = − tk log yk .
k=1

This loss function is known as categorical cross-entropy:


K
X
L(y, t) = − tk log yk .
k=1

The expected categorical cross-entropy for a dataset of N samples is


K N K
h X
(i) (i)
i 1 X X (i) (i)
E − tk log yk =− tk log yk .
k=1
data N i=1 k=1
P
Since k yk = 1, this cost function works well with Softmax activation at the output layer.
CHAPTER 2. NEURAL LEARNING 36

2.4 Gradient Descent Learning

Goal: To see how we can use a simple optimization method to tune our network weights.

Here, we assume that you are familiar with partial derivatives and gradient vectors.

The operation of our network can be written as:

y = f (x; θ)

(Connection weights & biases are represented by θ)

So, if our loss function is L(y, t), where t is the target, then neural learning becomes the
optimization problem:
h i
min E(θ), where E(θ) = E L f (x; θ), t(x)
θ x∈data

We can apply gradient descent to E using the gradient:


h i
∂E ∂E ∂E
∇θ E = ∂θ1 ∂θ2
··· ∂θp

Note: we will assume the gradient vector here is a row vector with the same shape as the
vector θ, instead of a column vector. Sorry for the confusion in lecture 5.

Gradient-Based Optimization
If you want to find a local maximum of a function, you can simply start somewhere, and
keep walking uphill. For example, suppose you have a function with two inputs, E(a, b).
You wish to find a and b to maximize E. We are trying to find the parameters (ā, b̄) that
yield the maximum value of E. i.e.,

(ā, b̄) = argmax E(a, b)


(a,b)

No matter where you are, “uphill” is in the direction of the gradient vector:
 ∂E ∂E

∇E(a, b) = ∂a ∂b
CHAPTER 2. NEURAL LEARNING 37

Gradient ascent is an optimization method where you continuously move in the direction
of your gradient vector. In other words, you update a and b according to the differential
equations
d(a, b)
τ = ∇E(a, b)
dt
where we have introduced the time variable (t) so that we can move through parameter space
and approach our optimum over time.

Let’s solve that DE numerically using Euler’s method. If your current position is (an , bn ),
then
(an+1 , bn+1 ) = (an , bn ) + k∇E(an , bn )

where k is your step multiplier (which swallows up both τ and the time step size).

Gradient DESCENT aims to minimize your objective function. So, you walk downhill,
stepping in the direction opposite the gradient vector as follows:

(an+1 , bn+1 ) = (an , bn ) − k∇E(an , bn ).

Note that there is no guarantee that you will find the global optimum. In general, you will
find a local optimum that may or may not be the global optimum.
CHAPTER 2. NEURAL LEARNING 38
CHAPTER 2. NEURAL LEARNING 39

2.5 Error Backpropagation


Goal: To find an efficient method to compute the gradients for gradient-descent opti-
mization.

We can apply gradient descent on a multi-layer network, using chain rule to calculate the
gradients of the error with respect to deeper connection weights and biases.
Consider the network

c
αi is the input current to hidden node i. βj is the input current to output node j. For our
∂E
cost (loss) function, we will use E(y, t). For learning, suppose we want to know ∂M ij
.
e.g:
∂E ∂E ∂β1
=
∂M41 ∂β1 ∂M41

∂E
∂β1
y1 y2

β1 β2

M11 M21 M31 M41 M12 M22 M32 M42

Recall that:  
E(y, t) = E σ(hM + b), t

∂E ∂E dyi
=
∂β ∂yi dβ
CHAPTER 2. NEURAL LEARNING 40

We’ll revisit this later.


Thus

∂E ∂E ∂β1
=
∂M41 ∂β1 ∂M41

Note that:
4
X
β1 = hi Mi1 + b
i=1

Thus:
∂β1
= h4
∂M41

As a result:
∂E ∂E
= h4
∂M41 ∂β1

In general:
∂E ∂E
= hi
∂Mij ∂βj

This result can be summarized in a vectorized form as:


 ∂E ∂E

 
∂M11
 ∂E
∂M12
∂E 
h1 h
 ∂M21 ∂M22   ..  ∂E ∂E
i
T
 ∂E ∂E  =  .  ∂β1 ∂β2 = h · (∇β E)
 ∂M31 ∂M32 
∂E ∂E h4
∂M41 ∂M42

Now we have seen how backpropagation works for the connection weights between the top
two layers. What about the connection weights between layers deeper in the network? Let
us take a careful look at the gradient with respect to W21 .
CHAPTER 2. NEURAL LEARNING 41

∂E ∂E
y1 ∂β1 ∂β2 y2

M11 β1 M12 β2

h1 h2 h3 h4

α1 α2 α3 α4

W11 W21 W31

∂E
Using the two illustrations above, we can compute ∂W21
as follows:

∂E ∂E ∂α1
= .
∂W21 ∂α1 ∂W
| {z21}
P3
α1 = j=1 xj Wji +a1

∂α1
Given that ∂W21
= x2 , we can now focus on:

∂E ∂E dh1
=
∂α1 ∂h1 dα1

 
∂E ∂β1 ∂E ∂β2 dh1
= + .
∂β1 ∂h1 ∂β2 ∂h1 dα1
Hence:  
∂E ∂E ∂E dh1
= M11 + M12 .
∂α1 ∂β1 ∂β2 dα1

∂E ∂E
Note that we computed ∂β1
and ∂β2
already when we were learning the weight matrix M ,
thus: h
∂E dh1 i M 
∂E ∂E 11
= ∂β1 ∂β2 ·
∂α1 dα1 M12

More generally, we have:

x ∈ RX , h ∈ RH , y, t ∈ RY , Recall M ∈ RH×Y
CHAPTER 2. NEURAL LEARNING 42
 
 
Mi1
 
h 
∂E dhi 
 ∂E · · · ∂E
i
 ..  
= ∂β1 ∂βY ·  .  
∂αi dαi  
MiY
 
 
| {z }
This is the ith col. of M T

Putting everything into a one-row vector, we get:


 
 
M11 M21 · · · MH1 


i  i  M12 M22 · · · MH2 
h i h 
∂E ∂E dh1 dhH
h ∂E ∂E  
··· = ··· ⊙  ∂β1 · · · ∂βY  .. .. ... .. 

∂α1 ∂αH dα1 dαH
  . . . 

M1Y M2Y · · · MHY 
 

| {z }
M T ∈RY ×H

As a result:
dh
⊙ ∇β E · M T ,

∇α E =

 
dh dh1 dhH
where =

,
dα1
,··· . Here is a graphical illustration of the previous formula showing
dαH
that gradients at layer (l) can be computed using gradients at layer (l + 1):

To derive a generic gradient formula for multiple layers, let us define:


∂E
∇z(l+1) E = ,
∂z (l+1)
and let:
h(l+1) = σ(z (l+1) ) = σ(W (l) h(l) + b(l+1) ).
We have:
dh(l) 
⊙ ∇z(l+1) E · (W (l) )T .

∇z(l) E = (l)
dz
∂E
Thus, to compute: (l) · · ·. We can do the following:
∂Wij

(l+1)
∂E ∂E ∂zj ∂E (l)
(l)
= (l+1)
· (l)
= (l+1)
· hi
∂Wij ∂zj ∂Wij ∂zj
CHAPTER 2. NEURAL LEARNING 43

∂E (l) ∂E
(l)
= hi · (l+1)
.
∂Wij ∂zj
As a result:
∂E
(l)
= Same size as W (l) = (h(l) )T ∇z(l+1) E.
∂W
Here is another graphical summary:

∇z(l) E = σ ′ (z (l) ) ⊙ ∇z(l+1) E · (W (l) )T




 T
∇W (l) E = h(l) ∇z(l+1) E

Vectorization
These formulas even work when processing many samples at once. Instead of feeding in just
one sample,
W (ℓ) σ
x −−−−→ z (ℓ) −−−−→ h(ℓ) −−−−→ · · ·
x=[ ], z (l) = xW (l) = [ ]
h(ℓ) = σ(z (ℓ) ) = [ ]
we could feed in many samples, such that:
 
 
x= 4 samples
 

 

 
 
(ℓ) (ℓ)
z = xW =
 

 
CHAPTER 2. NEURAL LEARNING 44

 
 
(ℓ) (ℓ)
h = σ(z ) = 
 

 

Then,  
 
∇z(ℓ) E =   etc...,
 
 

where ∇z(ℓ) E has the same shape as z (ℓ) . Now how about the term:

∇W (ℓ) E = (h(ℓ) )T ∇z(ℓ+1) E ?

We have
 
  b1
a1 a2 a3 a4
 
b2
 
(h(ℓ) )T ∇z(ℓ+1) E =  4×M =  ...
  
N ×4  N ×M
b3 
b4
   
a1 a4
     
=   b1 + · · · +   b4 .

Following this logic, we can observe that gradients automatically add up over the samples.
Chapter 3

Automatic Differentiation

45
CHAPTER 3. AUTOMATIC DIFFERENTIATION 46

3.1 Automatic Differentiation


Goal: To learn a way to automate the computation of gradients.

Consider this expression:


f = sin(x) + xy
| {z } |{z}
a b

This expression can be built using the following code:

x = Var y = Var
a = sin(x)
b=x∗y
f =a+b

Dependencies: Shown as arrows in the graph.


Creators: Indicate which operations generated each variable.

We will build a data structure to represent the expression graph (illustrated above) using
two different types of objects:
CHAPTER 3. AUTOMATIC DIFFERENTIATION 47

Variables & Operations


• Var:

x val
creator

Op = [Link]

• Op:
+

a b args

Var Var

Example: f = a ∗ b
Given Var objects a and b:
1. Create Op object.
2. Save references to args (a, b).
3. Create a Var for output (f).
4. Set [Link] to [Link] * [Link].
5. Set [Link] to this Op.

a b

Differentiate
The expression graph can also be used to compute the derivatives. Each Var stores the
derivative of the expression with respect to itself. It stores it in its member grad .
Consider:
f = F (G(H(x)))
df
[Link] =
dx
CHAPTER 3. AUTOMATIC DIFFERENTIATION 48

Chain Rule Derivation:


For this derivation, define h = H(x), g = G(h), and f = F (g).

df dF dG(H(x))
=
dx dg dx
dF (g) dG(h) dH(x)
=
dg dh dx
df dg dh
=
dg dh dx

Here is an expression graph that can be used to compute the derivatives:

Starting with a value of 1 at the top, we work our way down through the graph, and increment
grad of each Var as we go.

Each Op contributes its factor (according to the chain rule) and passes the updated derivative
down the graph.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 49

Example:
f = (x + y) + sin(y)
| {z } | {z }
a b

Because of the chain rule, we multiply as we work our way down a branch. We also add
whenever multiple branches converge.

Each object has a backward() method that processes the derivative and passes it down the
graph. The backward methods can be coded as follows:

Var Class: backward method


1 class Var :
2 def backward ( s ):
3 self . grad += s
4 self . creator . backward ( s )

Note: [Link], [Link], s must all have the same shape.

Op Class: backward method


1 class Op :
2 def backward ( s ):
3 for x in self . args :
4 x . backward ( s * ∂Op
∂x )

Notes:

• s must match the shape of the operation’s output.

• ∂Op
∂x
is the derivative of the operation with respect to x.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 50

• The chain rule is applied recursively.


• At each node, the gradient is propagated backward through the graph.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 51

3.2 Neural Networks with Auto-Diff


Goal: To see how automatic differentiation (AD) can be used to implement backpropa-
gation for neural networks.

Optimization using AD
Consider a scalar function E that depends (possibly remotely) on some variable v.
Suppose we want to minimize E with respect to v, i.e.,

min E(v).

We can use gradient descent:


v ← v − κ∇v E(v),
where ∇v E(v) is the gradient of E with respect to v, and κ is the learning rate.

Pseudo-Code for Gradient Descent (using AD)


1. Initialize v, κ
2. Construct the expression graph for E
3. Until convergence:
(a) Evaluate E at v
(b) Set gradients to zero (i.e., [Link] = 0)
(c) Propagate derivatives down (increment [Link])
(d) Update: v ← v − κ · [Link]

Neural Learning
We use the same process to implement error backpropagation for neural networks, and we
optimize w.r.t. (with respect to) the connection weights and biases.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 52

To accomplish this, our network will be composed of a series of layers, each layer transforming
the data from the layer below it, culminating in a scalar-valued cost function.
There are two types of operations in the network:
1. Multiply by connection weights (including adding biases)
2. Apply activation function
Finally, a cost function takes the output of the network, as well as the targets, and returns
a scalar.
Let us consider this (very) small network:

Given dataset (X, T ):


A = identity
B = (z → z · W ) (multiply by W )
C = logistic function
D = Cost/Loss function

These are all functions that transform their input.


Each layer can be called like a function:

x = A(X) e.g., A(X) = X


z = B(x) e.g., B(x) = x · W
y = C(z) e.g., C(z) = σ(z)
 
1
E = D(y, T ) e.g., D(T, y) = E ∥y − T ∥22
2

Each layer, including the cost function, is just a function in a nested mathematical expression:

E = D(C(B(A(X))), T )

Neural learning is:


W ← W − κ∇W E
CHAPTER 3. AUTOMATIC DIFFERENTIATION 53

We construct our network using objects from our AD classes (Variables and Operations) so
that we can take advantage of their backward() methods to compute the gradients.

net ≡ (A, B, C)
y = net(X) = C(B(A(X))) (Forward pass sets network state)
E = D(y, T )

Then we take gradient steps:

[Link] grad()
[Link]() (Backward pass sets gradients)

Pseudo-Code

Algorithm 1 Gradient Descent


1: Given: Dataset (X, T)
2: Given: Network model net with parameters Θ
3: Given: Loss function Cost
4: Given: Learning rate κ
5: for epochs do
6: y ← net(X) ▷ (Forward pass)
7: loss ← Cost(y,T) ▷ (Evaluate loss)
8: [Link] grad() ▷ (Set all gradients to zero)
9: [Link]() ▷ (Backpropagation)
10: Θ ← Θ − κΘ.grad ▷ (Gradient update)
11: end for

Matrix AD

To work with neural networks, our AD library will have to deal with matrix operations.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 54

Example: Matrix Addition


Suppose our scalar function involved a matrix addition:

L(y) is a scalar function, where y = A + B, A, B ∈ RM ×N .

What is ∇A L and ∇B L?

∇A L = ∇y L ⊙ ∇A y = s ⊙ 1M ×N (same shape as A),


∇B L = ∇y L ⊙ ∇B y = s ⊙ 1M ×N (same shape as B).

As we saw in the previous lecture, within the plus operation backward method
+.backward(s), we need to call the following commands:
[Link](s ⊙ 1M ×N )
[Link](s ⊙ 1M ×N )
(Recall that s is the same shape as the output of the operation, i.e., ∇y L).

Example: Matrix Multiplication


Suppose our scalar function involved a matrix multiplication:

L(..., A, B, ...) Let y = A ∗ B, A ∈ RM ×N , B ∈ RN ×K , y ∈ RM ×K .

∇A L = ∇y L · ∇A y = s · B ⊤ shape: M × K · K × N → M × N,
∇B L = ∇B y · ∇y L = A⊤ · s shape: N × M · M × K → N × K.
In the method *.backward(s), you will need to figure out in Assignment 2 how to complete
the following commands:
[Link](?)
[Link](?)
Chapter 4

Generalizability

55
CHAPTER 4. GENERALIZABILITY 56

4.1 Overfitting
Goal: Develop a process to use our labelled data to generate models that can predict
future, unseen samples.

Suppose you have a dataset of drug dosage vs. your blood-sugar level.
Your doctor would like to train a neural network so that, given a dose, she can predict your
blood sugar.

That dataset has 6 samples, and since this is a regression problem, we will use a linear
activation function on the output, and MSE as a loss function.
Your doctor creates a neural network with 1 input node, two hidden layers, each with 250
ReLU nodes, and 1 output node, and trains it on your dataset for 2000 epochs.

The doctor wants to give you a dose of 0.65, so she uses the network to estimate what your
blood sugar will be.
Blood sugar is 1.0043
Does this seem reasonable?
CHAPTER 4. GENERALIZABILITY 57

This week’s exercises will show you that the neural network model looks very much like a
linear interpolator of the points.

Suppose the doctor takes 300 more blood samples from you, at a variety of different doses.
Once you’re drained of blood, she runs the dataset through her model to see what the MSE
loss is:
MSE loss is 0.018

That’s orders of magnitude worse than the 1.5 × 10−10 on the training dataset.

The false sense of success we get from the results on our training dataset is known as
overfitting or overtraining.

Suppose your model has enough flexibility and you train it long enough. In that case, it
will start to fit the specific points in your training dataset, rather than fit the underlying
phenomenon that produced the noisy data.

Recall that our sole purpose was to create a model to predict the output for samples it hasn’t
seen. How can we tell if we are overfitting?

A common practice is to keep some of your data as test data, which your model does not
train on.
CHAPTER 4. GENERALIZABILITY 58

A large discrepancy between test loss and train loss is a sign of overfitting. Notice
that the test loss is going up!
We are looking for a balance between underfitting and overfitting as demonstrated here:
CHAPTER 4. GENERALIZABILITY 59

4.2 Combatting Overfitting


Goal: See some tricks for how to mitigate against overtraining.

We saw that if a model has enough degrees of freedom, it can become hyper-adapted to the
training set, and start to fit the noise in the dataset.

Training error is very small.


This is a problem because the model does not generalize well to new samples.

Test error is much bigger than training error, and gets worse with more training.
There are some strategies to try to stop our network from trying to fit the noise.

Validation

If we want to estimate how well our model will generalize to samples it hasn’t trained on,
we can withhold part of the training set and try our model on that validation set. Once our
model does reasonably well on the validation set, then we have more confidence that it will
perform reasonably well on the test set.

It’s common to use a random subset of the training set as a validation set.

Regularization by Weight Decay

We can limit overfitting by creating a preference for solutions with smaller weights, achieved
by adding a term to the loss function that penalizes the magnitude of the weights:

λ
Ẽ(ŷ, t; θ) = E(ŷ, t; θ) + ∥θ∥2F ,
2
qP
2
where E(ŷ, t; θ) is the Original loss function, ∥θ∥F = j θj is the Frobenius Norm of the
weights, and λ controls the weight of the regularization term.
CHAPTER 4. GENERALIZABILITY 60

How does this change our gradients, and thus our update rule?

∇θi Ẽ = ∇θi E + λθi

As a result:
θi ← θi − κ∇θi E − κλθi

• Original test loss = 0.01728, ∥θ∥22 = 254.4.


• Weight decay test loss = 0.01219, ∥θ∥2F = 7.1.

One can also use different norms. For example, it is common to use the L1 norm:
X
L1 (θ) = |θi |
i

The L1 norm tends to favour sparsity (most weights are close to zero, with only a small
number of non-zero weights).

Other Methods to Combat Overfitting


• Data augmentation: Apply transformations to some data samples to create more
samples.
e.g. rotate and shift images.
• Dropout: Randomly disable hidden nodes so the network has to learn to distribute
its computation.
Chapter 5

Optimization Considerations

61
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 62

5.1 Enhancing Optimization


Goal: To learn some methods that help learning go faster.

Stochastic Gradient Descent


Computing the gradient of the cost function can be very expensive and time-consuming,
especially if you have a huge training set. We define the cost function as:
D
1 X 
E(y, τ ) = L yd , td .
D d=1

Rather than compute the full gradient, we can try to get a cheaper estimate by computing
the gradient from a random sampling.
Let γ be a random sampling of B elements from {1, 2, . . . , D}. We can estimate E(y, τ ) as:
B
1 X 
E(y, τ ) ≈ E(ỹ, τ̃ ) = L yγd , tγd ,
B d=1

where γi ∈ {1, 2, . . . , D} are indices of randomly sampled elements.


We refer to the set:

(yx̃1 , tγ1 ), (yγ2 , tγ2 ), . . . , (yγB , tγB )
as a batch, or mini-batch.
We use the estimate from this batch to update our weights, and then choose subsequent
batches from the remaining samples, etc. This method is called Mini-batch Gradient
Descent.
Note: Stochastic Gradient Descent (SGD) often refers to using just 1 sample at
a time.

Momentum
Consider gradient descent optimization in these situations:

Case 1:
• The optimization involves oscillations that are inefficient.
• This is represented by level sets of the loss function in the space of weights.
• The trajectory exhibits oscillatory behavior while converging to the minimum.
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 63

Case 2:
• The optimization stops in a shallow local minimum.
• However, we would prefer the optimization to “get over the hump” and reach the
deeper minimum.
• The trajectory may fail to escape the local basin of attraction.

A technique to improve our prospects in both situations is called momentum.

Current Approach:
Thus far, we have been moving through parameter space by stepping in the opposite direction
of the gradient:
θn+1 = θn − η∇θ E

Momentum as a Force:
But what if we thought of the gradient as a force that pushes us?
Recall from physics, if θ is our position:

=v (velocity)
dt
dv
= a (acceleration).
dt
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 64

In this analogy, momentum helps us overcome the inefficiencies in Case 1 (oscillations) and
Case 2 (local minimum trapping), by incorporating the gradient as a force to push the
optimization process forward.

So, solving numerically using Euler’s method, we obtain updates of the parameters and
velocity using the following equations:

θn+1 = θn + ∆t vn ,
vn+1 = (1 − r)vn + ∆t An .

Here:

• r represents resistance, such as friction.

• An is the acceleration (gradient-based force).

Example: When driving a car, A corresponds to the gas pedal, brakes, or even steering
adjustments.

Momentum and Error Gradients

What if we treat our error gradients as A? By integrating the gradients, we gain velocity v,
and thus momentum. Recall the physics definition of momentum:

momentum: ρ = mv.

Intuition

It’s like our weights are dictated by their location in parameter space. We move through
weight space, accelerated by the error gradients. Over time:

• We build speed.

• If we maintain acceleration in the same direction, we gain significant momentum.

Visualization

In parameter space:

• A represents acceleration (gradient).

• v represents velocity.

• w moves toward an optimal point.


CHAPTER 5. OPTIMIZATION CONSIDERATIONS 65

θ2

v
A

w
θ1

The trajectory begins with slower motion, but momentum builds over time, helping us
overcome inefficiencies and improve optimization performance.
Or, as is commonly used (re-parameterized):

v (t) ← βv (t−1) + ∇w E

Then, update our weights using:

w(t) ← w(t−1) − ηv (t)

This approach not only smooths out oscillations but can also help to avoid getting stuck in
local minima.

• The image above shows the comparison of optimization techniques such as SGD, Mo-
mentum, NAG, Adagrad, Adadelta, and RMSprop.
• Source: [Link]
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 66

Additional techniques include:


• Nesterov Momentum: A modification of momentum that looks ahead before up-
dating.
• Adam: Combines momentum with adaptive learning rates for better convergence.
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 67

5.2 Deep Neural Networks

Goal: To see the advantages and disadvantages of deep neural networks: representational
power vs. vanishing or exploding gradients.

How many layers should our neural network have?

Recall the Universal Approximation Theorem:

Theorem: Let σ be any continuous sigmoidal function. Then finite sums of the form
N
X
G(x) = αj σ(ωj x + θj )
j=1

are dense in C(In ). In other words, given any f ∈ C(In ) and ε > 0, there is a sum, G(x),
of the above form, for which

|G(x) − f (x)| < ε for all x ∈ In .

Here, In = [0, 1]n .


Cybenko, G., “Approximation by Superpositions of a Sigmoidal Function,”
Math. Control Signals Systems, 2:303-314, 1989.

Thus, we really only ever need one hidden layer. But is that the best approach in terms of
the number of nodes or learning efficiency? Not necessarily: it can be shown that such a
shallow network may require an exponentially large number of nodes (i.e. a very large N ) to
work.

Hence, a deeper network is preferred in many cases.

So, why don’t we always use really deep networks?

Vanishing Gradients
Suppose the initial weights and biases were large enough that the input current to many of
the nodes was not too close to zero. As an example, consider one of the output nodes, where
(l)
we have z1 = 5. Then

(l)  1 dy1 
y1 = σ z1 = ≈ 0.9933, = y1 1 − y1 = 0.0066.
1 + e−5 dz1
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 68

Compare that to the case where the input current was 0.1. Then

dy1
y1 = σ(0.1) = 0.525, = y1 (1 − y1 ) = 0.525 × (0.475) = 0.249,
dz1

which is almost 40 times larger than 0.0066.

Recall that for a sigmoid function, the slope (its derivative) is largest near inputs close to
zero and becomes very small for inputs far from zero. Therefore, when the input currents
are large in magnitude, the derivative is small, and hence the updates to the weights (based
on the gradient) will also be smaller.

What about the next layer down?

Suppose
∇z(3) E 2
= 0.01.

Define
∇z(3) E
Uz(3) = ,
∇z(3) E 2
so that
∇z(3) E = 0.01 Uz(3) .
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 69

What if the inputs to the penultimate layer were around 4 in magnitude? Then the corre-
sponding slopes of their sigmoid functions will also be small:

σ(4) = 0.982, σ ′ (4) = 0.0177.

Recall that
∂E ′ (2) 

(2) T
 
(2)
= σ zj ⊙ ∇z(3) E W .
∂zj j

Plugging in the small derivative factor and the scale for ∇z(3) E, we get:

∂E  T   T 
(2)
= (0.0177) (0.01) Uz(3) W (2) = (0.000177) Uz(3) W (2) .
∂zj j j

As we move to deeper layers, these factors can become even smaller. When this happens,
learning effectively comes to a halt in those layers. This is referred to as the vanishing
gradients problem.

Another way to look at it: Consider this simple but deep network:

w(1) , b(1) w(2) , b(2) w(3) , b(3) w(4) , b(4) w(5) , b(5)
x −−−−−→ h(1) −−−−−→ h(2) −−−−−→ h(3) −−−−−→ h(4) −−−−−→ y with target t and loss E(y, t).

Start with the loss at the output side: E(y, t). The gradient with respect to the input
current z (5) of the output node is

∂E
= y − t, ∇z(5) E = y − t.
∂z (5)

Then, using backprop, we can write a single formula for

∇z(4) E = ( y − t ) w(5) σ ′ z (4) .




Going deeper, we continue in the same way:

∇z(1) E = ( y − t ) w(5) σ ′ z (4) w(4) σ ′ z (3) w(3) σ ′ z (2) w(2) σ ′ z (1) .


   

What is the steepest slope that σ ′ (z) attains for logistic σ?

σ ′ (z) = σ(z) 1 − σ(z) ,



0 ≤ σ(z) ≤ 1.

Its maximum value is 0.25, occurring at σ(z) = 0.5.


CHAPTER 5. OPTIMIZATION CONSIDERATIONS 70

Hence, all else being equal, the gradient can shrink by a factor of at least 4 at
each layer. This is one of the key observations in the vanishing gradients discussion.

We can observe the vanishing gradient phenomenon by looking at the norm of the gradients
at each layer:
2 ∂E 2
X 
∇z(i) E = (i) .
∂zj
j

For a network with 5 layers (labeling them from the output back to the input), we might see
something like:

layer 4 norm = 0.7598 [≈ 0.76],


layer 3 norm = 0.2266 [ −0.038, −0.217, −0.039, −0.039 ],
layer 2 norm = 0.0316 [ 0.018, −0.018, −0.007, 0.018 ],
layer 1 norm = 0.0094 [ −0.004, 0.002, 0.001, 0.008 ].
Notice how the gradient norm becomes smaller as we move toward the earlier layers.

Skip Connections
Skip connections (also called residual connections) allow the network to bypass one or more
layers by directly adding the input of a previous layer to a later one. This creates an explicit
short path for gradient flow, which helps mitigate the problem of vanishing gradients and
allows much deeper networks to be trained effectively. Intuitively, the skip path enables the
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 71

model to learn residual mappings (i.e., deviations from the identity), making optimization
easier and convergence faster.

Example of a forward pass with a skip connection

h3 = σ(z3 ) = σ(w3 h2 + b3 + h1 )

Backward Pass (Gradient Derivation)


∂L ∂L ∂z3 ∂h2 ∂z2 ∂h1 ∂z1 ∂L ∂z3 ∂h1 ∂z1
= + .
∂w1 ∂z3 ∂h2 ∂z2 ∂h1 ∂z1 ∂w1 ∂z3 ∂h1 ∂z1 ∂w1
| {z }
direct path term

The second term corresponds to the path with a skip connection. Thus,

∂L ∂L ∂L
= w3 σ ′ (z2 )w2 σ ′ (z1 ) x + × 1 × σ ′ (z1 ) × x.
∂w1 ∂z3 ∂z3

The second part on the right hand side has fewer terms compared to the direct path term
(which we get when we don’t have a skip connection). The latter explains how skip connec-
tions can enhance the magnitude of the gradients and mitigate the limitation of the vanishing
gradient problem in deep neural nets, especially when the weights are very small.

Effect on Training Loss


Training Loss

#practice

#practice with skip connections


#theory
# Layers
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 72

Exploding Gradients
A similar (though less frequent) phenomenon can cause very large gradients. For instance,
consider a deep chain of neurons each with:

w = 8, b = −4, σ ′ (z) = 1
4
.

Then the backprop calculation might yield:


∂E ∂E
(1)
= 16 × .
∂z ∂z (5)
This “exploding” behavior occurs if the weights are large and the biases position the inputs
in the high-slope region of the logistic function (hence each layer can amplify the gradient).

Note on Gradient Clipping


To prevent unstable updates when the gradient of the loss E with respect to a parameter w
becomes too large, its magnitude can be limited to a maximum threshold τ :

∂E τ ∂E ∂E
← ∂E
if > τ.
∂θ ∂θ
∂θ ∂θ

This keeps the update direction unchanged but bounds its size, ensuring stable learning
when gradients would otherwise explode.
Chapter 6

Vision

73
CHAPTER 6. VISION 74

6.1 Your Visual System

Goal: Visual System: To learn about the mammalian visual system, which is the basis
for many neural-network vision systems.

Most of the networks we have looked at assume an all-to-all connectivity between populations
of neurons, like between layers in a network. But that’s not the way our brains are wired,
thankfully.

If every one of your 86 billion neurons was connected to every other neuron, your head would
have to be much bigger.

It’s Layered

Although the details are more complicated, your visual system is roughly arranged into a
hierarchy of layers.
CHAPTER 6. VISION 75

The image1 shows how visual information flows from the retina through successive processing
stages in the lateral geniculate nucleus (LGN) and primary visual cortex (V1), onward to
higher areas such as V2, V4, and IT. Ultimately, these layers of processing allow your brain
to form coherent perceptions (e.g., recognizing a cat).

It’s Topological

Neurons close to each other in the primary visual cortex process parts of the visual scene
that are close to each other.

1
Source: Kubilius, Jonas (2017): Ventral visual stream. [Link]
106794.v3.
CHAPTER 6. VISION 76

Images from: [Link]


jpg

This dot in the visual field only excites a small patch of neurons in V1. Each neuron in V1
is only activated by a small patch in the visual field.
CHAPTER 6. VISION 77

Conversely, each patch in the visual field excites only a small neighborhood of neurons in
V1.

This topological mapping between the visual field and the surface of the cortex is called a
retinotopic mapping.

Moreover, neurons in V1 project to the next layer, V2, and again, the connections are
retinotopically local. The “footprint” of a region in the visual field gets larger as the data
progresses up the layers. By the time signals reach IT (Inferior Temporal cortex), the neurons
may be influenced by the entire visual field.

It’s a bit like a filter bank


In the lower levels of the hierarchy, the neurons seem to respond to standard patterns of
input.

Each little square corresponds to one V1 neuron and shows the pattern that most activates
that neuron—its receptive field.
CHAPTER 6. VISION 78

Figure 6.1: Comparison of V1 receptive fields in a macaque (A) versus those learned by
a neural network, SAILnet (B). Adapted from Zylberberg, Joel; Timothy Murphy, Jason;
Robert DeWeese, Michael (2015): SAILnet learns receptive fields (RFs) with the same di-
versity of shapes as those of simple cells in macaque primary visual cortex (V1). PLOS
Computational Biology. [Link]

Each column of squares is a different neuron’s preferred stimulus. Notice that both the
macaque and the neural network learn a diverse set of oriented patterns, reflecting the
system’s attempt to efficiently encode natural images. You are encouraged to take a careful
look at the following demo through this demo link.
CHAPTER 6. VISION 79

6.2 Convolutional Neural Networks (CNNs)


Goal: To see how we can take advantage of some of our visual system’s features in
artificial neural networks.

Inspired by the brain’s topological (retinotopic) connectivity, and in an effort to reduce the
number of connection weights that must be learned, scientists devised the Convolutional
Neural Network (CNN).

Convolution
In typical CNNs, g (often called a kernel or filter ) has small spatial support, e.g. a D × D
region:
g ∈ RD×D .
When we apply g to an image or feature map f , we take an inner product of g with the
“patch” of f with an upper left at each (n, m).

Then we can write the (2D) filtering operation at (n, m) as


D−1
X D−1
X
(f ∗ g)n,m = fi+n, j+m gi,j . (6.1)
i=0 j=0

where ∗ is the convolution operator2 .


Then we add a bias term:
zn,m = (f ∗ g)n,m + b.

Multiple Kernels per Layer


A single convolutional layer often uses many kernels (filters). For example, a layer might
have K = 128 different kernels. Each kernel produces one output feature map; thus, if there
are K kernels, we end up with K feature maps. Symbolically, if
f ∈ RN ×N
2
Actually, (6.1) describes a correlation operation, but this correlation sum can be turned into a convolution
sum by simply reversing g, replacing gij with gD−1−i,D−1−j .
CHAPTER 6. VISION 80

is our input (just one channel for simplicity), and

g (k) ∈ RD×D for k = 1, . . . , K

are our K kernels, then each kernel produces an output map z (k) ∈ RN ×N (depending on
how we pad/stride). Concretely,
(k)
zn,m = (f ∗ g (k) )n,m + b(k) ,

where b(k) is the bias (scalar) for the k-th kernel.

Multi-channel Inputs
If the input f has multiple channels (e.g. an RGB image with C = 3 channels), then each
kernel must have the same number of channels. In this scenario,

f ∈ RN ×N ×C , g ∈ RD×D×C .

We do not convolve across channels; rather, we convolve each channel separately and sum
across them. For a single kernel,
C
X 
zn,m = f·,·,c ∗ g·,·,c n,m
+ b,
c=1

where b is a bias term (one per kernel). For K kernels, each has its own parameters g (k) and
bias b(k) , yielding K output feature maps.

Edge cases
At the edges of the image, the convolution window (kernel) might partially lie outside the
valid region of f . We typically have two common choices:
• Choice 1 (Padding): Pad the input f with zeros (or another value) so that the kernel
can be applied at the borders without losing spatial resolution.
CHAPTER 6. VISION 81

• Choice 2 (No Padding): Do not pad. In this case, the output z is smaller than the
input f , because we only convolve where the kernel is fully within the image boundaries.

Stride
The stride is the number of pixels by which we move or “slide” the kernel between consecutive
convolution positions.

• Stride = 1: The kernel is shifted by 1 pixel each time (typical default), so we sample
every adjacent position.

• Stride = 3 (example): The kernel is shifted by 3 pixels between each convolution.


This leads to a smaller output feature map because we skip more positions.

In general, larger stride reduces the output’s spatial dimension.

Why CNNs?
A Parameter counting argument

Consider a 100 × 100 input image and a 100 × 100 hidden representation.
CHAPTER 6. VISION 82

Fully connected approach: if we flatten the 100 × 100 image into a 10 000-dimensional
vector, and want a 10 000-dimensional hidden layer, we have

10 000 × 10 000 = 108

connection weights (not including biases). That is extremely large.

Convolutional approach: suppose we use a convolutional layer with 64 kernels, each of


size 5 × 5. Each kernel has
5 × 5 = 25 weights

plus 1 bias term, giving 26 parameters per kernel. Hence, for 64 such kernels:

64 ( 25 + 1 ) = 64 × 26 = 1664

total trainable parameters. This is much smaller than 108 , yet CNNs with relatively few
parameters still perform extremely well in practice.

Locality argument

FFNNs do not exploit properties such as locality and translational invariance that are inher-
ent to many physical systems and images ⇒ a lot of spatial information is lost. To illustrate
this point, let us consider a 4 × 4 image (picture, Ising spin configuration, etc.). Let us label
the pixels (sites) as:

13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4

When taken as input to a FFNN, the image gets flattened:

1
2
3
4 (spatially close to 3)
5 (not spatially close to 4)
..
.
16

It is clear from this flattening example that locality is easily lost!


CHAPTER 6. VISION 83

Online tools for visualizing CNNs:


• Dumoulin and Visin, “Convolution animations” section of [Link]
conv_arithmetic/blob/master/[Link].
• Powell, [Link]
• CNN Explainer, [Link]
• Drawing CNN architectures: [Link]
CHAPTER 6. VISION 84
Chapter 7

Unsupervised Learning

85
CHAPTER 7. UNSUPERVISED LEARNING 86

7.1 Hopfield Networks


Goal: Train a Hopfield network to converge to a set of target states by minimizing the
Hopfield energy.

Content-Addressable Memory (CAM)


A CAM is a system that can take part of a pattern, and produce the most likely match from
memory.

A CAM for instance would be able to interpret these:


intel gent nueroscience
Wa rloo pa$sv0rd
1,3, ,7, ,11

A CAM system can find an input’s closest match to a set of known patterns. It retrieves
data by directly comparing input queries with stored memory locations.

Hopfield Networks mimic the behavior of CAM in a biologically inspired way using neural
networks.

Hopfield Networks
Suppose we have a network of N neurons, each connected to all the others. We want this
network to converge to the nearest of a set of M targets or inputs.

Targets or Inputs: X = {⃗x(s) ∈ {−1, 1}N | s = 1, . . . , M }

John Hopfield proposed this network in a paper in 1982, and won the Nobel Prize in physics
in 2024 for it (with Hinton).
CHAPTER 7. UNSUPERVISED LEARNING 87


−1 if (⃗xW )i + bi < 0
xi = (7.1)
1 if (⃗xW )i + bi ≥ 0

Example: ⃗x(1) = [1, −1, 1, −1]

 
0 −1 1 −1
−1 0 −1 1
 
W =
  ⃗b = [0 0 0 0]
1 −1 0 −1

 
−1 1 −1 0

One target is easy. What if we have many targets?

Problem: The graph of the Hopfield net has cycles, so backpropagation won’t work.

Hopfield Energy (for symmetric W )


1 XX X
E=− xi Wij xj − bi x i
2 i j̸=i i

1
E = − ⃗xW ⃗x⊤ − ⃗b⃗x⊤
2
where Wii = 0.

To minimize energy (like physics tends to do), we use gradient descent.

∂E X
=− xi Wij − bj ,
∂xj i̸=j

or
d⃗x
∇⃗x E = −⃗xW − b̄ ⇒ τx = ⃗xW + b̄ (similar to Eq. (7.1).)
dt

If i ̸= j:
∂E
= −xi xj .
∂Wij
CHAPTER 7. UNSUPERVISED LEARNING 88

If i = j:
∂E
= −x2i = −1.
∂Wii
As a result, the gradient vector is:

∇W E = −⃗x⊤⃗x + IN ×N ,

where ⃗x⊤⃗x is a rank-1 N × N matrix. We add the identity matrix to the right-hand side, so
that the gradient of the diagonal weights is zero to keep Wii = 0 when performing gradient
descent.
Over all M targets, we have:
M
1 X (s) ⊤ (s) 1
∇W E = − (⃗x ) ⃗x + I = − X ⊤ X + I.
M s=1 M

Thus:  
1 ⊤
W ←W +κ X X −I ,
M
where X ⊤ X computes coactivation states between all pairs of neurons.
Because the input patterns X are fixed, the coactivation matrix M1 X ⊤ X −I remains constant
across iterations, so the gradient direction does not change, and repeated updates simply
move W linearly towards the steady-state solution, which is proportional to W ∗ = M1 X ⊤ X −
I.
CHAPTER 7. UNSUPERVISED LEARNING 89

7.2 Restricted Boltzmann Machines (RBMs)


Goal: To formulate unsupervised learning as energy minimization, and train a Restricted
Boltzmann Machine (RBM).

The topic of this lecture is related to the 2024 Nobel Prize in Physics, which went to:
• John Hopfield (Hopfield nets, 1982)
• Geoff Hinton (RBMs, 1983–1985)

Restricted Boltzmann Machines (RBMs)

The network consists of:


• A “hidden” layer: h ∈ {0, 1}n
• A “visible” layer: v ∈ {0, 1}m , because this layer interacts with the environment.
Connections between layers are symmetric, represented by weight matrix W .

RBM energy
Similar to the Hopfield network, an RBM is characterized by an energy:
m X
X n m
X n
X
E(v, h) = − vi Wij hj − bi vi − cj hj
i=1 j=1 i=1 j=1

This can be rewritten as:


E(v, h) = −vW hT − bv T − chT
where W ∈ Rm×n . The terms in this equation correspond to:
• Discordance cost: −vW hT
• Operating cost: −bv T − chT
CHAPTER 7. UNSUPERVISED LEARNING 90

Boltzmann Probability
The RBM network states are visited with the Boltzmann probability:
1 −E(v,h)
q(v, h) = e
Z
where the partition function Z is defined as:
X
Z= e−E(v,h)
v,h

Since lower-energy states are visited more frequently:

E(v (1) , h(1) ) < E(v (2) , h(2) ) ⇒ q(v (1) , h(1) ) > q(v (2) , h(2) )

Training an RBM as a Generative Model


Suppose our inputs v ∼ p(v). We want an RBM to behave as a generative model qθ such
that:
max Ev∼p [ln qθ (v)] or equivalently min Ev∼p [− ln qθ (v)]
θ θ

Let the loss function be:


L = − ln qθ (V ) for a given V

Expanding qθ (V ): !
1 X −Eθ (V,h)
L = − ln e
Z h

Rewriting: ! !
X XX
L = − ln e−Eθ (V,h) + ln e−Eθ (v,h)
h v h

Thus, we decompose the loss into:


L = L 1 + L2

Gradient of L1
To optimize the parameters, we need to compute the gradient of the loss function:
!
X X
∇θ L1 = −∇θ e−Eθ (V,h) e−Eθ (V,h)
h h

e−Eθ (V,h) ∇θ Eθ (V, h)


P
h
= P −E (V,h)
he
θ
CHAPTER 7. UNSUPERVISED LEARNING 91

X e−Eθ (V,h)
= P −E (V,h) ∇θ Eθ (V, h)
he
θ
h

Since
e−Eθ (V,h)
P −E (V,h) = qθ (h|V )
he
θ

we rewrite:
X
∇θ L1 = qθ (h|V )∇θ Eθ (V, h) = Eq(h|V ) [∇θ Eθ (V, h)]
h

Gradient of L2

e−Eθ ∇θ Eθ
P
v,h
∇θ L2 = − P −Eθ
v,h e
X
=− qθ (v, h)∇θ Eθ (v, h)
v,h

= −Eq(v,h) [∇θ Eθ (v, h)]

Thus, combining both terms:

∇θ L = ∇θ L1 + ∇θ L2
= Eq(h|V ) [∇θ Eθ ] − Eq(v,h) [∇θ Eθ ]

Computing the Gradient for Wij

Considering the parameter θ = Wij , we have:


" m X
n m n
#
X X X
∇Wij E(V, h) = ∇Wij − Vi Wij hj − bi Vi − cj hj = −Vi hj
i=1 j=1 i=1 j=1

and
∇Wij E(v, h) = −vi hj ,

we have:
∇Wij L = −Eq(h|V ) [Vi hj ] + Eq(v,h) [vi hj ]

where the first term represents the expected value under the posterior distribution, and the
second term is the expected value under the joint distribution.
CHAPTER 7. UNSUPERVISED LEARNING 92

Contrastive Divergence for Training RBMs


Step 1: Clamp visible states to V and calculate the hidden probabilities

q(hj |V ) = σ(V W·j + cj )

Then,
∇W L1 = −V T σ(V W + c)

where this results in a rank-1 outer product in Rm×n (you can show this as an exercise).

Step 2: Compute the expectation using Gibbs Sampling


XX
⟨vi hj ⟩q(v,h) ≡ Eq(v,h) [vi hj ] = q(v, h)vi hj
v h

In practice, a single network state is used to approximate the expectation. Gibbs sampling
is employed to run the network freely and compute the average vi hj .

∇W L2 = v T σ(vW + c)

which is again an outer product.

Finally, the weight update rule is:

W ← W − η(∇W L1 + ∇W L2 )
W ← W + ηV T σ(V W + c) − ηv T σ(vW + c)

where η is the learning rate, the first term corresponds to the positive phase (clamped
visible state), and the second term corresponds to the negative phase (after one Gibbs
sampling step).

Sampling an RBM

After training an RBM, we can use it as a generative model to generate new M data
points V (1) , V (2) . . . , V (M ) by performing Gibbs sampling from the conditional probabilities:
P (h|v), P (v|h) as illustrated below:
CHAPTER 7. UNSUPERVISED LEARNING 93

where we start from a random data point V (0) .


If you are interested in the practical side of what is covered in this lecture, you can find an
RBM demo here: [Link]
CHAPTER 7. UNSUPERVISED LEARNING 94

7.3 Autoencoders
Goal: To introduce a common and useful neural architecture.

Definition
An autoencoder is a neural network that learns to encode (and decode) a set of inputs. It’s
called an autoencoder because it learns the encoding automatically.

Key Observations
• The code layer is smaller than the input/output layers.

• The autoencoder consists of:

– An encoder that compresses the input x into a lower-dimensional representation


z.

– A decoder that reconstructs the original input from z.

Loss Function
The model is trained using a loss function that minimizes the reconstruction error:

L(x′ , x)

where x′ is the output (reconstructed input) and x is the original input.


CHAPTER 7. UNSUPERVISED LEARNING 95

Encoding Dimension
Consider the set of 5 binary vectors, stored in the rows of the matrix
 
1 0 1 0 0 1 1 0
0 1 0 1 0 1 0 1
 
0 1 1 0 1 0 0 1 .
 
 
1 0 0 0 1 0 1 1
1 0 0 1 0 1 0 1

Even though the vectors are 8-D (so they could take on 256 different inputs), the actual
dataset has only 5 patterns. We can, in principle, encode each of them with a unique 3-bit
code. However, we can choose the dimension of the encoding layer.

Unrolling the Autoencoder


We can also think of our autoencoder as just 2 layers, and we can “unfold” it (or “unroll” it)
into 3 layers, where the input and output layers are the same size and have the same state.

Tied Weights
Instead of:

we use

If we allow W and M to be different, then it’s just a 3-layer network.


If we enforce that M = W T , then we say the weights are tied.
CHAPTER 7. UNSUPERVISED LEARNING 96

7.4 Vector Embeddings


Goal: To see an efficient and semantic strategy to encode large input spaces.

Vector Representations
We have been using vectors to represent inputs and outputs.

eg. “2” = 0 0 1 0 0 0 0 0 0 0 ∈ {0, 1}10


 

“e” = 0 0 0 0 1 0 . . . 0 ∈ {0, 1}26


 

Word Representations
What about words? Consider the set of all words encountered in a dataset. We will call this
our vocabulary.

Let’s order and index our vocabulary and represent words using one-hot vectors, like above.

Let wordi be the ith word in the vocabulary.

eg. “cat” = v ∈ W where W = {0, 1}Nv ⊂ RNv

and Nv is the number of words in our vocabulary (e.g., 70,000). The vector v is a one-hot
encoding of the word,
(
0 if wordi ̸= ”cat”
vi = .
1 if wordi = ”cat”

Handling Similar Meanings


This is nice, but when we are doing Natural Language Processing (NLP), how do we handle
the common situation in which different words can be used to form a similar meaning?

Example:
These two sentences have similar meanings, but use different words.
• "CS 479 is interesting."
• "CS 479 is fascinating."
We could form synonym groups, but how do we decide on groups when words have similar,
but not identical, meanings? In the following list of words, where does one draw the line
between “misery” and “ecstasy”?
misery - despair - sadness - contentment - joy - elation - ecstasy
CHAPTER 7. UNSUPERVISED LEARNING 97

Semantic Relationships Between Words

These issues reflect the semantic relationships between words. We would like to find a
different representation for each word, but one that also incorporates their semantics.

Example:

misery - despair - sadness - contentment - joy - elation - ecstasy


happiness scale −−−−−−−−→

How can we tease out the complex, semantic relationships between words?

Predicting Word Pairs

We can get a lot of information from the simple fact that some words often occur together
(or nearby) in sentences.

Consider the text:

“The scientist visited Paris last week, enjoying the city’s famous museums.”

For the purposes of this topic, we will consider “nearby” to be within a fixed window size d.

Example: Nearby the word “last”, with d = 2


“The scientist visited Paris last week, enjoying the city’s famous museums.”

This gives us the word pairings:


(week, Paris), (week, last), (week, enjoying), (week, the)

These word pairings help us understand the relationships between words based on their
co-occurrence in text.
CHAPTER 7. UNSUPERVISED LEARNING 98

Approach: Predicting Word Co-occurrences

Our approach is to try to predict these word co-occurrences using a 3-layer neural network.

• Its input is a one-hot word vector.

• Its output is the probability of each word’s co-occurrence.

Neural Network Formulation

Our neural network takes a one-hot work vector, v, as input,

y = f (v, θ) where v ∈ W

and we can interpret the output as a distribution over the one-hot vocabulary,

y ∈ P Nv = p ∈ RNv | p is a probability vector



X
i.e. pi = 1, pi ≥ 0 ∀i
i

Then, yj equals the probability that wordj is nearby.


CHAPTER 7. UNSUPERVISED LEARNING 99

Neural Network Architecture

The output layer uses the softmax activation function.

This hidden-layer squeezing forces a compressed representation, requiring similar words to


take on similar representations. This is called ‘embedding’.

word2vec
Word2vec is a popular embedding strategy for words (or phrases, or sentences). It uses
additional tricks to speed up the learning.

1. Treats common phrases as new words e.g., ”New York” is one word.

2. Randomly ignores very common words e.g.,

”the car hit the post on the curb”

Of the 56 possible word pairs (permutations), only 20 don’t involve ”the”.

3. Negative Sampling: backpropagates only some of the negative cases.

Embedding Space

The embedding space is a relatively low-dimensional space where similar inputs are mapped
to similar locations.

Why does this work? Words with similar meanings will likely co-occur with the same
set of words, so the network should produce similar outputs, and therefore have similar
hidden-layer activations.
CHAPTER 7. UNSUPERVISED LEARNING 100

Cosine Similarity
The cosine angle is often used to measure the ”distance” between two vectors in the embed-
ding (latent) space.
Example:

Vector Arithmetic in Word Embeddings


To some extent, you can do a sort of vector addition on these representations.
Example: king - man + woman = ?
CHAPTER 7. UNSUPERVISED LEARNING 101

7.5 Variational Autoencoders


Goal: To see a type of autoencoder that can generate reasonable samples that were not
in the training set.

Recall what an autoencoder is:

We would like to be able to reconstruct the samples in our dataset. In fact, we would like to
be able to generate ANY valid sample. In essence, we would like to sample the distribution

p(x) the distribution of the inputs

We generate samples by choosing elements from some lower-dimensional latent space,

z ∼ p(z) e.g., z could represent digit class, line thickness, slant, etc.

and then generate the samples from those latent representations,

Z
p(x) = pθ (x|z)p(z) dz

We have a dataset of samples, X, and we want to find θ to maximize the likelihood of


observing X.

Note: The mapping between random variable z and x can be written p(x|z).

But, we will assume


p(x|z) is Gaussian, N (d(z, θ), Σ)
CHAPTER 7. UNSUPERVISED LEARNING 102

If we assume pθ (x|z) is Gaussian, with mean d(z, θ) and standard deviation Σ, then
1
− ln pθ (x|z) = ∥X − d(z, θ)∥2 + C
2Σ2
So, given samples z, we have a way to learn d(z, θ).

We can solve h i h i
max Ez∼p(z) pθ (x|z) by min Ez∼p(z) ∥X − d(z, θ)∥2
θ θ

Note that Z
Ep(z) [pθ (x|z)] = pθ (x|z)p(z)dz,
where we can use a Monte Carlo method to evaluate the previous integral.
Problem: We don’t know how to sample z ∼ p(z).
Illustration: Suppose we train an AE on a dataset of simple shapes. ⃝□△
The latent space is 2D, and the clusters are well separated.
CHAPTER 7. UNSUPERVISED LEARNING 103

However, latent vectors between the clusters generate samples that don’t look like shapes.
For MNIST 2D Latent space, we have the following:

What’s happening? Why is our generator so bad? Answer: We are choosing improbable z
samples, p(zi ) ≈ 0
We would like to sample only z’s that yield reasonable samples with high probability.
We are now placing requirements on the distribution in our latent space. Can we get away
with this?
Let’s assume that we can choose the distribution of z’s in the latent space; call it q(z). Then
h i
p(x) = Ez∼p p(x|z)
Z
= p(x|z)p(z)dz
Z
p(z)
= p(x|z) q(z)dz
q(z)
 
p(z)
= Ez∼q p(x|z)
q(z)

Thus the expected Negative log likelihood (NLL),


 
p(z)
− ln p(x) ≤ −Eq(z) ln p(x|z) + ln
q(z)

The previous inequality can be shown using Jensen’s inequality (where − ln is a convex
function over (0, +∞)). We can rewrite the right-hand side as follows:

− ln p(x) ≤ KL(q(z)∥p(z)) − Eq(z) [ln p(x|z)],


| {z } | {z }
(1) (2)

where   
p(z)
KL(q(z)∥p(z)) = −Ez∼q ln
q(z)
CHAPTER 7. UNSUPERVISED LEARNING 104

is the Kullback-Liebler (KL) divergence. Note that the right-hand side (1) + (2) is an upper
bound of the NLL, and we are going to use it as our loss function since minimizing the upper
bound would also minimize the NLL, which would maximize the likelihood p(x), which is
our end goal to get quality generated samples.
(1) Let’s choose a latent distribution that is convenient for us.

p(z) ∼ N (0, I)

Then, our aim is to design q(z) so that it is close to N (0, I), i.e.

min KL (q(z)∥N (0, I))


q

How do we design our latent representations to achieve this? Answer: We design an encoder
and ask its outputs to be N (µ, σ 2 ).

Here N (µ, σ 2 ) defines a distribution. Next, we keep pressuring the encoder to give us µ = 0,
σ 2 = I.

These Gaussians are convenient because there is a closed-form expression for

KL N (µ, σ 2 )∥N (0, I)




1 2
σ + µ2 − ln σ 2 − 1

=
2
We want to minimize this, but there are other forces at play...
(2) The other term in the objective,

Eq [ln p(x|z)] ,
CHAPTER 7. UNSUPERVISED LEARNING 105

is our reconstruction loss, which can be written as

Eq [ln p(x|x̂)]

where x̂ = d(z, θ) (deterministic decoder) and

z = µ(x, θ) + ϵ ⊙ σ(x, θ), ϵ ∼ N (0, I)

This is called the ”reparameterization trick,” where the distribution is differentiable, and
stochasticity is brought in by a separate variable, ϵ. Note that ϵ is generally a vector of
random values, not a scalar.

Intuition
Think of a cloud of matter floating in space, but collapsing in by its own gravity, eventually
forming a star.

Here is the process:


• Encode x by computing µ(x, θ) and σ(x, θ) using neural networks.

• Sample z = µ + ϵσ, ϵ ∼ N (0, I).

• Calculate KL loss
1 2
σ + µ2 − ln σ 2 − 1

=
2

• Decode x̂ using another neural network, x̂ = f (x, θ) = d(z).

• Calculate reconstruction loss, L(x, x̂),

1
∥x̂ − x∥2 for Gaussian p(x|x̂)
2
or
X
x ln x̂ for Bernoulli p(x|x̂)
x
CHAPTER 7. UNSUPERVISED LEARNING 106

• Both terms of our objective function are differentiable w.r.t. θ,


 

E = Ex L(x, x̂) + β σ 2 + µ2 − ln σ 2 − 1 
 
| {z }
all depend on net. params θ

so we can do gradient descent on θ. β adjusts relative importance of reconstructuion


loss vs. KL divergence loss.

Because the VAE seeks a distribution in the latent space that is close to N (0, I), there are
fewer holes.

MNIST using a VAE:

Note: Points chosen randomly from N (0, I). Decode → Generated samples.
CHAPTER 7. UNSUPERVISED LEARNING 107

Now, there are no (or fewer) gaps in the latent space, which enhances the quality of samples
generated by the VAE.
CHAPTER 7. UNSUPERVISED LEARNING 108
Chapter 8

Recurrent Neural Networks

109
CHAPTER 8. RECURRENT NEURAL NETWORKS 110

8.1 Recurrent Neural Networks (RNNs)

Goal: To learn the basics of Recurrent Neural Networks (RNNs): what they are used
for, and how to train them.

So far, we have focused on FFNNs, CNNs, AE, VAEs and Diffusion models, where we have
an architecture similar to the following:

Consider the task of predicting the next word in a sentence:

“I live in France, hence I speak ?”

The most likely word is ‘French’ because of the context created by the previous words. To
model this task using a neural network, we have two solutions.

Solution 1: (not optimal)

1. The NN only considers fixed-length sequences.

2. It cannot do a forward pass unless a full sequence is given.


CHAPTER 8. RECURRENT NEURAL NETWORKS 111

As a result, a conventional Feedforward neural network (FFNN) is not suitable for this task.

Solution 2: RNNs

Key Differences:

• We process one input at a time.

• RNNs are more suitable for this task.

Historical Significance of RNNs

• RNNs have historically enabled machine translation and speech recognition.

• Theoretically, an RNN can approximate any computation you can do on your laptop
(i.e., Universal Turing Machine).

“Siegelmann and Sontag, ACM, 1982”


CHAPTER 8. RECURRENT NEURAL NETWORKS 112

Illustration of a Vanilla RNN

Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state that
carries information from previous time steps. The following figure illustrates the unrolled
structure of an RNN, showing how information flows over time.

At each time step i, the hidden state is updated using the previous hidden state and the
current input. This is mathematically defined as:
 
⃗hi = f ⃗x i U + ⃗hi−1 W + ⃗b ,

where:
• ⃗hi is the hidden state at time step i.
• ⃗x i is the input at time step i.
• U is the weight matrix for the input-to-hidden transformation.
CHAPTER 8. RECURRENT NEURAL NETWORKS 113

• W is the weight matrix for the hidden-to-hidden transformation (which allows infor-
mation to persist over time).
• ⃗b is the bias term.
• f is a non-linear activation function (e.g. tanh, sigmoid, or ReLU).
The output at each time step is computed as:
 
⃗y i = Softmax ⃗hi V + ⃗c ,

where:
• ⃗y i is the output at time step i.
• V is the weight matrix for the hidden-to-output transformation.
• ⃗c is the bias term.
• The Softmax function ensures that the output represents probabilities or conditional
probabilities.

Remarks
One of the key strengths of RNNs is their ability to maintain a form of memory through the
hidden state ⃗hi . This makes them particularly effective for tasks involving sequences, such
as language modeling, time-series forecasting, and speech recognition.
• The hidden state ⃗hi acts as a memory of previous inputs. This allows the network to
capture temporal dependencies and make informed predictions based on past informa-
tion.
• Increasing the size of ⃗hi (i.e., increasing the number of hidden units) enhances the
model’s ability to store and process long-term dependencies. However, this also leads
to higher computational costs and memory usage.
Despite their advantages, standard (vanilla) RNNs have limitations, particularly when deal-
ing with long sequences. They often suffer from vanishing and exploding gradient problems,
making it difficult to learn dependencies over long time spans. More advanced architectures
such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs)
have been developed to address these issues. In particular, we will cover the GRU architec-
ture in the next lecture.

Training an RNN: Backprop Through Time (BPTT)


Training a Recurrent Neural Network (RNN) involves optimizing its parameters so that
the predicted outputs match the true targets as closely as possible. The training process
CHAPTER 8. RECURRENT NEURAL NETWORKS 114

propagates information across time steps while minimizing a loss function.

The total loss function L over a sequence of time steps is computed as the sum of individual
losses at each step:
N
 X
L ⃗y , . . . , ⃗y , t , . . . , ⃗t
1 N ⃗1 N
= αi L(⃗y i , ⃗t i )
i=1
where:
• ⃗y i is the predicted output at time step i.
• ⃗t i is the target output (ground truth) at time step i.
• N is the sequence length.
• L(⃗y i , ⃗t i ) represents the loss function measuring the error between prediction and target.
This could be cross-entropy loss for classification tasks or mean squared error (MSE)
for regression tasks.
• αi are weights for different time steps.
• The total loss is accumulated over all time steps, ensuring that all predictions contribute
to the optimization process.
The objective of training an RNN is to find the optimal parameters θ that minimize the
expected loss over the entire dataset:
θ∗ = arg min E(X,T )∈D [L]
θ
CHAPTER 8. RECURRENT NEURAL NETWORKS 115

where:

• θ = {U, V, W, ⃗b, ⃗c} represents the set of trainable parameters:

U : Input-to-hidden weight matrix.

W : Hidden-to-hidden recurrent weight matrix.

V : Hidden-to-output weight matrix.

⃗b : Bias term for the hidden state.

⃗c : Bias term for the output.

• E[·] denotes the expectation over the entire dataset D.

• (X, T) are the input sequences (e.g. a sentence of words) and their corresponding
target outputs from the dataset.

In an unsupervised learning (UL) setting where we only have access to the input sequences
X, we can minimize the expectation of the negative log likelihood L(Y ), where L(Y ) =
− N i
P P
i=1 j log(yj ).

Deep RNNs

Recurrent Neural Networks (RNNs) can be extended beyond a single layer by stacking mul-
tiple RNN layers vertically. This results in a Deep RNN, which allows for hierarchical
feature extraction across multiple levels of representation.

Key Idea: Instead of a single hidden state per time step, we introduce multiple layers of
hidden states. Each layer processes information before passing it to the next layer.
CHAPTER 8. RECURRENT NEURAL NETWORKS 116

The structure of a deep RNN with L layers can be described as follows:


• The input ⃗xn is first processed by the first RNN layer, generating a hidden state ⃗hn1 .
• Each subsequent layer l receives the hidden state from the previous layer l − 1 and
computes a new hidden representation.
• The final layer L produces the output ⃗y n .
For a deep RNN with L stacked layers, the hidden states at each layer evolve as follows:
⃗hn = f (⃗xn U1 + ⃗hn−1 W1 + ⃗b1 )
1 1
⃗h = f (⃗h U2 + ⃗hn−1 W2 + ⃗b2 ) ,
n n
2 1 2

...
⃗hn = f (⃗hn UL + ⃗hn−1 WL + ⃗bL ) .
L L−1 L

Here:
• ⃗hnl is the hidden state at time step n in layer l.
• Ul is the input-to-hidden weight matrix for layer l.
• Wl is the recurrent weight matrix within layer l.
CHAPTER 8. RECURRENT NEURAL NETWORKS 117

• f is a non-linear activation function (e.g., tanh or ReLU).


• ⃗bi (for i = 1, . . . , L) are vector biases.
The final output is computed as:

⃗y n = Softmax(⃗hnL V + ⃗c),

where: V is the weight matrix from the last hidden layer to the output and ⃗c is a bias vector.

Why use deep RNNs?


• Numerical cost: Instead of increasing the size of the hidden state h (dh ), we can
instead increase the number of layers while keeping dh small enough to maintain a
reasonable computational cost.
• Improved Representation Learning: Helps in modeling complex temporal depen-
dencies.
• Better Performance: In tasks like speech recognition and language modeling, deep
RNNs could outperform shallow ones.
CHAPTER 8. RECURRENT NEURAL NETWORKS 118

8.2 Gated Recurrent Units (GRUs)

Goal: To understanding the limitations of vanilla RNNs, and see one solution: Gated
Recurrent Units (GRUs).

Long-Range Dependencies
One of the major challenges in sequence processing with RNNs is the loss of information as
the sequence length increases. Consider the following example:

Cats make great house pets. If you have one, it is important that you care for
it, including taking it for regular health check-ups at the .

The expected word might be “vet”. However, due to the long distance between the subject
(Cats) and the missing word, standard RNNs may struggle to maintain the necessary context.

Why does this happen?


• As the sequence grows longer, information from earlier words gradually fades away.

• This is commonly referred to as the vanishing gradient problem, which makes it


difficult for RNNs to retain long-term dependencies.

To simplify our analysis of the long-range dependency issue, assume the following:

• Y = X = H = 1 (real numbers).

• The activation function is the identity function: f = Id.

• The input-to-hidden weight is U = 1, and the bias term is b = 0.

The recurrent hidden state update is given by:

hn = whn−1 + xn

Unrolling this equation over time:

hn = wn−1 h1 + wn−1 x1 + wn−2 x2 + . . .

• If |w| < 1, then wn shrinks exponentially as n grows. This means that earlier
information h1 and inputs x1 , x2 , . . . have exponentially less influence on hn over
time.

• If |w| > 1, hn magnitude will grow exponentially, making the RNN training unstable.
CHAPTER 8. RECURRENT NEURAL NETWORKS 119

Solution: Gated Recurrent Units (GRU)


To address this issue, we will consider Gated Recurrent Units (GRU), which introduce
mechanisms to preserve long-term dependencies.
Gated Recurrent Units (GRUs) introduce gating mechanisms to control the flow of informa-
tion and mitigate the vanishing gradient problem. Below is a simplified version of the GRU
equations.

Candidate Hidden State


The candidate hidden state ⃗h̃n is computed as:

⃗h̃n = tanh ⃗hn−1 W + ⃗x n U + ⃗b


 

where:
• W is the hidden-to-hidden weight matrix.
• U is the input-to-hidden weight matrix.
• ⃗b is the bias vector.
• The tanh function ensures that h̃ni ∈ (−1, 1).

Gate Mechanism
The gate ⃗g n determines how much past information is retained:
 
n ⃗ n−1 n
⃗g = σ h Wg + ⃗x Ug + bg ⃗

where:
• Wg and Ug are the gate’s weight matrices.
• b⃗g is the bias vector for the gate.
• The σ (sigmoid) function ensures that gin ∈ (0, 1), meaning it acts as a soft switch.

Final Hidden State Update


The new hidden state ⃗hn is a combination of the previous hidden state and the candidate
hidden state:
⃗hn = ⃗g n ⊙ ⃗h̃n + (1 − ⃗g n ) ⊙ ⃗hn−1

where:
• ⊙ denotes the Hadamard (element-wise) product.
CHAPTER 8. RECURRENT NEURAL NETWORKS 120

• ⃗g n controls how much of the new candidate state ⃗h̃n is retained.

• (1 − ⃗g n ) controls how much of the previous state ⃗hn−1 is preserved.

Intuition

• If gin ≈ 1, the new state is mostly the candidate state component h̃ni , meaning the
network updates to new information.

• If gin ≈ 0, the previous hidden state component hn−1


i is mostly preserved, preventing
unnecessary updates.

• This gating mechanism allows GRUs to adaptively retain or forget past information,
addressing the vanishing gradient problem.

The new hidden state is a combination of the previous hidden state and the candidate
hidden state, allowing for more efficient information retention in sequential learning.

Consider the following sentence:

Cats make great . . . check-ups at the vet.

In a standard RNN, long-range dependencies might be lost, making it difficult to associate


“Cats” with “vet’.’ However, in GRUs, the gating mechanism helps retain relevant informa-
tion over long distances.

Cats make great ... the vet.


h = h̃ = 1 h=1 h=1 ··· h=1 h=1
g=1 g=0 g=0 ··· g=0 g=0
“g = 0” means don’t update the hidden state.

• At the beginning, g = 1 allows the hidden state to update.

• Once g = 0, the hidden state stops updating, preserving information from earlier in
the sentence.

When g = 0, the hidden state remains unchanged, preventing unnecessary updates. This
ensures that crucial information, such as the subject (Cats), is retained and can still influence
later words. As a result, GRUs can effectively model long-range dependencies.
CHAPTER 8. RECURRENT NEURAL NETWORKS 121

GRU (Full Version)


The Gated Recurrent Unit (GRU) extends the simple RNN by introducing two gating mech-
anisms: the update gate and the reset gate.
 
n ⃗ n−1 n
⃗g = σ h Wg + ⃗x Ug + bg , ⃗
 
⃗r n = σ ⃗hn−1 Wr + ⃗x n Ur + ⃗br ,
⃗h̃n = tanh ⃗hn−1 ⊙ ⃗r n W + ⃗x n U + ⃗b ,
  

⃗hn = ⃗g n ⊙ ⃗h̃n + (1 − ⃗g n ) ⊙ ⃗hn−1 .

The reset gate ⃗r n determines how much of the previous hidden state ⃗hn−1 should be for-
gotten before computing the new candidate hidden state ⃗h̃n . When ⃗r n is close to 0, the past
information is largely discarded, making the model more reliant on the new input ⃗x n . When
⃗r n is close to 1, more of the past hidden state is retained.
There is also an advanced RNN variant called Long Short-Term Memory (LSTM), but
it is not covered in this course.
CHAPTER 8. RECURRENT NEURAL NETWORKS 122
Chapter 9

Adversarial Attacks

123
CHAPTER 9. ADVERSARIAL ATTACKS 124

9.1 Adversarial Attacks


Goal: Learn how to generate inputs that fool a discriminator neural network into giving
the wrong answer.

Dataset and Classifier


Consider the dataset
D = {(x, t) | x ∈ X, t ∈ {1, . . . , K}}
for X ⊂ X, where the class of x is t.
Let f : X → PK be a discriminator (classifier) network, where
( )
X
PK = y ∈ RK | 0 ≤ yi ≤ 1, yi = 1
i

represents probability vectors, such as those produced by a softmax function: y = Softmax(z).

Classification Error
The classification error is defined as:
h n oi
R(f ) ≜ E(x,t)∼D card arg max yi ̸= t | y = f (x)
i

where:
• arg maxi yi gives the index of the largest element of y.
• ”card” denotes the cardinality (number of elements).

ε-Ball (Neighbourhood)

ε
x B(x, ε)

The ε-ball around x is defined as:

B(x, ε) = {x′ ∈ X | ∥x − x′ ∥ ≤ ε}
CHAPTER 9. ADVERSARIAL ATTACKS 125

Adversarial Attacks
We ask, given (x, t) ∈ D, is there x′ ∈ B(x, ε) such that

arg max(yi ) ̸= t for y = f (x′ )?


i

These inputs can be found quite easily. This is called an Adversarial Attack. There are
two main classes of adversarial attacks:

• Whitebox: attacker has access to the whole model, e.g., weights, activation functions,
etc.

• Blackbox: attacker only has access to inputs & outputs.

It is very important to consider mitigating adversarial attacks as they pose a challenge to


a correct deployment of neural network models in critical real-world applications. Here are
two examples:

• A self-driving vehicle’s model misclassifying a stop sign as a speed limit sign can lead
to a fatal accident. For more details: [Link]

• A model misclassifying the MRI scan of a sick patient as normal can have severe
consequences on the patient’s health.

Overcoming these adversarial attacks is a very active field of research.

Gradient-Based Whitebox Attack


Recall gradient descent learning:
θ = θ − k∇θ E

Consider ∇x E: The gradient of the loss with respect to the input, x.

x y
CHAPTER 9. ADVERSARIAL ATTACKS 126

Gradient Adjustment for Adversarial Attacks


∇x E tells us how to adjust our input to decrease (or increase) the loss.

Untargeted Attack

x′ = x + k∇x E(f (x; θ), t(x))

Gradient ascent nudges input in the direction to increase loss.

Targeted Attack

x′ = x − k∇x E(f (x; θ), l)

where l ̸= t(x).

Gradient descent nudges input in the direction to decrease loss for the wrong target class.

Fast Gradient Sign Method (FGSM)


Adjust each pixel by ε.
∆x = ε sign(∇x E)

Example: 24-bit RGB Perturbation


For a 24-bit image where (R, G, B) ∈ {0, . . . , 255}3 , the perturbed image is computed as:

(R, G, B)′ = (R, G, B) ± ε sign(∇x E)

which ensures that the perturbation follows:

∥∆x∥∞ = ε

Example:
Here is an illustration from Goodfellow, I. J., Shlens, J., Szegedy, C. (2015). Explaining
and Harnessing Adversarial Examples. In Proc. of ICLR.
CHAPTER 9. ADVERSARIAL ATTACKS 127

Finding Minimal Perturbation

Instead of applying a fixed perturbation, one can also search for the smallest ∥∆x∥ that
causes misclassification:
h i
min arg max(yi (x)) ̸= t(x)
∥∆x∥ i

Why Are Neural Networks So Easily Fooled?


Consider the MNIST dataset, where inputs are 28 × 28 images, mapped to:

R28×28 → R784

The neural network partitions the space into 10 regions, each corresponding to a digit class.

Summary
Learning is defined as optimizing the expected loss:

min ED [L(f (x), t(x))]


θ

An Adversarial Attack of size ε is formulated as:

Untargeted Attack

max [L(f (x′ ), t)]


x′ ∈B(x,ε)
CHAPTER 9. ADVERSARIAL ATTACKS 128

Targeted Attack
min [L(f (x′ ), l)] , l ̸= t
x′ ∈B(x,ε)
CHAPTER 9. ADVERSARIAL ATTACKS 129

9.2 Adversarial Defence


Goal: Deploy a strategy to make our discriminative model less prone to adversarial
attacks.

How can we train our models so that they are harder to attack?

Adversarial Training
During training, we add adversarial samples to the dataset.

What if we built that process into our training? Idea: Incorporate a mini adversarial
attack into every gradient step while training.

TRADES:
“TRadeoff-inspired Adversarial DEfense via Surrogate loss minimization”

Model f : X → R

Dataset (X, T ), where X ⊂ X, and T ∈ {−1, 1}

• sign(f (X)) indicates the class of x.

• Classification is correct if f (X)T > 0.

Classification (“natural”) Loss


 
Rnat (f ) = E(X,T ) card {f (X)T ≤ 0}
CHAPTER 9. ADVERSARIAL ATTACKS 130

Robust Loss

Rrob (f ) = E(X,T ) card {X ′ ∈ B(X, ε) | f (X ′ )T ≤ 0}


 

Explanation:

• Built-in pessimism that looks for a worst-case scenario in the neighbourhood of X.

• Rrob accounts for all possible misclassified points within B(X, ε).

Making the Counting Differentiable


Instead of counting misclassified points directly, we approximate it using a smooth function
g:

Rlearning = min E(X,T ) [g(f (X)T )]


f
CHAPTER 9. ADVERSARIAL ATTACKS 131

card{α ≤ 0} g(α)

α α
0 0

Training a Robust Model

We aim to train a robust model by minimizing the following objective:

 

min E(X,T ) g(f (X)T ) + ′max g(f (X)f (X ))
f X ∈B(X,ε)

• The first term ensures that X is correctly classified.

• The second term adds a penalty for models f that place the decision boundary within
ε of X, where f (X) and f (X ′ ) will have opposite signs.

Implementation

• For each gradient update:

– Run several steps of gradient ascent to find X ′ .

– Evaluate the joint loss:

loss = g(f (X)T ) + βg(f (X)f (X ′ ))

where β is a hyperparameter.

– Use the gradient of the loss to update weights.


CHAPTER 9. ADVERSARIAL ATTACKS 132

Results

• Original: B(X, ε) regions cross decision boundaries.


• TRADES: B(X, ε) regions tend to stay within the class.
TRADES won the NeurIPS Adversarial Vision Challenge in 2018 and has been highly
influential since. As of March 2025, in RobustBench, the top method is based on TRADES.
[Link]
Note: TRADES was invented by Hongyang Zhang, CS Professor at University of
Waterloo.
Chapter 10

Neural Engineering

133
CHAPTER 10. NEURAL ENGINEERING 134

10.1 Population Coding


Goal: To learn how to systematically encode, decode, and transform data stored in the
activity of a population of neurons.

If we know what processing we want done, how do we design a network to do it effectively?


Can we build a compiler to implement a program on a neural machine?
Like for any machine, we will add a hardware abstraction layer between our data pro-
cessing and the hardware.
This approach will allow us to focus on the data that we are encoding, and the transforma-
tions applied to that data.
Black Box NNs

Input x ? Output y

Neural Engineering

Input x f g Output z = g(f (x))

We can build a neural network that processes our data by putting together a series of
interpretable transformations, such as regression layers.
The first step is to encode data into the collective activity of a population of neurons.
Consider this simple regression network with an imaginary linear readout node, connected
to all the neurons in the hidden layer.

h1
encoding
d1
h2

x .. d2 y
.
dN
hN

We call the di as decoding weights. The output is computed as:

y = h1 d1 + h2 d2 + · · · + hN dN = h · d

Suppose we want our imaginary readout node to approximate x for a range of possible input
x values. This is a regression problem in which we adjust only the decoding weights and
keep the encoding fixed.
CHAPTER 10. NEURAL ENGINEERING 135

Since we know the function the network should compute, we can create a training dataset:
• Choose a set of input values {x(p) }Pp=1
• Generate training samples of the form:

Input = x(p) , Target = x(p) (like an autoencoder)

Then we could learn the decoding weights by minimizing the loss:

Output: y (p) = h(x(p) ) · d = h(p) · d

min E L(y (p) , x(p) ) = min E L(h(p) · d, x(p) )


   
d d

Since it’s a regression problem, we use mean squared error (MSE) loss.
We could train this using backpropagation, but let’s look more closely at the structure of
this optimization:

P
1 X (p) 2
min h · d − x(p) 2
(MSE Loss)
d 2P p=1

This can be rewritten as:


2
h(1) · d
   (1) 
x
(2) (2) 
1 h · d x  1
  
min  .  −  .  = min ∥Hd − X∥22
2P d  ..   ..  2P d
h(P ) · d x(P ) 2

where  
h(1)
H =  ... 
 

h(P )
holds the activities of the hidden population, with each row corresponding to one input
sample.

Optimal Linear Decoding


Given a set of P inputs stored in X and corresponding encoding activations in
 (1) 
h
 .. 
H =  .  (e.g. firing rates)
h(P )
CHAPTER 10. NEURAL ENGINEERING 136

the optimal linear decoding weights d∗ solve

d∗ = arg min ∥Hd − X∥22


d

and can be computed by solving the normal equations:

d∗ = (H ⊤ H)−1 H ⊤ X.

We can encode and decode vectors as well.

h1
x1 y1

h2

x2 y2
h3

For example, suppose an input vector (x1 , x2 ) is encoded by a hidden layer h = σ([x1 , x2 ]E +
b). Collecting P such samples yields

H = σ(XE + b),

where X ∈ RP ×2 and H ∈ RP ×N .
If we have two output nodes, say y1 and y2 , we need two different sets of decoding weights—one
column per output:
D∗ = arg min ∥HD − X∥2F ,
D

where D ∈ RN ×2 and ∥ · ∥F denotes the Frobenius norm.


Finally, we can also decode nonlinear functions of the values encoded. For instance, if the

neuron population encodes x1 , x2 , we could decode t(x1 , x2 ) = x1 x2 . All we do is solve
another least squares (LS) problem with a new target matrix T .
Let  
(1) (1)
x1 x2
 (2) (2) 
 x1 x2 
T =
 .. ..  and D∗ = arg min ∥HD − T ∥2F .
D
 . . 

(P ) (P )
x1 x2

So long as an error (target) can be specified, the decoding weights can be learned.
CHAPTER 10. NEURAL ENGINEERING 137

A Word on Solving the Normal Equations


What if H ⊤ H is (almost) singular?
The usual solution is:
d∗ = (H ⊤ H)−1 H ⊤ X.
But this can be problematic if H ⊤ H is poorly conditioned.
If there is noise in H, say H + ε,

∥(H + ε)D − T ∥22 = ∥(HD − T ) + εD∥22 .

= (HD − T )⊤ (HD − T ) + 2(HD − T )⊤ (εD) + D⊤ ε⊤ ε D.


The middle term is on average zero as ϵ is independent of HD − T . If ε⊤ ε ≈ σ 2 I, then

∥(H + ε)D − T ∥22 = ∥HD − T ∥22 + σ 2 ∥D∥22 ,

where the last term behaves like an ℓ2 regularization term on D.


Hence, adding a bit of noise to your activation matrix H before solving could be helpful.
CHAPTER 10. NEURAL ENGINEERING 138

10.2 Transformations
Goal: To see how to use population coding to pass data between populations of neurons.

So far, we’ve only looked at interpreting the activity of a population of neurons, decoding
its encoded value(s) to an imaginary readout node. But neurons send their output to other
neurons.

Suppose we want population B to encode the same value encoded in population A. Can we
just copy the neural activities?

A B

input output

Answer: No, A and B might have different numbers of neurons.

A different approach:

To copy the value to B, we decode it from A and then re-encode it using B’s encoders.

Step-by-step Transformation Process


1. Encode x into population A:

A = σ (xEA + βA ) , A ∈ R1×N , x ∈ R1×K , EA ∈ RK×N

2. Decode the value x̂ from A:

x̂ = ADA , DA ∈ RN ×K

3. Re-encode x̂ into population B:

B = σ (x̂EB + βB ) , EB ∈ RK×M , βB ∈ R1×M


CHAPTER 10. NEURAL ENGINEERING 139

Alternatively, substituting x̂ = ADA , we get:


B = σ (ADA EB + βB )
As a result:
B = σ (AW + βB ) , where W = DA EB ∈ RN ×M

4. (Optional) Decode x̂ from B:


x̂ = BDB , DB ∈ RM ×K

Transformations with Neural Populations


The reason we connect populations of neurons is because the connections can perform trans-
formations on the data.
Example: Given (x, y), we want to build a network that computes x · y.

A B
EA
Dxy
x
EB
xy z

N M
CHAPTER 10. NEURAL ENGINEERING 140

The decoder is trained by:


e − xy
Dxy = arg min AD
D
e

Low-Dimensional Bottle-Neck
Even though we can form a full connection matrix W from the product DE, it is actually
a low-rank matrix.
In the example above:

Dxy ∈ RN ×1 , EB ∈ R1×M ⇒ W = Dxy EB ∈ RN ×M (a rank-1 matrix)

This might seem like a limitation, but it actually makes things more efficient.

How many FLOPs for AW ?


A ∈ RN , W ∈ RN ×M ⇒ FLOPs = O(N M )

How many FLOPs for ADE?


A ∈ RN , D ∈ RN ×1 , E ∈ R1×M
AD ∈ R1 , FLOPs = O(N )
(AD)E ∈ RM , FLOPs = O(M )
Total FLOPs = O(N + M )

This shows why using low-rank representations (like DE) is not only compact but compu-
tationally efficient.
CHAPTER 10. NEURAL ENGINEERING 141

10.3 Dynamics
Goal: To see how to build dynamic, recurrent networks using population coding methods.

Consider a dynamic model of a leaky integrate-and-fire (LIF) neuron that contains two
time-dependent processes:

ds

 τs = −s + C (Current)

dt
dv
τm = −v + σ(s) (Activity)

dt

• τs : synaptic time constant

• τm : membrane time constant

• C: input current

• σ(s): nonlinear activation function

Recall previous forms of σ(s):

σ(s) σ(s) σ(s)

s s s
LIF Logistic tanh

Equilibrium Solutions
Current:
ds
τs = −s + C (Suppose C is constant)
dt

At equilibrium:
ds
=0 ⇒ 0 = −s + C ⇒ s=C
dt

Activity:
dv
τm = −v + σ(s) (Suppose s is constant)
dt

At equilibrium:
dv
=0 ⇒ v = σ(s)
dt
CHAPTER 10. NEURAL ENGINEERING 142

Variants
Case 1: τm ≪ τs If the neuron membranes react quickly (i.e., reach equilibrium quickly)
compared to synaptic dynamics:
τ ds = −s + C

s
dt
v = σ(s)

Case 2: τs ≪ τm If the synaptic current changes quickly compared to the neuron activity:

 s=C
τm dv = −v + σ(s)
dt
In steady state:
v = σ(s)

Recurrent Networks
The feedback afforded by recurrent connections can lead to interesting dynamics.
Consider a neural integrator:

Consider a fully-connected population of N neurons.


CHAPTER 10. NEURAL ENGINEERING 143

Key ideas:

• Recurrent connections: W = DE

• Identity decoding: νD ≈ y

Input Integration in Recurrent Networks


If there is no input (i.e., x = 0), then we want our population to maintain its value by
re-injecting its current state.

What about integrating the input?

We let the time dynamics be dictated by the synaptic time constant. So we set τm = 0, and
therefore:
v = σ(s)

Then:
ds
τs = −s + σ(s)W + β + C (recurrent)
dt

To get the input current C, we need to encode the input value x.

ds −s + νW + β
= + C̃ with C̃ = xE
dt τs

Thus,

ds
τs = νW + β − s + τs xE
dt
= (νD + τs x)E + β − s
= (y + τs x) E + β − s,

where:

• y is the identity re-injection term (feedback from the decoded activity),

• τs x is the input scaled by the synaptic time constant τs .


CHAPTER 10. NEURAL ENGINEERING 144

Example:

You can get the recurrent network to simulate the dynamical system:
dy
= f (y)
dt
by setting the input to be f (y).

Nengo Demo: Simple Harmonic Oscillator


We define a second-order system using:
dx

τ
 = −θy
dt
 τ dy = θx

dt

This produces oscillatory behavior in phase space:


CHAPTER 10. NEURAL ENGINEERING 145

Network Structure:

• Represent state as (x, y)


• Feedback encodes the right-hand side of the system: (−θy, θx)
• Input node scales by τs
CHAPTER 10. NEURAL ENGINEERING 146
Chapter 11

Additional Topics

147
CHAPTER 11. ADDITIONAL TOPICS 148

11.1 Biological Backprop

Goal: Implement BP-like learning in a biologically plausible way.

This topic is still an active area of research.

Neurons in the brain are chemical machines.

→ All updates must use only local information

By what physical mechanism could L influence w?

Predictive Coding (PC)


pred. pred.

∆ ∆

• Predictions sent down the hierarchy

• Errors sent up the hierarchy

Feed-forward network:
Layer 1 Layer 2

x11 x21

x12 x22

In the PC network, each neuron is split into 2 parts, as follows:


CHAPTER 11. ADDITIONAL TOPICS 149

PC Node

µi = σ (xi+1 )M i + β i

Prediction

For now, assume W i = (M i )T

Error Node
dεi
τ = xi − µi − νleak
i
εi
dt

At equilibrium:
1 i
εi = (x − µi )
νi

Generative Network

Given dataset (x, y) and θ = {M i , W i }i=1,...,L−1 ,

max E(x,y) [log p(x | y)]


θ

p(x | y) = p(x1 | x2 ) p(x2 | x3 ) · · · p(xL−1 | y)

= p(x1 | µ1 ) · · · p(xL−1 | µL−1 )


CHAPTER 11. ADDITIONAL TOPICS 150

Assume xi ∼ N (µi , ν i )

∥xi − µi ∥2
 
i 1i
p(x | µ ) = exp −
□ 2(ν i )2
2
1 xi − µ i
− log p(xi | µi ) = + const.
2 νi

As a result:
L−1
1X i 2
− log p(x | y) = ∥ε ∥
2 i=1

Hopfield Energy

L−1
1X i 2
F = ∥ε ∥
2 i=1
dxi
τ = −∇xi F
dt

xi shows up in two terms in F :

1. εi = xi − µi (xi+1 ) = xi − (σ(xi+1 )M i + β i )

2. εi−1 = xi−1 − µi−1 (xi ) = xi−1 − (σ(xi )M i−1 + β i−1 )

Therefore:

∂εi i−1 ∂ε
i−1
∇xi F = εi + ε
∂xi ∂xi

i ′ i i−1
 
i−1 T
= ε − σ (x ) ⊙ ε W

Dynamics

dxi
○ = σ ′ (xi ) ⊙ εi−1 M i − εi

1 τ
dt

dεi

2 τ = xi − µi − ν i εi
dt

How about learning M i ?


T
∇M i F = − σ(xi+1 ) εi (outer product)
CHAPTER 11. ADDITIONAL TOPICS 151

dM i dW i

3 = σ(xi+1 )T εi ○
4 τ = (εi )T · σ(xi+1 )
dt dt
i
At equilibrium ( dε
dt
= 0), we get:
h T i
εi = σ ′ (xi ) ⊙ εi−1 W i−1 (This looks like BP!)

where:
∂F ∂F
= −εi , = ...
∂µi ∂xi

Training
Clamp x1 = X and xL = Y ⇒ run to equilibrium.
xi , εi reach equilibrium quickly.
Then ○
3 and ○
4 use that equilibrium to update M i and W i .

Generating
Clamp xL = Y ⇒ run to equilibrium ⇒ x1 is generated sample.

Inference (approx.)
Clamp x1 = X ⇒ run to equilibrium ⇒ arg maxj (xLj ) is the class.

How does this work overcome the local learning condition?


Answer: Running to equilibrium allows information to spread through the net.
CHAPTER 11. ADDITIONAL TOPICS 152

11.2 Generative Adversarial Networks


Goal: To see a game-theoretic approach to generating new input samples.

A Generative Adversarial Network (GAN) consists of two networks:

• Generator (G): Receives noise Z as input and generates data that tries to mimic the
real data distribution.

• Discriminator (D): Tries to distinguish between real and fake data produced by the
generator. The output is a probability y where a value close to 1 indicates ”real” and
close to 0 indicates ”fake.”

The GAN is trained through a min-max game where the generator tries to maximize the
probability of the discriminator making a mistake, and the discriminator tries to minimize
this probability.

Cost Function of GANs


The cost function for GANs involves a sigmoid output that represents the probability of real
and fake data:

C(θD , θG ) = − 12 Exreal ∼pdata log DθD (xreal ) − 12 Ez∼pz log(1 − DθD (GθG (z))) ,
   

where the first term corresponds to the negative log probability of the real datapoints and
the second term corresponds to the negative log probability of the fake datapoints, where
we could define xfake = G(z) given a noise z. Note that θD , θD are the parameters of the
discriminator and the generator, respectively. Additionally:

• D(xreal ) is the discriminator’s estimate of the probability that real data is real.

• D(xfake ) is the discriminator’s estimate of the probability that fake data is real.
CHAPTER 11. ADDITIONAL TOPICS 153

Training Strategy
Training can be seen as the min–max game

max min C(θD , θG ).


θG θD

In practice this is implemented by alternating gradient descent on the discriminator param-


eters and gradient ascent on the generator parameters:

θD ← θD − ηD ∇θD C(θD , θG ), θG ← θG + ηG ∇θG C(θD , θG ).

The discriminator D aims to minimize the cost function C, ideally such that D(xreal ) = 1
and D(xfake ) = 0. At equilibrium, the generator makes fake samples indistinguishable from
real ones, so
D(xreal ) ≈ D(xfake ) ≈ 0.5.

Relation to untargeted adversarial attacks. For a classifier fθ (x) with loss ℓ(fθ (x), y),
an untargeted adversarial attack solves

xadv = argmaxx′ ∈B(x,ϵ) ℓ(fθ (x′ , y),

often via gradient ascent on the input:

x ← x + α∇x ℓ(fθ (x), y).

Adversarial training (such as in TRADES) uses the min–max objective

min ′ max ℓ(fθ (x′ ), y).


θ x ∈B(x,ϵ)

This has the same structure as the GAN game

min max C(θD , θG ),


θD θG

where the discriminator DθD plays the role of the classifier fθ and the generator GθG (z)
plays the role of a learned adversary that searches (by gradient ascent on θG ) for inputs
xfake = GθG (z) that maximally confuse the discriminator.

Intuition and Training Phases


The training process can be described as follows:
1. Initially, the discriminator D can easily distinguish between fake and real data points.
2. After sufficient training, the generator G improves, and the discriminator’s task be-
comes more challenging.
CHAPTER 11. ADDITIONAL TOPICS 154

3. Ideally, the generator produces data indistinguishable from real data, and the discrim-
inator assigns a probability of 0.5 to both real and fake data.

The GAN is trained until the discriminator can no longer reliably distinguish between real
and generated data. So the generator G has, in principle, the winning strategy. These steps
are illustrated as follows:

Example: SVHN (Street View House Numbers)

Example: MNIST
CHAPTER 11. ADDITIONAL TOPICS 155

Example: Pictures of faces which do not exist


[Link]

If you want to get more intuition, you can play with GANs online on this link.
CHAPTER 11. ADDITIONAL TOPICS 156

11.3 Epilogue: Ethics in Neural Networks and AI


We have now reached the end of this course on neural networks. We have spent most of
our time developing models, training algorithms, and applications, focusing on how to make
these systems work well in practice.
The story of neural networks and AI does not end with accuracy curves, benchmarks, or clever
architectures. As researchers, engineers, and scientists, we also have a social responsibility
to think carefully about how these tools are used, who they might harm, who they might
benefit, and what kinds of futures they help create.
The goal of this epilogue is not to provide a complete ethics curriculum, but to highlight a
few key perspectives and resources. These references can help you start forming your own
informed, critical view about the ethical use of neural networks and AI, including questions
of bias, misinformation, environmental impact, and power imbalances.

Ethical Use of Neural Networks and AI: Selected References


Talk: Philosophy, Machine Learning, and Quantum Context
• Skorburg, J. A. (2023). Ethics PROBES: Artificial Intelligence, Machine Learning,
& Quantum Computing. Perimeter Institute (PIRSA:23060042).
Overview-style talk on ethical issues around AI, machine learning, and quantum tech-
nologies.
Available at: [Link]

Critical Paper on Large Language Models


• Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021).
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings
of ACM FAccT 2021.
Classic critique of large language models discussing data provenance, documentation
and consent, environmental costs, and the limits of “stochastic parrots” as models of
language understanding.
Preprint (open access): [Link]

Investigative Journalism on Misleading AI-Generated Ads


• CBC Marketplace. AI-generated misleading ads on online platforms. [Video].
YouTube.
Investigative report showing how AI tools are used to create deceptive or misleading
online advertisements, illustrating real-world harms related to deepfakes, scam cam-
paigns, and gaps in platform governance.
Available at: [Link]
CHAPTER 11. ADDITIONAL TOPICS 157

11.4 Attention mechanism


Goal: To learn the attention mechanism used in Transformer architectures – the
building block of large language models (LLMs) such as ChatGPT.

Historical context
Transformers became famous since the publication of the paper: Attention Is All You Need
(2017): [Link]
- Main highlights:
• Transformers enabled state-of-the-art results in Natural Language Processing.
• Transformers are based on the attention mechanism.

Tokenization (vector embedding)


Assume X is an embedding of a french sentence: “Je suis étudiant” and Y is an embedding
of its english translation: “I am a student”. There we have the following:
   

X =  ∈ R3×d  , Y =  ∈ R4×d 

Each word is converted to a vector of size d using an embedding.

Input Sequence and Projections


Consider an input sequence X:
X = (⃗x1 , ⃗x2 , . . . , ⃗xn )

From this, we compute:

Queries: ⃗qi = ⃗xi W (Q) ∈ Rℓ

Keys: ⃗ki = ⃗xi W (K) ∈ Rℓ

Values: ⃗vi = ⃗xi W (V ) ∈ Rℓ

where W (Q) , W (K) , W (V ) ∈ Rd×ℓ , such that l is a hyperparameter, and ⃗xi ∈ Rd .


Motivation: (YouTube search example)
CHAPTER 11. ADDITIONAL TOPICS 158

• Query: Text on search bar


• Keys: Video titles and description
• Values: Video matches

Vectorization

Q = XW (Q) n×d·d×ℓ=n×ℓ

K = XW (K) n×d·d×ℓ=n×ℓ

V = XW (V ) n×d·d×ℓ=n×ℓ

We can use these Q, K, V to compute self-attention.

Computing Attention Scores


First, we need to compute the attention of ⃗qi on ⃗kj :

Sij = ⃗qi · ⃗kj , j = 1, . . . , n

Sij is interpreted as vector i’s score for vector j. This gives us a full score matrix:

S = QK T n×ℓ·ℓ×n=n×n

Softmax and Self-Attention Output


We use softmax (on the second index / over rows) to obtain attention scores:

 
S
A = Softmax √ ( d is for normalization purposes)
d
As a result, we obtain
A ∈ Rn×n (where rows sum to 1)

Output of self-attention:
H =A·V H ∈ Rn×ℓ ,
where H is an attention head. Alternatively, for each token output:
n
X
⃗i =
H Aij ⃗vj
j=1
CHAPTER 11. ADDITIONAL TOPICS 159

• Aij : Attention score of query ⃗qi on key ⃗kj


• ⃗vj : Value associated with input j.
CHAPTER 11. ADDITIONAL TOPICS 160

Positional Encoding

Let us consider these examples:

X = I am not sad I am happy

X ′ = I am not happy I am sad

These sentences have the same words, thus the same attention head, but they have different
meanings. ⇒ We need to impose order on the inputs.

⃗xi ⇒ ⃗x′i = ⃗xi + PE(i)

Positional Encoding Definition:

   
i i
PE(i)2j = sin PE(i)2j+1 = cos
100002j/d 100002j/d

• PE(i) is the positional encoding vector for position i

• Frequency (or wavelength) changes with dimension j

• 10000 is a scaling constant to allow coverage of a large max sequence length

Here is an example of position encoding for d = 256, sequence length n = 100.


CHAPTER 11. ADDITIONAL TOPICS 161

Summary of the computations performed in one attention head

Multi-Head Attention

Multi-Head Attention allows the model to jointly attend to information from different rep-
resentation subspaces. Each head can learn distinct aspects or features (e.g., job, age,
CHAPTER 11. ADDITIONAL TOPICS 162

address,. . . ), enabling the model to capture various semantic aspects simultaneously. The
mechanism can be formally expressed as follows:

Multi-Head Attention = Concat H (1) , H (2) , . . . , H (h) WO ∈ Rn×dmodel ,



| {z } |{z}
∈Rn×(hl) ∈R(hl)×dmodel

where each H (µ) is computed independently in each attention head µ. Here, WµQ , WµK , and
WµV are learned projection matrices for queries Q, keys K, and values V respectively. The
number of heads is typically denoted by h.

Further Reading and Exploration


To dive deeper and gain an intuitive understanding of Transformers, including detailed ex-
planations, visualizations, and interactive examples, students are highly encouraged to visit
the following resource:
[Link]
CHAPTER 11. ADDITIONAL TOPICS 163

11.5 Transformers
Goal:

• Batch and layer normalization.

• Learn about the Transformer architecture.

• Introduction to large language models.

Batch Normalization
Consider a dataset in which you are trying to estimate if someone is a vegetarian from their
age and income.
Age (years) Income ($) Veg?
27 31,000 yes
52 120,000 no
16 10,000 yes
.. .. ..
. . .
Most of the variance in this dataset is along the income axis (income values vary over a
much larger range than ages).

Of course, the connection weights in a neural network can accommodate these differences in
scale, but doing so will typically result in weights with very different magnitudes. This, in
turn, forces us to use a small learning rate to keep training stable.

Batch normalization on a single neuron


(d)
Consider a particular neuron (or feature) indexed by i and a mini-batch of size D. Let hi
denote the activations of neuron i on example d, for d = 1, . . . , D.
(d)
The mini-batch mean of neuron i is µi = D1 D
P
d=1 hi .
(d) 2
The corresponding mini-batch variance is σi2 = D1 D
P
d=1 hi − µ i .
(d) (d)
(d) hi −µi (d) h −µi
We then normalize each activation: ĥi = σi
or, more stably, ĥi = √i for a
σi2 +ε
small ε > 0 to avoid division by zero.

Layer Normalization (LN) in Transformers


We can also perform layer normalization: y = LN(x). This is similar to Batch Normalization
(BN), except that in BN, you normalize the input to (or output of) a single neuron over all
examples in a mini-batch, while in layer normalization (LN) you normalize the inputs to a
CHAPTER 11. ADDITIONAL TOPICS 164

layer for each sample individually. In other words, BN normalizes across the batch for each
neuron, whereas LN normalizes across neurons/features for each sample.

For a hidden vector ⃗h ∈ RH we first compute the mean µ and variance σ 2 of its coordinates,
ˆ ⃗
and form the normalized activations ⃗h = √h−µσ 2 +ε
, for a small ε > 0. Layer norm is then
ˆ
defined as LN(⃗h) = α ⊙ ⃗h + β, where α, β ∈ RH are learned scale and shift parameters and
⊙ denotes elementwise multiplication.

Effect of different feature scales


Consider a hidden layer with two coordinates h1 and h2 whose values live on very different
scales: h1 = 5, h2 = 5000. A neuron’s pre-activation might compute z = w1 h1 + w2 h2 + b. To
keep z in a reasonable magnitude, the weight w2 must be extremely small compared to w1 .
This difference in scales makes learning with gradient descent more difficult. LN rescales
h1 and h2 so that they are comparable, allowing the corresponding weights to have similar
magnitudes.

Add and Norm module

For example, after a multi-head attention (MHA) block in a Transformer, we use



y = LN x + MHA(x) .

We prefer LN over BN in Transformers because the batch size (e.g. number of words in the
sentence) is often variable (and can even be 1 at inference time), and LN does not depend
on batch statistics.

Feed-Forward Layer
CHAPTER 11. ADDITIONAL TOPICS 165

Inside each encoder (and decoder) block, there is a position-wise feed-forward network (FFN).
The FF layer simply applies weights and biases independently to each position of the se-
quence.
Concretely, for an input vector x at a given position, the FFN typically has the form f (x) =
W2 ϕ(W1 x+b1 )+b2 , where W1 , W2 are weight matrices, b1 , b2 are biases, and ϕ is a nonlinearity
such as ReLU or GELU. The same parameters are shared across all positions, but each
position is processed independently.

Decoder

The decoder component is made up of many of the same parts as the encoder:
• masked multi-head self-attention;
• add & norm;
CHAPTER 11. ADDITIONAL TOPICS 166

• cross-attention over the encoder outputs;

• feed-forward layer;

• another add & norm.

However, after the first block of masked multi-head attention (MHA), the decoder receives
the keys and values from the encoder. Intuitively, while generating a translated sentence
(for example, English) the decoder can ask the encoder: “Given the French sentence, which
source words should I focus on now for this output token?” The cross-attention mechanism
answers this question by attending over the encoder representations.

Example: Machine Translation with a Transformer

Consider a French input sentence such as Je suis étudiant. After tokenization and em-
bedding (plus positional encoding), the sequence is passed through the encoder stack. The
encoder produces a sequence of hidden vectors that are used as keys and values for the
decoder’s cross-attention.

On the decoder side, we generate the English sentence I am a student . <end> one
token at a time. The decoder is autoregressive: at step t it can only attend to positions
1, . . . , t in the output sequence (future tokens are masked).

Let y1 , . . . , yT be the target tokens. At training time, the decoder operates roughly as follows:

1. Step 1: input is a start or masked token; the decoder predicts y1 = I.

2. Step 2: the visible prefix is I and the rest of the positions are masked; the decoder
predicts y2 = am.
CHAPTER 11. ADDITIONAL TOPICS 167

3. Step 3: the visible prefix is I am; the decoder predicts y3 = a.

4. Step 4: the visible prefix is I am a; the decoder predicts y4 = student.

5. Step 5: the visible prefix is I am a student; the decoder predicts y5 = . or an <end>


token.

At each step, the decoder combines:

• masked self-attention over the previously generated target tokens, and

• cross-attention to all encoder outputs (keys and values).

The final linear layer followed by a softmax transforms the decoder’s hidden state at each
position into a categorical distribution over the vocabulary. Concretely, the hidden repre-
sentation hi is mapped to logits (softmax pre-activation) W hi + b, and applying a softmax
produces a valid probability distribution. This yields the conditional probability pθ (xi |
x<i ) = Softmax(W hi + b), which appears inside the negative log-likelihood loss (NLL)
" N
#
X
L = ED − log pθ (xi | x<i ) ,
i=1

where N is the sentence length and D is a dataset which could represent a sentence of words.

What is the benefit of a Transformer?


The key advantage of the Transformer is the attention mechanism. Attention shortens the
path length between related inputs in the sequence. In an RNN, information must flow
through O(T ) recurrent steps to connect token 1 and token T , whereas in a self-attention
layer each output vector can attend directly to every input position in a single step.

For an attention layer, every output vector contains information aggregated from all vectors
in the sequence. This makes it much easier to model long-range dependencies and enables
highly parallel computation over all sequence positions.
CHAPTER 11. ADDITIONAL TOPICS 168

Brief introduction to large language models


Large language models (LLMs) are typically built from stacked Transformer blocks and
trained on very large text corpora.
• GPT = Generative Pre-trained Transformer.
• GPT-3 (as an example of a large GPT model):
– about 175B parameters,
– 96 Transformer layers,
– trained on roughly 300B tokens using a next-token negative log-likelihood (NLL)
loss,
– followed by reinforcement learning with scores from human feedback (RLHF) to
better align outputs with human preferences.
In class, we will do a short “demo” of GPT-type architectures [Link]

Environmental concerns
Training very large language models is computationally and financially expensive, and it also
has a non-negligible environmental footprint.
• Training cost on the order of 4-5 million USD.
• Estimated emissions of about 550 tons of CO2 for a single large training run.
To give some intuition, 550 tons of CO2 is roughly comparable to
• the emissions generated by driving around the Earth on the order of 100 times, or
• the amount of CO2 that would need to be absorbed by approximately 10, 000 to 100, 000
trees over multiple years.
Because of these costs, architectures such as recurrent neural networks (RNNs) (for ex-
ample, minGRU: [Link] and other lightweight models can
sometimes offer cheaper and more sustainable alternatives.

Common questions

Powered by AI

The Hodgkin-Huxley model describes the dynamics of membrane potential by modeling the nonlinear interaction between the membrane voltage and ion channels. It introduces variables like m, n, and h that represent the probability of sodium and potassium channels being open. These probabilities change with voltage, impacting the conductance of ions and thus influencing the membrane potential. The model incorporates differential equations to track these dynamics over time allowing prediction of action potentials by simulating ion flow through voltage-gated channels .

Sodium and potassium ion channels are crucial for the generation of action potentials in neurons. Voltage-gated sodium channels open when the membrane potential reaches a threshold, allowing Na+ ions to flow into the cell and causing depolarization. This leads to further opening of Na+ channels in a positive feedback loop. As a result, the membrane potential becomes more positive. Subsequently, potassium channels open, allowing K+ ions to flow out of the cell, which repolarizes the membrane by lowering the membrane potential back toward the resting potential. The interplay of opening and closing these channels enables the propagation of an action potential along the neuron .

A significant challenge with Recurrent Neural Networks (RNNs) is their difficulty in maintaining long-term dependencies due to the vanishing gradient problem, where gradients diminish as they are back-propagated through time. This can result in the loss of information from earlier in the sequence, making RNNs struggle with long-range dependencies. As sequence length increases, the network's ability to remember context from earlier in the sequence diminishes, hindering performance in tasks that require understanding of long-term dependencies .

The Leaky Integrate-and-Fire (LIF) model simplifies neuronal dynamics by focusing on sub-threshold membrane potential and recording spike occurrences upon reaching a threshold, but it does not model the action potential shape in detail. In contrast, the Hodgkin-Huxley model describes the detailed biophysical interactions of ion channels and their voltage-dependent kinetics, capturing the full shape and dynamics of action potentials. Consequently, LIF is more computationally efficient but less biologically detailed than Hodgkin-Huxley .

In the Hodgkin-Huxley model, changes in membrane potential influence the gating variables m, n, and h, which govern the probability of sodium and potassium channels being open. As the membrane potential becomes more positive, the variable m increases, representing more sodium channels opening. The variable h decreases, indicating inactivation of sodium channels. Meanwhile, an increase in n correlates with more potassium channels opening. These changes in gating variables, influenced by the membrane potential, regulate ion flow and thus the neuron's action potential .

Gradient descent optimizes neural network weights by iteratively adjusting them in the direction of the negative gradient of the loss function. This process minimizes the differentiation-based error between predicted and target outputs. By applying the gradient to update the weights, the algorithm seeks to find the minimum point of the loss function, improving the network's accuracy over training iterations. This method is crucial for fine-tuning the model to closely match the data patterns .

The sodium-potassium pump maintains the resting membrane potential by exchanging 3 Na+ ions from inside the cell with 2 K+ ions from outside the cell. This active transport process contributes to a higher sodium concentration outside and a higher potassium concentration inside the cell. The net movement of positive charges out of the cell establishes a negative resting membrane potential, typically around -70 mV, crucial for neuron function .

The equilibrium voltage for different ion channels, such as VL for leakage, VNa for sodium, and VK for potassium, sets the potential toward which each ion drives the membrane. During an action potential, sodium influx depolarizes the membrane toward VNa, while potassium efflux drives it back toward VK. These equilibria create the conditions for the rapid depolarization and repolarization phases observed in action potentials, governing the direction and magnitude of ion flow .

Deep RNNs improve learning complex temporal dependencies by stacking multiple RNN layers, which allows hierarchical extraction of temporal features across layers. Each layer captures different levels of temporal dynamics, aiding in better representation learning. This multi-layer approach reduces the burden on a single hidden layer, enhancing the modeling of long-range dependencies and improving performance in tasks like language modeling and speech recognition by capturing nuanced temporal information more effectively .

In the membrane potential equation, the membrane capacitance (C) represents the ability of the cell membrane to store charge, impacting how quickly the membrane potential can change in response to input. The conductance (gL) refers to the membrane's leakage conductance, indicating how readily ions can pass through the membrane leaks. Together, these parameters help describe the neuron's electrical response to input currents, impacting the dynamics of action potential generation and propagation .

You might also like