Neural Networks Lecture Notes Overview
Neural Networks Lecture Notes Overview
Neural Networks
Lecture Notes
January 4, 2026
Contents
Contents 3
1 Networks of Neurons 1
1.1 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Simpler Neuron Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Neural Learning 23
2.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Universal Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Gradient Descent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Automatic Differentiation 45
3.1 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Neural Networks with Auto-Diff . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Generalizability 55
4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Combatting Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Optimization Considerations 61
5.1 Enhancing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Vision 73
6.1 Your Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . 79
7 Unsupervised Learning 85
7.1 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3
7.2 Restricted Boltzmann Machines (RBMs) . . . . . . . . . . . . . . . . . . . . 89
7.3 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Vector Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Networks of Neurons
1
CHAPTER 1. NETWORKS OF NEURONS 2
1.1 Neurons
Goal: To see the basics of how a neuron works, in the form of the Hodgkin-Huxley neuron
model.
The human brain is estimated to contain approximately around 86 billion neurons. Each
neuron can form thousands of connections with other neurons, leading to extraordinary
human intelligence.
To understand how a neuron works, we will examine the Hodgkin-Huxley neuron model more
closely. A neuron is a special cell that can send and receive signals from other neurons. It
can be quite long, sending its signal over a long distance—up to 50m long! But most are
much shorter.
Figure 1.1: Diagram of a neuron with key components labeled: dendrites, soma (body), and
axon.
Figure 1.2: Illustration of ion channels and sodium-potassium pump in the cell membrane
of the axon.
Sodium-Potassium Pump
The sodium-potassium pump exchanges 3 Na+ ions inside the cell for 2 K+ ions outside
the cell. This process causes a higher concentration of Na+ outside the cell, and higher
concentration of K+ inside the cell. It also creates a net positive charge outside, and thus a
net negative charge inside the cell. This difference in charge across the membrane induces
a voltage difference, and is called the membrane potential V between the inside and
the outside of the neuron, which is usually around −70 mV. In this regard, the Sodium-
Potassium Pump allows to maintain the value of the resting membrane potential
of a neuron. Also, note that this pump does not play a role in the change of the membrane
potential. We will see in the Hodgkin-Huxley model that this voltage change depends on
several factors including the Sodium ions (Na+ ) channels and the Potassium ions (K+ )
channels.
Action Potential
Neurons have a peculiar behavior: they can produce a spike of electrical activity called an
action potential. This electrical burst travels along the neuron’s axon to its synapses,
where it passes signals to other neurons. The water dump bucket experiment is a great
analogy for this phenomenon [Link]
Hodgkin-Huxley Model
Alan Lloyd Hodgkin and Andrew Fielding Huxley received the Nobel Prize in Physiology or
Medicine in 1963 for their model of an action potential (spike). Their model is based on the
nonlinear interaction between membrane potential (voltage) and the opening and closing of
Na+ and K+ ion channels.
dn 1
= (n∞ (V ) − n) .
dt τn (V )
Here n is a variable associated with the probability of the activation of a subunit of the
potassium channels. Since there are 4 subunits in the K+ channel, the probability of all sub-
units being open is n(t)4 . As n increases, more potassium channels open, allowing potassium
ions to flow out of the cell.
The fraction of Na+ ion channels open is proportional m(t)3 h(t), where:
dm 1
= (m∞ (V ) − m) ,
dt τm (V )
dh 1
= (h∞ (V ) − h) .
dt τh (V )
Note that:
You can observe in the following plot, a typical evolution of the gating parameters n∞ (V ),
m∞ (V ), and h∞ (V ) against the membrane potential V .
CHAPTER 1. NETWORKS OF NEURONS 5
Figure 1.4: Typical behavior of the action potential which travels through the axon. Credit:
Wikipedia.
Intuitively, at a threshold voltage of around −55mV initiated by the input current coming
from other neurons, the conductance of Na+ channels are expected to be active as illustrated
in Fig. 1.3. The Na+ channels activation leads to the so-called depolarization phase, by
pushing Na+ ions in the cell, which causes a sudden increase in the membrane potential (V).
CHAPTER 1. NETWORKS OF NEURONS 7
The Na+ channel becomes effectively inactive around the maximum voltage as illustrated in
Fig. 1.3.
The repolarization phase is catalyzed by the K+ channels activation as illustrated by the
increase in the gating variable n in Fig. 1.3. These channels are sending K+ ions out of the
membrane. This mechanism leads to a sudden drop in voltage to a lower level compared to
the threshold voltage. After a refractory period with a characteristic period τref , the voltage
comes back to around V = −70mV . These steps are illustrated in Fig. 1.4.
An important observation to make is that the maximum voltage in Fig. 1.4 is independent
of the input current, however the strength of the input current influences the frequency of
the firing of the neuron through the action potential. This observation will be relevant in
the coming lectures.
It is also important to note that the sodium-potassium pump is not explicitly represented
in the Hodgkin-Huxley equation. It operates on a much slower timescale than the rapid
sodium and potassium channel dynamics that underlie action potential generation.
Remarks:
• Try to observe what happens when the input current is: Negative, Zero, Slightly
positive, or Very positive.
• Here is also a recommended video to watch that illustrates the electrical nature of the
human brain: Video.
CHAPTER 1. NETWORKS OF NEURONS 8
In this lecture, we take a careful look at other less complicated neuron models. The Hodgkin-
Huxley (HH) model is already greatly simplified:
• A neuron is treated as a point in space.
• Conductances are approximated with formulas.
• Only considers K+ , Na+ , and generic leak currents.
However, to model a single action potential (spike) takes many time steps of this 4-D dif-
ferential equation system. Spikes are fairly generic, and it is thought that the presence of a
spike is more important than its specific shape.
dV
C = Jin − gL (V − VL ) ,
dt
where:
• C: Capacitance
• gL = 1
R
: Conductance
• Jin : Input current
• VL : Resting potential
Using Ohm’s law (V = IR), we can rewrite it as
dV
RC = RJin − (V − VL ) .
dt
dV
τm = Vin − (V − VL ) for V < Vth ,
dt
CHAPTER 1. NETWORKS OF NEURONS 9
V − VL Vin
v= , vin = ,
Vth − VL Vth − VL
dv
τm = vin − v for v < 1 . (1.1)
dt
We integrate the differential equation for a given input current until v reaches the threshold
value of 1.
The figure below illustrates how we record a spike at time ti . When the membrane potential
v crosses the threshold (v = 1), a spike is recorded and the membrane potential is reset to
0. After a spike, a refractory period τref follows before the membrane potential can start
integrating again.
Suppose we hold the input, vin , constant. We can solve the DE analytically between spikes.
Claim:
1.2
v(t)
1
0.8
0.6
0.4
0.2
t
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
v(t) vin
It can be shown that the steady-state firing rate for constant vin is:
1
for vin > 1
τref −τm ln 1− v1
G(vin ) = in
0 otherwise.
You are asked to derive this result in Assignment 1.
Typical Values for Cortical Neurons
The steady-state firing rate can be visualized as a function of input vin . The plot below uses
these typical values:
100
80
Firing Rate (Hz)
60
40
20
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
Input (vin )
CHAPTER 1. NETWORKS OF NEURONS 11
The previous graph is known as the tuning curve because it tells us how a neuron reacts
to different input currents. Note that the turning curve saturates in the infinite limit of
vin to 1/τref , which means that the neuron activation rate cannot go beyond the frequency
characterized by the refractory period.
Artificial Neurons
As we’ve seen, the activity of a neuron is very low, or zero, when the input is low, and the
activity increases and approaches some maximum as the input increases.
The previous tuning curve motivates why we can represent an artificial neuron by a number
that represents its activity. Biologically speaking, this has to be a real positive number but
in practice, we can also consider negative values for the activity of artificial neurons. The
general behavior of a neuron activity can be modeled by a number of different activation
functions.
Logistic Curve
1
σ(z) =
1 + e−z
1.2
σ(z)
1
0.8
0.6
0.4
0.2
z
−4 −2 2 4
Arctan
σ(z) = arctan(z)
CHAPTER 1. NETWORKS OF NEURONS 12
2
σ(z)
z
−4 −2 2 4
−1
−2
Hyperbolic Tangent
σ(z) = tanh(z)
1 σ(z)
0.5
z
−4 −2 2 4
−0.5
−1
Threshold
(
0, if z < 0
σ(z) =
1, if z ≥ 0
σ(z)
1
0.5
z
−4 −2 2 4
CHAPTER 1. NETWORKS OF NEURONS 13
ReLU(z) = max(0, z)
5
ReLU(z)
4
1
z
−4 −2 2 4
Softplus
Softplus(z) = log(1 + ez )
5
Softplus(z)
4
1
z
−4 −2 2 4
SoftMax
SoftMax is like a probability distribution (or probability vector), so its elements add to 1. If
⃗z is the drive (input) to a set of neurons, then:
exp(zi )
SoftMax(⃗z)i = P
j exp(zj )
CHAPTER 1. NETWORKS OF NEURONS 14
Then, by definition:
X
SoftMax(⃗z)i = 1
i
Example:
SoftMax
⃗z = [0.6, 3.4, −1.2, 0.05] −−−−→ ⃗y = [0.06, 0.90, 0.009, 0.031]
Input ⃗z
4
2
Value
−2
1 2 3 4
Index
SoftMax Output ⃗y
1
0.8
Probability
0.6
0.4
0.2
0
1 2 3 4
Index
Argmax
Argmax is the extreme of the Softmax, where only the largest element remains nonzero,
while the others are set to zero.
CHAPTER 1. NETWORKS OF NEURONS 15
Example:
Argmax
⃗z = [0.6, 3.4, −1.2, 0.05] −−−−→ ⃗y = [0, 1, 0, 0]
Argmax Output ⃗y
1.2
1
0.8
Value
0.6
0.4
0.2
0
1 2 3 4
Index
CHAPTER 1. NETWORKS OF NEURONS 16
1.3 Synapses
Goal: To get an overview of how neurons pass information between them, and how we
can model those communication channels.
So far, we’ve just looked at individual neurons, and how they react to their input. But that
input usually comes from other neurons. When a neuron fires, an action potential (the wave
of electrical activity) travels along its axon.
The junction where one neuron communicates with the next neuron is called a synapse.
Note that the neurons are separated by a microscopic small space called the synaptic cleft,
which is around 20–50 nm wide.
Post-Synaptic Current
Even though an action potential is very fast, the synaptic processes by which it affects
the next neuron take time. Some synapses are fast (taking just about 10 ms), and some are
quite slow (taking over 300 ms). If we represent that time constant using τs , then the current
entering the post-synaptic neuron can be written:
t
(
ktn e− τs if t ≥ 0 (for some n ∈ Z≥0 ),
h(t) =
0 if t < 0,
CHAPTER 1. NETWORKS OF NEURONS 17
The function h(t) is called a Post-Synaptic Current (PSC) filter, or (in keeping with the
ambiguity between current and voltage) Post-Synaptic Potential (PSP) filter.
Multiple spikes form what we call a “spike train,” and can be modeled as a sum of Dirac
delta functions:
X3
a(t) = δ(t − tp ).
p=1
t1 t2 t3 t
Answer: You simply add together all the PSC filters, one for each spike. This is actually
convolving the spike train with the PSC filter:
That is,
s(t) = (a ∗ h)(t)
hP i
= p δ(t − tp ) ∗ h(t)
RP
= p δ(τ − tp )h(t − τ ) dτ (Convolution)
P R
= p δ(τ − tp )h(t − τ ) dτ
P
= p h(t − tp ),
which is the sum of PSC filters, one for each spike, also known as the filtered spike
train.
Post-synaptic current for a random spike train with τs = 0.1, n = 1. The PSC captures some
information about the pre-synaptic spike history.
More specifically, if we plot the asymptotic value of the post-synaptic current (PSC) against
the firing rate, we find a linear relationship
PSC ≈ P,
Connection Weight
The total current induced by an action potential onto a particular post-synaptic neuron can
vary widely, depending on:
We can combine all those factors into a single number, the connection weight. Thus, the
total input to a neuron is a weighted sum of filtered spike-trains.
A wCA
C
wCB
B
Weight Matrices
When we have many pre-synaptic neurons, it is more convenient to use matrix-vector nota-
tion to represent the weights and activities.
• X has N nodes,
• Y has M nodes.
Population X Population Y
If every node in X sends its output to every node in Y , then we will have a total of N × M
connections, each with its own weight.
y1
w11
x1
w21
X w12 y2 Y
x2 w22 w13
w23
y3
CHAPTER 1. NETWORKS OF NEURONS 21
⃗z = ⃗xW + ⃗b,
Thus,
⃗y = σ(⃗z) = σ(⃗xW + ⃗b),
Bias Representation
Another way to represent the biases, ⃗b, is by adding an additional input node with a fixed
value of 1.
y1
x1
X y2 Y
x2
b1
b2 y3
1 b3
1
τs
Theorem: The function h(t) defined above is the solution of the initial value problem
(IVP)
ds 1
τs = −s, s(0) = .
dt τs
Proof: Exercise.
If vi reaches 1:
1. Start refractory period.
2. Send spike along the axon.
3. Reset vi to 0.
If a spike arrives from neuron j: increment si using
wij
si ← si +
τs
Chapter 2
Neural Learning
23
CHAPTER 2. NEURAL LEARNING 24
2.1 Learning
Goal: To formulate the problem of supervised learning as an optimization problem.
Getting a neural network to do what you want usually means finding a set of connection
weights that yield the desired behaviour. That is, neural learning is all about adjusting
connection weights.
Overview of Learning
There are three basic categories of learning problems:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
Supervised Learning:
In supervised learning, the desired output is known so we can compute the error and use
that error to adjust our network.
Example:
Given an image of a digit, identify which digit it is.
Input: Image of digit “4”:
Target: [0 0 0 0 1 0 0 0 0 0]
Unsupervised Learning
In unsupervised learning, the output is not known (or not supplied), so cannot be used
to generate an error signal. Instead, this form of learning is all about finding efficient
representations for the statistical structure in the input.
Example:
Given spoken English words, transform them into a more efficient representation such as
phonemes, and then syllables. Or, cluster points into categories.
CHAPTER 2. NEURAL LEARNING 25
Reinforcement Learning
In reinforcement learning, feedback is given, but usually less often, and the error signal is
usually less specific.
Example:
When playing a game of chess, a person knows their play was good if they win the game.
They can try to learn from the moves they made.
In this course, we will mostly focus on supervised learning. But we will also look at some
examples of unsupervised learning.
Supervised Learning
Our neural network performs some mapping from an input space to an output space.
We are given training data, with MANY examples of input/target pairs. This data is
(presumably) the result of some consistent mapping process.
Example: MNIST
Images of handwritten digits map to integers.
Example:
A B XOR(A, B)
1 1 0
1 0 1
0 1 1
0 0 0
Input: (A, B) ∈ {0, 1}2
Output/Target: t ∈ {0, 1}, y ∈ [0, 1]
CHAPTER 2. NEURAL LEARNING 26
Our task is to alter the connection weights in our network so that our network mimics this
mapping.
But what, exactly, do we mean by “close”? For now, we will use the scalar function L(y, t)
as an error (or “loss”) function, which returns a smaller value as our outputs are closer to
the target.
• Regression
• Classification
Regression
Output values are a continuous-valued function of the inputs. The outputs can take on a
range of values.
The plot above demonstrates a linear regression model. The blue line is the regression line
that best fits the data points, minimizing the error between the predicted outputs and the
true values.
Classification
Outputs fall into a number of distinct categories. Here are some examples:
CHAPTER 2. NEURAL LEARNING 27
Example:
Learning as Optimization
Once we have a cost function, our neural-network learning problem can be formulated as an
optimization problem.
y = f (x; θ)
In other words, find the weights and biases that minimize the expected cost between the
outputs and the targets.
CHAPTER 2. NEURAL LEARNING 29
Given a function f (x), can we find the weights ωj , αj , and biases θj , j = 1, 2, . . . , N , such
that
N
X
f (x) ≈ αj σ(ωj x + θj )
j=1
to arbitrary precision?
h1
w1 α1
w2 h2 α2
x y
wn αn
hN
Theorem: Let σ be any continuous sigmoidal function. Then finite sums of the form
N
X
G(x) = αj σ(ωj x + θj )
j=1
are dense in C(In ). In other words, given any f ∈ C(In ) and ε > 0, there is a sum, G(x),
of the above form, for which
A function σ is “sigmoidal” if
(
1 as x → ∞,
σ(x) =
0 as x → −∞.
CHAPTER 2. NEURAL LEARNING 30
Informal Proof:
Suppose we let ωj → ∞ for j = 1, . . . , N , then
(
ωj →∞ 0 for x ≤ 0,
σ(ωj x) −−−−→
1 for x > 0.
0 x
bj x
Let us define:
H(x; b) = lim σ(ω(x − b)).
ω→∞
CHAPTER 2. NEURAL LEARNING 31
0 b b+δ x
0 b b+δ x
f
ε f (aj )
ε
aj aj + ∆x x
∆x
Here N ′ is the number of subintervals. G(x) can be also written in terms of threshold
functions (which can be approximated by sigmoids) as:
N ′
X
G(x) = f (aj )(H(x; bj ) − H(x; bj + δj )).
j=1
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
0.0 0.2 0.4 0.6 0.8 1.0
So, why would we ever need a neural network with more than one hidden layer?
Answer:
The theorem guarantees existence, but makes no claims about the scaling of N
as a function of ε, the number of hidden neurons. N might grow exponentially
as ε gets smaller.
CHAPTER 2. NEURAL LEARNING 33
We have to choose a way to quantify how close our output is to the target. For this, we use a
“cost function”, also known as an “objective function”, “loss function”, or “error function”.
There are many choices, but here are two commonly-used ones.
Suppose we are given a dataset
{y (i) , t(i) }i=1,...,N
yi = f (xi ; θ)
The use of MSE as a cost function is often associated with linear activation functions, or
ReLU. This loss-function/activation-function pair is often used for regression problems.
t ∈ {0, 1}
y = P (x → 1 | θ) = f (x; θ)
CHAPTER 2. NEURAL LEARNING 34
P (x → 1 | θ) = y i.e. t = 1
P (x → 0 | θ) = 1 − y i.e. t = 0
P (x → t | θ) = y t (1 − y)1−t
The task of ”learning” would be finding a model (θ) that maximizes this likelihood.
Or, we could equivalently minimize the negative log-likelihood:
L(y, t) = − t log y + (1 − t) log(1 − y) .
Cross entropy assumes that the output values are in the range [0, 1]. Hence, it works nicely
with the logistic activation function.
y = f (x; θ)
We interpret yk as the probability of x being from class k. That is, y is the distribution of
x’s membership over the K classes.
Note that:
K
X
yk = 1
k=1
Under that distribution, suppose we observed a sample from class k̄. The likelihood of that
observation is:
P (x ∈ Ck̄ | θ) = yk̄ , where Ck̄ = {x | x is in class k̄}
CHAPTER 2. NEURAL LEARNING 35
Note that y is a function of the input x and the model parameters θ. If we represent the
target class using the one-hot vector:
t = [0 · · · 0 1 0 · · · 0],
Goal: To see how we can use a simple optimization method to tune our network weights.
Here, we assume that you are familiar with partial derivatives and gradient vectors.
y = f (x; θ)
So, if our loss function is L(y, t), where t is the target, then neural learning becomes the
optimization problem:
h i
min E(θ), where E(θ) = E L f (x; θ), t(x)
θ x∈data
Note: we will assume the gradient vector here is a row vector with the same shape as the
vector θ, instead of a column vector. Sorry for the confusion in lecture 5.
Gradient-Based Optimization
If you want to find a local maximum of a function, you can simply start somewhere, and
keep walking uphill. For example, suppose you have a function with two inputs, E(a, b).
You wish to find a and b to maximize E. We are trying to find the parameters (ā, b̄) that
yield the maximum value of E. i.e.,
No matter where you are, “uphill” is in the direction of the gradient vector:
∂E ∂E
∇E(a, b) = ∂a ∂b
CHAPTER 2. NEURAL LEARNING 37
Gradient ascent is an optimization method where you continuously move in the direction
of your gradient vector. In other words, you update a and b according to the differential
equations
d(a, b)
τ = ∇E(a, b)
dt
where we have introduced the time variable (t) so that we can move through parameter space
and approach our optimum over time.
Let’s solve that DE numerically using Euler’s method. If your current position is (an , bn ),
then
(an+1 , bn+1 ) = (an , bn ) + k∇E(an , bn )
where k is your step multiplier (which swallows up both τ and the time step size).
Gradient DESCENT aims to minimize your objective function. So, you walk downhill,
stepping in the direction opposite the gradient vector as follows:
Note that there is no guarantee that you will find the global optimum. In general, you will
find a local optimum that may or may not be the global optimum.
CHAPTER 2. NEURAL LEARNING 38
CHAPTER 2. NEURAL LEARNING 39
We can apply gradient descent on a multi-layer network, using chain rule to calculate the
gradients of the error with respect to deeper connection weights and biases.
Consider the network
c
αi is the input current to hidden node i. βj is the input current to output node j. For our
∂E
cost (loss) function, we will use E(y, t). For learning, suppose we want to know ∂M ij
.
e.g:
∂E ∂E ∂β1
=
∂M41 ∂β1 ∂M41
∂E
∂β1
y1 y2
β1 β2
Recall that:
E(y, t) = E σ(hM + b), t
∂E ∂E dyi
=
∂β ∂yi dβ
CHAPTER 2. NEURAL LEARNING 40
∂E ∂E ∂β1
=
∂M41 ∂β1 ∂M41
Note that:
4
X
β1 = hi Mi1 + b
i=1
Thus:
∂β1
= h4
∂M41
As a result:
∂E ∂E
= h4
∂M41 ∂β1
In general:
∂E ∂E
= hi
∂Mij ∂βj
Now we have seen how backpropagation works for the connection weights between the top
two layers. What about the connection weights between layers deeper in the network? Let
us take a careful look at the gradient with respect to W21 .
CHAPTER 2. NEURAL LEARNING 41
∂E ∂E
y1 ∂β1 ∂β2 y2
M11 β1 M12 β2
h1 h2 h3 h4
α1 α2 α3 α4
∂E
Using the two illustrations above, we can compute ∂W21
as follows:
∂E ∂E ∂α1
= .
∂W21 ∂α1 ∂W
| {z21}
P3
α1 = j=1 xj Wji +a1
∂α1
Given that ∂W21
= x2 , we can now focus on:
∂E ∂E dh1
=
∂α1 ∂h1 dα1
∂E ∂β1 ∂E ∂β2 dh1
= + .
∂β1 ∂h1 ∂β2 ∂h1 dα1
Hence:
∂E ∂E ∂E dh1
= M11 + M12 .
∂α1 ∂β1 ∂β2 dα1
∂E ∂E
Note that we computed ∂β1
and ∂β2
already when we were learning the weight matrix M ,
thus: h
∂E dh1 i M
∂E ∂E 11
= ∂β1 ∂β2 ·
∂α1 dα1 M12
x ∈ RX , h ∈ RH , y, t ∈ RY , Recall M ∈ RH×Y
CHAPTER 2. NEURAL LEARNING 42
Mi1
h
∂E dhi
∂E · · · ∂E
i
..
= ∂β1 ∂βY · .
∂αi dαi
MiY
| {z }
This is the ith col. of M T
As a result:
dh
⊙ ∇β E · M T ,
∇α E =
dα
dh dh1 dhH
where =
dα
,
dα1
,··· . Here is a graphical illustration of the previous formula showing
dαH
that gradients at layer (l) can be computed using gradients at layer (l + 1):
(l+1)
∂E ∂E ∂zj ∂E (l)
(l)
= (l+1)
· (l)
= (l+1)
· hi
∂Wij ∂zj ∂Wij ∂zj
CHAPTER 2. NEURAL LEARNING 43
∂E (l) ∂E
(l)
= hi · (l+1)
.
∂Wij ∂zj
As a result:
∂E
(l)
= Same size as W (l) = (h(l) )T ∇z(l+1) E.
∂W
Here is another graphical summary:
T
∇W (l) E = h(l) ∇z(l+1) E
Vectorization
These formulas even work when processing many samples at once. Instead of feeding in just
one sample,
W (ℓ) σ
x −−−−→ z (ℓ) −−−−→ h(ℓ) −−−−→ · · ·
x=[ ], z (l) = xW (l) = [ ]
h(ℓ) = σ(z (ℓ) ) = [ ]
we could feed in many samples, such that:
x= 4 samples
(ℓ) (ℓ)
z = xW =
CHAPTER 2. NEURAL LEARNING 44
(ℓ) (ℓ)
h = σ(z ) =
Then,
∇z(ℓ) E = etc...,
where ∇z(ℓ) E has the same shape as z (ℓ) . Now how about the term:
We have
b1
a1 a2 a3 a4
b2
(h(ℓ) )T ∇z(ℓ+1) E = 4×M = ...
N ×4 N ×M
b3
b4
a1 a4
= b1 + · · · + b4 .
Following this logic, we can observe that gradients automatically add up over the samples.
Chapter 3
Automatic Differentiation
45
CHAPTER 3. AUTOMATIC DIFFERENTIATION 46
x = Var y = Var
a = sin(x)
b=x∗y
f =a+b
We will build a data structure to represent the expression graph (illustrated above) using
two different types of objects:
CHAPTER 3. AUTOMATIC DIFFERENTIATION 47
x val
creator
Op = [Link]
• Op:
+
a b args
Var Var
Example: f = a ∗ b
Given Var objects a and b:
1. Create Op object.
2. Save references to args (a, b).
3. Create a Var for output (f).
4. Set [Link] to [Link] * [Link].
5. Set [Link] to this Op.
a b
Differentiate
The expression graph can also be used to compute the derivatives. Each Var stores the
derivative of the expression with respect to itself. It stores it in its member grad .
Consider:
f = F (G(H(x)))
df
[Link] =
dx
CHAPTER 3. AUTOMATIC DIFFERENTIATION 48
df dF dG(H(x))
=
dx dg dx
dF (g) dG(h) dH(x)
=
dg dh dx
df dg dh
=
dg dh dx
Starting with a value of 1 at the top, we work our way down through the graph, and increment
grad of each Var as we go.
Each Op contributes its factor (according to the chain rule) and passes the updated derivative
down the graph.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 49
Example:
f = (x + y) + sin(y)
| {z } | {z }
a b
Because of the chain rule, we multiply as we work our way down a branch. We also add
whenever multiple branches converge.
Each object has a backward() method that processes the derivative and passes it down the
graph. The backward methods can be coded as follows:
Notes:
• ∂Op
∂x
is the derivative of the operation with respect to x.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 50
Optimization using AD
Consider a scalar function E that depends (possibly remotely) on some variable v.
Suppose we want to minimize E with respect to v, i.e.,
min E(v).
Neural Learning
We use the same process to implement error backpropagation for neural networks, and we
optimize w.r.t. (with respect to) the connection weights and biases.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 52
To accomplish this, our network will be composed of a series of layers, each layer transforming
the data from the layer below it, culminating in a scalar-valued cost function.
There are two types of operations in the network:
1. Multiply by connection weights (including adding biases)
2. Apply activation function
Finally, a cost function takes the output of the network, as well as the targets, and returns
a scalar.
Let us consider this (very) small network:
Each layer, including the cost function, is just a function in a nested mathematical expression:
E = D(C(B(A(X))), T )
We construct our network using objects from our AD classes (Variables and Operations) so
that we can take advantage of their backward() methods to compute the gradients.
net ≡ (A, B, C)
y = net(X) = C(B(A(X))) (Forward pass sets network state)
E = D(y, T )
[Link] grad()
[Link]() (Backward pass sets gradients)
Pseudo-Code
Matrix AD
To work with neural networks, our AD library will have to deal with matrix operations.
CHAPTER 3. AUTOMATIC DIFFERENTIATION 54
What is ∇A L and ∇B L?
As we saw in the previous lecture, within the plus operation backward method
+.backward(s), we need to call the following commands:
[Link](s ⊙ 1M ×N )
[Link](s ⊙ 1M ×N )
(Recall that s is the same shape as the output of the operation, i.e., ∇y L).
∇A L = ∇y L · ∇A y = s · B ⊤ shape: M × K · K × N → M × N,
∇B L = ∇B y · ∇y L = A⊤ · s shape: N × M · M × K → N × K.
In the method *.backward(s), you will need to figure out in Assignment 2 how to complete
the following commands:
[Link](?)
[Link](?)
Chapter 4
Generalizability
55
CHAPTER 4. GENERALIZABILITY 56
4.1 Overfitting
Goal: Develop a process to use our labelled data to generate models that can predict
future, unseen samples.
Suppose you have a dataset of drug dosage vs. your blood-sugar level.
Your doctor would like to train a neural network so that, given a dose, she can predict your
blood sugar.
That dataset has 6 samples, and since this is a regression problem, we will use a linear
activation function on the output, and MSE as a loss function.
Your doctor creates a neural network with 1 input node, two hidden layers, each with 250
ReLU nodes, and 1 output node, and trains it on your dataset for 2000 epochs.
The doctor wants to give you a dose of 0.65, so she uses the network to estimate what your
blood sugar will be.
Blood sugar is 1.0043
Does this seem reasonable?
CHAPTER 4. GENERALIZABILITY 57
This week’s exercises will show you that the neural network model looks very much like a
linear interpolator of the points.
Suppose the doctor takes 300 more blood samples from you, at a variety of different doses.
Once you’re drained of blood, she runs the dataset through her model to see what the MSE
loss is:
MSE loss is 0.018
That’s orders of magnitude worse than the 1.5 × 10−10 on the training dataset.
The false sense of success we get from the results on our training dataset is known as
overfitting or overtraining.
Suppose your model has enough flexibility and you train it long enough. In that case, it
will start to fit the specific points in your training dataset, rather than fit the underlying
phenomenon that produced the noisy data.
Recall that our sole purpose was to create a model to predict the output for samples it hasn’t
seen. How can we tell if we are overfitting?
A common practice is to keep some of your data as test data, which your model does not
train on.
CHAPTER 4. GENERALIZABILITY 58
A large discrepancy between test loss and train loss is a sign of overfitting. Notice
that the test loss is going up!
We are looking for a balance between underfitting and overfitting as demonstrated here:
CHAPTER 4. GENERALIZABILITY 59
We saw that if a model has enough degrees of freedom, it can become hyper-adapted to the
training set, and start to fit the noise in the dataset.
Test error is much bigger than training error, and gets worse with more training.
There are some strategies to try to stop our network from trying to fit the noise.
Validation
If we want to estimate how well our model will generalize to samples it hasn’t trained on,
we can withhold part of the training set and try our model on that validation set. Once our
model does reasonably well on the validation set, then we have more confidence that it will
perform reasonably well on the test set.
It’s common to use a random subset of the training set as a validation set.
We can limit overfitting by creating a preference for solutions with smaller weights, achieved
by adding a term to the loss function that penalizes the magnitude of the weights:
λ
Ẽ(ŷ, t; θ) = E(ŷ, t; θ) + ∥θ∥2F ,
2
qP
2
where E(ŷ, t; θ) is the Original loss function, ∥θ∥F = j θj is the Frobenius Norm of the
weights, and λ controls the weight of the regularization term.
CHAPTER 4. GENERALIZABILITY 60
How does this change our gradients, and thus our update rule?
As a result:
θi ← θi − κ∇θi E − κλθi
One can also use different norms. For example, it is common to use the L1 norm:
X
L1 (θ) = |θi |
i
The L1 norm tends to favour sparsity (most weights are close to zero, with only a small
number of non-zero weights).
Optimization Considerations
61
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 62
Rather than compute the full gradient, we can try to get a cheaper estimate by computing
the gradient from a random sampling.
Let γ be a random sampling of B elements from {1, 2, . . . , D}. We can estimate E(y, τ ) as:
B
1 X
E(y, τ ) ≈ E(ỹ, τ̃ ) = L yγd , tγd ,
B d=1
Momentum
Consider gradient descent optimization in these situations:
Case 1:
• The optimization involves oscillations that are inefficient.
• This is represented by level sets of the loss function in the space of weights.
• The trajectory exhibits oscillatory behavior while converging to the minimum.
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 63
Case 2:
• The optimization stops in a shallow local minimum.
• However, we would prefer the optimization to “get over the hump” and reach the
deeper minimum.
• The trajectory may fail to escape the local basin of attraction.
Current Approach:
Thus far, we have been moving through parameter space by stepping in the opposite direction
of the gradient:
θn+1 = θn − η∇θ E
Momentum as a Force:
But what if we thought of the gradient as a force that pushes us?
Recall from physics, if θ is our position:
dθ
=v (velocity)
dt
dv
= a (acceleration).
dt
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 64
In this analogy, momentum helps us overcome the inefficiencies in Case 1 (oscillations) and
Case 2 (local minimum trapping), by incorporating the gradient as a force to push the
optimization process forward.
So, solving numerically using Euler’s method, we obtain updates of the parameters and
velocity using the following equations:
θn+1 = θn + ∆t vn ,
vn+1 = (1 − r)vn + ∆t An .
Here:
Example: When driving a car, A corresponds to the gas pedal, brakes, or even steering
adjustments.
What if we treat our error gradients as A? By integrating the gradients, we gain velocity v,
and thus momentum. Recall the physics definition of momentum:
momentum: ρ = mv.
Intuition
It’s like our weights are dictated by their location in parameter space. We move through
weight space, accelerated by the error gradients. Over time:
• We build speed.
Visualization
In parameter space:
• v represents velocity.
θ2
v
A
w
θ1
The trajectory begins with slower motion, but momentum builds over time, helping us
overcome inefficiencies and improve optimization performance.
Or, as is commonly used (re-parameterized):
v (t) ← βv (t−1) + ∇w E
This approach not only smooths out oscillations but can also help to avoid getting stuck in
local minima.
• The image above shows the comparison of optimization techniques such as SGD, Mo-
mentum, NAG, Adagrad, Adadelta, and RMSprop.
• Source: [Link]
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 66
Goal: To see the advantages and disadvantages of deep neural networks: representational
power vs. vanishing or exploding gradients.
Theorem: Let σ be any continuous sigmoidal function. Then finite sums of the form
N
X
G(x) = αj σ(ωj x + θj )
j=1
are dense in C(In ). In other words, given any f ∈ C(In ) and ε > 0, there is a sum, G(x),
of the above form, for which
Thus, we really only ever need one hidden layer. But is that the best approach in terms of
the number of nodes or learning efficiency? Not necessarily: it can be shown that such a
shallow network may require an exponentially large number of nodes (i.e. a very large N ) to
work.
Vanishing Gradients
Suppose the initial weights and biases were large enough that the input current to many of
the nodes was not too close to zero. As an example, consider one of the output nodes, where
(l)
we have z1 = 5. Then
(l) 1 dy1
y1 = σ z1 = ≈ 0.9933, = y1 1 − y1 = 0.0066.
1 + e−5 dz1
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 68
Compare that to the case where the input current was 0.1. Then
dy1
y1 = σ(0.1) = 0.525, = y1 (1 − y1 ) = 0.525 × (0.475) = 0.249,
dz1
Recall that for a sigmoid function, the slope (its derivative) is largest near inputs close to
zero and becomes very small for inputs far from zero. Therefore, when the input currents
are large in magnitude, the derivative is small, and hence the updates to the weights (based
on the gradient) will also be smaller.
Suppose
∇z(3) E 2
= 0.01.
Define
∇z(3) E
Uz(3) = ,
∇z(3) E 2
so that
∇z(3) E = 0.01 Uz(3) .
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 69
What if the inputs to the penultimate layer were around 4 in magnitude? Then the corre-
sponding slopes of their sigmoid functions will also be small:
Recall that
∂E ′ (2)
(2) T
(2)
= σ zj ⊙ ∇z(3) E W .
∂zj j
Plugging in the small derivative factor and the scale for ∇z(3) E, we get:
∂E T T
(2)
= (0.0177) (0.01) Uz(3) W (2) = (0.000177) Uz(3) W (2) .
∂zj j j
As we move to deeper layers, these factors can become even smaller. When this happens,
learning effectively comes to a halt in those layers. This is referred to as the vanishing
gradients problem.
Another way to look at it: Consider this simple but deep network:
w(1) , b(1) w(2) , b(2) w(3) , b(3) w(4) , b(4) w(5) , b(5)
x −−−−−→ h(1) −−−−−→ h(2) −−−−−→ h(3) −−−−−→ h(4) −−−−−→ y with target t and loss E(y, t).
Start with the loss at the output side: E(y, t). The gradient with respect to the input
current z (5) of the output node is
∂E
= y − t, ∇z(5) E = y − t.
∂z (5)
Hence, all else being equal, the gradient can shrink by a factor of at least 4 at
each layer. This is one of the key observations in the vanishing gradients discussion.
We can observe the vanishing gradient phenomenon by looking at the norm of the gradients
at each layer:
2 ∂E 2
X
∇z(i) E = (i) .
∂zj
j
For a network with 5 layers (labeling them from the output back to the input), we might see
something like:
Skip Connections
Skip connections (also called residual connections) allow the network to bypass one or more
layers by directly adding the input of a previous layer to a later one. This creates an explicit
short path for gradient flow, which helps mitigate the problem of vanishing gradients and
allows much deeper networks to be trained effectively. Intuitively, the skip path enables the
CHAPTER 5. OPTIMIZATION CONSIDERATIONS 71
model to learn residual mappings (i.e., deviations from the identity), making optimization
easier and convergence faster.
h3 = σ(z3 ) = σ(w3 h2 + b3 + h1 )
The second term corresponds to the path with a skip connection. Thus,
∂L ∂L ∂L
= w3 σ ′ (z2 )w2 σ ′ (z1 ) x + × 1 × σ ′ (z1 ) × x.
∂w1 ∂z3 ∂z3
The second part on the right hand side has fewer terms compared to the direct path term
(which we get when we don’t have a skip connection). The latter explains how skip connec-
tions can enhance the magnitude of the gradients and mitigate the limitation of the vanishing
gradient problem in deep neural nets, especially when the weights are very small.
#practice
Exploding Gradients
A similar (though less frequent) phenomenon can cause very large gradients. For instance,
consider a deep chain of neurons each with:
w = 8, b = −4, σ ′ (z) = 1
4
.
∂E τ ∂E ∂E
← ∂E
if > τ.
∂θ ∂θ
∂θ ∂θ
This keeps the update direction unchanged but bounds its size, ensuring stable learning
when gradients would otherwise explode.
Chapter 6
Vision
73
CHAPTER 6. VISION 74
Goal: Visual System: To learn about the mammalian visual system, which is the basis
for many neural-network vision systems.
Most of the networks we have looked at assume an all-to-all connectivity between populations
of neurons, like between layers in a network. But that’s not the way our brains are wired,
thankfully.
If every one of your 86 billion neurons was connected to every other neuron, your head would
have to be much bigger.
It’s Layered
Although the details are more complicated, your visual system is roughly arranged into a
hierarchy of layers.
CHAPTER 6. VISION 75
The image1 shows how visual information flows from the retina through successive processing
stages in the lateral geniculate nucleus (LGN) and primary visual cortex (V1), onward to
higher areas such as V2, V4, and IT. Ultimately, these layers of processing allow your brain
to form coherent perceptions (e.g., recognizing a cat).
It’s Topological
Neurons close to each other in the primary visual cortex process parts of the visual scene
that are close to each other.
1
Source: Kubilius, Jonas (2017): Ventral visual stream. [Link]
106794.v3.
CHAPTER 6. VISION 76
This dot in the visual field only excites a small patch of neurons in V1. Each neuron in V1
is only activated by a small patch in the visual field.
CHAPTER 6. VISION 77
Conversely, each patch in the visual field excites only a small neighborhood of neurons in
V1.
This topological mapping between the visual field and the surface of the cortex is called a
retinotopic mapping.
Moreover, neurons in V1 project to the next layer, V2, and again, the connections are
retinotopically local. The “footprint” of a region in the visual field gets larger as the data
progresses up the layers. By the time signals reach IT (Inferior Temporal cortex), the neurons
may be influenced by the entire visual field.
Each little square corresponds to one V1 neuron and shows the pattern that most activates
that neuron—its receptive field.
CHAPTER 6. VISION 78
Figure 6.1: Comparison of V1 receptive fields in a macaque (A) versus those learned by
a neural network, SAILnet (B). Adapted from Zylberberg, Joel; Timothy Murphy, Jason;
Robert DeWeese, Michael (2015): SAILnet learns receptive fields (RFs) with the same di-
versity of shapes as those of simple cells in macaque primary visual cortex (V1). PLOS
Computational Biology. [Link]
Each column of squares is a different neuron’s preferred stimulus. Notice that both the
macaque and the neural network learn a diverse set of oriented patterns, reflecting the
system’s attempt to efficiently encode natural images. You are encouraged to take a careful
look at the following demo through this demo link.
CHAPTER 6. VISION 79
Inspired by the brain’s topological (retinotopic) connectivity, and in an effort to reduce the
number of connection weights that must be learned, scientists devised the Convolutional
Neural Network (CNN).
Convolution
In typical CNNs, g (often called a kernel or filter ) has small spatial support, e.g. a D × D
region:
g ∈ RD×D .
When we apply g to an image or feature map f , we take an inner product of g with the
“patch” of f with an upper left at each (n, m).
are our K kernels, then each kernel produces an output map z (k) ∈ RN ×N (depending on
how we pad/stride). Concretely,
(k)
zn,m = (f ∗ g (k) )n,m + b(k) ,
Multi-channel Inputs
If the input f has multiple channels (e.g. an RGB image with C = 3 channels), then each
kernel must have the same number of channels. In this scenario,
f ∈ RN ×N ×C , g ∈ RD×D×C .
We do not convolve across channels; rather, we convolve each channel separately and sum
across them. For a single kernel,
C
X
zn,m = f·,·,c ∗ g·,·,c n,m
+ b,
c=1
where b is a bias term (one per kernel). For K kernels, each has its own parameters g (k) and
bias b(k) , yielding K output feature maps.
Edge cases
At the edges of the image, the convolution window (kernel) might partially lie outside the
valid region of f . We typically have two common choices:
• Choice 1 (Padding): Pad the input f with zeros (or another value) so that the kernel
can be applied at the borders without losing spatial resolution.
CHAPTER 6. VISION 81
• Choice 2 (No Padding): Do not pad. In this case, the output z is smaller than the
input f , because we only convolve where the kernel is fully within the image boundaries.
Stride
The stride is the number of pixels by which we move or “slide” the kernel between consecutive
convolution positions.
• Stride = 1: The kernel is shifted by 1 pixel each time (typical default), so we sample
every adjacent position.
Why CNNs?
A Parameter counting argument
Consider a 100 × 100 input image and a 100 × 100 hidden representation.
CHAPTER 6. VISION 82
Fully connected approach: if we flatten the 100 × 100 image into a 10 000-dimensional
vector, and want a 10 000-dimensional hidden layer, we have
plus 1 bias term, giving 26 parameters per kernel. Hence, for 64 such kernels:
64 ( 25 + 1 ) = 64 × 26 = 1664
total trainable parameters. This is much smaller than 108 , yet CNNs with relatively few
parameters still perform extremely well in practice.
Locality argument
FFNNs do not exploit properties such as locality and translational invariance that are inher-
ent to many physical systems and images ⇒ a lot of spatial information is lost. To illustrate
this point, let us consider a 4 × 4 image (picture, Ising spin configuration, etc.). Let us label
the pixels (sites) as:
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
1
2
3
4 (spatially close to 3)
5 (not spatially close to 4)
..
.
16
Unsupervised Learning
85
CHAPTER 7. UNSUPERVISED LEARNING 86
A CAM system can find an input’s closest match to a set of known patterns. It retrieves
data by directly comparing input queries with stored memory locations.
Hopfield Networks mimic the behavior of CAM in a biologically inspired way using neural
networks.
Hopfield Networks
Suppose we have a network of N neurons, each connected to all the others. We want this
network to converge to the nearest of a set of M targets or inputs.
John Hopfield proposed this network in a paper in 1982, and won the Nobel Prize in physics
in 2024 for it (with Hinton).
CHAPTER 7. UNSUPERVISED LEARNING 87
−1 if (⃗xW )i + bi < 0
xi = (7.1)
1 if (⃗xW )i + bi ≥ 0
0 −1 1 −1
−1 0 −1 1
W =
⃗b = [0 0 0 0]
1 −1 0 −1
−1 1 −1 0
Problem: The graph of the Hopfield net has cycles, so backpropagation won’t work.
1
E = − ⃗xW ⃗x⊤ − ⃗b⃗x⊤
2
where Wii = 0.
∂E X
=− xi Wij − bj ,
∂xj i̸=j
or
d⃗x
∇⃗x E = −⃗xW − b̄ ⇒ τx = ⃗xW + b̄ (similar to Eq. (7.1).)
dt
If i ̸= j:
∂E
= −xi xj .
∂Wij
CHAPTER 7. UNSUPERVISED LEARNING 88
If i = j:
∂E
= −x2i = −1.
∂Wii
As a result, the gradient vector is:
∇W E = −⃗x⊤⃗x + IN ×N ,
where ⃗x⊤⃗x is a rank-1 N × N matrix. We add the identity matrix to the right-hand side, so
that the gradient of the diagonal weights is zero to keep Wii = 0 when performing gradient
descent.
Over all M targets, we have:
M
1 X (s) ⊤ (s) 1
∇W E = − (⃗x ) ⃗x + I = − X ⊤ X + I.
M s=1 M
Thus:
1 ⊤
W ←W +κ X X −I ,
M
where X ⊤ X computes coactivation states between all pairs of neurons.
Because the input patterns X are fixed, the coactivation matrix M1 X ⊤ X −I remains constant
across iterations, so the gradient direction does not change, and repeated updates simply
move W linearly towards the steady-state solution, which is proportional to W ∗ = M1 X ⊤ X −
I.
CHAPTER 7. UNSUPERVISED LEARNING 89
The topic of this lecture is related to the 2024 Nobel Prize in Physics, which went to:
• John Hopfield (Hopfield nets, 1982)
• Geoff Hinton (RBMs, 1983–1985)
RBM energy
Similar to the Hopfield network, an RBM is characterized by an energy:
m X
X n m
X n
X
E(v, h) = − vi Wij hj − bi vi − cj hj
i=1 j=1 i=1 j=1
Boltzmann Probability
The RBM network states are visited with the Boltzmann probability:
1 −E(v,h)
q(v, h) = e
Z
where the partition function Z is defined as:
X
Z= e−E(v,h)
v,h
E(v (1) , h(1) ) < E(v (2) , h(2) ) ⇒ q(v (1) , h(1) ) > q(v (2) , h(2) )
Expanding qθ (V ): !
1 X −Eθ (V,h)
L = − ln e
Z h
Rewriting: ! !
X XX
L = − ln e−Eθ (V,h) + ln e−Eθ (v,h)
h v h
Gradient of L1
To optimize the parameters, we need to compute the gradient of the loss function:
!
X X
∇θ L1 = −∇θ e−Eθ (V,h) e−Eθ (V,h)
h h
X e−Eθ (V,h)
= P −E (V,h) ∇θ Eθ (V, h)
he
θ
h
Since
e−Eθ (V,h)
P −E (V,h) = qθ (h|V )
he
θ
we rewrite:
X
∇θ L1 = qθ (h|V )∇θ Eθ (V, h) = Eq(h|V ) [∇θ Eθ (V, h)]
h
Gradient of L2
e−Eθ ∇θ Eθ
P
v,h
∇θ L2 = − P −Eθ
v,h e
X
=− qθ (v, h)∇θ Eθ (v, h)
v,h
∇θ L = ∇θ L1 + ∇θ L2
= Eq(h|V ) [∇θ Eθ ] − Eq(v,h) [∇θ Eθ ]
and
∇Wij E(v, h) = −vi hj ,
we have:
∇Wij L = −Eq(h|V ) [Vi hj ] + Eq(v,h) [vi hj ]
where the first term represents the expected value under the posterior distribution, and the
second term is the expected value under the joint distribution.
CHAPTER 7. UNSUPERVISED LEARNING 92
Then,
∇W L1 = −V T σ(V W + c)
where this results in a rank-1 outer product in Rm×n (you can show this as an exercise).
In practice, a single network state is used to approximate the expectation. Gibbs sampling
is employed to run the network freely and compute the average vi hj .
∇W L2 = v T σ(vW + c)
W ← W − η(∇W L1 + ∇W L2 )
W ← W + ηV T σ(V W + c) − ηv T σ(vW + c)
where η is the learning rate, the first term corresponds to the positive phase (clamped
visible state), and the second term corresponds to the negative phase (after one Gibbs
sampling step).
Sampling an RBM
After training an RBM, we can use it as a generative model to generate new M data
points V (1) , V (2) . . . , V (M ) by performing Gibbs sampling from the conditional probabilities:
P (h|v), P (v|h) as illustrated below:
CHAPTER 7. UNSUPERVISED LEARNING 93
7.3 Autoencoders
Goal: To introduce a common and useful neural architecture.
Definition
An autoencoder is a neural network that learns to encode (and decode) a set of inputs. It’s
called an autoencoder because it learns the encoding automatically.
Key Observations
• The code layer is smaller than the input/output layers.
Loss Function
The model is trained using a loss function that minimizes the reconstruction error:
L(x′ , x)
Encoding Dimension
Consider the set of 5 binary vectors, stored in the rows of the matrix
1 0 1 0 0 1 1 0
0 1 0 1 0 1 0 1
0 1 1 0 1 0 0 1 .
1 0 0 0 1 0 1 1
1 0 0 1 0 1 0 1
Even though the vectors are 8-D (so they could take on 256 different inputs), the actual
dataset has only 5 patterns. We can, in principle, encode each of them with a unique 3-bit
code. However, we can choose the dimension of the encoding layer.
Tied Weights
Instead of:
we use
Vector Representations
We have been using vectors to represent inputs and outputs.
Word Representations
What about words? Consider the set of all words encountered in a dataset. We will call this
our vocabulary.
Let’s order and index our vocabulary and represent words using one-hot vectors, like above.
and Nv is the number of words in our vocabulary (e.g., 70,000). The vector v is a one-hot
encoding of the word,
(
0 if wordi ̸= ”cat”
vi = .
1 if wordi = ”cat”
Example:
These two sentences have similar meanings, but use different words.
• "CS 479 is interesting."
• "CS 479 is fascinating."
We could form synonym groups, but how do we decide on groups when words have similar,
but not identical, meanings? In the following list of words, where does one draw the line
between “misery” and “ecstasy”?
misery - despair - sadness - contentment - joy - elation - ecstasy
CHAPTER 7. UNSUPERVISED LEARNING 97
These issues reflect the semantic relationships between words. We would like to find a
different representation for each word, but one that also incorporates their semantics.
Example:
How can we tease out the complex, semantic relationships between words?
We can get a lot of information from the simple fact that some words often occur together
(or nearby) in sentences.
“The scientist visited Paris last week, enjoying the city’s famous museums.”
For the purposes of this topic, we will consider “nearby” to be within a fixed window size d.
These word pairings help us understand the relationships between words based on their
co-occurrence in text.
CHAPTER 7. UNSUPERVISED LEARNING 98
Our approach is to try to predict these word co-occurrences using a 3-layer neural network.
y = f (v, θ) where v ∈ W
and we can interpret the output as a distribution over the one-hot vocabulary,
word2vec
Word2vec is a popular embedding strategy for words (or phrases, or sentences). It uses
additional tricks to speed up the learning.
1. Treats common phrases as new words e.g., ”New York” is one word.
Embedding Space
The embedding space is a relatively low-dimensional space where similar inputs are mapped
to similar locations.
Why does this work? Words with similar meanings will likely co-occur with the same
set of words, so the network should produce similar outputs, and therefore have similar
hidden-layer activations.
CHAPTER 7. UNSUPERVISED LEARNING 100
Cosine Similarity
The cosine angle is often used to measure the ”distance” between two vectors in the embed-
ding (latent) space.
Example:
We would like to be able to reconstruct the samples in our dataset. In fact, we would like to
be able to generate ANY valid sample. In essence, we would like to sample the distribution
z ∼ p(z) e.g., z could represent digit class, line thickness, slant, etc.
Z
p(x) = pθ (x|z)p(z) dz
Note: The mapping between random variable z and x can be written p(x|z).
If we assume pθ (x|z) is Gaussian, with mean d(z, θ) and standard deviation Σ, then
1
− ln pθ (x|z) = ∥X − d(z, θ)∥2 + C
2Σ2
So, given samples z, we have a way to learn d(z, θ).
We can solve h i h i
max Ez∼p(z) pθ (x|z) by min Ez∼p(z) ∥X − d(z, θ)∥2
θ θ
Note that Z
Ep(z) [pθ (x|z)] = pθ (x|z)p(z)dz,
where we can use a Monte Carlo method to evaluate the previous integral.
Problem: We don’t know how to sample z ∼ p(z).
Illustration: Suppose we train an AE on a dataset of simple shapes. ⃝□△
The latent space is 2D, and the clusters are well separated.
CHAPTER 7. UNSUPERVISED LEARNING 103
However, latent vectors between the clusters generate samples that don’t look like shapes.
For MNIST 2D Latent space, we have the following:
What’s happening? Why is our generator so bad? Answer: We are choosing improbable z
samples, p(zi ) ≈ 0
We would like to sample only z’s that yield reasonable samples with high probability.
We are now placing requirements on the distribution in our latent space. Can we get away
with this?
Let’s assume that we can choose the distribution of z’s in the latent space; call it q(z). Then
h i
p(x) = Ez∼p p(x|z)
Z
= p(x|z)p(z)dz
Z
p(z)
= p(x|z) q(z)dz
q(z)
p(z)
= Ez∼q p(x|z)
q(z)
The previous inequality can be shown using Jensen’s inequality (where − ln is a convex
function over (0, +∞)). We can rewrite the right-hand side as follows:
where
p(z)
KL(q(z)∥p(z)) = −Ez∼q ln
q(z)
CHAPTER 7. UNSUPERVISED LEARNING 104
is the Kullback-Liebler (KL) divergence. Note that the right-hand side (1) + (2) is an upper
bound of the NLL, and we are going to use it as our loss function since minimizing the upper
bound would also minimize the NLL, which would maximize the likelihood p(x), which is
our end goal to get quality generated samples.
(1) Let’s choose a latent distribution that is convenient for us.
p(z) ∼ N (0, I)
Then, our aim is to design q(z) so that it is close to N (0, I), i.e.
How do we design our latent representations to achieve this? Answer: We design an encoder
and ask its outputs to be N (µ, σ 2 ).
Here N (µ, σ 2 ) defines a distribution. Next, we keep pressuring the encoder to give us µ = 0,
σ 2 = I.
1 2
σ + µ2 − ln σ 2 − 1
=
2
We want to minimize this, but there are other forces at play...
(2) The other term in the objective,
Eq [ln p(x|z)] ,
CHAPTER 7. UNSUPERVISED LEARNING 105
Eq [ln p(x|x̂)]
This is called the ”reparameterization trick,” where the distribution is differentiable, and
stochasticity is brought in by a separate variable, ϵ. Note that ϵ is generally a vector of
random values, not a scalar.
Intuition
Think of a cloud of matter floating in space, but collapsing in by its own gravity, eventually
forming a star.
• Calculate KL loss
1 2
σ + µ2 − ln σ 2 − 1
=
2
1
∥x̂ − x∥2 for Gaussian p(x|x̂)
2
or
X
x ln x̂ for Bernoulli p(x|x̂)
x
CHAPTER 7. UNSUPERVISED LEARNING 106
E = Ex L(x, x̂) + β σ 2 + µ2 − ln σ 2 − 1
| {z }
all depend on net. params θ
Because the VAE seeks a distribution in the latent space that is close to N (0, I), there are
fewer holes.
Note: Points chosen randomly from N (0, I). Decode → Generated samples.
CHAPTER 7. UNSUPERVISED LEARNING 107
Now, there are no (or fewer) gaps in the latent space, which enhances the quality of samples
generated by the VAE.
CHAPTER 7. UNSUPERVISED LEARNING 108
Chapter 8
109
CHAPTER 8. RECURRENT NEURAL NETWORKS 110
Goal: To learn the basics of Recurrent Neural Networks (RNNs): what they are used
for, and how to train them.
So far, we have focused on FFNNs, CNNs, AE, VAEs and Diffusion models, where we have
an architecture similar to the following:
The most likely word is ‘French’ because of the context created by the previous words. To
model this task using a neural network, we have two solutions.
As a result, a conventional Feedforward neural network (FFNN) is not suitable for this task.
Solution 2: RNNs
Key Differences:
• Theoretically, an RNN can approximate any computation you can do on your laptop
(i.e., Universal Turing Machine).
Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state that
carries information from previous time steps. The following figure illustrates the unrolled
structure of an RNN, showing how information flows over time.
At each time step i, the hidden state is updated using the previous hidden state and the
current input. This is mathematically defined as:
⃗hi = f ⃗x i U + ⃗hi−1 W + ⃗b ,
where:
• ⃗hi is the hidden state at time step i.
• ⃗x i is the input at time step i.
• U is the weight matrix for the input-to-hidden transformation.
CHAPTER 8. RECURRENT NEURAL NETWORKS 113
• W is the weight matrix for the hidden-to-hidden transformation (which allows infor-
mation to persist over time).
• ⃗b is the bias term.
• f is a non-linear activation function (e.g. tanh, sigmoid, or ReLU).
The output at each time step is computed as:
⃗y i = Softmax ⃗hi V + ⃗c ,
where:
• ⃗y i is the output at time step i.
• V is the weight matrix for the hidden-to-output transformation.
• ⃗c is the bias term.
• The Softmax function ensures that the output represents probabilities or conditional
probabilities.
Remarks
One of the key strengths of RNNs is their ability to maintain a form of memory through the
hidden state ⃗hi . This makes them particularly effective for tasks involving sequences, such
as language modeling, time-series forecasting, and speech recognition.
• The hidden state ⃗hi acts as a memory of previous inputs. This allows the network to
capture temporal dependencies and make informed predictions based on past informa-
tion.
• Increasing the size of ⃗hi (i.e., increasing the number of hidden units) enhances the
model’s ability to store and process long-term dependencies. However, this also leads
to higher computational costs and memory usage.
Despite their advantages, standard (vanilla) RNNs have limitations, particularly when deal-
ing with long sequences. They often suffer from vanishing and exploding gradient problems,
making it difficult to learn dependencies over long time spans. More advanced architectures
such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs)
have been developed to address these issues. In particular, we will cover the GRU architec-
ture in the next lecture.
The total loss function L over a sequence of time steps is computed as the sum of individual
losses at each step:
N
X
L ⃗y , . . . , ⃗y , t , . . . , ⃗t
1 N ⃗1 N
= αi L(⃗y i , ⃗t i )
i=1
where:
• ⃗y i is the predicted output at time step i.
• ⃗t i is the target output (ground truth) at time step i.
• N is the sequence length.
• L(⃗y i , ⃗t i ) represents the loss function measuring the error between prediction and target.
This could be cross-entropy loss for classification tasks or mean squared error (MSE)
for regression tasks.
• αi are weights for different time steps.
• The total loss is accumulated over all time steps, ensuring that all predictions contribute
to the optimization process.
The objective of training an RNN is to find the optimal parameters θ that minimize the
expected loss over the entire dataset:
θ∗ = arg min E(X,T )∈D [L]
θ
CHAPTER 8. RECURRENT NEURAL NETWORKS 115
where:
• (X, T) are the input sequences (e.g. a sentence of words) and their corresponding
target outputs from the dataset.
In an unsupervised learning (UL) setting where we only have access to the input sequences
X, we can minimize the expectation of the negative log likelihood L(Y ), where L(Y ) =
− N i
P P
i=1 j log(yj ).
Deep RNNs
Recurrent Neural Networks (RNNs) can be extended beyond a single layer by stacking mul-
tiple RNN layers vertically. This results in a Deep RNN, which allows for hierarchical
feature extraction across multiple levels of representation.
Key Idea: Instead of a single hidden state per time step, we introduce multiple layers of
hidden states. Each layer processes information before passing it to the next layer.
CHAPTER 8. RECURRENT NEURAL NETWORKS 116
...
⃗hn = f (⃗hn UL + ⃗hn−1 WL + ⃗bL ) .
L L−1 L
Here:
• ⃗hnl is the hidden state at time step n in layer l.
• Ul is the input-to-hidden weight matrix for layer l.
• Wl is the recurrent weight matrix within layer l.
CHAPTER 8. RECURRENT NEURAL NETWORKS 117
⃗y n = Softmax(⃗hnL V + ⃗c),
where: V is the weight matrix from the last hidden layer to the output and ⃗c is a bias vector.
Goal: To understanding the limitations of vanilla RNNs, and see one solution: Gated
Recurrent Units (GRUs).
Long-Range Dependencies
One of the major challenges in sequence processing with RNNs is the loss of information as
the sequence length increases. Consider the following example:
Cats make great house pets. If you have one, it is important that you care for
it, including taking it for regular health check-ups at the .
The expected word might be “vet”. However, due to the long distance between the subject
(Cats) and the missing word, standard RNNs may struggle to maintain the necessary context.
To simplify our analysis of the long-range dependency issue, assume the following:
• Y = X = H = 1 (real numbers).
hn = whn−1 + xn
• If |w| < 1, then wn shrinks exponentially as n grows. This means that earlier
information h1 and inputs x1 , x2 , . . . have exponentially less influence on hn over
time.
• If |w| > 1, hn magnitude will grow exponentially, making the RNN training unstable.
CHAPTER 8. RECURRENT NEURAL NETWORKS 119
where:
• W is the hidden-to-hidden weight matrix.
• U is the input-to-hidden weight matrix.
• ⃗b is the bias vector.
• The tanh function ensures that h̃ni ∈ (−1, 1).
Gate Mechanism
The gate ⃗g n determines how much past information is retained:
n ⃗ n−1 n
⃗g = σ h Wg + ⃗x Ug + bg ⃗
where:
• Wg and Ug are the gate’s weight matrices.
• b⃗g is the bias vector for the gate.
• The σ (sigmoid) function ensures that gin ∈ (0, 1), meaning it acts as a soft switch.
where:
• ⊙ denotes the Hadamard (element-wise) product.
CHAPTER 8. RECURRENT NEURAL NETWORKS 120
Intuition
• If gin ≈ 1, the new state is mostly the candidate state component h̃ni , meaning the
network updates to new information.
• This gating mechanism allows GRUs to adaptively retain or forget past information,
addressing the vanishing gradient problem.
The new hidden state is a combination of the previous hidden state and the candidate
hidden state, allowing for more efficient information retention in sequential learning.
• Once g = 0, the hidden state stops updating, preserving information from earlier in
the sentence.
When g = 0, the hidden state remains unchanged, preventing unnecessary updates. This
ensures that crucial information, such as the subject (Cats), is retained and can still influence
later words. As a result, GRUs can effectively model long-range dependencies.
CHAPTER 8. RECURRENT NEURAL NETWORKS 121
The reset gate ⃗r n determines how much of the previous hidden state ⃗hn−1 should be for-
gotten before computing the new candidate hidden state ⃗h̃n . When ⃗r n is close to 0, the past
information is largely discarded, making the model more reliant on the new input ⃗x n . When
⃗r n is close to 1, more of the past hidden state is retained.
There is also an advanced RNN variant called Long Short-Term Memory (LSTM), but
it is not covered in this course.
CHAPTER 8. RECURRENT NEURAL NETWORKS 122
Chapter 9
Adversarial Attacks
123
CHAPTER 9. ADVERSARIAL ATTACKS 124
Classification Error
The classification error is defined as:
h n oi
R(f ) ≜ E(x,t)∼D card arg max yi ̸= t | y = f (x)
i
where:
• arg maxi yi gives the index of the largest element of y.
• ”card” denotes the cardinality (number of elements).
ε-Ball (Neighbourhood)
ε
x B(x, ε)
B(x, ε) = {x′ ∈ X | ∥x − x′ ∥ ≤ ε}
CHAPTER 9. ADVERSARIAL ATTACKS 125
Adversarial Attacks
We ask, given (x, t) ∈ D, is there x′ ∈ B(x, ε) such that
These inputs can be found quite easily. This is called an Adversarial Attack. There are
two main classes of adversarial attacks:
• Whitebox: attacker has access to the whole model, e.g., weights, activation functions,
etc.
• A self-driving vehicle’s model misclassifying a stop sign as a speed limit sign can lead
to a fatal accident. For more details: [Link]
• A model misclassifying the MRI scan of a sick patient as normal can have severe
consequences on the patient’s health.
x y
CHAPTER 9. ADVERSARIAL ATTACKS 126
Untargeted Attack
Targeted Attack
where l ̸= t(x).
Gradient descent nudges input in the direction to decrease loss for the wrong target class.
∥∆x∥∞ = ε
Example:
Here is an illustration from Goodfellow, I. J., Shlens, J., Szegedy, C. (2015). Explaining
and Harnessing Adversarial Examples. In Proc. of ICLR.
CHAPTER 9. ADVERSARIAL ATTACKS 127
Instead of applying a fixed perturbation, one can also search for the smallest ∥∆x∥ that
causes misclassification:
h i
min arg max(yi (x)) ̸= t(x)
∥∆x∥ i
R28×28 → R784
The neural network partitions the space into 10 regions, each corresponding to a digit class.
Summary
Learning is defined as optimizing the expected loss:
Untargeted Attack
Targeted Attack
min [L(f (x′ ), l)] , l ̸= t
x′ ∈B(x,ε)
CHAPTER 9. ADVERSARIAL ATTACKS 129
How can we train our models so that they are harder to attack?
Adversarial Training
During training, we add adversarial samples to the dataset.
What if we built that process into our training? Idea: Incorporate a mini adversarial
attack into every gradient step while training.
TRADES:
“TRadeoff-inspired Adversarial DEfense via Surrogate loss minimization”
Model f : X → R
Robust Loss
Explanation:
• Rrob accounts for all possible misclassified points within B(X, ε).
card{α ≤ 0} g(α)
α α
0 0
′
min E(X,T ) g(f (X)T ) + ′max g(f (X)f (X ))
f X ∈B(X,ε)
• The second term adds a penalty for models f that place the decision boundary within
ε of X, where f (X) and f (X ′ ) will have opposite signs.
Implementation
where β is a hyperparameter.
Results
Neural Engineering
133
CHAPTER 10. NEURAL ENGINEERING 134
Input x ? Output y
Neural Engineering
We can build a neural network that processes our data by putting together a series of
interpretable transformations, such as regression layers.
The first step is to encode data into the collective activity of a population of neurons.
Consider this simple regression network with an imaginary linear readout node, connected
to all the neurons in the hidden layer.
h1
encoding
d1
h2
x .. d2 y
.
dN
hN
y = h1 d1 + h2 d2 + · · · + hN dN = h · d
Suppose we want our imaginary readout node to approximate x for a range of possible input
x values. This is a regression problem in which we adjust only the decoding weights and
keep the encoding fixed.
CHAPTER 10. NEURAL ENGINEERING 135
Since we know the function the network should compute, we can create a training dataset:
• Choose a set of input values {x(p) }Pp=1
• Generate training samples of the form:
Since it’s a regression problem, we use mean squared error (MSE) loss.
We could train this using backpropagation, but let’s look more closely at the structure of
this optimization:
P
1 X (p) 2
min h · d − x(p) 2
(MSE Loss)
d 2P p=1
where
h(1)
H = ...
h(P )
holds the activities of the hidden population, with each row corresponding to one input
sample.
d∗ = (H ⊤ H)−1 H ⊤ X.
h1
x1 y1
h2
x2 y2
h3
For example, suppose an input vector (x1 , x2 ) is encoded by a hidden layer h = σ([x1 , x2 ]E +
b). Collecting P such samples yields
H = σ(XE + b),
where X ∈ RP ×2 and H ∈ RP ×N .
If we have two output nodes, say y1 and y2 , we need two different sets of decoding weights—one
column per output:
D∗ = arg min ∥HD − X∥2F ,
D
So long as an error (target) can be specified, the decoding weights can be learned.
CHAPTER 10. NEURAL ENGINEERING 137
10.2 Transformations
Goal: To see how to use population coding to pass data between populations of neurons.
So far, we’ve only looked at interpreting the activity of a population of neurons, decoding
its encoded value(s) to an imaginary readout node. But neurons send their output to other
neurons.
Suppose we want population B to encode the same value encoded in population A. Can we
just copy the neural activities?
A B
input output
A different approach:
To copy the value to B, we decode it from A and then re-encode it using B’s encoders.
x̂ = ADA , DA ∈ RN ×K
A B
EA
Dxy
x
EB
xy z
N M
CHAPTER 10. NEURAL ENGINEERING 140
Low-Dimensional Bottle-Neck
Even though we can form a full connection matrix W from the product DE, it is actually
a low-rank matrix.
In the example above:
This might seem like a limitation, but it actually makes things more efficient.
This shows why using low-rank representations (like DE) is not only compact but compu-
tationally efficient.
CHAPTER 10. NEURAL ENGINEERING 141
10.3 Dynamics
Goal: To see how to build dynamic, recurrent networks using population coding methods.
Consider a dynamic model of a leaky integrate-and-fire (LIF) neuron that contains two
time-dependent processes:
ds
τs = −s + C (Current)
dt
dv
τm = −v + σ(s) (Activity)
dt
• C: input current
s s s
LIF Logistic tanh
Equilibrium Solutions
Current:
ds
τs = −s + C (Suppose C is constant)
dt
At equilibrium:
ds
=0 ⇒ 0 = −s + C ⇒ s=C
dt
Activity:
dv
τm = −v + σ(s) (Suppose s is constant)
dt
At equilibrium:
dv
=0 ⇒ v = σ(s)
dt
CHAPTER 10. NEURAL ENGINEERING 142
Variants
Case 1: τm ≪ τs If the neuron membranes react quickly (i.e., reach equilibrium quickly)
compared to synaptic dynamics:
τ ds = −s + C
s
dt
v = σ(s)
Case 2: τs ≪ τm If the synaptic current changes quickly compared to the neuron activity:
s=C
τm dv = −v + σ(s)
dt
In steady state:
v = σ(s)
Recurrent Networks
The feedback afforded by recurrent connections can lead to interesting dynamics.
Consider a neural integrator:
Key ideas:
• Recurrent connections: W = DE
• Identity decoding: νD ≈ y
We let the time dynamics be dictated by the synaptic time constant. So we set τm = 0, and
therefore:
v = σ(s)
Then:
ds
τs = −s + σ(s)W + β + C (recurrent)
dt
ds −s + νW + β
= + C̃ with C̃ = xE
dt τs
Thus,
ds
τs = νW + β − s + τs xE
dt
= (νD + τs x)E + β − s
= (y + τs x) E + β − s,
where:
Example:
You can get the recurrent network to simulate the dynamical system:
dy
= f (y)
dt
by setting the input to be f (y).
Network Structure:
Additional Topics
147
CHAPTER 11. ADDITIONAL TOPICS 148
∆ ∆
Feed-forward network:
Layer 1 Layer 2
x11 x21
x12 x22
PC Node
µi = σ (xi+1 )M i + β i
Prediction
Error Node
dεi
τ = xi − µi − νleak
i
εi
dt
At equilibrium:
1 i
εi = (x − µi )
νi
Generative Network
Assume xi ∼ N (µi , ν i )
∥xi − µi ∥2
i 1i
p(x | µ ) = exp −
□ 2(ν i )2
2
1 xi − µ i
− log p(xi | µi ) = + const.
2 νi
As a result:
L−1
1X i 2
− log p(x | y) = ∥ε ∥
2 i=1
Hopfield Energy
L−1
1X i 2
F = ∥ε ∥
2 i=1
dxi
τ = −∇xi F
dt
1. εi = xi − µi (xi+1 ) = xi − (σ(xi+1 )M i + β i )
Therefore:
∂εi i−1 ∂ε
i−1
∇xi F = εi + ε
∂xi ∂xi
i ′ i i−1
i−1 T
= ε − σ (x ) ⊙ ε W
Dynamics
dxi
○ = σ ′ (xi ) ⊙ εi−1 M i − εi
1 τ
dt
dεi
○
2 τ = xi − µi − ν i εi
dt
dM i dW i
○
3 = σ(xi+1 )T εi ○
4 τ = (εi )T · σ(xi+1 )
dt dt
i
At equilibrium ( dε
dt
= 0), we get:
h T i
εi = σ ′ (xi ) ⊙ εi−1 W i−1 (This looks like BP!)
where:
∂F ∂F
= −εi , = ...
∂µi ∂xi
Training
Clamp x1 = X and xL = Y ⇒ run to equilibrium.
xi , εi reach equilibrium quickly.
Then ○
3 and ○
4 use that equilibrium to update M i and W i .
Generating
Clamp xL = Y ⇒ run to equilibrium ⇒ x1 is generated sample.
Inference (approx.)
Clamp x1 = X ⇒ run to equilibrium ⇒ arg maxj (xLj ) is the class.
• Generator (G): Receives noise Z as input and generates data that tries to mimic the
real data distribution.
• Discriminator (D): Tries to distinguish between real and fake data produced by the
generator. The output is a probability y where a value close to 1 indicates ”real” and
close to 0 indicates ”fake.”
The GAN is trained through a min-max game where the generator tries to maximize the
probability of the discriminator making a mistake, and the discriminator tries to minimize
this probability.
C(θD , θG ) = − 12 Exreal ∼pdata log DθD (xreal ) − 12 Ez∼pz log(1 − DθD (GθG (z))) ,
where the first term corresponds to the negative log probability of the real datapoints and
the second term corresponds to the negative log probability of the fake datapoints, where
we could define xfake = G(z) given a noise z. Note that θD , θD are the parameters of the
discriminator and the generator, respectively. Additionally:
• D(xreal ) is the discriminator’s estimate of the probability that real data is real.
• D(xfake ) is the discriminator’s estimate of the probability that fake data is real.
CHAPTER 11. ADDITIONAL TOPICS 153
Training Strategy
Training can be seen as the min–max game
The discriminator D aims to minimize the cost function C, ideally such that D(xreal ) = 1
and D(xfake ) = 0. At equilibrium, the generator makes fake samples indistinguishable from
real ones, so
D(xreal ) ≈ D(xfake ) ≈ 0.5.
Relation to untargeted adversarial attacks. For a classifier fθ (x) with loss ℓ(fθ (x), y),
an untargeted adversarial attack solves
where the discriminator DθD plays the role of the classifier fθ and the generator GθG (z)
plays the role of a learned adversary that searches (by gradient ascent on θG ) for inputs
xfake = GθG (z) that maximally confuse the discriminator.
3. Ideally, the generator produces data indistinguishable from real data, and the discrim-
inator assigns a probability of 0.5 to both real and fake data.
The GAN is trained until the discriminator can no longer reliably distinguish between real
and generated data. So the generator G has, in principle, the winning strategy. These steps
are illustrated as follows:
Example: MNIST
CHAPTER 11. ADDITIONAL TOPICS 155
If you want to get more intuition, you can play with GANs online on this link.
CHAPTER 11. ADDITIONAL TOPICS 156
Historical context
Transformers became famous since the publication of the paper: Attention Is All You Need
(2017): [Link]
- Main highlights:
• Transformers enabled state-of-the-art results in Natural Language Processing.
• Transformers are based on the attention mechanism.
X = ∈ R3×d , Y = ∈ R4×d
Vectorization
Q = XW (Q) n×d·d×ℓ=n×ℓ
K = XW (K) n×d·d×ℓ=n×ℓ
V = XW (V ) n×d·d×ℓ=n×ℓ
Sij is interpreted as vector i’s score for vector j. This gives us a full score matrix:
S = QK T n×ℓ·ℓ×n=n×n
Output of self-attention:
H =A·V H ∈ Rn×ℓ ,
where H is an attention head. Alternatively, for each token output:
n
X
⃗i =
H Aij ⃗vj
j=1
CHAPTER 11. ADDITIONAL TOPICS 159
Positional Encoding
These sentences have the same words, thus the same attention head, but they have different
meanings. ⇒ We need to impose order on the inputs.
i i
PE(i)2j = sin PE(i)2j+1 = cos
100002j/d 100002j/d
Multi-Head Attention
Multi-Head Attention allows the model to jointly attend to information from different rep-
resentation subspaces. Each head can learn distinct aspects or features (e.g., job, age,
CHAPTER 11. ADDITIONAL TOPICS 162
address,. . . ), enabling the model to capture various semantic aspects simultaneously. The
mechanism can be formally expressed as follows:
where each H (µ) is computed independently in each attention head µ. Here, WµQ , WµK , and
WµV are learned projection matrices for queries Q, keys K, and values V respectively. The
number of heads is typically denoted by h.
11.5 Transformers
Goal:
Batch Normalization
Consider a dataset in which you are trying to estimate if someone is a vegetarian from their
age and income.
Age (years) Income ($) Veg?
27 31,000 yes
52 120,000 no
16 10,000 yes
.. .. ..
. . .
Most of the variance in this dataset is along the income axis (income values vary over a
much larger range than ages).
Of course, the connection weights in a neural network can accommodate these differences in
scale, but doing so will typically result in weights with very different magnitudes. This, in
turn, forces us to use a small learning rate to keep training stable.
layer for each sample individually. In other words, BN normalizes across the batch for each
neuron, whereas LN normalizes across neurons/features for each sample.
For a hidden vector ⃗h ∈ RH we first compute the mean µ and variance σ 2 of its coordinates,
ˆ ⃗
and form the normalized activations ⃗h = √h−µσ 2 +ε
, for a small ε > 0. Layer norm is then
ˆ
defined as LN(⃗h) = α ⊙ ⃗h + β, where α, β ∈ RH are learned scale and shift parameters and
⊙ denotes elementwise multiplication.
We prefer LN over BN in Transformers because the batch size (e.g. number of words in the
sentence) is often variable (and can even be 1 at inference time), and LN does not depend
on batch statistics.
Feed-Forward Layer
CHAPTER 11. ADDITIONAL TOPICS 165
Inside each encoder (and decoder) block, there is a position-wise feed-forward network (FFN).
The FF layer simply applies weights and biases independently to each position of the se-
quence.
Concretely, for an input vector x at a given position, the FFN typically has the form f (x) =
W2 ϕ(W1 x+b1 )+b2 , where W1 , W2 are weight matrices, b1 , b2 are biases, and ϕ is a nonlinearity
such as ReLU or GELU. The same parameters are shared across all positions, but each
position is processed independently.
Decoder
The decoder component is made up of many of the same parts as the encoder:
• masked multi-head self-attention;
• add & norm;
CHAPTER 11. ADDITIONAL TOPICS 166
• feed-forward layer;
However, after the first block of masked multi-head attention (MHA), the decoder receives
the keys and values from the encoder. Intuitively, while generating a translated sentence
(for example, English) the decoder can ask the encoder: “Given the French sentence, which
source words should I focus on now for this output token?” The cross-attention mechanism
answers this question by attending over the encoder representations.
Consider a French input sentence such as Je suis étudiant. After tokenization and em-
bedding (plus positional encoding), the sequence is passed through the encoder stack. The
encoder produces a sequence of hidden vectors that are used as keys and values for the
decoder’s cross-attention.
On the decoder side, we generate the English sentence I am a student . <end> one
token at a time. The decoder is autoregressive: at step t it can only attend to positions
1, . . . , t in the output sequence (future tokens are masked).
Let y1 , . . . , yT be the target tokens. At training time, the decoder operates roughly as follows:
2. Step 2: the visible prefix is I and the rest of the positions are masked; the decoder
predicts y2 = am.
CHAPTER 11. ADDITIONAL TOPICS 167
The final linear layer followed by a softmax transforms the decoder’s hidden state at each
position into a categorical distribution over the vocabulary. Concretely, the hidden repre-
sentation hi is mapped to logits (softmax pre-activation) W hi + b, and applying a softmax
produces a valid probability distribution. This yields the conditional probability pθ (xi |
x<i ) = Softmax(W hi + b), which appears inside the negative log-likelihood loss (NLL)
" N
#
X
L = ED − log pθ (xi | x<i ) ,
i=1
where N is the sentence length and D is a dataset which could represent a sentence of words.
For an attention layer, every output vector contains information aggregated from all vectors
in the sequence. This makes it much easier to model long-range dependencies and enables
highly parallel computation over all sequence positions.
CHAPTER 11. ADDITIONAL TOPICS 168
Environmental concerns
Training very large language models is computationally and financially expensive, and it also
has a non-negligible environmental footprint.
• Training cost on the order of 4-5 million USD.
• Estimated emissions of about 550 tons of CO2 for a single large training run.
To give some intuition, 550 tons of CO2 is roughly comparable to
• the emissions generated by driving around the Earth on the order of 100 times, or
• the amount of CO2 that would need to be absorbed by approximately 10, 000 to 100, 000
trees over multiple years.
Because of these costs, architectures such as recurrent neural networks (RNNs) (for ex-
ample, minGRU: [Link] and other lightweight models can
sometimes offer cheaper and more sustainable alternatives.
The Hodgkin-Huxley model describes the dynamics of membrane potential by modeling the nonlinear interaction between the membrane voltage and ion channels. It introduces variables like m, n, and h that represent the probability of sodium and potassium channels being open. These probabilities change with voltage, impacting the conductance of ions and thus influencing the membrane potential. The model incorporates differential equations to track these dynamics over time allowing prediction of action potentials by simulating ion flow through voltage-gated channels .
Sodium and potassium ion channels are crucial for the generation of action potentials in neurons. Voltage-gated sodium channels open when the membrane potential reaches a threshold, allowing Na+ ions to flow into the cell and causing depolarization. This leads to further opening of Na+ channels in a positive feedback loop. As a result, the membrane potential becomes more positive. Subsequently, potassium channels open, allowing K+ ions to flow out of the cell, which repolarizes the membrane by lowering the membrane potential back toward the resting potential. The interplay of opening and closing these channels enables the propagation of an action potential along the neuron .
A significant challenge with Recurrent Neural Networks (RNNs) is their difficulty in maintaining long-term dependencies due to the vanishing gradient problem, where gradients diminish as they are back-propagated through time. This can result in the loss of information from earlier in the sequence, making RNNs struggle with long-range dependencies. As sequence length increases, the network's ability to remember context from earlier in the sequence diminishes, hindering performance in tasks that require understanding of long-term dependencies .
The Leaky Integrate-and-Fire (LIF) model simplifies neuronal dynamics by focusing on sub-threshold membrane potential and recording spike occurrences upon reaching a threshold, but it does not model the action potential shape in detail. In contrast, the Hodgkin-Huxley model describes the detailed biophysical interactions of ion channels and their voltage-dependent kinetics, capturing the full shape and dynamics of action potentials. Consequently, LIF is more computationally efficient but less biologically detailed than Hodgkin-Huxley .
In the Hodgkin-Huxley model, changes in membrane potential influence the gating variables m, n, and h, which govern the probability of sodium and potassium channels being open. As the membrane potential becomes more positive, the variable m increases, representing more sodium channels opening. The variable h decreases, indicating inactivation of sodium channels. Meanwhile, an increase in n correlates with more potassium channels opening. These changes in gating variables, influenced by the membrane potential, regulate ion flow and thus the neuron's action potential .
Gradient descent optimizes neural network weights by iteratively adjusting them in the direction of the negative gradient of the loss function. This process minimizes the differentiation-based error between predicted and target outputs. By applying the gradient to update the weights, the algorithm seeks to find the minimum point of the loss function, improving the network's accuracy over training iterations. This method is crucial for fine-tuning the model to closely match the data patterns .
The sodium-potassium pump maintains the resting membrane potential by exchanging 3 Na+ ions from inside the cell with 2 K+ ions from outside the cell. This active transport process contributes to a higher sodium concentration outside and a higher potassium concentration inside the cell. The net movement of positive charges out of the cell establishes a negative resting membrane potential, typically around -70 mV, crucial for neuron function .
The equilibrium voltage for different ion channels, such as VL for leakage, VNa for sodium, and VK for potassium, sets the potential toward which each ion drives the membrane. During an action potential, sodium influx depolarizes the membrane toward VNa, while potassium efflux drives it back toward VK. These equilibria create the conditions for the rapid depolarization and repolarization phases observed in action potentials, governing the direction and magnitude of ion flow .
Deep RNNs improve learning complex temporal dependencies by stacking multiple RNN layers, which allows hierarchical extraction of temporal features across layers. Each layer captures different levels of temporal dynamics, aiding in better representation learning. This multi-layer approach reduces the burden on a single hidden layer, enhancing the modeling of long-range dependencies and improving performance in tasks like language modeling and speech recognition by capturing nuanced temporal information more effectively .
In the membrane potential equation, the membrane capacitance (C) represents the ability of the cell membrane to store charge, impacting how quickly the membrane potential can change in response to input. The conductance (gL) refers to the membrane's leakage conductance, indicating how readily ions can pass through the membrane leaks. Together, these parameters help describe the neuron's electrical response to input currents, impacting the dynamics of action potential generation and propagation .