11 Transformers Notes
11 Transformers Notes
Herman Kamper
Attention recap
Self-attention
Positional encodings
Multi-head attention
Cross-attention
Transformer
1
Issues with RNNs
Architectural
Even with changes to deal with long-range dependencies (e.g. LSTM),
more recent observations inevitably have a bigger influence on the
current hidden state than those that are far away.
Computational
• Future RNN states can’t be computed before past hidden states
have been computed.
• We just can’t get away from the “for loop” over time in the
forward pass over an RNN.
2
Attention doesn’t have these problems
he threw me
softmax
3
Intuition from the Google AI blog post:
4
Attention recap
One way to think of attention intuitively is as a soft lookup table:
Keys Values Keys Values
Computational graph:
c Attention
+
output
k1 a α × v1
k2 a α × v2
softmax
Keys k3 a α × v3 Values
kN a α × vN
Query q
5
Mathematically:
• Attention weight:
• Attention score:
a(q, kn ) ∈ R
6
Self-attention
y6
k1 k2 k3 k4 k5 k6 v6
q6
v1 v2 v3 v4 v5
x1 x2 x3 x4 x5 x6
7
Self-attention
k1 k2 k3 k4 k5 k6 v6
q6
v1 v2 v3 v4 v5
x1 x2 x3 x4 x5 x6
8
T
y6 yi =
X
αi,t vt
t=1
+
α6,1 v1
α6,6 v6
× × × × × × eai,t
α6,1 αi,t = PT
ai,j
j=1 e
softmax
a6,1 qi ⊤ kt
ai,t = √
Dk
k1 k2 k3 k4 k5 k6 v6 qt = Wq⊤ xt
q6
kt = Wk⊤ xt
v1 v2 v3 v4 v5 vt = Wv⊤ xt
x1 x2 x3 x4 x5 x6
Layer input: x1 , x2 , . . . xT
Layer output: y1 , y2 , . . . yT
9
In matrix form
Each of the T queries need to be compared to each of the T keys.
We can express this in a compact matrix form.
Q ∈ RT ×Dk
K ∈ RT ×Dk
V ∈ RT ×Dv
We can then write all the dot products and weighting in a short
condensed form:
QK>
!
Attention(Q, K, V) = softmax √ V
Dk
You can figure out the shapes for the W’s, e.g. Wk ∈ RD×Dk .
10
Self-attention: A new computational
block
A new block or layer, like an RNN or a CNN.
Can use this in both encoder and decoder modules. E.g. for machine
translation:
11
Positional encodings intuition
12
Positional encodings
In contrast to RNNs, there aren’t any order information in the inputs
of self-attention.
pt ∈ RD
There is a unique pt for every input position. E.g. p10 will always be
the same for all input sequences.
x̃t = xt + pt
1
I like the idea of concatenation more than adding. But Benjamin van Niekerk
pointed out to me that if you pass x̃t through a single linear layer, then con-
catenation and addition are very similar: In both cases you end up with a new
representation that is a weighted sum of the original input and the positional
encoding (there are just additional weights specifically for the positional encoding
when you concatenate).
13
Represent position using sinusoids
Let’s use a single sinusoid as our pt :
1.00 d=6
0.75
0.50
Encoding feature value
0.25
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50 60
Position
In this case, we would have unique positional feature value for inputs
roughly with lengths up to T = 36, and then the feature value would
repeat. This could be useful, if relative position at this scale is more
important than absolute position.
1.00 d=6
d=7
0.75
0.50
Encoding feature value
0.25
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50 60
Position
14
Now we would have unique positional encodings for a longer range.
But the model could also just decide that relative position matters
more.
1.00
0.75
0.50
Encoding feature value
0.25
0.00
0.25
0.50
d=6
0.75 d=7
d=8
1.00 d=9
0 10 20 30 40 50 60
Position
where
λm = 10 0002m/D
15
If we stack all these into P ∈ RD×T :
5
Encoding dimension
10
15
20
25
30
0 10 20 30 40 50
Position
There are formal reasons that this encodes relative position (Denk,
2019).2 But intuitively you should be able to see that periodicity
indicates that absolute position isn’t necessarily important.
2
For a fixed offset between two positional encodings, there is a linear transfor-
mation to take you from the one to the other. E.g. you can go from p10 to p15
using some linear transformation, and this will be the same transformation needed
to go from p30 to p35 .
16
The clock analogy for positional encodings
3
Analogy from Benjamin van Niekerk.
17
Multi-head attention
Hypothetical example:
k1 k2 k3 k4 k5 k6 v6 [1] k1 k2 k3 k4 k5 k6 v6 [2]
q6 q6
v1 v2 v3 v4 v5 v1 v2 v3 v4 v5
x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6
18
Masking the future in self-attention
If we have a network or decoder that needs to be causal, then we
should ensure that it can only attend to the past when making the
current prediction.
y4
k1 k2 k3 k4 v4 q4 k5 v5 k6 v6
v1 v2 v3
x1 x2 x3 x4 x5 x6
Mathematically: >
q√i kt if t ≤ i
Dk
ai,t =
−∞ if t > i
19
Have a careful look at what happens in the Google transformer diagram
for machine translation:
20
Cross-attention
he threw me </s>
21
Cross-attention
he threw me </s>
22
Transformer
• Residual connections
• Layer normalisation
23
Videos covered in this note
• Intuition behind self-attention (12 min)
• Attention recap (6 min)
• Self-attention details (13 min)
• Self-attention in matrix form (5 min)
• Positional encodings in transformers (19 min)
• The clock analogy for positional encodings (5 min)
• Multi-head attention (5 min)
• Masking the future in self-attention (5 min)
• Cross-attention (7 min)
• Transformer (4 min)
24
Acknowledgments
Christiaan Jacobs and Benjamin van Niekerk were instrumental in
helping me to start to understand self-attention and transformers.
Further reading
A. Goldie, “CS224N: Pretraining,” Stanford University, 2022.
References
T. Denk, “Linear relationships in the transformer’s positional encoding,”
2019.
25