0% found this document useful (0 votes)
11 views

Attention Deep Learning

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Attention Deep Learning

Uploaded by

kaushalmeena3003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Attention

Sequence-to-Sequence with RNNs


Input: Sequence x1, … xT
Output: Sequence y1, …, yT

Encoder: ht = fW(xt, ht-1)

h1 h2 h3 h4

x1 x2 x3 x4

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, … yT

From final hidden state predict:


Encoder: ht = fW(xt, ht-1) Initial decoder state s0 = hT
Context vector c (often c = hT)

h1 h2 h3 h4 s0

x1 x2 x3 x4 c

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos

From final hidden state predict:


y1
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c y0

we are eating bread [START]

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo

From final hidden state predict:


y1 y2
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c y0 y1

we are eating bread [START] estamos

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]

From final hidden state predict:


y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread [START] estamos comiendo pan

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]

From final hidden state predict:


y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread


Problem: Input sequence [START] estamos comiendo pan
bottlenecked through fixed-
sized vector. What if T=1000?
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]

From final hidden state predict:


y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread


Problem: Input sequence [START] estamos comiendo pan
bottlenecked through fixed-
Idea: use new context vector
sized context vector. What if
at each step of decoder!
utskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 T=1000?
Sequence-to-Sequence with RNNs and Attention
Input: Sequence x1, … xT
Output: Sequence y1, … yT

From final hidden state:


Encoder: ht = fW(xt, ht-1) Initial decoder state s0

• The context vector is dynamically computed


h1 h2 h3 h4 s0
during each step of the decoding process.

x1 x2 x3 x4 • The context vector at each decoder timestep is


computed as a weighted sum of all the encoder
we are eating bread hidden states.

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Input: Sequence x1, … xT The weights are determined by
assessing
Output: Sequence y1, … yT how much each part of the input
sequence should contribute to the
output (the attention mechanism)
From final hidden state: at a particular step.
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
• A function computes a set of attention
h1 h2 h3 h4 s0 weights for each encoder hidden
state relative to the current decoder
state.

x1 x2 x3 x4 • This function can be as simple as a dot


product followed by a softmax or more
we are eating bread complex functions like those using
trainable parameters..
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Thus, let a function fatt is used to compute how much each part of the
input sequence (hi) should contribute to the output at time step t of the
decoder,
i.e. how hi is align with st-1 , hidden state of the decoder at
time step t
From final hidden state:
e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Compute (scalar) alignment scores
et,i = fatt(st-1, hi) (fatt is an MLP)

From final hidden state:


e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
Normalize alignment scores
softmax
to get attention weights
From final hidden state: 0 < at,i < 1 ∑i at,i = 1
e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combination of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Use context vector in


x1 x2 x3 x4 c1 y0 decoder: s = gU(yt-1, st-1, ct)

we are eating bread This is all differentiable! Do not


[START]
supervise attention weights –
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 backprop through everything
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fa - (st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combina6on of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Intuition: Context vector Use context vector in


x1 x2 x3 x4 attends to the relevant c1 y0 decoder: s = gU(yt-1, st-1, ct)
part of the input sequence
we are eating bread “estamos” = “we are”
This is all differentiable! Do not
so maybe a11=a12=0.45, [START]
supervise attention weights –
a13=a14=0.05
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 backprop through everything
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fa - (st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combina6on of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Intuition: Context vector Use context vector in


x1 x2 x3 x4 attends to the relevant c1 y0 decoder: s = gU(yt-1, st-1, ct)
part of the input sequence
we are eating bread “estamos” = “we are”
i.e. there are no target attention
so maybe a11=a12=0.45, [START] weights,
a13=a14=0.05 There are learned while training
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Repeat: Use s1 to compute
new context vector c2
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos
soXmax
y1
e21 e22 e23 e24
+

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c1 y0 c2

we are eating bread


[START]

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c1 y0 c2 y1

we are eating bread


[START] estamos

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

IntuiGon: Context vector


attends to the relevant
x1 x2 x3 x4 part of the input sequence c1 y0 c2 y1
“comiendo” = “eating”
we are eating bread
so maybe a21=a24=0.05, [START] estamos
a22=0.1, a23=0.8
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention

Use a different context vector in each timestep of decoder


- Input sequence not bottlenecked through single vector estamos comiendo pan [STOP]
- At each timestep of decoder, context vector “looks at”
different parts of the input sequence y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread


[START] estamos comiendo pan

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation

Input: “The agreement on the


European Economic Area was
signed in August 1992.”

Output: “L’accord sur la zone


économique européenne a
été signé en août 1992.”

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation

Input: “The agreement on the Diagonal attention means


words correspond in order
European Economic Area was
signed in August 1992.”

Output: “L’accord sur la zone


économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Diagonal attention means
Input: “The agreement on the words correspond in order
European Economic Area was
signed in August 1992.” Attention figures out
different word orders
Output: “L’accord sur la zone
économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
transla^on

Input: “The agreement on the Diagonal a)en+on means


words correspond in ord er
European Economic Area was
signed in August 1992.” A)en+on figures out
different word orders
Output: “L’accord sur la zone
Verb conjugation
économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
The decoder doesn’t use the fact that
hi form an ordered sequence – it just
treats them as an unordered set {hi} estamos comiendo pan [STOP]

Can use similar architecture given any y1 y2 y3 y4


set of input hidden vectors {h i}!

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread


[START] estamos comiendo pan

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Image Captioning with RNNs and Attention

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a


grid of features for an image

Cat image is free to use under the Pixabay License

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fatt(st-1, hi,j) e1,1,1 e1,1,2 e1,1,3

e1,2,1 e1,2,2 e1,2,3

e1,3,1 e1,3,2 e1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a


grid of features for an image

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
so#max
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3

e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a


grid of features for an image

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
softmax
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3

ct = ∑i,jat,i,jhi,j e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a c1


grid of features for an image

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
so#max
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3 cat

ct = ∑i,jat,i,jhi,j e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3


y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0


grid of features for an image
[START]

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention

et,i,j = fa((st-1, hi,j)


at,:,: = soUmax(et,:,:) cat

ct = ∑i,jat,i,jhi,j y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0


grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3

at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3


y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0


grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3


y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0


grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3


y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2


grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

so#max
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat si9ng

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3


y1 y2

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1 s2
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2 y1


grid of features for an image
[START] cat

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Each ^mestep of decoder
et,i,j = fa ( (st-1, hi,j) uses a different context
at,:,: = soUmax(et,:,:) vector that looks at different
cat si9ng outside [STOP]

ct = ∑i,jat,i,jhi,j parts of the input image y1 y2 y3 y4

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3


CNN s0 s1 s2 s3 s4
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2 y1 c3 y2 c4 y3


grid of features for an image
[START] cat sitting outside

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention

A bird flying over a body of water .

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
X, Attend, and Y
“Show, aKend, and tell” (Xu et al, ICML 2015)
Look at image, aUend to image regions, produce ques6on
“Ask, aKend, and answer” (Xu and Saenko, ECCV 2016) “Show,
ask, aKend, and answer” (Kazemi and Elqursh, 2017) Read text
of ques6on, aUend to image regions, produce answer
“Listen, aKend, and spell” (Chan et al, ICASSP 2016)
Process raw audio, aUend to audio regions while producing text
“Listen, attend, and walk” (Mei et al, AAAI 2016)
Process text, attend to text regions, output navigation commands
“Show, aKend, and interact” (Qureshi et al, ICRA 2017)
Process image, aUend to image regions, output robot control commands
“Show, aKend, and read” (Li et al, AAAI 2019)
Process image, aUend to image regions, output text
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

Inputs: ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3


y1
Query vector: q (Shape: DQ)
h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DX)
CNN h2,1 h2,2 h2,3 s0 s1
Similarity function: fatt
h3,1 h3,2 h3,3

c1 y0 c2

[START]

Computation:
Similarities: e (Shape: NX) ei = fatt(q, Xi)
Attention weights: a = softmax(e) (Shape: NX)
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3


Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vector: q (Shape: DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) h2,1 h2,2 h2,3
CNN s0 s1
Similarity funcGon: dot product
h3,1 h3,2 h3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3


Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vector: q (Shape: DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) h2,1 h2,2 h2,3
CNN s0 s1
Similarity funcGon: scaled dot product
h3,1 h3,2 h3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3


Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vector: q (Shape: DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) h2,1 h2,2 h2,3
CNN s0 s1
Similarity funcGon: scaled dot product
h3,1 h3,2 h3,3
Large similari5es will cause soDmax to
saturate and give vanishing gradients c1 y0 c2
Recall a · b = |a||b| cos(angle)
Suppose that a and b are constant vectors of [START]

dimension D
Then |a| = (∑ia2)1/2 = a 𝐷
ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3


Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vectors: Q (Shape: NQ x DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) CNN h2,1 h2,2 h2,3 s0 s1
h3,1 h3,2 h3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: E = QXT/ 𝐷! (Shape: NQ x Ei,j = (Qi · Xj )/ 𝐷! Changes:
NX)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AX (Shape: NQ x DX) Yi = ∑jAi,jXj - Mul6ple query vectors
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3


softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3


Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vectors: Q (Shape: NQ x DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DX) CNN h2,1 h2,2 h2,3 s0 s1
Key matrix: WK (Shape: DX x DQ) h3,1 h3,2 h3,3
Value matrix: WV (Shape: DX x DV)
c1 y0 c2

[START]

Computation:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Similarities: E = QKT / 𝐷! (Shape: NQ x NX) Ei,j = (Qi · Kj) / 𝐷! Changes:
Attention weights: A = softmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj - Mul6ple query vectors
- Separate key and value
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1 K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1


Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ) A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
A1,3 A2,3 A3,3 A4,3

SoDmax( )

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1


Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ) V1 A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
V2 A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
V3 A1,3 A2,3 A3,3 A4,3

SoDmax( )

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1


Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer Y1 Y2 Y3 Y4

Product( ), Sum( )

Inputs:
Query vectors: Q (Shape: NQ x DQ) V1 A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
V2 A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
V3 A1,3 A2,3 A3,3 A4,3

SoDmax( )

Computation: X1 K1 E1,1 E2,1 E3,1 E4,1


Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
Similarities: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
Attention weights: A = softmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Self-Attention Layer
One query per input vector
Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)

ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3
ComputaGon: K2
Query vectors: Q = XWQ
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
Computation: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
Similarities: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
A1,3 A2,3 A3,3
Inputs: A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
So#max(↑)
Query matrix: WQ (Shape: DX x DQ)

ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permuting
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Queries and Keys will be
Key matrix: WK (Shape: DX x DQ) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2
ComputaGon: K1
Query vectors: Q = XWQ K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Similari5es will be the
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permu+ng
A3,2 A1,2 A2,2
the input vectors:
Inputs: A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) ARen5on weights will be A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x D Q) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
Computation: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
Similarities: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permu+ng
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Values will be the V3 A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)

Consider permuting
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Outputs will be the A3,3 A1,3 A2,3
V3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)

Consider permu+ng A3,2 A1,2 A2,2


the input vectors:
V2
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Outputs will be the A3,3 A1,3 A2,3
V3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Self-aRen5on layer is
Query matrix: WQ (Shape: DX x DQ) Permuta+on Equivariant K2 E3,2 E1,2 E2,2
f(s(x)) = s(f(x))
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ Self-ARen5on layer works E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ) on sets of vectors
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)

Self aRen5on doesn’t A1,3 A2,3 A3,3


“know” the order of the
V3
Inputs: vectors it is processing! V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
Self aRen5on doesn’t
“know” the order of the V3 A1,3 A2,3 A3,3
vectors it is processing! A1,2 A2,2 A3,2
Inputs: V2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ) In order to make
processing posi5on-
Value matrix: WV (Shape: DX x DV) So#max(↑)
aware, concatenate or
Query matrix: WQ (Shape: DX x DQ) add posi+onal encoding
K3 E1,3 E2,3 E3,3
to the input
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E can be learned lookup E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ) table, or fixed func5on
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj E(1) E(2) E(3)
Masked Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
Don’t let vectors “look ahead” in the sequence
V3 0 0 A3,3

Inputs: V2 0 A2,2 A3,2


Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 -∞ -∞ E3,3
ComputaGon: K2 -∞ E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Multihead Self-Attention Layer Big cat [END]
Product(→), Sum(↑)
Don’t let vectors “look ahead” in the sequence
Used for language modeling (predict next word) V3 0 0 A3,3
Inputs: V2 0 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Softmax(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 -∞ -∞ E3,3
ComputaGon: K2 -∞ E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) [START] Big cat
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Multihead Self-Attention Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX)
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1 X2 X3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“Attention Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X1,2 X1,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X1,2 X2,2 X1,3 X2,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Run self-aRen5on in parallel on each set of
input vectors (different weights per head)

Inputs:
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3


V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“AUen6on Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)


X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Y1,1 Y2,1 Y3,1
Y1,2 Y2,2 Y3,2
Y1,3 Y2,3 Y3,3
Inputs: Concat
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3


V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“Attention Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)


X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Sequence-to-Sequence with Y1
Linear
Y2 Y3

Y1,1
projec5on Y Y3,1
2,1
Y1,2 Y2,2 Y3,2
Y1,3 Y2,3 Y3,3
Inputs: Concat
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3


V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“AUen6on Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)


X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Example: CNN with Self-Attention

Input Image

CNN

Features:
Cat image is free to use under the Pixabay License CxHxW

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018


Example: CNN with Self-Attention

Queries:
C’ x H x W
Input Image 1x1 Conv

Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW
Values:
C’ x H x W
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018


Example: CNN with Self-Attention

Queries: A)en+on Weights


C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv


soDmax
x
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW
Values:
C’ x H x W
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018


Example: CNN with Self-Attention

Queries: A)en+on Weights


C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv


softmax
x
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW C’ x H x W
Values:
C’ x H x W x
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018


Example: CNN with Self-Attention

Queries: A)en+on Weights


C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv


soDmax
x
CxHxH
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW C’ x H x W
Values:
C’ x H x W x 1x1 Conv
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018


Example: CNN with Self-Attention

Residual Connec+on
Queries: A)en+on Weights
C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv soDmax


x
CxHxW
Keys:
CNN C’ x H x W +
Features: 1x1 Conv

Cat image is free to use under the Pixabay License CxHxW C’ x H x W


Values:
C’ x H x W x 1x1 Conv
1x1 Conv

Self-Acen^on Module
Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018
Three Ways of Processing Sequences

Recurrent Neural Network

y1 y2 y3 y4

x1 x2 x3 x4

Works on Ordered Sequences


(+) Good at long sequences: ANer
one RNN layer, hT ”sees” the whole
sequence
(-) Not parallelizable: need to
compute hidden states sequen+ally
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on

y1 y2 y3 y4 y1 y2 y3 y4

x1 x2 x3 x4 x1 x2 x3 x4

Works on Ordered Sequences Works on Mul+dimensional Grids


(+) Good at long sequences: ANer (-) Bad at long sequences: Need to
one RNN layer, hT ”sees” the whole stack many conv layers for outputs
sequence to “see” the whole sequence
(-) Not parallelizable: need to (+) Highly parallel: Each output can
compute hidden states sequen+ally be computed in parallel
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on Self-Acen^on


Y1 Y2 Y3
Product(→), Sum(↑)

y1 y2 y3 y4 y1 y2 y3 y4 V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
A1,1 A2,1 A3,1
V1
Softmax(↑)

E1,3 E2,3 E3,3


K3
K2 E1,2 E2,2 E3,2
E1,1
K1 E2,1 E3,1

x1 x2 x3 x4 x1 x2 x3 x4 Q1

X1
Q2

X2
Q3

X3

Works on Ordered Sequences Works on Mul+dimensional Grids Works on Sets of Vectors


(+) Good at long sequences: After (-) Bad at long sequences: Need to (-) Good at long sequences: aNer one
one RNN layer, hT ”sees” the whole stack many conv layers for outputs self-a)en+on layer, each output
sequence to “see” the whole sequence “sees” all inputs!
(-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can
compute hidden states sequentially be computed in parallel be computed in parallel
(-) Very memory intensive
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on Self-Acen^on


Y1 Y2 Y3
Product(→), Sum(↑)

y1 y2 y3 y4 y1 y2 y3 y4 V3 A1,3 A2,3 A3,3

Attention is all you need


V2 A1,2 A2,2 A3,2
V1 A1,1 A2,1 A3,1

Softmax(↑)

K3 E1,3 E2,3 E3,3


K2 E1,2 E2,2 E3,2
K1 E1,1 E2,1 E3,1

x1 x2 x3 x4 Vaswani
x1 et xal,
2 NeurIPS
x3 2017
x4 Q1

X1
Q2

X2
Q3

X3

Works on Ordered Sequences Works on Mul+dimensional Grids Works on Sets of Vectors


(+) Good at long sequences: ANer (-) Bad at long sequences: Need to (-) Good at long sequences: aNer one
one RNN layer, hT ”sees” the whole stack many conv layers for outputs self-a)en+on layer, each output
sequence to “see” the whole sequence “sees” all inputs!
(-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can
compute hidden states sequen+ally be computed in parallel be computed in parallel
(-) Very memory intensive
The Transformer

x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer

All vectors interact Self-AUen6on


with each other

x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer

Residual connec^on +
All vectors interact Self-AUen6on
with each other

x1 x2 x3 x4

Vaswani et al, “AKenAon is all you need”, NeurIPS 2017


The Transformer

Recall Layer Normalization:


Given h1, …, hN (Shape: D)
scale: 𝛾 (Shape: D)
shift: 𝛽 (Shape: D)
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer

Recall Layer NormalizaGon:


Given h1, …, hN (Shape: D)
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer

Recall Layer NormalizaGon:


Given h1, …, hN (Shape: D) Residual connec^on +
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
y1 y2 y3 y4

Layer Normaliza6on
Recall Layer NormalizaGon:
Given h1, …, hN (Shape: D) Residual connec^on +
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
y1 y2 y3 y4

Transformer Block: Layer Normaliza6on


Input: Set of vectors x +
Output: Set of vectors y
MLP MLP MLP MLP
Self-acen^on is the only
interac^on between vectors!
Layer Normaliza6on

Layer norm and MLP work +


independently per vector Self-AUen6on

Highly scalable, highly


parallelizable x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Post-Norm Transformer
y1 y2 y3 y4

Layer Normaliza6on
+

MLP MLP MLP MLP

Layer normaliza-on is Layer Normaliza6on


a.er residual connecXons
+
Self-AUen6on

x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Post-Norm Transformer
y1 y2 y3 y4

MLP MLP MLP MLP

Layer Normaliza6on
Layer normaliza-on is +
inside residual connecXons
Self-AUen6on

Gives more stable training, Layer Normaliza6on

commonly used in pracXce


x1 x2 x3 x4
Baevski & Auli, “Adaptive Input Representations for Neural Language Modeling”, arXiv 2018
The Transformer Layer Normalization

MLP MLP MLP MLP

Transformer Block: Layer Normalization

Input: Set of vectors x Self-Attention

Output: Set of vectors y A Transformer is a sequence Layer Normalization

of transformer blocks +

Self-acen^on is the only MLP MLP MLP MLP

interac^on between vectors! Vaswani et al:


Layer Normalization

+
Self-Attention
12 blocks, DQ=512, 6 heads
Layer norm and MLP work Layer Normalization

independently per vector +

MLP MLP MLP MLP

Layer Normalization
Highly scalable, highly +

parallelizable Self-Attention

Vaswani et al, “Attention is all you need”, NeurIPS 2017


The Transformer: Transfer Learning Layer Normalization

MLP MLP MLP MLP

Layer Normalization

“ImageNet Moment for Natural Language Processing” +


Self-Attention

Pretraining: Layer Normalization

Download a lot of text from the internet MLP MLP MLP MLP

Layer Normalization

+
Self-Attention
Train a giant Transformer model for language modeling
Layer Normalization

Finetuning: MLP MLP MLP MLP

Fine-tune the Transformer on your own NLP task Layer Normalization

+
Self-Attention

Devlin et al, "BERT: Pre-training of Deep BidirecAonal Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)

Vaswani et al, “AKenAon is all you need”, NeurIPS 2017


Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB

Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)

Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019
Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB

Radford et al, "Language models are unsupervised multitask learners", 2019


Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)

Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU

Microsoft, "Turing-NLG: A 17-billion parameter language model by Microsoft", 2020


Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12,288 96 175B 694GB ?

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020


Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12,288 96 175B 694GB ?
Gopher 80 16,384 128 280B 10.55 TB 4096x TPUv3 (38 days)

Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Scaling up Transformers $3,768,320 on Google Cloud (eval price)
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12,288 96 175B 694GB ?
Gopher 80 16,384 128 280B 10.55 TB 4096x TPUv3 (38 days)

Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Bold text: Input prompt written by humans
Generated Text from GPT-3 Italics: Completion by GPT-3
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020


Bold text: Input prompt written by humans
Generated Text from GPT-3 Italics: Completion by GPT-3
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one
that is expected to end in the creation of a new denomination, one that will be "theologically and socially
conservative," according to The Washington Post. The majority of delegates attending the church's annual
General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new
rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these
measures have a new plan: They say they will form a separate denomination by 2020, calling their church
the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million
members, was in the early 20th century the "largest Protestant denomination in the U.S.," but that it has
been shrinking in recent decades. The new split will be the second in the church's history. The first occurred
in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church.
The Post notes that the proposed split "comes at a critical time for the church, which has been losing
members for years," which has been "pushed toward the brink of a schism over the role of LGBTQ people in
the church." Gay marriage is not the only issue that has divided the church. In 2016, the denomination was
split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them
from serving as clergy, and the South Pacific regional conference voting to allow them.
GPT-3: Programming by prompt
Bold text: Input prompt written by humans
Italics: Completion by GPT-3

Poor English input: I eated the purple berries.


Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration
that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020


GPT-3: Programming by prompt
Input / Output examples
Test example
Italics: Completion by GPT-3
Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration
that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020


GPT-3: Programming by prompt
Input / Output examples
Test example
Italics: Completion by GPT-3
Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration
that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output: I'd be more than happy to work with you on another project.

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020


https://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer


Topic: humor: literary parodies and homages, pastiches, style parodies.
Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing.
https://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer


Topic: humor: literary parodies and homages, pastiches, style parodies.
Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He
had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he
had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator,
his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."
hRps://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer


Topic: humor: literary parodies and homages, pastiches, style parodies.
Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He
had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he
had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator,
his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."
By Jane Austen:
“It is a truth universally acknowledged, that a broken Harry is in want of a book. This he knows to his cost; pressing that
wretched nose against the window of a bookstore in Diagon Alley the last thing before he goes into hiding for several years,
where upon, he goes straight to Azkaban.”
https://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer


Topic: humor: literary parodies and homages, pas+ches, style parodies.
Parodies of the fantasy novel series Harry Po*er in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He
had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he
had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator,
his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."
By Jane Austen:
“It is a truth universally acknowledged, that a broken Harry is in want of a book. This he knows to his cost; pressing that
wretched nose against the window of a bookstore in Diagon Alley the last thing before he goes into hiding for several years,
where upon, he goes straight to Azkaban.”
By Arthur Conan Doyle:
“Harry pushed at the swinging doors of the bookshop hard, and nearly knocked himself unconscious. He staggered in with his
ungainly package, his cheeks scarlet with cold and the shame of having chosen the wrong month to go Christmas shopping. The
proprietor of the store, however, didn’t cast even a cursory glance at him, being far more interested in an enormous hunk of
wood lying in the middle of the floor, which certainly looked like a gallows. Yes, the proprietor said to a reedy old man wearing
a bowler hat and a forlorn expression that made Harry want to kick him, I can rent you such a gallows for a small fee.”
Sequence-to-Sequence with RNNs
Summary Generalized Self-ARenGon Transformers are a new
is new, powerful neural neural network model
network primi^ve that only uses attention
Adding Attention to RNN Y1 Y2 Y3 y1 y2 y3 y4
models lets them look at Product(→), Sum(↑)

different parts of the V3 A1,3 A2,3 A3,3


Layer Normalization

input at each timestep V2 A1,2 A2,2 A3,2 +


V1 A1,1 A2,1 A3,1
MLP MLP MLP MLP
Softmax(↑)

K3 E1,3 E2,3 E3,3 Layer Normalization


K2 E1,2 E2,2 E3,2 +
K1 E1,1 E2,1 E3,1
Self-Attention

Q1 Q2 Q3

X1 X2 X3 x1 x2 x3 x4

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

You might also like