0% found this document useful (0 votes)

11 views

Attention Deep Learning

Uploaded by

kaushalmeena3003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Attention Deep Learning

Uploaded by

kaushalmeena3003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 118

Attention

Sequence-to-Sequence with RNNs

Input: Sequence x1, … xT
Output: Sequence y1, …, yT

Encoder: ht = fW(xt, ht-1)

h1 h2 h3 h4

x1 x2 x3 x4

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, … yT

From final hidden state predict:

Encoder: ht = fW(xt, ht-1) Initial decoder state s0 = hT
Context vector c (often c = hT)

h1 h2 h3 h4 s0

x1 x2 x3 x4 c

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos

From final hidden state predict:

y1
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c y0

we are eating bread [START]

From final hidden state predict:

y1 y2
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c y0 y1

we are eating bread [START] estamos

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread [START] estamos comiendo pan

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread

Problem: Input sequence [START] estamos comiendo pan
bottlenecked through fixed-
sized vector. What if T=1000?
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread

Problem: Input sequence [START] estamos comiendo pan
bottlenecked through fixed-
Idea: use new context vector
sized context vector. What if
at each step of decoder!
utskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 T=1000?
Sequence-to-Sequence with RNNs and Attention
Input: Sequence x1, … xT
Output: Sequence y1, … yT

From final hidden state:

Encoder: ht = fW(xt, ht-1) Initial decoder state s0

• The context vector is dynamically computed

h1 h2 h3 h4 s0
during each step of the decoding process.

x1 x2 x3 x4 • The context vector at each decoder timestep is

computed as a weighted sum of all the encoder
we are eating bread hidden states.

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Input: Sequence x1, … xT The weights are determined by
assessing
Output: Sequence y1, … yT how much each part of the input
sequence should contribute to the
output (the attention mechanism)
From final hidden state: at a particular step.
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
• A function computes a set of attention
h1 h2 h3 h4 s0 weights for each encoder hidden
state relative to the current decoder
state.

x1 x2 x3 x4 • This function can be as simple as a dot

product followed by a softmax or more
we are eating bread complex functions like those using
trainable parameters..
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Thus, let a function fatt is used to compute how much each part of the
input sequence (hi) should contribute to the output at time step t of the
decoder,
i.e. how hi is align with st-1 , hidden state of the decoder at
time step t
From final hidden state:
e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

From final hidden state:

e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
Normalize alignment scores
softmax
to get attention weights
From final hidden state: 0 < at,i < 1 ∑i at,i = 1
e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combination of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Use context vector in

x1 x2 x3 x4 c1 y0 decoder: s = gU(yt-1, st-1, ct)

we are eating bread This is all differentiable! Do not

[START]
supervise attention weights –
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 backprop through everything
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fa - (st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combina6on of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Intuition: Context vector Use context vector in

x1 x2 x3 x4 attends to the relevant c1 y0 decoder: s = gU(yt-1, st-1, ct)
part of the input sequence
we are eating bread “estamos” = “we are”
This is all differentiable! Do not
so maybe a11=a12=0.45, [START]
supervise attention weights –
a13=a14=0.05
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 backprop through everything
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fa - (st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combina6on of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi

Intuition: Context vector Use context vector in

x1 x2 x3 x4 attends to the relevant c1 y0 decoder: s = gU(yt-1, st-1, ct)
part of the input sequence
we are eating bread “estamos” = “we are”
i.e. there are no target attention
so maybe a11=a12=0.45, [START] weights,
a13=a14=0.05 There are learned while training
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Repeat: Use s1 to compute
new context vector c2
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos
soXmax
y1
e21 e22 e23 e24
+

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c1 y0 c2

we are eating bread

[START]

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c1 y0 c2 y1

we are eating bread

[START] estamos

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

IntuiGon: Context vector

attends to the relevant
x1 x2 x3 x4 part of the input sequence c1 y0 c2 y1
“comiendo” = “eating”
we are eating bread
so maybe a21=a24=0.05, [START] estamos
a22=0.1, a23=0.8
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention

Use a diﬀerent context vector in each timestep of decoder

- Input sequence not bottlenecked through single vector estamos comiendo pan [STOP]
- At each timestep of decoder, context vector “looks at”
diﬀerent parts of the input sequence y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread

[START] estamos comiendo pan

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation

Input: “The agreement on the

European Economic Area was
signed in August 1992.”

Output: “L’accord sur la zone

économique européenne a
été signé en août 1992.”

Input: “The agreement on the Diagonal attention means

words correspond in order
European Economic Area was
signed in August 1992.”

Output: “L’accord sur la zone

économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Diagonal attention means
Input: “The agreement on the words correspond in order
European Economic Area was
signed in August 1992.” Attention figures out
different word orders
Output: “L’accord sur la zone
économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translaôn

Input: “The agreement on the Diagonal a)en+on means

words correspond in ord er
European Economic Area was
signed in August 1992.” A)en+on ﬁgures out
diﬀerent word orders
Output: “L’accord sur la zone
Verb conjugation
économique européenne a
été signé en août 1992.”
Diagonal attention means
words correspond in order
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
The decoder doesn’t use the fact that
hi form an ordered sequence – it just
treats them as an unordered set {hi} estamos comiendo pan [STOP]

Can use similar architecture given any y1 y2 y3 y4

set of input hidden vectors {h i}!

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread

[START] estamos comiendo pan

Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Image Captioning with RNNs and Attention

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a

grid of features for an image

Cat image is free to use under the Pixabay License

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fatt(st-1, hi,j) e1,1,1 e1,1,2 e1,1,3

e1,2,1 e1,2,2 e1,2,3

e1,3,1 e1,3,2 e1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a

grid of features for an image

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
so#max
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3

e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a

grid of features for an image

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
softmax
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3

ct = ∑i,jat,i,jhi,j e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0
h3,1 h3,2 h3,3

Use a CNN to compute a c1

grid of features for an image

ct = ∑i,jat,i,jhi,j e1,3,1 e1,3,2 e1,3,3 a1,3,1 a1,3,2 a1,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0

grid of features for an image
[START]

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention

et,i,j = fa((st-1, hi,j)

at,:,: = soUmax(et,:,:) cat

ct = ∑i,jat,i,jhi,j y1

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0

grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3

at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0

grid of features for an image
[START]

so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0

grid of features for an image
[START]

so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2

grid of features for an image
[START]

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

so#max
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat si9ng

ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3

y1 y2

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1 s2
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2 y1

grid of features for an image
[START] cat

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Each ^mestep of decoder
et,i,j = fa ( (st-1, hi,j) uses a diﬀerent context
at,:,: = soUmax(et,:,:) vector that looks at diﬀerent
cat si9ng outside [STOP]

ct = ∑i,jat,i,jhi,j parts of the input image y1 y2 y3 y4

h1,1 h1,2 h1,3

h2,1 h2,2 h2,3

CNN s0 s1 s2 s3 s4
h3,1 h3,2 h3,3

Use a CNN to compute a c1 y0 c2 y1 c3 y2 c4 y3

grid of features for an image
[START] cat sitting outside

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention

A bird flying over a body of water .

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention

Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
X, Attend, and Y
“Show, aKend, and tell” (Xu et al, ICML 2015)
Look at image, aUend to image regions, produce ques6on
“Ask, aKend, and answer” (Xu and Saenko, ECCV 2016) “Show,
ask, aKend, and answer” (Kazemi and Elqursh, 2017) Read text
of ques6on, aUend to image regions, produce answer
“Listen, aKend, and spell” (Chan et al, ICASSP 2016)
Process raw audio, aUend to audio regions while producing text
“Listen, attend, and walk” (Mei et al, AAAI 2016)
Process text, attend to text regions, output navigation commands
“Show, aKend, and interact” (Qureshi et al, ICRA 2017)
Process image, aUend to image regions, output robot control commands
“Show, aKend, and read” (Li et al, AAAI 2019)
Process image, aUend to image regions, output text
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

Inputs: ct = ∑i,jat,i,jhi,j e2,3,1 e2,3,2 e2,3,3 a2,3,1 a2,3,2 a2,3,3

y1
Query vector: q (Shape: DQ)
h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DX)
CNN h2,1 h2,2 h2,3 s0 s1
Similarity function: fatt
h3,1 h3,2 h3,3

c1 y0 c2

[START]

Computation:
Similarities: e (Shape: NX) ei = fatt(q, Xi)
Attention weights: a = softmax(e) (Shape: NX)
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3

Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vector: q (Shape: DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) h2,1 h2,2 h2,3
CNN s0 s1
Similarity funcGon: dot product
h3,1 h3,2 h3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3

Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vector: q (Shape: DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) h2,1 h2,2 h2,3
CNN s0 s1
Similarity funcGon: scaled dot product
h3,1 h3,2 h3,3
Large similari5es will cause soDmax to
saturate and give vanishing gradients c1 y0 c2
Recall a · b = |a||b| cos(angle)
Suppose that a and b are constant vectors of [START]

dimension D
Then |a| = (∑ia2)1/2 = a 𝐷
ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3

Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vectors: Q (Shape: NQ x DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DQ) CNN h2,1 h2,2 h2,3 s0 s1
h3,1 h3,2 h3,3

c1 y0 c2

[START]

ComputaGon:
SimilariGes: E = QXT/ 𝐷! (Shape: NQ x Ei,j = (Qi · Xj )/ 𝐷! Changes:
NX)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AX (Shape: NQ x DX) Yi = ∑jAi,jXj - Mul6ple query vectors
Attention Layer Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3

softmax
seagull
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3

e2,3,1 e2,3,2 e2,3,3

Inputs: ct = ∑i,jat,i,jhi,j a2,3,1 a2,3,2 a2,3,3
y1
Query vectors: Q (Shape: NQ x DQ) h1,1 h1,2 h1,3
Input vectors: X (Shape: NX x DX) CNN h2,1 h2,2 h2,3 s0 s1
Key matrix: WK (Shape: DX x DQ) h3,1 h3,2 h3,3
Value matrix: WV (Shape: DX x DV)
c1 y0 c2

[START]

Computation:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Similarities: E = QKT / 𝐷! (Shape: NQ x NX) Ei,j = (Qi · Kj) / 𝐷! Changes:
Attention weights: A = softmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj - Mul6ple query vectors
- Separate key and value
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1 K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1

Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer

Inputs:
Query vectors: Q (Shape: NQ x DQ) A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
A1,3 A2,3 A3,3 A4,3

SoDmax( )

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1

Inputs:
Query vectors: Q (Shape: NQ x DQ) V1 A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
V2 A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
V3 A1,3 A2,3 A3,3 A4,3

SoDmax( )

ComputaGon: X1 K1 E1,1 E2,1 E3,1 E4,1

Product( ), Sum( )

SoDmax( )

Computation: X1 K1 E1,1 E2,1 E3,1 E4,1

Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2 E1,2 E2,2 E3,2 E4,2
Similarities: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3 E1,3 E2,3 E3,3 E4,3
N X)
Attention weights: A = softmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Self-Attention Layer
One query per input vector
Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)

ComputaGon:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)

ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3
ComputaGon: K2
Query vectors: Q = XWQ
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
Computation: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
Similarities: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
A1,3 A2,3 A3,3
Inputs: A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
So#max(↑)
Query matrix: WQ (Shape: DX x DQ)

ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permuting
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Queries and Keys will be
Key matrix: WK (Shape: DX x DQ) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2
ComputaGon: K1
Query vectors: Q = XWQ K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Similari5es will be the
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permu+ng
A3,2 A1,2 A2,2
the input vectors:
Inputs: A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) ARen5on weights will be A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x D Q) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
Computation: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
Similarities: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)

Consider permu+ng
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Values will be the V3 A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)

Consider permuting
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Outputs will be the A3,3 A1,3 A2,3
V3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)

Consider permu+ng A3,2 A1,2 A2,2

the input vectors:
V2
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Outputs will be the A3,3 A1,3 A2,3
V3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Self-aRen5on layer is
Query matrix: WQ (Shape: DX x DQ) Permuta+on Equivariant K2 E3,2 E1,2 E2,2
f(s(x)) = s(f(x))
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ Self-ARen5on layer works E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ) on sets of vectors
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)

Self aRen5on doesn’t A1,3 A2,3 A3,3

“know” the order of the
V3
Inputs: vectors it is processing! V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
Self aRen5on doesn’t
“know” the order of the V3 A1,3 A2,3 A3,3
vectors it is processing! A1,2 A2,2 A3,2
Inputs: V2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ) In order to make
processing posi5on-
Value matrix: WV (Shape: DX x DV) So#max(↑)
aware, concatenate or
Query matrix: WQ (Shape: DX x DQ) add posi+onal encoding
K3 E1,3 E2,3 E3,3
to the input
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E can be learned lookup E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ) table, or ﬁxed func5on
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj E(1) E(2) E(3)
Masked Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
Don’t let vectors “look ahead” in the sequence
V3 0 0 A3,3

Inputs: V2 0 A2,2 A3,2

Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 -∞ -∞ E3,3
ComputaGon: K2 -∞ E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Multihead Self-Attention Layer Big cat [END]
Product(→), Sum(↑)
Don’t let vectors “look ahead” in the sequence
Used for language modeling (predict next word) V3 0 0 A3,3
Inputs: V2 0 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Softmax(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 -∞ -∞ E3,3
ComputaGon: K2 -∞ E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) [START] Big cat
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Multihead Self-Attention Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX)
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1 X2 X3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“Attention Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X1,2 X1,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X1,2 X2,2 X1,3 X2,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer

Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Run self-aRen5on in parallel on each set of
input vectors (diﬀerent weights per head)

Inputs:
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3

V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“AUen6on Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)

X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Y1,1 Y2,1 Y3,1
Y1,2 Y2,2 Y3,2
Y1,3 Y2,3 Y3,3
Inputs: Concat
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3

V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“Attention Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)

Y1,1
projec5on Y Y3,1
2,1
Y1,2 Y2,2 Y3,2
Y1,3 Y2,3 Y3,3
Inputs: Concat
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3

Query matrix: WQ (Shape: DX x DQ) Use H independent V3

V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2

“AUen6on Heads” in V1 A1,1 A2,1

Softmax(↑)
A3,1 V1 A1,1 A2,1

Softmax(↑)
A3,1

parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2

ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Query vectors: Q = XWQ X1 X2 X3 X1 X2 X3 X1 X2 X3

Key vectors: K = XWK (Shape: NX x DQ)

Input Image

CNN

Features:
Cat image is free to use under the Pixabay License CxHxW

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018

Example: CNN with Self-Attention

Queries:
C’ x H x W
Input Image 1x1 Conv

Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW
Values:
C’ x H x W
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018

Example: CNN with Self-Attention

Queries: A)en+on Weights

C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv

soDmax
x
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW
Values:
C’ x H x W
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018

Example: CNN with Self-Attention

Queries: A)en+on Weights

C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW C’ x H x W
Values:
C’ x H x W x
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018

Example: CNN with Self-Attention

Queries: A)en+on Weights

C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv

soDmax
x
CxHxH
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW C’ x H x W
Values:
C’ x H x W x 1x1 Conv
1x1 Conv

Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018

Example: CNN with Self-Attention

Residual Connec+on
Queries: A)en+on Weights
C’ x H x W Transpose (H x W) x (H x W)

Input Image 1x1 Conv soDmax

x
CxHxW
Keys:
CNN C’ x H x W +
Features: 1x1 Conv

Cat image is free to use under the Pixabay License CxHxW C’ x H x W

Values:
C’ x H x W x 1x1 Conv
1x1 Conv

Self-Acen^on Module
Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018
Three Ways of Processing Sequences

Recurrent Neural Network

y1 y2 y3 y4

x1 x2 x3 x4

Works on Ordered Sequences

(+) Good at long sequences: ANer
one RNN layer, hT ”sees” the whole
sequence
(-) Not parallelizable: need to
compute hidden states sequen+ally
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on

y1 y2 y3 y4 y1 y2 y3 y4

x1 x2 x3 x4 x1 x2 x3 x4

Works on Ordered Sequences Works on Mul+dimensional Grids

(+) Good at long sequences: ANer (-) Bad at long sequences: Need to
one RNN layer, hT ”sees” the whole stack many conv layers for outputs
sequence to “see” the whole sequence
(-) Not parallelizable: need to (+) Highly parallel: Each output can
compute hidden states sequen+ally be computed in parallel
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on Self-Acen^on

Y1 Y2 Y3
Product(→), Sum(↑)

y1 y2 y3 y4 y1 y2 y3 y4 V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
A1,1 A2,1 A3,1
V1
Softmax(↑)

E1,3 E2,3 E3,3

K3
K2 E1,2 E2,2 E3,2
E1,1
K1 E2,1 E3,1

x1 x2 x3 x4 x1 x2 x3 x4 Q1

X1
Q2

X2
Q3

Works on Ordered Sequences Works on Mul+dimensional Grids Works on Sets of Vectors

(+) Good at long sequences: After (-) Bad at long sequences: Need to (-) Good at long sequences: aNer one
one RNN layer, hT ”sees” the whole stack many conv layers for outputs self-a)en+on layer, each output
sequence to “see” the whole sequence “sees” all inputs!
(-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can
compute hidden states sequentially be computed in parallel be computed in parallel
(-) Very memory intensive
Three Ways of Processing Sequences

Recurrent Neural Network 1D Convolu^on Self-Acen^on

Y1 Y2 Y3
Product(→), Sum(↑)

y1 y2 y3 y4 y1 y2 y3 y4 V3 A1,3 A2,3 A3,3

Attention is all you need

V2 A1,2 A2,2 A3,2
V1 A1,1 A2,1 A3,1

Softmax(↑)

K3 E1,3 E2,3 E3,3

K2 E1,2 E2,2 E3,2
K1 E1,1 E2,1 E3,1

x1 x2 x3 x4 Vaswani
x1 et xal,
2 NeurIPS
x3 2017
x4 Q1

X1
Q2

X2
Q3

Works on Ordered Sequences Works on Mul+dimensional Grids Works on Sets of Vectors

(+) Good at long sequences: ANer (-) Bad at long sequences: Need to (-) Good at long sequences: aNer one
one RNN layer, hT ”sees” the whole stack many conv layers for outputs self-a)en+on layer, each output
sequence to “see” the whole sequence “sees” all inputs!
(-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can
compute hidden states sequen+ally be computed in parallel be computed in parallel
(-) Very memory intensive
The Transformer

x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer

All vectors interact Self-AUen6on

with each other

x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer

Residual connec^on +
All vectors interact Self-AUen6on
with each other

x1 x2 x3 x4

Vaswani et al, “AKenAon is all you need”, NeurIPS 2017

The Transformer

Recall Layer Normalization:

Given h1, …, hN (Shape: D)
scale: 𝛾 (Shape: D)
shift: 𝛽 (Shape: D)
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer

Recall Layer NormalizaGon:

Given h1, …, hN (Shape: D)
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
The Transformer

Recall Layer NormalizaGon:

Given h1, …, hN (Shape: D) Residual connec^on +
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
y1 y2 y3 y4

Layer Normaliza6on
Recall Layer NormalizaGon:
Given h1, …, hN (Shape: D) Residual connec^on +
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
y1 y2 y3 y4

Transformer Block: Layer Normaliza6on

Input: Set of vectors x +
Output: Set of vectors y
MLP MLP MLP MLP
Self-acen^on is the only
interac^on between vectors!
Layer Normaliza6on

Layer norm and MLP work +

independently per vector Self-AUen6on

Highly scalable, highly

parallelizable x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Post-Norm Transformer
y1 y2 y3 y4

Layer Normaliza6on
+

MLP MLP MLP MLP

Layer normaliza-on is Layer Normaliza6on

a.er residual connecXons
+
Self-AUen6on

x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Post-Norm Transformer
y1 y2 y3 y4

MLP MLP MLP MLP

Layer Normaliza6on
Layer normaliza-on is +
inside residual connecXons
Self-AUen6on

Gives more stable training, Layer Normaliza6on

commonly used in pracXce

x1 x2 x3 x4
Baevski & Auli, “Adaptive Input Representations for Neural Language Modeling”, arXiv 2018
The Transformer Layer Normalization

MLP MLP MLP MLP

Transformer Block: Layer Normalization

Input: Set of vectors x Self-Attention

Output: Set of vectors y A Transformer is a sequence Layer Normalization

of transformer blocks +

Self-acen^on is the only MLP MLP MLP MLP

interac^on between vectors! Vaswani et al:

Layer Normalization

+
Self-Attention
12 blocks, DQ=512, 6 heads
Layer norm and MLP work Layer Normalization

independently per vector +

MLP MLP MLP MLP

Layer Normalization
Highly scalable, highly +

parallelizable Self-Attention

Vaswani et al, “Attention is all you need”, NeurIPS 2017

The Transformer: Transfer Learning Layer Normalization

MLP MLP MLP MLP

Layer Normalization

“ImageNet Moment for Natural Language Processing” +

Self-Attention

Pretraining: Layer Normalization

Download a lot of text from the internet MLP MLP MLP MLP

Layer Normalization

+
Self-Attention
Train a giant Transformer model for language modeling
Layer Normalization

Finetuning: MLP MLP MLP MLP

Fine-tune the Transformer on your own NLP task Layer Normalization

+
Self-Attention

Devlin et al, "BERT: Pre-training of Deep BidirecAonal Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)

Vaswani et al, “AKenAon is all you need”, NeurIPS 2017

Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)

Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019
Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB

Radford et al, "Language models are unsupervised multitask learners", 2019

Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)

Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU

Microsoft, "Turing-NLG: A 17-billion parameter language model by Microsoft", 2020

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020

Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Scaling up Transformers $3,768,320 on Google Cloud (eval price)
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12,288 96 175B 694GB ?
Gopher 80 16,384 128 280B 10.55 TB 4096x TPUv3 (38 days)

Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Bold text: Input prompt written by humans
Generated Text from GPT-3 Italics: Completion by GPT-3
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020

Bold text: Input prompt written by humans
Generated Text from GPT-3 Italics: Completion by GPT-3
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one
that is expected to end in the creation of a new denomination, one that will be "theologically and socially
conservative," according to The Washington Post. The majority of delegates attending the church's annual
General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new
rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these
measures have a new plan: They say they will form a separate denomination by 2020, calling their church
the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million
members, was in the early 20th century the "largest Protestant denomination in the U.S.," but that it has
been shrinking in recent decades. The new split will be the second in the church's history. The first occurred
in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church.
The Post notes that the proposed split "comes at a critical time for the church, which has been losing
members for years," which has been "pushed toward the brink of a schism over the role of LGBTQ people in
the church." Gay marriage is not the only issue that has divided the church. In 2016, the denomination was
split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them
from serving as clergy, and the South Pacific regional conference voting to allow them.
GPT-3: Programming by prompt
Bold text: Input prompt written by humans
Italics: Completion by GPT-3

Poor English input: I eated the purple berries.

Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration
that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020

GPT-3: Programming by prompt
Input / Output examples
Test example
Italics: Completion by GPT-3
Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I’d appreciate it.
Good English output: Thank you for choosing me as your designer. I appreciate it.
Poor English input: The mentioned changes have done. or I did the alteration that you
requested. or I changed things you wanted and did the modifications.
Good English output: The requested changes have been made. or I made the alteration
that you requested. or I changed things you wanted and made the modifications.
Poor English input: I’d be more than happy to work with you in another project.
Good English output:

Brown et al, "Language Models are Few-Shot Learners", arXiv 2020

https://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer

Topic: humor: literary parodies and homages, pastiches, style parodies.
Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He
had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he
had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator,
his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."
By Jane Austen:
“It is a truth universally acknowledged, that a broken Harry is in want of a book. This he knows to his cost; pressing that
wretched nose against the window of a bookstore in Diagon Alley the last thing before he goes into hiding for several years,
where upon, he goes straight to Azkaban.”
https://www.gwern.net/GPT-3#successes

GPT-3: Literary Style Transfer

Topic: humor: literary parodies and homages, pas+ches, style parodies.
Parodies of the fantasy novel series Harry Po*er in the style of various famous authors:
By Ernest Hemingway:
"It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He
had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he
had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator,
his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."
By Jane Austen:
“It is a truth universally acknowledged, that a broken Harry is in want of a book. This he knows to his cost; pressing that
wretched nose against the window of a bookstore in Diagon Alley the last thing before he goes into hiding for several years,
where upon, he goes straight to Azkaban.”
By Arthur Conan Doyle:
“Harry pushed at the swinging doors of the bookshop hard, and nearly knocked himself unconscious. He staggered in with his
ungainly package, his cheeks scarlet with cold and the shame of having chosen the wrong month to go Christmas shopping. The
proprietor of the store, however, didn’t cast even a cursory glance at him, being far more interested in an enormous hunk of
wood lying in the middle of the ﬂoor, which certainly looked like a gallows. Yes, the proprietor said to a reedy old man wearing
a bowler hat and a forlorn expression that made Harry want to kick him, I can rent you such a gallows for a small fee.”
Sequence-to-Sequence with RNNs
Summary Generalized Self-ARenGon Transformers are a new
is new, powerful neural neural network model
network primi^ve that only uses attention
Adding Attention to RNN Y1 Y2 Y3 y1 y2 y3 y4
models lets them look at Product(→), Sum(↑)

different parts of the V3 A1,3 A2,3 A3,3

Layer Normalization

input at each timestep V2 A1,2 A2,2 A3,2 +

V1 A1,1 A2,1 A3,1
MLP MLP MLP MLP
Softmax(↑)

K3 E1,3 E2,3 E3,3 Layer Normalization

K2 E1,2 E2,2 E3,2 +
K1 E1,1 E2,1 E3,1
Self-Attention

Q1 Q2 Q3

X1 X2 X3 x1 x2 x3 x4

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Attention and Transformers
No ratings yet
Attention and Transformers
103 pages
Attention Layers
No ratings yet
Attention Layers
103 pages
transformers and attention models
No ratings yet
transformers and attention models
115 pages
598_FA2020_lecture13
No ratings yet
598_FA2020_lecture13
112 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
L5
No ratings yet
L5
99 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
25 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Deep Recurrent Neural Networks (1)
No ratings yet
Deep Recurrent Neural Networks (1)
24 pages
Lecture5
No ratings yet
Lecture5
102 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Modeling With Machine Learning: RNN (Part 1)
No ratings yet
Modeling With Machine Learning: RNN (Part 1)
24 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
Attention and Memory in Deep Learning and NLP
No ratings yet
Attention and Memory in Deep Learning and NLP
8 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
Online and Linear-Time Attention by Enforcing Monotonic Alignments
No ratings yet
Online and Linear-Time Attention by Enforcing Monotonic Alignments
19 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
cl8_encdec
No ratings yet
cl8_encdec
51 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
lec14-RNN3-8-Feb-18
No ratings yet
lec14-RNN3-8-Feb-18
16 pages
Cs224n 2020 Lecture08 NMT
No ratings yet
Cs224n 2020 Lecture08 NMT
77 pages
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
No ratings yet
Deep Learning For Machine Translation: A Dramatic Turn of Paradigm
36 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
2014 10 Cho EMNLP
No ratings yet
2014 10 Cho EMNLP
11 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
10-rnn
No ratings yet
10-rnn
56 pages
lecture4.2-AlignedRepresentations
No ratings yet
lecture4.2-AlignedRepresentations
59 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
XCS224N Module5 Slides
No ratings yet
XCS224N Module5 Slides
80 pages
Simplifying Neural Networks and Deep Learning Basics!
No ratings yet
Simplifying Neural Networks and Deep Learning Basics!
27 pages
Neural_Network_and_Deep_Learning_1736802600
No ratings yet
Neural_Network_and_Deep_Learning_1736802600
54 pages
Deep Learning Basics
No ratings yet
Deep Learning Basics
10 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
AD3501-DL-UNIT 3 NOTES
No ratings yet
AD3501-DL-UNIT 3 NOTES
34 pages
09. Chap 10-1_Sequence Modeling Recurrent and Recursive nets_Eunjeong Yi
No ratings yet
09. Chap 10-1_Sequence Modeling Recurrent and Recursive nets_Eunjeong Yi
21 pages
DL Un3
No ratings yet
DL Un3
11 pages
Transformer
No ratings yet
Transformer
10 pages
Retentive Network: A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network: A Successor To Transformer For Large Language Models
14 pages
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
9 pages
Retentive Network - A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network - A Successor To Transformer For Large Language Models
14 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
Back Propagation
100% (1)
Back Propagation
42 pages
Breast_Cancer_Classification-Group240
No ratings yet
Breast_Cancer_Classification-Group240
4 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
53 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
Image Classification: CNN Model
No ratings yet
Image Classification: CNN Model
2 pages
Axcel
No ratings yet
Axcel
2 pages
GenAI PYQs Solution
No ratings yet
GenAI PYQs Solution
38 pages
Wood Species Identification Using Convolutional Neural Network (CNN) Architectures On Macroscopic Images
No ratings yet
Wood Species Identification Using Convolutional Neural Network (CNN) Architectures On Macroscopic Images
11 pages
Computer Vision Intern Position - Set 1 (1) 354
No ratings yet
Computer Vision Intern Position - Set 1 (1) 354
3 pages
Image Captioning
No ratings yet
Image Captioning
16 pages
Data Mining Seminarski
No ratings yet
Data Mining Seminarski
11 pages
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
No ratings yet
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
9 pages
Deep Learning With Long Short-Term Memory Networks For Financial Market Predictions
No ratings yet
Deep Learning With Long Short-Term Memory Networks For Financial Market Predictions
16 pages
Clustering Part-2
No ratings yet
Clustering Part-2
49 pages
Chapter 10: Artificial Neural Networks
No ratings yet
Chapter 10: Artificial Neural Networks
17 pages
Devika 2019
No ratings yet
Devika 2019
6 pages
Unit 1
No ratings yet
Unit 1
61 pages
AIML_ECE304_Assign-2_Kartikeya_Kandpal_Ajitesh_S.ipynb - Colab
No ratings yet
AIML_ECE304_Assign-2_Kartikeya_Kandpal_Ajitesh_S.ipynb - Colab
3 pages
AIML Roadmap
No ratings yet
AIML Roadmap
6 pages
RNN Neural Network
No ratings yet
RNN Neural Network
23 pages
U2-ML-QB With Answers
No ratings yet
U2-ML-QB With Answers
16 pages
AI Crash Course for Beginners
No ratings yet
AI Crash Course for Beginners
60 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
3 pages
Practical 3 ANN
No ratings yet
Practical 3 ANN
3 pages
ML Lab 6
No ratings yet
ML Lab 6
7 pages
Exercises695Clas Solution
100% (2)
Exercises695Clas Solution
13 pages
21CS10041 RCNN
No ratings yet
21CS10041 RCNN
6 pages
A Step by Step Forward Pass and Backpropagation Example
No ratings yet
A Step by Step Forward Pass and Backpropagation Example
14 pages
Adaptive Resonance Theory - Tutorialspoint
No ratings yet
Adaptive Resonance Theory - Tutorialspoint
3 pages