Attention Deep Learning
Attention Deep Learning
h1 h2 h3 h4
x1 x2 x3 x4
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, … yT
h1 h2 h3 h4 s0
x1 x2 x3 x4 c
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos
h1 h2 h3 h4 s0 s1
x1 x2 x3 x4 c y0
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo
h1 h2 h3 h4 s0 s1 s2
x1 x2 x3 x4 c y0 y1
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Sequence-to-Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, … yT
estamos comiendo pan [STOP]
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Input: Sequence x1, … xT The weights are determined by
assessing
Output: Sequence y1, … yT how much each part of the input
sequence should contribute to the
output (the attention mechanism)
From final hidden state: at a particular step.
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
• A function computes a set of attention
h1 h2 h3 h4 s0 weights for each encoder hidden
state relative to the current decoder
state.
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Compute (scalar) alignment scores
et,i = fatt(st-1, hi) (fatt is an MLP)
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
Normalize alignment scores
softmax
to get attention weights
From final hidden state: 0 < at,i < 1 ∑i at,i = 1
e11 e12 e13 e14
Initial decoder state s0
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖ Compute (scalar) alignment scores
a11 a12 a13 a14 et,i = fatt(st-1, hi) (fatt is an MLP)
estamos
Normalize alignment scores
softmax
to get attention weights
From final hidden state: y1
e11 e12 e13 e14 0 < at,i < 1 ∑iat,i = 1
Initial decoder state s0
Compute context vector as linear
combination of hidden states
h1 h2 h3 h4 s0 + s1
ct = ∑i at,i hi
h1 h2 h3 h4 s0 s1
x1 x2 x3 x4 c1 y0 c2
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2
x1 x2 x3 x4 c1 y0 c2 y1
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
✖ ✖ ✖ ✖
a21 a22 a23 a24
estamos comiendo
soXmax
Repeat: Use s1 to
y1 y2
e21 e22 e23 e24 compute new context
vector c2
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Sequence-to-Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015
Image Captioning with RNNs and Attention
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fatt(st-1, hi,j) e1,1,1 e1,1,2 e1,1,3
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
so#max
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
softmax
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e1,1,1 e1,1,2 e1,1,3 a1,1,1 a1,1,2 a1,1,3
so#max
at,:,: = soUmax(et,:,:) e1,2,1 e1,2,2 e1,2,3 a1,2,1 a1,2,2 a1,2,3 cat
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention
ct = ∑i,jat,i,jhi,j y1
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3
so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fa((st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3
so#max
at,:,: = soUmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Alignment scores A5en6on weights
et,i,j = fatt(st-1, hi,j) e2,1,1 e2,1,2 e2,1,3 a2,1,1 a2,1,2 a2,1,3
so#max
at,:,: = softmax(et,:,:) e2,2,1 e2,2,2 e2,2,3 a2,2,1 a2,2,2 a2,2,3 cat si9ng
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Each ^mestep of decoder
et,i,j = fa ( (st-1, hi,j) uses a different context
at,:,: = soUmax(et,:,:) vector that looks at different
cat si9ng outside [STOP]
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
Image Captioning with RNNs and Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Image Captioning with RNNs and Attention
Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015
X, Attend, and Y
“Show, aKend, and tell” (Xu et al, ICML 2015)
Look at image, aUend to image regions, produce ques6on
“Ask, aKend, and answer” (Xu and Saenko, ECCV 2016) “Show,
ask, aKend, and answer” (Kazemi and Elqursh, 2017) Read text
of ques6on, aUend to image regions, produce answer
“Listen, aKend, and spell” (Chan et al, ICASSP 2016)
Process raw audio, aUend to audio regions while producing text
“Listen, attend, and walk” (Mei et al, AAAI 2016)
Process text, attend to text regions, output navigation commands
“Show, aKend, and interact” (Qureshi et al, ICRA 2017)
Process image, aUend to image regions, output robot control commands
“Show, aKend, and read” (Li et al, AAAI 2019)
Process image, aUend to image regions, output text
Attention Layer Alignment scores Attention weights
c1 y0 c2
[START]
Computation:
Similarities: e (Shape: NX) ei = fatt(q, Xi)
Attention weights: a = softmax(e) (Shape: NX)
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights
c1 y0 c2
[START]
ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights
c1 y0 c2
[START]
ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights
dimension D
Then |a| = (∑ia2)1/2 = a 𝐷
ComputaGon:
SimilariGes: e (Shape: NX) ei = q · Xi / 𝐷! Changes:
AKenGon weights: a = soXmax(e) (Shape: NX) - Use scaled dot product for similarity
Output vector: y = ∑iaiXi (Shape: DX)
Attention Layer Alignment scores Attention weights
c1 y0 c2
[START]
ComputaGon:
SimilariGes: E = QXT/ 𝐷! (Shape: NQ x Ei,j = (Qi · Xj )/ 𝐷! Changes:
NX)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AX (Shape: NQ x DX) Yi = ∑jAi,jXj - Mul6ple query vectors
Attention Layer Alignment scores Attention weights
[START]
Computation:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Similarities: E = QKT / 𝐷! (Shape: NQ x NX) Ei,j = (Qi · Kj) / 𝐷! Changes:
Attention weights: A = softmax(E, dim=1) (Shape: N Q x NX) - Use scaled dot product for similarity
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj - Mul6ple query vectors
- Separate key and value
Attention Layer
Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
ComputaGon: X1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer
Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
ComputaGon: X1 K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV) X2 K2
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
X3 K3
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX)
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Q1 Q2 Q3 Q4
Attention Layer
Inputs:
Query vectors: Q (Shape: NQ x DQ)
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Inputs:
Query vectors: Q (Shape: NQ x DQ) A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
A1,3 A2,3 A3,3 A4,3
SoDmax( )
Inputs:
Query vectors: Q (Shape: NQ x DQ) V1 A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
V2 A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
V3 A1,3 A2,3 A3,3 A4,3
SoDmax( )
Product( ), Sum( )
Inputs:
Query vectors: Q (Shape: NQ x DQ) V1 A1,1 A2,1 A3,1 A4,1
Input vectors: X (Shape: NX x DX)
V2 A1,2 A2,2 A3,2 A4,2
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
V3 A1,3 A2,3 A3,3 A4,3
SoDmax( )
ComputaGon:
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NQ x Ei,j = (Qi · Kj) / 𝐷!
N X)
AKenGon weights: A = soXmax(E, dim=1) (Shape: N Q x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NQ x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3
ComputaGon: K2
Query vectors: Q = XWQ
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
Computation: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
Similarities: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
A1,3 A2,3 A3,3
Inputs: A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y1 Y2 Y3
Product(→), Sum(↑)
One query per input vector
V3 A1,3 A2,3 A3,3
Inputs: V2 A1,2 A2,2 A3,2
Input vectors: X (Shape: NX x DX) V1 A1,1 A2,1 A3,1
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K3 E1,3 E2,3 E3,3
ComputaGon: K2 E1,2 E2,2 E3,2
Query vectors: Q = XWQ E1,1 E2,1 E3,1
K1
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q1 Q2 Q3
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1 X2 X3
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permuting
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Queries and Keys will be
Key matrix: WK (Shape: DX x DQ) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2
ComputaGon: K1
Query vectors: Q = XWQ K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
the input vectors:
Inputs:
Input vectors: X (Shape: NX x DX) Similari5es will be the
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
A3,2 A1,2 A2,2
the input vectors:
Inputs: A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) ARen5on weights will be A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x D Q) the same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
Computation: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
Similarities: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
Attention weights: A = softmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer
Product(→), Sum(↑)
Consider permu+ng
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Values will be the V3 A3,3 A1,3 A2,3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)
Consider permuting
V2 A3,2 A1,2 A2,2
the input vectors:
Inputs: V1 A3,1 A1,1 A2,1
Input vectors: X (Shape: NX x DX) Outputs will be the A3,3 A1,3 A2,3
V3
Key matrix: WK (Shape: DX x DQ) same, but permuted
Value matrix: WV (Shape: DX x DV) So#max(↑)
Query matrix: WQ (Shape: DX x DQ)
K2 E3,2 E1,2 E2,2
ComputaGon: K1 E3,1 E1,1 E2,1
Query vectors: Q = XWQ E3,3 E1,3 E2,3
K3
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
Q3 Q1 Q2
SimilariGes: E = QKT / 𝐷! (Shape: N X x N X) E i,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X3 X1 X2
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj
Self-Attention Layer Y3 Y1 Y2
Product(→), Sum(↑)
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX)
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1 X2 X3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“Attention Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X1,2 X1,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X1,2 X2,2 X1,3 X2,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Inputs:
Input vectors: X (Shape: NX x DX)
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV)
Query matrix: WQ (Shape: DX x DQ) Use H independent
“AUen6on Heads” in
parallel
ComputaGon:
Query vectors: Q = XWQ
Key vectors: K = XWK (Shape: NX x DQ)
X1,1 X2,1 X3,1 X1,2 X2,2 X3,2 X1,3 X2,3 X3,3
Value Vectors: V = XWV (Shape: NX x DV)
SimilariGes: E = QKT / 𝐷! (Shape: NX x NX) Ei,j = (Qi · Kj ) / 𝐷!
Split
AKenGon weights: A = soXmax(E, dim=1) (Shape: N X x NX) X1,1 X2,1 X3,1
Output vectors: Y = AV (Shape: NX x DV) Yi = ∑jAi,jVj X1,2 X2,2 X3,2
X1,3 X2,3 X3,3
Multihead
Masked Self-Attention
Self-AttentionLayer
Layer
Run self-aRen5on in parallel on each set of
input vectors (different weights per head)
Inputs:
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1
parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2
ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1
parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2
ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Y1,1
projec5on Y Y3,1
2,1
Y1,2 Y2,2 Y3,2
Y1,3 Y2,3 Y3,3
Inputs: Concat
Input vectors: X (Shape: NX x DX) Y1,1 Y2,1 Y3,1 Y1,2 Y2,2 Y3,2 Y1,3 Y2,3 Y3,3
Key matrix: WK (Shape: DX x DQ)
Value matrix: WV (Shape: DX x DV) Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3 Y1 Y2
Product(→), Sum(↑)
Y3
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1 V1 A1,1 A2,1
Softmax(↑)
A3,1
parallel K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3 K3 E1,3 E2,3 E3,3
K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2 K2 E1,2 E2,2 E3,2
ComputaGon: K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1 K1 E1,1 E2,1 E3,1
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Input Image
CNN
Features:
Cat image is free to use under the Pixabay License CxHxW
Queries:
C’ x H x W
Input Image 1x1 Conv
Keys:
CNN C’ x H x W
1x1 Conv
Features:
Cat image is free to use under the Pixabay License CxHxW
Values:
C’ x H x W
1x1 Conv
Residual Connec+on
Queries: A)en+on Weights
C’ x H x W Transpose (H x W) x (H x W)
Self-Acen^on Module
Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018
Three Ways of Processing Sequences
y1 y2 y3 y4
x1 x2 x3 x4
y1 y2 y3 y4 y1 y2 y3 y4
x1 x2 x3 x4 x1 x2 x3 x4
y1 y2 y3 y4 y1 y2 y3 y4 V3
V2
A1,3
A1,2
A2,3
A2,2
A3,3
A3,2
A1,1 A2,1 A3,1
V1
Softmax(↑)
x1 x2 x3 x4 x1 x2 x3 x4 Q1
X1
Q2
X2
Q3
X3
Softmax(↑)
x1 x2 x3 x4 Vaswani
x1 et xal,
2 NeurIPS
x3 2017
x4 Q1
X1
Q2
X2
Q3
X3
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
Residual connec^on +
All vectors interact Self-AUen6on
with each other
x1 x2 x3 x4
Layer Normaliza6on
Recall Layer NormalizaGon:
Given h1, …, hN (Shape: D) Residual connec^on +
scale: 𝛾 (Shape: D) MLP independently MLP MLP MLP MLP
shim: 𝛽 (Shape: D) on each vector
��i = (∑j (scalar)
��
hi,j)/Di = (hi,j - (scalar) Layer Normaliza6on
z(∑i =j (hi - ��i)2/D) Residual connec^on +
y��
i = 𝛾i)* zi + 1/2
All vectors interact Self-AUen6on
/ ��i with each other
𝛽
Ba et al, 2016
x1 x2 x3 x4
Vaswani et al, “AKenAon is all you need”, NeurIPS 2017
The Transformer
y1 y2 y3 y4
Layer Normaliza6on
+
x1 x2 x3 x4
Vaswani et al, “Attention is all you need”, NeurIPS 2017
Post-Norm Transformer
y1 y2 y3 y4
Layer Normaliza6on
Layer normaliza-on is +
inside residual connecXons
Self-AUen6on
of transformer blocks +
+
Self-Attention
12 blocks, DQ=512, 6 heads
Layer norm and MLP work Layer Normalization
Layer Normalization
Highly scalable, highly +
parallelizable Self-Attention
Layer Normalization
Download a lot of text from the internet MLP MLP MLP MLP
Layer Normalization
+
Self-Attention
Train a giant Transformer model for language modeling
Layer Normalization
+
Self-Attention
Devlin et al, "BERT: Pre-training of Deep BidirecAonal Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019
Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019
Scaling up Transformers
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Scaling up Transformers $3,768,320 on Google Cloud (eval price)
Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hours)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13 GB
BERT-Large 24 1024 16 340M 13 GB
XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)
GPT-2 48 1600 ? 1.5B 40 GB
Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100 GPU
GPT-3 96 12,288 96 175B 694GB ?
Gopher 80 16,384 128 280B 10.55 TB 4096x TPUv3 (38 days)
Rae et al, “Scaling Language Models: Methods, Analysis, & Insights from Training Gopher”, arXiv 2021
Bold text: Input prompt written by humans
Generated Text from GPT-3 Italics: Completion by GPT-3
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article:
Q1 Q2 Q3
X1 X2 X3 x1 x2 x3 x4
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015