Tranformrerz
Tranformrerz
Anna Goldie
Lecture 8: Transformers
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers
2
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers
3
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?
4
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?
• Spoiler: Not Quite!
5
Transformers Have Revolutionized the Field of NLP
• By the end of this lecture, you will deeply understand the neural architecture that
underpins virtually every state-of-the-art NLP model today! Output
Probabilities
Softmax
Linear
Feed Forward
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
6 (shifted right)
[Vaswani et al., 2017]
Great Results with Transformers: Machine Translation
First, Machine Translation results from the original Transformers paper!
10
Transformers Even Show Promise Outside of NLP
Protein Folding
11
Transformers Even Show Promise Outside of NLP
Protein Folding
Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.
12
Transformers Even Show Promise Outside of NLP
Protein Folding
Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.
ML for Systems
[Zhou et al. 2020]: A Transformer-based
compiler model (GO-one) speeds up a
Transformer model!
13
Scaling Laws: Are Transformers All We Need?
• With Transformers, language modeling performance improves smoothly as we increase
model size, training data, and compute resources in tandem.
• This power-law relationship has been observed over multiple orders of magnitude with
no sign of slowing!
• If we keep scaling up these models (with no change to the architecture), could they
eventually match or exceed human-level performance?
15
As of last lecture: recurrent models for (most) NLP!
16
Why Move Beyond Recurrence?
Motivation for Transformer Architecture
The Transformers authors had 3 desirata when designing this architecture:
1. Minimize (or at least not increase) computational complexity per layer.
2. Minimize path length between any pair of words to facilitate learning of long-range
dependencies.
3. Maximize the amount of computation that can be parallelized.
17
[Vaswani et al., 2017]
1. Transformer Motivation: Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per layer is lower for a Transformer
compared to the recurrent models we’ve learned about so far.
18
[Vaswani et al., 2017]
2. Transformer Motivation: Minimize Linear Interaction Distance
O(sequence length)
1 2 3 T
0 1 2 T-1
h1 h2 hT
2 2 2 2 2 2 2 2
All words attend
attention to all words in
1 1 1 1 1 1 1 1 previous layer;
attention
most arrows here
0 0 0 0 0 0 0 0 are omitted
embedding
h1 h2 hT
22
Computational Dependencies for Recurrence vs. Attention
RNN-Based Encoder-Decoder
Model with Attention
Transformer-Based
Encoder-Decoder Model
23
Computational Dependencies for Recurrence vs. Attention
Transformer Advantages:
RNN-Based Encoder-Decoder • Number of unparallelizable operations does
Model with Attention not increase with sequence length.
• Each "word" interacts with each other, so
maximum interaction distance is O(1).
Transformer-Based
Encoder-Decoder Model
24
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers
25
The Transformer Encoder-Decoder [Vaswani et al., 2017]
Output
In this section, you will learn exactly how Probabilities
Feed Forward
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
26
Encoder: Self-Attention
Self-Attention is the core building block of
Transformer, so let's first focus on that! Output
Probabilities
Decoder
Encoder
Self-Attention
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
27
Intuition for Attention Mechanism
▪ Let's think of attention as a "fuzzy" or approximate hashtable:
▪ To look up a value, we compare a query against keys in a table.
▪ In a hashtable (shown on the bottom left):
▪ Each query (hash) maps to exactly one key-value pair.
▪ In (self-)attention (shown on the bottom right):
▪ Each query matches each key to varying degrees.
▪ We return a sum of values weighted by the query-key match.
k0 v0 k0 v0
k1 v1 k1 v1
k2 v2 k2 v2
q k3 v3 q k3 v3
k4 v4 k4 v4
k5 v5 k5 v5
k6 v6 k6 v6
k7 v7 k7 v7
28
Recipe for Self-Attention in the Transformer Encoder
▪ Step 1: For each word , calculate its query, key, and value.
k0 v0
k1 v1
• Step 2: Calculate attention score between query and keys. k2 v2
q k3 v3
k4 v4
k5 v5
• Step 3: Take the softmax to normalize attention scores.
k6 v6
k7 v7
29
Recipe for (Vectorized) Self-Attention in the Transformer Encoder
▪ Step 1: With embeddings stacked in X, calculate queries, keys, and values.
30
What We Have So Far: (Encoder) Self-Attention!
Output
Probabilities
Decoder
Encoder
Self-Attention
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
31
But attention isn't quite all you need!
• Problem: Since there are no element-wise non-linearities, self-
attention is simply performing a re-averaging of the value vectors.
• Easy fix: Apply a feedforward layer to the output of attention, Output
Probabilities
Decoder
Encoder Feed Forward
Self-Attention
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
32
But how do we make this work for deep networks?
Output
Probabilities
Self-Attention
33
Training Trick #1: Residual Connections [He et al., 2016] Output
Probabilities
Self-Attention
Inputs Outputs
(shifted right)
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
35
Training Trick #2: Layer Normalization [Ba et al., 2016]
Output
Probabilities
Self-Attention
36
Training Trick #3: Scaled Dot Product Attention
• After LayerNorm, the mean and variance of Output
Probabilities
vector elements is 0 and 1, respectively. (Yay!)
• However, the dot product still tends to take on
extreme values, as its variance scales with
dimensionality dk
Add & Norm
Input Output
Embedding Embedding
37
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."
Transformer-Based
Encoder-Decoder Model
38
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."
• Wait a minute, order doesn't
impact the network at all!
• This seems wrong given that
word order does have meaning Transformer-Based
in many languages, including Encoder-Decoder Model
English!
Man eats small dinosaur
39
Solution: Inject Order Information through Positional Encodings!
Output
Probabilities
Decoder
Repeat 6x
Add & Norm
(# of Layers)
Encoder Feed Forward
Repeat 6x
(# of Layers)
Add & Norm
Scaled Attention
Positional
+
Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
40
Fixing the first self-attention problem: sequence order
• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector
𝑝𝑖 ∈ ℝ𝑑 , for 𝑖 ∈ {1,2, … , 𝑇} are position vectors
sin(𝑖/100002∗1/𝑑 )
cos(𝑖/100002∗1/𝑑 )
Dimension
𝑝𝑖 =
𝑑
2∗ 2 /𝑑
sin(𝑖/10000 )
𝑑
cos(𝑖/100002∗2 /𝑑 ) Index in the sequence
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart
• Cons:
• Not learnable; also the extrapolation doesn’t really work
42 Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Extension: Self-Attention w/ Relative Position Encodings
Key Insight: The most salient position information is the relationship (e.g. “cat” is the word before “eat”)
between words, rather than their absolute position (e.g. “cat” is word 2).
where where
44
The Transformer Encoder: Multi-headed Self-Attention
• What if we want to look in multiple places in the sentence at
once?
• For word 𝑖, self-attention “looks” where 𝑥𝑖⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but
maybe we want to focus on different 𝑗 for different reasons?
• We’ll define multiple attention “heads” through multiple Q,K,V
matrices
𝑑
𝑑×
• Let, 𝑄ℓ , 𝐾ℓ , 𝑉ℓ ∈ ℝ , where ℎ is the number of attention heads,
ℎ
Output
Probabilities
Positional
+
Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
46
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?
Transformer-Based
Encoder-Decoder Model
47
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?
• Solution: Masked Multi-Head
Attention. At a high-level, we hide
(mask) information about future
tokens from the model.
Transformer-Based
Encoder-Decoder Model
48
Masking the future in self-attention
We can look at these
(not greyed out) words
• To use self-attention in
decoders, we need to ensure
we can’t peek at the future.
[START] −∞ −∞ −∞ −∞
• At every timestep, we could
change the set of keys and
queries to include only past The −∞ −∞ −∞
words. (Inefficient!) For encoding
these words
chef −∞ −∞
• To enable parallelization, we
mask out attention to future
words by setting attention who −∞
scores to −∞. 𝑞𝑖⊤ 𝑘𝑗 , 𝑗 < 𝑖
𝑒𝑖𝑗 = ൝
−∞, 𝑗 ≥ 𝑖
49
Decoder: Masked Multi-Headed Self-Attention
Output
Probabilities
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
50
Encoder-Decoder Attention
• We saw that self-attention is when keys, queries, Output
• 𝑘𝑖 = 𝐾ℎ𝑖 , 𝑣𝑖 = 𝑉ℎ𝑖 .
Input Output
Embedding Embedding
• And the queries are drawn from the decoder, Inputs Outputs
(shifted right)
𝑞𝑖 = 𝑄𝑧𝑖 .
51
Decoder: Finishing touches!
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
52
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)
Feed Forward
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
53
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
54
Decoder: Finishing touches!
• Add a feed forward layer (with residual Output
Softmax
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
55
Recap of Transformer Architecture
Output
Probabilities
Softmax
Linear
Feed Forward
Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
56
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers
57
What would we like to fix about the Transformer?
58
Recent work on improving on quadratic self-attention cost
• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]
59
Recent work on improving on quadratic self-attention cost
• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, BigBird [Zaheer et al., 2021]
Key idea: replace all-pairs interactions with a family of other interactions, like local
windows, looking at everything, and random interactions.
60
Do Transformer Modifications Transfer?
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."
61
Parting remarks
62