XCS224N Module6 Slides
XCS224N Module6 Slides
John Hewitt
Lecture 8: Self-Attention and Transformers
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. From recurrence (RNN) to attention-based NLP models
2. The Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers
Reminders:
Extra details are in the brand new lecture notes, wooooo!
Assignment 4 due a week from today!
Final project proposal out tonight, due Tuesday, Feb 14 at 4:30PM PST!
Please try to hand in the project proposal on time; we want to get you feedback
quickly!
2
As of last lecture: recurrent models for (most) NLP!
3
Today: Same goals, different building blocks
Lots of trial
and error
2014-2017ish 2021
Recurrence ??????
4
Issues with recurrent models: Linear interaction distance
O(sequence length)
1 2 3 n
0 1 2
h1 h2 hT
2 2 2 2 2 2 2 2
All words attend
attention to all words in
1 1 1 1 1 1 1 1 previous layer;
attention
most arrows here
0 0 0 0 0 0 0 0 are omitted
embedding
h1 h2 hT
8
Attention as a soft, averaging lookup table
We can think of attention as performing fuzzy lookup in a key-value store.
In a lookup table, we have a table of keys In attention, the query matches all keys softly,
that map to values. The query matches to a weight between 0 and 1. The keys’ values
one of the keys, returning its value. are multiplied by the weights and summed.
9
Self-Attention Hypothetical Example
10
Self-Attention: keys, queries, values from the same sequence
Let 𝒘1:𝑛 be a sequence of words in vocabulary 𝑉, like Zuko made his uncle tea.
𝒐𝑖 = 𝜶𝑖𝑗 𝒗𝑖
𝒋
11
Barriers and solutions for Self-Attention as a building block
Barriers Solutions
• Doesn’t have an inherent
notion of order!
12
Fixing the first self-attention problem: sequence order
• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector
In deep self-attention
𝑖 = 𝒙𝑖 + 𝒑𝑖
𝒙 networks, we do this at the
first layer! You could
concatenate them as well,
but people mostly just add…
13
Position representation vectors through sinusoids
sin(𝑖/100002∗1/𝑑 )
cos(𝑖/100002∗1/𝑑 )
Dimension
𝒑𝑖 =
𝑑
2∗ 2 /𝑑
sin(𝑖/10000 )
𝑑
cos(𝑖/100002∗2 /𝑑 ) Index in the sequence
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart!
• Cons:
• Not learnable; also the extrapolation doesn’t really work!
14 Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Position representation vectors learned from scratch
• Pros:
• Flexibility: each position gets to be learned to fit the data
• Cons:
• Definitely can’t extrapolate to indices outside 1, … , 𝑛.
• Most systems use this!
15
Barriers and solutions for Self-Attention as a building block
Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs
16
Adding nonlinearities in self-attention
Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs
[START] −∞ −∞ −∞
• At every timestep, we could
change the set of keys and
queries to include only past The −∞ −∞
words. (Inefficient!) For encoding
these words
chef −∞
• To enable parallelization, we
mask out attention to future
words by setting attention who
scores to −∞. 𝑞𝑖⊤ 𝑘𝑗 , 𝑗 ≤ 𝑖
𝑒𝑖𝑗 = ൝
−∞, 𝑗 > 𝑖
19
Barriers and solutions for Self-Attention as a building block
Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs
• Self-attention:
• the basis of the method.
• Position representations:
• Specify the sequence order, since self-attention
is an unordered function of its inputs.
• Nonlinearities:
• At the output of the self-attention block
• Frequently implemented as a simple feed-
forward network.
• Masking:
• In order to parallelize operations while not
looking at the future.
• Keeps information about the future from
21 “leaking” to the past.
Outline
1. From recurrence (RNN) to attention-based NLP models
2. The Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers
22
The Transformer Decoder
• A Transformer decoder is how
we’ll build systems like
language models.
• It’s a lot like our minimal self-
attention architecture, but
with a few more components.
• The embeddings and position
embeddings are identical.
• We’ll next replace our self-
attention with multi-head self-
attention.
Transformer Decoder
23
Recall the Self-Attention Hypothetical Example
24
Hypothetical Example of Multi-Head Attention
25
Sequence-Stacked form of Attention
• Each head gets to “look” at different things, and construct value vectors
27
differently.
Multi-head self-attention is computationally efficient
• Even though we compute ℎ many attention heads, it’s not really more costly.
• We compute 𝑋𝑄 ∈ ℝ𝑛×𝑑 , and then reshape to ℝ𝑛×ℎ×𝑑/ℎ . (Likewise for 𝑋𝐾, 𝑋𝑉.)
• Then we transpose to ℝℎ×𝑛×𝑑/ℎ ; now the head axis is like a batch axis.
• Almost everything else is identical, and the matrices are the same sizes.
29
The Transformer Decoder
• Now that we’ve replaced self-
attention with multi-head self-
attention, we’ll go through two
optimization tricks that end up
being :
• Residual Connections
• Layer Normalization
• In most Transformer diagrams,
these are often written
together as “Add & Norm”
Transformer Decoder
30
The Transformer Encoder: Residual connections [He et al., 2016]
Transformer Decoder
33
The Transformer Encoder
• The Transformer Decoder
constrains to unidirectional
context, as for language
models.
• What if we want bidirectional
context, like in a bidirectional
RNN?
• This is the Transformer
Encoder. The only difference is
that we remove the masking
in the self-attention.
34
The Transformer Encoder-Decoder
• Recall that in machine
translation, we processed the
source sentence with a
bidirectional model and
generated the target with a
unidirectional model.
• For this kind of seq2seq
format, we often use a
Transformer Encoder-Decoder.
• We use a normal Transformer
Encoder.
• Our Transformer Decoder is
modified to perform cross-
attention to the output of the
35
Encoder.
Cross-attention (details)
38
Great Results with Transformers
First, Machine Translation from the original Transformers paper!
42
What would we like to fix about the Transformer?
43
Quadratic computation as a function of sequence length
• One of the benefits of self-attention over recurrence was that it’s highly
parallelizable.
• However, its total number of operations grows as 𝑂 𝑛2 𝑑 , where 𝑛 is the
sequence length, and 𝑑 is the dimensionality.
• Think of 𝑑 as around 𝟏, 𝟎𝟎𝟎 (though for large language models it’s much larger!).
• So, for a single (shortish) sentence, 𝑛 ≤ 30; 𝑛2 ≤ 𝟗𝟎𝟎.
• In practice, we set a bound like 𝑛 = 512.
• But what if we’d like 𝒏 ≥ 𝟓𝟎, 𝟎𝟎𝟎? For example, to work on long documents?
44
Work on improving on quadratic self-attention cost
• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]
45
Do we even need to remove the quadratic cost of attention?
46
Do Transformer Modifications Transfer?
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."
47
Natural Language Processing
with Deep Learning
CS224N/Ling284
John Hewitt
Lecture 9: Pretraining
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning
Reminders:
Assignment 5 is out on Thursday! It covers lecture 8 and lecture 9(Today)!
It has ~pedagogically relevant math~
2
Word structure and subword models
Let’s take a look at the assumptions we’ve made about a language’s vocabulary.
We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.
3
Word structure and subword models
Finite vocabulary assumptions make even less sense in many languages.
• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.
4 [Wiktionary]
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about
structure below the word level. (Parts of words, characters, bytes.)
• The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens).
• At training and testing time, each word is split into a sequence of known subwords.
Originally used in NLP for machine translation; now a similar method (WordPiece) is used in pretrained
models.
In the worst case, words are split into as many subwords as they have characters.
6
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
7
Motivating word meaning and context
Recall the adage we mentioned at the beginning of the course:
“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)
Consider I record the record: the two instances of record mean different things.
8 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Firth quote.]
Where we were: pretrained word embeddings
Circa 2017:
• Start with pretrained word embeddings (no ෝ
𝒚
context!)
• Learn how to incorporate context in an LSTM Not pretrained
or Transformer while training on the task.
9
Where we’re going: pretraining whole models
In modern NLP:
• All (or almost all) parameters in NLP ෝ
𝒚
networks are initialized via pretraining.
• Pretraining methods hide parts of the input
from the model, and train the model to Pretrained jointly
reconstruct those parts.
• representations of language
• parameter initializations for strong NLP [This model has learned how to represent
models. entire sentences through pretraining]
• Probability distributions over language that
we can sample from
10
What can we learn from reconstructing the input?
11
What can we learn from reconstructing the input?
12
What can we learn from reconstructing the input?
13
What can we learn from reconstructing the input?
I went to the ocean to see the fish, turtles, seals, and _____.
14
What can we learn from reconstructing the input?
15
What can we learn from reconstructing the input?
16
What can we learn from reconstructing the input?
17
Pretraining through language modeling [Dai and Le, 2015]
Recall the language modeling task:
• Model 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 ), the probability
distribution over words given their past goes to make tasty tea END
contexts.
• There’s lots of data for this! (In English.) Decoder
(Transformer, LSTM, ++ )
Pretraining through language modeling:
• Train a neural network to perform language
modeling on a large amount of text. Iroh goes to make tasty tea
• Save the network parameters.
18
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.
Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
goes to make tasty tea END ☺/
19
Stochastic gradient descent and pretrain/finetune
Why should pretraining and finetuning help, from a “training neural nets” perspective?
20
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
21
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
22
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.
23
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional
context, so we can’t do language modeling!
24
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.
• The pretraining input to BERT was two separate contiguous chunks of text:
• BERT was trained to predict whether one chunk follows the other or is randomly
sampled.
• Later work has argued this “next sentence prediction” is not necessary.
If your task involves generating sequences, consider using a pretrained decoder; BERT and other
pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation
methods.
Iroh goes to [MASK] tasty tea Iroh goes to make tasty tea
29
Extensions of BERT
You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task
BERT SpanBERT
☺/ ☺/
☺/
(Transformer, LSTM, ++ )
𝐵 ∈ ℝ𝑘×𝑑
☺/
𝑊 ∈ ℝ𝑑×𝑑
(Transformer, LSTM, ++ )
𝐴 ∈ ℝ𝑑×𝑘
𝑊 + 𝐴𝐵
… the movie was …
35
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a
prefix of every input is provided to the encoder and is not predicted.
𝑤𝑇+2 , … ,
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
ℎ 𝑇+1 , … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1 , … , 𝑤𝑇 , ℎ1 , … , ℎ 𝑇
𝑦𝑖 ∼ 𝐴ℎ𝑖 + 𝑏, 𝑖 > 𝑇
36
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.
37
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks,
and span corruption (denoising) to work better than language modeling.
Pretraining encoder-decoders: what pretraining objective to use?
A fascinating property
of T5: it can be
finetuned to answer a
wide range of
questions, retrieving
knowledge from its
parameters.
Gradients backpropagate through the whole [Note how the linear layer hasn’t been
network. pretrained and must be learned from scratch.]
41
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 )!
ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑤𝑡 ∼ 𝐴ℎ𝑡−1 + 𝑏 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5
[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]
44
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
GPT results on various natural language inference datasets.
45
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively
convincing samples of natural language.
GPT-3, In-context learning, and very large models
So far, we’ve interacted with pretrained models in two ways:
• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.
47
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
The in-context examples seem to specify the task to be performed, and the conditional
distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
“ thanks -> merci
hello -> bonjour
mint -> menthe
otter -> ”
Output (conditional generations):
loutre…”
48
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.
49
Scaling Efficiency: how do we best use our compute
GPT-3 was 175B parameters and trained on 300B tokens of text.
Roughly, the cost of training a large transformer scales as parameters*tokens
Did OpenAI strike the right parameter-token data to get the best model? No.
This 70B parameter model is better than the much larger other models!
50
The prefix as task specification and scratch pad: chain-of-thought
51
[Wei et al., 2023]
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?
52
What kinds of things does pretraining teach?
There’s increasing evidence that pretrained models learn a wide variety of things about
the statistical properties of language. Taking our examples from the start of class:
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
• Models also learn – and can exacerbate racism, sexism, all manner of bad biases.
• More on all this in the interpretability lecture!
53