0% found this document useful (0 votes)
27 views

XCS224N Module6 Slides

Uploaded by

bksaif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

XCS224N Module6 Slides

Uploaded by

bksaif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Natural Language Processing

with Deep Learning


CS224N/Ling284

John Hewitt
Lecture 8: Self-Attention and Transformers
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. From recurrence (RNN) to attention-based NLP models
2. The Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers

Reminders:
Extra details are in the brand new lecture notes, wooooo!
Assignment 4 due a week from today!
Final project proposal out tonight, due Tuesday, Feb 14 at 4:30PM PST!
Please try to hand in the project proposal on time; we want to get you feedback
quickly!

2
As of last lecture: recurrent models for (most) NLP!

• Circa 2016, the de facto strategy in NLP is to


encode sentences with a bidirectional LSTM:
(for example, the source sentence in a translation)

• Define your output (parse, sentence,


summary) as a sequence, and use an LSTM to
generate it.

• Use attention to allow flexible access to


memory

3
Today: Same goals, different building blocks

• Last week, we learned about sequence-to-sequence problems and


encoder-decoder models.
• Today, we’re not trying to motivate entirely new ways of looking at
problems (like Machine Translation)
• Instead, we’re trying to find the best building blocks to plug into our
models and enable broad progress.

Lots of trial
and error

2014-2017ish 2021
Recurrence ??????
4
Issues with recurrent models: Linear interaction distance

• RNNs are unrolled “left-to-right”.


• This encodes linear locality: a useful heuristic!
• Nearby words often affect each other’s meanings
tasty pizza
• Problem: RNNs take O(sequence length) steps for
distant word pairs to interact.

O(sequence length)

The chef who … was


5
Issues with recurrent models: Linear interaction distance

• O(sequence length) steps for distant word pairs to interact means:


• Hard to learn long-distance dependencies (because gradient problems!)
• Linear order of words is “baked in”; we already know linear order isn’t the
right way to think about sentences…

The chef who … was


Info of chef has gone through
O(sequence length) many layers!
6
Issues with recurrent models: Lack of parallelizability

• Forward and backward passes have O(sequence length)


unparallelizable operations
• GPUs can perform a bunch of independent computations at once!
• But future RNN hidden states can’t be computed in full before past RNN
hidden states have been computed
• Inhibits training on very large datasets!

1 2 3 n

0 1 2

h1 h2 hT

Numbers indicate min # of steps before a state can be computed


7
If not recurrence, then what? How about attention?

• Attention treats each word’s representation as a query to access and


incorporate information from a set of values.
• We saw attention from the decoder to the encoder; today we’ll think about
attention within a single sentence.
• Number of unparallelizable operations does not increase with sequence length.
• Maximum interaction distance: O(1), since all words interact at every layer!

2 2 2 2 2 2 2 2
All words attend
attention to all words in
1 1 1 1 1 1 1 1 previous layer;
attention
most arrows here
0 0 0 0 0 0 0 0 are omitted
embedding
h1 h2 hT
8
Attention as a soft, averaging lookup table
We can think of attention as performing fuzzy lookup in a key-value store.

In a lookup table, we have a table of keys In attention, the query matches all keys softly,
that map to values. The query matches to a weight between 0 and 1. The keys’ values
one of the keys, returning its value. are multiplied by the weights and summed.

9
Self-Attention Hypothetical Example

10
Self-Attention: keys, queries, values from the same sequence
Let 𝒘1:𝑛 be a sequence of words in vocabulary 𝑉, like Zuko made his uncle tea.

For each 𝒘𝑖 , let 𝒙𝑖 = 𝐸𝒘𝒊 , where 𝐸 ∈ ℝ𝑑×|𝑉| is an embedding matrix.


1. Transform each word embedding with weight matrices Q, K, V , each in ℝ𝑑×𝑑
𝒒𝑖 = 𝑄𝒙𝒊 (queries) 𝒌𝑖 = 𝐾𝒙𝒊 (keys) 𝒗𝑖 = 𝑉𝒙𝒊 (values)
2. Compute pairwise similarities between keys and queries; normalize with softmax
exp(𝒆𝑖𝑗 )
𝒆𝑖𝑗 = 𝒒⊤
𝒊 𝒌𝒋 𝜶𝑖𝑗 =
σ𝑗′ exp(𝒆𝑖𝑗 ′ )

3. Compute output for each word as weighted sum of values

𝒐𝑖 = ෍ 𝜶𝑖𝑗 𝒗𝑖
𝒋
11
Barriers and solutions for Self-Attention as a building block

Barriers Solutions
• Doesn’t have an inherent
notion of order!

12
Fixing the first self-attention problem: sequence order

• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector

𝒑𝑖 ∈ ℝ𝑑 , for 𝑖 ∈ {1,2, … , 𝑛} are position vectors

• Don’t worry about what the 𝑝𝑖 are made of yet!


• Easy to incorporate this info into our self-attention block: just add the 𝒑𝑖 to our inputs!
• Recall that 𝒙𝑖 is the embedding of the word at index 𝑖. The positioned embedding is:

In deep self-attention
෥𝑖 = 𝒙𝑖 + 𝒑𝑖
𝒙 networks, we do this at the
first layer! You could
concatenate them as well,
but people mostly just add…
13
Position representation vectors through sinusoids

• Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

sin(𝑖/100002∗1/𝑑 )
cos(𝑖/100002∗1/𝑑 )

Dimension
𝒑𝑖 =
𝑑
2∗ 2 /𝑑
sin(𝑖/10000 )
𝑑
cos(𝑖/100002∗2 /𝑑 ) Index in the sequence
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart!
• Cons:
• Not learnable; also the extrapolation doesn’t really work!

14 Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Position representation vectors learned from scratch

• Learned absolute position representations: Let all 𝑝𝑖 be learnable parameters!


Learn a matrix 𝒑 ∈ ℝ𝑑×𝑛 , and let each 𝒑𝑖 be a column of that matrix!

• Pros:
• Flexibility: each position gets to be learned to fit the data
• Cons:
• Definitely can’t extrapolate to indices outside 1, … , 𝑛.
• Most systems use this!

• Sometimes people try more flexible representations of position:


• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]

15
Barriers and solutions for Self-Attention as a building block

Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs

• No nonlinearities for deep


learning! It’s all just weighted
averages

16
Adding nonlinearities in self-attention

• Note that there are no elementwise


nonlinearities in self-attention;
stacking more self-attention layers FF FF FF FF
just re-averages value vectors
(Why? Look at the notes!) self-attention

• Easy fix: add a feed-forward network FF FF FF FF
to post-process each output vector.
self-attention
𝑚𝑖 = 𝑀𝐿𝑃 output 𝑖 …
= 𝑊2 ∗ ReLU 𝑊1 output 𝑖 + 𝑏1 + 𝑏2 𝑤1 𝑤2 𝑤3 𝑤𝑛
The chef who food

Intuition: the FF network processes the result of attention


17
Barriers and solutions for Self-Attention as a building block

Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs

• No nonlinearities for deep • Easy fix: apply the same


learning magic! It’s all just feedforward network to each self-
weighted averages attention output.

• Need to ensure we don’t


“look at the future” when
predicting a sequence
• Like in machine translation
• Or language modeling
18
Masking the future in self-attention
We can look at these
(not greyed out) words
• To use self-attention in
decoders, we need to ensure
we can’t peek at the future.

[START] −∞ −∞ −∞
• At every timestep, we could
change the set of keys and
queries to include only past The −∞ −∞
words. (Inefficient!) For encoding
these words
chef −∞
• To enable parallelization, we
mask out attention to future
words by setting attention who
scores to −∞. 𝑞𝑖⊤ 𝑘𝑗 , 𝑗 ≤ 𝑖
𝑒𝑖𝑗 = ൝
−∞, 𝑗 > 𝑖
19
Barriers and solutions for Self-Attention as a building block

Barriers Solutions
• Doesn’t have an inherent • Add position representations to
notion of order! the inputs

• No nonlinearities for deep • Easy fix: apply the same


learning magic! It’s all just feedforward network to each self-
weighted averages attention output.

• Need to ensure we don’t • Mask out the future by artificially


“look at the future” when setting attention weights to 0!
predicting a sequence
• Like in machine translation
• Or language modeling
20
Necessities for a self-attention building block:

• Self-attention:
• the basis of the method.
• Position representations:
• Specify the sequence order, since self-attention
is an unordered function of its inputs.
• Nonlinearities:
• At the output of the self-attention block
• Frequently implemented as a simple feed-
forward network.
• Masking:
• In order to parallelize operations while not
looking at the future.
• Keeps information about the future from
21 “leaking” to the past.
Outline
1. From recurrence (RNN) to attention-based NLP models
2. The Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers

22
The Transformer Decoder
• A Transformer decoder is how
we’ll build systems like
language models.
• It’s a lot like our minimal self-
attention architecture, but
with a few more components.
• The embeddings and position
embeddings are identical.
• We’ll next replace our self-
attention with multi-head self-
attention.

Transformer Decoder

23
Recall the Self-Attention Hypothetical Example

24
Hypothetical Example of Multi-Head Attention

25
Sequence-Stacked form of Attention

• Let’s look at how key-query-value attention is computed, in matrices.


• Let 𝑋 = 𝑥1 ; … ; 𝑥𝑛 ∈ ℝ𝑛×𝑑 be the concatenation of input vectors.
• First, note that 𝑋𝐾 ∈ ℝ𝑛×𝑑 , 𝑋𝑄 ∈ ℝ𝑛×𝑑 , 𝑋𝑉 ∈ ℝ𝑛×𝑑 .
• The output is defined as output = softmax 𝑋𝑄 𝑋𝐾 ⊤ 𝑋𝑉 ∈∈ ℝ𝑛×𝑑 .

First, take the query-key dot All pairs of


products in one matrix 𝑋𝑄 = 𝑋𝑄𝐾 ⊤ 𝑋 ⊤
attention scores!
multiplication: 𝑋𝑄 𝑋𝐾 ⊤ 𝐾 ⊤ 𝑋⊤ ∈ ℝ𝑛×𝑛

Next, softmax, and


compute the weighted
average with another
softmax 𝑋𝑄𝐾 ⊤ 𝑋 ⊤ 𝑋𝑉 =
output ∈ ℝ𝑛×𝑑
matrix multiplication.
26
Multi-headed attention

• What if we want to look in multiple places in the sentence at once?


• For word 𝑖, self-attention “looks” where 𝑥𝑖⊤ 𝑄 ⊤ 𝐾𝑥𝑗 is high, but maybe we want
to focus on different 𝑗 for different reasons?
• We’ll define multiple attention “heads” through multiple Q,K,V matrices
𝑑
𝑑×ℎ
• Let, 𝑄ℓ , 𝐾ℓ , 𝑉ℓ ∈ ℝ , where ℎ is the number of attention heads, and ℓ ranges
from 1 to ℎ.
• Each attention head performs attention independently:
• output ℓ = softmax 𝑋𝑄ℓ 𝐾ℓ⊤ 𝑋 ⊤ ∗ 𝑋𝑉ℓ , where output ℓ ∈ ℝ𝑑/ℎ
• Then the outputs of all the heads are combined!
• output = output1 ; … ; output ℎ 𝑌, where 𝑌 ∈ ℝ𝑑×𝑑

• Each head gets to “look” at different things, and construct value vectors
27
differently.
Multi-head self-attention is computationally efficient

• Even though we compute ℎ many attention heads, it’s not really more costly.
• We compute 𝑋𝑄 ∈ ℝ𝑛×𝑑 , and then reshape to ℝ𝑛×ℎ×𝑑/ℎ . (Likewise for 𝑋𝐾, 𝑋𝑉.)
• Then we transpose to ℝℎ×𝑛×𝑑/ℎ ; now the head axis is like a batch axis.
• Almost everything else is identical, and the matrices are the same sizes.

First, take the query-key dot 3 sets of all pairs of


products in one matrix = 𝑋𝑄𝐾 ⊤ 𝑋⊤ attention scores!
𝑋𝑄
multiplication: 𝑋𝑄 𝑋𝐾 ⊤ 𝐾⊤ 𝑋⊤ ∈ ℝ3×𝑛×𝑛

Next, softmax, and


compute the weighted
average with another
softmax 𝑋𝑄𝐾 ⊤ 𝑋 ⊤ 𝑋𝑉
𝑋𝑉 = =
𝑃 output ∈ ℝ𝑛×𝑑
matrix multiplication.
28 mix
Scaled Dot Product [Vaswani et al., 2017]

• “Scaled Dot Product” attention aids in training.


• When dimensionality 𝑑 becomes large, dot products between vectors tend to
become large.
• Because of this, inputs to the softmax function can be large, making the
gradients small.
• Instead of the self-attention function we’ve seen:
output ℓ = softmax 𝑋𝑄ℓ 𝐾ℓ⊤ 𝑋 ⊤ ∗ 𝑋𝑉ℓ
• We divide the attention scores by 𝑑/ℎ, to stop the scores from becoming large
just as a function of 𝑑/ℎ (The dimensionality divided by the number of heads.)
𝑋𝑄ℓ 𝐾ℓ⊤ 𝑋 ⊤
output ℓ = softmax ∗ 𝑋𝑉ℓ
𝑑/ℎ

29
The Transformer Decoder
• Now that we’ve replaced self-
attention with multi-head self-
attention, we’ll go through two
optimization tricks that end up
being :
• Residual Connections
• Layer Normalization
• In most Transformer diagrams,
these are often written
together as “Add & Norm”

Transformer Decoder

30
The Transformer Encoder: Residual connections [He et al., 2016]

• Residual connections are a trick to help models train better.


• Instead of 𝑋 (𝑖) = Layer(𝑋 𝑖−1
) (where 𝑖 represents the layer)

𝑋 (𝑖−1) Layer 𝑋 (𝑖)

• We let 𝑋 (𝑖) = 𝑋 (𝑖−1) + Layer(𝑋 𝑖−1


) (so we only have to learn “the residual”
from the previous layer)
𝑋 (𝑖−1) Layer + 𝑋 (𝑖)

• Gradient is great through the residual


connection; it’s 1!
• Bias towards the identity function! [no residuals] [residuals]
[Loss landscape visualization,
31 Li et al., 2018, on a ResNet]
The Transformer Encoder: Layer normalization [Ba et al., 2016]

• Layer normalization is a trick to help models train faster.


• Idea: cut down on uninformative variation in hidden vector values by normalizing
to unit mean and standard deviation within each layer.
• LayerNorm’s success may be due to its normalizing gradients [Xu et al., 2019]
• Let 𝑥 ∈ ℝ𝑑 be an individual (word) vector in the model.
• Let 𝜇 = σ𝑑𝑗=1 𝑥𝑗 ; this is the mean; 𝜇 ∈ ℝ.
1 2
• Let 𝜎 = σ𝑑𝑗=1 𝑥𝑗 − 𝜇 ; this is the standard deviation; 𝜎 ∈ ℝ.
𝑑
• Let 𝛾 ∈ ℝ𝑑 and 𝛽
∈ ℝ𝑑 be learned “gain” and “bias” parameters. (Can omit!)
• Then layer normalization computes:
𝑥 −𝜇
output = ∗𝛾+𝛽
𝜎+𝜖
Normalize by scalar Modulate by learned
32
mean and variance elementwise gain and bias
The Transformer Decoder
• The Transformer Decoder is a
stack of Transformer Decoder
Blocks.
• Each Block consists of:
• Self-attention
• Add & Norm
• Feed-Forward
• Add & Norm
• That’s it! We’ve gone through
the Transformer Decoder.

Transformer Decoder

33
The Transformer Encoder
• The Transformer Decoder
constrains to unidirectional
context, as for language
models.
• What if we want bidirectional
context, like in a bidirectional
RNN?
• This is the Transformer
Encoder. The only difference is
that we remove the masking
in the self-attention.

No Masking! Transformer Decoder

34
The Transformer Encoder-Decoder
• Recall that in machine
translation, we processed the
source sentence with a
bidirectional model and
generated the target with a
unidirectional model.
• For this kind of seq2seq
format, we often use a
Transformer Encoder-Decoder.
• We use a normal Transformer
Encoder.
• Our Transformer Decoder is
modified to perform cross-
attention to the output of the
35
Encoder.
Cross-attention (details)

• We saw that self-attention is when keys,


queries, and values come from the same
source.
• In the decoder, we have attention that ℎ1 , … , ℎ𝑛
looks more like what we saw last week.
• Let ℎ1 , … , ℎ𝑛 be output vectors from the
Transformer encoder; 𝑥𝑖 ∈ ℝ𝑑 𝑧1 , … , 𝑧𝑛
• Let 𝑧1 , … , 𝑧𝑛 be input vectors from the
Transformer decoder, 𝑧𝑖 ∈ ℝ𝑑
• Then keys and values are drawn from the
encoder (like a memory):
• 𝑘𝑖 = 𝐾ℎ𝑖 , 𝑣𝑖 = 𝑉ℎ𝑖 .
• And the queries are drawn from the
decoder, 𝑞𝑖 = 𝑄𝑧𝑖 .
36
Outline
1. From recurrence (RNN) to attention-based NLP models
2. Introducing the Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers

38
Great Results with Transformers
First, Machine Translation from the original Transformers paper!

Not just better Machine Also more efficient to


Translation BLEU scores train!
39 [Test sets: WMT 2014 English-German and English-French] [Vaswani et al., 2017]
Great Results with Transformers
Next, document generation!

The old standard Transformers all the way down.

40 [Liu et al., 2018]; WikiSum dataset


Great Results with Transformers
Before too long, most Transformers results also included pretraining, a method we’ll
go over on Thursday.
Transformers’ parallelizability allows for efficient pretraining, and have made them
the de-facto standard.

On this popular aggregate


benchmark, for example:

All top models are


Transformer (and
pretraining)-based.

More results Thursday when we discuss pretraining.


41 [Liu et al., 2018]
Outline
1. From recurrence (RNN) to attention-based NLP models
2. Introducing the Transformer model
3. Great results with Transformers
4. Drawbacks and variants of Transformers

42
What would we like to fix about the Transformer?

• Quadratic compute in self-attention (today):


• Computing all pairs of interactions means our computation grows
quadratically with the sequence length!
• For recurrent models, it only grew linearly!
• Position representations:
• Are simple absolute indices the best we can do to represent position?
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]

43
Quadratic computation as a function of sequence length

• One of the benefits of self-attention over recurrence was that it’s highly
parallelizable.
• However, its total number of operations grows as 𝑂 𝑛2 𝑑 , where 𝑛 is the
sequence length, and 𝑑 is the dimensionality.

Need to compute all


𝑋𝑄 = 𝑋𝑄𝐾 ⊤ 𝑋 ⊤ pairs of interactions!
𝐾⊤ 𝑋⊤ 𝑂 𝑛2 𝑑
∈ ℝ𝑛×𝑛

• Think of 𝑑 as around 𝟏, 𝟎𝟎𝟎 (though for large language models it’s much larger!).
• So, for a single (shortish) sentence, 𝑛 ≤ 30; 𝑛2 ≤ 𝟗𝟎𝟎.
• In practice, we set a bound like 𝑛 = 512.
• But what if we’d like 𝒏 ≥ 𝟓𝟎, 𝟎𝟎𝟎? For example, to work on long documents?

44
Work on improving on quadratic self-attention cost

• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]

Inference time (s)


Key idea: map the
sequence length
dimension to a lower-
dimensional space for
values, keys

Sequence length / batch size

45
Do we even need to remove the quadratic cost of attention?

• As Transformers grow larger, a larger and larger percent of compute is outside


the self-attention portion, despit the quadratic cost.
• In practice, almost no large Transformer language models use anything but the
quadratic cost attention we’ve presented here.
• The cheaper methods tend not to work as well at scale.
• So, is there no point in trying to design cheaper alternatives to self-attention?
• Or would we unlock much better models with much longer contexts (>100k
tokens?) if we were to do it right?

46
Do Transformer Modifications Transfer?
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."

47
Natural Language Processing
with Deep Learning
CS224N/Ling284

John Hewitt
Lecture 9: Pretraining
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning

Reminders:
Assignment 5 is out on Thursday! It covers lecture 8 and lecture 9(Today)!
It has ~pedagogically relevant math~
2
Word structure and subword models
Let’s take a look at the assumptions we’ve made about a language’s vocabulary.

We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.

word vocab mapping embedding


Common hat → pizza (index)
words
learn → tasty (index)
Variations taaaaasty → UNK (index)
misspellings laern → UNK (index)
novel items Transformerify → UNK (index)

3
Word structure and subword models
Finite vocabulary assumptions make even less sense in many languages.
• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.

Example: Swahili verbs can have


hundreds of conjugations, each
encoding a wide variety of
information. (Tense, mood,
definiteness, negation, information
about the object, ++)

Here’s a small fraction of the


conjugations for ambia – to tell.

4 [Wiktionary]
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about
structure below the word level. (Parts of words, characters, bytes.)
• The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens).
• At training and testing time, each word is split into a sequence of known subwords.

Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.


1. Start with a vocabulary containing only characters and an “end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.

Originally used in NLP for machine translation; now a similar method (WordPiece) is used in pretrained
models.

5 [Sennrich et al., 2016, Wu et al., 2016]


Word structure and subword models
Common words end up being a part of the subword vocabulary, while rarer words are split
into (sometimes intuitive, sometimes not) components.

In the worst case, words are split into as many subwords as they have characters.

word vocab mapping embedding


Common hat → hat
words learn → learn
Variations taaaaasty → taa## aaa## sty
misspellings laern → la## ern##
novel items Transformerify → Transformer## ify

6
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

7
Motivating word meaning and context
Recall the adage we mentioned at the beginning of the course:

“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)

This quote is a summary of distributional semantics, and motivated word2vec. But:

“… the complete meaning of a word is always contextual,


and no study of meaning apart from a complete context
can be taken seriously.” (J. R. Firth 1935)

Consider I record the record: the two instances of record mean different things.

8 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Firth quote.]
Where we were: pretrained word embeddings
Circa 2017:
• Start with pretrained word embeddings (no ෝ
𝒚
context!)
• Learn how to incorporate context in an LSTM Not pretrained
or Transformer while training on the task.

Some issues to think about: pretrained


(word embeddings)
• The training data we have for our
… the movie was …
downstream task (like question answering)
must be sufficient to teach all contextual
[Recall, movie gets the same word embedding,
aspects of language. no matter what sentence it shows up in]
• Most of the parameters in our network are
randomly initialized!

9
Where we’re going: pretraining whole models
In modern NLP:
• All (or almost all) parameters in NLP ෝ
𝒚
networks are initialized via pretraining.
• Pretraining methods hide parts of the input
from the model, and train the model to Pretrained jointly
reconstruct those parts.

• This has been exceptionally effective at


building strong: … the movie was …

• representations of language
• parameter initializations for strong NLP [This model has learned how to represent
models. entire sentences through pretraining]
• Probability distributions over language that
we can sample from
10
What can we learn from reconstructing the input?

Stanford University is located in __________, California.

11
What can we learn from reconstructing the input?

I put ___ fork down on the table.

12
What can we learn from reconstructing the input?

The woman walked across the street,


checking for traffic over ___ shoulder.

13
What can we learn from reconstructing the input?

I went to the ocean to see the fish, turtles, seals, and _____.

14
What can we learn from reconstructing the input?

Overall, the value I got from the two hours watching


it was the sum total of the popcorn and the drink.
The movie was ___.

15
What can we learn from reconstructing the input?

Iroh went into the kitchen to make some tea.


Standing next to Iroh, Zuko pondered his destiny.
Zuko left the ______.

16
What can we learn from reconstructing the input?

I was thinking about the sequence that goes


1, 1, 2, 3, 5, 8, 13, 21, ____

17
Pretraining through language modeling [Dai and Le, 2015]
Recall the language modeling task:
• Model 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 ), the probability
distribution over words given their past goes to make tasty tea END
contexts.
• There’s lots of data for this! (In English.) Decoder
(Transformer, LSTM, ++ )
Pretraining through language modeling:
• Train a neural network to perform language
modeling on a large amount of text. Iroh goes to make tasty tea
• Save the network parameters.

18
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.

Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
goes to make tasty tea END ☺/

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

Iroh goes to make tasty tea … the movie was …

19
Stochastic gradient descent and pretrain/finetune
Why should pretraining and finetuning help, from a “training neural nets” perspective?

• Consider, provides parameters 𝜃෠ by approximating min ℒpretrain 𝜃 .


𝜃
• (The pretraining loss.)

• Then, finetuning approximates min ℒfinetune 𝜃 , starting at 𝜃.
𝜃
• (The finetuning loss)
• The pretraining may matter because stochastic gradient descent sticks (relatively)
close to 𝜃෠ during finetuning.
• So, maybe the finetuning local minima near 𝜃෠ tend to generalize well!
• And/or, maybe the gradients of finetuning loss near 𝜃෠ propagate nicely!

20
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

21
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?


Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


Decoders
• Nice to generate from; can’t condition on future words

22
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?


Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


Decoders
• Nice to generate from; can’t condition on future words

23
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional
context, so we can’t do language modeling!

Idea: replace some fraction of words in the


went store
input with a special [MASK] token; predict
these words. 𝐴, 𝑏
ℎ1 , … , ℎ 𝑇
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏

Only add loss terms from words that are


“masked out.” If 𝑥෤ is the masked version of 𝑥, I [M] to the [M]
we’re learning 𝑝𝜃 (𝑥|𝑥).
෤ Called Masked LM.
[Devlin et al., 2018]

24
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.

Some more details about Masked LM for BERT:


• Predict a random 15% of (sub)word tokens. [Predict these!] went to store
• Replace input word with [MASK] 80% of the time
• Replace input word with a random token 10% of Transformer
the time Encoder
• Leave input word unchanged 10% of the time (but
still predict it!)
• Why? Doesn’t let the model get complacent and not
I pizza to the [M]
build strong representations of non-masked words.
(No masks are seen at fine-tuning time!)
[Replaced] [Not replaced] [Masked]

25 [Devlin et al., 2018]


BERT: Bidirectional Encoder Representations from Transformers

• The pretraining input to BERT was two separate contiguous chunks of text:

• BERT was trained to predict whether one chunk follows the other or is randomly
sampled.
• Later work has argued this “next sentence prediction” is not necessary.

26 [Devlin et al., 2018, Liu et al., 2019]


BERT: Bidirectional Encoder Representations from Transformers
Details about BERT
• Two models were released:
• BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.
• BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.
• Trained on:
• BooksCorpus (800 million words)
• English Wikipedia (2,500 million words)
• Pretraining is expensive and impractical on a single GPU.
• BERT was pretrained with 64 TPU chips for a total of 4 days.
• (TPUs are special tensor operation acceleration hardware)
• Finetuning is practical and common on a single GPU
• “Pretrain once, finetune many times.”

27 [Devlin et al., 2018]


BERT: Bidirectional Encoder Representations from Transformers
BERT was massively popular and hugely versatile; finetuning BERT led to new state-of-
the-art results on a broad range of tasks.
• QQP: Quora Question Pairs (detect paraphrase • CoLA: corpus of linguistic acceptability (detect
questions) whether sentences are grammatical.)
• QNLI: natural language inference over question • STS-B: semantic textual similarity
answering data • MRPC: microsoft paraphrase corpus
• SST-2: sentiment analysis • RTE: a small natural language inference corpus

28 [Devlin et al., 2018]


Limitations of pretrained encoders
Those results looked great! Why not used pretrained encoders for everything?

If your task involves generating sequences, consider using a pretrained decoder; BERT and other
pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation
methods.

make/brew/craft goes to make tasty tea END

Pretrained Encoder Pretrained Decoder

Iroh goes to [MASK] tasty tea Iroh goes to make tasty tea

29
Extensions of BERT
You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task

It’s bly irr## esi## sti## bly

BERT SpanBERT

It’ [MASK] [MASK] [MASK] [MASK] good


[MASK] irr## esi## sti## [MASK] good

30 [Liu et al., 2019; Joshi et al., 2020]


Extensions of BERT
A takeaway from the RoBERTa paper: more compute, more data can improve pretraining
even when not changing the underlying Transformer encoder.

31 [Liu et al., 2019; Joshi et al., 2020]


Full Finetuning vs. Parameter-Efficient Finetuning
Finetuning every parameter in a pretrained model works well, but is memory-intensive.
But lightweight finetuning methods adapt pretrained models in a constrained way.
Leads to less overfitting and/or more efficient finetuning and inference.

Full Finetuning Lightweight Finetuning


Adapt all parameters Train a few existing or new parameters

☺/ ☺/

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

… the movie was … … the movie was …


32 [Liu et al., 2019; Joshi et al., 2020]
Parameter-Efficient Finetuning: Prefix-Tuning, Prompt tuning
Prefix-Tuning adds a prefix of parameters, and freezes all pretrained parameters.
The prefix is processed by the model just like real words would be.
Advantage: each element of a batch at inference could run a different tuned model.

☺/

(Transformer, LSTM, ++ )

… the movie was …


Learnable prefix
parameters
33 [Li and Liang, 2021; Lester et al., 2021]
Parameter-Efficient Finetuning: Low-Rank Adaptation
Low-Rank Adaptation Learns a low-rank “diff” between the pretrained and finetuned
weight matrices.
Easier to learn than prefix-tuning.

𝐵 ∈ ℝ𝑘×𝑑
☺/
𝑊 ∈ ℝ𝑑×𝑑
(Transformer, LSTM, ++ )
𝐴 ∈ ℝ𝑑×𝑘

𝑊 + 𝐴𝐵
… the movie was …

34 [Hu et al., 2021]


Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?


Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


Decoders
• Nice to generate from; can’t condition on future words

35
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a
prefix of every input is provided to the encoder and is not predicted.
𝑤𝑇+2 , … ,
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
ℎ 𝑇+1 , … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1 , … , 𝑤𝑇 , ℎ1 , … , ℎ 𝑇
𝑦𝑖 ∼ 𝐴ℎ𝑖 + 𝑏, 𝑖 > 𝑇

The encoder portion benefits from


bidirectional context; the decoder portion is 𝑤𝑇+1 , … , 𝑤2𝑇
used to train the whole model through
language modeling.
𝑤1 , … , 𝑤𝑇
[Raffel et al., 2018]

36
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input


with unique placeholders; decode out the
spans that were removed!

This is implemented in text


preprocessing: it’s still an objective
that looks like language modeling at
the decoder side.

37
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks,
and span corruption (denoising) to work better than language modeling.
Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property
of T5: it can be
finetuned to answer a
wide range of
questions, retrieving
knowledge from its
parameters.

NQ: Natural Questions


WQ: WebQuestions 220 million params
770 million params
TQA: Trivia QA 3 billion params
11 billion params
All “open-domain”
versions
[Raffel et al., 2018]
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!


Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?


Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.


Decoders
• Nice to generate from; can’t condition on future words
• All the biggest pretrained models are Decoders.
40
Pretraining decoders
When using language model pretrained decoders, we can ignore
that they were trained to model 𝑝 𝑤𝑡 𝑤1:𝑡−1 ). ☺/

We can finetune them by training a classifier Linear 𝐴, 𝑏


on the last word’s hidden state.
ℎ1 , … , ℎ 𝑇
ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑦 ∼ 𝐴ℎ 𝑇 + 𝑏
Where 𝐴 and 𝑏 are randomly initialized and
specified by the downstream task. 𝑤1 , … , 𝑤𝑇

Gradients backpropagate through the whole [Note how the linear layer hasn’t been
network. pretrained and must be learned from scratch.]

41
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 )!

This is helpful in tasks where the output is a 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6


sequence with a vocabulary like that at 𝐴, 𝑏
pretraining time!
• Dialogue (context=dialogue history) ℎ1 , … , ℎ 𝑇
• Summarization (context=document)

ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑤𝑡 ∼ 𝐴ℎ𝑡−1 + 𝑏 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5

[Note how the linear layer has been pretrained.]


Where 𝐴, 𝑏 were pretrained in the language
model!
42
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
2018’s GPT was a big success in pretraining a decoder!
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• The acronym “GPT” never showed up in the original paper; it could stand for
“Generative PreTraining” or “Generative Pretrained Transformer”

43 [Devlin et al., 2018]


Generative Pretrained Transformer (GPT) [Radford et al., 2018]
How do we format inputs to our decoder for finetuning tasks?

Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral


Premise: The man is in the doorway
entailment
Hypothesis: The person is near the door

Radford et al., 2018 evaluate on natural language inference.


Here’s roughly how the input was formatted, as a sequence of tokens for the decoder.

[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]

The linear classifier is applied to the representation of the [EXTRACT] token.

44
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
GPT results on various natural language inference datasets.

45
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively
convincing samples of natural language.
GPT-3, In-context learning, and very large models
So far, we’ve interacted with pretrained models in two ways:
• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.

47
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

The in-context examples seem to specify the task to be performed, and the conditional
distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
“ thanks -> merci
hello -> bonjour
mint -> menthe
otter -> ”
Output (conditional generations):
loutre…”
48
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

49
Scaling Efficiency: how do we best use our compute
GPT-3 was 175B parameters and trained on 300B tokens of text.
Roughly, the cost of training a large transformer scales as parameters*tokens
Did OpenAI strike the right parameter-token data to get the best model? No.

This 70B parameter model is better than the much larger other models!
50
The prefix as task specification and scratch pad: chain-of-thought

51
[Wei et al., 2023]
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

52
What kinds of things does pretraining teach?
There’s increasing evidence that pretrained models learn a wide variety of things about
the statistical properties of language. Taking our examples from the start of class:
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
• Models also learn – and can exacerbate racism, sexism, all manner of bad biases.
• More on all this in the interpretability lecture!
53

You might also like