0% found this document useful (0 votes)
15 views62 pages

Tranformrerz

Uploaded by

jamdhadey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views62 pages

Tranformrerz

Uploaded by

jamdhadey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Natural Language Processing

with Deep Learning


CS224N/Ling284

Anna Goldie
Lecture 8: Transformers
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

2
Lecture Plan
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

3
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?

4
Transformers: Is Attention All We Need?
• Last lecture, we learned that attention dramatically improves the performance of
recurrent neural networks.
• Today, we will take this one step further and ask Is Attention All We Need?
• Spoiler: Not Quite!

5
Transformers Have Revolutionized the Field of NLP
• By the end of this lecture, you will deeply understand the neural architecture that
underpins virtually every state-of-the-art NLP model today! Output
Probabilities

Softmax

Linear

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
6 (shifted right)
[Vaswani et al., 2017]
Great Results with Transformers: Machine Translation
First, Machine Translation results from the original Transformers paper!

Not just better Machine Also more efficient to


Translation BLEU scores train!
7 [Test sets: WMT 2014 English-German and English-French] [Vaswani et al., 2017]
Great Results with Transformers: SuperGLUE
SuperGLUE is a suite of challenging NLP tasks, including question-answering, word sense
disambiguation, coreference resolution, and natural language inference.

Not just better Machine Also more efficient to


Translation BLEU scores train!
8 [Test sets: SuperGLUE Leaderboard Version: 2.0] [Wang et al., 2019]
Great Results with Transformers: Rise of Large Language Models!
Today, Transformer-based models dominate LMSYS Chatbot Arena Leaderboard!

9 [Chiang et al., 2024]


Transformers Even Show Promise Outside of NLP

10
Transformers Even Show Promise Outside of NLP
Protein Folding

[Jumper et al. 2021] aka AlphaFold2!

11
Transformers Even Show Promise Outside of NLP
Protein Folding

Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.

[Jumper et al. 2021] aka AlphaFold2!

12
Transformers Even Show Promise Outside of NLP
Protein Folding

Image Classification
[Dosovitskiy et al. 2020]: Vision Transformer (ViT) outperforms
ResNet-based baselines with substantially less compute.

ML for Systems
[Zhou et al. 2020]: A Transformer-based
compiler model (GO-one) speeds up a
Transformer model!

[Jumper et al. 2021] aka AlphaFold2!

13
Scaling Laws: Are Transformers All We Need?
• With Transformers, language modeling performance improves smoothly as we increase
model size, training data, and compute resources in tandem.
• This power-law relationship has been observed over multiple orders of magnitude with
no sign of slowing!
• If we keep scaling up these models (with no change to the architecture), could they
eventually match or exceed human-level performance?

[Kaplan et al., 2020]


14
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

15
As of last lecture: recurrent models for (most) NLP!

• Circa 2016, the de facto strategy in NLP is to


encode sentences with a bidirectional LSTM:
(for example, the source sentence in a translation)

• Define your output (parse, sentence,


summary) as a sequence, and use an LSTM to
generate it.

• Use attention to allow flexible access to


memory

16
Why Move Beyond Recurrence?
Motivation for Transformer Architecture
The Transformers authors had 3 desirata when designing this architecture:
1. Minimize (or at least not increase) computational complexity per layer.
2. Minimize path length between any pair of words to facilitate learning of long-range
dependencies.
3. Maximize the amount of computation that can be parallelized.

17
[Vaswani et al., 2017]
1. Transformer Motivation: Computational Complexity Per Layer
When sequence length (n) << representation dimension (d), complexity per layer is lower for a Transformer
compared to the recurrent models we’ve learned about so far.

Table 1 of the Transformer paper.

18
[Vaswani et al., 2017]
2. Transformer Motivation: Minimize Linear Interaction Distance

• RNNs are unrolled “left-to-right”.


• It encodes linear locality: a useful heuristic!
• Nearby words often affect each other’s meanings
tasty pizza
• Problem: RNNs take O(sequence length) steps for distant word
pairs to interact.

O(sequence length)

The chef who … ate


19
2. Transformer Motivation: Minimize Linear Interaction Distance

• O(sequence length) steps for distant word pairs to interact means:


• Hard to learn long-distance dependencies (because gradient problems!)
• Linear order of words is “baked in”; we already know sequential structure
doesn't tell the whole story...

The chef who … ate


Info of chef has gone through
O(sequence length) many layers!
20
3. Transformer Motivation: Maximize Parallelizability

• Forward and backward passes have O(seq length) unparallelizable operations


• GPUs (and TPUs) can perform many independent computations at once!
• But future RNN hidden states can’t be computed in full before past RNN hidden
states have been computed
• Inhibits training on very large datasets!
• Particularly problematic as sequence length increases, as we can no longer batch
many examples together due to memory limitations

1 2 3 T

0 1 2 T-1

h1 h2 hT

Numbers indicate min # of steps before a state can be computed


21
High-Level Architecture: Transformer is all about (Self) Attention

• To recap, attention treats each word’s representation as a query to


access and incorporate information from a set of values.
• Last lecture, we saw attention from the decoder to the encoder in a
recurrent sequence-to-sequence model
• Self-attention is encoder-encoder (or decoder-decoder) attention where
each word attends to each other word within the input (or output).

2 2 2 2 2 2 2 2
All words attend
attention to all words in
1 1 1 1 1 1 1 1 previous layer;
attention
most arrows here
0 0 0 0 0 0 0 0 are omitted
embedding
h1 h2 hT
22
Computational Dependencies for Recurrence vs. Attention
RNN-Based Encoder-Decoder
Model with Attention

Transformer-Based
Encoder-Decoder Model

23
Computational Dependencies for Recurrence vs. Attention
Transformer Advantages:
RNN-Based Encoder-Decoder • Number of unparallelizable operations does
Model with Attention not increase with sequence length.
• Each "word" interacts with each other, so
maximum interaction distance is O(1).

Transformer-Based
Encoder-Decoder Model

24
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

25
The Transformer Encoder-Decoder [Vaswani et al., 2017]
Output
In this section, you will learn exactly how Probabilities

the Transformer architecture works: Softmax


• First, we will talk about the Encoder!
Linear
• Next, we will go through the Decoder
(which is quite similar)! Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

26
Encoder: Self-Attention
Self-Attention is the core building block of
Transformer, so let's first focus on that! Output
Probabilities

Decoder
Encoder

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

27
Intuition for Attention Mechanism
▪ Let's think of attention as a "fuzzy" or approximate hashtable:
▪ To look up a value, we compare a query against keys in a table.
▪ In a hashtable (shown on the bottom left):
▪ Each query (hash) maps to exactly one key-value pair.
▪ In (self-)attention (shown on the bottom right):
▪ Each query matches each key to varying degrees.
▪ We return a sum of values weighted by the query-key match.

k0 v0 k0 v0
k1 v1 k1 v1
k2 v2 k2 v2
q k3 v3 q k3 v3
k4 v4 k4 v4
k5 v5 k5 v5
k6 v6 k6 v6
k7 v7 k7 v7
28
Recipe for Self-Attention in the Transformer Encoder
▪ Step 1: For each word , calculate its query, key, and value.
k0 v0
k1 v1
• Step 2: Calculate attention score between query and keys. k2 v2
q k3 v3
k4 v4
k5 v5
• Step 3: Take the softmax to normalize attention scores.
k6 v6
k7 v7

• Step 4: Take a weighted sum of values.

29
Recipe for (Vectorized) Self-Attention in the Transformer Encoder
▪ Step 1: With embeddings stacked in X, calculate queries, keys, and values.

• Step 2: Calculate attention scores between query and keys.

• Step 3: Take the softmax to normalize attention scores.

• Step 4: Take a weighted sum of values.

30
What We Have So Far: (Encoder) Self-Attention!

Output
Probabilities

Decoder

Encoder

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

31
But attention isn't quite all you need!
• Problem: Since there are no element-wise non-linearities, self-
attention is simply performing a re-averaging of the value vectors.
• Easy fix: Apply a feedforward layer to the output of attention, Output
Probabilities

providing non-linear activation (and additional expressive power).


Equation for Feed Forward Layer

Decoder
Encoder Feed Forward

Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

32
But how do we make this work for deep networks?
Output
Probabilities

Encoder Feed Forward Decoder


Repeat 6x Repeat 6x
(# of Layers) (# of Layers)

Self-Attention

Training Trick #1: Residual Connections Input Output


Embedding Embedding
Training Trick #2: LayerNorm
Training Trick #3: Scaled Dot Product Attention Inputs Outputs
(shifted right)

33
Training Trick #1: Residual Connections [He et al., 2016] Output
Probabilities

• Residual connections are a simple but powerful


technique from computer vision.
• Deep networks are surprisingly bad at Decoder
Repeat 6x
learning the identity function! Add (# of Layers)
Encoder
• Therefore, directly passing "raw" embeddings to Repeat 6x
(# of Layers)
Feed Forward

the next layer can actually be very helpful! Add

Self-Attention

• This prevents the network from "forgetting" or


distorting important information as it is
Input Output
processed by many layers. Embedding Embedding

Inputs Outputs
(shifted right)

Residual connections are


also thought to smooth the
loss landscape and make
training easier!
34
Training Trick #2: Layer Normalization [Ba et al., 2016]
• Problem: Difficult to train the parameters of Output
Probabilities
a given layer because its input from the layer
beneath keeps shifting.
• Solution: Reduce variation by normalizing to
zero mean and standard deviation of one Add & Norm
Encoder Decoder
within each layer. Repeat 6x
Feed Forward
Repeat 6x
(# of Layers) (# of Layers)
Add & Norm
Mean: Standard Deviation:
Self-Attention

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

35
Training Trick #2: Layer Normalization [Ba et al., 2016]
Output
Probabilities

Add & Norm


Encoder Feed Forward
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm

Self-Attention

An Example of How LayerNorm Works (Image by Bala Priya C, Pinecone)


Input Output
Embedding Embedding
Mean: Standard Deviation:
Inputs Outputs
(shifted right)

36
Training Trick #3: Scaled Dot Product Attention
• After LayerNorm, the mean and variance of Output
Probabilities
vector elements is 0 and 1, respectively. (Yay!)
• However, the dot product still tends to take on
extreme values, as its variance scales with
dimensionality dk
Add & Norm

Quick Statistics Review: Encoder Feed Forward Decoder


Repeat 6x Repeat 6x
• Mean of sum = sum of means = (# of Layers) (# of Layers)
Add & Norm
• Variance of sum = sum of variances =
Scaled Attention
• To set the variance to 1, simply divide by !

Input Output
Embedding Embedding

Updated Self-Attention Equation:


Inputs Outputs
(shifted right)

37
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."

Transformer-Based
Encoder-Decoder Model

Man eats small dinosaur

38
Major issue!
• We're almost done with the
Encoder, but we have a
major problem! Has anyone
spotted it?
• Consider this sentence:
• "Man eats small dinosaur."
• Wait a minute, order doesn't
impact the network at all!
• This seems wrong given that
word order does have meaning Transformer-Based
in many languages, including Encoder-Decoder Model
English!
Man eats small dinosaur

39
Solution: Inject Order Information through Positional Encodings!
Output
Probabilities

Decoder
Repeat 6x
Add & Norm
(# of Layers)
Encoder Feed Forward
Repeat 6x
(# of Layers)
Add & Norm

Scaled Attention

Positional
+
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

40
Fixing the first self-attention problem: sequence order

• Since self-attention doesn’t build in order information, we need to encode the order of the
sentence in our keys, queries, and values.
• Consider representing each sequence index as a vector
𝑝𝑖 ∈ ℝ𝑑 , for 𝑖 ∈ {1,2, … , 𝑇} are position vectors

• Don’t worry about what the 𝑝𝑖 are made of yet!


• Easy to incorporate this info into our self-attention block: just add the 𝑝𝑖 to our inputs!
• Let 𝑣෤𝑖 𝑘෨ 𝑖 , 𝑞෤𝑖 be our old values, keys, and queries.

𝑣𝑖 = 𝑣෤𝑖 + 𝑝𝑖 In deep self-attention


𝑞𝑖 = 𝑞෤𝑖 + 𝑝𝑖 networks, we do this at the
𝑘𝑖 = 𝑘෨ 𝑖 + 𝑝𝑖 first layer! You could
concatenate them as well,
but people mostly just add…
41
Position representation vectors through sinusoids (original)

• Sinusoidal position representations: concatenate sinusoidal functions of varying periods:

sin(𝑖/100002∗1/𝑑 )
cos(𝑖/100002∗1/𝑑 )

Dimension
𝑝𝑖 =
𝑑
2∗ 2 /𝑑
sin(𝑖/10000 )
𝑑
cos(𝑖/100002∗2 /𝑑 ) Index in the sequence
• Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Maybe can extrapolate to longer sequences as periods restart
• Cons:
• Not learnable; also the extrapolation doesn’t really work

42 Image: https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/
Extension: Self-Attention w/ Relative Position Encodings
Key Insight: The most salient position information is the relationship (e.g. “cat” is the word before “eat”)
between words, rather than their absolute position (e.g. “cat” is word 2).

Original Self-Attention Output: Relation-Aware Self-Attention Output:

where where

Table and Equations From [Shaw et al., 2018]


43
Multi-Headed Self-Attention: k heads are better than 1!
• High-Level Idea: Let's perform self-attention multiple times in parallel and combine the results.

[Vaswani et al. 2017]

Wizards of the Coast, Artist: Todd Lockwood

44
The Transformer Encoder: Multi-headed Self-Attention
• What if we want to look in multiple places in the sentence at
once?
• For word 𝑖, self-attention “looks” where 𝑥𝑖⊤ 𝑄⊤ 𝐾𝑥𝑗 is high, but
maybe we want to focus on different 𝑗 for different reasons?
• We’ll define multiple attention “heads” through multiple Q,K,V
matrices
𝑑
𝑑×
• Let, 𝑄ℓ , 𝐾ℓ , 𝑉ℓ ∈ ℝ , where ℎ is the number of attention heads,

and ℓ ranges from 1 to ℎ.


• Each attention head performs attention independently:
• output ℓ = softmax 𝑋𝑄ℓ 𝐾ℓ⊤ 𝑋 ⊤ ∗ 𝑋𝑉ℓ , where output ℓ ∈
ℝ𝑑/ℎ
• Then the outputs of all the heads are combined! Credit to https://jalammar.github.io/illustrated-transformer/

• output = 𝑌[output1 ; … ; output ℎ ], where 𝑌 ∈ ℝ𝑑×𝑑

• Each head gets to “look” at different things, and construct value


vectors differently.
45
Yay, we've completed the Encoder! Time for the Decoder...

Output
Probabilities

Add & Norm


Encoder Feed Forward Decoder
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm
Multi-Head
Attention

Positional
+
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

46
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?

Transformer-Based
Encoder-Decoder Model

47
Decoder: Masked Multi-Head Self-Attention
• Problem: How do we keep the decoder
from “cheating”? If we have a language
modeling objective, can't the network
just look ahead and "see" the answer?
• Solution: Masked Multi-Head
Attention. At a high-level, we hide
(mask) information about future
tokens from the model.

Transformer-Based
Encoder-Decoder Model

48
Masking the future in self-attention
We can look at these
(not greyed out) words
• To use self-attention in
decoders, we need to ensure
we can’t peek at the future.

[START] −∞ −∞ −∞ −∞
• At every timestep, we could
change the set of keys and
queries to include only past The −∞ −∞ −∞
words. (Inefficient!) For encoding
these words
chef −∞ −∞
• To enable parallelization, we
mask out attention to future
words by setting attention who −∞
scores to −∞. 𝑞𝑖⊤ 𝑘𝑗 , 𝑗 < 𝑖
𝑒𝑖𝑗 = ൝
−∞, 𝑗 ≥ 𝑖
49
Decoder: Masked Multi-Headed Self-Attention
Output
Probabilities

Add & Norm


Encoder Feed Forward Decoder
Repeat 6x Repeat 6x
(# of Layers) (# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

50
Encoder-Decoder Attention
• We saw that self-attention is when keys, queries, Output

and values come from the same source. Probabilities

• In the decoder, we have attention that looks


more like what we saw last week.
• Let ℎ1 , … , ℎ 𝑇 be output vectors from the Add & Norm Add & Norm

Transformer encoder; 𝑥𝑖 ∈ ℝ𝑑 Encoder


Repeat 6x
Feed Forward
Multi-Head
Cross-Attention
Decoder
Repeat 6x
• Let 𝑧1 , … , 𝑧𝑇 be input vectors from the (# of Layers)
Add & Norm Add & Norm
(# of Layers)

Transformer decoder, 𝑧𝑖 ∈ ℝ𝑑 Multi-Head


Self-Attention
Masked Multi-Head
Self-Attention

• Then keys and values are drawn from the


encoder (like a memory): Positional
Encoding
+ +
Positional
Encoding

• 𝑘𝑖 = 𝐾ℎ𝑖 , 𝑣𝑖 = 𝑉ℎ𝑖 .
Input Output
Embedding Embedding

• And the queries are drawn from the decoder, Inputs Outputs
(shifted right)
𝑞𝑖 = 𝑄𝑧𝑖 .

51
Decoder: Finishing touches!

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

52
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

53
Decoder: Finishing touches!
• Add a feed forward layer (with residual
connections and layer norm)
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

54
Decoder: Finishing touches!
• Add a feed forward layer (with residual Output

connections and layer norm) Probabilities

Softmax
• Add a final linear layer to project the
Linear
embeddings into a much longer vector of
Add & Norm
length vocab size (logits)
Feed Forward

• Add a final softmax to generate a Decoder


Add & Norm Add & Norm
probability distribution of possible next Encoder Multi-Head
Repeat 6x
(# of Layers)
Feed Forward
words! Repeat 6x
(# of Layers)
Attention

Add & Norm Add & Norm


Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

55
Recap of Transformer Architecture
Output
Probabilities

Softmax

Linear

Add & Norm

Feed Forward

Decoder
Add & Norm Add & Norm
Repeat 6x
Encoder Feed Forward
Multi-Head (# of Layers)
Attention
Repeat 6x
(# of Layers)
Add & Norm Add & Norm
Multi-Head Masked Multi-
Attention Head Attention

Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

56
Outline
1. Impact of Transformers on NLP (and ML more broadly)
2. From Recurrence (RNNs) to Attention-Based NLP Models
3. Understanding the Transformer Model
4. Drawbacks and Variants of Transformers

57
What would we like to fix about the Transformer?

• Quadratic compute in self-attention (today):


• Computing all pairs of interactions means our computation grows
quadratically with the sequence length!
• For recurrent models, it only grew linearly!
• Position representations:
• Are simple absolute indices the best we can do to represent position?
• As we learned: Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position [Wang et al., 2019]
• Rotary Embeddings [Su et al., 2021]

58
Recent work on improving on quadratic self-attention cost

• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, Linformer [Wang et al., 2020]

Inference time (s)


Key idea: map the
sequence length
dimension to a lower-
dimensional space for
values, keys

Sequence length / batch size

59
Recent work on improving on quadratic self-attention cost

• Considerable recent work has gone into the question, Can we build models like
Transformers without paying the 𝑂 𝑇 2 all-pairs self-attention cost?
• For example, BigBird [Zaheer et al., 2021]

Key idea: replace all-pairs interactions with a family of other interactions, like local
windows, looking at everything, and random interactions.

60
Do Transformer Modifications Transfer?
• "Surprisingly, we find that most modifications do not meaningfully improve
performance."

61
Parting remarks

• Yay, you now understand Transformers!


• Next class, we will see how pre-training can take performance to the next level!
• Good luck on assignment 4!
• Remember to work on your project proposal!

62

You might also like