0% found this document useful (0 votes)

24 views40 pages

anlp-05-transformers

Uploaded by

Robert Pec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views40 pages

anlp-05-transformers

Uploaded by

Robert Pec

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

CS11-711 Advanced NLP

Transformers
Graham Neubig

Site
https://phontron.com/class/anlp2024/
Reminder: Attention
Cross Attention
(Bahdanau et al. 2015)
• Each element in a sequence attends to elements of
another sequence

this is an example
kore
wa
rei
desu
Self Attention
(Cheng et al. 2016, Vaswani et al. 2017)
• Each element in the sequence attends to elements
of that sequence → context sensitive encodings!

this is an example
this
is
an
example
Calculating Attention (1)
• Use “query” vector (decoder state) and “key” vectors (all encoder states)
• For each query-key pair, calculate weight
• Normalize to add to one using softmax

kono eiga ga kirai

Key
Vectors

I hate

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

Query Vector softmax

α1=0.76 α2=0.08 α3=0.13 α4=0.03

Calculating Attention (2)
• Combine together value vectors (usually encoder
states, like key vectors) by taking the weighted sum
kono eiga ga kirai
Value
Vectors

* * * *
α1=0.76 α2=0.08 α3=0.13 α4=0.03

• Use this in any part of the model you like

Transformers
“Attention is All You Need” Output
Probabilities
(Vaswani et al. 2017) Softmax

Linear

• A sequence-to-sequence Add & Norm

Feed
model based entirely on Forward

attention Add & Norm

Add & Norm
Multi-Head
Feed Attention
Nx
• Strong results on machine Forward

Add & Norm

translation Nx Add & Norm
Masked
Multi-Head Multi-Head
Attention Attention
• Fast: only matrix
multiplications Positional
Encoding ⊕ ⊕ Positional
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)
Two Types of Transformers
Output
Encoder-Decoder Model Probabilities Decoder Only Model
(e.g. T5, MBART) Softmax (e.g. GPT, LLaMa)
Linear
Output
Probabilities
Add & Norm
Softmax
Feed
Forward
Linear

Add & Norm

Add & Norm
Add & Norm
Multi-Head Feed
Feed Attention
Nx Forward
Forward

Add & Norm Add & Norm

Nx Add & Norm
Masked Nx
Multi-Head Masked
Multi-Head Multi-Head
Attention Attention Attention

Positional Positional
Encoding ⊕ ⊕ Positional
Encoding
Encoding ⊕
Input Output
Embedding Embedding Input
Embedding

Inputs Outputs
(shifted right) Inputs
Core Transformer Concepts

• Positional encodings

• Multi-headed attention

• Masked attention

• Residual + layer normalization

• Feed-forward layer
(Review)
Inputs and Embeddings
Output
Probabilities

Softmax
• Inputs: Generally split using Linear

subwords Add & Norm

Feed
Forward
the books were improved
Add & Norm
Nx
Masked
the book _s were improv _ed Multi-Head
Attention

• Input Embedding: Looked up, like in Positional

Encoding ⊕
previously discussed models Input
Embedding

Inputs
Multi-head Attention
Intuition for Multi-heads
• Intuition: Information from different
parts of the sentence can be useful to Output
Probabilities
disambiguate in different ways Softmax

Linear

syntax Add & Norm

(nearby context) Feed

I run a small business Forward

semantics Add & Norm

Nx
I run a mile in 10 minutes (farther context) Masked
Multi-Head
Attention

The robber made a run for it Positional

Encoding ⊕
Input
The stocking had a run Embedding

Inputs
Multi-head Attention Concept
MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O
where headi = Attention(QWiQ , KWiK , V WiV )

Multiply by Split/rearrange Run attn over Concat

weights to n attn inputs each head and *WO

* WQ
Q

* WK attn()
K

* WV
V
Code Example
def forward(self, query, key, value, mask=None):
nbatches = query.size(0)

# 1) Do all the linear projections

query = self.W_q(query)
key = self.W_k(key)
value = self.W_v(value)

# 2) Reshape to get h heads

query = query.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)
key = key.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)
value = value.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)

# 3) Apply attention on all the projected vectors in batch.

x, self.attn = attention(query, key, value)

# 4) "Concat" using a view and apply a final linear.

x = (
x.transpose(1, 2)
.contiguous()
.view(nbatches, -1, self.h * self.d_k)
)
return self.W_o(x)
What Happens w/ Multi-heads?
• Example from Vaswani et al.

• See also BertVis: https://github.com/jessevig/bertviz

Positional Encoding
Positional Encoding
• The transformer model is purely attentional Output
Probabilities

Softmax
• If embeddings were used, there would be no
Linear
way to distinguish between identical words
Add & Norm
Feed
A big dog and a big cat Forward

would be identical! Nx
Add & Norm

Masked
Multi-Head
A big dog and a big cat Attention

Positional
⊕
• Positional encodings add an embedding Encoding
Input
based on the word position Embedding

Inputs
wbig + wpos2 wbig + wpos6
Sinusoidal Encoding
(Vaswani+ 2017, Kazemnejad 2019)
• Calculate each dimension with a sinusoidal function
!
(i) (i) sin(ωk · t), if i = 2k 1
pt = f (t) :=
cos(ωk · t), if i = 2k + 1 where ωk =
100002k/d

• Why? So the dot product between two embeddings

becomes higher relatively.
Learned Encoding
(Shaw+ 2018)

• More simply, just create a learnable embedding

• Advantages: exibility

• Disadvantages: impossible to extrapolate to

longer sequences
fl
Absolute vs. Relative Encodings
(Shaw+ 2018)

• Absolute positional encodings add an encoding to

the input in hope that relative position will be
captured

• Relative positional encodings explicitly encode

relative position
Rotary Positional Encodings (RoPE)
(Su+ 2021)

• Fundamental idea: we want the dot product of

embeddings to result in a function of relative position
fq (xm , m) · fk (xn , n) = g(xm , xn , m − n)
• In summary, RoPE uses trigonometry and imaginary numbers
to come up with a function that satis es this property
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
x1 cos mθ1 −x2 sin mθ1
⎜ x2 ⎟ ⎜ cos mθ1 ⎟ ⎜ x1 ⎟ ⎜ sin mθ1 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ x3 ⎟ ⎜ cos mθ2 ⎟ ⎜ −x4 ⎟ ⎜ sin mθ2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
d ⎜ x4 ⎟ ⎜ cos mθ2 ⎟ ⎜ x3 ⎟ ⎜ sin mθ2 ⎟
RΘ,m x = ⎜ ⎟⊗⎜ ⎟+⎜ ⎟⊗⎜
.. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎟
⎜ .. ⎟ ⎜
⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟
⎝xd−1 ⎠ ⎝cos mθ d2 ⎠ ⎝ −xd ⎠ ⎝sin mθ d2 ⎠
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
xd cos mθ d xd−1 sin mθ d
2 2
fi
Layer Normalization and
Residual Connections
Reminder:
Gradients and Training Instability
• In RNNs, we asked about how backdrop through a
network causes gradients can vanish or explode

• The same issue occurs in multi-layer transformers!

Layer Normalization
(Ba et al. 2016)
• Normalizes the outputs to be within a Output
consistent range, preventing too Probabilities

Softmax
much variance in scale of outputs Linear

gain Add & Norm

bias Feed
Forward
g
LayerNorm(x; g, b) = ⊙ (x − µ(x)) + b
σ(x) Add & Norm
Nx
Masked

vector vector Multi-Head

Attention

stddev mean
Positional
Encoding ⊕

!
Input
n !
1 " n
"1 $ Embedding

µ(x) = xi σ(x) = # (xi − µ)2

Inputs
n n i=1
i=1
RMSNorm
(Zhang and Sennrich 2019)

• Simpli es LayerNorm by removing the mean and bias terms

!
" n
"1 $
RMS(x) = # x2i
n i=1

x
RMSNorm(x) = ·g
RMS(x)
fi
Residual Connections
Output

• Add an additive connection between Probabilities

Softmax
the input and output Linear

Residual(x, f ) = f (x) + x Add & Norm

Feed
Forward

• Prevents vanishing gradients and Add & Norm

Nx
allows f to learn the difference from Masked
Multi-Head

the input Attention

Positional
⊕
• Quiz: what are the implications for Encoding
Input
self-attention w/ and w/o residual Embedding

connections? Inputs
Post- vs. Pre- Layer Norm
(e.g. Xiong et al. 2020)

• Where should
LayerNorm be
applied? Before or
after?

• Pre-layer-norm is
better for gradient
propagation

post-LayerNorm pre-LayerNorm
Feed Forward Layers
Feed Forward Layers
• Extract combination features from the Output
Probabilities
attended outputs Softmax

Linear
FFN(x; W1 , b1 , W2 , b2 ) = f (xW1 + b1 )W2 + b2
Add & Norm
Feed
Forward
Non-linearity
Add & Norm
Linear1 Linear2 Nx
Masked
Multi-Head
Attention
f()
Positional
Encoding ⊕
Input
Embedding

Inputs
Some Activation Functions in Transformers

• Vaswani et al.: ReLU

ReLU(x) = max(0, x)

• LLaMa: Swish/SiLU (Hendricks and Gimpel 2016)

Swish(x; β) = x ⊙ σ(βx)
Optimization Tricks for
Transformers
Transformers are Powerful
but Fickle

• Optimization of models can be dif cult, and

transformers are more dif cult than others!

• e.g. OPT-175 training logbook

https://github.com/facebookresearch/metaseq/
blob/main/projects/OPT/chronicles/
OPT175B_Logbook.pdf
fi
fi
Optimizers for Transformers
• SGD: Update in the direction of reducing loss

• Adam: Add momentum turn and normalize by stddev of the

outputs

• Adam w/ learning rate schedule (Vaswani et al. 2017): Adds

a learning rate increase and decrease

lrate = d−0.5
model · min(step
−0.5
, step ∗ warmup steps−1.5 )

• AdamW (Loshchilov and Hutter 2017): properly applies

weight decay for regularization to Adam
Low-Precision Training
• Training at full 32-bit precision can be costly

• Low-precision alternatives

Image: Wikipedia
Checkpointing/Restarts
• Even through best efforts, training can go south — what to do?

• Monitor possible issues, e.g. through monitoring the norm of

gradients

• If training crashes, roll back to previous checkpoint, shuf e

data, and resume

• (Also, check your code)

Image: OPT Log
fl
Comparing Transformer
Architectures
Original Transformer vs.
LLaMa

Vaswani et al. LLaMA

Norm Position Post Pre

Norm Type LayerNorm RMSNorm

Non-linearity ReLU SiLU

Positional
Sinusoidal RoPE
Encoding
How Important is It?
• “Transformer” is Vaswani et al., “Transformer++” is (basically) LLaMA

• Stronger architecture is ≈10x more ef cient!

Image: Gu and Dao (2023)
fi
Questions?

Some of Us Chinese Women Growing Up in The Mao Era
100% (4)
Some of Us Chinese Women Growing Up in The Mao Era
245 pages
AI by Hand Vol 1
No ratings yet
AI by Hand Vol 1
28 pages
Hypertrophy Execution Mastery - Module 2 Workouts - Biceps & Triceps PDF
100% (1)
Hypertrophy Execution Mastery - Module 2 Workouts - Biceps & Triceps PDF
24 pages
Interface Software: Entec
No ratings yet
Interface Software: Entec
58 pages
Samsung Electronics: From National Champion' To Global Leader'
No ratings yet
Samsung Electronics: From National Champion' To Global Leader'
2 pages
NLP-week8-transformers
No ratings yet
NLP-week8-transformers
66 pages
16_
No ratings yet
16_
41 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformers Torch
No ratings yet
Transformers Torch
38 pages
Slides
No ratings yet
Slides
81 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Lecture 26
No ratings yet
Lecture 26
17 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
Transformer
No ratings yet
Transformer
5 pages
IBest_DeepLearning
No ratings yet
IBest_DeepLearning
123 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
RoPE
No ratings yet
RoPE
33 pages
midterm_study_guide_csci566
No ratings yet
midterm_study_guide_csci566
20 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Transformer
No ratings yet
Transformer
41 pages
Quiz 3
No ratings yet
Quiz 3
5 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
AE556_2024_Topic7_Transformer
No ratings yet
AE556_2024_Topic7_Transformer
49 pages
unit-iv-v-deep-learning-material
No ratings yet
unit-iv-v-deep-learning-material
32 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
2025 Lecture 3 - Architecture
No ratings yet
2025 Lecture 3 - Architecture
68 pages
Transformer
No ratings yet
Transformer
4 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
Transformer
No ratings yet
Transformer
58 pages
Transformers
No ratings yet
Transformers
41 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
Lecture 9
No ratings yet
Lecture 9
97 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
105 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
nlp4
No ratings yet
nlp4
10 pages
Auto Encoder S
No ratings yet
Auto Encoder S
32 pages
LLM for Maths People
No ratings yet
LLM for Maths People
53 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Deep Learning Curriculum
No ratings yet
Deep Learning Curriculum
23 pages
Lessson 13 ANN
No ratings yet
Lessson 13 ANN
76 pages
HW1P1_F23
No ratings yet
HW1P1_F23
37 pages
Unit 3
No ratings yet
Unit 3
110 pages
Reviewer
No ratings yet
Reviewer
7 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
The Transformer Model in Equations: John Thickstun
No ratings yet
The Transformer Model in Equations: John Thickstun
5 pages
Deep Learning
No ratings yet
Deep Learning
40 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
neural network
No ratings yet
neural network
97 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
llm-book (1)
No ratings yet
llm-book (1)
161 pages
Deep Learning notes
No ratings yet
Deep Learning notes
155 pages
anlp-03-lm_annotated
No ratings yet
anlp-03-lm_annotated
47 pages
anlp-02-wordrep-textclass
No ratings yet
anlp-02-wordrep-textclass
59 pages
anlp-04-seqmod_annotated
No ratings yet
anlp-04-seqmod_annotated
51 pages
anlp-01-intro
No ratings yet
anlp-01-intro
46 pages
College Pastel Notes Infographics by Slidesgo
No ratings yet
College Pastel Notes Infographics by Slidesgo
35 pages
BG Ch 4 Overview
No ratings yet
BG Ch 4 Overview
4 pages
Jigs and Fixtures
No ratings yet
Jigs and Fixtures
83 pages
Manhour Time Sheet Form
No ratings yet
Manhour Time Sheet Form
2 pages
Theory of Social Entrepreneurship
No ratings yet
Theory of Social Entrepreneurship
17 pages
Service BoQ For Site
No ratings yet
Service BoQ For Site
22 pages
Soal PTS 1 22.23 12 Umum
No ratings yet
Soal PTS 1 22.23 12 Umum
14 pages
EF3e Elem Progresstest 1 6a
No ratings yet
EF3e Elem Progresstest 1 6a
7 pages
Engineering Mechanics, Statics 4th Edition Andrew Pytel instant download
100% (1)
Engineering Mechanics, Statics 4th Edition Andrew Pytel instant download
53 pages
Dbi Sala Catalog PDF
No ratings yet
Dbi Sala Catalog PDF
60 pages
Literature Review in Tagalog
100% (2)
Literature Review in Tagalog
4 pages
Lateral Soil Resistance of Rigid Pile in Cohesionless Soil On Slope
No ratings yet
Lateral Soil Resistance of Rigid Pile in Cohesionless Soil On Slope
8 pages
The Parable of the Weeds (Tares) Among the Wheat
No ratings yet
The Parable of the Weeds (Tares) Among the Wheat
8 pages
NotificationAdvertisementNo022025
No ratings yet
NotificationAdvertisementNo022025
13 pages
Introduction To Robotics - Midterm
No ratings yet
Introduction To Robotics - Midterm
9 pages
Climatic Data TLS-06
No ratings yet
Climatic Data TLS-06
64 pages
J.S. Mill: DR JTM Miller 31 OCTOBER 2017
No ratings yet
J.S. Mill: DR JTM Miller 31 OCTOBER 2017
42 pages
3 - Core Values & Principles of Community Engagement
No ratings yet
3 - Core Values & Principles of Community Engagement
49 pages
Silencing The Giant: Evidence of AGN Feedback and Quenching in A Little Red Dot at Z 4.13
No ratings yet
Silencing The Giant: Evidence of AGN Feedback and Quenching in A Little Red Dot at Z 4.13
20 pages
Biology Notebook: 01.01 Exploring Life: Key Questions and Terms Notes
No ratings yet
Biology Notebook: 01.01 Exploring Life: Key Questions and Terms Notes
3 pages
Parabolic Curves
0% (1)
Parabolic Curves
11 pages
Pesantren Dalam Kebijakan Pendidikan Indonesia
No ratings yet
Pesantren Dalam Kebijakan Pendidikan Indonesia
40 pages
Vision
No ratings yet
Vision
219 pages
CST Lab Micro Strip Line
No ratings yet
CST Lab Micro Strip Line
11 pages
32 .Solomon Goshu
No ratings yet
32 .Solomon Goshu
83 pages
Math Lesson Addition Properties
No ratings yet
Math Lesson Addition Properties
4 pages

anlp-05-transformers

Uploaded by

anlp-05-transformers

Uploaded by

CS11-711 Advanced NLP

kono eiga ga kirai

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

α1=0.76 α2=0.08 α3=0.13 α4=0.03

• Use this in any part of the model you like

• A sequence-to-sequence Add & Norm

attention Add & Norm

Add & Norm

Add & Norm

Add & Norm Add & Norm

• Residual + layer normalization

subwords Add & Norm

• Input Embedding: Looked up, like in Positional

syntax Add & Norm

(nearby context) Feed

semantics Add & Norm

The robber made a run for it Positional

Multiply by Split/rearrange Run attn over Concat

# 1) Do all the linear projections

# 2) Reshape to get h heads

# 3) Apply attention on all the projected vectors in batch.

# 4) "Concat" using a view and apply a final linear.

• See also BertVis: https://github.com/jessevig/bertviz

• Why? So the dot product between two embeddings

• More simply, just create a learnable embedding

• Disadvantages: impossible to extrapolate to

• Absolute positional encodings add an encoding to

• Relative positional encodings explicitly encode

• Fundamental idea: we want the dot product of

• The same issue occurs in multi-layer transformers!

gain Add & Norm

vector vector Multi-Head

µ(x) = xi σ(x) = # (xi − µ)2

• Simpli es LayerNorm by removing the mean and bias terms

• Add an additive connection between Probabilities

Residual(x, f ) = f (x) + x Add & Norm

• Prevents vanishing gradients and Add & Norm

the input Attention

• Vaswani et al.: ReLU

• LLaMa: Swish/SiLU (Hendricks and Gimpel 2016)

• Optimization of models can be dif cult, and

• e.g. OPT-175 training logbook

• Adam: Add momentum turn and normalize by stddev of the

• Adam w/ learning rate schedule (Vaswani et al. 2017): Adds

• AdamW (Loshchilov and Hutter 2017): properly applies

• Monitor possible issues, e.g. through monitoring the norm of

• If training crashes, roll back to previous checkpoint, shuf e

• (Also, check your code)

Vaswani et al. LLaMA

Norm Position Post Pre

Norm Type LayerNorm RMSNorm

Non-linearity ReLU SiLU

• Stronger architecture is ≈10x more ef cient!

You might also like