0% found this document useful (0 votes)
24 views40 pages

anlp-05-transformers

Uploaded by

Robert Pec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views40 pages

anlp-05-transformers

Uploaded by

Robert Pec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CS11-711 Advanced NLP

Transformers
Graham Neubig

Site
https://phontron.com/class/anlp2024/
Reminder: Attention
Cross Attention
(Bahdanau et al. 2015)
• Each element in a sequence attends to elements of
another sequence

this is an example
kore
wa
rei
desu
Self Attention
(Cheng et al. 2016, Vaswani et al. 2017)
• Each element in the sequence attends to elements
of that sequence → context sensitive encodings!

this is an example
this
is
an
example
Calculating Attention (1)
• Use “query” vector (decoder state) and “key” vectors (all encoder states)
• For each query-key pair, calculate weight
• Normalize to add to one using softmax

kono eiga ga kirai


Key
Vectors

I hate

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0


Query Vector softmax

α1=0.76 α2=0.08 α3=0.13 α4=0.03


Calculating Attention (2)
• Combine together value vectors (usually encoder
states, like key vectors) by taking the weighted sum
kono eiga ga kirai
Value
Vectors

* * * *
α1=0.76 α2=0.08 α3=0.13 α4=0.03

• Use this in any part of the model you like


Transformers
“Attention is All You Need” Output
Probabilities
(Vaswani et al. 2017) Softmax

Linear

• A sequence-to-sequence Add & Norm


Feed
model based entirely on Forward

attention Add & Norm


Add & Norm
Multi-Head
Feed Attention
Nx
• Strong results on machine Forward

Add & Norm


translation Nx Add & Norm
Masked
Multi-Head Multi-Head
Attention Attention
• Fast: only matrix
multiplications Positional
Encoding ⊕ ⊕ Positional
Encoding
Input Output
Embedding Embedding

Inputs Outputs
(shifted right)
Two Types of Transformers
Output
Encoder-Decoder Model Probabilities Decoder Only Model
(e.g. T5, MBART) Softmax (e.g. GPT, LLaMa)
Linear
Output
Probabilities
Add & Norm
Softmax
Feed
Forward
Linear

Add & Norm


Add & Norm
Add & Norm
Multi-Head Feed
Feed Attention
Nx Forward
Forward

Add & Norm Add & Norm


Nx Add & Norm
Masked Nx
Multi-Head Masked
Multi-Head Multi-Head
Attention Attention Attention

Positional Positional
Encoding ⊕ ⊕ Positional
Encoding
Encoding ⊕
Input Output
Embedding Embedding Input
Embedding

Inputs Outputs
(shifted right) Inputs
Core Transformer Concepts

• Positional encodings

• Multi-headed attention

• Masked attention

• Residual + layer normalization

• Feed-forward layer
(Review)
Inputs and Embeddings
Output
Probabilities

Softmax
• Inputs: Generally split using Linear

subwords Add & Norm


Feed
Forward
the books were improved
Add & Norm
Nx
Masked
the book _s were improv _ed Multi-Head
Attention

• Input Embedding: Looked up, like in Positional


Encoding ⊕
previously discussed models Input
Embedding

Inputs
Multi-head Attention
Intuition for Multi-heads
• Intuition: Information from different
parts of the sentence can be useful to Output
Probabilities
disambiguate in different ways Softmax

Linear

syntax Add & Norm

(nearby context) Feed


I run a small business Forward

semantics Add & Norm


Nx
I run a mile in 10 minutes (farther context) Masked
Multi-Head
Attention

The robber made a run for it Positional


Encoding ⊕
Input
The stocking had a run Embedding

Inputs
Multi-head Attention Concept
MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O
where headi = Attention(QWiQ , KWiK , V WiV )

Multiply by Split/rearrange Run attn over Concat


weights to n attn inputs each head and *WO

* WQ
Q

* WK attn()
K

* WV
V
Code Example
def forward(self, query, key, value, mask=None):
nbatches = query.size(0)

# 1) Do all the linear projections


query = self.W_q(query)
key = self.W_k(key)
value = self.W_v(value)

# 2) Reshape to get h heads


query = query.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)
key = key.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)
value = value.view(nbatches, -1, self.heads, self.d_k).transpose(1, 2)

# 3) Apply attention on all the projected vectors in batch.


x, self.attn = attention(query, key, value)

# 4) "Concat" using a view and apply a final linear.


x = (
x.transpose(1, 2)
.contiguous()
.view(nbatches, -1, self.h * self.d_k)
)
return self.W_o(x)
What Happens w/ Multi-heads?
• Example from Vaswani et al.

• See also BertVis: https://github.com/jessevig/bertviz


Positional Encoding
Positional Encoding
• The transformer model is purely attentional Output
Probabilities

Softmax
• If embeddings were used, there would be no
Linear
way to distinguish between identical words
Add & Norm
Feed
A big dog and a big cat Forward

would be identical! Nx
Add & Norm

Masked
Multi-Head
A big dog and a big cat Attention

Positional

• Positional encodings add an embedding Encoding
Input
based on the word position Embedding

Inputs
wbig + wpos2 wbig + wpos6
Sinusoidal Encoding
(Vaswani+ 2017, Kazemnejad 2019)
• Calculate each dimension with a sinusoidal function
!
(i) (i) sin(ωk · t), if i = 2k 1
pt = f (t) :=
cos(ωk · t), if i = 2k + 1 where ωk =
100002k/d

• Why? So the dot product between two embeddings


becomes higher relatively.
Learned Encoding
(Shaw+ 2018)

• More simply, just create a learnable embedding

• Advantages: exibility

• Disadvantages: impossible to extrapolate to


longer sequences
fl
Absolute vs. Relative Encodings
(Shaw+ 2018)

• Absolute positional encodings add an encoding to


the input in hope that relative position will be
captured

• Relative positional encodings explicitly encode


relative position
Rotary Positional Encodings (RoPE)
(Su+ 2021)

• Fundamental idea: we want the dot product of


embeddings to result in a function of relative position
fq (xm , m) · fk (xn , n) = g(xm , xn , m − n)
• In summary, RoPE uses trigonometry and imaginary numbers
to come up with a function that satis es this property
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
x1 cos mθ1 −x2 sin mθ1
⎜ x2 ⎟ ⎜ cos mθ1 ⎟ ⎜ x1 ⎟ ⎜ sin mθ1 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ x3 ⎟ ⎜ cos mθ2 ⎟ ⎜ −x4 ⎟ ⎜ sin mθ2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
d ⎜ x4 ⎟ ⎜ cos mθ2 ⎟ ⎜ x3 ⎟ ⎜ sin mθ2 ⎟
RΘ,m x = ⎜ ⎟⊗⎜ ⎟+⎜ ⎟⊗⎜
.. ⎟ ⎜ .. ⎟ ⎜ .. ⎟

⎜ .. ⎟ ⎜
⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟
⎝xd−1 ⎠ ⎝cos mθ d2 ⎠ ⎝ −xd ⎠ ⎝sin mθ d2 ⎠
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
xd cos mθ d xd−1 sin mθ d
2 2
fi
Layer Normalization and
Residual Connections
Reminder:
Gradients and Training Instability
• In RNNs, we asked about how backdrop through a
network causes gradients can vanish or explode

• The same issue occurs in multi-layer transformers!


Layer Normalization
(Ba et al. 2016)
• Normalizes the outputs to be within a Output
consistent range, preventing too Probabilities

Softmax
much variance in scale of outputs Linear

gain Add & Norm


bias Feed
Forward
g
LayerNorm(x; g, b) = ⊙ (x − µ(x)) + b
σ(x) Add & Norm
Nx
Masked

vector vector Multi-Head


Attention

stddev mean
Positional
Encoding ⊕

!
Input
n !
1 " n
"1 $ Embedding

µ(x) = xi σ(x) = # (xi − µ)2


Inputs
n n i=1
i=1
RMSNorm
(Zhang and Sennrich 2019)

• Simpli es LayerNorm by removing the mean and bias terms

!
" n
"1 $
RMS(x) = # x2i
n i=1

x
RMSNorm(x) = ·g
RMS(x)
fi
Residual Connections
Output

• Add an additive connection between Probabilities

Softmax
the input and output Linear

Residual(x, f ) = f (x) + x Add & Norm


Feed
Forward

• Prevents vanishing gradients and Add & Norm


Nx
allows f to learn the difference from Masked
Multi-Head

the input Attention

Positional

• Quiz: what are the implications for Encoding
Input
self-attention w/ and w/o residual Embedding

connections? Inputs
Post- vs. Pre- Layer Norm
(e.g. Xiong et al. 2020)

• Where should
LayerNorm be
applied? Before or
after?

• Pre-layer-norm is
better for gradient
propagation

post-LayerNorm pre-LayerNorm
Feed Forward Layers
Feed Forward Layers
• Extract combination features from the Output
Probabilities
attended outputs Softmax

Linear
FFN(x; W1 , b1 , W2 , b2 ) = f (xW1 + b1 )W2 + b2
Add & Norm
Feed
Forward
Non-linearity
Add & Norm
Linear1 Linear2 Nx
Masked
Multi-Head
Attention
f()
Positional
Encoding ⊕
Input
Embedding

Inputs
Some Activation Functions in Transformers

• Vaswani et al.: ReLU

ReLU(x) = max(0, x)

• LLaMa: Swish/SiLU (Hendricks and Gimpel 2016)

Swish(x; β) = x ⊙ σ(βx)
Optimization Tricks for
Transformers
Transformers are Powerful
but Fickle

• Optimization of models can be dif cult, and


transformers are more dif cult than others!

• e.g. OPT-175 training logbook


https://github.com/facebookresearch/metaseq/
blob/main/projects/OPT/chronicles/
OPT175B_Logbook.pdf
fi
fi
Optimizers for Transformers
• SGD: Update in the direction of reducing loss

• Adam: Add momentum turn and normalize by stddev of the


outputs

• Adam w/ learning rate schedule (Vaswani et al. 2017): Adds


a learning rate increase and decrease

lrate = d−0.5
model · min(step
−0.5
, step ∗ warmup steps−1.5 )

• AdamW (Loshchilov and Hutter 2017): properly applies


weight decay for regularization to Adam
Low-Precision Training
• Training at full 32-bit precision can be costly

• Low-precision alternatives

Image: Wikipedia
Checkpointing/Restarts
• Even through best efforts, training can go south — what to do?

• Monitor possible issues, e.g. through monitoring the norm of


gradients

• If training crashes, roll back to previous checkpoint, shuf e


data, and resume

• (Also, check your code)


Image: OPT Log
fl
Comparing Transformer
Architectures
Original Transformer vs.
LLaMa

Vaswani et al. LLaMA

Norm Position Post Pre

Norm Type LayerNorm RMSNorm

Non-linearity ReLU SiLU

Positional
Sinusoidal RoPE
Encoding
How Important is It?
• “Transformer” is Vaswani et al., “Transformer++” is (basically) LLaMA

• Stronger architecture is ≈10x more ef cient!


Image: Gu and Dao (2023)
fi
Questions?

You might also like