anlp-05-transformers
anlp-05-transformers
Transformers
Graham Neubig
Site
https://phontron.com/class/anlp2024/
Reminder: Attention
Cross Attention
(Bahdanau et al. 2015)
• Each element in a sequence attends to elements of
another sequence
this is an example
kore
wa
rei
desu
Self Attention
(Cheng et al. 2016, Vaswani et al. 2017)
• Each element in the sequence attends to elements
of that sequence → context sensitive encodings!
this is an example
this
is
an
example
Calculating Attention (1)
• Use “query” vector (decoder state) and “key” vectors (all encoder states)
• For each query-key pair, calculate weight
• Normalize to add to one using softmax
I hate
* * * *
α1=0.76 α2=0.08 α3=0.13 α4=0.03
Linear
Inputs Outputs
(shifted right)
Two Types of Transformers
Output
Encoder-Decoder Model Probabilities Decoder Only Model
(e.g. T5, MBART) Softmax (e.g. GPT, LLaMa)
Linear
Output
Probabilities
Add & Norm
Softmax
Feed
Forward
Linear
Positional Positional
Encoding ⊕ ⊕ Positional
Encoding
Encoding ⊕
Input Output
Embedding Embedding Input
Embedding
Inputs Outputs
(shifted right) Inputs
Core Transformer Concepts
• Positional encodings
• Multi-headed attention
• Masked attention
• Feed-forward layer
(Review)
Inputs and Embeddings
Output
Probabilities
Softmax
• Inputs: Generally split using Linear
Inputs
Multi-head Attention
Intuition for Multi-heads
• Intuition: Information from different
parts of the sentence can be useful to Output
Probabilities
disambiguate in different ways Softmax
Linear
Inputs
Multi-head Attention Concept
MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O
where headi = Attention(QWiQ , KWiK , V WiV )
* WQ
Q
* WK attn()
K
* WV
V
Code Example
def forward(self, query, key, value, mask=None):
nbatches = query.size(0)
Softmax
• If embeddings were used, there would be no
Linear
way to distinguish between identical words
Add & Norm
Feed
A big dog and a big cat Forward
would be identical! Nx
Add & Norm
Masked
Multi-Head
A big dog and a big cat Attention
Positional
⊕
• Positional encodings add an embedding Encoding
Input
based on the word position Embedding
Inputs
wbig + wpos2 wbig + wpos6
Sinusoidal Encoding
(Vaswani+ 2017, Kazemnejad 2019)
• Calculate each dimension with a sinusoidal function
!
(i) (i) sin(ωk · t), if i = 2k 1
pt = f (t) :=
cos(ωk · t), if i = 2k + 1 where ωk =
100002k/d
• Advantages: exibility
Softmax
much variance in scale of outputs Linear
stddev mean
Positional
Encoding ⊕
!
Input
n !
1 " n
"1 $ Embedding
!
" n
"1 $
RMS(x) = # x2i
n i=1
x
RMSNorm(x) = ·g
RMS(x)
fi
Residual Connections
Output
Softmax
the input and output Linear
Positional
⊕
• Quiz: what are the implications for Encoding
Input
self-attention w/ and w/o residual Embedding
connections? Inputs
Post- vs. Pre- Layer Norm
(e.g. Xiong et al. 2020)
• Where should
LayerNorm be
applied? Before or
after?
• Pre-layer-norm is
better for gradient
propagation
post-LayerNorm pre-LayerNorm
Feed Forward Layers
Feed Forward Layers
• Extract combination features from the Output
Probabilities
attended outputs Softmax
Linear
FFN(x; W1 , b1 , W2 , b2 ) = f (xW1 + b1 )W2 + b2
Add & Norm
Feed
Forward
Non-linearity
Add & Norm
Linear1 Linear2 Nx
Masked
Multi-Head
Attention
f()
Positional
Encoding ⊕
Input
Embedding
Inputs
Some Activation Functions in Transformers
ReLU(x) = max(0, x)
Swish(x; β) = x ⊙ σ(βx)
Optimization Tricks for
Transformers
Transformers are Powerful
but Fickle
lrate = d−0.5
model · min(step
−0.5
, step ∗ warmup steps−1.5 )
• Low-precision alternatives
Image: Wikipedia
Checkpointing/Restarts
• Even through best efforts, training can go south — what to do?
Positional
Sinusoidal RoPE
Encoding
How Important is It?
• “Transformer” is Vaswani et al., “Transformer++” is (basically) LLaMA