L5
L5
and Transformers
CS886: Recent Advances on Foundation Models
Output
𝑦1 𝑦2 𝑦3 𝑦4 sequence
magic?
Input
sequence
𝑥1 𝑥 2 𝑥 3 𝑥 4
2025-02-28 Slides created for CS886 at UWaterloo 5
Sequence to Sequence
• Example Scenarios
• Text Text (e.g. Q/A, translation, text summarization)
• Image Text (e.g. image captioning)
state
ENCODER DECODER
context
vector
Input
sequence
𝑥1 𝑥 2 𝑥 3 𝑥 4
2025-02-28 Slides created for CS886 at UWaterloo 6
Sequence to Sequence with RNNs
• Encoder (LSTM) and decoder (LSTM)
• Fixed-length context vector
Input: sequence Output: sequence
h 𝑡= 𝑓 ( 𝑥𝑡 , h𝑡 − 1) 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝑐)
𝑦1 𝑦2 𝑦3 𝑦4
Initial
state
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
𝑥1 𝑥2 𝑥3 𝑥4 c 𝑦0 𝑦1 𝑦2 𝑦3
Context
ENCODER vector DECODER
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems
(NIPS), 2014, pp. 3104–3112.
2025-02-28 Slides created for CS886 at UWaterloo 7
Sequence to Sequence with RNNs
• Encoder (LSTM) and decoder (LSTM)
• Fixed-length context vector (bottleneck)
Input: sequence Output: sequence
h 𝑡= 𝑓 ( 𝑥𝑡 , h𝑡 − 1) 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝑐)
𝑦1 𝑦2 𝑦3 𝑦4
Initial
state
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
BOTTLENECK
𝑥1 𝑥2 𝑥3 𝑥4 c 𝑦0 𝑦1 𝑦2 𝑦3
Context
ENCODER vector DECODER
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems
(NIPS), 2014, pp. 3104–3112.
2025-02-28 Slides created for CS886 at UWaterloo 8
Sequence to Sequence with RNNs + Attention
• Idea! Use a different context vector for each timestep in the decoder
• Craft the context vector so that it “looks at” different parts of the
input sequence for each decoder timestep
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 9
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 1, 1 𝛼 1, 2 𝛼 1, 3 𝛼 1, 4 (normalize 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 represents the probability that
𝑒 1 ,1 𝑒 1 ,2 𝑒 1 ,3 𝑒 1 ,4 Alignment scores
the target word is aligned to, or
translated from, a source word
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 10
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 2,1 𝛼 2,2 𝛼 2,3 𝛼 2,4 (normalize 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2
𝑒 2 ,1 𝑒 2 ,2 𝑒 2 ,3 𝑒2 , 4 Alignment scores
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2
Initial
state
𝑥1 𝑥2 𝑥3 𝑥4 𝑐1 𝑦 0 𝑐2 𝑦 1
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 11
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 𝑡,1 𝛼 𝑡, 2 𝛼 𝑡, 3 𝛼 𝑡, 4 (normalized 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2 𝑦3 𝑦4
𝑒𝑡 , 1 𝑒𝑡 , 2 𝑒𝑡 , 3 𝑒 𝑡, 4 Alignment scores 𝑐𝑡
𝑠𝑡 −1
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
Initial
state
𝑥1 𝑥2 𝑥3 𝑥4 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 2 𝑐3 𝑦 3
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 12
Sequence to Sequence with RNNs + Attention
All steps are
Compute context vector differentiable, so we
Find
× × × × can backpropagate
Attention weights
+¿ through everything
𝛼 𝑡,1 𝛼 𝑡, 2 𝛼 𝑡, 3 𝛼 𝑡, 4 (normalized 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2 𝑦3 𝑦4
𝑒𝑡 , 1 𝑒𝑡 , 2 𝑒𝑡 , 3 𝑒 𝑡, 4 Alignment scores 𝑐𝑡
𝑠𝑡 −1
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
Initial
Encoder is bi-directional: allows for the annotation of
state
each word to summarize both preceding and
𝑥
following words.
1 2 𝑥 3 𝑥 4 𝑥 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 2 𝑐3 𝑦 3
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 13
Sequence to Sequence with RNNs + Attention
Application: translation
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 14
Sequence to Sequence with RNNs + Attention
Application:
text translation
RNN:
RNNenc
RNN + attention:
RNNsearch
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 15
Image Captioning with Visual Attention
• We can similarly use attention for image captioning (image text)
• Builds directly on previous work
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 16
Image Captioning with Visual Attention
corresponds to a part of the image Compute context vector
is an MLP 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
m
xaft
os
𝑒 𝑡𝑖 𝛼 𝑡𝑖 A
𝑦1
𝑓 att
𝑠0 𝑠1 Different context vector at every
time step
CNN
𝒉𝒊 × 𝑐1 𝑦 0 Each context vector attends to
different image regions.
Input image Feature Annotation vectors
extraction (feature vectors) [START]
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 17
Image Captioning with Visual Attention
corresponds to a part of the image Compute context vector
is an MLP 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
m
xaft
os
𝑒 𝑡𝑖 𝛼 𝑡𝑖 A bird flying [END]
𝑦1 𝑦2 𝑦3 𝑦4
𝑓 att 𝑠𝑡 −1
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
CNN
h𝑖 × 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 1 𝑐4 𝑦 1
Input image Feature Annotation vectors
extraction (feature vectors) [START] A bird flying
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 18
Image Captioning with Visual Attention
All steps are differentiable,
corresponds so we can
to a part of the image Compute context vector
backpropagate through everything. 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
is an MLP
m
xaft
os
Each context vector attends𝑒
to 𝑡𝑖
different 𝛼 𝑡𝑖 A bird flying [END]
image regions. 𝑦1 𝑦2 𝑦3 𝑦4
𝑓 att 𝑠𝑡 −1
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
CNN
h𝑖 × 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 1 𝑐4 𝑦 1
Input image Feature Annotation vectors
extraction (feature vectors) [START] A bird flying
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 19
Image Captioning with Visual Attention
• Visualization of the attention for each generated word
• Gives insight to “where” and “what” the attention focused on when
generating each word
deterministic
“soft” attention
stochastic
“hard” attention
(requires RL)
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 20
Image Captioning with Visual Attention
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 21
Attention is All you Need (2017)
• Key Idea:
• Decouple attention from RNNs
• Use self-attention to make this efficient
• Contributions:
• Multi-head attention
• Transformer architecture
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 22
Feature Superposition (Polysemanticity)
• A NN neural activation often does not
represent a single thing
• “Neural networks want to represent
more features than they have neurons
for”[1]
• Superposition of features: “often pack
many unrelated concepts into a single
neuron” [1]
• Results in decreased explainability
• A paper from Anthropic seeks to add
explainability in LLMs [2]
[1] N. Elhage et al., “Toy Models of Superposition.” arXiv, 2022. doi: 10.48550/arXiv.2209.10652.
[2] T. Bricken et al., “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” 2023. [Online]. Available:
https://transformer-circuits.pub/2023/monosemantic-features/index.html
2025-02-28 Slides created for CS886 at UWaterloo 23
Attention we’ve seen so far
Now known as “additive” recurrent attention (type of encoder-decoder attention)
mul + add
Attention weights
Attention 𝛼𝑡1 𝛼𝑡2 𝛼𝑡3 𝛼𝑡4
softmax
Input 𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒔 𝒕 −𝟏
2025-02-28 Slides created for CS886 at UWaterloo 24
Issues with Recurrent Attention
• Scalability issues
• Performance degrades as the distance between words increases
• Parallelization limitations
• Recurrent processes lacks ability to be parallelized
• Memory constraints
• RNNs have limited memory and struggle with long-range dependencies
• Diluted impact of earlier elements on output as sequence progresses
𝒙𝒊
𝒉𝒊 𝒌𝒊
𝒔𝒕 𝒗𝒊
𝒄𝒕 𝒒𝒋
𝒚𝒕 𝒐𝒋
𝒗𝟏 𝛼 𝑗1
Attention 𝒗𝟐 𝛼 𝑗2 Attention weights
𝒗𝟑 𝛼 𝑗3
softmax
𝒌 𝟐 𝑓 att 𝑒 𝑗2
𝑓 (∙)
Goal: find the “alignment” 𝒌 𝟑 𝑓 att 𝑒 𝑗3 att
or “compatibility” of keys Keys:
with a query to scale Values:
values 𝒒𝐣 Query:
2025-02-28 Slides created for CS886 at UWaterloo 30
A more general attention
𝒐𝟏 𝒐𝟐 𝒐𝟑 Output vectors
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌, 𝒗) ,
mul + add for
𝒗𝟏 𝛼 11 𝛼 12 𝛼 13
Attention 𝒗𝟐 𝛼 21 𝛼 2 2 𝛼 23 Attention weights
𝒌𝟐 𝑒 21 𝑒 22 𝑒 23
𝑓 (∙)
Goal: find the “alignment” 𝒌𝟑 𝑒 31 𝑒 32 𝑒 33 att
or “compatibility” Keys:
between keys and queries Values:
to scale values 𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
2025-02-28 Slides created for CS886 at UWaterloo 31
Applying the Attention Mechanism
Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
𝒙𝟏
Arbitrary
inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒙𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐
𝒙𝟑
𝒌𝟏 𝒌𝟐 𝒌𝟑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌,𝒗) 𝒚𝟏
𝒌𝟏 𝒌𝟐 𝒌𝟑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌,𝒗)
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation ** are arbitrary sequences
𝒌𝟏 𝑒 11 𝑒 12 𝑒 13 Alignment scores
}
Alignment
𝒌𝟐 𝑒 21 𝑒 22 𝑒 23
scaled dot-product
𝒌𝟑 𝑒 31 𝑒 32 𝑒 33
Keys:
Share the same
Values:
𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
dimension
2025-02-28 Slides created for CS886 at UWaterloo 34
Attention in Attention is All you Need
𝑂 Matrix of outputs O
Calculate dot-products in
parallel with matrix
multiplication matmul
• High concurrency on
modern hardware
𝒗𝟏 𝛼 11 𝛼 12 𝛼 13 Matrix containing attention
(GPUs) weights
𝒗 𝟐 𝑉 𝛼 21 𝛼 2 2 𝛼 23
• Independently
calculates each query 𝒗𝟑 𝛼31 𝛼32 𝛼33
softmax
scale
𝒌𝟏 Alignment scores
[j, i]
matmul
Alignment 𝒌𝟐 𝐾
𝒌𝟑 𝑄 scaled dot-product
Keys:
Values:
𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
2025-02-28 Slides created for CS886 at UWaterloo 35
Misconceptions about Transformers (1)
• What?
• Attention in transformers performs a vector similarity search
• Why?
• Over-simplification in terminology
• The key-query value explanation is convenient, and many don’t know to look
past it
Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
𝒙𝟏
Arbitrary
inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒙𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐 𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒙𝟑 𝒚𝟏
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation Nothing to learn
** are arbitrary sequences
inside of this
2025-02-28 Slides created for CS886 at UWaterloo 38
Learning Transformer Attention
Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
Arbitrary
We have to learn
𝒙 𝟏
inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 these 𝒙 𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐 𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒙𝟑 𝒚𝟏
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation ** are arbitrary sequences
X K X K
× ¿ × ¿
X V X V
× ¿ × ¿
** X, Y are matrices of arbitrary sequences
2025-02-28 Slides created for CS886 at UWaterloo 40
Multi-Head Attention
𝒐𝒋 Output vectors
mul + add
Attention
𝒗𝟏 𝛼 𝑗1
𝒗𝟐 𝛼 𝑗2 Attention weights
𝒗𝟑 𝛼 𝑗3
Output vectors
Attention 𝒗 𝟏 𝛼 𝑗1 𝑣 1 𝛼 𝑗 1
𝒗 𝟐 𝛼 𝑗2¿𝑣 2 𝛼 𝑗 2 Attention weights
𝒗 𝟑 𝛼 𝑗3 𝑣 3 𝛼 𝑗3
Output vectors
Attention 𝒗𝟏 +¿𝑣 1 𝛼 𝑗 1
𝒗𝟐 𝑣2 𝛼 𝑗 2 Attention weights
𝒗𝟑 𝑣 3 𝛼 𝑗3
Output vectors
Attention 𝒗𝟏 𝒐
𝑣 12 𝛼 𝒋𝑗 21 Since we summed through the
positions, we lose resolution in our
𝒗𝟐 Attention weights
representation
𝒗𝟑
information is maintained
• Together all heads take roughly the same computational time as one
fully dimensioned attention head
2025-02-28 Slides created for CS886 at UWaterloo 47
Learning Multi-Head Attention
𝑶𝟎 • Each
where )
𝑶𝟎 𝑶𝟏 𝑶𝟐 𝑶𝟑 𝑶𝟒 𝑶𝟓 𝑶𝟔 𝑶𝟕 𝑶
× ¿
• Encoder-decoder cross-attention
• Allows decoder layers to attend all parts of the latent representation produced by the
encoder
• Pulls context from the encoder sequence over to the decoder”
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 50
Transformer Architecture
Why Self-Attention?
• Lower computational complexity
• Greater amount of the computation that can be parallelized
• Each representation encodes the positional information of the sequence
for sequence
representations
Feed Forward
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
Feed Forward
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
Feed Forward
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
Feed Forward
auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
Feed Forward
auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
Feed Forward
auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
vocabulary
1 0.21 -0.07 -0.46 .08
• Why?
• Can be computed on by neural architecture 0 0.67 0.33 0.22 -.012
• How?
• Linear MLP (FC) layers with ReLU activation in
between
• Hidden space with higher dimension
2025-02-28 Slides created for CS886 at UWaterloo 69
Encoder-Decoder Sublayers (2)
Transformers From The Ground Up
Implementing position-wise Feed Forward
Network
𝑑 𝑓𝑓 =2048= 4 𝑑𝑚𝑜𝑑𝑒𝑙
• LayerNorm
• Combats vanishing gradient
• Combats exploding gradient
2025-02-28 Slides created for CS886 at UWaterloo 71
Encoder-Decoder Sublayers (3)
Transformers From The Ground Up
Implementing sublayer connections
Feed Forward
Feed Forward
Output
Softmax
• Apply softmax to convert logits to probabilities Linear
Probabilities
For next area
Feed Forward
Incorrect Correct
encoder decoder encoder decoder
Feed Forward
Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
2025-02-28 Slides created for CS886 at UWaterloo 84
Putting it all together
Transformers From The Ground Up
is very negative,
softmax(-1e9) 0
2025-02-28 Slides created for CS886 at UWaterloo 89
Results and Impact
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 91
Paper Impact
• Highly influential
• Transformer architecture has been used as the basis for many state-
of-the-art models
• Convert images to
“sequences”:
• Images are spliced into
smaller regions
• Regions are flattened
and treated as a
sequence
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” presented at ICLR, 2020. [Online]. Available:
https://openreview.net/forum?id=YicbFdNTTy
2025-02-28 Slides created for CS886 at UWaterloo 95
Discussion
2. What elements of the architecture and training paradigm allow the multiple
heads to learn different representations, thereby preventing them from
converging/collapsing to the same content?