0% found this document useful (0 votes)
2 views

L5

The lecture discusses self-attention and transformers, focusing on the concept of attention in deep learning and its applications in sequence-to-sequence tasks such as machine translation and image captioning. It highlights the evolution from RNNs with attention to the transformer architecture, which decouples attention from RNNs, allowing for more effective processing of input sequences. The presentation also covers the impact of transformers in various domains, including computer vision.

Uploaded by

llinmeng0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

L5

The lecture discusses self-attention and transformers, focusing on the concept of attention in deep learning and its applications in sequence-to-sequence tasks such as machine translation and image captioning. It highlights the evolution from RNNs with attention to the transformer architecture, which decouples attention from RNNs, allowing for more effective processing of input sequences. The presentation also covers the impact of transformers in various domains, including computer vision.

Uploaded by

llinmeng0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Lecture 5: Self-Attention

and Transformers
CS886: Recent Advances on Foundation Models

Presenters: Evelien Riddell and James Riddell

2025-02-28 Slides created for CS886 at UWaterloo 1


Outline
• What is attention?
• Attention pre-transformers:
• RNN + Attention in machine translation (NLP)
• RNN + Attention in image captioning (CV)
• Attention Is All You Need:
• Motivation
• Attention decoupled from the RNN
• Transformer architecture
• Masked training
• Impact
• Transformers in other domains (CV):
• Early adoption (Image transformer)
• Visual transformers
• Discussion

2025-02-28 Slides created for CS886 at UWaterloo 2


What is Attention?
• The notion of exploiting context is not new
• CNN – context from spatial locality (useful for images)
• RNN – context from temporal locality (useful for sequences/time-series data)
• Embedding priors into models forces them to pay “attention” to relevant
features for a given problem

• What we now call “attention” in DL


• The idea of paying “attention” to the most relevant or important parts of the
input at a given step
• Very useful in sequence-to-sequence modelling
• Ideally, we’d like to learn this!

2025-02-28 Slides created for CS886 at UWaterloo 3


What is a Learned Attention Mechanism?
• An attention mechanism typically refers to function that allows a
model to attend to different content

• There are many forms of attention mechanisms


• Additive
• Dot-product

• We have names to distinguish attention based on what is attended to


• Self-attention (intra attention)
• Cross-attention (encoder-decoder attention/inter attention)

2025-02-28 Slides created for CS886 at UWaterloo 4


Sequence to Sequence
• Example Scenarios
• Text  Text (e.g. Q/A, translation, text summarization)
• Image  Text (e.g. image captioning)

Output
𝑦1 𝑦2 𝑦3 𝑦4 sequence

magic?

Input
sequence
𝑥1 𝑥 2 𝑥 3 𝑥 4
2025-02-28 Slides created for CS886 at UWaterloo 5
Sequence to Sequence
• Example Scenarios
• Text  Text (e.g. Q/A, translation, text summarization)
• Image  Text (e.g. image captioning)

• How? Usually Encoder-Decoder models


• e.g. RNNs, transformers Output
𝑦1 𝑦2 𝑦3 𝑦4 sequence

state
ENCODER DECODER
context
vector
Input
sequence
𝑥1 𝑥 2 𝑥 3 𝑥 4
2025-02-28 Slides created for CS886 at UWaterloo 6
Sequence to Sequence with RNNs
• Encoder (LSTM) and decoder (LSTM)
• Fixed-length context vector
Input: sequence Output: sequence
h 𝑡= 𝑓 ( 𝑥𝑡 , h𝑡 − 1) 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝑐)
𝑦1 𝑦2 𝑦3 𝑦4
Initial
state
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4

𝑥1 𝑥2 𝑥3 𝑥4 c 𝑦0 𝑦1 𝑦2 𝑦3
Context
ENCODER vector DECODER
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems
(NIPS), 2014, pp. 3104–3112.
2025-02-28 Slides created for CS886 at UWaterloo 7
Sequence to Sequence with RNNs
• Encoder (LSTM) and decoder (LSTM)
• Fixed-length context vector (bottleneck)
Input: sequence Output: sequence
h 𝑡= 𝑓 ( 𝑥𝑡 , h𝑡 − 1) 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝑐)
𝑦1 𝑦2 𝑦3 𝑦4
Initial
state
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
BOTTLENECK
𝑥1 𝑥2 𝑥3 𝑥4 c 𝑦0 𝑦1 𝑦2 𝑦3
Context
ENCODER vector DECODER
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems
(NIPS), 2014, pp. 3104–3112.
2025-02-28 Slides created for CS886 at UWaterloo 8
Sequence to Sequence with RNNs + Attention

• Idea! Use a different context vector for each timestep in the decoder

• No more bottleneck through a single vector

• Craft the context vector so that it “looks at” different parts of the
input sequence for each decoder timestep

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 9
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 1, 1 𝛼 1, 2 𝛼 1, 3 𝛼 1, 4 (normalize 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 represents the probability that
𝑒 1 ,1 𝑒 1 ,2 𝑒 1 ,3 𝑒 1 ,4 Alignment scores
the target word is aligned to, or
translated from, a source word

h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 reflects the importance of the


annotation with respect to the
Initial previous hidden state in
state deciding the next state and
𝑥1 𝑥2 𝑥3 𝑥4 𝑐1 𝑦 0 generating

Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 10
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 2,1 𝛼 2,2 𝛼 2,3 𝛼 2,4 (normalize 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2
𝑒 2 ,1 𝑒 2 ,2 𝑒 2 ,3 𝑒2 , 4 Alignment scores

h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2
Initial
state

𝑥1 𝑥2 𝑥3 𝑥4 𝑐1 𝑦 0 𝑐2 𝑦 1
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 11
Sequence to Sequence with RNNs + Attention
Compute context vector
Find
× × × ×
Attention weights
+¿
𝛼 𝑡,1 𝛼 𝑡, 2 𝛼 𝑡, 3 𝛼 𝑡, 4 (normalized 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2 𝑦3 𝑦4
𝑒𝑡 , 1 𝑒𝑡 , 2 𝑒𝑡 , 3 𝑒 𝑡, 4 Alignment scores 𝑐𝑡
𝑠𝑡 −1
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
Initial
state

𝑥1 𝑥2 𝑥3 𝑥4 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 2 𝑐3 𝑦 3
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 12
Sequence to Sequence with RNNs + Attention
All steps are
Compute context vector differentiable, so we
Find
× × × × can backpropagate
Attention weights
+¿ through everything
𝛼 𝑡,1 𝛼 𝑡, 2 𝛼 𝑡, 3 𝛼 𝑡, 4 (normalized 𝑠𝑡 = 𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
alignment scores)
softmax
𝑦1 𝑦2 𝑦3 𝑦4
𝑒𝑡 , 1 𝑒𝑡 , 2 𝑒𝑡 , 3 𝑒 𝑡, 4 Alignment scores 𝑐𝑡
𝑠𝑡 −1
h1 𝑓 h2 𝑓 h3 𝑓 h4 𝑠0 𝑠1 𝑔 𝑠2 𝑔 𝑠3 𝑔 𝑠4
Initial
Encoder is bi-directional: allows for the annotation of
state
each word to summarize both preceding and
𝑥
following words.
1 2 𝑥 3 𝑥 4 𝑥 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 2 𝑐3 𝑦 3
Context
ENCODER vector DECODER
D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 13
Sequence to Sequence with RNNs + Attention

Application: translation

Each pixel shows the weight of


the annotation of the -th source
word for the -th target word.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 14
Sequence to Sequence with RNNs + Attention

Application:
text translation

RNN:
RNNenc

RNN + attention:
RNNsearch

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations (ICLR), 2015.
2025-02-28 Slides created for CS886 at UWaterloo 15
Image Captioning with Visual Attention
• We can similarly use attention for image captioning (image  text)
• Builds directly on previous work

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 16
Image Captioning with Visual Attention
corresponds to a part of the image Compute context vector

is an MLP 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
m
xaft
os
𝑒 𝑡𝑖 𝛼 𝑡𝑖 A
𝑦1
𝑓 att
𝑠0 𝑠1 Different context vector at every
time step
CNN
𝒉𝒊 × 𝑐1 𝑦 0 Each context vector attends to
different image regions.
Input image Feature Annotation vectors
extraction (feature vectors) [START]
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 17
Image Captioning with Visual Attention
corresponds to a part of the image Compute context vector

is an MLP 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
m
xaft
os
𝑒 𝑡𝑖 𝛼 𝑡𝑖 A bird flying [END]
𝑦1 𝑦2 𝑦3 𝑦4
𝑓 att 𝑠𝑡 −1
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
CNN
h𝑖 × 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 1 𝑐4 𝑦 1
Input image Feature Annotation vectors
extraction (feature vectors) [START] A bird flying
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 18
Image Captioning with Visual Attention
All steps are differentiable,
corresponds so we can
to a part of the image Compute context vector
backpropagate through everything. 𝑠𝑡 =𝑔( 𝑦 𝑡 − 1 , 𝑠 𝑡 − 1 , 𝒄 𝒕 )
is an MLP
m
xaft
os
Each context vector attends𝑒
to 𝑡𝑖
different 𝛼 𝑡𝑖 A bird flying [END]
image regions. 𝑦1 𝑦2 𝑦3 𝑦4
𝑓 att 𝑠𝑡 −1
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4
CNN
h𝑖 × 𝑐1 𝑦 0 𝑐2 𝑦 1 𝑐3 𝑦 1 𝑐4 𝑦 1
Input image Feature Annotation vectors
extraction (feature vectors) [START] A bird flying
ENCODER DECODER
K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 19
Image Captioning with Visual Attention
• Visualization of the attention for each generated word
• Gives insight to “where” and “what” the attention focused on when
generating each word

deterministic
“soft” attention

stochastic
“hard” attention
(requires RL)

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 20
Image Captioning with Visual Attention

K. Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in PMLR, 2015, pp. 2048–2057.
2025-02-28 Slides created for CS886 at UWaterloo 21
Attention is All you Need (2017)
• Key Idea:
• Decouple attention from RNNs
• Use self-attention to make this efficient

• Contributions:
• Multi-head attention
• Transformer architecture

• Highly impactful (as we’ll touch on later)

A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 22
Feature Superposition (Polysemanticity)
• A NN neural activation often does not
represent a single thing
• “Neural networks want to represent
more features than they have neurons
for”[1]
• Superposition of features: “often pack
many unrelated concepts into a single
neuron” [1]
• Results in decreased explainability
• A paper from Anthropic seeks to add
explainability in LLMs [2]

[1] N. Elhage et al., “Toy Models of Superposition.” arXiv, 2022. doi: 10.48550/arXiv.2209.10652.
[2] T. Bricken et al., “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” 2023. [Online]. Available:
https://transformer-circuits.pub/2023/monosemantic-features/index.html
2025-02-28 Slides created for CS886 at UWaterloo 23
Attention we’ve seen so far
Now known as “additive” recurrent attention (type of encoder-decoder attention)

𝒄𝒕 ••• 𝒔𝒕 Context vector

mul + add

Attention weights
Attention 𝛼𝑡1 𝛼𝑡2 𝛼𝑡3 𝛼𝑡4
softmax

Alignment 𝑒𝑡 1 𝑒𝑡 2 𝑒𝑡 3 𝑒𝑡 4 Alignment scores

𝑓 att 𝑓 att 𝑓 att 𝑓 att : simple feedforward


network (e.g. MLP)

Input 𝒉𝟏 𝒉𝟐 𝒉𝟑 𝒉𝟒 𝒔 𝒕 −𝟏
2025-02-28 Slides created for CS886 at UWaterloo 24
Issues with Recurrent Attention
• Scalability issues
• Performance degrades as the distance between words increases
• Parallelization limitations
• Recurrent processes lacks ability to be parallelized
• Memory constraints
• RNNs have limited memory and struggle with long-range dependencies
• Diluted impact of earlier elements on output as sequence progresses

• Potential solution: decouple attention from RNNs


• How? Separate the attention mechanism into smaller, self-contained components
2025-02-28 Slides created for CS886 at UWaterloo 25
Decoupling from RNNs
• Recall: attention determines the importance of elements to be
passed forward in the model.
• These weights lets the model pay attention to the most significant
parts

• Objective: a more general attention mechanism not confined to RNNs


• We need a modified procedure to:
1. Determine weights based on context that indicate the elements to
attend to
2. Apply these weights to enhance attended features

2025-02-28 Slides created for CS886 at UWaterloo 26


Decoupling from RNNs
• RNN Notation

𝒙𝒊 Input for position in source sequence

𝒉𝒊 Hidden states for position in source sequence

𝒔𝒕 Hidden states for position in target sequence

𝒄𝒕 Context vector for position in target sequence

𝒚𝒕 Output for position in target sequence

2025-02-28 Slides created for CS886 at UWaterloo 27


Decoupling from RNNs
• New Notation

𝒙𝒊
𝒉𝒊 𝒌𝒊
𝒔𝒕 𝒗𝒊
𝒄𝒕 𝒒𝒋
𝒚𝒕 𝒐𝒋

2025-02-28 Slides created for CS886 at UWaterloo 28


Decoupling from RNNs
• New Notation

𝒌𝒊 Key vector for position in an arbitrary sequence

𝒗𝒊 Value vector for position in an arbitrary sequence

𝒒𝒋 Query vector for position in a (same/different) arbitrary sequence

𝒐𝒋 Output vector corresponding to position

2025-02-28 Slides created for CS886 at UWaterloo 29


A more general attention
𝒐𝒋 Output vectors
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌, 𝒗)
mul + add

𝒗𝟏 𝛼 𝑗1
Attention 𝒗𝟐 𝛼 𝑗2 Attention weights

𝒗𝟑 𝛼 𝑗3
softmax

Alignment 𝒌 𝟏 𝑓 att 𝑒 𝑗1 Alignment scores

𝒌 𝟐 𝑓 att 𝑒 𝑗2
𝑓 (∙)
Goal: find the “alignment” 𝒌 𝟑 𝑓 att 𝑒 𝑗3 att
or “compatibility” of keys Keys:
with a query to scale Values:
values 𝒒𝐣 Query:
2025-02-28 Slides created for CS886 at UWaterloo 30
A more general attention
𝒐𝟏 𝒐𝟐 𝒐𝟑 Output vectors
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌, 𝒗) ,
mul + add for

𝒗𝟏 𝛼 11 𝛼 12 𝛼 13
Attention 𝒗𝟐 𝛼 21 𝛼 2 2 𝛼 23 Attention weights

𝒗𝟑 𝛼31 𝛼32 𝛼33


softmax

Alignment 𝒌𝟏 𝑒 11 𝑒 12 𝑒 13 Alignment scores

𝒌𝟐 𝑒 21 𝑒 22 𝑒 23
𝑓 (∙)
Goal: find the “alignment” 𝒌𝟑 𝑒 31 𝑒 32 𝑒 33 att
or “compatibility” Keys:
between keys and queries Values:
to scale values 𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
2025-02-28 Slides created for CS886 at UWaterloo 31
Applying the Attention Mechanism

Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
𝒙𝟏
Arbitrary
inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒙𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐
𝒙𝟑
𝒌𝟏 𝒌𝟐 𝒌𝟑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌,𝒗) 𝒚𝟏
𝒌𝟏 𝒌𝟐 𝒌𝟑 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝒒,𝒌,𝒗)
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation ** are arbitrary sequences

2025-02-28 Slides created for CS886 at UWaterloo 32


Attention Mechanism in Attention is All You Need

To use a decoupled attention mechanism, it is implemented with properties:


1. scaled dot-product attention
• Good representation of compatibility
• Fast and interpretable computation
• Parallelizable evaluation across all queries (can leverage GPUs)
• Scaled dot-products for stable softmax gradients in high dimensions (prevents
large magnitudes)

2. Imposed a common dimension for keys, values, and queries


• Requirement for dot-product
• Simplifies architecture with predictable attention output shape
• Provides consistent hidden state dimensions for easier model analysis
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 33
Attention in Attention is All you Need
Scaled Dot-Product Attention
𝒐𝟏 𝒐𝟐 𝒐𝟑 Output vectors

• Faster mul + add


• More space-efficient
𝒗𝟏 𝛼 11 𝛼 12 𝛼 13
Attention 𝒗𝟐 𝛼 21 𝛼 2 2 𝛼 23 Attention weights

𝒗𝟑 𝛼31 𝛼32 𝛼33


softmax

𝒌𝟏 𝑒 11 𝑒 12 𝑒 13 Alignment scores

}
Alignment

𝒌𝟐 𝑒 21 𝑒 22 𝑒 23
scaled dot-product
𝒌𝟑 𝑒 31 𝑒 32 𝑒 33
Keys:
Share the same
Values:
𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
dimension
2025-02-28 Slides created for CS886 at UWaterloo 34
Attention in Attention is All you Need
𝑂 Matrix of outputs O
Calculate dot-products in
parallel with matrix
multiplication matmul
• High concurrency on
modern hardware
𝒗𝟏 𝛼 11 𝛼 12 𝛼 13 Matrix containing attention
(GPUs) weights
𝒗 𝟐 𝑉 𝛼 21 𝛼 2 2 𝛼 23
• Independently
calculates each query 𝒗𝟑 𝛼31 𝛼32 𝛼33
softmax
scale
𝒌𝟏 Alignment scores
[j, i]
matmul
Alignment 𝒌𝟐 𝐾
𝒌𝟑 𝑄 scaled dot-product
Keys:
Values:
𝒒𝟏 𝒒𝟐 𝒒 𝟑 Query:
2025-02-28 Slides created for CS886 at UWaterloo 35
Misconceptions about Transformers (1)
• What?
• Attention in transformers performs a vector similarity search
• Why?
• Over-simplification in terminology
• The key-query value explanation is convenient, and many don’t know to look
past it

2025-02-28 Slides created for CS886 at UWaterloo 36


Misconceptions about Transformers (1)
• What?
• Attention in transformers performs a vector similarity search
• Why?
• Over-simplification in terminology
• The key-query value explanation is convenient, and many don’t know to look
past it
How do we get Q, K, and V?

What are we learning?

Is this parametric or non-parametric?

2025-02-28 Slides created for CS886 at UWaterloo 37


Learning Transformer Attention

Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
𝒙𝟏
Arbitrary
inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒙𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐 𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒙𝟑 𝒚𝟏
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation Nothing to learn
** are arbitrary sequences
inside of this
2025-02-28 Slides created for CS886 at UWaterloo 38
Learning Transformer Attention

Self-Attention Cross-Attention
• Keys, values, and queries are all • Keys-values and queries are
derived from the same source derived from separate sources
Arbitrary
We have to learn
𝒙 𝟏

inputs 𝒗𝟏 𝒗𝟐 𝒗𝟑 these 𝒙 𝟐 𝒗𝟏 𝒗𝟐 𝒗𝟑
𝒙𝟏 𝒙𝟑
𝒙𝟐 𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒌𝟏 𝒌𝟐 𝒌𝟑 Attention
𝒙𝟑 𝒚𝟏
𝒒𝟏 𝒒𝟐 𝒒𝟑 𝒚𝟐 𝒒𝟏 𝒒𝟐 𝒒𝟑
𝒚𝟑
Arbitrary
transformation ** are arbitrary sequences

2025-02-28 Slides created for CS886 at UWaterloo 39


Learning Transformer Attention
Self-Attention Cross-Attention
X Q Q
Y
× ¿ × ¿

X K X K
× ¿ × ¿

X V X V
× ¿ × ¿
** X, Y are matrices of arbitrary sequences
2025-02-28 Slides created for CS886 at UWaterloo 40
Multi-Head Attention

• Builds on Scaled Dot-Product Attention

• Extension of generalized attention mentioned outlined previously

• Leverages multiple heads to attend to different things

2025-02-28 Slides created for CS886 at UWaterloo 41


Learning Multi-Head Attention
Why do we need multiple heads?

𝒐𝒋 Output vectors

mul + add

Attention
𝒗𝟏 𝛼 𝑗1
𝒗𝟐 𝛼 𝑗2 Attention weights

𝒗𝟑 𝛼 𝑗3

2025-02-28 Slides created for CS886 at UWaterloo 42


Learning Multi-Head Attention
Why do we need multiple heads?

Output vectors

Attention 𝒗 𝟏 𝛼 𝑗1 𝑣 1 𝛼 𝑗 1
𝒗 𝟐 𝛼 𝑗2¿𝑣 2 𝛼 𝑗 2 Attention weights

𝒗 𝟑 𝛼 𝑗3 𝑣 3 𝛼 𝑗3

2025-02-28 Slides created for CS886 at UWaterloo 43


Learning Multi-Head Attention
Why do we need multiple heads?

Output vectors

Attention 𝒗𝟏 +¿𝑣 1 𝛼 𝑗 1
𝒗𝟐 𝑣2 𝛼 𝑗 2 Attention weights

𝒗𝟑 𝑣 3 𝛼 𝑗3

2025-02-28 Slides created for CS886 at UWaterloo 44


Learning Multi-Head Attention
Why do we need multiple heads?

Output vectors

Attention 𝒗𝟏 𝒐
𝑣 12 𝛼 𝒋𝑗 21 Since we summed through the
positions, we lose resolution in our
𝒗𝟐 Attention weights
representation
𝒗𝟑

2025-02-28 Slides created for CS886 at UWaterloo 45


Learning Multi-Head Attention
• Main idea: 𝑶𝟎
• Learn multiple sets of weights
matrices to attend to different things
• Preserve resolution since more
heads increases chance that the 𝑶𝟏

information is maintained

• Allows model to jointly attend to


information from different 𝑶𝟕
representation subspaces
(like ensembling)

2025-02-28 Slides created for CS886 at UWaterloo 46


Learning Multi-Head Attention
• To make computation efficient, weight matrices project to subspaces

where (512/8 = 64 in paper)

• Together all heads take roughly the same computational time as one
fully dimensioned attention head
2025-02-28 Slides created for CS886 at UWaterloo 47
Learning Multi-Head Attention

𝑶𝟎 • Each

• , one output for each head


𝑶𝟏
• Recall, model expects vectors of
dimension
 Indicates we need to reduce to a
single matrix
𝑶𝟕

2025-02-28 Slides created for CS886 at UWaterloo 48


Learning Multi-Head Attention

where )

𝑶𝟎 𝑶𝟏 𝑶𝟐 𝑶𝟑 𝑶𝟒 𝑶𝟓 𝑶𝟔 𝑶𝟕 𝑶
× ¿

2025-02-28 Slides created for CS886 at UWaterloo 49


Transformer Architecture
Ways attention is used in the transformer:
• Self-attention in the encoder
• Allows the model to attend to all positions in the previous encoder layer
• Embeds context about how elements in the sequence relate to one another

• Masked self-attention in the decoder


• Allows the model to attend to all positions in the previous decoder layer up to and including
the current position (during auto-regressive process)
• Prevents forward looking bias by stopping leftward information flow during training
• Also embeds context about how elements in the sequence relate to one another

• Encoder-decoder cross-attention
• Allows decoder layers to attend all parts of the latent representation produced by the
encoder
• Pulls context from the encoder sequence over to the decoder”
A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 50
Transformer Architecture
Why Self-Attention?
• Lower computational complexity
• Greater amount of the computation that can be parallelized
• Each representation encodes the positional information of the sequence

2025-02-28 Slides created for CS886 at UWaterloo 51


Transformer Architecture
Why Self-Attention?
• Cheaper (more power, less parameters)
• Faster to train

for sequence
representations

2025-02-28 Slides created for CS886 at UWaterloo 52


Transformer Architecture
Output
Generator
Softmax
Probabilities
(prediction
Linear head) (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Decoder
Cross- Decoding
Feed Forward Attention Procedure
Stack
Encoder auto-regressive 𝑦𝑡
Add & Norm
Stack
Add & Norm
decoding
Masked ++
Multi-Head Multi-Head
Self-Attention (shift right)
Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 53


Transformer Architecture
Output
Generator
Softmax
Probabilities
(prediction
Linear head) (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Decoder
Cross- “Greedy”
Feed Forward Attention
Stack
Encoder auto-regressive 𝑦𝑡
Add & Norm
Stack
Add & Norm
decoding
Masked ++
Multi-Head Multi-Head
Self-Attention (shift right)
Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 54


Transformer Architecture
Output
Generator
Softmax
Probabilities
(prediction
Linear head) (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Decoder
Cross- Decoding
Greedy
Feed Forward Attention Procedure
Stack
auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 55


Transformer Architecture
Output
Generator
Softmax
Probabilities
(prediction
Linear head) (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross- Decoding
Greedy
Decoder
Attention
N Procedure
Feed Forward

auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 56


Transformer Architecture
Output
Softmax
Probabilities
Linear (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross- Decoding
Attention
N Procedure
Feed Forward

auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 57


Transformer Architecture
Output
Softmax
Constant representation Probabilities
Linear (for )
size between model
components Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross- Decoding
Attention
N Procedure
Feed Forward

auto-regressive 𝑦𝑡
N Add & Norm Add & Norm
Masked
decoding
Multi-Head
++
Multi-Head (shift right)
Self-Attention Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2 ••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1

2025-02-28 Slides created for CS886 at UWaterloo 58


Transformers From The Ground Up

Harvard, “The Annotated Transformer.” [Online]. Available: https://nlp.seas.harvard.edu/annotated-transformer/


2025-02-28 Slides created for CS886 at UWaterloo 59
Model Creation Helper
Transformers From The Ground Up
Clones Helper function
• What?
• Create N copies of pytorch nn.Module
• Why?
• The Transformer’s structure contains a lot of
design repetition (like VGG)
Remember these clones
shouldn’t share parameters (for
the most part)

Make sure to initialize all model


parameters to keep clones
independent
2025-02-28 Slides created for CS886 at UWaterloo 60
Getting Data into the Transformer (1)
Transformers From The Ground Up
Creating Embeddings
• What?
• Create vector representation of sequence 0 0.67 0.13 0.22 -.012

vocabulary
1 0.21 -0.07 -0.46 .08

• Why?
• Can be computed on by neural architecture 0 0.67 0.33 0.22 -.012

• Dimensionality usually reduced


• ~37,000 words  512 in paper
 More efficient computation
• How? 𝑥1 𝑥 2••• 𝑥𝑛 Input
Embedding
• Learned mapping (linear projection)

2025-02-28 Slides created for CS886 at UWaterloo 61


Getting Data into the Transformer (1)
Transformers From The Ground Up
Implementing Embeddings

nn.Embedding creates a lookup


table to map sequence vocabulary
to unique vectors

Uses learned weights to handle this


mapping (essentially a nn.Linear)

2025-02-28 Slides created for CS886 at UWaterloo 62


Getting Data into the Transformer (2)
Transformers From The Ground Up
Positional encoding
• What? 𝒙 𝒊 𝑷 𝑬𝒊
• Add information about an element’s position in
+
a sequence to its representation
• Why?
=
• Removes need for recurrence or convolution
𝒛 𝒊
• How?
• Element wise addition of sinusoidal encoding
Positional
Encoding

2025-02-28 Slides created for CS886 at UWaterloo 63


Getting Data into the Transformer (2)
Transformers From The Ground Up
Sinusoidal positional encoding
“May allow the model to easily learn to attend by relative positions”

2025-02-28 Slides created for CS886 at UWaterloo 64


Getting Data into the Transformer (2)
Transformers From The Ground Up
Implementing sinusoidal positional encoding
• Know at model creation time, so
precompute positional encoding

• Dim is consistent with x, so we use


in-place addition to add positional
context to x

2025-02-28 Slides created for CS886 at UWaterloo 65


Encoder-Decoder Sublayers (1)
Transformers From The Ground Up
Multi-Head Attention Sublayers
• What?
Add & Norm
• Carries out multi-head attention and learns
Feed Forward
weights for creating keys, values, and queries
• Why? Add & Norm
Multi-Head
Add & Norm
• To extract relevant context from input Cross-
Attention
Feed Forward
sequence
• Multiple heads provide greater resolution Add & Norm Add & Norm
Masked
• Attend to different sub-representations Multi-Head Multi-Head
Self-Attention Self-Attention
• How?
• Implemented as previously discussed

2025-02-28 Slides created for CS886 at UWaterloo 66


Encoder-Decoder Sublayers (1)
Transformers From The Ground Up
Implementing Multi-Head Attention

2025-02-28 Slides created for CS886 at UWaterloo 67


Encoder-Decoder Sublayers (1)
Transformers From The Ground Up
Implementing Multi-Head Attention

2025-02-28 Slides created for CS886 at UWaterloo 68


Encoder-Decoder Sublayers (2)
Transformers From The Ground Up
Position-wise Feed Forward Network
• What?
• Applies learned transformations to each position Add & Norm

in input representation Feed Forward


• Applied separately and identically
• Why? Add & Norm
Add & Norm
Multi-Head
• Exploits context added by previous sublayers Cross-
Feed Forward Attention
• Adds depth to network so it can approximate
greater complexity Add & Norm
Add & Norm
• Increases resolution to pull out different parts of Multi-Head
Masked
Multi-Head
the superposition Self-Attention Self-Attention

• How?
• Linear MLP (FC) layers with ReLU activation in
between
• Hidden space with higher dimension
2025-02-28 Slides created for CS886 at UWaterloo 69
Encoder-Decoder Sublayers (2)
Transformers From The Ground Up
Implementing position-wise Feed Forward
Network
𝑑 𝑓𝑓 =2048= 4 𝑑𝑚𝑜𝑑𝑒𝑙

2025-02-28 Slides created for CS886 at UWaterloo 70


Encoder-Decoder Sublayers (3)
Transformers From The Ground Up
Sublayer connections
• Residual connection (recall resnet)
Add & Norm
• Can be less expensive to learn residuals Feed Forward
• Elevates vanishing gradient
• Preserves more of the input signal through Add & Norm
Add & Norm Multi-Head
skip connection Cross-
Feed Forward Attention
• Dropout (recall resnet)
• Regularizes model (combats overfitting) Add & Norm Add & Norm
Masked
Multi-Head
• Encourages diversity of attention heads Self-Attention
Multi-Head
Self-Attention

• LayerNorm
• Combats vanishing gradient
• Combats exploding gradient
2025-02-28 Slides created for CS886 at UWaterloo 71
Encoder-Decoder Sublayers (3)
Transformers From The Ground Up
Implementing sublayer connections

2025-02-28 Slides created for CS886 at UWaterloo 72


Encoder-Decoder Layers (1)
Transformers From The Ground Up
Encoder Layer
• What?
• Composable blocks for the task of encoding an input Add & Norm
sequence representation with attention
Feed Forward
• Why?
• Easy construction of model Add & Norm
Add & Norm Multi-Head
• Allows encoder layers to be stacked to achieve depth Cross-
• Repeating Multi-head attention  Model more Feed Forward Attention

complex position interactions


Add & Norm
• How? Add & Norm
Masked
Multi-Head Multi-Head
• Multi-head self-attention (8 heads used) sublayer Self-Attention Self-Attention
• Position-wise feed forward network sublayer
• All sublayers are surrounded by sublayer connections

2025-02-28 Slides created for CS886 at UWaterloo 73


Encoder-Decoder Layers (1)
Transformers From The Ground Up
Implementing the encoder layer

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross-
Feed Forward Attention

Add & Norm Add & Norm


Masked
Multi-Head Multi-Head
Self-Attention Self-Attention

2025-02-28 Slides created for CS886 at UWaterloo 74


Encoder-Decoder Layers (2)
Transformers From The Ground Up
Decoder Layer
• What?
Add & Norm
• Composable blocks for the task of decoding
a target sequence auto-regressively Feed Forward

• Same as encoder layers other than: Add & Norm


Add & Norm Multi-Head
1. the additional multi-head attention block Cross-
Attention
to preform cross-attention with the output Feed Forward

representation from the encoder Add & Norm


Add & Norm
Masked
Multi-Head Multi-Head
Self-Attention
2. the addition of masking in self-attention Self-Attention

This prevents cheating(forward looking bias)


 Model purely attends to past info

2025-02-28 Slides created for CS886 at UWaterloo 75


Encoder-Decoder Layers (2)
Transformers From The Ground Up
Implementing the decoder layer

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross-
Feed Forward Attention

Add & Norm Add & Norm


Masked
Multi-Head Multi-Head
Self-Attention Self-Attention

2025-02-28 Slides created for CS886 at UWaterloo 76


The Prediction Head
Transformers From The Ground Up
Generator
• Sometimes referred to as the predictor

• A final linear mapping


• Internal Representation -> logits that capture
maximum likelihood of next element in sequence
• In seq2seq language translation this maps back to
vocab corpora

Output
Softmax
• Apply softmax to convert logits to probabilities Linear
Probabilities
For next area

2025-02-28 Slides created for CS886 at UWaterloo 77


The Prediction Head
Transformers From The Ground Up
Implementing a generator

2025-02-28 Slides created for CS886 at UWaterloo 78


Assembling the Encoder-Decoder
Transformers From The Ground Up
• Encoder-Decoder
Output
Softmax Probabilities 𝑦𝑡
(for )
Linear

Add & Norm

Feed Forward

Add & Norm Stack N of the sublayers


Add & Norm Multi-Head
Cross-
N N
Feed Forward Attention

Add & Norm Add & Norm Add Generator on top


Masked
Multi-Head Multi-Head
Self-Attention Self-Attention

2025-02-28 Slides created for CS886 at UWaterloo 79


Assembling the Encoder-Decoder
Transformers From The Ground Up
• Encoder-Decoder implementation

2025-02-28 Slides created for CS886 at UWaterloo 80


Misconceptions about Transformers (2)
• What?
• Notion of a whole “transformer block” that is stackable in the vanilla
transformer architecture
• Incorrect belief that encoder-decoder attention connection is layer wise
• Why?
• Incorrect understanding of stacking layers
• Pervasive amount of bad figures

2025-02-28 Slides created for CS886 at UWaterloo 81


Misconceptions about Transformers (2)

Incorrect Correct
encoder decoder encoder decoder

encoder decoder encoder decoder

encoder decoder encoder decoder

encoder decoder encoder decoder

encoder decoder encoder decoder

encoder decoder encoder decoder

2025-02-28 Slides created for CS886 at UWaterloo 82


Assembling the Encoder-Decoder
Transformers From The Ground Up
• Encoder-Decoder implementation

2025-02-28 Slides created for CS886 at UWaterloo 83


Putting it all together
Transformers From The Ground Up
Output
Softmax
Probabilities 𝑦𝑡
Linear (for )

Add & Norm

Feed Forward

Add & Norm


Add & Norm Multi-Head
Cross-
Attention
N
Feed Forward

N Add & Norm Add & Norm


Masked
Multi-Head Multi-Head
Self-Attention Self-Attention

Positional Positional
Encoding Encoding
Input Output
𝑥1 𝑥 2••• 𝑥𝑛 Embedding Embedding
𝑦 1 ••• 𝑦 𝑡 −1
2025-02-28 Slides created for CS886 at UWaterloo 84
Putting it all together
Transformers From The Ground Up

2025-02-28 Slides created for CS886 at UWaterloo 85


Training Transformers
• “Architecture alone does not make a model”

Architecture + Training = Model

• A model expresses different properties depending on how it is


trained
• Like nature vs. nurture, both impact what the model does
• Training is what influences parameters
2025-02-28 Slides created for CS886 at UWaterloo 86
Training Transformers
• Models fit to training data

• If shown examples that encourage bidirectional attention, it will learn


that

• If shown only examples that require right attention, it may express


more unidirectional behavior (won’t generalize as well)

• BERT uses large scale pre-training to do this

2025-02-28 Slides created for CS886 at UWaterloo 87


Training Transformers
• Masked training

• Attention mechanism can build a masking support directly


• Motivation:
• Want to prevent the model from learning from future information in the
output sequence
• Main idea:
• Since each decode layer starts with a self-attention block, we can add custom
logic to mask out positions in target sequence which it shouldn’t see yet
• Implemented as rolling window

2025-02-28 Slides created for CS886 at UWaterloo 88


Training Transformers
• Masked training

is very negative,
softmax(-1e9)  0
2025-02-28 Slides created for CS886 at UWaterloo 89
Results and Impact

2025-02-28 Slides created for CS886 at UWaterloo 90


Performance
• Experimentation on text translation: (1) EN-DE and (2) EN-FR

A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
2025-02-28 Slides created for CS886 at UWaterloo 91
Paper Impact
• Highly influential

• Paper has 113,405 citations

• Transformer architecture has been used as the basis for many state-
of-the-art models

• Transformer is a fundamental building block of all LLMs (e.g. GPT-4,


LLaMA 2, Gemini, etc.)

2025-02-28 Slides created for CS886 at UWaterloo 92


Image Transformer
• Inspired by the transformer architecture
on text

• Image pixels are positionally encoded and


used in a “local self-attention” mechanism
• Considers only the pixels in a local
neighborhood around the query position (e.g.
the 8x8 nearest pixels)

• Aside: not used heavily now


• Global attention in images seems to be more
effective
• Subsequent models are more powerful
N. Parmar et al., “Image Transformer,” in ICML, 2018, pp. 4052–4061.
2025-02-28 Slides created for CS886 at UWaterloo 93
Image Transformer

N. Parmar et al., “Image Transformer,” in ICML, 2018, pp. 4052–4061.


2025-02-28 Slides created for CS886 at UWaterloo 94
Vision Transformers (ViT)
• Applies vanilla
transformer encoder
to image classification

• Convert images to
“sequences”:
• Images are spliced into
smaller regions
• Regions are flattened
and treated as a
sequence

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” presented at ICLR, 2020. [Online]. Available:
https://openreview.net/forum?id=YicbFdNTTy
2025-02-28 Slides created for CS886 at UWaterloo 95
Discussion

2025-02-28 Slides created for CS886 at UWaterloo 96


Discussion Questions
1. What misconceptions did you have about attention or transformers? How would
you explain that concept to a novice learner to avoid that misconception?

2. What elements of the architecture and training paradigm allow the multiple
heads to learn different representations, thereby preventing them from
converging/collapsing to the same content?

3. In the annotated transformer the implementation swaps from

why does this still work?


2025-02-28 Slides created for CS886 at UWaterloo 97
Thank you!
Questions?

2025-02-28 Slides created for CS886 at UWaterloo 98


Multi-Head Self-Attention Visualizations

J. Alammar, “The Illustrated Transformer,” 2020. [Online]. Available: https://jalammar.github.io/illustrated-transformer/


2025-02-28 Slides created for CS886 at UWaterloo 99

You might also like