Lecture-28-TransformerIntroductionFinal-1
Lecture-28-TransformerIntroductionFinal-1
Lecture-2
CAP6412 Spring 2024
Mubarak Shah
[email protected]
ENCODER
DECODER
Attention Map
Attention Map
Attention Map
9x3
3x3 3x3
3x3
9x3 9x3 3x3
9x3 9x3
9x3 9x3
9x9
Transformer
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Encoder
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Self Attention
• qi = xi WQ
• Ki = xi WK
• Vi = xi WV
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Self Attention
Now we calculate
a score to determine how
much focus to place on other
Parts of the input.
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Self Attention
Formula
~ ~
dk=64 is dimension of key vector
z1= 0.88v1+ 0.12v2
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Multiple heads
1. It expands the model’s ability to
focus on different positions.
2. It gives the attention layer
multiple “representation
subspaces”
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
Position Encoding
Position Encoding
Position Encoding
• Can also be learned
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
02 Attention and Transformers The complete transformer
K, V
Q
Slides from Ming Li, University of Waterloo, CS 886 Deep Learning and NLP
BERT (Stack Encoder Blocks)
BERT (Bidirectional Encoder Representations from Transformers)
• BERT jointly encodes the right and left context of a word in a sentence to improve
the learned feature representations
• BERT is trained on two pre-text tasks in self-supervised manner
• Masked Language Model (MLM)
• Mask fixed percentage (15%) of words in a sentence predict these masked words
• In predicting the masked words, the model learns the bidirectional context.
• Next Sentence Prediction (NSP)
• Given a pair of sentences A and B the model predicts a binary label i.e., whether
the pair is valid or not from the original document
• Pair is formed such that B is the actual sentence (next to A) 50% of the time, and
B is a random sentence for other 50% of the time.
https://www.youtube.com/shorts/BEt_BACGw6g
Vision Transformers
Mubarak Shah
[email protected]
CLS
C=96, 128,192