Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them
for educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode.
Transformers
vs RNNs
deeplearning.ai
Outline
...
Loss of information
T sequential steps
...
...
...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est
⊕ c Decoder
h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism
LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model
https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention
Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention
Scaled dot-product
attention multiple times in
parallel
Linear transformations of
the input queries, keys and
values
The Encoder
Provides contextual
representation of each item
in the input sequence
Self-Attention
Masked Self-Attention
EMBEDDINGS
Decoder
Encoder
Easy to parallelize!
Summary
Embedding Stack
Je suis heureux Q
I am happy
Embedding Stack
I am happy K
Same
Generally the
number of
same
Stack rows
V
Attention Math
Context vectors
for each query
Number of
queries
Size of the
value vector
● Ways of Attention
c’est
l’heure
Weight matrix
du
thé
Self-Attention
Queries, keys and values come from the same sentence
it’s
time
Weight matrix
for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea
it’s
time
Weight matrix
for
tea
Masked self-attention math
Minus infinity
0
0 0
0 0 0
0 0
Weights assigned to future
0
positions are equal to 0
Summary
c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for
Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview
Linear
Concatenation
Learnable parameters
Scaled Dot-Product heads
Attention
Attention
Context vectors
for each query
Head 2 Concat
Attention
: Embedding size
Summary
Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Output Probabilities
Add & Norm Decoder
SoftMax
Block
Linear
Feed Feed Feed
Add & Norm Forward Forward Forward
Feed
Forward
Add & Norm
Multi-Head LayerNormAdd
( & Norm
+ )
Attention
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)
● Decoder and feed-forward blocks are the core of this model code
Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
Output Probabilities
SoftMax Model Input:
Linear
ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Add & Norm
Feed
Forward
Tokenized version:
Add & Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]
Positional
Encoding Loss weights: 0s until the first <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add & Norm
Feed
Forward
Add & Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>
Inference:
● Provide: [Article] <EOS>