0% found this document useful (0 votes)

44 views

Deeplearning - Ai Deeplearning - Ai

These slides are distributed under a Creative Commons license which allows for educational use and distribution as long as DeepLearning.ai is cited as the source. The license details are available at a specified URL and prohibit commercial use or distribution of the slides without permission. Transformers help address issues with RNNs like parallel computing limitations, loss of information for long sequences, and vanishing gradients.

Uploaded by

9f8z4k2cxs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Deeplearning - Ai Deeplearning - Ai

Uploaded by

9f8z4k2cxs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them
for educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode.
Transformers
vs RNNs
deeplearning.ai
Outline

● Issues with RNNs

● Comparison with Transformers

Neural Machine Translation

Comment allez- vous

How are you

No parallel computing!
Seq2Seq Architectures

...

Loss of information
T sequential steps
...

...

...
...
Vanishing gradient
RNNs vs Transformer: Encoder-Decoder
C’est

⊕ c Decoder

h1 h2 h3 h4
si-1
Attention <sos>
Encoder Mechanism

LSTMs
It’s time for tea
Transformers don’t use RNNs, such
as LSTMs or GRUs
Transformers
Overview
deeplearning.ai
The Transformer Model

https://arxiv.org/abs/1706.03762
Scaled Dot-Product Attention

Values
Queries Keys
(Vaswani et al., 2017)
Multi-Head Attention

Scaled dot-product
attention multiple times in
parallel

Linear transformations of
the input queries, keys and
values
The Encoder

Provides contextual
representation of each item
in the input sequence

Self-Attention

Every item in the input

attends to every other item
in the sequence
The Decoder

Every position from the

Encoder-Decoder
decoder attents to the
Attention
outputs from the encoder

Masked Self-Attention

Every position attends to

previous positions
RNNs vs Transformer: Positional Encoding

POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

ENCODING

EMBEDDINGS

INPUT Je suis content

The Transformer

Decoder
Encoder

Easy to parallelize!
Summary

● In RNNs parallel computing is difﬁcult to implement

● For long sequences in RNNs there is loss of information

● In RNNs there is the problem of vanishing gradient

● Transformers help with all of the above

Transformer
Applications
deeplearning.ai
Outline

● Transformers applications in NLP

● Some Transformers
● Introduction to T5
Transformer NLP applications
Translation
Text
summarization
Chat-bots
te Auto-Complete
Other NLP tasks
Named entity Sentiment Analysis
recognition (NER) Market Intelligence
Question Text Classiﬁcation
answering (Q&A) Character Recognition
Spell Checking
State of the Art Transformers

Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Open AI Transformer

Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Google AI Language Representations from Transformers

Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Google
T5: Text-To-Text Transfer Transformer

Translate English into French: “I am happy” “Je suis content”

Translation
Unacceptable
Cola sentence: “He bought fruits and.” Classiﬁcation
*Cola stands for “Corpus of Linguistic Acceptability”
T5 Acceptable
Cola sentence: “He bought fruits and
vegetables.” Q&A

Question: Which volcano in Tanzania is the Answer: Mount

highest mountain in Africa? Kilimanjaro
T5: Text-To-Text Transfer Transformer
Stsb sentence1: “Cats and dogs are
mammals.” Sentence2: “There are four
known forces in nature – gravity, 0.0
electromagnetic, weak and strong.” Regression

Stsb sentence1: “Cats and dogs are

mammals.” Sentence2:“Cats, dogs, and T5 2.6
cows are domesticated.”
Summarization

Summarize: “State authorities “Six people

dispatched emergency crews Tuesday to hospitalized
survey the damage after an onslaught of after a storm in
severe weather in mississippi…” Attala county”
T5: Demo
Summary

● Transformers are suitable for a wide range of NLP applications

● Some transformers include GPT, BERT and T5

● T5 is a powerful multi-task transformer

Scaled Dot-Product
Attention
deeplearning.ai
Outline

● Revisit scaled dot product attention

● Mathematics behind Attention

Scaled dot-product attention
Improves
Weights add up to 1 performance

Queries Values Weighted sum of values V

Keys
Just two matrix multiplications
(Vaswani et al., 2017)
and a Softmax!
Queries, Keys and Values Size of the
Je suis heureux embedding

Embedding Stack
Je suis heureux Q

I am happy
Embedding Stack
I am happy K

Same
Generally the
number of
same
Stack rows
V
Attention Math

Context vectors
for each query

Number of
queries

Size of the
value vector

Weight assigned to the third key

for the second query
Summary

● Scaled Dot-product Attention is essential for Transformer

● The input to Attention are queries, keys, and values

● GPUs and TPUs

Masked
Self-Attention
deeplearning.ai
Outline

● Ways of Attention

● Overview of masked Self-Attention

Encoder-Decoder Attention
Queries from one sentence, keys and values from another

it’s time for tea

c’est

l’heure
Weight matrix

thé
Self-Attention
Queries, keys and values come from the same sentence

it’s time for tea

it’s

time
Weight matrix

for
Meaning of each
word within the
tea sentence
Masked Self-Attention
Queries, keys and values come from the same sentence. Queries don’t
attend to future positions.
it’s time for tea

it’s

time
Weight matrix

for

tea
Masked self-attention math
Minus inﬁnity

0
0 0
0 0 0

0 0
Weights assigned to future
0
positions are equal to 0
Summary

● There are three main ways of Attention: Encoder/Decoder,

self-attention and masked self-attention.

● In self-attention, queries and keys come from the same sentence

● In masked self-attention queries cannot attend to the future

Multi-head
Attention
deeplearning.ai
Outline

● Intuition Multi-Head Attention

● Math of Multi-Head Attention

Multi-Head Attention - Overview
Queries Keys Values

c’est it’s
thé it’s
Original Embeddings tea tea
l’heure time time
du for for

Head 1 Head 2
for tea
thé it’s it’s it’s
du
it’s c’est time tea
tea time tea thé
c’est time
l’heure time l’heure
for
for
for du
Multi-Head Attention - Overview

Linear

Concatenation

Learnable parameters
Scaled Dot-Product heads
Attention

Linear Linear Linear heads

Queries Keys Values

Multi-Head Attention
Head 1

Attention
Context vectors
for each query

Head 2 Concat

Attention

Usual choice of dimensions

: Embedding size
Summary

● Multi-Headed models attend to information from different

representations
● Parallel computations

● Similar computational cost to single-head attention

Transformer
decoder
deeplearning.ai
Outline

● Overview of Transformer decoder

● Implementation (decoder and feed-forward block)

Transformer decoder Overview
Output Probabilities
SoftMax ● input: sentence or paragraph
○ we predict the next word
Linear
● sentence gets embedded, add positional encoding
Add & Norm ○ (vectors representing )
Feed
Forward ● multi-head attention looks at previous words

Add & Norm ● feed-forward layer with ReLU

Multi-Head
○ that’s where most parameters are!
Attention
● residual connection with layer normalization
Positional
Encoding ● repeat N times
Input
Embedding ● dense layer and softmax for output
Inputs
Transformer decoder
Output Probabilities
SoftMax
Explanation
Linear
Add & Norm
Feed
Forward Decoder Block

Add & Norm

Multi-Head
Attention Positional Encoding

Positional
Input Embedding
Encoding
Input
Embedding
<start> I am happy
Inputs
The Transformer decoder
Output Probabilities
Add & Norm Decoder
SoftMax
Block
Linear
Feed Feed Feed
Add & Norm Forward Forward Forward
Feed
Forward
Add & Norm
Multi-Head LayerNormAdd
( & Norm
+ )
Attention
Output Vector
Positional
Encoding
Input Multi-Head Attention
Embedding
Positional input
Inputs embedding
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)

Add & Norm

Multi-Head
Attention

Positional Self Attention

Encoding
Input
Embedding
Inputs
The Transformer decoder
Feed forward layer
Output Probabilities
SoftMax
Linear
Add & Norm
Feed Feed Forward Feed Forward
Forward (ReLu) (ReLu)

Add & Norm

Multi-Head
Attention

Positional Self Attention

Encoding
Input
Embedding
Inputs
Summary

● Transformer decoder mainly consists of three layers

● Decoder and feed-forward blocks are the core of this model code

● It also includes a module to calculate the cross-entropy loss

Transformer
summarizer
deeplearning.ai
Outline

● Overview of Transformer summarizer

● Technical details for data processing

● Inference with a Language Model

Transformer for summarization
Output Probabilities
SoftMax
Linear Input Output:
Add & Norm Summary
Feed
Forward
Add & Norm
Multi-Head
Attention

Positional
Encoding
Input
Embedding
Inputs
Technical details for data processing
Output Probabilities
SoftMax Model Input:
Linear
ARTICLE TEXT <EOS> SUMMARY <EOS> <pad> …
Add & Norm
Feed
Forward
Tokenized version:
Add & Norm
Multi-Head
Attention [2,3,5,2,1,3,4,7,8,2,5,1,2,3,6,2,1,0,0]

Positional
Encoding Loss weights: 0s until the ﬁrst <EOS> and then
Input
Embedding 1 on the start of the summary.
Inputs
Cost function
Output Probabilities Cross entropy loss
SoftMax
Linear
Add & Norm
Feed
Forward
Add & Norm
Multi-Head : over summary
Attention
: bach elements
Positional
Encoding
Input
Embedding
Inputs
Inference with a Language Model
Model input:
[Article] <EOS> [Summary] <EOS>

Inference:
● Provide: [Article] <EOS>

● Generate summary word-by-word

○ until the ﬁnal <EOS>

● Pick the next word by random sampling

○ each time you get a different summary!
Summary

● For summarization, a weighted loss function is optimized

● Transformer Decoder summarizes predicting the next word using

● The transformer uses tokenized versions of the input

Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
11 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
C4_W2
No ratings yet
C4_W2
57 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformer
No ratings yet
Transformer
5 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Transformer
No ratings yet
Transformer
10 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
attention_transformer
No ratings yet
attention_transformer
41 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
A1
No ratings yet
A1
11 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
7181-attention-is-all-you-need
No ratings yet
7181-attention-is-all-you-need
11 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Transformers
No ratings yet
Transformers
41 pages
Transformer
No ratings yet
Transformer
58 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Transformer
No ratings yet
Transformer
41 pages
12 Transformer
No ratings yet
12 Transformer
41 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
6 Transformers
No ratings yet
6 Transformers
77 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Transformers Illustraded
No ratings yet
Transformers Illustraded
31 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Example File
No ratings yet
Example File
3 pages
attention
No ratings yet
attention
15 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Transformer
No ratings yet
Transformer
59 pages
10 Attention N Bert
No ratings yet
10 Attention N Bert
55 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Aiayn
No ratings yet
Aiayn
15 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
L.7
No ratings yet
L.7
54 pages
chapter_4
No ratings yet
chapter_4
24 pages
Chap6 Transformer (20240219) - DL4H practioner guide
No ratings yet
Chap6 Transformer (20240219) - DL4H practioner guide
36 pages
AN2DL_06_2324_AttentionAndTrasformers
No ratings yet
AN2DL_06_2324_AttentionAndTrasformers
60 pages
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
No ratings yet
2022 AIOpen A Survey of Transformers Lin, Wang, Liu, Qiu
22 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
C Programming for Arduino
From Everand
C Programming for Arduino
Julien Bayle
4/5 (13)
Python Text Processing with NLTK 2.0 Cookbook: LITE
From Everand
Python Text Processing with NLTK 2.0 Cookbook: LITE
Jacob Perkins
4/5 (1)
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
An Overview of Affective Speech Synthesis and Conversion in The Deep Learning - Triantafyllopoulos Et Al - 2022
No ratings yet
An Overview of Affective Speech Synthesis and Conversion in The Deep Learning - Triantafyllopoulos Et Al - 2022
24 pages
Blue Simple Artificial Intelligence Presentation
No ratings yet
Blue Simple Artificial Intelligence Presentation
37 pages
maurits-bleeker-phd-thesis-2024
No ratings yet
maurits-bleeker-phd-thesis-2024
195 pages
Conversational Text Extraction With Large Language Models Using Retrieval-Augmented Systems
No ratings yet
Conversational Text Extraction With Large Language Models Using Retrieval-Augmented Systems
7 pages
High-efficiency Transformer Based Network for Radar Interference Recognition
No ratings yet
High-efficiency Transformer Based Network for Radar Interference Recognition
9 pages
AI From Basics To Advanced Levels
No ratings yet
AI From Basics To Advanced Levels
3 pages
Comp 428 Cat Ii
No ratings yet
Comp 428 Cat Ii
2 pages
Gen AI
No ratings yet
Gen AI
26 pages
A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing
No ratings yet
A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing
14 pages
A Neural Machine Translation System For Kreol Repiblik Moris and English
No ratings yet
A Neural Machine Translation System For Kreol Repiblik Moris and English
12 pages
DeepClassic: Music Generation With Neural Neural Networks
No ratings yet
DeepClassic: Music Generation With Neural Neural Networks
6 pages
NepaliGPT 2.0: Nepali Text Understanding and Generation
No ratings yet
NepaliGPT 2.0: Nepali Text Understanding and Generation
9 pages
ChatGPT: Jack of All Trades, Master of None
No ratings yet
ChatGPT: Jack of All Trades, Master of None
40 pages
Question Bank-END SEM Exam_CI
No ratings yet
Question Bank-END SEM Exam_CI
2 pages
Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection
No ratings yet
Intrusion Detection of Foreign Objects in Overhead Power System For Preventive Maintenance in High-Speed Railway Catenary Inspection
12 pages
Automatic Extractive Text Summarization For Nepali Language With Bidirectional Encorder Representation Transformer and K Mean Clustering1
No ratings yet
Automatic Extractive Text Summarization For Nepali Language With Bidirectional Encorder Representation Transformer and K Mean Clustering1
16 pages
2401.00908
No ratings yet
2401.00908
16 pages
What Is A Large Language Model A Comprehensive LLMs Guide
No ratings yet
What Is A Large Language Model A Comprehensive LLMs Guide
18 pages
Learning To Prompt For Continual Learning
No ratings yet
Learning To Prompt For Continual Learning
13 pages
Fin Irjmets1642882332
No ratings yet
Fin Irjmets1642882332
17 pages
Untitled Collection 2ye0ujym Composite User Behaviour Assisted Rumour Detection Over 4abixgi4tv
No ratings yet
Untitled Collection 2ye0ujym Composite User Behaviour Assisted Rumour Detection Over 4abixgi4tv
6 pages
Liu A Data-Centric Solution To NonHomogeneous Dehazing Via Vision Transformer CVPRW 2023 Paper
No ratings yet
Liu A Data-Centric Solution To NonHomogeneous Dehazing Via Vision Transformer CVPRW 2023 Paper
10 pages
Med-BERT: Pretrained Contextualized Embeddings On Large-Scale Structured Electronic Health Records For Disease Prediction
No ratings yet
Med-BERT: Pretrained Contextualized Embeddings On Large-Scale Structured Electronic Health Records For Disease Prediction
13 pages
Neural Matching Fields Implici
No ratings yet
Neural Matching Fields Implici
14 pages
GenerativeAIforManagers Paperback
100% (1)
GenerativeAIforManagers Paperback
106 pages
Research Notes
No ratings yet
Research Notes
9 pages
How Stability AI's Stable Code Instruct 3B Outperforms Larger Models
No ratings yet
How Stability AI's Stable Code Instruct 3B Outperforms Larger Models
8 pages
firoz KHAN
No ratings yet
firoz KHAN
31 pages
Vision Transformer Attention With Multi-Reservoir Echo State
No ratings yet
Vision Transformer Attention With Multi-Reservoir Echo State
17 pages

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright Notice

These slides are distributed under the Creative Commons License.

● Issues with RNNs

● Comparison with Transformers

Comment allez- vous

How are you

Every item in the input

Every position from the

Every position attends to

POSITIONAL 0 0 1 1 0.84 0.0001 0.52 1 0.91 0.0002 -0.42 1

INPUT Je suis content

● In RNNs parallel computing is difﬁcult to implement

● For long sequences in RNNs there is loss of information

● In RNNs there is the problem of vanishing gradient

● Transformers help with all of the above

● Transformers applications in NLP

Radford, A., et al. (2018) GPT-2: Generative Pre-training for

Devlin, J., et al. (2018) BERT:Bidirectional Encoder

Colin, R., et al. (2019) T5: Text-to-text transfer transformer

Translate English into French: “I am happy” “Je suis content”

Question: Which volcano in Tanzania is the Answer: Mount

Stsb sentence1: “Cats and dogs are

Summarize: “State authorities “Six people

● Transformers are suitable for a wide range of NLP applications

● Some transformers include GPT, BERT and T5

● T5 is a powerful multi-task transformer

● Revisit scaled dot product attention

● Mathematics behind Attention

Queries Values Weighted sum of values V

Weight assigned to the third key

● Scaled Dot-product Attention is essential for Transformer

● The input to Attention are queries, keys, and values

● GPUs and TPUs

● Overview of masked Self-Attention

it’s time for tea

it’s time for tea

● There are three main ways of Attention: Encoder/Decoder,

● In self-attention, queries and keys come from the same sentence

● In masked self-attention queries cannot attend to the future

● Intuition Multi-Head Attention

● Math of Multi-Head Attention

Linear Linear Linear heads

Queries Keys Values

Usual choice of dimensions

● Multi-Headed models attend to information from different

● Similar computational cost to single-head attention

● Overview of Transformer decoder

● Implementation (decoder and feed-forward block)

Add & Norm ● feed-forward layer with ReLU

Add & Norm

Add & Norm

Positional Self Attention

Add & Norm

Positional Self Attention

● Transformer decoder mainly consists of three layers

● It also includes a module to calculate the cross-entropy loss

● Overview of Transformer summarizer

● Technical details for data processing

● Inference with a Language Model

● Generate summary word-by-word

● Pick the next word by random sampling

● For summarization, a weighted loss function is optimized

● Transformer Decoder summarizes predicting the next word using

● The transformer uses tokenized versions of the input

You might also like