0% found this document useful (0 votes)

23 views

A Present A Ç Ão Deep Learning

The document provides an overview of the Transformer model. It discusses how the Transformer uses stacked encoders and decoders. Each encoder contains self-attention and feed-forward layers. The decoder contains these layers as well as an attention layer to focus on the input sequence. Positional encoding is added to maintain word order. The model uses residual connections and layer normalization. During decoding, the decoder predicts outputs sequentially while attending to the encoded input. A final linear and softmax layer convert outputs to predicted word probabilities.

Uploaded by

Igor Caetano Diniz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

A Present A Ç Ão Deep Learning

Uploaded by

Igor Caetano Diniz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Attention is All you

Need
Igor Caetano Diniz
Introduction
• The Transformer was proposed in the paper Attention is All You Need.
• A TensorFlow implementation of it is available as a part of the
Tensor2Tensor package.

• we will attempt to oversimplify things a bit and introduce the

concepts one by one to hopefully make it easier to understand to
people without in-depth knowledge of the subject matter.
The Optimus P... Transformer
The Encoders and Decoders Stack
Inside Encoder
All encoders possess an identical structural composition,
although they do not share weights. Each encoder is composed
of two distinct sub-layers:
• The encoding component consists of a series of encoders, with the
paper showcasing a stack of six encoders vertically aligned.
• It’s the same for Decoding component
the role of Tensors in the Context
• The outputs of the self-attention layer are fed to a feed-forward
neural network. The exact same feed-forward network is
independently applied to each position.
• The decoder has both those layers, but between them is an
attention layer that helps the decoder focus on relevant parts of
the input sentence (similar what attention does in
seq2seq models).
• As is the case in NLP applications in general, we begin by
turning each input word into a vector using an
embedding algorithm.
• Each word is embedded into a vector of size 512. We'll
represent those vectors with these simple boxes.
• an encoder receives a list of vectors as input. It processes this list by passing these
vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends
out the output upwards to the next encoder.
• The word at each position passes through a self-attention process. Then, they each pass
through a feed-forward neural network -- the exact same network with each vector flowing
through it separately.
Self-Attention
• Example: "The animal didn't cross the street because it was too tired“
• Question: "it" in this sentence refer to?
Calculate Self-Attention

• Softmax Function:
The Beast With Many Heads – Multi-head
Attention
• It gives the attention layer multiple “representation subspaces”.
• As we encode the word "it",
one attention head is focusing
most on "the animal", while
another is focusing on "tired“
• In a sense, the model's
representation of the word
"it" bakes in some of the
representation of both
"animal" and "tired".
Positional Encoding

• To give the model a sense of the

order of the words, we add
positional encoding vectors
• the values of which follow a
specific pattern.
A real example of positional encoding

• A real example of positional

encoding with a toy embedding
size of 4
A real example of positional encoding

• A real example of positional

encoding for 20 words (rows)
with an embedding size of 512
(columns).
• it appears split in half down the
center because the values of the
left half are generated by one
function (which uses sine), and
the right half is generated by
another function (which uses
cosine).
• They're then concatenated to
form each of the positional
encoding vectors.
Another representation

• It is from the Tranformer2Transformer

implementation of the Transformer. The
method shown in the paper is slightly
different in that it doesn’t directly
concatenate, but interweaves the two
signals. The following figure shows what that
looks like.
Residuals

• each sub-layer (self-attention, ffnn) in each

encoder has a residual connection around it,
and is followed by a layer-normalization
step.
Residuals

• each sub-layer (self-attention,

ffnn) in each encoder has a
residual connection around it,
and is followed by a
layer-normalization step.
• If we’re to visualize the
vectors and the layer-norm
operation associated with self
attention, it would look like
this image.
Residuals

• If we’re to think of a
Transformer of 2 stacked
encoders and decoders
Residuals

• After finishing the encoding

phase, we begin the decoding
phase. Each step in the
decoding phase outputs an
element from the output
sequence (the English
translation sentence in this
case).
Residuals

• Repeat the following steps until

a special symbol is encountered,
indicating the completion of the
transformer decoder's output.
• At each step, feed the output to
the bottom decoder in the
subsequent time step.
• Like the encoders, the decoders
propagate their decoding results
upwards.
• Similar to the encoder inputs,
embed and add positional
encoding to the decoder inputs
to indicate the position of each
word.
The Final: Linear
Layer and Softmax
Layer

• The decoder stack outputs a vector

of floats, and the process of
converting it into a word involves
the final Linear layer and Softmax
layer.
• The Linear layer projects the
decoder's vector into a larger vector
called logits, while the Softmax layer
converts the scores in logits into
probabilities.
• The word with the highest
probability is selected as the output
for the current time step.
Evaluation

• cross-entropy

• Kullback–Leibler divergence.
Obrigado!

Lesson Exemplar in Science 6 For Week 5-8 Quarter 1
90% (10)
Lesson Exemplar in Science 6 For Week 5-8 Quarter 1
3 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
(Essential Textbooks in Mathematics) Sasane, Amol - Friendly Approach To Functional Analysis, A-World Scientific Publishing Company (2017)
No ratings yet
(Essential Textbooks in Mathematics) Sasane, Amol - Friendly Approach To Functional Analysis, A-World Scientific Publishing Company (2017)
396 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Assignment 1 - Unit 3
0% (1)
Assignment 1 - Unit 3
3 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformers: Intro
No ratings yet
Transformers: Intro
7 pages
Transformer Explained
No ratings yet
Transformer Explained
29 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Solved-example-of-transformers
No ratings yet
Solved-example-of-transformers
20 pages
Transformer's Not Working Properly in This Room
No ratings yet
Transformer's Not Working Properly in This Room
65 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Transformer
No ratings yet
Transformer
58 pages
Transformer
No ratings yet
Transformer
10 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
Transformer
No ratings yet
Transformer
21 pages
Transformers_v1.1
No ratings yet
Transformers_v1.1
1 page
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
chapter 2
No ratings yet
chapter 2
11 pages
Attention Is All You Need Paper Explained Well
No ratings yet
Attention Is All You Need Paper Explained Well
18 pages
Tianzheng Troy Wang CIS498EAS499 Submission
No ratings yet
Tianzheng Troy Wang CIS498EAS499 Submission
51 pages
chapter 2
No ratings yet
chapter 2
11 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
Transformer_Decoder_Side
No ratings yet
Transformer_Decoder_Side
9 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
encode and decoder diagram explanation
No ratings yet
encode and decoder diagram explanation
8 pages
RoPE
No ratings yet
RoPE
33 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
chapter_4
No ratings yet
chapter_4
24 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Computer Vision 12 Vision Language Models(1)
No ratings yet
Computer Vision 12 Vision Language Models(1)
56 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Aiayn
No ratings yet
Aiayn
15 pages
The Transformer Model
No ratings yet
The Transformer Model
1 page
Transformer
No ratings yet
Transformer
5 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Unit_2_Generative_AI[1]
No ratings yet
Unit_2_Generative_AI[1]
14 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
Introduction to Advanced Mathematical Analysis
From Everand
Introduction to Advanced Mathematical Analysis
Simone Malacrida
No ratings yet
Compiler Design
From Everand
Compiler Design
Knowledge Flow
No ratings yet
Exercises of Vectors and Vectorial Spaces
From Everand
Exercises of Vectors and Vectorial Spaces
Simone Malacrida
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Super Slick Melodic Blues VOL.1: Don't You Feel Lonely
No ratings yet
Super Slick Melodic Blues VOL.1: Don't You Feel Lonely
6 pages
Super Slick Melodic Blues VOL.1: Long Story Short
No ratings yet
Super Slick Melodic Blues VOL.1: Long Story Short
5 pages
Super Slick Melodic Blues VOL.1: Taking The Hit
No ratings yet
Super Slick Melodic Blues VOL.1: Taking The Hit
7 pages
Super Slick Melodic Blues VOL.1: Gear & Credits
No ratings yet
Super Slick Melodic Blues VOL.1: Gear & Credits
1 page
Cooperative and Collaborative Learning: Getting The Best of Both Words
No ratings yet
Cooperative and Collaborative Learning: Getting The Best of Both Words
15 pages
Research Design MCQ With Answers PDF
88% (8)
Research Design MCQ With Answers PDF
4 pages
Writing
No ratings yet
Writing
8 pages
TLE6-IA - Q2 - Mod1 - ImportanceandMethodsofEnhancingand DecoratingBambooWoodandMetalProducts - V4
0% (1)
TLE6-IA - Q2 - Mod1 - ImportanceandMethodsofEnhancingand DecoratingBambooWoodandMetalProducts - V4
19 pages
A Detailed Lesson Plan in Science 9
0% (1)
A Detailed Lesson Plan in Science 9
5 pages
Curriculum Vitae: Personal Details
No ratings yet
Curriculum Vitae: Personal Details
2 pages
Technology Enhance Teaching
No ratings yet
Technology Enhance Teaching
16 pages
TRAINING & Development LG
50% (4)
TRAINING & Development LG
108 pages
Summary of Clinical Rotation
No ratings yet
Summary of Clinical Rotation
2 pages
Comprehensive Chemometrics: Chemical and Biochemical Data Analysis 2nd Edition Steven Brown (Editor)download
100% (2)
Comprehensive Chemometrics: Chemical and Biochemical Data Analysis 2nd Edition Steven Brown (Editor)download
77 pages
Individual Student Evaluation Report: Motor Skills and Fitness
No ratings yet
Individual Student Evaluation Report: Motor Skills and Fitness
1 page
Book Report - The Obstacle Is The Way
No ratings yet
Book Report - The Obstacle Is The Way
2 pages
Your Thesis: Role of International Relations On International Business
No ratings yet
Your Thesis: Role of International Relations On International Business
4 pages
Workshop Daily Validation Schedule 2017
No ratings yet
Workshop Daily Validation Schedule 2017
2 pages
Recruitmemnt For The Post of Medical Technologist (Critical Care) and Nutritionist NRC1
No ratings yet
Recruitmemnt For The Post of Medical Technologist (Critical Care) and Nutritionist NRC1
4 pages
Challenges Faced by 21st Century School Heads in Barrio School Areas
No ratings yet
Challenges Faced by 21st Century School Heads in Barrio School Areas
22 pages
Osy Micro Project
No ratings yet
Osy Micro Project
3 pages
Grade 3 - Simple Ostinato Pattern
100% (1)
Grade 3 - Simple Ostinato Pattern
8 pages
Chapter 3 Torsion
No ratings yet
Chapter 3 Torsion
56 pages
The Layers of Subtitling
No ratings yet
The Layers of Subtitling
16 pages
Appendix-28 A Kerala Public Service Commission Bio Data
100% (1)
Appendix-28 A Kerala Public Service Commission Bio Data
1 page
SESSION 3. General Examination
No ratings yet
SESSION 3. General Examination
33 pages
Filipino Psychology Syllabus
No ratings yet
Filipino Psychology Syllabus
7 pages
Detailed Lesson Plan in Technology and Livelihood Education 7
100% (2)
Detailed Lesson Plan in Technology and Livelihood Education 7
4 pages
The Science of Falling in Love
No ratings yet
The Science of Falling in Love
2 pages
History of Primary Education in Odisha
0% (1)
History of Primary Education in Odisha
6 pages
Scholarship Essay
No ratings yet
Scholarship Essay
1 page
Bank Management: Master in Finance and Banking
No ratings yet
Bank Management: Master in Finance and Banking
27 pages

A Present A Ç Ão Deep Learning

Uploaded by

A Present A Ç Ão Deep Learning

Uploaded by

Attention is All you

• we will attempt to oversimplify things a bit and introduce the

• To give the model a sense of the

• A real example of positional

• A real example of positional

• It is from the Tranformer2Transformer

• each sub-layer (self-attention, ffnn) in each

• each sub-layer (self-attention,

• After finishing the encoding

• Repeat the following steps until

• The decoder stack outputs a vector

You might also like