0% found this document useful (0 votes)
23 views

A Present A Ç Ão Deep Learning

The document provides an overview of the Transformer model. It discusses how the Transformer uses stacked encoders and decoders. Each encoder contains self-attention and feed-forward layers. The decoder contains these layers as well as an attention layer to focus on the input sequence. Positional encoding is added to maintain word order. The model uses residual connections and layer normalization. During decoding, the decoder predicts outputs sequentially while attending to the encoded input. A final linear and softmax layer convert outputs to predicted word probabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

A Present A Ç Ão Deep Learning

The document provides an overview of the Transformer model. It discusses how the Transformer uses stacked encoders and decoders. Each encoder contains self-attention and feed-forward layers. The decoder contains these layers as well as an attention layer to focus on the input sequence. Positional encoding is added to maintain word order. The model uses residual connections and layer normalization. During decoding, the decoder predicts outputs sequentially while attending to the encoded input. A final linear and softmax layer convert outputs to predicted word probabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Attention is All you

Need
Igor Caetano Diniz
Introduction
• The Transformer was proposed in the paper Attention is All You Need.
• A TensorFlow implementation of it is available as a part of the
Tensor2Tensor package.

• we will attempt to oversimplify things a bit and introduce the


concepts one by one to hopefully make it easier to understand to
people without in-depth knowledge of the subject matter.
The Optimus P... Transformer
The Encoders and Decoders Stack
Inside Encoder
All encoders possess an identical structural composition,
although they do not share weights. Each encoder is composed
of two distinct sub-layers:
• The encoding component consists of a series of encoders, with the
paper showcasing a stack of six encoders vertically aligned.
• It’s the same for Decoding component
the role of Tensors in the Context
• The outputs of the self-attention layer are fed to a feed-forward
neural network. The exact same feed-forward network is
independently applied to each position.
• The decoder has both those layers, but between them is an
attention layer that helps the decoder focus on relevant parts of
the input sentence (similar what attention does in 
seq2seq models).
• As is the case in NLP applications in general, we begin by
turning each input word into a vector using an 
embedding algorithm.
• Each word is embedded into a vector of size 512. We'll
represent those vectors with these simple boxes.
• an encoder receives a list of vectors as input. It processes this list by passing these
vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends
out the output upwards to the next encoder.
• The word at each position passes through a self-attention process. Then, they each pass
through a feed-forward neural network -- the exact same network with each vector flowing
through it separately.
Self-Attention
• Example: "The animal didn't cross the street because it was too tired“
• Question: "it" in this sentence refer to?
Calculate Self-Attention

• Softmax Function:
The Beast With Many Heads – Multi-head
Attention
• It gives the attention layer multiple “representation subspaces”.
• As we encode the word "it",
one attention head is focusing
most on "the animal", while
another is focusing on "tired“
• In a sense, the model's
representation of the word
"it" bakes in some of the
representation of both
"animal" and "tired".
Positional Encoding

• To give the model a sense of the


order of the words, we add
positional encoding vectors
• the values of which follow a
specific pattern.
A real example of positional encoding

• A real example of positional


encoding with a toy embedding
size of 4
A real example of positional encoding

• A real example of positional


encoding for 20 words (rows)
with an embedding size of 512
(columns).
• it appears split in half down the
center because the values of the
left half are generated by one
function (which uses sine), and
the right half is generated by
another function (which uses
cosine).
• They're then concatenated to
form each of the positional
encoding vectors.
Another representation

• It is from the Tranformer2Transformer


implementation of the Transformer. The
method shown in the paper is slightly
different in that it doesn’t directly
concatenate, but interweaves the two
signals. The following figure shows what that
looks like.
Residuals

• each sub-layer (self-attention, ffnn) in each


encoder has a residual connection around it,
and is followed by a layer-normalization
 step.
Residuals

• each sub-layer (self-attention,


ffnn) in each encoder has a
residual connection around it,
and is followed by a 
layer-normalization step.
• If we’re to visualize the
vectors and the layer-norm
operation associated with self
attention, it would look like
this image.
Residuals

• If we’re to think of a
Transformer of 2 stacked
encoders and decoders
Residuals

• After finishing the encoding


phase, we begin the decoding
phase. Each step in the
decoding phase outputs an
element from the output
sequence (the English
translation sentence in this
case).
Residuals

• Repeat the following steps until


a special symbol is encountered,
indicating the completion of the
transformer decoder's output.
• At each step, feed the output to
the bottom decoder in the
subsequent time step.
• Like the encoders, the decoders
propagate their decoding results
upwards.
• Similar to the encoder inputs,
embed and add positional
encoding to the decoder inputs
to indicate the position of each
word.
The Final: Linear
Layer and Softmax
Layer

• The decoder stack outputs a vector


of floats, and the process of
converting it into a word involves
the final Linear layer and Softmax
layer.
• The Linear layer projects the
decoder's vector into a larger vector
called logits, while the Softmax layer
converts the scores in logits into
probabilities.
• The word with the highest
probability is selected as the output
for the current time step.
Evaluation

• cross-entropy

• Kullback–Leibler divergence.
Obrigado!

You might also like