0% found this document useful (0 votes)

2 views

Visualizing A Neural Machine Translation Model

Uploaded by

rajputakashchand4

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Visualizing A Neural Machine Translation Model

Uploaded by

rajputakashchand4

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Module 3

Seq2seq Models
Sequence-to-sequence models/The Encoder-
Decoder Model
• Are models capable of generating contextually appropriate, arbitrary
length, output sequences.

• Encoder network that takes an input sequence and creates a

contextualized representation of it, often called the context.

• This representation is then passed to a decoder which generates a task

specific output sequence.
The Encoder-Decoder Architecture

The context is a function of the hidden representations of the

input and may be used by the decoder in a variety of ways.
Encoder-decoder networks consist of three
components:
Neural Machine Translation
• In neural machine translation, a sequence is a series of words,
processed one after another.

• The output is, likewise, a series of words:

The model is composed of an encoder and
a decoder
• The encoder processes each item in the input sequence, it compiles the
information it captures into a vector (called the context).

• After processing the entire input sequence, the encoder sends

the context over to the decoder, which begins producing the output
sequence item by item.
Machine Translation.
Context Vector
• The context is a vector (an array of numbers, basically) in the case of
machine translation.

• The encoder and decoder tend to both be recurrent neural networks

• In Context(Vector of floats) vector Brighter colors represents the cells with

the higher values.
Size of Context Vector
• You can set the size of the context vector when you set up your
model.

• It is basically the number of hidden units in the encoder RNN.

• These visualizations show a vector of size 4, but in real world

applications the context vector would be of a size like 256, 512, or
1024.
Step 1 :- Representation
of Input
• RNN takes two inputs at each time step: an input (in the
case of the encoder, one word from the input sentence),
and a hidden state.

• The word, however, needs to be represented by a vector.

• To transform a word into a vector, we turn to the class of

methods called “word embedding” algorithms.

• These turn words into vector spaces that capture a lot of

the meaning/semantic information of the words (e.g. king
- man + woman = queen).

• Embedding vectors of size 200 or 300 are typical, we're

showing a vector of size four for simplicity.
Step 1 :- Representation of Input
Step 2
• The next RNN step takes the second input vector and hidden state #1 to create
the output of that time step.

• the encoder or decoder is that RNN processing its inputs and generating an
output for that time step.

• Since the encoder and decoder are both RNNs, each time step one of the RNNs
does some processing, it updates its hidden state based on its inputs and
previous inputs it has seen.
Hidden State

The above animation shows the working of Encoder RNN and hidden
states for the encoder.

The last hidden state is actually the context we pass along to the decoder.
Decoder RNN and Hidden State
• The decoder also maintains a hidden state that it passes from one
time step to the next.
Why does the seq2seq model fails?
• Encoder takes input and converts it into a fixed-size vector and then the
decoder makes a prediction and gives output sequence.

• It works fine for short sequence, but it fails when we have a long sequence

• It becomes difficult for the encoder to memorize the entire sequence into a
fixed-sized vector and to compress all the contextual information from the
sequence.

• As the sequence size increases model performance starts getting degrading.

Bottleneck
• The context vector turned out to be a bottleneck for these types of
models. It made it challenging for the models to deal with long
sentences.
Attention
• Attention allows the model to focus on the relevant parts of the input
sequence as needed.

• “Attention”, highly improved the quality of machine translation systems.

• At time step 7, the attention mechanism enables the decoder to focus on

the word "étudiant" ("student" in french) before it generates the English
translation.

• This ability to amplify the signal from the relevant part of the input
sequence makes attention models produce better results than models
without attention.
Difference between Attention model and Classic Sequence
to Sequence Model
• First, the encoder passes a lot more data to the decoder.
• Instead of passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder:
Difference between Attention model and Classic Sequence
to Sequence Model
• Second, an attention decoder does an extra step before producing its output.

• To focus on the parts of the input that are relevant to this decoding time step,
the decoder does the following:

• Look at the set of encoder hidden states it received – each encoder hidden state is
most associated with a certain word in the input sentence

• Give each hidden state a score

• Multiply each hidden state by its softmaxed score, thus amplifying hidden
states with high scores, and drowning out hidden states with low scores

• This scoring exercise is done at each time step on the decoder side.
Decoder Hidden State Calculation

Previous Hid Context Vector

Current Hidden Output of den State For Current Time
State Previous Step
Time Step

The attention mechanism allows

each hidden state of the decoder
to see a different, dynamic,
context, which is a function of all
the encoder hidden states.
Context Vector Calculation
• Dot Product

• how much to focus on each encoder state, how relevant each encoder state is to the decoder
state captured in .

• We capture relevance by computing— at each state i during decoding—

a for each encoder state j.

• dot-product attention, implements relevance as dot-product attention similarity: measuring

how similar the decoder hidden state is to an encoder hidden state, by computing the dot
product between them
• The score that results from this dot product is a scalar that reflects the degree
of similarity between the two vectors.

• The vector of these scores across all the encoder hidden states gives us the
relevance of each encoder state to the current step of the decoder.
• Finally, given the distribution in α, we can compute a fixed-length context vector
for the current decoder state by taking a weighted average over all the encoder
hidden states.
Example Sequence2Sequence Model +Attention
Example Sequence2Sequence Model +Attention
• It’s also possible to create more sophisticated scoring functions for attention
models.

• Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder
hidden state by parameterizing the score with its own set of weights, Ws .

The weights Ws , which are then trained during normal end-to-end

training, give the network the ability to learn which aspects of
similarity between the decoder and encoder states are important to
the current application.
Teacher Forcing
Beam Search Algorithm
• Decoding in MT and other sequence generation problems generally uses beam
search a method called beam search.

• In beam search, instead of choosing the best token to generate at each time
step, we keep k possible tokens at each step.

• This fixed-size beam width memory footprint k is called the beam width, on the
metaphor of a flashlight beam that can be parameterized to be wider or
narrower.
• the first step of decoding, we compute a softmax over the entire vocabulary,
assigning a probability to each word.

• We then select the k-best options from this softmax output.

• These initial k outputs are the search frontier and these k initial words are called
hypotheses.

• A hypothesis is an output sequence, a translation-sofar, together with its

probability.
• Beam search decoding with a beam width of k = 2.

• At each time step, we choose the k best hypotheses, compute the V possible
extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue.

• At time 1, the frontier is filled with the best 2 options from the initial state of
the decoder: arrived and the.

• We then extend each of those, compute the probability of all the hypotheses
so far (arrived the, arrived aardvark, the green, the witch) and compute the
best 2 (in this case the green and the witch) to be the search frontier to
extend on the next step.

• On the arcs we show the decoders that we run to score the extension words
(although for simplicity we haven’t shown the context value ci that is input at
each step).
Attention Process
This scoring exercise is done at each time step on the decoder side.

• Let us now bring the whole thing together in the following visualization and look at how
the attention process works:
• The attention decoder RNN takes in the embedding of the <END> token, and an initial
decoder hidden state.
• The RNN processes its inputs, producing an output and a new hidden state vector (h4).
The output is discarded.
• Attention Step: We use the encoder hidden states and the h4 vector to calculate a context
vector (C4) for this time step.
• We concatenate h4 and C4 into one vector.
• We pass this vector through a feedforward neural network (one trained jointly with the
model).
• The output of the feedforward neural networks indicates the output word of this time step.
• Repeat for the next time steps
Sequence to Sequence Model With Attention
Which part of the input sentence we’re paying
attention to at each decoding step:

VMProtect Ultra Unpacker 1.0
No ratings yet
VMProtect Ultra Unpacker 1.0
98 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Vinija's Notes - Natural Language Processing - Attention
No ratings yet
Vinija's Notes - Natural Language Processing - Attention
27 pages
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
No ratings yet
15 - NEW 2020 ATTENTION ENC DEC TRANSFORMERS Lect15
50 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Deep Recurrent Neural Networks (1)
No ratings yet
Deep Recurrent Neural Networks (1)
24 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
2. Encoder-Decoder Sequence to Sequence Architechure
No ratings yet
2. Encoder-Decoder Sequence to Sequence Architechure
16 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Week9 Seq2seq
No ratings yet
Week9 Seq2seq
32 pages
Apresentação Deep
No ratings yet
Apresentação Deep
28 pages
L5
No ratings yet
L5
99 pages
[Slides] Module 44
No ratings yet
[Slides] Module 44
119 pages
NeurIPS-2021-understanding-how-encoder-decoder-architectures-attend-Paper
No ratings yet
NeurIPS-2021-understanding-how-encoder-decoder-architectures-attend-Paper
12 pages
cl8_encdec
No ratings yet
cl8_encdec
51 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
lecture15_transformer
No ratings yet
lecture15_transformer
26 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
Attention
No ratings yet
Attention
12 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
A Present A Ç Ão Deep Learning
No ratings yet
A Present A Ç Ão Deep Learning
28 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Sequence Models-II
No ratings yet
Sequence Models-II
10 pages
attention
No ratings yet
attention
15 pages
Unit_IV_Natural Language Processing (1)
No ratings yet
Unit_IV_Natural Language Processing (1)
9 pages
Encoder Decoder
No ratings yet
Encoder Decoder
8 pages
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
No ratings yet
Unlocking Linguistic Intelligence_ Attention Mechanisms and Transformer Architectures in NLP (1)
117 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Aiayn
No ratings yet
Aiayn
15 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
20190630transformer-210110081057
No ratings yet
20190630transformer-210110081057
32 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Unit IV DL
No ratings yet
Unit IV DL
122 pages
3. Graph Representation Learning
No ratings yet
3. Graph Representation Learning
32 pages
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention is All you Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Example File
No ratings yet
Example File
3 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
1706.03762v1
No ratings yet
1706.03762v1
15 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Mod2_Data_Streams
No ratings yet
Mod2_Data_Streams
75 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
formula-student-energy-meter-specification-2023_may2023
No ratings yet
formula-student-energy-meter-specification-2023_may2023
6 pages
Vimal Kumar: Work History
No ratings yet
Vimal Kumar: Work History
3 pages
National Family Planning Month
No ratings yet
National Family Planning Month
1 page
CGPSC Civil Judge (Entry Level) Question Paper 2016
No ratings yet
CGPSC Civil Judge (Entry Level) Question Paper 2016
53 pages
2223S1FL HW3 G3
No ratings yet
2223S1FL HW3 G3
4 pages
Matlab Vs Mathematica: Forums Mathematics Math Software and Latex
No ratings yet
Matlab Vs Mathematica: Forums Mathematics Math Software and Latex
9 pages
Xylem Vue Powerd by Go-Aigua
No ratings yet
Xylem Vue Powerd by Go-Aigua
68 pages
SVR 100 Faq en
No ratings yet
SVR 100 Faq en
5 pages
Icao Shell Model
No ratings yet
Icao Shell Model
21 pages
Introduction To Abductive Learning and Neuro-Symbolic (RL)
100% (1)
Introduction To Abductive Learning and Neuro-Symbolic (RL)
45 pages
550lc Data Sheet PDF
No ratings yet
550lc Data Sheet PDF
3 pages
What Is Data Transfer Instruction Process in Computer Architecture
No ratings yet
What Is Data Transfer Instruction Process in Computer Architecture
2 pages
CHANGELOGS
No ratings yet
CHANGELOGS
18 pages
Lec 2
No ratings yet
Lec 2
28 pages
CSEC IT Workshop Part 2 - B
0% (1)
CSEC IT Workshop Part 2 - B
16 pages
Collaborative Desktop Publishing (CDP) Layout Artist Guide
No ratings yet
Collaborative Desktop Publishing (CDP) Layout Artist Guide
1 page
7 Texting Principles That Make Her Want You
100% (2)
7 Texting Principles That Make Her Want You
35 pages
Angry Birds - Strategic Analysis
100% (1)
Angry Birds - Strategic Analysis
36 pages
Liebert APM250 Manual
No ratings yet
Liebert APM250 Manual
129 pages
Dongfeng Automates Manual Transmission
No ratings yet
Dongfeng Automates Manual Transmission
2 pages
Stop Using Print To Debug in Python. Use Icecream Instead - by Khuyen Tran - Jan, 2021 - Towards Data Science
No ratings yet
Stop Using Print To Debug in Python. Use Icecream Instead - by Khuyen Tran - Jan, 2021 - Towards Data Science
7 pages
Customer Acceptance in R12 Order Management
No ratings yet
Customer Acceptance in R12 Order Management
17 pages
Types of Memory Case Study
No ratings yet
Types of Memory Case Study
12 pages
SE Question Bank-5
No ratings yet
SE Question Bank-5
4 pages
Postal: Study Package
No ratings yet
Postal: Study Package
2 pages
BFDFG
No ratings yet
BFDFG
49 pages
Rupali Santosh Kadam: Career Objective
No ratings yet
Rupali Santosh Kadam: Career Objective
3 pages
Aggregating Data Using Group Functions
No ratings yet
Aggregating Data Using Group Functions
22 pages
Seam Matic
No ratings yet
Seam Matic
2 pages