0% found this document useful (0 votes)
2 views

Visualizing A Neural Machine Translation Model

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Visualizing A Neural Machine Translation Model

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Module 3

Seq2seq Models
Sequence-to-sequence models/The Encoder-
Decoder Model
• Are models capable of generating contextually appropriate, arbitrary
length, output sequences.

• Encoder network that takes an input sequence and creates a


contextualized representation of it, often called the context.

• This representation is then passed to a decoder which generates a task


specific output sequence.
The Encoder-Decoder Architecture

The context is a function of the hidden representations of the


input and may be used by the decoder in a variety of ways.
Encoder-decoder networks consist of three
components:
Neural Machine Translation
• In neural machine translation, a sequence is a series of words,
processed one after another.

• The output is, likewise, a series of words:


The model is composed of an encoder and
a decoder
• The encoder processes each item in the input sequence, it compiles the
information it captures into a vector (called the context).

• After processing the entire input sequence, the encoder sends


the context over to the decoder, which begins producing the output
sequence item by item.
Machine Translation.
Context Vector
• The context is a vector (an array of numbers, basically) in the case of
machine translation.

• The encoder and decoder tend to both be recurrent neural networks

• In Context(Vector of floats) vector Brighter colors represents the cells with


the higher values.
Size of Context Vector
• You can set the size of the context vector when you set up your
model.

• It is basically the number of hidden units in the encoder RNN.

• These visualizations show a vector of size 4, but in real world


applications the context vector would be of a size like 256, 512, or
1024.
Step 1 :- Representation
of Input
• RNN takes two inputs at each time step: an input (in the
case of the encoder, one word from the input sentence),
and a hidden state.

• The word, however, needs to be represented by a vector.

• To transform a word into a vector, we turn to the class of


methods called “word embedding” algorithms.

• These turn words into vector spaces that capture a lot of


the meaning/semantic information of the words (e.g. king
- man + woman = queen).

• Embedding vectors of size 200 or 300 are typical, we're


showing a vector of size four for simplicity.
Step 1 :- Representation of Input
Step 2
• The next RNN step takes the second input vector and hidden state #1 to create
the output of that time step.

• the encoder or decoder is that RNN processing its inputs and generating an
output for that time step.

• Since the encoder and decoder are both RNNs, each time step one of the RNNs
does some processing, it updates its hidden state based on its inputs and
previous inputs it has seen.
Hidden State

The above animation shows the working of Encoder RNN and hidden
states for the encoder.

The last hidden state is actually the context we pass along to the decoder.
Decoder RNN and Hidden State
• The decoder also maintains a hidden state that it passes from one
time step to the next.
Why does the seq2seq model fails?
• Encoder takes input and converts it into a fixed-size vector and then the
decoder makes a prediction and gives output sequence.

• It works fine for short sequence, but it fails when we have a long sequence

• It becomes difficult for the encoder to memorize the entire sequence into a
fixed-sized vector and to compress all the contextual information from the
sequence.

• As the sequence size increases model performance starts getting degrading.


Bottleneck
• The context vector turned out to be a bottleneck for these types of
models. It made it challenging for the models to deal with long
sentences.
Attention
• Attention allows the model to focus on the relevant parts of the input
sequence as needed.

• “Attention”, highly improved the quality of machine translation systems.

• At time step 7, the attention mechanism enables the decoder to focus on


the word "étudiant" ("student" in french) before it generates the English
translation.

• This ability to amplify the signal from the relevant part of the input
sequence makes attention models produce better results than models
without attention.
Difference between Attention model and Classic Sequence
to Sequence Model
• First, the encoder passes a lot more data to the decoder.
• Instead of passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder:
Difference between Attention model and Classic Sequence
to Sequence Model
• Second, an attention decoder does an extra step before producing its output.

• To focus on the parts of the input that are relevant to this decoding time step,
the decoder does the following:

• Look at the set of encoder hidden states it received – each encoder hidden state is
most associated with a certain word in the input sentence

• Give each hidden state a score

• Multiply each hidden state by its softmaxed score, thus amplifying hidden
states with high scores, and drowning out hidden states with low scores

• This scoring exercise is done at each time step on the decoder side.
Decoder Hidden State Calculation

Previous Hid Context Vector


Current Hidden Output of den State For Current Time
State Previous Step
Time Step

The attention mechanism allows


each hidden state of the decoder
to see a different, dynamic,
context, which is a function of all
the encoder hidden states.
Context Vector Calculation
• Dot Product

• how much to focus on each encoder state, how relevant each encoder state is to the decoder
state captured in .

• We capture relevance by computing— at each state i during decoding—


a for each encoder state j.

• dot-product attention, implements relevance as dot-product attention similarity: measuring


how similar the decoder hidden state is to an encoder hidden state, by computing the dot
product between them
• The score that results from this dot product is a scalar that reflects the degree
of similarity between the two vectors.

• The vector of these scores across all the encoder hidden states gives us the
relevance of each encoder state to the current step of the decoder.
• Finally, given the distribution in α, we can compute a fixed-length context vector
for the current decoder state by taking a weighted average over all the encoder
hidden states.
Example Sequence2Sequence Model +Attention
Example Sequence2Sequence Model +Attention
• It’s also possible to create more sophisticated scoring functions for attention
models.

• Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder
hidden state by parameterizing the score with its own set of weights, Ws .

The weights Ws , which are then trained during normal end-to-end


training, give the network the ability to learn which aspects of
similarity between the decoder and encoder states are important to
the current application.
Teacher Forcing
Beam Search Algorithm
• Decoding in MT and other sequence generation problems generally uses beam
search a method called beam search.

• In beam search, instead of choosing the best token to generate at each time
step, we keep k possible tokens at each step.

• This fixed-size beam width memory footprint k is called the beam width, on the
metaphor of a flashlight beam that can be parameterized to be wider or
narrower.
• the first step of decoding, we compute a softmax over the entire vocabulary,
assigning a probability to each word.

• We then select the k-best options from this softmax output.

• These initial k outputs are the search frontier and these k initial words are called
hypotheses.

• A hypothesis is an output sequence, a translation-sofar, together with its


probability.
• Beam search decoding with a beam width of k = 2.

• At each time step, we choose the k best hypotheses, compute the V possible
extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue.

• At time 1, the frontier is filled with the best 2 options from the initial state of
the decoder: arrived and the.

• We then extend each of those, compute the probability of all the hypotheses
so far (arrived the, arrived aardvark, the green, the witch) and compute the
best 2 (in this case the green and the witch) to be the search frontier to
extend on the next step.

• On the arcs we show the decoders that we run to score the extension words
(although for simplicity we haven’t shown the context value ci that is input at
each step).
Attention Process
This scoring exercise is done at each time step on the decoder side.

• Let us now bring the whole thing together in the following visualization and look at how
the attention process works:
• The attention decoder RNN takes in the embedding of the <END> token, and an initial
decoder hidden state.
• The RNN processes its inputs, producing an output and a new hidden state vector (h4).
The output is discarded.
• Attention Step: We use the encoder hidden states and the h4 vector to calculate a context
vector (C4) for this time step.
• We concatenate h4 and C4 into one vector.
• We pass this vector through a feedforward neural network (one trained jointly with the
model).
• The output of the feedforward neural networks indicates the output word of this time step.
• Repeat for the next time steps
Sequence to Sequence Model With Attention
Which part of the input sentence we’re paying
attention to at each decoding step:

You might also like