Visualizing A Neural Machine Translation Model
Visualizing A Neural Machine Translation Model
Seq2seq Models
Sequence-to-sequence models/The Encoder-
Decoder Model
• Are models capable of generating contextually appropriate, arbitrary
length, output sequences.
• the encoder or decoder is that RNN processing its inputs and generating an
output for that time step.
• Since the encoder and decoder are both RNNs, each time step one of the RNNs
does some processing, it updates its hidden state based on its inputs and
previous inputs it has seen.
Hidden State
The above animation shows the working of Encoder RNN and hidden
states for the encoder.
The last hidden state is actually the context we pass along to the decoder.
Decoder RNN and Hidden State
• The decoder also maintains a hidden state that it passes from one
time step to the next.
Why does the seq2seq model fails?
• Encoder takes input and converts it into a fixed-size vector and then the
decoder makes a prediction and gives output sequence.
• It works fine for short sequence, but it fails when we have a long sequence
• It becomes difficult for the encoder to memorize the entire sequence into a
fixed-sized vector and to compress all the contextual information from the
sequence.
• This ability to amplify the signal from the relevant part of the input
sequence makes attention models produce better results than models
without attention.
Difference between Attention model and Classic Sequence
to Sequence Model
• First, the encoder passes a lot more data to the decoder.
• Instead of passing the last hidden state of the encoding stage,
the encoder passes all the hidden states to the decoder:
Difference between Attention model and Classic Sequence
to Sequence Model
• Second, an attention decoder does an extra step before producing its output.
• To focus on the parts of the input that are relevant to this decoding time step,
the decoder does the following:
• Look at the set of encoder hidden states it received – each encoder hidden state is
most associated with a certain word in the input sentence
• Multiply each hidden state by its softmaxed score, thus amplifying hidden
states with high scores, and drowning out hidden states with low scores
• This scoring exercise is done at each time step on the decoder side.
Decoder Hidden State Calculation
• how much to focus on each encoder state, how relevant each encoder state is to the decoder
state captured in .
• The vector of these scores across all the encoder hidden states gives us the
relevance of each encoder state to the current step of the decoder.
• Finally, given the distribution in α, we can compute a fixed-length context vector
for the current decoder state by taking a weighted average over all the encoder
hidden states.
Example Sequence2Sequence Model +Attention
Example Sequence2Sequence Model +Attention
• It’s also possible to create more sophisticated scoring functions for attention
models.
• Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder
hidden state by parameterizing the score with its own set of weights, Ws .
• In beam search, instead of choosing the best token to generate at each time
step, we keep k possible tokens at each step.
• This fixed-size beam width memory footprint k is called the beam width, on the
metaphor of a flashlight beam that can be parameterized to be wider or
narrower.
• the first step of decoding, we compute a softmax over the entire vocabulary,
assigning a probability to each word.
• These initial k outputs are the search frontier and these k initial words are called
hypotheses.
• At each time step, we choose the k best hypotheses, compute the V possible
extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue.
• At time 1, the frontier is filled with the best 2 options from the initial state of
the decoder: arrived and the.
• We then extend each of those, compute the probability of all the hypotheses
so far (arrived the, arrived aardvark, the green, the witch) and compute the
best 2 (in this case the green and the witch) to be the search frontier to
extend on the next step.
• On the arcs we show the decoders that we run to score the extension words
(although for simplicity we haven’t shown the context value ci that is input at
each step).
Attention Process
This scoring exercise is done at each time step on the decoder side.
• Let us now bring the whole thing together in the following visualization and look at how
the attention process works:
• The attention decoder RNN takes in the embedding of the <END> token, and an initial
decoder hidden state.
• The RNN processes its inputs, producing an output and a new hidden state vector (h4).
The output is discarded.
• Attention Step: We use the encoder hidden states and the h4 vector to calculate a context
vector (C4) for this time step.
• We concatenate h4 and C4 into one vector.
• We pass this vector through a feedforward neural network (one trained jointly with the
model).
• The output of the feedforward neural networks indicates the output word of this time step.
• Repeat for the next time steps
Sequence to Sequence Model With Attention
Which part of the input sentence we’re paying
attention to at each decoding step: