AD3501-DL-UNIT 3 NOTES
AD3501-DL-UNIT 3 NOTES
It basically says the current hidden state h(t) is a function f of the previous
hidden state h(t-1) and the current input x(t). The theta are the parameters
of the function f. The network typically learns to use h(t) as a kind of lossy
summary of the task-relevant aspects of the past sequence of inputs up to t.
Unfolding maps the left to the right in the figure below (both are
computational graphs of a RNN without output o)
where the black square indicates that an interaction takes place with a delay
of 1 time step, from the state at time t to the state at time t + 1.
Unfolding/parameter sharing is better than using different parameters per
position: less parameters to estimate, generalize to various length.
3.1.1 Recurrent Neural Network
Variation 1 of RNN (basic form): hidden2hidden connections, sequence
output. As in Fig 10.3.
The basic equations that defines the above RNN is shown in (10.6) below
(on pp. 385 of the book)
The total loss for a given sequence of x values paired with a sequence
of y values would then be just the sum of the losses over all the time steps.
For example, if L(t) is the negative log-likelihood
of y (t) given x (1), . . . , x (t) , then sum them up you get the loss for the
sequence as shown in (10.7):
The derivations are w.r.t. the basic form of RNN, namely Fig
10.3 and Equation (10.6) . We copy Fig 10.3 again here:
Once the gradients on the internal nodes of the computational graph are
obtained, we can obtain the gradients on the parameter nodes, which have
descendents at all the time steps:
Note: We move Section 10.2.3 and Sec 10.2.4, both of which are about
graphical model interpretation of RNN, to the end of the notes, as they are
not essential for the idea flow, in my opinion…
10.3 Bidirectional RNNs
In many applications we want to output a prediction of y (t) which may
depend on the whole input sequence. E.g. co-articulation in speech
recognition, right neighbors in POS tagging, etc.
Bidirectional RNNs combine an RNN that moves forward through time
beginning from the start of the sequence with another RNN that moves
backward through time beginning from the end of the sequence.
Fig. 10.11 (below) illustrates the typical bidirectional RNN,
where h(t) and g(t) standing for the (hidden) state of the sub-RNN that
moves forward and backward through time, respectively. This allows the
output units o(t) to compute a representation that depends on both the past
and the future but is most sensitive to the input values around time t
Figure 10.11: Computation of a typical bidirectional recurrent neural
network, meant to learn to map input sequences x to target sequences y, with
loss L(t) at each step t.
Footnote: This idea can be naturally extended to 2-dimensional input, such
as images, by having four RNNs…
10.4 Encoder-Decoder Sequence-to-Sequence Architectures
Encode-Decoder architecture, basic idea:
(1) an encoder or reader or input RNN processes the input sequence. The
encoder emits the context C , usually as a simple function of its final hidden
state.
(2) a decoder or writer or output RNN is conditioned on that fixed-length
vector to generate the output sequence Y = ( y(1) , . . . , y(ny ) ).
highlight: the lengths of input and output sequences can vary from each
other. Now widely used in machine translation, question answering etc.
See Fig 10.12 below.
minimized;
• have Echo State Networks that are designed to solve the vanishing
gradient problem;
• have Long Short-Term Memory Networks (LSTMs).
LSTMs are considered to be the go-to network for implementing RNNs, and
we’re going to discuss this solution in depth in our next article.
Long Short Term Memory - (LSTM)
Long short-term memory (LSTM) network is the most popular solution to
the vanishing gradient problem.
Are you ready to learn how we can elegantly remove the major roadblock to
the use of Recurrent Neural Networks (RNNs)?
Here is our plan of attack for this challenging deep learning topic:
• First of all, we are going to look at a bit of history: what LSTM came
from, what was the main idea behind it, why people invented it.
• Then, we will present the LSTM architecture.
We’ve also defined that as a rule of thumb, if wrec is small – the gradient is
vanishing, and if wrec is large – the gradient is exploding.
But what’s “large” and “small” in this context? In fact, we can say that we
have a vanishing gradient if wrec < 1 and exploding gradient if wrec > 1.
Then, what’s the first thing that comes to your mind to solve this problem?
Probably, the easiest and fastest solution will be to make wrec = 1. That’s
exactly what was done in LSTMs. Of course, this is a very simplified
explanation, but in general, making recurrent weight equal to one is the main
idea behind LSTMs.
Now, let’s dig deeper into the architecture of LSTMs.
LSTM Architecture
Long short-term memory network was first introduced in 1997 by Sepp
Hochreiter and his supervisor for a Ph.D. thesis Jurgen Schmidhuber. It
suggests a very elegant solution to the vanishing gradient problem.
Overview
To provide you with the most simple and understandable illustrations of
LSTM networks, we are going to use images created by Christopher Olah,
where he does an amazing job on explaining LSTMs in simple terms.
So, the first image below demonstrates how a standard RNN looks like from
the inside.
The hidden layer in the central block receives input xt from the input layer
and also from itself in time point t-1, then it generates output ht and also
another input for itself but in time point t+1.
This is a standard architecture that doesn’t solve a vanishing gradient
problem.
The next image shows how LSTMs look like. This might seem very complex
at the beginning, but don’t worry!
We’re going to walk you through this architecture and explain in detail,
what’s happening here. By the end of this article, you’ll be completely
comfortable with navigating LSTMs.
As you might recall, we’ve started with the claim that in LSTMs wrec = 1.
This feature is reflected as a straight pipeline on the top of the scheme and is
usually referenced as a memory cell. It can very freely flow through time.
Though sometimes it might be removed or erased, sometimes some things
might be added to it. Otherwise, it flows through time freely, and therefore
when you backpropagate through these LSTMs, you don’t have that problem
of the vanishing gradient.
Notation
Let’s begin with a few words on the notation:
• ct-1 stands for the input from a memory cell in time point t;
• xt is an input in time point t;
• ht is an output in time point t that goes to both the output layer and the
hidden layer in the next time point.
Thus, every block has three inputs (xt, ht-1, and ct-1) and two outputs (ht and
ct). An important thing to remember is that all these inputs and outputs are
not single values, but vectors with lots of values behind each of them.
Let’s continue our journey through the legend:
• Concatenate: two lines combining into one, as for example, the vectors
from ht-1 and xt. You can imagine this like two pipes running in
parallel.
• Copy: the information is copied and goes into two different directions,
as for example, at the right bottom of the scheme, where output
information is copied in order to arrive at two different layers ht.
you can add additional memory if the memory valve below this
joint is open.
o “tanh” – responsible for transforming the value to be within the
range from -1 to 1 (required due to certain mathematical
considerations).
1. We’ve got new value xt and value from the previous node ht-1 coming
in.
2. These values are combined together and go through the sigmoid
activation function, where it is decided if the forget valve should be
open, closed or open to some extent.
3. The same values, or actually vectors of values, go in parallel through
another layer operation “tanh”, where it is decided what value we’re
going to pass to the memory pipeline, and also sigmoid layer
operation, where it is decided, if that value is going to be passed to the
memory pipeline and to what extent.
4. Then, we have a memory flowing through the top pipeline. If we have
forget valve open and memory valve closed then the memory will not
change. Otherwise, if we have forget valve closed and memory valve
open, the memory will be updated completely.
5. Finally, we’ve got xt and ht-1 combined to decide what part of the
memory pipeline is going to become the output of this module.
That’s basically, what’s happening within the LSTM network. As you can
see, it has a pretty straightforward architecture, but let’s move on to a specific
example to get an even better understanding of the Long Short-Term
Memory networks.
Example Walkthrough
You might remember the translation example from one of our previous
articles. Recall that when we change the word “boy” to “girl” in the English
sentence, the Czech translation has two additional words changed because in
Czech the verb form depends on the subject’s gender.
So, let’s say the word “boy” is stored in the memory cell ct-1. It is just
flowing through the module freely if our new information doesn’t tell us that
there is a new subject.
If for instance, we have a new subject (e.g., “girl”, “Amanda”), we’ll close
the forget valve to destroy the memory that we had. Then, we’ll open a
memory valve to put a new memory (e.g., name, subject, gender) to the
memory pipeline via the t-joint.
If we put the word “girl” into the memory pipeline, we can extract different
elements of information from this single piece: the subject is female,
singular, the word is not capitalized, has 4 letters etc.
Next, the output valve facilitates the extraction of the elements required for
the purposes of the next word or sentence (gender in our example). This
information will be transferred as an input to the next module and it will help
the next module to decide on the best translation given the subject’s gender.
That’s going to be quite an interesting and at the same time a bit of magical
experience.
Neuron Activation
Here is our LSTM architecture. To start off, we are going to be looking at
the tangent function tanh and how it fires up. As you remember, its value
ranges from -1 to 1. In our further images, “-1” is going to be red and “+1”
is going to be blue.
As you can see, this neuron is sensitive to position in line. When you get
towards the end of the line, it is activating. How does it know that it is the
end of the line? You have about 80 symbols per line in this novel. So, it’s
counting how many symbols have passed and that’s the way it’s trying to
predict when the new line character is coming up.
The next cell recognizes direct speech. It’s keeping track of the quotation
marks and is activating inside the quotes.
This is very similar to our example where the network was keeping track of
the subject to understand if it is male or female, singular or plural, and to
suggest the correct verb forms for the translation. Here we observe the same
logic. It’s important to know if you are inside or outside the quotes because
that affects the rest of the text.
On the next image, we have a snippet from the code of the Linux operating
system. This example refers to the cell that activates inside if-statements. It’s
completely dormant everywhere else, but as soon as you have an if-
statement, it activates. Then, it’s only active for the condition of the if-
statement and it stops being active at the actual body of the if-statement.
That’s can be important because you’re anticipating the body of the if-
statement.
The next cell is sensitive to how deep you are inside of the nested expression.
As you go deeper, and the expression gets more and more nested, this cell
keeps track of that.
It’s very important to remember that none of these is actually hardcoded into
the neural network. All of these is learned by the network itself through
thousands and thousands of iterations.
The network kind of thinks: okay, I have this many hidden states, an out of
them I need to identify, what’s important in a text to keep track off. Then, it
identifies that in this particular text understanding how deep you’re inside a
nested statement is important. Therefore, it assigns one of its hidden states,
or memory cells, to keep track of that.
So, the network is really evolving on itself and deciding how to allocate its
resources to best complete the task. That’s really fascinating!
The next image demonstrates an example of the cell that you can’t really
understand, what it’s doing. According to Andrej Karpathy, about 95% of
the cells are like this. They are doing something, but that’s just not obvious
to humans, while it makes sense for machines.
Output
Now let’s move to the actual output ht. This is the resulting value after it
passed the tangent function and the output valve.
What do you think this specific hidden state in the neural network is looking
out for?
For example, after the first “w” it’s pretty confident that the next letter will
be “w” as well. Conversely, its prediction about the letter or symbol after “.”
is very unsure because it could actually be any website.
As you see from the image, the network continues generating predictions
even when the actual neuron is dormant. See, for example, how it was able
to predict the word “language” just from the first two letters.
The neuron activates again in the third row, when another URL appears (see
the image below). That’s quite an interesting case.
You can observe that the network was pretty sure that the next letter after
“co” should be “m” to get “.com”, but it was another dot instead.
Then, the network predicted “u” because the domain “co.uk” (for the United
Kingdom) is quite popular. And again, this was the wrong prediction because
the actual domain was “co.il” (for Israel), which was not at all considered by
the neural network even as 2nd, 3rd, 4th or 5th best guess.
This is how to look at this pictures that Andrej has created. There is a couple
more of such examples in his blog.
Hopefully, you are now much more comfortable about what’s going on
inside the neural network, when it’s thinking and processing information.
LSTM Variation
Have you checked all our articles on Recurrent Neural Networks (RNNs)?
Then you should be already pretty much comfortable with the concept of
Long Short-Term Memory networks (LSTMs).
Let’s wind up our journey with a very short article on LSTM variations.
You may encounter them sometimes in your work. So, it could be really
important for you to be at least aware of these other LSTM architectures.
Variation #1
In variation #1, we add peephole connections – the lines that feed additional
input about the current state of the memory cell to the sigmoid activation
functions.
Variation #2
In variation #2, we connect forget valve and memory valve. So, instead of
having separate decisions about opening and closing the forget and memory
valves, we have a combined decision here.
Basically, whenever you close the memory off (forget valve = 0), you have
to put something in (memory valve = 1 – 0 = 1), and vice versa.
Variation #3
It might look quite complex, but in fact, the resulting model is simpler than
the standard LSTM. That’s why this modification becomes increasingly
popular. We have discussed three LSTM modifications, which are probably
the most notable. However, be aware that there are lots and lots of others
LSTM variations out there.