0% found this document useful (0 votes)
5 views34 pages

AD3501-DL-UNIT 3 NOTES

The document discusses Recurrent Neural Networks (RNNs) and their design patterns, emphasizing their ability to handle sequential data through parameter sharing and various architectures like Bidirectional RNNs and Encoder-Decoder models. It also addresses challenges such as long-term dependencies and the vanishing/exploding gradient problems, presenting solutions like Echo State Networks and Leaky Units. Additionally, it explores advanced concepts like deep recurrent networks and recursive neural networks, highlighting their applications in fields like machine translation and speech recognition.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views34 pages

AD3501-DL-UNIT 3 NOTES

The document discusses Recurrent Neural Networks (RNNs) and their design patterns, emphasizing their ability to handle sequential data through parameter sharing and various architectures like Bidirectional RNNs and Encoder-Decoder models. It also addresses challenges such as long-term dependencies and the vanishing/exploding gradient problems, presenting solutions like Echo State Networks and Leaky Units. Additionally, it explores advanced concepts like deep recurrent networks and recursive neural networks, highlighting their applications in fields like machine translation and speech recognition.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT III- RECURRENT NEURAL NETWORKS

Unfolding Graphs -- RNN Design Patterns: Acceptor -- Encoder --Transducer;


Gradient Computation -- Sequence Modeling Conditioned on Contexts --
Bidirectional RNN -- Sequence to Sequence RNN – Deep Recurrent Networks
-- Recursive Neural Networks -- Long Term Dependencies; Leaky Units: Skip
connections and dropouts; Gated Architecture: LSTM.

Recurrent Neural Networks (RNN) are for handling sequential data.


RNNs share parameters across different positions/ index of time/ time
steps of the sequence, which makes it possible to generalize well to examples
of different sequence length. RNN is usually a better alternative to position-
independent classifiers and sequential models that treat each position
differently.
How does a RNN share parameters? Each member of the output is produced
using the same update rule applied to the previous outputs. Such update rule
is often a (same) NN layer, as the “A” in the figure below (fig from colan).

Notation: We refer to RNNs as operating on a sequence that contains


vectors x(t) with the time step index t ranging from 1 to τ. Usually, there is
also a hidden state vector h(t) for each time step t.
3.1 Unfolding Computational Graphs
Basic formula of RNN (10.4) is shown below:

It basically says the current hidden state h(t) is a function f of the previous
hidden state h(t-1) and the current input x(t). The theta are the parameters
of the function f. The network typically learns to use h(t) as a kind of lossy
summary of the task-relevant aspects of the past sequence of inputs up to t.
Unfolding maps the left to the right in the figure below (both are
computational graphs of a RNN without output o)

where the black square indicates that an interaction takes place with a delay
of 1 time step, from the state at time t to the state at time t + 1.
Unfolding/parameter sharing is better than using different parameters per
position: less parameters to estimate, generalize to various length.
3.1.1 Recurrent Neural Network
Variation 1 of RNN (basic form): hidden2hidden connections, sequence
output. As in Fig 10.3.

The basic equations that defines the above RNN is shown in (10.6) below
(on pp. 385 of the book)
The total loss for a given sequence of x values paired with a sequence
of y values would then be just the sum of the losses over all the time steps.
For example, if L(t) is the negative log-likelihood
of y (t) given x (1), . . . , x (t) , then sum them up you get the loss for the
sequence as shown in (10.7):

• Foward Pass: The runtime is O(τ) and cannot be reduced by


parallelization because the forward propagation graph is inherently
sequential; each time step may only be computed after the previous
one.
• Backward Pass: see Section 10.2.2.
Variation 2 of RNN output2hidden, sequence output. As shown in Fig
10.4, it produces an output at each time step and have recurrent connections
only from the output at one time step to the hidden units at the next time step
Teacher forcing (Section 10.2.1, pp 385) can be used to train RNN as in
Fig 10.4 (above), where only output2hidden connections exist, i.e
hidden2hidden connections are absent.
In teach forcing, the model is trained to maximize the conditional probability
of current output y(t), given both the x sequence so far and the previous
output y(t-1), i.e. use the gold-standard output of previous time step in
training.
Variation 3 of RNN hidden2hidden, single output. As Fig 10.5 recurrent
connections between hidden units, that read an entire sequence and then
produce a single output
10.2.2 Computing the Gradient in a Recurrent Neural
Network
How? Use back-propagation through time (BPTT) algorithm on on the
unrolled graph. Basically, it is the application of chain-rule on the unrolled
graph for parameters of U, V , W, b and c as well as the sequence of nodes
indexed by t for x(t), h(t), o(t) and L(t).

The derivations are w.r.t. the basic form of RNN, namely Fig
10.3 and Equation (10.6) . We copy Fig 10.3 again here:
Once the gradients on the internal nodes of the computational graph are
obtained, we can obtain the gradients on the parameter nodes, which have
descendents at all the time steps:

Note: We move Section 10.2.3 and Sec 10.2.4, both of which are about
graphical model interpretation of RNN, to the end of the notes, as they are
not essential for the idea flow, in my opinion…
10.3 Bidirectional RNNs
In many applications we want to output a prediction of y (t) which may
depend on the whole input sequence. E.g. co-articulation in speech
recognition, right neighbors in POS tagging, etc.
Bidirectional RNNs combine an RNN that moves forward through time
beginning from the start of the sequence with another RNN that moves
backward through time beginning from the end of the sequence.
Fig. 10.11 (below) illustrates the typical bidirectional RNN,
where h(t) and g(t) standing for the (hidden) state of the sub-RNN that
moves forward and backward through time, respectively. This allows the
output units o(t) to compute a representation that depends on both the past
and the future but is most sensitive to the input values around time t
Figure 10.11: Computation of a typical bidirectional recurrent neural
network, meant to learn to map input sequences x to target sequences y, with
loss L(t) at each step t.
Footnote: This idea can be naturally extended to 2-dimensional input, such
as images, by having four RNNs…
10.4 Encoder-Decoder Sequence-to-Sequence Architectures
Encode-Decoder architecture, basic idea:
(1) an encoder or reader or input RNN processes the input sequence. The
encoder emits the context C , usually as a simple function of its final hidden
state.
(2) a decoder or writer or output RNN is conditioned on that fixed-length
vector to generate the output sequence Y = ( y(1) , . . . , y(ny ) ).
highlight: the lengths of input and output sequences can vary from each
other. Now widely used in machine translation, question answering etc.
See Fig 10.12 below.

Training: two RNNs are trained jointly to maximize the average of


logP(y(1),…,y(ny) |x(1),…,x(nx)) over all the pairs of x and y sequences in
the training set.
Variations: If the context C is a vector, then the decoder RNN is simply
a vector-to- sequence RNN. As we have seen (in Sec. 10.2.4), there are at
least two ways for a vector-to-sequence RNN to receive input. The input can
be provided as the initial state of the RNN, or the input can be connected to
the hidden units at each time step. These two ways can also be combined.
10.5 Deep Recurrent Networks
The computation in most RNNs can be decomposed into three blocks of
parameters and associated transformations:
1. from the input to the hidden state, x(t) → h(t)
2. from the previous hidden state to the next hidden state, h(t-1) → h(t)
3. from the hidden state to the output, h(t) → o(t)
These transformations are represented as a single layer within a deep MLP
in the previous discussed models. However, we can use multiple layers for
each of the above transformations, which results in deep recurrent networks.
Fig 10.13 (below) shows the resulting deep RNN, if we
(a) break down hidden to hidden,
(b) introduce deeper architecture for all the 1,2,3 transformations above and
(c) add “skip connections” for RNN that have deep hidden2hidden
transformations.
10.6 Recursive Neural Network
A recursive network has a computational graph that generalizes that of the
recurrent network from a chain to a tree.
Pro: Compared with a RNN, for a sequence of the same length τ, the depth
(measured as the number of compositions of nonlinear operations) can be
drastically reduced from τ to O(logτ).
Con: how to best structure the tree? Balanced binary tree is an optional
but not optimal for many data. For natural sentences, one can use a parser to
yield the tree structure, but this is both expensive and inaccurate. Thus
recursive NN is NOT popular.
10.7 The challenge of Long-Term Dependency
• Comments: This is the central challenge of RNN, which drives
the rest of the chapter.
The long-term dependency challenge motivates various solutions such
as Echo state network (Section 10.8), leaky units (Sec 10.9) and the
infamous LSTM (Sec 10.10), as well as clipping gradient, neural turing
machine (Sec 10.11).
Recurrent networks involve the composition of the same function multiple
times, once per time step. These compositions can result in extremely
nonlinear behavior. But let’s focus on a linear simplification of RNN, where
all the non-linearity are removed, for an easier demonstration of why long-
term dependency can be problematic.
Without non-linearity, the recurrent relation for h(t) w.r.t. h(t-1) is now
simply matrix multiplication:

If we recurrently apply this until we reach h(0), we get:

and if W admits an eigen-decomposition

the recurrence may be simplified further to:


In other words, the recurrence means that the eigenvalues are raised to the
power of t. This means that eigenvalues with magnitude less than one
to vanish to zero and eigenvalues with magnitude greater than one
to explode. The above analysis shows the essence of the vanishing and
exploding gradient problem for RNNs.
Comment: the trend of recurrence in matrix multiplication is similar in actual
RNN, if we look back at 10.2.2 “Computing the Gradient in a Recurrent
Neural Network”.
Bengio et al., (1993, 1994) shows that whenever the model is able to
represent long term dependencies, the gradient of a long term interaction has
exponentially smaller magnitude than the gradient of a short term interaction.
It means that it can be time-consuming, if not impossible, to learn long-term
dependencies. The following sections are all devoted to solving this problem.
Practical tips: The maximum sequences length that SGD-trained traditional
RNN can handle is only 10 ~ 20.
10.8 Echo State Networks
Note: This approach seems to be non-salient in the literature, so knowing the
concept is probably enough. The techniques are only explained at an abstract
level in the book, anyway.
Basic Idea: Since the recurrence causes all the vanishing/exploding
problems, we can set the recurrent weights such that the recurrent hidden
units do a good job of capturing the history of past inputs (thus “echo”),
and only learn the output weights.
Specifics: The original idea was to make the eigenvalues of the Jacobian of
the state-to-
state transition function be close to 1. But that is under the assumption of no
non-linearity. So The modern strategy is simply to fix the weights to have
some spectral radius such as 3, where information is carried forward
through time but does not explode due to the stabilizing effect of saturating
nonlinearities like tanh.
10.9 Leaky Units and Other Strategies for Multiple Time
Scales
A common idea shared by various methods in the following sections: design
a model that operates on both fine time scale (handle small details) and
coarse time scale (transfer information through long-time).
• Adding skip connections. One way to obtain coarse time scales is
to add direct connections from variables in the distant past to
variables in the present. Not an ideal solution.
10.9.2 Leaky Units
Idea: each hidden state u(t) is now a “summary of history”, which is set to
memorize both a coarse-grained immediate past summary of history u(t-
1) and some “new stuff” of present time v(t):

where alpha is a parameter. Then this introduces a linear self-connections


from u(t-1) → u(t), with a weight of alpha.
In this case, the alpha substitutes the matrix W’ of the plain RNN (in the
analysis of Sec 10.7 ). So if alpha ends up near 1, the several multiplications
will not leads to zero or exploded number.
Note: The notation seems to suggest alpha is a scalar, but it looks it will also
work if alpha is a vector and the multiplications there is element-wise, which
will resemble gate recurrent unit (GRU) as in the coming Section 10.10.
• Removing Connections: actively removing length-one
connections and replacing them with longer connections

The Vanishing Gradient Problem


As you remember, the gradient descent algorithm finds the global minimum
of the cost function that is going to be an optimal setup for the network.
As you might also recall, information travels through the neural network
from input neurons to the output neurons, while the error is calculated and
propagated back through the network to update the weights.
It works quite similarly for RNNs, but here we’ve got a little bit more going
on.
• Firstly, information travels through time in RNNs, which means that
information from previous time points is used as input for the next time
points.
• Secondly, you can calculate the cost function, or your error, at each
time point.
Basically, during the training, your cost function compares your outcomes
(red circles on the image below) to your desired output.
As a result, you have these values throughout the time series, for every single
one of these red circles.

Let’s focus on one error term et.


You’ve calculated the cost function et, and now you want to propagate your
cost function back through the network because you need to update the
weights.
Essentially, every single neuron that participated in the calculation of the
output, associated with this cost function, should have its weight updated in
order to minimize that error. And the thing with RNNs is that it’s not just the
neurons directly below this output layer that contributed but all of the
neurons far back in time. So, you have to propagate all the way back through
time to these neurons.
The problem relates to updating wrec (weight recurring) – the weight that is
used to connect the hidden layers to themselves in the unrolled temporal
loop.
For instance, to get from xt-3 to xt-2 we multiply xt-3 by wrec. Then, to get
from xt-2 to xt-1 we again multiply xt-2 by wrec. So, we multiply with the
same exact weight multiple times, and this is where the problem arises: when
you multiply something by a small number, your value decreases very
quickly.
As we know, weights are assigned at the start of the neural network with the
random values, which are close to zero, and from there the network trains
them up. But, when you start with wrec close to zero and multiply xt, xt-1,
xt-2, xt-3, … by this value, your gradient becomes less and less with each
multiplication.

What does this mean for the network?


The lower the gradient is, the harder it is for the network to update the
weights and the longer it takes to get to the final result.
For instance, 1000 epochs might be enough to get the final weight for the
time point t, but insufficient for training the weights for the time point t-3
due to a very low gradient at this point.
However, the problem is not only that half of the network is not trained
properly.
The output of the earlier layers is used as the input for the further layers.
Thus, the training for the time point t is happening all along based on inputs
that are coming from untrained layers. So, because of the vanishing gradient,
the whole network is not being trained properly.
To sum up, if wrec is small, you have vanishing gradient problem, and if
wrec is large, you have exploding gradient problem.
For the vanishing gradient problem, the further you go through the network,
the lower your gradient is and the harder it is to train the weights, which has
a domino effect on all of the further weights throughout the network.
That was the main roadblock to using Recurrent Neural Networks.
But let’s now check what are the possible solutions to this problem.
Solutions to the Vanishing Gradient Problem
In case of exploding gradient, you can:
• stop backpropagating after a certain point, which is usually not optimal
because not all of the weights get updated;
• penalize or artificially reduce gradient;

• put a maximum limit on a gradient.


In case of vanishing gradient, you can:
• initialize weights so that the potential for vanishing gradient is

minimized;
• have Echo State Networks that are designed to solve the vanishing
gradient problem;
• have Long Short-Term Memory Networks (LSTMs).
LSTMs are considered to be the go-to network for implementing RNNs, and
we’re going to discuss this solution in depth in our next article.
Long Short Term Memory - (LSTM)
Long short-term memory (LSTM) network is the most popular solution to
the vanishing gradient problem.
Are you ready to learn how we can elegantly remove the major roadblock to
the use of Recurrent Neural Networks (RNNs)?
Here is our plan of attack for this challenging deep learning topic:
• First of all, we are going to look at a bit of history: what LSTM came
from, what was the main idea behind it, why people invented it.
• Then, we will present the LSTM architecture.

• And finally, we’re going to have an example walkthrough.


Let’s get started!

Refresh on the Vanishing Gradient Problem


LSTMs were created to deal with the vanishing gradient problem. So, let’s
have a brief reminder on this issue.
As we propagate the error through the network, it has to go through the
unravelled temporal loop – the hidden layers connected to themselves in time
by the means of weights wrec.
Because this weight is applied many-many times on top of itself, that causes
the gradient to decline rapidly.
As a result, weights of the layers on the very far left are updated much slower
than the weights of the layers on the far right.
This creates a domino effect because the weights of the far-left layers define
the inputs to the far-right layers.
Therefore, the whole training of the network suffers, and that is called the
problem of the vanishing gradient.

We’ve also defined that as a rule of thumb, if wrec is small – the gradient is
vanishing, and if wrec is large – the gradient is exploding.
But what’s “large” and “small” in this context? In fact, we can say that we
have a vanishing gradient if wrec < 1 and exploding gradient if wrec > 1.
Then, what’s the first thing that comes to your mind to solve this problem?
Probably, the easiest and fastest solution will be to make wrec = 1. That’s
exactly what was done in LSTMs. Of course, this is a very simplified
explanation, but in general, making recurrent weight equal to one is the main
idea behind LSTMs.
Now, let’s dig deeper into the architecture of LSTMs.

LSTM Architecture
Long short-term memory network was first introduced in 1997 by Sepp
Hochreiter and his supervisor for a Ph.D. thesis Jurgen Schmidhuber. It
suggests a very elegant solution to the vanishing gradient problem.
Overview
To provide you with the most simple and understandable illustrations of
LSTM networks, we are going to use images created by Christopher Olah,
where he does an amazing job on explaining LSTMs in simple terms.
So, the first image below demonstrates how a standard RNN looks like from
the inside.
The hidden layer in the central block receives input xt from the input layer
and also from itself in time point t-1, then it generates output ht and also
another input for itself but in time point t+1.
This is a standard architecture that doesn’t solve a vanishing gradient
problem.

The next image shows how LSTMs look like. This might seem very complex
at the beginning, but don’t worry!
We’re going to walk you through this architecture and explain in detail,
what’s happening here. By the end of this article, you’ll be completely
comfortable with navigating LSTMs.
As you might recall, we’ve started with the claim that in LSTMs wrec = 1.
This feature is reflected as a straight pipeline on the top of the scheme and is
usually referenced as a memory cell. It can very freely flow through time.
Though sometimes it might be removed or erased, sometimes some things
might be added to it. Otherwise, it flows through time freely, and therefore
when you backpropagate through these LSTMs, you don’t have that problem
of the vanishing gradient.

Notation
Let’s begin with a few words on the notation:

• ct-1 stands for the input from a memory cell in time point t;
• xt is an input in time point t;
• ht is an output in time point t that goes to both the output layer and the
hidden layer in the next time point.

Thus, every block has three inputs (xt, ht-1, and ct-1) and two outputs (ht and
ct). An important thing to remember is that all these inputs and outputs are
not single values, but vectors with lots of values behind each of them.
Let’s continue our journey through the legend:

• Vector transfer: any line on the scheme is a vector.

• Concatenate: two lines combining into one, as for example, the vectors
from ht-1 and xt. You can imagine this like two pipes running in
parallel.
• Copy: the information is copied and goes into two different directions,
as for example, at the right bottom of the scheme, where output
information is copied in order to arrive at two different layers ht.

• Pointwise operation: there are five pointwise operations on the scheme


and they are of three types:
o “x” or valves (forget valve, memory valve, and output valve) –
points on the scheme, where you can open your pipeline for the
flow, close it or open to some extent. For instance, forget valve
at the top right of the above scheme is controlled by the layer
operation s. Based on a decision of this sigmoid activation
function (ranging from 0 to 1), the valve will be closed, open or
closed to some extent. If it’s open, memory flows freely from ct-
1 to ct. If it’s closed, then memory is cut off, and probably new
memory will be added further in the pipeline, where another
pointwise operation is depicted.
o “+” – t-shaped joint, where you have memory going through and

you can add additional memory if the memory valve below this
joint is open.
o “tanh” – responsible for transforming the value to be within the
range from -1 to 1 (required due to certain mathematical
considerations).

• Neural Network Layer: layer operations, where you’ve got layer


coming in and layer coming out

Walk through the architecture


Now we are ready to look into the LSTM architecture step by step:

1. We’ve got new value xt and value from the previous node ht-1 coming
in.
2. These values are combined together and go through the sigmoid
activation function, where it is decided if the forget valve should be
open, closed or open to some extent.
3. The same values, or actually vectors of values, go in parallel through
another layer operation “tanh”, where it is decided what value we’re
going to pass to the memory pipeline, and also sigmoid layer
operation, where it is decided, if that value is going to be passed to the
memory pipeline and to what extent.
4. Then, we have a memory flowing through the top pipeline. If we have
forget valve open and memory valve closed then the memory will not
change. Otherwise, if we have forget valve closed and memory valve
open, the memory will be updated completely.
5. Finally, we’ve got xt and ht-1 combined to decide what part of the
memory pipeline is going to become the output of this module.

That’s basically, what’s happening within the LSTM network. As you can
see, it has a pretty straightforward architecture, but let’s move on to a specific
example to get an even better understanding of the Long Short-Term
Memory networks.

Example Walkthrough
You might remember the translation example from one of our previous
articles. Recall that when we change the word “boy” to “girl” in the English
sentence, the Czech translation has two additional words changed because in
Czech the verb form depends on the subject’s gender.

So, let’s say the word “boy” is stored in the memory cell ct-1. It is just
flowing through the module freely if our new information doesn’t tell us that
there is a new subject.
If for instance, we have a new subject (e.g., “girl”, “Amanda”), we’ll close
the forget valve to destroy the memory that we had. Then, we’ll open a
memory valve to put a new memory (e.g., name, subject, gender) to the
memory pipeline via the t-joint.

If we put the word “girl” into the memory pipeline, we can extract different
elements of information from this single piece: the subject is female,
singular, the word is not capitalized, has 4 letters etc.

Next, the output valve facilitates the extraction of the elements required for
the purposes of the next word or sentence (gender in our example). This
information will be transferred as an input to the next module and it will help
the next module to decide on the best translation given the subject’s gender.

That’s how LSTM actually works.

LSTM Practical Intuition


Now we are going to dive inside some practical applications of Long Short-
Term Memory networks (LSTMs).
How do LSTMs work under the hood? How do they think? And how do they
come up with the final output?

That’s going to be quite an interesting and at the same time a bit of magical
experience.

Neuron Activation
Here is our LSTM architecture. To start off, we are going to be looking at
the tangent function tanh and how it fires up. As you remember, its value
ranges from -1 to 1. In our further images, “-1” is going to be red and “+1”
is going to be blue.

Below is the first example of LSTM “thinking”. The image includes a


snippet from “War and Peace” by Leo Tolstoy. The text was given to RNN,
and it learned to read it and predict what text is coming next.

As you can see, this neuron is sensitive to position in line. When you get
towards the end of the line, it is activating. How does it know that it is the
end of the line? You have about 80 symbols per line in this novel. So, it’s
counting how many symbols have passed and that’s the way it’s trying to
predict when the new line character is coming up.
The next cell recognizes direct speech. It’s keeping track of the quotation
marks and is activating inside the quotes.

This is very similar to our example where the network was keeping track of
the subject to understand if it is male or female, singular or plural, and to
suggest the correct verb forms for the translation. Here we observe the same
logic. It’s important to know if you are inside or outside the quotes because
that affects the rest of the text.

On the next image, we have a snippet from the code of the Linux operating
system. This example refers to the cell that activates inside if-statements. It’s
completely dormant everywhere else, but as soon as you have an if-
statement, it activates. Then, it’s only active for the condition of the if-
statement and it stops being active at the actual body of the if-statement.
That’s can be important because you’re anticipating the body of the if-
statement.
The next cell is sensitive to how deep you are inside of the nested expression.
As you go deeper, and the expression gets more and more nested, this cell
keeps track of that.

It’s very important to remember that none of these is actually hardcoded into
the neural network. All of these is learned by the network itself through
thousands and thousands of iterations.

The network kind of thinks: okay, I have this many hidden states, an out of
them I need to identify, what’s important in a text to keep track off. Then, it
identifies that in this particular text understanding how deep you’re inside a
nested statement is important. Therefore, it assigns one of its hidden states,
or memory cells, to keep track of that.

So, the network is really evolving on itself and deciding how to allocate its
resources to best complete the task. That’s really fascinating!

The next image demonstrates an example of the cell that you can’t really
understand, what it’s doing. According to Andrej Karpathy, about 95% of
the cells are like this. They are doing something, but that’s just not obvious
to humans, while it makes sense for machines.
Output
Now let’s move to the actual output ht. This is the resulting value after it
passed the tangent function and the output valve.

So, what do we actually see in the next image?


This is a neural network that is reading a page from Wikipedia. This result
is a bit more detailed. The first line shows us if the neuron is active (green
color) or not (blue color), while the next five lines say us, what the neural
network is predicting, particularly, what letter is going to come next. If it’s
confident about its prediction, the color of the corresponding cell is red and
if it’s not confident – it is light red.

What do you think this specific hidden state in the neural network is looking
out for?

Yes, it’s activating inside URLs!

The first row demonstrates the neuron’s activation inside the


URL www.ynetnews.com. Then, below each of the letter you can see, what
is the network’s prediction for the next letter.

For example, after the first “w” it’s pretty confident that the next letter will
be “w” as well. Conversely, its prediction about the letter or symbol after “.”
is very unsure because it could actually be any website.
As you see from the image, the network continues generating predictions
even when the actual neuron is dormant. See, for example, how it was able
to predict the word “language” just from the first two letters.

The neuron activates again in the third row, when another URL appears (see
the image below). That’s quite an interesting case.

You can observe that the network was pretty sure that the next letter after
“co” should be “m” to get “.com”, but it was another dot instead.

Then, the network predicted “u” because the domain “co.uk” (for the United
Kingdom) is quite popular. And again, this was the wrong prediction because
the actual domain was “co.il” (for Israel), which was not at all considered by
the neural network even as 2nd, 3rd, 4th or 5th best guess.

This is how to look at this pictures that Andrej has created. There is a couple
more of such examples in his blog.

Hopefully, you are now much more comfortable about what’s going on
inside the neural network, when it’s thinking and processing information.

LSTM Variation
Have you checked all our articles on Recurrent Neural Networks (RNNs)?
Then you should be already pretty much comfortable with the concept of
Long Short-Term Memory networks (LSTMs).
Let’s wind up our journey with a very short article on LSTM variations.

You may encounter them sometimes in your work. So, it could be really
important for you to be at least aware of these other LSTM architectures.

Here is the standard LSTM that we discussed.

Now let’s have a look at a couple of variations.

Variation #1
In variation #1, we add peephole connections – the lines that feed additional
input about the current state of the memory cell to the sigmoid activation
functions.

Variation #2
In variation #2, we connect forget valve and memory valve. So, instead of
having separate decisions about opening and closing the forget and memory
valves, we have a combined decision here.

Basically, whenever you close the memory off (forget valve = 0), you have
to put something in (memory valve = 1 – 0 = 1), and vice versa.

Variation #3

Variation #3 is usually referred to as Gated Recurrent Unit (GRU). This


modification completely gets rid of the memory cell and replaces it with the
hidden pipeline. So, here instead of having two separate values – one for the
memory and one for the hidden state – you have only one value.

It might look quite complex, but in fact, the resulting model is simpler than
the standard LSTM. That’s why this modification becomes increasingly
popular. We have discussed three LSTM modifications, which are probably
the most notable. However, be aware that there are lots and lots of others
LSTM variations out there.

You might also like