0% found this document useful (0 votes)
151 views49 pages

RNN LSTM

The document discusses recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. RNNs can model sequential data by sharing parameters across time, while convolutional neural networks (CNNs) share parameters across space. RNNs address the issue that CNNs have no memory of previous inputs in a sequence. LSTMs were developed to allow RNNs to retain information for longer periods of time by incorporating forget, input, and output gates to optionally keep or reset the cell state. Gated recurrent units (GRUs) are similar to LSTMs but have fewer parameters. RNNs, LSTMs, and GRUs are applied to tasks involving sequential data like text, speech, videos, and time series.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views49 pages

RNN LSTM

The document discusses recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. RNNs can model sequential data by sharing parameters across time, while convolutional neural networks (CNNs) share parameters across space. RNNs address the issue that CNNs have no memory of previous inputs in a sequence. LSTMs were developed to allow RNNs to retain information for longer periods of time by incorporating forget, input, and output gates to optionally keep or reset the cell state. Gated recurrent units (GRUs) are similar to LSTMs but have fewer parameters. RNNs, LSTMs, and GRUs are applied to tasks involving sequential data like text, speech, videos, and time series.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

RRNs & LSTM

(Recurrent Neural Networks & Long-Short Term Memory)

1
RNNs vs CNNs
▪ RNNs ▪ CNNs

▪ You can think of an RNN sharing ▪ You can think of a CNN sharing
parameters across time parameters across space.

2
RNNs: motivation
▪ The cat sat on the mat.
▪ She got up and impatiently climbed on the chair, meowing for food.

▪ Say you were to feed those two sentences, one after the other, into a CNN and ask, where is
the cat? You’d have a problem, because the network has no concept of memory.
▪ This is incredibly important when it comes to dealing with data that has a temporal domain
(e.g., text, speech, video, and time-series data).
▪ RNNs answer this problem by giving neural networks a memory via hidden state. The
purpose of recurrent neural networks is to model sequences of tensors.
▪ You can think of an RNN sharing parameters across time and a CNN sharing parameters
across space.

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
3
Can CNNs do the same thing as RNNs?

▪ Note that it’s not impossible to do these things with CNNs; a lot of in-depth research in the
last few years has been done to apply CNN-based networks in the temporal domain.

▪ See “Temporal Convolutional Networks: A Unified Approach to Action Segmentation” by


Colin Lea, et al. (2016) provides further information. And seq2seq!

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
4
How does a RRN look like

unrolling

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
5
Backpropagation through time
▪ When we have our completed predicted
sequence, we then have to backpropagate the
error back through the RNN.
▪ Because this involves stepping back through
the network’s steps, this process is known as
backpropagation through time.
▪ the input vector from the current time step
and the hidden state vector from the previous
time step are mapped to the hidden state
vector of the current time step.

Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build
intelligent language applications using deep learning. " O'Reilly Media, Inc.", 2019.
6
Hidden state (=sequence representation)
▪ In deep learning, modelling sequences involves maintaining hidden “state information,” or a
hidden state.

▪ As each item in the sequence is encountered—for example, as each word in a sentence is


seen by the model—the hidden state is updated.

▪ Thus, the hidden state (usually a vector) encapsulates everything seen by the sequence so
far. This hidden state vector, also called a sequence representation.

Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build 7
intelligent language applications using deep learning. " O'Reilly Media, Inc.", 2019.
RNNs: issues
▪ Inability to retain information for long range predictions. RNNs are computing a hidden
state vector at each time step using the hidden state vector of the previous time step and
an input vector at the current time step.
▪ At each time step we simply updated the hidden state vector regardless of whether it made
sense. As a consequence, the RNN has no control over which values are retained and which
are discarded in the hidden state—that is entirely determined by the input. Intuitively, that
doesn’t make sense.

▪ Ideal situation? What is desired is some way for the RNN to decide if the update is optional,
or if the update happens, by how much and what parts of the state vector, and so on.

▪ Gradient stability: this means either zero or infinity.

Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build 8
intelligent language applications using deep learning. " O'Reilly Media, Inc.", 2019.
How to add attention to RNN

• Traditional RNN model for a seq2seq task like language translation parses the entire input
sequence (for instance, one or more sentences) before producing the translation.
• Why is the RNN parsing the whole input sentence before producing the first output? This
is motivated by the fact that translating a sentence word by word would likely result in
grammatical errors. One limitation of this seq2seq approach is that the RNN is trying to
remember the entire input sequence via one single hidden unit before translating it.
Compressing all the information into one hidden unit may cause loss of information, especially
for long sequences.
• Thus, similar to how humans translate sentences, it may be beneficial to have access to the
whole input sequence at each time step. In contrast to a regular RNN, an attention mechanism
lets the RNN access all input elements at each given time step. However, having access to all
input sequence elements at each time step can be overwhelming. So, to help the RNN focus on
the most relevant elements of the input sequence, the attention mechanism assigns different
attention weights to each input element. These attention weights designate how important or
relevant a given input sequence element is at a given time step.
• https://www.kdnuggets.com/2022/03/packt-adding-attention-mechanism-rnns.html

9
RNN & LSTM & GRU
▪ There are different variants of RNNs. Two of the most important are Long Short-Term
Memory (LSTM) and Gated Recurrent Unit (GRU). These two power most of the deep
learning models for text and sequential data.

▪ Some applications include:


• Document classifiers: Identifying the sentiment of a tweet or review, classifying news
articles
• Sequence-to-sequence learning: For tasks such as language translations, converting
English to French
• Time-series forecasting: Predicting the sales of a store given details about previous
days' store details

10
RNN

RRN vs. LSTM

▪ Standard RNN: “remember” everything forever.


▪ LSTM: forget gate allows us to model the idea that as we
continue in our input chain, the beginning of the chain
LSTM becomes less important.
▪ How much the LSTM forgets is something that is learned
▪ The cell is the “memory” of the network layer
▪ There are 3 scenarios:
▪ Data passes through
▪ Data is written to the cell
▪ Data may (or may not!) flow through to the
next layer, modified by the output gate

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019. 11
LSTM

Gated Recurrent Units


▪ GRU has merged the forget gate with the
output gate (fewer parameters) so tends to be
quicker to train and uses fewer resources at
runtime.
GRU
▪ Strictly speaking, they are less powerful than
LSTMs because of the merging of the forget
and output gates.

▪ Suggest: play with both and see what happens

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and 12


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
Constant error carousel

The memory cell can maintain its state over time. It is often
called the Constant Error Carousel. This is because at its
core it is a recurrently self-connected linear unit which
recirculates activation and error signals indefinitely. This
allows it to provide short-term memory storage for extended
time periods.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 13
Understanding gating (Link to LSTM)
▪ To intuitively understand gating, suppose that you were adding two quantities, a and b, but you
wanted to control how much of b gets into the sum. Mathematically, you can rewrite the sum a + b
as: a+ λb where λ is a value between 0 and 1. If λ = 0, there is no contribution from b and if λ = 1, b
contributes fully. Looking at it this way, you can interpret that λ acts as a “switch” or a “gate” in
controlling the amount of b that gets into the sum.

▪ The function λ is context dependent. This is the basic intuition behind all gated networks. The
function λ is usually a sigmoid function, which we know to produce a value between 0 and 1.

▪ In LSTM this basic intuition is extended carefully to incorporate not only conditional updates, but
also intentional forgetting of the values in the previous hidden state. This “forgetting” happens by
multiplying the previous hidden state value ht−1 with another function, μ, that also produces values
between 0 and 1 and depends on the current input.

Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build 14
intelligent language applications using deep learning. " O'Reilly Media, Inc.", 2019.
So GTU or LSTM?
▪ When possible, prefer GRUs over LSTMs.
▪ GRUs provide almost comparable performance to LSTMs and use far fewer parameters and
compute resources.

Rao, Delip, and Brian McMahan. Natural language processing with PyTorch: build 15
intelligent language applications using deep learning. " O'Reilly Media, Inc.", 2019.
bidirectional LSTM = biLSTM
▪ LSTMs (and RNNs in general) can look to the
past as they are trained and make decisions.
But sometimes you need to see the future
as in applications like translation and
hand-writing recognition.

▪ biLSTM solves this problem in the simplest of


ways: it’s essentially two stacked LSTMs,
with the input being sent in the forward
direction in one LSTM and reversed in the
second.

Pointer, Ian. Programming PyTorch for Deep Learning: Creating and 16


Deploying Deep Learning Applications. " O'Reilly Media, Inc.", 2019.
Recurrent neural networks: #1
▪ Recurrent neural networks: neural networks in which there are feedback loops.

▪ The idea in these models is to have neurons which fire for some limited duration of time,
before becoming quiescent.

▪ Loops don't cause problems in such a model, since a neuron's output only affects its input
at some later time, not instantaneously.

▪ Recurrent neural nets have been less influential than feedforward networks, in part because
the learning algorithms for recurrent nets are (at least to date) less powerful.

http://neuralnetworksanddeeplearning.com/chap1.html 17
Recurrent neural networks: #1.1
▪ Why classical feedforward Multilayer Perceptron network would not work?

▪ For example, consider a univariate time series problem, like the price of a stock over time.

▪ This dataset can be framed as a prediction problem for a classical feedforward Multilayer Perceptron
network by defining a windows size and training the network to learn to make short term predictions
from the fixed sized window of inputs.

▪ This would work, but is very limited. The window of inputs adds memory to the problem, but is limited to
just a fixed number of points and must be chosen with sufficient knowledge of the problem. A naive
window would not capture the broader trends over minutes, hours and days that might be relevant to
making a prediction. From one prediction to the next, the network only knows about the specific inputs
it is provided.

18
Recurrent neural
networks: #2

19
Recurrent neural networks: #3

20
Recurrent neural networks: #3.1
▪ Consider the following taxonomy of sequence problems that require a mapping of an input
to an output:

▪ One-to-Many: sequence output, for image captioning.

▪ Many-to-One: sequence input, for sentiment classification.

▪ Many-to-Many: sequence in and out, for machine translation.

▪ Synchronized Many-to-Many: synced sequences in and out, for video classification.

21
Recurrent neural networks: #2

▪ These have directed cycles which means you can


sometimes get back to where you started by following
the arrows.

▪ They can have complicated dynamics and this can make Recurrent nets with
them very difficult to train. multiple hidden layers are
just a special case that has
▪ They are more biologically realistic. some of the
hidden🡪hidden
connections missing.

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Recurrent neural networks: #3
time 🡪
▪ Recurrent neural networks are a very natural way to

output

output

output
model sequential data:
▪ They are equivalent to very deep nets with one
hidden layer per time slice.
▪ Except that they use the same weights at every
time slice and they get input at every time slice.

hidden

hidden

hidden
▪ They have the ability to remember information in their
hidden state for a long time.
▪ But its very hard to train them to use this

input

input

input
potential.

https://www.cs.toronto.edu/~hinton/coursera_slides.html
Issue for RNNs: #1

▪ For the techniques to be effective on real problems, two major issues needed to be
resolved for the network to be useful:

▪ How to train the network with Back propagation.

▪ How to stop gradients vanishing or exploding during training.

24
Issue for RNNs: #2
▪ Back propagation breaks down in a RNN, because of the recurrent or loop connections. This
was addressed with a modification of the Back propagation technique called Back
propagation Through Time or BPTT.

▪ Instead of performing back propagation on the recurrent network as stated, the structure of
the network is unrolled, where copies of the neurons that have recurrent connections are
created. For example a single neuron with a connection to itself (A -> A) could be
represented as two neurons with the same weight values (A -> B).

▪ This allows the cyclic graph of a recurrent neural network to be turned into an acyclic graph
like a classic feedforward neural network, and Back propagation can be applied.

25
Issue for RNNs: #3
▪ When Back propagation is used in very deep neural networks and in unrolled recurrent neural networks,
the gradients that are calculated in order to update the weights can become unstable.

▪ They can become very large numbers called exploding gradients or very small numbers called the
vanishing gradient problem.

▪ The solution depends on the applications:

▪ This problem is alleviated in deep Multilayer Perceptron networks through the use of the Rectifier
transfer function.
▪ In RNN architectures, this problem has been alleviated using a new type of architecture called the
Long Short-Term Memory Networks that allows deep recurrent networks to be trained.

26
Long-short-term memory: #1
▪ The Long Short-Term Memory or LSTM
network is a recurrent neural network
that is trained using Back propagation
Through Time and overcomes the
vanishing gradient problem.

▪ As such it can be used to create large


(stacked) recurrent networks.

▪ Almost all exciting results based on


recurrent neural networks are achieved
with them.

https://towardsdatascience.com/the-mostly-complete-char 27
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
t-of-neural-networks-explained-3fb6f2367464
Long-short-term memory: #2
▪ In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies in
the input sequences.

▪ The problem of vanilla RNNs is computational (or practical) in nature: when training a
vanilla RNN using back-propagation, the gradients which are back-propagated can "vanish"
(that is, they can tend to zero) or "explode" (that is, they can tend to infinity), because of
the computations involved in the process, which use finite-precision numbers.

▪ RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units
allow gradients to also flow unchanged. However, LSTM networks can still suffer from the
exploding gradient problem.

https://en.wikipedia.org/wiki/Long_short-term_memory 28
Long-short-term memory: #3
unrolling

▪ A recurrent neural network can be thought of as multiple copies of the


same network, each passing a message to a successor.

▪ This chain-like nature reveals that recurrent neural networks are


intimately related to sequences and lists.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 29
Long-short-term memory: #4

▪ From RNNs to LSTM: RNNs seems to be pretty good at short term memory but not so good
for long term memory. That is where LSTMs come in.

▪ LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behaviour, not something
they struggle to learn!

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 30
Long-short-term memory: #5

▪ There are three types of gates within a memory unit:

▪ Forget Gate: conditionally decides what information to discard from the unit.
▪ Input Gate: conditionally decides which values from the input to update the memory
state.
▪ Output Gate: conditionally decides what to output based on input and the memory of
the unit.

31
Example of LONG & SHORT term memory
▪ Example of short term memory: Sometimes, we only need to look at recent information to perform the
present task. For example, consider a language model trying to predict the next word based on the
previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any
further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap
between the relevant information and the place that it’s needed is small, RNNs can learn to use the past
information.

▪ Example of LONG term memory: But there are also cases where we need more context. Consider trying
to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information
suggests that the next word is probably the name of a language, but if we want to narrow down which
language, we need the context of France, from further back. It’s entirely possible for the gap between
the relevant information and the point where it is needed to become very large. Unfortunately, as that
gap grows, RNNs become unable to learn to connect the information.

▪ The reasons are explained in this paper: Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning long-term
dependencies with gradient descent is difficult." IEEE transactions on neural networks 5.2 (1994): 157-166.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 32
Other types of LSTM NN

▪ Vanilla LSTM. Memory cells of a single LSTM layer are used in a simple network structure.
▪ Stacked LSTM. LSTM layers are stacked one on top of another into deep networks.
▪ CNN LSTM. A convolutional neural network is used to learn features in spatial input like
images and the LSTM can be used to support a sequence of images as input or generate a
sequence in response to an image.
▪ Encoder-Decoder LSTM. One LSTM network encodes input sequences and a separate LSTM
network decodes the encoding into an output sequence.
▪ Bidirectional LSTM. Input sequences are presented and learned both forward and
backward.
▪ Generative LSTM. LSTMs learn the structure relationship in input sequences so well that
they can generate new plausible sequences.

33
What is the limitation of perceptron and why
we need LSMT NNs?
▪ Multilayer perceptrons has 5 critical limitations:

▪ Stateless. MLPs learn a fixed function approximation. Any outputs that are conditional on the
context of the input sequence must be generalized and frozen into the network weights.
▪ Unaware of Temporal Structure. Time steps are modelled as input features, meaning that network
has no explicit handling or understanding of the temporal structure or order between observations.
▪ Messy Scaling. For problems that require modelling multiple parallel input sequences, the number
of input features increases as a factor of the size of the sliding window without any explicit
separation of time steps of series.
▪ Fixed Sized Inputs. The size of the sliding window is fixed and must be imposed on all inputs to the
network.
▪ Fixed Sized Outputs. The size of the output is also fixed and any outputs that do not conform must
be forced.

34
Backpropagation Through Time
▪ Back propagation breaks down in a RNN, because of the recurrent or loop connections. This was addressed with a
modification of the Back propagation technique called Back propagation Through Time or BPTT.
▪ How it was solved?
▪ Instead of performing back propagation on the recurrent network as stated, the structure of the network is unrolled,
where copies of the neurons that have recurrent connections are created. For example a single neuron with a
connection to itself (A -> A) could be represented as two neurons with the same weight values (A -> B).
▪ This allows the cyclic graph of a recurrent neural network to be turned into an acyclic graph like a classic feedforward
neural network, and Back propagation can be applied.
▪ Still how the solved the issue with the gradient?
▪ BPTT can be computationally expensive as the number of time steps increases. If input sequences are comprised of
thousands of time steps, then this will be the number of derivatives required for a single weight update. This can
cause weights to vanish or explode (go to zero or overflow) and make slow learning and model skill noisy.
▪ Truncated Backpropagation Through Time, or TBPTT, is a modified version of the BPTT training algorithm for
recurrent neural networks where the sequence is processed one time step at a time and periodically an update is
performed back for a fixed number of time steps.

35
Tuning LSTM: #1
▪ You fit the model to your training data and evaluate it on the test dataset, then report the skill. Perhaps
you use k-fold cross-validation to evaluate the model, then report the skill of the model.

▪ This is a mistake made by beginners.

▪ It looks like you’re doing the right thing, but there is a key issue you have not accounted for: deep
learning models are stochastic. Artificial neural networks like LSTMs use randomness while being fit on a
dataset, such as random initial weights and random shuffling of data during each training epoch during
stochastic gradient descent.

▪ This means that each time the same model is fit on the same data, it may give different predictions and
in turn have different overall skill.

36
Tuning LSTM: #2
▪ Stochastic models, like deep neural networks, add an additional source of randomness. This
additional randomness gives the model more flexibility when learning, but can make the
model less stable (e.g. different results when the same model is trained on the same data).

▪ This is different from model variance that gives different results when the same model is
trained on different data.

▪ To get a robust estimate of the skill of a stochastic model, we must take this additional
source of variance into account; we must control for it.

▪ A robust approach is to repeat the experiment of evaluating a stochastic model multiple


times.

37
Tuning LSTM: #3

▪ How Unstable Are Neural Networks? It depends on your problem, on the network, and on
its configuration.

▪ It is recommend performing a sensitivity analysis to find out. Evaluate the same model on
the same data many times (30, 100, or thousands) and only vary the seed for the random
number generator.

▪ Then review the mean and standard deviation of the skill scores produced. The standard
deviation (average distance of scores from the mean score) will give you an idea of just how
unstable your model is

38
Recurrent neural networks [RNNs]
▪ Recurrent neural networks , or RNNs, are a family of neural networks for processing
sequential data.
▪ RNNs are notably order dependent, or time dependent: they process the timesteps of their
input sequences in order, and shuffling or reversing the timesteps can completely change
the representations the RNN extracts from the sequence.
▪ In other words a recurrent neural network is a neural network that is specialized for
processing a sequence of values x1, x2, … xn. They are very effective in certain task because
the have “memory”.

▪ There are two types of them:


▪ Unidirectional RNN: can take information from the past to process later inputs.
▪ Bidirectional RNN can take context from both the past and the future.

39
Bidirectional RNN: #0

▪ This recipe builds on the multilayer LSTM recipe. In a normal LSTM, the LSTM reads the
input sequence from first to last; however, in a bidirectional LSTM, there is a second
LSTM that reads the sequence from last to first—that is, a backward RNN.

▪ This type of LSTM improves the model performance when the prediction at the current
timestamp is dependent on the inputs further on in the sequence.”

Jibin Mathew, PyTorch Artificial Intelligence Fundamentals.


40
Bidirectional RNN: #1

41
Bidirectional RNN: #2
▪ This approach has been used to great effect with LSTM Recurrent Neural Networks.
Providing the entire sequence both forwards and backwards is based on the assumption
that the whole sequence is available.

▪ Nevertheless, it may raise a philosophical concern where ideally time steps are provided in
order and just-in-time.

▪ The use of providing an input sequence bi-directionally was justified in the domain of
speech recognition because there is evidence that in humans, the context of the whole
utterance is used to interpret what is being said rather than a linear interpretation.

42
Bidirectional RNN: #3
▪ Relying on knowledge of the future seems at first sight to violate causality.

▪ How can we base our understanding of what we’ve heard on something that hasn’t been
said yet? However, human listeners do exactly that.

▪ Sounds, words, and even whole sentences that at first mean nothing are found to make
sense in the light of future context.

▪ What we must remember is the distinction between tasks that are truly online - requiring
an output after every input - and those where outputs are only needed at the end of some
input segment.

Framewise Phoneme Classification with Bidirectional


LSTM and Other Neural Network Architectures, 2005. 43
How to apply dropout to a RNNs
▪ It has long been known that applying dropout before a recurrent layer hinders learning
rather than helping with regularization.
▪ The same dropout mask (the same pattern of dropped units) should be applied at every
time- step, instead of a dropout mask that varies randomly from timestep to timestep.
What’s more, in order to regularize the representations formed by the recurrent gates of
layers such as GRU and LSTM, a temporally constant dropout mask should be applied to the
inner recurrent activations of the layer (a recurrent dropout mask).
▪ Using the same dropout mask at every timestep allows the network to properly propagate
its learning error through time; a temporally random dropout mask would disrupt this error
signal and be harmful to the learning process.

Chollet, Francois. Deep learning with Python. Simon and Schuster, 2017.
44
RNNs, market and machine learning
▪ Some readers are bound to want to take the techniques we’ve introduced here and try
them on the problem of forecasting the future price of securities on the stock market (or
currency exchange rates, and so on). Markets have very different statistical characteristics
than natural phenomena such as weather patterns. Trying to use machine learning to beat
markets, when you only have access to publicly available data, is a difficult endeavour, and
you’re likely to waste your time and resources with nothing to show for it.

▪ Always remember that when it comes to markets, past performance is not a good predictor
of future returns—looking in the rear-view mirror is a bad way to drive.

▪ Bottom line? Machine learning, on the other hand, is applicable to datasets where the past
is a good predictor of the future.

45
Chollet, Francois. Deep learning with Python. Simon and Schuster, 2017.
RNNs are Turing complete
▪ It turns out that with enough neurons and time, RNNs can compute anything that can be
computed by your computer. Computer Scientists refer to this as being Turing complete.
▪ Turing complete roughly means that in theory an RNN can be used to solve any
computation problem.

▪ In theory, often translates poorly into practice because we don’t have infinite memory or
time. Thus, RNN can approximate Turing-completeness up to the limits of their available
memory.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 46
Partially recurrent neural network (PRNN)
▪ A partially recurrent neural network (PRNN) is defined as a network formed by
feedforward and feedback connections in which the former are; predominant but the
latter are crucial to deal with tasks involving temporal sequences.
▪ Partially recurrent neural networks may be characterized as having a layered topology
where feedback loops exist from one
layer to specialized neurons of a preceding layer.
▪ These specialized neurons are the so-called context units. The purpose of these units is to
store information concerning the temporal sequence of the input data.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.57.3012&rep=rep1&type=pdf 47
Elman and Jordan NNs: #1
▪ Elman NNs are a popular partially recurrent NNs. Initially designed to learn
sequential or time-varying patterns. In an Elman neural network the number of
neurons in the context layer is equal to the number of neurons in the hidden
layer. In addition, the context layer neurons are fully connected to all the neurons
in the hidden layer. Memory occurs through the delay (context) units which are
fed by hidden layer neurons.
▪ Jordan NNs is a single hidden layer feed forward neural network. It is similar to
the Elman NN. The only difference is that the context (delay) neurons are fed
from the output layer instead of the hidden layer. It therefore “remembers” the
output from the previous time-step. Like the Elman neural network, it is useful for
predicting time series observations which have a short memory.

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 48
Elman and Jordan NNs: #2

context layer = context layer =


Recurrent = Recurrent =
delay layer delay layer

Lewis, N. D. "Deep Time Series Forecasting with Python." Create Space Independent Publishing Platform (2016). 49

You might also like