0% found this document useful (0 votes)
12 views

Module2 L7 RNN LSTM

Uploaded by

heheryangosling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Module2 L7 RNN LSTM

Uploaded by

heheryangosling
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Recurrent Neural Network(RNN)

• Recurrent neural networks (RNN) are the state of the art algorithms
for sequential data and are used by Apple's Siri and Google's voice
search.
• Sequential Data?
• the points in the dataset are dependent on the other points in the dataset,
the data is said to be Sequential data.
• Ex: time series data, stock market price data, words in a sentence, gene
sequence data, etc.
• Why ANN cannot be used for sequential data?
• It doesn’t consider the dependencies within a sequence data.
• Ex: Given time-series data, develop a DNN to predict the outlook of a
day as sunny/rainy/windy.
• The traditional NN makes the prediction for each observation independent of the other observations.
• This violates the fact that weather on a particular day is strongly correlated with the weather of the previous
day and the following day.
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• In traditional neural networks, all the inputs and outputs are independent of each other, but in cases when it is
required to predict the next word of a sentence, the previous words are required and hence there is a need to
remember the previous words.
• RNN are a type of Neural Network where the output from previous step are fed
as input to the current step.

• Most important feature of RNN is Hidden state, which remembers some information about a
sequence.
• RNN has a “memory” which remembers all information about what has been calculated in the
previous day.
• It uses the same parameters for each input as it performs the same task on all the inputs or
hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks
Some Applications of RNN
Why not ANN?
• 1. An issue with using an ANN for language translation is, we cannot
fix the no. of neurons in a layer. It depends on the no. of words in the
input sentence.
Why not ANN?

2. Too much computations.


• Input words have to be converted to vectors(word2vec) using one-hot encoding.
• Hence that many neurons and parameters have to be learnt by the model.
Why not ANN?
• 3. Doesn’t preserve the sequence relationship in the input data
• a traditional neural network assumes the data is non-sequential, and that each data point is independent of
other data points.
• Hence, the inputs are analyzed in isolation, which can cause problems in case there are dependencies in the
data.
• Since each hidden layer has its own weights, bias and activations, they behave independently.
• When the input is a sequence data, the model should be also able to identify the relationship between
successive inputs
• If the task is to predict the• next word in a sentence using a MLP.
This will not help. All hidden layers with different
weights and bias work independently.
• To make the hidden layers preserve the sequence
relationship in the input, all hidden layers have to be
combined.
• To combine them use same weights and activation
functions
All these hidden layers can be rolled in
together in a single recurrent layer
How RNN works?
• Neurons in recurrent layer are called recurrent neurons
• At all the time steps weights of the recurrent neurons would be the same
• So a recurrent neuron stores the state of a previous input and combines with the
current input thereby preserving some relationship of the current input with the
previous input.
• RNN converts the independent activations into dependent activations by
providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous outputs by
giving each output as input to the next hidden layer.
• Entire RNN computation involves – computations to update the cell
state at that time step and computations to predict the output at that
time step.
• During forward pass, we calculate the outputs at each time step, to
calculate the individual loss at each time step.
• The individual losses are combined to form the total loss.
• This total loss is used to train the neural network
Desirable Characteristics of RN N for
Sequence Modeling

• Ability to handle sequences of variable


lengths.
• Information about the next word to be
predicted in the sequence, might be
present very much earlier at the
beginning of the sequence.

• Ability to capture and model Long-Term


Dependencies
• This is possible since RNNs, keep
updating information collected from
the past by updating their
recurrent/hidden cell state at each
time step.
Desirable Characteristics of RNN for
Sequence Modeling…

• Ability to capture differences in sequence


order
• Two sentences with same words, but
different meaning.
• But the RNNs capture this difference,
since it uses the same weight matrices
at each time step, to update its hidden
state and remembers past information.
• In FFNN, the gradient of
the loss function is back
propagated through one
feed forward network in
one time step/input
• But in RNN, the gradient
of the total error is
propagated to the
individual time steps and
also across the time steps
from the most recent
time step to the very
beginning of the
sequence.
• Hence the name
“Backpropagation
through time”
Types of RNN
• One to One RNN
• One to Many RNN
• Many to One RNN
• Many to Many RNN
One to One RNN
• One to One RNN is the most basic and traditional
type of Neural network giving a single output for a
single input, as can be seen in the above image.
• It is also known as Vanilla Neural Network. It is used
to solve regular machine learning problems. Ex:
image classification

One to Many
• One to Many is a kind of RNN architecture is applied
in situations that give multiple output for a single
input.

• Image Captioning – Here, let’s say we have an image


for which we need a textual description. So we have a
single input – the image, and a series or sequence of
words as output. Here the image might be of a fixed
size, but the output is a description of varying lengths
 Many to One
• It takes a sequence of information as input and
produces a fixed size output.
• Many-to-one RNN architecture is usually seen for
sentiment analysis model as a common example. As
the name suggests, this kind of model is used when
multiple inputs are required to give a single output.
• Take for example The Twitter sentiment analysis
model. In that model, a text input (words as
multiple inputs) gives its fixed sentiment (single
output).
• Another example could be movie ratings model
that takes review texts as input to provide a rating
to a movie that may range from 1 to 5.
 Many-to-Many
• Many-to-Many RNN Architecture takes multiple
input and gives multiple output.
• Ex: language translaion
• Input is a sentence that has many words-> output
Problem of Long-Term Dependencies
• Problem of Vanishing Gradients?
• Multiply two small numbers(gradients) will result in a smaller number
(gradient).
• It becomes harder and harder for the neurons to propagate the error
to the earlier stages.
• Hence the parameters will be biased only to capture short term
dependencies.
• RNNs predict the next word in a sequence, based on the relevant
information in the distant past
• If the distance between the distant past and the current time step is
small, RRNs predict the next word correctly.
• As the sequence length increases, RNNs won’t be able to remember
the relevant information in the distant past and predict the next word.

• This is common in real life use cases with long sequences.


• This is due to the vanishing gradient problem.
A solution to the Vanishing gradient problem of RNN:

• Keep track of the long term dependencies by using “gates” .

• These “Gates” control which information from the “distant path” should be
passed through the network to update the current cell state.

• The most commonly used variants of RNN which are capable of remembering
long term dependencies using “gated cell” is the
LSTM (Long Short Term Memory) and GRU(gated recurrent unit).

• The “gates” perform different tensor operations to decide which information


can be removed/added to the current hidden state.
Problems of RNN
• Recurrent Neural Networks suffer from short-term memory.
• If a sequence is long enough, they’ll have a hard time carrying information
from earlier time steps to later ones. So to process a paragraph of text to do
predictions, RNN’s may leave out important information from the beginning.

• During back propagation, recurrent neural networks suffer from the


vanishing gradient problem.
• Layers that get a small gradient update stops learning.
• Those are usually the earlier layers.
• So because these layers don’t learn, RNN’s can forget what it seen in longer
sequences, thus having a short-term memory.
LSTMs and GRUs as a solution
• LSTM ’s and GRU’s were created as the solution to short-term memory. They have
internal mechanisms called gates that can regulate the flow of information.
• These gates can learn which data in a sequence is important to keep
or throw away.
• By doing that, it can pass relevant information down the long chain of
sequences to make predictions.
• LSTM’s and GRU’s can be found in speech recognition, speech
synthesis, and text generation. You can even use them to generate
captions for videos.
Recap of RNN
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-st
ep-explanation-44e9eb85bf21

• How a cell in a RNN calculates the hidden state? (Short-term memory)


• It combines the current input and the previous hidden state into a vector
• This vector has information on the current input and previous inputs.
• The vector goes through the tanh activation, and the output is the new
hidden state, or the memory of the network.
Recap of RNN
• The tanh activation is used to help regulate the values flowing
through the network.
• It squishes values to always be between -1 and 1.
• RNNs involve less computational resources
• But works well only for shorter sequences
LSTM
• An LSTM has a similar control flow
as a recurrent neural network.
• It processes data passing on
information as it propagates forward.
• The differences are the operations
within the LSTM’s cells.
• LSTMs have short-term memory in
“hidden states” and long-term
memory in “cell-states”
• The core concept of LSTM’s are the cell state, and it’s various gates.
• The cell state transfers relative information all the way down the sequence
chain.
• It helps to preserve the “long term memory” of the network.
• The cell state, helps information from the earlier time steps to make it’s
way to later time steps, reducing the effects of short-term memory.
• As the cell state goes on its journey, information get’s added or removed
to the cell state via gates.
• The gates are different neural networks that decide which information is
allowed on the cell state.
• The gates can learn what information is relevant to keep or forget during
training.
LSTM…

• Gates use “sigmoid” activation function to update or forget data.


• Sigmoid squishes its input values between 0 and 1.
• If sigmoid squishes its input X closer to 0, then X is forgotten.
• If sigmoid squishes its input X closer to 1, then X is kept.
• The network can learn which data is not important therefore can be
forgotten or which data is important to keep using Sigmoid.
• Three different gates regulate information flow in an LSTM cell. A
forget gate, input gate, and output gate.
LSTM

Forget Gate:
• This gate decides what information should be thrown away or kept.
• Information from the previous hidden state and information from the
current input is passed through the sigmoid function.
• Values come out between 0 and 1.
• The closer to 0 means to forget, and the closer to 1 means to keep.
Output gate:
• Output gate has 2 layers:
• “Tanh layer” generates a vector of new information that could be
written to the cell state
• “Sigmoid layer” decides which information should be kept from the
output of tanh function.
Cell State :
• Now we should have enough information to calculate the cell state.
• First, the cell state gets pointwise multiplied by the forget vector.
• This has a possibility of dropping values in the cell state if it gets
multiplied by values near 0.
• Then we take the output from the input gate and do a pointwise
addition which updates the cell state to new values that the neural
network finds relevant.
• That gives us our new cell state.
To summarize the LSTM:
• The “forget gate” with sigmoid function decides which information to
be forgotten from previous steps
• The “input gate” along with a tanh and a sigmoid function decides
what new inputs are added to the network

• The “output gate” updates the cell state, using the outputs from the
previous two gates.
• The tanh and sigmoid layers decide which part of the cell state are to
be output to the hidden state.
• To review,
• the Forget gate decides what is relevant to keep from prior steps.
• The input gate decides what information is relevant to add from the
current step.
• The output gate determines what the next hidden state should be.

You might also like