module-4-RNN-LSTM-GRU
module-4-RNN-LSTM-GRU
Module 4
RNN
RNNs are largely being replaced by large language models (LLM), which are
much more efficient in sequential data processing.
Recurrent Neural Networks(RNN’s)
RNN
The green blocks are called hidden states. The blue circles, defined by
the vector a within each block, are called hidden nodes or hidden
units where the number of nodes is decided by the hyper-parameter d.
Recurrent Neural Networks(RNN’s)
RNN
Vector h — is the output of the hidden state after the activation function
has been applied to the hidden nodes. As you can see at time t, the
architecture takes into account what happened at t-1 by including
the h from the previous hidden state as well as the input x at time t. This
allows the network to account for information from previous inputs that
are sequentially behind the current input.
It’s important to note that the zeroth h vector will always start as a vector
of 0’s because the algorithm has no information preceding the first
element in the sequence.
Recurrent Neural Networks(RNN’s)
RNN
RNN
Matrices Wx, Wy, Wh — are the weights of the RNN architecture which
are shared throughout the entire network.
The model weights of Wx at t=1 are the exact same as the weights
of Wx at t=2 and every other time step.
Recurrent Neural Networks(RNN’s)
RNN
Vector xᵢ — is the input to each hidden state where i=1, 2,…, n for each
element in the input sequence.
Recall that text must be encoded into numerical values. For example,
every letter in the word “dogs” would be a one-hot encoded vector with
dimension (4x1).
RNN
RNN Equations
Now that we know what all the variables are, here are all the equations
that we’re going to need in order to go through an RNN calculation:
Recurrent Neural Networks(RNN’s)
RNN
To make a prediction, we take the output from the current hidden state
and weight it by the weight matrix Wy with a soft max activation.
Recurrent Neural Networks(RNN’s)
Take the word “dogs,” where we want to train an RNN to predict the
letter “s” given the letters “d”-“o”-“g”. The architecture would look like
the following:
We’ll use 3 hidden nodes in our RNN (d=3). The dimensions for each of our
variables are as follows:
Taking into account all time steps, the overall loss is:
Recurrent Neural Networks(RNN’s)
Given our loss function, we need to calculate the gradients for our three
weight matrices Wx, Wy, Wh, and update them with a learning rate η.
A problem that RNNs face, which is also common in other deep neural
nets, is the vanishing gradient problem. Vanishing gradients make it
difficult for the model to learn long-term dependencies.
and had to predict the last two words “german” and “shepherd,” the RNN
would need to take into account the inputs “brown”, “black”, and “dog,”
which are the nouns and adjectives that describe a german shepherd.
However, the word “brown” is quite far from the word “shepherd.”
Recurrent Neural Networks(RNN’s)
So the word “brown” when doing a forward propagation, may not have
any effect in the prediction of “shepherd” because the weights weren’t
updated due to the vanishing gradient.
The pros and cons of a typical RNN architecture are summed up:
Recurrent Neural Networks(RNN’s)
Applications of RNNs:
RNN models are mostly used in the fields of NLP and speech recognition.
The different applications are summed up in the table below:
Applications of RNNs:
One To Many
In this type of RNN, there is one input and many outputs
associated with it. One of the most used examples of
this network is Image captioning where given an image
we predict a sentence having Multiple words.
Recurrent Neural Networks(RNN’s)
Applications of RNNs:
Many to One
In this type of network, Many inputs are fed to the
network at several states of the network generating only
one output. This type of network is used in the problems
like sentimental analysis. Where the model predicts
customers’ sentiments like positive, negative,
and neutral from input testimonials.
Recurrent Neural Networks(RNN’s)
Applications of RNNs:
Many to Many
In this type of neural network, there are
multiple inputs and multiple outputs
corresponding to a problem. One Example of
this Problem will be language translation. In
language translation, we provide multiple
words from one language as input and predict
multiple words from the second language as
output.
Recurrent Neural Networks(RNN’s)
LSTMs Long Short-Term Memory is a type of RNNs that can detain long-
term dependencies in sequential data.
LSTMs are able to process and analyze sequential data, such as time series,
text, and speech.
They use a memory cell and gates to control the flow of information,
allowing them to selectively retain or discard information as needed and
thus avoid the vanishing gradient problem that plagues traditional RNNs.
There are three types of gates in an LSTM: the input gate, the forget gate,
and the output gate.
• The input gate controls the flow of information into the memory cell.
• The forget gate controls the flow of information out of the memory cell.
• The output gate controls the flow of information out of the LSTM and
into the output.
Three gates input gate, forget gate, and output gate are all implemented
using sigmoid functions, which produce an output between 0 and 1. These
gates are trained using a Backpropagation algorithm through the network.
Memory Cell (Ct): The core of the LSTM, responsible for retaining
information over time. It helps the model “remember” important details
over long sequences.
Long Short-Term Memory (LSTM)
The input gate decides which information to store in the memory cell. It is
trained to open when the input is important and close when it is not.
The forget gate decides which information to discard from the memory cell.
It is trained to open when the information is no longer important and close
when it is.
Long Short-Term Memory (LSTM)
The output gate is responsible for deciding which information to use for the
output of the LSTM. It is trained to open when the information is important
and close when it is not.
The gates in an LSTM are trained to open and close based on the input and
the previous hidden state. This allows the LSTM to selectively retain or
discard information, making it more effective at capturing long-term
dependencies.
Long Short-Term Memory (LSTM)
1. Forget Gate Purpose: Decides what information to discard from the cell state.
This generates values between 0 and 1, where
0 means “forget everything” & 1 means “keep everything.”
2. Input Gate
Purpose: Updates the cell state by combining the forget and input gates.
How it works: The old cell state is multiplied by the forget gate output (to forget
irrelevant information), and the result is added to the new candidate values (to
store new relevant information).
Long Short-Term Memory (LSTM)
4. Output Gate
Purpose: Determines what information will be output from the current time step.
How it works: A sigmoid layer determines what parts of the cell state will be
output, and the cell state is passed through a tanh function to scale the values
between −1 and 1. The final output is a filtered version of the cell state.
Long Short-Term Memory (LSTM)
The gates are used to selectively forget or retain information from the
previous time steps, allowing the LSTM to maintain long-term dependencies
in the input data.
Long Short-Term Memory (LSTM)
It has a memory cell at the top which helps to carry the information from a
particular time instance to the next time instance in an efficient manner.
So, it can able to remember a lot of information from previous states when
compared to RNN and overcomes the vanishing gradient problem.
Information might be added or removed from the memory cell with the
help of valves.
LSTM network is fed by input data from the current time instance and
output of hidden layer from the previous time instance. These two data
passes through various activation functions and valves in the network
before reaching the output.
Long Short-Term Memory (LSTM)
Pros of LSTM:
Handles Long-Term Dependencies: LSTMs excel at capturing long-range
patterns in sequential data.
Mitigates Vanishing Gradient Problem: LSTMs solve the vanishing gradient
issue common in traditional RNNs.
Selective Memory: LSTMs selectively keep or discard information using
forget, input, and output gates.
Effective for Sequential Data: Ideal for tasks like time series forecasting,
speech recognition etc.
Versatility: LSTMs are used for various sequence-based tasks such as
classification, regression, text generation.
Long Short-Term Memory (LSTM)
Cons of LSTM:
High Computational Cost: LSTMs are resource-intensive and slower to train
due to their complex structure.
Memory Consumption: They consume more memory, especially when
handling long sequences or large datasets.
Difficulty in Parallelization: LSTMs process data sequentially, making
parallelization difficult and slowing training.
Overfitting with Small Data: LSTMs tend to overfit on small datasets
without proper regularization.
Architecture Complexity: LSTMs are more complex and harder to tune
compared to simpler recurrent models.
Gated Recurrent Unit (GRU)
Just like LSTM, GRU uses gates to control the flow of information. They are
relatively new as compared to LSTM.
This is the reason they offer some improvement over LSTM and have
simpler architecture.
Gated Recurrent Unit (GRU)
GRU network is that, unlike LSTM, it does not have a separate cell state
(Ct). It only has a hidden state (Ht).
At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the
previous timestamp t-1.
Later it outputs a new hidden state Ht which again passed to the next
timestamp.
Now there are primarily two gates in a GRU as opposed to three gates in an
LSTM cell. The first gate is the Reset gate and the other one is the update
gate.
Gated Recurrent Unit (GRU)
The Reset Gate is responsible for the short-term memory of the network i.e
the hidden state (Ht). Here is the equation of the Reset gate.
The value of rt will range from 0 to 1 because of the sigmoid function. Here
Ur and Wr are weight matrices for the reset gate.
Gated Recurrent Unit (GRU)
Similarly, we have an Update gate for long-term memory and the equation
of the gate is shown below.
GRU Works
Prepare the Inputs:
The GRU takes two inputs as vectors: the current input (Xt) and the
previous hidden state (h_(t-1)).
Gate Calculations:
There are two gates in a GRU: Reset Gate, Update Gate.
Gated Recurrent Unit (GRU)
The most important part of this equation is how we are using the value of
the reset gate to control how much influence the previous hidden state
can have on the candidate state.
Hidden State
Once we have the candidate state, it is used to generate the current hidden
state Ht. It is where the Update gate comes into the picture.
Instead of using a separate gate like in LSTM and GRU Architecture we use a
single update gate to control both the historical information which is Ht-1 as
well as the new information which comes from the candidate state.
Gated Recurrent Unit (GRU)
Now assume the value of ut is around 0 then the first term in the equation
will vanish which means the new hidden state will not have much
information from the previous hidden state.
On the other hand, the second part becomes almost one that essentially
means the hidden state at the current timestamp will consist of the
information from the candidate state only
Gated Recurrent Unit (GRU)
Similarly, if the value of ut is on the second term will become entirely 0 and
the current hidden state will entirely depend on the first term i.e the
information from the hidden state at the previous timestamp t-1.
Gated Recurrent Unit (GRU)
Advantages of GRU
Disadvantages of GRU
Less Powerful Gating Mechanism: While effective, GRUs have a simpler
gating mechanism compared to LSTMs which utilize three gates. This can
limit their ability to capture very complex relationships or long-term
dependencies in certain scenarios.
Potential for Overfitting: With a simpler architecture, LSTM and GRU
Architecture might be more susceptible to overfitting, especially on smaller
datasets. Careful hyperparameter tuning is crucial to avoid this issue.
Limited Interpretability: Understanding how a GRU Activation Function
arrives at its predictions can be challenging due to the complexity of the
gating mechanisms. This makes it difficult to analyze or explain the
network’s decision-making process.
Gated Recurrent Unit (GRU)