0% found this document useful (0 votes)
15 views

module-4-RNN-LSTM-GRU

This document covers Natural Language Processing (NLP) using deep learning techniques, focusing on Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs). It discusses the architecture, functioning, and applications of RNNs, including their advantages and challenges such as vanishing gradients. The document also highlights the importance of LSTMs and GRUs in overcoming traditional RNN limitations, particularly in handling long-term dependencies in sequential data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

module-4-RNN-LSTM-GRU

This document covers Natural Language Processing (NLP) using deep learning techniques, focusing on Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs). It discusses the architecture, functioning, and applications of RNNs, including their advantages and challenges such as vanishing gradients. The document also highlights the importance of LSTMs and GRUs in overcoming traditional RNN limitations, particularly in handling long-term dependencies in sequential data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Natural Language Processing (CSE3015)

Module 4

NLP Using Deep Learning

Dr. Ch. Balaram Murthy


Syllabus (6H)

Types of learning techniques, Chunking, Information


extraction & Relation Extraction, Recurrent neural
networks, LSTMs/GRUs, Transformers, Self-attention
Mechanism, Sub-word tokenization, Positional encoding
Recurrent Neural Networks(RNN’s)

RNN

A RNN is a deep learning model that is trained to process and convert a


sequential data input into a specific sequential data output. Sequential data
is data:- such as words, sentences, or time-series data, where sequential
components interrelate based on complex semantics and syntax rules.

An RNN is a software system that consists of many interconnected


components mimicking how humans perform sequential data conversions,
such as translating text from one language to another.

RNNs are largely being replaced by large language models (LLM), which are
much more efficient in sequential data processing.
Recurrent Neural Networks(RNN’s)

RNN

The green blocks are called hidden states. The blue circles, defined by
the vector a within each block, are called hidden nodes or hidden
units where the number of nodes is decided by the hyper-parameter d.
Recurrent Neural Networks(RNN’s)

RNN

Vector h — is the output of the hidden state after the activation function
has been applied to the hidden nodes. As you can see at time t, the
architecture takes into account what happened at t-1 by including
the h from the previous hidden state as well as the input x at time t. This
allows the network to account for information from previous inputs that
are sequentially behind the current input.

It’s important to note that the zeroth h vector will always start as a vector
of 0’s because the algorithm has no information preceding the first
element in the sequence.
Recurrent Neural Networks(RNN’s)

RNN

The hidden state at t=2, takes as


input the output from t-1 and x at t.
Recurrent Neural Networks(RNN’s)

RNN

Matrices Wx, Wy, Wh — are the weights of the RNN architecture which
are shared throughout the entire network.

The model weights of Wx at t=1 are the exact same as the weights
of Wx at t=2 and every other time step.
Recurrent Neural Networks(RNN’s)

RNN

Vector xᵢ — is the input to each hidden state where i=1, 2,…, n for each
element in the input sequence.

Recall that text must be encoded into numerical values. For example,
every letter in the word “dogs” would be a one-hot encoded vector with
dimension (4x1).

Similarly, x can also be word embedding or other numerical


representations.
Recurrent Neural Networks(RNN’s)

RNN

One-Hot Encoding of the word “dogs”


Recurrent Neural Networks(RNN’s)

RNN Equations

Now that we know what all the variables are, here are all the equations
that we’re going to need in order to go through an RNN calculation:
Recurrent Neural Networks(RNN’s)

RNN

The hidden nodes are a concatenation of the previous state’s output


weighted by the weight matrix Wh and the input x weighted by the weight
matrix Wx.

The tanh function is the activation function, symbolized by the green


block. The output of the hidden state is the activation function applied to
the hidden nodes.

To make a prediction, we take the output from the current hidden state
and weight it by the weight matrix Wy with a soft max activation.
Recurrent Neural Networks(RNN’s)

Take the word “dogs,” where we want to train an RNN to predict the
letter “s” given the letters “d”-“o”-“g”. The architecture would look like
the following:

RNN architecture predicting the


letter “s” in “dogs”
Recurrent Neural Networks(RNN’s)

We’ll use 3 hidden nodes in our RNN (d=3). The dimensions for each of our
variables are as follows:

where k = 4, because our input x is a


4-dimensional one-hot vector for the
letters in “dogs.”
Recurrent Neural Networks(RNN’s)

Backpropagation through time (BPTT)

Like their classical counterparts (MLPs), RNNs use the backpropagation


methodology to learn from sequential training data.

Backpropagation with RNNs is a little more challenging due to


the recursive nature of the weights and their effect on the loss which
spans over time.
Recurrent Neural Networks(RNN’s)

The general workflow:

1. Initialize weight matrices Wx, Wy, Wh randomly


2. Forward propagation to compute predictions
3. Compute the loss
4. Backpropagation to compute gradients
5. Update weights based on gradients
6. Repeat steps 2–5
Recurrent Neural Networks(RNN’s)

Because this example is a classification problem where we’re trying to


predict four possible letters (“d-o-g-s”), it makes sense to use the multi-class
cross entropy loss function:

Taking into account all time steps, the overall loss is:
Recurrent Neural Networks(RNN’s)

Visually, this can be seen as:


Recurrent Neural Networks(RNN’s)

Given our loss function, we need to calculate the gradients for our three
weight matrices Wx, Wy, Wh, and update them with a learning rate η.

Similar to normal backpropagation, the gradient gives us a sense of how


the loss is changing with respect to each weight parameter.

We update the weights to minimize loss with the following equation:

where i = x, y, and h as a shorthand for the 3 weight matrices


Recurrent Neural Networks(RNN’s)

One major problem: vanishing gradients

A problem that RNNs face, which is also common in other deep neural
nets, is the vanishing gradient problem. Vanishing gradients make it
difficult for the model to learn long-term dependencies.

For example, if an RNN was given this sentence:

and had to predict the last two words “german” and “shepherd,” the RNN
would need to take into account the inputs “brown”, “black”, and “dog,”
which are the nouns and adjectives that describe a german shepherd.
However, the word “brown” is quite far from the word “shepherd.”
Recurrent Neural Networks(RNN’s)

From the gradient calculation of Wx that we saw earlier, we can break


down the backpropagation error of the word “shepherd” back to “brown”
and see what it looks like:

The partial derivative of the state corresponding to the input “shepherd”


respective to the state “brown” is actually a chain rule in itself, resulting in:
Recurrent Neural Networks(RNN’s)

That’s a lot of chain rule! These chains of gradients are troublesome


because if less than 1 they can cause the loss from the word shepherd with
respect to the word brown to approach 0, thereby vanishing. This makes it
difficult for the weights to take into account words that occur at the start
of a long sequence.

So the word “brown” when doing a forward propagation, may not have
any effect in the prediction of “shepherd” because the weights weren’t
updated due to the vanishing gradient.

This is one of the major disadvantages of RNNs.


Recurrent Neural Networks(RNN’s)

However, there have been advancements in RNNs such as gated recurrent


units (GRUs) and long short term memory (LSTMs) that have been able to
deal with the problem of vanishing gradients.

The pros and cons of a typical RNN architecture are summed up:
Recurrent Neural Networks(RNN’s)

Applications of RNNs:

RNN models are mostly used in the fields of NLP and speech recognition.
The different applications are summed up in the table below:

This type of RNN behaves the same as any simple Neural


network it is also known as Vanilla Neural Network. In this
Neural network, there is only one input and one output.
Recurrent Neural Networks(RNN’s)

Applications of RNNs:

One To Many
In this type of RNN, there is one input and many outputs
associated with it. One of the most used examples of
this network is Image captioning where given an image
we predict a sentence having Multiple words.
Recurrent Neural Networks(RNN’s)

Applications of RNNs:

Many to One
In this type of network, Many inputs are fed to the
network at several states of the network generating only
one output. This type of network is used in the problems
like sentimental analysis. Where the model predicts
customers’ sentiments like positive, negative,
and neutral from input testimonials.
Recurrent Neural Networks(RNN’s)

Applications of RNNs:

Many to Many
In this type of neural network, there are
multiple inputs and multiple outputs
corresponding to a problem. One Example of
this Problem will be language translation. In
language translation, we provide multiple
words from one language as input and predict
multiple words from the second language as
output.
Recurrent Neural Networks(RNN’s)

Commonly used activation functions: The most common activation


functions used in RNN modules are described below:
Recurrent Neural Networks(RNN’s)

The vanishing and exploding gradient phenomena are often encountered in


the context of RNNs. The reason why they happen is that it is difficult to
capture long term dependencies because of multiplicative gradient that can
be exponentially decreasing/increasing with respect to the number of layers.

Exploding gradient happens when the gradient increases exponentially until


the RNN becomes unstable. When gradients become infinitely large, the
RNN behaves erratically, resulting in performance issues such as overfitting.
Overfitting is a phenomenon where the model can predict accurately with
training data but can’t do the same with real-world data.
Recurrent Neural Networks(RNN’s)

The vanishing gradient problem is a condition where the model’s gradient


approaches zero in training. When the gradient vanishes, the RNN fails to
learn effectively from the training data, resulting in underfitting. An
underfit model can’t perform well in real-life applications because its
weights weren’t adjusted appropriately. RNNs are at risk of vanishing and
exploding gradient issues when they process long data sequences.

To overcome the problems like vanishing gradient and exploding


gradient descent several new advanced versions of RNNs are formed
some of these are as;

• Bidirectional Neural Network (BiNN)


• Long Short-Term Memory (LSTM)
Recurrent Neural Networks(RNN’s)
Syllabus (6H)

Types of learning techniques, Chunking, Information


extraction & Relation Extraction, Recurrent neural
networks, LSTMs/GRUs, Transformers, Self-attention
Mechanism, Sub-word tokenization, Positional encoding
Long Short-Term Memory (LSTM)

LSTMs Long Short-Term Memory is a type of RNNs that can detain long-
term dependencies in sequential data.

LSTMs are able to process and analyze sequential data, such as time series,
text, and speech.

They use a memory cell and gates to control the flow of information,
allowing them to selectively retain or discard information as needed and
thus avoid the vanishing gradient problem that plagues traditional RNNs.

LSTMs are widely used in various applications such as natural language


processing, speech recognition, and time series forecasting.
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)

There are three types of gates in an LSTM: the input gate, the forget gate,
and the output gate.

• The input gate controls the flow of information into the memory cell.
• The forget gate controls the flow of information out of the memory cell.
• The output gate controls the flow of information out of the LSTM and
into the output.

Three gates input gate, forget gate, and output gate are all implemented
using sigmoid functions, which produce an output between 0 and 1. These
gates are trained using a Backpropagation algorithm through the network.

Memory Cell (Ct): The core of the LSTM, responsible for retaining
information over time. It helps the model “remember” important details
over long sequences.
Long Short-Term Memory (LSTM)

The input gate decides which information to store in the memory cell. It is
trained to open when the input is important and close when it is not.

The forget gate decides which information to discard from the memory cell.
It is trained to open when the information is no longer important and close
when it is.
Long Short-Term Memory (LSTM)

The output gate is responsible for deciding which information to use for the
output of the LSTM. It is trained to open when the information is important
and close when it is not.

The gates in an LSTM are trained to open and close based on the input and
the previous hidden state. This allows the LSTM to selectively retain or
discard information, making it more effective at capturing long-term
dependencies.
Long Short-Term Memory (LSTM)

1. Forget Gate Purpose: Decides what information to discard from the cell state.
This generates values between 0 and 1, where
0 means “forget everything” & 1 means “keep everything.”

where σ — is the sigmoid function which converts values between 0 to 1,


Wf — weights associated with hidden state and current state​
ht-1 — Output from previous time stamp also called hidden state passed as input.
Xt — current time stamp input
bf — bias value
Long Short-Term Memory (LSTM)

2. Input Gate

Purpose: Decides what new information to store in the cell state.

Two steps happen here:


• A sigmoid layer decides which parts of the new information to update.
• A tanh layer creates new candidate values ​ to be added to the cell state.

Where Wi and Wc is weights and bi and bc is biased values.


Others ,same as forget gate.
Long Short-Term Memory (LSTM)

3.Cell State Update

Purpose: Updates the cell state by combining the forget and input gates.

How it works: The old cell state is multiplied by the forget gate output​ (to forget
irrelevant information), and the result is added to the new candidate values ​ (to
store new relevant information).
Long Short-Term Memory (LSTM)

4. Output Gate

Purpose: Determines what information will be output from the current time step.

How it works: A sigmoid layer determines what parts of the cell state will be
output, and the cell state is passed through a tanh function to scale the values
between −1 and 1. The final output is a filtered version of the cell state.
Long Short-Term Memory (LSTM)

The structure of an LSTM network consists of a series of LSTM cells, each of


which has a set of gates (input, output, and forget gates) that control the
flow of information into and out of the cell.

The gates are used to selectively forget or retain information from the
previous time steps, allowing the LSTM to maintain long-term dependencies
in the input data.
Long Short-Term Memory (LSTM)

It has a memory cell at the top which helps to carry the information from a
particular time instance to the next time instance in an efficient manner.

So, it can able to remember a lot of information from previous states when
compared to RNN and overcomes the vanishing gradient problem.
Information might be added or removed from the memory cell with the
help of valves.

LSTM network is fed by input data from the current time instance and
output of hidden layer from the previous time instance. These two data
passes through various activation functions and valves in the network
before reaching the output.
Long Short-Term Memory (LSTM)

Pros of LSTM:
Handles Long-Term Dependencies: LSTMs excel at capturing long-range
patterns in sequential data.
Mitigates Vanishing Gradient Problem: LSTMs solve the vanishing gradient
issue common in traditional RNNs.
Selective Memory: LSTMs selectively keep or discard information using
forget, input, and output gates.
Effective for Sequential Data: Ideal for tasks like time series forecasting,
speech recognition etc.
Versatility: LSTMs are used for various sequence-based tasks such as
classification, regression, text generation.
Long Short-Term Memory (LSTM)

Cons of LSTM:
High Computational Cost: LSTMs are resource-intensive and slower to train
due to their complex structure.
Memory Consumption: They consume more memory, especially when
handling long sequences or large datasets.
Difficulty in Parallelization: LSTMs process data sequentially, making
parallelization difficult and slowing training.
Overfitting with Small Data: LSTMs tend to overfit on small datasets
without proper regularization.
Architecture Complexity: LSTMs are more complex and harder to tune
compared to simpler recurrent models.
Gated Recurrent Unit (GRU)

GRU or Gated recurrent unit is an advancement of the standard RNN.


GRUs are very similar to Long Short Term Memory (LSTM).

Just like LSTM, GRU uses gates to control the flow of information. They are
relatively new as compared to LSTM.

This is the reason they offer some improvement over LSTM and have
simpler architecture.
Gated Recurrent Unit (GRU)

GRU network is that, unlike LSTM, it does not have a separate cell state
(Ct). It only has a hidden state (Ht).

Due to the simpler architecture, GRUs are faster to train.

Architecture of Gated Recurrent Unit


Here we have a GRU cell which more or less similar to an LSTM cell or
RNN cell.
Gated Recurrent Unit (GRU)

At each timestamp t, it takes an input Xt and the hidden state Ht-1 from the
previous timestamp t-1.

Later it outputs a new hidden state Ht which again passed to the next
timestamp.

Now there are primarily two gates in a GRU as opposed to three gates in an
LSTM cell. The first gate is the Reset gate and the other one is the update
gate.
Gated Recurrent Unit (GRU)

Reset Gate (Short term memory)

The Reset Gate is responsible for the short-term memory of the network i.e
the hidden state (Ht). Here is the equation of the Reset gate.

The value of rt will range from 0 to 1 because of the sigmoid function. Here
Ur and Wr are weight matrices for the reset gate.
Gated Recurrent Unit (GRU)

Update Gate (Long Term memory)

Similarly, we have an Update gate for long-term memory and the equation
of the gate is shown below.

The only difference is of weight metrics i.e Uu and Wu.


Gated Recurrent Unit (GRU)

GRU Works
Prepare the Inputs:
The GRU takes two inputs as vectors: the current input (Xt) and the
previous hidden state (h_(t-1)).
Gate Calculations:
There are two gates in a GRU: Reset Gate, Update Gate.
Gated Recurrent Unit (GRU)

To do this, we perform an element-wise multiplication (like a dot product


for each element) between the current input and the previous hidden
state vectors. This is done separately for each gate, essentially creating
“parameterized” versions of the inputs specific to each gate.

Finally, we apply an activation function element-wise to each element in


these parameterized vectors. This activation function typically outputs
values between 0 and 1, which will be used by the gates to control
information flow.
Gated Recurrent Unit (GRU)

Candidate Hidden State:

The most important part of this equation is how we are using the value of
the reset gate to control how much influence the previous hidden state
can have on the candidate state.

If the value of rt is equal to 1 then it means the entire information from


the previous hidden state Ht-1 is being considered. Likewise, if the value of
rt is 0 then that means the information from the previous hidden state is
completely ignored.
Gated Recurrent Unit (GRU)

Hidden State
Once we have the candidate state, it is used to generate the current hidden
state Ht. It is where the Update gate comes into the picture.

Instead of using a separate gate like in LSTM and GRU Architecture we use a
single update gate to control both the historical information which is Ht-1 as
well as the new information which comes from the candidate state.
Gated Recurrent Unit (GRU)

Now assume the value of ut is around 0 then the first term in the equation
will vanish which means the new hidden state will not have much
information from the previous hidden state.

On the other hand, the second part becomes almost one that essentially
means the hidden state at the current timestamp will consist of the
information from the candidate state only
Gated Recurrent Unit (GRU)

Similarly, if the value of ut is on the second term will become entirely 0 and
the current hidden state will entirely depend on the first term i.e the
information from the hidden state at the previous timestamp t-1.
Gated Recurrent Unit (GRU)

Advantages of GRU

Faster Training and Efficiency: Compared to LSTMs, GRUs have a simpler


architecture with fewer parameters. This makes them faster to train and
computationally less expensive.
Effective for Sequential Tasks: GRUs excel at handling long-term
dependencies in sequential data like language or time series. Their gating
mechanisms allow them to selectively remember or forget information,
leading to better performance on tasks like machine translation or
forecasting.
Less Prone to Gradient Problems: The gating mechanisms in GRUs help
mitigate the vanishing/exploding gradient problems that plague standard
RNNs. This allows for more stable training and better learning in long
sequences.
Gated Recurrent Unit (GRU)

Disadvantages of GRU
Less Powerful Gating Mechanism: While effective, GRUs have a simpler
gating mechanism compared to LSTMs which utilize three gates. This can
limit their ability to capture very complex relationships or long-term
dependencies in certain scenarios.
Potential for Overfitting: With a simpler architecture, LSTM and GRU
Architecture might be more susceptible to overfitting, especially on smaller
datasets. Careful hyperparameter tuning is crucial to avoid this issue.
Limited Interpretability: Understanding how a GRU Activation Function
arrives at its predictions can be challenging due to the complexity of the
gating mechanisms. This makes it difficult to analyze or explain the
network’s decision-making process.
Gated Recurrent Unit (GRU)

GRUs have been successfully applied in various domains, such as language


modeling, machine translation, and speech-to-text applications, where
the balance between complexity and performance is crucial.

You might also like