0% found this document useful (0 votes)
31 views

Unit 3 RCNN Updated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Unit 3 RCNN Updated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Recurrent Neural networks

Unfolding Graphs – RNN Design Patterns: Acceptor – Encoder – Transducer;


GradientComputation – Sequence Modeling Conditioned on Contexts Bidirectional RNN –
Sequence to Sequence RNN – Deep Recurrent Networks – Recursive Neural Networks – Long
Term Dependencies; Leaky Units: Skip connections and dropouts; Gated Architecture: LSTM

Unfolding Graphs :

A Computational Graph is a way to formalize the structure of a set of computations • Such as


mapping inputs and parameters to outputs and loss • We can unfold a recursive or recurrent
computation into a computational graph that has a repetitive structure • Corresponding to a chain
of events • Unfolding this graph results in sharing of parameters across a deep network structure.

The Process of Unfolding a Graph is given as Follows :


Rnn Design Patterns:

Some Examples of the Rnn Design Patters are given by

1.Recurrent netwroks produce an output at each time and have a recurrent connections
between the Hidden Units .
2.Recurrent Networks produce an Output at each time and have a recurrent connections only
for the output at the hidden Units in the next time step.
3.Recurrent netwroks has a set Hidden Units that is read as a enrire sequence that produces a
Single Ouptut.

Advantages and Disadvantges of the Recurrent Neural networks:

Acceptor - Encoder –Transducer :

Finite state Machines are the way of representing the different set of states .
The different set of Notations are given as Follows :

What is a Recurrent Neural Network (RNN)?

 Recurrent neural network is a type of neural network in which the output form
the previous step is fed as input to the current step

 In traditional neural networks, all the inputs and outputs are independent of each
other, but this is not a good idea if we want to predict the next word in a sentence
 We need to remember the previous word in order to generate the next word in a
sentence, hence traditional neural networks are not efficient for NLP applications
 RNNs also have a hidden stage which used to capture information about a
sentence
 RNNs have a ‘memory’, which is used to capture information about the
calculations made so far
 In theory, RNNs can use information in arbitrary long sequences, but practically
they are limited to look back only a few steps
Diagrammatic Representation

Unfolding means writing the network for the complete sequence, for example, if a sequence
has 4 words then the network will be unfolded into a 4 layered neural network
.We can think of s t as the memory of the network as it captures information about what happened in all
the previous steps
A traditional neural network uses different parameter at each layer while an RNN shares the
same parameter across all the layers, in the diagram we could see that the same parameters (U,
V, W) were being used across all the layers.Using the same parameters across all layers shown
that we are performing the same task withdifferent inputs, thus reducing the total number of
parameters to learn.
The tree set of parameter ( U, V, and W) are used to apply linear transformation over their
respective inputs Parameter U transformation the input xt to the state st.
Parameter W transforms the previous state st-1 to the current state st And, parameter V maps the
computed internal state st to the output OtFormula to calculate current state:
ht = f(ht-1,xt)
Here, ht is the current state, ht-1 is the previous state and xt is the current input

The equation applying after activation function (tanh) is:

ht=tanh(whhht-1 + wxhxt)
Here, whh : weight at recurrent neuron, Wxh : weight at input neuron

 After calculating the final state, we can then produce the output
 The output state can be calculated as:
Ot = Why ht
Here, Ot is the output state, why: weight at output layer, ht: current state

Backward propagation in RNN


Backward phase :To train an RNN, we need a loss function. We will make use of cross-
entropy loss which isoften paired with softmax, which can be calculated as:

L = -ln(pc)
Here, pc is the RNN’s predicted probability for the correct class (positive or negative). For
example, if a positive text is predicted to be 95% positive by the RNN, then the loss is:
L= -ln(0.95) = 0.051
After calculating loss, we will train the RNN using gradient descent to minimize loss
Steps for Back Propagation

 We compute the cross-entropy error first using the current and actual output. The
network is unfolding for each time step
 Then for each time step in the network, the gradient descent is calculated with respect
to the weight of each of the parameter
 Once the weight for all time step is the same we can combine together the gradients for
all the time steps
 Then we update the weights for both the recurrent neurons as well as the dense layers

Vanishing and Exploding Gradient Problem

Defining the problem

 During the training of all a deep network, the gradients are propagated back in time all
the way to the initial layer
 Gradients that come from deeper layers go through multiple matrix multiplications
according to the chain rule, and when they approach the earlier layers, if they have
small values ( <1 ) they shrink exponentially till they vanish
 Vanishing gradient make model learning difficult
 While if they have large values (>1), then they eventually blow up and crash the model,
this is the exploding gradient problem

Types of RNN Architectures


The common architectures which are used for sequence learning are:
 One to one
 One to many
 Many to one
 Many to many

One to one :This model is similar to a single layer neural network as it only provides linear
predictions .It is mostly used fixed-size input ‘x’ and fixed-size output ‘y’ image classification)

One to many

 This consist of a single input ‘x’, activation ‘a’, and multiple outputs ‘y’
 Example: generating an audio stream. It takes a single audio stream as input and
generates new tones or new music based on that stream
 In some cases, it propagates the output ‘y’ to the next RNN units

Many to one

 This consist of multiple inputs ‘x’ (such as words or sentences), activation ‘a’ and
produce a single output ‘y’ at the end
 This type of architecture is mostly used to perform sentiment analysis as it processes
the entire input (collection of words sentences) to produce a single output (positive,
negative, or neutral sentiment)
Many to many

 In this, a single frame is taken as input for each RNN unit. A-frame represents multiple
inputs ‘x’, activations ‘a’ which are propagated through the network to produce output
‘y’ which are the classification result for each frame
 It used mostly in video classification, where we try to classify each frame of the video

Bi- directional RNNs

 In this neural network, 2 hidden layers running in the opposite direction are connected
to produce a single output
 These layers allow the neural network to received information from both past as well
as a future state
 For example, given a word sequence: ‘I like programming’. The forward layer will
input the sequence as it is while the backward layer will feed the sequence in the reverse
order ‘programming like i’
 The output for this will be calculated by concatenating the word sequence at each time
step and generating the weight
Note:-

 RNNs remember each and every piece of information through timestamp


 The memory state which stores information of all the state is useful for tasks such as
sentence generation and time series prediction
 RNNs can handle inputs and outputs of arbitrary length
 RNNs share the same parameters across different time steps which means fewer
parameters to train and computation cost
 RNNs can not process very long sequences while making use of tanh or ReLu as an
activation function
 RNNs face vanishing and exploding gradient problem

Rcnn applications
Text summarization: Summarizing the text from any literature, for example, if a news
website wants to display brief summary of important news from each and every news
article on the website, then text summarization will be helpful
 Text recommendation: Text autofill or sentence generation in data every work by
making use of RNNs can help in automating the processes and make it less time
consuming
 Image recognition: RNNs can be combined with CNN in order to recognize an image
and give its description
 Music generation: RNNs can be used to generate new music or tunes, by feeding a
single tune as an input we can generate new notes or tunes of music.
Types of Rcnn networks :

Feedforward networks map one input to one output, and while we’ve visualized recurrent
neural networks in this way in the above diagrams, they do not actually have this constraint.
Instead, their inputs and outputs can vary in length, and different types of RNNs are used for
different use cases, such as music generation, sentiment classification, and machine translation.

1. Vanishing Gradient Problem

Recurrent Neural Networks enable you to model time-dependent and sequential data problems,
such as stock market prediction, machine translation, and text generation. You will find,
however, RNN is hard to train because of the gradient problem.

RNNs suffer from the problem of vanishing gradients. The gradients carry information used in
the RNN, and when the gradient becomes too small, the parameter updates become
insignificant. This makes the learning of long data sequences difficult.
2. Exploding Gradient Problem

While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients
accumulate, resulting in very large updates to the neural network model weights during the
training process.

Long training time, poor performance, and bad accuracy are the major issues in gradient
problems. Now, let’s discuss the most popular and efficient way to deal with gradient problems,
i.e., Long Short-Term Memory Network (LSTMs).

The word you predict will depend on the previous few words in context. Here, you need the
context of Spain to predict the last word in the text, and the most suitable answer to this sentence
is “Spanish.” The gap between the relevant information and the point where it's eeded may have
become very large. LSTMs help you solve this problem.

Common Activation Functions

Recurrent Neural Networks (RNNs) use activation functions just like other neural networks to
introduce non-linearity to their models. Here are some common activation functions used in
RNNs:

Sigmoid Function:

The sigmoid function is commonly used in RNNs. It has a range between 0 and 1, which makes
it useful for binary classification tasks. The formula for the sigmoid function is:

σ(x) = 1 / (1 + e^(-x))

Hyperbolic Tangent (Tanh) Function:

The tanh function is also commonly used in RNNs. It has a range between -1 and 1, which
makes it useful for non-linear classification tasks. The formula for the tanh function is:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Rectified Linear Unit (Relu) Function:

The ReLU function is a non-linear activation function that is widely used in deep neural
networks. It has a range between 0 and infinity, which makes it useful for models that require
positive outputs. The formula for the ReLU function is:

ReLU(x) = max(0, x)

Leaky Relu Function:

The Leaky ReLU function is similar to the ReLU function, but it introduces a small slope to
negative values, which helps to prevent "dead neurons" in the model. The formula for the Leaky
ReLU function is:

Leaky ReLU(x) = max(0.01x, x)

Softmax Function:

The softmax function is often used in the output layer of RNNs for multi-class classification
tasks. It converts the network output into a probability distribution over the possible classes.
The formula for the softmax function is:

softmax(x) = e^x / ∑(e^x)

These are just a few examples of the activation functions used in RNNs. The choice of
activation function depends on the specific task and the model's architecture

Variant RNN Architectures

There are several variant RNN architectures that have been developed over the years to address
the limitations of the standard RNN architecture. Here are a few examples:
Long Short-Term Memory (LSTM) Networks

LSTM is a type of RNN that is designed to handle the vanishing gradient problem that can
occur in standard RNNs. It does this by introducing three gating mechanisms that control the
flow of information through the network: the input gate, the forget gate, and the output gate.
These gates allow the LSTM network to selectively remember or forget information from the
input sequence, which makes it more effective for long-term dependencies.

Gated Recurrent Unit (GRU) Networks

GRU is another type of RNN that is designed to address the vanishing gradient problem. It has
two gates: the reset gate and the update gate. The reset gate determines how much of the
previous state should be forgotten, while the update gate determines how much of the new state
should be remembered. This allows the GRU network to selectively update its internal state
based on the input sequence.

Bidirectional RNNs:

Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be useful
for speech recognition and natural language processing tasks.

Encoder-Decoder RNNs:

Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder network
that generates the output sequence based on the encoder's representation. This architecture is
commonly used for sequence-to-sequence tasks such as machine translation.

Attention Mechanisms

Attention mechanisms are a technique that can be used to improve the performance of RNNs
on tasks that involve long input sequences. They work by allowing the network to attend to
different parts of the input sequence selectively rather than treating all parts of the input
sequence equally. This can help the network focus on the input sequence's most relevant parts
and ignore irrelevant information.These are just a few examples of the many variant RNN
architectures that have been developedover the years. The choice of architecture depends on
the specific task and the characteristics of the input and output sequences.

Long Short-Term Memory Networks

LSTMs are a special kind of RNN — capable of learning long-term dependencies by


remembering information for long periods is the default behavior.

All RNN are in the form of a chain of repeating modules of a neural network. In standard
RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

Fig: Long Short Term Memory Networks

LSTMs also have a chain-like structure, but the repeating module is a bit different structure.
Instead of having a single neural network layer, four interacting layers are communicating
extraordinarily.
Advantages and Shortcomings of RNNs

RNNs have various advantages, such as:

 Ability to handle sequence data


 Ability to handle inputs of varying lengths
 Ability to store or “memorize” historical information
The disadvantages are:
 The computation can be very slow.
 The network does not take into account future inputs to make decisions. 
 Vanishing gradient problem, where the gradients used to compute the weight update
may get very close to zero, preventing the network from learning new weights. The
deeper the network, the more pronounced this problem is. 
Different RNN Architectures

There are different variations of RNNs that are being applied practically in machine learning
problems:

Bidirectional Recurrent Neural Networks (BRNN)

In BRNN, inputs from future time steps are used to improve the accuracy of the network. It is
like knowing the first and last words of a sentence to predict the middle words.

Gated Recurrent Units (GRU)

These networks are designed to handle the vanishing gradient problem. They have a reset and
update gate. These gates determine which information is to be retained for future predictions.
Long Short Term Memory (LSTM)

LSTMs were also designed to address the vanishing gradient problem in RNNs. LSTMs use
three gates called input, output, and forget gate. Similar to GRU, these gates determine which
information to retain.

Key Differences Between CNN and RNN

 CNN is applicable for sparse data like images. RNN is applicable for time series and
sequential data.

 While training the model, CNN uses a simple backpropagation and RNN uses
backpropagation through time to calculate the loss. 

 RNN can have no restriction in length of inputs and outputs, but CNN has finite inputs
and finite outputs.

 CNN has a feedforward network and RNN works on loops to handle sequential data.

 CNN can also be used for video and image processing. RNN is primarily used for
speech and text analysis. 

Working of a Rcnn network:

A recurrent neural network (RNN) is the type of artificial neural network (ANN) that is used
voice search. RNN remembers past inputs due to an internal memory which is useful for
predicting stock prices, generating text, transcriptions, and machine translation.In the
traditional neural network, the inputs and the outputs are independent of each other, whereas
the output in RNN is dependent on prior elementals within the sequence. Recurrent networks
also share parameters across each layer of the network. In feed forward networks, there are
different weights across each node. Whereas RNN shares the same weights within each layer
of the network and during gradient descent, the weights and basis are adjusted individually to
reduce the loss.
RNN

The image above is a simple representation of recurrent neural networks. If we are forecasting
stock prices using simple data [45,56,45,49,50,…], each input from X0 to Xt will contain a
past value. For example, X0 will have 45, X1 will have 56, and these values are used to predict
the next number in a sequence.

How Recurrent Neural Networks Work

In RNN, the information cycles through the loop, so the output is determined by the current
input and previously received inputs.

The input layer X processes the initial input and passes it to the middle layer A. The middle
layer consists of multiple hidden layers, each with its activation functions, weights, and biases.
These parameters are standardized across the hidden layer so that instead of creating multiple
hidden layers, it will create one and loop it over.

Instead of using traditional backpropagation, recurrent neural networks use


backpropagation through time (BPTT) algorithms to determine the gradient. In
backpropagation, the model adjusts the parameter by calculating errors from the output to the
input layer.
Recurrent Neural Networks
Traditional neural networks can’t do this, and it seems like a major shortcoming. For example,
imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear
how a traditional neural network could use its reasoning about previous events in the film to
inform later ones.Recurrent neural networks address this issue. They are networks with loops
in them, allowing information to persist.

Recurrent Neural Networks have loops.

In the above diagram, a chunk of neural network, A�, looks at some input xt�� and outputs a
value htℎ�. A loop allows information to be passed from one step of the network to the next.
These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit
more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural
network can be thought of as multiple copies of the same network, each passing a message to a
successor. Consider what happens if we unroll the loop:
An unrolled recurrent neural network.

This chain-like nature reveals that recurrent neural networks are intimately related to sequences
and lists. They’re the natural architecture of neural network to use for such data.And they
certainly are used! In the last few years, there have been incredible success applyingRNNs to a
variety of problems: speech recognition, language modeling, translation, image
captioning.Essential to these successes is the use of “LSTMs,” a very special kind of recurrent
neural networkwhich works, for many tasks, much much better than the standard version.

The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to
the present task, such as using previous video frames might inform the understanding of the
present frame. For example, consider a language model trying to predict the next word based on
the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t
need any further context –it’s pretty obvious the next word is going to be sky. In such cases,
where the gap between the relevant information and the place that it’s needed is small, RNNs
can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last word in
the text “I grew up in France… I speak fluent French.” Recent information suggests that the next
word is probably the name of a language, but if we want to narrow down which language, we need
the context of France, from further back. It’s entirely possible for the gap between the relevant
information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human
could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice,
RNNs don’t seem to be able to learn them.

LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-
term dependency problem. Remembering information for long periods of time is practically
their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In
standard RNNs, this repeating module will have a very simple structure, such as a single tanh
layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure.
Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs
of others. The pink circles represent pointwise operations, like vector addition, while the yellow
boxes are learned neural network layers. Lines merging denote concatenation, while a line forking
denote its content being copied and the copies going to different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated
by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural
net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each
component should be let through. A value of zero means “let nothing through,” while a value of
one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell
state. This decision is made by a sigmoid layer called the “forget gate layer.” It looksat
ht−1ℎ −1 and xt , and outputs a number between 00 and 11 for each number in the cellstate
Ct−1 −1. A 11 represents “completely keep this” while a 00 represents “completely get rid of
this.”
Let’s go back to our example of a language model trying to predict the next word based on all the
previous ones. In such a problem, the cell state might include the gender of the present subject, so
that the correct pronouns can be used. When we see a new subject, we want to forget the gender
of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This has two
parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next,
a tanh layer creates a vector of new candidate values, C~t ~ , that could be added to the state.
In
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether the
subject is singular or plural, so that we know what form a verb should be conjugated into if that’s
what follows next.
Bidirectional recurrent neural networks

Bidirectional recurrent neural networks are a combination of two recurrent neural networks
that train in unison. One network trains from the start to the end of a sequence while the other
works in the opposite direction.

The bidirectional method that this type of recurrent neural network uses allows the model to
learn from both present and past information. Once the network has learned from this, it can
analyze future events accordingly. This feature sets it apart from other types of recurrent neural
networks. The dual nature of bidirectional recurrent neural networks is useful in circumstances
where context is required.
Long short-term memory

Long short-term memory recurrent neural networks handle long time-series data. This means
that they can recall long-term time-series data collected prior.

This model has three different gates: the input gate, the output gate, and the forget gate. These
gates act as a form of control over features of the network, such as saving or removing memory.

The input gate decides which new information moves into the cell state. The output gate, on
the other hand, regulates which information is selected from the cell state. After that decision
is made, it chooses the next hidden state for the network. Finally, the forget gate removes any
information from the cell state that is deemed irrelevant or insignificant.

While in the cell state, the network has automatic control while discarding irrelevant
information or retaining relevant features. The vanishing gradient problem found in some
networks can be prevented via the use of long short-term memory networks.

Leaky Units :

Skip Connections in recurrent Neural Networks :


Skip Connections in Rnn:
The Gated Block diagram is given by

Points on LSTM :

You might also like