JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES-www - Jntumaterials.co - in
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES-www - Jntumaterials.co - in
Convolutional Neural Networks: Nerual Network and Representation Learing, Convolutional Layers,
Multichannel Convolution Operation, Recurrent Neural Networks: Introduction to RNN, RNN Code, PyTorch
Tensors: Deep Learning with PyTorch, CNN in PyTorch
……………………………………………………………………………………………………………………………
Convolutional neural networks (CNNs) are a type of deep learning neural network that is specifically designed
for image processing and computer vision tasks. CNNs are inspired by the structure and function of the human
visual cortex, which is the part of the brain that is responsible for processing visual information.
CNNs have a number of advantages over other types of neural networks for image processing and computer vision
tasks:
• They are able to extract features from images that are invariant to translation, rotation, and scaling.
This means that the same features can be detected even if the object in the image is in a different position
or orientation.
• They are able to extract features from images that are hierarchically organized. This means that they can start
by extracting simple features, such as edges and corners, and then use those features to extract more complex
features, such as faces and objects.
• They are computationally efficient. This is because the convolution operation at the heart of CNNs can be
implemented using the fast Fourier transform (FFT).
CNNs have revolutionized the field of computer vision. They are now used in a wide range of applications,
including:
• Image classification: CNNs can be used to classify images into different categories, such as cats, dogs, and
cars.
• Object detection: CNNs can be used to detect objects in images, such as pedestrians, cars, and traffic signs.
• Facial recognition: CNNs can be used to recognize faces in images.
• Medical imaging: CNNs can be used to analyze medical images, such as X-rays and MRI scans, to
diagnose diseases and identify abnormalities.
• Natural language processing: CNNs can be used to extract features from text, which can then be used for tasks
such as sentiment analysis and machine translation.
CNNs are a powerful tool for a wide range of applications. They are still under active development, and new
applications for CNNs are being discovered all the time.
Here are some specific examples of how CNNs are being used today:
• Facebook uses CNNs to recognize faces in photos.
• Google uses CNNs to power its image search engine.
• Tesla uses CNNs to power its self-driving cars.
• Doctors use CNNs to analyze medical images and diagnose diseases.
• Researchers are using CNNs to develop new methods for machine translation and text analysis. CNNs are a
powerful and versatile tool that is transforming the way we interact with the world around us.
CNNs are a type of ANN that is specifically designed for image processing and computer vision tasks. They are
made up of a series of convolutional layers, which are able to extract features from images that are invariant to
translation, rotation, and scaling.
Here is a table that summarizes the key differences between ANNs and CNNs:
Characteristic ANN CNN
Architecture Fully connected layers Convolutional layers
Image processing, computer
Applications General-purpose
vision
Able to extract invariant features from
Advantages Flexible and versatile
images
Can be computationally Requires a large amount of
Disadvantages labeled training data
expensive
Which type of neural network to use depends on the specific task at hand. If you are working on a general- purpose
task, such as classification or regression, then an ANN may be a good choice. If you are working on an image processing
or computer vision task, then a CNN is likely to be a better choice.
• CNNs:
o Classifying images of objects
o Detecting objects in images
o Segmenting images
Convolution operations are performed by sliding a small filter over the image and computing the dot product of the filter
and the image pixels at each location. The filter is typically a small square or rectangular array of weights. The result of
the convolution operation is a new image that is smaller than the original image.
The new image contains the features that were extracted by the filter. For example, a filter might be designed to extract
edge features from an image. The output of the convolution operation with this filter would be an image that highlights
the edges in the original image.
CNNs typically have multiple convolutional layers, each of which uses a different filter to extract different
features from the image. The output of the convolutional layers is then fed into a fully connected neural
network, which performs classification or other tasks.
We compute the output(re-estimated value of current pixel) using the following formula:
Here m refers to the number of rows(which is 2 in this case) and n refers to the number of columns(which is
2 i this case).
4 PREPARED BY P.SRIHARI RAO DEPT OF CSE
Similarly, we do the rest
While each “set of features” detected by a particular set of weights is called a feature map, in the context of a
convolutional Layer, the number of feature maps is referred to as the number of channels of the Layer—this is why the
operation involved with the Layer is called the multichannel convolution. In addition, the f sets of
weights Wi are called the convolutional filters.
Padding
In a convolutional layer, we observe that the pixels located on the corners and the edges are used much less than those
in the middle.
A simple and powerful solution to this problem is padding, which adds rows and columns of zeros to the input image. If
we apply padding in an input image of size HXH ,the output image has dimensions (W+2P)X(H+2P).
By using padding in a convolutional layer, we increase the contribution of pixels at the corners and the edges to the
learning procedure.
The Sobel filter puts a little bit more weight on the central pixels. Instead of using these filters, we can create our own
as well and treat them as a parameter which the model will learn using backpropagation.
To present the formula for computing the output size of a convolutional layer. We have the following input:
Example:
Let’s suppose that we have an input image of size 125x49, a filter of size 5x5, padding P=2 and stride S=2. Then the
output dimensions are the following:
1. Parameter sharing, means one parameter may be shared by more than one input/connection. So this reduces
total amount of independent parameters. Parameters shared are non-zero.
Q) Pooling Layer
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and
summarising the features lying within the region covered by the filter.
For a feature map having dimensions n h x nw x nc, the dimensions of output obtained after a pooling layer is
(nh - f + 1) / s x (n w - f + 1)/s x nc
where,
-> nh - height of feature map
-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length
A common CNN model architecture is to have a number of convolution and pooling layers stacked one after the other.
Why to use Pooling Layers?
• Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of
parameters to learn and the amount of computation performed in the network.
The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. So,
further operations are performed on summarised features instead of precisely positioned features generated by the
convolution layer. This makes the model more robust to variations in the position of the
features in the input image.
Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the filter.
Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling
gives the average of features present in a patch.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is typically added after
convolutional layers. The pooling layer is used to reduce the spatial dimensions (i.e., the width and height) of
the feature maps, while preserving the depth (i.e., the number of channels).
1. The pooling layer works by dividing the input feature map into a set of non-overlapping regions, called pooling
regions. Each pooling region is then transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations are max pooling and average
pooling.
2. In max pooling, the output value for each pooling region is simply the maximum value of the input values within
that region. This has the effect of preserving the most salient features in each pooling region, while
discarding less relevant information. Max pooling is often used in CNNs for object recognition tasks, as
it helps to identify the most distinctive features of an object, such as its edges and corners.
10 PREPARED BY P.SRIHARI RAO DEPT OF CSE
3. In average pooling, the output value for each pooling region is the average of the input values within that region.
This has the effect of preserving more information than max pooling, but may also dilute the most salient
features. Average pooling is often used in CNNs for tasks such as image segmentation and object detection,
where a more fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling layer reducing
the spatial dimensions of the feature maps, while the convolutional layers extract increasingly complex features
from the input. The resulting feature maps are then passed to a fully connected layer, which performs the final
classification or regression task.
Q) LeNet-5 for handwritten character recognition
LeNet-5 is a convolutional neural network (CNN) architecture that was first proposed in 1998 for handwritten digit
recognition. It is one of the earliest and most successful CNN architectures, and it has been used as a
benchmark for many other CNN models.
• Convolutional layer 1: This layer extracts features from the input image using a set of convolution
filters.
• Pooling layer 1: This layer reduces the dimensionality of the feature maps produced by the
convolutional layer by downsampling them.
• Convolutional layer 2: This layer extracts more complex features from the feature maps produced by the first
convolutional layer.
• Pooling layer 2: This layer further reduces the dimensionality of the feature maps produced by the
second convolutional layer.
• Fully connected layer: This layer takes the flattened feature maps produced by the second pooling layer and
produces a vector of outputs, one for each digit class.
• Output layer: This layer is a softmax layer that produces a probability distribution over the digit classes.
LeNet-5 can be trained to recognize handwritten digits by feeding it a dataset of labeled handwritten digit images.
The network learns to extract features from the images that are discriminative for the different digit classes. Once the
network is trained, it can be used to predict the digit class of a new handwritten digit image.
LeNet-5 has been shown to achieve high accuracy on the MNIST handwritten digit dataset, with an accuracy of over
99%. It has also been used to successfully recognize handwritten characters in other datasets, such as
the USPS zip code dataset and the IAM handwritten text database.
2. Preprocess the images. This may involve resizing the images, normalizing the pixel values, or converting the
images to grayscale.
3. Design the CNN architecture. This involves choosing the number and type of layers, as well as the
hyperparameters for each layer.
4. Initialize the weights of the CNN. This is typically done by randomly initializing the weights.
5. Choose a loss function and optimizer. The loss function measures how well the network is performing on the
training data, and the optimizer updates the weights of the network to minimize the loss function.
6. Train the CNN. This involves feeding the training data to the network and updating the weights of the network
using the optimizer.
7. Evaluate the CNN. Once the CNN is trained, you should evaluate its performance on a held-out test dataset. This
will give you an idea of how well the network will generalize to new images.
Choose a loss function and optimizer: The loss function measures how well the network is performing on the training
data. Common loss functions for CNNs include cross-entropy loss and mean squared error loss. The optimizer updates
the weights of the network to minimize the loss function. Common optimizers for CNNs include Adam and stochastic
gradient descent (SGD).
Train the CNN: This involves feeding the training data to the network and updating the weights of the network using
the optimizer. The training process is typically repeated for a number of epochs, until the network converges
to a good solution.
Evaluate the CNN: Once the CNN is trained, you should evaluate its performance on a held-out test dataset. This will
give you an idea of how well the network will generalize to new images.
AlexNet was the first CNN to win the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It has a
relatively simple architecture, consisting of five convolutional layers followed by three fully connected layers. AlexNet
was trained on a massive dataset of over 1.2 million images, and it achieved a top-5 error rate of 15.3% on the ILSVRC
test set.
ZF-Net was inspired by AlexNet, but it introduced several improvements, such as the use of smaller convolutional
filters and a deeper network architecture. ZF-Net achieved a top-5 error rate of 14.8% on the ILSVRC 2013 test set.
VGGNet is a family of CNNs that were developed by the University of Oxford. VGGNets are characterized by their use
of very small convolutional filters and a very deep network architecture. VGGNet-16 achieved a top-
5 error rate of 13.6% on the ILSVRC 2014 test set.
GoogLeNet is a CNN that was developed by Google. It is characterized by its use of the Inception module, which allows
the network to learn multiple levels of abstraction in parallel. GoogLeNet achieved a top-5 error rate of 6.7% on the
ILSVRC 2014 test set, which was a significant improvement over the previous state-of-the- art.
ResNet is a CNN that was developed by Microsoft. It is characterized by its use of residual blocks, which allow the
network to learn deeper representations of the data without overfitting. ResNets have achieved state-of- the-art results
on a wide range of image classification tasks, including the ILSVRC and COCO benchmarks.
All of these CNNs have played a significant role in the development of deep learning and computer vision. They have
demonstrated the power of CNNs to learn complex representations of data and to solve challenging
problems such as image classification.
• A filter acts as a single template or pattern, which, when convolved across the input, finds similarities between
the stored template & different locations/regions in the input image.
• Let us consider an example of detecting a vertical edge in the input image.
• Each column of the 4×4 output matrix looks at exactly three columns & three rows (the coloured boxes show
the output of the filter as it moves over the input image). The values in the output matrix represent
the change in the intensity along the horizontal direction w.r.t the columns in the input image.
• The output image has the value 0 in the 1st & last column. It means there is no change in intensity in the first
three columns & the previous three columns of the input image. On the other hand, the output
• The 2nd reason for choosing an odd size filter such as a 3×3 or a 5×5 filter is we get a central position
& at times it is nice to have a distinguisher.
• We can use multiple filters to detect various features simultaneously. Let us consider the following
example in which we see vertical edge & curve in the input RGB image. We will have to use two
different filters for this task, and the output image will thus have two feature maps.
• Edge detectors: These filters are designed to detect edges in images. They can be used to extract
features such as horizontal edges, vertical edges, and diagonal edges.
• Corner detectors: These filters are designed to detect corners in images. They can be used to extract
features such as right angles, acute angles, and obtuse angles.
• Texture detectors: These filters are designed to detect textures in images. They can be used to extract
features such as bumps, grooves, and patterns.
1. Introduction to RNN
1.1. Sequence Learning Problems
Sequence learning problems are different from other machine learning problems in two key ways:
• The inputs to the model are not of a fixed size.
Example:
Consider the task of auto completion. Given a sequence of characters, we want to predict the next character. For
example, given the sequence "d", we want to predict the next character, which is "e".
An RNN would solve this problem by maintaining a hidden state. The hidden state would be initialized with the
information from the first input character, "d". Then, at the next time step, the RNN would take the current input
character, "e", and the hidden state as input and produce a prediction for the next character. The hidden state would then
be updated with the new information.
This process would be repeated until the end of the sequence. At the end of the sequence, the RNN would output the
final prediction.
Advantages of RNNs for sequence learning problems:
• RNNs can handle inputs of any length.
• RNNs can learn long-term dependencies between the inputs in a sequence.
Disadvantages of RNNs:
RNNs are a powerful tool for solving sequence learning problems. They have been used to achieve state-of- the-art
results in many tasks, such as machine translation, text summarization, and speech recognition.
• RNNs can be used to solve a wide variety of sequence learning problems, such as natural language
processing, machine translation, and speech recognition.
How to model sequence learning problems with RNNs:
To model a sequence learning problem with an RNN, we first need to define the function that the RNN will compute at
each time step. The function should take as input the current input and the hidden state from the previous time step, and
output the next hidden state and the prediction for the current time step.
Once we have defined the function, we can train the RNN using backpropagation through time (BPTT). BPTT is
a specialized training algorithm for RNNs that allows us to train the network even though it has recurrent
connections.
• The same function is executed at every time step: This is achieved by sharing the same network
parameters at every time step.
• The model can handle inputs of arbitrary length: This is because the RNN can keep updating its hidden
state based on the previous inputs, regardless of the length of the input sequence.
• The model can learn long-term dependencies between the inputs in a sequence: This is because the RNN's
hidden state can capture information from the previous inputs, even if they are many time steps ago.
1.3. Backpropagation through time (BPTT)
BPTT is a training algorithm for recurrent neural networks (RNNs). It is used to compute the gradients of the loss
function with respect to the RNN's parameters, which are then used to update the parameters using gradient descent.
To compute the gradients using BPTT, we need to first compute the explicit derivative of the loss function with
respect to the RNN's parameters. This is done by treating all of the other inputs to the RNN as constants.
However, RNNs also have implicit dependencies, which means that the output of the RNN at a given time step
depends on the outputs of the RNN at previous time steps. This makes it difficult to compute the
gradients using the explicit derivative alone.
BPTT can be computationally expensive, but it is a powerful tool for training RNNs. It has been used to achieve
state-of-the-art results on a variety of sequence learning tasks, such as natural language processing, machine
translation, and speech recognition.
Example:
Consider the following RNN, which is used to predict the next character in a sequence:
s_t = W * s_{t-1} + x_t
y_t = softmax(V * s_t)
where:
There are a number of techniques that can be used to address the problem of vanishing and exploding gradients,
such as:
• Truncated backpropagation: Truncated backpropagation only backpropagates the gradients through a fixed
number of layers. This helps to prevent the gradients from vanishing.
• Gradient clipping: Gradient clipping normalizes the gradients so that their magnitude does not exceed a certain
threshold. This helps to prevent the gradients from exploding.
• Weight initialization: The way that the RNN's parameters are initialized can have a big impact on the problem
of vanishing and exploding gradients. It is important to initialize the parameters in a way that prevents the
gradients from becoming too small or too large.
Truncated backpropagation is a common technique used to address the problem of vanishing and exploding gradients
in recurrent neural networks (RNNs). However, it is not the only solution.
Another common solution is to use gated recurrent units (GRUs) or long short-term memory (LSTM) cells. These
units are specifically designed to deal with the problem of vanishing and exploding gradients.
GRUs and LSTMs work by using gates to control the flow of information through the RNN. This allows the
RNN to learn long-term dependencies in the data without the problem of vanishing gradients.
GRUs and LSTMs have been shown to be very effective for training RNNs on a variety of tasks, such as natural
language processing, machine translation, and speech recognition.
GRU Architecture
A GRU cell has two gates: a reset gate and an update gate.
• The reset gate controls how much of the previous cell state is forgotten.
• The update gate controls how much of the previous cell state is combined with the current input to form the
new cell state.
The GRU cell does not have a separate output gate. Instead, the output of the GRU cell is simply the updated cell state.
Comparison of LSTMs and GRUs
LSTMs and GRUs are very similar in terms of their performance on most tasks. However, there are a few key differences
between the two architectures:
• LSTMs have more gates and parameters than GRUs, which makes them more complex and
computationally expensive to train.
• GRUs are generally faster to train and deploy than LSTMs.
• GRUs are more robust to noise in the input data than LSTMs.
Which one to choose?
The best choice of architecture for a particular task depends on a number of factors, including the size and complexity
of the dataset, the available computing resources, and the specific requirements of the task.
In general, LSTMs are recommended for tasks where the input sequences are very long or complex, or where the task
requires a high degree of accuracy. GRUs are a good choice for tasks where the input sequences are
shorter or less complex, or where speed and efficiency are important considerations.
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10)
# Evaluate the model
model.evaluate(x_test, y_test)
# Make predictions
predictions = model.predict(x_test)
This code defines a simple RNN model with one LSTM layer, one dense layer, and one output layer. The LSTM layer
has 128 hidden units, and the dense layer has 64 hidden units. The output layer has a single unit, and it uses the sigmoid
activation function to produce a probability score.
The model is compiled using the binary cross-entropy loss function and the Adam optimizer. The model is
then trained on the training data for 10 epochs.
Once the model is trained, it can be evaluated on the test data to assess its performance. The model can also be used to
make predictions on new data.
Here is an example of how to use the model to make predictions:
print(prediction)
This code will print the prediction for the new data sample, which is a probability score between 0 and 1. A
probability score closer to 1 means that the model is more confident in the prediction.
This is just a simple example of RNN code, and there are many other ways to implement RNNs in Python. For more
complex tasks, you may need to use a different RNN architecture or add additional layers to the
model.
3.1.Features
The major features of PyTorch are mentioned below −
Easy Interface − PyTorch offers easy to use API; hence it is considered to be very simple to operate and runs on Python.
The code execution in this framework is quite easy.
Python usage − This library is considered to be Pythonic which smoothly integrates with the Python data
science stack. Thus, it can leverage all the services and functionalities offered by the Python environment.
Computational graphs − PyTorch provides an excellent platform which offers dynamic computational graphs. Thus a
user can change them during runtime. This is highly useful when a developer has no idea of how much memory is
required for creating a neural network model.
PyTorch is known for having three levels of abstraction as given below −
• Tensor − Imperative n-dimensional array which runs on GPU.
• Variable − Node in computational graph. This stores data and gradient.
• Module − Neural network layer which will store state or learnable weights.
Advantages of PyTorch
Loss functions are created using the classes provided by the torch.nn.functional module. For example, to create a mean
squared error loss function, you would use the torch.nn.functional.mse_loss function.
Once you have created the model, layers, optimizer, and loss function, you can train the model using the
following steps:
1. Forward pass: The input data is passed through the model to produce predictions.
2. Loss calculation: The loss function is used to calculate the error between the predictions and the ground truth
labels.
23 PREPARED BY P.SRIHARI RAO DEPT OF CSE
3. Backward pass: The gradients of the loss function with respect to the model's parameters are calculated.
4. Optimizer step: The optimizer uses the gradients to update the model's parameters.
This process is repeated for a number of epochs until the model converges and achieves the desired
performance.
Here is a simple example of a PyTorch model:
This code defines a simple linear model with one input layer and one output layer. The model is trained using the
Adam optimizer and the mean squared error loss function.
To implement a CNN in PyTorch, you can use the torch.nn.Conv2d layer. This layer performs a convolution operation
on the input data. The convolution operation is a mathematical operation that extracts features from the input
data.
CNNs also use pooling layers to reduce the spatial size of the input data. This helps to reduce the number of parameters
in the network and makes it more efficient to train.
Here is a simple example of a CNN in PyTorch:
import torch
class CNN(torch.nn.Module):
def init (self):
super(CNN, self). init ()
= self.conv2(x) x =
self.pool2(x)
x.view(-1, 16 * 5 * 5)
self.fc1(x)
x = self.fc2(x) x
= self.fc3(x)
return x
model = CNN()
# Train the model
...
This code defines a simple CNN with two convolutional layers, two pooling layers, and three fully connected layers.
The convolutional layers have 6 and 16 filters, respectively. The pooling layers have a kernel size of
2x2 and a stride of 2. The fully connected layers have 120, 84, and 10 units, respectively.
The model is trained using the model.fit() method. The model can then be used to make predictions on new data using
the model.predict() method.
For more complex tasks, you may need to use a different CNN architecture or add additional layers to the model. You
can also use PyTorch to implement other types of neural networks, such as recurrent neural networks (RNNs)
and long short-term memory (LSTM) networks.