NARASARAOPETA INSTITUTE OF TECHNOLOGY
Department of Computer Science and Engineering
DEEP LEARNING (III – AI&ML) – II SEM
UNIT IV
Convolutional Neural Networks: Nerual Network and Representation Learing,
Convolutional Layers, Multichannel Convolution Operation, Recurrent Neural Networks:
Introduction to RNN, RNN Code, PyTorch Tensors: Deep Learning with PyTorch, CNN in
PyTorch.
Convolutional Neural Networks (CNNs) :
➢ Convolutional Neural Networks (CNNs) are a specialized type of neural network designed
specifically for processing structured grid-like data, such as images and topology, like time-
series data and spatial data in 2D or 3D.
➢ Convolutional neural network is composed of multiple building blocks, such as convolution
layers, pooling layers, and fully connected.
Key Components of CNNs:
Convolutional Layers:
➢ The core building block of a CNN that performs a convolutional operation.
➢ Filters (or kernels) slide across the input image (or previous layer feature map) to produce
feature maps. This operation captures the spatial dependencies in the data.
➢ Each filter detects features at different locations, making CNNs translation invariant.
UNIT-4 [1] Dr.R.Satheeskumar, Professor
ReLU Layer:
After each convolution operation, a nonlinear layer (typically a ReLU, or Rectified Linear Unit)
is applied to introduce nonlinear properties into the network, helping it to learn more complex
patterns.
Pooling (Subsampling or Down-sampling) Layers:
➢ These layers reduce the dimensionality of each feature map but retain the most important
information.
➢ Max pooling and average pooling are the most common pooling functions.
➢ Fully Connected Layers:
➢ After several convolutional and pooling layers, the high-level reasoning in the neural
network is done via fully connected layers.
➢ Neurons in a fully connected layer have connections to all activations in the previous
layer, as seen in regular neural networks.
Normalization Layers (optional):
➢ Layers such as Batch Normalization may be used to make the training of deep networks
more efficient and stable by normalizing the input layer by adjusting and scaling
activations.
Dropout (optional):
➢ A regularization technique where randomly selected neurons are ignored during training,
reducing the risk of overfitting.
Neural Networks and Representation Learning:
“Neural networks are computational models inspired by the human brain, consisting of layers of
interconnected nodes or neurons. Each neuron receives input, processes it, and passes its output
to the next layer”.
“Representation learning involves learning how to transform raw data into a form that is more
amenable to a specific task, such as classification or prediction. This transformation effectively
creates new features from raw data automatically, without human intervention”.
Neural Network Data Input:
Neural networks take in data, where each data point (or observation) is represented by a set of
features. For example:
• House Prices Dataset: Each house described by 13 numeric features.
• MNIST Dataset: Each handwritten digit image represented by 784 pixel values.
UNIT-4 [2] Dr.R.Satheeskumar, Professor
Scaling and Normalization:
Prior to training, data must be scaled or normalized to ensure the neural network models each
feature appropriately and converges more efficiently during training.
Importance of Hidden Layers:
➢ Enhanced Modeling Capability: Adding hidden layers allows the neural network to
learn complex patterns or nonlinear relationships that cannot be captured by a simple
linear model.
➢ Example with House Prices: The network learns nonlinear interactions between features
that relate to house prices.
Linear Combinations in Prediction:
In machine learning, effective prediction often requires finding useful linear combinations of
input features. For example, specific combinations of pixel values in the MNIST dataset can
indicate a particular digit.
Representation Learning:
➢ Automatic Feature Combination: Neural networks start with random weights and learn
to identify and refine combinations of input features that are most predictive of the
output.
➢ Process: Initially random combinations of features are refined through training, where
the network learns to emphasize helpful combinations and ignore unhelpful ones.
Neural networks start with input features, denoted as n features. These features are typically raw
data points (e.g., pixel values in images, measurements in datasets like house prices). The neural
networks we have seen so far start with n features and then learn somewhere between n and n
“combinations” of these features to make predictions:
UNIT-4 [3] Dr.R.Satheeskumar, Professor
A Different Architecture for Image Data:
▪ The image data, at a high level, will be to create combinations of features, as before, but
an order of magnitude more of them, and have each one be only a combination of the
pixels from a small rectangular patch in the input image.
▪ if we had f input features and wanted to compute n new features, we could simply
multiply the ndarray containing our input features by an f × n matrix. What operation can
we use to compute many combinations of the pixels from local patches of the input
image? The answer is the convolution operation.
The Convolution Operation:
The convolution operation is a fundamental building block in the architecture of convolutional
neural networks (CNNs), primarily used in processing images.
➢ Convolution is a mathematical operation that involves a kernel (also known as a
filter) that is applied to an image to extract features such as edges, corners, and textures.
➢ The kernel is a small matrix of weights that slides over the image, and at each position, a
dot product is computed between the kernel and the portion of the image it covers.
Let's consider the example of a 5x5 input image I. Each element (pixel) in this image has a
specific intensity value, typically represented by a grayscale value ranging from 0 to 255, where
0 represents black and 255 represents white.
UNIT-4 [4] Dr.R.Satheeskumar, Professor
➢ And let’s say we want to calculate a new feature that is a function of the 3 × 3 patch of pixels
in the middle.
➢ we’ll define a new feature that is a function of this 3 × 3 patch, which we’ll do by defining a
3 × 3 set of weights, W:
Then we’ll simply take the dot product of W with the relevant patch from I to get the value of the
feature in the output, which, since the section of the input image involved was centered at (3,3),
we’ll denote as o33 (the o stands for “output”):
For example, taking the dot product of the following 3 × 3 array of numbers:
Convolutional Layers
▪ Convolutional layers are fundamental building blocks of convolutional neural networks
(CNNs), a class of deep learning models commonly used for image recognition, object
detection, and various other tasks in computer vision.
▪ The first layer of a Convolutional Neural Network is always a Convolutional Layer.
Convolutional layers apply a convolution operation to the input, passing the result to the next
layer.
UNIT-4 [5] Dr.R.Satheeskumar, Professor
▪ A convolution converts all the pixels in its receptive field into a single value. For example, if
you would apply a convolution to an image, you will be decreasing the image size as well as
bringing all the information in the field together into a single pixel.
Implementation Implications:
Two multichannel convolutional layers are connected tells us how to implement the operation:
▪ we need h1 × h2 weights to connect a fully connected layer with h1 neurons to one with
h2, we need m1 × m2 convolutional filters to connect a convolutional layer with m1
channels to one with m2.
1. The input will have shape:
• Batch size
• Input channels
• Image height
• Image width
2. The output will have shape:
• Batch size
• Output channels
• Image height
• Image width
3. The convolutional filters themselves will have shape:
• Input channels
• Output channels
• Filter height
• Filter width
The Differences Between Convolutional and Fully Connected Layers:
▪ A fully connected layer refers to a neural network in which each input node is connected
to each output node.
▪ A convolutional layer applies to a neural network in which not all input nodes in a neuron
are connected to the output nodes.
▪ In fully connected layers, each neuron has its own set of weights, leading to a large
number of parameters
▪ Convolutional layers exploit parameter sharing, meaning that the same set of weights
(filter) is shared across different spatial locations of the input data.
UNIT-4 [6] Dr.R.Satheeskumar, Professor
Making Predictions with Convolutional Layers: The Flatten Layer:
▪ The Flatten layer is a simple layer in CNN architectures that serves the purpose of
flattening the output feature maps obtained from the convolutional layers into a 1D
vector.
▪ The Flatten layer reshapes the input tensor into a 1D vector by simply concatenating all
the elements along the spatial dimensions.
UNIT-4 [7] Dr.R.Satheeskumar, Professor
• After the convolutional layers, the flatten layer is introduced. Its purpose is to convert the
multi-dimensional feature maps produced by the convolutional layers into a one-
dimensional vector.
• The flatten layer essentially takes each feature map and stretches it out into a single long
vector by unraveling the spatial dimensions.
Pooling Layers:
Pooling layers perform a form of downsampling on the feature maps generated by the
convolutional layers. They reduce the spatial dimensions (width and height) of the feature maps
while retaining their depth (number of channels).
Pooling types- The two most common types of pooling layers are Max Pooling and Average
Pooling.
1. Max pooling is a type of pooling operation commonly used in convolutional neural networks
(CNNs) to downsample feature maps. It works by partitioning the input feature map into
non-overlapping rectangular regions and outputting the maximum value within each region.
It helps in controlling overfitting, improving computational efficiency, and capturing the most
relevant features for downstream tasks such as classification or object detection.
UNIT-4 [8] Dr.R.Satheeskumar, Professor
2. Average pooling is another type of pooling operation used in convolutional neural networks
(CNNs) to downsample feature maps. Unlike max pooling, which selects the maximum value
within each region, average pooling computes the average value of the elements within each
region.
• Average pooling is a simple yet effective technique used in CNN architectures for
downsampling feature maps, reducing spatial dimensions, and providing a smoothed
representation of the input data.
• Average pooling helps in reducing the spatial dimensions of the feature maps while
preserving the average intensity or activation level within each region.
Padding in Convolutional Layers:
Padding in convolutional layers is a technique used to control the spatial dimensions of the
feature maps produced by the convolution operation. It involves adding additional pixels
(padding) around the input data before applying the convolution operation. Padding can be
applied symmetrically or asymmetrically, depending on the desired effect.
Padding Types:
• Valid Padding: Also known as "no padding," in this case, no padding is added to the
input data. As a result, the spatial dimensions of the output feature maps are reduced
compared to the input size.
• Same Padding: In same padding, equal amounts of padding are added to all sides of the
input data. This ensures that the spatial dimensions of the output feature maps are the
same as those of the input.
UNIT-4 [9] Dr.R.Satheeskumar, Professor
• Valid Padding: No padding is added. The output size will be reduced due to the
convolution operation.
• Same Padding: Padding is added to the input such that the output size is the same as the
input size.
Stride:
The stride in convolutional neural networks (CNNs) refers to the step size or the amount by
which the convolutional filter moves across the input data during the convolution operation. It
directly influences the spatial dimensions of the output feature maps.
• The stride determines how much the filter moves horizontally and vertically after each
convolution operation.
• The stride affects the spatial dimensions (width and height) of the output feature maps
produced by the convolutional operation.
• The stride size is typically set to 1 or greater. A stride of 1 means the filter moves one pixel
at a time, resulting in output feature maps with spatial dimensions similar to those of the
input.
• A stride greater than 1 results in the filter skipping over pixels, causing a greater reduction in
the spatial dimensions of the output feature maps.
• It plays a crucial role in controlling model complexity, computational efficiency, and the
amount of spatial information retained in the feature maps.
UNIT-4 [10] Dr.R.Satheeskumar, Professor
If stride is set to 1, filter moves across 1 pixel at a time and if stride is 2, filter moves 2 pixels at a
time. More the value of stride, smaller will be the resulting output and vice versa.
ReLU Layer (Rectified Linear Unit)
ReLU is computed after convolution. It is most commonly deployed activation function that
allows the neural network to account for non-linear relationships. In a given matrix (x), ReLU
sets all negative values to zero and all other values remains constant.
It is mathematically represented as :
y = max(0, x)
For Example : In a given matrix (M),
M = [ [ -3, 19, 5 ], [ 7, -6, 12 ], [ 4, -8, 17 ] ]
ReLU converts it as
[ [ 0, 19, 5 ], [ 7, 0, 12 ], [ 4, 0, 17 ] ]
Soft-Max Layer:
• The output from last layer of fully connected layer is directed to soft max layer, which
converts it into probabilities.
• Here soft-max assigns decimal probabilities to each class in a multi-class problem, these
probabilities sum equals 1.0.
• This allows the output to be interpreted directly as a probability.
UNIT-4 [11] Dr.R.Satheeskumar, Professor
The Multichannel Convolution Operation:
➢ The multichannel convolution operation extends the basic convolution operation used in
CNNs to handle inputs that have multiple channels.
➢ This is particularly relevant for processing color images or any other type of data where
each piece of information is represented by several simultaneous measurements.
➢ For instance, standard color images are typically represented in RGB format, consisting
of three channels: Red, Green, and Blue.
➢ When performing convolution on multichannel input data, each channel is convolved
independently with its corresponding set of filters (kernels).
➢ The results of these convolutions are then combined (usually by summation) across
channels to produce the output feature map.
UNIT-4 [12] Dr.R.Satheeskumar, Professor
Implementing the Multichannel Convolution Operation:
Implementing a multichannel convolution operation involves applying convolutional
filters to multiple channels (or feature maps) of the input data. Each filter is convolved with its
corresponding input channel, and the results are combined to produce the output feature map.
Understanding Convolution:
Before diving into multichannel convolution, ensure a solid understanding of the basic
convolution operation. Convolution involves sliding a filter/kernel over an input image or signal,
multiplying the filter values with the overlapping input values, and summing them up to produce
an output feature map.
The Forward Pass:
The forward pass in a convolutional neural network (CNN) involves passing input data through
the network to compute the output.
1. Input Data:
The forward pass begins with the input data, which could be a single sample or a batch of
samples.
• Each sample is represented as a vector or a multi-dimensional array, depending on the
nature of the data and the network architecture.
• let's suppose our input is of length 5 and our convolutional filter is also one-dimensional.
We'll go through an example to demonstrate how convolution works in one dimension.
Let’s suppose our input is of length 5:
the size of the “patterns” we want to detect is length 3:
The first element of the output would be created by convolving the first element of the input with
the filter:
The second element of the output would be created by sliding the filter one unit to the right and
convolving it with the next set values of the series:
However, when we compute the next output value:
UNIT-4 [13] Dr.R.Satheeskumar, Professor
2. Layer-wise Computations:
The input data is passed through each layer of the neural network sequentially, starting from the
input layer and proceeding through hidden layers until reaching the output layer.
At each layer, the input is transformed using a series of operations specific to the type of layer.
These operations typically include:
Linear transformation: Multiplying the input by a weight matrix and adding a bias vector.
Activation function: Applying a non-linear function element-wise to introduce non-linearity
into the network.
Pooling or normalization (optional): Downsampling or normalizing the activations to control
overfitting or improve efficiency.
Padding:
• Decide whether padding is necessary and apply padding to the input data if needed.
Padding ensures that the spatial dimensions of the output feature map match the input
size.
• we “pad” the input with zeros around the edges, enough so that the output remains the
same size as the input.
UNIT-4 [14] Dr.R.Satheeskumar, Professor
Stride:
• In convolutional operations, after applying the filter/kernel to a position in the input data,
the filter moves to the next position based on the stride value.
• A stride of 1 means the filter moves one position at a time.
• A stride of 2 means the filter moves two positions at a time, and so on.
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Filter: [0.5, 1, 0.5]
Stride: 2
[1, 2, 3] * [0.5, 1, 0.5] = (1*0.5) + (2*1) + (3*0.5) = 1.5 + 2 + 1.5 = 5
[5, 6, 7] * [0.5, 1, 0.5] = (5*0.5) + (6*1) + (7*0.5) = 2.5 + 6 + 3.5 = 12
[9, 10] * [0.5, 1, 0.5] = (9*0.5) + (10*1) + (0*0.5) = 4.5 + 10 + 0 = 14.5
Output: [5, 12, 14.5]
Example:
3. Output Prediction:
After passing through all the layers, the final layer produces the output predictions or activations,
depending on the task the network is designed for.
For classification tasks, the output may be a probability distribution over different classes (e.g.,
using softmax activation).
UNIT-4 [15] Dr.R.Satheeskumar, Professor
For regression tasks, the output may be a continuous value representing the predicted target.
4. Loss Calculation:
Once the predictions are obtained, they are compared to the ground truth labels or targets using a
loss function.
The loss function quantifies the difference between the predicted and actual values, providing a
measure of the network's performance.
5. Evaluation:
Finally, the performance of the network is evaluated based on metrics such as accuracy,
precision, recall, or mean squared error, depending on the task.
The Backward Pass:
The backward pass, also known as backpropagation, is a crucial step in training neural networks.
It involves computing the gradients of the loss function with respect to the network parameters,
which allows for updating the parameters through optimization algorithms such as gradient
descent.
1. Loss Gradient Calculation:
The backward pass begins with the computation of the gradient of the loss function with respect
to the network's output. This gradient represents how much the loss would change with a small
change in each output value.
The choice of loss function depends on the task the network is designed for. Common loss
functions include mean squared error for regression and categorical cross-entropy for
classification.
2. Backpropagation through Layers:
Starting from the output layer and moving backward through the network, the gradients are
propagated layer by layer using the chain rule of calculus.
At each layer, the gradient of the loss with respect to the layer's output is multiplied element-
wise with the derivative of the activation function applied to the layer's input.
UNIT-4 [16] Dr.R.Satheeskumar, Professor
This step computes the gradient of the loss with respect to the layer's input, which represents
how much each input value contributed to the loss.
Computing the gradient of a 1D convolution:
3. Parameter Gradients:
Once the gradients with respect to the layer's input are computed, the gradients with respect to
the layer's parameters (weights and biases) can be obtained using these input gradients.
This involves multiplying the input gradients with the input values (for weights) or with 1 (for
biases) and aggregating them over all samples in the batch.
Chain Rule:
The chain rule states that the derivative of a composite function can be computed by multiplying
the derivatives of its individual components.
In the context of neural networks, we apply the chain rule iteratively backward through the
layers, starting from the output layer and moving towards the input layer.
4. Parameter Updates:
With the gradients of the loss function with respect to the network parameters computed, the
parameters can be updated using optimization algorithms such as stochastic gradient descent
(SGD), Adam, or RMSprop.
UNIT-4 [17] Dr.R.Satheeskumar, Professor
The parameters are updated in the direction opposite to the gradient, scaled by a learning rate
parameter that controls the size of the update steps.
5. Iterative Process:
The backward pass is typically performed iteratively over multiple batches of training data, with
parameter updates applied after each batch (batch gradient descent) or after each individual
sample (online gradient descent).
This process continues until the loss converges to a minimum or until a predefined number of
iterations (epochs) is reached.
6. Regularization and Optimization:
During the backward pass, additional techniques such as weight regularization (e.g., L1 or L2
regularization) or dropout may be applied to prevent overfitting and improve generalization
performance.
for i in range(len(activations) - 2, -1, -1):
delta = np.dot(weights[i+1].T, delta) * sigmoid_derivative(activations[i])
gradient_weights = np.dot(delta, activations[i-1].T) if i > 0 else np.dot(delta, input_data.T)
gradient_bias = delta
gradients_weights.insert(0, gradient_weights)
gradients_biases.insert(0, gradient_bias)
return gradients_weights, gradients_biases
UNIT-4 [18] Dr.R.Satheeskumar, Professor
Recurrent Neural Networks
A recurrent neural network (RNN) is a deep learning model that is trained to process and convert
a sequential data input into a specific sequential data output. Sequential data is data—such as
words, sentences, or time-series data—where sequential components interrelate based on
complex semantics and syntax rules.
• A Recurrent Neural Network (RNN) is a type of artificial neural network designed to handle
sequential data, where the order of the data points matters.
• RNNs maintain an internal state, or memory, that captures information about what has been
calculated so far.
1. Select a two-dimensional array from the second axis:
• The instruction is to select a subset of the input data array data along the second
axis (axis index 1).
• This operation extracts a two-dimensional array for each sample in the batch,
representing the features of the data at the first time step.
• The resulting array will have a shape of (batch_size, num_features).
2. Initialize a "hidden state" for the RNNLayer:
• Before processing the input data, the RNNLayer initializes a hidden state, which
is an ndarray that accumulates information about the data passed in during prior
time steps.
• The hidden state has a shape of (batch_size, hidden_size), where hidden_size
represents the dimensionality of the hidden state.
• This hidden state will be continually updated as the RNNLayer processes each
sequence element.
3. Pass the input data and hidden state forward through the first time step:
• In the first time step, the RNNLayer takes the selected two-dimensional array of
input data and the initialized hidden state.
• These two arrays are passed forward through the layer to compute the output.
• The output will have a shape of (batch_size, num_outputs), representing the
predictions or activations of the RNNLayer for each sample in the batch at the
first time step.
• Additionally, the RNNLayer updates its representation for each observation,
outputting an ndarray of shape (batch_size, hidden_size), representing the
hidden state after processing the first time step.
UNIT-4 [19] Dr.R.Satheeskumar, Professor
4. Select the next two-dimensional array from data:
• After processing the first time step, the RNNLayer selects the next two-
dimensional array of input data from the input sequence.
• This array represents the features of the data at the second time step.
5. Pass the data and updated hidden state into the second time step:
• The RNNLayer takes the selected input data from the second time step and the
updated hidden state from the first time step.
• These arrays are passed into the layer to compute the output and update the
hidden state for the second time step.
• The output will have the same shape as before (batch_size, num_outputs),
representing the predictions or activations for each sample in the batch at the
second time step.
• The hidden state will also be updated and outputted as an ndarray of shape
(batch_size, hidden_size).
6. Continue processing all time steps and concatenate the results:
• The RNNLayer continues processing each time step of the input sequence,
updating the hidden state and computing outputs.
• After processing all time steps, the RNNLayer concatenates all the output arrays
to form the final output from the layer.
• The final output will have a shape of (batch_size, sequence_length,
num_outputs), representing the predictions or activations for each sample in the
batch at each time step of the input sequence.
The Second Class for RNNs: RNNNode:
Two ndarrays as inputs:
—One for the data inputs to the network, of shape [batch_size, num_features]
—One for the representations of the observations at that time step, of shape [batch_size,
hidden_size]
Two ndarrays as outputs:
—One for the outputs of the network at that time step, or shape [batch_size,num_outputs]
UNIT-4 [20] Dr.R.Satheeskumar, Professor
—One for the updated representations of the observations at that time step, of shape: [batch_size,
hidden_size]
Putting These Two Classes Together:
The RNNLayer class will wrap around a list of RNNNodes and will (at least) contain a forward
method that has the following inputs and outputs:
• Input: a batch of sequences of observations of shape [batch_size, sequence_length,
num_features]
• Output: the neural network output of those sequences of shape [batch_size, sequence_length,
num_outputs]
The order in which data would flow through an RNN with two layers that was designed to
process sequences of length 5:
Another order in which data could flow through the same RNN during its forward pass:
UNIT-4 [21] Dr.R.Satheeskumar, Professor
At each time step, inputs initially of dimension feature_size are passed successively forward
through the first RNNNode in each RNNLayer, with the network ultimately outputting a
prediction at that time step of dimension output_size. In addition, each RNNNode passes a
“hidden state” forward to the next RNNNode within each layer.
The Backward Pass:
Backpropagation through Recurrent Neural Networks (RNNs) is an extension of the
backpropagation algorithm used to train feedforward neural networks. It involves propagating
gradients backward through time to update the network parameters (weights and biases) and
minimize the loss function.
Backpropagation simply works the same way, but in reverse:
1. We start with a gradient of shape [output_size, sequence_length], representing how much each
element of the output (also of size [output_size, sequence_length]) ultimately impacts the loss
computed for that batch of observations.
2. These gradients are broken up into the individual sequence_length elements and passed
backward through the layers in reverse order.
3. The gradient for an individual element is passed backward through all the layers.
4. At the same, the layers pass the gradient of the loss with respect to the hidden state at that time
step backward into the layers’ computations at the prior time steps.
5. This continues for all sequence_length time steps, until the gradients have been passed
backward to every layer in the network, thus allowing us to compute the gradient of the loss with
respect to each of the weights, just as we do in the case of regular feed-forward networks.
In the backward pass, RNNs pass data in the opposite direction of the way the data is
passed during the forward pass:
UNIT-4 [22] Dr.R.Satheeskumar, Professor
Vanishing Gradient Problem:
The vanishing gradient problem is a common issue encountered during the training of deep
neural networks, including recurrent neural networks (RNNs). It occurs when the gradients of the
loss function with respect to the parameters (e.g., weights and biases) become extremely small as
they propagate backward through the network during training. This phenomenon can
significantly slow down or even halt the learning process, particularly in deep networks with
many layers or recurrent connections.
Long Short-Term Memory (LSTM):
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that
is designed to overcome the vanishing gradient problem of traditional RNNs. LSTM networks
incorporate memory cells, which are composed of a cell state and various gates, including input
gates, forget gates, and output gates.
UNIT-4 [23] Dr.R.Satheeskumar, Professor
Key Components of LSTM:
Memory Cell:
The core component of an LSTM unit is the memory cell, which allows the network to store
information over long periods of time.
The memory cell maintains a hidden state (like a traditional RNN) and a cell state, which can be
updated or modified through a series of gates.
Gates:
LSTM units include three types of gates: input gate, forget gate, and output gate.
Each gate is responsible for controlling the flow of information into and out of the memory cell,
allowing the LSTM to selectively update and utilize information over time.
Input Gate:
The input gate determines how much of the new input information should be stored in the cell
state.
It takes the current input and the previous hidden state as inputs and outputs a value between 0
and 1 for each component of the cell state, indicating how much of the new information should
be retained.
Forget Gate:
The forget gate determines how much of the information from the previous time step should be
forgotten or retained in the cell state.
UNIT-4 [24] Dr.R.Satheeskumar, Professor
It takes the current input and the previous hidden state as inputs and outputs a value between 0
and 1 for each component of the cell state, indicating how much of the old information should be
retained.
Output Gate:
The output gate determines how much of the cell state should be exposed as the output of the
LSTM unit.
It takes the current input and the previous hidden state as inputs and outputs a value between 0
and 1 for each component of the cell state, indicating how much of the updated cell state should
be passed to the next time step.
Gated Recurrent Units (GRUs):
To address some limitations of the traditional LSTM units while maintaining similar capabilities
in capturing long-range dependencies in sequential data. GRUs have become popular alternatives
to LSTMs due to their simpler architecture and comparable performance in many tasks.GRUs is
the use of gating mechanisms to control the flow of information within the network.
Update Gate:
• The update gate in a GRU controls how much of the previous hidden state should be
retained and how much of the new hidden state should be considered.
• It takes the current input and the previous hidden state as inputs and outputs a value
between 0 and 1, indicating the proportion of the previous hidden state to retain.
Reset Gate:
• The reset gate in a GRU determines how much of the previous hidden state should be
ignored when computing the candidate hidden state.
• It takes the current input and the previous hidden state as inputs and outputs a value
between 0 and 1, indicating the proportion of the previous hidden state to reset.
Candidate Hidden State:
• The candidate hidden state is computed based on the current input and the reset gate.
• It represents the new information that could potentially be added to the hidden state.
• Hidden State Update:
• The final hidden state is computed as a combination of the previous hidden state and the
candidate hidden state, controlled by the update gate.
• It determines how much of the previous state to keep and how much of the new state to
incorporate.
UNIT-4 [25] Dr.R.Satheeskumar, Professor
RNNs: The Code
An RNN still passes data forward through a series of layers, which send outputs forward on the
forward pass and gradients backward on the backward pass.
def forward(self, x_batch: ndarray) -> ndarray:
assert_dim(ndarray, 3)
x_out = x_batch
for layer in self.layers:
x_out = layer.forward(x_out)
return x_out
Initialization and Forward method:
Each RNNLayer will start with:
• An int hidden_size
• An int output_size
• forward method
That takes in two arrays of shapes:
UNIT-4 [26] Dr.R.Satheeskumar, Professor
The backward method:
• The forward method outputted x_seq_out, the backward method will receive a gradient of
the same shape as x_seq_out called x_seq_out_grad. Moving in the opposite direction
from the forward method.
“Vanilla” RNNNodes:
"Vanilla" RNN nodes represent the simplest form of recurrent neural network (RNN) nodes,
where the update equations for the hidden state do not include any gating mechanisms such as
those found in Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) cells. Instead,
these nodes directly compute the new hidden state based on the current input and the previous
hidden state using linear transformations and activation functions.
UNIT-4 [27] Dr.R.Satheeskumar, Professor
vanilla RNNs often suffer from the vanishing gradient problem and have difficulty capturing
long-term dependencies in sequential data compared to more advanced architectures like LSTMs
or GRUs.
PyTorch
PyTorch is defined as an open source machine learning library for Python. It is used for
applications such as natural language processing. It is initially developed by Facebook artificial-
intelligence research group, and Uber’s Pyro software for probabilistic programming which is
built on it.
PyTorch Tensors:
PyTorch tensors are the fundamental data structures used for representing and manipulating data
in PyTorch. They are similar to NumPy arrays but with additional features optimized for deep
learning computations, such as automatic differentiation for gradient computations.
Simple NumberWithGrad accumulate gradients by keeping track of the operations performed on
it. This meant that if we wrote:
a = NumberWithGrad(3)
UNIT-4 [28] Dr.R.Satheeskumar, Professor
b=a*4
c=b+3
d = (a + 2)
e=c*d
e.backward()
then a.grad would equal 35, which is actually the partial derivative of e with respect to a.
a = torch.Tensor([[3., 3.,],
[3., 3.]], requires_grad=True)
Deep Learning with PyTorch
Deep learning with PyTorch is a powerful and flexible way to build and train neural networks.
PyTorch provides a rich set of tools and functionalities for implementing various deep learning
models, from simple feedforward networks to complex architectures like convolutional neural
networks (CNNs) and recurrent neural networks (RNNs).
deep learning models have several elements that work together to produce
a trained model:
• A Model, which contains Layers
• An Optimizer
• A Loss
• A Trainer
PyTorch Elements: Model, Layer, Optimizer, and Loss:
Model: The model represents the architecture of your neural network. It consists of various
layers organized in a specific manner to process input data and produce desired outputs.
Optimizer: The optimizer is responsible for updating the parameters of the model (i.e., weights
and biases) during the training process to minimize the loss function.
Loss Function: The loss function (also known as the objective function or cost function)
quantifies how well the model's predictions match the actual target values.
Trainer: The trainer is responsible for orchestrating the training process. It involves iterating
over the dataset, feeding batches of data to the model, computing predictions, calculating the
loss, performing backpropagation to compute gradients.
A key feature of PyTorch is the ability to define models and layers as easy-to-use objects that
handle sending gradients backward and storing parameters automatically, simply by having them
inherit from the torch.nn.Module class.
UNIT-4 [29] Dr.R.Satheeskumar, Professor
The inference flag:
we need the ability to change our model’s behavior depending on whether we are running it in
training mode or in inference mode. In PyTorch, we can switch a model or layer from training
mode (its default behavior) to inference mode by running m.eval on the model or layer (any
object that inherits from nn.Module).
Example: Boston Housing Prices Model in PyTorch:
we implemented this within our objectoriented framework that had a class for the Layers and a
model that had a list ofength 2 as its layers attribute. PyTorch to predict Boston housing prices
based on the famous Boston Housing dataset. We'll use a basic linear regression model for this
task.
First, you need to ensure you have the necessary libraries installed.Similarly, we can define a
HousePricesModel class that inherits from PyTorchModel as follows:
UNIT-4 [30] Dr.R.Satheeskumar, Professor
PyTorch Elements: Optimizer and Loss:
Optimizers and Losses are implemented in PyTorch as one-liners. For example, the
SGDMomentum loss:
UNIT-4 [31] Dr.R.Satheeskumar, Professor
In PyTorch, models are passed into the Optimizer as an argument; this ensures that the optimizer
is “pointed at” the correct model’s parameters so it knows what to update on each iteration (we
did this using the Trainer class earlier).
These Losses inherit from nn.Module, just like the Layers from earlier, so they can be called the
same way, using loss(x) instead of loss.forward(x).
PyTorch Elements: Trainer
The Trainer pulls all of these elements together. Let’s consider the requirements for the Trainer.
We know that it has to implement the general pattern for training neural Networks.
1. Feed a batch of inputs through the model.
2. Feed the outputs and targets into a loss function to compute a loss value.
3. Compute the gradient of the loss with respect to all of the parameters.
4. Use the Optimizer to update the parameters according to some rule.
By default, Optimizers will retain the gradients of the parameters (what we referred to as
param_grads earlier in the book) after each iteration of a parameter update. To clear these
gradients before the next parameter update, we’ll call self.optim.zero_grad.
UNIT-4 [32] Dr.R.Satheeskumar, Professor
Convolutional Neural Networks in PyTorch
Convolutional Neural Networks (CNNs) are widely used in computer vision tasks such as image
classification, object detection, and image segmentation. PyTorch provides a flexible framework
for building CNNs efficiently. Here's an example of implementing a simple CNN for image
classification using PyTorch:
• The data input shape [batch_size, in_channels, image_height,image_width]
• The parameters input shape [in_channels, out_channels, filter_size, filter_size]
• The output shape [batch_size, out_channels, image_height, image_width]
In terms of this notation, the multichannel convolution operation in PyTorch is:
nn.Conv2d(in_channels, out_channels, filter_size)
With this defined, wrapping a ConvLayer around this operation is straightforward:
UNIT-4 [33] Dr.R.Satheeskumar, Professor
• A convolutional layer that transforms the input from 1 “channel” to 16 channels
• Another layer that transforms these 16 channels into 8 (with each channel still containing 28 ×
28 neurons)
• Two fully connected layers
DataLoader and Transforms:
Simple preprocessing step to the MNIST data, subtracting off the global mean and dividing by
the global standard deviation to roughly “normalize” the data:
UNIT-4 [34] Dr.R.Satheeskumar, Professor