Deep Learning Sem
Deep Learning Sem
2m:
○ Weights scale inputs, and bias shifts the output to help the
model learn better.
4. Define Hyperparameters tuning?
10m:
1. Inputs (xix_i):
2. Weights (wiw_i):
3. Bias (bb):
4. Summation Function:
6. Output (yy):
Process Summary:
Example:
Given:
9
● There may be just two layers of neuron in the network – the input
and output layer.
● There can be one or more intermediate ‘hidden’ layers of a neuron.
10
● The neurons may be connected with all neurons in the next layer
and so on
Structure:
Key Characteristics:
Structure:
● Consists of:
○ An input layer to receive data.
○ One or more hidden layers for complex feature extraction.
○ An output layer for predictions.
● Each layer is fully connected, and neurons use activation functions
like ReLU, Sigmoid, or Tanh.
Key Characteristics:
Comparison Table
Conclusion
states a common problem that occurs when a model learns the train data
too well including the noisy data, resulting in poor generalization
performance on test data. Overfit models don't generalize, which is the
ability to apply knowledge to different situations.
For example we are training a linear regression model to predict the price of a
house based on its square feet and with few specifications. We collect a
dataset of houses with their square feet and sale price. We then train our
linear regression model on this dataset. Generally in linear regression
algorithms, it draws a straight that best fits the data points by minimizing the
difference between predicted and actual values. The goal is to make a
straight line that captures the main pattern in the dataset . This way, it can
predict new points more accurately. But sometimes we come across
overfitting in linear regression as bending that straight line to fit exactly with
a few points on the pattern which is shown below fig.1. This might look
perfect for those points while training but doesn't work well for other parts
of the pattern when come to model testing.
Let's discuss what are the reasons that cause overfitting to the machine
learning model which are listed below,
Regularization Technique
Regularization is a technique in machine learning that helps prevent from
overfitting. It works by introducing penalties term or constraints on the
model's parameters during training. These penalties term encourage the
model to avoid extreme or overly complex parameter values. By doing so,
regularization prevents the model from fitting the training data too closely,
which is a common cause of overfitting. Instead, it promotes a balance
between model complexity and performance, leading to better generalization
on new, unseen data.
16
L1 Regularization
In the above equation, 'y' is the actual target, and 'ŷ' is the predicted target.
Now, to add L1 regularization, we introduce a new term to the model's loss
function:
L2 Regularization
Here, 'y' is the actual target, and 'ŷ' is the predicted target.Now, to add L2
regularization, we introduce a new term to the model's loss function:
C02:
20
21
22
2m:
1. List the Deep Learning frameworks or tools that you have used.
○ Activation functions:
■ Enable networks to model complex relationships by
introducing non-linearity.
■ Help networks distinguish between features.
■ Facilitate the backpropagation process by allowing
gradient calculation.
9. List out the features of Neural Networks.
○ Limitations include:
■ High data requirements: Needs large datasets for
training.
■ Computational intensity: Requires powerful hardware
(GPUs/TPUs).
■ Black-box nature: Hard to interpret decisions.
■ Overfitting risk: If not regularized, it may memorize data
instead of generalizing.
1. Computer Vision:
9. Gaming:
Conclusion
Real-World Applications
Conclusion
The ReLU activation function is one of the most widely used activation
functions in deep learning due to its simplicity and effectiveness.
Definition:
Characteristics:
36
● Non-linearity: Even though ReLU looks like a linear function for x>0x >
0, the zeroing of negative inputs introduces non-linearity.
● Computational Efficiency: It is computationally simple as it involves
only a thresholding operation.
● Sparsity: Negative inputs are mapped to zero, reducing the number
of active neurons.
● Gradient Flow: For x>0x > 0, the gradient is 1, ensuring stable
updates during backpropagation.
Limitations:
Definition:
The eReLU function behaves like ReLU for positive inputs, but for negative
inputs, it follows an exponential curve controlled by a parameter α\alpha
(typically set to 1). Mathematically:
Characteristics:
37
Limitations:
Conclusion
1. Autoencoders:
Challenges
Applications
the connections are bidirectional). There are two types of nodes in the
Boltzmann Machine —
Visible nodes — those nodes which we can and do measure, and the
Hidden nodes– those nodes which we cannot or do not measure.
Although the node types are different, the Boltzmann machine considers
them as the same and everything works as one single system. The training
data is fed into the Boltzmann Machine and the weights of the system are
adjusted accordingly. Types of Boltzmann Machines:
learning phase, you can use a DBM to generate new data. When generating
new data, the DBM starts with a random pattern and refines it step by
step, each time updating the pattern to be more like the patterns it learned
during training. Concepts Related to Deep Boltzmann Machines (DBMs)
Several key concepts underpin Deep Boltzmann Machines:
CO3:
45
46
47
2m:
1. Purpose of Autoencoders
● Applications:
○ Dimensionality Reduction: Extracting meaningful features from
high-dimensional data.
○ Noise Reduction: Denoising autoencoders help clean corrupted
data.
○ Anomaly Detection: Identifying patterns that deviate from the
norm.
48
Striding determines how much the filter moves across the input image during
the convolution operation.
● Stride = 1: Filter moves one pixel at a time, keeping the output size
larger.
● Stride > 1: The filter skips pixels, reducing the spatial dimensions of the
output, which also reduces computational load.
Striding controls the spatial resolution of the output and is essential in
designing efficient CNNs.
3. Convolution Layer
● Structure:
49
● Examples:
○ Rotations, flips, scaling, cropping.
○ Adding noise or changing brightness.
● Purpose: Improve model generalization and reduce overfitting by
exposing the model to diverse variations of data.
● Steps:
○ A kernel (filter) slides over the input.
○ Dot products between the kernel and input regions are computed.
○ The result is a feature map highlighting important patterns.
● Benefits: Captures spatial dependencies efficiently.
ResNet addresses the vanishing gradient problem, which occurs in very deep
networks.
● Benefits:
○ Simplifies learning by focusing on residuals.
○ Allows training of very deep networks (e.g., 152 layers).
● Larger strides reduce the size of the output feature map, which
decreases computational complexity.
● Strides help balance feature resolution and efficiency.
● Steps:
○ Train several base models.
○ Use their predictions as input features for the meta-model.
● Techniques:
○ Dropout: Randomly deactivates neurons during training.
○ Weight Regularization: Penalizes large weights (e.g., L1/L2
regularization).
● Advantages:
○ Simplicity.
○ Efficient computation.
○ Sparse activations.
19. How does a convolutional layer differ from a fully connected layer?
Parameter sharing involves using the same filter (weights) across different
regions of the input.
54
● Benefits:
○ Reduces the number of parameters.
○ Improves generalization.
○ Makes CNNs computationally efficient.
10m:
1. Input Layer
● How It Works:
3. Activation Layer
4. Pooling Layer
7. Output Layer
Example Architecture
Advantages of CNNs
Applications
Conclusion
H(x)=F(x)+xH(x) = F(x) + x
Here:
1. Skip Connections:
○ ResNet enables the training of very deep networks (e.g., 50, 101,
or even 152 layers) without performance degradation.
3. Ease of Optimization:
Architecture Components
1. Input Layer
● Accepts the input image data (e.g., size 224×224×3224 \times 224
\times 3 for RGB images).
● Initial layers include a convolution layer (7×77 \times 7 filter, stride 2)
followed by batch normalization, ReLU activation, and max pooling
(3×33 \times 3, stride 2).
2. Residual Block
Where:
● Residual blocks are grouped into stages, with each stage containing
a specific number of blocks.
● Each stage typically doubles the number of filters and reduces
spatial dimensions.
Variants of ResNet
ResNet-1 18 [2, 2, 2, 2]
8
ResNet-3 34 [3, 4, 6, 3]
4
Advantages of ResNet
63
Applications of ResNet
1. Image Classification:
Conclusion
1. Parameters in CNNs
a. Weights (Filters/Kernels):
● Definition: Filters in CNN layers are matrices that slide (convolve) over
the input to extract features like edges, textures, and patterns.
● Key Points:
65
○ The size of filters (e.g., 3x3, 5x5) determines the receptive field.
○ Filters are initialized randomly and updated during training
using backpropagation.
b. Biases:
c. Hyperparameters:
● Hyperparameters are not learned but are set by the user to control
the training process. Examples include:
○ Learning Rate: The step size for weight updates.
○ Batch Size: Number of samples processed before updating
weights.
○ Number of Filters: Determines the depth of feature maps.
○ Stride: Determines how much the filter moves during
convolution.
○ Padding: Controls the spatial size of the output (e.g., valid vs.
same padding).
2. Regularization in CNNs
a. Dropout:
c. Data Augmentation:
d. Batch Normalization:
e. Early Stopping:
f. Weight Initialization:
67
g. Pooling:
● Techniques like max pooling and average pooling reduce the spatial
dimensions of feature maps, discouraging overfitting by reducing
parameters.
1. Computer Vision
a. Image Classification:
● Applications:
68
b. Object Detection:
● Applications:
○ Autonomous vehicles: Detecting pedestrians, vehicles, and
traffic signs.
○ Surveillance: Identifying unauthorized persons or suspicious
activities.
○ Retail: Inventory management via automated shelf monitoring.
c. Face Recognition:
● Applications:
○ Security systems (e.g., biometric authentication).
○ Personalized user experiences (e.g., unlocking smartphones).
○ Forensic investigations to identify individuals in images or
videos.
d. Semantic Segmentation:
● Applications:
○ Medical imaging: Segmenting tumors or organs for better
diagnosis.
○ Autonomous vehicles: Understanding the road environment.
○ Augmented reality: Real-time mapping of environments.
● Applications:
69
3. Healthcare
● Applications:
○ Detecting diseases in X-rays, MRIs, and CT scans (e.g., cancer,
fractures).
○ Classifying skin conditions like melanoma or psoriasis.
b. Drug Discovery:
● Applications:
○ Predicting molecular interactions using image-like molecular
data.
○ Visualizing cell behavior to identify potential drug candidates.
4. Autonomous Systems
a. Self-Driving Cars:
b. Robotics:
a. Visual Search:
b. Inventory Management:
6. Agriculture
a. Crop Monitoring:
b. Livestock Monitoring:
7. Entertainment
a. Content Creation:
b. Gaming:
c. Video Analytics:
71
8. Financial Services
a. Fraud Detection:
b. Document Analysis:
b. Weather Prediction:
10. Manufacturing
a. Quality Control:
b. Process Monitoring:
AlexNet
When?
● London Olympics
Why? AlexNet was born out of the need to improve the results of the ImageNet
challenge. This was one of the first Deep convolutional networks to achieve
73
Connected (FC) layers. The activation used is the Rectified Linear Unit (ReLU).
The structural details of each layer in the network can be found in the table
below.
VGGNet:
When?
74
Why? VGGNet was born out of the need to reduce the # of parameters in the
What? There are multiple variants of VGGNet (VGG16, VGG19, etc.) which differ
only in the total number of layers in the network. The structural details of a
VGG16 has a total of 138 million parameters. The important point to note here is
that all the conv kernels are of size 3x3 and maxpool kernels are of size 2x2 with a
stride of two.
How? The idea behind having fixed size kernels is that all the variable size
making use of multiple 3x3 kernels as building blocks. The replication is in terms
ResNet
When?
Why? Neural Networks are notorious for not being able to find a simpler
the output. The simplest solution to this problem is having all weights
equaling one and all biases zeros for all the hidden layers. But when
mapping is learned where the weights and biases have a wide range of
values.
data-set. Now adding more layers to this network g(f(x)) should have
more. But unfortunately, that is not the case. Experiments have shown
value.
denotes the number of layers. The most commonly used ones are ResNet50 and
ResNet101. Since the vanishing gradient problem was taken care of (more about
it in the How part), CNN started to get deeper and deeper. Below we present the
Stacking:
Stacking is a way to ensemble multiple classifications or regression model.
There are many ways to ensemble models, the widely known models are
Bagging or Boosting. Bagging allows multiple similar models with high
variance are averaged to decrease variance. Boosting builds multiple
incremental models to decrease the bias, while keeping variance small.
types of models which are capable to learn some part of the problem, but not
the whole space of the problem. So, you can build multiple different learners
and you use them to build an intermediate prediction, one prediction for each
learned model. Then you add a new model which learns from the
intermediate predictions the same target.
This final model is said to be stacked on the top of the others, hence the
name. Thus, you might improve your overall performance, and often you end
up with a model which is better than any individual intermediate model.
Notice however, that it does not give you any guarantee, as is often the case
with any machine learning technique.
For example, consider the red square as a filter. The computer is going to use this filter
to scan the image.
Stride is a Convolution Neural Network technique which has two main features. The first
is to reduce the size of the output feature map. This is because the filter only overlaps
with a subset of the input feature map so that the output feature map will be small, and it
helps reduce the computational complexity.
The second is the overlap of the receptive field. The receptive field is the area of the
input feature map that is used to calculate the output of a neuron.
81
For example, a stride of 2 reduces the overlap of receptive fields by half because the
filter will overlap with half of the receptive fields in the previous layer. It helps prevent
the CNN from learning redundant features.
Assume a convolutional neural network is analysing the content of an image. If the filter
size is 4x4 pixels, the contained sixteen pixels will be converted down to 1 pixel in the
output layer. As the stride increases, the resulting output decreases.
Stride is a parameter that works in conjunction with padding. Padding is the feature that
puts empty blanks into the frame of the image to minimize the reduction of size in the
output layer.
Actually, it is a way of increasing the size of an image to balance the size reduced by
the strides. Padding and Stride are the fundamentals for CNN.
As we have discussed enough about padding and stride, let's see a comparison
between the both.
Pooling:
refer this link
(https://medium.com/@abhishekjainindore24/pooling-and-their-types-in-cnn-4a4b8a7a4611)
CO4:
82
83
84
85
2m:
Advantages:
Drawbacks:
● Unlike ANNs, RNNs must propagate errors across time, making gradient
computation more complex and prone to issues like vanishing/exploding
gradients.
Key Concept
In standard RNNs, the output at each time step is computed based only on
the current and previous inputs, limiting the model to past context.
Bidirectional RNNs address this limitation by:
This bidirectional setup ensures that the model utilizes both past
(previous context) and future (subsequent context) information to make
predictions.
1. Input Layer
91
2. Forward RNN
3. Backward RNN
4. Output Layer
● Combines the hidden states from both the forward and backward
RNNs at each time step: ht=[ht→;ht←]h_t = [\overrightarrow{h_t};
\overleftarrow{h_t}]
○ Concatenates (;; ) or adds the forward and backward hidden
states.
92
↑ ↑ ↑ ↑
↓ ↓ ↓ ↓
↓ ↓ ↓ ↓
1. Context Awareness:
3. Flexibility:
Conclusion
map one sequence to another, even when the sequences have different
lengths.
1. Encoder
● The encoder processes the input sequence step by step and converts
it into a context vector (also known as the hidden state). This
context vector is a representation of the entire input sequence in a
compressed form.
Process:
2. Decoder
Process:
3. Sequence Generation
Mathematical Representation
Let the input sequence be X=[x1,x2,...,xT]X = [x_1, x_2, ..., x_T] and the
output sequence be Y=[y1,y2,...,yT′]Y = [y_1, y_2, ..., y_{T'}].
97
1. Encoder:
3. Decoder:
1. Machine Translation:
3. Text Summarization:
○ The input sequence may not always align well with the output
sequence, especially in tasks like machine translation. This can
result in errors when predicting long output sequences based
on a compressed representation of the input.
Enhancements to Seq2Seq
99
1. Attention Mechanism:
Conclusion
CO5:
101
102
103
Architecture of DBN
1. Layer-wise Pre-training:
○ The DBN is trained in an unsupervised manner using an RBM
for each layer.
○ Each RBM learns to reconstruct the inputs from hidden
representations, capturing more abstract features as you go
deeper into the network.
○ For example, in a DBN used for image recognition:
■ The first RBM might learn edges, corners, and textures
from the raw pixel values of the images.
■ The second RBM might learn higher-level features like
shapes and objects by combining the first layer’s outputs.
105
Advantages of DBN
● Feature Learning:
○ The layer-wise pre-training allows the network to learn useful
features from the raw data without requiring labeled examples,
making it an effective tool for unsupervised learning.
● Improved Representation:
○ The deep architecture of DBNs provides better representations
of the input compared to shallow networks.
● Generative Model:
○ DBNs are capable of generating new samples from the learned
distribution, which is useful for tasks like generating images or
music.
1. Image Recognition:
○ Example: A DBN might be used to classify handwritten digits
from the MNIST dataset.
106
Architecture of DBM
1. Pre-training Phase:
○ Similar to the DBN, each layer of the DBM is pre-trained using
RBMs in an unsupervised manner.
○ The DBM learns the most meaningful features from the raw
input data.
2. Inference and Generating Data:
○ After training, the DBM can be used to generate new samples
from the learned distribution.
○ For example, the DBM could generate images, music, or even
text by sampling from the learned representation.
3. Fine-Tuning for Specific Tasks:
○ The DBM can be fine-tuned using supervised learning
methods for specific applications (like image classification,
music generation, etc.).
1. Image Generation:
○ A DBM trained on a large image dataset could generate
realistic images based on the features it has learned during
108
Advantages of DBM
1. Representation Power:
○ DBMs can model more complex structures in the data
compared to DBNs by allowing connections between hidden
layers, leading to improved representations.
2. Generative Model:
○ Like DBNs, DBMs are capable of generating new samples from
the learned distribution, which is particularly useful for image,
music, and text generation.
3. Improved Feature Extraction:
○ The presence of connections between hidden layers allows
DBMs to learn more detailed and abstract features from the
input data compared to simpler models.
Limitations of DBM
Conclusion
Both Deep Belief Networks (DBN) and Deep Boltzmann Machines (DBM)
are powerful unsupervised learning models that help in learning complex
representations from raw data. They have been used in various
applications like image recognition, natural language processing, music
generation, and image synthesis. Despite their advantages in feature
learning and representation, they come with challenges like
computational cost and the need for extensive training.
Impact on Learning:
● When the gradients become very small, the weight updates in the
earlier layers (or earlier time steps) become insignificant. As a result:
○ Learning becomes very slow for the earlier layers, or it can
stop altogether.
○ The model fails to learn long-range dependencies and is
unable to capture information from distant time steps.
○ The network struggles to perform well on tasks that require
long-term memory (such as machine translation or speech
recognition).
Example:
Impact on Learning:
Example:
● This helps maintain stable training, especially when using RNNs with
many layers or training on long sequences.
How it works:
How it helps:
○ These methods ensure that the starting weights are neither too
small (causing vanishing gradients) nor too large (causing
exploding gradients).
3. Use of LSTM and GRU (Long Short-Term Memory and Gated Recurrent Units)
● ReLU (Rectified Linear Unit) and its variants (like Leaky ReLU or ELU) are
less prone to the vanishing gradient problem because they do not
saturate in the positive domain. Unlike sigmoid and tanh, ReLU has a
constant derivative for positive inputs, reducing the likelihood of
vanishing gradients.
115
How it helps:
5. Batch Normalization
● How it helps:
Conclusion
C06:
1.Describe Deep Boltzmann machine
architecture with necessary diagrams.
Deep Boltzmann Machine (DBM) Architecture
1. Visible Layer:
+-------------------------+
+-------------------------+
+----------------------------+
| Hidden Layer 1 |
118
+----------------------------+
+----------------------------+
| Hidden Layer 2 |
+----------------------------+
+----------------------------+
| Hidden Layer 3 |
+----------------------------+
+-------------------------+
+-------------------------+
1. Data Representation
119
4. Inference
● After training, the DBM can be used to generate new data or infer
data from the learned distribution. Given an input (such as an
image), the visible layer is activated, and the hidden layers generate
a probabilistic model of the data.
Applications of DBMs
1. Training Complexity:
Conclusion
of training, computational cost, and scalability mean that DBMs are often
used in specific applications where their generative capabilities are
particularly beneficial.
2. Feature Extraction
● CNNs are highly effective for identifying fake images, as they are
designed to analyze visual patterns in images. CNNs are typically
used for both binary classification (real or fake) and more advanced
tasks like localized manipulation detection.
b. Transfer Learning:
4. GAN-based Detection
126
6. Adversarial Attacks
● Defense Mechanisms:
8. Multimodal Detection
Conclusion