Unit 2 Deep Learning and Neural Networks
Unit 2 Deep Learning and Neural Networks
What is a Perceptron?
A Perceptron is the most basic type of artificial neural network, designed to classify data into
two categories (binary classification). It is a single-layer neural network.
Structure of a Perceptron
✔ Works well for linearly separable problems (e.g., AND, OR logic gates).
✖ Fails to solve non-linearly separable problems (e.g., XOR problem).
✖ Does not support multi-class classification.
✖ Cannot learn complex patterns due to its simple structure.
Steps in Backpropagation
2. Compute Loss: Calculate how far the prediction is from the actual value.
3. Backward Pass:
o Compute the gradient of the loss w.r.t. weights using the chain rule.
2. Update weights:
The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex decisions
and predictions by introducing non-linearities to the output of each neuron.
Neural networks consist of neurons that operate using weights, biases, and activation
functions.
In the learning process, these weights and biases are updated based on the error produced at
the output—a process known as backpropagation. Activation functions enable backpropagation
by providing gradients that are essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions empower neural networks to model highly complex
data distributions and solve advanced deep learning tasks. Adding non-linear activation
functions introduce flexibility and enable the network to learn more complex and abstract
patterns from data.
To illustrate the need for non-linearity in neural networks with a specific example, let’s consider
a network with two input nodes (i1and i2)(i1and i2), a single hidden layer containing one
neuron (h1)(h1), and an output neuron (out). We will use w1,w2w1,w2 as weights connecting
the inputs to the hidden neuron, and w5w5 as the weight connecting the hidden neuron to the
output. We’ll also include biases (b1b1 for the hidden neuron and b2b2 for the output neuron)
to complete the model.
Network Structure
The input to the hidden neuron h1h_1h1 is calculated as a weighted sum of the inputs plus a
bias:
zh1=w1i1+w2i2+b1zh1=w1i1+w2i2+b1
The output neuron is then a weighted sum of the hidden neuron’s output plus a bias:
output=w5h1+b2output=w5h1+b2
If h1h1 were directly the output of zh1zh1 (no activation function applied, i.e., h1=zh1h1=zh1),
then substituting h1h1 in the output equation yields:
output=w5(w1i1+w2i2+b1)+b2output=w5(w1i1+w2i2+b1)+b2
output=w5w1i1+w5w2i2+w5b1+b2output=w5w1i1+w5w2i2+w5b1+b2
This shows that the output neuron is still a linear combination of the inputs i1i1 and i2i2.
Thus, the entire network, despite having multiple layers and weights, effectively performs a
linear transformation, equivalent to a single-layer perceptron.
To introduce non-linearity, let’s use a non-linear activation function σσ for the hidden neuron. A
common choice is the ReLU function, defined as σ(x)=max(0,x)σ(x)=max(0,x).
h1=σ(zh1)=σ(w1i1+w2i2+b1)h1=σ(zh1)=σ(w1i1+w2i2+b1)
output=w5σ(w1i1+w2i2+b1)+b2output=w5σ(w1i1+w2i2+b1)+b2
Effect of Non-linearity
The inclusion of the ReLU activation function \sigma allows h_1 to introduce a non-linear
decision boundary in the input space. This non-linearity enables the network to learn more
complex patterns that are not possible with a purely linear model, such as:
• Increasing the capacity of the network to form multiple decision boundaries based on
the combination of weights and biases.
Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains, if they all use linear activation functions, the output is a linear
combination of the input.
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network’s ability to learn complex
patterns limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output
1. Sigmoid Function
• It allows neural networks to handle and model complex patterns that linear equations
cannot.
• The output ranges between 0 and 1, hence useful for binary classification.
• The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output y,
which is critical during the training process.
Sigmoid or Logistic Activation Function Graph
Tanh function or hyperbolic tangent function, is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:
f(x)=tanh(x)=21+e−2x–1.f(x)=tanh(x)=1+e−2x2–1.
tanh(x)=2×sigmoid(2x)–1tanh(x)=2×sigmoid(2x)–1
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
Tanh Activation Function
• Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.
• Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReLU Activation Function
Adaptive Moment Estimation is an algorithm for optimization technique for gradient descent.
The method is really efficient when working with large problem involving a lot of data or
parameters. It requires less memory and is efficient. Intuitively, it is a combination of the
‘gradient descent with momentum’ algorithm and the ‘RMSP’ algorithm.
Momentum:
This algorithm is used to accelerate the gradient descent algorithm by taking into consideration
the ‘exponentially weighted average’ of the gradients. Using averages makes the algorithm
converge towards the minima in a faster pace.
wt+1=wt−αmtwt+1=wt−αmt
where,
mt=βmt−1+(1−β)[δLδwt]mt=βmt−1+(1−β)[δwtδL]
Wt = weights at time t
Root mean square prop or RMSprop is an adaptive learning algorithm that tries to improve
AdaGrad. Instead of taking the cumulative sum of squared gradients like in AdaGrad, it takes the
‘exponential moving average’.
wt+1=wt−αt(vt+ε)1/2∗[δLδwt]wt+1=wt−(vt+ε)1/2αt∗[δwtδL]
where,
vt=βvt−1+(1−β)∗[δLδwt]2vt=βvt−1+(1−β)∗[δwtδL]2
Wt = weights at time t
Adam Optimizer inherits the strengths or the positive attributes of the above two methods and
builds upon them to give a more optimized gradient descent.
Here, we control the rate of gradient descent in such a way that there is minimum oscillation
when it reaches the global minimum while taking big enough steps (step-size) so as to pass the
local minima hurdles along the way. Hence, combining the features of the above methods to
reach the global minimum efficiently.
mt=β1mt−1+(1−β1)[δLδwt]vt=β2vt−1+(1−β2)[δLδwt]2mtvt=β1mt−1+(1−β1)[δwtδL]=β2vt−1
+(1−β2)[δwtδL]2
Parameters Used :
1. ϵ = a small +ve constant to avoid 'division by 0' error when (vt -> 0). (10-8)
2. β1 & β2 = decay rates of average of gradients in the above two methods. (β1 = 0.9 & β2 =
0.999)
mt^=mt1−β1tv^t=vt1−β2tmt
=1−β1tmtv
t=1−β2tvt
Intuitively, we are adapting to the gradient descent after every iteration so that it remains
controlled and unbiased throughout the process, hence the name Adam.
Now, instead of our normal weight parameters mt and vt , we take the bias-corrected weight
parameters (m_hat)t and (v_hat)t. Putting them into our general equation, we get
wt+1=wt−mt^(αvt^+ε)wt+1=wt−mt
(vt
+εα)
Performance:
Building upon the strengths of previous models, Adam optimizer gives much higher
performance than the previously used and outperforms them by a big margin into giving an
optimized gradient descent. The plot is shown below clearly depicts how Adam Optimizer
outperforms the rest of the optimizer by a considerable margin in terms of training cost (low)
and performance (high).
What is RMSProp Optimizer?
RMSProp was introduced by Geoffrey Hinton. The algorithm was developed to address the
limitations of previous optimization methods such as SGD (Stochastic Gradient Descent) and
AdaGrad. While SGD uses a constant learning rate, which can be inefficient, and AdaGrad
reduces the learning rate too aggressively, RMSProp strikes a balance by adapting the learning
rates based on a moving average of squared gradients. This approach helps in maintaining a
balance between efficient convergence and stability during the training process, making
RMSProp a widely used optimization algorithm in modern deep learning.
The core idea behind RMSProp is to keep a moving average of the squared gradients to
normalize the gradient updates. By doing so, RMSProp prevents the learning rate from
becoming too small, which was a drawback in AdaGrad, and ensures that the updates are
appropriately scaled for each parameter. This mechanism allows RMSProp to perform well even
in the presence of non-stationary objectives, making it suitable for training deep learning
models.
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for
optimizing machine learning models. It addresses the computational inefficiency of traditional
Gradient Descent methods when dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training
example (or a small batch) is selected to calculate the gradient and update the model
parameters. This random selection introduces randomness into the optimization process, hence
the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with large
datasets. By using a single example or a small batch, the computational cost per iteration is
significantly reduced compared to traditional Gradient Descent methods that require processing
the entire dataset.
Stochastic Gradient Descent Algorithm
• Set Parameters: Determine the number of iterations and the learning rate (alpha) for
updating the parameters.
• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:
o Iterate over each training example (or a small batch) in the shuffled order.
o Compute the gradient of the cost function with respect to the model parameters
using the current training
example (or batch).
o Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate.
o Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.
• Return Optimized Parameters: Once the convergence criteria are met or the maximum
number of iterations is reached, return the optimized model parameters.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the path
taken by the algorithm to reach the minima is usually noisier than your typical Gradient Descent
algorithm. But that doesn’t matter all that much because the path taken by the algorithm does
not matter, as long as we reach the minimum and with a significantly shorter training time.
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness in
its descent. Even though it requires a higher number of iterations to reach the minima than
typical Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing
a learning algorithm.
Advanced Deep Learning Architectures
a) Convolutional Layer
• Applies a set of filters (kernels) to the input image to detect patterns such as edges,
textures, and shapes.
• Each filter slides over the input image and performs element-wise multiplication
followed by summation, producing a feature map.
• Reduces the spatial dimensions of feature maps, making the network more efficient and
reducing overfitting.
• Max Pooling selects the maximum value in a region, while Average Pooling takes the
mean.
• After convolutional and pooling layers, extracted features are passed to fully connected
layers to make final predictions.
• The last layer often uses softmax activation for multi-class classification or sigmoid for
binary classification.
2. Working of CNNs for Image Data
1. Input Image: A digital image represented as a matrix of pixel values (e.g., 28×28 for
grayscale or 224×224×3 for RGB).
2. Feature Extraction: The convolutional layers apply filters to extract low- and high-level
features.
3. Dimensionality Reduction: Pooling layers reduce the feature map size while retaining
important information.
3. CNN Architectures
• VGGNet: Uses deep layers (VGG-16, VGG-19) with small filters (3×3).
4. Applications of CNNs
Working of RNNs:
The hidden state htht is updated using the previous state ht−1ht−1 and the current input xtxt:
ht=f(Whht−1+Wxxt+b)
ht=f(Whht−1+Wxxt+b)
Vanishing Gradient Problem: When training long sequences, gradients become too small,
making it hard to learn long-term dependencies.
LSTM Equations:
ft=σ(Wf[ht−1,xt]+bf)(Forget Gate)
ft=σ(Wf[ht−1,xt]+bf)(Forget Gate)
it=σ(Wi[ht−1,xt]+bi)(Input Gate)
it=σ(Wi[ht−1,xt]+bi)(Input Gate)
Ct~=tanh(Wc[ht−1,xt]+bc)(New Memory)
Ct~=tanh(Wc[ht−1,xt]+bc)(New Memory)
ot=σ(Wo[ht−1,xt]+bo)(Output Gate)
ot=σ(Wo[ht−1,xt]+bo)(Output Gate)
GRUs are a simplified version of LSTMs with fewer parameters, making them computationally
efficient.
GRU Equations:
rt=σ(Wr[ht−1,xt]+br)(Reset Gate)
rt=σ(Wr[ht−1,xt]+br)(Reset Gate)
zt=σ(Wz[ht−1,xt]+bz)(Update Gate)
zt=σ(Wz[ht−1,xt]+bz)(Update Gate)
1. Autoencoders (AEs)
Architecture of Autoencoders:
1. Encoder:
o The encoder maps the input data xxx to a lower-dimensional latent space
representation zzz.
o The goal of the encoder is to compress the input while retaining the essential
information.
2. Decoder:
o The decoder maps the latent space representation zzz back to the original data
space, aiming to reconstruct the input x^\hat{x}x^.
o The goal is to minimize the reconstruction error (difference between the input
and the reconstructed data).
Working of Autoencoders:
• Input: xxx
The loss function typically used for training autoencoders is mean squared error (MSE) or
binary cross-entropy:
Applications of Autoencoders:
2. Denoising: A denoising autoencoder can learn to remove noise from the input data and
output a clean version.
4. Image Compression: By training on large image datasets, autoencoders can be used for
image compression.
1. Probabilistic Approach: VAEs assume that the data comes from a latent variable model
and use a probabilistic encoding of the input data.
2. Latent Space Distribution: While regular autoencoders map inputs directly to a latent
space, VAEs introduce randomness by learning a distribution over the latent space.
Typically, this distribution is a Gaussian distribution.
3. Regularization: VAEs include a regularization term (KL divergence) that forces the
learned latent space distribution to be close to a standard normal distribution.
Architecture of VAEs:
2. Sampling: During training, we sample from this Gaussian distribution z∼N(μ,σ2)z \sim
\mathcal{N}(\mu, \sigma^2)z∼N(μ,σ2), making the encoder probabilistic.
3. Decoder: Decodes the sampled latent variable zzz back into the original data space.
Mathematical Formulation:
• Encoder learns the parameters μ\muμ and σ2\sigma^2σ2 for the distribution
q(z∣x)q(z|x)q(z∣x).
Where:
1. Data Generation: VAEs are particularly well-suited for generating new, synthetic data
(e.g., generating new images similar to the training data).
2. Image Synthesis: VAEs can generate new images that resemble the data they were
trained on by sampling from the latent space.
3. Anomaly Detection: Like autoencoders, VAEs can be used to detect anomalies based on
reconstruction error.
4. Latent Space Exploration: Because the latent space is continuous and structured, it can
be manipulated for tasks like interpolation and latent space visualization.
Transfer Learning
Transfer Learning is a technique in machine learning where a model developed for a particular
task is reused (or adapted) for a different, but related task. This approach leverages the
knowledge gained from a previously trained model to solve new problems, often requiring
fewer resources, such as data and computational power, compared to training a model from
scratch.
Overview:
• ResNet is well-suited for image classification tasks and is known for its impressive
performance on benchmark datasets like ImageNet.
Architecture:
• ResNet consists of a series of residual blocks in which the input is added to the output of
the block (after some processing). This makes it easier for the network to learn identity
mappings, which helps the model avoid the degradation problem that occurs with
deeper networks.
• ResNet has various versions with different depths, e.g., ResNet-18, ResNet-34, ResNet-
50, ResNet-101, and ResNet-152. The number refers to the number of layers in the
network.
Key Features:
• Residual Connections: Skip connections between layers, allowing gradients to flow more
easily during backpropagation.
• Very Deep Networks: ResNet can be extremely deep (e.g., ResNet-152 with 152 layers).
• High Performance: It performs very well on various image classification and object
detection tasks.
Use Cases:
• Image classification
• Object detection
• Semantic segmentation
Overview:
• VGG is a CNN architecture developed by the Visual Geometry Group (VGG) at the
University of Oxford. The VGG network was a major breakthrough due to its simplicity
and depth.
• VGG16 and VGG19 are the most commonly used models, named after the number of
layers in the network (16 and 19 layers, respectively).
Architecture:
• The network is very deep, with several convolutional layers stacked on top of each other.
The key characteristic of the VGG models is the use of small 3x3 convolutional filters
throughout the network.
Key Features:
• Deep Architecture: VGG was one of the first to demonstrate that increasing depth could
significantly improve performance.
• Simplicity: The VGG architecture uses very simple 3x3 filters and max-pooling layers,
making it straightforward to implement.
• Large Model Size: VGG models are large with a high number of parameters, making
them computationally expensive.
Use Cases:
• Image classification
• Object detection
Overview:
• BERT is a transformer-based model introduced by Google AI for NLP tasks in the paper
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
• Unlike traditional models that process text in a unidirectional manner, BERT is trained
bidirectionally, meaning it looks at both the left and right context of a word in a
sentence.
• BERT is pre-trained on large corpora and can be fine-tuned for specific NLP tasks such as
text classification, question answering, and named entity recognition (NER).
Architecture:
• Bidirectional: BERT’s pre-training objective is to predict missing words from both left
and right context in the text, which helps capture rich contextual relationships.
• Pre-training Tasks:
o Masked Language Model (MLM): Random words in a sentence are masked, and
the model learns to predict them.
o Next Sentence Prediction (NSP): The model is trained to predict whether a pair
of sentences follow one another in the text.
Key Features:
• Bidirectional Context: BERT learns both left and right contexts, which improves its
understanding of word meanings.
• Pre-training and Fine-tuning: Pre-trained on vast amounts of text data and fine-tuned
for specific tasks.
• State-of-the-art: BERT has set new performance benchmarks for a variety of NLP tasks.
Use Cases:
• Text generation
• Language translation
1. Fine-tuning
Overview:
Fine-tuning is the process of taking a pre-trained model and continuing its training on your
dataset. The key idea is that the model has already learned useful representations from the
original dataset, and by training it on your task-specific data, you can adjust the model to
perform better for your problem.
• You have a smaller dataset, and training from scratch would lead to overfitting.
• The pre-trained model’s learned features are highly relevant to your task (e.g., using an
ImageNet-trained model for another image classification task).
1. Load a Pre-trained Model: First, load a pre-trained model (e.g., ResNet, VGG, BERT) with
weights trained on a large dataset like ImageNet or a language corpus.
2. Replace or Modify the Output Layer: Replace the final classification layers (e.g., fully
connected layers) of the model with new layers suited for your task (e.g., for a different
number of classes in classification).
3. Freeze Early Layers: Initially, freeze the weights of the earlier layers of the model. These
layers capture generic features (such as edges in images or basic syntax in text) and
don’t need to be re-trained. Only the later layers will be fine-tuned.
4. Unfreeze Some Layers: Unfreeze the later layers or the entire model and continue
training on your data. This allows the model to adapt its learned features to your new
task. Fine-tuning is generally done with a lower learning rate to avoid forgetting
previously learned knowledge (catastrophic forgetting).
5. Train the Model: Train the model on your dataset. Typically, fine-tuning requires fewer
epochs than training from scratch because the model already has learned meaningful
representations.
2. Feature Extraction
Overview:
Feature extraction is a process where you use a pre-trained model as a fixed feature extractor. In
this approach, you don’t modify the pre-trained model’s weights. Instead, you extract the
features learned by the pre-trained model and feed them into a new model or classifier (like a
dense layer or a logistic regression) that will perform the task on top of those features.
• You don’t want to spend too much time or computational resources fine-tuning the
model.
• You have a small dataset and can’t afford to fine-tune the model without overfitting.
1. Load a Pre-trained Model: Choose a model trained on a large dataset (e.g., ResNet,
VGG, BERT).
2. Freeze the Entire Pre-trained Model: Freeze all the layers of the pre-trained model, so
the weights are not updated during training.
3. Remove the Top Layers: Remove the final layers (usually fully connected layers) and add
new layers suitable for your task (e.g., a classifier for your dataset).
4. Train the New Classifier: Train only the new classifier (e.g., a fully connected layer) using
the features extracted by the pre-trained model.
5. Evaluate and Improve: You can evaluate the model’s performance and, if necessary,
make adjustments.
o You want to use the pre-trained model as a feature extractor without retraining
the entire network.
o You only need to modify the last layer(s) for your task.
layer.trainable = False
model.fit(train_data, epochs=5)
layer.trainable = True
model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy',
metrics=['accuracy'])
Applications:
1. Image Classification:
o Pre-trained models like ResNet, VGG, and Inception are often used for
classification tasks. These models have been trained on large datasets (e.g.,
ImageNet), and by fine-tuning them, they can be adapted for tasks like medical
image classification, product categorization, etc.
o Example: Identifying diseases in X-ray images. A model like ResNet can be fine-
tuned on a dataset of labeled medical images to detect diseases such as
pneumonia or tuberculosis.
2. Object Detection:
o YOLO (You Only Look Once) and Faster R-CNN are examples of pre-trained
models that can be fine-tuned for tasks like detecting specific objects in images.
These models are pre-trained on large datasets such as COCO and can be
transferred to detect a specific set of objects in a new dataset.
3. Semantic Segmentation:
4. Face Recognition:
o Pre-trained models like FaceNet are often used for face recognition tasks. These
models are fine-tuned on smaller datasets containing faces to adapt them for
specific applications like security or identity verification.
o Example: Face verification for secure access systems. After fine-tuning, these
models can verify if the person in front of the camera matches the stored face
data.
5. Style Transfer:
o Using models pre-trained on large image datasets, you can transfer the style of
one image to another. Neural Style Transfer is a popular technique in artistic
applications.
Applications:
1. Text Classification:
o Example: Extracting entities from legal documents. Fine-tuned BERT models can
identify relevant entities such as company names, legal terms, or case numbers.
3. Question Answering:
o BERT, RoBERTa, and ALBERT are often fine-tuned for specific question-answering
tasks. These models can answer questions from a given passage or context.
4. Machine Translation:
o GPT-3, T5, and MarianMT are models that can be fine-tuned for specific machine
translation tasks. Transfer learning helps by utilizing knowledge from pre-trained
models, allowing the translation of text between different languages.
5. Text Summarization:
6. Language Modeling:
o Models like GPT-2 and GPT-3 are pre-trained on large text corpora and can be
fine-tuned for specific language generation tasks. These models can generate
human-like text, which can be used for content creation, chatbots, and more.
o Example: Automated content generation for social media posts, blogs, or news
articles. Fine-tuning a model like GPT-3 on a specific domain (e.g., tech news) can
generate relevant and coherent articles.
o BERT, GPT, and other pre-trained transformer models are used in fine-tuning for
building intelligent chatbots and virtual assistants. These systems can answer
questions, provide recommendations, or perform specific tasks.
• Reduces Training Time: Since the model is already pre-trained on a large dataset, you
don’t need to start from scratch. Fine-tuning or feature extraction requires fewer epochs
and computational resources.