0% found this document useful (0 votes)
3 views

Unit 2 Deep Learning and Neural Networks

This document provides an overview of deep learning and neural networks, focusing on the fundamentals of perceptrons, backpropagation, gradient descent, activation functions, and optimization techniques. It explains how perceptrons function for binary classification, the importance of non-linear activation functions for complex pattern recognition, and various optimization algorithms like Adam and RMSProp that enhance training efficiency. The content emphasizes the need for non-linearity and adaptive learning rates in neural networks to effectively model complex data distributions.

Uploaded by

shiva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 2 Deep Learning and Neural Networks

This document provides an overview of deep learning and neural networks, focusing on the fundamentals of perceptrons, backpropagation, gradient descent, activation functions, and optimization techniques. It explains how perceptrons function for binary classification, the importance of non-linear activation functions for complex pattern recognition, and various optimization algorithms like Adam and RMSProp that enhance training efficiency. The content emphasizes the need for non-linearity and adaptive learning rates in neural networks to effectively model complex data distributions.

Uploaded by

shiva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit 2

Deep Learning and Neural Networks

Fundamentals of Neural Networks


Neural networks are the foundation of deep learning and artificial intelligence (AI). They are
inspired by the structure of the human brain and consist of interconnected layers of neurons
that process information.

1. Perceptron (The Simplest Neural Network)

What is a Perceptron?

A Perceptron is the most basic type of artificial neural network, designed to classify data into
two categories (binary classification). It is a single-layer neural network.

Structure of a Perceptron

A perceptron consists of:

• Input Layer → Takes input features x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn.

• Weights (www) → Each input has an associated weight wiw_iwi.

• Summation Function → Computes a weighted sum of inputs:


• Limitations of Perceptron

✔ Works well for linearly separable problems (e.g., AND, OR logic gates).
✖ Fails to solve non-linearly separable problems (e.g., XOR problem).
✖ Does not support multi-class classification.
✖ Cannot learn complex patterns due to its simple structure.

To overcome these limitations, multi-layer networks and backpropagation are introduced.

2. Backpropagation (Training Neural Networks)


Backpropagation (Backward Propagation of Errors) is an algorithm used to train deep neural
networks by adjusting weights based on the error (difference between predicted and actual
output).

Why Do We Need Backpropagation?

• A single-layer perceptron cannot solve non-linear problems.

• Multi-layer networks require a way to learn weights.

• Backpropagation efficiently updates weights by propagating errors backward from the


output layer to earlier layers.

Steps in Backpropagation

1. Forward Pass: Compute the output using the current weights.

2. Compute Loss: Calculate how far the prediction is from the actual value.

3. Backward Pass:

o Compute the gradient of the loss w.r.t. weights using the chain rule.

o Adjust weights in the opposite direction of the gradient to reduce loss.

4. Repeat until convergence.


3. Gradient Descent (Optimization Algorithm)
Gradient Descent is an optimization algorithm used to minimize the loss function by adjusting
weights iteratively in the direction that reduces the error.

Types of Gradient Descent

1. Batch Gradient Descent (BGD)

o Uses all training data to compute the gradient.

o Slow for large datasets but provides a stable update.


2. Stochastic Gradient Descent (SGD)

o Updates weights after each training sample.

o Faster but has high variance (noisy updates).

3. Mini-Batch Gradient Descent

o Uses a small batch of data to update weights.

o Balances stability and speed.

Gradient Descent in Neural Networks

During backpropagation, weights are updated using gradient descent:

1. Compute error gradient for each layer.

2. Update weights:

W(l)=W(l)−α∂L∂W(l)W^{(l)} = W^{(l)} - \alpha \frac{\partial L}{\partial


W^{(l)}}W(l)=W(l)−α∂W(l)∂L

3. Repeat until convergence (loss reaches a minimum).

Activation functions: ReLU, Sigmoid, Tanh


An activation function is a mathematical function applied to the output of a neuron. It
introduces non-linearity into the model, allowing the network to learn and represent complex
patterns in the data. Without this non-linearity feature, a neural network would behave like a
linear regression model, no matter how many layers it has.

The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex decisions
and predictions by introducing non-linearities to the output of each neuron.

Why is Non-Linearity Important in Neural Networks?

Neural networks consist of neurons that operate using weights, biases, and activation
functions.
In the learning process, these weights and biases are updated based on the error produced at
the output—a process known as backpropagation. Activation functions enable backpropagation
by providing gradients that are essential for updating the weights and biases.

Without non-linearity, even deep networks would be limited to solving only simple, linearly
separable problems. Activation functions empower neural networks to model highly complex
data distributions and solve advanced deep learning tasks. Adding non-linear activation
functions introduce flexibility and enable the network to learn more complex and abstract
patterns from data.

Mathematical Proof of Need of Non-Linearity in Neural Networks

To illustrate the need for non-linearity in neural networks with a specific example, let’s consider
a network with two input nodes (i1and i2)(i1and i2), a single hidden layer containing one
neuron (h1)(h1), and an output neuron (out). We will use w1,w2w1,w2 as weights connecting
the inputs to the hidden neuron, and w5w5 as the weight connecting the hidden neuron to the
output. We’ll also include biases (b1b1 for the hidden neuron and b2b2 for the output neuron)
to complete the model.

Network Structure

1. Input Layer: Two inputs i1i1 and i2i2.

2. Hidden Layer: One neuron h1h1.

3. Output Layer: One output neuron.


Mathematical Model Without Non-linearity

Hidden Layer Calculation:

The input to the hidden neuron h1h_1h1 is calculated as a weighted sum of the inputs plus a
bias:

zh1=w1i1+w2i2+b1zh1=w1i1+w2i2+b1

Output Layer Calculation:

The output neuron is then a weighted sum of the hidden neuron’s output plus a bias:

output=w5h1+b2output=w5h1+b2

If h1h1 were directly the output of zh1zh1 (no activation function applied, i.e., h1=zh1h1=zh1),
then substituting h1h1 in the output equation yields:

output=w5(w1i1+w2i2+b1)+b2output=w5(w1i1+w2i2+b1)+b2

output=w5w1i1+w5w2i2+w5b1+b2output=w5w1i1+w5w2i2+w5b1+b2

This shows that the output neuron is still a linear combination of the inputs i1i1 and i2i2.
Thus, the entire network, despite having multiple layers and weights, effectively performs a
linear transformation, equivalent to a single-layer perceptron.

Introducing Non-Linearity in Neural Network

To introduce non-linearity, let’s use a non-linear activation function σσ for the hidden neuron. A
common choice is the ReLU function, defined as σ(x)=max⁡(0,x)σ(x)=max(0,x).

Updated Hidden Layer Calculation:

h1=σ(zh1)=σ(w1i1+w2i2+b1)h1=σ(zh1)=σ(w1i1+w2i2+b1)

Output Layer Calculation with Non-linearity:

output=w5σ(w1i1+w2i2+b1)+b2output=w5σ(w1i1+w2i2+b1)+b2

Effect of Non-linearity

The inclusion of the ReLU activation function \sigma allows h_1 to introduce a non-linear
decision boundary in the input space. This non-linearity enables the network to learn more
complex patterns that are not possible with a purely linear model, such as:

• Modeling functions that are not linearly separable.

• Increasing the capacity of the network to form multiple decision boundaries based on
the combination of weights and biases.

Types of Activation Functions in Deep Learning

1. Linear Activation Function

Linear Activation Function resembles straight line define by y=x. No matter how many layers
the neural network contains, if they all use linear activation functions, the output is a linear
combination of the input.

• The range of the output spans from (−∞ to +∞)(−∞ to +∞).

• Linear activation function is used at just one place i.e. output layer.

• Using linear activation across all layers makes the network’s ability to learn complex
patterns limited.

Linear activation functions are useful for specific tasks but must be combined with non-linear
functions to enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output

2. Non-Linear Activation Functions

1. Sigmoid Function

Sigmoid Activation Function is characterized by ‘S’ shape. It is mathematically defined


asA=11+e−xA=1+e−x1. This formula ensures a smooth and continuous output that is essential
for gradient-based optimization methods.

• It allows neural networks to handle and model complex patterns that linear equations
cannot.

• The output ranges between 0 and 1, hence useful for binary classification.

• The function exhibits a steep gradient when x values are between -2 and 2. This
sensitivity means that small changes in input x can cause significant changes in output y,
which is critical during the training process.
Sigmoid or Logistic Activation Function Graph

2. Tanh Activation Function

Tanh function or hyperbolic tangent function, is a shifted version of the sigmoid, allowing it to
stretch across the y-axis. It is defined as:

f(x)=tanh⁡(x)=21+e−2x–1.f(x)=tanh(x)=1+e−2x2–1.

Alternatively, it can be expressed using the sigmoid function:

tanh⁡(x)=2×sigmoid(2x)–1tanh(x)=2×sigmoid(2x)–1

• Value Range: Outputs values from -1 to +1.

• Non-linear: Enables modeling of complex data patterns.

• Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
Tanh Activation Function

3. ReLU (Rectified Linear Unit) Function

ReLU activation is defined by A(x)=max⁡(0,x)A(x)=max(0,x), this means that if the input x is


positive, ReLU returns x, if the input is negative, it returns 0.

• Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.

• Nature: It is a non-linear activation function, allowing neural networks to learn complex


patterns and making backpropagation more efficient.

• Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and easy for
computation.
ReLU Activation Function

Optimization techniques: Adam, RMSprop, SGD


Adam Optimizer

Adaptive Moment Estimation is an algorithm for optimization technique for gradient descent.
The method is really efficient when working with large problem involving a lot of data or
parameters. It requires less memory and is efficient. Intuitively, it is a combination of the
‘gradient descent with momentum’ algorithm and the ‘RMSP’ algorithm.

How Adam works?

Adam optimizer involves a combination of two gradient descent methodologies:

Momentum:

This algorithm is used to accelerate the gradient descent algorithm by taking into consideration
the ‘exponentially weighted average’ of the gradients. Using averages makes the algorithm
converge towards the minima in a faster pace.
wt+1=wt−αmtwt+1=wt−αmt

where,

mt=βmt−1+(1−β)[δLδwt]mt=βmt−1+(1−β)[δwtδL]

mt = aggregate of gradients at time t [current] (initially, mt = 0)

mt-1 = aggregate of gradients at time t-1 [previous]

Wt = weights at time t

Wt+1 = weights at time t+1

αt = learning rate at time t

∂L = derivative of Loss Function

∂Wt = derivative of weights at time t

β = Moving average parameter (const, 0.9)

Root Mean Square Propagation (RMSP):

Root mean square prop or RMSprop is an adaptive learning algorithm that tries to improve
AdaGrad. Instead of taking the cumulative sum of squared gradients like in AdaGrad, it takes the
‘exponential moving average’.

wt+1=wt−αt(vt+ε)1/2∗[δLδwt]wt+1=wt−(vt+ε)1/2αt∗[δwtδL]

where,

vt=βvt−1+(1−β)∗[δLδwt]2vt=βvt−1+(1−β)∗[δwtδL]2

Wt = weights at time t

Wt+1 = weights at time t+1

αt = learning rate at time t

∂L = derivative of Loss Function

∂Wt = derivative of weights at time t

Vt = sum of square of past gradients. [i.e sum(∂L/∂Wt-1)] (initially, Vt = 0)

β = Moving average parameter (const, 0.9)


ϵ = A small positive constant (10-8)

NOTE: Time (t) could be interpreted as an Iteration (i).

Adam Optimizer inherits the strengths or the positive attributes of the above two methods and
builds upon them to give a more optimized gradient descent.

Here, we control the rate of gradient descent in such a way that there is minimum oscillation
when it reaches the global minimum while taking big enough steps (step-size) so as to pass the
local minima hurdles along the way. Hence, combining the features of the above methods to
reach the global minimum efficiently.

Mathematical Aspect of Adam Optimizer

Taking the formulas used in the above two methods, we get

mt=β1mt−1+(1−β1)[δLδwt]vt=β2vt−1+(1−β2)[δLδwt]2mtvt=β1mt−1+(1−β1)[δwtδL]=β2vt−1
+(1−β2)[δwtδL]2

Parameters Used :

1. ϵ = a small +ve constant to avoid 'division by 0' error when (vt -> 0). (10-8)

2. β1 & β2 = decay rates of average of gradients in the above two methods. (β1 = 0.9 & β2 =
0.999)

3. α — Step size parameter / learning rate (0.001)


Since mt and vt have both initialized as 0 (based on the above methods), it is observed that they
gain a tendency to be ‘biased towards 0’ as both β1 & β2 ≈ 1. This Optimizer fixes this problem
by computing ‘bias-corrected’ mt and vt. This is also done to control the weights while reaching
the global minimum to prevent high oscillations when near it. The formulas used are:

mt^=mt1−β1tv^t=vt1−β2tmt

=1−β1tmtv

t=1−β2tvt

Intuitively, we are adapting to the gradient descent after every iteration so that it remains
controlled and unbiased throughout the process, hence the name Adam.

Now, instead of our normal weight parameters mt and vt , we take the bias-corrected weight
parameters (m_hat)t and (v_hat)t. Putting them into our general equation, we get

wt+1=wt−mt^(αvt^+ε)wt+1=wt−mt

(vt

+εα)

Performance:

Building upon the strengths of previous models, Adam optimizer gives much higher
performance than the previously used and outperforms them by a big margin into giving an
optimized gradient descent. The plot is shown below clearly depicts how Adam Optimizer
outperforms the rest of the optimizer by a considerable margin in terms of training cost (low)
and performance (high).
What is RMSProp Optimizer?

RMSProp is an adaptive learning rate optimization algorithm designed to improve the


performance and speed of training deep learning models. It is a variant of the gradient descent
algorithm, which adapts the learning rate for each parameter individually by considering the
magnitude of recent gradients for those parameters. This adaptive nature helps in dealing with
the challenges of non-stationary objectives and sparse gradients commonly encountered in
deep learning tasks.

RMSProp was introduced by Geoffrey Hinton. The algorithm was developed to address the
limitations of previous optimization methods such as SGD (Stochastic Gradient Descent) and
AdaGrad. While SGD uses a constant learning rate, which can be inefficient, and AdaGrad
reduces the learning rate too aggressively, RMSProp strikes a balance by adapting the learning
rates based on a moving average of squared gradients. This approach helps in maintaining a
balance between efficient convergence and stability during the training process, making
RMSProp a widely used optimization algorithm in modern deep learning.

How RMSProp Works?

The core idea behind RMSProp is to keep a moving average of the squared gradients to
normalize the gradient updates. By doing so, RMSProp prevents the learning rate from
becoming too small, which was a drawback in AdaGrad, and ensures that the updates are
appropriately scaled for each parameter. This mechanism allows RMSProp to perform well even
in the presence of non-stationary objectives, making it suitable for training deep learning
models.

Stochastic Gradient Descent (SGD):

Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for
optimizing machine learning models. It addresses the computational inefficiency of traditional
Gradient Descent methods when dealing with large datasets in machine learning projects.

In SGD, instead of using the entire dataset for each iteration, only a single random training
example (or a small batch) is selected to calculate the gradient and update the model
parameters. This random selection introduces randomness into the optimization process, hence
the term “stochastic” in stochastic Gradient Descent

The advantage of using SGD is its computational efficiency, especially when dealing with large
datasets. By using a single example or a small batch, the computational cost per iteration is
significantly reduced compared to traditional Gradient Descent methods that require processing
the entire dataset.
Stochastic Gradient Descent Algorithm

• Initialization: Randomly initialize the parameters of the model.

• Set Parameters: Determine the number of iterations and the learning rate (alpha) for
updating the parameters.

• Stochastic Gradient Descent Loop: Repeat the following steps until the model converges
or reaches the maximum number of iterations:

o Shuffle the training dataset to introduce randomness.

o Iterate over each training example (or a small batch) in the shuffled order.

o Compute the gradient of the cost function with respect to the model parameters
using the current training
example (or batch).

o Update the model parameters by taking a step in the direction of the negative
gradient, scaled by the learning rate.

o Evaluate the convergence criteria, such as the difference in the cost function
between iterations of the gradient.

• Return Optimized Parameters: Once the convergence criteria are met or the maximum
number of iterations is reached, return the optimized model parameters.

In SGD, since only one sample from the dataset is chosen at random for each iteration, the path
taken by the algorithm to reach the minima is usually noisier than your typical Gradient Descent
algorithm. But that doesn’t matter all that much because the path taken by the algorithm does
not matter, as long as we reach the minimum and with a significantly shorter training time.

The path taken by Batch Gradient Descent is shown below:


Batch gradient optimization path

A path taken by Stochastic Gradient Descent looks as follows –


stochastic gradient optimization path

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of the randomness in
its descent. Even though it requires a higher number of iterations to reach the minima than
typical Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing
a learning algorithm.
Advanced Deep Learning Architectures

Convolutional Neural Networks (CNNs) for image data


Convolutional Neural Networks (CNNs) are a specialized type of deep learning model designed
primarily for analyzing image data. Unlike traditional fully connected neural networks, CNNs use
convolutional layers to extract spatial and hierarchical features from images, making them
highly effective for tasks such as image classification, object detection, and segmentation.

1. Key Components of CNNs

a) Convolutional Layer

• Applies a set of filters (kernels) to the input image to detect patterns such as edges,
textures, and shapes.

• Each filter slides over the input image and performs element-wise multiplication
followed by summation, producing a feature map.

• Example: Detecting horizontal or vertical edges in an image.

b) Activation Function (ReLU - Rectified Linear Unit)

• Introduces non-linearity by converting negative values to zero while keeping positive


values unchanged.

• Helps the network learn complex patterns in images.

c) Pooling Layer (Max Pooling / Average Pooling)

• Reduces the spatial dimensions of feature maps, making the network more efficient and
reducing overfitting.

• Max Pooling selects the maximum value in a region, while Average Pooling takes the
mean.

d) Fully Connected (Dense) Layer

• After convolutional and pooling layers, extracted features are passed to fully connected
layers to make final predictions.

• The last layer often uses softmax activation for multi-class classification or sigmoid for
binary classification.
2. Working of CNNs for Image Data

1. Input Image: A digital image represented as a matrix of pixel values (e.g., 28×28 for
grayscale or 224×224×3 for RGB).

2. Feature Extraction: The convolutional layers apply filters to extract low- and high-level
features.

3. Dimensionality Reduction: Pooling layers reduce the feature map size while retaining
important information.

4. Classification: The final fully connected layer outputs class probabilities.

3. CNN Architectures

• LeNet-5: One of the first CNN architectures for digit recognition.

• AlexNet: Introduced deep CNNs with ReLU and dropout.

• VGGNet: Uses deep layers (VGG-16, VGG-19) with small filters (3×3).

• ResNet: Introduced skip connections to solve the vanishing gradient problem.

• EfficientNet: Scales CNNs efficiently for better performance.

4. Applications of CNNs

Image Classification (e.g., Cats vs. Dogs)


Object Detection (e.g., YOLO, Faster R-CNN)
Image Segmentation (e.g., U-Net for medical images)
Face Recognition (e.g., FaceNet, DeepFace)
Self-Driving Cars (e.g., Road and obstacle detection)
Recurrent Neural Networks (RNNs), LSTM, GRU for sequential data
Recurrent Neural Networks (RNNs) are specialized neural networks designed for processing
sequential data, such as time series, natural language, and speech. Unlike traditional
feedforward networks, RNNs have memory, enabling them to learn from past inputs while
making predictions.

1. Recurrent Neural Networks (RNNs)

Key Features of RNNs:

Suitable for sequential data (e.g., text, speech, stock prices).

Maintains hidden states to store past information.

Shares parameters across time steps.

Working of RNNs:

Takes an input sequence xtxt at each time step tt.

The hidden state htht is updated using the previous state ht−1ht−1 and the current input xtxt:

ht=f(Whht−1+Wxxt+b)

ht=f(Whht−1+Wxxt+b)

The output ytyt is generated based on htht.

Challenges with RNNs:

Vanishing Gradient Problem: When training long sequences, gradients become too small,
making it hard to learn long-term dependencies.

Exploding Gradient Problem: Gradients become too large, leading to instability.

2. Long Short-Term Memory (LSTM) Networks


LSTMs solve the vanishing gradient problem by introducing a memory cell and gates to control
information flow.

Key Components of LSTMs:

Forget Gate: Decides what information to remove from memory.

Input Gate: Determines what new information to store.

Cell State: Stores long-term dependencies.

Output Gate: Controls what to output.

LSTM Equations:

ft=σ(Wf[ht−1,xt]+bf)(Forget Gate)

ft=σ(Wf[ht−1,xt]+bf)(Forget Gate)

it=σ(Wi[ht−1,xt]+bi)(Input Gate)

it=σ(Wi[ht−1,xt]+bi)(Input Gate)

Ct~=tanh⁡(Wc[ht−1,xt]+bc)(New Memory)

Ct~=tanh(Wc[ht−1,xt]+bc)(New Memory)

Ct=ft∗Ct−1+it∗Ct~(Update Cell State)

Ct=ft∗Ct−1+it∗Ct~(Update Cell State)

ot=σ(Wo[ht−1,xt]+bo)(Output Gate)

ot=σ(Wo[ht−1,xt]+bo)(Output Gate)

ht=ot∗tanh⁡(Ct)(New Hidden State)

ht=ot∗tanh(Ct)(New Hidden State)

Advantages of LSTM: Learns long-term dependencies and solves vanishing gradients.


Applications: Speech recognition, text generation, stock market prediction.

3. Gated Recurrent Unit (GRU)

GRUs are a simplified version of LSTMs with fewer parameters, making them computationally
efficient.

Key Differences Between GRU and LSTM:

Feature LSTM GRU

Memory Cell Yes No

Gates Forget, Input, Output Reset, Update

Computational Complexity Higher Lower

GRU Equations:

rt=σ(Wr[ht−1,xt]+br)(Reset Gate)

rt=σ(Wr[ht−1,xt]+br)(Reset Gate)

zt=σ(Wz[ht−1,xt]+bz)(Update Gate)

zt=σ(Wz[ht−1,xt]+bz)(Update Gate)

ht~=tanh⁡(Wh[rt∗ht−1,xt]+bh)(New Hidden State Candidate)

ht~=tanh(Wh[rt∗ht−1,xt]+bh)(New Hidden State Candidate)

ht=(1−zt)∗ht−1+zt∗ht~(Final Hidden State)

ht=(1−zt)∗ht−1+zt∗ht~(Final Hidden State)

Advantages of GRU: Faster than LSTM, performs well on short sequences.

Applications: Machine translation, text summarization, chatbot responses.

4. Applications of RNN, LSTM, and GRU


Natural Language Processing (Chatbots, Sentiment Analysis, Machine Translation)

Speech Recognition (Google Assistant, Siri, Alexa)

Stock Price Prediction (Time Series Forecasting)

Handwriting Recognition (Google Handwriting Input

Autoencoders and Variational Autoencoders (VAEs)


Autoencoders (AEs) and Variational Autoencoders (VAEs) are unsupervised neural networks
used for tasks like dimensionality reduction, data compression, and feature learning. While both
share the same basic architecture, VAEs introduce a probabilistic approach that makes them
more powerful for generating new data and learning more expressive latent representations.

1. Autoencoders (AEs)

Architecture of Autoencoders:

An autoencoder consists of two main parts:

1. Encoder:

o The encoder maps the input data xxx to a lower-dimensional latent space
representation zzz.

o The goal of the encoder is to compress the input while retaining the essential
information.

2. Decoder:

o The decoder maps the latent space representation zzz back to the original data
space, aiming to reconstruct the input x^\hat{x}x^.

o The goal is to minimize the reconstruction error (difference between the input
and the reconstructed data).

Working of Autoencoders:
• Input: xxx

• Encoder: z=f(x)z = f(x)z=f(x)

• Decoder: x^=g(z)\hat{x} = g(z)x^=g(z)

The loss function typically used for training autoencoders is mean squared error (MSE) or
binary cross-entropy:

Loss=∥x−x^∥2\text{Loss} = \| x - \hat{x} \|^2Loss=∥x−x^∥2

Applications of Autoencoders:

1. Dimensionality Reduction: Autoencoders can learn compact, lower-dimensional


representations of data.

2. Denoising: A denoising autoencoder can learn to remove noise from the input data and
output a clean version.

3. Anomaly Detection: Since the autoencoder learns a compressed representation of


normal data, it can be used to detect anomalies (by checking for poor reconstruction of
outlier data).

4. Image Compression: By training on large image datasets, autoencoders can be used for
image compression.

2. Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are a probabilistic version of autoencoders. They introduce a


more structured approach to the latent space and ensure that the latent representations are
smooth and continuous. This allows VAEs to generate new data points by sampling from the
latent space, making them suitable for generative tasks.

Key Differences Between AEs and VAEs:

1. Probabilistic Approach: VAEs assume that the data comes from a latent variable model
and use a probabilistic encoding of the input data.

2. Latent Space Distribution: While regular autoencoders map inputs directly to a latent
space, VAEs introduce randomness by learning a distribution over the latent space.
Typically, this distribution is a Gaussian distribution.
3. Regularization: VAEs include a regularization term (KL divergence) that forces the
learned latent space distribution to be close to a standard normal distribution.

Architecture of VAEs:

A VAE consists of the following components:

1. Encoder: Similar to autoencoders, but instead of learning a deterministic encoding, the


encoder learns the parameters (mean μ\muμ and variance σ2\sigma^2σ2) of a Gaussian
distribution in the latent space.

2. Sampling: During training, we sample from this Gaussian distribution z∼N(μ,σ2)z \sim
\mathcal{N}(\mu, \sigma^2)z∼N(μ,σ2), making the encoder probabilistic.

3. Decoder: Decodes the sampled latent variable zzz back into the original data space.

Mathematical Formulation:

• Encoder learns the parameters μ\muμ and σ2\sigma^2σ2 for the distribution
q(z∣x)q(z|x)q(z∣x).

• Decoder learns to reconstruct the data from p(x∣z)p(x|z)p(x∣z).

The VAE loss function is a combination of:

1. Reconstruction Loss: Similar to the autoencoder, this is the reconstruction error


between the original and the reconstructed data.

2. KL Divergence: This regularizes the latent space distribution, encouraging it to be close


to the standard normal distribution.

The total loss function is:

L=Eq(z∣x)[log⁡p(x∣z)]−DKL[q(z∣x)∥p(z)]\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] -


D_{\text{KL}}[q(z|x) \parallel p(z)]L=Eq(z∣x)[logp(x∣z)]−DKL[q(z∣x)∥p(z)]

Where:

• DKLD_{\text{KL}}DKL is the Kullback-Leibler divergence, measuring the difference


between the learned distribution q(z∣x)q(z|x)q(z∣x) and the prior p(z)p(z)p(z) (often a
standard Gaussian).

• Eq(z∣x)[log⁡p(x∣z)]\mathbb{E}_{q(z|x)}[\log p(x|z)]Eq(z∣x)[logp(x∣z)] is the


reconstruction loss.
Applications of VAEs:

1. Data Generation: VAEs are particularly well-suited for generating new, synthetic data
(e.g., generating new images similar to the training data).

2. Image Synthesis: VAEs can generate new images that resemble the data they were
trained on by sampling from the latent space.

3. Anomaly Detection: Like autoencoders, VAEs can be used to detect anomalies based on
reconstruction error.

4. Latent Space Exploration: Because the latent space is continuous and structured, it can
be manipulated for tasks like interpolation and latent space visualization.

Transfer Learning
Transfer Learning is a technique in machine learning where a model developed for a particular
task is reused (or adapted) for a different, but related task. This approach leverages the
knowledge gained from a previously trained model to solve new problems, often requiring
fewer resources, such as data and computational power, compared to training a model from
scratch.

Transfer learning is particularly useful when:

• You have limited data for the task at hand.

• Training from scratch would be computationally expensive.

• The source task and the target task are similar.

Pre-trained models. ResNet, VGG, BERT


Pre-trained models are widely used in deep learning, especially for transfer learning. These
models are trained on large datasets and are often used as a starting point for solving more
specific tasks in fields like image recognition, object detection, and natural language processing
(NLP). Below, we’ll explore the details of some of the most popular pre-trained models: ResNet,
VGG, and BERT.
1. ResNet (Residual Networks)

Overview:

• ResNet is a deep convolutional neural network (CNN) architecture introduced by


Microsoft Research in the paper "Deep Residual Learning for Image Recognition".

• It introduced the concept of residual connections (skip connections) to solve the


problem of vanishing gradients, which allows deeper networks to be trained effectively.

• ResNet is well-suited for image classification tasks and is known for its impressive
performance on benchmark datasets like ImageNet.

Architecture:

• ResNet consists of a series of residual blocks in which the input is added to the output of
the block (after some processing). This makes it easier for the network to learn identity
mappings, which helps the model avoid the degradation problem that occurs with
deeper networks.

• ResNet has various versions with different depths, e.g., ResNet-18, ResNet-34, ResNet-
50, ResNet-101, and ResNet-152. The number refers to the number of layers in the
network.

Key Features:

• Residual Connections: Skip connections between layers, allowing gradients to flow more
easily during backpropagation.

• Very Deep Networks: ResNet can be extremely deep (e.g., ResNet-152 with 152 layers).

• High Performance: It performs very well on various image classification and object
detection tasks.

Use Cases:

• Image classification

• Object detection

• Semantic segmentation

• Transfer learning in computer vision tasks


2. VGG (Visual Geometry Group)

Overview:

• VGG is a CNN architecture developed by the Visual Geometry Group (VGG) at the
University of Oxford. The VGG network was a major breakthrough due to its simplicity
and depth.

• VGG16 and VGG19 are the most commonly used models, named after the number of
layers in the network (16 and 19 layers, respectively).

• The architecture is based on a series of convolutional layers followed by max-pooling


and fully connected layers at the end.

Architecture:

• The network is very deep, with several convolutional layers stacked on top of each other.
The key characteristic of the VGG models is the use of small 3x3 convolutional filters
throughout the network.

• VGG16 has 16 layers with weights, while VGG19 has 19 layers.

Key Features:

• Deep Architecture: VGG was one of the first to demonstrate that increasing depth could
significantly improve performance.

• Simplicity: The VGG architecture uses very simple 3x3 filters and max-pooling layers,
making it straightforward to implement.

• Large Model Size: VGG models are large with a high number of parameters, making
them computationally expensive.

Use Cases:

• Image classification

• Object detection

• Fine-grained image recognition tasks

3. BERT (Bidirectional Encoder Representations from Transformers)

Overview:
• BERT is a transformer-based model introduced by Google AI for NLP tasks in the paper
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".

• Unlike traditional models that process text in a unidirectional manner, BERT is trained
bidirectionally, meaning it looks at both the left and right context of a word in a
sentence.

• BERT is pre-trained on large corpora and can be fine-tuned for specific NLP tasks such as
text classification, question answering, and named entity recognition (NER).

Architecture:

• Transformers: BERT is based on the Transformer architecture, which uses self-attention


mechanisms to process text efficiently.

• Bidirectional: BERT’s pre-training objective is to predict missing words from both left
and right context in the text, which helps capture rich contextual relationships.

• Pre-training Tasks:

o Masked Language Model (MLM): Random words in a sentence are masked, and
the model learns to predict them.

o Next Sentence Prediction (NSP): The model is trained to predict whether a pair
of sentences follow one another in the text.

Key Features:

• Bidirectional Context: BERT learns both left and right contexts, which improves its
understanding of word meanings.

• Pre-training and Fine-tuning: Pre-trained on vast amounts of text data and fine-tuned
for specific tasks.

• State-of-the-art: BERT has set new performance benchmarks for a variety of NLP tasks.

Use Cases:

• Text classification (e.g., sentiment analysis)

• Named entity recognition (NER)

• Question answering (e.g., SQuAD dataset)

• Text generation
• Language translation

Fine-tuning and feature extraction


When using pre-trained models, there are two main approaches you can take: fine-tuning and
feature extraction. These techniques allow you to leverage the knowledge learned by a pre-
trained model and adapt it to your specific task, especially when you have limited data or
computational resources.

1. Fine-tuning

Overview:

Fine-tuning is the process of taking a pre-trained model and continuing its training on your
dataset. The key idea is that the model has already learned useful representations from the
original dataset, and by training it on your task-specific data, you can adjust the model to
perform better for your problem.

Fine-tuning is especially useful when:

• You have a smaller dataset, and training from scratch would lead to overfitting.

• The pre-trained model’s learned features are highly relevant to your task (e.g., using an
ImageNet-trained model for another image classification task).

Steps for Fine-tuning:

1. Load a Pre-trained Model: First, load a pre-trained model (e.g., ResNet, VGG, BERT) with
weights trained on a large dataset like ImageNet or a language corpus.

2. Replace or Modify the Output Layer: Replace the final classification layers (e.g., fully
connected layers) of the model with new layers suited for your task (e.g., for a different
number of classes in classification).

3. Freeze Early Layers: Initially, freeze the weights of the earlier layers of the model. These
layers capture generic features (such as edges in images or basic syntax in text) and
don’t need to be re-trained. Only the later layers will be fine-tuned.

4. Unfreeze Some Layers: Unfreeze the later layers or the entire model and continue
training on your data. This allows the model to adapt its learned features to your new
task. Fine-tuning is generally done with a lower learning rate to avoid forgetting
previously learned knowledge (catastrophic forgetting).
5. Train the Model: Train the model on your dataset. Typically, fine-tuning requires fewer
epochs than training from scratch because the model already has learned meaningful
representations.

2. Feature Extraction

Overview:

Feature extraction is a process where you use a pre-trained model as a fixed feature extractor. In
this approach, you don’t modify the pre-trained model’s weights. Instead, you extract the
features learned by the pre-trained model and feed them into a new model or classifier (like a
dense layer or a logistic regression) that will perform the task on top of those features.

Feature extraction is useful when:

• You don’t want to spend too much time or computational resources fine-tuning the
model.

• You have a small dataset and can’t afford to fine-tune the model without overfitting.

Steps for Feature Extraction:

1. Load a Pre-trained Model: Choose a model trained on a large dataset (e.g., ResNet,
VGG, BERT).

2. Freeze the Entire Pre-trained Model: Freeze all the layers of the pre-trained model, so
the weights are not updated during training.

3. Remove the Top Layers: Remove the final layers (usually fully connected layers) and add
new layers suitable for your task (e.g., a classifier for your dataset).

4. Train the New Classifier: Train only the new classifier (e.g., a fully connected layer) using
the features extracted by the pre-trained model.

5. Evaluate and Improve: You can evaluate the model’s performance and, if necessary,
make adjustments.

When to Use Which Approach?

• Fine-tuning is best when:

o You have a moderate to large dataset.

o You need the model to adapt closely to the new task.


o The pre-trained model has features that are highly relevant to your task.

• Feature extraction is best when:

o You have a small dataset or limited resources.

o You want to use the pre-trained model as a feature extractor without retraining
the entire network.

o You only need to modify the last layer(s) for your task.

from tensorflow.keras.applications import ResNet50

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Dense, GlobalAveragePooling2D

from tensorflow.keras.optimizers import Adam

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# 1. Load pre-trained ResNet50 model without the top layers

base_model = ResNet50(weights='imagenet', include_top=False)

# 2. Add custom layers for our specific task

x = GlobalAveragePooling2D()(base_model.output) # Reduce the dimensions

x = Dense(1024, activation='relu')(x) # Add a fully connected layer

x = Dense(10, activation='softmax')(x) # 10 classes in the new task

# Create the final model

model = Model(inputs=base_model.input, outputs=x)


# 3. Freeze the layers of the pre-trained model (ResNet50)

for layer in base_model.layers:

layer.trainable = False

# 4. Compile the model

model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# 5. Load the dataset (Using ImageDataGenerator for simplicity)

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2,


horizontal_flip=True)

train_data = train_datagen.flow_from_directory('path_to_train_data', target_size=(224, 224),


batch_size=32, class_mode='categorical')

# 6. Train the model (only the new classifier layers)

model.fit(train_data, epochs=5)

# 7. Unfreeze some layers of ResNet50 for fine-tuning

for layer in base_model.layers[-10:]: # Unfreeze the last 10 layers of ResNet50

layer.trainable = True

# 8. Re-compile the model with a lower learning rate

model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy',
metrics=['accuracy'])

# 9. Continue training with fine-tuning


model.fit(train_data, epochs=5)

Applications in computer vision, natural language processing


1. Transfer Learning in Computer Vision

Applications:

1. Image Classification:

o Pre-trained models like ResNet, VGG, and Inception are often used for
classification tasks. These models have been trained on large datasets (e.g.,
ImageNet), and by fine-tuning them, they can be adapted for tasks like medical
image classification, product categorization, etc.

o Example: Identifying diseases in X-ray images. A model like ResNet can be fine-
tuned on a dataset of labeled medical images to detect diseases such as
pneumonia or tuberculosis.

2. Object Detection:

o YOLO (You Only Look Once) and Faster R-CNN are examples of pre-trained
models that can be fine-tuned for tasks like detecting specific objects in images.
These models are pre-trained on large datasets such as COCO and can be
transferred to detect a specific set of objects in a new dataset.

o Example: Autonomous vehicles use transfer learning for detecting pedestrians,


cars, or traffic signs from street images.

3. Semantic Segmentation:

o In tasks like pixel-wise classification (e.g., identifying the boundary of objects),


models like FCN (Fully Convolutional Networks) or U-Net are fine-tuned for
specific tasks. These models are useful for tasks like satellite image analysis or
medical image segmentation.

o Example: Segmentation of tumors in MRI images. The pre-trained model can be


adapted to segment and identify the region of interest.

4. Face Recognition:
o Pre-trained models like FaceNet are often used for face recognition tasks. These
models are fine-tuned on smaller datasets containing faces to adapt them for
specific applications like security or identity verification.

o Example: Face verification for secure access systems. After fine-tuning, these
models can verify if the person in front of the camera matches the stored face
data.

5. Style Transfer:

o Using models pre-trained on large image datasets, you can transfer the style of
one image to another. Neural Style Transfer is a popular technique in artistic
applications.

o Example: Creating artwork in the style of famous painters. A model pre-trained


on large image datasets can generate new images that resemble the artistic style
of an artist like Van Gogh or Picasso.

2. Transfer Learning in Natural Language Processing (NLP)

Applications:

1. Text Classification:

o Pre-trained models like BERT (Bidirectional Encoder Representations from


Transformers) and GPT (Generative Pretrained Transformers) can be fine-tuned
for text classification tasks such as spam detection, sentiment analysis, or topic
categorization.

o Example: Sentiment analysis of product reviews. BERT can be fine-tuned on a


dataset of labeled reviews to predict whether a review is positive or negative.

2. Named Entity Recognition (NER):

o Pre-trained language models can be fine-tuned to recognize entities like names,


dates, organizations, etc. BERT, SpaCy, and Hugging Face transformers are
commonly used for NER tasks.

o Example: Extracting entities from legal documents. Fine-tuned BERT models can
identify relevant entities such as company names, legal terms, or case numbers.

3. Question Answering:
o BERT, RoBERTa, and ALBERT are often fine-tuned for specific question-answering
tasks. These models can answer questions from a given passage or context.

o Example: Customer support systems. Pre-trained models fine-tuned on customer


support data can help answer customer queries automatically based on available
product documentation.

4. Machine Translation:

o GPT-3, T5, and MarianMT are models that can be fine-tuned for specific machine
translation tasks. Transfer learning helps by utilizing knowledge from pre-trained
models, allowing the translation of text between different languages.

o Example: Translating legal contracts from English to French. A model pre-trained


on multilingual corpora can be adapted to translate domain-specific documents.

5. Text Summarization:

o Pre-trained models like T5 (Text-to-Text Transfer Transformer) and BART


(Bidirectional and Auto-Regressive Transformers) can be fine-tuned for
summarization tasks. These models can be adapted to generate concise
summaries of long documents.

o Example: Generating summaries of news articles. Fine-tuned models can


generate summaries that capture the key points of an article without losing the
context.

6. Language Modeling:

o Models like GPT-2 and GPT-3 are pre-trained on large text corpora and can be
fine-tuned for specific language generation tasks. These models can generate
human-like text, which can be used for content creation, chatbots, and more.

o Example: Automated content generation for social media posts, blogs, or news
articles. Fine-tuning a model like GPT-3 on a specific domain (e.g., tech news) can
generate relevant and coherent articles.

7. Text-to-Speech and Speech Recognition:

o Pre-trained models for speech-to-text (e.g., DeepSpeech) or text-to-speech (e.g.,


Tacotron) can be adapted for specific accents, languages, or domains.
o Example: Voice assistants (like Siri or Alexa) adapt pre-trained models for specific
regional accents or voice commands to make the system more personalized.

8. Chatbots and Virtual Assistants:

o BERT, GPT, and other pre-trained transformer models are used in fine-tuning for
building intelligent chatbots and virtual assistants. These systems can answer
questions, provide recommendations, or perform specific tasks.

o Example: AI-powered customer support chatbots. Fine-tuning pre-trained


models on customer service conversations helps the chatbot understand and
respond to user queries accurately.

Key Benefits of Transfer Learning in CV and NLP:

• Reduces Training Time: Since the model is already pre-trained on a large dataset, you
don’t need to start from scratch. Fine-tuning or feature extraction requires fewer epochs
and computational resources.

• Better Performance: Pre-trained models capture high-level features (such as patterns,


textures, or language structures), which can significantly improve performance on
specific tasks, especially when you have limited data.

• Adaptability: Pre-trained models can be adapted to a wide range of tasks in both


computer vision and NLP, making them highly versatile for real-world applications.

• Improved Generalization: By leveraging a model trained on large, diverse datasets,


transfer learning helps your model generalize better, even if you’re working with a
smaller, domain-specific dataset.

You might also like