0% found this document useful (0 votes)
7 views

cst414-deep learning module 2

This document provides an introduction to deep learning, explaining its significance within artificial intelligence and its applications. It covers the structure of Deep Feed Forward Networks, regularization techniques to prevent overfitting, and various optimization methods used to improve model performance. Key concepts include L1 and L2 regularization, dropout, early stopping, and different gradient descent strategies such as SGD and RMSProp.

Uploaded by

heisenberganaya1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

cst414-deep learning module 2

This document provides an introduction to deep learning, explaining its significance within artificial intelligence and its applications. It covers the structure of Deep Feed Forward Networks, regularization techniques to prevent overfitting, and various optimization methods used to improve model performance. Key concepts include L1 and L2 regularization, dropout, early stopping, and different gradient descent strategies such as SGD and RMSProp.

Uploaded by

heisenberganaya1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MODULE 2

Introduction to Deep Learning



Deep Learning is a subset of machine learning, which itself is a part of
artificial intelligence (AI).

It involves training algorithms, typically artificial neural networks, to
learn from large amounts of data and make predictions or decisions.

Deep learning has become one of the most significant advances in the
field of AI, with applications in areas such as image and speech
recognition, natural language processing, and autonomous driving.
Deep Feed Forward Network

A Deep Feed Forward Network (DFFN), also known as a multilayer
perceptron (MLP).

It is a class of feedforward neural networks with multiple layers of
nodes, where each node in a layer is connected to all nodes in the
previous and next layers.
Concept of Regularization and Optimization

Regularization and optimization are two core concepts in deep
learning that help improve model performance, prevent overfitting,
and ensure efficient convergence during training.
Regularization in Deep Learning

This techniques are used to prevent overfitting — a scenario where a
model performs well on training data but poorly on unseen data.

Regularization methods reduce this complexity or add constraints to
the model to improve its ability to generalize to new data.
1) L2 Regularization (Ridge Regularization): It adds a penalty term to the
loss function based on the square of the weights. This discourages large
weights, which can lead to overfitting.

Where Ltotal is the total loss, Loriginal is the original loss, wi are the
weights and λ is the regularization parameter that controls the penalty.
2) L1 Regularization (Lasso Regularization): It adds a penalty term to the
loss function based on the absolute values of the weights. It has a tendency
to drive some weights to exactly zero, effectively performing feature
selection.
3) Dropout: At each training step, a fraction of the neurons in the network is
randomly ignored (dropped out). The remaining neurons must learn to work
together and generalize better.
4) Early Stopping: The training process is stopped when the validation loss
starts increasing, indicating that the model is overfitting to the training data.
This helps prevent the model from memorizing the training data.
5) Data Augmentation: It is the process of artificially increasing the size of
the training set by applying random transformations (e.g., rotation, scaling,
flipping, etc.) to the training data. For tasks like image classification,
transformations (e.g., rotations, flips, crops) are applied to the original
images to create new training samples.
Optimization in Deep Learning

Optimization in deep learning refers to the process of adjusting the parameters
(weights and biases) of a neural network in order to minimize a loss function.
1) Gradient Descent (GD):
• It is used for finding local minimum for a differentiable function.In this
method weight is initialized using some initialization strategies and is updated
with each epoch according to the update equation.
• The core idea is to update the parameters in the direction opposite to the
gradient of the function, thus progressively reducing the value of the function.
The size of the step taken in each iteration is controlled by a parameter called
the learning rate (η).
2) Stochastic Gradient Descent (SGD):

It updates the model parameters using only one data point at a time, as
opposed to the entire dataset.

But this approach may lead to noiser result because it iterates one
obseervation at a time.

Mini-Batch SGD: It is a cross-over between GD and SGD. Here entire
dataset is divided into batches and compute the gradient for each
batch.
3) GD with momentum:

Even with Mini-Batch SGD there are noises and taking too much time for
convergence.

SGD with momentum is working with the concept of Exponentially
Weighted Moving Average(EWMA).

Here most recent previous values will be getting more importance and
earlier events will be having less importance.

Here βVt-1 is the momentum, this will reduce the noise while trying to
reach the global minima.
4) Nestrov Accelerated GD

In gradient descent with momentum, the actual movement is large due
to the added momentum.

This added momentum causes different types of problems.We cross
the actual minimum point and have to come back to get the minimum
point.

That means SGD with momentum oscillates around the minimum
point.

To reduce the problem we can use Nestrov Accelerated GD.

In NAG we calculate the gradient at the look ahead point and then used
it to update the weights.

In momentum based GD, the weight is update was based on
ΔW= History of velocity + Gradient at that point

But in NAG the only difference is that
ΔW= History of velocity + Gradient at a look ahead point

So based on this it will decide wheather to move forward or backward to
reach minimum.
5) Adagrad Optimizer

In SGD and Mini-Batch SGD the learning rate was same for each
weight or for each parameter.

But in Adagrad Optimizer the learning rate gets modified based on
how frequently a parameter gets updated during training.

This method is adopted when the dataset is having different features
of various dimentions. For example if it is having Dense and Sparse
features the saming learning rate is applied to both types of features
will face difficulty in optimization.
6) RMSProp

It is an extension of the popular Adaptive Gradient Algorithm and is
designed to dramatically reduce the amount of computational effort
used in training neural networks.

RMSProp is able to smoothly adjust the learning rate for each of the
parameters in the network, providing a better performance than
regular Gradient Descent.

You might also like