Unit -1 Deep Learning
Unit -1 Deep Learning
Deep Learning(R2032423)
Unit-1
----------------------------------------------------------------------------------------------------------------
--------
Deep learning is a specific subfield of machine learning: a new way for learning
representationsfrom data that puts an importance on learning successive layers of
increasinglymeaningful representations.
The “deep” here stands for the idea of successive layers of representations. How many layers
contribute to a model of the data iscalled the depth of the model. Other appropriate names for
the field could have beenlayered representations learning and hierarchical representations
learning.Modern deeplearning often involves tens or even hundreds of successive layers of
representations.In deep learning, these layered representations are (almost always) learned
viamodels called neural networks.
Figure 4: The loss score is used as a feedback signal to adjust the weights
The history of machine learning dates back several decades and has undergone significant
developments over time. Here's a brief overview of the key milestones in the history of
machinelearning:
1. Early Foundations (1950s-1960s):
- The field of machine learning emerged from the intersection of computer
science and statistics, with early pioneers including Alan Turing and Arthur
Samuel.
- In 1950, Alan Turing proposed the "Turing Test" as a way to measure a machine's
ability toexhibit intelligent behavior.
- In the 1950s, Arthur Samuel developed the concept of machine learning by creating
programs that could improve their performance over time through experience, specifically
in the domain ofgame-playing, such as checkers.
2. Symbolic AI and Expert Systems (1960s-1980s):
- During this period, researchers focused on symbolic AI and expert systems, which
relied onrules and logical reasoning.
- Machine learning took a backseat as rule-based systems dominated the field, with
projects like DENDRAL (a system for molecular biology) and MYCIN (a system for
diagnosing bacterialinfections) gaining attention.
3. Connectionism and Neural Networks (1980s-1990s):
- Interest in neural networks and connectionism resurged during this period.
- Backpropagation, a widely used algorithm for training neural networks, was developed
in the1980s.
- The field saw advancements in areas such as pattern recognition and speech
recognition, fueled by neural network models like the Multi-Layer Perceptron (MLP).
4. Statistical Learning and Data-Driven Approaches (1990s-2000s):
- Researchers started emphasizing statistical learning and data-driven approaches.
- Support Vector Machines (SVMs) gained popularity for classification tasks, offering
strongtheoretical foundations.
- The field saw the emergence of ensemble methods, such as Random Forests and
Boosting,which combined multiple models to improve performance.
Deep learning has got more public attention in the recent times and industries also
have invented never before seen in the history. The deep learning may not solve all the
problems, it needs sufficient data. Sometimes other machine learning methods could solve the
problem efficiently than deep learning.
Probabilistic Modeling:
Probabilistic modeling is the process of applying the principles of statistics to perform data
analysis.
The early neural networks have been replaced by the modern neural networks.
The early neural networks have laid the path to the deep learning. The core
idea of neural networks coined in the year 1950, and due its structure itself
was ignored for decades.
When some people independently rediscovered the Backpropagation
algorithm has initiated the neural networks again.
Kernel Methods:
Kernel methods are a family of machine learning techniques that operate in a high-
dimensional feature space implicitly through a kernel function. They are particularly useful
for solving complex nonlinear problems while preserving the computational efficiency of
linear methods. Kernel methods have applications in various fields, including
classification, regression, dimensionality reduction, and anomaly detection.
The kernel methods are group of classification algorithms. The support vector
machine is one of the best known algorithm under this category.SVM was developed by
Vladimir Vapnik and cornnacortes in 1990s at Bell Labs.SVMs aim at solving classification
problems by finding good decision boundariesbetween two sets of points belonging to two
different categories. This decision boundary is a line which can be linear or non-linear and
separates two spaces belong to two categories.SVMs proceed to find these boundaries in two
steps:
The process of mapping the data to a high-dimensional space can be carried out using the
Kernel methods. An example of kernel methods is given below.
These Kernel methods are used to transform the non-linear data into linear
(Ex: y=power(x,2)).
x y=power(x,2)
1.2 1.44
1.4 1.96
1.3 1.69
1.5 2.25
1.3 1.69
1.2 1.44
x
2
1.5
1
x
0.5
0
0 2 4 6 8
But, if we add second feature using the polynomial expression y=power(x,2), then the dataset
becomes linearly separable as shown below.
y=power(x,2)
2.5
1.5
1 y=power(x,2)
0.5
0
0 0.5 1 1.5 2
Decision Trees:
Decision trees are Tree-like structures that let you classify input data points or
predictoutput values given inputs as Shown in the Figure 7.Decision Tree is a supervised
learning technique that can be used for both classification and Regression problems, but
mostly it is preferred for solving Classification problems.They’re easy to visualize and
interpret.IT contains 3 main elements: Decision Nodes, Branch, and Leaf Nodes. The
Decision nodes can have multiple branches whereas the Leaf nodes cannot contain any
further branches.
entire dataset, which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodesaccording to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other
nodesare called the child nodes.
Algorithm
• Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset
created in step -3. Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Example:
Random Forest:
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is a collection of large number of specialized decision trees. It is based on the concept
of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model. The greater number of trees
in the forest leads to higher accuracy and prevents the problem of overfitting.
For the same data different decision trees are created, instead of depending on one decision
tree, the random forest takes the decision from each tree and based on the majority votes of
prediction the final output will be predicted.
Gradient Boosting Machines (GBMs) are a powerful ensemble learning method that
combines multiple weak prediction models, typically decision trees, to create a strong
predictive model. GBMs iteratively build an ensemble of models by optimizing a loss
function in a gradient descent manner, focusing on reducing the errors made by the
previous models in the ensemble. They are known for their effectiveness in a wide range
of machine learning tasks, including regression and classification.
Here are the key characteristics and concepts of Gradient Boosting Machines:
1. Boosting: GBMs belong to the boosting family of algorithms, where weak models are
sequentially trained to correct the mistakes of the previous models. Each subsequent model
in the ensemble focuses on reducing the errors made by the previous models, leading to an
ensemble with improved overall predictive performance.
2. Gradient Descent: GBMs optimize the ensemble by minimizing a differentiable loss
function using gradient descent. The loss function measures the discrepancy between the
predicted values and the true values of the target variable. Gradient descent updates the
model parameters in the direction of steepest descent to iteratively improve the model's
predictions.
3. Weak Learners: GBMs use weak learners as building blocks, typically decision trees
with a small depth (often referred to as "shallow trees" or "decision stumps"). These weak
learners are simple models that make predictions slightly better than random guessing.
They are usually shallow to prevent overfitting and to focus on capturing the specific
patterns missed by previous models.
4. Residuals: In GBMs, the subsequent weak learners are trained to predict the residuals
(the differences between the true values and the predictions of the ensemble so far). By
focusing on the residuals, the subsequent models are designed to correct the errors made
by the previous models and improve the overall prediction accuracy.
5. Learning Rate: GBMs introduce a learning rate parameter that controls the contribution
of each weak learner to the ensemble. A smaller learning rate makes the learning process
more conservative, slowing down the convergence but potentially improving the
generalization ability.
6. Regularization: To prevent overfitting, GBMs often include regularization techniques.
Common regularization methods include limiting the depth or complexity of the weak
learners, applying shrinkage (reducing the impact of each weak learner), and using
subsampling techniques to train each weak learner on a random subset of the data.
7. Feature Importance: GBMs can provide estimates of feature importance based on how
frequently and effectively they are used in the ensemble. This information helps identify
the most informative features for the task.
Gradient Boosting Machines, particularly popular implementations such as
XGBoost, LightGBM, and CAT Boost, have achieved state-of-the-art performance in
various machine learning competitions and real-world applications. They excel at handling
complex, high-dimensional data and have become an essential tool in the machine learning
practitioner's toolkit.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height
(dogs are taller, cats are smaller), etc. After completion of training, we input the picture
of a cat and ask the machine to identify the object and predict the output. Now, the
machine is well trained, so it will check all the features of the object, such as height, shape,
colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This
is the process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
• Classification
• Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
• Since supervised learning work with the labelled dataset so we can have an
exact ideaabout the classes of objects.
• These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
• ImageSegmentation:
Supervised Learning algorithms are used in image segmentation. In this
process,image classification is performed on different image data with pre-
defined labels.
• MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis
purposes. It is done by using medical images and past labelled data with labels
for disease conditions. With such a process, the machine can identify a disease
for the new patients.
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown
to the model, and the task of the machine is to find the patterns and categories of the
objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
• Clustering
• Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data.
It is a way to group the objects into a cluster such that the objects with the most
similarities remain in
one group and have fewer or no similarities with the objects of other groups. An example
of the clustering algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
• These algorithms can be used for complicated tasks compared to the supervised
onesbecause these algorithms work on the unlabeled dataset.
• Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset
is easier as compared to the labelled dataset.
Disadvantages:
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.
unsupervised learning algorithm, and further, it helps to label the unlabeled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition than
unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student
is under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning
is to play a game, where the Game is the environment, moves of an agent at each step
define states, and the goal of the agent is to get a high score. Agent receives feedback in
terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.
weaken theresults.
The curse of dimensionality limits reinforcement learning for real physical systems.
Machine learning is a field of study and application that focuses on developing algorithms
and models that enable computers to learn and make predictions or decisions without being
explicitly programmed. It involves the development of mathematical and statistical
techniques that allow systems to automatically learn patterns and relationships from data
and improve their performance through experience.
Once the model is trained it is evaluated to know the performance. The model is
evaluated on the data which is never-before-seen. If the evaluation is done on the same data it
leads to the model overfitting. Hence the training data will be split into three sets.
2. K-fold validation,
3. Iterated K-fold validation with shuffling
SIMPLE HOLD-OUT VALIDATION:
Here dataset will be divided into two parts: Training set and Hold-out Validation set. The
model is trained using the Training set and is tested with Validation set. This is preferred to
prevent information leaks that occur when we divide that data into Three Parts: Training,
Validation and Test sets. Before starting the process the random shuffling can be done to mix
the data well.
K-FOLD VALIDATION
Here we split your data into K partitions of equal size.For each partition i, train a
model on the remaining K – 1 partitions, and evaluate it on partition i.The Same process is
repeated for K Times. The final score of the model is the average of all the scores obtained in
K Scores. This is preferred when your model is giving significance variance on the test set.
Here, only one fold may not be considered as validation set.
We can compute some error measure on the training set called the trainingerror, and we
reduce this training error.What separates machine learning from optimization isthat we want
the generalization error, also called the test error to be low aswell. The generalization error
is defined as the expected value of the error on anew input.
We typically estimate the generalization error of a machine learning model bymeasuring its
performance on a test setof examples that were collected separatelyfrom the training set. The
test error will be computer using the MSE (Means Square Error) as follow:
1. Measuring the distance of the observed y-values from the predicted y-values at each
value of x;(y-y`)
2. Squaring each of these distances;Eg: (y-y`)2
3. Calculating the mean of each of the squared distances. 1/n* (y-y`)2
The factors determining how well a machine learning algorithm will perform are its ability to:
These two factors correspond to the two central challenges in machine learning: Underfitting
and overfitting. Underfitting occurs when the model is not able to obtain a sufficiently low
error value on the training set. That means the model has not learned from the training
sufficient enough. Overfitting occurs when the gap between the training error and test error is
too large.In this case the model has learned completely from the training set and results in
low training error and when the new item or sample is given the difference between the
training error and test error will be large.
The capacity play the major role in controlling the Underfitting and Overfitting. The capacity
is nothing the number of functions that are applied on the dataset to fit it. Models with low
capacity may struggle to fit the training set. Models with high capacity can Overfit.
To prevent a model from learning misleading or irrelevant patterns found in thetraining data,
the best solution is to get more training data. A model trained on more datawill naturally
generalize better.The processing of fighting overfitting this way is called regularization.
Let’s review some of the most common regularization techniques:
The simplest way to prevent overfitting is to reduce the size of the model: the number
of learnable parameters in the model. It is often referred as Capacity.
A simple model in this context is a model where the distribution of parameter values
has less entropy (or a model with fewer parameters, as you saw in the previous
section). Thus a common way to mitigate overfitting is to put constraints on the
complexityof a network by forcing its weights to take only small values, which makes
thedistribution of weight values more regular. This is called weight regularization. It
is done with help of cost function.This cost comes in two flavors:
3. Adding dropout
Dropout is one of the most effective and most commonly used regularization
techniques for neural networks, developed by Geoff Hinton and his students at the
University of Toronto. Dropout, applied to a layer, consists of randomly dropping out
(setting to zero) a number of output features of the layer during training.
Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a
given input sample during training. After applying dropout, this vector will have a
few zero entries distributed at random: for example, [0, 0.5, 1.3, 0, 1.1].