Unit - 2 ML notes
Unit - 2 ML notes
Supervised Learning
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
Classification:
Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.
The task of the classification algorithm is to find the mapping function to map the input(x) to
the discrete output(y).
Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.
o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Regression:
The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the input value(x) with the discrete output variable(y).
continuous output variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete data.
continuous data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into different
accurately. classes.
Regression algorithms can be used to Classification Algorithms can be used to solve
solve the regression problems such as classification problems such as Identification of spam
Weather Prediction, House price emails, Speech Recognition, Identification of cancer
prediction, etc. cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.
Basic Methods:
Distance based Methods: Distance-based algorithms are machine learning algorithms that
classify queries by computing distances between these queries and a number of internally
stored exemplars. Exemplars that are closest to the query have the largest influence on the
classification assigned to the query. Two specific distance-based algorithms, the nearest
neighbor algorithm and the nearest-hyperrectangle algorithm.
The k-nearest neighbour algorithm (kNN) out performs the first nearest neighbour algorithm
only under certain conditions. Data sets must contain moderate amounts of noise. Training
examples from the different classes must belong to clusters that allow an increase in the value
of k without reaching into clusters of other classes. Methods for choosing the value of k for
kNN are investigated. It shown that one-fold cross-validation on a restricted number of values
for k suffices for best performance. It is also shown that for best performance the votes of the
k-nearest neighbours of a query should be weighted in inverse proportion to their distances
from the query.
The nearest-hyper rectangle algorithm (NGE) is found to give predictions that are
substantially inferior to those given by kNN in a variety of domains. Experiments performed
to understand this inferior performance led to the discovery of several improvements to NGE.
Foremost of these is BNGE, a batch algorithm that avoids construction of overlapping
hyperrectangles from different classes. Although it is generally superior to NGE, BNGE is
still significantly inferior to kNN in a variety of domains. Hence, a hybrid algorithm
(KBNGE), that uses BNGE in parts of the input space that can be represented by a single
hyperrectangle and kNN otherwise, is introduced.
The primary contributions of this dissertation are (a) several improvements to existing
distance-based algorithms, (b) several new distance-based algorithms, and (c) an
experimentally supported understanding of the conditions under which various distance-
based algorithms are likely to give good performance.
Decision Tree Algorithm
A decision tree is a supervised learning algorithm that is mainly used to solve the
classification problems but can also be used for solving the regression problems. It can work
with both categorical variables and continuous variables. It shows a tree-like structure that
includes nodes and branches, and starts with the root node that expand on further branches till
the leaf node. The internal node is used to represent the features of the dataset, branches
show the decision rules, and leaf nodes represent the outcome of the problem.
Some real-world applications of decision tree algorithms are identification between cancerous
and non-cancerous cells, suggestions to customers to buy a car, etc
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions
based on the probability of the object. The algorithm named as Naïve Bayes as it is based
on Bayes theorem, and follows the naïve assumption that says' variables are independent of
each other.
The Bayes theorem is based on the conditional probability; it means the likelihood that
event(A) will happen, when it is given that event(B) has already happened. The equation for
Bayes theorem is given as:
Naïve Bayes classifier is one of the best classifiers that provide a good result for a given
problem. It is easy to build a naïve bayesian model, and well suited for the huge amount of
dataset. It is mostly used for text classification.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Linear Models:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
o Simple Linear Regression: In simple linear regression, a single independent variable is used
to predict the value of the dependent variable.
o Multiple Linear Regression: In multiple linear regression, more than one independent
variables are used to predict the value of the dependent variable.
Logistic Regression
Logistic regression is the supervised learning algorithm, which is used to predict the
categorical variables or discrete values. It can be used for the classification problems in
machine learning, and the output of the logistic regression algorithm can be either Yes or
NO, 0 or 1, Red or Blue, etc.
Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas
Logistic regression is used to solve the Classification problem and used to predict the discrete
values.
Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0 and 1. The S-
shaped curve is also known as a logistic function that uses the concept of the threshold. Any
value above the threshold will tend to 1, and below the threshold will tend to 0
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
Binary Classification:
Application Observation 0 1
Example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's
symptoms as input features and predict whether the patient is healthy or has the disease.
The possible outcomes of the diagnosis are positive and negative.
True Positive (TP): The patient is diseased and the model predicts "diseased"
False Positive (FP): The patient is healthy but the model predicts "diseased"
True Negative (TN): The patient is healthy and the model predicts "healthy"
False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier
as
follows:
In machine learning, many methods utilize binary classification. The most common are:
It is a widely used and deeply understood dataset and, for the most part, is “solved.” Top-
performing models are deep learning convolutional neural networks that achieve a classification
accuracy of above 99%, with an error rate between 0.4 %and 0.2% on the hold out test dataset.
Ranking
Ranking is a machine learning technique to rank items. Ranking is useful for many
applications in information retrieval such as e-commerce, social networks, recommendation
systems, and so on. For example, a user searches for an article or an item to buy online. To
build a Machine Learning model for ranking, we need to define inputs, outputs and loss
function.
All Learning to Rank models use a base Machine Learning model (e.g. Decision
Tree or Neural Network) to compute s = f(x). The choice of the loss function is the distinctive
element for Learning to Rank models. In general, we have 3 approaches, depending on how
the loss is computed.
1. Pointwise Methods – The total loss is computed as the sum of loss terms defined
on each document dᵢ (hence pointwise) as the distance between the predicted
score sᵢ and the ground truth yᵢ, for i=1…n. By doing this, we transform our task
into a regression problem, where we train a model to predict y.
2. Pairwise Methods – The total loss is computed as the sum of loss terms defined
on each pair of documents dᵢ, dⱼ (hence pairwise), for i, j=1…n. The objective
on which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two
documents is more relevant. By doing this, we transform our task into a binary
classification problem.
3. Listwise Methods – The loss is directly computed on the whole list of documents
(hence listwise) with corresponding predicted ranks. In this way, ranking metrics
can be more directly incorporated into the loss.