0% found this document useful (0 votes)
8 views

Unit - 2 ML notes

The document provides an overview of Supervised Learning, focusing on the differences between Regression and Classification algorithms. Regression predicts continuous values while Classification predicts discrete values, with various algorithms for each type discussed. Additionally, it covers specific algorithms like Decision Trees, Naïve Bayes, and Support Vector Machines, along with their advantages, disadvantages, and applications.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit - 2 ML notes

The document provides an overview of Supervised Learning, focusing on the differences between Regression and Classification algorithms. Regression predicts continuous values while Classification predicts discrete values, with various algorithms for each type discussed. Additionally, it covers specific algorithms like Decision Trees, Naïve Bayes, and Support Vector Machines, along with their advantages, disadvantages, and applications.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit -2

Supervised Learning

Regression vs. Classification in Machine Learning:


Regression and Classification algorithms are Supervised Learning algorithms. Both the algorithms are
used for prediction in Machine learning and work with the labeled datasets.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.

Consider the below diagram:

Classification:

Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters. In Classification, a computer program is trained on the
training dataset and based on that training, it categorizes the data into different classes.

The task of the classification algorithm is to find the mapping function to map the input(x) to
the discrete output(y).

Example: The best example to understand the Classification problem is Email Spam
Detection. The model is trained on the basis of millions of emails on different parameters,
and whenever it receives a new email, it identifies whether the email is spam or not. If the
email is spam, then it is moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Regression:

Regression is a process of finding the correlations between dependent and independent


variables. It helps in predicting the continuous variables such as prediction of Market
Trends, prediction of House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training
is completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression


o Multiple Linear Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.

The task of the regression algorithm is to The task of the classification algorithm is to map the
map the input value (x) with the input value(x) with the discrete output variable(y).
continuous output variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete data.
continuous data.

In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into different
accurately. classes.
Regression algorithms can be used to Classification Algorithms can be used to solve
solve the regression problems such as classification problems such as Identification of spam
Weather Prediction, House price emails, Speech Recognition, Identification of cancer
prediction, etc. cells, etc.

The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.

Basic Methods:
Distance based Methods: Distance-based algorithms are machine learning algorithms that
classify queries by computing distances between these queries and a number of internally
stored exemplars. Exemplars that are closest to the query have the largest influence on the
classification assigned to the query. Two specific distance-based algorithms, the nearest
neighbor algorithm and the nearest-hyperrectangle algorithm.
The k-nearest neighbour algorithm (kNN) out performs the first nearest neighbour algorithm
only under certain conditions. Data sets must contain moderate amounts of noise. Training
examples from the different classes must belong to clusters that allow an increase in the value
of k without reaching into clusters of other classes. Methods for choosing the value of k for
kNN are investigated. It shown that one-fold cross-validation on a restricted number of values
for k suffices for best performance. It is also shown that for best performance the votes of the
k-nearest neighbours of a query should be weighted in inverse proportion to their distances
from the query.

Principal component analysis is shown to reduce the number of relevant dimensions


substantially in several domains. Two methods for learning feature weights for a weighted
Euclidean distance metric are proposed. These methods improve the performance of kNN and
NN in a variety of domains.

The nearest-hyper rectangle algorithm (NGE) is found to give predictions that are
substantially inferior to those given by kNN in a variety of domains. Experiments performed
to understand this inferior performance led to the discovery of several improvements to NGE.
Foremost of these is BNGE, a batch algorithm that avoids construction of overlapping
hyperrectangles from different classes. Although it is generally superior to NGE, BNGE is
still significantly inferior to kNN in a variety of domains. Hence, a hybrid algorithm
(KBNGE), that uses BNGE in parts of the input space that can be represented by a single
hyperrectangle and kNN otherwise, is introduced.

The primary contributions of this dissertation are (a) several improvements to existing
distance-based algorithms, (b) several new distance-based algorithms, and (c) an
experimentally supported understanding of the conditions under which various distance-
based algorithms are likely to give good performance.
Decision Tree Algorithm
A decision tree is a supervised learning algorithm that is mainly used to solve the
classification problems but can also be used for solving the regression problems. It can work
with both categorical variables and continuous variables. It shows a tree-like structure that
includes nodes and branches, and starts with the root node that expand on further branches till
the leaf node. The internal node is used to represent the features of the dataset, branches
show the decision rules, and leaf nodes represent the outcome of the problem.

Some real-world applications of decision tree algorithms are identification between cancerous
and non-cancerous cells, suggestions to customers to buy a car, etc

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Naïve Bayes Algorithm:

Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions
based on the probability of the object. The algorithm named as Naïve Bayes as it is based
on Bayes theorem, and follows the naïve assumption that says' variables are independent of
each other.

The Bayes theorem is based on the conditional probability; it means the likelihood that
event(A) will happen, when it is given that event(B) has already happened. The equation for
Bayes theorem is given as:

Naïve Bayes classifier is one of the best classifiers that provide a good result for a given
problem. It is easy to build a naïve bayesian model, and well suited for the huge amount of
dataset. It is mostly used for text classification.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes
3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Linear Models:

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Linear regression is further divided into two types:

o Simple Linear Regression: In simple linear regression, a single independent variable is used
to predict the value of the dependent variable.
o Multiple Linear Regression: In multiple linear regression, more than one independent
variables are used to predict the value of the dependent variable.

Logistic Regression

Logistic regression is the supervised learning algorithm, which is used to predict the
categorical variables or discrete values. It can be used for the classification problems in
machine learning, and the output of the logistic regression algorithm can be either Yes or
NO, 0 or 1, Red or Blue, etc.
Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas
Logistic regression is used to solve the Classification problem and used to predict the discrete
values.

Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0 and 1. The S-
shaped curve is also known as a logistic function that uses the concept of the threshold. Any
value above the threshold will tend to 1, and below the threshold will tend to 0

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.


Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

Binary Classification:

In machine learning, binary classification is a supervised learning algorithm that


categorizes new observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are
two possible classes for each observation:

Application Observation 0 1

Medical Diagnosis Patient Healthy Diseased

Email Analysis Email Not Spam Spam

Financial Data Analysis Transaction Not Fraud Fraud

Marketing Website visitor Won't Buy Will Buy

Image Classification Image Hotdog Not Hotdog

Example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's
symptoms as input features and predict whether the patient is healthy or has the disease.
The possible outcomes of the diagnosis are positive and negative.

Evaluation of binary classifiers


If the model successfully predicts the patients as positive, this case is called True
Positive (TP). If the model successfully predicts patients as negative, this is called True
Negative (TN). The binary classifier may misdiagnose some patients as well. If a
diseased patient is classified as healthy by a negative test result, this error is called False
Negative (FN). Similarly, If a healthy patient is classified as diseased by a positive test
result, this error is called False Positive(FP).
We can evaluate a binary classifier based on the following parameters:

 True Positive (TP): The patient is diseased and the model predicts "diseased"
 False Positive (FP): The patient is healthy but the model predicts "diseased"
 True Negative (TN): The patient is healthy and the model predicts "healthy"
 False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier
as
follows:

The following is a confusion matrix, which represents the above parameters:

In machine learning, many methods utilize binary classification. The most common are:

 Support Vector Machines


 Naive Bayes
 Nearest Neighbor
 Decision Trees
 Logistic Regression
 Neural Networks
Multiclass/Structured outputs:

Multi-output classification is a type of machine learning that predicts multiple outputs


simultaneously. In multi-output classification, the model will give two or more outputs after
making any prediction. In other types of classifications, the model usually predicts only a
single output.

An example of a multi-output classification model is a model that predicts


the type and color of fruit simultaneously. The type of fruit can be, orange, mango and
pineapple. The color can be, red, green, yellow, and orange. The multi-output classification
solves this problem and gives two prediction results.
we will build a multi-output text classification model using the Netflix dataset. The model
will classify the input text as either TV Show or Movie. This will be the first output. The
model will also classify the rating as: TV-MA, TV-14, TV-PG, R, PG-13 and TV-Y. The
rating will be the second output. We will use Scikit-Learn Multi Output Classifier algorithm
to build this model.
MNIST
The MNIST dataset is an acronym that stands for the Modified National Institute of
Standards and Technology dataset. It is a dataset of 60,000 small square 28×28 pixel
grayscale images of handwritten single digits between 0 and 9. The task is to classify a given
image of a handwritten digit into one of 10 classes representing integer values from 0 to 9,
inclusively.

It is a widely used and deeply understood dataset and, for the most part, is “solved.” Top-
performing models are deep learning convolutional neural networks that achieve a classification
accuracy of above 99%, with an error rate between 0.4 %and 0.2% on the hold out test dataset.

Ranking

Ranking is a machine learning technique to rank items. Ranking is useful for many
applications in information retrieval such as e-commerce, social networks, recommendation
systems, and so on. For example, a user searches for an article or an item to buy online. To
build a Machine Learning model for ranking, we need to define inputs, outputs and loss
function.

 Input – For a query q we have n documents D = {d₁, …, dₙ} to be ranked by


relevance. The elements xᵢ = (q, dᵢ) are the inputs to our model.
 Output – For a query-document input xᵢ = (q, dᵢ), we assume there exists a
true relevance score yᵢ. Our model outputs a predicted score sᵢ = f(xᵢ).

All Learning to Rank models use a base Machine Learning model (e.g. Decision
Tree or Neural Network) to compute s = f(x). The choice of the loss function is the distinctive
element for Learning to Rank models. In general, we have 3 approaches, depending on how
the loss is computed.

1. Pointwise Methods – The total loss is computed as the sum of loss terms defined
on each document dᵢ (hence pointwise) as the distance between the predicted
score sᵢ and the ground truth yᵢ, for i=1…n. By doing this, we transform our task
into a regression problem, where we train a model to predict y.
2. Pairwise Methods – The total loss is computed as the sum of loss terms defined
on each pair of documents dᵢ, dⱼ (hence pairwise), for i, j=1…n. The objective
on which the model is trained is to predict whether yᵢ > yⱼ or not, i.e. which of two
documents is more relevant. By doing this, we transform our task into a binary
classification problem.
3. Listwise Methods – The loss is directly computed on the whole list of documents
(hence listwise) with corresponding predicted ranks. In this way, ranking metrics
can be more directly incorporated into the loss.

You might also like