0% found this document useful (0 votes)
400 views48 pages

Ch-2 Supervised Machine Learning

Very help full doc

Uploaded by

Patil Mangesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
400 views48 pages

Ch-2 Supervised Machine Learning

Very help full doc

Uploaded by

Patil Mangesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Chapter 2: Supervised Machine Learning

2.1. Introduction to Supervised Machine Learning


2.2. Regression in Machine Learning:
2.2.1. Types of Regression Algorithms-Linear Regression, Polynomial,
Stepwise, Ridge,
Lasso, Elastic Net, Bayesian Regression.
2.2.2 Evaluation methods for regression: MAE, RMSE, R-Square error
2.3. Classification in ML: Classification Algorithms
2.3.1. Logistic Regression
2.3.2. Decision Tree
2.3.3. k-Nearest Neighbors
2.3.4. Naive Bayes
2.3.5. Support Vector Machines (SVM)
2.3.6. Evaluation methods: Confusion matrix, precision, recall, F1-score,
Accuracy,
ROC-AUC curve,
2.4. Non-linear regression: Decision Tree, Support Vector, KNN
2.5. Advantages and Disadvantages of supervised ML
2.1. Introduction to Supervised Machine Learning

Based on type of input provided and learning technique, machine learning


algorithms are broadly classified into three categories: supervised, Unsupervised
and semi-supervised learning algorithms.

Supervised learning is a type of machine learning where machines are


trained using a well labelled dataset, from which it builds a model that can be
used to make predictions about unseen or future data.
The aim of supervised machine learning is to find a well approximate
mapping function () between input variables (X) and output variables (Y) so that
when new input data (X) is given to the model, the model should predict correct
output(Y).
Y=f(X)

The name "supervised learning' is given because the process of learning


from the training dataset can be thought of as a teacher who is supervising the
learning process. The algorithm learns from the training data iteratively to make
predictions, these predictions are corrected based on the actual output. The
learning process stops when it achieves an acceptable level of accuracy
The figure above shows the detailed Process of supervised machine learning.
The steps are:
1. A labeled training data is given as an input to the algorithm.
2. The algorithm learns the patterns from given input data and builds a
predictive model.
3. This model will be evaluated against the labeled test dataset.
4. After minimizing the errors the predictive model predicts the result for
new unseen data
Advantages of supervised learning
 Supervised algorithms design a model, which can predict the output
based on prior knowledge or past experiences.
 The input data is very well known and is labeled.
 Produces more accurate results as compared to unsupervised methods
due to prior knowledge of class labels.
Disadvantages of supervised learning
 Supervised technique is not suitable if you have a dynamic, big and
growing data.
 These are not suitable if you are not sure of the labels to predefine the
rules.
 Training requires lots of computation time
Classification: Classification is a task that extracts and models important data
classes from input data. Such models are called classifiers that predict categorical
(discrete unordered) class labels. Classification algorithms are used when
dependent variable or Output variable is a category or a class.

From the above figure, given input data is classified into two different classes A
and B. The model finds a class boundary that separates the classes. If a new data
point is given to the model, it can predict the new data point's class label
according to the boundary.
2.2. Regression in Machine Learning:
2.2.1. Types of Regression Algorithms
-Linear Regression
-Polynomial,
-Stepwise
-Ridge,

Regression: It is a supervised machine learning technique, used to predict the


value of the dependent variable for new, unseen data. It models the relationship
between the input features and the target variable, allowing for the estimation or
prediction of numerical values.
It seeks to find the best-fitting model, which can be utilized to make predictions
or draw conclusions.

Above figure shows the predictive model which finds an


equation of regression line. Using this equation model will predict the value for
new data point
Terminologies Related to the Regression Analysis:
• Dependent Variable: The main factor in Regression analysis which we
want to predict or understand is called the dependent variable. It is also
called target variable.
• Independent Variable: The factors which affect the dependent variables
or which are used to predict the values of the dependent variables are
called independent variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value
or very high value in comparison to other observed values. An outlier
may hamper the result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated with
each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the
training dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.

-Linear Regression
There are three main types of linear regression:
1. Simple Linear Regression
This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple linear
regression is:
y=β0+β1X
where:
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope
2. Multiple Linear Regression
This involves more than one independent variable and one dependent variable.
The equation for multiple linear regression is:

y=β0+β1X1+β2X2+………βnXn
where:
 Y is the dependent variable
 X1, X2, …, Xn are the independent variables
 β0 is the intercept
 β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can
predict the values based on the independent variables.

Some key points about MLR:


 For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable may be of
continuous or categorical form.
 Each feature variable must model the linear relationship with the
dependent variable.
 MLR tries to fit a regression line through a multidimensional space of
data-points.

Assumptions for Multiple Linear Regression:


 A linear relationship should exist between the Target and predictor
variables.
 The regression residuals must be normally distributed.(Residuals.is the
difference between the observed value of the dependent variable (y) and
the predicted value (ŷ) is called the residual (e))
 MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.
Dummy Variable –
As we know in the Multiple Regression Model we use a lot of categorical data.
Using Categorical Data is a good method to include non-numeric data into
respective Regression Model.
Categorical Data refers to data values which represent categories-data values
with the fixed and unordered number of values, for instance, gender
(male/female). In the regression model, these values can be represented by
Dummy Variables.
These variable consist of values such as 0 or 1 representing the presence and
absence of categorical value.
Variables that classify the data into mutually exclusive categories are called
dummy variables or indicator variables.

Dummy variable trap:


The Dummy Variable Trap is a situation in which two or more variables are
highly correlated i.e multi-collinearity exists. This means we can predict value
of one variable from the value of other variable.
3. Polynomial Regression
• If your data points clearly not fit a linear regression (a straight line
through all data points), it might be ideal for polynomial regression.
• Polynomial regression, like linear regression, uses the relationship
between the variables x and y to find the best way to draw a line through
the data points.
Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial.
The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b3x13+...... bnx1n
• Need for Polynomial Regression:
• If we apply a linear model on a linear dataset, then it provides us a good
result as we have seen in Simple Linear Regression, but if we apply the
same model without any modification on a non-linear dataset, then it
will produce a drastic output. Due to which loss function will increase,
the error rate will be high, and accuracy will be decreased.
• So for such cases, where data points are arranged in a non-linear
fashion, we need the Polynomial Regression model.
We can understand it in a better way using the below comparison diagram of the
linear dataset and non-linear dataset.

Advantages of using Polynomial Regression:


 Broad range of function can be fit under it.
 Polynomial basically fits wide range of curvature.
 Polynomial provides the best approximation of the relationship between
dependent and independent variable.
Disadvantages of using Polynomial Regression
 These are too sensitive to the outliers.
 The presence of one or two outliers in the data can seriously affect the
results of a nonlinear analysis.
 In addition there are unfortunately fewer model validation tools for the
detection of outliers in nonlinear regression than there are for linear
regression.
Equations of the Regression Models:

Simple Linear Regression equation:

y = b0+b1x .........(a)

Multiple Linear Regression equation:

y= b0+b1x+ b2x2+ b3x3+....+ bnxn .........(b)

Polynomial Regression equation:


2 3 n
y= b +b x + b x + b x +...... b x ..........(c)
0 1 1 2 1 3 1 n 1

Stepwise regression
Stepwise regression is a method of fitting a regression model by iteratively
adding or removing variables. It is used to build a model that is accurate and
parsimonious, meaning that it has the smallest number of variables that can
explain the data.

Following are the important aspects of stepwise regression:


Forward Selection – In forward selection, the algorithm starts with an empty
model and iteratively adds variables to the model until no further improvement
is made.
Backward Elimination – In backward elimination, the algorithm starts with a
model that includes all variables and iteratively removes variables until no
further improvement is made.
Stepwise Selection: This combines elements of both forward selection and
backward elimination. Start with an empty model or a model with all predictors.
At each step, add or remove predictors based on their statistical significance or
impact on the model criteria. The process continues until adding or removing
predictors no longer improves the model.
The advantage of stepwise regression is that it can automatically select the most
important variables for the model and build a parsimonious model. The
disadvantage is that it may not always select the best model, and it can be
sensitive to the order in which the variables are added or removed.

Ridge Regression
Ridge Regression, also known as Tikhonov regularization, is a technique
used to improve the performance of linear regression models by adding a
penalty to the magnitude of the coefficients.
This helps prevent overfitting and makes the model more robust, especially
when dealing with highly correlated features or when the number of features is
large compared to the number of observations.
Why Use Ridge Regression?
1. Overfitting: In machine learning, overfitting occurs when a model learns
not only the underlying patterns but also the noise in the training data.
This leads to poor performance on new, unseen data. Ridge Regression
helps to reduce overfitting by constraining the size of the coefficients.
2. Multicollinearity: When predictor variables are highly correlated,
ordinary least squares (OLS) regression can produce large coefficients
with high variance. Ridge Regression stabilizes these coefficients by
penalizing their size.
3. High-Dimensional Data: When there are many features, especially
relative to the number of observations, Ridge Regression can prevent the
model from fitting the noise in the data by applying a regularization
penalty.
How Does Ridge Regression Work?
Objective Function: Ridge Regression modifies the standard linear regression
objective function by adding a penalty term:

Where:
 RSS: Residual Sum of Squares, which measures the model’s fit.
 λ: Regularization parameter that controls the strength of the penalty.
 βj : Coefficients of the predictors.
Regularization Term: The term adds a cost proportional to the
square of the magnitude of the coefficients. This penalizes large coefficients,
which helps in reducing their impact and thus prevents overfitting.
Choosing λ: The parameter λ controls the trade-off between fitting the data and
keeping the coefficients small. If λ is zero, Ridge Regression is equivalent to
ordinary least squares. As λ increases, the coefficients are shrunk more, which
can lead to a simpler model.

Lasso regression
Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is
a regularization technique used in linear regression models to enhance
prediction accuracy and interpretability. It adds a penalty term to the regression
model that encourages sparsity in the coefficients, which can effectively
perform feature selection.
Lasso regression is a type of linear regression that includes a regularization
term based on the absolute values of the coefficients. This technique helps in
preventing overfitting and can simplify the model by forcing some coefficients to
be exactly zero.

1. Objective Function
Lasso regression modifies the ordinary least squares (OLS) objective function
by adding a penalty proportional to the sum of the absolute values of the
coefficients. The modified objective function is:

Where:
 RSS: Residual Sum of Squares, which measures the model's fit to the
data.
 λ: Regularization parameter that controls the strength of the penalty.
 ∣βj∣| : Absolute value of the coefficients.
2. Regularization Term.
The term is the Lasso penalty. Unlike Ridge Regression, which
uses the squared magnitude of the coefficients for regularization, Lasso uses the
absolute magnitude. This has the effect of shrinking some coefficients exactly to
zero, leading to a sparse model.
3. Effect of the Regularization Parameter λ
 When λ=0 Lasso regression reduces to ordinary least squares (OLS)
regression, as there is no penalty on the coefficients.
 When λ is large: The penalty term becomes significant, causing more
coefficients to be shrunk to zero. This results in a simpler model with
fewer predictors.

Elastic Net Regression


Elastic Net Regression is a powerful machine learning algorithm that combines
the features of both Lasso and Ridge Regression. It is a regularized regression
technique that is used to deal with the problems of multicollinearity and
overfitting, which are common in high-dimensional datasets.

1. Ridge Regression (L2 Regularization):


o Adds a penalty proportional to the square of the magnitude of the
coefficients i.e
o This helps to shrink the coefficients and prevent overfitting, but it
doesn’t perform feature selection.
2. Lasso Regression (L1 Regularization):
o Adds a penalty proportional to the absolute value of the magnitude
of the coefficients i.e
o This can shrink some coefficients to zero, effectively performing
feature selection.

Elastic Net Regression


Elastic Net combines both penalties:

Where:
 Loss Function is typically the mean squared error for regression.
 α is a mixing parameter between Ridge and Lasso (with α=0
corresponding to Ridge and α=1corresponding to Lasso).
 λ is a regularization parameter that controls the overall strength of the
regularization.
Advantages of Elastic Net
Feature Selection and Shrinkage: Elastic Net can perform feature selection
like Lasso while also providing the shrinkage of Ridge, which can be
advantageous when features are correlated.
Handling Correlated Features: Elastic Net is better suited for datasets with
correlated features than Lasso alone, as Lasso tends to arbitrarily select one
feature from a group of correlated features while ignoring the others.
Bayesian Regression
Bayesian regression is a statistical approach to regression that incorporates
Bayesian principles to estimate the distribution of model parameters. Unlike
traditional regression methods, which provide point estimates of parameters,
Bayesian regression provides a full probability distribution for each parameter,
which allows for a more nuanced understanding of the uncertainties associated
with the estimates.
Key Concepts in Bayesian Regression
1. Bayesian Inference:
o Bayesian inference updates the probability distribution of a
hypothesis (in this case, the regression parameters) based on
observed data. The goal is to compute the posterior distribution of
the parameters given the data.
2. Prior Distribution:
o Before observing the data, you specify a prior distribution for the
regression coefficients. This reflects your beliefs or assumptions
about the parameters before seeing the data. It can be informed by
prior knowledge or chosen to be non-informative if you prefer not
to impose strong prior beliefs.
3. Likelihood:
o The likelihood function describes how likely the observed data is
given the parameters of the model. In regression, this is typically
based on the assumption of normally distributed errors.
4. Posterior Distribution:
o The posterior distribution combines the prior distribution and the
likelihood of the observed data. It represents the updated beliefs
about the parameters after observing the data. This is computed
using Bayes' theorem:

Where:
 P(θ∣data) is the posterior distribution of the parameters θ given the
data.
 P(data∣θ) is the likelihood of the data given the parameters.
 P(θ) is the prior distribution of the parameters.
 P(data) is the marginal likelihood or evidence, which normalizes the
distribution.
Predictive Distribution:
 The predictive distribution for new data points incorporates the uncertainty
in the parameter estimates. It is obtained by integrating over the posterior
distribution of the parameters.
Advantages of Bayesian Regression
Incorporates Prior Knowledge: You can incorporate prior knowledge or beliefs
about the parameters through the prior distribution.
Quantifies Uncertainty: Provides a full probability distribution for each
parameter, offering a richer understanding of parameter uncertainty.
Flexibility: Allows for more flexible modeling, especially in situations with
limited data or when you want to incorporate specific assumptions.

2.2.2 Evaluation methods for regression:


Evaluation Metrics for regression are essential for assessing the
performance of regression models specifically. These metrics help in measuring
how well a regression model is able to predict continuous outcomes
By utilizing these regression-specific metrics, data scientists and machine
learning engineers can evaluate the accuracy and effectiveness of their metrics
for regression models in making predictions.
Common regression evaluation metrics for regression include
Mean Absolute Error (MAE),
Mean Squared Error (MSE),
Root Mean Squared Error (RMSE),
R-squared (Coefficient of Determination), and
Mean Absolute Percentage Error (MAPE).
Mean Absolute Error (MAE),
Definition: The average of the absolute differences between predicted and
actual values.
Mathematical Formula
The formula to calculate MAE for a data with “n” data points is:

Where:
 xi represents the actual or observed values for the i-th data point.
 yi represents the predicted value for the i-th data point.

from sklearn.metrics import mean_squared_error

true_values = [2.5, 3.7, 1.8, 4.0, 5.2]


predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]

mse = mean_squared_error(true_values, predicted_values)


print("Mean Squared Error:", mse)

Root Mean Squared Error (RMSE)

RMSE stands for Root Mean Squared Error. It is a usually used metric in
regression analysis and machine learning to measure the accuracy or goodness
of fit of a predictive model, especially when the predictions are continuous
numerical values.
The RMSE quantifies how well the predicted values from a model align with
the actual observed values in the dataset. Here’s how it works:
Calculate the Squared Differences: For each data point, subtract the predicted
value from the actual (observed) value, square the result, and sum up these
squared differences.
Compute the Mean: Divide the sum of squared differences by the number of
data points to get the mean squared error (MSE).
Take the Square Root: To obtain the RMSE, simply take the square root of the
MSE.
Mathematical Formula
The formula for RMSE for a data with ‘n’ data points is as follows:

Where:
 RMSE is the Root Mean Squared Error.
 xi represents the actual or observed value for the i-th data point.
 yi represents the predicted value for the i-th data point.

from sklearn.linear_model import LinearRegression


from sklearn.metrics import mean_squared_error
import numpy as np

# Sample data
true_prices = np.array([250000, 300000, 200000, 400000, 350000])
predicted_prices = np.array([240000, 310000, 210000, 380000, 340000])

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(true_prices, predicted_prices))
print("Root Mean Squared Error (RMSE):", rmse)

R-Square error
R-squared (R²) Score
A statistical metric frequently used to assess the goodness of fit of a regression
model is the R-squared (R2) score, also referred to as the coefficient of
determination. It quantifies the percentage of the dependent variable’s variation
that the model’s independent variables contribute to. R2 is a useful statistic for
evaluating the overall effectiveness and explanatory power of a regression
model.
Mathematical Formula:
The formula to calculate the R-squared score is as follows:

Where:
 R2 is the R-Squared.
 SSR represents the sum of squared residuals between the predicted values
and actual values.
 SST represents the total sum of squares, which measures the total
variance in the dependent variable.
from sklearn.metrics import r2_score
true_values = [2.5, 3.7, 1.8, 4.0, 5.2]
predicted_values = [2.1, 3.9, 1.7, 3.8, 5.0]
r2 = r2_score(true_values, predicted_values)
print("R-squared (R²) Score:", r2)

2.3. Classification in ML: Classification Algorithms


Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data. In classification, the model is
fully trained using the training data, and then it is evaluated on test data before
being used to perform prediction on new unseen data.
For instance, an algorithm can learn to predict whether a given email is spam or
ham (no spam), as illustrated below.

The main goal of a classification is to identify the category/class to which a new


data falls under.
The classification problems can have two types of learning behaviors
Lazy learners: These classifiers, first store the training dataset and wait until
they receive the test dataset. Here, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more
time for predictions. Example: KNN

Eager learners: The classifier develops a classification model based on a


training dataset before receiving a test dataset. It must be able to commit to a
single hypothesis that covers the entire instance space. Classifiers take more
time in training and less time in prediction. Example: Decision Trees, Naive
Bayes, Artificial Neural Network
Some terminologies
Classifier: An algorithm that maps the input data to a specific category.
Classification model: A classification model tries to draw some conclusion
from the input values given for training. It will predict the class
labels/categories for the new data.
Feature: A feature is an individual measurable property of a phenomenon
being observed.
Binary classification: Classification task with two possible outcomes. E.g.:
Email classification (Spam/Not Spam).
Multi-class classification: Classification with more than two classes. In
multiclass classification each sample is assigned to one and only one target
label. e.g. An animal can be cat or dog or horse but only one at a time.
Multi-label classification: Classification task where each sample is mapped to
a set of target labels (more than one class). E.g.: A news article can be about
sports, a person, and location at the same time

2.3.1. Logistic Regression


Logistic regression is a supervised machine learning technique for binary
classification, where the goal is to predict one of two possible outcomes. Despite
its name, logistic regression is used for classification, not regression. It estimates
the probability that a given input belongs to a certain class.
Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic sigmoid function, that return a probability value which can be mapped
to two or more discrete classes.

Threshold value: The sigmoid function is asymptotically bounded between 0


and 1.To predict in which class a data point belongs, a threshold can be set.
Based upon this threshold, the obtained estimated probability is classified into
respective classes. The value above threshold tends to 1 and value below it
tends to 0. Consider if threshold value is 0.5 and predicted value is 0.8.Then our
predicted value > 0.5, means classifier will classify it as 1 otherwise classify it
as 0. The Logistic regression equation can be obtained from the linear
regression equation

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

For logistic regression, following are the assumptions made about the data
1. The dependent variable must be categorical in nature.
2. The independent variable should not have multi-collinearity
Based on type of categories the logistic regression is classified into three
categories-
Binary logistic regression: The categorical response has only two 2 possible
outcomes. Example: Spam or Not, True or False.
Multinomial logistic regression: The response has more than two categories
without ordering. Example: Predicting which food is preferred more (Veg, Non-
Veg. Vegan)
Ordinal logistic regression: The response has more than two categories with
ordering Example: Movie rating from 1 to 5, taste of food as "Bad", "Good'",
"Excellent”.

Difference between Linear Regression and Logistic Regression


Advantages of Logistic regression:
1. Logistic regression is simple, easy to implement, train and test.
2. Training a model with this algorithm doesn't require high computation
power.
3. Logistic regression proves to be very efficient when the dataset has
features that are linearly separable.
4. Due to its simple probabilistic interpretation, the training time of logistic
regression algorithm comes out to be far less than most complex
algorithms, such as an Artificial Neural Network.
5. This algorithm can easily be extended to multi-class classification.
6. Logistic regression not only gives a measure of how relevant a predictor
(coefficient size) is, but also its direction of association (positive or
negative)
Disadvantages of Logistic regression:
1. Main limitation of Logistic regression is the assumption of linearity
between the dependent variable and the independent variables.
2. In the real world, the data is rarely linearly separable.
3. If the number of observations are lesser than the number of features,
Logistic Regression should not be used, otherwise it may lead to overfit.
4. Logistic regression can only be used to predict discrete functions.
Therefore the dependent variable of Logistic regression is restricted to the
discrete number set. Logistic regression requires average or no multi-
collinearity between independent variables

2.3.2. Decision Tree


Decision tree is a supervised machine learning algorithm used for both prediction
and classification but most often used for classification. It builds the classification
model in the form of a tree structure. While the regression model gives a
continuous value as output, the classification tree is used for Yes/No outputs. The
decision variable is Categorical/ discrete. Such a tree is built through a process
known as binary recursive partitioning.
Binary Recursive Partitioning
A decision tree is a flowchart-like tree structure where an internal node represents
feature (or attribute), the branch represents a decision rule, and each leaf node
represents the outcome. The topmost node in a decision tree is known as the root
node. It learns to partition on the basis of the attribute value. It partitions the tree
in a recursive manner called recursive partitioning. This flowchart-like structure
helps you in decision-making. It's visualization like a flowchart diagram which
easily mimics the human level thinking. That is why decision trees are easy to
understand and interpret.
Decision Tree algorithm
The algorithms to generate the decision tree adopt a greedy strategy by making
a series of locally optimal decisions to select the attribute to partition the set.
One popular algorithm is the Hunt's algorithm which is the basis for many
others such as ID3 (Iterative Dichotomiser 3), C4.5 and CART(Classification
and Regression Tree).
The basic idea is given below:
1. Iterate through every unused attribute of set S the calculate Entropy (H)
and Information gain (IG) of this attribute.
2. Select the attribute which has the smallest Entropy or Largest Information
gain.
3. Split set S by the selected attribute to produce a subset of the data.
4. The algorithm continues recursively on each subset, considering only
attributes never selected before
Attribute Selection
Attribute selection measure is a heuristic for selecting the splitting
criterion that partition data possible manner. ASM provides a rank to each
feature (or attribute) and the best score attribute will be selected as a splitting
attribute. Most popular selection measures are Information Gain, Gain Ratio,
and Gini Index.
Information Gain: Entropy is a metric to measure the impurity or
randomness in a given attribute. If the dataset contains only one attribute, it is
calculated as

Where S is the current data set, c is the set of classes in S, p(c) - proportion of
number of elements in class c to elements in S. For multiple attributes, the
entropy is calculated as:

Where X are attributes in set T.


Information gain IG(A) is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute. It is the difference in entropy in
set T before and after the split on attribute X. It is calculated as:

According to the value of information gain, we split the node and build the
decision tree. A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information gain is
split first.
Gain Ratio: Information gain is biased for the attribute with many outcomes. It
means it prefers the attribute with a large number of distinct values. Gain ratio
handles the issue of bias by normalizing the information gain using Split Info.

Where -

acts as a weight of the jth partition.

V is the number of discrete values in attribute A.

The gain ratio can be defined as-

Gini Index: Gini Index can be thought of as a cost function used to evaluate
splits in the dataset. It is calculated by subtracting the sum of squared
probabilities of each class from one.

Where, pi is the probability that a tuple in D belongs to class ci.

2.3.3 k-Nearest Neighbors


The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point. K-NN algorithm can be used for
Regression as well as for Classification but mostly it is used for the
Classification problems.

K-NN is a non-parametric algorithm, which means it does not make any


assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.

The figure above shows a graph for a dataset consists of two numeric features
X, and Y along with target classes A and B. The graph is a scatter plot of X
against Y, where star symbol indicates Class A and filled triangle indicates
Class B.
The problem is to find the class for new data point represented by question mark
(?). This data point will belong to either class A or class B and nothing else.
To solve such kind of problems, we need a K-NN algorithm. The "K" in K-NN
algorithm is the nearest neighbor we wish to take the vote from. Let's say K=3.
Hence, we will now make a circle with star as the center just as big as to enclose
only three data points on the plane.
Out of three closest points to new data point, two points belong to Class B and
one is belongs to Class A. Hence, with a good confidence level, we can say that
the new data point (indicated by ?") should belong to the class B. Here, the
choice became very obvious as all two votes from the closest neighbor went to
Class B. But the right choice of the parameter value K is very crucial in this
algorithm.

Algorithm: K-Nearest Neighbor (KNN)


Step 1: Load the training and testing data.
Step 2: Choose the value of K as number of neighbors.
Step 3: For each point in test data:
 Find the distance between test data and each row of training data
using distance metric like Euclidean distance, Manhattan distance,
Minkowski distance, etc.
 Store the distances in a list and sort this list in ascending order.
 Choose the first K-data points from the sorted list (these are min
distance values)
 Assign the class label to the new data point based on the majority
of classes present in the chosen points.
Step 4: End

Distance Metrics
It should be noted that Manhattan distance, Minkowski distance and Euclidean
distance measures are only valid for continuous variables. For categorical data,
Hamming distance must be used.
How to select value of K in K-NN?
 There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
 Large values for K are good, but it may be computationally expensive.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.

2.3.4. Naive Bayes


The Naïve Bayes classifier is a supervised machine learning algorithm that is
used for classification tasks such as text classification. It uses principles of
probability to perform classification tasks.
It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object. A naïve Bayes classifier assumes that the presence or
absence of a particular feature of a class is unrelated to the presence or absence
of other features
It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified
on the bases of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify
that it is an apple without depending on each other.
Bayes' Theorem:
 Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.
 The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.

Applications
 Naïve Bayes classifiers can also be used for fraud detection.
 In the domain of auto insurance, based on a training set with attributes
such as driver's rating, vehicle age, vehicle price, historical claims by the
policy holder, police report status, and claim genuineness, Naive Bayes
can provide probability based classification of Whether new claim is
genuine or not.
 This classifier is mainly used in text classification that includes a high
dimensional training dataset.
 Naive Bayes based Bayesian spam filtering has become a popular
mechanism to distinguish spam e-mail from legitimate e-mail.
 Many modern mail clients implement variants of Bayesian spam filtering

Advantages of Naive Bayes Classifier


 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other
Algorithms. It is the most popular choice for text classification problems.
 It is easy to implement and can execute efficiently even without prior
knowledge of the data.
Disadvantages of Naïve Bayes Classifier
Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features
Types of Naïve Bayes Classifiers
Gaussian Naïve Bayes
 Used when data is as per the Gaussian distribution
 Predictors are continuous variables
Multinomial Naïve Bayes
 Uses frequency of present words as features
 Commonly used for document classification problems
 Popular for discrete features as well
Bernoulli Naïve Bayes
 Predictors are Boolean variables
 Used when data is as per multivariate Bernoulli distribution
 Popular for discrete features

2.3.5. Support Vector Machines (SVM)


Support Vector Machine (SVM) is a supervised machine learning algorithm
which can be used for both classification and regression problems. Mostly, it is
used in classification of both linear and non-linear data.
In the SVM algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature being
the value of a particular coordinate.
Support Vectors are simply the co-ordinates of individual observation. Then, we
perform classification by finding the hyper-plane that differentiates the two
classes very well. The SVM Classifier is a hyperplane or line which best
segregates the two classes
To separate the two classes of data points, there are many possible hyperplanes
that could be chosen. Our objective is to find a plane that has the maximum
margin, i.e. the maximum distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future
data points can be classified with more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data
points falling either side of the hyperplane can be attributed to different classes.
Also, the dimension of the hyperplane depends upon the number of features.
If the number of input features is 2, then the hyperplane is just a line. If the
number of input features is 3, then the hyperplane becomes a two dimensional
plane.
Support vectors are the closest data points to the hyperplane, which makes a
critical role in deciding the hyperplane and margin. Using these support vectors,
we maximize the margin of the classifier.
Margin: Margin is the distance between the support vector and hyperplane. The
main objective of the support vector machine algorithm is to maximize the
margin. The wider margin indicates better classification performance.

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.

Kernel: Kernel is the mathematical function, which is used in SVM to map the
original input data points into high-dimensional feature spaces, so, that the
hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are
linear, polynomial, radial basis function (RBF), and sigmoid.
Advantages of SVM classifiers
1. SVM classifiers gives greater accuracy.
2. SVM classifiers use very less memory because it use a subset of training
points in decision function called support vectors.
3. SVM is effective and work well with high dimensional space.
4. SVM is effective in cases where the number of dimensions is greater than
the number of samples.
5. SVM uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
Disadvantages of SVM classifiers
1. SVM have high training time hence in practice not suitable for large
datasets.
2. SVM classifiers do not work well with overlapping classes.
3. It also doesn't perform very well, when the data set has more noise.

2.3.6. Evaluation methods: Confusion matrix, precision, recall, F1-


score, Accuracy, ROC-AUC curve.
Evaluating the performance of machine learning algorithm is very
important. Choosing the correct metric for evaluation influences how the
performance of machine learning model will be compared. Various metrics
can be used for evaluating the performance of a machine learning model.
While testing accuracy of a classifier on an example whose real classes are
known, then there is possibility of four different cases
I. True Positives (TP): These are tuples having actual value as true
and the predicted value by classifier is also true.
II. True Negatives (TN): These are tuples having actual value as false
and the predicted value by classifier is also false.
III. False Positives (FP): These are tuples having actual value as false
and the predicted value by classifier is true.
IV. False Negatives (FN): These are tuples having actual value as true
and the predicted value by classifier is false
Type I error: This type of error occurs when false tuples are predicted by
classifier as true.
Type II error: This type of error occurs when true tuples are predicted by
classifier as false
Accuracy
Accuracy is the measure to evaluate performance of a classifier which
compares correct prediction of the classifier with the overall input data
samples.
Definition: Accuracy The accuracy of a classifier on a given test dataset is the
percentage of test dataset instances that are correctly classified by the
classifier.

Accuracy doesn't give the best picture of the cost of misclassification or


unbalanced testing data set. It can be easily calculated using confusion
matrix

Confusion matrix:
The confusion matrix is a matrix used to determine the performance of the
classification models for a given set of test data.
It is a tabular representation of the model predicted values versus actual
values.

Example: Let's assume we have a binary classification problem. We have


some samples belonging to two classes: YES or NO, Also, we have our own
classifier which predicts a class for a given input sample. On testing our
model on 165 samples, we get the following result…

Accuracy for the matrix can be calculated by taking average of the values
lying across main diagonal. i.e

Precision
Definition: Precision is defined as the ratio of correctly classified positive
samples (True Positive) to a total number of classified positive samples (either
correctly or incorrectly).

The precision score function from sklearn.metrics can be used to calculate


precision score of a classifier.
Recall:
The recall is calculated as the ratio between the numbers of Positive
samples correctly classified as Positive to the total number of Positive
samples.
The recall measures the model's ability to detect positive samples.
The higher the recall, the more positive samples detected. It is also called as
the true positive rate (TPR) or Sensitivity.

Recall Vs Precision

F1 Score:
In some applications, it is necessary to give higher priority to recall and in
other applications, it is necessary to choose precision with high priority.
But there are many applications in which recall and precision both needs to
be treated as equally important. So alternative way to use both precision
and recall is to combine them into a single measure. This is the approach of
the F measure. One popular metric which combines precision and recall is
called Fl-score, which is the harmonic mean of precision and recall and it
can be expressed as…

ROC-AUC curve
The area under the receiver operating characteristic (ROC) curve (AUC) is a
performance metric used to evaluate machine learning models that perform
binary classification.
It's a probability curve that plots the true positive rate (TPR) on the y-axis
and the false positive rate (FPR) on the x-axis at different thresholds.
The AUC is a value between 0 and 1, with 0 indicating a poor model and 1
indicating a perfect model.

Basically, the ROC curve is a graph that shows the performance of a


classification model at all possible thresholds (threshold is a particular value
beyond which you say a point belongs to a particular class). The curve is
plotted between two parameters…
 TPR – True Positive Rate
 FPR – False Positive Rate
2.4. Non-linear regression: Decision Tree, Support Vector, KNN
Non-linear regression:
Nonlinear regression refers to a broader category of regression models
where the relationship between the dependent variable and the
independent variables is not assumed to be linear.
If the underlying pattern in the data exhibits a curve, whether it’s
exponential growth, decay, logarithmic, or any other non-linear form, fitting
a nonlinear regression model can provide a more accurate representation of
the relationship. This is because in linear regression it is pre-assumed that
the data is linear.

A nonlinear regression model can be expressed as:

Where,
 Y : Regression function
 X: This is the vector of independent variables, which are used to
predict the dependent variable.
 β :The vector of parameters that the model aims to estimate. These
parameters determine the shape and characteristics of the regression
function.
 ϵ: error term
Nonlinear regression tries to model the dependent variable as a function of
a combination of nonlinear parameters and one or more independent
variables. The model can be Uni-variae (single response variable) or
multivariate (multiple response variables).
Decision Tree Regression
Decision tree is a supervised machine learning algorithm used for both
prediction and classification. It is mostly used for classification. It is based
on divide and conquer strategy, where the input space is divided into smaller
regions until that becomes more manageable.
Decision trees can handle both categorical and numerical data. It is a tree-
structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.

Decision tree builds regression or classification models in the form of a tree


structure. To build a decision tree, we first split the space into number of
regions, and model the response by the mean of Y in each region.
We choose the variable and split-point to achieve the best fit. Then each
region is split into two more regions, and this process is continued, until
model does not meet any stopping criteria. The final result is a tree with
decision nodes and leaf nodes. A decision node has two or more branches,
each representing values for the attribute tested. Leaf node represents a
decision on the numerical target. The topmost decision node in a tree
which corresponds to the best predictor called root node.
When should we choose decision tree regression?
Consider the dataset with distribution of points as shown in figure (a). In
such a case, it is clear that a linear model fits well. However, if the data points
are randomly distributed and data has a non-linear shape as shown in figure
(b), then a linear model cannot capture the non-linear features. So in this
case, you can use the decision trees, which do a better job at capturing the
non-linearity in the data by dividing the space into smaller sub-spaces
depending on the questions asked.

How do we build the regression tree?


Consider a problem with response variable y and input variables x1, x2, . . . ,
xp.
1. We divide the predictor space (x1, x2, . . . , xp) into j distinct and
regions, R1, R2, . . ., Rj.
2. For every observation that falls into the region Rj, model the
response by mean of the response values for the training
observations in each region Rj.
3. Choose the variable and split-point to achieve the best fit.
4. Then these regions are split into more regions, and this process
is continued, until some stopping rule is applied.

Consider the dataset shown below in (a). The region splits and decision
tree are shown in (b) and (c) respectively
Advantages
1. It is simple to understand as it mimic a human decision making
approach in real-life.
2. These are very useful in finding solution for decision-related problems.
3. It generates all the possible solutions for a problem.
4. It requires less data preprocessing task as compared to other
algorithms.
5. Making predictions is fast just looking up constants in the tree and does
not include complicated calculations.
6. If some data is missing, we might not be able to go all the way down the
tree to a leaf, but we can still make a prediction by averaging all the
leaves in the sub-tree we do reach.
7. The model gives a jagged response, so it can work when the true
regression surface is not smooth.
8. There are fast, reliable algorithms to learn these trees.

Disadvantages
1. Decision trees are sensitive to specific data means a little change in
training data may cause large change in decision tree and that affects
final prediction or output.
2. Decision trees are computationally expensive due to formation of sub
trees and their comparisons.
3. Decision trees carry a risk of over fitting the data.
4. If dataset have uncorrelated features then decision trees will become
complex and give poor performance.
5. Building decision-trees on dataset having too many features is
exhaustive and time consuming
Support Vector Regression
Support Vector Machine (SVM) is a popular supervised machine learning
algorithm used for classification and regression analysis. It was first
identified in 1992 by Vladimir Vapnik and his colleagues. This technique was
originally designed for binary classification, but can be extended to
regression and multi-class classification.
The regression problem is a generalization of the classification problem, in
which the model returns a continuous-valued output
Terminology
Hyperplane: In SVM this is basically the separation line between the data
classes. In SVR, it defines as the line that will help us predict the continuous
value or target value.
Boundary line: These are two parallel lines drawn to the either sides of
Support vector with the error threshold value, (epsilon) are known as the
boundary line. These lines create a margin between the data points.
Support vectors: These are the data points which are closest to the
boundary. The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane.
Kernel: The function used to map a lower dimensional data into a higher
dimensional data. This is important function because SVR performs linear
regression in a higher dimension. There are many types of kernel functions
like Polynomial kernel, Gaussian kernel, Sigmoid kernel, etc

Figure showing One dimensional support vector regression

Support Vector regression is non-linear algorithm to find the optimal


hyperplane that divides the given input dataset in N-dimensional space (N is
the number of features). The optimal hyperplane is maximum margin of
separation between all training data points and hyperplane. During
searching of optimal hyperplane the algorithm tries to find support vectors
or boundary points. These support vectors are picked such that the
hyperplane will be at a possible maximum distance from both support
vectors. The figure above shows the optimal hyperplane with maximum
margin of separation and two decision boundaries on either side of optimal
hyperplane along with potential support vectors.
Support Vectors
Why use Support Vector Regression?
In simple regression we try to minimize the error rate. Here, we try to fit the
error within certain threshold. The objective is to basically consider the
points that are within the boundary line. The best fit line is the line
hyperplane that has maximum number of points.
Our main aim here is to decide a decision boundary at ∊ distance from the
original hyperplane such that data points closest to the hyperplane or the
support vectors are within that boundary line. The width of the boundary is
controlled by the error threshold-∊ (epsilon). Here, these vectors are used to
perform regression.

Advantages of SVR
1. SVR works really well with high dimensional data.
2. It is robust to outliers.
3. Decision model can be easily updated.
4. It has excellent generalization capability, with high prediction
accuracy.
5. Its implementation is easy.
6. It is effective in cases where the number of dimensions is greater
than the number of samples.

Disadvantages
1. It is not suitable when the data has no clear margin of separation i.e.
the target class contains overlapping data points.
2. It does not work well with large data sets

Random Forest
To address the weaknesses of decision trees random forest can be used
which combines the power of multiple decision trees into one. It is based on
the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of
the model. We will first understand the concept of Ensemble learning.
Ensemble learning
An Ensemble method is a technique that combines the predictions from
multiple machine learning algorithms together to make more accurate
predictions than any individual model. A model comprised of many models
is called an Ensemble model.

Figure of Ensemble Learning


Types of Ensemble Learning
Boosting: Combines multiple weak learners (poorly accurate predictors)
into a strong learner (a model with strong accuracy). Each model learns
from its predecessor's mistakes.
Bagging: Creates a strong model from multiple independent weak models,
often trees that are trained in parallel.
Stacking: Combines multiple base learners using a meta-learner to train on
the features that are outputs of the base learners.
Voting: Combines the predictions from multiple other models to improve
model performance
Random forest: A branch of ensemble learning that automatically creates
a number of random decision trees.
What is Random Forest Regression?
Random forest regression is a supervised learning algorithm and bagging
technique that uses an ensemble learning method for regression in machine
learning.
The trees in random forests run in parallel, meaning there is no interaction
between these trees while building the trees.

How Does Random Forest Regression Work?


A random forest is a meta-estimator (i.e. it combines the result of multiple
predictions), which aggregates many decision trees with some helpful
modifications:
1. The number of features that can be split at each node is limited to some
percentage of the total (which is known as the hyper-parameter). This
ensures that the ensemble model does not rely too heavily on any
individual feature and makes fair use of all potentially predictive
features.
2. Each tree draws a random sample from the original data set when
generating its splits, adding a further element of randomness that
prevents overfitting.
The combined decision trees are called as base models, and it can be
represented more formally as…
g(f) = f0(x)+ f1(x)+ f2(x)+ . . .
When used for classification, a random forest obtains a class vote from each
tree, and then classifies using majority vote. And when used for regression,
the predictions from each tree at a target point x are simply averaged
Note: The greater number of trees in the forest leads to higher accuracy and
prevents the problem.
Assumptions of Random Forest
1. There should be enough actual values in features of dataset to make
correct predictions.
2. The predictions from each tree must have very low correlations.

Random forest algorithm


1. Select random K data points from the training set.
2. Build the decision trees associated with the selected data points
(Subsets).
3. Choose the number N for decision trees that you want to build.
4. Repeat Step l and 2.
5. For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes

Advantages
1. Random forest algorithm produces highly accurate results.
2. It work well on large datasets and can handle many parameters
without being deleted.
3. It is an effective method for estimating missing data and maintains
accuracy when a large proportion of the data are missing.
Disadvantages
1. Random forests sometimes overfit for some datasets with noisy
classification/regression tasks.
2. For attributes with categorical data having different levels, random
forests will be biased to those attributes with more levels
Advantages and Disadvantages of supervised Machine Learning
Advantages of Supervised Machine Learning
1. Predictive Accuracy: Supervised learning can achieve high accuracy
when trained on a well-labeled dataset.
2. Clear Objectives: It involves learning a mapping from inputs to
outputs, making it straightforward to evaluate performance using
metrics like accuracy, precision, and recall.
3. Interpretability: Many supervised learning algorithms, such as linear
regression or decision trees, allow for easier interpretation of how
inputs influence predictions.
4. Well-Defined Problems: Ideal for tasks with clear labels, such as
classification and regression, making it suitable for a wide range of
applications.
5. Data Utilization: It can effectively utilize large amounts of labeled
data, leading to better model generalization if the data is
representative.
Disadvantages of Supervised Machine Learning

1. Dependency on Labeled Data: Requires a large amount of high-


quality labeled data, which can be expensive and time-consuming to
obtain.
2. Overfitting Risk: Models can easily overfit the training data if not
properly regularized, leading to poor performance on unseen data.
3. Bias in Data: If the training data is biased, the model will likely inherit
and perpetuate those biases in its predictions.
4. Limited to Available Labels: Performance is constrained to the types
of labels present in the training set, making it difficult to generalize to
unseen classes or labels.
5. Complexity of Labeling: Labeling can be a complex task requiring
domain expertise, especially in cases with ambiguous or subjective
categories.

This finishes the Chapter-2.

You might also like