Notes
Notes
SUPERVISED LEARNING
A learning model that summarizes data with a set of fixed-size parameters (independent on the
number of instances of training).Parametric machine learning algorithms are which optimizes the function
to a known form.
In a parametric model, you know exactly which model you are going to fit in with the data, for
example,linearregression line.
b0+b1*x1+b2*x2=0
where, b0, b1, b2 → the coefficients of the line that control the intercept and slope
x1, x2 → input variables.
Some more examples of parametric machine learning algorithms include:
• Logistic Regression
• Perceptron
• Naive Bayes
A simple to understand the nonparametric model is the k-nearest neighbors' algorithm, making
predictions for a new data instance based on the most similar training patterns k. The only assumption it
makes about the data set is that the training patterns that are the most similar are most likely to have a
similar result.
• k-Nearest Neighbors
2. Parametric models are able to infer the traditional measurements associated with normal
distributions including mean, median, and mode. While some nonparametric distributions are
normally oriented, often one cannot assume the data comes from a normal distribution.
3. Feature engineering is important in parametric models. Because you can poison parametric
models if you feed a lot of unrelated features. Nonparametric models handle feature
engineering mostly. We can feed all the data we have to those non-parametric algorithms and
the algorithm can ignore unimportant features. It would not cause overfitting.
4. A parametric model can predict future values using only the parameters. While nonparametric
machine learning algorithms are often slower and require large amounts of data, they are
rather flexible as they minimize the assumptions they make about the data.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Support Vector Machine Algorithm
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the belowdiagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:
Example:
➢ Suppose we see a strange cat that also has some features of dogs, so if wewant a model
that can accurately identify whether it is a cat or dog, so such a model can be created
by using the SVM algorithm. We will first train our model with lots of images of cats
and dogs so that it can learn about different features of cats and dogs, and then we test
it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it
will see the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat.
• SVM algorithm can be used for Face detection, image classification, text categorization,
etc.
Types of SVM
SVM can be of two types:
➢ Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
➢ Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straightline, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors
➢ The data points or vectors that are the closest to the hyperplane and whichaffect the
position of the hyperplane are termed as Support Vector. Sincethese vectors support
the hyperplane, hence called a Support vector.
Linear SVM
➢ The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue.
➢ Consider the below image:
➢ So, as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
➢ Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVMalgorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM
➢ If data is linearly arranged, then we can separate it by using a straightline, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
➢ So, to separate these data points, we need to add one more dimension. Forlinear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
➢ By adding the third dimension, the sample space will become as below image:
➢ So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
➢ Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
There are many algorithms there to build a decision tree. They are
Entropy
Information Gain
Gini Impurity
➢ Root Node: Root node is from where the decision tree starts. It
representsthe entire dataset, which further gets divided into two or
more homogeneous sets.
➢ Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
➢ Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
➢ Branch/Sub Tree: A tree formed by splitting the tree.
➢ Pruning: Pruning is the process of removing the unwanted branches
fromthe tree.
➢ Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm
compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps
to the next node.
For the next node, the algorithm again compares the attribute value
with the other sub-nodes and move further. It continues the process
until it reaches the leaf node of the tree. The complete process can be
better understood using the following steps:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the
bestattributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node
as a leaf node.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which
is called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain.
o Gini Index.
“Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree”.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.
There are mainly two types of tree pruning technology used:
Entropy E(S)=>
note: Here typically we will take log to base 2. Here total there are 14 yes/no.
Out of whi
• From the above data for outlook we can arrive at the following table easily
ie, we have found the total of weights of each feature multiplied by probabilities.E(S,
=(5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+(4/14)(0)+(5/14)((2/5)log(2/5)-
(3/5)log(3/5))
= 0.693
Step 4:Now select the feature having the largest entropy gain.
Here it is Outlook. So it forms the first node(root node) of our decision tree.
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That
means if outlook is overcast football will be played. Now our decision tree looks
as follows.
Step 5:The next step is to find the next node in our decision tree.
Now we will find one under sunny. We have to determine which of the
following (Temperature, Humidity or Wind) has higher information gain.
Step-2: Build the decision trees associated with the selected data points(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, andassign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example:
Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification
of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
Regularization in Machine Learning
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss
function, discouraging the model from assigning too much importance to individual features or
coefficients.
Role of Regularization
1. Complexity Control: Regularization helps control model complexity by preventing overfitting
to training data, resulting in better generalization to new data.
2. Preventing Overfitting: One way to prevent overfitting is to use regularization, which
penalizes large coefficients and constrains their magnitudes, thereby preventing a model from
becoming overly complex and memorizing the training data instead of learning its underlying
patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off between model
bias (underfitting) and model variance (overfitting) in machine learning, which leads to
improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization (Lasso), promote
sparse solutions that drive some feature coefficients to zero. This automatically selects
important features while excluding less important ones.
5. Handling Multicollinearity: When features are highly correlated (multicollinearity),
regularization can stabilize the model by reducing coefficient sensitivity to small data
changes.
6. Generalization: Regularized models learn underlying patterns of data for better
generalization to new data, instead of memorizing specific examples.
Overfitting is a phenomenon that occurs when a Machine Learning model is constrained to the
training set and not able to perform well on unseen data. That is when our model learns the noise in the
training data as well. This is the case when our model memorizes the training data instead of learning the
patterns in it.
Underfitting on the other hand is the case when our model is not able to learn even the basic
patterns available in the dataset. In the case of the underfitting model is unable to perform well even on
the training data hence we cannot expect it to perform well on the validation data. This is the case when
we are supposed to increase the complexity of the model or add more features to the feature set.
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the
simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the
cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate
value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization.
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model.
It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias
added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the
lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
o There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.