0% found this document useful (0 votes)
7 views

Notes

Notes

Uploaded by

ANISHYA P IT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Notes

Notes

Uploaded by

ANISHYA P IT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT -IV

SUPERVISED LEARNING

Parametric and Non-parametric Models In Machine Learning

What is the parametric model?

A learning model that summarizes data with a set of fixed-size parameters (independent on the
number of instances of training).Parametric machine learning algorithms are which optimizes the function
to a known form.

In a parametric model, you know exactly which model you are going to fit in with the data, for
example,linearregression line.
b0+b1*x1+b2*x2=0
where, b0, b1, b2 → the coefficients of the line that control the intercept and slope
x1, x2 → input variables.
Some more examples of parametric machine learning algorithms include:

• Logistic Regression

• Linear Discriminant Analysis

• Perceptron

• Naive Bayes

• Simple Neural Networks

What is the nonparametric model?


Nonparametric machine learning algorithms are those which do not make specific
assumptions about the type of the mapping function. They are prepared to choose any functional
form from the training data, by not making assumptions.
The word nonparametric does not mean that the value lacks parameters existing
in it, but rather that the parameters are adjustable and can change. When dealing with ranked data
one may turn to nonparametric modeling, in which the sequence in that they are ordered is some of
the significance of the parameters.

A simple to understand the nonparametric model is the k-nearest neighbors' algorithm, making
predictions for a new data instance based on the most similar training patterns k. The only assumption it
makes about the data set is that the training patterns that are the most similar are most likely to have a
similar result.

Some more examples of popular nonparametric machine learning algorithms are:

• k-Nearest Neighbors

• Decision Trees like CART and C4.5

• Support Vector Machines

Parametric vs. Nonparametric modeling


1. Parametric models deal with discrete values, and nonparametric models use continuous
values.

2. Parametric models are able to infer the traditional measurements associated with normal
distributions including mean, median, and mode. While some nonparametric distributions are
normally oriented, often one cannot assume the data comes from a normal distribution.

3. Feature engineering is important in parametric models. Because you can poison parametric
models if you feed a lot of unrelated features. Nonparametric models handle feature
engineering mostly. We can feed all the data we have to those non-parametric algorithms and
the algorithm can ignore unimportant features. It would not cause overfitting.

4. A parametric model can predict future values using only the parameters. While nonparametric
machine learning algorithms are often slower and require large amounts of data, they are
rather flexible as they minimize the assumptions they make about the data.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below example:

Suppose we have a dataset of weather conditions and corresponding target variable "Play". So
using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No
9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 4

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5
P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Support Vector Machine Algorithm
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the belowdiagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:

Example:

➢ Suppose we see a strange cat that also has some features of dogs, so if wewant a model
that can accurately identify whether it is a cat or dog, so such a model can be created
by using the SVM algorithm. We will first train our model with lots of images of cats
and dogs so that it can learn about different features of cats and dogs, and then we test
it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it
will see the extreme case of cat and dog. On the basis of the support vectors, it will
classify it as a cat.

• SVM algorithm can be used for Face detection, image classification, text categorization,
etc.

Types of SVM
SVM can be of two types:

➢ Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
➢ Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straightline, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithmHyperplane

➢ There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps
to classify the data points. This best boundary is known as the hyperplane of SVM.
➢ The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features, then hyperplane will be a straight line. And if there are
3 features, then hyperplane will be a 2- dimensional plane.
➢ Always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors

➢ The data points or vectors that are the closest to the hyperplane and whichaffect the
position of the hyperplane are termed as Support Vector. Sincethese vectors support
the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM

➢ The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue.
➢ Consider the below image:

➢ So, as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
➢ Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called a hyperplane. SVMalgorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM

➢ If data is linearly arranged, then we can separate it by using a straightline, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

➢ So, to separate these data points, we need to add one more dimension. Forlinear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

➢ By adding the third dimension, the sample space will become as below image:
➢ So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

➢ Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

➢ Hence we get a circumference of radius 1 in case of non-linear data.

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

There are many algorithms there to build a decision tree. They are

o CART (Classification and Regression Trees) — This makes use


ofGini impurity as the metric.
o ID3 (Iterative Dichotomiser 3) — This uses entropy and
informationgain as metric.
o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
First, it checks if the customer has a good credit history.
Based on that, it classifies the customer into two groups, i.e.,
customerswith good credit history and customers with bad credit
history.
Then, it checks the income of the customer and again classifies
him/herinto two groups.
Finally, it checks the loan amount requested by the customer.
Based on the outcomes from checking these three features, the
decisiontree decides if the customer’s loan should be approved or
not.

Entropy

• In machine learning, entropy is a measure of the randomness in the


information being processed. The higher the entropy, the harder it
is todraw any conclusions from that information.

Information Gain

Information gain can be defined as the amount of information gained


about a random variable or signal from observing another random
variable. It can be considered as the difference between the entropy of
parent node and weighted average entropy of child nodes.

Gini Impurity

Gini impurity is a measure of how often a randomly chosen element


fromthe set would be incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best


algorithm for the given dataset and problem is the main point to
remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies

➢ Root Node: Root node is from where the decision tree starts. It
representsthe entire dataset, which further gets divided into two or
more homogeneous sets.
➢ Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
➢ Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
➢ Branch/Sub Tree: A tree formed by splitting the tree.
➢ Pruning: Pruning is the process of removing the unwanted branches
fromthe tree.
➢ Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm
compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps
to the next node.
For the next node, the algorithm again compares the attribute value
with the other sub-nodes and move further. It continues the process
until it reaches the leaf node of the tree. The complete process can be
better understood using the following steps:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the
bestattributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node
as a leaf node.
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which
is called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain.
o Gini Index.

Pruning: Getting an Optimal Decision tree

“Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree”.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning.
There are mainly two types of tree pruning technology used:

➢ Cost Complexity Pruning


➢ Reduced Error Pruning.

Algorithm for decision tree


Classification using the ID3 algorithm

Consider whether a dataset based on which we will determine whether to


playfootball or not.
Here there are for independent variables to determine the dependent variable.
The independent variables are Outlook, Temperature, Humidity, and Wind. The
dependent variable is whether to play football or not.
As the first step, we have to find the parent node for our decision tree. Forthat
follow the steps:

Step 1:Find the entropy of the class variable.

Entropy E(S)=>

E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94

note: Here typically we will take log to base 2. Here total there are 14 yes/no.
Out of whi

ch 9 yes and 5 no. Based on it we calculated the above probability value.

• From the above data for outlook we can arrive at the following table easily

Step2:Now we have to calculate average weighted entropy.

ie, we have found the total of weights of each feature multiplied by probabilities.E(S,

outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3)

=(5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+(4/14)(0)+(5/14)((2/5)log(2/5)-
(3/5)log(3/5))
= 0.693

Step 3:The next step is to find the information gain.


• It is the difference between parent entropy and average weighted entropywe
found above.
➢ IG(S, outlook) = 0.94 - 0.693 = 0.247
• Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048

Step 4:Now select the feature having the largest entropy gain.

Here it is Outlook. So it forms the first node(root node) of our decision tree.

Now our data look as follows

Since overcast contains only examples of class ‘Yes’ we can set it as yes. That
means if outlook is overcast football will be played. Now our decision tree looks
as follows.
Step 5:The next step is to find the next node in our decision tree.

Now we will find one under sunny. We have to determine which of the
following (Temperature, Humidity or Wind) has higher information gain.

Calculate parent entropy E(sunny)


E(sunny) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.
Now Calculate the information gain of Temperature. IG(sunny,
Temperature)

E(sunny, Temperature) = (2/5)*E(0,2) + (2/5)*E(1,1) +


(1/5)*E(1,0)=2/5=0.4
Now calculate information gain.
IG(sunny, Temperature) = 0.971–0.4 =0.571
Similarly we get
IG(sunny, Humidity) = 0.971
IG(sunny, Windy) = 0.020
Here IG(sunny, Humidity) is the largest value. So Humidity is the nodethat
comes under sunny.
For humidity from the above table, we can say that play will occur if humidity is
normal and will not occur if it is high. Similarly, find the nodes under rainy.

Note: A branch with entropy more than 0 needs further splitting.

Finally, our decision tree will look as below:

Classification using CART algorithm

Classification using CART is similar to it. But instead of entropy, we useGini


impurity.
So as the first step we will find the root node of our decision tree. Forthat
Calculate the Gini index of the class variable
Gini(S) = 1 - [(9/14)² + (5/14)²] = 0.4591
As the next step, we will calculate the Gini gain. For that first, we will find the
average weighted Gini impurity of Outlook, Temperature, Humidity, and Windy.
First, consider case of Outlook
Gini(S, outlook) = (5/14)gini(3,2) + (4/14)*gini(4,0)+ (5/14)*gini(2,3) = (5/14)(1
- (3/5)² - (2/5)²) + (4/14)*0 + (5/14)(1 - (2/5)² - (3/5)²)= 0.171+0+0.171 = 0.342

Gini gain (S, outlook) = 0.459 - 0.342 = 0.117

Gini gain(S, Temperature) = 0.459 - 0.4405 = 0.0185

Gini gain(S, Humidity) = 0.459 - 0.3674 = 0.0916

Gini gain(S, windy) = 0.459 - 0.4286 = 0.0304


Choose one that has a higher Gini gain. Gini gain is higher for outlook. Sowe can
choose it as our root node.
Note:Repeat the same steps we used in the ID3 algorithm.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a
humanfollow while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved
usingthe Random Forest algorithm.
o For more class labels, the computational complexity of the decision
treemay increase.

RANDOM FOREST ALGORITHM

➢ Random Forest is a popular machine learning algorithm that


belongs to the supervised learning technique. It can be used
forboth Classification and Regression problems in ML.
➢ It is based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex
problem andto improve the performance of the model.
➢ Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset.
➢ Instead of relying on one decision tree, the random forest takes
theprediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
➢ The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
➢ The below diagram explains the working of the Random Forest
algorithm:

Fig.3.8.1 Random forest

Assumptions for Random Forest


o Since the random forest combines multiple trees to predict the class of
the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict
the correctoutput. Therefore, below are two assumptions for a better
Random forest classifier:
o There should be some actual values in the feature variable of the
dataset so that the classifier can predict accurate results rather than a
guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

• It takes less training time as compared to other algorithms.


• It predicts output with high accuracy, even for the large dataset it
runsefficiently.
• It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?

o Random Forest works in two-phase first is to create the random forest


by combining N decision tree, and second is to make predictions for
each tree created in the first phase.

The Working process


Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, andassign the new
data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example:

Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification
of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and


Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.
Regularization in Machine Learning
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss
function, discouraging the model from assigning too much importance to individual features or
coefficients.
Role of Regularization
1. Complexity Control: Regularization helps control model complexity by preventing overfitting
to training data, resulting in better generalization to new data.
2. Preventing Overfitting: One way to prevent overfitting is to use regularization, which
penalizes large coefficients and constrains their magnitudes, thereby preventing a model from
becoming overly complex and memorizing the training data instead of learning its underlying
patterns.
3. Balancing Bias and Variance: Regularization can help balance the trade-off between model
bias (underfitting) and model variance (overfitting) in machine learning, which leads to
improved performance.
4. Feature Selection: Some regularization methods, such as L1 regularization (Lasso), promote
sparse solutions that drive some feature coefficients to zero. This automatically selects
important features while excluding less important ones.
5. Handling Multicollinearity: When features are highly correlated (multicollinearity),
regularization can stabilize the model by reducing coefficient sensitivity to small data
changes.
6. Generalization: Regularized models learn underlying patterns of data for better
generalization to new data, instead of memorizing specific examples.

What are Overfitting and Underfitting?

Overfitting is a phenomenon that occurs when a Machine Learning model is constrained to the
training set and not able to perform well on unseen data. That is when our model learns the noise in the
training data as well. This is the case when our model memorizes the training data instead of learning the
patterns in it.

Underfitting on the other hand is the case when our model is not able to learn even the basic
patterns available in the dataset. In the case of the underfitting model is unable to perform well even on
the training data hence we cannot expect it to perform well on the validation data. This is the case when
we are supposed to increase the complexity of the model or add more features to the feature set.
How does Regularization Work?

Regularization works by adding a penalty or complexity term to the complex model. Let's consider the
simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the
cost function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the accurate
value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.

Techniques of Regularization.

There are mainly two types of regularization techniques, which are given below:

o Ridge Regression
o Lasso Regression

Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model.
It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias
added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the
lambda to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.

Key Difference between Ridge Regression and Lasso Regression


o Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as feature selection.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

You might also like