0% found this document useful (0 votes)
5 views

ML UNIT-5 (1)

The document discusses various distance measures and algorithms in machine learning, focusing on the K-Nearest Neighbors (KNN) algorithm and hierarchical clustering. KNN is a supervised learning method that classifies data points based on their similarity to existing data, while hierarchical clustering organizes data into a tree-like structure without needing a predefined number of clusters. Additionally, it covers ensemble methods like bagging and boosting, highlighting their roles in improving model accuracy through the combination of multiple classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ML UNIT-5 (1)

The document discusses various distance measures and algorithms in machine learning, focusing on the K-Nearest Neighbors (KNN) algorithm and hierarchical clustering. KNN is a supervised learning method that classifies data points based on their similarity to existing data, while hierarchical clustering organizes data into a tree-like structure without needing a predefined number of clusters. Additionally, it covers ensemble methods like bagging and boosting, highlighting their roles in improving model accuracy through the combination of multiple classifiers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT-5

DISTANCE MEASURES:
Minkowski distance is primarily used in machine learning and data science, particularly in algorithms
that require measuring the similarity or dissimilarity between data points, like K-Nearest Neighbors
(KNN), clustering algorithms (e.g., K-Means), and other classification tasks
K-Nearest Neighbor(KNN) Algorithm for Machine
Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories.
o K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to
the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat
and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o Step-6: Our model is ready.

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have
already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
How to select the value of K in the K-NN Algorithm?
o There is no particular way to determine the best value for "K", so we need to
try some values to find the best out of them. The most preferred value for K
is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the
data points for all the training samples.
DISTANCE BASED CLUSTURING:
Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-Means
algorithm.

The hierarchical clustering technique has two approaches:


Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm
starts with taking all data points as single clusters and merging them until one
cluster is left.

Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is


a top-down approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then


why we need hierarchical clustering? So, as we have seen in the K-means clustering
that there are some challenges with this algorithm, which are a predetermined
number of clusters, and it always tries to create the clusters of the same size. To
solve these two challenges, we can opt for the hierarchical clustering algorithm
because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA. To


group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are
merged into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:
Step-1: Create each data point as a single cluster. Let's say there are N data points,
so the number of clusters will also be N.

Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

Step-3: Again, take the two closest clusters and merge them together to form one
cluster. There will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between
two clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:

Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:

Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than
single-linkage.

Average Linkage: It is the linkage method in which the distance between each pair
of datasets is added up and then divided by the total number of datasets to
calculate the average distance between two clusters. It is also one of the most
popular linkage methods.

Centroid Linkage: It is the linkage method in which the distance between the
centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.
EXAMPLE:
From the above calculation :
p(yes/sunny,hot) < p(No/sunny,hot)
27 < 73
Hence we can conclude ‘not to play’ as the probability of NO is 73%

Laplace Smoothing

From the dataset, the occurrences of "Overcast" under Play Tennis:

Outloo Play Cou


k Tennis nt

Overcas
Yes 4
t

Overcas
No 0
t

Total occurrences of "Overcast" = 4 (Yes) + 0 (No) = 4

Step 2: Calculate Probability Without Smoothing

Without smoothing, the probability of Play Tennis being "No" when Outlook is "Overcast" is:

P(Yes/Overcast)=4/9

P(No∣Overcast)= 0/5=0
This is problematic because a zero probability would completely eliminate this case from any
classification calculations.

Step 3: Apply Laplace Smoothing

Using Laplace Smoothing (k = 1), we add 1 to all category counts. The formula is:

α = Smoothing parameter (typically 1, but can be any positive number)

K=total no of classes

N=Total number of w’ in class

Since we have two possible classes (Yes and No), we add 2 in the denominator:

Similarly, for "Yes":

P(Yes/Overcast)=4+1/9+2=5/11=0.45=45%

P(No∣Overcast)= 0+1/5+2=1/7=0.14=14%

Why is Laplace Smoothing Important?

 Without smoothing, P(No∣Overcast)=0 means we would never predict "No" if Outlook is


Overcast.
 With smoothing, we allow a small probability 14% for "No", preventing a total elimination of this
case in classification.
 This helps avoid overfilling to a small data set and ensures better generalization.

ENSEMBLING METHODS:
To improve the accuracy (estimate) of the model, ensemble learning methods are developed. Ensemble
is a machine learning concept, in which several models are trained using machine learning algorithms. It
combines low performing classifiers (also called as weak learners or base learner) and combine
individual model prediction for the final prediction.
On the basis of type of base learners, ensemble methods can be categorized as homogeneous and
heterogeneous ensemble methods. If base learners are same, then it is a homogeneous ensemble
method. If base learners are different then it is a heterogeneous ensemble method.
Bagging

Consider a scenario where you are looking at the users’ ratings for a product. Instead of approving one
user’s good/bad rating, we consider average rating given to the product. With average rating, we can be
considerably sure of quality of the product. Bagging makes use of this principle. Instead of depending on
one model, it runs the data through multiple models in parallel, and average them out as model’s final
output.

What is Bagging? How it works?

 Bagging is an acronym for Bootstrapped Aggregation. Bootstrapping means random selection of


records with replacement from the training dataset. ‘Random selection with replacement’ can
be explained as follows:

a. Consider that there are 8 samples in the training dataset. Out of these 8 samples, every weak
learner gets 5 samples as training data for the model. These 5 samples need not be unique, or
non-repetitive.

b. The model (weak learner) is allowed to get a sample multiple times. For example, as shown in
the figure, Rec5 is selected 2 times by the model. Therefore, weak learner1 gets Rec2, Rec5,
Rec8, Rec5, Rec4 as training data.

c. All the samples are available for selection to next weak learners. Thus all 8 samples will be
available for next weak learner and any sample can be selected multiple times by next weak
learners.

 Bagging is a parallel method, which means several weak learners learn the data pattern
independently and simultaneously. This can be best shown in the below diagram:
1. The output of each weak learner is averaged to

2.

3. generate final output of the model.

4. Since the weak learner’s outputs are averaged, this mechanism helps to reduce variance or
variability in the predictions. However, it does not help to reduce bias of the model.

5. Since final prediction is an average of output of each weak learner, it means that each weak
learner has equal weight in the final output.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to
the category that wins the majority votes.
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.

2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.

3. Land Use: We can identify the areas of similar land use by this algorithm.

4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.

o It is capable of handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

What is Boosting?
Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong
learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to
identify ‘spam’ and ‘not spam’ emails using following criteria. If:

1. Email has only one image file (promotional image), It’s a SPAM

2. Email has only link(s), It’s a SPAM

3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM

4. Email from our official domain “[email protected]” , Not a SPAM

5. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think
these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using
methods like:
• Using average/ weighted average
• Considering prediction has higher vote

For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are
voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have
higher(3) vote for ‘SPAM’.

How Boosting Algorithms works?


Now we know that, boosting combines weak learner to form a strong rule. An immediate question
which should pop in your mind is, ‘How boosting identify weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time base
learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After
many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.

Here’s another question which might haunt you, ‘How do we choose different distribution for each
round?’

For choosing the right distribution, here are the following steps:

Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are mis-
classified or have higher errors by preceding weak rules.

Types of Boosting Algorithms


Underlying engine used for boosting algorithms can be anything. It can be decision stamp, margin-
maximizing classification algorithm etc. There are many boosting algorithms which use other types of
engine such as:

1. AdaBoost (Adaptive Boosting)

2. Gradient Tree Boosting

You might also like