Mod 4 - CLustering
Mod 4 - CLustering
Clustering:-
Introduction - Similarity measures - Clustering criteria - Distance functions - K-means clustering,
Hierarchical Clustering, Density based clustering (DBSCAN)
Combining Multiple Learners:- Voting, Bagging, Boosting
By:
Sherry O. Panicker
MCA, M. Phil
MCA@NirmalaCollegeMuvattupuzha
Introduction
• Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group.“
• Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity, but are very dissimilar to objects in other
clusters.
• Dissimilarities and similarities are assessed based on the attribute values and often involve
distance measures.
2
Introduction
Application areas of Clustering Techniques:
as a data mining tool in biology, security, business intelligence, and Web search.
Market Segmentation
Customer Segmentation
Image segmentation
Search Engines
In Land Use
In Biology
3
Introduction
4
Introduction
• It is an unsupervised learning method.
• After applying clustering technique, each cluster or group is provided with a cluster-ID.
• ML system can use this id to simplify the processing of large and complex datasets.
5
Introduction
7 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Similarity Measures
• Similarity denotes the strength of relationship between two data items, it represents how
similar 2 data patterns are.
• Clustering is done based on a similarity measure to group similar data objects together.
• The clusters are formed in such a way that any two data objects within a cluster have a
minimum distance value and any two data objects across different clusters have a maximum
distance value.
• Clustering using distance functions, called distance based clustering, is a very popular
technique to cluster the objects and has given good results.
• Similarity measure is based on distance functions such as
Euclidean distance
Manhattan distance
Minkowski distance
Cosine similarity, etc. to group objects in clusters.
8
Clustering Criteria
• A good clustering method will produce high quality clusters where:
– the intra-class similarity is high.
– the inter-class similarity is low.
• The quality of a clustering result also depends on both the similarity measure used by
the method and its implementation.
9
Distance functions
• .
10 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering
Can be used in clustering all types of data
Aim: clustering/grouping of data. Ex. YouTube groups people on the basis of age, location, etc.
11 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
I 1D data: 2, 4, 10, 8, 16, 30, 12, 6
1. Suppose k=2, ie we are forming 2 clusters.
2. Randomly select cluster centres. Say 6 and 12.
3. Calculate similarity using distance function: we have to decide the
cluster to which remaining data will go. For that, take each data item
and calculate its difference with the cluster centres/ compare all data
items with centres(take absolute values). Data will belong to Cluster
with minimum distance, ie maximum similarity.
16-6=10 more 16-12=4 less
Iter 1: 2-6=4 less 2-12=10 more 30-6=24 more 30-12=18 less
4-6=2 less 4-12=8 more
10-6=4 more 10-12=2 less
8-6=2 less 8-12=4 more
End of iteration 1
12 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Iter 2: Find cluster centres again, taking avarege of data items. New values may or may not be part of
actual data.
13 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Repeat Step 1. Note: If distance to both clusters are same for any data item, it can belong to C1
or C2. 2-5=3 less 2-17=15 more
4-5=1 less 4-17=13 more
10-5=5 less 10-17=7 more
8-5=3 less 8-17=9 more
16-5=11 more 16-17=1 less
30-5=25 more 30-17=13 less
12-5=7 more 12-17=5 less
6-5=1 less 6-17=11 more
14 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Iter 3: Find cluster centres again, taking avarege of data items. Do not include 5 and 17 since
they are not part of data.
End of iteration 3
15 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
label 0 label 1
• Clusters are named starting from label 0 onwards. This is clustering of 1D data.
16 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
• Drawback: in k-means, we use circular or ellipse shape ie. Rigid boundaries,
so points near or on border will be avoided. Its better to use arbitrary shapes
DBSCAN makes it possible.
II Clustering of 2D data
Then Calculate distance
D1-D2 D1-D4
D3-D2 D3-D4
D5-D2 D5-D4
In 2D, distance between data points is calculated using Euclidean distance.
Ex.
17 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Hands on for K-Means
• Number of items=150
• Dimensionality = 4
• Since k-Means is unsupervised, we take only x. do not consider the output y. Unlabeled.
• Group data based on similarity
Within cluster: high similarity
Between clusters: very low similarity. High dissimilarity
• Do the steps mentioned with 1D.
18 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…- Hands on for K-Means
• Project the data in 2D plane. So we can take only 2
features. Suppose we take SL and Sw.
Final result
19 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…- Hands on for K-Means
c=labels: plotting based on the variable ‘labels’
4 Steps #4 Draw the Graph
import matplotlib.pyplot as plt
xaxis=file["sepal_length"]
yaxis=file["sepal_width"]
plt.scatter(xaxis,yaxis,c=labels,cmap="rainbow")
#1 Sklearn Package OUTPUT
from sklearn.cluster import KMeans
ML=KMeans(n_clusters=3,max_iter=5)
#2 Load Data
import pandas as pd
file=pd.read_csv("/content/irisexcel.csv")
x=file[["sepal_length","sepal_width"]]
ML.fit(x)
#3 Finding Centers and labels
centers=ML.cluster_centers_
labels=ML.labels_
20 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
• Deals with arbitrary shapes
DBSCAN - parameters
21 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
DBSCAN – points
Each point is either:
• Core point
• Border point
• Outlier point
Core point example
• Let R=2, M= 3
• Consider a point x with radius 2 from border. If there are minimum 3 other points within this
radius, then x is a core point. There can be more than 3 too.
x
2
x
x x x
22 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
2
Density Reachability are of 3 types.
1
• p is a core point. q is a point in p’s neighbourhood.
• If q is also a core point, we can form a cluster from q.
• Let r be a point in q’s neighbourhood.
• Through q, r is density reachable from p
• So p,q,r can be considered a single group.
23 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
3
Thus they form a chain. Thus they can be grouped into a single
category.
24 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
DBSCAN steps
1. Suppose we have a set of points.
2. Select a random point and check if it’s a core point(are there min number of points in its
neighbourhood) and if it can form a cluster.
3. If it forms a cluster, mark all inside points as visited with a tick mark.
Here, all 4
points in the
blue boundary
get’s tick mark
25 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
Implementing in Python
7. Continue the process till all points have been visited and marked.
8. Combine the groups based on the 3 density reachability conditions.
9. Advantages: arbitrary shapes possible, number of clusters not required.
26 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
27 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering
28 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
29 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering- Agglomerative algorithm(AGNES)
• At each iteration of an agglomerative algorithm, we choose the two closest groups to merge.
• There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering.
• These measures are called Linkage methods.
In single-link clustering, the distance is defined as the smallest distance between all possible
pair of elements of the two groups.
In complete-link clustering, the distance between two groups is taken as the largest distance
between all possible pairs.
In average-link method, the average of distances between all pairs is used.
In centroid distance, the distance between the centroids (means) of the two groups is
measured.
30 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
Single-link clustering
Complete-link clustering
31 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
32 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
1 2 • Proceeding this way we get three groups:
P1-P2, P3-P4, and P5-P6.
• Similarly, we have three dendrograms, as shown:
34 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
4 We bring everything together by joining P1-P2 with cluster of P3-P4 and P5-P6
35 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
1
2
Here, we will make use of centroids, which is the average
of its points.
Let’s first take the points (1, 2) and (2,1), and we’ll group
them together because they're close. For these points, we
compute a point in the middle and mark it as (1.5,1.5). It’s
the centroid of those two points.
36 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
3
4 5
37 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DIANA Hierarchical Clustering
38 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
When do we stop combining clusters?
39 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting
We discussed many different learning algorithms in the previous chapters. Though these are
generally successful, no one single algorithm is always the most accurate. Now, we are going to
discuss models composed of multiple learners that complement each other so that by combining
them, we attain higher accuracy.
What are the different ways to combine classifiers in machine learning?
They can be divided into two big groups:
1. Ensemble methods: Bagging(Bootstrap Aggregating) and Boosting are the most extended
ones.
2. Hybrid methods
An ensemble is a machine learning model that combines the predictions from two or more
models. The models that contribute to the ensemble, referred to as ensemble members, may be
the same type or different types and may or may not be trained on the same training data.
40 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Model Combination Schemes
2. Multistage combination methods use serial approach where the next combination base-
learner is trained with or tested on only the instances where the previous base-learners are not
accurate enough.
OR
base-learners are sorted in increasing complexity so that a complex base-learner is not used
unless the preceding simpler base-learners are not confident. Ex. Cascading.
41 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting…
42 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting…
43 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Types of Ensembling techniques include:
• Bagging or Bootstrap Aggregation
• Boosting
• Stacking Classifier
• Voting Classifier
44 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting
• This is the simplest way to combine multiple classifiers.
• It corresponds to taking a linear combination of the learners
• This is also known as ensembles and linear opinion pools.
• In simple voting, all learners are given equal weight and we take the average.
Table 17.1 Classifier combination rules Table 17.2 Example of combination rules on three learners
and three classes
45 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting…
Voting Classifier:
• A voting classifier is a machine learning estimator that trains various base models or estimators
and predicts on the basis of aggregating the findings of each base estimator.
• It is a homogeneous and heterogeneous type of Ensemble Learning, that is, the base
classifiers can be of the same or different type.
• also works as an extension of bagging (e.g. Random Forest).
The voting criteria can be of two types:
• Hard Voting: Voting is calculated on the predicted output class.
• Soft Voting: Voting is calculated on the predicted probability of the output class.
Simple voting is a special case where all voters have equal weight.
This is called plurality voting where the class having the maximum number of votes is the
winner.
When there are two classes, this is majority voting.
• Voting schemes can be seen as approximations under a Bayesian framework. This is Bayesian
model combination.
46 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting…
Figure 2. Voting Classifier in
Figure 1. Voting “soft” mode
Classifier in “hard”
mode
47 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.6 Bagging(Bootstrap Aggregating )
• Bagging is a voting method where base-learners are made different by training them over
slightly different training sets.
• Unstable algorithm: A learning algorithm is unstable if small changes in the training set
causes a large difference in the generated learner.
• Bagging, short for bootstrap aggregating, uses bootstrap to generate L training sets, trains L
base-learners using an unstable learning procedure, and then, during testing, takes an
49 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.6 Bagging(Bootstrap Aggregating)…
Aggregation
Bootstrapping
50 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting
• Here, we try to generate complementary base-learners by training the next learner boosting on
the mistakes of the previous learners.
• The original boosting algorithm combines three weak learners to generate a strong learner.
1. Given a large training set, randomly divide it into three.
2. Use X1 and train d1.
3. Then take X2 and feed it to d1.
4. All instances misclassified by d1 and many instances in X2 where d1 is correct together form
the training set of d2.
5. Then take X3 and feed it to d1 and d2.
6. The instances on which d1 and d2 disagree form the training set of d3.
7. Testing: feed an instance to d1 and d2; if they agree, that is the response, otherwise the response
of d3 is taken as the output.
8. This overall system has reduced error rate.
51 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…
Disadvantage:
• Though successful, it requires a very large training sample.
• The sample should be divided into three and, the second and third classifiers are only
trained on a subset on which the previous ones err.
• So without a large training set, d2 and d3 will not have training sets of reasonable size.
52 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…
53 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…
Though it is quite successful, the disadvantage of the original boosting method is that it requires a very large training
sample. ADABOOST is a variant of boosting technique.
• AdaBoost (Adaptive boosting) was the first boosting algorithm to combine various weak classifiers into a single
strong classifier in the history of machine learning.
• It primarily focuses to solve classification tasks such as binary classification.
ADABOOST - adaptive boosting, uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an arbitrary number of base-
learners, not three.
ADABOOST Outline – we assign equal probability to all data instance initially and give a portion to first learner, using
the trained first learner we classify all the datasets, we update the probabilities of each instance so that misclassified
instance have higher chance to be chosen and fed to the next learner, this process is repeated over all learners in a serial
manner, the final learner is expected to produce all result rightly classified.
54 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
End of Module 4
Thank you
55 10/18/2022 MCA@NirmalaCollegeMuvattupuzha