0% found this document useful (0 votes)

56 views

Mod 4 - CLustering

The document discusses different clustering techniques including K-means clustering, hierarchical clustering, and density-based clustering (DBSCAN). It defines clustering as a machine learning technique that groups unlabeled datasets into clusters of similar data points. It describes how K-means clustering works by assigning data points to clusters based on distance from cluster centers, and how the cluster centers are recomputed in each iteration. It also provides an overview of DBSCAN, a density-based clustering method that can identify clusters of arbitrary shapes unlike K-means.

Uploaded by

ABHIJITH DAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Mod 4 - CLustering

Uploaded by

ABHIJITH DAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

Module 4

Clustering:-
Introduction - Similarity measures - Clustering criteria - Distance functions - K-means clustering,
Hierarchical Clustering, Density based clustering (DBSCAN)
Combining Multiple Learners:- Voting, Bagging, Boosting

By:
Sherry O. Panicker
MCA, M. Phil
MCA@NirmalaCollegeMuvattupuzha
Introduction

• Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group.“

• Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity, but are very dissimilar to objects in other
clusters.
• Dissimilarities and similarities are assessed based on the attribute values and often involve
distance measures.

2
Introduction
Application areas of Clustering Techniques:
 as a data mining tool in biology, security, business intelligence, and Web search.
 Market Segmentation

 Customer Segmentation

 Image segmentation

 Statistical data analysis

 Social network analysis

 Anomaly detection, etc.

 Search Engines

 In Identification of Cancer Cells

 In Land Use

 In Biology

3
Introduction

 Clustering is used by the Amazon in its recommendation system to provide the

recommendations as per the past search of products.
 Netflix also uses this technique to recommend the movies and web-series to its users as per
the watch history.

4
Introduction

• It is an unsupervised learning method.

• After applying clustering technique, each cluster or group is provided with a cluster-ID.

• ML system can use this id to simplify the processing of large and complex datasets.

Clustering is somewhere similar to the Classification Algorithm, but the difference is the

type of dataset that we are using.

In classification, we work with the labeled data set (supervised)

In clustering, we work with the unlabelled dataset. (unsupervised)

5
Introduction

• Categories of Clustering techniques: broadly divided into Hard clustering (datapoint

belongs to only one group) and Soft Clustering (data points can belong to another group
also)
 partitioning methods (ex. kMeans, kMedoid, CLARA, CLARANS)
 hierarchical methods (BIRCH)
 density-based methods (DBSCAN)
 Probability based methods
 grid-based methods
 Distribution Model-Based Clustering
 Fuzzy Clustering
6
Introduction..
Working of the clustering algorithm

7 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Similarity Measures
• Similarity denotes the strength of relationship between two data items, it represents how
similar 2 data patterns are.
• Clustering is done based on a similarity measure to group similar data objects together.
• The clusters are formed in such a way that any two data objects within a cluster have a
minimum distance value and any two data objects across different clusters have a maximum
distance value.
• Clustering using distance functions, called distance based clustering, is a very popular
technique to cluster the objects and has given good results.
• Similarity measure is based on distance functions such as
Euclidean distance
Manhattan distance
Minkowski distance
Cosine similarity, etc. to group objects in clusters.

8
Clustering Criteria
• A good clustering method will produce high quality clusters where:
– the intra-class similarity is high.
– the inter-class similarity is low.
• The quality of a clustering result also depends on both the similarity measure used by
the method and its implementation.

9
Distance functions
• .

10 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering
Can be used in clustering all types of data

Aim: clustering/grouping of data. Ex. YouTube groups people on the basis of age, location, etc.

11 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
I 1D data: 2, 4, 10, 8, 16, 30, 12, 6
1. Suppose k=2, ie we are forming 2 clusters.
2. Randomly select cluster centres. Say 6 and 12.
3. Calculate similarity using distance function: we have to decide the
cluster to which remaining data will go. For that, take each data item
and calculate its difference with the cluster centres/ compare all data
items with centres(take absolute values). Data will belong to Cluster
with minimum distance, ie maximum similarity.
16-6=10 more 16-12=4 less
Iter 1: 2-6=4 less 2-12=10 more 30-6=24 more 30-12=18 less
4-6=2 less 4-12=8 more
10-6=4 more 10-12=2 less
8-6=2 less 8-12=4 more
End of iteration 1
12 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Iter 2: Find cluster centres again, taking avarege of data items. New values may or may not be part of
actual data.

Here 5 and 17 are not part of data. But its ok.

13 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…

Repeat Step 1. Note: If distance to both clusters are same for any data item, it can belong to C1
or C2. 2-5=3 less 2-17=15 more
4-5=1 less 4-17=13 more
10-5=5 less 10-17=7 more
8-5=3 less 8-17=9 more
16-5=11 more 16-17=1 less
30-5=25 more 30-17=13 less
12-5=7 more 12-17=5 less
6-5=1 less 6-17=11 more

14 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…

Iter 3: Find cluster centres again, taking avarege of data items. Do not include 5 and 17 since
they are not part of data.

6 is data. 15 not data. Repeat distance calculation and form

clusters as before

End of iteration 3

15 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…

label 0 label 1

How to decide the number of iterations? OR How to stop iteration? 2 options:

1. If all current cluster centres are same as previous, stop iteration.
2. If number of iterations n has been set, stop after n rounds.

• Clusters are named starting from label 0 onwards. This is clustering of 1D data.

16 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
• Drawback: in k-means, we use circular or ellipse shape ie. Rigid boundaries,
so points near or on border will be avoided. Its better to use arbitrary shapes
DBSCAN makes it possible.

II Clustering of 2D data
Then Calculate distance
D1-D2 D1-D4
D3-D2 D3-D4
D5-D2 D5-D4
In 2D, distance between data points is calculated using Euclidean distance.
Ex.

17 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…
Hands on for K-Means

• Number of items=150
• Dimensionality = 4
• Since k-Means is unsupervised, we take only x. do not consider the output y. Unlabeled.
• Group data based on similarity
Within cluster: high similarity
Between clusters: very low similarity. High dissimilarity
• Do the steps mentioned with 1D.
18 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…- Hands on for K-Means
• Project the data in 2D plane. So we can take only 2
features. Suppose we take SL and Sw.

Final result

19 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
K-Means clustering…- Hands on for K-Means
c=labels: plotting based on the variable ‘labels’
4 Steps #4 Draw the Graph
import matplotlib.pyplot as plt
xaxis=file["sepal_length"]
yaxis=file["sepal_width"]
plt.scatter(xaxis,yaxis,c=labels,cmap="rainbow")

#1 Sklearn Package OUTPUT
from sklearn.cluster import KMeans
ML=KMeans(n_clusters=3,max_iter=5)

#2 Load Data
import pandas as pd
file=pd.read_csv("/content/irisexcel.csv")
x=file[["sepal_length","sepal_width"]]
ML.fit(x)

#3 Finding Centers and labels
centers=ML.cluster_centers_
labels=ML.labels_
20 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
• Deals with arbitrary shapes

DBSCAN - parameters

Ex. M=3 means for given a radius, minimum 3 neighbours required.

21 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN: Density-Based Spatial Clustering of Applications with Noise

DBSCAN – points
Each point is either:
• Core point
• Border point
• Outlier point
Core point example
• Let R=2, M= 3
• Consider a point x with radius 2 from border. If there are minimum 3 other points within this
radius, then x is a core point. There can be more than 3 too.

x
2
x
x x x

22 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
2
Density Reachability are of 3 types.
1
• p is a core point. q is a point in p’s neighbourhood.
• If q is also a core point, we can form a cluster from q.
• Let r be a point in q’s neighbourhood.
• Through q, r is density reachable from p
• So p,q,r can be considered a single group.

23 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
3

• p is a core point. q and r are 2 points within p’s boundary.

• Also q is also a core point and s is a point within boundary of q.
• Later we understood that r is also a core point and t is within its boundary.
• Then, we can say that t and s are density connected.

Thus they form a chain. Thus they can be grouped into a single
category.
24 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…
DBSCAN steps
1. Suppose we have a set of points.
2. Select a random point and check if it’s a core point(are there min number of points in its
neighbourhood) and if it can form a cluster.
3. If it forms a cluster, mark all inside points as visited with a tick mark.

Here, all 4
points in the
blue boundary
get’s tick mark

4. Go for the next random point, repeat all previous steps.

Now, total 8 points visited.

25 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…

5. Suppose there is another random point that cannot form a cluster.

6. We select that too and mark with X.

Implementing in Python

7. Continue the process till all points have been visited and marked.
8. Combine the groups based on the 3 density reachability conditions.
9. Advantages: arbitrary shapes possible, number of clusters not required.

26 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DBSCAN…

27 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering

• unsupervised machine learning algorithm

• Hierarchical clustering, also known as hierarchical cluster analysis, is an
algorithm that groups similar objects into groups called clusters.
• It means, this algorithm:
 considers each dataset as a single cluster at the beginning, and then
 start combining the closest pair of clusters together.
 It does this until all the clusters are merged into a single cluster that contains all
the datasets.
• Here, the hierarchy of clusters is developed in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

28 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..

Hierarchical clustering technique has two approaches:

1.Agglomerative or AGglomerative NESting or AGNES : Agglomerative is a bottom-
up approach.
 This algorithm starts with N groups, each initially containing one training instance,
 merging similar groups to form larger groups, until there is a single one.

2. Divisive or DIvisie ANAlysis (DIANA): Divisive algorithm is the reverse of the

agglomerative algorithm as it is a top-down approach.
 Starts with a single large group, divide it into smaller groups, until each group
contains a single instance.

29 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering- Agglomerative algorithm(AGNES)
• At each iteration of an agglomerative algorithm, we choose the two closest groups to merge.
• There are various ways to calculate the distance between two clusters, and these ways
decide the rule for clustering.
• These measures are called Linkage methods.
 In single-link clustering, the distance is defined as the smallest distance between all possible
pair of elements of the two groups.
 In complete-link clustering, the distance between two groups is taken as the largest distance
between all possible pairs.
 In average-link method, the average of distances between all pairs is used.
 In centroid distance, the distance between the centroids (means) of the two groups is
measured.

30 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
Single-link clustering
Complete-link clustering

Ways to calculate distance measures include:

• Euclidean distance measure
• Squared Euclidean distance measure
Centroid distance • Manhattan distance measure
• Cosine distance measure

31 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..

Consider that we have a few

points on a 2D plane with x-y
coordinates.
• Here, each data point is a
cluster of its own.
• Next we find points with the
least distance between
them, and start grouping
them together to form
clusters of multiple points.

32 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
1 2 • Proceeding this way we get three groups:
P1-P2, P3-P4, and P5-P6.
• Similarly, we have three dendrograms, as shown:

1st cluster and its

representation as a
dendogram
33 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
3 In the next step, we bring two groups together. Now the two groups P3-P4 and P5-P6 are all under
one dendrogram because they're closer together than the P1-P2 group. This is as shown:

This box shows P3-P4 and P5-P6 in one

cluster and hence under one dendogram.

34 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Hierarchical Clustering..
4 We bring everything together by joining P1-P2 with cluster of P3-P4 and P5-P6

35 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
1

How do we represent a cluster of more than one point?

2
Here, we will make use of centroids, which is the average
of its points.
Let’s first take the points (1, 2) and (2,1), and we’ll group
them together because they're close. For these points, we
compute a point in the middle and mark it as (1.5,1.5). It’s
the centroid of those two points.

36 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
3

4 5

37 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
DIANA Hierarchical Clustering

38 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
When do we stop combining clusters?

There are many approaches to choose:

• Approach 1: Pick clusters(k). We pick the K value. We decide the number of clusters (say, the
first six or seven) required in the beginning, and we finish when we reach the value K.
• Approach 2: Stop when the next merge would create a cluster with low cohesion.
We keep clustering until the next merge of clusters creates a bad cluster/low cohesion setup.
That means it doesn't make sense to bring them together.
• Approach 3.1: Diameter of a cluster. Diameter is the maximum distance between any pair of
points in the cluster. We finish when the diameter of a new cluster exceeds the threshold.
• Approach 3.2: Radius of a cluster.

39 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting

We discussed many different learning algorithms in the previous chapters. Though these are
generally successful, no one single algorithm is always the most accurate. Now, we are going to
discuss models composed of multiple learners that complement each other so that by combining
them, we attain higher accuracy.
What are the different ways to combine classifiers in machine learning?
They can be divided into two big groups:
1. Ensemble methods: Bagging(Bootstrap Aggregating) and Boosting are the most extended
ones.
2. Hybrid methods

An ensemble is a machine learning model that combines the predictions from two or more
models. The models that contribute to the ensemble, referred to as ensemble members, may be
the same type or different types and may or may not be trained on the same training data.

40 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Model Combination Schemes

Ways to combine multiple base-learners are

1. Multiexpert combination methods have base-learners that work in parallel. These
methods can in turn be divided into two:
• global approach or learner fusion:, given an input, all base-learners generate an output and all
these outputs are used. Ex. Voting and Stacking.
• local approach or learner selection: a gating model looks at the input and chooses one (or very
few) of the learners as responsible for generating the output

2. Multistage combination methods use serial approach where the next combination base-
learner is trained with or tested on only the instances where the previous base-learners are not
accurate enough.
OR
base-learners are sorted in increasing complexity so that a complex base-learner is not used
unless the preceding simpler base-learners are not confident. Ex. Cascading.

41 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting…

What is Ensemble learning?

It is a machine learning pattern where multiple models (often called weak learners or base models)
are trained to solve the same problem and combined to get better performances. The main
hypothesis is that if we combine the weak learners the right way we can obtain more accurate
and/or robust models.
• Uses multiple learning algorithm together for the same task.
• Better predictions than individual learning models.

• Suppose, we have a classifier that is 80 percent accurate.

• When we decide on a second classifier, we do not care for the overall accuracy; we care only
about how accurate it is on the 20 percent that the first classifier misclassifies.

42 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Combining Multiple Learners - Voting, Bagging, Boosting…

43 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
Types of Ensembling techniques include:
• Bagging or Bootstrap Aggregation
• Boosting
• Stacking Classifier
• Voting Classifier

44 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting
• This is the simplest way to combine multiple classifiers.
• It corresponds to taking a linear combination of the learners
• This is also known as ensembles and linear opinion pools.
• In simple voting, all learners are given equal weight and we take the average.

Table 17.1 Classifier combination rules Table 17.2 Example of combination rules on three learners
and three classes

• Sum rule is the widely used in practice.

• Median rule is more robust to outliers;
• Minimum and Maximum rules are pessimistic and optimistic, respectively.
• With the product rule, each learner has veto power.

45 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting…

Voting Classifier:
• A voting classifier is a machine learning estimator that trains various base models or estimators
and predicts on the basis of aggregating the findings of each base estimator.
• It is a homogeneous and heterogeneous type of Ensemble Learning, that is, the base
classifiers can be of the same or different type.
• also works as an extension of bagging (e.g. Random Forest).
The voting criteria can be of two types:
• Hard Voting: Voting is calculated on the predicted output class.
• Soft Voting: Voting is calculated on the predicted probability of the output class.

Simple voting is a special case where all voters have equal weight.
 This is called plurality voting where the class having the maximum number of votes is the
winner.
 When there are two classes, this is majority voting.
• Voting schemes can be seen as approximations under a Bayesian framework. This is Bayesian
model combination.
46 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.4 Voting…
Figure 2. Voting Classifier in
Figure 1. Voting “soft” mode
Classifier in “hard”
mode

47 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.6 Bagging(Bootstrap Aggregating )

• Bagging is a voting method where base-learners are made different by training them over
slightly different training sets.
• Unstable algorithm: A learning algorithm is unstable if small changes in the training set
causes a large difference in the generated learner.
• Bagging, short for bootstrap aggregating, uses bootstrap to generate L training sets, trains L
base-learners using an unstable learning procedure, and then, during testing, takes an

average. The base-learners dj are trained with these L samples Xj .

• Bagging can be used both for classification and regression. In the case of regression, to be
more robust, one can take the median instead of the average when combining predictions.
Algorithms such as decision trees and multilayer perceptrons are unstable. Nearest neighbor is stable, but condensed nearest
neighbor is unstable
48 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.6 Bagging(Bootstrap Aggregating )….
• Bagging is composed of two parts: aggregation and bootstrapping.
Bootstrapping is a sampling method, where a sample is chosen out of a set, using the replacement
method. This is called row sampling with replacement. The learning algorithm is then run on the
samples selected.
Aggregation: Model predictions undergo aggregation to combine them for the final prediction. The
aggregation can be done based on the total number of outcomes or the probability of predictions.
Advantage of Bagging
• Allows many weak learners to combine and outdo a single strong learner.
• It helps in the reduction of variance.
• hence eliminating the overfitting of models in the procedure.
Disadvantage of bagging
• It introduces a loss of interpretability of a model.

49 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.6 Bagging(Bootstrap Aggregating)…

Aggregation
Bootstrapping

50 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting
• Here, we try to generate complementary base-learners by training the next learner boosting on
the mistakes of the previous learners.
• The original boosting algorithm combines three weak learners to generate a strong learner.
1. Given a large training set, randomly divide it into three.
2. Use X1 and train d1.
3. Then take X2 and feed it to d1.
4. All instances misclassified by d1 and many instances in X2 where d1 is correct together form
the training set of d2.
5. Then take X3 and feed it to d1 and d2.
6. The instances on which d1 and d2 disagree form the training set of d3.
7. Testing: feed an instance to d1 and d2; if they agree, that is the response, otherwise the response
of d3 is taken as the output.
8. This overall system has reduced error rate.
51 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…

4 Boosting Algorithms in Machine Learning

1.Gradient Boosting Machine (GBM)
2.Extreme Gradient Boosting Machine (XGBM)
3.LightGBM
4.CatBoost

Disadvantage:
• Though successful, it requires a very large training sample.
• The sample should be divided into three and, the second and third classifiers are only
trained on a subset on which the previous ones err.
• So without a large training set, d2 and d3 will not have training sets of reasonable size.

52 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…

53 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
17.7 Boosting…

Though it is quite successful, the disadvantage of the original boosting method is that it requires a very large training
sample. ADABOOST is a variant of boosting technique.

• AdaBoost (Adaptive boosting) was the first boosting algorithm to combine various weak classifiers into a single
strong classifier in the history of machine learning.
• It primarily focuses to solve classification tasks such as binary classification.
ADABOOST - adaptive boosting, uses the same training set over and over and thus need not be large, but the
classifiers should be simple so that they do not overfit. AdaBoost can also combine an arbitrary number of base-
learners, not three.
ADABOOST Outline – we assign equal probability to all data instance initially and give a portion to first learner, using
the trained first learner we classify all the datasets, we update the probabilities of each instance so that misclassified
instance have higher chance to be chosen and fed to the next learner, this process is repeated over all learners in a serial
manner, the final learner is expected to produce all result rightly classified.

54 10/18/2022 MCA@NirmalaCollegeMuvattupuzha
End of Module 4

Thank you

55 10/18/2022 MCA@NirmalaCollegeMuvattupuzha

Leadership Communication
89% (56)
Leadership Communication
449 pages
Format Employers-Statement-Okp-And-Msp-Programmes For Mean Scholarship
No ratings yet
Format Employers-Statement-Okp-And-Msp-Programmes For Mean Scholarship
2 pages
5G RAN Capacity Monitoring Guide (V100R016C10 - 01) (PDF) - EN
No ratings yet
5G RAN Capacity Monitoring Guide (V100R016C10 - 01) (PDF) - EN
18 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
M5
No ratings yet
M5
40 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
ML - 8
No ratings yet
ML - 8
70 pages
Cluster
100% (1)
Cluster
72 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Lect 4
No ratings yet
Lect 4
34 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Week6_clustering_regression
No ratings yet
Week6_clustering_regression
101 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Clustering
No ratings yet
Clustering
104 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Clustering
No ratings yet
Clustering
34 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
07-Clustering-2024
No ratings yet
07-Clustering-2024
51 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
M5
No ratings yet
M5
40 pages
PART2
No ratings yet
PART2
61 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unsupervised Machine Learning Techniques (2)
No ratings yet
Unsupervised Machine Learning Techniques (2)
58 pages
Clustering
No ratings yet
Clustering
80 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
EML %th Module
No ratings yet
EML %th Module
40 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
65 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
k-means
No ratings yet
k-means
25 pages
Clustering
No ratings yet
Clustering
75 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
Week 11
No ratings yet
Week 11
49 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Smart 24 X 7
No ratings yet
Smart 24 X 7
3 pages
Module 3
No ratings yet
Module 3
9 pages
Introduction To Management
No ratings yet
Introduction To Management
73 pages
Module 3 - Session 19
No ratings yet
Module 3 - Session 19
31 pages
SRS of Students Result Management System
No ratings yet
SRS of Students Result Management System
6 pages
Linux and Shell Programming Lab-1
No ratings yet
Linux and Shell Programming Lab-1
113 pages
Module 3 - Session 17
No ratings yet
Module 3 - Session 17
38 pages
PHP COMMON
No ratings yet
PHP COMMON
54 pages
Module 3 and 4 - ERP Implementation
No ratings yet
Module 3 and 4 - ERP Implementation
21 pages
SRS of Students Result Management System
0% (1)
SRS of Students Result Management System
6 pages
Smart Parking
No ratings yet
Smart Parking
20 pages
SX32 Emulator Manual 9.2.x
No ratings yet
SX32 Emulator Manual 9.2.x
10 pages
PSPO Practice Test
No ratings yet
PSPO Practice Test
2 pages
Flowserve+Logix+520MD +510 +IOM
No ratings yet
Flowserve+Logix+520MD +510 +IOM
84 pages
ZUP How To Use
No ratings yet
ZUP How To Use
17 pages
Word Tutorial 01
No ratings yet
Word Tutorial 01
20 pages
Edexcel Ial 2024Jan FP1 draft
No ratings yet
Edexcel Ial 2024Jan FP1 draft
32 pages
Connect SQL Database To Your C# Program: Domainaced by
No ratings yet
Connect SQL Database To Your C# Program: Domainaced by
9 pages
Response Surface Methodology
No ratings yet
Response Surface Methodology
26 pages
Smart Grids Germany
No ratings yet
Smart Grids Germany
32 pages
Class 9 Computer Project
No ratings yet
Class 9 Computer Project
28 pages
May 2011
No ratings yet
May 2011
4 pages
Master Thesis On Software Testing
100% (2)
Master Thesis On Software Testing
6 pages
Maintenance AzS350U
No ratings yet
Maintenance AzS350U
89 pages
May KeyCreator 2015 Version 13.5
No ratings yet
May KeyCreator 2015 Version 13.5
54 pages
Flow Controls
No ratings yet
Flow Controls
35 pages
EDU232 Handout 2
No ratings yet
EDU232 Handout 2
6 pages
Tugas 2 Stapro
100% (1)
Tugas 2 Stapro
3 pages
Mobile Banking in Malaysia
100% (4)
Mobile Banking in Malaysia
108 pages
Lab Exercises 1
No ratings yet
Lab Exercises 1
10 pages
Ethics and Values in Business
100% (1)
Ethics and Values in Business
41 pages
Solving Linear and Quadratic Equations Graphically Questions
100% (2)
Solving Linear and Quadratic Equations Graphically Questions
3 pages
Efficient Solar Power
100% (2)
Efficient Solar Power
28 pages
Honor in Artificial Intelligence
No ratings yet
Honor in Artificial Intelligence
3 pages
White Paper c11-562881
No ratings yet
White Paper c11-562881
44 pages
Crystal Radio Circuits
100% (1)
Crystal Radio Circuits
11 pages
Invitation Letter Admin Instruction-ESDC Critical Entities Resilience Advanced Course Hosted by GNR Lisbon June 2024
No ratings yet
Invitation Letter Admin Instruction-ESDC Critical Entities Resilience Advanced Course Hosted by GNR Lisbon June 2024
6 pages

Mod 4 - CLustering

Uploaded by

Mod 4 - CLustering

Uploaded by

Module 4

 Statistical data analysis

 Social network analysis

 Anomaly detection, etc.

 In Identification of Cancer Cells

 Clustering is used by the Amazon in its recommendation system to provide the

Clustering is somewhere similar to the Classification Algorithm, but the difference is the

In classification, we work with the labeled data set (supervised)

In clustering, we work with the unlabelled dataset. (unsupervised)

• Categories of Clustering techniques: broadly divided into Hard clustering (datapoint

Here 5 and 17 are not part of data. But its ok.

6 is data. 15 not data. Repeat distance calculation and form

How to decide the number of iterations? OR How to stop iteration? 2 options:

Ex. M=3 means for given a radius, minimum 3 neighbours required.

• p is a core point. q and r are 2 points within p’s boundary.

4. Go for the next random point, repeat all previous steps.

Now, total 8 points visited.

5. Suppose there is another random point that cannot form a cluster.

• unsupervised machine learning algorithm

Hierarchical clustering technique has two approaches:

2. Divisive or DIvisie ANAlysis (DIANA): Divisive algorithm is the reverse of the

Ways to calculate distance measures include:

Consider that we have a few

1st cluster and its

This box shows P3-P4 and P5-P6 in one

How do we represent a cluster of more than one point?

There are many approaches to choose:

Ways to combine multiple base-learners are

What is Ensemble learning?

• Suppose, we have a classifier that is 80 percent accurate.

• Sum rule is the widely used in practice.

average. The base-learners dj are trained with these L samples Xj .

4 Boosting Algorithms in Machine Learning

You might also like