0% found this document useful (0 votes)

25 views

Clustering

This document discusses clustering techniques in machine learning. Clustering groups unlabeled data into clusters with similar characteristics. Common clustering methods are k-means clustering, density-based clustering, distribution model-based clustering, hierarchical clustering, and fuzzy clustering. The k-means algorithm works by assigning data points to the closest cluster centroid and iteratively updating the centroids until clusters are stable.

Uploaded by

Saif Fazal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Clustering

Uploaded by

Saif Fazal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled

dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, colour,
behaviour, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and
it deals with the unlabelled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
Example: Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar usage are
grouped together. Such as the t-shirts are grouped in one section, and trousers are at other
sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in
separate sections, so that we can easily find out the things. The clustering technique also
works in the same way. Other examples of clustering are grouping documents according to
the topic.
The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there
are also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups. The cluster center is created in such a way that the distance between
the data points of one cluster is minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the
degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of
the line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. The method is given
below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the value
of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Test Banks - Chapter 7
83% (6)
Test Banks - Chapter 7
23 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (26)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
Nursing Care Plan (Readiness of Enhanced Therapeutic Management)
78% (27)
Nursing Care Plan (Readiness of Enhanced Therapeutic Management)
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Machine_Learning_Unit_4
No ratings yet
Machine_Learning_Unit_4
22 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Clustering
No ratings yet
Clustering
17 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Clustering
No ratings yet
Clustering
13 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
clustering
No ratings yet
clustering
9 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
Module 5
No ratings yet
Module 5
91 pages
chapter 3 p4
No ratings yet
chapter 3 p4
18 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
ML UNIT 2
No ratings yet
ML UNIT 2
17 pages
algo
No ratings yet
algo
59 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
Unit 4
No ratings yet
Unit 4
40 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Module-5_Notes_13-12-2024.docx
No ratings yet
Module-5_Notes_13-12-2024.docx
45 pages
DM UNIT IV (1)
No ratings yet
DM UNIT IV (1)
45 pages
Unit- 4(ML)
No ratings yet
Unit- 4(ML)
13 pages
Clustering
No ratings yet
Clustering
24 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Week 11
No ratings yet
Week 11
49 pages
CBSYLLABUS BDA
No ratings yet
CBSYLLABUS BDA
5 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
6 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
20 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
Unit IV
No ratings yet
Unit IV
96 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Chapter 7
No ratings yet
Chapter 7
29 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Polycet Booklet 2024
No ratings yet
Polycet Booklet 2024
80 pages
Quesnay - Tableau Economique (Translation)
No ratings yet
Quesnay - Tableau Economique (Translation)
17 pages
OT Task Weekend 7,8-09
No ratings yet
OT Task Weekend 7,8-09
14 pages
Ignitarium JNF
No ratings yet
Ignitarium JNF
5 pages
Gym Shark
No ratings yet
Gym Shark
3 pages
107543-Communications Procurement Policy
No ratings yet
107543-Communications Procurement Policy
35 pages
Pressnote - Tpbo Addendum Webnote20240503121833
No ratings yet
Pressnote - Tpbo Addendum Webnote20240503121833
2 pages
Stripping: Strip Series B240
No ratings yet
Stripping: Strip Series B240
2 pages
Flash Cs5 Extending
No ratings yet
Flash Cs5 Extending
553 pages
Electronics Tutorials Ws
No ratings yet
Electronics Tutorials Ws
11 pages
Brosur Ama Disc
No ratings yet
Brosur Ama Disc
2 pages
ARGUS-wt-control system-brochure-EN
No ratings yet
ARGUS-wt-control system-brochure-EN
8 pages
Housing: Government Agencies Concerned With Housing
No ratings yet
Housing: Government Agencies Concerned With Housing
5 pages
AAQUIB, R. CHATURVEDI, A. Cloud Computing - Characteristics and Services A Brief Review PDF
No ratings yet
AAQUIB, R. CHATURVEDI, A. Cloud Computing - Characteristics and Services A Brief Review PDF
6 pages
CC-10 International Organizations
No ratings yet
CC-10 International Organizations
6 pages
History Extension 11 12 2024 Sample Scope and Sequence Year 12 B
No ratings yet
History Extension 11 12 2024 Sample Scope and Sequence Year 12 B
6 pages
ET 2008 Steam Circulation System
100% (2)
ET 2008 Steam Circulation System
49 pages
Leadership Skills For Nurses
100% (2)
Leadership Skills For Nurses
34 pages
Des, Double Des Aes
No ratings yet
Des, Double Des Aes
11 pages
Performance Appraisal and Employee Turnover
No ratings yet
Performance Appraisal and Employee Turnover
4 pages
Benedictine Leadership
No ratings yet
Benedictine Leadership
57 pages
Study Plan
No ratings yet
Study Plan
2 pages
Account Pool Request Form Semi Editable
No ratings yet
Account Pool Request Form Semi Editable
2 pages
Software Quality Assurance Testing
No ratings yet
Software Quality Assurance Testing
11 pages
IDL - Final Tir 1400
No ratings yet
IDL - Final Tir 1400
630 pages
Coffee Shops in Japan: Companies & Products
No ratings yet
Coffee Shops in Japan: Companies & Products
35 pages
Module 1 - Lecture 4
No ratings yet
Module 1 - Lecture 4
13 pages
CMPM
No ratings yet
CMPM
60 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled

How does the K-Means Algorithm Work?

You might also like