0% found this document useful (0 votes)

8 views9 pages

K-Mean

The document provides an in-depth explanation of K-Means Clustering, a popular unsupervised learning technique used for grouping similar data points. It outlines the algorithm's steps, including initializing centroids, assigning clusters, updating centroids, and setting stopping criteria, while also discussing methods for evaluating cluster quality and determining the optimal number of clusters. Additionally, it emphasizes the importance of data preprocessing and feature reduction to enhance clustering results.

Uploaded by

berihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views9 pages

K-Mean

Uploaded by

berihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

TVTI – Introduction to Data Science

K-Means Clustering

Data is essential for data science (as if the name isn’t suggestive enough).
With tons of data being generated every millisecond, it’s no surprise that
most of this data is unlabeled. But that’s okay, because there are
different techniques available to make do with unlabeled datasets. In
fact, there’s an entire domain of Machine Learning called “Unsupervised
Learning” that deals with unlabeled data.

Sometimes we just want to see how the data is organized, and that’s
where clustering comes into play. Even though it’s mostly used of
unlabeled data, but it works just fine for labeled data as well. The word
‘clustering’ means grouping similar things together. The most
commonly used clustering method is K-Means (because of it’s
simplicity).

This post explains how K-Means Clustering work (in depth), how to
measure the quality of clusters, choose the optimal number of K, and
mentions other clustering algorithms.

The Concept
Imagine you’re opening a small book store. You have a stack of different
books, and 3 bookshelves. Your goal is place similar books in one shelf.
What you would do, is pick up 3 books, one for each shelf in order to set
a theme for every shelf. These books will now dictate which of the
remaining books will go in which shelf.

Every time you pick a new book up from the stack, you would compare
it with those first 3 books, and place this new book on the shelf that has
similar books. You would repeat this process until all the books have
been placed.

Once you’re done, you might notice that changing the number of
bookshelves, and picking up different initial books for those shelves
(changing the theme for each shelf) would increase how well you’ve
grouped the books. So, you repeat the process in hopes of a better
outcome.

K-Means algorithm

The Algorithm

K-means clustering is a good place to start exploring an unlabeled

dataset. The K in K-Means denotes the number of clusters. This
algorithm is bound to converge to a solution after some iterations. It has
4 basic steps:

1. Initialize Cluster Centroids (Choose those 3 books to start with)

2. Assign datapoints to Clusters (Place remaining the books one by

one)
3. Update Cluster centroids (Start over with 3 different books)

4. Repeat step 2–3 until the stopping condition is met.

You don’t have to start with 3 clusters initially, but 2–3 is generally a
good place to start, and update later on.

Initialize K & Centroids

As a starting point, you tell your model how many clusters it should
make. First the model picks up K, (let K = 3) data points from the
dataset. These data points are called cluster centroids.

Now there are different ways you to initialize the centroids, you can
either choose them at random — or sort the dataset, split it into K
portions and pick one data point from each portion as a centroid.

2. Assigning Clusters to data points

From here on wards, the model performs calculations on it’s own and
assigns a cluster to each data point. Your model would calculate the
distance between the data point & all the centroids, and will be assigned
to the cluster with the nearest centroid. Again, there are different ways
you can calculate this distance; all having their pros and cons. Usually
we use the L2 distance.
The picture below shows how to calculate the L2 distance between the
centroid and a data point. Every time a data point is assigned to a
cluster the following steps are followed.

3. Updating Centroids

Because the initial centroids were chosen arbitrarily, your model the
updates them with new cluster values. The new value might or might not
occur in the dataset, in fact, it would be a coincidence if it does. This is
because the updated cluster centorid is the average or the mean value of
all the datapoints within that cluster.
Now if some other algorithm, like K-Mode, or K-Median was used,
instead of taking the average value, mode and median would be taken
respectively.

4. Stopping Criterion

Since step 2 and 3 would be performed iteratively, it would go on forever

if we don’t set a stopping criterion. The stopping criterion tells our algo
when to stop updating the clusters. It is important to note that setting a
stopping criterion would not necessarily return THE BEST clusters, but
to make sure it returns reasonably good clusters, and more importantly
at least return some clusters, we need to have a stopping criterion.

Like everything else, there are different ways to set the stopping
criterion. You can even set multiple conditions that, if met, would stop
the iteration and return the results. Some of the stopping conditions are:
1. The data points assigned to specific cluster remain the same (takes
too much time)

2. Centroids remain the same (time consuming)

3. The distance of data points from their centroid is minimum (the

thresh you’ve set)

4. Fixed number of iterations have reached (insufficient iterations →

poor results, choose max iteration wisely)

Evaluating the cluster quality

The goal here isn’t just to make clusters, but to make good, meaningful
clusters. Quality clustering is when the data points within a cluster are
close together, and afar from other clusters.

The two methods to measure the cluster quality are described below:

1. Inertia: Intuitively, inertia tells how far away the points within a
cluster are. Therefore, a small of inertia is aimed for. The range of
inertia’s value starts from zero and goes up.

2. Silhouette score: Silhouette score tells how far away the data
points in one cluster are, from the data points in another cluster.
The range of silhouette score is from -1 to 1. Score should be closer
to 1 than -1.
How many clusters?

You have to specify the number of clusters you want to make. There are
a few methods available to choose the optimal number of K. The direct
method is to just plot the data points and see if it gives you a hint. As you
can see in the figure below, making 3 clusters seems like a good choice.

K=3 seems like a good choice

Other method is to use the value of inertia. The idea behind good
clustering is having a small value of inertia, and small number of
clusters.

The value of inertia decreases as the number of clusters increase. So, its
a trade-off here. Rule of thumb: The elbow point in the inertia graph is
a good choice because after that the change in the value of inertia isn’t
significant.

K=3 is the optimal choice

Naming your Clusters

When you’ve formed a cluster, you give a name it, and all the data points
in that cluster are assigned this name as their label. Now your dataset
has labels! You can perform testing using these labels. To find insights
about your data, you can see what similarity do the data points within a
cluster have, and how does it differ from other clusters.
Assigning a cluster to a new data point

Once you’ve finalized your model, it can now assign a cluster to a new
data point. The method of assigning cluster remains same, i.e.,
assigning it to the cluster with the closet centroid.

⚠️ Warning!

It’s important to preprocess your data before performing K-Means. You

would have to convert your dataset into numerical values if it is not
already, so that calculations can be performed. Also, applying feature
reduction techniques would speed up the process, and also improve the
results. These steps are important to follow because K-Means is
sensitive to outliers, just like every other algorithms that uses
average/mean values. Following these steps alleviate these issues.

06 - 2 - Transportation Problems and Solution Methods
100% (1)
06 - 2 - Transportation Problems and Solution Methods
60 pages
ML unit 4
No ratings yet
ML unit 4
110 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
CE345 - Lecture #9 - Clustering
No ratings yet
CE345 - Lecture #9 - Clustering
56 pages
Unsupervised Learning (1)
No ratings yet
Unsupervised Learning (1)
27 pages
algo
No ratings yet
algo
59 pages
Dwdm Unit v Note
No ratings yet
Dwdm Unit v Note
19 pages
Unit IV
No ratings yet
Unit IV
96 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
DM UNIT IV (1)
No ratings yet
DM UNIT IV (1)
45 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
4.1.2. K Means Clustering
No ratings yet
4.1.2. K Means Clustering
38 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Algebra Basic Concepts Formulas For SSC Railway Exams
50% (2)
Algebra Basic Concepts Formulas For SSC Railway Exams
6 pages
M5
No ratings yet
M5
40 pages
Ml Unit5 Notes
No ratings yet
Ml Unit5 Notes
18 pages
UNIT-4
No ratings yet
UNIT-4
22 pages
Unit-4
No ratings yet
Unit-4
19 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
Kmean
No ratings yet
Kmean
24 pages
ML UNIT 2
No ratings yet
ML UNIT 2
17 pages
lec37
No ratings yet
lec37
13 pages
K means Clustering
No ratings yet
K means Clustering
11 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
lec48
No ratings yet
lec48
12 pages
K-Means Clustering Clearly Explained
No ratings yet
K-Means Clustering Clearly Explained
12 pages
Clustering
No ratings yet
Clustering
17 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Clustering
No ratings yet
Clustering
10 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
An Introduction To Different Methods of Clustering in Machine Learning
No ratings yet
An Introduction To Different Methods of Clustering in Machine Learning
8 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Machine Learning Unsupervised
No ratings yet
Machine Learning Unsupervised
20 pages
Introduction To Kmeans
No ratings yet
Introduction To Kmeans
4 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Object Oriented Programming (Mahlet and Mohammed)
No ratings yet
Object Oriented Programming (Mahlet and Mohammed)
353 pages
Chapter 2 Resistor N (1)
No ratings yet
Chapter 2 Resistor N (1)
54 pages
Engineering Drawing Handout (1)
No ratings yet
Engineering Drawing Handout (1)
110 pages
Lecture 19
No ratings yet
Lecture 19
29 pages
emaye exam
No ratings yet
emaye exam
39 pages
A Survey of Classical and Recent Results in Bin Packing Problem
No ratings yet
A Survey of Classical and Recent Results in Bin Packing Problem
82 pages
Font-Bookman Old Style Size-11 Alignment-Justified
No ratings yet
Font-Bookman Old Style Size-11 Alignment-Justified
9 pages
Documento1
No ratings yet
Documento1
16 pages
ITP3902_DMS_Lec_6_Graph_part2
No ratings yet
ITP3902_DMS_Lec_6_Graph_part2
27 pages
10th - Test - Polynomials
No ratings yet
10th - Test - Polynomials
2 pages
03 -Supervised Learning (BPNN)
No ratings yet
03 -Supervised Learning (BPNN)
14 pages
Chapter 1.3 Limit and Contuinity
No ratings yet
Chapter 1.3 Limit and Contuinity
40 pages
Chapter 3 - Curve Fitting
No ratings yet
Chapter 3 - Curve Fitting
60 pages
Assignment 5 Solution
No ratings yet
Assignment 5 Solution
3 pages
urn_ch_slsp_zbz_9781098134181_ihv_pdf
No ratings yet
urn_ch_slsp_zbz_9781098134181_ihv_pdf
7 pages
Ssce2393 20212022 Sem1
No ratings yet
Ssce2393 20212022 Sem1
7 pages
Operations Research Unit 1,2 QB
100% (1)
Operations Research Unit 1,2 QB
5 pages
WIN_SEM_(2022-23)_FRESHERS_MAT1002_TH_AP2022237000640_Reference_Material_I_23-May-2023_Lecture37
No ratings yet
WIN_SEM_(2022-23)_FRESHERS_MAT1002_TH_AP2022237000640_Reference_Material_I_23-May-2023_Lecture37
6 pages
Gauss Alimination Method New
No ratings yet
Gauss Alimination Method New
12 pages
1.5 Assignment
No ratings yet
1.5 Assignment
5 pages
ME685_Homework2_d67e46da-c243-4e2a-97ba-4b7fc71d1df6
No ratings yet
ME685_Homework2_d67e46da-c243-4e2a-97ba-4b7fc71d1df6
2 pages
Worksheet Differences
No ratings yet
Worksheet Differences
3 pages
EXCERSICE -1
No ratings yet
EXCERSICE -1
2 pages
Lab 06 Divided Difference and Lagrange Interpolation
No ratings yet
Lab 06 Divided Difference and Lagrange Interpolation
8 pages
10th Question Paper Maths
No ratings yet
10th Question Paper Maths
2 pages
On The Use Non-Stationary Penalty Functions T o Solve Nonlinear Constrained Optimization Problems With GA's
No ratings yet
On The Use Non-Stationary Penalty Functions T o Solve Nonlinear Constrained Optimization Problems With GA's
6 pages
DAA NCEAC CLOs
No ratings yet
DAA NCEAC CLOs
2 pages
Unit 2 Lesson 6 Practice Problems Solutions
No ratings yet
Unit 2 Lesson 6 Practice Problems Solutions
3 pages
Helmert
No ratings yet
Helmert
3 pages
Co-1 Home Assignment
No ratings yet
Co-1 Home Assignment
2 pages
Econometrics-Ii Quiz
No ratings yet
Econometrics-Ii Quiz
1 page
ADA Questions
No ratings yet
ADA Questions
3 pages
Cse202 Algorithm-Design-And-Analysis TH 2.00 Ac26
No ratings yet
Cse202 Algorithm-Design-And-Analysis TH 2.00 Ac26
1 page
HW 1
No ratings yet
HW 1
1 page
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet

K-Mean

Uploaded by

K-Mean

Uploaded by

TVTI – Introduction to Data Science

K-means clustering is a good place to start exploring an unlabeled

1. Initialize Cluster Centroids (Choose those 3 books to start with)

2. Assign datapoints to Clusters (Place remaining the books one by

4. Repeat step 2–3 until the stopping condition is met.

Initialize K & Centroids

2. Assigning Clusters to data points

Since step 2 and 3 would be performed iteratively, it would go on forever

2. Centroids remain the same (time consuming)

3. The distance of data points from their centroid is minimum (the

4. Fixed number of iterations have reached (insufficient iterations →

Evaluating the cluster quality

K=3 seems like a good choice

K=3 is the optimal choice

Naming your Clusters

It’s important to preprocess your data before performing K-Means. You

You might also like