02 01 KMeans
02 01 KMeans
Ali Sharifi-Zarchi
CE Department
Sharif University of Technology
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 1 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
2 K-Means
3 Challenges in K-Means
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 2 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
2 K-Means
3 Challenges in K-Means
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 3 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Unsupervised Learning
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 4 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 5 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Clustering
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 6 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 7 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 8 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
• When you like a song, the system suggests others from the same cluster.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 9 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
• Clustering can decipher hidden patterns in gene expression data, which can help
in understanding disease mechanisms or genetic variations.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 10 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 11 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
2 K-Means
3 Challenges in K-Means
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 12 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
K-Means overview
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 13 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
K-Means in action
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 14 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
K-Means in action
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 15 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 16 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 17 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 18 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 19 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Algorithm
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 20 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Problem definition
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 21 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Objective Function
∑
K ∑
J= ||x(i) − µj ||2
j=1 x(i) ∈Cj
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 22 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 23 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Convergence
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 24 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Convergence (cont.)
.
• Keep each sample’s assignment fixed until a closer centriod is found.
• Each time a sample is reassigned. the total distance between samples and their
centroids decreases.
• The number of possible sample-to-centroid assignments is finite.
• The algorithm terminates when no sample changes its assigned centroid.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 25 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Convergence (cont.)
• In Updating step, with f (x) fixed, J is a quadratic function of µj (like SSE) and by
taking derivative we can minimize it as:
∂J ∑ ( (i) )
= 0 =⇒ 2 x − µj = 0
∂µj x(i) ∈C j
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 26 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Convergence (cont.)
• For each cluster, the mean of its samples minimizes squared distances.
∑ ∑
• For Cj if µ′ was the old centroid we have: x(i) ∈Cj ||x(i) − µ′ ||2 ≥ x(i) ∈Cj ||x(i) − µj ||. So
j j
Jnew ≤ Jold .
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 27 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Convergence (cont.)
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 28 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 29 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Strengths
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 30 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
2 K-Means
3 Challenges in K-Means
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 31 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Initialization
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 32 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Local Optimum
• The algorithm finds a local minimum but there is no guarantee to find global
minimum.
• Its result is highly affected by the initialization.
• Some suggestions are:
• Multiple runs with random initial centroids, then select the "best" result.
• Initialization heuristics (K-Means++ , Furthest Traversal).
• Initializing with the suggested results of another method.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 33 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Local Optimum
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 34 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 35 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Definition of Mean
• We assume x(i) ∈ Rd , which is not always the case. K-Means requires a space where
sample mean is defined.
• Categorical data.
• A suggested solution: K-Mode - the centroid is the most frequent category (the mode)
in each cluster.
• Closest centroid is found by the Hamming Distance.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 36 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Adopted from
slides of Dr. Soleymani, Modern Information Retrieval Course, Sharif University of technology.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 37 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 38 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Clustering Evaluation
∑
K ∑
WCSS = ||x − µi ||2
i=1 x∈Ci
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 39 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Clustering Evaluation
• Inter-cluster separation (isolation): How different the data points are between
clusters.
• Single-link (Minimum Distance):
• Measures the **minimum distance** between any two points from different clusters.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 40 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Clustering Evaluation
• Inter-cluster separation (isolation): How different the data points are between
clusters.
• Centroid (Wards Method):
• Measures the distance between the centroids of two clusters.
• Average-link:
• Measures the average distance between all pairs of points from different clusters.
1 ∑ ∑
daverage (Ci , Cj ) = d(x, y)
|Ci | · |Cj | x∈Ci y∈Cj
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 41 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) Adopted from medium.com
October 15, 2024 42 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
b(i) − a(i)
S(i) =
max(a(i), b(i))
• where:
• a(i) is the average distance between i and all other points in the same cluster.
• b(i) is the average distance between i and points in the nearest neighboring cluster.
• Interpretation:
• S(i) ∈ [−1, 1]
• S(i) ≈ 1 : Well-clustered.
• S(i) ≈ 0 : On or near the decision boundary between clusters.
• S(i) ≈ −1 : Misclustered.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 43 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
• There is a trade-off between having better focus within each cluster or having too
many clusters.
• Don’t want one-element clusters.
• Optimization problem: penalize having too many clusters
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 44 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Outliers
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 45 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Data Distribution
2 K-Means
3 Challenges in K-Means
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 47 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 48 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
• Hard Clustering(Partitional)
• Soft Clustering(Bayesian): Each sample is
assigned to different clusters with
probabilities, rather than {0, 1}.
• data point belongs to each cluster with a
probability
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 49 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Hierarchical Clustering
• Hierarchical algorithms find successive clusters using previously established
clusters. Two Types:
• Agglomerative (bottom-up): Start with individual points and merge clusters.
• Divisive (top-down): Start with all points and split clusters.
Result: A hierarchy of clusters represented by a dendrogram.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 50 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 51 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Hierarchical Algorithms
• Advantages:
• No need to specify the number of clusters.
• Produces a dendrogram for visualization.
• Works with arbitrary-shaped clusters.
• Disadvantages
• High computational cost.
• Sensitive to noise and outliers.
• Greedy: cannot undo merges.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 53 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
DBSCAN
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 54 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 55 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Definitions:
• A point xi is a core point if:
• A point is a border point if it is within distance ϵ of a core point, but not itself a core
point.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 56 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Algorithm Steps:
1 For each unvisited point xi :
• Mark xi as visited.
• Find all points within distance ϵ (neighborhood).
2 If xi is a core point:
• Create a new cluster and expand it by recursively adding all reachable core and
border points.
3 If xi is not a core point:
• Label it as noise if it does not belong to any cluster.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 57 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Advantages of DBSCAN
Adopted
Limitations of DBSCAN
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 59 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Clustering Algorithms
• Each algorithm is suited for different kinds of patterns and information in data.
Adopted from
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 60 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
Contributions
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 61 / 62
Unsupervised Learning Overview K-Means Challenges in K-Means Other Clustering Algorithms
CE Department (Sharif University of Technology) Machine Learning (CE 40477) October 15, 2024 62 / 62