0% found this document useful (0 votes)
8 views

Week 10 Lecture - Introduction to Clustering(1)

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 10 Lecture - Introduction to Clustering(1)

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Introduction to Clustering

CMP4294 – INTRODUCTION TO AI
DR MARIAM ADEDOYIN-OLOWE
Data
Analysis
Techniqu
es

Predictive Descriptive
Analytics Analytics

Classification Prediction Clustering Association


analysis
Data
Analysis
Techniqu
es

Supervised Unsupervise
learning d learning

Classification Prediction Clustering Association


analysis
Types of learning techniques in data
mining
1. Supervised
Binary classification
Learn
learning:
to predict an output when given an input vector. Given x find y in {1, -1}

Training data includes


desired outputs
https:
//www.sarahmestiri.com/index.php/category/tec
hnology/
Types of learning techniques in data
mining
2. Unsupervised learning:
Discover a good internal representation of the
data.

Clustering
Unsupervised learning Partition the data into
Training data does not clusters based on their
include desired outputs similarity

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRmehaTzUloHT0P5G0Ok1rnKxyZcsZsbIJ0ViYm-p
m8LLcUDj
Given a cloud of data points we
want to understand its structure

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive


Datasets, http://www.mmds.org
What is Clustering?
• Clustering is the process of grouping similar objects together
such that the similarity of members within the same group is
maximised, and among those that belong to different groups is
minimised (dissimilar).
• Similarity is defined using a distance measure (e.g., Euclidean
distance).
Inter-
Intra- cluster
cluster distances
distances are
are maximized
minimized
Why cluster?

• Labeling is expensive
• Gain insight into the structure of the data
• Find prototypes in the data
Goal of Clustering
• Given a set of data points, each described by
a set of attributes, find clusters such that:
F1 xx
– Inter-cluster similarity is x x
x xx x
xx
minimized
xxxx
x
x xx x
– Intra-cluster similarity is
maximized F2

• Requires the definition of a similarity measure


What is a natural grouping of these objects?
Slide from Eamonn Keogh
What is a natural grouping of these objects?
Slide from Eamonn Keogh

Clustering is subjective

Simpson's Family School Employees Females Males


What is Similarity?
Slide based on one by Eamonn Keogh

Similarity is
hard to define,
but…
“We know it
when we see it”
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
Partitioning Algorithms

• Partitioning method: Construct a partition of n documents into


a set of K clusters

• Given: a set of documents and the number K

• Find: a partition of K clusters that optimizes the chosen


partitioning criterion

See also Kleinberg NIPS 2002 – impossibility for natural clustering


Euclidean
distance
• The Euclidean distance
between two points
measures the length of a
segment connecting the
two points.

• It is the most common


way of representing
distance between two
points.
Example
Why Clustering?
• Clustering is useful when we have diverse and huge amount of
varied data.
• Clustering reduces the complexity of the data by deriving a
reduced representation of it (Summary)
• Helps for looking for new insights into the structure of the
data (Discovery).
Examples of Clustering Problems
Assume that we have dataset containing information about people in different
occupations, their countries, age, education, family sizes, stores where they shop
and the products they purchase in those stores.

http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.
html
http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.html
http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.html
Types of
Clustering
1. Partitional Clustering:

dividing data objects into


subsets (clusters) that do not
overlap, such that each data
object is in exactly one subset
(e.g., k-means clustering).

Original A Partitional Clustering


Points
Types of
Clustering
2. Hierarchical Clustering: is a set of nested clusters that are
organised as a tree (e.g., agglomerative and divisive clustering).
𝐶

𝐶
𝐶
5

4
3

Hierarchical
Clustering

𝐶
𝐶5 𝐶4 𝐶 3 𝐶2 𝐶1
1

𝐶 Hierarchical Clustering
2
Output of a Clustering Session
• Instance assignment: each instance is assigned to a cluster
(group), or in some methods, some instances are considered
outliers (instances that do not belong to any cluster).

• Cluster statistics:
• Centroids: the centre of each cluster (the average of each
feature of all instances that belong to the same cluster).
• Size: number of instances the belong to the cluster.
• Variations: the variance or standard deviation of the instances that
belong to each cluster.
k-Means
https://
www.youtube.com/watch?v=4R8nWDh-wA0
https://
www.youtube.com/watch?v=4R8nWDh-wA0
k–means Clustering

• One of the simplest and most


common clustering algorithms.

• It partitions n observations into k


clusters in which each
observation belongs to
the cluster with the nearest
mean, serving as a prototype of
the cluster.

• This results in a partitioning of the


data space into Voronoi cells.

https://en.wikipedia.org/wiki/K-means_clustering
Example: Assigning Clusters

x
x
x
x
x

x x x x x x Clusters after round 1

x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Example: Assigning Clusters

x
x
x
x
x

x x x x x x Clusters after round 2

x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets,
Example: Assigning Clusters

x
x
x
x
x

x x x x x x

Clusters at the end

x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
k–means Algorithm
1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centers.

3. Assign objects to their closest cluster center according to the


Euclidean distance function.

4. Calculate the centroid or mean of all objects in each cluster.

5. Repeat steps 2, 3 and 4 until the same points are assigned to each
cluster in consecutive rounds.
https://www.saedsayad.com/clustering_kmeans.htm
Getting the k right
• k < total number of points
• Try different k, looking at the change in the average distance to centroid as k
increases
• Average falls rapidly until right k, then changes little

𝑘
𝑊𝑆𝑆 ( 𝐶 )=∑ ∑ 𝑑𝑖𝑠𝑡 (𝑥 ,𝑐 𝑖 )
𝑖=1 𝑥 ∈𝐶 𝑖

WCSS (Within-Cluster Sum of Square) i.e. the


sum of the square distance between points in a
cluster and the cluster centroid.
What are the Weaknesses of k-means algorithm?
k-means Properties

• Strengths
• Simple and easy to implement Quite efficient
• Weaknesses
• Need to specify the value of k, but we may not know what the
value should be beforehand
• Sensitive to the choice of initial k centroids: the result can
be non deterministic
• Sensitive to noise
• Initialization
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
Based on Prof Mohamed Gaber slides, BCU
Summary

• Clustering is the process of grouping similar


objects together

• In clustering, the goal is to maximize Intra-


cluster similarity and minimize Inter-cluster
similarity

You might also like