0% found this document useful (0 votes)

55 views

Lect 4

The document discusses clustering algorithms. It begins by defining clustering as grouping similar data objects together and dissimilar objects in different groups. It then discusses several clustering algorithms like K-means, K-median, and K-center. K-means aims to partition objects into K clusters by minimizing distances between objects and assigned cluster centers. The K-median and K-center problems are also defined with the goal of minimizing distances. Iterative algorithms like K-means, K-medoids are discussed for solving these clustering problems. Factors that impact the performance of these algorithms like initialization and outliers are also summarized.

Uploaded by

yoursweetseptember

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Lect 4

Uploaded by

yoursweetseptember

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Clustering

Lecture outline
Distance/Similarity between data
objects
Data objects as geometric data
points
Clustering problems and algorithms
K-means
K-median
K-center

What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Intra-cluster
distances are
minimized

Inter-cluster
distances are
maximized

Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality

cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)

Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

Clustering results are used:

As a stand-alone tool to get insight into data distribution
Visualization of clusters may unveil important information

As a preprocessing step for other algorithms

Efficient indexing or compression often relies on clustering

Applications of
clustering?
Image Processing
cluster images based on their visual content

Web
Cluster groups of users based on their access
patterns on webpages
Cluster webpages based on their content

Bioinformatics
Cluster similar proteins together (similarity
wrt chemical structure and/or functionality
etc)

Many more

The clustering task

Group observations into groups so that the
observations belonging in the same group
are similar, whereas observations in
different groups are different
Basic questions:
What does similar mean
What is a good partition of the objects? I.e.,
how is the quality of a solution measured
How to find a good partition of the
observations

Observations to cluster

Real-value attributes/variables
e.g., salary, height

Binary attributes
e.g., gender (M/F), has_cancer(T/F)

Nominal (categorical) attributes

e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

Variables of mixed types

multiple attributes with various types

Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)
J. Smith, 20, 200K
If all d dimensions are real-valued then we
can visualize each data point as points in a
d-dimensional space
If all d dimensions are binary then we can
think of each data point as a binary vector

Distance functions
The distance d(x, y) between two objects xand y
is a metric if

d(i, j)0 (non-negativity)

d(i, i)=0 (isolation)
d(i, j)= d(j, i) (symmetry)
d(i, j) d(i, h)+d(h, j) (triangular inequality) [Why do
we need it?]

The definitions of distance functions are usually

different for real, boolean, categorical, and ordinal
variables.
Weights may be associated with different variables
based on applications and data semantics.

data matrix

tuples/objects

Data Structures

attributes/dimensions

x
11
...

x
i1
...

x
n1

...
...

x
1
...

... x
1d
... ...

x
id
...
... ...
... x
... x
n
nd
...

x
i
...

...

objects

Distance matrix
objects

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0

Distance functions for binary

vectors
Jaccard similarity between binary
X Y
vectors
JSim( X , Y ) X and Y
X Y

Jaccard distance between binary

vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y) Q Q Q Q Q
Example:
JSim = 1/6
Jdist = 5/6

Q
6

Distance functions for real-valued

vectors
Lp norms or Minkowski distance:

L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1

p 1/ p

1/ p

(x y )

i
i
i 1

where p is a positive integer

If p = 1, L1 is the Manhattan (or city block)

distance:

L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d

x y
i
i
i 1

Distance functions for realvalued vectors

If p = 2, L2 is the Euclidean distance:
d ( x, y) (| x y |2 | x y |2 ... | x y |2 )
1 1
2 2
d d

Also one can use weighted distance:

d ( x, y) (w | x x |2 w | x x |2 ... w | x y |2 )
1 1 1
2 2 2
d d d
d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2
d d d

Very often Lpp is used instead of Lp (why?)

Partitioning algorithms: basic

concept
Construct a partition of a set of n objects into a set
of k clusters
Each object belongs to exactly one cluster
The number of clusters k is given in advance

The k-means problem

Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1, c2,
,ck} in the d-dimensional space to
form clusters {C1, kC2,,Ck} such that

Cost (C ) L2 x ci
2

i 1 xCi

is minimized
Some special cases: k = 1, k = n

Algorithmic properties of the kmeans problem

NP-hard if the dimensionality of the data is
at least 2 (d>=2)
Finding the best solution in polynomial time
is infeasible
For d=1 the problem is solvable in
polynomial time (how?)
A simple iterative algorithm works quite well
in practice

The k-means algorithm

One way of solving the k-means problem
Randomly pick k cluster centers {c1,,ck}
For each i, set the cluster Ci to be the set of points
in X that are closer to ci than they are to cj for all
ij
For each i let ci be the center of cluster Ci (mean
of the vectors in Ci)
Repeat until convergence

Properties of the k-means

algorithm
Finds a local optimum
Converges often quickly (but not
always)
The choice of initial points can have
large influence in the result

Two different K-means Clusterings

3
2.5

Original Points

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

1.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering

-2

-1.5

-1

-0.5

0.5

1.5

Sub-optimal Clustering

Discussion k-means
algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes

Outliers can also cause a problem

(Example?)

Some alternatives to random

initialization of the central
points

Multiple runs

Helps, but probability is not on your side

Select original set of points by

methods other than random . E.g.,
pick the most distant (from each
other) points as cluster centers
(kmeans++ algorithm)

The k-median problem

Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1,c2,
,ck} from X and form clusters {C1,C2,
,Ck} such that
k

Cost (C ) L1 ( x, ci )
i 1 xCi

is minimized

The k-medoids algorithm

Or PAM (Partitioning Around Medoids, 1987)
Choose randomly k medoids from the original
dataset X
Assign each of the n-k remaining points in X to
their closest medoid
iteratively replace one of the medoids by one of
the non-medoids if it improves the total clustering
cost

Discussion of PAM algorithm

The algorithm is very similar to the
k-means algorithm
It has the same advantages and
disadvantages
How about efficiency?

CLARA (Clustering Large

Applications)

It draws multiple samples of the data set,

applies PAM on each sample, and gives the best
clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased

The k-center problem

Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points from X
as cluster centers {c1,c2,,ck} such
that for clusters {C1,C2,,Ck}

R(C ) max j max xC j d ( x, c j )

is minimized

Algorithmic properties of the kcenters problem

NP-hard if the dimensionality of the data is at
least 2 (d>=2)
Finding the best solution in polynomial time is
infeasible
For d=1 the problem is solvable in polynomial
time (how?)
A simple combinatorial algorithm works well in
practice

The farthest-first traversal

algorithm
Pick any data point and label it as
point 1
For i=2,3,,n
Find the unlabelled point furthest from
{1,2,,i-1} and label it as i.
//Use d(x,S) = minyS d(x,y) to identify
the distance //of a point from a set
(i) = argminj<id(i,j)
Ri=d(i,(i))

The farthest-first traversal is a

2-approximation algorithm
Claim1: R1R2 Rn
Proof:
Rj=d(j,(j)) = d(j,{1,2,,j-1})
d(j,{1,2,,i-1}) //j > i
d(i,{1,2,,i-1}) = Ri

The farthest-first traversal is a

2-approximation algorithm
Claim 2: If C is the clustering reported
by the farthest algorithm, then
R(C)=Rk+1
Proof:
For all i > k we have that
d(i, {1,2,,k}) d(k+1,{1,2,,k}) =
Rk+1

The farthest-first traversal is a

2-approximation algorithm
Theorem: If C is the clustering reported by the farthest
algorithm, and C*is the optimal clustering, then then
R(C)2xR(C*)
Proof:
Let C*1, C*2,, C*k be the clusters of the optimal k-clustering.
If these clusters contain points {1,,k} then R(C) 2R(C*)
(triangle inequality)
Otherwise suppose that one of these clusters contains two or
more of the points in {1,,k}. These points are at distance at
least Rk from each other. Thus clusters must have radius
Rk Rk+1= R(C)

What is the right number of

clusters?
or who sets the value of k?
For n points to be clustered consider the
case where k=n. What is the value of
the error function
What happens when k = 1?
Since we want to minimize the error why
dont we select always k = n?

Occams razor and the

minimum description length
principle

Clustering provides a description of the data

For a description to be good it has to be:
Not too general
Not too specific

Penalize for every extra parameter that one has to pay

Penalize the number of bits you need to describe the
extra parameter
So for a clustering C, extend the cost function as follows:
NewCost(C) = Cost( C ) + |C| x logn

Google Interview Prep Guide - Software Engineer
No ratings yet
Google Interview Prep Guide - Software Engineer
4 pages
8-cluster
No ratings yet
8-cluster
33 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
Clustering
No ratings yet
Clustering
80 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
ML_Lec-16
No ratings yet
ML_Lec-16
16 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Clusters
No ratings yet
Clusters
64 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Module 5
No ratings yet
Module 5
98 pages
7 Cluster Analysis
No ratings yet
7 Cluster Analysis
62 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
5 - CH 5-K-Means Clustering
No ratings yet
5 - CH 5-K-Means Clustering
54 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
2.10 Partitioning Methods - k-Means and k-Medoids
No ratings yet
2.10 Partitioning Methods - k-Means and k-Medoids
38 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Greedy Algorithm
100% (1)
Greedy Algorithm
18 pages
Data Structures and Algorithm: Avl Tree
No ratings yet
Data Structures and Algorithm: Avl Tree
42 pages
Operating Systems: 06CS53 Fifth Semester B.E. Degree Examination, Dec.09/Jan.l0
No ratings yet
Operating Systems: 06CS53 Fifth Semester B.E. Degree Examination, Dec.09/Jan.l0
3 pages
Course Outline Math-351 Numerical Methods
No ratings yet
Course Outline Math-351 Numerical Methods
2 pages
Cse 205 (Data STR and Algo)
No ratings yet
Cse 205 (Data STR and Algo)
3 pages
cds-lab-1-3-lab-manual
No ratings yet
cds-lab-1-3-lab-manual
15 pages
Virtu Finance 数学脑筋急转弯
No ratings yet
Virtu Finance 数学脑筋急转弯
7 pages
Implementation of Stack Using Array
No ratings yet
Implementation of Stack Using Array
4 pages
CYK Algorithm
No ratings yet
CYK Algorithm
33 pages
Trees: 1/34 Data Structures and Algorithms in Java
No ratings yet
Trees: 1/34 Data Structures and Algorithms in Java
34 pages
Expt-12 - Implementation of Selection Sort
No ratings yet
Expt-12 - Implementation of Selection Sort
5 pages
l5 Advanced Power System Optimization l5 Simplex Method Part1
No ratings yet
l5 Advanced Power System Optimization l5 Simplex Method Part1
27 pages
DSA Pract04
No ratings yet
DSA Pract04
9 pages
Introduction To Artificial Intelligence: Chapter 2: Solving Problems by Searching (4) Heuristic Functions
No ratings yet
Introduction To Artificial Intelligence: Chapter 2: Solving Problems by Searching (4) Heuristic Functions
20 pages
DS Module Iv Part 2
No ratings yet
DS Module Iv Part 2
20 pages
DS (3rd) Dec2019ggg
No ratings yet
DS (3rd) Dec2019ggg
2 pages
Adsa Assignment 2
No ratings yet
Adsa Assignment 2
12 pages
The Simplest Form of An Array Is One-Dimensional Array. The Syntax To Define An Array Is As Follows. Type Arr-Name (Size) E.G. Int S
No ratings yet
The Simplest Form of An Array Is One-Dimensional Array. The Syntax To Define An Array Is As Follows. Type Arr-Name (Size) E.G. Int S
43 pages
Stack Data Structure
No ratings yet
Stack Data Structure
12 pages
Learn Java - Arrays and ArrayLists Cheatsheet - Codecademy PDF
No ratings yet
Learn Java - Arrays and ArrayLists Cheatsheet - Codecademy PDF
4 pages
PSIT104 Soft Computing Techniques: Objective
No ratings yet
PSIT104 Soft Computing Techniques: Objective
2 pages
Insertion Sort Vs Merge Sort in Matlab
No ratings yet
Insertion Sort Vs Merge Sort in Matlab
4 pages
Huffman
No ratings yet
Huffman
13 pages
Self-Quiz Unit 4 - Attempt Review
No ratings yet
Self-Quiz Unit 4 - Attempt Review
7 pages
Breadth First Search
No ratings yet
Breadth First Search
8 pages
2.3.2. Algorithms For The Main Data Structures
No ratings yet
2.3.2. Algorithms For The Main Data Structures
9 pages
Assignment 5
No ratings yet
Assignment 5
14 pages
Conceptual Clustering
No ratings yet
Conceptual Clustering
5 pages
Dynamic Programming and The Knapsack Problem: Paul Dohmen Roshnika Fernando
No ratings yet
Dynamic Programming and The Knapsack Problem: Paul Dohmen Roshnika Fernando
12 pages