0% found this document useful (0 votes)
55 views

Lect 4

The document discusses clustering algorithms. It begins by defining clustering as grouping similar data objects together and dissimilar objects in different groups. It then discusses several clustering algorithms like K-means, K-median, and K-center. K-means aims to partition objects into K clusters by minimizing distances between objects and assigned cluster centers. The K-median and K-center problems are also defined with the goal of minimizing distances. Iterative algorithms like K-means, K-medoids are discussed for solving these clustering problems. Factors that impact the performance of these algorithms like initialization and outliers are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Lect 4

The document discusses clustering algorithms. It begins by defining clustering as grouping similar data objects together and dissimilar objects in different groups. It then discusses several clustering algorithms like K-means, K-median, and K-center. K-means aims to partition objects into K clusters by minimizing distances between objects and assigned cluster centers. The K-median and K-center problems are also defined with the goal of minimizing distances. Iterative algorithms like K-means, K-medoids are discussed for solving these clustering problems. Factors that impact the performance of these algorithms like initialization and outliers are also summarized.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 34

Clustering

Lecture outline
Distance/Similarity between data
objects
Data objects as geometric data
points
Clustering problems and algorithms
K-means
K-median
K-center

What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups

Intra-cluster
distances are
minimized

Inter-cluster
distances are
maximized

Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality

cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)

Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

Clustering results are used:


As a stand-alone tool to get insight into data distribution
Visualization of clusters may unveil important information

As a preprocessing step for other algorithms


Efficient indexing or compression often relies on clustering

Applications of
clustering?
Image Processing
cluster images based on their visual content

Web
Cluster groups of users based on their access
patterns on webpages
Cluster webpages based on their content

Bioinformatics
Cluster similar proteins together (similarity
wrt chemical structure and/or functionality
etc)

Many more

The clustering task


Group observations into groups so that the
observations belonging in the same group
are similar, whereas observations in
different groups are different
Basic questions:
What does similar mean
What is a good partition of the objects? I.e.,
how is the quality of a solution measured
How to find a good partition of the
observations

Observations to cluster

Real-value attributes/variables
e.g., salary, height

Binary attributes
e.g., gender (M/F), has_cancer(T/F)

Nominal (categorical) attributes


e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

Variables of mixed types


multiple attributes with various types

Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)
J. Smith, 20, 200K
If all d dimensions are real-valued then we
can visualize each data point as points in a
d-dimensional space
If all d dimensions are binary then we can
think of each data point as a binary vector

Distance functions
The distance d(x, y) between two objects xand y
is a metric if

d(i, j)0 (non-negativity)


d(i, i)=0 (isolation)
d(i, j)= d(j, i) (symmetry)
d(i, j) d(i, h)+d(h, j) (triangular inequality) [Why do
we need it?]

The definitions of distance functions are usually


different for real, boolean, categorical, and ordinal
variables.
Weights may be associated with different variables
based on applications and data semantics.

data matrix

tuples/objects

Data Structures

attributes/dimensions

x
11
...

x
i1
...

x
n1

...
...

x
1
...

... x
1d
... ...

x
id
...
... ...
... x
... x
n
nd
...

x
i
...

...

objects

Distance matrix
objects

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0

Distance functions for binary


vectors
Jaccard similarity between binary
X Y
vectors
JSim( X , Y ) X and Y
X Y

Jaccard distance between binary


vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y) Q Q Q Q Q
Example:
JSim = 1/6
Jdist = 5/6

Q
6

Distance functions for real-valued


vectors
Lp norms or Minkowski distance:

L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1

p 1/ p

1/ p

(x y )

i
i
i 1

where p is a positive integer

If p = 1, L1 is the Manhattan (or city block)


distance:

L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d

x y
i
i
i 1

Distance functions for realvalued vectors


If p = 2, L2 is the Euclidean distance:
d ( x, y) (| x y |2 | x y |2 ... | x y |2 )
1 1
2 2
d d

Also one can use weighted distance:


d ( x, y) (w | x x |2 w | x x |2 ... w | x y |2 )
1 1 1
2 2 2
d d d
d ( x, y) w x y w x y ... w x y
1 1 1 2 2 2
d d d

Very often Lpp is used instead of Lp (why?)

Partitioning algorithms: basic


concept
Construct a partition of a set of n objects into a set
of k clusters
Each object belongs to exactly one cluster
The number of clusters k is given in advance

The k-means problem


Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1, c2,
,ck} in the d-dimensional space to
form clusters {C1, kC2,,Ck} such that

Cost (C ) L2 x ci
2

i 1 xCi

is minimized
Some special cases: k = 1, k = n

Algorithmic properties of the kmeans problem


NP-hard if the dimensionality of the data is
at least 2 (d>=2)
Finding the best solution in polynomial time
is infeasible
For d=1 the problem is solvable in
polynomial time (how?)
A simple iterative algorithm works quite well
in practice

The k-means algorithm


One way of solving the k-means problem
Randomly pick k cluster centers {c1,,ck}
For each i, set the cluster Ci to be the set of points
in X that are closer to ci than they are to cj for all
ij
For each i let ci be the center of cluster Ci (mean
of the vectors in Ci)
Repeat until convergence

Properties of the k-means


algorithm
Finds a local optimum
Converges often quickly (but not
always)
The choice of initial points can have
large influence in the result

Two different K-means Clusterings


3
2.5

Original Points

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

2.5

1.5

1.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering

-2

-1.5

-1

-0.5

0.5

1.5

Sub-optimal Clustering

Discussion k-means
algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes

Outliers can also cause a problem


(Example?)

Some alternatives to random


initialization of the central
points

Multiple runs

Helps, but probability is not on your side

Select original set of points by


methods other than random . E.g.,
pick the most distant (from each
other) points as cluster centers
(kmeans++ algorithm)

The k-median problem


Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points {c1,c2,
,ck} from X and form clusters {C1,C2,
,Ck} such that
k

Cost (C ) L1 ( x, ci )
i 1 xCi

is minimized

The k-medoids algorithm


Or PAM (Partitioning Around Medoids, 1987)
Choose randomly k medoids from the original
dataset X
Assign each of the n-k remaining points in X to
their closest medoid
iteratively replace one of the medoids by one of
the non-medoids if it improves the total clustering
cost

Discussion of PAM algorithm


The algorithm is very similar to the
k-means algorithm
It has the same advantages and
disadvantages
How about efficiency?

CLARA (Clustering Large


Applications)

It draws multiple samples of the data set,


applies PAM on each sample, and gives the best
clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency depends on the sample size
A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased

The k-center problem


Given a set X of n points in a ddimensional space and an integer k
Task: choose a set of k points from X
as cluster centers {c1,c2,,ck} such
that for clusters {C1,C2,,Ck}

R(C ) max j max xC j d ( x, c j )


is minimized

Algorithmic properties of the kcenters problem


NP-hard if the dimensionality of the data is at
least 2 (d>=2)
Finding the best solution in polynomial time is
infeasible
For d=1 the problem is solvable in polynomial
time (how?)
A simple combinatorial algorithm works well in
practice

The farthest-first traversal


algorithm
Pick any data point and label it as
point 1
For i=2,3,,n
Find the unlabelled point furthest from
{1,2,,i-1} and label it as i.
//Use d(x,S) = minyS d(x,y) to identify
the distance //of a point from a set
(i) = argminj<id(i,j)
Ri=d(i,(i))

The farthest-first traversal is a


2-approximation algorithm
Claim1: R1R2 Rn
Proof:
Rj=d(j,(j)) = d(j,{1,2,,j-1})
d(j,{1,2,,i-1}) //j > i
d(i,{1,2,,i-1}) = Ri

The farthest-first traversal is a


2-approximation algorithm
Claim 2: If C is the clustering reported
by the farthest algorithm, then
R(C)=Rk+1
Proof:
For all i > k we have that
d(i, {1,2,,k}) d(k+1,{1,2,,k}) =
Rk+1

The farthest-first traversal is a


2-approximation algorithm
Theorem: If C is the clustering reported by the farthest
algorithm, and C*is the optimal clustering, then then
R(C)2xR(C*)
Proof:
Let C*1, C*2,, C*k be the clusters of the optimal k-clustering.
If these clusters contain points {1,,k} then R(C) 2R(C*)
(triangle inequality)
Otherwise suppose that one of these clusters contains two or
more of the points in {1,,k}. These points are at distance at
least Rk from each other. Thus clusters must have radius
Rk Rk+1= R(C)

What is the right number of


clusters?
or who sets the value of k?
For n points to be clustered consider the
case where k=n. What is the value of
the error function
What happens when k = 1?
Since we want to minimize the error why
dont we select always k = n?

Occams razor and the


minimum description length
principle

Clustering provides a description of the data


For a description to be good it has to be:
Not too general
Not too specific

Penalize for every extra parameter that one has to pay


Penalize the number of bits you need to describe the
extra parameter
So for a clustering C, extend the cost function as follows:
NewCost(C) = Cost( C ) + |C| x logn

You might also like