0% found this document useful (0 votes)
6 views

Clustering-Part 1

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Clustering-Part 1

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CHAPTER 4:

Clustering
Contents…

4. Clustering
 Introduction to Unsupervised Learning
 Introduction to Clustering
 Clustering Applications
 Partioning-based Clustering
 Hierarchical-based clustering
 Density-based clustering
 Evaluation methods for Clustering

2
Types of ML
Supervised Unsupervised

have a target column  Doesn’t have a target column


Data points have a known outcome  Data points have unknown outcome

 Unsupervised algorithms are relevant when we don’t have an


outcome or labeled variable we are trying to predict. 3
Types of ML
Supervised Unsupervised

have a target column  Doesn’t have a target column


Data points have a known outcome  Data points have unknown outcome

 Unsupervised algorithms are relevant when we don’t have an


outcome or labeled variable we are trying to predict. 4
Unsupervised ML –
Group New Articles by Example

 Unsupervised algorithms are helpful to find structures within our data


set and when we want to partition our data set into smaller pieces.
 Explore the data to find some intrinsic structures in them 5
Unsupervised ML –
Group New Articles by Example

 Unsupervised algorithms are helpful to find structures within our data


set and when we want to partition our data set into smaller pieces.
 Explore the data to find some intrinsic structures in them 6
Types of Unsupervised ML
Dimensionality
Clustering
Reduction
The process of partitioning a set
 Use structural
of data objects (or observations)
characteristics to simplify
into subsets
data
Identify unknown structure in data
 Examples
 Examples  Principal Component Analysis
 K-Means  Non-Negative Matrix
 K-Medoids Factorization
 Agglomerative Clustering
 DBSCAN

7
Cluster Analysis
 Cluster- a collection of data objects
 Clustering- is the process of partitioning a set of data objects (or
observations) into subsets
 Each subset is a cluster, such that:
 objects in a cluster are similar to one another
 dissimilar to objects in other clusters

Definition. Given a database D ={t1,t2,…,tn} of tuples and an integer value k,


the clustering problem is to define mapping f:D→{1,…,k} where each ti is
assigned to one cluster Kj, 1≤ j ≤ k. A cluster, Kj, containing precisely those
8
tuples mapped to it; that is, Kj = {ti | f(ti) = Kj, 1 ≤ i ≤ n, and ti ∈ D}
Clustering Applications
 Classification
 Clustering of document into topics
 Image pattern recognition
 Handwritten character recognition
 Image compression
 Web search
 Grouping of search results
 Customer Segmentation
 Helps marketers discover distinct groups, so that they can
characterize their customer groups based on the purchasing
patterns
 As a data mining tool, it can be used to gain insight into the
distribution of data.

9
Application of Clustering

 Clustering can be used


for outlier/anomaly
detection.
 An outliers :- are values
which are far away from
any values of the cluster.
 Example -credit card
fraud detection.

10
Requirement for cluster analysis
 Scalability :- Clustering on only a sample of a given large data
set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed
 Ability to deal with different types of attributes: - a clustering
algorithms shall be able to cluster numeric, nominal, ordinal and
other types of data
 Ability to deal with noisy data :- Clustering algorithms can be
sensitive to noise (outliers, missing values unknown or
erroneous data) and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
 Interpretability :− The clustering results should be
interpretable, comprehensible, and usable.
 High dimensionality :− The clustering algorithm should not only
be able to handle low-dimensional data but also the high
dimensional space

11
Categories of Clustering Methods

12
Clustering Methods: Partitioning

 Partitioning methods : - divides n objects in to k partitions of


data.
 Each group (cluster) must contain at least one object
 Adopts exclusive cluster separation :- each object must
belong to exactly one group
 are distance based
 Follows iterative relocation technique – by moving objects
from one group to an other it improves the partitioning
 Good partition – objects in same cluster are close, while
objects in different clusters are far apart
 For example K-means and k-mediods

13
Clustering Methods: K-Means

 k-Means clustering algorithm proposed by J. Hartigan and M. A.


Wong [1979].
 Given a set of n distinct objects, the k-Means clustering
algorithm partitions the objects into k number of clusters such
that intracluster similarity is high but the intercluster
similarity is low.
 In this algorithm, user has to specify k, the number of clusters
and consider the objects are defined with numeric attributes
and thus using any one of the distance metric to demarcate the
clusters.

14
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 1: Randomly assign cluster centroids
 First it selects k number of objects at random from the set of n objects. These
k objects are treated as the centroids or center of gravities of k clusters.

15
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 2: Each point belongs to closet centroid
 For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.

16
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 2: Each point belongs to closet centroid
 For each of the remaining objects, it is assigned to one of the closest centroid.
Thus, it forms a collection of objects assigned to each centroid and is called a
cluster.

17
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 3: Move each centroid to cluster’s mean
 the centroid of each cluster is then updated (by calculating the mean values
of attributes of each object).

18
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 3: Move each centroid to cluster’s mean
 the centroid of each cluster is then updated (by calculating the mean values
of attributes of each object).

19
Clustering Methods: K-Means

The algorithm can be stated as follows.


 Step 3: Move each centroid to cluster’s mean
 the centroid of each cluster is then updated (by calculating the mean values
of attributes of each object).
 The assignment and update procedure is until it reaches some
stopping criteria (such as, number of iteration, centroids remain
unchanged or no assignment, etc)

 Points don’t change- converged 20


k-Means Algorithm
Algorithm : k-Means clustering
 Input:
 D : a dataset containing n objects,
 k : the number of cluster
 Output: A set of k clusters

Steps:
1. arbitrarily choose k objects from D as the initial cluster
centroids.
2. For each of the objects in D do
 Compute distance between the current objects and k
cluster centroids
 Assign the current object to that cluster to which it is
closest.
3. Compute the “cluster centers” of each cluster. These become
the new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
21
5. Stop
k-Means Algorithm
Note:
1) Objects are defined in terms of set of attributes. 𝐴 = {𝐴1 , 𝐴2 , … . . , 𝐴𝑚 }
where each 𝐴𝑖 is continuous data type.

2) Distance computation: Any distance such as 𝐿1 , 𝐿2 , 𝐿3 or cosine


similarity.

3) Minimum distance is the measure of closeness between an object and


centroid.

4) Mean Calculation: It is the mean value of each attribute values of all


objects.

5) Convergence criteria: Any one of the following are termination


condition of the algorithm.
 Number of maximum iteration permissible.

 No change of centroid values in any cluster.

 Zero (or no significant) movement(s) of object from one cluster


to another.
 Cluster quality reaches to a certain level of acceptance.
22
Example : Clustering by k-means partitioning
Consider a set of objects depicted in the table below. Let k =3, that is, the user would like
the objects to be partitioned into three clusters

A1 A2
6.8 12.6 Fig 1: Plotting data of Table 1
25
0.8 9.8
1.2 11.6
Table 1 : 20
16 objects with 2.8 9.6
two attributes 3.8 9.9
𝑨𝟏 and 𝑨𝟐 . 15
4.4 6.5

A2
4.8 1.1
6.0 19.9 10

6.2 18.5
7.6 17.4 5

7.8 12.2
6.6 7.7 0
0 2 4 6 8 10 12
8.2 4.5
A1
8.4 6.9
9.0 3.4
9.6 11.1 23
k-Means Algorithm
Step 1: We arbitrarily choose three objects as the three initial cluster centers

A1 A2
6.8 12.6 Plotting data of Table 1
25
0.8 9.8
1.2 11.6
20
2.8 9.6
3.8 9.9
4.4 6.5 15
A2
4.8 1.1
6.0 19.9 10

6.2 18.5
7.6 17.4 5
7.8 12.2
6.6 7.7 0
8.2 4.5 0 2 4 6 8 10 12
A1
8.4 6.9
9.0 3.4
9.6 11.1 24
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to which it is the nearest.
 Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3 respectively.

A1 A2
6.8 12.6 Plotting data of Table 1
25
0.8 9.8
1.2 11.6
20
2.8 9.6
3.8 9.9
4.4 6.5 15
A2
4.8 1.1
6.0 19.9 10

6.2 18.5
7.6 17.4 5
7.8 12.2
6.6 7.7 0
8.2 4.5 0 2 4 6 8 10 12
A1
8.4 6.9
9.0 3.4
9.6 11.1 25
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to which it is the nearest.
 Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3 respectively.
.

A1 A2
6.8 12.6
0.8 9.8
Initial centroids chosen randomly
1.2 11.6
2.8 9.6
Centroid Objects
3.8 9.9
4.4 6.5
A1 A2
4.8 1.1 c1 3.8 9.9
6.0 19.9 c2 7.8 12.2
6.2 18.5 c3 6.2 18.5
7.6 17.4
7.8 12.2
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
26
9.6 11.1
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to
which it is the nearest.
 Let d1, d2 and d3 denote the Euclidean distance from an object to
c1, c2 and c3 respectively.
 The distance calculations are shown in Table.
. A1 A2 d1 d2 d3
6.8 12.6 4.0 1.1 5.9
0.8 9.8 3.0 7.4 10.2
1.2 11.6 3.1 6.6 8.5
2.8 9.6 1.0 5.6 9.5 Centroid Objects
3.8 9.9 0.0 4.6 8.9
A1 A2
4.4 6.5 3.5 6.6 12.1
c1 3.8 9.9
4.8 1.1 8.9 11.5 17.5
c2 7.8 12.2
6.0 19.9 10.2 7.9 1.4
6.2 18.5 8.9 6.5 0.0 c3 6.2 18.5
7.6 17.4 8.4 5.2 1.8
7.8 12.2 4.6 0.0 6.5
6.6 7.7 3.6 4.7 10.8
8.2 4.5 7.0 7.7 14.1
8.4 6.9 5.5 5.3 11.8
9.0 3.4 8.3 8.9 15.4
27
9.6 11.1 5.9 2.1 8.1
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to
which it is the nearest.
 Assignment of each object to the respective centroid is shown in
the right-most column and the clustering so obtained is shown in
Fig below.
. A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
28
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
 the mean value of each cluster is recalculated based on the current
objects in the cluster.
 Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
29
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
 the mean value of each cluster is recalculated based on the current
objects in the cluster.
 Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
30
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
 the mean value of each cluster is recalculated based on the current
objects in the cluster.
 Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest

New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6

31
k-Means Algorithm
Step 3: We next reassign the 16 objects to three clusters by determining
which centroid is closest to each one. This gives the revised set of
clusters shown in Fig 4.
 Note that point p moves from cluster C2 to cluster C1.

Fig 4: Cluster after first iteration

32
k-Means Algorithm

• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster in Fig 5 is same as Fig 4.

Cluster centres after second iteration Fig 5: Cluster after Second iteration

Centroi Revised Centroids


d A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6

 The process of iteratively reassigning objects to clusters to


improve the partitioning is referred to as iterative relocation.
 Eventually, no reassignment of the objects in any cluster occurs
and so the process terminates.
 The k-means method is not guaranteed to converge to the global
optimum and often terminates at a local optimum. The results 32
may depend on the initial random selection of cluster centers.
Comments of K-Means Algorithm
Limitations :
• k-means has trouble clustering data
that contains outliers. When the SSE is
used as objective function, outliers can
unduly influence the cluster that are
produced. More precisely, in the
presence of outliers, the cluster
centroids, in fact, not truly as
representative as they would be
otherwise. It also influence the SSE
measure as well.

• k-Means algorithm cannot handle non-


globular clusters, clusters of different
sizes and densities (see Fig 16.6 in the
next slide).

• k-Means algorithm not really beyond


the scalability issue (and not so
practical for large databases).

32
Thank You

35

You might also like