0% found this document useful (0 votes)
292 views

Clustering in AI

Cluster analysis is an unsupervised machine learning technique that groups similar data objects together. It finds internal structures within unlabeled data and partitions data into meaningful clusters. The k-means and k-medoids algorithms are popular partitioning clustering methods that assign data points to clusters based on feature similarity, aiming to minimize intra-cluster distances and maximize inter-cluster distances. K-medoids is more robust to outliers than k-means as it uses actual data points as cluster centers rather than averages.

Uploaded by

Ram Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
292 views

Clustering in AI

Cluster analysis is an unsupervised machine learning technique that groups similar data objects together. It finds internal structures within unlabeled data and partitions data into meaningful clusters. The k-means and k-medoids algorithms are popular partitioning clustering methods that assign data points to clusters based on feature similarity, aiming to minimize intra-cluster distances and maximize inter-cluster distances. K-medoids is more robust to outliers than k-means as it uses actual data points as cluster centers rather than averages.

Uploaded by

Ram Kushwaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

What is Cluster Analysis?

Cluster: a collection of data objects


Similar to one another within the same cluster


Dissimilar to the objects in other clusters Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Cluster analysis

Unsupervised learning: no predefined classes Typical applications


As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
Data Mining: Concepts and Techniques 1

April 12, 2014

Clustering: Rich Applications and Multidisciplinary Efforts


Pattern Recognition

Spatial Data Analysis

Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks

Image Processing Economic Science (especially market research) WWW


Document classification Cluster Weblog data to discover groups of similar access patterns
Data Mining: Concepts and Techniques 2

April 12, 2014

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

April 12, 2014

Data Mining: Concepts and Techniques

Quality: What Is Good Clustering?

A good clustering method will produce high quality

clusters with

high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

April 12, 2014

Data Mining: Concepts and Techniques

Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) There is a separate quality function that measures the goodness of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define similar enough or good enough

the answer is typically highly subjective.


Data Mining: Concepts and Techniques 5

April 12, 2014

Typical Alternatives to Calculate the Distance between Clusters

Single link: smallest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e.,


dis(Ki, Kj) = dis(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dis(Ki,


Kj) = dis(Mi, Mj)

Medoid: one chosen, centrally located object in the cluster


Data Mining: Concepts and Techniques 6

April 12, 2014

Centroid, Radius and Diameter of a Cluster (for numerical data sets)

Centroid: the middle of a cluster

Cm

iN 1(t
N

ip

Radius: square root of average distance from any point of the cluster to its centroid

N (t cm ) 2 Rm i 1 ip N

Diameter: square root of average mean squared distance between all pairs of points in the cluster

N N (t t ) 2 Dm i 1 i 1 ip iq N ( N 1)
April 12, 2014 Data Mining: Concepts and Techniques 7

Chapter 7. Cluster Analysis


1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods

6. Density-Based Methods
7. Grid-Based Methods 8. Model-Based Methods 9. Clustering High-Dimensional Data 10. Constraint-Based Clustering 11. Outlier Analysis 12. Summary
April 12, 2014 Data Mining: Concepts and Techniques 8

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance

k m1 tmiKm

(Cm tmi )

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen67): Each cluster is represented by the center


of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &


Rousseeuw87): Each cluster is represented by one of the objects in the cluster

April 12, 2014

Data Mining: Concepts and Techniques

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:


Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment
Data Mining: Concepts and Techniques 10

April 12, 2014

The K-Means Clustering Method

10
9

Example
10

10 9 8 7 6 5

9
8

8
7

7
6

6
5

5
4

4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Assign each objects to most similar center

3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

reassign
10 9 8
10 9 8 7 6

reassign

K=2 Arbitrarily choose K object as initial cluster center

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

April 12, 2014

Data Mining: Concepts and Techniques

11

Comments on the K-Means Method

Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and

genetic algorithms

Weakness

Applicable only when mean is defined, then what about categorical data?

Need to specify k, the number of clusters, in advance


Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes
Data Mining: Concepts and Techniques 12

April 12, 2014

Variations of the K-Means Method

A few variants of the k-means which differ in

Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means

Handling categorical data: k-modes (Huang98)

Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method
Data Mining: Concepts and Techniques 13

April 12, 2014

What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially


distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most


centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

April 12, 2014

Data Mining: Concepts and Techniques

14

The K-Medoids Clustering Method


Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
Data Mining: Concepts and Techniques 15

April 12, 2014

A Typical K-Medoids Algorithm (PAM)


Total Cost = 20
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 10 9 8

Arbitrary choose k object as initial medoids

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Assign each remainin g object to nearest medoids

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

K=2
Total Cost = 26

Randomly select a nonmedoid object,Oramdom


10

Do loop Until no change

10 9

Swapping O and Oramdom If quality is improved.

8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Compute total cost of swapping

9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

April 12, 2014

Data Mining: Concepts and Techniques

16

You might also like