Clustering in AI
Clustering in AI
Cluster analysis
As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
Data Mining: Concepts and Techniques 1
Pattern Recognition
Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks
Document classification Cluster Weblog data to discover groups of similar access patterns
Data Mining: Concepts and Techniques 2
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
clusters with
The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) There is a separate quality function that measures the goodness of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define similar enough or good enough
Cm
iN 1(t
N
ip
Radius: square root of average distance from any point of the cluster to its centroid
N (t cm ) 2 Rm i 1 ip N
Diameter: square root of average mean squared distance between all pairs of points in the cluster
N N (t t ) 2 Dm i 1 i 1 ip iq N ( N 1)
April 12, 2014 Data Mining: Concepts and Techniques 7
6. Density-Based Methods
7. Grid-Based Methods 8. Model-Based Methods 9. Clustering High-Dimensional Data 10. Constraint-Based Clustering 11. Outlier Analysis 12. Summary
April 12, 2014 Data Mining: Concepts and Techniques 8
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance
k m1 tmiKm
(Cm tmi )
Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms
Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment
Data Mining: Concepts and Techniques 10
10
9
Example
10
10 9 8 7 6 5
9
8
8
7
7
6
6
5
5
4
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8
10 9 8 7 6
reassign
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
11
Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and
genetic algorithms
Weakness
Applicable only when mean is defined, then what about categorical data?
Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means
Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method
Data Mining: Concepts and Techniques 13
14
starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering
PAM works effectively for small data sets, but does not scale
well for large data sets
CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling
Focusing + spatial data structure (Ester et al., 1995)
Data Mining: Concepts and Techniques 15
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
K=2
Total Cost = 26
10 9
8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
16