0% found this document useful (0 votes)
13 views18 pages

3CP10 MJJ Clustering Intro

The document discusses clustering as a key aspect of unsupervised learning, which involves grouping similar objects without labeled data. It outlines various clustering approaches, including partitioning, hierarchical, model-based, density-based, and graph-theoretic methods, along with their applications in fields like biology and marketing. Additionally, it covers the importance of distance metrics and evaluation methods for assessing clustering quality.

Uploaded by

ypwudfhck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

3CP10 MJJ Clustering Intro

The document discusses clustering as a key aspect of unsupervised learning, which involves grouping similar objects without labeled data. It outlines various clustering approaches, including partitioning, hierarchical, model-based, density-based, and graph-theoretic methods, along with their applications in fields like biology and marketing. Additionally, it covers the importance of distance metrics and evaluation methods for assessing clustering quality.

Uploaded by

ypwudfhck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Foundations of Machine Learning

Module 9: Clustering
Part A: Introduction and kmeans

Sudeshna Sarkar
IIT Kharagpur
Unsupervised learning
• Unsupervised learning:
– Data with no target attribute. Describe hidden structure from
unlabeled data.
– Explore the data to find some intrinsic structures in them.
• Clustering: the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more
similar to each other than to those in other clusters.
• Useful for
– Automatically organizing data.
– Understanding hidden structure in data.
– Preprocessing for further analysis.
2
Applications: News Clustering (Google)
Gene Expression Clustering
Other Applications
• Biology: classification of plants and animal kingdom
given their features
• Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
• Clustering weblog data to discover groups of similar
access patterns.
• Recognize communities in social networks.
An illustration
• This data set has four natural clusters.

6
An illustration
• This data set has four natural clusters.

7
Aspects of clustering
• A clustering algorithm such as
– Partitional clustering eg, kmeans The quality of a
– Hierarchical clustering eg, AHC clustering result
– Mixture of Gaussians depends on the
• A distance or similarity function algorithm, the
distance function,
– such as Euclidean, Minkowski, cosine
and the application.
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized

8
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate
them by some criterion
• Hierarchical: Create a hierarchical decomposition of the set of
objects using some criterion
• Model-based: Hypothesize a model for each cluster and find
best fit of models to data
• Density-based: Guided by connectivity and density functions
• Graph-Theoretic Clustering

9
Partitioning Algorithms
• Partitioning method: Construct a partition of a
database D of m objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic method: k-means (MacQueen, 1967)

10
Hierarchical Clustering
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

• Produce a nested sequence of clusters.


• One approach: recursive application of a partitional
clustering algorithm.
Model Based Clustering
• A model is hypothesized
• e,g., Assume data is
generated by a mixture of
underlying probability
distributions
• Fit the data to model
Density based Clustering
• Based on density
connected points
• Locates regions of high
density separated by
regions of low density
• e.g., DBSCAN
Graph Theoretic Clustering
• Weights of edges
between items (nodes)
based on similarity
• E.g., look for minimum
cut in a graph
(Dis)similarity measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures

Manhattan (p=1), Euclidean (p=2)


– Cosine distance
(Dis)similarity measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance

• Pearson correlation
Quality of Clustering
• Internal evaluation:
– assign the best score to the algorithm that produces clusters with high
similarity within a cluster and low similarity between clusters, e.g.,
Davies-Bouldin index

• External evaluation:
– evaluated based on data such as known class labels and external
benchmarks, eg, Rand Index, Jaccard Index, f-measure
Thank You

You might also like