Unsupervised Learning Modi
Unsupervised Learning Modi
Clustering
Unsupervised Learning
• Unsupervised Learning: The data has no target
attribute.
– We want to explore the data to find some
intrinsic structures (Patterns) in them.
• Clustering is a technique for finding similarity
groups in data, called Clusters i.e.,
– data instances that are similar to (near) each
other are in the same cluster
– data instances that are very different (far
away) from each other fall in different clusters.
A few clustering applications
• In marketing, segment customers according to
their similarities
• Given a collection of text documents, organize
them according to their content similarities
DISTANCE METRICS
Distance for numeric attributes
• Denote distance with: dist(xi, xj), where xi and
xj are data points
• Partitioning approach:
– Construct various partitions and then
evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Agglomerative Clustering (Hierarchical)
• Assign each item to its own cluster, so that if you
have N items, you now have N clusters, each
containing just one item.
• Merge most similar clusters into a single cluster, so
that now you have one less cluster.
• Compute distances (similarities) between the new
cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into
a single cluster of size N.
K-means Clustering (Partitional)
• K-means is a partitional clustering algorithm as
it partitions the given data into k-clusters.
– Each cluster has a cluster center, called
centroid.
– “k” is specified by the user
K-means Algorithm
• Given k, the k-means algorithm works as follows:
1. Randomly choose k data points (seeds) to be the initial
centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships.
4. If a convergence criterion is not met, go to 2.
• Stopping/convergence criterion
- no (or minimum) re-assignments of data points to
different clusters,
- no (or minimum) change of centroids
• Strengths of k-means
– Simple and efficient: Time complexity: O(tkn),
– n is the number of data points,
– k is the number of clusters, and
– t is the number of iterations.
– Since both k and t are small. k-means is considered
a linear algorithm.
• Problem with K-Means?
– The k-means algorithm is sensitive to outliers !!
– K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
which is the most centrally located object in a cluster.