0% found this document useful (0 votes)
37 views

Unsupervised Learning Modi

This document discusses unsupervised learning techniques, specifically clustering. It describes clustering as a way to group similar data instances together in clusters without a predefined target variable. Common clustering algorithms mentioned include hierarchical clustering, k-means clustering, and k-medoids clustering. Hierarchical clustering builds clusters sequentially, while k-means and k-medoids partitioning algorithms assign data to clusters to optimize some criterion like distance between cluster centroids.

Uploaded by

SatishKakarla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Unsupervised Learning Modi

This document discusses unsupervised learning techniques, specifically clustering. It describes clustering as a way to group similar data instances together in clusters without a predefined target variable. Common clustering algorithms mentioned include hierarchical clustering, k-means clustering, and k-medoids clustering. Hierarchical clustering builds clusters sequentially, while k-means and k-medoids partitioning algorithms assign data to clusters to optimize some criterion like distance between cluster centroids.

Uploaded by

SatishKakarla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Unsupervised learning:

Clustering
Unsupervised Learning
• Unsupervised Learning: The data has no target
attribute.
– We want to explore the data to find some
intrinsic structures (Patterns) in them.
• Clustering is a technique for finding similarity
groups in data, called Clusters i.e.,
– data instances that are similar to (near) each
other are in the same cluster
– data instances that are very different (far
away) from each other fall in different clusters.
A few clustering applications
• In marketing, segment customers according to
their similarities
• Given a collection of text documents, organize
them according to their content similarities
DISTANCE METRICS
Distance for numeric attributes
• Denote distance with: dist(xi, xj), where xi and
xj are data points

• h= 2 is Euclidean and h=1 is Manhattan distance


• When attributes have similar scale: (1,2), (2,1)
- Manhattan = Abs(1-2) + Abs(2-1)
- Euclidean = Squareroot((1-2)^2+(2-1)^2)
Choosing the distance metric:
• When attributes have different ranges (10,
100), (50, 500)
– Manhattan = 440
– Euclidean= 401.99
• Manhattan is more stable than Euclidean
• Chebychev Distance :
HOW DO WE EMPLOY
DISTANCE IN A CLUSTER
Distance between Clusters
• Single link: smallest distance between an element in
one cluster and an element in the other cluster
• Complete link: largest distance between an element in
one cluster and an element in the other cluster
• Average: average distance between an element in one
cluster and an element in the other cluster
• Centroid: distance between the centroids of two
clusters
• Medoid: distance between the medoids of two clusters
– Medoid: one chosen, centrally located object in the
cluster
Representation and Naming Clusters
• Use the centroid of each cluster to represent
the cluster
• Using frequent values to represent cluster
• Using classification representation
TYPES OF CLUSTERING
Algorithms
• Hierarchical approach:
– Create a hierarchical decomposition of the
set of data (or objects) using some criterion

• Partitioning approach:
– Construct various partitions and then
evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Agglomerative Clustering (Hierarchical)
• Assign each item to its own cluster, so that if you
have N items, you now have N clusters, each
containing just one item.
• Merge most similar clusters into a single cluster, so
that now you have one less cluster.
• Compute distances (similarities) between the new
cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into
a single cluster of size N.
K-means Clustering (Partitional)
• K-means is a partitional clustering algorithm as
it partitions the given data into k-clusters.
– Each cluster has a cluster center, called
centroid.
– “k” is specified by the user
K-means Algorithm
• Given k, the k-means algorithm works as follows:
1. Randomly choose k data points (seeds) to be the initial
centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships.
4. If a convergence criterion is not met, go to 2.
• Stopping/convergence criterion
- no (or minimum) re-assignments of data points to
different clusters,
- no (or minimum) change of centroids
• Strengths of k-means
– Simple and efficient: Time complexity: O(tkn),
– n is the number of data points,
– k is the number of clusters, and
– t is the number of iterations.
– Since both k and t are small. k-means is considered
a linear algorithm.
• Problem with K-Means?
– The k-means algorithm is sensitive to outliers !!
– K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
which is the most centrally located object in a cluster.

• Problem with Medoids?


– More robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or
other extreme values than a mean
– Works efficiently for small data sets but does not scale
well for large data sets.

You might also like