Clustering_Course_Slides
Clustering_Course_Slides
B B
Cosine Similarity
Normalizing Input Variables
Python for Data Science
Scaled Values
Weight
Height
Cluster Analysis Notes
Python for Data Science
Unsupervised
There is no ‘correct’
clustering
non-fiction
children’s
Uses of Cluster Results
• Categories for classifying new data
Python for Data Science
Labeled samples
for science fiction
customers
Uses of Cluster Results
• Basis for anomaly detection
Python for Data Science
Anomalies that
require further
v analysis
Cluster Analysis Summary
• Organize similar items into groups
Python for Data Science
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached
centroid
X
(a) (b) (c)
X X k-Means
Python for Data Science
X X
Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
Evaluating Cluster Results
error = distance between sample & centroid
Python for Data Science
Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
Choosing Value for k
• Approaches: k=?
Python for Data Science
• Visualization
• Application-Dependent
• Data-Driven
Elbow Method for Choosing k
“Elbow” suggests value for
Python for Data Science
k should be 3
Stopping Criteria
X
Python for Data Science
X
X Compare centroids
to see how clusters
are different
X
K-Means Summary
• Classic algorithm for cluster analysis
Python for Data Science