0% found this document useful (0 votes)
4 views

Clustering_Course_Slides

The document provides an overview of cluster analysis in Python, focusing on its goal to organize similar items into groups and its applications in various fields. It explains the k-means clustering algorithm, including the steps involved, the importance of initial centroids, and methods for evaluating and choosing the number of clusters (k). The document emphasizes that cluster analysis is unsupervised, requiring interpretation of results for meaningful insights.

Uploaded by

Autisticsad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Clustering_Course_Slides

The document provides an overview of cluster analysis in Python, focusing on its goal to organize similar items into groups and its applications in various fields. It explains the k-means clustering algorithm, including the steps involved, the importance of initial centroids, and methods for evaluating and choosing the number of clusters (k). The document emphasizes that cluster analysis is unsupervised, requiring interpretation of results for meaningful insights.

Uploaded by

Autisticsad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Python for Data Science

Machine Learning in Python:


Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science

§ Articulate the goal of cluster analysis

§ Discuss whether cluster analysis is supervised or


unsupervised

§ List some ways that cluster results can be applied


Cluster Analysis Overview
Python for Data Science

Goal: Organize similar items into groups


Cluster Analysis Examples
Python for Data Science

• Segment customer base into groups


• Characterize different weather patterns
for a region
• Group news articles into topics
• Discover crime hot spots
Cluster Analysis
• Divides data into clusters
Python for Data Science

• Similar items are placed in same cluster


Intra-cluster
differences are
minimized

Inter-cluster differences are


v maximized
Similarity Measures
A A
Python for Data Science

B B

Euclidean Distance Manhattan Distance

Cosine Similarity
Normalizing Input Variables
Python for Data Science

Scaled Values

Weight
Height
Cluster Analysis Notes
Python for Data Science

Unsupervised

There is no ‘correct’
clustering

Clusters don’t come


with labels

Interpretation and analysis required to


make sense of clustering results!
Uses of Cluster Results
• Data segmentation
Python for Data Science

• Analysis of each segment can provide insights


science fiction

non-fiction

children’s
Uses of Cluster Results
• Categories for classifying new data
Python for Data Science

• New sample assigned to closest cluster


Label of closest
cluster used to
classify new
sample
Uses of Cluster Results
• Labeled data for classification
Python for Data Science

• Cluster samples used as labeled data

Labeled samples
for science fiction
customers
Uses of Cluster Results
• Basis for anomaly detection
Python for Data Science

• Cluster outliers are anomalies

Anomalies that
require further
v analysis
Cluster Analysis Summary
• Organize similar items into groups
Python for Data Science

• Analyzing clusters often leads to useful


insights about data
• Clusters require analysis and interpretation
Python for Data Science

Machine Learning in Python:


k-Means Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science

§ Describe the steps in the k-means algorithm

§ Explain what the ‘k’ stands for in k-means

§ Define cluster centroid


Cluster Analysis
• Divides data into clusters
Python for Data Science

• Similar items are in same cluster


Intra-cluster
differences are
minimized

Inter-cluster differences are


maximized
k-Means Algorithm
Select k initial centroids (cluster centers)
Python for Data Science

Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached

centroid
X
(a) (b) (c)

X X k-Means
Python for Data Science

X X

Original samples Initial centroids Assign samples

(d) (e) (f)


X
X X X X
X

Re-calculate centroids Assign samples Re-calculate centroids


Choosing Initial Centroids
Issue:
Python for Data Science

Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
Evaluating Cluster Results
error = distance between sample & centroid
Python for Data Science

X squared error = error2

Sum of squared errors between all


samples & centroid

Sum over all clusters WSSE


Within-Cluster Sum of
Squared Error
Using WSSE
Python for Data Science

WSSE1 < WSSE2 WSSE1 is better numerically

Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
Choosing Value for k
• Approaches: k=?
Python for Data Science

• Visualization

• Application-Dependent

• Data-Driven
Elbow Method for Choosing k
“Elbow” suggests value for
Python for Data Science

k should be 3
Stopping Criteria
X
Python for Data Science

When to stop iterating?


• No changes to centroids
• Number of samples changing clusters
is below threshold
Interpreting Results
• Examine cluster centroids
Python for Data Science

• How are clusters different?

X
X Compare centroids
to see how clusters
are different
X
K-Means Summary
• Classic algorithm for cluster analysis
Python for Data Science

• Simple to understand and implement


and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids

You might also like