0% found this document useful (0 votes)

4 views

Clustering_Course_Slides

The document provides an overview of cluster analysis in Python, focusing on its goal to organize similar items into groups and its applications in various fields. It explains the k-means clustering algorithm, including the steps involved, the importance of initial centroids, and methods for evaluating and choosing the number of clusters (k). The document emphasizes that cluster analysis is unsupervised, requiring interpretation of results for meaningful insights.

Uploaded by

Autisticsad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Clustering_Course_Slides

Uploaded by

Autisticsad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Python for Data Science

Machine Learning in Python:

Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science

§ Articulate the goal of cluster analysis

§ Discuss whether cluster analysis is supervised or

unsupervised

§ List some ways that cluster results can be applied

Cluster Analysis Overview
Python for Data Science

Goal: Organize similar items into groups

Cluster Analysis Examples
Python for Data Science

• Segment customer base into groups

• Characterize different weather patterns
for a region
• Group news articles into topics
• Discover crime hot spots
Cluster Analysis
• Divides data into clusters
Python for Data Science

• Similar items are placed in same cluster

Intra-cluster
differences are
minimized

Inter-cluster differences are

v maximized
Similarity Measures
A A
Python for Data Science

B B

Euclidean Distance Manhattan Distance

Cosine Similarity
Normalizing Input Variables
Python for Data Science

Scaled Values

Weight
Height
Cluster Analysis Notes
Python for Data Science

Unsupervised

There is no ‘correct’
clustering

Clusters don’t come

with labels

Interpretation and analysis required to

make sense of clustering results!
Uses of Cluster Results
• Data segmentation
Python for Data Science

• Analysis of each segment can provide insights

science fiction

non-fiction

children’s
Uses of Cluster Results
• Categories for classifying new data
Python for Data Science

• New sample assigned to closest cluster

Label of closest
cluster used to
classify new
sample
Uses of Cluster Results
• Labeled data for classification
Python for Data Science

• Cluster samples used as labeled data

Labeled samples
for science fiction
customers
Uses of Cluster Results
• Basis for anomaly detection
Python for Data Science

• Cluster outliers are anomalies

Anomalies that
require further
v analysis
Cluster Analysis Summary
• Organize similar items into groups
Python for Data Science

• Analyzing clusters often leads to useful

insights about data
• Clusters require analysis and interpretation
Python for Data Science

Machine Learning in Python:

k-Means Clustering
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science

§ Describe the steps in the k-means algorithm

§ Explain what the ‘k’ stands for in k-means

§ Define cluster centroid

Cluster Analysis
• Divides data into clusters
Python for Data Science

• Similar items are in same cluster

Intra-cluster
differences are
minimized

Inter-cluster differences are

maximized
k-Means Algorithm
Select k initial centroids (cluster centers)
Python for Data Science

Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Until some stopping criterion is reached

centroid
X
(a) (b) (c)

X X k-Means
Python for Data Science

X X

Original samples Initial centroids Assign samples

(d) (e) (f)

X
X X X X
X

Re-calculate centroids Assign samples Re-calculate centroids

Choosing Initial Centroids
Issue:
Python for Data Science

Final clusters are sensitive to initial centroids

Solution:
Run k-means multiple times with
different random initial centroids,
and choose best results
Evaluating Cluster Results
error = distance between sample & centroid
Python for Data Science

X squared error = error2

Sum of squared errors between all

samples & centroid

Sum over all clusters WSSE

Within-Cluster Sum of
Squared Error
Using WSSE
Python for Data Science

WSSE1 < WSSE2 WSSE1 is better numerically

Caveats:
• Does not mean that cluster set 1 is
more ‘correct’ than cluster set 2
• Larger values for k will always reduce
WSSE
Choosing Value for k
• Approaches: k=?
Python for Data Science

• Visualization

• Application-Dependent

• Data-Driven
Elbow Method for Choosing k
“Elbow” suggests value for
Python for Data Science

k should be 3
Stopping Criteria
X
Python for Data Science

When to stop iterating?

• No changes to centroids
• Number of samples changing clusters
is below threshold
Interpreting Results
• Examine cluster centroids
Python for Data Science

• How are clusters different?

X
X Compare centroids
to see how clusters
are different
X
K-Means Summary
• Classic algorithm for cluster analysis
Python for Data Science

• Simple to understand and implement

and is efficient
• Value of k must be specified
• Final clusters are sensitive to initial
centroids

HW7
100% (3)
HW7
6 pages
NLOGIT 5 Reference Guide
100% (1)
NLOGIT 5 Reference Guide
667 pages
Pharmacy Statistics Midterms - Hypothesis Testing
100% (1)
Pharmacy Statistics Midterms - Hypothesis Testing
41 pages
Training Deep Neural Networks in Python Keras Framework (Tensor Ow Backend) With Inertial Sensor Data For Human Activity Classification
No ratings yet
Training Deep Neural Networks in Python Keras Framework (Tensor Ow Backend) With Inertial Sensor Data For Human Activity Classification
28 pages
Grossman Ngdm07
No ratings yet
Grossman Ngdm07
35 pages
Python Ds
No ratings yet
Python Ds
22 pages
Part3 ML
No ratings yet
Part3 ML
201 pages
Data Science ppt
No ratings yet
Data Science ppt
17 pages
Data Science & Machine Learning 2024
No ratings yet
Data Science & Machine Learning 2024
2 pages
The Data Science Process Course Slides Red
No ratings yet
The Data Science Process Course Slides Red
95 pages
Applied Tech Curriculum (Grades 10-12+) : Python Programming
No ratings yet
Applied Tech Curriculum (Grades 10-12+) : Python Programming
1 page
lecture 1
No ratings yet
lecture 1
18 pages
Intro Big Data
No ratings yet
Intro Big Data
36 pages
People Analytics Python Training String
No ratings yet
People Analytics Python Training String
19 pages
06 July 2021 Python For Machine Learning
No ratings yet
06 July 2021 Python For Machine Learning
19 pages
0901ec221090 Rishavmudgal
No ratings yet
0901ec221090 Rishavmudgal
11 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
Python Data Science Handbook Python Data Science Handbook
0% (1)
Python Data Science Handbook Python Data Science Handbook
5 pages
Turing College Data Science Outline
No ratings yet
Turing College Data Science Outline
4 pages
PPB ML Notes
No ratings yet
PPB ML Notes
54 pages
Novelty Detection Scope
No ratings yet
Novelty Detection Scope
27 pages
2 CO Programming for Analytics - I
No ratings yet
2 CO Programming for Analytics - I
6 pages
DM - MP (1)
No ratings yet
DM - MP (1)
15 pages
DA Python PDF
No ratings yet
DA Python PDF
41 pages
Sodapdf
No ratings yet
Sodapdf
1 page
Data Scientist Resume Template
No ratings yet
Data Scientist Resume Template
1 page
Unsupervised Learning Using Back Propagation in Neural Networks
No ratings yet
Unsupervised Learning Using Back Propagation in Neural Networks
4 pages
OOP2
No ratings yet
OOP2
21 pages
Data Analyst
No ratings yet
Data Analyst
1 page
Introduction To Data Science With Artificial Intelligence Preview
No ratings yet
Introduction To Data Science With Artificial Intelligence Preview
2 pages
Data Science With Python Training in Bangalore - Python Training Institutes in Bangalore, Marathahalli, Jayanagar
100% (1)
Data Science With Python Training in Bangalore - Python Training Institutes in Bangalore, Marathahalli, Jayanagar
8 pages
Modern Machine Learning in Python
No ratings yet
Modern Machine Learning in Python
50 pages
Introduction to Machine Learning (1)
No ratings yet
Introduction to Machine Learning (1)
89 pages
Data Analyst Nanodegree Program - Syllabus
No ratings yet
Data Analyst Nanodegree Program - Syllabus
7 pages
Data Science Using With Python
No ratings yet
Data Science Using With Python
14 pages
M56. Dasar Data Analytics Menggunakan Python
No ratings yet
M56. Dasar Data Analytics Menggunakan Python
19 pages
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
No ratings yet
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
31 pages
Big Data Processing: Jiaul Paik
No ratings yet
Big Data Processing: Jiaul Paik
47 pages
Applied Tech Curriculum (1-144 Classes)
No ratings yet
Applied Tech Curriculum (1-144 Classes)
1 page
lecture23
No ratings yet
lecture23
52 pages
Introduction To Python For Data Science - Syllabus
No ratings yet
Introduction To Python For Data Science - Syllabus
4 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
lecture1SyllabusOverviewNTableData
No ratings yet
lecture1SyllabusOverviewNTableData
19 pages
Roadmap To Become A Data Scientist in 2024
No ratings yet
Roadmap To Become A Data Scientist in 2024
12 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
No ratings yet
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
40 pages
DSci-Lecture 02-w2-20240929-type of data - python
No ratings yet
DSci-Lecture 02-w2-20240929-type of data - python
134 pages
Python Developer
No ratings yet
Python Developer
7 pages
Clustering High Dimensional Data
No ratings yet
Clustering High Dimensional Data
15 pages
Week 10 Lecture - Introduction to Clustering(1)
No ratings yet
Week 10 Lecture - Introduction to Clustering(1)
35 pages
syllabus
No ratings yet
syllabus
7 pages
Mastering in Data Science 3RITPL
No ratings yet
Mastering in Data Science 3RITPL
33 pages
基于知识图谱的问答系统关键技术
No ratings yet
基于知识图谱的问答系统关键技术
40 pages
A145286344 23681 24 2018 Tensorflow
No ratings yet
A145286344 23681 24 2018 Tensorflow
15 pages
Module 4 - Writing Functions in Python
No ratings yet
Module 4 - Writing Functions in Python
20 pages
Introduction To Analyse
No ratings yet
Introduction To Analyse
10 pages
4
No ratings yet
4
4 pages
Mastering in Data Science 3RITPL
100% (1)
Mastering in Data Science 3RITPL
33 pages
Data Visualization for Online Learning Platforms[1][1][1][1]
No ratings yet
Data Visualization for Online Learning Platforms[1][1][1][1]
31 pages
FDS Question Paper-01
No ratings yet
FDS Question Paper-01
13 pages
Neural Networks with Python
From Everand
Neural Networks with Python
Mei Wong
No ratings yet
Large Scale Machine Learning with Python
From Everand
Large Scale Machine Learning with Python
Bastiaan Sjardin
2/5 (1)
Price Reversals: Bid-Ask Errors or Market Overreaction?
No ratings yet
Price Reversals: Bid-Ask Errors or Market Overreaction?
27 pages
Point Estimation of Parameters and Sampling Distributions: Chapter 7 (Cont)
No ratings yet
Point Estimation of Parameters and Sampling Distributions: Chapter 7 (Cont)
14 pages
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
No ratings yet
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
14 pages
Correlation and Regression s1
No ratings yet
Correlation and Regression s1
30 pages
DiD Regression
No ratings yet
DiD Regression
18 pages
PDF (Ebook) Principles of Econometrics 3rd Ed. by R. Carter Hill, William E. Griffiths, Guay C. Lim ISBN 9780471723608, 0471723606 download
100% (1)
PDF (Ebook) Principles of Econometrics 3rd Ed. by R. Carter Hill, William E. Griffiths, Guay C. Lim ISBN 9780471723608, 0471723606 download
67 pages
Descriptive & Inferential Statistics
No ratings yet
Descriptive & Inferential Statistics
6 pages
Unit 01 Chapter 2 Practice
No ratings yet
Unit 01 Chapter 2 Practice
6 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
EVIEWS Tutorial: Time Series Analysis: Professor Roy Batchelor City University Business School, London & ESCP, Paris
No ratings yet
EVIEWS Tutorial: Time Series Analysis: Professor Roy Batchelor City University Business School, London & ESCP, Paris
13 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Support Vector Machine
No ratings yet
Support Vector Machine
9 pages
Week 4 - Classification Quiz
No ratings yet
Week 4 - Classification Quiz
10 pages
Math2801 2017 S1
No ratings yet
Math2801 2017 S1
6 pages
Latent Burnout Profiles A New Approach To Understanding The Burnout Experience
No ratings yet
Latent Burnout Profiles A New Approach To Understanding The Burnout Experience
12 pages
Uji Likelihood Ratio
No ratings yet
Uji Likelihood Ratio
5 pages
Correlation and Regression
100% (6)
Correlation and Regression
36 pages
Geographically Weighted Regression Workbook (PDFDrive)
No ratings yet
Geographically Weighted Regression Workbook (PDFDrive)
75 pages
Statistics Assignment MIT
No ratings yet
Statistics Assignment MIT
6 pages
ay-sem8-internship report
No ratings yet
ay-sem8-internship report
34 pages
EC 823 Fall 2012 - Applied Econometrics
No ratings yet
EC 823 Fall 2012 - Applied Econometrics
5 pages
Review of Multiple Regression: Assumptions About Prior Knowledge. This Handout Attempts To Summarize and Synthesize
No ratings yet
Review of Multiple Regression: Assumptions About Prior Knowledge. This Handout Attempts To Summarize and Synthesize
12 pages
Online Workshop Spss
No ratings yet
Online Workshop Spss
4 pages
Hierarchical Modeling and Analysis for Spatial Data Second Edition Banerjee instant download
100% (2)
Hierarchical Modeling and Analysis for Spatial Data Second Edition Banerjee instant download
52 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Solution Manual For Using Econometrics A Practical Guide 6 e 6th Edition A H Studenmund
No ratings yet
Solution Manual For Using Econometrics A Practical Guide 6 e 6th Edition A H Studenmund
6 pages
Stat2001 Practice Exam Solution
No ratings yet
Stat2001 Practice Exam Solution
21 pages