0% found this document useful (0 votes)

16 views

Lecture4 Slides

Uploaded by

mohammadthajmeel10

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Lecture4 Slides

Uploaded by

mohammadthajmeel10

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Lecture 4: Clustering and KNN

Instructor: Ari Smith

October 10, 2023
Learning objectives

k-means clustering

Hierarchical / agglomerative clustering

Distance metrics and linkage criteria

K-nearest neighbours

Page 2
Netflix
The Netflix Prize

Predict user ratings for films using only previous ratings

Beat Cinematch by 10% and win $1,000,000

• Progress prize of $50,000 for each year that improves upon previous year by 1%

Competition began on October 6, 2006

• 100,000,000 observations in the training set
• 1,500,000 in the validation set
• 1,500,000 in the test set

Page 4
The Netflix Prize

2006
• WXYZConsulting beat Cinematch on Oct 8
• UofT (led by Prof. Hinton) emerged as an early leader

2007
• 40,000 teams from 186 countries
• BellKor beat Cinematch by 8.43%

2008
• An ensemble of BellKor and BigChaos beat Cinematch by 9.54%

Page 5
The Netflix Prize

The Winner
• BellKor’s Pragmatic Chaos beat Cinematch by 10.06%
• Declared the winner on September 18, 2009
• Ensemble of three teams

Page 6
User groups

In 2016, Netflix stopped segmenting users by geography

Users are now clustered into 1300 “taste-communities”

Cluster 290:
• Movies like: Black Mirror,
Lost, and Groundhog Day

Page 7
The basics of clustering
Types of clustering

Clustering is an unsupervised learning algorithm

• Partition data into groups /clusters such that the observations within a cluster are
similar

Two popular types:

1. k-means clustering

2. Hierarchical / agglomerative clustering

Page 9
How do we define “similar”?

We use distance to determine if two observations are similar

Define an observation with F features:

𝑇
𝒙𝒊 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐹

Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐

Symmetry: 𝑑 𝒙𝟏 , 𝒙𝟐 = 𝑑 𝒙𝟐 , 𝒙𝟏

Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )

Page 10
Distance metrics
Euclidean: 1
𝐹 2
2
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 2 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Manhattan:
𝐹

𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝟏 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Chebychev:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 ∞ = max 𝑥1𝑓 − 𝑥2𝑓
𝑓=1,…,𝐹

Page 11
Distance metrics

Minkowski:
1
𝐹 𝑝
𝑝
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝑃 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Hamming:

𝑑 𝒙𝟏 , 𝒙𝟐 = ෍ 𝕀(𝑥1𝑓 ≠ 𝑥2𝑓 )
𝑓=1

Page 12
Index sets and centroids

Index set: includes the IDs of all observations in a cluster

𝑆𝑘 = {1,3,7,21,44}

Centroid: the “center” or “representative point” of each cluster

1
𝒔𝒌 = ෍ 𝒙𝒊
|𝑆𝑘 |
𝑖∈𝑆𝑘

Page 13
Cluster distances

Intra-cluster distance: distance between two points in the same cluster

Inter-cluster distance: distance between two points in different clusters

Page 14
k-means clustering
Basics

Partition observations into k clusters such that the total pairwise distance
between each observation and it's nearest centroid is minimized

Hyperparameters
• k – number of clusters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric

Can be written as an integer programming problem – NP hard!

• Use heuristic algorithms

Page 16
Lloyd’s Algorithm

1. Randomly initialize k centroids

2. Assign each observation to its closest centroid using the distance metric

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

Repeat process with many different initializations!

Page 17
Fisher’s Iris dataset
Overview

Collected in the 1930s by Sir Ronald Fisher

• Professor of Eugenics at University College London

50 observations from each of three species of Iris flowers

• Setosa
• Virginica
• Versicolor

4 features
• Petal: length and width
• Sepal: length and width

Page 19
Overview

Page 20
Visualization – no labels

Page 21
Visualization – true labels

Page 22
Visualization – k-means labels

Page 23
How do we determine the number of clusters?

Create an elbow plot

• Number of clusters vs total intra-cluster distance

Choose the number of clusters corresponding to the “elbow”

Page 24
Hierarchical / agglomerative clustering
Basics

Build a hierarchy of clusters where the closest pairwise clusters are merged
until there is only one cluster

Hyperparameters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
• 𝑑 𝑆1 , 𝑆2 – Linkage criteria

Page 26
Algorithm

1. Initialize each observation as its own cluster

2. Merge each cluster with its closest neighbor cluster according to some
distance metric / linkage criteria combination

3. Continue until there is only one cluster (or a stopping criteria is met)

Page 27
Linkage criteria

Centroid:
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒔1 , 𝒔2

Minimum:
𝑑 𝑆1 , 𝑆2 = min 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Maximum:
𝑑 𝑆1 , 𝑆2 = max 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Page 28
Linkage criteria

Average:
1
𝑑 𝑆1 , 𝑆2 = ෍ ෍ 𝑑 𝒙𝑖 , 𝒙𝑗
𝑆1 |𝑆2 |
𝑖∈𝑆1 𝑗∈𝑆2

Minimum variance:

2 𝑆1 |𝑆2 |
𝑑 𝑆1 , 𝑆2 = 𝒔1 − 𝒔2 2
𝑆1 + |𝑆2 |

Page 29
Dendrogram – Iris dataset

Page 30
Dendrogram – Iris dataset

Page 31
DailyKos
Overview

Internet blog, forum, and news site devoted to the Democratic Party and
liberal politics

Obtained 3430 articles with 1545 features from Fall 2004

• Each feature is a binary variable corresponding to a word

What were the hot topics on DailyKos at the time?

Page 33
Hierarchical clustering dendrogram

Page 34
Articles per cluster

Hierarchical k-means

Page 35
Top 5 words in each cluster
Hierarchical k-means

Page 36
K-nearest neighbors
Overview

Simple, intuitive, and widely used method that can capture complex non-
linear relationships

Two types:

1. Classification: majority vote of the K-nearest neighbors

2. Regression: weighted average of the K-nearest neighbors

Page 38
Hyperparameters

K – the number of nearest neighbors

• Can range from 1 to all

𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric

• Chosen from the same metrics used for clustering

𝒘𝒊 – the weighting used for each neighbor

• Equal: each neighbor is weighted equally
• Distance: each neighbor is weighted by its distance

Page 39
Algorithm
Given n observation with features (𝒙0 , 𝒙1 , … , 𝒙𝑛 ) and targets (𝑦0 , 𝑦1 , … , 𝑦𝑛 )
• Predict for a new observation 𝒙𝑝

1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝

2. Compute prediction:
𝑦ො𝑝 = ෍ 𝑤𝑖 𝑦𝑖
𝑖∈𝑁𝑝
where
𝑑 𝒙𝒊 ,𝒙𝒑 1
𝑤𝑖 = σ for distance OR 𝑤𝑖 = for uniform
𝑖∈𝑁𝑝 𝑑 𝒙𝒊 ,𝒙𝒑 𝐾

Page 40
Applied to the Iris dataset

Page 41
Applied to the Iris dataset

Page 42
Applied to the Iris dataset

Page 43

5 places to find better NSFW references for Your Lewd Drawings - HBeats Art
No ratings yet
5 places to find better NSFW references for Your Lewd Drawings - HBeats Art
1 page
PCMHACKING Hardware and Software Guide V104
No ratings yet
PCMHACKING Hardware and Software Guide V104
11 pages
Organizational Management
100% (2)
Organizational Management
22 pages
Lecture 4
No ratings yet
Lecture 4
6 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
84 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Clustering Algorithms: Dalya Baron (Tel Aviv University) XXX Winter School, November 2018
No ratings yet
Clustering Algorithms: Dalya Baron (Tel Aviv University) XXX Winter School, November 2018
53 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
4 Clustering1
No ratings yet
4 Clustering1
41 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Cluster
100% (1)
Cluster
72 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering
No ratings yet
Clustering
125 pages
CSE4261 Lecture-8
No ratings yet
CSE4261 Lecture-8
49 pages
Clustering
No ratings yet
Clustering
39 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Week 9
No ratings yet
Week 9
66 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Aprendizaje No Supervisado
No ratings yet
Aprendizaje No Supervisado
29 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Clustering
No ratings yet
Clustering
75 pages
Clusters
No ratings yet
Clusters
64 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Cluster
No ratings yet
Cluster
50 pages
UNIT5
No ratings yet
UNIT5
60 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
AB Quantum User Manual
No ratings yet
AB Quantum User Manual
25 pages
Introduction To Distributed Systems: Brian Nielsen Bnielsen@cs - Aau.dk Bnielsen@cs - Aau.dk
No ratings yet
Introduction To Distributed Systems: Brian Nielsen Bnielsen@cs - Aau.dk Bnielsen@cs - Aau.dk
54 pages
CV
No ratings yet
CV
2 pages
Ai Unit I PDF
No ratings yet
Ai Unit I PDF
50 pages
Vulnerability Scanners-A Proactive Approach To Ass
No ratings yet
Vulnerability Scanners-A Proactive Approach To Ass
13 pages
Secret Codes For Nokia Phones
No ratings yet
Secret Codes For Nokia Phones
3 pages
Ourlog 1933
No ratings yet
Ourlog 1933
1 page
Forcedotcom SFDX Simple
No ratings yet
Forcedotcom SFDX Simple
2 pages
Seshasai Resume Dotnet
0% (1)
Seshasai Resume Dotnet
6 pages
GOOD MIX MAIL ACCESS @darkdatalogs (1)
No ratings yet
GOOD MIX MAIL ACCESS @darkdatalogs (1)
17 pages
Chapter 3: Simple Resistive Circuits: 3.1 Resistors in Series
No ratings yet
Chapter 3: Simple Resistive Circuits: 3.1 Resistors in Series
5 pages
Chapter 12: Numerical Methods Locating An Approximate Root
No ratings yet
Chapter 12: Numerical Methods Locating An Approximate Root
5 pages
Questions On AWT
0% (1)
Questions On AWT
83 pages
Parteneriat Gradinita Primarie
No ratings yet
Parteneriat Gradinita Primarie
13 pages
xnhau
No ratings yet
xnhau
321 pages
Software Engineering
No ratings yet
Software Engineering
101 pages
Elctronics and Communication Engineering - 1 To 4 Years PDF
No ratings yet
Elctronics and Communication Engineering - 1 To 4 Years PDF
82 pages
Checking The IHO S-52 Presentation Library Edition Number in The ECDIS
No ratings yet
Checking The IHO S-52 Presentation Library Edition Number in The ECDIS
4 pages
Reading N Writing
No ratings yet
Reading N Writing
42 pages
BB FlashBack User Guide
No ratings yet
BB FlashBack User Guide
210 pages
Jaw Tracking Devices Copy
No ratings yet
Jaw Tracking Devices Copy
9 pages
Composite Exhibit 2 (Part 2)
No ratings yet
Composite Exhibit 2 (Part 2)
92 pages
CH# 6 Risks To Data
No ratings yet
CH# 6 Risks To Data
14 pages
Connection Lead Sets 8951
No ratings yet
Connection Lead Sets 8951
18 pages
8th Generation Core ISeries Used Laptop at Wholesale Prices I SR Trader
No ratings yet
8th Generation Core ISeries Used Laptop at Wholesale Prices I SR Trader
1 page
Statim 2000 & 5000 - SM R3
No ratings yet
Statim 2000 & 5000 - SM R3
20 pages

Lecture4 Slides

Uploaded by

Lecture4 Slides

Uploaded by

Lecture 4: Clustering and KNN

Instructor: Ari Smith

Hierarchical / agglomerative clustering

Distance metrics and linkage criteria

Predict user ratings for films using only previous ratings

Beat Cinematch by 10% and win $1,000,000

Competition began on October 6, 2006

In 2016, Netflix stopped segmenting users by geography

Users are now clustered into 1300 “taste-communities”

Clustering is an unsupervised learning algorithm

Two popular types:

2. Hierarchical / agglomerative clustering

We use distance to determine if two observations are similar

Define an observation with F features:

Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐

Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )

Index set: includes the IDs of all observations in a cluster

Centroid: the “center” or “representative point” of each cluster

Intra-cluster distance: distance between two points in the same cluster

Inter-cluster distance: distance between two points in different clusters

Can be written as an integer programming problem – NP hard!

1. Randomly initialize k centroids

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

Repeat process with many different initializations!

Collected in the 1930s by Sir Ronald Fisher

50 observations from each of three species of Iris flowers

Create an elbow plot

Choose the number of clusters corresponding to the “elbow”

1. Initialize each observation as its own cluster

Obtained 3430 articles with 1545 features from Fall 2004

What were the hot topics on DailyKos at the time?

1. Classification: majority vote of the K-nearest neighbors

2. Regression: weighted average of the K-nearest neighbors

K – the number of nearest neighbors

𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric

𝒘𝒊 – the weighting used for each neighbor

1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝

You might also like