0% found this document useful (0 votes)
16 views

Lecture4 Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture4 Slides

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Lecture 4: Clustering and KNN

Instructor: Ari Smith


October 10, 2023
Learning objectives

k-means clustering

Hierarchical / agglomerative clustering

Distance metrics and linkage criteria

K-nearest neighbours

Page 2
Netflix
The Netflix Prize

Predict user ratings for films using only previous ratings

Beat Cinematch by 10% and win $1,000,000


• Progress prize of $50,000 for each year that improves upon previous year by 1%

Competition began on October 6, 2006


• 100,000,000 observations in the training set
• 1,500,000 in the validation set
• 1,500,000 in the test set

Page 4
The Netflix Prize

2006
• WXYZConsulting beat Cinematch on Oct 8
• UofT (led by Prof. Hinton) emerged as an early leader

2007
• 40,000 teams from 186 countries
• BellKor beat Cinematch by 8.43%

2008
• An ensemble of BellKor and BigChaos beat Cinematch by 9.54%

Page 5
The Netflix Prize

The Winner
• BellKor’s Pragmatic Chaos beat Cinematch by 10.06%
• Declared the winner on September 18, 2009
• Ensemble of three teams

Page 6
User groups

In 2016, Netflix stopped segmenting users by geography

Users are now clustered into 1300 “taste-communities”

Cluster 290:
• Movies like: Black Mirror,
Lost, and Groundhog Day

Page 7
The basics of clustering
Types of clustering

Clustering is an unsupervised learning algorithm


• Partition data into groups /clusters such that the observations within a cluster are
similar

Two popular types:

1. k-means clustering

2. Hierarchical / agglomerative clustering

Page 9
How do we define “similar”?

We use distance to determine if two observations are similar

Define an observation with F features:

𝑇
𝒙𝒊 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐹

Non-negativity: 𝑑 𝒙𝟏 , 𝒙𝟐 ≥ 0 and 𝑑 𝒙𝟏 , 𝒙𝟐 = 0 iff 𝒙𝟏 = 𝒙𝟐

Symmetry: 𝑑 𝒙𝟏 , 𝒙𝟐 = 𝑑 𝒙𝟐 , 𝒙𝟏

Triangle inequality: 𝑑 𝒙𝟏 , 𝒙𝟐 + 𝑑 𝒙𝟐 , 𝒙𝟑 ≥ 𝑑(𝒙𝟏 , 𝒙𝟑 )

Page 10
Distance metrics
Euclidean: 1
𝐹 2
2
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 2 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Manhattan:
𝐹

𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝟏 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Chebychev:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 ∞ = max 𝑥1𝑓 − 𝑥2𝑓
𝑓=1,…,𝐹

Page 11
Distance metrics

Minkowski:
1
𝐹 𝑝
𝑝
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝑃 = ෍ 𝑥1𝑓 − 𝑥2𝑓
𝑓=1

Hamming:

𝑑 𝒙𝟏 , 𝒙𝟐 = ෍ 𝕀(𝑥1𝑓 ≠ 𝑥2𝑓 )
𝑓=1

Page 12
Index sets and centroids

Index set: includes the IDs of all observations in a cluster

𝑆𝑘 = {1,3,7,21,44}

Centroid: the “center” or “representative point” of each cluster

1
𝒔𝒌 = ෍ 𝒙𝒊
|𝑆𝑘 |
𝑖∈𝑆𝑘

Page 13
Cluster distances

Intra-cluster distance: distance between two points in the same cluster

Inter-cluster distance: distance between two points in different clusters

Page 14
k-means clustering
Basics

Partition observations into k clusters such that the total pairwise distance
between each observation and it's nearest centroid is minimized

Hyperparameters
• k – number of clusters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric

Can be written as an integer programming problem – NP hard!


• Use heuristic algorithms

Page 16
Lloyd’s Algorithm

1. Randomly initialize k centroids

2. Assign each observation to its closest centroid using the distance metric

3. Recompute the centroid of each cluster

4. Stop if there is no change in the centroids. Otherwise, return to step 2.

Repeat process with many different initializations!

Page 17
Fisher’s Iris dataset
Overview

Collected in the 1930s by Sir Ronald Fisher


• Professor of Eugenics at University College London

50 observations from each of three species of Iris flowers


• Setosa
• Virginica
• Versicolor

4 features
• Petal: length and width
• Sepal: length and width

Page 19
Overview

Page 20
Visualization – no labels

Page 21
Visualization – true labels

Page 22
Visualization – k-means labels

Page 23
How do we determine the number of clusters?

Create an elbow plot


• Number of clusters vs total intra-cluster distance

Choose the number of clusters corresponding to the “elbow”

Page 24
Hierarchical / agglomerative clustering
Basics

Build a hierarchy of clusters where the closest pairwise clusters are merged
until there is only one cluster

Hyperparameters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
• 𝑑 𝑆1 , 𝑆2 – Linkage criteria

Page 26
Algorithm

1. Initialize each observation as its own cluster

2. Merge each cluster with its closest neighbor cluster according to some
distance metric / linkage criteria combination

3. Continue until there is only one cluster (or a stopping criteria is met)

Page 27
Linkage criteria

Centroid:
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒔1 , 𝒔2

Minimum:
𝑑 𝑆1 , 𝑆2 = min 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Maximum:
𝑑 𝑆1 , 𝑆2 = max 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2

Page 28
Linkage criteria

Average:
1
𝑑 𝑆1 , 𝑆2 = ෍ ෍ 𝑑 𝒙𝑖 , 𝒙𝑗
𝑆1 |𝑆2 |
𝑖∈𝑆1 𝑗∈𝑆2

Minimum variance:

2 𝑆1 |𝑆2 |
𝑑 𝑆1 , 𝑆2 = 𝒔1 − 𝒔2 2
𝑆1 + |𝑆2 |

Page 29
Dendrogram – Iris dataset

Page 30
Dendrogram – Iris dataset

Page 31
DailyKos
Overview

Internet blog, forum, and news site devoted to the Democratic Party and
liberal politics

Obtained 3430 articles with 1545 features from Fall 2004


• Each feature is a binary variable corresponding to a word

What were the hot topics on DailyKos at the time?

Page 33
Hierarchical clustering dendrogram

Page 34
Articles per cluster

Hierarchical k-means

Page 35
Top 5 words in each cluster
Hierarchical k-means

Page 36
K-nearest neighbors
Overview

Simple, intuitive, and widely used method that can capture complex non-
linear relationships

Two types:

1. Classification: majority vote of the K-nearest neighbors

2. Regression: weighted average of the K-nearest neighbors

Page 38
Hyperparameters

K – the number of nearest neighbors


• Can range from 1 to all

𝒅 𝒙𝒊 , 𝒙𝒋 – the distance metric


• Chosen from the same metrics used for clustering

𝒘𝒊 – the weighting used for each neighbor


• Equal: each neighbor is weighted equally
• Distance: each neighbor is weighted by its distance

Page 39
Algorithm
Given n observation with features (𝒙0 , 𝒙1 , … , 𝒙𝑛 ) and targets (𝑦0 , 𝑦1 , … , 𝑦𝑛 )
• Predict for a new observation 𝒙𝑝

1. Compute 𝑑 𝒙𝒊 , 𝒙𝒑 , for 𝑖 = 0, … , 𝑛 and index K nearest neighbors by 𝑁𝑝

2. Compute prediction:
𝑦ො𝑝 = ෍ 𝑤𝑖 𝑦𝑖
𝑖∈𝑁𝑝
where
𝑑 𝒙𝒊 ,𝒙𝒑 1
𝑤𝑖 = σ for distance OR 𝑤𝑖 = for uniform
𝑖∈𝑁𝑝 𝑑 𝒙𝒊 ,𝒙𝒑 𝐾

Page 40
Applied to the Iris dataset

Page 41
Applied to the Iris dataset

Page 42
Applied to the Iris dataset

Page 43

You might also like