Week 4 v1.1 (hidden) - Supervised Learning (Classification)
Week 4 v1.1 (hidden) - Supervised Learning (Classification)
(WEEK 4, 2024)
SUPERVISED LEARNING (CLASSIFICATION)
DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
TIMETABLE
2
LECTURE OVERVIEW
3
WEEK 4 LAB OVERVIEW
Scikit-Learn
What is Scikit-Learn?
▪ A Python library (an extension of SciPy) for machine learning.
▪ Built on SciPy and NumPy: provides access to classic machine
learning algorithms for classification, regression and clustering.
▪ Dimensionality reduction with Principal Component Analysis
(PCA).
▪ Linear and non-linear (including logistic) regression.
▪ K-nearest neighbour (K-NN).
▪ Support vector machines (SVMs).
▪ K-means clustering.
▪ Gaussian Mixture Models (GMMs).
4
WEEK 4 LAB OVERVIEW
5
WHAT IS CLASSIFICATION?
▪ Given data containing 𝑥 features and a target feature 𝑦, where
▪ 𝑦 is a categorical variable
▪ 𝑥 is a set of different features.
𝑦3
𝑥1 (e.g. length)
𝑦1 𝑦𝑛
𝑦2
𝑥2 (e.g., weight)
6
CLASSIFICATION WITH
LOGISTIC REGRESSION
▪ Logistic regression is Linear Regression where the
dependent variable 𝑦 is categorical (e.g., car/train/bike)
instead of numerical (e.g., 10/15/20).
Feature 1
(e.g. length)
𝑥1 (e.g. length) 𝑦𝑛
𝑦1
𝑦2
𝑥2 (e.g., weight)
8
HOW TO SEARCH EFFECTIVELY?
Which line is more accurate and why?
A B
C D
9
HIDDEN SLIDE
10
HIDDEN SLIDE
11
HIDDEN SLIDE
12
HIDDEN SLIDE
13
HIDDEN SLIDE
14
VISUALISING FEATURE IMPORTANCE
▪ Which feature is more important for classification:
▪ Feature 1 or Feature 2?
Feature 1
Feature 2
15
VISUALISING FEATURE IMPORTANCE
▪ Which feature is more important for classification:
▪ Feature 1 or Feature 2?
Feature 1
Feature 2
16
HIDDEN SLIDE
17
HIDDEN SLIDE
18
LINEAR CLASSIFICATION
▪ Data may be linearly separable if the dimensionality of the data is big
relative to the sample size of the input data.
▪ The more dimensions we have and the less samples those
dimensions contain, the easier is to find a linear line that separates
the data.
▪ But high dimensionality may lead to overfitting.
▪ While having more samples makes it harder to find a linear
separation, the increased number of samples will help produce a
more accurate linear separation on previously unseen cases.
Feature 1
(e.g. length)
19
Feature 2 (e.g., weight)
HOW TO SEARCH EFFECTIVELY?
Find the line that is as distant as possible from each class.
▪ Logistic regression relies on the sigmoid function, also known as the
logistic function, for binary classification.
▪ It takes an input 𝑥 and outputs a class probability value.
CLASS 1
𝑒𝑥 1
sigmoid 𝑥 = 𝑥 = =
sigmoid 𝑥 𝑒 + 1 1 + 𝑒 −𝑥
1
= 0 to 1
CLASS 0 1 + 𝑒 −𝑎𝑥1+𝑏𝑥2+𝑐
20
𝑥
HOW TO SEARCH EFFECTIVELY?
▪ If a new data point 𝑥 is placed somewhere on the red line then:
▪ 𝑦ො = sigmoid 𝑥 = 0.5 = either class.
▪ This is because the red line is the decision boundary.
▪ If a new data point 𝑥 is placed somewhere on the blue line then:
▪ 𝑦ො = sigmoid 𝑥 = ~0.75 = class train.
▪ If a new data point 𝑥 is placed somewhere on the green line then:
▪ 𝑦ො = sigmoid 𝑥 = ~0.25 = class car.
Feature 1
(e.g. length)
22
Figure source: https://en.wikipedia.org/wiki/Sigmoid_function
REGRESSION VS CLASSIFICATION
Linear Regression Logistic Regression
▪ Given data 𝑥, 𝑦 ▪ Given data 𝑥, 𝑦
▪ Where 𝑦 is a set of numerical values. ▪ Where 𝑦 is a set of categorical values.
𝑦3
𝑥1 (e.g. length)
𝑦1 𝑦𝑛
𝑦 𝑦2
𝑥2 (e.g., weight)
𝑥
ො given 𝑥𝑖
▪ E.g., predict carPrice 𝑦 ො given 𝑥𝑖
▪ E.g., predict category 𝑦
▪ 𝐜𝐚𝐫𝐏𝐫𝐢𝐜𝐞 = 𝑎 × carModel + 𝑏 × mileAge + 𝑐 ▪ if sigmoid(𝑥) > 0.5 then: car
23
▪ Else: train
MULTINOMIAL LOGISTIC CLASSIFICATION
▪ From bivariate/binary 𝑦 = {0,1} to multinomial 𝑦 = {0,1, … , 𝑘}
24
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
دقائق استراحة10
10 MINUTI DI PAUSA
דקות10 הפסקה של
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 25
SUPPORT VECTOR CLASSIFICATION
Finds the support vectors that maximise margins
Support Vectors
26
SUPPORT VECTOR CLASSIFICATION
▪ Unlike Logistic Regression which uses the Sigmoid function to find the line which is
as distant as possible from each class, Support Vector Classification (SVC) finds
the decision boundary that is as distant as possible from the boundary point
of both classes.
▪ Support Vector Classification (SVC) returns a decision boundary that is similar to
that returned by Logistic Regression.
▪ While the decision boundary will often not be identical between these two
algorithms, their classification accuracy will often be identical. Why is that?
one possible reason could be the amount of the dataset is not big enough
only if the testing data lie in between two lines will make the accuracy not
identical=> so less data, less chance it will happen
model 1
model 2 27
Logistic regression Support Vector Classification
HIDDEN SLIDE
28
K-NEAREST NEIGHBOUR
▪ K-NN is a well-known and simple machine learning algorithm for classification.
𝑦3 𝑦3
𝑥2 (e.g. length)
𝑦1 𝑦𝑛 𝑥2 (e.g. length) 𝑦1 𝑦𝑛
𝑦2 𝑦2
𝑥2 (e.g. length) 𝑦1 𝑦𝑛
𝑦2
𝑥2 (e.g., weight)
Distance-based similarity.
▪ Euclidean distance: 𝐷 𝐱, 𝐲 = 𝑥𝑖 − 𝑦𝑖 2
▪ Manhattan distance:
▪ Hamming distance:
30
▪ Measures the number of changes between 𝐴 and 𝐵 required to make 𝐴 = 𝐵.
K-NN: MEASURING DISTANCE
Distance-based similarity.
▪ Euclidean distance:
▪ In Euclidean distance the green line has
length 0 − 6 2 + 0 − 6 2 = 8.49, and it is
the unique shortest path.
▪ Manhattan distance: Euclidean (green) vs. Manhattan
(other colours) distance
▪ In Manhattan distance, the red, yellow and
blue paths all have the shortest length of 12.
▪ Hamming distance:
▪ Measures the number of changes between 𝐴 and 𝐵,
needed to make 𝐴 = 𝐵.
▪ There is a risk that the classification will be driven by the weight parameter.
▪ The distances generated between weights would be significantly larger relative to the
distances generated between lengths.
▪ Potential solution: Normalisation
▪ Normalise for a particular scale.
▪ E.g., normalise everything between 0-1.
▪ Normalise for standard deviation.
▪ E.g., normalise everything to have 1 STD.
32
Source of images: https://i.stack.imgur.com/
K-NN: CLASSIFICATION BOUNDARY
▪ How does the classification boundary look like?
▪ Like a Voronoi diagram; i.e., partition of a plane/surface into regions close to each of
the given set of objects.
Feature 2
Feature 2
33
Feature 1 Feature 1
K-NN: OUTLIERS
What if the division of the surface is unrealistic?
▪ Since we are taking the historical data as is, without generalising, the
classification boundaries often become too complex.
▪ How to smooth boundaries?
Feature 2
Feature 2
swapped
34
Feature 1 Feature 1
HIDDEN SLIDE
35
HIDDEN SLIDE
36
HIDDEN SLIDE
37
K-NN: PROCEDURE Reading
slide
▪ pseudocode:
1. Given data 𝐷 and new input 𝑥 to classify.
2. Measure the distance of input 𝑥 to every
known data point in 𝐷.
3. Search for the 𝐾 smaller distances.
4. Retrieve the majority class from the 𝐾
Feature 2
smaller distances.
5. Assign the majority class to 𝑦.
ො
▪ Note that we can also retrieve the probability
of each class by simply counting the
neighbours that associate to each class,
relative to 𝐾.
▪ i.e., K-NN could be used for regression.
Feature 1
38
LOGISTIC REGRESSION VS. Reading
K-NEAREST NEIGHBOUR CLASSIFICATION slide
▪ Learn the sigmoid of a function 𝑓(𝑥) to predict ▪ Use historical data to predict 𝑦
ො given new data 𝑥
𝑦ො given new data 𝑥. ▪ i.e., find similar 𝑦𝑖 from historical data given
▪ i.e., in the case of logistic regression we new 𝑥𝑖.
have 𝑦ො = sigmoid 𝑎𝑥1 + 𝑏𝑥2 + 𝑐 > 0.5
𝑦3 𝑦3
𝑥2 (e.g. length)
𝑥2 (e.g. length)
𝑦1 𝑦𝑛 𝑦1 𝑦𝑛
𝑦2 𝑦2
41
COMPUTATIONAL COMPLEXITY
▪ Logistic Regression: 𝑁 = samples, 𝐷 = dimensions
▪ K-Nearest Neighbour:
▪ Memory: 𝑶 𝑵𝑫 Store the entire dataset (K-NN does not produce a model).
43