2024-SCU-ML-2-1-SVM
2024-SCU-ML-2-1-SVM
Machine Learning
Yen-Kuang Chen, Ph.D., IEEE Fellow
[email protected]
1
Summary of Last Week’s Materials
• Why do we want to use machine learning?
• A huge amount of information must be processed
• However, no simple programmable rules
• ML can be an alternative route to building complicated systems
• 𝑌: +1 malignant , −1 benign
• Note y is single-dimensional in this example
ℝ! ℝ" -
-
Features of tumor 𝑥⃑ Points on the 2D plane Points in ℝ"
Labels 𝑦 +1 (malignant), -1 (benign)
Hypothesis 𝑔 𝑥⃑ Line Hyperplanes in ℝ"
Positive on one side of a line (hyperplane);
Negative on the other side
After having the model, how can we find {𝑤! } & threshold?
à ML algorithms
PLA: A Simple Learning Algorithm
• Start from some random 𝑤( , and correct its mistakes on 𝐷
• For 𝑡 = 0, 1, …
• Find a mistake called 𝑥⃑&(() , 𝑦&(()
• Based on the mistake, “correct” 𝑤( → 𝑤(*+
• until no more mistakes
Review of Perceptron Learning Algorithm
• What kinds of data were given?
• Labeled (often created by humans)
• What kinds of labels? Supervised classification
• Two classes
• Assumptions?
• Linear separatable
• Learning algorithm?
• Iteratively updates its weights in response to errors in its prediction
Remaining Question
• How can machines learn better?
• What are the strategies to improve performance/efficiency?
• Review of PLA
• Assumptions: Linear separable
• Learning algorithm: Iteratively updates its weights in response to errors in its
prediction
• Guarantee: It will find a linear separable boundary
• However,
• How can we find the linear separable boundary faster?
• If data are linear separable, is PLA the best algorithm?
Outline
• Course Atlas
• Support Vector Machine (SVM)
• Inferencing: Decision rule
• Training: Objective function/optimization problem
• Transforming from good to great
9
Course Atlas
Supervised Unsupervised
Dimension
Classification Regression Clustering
reduction
Perceptron Linear
Support Vector Linear
Learning Decision Trees Discriminant K-mean PCA
Machines (SVM) regression
Algorithm (PLA) Analysis (LDA)
10
Quiz
• Which of the following problems a supervised classification algorithm
is best suited for? (select one)
• Forecast sales based on past sales data, economy outlook, etc.
• Identify whether an image contains a dog
• Split the dataset into groups based on their similarities
• Reduce the dimensionality of a dataset
11
Basic Assumptions
• Input data (also called samples)
• 𝑥,
• Binary labeled
𝑦, = 1 for positive samples
•3
𝑦, = −1 for negative samples
• Linear separatable X
+ +
• Clean
+
- vs. - -
- +
12
Supper Vector Machine (SVM)
• To classify future data points with more confidence.
• Margin should be maximal
X
+ +
+ +
vs.
- -
- -
13
Decision Rule
𝑤
• Let’s define 𝑤 is perpendicular to the median +
• 𝑤 is any length for now +
-
-
𝑥⃑
• Decision rule
𝑤 ⋅ 𝑥⃑ ≥ 𝑐, then positive 𝑤 ⋅ 𝑥⃑ + 𝑏 ≥ 0, then positive
• " ≡ "
𝑤 ⋅ 𝑥⃑ < 𝑐, then negative 𝑤 ⋅ 𝑥⃑ + 𝑏 < 0, then negative
• Decision rule with margins
𝑤 ⋅ 𝑥⃑ + 𝑏 ≥ 1, then positive
• " ≡ 𝑦/ (𝑤 ⋅ 𝑥/ + 𝑏) ≥ 1
𝑤 ⋅ 𝑥⃑ + 𝑏 ≤ −1, then negative
• For 𝑥! on the edges of the margin, 𝑦! 𝑤 ⋅ 𝑥! + 𝑏 = 1
14
How to Train an SVM?
• How can we get 𝑤 and 𝑏, given the labeled dataset?
15
Objective Function
𝑤
• The objective is to get the widest margin +
+
• Let’s find an 𝑥" on the positive edge of the margin 𝑥$
and an 𝑥# on the negative edge of the margin -
-
$⋅('0#'1) $⋅'0 $⋅'1 𝑥#
• Width of the margin = = -
$ $ $
• Note for 𝑥/ on the edge of the margin, 𝑦/ 𝑤 ⋅ 𝑥/ + 𝑏 = 1
• 𝑤 ⋅ 𝑥2 = 1 − 𝑏
• 𝑤 ⋅ 𝑥3 = −1 − 𝑏
4
• Width of the margin =
5
)
• Objective: max ≡ min 𝑤
$
16
Constrained Optimization Problem
" ! "
• Objective: max ≡ min 𝑤 ≡ min 𝑤
) "
• Constraint: 𝑦% 𝑤 ⋅ 𝑥% + 𝑏 = 1 for 𝑥% on the edge of the margin
! "
• Lagrangian: ℒ = " 𝑤 − ∑% 𝛼% 𝑦% 𝑤 ⋅ 𝑥% + 𝑏 − 1
𝛼, = 1, when 𝑥, is on the edge of the margin
• Note that 3
𝛼, = 0, when is not on the edge of the margin
*ℒ *ℒ
• To find out the minimal, = 0 and =0
*) *,
17
! "
ℒ= 𝑤 − ∑% 𝛼% 𝑦% 𝑤 ⋅ 𝑥% + 𝑏 − 1
"
Quadratic Programming
*ℒ
• *$ = 𝑤 − ∑! 𝛼! 𝑦! 𝑥! = 0 ⟹ 𝑤 = ∑! 𝛼! 𝑦! 𝑥!
*ℒ
• = − ∑! 𝛼! 𝑦! = 0 ⟹ ∑! 𝛼! 𝑦! = 0
*,
-
• ℒ = 𝑤 ) − ∑! 𝛼! 𝑦! 𝑤 ⋅ 𝑥! + 𝑏 − 1
)
-
= ∑! 𝛼! 𝑦! 𝑥! ⋅ ∑. 𝛼. 𝑦. 𝑥. − ∑! 𝛼! 𝑦! 𝑥! ⋅ ∑. 𝛼. 𝑦. 𝑥. − ∑! 𝛼! 𝑦! 𝑏 + ∑! 𝛼 !
)
-
= ! 𝛼! − ∑! ∑. 𝛼! 𝛼. 𝑦! 𝑦. 𝑥! ⋅ 𝑥.
∑
)
• Given 𝑥! and 𝑦! , we just need to solve 𝛼! to minℒ
• Looks simple
• Is a quadratic programming problem
• Is NP-hard
There are packages to help solve the SVM optimization problem.
18
Quiz
• What is the primary objective of a Support Vector Machine (SVM)?
1. Maximize the margin between support vectors
2. Minimize the margin between support vectors
3. Maximize the number of support vectors
4. Minimize the number of support vectors
19
Quiz
• In SVM classification, what is the term 'support vectors' referring to?
1. Data points that are hard to classify
2. Data points that lie closest to the decision boundary
3. Data points that belong to the majority class
4. Data points that belong to the minority class
20
Transforming from Good to Great
• What if it is not linear • What if the given data are not
separatable? clean?
-
- +
+ + +
+
+ - +
- +
+ - +
- -
- -
-
21
Polynomial Kernels
Source: U. Braga-Neto, Fundamentals of Pattern Recognition and Machine Learning, Springer, 2020 23
Radial Basis Function (RBF) Kernel
• One of the most popular non-linear kernels in SVM
24
Quiz
• What is the 'kernel trick' in SVMs used for?
1. To turn a linearly inseparable problem into a linearly separable one
2. To increase the dimensionality of the feature space
3. To reduce the dimensionality of the feature space
4. To adjust the cost parameter 'C'
25
Soft-margin SVM
• There may be outliers or noises in the data from real-world
applications.
• To address this issue, a soft margin can be used in a modified
optimization problem, known as a soft-margin SVM:
+
•Objective: min / 𝑤 / + 𝐶 ∑, 𝜉,
• Constraint: 𝑦, 𝑤 ⋅ 𝑥, + 𝑏 = 1 − 𝜉, and 𝜉, ≥ 0 𝑤
+
• 𝜉, is the slack, which allows 𝑥, to be inside the margin
+
- -
• SVM without the slacks is known as hard-margin SVM. 𝜉%
𝑥⃑
26
Group Discussion
• Where is 𝑥% relative to where the margin is when its 𝜉% value is 0?
• Where is 𝑥% relative to where the margin is when 0 ≤ 𝜉% ≤ 1?
• Where is 𝑥% relative to where the margin is when 𝜉% > 1?
𝑤
+
+
- -
𝜉%
𝑥⃑
27
Quiz
• What is the main difference between hard-margin and soft-margin
SVMs?
1. Hard-margin SVMs have a wider margin
2. Soft-margin SVMs allow for misclassification
3. Soft-margin SVMs have no margin
4. Soft-margin SVMs do not use kernels
28
Quiz
• What is the purpose of the hyperparameter 'C' in soft-margin SVMs?
1. To control the width of the margin
2. To control the trade-off between margin width and misclassification
3. To specify the number of support vectors
4. To adjust the kernel function
29
Quiz
• What happens if the cost parameter 'C' in a soft-margin SVM is set to
a very high value?
1. The margin becomes wider
2. The margin becomes narrower
3. More misclassifications are allowed
4. Fewer misclassifications are allowed
30
Summary of SVM
• Finding a hyperplane that separates the data with a maximal margin,
• In order to classify future data points with more confidence
• To find the optimal hyperplane, we need to solve a quadratic optimization problem
• The name "Support Vector Machine" (SVM) comes from using support
vectors to create a hyperplane that separates the data
• The support vectors are the data points that are closest to the hyperplane
• Kernels in SVMs are used to turn a linearly inseparable problem into a
linearly separable one
• Soft-margin SVMs allow for misclassification
• Fewer misclassifications are allowed if the cost parameter 'C' in a soft-margin SVM is
set to a very high value
31
Geometry Review for Homework #2
• Find the line that separates 3 or 4 points with maximal margins
• If 3 points form an isosceles triangle +
- -
• If 4 points form a rectangle + +
- -
Scikit-Learn SVM packages
Class Library Time Complexity Kernel trick
LinearSVC Liblinear O(m*n) No
SVC Libsvm O(m2*n) to O(m3*n) Yes
Group Discussions
• Compare Perceptron Learning Algorithm (PLA) and Support Vector
Machines (SVM)
• Both are classification algorithms
• Limitation: PLA is a linear algorithm that finds the hyperplane to separate the
data, while SVM is a more flexible algorithm that can use nonlinear kernels to
separate the data.
• Quality: SVM finds the hyperplane with the largest margin, which is the
distance between the hyperplane and the nearest data points. PLA does not
consider margin and simply finds a hyperplane that separates the data.
• Computation: PLA can be faster and more computationally efficient than SVM
on simple datasets with linear boundaries.
34
List of Key Questions About Each Machine
Learning Algorithm
• What kinds of data were given?
• Labeled?
• What kind of labels?
• Continuous or classification labels? Number of categories?
• Assumptions?
• Linear separatable?
• Learning algorithm?
• Direct optimization? Iterative optimization? Parameters?
• Inference computation?
• Linear? Polynomial? Non-linear?
• Error function?
• Class labels? Probabilities? Manual threshold?
• Overfitting or underfitting?
35
Any Questions About HW#1?
36