0% found this document useful (0 votes)
4 views

07 Supervised Machine Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

07 Supervised Machine Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

AI 610 – Machine Learning — Fall 2022

Supervised
Machine
Learning, 4
Dr. Waad Alhoshan
[email protected]
Department of Computer Science — IMSIU
Previous Lectures
• Define classification as ML problem
• Simple intro the three classification approaches;
• Non-probabilistic approach discriminant functions
• Probabilistic approaches such as generative models and discriminative
models.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 2


Today’s lecture!
• We will cover some examples of discriminant functions
• Least square classification
• The Perceptron
• Fisher’s Linear discriminant “self-study”
• Gentle introduction on the example models of other probabilistic
approaches to be discussed next week!

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 3


Approaches of Statistical
Classification
Recap from Lecture 06
• Non-probabilistic approach
• Discriminant Function
• Probabilistic approaches
• Probabilistic Generative Models
• Probabilistic Discriminative Models

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 4


Discriminant function
Recap from Lecture 03

• Find a function f(x), called a discriminant function, which maps each


input x directly onto a class label.
• For instance, in the case of two-class problems, f(·) might be binary
valued and such that f = 0 represents class C1 and f = 1 represents
class C2.
In this case, probabilities play no role.
However, probabilities are important for minimizing the risks, reject options,
and so on! Check Bishop’s book (p.44 – p. 46)

Finding the optimal value of x (green line) , because this is the decision
boundary giving the minimum probability of misclassification.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 5


Discriminant
functions
Examples

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 6


Least Square Classification
Recap from Lecture 06

• In regression, we considered models that were linear functions of the parameters, and we saw
that the minimization of a sum-of-squares error function led to a simple closed-form solution for
the parameter values (or weights).

• The least-squares solution can also be used to solve classification problems by attempting to find
the optimal decision boundary.

• Not optimal solution especially with more than 2 classes.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 7


Least Square Classification
Mathematical Concept

Each class has its own discriminant function, considering simple generalized linear
model as follows:
• classes
• “how far from the decision boundary”
• For all classes, a shorter notation would be:
• Where is a matrix contains columns, where
• And is the vector of input features as
• And is the output vector for the discriminant functions of all classes.
• Finally, input will be assigned to Class with the maximum output predicator

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 8


Least Square Classification
Mathematical Concept

If we applied the previous function, we would get different values (continued values) of , for each
class. This won’t help in the classification problem! So, we want to predicate 0 and 1
Basically, transform the output into one-hot encoding of binary values 0 and 1. The goal is to get
one-hot encoding to represent output from classes
FYI .. “One Hot Encoding is used to convert
numerical categorical variables into binary vectors”

Step 1: Dataset : N x (D+1) input data matrix and N x K target matrix

Example:

()
𝑡𝑇 𝑇
1
𝑖𝑓 𝑘=5 , 𝑎𝑛𝑑 𝑗 = 1 ∈ 𝑛∴ 𝑡 𝑗 =¿

𝑇 =¿ ⋮
We want to get 0 for class if the
𝑡𝑇𝑛
predicate value is not in , and 1
otherwise!

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 9


Cont., Recall this?
Step 2: Use cost function of squared error for all data points!

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 10


Cont., the pseudo inverse of a
matrix  If both the
Step 3: Minimize the squared errors as we did in linear regression by columns and the rows of
setting the derivative with respect to , to zero =0 the matrix are linearly
independent, then the
Mathematical proof in Bishop’s book matrix is invertible, and
~ the pseudo inverse is
𝑊 𝐿𝑆 equal to the inverse of the
matrix

Step 4: obtain the discriminant function in the form

~𝑇
𝑦 𝐿𝑆 ( 𝑥 ) =𝑊 𝐿𝑆 ~
𝑥

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 11


Limitations of LS Classification
• The decision boundaries with LS classification is very sensitive to outliers.

Remember our
interpretations of
It’s the distance from the
decision boundary

It’s like panelizing the good


cluster in the corner
because they are easy to
detect ..

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 12


Cont.,
• For K>2 some decision regions can become very small or are even completely ignored  Masking

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 13


Cont.,
• The components of are not real probabilities!
• We prefer probabilistic approaches since they represent the uncertainty that
we would like to express in machine learning.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 14


Perceptron
• It’s one of the oldest classification machine learning model.
• Works with 2 classes only
• Fast and it’s used with deep ‘multi-layer’ neural networks
• “ It is a type of neural network model, perhaps the simplest type of neural network
model. It consists of a single node or neuron that takes a row of data as input and
predicts a class label.”

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 15


Perceptron
Mathematical Concept

• Input dataset: it is more convenient to use target values t = +1 for class C1 and
• Target: binary classes t = −1 for class C2, which matches the choice of activation
function.
• Classification decision :
 From predication function Discriminant function

 Classify as -1 or +1, where → 𝑓 ( 𝑎 ) = +1 𝑎 ≥ 0


− 1 𝑎< 0 { Step function

 if and if

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 16


Cont.,
• The algorithm used to determine the parameters of the perceptron can most easily be motivated
by error function minimization.
• A natural choice of error function would be the total number of misclassified targets.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 17


C1 = -1

Cont., C2 = +1

“For a quick simple explanation:


In both gradient descent (GD) and stochastic gradient descent (SGD), you update a
set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a
single update for a parameter in a particular iteration, in SGD, on the other hand,
you use ONLY ONE or SUBSET of training sample from your training set to do the
update for a parameter in a particular iteration. If you use SUBSET, it is called
Minibatch Stochastic gradient Descent.”
AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 18
Limitations
• Perceptron only works for 2 classes

• There might be many solutions depending on the initialization of and on the order in which data
is presented in SGD

• If dataset is not linearly separable, the perceptron algorithm will not converge.

• Based on linear combination of fixed basis functions.

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 19


Next lectures
Probabilistic
approaches to
classification

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 20


Generative models
Recap from Lecture 03

1. Determine the class-conditional densities p(x| Ck) for each class Ck


individually.
2. Infer the prior class probabilities p(Ck), separately.
3. Use Bayes’ theorem in the form

4. Then, p(x) will be used for normalization as

Expensive, requires large training dataset. This can be useful for


detecting new data points that have low probability (outlier detection)
Examples: NB, BN, HMM, LDA,…
AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 21
Discriminative models
Recap from Lecture 03

1. determining the posterior class probabilities p(Ck |x)


2. use decision theory to assign each new x to one of the classes.

Brilliant choice for


classification
problems!

What about combining


generative and Discriminative
approaches?

Examples: LR, SVM, DT, RF,..


AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 22
Essential Study Readings
• Pattern Recognition and Machine Learning
• 4.1 + 4.2 + 4.3

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 23


Presentations next week!
• Don’t forget to follow the submission requirements
• Submit your materials (slides and short summary)
• Each student in the group must submit the same zipped folder

AI 610 – Machine Learning — Fall 2022 — W. Alhoshan 24

You might also like