Lecture Slides Week11
Lecture Slides Week11
(CS4613)
Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
Master Machine Learning Algorithms Ch 22, 23
K-Means Clustering Week 16 Data Mining_ The Textbook. Ch 6
2
Outline Week 11
• Handling Categorical Features in Regression
• Logistic Regression: Handling Categorical Target
Features
• Support Vector Machines (SVM) Introduction
3
Handling Categorical
Features
4
Handling Categorical Features
• Using a transformation that converts a single
categorical descriptive feature into a number of
continuous descriptive feature values that can
encode the levels of the categorical feature.
• For example, Rating A converted to 1 0 0.
5
Original Data
Energy Rating
Converted to
3 features
6
Logistic Regression: Handling
Categorical Target Features
• Simple and multivariable linear regression models can be
used to make predictions for continuous target features
(like Rental Price in the previous example).
• Now we focus on Categorical Target Features
• Consider the following figure that shows a scatter plot of
generators dataset
7
A snapshot of generators dataset
8
Decision Boundary
• We can draw a straight line across the scatter plot that perfectly
separates the good generators from the faulty ones.
• This line is known as a decision boundary, and because we can draw
this line, this dataset is said to be linearly separable in terms of the two
descriptive features used.
• As the decision boundary is a linear separator, it can be defined using
the equation of the line
• In the following figure, the (linear) decision boundary is defined as
0.667 x RPM + 1 x Vibration – 830 = 0
https://doubleroot.in/lessons/straight-line/general-form/ 12
Side note: A and B represent the
vector normal to the line.
13
A Prediction Model
• Hence we can make a prediction model as
14
Decision Surface
• The following figure shows the decision surface for
every possible value of RPM and VIBRATION.
15
How to determine the values for
the weights?
• In this case we cannot just use the gradient descent algorithm. The
hard decision boundary represented in the previous figure is
discontinuous, so is not differentiable, which means we cannot
calculate the gradient of the error surface using the derivative.
• Another problem with this model is that the model always makes
completely confident predictions of 0 or 1.
• A model able to distinguish between instances that are very close
to the boundary and those that are farther away would be
preferable.
• We can solve both these problems by using a more sophisticated
threshold function that is continuous, and therefore differentiable,
and that allows for the subtlety desired
• the logistic function.
16
17
Logistic Regression
• Instead of the regression function simply being the
dot product of the weights and the descriptive
features the dot product of weights and descriptive
feature values is passed through the logistic
Function.
22
23
Basis Function Contd..
• What makes this approach really attractive is that,
although this new model stated in terms of basis
functions captures the non-linear relationship
between rainfall and grass growth, the model is still
linear in terms of the weights and so can be trained
using gradient descent without making any changes
to the algorithm
24
Support Vector Machines
(SVM)
25
Introduction
• We want to find the hyperplane that separates the two classes (like we
have done before).
• The problem is that there are can be many such hyperplanes that can
separate the two classes.
• SVM tries to find the optimal hyperplane. This is the separating
hyperplane, from which distance of closest training points is maximum.
• Those closest training points are called support vectors
• We compute the distance between the hyperplane and the support
vectors (margin) and the goal is to find the hyperplane that maximizes
this margin.
• The hyperplane for which the margin is maximum is the optimal
hyperplane.
• The idea is that a hyperplane which separates the points, but is also as
far away from any training point as possible, will generalize best.
26
Support Vectors
27
Basics
• We define the separating hyperplane in the same way as we did before. We separate
w0 this time. w0 + w . d = 0
• For instances above a separating hyperplane w0 + w . d > 0 and for instances below a
separating hyperplane w0 + w . d < 0
• We want to find a δ such that we can define two parallel hyperplanes on either side of
the separating hyperplane such that these hyperplanes are
w0 + w . d = + δ and w0 + w . d = - δ
• We'd like to maximize δ (maximizes the margin), subject to the constraint that all points
from one class must be on one side of the hyperplane and all points from the other class
must be on the other side.
• This will let us find a hyperplane that is exactly in the middle of these two.
• The value of δ is set to 1 and w0 and w are adjusted to maximize the margin
• If the negative target feature level is −1 and the positive target feature level is +1, we
want a prediction model so that instances with the negative target level result in the
model outputting ≤ −1 and instances with the positive target level result in the model
outputting ≥ +1
or 28
w0 is b in this figure 29
Why δ = 1 works?
• As we know, w represents the
vector normal to the line (our
decision boundary).
• Changing w’s length does not
affect the decision boundary.
30
https://jeremykun.com/2017/06/05/formulating-the-support-vector-machine-optimization-problem/
Why δ = 1 works? Contd..
• If we increase the length of w, that means the absolute values of the dot
product w.d will increase.
• We are just scaling the length of w so that w.d+w0=1.
• The following figures show the actual distance between a point on the
margin (support vector) and the decision boundary (value: 2.2) and the
dot product of this point with w for two different values of w.
• As we can see, we can adjust w so that the dot product is 1.
31
That is all for Week 11
32
In the context of Support Vector Machines (SVMs), the equation �0+�⋅�w0+w⋅d represents
the equation of a hyperplane used for classification. Let's break down the components:
1.�0w0:
1. �0w0is the bias term or the intercept of the hyperplane.
2. It is a constant that shifts the hyperplane away from the origin.
2.�w:
1. �w is a weight vector.
2. It contains the weights assigned to each feature in your dataset.
3. The direction of �w is perpendicular to the hyperplane, and the length of �w
represents the importance of each feature in determining the hyperplane.
3.�d:
1. �d is a vector representing a data point or a set of features.
2. It is the input data vector for which you want to determine the classification.
4.⋅⋅:
1. The dot product operation (⋅⋅) is used to multiply corresponding elements of �w and
�d and then sum them up.
Putting it all together, �0+�⋅�w0+w⋅d is the equation of a hyperplane in a feature space.
When this expression is evaluated for a given data point �d, the result indicates on which side
of the hyperplane the point lies.
•If �0+�⋅�>0w0+w⋅d>0, the point lies on one side of the hyperplane (positive class).
•If �0+�⋅�<0w0+w⋅d<0, the point lies on the other side of the hyperplane (negative class).
33