0% found this document useful (0 votes)
31 views

Lecture Slides Week11

The document provides an outline for a machine learning course covering 15 topics over 16 weeks. The topics include introduction to machine learning, hypothesis learning, model evaluation, classification, decision trees, Bayesian inference, PCA, linear regression, SVM, ANN, K-nearest neighbor, and K-means clustering. Support vector machines are introduced as finding the optimal hyperplane that maximizes the margin between the two classes.

Uploaded by

moazzam kiani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Lecture Slides Week11

The document provides an outline for a machine learning course covering 15 topics over 16 weeks. The topics include introduction to machine learning, hypothesis learning, model evaluation, classification, decision trees, Bayesian inference, PCA, linear regression, SVM, ANN, K-nearest neighbor, and K-means clustering. Support vector machines are introduced as finding the optimal hyperplane that maximizes the margin between the two classes.

Uploaded by

moazzam kiani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning

(CS4613)

Department of Computer Science


Capital University of Science and Technology (CUST)
Course Outline
Topic Weeks Reference
Introduction Week 1 Hands-on machine learning, Ch 1
Hypothesis Learning Week 2 Tom Mitchel, Ch2
Model Evaluation. Week 3 Fundamentals of Machine Learning for Predictive Data Analytics, Ch8
Classification
Decision Trees Week 4, 5 Fundamentals of Machine Learning for Predictive Data Analytics, Ch4

Bayesian Inference. Week 6,7 Fundamentals of Machine Learning for Predictive Data Analytics, Ch6
Naïve Bayes
PCA Week 8 Hands-on machine learning, Ch 8
Linear Regression Week 9, 10 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
SVM Week 11, 12 Fundamentals of Machine Learning for Predictive Data Analytics, Ch7
ANN Week 13, 14 Neural Networks, A systematic Introduction, 1, 2, 3, 4, 7 (Selected Topics)
Hands-on machine learning, Ch 10
K-Nearest Neighbor Week 15 Fundamentals of Machine Learning for Predictive Data Analytics, Ch5
Master Machine Learning Algorithms Ch 22, 23
K-Means Clustering Week 16 Data Mining_ The Textbook. Ch 6

2
Outline Week 11
• Handling Categorical Features in Regression
• Logistic Regression: Handling Categorical Target
Features
• Support Vector Machines (SVM) Introduction

3
Handling Categorical
Features

4
Handling Categorical Features
• Using a transformation that converts a single
categorical descriptive feature into a number of
continuous descriptive feature values that can
encode the levels of the categorical feature.
• For example, Rating A converted to 1 0 0.

5
Original Data

Energy Rating
Converted to
3 features

6
Logistic Regression: Handling
Categorical Target Features
• Simple and multivariable linear regression models can be
used to make predictions for continuous target features
(like Rental Price in the previous example).
• Now we focus on Categorical Target Features
• Consider the following figure that shows a scatter plot of
generators dataset

7
A snapshot of generators dataset

8
Decision Boundary
• We can draw a straight line across the scatter plot that perfectly
separates the good generators from the faulty ones.
• This line is known as a decision boundary, and because we can draw
this line, this dataset is said to be linearly separable in terms of the two
descriptive features used.
• As the decision boundary is a linear separator, it can be defined using
the equation of the line
• In the following figure, the (linear) decision boundary is defined as
0.667 x RPM + 1 x Vibration – 830 = 0

y = mx + b m = - 0.667 Ax + By + C = 0 A = 0.667 (w[1])


b= B=1
830 (w[2])
C = - 830
(w[0])
9
slope -A/B and y-intercept -C/B
10
Description
• For any instance that is on the decision boundary, the RPM and
VIBRATION values satisfy the equality in the following Equation

• The descriptive feature values of all instances above the decision


boundary will result in a positive value when plugged into the decision
boundary equation. For example RPM = 810, VIBRATION = 495
0.667 x 810 + 1 x 495 – 830 = 205.27

• The descriptive features of all instances below the decision boundary


will result in a negative value. For example RPM = 650 and VIBRATION =
240
0.667 × 650 + 1 x 240 − 830 = -156.45
11
Another Example
(3,1) i.e. x=3, y=1
1x3 + 1x1 – 3
=3 +1 -3
= 1 (positive, above the line)
(1,1) i.e. x=1, y=1
1x1 + 1x1 – 3
=1 +1 -3
= -1 (negative, below the line)
(2,1) i.e. x=2, y=1
2x1 + 1x1 – 3
=2 +1 -3
= 0 (on the line)

https://doubleroot.in/lessons/straight-line/general-form/ 12
Side note: A and B represent the
vector normal to the line.

13
A Prediction Model
• Hence we can make a prediction model as

where d is a set of descriptive features for an instance


(including the dummy feature with value 1)
w is the set of weights in the model
good and faulty generator target feature levels are
represented as 1 and 0 respectively.

14
Decision Surface
• The following figure shows the decision surface for
every possible value of RPM and VIBRATION.

15
How to determine the values for
the weights?
• In this case we cannot just use the gradient descent algorithm. The
hard decision boundary represented in the previous figure is
discontinuous, so is not differentiable, which means we cannot
calculate the gradient of the error surface using the derivative.
• Another problem with this model is that the model always makes
completely confident predictions of 0 or 1.
• A model able to distinguish between instances that are very close
to the boundary and those that are farther away would be
preferable.
• We can solve both these problems by using a more sophisticated
threshold function that is continuous, and therefore differentiable,
and that allows for the subtlety desired
• the logistic function.
16
17
Logistic Regression
• Instead of the regression function simply being the
dot product of the weights and the descriptive
features the dot product of weights and descriptive
feature values is passed through the logistic
Function.

• Another benefit of using the logistic function is that


logistic regression model outputs can be interpreted
as probabilities of the occurrence of a target level.

• Other than changing the weight update rule, we


don’t need to make any other changes to the model
training process
18
Modelling Non-linear relationships
• Linear models work very well when the underlying
relationships in the data are linear.
• Sometimes, however, the underlying data will
exhibit nonlinear relationships.
• A simple linear regression model cannot handle this
non-linear relationship.
• See example on the next slide.
• To successfully model the relationship between
grass growth and rainfall, we need to introduce
non-linear elements.
19
20
Basis Functions
• A generalized way in which to introduce non-linearity is to
introduce basis functions that transform the raw inputs to the
model into non-linear representations but still keep the
model itself linear in terms of the weights.
• To use basis functions, we recast the simple linear regression
model as

where d is a set of m descriptive features, w is a set of b


weights, and ϕ0 to ϕb are a series of b basis functions that
each transform the input vector d in a different way.
21
Example

22
23
Basis Function Contd..
• What makes this approach really attractive is that,
although this new model stated in terms of basis
functions captures the non-linear relationship
between rainfall and grass growth, the model is still
linear in terms of the weights and so can be trained
using gradient descent without making any changes
to the algorithm

24
Support Vector Machines
(SVM)

25
Introduction
• We want to find the hyperplane that separates the two classes (like we
have done before).
• The problem is that there are can be many such hyperplanes that can
separate the two classes.
• SVM tries to find the optimal hyperplane. This is the separating
hyperplane, from which distance of closest training points is maximum.
• Those closest training points are called support vectors
• We compute the distance between the hyperplane and the support
vectors (margin) and the goal is to find the hyperplane that maximizes
this margin.
• The hyperplane for which the margin is maximum is the optimal
hyperplane.
• The idea is that a hyperplane which separates the points, but is also as
far away from any training point as possible, will generalize best.
26
Support Vectors

27
Basics
• We define the separating hyperplane in the same way as we did before. We separate
w0 this time. w0 + w . d = 0
• For instances above a separating hyperplane w0 + w . d > 0 and for instances below a
separating hyperplane w0 + w . d < 0
• We want to find a δ such that we can define two parallel hyperplanes on either side of
the separating hyperplane such that these hyperplanes are
w0 + w . d = + δ and w0 + w . d = - δ
• We'd like to maximize δ (maximizes the margin), subject to the constraint that all points
from one class must be on one side of the hyperplane and all points from the other class
must be on the other side.
• This will let us find a hyperplane that is exactly in the middle of these two.
• The value of δ is set to 1 and w0 and w are adjusted to maximize the margin
• If the negative target feature level is −1 and the positive target feature level is +1, we
want a prediction model so that instances with the negative target level result in the
model outputting ≤ −1 and instances with the positive target level result in the model
outputting ≥ +1
or 28
w0 is b in this figure 29
Why δ = 1 works?
• As we know, w represents the
vector normal to the line (our
decision boundary).
• Changing w’s length does not
affect the decision boundary.

30
https://jeremykun.com/2017/06/05/formulating-the-support-vector-machine-optimization-problem/
Why δ = 1 works? Contd..
• If we increase the length of w, that means the absolute values of the dot
product w.d will increase.
• We are just scaling the length of w so that w.d+w0=1.
• The following figures show the actual distance between a point on the
margin (support vector) and the decision boundary (value: 2.2) and the
dot product of this point with w for two different values of w.
• As we can see, we can adjust w so that the dot product is 1.

31
That is all for Week 11

32
In the context of Support Vector Machines (SVMs), the equation �0+�⋅�w0​+w⋅d represents
the equation of a hyperplane used for classification. Let's break down the components:
1.�0w0​:
1. �0w0​is the bias term or the intercept of the hyperplane.
2. It is a constant that shifts the hyperplane away from the origin.
2.�w:
1. �w is a weight vector.
2. It contains the weights assigned to each feature in your dataset.
3. The direction of �w is perpendicular to the hyperplane, and the length of �w
represents the importance of each feature in determining the hyperplane.
3.�d:
1. �d is a vector representing a data point or a set of features.
2. It is the input data vector for which you want to determine the classification.
4.⋅⋅:
1. The dot product operation (⋅⋅) is used to multiply corresponding elements of �w and
�d and then sum them up.
Putting it all together, �0+�⋅�w0​+w⋅d is the equation of a hyperplane in a feature space.
When this expression is evaluated for a given data point �d, the result indicates on which side
of the hyperplane the point lies.
•If �0+�⋅�>0w0​+w⋅d>0, the point lies on one side of the hyperplane (positive class).
•If �0+�⋅�<0w0​+w⋅d<0, the point lies on the other side of the hyperplane (negative class).

33

You might also like