Supervised Machine Learning
Supervised Machine Learning
VENKATARAMANAIAH G
Assistant Professor, Dept. of ECE
This has the same form as the decision boundary for the two-class case .
The decision regions of such a discriminant are always singly connected and
convex. To see this, consider two points xA and xB both of which lie inside
decisionregion Rk, as illustrated in Figure below. Any point that lies on
the line connecting xA and xB can be expressed in the form. ^
x
where 0 ≤ λ ≤ 1. From the
linearity of the discriminant
functions, it follows that
x x
also ^
lies
inside Rk. Thusx Rk is singly
connected and convex
Least squares for classification
Consider a general classification problem with K classes, with a 1-of-K binary
coding scheme for the target vector t.
Each class Ck is described by its own linear model so that
Least squares for classification
Here our aim is to determine the parameter matrix W, by minimizing a sum-of-squares
error function.
Consider a training data set {xn,tn} where n = 1, . . . , N, and define a matrix T whose nth
row is the vector tTn, together with a matrix
Fig.1b
Fig.1a
It is seen that the additional data points in the Fig.1b produce a significant
change in the location of the decision boundary, even though these point
would be correctly classified by the original decision boundary in Fig.1a
The sum-of-squares error function penalizes predictionsthat are ‘too correct’
in that they lie a long way on the correct side of the decision boundary.
However, problems with least squares can be more severe than simply lack of
robustness shown in Fig.2
This shows a synthetic data set drawn from three classes in a two-dimensional
input space (x1, x2), having the property that linear decision boundaries can
give excellent separation between the classes shown in fig 2b .
However, the least-squares solution gives poor results, with only a small
region of the input space assigned to the green class
The failure of least squares should not surprise us when we
recall that it corresponds to maximum likelihood under the
assumption of a Gaussian conditional distribution, whereas binary
target vectors clearly have a distribution that is far from
Gaussian.
By adopting more appropriate probabilistic models, we shall obtain
classification techniques with much better properties than least
squares.
Fig.2a Fig.2b
Fisher’s linear discriminant
One way to view a linear classification model is in terms of
dimensionality reduction.
Consider first the case of two classes, and suppose we take the D-
dimensional input vector x and project it down to one dimension using
If we place a threshold on y and classify y -w0 as class C1, else classC2.
In general, the projection onto one dimension leads to a considerable
loss of information, and classes that are well separated in the original D-
dimensional space may become strongly overlapping in one dimension.
However, by adjusting the components of the weight vector w, we can
select a projection that maximizes the class separation.
For a two-class problem in which there are N1 points of class C1 and
N2 points of class C2, so that the mean vectors of the two classes are
given by
---2
and SW is the total within-class covariance matrix, given by
J(w,b)=wx+b
We know we want to find the values of w and b that
correspond to the minimum of the cost function (marked with
the red arrow).
To start finding the right values we initialize w and b with
some random numbers.
Gradient descent then starts at that point (somewhere around
the top of our illustration), and it takes one step after another
in the steepest downside direction (i.e., from the top to the
bottom of the illustration) until it reaches the point where the
cost function is as small as possible.
IMPORTANCE OF THE
LEARNING RATE
How big the steps are gradient descent takes into
the direction of the local minimum are
determined by the learning rate, which figures
out how fast or slow we will move towards the
optimal weights.
For gradient descent to reach the local minimum
we must set the learning rate to an appropriate
value, which is neither too low nor too high.
This is important because if the steps it takes are
too big, it may not reach the local minimum
because it bounces back and forth between the
convex function of gradient descent (see left
image below). If we set the learning rate to a
very small value, gradient descent will
eventually reach the local minimum but that may
take a while (see the right mage). So, the
learning rate should never be too high or too low
for this reason
Support vector Machine(SVM)
Support vector machine is another simple algorithm that every
machine learning expert should have in his/her arsenal. Support
vector machine is highly preferred by many as it produces
significant accuracy with less computation power.
The objective of the support vector machine algorithm is to find a
hyperplane in an N-dimensional space(N - the number of features)
that distinctly classifies the data points.
To separate the two classes of data points, there are many possible
hyper planes that could be chosen. Our objective is to find a plane
that has the maximum margin, i.e the maximum distance between
data points of both classes.
Maximizing the margin distance provides some reinforcement so
that future data points can be classified with more confidence.
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
w: weight vector
denotes -1 x: data vector
51
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
52
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x +b)
denotes +1
denotes -1
53
Linear Classifiers
x f yest
f(x,w,b) = sign(w x +b)
denotes +1 w x +b>0
denotes -1
0
b=
+
How would you
x
w
classify this data?
w x + b<0
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
55
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
56
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the linear
classifier with the,
maximum margin.
Support Vectors
are those
datapoints that
the margin
pushes up
against This is the simplest
kind of SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind
Support Vectors of SVM (Called an
are those LSVM)
datapoints that
the margin
pushes up 1. Maximizing the margin is good according to
against intuition and PAC theory
2. Implies that only support vectors are important;
other training examples are ignorable.
59 3. Empirically it works very well.
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
60
Linear SVM Mathematically
+1”
= x+ M=Margin Width
la ss
i c tC
Pr ed e
“ zon
X-
- 1”
=1 =
x +b =0 la ss
w
wx
+ b
i c tC
= -1
Pr ed e
x +b “ zon
w
What we know:
(x x ) w 2
w . x+ + b = +1 M
w . x- + b = -1 w w
w . (x+-x-) = 2
Linear SVM Mathematically
Goal: 1) Correctly classify all training data
if yi = +1
wxi ify =b-1 1
i
wx
2) Maximize the Margin i
b 1
for all i
yi ( wxi b) 1
same as minimize
2
M
1 t w
We can formulate a Quadratic Optimization Problem and solve for w and b
Minimize ww
2
subject to
1 t
( w) w w
2
yi ( wxi b) 1
Finding the Decision Boundary
Let {x , ..., x } be our data set and let y {1,-1} be the
1 n i
class label of xi
The decision boundary should classify all points correctly
Let {x1, ..., xn} be our data set and let yi {1,-1} be the
class label of xi
The decision boundary should classify all points correct
ly
A constrained optimization problem
||w|| = w w
2 T
Lagrangian of Original Problem
i0
Finding the Decision Boundary
Let {x , ..., x } be our data set and let y {1,-1} be the
1 n i
class label of xi
The decision boundary should classify all points correctly
The decision boundary can be found by solving the
following constrained optimization problem
10/31/23
The Dual Optimization Problem
We can transform the problem to its dual Dot product of X
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
Non-linearly Separable Problems
We allow “error” i in classification; it is based on the
output of the discriminant function wTx+b
i approximates the number of misclassified samples
New objective function:
Class 2
Class 1
The Optimization Problem
w is also recovered as
The only difference with the linear separable case is that
there is an upper bound C on i
Once again, a QP solver can be used to find i efficiently!!!
THANK YOU