cs221-lecture11
cs221-lecture11
Lecture 11:
Advanced Machine Learning
2
Linear Seperaration
3
Outline
Perceptron Algorithm
Support Vector Machines
Boosting
4
Basic Neuron
5
Perceptron Node – Threshold Logic Unit
x1 w1
x2 w2 z
xn wn
6
Perceptron Learning Algorithm
x1 .4
.1 z
x2 -.2
x1 x2 t
.8 .3 1
.4 .1 0
7
First Training Instance
.8 .4
.1 z =1
x1 x2 t
.8 .3 1
.4 .1 0
8
Second Training Instance
.4 .4
.1 z =1
x1 x2 t
.8 .3 1 wi = t - z xi
.4 .1 0 c
9
Perceptron Rule Learning
wij = ctj – zj)xi
Least perturbation principle
Only change weights if there is an error
small c rather than changing weights sufficient to make current pattern
correct
Scale by xi
Create a network with n input and m output nodes
Iteratively apply a pattern from the training set and apply the
perceptron rule
Each iteration through the training set is an epoch
Continue training until total training set error is less than some epsilon
Perceptron Convergence Theorem: Guaranteed to find a solution in
finite time if a solution exists
10
Outline
Perceptron Algorithm
Support Vector Machines
Boosting
11
Support Vector Machines: Overview
• A new, powerful method for 2-class classification
– Original idea: Vapnik, 1965 for linear classifiers
– SVM, Cortes and Vapnik, 1995
– Became very hot since 2001
• Better generalization (less overfitting)
• Can do linearly unspeakable classification with
global optimal
• Key ideas
– Use kernel function to transform low dimensional
training samples to higher dim (for linear separability
problem)
– Use quadratic programming (QP) to find the best
classifier boundary hyperplane (for global optima and)
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1
Any of these
would be fine..
..but which is
best?
Classifier Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 Define the
margin of a
linear classifier
as the width
that the
boundary could
be increased by
before hitting a
datapoint.
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Maximum Margin
x f yest
f(x,w,b) = sign(w. x -
denotes +1 b)
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support with maximum
Vectors are
those margin.
datapoints
This is the
that the
margin pushes simplest kind of
up against SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error x
in-the
denotes +1 b)
location of the boundary this
denotes -1 The maximum
gives us least chance of causing
a misclassification.
margin linear
classifier is the
3. It also helps generalization
linear classifier
4. There’s some theory that this is a
Support with the, um,
Vectors are good thing.
those maximum
5. Empirically it works very very
datapoints well. margin.
that the
margin pushes This is the
up against simplest kind of
SVM (Called an
LSVM)
Computing the margin width
”
+1 M = Margin Width
s s=
C la e
t n
e dic zo
“Pr -1” How do we
s=
b=
1
C las compute M in
+ ic t one
wx =0
+ b
=- “P r ed z
terms of w and
wx b
+ 1
wx
b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
M=
Subject to
n additional
linear
inequality
constraints
And subject
e additional
to
constraints
equality
linear
Learning Maximum Margin with Noise
M Given guess of w , b we can
=
Compute sum of distances
of points to their correct
zones
= 1
wx
+b
=0
Compute the margin width
+ b
wx =-
wx
+ 1b Assume R datapoints, each
(xk,yk) where yk = +/- 1
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Soft Margin Classification
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of difficult
or noisy examples, resulting margin called soft.
ξi
ξi
Soft Margin Classification Mathematically
0 x
0 x
0 x
Non-linear SVMs: Feature spaces
General idea: the original feature space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
The “Kernel Trick”
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is a function that is equivalent to an inner product in some
feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
What Functions are Kernels?
For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Examples of Kernel Functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself
41
Why boosting?
A simple algorithm for learning robust classifiers
Freund & Shapire, 1995
Friedman, Hastie, Tibshhirani, 1998
and a weight:
wt =1
Toy example
Weak learners from the family of lines
and a weight:
wt =1
and a weight:
wt =1
We set a new problem for which the previous weak classifier performs at chance again
Toy example
We set a new problem for which the previous weak classifier performs at chance again
Toy example
We set a new problem for which the previous weak classifier performs at chance again
Toy example
We set a new problem for which the previous weak classifier performs at chance again
Toy example
f1 f2
f4
f3
53
Formal Procedure of AdaBoost
Adaboost
AdaBoost with Perceptron
t=1
56
AdaBoost with Perceptron
57
AdaBoost with Perceptron
58
AdaBoost with Perceptron
59
AdaBoost with Perceptron
60
AdaBoost with Perceptron
61
AdaBoost with Perceptron
62
AdaBoost with Perceptron
63
Testing Set Performance
Over fitting
●
4916 positive training
examples were hand
picked aligned,
normalized, and scaled
to a base resolution of
24x24
●
10,000 negative
examples were
selected by randomly
picking sub-windows
from 9500 images
which did not contain
faces
Results
Tracking Cars (Hendrik Dahlkamp)
71