Classification Advanced
Classification Advanced
■ Neural Network
■ Support Vector Machines
1
Neural Network
Output vector
Output layer
Hidden layer
wij
Input layer
Input vector: X
Defining a Network Topology
■ Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
■ Normalize the input values for each attribute measured in
the training tuples to [0.0—1.0]
■ One input unit per domain value, each initialized to 0
■ Output, if for classification and more than two classes,
one output unit per class is used
■ Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
Backpropagation
■ Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
■ For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the actual
target value
■ Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
■ Steps
■ Initialize weights to small random numbers, associated with biases
■ Propagate the inputs forward (by applying activation function)
■ Backpropagate the error (by updating weights and biases)
■ Terminating condition (when error is very small, etc.)
Backpropagation
■ Algorithm: Backpropagation. Neural network learning for
classification or numeric prediction, using the
backpropagation algorithm.
■ Input:
■ D, a data set consisting of the training tuples and their
■ Note that here we are updating the weights and biases after the
presentation of each tuple. This is referred to as case updating.
■ Alternatively, the weight and bias increments could be
accumulated in variables, so that the weights and biases are
updated after all the tuples in the training set have been
presented. This latter strategy is called epoch updating, where
one iteration through the training set is an epoch. In theory, the
mathematical derivation of backpropagation employs epoch
updating, yet in practice, case updating is more common
because it tends to yield more accurate results.
Backpropogation
■ Terminating condition: Training stops when
■ All ∆wij in the previous epoch are so small as to be
■ Neural Network
■ Support Vector Machines
■ Association
26
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Linear SVM
α
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.
Linear SVM
Specifying a line and margin
+ 1” Plus-Plane
=
l a ss Classifier Boundary
i c t C ne
d zo
“ Pre Minus-Plane
= -1”
l ass
i c tC e
“P red zon
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Learning the Maximum Margin Classifier
+ 1” +
x M = Margin Width =
s=
C las
d i ct one
“Pre z
-
- 1”
s x=
= 1 C las
wx
+b
d i ct ne
=0 e zo
+ b
wx b=-1 “Pr
+
wx
■ x = (x , x , x , …), y = +1 or –1
i 1 2 3 i
■ x : # of word “homepage”
1
■ x : # of word “welcome”
x
2 x x
x x
■ Mathematically, x ∈ X = ℜn, y ∈ Y = {+1, –1}, x
■ We want to derive a function f: X 🡪 Y
x x x o
o
x o
■ Linear Classification ooo o
■ Binary Classification problem
o o
o o o o
■ Data above the red line belongs to class ‘x’
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Separable
■ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
■ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
■ The hyperplane defining the sides of the margin:
H1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
■ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
Why Is SVM Effective on High Dimensional Data?
47
■ Based on Lagrangian formulation the above equation
can be rewritten as
50
Kernel functions for Nonlinear
Classification
51