0% found this document useful (0 votes)
15 views

Classification Advanced

The document discusses neural networks and the backpropagation algorithm for training neural networks. Specifically, it provides details on: 1) How a multi-layer neural network works, including input, hidden, and output layers connected by weighted connections; 2) The backpropagation algorithm, which iteratively processes training tuples to minimize error by adjusting weights in the backwards direction from output to input layers; 3) The steps of backpropagation including propagating inputs forward, calculating errors, and updating weights and biases to reduce errors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Classification Advanced

The document discusses neural networks and the backpropagation algorithm for training neural networks. Specifically, it provides details on: 1) How a multi-layer neural network works, including input, hidden, and output layers connected by weighted connections; 2) The backpropagation algorithm, which iteratively processes training tuples to minimize error by adjusting weights in the backwards direction from output to input layers; 3) The steps of backpropagation including propagating inputs forward, calculating errors, and updating weights and biases to reduce errors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Topic-04 Classification: Advanced Methods

■ Neural Network
■ Support Vector Machines

1
Neural Network

■ Backpropagation: A neural network learning algorithm


■ Started by psychologists and neurobiologists to develop
and test computational analogues of neurons
■ A neural network: A set of connected input/output units
where each connection has a weight associated with it
■ During the learning phase, the network learns by
adjusting the weights so as to be able to predict the
correct class label of the input tuples
■ Also referred to as connectionist learning due to the
connections between units
Neural Network
How A Multi-Layer Neural Network Works
■ The inputs to the network correspond to the attributes measured
for each training tuple
■ Inputs are fed simultaneously into the units making up the input
layer
■ They are then weighted and fed simultaneously to a hidden layer
■ The number of hidden layers is arbitrary, although usually only one
■ The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
■ The network is feed-forward: None of the weights cycles back to
an input unit or to an output unit of a previous layer
■ From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
samples, they can closely approximate any function
Neuron: A Hidden/Output Layer Unit
bias
x0 w0 μk

x1 w1
f output y
xn wn

Input weight weighted Activation


vector x vector w sum function
■ An n-dimensional input vector x is mapped into variable y by means of the
scalar product and a nonlinear function mapping
■ The inputs to unit are outputs from the previous layer. They are multiplied by
their corresponding weights to form a weighted sum, which is added to the
bias associated with unit. Then a nonlinear activation function is applied to it.
5
A Multi-Layer Feed-Forward Neural Network

Output vector

Output layer

Hidden layer

wij

Input layer

Input vector: X
Defining a Network Topology
■ Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
■ Normalize the input values for each attribute measured in
the training tuples to [0.0—1.0]
■ One input unit per domain value, each initialized to 0
■ Output, if for classification and more than two classes,
one output unit per class is used
■ Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
Backpropagation
■ Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
■ For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the actual
target value
■ Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
■ Steps
■ Initialize weights to small random numbers, associated with biases
■ Propagate the inputs forward (by applying activation function)
■ Backpropagate the error (by updating weights and biases)
■ Terminating condition (when error is very small, etc.)
Backpropagation
■ Algorithm: Backpropagation. Neural network learning for
classification or numeric prediction, using the
backpropagation algorithm.
■ Input:
■ D, a data set consisting of the training tuples and their

associated target values;


■ l, the learning rate;

■ network, a multilayer feed-forward network.

■ Output: A trained neural network.


Backpropagation
Backpropogation
■ Initialize the weights: The weights in the network are initialized to
small random numbers (e.g., ranging from -1.0 to 1.0, or -0.5 to
0.5).
■ Each unit has a bias associated with it. The biases are similarly
initialized to small random numbers.
■ Each training tuple, X, is processed by the following steps.
■ Propagate the inputs forward: First, the training tuple is fed to the
network’s input layer. The inputs pass through the input units,
unchanged. That is, for an input unit, j, its output, Oj , is equal to its
input value, Ij .
■ Next, the net input and output of each unit in the hidden and output
layers are computed. The net input to a unit in the hidden or output
layers is computed as a linear combination of its inputs.
Backpropogation
■ Each such unit has a number of inputs to it that are, in
fact, the outputs of the units connected to it in the
previous layer. Each connection has a weight. To compute
the net input to the unit, each input connected to the unit
is multiplied by its corresponding weight, and this is
summed.
■ Given a unit, j in a hidden or output layer, the net input, Ij
, to unit j is

■ where wij is the weight of the connection from unit i in the


previous layer to unit j; Oi is the output of unit i from the
previous layer; and Өj is the bias of the unit. The bias acts
as a threshold in that it serves to vary the activity of the
unit.
Backpropogation
■ Each unit in the hidden and output layers takes its net
input and then applies an activation function to it. The
function symbolizes the activation of the neuron
represented by the unit. The logistic, or sigmoid,
function is used. Given the net input Ij to unit j, then Oj
, the output of unit j, is computed as

■ This function is also referred to as a squashing function,


because it maps a large input domain onto the smaller
range of 0 to 1. The logistic function is nonlinear and
differentiable, allowing the backpropagation algorithm to
model classification problems that are linearly inseparable.
Backpropogation
■ We compute the output values, Oj , for each hidden layer,
up to and including the output layer, which gives the
network’s prediction. In practice, it is a good idea to cache
(i.e., save) the intermediate output values at each unit as
they are required again later when backpropagating the
error. This trick can substantially reduce the amount of
computation required..
Backpropogation
■ Backpropagate the error: The error is propagated backward
by updating the weights and biases to reflect the error of the
network’s prediction. For a unit j in the output layer, the error
Errj is computed by

■ where Oj is the actual output of unit j, and Tj is the known


target value of the given training tuple.
■ To compute the error of a hidden layer unit j, the weighted sum
of the errors of the units connected to unit j in the next layer
are considered. The error of a hidden layer unit j is

■ where wjk is the weight of the connection from unit j to a unit k


in the next higher layer, and Errk is the error of unit k.
Backpropogation
■ The weights and biases are updated to reflect the
propagated errors. Weights are updated by the following
equations, where ∆ wij is the change in weight wij :
Backpropogation
■ The variable l is the learning rate, a constant typically
having a value between 0.0 and 1.0.
■ The learning rate helps avoid getting stuck at a local
minimum in decision space (i.e., where the weights appear
to converge, but are not the optimum solution) and
encourages finding the global minimum.
■ If the learning rate is too small, then learning will occur at
a very slow pace. If the learning rate is too large, then
oscillation between inadequate solutions may occur.
■ A rule of thumb is to set the learning rate to 1/t, where t
is the number of iterations through the training set so far.
Backpropogation
■ Biases are updated by the following equations, where∆ Өj is the
change in bias Өj :

■ Note that here we are updating the weights and biases after the
presentation of each tuple. This is referred to as case updating.
■ Alternatively, the weight and bias increments could be
accumulated in variables, so that the weights and biases are
updated after all the tuples in the training set have been
presented. This latter strategy is called epoch updating, where
one iteration through the training set is an epoch. In theory, the
mathematical derivation of backpropagation employs epoch
updating, yet in practice, case updating is more common
because it tends to yield more accurate results.
Backpropogation
■ Terminating condition: Training stops when
■ All ∆wij in the previous epoch are so small as to be

below some specified threshold, or


■ The percentage of tuples misclassified in the previous

epoch is below some threshold, or


■ A prespecified number of epochs has expired.
■ “How can we classify an unknown tuple using a trained
network?”
■ To classify an unknown tuple, X, the tuple is input to the
trained network, and the net input and output of each unit
are computed. (There is no need for computation and/or
backpropagation of the error.)
■ If there is one output node per class, then the output
node with the highest value determines the predicted class
label for X.
■ If there is only one output node, then output values
greater than or equal to 0.5 may be considered as
belonging to the positive class, while values less than 0.5
may be considered negative
Efficiency and Interpretability
■ Efficiency of backpropagation: Each epoch (one iteration through the
training set) takes O(|D| * w), with |D| tuples and w weights, but # of
epochs can be exponential to n, the number of inputs, in worst case
■ For easier comprehension: Rule extraction by network pruning
■ Simplify the network structure by removing weighted links that
have the least effect on the trained network
■ Then perform link, unit, or activation value clustering
■ The set of input and activation values are studied to derive rules
describing the relationship between the input and hidden unit
layers
■ Sensitivity analysis: assess the impact that a given input variable
has on a network output. The knowledge gained from this analysis
can be represented in rules
Neural Network as a Classifier
■ Weakness
■ Long training time
■ Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
■ Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the network
■ Strength
■ High tolerance to noisy data
■ Ability to classify untrained patterns
■ Well-suited for continuous-valued inputs and outputs
■ Successful on an array of real-world data, e.g., hand-written letters
■ Algorithms are inherently parallel
■ Techniques have recently been developed for the extraction of
rules from trained neural networks
Chapter 9. Classification: Advanced Methods

■ Neural Network
■ Support Vector Machines
■ Association

26
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

Any of these would


be fine..

..but which is best?


α
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a
datapoint.
α
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin

This is the simplest kind


of SVM (Called an LSVM)

Linear SVM
α
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum margin
linear classifier is the
linear classifier with the,
um, maximum margin.

This is the simplest kind


Support Vectors
of SVM (Called an LSVM)
are those
datapoints that
the margin
pushes up
against

Linear SVM
Specifying a line and margin

+ 1” Plus-Plane
=
l a ss Classifier Boundary
i c t C ne
d zo
“ Pre Minus-Plane
= -1”
l ass
i c tC e
“P red zon

• How do we represent this mathematically?


• …in m input dimensions?
Specifying a line and margin
+ 1” Plus-Plane
=
l ass Classifier Boundary
i ct C ne
r ed zo Minus-Plane
“P
= -1”
1 l ass
+b
=
i c tC e
wx
+ b =0
“P red zon
wx b=-1
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Learning the Maximum Margin Classifier
+ 1” +
x M = Margin Width =
s=
C las
d i ct one
“Pre z
-
- 1”
s x=
= 1 C las
wx
+b
d i ct ne
=0 e zo
+ b
wx b=-1 “Pr
+
wx

■ Given a guess of w and b we can


• Compute whether all data points in the correct
half-planes
• Compute the width of the margin
Classification: A Mathematical Mapping

■ Classification: predicts categorical class labels


■ E.g., Personal homepage classification

■ x = (x , x , x , …), y = +1 or –1
i 1 2 3 i
■ x : # of word “homepage”
1
■ x : # of word “welcome”
x
2 x x
x x
■ Mathematically, x ∈ X = ℜn, y ∈ Y = {+1, –1}, x
■ We want to derive a function f: X 🡪 Y
x x x o
o
x o
■ Linear Classification ooo o
■ Binary Classification problem
o o
o o o o
■ Data above the red line belongs to class ‘x’

■ Data below red line belongs to class ‘o’

■ Examples: SVM, Perceptron, Probabilistic Classifiers


SVM—Support Vector Machines
■ A relatively new classification method for both linear and
nonlinear data
■ It uses a nonlinear mapping to transform the original
training data into a higher dimension
■ With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
■ With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
■ SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
SVM—History and Applications
■ Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
■ Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
■ Used for: classification and numeric prediction
■ Applications:
■ handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
SVM—General Philosophy

Small Margin Large Margin


Support Vectors
SVM—Margins and Support Vectors
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Separable
■ A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
■ For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
■ The hyperplane defining the sides of the margin:
H1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
■ Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
Why Is SVM Effective on High Dimensional Data?

■ The complexity of trained classifier is characterized by the # of


support vectors rather than the dimensionality of the data
■ The support vectors are the essential or critical training examples
—they lie closest to the decision boundary (MMH)
■ If all other training examples are removed and the training is
repeated, the same separating hyperplane would be found
■ The number of support vectors found can be used to compute an
(upper) bound on the expected error rate of the SVM classifier, which
is independent of the data dimensionality
■ Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
SVM-Linearly Inseparable
■ Linear SVMs can be extended to create nonlinear SVMs for the
classification of linearly inseparable data (also called nonlinearly
separable data, or nonlinear data for short). Such SVMs are
capable of finding nonlinear decision boundaries (i.e., nonlinear
hypersurfaces) in input space.
■ There are two main steps.
■ In the first step, we transform the original input data into a higher
dimensional space using a nonlinear mapping. Several common
nonlinear mappings can be used in this step, as we will further
describe next.
■ Once the data have been transformed into the new higher space,
the second step searches for a linear separating hyperplane in the
new space.We again end up with a quadratic optimization problem
that can be solved using the linear SVM formulation. The maximal
marginal hyperplane found in the new space corresponds to a
nonlinear separating hypersurface in the original space
SVM—Linearly Inseparable

■ Transform the original input data into a higher dimensional


space

■ Search for a linear separating hyperplane in the new space

47
■ Based on Lagrangian formulation the above equation
can be rewritten as

■ where yi is the class label of support vector Xi ; XT is a


test tuple; i and b0 are numeric parameters that were
determined automatically by the SVM algorithm; and l is
the number of support vectors and the are Lagrangian
multipliers
■ Given a test tuple, XT , we plug it into Eq. and then
check to see the sign of the result. This tells us on
which side of the hyperplane the test tuple falls.
■ If the sign is positive, then XT falls on or above the
MMH, and so the SVM predicts that XT belongs to
class C1 (representing buys computer=yes, in our
case).
■ If the sign is negative, then XT falls on or below the
MMH and the class prediction is -1 (representing buys
computer=no).
Kernel functions for Nonlinear Classification

■ Instead of computing the dot product on the transformed


data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
■ Typical Kernel Functions

■ SVM can also be used for classifying multiple (> 2) classes


and for regression analysis (with additional parameters)

50
Kernel functions for Nonlinear
Classification

■ Each of these results in a different nonlinear classifier in


(the original) input space.
■ The kernel chosen does not generally make a large
difference in resulting accuracy.
■ A major research goal regarding SVMs is to improve the
speed in training and testing so that SVMs may become a
more feasible option for very large data sets (e.g., millions
of support vectors). Other issues include determining the
best kernel for a given data set and finding more efficient
methods for the multiclass case.

51

You might also like