NN Suppl
NN Suppl
Biointelligence Laboratory
Department of Computer Engineering
Seoul National University
Contents
Introduction
Characteristics
Nonlinear I/O mapping
Adaptivity
Generalization ability
Fault-tolerance (graceful
degradation)
Biological analogy
ALVINN System
Image of a
forward -
mounted Weight values
camera for one of the
hidden units
Application:
Error Correction by a Hopfield Network
corrupted
input data
original
target data
Corrected
data after
20 iterations
Corrected
data after
10 iterations Fully
corrected
data after
35 iterations
Perceptron
and
Gradient Descent Algorithm
Architecture of A Perceptron
Linear function
f (x) w0 w1 x1 wn xn
Perceptron and Decision Hyperplane
A perceptron represents a ‘hyperplane’ decision surface in
the n-dimensional space of instances (i.e. points).
The perceptron outputs 1 for instances lying on one side
of the hyperplane and outputs -1 for instances lying on the
other side.
Equation for the decision hyperplane: wx = 0.
Some sets of positive and negative examples cannot be
separated by any hyperplane
wi (t d od ) xid
d D
Gradient Descent Algorithm for
Perceptron Learning
Properties of Gradient Descent
Because the error surface contains only a single global
minimum, the gradient descent algorithm will converge
to a weight vector with minimum error, regardless of
whether the training examples are linearly separable.
Condition: a sufficiently small learning rate
Delta rule
Unthresholded output
Converges only asymptotically toward the error minimum,
possibly requiring unbounded time, but converges regardless of
whether the training data are linearly separable.
Can deal with linearly nonseparable data
Multilayer Perceptron
Multilayer Network and
Its Decision Boundaries
1
E ( w) (t kd o kd ) 2
2 d D koutputs
Termination Criteria
After a fixed number of iterations (epochs)
Once the error falls below some threshold
Once the validation error meets some criterion
Adding Momentum
Adding momentum
w ji (n) j x ji w ji (n 1), 0 1
Help to escape a small local minima in the error surface.
Speed up the convergence.
Derivation of the BP Rule
Notations
xij : the ith input to unit j
wij : the weight associated with the ith input to unit j
netj : the weighted sum of inputs for unit j
oj : the output computed by unit j
tj : the target output for unit j
: the sigmoid function
outputs : the set of units in the final layer of the network
Downstream(j) : the set of units whose immediate inputs include the
output of unit j
Derivation of the BP Rule
1
Error measure: d )
E ( w k k(t
2 koutputs
o ) 2
E d
Gradient descent: w ji
w ji
Ed Ed o j
Step 1: net j w ji x ji
net j o j net j i
o j (net j )
Step 3: o j (1 o j )
net j net j
All together: w ji E d (t j o j )o j (1 o j ) x ji
w ji
Case 2: Rule for Hidden Unit Weights
Ed Ed net k o j
Step 1:
net j kDownstream( j ) net k o j net j
net k o j
k
kDownstream ( j ) o j net j
o j
kDownstream ( j )
k wkj
net j
k
kDownstream ( j )
wkj o j (1 o j )
Thus: w ji j x ji , where j o j (1 o j ) w
k kj
kDownstream ( j )
Backpropagation for MLP: revisited
Convergence and Local Minima
The error surface for multilayer networks may contain many
different local minima.
BP guarantees to converge local minima only.
BP is a highly effective function approximator in practice.
The local minima problem found to be not severe in many
applications.
Notes
Gradient descent over the complex error surfaces represented by
ANNs is still poorly understood
No methods are known to predict certainly when local minima will
cause difficulties.
We can use only heuristics for avoiding local minima.
Heuristics for Alleviating the Local
Minima Problem
Add a momentum term to the weight-update rule.
Differentiable
Provides a useful structure for gradient search.
This structure is quite different from the general-to-specific
ordering in CE, or the simple-to-complex ordering in ID3 or C4.5.
CE: candidate-elimination algorithm in ‘concept learning’ (T.M. Mitchell)
ID3: a learning scheme of ‘Decision Tree’ for discrete values (R. Quinlan)
C4.5: an improved scheme of ID3 for ‘real values’ (R. Quinlan)
Hidden Layer Representations
BP has an ability to discover useful intermediate
representations at the hidden unit layers inside the
networks which capture properties of the input spaces that
are most relevant to learning the target function.
k-fold cross-validation
Cross validation is performed k different times, each time using a
different partitioning of the data into training and validation sets
The result are averaged after k times cross validation.
Designing an Artificial Neural
Network for Face Recognition
Application
Problem Definition
Input encoding
Output encoding