0% found this document useful (0 votes)
4 views

Probability Neuron Network

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Probability Neuron Network

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

An image classifier

Unlike e.g. sorting a list of numbers,

no obvious way to hard-code the algorithm for


recognizing a cat, or other classes.
Parametric Approach: Linear Classifier

Image
f(x,W) = Wx
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Parametric Approach: Linear Classifier
3072x1

Image
f(x,W) = Wx + b 10x1
10x1 10x3072
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)

Stretch pixels into column

56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231

24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9 Dog score

0 0.25 0.2 -0.3 -1.2 61.95 Ship score


Input image
2
W
Linear Classifier
f(x,W) = Wx + b
Suppose: 3 training examples, 3 classes. A loss function tells how
With some W the scores are: good our current classifier is

Given a dataset of examples

Where is image and


cat 3.2 1.3 2.2 is (integer) label

car 5.1 4.9 2.5 Loss over the dataset is a


sum of loss over examples:
frog -1.7 2.0 -3.1
Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

cat 3.2
car 5.1
frog -1.7
Softmax Classifier (Multinomial Logistic Regression)

cat 3.2
car 5.1
frog -1.7
unnormalized log probabilities
Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

where

cat 3.2 Softmax function


car 5.1
frog -1.7
Softmax Classifier (Multinomial Logistic Regression)

unnormalized probabilities

cat 3.2 24.5


exp
car 5.1 164.0
frog -1.7 0.18
unnormalized log probabilities
Softmax Classifier (Multinomial Logistic Regression)

unnormalized probabilities

cat 3.2 24.5 0.13


exp normalize
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
unnormalized log probabilities probabilities
Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

where

Want to maximize the log likelihood, or (for a loss function)


cat 3.2 to minimize the negative log likelihood of the correct class:

car 5.1
frog -1.7
Softmax Classifier (Multinomial Logistic Regression)

scores = unnormalized log probabilities of the classes.

where

Want to maximize the log likelihood, or (for a loss function)


cat 3.2 to minimize the negative log likelihood of the correct class:

car 5.1
frog -1.7 in summary:
Softmax Classifier (Multinomial Logistic Regression)

unnormalized probabilities

cat 3.2 24.5 0.13 L_i = -log(0.13)


exp normalize = 0.89
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
unnormalized log probabilities probabilities
Softmax Classifier (Multinomial Logistic Regression)

Q: What is the min/max


possible loss L_i?

unnormalized probabilities

cat 3.2 24.5 0.13 L_i = -log(0.13)


exp normalize = 0.89
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
unnormalized log probabilities probabilities
Softmax Classifier (Multinomial Logistic Regression)

Q2: Usually at
initialization W is small
so all s ≈ 0.
unnormalized probabilities What is the loss?

cat 3.2 24.5 0.13 L_i = -log(0.13)


exp normalize = 0.89
car 5.1 164.0 0.87
frog -1.7 0.18 0.00
unnormalized log probabilities probabilities
Next: Neural Networks

Lec Fei-Fei Li &


Neuron Model – Network Architectures – Learning

Neuron model
Network architectures
Learning algorithms
…..
Training Neural Networks:

- Activation Functions (use ReLU)


- Data Preprocessing (images: subtract mean)
- Weight Initialization (use Xavier init)
- Batch Normalization (use)
- Babysitting the Learning process
- Hyperparameter Optimization
(random sample hyperparams, in log space when appropriate)
- Parameter update schemes
- Learning rate schedules
- Gradient checking
- Regularization (Dropout etc.)
- Evaluation (Ensembles etc.)
- Transfer learning / fine-tuning
Neuron model

Neuron
information processing unit that is fundamental to the operation of a
neural network
Single input neuron
scalar input x
synaptic weight w
bias b
adder or linear combiner Σ
activation potential v
activation function f x v y
neuron output y

Adjustable parameters
synaptic weight w
bias b y  f (wx  b)
Neuron with vector input
Input vector
x = [x1, x2, ... xR ], R = number of elements in input
vector

Weight vector
w = [w1, w2, ... wR ]
x1
w1
Activation potential  v y
v=wx+b 
xR wR
product of input vector and
weight vector y  f ( wx  b)
 f ( w1 x1  w2 x2  ...  wR xR  b)
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0
Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0

sigmoid activation function


Impulses carried toward cell body
dendrite
presynaptic
terminal

axon

cell body

Impulses carried away


from cell body
This image by Felipe Perucho
is licensed under CC-BY 3.0
Network architectures
About network architectures
Two or more of the neurons can be combined in a layer
Neural network can contain one or more layers
Strong link between network architecture and learning algorithm

1. Single-layer feedforward networks


• Input layer of source nodes projects onto an output layer of neurons
• Single-layer refers to the output layer (the only computation layer)

2. Multi-layer feedforward networks


• One or more hidden layers
• Can extract higher-order statistics

3. Recurrent networks
• Contains at least one feedback loop
• Powerfull temporal learning capabilities
Neural networks: Architectures

“3-layer Neural Net”, or


“2-layer Neural Net”, or “2-hidden-layer Neural Net”
“1-hidden-layer Neural Net” “Fully-connected” layers
Single-layer feedforward networks
Multi-layer feedforward networks
Multi-layer feedforward networks

Data flow strictly feedforward: input  output


No feedback  Static network, easy learning
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network
Full implementation of training a 2-layer Neural Network needs ~20 lines:
Activation functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU
Activation Functions

3 problems:

1. Saturated neurons “kill” the


Sigmoid gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit compute expensive
Activation Functions

- Squashes numbers to range [-1,1]


- zero centered (nice)
- still kills gradients when saturated :(

tanh(x)

[LeCun et al., 1991]


- Computes f(x) = max(0,x)
Activation Functions
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Actually more biologically plausible
than sigmoid

ReLU - Not zero-centered output


(Rectified Linear Unit) - An annoyance:

hint: what is the gradient when x < 0?


active ReLU
DATA CLOUD

=> people like to initialize


ReLU neurons with slightly dead ReLU
positive biases (e.g. 0.01) will never activate
=> never update
[Mass et al., 2013]
Activation Functions [He et al., 2015]

- Does not saturate


- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

Parametric Rectifier (PReLU)


Leaky ReLU

backprop into \alpha


(parameter)
Backpropagation

Multilayer feedforward networks


Backpropagation algorithm
Working with backpropagation
Advanced algorithms
Performance of multilayer perceptrons
Multilayer feedforward networks
Important class of neural networks
Input layer (only distributing inputs, without processing)
One or more hidden layers
Output layer

Commonly referred to as multilayer perceptron


Properties of multilayer perceptrons
1. Neurons include nonlinear activation function
Without nonlinearity, the capacity of the network is reduced to that of
a single layer perceptron
Nonlinearity must be smooth (differentiable everywhere), not hard-
limiting as in the original perceptron
Often, a logistic function is used:
1
y
1  exp( v)
2. One or more layers of hidden neurons
Enable learning of complex tasks by extracting features from the input
patterns

3. Massive connectivity
Neurons in successive layers are fully interconnected
About backpropagation
Multilayer perceptrons can be trained by the backpropagation learning rule
Based on the error-correction learning rule
Backpropagation consists of two passes through the network

1. Forward pass
Input is applied to the network
Input is propagated to the output
Synaptic weights stay frozen
Error signal is calculated

2. Backward pass
Error signal is propagated backward
Error gradients are calculated
Synaptic weights are adjusted
Backpropagation algorithm

A set of learning samples (inputs and target outputs)

( xn , d n )nN1 xn  M , d n  R
Error signal at output layer, neuron j, learning iteration n
e j (n)  d j (n)  y j (n)

Instantaneous error energy of output layer with R neurons


1 R
E ( n)   e j ( n) 2
2 j 1
N
1
Average error energy over all learning set E
N
 E ( n)
n 1
Backpropagation algorithm
Average error energy 𝐸ത represents a cost function as a measure of learning performance

𝐸ത is a function of free network parameters


synaptic weights
bias levels

Learning objective is to minimize average error energy 𝐸ത by adjusting free network parameters

Learning results from many presentations of training examples


Epoch learning: network parameters are adjusted after presenting the entire training set
We use an approximation: pattern-by-pattern learning instead of epoch learning
Parameter adjustments are made for each pattern presented to the network
Minimizing instantaneous error energy at each step instead of average error energy
Backpropagation algorithm
Similar to the LMS algorithm, backpropagation applies
correction of weights proportional to the partial
derivative
E (n) Instantaneous error energy
w ji (n) 
w ji (n)

Expressing this gradient by the chain rule


E (n) E (n) e j (n) y j (n) v j (n)

w ji (n) e j (n) y j (n) v j (n) w ji (n)
output error yi
network output w ji
vj yj
induced local field ej  d j  y j
synaptic weight 1 R 2
E ej
2 j 1
Backpropagation algorithm

1. Gradient on output error


E ( n)
 e j ( n)
e j ( n)
2. Gradient on network output
e j (n)
e j (n)  d j (n)  y j (n)  1
y j (n)
3. Gradient on induced local field
yi w ji
vj yj y j (n)
 f (v j (n))
v j (n)
4. Gradient on synaptic weight
R
v j (n)
v j ( n)   w ji (n) yi ( n)  yi ( n )
j 0 w ji (n)
Backpropagation algorithm

E (n) E (n) e j (n) y j (n) v j (n) yi


 w ji
vj yj
w ji (n) e j (n) y j (n) v j (n) w ji (n)
 e j (n) (1) f (v j (n)) yi (n)
Putting gradients together   e j (n) f (v j (n)) yi (n)

E (n)
w ji (n)     e j (n) f (v j (n)) yi (n)
w ji  
 j (n)

Learning rate Local gradient


w ji (n)    j (n) yi (n)
Correction of synaptic weight is defined by the delta rule

#49
Backpropagation algorithm
CASE 1 Neuron j is an output node
Output error ej(n) is available
Computation of local gradient is straightforward
1
f (v j (n)) 
 j (n)  e j (n) f (v j (n)) 1  exp(  av j (n))
a exp( av j (n))
f (v j (n)) 
CASE 2 Neuron j is a hidden node [1  exp( av (n))] j
2

Hidden error is not available  Credit assignment problem


Local gradient solved by backpropagating errors through the network
E (n) E (n) e j (n) y j (n) v j (n) yi
 w ji
vj yj
w ji (n) e j (n) y j (n) v j (n) w ji (n)
   
 j ( n ) yi ( n )

y j (n) E (n) y j (n) E (n)


 f (v j (n))  j ( n)    f (v j (n))
v j (n) y j (n) v j (n) y j (n)
How to calculate the derivative of output error energy E on hidden layer output yj ?
Backpropagation algorithm
CASE 2 Neuron j is a hidden node ...
Instantaneous error energy of the output layer with R neurons
1 R
E (n)   ek (n) 2
2 k 1
Expressing the gradient of output error energy E on hidden layer
output yj
E (n) e (n) ek (n)  d k (n)  yk (n)
  ek k
y j (n) k y j (n)  d k (n)  f (vk (n))
e (n) vk (n) M
  ek k vk (n)   wkj (n) y j (n)
vk (n) y j (n) j 0
k
  
 f  ( vk ( n )) wkj

  ek f (vk (n)) wkj yj wkj


vk yk
k

   k wkj
k
Backpropagation algorithm (8/9)
CASE 2 Neuron j is a hidden node ...
Finally, combining ansatz for hidden layer local gradient
E ( n)
 j ( n)   f (v j (n))
y j ( n)

and gradient of output error energy on hidden layer output

E (n)
   k wkj
y j (n) k

gives final result for hidden layer local gradient

 j (n)  f (v j (n))  k wkj


k
Backpropagation algorithm (9/9)

Backpropagation summary

w ji (n)    j (n) yi (n)

Weight Learning Local Input of


correction rategradient neuron j
1. Local gradient of an output node

 k (n)  ek (n) f (vk (n)) xi w ji


vj yj
wkj
2. Local gradient of a hidden node vk yk

 j (n)  f (v j (n))  k wkj


k
Two passes of computation
1. Forward pass
Input is applied to the network and propagated to the output
Inputs  Hidden layer output  Output layer output  Output error

xi (n)  y j  f  w ji xi   yk  f  wkj y j   ek (n)  d k (n)  yk (n)


xi w ji
vj yj
wkj
vk yk

2. Backward pass
Recursive computing of local gradients
Output local gradients  Hidden layer local gradients
 k (n)  ek (n) f (vk (n))
 j (n)  f (v j (n))  k wkj
k

wkj (n)    k (n) y j (n) w ji (n)    j (n) xi (n)


Synaptic weights are adjusted according to local gradients
Summary of the backpropagation algorithm

1. Initialization
Pick weights and biases from the uniform distribution with zero mean and variance that induces
local fields between the linear and saturated parts of the logistic function

2. Presentation of training samples


For each sample from the epoch, perform forward pass and backward pass

3. Forward pass
Propagate training sample from network input to the output
Calculate the error signal

4. Backward pass
Recursive computation of local gradients from output layer toward input layer
Adaptation of synaptic weights according to generalized delta rule

5. Iteration
Iterate steps 2-4 until the stopping criterion is met
Backpropagation for ADALINE
Using backpropagation learning for ADALINE
No hidden layers, one output neuron x1
Linear activation function  v y

f (v(n))  v(n)  f (v(n))  1 xR

Backpropagation rule
wi (n)    (n) yi (n), yi  xi 
 wi (n)   e(n) xi (n)
 (n)  e(n) f (v(n))  e(n) 
Original delta rule

wi (n)   e(n) xi (n)

Backpropagation is a generalization of a delta rule


Working with backpropagation

Efficient application of backpropagation requires some “fine-tuning”

Various parameters, functions, and methods should be selected


Training mode (sequential / batch)
Activation function
Learning rate
Momentum
Stopping criterium
Heuristics for efficient backpropagation
Methods for improving generalization
Sequential vs. batch training
Learning results from many presentations of training examples
Epoch = presentation of the entire training set

Batch training
Weight updating after the presentation of a complete epoch
Training is more accurate but very slow

Sequential training
Weight updating after the presentation of each training example
Stochastic nature of learning, faster convergence
Important practical reasons for sequential learning:
The algorithm is easy to implement
Provides an effective solution to large and difficult problems
Therefore sequential training is the preferred training mode
A good practice is random order of presentation of training examples
Activation function

Derivative of activation function f (v j (n)) is required for


computation of local gradients
The only requirement for activation function: differentiability
For example: logistic function
1 a  0,
f (v j (n)) 
1  exp( av j (n))    v j ( n)  

Derivative of the logistic function


a exp( av j (n)) y ( n )  f ( v ( n )) f (v j (n))  a y j (n)[1  y j (n)]
f (v j (n))  
j
 j 
[1  exp( av j (n))] 2

Local gradient can be calculated


without explicit knowledge of the
activation function
Other activation functions
Using sin() activation functions

f ( x)  a   ck sin( kx   k )
k 1

Equivalent to traditional Fourier analysis


Network with sin() activation functions can be
trained by backpropagation

Example: Approximating a periodic function by


8 sigmoid hidden neurons 4 sin hidden neurons
Learning rate
Learning procedure requires
Change in the weight space to be proportional to error gradient
True gradient descent requires infinitesimal steps

Learning in practice
Factor of proportionality is learning rate η  w ji (n)    j (n) yi (n)
Choose a learning rate as large as possible without leading to oscillations
  0.010

  0.035

  0.040
Stopping criteria

Generally, backpropagation cannot be shown to converge


No well-defined criteria for stopping its operation

Possible stopping criteria

1. Gradient vector
– Euclidean norm of the gradient vector reaches a sufficiently small gradient

2. Output error
– Output error is small enough
– Rate of change in the average squared error per epoch is sufficiently small

3. Generalization performance
– Generalization performance has peaked or is adequate

4. Max number of iterations


– We are out of time ...
Heuristics for efficient backpropagation
1. Maximizing information content
General rule: every training example presented to the backpropagation
algorithm should be chosen on the basis that its information content is
the largest possible for the task at hand
Simple technique: randomize the order in which examples are presented
from one epoch to the next

2. Activation function
Faster learning with antisimetric sigmoid activation functions
Popular choice is:
f (v) a tanh( bv) f (1)  1, f ( 1)  1
a  1.72 effective gain f (0)  1
b  0.67 max second derivative at v  1
Heuristics for efficient backpropagation
3. Target values
Must be in the range of the activation function
Offset is recommended, otherwise learning is driven into saturation
Example: max(target) = 0.9 max(f)

4. Preprocessing inputs
a) Normalizing mean to zero
b) Decorrelating input variables (by using principal component analysis)
c) Scaling input variables (variances should be approx. equal)

Original a) Zero mean b) Decorrelated c) Equalized variance


Heuristics for efficient backpropagation
5. Initialization
Choice of initial weights is important for a successful network design
Large initial values  saturation
Small initial values  slow learning due to operation only in the saddle point near origin
Good choice lies between these extreme values
Standard deviation of induced local fields should lie between the linear and saturated
parts of its sigmoid function tanh activation function example (a=1.72, b=0.67):
synaptic weights should be chosen from a uniform distribution with zero mean and
standard deviation  v  m1/ 2 m ... number of synaptic weights

6. Learning from hints


Prior information about the unknown mapping can be included in the learning process
Initialization
Possible invariance properties, symmetries, ...
Choice of activation functions
Generalization
Neural network is able to generalize:
If input-output mapping computed by the network is correct for test data
Test data were not used during training
Test data are from the same population as training data
If correct network response is given for inputs that are slightly different than
the training examples
Overfitting Good generalization
Improving generalization
Methods to improve generalization
1. Keeping the network small
2. Early stopping
3. Regularization

Early stopping
Available data are divided into three sets:
1.Training set – used to train the network
Early stopping

2.Validation set – used for early stopping,


when the error starts to increase

3.Test set – used for final estimation of


network performance and comparison
of various models
Regularization
N
1
mse 
N
 (d
n 1
j (n)  y j (n)) 2

Improving generalization by regularization


Modifying performance function 1 M
msw 
M
w
m 1
2
m

with the mean sum of squares of network weights and biases


msreg   mse  (1   )msw
thus obtaining a new performance function
Using this performance function, the network will have smaller weights
and biases, and this forces the network response to be smoother and less
likely to overfit
Limitations of backpropagation
Some properties of backpropagation do not guarantee the algorithm to be universally
useful:

1. Long training process


Possibly due to the non-optimum learning rate
(advanced algorithms address this problem)

2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease gradients
almost to zero (vanishing gradient problem)  training is almost stopped

3. Local minima
Error surface of a complex network can be very complex, with many hills and valleys
Gradient methods can get trapped in local minima
Solutions: probabilistic learning methods (simulated annealing, ...)
Advanced algorithms
Basic backpropagation is slow
Adjusts the weights in the steepest descent direction (negative of the gradient) in
which the performance function is decreasing most rapidly
It turns out that, although the function decreases most rapidly along with the negative
of the gradient, this does not necessarily produce the fastest convergence

1. Advanced algorithms based on heuristics


Developed from an analysis of the performance of the standard steepest descent
algorithm
Momentum technique
Variable learning rate backpropagation
Resilient backpropagation

2. Numerical optimization techniques


Application of standard numerical optimization techniques to network training
Quasi-Newton algorithms
Conjugate Gradient algorithms
Levenberg-Marquardt algorithm
Momentum

A simple method of increasing learning rate yet avoiding the danger of


instability
Modified delta rule by adding momentum term
w ji (n)    j (n) yi (n)  w ji (n  1)

Momentum constant
0 
Accelerates backpropagation insteady
1 downhill directions
Large learning rate Small learning rate
(oscillations)

Learning with momentum


Variable learning rate η(t)
Another method of manipulating learning rate and momentum to
accelerate backpropagation

after weight update

If error increases If error increases


if error decreases
less than ζ more than ζ

• weight update is accepted • weight update is accepted


• learning rate is increased • learning rate is not • weight update is discarded
η(t+1) = ση(t), σ >1 changed η(t+1) = η(t), • learning rate is decreased
• if momentum has been • if momentum has been η(t+1) = ρη(t), 0<ρ<1
previously reset to 0, it is previously reset to 0, it is • momentum is reset to 0
set to its original value set to its original value

Possible parameter values:   4%,   0.7,   1.05


Resilient backpropagation
Slope of sigmoid functions approaches zero as the input gets large
This causes a problem when you use the steepest descent to train a network
Gradient can have a very small magnitude  also changes in weights are small, even
though the weights are far from their optimal values

Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives
Only sign of the derivative is used to determine the direction of weight update, the size
of the weight change is determined by a separate update value
Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor δinc if the
derivative of the performance function concerning that weight has the same sign
for two successive iterations
2. Update value is decreased by a factor δdec if the derivative concerning that
weight changes sign from the previous iteration
3. If the derivative is zero, then the update value remains the same
4. If weights are oscillating, the weight change is reduced
Numerical optimization
Supervised learning as an optimization problem
Error surface of a multilayer perceptron, expressed by
instantaneous error energy E(n), is a highly nonlinear function of
synaptic weight vector w(n)
E(n)  E(w(n))

E(w1,w2)

w2
w1
Numerical optimization

E(n)  E(w(n))
1
E ( w(n)  w(n))  E ( w(n))  g T (n)w(n)  wT (n) H (n)w(n)
2
Expanding the error energy by a Taylor series

E ( w)
Local gradient g T ( n) 
w w w( n )
 2 E ( w)
Hessian matrix H ( n) 
w 2 w w( n )
Numerical optimization
Steepest descent method (backpropagation)
Weight adjustment proportional to the gradient
Simple implementation, but slow convergence

w(n)   g (n)
Significant improvement by using higher-order information
Adding momentum term  a crude approximation to use second-order information about
error surface
Quadratic approximation about error surface  The essence of Newton’s method
w( n)  H 1 (n) g (n)

gradient descent
H-1 is the inverse of the Hessian matrix Newton’s method

NEURAL
Quasi-Newton algorithms
Problems with the calculation of the Hessian matrix
Inverse Hessian H-1 is required, which is computationally expensive
Hessian has to be nonsingular which is not guaranteed
Hessian for a neural network can be rank deficient
No convergence guarantee for non-quadratic error surface

Quasi-Newton method
Only requires a calculation of the gradient vector g(n)
The method estimates the inverse Hessian directly without matrix inversion
Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm
Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!

Application for neural networks


The method is fast for small neural networks
Conjugate gradient algorithms
Conjugate gradient algorithms
Second-order methods, avoid computational problems with the inverse Hessian
Search is performed along the conjugate directions, which produces generally faster convergence than
steepest descent directions
1. In most of the conjugate gradient algorithms, the step size is adjusted at each iteration
2. A search is made along the conjugate gradient direction to determine the step size that minimizes
the performance function along that line

Many variants of conjugate gradient algorithms


Fletcher-Reeves Update
Polak-Ribiére Update gradient descent
conjugate gradient
Powell-Beale Restarts
Scaled Conjugate Gradient
Application for neural networks
Perhaps the only method suitable for large-scale problems (hundreds or thousands of adjustable
parameters)  well suited for multilayer perceptrons
Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm (LM)
Similar to quasi-Newton methods, LM algorithm was designed to approach second-order
training speed without having to compute the Hessian matrix
H  JT J
When the performance function has the form of a sum of squares (typical in neural
network training), then the Hessian matrix H can be approximated by Jacobian matrix J

where Jacobian matrix contains first derivatives of the network errors with respect to the
weights
Jacobian can be computed through a standard backpropagation technique that is much less
complex than computing the Hessian matrix
Application for neural networks
Algorithm appears to be the fastest method for training moderate-sized feedforward
neural networks (up to several hundred weights)
Advanced algorithms summary

Practical hints
Variable learning rate algorithm is usually much slower than the other methods
Resilient backpropagation method is very well suited for pattern recognition
problems
Function approximation problems, networks with up to a few hundred weights:
Levenberg-Marquardt algorithm will have the fastest convergence and very
accurate training
Conjugate gradient algorithms perform well over a wide variety of problems,
particularly for networks with a large number of weights (modest memory
requirements)
Performance of multilayer perceptrons

Approximation error is influenced by

Learning algorithm used ... (discussed in the last section)


This determines how good the error on the training set is minimized

Number and distribution of learning samples


This determines how good training samples represent the actual function

Number of hidden units


This determines the expressive power of the network. For smooth functions only a
few hidden units are needed, for wildly fluctuating functions more hidden units will
be needed
Number of learning samples

4 learning samples 20 learning samples

Function approximation example y = f(x)

Training set with 4 samples has a


small training error but gives a very
poor generalization
Number of hidden units
Function approximation example y=f(x)
5 hidden units 20 hidden units

A large number of hidden units leads to a small training error but not necessarily to a
small test error
Adding hidden units always leads to a reduction of the training error
However, adding hidden units will first lead to a reduction of test error but then to an
increase of test error ... (peaking effect, early stopping can be applied)
Size effect summary

Error Error
rate rate Optimal number of
hidden neurons

Test set Test set


Number of training samples Number of hidden
units Training set Training set

Number of training samples Number of hidden units

You might also like