Probability Neuron Network
Probability Neuron Network
Image
f(x,W) = Wx
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Parametric Approach: Linear Classifier
3072x1
Image
f(x,W) = Wx + b 10x1
10x1 10x3072
10 numbers giving
f(x,W) class scores
Array of 32x32x3 numbers
(3072 numbers total)
W
parameters
or weights
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
56
0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score
56 231
231
24 2
1.5 1.3 2.1 0.0
24
+ 3.2
= 437.9 Dog score
cat 3.2
car 5.1
frog -1.7
Softmax Classifier (Multinomial Logistic Regression)
cat 3.2
car 5.1
frog -1.7
unnormalized log probabilities
Softmax Classifier (Multinomial Logistic Regression)
where
unnormalized probabilities
unnormalized probabilities
where
car 5.1
frog -1.7
Softmax Classifier (Multinomial Logistic Regression)
where
car 5.1
frog -1.7 in summary:
Softmax Classifier (Multinomial Logistic Regression)
unnormalized probabilities
unnormalized probabilities
Q2: Usually at
initialization W is small
so all s ≈ 0.
unnormalized probabilities What is the loss?
Neuron model
Network architectures
Learning algorithms
…..
Training Neural Networks:
Neuron
information processing unit that is fundamental to the operation of a
neural network
Single input neuron
scalar input x
synaptic weight w
bias b
adder or linear combiner Σ
activation potential v
activation function f x v y
neuron output y
Adjustable parameters
synaptic weight w
bias b y f (wx b)
Neuron with vector input
Input vector
x = [x1, x2, ... xR ], R = number of elements in input
vector
Weight vector
w = [w1, w2, ... wR ]
x1
w1
Activation potential v y
v=wx+b
xR wR
product of input vector and
weight vector y f ( wx b)
f ( w1 x1 w2 x2 ... wR xR b)
Impulses carried toward cell body
dendrite
presynaptic
terminal
axon
cell body
axon
cell body
axon
cell body
3. Recurrent networks
• Contains at least one feedback loop
• Powerfull temporal learning capabilities
Neural networks: Architectures
x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
(Before) Linear score function:
(Now) 2-layer Neural Network
x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
tanh Maxout
ReLU ELU
Activation Functions
3 problems:
tanh(x)
3. Massive connectivity
Neurons in successive layers are fully interconnected
About backpropagation
Multilayer perceptrons can be trained by the backpropagation learning rule
Based on the error-correction learning rule
Backpropagation consists of two passes through the network
1. Forward pass
Input is applied to the network
Input is propagated to the output
Synaptic weights stay frozen
Error signal is calculated
2. Backward pass
Error signal is propagated backward
Error gradients are calculated
Synaptic weights are adjusted
Backpropagation algorithm
( xn , d n )nN1 xn M , d n R
Error signal at output layer, neuron j, learning iteration n
e j (n) d j (n) y j (n)
Learning objective is to minimize average error energy 𝐸ത by adjusting free network parameters
E (n)
w ji (n) e j (n) f (v j (n)) yi (n)
w ji
j (n)
#49
Backpropagation algorithm
CASE 1 Neuron j is an output node
Output error ej(n) is available
Computation of local gradient is straightforward
1
f (v j (n))
j (n) e j (n) f (v j (n)) 1 exp( av j (n))
a exp( av j (n))
f (v j (n))
CASE 2 Neuron j is a hidden node [1 exp( av (n))] j
2
k wkj
k
Backpropagation algorithm (8/9)
CASE 2 Neuron j is a hidden node ...
Finally, combining ansatz for hidden layer local gradient
E ( n)
j ( n) f (v j (n))
y j ( n)
E (n)
k wkj
y j (n) k
Backpropagation summary
2. Backward pass
Recursive computing of local gradients
Output local gradients Hidden layer local gradients
k (n) ek (n) f (vk (n))
j (n) f (v j (n)) k wkj
k
1. Initialization
Pick weights and biases from the uniform distribution with zero mean and variance that induces
local fields between the linear and saturated parts of the logistic function
3. Forward pass
Propagate training sample from network input to the output
Calculate the error signal
4. Backward pass
Recursive computation of local gradients from output layer toward input layer
Adaptation of synaptic weights according to generalized delta rule
5. Iteration
Iterate steps 2-4 until the stopping criterion is met
Backpropagation for ADALINE
Using backpropagation learning for ADALINE
No hidden layers, one output neuron x1
Linear activation function v y
f (v(n)) v(n) f (v(n)) 1 xR
Backpropagation rule
wi (n) (n) yi (n), yi xi
wi (n) e(n) xi (n)
(n) e(n) f (v(n)) e(n)
Original delta rule
Batch training
Weight updating after the presentation of a complete epoch
Training is more accurate but very slow
Sequential training
Weight updating after the presentation of each training example
Stochastic nature of learning, faster convergence
Important practical reasons for sequential learning:
The algorithm is easy to implement
Provides an effective solution to large and difficult problems
Therefore sequential training is the preferred training mode
A good practice is random order of presentation of training examples
Activation function
Learning in practice
Factor of proportionality is learning rate η w ji (n) j (n) yi (n)
Choose a learning rate as large as possible without leading to oscillations
0.010
0.035
0.040
Stopping criteria
1. Gradient vector
– Euclidean norm of the gradient vector reaches a sufficiently small gradient
2. Output error
– Output error is small enough
– Rate of change in the average squared error per epoch is sufficiently small
3. Generalization performance
– Generalization performance has peaked or is adequate
2. Activation function
Faster learning with antisimetric sigmoid activation functions
Popular choice is:
f (v) a tanh( bv) f (1) 1, f ( 1) 1
a 1.72 effective gain f (0) 1
b 0.67 max second derivative at v 1
Heuristics for efficient backpropagation
3. Target values
Must be in the range of the activation function
Offset is recommended, otherwise learning is driven into saturation
Example: max(target) = 0.9 max(f)
4. Preprocessing inputs
a) Normalizing mean to zero
b) Decorrelating input variables (by using principal component analysis)
c) Scaling input variables (variances should be approx. equal)
Early stopping
Available data are divided into three sets:
1.Training set – used to train the network
Early stopping
2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease gradients
almost to zero (vanishing gradient problem) training is almost stopped
3. Local minima
Error surface of a complex network can be very complex, with many hills and valleys
Gradient methods can get trapped in local minima
Solutions: probabilistic learning methods (simulated annealing, ...)
Advanced algorithms
Basic backpropagation is slow
Adjusts the weights in the steepest descent direction (negative of the gradient) in
which the performance function is decreasing most rapidly
It turns out that, although the function decreases most rapidly along with the negative
of the gradient, this does not necessarily produce the fastest convergence
Momentum constant
0
Accelerates backpropagation insteady
1 downhill directions
Large learning rate Small learning rate
(oscillations)
Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives
Only sign of the derivative is used to determine the direction of weight update, the size
of the weight change is determined by a separate update value
Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor δinc if the
derivative of the performance function concerning that weight has the same sign
for two successive iterations
2. Update value is decreased by a factor δdec if the derivative concerning that
weight changes sign from the previous iteration
3. If the derivative is zero, then the update value remains the same
4. If weights are oscillating, the weight change is reduced
Numerical optimization
Supervised learning as an optimization problem
Error surface of a multilayer perceptron, expressed by
instantaneous error energy E(n), is a highly nonlinear function of
synaptic weight vector w(n)
E(n) E(w(n))
E(w1,w2)
w2
w1
Numerical optimization
E(n) E(w(n))
1
E ( w(n) w(n)) E ( w(n)) g T (n)w(n) wT (n) H (n)w(n)
2
Expanding the error energy by a Taylor series
E ( w)
Local gradient g T ( n)
w w w( n )
2 E ( w)
Hessian matrix H ( n)
w 2 w w( n )
Numerical optimization
Steepest descent method (backpropagation)
Weight adjustment proportional to the gradient
Simple implementation, but slow convergence
w(n) g (n)
Significant improvement by using higher-order information
Adding momentum term a crude approximation to use second-order information about
error surface
Quadratic approximation about error surface The essence of Newton’s method
w( n) H 1 (n) g (n)
gradient descent
H-1 is the inverse of the Hessian matrix Newton’s method
NEURAL
Quasi-Newton algorithms
Problems with the calculation of the Hessian matrix
Inverse Hessian H-1 is required, which is computationally expensive
Hessian has to be nonsingular which is not guaranteed
Hessian for a neural network can be rank deficient
No convergence guarantee for non-quadratic error surface
Quasi-Newton method
Only requires a calculation of the gradient vector g(n)
The method estimates the inverse Hessian directly without matrix inversion
Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm
Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!
where Jacobian matrix contains first derivatives of the network errors with respect to the
weights
Jacobian can be computed through a standard backpropagation technique that is much less
complex than computing the Hessian matrix
Application for neural networks
Algorithm appears to be the fastest method for training moderate-sized feedforward
neural networks (up to several hundred weights)
Advanced algorithms summary
Practical hints
Variable learning rate algorithm is usually much slower than the other methods
Resilient backpropagation method is very well suited for pattern recognition
problems
Function approximation problems, networks with up to a few hundred weights:
Levenberg-Marquardt algorithm will have the fastest convergence and very
accurate training
Conjugate gradient algorithms perform well over a wide variety of problems,
particularly for networks with a large number of weights (modest memory
requirements)
Performance of multilayer perceptrons
A large number of hidden units leads to a small training error but not necessarily to a
small test error
Adding hidden units always leads to a reduction of the training error
However, adding hidden units will first lead to a reduction of test error but then to an
increase of test error ... (peaking effect, early stopping can be applied)
Size effect summary
Error Error
rate rate Optimal number of
hidden neurons