L6 Neural Network
L6 Neural Network
Lương Thái Lê
Outline
1. NN Introduction
2. Components of a NN
3. ANN architect
4. Perceptron
5. Gradient descent
𝑁𝑒𝑡 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑚 𝑥𝑚 = 𝑤𝑖 𝑥𝑖
𝑖=0
• Meaning of bias:
• The family of separation functions Net = wixi cannot separate examples into
two classes
• But Net = wixi + w0 could
Activate function: Hard limiter
• Or call Threshold function:
• 𝜃 𝑖𝑠 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑣𝑎𝑙𝑢𝑒
• Pitfall: not contituous and
the derivative is not continuous too
Activate function: Threshold logic
• commonly used
• 𝛼 determine the slope of the linear
interval
• Output is in (-1,1)
• Advantage: Continuos and it’s
derivation is too
ANN – network architecture (1)
• The architecture of an ANN is determined by:
• Number of input and output signals
• Number of layers
• Number of neurons on each layer
• Number of weights for each neurons
• How neurons (on a layer, or between layers) connect to
each other
• Which neurons receive error correction signals
• An ANN must has:
• An input layer
• An output layer
• 0,1, or more hidden layer
ANN – network architecture (2)
• An ANN is said to be fully connected if every output from one layer is
connected to every neuron of the next layer
• An ANN is called a feedforward network if no output of one node is
the input of another node of the same layer (or of a previous layer).
• When the outputs of a node link back to the inputs of a node of the
same layer (or of a previous layer), it is a feedback network.
• Feedback networks that have closed loops are called recurrent
networks
ANN – network architecture (3)
ANN – Learning rules
• Two types of learning in ANN:
• Parameter learning
The goal is to adaptively change the weights of the links in the neural network
• Structure learning
The goal is to adaptively change the network structure, including the number
of neurons and the connection patterns between them
• These two types of learning can be done simultaneously or separately
• Most of the learning method in ANN are parameter learning
=> We will consider only parameter learning
Weight learning rules
• At learning step t, the degree of adjustment of the
weight vector w is proportional to the product of
the learning signal r(t) and input x(t)
∆𝒘(𝑡) = η𝑟 (𝑡) 𝒙(𝑡)
where η>0 is learning rate
• Learning signal r is a function of w, x and desired
output value d
r = g(w,x,d)
Note : xj can be input signal or an output
• Generalized weight learning rules: value from another neuron
∆𝒘(𝑡) = η𝑔(𝒘 𝑡 ,𝒙 𝑡 ,𝑑 𝑡 )𝒙(𝑡)
Perceptron
• A perceptron is the simplest of ANNs (include
only 1 neuron)
• Use hard limiter activate function:
𝑂𝑢𝑡 = 𝑠𝑖𝑔𝑛 𝑁𝑒𝑡 𝑤, 𝑥 = 𝑠𝑖𝑔𝑛(σ𝑚
𝑗=0 𝑤𝑗 𝑥𝑗 )
=>Requirement: The activate functions used in the network must be continuous functions with
respect to the weights and have continuous derivatives.
Gradient descent - visualization
Stopping criterion: number of study cycles (epochs), error threshold…
Multi-layer ANN and back propagation
algorithm
• A perceptron can only represent a linear separable function
• A multi-layer neural network (NN) learned by the back-propagation
(BP- algorithm) can represent a highly non-linear separation
function
• The BP learning algorithm is used to learn the weights of a multilayer
neural network
• Fixed network structure (neurons and the links between them are fixed)
• For each neuron, the action function must have a continuous derivative
• The BP algorithm applies the gradient descent strategy in the weights
update rule to minimize error
.
Back propagation algorithm
• The backpropagation learning algorithm searches for a vector of weights that
minimizes the overall error of the system for the learning set
• BP includes 2 steps:
1. Signal forward step:
• The input signals are forward propagated from the input layer to the output layer
(passing through hidden layers).
2. Error backward step:
• Based on the desired output value of the input vector, the system calculates the error
value
• From the output layer, the error value is propagated back through the network, from
layer to layer, until the input layer
• Error back-propagation is performed through calculating (recursively) the local gradient
value of each neuron
BP-algorithm network structure
• Consider 3-layer NN:
• m input signal xj
• l hidden layer neural zq
• n output neurons yi
• wqj is the weight of the link from the input
signal xj to the hidden layer neuron zq
• wiq is the weight of the link from the
hidden layer neuron zq to output neuron yi
• Outq is the (local) output value of the
hidden layer neuron zq
• Outi is the output value of the network
corresponding to the output neuron yi
BP-algorithm: forward propagation (1)
• For each example x:
• The input vector x is propagated from the input layer to the output layer
• The network will produce an actual output value Out (which is a vector of Outi
values)
• For each input vecto x:
• a neuron zq in the hidden layer will receive a net input value of:
𝑁𝑒𝑡𝑞 = σ𝑚 𝑗=1 𝑤𝑞𝑗 𝑥𝑗
• and create a local output of:
𝑂𝑢𝑡𝑞 = 𝑓 𝑁𝑒𝑡𝑞 = 𝑓(σ𝑚 𝑗=1 𝑤𝑞𝑗 𝑥𝑗 )
where f is an activate function
BP-algorithm: forward propagation (2)
• The net input value of neuron yi at the output layer:
𝑁𝑒𝑡𝑖 = σ𝑙𝑞=1 𝑤𝑖𝑞 𝑂𝑢𝑡𝑞 = σ𝑙𝑞=1 𝑤𝑖𝑞 𝑓(σ𝑚
𝑗=1 𝑤𝑞𝑗 𝑥𝑗 )
• Vecto created of Outi , i=1…n, is the output of the network for vecto x
BP-algorithm: backward propagate (1)
• For each example x
• Error signals due to the difference between the desired output value d and the
actual output value Out are calculated
• These error signals are back-propagated from the output layer to the front
layers, to update the weights.
• To consider error signals and their back propagation, it is necessary to
define an error evaluation function
𝑛 𝑛
1 1
𝐸 𝑤 = (𝑑𝑖 − 𝑂𝑢𝑡𝑖 ) = (𝑑𝑖 − 𝑓(𝑁𝑒𝑡𝑖 ))2
2
2 2
𝑖=1 𝑖=1
1 𝑛
= σ𝑖=1(𝑑𝑖 − 𝑓(σ𝑙𝑞=1 𝑤𝑖𝑞 𝑂𝑢𝑡𝑞 ))2
2
BP-algorithm: backward propagate (2)
• Apply Gradient-descent, weights from hidden to output layer are
update by:
𝜕𝐸
𝛻𝑤𝑖𝑞 = −η.
𝜕𝑤𝑖𝑞
• Apply the chain derivative rule we have
𝛻𝑤𝑖𝑞 = η𝛿𝑖 𝑂𝑢𝑡𝑞
• 𝛿𝑖 is the error signal of neuron yi at the output layer
𝛿𝑖 = 𝑑𝑖 − 𝑂𝑢𝑡𝑖 [𝑓′(𝑁𝑒𝑡𝑖 )]
BP-algorithm: backward propagate (3)
• To update the weights of the links from the input layer to the hidden layer,
we also apply the gradient-descent method and the derivative chain rule
𝜕𝐸
𝛻𝑤𝑞𝑗 = −η. = η𝛿𝑞 𝑥𝑗
𝜕𝑤𝑞𝑗
• 𝛿𝑞 is the error signal of neuron zq at the hidden layer
𝛿𝑞 = 𝑓′(𝑁𝑒𝑡𝑞 ) σ𝑛𝑖=1 𝛿𝑖 𝑤𝑖𝑞
• According to the above formulas for calculating error signals δi and δq, the
error signal of a neuron in the hidden layer is different from the error signal
of a neuron in the output layer.
• Due to this difference, the weight update procedure in the BP algorithm is
called the generalized delta learning rule
BP-algorithm: backward propagate (4)
• The error signal δq of neuron zq in the hidden layer is determined by:
• The error signals δi of neurons yi in the output layer (to which neuron zq is
linked) and
• The coefficients are wiq weights
• Important features of the BP algorithm: The weight update rule is
local
• To update the weight of a link, the system only needs to use the values at the
two ends of that link
• The general form of the weight updating rule in the BP algorithm is
∆𝑤𝑎𝑏 = η𝛿𝑎 𝑥𝑏
Back_propagation_incremental Alg (1)
Back_propagation_incremental(D, η)
Neural network include Q layers, q=1,2,…,Q
𝑞 𝑞
𝑁𝑒𝑡𝑖 and 𝑂𝑢𝑡𝑖 are net input and output of neural i at layer q
Network has m input signals and n output neural
𝑞
𝑤𝑖𝑗 is the weight of the link from neural j of the layer (q-1) to neural q of the layer q
Step 0 (Initalization)
Choose error threshold Eth
Initialize the initial values of the weights with small random values
Let E=0
Step 1 (Start a learning period)
Apply the input vector of the learning example k to the input layer (q=1)
𝑞 (𝑘)
𝑂𝑢𝑡𝑖 = 𝑂𝑢𝑡𝑖1 = 𝑥𝑖 , ∀𝑖
Step 2 (forward propagation)
𝑞
Forward propagation of input signals through the network, until the network output values are received (at the output
layer) 𝑂𝑢𝑡𝑖
Back_propagation_incremental Alg (2)
Step 3 (Output Error Coutting)
𝑞
Calculate the output error of the network and
𝑛
the error signal of each neuron δ 𝑖 in the output layer
1 𝑘 𝑞
𝐸 = 𝐸 + (𝑑𝑖 − 𝑂𝑢𝑡𝑖 )
2
𝑖=1
𝑞 𝑘 𝑞 𝑞
𝛿𝑖 = 𝑑𝑖 − 𝑂𝑢𝑡𝑖 𝑓′(𝑁𝑒𝑡𝑖 )
Step 4 (Error back-propagation)
𝑞−1
Error back-propagation to update the weights and calculate the error signals 𝛿𝑖 for the above layers
𝑞 𝑞 𝑞−1 𝑞 𝑞 𝑞
∆𝑤𝑖𝑗 = η𝛿𝑖 𝑂𝑢𝑡𝑗 ; 𝑤𝑖𝑗 = 𝑤𝑖𝑗 + ∆𝑤𝑖𝑗
𝑞−1 𝑞−1 𝑞 𝑞
𝛿𝑖 = 𝑓′(𝑁𝑒𝑡𝑖 )𝑤𝑗𝑖 𝛿𝑗
Step 5 (Check the end of a learning cycle – epoch)
Check if the entire learning set has been used (a learning cycle has been completed – epoch)
If the entire study set has been used, go to Step 6; otherwise, go to Step 1
Step 6 (Check for overall errors)
If the overall error E is less than the acceptable error threshold (<Eth), the learning process ends and
the learned weights are returned;
Otherwise, reassign E=0, and start a new learning cycle (return to Step 1).
Forward propagation visualization (1)
Forward propagation visualization (2)
Forward propagation visualization (3)
Forward propagation visualization (4)
Forward propagation visualization (5)
Forward propagation visualization (6)
Forward propagation visualization (7)
Error Counting
Back propagation (1)
Back propagation (2)
Back propagation (3)
Back propagation (4)
Back propagation (5)
Weights updating (1)
Weights updating (2)
Weights updating (3)
Weights updating (4)
Weights updating (5)
Weights updating (6)
Weights Initialization
• Usually, the weights are initialized with small random values
• If the weights have large initial values
• The sigmoid functions will reach saturation state soon
• The system will clog at a local minimum or at a very flat plateau near the starting point
0
• Recomend for 𝑤𝑎𝑏 (link from neuron b to neuron a)
• Let na be the number of neurons in the same layer as neuron a
0 1 1
𝑤𝑎𝑏 ∈ [− , ]
𝑛𝑎 𝑛𝑎
• Let ka be the number of neurons with forward connections to neuron a (=number of
input connections of neuron a)
0 3 3
𝑤𝑎𝑏 ∈ [− , ]
𝑘𝑎 𝑘𝑎
Learning rate
• Important influence on the efficiency and convergence of the BP learning algorithm
• A large value of η can accelerate the convergence of the learning process, but can cause the
system to miss the global optimum or fall into the local optimum.
• A small value of η can make the learning process take a very long time
• Usually be chosen experimentally for each problem
• Good values of learning rate at the beginning may not be good at a later time
• An adaptive (dynamic) learning rate should be used
• After updating the weights, check whether updating the weights reduces the error value