2023-Lecture11-NeuralNetworks
2023-Lecture11-NeuralNetworks
INTRODUCTION TO
NEURAL NETWORKS
2
Artificial neural network
3
What is a neural network?
• A reasoning model based on the human brain, including
billions of neurons and trillion connections between them
4
Biological neural network
• A system that is highly complex, nonlinear and parallel
information-processing
• Learning through experience is an essential characteristic.
• Plasticity: connections between neurons leading to the
“right answer” are strengthened while those leading to the
“wrong answer” are weakened.
5
Artificial neural networks (ANN)
• Resemble the human brain in terms of learning mechanisms
• Improve performance through experience and generalization
6
How does an ANN model the brain?
• An ANN includes many neurons, which are simple and highly
interconnected processors arranging in a hierarchy of layers.
8
Biological neuron
Artificial neuron
9
How to build an ANN?
• The network architecture must be decided first,
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.
10
11
12
Source: http://www.asimovinstitute.org/neural-network-zoo/
13
Perceptro
n and
Learning
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.
15
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏, 𝒘𝟐, … , 𝒘𝒏 and threshold 𝜽 are randomly
assigned to small numbers (usually in −0.5, 0.5 , but not restricted to).
• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑$% example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏(𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ) 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =7
0 𝑖𝑓 𝑥 < 0
𝒊"𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function
16
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.
18
Will a sigmoidal element do better?
• Perceptron can classify only linearly separable patterns
regardless of the activation function used (Shynk, 1990;
Shynk and Bershad, 1992)
• Solution: advanced forms of neural networks (e.g., multi-
layer perceptrons trained with back-propagation algorithm)
19
An example of perceptron
go to the festival?
20
An example of perceptron
weather
partner wants to go
• For every combination (𝑥! , 𝑥" ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
22
Multi-layer
neural
networks 23
Multi-layer neural network
• A feedforward network with one or more hidden layers.
• The input signals are propagated forwardly on a layer-by-
layer basis.
24
Back-propagation algorithm
• (Bryson and Ho, 1969), most popular among over a hundred
different learning algorithms available
25
Back-propagation learning rule
• Step 1 – Initialization: Initial weights and thresholds are assigned to random numbers.
𝟐.𝟒 𝟐.𝟒
• The numbers may be uniformly distributed in the range − ,+ (Haykin, 1999),
𝑭𝒊 𝑭𝒊
where 𝑭𝒊 is the total number of inputs of neuron
• The weight initialization is done on a neuron-by-neuron basis
• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑$% example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏(𝒑) and desired outputs 𝒚𝒅,𝟏 𝒑 , 𝒚𝒅,𝟐 𝒑 , … , 𝒚𝒅,𝒍 (𝒑).
• (a) Calculate the actual output, from 𝒏 inputs, of neuron 𝒋 in the hidden layer.
𝒏
𝒚𝒋 𝒑 = 𝛔 ) 𝒙𝒊 (𝒑)𝒘𝒊𝒋 (𝒑) − 𝜽𝒋 1
𝜎 𝑥 =
𝒊"𝟏 1 + 𝑒 *+
• (b) Calculate the actual output, from 𝒌 inputs, of neuron 𝒎 in the output layer.
𝒎
𝒚𝒌 𝒑 = 𝛔 ) 𝒚𝒋 (𝒑)𝒘𝒋𝒌 (𝒑) − 𝜽𝒌
𝒋"𝟏
26
Back-propagation learning rule
• Step 3 – Weight training: Update the weights in the back-propagation network and
propagate backward the errors associated with output neurons.
𝒆𝒌 𝒑
• (a) Calculate the error gradient for neuron 𝒌 in the output layer
𝜹𝒌 𝒑 = 𝒚𝒌 𝒑 × 𝟏 − 𝒚𝒌 𝒑 ×[𝒚𝒅,𝒌 𝒑 − 𝒚𝒌 𝒑 ]
Calculate the weight corrections: ∆𝒘𝒋𝒌 𝒑 = 𝜶×𝒚𝒋 𝒑 ×𝜹𝒌 𝒑
28
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.
29
Decision boundaries for XOR
Decision boundaries are demonstrated
with McCulloch-Pitts neurons using a
sign function.
30
Visualization of the XOR decision problem for different types of classifiers. Markers correspond to the
four data points to be classified. The colored/hatched background corresponds to the output of one
exemplary decision function. (A) The linear decision boundary of a single-layer Perceptron cannot solve
the problem. (B, C) This still holds for the generalization 𝜎 𝑓(𝑥) + 𝑔(𝑦) . (D) A multi-layer Perceptron
(MLP) of the form 𝜎 ∑. 𝑤. 𝜎 𝑢. 𝑥 + 𝑣. 𝑦 + 𝑏. can be optimized using gradient descent to solve the
problem correctly. (E) An alternative solution using a non-monotonic nonlinearity 𝜎 / 𝜉 = 𝜎′ 𝜉 0 − 1 .
(F) Multiplication of two real-valued variables x, y can be seen as a superset of the XOR problem. 31
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.
32
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990
33
Gradient descent method
• Consider two parameters, 𝑤! and 𝑤" , in a network.
Error Surface
Randomly pick a
starting point 𝜃 $
Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 $
® −∇𝑓(𝜃 $ ) 𝑤; 𝜃∗
−𝜂𝛻𝑓 𝜃 <
Time the learning rate 𝜂
® −𝜂∇𝑓(𝜃 $ ) −𝛻𝑓 𝜃 <
𝜃<
𝑤: 35
Gradient descent method
• Gradient descent never guarantees global minima.
Different initial
point 𝜃 $
𝑓
𝐶
36
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau
𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
37
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = B𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆
38
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷×∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜶×𝒚𝒋 𝒑 ×𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)
39
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost
Negative of Gradient
Momentum
Real Movement
Gradient = 0 40
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter a
during training
• Small a ® small weight changes through iterations ® smooth
learning curve
• Large a ® speed up the training process with larger weight changes
® possible instability and oscillatory
• Heuristic-like approaches for adjusting a
1. The algebraic sign of the SSE change remains for several
consequent epochs ® increase a.
2. The algebraic sign of the SSE change alternates for several
consequent epochs ® decrease a
• One of the most effective acceleration means
41
Learning with momentum only
Learning with momentum for
the logical operation XOR.
42
Learning with adaptive a only
43
Learning with adaptive a and momentum
44
Quiz 02: Multi-layer neural networks
• Consider the below feedforward network with one hidden layer of units.
• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻! of the first hidden neuron and the activation 𝐼# of the
third output neuron?
45
Quiz 02: Multi-layer neural networks
• The input vector to the network is 𝑥 = 𝑥! , 𝑥" , 𝑥# %
46
Quiz 03: Backpropagation
• The figure shows part of the
network described in Slide 45.
• Use the same weights,
activation functions and bias
values as described.
48