0% found this document useful (0 votes)
2 views

2023-Lecture11-NeuralNetworks

The document provides an introduction to artificial neural networks (ANNs), explaining their structure, learning mechanisms, and the analogy with biological neural networks. It covers the perceptron model, its learning rules, and limitations, particularly in solving non-linearly separable problems like XOR, and introduces multi-layer neural networks and the back-propagation algorithm for training. Additionally, it discusses gradient descent methods and their challenges in finding optimal solutions.

Uploaded by

Thành Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2023-Lecture11-NeuralNetworks

The document provides an introduction to artificial neural networks (ANNs), explaining their structure, learning mechanisms, and the analogy with biological neural networks. It covers the perceptron model, its learning rules, and limitations, particularly in solving non-linearly separable problems like XOR, and introduces multi-layer neural networks and the back-propagation algorithm for training. Additionally, it discusses gradient descent methods and their challenges in finding optimal solutions.

Uploaded by

Thành Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Artificial Intelligence

INTRODUCTION TO
NEURAL NETWORKS

Bùi Duy Đăng


[email protected]
Outline
• Introduction to Artificial neural networks
• Perceptron and Learning
• Multi-layer neural networks

2
Artificial neural network

3
What is a neural network?
• A reasoning model based on the human brain, including
billions of neurons and trillion connections between them

4
Biological neural network
• A system that is highly complex, nonlinear and parallel
information-processing
• Learning through experience is an essential characteristic.
• Plasticity: connections between neurons leading to the
“right answer” are strengthened while those leading to the
“wrong answer” are weakened.

5
Artificial neural networks (ANN)
• Resemble the human brain in terms of learning mechanisms
• Improve performance through experience and generalization

6
How does an ANN model the brain?
• An ANN includes many neurons, which are simple and highly
interconnected processors arranging in a hierarchy of layers.

• Each neuron is an elementary information-processing unit.


7
How does an ANN model the brain?

• Each neuron receives several input signals through its


connections and produces at most a single output signal.
• The neurons are connected by links, which pass signals
from one neuron to another.
• Each link associates with a numerical weight expressing the strength
of the neuron input.
• The set of weights is the basic mean of long-term memory in ANNs.

• ANNs “learn” through iterative adjustments of weights.

8
Biological neuron

Artificial neuron

Analogy between biological and artificial neural networks

9
How to build an ANN?
• The network architecture must be decided first,
• How many neurons are to be used?
• How the neurons are to be connected to form a network?
• Then determine which learning algorithm to use,
• Supervised /semi-supervised / unsupervised / reinforcement learning
• And finally train the neural network
• How to initialize the weights of the network?
• How to update them from a set of training examples.

10
11
12
Source: http://www.asimovinstitute.org/neural-network-zoo/
13

Perceptro
n and
Learning
Perceptron (Frank Rosenblatt, 1958)
• A perceptron has a single neuron with adjustable synaptic
weights and a hard limiter.

A single-layer two-input perceptron


14
How does a perceptron work?
• Divide the n-dimensional space into two decision regions by
a hyperplane defined by the linearly separable function
𝒏
! 𝒙𝒊 𝒘𝒊 − 𝜽
𝒊"𝟏

15
Perceptron learning rule
• Step 1 – Initialization: Initial weights 𝒘𝟏, 𝒘𝟐, … , 𝒘𝒏 and threshold 𝜽 are randomly
assigned to small numbers (usually in −0.5, 0.5 , but not restricted to).

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑$% example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏(𝒑) and desired output 𝒀𝒅 𝒑 , and calculate the actual output
𝒏 1 𝑖𝑓 𝑥 ≥ 0
𝒀 𝒑 =𝛔 ) 𝒙𝒊 (𝒑)𝒘𝒊 (𝒑) − 𝜽 𝜎 𝑥 =7
0 𝑖𝑓 𝑥 < 0
𝒊"𝟏
where 𝑛 is the number of perceptron inputs and 𝑠𝑡𝑒𝑝 is the activation function

• Step 3 – Weight training


• Update the weights 𝒘𝒊 : 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + ∆𝒘𝒊 (𝒑)
where ∆𝒘𝒊 (𝒑) is the weight correction at iteration 𝒑
• The delta rule determines how to adjust the weights: ∆𝒘𝒊 𝒑 = 𝜶×𝒙𝒊 𝒑 ×𝒆(𝒑)
where 𝜶 is the learning rate (0 < 𝛼 < 1) and 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀(𝒑)
• Step 4 – Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until convergence.

16
Perceptron for the logical AND/OR
• A single-layer perceptron can learn the AND/OR operations.

The learning of logical AND


converged after several iterations
Threshold q = 0.2, learning rate a = 0.1 17
Perceptron for the logical XOR
• It cannot be trained to perform the Exclusive-OR.

18
Will a sigmoidal element do better?
• Perceptron can classify only linearly separable patterns
regardless of the activation function used (Shynk, 1990;
Shynk and Bershad, 1992)
• Solution: advanced forms of neural networks (e.g., multi-
layer perceptrons trained with back-propagation algorithm)

19
An example of perceptron

Is the weather good?

Does your partner want to accompany you?

Is the festival near public


transit? (You don't own a car)

go to the festival?

20
An example of perceptron

weather

partner wants to go

near public transit

• 𝑤! = 6, 𝑤" = 2, 𝑤# = 2 ® the weather matters to you much more than


whether your partner joins you, or the nearness of public transit
• 𝜃 = 5 ® decisions are made based on the weather only
• 𝜃 = 3 ® you go to the festival whenever the weather is good or when
both the festival is near public transit and your partner wants to join you.
21
Quiz 01: Perceptron
• Consider the following neural network which receives binary input
values, 𝑥! and 𝑥" , and produces a single binary value.

• For every combination (𝑥! , 𝑥" ), what are the output values at neurons, 𝐴,
𝐵 and 𝐶?
22
Multi-layer
neural
networks 23
Multi-layer neural network
• A feedforward network with one or more hidden layers.
• The input signals are propagated forwardly on a layer-by-
layer basis.

24
Back-propagation algorithm
• (Bryson and Ho, 1969), most popular among over a hundred
different learning algorithms available

25
Back-propagation learning rule
• Step 1 – Initialization: Initial weights and thresholds are assigned to random numbers.
𝟐.𝟒 𝟐.𝟒
• The numbers may be uniformly distributed in the range − ,+ (Haykin, 1999),
𝑭𝒊 𝑭𝒊
where 𝑭𝒊 is the total number of inputs of neuron
• The weight initialization is done on a neuron-by-neuron basis

• Step 2 – Activation: At iteration 𝒑 , apply the 𝒑$% example, which has inputs
𝒙𝟏 (𝒑), 𝒙𝟐 (𝒑), … , 𝒙𝒏(𝒑) and desired outputs 𝒚𝒅,𝟏 𝒑 , 𝒚𝒅,𝟐 𝒑 , … , 𝒚𝒅,𝒍 (𝒑).
• (a) Calculate the actual output, from 𝒏 inputs, of neuron 𝒋 in the hidden layer.
𝒏
𝒚𝒋 𝒑 = 𝛔 ) 𝒙𝒊 (𝒑)𝒘𝒊𝒋 (𝒑) − 𝜽𝒋 1
𝜎 𝑥 =
𝒊"𝟏 1 + 𝑒 *+
• (b) Calculate the actual output, from 𝒌 inputs, of neuron 𝒎 in the output layer.
𝒎
𝒚𝒌 𝒑 = 𝛔 ) 𝒚𝒋 (𝒑)𝒘𝒋𝒌 (𝒑) − 𝜽𝒌
𝒋"𝟏

26
Back-propagation learning rule
• Step 3 – Weight training: Update the weights in the back-propagation network and
propagate backward the errors associated with output neurons.
𝒆𝒌 𝒑
• (a) Calculate the error gradient for neuron 𝒌 in the output layer

𝜹𝒌 𝒑 = 𝒚𝒌 𝒑 × 𝟏 − 𝒚𝒌 𝒑 ×[𝒚𝒅,𝒌 𝒑 − 𝒚𝒌 𝒑 ]
Calculate the weight corrections: ∆𝒘𝒋𝒌 𝒑 = 𝜶×𝒚𝒋 𝒑 ×𝜹𝒌 𝒑

Update the weights at the output neurons: 𝒘𝒋𝒌 𝒑 + 𝟏 = 𝒘𝒋𝒌 𝒑 + ∆𝒘𝒋𝒌 𝒑


• (b) Calculate the error gradient for neuron 𝒋 in the hidden layer
𝒍
𝜹𝒋 𝒑 = 𝒚𝒋 𝒑 ×[𝟏 − 𝒚𝒋 𝒑 ]× ) 𝜹𝒌 𝒑 𝒘𝒋𝒌 𝒑
𝒌"𝟏

Calculate the weight corrections: ∆𝒘𝒊𝒋 𝒑 = 𝜶×𝒙𝒊 𝒑 ×𝜹𝒋 𝒑


Update the weights at the hidden neurons: 𝒘𝒊𝒋 𝒑 + 𝟏 = 𝒘𝒊𝒋 𝒑 + ∆𝒘𝒊𝒋 𝒑
• Step 4: Iteration: Increase iteration 𝒑 by one, go back to Step 2 and repeat the process
until the selected error criterion is satisfied.

• A mathematical explanation can be found here.


27
Back-propagation network for XOR
• The logical XOR problem took
224 epochs or 896 iterations
for network training.

28
Sum of the squared errors (SSE)
• When the SSE in an entire pass through all training sets is
sufficiently small, a network is deemed to have converged.

Learning curve for


logical operation XOR

29
Decision boundaries for XOR
Decision boundaries are demonstrated
with McCulloch-Pitts neurons using a
sign function.

30
Visualization of the XOR decision problem for different types of classifiers. Markers correspond to the
four data points to be classified. The colored/hatched background corresponds to the output of one
exemplary decision function. (A) The linear decision boundary of a single-layer Perceptron cannot solve
the problem. (B, C) This still holds for the generalization 𝜎 𝑓(𝑥) + 𝑔(𝑦) . (D) A multi-layer Perceptron
(MLP) of the form 𝜎 ∑. 𝑤. 𝜎 𝑢. 𝑥 + 𝑣. 𝑦 + 𝑏. can be optimized using gradient descent to solve the
problem correctly. (E) An alternative solution using a non-monotonic nonlinearity 𝜎 / 𝜉 = 𝜎′ 𝜉 0 − 1 .
(F) Multiplication of two real-valued variables x, y can be seen as a superset of the XOR problem. 31
Sigmoid neuron vs. Perceptron
• Sigmoid neuron better reflects the fact that small changes in
weights and bias cause only a small change in output.

A sigmoidal function is a smoothed-out


version of a step function.

32
About back-propagation learning
• Are randomly initialized weights and thresholds leading to
different solutions?
• Starting with different initial conditions will obtain different weights
and threshold values. The problem will always be solved within
different numbers of iterations.
• Back-propagation learning cannot be viewed as emulation of
brain-like learning.
• Biological neurons do not work backward to adjust the strengths of
their interconnections, synapses.
• The training is slow due to extensive calculations.
• Improvements: Caudill, 1991; Jacobs, 1988; Stubbs, 1990

33
Gradient descent method
• Consider two parameters, 𝑤! and 𝑤" , in a network.
Error Surface

Randomly pick a
starting point 𝜃 $

Compute the negative The colors represent the value of the function 𝑓.
gradient at 𝜃 $
® −∇𝑓(𝜃 $ ) 𝑤; 𝜃∗
−𝜂𝛻𝑓 𝜃 <
Time the learning rate 𝜂
® −𝜂∇𝑓(𝜃 $ ) −𝛻𝑓 𝜃 <

𝜃< 𝜕𝑓 𝜃 < /𝜕𝑤:


𝛻𝑓 𝜃 < =
𝜕𝑓 𝜃 < /𝜕𝑤;
𝑤: 34
Gradient descent method
• Consider two parameters, 𝑤! and 𝑤" , in a network.
Error Surface
Eventually, we would reach a minima …..
Randomly pick a
starting point 𝜃 $

Compute the negative −𝜂𝛻𝑓 𝜃 ;


;
gradient at 𝜃 $ −𝜂𝛻𝑓 𝜃:𝜃
® −∇𝑓(𝜃 $ ) 𝑤; −𝛻𝑓 : 𝜃 ;
−𝛻𝑓 𝜃
𝜃:
Time the learning rate 𝜂
® −𝜂∇𝑓(𝜃 $ )

𝜃<

𝑤: 35
Gradient descent method
• Gradient descent never guarantees global minima.

Different initial
point 𝜃 $
𝑓
𝐶

Reach different minima,


so different results
𝑤3 𝑤4

36
Gradient descent method
• It also has issues at plateau and saddle point.
cost
Very slow at the plateau

Stuck at saddle point

Stuck at local minima

𝛻𝑓 𝜃 𝛻𝑓 𝜃 𝛻𝑓 𝜃
≈0 =0 =0
parameter space
37
Accelerated learning in ANNs
• Use tanh instead of sigmoid: represent the sigmoidal function
by a hyperbolic tangent
𝟐𝒂 where 𝑎 = 1.716 and 𝑏 = 0.667
𝒀𝐭𝐚𝐧 𝒉 = B𝒃𝑿
−𝒂 (Guyon, 1991)
𝟏−𝒆

38
Accelerated learning in ANNs
• Generalized delta rule: A momentum term is
included in the delta rule (Rumelhart et al., 1986)
∆𝒘𝒋𝒌 𝒑 = 𝜷×∆𝒘𝒋𝒌 𝒑 − 𝟏 + 𝜶×𝒚𝒋 𝒑 ×𝜹𝒌 𝒑
where 𝛽 = 0.95 is the momentum constant (0 ≤ 𝛽 ≤ 1)

How about put momentum of physical world in gradient descent?

39
Accelerated learning in ANNs
Still not guarantee reaching global minima,
but give some hope ……
cost

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0 40
Accelerated learning in ANNs
• Adaptive learning rate: Adjust the learning rate parameter a
during training
• Small a ® small weight changes through iterations ® smooth
learning curve
• Large a ® speed up the training process with larger weight changes
® possible instability and oscillatory
• Heuristic-like approaches for adjusting a
1. The algebraic sign of the SSE change remains for several
consequent epochs ® increase a.
2. The algebraic sign of the SSE change alternates for several
consequent epochs ® decrease a
• One of the most effective acceleration means
41
Learning with momentum only
Learning with momentum for
the logical operation XOR.

42
Learning with adaptive a only

Learning with adaptive learning


rate for the logical operation XOR.

43
Learning with adaptive a and momentum

44
Quiz 02: Multi-layer neural networks
• Consider the below feedforward network with one hidden layer of units.

• If the network is tested with an input vector 𝑥 = 1.0, 2.0, 3.0 then what
are the activation 𝐻! of the first hidden neuron and the activation 𝐼# of the
third output neuron?

45
Quiz 02: Multi-layer neural networks
• The input vector to the network is 𝑥 = 𝑥! , 𝑥" , 𝑥# %

• The vector of hidden layer outputs is 𝑦 = 𝑦! , 𝑦" %

• The vector of actual outputs is 𝑧 = 𝑧! , 𝑧" , 𝑧# %

• The vector of desired outputs is 𝑡 = 𝑡! , 𝑡" , 𝑡# %

• The network has the following weight vectors


−2.0 1.0
1.0 0.5 0.3
𝑣/ = 2.0 𝑣0 = 1.0 𝑤/ = 𝑤0 = 𝑤1 =
−3.5 −1.2 0.6
−2.0 −1.0
• Assume that all units have sigmoid activation function given by
1
𝑓 𝑥 =
1 + exp(−𝑥)
and that each unit has 𝜃 = 0 (zero).
• (Hint: on some calculators, exp(𝑥) = 𝑒 & where 𝑒 = 2.7182818)

46
Quiz 03: Backpropagation
• The figure shows part of the
network described in Slide 45.
• Use the same weights,
activation functions and bias
values as described.

• A new input pattern is presented to the network and training proceeds as


follows. The actual outputs are given by 𝑧 = 0.15, 0.36, 0.57 % and the
corresponding target outputs are given by 𝑡 = 1.0, 1.0, 1.0 % .
• The weights 𝑤!" , 𝑤"" and 𝑤#" are also shown.
• What is the error for each of the output units?
47
THE END

48

You might also like