Lecture+8
Lecture+8
Components of a neuron
Components of a synapse
2
Neural Communication
§ Electrical potential across cell membrane exhibits spikes called action
potentials.
§ Spike originates in cell body, travels down axon, and causes synaptic
terminals to release neurotransmitters.
§ Chemical diffuses across synapse to dendrites of other neurons.
§ Neurotransmitters can be excitatory or inhibitory.
§ If net input of neurotransmitters to a neuron from other neurons is
excitatory and exceeds some threshold, it fires an action.
axon dendrites
synapses
3
Neural Speed Constraints
§ Neurons have a “switching time” on the order of a few
milliseconds, compared to nanoseconds for current
computing hardware.
§ However, neural systems can perform complex cognitive
tasks (vision, speech understanding) in tenths of a second.
§ Only time for performing 100 serial steps in this time frame,
compared to orders of magnitude more for current
computers.
§ Must be exploiting “massive parallelism.”
§ Human brain has about 1011 neurons with an average of
104 connections each.
4
Real Neural Learning
§ Synapses change size and strength with experience.
§ Hebbian learning: “When neuron A repeatedly participates
in firing neuron B, the strength of the action of A onto B
increases”
§ “Neurons that fire together, wire together.”
5
What are artificial neural networks?
Connectionist systems inspired by the biological neural networks that constitute
animal brains, learning (progressively improve performance) to do tasks by
considering examples, generally without task-specific programming.
6
ANNs: goal and design
§ Knowledge about the learning task is given in the form of
a set of examples (training examples)
§ An ANN is specified by:
architecture: a set of neurons and weighted links connecting
neurons.
neuron model: the information processing unit of the ANN,
learning algorithm: used for training the ANN by modifying the
weights in order to solve the particular learning task correctly on the
training examples.
7
ANN application….
8
When to consider neural network models
§ Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
§ Output is discrete or real valued
§ Output is a vector of values
§ Possibly noisy data
§ Forms of target function is unknown
§ Human readability of result is unimportant
§ Examples:
Ø Speech recognition
Ø Image classification
Ø Financial prediction
9
Neural Network Learning
Learning approach based on modeling adaptation in biological
neural systems.
§ Perceptron: Initial algorithm for learning simple neural
networks (single layer) developed in the 1950’s.
§ Backpropagation: More complex algorithm for learning
multi-layer neural networks developed in the 1980’s.
10
Perceptron
W. McCulloch and W. Pitts (1943)
11
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions
What weights represent “AND” function?
g(x1,x2)= AND(x1,x2 )?
X1 X2 X1 & X2
W1=? W0=?
x1 0 0 0
X1&X2 0 1 0
AND
W2=? 1 0 0
x2 1 1 1
X1 X2 X1 || X2
W1=? W0=?
x1 0 0 0
X1&X2 0 1 1
OR
W2=? 1 0 1
x2 1 1 1
All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.
14
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions
All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.
15
Decision surface of a perceptron
+
+
+ -
-
-
16
Decision surface of a perceptron
Single perceptrons can represent all of the primitive Boolean
functions AND, OR, NAND, NOR, But fail on some functions (e.g.,
not linearly separable) like XOR
Not linearly
separable!
+ -
- +
17
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions
All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.
18
Perceptron training rule
19
Perceptron training rule
Linear separability can be verified using several approaches (e.g., Linear Programming).
20
Gradient descent and delta rule
(least-mean-square (LMS) rule)
Unthresholded perceptron
21
Gradient descent
22
Gradient descent
23
Gradient descent
24
Summary
25
Stochastic approximation to gradient descent
26
Key differences between standard gradient
descent and stochastic gradient descent
27
perceptron training rule and delta rule
perceptron training rule:
Delta rule:
28
Is linear decision surface enough?
29
The first question: a differentiable threshold unit
1
g ( h) =
1 + exp(-h)
sigmoid unit (logistic function,
squashing function)
30
Sigmoid unit
31
Error gradient for a sigmoid unit
32
Backpropagation algorithm
k output
h hidden
input
33
Derivation of the Backpropagation algorithm
n+1
input
34
Derivation of the Backpropagation algorithm
1 2 output
1 2 3 hidden
w15
w11 w12
input
1 2 3 4 5
!
x1 =< x11x12 x13 x14 x15 >
35
Derivation of the Backpropagation algorithm
For output units
output
hidden
input
36
Derivation of the Backpropagation algorithm
For output units
output
So:
hidden
input
37
Derivation of the Backpropagation algorithm
For Hidden units
k output
j hidden
input
Also:
So:
38
More on Backpropagation algorithm
39
Convergence of backpropagation
40
Expressive capabilities of ANNs
41
Hidden Layer Representations
• Trained hidden units can be seen as newly constructed
features that make the target concept linearly separable in
the transformed space.
• On many real domains, hidden units can be interpreted as
representing meaningful features such as vowel detectors
or edge detectors, etc..
• However, the hidden layer can also become a distributed
representation of the input in which each individual unit is
not easily interpretable as a meaningful feature.
42
Learning Hidden layer representation
43
Learning Hidden layer representation
100
001
010
111
000
011
101
110
44
Evolution of the hidden layer representation
00000100
45
Weights from inputs to one hidden unit
46
Sum of squared errors for each output unit
47
Determining the Best Number of Hidden Units
• Too few hidden units prevents the network from adequately fitting the data.
• Too many hidden units can result in over-fitting.
error
on test data
on training data
0 # hidden units
• Start small and increase the number until satisfactory results are obtained
• Use internal cross-validation to empirically determine an optimal number of
hidden units.
48
Overfitting of ANNs
49
Over-fitting Prevention
• Running too many epochs can result in over-fitting.
error
on test data
on training data
0 # training epochs
• Keep a hold-out validation set and test accuracy on it after every
epoch. Stop training when additional epochs actually increase
validation error.
• To avoid losing training data for validation:
• Use internal K-fold CV on the training set to compute the average
number of epochs that maximizes generalization accuracy.
• Train final network on complete training set for this many epochs.
50
Example: NN for face recognition
20 different people and 32 images each with resolution 120x128
51
Learned hidden unit weights
Problems to consider:
• Input preprocessing
• Network Structure
• Output encoding
• Learning method
52
Input Data Preprocessing
• The curse of Dimensionality
o The quantity of training data grows exponentially with the dimension
of the input space
o In practice, we only have limited quantity of input data
Increasing the dimensionality of the problem leads to giving a poor
representation of the mapping
Preprocessing methods
• Normalization
o Translate input values so that they can be exploitable by
the neural network
• Component reduction
o Build new input variables in order to reduce their number
o No lost of information about their distribution
Normalization
• Inputs of the neural net are often of different types with
different orders of magnitude (E.g. Pressure, Temperature,
etc.)
• It is necessary to normalize the data so that they have the
same impact on the model (i.e., same range as hidden unit
and output activations).
• Center and reduce the variables
Normalization
1
å
N
xi = x
n =1 i
n Average on all points
N
s i2 =
1
N -1
å
N
n =1
xi(
n
- xi )
2
Variance calculation
58
Recurrent Networks
59
Other Issues in Neural Nets
• Alternative Error Minimization Procedures:
o Linear search
o Conjugate gradient (exploits 2nd derivative)
60
Summary of Major NN Models
61
Summary of Major ANN Models
62
Acknowledgement
Part of the slide materials were based on Dr. Rong Duan’s Fall 2016 course CPE/EE 695A
Applied Machine Learning at Stevens Institute of Technology.
63
Reference
The lecture notes in this lecture are mainly based on the following textbooks:
64