0% found this document useful (0 votes)
3 views

Lecture+8

Uploaded by

salah.abdo.tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture+8

Uploaded by

salah.abdo.tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CPE/EE 695: Applied Machine Learning

Lecture 8: Artificial Neural Networks

Dr. Shucheng Yu, Associate Professor


Department of Electrical and Computer Engineering
Stevens Institute of Technology
What are artificial neural networks?
Biological inspiration

How human brain works?

Components of a neuron

Components of a synapse
2
Neural Communication
§ Electrical potential across cell membrane exhibits spikes called action
potentials.
§ Spike originates in cell body, travels down axon, and causes synaptic
terminals to release neurotransmitters.
§ Chemical diffuses across synapse to dendrites of other neurons.
§ Neurotransmitters can be excitatory or inhibitory.
§ If net input of neurotransmitters to a neuron from other neurons is
excitatory and exceeds some threshold, it fires an action.

axon dendrites

synapses
3
Neural Speed Constraints
§ Neurons have a “switching time” on the order of a few
milliseconds, compared to nanoseconds for current
computing hardware.
§ However, neural systems can perform complex cognitive
tasks (vision, speech understanding) in tenths of a second.
§ Only time for performing 100 serial steps in this time frame,
compared to orders of magnitude more for current
computers.
§ Must be exploiting “massive parallelism.”
§ Human brain has about 1011 neurons with an average of
104 connections each.

4
Real Neural Learning
§ Synapses change size and strength with experience.
§ Hebbian learning: “When neuron A repeatedly participates
in firing neuron B, the strength of the action of A onto B
increases”
§ “Neurons that fire together, wire together.”

5
What are artificial neural networks?
Connectionist systems inspired by the biological neural networks that constitute
animal brains, learning (progressively improve performance) to do tasks by
considering examples, generally without task-specific programming.

6
ANNs: goal and design
§ Knowledge about the learning task is given in the form of
a set of examples (training examples)
§ An ANN is specified by:
architecture: a set of neurons and weighted links connecting
neurons.
neuron model: the information processing unit of the ANN,
learning algorithm: used for training the ANN by modifying the
weights in order to solve the particular learning task correctly on the
training examples.

The aim is to obtain an ANN that can be generalized well.

7
ANN application….

ALVINN [Pomerleau’93] drives 70 mph in highway

8
When to consider neural network models
§ Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
§ Output is discrete or real valued
§ Output is a vector of values
§ Possibly noisy data
§ Forms of target function is unknown
§ Human readability of result is unimportant
§ Examples:
Ø Speech recognition
Ø Image classification
Ø Financial prediction

9
Neural Network Learning
Learning approach based on modeling adaptation in biological
neural systems.
§ Perceptron: Initial algorithm for learning simple neural
networks (single layer) developed in the 1950’s.
§ Backpropagation: More complex algorithm for learning
multi-layer neural networks developed in the 1980’s.

10
Perceptron
W. McCulloch and W. Pitts (1943)

11
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions
What weights represent “AND” function?
g(x1,x2)= AND(x1,x2 )?

X1 X2 X1 & X2
W1=? W0=?
x1 0 0 0

X1&X2 0 1 0
AND
W2=? 1 0 0

x2 1 1 1

w0 + w1*0 +w2*0 <=0


w0 + w1*0 +w2*1 <=0 w0=-0.8, w1=w2=0.5
w0 + w1*1 +w2*0 <=0
w0 + w1*1 +w2*1 >0
12
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions
What weights represent “OR” function?
g(x1,x2)= OR(x1,x2 )?

X1 X2 X1 || X2
W1=? W0=?
x1 0 0 0

X1&X2 0 1 1
OR
W2=? 1 0 1

x2 1 1 1

w0 + w1*0 +w2*0 <=0


w0 + w1*0 +w2*1 >=0 w0=-0.3, w1=w2=0.5
w0 + w1*1 +w2*0 >=0
w0 + w1*1 +w2*1 >0
13
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions

All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.

14
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions

All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.

Any boolean function can be implemented by using a (2-level) combination of


these primitives (functional completeness property)!

15
Decision surface of a perceptron

We can view the perceptron as representing a hyperplane decision


surface in the n-dimensional spaces of instances (i.e. points).

+
+
+ -

-
-

16
Decision surface of a perceptron
Single perceptrons can represent all of the primitive Boolean
functions AND, OR, NAND, NOR, But fail on some functions (e.g.,
not linearly separable) like XOR

Not linearly
separable!
+ -

- +

17
Representational power of perceptron
A single perceptron can be used to represent many Boolean functions

All of the primitive Boolean functions AND, OR, NAND (¬AND), NOR (¬OR)
can be represented by perceptrons.

Some Boolean functions such as XOR can not be represented by


single perceptrons (we need a network of them)

A XOR B = (¬A AND B) OR (A AND ¬B)

18
Perceptron training rule

19
Perceptron training rule

Linear separability can be verified using several approaches (e.g., Linear Programming).

20
Gradient descent and delta rule
(least-mean-square (LMS) rule)

Unthresholded perceptron

21
Gradient descent

22
Gradient descent

23
Gradient descent

24
Summary

25
Stochastic approximation to gradient descent

26
Key differences between standard gradient
descent and stochastic gradient descent

§ Standard gradient descent:


Øthe error is summed over all examples before updating
weights,
Ørequires more computation per weight update step.

§ Stochastic gradient descent


Øweights are updated upon examining each training example;
Øsometimes avoid falling into local minima, if there are
multiple local minima, (why?)

Both methods are commonly used in practice

27
perceptron training rule and delta rule
perceptron training rule:

Delta rule:

The definition of output o is different!


Perceptron rule updates weights based on the error in the
thresholded perceptron output, whereas delta rule updates weights
based on the error in the unthresholded linear combination of inputs

28
Is linear decision surface enough?

Multilayer cascaded linear units still produce only linear functions.


Multilayer perceptron units can represent highly nonlinear decision surface, but
undifferentiable

29
The first question: a differentiable threshold unit

1
g ( h) =
1 + exp(-h)
sigmoid unit (logistic function,
squashing function)

Note that if you rotate this curve


through 180o centered on (0,1/2)
you get the same curve.
§ Very much like a perceptron, but based on
i.e. g(h)=1-g(-h) smoothed, differentiable threshold.
§ Other differentiable functions with easily
calculated derivatives are sometimes used.

30
Sigmoid unit

31
Error gradient for a sigmoid unit

32
Backpropagation algorithm

k output

h hidden

input

33
Derivation of the Backpropagation algorithm

1. Propagates inputs forward in the usual way, i.e.


o All outputs are computed using sigmoid thresholding of the inner
product of the corresponding weight and input vectors.
o All outputs at stage n are connected to all the inputs at stage n+1

2. Propagates the errors backwards by apportioning them to each unit


according to the amount of this error the unit is responsible for.

n+1

input

34
Derivation of the Backpropagation algorithm

1 2 output

1 2 3 hidden
w15
w11 w12
input
1 2 3 4 5

!
x1 =< x11x12 x13 x14 x15 >

35
Derivation of the Backpropagation algorithm
For output units

output

hidden

input

36
Derivation of the Backpropagation algorithm
For output units
output
So:
hidden

input

37
Derivation of the Backpropagation algorithm
For Hidden units

k output

j hidden

input

Also:

So:

38
More on Backpropagation algorithm

39
Convergence of backpropagation

40
Expressive capabilities of ANNs

41
Hidden Layer Representations
• Trained hidden units can be seen as newly constructed
features that make the target concept linearly separable in
the transformed space.
• On many real domains, hidden units can be interpreted as
representing meaningful features such as vowel detectors
or edge detectors, etc..
• However, the hidden layer can also become a distributed
representation of the input in which each individual unit is
not easily interpretable as a meaningful feature.

42
Learning Hidden layer representation

43
Learning Hidden layer representation

100
001
010
111
000
011
101
110

44
Evolution of the hidden layer representation

00000100

Three hidden unit values for one of the possible inputs

45
Weights from inputs to one hidden unit

46
Sum of squared errors for each output unit

47
Determining the Best Number of Hidden Units
• Too few hidden units prevents the network from adequately fitting the data.
• Too many hidden units can result in over-fitting.
error

on test data

on training data
0 # hidden units

• Start small and increase the number until satisfactory results are obtained
• Use internal cross-validation to empirically determine an optimal number of
hidden units.

48
Overfitting of ANNs

49
Over-fitting Prevention
• Running too many epochs can result in over-fitting.
error

on test data

on training data
0 # training epochs
• Keep a hold-out validation set and test accuracy on it after every
epoch. Stop training when additional epochs actually increase
validation error.
• To avoid losing training data for validation:
• Use internal K-fold CV on the training set to compute the average
number of epochs that maximizes generalization accuracy.
• Train final network on complete training set for this many epochs.

50
Example: NN for face recognition
20 different people and 32 images each with resolution 120x128

51
Learned hidden unit weights

Problems to consider:

• Input preprocessing
• Network Structure
• Output encoding
• Learning method

52
Input Data Preprocessing
• The curse of Dimensionality
o The quantity of training data grows exponentially with the dimension
of the input space
o In practice, we only have limited quantity of input data
Increasing the dimensionality of the problem leads to giving a poor
representation of the mapping
Preprocessing methods
• Normalization
o Translate input values so that they can be exploitable by
the neural network

• Component reduction
o Build new input variables in order to reduce their number
o No lost of information about their distribution
Normalization
• Inputs of the neural net are often of different types with
different orders of magnitude (E.g. Pressure, Temperature,
etc.)
• It is necessary to normalize the data so that they have the
same impact on the model (i.e., same range as hidden unit
and output activations).
• Center and reduce the variables
Normalization

1
å
N
xi = x
n =1 i
n Average on all points
N

s i2 =
1
N -1
å
N
n =1
xi(
n
- xi )
2
Variance calculation

xin - xi Variables transposition


xin =
si
Components reduction
• Sometimes, the number of inputs is too large to be
exploited
• The reduction of the input number simplifies the
construction of the model
• Goal : Better representation of the data in order to
get a more synthetic view without losing relevant
information
• Reduction methods (PCA, LDA, SOM etc.)
Alternative Error Functions

58
Recurrent Networks

• Apply to time series data


• use feedback and can
learn finite state
machines with
“backpropagation
through time.”

59
Other Issues in Neural Nets
• Alternative Error Minimization Procedures:
o Linear search
o Conjugate gradient (exploits 2nd derivative)

• Dynamically Modifying Network Structure:


o Grow network until able to fit data
Cascade Correlation
Upstart
o Shrink large network until unable to fit data
Optimal Brain Damage

60
Summary of Major NN Models

61
Summary of Major ANN Models

62
Acknowledgement

Part of the slide materials were based on Dr. Rong Duan’s Fall 2016 course CPE/EE 695A
Applied Machine Learning at Stevens Institute of Technology.

63
Reference
The lecture notes in this lecture are mainly based on the following textbooks:

T. M. Mitchell, Machine Learning, McGraw Hill, 1997. ISBN: 978-0-07-042807-2

64

You might also like