0% found this document useful (0 votes)
38 views

Unit 2ANNs

The document discusses the history and workings of artificial neural networks. It begins with a brief history of artificial neural networks from 1943 to the 1980s. It then explains that an artificial neural network consists of simple processing units that communicate with each other via weighted connections, similar to neurons in the brain. The network learns by adjusting the weights of these connections. Finally, it describes how a single artificial neuron, also called a perceptron, functions by taking a weighted sum of its inputs and passing the result through an activation function to produce an output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Unit 2ANNs

The document discusses the history and workings of artificial neural networks. It begins with a brief history of artificial neural networks from 1943 to the 1980s. It then explains that an artificial neural network consists of simple processing units that communicate with each other via weighted connections, similar to neurons in the brain. The network learns by adjusting the weights of these connections. Finally, it describes how a single artificial neuron, also called a perceptron, functions by taking a weighted sum of its inputs and passing the result through an activation function to produce an output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Artificial Neural Networks

Dr.Ravi Kumar Chandu


Agenda
 History of Artificial Neural Networks
 What is an Artificial Neural Networks?
 How it works?
History of Artificial Neural Networks

5/22/2021
History of the Artificial Neural Networks

• 1943 McCulloch-Pitts neurons


• 1949 Hebb’s law
• 1958 Perceptron (Rosenblatt)
• 1960 Adaline, better learning rule (Widrow, Huff)
• 1969 Limitations (Minsky, Papert)
• 1972 Kohonen nets, associative memory
History of the Artificial Neural Networks

• 1977 Brain State in a Box (Anderson)


• 1982 Hopfield net, constraint satisfaction
• 1985 ART (Carpenter, Grossfield)
• 1986 Backpropagation (Rumelhart, Hinton, McClelland)
• 1988 Neocognitron, character recognition (Fukushima)
Artificial Neural Network
 An artificial neural network consists of a pool of simple processing units which
communicate by sending signals to each other over a large number of weighted
connections.
A set of major aspects of a parallel distributed model include:
 a set of processing units (cells).
 a state of activation for every unit, which equivalent to the output of the unit.
 connections between the units. Generally each connection is defined by a weight.
 a propagation rule, which determines the effective input of a unit from its external
inputs.
 an activation function, which determines the new level of activation based on the
effective input and the current activation.
 an external input for each unit.
 a method for information gathering (the learning rule).
 an environment within which the system must operate, providing input signals and
_ if necessary _ error signals.
Computers vs. Natural Neural Networks

• Precise design, highly constrained, not very adaptive or fault


tolerant, Centralized control, deterministic, basic switching
times 10−9
• Massively parallel, highly adaptive and fault tolerant, self
configuring, self repairing, noisy, stochastic, basic switching
time 10−3 sec

5/22/2021
Appropriate Problems for Neural Network Learning
• Instances are represented by many attribute-value pairs (e.g., the
pixels of a picture. ALVINN [Mitchell, p. 84]).
• The target function output may be discrete-valued, real-valued, or a
vector of several real- or discrete-valued attributes.
• The training examples may contain errors.
• Long training times are acceptable.
• Fast evaluation of the learned target function may be required.
• The ability for humans to understand the learned target function is
not important.

5/22/2021
Why Artificial Neural Networks?

• Can be viewed as one approach towards understanding brain/building


intelligent machines
• Computational architectures inspired by brain Computational methods
for ‘learning’ dependencies in data stream
e.g. Pattern Recognition, System identification
• Characteristics: Emergent properties, learning, self adaptation
• Modeling Biology?
Mathematically purified neurons!!
Why Artificial Neural Networks?
There are two basic reasons why we are interested in
building artificial neural networks (ANNs):

• Technical viewpoint: Some problems such as


character recognition or the prediction of future
states of a system require massively parallel and
adaptive processing.

• Biological viewpoint: ANNs can be used to


replicate and simulate components of the human
(or animal) brain, thereby giving us insight into
natural information processing.
What is an Artificial Neural Network?
"A parallel distributed information processor made up of simple processing units
that has a propensity for acquiring problem solving knowledge through
experience "

 A large network of interconnected units


 Each unit has simple input-output mapping
 Each interconnection has numerical weight attached to it
 Output of unit depends on outputs and connection weights of units
connected to it
 ‘Knowledge’ resides in the weights
 Problem solving ability is often through learning

An architecture inspired by the structure of Brain

5/22/2021
Artificial Neural Networks
• The “building blocks” of neural networks are the
neurons.
• In technical systems, we also refer to them as units or nodes.
• Basically, each neuron
 receives input from many other neurons.
 changes its internal state (activation) based on the current
input.
 sends one output signal to many other neurons, possibly
including its input neurons (recurrent network).
How do ANNs work?
 An artificial neural network (ANN) is either a hardware
implementation or a computer program which strives to
simulate the information processing capabilities of its biological
exemplar. ANNs are typically composed of a great number of
interconnected artificial neurons. The artificial neurons are
simplified models of their biological counterparts.
 ANN is a technique for solving problems by constructing software
that works like our brains.
Consider humans
• Neuron - the basic computing unit
• Brain is a highly organized structure of networks of interconnected neurons

Neuron switching time 0.001 second


Number of neurons 1011 (100 billion)
Connections per neuron 104 to 105
Total synapses 1015
Scene recognition time 0.1 second
Processing Much parallel computation

5/22/2021
Properties of artificial neural networks(ANN’s)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed process
 Emphasis on tuning weights automatically

5/22/2021
When to consider neural networks
 Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
 Output is discrete or real valued
 Output is a vector of values
 Possibly noisy data
 Form of target function is unknown
 Human readability of result is unimportant

5/22/2021
Application Examples
 Speech phoneme recognition
 Image Classification
 Financial prediction
 Handwriting recognition
 Computer vision

5/22/2021
Advantages of ANNs
• Distributed representation
• Representation and processing of fuzziness
• Highly parallel and distributed action
• Speed and fault-tolerance

5/22/2021
How do our brains work?
 The Brain is A massively parallel information processing system.
 Our brains are a huge network of processing elements. A typical brain contains a
network of 10 billion neurons.
How do our brains work?
 Biological Neuron - A processing element

Dendrites: Input
Soma/Cell body: Processor
Synaptic: Link
Axon: Output
How do our brains work?
 dendrites: nerve fibres carrying electrical signals to the cell
 cell body: computes a non-linear function of its inputs
 axon: single long fiber that carries the electrical signal from the cell body
to other neurons
 synapse: the point of contact between the axon of one cell and the dendrite
of another, regulating a chemical connection whose strength affects the
input to the cell.
How do our brains work?
 A neuron is connected to other neurons through about 10,000 synapses
 A neuron receives input from other neurons. Inputs are combined
 Once input exceeds a critical level, the neuron discharges a spike ‐ an
electrical pulse that travels from the body, down the axon, to the next
neuron(s)
 The axon endings almost touch the dendrites or cell body of the next
neuron.
 Transmission of an electrical signal from one neuron to the next is effected
by neurotransmitters.
 Neurotransmitters are chemicals which are released from the first neuron
and which bind to the Second
 This link is called a synapse. The strength of the signal that reaches the
next neuron depends on factors such as the amount of neurotransmitter
available
How do ANNs work?

• An artificial neuron is an imitation of a human neuron.


Biological Neuron Network Artificial Neural Network
BNN ANN
Soma Node

Dendrites Input

Synapse Weights or Interconnections

Axon Output
Artificial Neuron
Perceptron

Artificial Neuron is also called as perceptron.


5/22/2021
A Neuron (= a perceptron)
- t
x0 w0
x1 w1
 f
output y
xn wn

Input weight weighted Activation


vector x vector w sum function

 The n-dimensional input vector x is mapped into variable y by means of


the scalar product and a nonlinear function mapping
Perceptron
 Basic unit in a neural network
 Linear separator
 Parts
 N inputs, x1 ... xn
 Weights for each input, w1 ... wn
 A bias input x0 (constant) and associated weight w0
 Weighted sum of inputs, y = w0x0 + w1x1 + ... + wnxn
 A threshold function or activation function,
 i.e 1 if y > t, -1 if y <= t
How do ANNs work?
• Now, let us have a look at the model of an artificial neuron.
Perceptron
• Perceptron is a type of artificial neural network (ANN)

Machine Learning | Dr Ravi Kumar Chandu


35
Perceptron - Operation
• It takes a vector of real-valued inputs, calculates a linear
combination of these inputs, then output 1 if the result is greater
than some threshold and -1 otherwise

n
R  w0  w1 x1  w2 x2 ,, wn xn  w0   wi xi
i 1

 1; if R  0

o  signR   
  1, otherwise

36
Perceptron – Decision Surface
• Perceptron can be regarded as representing a hyperplane decision
surface in the n-dimensional feature space of instances.

• The perceptron outputs a 1 for instances lying on one side of the


hyperplane and a -1 for instances lying on the other side.

• This hyperplane is called the Decision Surface

37
Perceptron – Decision Surface
• In 2-dimensional space
Perceptron – Representation Power
• The Decision Surface is linear

• Perceptron can only solve Linearly Separable Problems


Perceptron – Representation Power
• Can represent many boolean functions: Assume boolean values of 1
(true) and -1 (false)
Perceptron – Representation Power
• Can represent many boolean functions: Assume boolean values of 1
(true) and -1 (false)
Perceptron – Representation Power
• Separate the objects from the rest
Perceptron – Representation Power
• Some problems are linearly non-separable
LIMITATION OF PERCEPTRONS
• Not every set of inputs can be divided by a line like this. Those that
can be are called linearly separable. If the vectors are not linearly
separable, learning will never reach a point where all vectors are
classified properly.
Perceptron – Training Algorithm
• Separate the objects from the rest
Perceptron – Training Algorithm
• Training sample pairs (X, d), where X is the input vector, d is the input
vector’s classification (+1 or -1) is iteratively presented to the network for
training, one at a time, until the process converges
Perceptron – Training Algorithm
• The Procedure is as follows

1. Set the weights to small random values, e.g., in the range (-1, 1)

2. Present X, and calculate

  1; if R  0
n 
R  w0   w ixi o  sign R   
i1   1, otherwise

3. Update the weights

w i  w i   d  o x i , i  1 , 2 ,  , n

0  η  1 is the training rate x0=1 (constant)

4. Repeat by going to step 2

47
Perceptron – Training Algorithm
Perceptron training

5/22/2021
Weights Adjusting
 After each iteration, weights should be adjusted to
minimize the error.
– All possible weights
Perceptron – Training Rule
Perceptron Learning algorithm
While epoch produces an error
Present network with next inputs from epoch
Error = T – O
If Error <> 0 then
Wj = Wj + η * xj * Error
End If
End While
Error Estimation
 The root mean square error (RMSE) is a frequently-
used measure of the differences between values
predicted by a model or an estimator and the values
actually observed from the thing being modeled or
estimated
Perceptron – Training Rule

Perceptron training rule guaranteed to succeed if

• If training data is linearly separable


• and η is sufficiently small
Perceptron – Training Algorithm
• Convergence Theorem

• The perceptron training rule will converge (finding a weight vector


correctly classifies all training samples) within a finite number of
iterations, provided the training examples are linearly separable
and provided a sufficiently small h is used.

56
Training Perceptrons
For AND
-1 A B Output
W1 = ?
00 0
01 0
x t = 0.0 10 0
W2 = ?
11 1
W3 = ?
y

•What are the weight values?


•Initialize with random weight values

57
Training Perceptrons
For AND
-1 A B Output
W1 = 0.3
00 0
01 0
x t = 0.0
W2 = 0.5 10 0
11 1
W3 =-0.4
y

I1 I2 I3 Summation Output
-1 0 0 (-1*0.3) + (0*0.5) + (0*-0.4) = -0.3 0
-1 0 1 (-1*0.3) + (0*0.5) + (1*-0.4) = -0.7 0
-1 1 0 (-1*0.3) + (1*0.5) + (0*-0.4) = 0.2 1
-1 1 1 (-1*0.3) + (1*0.5) + (1*-0.4) = -0.2 0

58
Learning in Neural Networks
• Learn values of weights from I/O pairs
• Start with random weights
• Load training example’s input
• Observe computed input
• Modify weights to reduce difference
• Iterate over all training examples
• Terminate when weights stop changing OR when error is very small

59
Gradient Descent algorithm and its variants
Gradient Descent is an optimization algorithm used for minimizing
the cost function in various machine learning algorithms. It is
basically used for updating the parameters of the learning model.

Gradient descent algorithm is an iterative process that takes us to the


minimum of a function

Types of gradient Descent:


• Batch Gradient Descent
• Stochastic Gradient Descent
• Mini Batch gradient descent
Gradient Descent
Gradient Descent
The formula below sums up the entire Gradient Descent algorithm in a single line.
To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the
function at the current point. If we take steps proportional to the positive of the
gradient (moving towards the gradient), we will approach a local maximum of the
function, and the procedure is called Gradient Ascent.
Gradient Descent
Batch Gradient Descent
Incremental(Stochastic) Gradient Descent
Gradient Descent
Gradient Descent
Gradient Descent
The Learning Rate

It must be chosen carefully to end up with local minima

• If the learning rate is too high, we might OVERSHOOT the minima and keep
bouncing, without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
The Learning Rate

1.a) Learning rate is optimal, model converges to the minimum


2.b) Learning rate is too small, it takes more time but converges to the minimum
3.c) Learning rate is higher than the optimal value, it overshoots but converges
4.d) Learning rate is very large, it overshoots and diverges, moves away from the
minima, performance decreases on learning
Local Minima
The cost function may consist of many minimum points. The gradient may settle on any
one of the minima, which depends on the initial point (i.e initial parameters(theta)) and
the learning rate. Therefore, the optimization may converge to different points with
different starting points and learning rate.
How do ANNs work?
The signal is not passed down to the next neuron verbatim

............
xm x2 x1
Input
wm ..... w2
weights w1

Processing ∑
Transfer Function
f(vk)
(Activation Function)

Output y
The output is a function of the input, that is affected
by the weights, and the transfer functions
ACTIVATION FUNCTION
• A function that transforms the values or states the conditions for the
decision of the output neuron is known as an activation function.
• What does an artificial neuron do? Simply, it calculates a “weighted
sum” of its input, adds a bias and then decides whether it should be
“fired” or not.
• So consider a neuron.
ACTIVATION FUNCTION
• The value of Y can be anything ranging from -inf to +inf. The neuron
really doesn’t know the bounds of the value. So how do we decide
whether the neuron should fire or not ( why this firing pattern?
Because we learnt it from biology that’s the way brain works and
brain is a working testimony of an awesome and intelligent system ).
• We decided to add “activation functions” for this purpose. To check
the Y value produced by a neuron and decide whether outside
connections should consider this neuron as “fired” or not. Or rather
let’s say — “activated” or not.
ACTIVATION FUNCTION
• If we do not apply an Activation function, then the output signal would simply be a
simple linear function. A linear function is just a polynomial of one degree.
• A linear equation is easy to solve but they are limited in their complexity and have
less power to learn complex functional mappings from data.
• A Neural Network without Activation function would simply be a Linear Regression
Model, which has limited power and does not performs good most of the times.
• We want our Neural Network to not just learn and compute a linear function but
something more complicated than that.
• Also, without activation function our Neural network would not be able to learn
and model other complicated kinds of data such as images, videos , audio , speech
etc. That is why we use Artificial Neural network techniques such as Deep learning
to make sense of something complicated ,high dimensional, non-linear -big
datasets, where the model has lots and lots of hidden layers in between and has
a very complicated architecture which helps us to make sense and extract
knowledge form such complicated big datasets.
ACTIVATION FUNCTION
• Activation functions are really important for a Artificial Neural
Network to learn and make sense of something really complicated
and non-linear complex functional mappings between the inputs and
response variable. They introduce non-linear properties to our
network.
• Their main purpose is to convert an input signal of a node in a A-NN
to an output signal. That output signal now is used as a input in the
next layer in the stack.
WHY DO WE NEED NON-LINEARITIES?
• Non-linear functions are those which have degree more than one and they
have a curvature when we plot a Non-Linear function. Now we need a
Neural Network Model to learn and represent almost anything and any
arbitrary complex function which maps inputs to outputs. Neural-Networks
are considered Universal Function Approximators. It means that they can
compute and learn any function at all. Almost any process we can think of
can be represented as a functional computation in Neural Networks.

• Hence it all comes down to this, we need to apply an Activation function


f(x) so as to make the network more powerful and add ability to it to learn
something complex and complicated form data and represent non-linear
complex arbitrary functional mappings between inputs and outputs. Hence
using a non linear Activation, we are able to generate non-linear mappings
from inputs to outputs.
TYPES OF ACTIVATION FUNCTIONS*
Step function
• Activation function A = “activated” if Y > threshold else not
• Alternatively, A = 1 if Y > threshold, 0 otherwise
• Well, what we just did is a “step function”, see the below figure.

• DRAWBACK: Suppose you are creating a binary classifier. Something which should say a
“yes” or “no” ( activate or not activate ). A Step function could do that for you! That’s
exactly what it does, say a 1 or 0. Now, think about the use case where you would want
multiple such neurons to be connected to bring in more classes. Class1, class2, class3 etc.
What will happen if more than 1 neuron is “activated”. All neurons will output a 1 ( from
step function). Now what would you decide? Which class is it? Hard, complicated.
* https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
TYPES OF ACTIVATION FUNCTIONS
Linear function
• A = cx
• A straight line function where activation is proportional to input ( which is
the weighted sum from neuron ).
• This way, it gives a range of activations, so it is not binary activation. We
can definitely connect a few neurons together and if more than 1 fires, we
could take the max and decide based on that. So that is ok too. Then what
is the problem with this?
• A = cx, derivative with respect to x is c. That means, the gradient has no
relationship with X. It is a constant gradient and the descent is going to be
on constant gradient. If there is an error in prediction, the changes made
by back propagation is constant and not depending on the change in input.
TYPES OF ACTIVATION FUNCTIONS
Sigmoid function

This looks smooth and “step function like”. What


are the benefits of this? It is nonlinear in nature.
Combinations of this function are also nonlinear!
Great. Now we can stack layers. What about non
binary activations? Yes, that too! It will give an
analog activation unlike step function. It has a
smooth gradient too.
And if you notice, between X values -2 to 2, Y values are very steep. Which means, any small changes in the
values of X in that region will cause values of Y to change significantly. That means this function has a tendency
to bring the Y values to either end of the curve.
• Looks like it’s good for a classifier considering its property? Yes ! It tends to bring
the activations to either side of the curve ( above x = 2 and below x = -2 for
example). Making clear distinctions on prediction.
• Another advantage of this activation function is, unlike linear function, the output
of the activation function is always going to be in range (0,1) compared to (-inf,
inf) of linear function. So we have our activations bound in a range. It won’t blow
up the activations then. This is great.
• Sigmoid functions are one of the most widely used activation functions today.
Then what are the problems with this?
• If you notice, towards either end of the sigmoid function, the Y values tend to
respond very less to changes in X. What does that mean? The gradient at that
region is going to be small. It gives rise to a problem of “vanishing gradients”. So
what happens when the activations reach near the “near-horizontal” part of the
curve on either sides?
• Gradient is small or has vanished ( cannot make significant change because of the
extremely small value ). The network refuses to learn further or is drastically slow.
There are ways to work around this problem and sigmoid is still very popular in
classification problems.
TYPES OF ACTIVATION FUNCTIONS
Tanh Function
• Another activation function that is used is the tanh function.

This looks very similar to sigmoid. In fact, it is a


scaled sigmoid function!
• This has characteristics similar to sigmoid that we discussed above. It
is nonlinear in nature, so great we can stack layers! It is bound to
range (-1, 1) so no worries of activations blowing up. One point to
mention is that the gradient is stronger for tanh than sigmoid (
derivatives are steeper). Deciding between the sigmoid or tanh will
depend on your requirement of gradient strength. Like sigmoid, tanh
also has the vanishing gradient problem.
• Tanh is also a very popular and widely used activation function.
Especially in time series data.
TYPES OF ACTIVATION FUNCTIONS
ReLu
• Later, comes the ReLu function,
A(x) = max(0,x)
The ReLu function is as shown above. It gives an output x if x is positive
and 0 otherwise.
• At first look, this would look like having the same problems of linear function,
as it is linear in positive axis. First of all, ReLu is nonlinear in nature. And
combinations of ReLu are also non linear! ( in fact it is a good approximator.
Any function can be approximated with combinations of ReLu). Great, so this
means we can stack layers. It is not bound though. The range of ReLu is [0,
inf). This means it can blow up the activation.
• Another point to discuss here is the sparsity of the activation. Imagine a big
neural network with a lot of neurons. Using a sigmoid or tanh will cause
almost all neurons to fire in an analog way. That means almost all activations
will be processed to describe the output of a network. In other words, the
activation is dense. This is costly. We would ideally want a few neurons in the
network to not activate and thereby making the activations sparse and
efficient.
• ReLu give us this benefit. Imagine a network with random initialized weights (
or normalized ) and almost 50% of the network yields 0 activation because of
the characteristic of ReLu ( output 0 for negative values of x ). This means a
fewer neurons are firing ( sparse activation ) and the network is lighter. ReLu
seems to be awesome! Yes it is, but nothing is flawless.. Not even ReLu.
• Because of the horizontal line in ReLu ( for negative X ), the gradient can go
towards 0. For activations in that region of ReLu, gradient will be 0 because
of which the weights will not get adjusted during descent. That means,
those neurons which go into that state will stop responding to variations in
error/ input ( simply because gradient is 0, nothing changes ). This is called
dying ReLu problem. This problem can cause several neurons to just die
and not respond making a substantial part of the network passive. There
are variations in ReLu to mitigate this issue by simply making the horizontal
line into non-horizontal component . For example, y = 0.01x for x<0 will
make it a slightly inclined line rather than horizontal line. This is leaky ReLu.
There are other variations too. The main idea is to let the gradient be non
zero and recover during training eventually.
• ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. That is a good point to consider
when we are designing deep neural nets.
NOW WHICH ONE DO WE USE?
• Does that mean we just use ReLu for everything we do? Or sigmoid or
tanh? Well, yes and no.
• When you know the function you are trying to approximate has certain
characteristics, you can choose an activation function which will
approximate the function faster leading to faster training process. For
example, a sigmoid works well for a classifier, because approximating a
classifier function as combinations of sigmoid is easier than maybe ReLu,
for example. Which will lead to faster training process and convergence.
You can use your own custom functions too! If you don’t know the nature
of the function you are trying to learn, then maybe you can start with ReLu,
and then work backwards. ReLu works most of the time as a general
approximator!
Tree elements are particularly important in any model of
artificial neural networks:
• the structure of the nodes,
• the topology of the network,
• the learning algorithm used to find the weights of the network
Network Topology

• Feedforward Network
• Single layer feedforward network
• Multilayer feedforward network
• Feedback Network
• Recurrent networks
• Fully recurrent network
• Jordan network
Different Network Topologies
• Single layer feed-forward networks
• Input layer projecting into the output layer

Single layer
network

Input Output
layer layer
Different Network Topologies
• Multi-layer feed-forward networks
• One or more hidden layers. Input projects only from previous layers onto a
layer.

2-layer or
1-hidden layer
fully connected
network
Input Hidden Output
layer layer layer
Different Network Topologies

• Multi-layer feed-forward networks

Input Hidden Output


layer layers layer
Multi-layer networks of sigmoid networks
Multi-layer networks of sigmoid unit

/ activation function
Multi-layer networks of sigmoid unit
Error Gradient for a sigmoid unit
Error Gradient for a sigmoid unit
Backpropagation Algorithm

• The backpropagation algorithm (Rumelhart and McClelland, 1986)


is used in layered feed-forward Artificial Neural Networks.
• Back propagation is a multi-layer feed forward, supervised learning
network based on gradient descent learning rule.
• we provide the algorithm with examples of the inputs and outputs
we want the network to compute, and then the error (difference
between actual and expected results) is calculated.
• The idea of the backpropagation algorithm is to reduce this error,
until the Artificial Neural Network learns the training data.
Backpropagation Algorithm

• Purpose: To compute the weights of a feedforward


multilayer neural network adaptatively, given a set of
labeled training examples.
• Method: By minimizing the following cost function (the
sum of square error)
where N is the total number of training examples and K, the
total number of output units (useful for multiclass problems)
and fk is the function implemented by the neural net
Backpropagation: Overview
• Backpropagation works by applying the gradient descent
rule to a feedforward network.
• The algorithm is composed of two parts that get repeated
over and over until a pre-set maximal number of epochs,
EPmax.
• Part I, the feedforward pass: the activation values of the
hidden and then output units are computed.
• Part II, the backpropagation pass: the weights of the
network are updated--starting with the hidden to output
weights and followed by the input to hidden weights--with
respect to the sum of squares error and through a series of
weight update rules called the Delta Rule.
Backpropagation Algorithm
Example of Backpropagation Algorithm

5/22/2021
we’re going to use a neural network with two inputs, two hidden neurons, two output neurons.
Additionally, the hidden and output neurons will include a bias.
Here’s the basic structure:

5/22/2021
In order to have some numbers to work with, here are the initial weights, the biases,
and training inputs/outputs:

5/22/2021
The goal of backpropagation is to optimize the weights so that the neural network can learn how to
correctly map arbitrary inputs to outputs.
We’re going to work with a single training set: given inputs 0.05 and 0.10, we want the neural network
to output 0.01 and 0.99.

The Forward Pass


To begin, lets see what the neural network currently predicts given the weights and biases above and
inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.

We figure out the total net input to each hidden layer neuron, squash the total net input using
an activation function (here we use the logistic function), then repeat the process with the output layer
neurons.

5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
The Backwards Pass
Our goal with backpropagation is to update each of the weights in the network so that they cause
the actual output to be closer the target output, thereby minimizing the error for each output
neuron and the network as a whole.

5/22/2021
5/22/2021
Visually, here’s what we’re doing:

5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
Visually:

5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
5/22/2021
Backpropagation Algorithm
Backpropagation Algorithm
Hidden Layer Representations
One intriguing property of BACKPROPAGATION is its ability to discover useful intermediate
representations at the hidden unit layers inside the network

5/22/2021
Here, the eight network inputs are connected to three hidden units, which are in turn connected to the eight output units.
Because of this structure, the three hidden units will be forced to re-represent the eight input values in some way that
captures their relevant features, so that this hidden layer representation can be used by the output units to compute the
correct target values.

5/22/2021
This ability of multilayer networks to automatically discover useful representations at the hidden layers is
a key feature of ANN learning.

We can directly observe the effect of BACKPROPAGATION'S gradient descent search by plotting the
squared output error as a function of the number of gradient descent search steps

5/22/2021
Learning the 8 x 3 x 8 Network. This plot shows the evolving sum of squared errors for each of the eight
output units, as the number of training iterations (epochs) increases.
Learning the 8 x 3 x 8 Network. This plot shows the evolving hidden layer representation for the
input string "01000000

5/22/2021
Learning the 8 x 3 x 8 Network. This bottom plot shows the evolving weights for one of the three
hidden units.

5/22/2021
Convergence of Backpropagation
Convergence of Backpropagation
Expressive Capabilities of ANNs
Expressive Capabilities of ANNs
Overfitting ANN

5/22/2021
5/22/2021
5/22/2021
AN ILLUSTRATIVE EXAMPLE:
FACE RECOGNITION
The learning task here involves classifying camera images of faces of
various people in various poses. Images of 20 different people were
collected, including approximately 32 images per person, varying the
person's expression (happy, sad, angry, neutral), the direction in which
they were looking (left, right, straight ahead, up), and whether or not
they were wearing sunglasses. As can be seen from the example images in
Figure, here is also variation in the background behind the person, the
clothing worn by the person, and the position of the person's face within
the image. In total, 624 greyscale images were collected, each with a
resolution of 120 x 128, with each image pixel described by a greyscale
intensity value between 0 (black) and 255 (white).

5/22/2021
5/22/2021
target function: learning the direction in which the person is facing (to
their left, right, straight ahead, or upward).

5/22/2021
Learning an artificial neural network to recognize face pose. Here a 960 x 3 x
4 network is trained on grey-level images of faces (see top), to predict whether
a person is looking to their left, right, ahead, or up. After training on 260 such
images, the network achieves an accuracy of 90% over a separate test set. The
learned network weights are shown after one weight-tuning iteration through
the training examples and after 100 iterations. Each output unit (left, straight,
right, up) has four weights, shown by dark (negative) and light (positive)
blocks. The leftmost block corresponds to the weight wg, which determines
the unit threshold, and the three blocks to the right correspond to weights on
inputs from the three hidden units. The weights from the image pixels into
each hidden unit are also shown, with each weight plotted in the position of
the corresponding image pixel.
Design Choice

• Input encoding
• Output encoding
• Network graph structure
• Other learning algorithm parameters
ADVANCED TOPICS
IN
ARTIFICIAL NEURAL NETWORKS
Alternative Error Functions

Adding a penalty term for weight magnitude


Alternative Error Functions

Adding a term for errors in the slope, or derivative of the target function
Alternative Error Functions

Minimizing the cross entropy of the network with respect to the target values
Recurrent Networks
A Recurrent Neural Network (RNN) is a class of Artificial Neural Network in which the connection
between different nodes forms a directed graph to give a temporal dynamic behavior. It helps to model
sequential data that are derived from feedforward networks. It works similarly to human brains to
deliver predictive results.
They support a form of directed cycles in the network
we have added a new unit b to the hidden layer, and new input unit c(t). The input value c(t) to
the network at one time step is simply copied from the value of unit b on the previous time step.

5/22/2021
Following are the application of RNN:

1. Machine Translation
2. Speech Recognition
3. Sentiment Analysis
4. Automatic Image Tagger
Dynamically Modifying Network Structure
Dynamically Modifying Network Structure

Up to this point we have considered neural network learning as a problem of adjusting weights within
a fixed graph structure. A variety of methods have been proposed to dynamically grow or shrink the
number of network units and interconnections in an attempt to improve generalization accuracy and
training efficiency.

One idea is to begin with a network containing no hidden units, then grow the network as needed by
adding hidden units until the training error is reduced to some acceptable level.
CASCADE-CORRELATION algorithm can be used for Dynamically Modifying Network Structure

A second idea for dynamically altering network structure is to take the opposite approach. Instead of
beginning with the simplest possible network and adding complexity, we begin with a complex
network and prune it as we find that certain connections are inessential.
One way to decide whether a particular weight is inessential is to see whether its value is close to
zero. A second way, which appears to be more successful in practice, is to consider the effect that a
small variation in the weight has on the error E

You might also like