Artificial Neural Networks
Artificial Neural Networks
NETWORKS
Dr. K. Venkateswara Rao
Professor CSE
Artificial neural networks (ANNs)
• Artificial neural networks (ANNs) learning methods provide a robust approach
to approximating real-valued, discrete-valued, and vector-valued functions from
examples.
• ANN learning is robust to errors in the training data and has been successfully
applied to problems such as interpreting visual scenes, speech recognition, and
learning robot control strategies.
• The Backpropagation learning Algorithm has proven surprisingly successful in
many practical problems such as learning to recognize handwritten characters,
learning to recognize spoken words, and learning to recognize faces.
• The study of artificial neural networks (ANNs) has been inspired in part by the
observation that biological learning systems are built of very complex webs of
interconnected neurons.
Neuron: A processing element in Human Brain
Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Neuron: A processing element in Human Brain
• A neuron is connected to other neurons through about 10,000 synapses
• A neuron receives input from other neurons. Inputs are combined.
• Once input exceeds a critical level, the neuron discharges a spike ‐ an electrical
pulse that travels from the body, down the axon, to the next neuron(s)
• The axon endings almost touch the dendrites or cell body of the next neuron.
• Transmission of an electrical signal from one neuron to the next is effected by
neurotransmitters.
• Neurotransmitters are chemicals which are released from the first neuron and
which bind to the Second.
• This binding link between neurons is called a synapse. The strength of the
signal that reaches the next neuron depends on factors such as the amount of
neurotransmitter available.
How the Human Brain learns
• In the human brain, a typical neuron collects signals from others through a host of fine
structures called dendrites.
• The neuron sends out spikes of electrical activity through a long, thin stand known as an
axon, which splits into thousands of branches.
• At the end of each branch, a structure called a synapse converts the activity from the axon
into electrical effects that inhibit or excite activity in the connected neurons.
Biological neural networks
• The human brain is estimated to contain a densely interconnected network of
approximately 1011 neurons, each connected, on average, to l04 others. Neuron
activity is typically excited or inhibited through connections to other neurons.
The fastest neuron switching times are known to be on the order of l0 -3
seconds--quite slow compared to computer switching speeds of 10-l0 seconds.
• Yet humans are able to make surprisingly complex decisions, surprisingly
quickly.
• The information-processing abilities of biological neural systems must follow
from highly parallel processes operating on representations that are distributed
over many neurons
• One motivation for ANN systems is to capture this kind of highly parallel
computation based on distributed representations.
An artificial neuron is an imitation of a human neuron
Artificial neural networks (ANNs)
• Historically, two groups of researchers have worked with artificial neural
networks.
1. One group has been motivated by the goal of using ANNs to study and
model biological learning processes.
2. A second group has been motivated by the goal of obtaining highly
effective machine learning algorithms, independent of whether these
algorithms mirror biological processes.
Model of an Artificial Neuron.
How do a Neuron work?
xm ......... x2 x1
...
Input
w
w w
...
weights m
..
2
1
Processing ∑
Transfer Function
f(vk)
(Activation Function)
Output y
The Neuron with Bias
Bias
b
x1 w1
Activation
Local function
Field
Output
Input
signal
x2 w2
v
( ) y
Summing
function
xm wm
Synaptic
weights
Adding biases
• A linear neuron is a more flexible model yˆ b xi wi
if we include a bias.
i
• We can avoid having to figure out a
separate learning rule for the bias by
using a trick:
• A bias is exactly equivalent to a
weight on an extra input line that
always has an activity of 1.
b w1 w2
1 x1 x2
Transfer functions
• Determines the output from a summation of the weighted inputs of a
neuron.
O j f j wij xi
i
X1 X2 Output
0 0 0
0 1 1
1 0 1
1 1 0
Gradient Descent and the Delta Rule
• Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separable. The
second training rule, called the delta rule, is designed to overcome this difficulty. If the
training examples are not linearly separable, the delta rule converges toward a best-fit
approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search the hypothesis space
of possible weight vectors to find the weights that best fit the training examples.
• The delta training rule is best understood by considering the task of training an
unthresholded perceptron; that is, a linear unit for which the output o is given by
• In order to derive a weight learning rule for linear units, define a measure for the training
error as follows.
• where D is the set of training examples, td is the target output for training example d, and od
is the output of the linear unit for training example d.
Linearly Separable and Non-Separable data
Gradient Descent
VISUALIZING THE HYPOTHESIS SPACE
• To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis
space of possible weight vectors and their associated E values, as illustrated in Figure
DERIVATION OF THE GRADIENT DESCENT RULE
• How can we calculate the direction of steepest descent along the error surface?
• This direction can be found by computing the derivative of E with respect to each
component of the vector
This vector derivative is called the gradient of E with respect to
• Notice is itself a vector, whose components are the partial derivatives of E with
respect to each of the wi.
• When interpreted as a vector in weight space, the gradient specifies the direction that
produces the steepest increase in E. The negative of this vector therefore gives the direction
of steepest decrease.
• Since the gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is where
• Here is a positive constant called the learning rate.
DERIVATION OF THE GRADIENT DESCENT RULE
DERIVATION OF THE GRADIENT DESCENT RULE
• determines the step size in the gradient descent search. The negative sign is present because
we want to move the weight vector in the direction that decreases E. This training rule can
also be written in its component form as below.
• where
• Finally,
Stochastic Gradient Descent
• Gradient descent is an important general paradigm for learning. It is a strategy/Algorithm for
searching.
• Search where? The hypothesis space of all possible values for our learnable parameters.
• Search what? The optimal values for the learnable parameters of our model, which would
minimize the loss.
• The Gradient descent can be applied whenever
1. The hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a
linear unit), and
2. The error can be differentiated with respect to these hypothesis parameters.
• The key practical difficulties in applying gradient descent are
1. Converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands
of gradient descent steps), and
2. If there are multiple local minima in the error surface, then there is no guarantee that the
procedure will find the global minimum.
• One common variation on gradient descent intended to alleviate these difficulties is called incremental
gradient descent, or alternatively stochastic gradient descent.
Stochastic Gradient Descent
• The gradient descent training rule computes weight updates after summing over all the
training examples in D.
• The idea behind stochastic gradient descent is to approximate this gradient descent search by
updating weights incrementally, following the calculation of the error for each individual
example.
• Stochastic_Gradient_Descent(Training_Examples, )
• The above training rule is known as the delta rule, or sometimes the LMS (least-mean-square)
rule, Adaline rule, or Widrow-Hoff rule
Stochastic Gradient Descent
• One way to view this stochastic gradient descent is to consider a distinct error
function defined for each individual training example d as follows
where td and od are the target value and the unit output value for training example d.
• Stochastic gradient descent iterates over the training examples d in D, at each
iteration altering the weights according to the gradient with respect to
• The sequence of these weight updates, when iterated over all training examples,
provides a reasonable approximation to descending the gradient with respect to our
original error function
• By making the value of (the gradient descent step size) sufficiently small,
stochastic gradient descent can be made to approximate true gradient descent
arbitrarily closely.
Standard Gradient Descent Verses Stochastic Gradient Descent
Remarks on Perceptron Learning
• The Key difference between the two similar algorithms for iteratively learning
perceptron weights is that the perceptron training rule updates weights based on the
error in the thresholded perceptron output, whereas the delta rule updates weights
based on the error in the unthresholded linear combination of inputs.
• The perceptron training rule converges after a finite number of iterations to a
hypothesis that perfectly classifies the training data, provided the training examples
are linearly separable.
• The delta rule converges only asymptotically toward the minimum error hypothesis,
possibly requiring unbounded time, but converges regardless of whether the training
data are linearly separable.
• A third possible algorithm for learning the weight vector is linear programming. Linear
programming is a general, efficient method for solving sets of linear inequalities.
However, linear programming approach does not scale to training multilayer networks.
How do an ANN Work?
XOR Gate using Neural Network
MULTILAYER NETWORKS AND THE
BACKPROPAGATION ALGORITHM
• Single perceptron can only express linear decision surfaces. In contrast, the kind of multilayer
networks learned by the BACKPROPACATION algorithm are capable of expressing a rich
variety of nonlinear decision surfaces.
• What type of unit shall we use as the basis for constructing multilayer networks?
• Multiple layers of cascaded linear units still produce only linear functions.
• Networks capable of representing highly nonlinear functions are required.
• The perceptron unit is another possible choice, but its discontinuous threshold makes it
undifferentiable and hence unsuitable for gradient descent
• What is required is a unit whose output is a nonlinear function of its inputs, but whose
output is also a differentiable function of its inputs.
• One solution is the sigmoid unit, a unit very much like a perceptron, but based on a smoothed,
differentiable threshold function.
• The function tanh is also sometimes used in place of the sigmoid function
The Sigmoid Unit
• Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then
applies a threshold to the result.
• The threshold output is a continuous function of its input.
• More precisely, the sigmoid unit computes its output o as where
• is often called the sigmoid function or, alternatively, the logistic function. Its output ranges
between 0 and 1, increasing monotonically with its input.
• Because Sigma function maps a very large input domain to a small range of outputs, it is
often referred to as the squashing function of the unit.
Network architectures
• Three different classes of network architectures
3-4-2 Network
Input Output
layer layer
Hidden Layer
Recurrent Networks
• Classification
in marketing: consumer spending pattern classification
In defence: radar and sonar image classification
In agriculture & fishing: fruit and catch grading
In medicine: ultrasound and electrocardiogram image classification,
EEGs, medical diagnosis
• Recognition and Identification
In general computing and telecommunications: speech, vision and
handwriting recognition
In finance: signature verification and bank note verification
Applications of NNs
• Assessment
In engineering: product inspection monitoring and control
In defence: target tracking
In security: motion detection, surveillance image analysis and fingerprint
matching
• Forecasting and Prediction
In finance: foreign exchange rate and stock market forecasting
In agriculture: crop yield forecasting
In marketing: sales forecasting
In meteorology: weather prediction
Activation Functions
Activation Functions
Historical Development of Neural Network Principles
Historical Development of Neural Network Principles
Historical Development of Neural Network Principles
Multilayer neural networks
A multilayer perceptron is a feedforward neural network with one or
more hidden layers.
The network consists of an input layer of source neurons, at least one
middle or hidden layer of computational neurons, and an output
layer of computational neurons.
The input signals are propagated in a forward direction on a layer-by-
layer basis.
A hidden layer “hides” its desired output. Neurons in the hidden layer
cannot be observed through the input/output behaviour of the network.
There is no obvious way to know what the desired output of the
hidden layer should be.
Multilayer perceptron with two hidden layers
Ou t p u t Sig n a ls
Input Signals
First Second
Input hidden hidden Output
layer layer layer layer
59
Back-propagation Learning Algorithm
In a back-propagation neural network, the learning algorithm has
two phases.
1. Feed forward pass: A training input pattern is presented to the
network input layer. The network propagates the input pattern
from layer to layer until the output pattern is generated by the
output layer.
2. backpropagation pass: If the output pattern is different from the
desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input
layer. The weights are modified as the error is propagated.
Three-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2
i wij j wjk
xi k yk
m
n l yl
xn
Input Hidden Output
layer layer layer
Error signals
61
Sigmoid Unit
Steps in Back propagation Algorithm
1. Initialize the weights and biases.
2. Feed the training sample.
3. Propagate the inputs forward; Compute the net input
and output of each unit in the hidden and output layers.
4. Compute Error and back propagate the error.
5. Update weights and biases to reflect the propagated
errors.
6. Check for the terminating conditions.
Propagate the inputs forward
• For unit j in the input layer, its output is equal to its input, that is,
O j I j
• The net input to each unit in the hidden and output layers is computed as follows.
Given a unit j in a hidden or output layer, the net input is
I j wij Oi j
i
where wij is the weight of the connection from unit i in the previous layer to unit j;
Oi is the output of unit i from the previous layer;
j is the bias of the unit
Propagate the inputs
forward
• Each unit in the hidden and output layers takes its net input and
then applies an activation function. The function symbolizes the
activation of the neuron represented by the unit. It is also called
a logistic, sigmoid, or squashing function.
• Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed as
1
Oj I
1 e j
Back propagate the error
• When reaching the Output layer, the error is computed and propagated backwards.
• For a unit k in the output layer the error is computed by a formula:
Errk Ok (1 Ok )(Tk Ok )
Where Ok – produced output of unit k. It is computed using activation
function. 1
Ok
1 e Ik
Tk – True output based on known class label. It is part of the data sample.
Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.
Derivative of a Sigmoid function
Back propagate the error
• The error is propagated backwards by updating weights and biases to reflect
the error of the network classification .
• For a unit j in the hidden layer the error is computed by a formula:
Err j O j (1 O j ) Errk w jk
k
where wjk is the weight of the connection from unit j to unit k in the next
higher layer, and Errk is the error of unit k.
Update weights and biases
• Weights are updated by the following equations, where l is a constant
between 0.0 and 1.0 reflecting the learning rate. the learning rate is fixed for
implementation.
wij (l ) ErrjOi
j ( l ) Errj
j j j
Update weights and biases
• We are updating weights and biases after the presentation of each sample.
• This is called case updating.
• Epoch --- One iteration through the training set is called an epoch.
• The momentum has the effect of gradually increasing the step size of
the search in regions where the gradient is unchanging, thereby
speeding convergence.
Learning in Arbitrary Acyclic Networks
• It is straightforward to generalize the Backpropagation Algorithm to any
directed acyclic graph, regardless of whether the network units are
arranged in uniform layers as assumed up to now. In the case that they are
not, the rule for calculating error