0% found this document useful (0 votes)
14 views

Artificial Neural Networks

Artificial Neural Networks (ANNs) are robust learning methods that approximate various functions and are effective in tasks like visual interpretation and speech recognition. The document discusses the structure and function of biological neurons, the learning algorithms used in ANNs, including the perceptron and backpropagation, and highlights the limitations of single-layer perceptrons in solving non-linear problems. It emphasizes the importance of multilayer networks for capturing complex decision boundaries and the role of gradient descent in training these networks.

Uploaded by

ART SHOT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Artificial Neural Networks

Artificial Neural Networks (ANNs) are robust learning methods that approximate various functions and are effective in tasks like visual interpretation and speech recognition. The document discusses the structure and function of biological neurons, the learning algorithms used in ANNs, including the perceptron and backpropagation, and highlights the limitations of single-layer perceptrons in solving non-linear problems. It emphasizes the importance of multilayer networks for capturing complex decision boundaries and the role of gradient descent in training these networks.

Uploaded by

ART SHOT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

ARTIFICIAL NEURAL

NETWORKS
Dr. K. Venkateswara Rao
Professor CSE
Artificial neural networks (ANNs)
• Artificial neural networks (ANNs) learning methods provide a robust approach
to approximating real-valued, discrete-valued, and vector-valued functions from
examples.
• ANN learning is robust to errors in the training data and has been successfully
applied to problems such as interpreting visual scenes, speech recognition, and
learning robot control strategies.
• The Backpropagation learning Algorithm has proven surprisingly successful in
many practical problems such as learning to recognize handwritten characters,
learning to recognize spoken words, and learning to recognize faces.
• The study of artificial neural networks (ANNs) has been inspired in part by the
observation that biological learning systems are built of very complex webs of
interconnected neurons.
Neuron: A processing element in Human Brain

Dendrites: Input
Cell body: Processor
Synaptic: Link
Axon: Output
Neuron: A processing element in Human Brain
• A neuron is connected to other neurons through about 10,000 synapses
• A neuron receives input from other neurons. Inputs are combined.
• Once input exceeds a critical level, the neuron discharges a spike ‐ an electrical
pulse that travels from the body, down the axon, to the next neuron(s)
• The axon endings almost touch the dendrites or cell body of the next neuron.
• Transmission of an electrical signal from one neuron to the next is effected by
neurotransmitters.
• Neurotransmitters are chemicals which are released from the first neuron and
which bind to the Second.
• This binding link between neurons is called a synapse. The strength of the
signal that reaches the next neuron depends on factors such as the amount of
neurotransmitter available.
How the Human Brain learns

• In the human brain, a typical neuron collects signals from others through a host of fine
structures called dendrites.
• The neuron sends out spikes of electrical activity through a long, thin stand known as an
axon, which splits into thousands of branches.
• At the end of each branch, a structure called a synapse converts the activity from the axon
into electrical effects that inhibit or excite activity in the connected neurons.
Biological neural networks
• The human brain is estimated to contain a densely interconnected network of
approximately 1011 neurons, each connected, on average, to l04 others. Neuron
activity is typically excited or inhibited through connections to other neurons.
The fastest neuron switching times are known to be on the order of l0 -3
seconds--quite slow compared to computer switching speeds of 10-l0 seconds.
• Yet humans are able to make surprisingly complex decisions, surprisingly
quickly.
• The information-processing abilities of biological neural systems must follow
from highly parallel processes operating on representations that are distributed
over many neurons
• One motivation for ANN systems is to capture this kind of highly parallel
computation based on distributed representations.
An artificial neuron is an imitation of a human neuron
Artificial neural networks (ANNs)
• Historically, two groups of researchers have worked with artificial neural
networks.
1. One group has been motivated by the goal of using ANNs to study and
model biological learning processes.
2. A second group has been motivated by the goal of obtaining highly
effective machine learning algorithms, independent of whether these
algorithms mirror biological processes.
Model of an Artificial Neuron.
How do a Neuron work?

xm ......... x2 x1
...
Input
w
w w
...
weights m
..
2
1

Processing ∑
Transfer Function
f(vk)
(Activation Function)

Output y
The Neuron with Bias
Bias
b
x1 w1
Activation
Local function
Field
Output
Input
signal
x2 w2
 v
 ( ) y

  Summing
function

xm wm

Synaptic
weights
Adding biases
• A linear neuron is a more flexible model yˆ b   xi wi
if we include a bias.
i
• We can avoid having to figure out a
separate learning rule for the bias by
using a trick:
• A bias is exactly equivalent to a
weight on an extra input line that
always has an activity of 1.
b w1 w2
1 x1 x2
Transfer functions
• Determines the output from a summation of the weighted inputs of a
neuron.
 
O j  f j   wij xi 
 i 

• Maps any real numbers into a domain normally bounded by 0 to 1 or -1


to 1, i.e. squashing functions. Most common functions are sigmoid
functions:
1
l
ogi
st
ic: () x
fx
1
e
x x
e e
h
yp
er
bol
ict
ang
en
t ()
: fx x x
e e
McCulloch–Pitts “neuron” (1943)
• McCulloch–Pitts neuron is a binary threshold neurons
• Attributes of McCulloch–Pitts neuron
• n binary inputs xi (i=1 to n) and 1 output (0 or 1)
• Synaptic weights wi
• Threshold T or Theta
Logic Operators; NOT, AND, OR
Perceptron
Representational Power of Perceptron
• The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a -1 for instances
lying on the other side
• The equation for this decision hyperplane is
• Note:- is a vector representation of or
• Some sets of positive and negative examples cannot be separated by any hyperplane. Those that can be
separated are called linearly separable sets of examples.
• A single perceptron can be used to represent many boolean functions. For example, if we assume boolean
values of 1 (true) and -1 (false), then one way to use a two-input perceptron to implement the AND
function is to set the weights wo = -8, and w1 = w2 = .5. This perceptron can be made to represent the OR
function instead by altering the threshold to wo = -.3.
• Perceptrons can represent all of the primitive boolean functions AND, OR, NAND , and NOR.
• Some boolean functions, such as the XOR function having linearly nonseparable training examples,
cannot be represented by a single perceptron.
• In fact, every boolean function can be represented by some network of perceptrons only two levels deep,
in which the inputs are fed to multiple units, and the outputs of these units are then input to a second, final
Representational Power of Perceptrons
Perceptron Learning
• The Perceptron learning problem is to determine a weight vector that causes the
perceptron to produce the correct output for each of the given training examples.
• Several algorithms are known to solve this learning problem. Two of them are:
1. The perceptron rule
2. The delta rule (a variant of the LMS rule)
• One way to learn an acceptable weight vector is to begin with random weights, then
iteratively apply the perceptron to each training example, modifying the perceptron
weights whenever it misclassifies an example. This process is repeated, iterating
through the training examples as many times as needed until the perceptron
classifies all training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi associated
with input xi according to the rule
The Perceptron Training / Learning rule
wi  wi + wi
where wi =  (t – o) xi
Where:
• t = c(x) is target value
• o is perceptron output
•  is a small positive constant called learning rate
• The role of the learning rate is to moderate the degree to which weights are changed at each
step. It is usually set to some small value (e.g., 0.1) and is sometimes made to decay as the
number of weight-tuning iterations increases.
• Perceptron Learning will converge
• If training data is linearly separable
• And  sufficiently small
• Note:- If the data are not linearly separable, convergence is not assured.
Perceptron Learning: Intuition
• Weight Update
•  Input Ij (j=1,2,…,n)
•  Single output O: target output, T.
• Consider some initial weights
• Define example error: Err = T – O
• Now just move weights in right direction!
• If the error is positive, then we need to increase O.
• Err >0  need to increase O;
• Err <0  need to decrease O;
• Each input unit j, contributes Wj Ij to total input:
• if Ij is positive, increasing Wj tends to increase O;
• if Ij is negative, decreasing Wj tends to increase O;
• So, use:
 is the learning rate.
• Wj  Wj +   Ij  Err
• Perceptron Learning Rule (Rosenblatt 1960)
Problem with the Perceptron
• Can only learn linearly separable tasks.
• Cannot solve linearly non-separable problems
• e.g. exclusive-or function (XOR)-simplest non-separable function.

X1 X2 Output

0 0 0

0 1 1

1 0 1

1 1 0
Gradient Descent and the Delta Rule
• Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separable. The
second training rule, called the delta rule, is designed to overcome this difficulty. If the
training examples are not linearly separable, the delta rule converges toward a best-fit
approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search the hypothesis space
of possible weight vectors to find the weights that best fit the training examples.
• The delta training rule is best understood by considering the task of training an
unthresholded perceptron; that is, a linear unit for which the output o is given by
• In order to derive a weight learning rule for linear units, define a measure for the training
error as follows.

• where D is the set of training examples, td is the target output for training example d, and od
is the output of the linear unit for training example d.
Linearly Separable and Non-Separable data
Gradient Descent
VISUALIZING THE HYPOTHESIS SPACE
• To understand the gradient descent algorithm, it is helpful to visualize the entire hypothesis
space of possible weight vectors and their associated E values, as illustrated in Figure
DERIVATION OF THE GRADIENT DESCENT RULE
• How can we calculate the direction of steepest descent along the error surface?
• This direction can be found by computing the derivative of E with respect to each
component of the vector
This vector derivative is called the gradient of E with respect to

• Notice is itself a vector, whose components are the partial derivatives of E with
respect to each of the wi.
• When interpreted as a vector in weight space, the gradient specifies the direction that
produces the steepest increase in E. The negative of this vector therefore gives the direction
of steepest decrease.
• Since the gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is where
• Here is a positive constant called the learning rate.
DERIVATION OF THE GRADIENT DESCENT RULE
DERIVATION OF THE GRADIENT DESCENT RULE
• determines the step size in the gradient descent search. The negative sign is present because
we want to move the weight vector in the direction that decreases E. This training rule can
also be written in its component form as below.
• where

• Finally,
Stochastic Gradient Descent
• Gradient descent is an important general paradigm for learning. It is a strategy/Algorithm for
searching.
• Search where? The hypothesis space of all possible values for our learnable parameters.
• Search what? The optimal values for the learnable parameters of our model, which would
minimize the loss.
• The Gradient descent can be applied whenever
1. The hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a
linear unit), and
2. The error can be differentiated with respect to these hypothesis parameters.
• The key practical difficulties in applying gradient descent are
1. Converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands
of gradient descent steps), and
2. If there are multiple local minima in the error surface, then there is no guarantee that the
procedure will find the global minimum.
• One common variation on gradient descent intended to alleviate these difficulties is called incremental
gradient descent, or alternatively stochastic gradient descent.
Stochastic Gradient Descent
• The gradient descent training rule computes weight updates after summing over all the
training examples in D.
• The idea behind stochastic gradient descent is to approximate this gradient descent search by
updating weights incrementally, following the calculation of the error for each individual
example.
• Stochastic_Gradient_Descent(Training_Examples, )

• The above training rule is known as the delta rule, or sometimes the LMS (least-mean-square)
rule, Adaline rule, or Widrow-Hoff rule
Stochastic Gradient Descent
• One way to view this stochastic gradient descent is to consider a distinct error
function defined for each individual training example d as follows

where td and od are the target value and the unit output value for training example d.
• Stochastic gradient descent iterates over the training examples d in D, at each
iteration altering the weights according to the gradient with respect to
• The sequence of these weight updates, when iterated over all training examples,
provides a reasonable approximation to descending the gradient with respect to our
original error function
• By making the value of (the gradient descent step size) sufficiently small,
stochastic gradient descent can be made to approximate true gradient descent
arbitrarily closely.
Standard Gradient Descent Verses Stochastic Gradient Descent
Remarks on Perceptron Learning
• The Key difference between the two similar algorithms for iteratively learning
perceptron weights is that the perceptron training rule updates weights based on the
error in the thresholded perceptron output, whereas the delta rule updates weights
based on the error in the unthresholded linear combination of inputs.
• The perceptron training rule converges after a finite number of iterations to a
hypothesis that perfectly classifies the training data, provided the training examples
are linearly separable.
• The delta rule converges only asymptotically toward the minimum error hypothesis,
possibly requiring unbounded time, but converges regardless of whether the training
data are linearly separable.
• A third possible algorithm for learning the weight vector is linear programming. Linear
programming is a general, efficient method for solving sets of linear inequalities.
However, linear programming approach does not scale to training multilayer networks.
How do an ANN Work?
XOR Gate using Neural Network
MULTILAYER NETWORKS AND THE
BACKPROPAGATION ALGORITHM
• Single perceptron can only express linear decision surfaces. In contrast, the kind of multilayer
networks learned by the BACKPROPACATION algorithm are capable of expressing a rich
variety of nonlinear decision surfaces.
• What type of unit shall we use as the basis for constructing multilayer networks?
• Multiple layers of cascaded linear units still produce only linear functions.
• Networks capable of representing highly nonlinear functions are required.
• The perceptron unit is another possible choice, but its discontinuous threshold makes it
undifferentiable and hence unsuitable for gradient descent
• What is required is a unit whose output is a nonlinear function of its inputs, but whose
output is also a differentiable function of its inputs.
• One solution is the sigmoid unit, a unit very much like a perceptron, but based on a smoothed,
differentiable threshold function.
• The function tanh is also sometimes used in place of the sigmoid function
The Sigmoid Unit
• Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then
applies a threshold to the result.
• The threshold output is a continuous function of its input.
• More precisely, the sigmoid unit computes its output o as where

• is often called the sigmoid function or, alternatively, the logistic function. Its output ranges
between 0 and 1, increasing monotonically with its input.
• Because Sigma function maps a very large input domain to a small range of outputs, it is
often referred to as the squashing function of the unit.
Network architectures
• Three different classes of network architectures

• single-layer feed-forward neurons are organized


• multi-layer feed-forward in acyclic layers
• recurrent

• The architecture of a neural network is linked with the learning


algorithm used to train
Single Layer Feed-forward

Input layer Output layer


of of
source nodes neurons
Multi layer feed-forward

3-4-2 Network

Input Output
layer layer

Hidden Layer
Recurrent Networks

(a) (b) (c)

(a) Feedforward network


(b) Recurrent network
(c) Recurrent network unfolded
in time
Properties of Artificial Neural Networks
• High level abstraction of neural input-output transformation
• Inputs  weighted sum of inputs  nonlinear function  output
• Typically no spikes
• Typically use implausible constraints or learning rules
• Often used where data or functions are uncertain
• Goal is to learn from a set of training data
• And to generalize from learned instances to new unseen data
• Key attributes
• Parallel computation
• Distributed representation and storage of data
• Learning (networks adapt themselves to solve a problem)
• Fault tolerance (insensitive to component failures)
History of Neural Networks
• 1943: McCulloch and Pitts proposed a model of a neuron --> Perceptron
• 1960s: Widrow and Hoff explored Perceptron networks (which they called
“Adelines”) and the delta rule.
• 1962: Rosenblatt proved the convergence of the perceptron training rule.
• 1969: Minsky and Papert showed that the Perceptron cannot deal with nonlinearly-
separable data sets---even those that represent simple function such as X-OR.
• 1970-1985: Very little research on Neural Nets.
• 1982: Hopfield and convergence in symmetric networks
• Introduced energy-function concept
• 1986: Invention of Backpropagation [Rumelhart and McClelland] which can learn
from nonlinearly-separable data sets.
• Since 1985: A lot of research in Neural Nets!
History of Artificial Neural Networks
History of Artificial Neural Networks
Appropriate Problems for Neural Network Learning
• ANN learning is well-suited to problems in which the training data corresponds to noisy,
complex sensor data, such as inputs from cameras and microphones.
• It is also applicable to problems for which more symbolic representations are often used,
such as the decision tree learning tasks.
• The BACKPROPAGATION algorithm is the most commonly used ANN learning
technique. It is appropriate for problems with the following characteristics:
1. Instances are represented by many attribute-value pairs.
2. The target function output may be discrete-valued, real-valued, or a vector of several
real- or discrete-valued attributes
3. The training examples may contain errors.
4. Long training times are acceptable
5. Fast evaluation of the learned target function may be required
6. The ability of humans to understand the learned target function is not important.
Applications of NNs

• Classification
in marketing: consumer spending pattern classification
In defence: radar and sonar image classification
In agriculture & fishing: fruit and catch grading
In medicine: ultrasound and electrocardiogram image classification,
EEGs, medical diagnosis
• Recognition and Identification
In general computing and telecommunications: speech, vision and
handwriting recognition
In finance: signature verification and bank note verification
Applications of NNs

• Assessment
In engineering: product inspection monitoring and control
In defence: target tracking
In security: motion detection, surveillance image analysis and fingerprint
matching
• Forecasting and Prediction
In finance: foreign exchange rate and stock market forecasting
In agriculture: crop yield forecasting
In marketing: sales forecasting
In meteorology: weather prediction
Activation Functions
Activation Functions
Historical Development of Neural Network Principles
Historical Development of Neural Network Principles
Historical Development of Neural Network Principles
Multilayer neural networks
 A multilayer perceptron is a feedforward neural network with one or
more hidden layers.
 The network consists of an input layer of source neurons, at least one
middle or hidden layer of computational neurons, and an output
layer of computational neurons.
 The input signals are propagated in a forward direction on a layer-by-
layer basis.
 A hidden layer “hides” its desired output. Neurons in the hidden layer
cannot be observed through the input/output behaviour of the network.
There is no obvious way to know what the desired output of the
hidden layer should be.
Multilayer perceptron with two hidden layers

Ou t p u t Sig n a ls
Input Signals

First Second
Input hidden hidden Output
layer layer layer layer

59
Back-propagation Learning Algorithm
 In a back-propagation neural network, the learning algorithm has
two phases.
1. Feed forward pass: A training input pattern is presented to the
network input layer. The network propagates the input pattern
from layer to layer until the output pattern is generated by the
output layer.
2. backpropagation pass: If the output pattern is different from the
desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input
layer. The weights are modified as the error is propagated.
Three-layer back-propagation neural network
Input signals
1
x1 1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

m
n l yl
xn
Input Hidden Output
layer layer layer

Error signals

61
Sigmoid Unit
Steps in Back propagation Algorithm
1. Initialize the weights and biases.
2. Feed the training sample.
3. Propagate the inputs forward; Compute the net input
and output of each unit in the hidden and output layers.
4. Compute Error and back propagate the error.
5. Update weights and biases to reflect the propagated
errors.
6. Check for the terminating conditions.
Propagate the inputs forward
• For unit j in the input layer, its output is equal to its input, that is,
O j I j

• The net input to each unit in the hidden and output layers is computed as follows.
 Given a unit j in a hidden or output layer, the net input is

I j  wij Oi   j
i

where wij is the weight of the connection from unit i in the previous layer to unit j;
Oi is the output of unit i from the previous layer;
j is the bias of the unit
Propagate the inputs
forward
• Each unit in the hidden and output layers takes its net input and
then applies an activation function. The function symbolizes the
activation of the neuron represented by the unit. It is also called
a logistic, sigmoid, or squashing function.
• Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed as
1
Oj   I
1 e j
Back propagate the error
• When reaching the Output layer, the error is computed and propagated backwards.
• For a unit k in the output layer the error is computed by a formula:
Errk Ok (1  Ok )(Tk  Ok )
Where Ok – produced output of unit k. It is computed using activation
function. 1
Ok 
1  e Ik

Tk – True output based on known class label. It is part of the data sample.
Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.
Derivative of a Sigmoid function
Back propagate the error
• The error is propagated backwards by updating weights and biases to reflect
the error of the network classification .
• For a unit j in the hidden layer the error is computed by a formula:

Err j O j (1  O j ) Errk w jk
k

where wjk is the weight of the connection from unit j to unit k in the next
higher layer, and Errk is the error of unit k.
Update weights and biases
• Weights are updated by the following equations, where l is a constant
between 0.0 and 1.0 reflecting the learning rate. the learning rate is fixed for
implementation.
wij (l ) ErrjOi

wij wij  wij

• Biases are updated by the following equations

 j ( l ) Errj

 j  j   j
Update weights and biases
• We are updating weights and biases after the presentation of each sample.
• This is called case updating.

• Epoch --- One iteration through the training set is called an epoch.

• Epoch updating ------------


• Alternatively, the weight and bias increments could be accumulated in
variables and the weights and biases updated after all of the samples of
the training set have been presented.

• Case updating is more accurate


Terminating Conditions

Training stops When


1. All wij in the previous epoch are below some threshold,
OR
2. The percentage of samples misclassified in the previous epoch is
below some threshold,
OR
3. a pre specified number of epochs has expired.
• In practice, several hundreds of thousands of epochs may be
required before the weights will converge.
The stochastic gradient descent version of the Backpropagation Algorithm
for feedforward networks containing two layers of sigmoid units.
Adding Momentum
• The most common variant of Backpropagation Algorithm is to alter
the weight-update rule in by making the weight update on the n th
iteration depend partially on the update that occurred during the (n-
1)th iteration, as follows:

• The momentum has the effect of gradually increasing the step size of
the search in regions where the gradient is unchanging, thereby
speeding convergence.
Learning in Arbitrary Acyclic Networks
• It is straightforward to generalize the Backpropagation Algorithm to any
directed acyclic graph, regardless of whether the network units are
arranged in uniform layers as assumed up to now. In the case that they are
not, the rule for calculating error

• where Downstream(r) is the set of units immediately downstream from


unit r in the network: that is, all units whose inputs include the output of
unit r.
Derivation of the Backpropagation Rule
• For each training example d every weight wji is updated by adding to it Δwji where

• The Notation used is as follows


Derivation of the Backpropagation Rule
Derivation of the Backpropagation Rule
Derivation of the Backpropagation Rule
Derivation of the Backpropagation Rule
Appropriate Problems for Backpropagation
• BACKPROPAGATION is appropriate for problems with the
following characteristics:
1. Instances are represented by many attribute-value pairs.
2. The target function output may be discrete-valued, real-valued,
or a vector of several real- or discrete-valued attributes.
3. The training examples may contain errors.
4. Long training times are acceptable.
5. Fast evaluation of the learned target function may be required.
6. The ability of humans to understand the learned target function
is not important.

You might also like