0% found this document useful (0 votes)
33 views

Unit - 4 ANN

Uploaded by

Aman Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Unit - 4 ANN

Uploaded by

Aman Pal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT 4- ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer perceptron, Gradient

descent and the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm,
Generalization

ANN:

A neural network is a machine learning algorithm based on the model of a human neuron. An
Artificial Neural Network is an information processing technique. It works like the way human
brain processes information. ANN includes a large number of connected processing units that
work together to process information. They also generate meaningful results from it.
A neural network may contain the following 3 layers:

 Input layer – The activity of the input units represents the raw information that can feed
into the network.
 Hidden layer – To determine the activity of each hidden unit. The activities of the input
units and the weights on the connections between the input and the hidden units. There
may be one or more hidden layers.
 Output layer – The behavior of the output units depends on the activity of the hidden
units and the weights between the hidden and output units.

NEURAL NETWORK REPRESENTATIONS:

ALVINN:

A prototypical example of ANN learning is provided by Pomerleau's (1993) system ALVINN,


which uses a learned ANN to steer an autonomous vehicle driving at normal speeds on public
highways. The input to the neural network is a 30 x 32 grid of pixel intensities obtained from a
forward-pointed camera mounted on the vehicle. The network output is the direction in which
the vehicle is steered. The ANN is trained to mimic the observed steering commands of a
human driving the vehicle for approximately 5 minutes. ALVINN has used its learned networks
to successfully drive at speeds up to 70 miles per hour and for distances of 90 miles on public
highways.
APPROPRIATE PROBLEMS FOR NEURAL NETWORK LEARNING:

It is appropriate for problems with the following characteristics:

1. Instances are represented by many attribute-value pairs. The target function to be


learned is defined over instances that can be described by a vector of predefined
features, such as the pixel values in the ALVINN example. These input attributes may be
highly correlated or independent of one another. Input values can be any real values.
2. The target function output may be discrete-valued, real-valued, or a vector of several
real- or discrete-valued attributes. For example, in the ALVINN system the output is a
vector of 30 attributes, each corresponding to a recommendation regarding the steering
direction. The value of each output is some real number between 0 and 1, which in this
case corresponds to the confidence in predicting the corresponding steering direction.
3. The training examples may contain errors. ANN learning methods are quite robust to
noise in the training data.
4. Long training times are acceptable. Network training algorithms typically require longer
training times than, say, decision tree learning algorithms. Training times can range
from a few seconds to many hours, depending on factors such as the number of
weights in the network, the number of training examples considered, and the settings of
various learning algorithm parameters.
5. Fast evaluation of the learned target function may be required. Although ANN learning
times are relatively long, evaluating the learned network, in order to apply it to a
subsequent instance, is typically very fast. For example, ALVINN applies its neural
network several times per second to continually update its steering command as the
vehicle drives forward.
6. The ability of humans to understand the learned target function is not important. The
weights learned by neural networks are often difficult for humans to interpret. Learned
neural networks are less easily communicated to humans than learned rules.

PERCEPTRONS:

Where, each is a real-valued constant, or weight, that determines the contribution of input to
the perceptron output.

Representational Power of Perceptrons:


We can view the perceptron as representing a hyperplane decision surface in the n-dimensional
space of instances (i.e., points). The perceptron outputs a 1 for instances lying on one side of
the hyperplane and outputs a -1 for instances lying on the other side, as illustrated in Figure
4.3.
The equation for this decision hyperplane is . Of course, some sets of positive and
negative examples cannot be separated by any hyperplane. Those that can be separated are
called linearly separable sets of examples.

The Perceptron Training Rule:


The precise learning problem is to determine a weight vector that causes the perceptron to
produce the correct output for each of the given training examples. Several algorithms
are known to solve this learning problem. Here we consider two: the perceptron rule and the
delta rule. These two algorithms are guaranteed to converge to somewhat different acceptable
hypotheses, under somewhat different conditions.
GRADIENT DESCENT AND THE DELTA RULE:

Although the perceptron rule finds a successful weight vector when the training examples are
linearly separable, it can fail to converge if the examples are not linearly separable. A second
training rule, called the delta rule, is designed to overcome this difficulty.

The key idea behind the delta rule is to use gradient descent to search the hypothesis space of
possible weight vectors to find the weights that best fit the training examples. This rule is
important because gradient descent provides the basis for the BACKPROPAGATION algorithm,
which can learn networks with many interconnected units.

Gradient Descent rule and Stochastic Gradient Descent:


Gradient Descent which is an iterative optimization algorithm capable of tweaking the model
parameters by minimizing the cost function over the train data. It is a complete algorithm i.e it
is guaranteed to find the global minimum (optimal solution) given there is enough time and the
learning rate is not very high. Two Important variants of Gradient Descent which are widely
used in Linear Regression as well as Neural networks are Batch Gradient Descent and
Stochastic Gradient Descent(SGD).

Batch Gradient Descent Stochastic Gradient Descent


Computes gradient using the Computes gradient using a single Training
whole Training sample sample
Slow and computationally Faster and less computationally expensive
expensive algorithm than Batch GD
Not suggested for huge training Can be used for large training samples.
samples.
Deterministic in nature. Stochastic in nature.
Gives optimal solution given Gives good solution but not optimal.
sufficient time to converge.
No random shuffling of points are The data sample should be in a random
required. order, and this is why we want to shuffle the
training set for every epoch.
Can’t escape shallow local SGD can escape shallow local minima more
minima easily. easily.
Convergence is slow. Reaches the convergence much faster.
It updates the model parameters It updates the parameters after each
only after processing the entire individual data point.
training set.
The learning rate is fixed and The learning rate can be adjusted
cannot be changed during dynamically.
training.
It typically converges to the It may converge to a local minimum or
global minimum for convex loss saddle point.
functions.
It may suffer from overfitting if It can help reduce overfitting by updating the
the model is too complex for the model parameters more frequently.
dataset.
DERIVATION OF THE GRADIENT DESCENT RULE:
MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM:

Single perceptrons can only express linear decision surfaces. In contrast, the kind of multilayer
networks learned by the BACKPROPAGATION algorithm are capable of expressing a rich
variety of nonlinear decision surfaces.

For example, a typical multilayer network and decision surface is depicted in Figure 4.5. Here
the speech recognition task involves distinguishing among 10 possible vowels, all spoken in the
context of "h-d" (i.e., "hid," "had," "head," "hood," etc.). The input speech signal is represented
by two numerical parameters obtained from a spectral analysis of the sound, allowing us to
easily visualize the decision surface over the two-dimensional instance space. As shown
in the figure, it is possible for the multilayer network to represent highly nonlinear
decision surfaces that are much more expressive than the linear decision surfaces
of single units.
The BACKPROPAGATION Algorithm:

The BACKPROPAGATION algorithm learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections. It employs gradient descent to attempt
to minimize the squared error between the network output values and the target values for
these outputs.
Derivation of Backpropogation algorithm:
Convergence and Local Minima:

 The BACKPROPAGATION algorithm implements a gradient descent search through the


space of possible network weights, iteratively reducing the error E between the training
example target values and the network outputs.
 Because the error surface for multilayer networks may contain many different local
minima, gradient descent can become trapped in any of these. As a result,
BACKPROPAGATION over multilayer networks is only guaranteed to converge toward
some local minimum in E and not necessarily to the global minimum error.
 Despite the lack of assured convergence to the global minimum error,
BACKPROPAGATION is a highly effective function approximation method in practice.

Generalization:
 The performance of Artificial Neural Networks (ANN) is mostly depends upon its
generalization capability. Generalization of the ANN is ability to handle unseen data. The
generalization capability of the network is mostly determined by system complexity and
training of the network.
 The termination condition for the Backpropogation algorithm has been left unspecified.
What is an appropriate condition for termination the weight update loop?
 One obvious choice is to continue training until the error E on the training examples
falls below some predetermined threshold. In fact, this is a poor strategy because
BACKPROPAGATION is susceptible to overfitting the training examples at the cost
of decreasing generalization accuracy over other unseen examples.

You might also like