0% found this document useful (0 votes)
71 views

Machine Learning-4

machine learning chapter 4

Uploaded by

venu62
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Machine Learning-4

machine learning chapter 4

Uploaded by

venu62
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT-IV

Unit IV: Support Vector Machines (SVM): Introduction, Linear Discriminant Functions for Binary
Classification, Perceptron Algorithm, Large Margin Classifier for linearly seperable data, Linear Soft Margin
Classifier for Overlapping Classes, Kernel Induced Feature Spaces, Nonlinear Classifier, and Regression by
Support vector Machines.
Learning with Neural Networks: Towards Cognitive Machine, Neuron Models, Network Architectures,
Perceptrons, Linear neuron and the Widrow-Hoff Learning Rule, The error correction delta rule.

4. Support Vector Machines:


4.1 Introduction:
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional
space (N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen.
Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points
of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be
classified with more confidence.
Hyperplanes and Support Vectors:

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side
of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon
the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of
input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Support vectors are data points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support
vectors will change the position of the hyperplane. These are the points that help us build our SVM.

4.2 Linear Discriminant functions for Binary Classification:


A classification model is typically defined using discriminant functions
 For each class i define a function gi(x) mapping X → R
 When the decision on input x should be made choose the class with the highest value of g i(x)
Class = arg maxi gi(x)
 Works for binary and multi-class classification
 Assume a binary classification problem with classes 0 and 1
 Discriminant functions g0(x) and g1(x)
4.3 Perceptron Algorithm:
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a
type of linear classifier, i.e. a classification algorithm that makes all of its predictions based on a linear
predictor function combining a set of weights with the feature vector.
The linear classifier says that the training data should be classified into corresponding categories such
that if we are applying classification for two categories, then all the training data must be lie in these two
categories.
The binary classifier defines that there should be only two categories for classification. The basic
perceptron algorithm is used for binary classification and all the training examples should lie in these
categories. The term comes from the basic unit in a neuron, which is called the perceptron.
Following are the major components of a perceptron:
o Input: All the features become the input for a perceptron. We denote the input of a perceptron
by [x1, x2, x3, ..,xn], where x represents the feature value and n represents the total number of
features. We also have special kind of input called the bias. In the image, we have described
the value of the BIAS as w0.
o Weights: The values that are computed over the time of training the model. Initially, we start
the value of weights with some initial value and these values get updated for each training
error. We represent the weights for perceptron by [w1,w2,w3,.. wn].
o Bias: A bias neuron allows a classifier to shift the decision boundary left or right. In algebraic
terms, the bias neuron allows a classifier to translate its decision boundary. It aims to "move
every point a constant distance in a specified direction." Bias helps to train the model faster and
with better quality.
o Weighted summation: Weighted summation is the sum of the values that we get after the
multiplication of each weight [wn] associated with the each feature value [xn]. We represent
the weighted summation by ∑wixi for all i -> [1 to n].
o Step/activation function: The role of activation functions is to make neural networks
nonlinear. For linear classification, for example, it becomes necessary to make the perceptron
as linear as possible.
o Output: The weighted summation is passed to the step/activation function and whatever value
we get after computation is our predicted output.
Inside the Perceptron:

Description:
 Firstly, the features for an example are given as input to the perceptron.
 These input features get multiplied by corresponding weights (starting with initial value).
 The summation is computed for the value we get after multiplication of each feature with the
corresponding weight.
 The value of the summation is added to the bias.
 The step/activation function is applied to the new value.

4.4 Linear Maximal Margin Classifier for Linearly Separable Data:


The support vector machine is a generalization of a classifier called maximal margin classifier. The
maximal margin classifier is simple, but it cannot be applied to the majority of datasets, since the classes must
be separated by a linear boundary.
That is why the support vector classifier was introduced as an extension of the maximal margin
classifier, which can be applied in a broader range of cases. Finally, support vector machine is simply a
further extension of the support vector classifier to accommodate non-linear class boundaries. It can be used
for both binary or multiclass classification. This method relies on separating classes using a hyperplane.
What is a hyperplane?
In a p-dimensional space, a hyperplane is a flat affine subspace of dimension p-1. Visually, in a 2D
space, the hyperplane will be a line, and in a 3D space, it will be a flat plane. Mathematically, the hyperplane
is simply:

If X satisfies the equation above, then the point lies on the plane. Otherwise, it must be on one side of the
plane as shown below.

In general, if the data can be perfectly separated using a hyperplane, then there is an infinite number of
hyperplanes, since they can be shifted up or down, or slightly rotated without coming into contact with an
observation.
That is why we use the maximal margin hyperplane or optimal separating hyperplane which is the
separating hyperplane that is farthest from the observations. We calculate the perpendicular distance from
each training observation given a hyperplane. This is known as the margin. Hence, the optimal separating
hyperplane is the one with the largest margin.

As you can see above, there three points that are equidistant from the hyperplane. Those observations are
known as support vectors, because if their position shifts, the hyperplane shifts as well. Interestingly, this
means that the hyperplane depends only on the support vectors, and not on any other observations.
What if no separating plane exists

In this case, there is no maximal margin classifier. We use a support vector classifier that can almost separate
the classes using a soft margin called support vector classifier.

Support Vector Classifier


Consider the following situation:

Here, it simply doesn’t exist a separating hyperplane, hence we need to define another criterion to find it. The
idea is relaxing the assumption that the hyperplane has to well segregate all the observations, but rather
segregate most of them. By doing so, we can allow, with different degrees of ‘softness’, some observations to
be on the wrong side of the margin and eventually on the wrong side of the plane (and so to be misclassified).
In this case, the support vectors will be those observations lying on the margin and beyond it (as long as
they are still on the correct side of the hyperplane).

4.5 Linear Soft Margin Classifier for Overlapping Classes:


The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is
often called the soft margin classifier. This change allows some points in the training data to violate the
separating line.
An additional set of coefficients are introduced that give the margin wiggle room in each dimension.
These coefficients are sometimes called slack variables. This increases the complexity of the model as there
are more parameters for the model to fit to the data to provide this complexity.
A tuning parameter is introduced called simply C that defines the magnitude of the wiggle allowed across
all dimensions. The C parameters defines the amount of violation of the margin allowed. A C=0 is no
violation and we are back to the inflexible Maximal-Margin Classifier described above. The larger the value
of C the more violations of the hyperplane are permitted.
During the learning of the hyperplane from data, all training instances that lie within the distance of the
margin will affect the placement of the hyperplane and are referred to as support vectors. And as C affects the
number of instances that are allowed to fall within the margin, C influences the number of support vectors
used by the model.
 The smaller the value of C, the more sensitive the algorithm is to the training data (higher variance and
lower bias).
 The larger the value of C, the less sensitive the algorithm is to the training data (lower variance and
higher bias).

4.6 Nonlinear Classifier:


For a linearly separable dataset having n features (thereby needing n dimensions for representation), a
hyperplane is basically an (n – 1) dimensional subspace used for separating the dataset into two sets, each set
containing data points belonging to a different class. For example, for a dataset having two features X and Y
(therefore lying in a 2-dimensional space), the separating hyperplane is a line (a 1-dimensional subspace).
Similarly, for a dataset having 3-dimensions, we have a 2-dimensional separating hyperplane, and so on.
In machine learning, Support Vector Machine (SVM) is a non-probabilistic, linear, binary classifier used for
classifying data by learning a hyperplane separating the data.
Classifying a non-linearly separable dataset using a SVM – a linear classifier:
As mentioned above SVM is a linear classifier which learns an (n – 1)-dimensional classifier for
classification of data into two classes. However, it can be used for classifying a non-linear dataset. This can be
done by projecting the dataset into a higher dimension in which it is linearly separable!

The dataset is clearly a non-linear dataset and consists of two features (say, X and Y).
In order to use SVM for classifying this data, introduce another feature Z = X 2 + Y2 into the dataset.
Thus, projecting the 2-dimensional data into 3-dimensional space. The first dimension representing the feature
X, second representing Y and third representing Z (which, mathematically, is equal to the radius of the circle
of which the point (x, y) is a part of). Now, clearly, for the data shown above, the ‘yellow’ data points belong
to a circle of smaller radius and the ‘purple’ data points belong to a circle of larger radius. Thus, the data
becomes linearly separable along the Z-axis.

Now, we can use SVM (or, for that matter, any other linear classifier) to learn a 2-dimensional separating
hyperplane. This is how the hyperplane would look like:

Thus, using a linear classifier we can separate a non-linearly separable dataset.


A brief introduction to kernels in machine learning:
In machine learning, a trick known as “kernel trick” is used to learn a linear classifier to classify a non-
linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a
higher dimension. A kernel function is applied on each data instance to map the original non-linear data points
into some higher dimensional space in which they become linearly separable.

4.7 Regression by Support vector Machines:


Support Vector Regression (SVR) is quite different than other Regression models. It uses Support
Vector Machine (SVM, a classification algorithm) algorithm to predict a continuous variable. While other
linear regression models try to minimize the error between the predicted and the actual value, Support Vector
Regression tries to fit the best line within a predefined or threshold error value. SVR will try to classify all the
prediction lines in two types, ones that pass through the error boundary and ones that do not. Those lines
which do not pass the error boundary are not considered as the difference between the predicted value and the
actual value has exceeded the error threshold. The lines that pass are considered for a potential support vector
to predict the value of an unknown. The following illustration will help you to grab this concept.
To understand the above image, you first need to learn some important definitions.

1. Kernel: Kernel is a function that is used to map a lower dimensional data points into a higher
dimensional data points. As SVR performs linear regression in a higher dimension, this function is
crucial. There are many types of kernel such as Polynomial Kernel, Gaussian Kernel, and Sigmoid
Kernel etc.
2. Hyper Plane: In Support Vector Machine, a hyper plane is a line used to separate two data classes in a
higher dimension than the actual dimension. In SVR, hyper plane is the line that is used to predict the
continuous value.
3. Boundary Line: Two parallel lines drawn to the two sides of Support Vector with the error threshold
value are known as boundary line. This lines creates a margin between the data points.
4. Support Vector: The line from which the distance is minimum or least from two boundary data points.

To perform an SVR, you must do the following steps:

 Collect a training data


 Choose a Kernel and its parameters as well as any regularization needed
 Form the correlation matrix, K
 Train your machine, exactly or approximately, to get contraction coefficients, ={ i}
 Use these coefficients to create your estimator, f(X , , x*) = y*
Learning with Neural Networks

4.8 Neural Networks Towards Cognitive Machines:


Technology and the brain are very closely related in these days. Modern computer applications take into
account the features of human brains (in marketing, for example), and human brains take into account the
features of technologies.
Basically, a neuron is just a node with many inputs and one output. A neural network consists of many
interconnected neurons. In fact, it is a “simple” device that receives data at the input and provides a response.
First, the neural network learns to correlate incoming and outcoming signals with each other — this is called
learning. And then the neural network begins to work — it receives input data, generating output signals based
on the accumulated knowledge.
Most likely, the initial evolutionary task of a neural network in nature was to separate the signal from
noise. “Noise” is random and difficult to build into a pattern. A “signal” is a surge (electrical, mechanical,
molecular), something that is already by no means random. Now, neural systems in technology (that is —
along with biological) have already learned not only how to isolate a signal from noise, but also to create new
levels of abstraction in identifying different states of the world around. That is, not just to take into account
the factors designated by the programmers, but to identify these factors by themselves.
Currently, there are two areas of study of neural networks.
1. Creation of computer models that faithfully repeat the functioning models of neurons of the real brain.
It makes possible to explain both the mechanisms of real brain operation and learn the
diagnosis/treatment of diseases and injuries of the central nervous system better. In ordinary life, for
example, it allows us to learn more about what a person prefers (by collecting and analyzing data), to
get closer to the human creating more personalized interfaces, etc.
2. Creation of computer models that abstractly repeat the functioning models of neurons of the real brain.
It makes possible to use all the advantages of the real brain, such as noise immunity and energy
efficiency, in the analysis of large amounts of data. Here, for example, deep learning is gaining
popularity.
Like the human brain, neural networks consist of a large number of related elements that mimic neurons. Deep
neural networks are based on such algorithms, due to which computers learn from their own experience,
forming in the learning process multi-level, hierarchical ideas about the world.
Deep learning developers always take into account the human brain features — construction of its
neural networks, learning and memory processes, etc, trying to use the principles of their work and modeling
the structure of billions of interconnected neurons. As a result of this, Deep learning is a step-by-step process
similar to a human’s learning process. To do this, it is necessary to provide a neural network with a huge
amount of data to train the system to classify data clearly and accurately.
In fact, the network receives a series of impulses as the inputs and gives the outputs, just like the human
brain. At each moment, each neuron has a certain value (analogous to the electric potential of biological
neurons) and, if this value exceeds the threshold, the neuron sends a single impulse, and its value drops to a
level below the average for 2–30 ms (an analog of the rehabilitation process in biological neurons, so-called
refractory period). When out of the equilibrium, the potential of the neuron smoothly begins to tend to the
average value.
In general, deep learning is very similar to the process of human learning and has a phased process of
abstraction. Each layer will have a different “weighting”, and this weighting reflects what was known about
the components of the images. The higher the layer level, the more specific the components are. Like the
human brain, the source signal in deep learning passes through processing layers; further, it takes a partial
understanding (shallow) to a general abstraction (deep), where it can perceive the object.
An important part of creating and training neural networks is also the understanding and application of
cognitive science. This is a sphere that studies the mind and the processes in it, combining the elements of
philosophy, psychology, linguistics, anthropology, and neurobiology. Many scientists believe that the creation
of artificial intelligence is just another way of applying cognitive science, demonstrating how human thinking
can be modelled in machines. A striking example of cognitive science is the Kahneman decision-making
model, determining how a person makes a choice at any given moment — consciously or not

4.9 Neuron Model:


4.9.1 Biological Neuron:
A biological neuron model, also known as a spiking neuron model, is a mathematical description of
the properties of certain cells in the nervous system that generate sharp electrical potentials across their cell
membrane, roughly one millisecond in duration. Spiking neurons are known to be a major signalling unit of
the nervous system, and for this reason characterizing their operation is of great importance. It is worth noting
that not all the cells of the nervous system produce the type of spike that define the scope of the spiking
neuron models. For example, cochlear hair cells, retinal receptor cells, and retinal bipolar cells do not spike.
Furthermore, many cells in the nervous system are not classified as neurons but instead are classified as glia.
Ultimately, biological neuron models aim to explain the mechanisms underlying the operation of the
nervous system for the purpose of restoring lost control capabilities such as perception (e.g. deafness or
blindness), decision making, and continuous limb control. In that sense, biological neuron models differ
from artificial neuron models that do not presume to predict the outcomes of experiments involving the
biological neural tissue (although artificial neuron models are also concerned with execution of perception and
estimation tasks). Accordingly, an important aspect of biological neuron models is experimental validation,
and the use of physical units to describe the experimental procedure associated with the model predictions.
Neuron models can be divided into two categories according to the physical units of the interface of the
model. Each category could be further divided according to the abstraction/detail level:
1. Electrical input–output membrane voltage models – These models produce a prediction for
membrane output voltage as function of electrical stimulation at the input stage (either voltage or
current). The various models in this category differ in the exact functional relationship between the
input current and the output voltage and in the level of details. Some models in this category are black
box models and distinguish only between two measured voltage levels: the presence of a spike (also
known as "action potential") or a quiescent state. Other models are more detailed and account for sub-
cellular processes.
2. Natural or pharmacological input neuron models – The models in this category connect between
the input stimulus which can be either pharmacological or natural, to the probability of a spike event.
The input stage of these models is not electrical, but rather has either pharmacological (chemical)
concentration units, or physical units that characterize an external stimulus such as light, sound or
other forms of physical pressure. Furthermore, the output stage represents the probability of a spike
event and not an electrical voltage. Typically, this output probability is normalized (divided by) a time
constant, and the resulting normalized probability is called the "firing rate" and has units of Hertz. The
probabilistic description taken by the models in this category was inspired from laboratory
experiments involving either natural or pharmacological stimulation which exhibit variability in the
resulting spike pattern. Nevertheless, when averaging these experimental results across several trials, a
clear pattern is often revealed.

4.9.2 Artificial Neuron:


An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural
network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives
one or more inputs (representing excitatory postsynaptic potentials and inhibitory postsynaptic potentials at
neural dendrites) and sums them to produce an output (or activation, representing a neuron's action
potential which is transmitted along its axon). Usually each input is separately weighted, and the sum is
passed through a non-linear function known as an activation function or transfer function. The transfer
functions usually have a sigmoid shape, but they may also take the form of other non-linear
functions, piecewise linear functions, or step functions. They are also often monotonically
increasing, continuous, differentiable and bounded. The thresholding function has inspired building logic
gates referred to as threshold logic; applicable to building logic circuits resembling brain processing. For
example, new devices such as memristors have been extensively used to develop such logic in recent times.
For a given artificial neuron, let there be m + 1 inputs with signals x0 through xm and weights w0 through wm.
Usually, the x0 input is assigned the value +1, which makes it a bias input with wk0 = bk. This leaves
only m actual inputs to the neuron: from x1 to xm.
The output of the kth neuron is:
For a given artificial neuron, let there be m + 1 inputs with signals x0 through xm and weights w0 through wm. Usually,
the x0 input is assigned the value +1, which makes it a bias input with wk0 = bk. This leaves only m actual inputs to the
neuron: from x1 to xm.
The output of the kth neuron is:

Where (phi) is the transfer function (commonly a threshold function).

The output is analogous to the axon of a biological neuron, and its value propagates to the input of the next
layer, through a synapse. It may also exit the system, possibly as part of an output vector. It has no learning
process as such. Its transfer function weights are calculated and threshold value are predetermined.

4.10 Neural Network Architectures:


Humans and other animals process information with neural networks. These are formed
from trillions of neurons (nerve cells) exchanging brief electrical pulses called action potentials. Computer
algorithms that mimic these biological structures are formally called artificial neural networks to distinguish
them from the squishy things inside of animals. However, most scientists and engineers are not this formal
and use the term neural network to include both biological and nonbiological systems.
Neural network research is motivated by two desires: to obtain a better understanding of the human brain, and
to develop computers that can deal with abstract and poorly defined problems. For example, conventional
computers have trouble understanding speech and recognizing people's faces. In comparison, humans do
extremely well at these tasks.
Many different neural network structures have been tried, some based on imitating what a biologist sees under
the microscope, some based on a more mathematical analysis of the problem. This neural network is formed
in three layers, called the input layer, hidden layer, and output layer. Each layer consists of one or
more nodes, represented in this diagram by the small circles. The lines between the nodes indicate the flow of
information from one node to the next. In this particular type of neural network, the information flows only
from the input to the output (that is, from left-to-right). Other types of neural networks have more intricate
connections, such as feedback paths. The nodes of the input layer are passive, meaning they do not modify
the data. They receive a single value on their input, and duplicate the value to:

Their multiple outputs. In comparison, the nodes of the hidden and output layer are active. For example, they
may be pixel values from an image, samples from an audio signal, stock market prices on successive days, etc.
They may also be the output of some other algorithm, such as the classifiers in our cancer detection example:
diameter, brightness, edge sharpness, etc. Each value from the input layer is duplicated and sent to all of the
hidden nodes. This is called a fully interconnected structure.

The values entering a hidden node are multiplied by weights, a set of predetermined numbers stored in
the program. The weighted inputs are then added to produce a single number. This is shown in the diagram by
the symbol, ∑. Before leaving the node, this number is passed through a nonlinear mathematical function
called a sigmoid. This is an "s" shaped curve that limits the node's output. That is, the input to the sigmoid is a
value between -∞ and +∞, while its output can only be between 0 and 1.
The outputs from the hidden layer are represented in the flow diagram (Fig 26-5) by the variables:
X21,X22,X23 and X24. Just as before, each of these values is duplicated and applied to the next layer. The
active nodes of the output layer combine and modify the data to produce the two output values of this
network, X31 and X32.
Neural networks can have any number of layers, and any number of nodes per layer. Most applications
use the three layer structure with a maximum of a few hundred input nodes. The hidden layer is usually about
10% the size of the input layer. In the case of target detection, the output layer only needs a single node. The
output of this node is thresholded to provide a positive or negative indication of the target's presence or
absence in the input data.
Table 26-1 is a program to carry out the flow diagram of Fig. 26-5. The key point is that this architecture
is very simple and very generalized. This same flow diagram can be used for many problems, regardless of
their particular quirks. The ability of the neural network to provide useful data manipulation lies in the proper
selection of the weights. This is a dramatic departure from conventional information processing where
solutions are described in step-by-step procedures.

As an example, imagine a neural network for recognizing objects in a sonar signal. Suppose that 1000 samples
from the signal are stored in a computer. How does the computer determine if these data represent a
submarine, whale, undersea mountain, or nothing at all? Conventional DSP would approach this problem with
mathematics and algorithms, such as correlation and frequency spectrum analysis. With a neural network, the
1000 samples are simply fed into the input layer, resulting in values popping from the output layer. By
selecting the proper weights, the output can be configured to report a wide range of information. For instance,
there might be outputs for: submarine (yes/no), whale (yes/no), undersea mountain (yes/no), etc.
With other weights, the outputs might classify the objects as: metal or non-metal, biological or
nonbiological, enemy or ally, etc. No algorithms, no rules, no procedures; only a relationship between the
input and output dictated by the values of the weights selected.

Figure 26-7a shows a closer look at the sigmoid function, mathematically described by the equation:
The exact shape of the sigmoid is not important, only that it is a smooth threshold. For comparison,
a simple threshold produces a value of one when x > 0, and a value of zero when x < 0. The sigmoid
performs this same basic thresholding function, but is also differentiable, as shown in Fig. 26-7b. While the
derivative is not used in the flow diagram (Fig. 25-5), it is a critical part of finding the proper weights to use.
More about this shortly. An advantage of the sigmoid is that there is a shortcut to calculating the value of its
derivative:

For example, if x = 0, then s(x) = 0.5 (by Eq. 26-1), and the first derivative is calculated: s'(x) = 0.5(1 -
0.5) = 0.25. This isn't a critical concept, just a trick to make the algebra shorter.
Wouldn't the neural network be more flexible if the sigmoid could be adjusted left-or-right, making it centered
on some other value than x = 0? The answer is yes, and most neural networks allow for this. It is very simple
to implement; an additional node is added to the input layer, with its input always having a value of one.
When this is multiplied by the weights of the hidden layer, it provides a bias (DC offset) to each sigmoid. This
addition is called a bias node. It is treated the same as the other nodes, except for the constant input.
Can neural networks be made without a sigmoid or similar nonlinearity? To answer this, look at the three-
layer network of Fig. 26-5. If the sigmoids were not present, the three layers would collapse into only two
layers. In other words, the summations and weights of the hidden and output layers could be combined into a
single layer, resulting in only a two-layer network.

4.10.1 Feed-Forward networks:

Feed-forward networks have the following characteristics:


1. Perceptrons are arranged in layers, with the first layer taking in inputs and the last layer producing
outputs. The middle layers have no connection with the external world, and hence are called hidden layers.
2. Each perceptron in one layer is connected to every perceptron on the next layer. Hence information is
constantly "fed forward" from one layer to the next., and this explains why these networks are called feed-
forward networks.
3. There is no connection among perceptrons in the same layer.
What's so cool about feed-forward networks?
Recall that a single perceptron can classify points into two regions that are linearly separable. Now let us
extend the discussion into the separation of points into two regions that are not linearly separable. Consider
the following network:

(Fig.2) A feed-forward network with one hidden layer.


The same (x, y) is fed into the network through the perceptrons in the input layer. With four perceptrons that
are independent of each other in the hidden layer, the point is classified into 4 pairs of linearly separable
regions, each of which has a unique line separating the region.

(Fig.3) 4 lines each dividing the plane into 2 linearly separable regions.
The top perceptron performs logical operations on the outputs of the hidden layers so that the whole network
classifies input points in 2 regions that might not be linearly separable. For instance, using the AND operator
on these four outputs, one gets the intersection of the 4 regions that forms the center region.

(Fig.4) Intersection of 4 linearly separable regions forms the center region.


By varying the number of nodes in the hidden layer, the number of layers, and the number of input and output
nodes, one can classification of points in arbitrary dimension into an arbitrary number of groups. Hence feed-
forward networks are commonly used for classification.

4.11 Linear neuron and the Widrow-Hoff Learning Rule:


Linear Neuron: Despite the non-linearities mentioned above, it is still possible to build a simplified, linear
model of a neuron that provides useful insights about the function of neurons in the brain.A schematic of a
linear neuron model is shown below:
Each input is multiplied by a corresponding weight and these values are summed together to form the
output y. Thus, the output is given as a function of the inputs and weights by the equation

or, in vector form, , or .


 The inputs in a linear neuron model can be thought of as the action potentials from other neurons
that are impinging upon the neuron’s synapses. The weights can be thought of as the efficacies of
the synapses. The larger , the the more affects the neurons output. Some of the factors in a real
neuron that would determine are the number of synaptic vescicles in the presynaptic terminal, or the
number of ligand-gated channels in the post-synaptic membrane. The sign of reflects whether it is
an excitatory or inhibitory synapse.
 A slight but useful modification of the linear neuron above is to add a non-linear threshold function at
the output, which is meant as a crude model of an all-or-nothing action potential generated by a
neuron. In this case, the output is given by

where

Widrow-Hoff Learning Rule:


The WIDROW-HOFF Learning rule is very similar to the perception Learning rule. However the
origins are different. The units with linear activation functions are called linear units. A network with a single
linear unit is called as adaline (adaptive linear neuron). That is in an ADALINE, the input-output relationship
is linear. Adaline uses bipolar activation for its input signals and its target output. The weights between the
input and the output are adjustable. Adaline is a net which has only one output unit. The adaline network may
be trained using the delta learning rule. The delta learning rule may also b called as least mean square
(LMS) rule or Widrow-Hoff rule. This learning rule is found to minimize the mean-squared error between
the activation and the target value
Delta Learning rule:
 The perceptron learning rule originates from the Hebbian assumption while the delta rule is derived from
the gradient- descent method (it can be generalised to more than one layer).
 The delta rule updates the weights between the connections so as to minimize the difference between the
net input to the output unit and the target value.
 The major aim is to minimize all errors over all training patterns. This is done by reducing the error for each
pattern, one at a time
 The delta rule for adjusting the weight of ith pattern (i =1 to n) is
Hebb Learning rule
 It is an algorithm developed for training of pattern association nets.
 The hebb learning rule is widely used for finding the weights of an associative neural net. The training
vector pairs here are denoted as s:t. The algorithm steps are given below:
 Step0: set all the initial weights to 0
wij = 0
 Step1: for each training target input out
output vector pairs s:t, perform steps 2-4
 Step2: activate the input layer units to current training input.
xi = si (for i = 1 to n)
 Step3: activate the output layer units to current target output,
yi = tj (for j = 1 to m)
 Step4: start the weight adjustment
wij(new) = wij (old) + xiyi (for i =1 to n, j = 1 to m)

4.12 The Error Correction Delta Rule:


In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to
artificial neurons in a single layer neural network. It is a special case of the more general back-propagation
algorithm. For a neuron j with activation function g(x), the delta rule for j’s i’th weight w ji is given by:

Where,
ɑ : is a small constant called learning rate
g(x) : is the neuron’s activation function
g’ : is the derivative of g
tj : is the target output
hj: is the weighted sum of the neuron’s inputs
yj : is the actual output
xi is the i th input

The delta rule is commonly stated in simplified form for a neuron with a linear activation function as

While the delta rule is similar to the perceptron’s update rule, the derivation is different. The perceptron uses
the Heaviside step function as the activation function g(h) and that means that g ’(h) does not exist at zero and
is equal to zero elsewhere, which makes the direct application of the delta rule impossible.

You might also like