PNAL4 SingleLayerNets
PNAL4 SingleLayerNets
M. Ali Akcayol
Gazi University
Department of Computer Engineering
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
2
Perceptrons
In supervised learning algorithms, the desired result is known
for samples in the training data.
The learning algorithms are simpler for the networks
consisting of only one node in one layer.
The modification of the weights is very simple.
The perceptrons have simple description but limited
capabilities.
A perceptron is defined to be a machine that learns using
examples.
A perceptron also is defined as a stochastic gradient-descent
algortihm that separate a set of n-dimensional space linearly.
3
Perceptrons
A perceptron has a single output whose values determine that
each input pattern belongs to which one of two classes.
A perceptron can be represented by a single node.
The perceptron applies a step function to the net weighted
sum of its inputs.
The input pattern is considered to belong to one class or the
other.
The output class is decided depending on whether the node
output is 0 or l.
4
Perceptrons
Example
Consider two-dimensional samples (0,0), (0,1), (1,0), (-1,-1)
that belong to one class, and samples (2.1,0), (0, -2.5), (1.6,
-1.6) that belong to another class.
These classes are linearly separable.
The node function is a step function.
The output of the node is 1 if the net weighted input is greater
than 2, and 0 otherwise.
x1 - x2 ≤ 2
x1 - x2 > 2
5
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
6
Linear separability
If there exists a line that separates all samples of one class
from the other class, such classification problems are said to
be ‘linearly separable’.
The line’s equation is
w0 + w1 x1 + w2 x2 = 0
7
Linear separability
Examples of linearly non separable classes are:
8
Linear separability
If there is only one input dimension x, then the two-class
problem can be solved using a perceptron if and only if there
is some value x0 of x such that all samples of one class occur
for x > x0 , and all samples of the other class occur for x < x0.
9
Linear separability
If there are three input dimensions, a two-class problem can
be solved using a perceptron if and only if there is a plane that
separates samples of different classes.
As in the two-dimensional case, coefficients of terms
correspond to the weights of the perceptron.
A generic perceptron for n-dimensional space.
10
Linear separability
For spaces of higher number of input dimensions, the
geometric presentations need to be extended.
Hyperplanes can separate samples of different classes in n-
dimensional space.
Each hyperplane in n dimensions is defined by the equation
11
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
12
Perceptron training algorithm
Perceptron training algorithm can be used to obtain
appropriate weights of a perceptron that separates two
classes.
Using weight values, the equation of the hyperplane that
divide the solution space can be derived.
The developed perceptron can be used to classify new
samples.
Dot product or scalar product of two vectors, w and x, is
defined as follows,
14
Perceptron training algorithm
If i belongs to a class (desired node output is -1) but w.i > 0,
then the weight vector needs to be modified to w + Δw
so that (w + Δw).i < w.i
Δw = -η.i, where η > 0.
After modification of the weight, i would have a better chance
of being classified correctly in the following iteration.
15
Perceptron training algorithm
If i belongs to a class (desired node output is 1) but w.i < 0,
then the weight vector needs to be modified to w + Δw
so that (w + Δw).i > w.i
Let i1 , i2 , …, ip denote the training set, containing p input
vectors.
We define a function that maps each sample to either +1 (C1)
or -1 (C0).
Samples are presented repeatedly to train the weights.
16
Perceptron training algorithm
Example
Let there be 7 one-dimensional input patterns as shown
below.
17
Perceptron training algorithm
Example – cont.
For the input value 0.83, output is (0.83)(-0.36) – 1.0 = -1.2
Then the sample has calculated class 0, which is an error (it
would be 1).
For η = 0.1, new weights are calculated as,
20
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
21
Termination criterion
For many ANN learning algorithms, the termination criterion is
″stop when the goal is achieved″.
For any kind of classifier, the goal is the correct classification
of all samples.
So the perceptron training algorithm runs until all samples are
correctly classified.
For perceptron, termination is assured if η sufficiently small
and samples are linearly separable.
If η is not appropriate or samples are not linearly separable,
the algorithm runs indefinitely.
How can we detect that this may be the case?
22
Termination criterion
The amount of progress achieved in the recent past can be
used to terminate the training.
For linear classifier, if the number of correct classification has
not changed in large of steps, the samples may not be linearly
separable.
The same problem may be occurred with the inappropriate
choice of η.
The different values of η may yield improvement for training
phase.
23
Termination criterion
In some problems, two classes overlap (not linearly
separable).
If the performance requirements allow some amount of
misclassification, we can modify the termination criterion.
For example, it may known that at least 6% of the samples
will be misclassified (or user satisfied with 6%), the
termination criterion should be modified.
We can then terminate the training algorithm as soon as 94%
of the samples are correctly classified.
24
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
25
Choice of learning rate
The examination of extreme cases can help derive a good
choice for η.
If η is too large (e.g. 1.000.000), then the components of
Δw = ±ηx can have very large magnitudes.
If η is too large, each weight update swings perceptron
outputs completely in one direction as a result, the perceptron
considers all samples to be in the same class.
The system oscillates between extremes.
If η is very small (e.g. η = 0) the weights are never going to
be modified.
If η equals some too small value, the change in the weights in
each step going to be too small. This makes the algorithm
exceedingly slow.
26
Choice of learning rate
If η is too large, the progress will start very fast, but eventually
jump around the optimal solution and will never settle down.
If η is too small, the training will eventually converge to the best
state, but this will take a long time.
To find a fairly good learning rate, the network should be trained by
using various learning rates.
27
Choice of learning rate
What is an appropriate choice for η, which is neither too small
nor too large?
A common choice is η = 1, leading to the simple weight
change computational rule of Δw = ±x,
so that (w + Δw).x = w.x ± x.x
If |w.x| > |x.x|, the sample x may not be correctly classified.
In order to ensure that the sample x correctly classified,
(w + Δw).x and x.x have opposite signs.
28
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
29
Non-numeric inputs
In some problems, the input dimensions are non-numeric.
For example, input dimension may be ″color″.
Its values may range over the set {red, blue, green, yellow}.
We may not establish a relationships between colors on an
axis.
The simplest way is to generate four new dimensions (″red″,
″blue″, ″green″, ″yellow″).
We can replace each original attribute-value pair by a binary
vector.
For instance, color = ″green″ is represented by the input
vector (0, 0, 1, 0), ″blue″ is (0, 1, 0, 0).
The disadvantage of this approach is a drastic increase in the
number of dimensions.
30
Non-numeric inputs
Example
The day of the week (Sunday/Monday/ . . .) is an important
variable in predicting the amount of electric power consumed
in a city.
However, there is no obvious way of sequencing weekdays.
So it is not appropriate to use a single variable whose values
range from 1 to 7.
Instead, seven different variables should be chosen and each
input sample has a value of 1 for one of these coordinates,
and a value of 0 for others.
For instance, ″Tuesday″ is represented as (0, 0, 1, 0, 0, 0, 0),
″Monday″ is (0, 1, 0, 0, 0, 0, 0).
31
Content
Perceptrons
Linear separability
Perceptron training algorithm
Termination criterion
Choice of learning rate
Non-numeric inputs
Adalines
Multiclass discrimination
32
Adalines
The fundamental principle underlying the perceptron learning
algorithm is to modify weights to reduce the number of
misclassifications.
Perfect classification using a linear element may not be
possible for all problems.
Minimizing the mean squared error (MSE) instead of the
number of misclassified samples may be used while training.
An adaptive linear element or Adaline, proposed by Widrow
(1959, 1960), is a simple perceptron-like system.
33
Adalines
Adaline accomplishes classification by modifying weights in
such a way as to diminish the MSE at each iteration.
This can be accomplished using gradient descent.
MSE is a quadratic function whose derivative exists
everywhere.
Unlike the perceptron, this algorithm implies that weight
changes are made to reduce MSE.
Even when a sample is correctly classified by the network, the
weights may change.
34
Adalines
In the training process, when a sample is presented to the
network, the linear weighted net input is computed.
Computed net value is compared with the desired output.
Generated error signal used to modify each weight in the
Adaline.
The weight change rule use partial derivative with respect to
weights.
35
Adalines
Let be an input vector for which dj is the
desired output value.
Let be the net input to the node.
is the presented value of the weight vector.
The squared error is
36
Adalines
Adaline Least-Mean-Squares (LMS) training algorithm
39
Multiclass discrimination
So far, we have considered dichotomies, or two-class
problems.
Many important real-life problems require partitioning data
into three or more classes.
For example, the character recognition problem consists of
distinguishing between samples of 29 (for Turkish alphabet)
different classes.
A layer of perceptrons or Adalines may be used to solve some
such multiclass problems.
Four perceptrons can put together to solve a four-class
classification problem.
40
Multiclass discrimination
Each weight wi,j indicates the strength of the connection jth
input to the ith node.
A sample is considered to belong to the ith class if and only if
the ith output oi = 1, and every other output ok = 0, for k ≠ i.
This network is trained in the same way as perceptrons.
If all outputs are zeroes or if more than one output value
equals 1, the network is considered to have failed in the
classification task.
All outputs can have values in between 0 and 1, a ‘maximum-
selector’ can be used to select the highest-value output.
41
Homework
42