0% found this document useful (0 votes)
17 views

Unit 2-nn

Uploaded by

Sujithra sai
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Unit 2-nn

Uploaded by

Sujithra sai
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

19AI411- NEURAL NETWORK

UNIT –II
PERCEPTRON
Contents
 Single layer Perceptron: Adaptive Filtering Problem
 Unconstrained Organization Techniques
 Linear Least Square Filters
 Least Mean Square Algorithm
 Learning Curves, Learning Rate Annealing Techniques
 Perceptron Convergence Theorem
 Relation Between Perceptron and Bayes Classifier for a Gaussian Environment
 Multilayer Perceptron: Back Propagation Algorithm
 XOR Problem
 Heuristics
 Output Representation and Decision Rule
 Feature Detection

2
Adaptive Filtering Problem
Dynamic System The external behavior of the system:
T: {x(i), d(i); i=1, 2, …, n, …}
where x(i)=[x1(i), x2(i), …, xm(i)]T
x(i) can arise from:
 Spatial: x(i) is a snapshot of data.
Signal-flow Graph of the  Temporal: x(i) is uniformly spaced in time.
Adaptive Filter
 Filtering Process
 y(i) is produced in response to x(i).
 e(i) = d(i) - y(i)
 Adaptive Process
 Automatic Adjustment of the synaptic
weights in accordance with e(i).
m y (i )  xT (i )w (i )
e(i )  d (i )  y (i )
3 y (i )  v(i )   wk (i ) xk (i )
where w(i)  w1 (i ), w2 (i ),..., wm (i )
T
k 1
Important Points

 The algorithm starts from an arbitrary setting


of the neuron’s synaptic weights
 Adjustments to synaptic weights – in
response to statistical variations in the
systems’ behavior that are made on
continuous basis
 Computation of adjustments to the synaptic
weights are completed inside a time interval.
4
Unconstrained Optimization Techniques

 Let C(w) be a continuously differentiable function of some unknown weight


(parameter) vector w.
 C(w) maps w into real numbers.
 Goal: Find an optimal solution w* that satisfies C(w*)C(w)  Minimize
C(w) with respect to w.
Necessary Condition for optimality: C(w*)=0 ( is the gradient operator)
T T
      C C C 
 , ,...,  C w    , ,..., 
 w1 w2 wm   w1 w2 wm 

A class of unconstrained optimization algorithm:


Starting with an initial guess denoted by w(0), generate a sequence of weight
vectors w(1), w(2), …, such that the cost function C(w) is reduced at each
5 iteration of the algorithm.
Method of Steepest Descent

The successive adjustments applied to w are in the direction of steepest


descent, that is, in a direction opposite to the gradient vector C(w).
Let  
g  C w
The steepest descent algorithm: w(n+1)=w(n)-g(n)
: a positive constant called the stepsize or learning-rate parameter.
w(n) = w(n+1) - w(n) = -g(n)
Small  Large 
 Overdamp the transient  Underdamp the transient
response. response.

6 If exceeds a certain value, the algorithm becomes unstable.


Newton’s Method

Minimize the quadratic approximation of the cost function C(w) around the
current point w.
Applying second-order Taylor series expansion of C(w) around
Cw(n).
w n   C w n  1  C w n 
1 C(w) is minimized when
 g n w n   w n H n w n 
T T

2 C w n 
 C 2
C
2
C 
2  gn   H n w n   0
   w n 
 2  w 2
1  w 1w 2 w 1 wm   w n    H 1
n gn 
 C C C 
2 2

H   C w    w w
2
w22

w2 wm  w n  1  w n   w n 
 2 1

 2       w n   H 1
n gn 
 C C C 
2 2

 wm w1 wm w2 wm2  Generally speaking, Newton’s
method converges quickly
7
Gauss-Newton Method

Gauss-Newton method is applicable to a cost function C(w) that is the sum of


error squares.
1 n 2
Let C w    e i 
2 i 1
 ei 
T

e(i, w )  e(i )    w  w n , i  1,2,...,n


 w  w  w ( n )
e(n, w )  e(n)  J n w  w n , where e(n)  e(1),e(2),...,e(n)
T

 e1 e1 e1  The Jacobian J(n) is [e(n)]T


 w 
w2 wm  e(n)  e(1), e(2),..., e(n)
 1

 e 2  e 2  
e 2  
J n    w1 w2 wm 
1  2
      Goal w n  1  arg min  e ( n, w ) 
 en  en  en  w
2 
   :
8  w1 w2 wm  w  w n 
Gauss-Newton Method (Cont.)
1
2
1

e(n, w )  e(n, w ) e(n, w )
2

2
T

1

 e(n)  J n w  w n  e(n)  J n w  w n 
2
T

1
 
 eT (n)  w  w n  J T n  e(n)  J n w  w n 
2
T


2

1 T 2 T T

e (n)  eT (n)J n w  w n   w  w n  J T n e(n)  w  w n  J T n J n w  w n 

 
eT (n)J n w  w n   w  w n  J T n e(n) and both of them are scalars.
T T

1 1 1
e(n, w )  eT (n)  eT (n)J n w  w n   w  w n  J T n J n w  w n 
2 2 T

2 2 2
Differentiating this expression with respect to w and setting the result to be zero.
T
  T
     
J n e( n )  J n J n w  w n  0 w n  1  w n   J T
n Jn  1 T

J n e(n)
To guard against the possibility that J(n) is rank deficient.
9  
w n  1  w n   J T n J n   I J T n e(n)
1
Linear Least-Squares Filter

 Characteristics of Linear Least-Squares Filter


– The single neuron around which it is built is linear.
– The cost function C(w) consists of the sum of error squares.
e(n)  d(n)  x(1), x(2),..., x(n) w n 
T

 d(n)  X(n)w n 
where d(n)=[d(1), d(2),…, d(n)]T X(n)=[x(1), x(2),…, x(n)]T
e(n)
 e ( n )   X T ( n ) J ( n)   X( n)
w n 
Substituting it into equation derived from Gauss-Newton Method
 
w n  1  w n   XT n Xn  XT n d(n)  Xn w n 
1

 
 XT n Xn  XT n d(n)
1

10  
Let X  n   XT n Xn  XT n 
1
w n  1  X  n d(n)
Wiener Filter
Limiting form of the Linear Least-Squares Filter for an
Ergodic Environment

Let Rx denote the Correlation Matrix of input vector x(i).


  1 n 1
R x  E x(i )x (i )  lim  x(i )xT (i )  lim XT (n) X(n)
T
n  n n  n
i 1

Let rxd denote the Cross-correlation Vector of x(i) and d(i).


1 n 1
rxd  E x(i )d (i )  lim  x(i )d (i )  lim XT (n)d(n)
n  n n  n
i 1

Let w0 denote the Wiener solution to the linear optimum filtering problem.

n  n 
 
w 0  lim w n  1  lim XT n Xn  XT n d(n)
1

n 
 
 lim XT n Xn  lim XT n d(n)
1

n 

 R x1rxd
11
Least-Mean-Square (LMS) Algorithm

LMS is based on instantaneous values for the cost function


1
C w   e 2 n  e(n) is the error signal measured at time n.
2
C w  en  because en   d n   xT n w n 
 en 
w w
en  C n  ˆ n  1  w
ˆ n   xn en 
 xn   gˆ n   xn en  w
w n  w n 
ŵ n  is used in place of w(n) to emphasize that LMS produces an estimate
of w that result from the method of steepest descent.
Training Sample : Input signal vector  x(n)
Desired response  d(n)
Summary of the User - selected parameter :
LMS Algorithm ˆ (0)  0
Initialization. Set w
Computation. For n  1,2, , compute
e(n)  d(n) - w
ˆ T (n)x(n)
12 ˆ (n)  x(n)en 
ˆ (n  1)  w
w
Virtues and Limitations of LMS

 Virtues
– Simplicity
 Limitations
– Slow rate of convergence
– Sensitivity to variations in the eigenstructure of
the input

13
Learning Curve

14
Learning Rate Annealing

Normal Approach:  n    0 for all n


c
Stochastic Approximation:  n   c is a constant
n
There is a danger of parameter blowup for small n when c is
large. 0
Search-then-converge schedule:  n   0 and  are constants
1  n /  

15
Perceptron
 The simplest form used for the classification of patterns said to be linearly
separable.
Bias, b m
x1
w1 v   wi xi  b
vk j(×) m
vn    wi n xi n 
w2 Output i 1
x2
Hard yk
...

Inputs
liniter Let x0=1 and b=w0 i 0
...

 w T n xn 
wm

xm

 Goal: Classify the set {x(1), x(2), …, x(n)} into one of two classes, C1 or C2.
 Decision Rule: Assign x(i) to class C1 if y=+1 and to class C2 if y=-1.

wTx > 0 for every input vector x belonging to class C1

16 wTx  0 for every input vector x belonging to class C2


Perceptron (Cont.)

Algorithms:
w(n+1)=w(n) if wTx(n) > 0 and x(n) belongs to class C1
1.
w(n+1)=w(n) if wTx(n)  0 and x(n) belongs to class C2
w(n+1)=w(n)-(n)x(n) if wTx(n) > 0 and x(n) belongs to class C2
2.
w(n+1)=w(n)+(n)x(n) if wTx(n)  0 and x(n) belongs to class C1
 1 if x(n) belongs to class C1
Let d n   
 1 if x(n) belongs to class C2
w(n+1) = w(n) + [d(n)-y(n)]x(n) (Error-correction learning rule form)

 Smaller  provides stable weight estimates.


 Larger  provides fast adaption.
17
Perceptron (Cont.)

18
Perceptron (Cont.)

19
Perceptron (Cont.)

20
Perceptron (Cont.)

21
Perceptron (Cont.)

22
Perceptron (Cont.)

23
Perceptron (Cont.)

For n=nmax

24
Perceptron (Cont.)

25
Perceptron (Cont.)

26
Perceptron (Cont.)

27
Perceptron Convergence
Algorithm

28
Perceptron (Cont.)

Two essential points to design a perceptron

29
Relation between the Perceptron
and Bayes Classifier

• Classical Pattern classifier – Bayes Classifier


• Gaussian Distribution

30
Relation between the Perceptron and
Bayes Classifier

31
Relation between the Perceptron and
Bayes Classifier

32
Relation between the Perceptron and
Bayes Classifier

33
Relation between the Perceptron and
Bayes Classifier

Likelihood function

Threshold

34
Relation between the Perceptron and
Bayes Classifier

Key Points:

35
Relation between the Perceptron and
Bayes Classifier

C- Covariance Matrix
C- non-diagonal Matrix & non-singular matrix
(C-1 exists)
36
Relation between the Perceptron and
Bayes Classifier

Conditional Probability density function is given by

Misclassification cost is equal and cost of correct classification is


zero
37
Relation between the Perceptron and
Bayes Classifier

Linear Classifier

38
Relation between the Perceptron and
Bayes Classifier

Perceptron Bayes Classifier

1 It operates on the pattern to Operates on Gaussian


be classified are linearly distribution of two patterns do
separable overlap each other

39
Relation between the Perceptron and Bayes Classifier
Perceptron Bayes Classifier

2 Inputs are separable Inputs are non-separable

3 No overlapping here. Minimizes probability of error.


Difficult to minimize the Because it is Independent to
error probability overlap between gaussian
distribution
4 Converges is no-parametric Parametric

5 No assumption with Assumptions involved in


decision decision making
6 It operates on It operates focusing on
concentrating on errors probability density function
7 Adaptive and simple Fixed and can be made
adaptive
8 Less computation More computational complexity
40 complexity

You might also like