0% found this document useful (0 votes)
4 views

Ann Unit2

Ann unit-2, artificial neural network

Uploaded by

SIVA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Ann Unit2

Ann unit-2, artificial neural network

Uploaded by

SIVA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT II

MATHEMATICAL FOUNDATIONS AND LEARNING MECHANISMS


Re-visiting Vector and Matrix algebra
Vector: An n-tuple (pair, triple, quadruple ...) of scalars can be written as a horizontal row or vertical
column. A column is called a vector. A vector is denoted by an uppercase letter. Its entries are
identified by the corresponding lowercase letter, with subscripts. The row with the same entries is
indicated by a superscript t.
Consider
You can also use a superscript t to convert a row back to the corresponding column, so that X t t = X
for any vector X. You can add two vectors with the same number of entries:

Vectors satisfy commutative and associative laws for addition:


X+Y=Y+X
X + (Y + Z ) = (X + Y ) + Z.
The zero vector and the negative of a vector are defined by the equations

Clearly,
-O = O X + O = X
-(-X ) = X X + (-X ) = O.
You can multiply a vector by a scalar:

This product is also written sX.2 You should verify these manipulation rules:
You can add and subtract rows X t and Y t with the same number of entries, and define the zero row
and the negative of a row. The product of a scalar and a row is
sX t = s[x1 , ..., xn] = [sx1 , ..., sxn]
These rules are useful:
X t ± Y t = (X ± Y )t
-(X t) = (-X )t
s(X t) = (sX )t .
Finally, you can multiply a row by a vector with the same number of entries to get their scalar
product:

With a little algebra you can verify the following manipulation rules:

Matrix algebra An m×n matrix is a rectangular array of mn scalars in m rows and n columns. A
matrix is denoted by an uppercase letter. Its entries are identified by the corresponding lowercase
letter, with double subscripts:

A is called square when m = n. The ai j with i = j are called diagonal entries. m×1 and 1×n matrices
are columns and rows with m and n entries, and 1×1 matrices are handled like scalars. You can add
or subtract m×n matrices by adding or subtracting corresponding entries, just as you add or subtract
columns and rows. A matrix whose entries are all zeros is called a zero matrix, and denoted by O.
You can also define the negative of a matrix, and the product sA of a scalar s and a matrix A.
You can multiply an m×n matrix A by a vector X with n entries; their product AX is the vector with m
entries, the products of the rows of A by X:
You can verify the following manipulation rules:
OX = O = AO
(sA)X = (AX)s = A(Xs)
(-A)X = -(AX) = A(-X )

(A + B)X = AX + BX (distributive laws)


A(X + Y ) = AX + AY.
Similarly, you can multiply a row X t with m entries by an m×n matrix A; their product X t A is the row
with n entries, the products of X t by the columns of A:

Similar manipulation rules hold. Further, you can check the associative law
X t (AY ) = (X tA)Y.
You can multiply an l×m matrix A by an m×n matrix B. Their product AB is an l×n matrix that you can
describe two ways. Its columns are the products of A by the columns of B, and its rows are the
products of the rows of A by B:

The i,k th entry of AB is thus ai1b1 k + … + aimbmk. You can check these manipulation rules:

The definition of the product of two matrices was motivated by the formulas for linear substitution;
from
Every m×n matrix A has a transpose At, the n×m matrix whose j,ith entry is the i, jth entry of A:

The following manipulation rules hold:

At t = A O t = O
(A + B)t = At + B t (sA)t = s(At).
If A is an l×m matrix and B is an m×n matrix, then (AB)t = B tAt

Matrix inverses

A matrix A is called invertible if there.s a matrix B such that AB = I = BA. Clearly, invertible matrices
must be square. A zero matrix O isn’t invertible, because OB = O ≠ I for any B. Also, some nonzero
square matrices aren’ t invertible. For example, for every 2×2 matrix B,

Hence the leftmost matrix in this display isn’t invertible. When there exists B such that AB = I = BA,
it.s unique; if also AC = I = CA, then B = BI = B(AC) = (BA)C = IC = C. Thus an invertible matrix A
has a unique inverse A&I such that AA-1 = I = A-1A.
Clearly, I is invertible and I-1 = I.
The inverse and transpose of an invertible matrix are invertible, and any
product of invertible matrices is invertible:
(A-1 )-1 = A (At )-1 = (A-1 )t (AB)-1 = B-1A-1

Determinants

The determinant of an n×n matrix A is


where the sum ranges over all n! permutations ψ of {1, ..., n }, and sign ψ = ±1 depending on
whether n is even or odd. In each term of the sum there’s one factor from each row and one from
each column.
For the 2×2 case the determinant is

Matrix
 A matrix is a two dimensional array of numbers. Generally matrices are represented by
anuppercase bold letter such as A.
 Since a matrix is two dimensional, each element is represented by a small letter with
two indices such as aij where i represents the row and j represents the column.
 A representation of m×n matrix is shown below,

Matrix multiplication
 The multiplication result matrix of two matrices A and B denoted by C has elements
cijwhich is the dot product of the ith row of A with jth column of the matrix B.
Diagonal Matrix
 A square matrix whose non-diagonal elements are zero is called a diagonal matrix.
Forexample:

Identity Matrix
 A square matrix whose non-diagonal elements are zero and diagonal elements are 1
iscalled an identity matrix. For example:

Transpose of a Matrix
 The transpose of matrix X, denoted by XT, is the result of flipping the rows and
columnsof a matrix X.

Then the transpose of X is given by

Inverse of a Matrix
 The inverse of a square matrix X, denoted by X-1 is a special matrix which has
theproperty that their multiplication results in the identity matrix.
 Above, I is an identity matrix. Both X and X-1 are inverses of each other. An
instance ofan inverse matrix is illustrated below.

 These two matrices are inverses of each other. When we multiply them, we get an
identitymatrix.

State-space Concepts

State Space Concept

 Set of all possible states for a given problem is known as state space of the problem
 Searching is needed for the solution,
 If steps are not known beforehand and have to be found out.
 For a search process following things are required
 1. Initial state description of the problem
 2. Set of legal operators
 3. Final or goal state

State space search is a process used in the field of computer science, including artificial intelligence (AI),
in which successive configurations or states of an instance are considered, with the intention of finding a
goal state with a desired property.

The Problems are often modeled as a state space, a set of states that a problem can be in. The set of
states forms a graph where two states are connected if there is an operation that can be performed to
transform the first state into the second.

State space search often differs from traditional computer science search methods because the state
space is implicit: the typical state space graph is much too large to generate and store in memory.
Instead, nodes are generated as they are explored, and typically discarded thereafter. A solution to a
problem instance may consist of the goal state itself, or of a path from some initial state to the goal
state.

A state space is a set of descriptions or states.

∙ Each search problem consists of:


🡪One or more initial states.
🡪A set of legal actions- Actions are represented by operators or moves applied to
each state. For example, the operators in a state space representation of the 8-puzzle
problem are left, right, up and down.
🡪One or more goal states.
∙ The number of operators are problem dependant and specific to a particular state
space representation. The more operators the larger the branching factor of the
state space. Thus, the number of operators should kept to a minimum, e.g. 8-
puzzle: operations are efined in terms of moving the space instead of the tiles.

In state space search a state space is formally represented as a tuple S: (S, A, Action(s),
Result(s,a), Cost(s,a)), in which:
∙ S is the set of all possible states
∙ A is the set of possible action, not related to a particular state but regarding all
the state space ∙ Action(s) is the function that establish which action is possible to
perform in a certain state ∙ Result(s,a) is the function that return the state reached
performing action a in state s ∙ Cost(s,a) is the cost of performing an action a in
state s. In many state spaces is a constant, but this is not true in general.

🡪Water jug problem


In the water jug problem , we are provided with two jugs: one having the capacity
to hold 3 gallons of water and the other has the capacity to hold 4 gallons of water.
There is no other measuring equipment available and the jugs also do not have any
kind of marking on them. So, the agent’s task here is to fill the 4-gallon jug with 2
gallons of water by using only these two jugs and no other material. Initially, both our
jugs are empty.
So, to solve this problem, following set of rules were proposed:
Production rules for solving the water jug problem
Here, let x denote the 4-gallon jug and y denote the 3-gallon jug.

S.No. Initial State Condition Final state Description of action taken

1. (x,y) If x<4 (4,y) Fill the 4 gallon jug completely

2. (x,y) if y<3 (x,3) Fill the 3 gallon jug completely


3. (x,y) If x>0 (x-d,y) Pour some part from the 4 gallon jug

4. (x,y) If y>0 (x,y-d) Pour some part from the 3 gallon jug

5. (x,y) If x>0 (0,y) Empty the 4 gallon jug

6. (x,y) If y>0 (x,0) Empty the 3 gallon jug

7. (x,y) If (4, y-[4-x]) Pour some water from the 3 gallon jug to fill the
(x+y)<7 four gallon jug

8. (x,y) If (x+y)<7 (x-[3-y],y) Pour some water from the 4 gallon jug to
fill the 3 gallon jug.

9. (x,y) If Pour all water from 3 gallon jug to the 4 gallon jug
(x+y)<4
(x+y,0)

10. (x,y) if Pour all water from the 4 gallon jug to the 3 gallon jug
(x+y)<3
(0, x+y)

The listed production rules contain all the actions that could be performed by
the agent in transferring the contents of jugs. But, to solve the water jug problem in a
minimum number of moves, following set of rules in the given sequence should be
performed:
Solution of water jug problem according to the production rules:

S.No. 4 gallon jug contents 3 gallon jug contents Rule followed

1. 0 gallon 0 gallon Initial state

2. 0 gallon 3 gallons Rule no.2

3. 3 gallons 0 gallon Rule no. 9

4. 3 gallons 3 gallons Rule no. 2

5. 4 gallons 2 gallons Rule no. 7

6. 0 gallon 2 gallons Rule no. 5

7. 2 gallons 0 gallon Rule no. 9


On reaching the 7th attempt, we reach a state which is our goal state. Therefore, at this
state, our problem is solved.

🡪8-Puzzle
The 8-puzzle problem is a puzzle invented and popularized by Noyes Palmer
Chapman in the 1870s. It is played on a 3-by-3 grid with 8 square blocks labeled 1
through 8 and a blank square. Your goal is to rearrange the blocks so that they are in
order. You are permitted to slide blocks horizontally or vertically into the blank
square.
For example, given the initial state above we may want the tiles to be moved so that
the following goal state may be attained.
Initial State Final State

1 3
1 2 3
4 2 5
4 5 6
7 8 6
7 8

The following shows a sequence of legal moves from an initial board position
(left) to the goal position (right).

1 3

4 2 5

7 8 6

1 2 3

4 5

7 8 6

1 2 3

4 5

7 8 6
1 2 3

4 5 6

7 8

Concepts of Optimization

Optimization is an action of making something such as design, situation,


resource, and system as effective as possible. Using a resemblance between the cost
function and energy function, we can use highly interconnected neurons to solve
optimization problems. Such a kind of neural network is Hopfield network, that
consists of a single layer containing one or more fully connected recurrent neurons.
This can be used for optimization.
Points to remember while using Hopfield network for optimization −
∙ The energy function must be minimum of the network.
∙ It will find satisfactory solution rather than select one out of the stored patterns. ∙
The quality of the solution found by Hopfield network depends significantly on
the initial state of the network.

🡪Travelling Salesman Problem


Finding the shortest route travelled by the salesman is one of the
computational problems, which can be optimized by using Hopfield neural network.
Basic Concept of TSP
Travelling Salesman Problem TSP is a classical optimization problem in which a
salesman has to travel n cities, which are connected with each other, keeping the cost
as well as the distance travelled minimum. For example, the salesman has to travel a
set of 4 cities A, B, C, D and the goal is to find the shortest circular tour, A-B-C–D,
so as to minimize the cost, which also includes the cost of travelling from the last city
D to the first city A.

Matrix Representation

Actually each tour of n-city TSP can be expressed as n × n matrix whose ith row
describes the ith city’s location. This matrix, M, for 4 cities A, B, C, D can be
expressed as follows
Solution by Hopfield

While considering the solution of this TSP by Hopfield network, every node in the
network corresponds to one element in the matrix.

Energy Function Calculation


To be the optimized solution, the energy function must be minimum. On the basis of
the following constraints, we can calculate the energy function as follows −
Constraint-I

First constraint, on the basis of which we will calculate energy function, is that one
element must be equal to 1 in each row of matrix M and other elements in each row
must equal to 0 because each city can occur in only one position in the TSP tour. This
constraint can mathematically be written as follows −

Now the energy function to be minimized, based on the above constraint, will contain
a term proportional to

Constraint-II
As we know, in TSP one city can occur in any position in the tour hence in each
column of matrix M, one element must equal to 1 and other elements must be
equal to 0. This constraint can mathematically be written as follows −

Now the energy function to be minimized, based on the above constraint, will contain
a term proportional to –

Cost Function Calculation

Let’s suppose a square matrix of (n × n) denoted by C denotes the cost matrix of TSP
for n cities where n > 0. Following are some parameters while calculating the cost
function − ∙ Cx, y − The element of cost matrix denotes the cost of travelling from city
x to y. ∙ Adjacency of the elements

of A and B can be shown by the following relation –

As we know, in Matrix the output value of each node can be either 0 or 1, hence for
every pair of cities A, B we can add the following terms to the energy function –

On the basis of the above cost function and constraint value, the final energy function
E can be given as follows –

Here, γ1 and γ2 are two weighing constants.


Optimization Concept
The action of making the best or most effective use of a situation or resource.
Optimization is already a part of our life. You might not be aware of it. You are
already
doing optimization in your life. For example, you wake up early in the morning and
try to
catch a bus or train on time. Your objective is to catch the bus/train on time while
properly using your limited time early in the morning.
You go to shopping and decide what to purchase within your budget constraint.
Here, you try to maximize your satisfaction within your budget and in a time
constraint.
There are so many examples in our lives that we try to optimize while
maximizing/minimizing our objectives with many other constraints as such time, cost,
etc.
Optimization methods are applied in many fields: economics, management,
planning,
logistics, robotics, optimal design, engineering, signal processing, etc.
Phases
The objective of the first phase is the acquisition of information, which is to
identify the
parameters that affect the system. This could be profit, time, potential energy, or the
quantity or combination
The second, very important, phase is the development of a model to integrate all
the
parameters identified.
The third phase focuses on the evaluation and analysis of the performance of the
system.
Finally, an optimization phase is regularly used to improve system performance.
Any optimization problem has three basic ingredients (Arora 2003):
The optimization variables, also called the design variables.
Cost function, also called the objective function.
The constraints expressed as equalities or inequalities.
Transcript of an optimization problem in a mathematical formulation is a critical
step in
the process of solving the problem.
The decision consists of the following steps
Formulate the problem: In this first step, a decision problem is identified. Then,
an initial problem statement is made.
Model the problem: In this important step, is a mathematical model is constructed
for the problem.
Optimize the problem: Once the problem is modeled, the solving procedure
generates a good solution to the problem.
Implement a solution: The solution is tested in practice by the decision maker and
is implemented if it is acceptable.

Types of optimization problems


Linear programming (LP): The objective function and constraints are linear. The
decision variables are involved scalar and continuous.
Nonlinear programming (NLP): The function and / or objective constraints are not
linear. The decision variables are scalar and continuous.
Integer Programming (IP): The decision variables are scalar and integers.
Mixed integer linear programming (MILP): The objective function and constraints
are
linear. The decision variables are scalar; some of them are integers, while others are
continuous variables.
Mixed integer nonlinear programming (MINLP): A problem of integer linear
programming and continuous decision variables involved.
Discrete optimization: problems involving discrete variables (integer) decision.
Optimal Control: The decision variables are vectors.
Stochastic programming or stochastic optimization: It is also called optimization
under uncertainty. In these problems, the objective function and / or constraints have
uncertain variables (random).
Multi-objective optimization: Suppose we try to improve product performance while
trying to minimize the cost at the same time (Yang 2014). In this case, we are dealing
with a multi−objective optimization problem.
Dynamic optimization: In dynamic optimization problems, the objective function
is deterministic at some point, but varies over time.
Optimization methods to solve problems
There are three major classes of algorithms to solve problems
Extract algorithms
Approximation Algorithms
Heuristic Algorithm
Exact algorithms are guaranteed to find the optimal solution, but it can take an
exponential number of iterations. In practice, they are usually only applicable to small
instances, due to long operating times caused by the complexity. They include:
branch−and−bound, branch−and−cut, cutting plane, and dynamic programming
algorithms.
Approximation algorithms: Provides semi optimal solution comparative to heuristic
algorithm. Generally these are pretty good solutions within a reasonable time,
approximation algorithms provide provable solution quality and provable execution
limits .
Heuristic algorithms provide an optimum solution, but make no guarantee as to its
quality. Heuristic algorithms use trial and error, learning and adapting to solve
problems.
1.Error Correction Learning
ER is used to identify the error and reduce error in the learning process
Notations used for ER are
Output signal of neuron k is denoted by yk(n)
Desired output signal of neuron k is denoted by dk(n)
Error signal is denoted by ek(n).

Other Optimization Techniques

🡪Iterated Gradient Descent Technique


Gradient descent, also known as the steepest descent, is an iterative optimization
algorithm to find a local minimum of a function. While minimizing the function, we
are concerned with the cost or error to be minimized
RememberTravellingSalesmanProblem. It is extensively used in deep learning,
which is useful in a wide variety of situations. The point here to be remembered is
that we are concerned with local optimization and not global optimization.
Main Working Idea
We can understand the main working idea of gradient descent with the help of the
following steps − ∙ First, start with an initial guess of the solution.
∙ Then, take the gradient of the function at that point.
∙ Later, repeat the process by stepping the solution in the negative direction of the
gradient. By following the above steps, the algorithm will eventually converge

where the gradient is zero.

Mathematical Concept
Suppose we have a function fx and we are trying to find the minimum of this
function. Following are the steps to find the minimum of fx.
∙ First, give some initial value x0 for x
∙ Now take the gradient ∇f of function, with the intuition that the gradient will
give the slope of the curve at that x and its direction will point to the increase
in the function, to find out the best direction to minimize it.
∙ Now change x as follows −

Here, θ > 0 is the training rate stepsize that forces the algorithm to
take small jumps. Estimating Step Size
Actually a wrong step size θ may not reach convergence, hence a careful selection of
the same is very important. Following points must have to be remembered while
choosing the step size.

∙ Do not choose too large step size, otherwise it will have a negative impact, i.e. it
will diverge rather than converge.
∙ Do not choose too small step size, otherwise it take a lot of
time to converge. Some options with regards to choosing the step
size −
∙ One option is to choose a fixed step size.
∙ Another option is to choose a different step size for every iteration.
🡪Simulated Annealing
The basic concept of Simulated Annealing SASA is motivated by the annealing in
solids. In the process of annealing, if we heat a metal above its melting point and
cool it down then the structural properties will depend upon the rate of cooling. We
can also say that SA simulates the metallurgy process of annealing.
Use in ANN
SA is a stochastic computational method, inspired by Annealing analogy, for
approximating the global optimization of a given function. We can use SA to train
feed-forward neural networks. Algorithm
Step 1 − Generate a random solution.
Step 2 − Calculate its cost using some cost function.
Step 3 − Generate a random neighboring solution.
Step 4 − Calculate the new solution cost by the same cost function.
Step 5 − Compare the cost of a new solution with that of an old solution as follows.

− If CostNew Solution < CostOld Solution then move to the new solution.
Step 6 − Test for the stopping condition, which may be the maximum number of
iterations reached or get an acceptable solution.

Learning Mechanisms:
The property that is of primary significance for a neural network is the ability of the
network to learn from its environment. and to improve its performance through
learning.
The improvement in performance takes place over time in accordance with some
prescribed measure.
A neural network learns about its environment through an interactive process of
adjustments applied to its synaptic weights and bias levels. Ideally.
The network becomes more knowledgeable about its environment after each
iteration of
the learning process.
We define learning in the context of neural networks as:
Learning is a process by which the free parameters of a neural network are adapted
through a process of stimulation by the environment in which the network is
embedded.
The type of learning is determined by the manner in which the parameter changes take
place.
This definition of the learning process implies the following sequence of events:
The neural network is stimulated by an environment.
The neural network undergoes changes in its free parameters as a result of this
stimulation.
The neural network responds in a new way to the environment because of the
changes that have occurred in its internal structure.

A prescribed set of well-defined rules for the solution of a learning problem is


called a
learning algorithm.

we have a "kit of tools" represented by a diverse variety of learning algorithms, each


of
which offers advantages of its own.
Basically, learning algorithms differ from each other in the way in which the
adjustment
to a synaptic weight of a neuron is formulated.
Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine
using
data which is well labeled that means some data is already tagged with the correct
answer. After that, the machine is provided with a new set of examples (data) so that
supervised learning algorithm analyses the training data (set of training examples) and
produces a correct outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore machine is restricted to find the hidden structure in unlabeled
data
by our-self.

Learning Process in Neural Networks


The main goals of artificial neural networks are to: 1. Learn to recognize patterns
of input and produce the corresponding output.

By changing the weights of the links between neurons, neural networks may acquire knowledge. The
network gets provided with a labeled dataset throughout this procedure, referred to as training, and
the weights are repeatedly updated depending on any mistakes or differences between the network's
assumptions and the true labels.
 Forward Propagation − The weighted total of the inputs is calculated at every neuron as the
input information moves through the network during its propagation forward. Following that,
an activation function that induces nonlinearity in the network is applied to these values. The
introduction of non-linearities in various layers frequently uses activation algorithms like
sigmoid, ReLU, and tanh.
 Loss Function − A loss function is used to calculate the difference between the outcome of
the network and the real labels. The kind of problem being addressed determines the loss
function to be used. For instance, whereas categorical cross-entropy is appropriate for multi-
class classification, mean squared error (MSE) is frequently employed for regression
assignments.
 Backpropagation − In neural networks, backpropagation is the key to acquiring knowledge.
It includes applying the principle of chains of mathematics to determine the gradients of the
loss function about the weights of the network. The gradients show the scale and trajectory of
weight modifications needed to reduce loss.
 Gradient Descent − An optimisation procedure, such as gradient descent, is utilized for
modifying the weights once the gradients are known. The goal of gradient descent is to get to
a minimal point on the loss curve by iteratively adjusting the weights in the contrary direction
of the gradients. The reliability of training is frequently increased by using gradient descent
variants like stochastic gradient descent (SGD) and Adam optimizer.
 Iterative Training − A given number of epochs or until completion is reached, the forward
propagation, loss computation, backpropagation, and weight alters processes are performed
again repeatedly. The network can increase its ability to perform by lowering the loss with
every loop by improving its forecasting abilities.
Types of approaches of Learning in Neural Networks
 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
Learning types :
Parameter learning,structure learning.

Learning Mechanisms:

1.Error Correction Learning


 ER is used to identify the error and reduce error in the learning process

 Notations used for ER are


 Output signal of neuron k is denoted by yk(n)

 Desired output signal of neuron k is denoted by dk(n)


 Error signal is denoted by ek(n)

 If ek(n) is equal to 0 then there is no error


 If there is a difference we need to minimize the cost function.
 This objective is achieved by minimizing a cost function or index of performance.
E(n)defined in terms of the error signal ek(n) as:

 The step by Step adjustments to the synaptic weights of neuron k are continued until
thesystem reaches a steady state.
 At that point the learning process is terminated.
Minimization of cost function leads to a learning rule commonly called referred to as
theDelta rule or Widrow- Hoff rule.
 According to the delta rule. the adjustment Δwkj(n), applied to the synaptic weight wkj
attime step n is defined by
 Where η is a positive constant that determines the rate of learning as we proceed
fromone step in the learning process to another.

 It is therefore natural that we refer to η as the learning-rate parameter.

Having computed the synaptic adjustment , ΔWkj(n), the updated value of synaptic
weight Wkj is determined by

 In effect, wkj(n)and wkj(n + 1 ) may be viewed as the old and new values of
synapticweight wkj respectively. In computational terms we may also write.
 wkj(n)= z-1[Wkj(n+1)]

z is the unit delay operator that is z-1 represents a storage element.

2. Memory-Based Learning

 In memory-based learning, all (or most) of the past experiences are explicitly stored
in a large memory of correctly classified input-output examples:
 [(xi, di)]N , where xi denotes an input vector and di denotes the corresponding
i =1
desiredresponse.
 Without loss of generality, we have restricted the desired response to be a scalar.
 For example, in a binary pattern classification problem there are two classes of
hypotheses, denoted by E1and E2, to be considered.
 In this example, the desired response di takes the value 0 (or -1) for class E1 and the
value1 for class
 E 2.
When classification of a test vector test (not seen before) is required, the
algorithmresponds by retrieving and analyzing the training data in a “local
neighborhood” of xtest.

 All memory-based learning algorithms involve two essential ingredients:

 a. Criterion used for defining the local neighbourhood of the test vector xtest.
 b. Learning rule applied to the training examples in the local neighbourhood
ofxtest.

The algorithms differ from each other in the way in which these two
ingredients are defined.

 In a simple yet effective type of memory-based learning known as the nearest


neighbor
rule, the local neighborhood is defined as the training example which lies in the
immediate neighborhood of the test vector xtest. In particular, the vector.
 Where, d(xi, xtest ) is the Euclidean distance between the vectors xi and xtest.
 The class associated with the minimum distance, that is, vector x’N is reported as the
classification of xtest. This rule is independent of the underlying distribution responsible
for generating the training examples.
 Cover and Hart (1967) have formally studied the nearest neighbour rule as a tool for
pattern classification.
The analysis is based on two assumptions:

 a. The classified examples (xi, di) are independently and identically distributed
(iid), according to the joint probability distribution of the example (x, d).
 b. The sample size N is infinitely large.

 Under these two assumptions, it is shown that the probability of classification error
incurred by the nearest neighbour rule is bounded about by twice the Bayes probability
oferror, that is, the minimum probability of error over all decision rule.
 In this sense, it may be said that half the classification information in a training set of
infinite size is contained in the nearest neighbour, which is a surprising result.

 A variant of the nearest neighbour classifier is the k-nearest neighbour classifier,


which proceeds as:
 a. Identify the k classified patterns which lie nearest to the test vector xtest
forsome integer k.
 b. Assign xtest to class (hypothesis) which is most frequently represented in the
k nearest neighbours to xtest (i.e., use a majority vote to make the classification).
 Thus, the k-nearest neighbour classifier acts like an averaging device.

a) Two classifiers 1 and 0 (b) Test vector d

©: Classifying the test vector based on the highest


voting
3. Hebbian Learning (Generalised Learning) Supervised Learning:

 Hebb’s postulate of learning is the oldest and the most famous of all learning rules; it is
named in honor of the neuropsychologist Hebb (1949).
 When an axon of cell A is near enough to excite a cell B and repeatedly or persistently
takes part in firing it, some growth process or metabolic changes take place in one or
bothcells such that A’s efficiency as one of the cells firing B, is increased.
 Hebb proposed this change as a basis of associative learning (at the cellular level),
which would result in an enduring modification in the activity pattern of a spatially
distributed “assembly of nerve cells”.

This statement is made in a neurobiological context. We may expand and rephrase it


as atwo-part rule:
 a. If two neurons on either side of a synapse are activated simultaneously
(i.e., synchronously), then the strength of that synapse is selectively increased.

 b. If two neurons on either side of a synapse are activated asynchronously, then


that synapse is selectively weakened or eliminated.
 Such a synapse is called Hebbian synapse. More precisely, we define a Hebbian
synapse as a synapse which uses a time-dependent, highly local, and strongly
interactive mechanism to increase synaptic efficiency as a function of the correlation
between the presynaptic and postsynaptic activities.

Four key properties which characterise a Hebbian synapse:


 i. Time-Dependent Mechanism:
 This mechanism refers to the facts that the modifications in a Hebbian synapse
depend on the exact time of occurrence of the presynaptic and postsynaptic
signals.

 ii. Local Mechanism:


 By its very nature, a synapse is the transmission site where information-bearing
signals (representing on going activity in the presynaptic and postsynaptic
units) are in spatio temporal contiguity. This locally available information is
used by a Hebbian synapse to produce a local synaptic modification which is
input specific.
 iii. Interactive Mechanism:
 The occurrence of a change in a Hebbian synapse depends on signals on
both sides of the synapse. That is, a Hebbian form of learning depends on a
“true interaction” between presynaptic and postsynaptic signals in the sense that
we cannot make a prediction from either one of these two activities by itself.
 iv. Conjunctional or Correlation Mechanism:
 The condition for a change in synaptic efficiency is the conjunction of
presynapticand postsynaptic signals.
 Thus, according to this interpretation, the co-occurrence, of presynaptic and
postsynaptic signals (within a short interval of time) is sufficient to produce the
synaptic modification.
It is for this reason that a Hebbian synapse is sometimes referred to as a conjunctional
synapse or correlational synapse.

 Correlation – It is a mutual relationship or connection


 Positive Correlation: When presynaptic neuron and post synaptic neuron fires
at same time then we can call it as positive correlation. According to the Hebbis
Hypothesis there is increase in the strength of the synapse wkj.

 Negative Correlation: When two neurons fire unsynchronosly , that means


when neuron A excited B is in relaxed state and when B excited A is in relaxed
state. Then the result is weakening the synapse wkj.
 Uncorrelation: When both the neurons are in relaxed state. No change in the
synapse
 Classification of Synaptic Modifications( Synaptic Enhancement and Depression)
 Hebbian: It increases its strength with positively correlated pre synaptic and post
synaptic signals.
 Anti Hebbian: It weakens positively correlated pre synaptic and post synaptic
signals and strengthens negatively correlated pre synaptic and post synaptic signals.
Both Hebbian and Anti Hebbian synapses are time dependent, highly local and
strong interactive in nature.
 Non Hebbian: Does not involve hebbian or anti hebbian mechanisms

Mathematical Modeling
 Consider a synaptic weight wkj of neuron k with pre synaptic and post synaptic
signalsdenoted by Xj and yk respectively
 The adjustment applied to wkj at time n is expressed as

 Where F(-,-) is a function of both post and pre synaptic signals.


Hebb’s Hypothesis
 The change in weight is expressed as

 Where ƞ is the learning rate, and this is also referred as activity product rule.
Covariance Hypothesis
 The adjustment applied to the synaptic weight is defined as

 Where ƞ is the learning rate


 x̄ , ȳ Are the averages of values of x and y,
In particular , it allows

 It allows a convergence to a nontrivial state at


 Xj= x̄ and yk = ȳ
 It allows a prediction of both synaptic enhancement and synaptic depression
4. Competitive Learning
 In competitive learning, as the name implies, the output neurons of a neural
network compete among themselves to become active.

 Whereas in a neural network based on Hebbian learning several output neurons may be
active simultaneously, in competitive learning only a single output neuron is active at
anyone time.
 It is this feature which makes competitive learning highly suited to discover
statistically salient features which may be used to classify a set of input patterns.
 There are three basic elements to a competitive learning rule:
 i. A set of neurons which are all the same except for some randomly distributed
synaptic weights, and which therefore, respond differently to a given get of
input patterns.
 ii. A limit imposed on the ‘strength’ of each neuron.
iii. A mechanism which permits the neurons to compete for the right to respond
to a given subset of inputs, such that only one output neuron or only one
neuron per group, is active (i.e., ‘on’) at a time. The neuron which wins the
competition is called a winner-takes-all neuron.
 Accordingly the individual neurons of the network learn to specialize on ensembles of
similar patterns; in so doing they become feature detectors for different classes of input
patterns.
 In the simplest form of competitive learning, the neural network has a single layer of
output neurons, each of which is fully connected to the input nodes. The network may
include feedback connections among the neurons.
 In the network architecture described herein, the feedback connections perform lateral
inhibition, with each neuron tending to inhibit the neuron to which it is laterally
connected. In contrast, the feed forward synaptic connections in the network of all are
excitatory.
  For a neuron k to be the winning neuron, its induced local field vk for a specified
input pattern x must be the largest among all the neurons in the network. The output
signal yk of winning neuron k is set equal to one; the output signals of all the neurons
which losethe competition are set equal to zero.
For a neuron k to be the winning neuron, its induced local field vk for a specified
input pattern x must be the largest among all the neurons in the network. The
output signal yk of winning neuron k is set equal to one; the output signals of all
the neurons which lose the competition are set equal to zero.
We thus write:

 where, the induced local field yk represents the combined action of all the forward and
feedback inputs to neuron k.

 Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that
each neuron is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are
positive), which is distributed among its input nodes that is, for all k.
 A neuron then learns by shifting synaptic weights from its inactive to active input
nodes. If a neuron does not respond to a particular input pattern, no learning takes
place in that neuron.
 If a particular neuron wins the competition, each input node of that neuron relinquishes
(voluntarily give up) some proportion of its synaptic weight, and the weight
relinquished is then distributed equally among the active input nodes. According to the
standard competitive learning rule, the change Δωkj applied to synaptic weight ωkj is
defined by
 Where, ƞ is the learning rate parameter, xj is the jth input node, ωkj is the synaptic
weight of jth node to kth node.
 This rule has the overall effect of moving the synaptic weight vector ωk of
winningneuron k towards the input pattern y.

Boltzmann Machines

A Boltzmann Machine (BM) is a type of stochastic neural network that can learn to represent
complex probability distributions. It consists of:

 Visible Units: These correspond to the input features.


 Hidden Units: These help in capturing complex patterns by interacting with the visible units.
 Symmetric Weights: The weights between the visible and hidden units are symmetric, meaning
the connection from a visible unit to a hidden unit has the same weight as the connection from
the hidden unit to the visible unit.

6.Boltzmann Learning Algorithm

The learning process in Boltzmann Machines can be described in the following steps:

1. Initialization: Start with random weights and biases.


2. Forward Pass: Calculate the probabilities of the hidden units being active, given the
visible units. This involves applying a sigmoid activation function to the weighted sum of
inputs.
3. Sampling: Based on these probabilities, sample the hidden units' states (often using
stochastic methods like Gibbs sampling).
4. Reconstruction: Use the sampled hidden states to reconstruct the visible units. This step
involves another forward pass through the network.
5. Update Weights: Adjust the weights based on the difference between the original and
reconstructed visible units. This is done by comparing the data distribution (from the
original visible states) with the distribution obtained from the reconstruction. The weight
update typically follows a rule like:

Δwij=η⟨vihj⟩data−⟨vihj⟩recon
where η (eta) is the learning rate,

⟨vihj⟩data is the expectation over the data distribution, and

⟨vihj⟩reconis the expectation over the reconstruction distribution.

6. Iterate: Repeat the process for many iterations to refine the weights.
Restricted Boltzmann Machines (RBMs)

In practice, the standard Boltzmann Machine is often replaced by a variant called the Restricted
Boltzmann Machine (RBM), which has the following simplifications:

 Bipartite Structure: The network is divided into visible and hidden layers with no intra-layer
connections (i.e., units within the same layer do not connect to each other).
 Training Efficiency: The absence of intra-layer connections simplifies the training process and
allows for more efficient learning algorithms, such as Contrastive Divergence.

Learning Objectives

The primary goal of Boltzmann Learning is to model the joint probability distribution of the
visible units. This allows the network to capture complex dependencies between variables and
learn a probabilistic representation of the input data.

Applications

Boltzmann Machines, especially RBMs, are used in various applications, including:

 Dimensionality Reduction: RBMs can be used as feature extractors.


 Pretraining for Deep Networks: RBMs can initialize the weights of deep neural networks, which
can then be fine-tuned with backpropagation.
 Generative Models: They are used to generate new samples similar to the training data.

Challenges

Training Boltzmann Machines can be computationally expensive due to the need for sampling
and the complexity of the optimization process. Techniques like Contrastive Divergence and
parallel sampling methods have been developed to address these challenges.

In summary, Boltzmann Learning in neural networks involves probabilistic modeling and


stochastic optimization to learn complex patterns in data. While practical implementations often
use variants like RBMs, the core principles of Boltzmann Machines remain influential in the
field of machine learning.

You might also like