Ann Unit2
Ann Unit2
Clearly,
-O = O X + O = X
-(-X ) = X X + (-X ) = O.
You can multiply a vector by a scalar:
This product is also written sX.2 You should verify these manipulation rules:
You can add and subtract rows X t and Y t with the same number of entries, and define the zero row
and the negative of a row. The product of a scalar and a row is
sX t = s[x1 , ..., xn] = [sx1 , ..., sxn]
These rules are useful:
X t ± Y t = (X ± Y )t
-(X t) = (-X )t
s(X t) = (sX )t .
Finally, you can multiply a row by a vector with the same number of entries to get their scalar
product:
With a little algebra you can verify the following manipulation rules:
Matrix algebra An m×n matrix is a rectangular array of mn scalars in m rows and n columns. A
matrix is denoted by an uppercase letter. Its entries are identified by the corresponding lowercase
letter, with double subscripts:
A is called square when m = n. The ai j with i = j are called diagonal entries. m×1 and 1×n matrices
are columns and rows with m and n entries, and 1×1 matrices are handled like scalars. You can add
or subtract m×n matrices by adding or subtracting corresponding entries, just as you add or subtract
columns and rows. A matrix whose entries are all zeros is called a zero matrix, and denoted by O.
You can also define the negative of a matrix, and the product sA of a scalar s and a matrix A.
You can multiply an m×n matrix A by a vector X with n entries; their product AX is the vector with m
entries, the products of the rows of A by X:
You can verify the following manipulation rules:
OX = O = AO
(sA)X = (AX)s = A(Xs)
(-A)X = -(AX) = A(-X )
Similar manipulation rules hold. Further, you can check the associative law
X t (AY ) = (X tA)Y.
You can multiply an l×m matrix A by an m×n matrix B. Their product AB is an l×n matrix that you can
describe two ways. Its columns are the products of A by the columns of B, and its rows are the
products of the rows of A by B:
The i,k th entry of AB is thus ai1b1 k + … + aimbmk. You can check these manipulation rules:
The definition of the product of two matrices was motivated by the formulas for linear substitution;
from
Every m×n matrix A has a transpose At, the n×m matrix whose j,ith entry is the i, jth entry of A:
At t = A O t = O
(A + B)t = At + B t (sA)t = s(At).
If A is an l×m matrix and B is an m×n matrix, then (AB)t = B tAt
Matrix inverses
A matrix A is called invertible if there.s a matrix B such that AB = I = BA. Clearly, invertible matrices
must be square. A zero matrix O isn’t invertible, because OB = O ≠ I for any B. Also, some nonzero
square matrices aren’ t invertible. For example, for every 2×2 matrix B,
Hence the leftmost matrix in this display isn’t invertible. When there exists B such that AB = I = BA,
it.s unique; if also AC = I = CA, then B = BI = B(AC) = (BA)C = IC = C. Thus an invertible matrix A
has a unique inverse A&I such that AA-1 = I = A-1A.
Clearly, I is invertible and I-1 = I.
The inverse and transpose of an invertible matrix are invertible, and any
product of invertible matrices is invertible:
(A-1 )-1 = A (At )-1 = (A-1 )t (AB)-1 = B-1A-1
Determinants
Matrix
A matrix is a two dimensional array of numbers. Generally matrices are represented by
anuppercase bold letter such as A.
Since a matrix is two dimensional, each element is represented by a small letter with
two indices such as aij where i represents the row and j represents the column.
A representation of m×n matrix is shown below,
Matrix multiplication
The multiplication result matrix of two matrices A and B denoted by C has elements
cijwhich is the dot product of the ith row of A with jth column of the matrix B.
Diagonal Matrix
A square matrix whose non-diagonal elements are zero is called a diagonal matrix.
Forexample:
Identity Matrix
A square matrix whose non-diagonal elements are zero and diagonal elements are 1
iscalled an identity matrix. For example:
Transpose of a Matrix
The transpose of matrix X, denoted by XT, is the result of flipping the rows and
columnsof a matrix X.
Inverse of a Matrix
The inverse of a square matrix X, denoted by X-1 is a special matrix which has
theproperty that their multiplication results in the identity matrix.
Above, I is an identity matrix. Both X and X-1 are inverses of each other. An
instance ofan inverse matrix is illustrated below.
These two matrices are inverses of each other. When we multiply them, we get an
identitymatrix.
State-space Concepts
Set of all possible states for a given problem is known as state space of the problem
Searching is needed for the solution,
If steps are not known beforehand and have to be found out.
For a search process following things are required
1. Initial state description of the problem
2. Set of legal operators
3. Final or goal state
State space search is a process used in the field of computer science, including artificial intelligence (AI),
in which successive configurations or states of an instance are considered, with the intention of finding a
goal state with a desired property.
The Problems are often modeled as a state space, a set of states that a problem can be in. The set of
states forms a graph where two states are connected if there is an operation that can be performed to
transform the first state into the second.
State space search often differs from traditional computer science search methods because the state
space is implicit: the typical state space graph is much too large to generate and store in memory.
Instead, nodes are generated as they are explored, and typically discarded thereafter. A solution to a
problem instance may consist of the goal state itself, or of a path from some initial state to the goal
state.
In state space search a state space is formally represented as a tuple S: (S, A, Action(s),
Result(s,a), Cost(s,a)), in which:
∙ S is the set of all possible states
∙ A is the set of possible action, not related to a particular state but regarding all
the state space ∙ Action(s) is the function that establish which action is possible to
perform in a certain state ∙ Result(s,a) is the function that return the state reached
performing action a in state s ∙ Cost(s,a) is the cost of performing an action a in
state s. In many state spaces is a constant, but this is not true in general.
4. (x,y) If y>0 (x,y-d) Pour some part from the 3 gallon jug
7. (x,y) If (4, y-[4-x]) Pour some water from the 3 gallon jug to fill the
(x+y)<7 four gallon jug
8. (x,y) If (x+y)<7 (x-[3-y],y) Pour some water from the 4 gallon jug to
fill the 3 gallon jug.
9. (x,y) If Pour all water from 3 gallon jug to the 4 gallon jug
(x+y)<4
(x+y,0)
10. (x,y) if Pour all water from the 4 gallon jug to the 3 gallon jug
(x+y)<3
(0, x+y)
The listed production rules contain all the actions that could be performed by
the agent in transferring the contents of jugs. But, to solve the water jug problem in a
minimum number of moves, following set of rules in the given sequence should be
performed:
Solution of water jug problem according to the production rules:
🡪8-Puzzle
The 8-puzzle problem is a puzzle invented and popularized by Noyes Palmer
Chapman in the 1870s. It is played on a 3-by-3 grid with 8 square blocks labeled 1
through 8 and a blank square. Your goal is to rearrange the blocks so that they are in
order. You are permitted to slide blocks horizontally or vertically into the blank
square.
For example, given the initial state above we may want the tiles to be moved so that
the following goal state may be attained.
Initial State Final State
1 3
1 2 3
4 2 5
4 5 6
7 8 6
7 8
The following shows a sequence of legal moves from an initial board position
(left) to the goal position (right).
1 3
4 2 5
7 8 6
1 2 3
4 5
7 8 6
1 2 3
4 5
7 8 6
1 2 3
4 5 6
7 8
Concepts of Optimization
Matrix Representation
Actually each tour of n-city TSP can be expressed as n × n matrix whose ith row
describes the ith city’s location. This matrix, M, for 4 cities A, B, C, D can be
expressed as follows
Solution by Hopfield
While considering the solution of this TSP by Hopfield network, every node in the
network corresponds to one element in the matrix.
First constraint, on the basis of which we will calculate energy function, is that one
element must be equal to 1 in each row of matrix M and other elements in each row
must equal to 0 because each city can occur in only one position in the TSP tour. This
constraint can mathematically be written as follows −
Now the energy function to be minimized, based on the above constraint, will contain
a term proportional to
Constraint-II
As we know, in TSP one city can occur in any position in the tour hence in each
column of matrix M, one element must equal to 1 and other elements must be
equal to 0. This constraint can mathematically be written as follows −
Now the energy function to be minimized, based on the above constraint, will contain
a term proportional to –
Let’s suppose a square matrix of (n × n) denoted by C denotes the cost matrix of TSP
for n cities where n > 0. Following are some parameters while calculating the cost
function − ∙ Cx, y − The element of cost matrix denotes the cost of travelling from city
x to y. ∙ Adjacency of the elements
As we know, in Matrix the output value of each node can be either 0 or 1, hence for
every pair of cities A, B we can add the following terms to the energy function –
On the basis of the above cost function and constraint value, the final energy function
E can be given as follows –
Mathematical Concept
Suppose we have a function fx and we are trying to find the minimum of this
function. Following are the steps to find the minimum of fx.
∙ First, give some initial value x0 for x
∙ Now take the gradient ∇f of function, with the intuition that the gradient will
give the slope of the curve at that x and its direction will point to the increase
in the function, to find out the best direction to minimize it.
∙ Now change x as follows −
Here, θ > 0 is the training rate stepsize that forces the algorithm to
take small jumps. Estimating Step Size
Actually a wrong step size θ may not reach convergence, hence a careful selection of
the same is very important. Following points must have to be remembered while
choosing the step size.
∙ Do not choose too large step size, otherwise it will have a negative impact, i.e. it
will diverge rather than converge.
∙ Do not choose too small step size, otherwise it take a lot of
time to converge. Some options with regards to choosing the step
size −
∙ One option is to choose a fixed step size.
∙ Another option is to choose a different step size for every iteration.
🡪Simulated Annealing
The basic concept of Simulated Annealing SASA is motivated by the annealing in
solids. In the process of annealing, if we heat a metal above its melting point and
cool it down then the structural properties will depend upon the rate of cooling. We
can also say that SA simulates the metallurgy process of annealing.
Use in ANN
SA is a stochastic computational method, inspired by Annealing analogy, for
approximating the global optimization of a given function. We can use SA to train
feed-forward neural networks. Algorithm
Step 1 − Generate a random solution.
Step 2 − Calculate its cost using some cost function.
Step 3 − Generate a random neighboring solution.
Step 4 − Calculate the new solution cost by the same cost function.
Step 5 − Compare the cost of a new solution with that of an old solution as follows.
− If CostNew Solution < CostOld Solution then move to the new solution.
Step 6 − Test for the stopping condition, which may be the maximum number of
iterations reached or get an acceptable solution.
Learning Mechanisms:
The property that is of primary significance for a neural network is the ability of the
network to learn from its environment. and to improve its performance through
learning.
The improvement in performance takes place over time in accordance with some
prescribed measure.
A neural network learns about its environment through an interactive process of
adjustments applied to its synaptic weights and bias levels. Ideally.
The network becomes more knowledgeable about its environment after each
iteration of
the learning process.
We define learning in the context of neural networks as:
Learning is a process by which the free parameters of a neural network are adapted
through a process of stimulation by the environment in which the network is
embedded.
The type of learning is determined by the manner in which the parameter changes take
place.
This definition of the learning process implies the following sequence of events:
The neural network is stimulated by an environment.
The neural network undergoes changes in its free parameters as a result of this
stimulation.
The neural network responds in a new way to the environment because of the
changes that have occurred in its internal structure.
By changing the weights of the links between neurons, neural networks may acquire knowledge. The
network gets provided with a labeled dataset throughout this procedure, referred to as training, and
the weights are repeatedly updated depending on any mistakes or differences between the network's
assumptions and the true labels.
Forward Propagation − The weighted total of the inputs is calculated at every neuron as the
input information moves through the network during its propagation forward. Following that,
an activation function that induces nonlinearity in the network is applied to these values. The
introduction of non-linearities in various layers frequently uses activation algorithms like
sigmoid, ReLU, and tanh.
Loss Function − A loss function is used to calculate the difference between the outcome of
the network and the real labels. The kind of problem being addressed determines the loss
function to be used. For instance, whereas categorical cross-entropy is appropriate for multi-
class classification, mean squared error (MSE) is frequently employed for regression
assignments.
Backpropagation − In neural networks, backpropagation is the key to acquiring knowledge.
It includes applying the principle of chains of mathematics to determine the gradients of the
loss function about the weights of the network. The gradients show the scale and trajectory of
weight modifications needed to reduce loss.
Gradient Descent − An optimisation procedure, such as gradient descent, is utilized for
modifying the weights once the gradients are known. The goal of gradient descent is to get to
a minimal point on the loss curve by iteratively adjusting the weights in the contrary direction
of the gradients. The reliability of training is frequently increased by using gradient descent
variants like stochastic gradient descent (SGD) and Adam optimizer.
Iterative Training − A given number of epochs or until completion is reached, the forward
propagation, loss computation, backpropagation, and weight alters processes are performed
again repeatedly. The network can increase its ability to perform by lowering the loss with
every loop by improving its forecasting abilities.
Types of approaches of Learning in Neural Networks
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Learning types :
Parameter learning,structure learning.
Learning Mechanisms:
The step by Step adjustments to the synaptic weights of neuron k are continued until
thesystem reaches a steady state.
At that point the learning process is terminated.
Minimization of cost function leads to a learning rule commonly called referred to as
theDelta rule or Widrow- Hoff rule.
According to the delta rule. the adjustment Δwkj(n), applied to the synaptic weight wkj
attime step n is defined by
Where η is a positive constant that determines the rate of learning as we proceed
fromone step in the learning process to another.
Having computed the synaptic adjustment , ΔWkj(n), the updated value of synaptic
weight Wkj is determined by
In effect, wkj(n)and wkj(n + 1 ) may be viewed as the old and new values of
synapticweight wkj respectively. In computational terms we may also write.
wkj(n)= z-1[Wkj(n+1)]
2. Memory-Based Learning
In memory-based learning, all (or most) of the past experiences are explicitly stored
in a large memory of correctly classified input-output examples:
[(xi, di)]N , where xi denotes an input vector and di denotes the corresponding
i =1
desiredresponse.
Without loss of generality, we have restricted the desired response to be a scalar.
For example, in a binary pattern classification problem there are two classes of
hypotheses, denoted by E1and E2, to be considered.
In this example, the desired response di takes the value 0 (or -1) for class E1 and the
value1 for class
E 2.
When classification of a test vector test (not seen before) is required, the
algorithmresponds by retrieving and analyzing the training data in a “local
neighborhood” of xtest.
a. Criterion used for defining the local neighbourhood of the test vector xtest.
b. Learning rule applied to the training examples in the local neighbourhood
ofxtest.
The algorithms differ from each other in the way in which these two
ingredients are defined.
a. The classified examples (xi, di) are independently and identically distributed
(iid), according to the joint probability distribution of the example (x, d).
b. The sample size N is infinitely large.
Under these two assumptions, it is shown that the probability of classification error
incurred by the nearest neighbour rule is bounded about by twice the Bayes probability
oferror, that is, the minimum probability of error over all decision rule.
In this sense, it may be said that half the classification information in a training set of
infinite size is contained in the nearest neighbour, which is a surprising result.
Hebb’s postulate of learning is the oldest and the most famous of all learning rules; it is
named in honor of the neuropsychologist Hebb (1949).
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently
takes part in firing it, some growth process or metabolic changes take place in one or
bothcells such that A’s efficiency as one of the cells firing B, is increased.
Hebb proposed this change as a basis of associative learning (at the cellular level),
which would result in an enduring modification in the activity pattern of a spatially
distributed “assembly of nerve cells”.
Mathematical Modeling
Consider a synaptic weight wkj of neuron k with pre synaptic and post synaptic
signalsdenoted by Xj and yk respectively
The adjustment applied to wkj at time n is expressed as
Where ƞ is the learning rate, and this is also referred as activity product rule.
Covariance Hypothesis
The adjustment applied to the synaptic weight is defined as
Whereas in a neural network based on Hebbian learning several output neurons may be
active simultaneously, in competitive learning only a single output neuron is active at
anyone time.
It is this feature which makes competitive learning highly suited to discover
statistically salient features which may be used to classify a set of input patterns.
There are three basic elements to a competitive learning rule:
i. A set of neurons which are all the same except for some randomly distributed
synaptic weights, and which therefore, respond differently to a given get of
input patterns.
ii. A limit imposed on the ‘strength’ of each neuron.
iii. A mechanism which permits the neurons to compete for the right to respond
to a given subset of inputs, such that only one output neuron or only one
neuron per group, is active (i.e., ‘on’) at a time. The neuron which wins the
competition is called a winner-takes-all neuron.
Accordingly the individual neurons of the network learn to specialize on ensembles of
similar patterns; in so doing they become feature detectors for different classes of input
patterns.
In the simplest form of competitive learning, the neural network has a single layer of
output neurons, each of which is fully connected to the input nodes. The network may
include feedback connections among the neurons.
In the network architecture described herein, the feedback connections perform lateral
inhibition, with each neuron tending to inhibit the neuron to which it is laterally
connected. In contrast, the feed forward synaptic connections in the network of all are
excitatory.
For a neuron k to be the winning neuron, its induced local field vk for a specified
input pattern x must be the largest among all the neurons in the network. The output
signal yk of winning neuron k is set equal to one; the output signals of all the neurons
which losethe competition are set equal to zero.
For a neuron k to be the winning neuron, its induced local field vk for a specified
input pattern x must be the largest among all the neurons in the network. The
output signal yk of winning neuron k is set equal to one; the output signals of all
the neurons which lose the competition are set equal to zero.
We thus write:
where, the induced local field yk represents the combined action of all the forward and
feedback inputs to neuron k.
Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that
each neuron is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are
positive), which is distributed among its input nodes that is, for all k.
A neuron then learns by shifting synaptic weights from its inactive to active input
nodes. If a neuron does not respond to a particular input pattern, no learning takes
place in that neuron.
If a particular neuron wins the competition, each input node of that neuron relinquishes
(voluntarily give up) some proportion of its synaptic weight, and the weight
relinquished is then distributed equally among the active input nodes. According to the
standard competitive learning rule, the change Δωkj applied to synaptic weight ωkj is
defined by
Where, ƞ is the learning rate parameter, xj is the jth input node, ωkj is the synaptic
weight of jth node to kth node.
This rule has the overall effect of moving the synaptic weight vector ωk of
winningneuron k towards the input pattern y.
Boltzmann Machines
A Boltzmann Machine (BM) is a type of stochastic neural network that can learn to represent
complex probability distributions. It consists of:
The learning process in Boltzmann Machines can be described in the following steps:
Δwij=η⟨vihj⟩data−⟨vihj⟩recon
where η (eta) is the learning rate,
6. Iterate: Repeat the process for many iterations to refine the weights.
Restricted Boltzmann Machines (RBMs)
In practice, the standard Boltzmann Machine is often replaced by a variant called the Restricted
Boltzmann Machine (RBM), which has the following simplifications:
Bipartite Structure: The network is divided into visible and hidden layers with no intra-layer
connections (i.e., units within the same layer do not connect to each other).
Training Efficiency: The absence of intra-layer connections simplifies the training process and
allows for more efficient learning algorithms, such as Contrastive Divergence.
Learning Objectives
The primary goal of Boltzmann Learning is to model the joint probability distribution of the
visible units. This allows the network to capture complex dependencies between variables and
learn a probabilistic representation of the input data.
Applications
Challenges
Training Boltzmann Machines can be computationally expensive due to the need for sampling
and the complexity of the optimization process. Techniques like Contrastive Divergence and
parallel sampling methods have been developed to address these challenges.