CCS355 Neural Networks and Deep Learning
CCS355 Neural Networks and Deep Learning
After making a hypothesis with initial parameters, we calculate the Cost function. And with a
goal to reduce the cost function, we modify the parameters by using the Gradient descent
algorithm over the given data. Here’s the mathematical representation for it:
The cost function represents the discrepancy between the predicted output of the model and the
actual output. The goal of gradient descent is to find the set of parameters that minimizes this
discrepancy and improves the model’s performance.
The algorithm operates by calculating the gradient of the cost function, which indicates the
direction and magnitude of steepest ascent. However, since the objective is to minimize the cost
function, gradient descent moves in the opposite direction of the gradient, known as the negative
gradient direction.
By iteratively updating the model’s parameters in the negative gradient direction, gradient
descent gradually converges towards the optimal set of parameters that yields the lowest cost.
The learning rate, a hyperparameter, determines the step size taken in each iteration, influencing
the speed and stability of convergence.
Gradient descent can be applied to various machine learning algorithms, including linear
regression, logistic regression, neural networks, and support vector machines. It provides a
general framework for optimizing models by iteratively refining their parameters based on the
cost function.
The best way is to observe the ground and find where the land descends. From that position, take
a step in the descending direction and iterate this process until we reach the lowest point.
Gradient descent is an iterative optimization algorithm for finding the local minimum of a
function.
To find the local minimum of a function using gradient descent, we must take steps proportional
to the negative of the gradient (move away from the gradient) of the function at the current point.
If we take steps proportional to the positive of the gradient (moving towards the gradient), we
will approach a local maximum of the function, and the procedure is called Gradient Ascent.
Gradient descent was originally proposed by CAUCHY in 1847. It is also known as steepest
descent.
Source: Clairvoyant
The goal of the gradient descent algorithm is to minimize the given function (say cost function).
To achieve this goal, it performs two steps iteratively:
1. Compute the gradient (slope), the first order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient, opposite direction of
slope increase from the current point by alpha times the gradient at that point
Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the
length of the steps.
Batch gradient descent updates the model’s parameters using the gradient of the entire training
set. It calculates the average gradient of the cost function for all the training examples and
updates the parameters in the opposite direction. Batch gradient descent guarantees convergence
to the global minimum, but can be computationally expensive and slow for large datasets.
Mini-batch gradient descent updates the model’s parameters using the gradient of a small subset
of the training set, known as a mini-batch. It calculates the average gradient of the cost function
for the mini-batch and updates the parameters in the opposite direction. Mini-batch gradient
descent combines the advantages of both batch and stochastic gradient descent, and is the most
commonly used method in practice. It is computationally efficient and less noisy than stochastic
gradient descent, while still being able to converge to a good solution.
cost along z-axis and parameters(thetas) along x-axis and y-axis (source: Research gate)
It can also be visualized by using Contours. This shows a 3-D plot in two dimensions with
parameters along both axes and the response as a contour. The value of the response increases
away from the center and has the same value along with the rings. The response is directly
proportional to the distance of a point from the center (along a direction).
Gradient descent using Contour Plot. (source: Coursera )
We have the direction we want to move in, now we must decide the size of the step we must
take.
● If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing,
without reaching the minima
● If the learning rate is too small, the training might turn out to be too long
Source: Coursera
Source: researchgate
Note: As the gradient decreases while moving towards the local minima, the size of the step
decreases. So, the learning rate (alpha) can be constant over the optimization and need not be
varied iteratively.
Local Minima
The cost function may consist of many minimum points. The gradient may settle on any one of
the minima, which depends on the initial point (i.e initial parameters(theta)) and the learning
rate. Therefore, the optimization may converge to different points with different starting points
and learning rate.
Convergence of cost function with different starting points (Source: Gfycat )
1. Local Optima: Gradient descent can converge to local optima instead of the global
optimum, especially if the cost function has multiple peaks and valleys.
2. Learning Rate Selection: The choice of learning rate can significantly impact the
performance of gradient descent. If the learning rate is too high, the algorithm may overshoot the
minimum, and if it is too low, the algorithm may take too long to converge.
3. Overfitting: Gradient descent can overfit the training data if the model is too complex or
the learning rate is too high. This can lead to poor generalization performance on new data.
4. Convergence Rate: The convergence rate of gradient descent can be slow for large
datasets or high-dimensional spaces, which can make the algorithm computationally expensive.
5. Saddle Points: In high-dimensional spaces, the gradient of the cost function can have
saddle points, which can cause gradient descent to get stuck in a plateau instead of converging to
a minimum.
To overcome these challenges, several variations of gradient descent have been developed, such
as adaptive learning rate methods, momentum-based methods, and second-order methods.
Additionally, choosing the right regularization method, model architecture, and hyperparameters
can also help improve the performance of gradient descent.
lOMoARcPSD|31606405
NNDL-unit 2 - Notes
IUNIT 111
Associative Me01ory and
Unsupervised Learning
Networks
Syllabus
Training Algorithms for Pattern Association-A utoassoc iati ve Memory Network-Heteroassociative
Mem01y Network-Bidir ectional Assoc iative Memo,y (BAM) - Hopfi eld Netwo rks - Iterati ve
Autoassociative Mem01y Networks-Temporal Assoc iative Mem01y Network - Fixed Weight
Competitive Nets - Kohonen Self - Organi:ing Feature Maps - l earning Vector Quanti: ation -
Counter propagation Networks - Adaptive Resonance Theo,y Nenrnrk.
Contents
2.1 Training Algorithms for Pattern Association
2. 2 Associative Memory Network
2.3 Kohonen Self-Organizing Feature Maps
2.4 Learning Vector Quantization
2.5 Counter Propagation Networks
2.6 Adaptive Resonance Theory Network
2. 7 Two Marks Questions with Answers
(2 • 1)
TECHNICAL PUBLICAbyTION
Downloaded MONICA up-thrustfor k.no-,,..'/edqe
an ([email protected])
S®- GK
lOMoARcPSD|31606405
nd ·ng~N~et~w'..'.:'.o':..::.rk::::__s
·at~iv~e~M~e~m~o~ry~a~n~d~U!!._n~s~up~e~rv~is~e~d_!:L~ea~r!.!.:n,'!.'..
~:.:fw:.:.o_rk_s_a__D_ee~p=-L_e_a_rn_in~g~--.:~~-~A~ss~o~c,~
Output signals
Input signals
Wij at iteration P in
• Using Hebb's Law, we can express the adjustmentapplied to the weight
the following form :
/!J.Wu(p) = F[y/p), x/p)],
ies.
• where F[y/p), xi(p)] is a function of both postsynapticand presynapticactivit
• As a special case, we can represent Hebb's law
/!J.Wij(p) = ayi(p) xi(p)
x,
W,(new) = W,(old)+9
Networks
2-4 Associative Memory and Unsupervised Learning
Neural Networl<.sand Deep Learning
Activate output
Initialize weights y=t
Weight update
w1(new) = w1(old) + xy
Bias update
b(new) = b(old ) + y
Activate input
x, = s,
Stop
® - an edge
lOMoARcPSD|31606405
,\',-.ll:~~~~~~~s mid D«'«'P Lt'illl!'L'!J - - 2 • 5 ___ Ass1>c'i<1t1vo Alomory nnd Uns11po1visod L<>nrnm!} No/works
flEJDeltaRule
. . .11g algon.lhrn was pn.:scnh:d by W1dr
. n t1a1111
, .. ,..plw . ow
• An imp1,,rtant gl'nerali1ati,,n nf ti ll:, pu11
anJ l ll,tl as thl' ka st 11 11:•m • • g prnrc dun; also know n as the delta ru k .
s• 1iu·,11.1.,. 11., a111111
•
.
. - hnl':tr ..
l'll'111l'llt also 1JallllXI Adaline.
• Thl' kamin!! rnk was• a,)t I,li u·I I1' ti 11.:, ...,H1aphn
·
· , · ti 11., output ot· IIll.: thres hold lu11ct
• Thl' pl'rccptron karn in,,;:, ru l,L us1.:- 1·011 for k arn1·ng. The delta
rule uses the nl't output \\ itlwut further mapp ing into
output values - I or + I.
Output
: +1
-1-i-- t--- ...
. J i
Quantizer
Gains
Input Summer
pattern
switches -1•) .o+1
Reference
switch
n
y = r w, x, + 0
i-= I
Where 0 = Wo
sts of a set of controllable resistors
• In a simple physi cal implementation this devic e consi
d by the input voltage signals.
connected to a circuit which can sum up currents cause
edge
TECHNICAL PUBLICAT/ONS® - an up-thrust for knowl
Downloaded by MONICA GK ([email protected])
lOMoARcPSD|31606405
l I' '' r I
'' r
• l h..· \ktl\ ,\11''11 ,,t \ :- \\1th l\' 'I'\'\ I"' \.'.h h \\ \·1~'11 \\ 1 ,,
I . I"
Tl l t,. \' ) \
I'
• rt:.: ,~..·lt.t n1!~ t11 ..·, "' t:11r.:n ;1; ,· '"1ll.\r\'\I 1 \'11,'I' . II,, .I 1,,, fl' 1·\'trl'll 1 t ti • , ... ,,, 111•·a11 ,,111:11l'
\l .,... 11; " "
lornini: pnl,·f,iur(' (lr \\ iJnm-lltilT knninl,!
ruk.
~1~:r!1 ...·1t~
' l); , rn~~;t..'1.i l..·.1:-r.ir.; : ll·.1rr.m; 1, n,,t r1..·l1.1nt l'n
l'('ntr.ll \·,,ntn1l l,f the lll'I\H1rk.
(.):~:;r.~l..•,1rr.1::;: \\ ('!ftt:- .m.~ urJJt~J J!t1.'
r rr('~1..·nt.1ti,1ntlf cad1 rattem .
.
r,, (tr '\J
"h1.'l\'. ti' rar!,!1.'I \IUlplll
1,r . .-\1.·tn:1I ,,1111,ut 11fth1.· Adalim.'
• Thi.' d1:ri\ati,,n 11f Fr " ith t\'Spl.'d 111 l.':ll'h weight w, is
()F
)x
_r
h,· - - -"11\ r - llr . I
l I
th
• T11 lkl'I\':lSI.'Fl' by gradil'lltdescent, the update formula for w, on the P input-output pattern
IS
.\ \\ = I] ( Ir - l\,) \
• Till' ddt:1 rnle tril.'s ltl minimize squa1\'d crrnrs. it is also rcforrcd to as the least •~can square
k;1rninl!prol'rdurcor Widrow-lloffkarninj! rule.
• Fcaturl.'sof the delta mle arc as fi.lllows:
I. Simplicity
, Distributl.'dlearning: Leaming is nnt reliant on central control of the network.
3. Online learning: Weights arc updall·d afkr presentationof each pattern.
searched for a value in a single memory cycle rather than using a soflwan: loop.
• Associativememories can be implemented using networks with or without fccdh.1,·~ ~th h
. •l1'-·iatin:
11~~
m:ural 111.:1\,·l,1-ks• ar, .I • • · I . ••t 1·
' c u~cl to a:-.soc1atc one set of wl'llH·s \\ 1th anot IL'I sc o
, L'l'll'rs.say input anJ output patterns.
11 ~•twMk either as
input or as
• • mitta
· · · I state and the output pattern 1s
.
obsL·rn:d at the outputs· 0 1·
some neurons constitutingthe network.
• Associative memories belong to class of neural network that learn according lo a certain
recording algorithm· Th•'y
" r•'q
" ti'· nc
., m1ormatmn
· " · a pnon
• • and their
• conncct1v1
· ·ty m,1. t11c-..
: ••·s 111llst
·
ofien need to be formed in advance. Writing into memory prodm:es changes in the neural
interconnections. Reading of the stored info from memory named recall, is a transfonnation
of input signals by the network.
• All memory infom,ation is spatially distributed throughout the network. Associative
memory enables a paralid search within a stored data. The purpose of search is to output one
or all stored items that matches the search argumentand retrieve it entirely or partially.
• The Fig. 2.2. l shows a block diagram of an associative memory.
-
Associative
memory
fJj1 Auto-associativeMemory
rks, in which
• Auto-associativenehvorks are a special subset of the hetero-associativenetwo
such networks
each vector is associatedwith itself, i.e. y = xi for i= I, ..., m. The function of
1
is a hvo-1:iyer
• The simplest version of auto-associativememory is linear associator which
in a single
feed-forward fully connected neural network where the output is constrncted
feed-foiwardcomputation.
I
X1
r "' i
X1
Autoassociative
network I
I
xn xn
\.
Heteroassociative
network
i
---Yk
I hipficld network is created by s11pplyi1,v, 111p1JI d;1t:1 v 1:<.,11,r ;. ,,r f1i1'!ur, ,t: .':,•
correspondingto the different <.;la«.!->e4, ·rhcsc pi1llc11n ,ire t.;,lbJ d:1·:·1 r,:,tt,;r,,·;
I
l·
I
Unrt 1
i,
·,~1 ·:
Unit 2
t---w-,.-
1
--0 Unit 3
• llopficld model com,i~ts of a ~ingl<.: laycr ,,f pr,,t.c·,,,ing ckrr,u,h •,1,kn; t;:d, v .. ·
connected to every other unit in the network othcr than it'.d f.
• The output of em.: h m:uron is a hi nary nurnhcr in ( I, I} . ·1he "utp 11t H:<..t,1r I') tr.•;
vector, Starting from an initial ~late (given a", the: input vc:c.tr,r), the: t, t.;,1lf; r,f th::: r.,:;t,. , r£
changes from one to another like an autc,mat,111. If the \t,1tc (..(,nvcrgt:\, the rJl',int t,, ·;,fw.· ·
converges is called the attractor.
• In its simplest form, the output function i.., the \ign func..tt<Jn, v. h1d1 yit:ld·, I for t1rg J:-:- :.. .
0 and - I otherwise.
• The connection weight matrix W of this type of network i-.; <;quare and ,:, rnrnctnc. H ..: ,. - ·.
in the I lopficld model act as both input and output unit,.
• A llopfield network consi~ts of "n" totally coupled unit.... Lac..h unit 1·, c,,nncc.tt:dtci a!I < •• ••
units except itself. The network is symmetric hccau\c the wcig,ht w,1 for the uir.r :.:. ·
between unit i and unit j is equal to the weight w,, of the c<,nncttirin frc,m unit J tc, ur,it I r·.
absence of a connection from each unit to itself avoids a permanent f<.:~dbalk <if ih <i..1.:i , ;: .
value.
• llopficl<lnetworks arc typically U!->c<l for clas, ificat ion pn,hlcms 1th hmary p;ittun \ lL:
• I lopficl<l model is classified into two categories:
l . Discrete I lopficl<l Model
2. Continuous 1loplicl<l Model
k
N
eural Networks and Deep Learning 2 _ 11 Associative Memo ry and Unsup ervised Learni ng Netwo r s
---
·
In both discrete and continuous Hopfield network weights trained in a one-shot fashion and
•
not trained incrementally as was done m ·
case of Perceptron and MLP.
. ly modi. fied bipol
. use a slight · t fu nct·ion
· ar outpu
• In the discrete Hopfield modeI, the units
· • · ·
where the states of the u01·ts, 1.e., the output of the umts remam the same If the current state
is equal to some threshold value.
· a generahzat
· Just • te case. Here, the units
. .1onof the discre
• The continuous Hopfield m0 deI is
bolic tangent function. In the
use continuous output function such as the sigmoid or hyper
itor Ci and resistance ri that
contmuous Hopfield model, each unit has an associated capac
rane, respectively.
model the capacitance and resistance of real neuron's cell memb
Second layer
y
First layer
X
e
TECHNICAL PUBLICA T/ONS® - an up-thrust for knowledg
,________________________
Hetero-associative Memory
Hetero-associativcmemory
Auto-associativememory
The inputs and output vectors s and t are
The inputs and output vectors s and t arc the
same different.
------- -------
J-~-~----------
Recalls a memory that is different in character
Recalls a memory of the same modality as
from the input
the one that evokedit
-- -- -- -- -- -- -- -- -- -- -- -- - -- --
1A picture of a favorite object might evoke a particularsmell or sound, for example, - might
A
evoke a visual memory of some past event
mental image of that object in vivid detail
An auto -ass ocia tive mem ory retr ieve s the Hetero-associativememory retrieves the store
d --
pattern
same pattern
L 'B;L::,c::,1::r,:oN::s®
------- ,rE~C;H.;'f.l;-,_1C;~-;;PU~ :;--------- --- - -· -
• an up-thrustfor knowledge
Downloaded by MONICA GK ([email protected])
lOMoARcPSD|31606405
--------------------
(a) (b)
• A similarity measure is selected and the winning unit is considered to be the one with th c
largest activation. For this Kohonen features maps all the weights in a neighborhood
around the winning units are also updated. The neighborhood's size generally decreases
slowly with each iteration.
• Step for how to train a Kohonen self organizingnetwork is as follows :
For n-dimensionalinput space and m output neurons :
1. Choose random weight vector wi for neuron i, i = 1, ..., m
2. Choose random input x
3. Determinewinner neuron k : 11 wk - x 11 = mini 1 1 wi - x 11 (Euclideandistance)
4. Update all weight vectors of all neurons i in the neighborhoodof neuron
(k : wi : = wi + ll · (j) (i,k) · (x - w)) (wi is shifted towards x)
5. If convergence criterion met, STOP. Otherwise, narrow neighborhood function and
learningparameter ll and go to (2).
2- 1.4
I
I
Fw.. 2.3.2
~ · - ~ - ._ •--..- •- /.___•
.. I I
-:
-- ~ .~ I I
·- -
. .~-1•- r- " T T
• 1 1 \----.r-
- 1
":
.., '.
..
·-:-• ~
•-• - ~•-- ~.-• ---
I \ • - •
I I
•-
I • I · - •- •- ~," •I .
- • - • - •- •-I \ \
•
•- •-
,L •
. I \ , i • - .
\
- • ...........
,, ; .. • - • - • - .-• - · -- I \
- • -
l
•- !- •- . _J_. ___ I I
I I / \
/
- ~;. •- •
.. ,: ,~ I
•- •-- •- •-- I ~\ I I
•- •- \
• · -- ·
\ I
I
\ ,--- • - •
.r-- -i - r-·j--, •~·- • •I--/•- •-I' '•I
•
/
- ..,, '·.
.. .., , -
I i- -I I /
•- •- •- • - - •-- •- ·
ri9 . 2 J J
•
Neural Networks and Deep Learn ing 2 _ 15 Associative Memory and Unsupervised Learning Networks
:--
SJ Learning Vector Quantization
•
Learnin°0 Vector Quantiza t·ion (LVQ) 1s
· adaptive data classification method. It 1s
· based on
training data with desired class information.
• LVQ uses unsupervise
· d data clustenng
· techniques to preprocesses the data set an d btam
· °
cluster centers.
· Output units
learning method :
• The weight vector (w) that is closest to the input vector (x) must be found. If x belongs to
the same class, we move w towards x; otherwise we move w away from the input vector x.
Step I : f nitialize the cluster centers by a clustering method.
Step 2 : Label each cluster by the voting method.
_ t6
2
d find k such that
11 x _ I/
tw or ks an d De ep Leaming . t x an '\
Neural Ne t ,,e c or
m pu
·
• Step J : Rando
mly select a training
b
k y
is a minimum. be lo ng s to th e same class, update
w
• Step 4 : If x and
wk
i'.l wk = N (x - Wk)
by
Otherwiseupdate wk
ret
i1 wk =- TJ (X - wk)
· · ac he d' stop. Othenvise urn
. ns 1s re
imum number of 1terat1o
Step 5 : If the max
to step 3.
et works
fD Counter PropagationneNtw ork is a hybrid ne tw or k. It co ns is ts
t
of an outS ar networ
k and a
agation ielsen.
• The counterprop lo pe d in 19 86 by Robert Hecht-N
twork. It was de ve t,
competitivefilter ne ks ba se d on a combination of inpu
tilayer netw or
C ou nt er pr op ag ation networks mul to co m press data, to appr
oximate
• ork ca n be us ed
ut layers. This netw
clustering and outp
atepatterns.
functionsor to associ twork.
w in ne r- ta ke -a co
ll mpetitive learning ne a
• CPN is an unsupe
rvised
rv is ed le ar ni ng an d the output layer is
with unsupe
e hi dd en la ye r is a Kohonen network r. T he output layer is train
ed by
• Th d to the hi dd en la ye
layer fully connecte
Grossberg (outstar)
le. ion
the Widrow-Hoffru d in a da ta co m pression approximat
n be applie
Th e co un te r pr op agation network ca
•
association.
functionsor pattern
ponents : en t that
• Three major com st ar is a si ngle processing el em
ts . T he in
de with input weigh processing
I. lnstar : Hidden no es si ng fu nc tio ns with many other
structure and proc
shares its general
· elements
of instars
pe tit iv ela ye r: H id ~en layer composed
2. Com
ure ·
3. Outstar: A struct
stages :
at io n ne tw or ks training include two
• Counter pr op ag
do t prod uct metnc · or Euc1,·dean
e fo rm ed us in g
clustered. Clusters ar
I. Input vectors are
norm metrics. pr . d sponse.
oduc e th e d es1r
un its ar e m ad e to e re
cl us te r un its to ou tputs
2. Weights from
·
C ou nt er P ro pa ga tionOperation :
•
twork
I. Present input to ne
r
te ou tp ut of al J ne ur ons i~ Kohonen laye
2. Calcula
___
8u_
_r .:s~-k
t---_: __ __ _
P~ UM 'B~ L~ IC .;J 1~ T.; JO ~N;si®>.-:a:-n~up=--=1h ,or now/edge
-~------T.T5BC;H.;'N.WiiC.~J1~L~
Downloaded by MONICA GK ([email protected])
lOMoARcPSD|31606405
, .
3. Ddcnninc winner (nt:uron witll m<1x1mum output)
4. Set output of winner to I (others to O)
5. Cukulate output vector
JiiP"
•ative Memory and UnsupervisedLearning Net
.g 2 • 18 ASSOCI ~rk .~
Neural Networks and Deep Learnm
t. tworks ·
• Possibledrawbackof counter propaga ion ne ·
. . . twork has the same difficulty associated with train·1ng
l. Trammg a counte r propag ationne
a Kohonen network.
If a
2. Counter propagation networks tend to be larger than back propagation networks.
certain number of mappings are to be learned, the middle layer must have that lllany
numbers of neurons.
1Netwo
rks and Deep Learning 2 _ 19 Associative Memo,y and UnsupervisedLearning Networks
· b. . .
ART-I networks, which receive mary input vector s. Botto m-up weights are used to
• .
determineoutput-layercandidates that may best match the current input.
Top-down weights represent th e "prototype"for the cluster defined by each ou.tput neuron.
• ·
.
A close match between input an d prototypets necessaryfor categorizingthe input.
Output layer
Input layer
Types of ART :
Type Remarks
.
ART I ft is the simplest variety of ART networks, acceptingonly binary inputs
, thus
Fuzzy ART Fuzzy ART implements fuzzy logic into ART's pattern recognition
enhancing generalizability
FuzzyARTMAP Fuzzy ARTMAP is merel. y ARTM AP us~g fuzzy ART units, resulting 11).
ffi .
a correspondingincreasem e 1c1ency.
---------==-----~..:._
Nt''''''. / N,•1,11 111.<: {llld /)PP/I/ r •.1111111,}
Sh11,,FL' l, tpa t 11) id 11,l. HAM I hl' n1,1,1111111n 1111111hrr nf ,l'-'-"l 1;1t1n11, '" t,c , 11,rcd 111 tlil'
l\ Al\1 :-.llllllld 1101 C\l t'l·d th<: 1111111hlr 1,f 11rnr1111, 111 tlw ,m.ilkr !.,~er
, lnrolll'l.'I l'lll1\L'ffl ' lltl' I lic HAM 111;1y 111,t alv.,t),rrod11ll' tlil' t lil 'l' ' ""'tlll,ltlnll
Ans.:
• A Cl1nh.'11l-addrl '\\,1hh: llll'lllnl y 1, a lypc of rm 111111 Y th.it alln\\', fnr the rct all of d.11.1 h.1,rd
llll the dl' grl'l' or :-. 11111 l.1111y hct,,n 11 1111: 1111rnt p ,tllllll ;111d th l' patll rn , "'"rrd Ill llll'lllPr )
, It rckr s to a mrnH11y Olfan 1;; 1t 11111 111 \\hli h the fill ninry ,., ;Ht c,,r d hy ,, ., t(IJJtt nt a,
opp1lSl'd In an c ,pli r ,t add1l''-" l1~c 111 thl' t1ad1t1on.tll <1111p11fl r 111c 111nry '-)" tun
I l'IH.:rdur c, tlrn, type or 11ll'lllOI Y ;tllm,, th e reLall of 111f111111.it1u 11 l 1J'-' ,t fltl r ,11lr.d
knowledge of it:-. rnnll'nh
Ans.:
• When the input wcton, :ire lrni:arly indi:prndrnt, the delta rule produu ;-, l:Xall -,()Jut111n-,
• Whether the input vectors arc linearly independent or not, the delta rule produlc" a lca-,t
squares solution, i.e., it optimi7e::.for the lowest sum of lea-,t -,quarcd error-,.
Q.10 What is continuous BAM ?
Ans. : Continuous BAM transforms input smoothly and continuou-,ly into output in the range
[O, I] using the logistic sigmoid function as the activation function for all units.
Q.11 What are the delta rule for pattern association ?
Ans.:
• When the input vectors are linearly independent, the delta rule produces exact solution-,.
• Whether the input vectors are linearly independent or not, the delta rule produces a lea~t
squares solution, i.e., it optimizes for the lowest sum of least squared errors.
Ans. : Rules :
I. If two neurons on either side of a connection are activated synchronously, then the
weight of that connection is increased.
2. Jf two neurons on either side of a connection are activated asynchronously, then the
weight of that connection is decreased.
functionsor to associatepatterns.
Q.14 What Is Hopfield model?
Ans. : The Ifopficld model is a single-layeredrecurrentnetwo
rk. Like the associative mc111ory,
d.
it is usually initializedwith appropriateweights instead of being traine
Ans.:
• Is a simplitiL·d, ersi011 of 1hr full rnunk·rpw pagati,,n
• Arc intcndrd to appru, imat.:) = ll \ ) t1.111l·ti1111 that is not lll't'l':--sarily imertibk .
• It ,my he u:-ed if 1he mapping from , tl1 y is well defined. but th.: 111:1pping from y t0 , i:-.
not.
Q.23 Define plasticity.
Ans.: The ability of a net to resplmd 1L1 k :1m a nrw pallrm L'qu:1lly well at any stage of karnm g
is called plasticity.
Artificial neural networks that closely mimic natural neural networks are known as spiking
neural networks (SNNs). In addition to neuronal and synaptic status, SNNs incorporate time into
their working model. The idea is that neurons in the SNN do not transmit information at the end
of each propagation cycle (as they do in traditional multi-layer perceptron networks), but only
when a membrane potential – a neuron’s intrinsic quality related to its membrane electrical
charge – reaches a certain value, known as the threshold.
The neuron fires when the membrane potential hits the threshold, sending a signal to neighboring
neurons, which increase or decrease their potentials in response to the signal. A spiking neuron
model is a neuron model that fires at the moment of threshold crossing.
Artificial neurons, despite their striking resemblance to biological neurons, do not behave in the
same way. Biological and artificial NNs differ fundamentally in the following ways:
Structure in general
Computations in the brain
In comparison to the brain, learning is a rule.
Alan Hodgkin and Andrew Huxley created the first scientific model of a Spiking Neural
Network in 1952. The model characterized the initialization and propagation of action potentials
in biological neurons. Biological neurons, on the other hand, do not transfer impulses directly. In
order to communicate, chemicals called neurotransmitters must be exchanged in the synaptic
gap.
Key Concepts
What distinguishes a traditional ANN from an SNN is the information propagation approach.
SNN aspires to be as close to a biological neural network as feasible. As a result, rather than
working with continually changing time values as ANN does, SNN works with discrete events
that happen at defined times. SNN takes a set of spikes as input and produces a set of spikes as
output (a series of spikes is usually referred to as spike trains).
Each neuron has a value that is equivalent to the electrical potential of biological neurons at any
given time.
The value of a neuron can change according to its mathematical model; for example, if a neuron
gets a spike from an upstream neuron, its value may rise or fall.
If a neuron’s value surpasses a certain threshold, the neuron will send a single impulse to each
downstream neuron connected to the first one, and the neuron’s value will immediately drop
below its average.
As a result, the neuron will go through a refractory period similar to that of a biological neuron.
The neuron’s value will gradually return to its average over time.
Artificial spiking neural networks are designed to do neural computation. This necessitates that
neural spiking is given meaning: the variables important to the computation must be defined in
terms of the spikes with which spiking neurons communicate. A variety of neuronal information
encodings have been proposed based on biological knowledge:
Binary Coding:
Binary coding is an all-or-nothing encoding in which a neuron is either active or inactive within
a specific time interval, firing one or more spikes throughout that time frame. The finding that
physiological neurons tend to activate when they receive input (a sensory stimulus such as light
or external electrical inputs) encouraged this encoding.
Rate Coding:
Only the rate of spikes in an interval is employed as a metric for the information communicated
in rate coding, which is an abstraction from the timed nature of spikes. The fact that
physiological neurons fire more frequently for stronger (sensory or artificial) stimuli motivates
rate encoding.
The encoding of a fully temporal code is dependent on the precise timing of all spikes. Evidence
from neuroscience suggests that spike-timing can be incredibly precise and repeatable. Timings
are related to a certain (internal or external) event in a fully temporal code (such as the onset of a
stimulus or spike of a reference neuron).
Latency Coding
The timing of spikes is used in latency coding, but not the number of spikes. The latency
between a specific (internal or external) event and the first spike is used to encode information.
This is based on the finding that significant sensory events cause upstream neurons to spike
earlier.
SNN Architecture
Spiking neurons and linking synapses are described by configurable scalar weights in an SNN
architecture. The analogue input data is encoded into the spike trains using either a rate-based
technique, some sort of temporal coding or population coding as the initial stage in building an
SNN.
A biological neuron in the brain (and a simulated spiking neuron) gets synaptic inputs from other
neurons in the neural network, as previously explained. Both action potential production and
network dynamics are present in biological brain networks.
The network dynamics of artificial SNNs are much simplified as compared to actual biological
networks. It is useful in this context to suppose that the modelled spiking neurons have pure
threshold dynamics (as opposed to refractoriness, hysteresis, resonance dynamics, or post-
inhibitory rebound features).
When the membrane potential of postsynaptic neurons reaches a threshold, the activity of
presynaptic neurons affects the membrane potential of postsynaptic neurons, resulting in an
action potential or spike.
Its main feature is that the weight (synaptic efficacy) connecting a pre-and post-synaptic neuron
is altered based on their relative spike times within tens of millisecond time intervals. The weight
adjustment is based on information that is both local to the synapse and local in time. The next
subsections cover both unsupervised and supervised learning techniques in SNNs.
In theory, SNNs can be used in the same applications as standard ANNs. SNNs can also
stimulate the central nervous systems of biological animals, such as an insect seeking food in an
unfamiliar environment. They can be used to examine the operation of biological brain networks
due to their realism.
Advantages
SNN is a dynamic system. As a result, it excels in dynamic processes like speech and dynamic
picture identification.
When an SNN is already working, it can still train.
To train an SNN, you simply need to train the output neurons.
Traditional ANNs often have more neurons than SNNs; however, SNNs typically have fewer
neurons.
Because the neurons send impulses rather than a continuous value, SNNs can work incredibly
quickly.
Because they leverage the temporal presentation of information, SNNs have boosted
information processing productivity and noise immunity.
Disadvantages
Neural network with multiple hidden layers. Each layer has multiple nodes.
The inputs to nodes in a single layer will have a weight assigned to them that changes the effect
that parameter has on the overall prediction result. Since the weights are assigned on the links
between nodes, each node maybe influenced by multiple weights.
The neural network takes all of the training data in the input layer. Then it passes the data
through the hidden layers, transforming the values based on the weights at each node. Finally it
returns a value in the output layer.
It can take some time to properly tune a neural network to get consistent, reliable results. Testing
and training your neural network is a balancing process between deciding what features are the
most important to your model.
Below is a neural network that identifies two types of flowers: Orchid and Rose.
In CNN, every image is represented in the form of an array of pixel values.
The convolution operation forms the basis of any convolutional neural network. Let’s understand
the convolution operation using two matrices, a and b, of 1 dimension.
a = [5,3,7,5,9,7]
b = [1,2,3]
In convolution operation, the arrays are multiplied element-wise, and the product is summed to
create a new array, which represents a*b.
The first three elements of the matrix a are multiplied with the elements of matrix b. The product
is summed to get the result.
The next three elements from the matrix a are multiplied by the elements in matrix b, and the
product is summed up.
When you press backslash (\), the below image gets processed.
As you can see from the above diagram, only those values are lit that have a value of 1.
Layers in a Convolutional Neural Network
A convolution neural network has multiple hidden layers that help in extracting information from
an image. The four important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
4. Fully connected layer
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A convolution
layer has several filters that perform the convolution operation. Every image is considered as a
matrix of pixel values.
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also a filter
matrix with a dimension of 3x3. Slide the filter matrix over the image and compute the dot
product to get the convolved feature matrix.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is to
move them to a ReLU layer.
ReLU performs an element-wise operation and sets all the negative pixels to 0. It introduces non-
linearity to the network, and the generated output is a rectified feature map. Below is the graph
of a ReLU function:
The original image is scanned with multiple convolutions and ReLU layers for locating the
features.
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
The pooling layer uses various filters to identify different parts of the image like edges, corners,
body, feathers, eyes, and beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the resultant 2-
Dimensional arrays from pooled feature maps into a single long continuous linear vector.
The flattened matrix is fed as input to the fully connected layer to classify the image.
CNNs work by applying filters to your input data. What makes them so special is that CNNs are
able to tune the filters as training happens. That way the results are fine-tuned in real time, even
when you have huge data sets, like with images.
Since the filters can be updated to train the CNN better, this removes the need for hand-created
filters. That gives us more flexibility in the number of filters we can apply to a data set and the
relevance of those filters. Using this algorithm, we can work on more sophisticated problems like
face recognition.
Convolutional neural networks are based on neuroscience findings. They are made of layers of
artificial neurons called nodes. These nodes are functions that calculate the weighted sum of the
inputs and return an activation map. This is the convolution part of the neural network.
Each node in a layer is defined by its weight values. When you give a layer some data, like an
image, it takes the pixel values and picks out some of the visual features.
When you're working with data in a CNN, each layer returns activation maps. These maps point
out important features in the data set. If you gave the CNN an image, it'll point out features based
on pixel values, like colors, and give you an activation function.
Usually with images, a CNN will initially find the edges of the picture. Then this slight definition
of the image will get passed to the next layer. Then that layer will start detecting things like
corners and color groups. Then that image definition will get passed to the next layer and the
cycle continues until a prediction is made.
As the layers get more defined, this is called max pooling. It only returns the most relevant
features from the layer in the activation map. This is what gets passed to each successive layer
until you get the final layer.
The last layer of a CNN is the classification layer which determines the predicted value based on
the activation map. If you pass a handwriting sample to a CNN, the classification layer will tell
you what letter is in the image. This is what autonomous vehicles use to determine whether an
object is another car, a person, or some other obstacle.
Training a CNN is similar to training many other machine learning algorithms. You'll start with
some training data that is separate from your test data and you'll tune your weights based on the
accuracy of the predicted values. Just be careful that you don't overfit your model.
There are multiple kinds of CNNs you can use depending on your problem.
1D CNN: With these, the CNN kernel moves in one direction. 1D CNNs are usually used on
time-series data.
2D CNN: These kinds of CNN kernels move in two directions. You'll see these used with image
labelling and processing.
3D CNN: This kind of CNN has a kernel that moves in three directions. With this type of CNN,
researchers use them on 3D images like CT scans and MRIs.
In most cases, you'll see 2D CNNs because those are commonly associated with image data.
Here are some of the applications that you might see CNNs used for.
Recognize images with little preprocessing
Recognize different hand-writing
Computer vision applications
Used in banking to read digits on checks
Used in postal services to read zip codes on an envelope
Architecture of CNN
A typical CNN has the following 4 layers (O’Shea and Nash 2015)
1. Input layer
2. Convolution layer
3. Pooling layer
4. Fully connected layer
Please note that we will explain a 2 dimensional (2D) CNN here. But the same concepts apply to
a 1 (or 3) dimensional CNN as well.
Input layer
The input layer represents the input to the CNN. An example input, could be a 28 pixel by 28
pixel grayscale image. Unlike FNN, we do not “flatten” the input to a 1D vector, and the input is
presented to the network in 2D as a 28 x 28 matrix. This makes capturing spatial relationships
easier.
Convolution layer
The convolution layer is composed of multiple filters (also called kernels). Filters for a 2D image
are also 2D. Suppose we have a 28 pixel by 28 pixel grayscale image. Each pixel is represented
by a number between 0 and 255, where 0 represents the color black, 255 represents the color
white, and the values in between represent different shades of gray. Suppose we have a 3 by 3
filter (9 values in total), and the values are randomly set to 0 or 1. Convolution is the process of
placing the 3 by 3 filter on the top left corner of the image, multiplying filter values by the pixel
values and adding the results, moving the filter to the right one pixel at a time and repeating this
process. When we get to the top right corner of the image, we simply move the filter down one
pixel and restart from the left. This process ends when we get to the bottom right corner of the
image.
Figure 2: A 3 by 3 filter applied to a 4 by 4 image, resulting in a
2 by 2 image (Dumoulin and Visin 2016)
1. Filter size
2. Padding
3. Stride
4. Dilation
5. Activation function
Filter size can be 5 by 5, 3 by 3, and so on. Larger filter sizes should be avoided as the learning
algorithm needs to learn filter values (weights), and larger filters increase the number of weights
to be learned (more compute capacity, more training time, more chance of overfitting). Also, odd
sized filters are preferred to even sized filters, due to the nice geometric property of all the input
pixels being around the output pixel.
If you look at Figure 2 you see that after applying a 3 by 3 filter to a 4 by 4 image, we end up
with a 2 by 2 image – the size of the image has gone down. If we want to keep the resultant
image size the same, we can use padding. We pad the input in every direction with 0’s before
applying the filter. If the padding is 1 by 1, then we add 1 zero in evey direction. If its 2 by 2,
then we add 2 zeros in every direction, and so on.
Figure 3: A 3 by 3 filter applied to a 5
by 5 image, with padding of 1, resulting in a 5 by 5 image (Dumoulin and Visin 2016)
As mentioned before, we start the convolution by placing the filter on the top left corner of the
image, and after multiplying filter and image values (and adding them), we move the filter to the
right and repeat the process. How many pixels we move to the right (or down) is the stride. In
figure 2 and 3, the stride of the filter is 1. We move the filter one pixel to the right (or down). But
we could use a different stride. Figure 4 shows an example of using stride of 2.
Figure 4: A 3 by 3 filter applied to a 5 by 5 image, with
stride of 2, resulting in a 2 by 2 image (Dumoulin and Visin 2016)
When we apply a, say 3 by 3, filter to an image, our filter’s output is affected by pixels in a 3 by
3 subset of the image. If we like to have a larger receptive field (portion of the image that affect
our filter’s output), we could use dilation. If we set the dilation to 2 (Figure 5), instead of a
contiguous 3 by 3 subset of the image, every other pixel of a 5 by 5 subset of the image affects
the filter’s output.
Given the input size, filter size, padding, stride and dilation you can calculate the output size of
the convolution operation as below.
Figure 7:
Illustration of single input channel two dimensional convolution
Figure 7 illustrates the calculations for a convolution operation, via a 3 by 3 filter on a single
channel 5 by 5 input vector (5 x 5 x 1). Figure 8 illustrates the calculations when the input vector
has 3 channels (5 x 5 x 3). To show this in 2 dimensions, we are displaying each channel in input
vector and filter separately. Figure 9 shows a sample multi-channel 2D convolution in 3
dimensions.
Figure 8: Illustration of multiple input channel two dimensional convolution
As Figures 8 and 9 show the output of a multi-channel 2 dimensional filter is a single channel 2
dimensional image. Applying multiple filters to the input image results in a multi-channel 2
dimensional image for the output. For example, if the input image is 28 by 28 by 3 (rows x
columns x channels), and we apply a 3 by 3 filter with 1 by 1 padding, we would get a 28 by 28
by 1 image. If we apply 15 filters to the input image, our output would be 28 by 28 by 15. Hence,
the number of filters in a convolution layer allows us to increase or decrease the channel size.
Pooling layer
The pooling layer performs down sampling to reduce the spatial dimensionality of the input. This
decreases the number of parameters, which in turn reduces the learning time and computation,
and the likelihood of overfitting. The most popular type of pooling is max pooling. Its usually a 2
by 2 filter with a stride of 2 that returns the maximum value as it slides over the input data
(similar to convolution filters).
The last layer in a CNN is a fully connected layer. We connect all the nodes from the previous
layer to this fully connected layer, which is responsible for classification of the image.
Deep Learning
Deep learning is based on the branch of machine learning, which is a subset of artificial
intelligence. Since neural networks imitate the human brain and so deep learning will do. In deep
learning, nothing is programmed explicitly. Basically, it is a machine learning class that makes
use of numerous nonlinear processing units so as to perform feature extraction as well as
transformation. The output from each preceding layer is taken as input by each one of the
successive layers.
Deep learning models are capable enough to focus on the accurate features themselves by
requiring a little guidance from the programmer and are very helpful in solving out the problem
of dimensionality. Deep learning algorithms are used, especially when we have a huge no of
inputs and outputs.
Since deep learning has been evolved by the machine learning, which itself is a subset of
artificial intelligence and as the idea behind the artificial intelligence is to mimic the human
behavior, so same is "the idea of deep learning to build such algorithm that can mimic the brain".
Deep learning is implemented with the help of Neural Networks, and the idea behind the
motivation of Neural Network is the biological neurons, which is nothing but a brain cell.
“Deep learning is a collection of statistical techniques of machine learning for learning feature
hierarchies that are actually based on artificial neural networks.”
Example of Deep Learning
In the example given above, we provide the raw data of images to the first layer of the input layer. After
then, these input layer will determine the patterns of local contrast that means it will differentiate on
the basis of colors, luminosity, etc. Then the 1st hidden layer will determine the face feature, i.e., it will
fixate on eyes, nose, and lips, etc. And then, it will fixate those face features on the correct face
template. So, in the 2nd hidden layer, it will actually determine the correct face here as it can be seen in
the above image, after which it will be sent to the output layer. Likewise, more hidden layers can be
added to solve more complex problems, for example, if you want to find out a particular kind of face
having large or light complexions. So, as and when the hidden layers increase, we are able to solve
complex problems.
Architectures
A feed-forward neural network is none other than an Artificial Neural Network, which ensures
that the nodes do not form a cycle. In this kind of neural network, all the perceptrons are
organized within layers, such that the input layer takes the input, and the output layer generates
the output. Since the hidden layers do not link with the outside world, it is named as hidden
layers. Each of the perceptrons contained in one single layer is associated with each node in the
subsequent layer. It can be concluded that all of the nodes are fully connected. It does not contain
any visible or invisible connection between the nodes in the same layer. There are no back-loops
in the feed-forward network. To minimize the prediction error, the backpropagation algorithm
can be used to update the weight values.
Applications:
Data Compression
Pattern Recognition
Computer Vision
Sonar Target Recognition
Speech Recognition
Handwritten Characters Recognition
Recurrent neural networks are yet another variation of feed-forward networks. Here each of the
neurons present in the hidden layers receives an input with a specific delay in time. The
Recurrent neural network mainly accesses the preceding info of existing iterations. For example,
to guess the succeeding word in any sentence, one must have knowledge about the words that
were previously used. It not only processes the inputs but also shares the length as well as
weights crossways time. It does not let the size of the model to increase with the increase in the
input size. However, the only problem with this recurrent neural network is that it has slow
computational speed as well as it does not contemplate any future input for the current state. It
has a problem with reminiscing prior information.
Applications:
Machine Translation
Robot Control
Time Series Prediction
Speech Recognition
Speech Synthesis
Time Series Anomaly Detection
Rhythm Learning
Music Composition
Convolutional Neural Networks are a special kind of neural network mainly used for image
classification, clustering of images and object recognition. DNNs enable unsupervised
construction of hierarchical image representations. To achieve the best accuracy, deep
convolutional neural networks are preferred more than any other neural network.
Applications:
RBMs are yet another variant of Boltzmann Machines. Here the neurons present in the input
layer and the hidden layer encompasses symmetric connections amid them. However, there is no
internal association within the respective layer. But in contrast to RBM, Boltzmann machines do
encompass internal connections inside the hidden layer. These restrictions in BMs helps the
model to train efficiently.
Applications:
Filtering.
Feature Learning.
Classification.
Risk Detection.
Business and Economic analysis.
5. Autoencoders
Applications:
Classification.
Clustering.
Feature Compression.
Self-Driving Cars
In self-driven cars, it is able to capture the images around it by processing a huge amount of
data, and then it will decide which actions should be incorporated to take a left or right or
should it stop. So, accordingly, it will decide what actions it should take, which will further
reduce the accidents that happen every year.
Voice Controlled Assistance
When we talk about voice control assistance, then Siri is the one thing that comes into our
mind. So, you can tell Siri whatever you want it to do it for you, and it will search it for you and
display it for you.
Automatic Image Caption Generation
Whatever image that you upload, the algorithm will work in such a way that it will generate
caption accordingly. If you say blue colored eye, it will display a blue-colored eye with a caption
at the bottom of the image.
Automatic Machine Translation
With the help of automatic machine translation, we are able to convert one language into
another with the help of deep learning.
Limitations
Advantages
Disadvantages
Table of Contents
The feedforward neural network was the earliest and most basic type of artificial neural network
to be developed. In this network, information flows only in one direction forward from the input
nodes to the output nodes, passing via any hidden nodes. The network is devoid of cycles or
loops.
Extreme learning machines are feed-forward neural networks having a single layer or multiple
layers of hidden nodes for classification, regression, clustering, sparse approximation,
compression, and feature learning, where the hidden node parameters do not need to be
modified. These hidden nodes might be assigned at random and never updated, or they can be
inherited from their predecessors and never modified. In most cases, the weights of hidden nodes
are usually learned in a single step which essentially results in a fast learning scheme.
These models, according to their inventors, are capable of producing good generalization
performance and learning thousands of times quicker than backpropagation networks. These
models can also outperform support vector machines in classification and regression
applications, according to the research.
Fundamentals of ELM
An ELM is a quick way to train SLFN networks (shown in the below figure). An SLFN
comprises three layers of neurons, however, the name Single refers to the model’s one layer of
non-linear neurons which is the hidden layer. The input layer offers data features but does not do
any computations, whereas the output layer is linear with no transformation function and no bias.
Source
The ELM technique sets input layer weights W and biases b at random and never adjusts them.
Because the input weights are fixed, the output weights ???? are independent of them (unlike in
the Backpropagation training method) and have a straightforward solution that does not require
iteration. Such a solution is also linear and very fast to compute for a linear output layer.
Random input layer weights improve the generalization qualities of a linear output layer solution
because they provide virtually orthogonal (weakly correlated) hidden layer features. A linear
system’s solution is always in a range of inputs. If the solution weight range is constrained,
orthogonal inputs provide a bigger solution space volume with these constrained weights.
Smaller weight norms tend to make the system more stable and noise resistant since input errors
are not aggravating in the output of the linear system with smaller coefficients. As a result, the
random hidden layer creates weakly correlated hidden layer features, allowing for a solution with
a low norm and strong generalization performance.
Variants of ELM
In this section, we will summarize several variants of ELM and will introduce them briefly.
There are numerous types of data in real-world applications, thus ELM must be changed to
effectively learn from these data. For example, because the dataset is increasing, we may not
always be able to access the entire dataset. From time to time, new samples are added to the
dataset. Every time the set grows, we must retrain the ELM.
However, because the new samples frequently account for only a small portion of the total, re-
training the network using the entire dataset again is inefficient. Huang and Liang proposed an
online sequential ELM to address this issue (OS-ELM). The fundamental idea behind OS-ELM
is to avoid re-training over old samples by employing a sequential approach. OS-ELM can
update settings over new samples consecutively after startup. As a result, OS-ELM can be
trained one at a time or block by block.
Incremental ELM
Pruning ELM
Rong et al. proposed a pruned-ELM (P-ELM) algorithm as a systematic and automated strategy
for building ELM networks in light of the fact that using too few/many hidden nodes could lead
to underfitting/overfitting concerns in pattern categorization. P-ELM started with a large number
of hidden nodes and subsequently deleted the ones that were irrelevant or lowly relevant during
learning by considering their relevance to the class labels.
ELM’s architectural design can thus be automated as a result. When compared to the traditional
ELM, simulation results indicated that the P-ELM resulted in compact network classifiers that
generate fast response and robust prediction accuracy on unseen data.
Error-Minimized ELM
Feng et al. suggested an error-minimization-based method for ELM (EM-ELM) that can
automatically identify the number of hidden nodes in generalized SLFNs by growing hidden
nodes one by one or group by group. The output weights were changed incrementally as the
networks grew, reducing the computational complexity dramatically. The simulation results on
sigmoid type hidden nodes demonstrated that this strategy may greatly reduce the computational
cost of ELM and offer an ELM implementation that is both efficient and effective.
Evolutionary ELM
When ELM is used, the number of hidden neurons is usually selected at random. Due to the
random determination of input weights and hidden biases, ELM may require a greater number of
hidden neurons. Zhu et al. introduced a novel learning algorithm called evolutionary extreme
learning machine (E-ELM) for optimizing input weights and hidden biases and determining
output weights.
To improve the input weights and hidden biases in E-ELM, the modified differential
evolutionary algorithm was utilized. The output weights were determined analytically using
Moore– Penrose (MP) generalized inverse.
Applications of ELM
Extreme learning machine has been used in many application domains such as medicine,
chemistry, transportation, economy, robotics, and so on due to its superiority in training speed,
accuracy, and generalization. This section highlights some of the most common ELM
applications.
IoT Application
As the Internet of Things (IoT) has gained more attention from academic and industry circles in
recent years, a growing number of scientists have developed a variety of IoT approaches or
applications based on modern information technologies.
Using ELM in IoT applications can be done in a variety of ways. Rathore and Park developed an
ELM-based strategy for detecting cyber-attacks. To identify assaults from ordinary visits, they
devised a fog computing-based attack detection system and used an updated ELM as a classifier.
Transportation Application
The application of machine learning in transportation is a popular issue. Scientists, for example,
used machine learning techniques to create a driver sleepiness monitoring system to prevent
unsafe driving and save lives. It’s been a long time since an extreme learning machine was used
to solve transportation-related challenges. Sun and Ng suggested a two-stage approach to
transportation system optimization that integrated linear programming and extreme learning
machines. Two trials showed that combining their approaches might extend the life of a
transportation system while also increasing its reliability.
Convolutional Networks
Convolutional networks convolutional (LeCun, 1989), also known as neural networks or
CNNs, are a specialized kind of neural network for processing data that has a known, grid-
like topology. Examples include time-series data, which can be thought of as a 1D grid
taking samples at regular time intervals, and image data,which can be thought of as a 2D
grid of pixels. Convolutional
Suppose we are tracking the location of a spaceship with a laser sensor. Our
laser sensor provides a single output x(t), the position of the spaceship at time
t. Both x and t are real-valued, i.e., we can get a different reading from the laser sensor at
any instant in time.
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of
the spaceship’s position, we would like to average together several measurements. Of
course, more recent measurements are more relevant, so we will want this to be a
weighted average that gives more weight to recent measurements. We can do this with a
weighting function w(a), where a is the age of a measurement.If we apply such a weighted
average operation at every moment, we obtain a new function providing a smoothed
estimate of the position s of the spaceship:
In convolutional network terminology, the first argument (in this example, the function x)
to the convolution is often referred to as the input and the second argument (in this
example, the function w) as the kernel. The output is sometimes referred to as the feature
map.
In our example, it might be more realistic to assume that our laser provides a measurement
once per second. The time index t can then take on only integer values. If we now assume
that x and w are defined only on integer t, we can define the discrete convolution:
While the commutative property is useful for writing proofs, it is not usually an important
property of a neural network implementation. Instead, many neural network libraries
implement a related function called the cross-correlation, which is the same as convolution
but without flipping the kernel:
Fig. 9.1 for an example of convolution (without kernel flipping) applied to a 2-D
tensor.Discrete convolution can be viewed as multiplication by a matrix. However, the
matrix has several entries constrained to be equal to other entries. For example, for
univariate discrete convolution, each row of the matrix is constrained to be equal to the
row above shifted by one element. This is known as a Toeplitz matrix.
In two dimensions, a doubly block circulant matrix corresponds to convolution. In addition
to these constraints that several elements be equal to each other, convolution usually
corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero). This
is because the kernel is usually much smaller than the input image. Any neural network
algorithm that works with matrix multiplication and does not depend on specific
properties of the matrix structure should work with convolution, without requiring any
further changes to the neural network.Typical convolutional neural networks do make use
of further specializations in order to deal with large inputs efficiently, but these are not
strictly necessary from a theoretical perspective.
Motivation
Convolution leverages three important ideas that can help improve a machine learning
system: sparse interactions parameter sharing equivariant , and representations. Moreover,
convolution provides a means for working with inputs of variable
size. We now describe each of these ideas in turn.
Traditional neural network layers use matrix multiplication by a matrix of
parameters with a separate parameter describing the interaction between each
input unit and each output unit. This means every output unit interacts with every input
unit. Convolutional networks, however, typically have sparse interactions (also referred to
as sparse connectivity or sparse weights). This is accomplished by making the kernel smaller
than the input. For example, when processing an image, the input image might have
thousands or millions of pixels, but we can detect small, meaningful features such as edges
with kernels that occupy only tens or hundreds of pixels. This means that we need to store
fewer parameters, which both reduces the memory requirements of the model and
improves its statistical efficiency. It also means that computing the output requires fewer
operations. These improvements in efficiency are usually quite large. If there are m inputs
and n outputs, then matrix multiplication requiresm×n parameters and the algorithms used
in practice have O(m × n) runtime (per example). If we limit the number of connections
each output may have to k, then the sparsely connected approach requires only k × n
parameters and O(k × n) runtime.
Parameter sharing refers to using the same parameter for more than one function in a
model. In a traditional neural net, each element of the weight matrix is used exactly once
when computing the output of a layer. It is multiplied by one element of the input and then
never revisited. As a synonym for parameter sharing, one can say that a network has tied
weights, because the value of the weight applied to one input is tied to the value of a weight
applied elsewhere. In a convolutional neural net, each member of the kernel is used at
every position of the input (except perhaps some of the boundary pixels, depending on the
design decisions regarding the boundary). The parameter sharing used by the convolution
operation means that rather than learning a separate set of parameters for every location,
we learn
only one set. This does not affect the runtime of forward propagation—it is still O(k × n)—
but it does further reduce the storage requirements of the model to k parameters. Recall
that k is usually several orders of magnitude less than m. Since m and n are usually roughly
the same size, k is practically insignificant compared to m× n.
In the case of convolution, the particular form of parameter sharing causes the layer to
have a property called equivariance to translation. To say a function is equivariant means
that if the input changes, the output changes in the same way. Specifically, a function f(x) is
equivariant to a function g if f (g(x)) = g(f(x)). In the case of convolution, if we let g be any
function that translates the input, i.e., shifts it, then the convolution function is equivariant
to g. For example, let I be a function giving image brightness at integer coordinates. Let g be
a function mapping one image function to another image function, such that I = g(I ) is
the image function with I (x, y) = I(x − 1, y). This shifts every pixel of I one unit to the right.
If we apply this transformation to I , then apply convolution, the result will be the same as if
we applied convolution to I , then applied the transformation g to the output.
Pooling
A typical layer of a convolutional network consists of three stages (see Fig. 9.7). In the first
stage, the layer performs several convolutions in parallel to produce a set of linear
activations. In the second stage, each linear activation is run through a nonlinear activation
function, such as the rectified linear activation function. This stage is sometimes called the
detector stage. In the third stage, we use a pooling function to modify the output of the layer
further.
A pooling function replaces the output of the net at a certain location with a summary
statistic of the nearby outputs. For example, the max pooling (Zhou and Chellappa, 1988)
operation reports the maximum output within a rectangular
based on the distance from the central pixel. In all cases, pooling helps to make the
representation become approximately invariant to small translations of the input.
Invariance to translation means that if we translate the input by a small amount, the values
of most of the pooled outputs do not change. See Fig. for an example 9.8 of how this works.
Invariance to local translation can be a very useful property if we care more about
whether some feature is present than exactly where it is.
For example, when determining whether an image contains a face, we need not know the
location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye
on the left side of the face and an eye on the right side of the face. In other contexts, it is
more important to preserve the location of a feature. For example, if we want to find a
corner defined by two edges meeting at a specific orientation, we need to preserve the
location of the edges well enough to test whether they meet.
Pooling over spatial regions produces invariance to translation, but if we pool over the
outputs of separately parametrized convolutions, the features can learn which
transformations to become invariant to (see Fig. 9.9). Because pooling summarizes the
responses over a whole neighborhood, it is possible to use fewer pooling units than
detector units, by reporting summary statistics for pooling regions spaced k pixels apart
rather than 1 pixel apart. See Fig. 9.10 for an example. This improves the computational
efficiency of the network because the next layer has roughly k times fewer inputs to
process. When the number of parameters in the next layer is a function of its input size
(such as when the next layer is fully connected and based on matrix multiplication) this
reduction in the input size can also result in improved statistical efficiency and reduced
memory requirements for storing the parameters.
For many tasks, pooling is essential for handling inputs of varying size. For example, if we
want to classify images of variable size, the input to the classification layer must have a
fixed size. This is usually accomplished by varying the size of an offset between pooling
regions so that the classification layer always receives the same number of summary
statistics regardless of the input size. For example, the final pooling layer of the network
may be defined to output four sets of summary statistics, one for each quadrant of an
image, regardless of the image size.
Pooling can complicate some kinds of neural network architectures that use
top-down information, such as Boltzmann machines and autoencoders.
Some examples of complete convolutional network architectures for classification
using convolution and pooling are shown in Fig. 9.11.
Variants of the Basic Convolution Function
When discussing convolution in the context of neural networks, we usually do not refer
exactly to the standard discrete convolution operation as it is usually understood in the
mathematical literature. The functions used in practice differ slightly. Here we describe
these differences in detail, and highlight some useful properties of the functions used in
neural networks. First, when we refer to convolution in the context of neural networks, we
usually actually mean an operation that consists of many applications of convolution in
parallel. This is because convolution with a single kernel can only extract one kind of
feature, albeit at many spatial locations. Usually we want each layer of our network to
extract many kinds of features, at many locations.
Additionally, the input is usually not just a grid of real values. Rather, it is a grid of vector-
valued observations. For example, a color image has a red, green and blue intensity at each
pixel. In a multilayer convolutional network, the input to the second layer is the output of
the first layer, which usually has the output of many different convolutions at each position.
When working with images, we usually think of the input and output of the convolution as
being 3-D tensors, with one index into the different channels and two indices into the
spatial coordinates of each channel. Software implementations usually work in batch mode,
so they will actually use 4-D tensors, with the fourth axis indexing different examples in
the batch, but we will omit the batch axis in our description here for simplicity. Because
convolutional networks usually use multi-channel convolution, the linear operations they
are based on are not guaranteed to be commutative, even if kernel-flipping is used. These
multi-channel operations are only commutative if each operation has the same number of
output channels as input channels.
Assume we have a 4-D kernel tensor K with element Ki,j,k,l giving the connection strength
between a unit in channel i of the output and a unit in channel j of the input, with an offset
of k rows and l columns between the output unit and the
input unit. Assume our input consists of observed data V with element Vi,j,k giving the
value of the input unit within channel i at row j and column k. Assume our output consists
of Z with the same format as V. If Z is produced by convolving K across V without flipping
K, then
where the summation over l , m and n is over all values for which the tensor indexing
operations inside the summation is valid. In linear algebra notation, we index into arrays
using a 1 for the first entry. This necessitates the −1 in the above formula. Programming
languages such as C and Python index starting from 0, rendering the above expression even
simpler.
We may want to skip over some positions of the kernel in order to reduce the
computational cost (at the expense of not extracting our features as finely). We
can think of this as downsampling the output of the full convolution function. If we want to
sample only every s pixels in each direction in the output, then we can define a
downsampled convolution function c such that
Three special cases of the zero-padding setting are worth mentioning. One is the
extreme case in which no zero-padding is used whatsoever, and the convolution kernel is
only allowed to visit positions where the entire kernel is contained entirely within the
image. In MATLAB terminology, this is called valid convolution. In this case, all pixels in the
output are a function of the same number of pixels in the input, so the behavior of an
output pixel is somewhat more regular. However, the size of the output shrinks at each
layer. If the input image has width m and the kernel has width k, the output will be of width
m− k+ 1. The rate of this shrinkage can be dramatic if the kernels used are large. Since the
shrinkage is greater than 0, it limits the number of convolutional layers that can be
included in the network. As layers are added, the spatial dimension of the network will
eventually drop to 1 × 1, at which point additional layers cannot meaningfully be
considered convolutional. Another special case of the zero-padding setting is when just
enough zero-padding is added to keep the size of the output equal to the size of the input.
MATLAB calls this same convolution. In this case, the network can contain as many
convolutional layers as the available hardware can support, since the operation of
convolution does not modify the architectural possibilities available to the next layer.
However, the input pixels near the border influence fewer output pixels than the input
pixels near the center. This can make the border pixels somewhat underrepresented in the
model. This motivates the other extreme case, which MATLAB refers to as full convolution,
in which enough zeroes are added for every pixel to be visited k times in each direction,
resulting in an output image of width m+ k − 1. In this case, the output pixels near the
border are a function of fewer pixels than the output pixels near the center. This can make
it difficult to learn a single kernel that performs well at all positions in the convolutional
feature map. Usually the optimal amount of zero padding (in terms of test set classification
accuracy) lies somewhere between “valid” and “same”convolution.
In some cases, we do not actually want to use convolution, but rather locally connected
layers (LeCun, 1986, 1989). In this case, the adjacency matrix in the graph of our MLP is the
same, but every connection has its own weight, specified
by a 6-D tensor W. The indices into W are respectively: i, the output channel, j, the output
row, k, the output column, l, the input channel, m, the row offset within the input, and n, the
column offset within the input. The linear part of a locally connected layer is then given by
This is sometimes also called unshared convolution, because it is a similar operation to
discrete convolution with a small kernel, but without sharing parameters across locations.
Fig. 9.14 compares local connections, convolution, and full connections.
the output width, this is the same as a locally connected layer.
The matrix involved is a function of the convolution kernel. The matrix is sparse and each
element of the kernel is copied to several elements of the matrix. This view helps us to
derive some of the other operations needed to implement a convolutional network.
Multiplication by the transpose of the matrix defined by convolution is one such operation.
This is the operation needed to back-propagate error derivatives through a convolutional
layer, so it is needed to train convolutional networks that have more than one hidden layer.
This same operation is also needed if we wish to reconstruct the visible units from the
hidden units (Simard et al., 1992).
Reconstructing the visible units is an operation commonly used in the models described in
Part III of this book, such as autoencoders, RBMs, and sparse coding. Transpose
convolution is necessary to construct convolutional versions of those models. Like the
kernel gradient operation, this input gradient operation can be implemented using a
convolution in some cases, but in the general case requires a third operation to be
implemented. Care must be taken to coordinate this transpose operation with the forward
propagation. The size of the output that the transpose operation should return depends on
the zero padding policy and stride of the forward propagation operation, as well as the size
of the forward propagation’s output map. In some cases, multiple sizes of input to forward
propagation can result in the same size of output map, so the transpose operation must be
explicitly told what the size of the original input was.
2000-2010
Around the year 2000, The Vanishing Gradient Problem appeared. It was
discovered “features” (lessons) formed in lower layers were not being learned
by the upper layers, because no learning signal reached these layers. This was
not a fundamental problem for all neural networks, just the ones with gradient-
based learning methods. The source of the problem turned out to be certain
activation functions. A number of activation functions condensed their input, in
turn reducing the output range in a somewhat chaotic fashion. This produced
large areas of input mapped over an extremely small range. In these areas of
input, a large change will be reduced to a small change in the output, resulting
in a vanishing gradient. Two solutions used to solve this problem were layer-
by-layer pre-training and the development of long short-term memory.
In 2001, a research report by META Group (now called Gartner) described he
challenges and opportunities of data growth as three-dimensional. The report
described the increasing volume of data and the increasing speed of data as
increasing the range of data sources and types. This was a call to prepare for the
onslaught of Big Data, which was just starting.
In 2009, Fei-Fei Li, an AI professor at Stanford launched ImageNet, assembled a
free database of more than 14 million labeled images. The Internet is, and was,
full of unlabeled images. Labeled images were needed to “train” neural nets.
Professor Li said, “Our vision was that big data would change the way machine
learning works. Data drives learning.”
2011-2020
By 2011, the speed of GPUs had increased significantly, making it possible to
train convolutional neural networks “without” the layer-by-layer pre-training.
With the increased computing speed, it became obvious deep learning had
significant advantages in terms of efficiency and speed. One example is AlexNet,
a convolutional neural network whose architecture won several international
competitions during 2011 and 2012. Rectified linear units were used to
enhance the speed and dropout.
Also in 2012, Google Brain released the results of an unusual project known as
The Cat Experiment. The free-spirited project explored the difficulties of
“unsupervised learning.” Deep learning uses “supervised learning,” meaning the
convolutional neural net is trained using labeled data (think images from
ImageNet). Using unsupervised learning, a convolutional neural net is given
unlabeled data, and is then asked to seek out recurring patterns.
The Cat Experiment used a neural net spread over 1,000 computers. Ten million
“unlabeled” images were taken randomly from YouTube, shown to the system,
and then the training software was allowed to run. At the end of the training,
one neuron in the highest layer was found to respond strongly to the images of
cats. Andrew Ng, the project’s founder said, “We also found a neuron that
responded very strongly to human faces.” Unsupervised learning remains a
significant goal in the field of deep learning.
The Generative Adversarial Neural Network (GAN) was introduced in 2014.
GAN was created by Ian Goodfellow. With GAN, two neural networks play
against each other in a game. The goal of the game is for one network to imitate
a photo, and trick its opponent into believing it is real. The opponent is, of
course, looking for flaws. The game is played until the near perfect photo tricks
the opponent. GAN provides a way to perfect a product (and has also begun
being used by scammers).
Incorporating
Uncertainty
Gradient Learning
Cost Function
An important aspect of the design of deep neural networks is the
cost function. They are similar to those for parametric models such
as linear models. In most cases, parametric model defines a
distribution p(y|x; 0) and simply use the principle of maximum
likelihood.
The use of cross-entropy between the training data and the
model's prediction’s function. Most modern neural networks are
trained using maximum likelihood.
Cost function is given by
J(𝐽(𝜃) = ∑ 𝑥, 𝑦~𝑝𝑑𝑎𝑡𝑎 𝐿𝑜𝑔 𝑃𝑚𝑜𝑑𝑒𝑙 (𝑌|𝑋)
The advantage of this approach to cost is that deriving cost from maximum
likelihood removes the burden of designing cost functions for each model.
• A property of cross-entropy cost used for MLE is that, it does not have a
minimum value. For discrete output variables, they cannot represent
probability of zero or one but come arbitrarily close. Logistic regression
is an example.
• For real-valued output variables it becomes possible to assign extremely
high density to correct training set outputs, e.g, by learning the variance
parameter of Gaussian output and the resulting cross-entropy
approaches negative infinity.
Learning a function:
Training procedure:
Training Algorithm
Dataset Augmentation:
Regularization Effect:
Implementation:
Dataset augmentation is typically applied during the training phase, where each
training sample is randomly transformed before being fed into the model for
training. The transformed samples are treated as additional training data,
effectively enlarging the training dataset.
Modern deep learning frameworks often provide built-in support for dataset
augmentation through data preprocessing pipelines or dedicated augmentation
modules. These frameworks allow users to easily specify the desired
transformations and apply them to the training data on-the-fly during training.
Applying the chain rule
Let’s use the chain rule to calculate the derivative of cost with respect to any
weight in the network. The chain rule will help us identify how much each
weight contributes to our overall error and the direction to update each weight
to reduce our error. Here are the equations we need to make a prediction and
calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Noise robustness
In the context of machine learning, and particularly deep learning, refers to the
ability of a model to maintain its performance and make accurate predictions
even when presented with noisy or corrupted input data. Noise in data can arise
from various sources, including sensor errors, transmission errors,
environmental factors, or imperfections in data collection processes.
Here's how noise robustness is addressed in machine learning, particularly in
deep learning:
1. Data Preprocessing:
• Noise Removal: In some cases, it's possible to preprocess the data to
remove or reduce noise before feeding it into the model. Techniques such
as denoising filters, signal processing methods, or data cleaning
algorithms can be employed to mitigate noise in the data.
2. Model Architecture:
3. Data Augmentation:
4. Training Strategies:
5. Uncertainty Estimation:
• Probabilistic Models: Probabilistic deep learning models, such as
Bayesian neural networks or ensemble methods, can provide uncertainty
estimates along with predictions. These uncertainty estimates can help
the model recognize when it's uncertain about its predictions, which is
particularly useful in the presence of noisy or ambiguous input data.
6. Transfer Learning:
• Pretrained Models: Transfer learning from pretrained models trained
on large datasets can help improve noise robustness. Pretrained models
have learned robust features from vast amounts of data, which can
generalize well even in the presence of noise in the target domain.
Early Stopping:
Early stopping is a regularization technique used to prevent overfitting during
the training of machine learning models, including neural networks. The basic
idea is to monitor the performance of the model on a separate validation set
during training. Training is stopped early (i.e., before the model starts to overfit)
when the performance on the validation set starts to degrade.
Specifically, early stopping involves:
Pseudocode:
1. Given training data (x₁, y₁), .... (xm, ym)
2. For t = 1, T:
a. Form bootstrap replicate dataset S, by selecting m random examples
from the training set with replacement.
b. Let h, be the result of training base learning algorithm on St
Output Combined Classifier:
𝐻(𝑥) = 𝑀𝑎𝑗𝑜𝑟𝑖𝑡𝑦(ℎ1 (𝑥) … … ℎ𝑡 (𝑥))
Dropout:
Dropout is a regularization technique specifically designed for training neural
networks to prevent overfitting. It involves randomly "dropping out" (i.e.,
deactivating) a fraction of neurons during training.
The key aspects of dropout are:
• Random Deactivation: During each training iteration, a fraction of
neurons in the network is randomly set to zero with a probability p,
typically chosen between 0.2 and 0.5.
• Training and Inference: Dropout is only applied during training. During
inference (i.e., making predictions), all neurons are active, but their
outputs are scaled by the dropout probability p to maintain the expected
output magnitude.
• Ensemble Effect: Dropout can be interpreted as training an ensemble of
exponentially many subnetworks, which encourages the network to learn
more robust and generalizable features.
Dropout effectively prevents the co-adaptation of neurons and encourages the
network to learn more distributed representations, leading to improved
generalization performance.
A simple RNN has a feedback loop, as shown in the first diagram of the above
figure.
The feedback loop shown in the gray rectangle can be unrolled in three-time
steps to produce the second network of the above figure. Of course, you can vary
the architecture so that the network unrolls 𝑘 time steps. In the figure, the
following notation is used:
Hence, in the feedforward pass of an RNN, the network computes the values of
the hidden units and the output after 𝑘 time steps. The weights associated with
the network are shared temporally.
Each recurrent layer has two sets of weights:
• One for the input
• Second for the hidden unit
• The last feedforward layer, which computes the final output for the kth
time step, is just like an ordinary layer of a traditional feedforward
network.
• One to Many
This type of neural network incorporates a single input and multiple
outputs. An example of this is often the image caption.
• Many to One
This RNN takes a sequence of inputs and generates one output. Sentiment
analysis may be a example of this sort of network where a given sentence
are often classified as expressing positive or negative sentiments.
• Many to Many
This RNN takes a sequence of inputs and generates a sequence of outputs.
artificial intelligence is one among the examples.
Suppose you wish to predict the last word within the text: “The clouds
are within the ______.”
The most obvious answer to the present is that the “sky.” We don’t need
from now on context to predict the last word within the above sentence.
Consider this sentence: “I are staying in Spain for the last 10 years…I can
speak fluent ______.”
The word you are expecting will rely on the previous couple of words in
context. Here, you would like the context of Spain to predict the last word
within the text, and also the most fitted answer to the present sentence
is “Spanish.” The gap between the relevant information and the point
where it’s needed may became very large. LSTMs facilitate to solve this
problem.
*****************************************************************
Inputting a sequence:
A sequence of data points, each represented as a vector with the same
dimensionality, are fed into a BRNN. The sequence might have different lengths.
Dual Processing:
Both the forward and backward directions are used to process the data. On the
basis of the input at that step and the hidden state at step t-1, the hidden state
at time step t is determined in the forward direction. The input at step t and the
hidden state at step t+1 are used to calculate the hidden state at step t in a
reverse way.
A non-linear activation function on the weighted sum of the input and previous
hidden state is used to calculate the hidden state at each step. This creates a
memory mechanism that enables the network to remember data from earlier
steps in the process.
Training:
The network is trained through a supervised learning approach where the goal
is to minimize the discrepancy between the predicted output and the actual
output. The network adjusts its weights in the input-to-hidden and hidden-to-
output connections during training through backpropagation.
To calculate the output from an RNN unit, we use the following formula:
where,
A = activation function, W = weight matrix, b = bias
• Enhanced accuracy:
BRNNs frequently yield more precise answers since they take both
historical and upcoming data into account.
Bi-RNNs have been applied to various natural language processing (NLP) tasks,
including:
• Sentiment Analysis:
By taking into account both the prior and subsequent context, BRNNs can
be utilized to categorize the sentiment of a particular sentence.
• Part-of-Speech Tagging:
The classification of words in a phrase into their corresponding parts of
speech, such as nouns, verbs, adjectives, etc., can be done using BRNNs.
• Machine Translation:
BRNNs can be used in encoder-decoder models for machine translation,
where the decoder creates the target sentence and the encoder analyses
the source sentence in both directions to capture its context.
• Speech Recognition:
When the input voice signal is processed in both directions to capture the
contextual information, BRNNs can be used in automatic speech
recognition systems.
• Computational complexity:
Given that they analyze data both forward and backward, BRNNs can be
computationally expensive due to the increased amount of calculations
needed.
• Difficulty in parallelization:
Due to the requirement for sequential processing in both the forward and
backward directions, BRNNs can be challenging to parallelize.
• Overfitting:
BRNNs are prone to overfitting since they include many parameters that
might result in too complicated models, especially when trained on short
datasets.
• Interpretability:
Due to the processing of data in both forward and backward directions,
BRNNs can be tricky to interpret since it can be difficult to comprehend
what the model is doing and how it is producing predictions.
• RNNs are a type of neural network designed to work with sequential data,
where the output of each step is dependent on the previous steps.
• This makes them particularly suitable for tasks like natural language
processing (NLP), time series prediction, and speech recognition.
• Each layer in a DRN passes its output as input to the next layer, enabling
the network to learn hierarchical representations of sequential data.
There are several types of recurrent units that can be used in deep recurrent
networks, such as:
• Vanilla RNNs:
These are the simplest form of recurrent units, where the output is
computed based on the current input and the previous hidden state.
• Long Short-Term Memory (LSTM):
LSTMs are a type of recurrent unit that introduces gating mechanisms to
control the flow of information within the network, allowing it to learn
long-range dependencies more effectively and mitigate the vanishing
gradient problem.
• Gated Recurrent Units (GRUs):
GRUs are like LSTMs but have a simpler structure with fewer parameters,
making them computationally more efficient.
Steps to develop a deep RNN application
Developing an end-to-end deep RNN application involves several steps,
including data preparation, model architecture design, training the model, and
deploying it. Here is an example of an end-to-end deep RNN application for
sentiment analysis.
Data preparation:
The first step is to gather and preprocess the data. In this case, we’ll need a
dataset of text reviews labelled with positive or negative sentiment. The text
data needs to be cleaned, tokenized, and converted to the numerical format.
This can be done using libraries like NLTK or spaCy in Python.
Recurrent Layer
(Multiple layers stacked together)
• Embedding Layer:
Converts the input sequence into a dense representation suitable for
processing by the recurrent layers. It typically involves mapping each
element of the sequence (e.g., word or data point) to a high-dimensional
vector space.
• Recurrent Layers:
Consist of multiple recurrent units stacked together. Each layer processes
the input sequence sequentially, capturing temporal dependencies.
Common types of recurrent units include vanilla RNNs, LSTMs, and GRUs.
• Output Layer:
Takes the output from the recurrent layers and produces the final
prediction or output. The structure of this layer depends on the specific
task, such as classification (e.g., softmax activation) or regression (e.g.,
linear activation).
• Output (Prediction):
The final output of the network, which could be a sequence of predictions
for each time step or a single prediction for the entire sequence,
depending on the task.
• Transfer Learning:
Pre-training deep recurrent networks on large-scale datasets for related
tasks (e.g., language modeling) and fine-tuning them for specific tasks
often leads to improved performance. The hierarchical representations
learned during pre-training capture generic features of the data, which
can be beneficial for downstream tasks with limited labeled data.
• Computational Complexity:
Deep recurrent networks with multiple layers can be computationally
expensive to train and deploy, especially when dealing with large-scale
datasets or complex architectures. The computational complexity
increases with the number of layers, making it challenging to train deep
models on resource-constrained devices or in real-time applications.
• Overfitting:
Deep recurrent networks are prone to overfitting, especially when
dealing with small datasets or overly complex models. With a large
number of parameters, deep models have a high capacity to memorize
noise or irrelevant patterns in the training data, leading to poor
generalization performance on unseen data. Regularization techniques
such as dropout and weight decay are commonly used to prevent
overfitting.
• Difficulty in Interpretability:
Understanding the internal workings of deep recurrent networks and
interpreting their decisions can be challenging. With multiple layers of
non-linear transformations, it can be difficult to interpret the learned
representations and understand how the network arrives at a particular
prediction. This lack of interpretability can be a significant drawback in
applications where transparency and interpretability are essential.
In above diagram, there are three modules with two additional novel blocks in
the end-to-end framework, i.e., encoder network, analysis block, binarizer,
decoder network, and synthesis block. Image patches are directly given to the
analysis block as an input that generates latent features using the proposed
analysis encoder block. The entire framework architecture is presented in
architecture diagram.
The single iteration of the end-to-end framework is represented in below
Equation.
• L1 and L2 Regularization:
L1 and L2 regularization penalize the magnitude of the weights in the
autoencoder's neural network. By adding a regularization term to the
loss function proportional to either the L1 or L2 norm of the weights,
these techniques encourage sparsity (in the case of L1 regularization) or
small weights (in the case of L2 regularization), helping prevent
overfitting.
• Dropout:
Dropout is a regularization technique that randomly sets a fraction of the
input units to zero during each training iteration. This helps prevent the
autoencoder's neural network from relying too heavily on any individual
input features, forcing it to learn more robust representations.
• Batch Normalization:
Batch normalization normalizes the activations of each layer in the
autoencoder's neural network, helping stabilize and accelerate the
training process. By reducing internal covariate shift, batch
normalization acts as a regularizer, making the autoencoder more
resistant to overfitting.
• Noise Injection:
Noise injection involves adding noise to the input data or the activations
of the autoencoder's hidden layers during training. This helps prevent the
autoencoder from memorizing the training data and encourages it to
learn more generalizable representations.
• Contractive Regularization:
Contractive regularization penalizes the Frobenius norm of the Jacobian
matrix of the encoder with respect to the input data. This encourages the
encoder to learn representations that are invariant to small changes in
the input data, making the autoencoder more robust to variations in the
input.
Write abt autoencoder first from book
Stochastic Encoder:
In a VAE, the encoder network outputs the parameters of a probability
distribution instead of a deterministic encoding. Instead of directly
outputting the latent representation of the input data, the encoder
outputs the mean and variance (or other parameters) of a Gaussian
distribution that represents the distribution of possible latent variables
given the input. The latent variable is then sampled from this distribution
to generate a stochastic representation.
Stochastic Decoder:
Similarly, the decoder network in a VAE accepts a sampled latent variable
as input instead of a deterministic encoding. This sampled latent variable
is generated by sampling from the distribution outputted by the encoder.
The decoder then generates the reconstructed output based on this
sampled latent variable.
Cost Function Calculation
Contractive autoencoders