0% found this document useful (0 votes)
9 views

AD3451 ML UNIT 4 NOTES

Uploaded by

thiyagaraj3561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AD3451 ML UNIT 4 NOTES

Uploaded by

thiyagaraj3561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 36

AL3451-MACHINE LEARNING

UNIT-4
UNITIV NEURALNETWORKS 9

Multilayer perceptron, activation functions, network training – gradient descent optimization –


stochastic gradient descent, error backpropagation, from shallow networks to deep networks –Unit
saturation(akathevanishinggradientproblem)–ReLU,hyperparametertuning,batchnormalization,
regularization, dropout

Multi-layerPerceptron

Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron.

Thepictorialrepresentationofmulti-layerperceptronlearningisasshownbelow-

MLPnetworksareusedforsupervised learningformat. AtypicallearningalgorithmforMLPnetworks is


also called back propagation's algorithm.

Amultilayerperceptron(MLP)isafeedforwardartificialneuralnetworkthatgeneratesasetofoutputs from
a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed
graphbetweentheinputnodesconnectedasadirectedgraphbetweentheinputandoutputlayers.MLP uses
backpropagation for training the network. MLP is a deep learning method.

ActivationFunctionsinNeuralNetworks

ElementsofaNeuralNetwork

Downloadedfromwww.eduengineering.net
Input Layer: This layer accepts input features. It provides information from the outside world to the
network,nocomputationisperformedatthislayer,nodesherejustpassontheinformation(features) to the
hidden layer.
HiddenLayer:Nodesofthislayerarenotexposedtotheouterworld,theyarepartoftheabstraction provided
by any neural network. The hidden layer performs all sorts of computation on the features entered
through the input layer and transfers the result to the output layer.
OutputLayer:Thislayerbringuptheinformationlearnedbythenetworktotheouterworld.
Whatisanactivationfunctionandwhyusethem?
The activation function decides whether a neuron should be activated or not by calculating the
weightedsumandfurtheraddingbiastoit.Thepurposeoftheactivationfunctionistointroducenon- linearity
into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias,andtheirrespectiveactivationfunction.Inaneuralnetwork,wewouldupdatetheweightsand biases of
the neurons on the basis of the error at the output. This process is known as back-propagation.
Activation functions make the back-propagation possible since the gradients are supplied along with
the error to update the weights and biases.
WhydoweneedNon-linearactivation function?
A neural network without an activation function is essentially just a linear regression model. The
activationfunctiondoesthenon-lineartransformationtotheinputmakingitcapabletolearnand perform
more complex tasks.
Mathematicalproof
SupposewehaveaNeuralnetlikethis:-

Elementsofthediagramareasfollows:
Hiddenlayeri.e.layer 1:
z(1) =W(1)X +b(1) a(1)
Here,
 z(1)isthe vectorizedoutputoflayer1
 W(1)bethevectorizedweightsassignedtoneuronsofhiddenlayeri.e.w1,w2,w3andw4
 Xbethevectorizedinputfeaturesi.e.i1andi2
 bisthevectorizedbiasassignedtoneuronsinhiddenlayeri.e.b1andb2
 a(1)isthevectorizedformofanylinearfunction.
(Note: We are not considering activation function here)

Layer2i.e.outputlayer:-
Note:Inputforlayer2isoutputfromlayer1 z(2) =
W(2)a(1) + b(2)
a(2) =z(2)

Downloadedfromwww.eduengineering.net
CalculationatOutputlayer
z(2)=(W(2) * [W(1)X+b(1)]) +b(2)
z(2)=[W(2) * W(1)]*X+[W(2)*b(1) +b(2)]
Let,
[W(2)* W(1)]=W
[W(2)*b(1)+b(2)]=b
Final output : z(2) = W*X + b
whichisagainalinearfunction
This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
samewaybecausethecompositionoftwolinear functionisalinearfunctionitself.Neuroncannot learn
with just a linear function attached to it. A non-linear activation function will let it learn as per the
difference w.r.t error. Hence we need an activation function.
VariantsofActivationFunction
Linear Function
 Equation:Linearfunctionhastheequationsimilartoasofastraightlinei.e.y=x
 Nomatterhowmanylayerswehave,ifallarelinearinnature,thefinalactivationfunction of last
layer is nothing but just a linear function of the input of first layer.
 Range:-infto+inf
 Uses:Linearactivationfunctionisusedatjustoneplacei.e.outputlayer.
 Issues:Ifwewilldifferentiatelinearfunctiontobringnon-linearity,resultwillnomore
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have any
big/smallvalue,sowecanapplylinearactivationatoutputlayer.Eveninthiscaseneuralnetmust have any
non-linear function at hidden layers.
SigmoidFunction

 Itisafunctionwhichisplottedas‘S’ shapedgraph.
 Equation:A=1/(1+e-x)
 Nature:Non-linear.NoticethatXvaluesliesbetween -2to2,Yvalues areverysteep.This means,
small changes in x would also bring about large changes in the value of Y.
 ValueRange:0to1
 Uses:Usuallyusedinoutputlayerofabinaryclassification,whereresultiseither0or1,as value
for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be1ifvalueisgreaterthan0.5and0otherwise.

Downloadedfromwww.eduengineering.net
Tanh Function

 The activation that works almost always better than sigmoid function is Tanh function also
knownasTangentHyperbolicfunction.It’sactuallymathematicallyshiftedversionofthe
sigmoid function. Both are similar and can be derived from each other.
 Equation :-

 ValueRange:- -1to+1
 Nature:-non-linear
 Uses:-Usuallyusedinhiddenlayersofaneuralnetworkasit’svalues liesbetween -1to 1
hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
incenteringthedatabybringingmeancloseto0.Thismakeslearningforthenextlayer much
easier.
RELU Function

 ItStandsforRectifiedlinearunit.Itisthemostwidelyusedactivationfunction.Chiefly
implemented in hidden layers of Neural network.
 Equation:-A(x) =max(0,x).Itgivesanoutputxifxispositiveand 0 otherwise.
 ValueRange:-[0,inf)
 Nature:-non-linear,whichmeanswecaneasilybackpropagatetheerrorsandhave
multiple layers of neurons being activated by the ReLU function.
 Uses:-ReLuislesscomputationallyexpensivethantanhandsigmoidbecauseitinvolves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
Insimplewords,RELUlearnsmuchfasterthansigmoid andTanhfunction.

Downloadedfromwww.eduengineering.net
SoftmaxFunction

Thesoftmaxfunctionis alsoatypeofsigmoidfunctionbutishandywhenwearetryingtohandle multi-


class classification problems.
 Nature:-non-linear
 Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
wouldsqueezetheoutputsforeachclassbetween0and1andwouldalsodividebythesum of the
outputs.
 Output:-Thesoftmaxfunctionisideallyusedintheoutputlayeroftheclassifierwherewe are
actually trying to attain the probabilities to define the class of each input.
 The basic rule of thumb is if you really don’t know what activation function to use, then
simplyuseRELUasitisageneralactivationfunctioninhiddenlayersandisusedinmost cases
these days.
 Ifyouroutputisforbinaryclassificationthen, sigmoidfunction isverynaturalchoicefor
output layer.
 Ifyouroutputisformulti-classclassificationthen, Softmaxisveryusefultopredictthe
probabilities of each classes.

4.3.NetworkTraining
 Training: It is the process in which the network is taught to change its
weightand bias.
 Learning:Itistheinternalprocessoftrainingwheretheartificialneuralsystemlearns to
update/adapt the weights and biases.

DifferentTraining/LearningprocedureavailableinANN are

 Supervisedlearning
 Unsupervisedlearning
 Reinforcedlearning
 Hebbianlearning
 Gradientdescentlearning
 Competitivelearning
 Stochasticlearning

1.4.1.RequirementsofLearningLaws:
• LearningLawshouldleadtoconvergenceof weights

• Learningortrainingtimeshouldbelessforcapturingthe

Downloadedfromwww.eduengineering.net
informationfromthetrainingpairs
• Learningshouldusethelocalinformation

• Learningprocessshouldabletocapturethecomplexnonlinear
mappingavailablebetween the input & output pairs
• Learningshouldabletocaptureasmanyaspatternsas possible

• Storageofpatterninformation'sgatheredatthetimeoflearning
should be high for thegiven network

Figure3:DifferentTrainingmethodsofANN

Supervisedlearning:

Everyinputpatternthatisusedtotrainthenetworkisassociated withanoutputpatternwhichisthe target or


the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the
error.Theerrorcanthenbeusedtochangenetworkparameters,whichresultinanimprovement
inperformance.
Unsupervisedlearning:

In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering
and adapting to structural features in the input patterns.
Reinforcedlearning:

In this method, a teacher though available, doesnot present the expected answer but only
indicatesif the computed output correct or incorrect.The information provided helps the
network in the learning process.
Hebbianlearning:

ThisrulewasproposedbyHebbandisbasedoncorrelativeweightadjustment.Thisisthe
oldestlearningmechanisminspiredbybiology.Inthis,theinput-outputpatternpairs(𝑥𝑖,𝑦𝑖)are associated
by the weight matrix W, known as the correlation matrix.
Itiscomputedas

W=∑𝑖= 𝑇
𝑛
𝑥𝑖𝑦𝑖
------------eq(1)
1

Here 𝑦𝑖𝑇is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the
rule havebeen proposed.
Gradientdescentlearning:

This is based on the minimization of error E defined in terms of weights and activation
functionofthenetwork.Alsoitisrequiredthattheactivationfunctionemployedbythenetwork is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗is the weight update of the link connecting the 𝑖 𝑡ℎand 𝑗𝑡ℎneuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗is defined as,
∆𝑤=ɳ 𝜕 𝐸 -----------eq(2)
𝑖𝑗 𝜕𝑤𝑖𝑗

Where,ɳisthelearningrateparameterand
𝜕𝐸
istheerrorgradientwithreferencetothe
𝜕𝑤𝑖𝑗
weight𝑤𝑖𝑗.

GradientDescent:
 Gradient Descent is a popular optimization technique in Machine
Learning and Deep Learning of the learning algorithms.
 Agradientisthe slopeofafunction.
 Itmeasuresthedegreeofchangeofavariableinresponsetothechanges of
another variable.
 Mathematically,GradientDescentisaconvexfunctionwhoseoutputisthe
partialderivativeof a set of parameters of its inputs.
 The greater the gradient, the steeper the slope.Starting from an initial
value,GradientDescentisruniterativelytofindtheoptimalvaluesofthe
parameterstofindtheminimumpossiblevalueofthegivencostfunction.
TypesofGradientDescent:
Typically,therearethreetypesofGradientDescent:
1. BatchGradient Descent
2. StochasticGradientDescent
3. Mini-batchGradientDescent
Stochastic Gradient Descent (SGD):

 The word ‘stochastic‘ means a system or a process that is linked with a random
probability.
 Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
 InGradientDescent,thereisatermcalled“batch”whichdenotesthetotalnumber
ofsamplesfromadatasetthatisusedforcalculatingthegradientforeachiteration.
 IntypicalGradientDescentoptimization, likeBatchGradientDescent, thebatchis
taken to be the whole dataset.
 Although, using the whole dataset is really useful for getting to the minima in a
less noisy andlessrandommanner,buttheproblemariseswhenourdatasetsgetsbig.
 Suppose, you have a million samples in your dataset, so if you use a typical
GradientDescentoptimizationtechnique,youwillhavetousealloftheonemillion
samplesforcompletingone iterationwhileperformingtheGradientDescent,and
ithastobedoneforeveryiterationuntiltheminimaisreached.Hence,itbecomes
computationally very expensive to perform

Backpropagation

 Thebackpropagationconsistsofaninputlayerofneurons,anoutputlayer,andat least
one hidden layer.
 Theneuronsperforma weightedsumupontheinputlayer,whichisthenusedby the
activation function as an input, especially by the sigmoid activation function.
 Italsomakesuseofsupervisedlearningtoteachthenetwork.
 Itconstantlyupdatestheweightsofthenetworkuntilthedesiredoutputismetby the
network.
 Itincludesthefollowingfactorsthatareresponsibleforthetrainingand performance
of the network:

o Random(initial)valuesofweights.
o Anumberoftrainingcycles.
o Anumberofhiddenneurons.
o Thetrainingset.
o Teachingparametervaluessuchaslearningrateandmomentum.

Workingof Backpropagation

Considerthediagramgiven below.

1. Thepreconnectedpathstransfertheinputs X.
2. ThentheweightsWarerandomlyselected,whichareusedtomodeltheinput.
3. After then, the output is calculated for every individual neuron that passes from
the input layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. ErrorB= Actual Output - Desired
Output
5. The errors are sent back to the hidden layer from the output layer for adjusting
the weights to lessen the error.
6. Untilthedesired resultisachieved,keepiteratingalloftheprocesses.

NeedofBackpropagation
o Sinceitisfastaswellassimple,itisveryeasytoimplement.
o Apartfromnoofinputs,itdoesnotencompassofanyotherparametertoperform
tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be more
flexible.
o Itisastandardmethodthatresultswell.

WhatisaFeedForwardNetwork?
Afeedforwardneuralnetworkisanartificialneuralnetworkwherethenodesnever form a
cycle. This kind of neural network has an input layer, hidden layers, and an output
layer. It is the first and simplest type of artificial neural network.

TypesofBackpropagationNetworks
TwoTypesofBackpropagationNetworksare:

 StaticBack-propagation
 RecurrentBackpropagation

Staticback-propagation:
Itisonekindofbackpropagationnetworkwhichproducesamappingofastaticinput for static
output. It is useful to solve static classification issues like optical character recognition.

RecurrentBackpropagation:
RecurrentBackpropagationindataminingisfedforwarduntilafixedvalueisachieved. After
that, the error is computed and propagated backward.

Themaindifferencebetweenbothofthesemethodsis:thatthemappingisrapidin static
back-propagation while it is nonstatic in recurrent backpropagation.

BestpracticeBackpropagation
Backpropagationinneuralnetworkcanbeexplainedwiththehelpof“Shoe Lace”
analogy

Toolittletension=

 Notenoughconstrainingandveryloose

Toomuchtension=

 Toomuchconstraint(overtraining)
 Takingtoomuchtime(relativelyslowprocess)
 Higherlikelihoodofbreaking
Pullingonelacemorethanother=

 Discomfort(bias)

DisadvantagesofusingBackpropagation

 Theactualperformanceofbackpropagationonaspecificproblemisdependent on
the input data.
 Backpropagationalgorithmindataminingcanbequitesensitivetonoisydata
 Youneedtousethematrix-basedapproachforbackpropagationinsteadofmini-
batch.

BackpropagationProcessinDeepNeural Network

Backpropagationis one of the important concepts of a neural network. Our task is to


classifyourdata best.Forthis, we have toupdate the weights ofparameterandbias,but
how can we do that in a deep neural network? In the linear regression model, we use
gradient descent to optimize the parameter. Similarly here we also use gradientdescent
algorithm using Backpropagation.

For a single training example, Backpropagationalgorithm calculates the gradient


oftheerrorfunction.Backpropagationcanbewrittenasa functionof theneuralnetwork.
Backpropagationalgorithmsareasetofmethodsusedtoefficientlytrainartificialneural
networks following a gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method
throughwhichitcalculatestheupdatedweighttoimprovethenetworkuntilitisnotable to
performthe taskfor whichit is beingtrained. Derivatives ofthe activation function to be
known at network design time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Inputvalues

X1=0.05
X2=0.10

Initialweight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

BiasValues

b1=0.35 b2=0.60

TargetValues

T1=0.01
T2=0.99

Now,wefirstcalculatethevaluesofH1andH2byaforward pass.

ForwardPass

TofindthevalueofH1wefirstmultiplytheinputvaluefromtheweightsas

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas

WewillcalculatethevalueofH2inthesamewayasH1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas
Now,wecalculatethevaluesofy1andy2inthesamewayaswecalculatetheH1andH2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

Tocalculatethefinalresultofy1weperformedthesigmoidfunctionas

Wewillcalculatethevalueofy2inthesamewayas y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as

So,thetotalerroris

Now,wewillbackpropagatethiserrortoupdatetheweightsusingabackward pass.

Backwardpassattheoutput layer

To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.

Weperformbackward processsofirstconsiderthelastweightw5as

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as

Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow5as
Puttingthevalueofe-yinequation(5)

So, we put the values of in equation no (3) to find the final


result.
Now,wewillcalculatetheupdatedweightw5newwiththehelpofthefollowingformula

In the same way, we calculate w6new,w7new, and w8newand this will give us the following
values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

BackwardpassatHiddenlayer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.

Wewillcalculatetheerroratw1as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
becausethereisnoanyw1.Wesplitequation(1)intomultipletermssothatwecaneasily
differentiate it with respect to w1 as

Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow1as
WeagainsplitthisbecausethereisnoanyH1finalterminEtoatalas

will againsplitbecauseinE1andE2there isnoH1term.Splittingis


doneas

We again Split both becausethereisnoanyy1andy2terminE1andE2.We split


it as

Now, we find the value of byputtingvaluesinequation(18)and(19)as

From equation (18)

Fromequation(8)
Fromequation(19)

Puttingthevalueofe-y2inequation(23)

Fromequation(21)

Nowfromequation(16)and (17)
Put thevalue of inequation(15)as

We have weneedtofigureout as
Puttingthevalueofe-H1inequation(30)

WecalculatethepartialderivativeofthetotalnetinputtoH1withrespecttow1thesame as we
did for the output neuron:

So,weputthevaluesof inequation(13)tofindthefinalresult.

Now,wewillcalculatetheupdatedweightw1newwiththehelpofthefollowingformula
Inthesameway,wecalculatew2new,w3new,andw4 and thiswill give us thefollowing values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

Wehaveupdatedalltheweights.Wefoundtheerror0.298371109onthenetworkwhen we
fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
errorisdownto0.291027924.Afterrepeatingthisprocess10,000,thetotalerrorisdown to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.

2.5.1DifferenceBetweenaShallowNet&Deep Learning
Net:

Sl.No ShallowNet’s DeepLearningNet’s

1 OneHiddenlayer(orvery DeepNet’shasmanylayersof
lessno.ofHidden Hiddenlayers with more no.
Layers) of neurons in each layers
2 Takesinputonlyas VECTORS DLcanhaverawdatalike
image,textas
inputs
3 Shallownet’s needs more DL can fit functions better
parameterstohave withlessparametersthana
better fit shallow network
4 Shallow networks with one DLcancompactlyexpress
Hidden layer (same no of highly complex functions
neuronsasDL)cannotplace over input space
complex functions overthe
inputspace
5 Thenumberofunitsina DL don’t need
shallow network grows toincreaseitsize(neurons)fo
exponentiallywithtask r complex problems
complexity.
6 Shallow network is more TraininginDLiseasyandno
difficult to train with our issue oflocal minima
currentalgorithms(e.g.ithas in DL
issuesoflocalminima etc)

TheVanishingGradientProblem

TheProblem,ItsCauses,ItsSignificance,andItsSolutions
The problem:
Asmorelayersusingcertainactivationfunctionsareaddedtoneuralnetworks,the gradients of
the loss function approaches zero, making the network hard to train.
Why:

Certainactivationfunctions,likethesigmoidfunction,squishesalargeinputspaceintoa
smallinputspacebetween0and1. Therefore,alarge changeintheinput ofthesigmoid function
will cause a small change in the output. Hence, the derivative becomes small.

Image1:Thesigmoidfunctionanditsderivative

As an example, Image 1 is the sigmoid function and its derivative. Note how when the
inputsofthesigmoidfunctionbecomeslargerorsmaller(when|x|becomesbigger),the
derivative becomes close to zero.

Why it’ssignificant:

For shallow network with only a few layers that use these activations, this isn’t a big
problem.However,whenmorelayersareused,itcancausethegradienttobetoosmall for
training to work effectively.

Gradients of neural networks are found using backpropagation. Simply put,


backpropagation findsthederivatives ofthenetwork bymoving layerby layerfromthe
finallayertotheinitialone.Bythechainrule,thederivativesofeachlayeraremultiplied down
the network (from the final layer to the initial) to compute the derivatives of the initial
layers.
However, when n hidden layers use an activation like the sigmoid function, n small
derivativesaremultipliedtogether.Thus,thegradientdecreasesexponentiallyaswe
propagate down to the initial layers.

A small gradient means that the weights and biases of the initial layers will not be
updatedeffectivelywitheachtrainingsession.Sincetheseinitiallayersareoftencrucial
torecognizingthecoreelementsoftheinputdata,itcanleadtooverallinaccuracyofthe whole
network.

Solutions:

Thesimplestsolutionistouseotheractivationfunctions,suchasReLU,whichdoesn’t
causeasmallderivative.

Residualnetworksareanothersolution,astheyprovideresidualconnectionsstraightto earlier
layers. As seen in Image 2, the residual connection directly adds the value at the
beginning of the block, x, to the end of the block (F(x)+x). This residual connection
doesn’tgothroughactivationfunctionsthat“squashes”thederivatives,resultingina higher
overall derivative of the block.

Image2:Aresidual block

Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives todisappear. In Image 1, this is most clearly seen at when |x| isbig. Batch
normalizationreducesthisproblembysimplynormalizingtheinputso|x|doesn’treach the
outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so that
most of it falls in the green region, where the derivative isn’t too small.
Image3:Sigmoidfunctionwithrestrictedinputs

HyperparametersinMachineLearning

Hyperparameters in Machine learning are those parameters that are explicitly


defined by the user to control the learning process. These hyperparameters are used
toimprovethelearningofthemodel,andtheirvaluesaresetbeforestartingthelearning
process of the model.

 Heretheprefix"hyper"suggeststhattheparametersaretop-levelparametersthat are
used in controlling the learning process.
 ThevalueoftheHyperparameterisselectedandsetbythemachinelearning engineer
before the learning algorithm begins training the model.
 Hence, these are external to the model, and their values cannot be changed
during the training process.

SomeexamplesofHyperparametersinMachineLearning
o ThekinkNNorK-NearestNeighbouralgorithm
o Learningratefortraininganeural network
o Train-testsplitratio
o BatchSize
o Numberof Epochs
o BranchesinDecisionTree
o NumberofclustersinClusteringAlgorithm

ModelParameters:

Modelparametersareconfigurationvariablesthatareinternaltothemodel,andamodel
learns them on its own. For example, W Weights or Coefficients of independent
variablesintheLinearregressionmodel.orWeightsorCoefficientsofindependent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
o Theyareused bythemodelformakingpredictions.
o Theyarelearned bythemodelfromthedataitself
o Theseareusuallynotsetmanually.
o ThesearethepartofthemodelandkeytoamachinelearningAlgorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined bythe user to control
the learning process. Some key points for model parameters are as follows:

o Theseareusuallydefinedmanuallybythemachinelearning engineer.
o Onecannotknowtheexactbestvalueforhyperparametersforthegivenproblem.
Thebestvaluecanbedeterminedeitherbytheruleofthumborbytrialanderror.
o SomeexamplesofHyperparametersarethelearningratefortraininga neural
network, K in the KNN algorithm,

Categoriesof Hyperparameters

Broadlyhyperparameterscanbedividedintotwocategories,whicharegiven below:

1. HyperparameterforOptimization
2. HyperparameterforSpecificModels

Hyperparameter for Optimization

Theprocessofselectingthebesthyperparameterstouseisknownashyperparameter
tuning,andthetuningprocessisalsoknownashyperparameteroptimization.Optimization
parameters are used for optimizing the model.

Someofthepopularoptimizationparametersaregivenbelow:
o Learning Rate: The learning rate is the hyperparameter in optimization
algorithms that controls how much the model needs to change in response to the
estimated error for each time when the model's weights are updated. It is one of
thecrucialparameterswhilebuildinganeuralnetwork,andalsoitdeterminesthe
frequency of cross-checking with model parameters. Selecting the optimized
learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too
large, then it may not optimize the model properly.

o Batch Size: To enhance the speed of the learning process, the training set is
dividedintodifferentsubsets,whichareknownasabatch.NumberofEpochs:An
epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs
varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into
account. The number of epochs is increased until there is a reduction in a
validationerror.Ifthereisnoimprovementinreductionerrorfortheconsecutive
epochs, then it indicates to stop increasing the number of epochs.

HyperparameterforSpecificModels

Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

o AnumberofHiddenUnits:Hiddenunitsarepartofneuralnetworks,whichrefer
tothecomponentscomprisingthelayersofprocessorsbetweeninputandoutput units
in a neural network.

It is important to specify the number of hidden units hyperparameter for the neural
network.Itshouldbebetweenthesizeoftheinputlayerandthesizeoftheoutputlayer.
Morespecifically,thenumberofhiddenunitsshouldbe2/3ofthesizeoftheinputlayer, plus
the size of the output layer.

Forcomplexfunctions,itisnecessarytospecifythenumberofhiddenunits,butitshould not
overfit the model.

o Number of Layers: A neural network is made up of vertically arranged


components, which are called layers. There are mainly input layers, hidden
layers, and output layers. A 3-layered neural network gives a better
performance than a 2-layered network. For a Convolutional Neural network, a
greater number of layers make a better model.
BatchNormalization:

 Itisamethodofadaptivereparameterization,motivated bythedifficulty of
training very deep models.In Deep networks, the weights are updated
for each layer.
 So the output will no longer be on the same scale as the input (even
though input is normalized).
 Normalization - is adata pre-processing tool used tobring thenumerical
data toa common scale without distorting its shape.
 when we input the datato a machine ordeep learningalgorithmwetend to
change the values to abalanced scale because, weensure thatour model
can generalize appropriately.(Normalization is used to bring the input
into a balanced scale/ Range).

Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-


normalization/
 Even though the input X was normalized but the output is no longer on
the same scale.
 The data passes through multiple layers of network with multiple
times(sigmoidal) activation functions are applied, which leads to an
internal co-variate shift in the data.
 ThismotivatesustomovetowardsBatchNormalization
 Normalization is the process of altering the input data to have mean as
zero and standard deviationvalue as one.
ProceduretodoBatchNormalization:
(1) Consider the batch input from layer h, for this layer we need to
calculate the mean of this hiddenactivation.After calculating the
mean the next step is to calculate the standard deviation of the
hidden activations.
(2) Now we normalize the hidden activations using these Mean &
Standard Deviation values. To dothis, we subtract the mean from
each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(3) As the final stage, the re-scaling and offsetting of the input is
performed. Here two components of the BN algorithm is used,
γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) andshifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence
duringthetrainingofneuralnetwork,theoptimalvaluesofγandβ
areobtainedandused.Hencewegettheaccuratenormalizationof
eachbatch.
Regularization
Definition:-“anymodificationwemaketoalearningalgorithmthatisintended to
reduce its generalization error but not its training error.”
 Inthecontextof deep learning,mostregularization strategies
arebasedonregularizing estimators.
 Regularizationofanestimatorworksbytradingincreasedbias
forreducedvariance.Aneffectiveregularizerisonethatmakesaprofita
bletrade,reducingvariancesignificantly while not overlyincreasing
the bias.
 Manyregularizationapproachesarebasedonlimitingthecapacity of
models, such as neural networks, linear regression, or logistic
regression, by adding a parameter norm penalty Ω(θ) to the
objective function J. We denote the regularized objective function
by J˜
J˜(θ;X,y)=J(θ;X,y)+ αΩ(θ)
whereα∈[0,∞)isahyperparameterthatweightstherelativecontribution
ofthenormpenalty term, Ω, relative to the standard objective function J.
Settingαto0resultsinnoregularization. Largervaluesofαcorrespondtomore
regularization.
TheparameternormpenaltyΩthatpenalizesonlytheweightsoftheaffine
transformationateachlayerandleavesthebiases unregularized.
L2Regularization
One of the simplest and most common kind of parameter norm penalty is L2
parameter & it’salso called commonly as weight decay. This regularization
strategydrivestheweightsclosertotheoriginbyaddingaregularizationterm

L2regularization is also known as ridge regression or Tikhonov regularization.


To simplify, weassume no bias parameter, so θ is just w. Such a model has the
following total objective function.

Wecanseethattheadditionoftheweightdecaytermhasmodifiedthelearning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just
before performing the usual gradient update. This describes what happens in a
single step.The approximation ^J is Given by

WhereHistheHessianmatrixofJwithrespecttowevaluatedatw∗.
TheminimumofˆJoccurswhereitsgradient ∇wˆJ(w)=H(w−w∗)is equalto‘0’To study
the eff ect of weight decay,

As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormalbasisofeigenvectors,Q,suchthatH=QΛQT.ApplyingDecompositiontotheabove
equation, We Obtain

Figure2:Weightupdationeffect
The solid ellipses represent contours of equal value of the unregularized
objective. The dottedcircles represent contours of equal value of the L 2
regularizer. At thepoint w˜, these competing objectives reachanequilibrium. In
the first dimension, the eigenvalue of the Hessian of J is small. The objective
function does not increase much when moving horizontally away from w∗ .
Because the objective function does not express a strong preference along this
direction,theregularizerhasastrongeffectonthisaxis.Theregularizerpullsw1
closetozero. Intheseconddimension,the objectivefunctionisverysensitiveto
movements away from w∗ . The corresponding eigenvalue is large, indicating
highcurvature.Asaresult,weightdecayaffectsthepositionofw2relativelylittle.

L1Regularization

While L2 weight decay is the most common form of weight decay, there are
otherways to penalize the size of the model parameters. Another option is to use L1
regularization.
 L1regularizationonthemodelparameterwisdefinedasthesumof
absolute values of theindividual parameters.

L1 weight decay controls the strength of the regularization by scaling the


penaltyΩusingapositivehyperparameterα.Thus,theregularizedobjective
function J˜(w; X, y) is given by

By inspecting equation 1, we can see immediately that the effect of L 1


regularization is quite different from that of L 2 regularization. Specifically,
we can see that the regularization contribution to the gradient no longer
scaleslinearlywitheachwi;insteaditisaconstantfactorwithasignequalto
sign(wi).

DifferencebetweenL1&L2ParameterRegularization
DifferencebetweenNormalizationand Standardization

Normalization Standardization

Thistechniqueusesminimumandmax values Thistechniqueusesmeanandstandarddeviation for


for scaling of model. scaling of model.

Itishelpfulwhenfeaturesareofdifferent scales. It is helpful when the mean of a variable is set to 0


and the standard deviation is set to 1.

Scalesvaluesrangesbetween[0,1]or[-1,1]. Scalevaluesarenotrestrictedtoaspecific range.

Itgotaffectedby outliers. Itiscomparativelylessaffected by outliers.

Scikit-Learnprovidesatransformercalled Scikit-Learn provides a transformer called


MinMaxScaler for Normalization. StandardScaler for Normalization.

Itisalsocalled Scalingnormalization. ItisknownasZ-score normalization.

Itisusefulwhenfeaturedistributionis unknown. Itisusefulwhenfeaturedistributionis normal.


DropoutinNeuralNetworks
A Neural Network (NN) is based on a collection of connected units or nodes called
artificial neurons, which loosely model the neurons in a biological brain. Since such a
networkiscreatedartificiallyinmachines,werefertothatasArtificialNeuralNetworks
(ANN).
Problem:Whenafully-connectedlayerhasalargenumberofneurons, co-adaptationis
more likely to happen. Co-adaptation refers to when multiple neurons in a layer extract
the same, or very similar, hidden features from the input data. This can happen when
the connection weights for two different neurons are nearly identical.

Thisposestwodifferentproblemstoour model:
 Wastageofmachine’sresourceswhencomputingthesame output.
 Ifmanyneuronsareextractingthesamefeatures, itaddsmoresignificanceto
thosefeaturesforourmodel.Thisleadstooverfittingiftheduplicateextracted
features are specific to only the training set.
Solution to the problem:As the title suggests, we use dropout while training the NNto
minimize co-adaptation. In dropout, we randomly shut down some fraction of a layer’s
neurons at each training step by zeroing out the neuron values. The fraction of neurons
tobezeroedoutisknownasthedropoutrate,.Theremainingneuronshavetheir

valuesmultipliedby sothattheoverallsumoftheneuronvaluesremainsthe same.


The two images represent dropout applied to a layer of 6 units, shown at multiple
trainingsteps.Thedropoutrateis1/3,andtheremaining4neuronsateachtraining step have
their value scaled by x1.5. Thereby, we are choosing a random sample of neurons rather
than training the whole network at once. This ensures that the co- adaptation is solved
and they learn the hidden features better.
Whydropout works?
By using dropout, in every iteration, you will work on a smaller neural
networkthanthepreviousoneandtherefore,itapproachesregularization.
Dropouthelpsinshrinkingthesquarednormoftheweightsandthistendsto a reduction in
overfitting.

You might also like