0% found this document useful (0 votes)

9 views

AD3451 ML UNIT 4 NOTES

Uploaded by

thiyagaraj3561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

AD3451 ML UNIT 4 NOTES

Uploaded by

thiyagaraj3561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 36

AL3451-MACHINE LEARNING

UNIT-4
UNITIV NEURALNETWORKS 9

Multilayer perceptron, activation functions, network training – gradient descent optimization –

stochastic gradient descent, error backpropagation, from shallow networks to deep networks –Unit
saturation(akathevanishinggradientproblem)–ReLU,hyperparametertuning,batchnormalization,
regularization, dropout

Multi-layerPerceptron

Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron.

Thepictorialrepresentationofmulti-layerperceptronlearningisasshownbelow-

MLPnetworksareusedforsupervised learningformat. AtypicallearningalgorithmforMLPnetworks is

also called back propagation's algorithm.

Amultilayerperceptron(MLP)isafeedforwardartificialneuralnetworkthatgeneratesasetofoutputs from
a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed
graphbetweentheinputnodesconnectedasadirectedgraphbetweentheinputandoutputlayers.MLP uses
backpropagation for training the network. MLP is a deep learning method.

ActivationFunctionsinNeuralNetworks

ElementsofaNeuralNetwork

Downloadedfromwww.eduengineering.net
Input Layer: This layer accepts input features. It provides information from the outside world to the
network,nocomputationisperformedatthislayer,nodesherejustpassontheinformation(features) to the
hidden layer.
HiddenLayer:Nodesofthislayerarenotexposedtotheouterworld,theyarepartoftheabstraction provided
by any neural network. The hidden layer performs all sorts of computation on the features entered
through the input layer and transfers the result to the output layer.
OutputLayer:Thislayerbringuptheinformationlearnedbythenetworktotheouterworld.
Whatisanactivationfunctionandwhyusethem?
The activation function decides whether a neuron should be activated or not by calculating the
weightedsumandfurtheraddingbiastoit.Thepurposeoftheactivationfunctionistointroducenon- linearity
into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias,andtheirrespectiveactivationfunction.Inaneuralnetwork,wewouldupdatetheweightsand biases of
the neurons on the basis of the error at the output. This process is known as back-propagation.
Activation functions make the back-propagation possible since the gradients are supplied along with
the error to update the weights and biases.
WhydoweneedNon-linearactivation function?
A neural network without an activation function is essentially just a linear regression model. The
activationfunctiondoesthenon-lineartransformationtotheinputmakingitcapabletolearnand perform
more complex tasks.
Mathematicalproof
SupposewehaveaNeuralnetlikethis:-

Elementsofthediagramareasfollows:
Hiddenlayeri.e.layer 1:
z(1) =W(1)X +b(1) a(1)
Here,
 z(1)isthe vectorizedoutputoflayer1
 W(1)bethevectorizedweightsassignedtoneuronsofhiddenlayeri.e.w1,w2,w3andw4
 Xbethevectorizedinputfeaturesi.e.i1andi2
 bisthevectorizedbiasassignedtoneuronsinhiddenlayeri.e.b1andb2
 a(1)isthevectorizedformofanylinearfunction.
(Note: We are not considering activation function here)

Layer2i.e.outputlayer:-
Note:Inputforlayer2isoutputfromlayer1 z(2) =
W(2)a(1) + b(2)
a(2) =z(2)

Downloadedfromwww.eduengineering.net
CalculationatOutputlayer
z(2)=(W(2) * [W(1)X+b(1)]) +b(2)
z(2)=[W(2) * W(1)]*X+[W(2)*b(1) +b(2)]
Let,
[W(2)* W(1)]=W
[W(2)*b(1)+b(2)]=b
Final output : z(2) = W*X + b
whichisagainalinearfunction
This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
samewaybecausethecompositionoftwolinear functionisalinearfunctionitself.Neuroncannot learn
with just a linear function attached to it. A non-linear activation function will let it learn as per the
difference w.r.t error. Hence we need an activation function.
VariantsofActivationFunction
Linear Function
 Equation:Linearfunctionhastheequationsimilartoasofastraightlinei.e.y=x
 Nomatterhowmanylayerswehave,ifallarelinearinnature,thefinalactivationfunction of last
layer is nothing but just a linear function of the input of first layer.
 Range:-infto+inf
 Uses:Linearactivationfunctionisusedatjustoneplacei.e.outputlayer.
 Issues:Ifwewilldifferentiatelinearfunctiontobringnon-linearity,resultwillnomore
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have any
big/smallvalue,sowecanapplylinearactivationatoutputlayer.Eveninthiscaseneuralnetmust have any
non-linear function at hidden layers.
SigmoidFunction

 Itisafunctionwhichisplottedas‘S’ shapedgraph.
 Equation:A=1/(1+e-x)
 Nature:Non-linear.NoticethatXvaluesliesbetween -2to2,Yvalues areverysteep.This means,
small changes in x would also bring about large changes in the value of Y.
 ValueRange:0to1
 Uses:Usuallyusedinoutputlayerofabinaryclassification,whereresultiseither0or1,as value
for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be1ifvalueisgreaterthan0.5and0otherwise.

Downloadedfromwww.eduengineering.net
Tanh Function

 The activation that works almost always better than sigmoid function is Tanh function also
knownasTangentHyperbolicfunction.It’sactuallymathematicallyshiftedversionofthe
sigmoid function. Both are similar and can be derived from each other.
 Equation :-

 ValueRange:- -1to+1
 Nature:-non-linear
 Uses:-Usuallyusedinhiddenlayersofaneuralnetworkasit’svalues liesbetween -1to 1
hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
incenteringthedatabybringingmeancloseto0.Thismakeslearningforthenextlayer much
easier.
RELU Function

 ItStandsforRectifiedlinearunit.Itisthemostwidelyusedactivationfunction.Chiefly
implemented in hidden layers of Neural network.
 Equation:-A(x) =max(0,x).Itgivesanoutputxifxispositiveand 0 otherwise.
 ValueRange:-[0,inf)
 Nature:-non-linear,whichmeanswecaneasilybackpropagatetheerrorsandhave
multiple layers of neurons being activated by the ReLU function.
 Uses:-ReLuislesscomputationallyexpensivethantanhandsigmoidbecauseitinvolves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
Insimplewords,RELUlearnsmuchfasterthansigmoid andTanhfunction.

Downloadedfromwww.eduengineering.net
SoftmaxFunction

Thesoftmaxfunctionis alsoatypeofsigmoidfunctionbutishandywhenwearetryingtohandle multi-

class classification problems.
 Nature:-non-linear
 Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
wouldsqueezetheoutputsforeachclassbetween0and1andwouldalsodividebythesum of the
outputs.
 Output:-Thesoftmaxfunctionisideallyusedintheoutputlayeroftheclassifierwherewe are
actually trying to attain the probabilities to define the class of each input.
 The basic rule of thumb is if you really don’t know what activation function to use, then
simplyuseRELUasitisageneralactivationfunctioninhiddenlayersandisusedinmost cases
these days.
 Ifyouroutputisforbinaryclassificationthen, sigmoidfunction isverynaturalchoicefor
output layer.
 Ifyouroutputisformulti-classclassificationthen, Softmaxisveryusefultopredictthe
probabilities of each classes.

4.3.NetworkTraining
 Training: It is the process in which the network is taught to change its
weightand bias.
 Learning:Itistheinternalprocessoftrainingwheretheartificialneuralsystemlearns to
update/adapt the weights and biases.

DifferentTraining/LearningprocedureavailableinANN are

 Supervisedlearning
 Unsupervisedlearning
 Reinforcedlearning
 Hebbianlearning
 Gradientdescentlearning
 Competitivelearning
 Stochasticlearning

1.4.1.RequirementsofLearningLaws:
• LearningLawshouldleadtoconvergenceof weights

• Learningortrainingtimeshouldbelessforcapturingthe

Downloadedfromwww.eduengineering.net
informationfromthetrainingpairs
• Learningshouldusethelocalinformation

• Learningprocessshouldabletocapturethecomplexnonlinear
mappingavailablebetween the input & output pairs
• Learningshouldabletocaptureasmanyaspatternsas possible

• Storageofpatterninformation'sgatheredatthetimeoflearning
should be high for thegiven network

Figure3:DifferentTrainingmethodsofANN

Supervisedlearning:

Everyinputpatternthatisusedtotrainthenetworkisassociated withanoutputpatternwhichisthe target or

the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the
error.Theerrorcanthenbeusedtochangenetworkparameters,whichresultinanimprovement
inperformance.
Unsupervisedlearning:

In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering
and adapting to structural features in the input patterns.
Reinforcedlearning:

In this method, a teacher though available, doesnot present the expected answer but only
indicatesif the computed output correct or incorrect.The information provided helps the
network in the learning process.
Hebbianlearning:

ThisrulewasproposedbyHebbandisbasedoncorrelativeweightadjustment.Thisisthe
oldestlearningmechanisminspiredbybiology.Inthis,theinput-outputpatternpairs(𝑥𝑖,𝑦𝑖)are associated
by the weight matrix W, known as the correlation matrix.
Itiscomputedas

W=∑𝑖= 𝑇
𝑛
𝑥𝑖𝑦𝑖
------------eq(1)
1

Here 𝑦𝑖𝑇is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the
rule havebeen proposed.
Gradientdescentlearning:

This is based on the minimization of error E defined in terms of weights and activation
functionofthenetwork.Alsoitisrequiredthattheactivationfunctionemployedbythenetwork is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗is the weight update of the link connecting the 𝑖 𝑡ℎand 𝑗𝑡ℎneuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗is defined as,
∆𝑤=ɳ 𝜕 𝐸 -----------eq(2)
𝑖𝑗 𝜕𝑤𝑖𝑗

Where,ɳisthelearningrateparameterand
𝜕𝐸
istheerrorgradientwithreferencetothe
𝜕𝑤𝑖𝑗
weight𝑤𝑖𝑗.

GradientDescent:
 Gradient Descent is a popular optimization technique in Machine
Learning and Deep Learning of the learning algorithms.
 Agradientisthe slopeofafunction.
 Itmeasuresthedegreeofchangeofavariableinresponsetothechanges of
another variable.
 Mathematically,GradientDescentisaconvexfunctionwhoseoutputisthe
partialderivativeof a set of parameters of its inputs.
 The greater the gradient, the steeper the slope.Starting from an initial
value,GradientDescentisruniterativelytofindtheoptimalvaluesofthe
parameterstofindtheminimumpossiblevalueofthegivencostfunction.
TypesofGradientDescent:
Typically,therearethreetypesofGradientDescent:
1. BatchGradient Descent
2. StochasticGradientDescent
3. Mini-batchGradientDescent
Stochastic Gradient Descent (SGD):

 The word ‘stochastic‘ means a system or a process that is linked with a random
probability.
 Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
 InGradientDescent,thereisatermcalled“batch”whichdenotesthetotalnumber
ofsamplesfromadatasetthatisusedforcalculatingthegradientforeachiteration.
 IntypicalGradientDescentoptimization, likeBatchGradientDescent, thebatchis
taken to be the whole dataset.
 Although, using the whole dataset is really useful for getting to the minima in a
less noisy andlessrandommanner,buttheproblemariseswhenourdatasetsgetsbig.
 Suppose, you have a million samples in your dataset, so if you use a typical
GradientDescentoptimizationtechnique,youwillhavetousealloftheonemillion
samplesforcompletingone iterationwhileperformingtheGradientDescent,and
ithastobedoneforeveryiterationuntiltheminimaisreached.Hence,itbecomes
computationally very expensive to perform

Backpropagation

 Thebackpropagationconsistsofaninputlayerofneurons,anoutputlayer,andat least
one hidden layer.
 Theneuronsperforma weightedsumupontheinputlayer,whichisthenusedby the
activation function as an input, especially by the sigmoid activation function.
 Italsomakesuseofsupervisedlearningtoteachthenetwork.
 Itconstantlyupdatestheweightsofthenetworkuntilthedesiredoutputismetby the
network.
 Itincludesthefollowingfactorsthatareresponsibleforthetrainingand performance
of the network:

o Random(initial)valuesofweights.
o Anumberoftrainingcycles.
o Anumberofhiddenneurons.
o Thetrainingset.
o Teachingparametervaluessuchaslearningrateandmomentum.

Workingof Backpropagation

Considerthediagramgiven below.

1. Thepreconnectedpathstransfertheinputs X.
2. ThentheweightsWarerandomlyselected,whichareusedtomodeltheinput.
3. After then, the output is calculated for every individual neuron that passes from
the input layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. ErrorB= Actual Output - Desired
Output
5. The errors are sent back to the hidden layer from the output layer for adjusting
the weights to lessen the error.
6. Untilthedesired resultisachieved,keepiteratingalloftheprocesses.

NeedofBackpropagation
o Sinceitisfastaswellassimple,itisveryeasytoimplement.
o Apartfromnoofinputs,itdoesnotencompassofanyotherparametertoperform
tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be more
flexible.
o Itisastandardmethodthatresultswell.

WhatisaFeedForwardNetwork?
Afeedforwardneuralnetworkisanartificialneuralnetworkwherethenodesnever form a
cycle. This kind of neural network has an input layer, hidden layers, and an output
layer. It is the first and simplest type of artificial neural network.

TypesofBackpropagationNetworks
TwoTypesofBackpropagationNetworksare:

 StaticBack-propagation
 RecurrentBackpropagation

Staticback-propagation:
Itisonekindofbackpropagationnetworkwhichproducesamappingofastaticinput for static
output. It is useful to solve static classification issues like optical character recognition.

RecurrentBackpropagation:
RecurrentBackpropagationindataminingisfedforwarduntilafixedvalueisachieved. After
that, the error is computed and propagated backward.

Themaindifferencebetweenbothofthesemethodsis:thatthemappingisrapidin static
back-propagation while it is nonstatic in recurrent backpropagation.

BestpracticeBackpropagation
Backpropagationinneuralnetworkcanbeexplainedwiththehelpof“Shoe Lace”
analogy

Toolittletension=

 Notenoughconstrainingandveryloose

Toomuchtension=

 Toomuchconstraint(overtraining)
 Takingtoomuchtime(relativelyslowprocess)
 Higherlikelihoodofbreaking
Pullingonelacemorethanother=

 Discomfort(bias)

DisadvantagesofusingBackpropagation

 Theactualperformanceofbackpropagationonaspecificproblemisdependent on
the input data.
 Backpropagationalgorithmindataminingcanbequitesensitivetonoisydata
 Youneedtousethematrix-basedapproachforbackpropagationinsteadofmini-
batch.

BackpropagationProcessinDeepNeural Network

Backpropagationis one of the important concepts of a neural network. Our task is to

classifyourdata best.Forthis, we have toupdate the weights ofparameterandbias,but
how can we do that in a deep neural network? In the linear regression model, we use
gradient descent to optimize the parameter. Similarly here we also use gradientdescent
algorithm using Backpropagation.

For a single training example, Backpropagationalgorithm calculates the gradient

oftheerrorfunction.Backpropagationcanbewrittenasa functionof theneuralnetwork.
Backpropagationalgorithmsareasetofmethodsusedtoefficientlytrainartificialneural
networks following a gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method
throughwhichitcalculatestheupdatedweighttoimprovethenetworkuntilitisnotable to
performthe taskfor whichit is beingtrained. Derivatives ofthe activation function to be
known at network design time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Inputvalues

X1=0.05
X2=0.10

Initialweight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

BiasValues

b1=0.35 b2=0.60

TargetValues

T1=0.01
T2=0.99

Now,wefirstcalculatethevaluesofH1andH2byaforward pass.

ForwardPass

TofindthevalueofH1wefirstmultiplytheinputvaluefromtheweightsas

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas

WewillcalculatethevalueofH2inthesamewayasH1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas
Now,wecalculatethevaluesofy1andy2inthesamewayaswecalculatetheH1andH2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

Tocalculatethefinalresultofy1weperformedthesigmoidfunctionas

Wewillcalculatethevalueofy2inthesamewayas y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as

So,thetotalerroris

Now,wewillbackpropagatethiserrortoupdatetheweightsusingabackward pass.

Backwardpassattheoutput layer

To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.

Weperformbackward processsofirstconsiderthelastweightw5as

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as

Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow5as
Puttingthevalueofe-yinequation(5)

So, we put the values of in equation no (3) to find the final

result.
Now,wewillcalculatetheupdatedweightw5newwiththehelpofthefollowingformula

In the same way, we calculate w6new,w7new, and w8newand this will give us the following
values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

BackwardpassatHiddenlayer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.

Wewillcalculatetheerroratw1as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
becausethereisnoanyw1.Wesplitequation(1)intomultipletermssothatwecaneasily
differentiate it with respect to w1 as

Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow1as
WeagainsplitthisbecausethereisnoanyH1finalterminEtoatalas

will againsplitbecauseinE1andE2there isnoH1term.Splittingis

doneas

We again Split both becausethereisnoanyy1andy2terminE1andE2.We split

it as

Now, we find the value of byputtingvaluesinequation(18)and(19)as

From equation (18)

Fromequation(8)
Fromequation(19)

Puttingthevalueofe-y2inequation(23)

Fromequation(21)

Nowfromequation(16)and (17)
Put thevalue of inequation(15)as

We have weneedtofigureout as
Puttingthevalueofe-H1inequation(30)

WecalculatethepartialderivativeofthetotalnetinputtoH1withrespecttow1thesame as we
did for the output neuron:

So,weputthevaluesof inequation(13)tofindthefinalresult.

Now,wewillcalculatetheupdatedweightw1newwiththehelpofthefollowingformula
Inthesameway,wecalculatew2new,w3new,andw4 and thiswill give us thefollowing values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

Wehaveupdatedalltheweights.Wefoundtheerror0.298371109onthenetworkwhen we
fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
errorisdownto0.291027924.Afterrepeatingthisprocess10,000,thetotalerrorisdown to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.

2.5.1DifferenceBetweenaShallowNet&Deep Learning
Net:

Sl.No ShallowNet’s DeepLearningNet’s

1 OneHiddenlayer(orvery DeepNet’shasmanylayersof
lessno.ofHidden Hiddenlayers with more no.
Layers) of neurons in each layers
2 Takesinputonlyas VECTORS DLcanhaverawdatalike
image,textas
inputs
3 Shallownet’s needs more DL can fit functions better
parameterstohave withlessparametersthana
better fit shallow network
4 Shallow networks with one DLcancompactlyexpress
Hidden layer (same no of highly complex functions
neuronsasDL)cannotplace over input space
complex functions overthe
inputspace
5 Thenumberofunitsina DL don’t need
shallow network grows toincreaseitsize(neurons)fo
exponentiallywithtask r complex problems
complexity.
6 Shallow network is more TraininginDLiseasyandno
difficult to train with our issue oflocal minima
currentalgorithms(e.g.ithas in DL
issuesoflocalminima etc)

TheVanishingGradientProblem

TheProblem,ItsCauses,ItsSignificance,andItsSolutions
The problem:
Asmorelayersusingcertainactivationfunctionsareaddedtoneuralnetworks,the gradients of
the loss function approaches zero, making the network hard to train.
Why:

Certainactivationfunctions,likethesigmoidfunction,squishesalargeinputspaceintoa
smallinputspacebetween0and1. Therefore,alarge changeintheinput ofthesigmoid function
will cause a small change in the output. Hence, the derivative becomes small.

Image1:Thesigmoidfunctionanditsderivative

As an example, Image 1 is the sigmoid function and its derivative. Note how when the
inputsofthesigmoidfunctionbecomeslargerorsmaller(when|x|becomesbigger),the
derivative becomes close to zero.

Why it’ssignificant:

For shallow network with only a few layers that use these activations, this isn’t a big
problem.However,whenmorelayersareused,itcancausethegradienttobetoosmall for
training to work effectively.

Gradients of neural networks are found using backpropagation. Simply put,

backpropagation findsthederivatives ofthenetwork bymoving layerby layerfromthe
finallayertotheinitialone.Bythechainrule,thederivativesofeachlayeraremultiplied down
the network (from the final layer to the initial) to compute the derivatives of the initial
layers.
However, when n hidden layers use an activation like the sigmoid function, n small
derivativesaremultipliedtogether.Thus,thegradientdecreasesexponentiallyaswe
propagate down to the initial layers.

A small gradient means that the weights and biases of the initial layers will not be
updatedeffectivelywitheachtrainingsession.Sincetheseinitiallayersareoftencrucial
torecognizingthecoreelementsoftheinputdata,itcanleadtooverallinaccuracyofthe whole
network.

Solutions:

Thesimplestsolutionistouseotheractivationfunctions,suchasReLU,whichdoesn’t
causeasmallderivative.

Residualnetworksareanothersolution,astheyprovideresidualconnectionsstraightto earlier
layers. As seen in Image 2, the residual connection directly adds the value at the
beginning of the block, x, to the end of the block (F(x)+x). This residual connection
doesn’tgothroughactivationfunctionsthat“squashes”thederivatives,resultingina higher
overall derivative of the block.

Image2:Aresidual block

Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives todisappear. In Image 1, this is most clearly seen at when |x| isbig. Batch
normalizationreducesthisproblembysimplynormalizingtheinputso|x|doesn’treach the
outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so that
most of it falls in the green region, where the derivative isn’t too small.
Image3:Sigmoidfunctionwithrestrictedinputs

HyperparametersinMachineLearning

Hyperparameters in Machine learning are those parameters that are explicitly

defined by the user to control the learning process. These hyperparameters are used
toimprovethelearningofthemodel,andtheirvaluesaresetbeforestartingthelearning
process of the model.

 Heretheprefix"hyper"suggeststhattheparametersaretop-levelparametersthat are
used in controlling the learning process.
 ThevalueoftheHyperparameterisselectedandsetbythemachinelearning engineer
before the learning algorithm begins training the model.
 Hence, these are external to the model, and their values cannot be changed
during the training process.

SomeexamplesofHyperparametersinMachineLearning
o ThekinkNNorK-NearestNeighbouralgorithm
o Learningratefortraininganeural network
o Train-testsplitratio
o BatchSize
o Numberof Epochs
o BranchesinDecisionTree
o NumberofclustersinClusteringAlgorithm

ModelParameters:

Modelparametersareconfigurationvariablesthatareinternaltothemodel,andamodel
learns them on its own. For example, W Weights or Coefficients of independent
variablesintheLinearregressionmodel.orWeightsorCoefficientsofindependent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
o Theyareused bythemodelformakingpredictions.
o Theyarelearned bythemodelfromthedataitself
o Theseareusuallynotsetmanually.
o ThesearethepartofthemodelandkeytoamachinelearningAlgorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined bythe user to control
the learning process. Some key points for model parameters are as follows:

o Theseareusuallydefinedmanuallybythemachinelearning engineer.
o Onecannotknowtheexactbestvalueforhyperparametersforthegivenproblem.
Thebestvaluecanbedeterminedeitherbytheruleofthumborbytrialanderror.
o SomeexamplesofHyperparametersarethelearningratefortraininga neural
network, K in the KNN algorithm,

Categoriesof Hyperparameters

Broadlyhyperparameterscanbedividedintotwocategories,whicharegiven below:

1. HyperparameterforOptimization
2. HyperparameterforSpecificModels

Hyperparameter for Optimization

Theprocessofselectingthebesthyperparameterstouseisknownashyperparameter
tuning,andthetuningprocessisalsoknownashyperparameteroptimization.Optimization
parameters are used for optimizing the model.

Someofthepopularoptimizationparametersaregivenbelow:
o Learning Rate: The learning rate is the hyperparameter in optimization
algorithms that controls how much the model needs to change in response to the
estimated error for each time when the model's weights are updated. It is one of
thecrucialparameterswhilebuildinganeuralnetwork,andalsoitdeterminesthe
frequency of cross-checking with model parameters. Selecting the optimized
learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too
large, then it may not optimize the model properly.

o Batch Size: To enhance the speed of the learning process, the training set is
dividedintodifferentsubsets,whichareknownasabatch.NumberofEpochs:An
epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs
varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into
account. The number of epochs is increased until there is a reduction in a
validationerror.Ifthereisnoimprovementinreductionerrorfortheconsecutive
epochs, then it indicates to stop increasing the number of epochs.

HyperparameterforSpecificModels

Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

o AnumberofHiddenUnits:Hiddenunitsarepartofneuralnetworks,whichrefer
tothecomponentscomprisingthelayersofprocessorsbetweeninputandoutput units
in a neural network.

It is important to specify the number of hidden units hyperparameter for the neural
network.Itshouldbebetweenthesizeoftheinputlayerandthesizeoftheoutputlayer.
Morespecifically,thenumberofhiddenunitsshouldbe2/3ofthesizeoftheinputlayer, plus
the size of the output layer.

Forcomplexfunctions,itisnecessarytospecifythenumberofhiddenunits,butitshould not
overfit the model.

o Number of Layers: A neural network is made up of vertically arranged

components, which are called layers. There are mainly input layers, hidden
layers, and output layers. A 3-layered neural network gives a better
performance than a 2-layered network. For a Convolutional Neural network, a
greater number of layers make a better model.
BatchNormalization:

 Itisamethodofadaptivereparameterization,motivated bythedifficulty of
training very deep models.In Deep networks, the weights are updated
for each layer.
 So the output will no longer be on the same scale as the input (even
though input is normalized).
 Normalization - is adata pre-processing tool used tobring thenumerical
data toa common scale without distorting its shape.
 when we input the datato a machine ordeep learningalgorithmwetend to
change the values to abalanced scale because, weensure thatour model
can generalize appropriately.(Normalization is used to bring the input
into a balanced scale/ Range).

Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-

normalization/
 Even though the input X was normalized but the output is no longer on
the same scale.
 The data passes through multiple layers of network with multiple
times(sigmoidal) activation functions are applied, which leads to an
internal co-variate shift in the data.
 ThismotivatesustomovetowardsBatchNormalization
 Normalization is the process of altering the input data to have mean as
zero and standard deviationvalue as one.
ProceduretodoBatchNormalization:
(1) Consider the batch input from layer h, for this layer we need to
calculate the mean of this hiddenactivation.After calculating the
mean the next step is to calculate the standard deviation of the
hidden activations.
(2) Now we normalize the hidden activations using these Mean &
Standard Deviation values. To dothis, we subtract the mean from
each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(3) As the final stage, the re-scaling and offsetting of the input is
performed. Here two components of the BN algorithm is used,
γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) andshifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence
duringthetrainingofneuralnetwork,theoptimalvaluesofγandβ
areobtainedandused.Hencewegettheaccuratenormalizationof
eachbatch.
Regularization
Definition:-“anymodificationwemaketoalearningalgorithmthatisintended to
reduce its generalization error but not its training error.”
 Inthecontextof deep learning,mostregularization strategies
arebasedonregularizing estimators.
 Regularizationofanestimatorworksbytradingincreasedbias
forreducedvariance.Aneffectiveregularizerisonethatmakesaprofita
bletrade,reducingvariancesignificantly while not overlyincreasing
the bias.
 Manyregularizationapproachesarebasedonlimitingthecapacity of
models, such as neural networks, linear regression, or logistic
regression, by adding a parameter norm penalty Ω(θ) to the
objective function J. We denote the regularized objective function
by J˜
J˜(θ;X,y)=J(θ;X,y)+ αΩ(θ)
whereα∈[0,∞)isahyperparameterthatweightstherelativecontribution
ofthenormpenalty term, Ω, relative to the standard objective function J.
Settingαto0resultsinnoregularization. Largervaluesofαcorrespondtomore
regularization.
TheparameternormpenaltyΩthatpenalizesonlytheweightsoftheaﬃne
transformationateachlayerandleavesthebiases unregularized.
L2Regularization
One of the simplest and most common kind of parameter norm penalty is L2
parameter & it’salso called commonly as weight decay. This regularization
strategydrivestheweightsclosertotheoriginbyaddingaregularizationterm

L2regularization is also known as ridge regression or Tikhonov regularization.

To simplify, weassume no bias parameter, so θ is just w. Such a model has the
following total objective function.

Wecanseethattheadditionoftheweightdecaytermhasmodifiedthelearning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just
before performing the usual gradient update. This describes what happens in a
single step.The approximation ^J is Given by

WhereHistheHessianmatrixofJwithrespecttowevaluatedatw∗.
TheminimumofˆJoccurswhereitsgradient ∇wˆJ(w)=H(w−w∗)is equalto‘0’To study
the eff ect of weight decay,

As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormalbasisofeigenvectors,Q,suchthatH=QΛQT.ApplyingDecompositiontotheabove
equation, We Obtain

Figure2:Weightupdationeffect
The solid ellipses represent contours of equal value of the unregularized
objective. The dottedcircles represent contours of equal value of the L 2
regularizer. At thepoint w˜, these competing objectives reachanequilibrium. In
the first dimension, the eigenvalue of the Hessian of J is small. The objective
function does not increase much when moving horizontally away from w∗ .
Because the objective function does not express a strong preference along this
direction,theregularizerhasastrongeffectonthisaxis.Theregularizerpullsw1
closetozero. Intheseconddimension,the objectivefunctionisverysensitiveto
movements away from w∗ . The corresponding eigenvalue is large, indicating
highcurvature.Asaresult,weightdecayaffectsthepositionofw2relativelylittle.

L1Regularization

While L2 weight decay is the most common form of weight decay, there are
otherways to penalize the size of the model parameters. Another option is to use L1
regularization.
 L1regularizationonthemodelparameterwisdefinedasthesumof
absolute values of theindividual parameters.

L1 weight decay controls the strength of the regularization by scaling the

penaltyΩusingapositivehyperparameterα.Thus,theregularizedobjective
function J˜(w; X, y) is given by

By inspecting equation 1, we can see immediately that the effect of L 1

regularization is quite different from that of L 2 regularization. Specifically,
we can see that the regularization contribution to the gradient no longer
scaleslinearlywitheachwi;insteaditisaconstantfactorwithasignequalto
sign(wi).

DifferencebetweenL1&L2ParameterRegularization
DifferencebetweenNormalizationand Standardization

Normalization Standardization

Thistechniqueusesminimumandmax values Thistechniqueusesmeanandstandarddeviation for

for scaling of model. scaling of model.

Itishelpfulwhenfeaturesareofdifferent scales. It is helpful when the mean of a variable is set to 0

and the standard deviation is set to 1.

Scalesvaluesrangesbetween[0,1]or[-1,1]. Scalevaluesarenotrestrictedtoaspecific range.

Itgotaffectedby outliers. Itiscomparativelylessaffected by outliers.

Scikit-Learnprovidesatransformercalled Scikit-Learn provides a transformer called

MinMaxScaler for Normalization. StandardScaler for Normalization.

Itisalsocalled Scalingnormalization. ItisknownasZ-score normalization.

Itisusefulwhenfeaturedistributionis unknown. Itisusefulwhenfeaturedistributionis normal.

DropoutinNeuralNetworks
A Neural Network (NN) is based on a collection of connected units or nodes called
artificial neurons, which loosely model the neurons in a biological brain. Since such a
networkiscreatedartificiallyinmachines,werefertothatasArtificialNeuralNetworks
(ANN).
Problem:Whenafully-connectedlayerhasalargenumberofneurons, co-adaptationis
more likely to happen. Co-adaptation refers to when multiple neurons in a layer extract
the same, or very similar, hidden features from the input data. This can happen when
the connection weights for two different neurons are nearly identical.

Thisposestwodifferentproblemstoour model:
 Wastageofmachine’sresourceswhencomputingthesame output.
 Ifmanyneuronsareextractingthesamefeatures, itaddsmoresignificanceto
thosefeaturesforourmodel.Thisleadstooverfittingiftheduplicateextracted
features are specific to only the training set.
Solution to the problem:As the title suggests, we use dropout while training the NNto
minimize co-adaptation. In dropout, we randomly shut down some fraction of a layer’s
neurons at each training step by zeroing out the neuron values. The fraction of neurons
tobezeroedoutisknownasthedropoutrate,.Theremainingneuronshavetheir

valuesmultipliedby sothattheoverallsumoftheneuronvaluesremainsthe same.

The two images represent dropout applied to a layer of 6 units, shown at multiple
trainingsteps.Thedropoutrateis1/3,andtheremaining4neuronsateachtraining step have
their value scaled by x1.5. Thereby, we are choosing a random sample of neurons rather
than training the whole network at once. This ensures that the co- adaptation is solved
and they learn the hidden features better.
Whydropout works?
By using dropout, in every iteration, you will work on a smaller neural
networkthanthepreviousoneandtherefore,itapproachesregularization.
Dropouthelpsinshrinkingthesquarednormoftheweightsandthistendsto a reduction in
overfitting.

Delloitte Siebel CRM Interview Question and Answers
100% (6)
Delloitte Siebel CRM Interview Question and Answers
15 pages
Specification of KAVACH (The Indian Railway ATP) - KAVACH Control Table Guidelines - Amdt-5
100% (1)
Specification of KAVACH (The Indian Railway ATP) - KAVACH Control Table Guidelines - Amdt-5
33 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Need and Use of Activation Functions in Anndeep Learning
No ratings yet
Need and Use of Activation Functions in Anndeep Learning
7 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
lecture 9-NN- modified
No ratings yet
lecture 9-NN- modified
94 pages
Unit II
No ratings yet
Unit II
12 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
activatn fn 2
No ratings yet
activatn fn 2
10 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Activation Function
No ratings yet
Activation Function
4 pages
Activation
No ratings yet
Activation
7 pages
Artificial Neural Networks(ANN)
No ratings yet
Artificial Neural Networks(ANN)
67 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
26- netinput activation function forward and back propogation
No ratings yet
26- netinput activation function forward and back propogation
41 pages
ML unit 4
No ratings yet
ML unit 4
23 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Module1
No ratings yet
Module1
124 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
4. ANNs
No ratings yet
4. ANNs
57 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
UNIT V NEURAL NETWORKS
No ratings yet
UNIT V NEURAL NETWORKS
35 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
Unit 4
No ratings yet
Unit 4
19 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Activation Function
No ratings yet
Activation Function
43 pages
Ann
No ratings yet
Ann
40 pages
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
No ratings yet
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
57 pages
Activation Function
No ratings yet
Activation Function
31 pages
Fundamentals Deep Learning Activation Functions When To Use Them
No ratings yet
Fundamentals Deep Learning Activation Functions When To Use Them
15 pages
M2 PPT
No ratings yet
M2 PPT
84 pages
Unit 2_Activation Function_PR
No ratings yet
Unit 2_Activation Function_PR
22 pages
DL Answers
No ratings yet
DL Answers
24 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
Activation Funtions
No ratings yet
Activation Funtions
26 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Unit 5 Activation Function
No ratings yet
Unit 5 Activation Function
15 pages
4 - Activation Functions in Neural Networks
No ratings yet
4 - Activation Functions in Neural Networks
12 pages
Nndl Umit 1 Important Questions
No ratings yet
Nndl Umit 1 Important Questions
8 pages
Forward_and_Backward_Propagation_Deep_Learning_1703697260
No ratings yet
Forward_and_Backward_Propagation_Deep_Learning_1703697260
9 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Mod 2.3 - Activation Function, Loss Functions
No ratings yet
Mod 2.3 - Activation Function, Loss Functions
12 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
NNFL Unit II For ECE & EEE
No ratings yet
NNFL Unit II For ECE & EEE
24 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
ANN Unit IV Notes
No ratings yet
ANN Unit IV Notes
4 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Clevo M860TU M865TU
No ratings yet
Clevo M860TU M865TU
104 pages
Inside FIFA
No ratings yet
Inside FIFA
14 pages
Take Home Exam v2
No ratings yet
Take Home Exam v2
3 pages
EP Form
No ratings yet
EP Form
3 pages
Signal Integrity: by Asif
No ratings yet
Signal Integrity: by Asif
66 pages
Immediate download INFORMATION RETRIEVAL a biomedical and health perspective 4th Edition William Hersh ebooks 2024
100% (3)
Immediate download INFORMATION RETRIEVAL a biomedical and health perspective 4th Edition William Hersh ebooks 2024
55 pages
2027 2
No ratings yet
2027 2
7 pages
Networking Assignment
No ratings yet
Networking Assignment
73 pages
Design of Wheel Rim and Tyre Assembly
No ratings yet
Design of Wheel Rim and Tyre Assembly
26 pages
The Complete Guide To Marketing Automation
No ratings yet
The Complete Guide To Marketing Automation
33 pages
Kathrein RRU 4570 Reader Unit Datasheet
No ratings yet
Kathrein RRU 4570 Reader Unit Datasheet
4 pages
Business Intelligence - Chapter 4
No ratings yet
Business Intelligence - Chapter 4
28 pages
GPS NGR-1000 PDF
No ratings yet
GPS NGR-1000 PDF
2 pages
3BTG613032-0342 Rev00 GENERAL ARRANGEMENT BRU
No ratings yet
3BTG613032-0342 Rev00 GENERAL ARRANGEMENT BRU
7 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
ED550EL
No ratings yet
ED550EL
12 pages
03 02 RN33153EN30GLA0 OMS Fault Management Tools
No ratings yet
03 02 RN33153EN30GLA0 OMS Fault Management Tools
23 pages
Red Ink Business Report PPT Template
No ratings yet
Red Ink Business Report PPT Template
25 pages
Oil India Limited: Summer Internship
No ratings yet
Oil India Limited: Summer Internship
21 pages
Direct Current Motors: Technical Catalogue
No ratings yet
Direct Current Motors: Technical Catalogue
52 pages
Thiet Ke Nut Ham Tham Khao
No ratings yet
Thiet Ke Nut Ham Tham Khao
24 pages
Log
No ratings yet
Log
104 pages
("D C Motor") : Maharashtra State Board of Technical Education (Mumbai)
No ratings yet
("D C Motor") : Maharashtra State Board of Technical Education (Mumbai)
20 pages
Python AI ML LLM TrainingJun142024
No ratings yet
Python AI ML LLM TrainingJun142024
192 pages
TGCC21 Global Case Study
No ratings yet
TGCC21 Global Case Study
24 pages
Sales Dataset Analysis
No ratings yet
Sales Dataset Analysis
28 pages
Palo Alto Networks Cortex XDR Prevention and Deployment-Palo Alto Networks-1
No ratings yet
Palo Alto Networks Cortex XDR Prevention and Deployment-Palo Alto Networks-1
595 pages
Sar - Invoice Jan 22-New HSN
No ratings yet
Sar - Invoice Jan 22-New HSN
4 pages