AD3451 ML UNIT 4 NOTES
AD3451 ML UNIT 4 NOTES
UNIT-4
UNITIV NEURALNETWORKS 9
Multi-layerPerceptron
Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron.
Thepictorialrepresentationofmulti-layerperceptronlearningisasshownbelow-
Amultilayerperceptron(MLP)isafeedforwardartificialneuralnetworkthatgeneratesasetofoutputs from
a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed
graphbetweentheinputnodesconnectedasadirectedgraphbetweentheinputandoutputlayers.MLP uses
backpropagation for training the network. MLP is a deep learning method.
ActivationFunctionsinNeuralNetworks
ElementsofaNeuralNetwork
Downloadedfromwww.eduengineering.net
Input Layer: This layer accepts input features. It provides information from the outside world to the
network,nocomputationisperformedatthislayer,nodesherejustpassontheinformation(features) to the
hidden layer.
HiddenLayer:Nodesofthislayerarenotexposedtotheouterworld,theyarepartoftheabstraction provided
by any neural network. The hidden layer performs all sorts of computation on the features entered
through the input layer and transfers the result to the output layer.
OutputLayer:Thislayerbringuptheinformationlearnedbythenetworktotheouterworld.
Whatisanactivationfunctionandwhyusethem?
The activation function decides whether a neuron should be activated or not by calculating the
weightedsumandfurtheraddingbiastoit.Thepurposeoftheactivationfunctionistointroducenon- linearity
into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias,andtheirrespectiveactivationfunction.Inaneuralnetwork,wewouldupdatetheweightsand biases of
the neurons on the basis of the error at the output. This process is known as back-propagation.
Activation functions make the back-propagation possible since the gradients are supplied along with
the error to update the weights and biases.
WhydoweneedNon-linearactivation function?
A neural network without an activation function is essentially just a linear regression model. The
activationfunctiondoesthenon-lineartransformationtotheinputmakingitcapabletolearnand perform
more complex tasks.
Mathematicalproof
SupposewehaveaNeuralnetlikethis:-
Elementsofthediagramareasfollows:
Hiddenlayeri.e.layer 1:
z(1) =W(1)X +b(1) a(1)
Here,
z(1)isthe vectorizedoutputoflayer1
W(1)bethevectorizedweightsassignedtoneuronsofhiddenlayeri.e.w1,w2,w3andw4
Xbethevectorizedinputfeaturesi.e.i1andi2
bisthevectorizedbiasassignedtoneuronsinhiddenlayeri.e.b1andb2
a(1)isthevectorizedformofanylinearfunction.
(Note: We are not considering activation function here)
Layer2i.e.outputlayer:-
Note:Inputforlayer2isoutputfromlayer1 z(2) =
W(2)a(1) + b(2)
a(2) =z(2)
Downloadedfromwww.eduengineering.net
CalculationatOutputlayer
z(2)=(W(2) * [W(1)X+b(1)]) +b(2)
z(2)=[W(2) * W(1)]*X+[W(2)*b(1) +b(2)]
Let,
[W(2)* W(1)]=W
[W(2)*b(1)+b(2)]=b
Final output : z(2) = W*X + b
whichisagainalinearfunction
This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
samewaybecausethecompositionoftwolinear functionisalinearfunctionitself.Neuroncannot learn
with just a linear function attached to it. A non-linear activation function will let it learn as per the
difference w.r.t error. Hence we need an activation function.
VariantsofActivationFunction
Linear Function
Equation:Linearfunctionhastheequationsimilartoasofastraightlinei.e.y=x
Nomatterhowmanylayerswehave,ifallarelinearinnature,thefinalactivationfunction of last
layer is nothing but just a linear function of the input of first layer.
Range:-infto+inf
Uses:Linearactivationfunctionisusedatjustoneplacei.e.outputlayer.
Issues:Ifwewilldifferentiatelinearfunctiontobringnon-linearity,resultwillnomore
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have any
big/smallvalue,sowecanapplylinearactivationatoutputlayer.Eveninthiscaseneuralnetmust have any
non-linear function at hidden layers.
SigmoidFunction
Itisafunctionwhichisplottedas‘S’ shapedgraph.
Equation:A=1/(1+e-x)
Nature:Non-linear.NoticethatXvaluesliesbetween -2to2,Yvalues areverysteep.This means,
small changes in x would also bring about large changes in the value of Y.
ValueRange:0to1
Uses:Usuallyusedinoutputlayerofabinaryclassification,whereresultiseither0or1,as value
for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be1ifvalueisgreaterthan0.5and0otherwise.
Downloadedfromwww.eduengineering.net
Tanh Function
The activation that works almost always better than sigmoid function is Tanh function also
knownasTangentHyperbolicfunction.It’sactuallymathematicallyshiftedversionofthe
sigmoid function. Both are similar and can be derived from each other.
Equation :-
ValueRange:- -1to+1
Nature:-non-linear
Uses:-Usuallyusedinhiddenlayersofaneuralnetworkasit’svalues liesbetween -1to 1
hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
incenteringthedatabybringingmeancloseto0.Thismakeslearningforthenextlayer much
easier.
RELU Function
ItStandsforRectifiedlinearunit.Itisthemostwidelyusedactivationfunction.Chiefly
implemented in hidden layers of Neural network.
Equation:-A(x) =max(0,x).Itgivesanoutputxifxispositiveand 0 otherwise.
ValueRange:-[0,inf)
Nature:-non-linear,whichmeanswecaneasilybackpropagatetheerrorsandhave
multiple layers of neurons being activated by the ReLU function.
Uses:-ReLuislesscomputationallyexpensivethantanhandsigmoidbecauseitinvolves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
Insimplewords,RELUlearnsmuchfasterthansigmoid andTanhfunction.
Downloadedfromwww.eduengineering.net
SoftmaxFunction
4.3.NetworkTraining
Training: It is the process in which the network is taught to change its
weightand bias.
Learning:Itistheinternalprocessoftrainingwheretheartificialneuralsystemlearns to
update/adapt the weights and biases.
DifferentTraining/LearningprocedureavailableinANN are
Supervisedlearning
Unsupervisedlearning
Reinforcedlearning
Hebbianlearning
Gradientdescentlearning
Competitivelearning
Stochasticlearning
1.4.1.RequirementsofLearningLaws:
• LearningLawshouldleadtoconvergenceof weights
• Learningortrainingtimeshouldbelessforcapturingthe
Downloadedfromwww.eduengineering.net
informationfromthetrainingpairs
• Learningshouldusethelocalinformation
• Learningprocessshouldabletocapturethecomplexnonlinear
mappingavailablebetween the input & output pairs
• Learningshouldabletocaptureasmanyaspatternsas possible
• Storageofpatterninformation'sgatheredatthetimeoflearning
should be high for thegiven network
Figure3:DifferentTrainingmethodsofANN
Supervisedlearning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering
and adapting to structural features in the input patterns.
Reinforcedlearning:
In this method, a teacher though available, doesnot present the expected answer but only
indicatesif the computed output correct or incorrect.The information provided helps the
network in the learning process.
Hebbianlearning:
ThisrulewasproposedbyHebbandisbasedoncorrelativeweightadjustment.Thisisthe
oldestlearningmechanisminspiredbybiology.Inthis,theinput-outputpatternpairs(𝑥𝑖,𝑦𝑖)are associated
by the weight matrix W, known as the correlation matrix.
Itiscomputedas
W=∑𝑖= 𝑇
𝑛
𝑥𝑖𝑦𝑖
------------eq(1)
1
Here 𝑦𝑖𝑇is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the
rule havebeen proposed.
Gradientdescentlearning:
This is based on the minimization of error E defined in terms of weights and activation
functionofthenetwork.Alsoitisrequiredthattheactivationfunctionemployedbythenetwork is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗is the weight update of the link connecting the 𝑖 𝑡ℎand 𝑗𝑡ℎneuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗is defined as,
∆𝑤=ɳ 𝜕 𝐸 -----------eq(2)
𝑖𝑗 𝜕𝑤𝑖𝑗
Where,ɳisthelearningrateparameterand
𝜕𝐸
istheerrorgradientwithreferencetothe
𝜕𝑤𝑖𝑗
weight𝑤𝑖𝑗.
GradientDescent:
Gradient Descent is a popular optimization technique in Machine
Learning and Deep Learning of the learning algorithms.
Agradientisthe slopeofafunction.
Itmeasuresthedegreeofchangeofavariableinresponsetothechanges of
another variable.
Mathematically,GradientDescentisaconvexfunctionwhoseoutputisthe
partialderivativeof a set of parameters of its inputs.
The greater the gradient, the steeper the slope.Starting from an initial
value,GradientDescentisruniterativelytofindtheoptimalvaluesofthe
parameterstofindtheminimumpossiblevalueofthegivencostfunction.
TypesofGradientDescent:
Typically,therearethreetypesofGradientDescent:
1. BatchGradient Descent
2. StochasticGradientDescent
3. Mini-batchGradientDescent
Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random
probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
InGradientDescent,thereisatermcalled“batch”whichdenotesthetotalnumber
ofsamplesfromadatasetthatisusedforcalculatingthegradientforeachiteration.
IntypicalGradientDescentoptimization, likeBatchGradientDescent, thebatchis
taken to be the whole dataset.
Although, using the whole dataset is really useful for getting to the minima in a
less noisy andlessrandommanner,buttheproblemariseswhenourdatasetsgetsbig.
Suppose, you have a million samples in your dataset, so if you use a typical
GradientDescentoptimizationtechnique,youwillhavetousealloftheonemillion
samplesforcompletingone iterationwhileperformingtheGradientDescent,and
ithastobedoneforeveryiterationuntiltheminimaisreached.Hence,itbecomes
computationally very expensive to perform
Backpropagation
Thebackpropagationconsistsofaninputlayerofneurons,anoutputlayer,andat least
one hidden layer.
Theneuronsperforma weightedsumupontheinputlayer,whichisthenusedby the
activation function as an input, especially by the sigmoid activation function.
Italsomakesuseofsupervisedlearningtoteachthenetwork.
Itconstantlyupdatestheweightsofthenetworkuntilthedesiredoutputismetby the
network.
Itincludesthefollowingfactorsthatareresponsibleforthetrainingand performance
of the network:
o Random(initial)valuesofweights.
o Anumberoftrainingcycles.
o Anumberofhiddenneurons.
o Thetrainingset.
o Teachingparametervaluessuchaslearningrateandmomentum.
Workingof Backpropagation
Considerthediagramgiven below.
1. Thepreconnectedpathstransfertheinputs X.
2. ThentheweightsWarerandomlyselected,whichareusedtomodeltheinput.
3. After then, the output is calculated for every individual neuron that passes from
the input layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. ErrorB= Actual Output - Desired
Output
5. The errors are sent back to the hidden layer from the output layer for adjusting
the weights to lessen the error.
6. Untilthedesired resultisachieved,keepiteratingalloftheprocesses.
NeedofBackpropagation
o Sinceitisfastaswellassimple,itisveryeasytoimplement.
o Apartfromnoofinputs,itdoesnotencompassofanyotherparametertoperform
tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be more
flexible.
o Itisastandardmethodthatresultswell.
WhatisaFeedForwardNetwork?
Afeedforwardneuralnetworkisanartificialneuralnetworkwherethenodesnever form a
cycle. This kind of neural network has an input layer, hidden layers, and an output
layer. It is the first and simplest type of artificial neural network.
TypesofBackpropagationNetworks
TwoTypesofBackpropagationNetworksare:
StaticBack-propagation
RecurrentBackpropagation
Staticback-propagation:
Itisonekindofbackpropagationnetworkwhichproducesamappingofastaticinput for static
output. It is useful to solve static classification issues like optical character recognition.
RecurrentBackpropagation:
RecurrentBackpropagationindataminingisfedforwarduntilafixedvalueisachieved. After
that, the error is computed and propagated backward.
Themaindifferencebetweenbothofthesemethodsis:thatthemappingisrapidin static
back-propagation while it is nonstatic in recurrent backpropagation.
BestpracticeBackpropagation
Backpropagationinneuralnetworkcanbeexplainedwiththehelpof“Shoe Lace”
analogy
Toolittletension=
Notenoughconstrainingandveryloose
Toomuchtension=
Toomuchconstraint(overtraining)
Takingtoomuchtime(relativelyslowprocess)
Higherlikelihoodofbreaking
Pullingonelacemorethanother=
Discomfort(bias)
DisadvantagesofusingBackpropagation
Theactualperformanceofbackpropagationonaspecificproblemisdependent on
the input data.
Backpropagationalgorithmindataminingcanbequitesensitivetonoisydata
Youneedtousethematrix-basedapproachforbackpropagationinsteadofmini-
batch.
BackpropagationProcessinDeepNeural Network
The main features of Backpropagation are the iterative, recursive and efficient method
throughwhichitcalculatestheupdatedweighttoimprovethenetworkuntilitisnotable to
performthe taskfor whichit is beingtrained. Derivatives ofthe activation function to be
known at network design time is required to Backpropagation.
Now, how error function is used in Backpropagation and how Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.
Inputvalues
X1=0.05
X2=0.10
Initialweight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
BiasValues
b1=0.35 b2=0.60
TargetValues
T1=0.01
T2=0.99
Now,wefirstcalculatethevaluesofH1andH2byaforward pass.
ForwardPass
TofindthevalueofH1wefirstmultiplytheinputvaluefromtheweightsas
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas
WewillcalculatethevalueofH2inthesamewayasH1
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas
Now,wecalculatethevaluesofy1andy2inthesamewayaswecalculatetheH1andH2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
Tocalculatethefinalresultofy1weperformedthesigmoidfunctionas
Wewillcalculatethevalueofy2inthesamewayas y1
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214
TocalculatethefinalresultofH1,weperformedthesigmoidfunctionas
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as
So,thetotalerroris
Now,wewillbackpropagatethiserrortoupdatetheweightsusingabackward pass.
Backwardpassattheoutput layer
To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.
Weperformbackward processsofirstconsiderthelastweightw5as
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow5as
Puttingthevalueofe-yinequation(5)
In the same way, we calculate w6new,w7new, and w8newand this will give us the following
values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
BackwardpassatHiddenlayer
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.
Wewillcalculatetheerroratw1as
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
becausethereisnoanyw1.Wesplitequation(1)intomultipletermssothatwecaneasily
differentiate it with respect to w1 as
Now,wecalculateeachtermonebyonetodifferentiateEtotalwithrespecttow1as
WeagainsplitthisbecausethereisnoanyH1finalterminEtoatalas
Fromequation(8)
Fromequation(19)
Puttingthevalueofe-y2inequation(23)
Fromequation(21)
Nowfromequation(16)and (17)
Put thevalue of inequation(15)as
We have weneedtofigureout as
Puttingthevalueofe-H1inequation(30)
WecalculatethepartialderivativeofthetotalnetinputtoH1withrespecttow1thesame as we
did for the output neuron:
So,weputthevaluesof inequation(13)tofindthefinalresult.
Now,wewillcalculatetheupdatedweightw1newwiththehelpofthefollowingformula
Inthesameway,wecalculatew2new,w3new,andw4 and thiswill give us thefollowing values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
Wehaveupdatedalltheweights.Wefoundtheerror0.298371109onthenetworkwhen we
fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
errorisdownto0.291027924.Afterrepeatingthisprocess10,000,thetotalerrorisdown to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.
2.5.1DifferenceBetweenaShallowNet&Deep Learning
Net:
1 OneHiddenlayer(orvery DeepNet’shasmanylayersof
lessno.ofHidden Hiddenlayers with more no.
Layers) of neurons in each layers
2 Takesinputonlyas VECTORS DLcanhaverawdatalike
image,textas
inputs
3 Shallownet’s needs more DL can fit functions better
parameterstohave withlessparametersthana
better fit shallow network
4 Shallow networks with one DLcancompactlyexpress
Hidden layer (same no of highly complex functions
neuronsasDL)cannotplace over input space
complex functions overthe
inputspace
5 Thenumberofunitsina DL don’t need
shallow network grows toincreaseitsize(neurons)fo
exponentiallywithtask r complex problems
complexity.
6 Shallow network is more TraininginDLiseasyandno
difficult to train with our issue oflocal minima
currentalgorithms(e.g.ithas in DL
issuesoflocalminima etc)
TheVanishingGradientProblem
TheProblem,ItsCauses,ItsSignificance,andItsSolutions
The problem:
Asmorelayersusingcertainactivationfunctionsareaddedtoneuralnetworks,the gradients of
the loss function approaches zero, making the network hard to train.
Why:
Certainactivationfunctions,likethesigmoidfunction,squishesalargeinputspaceintoa
smallinputspacebetween0and1. Therefore,alarge changeintheinput ofthesigmoid function
will cause a small change in the output. Hence, the derivative becomes small.
Image1:Thesigmoidfunctionanditsderivative
As an example, Image 1 is the sigmoid function and its derivative. Note how when the
inputsofthesigmoidfunctionbecomeslargerorsmaller(when|x|becomesbigger),the
derivative becomes close to zero.
Why it’ssignificant:
For shallow network with only a few layers that use these activations, this isn’t a big
problem.However,whenmorelayersareused,itcancausethegradienttobetoosmall for
training to work effectively.
A small gradient means that the weights and biases of the initial layers will not be
updatedeffectivelywitheachtrainingsession.Sincetheseinitiallayersareoftencrucial
torecognizingthecoreelementsoftheinputdata,itcanleadtooverallinaccuracyofthe whole
network.
Solutions:
Thesimplestsolutionistouseotheractivationfunctions,suchasReLU,whichdoesn’t
causeasmallderivative.
Residualnetworksareanothersolution,astheyprovideresidualconnectionsstraightto earlier
layers. As seen in Image 2, the residual connection directly adds the value at the
beginning of the block, x, to the end of the block (F(x)+x). This residual connection
doesn’tgothroughactivationfunctionsthat“squashes”thederivatives,resultingina higher
overall derivative of the block.
Image2:Aresidual block
Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives todisappear. In Image 1, this is most clearly seen at when |x| isbig. Batch
normalizationreducesthisproblembysimplynormalizingtheinputso|x|doesn’treach the
outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so that
most of it falls in the green region, where the derivative isn’t too small.
Image3:Sigmoidfunctionwithrestrictedinputs
HyperparametersinMachineLearning
Heretheprefix"hyper"suggeststhattheparametersaretop-levelparametersthat are
used in controlling the learning process.
ThevalueoftheHyperparameterisselectedandsetbythemachinelearning engineer
before the learning algorithm begins training the model.
Hence, these are external to the model, and their values cannot be changed
during the training process.
SomeexamplesofHyperparametersinMachineLearning
o ThekinkNNorK-NearestNeighbouralgorithm
o Learningratefortraininganeural network
o Train-testsplitratio
o BatchSize
o Numberof Epochs
o BranchesinDecisionTree
o NumberofclustersinClusteringAlgorithm
ModelParameters:
Modelparametersareconfigurationvariablesthatareinternaltothemodel,andamodel
learns them on its own. For example, W Weights or Coefficients of independent
variablesintheLinearregressionmodel.orWeightsorCoefficientsofindependent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
o Theyareused bythemodelformakingpredictions.
o Theyarelearned bythemodelfromthedataitself
o Theseareusuallynotsetmanually.
o ThesearethepartofthemodelandkeytoamachinelearningAlgorithm.
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined bythe user to control
the learning process. Some key points for model parameters are as follows:
o Theseareusuallydefinedmanuallybythemachinelearning engineer.
o Onecannotknowtheexactbestvalueforhyperparametersforthegivenproblem.
Thebestvaluecanbedeterminedeitherbytheruleofthumborbytrialanderror.
o SomeexamplesofHyperparametersarethelearningratefortraininga neural
network, K in the KNN algorithm,
Categoriesof Hyperparameters
Broadlyhyperparameterscanbedividedintotwocategories,whicharegiven below:
1. HyperparameterforOptimization
2. HyperparameterforSpecificModels
Theprocessofselectingthebesthyperparameterstouseisknownashyperparameter
tuning,andthetuningprocessisalsoknownashyperparameteroptimization.Optimization
parameters are used for optimizing the model.
Someofthepopularoptimizationparametersaregivenbelow:
o Learning Rate: The learning rate is the hyperparameter in optimization
algorithms that controls how much the model needs to change in response to the
estimated error for each time when the model's weights are updated. It is one of
thecrucialparameterswhilebuildinganeuralnetwork,andalsoitdeterminesthe
frequency of cross-checking with model parameters. Selecting the optimized
learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too
large, then it may not optimize the model properly.
o Batch Size: To enhance the speed of the learning process, the training set is
dividedintodifferentsubsets,whichareknownasabatch.NumberofEpochs:An
epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs
varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into
account. The number of epochs is increased until there is a reduction in a
validationerror.Ifthereisnoimprovementinreductionerrorfortheconsecutive
epochs, then it indicates to stop increasing the number of epochs.
HyperparameterforSpecificModels
Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:
o AnumberofHiddenUnits:Hiddenunitsarepartofneuralnetworks,whichrefer
tothecomponentscomprisingthelayersofprocessorsbetweeninputandoutput units
in a neural network.
It is important to specify the number of hidden units hyperparameter for the neural
network.Itshouldbebetweenthesizeoftheinputlayerandthesizeoftheoutputlayer.
Morespecifically,thenumberofhiddenunitsshouldbe2/3ofthesizeoftheinputlayer, plus
the size of the output layer.
Forcomplexfunctions,itisnecessarytospecifythenumberofhiddenunits,butitshould not
overfit the model.
Itisamethodofadaptivereparameterization,motivated bythedifficulty of
training very deep models.In Deep networks, the weights are updated
for each layer.
So the output will no longer be on the same scale as the input (even
though input is normalized).
Normalization - is adata pre-processing tool used tobring thenumerical
data toa common scale without distorting its shape.
when we input the datato a machine ordeep learningalgorithmwetend to
change the values to abalanced scale because, weensure thatour model
can generalize appropriately.(Normalization is used to bring the input
into a balanced scale/ Range).
Wecanseethattheadditionoftheweightdecaytermhasmodifiedthelearning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just
before performing the usual gradient update. This describes what happens in a
single step.The approximation ^J is Given by
WhereHistheHessianmatrixofJwithrespecttowevaluatedatw∗.
TheminimumofˆJoccurswhereitsgradient ∇wˆJ(w)=H(w−w∗)is equalto‘0’To study
the eff ect of weight decay,
As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormalbasisofeigenvectors,Q,suchthatH=QΛQT.ApplyingDecompositiontotheabove
equation, We Obtain
Figure2:Weightupdationeffect
The solid ellipses represent contours of equal value of the unregularized
objective. The dottedcircles represent contours of equal value of the L 2
regularizer. At thepoint w˜, these competing objectives reachanequilibrium. In
the first dimension, the eigenvalue of the Hessian of J is small. The objective
function does not increase much when moving horizontally away from w∗ .
Because the objective function does not express a strong preference along this
direction,theregularizerhasastrongeffectonthisaxis.Theregularizerpullsw1
closetozero. Intheseconddimension,the objectivefunctionisverysensitiveto
movements away from w∗ . The corresponding eigenvalue is large, indicating
highcurvature.Asaresult,weightdecayaffectsthepositionofw2relativelylittle.
L1Regularization
While L2 weight decay is the most common form of weight decay, there are
otherways to penalize the size of the model parameters. Another option is to use L1
regularization.
L1regularizationonthemodelparameterwisdefinedasthesumof
absolute values of theindividual parameters.
DifferencebetweenL1&L2ParameterRegularization
DifferencebetweenNormalizationand Standardization
Normalization Standardization
Thisposestwodifferentproblemstoour model:
Wastageofmachine’sresourceswhencomputingthesame output.
Ifmanyneuronsareextractingthesamefeatures, itaddsmoresignificanceto
thosefeaturesforourmodel.Thisleadstooverfittingiftheduplicateextracted
features are specific to only the training set.
Solution to the problem:As the title suggests, we use dropout while training the NNto
minimize co-adaptation. In dropout, we randomly shut down some fraction of a layer’s
neurons at each training step by zeroing out the neuron values. The fraction of neurons
tobezeroedoutisknownasthedropoutrate,.Theremainingneuronshavetheir