0% found this document useful (0 votes)
136 views

Neural Networks and Deep Learning

This document provides an introduction to neural networks and deep learning by discussing how neural networks can be used to recognize handwritten digits. It explains that while humans can easily recognize handwritten digits, programming a computer to do so is extremely difficult. The document then introduces perceptrons, an early type of artificial neuron, and discusses how a short neural network program can be written to recognize handwritten digits with over 96% accuracy. It explains that neural networks learn from examples without being explicitly programmed, and that accuracy can be improved to over 99% with further techniques like deep learning, which will be covered later in the book.

Uploaded by

Yash Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Neural Networks and Deep Learning

This document provides an introduction to neural networks and deep learning by discussing how neural networks can be used to recognize handwritten digits. It explains that while humans can easily recognize handwritten digits, programming a computer to do so is extremely difficult. The document then introduces perceptrons, an early type of artificial neuron, and discusses how a short neural network program can be written to recognize handwritten digits with over 96% accuracy. It explains that neural networks learn from examples without being explicitly programmed, and that accuracy can be improved to over 99% with further techniques like deep learning, which will be covered later in the book.

Uploaded by

Yash Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

03/04/2017 Neuralnetworksanddeeplearning

CHAPTER1

Usingneuralnetstorecognizehandwrittendigits

Thehumanvisualsystemisoneofthewondersoftheworld. NeuralNetworksandDeepLearning
Whatthisbookisabout
Considerthefollowingsequenceofhandwrittendigits: Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thatease
Improvingthewayneural
isdeceptive.Ineachhemisphereofourbrain,humanshavea networkslearn
Avisualproofthatneuralnetscan
primaryvisualcortex,alsoknownasV1,containing140million
computeanyfunction
neurons,withtensofbillionsofconnectionsbetweenthem.Andyet Whyaredeepneuralnetworks
humanvisioninvolvesnotjustV1,butanentireseriesofvisual hardtotrain?
Deeplearning
corticesV2,V3,V4,andV5doingprogressivelymorecomplex Appendix:Isthereasimple
imageprocessing.Wecarryinourheadsasupercomputer,tunedby algorithmforintelligence?
Acknowledgements
evolutionoverhundredsofmillionsofyears,andsuperblyadapted
FrequentlyAskedQuestions
tounderstandthevisualworld.Recognizinghandwrittendigitsisn't
easy.Rather,wehumansarestupendously,astoundinglygoodat Sponsors
makingsenseofwhatoureyesshowus.Butnearlyallthatworkis
doneunconsciously.Andsowedon'tusuallyappreciatehowtough
aproblemourvisualsystemssolve.

Thedifficultyofvisualpatternrecognitionbecomesapparentifyou
attempttowriteacomputerprogramtorecognizedigitslikethose
above.Whatseemseasywhenwedoitourselvessuddenlybecomes

extremelydifficult.Simpleintuitionsabouthowwerecognize Thankstoallthesupporterswho
shapes"a9hasaloopatthetop,andaverticalstrokeinthe madethebookpossible,with
especialthankstoPavelDudrenov.
bottomright"turnouttobenotsosimpletoexpress Thanksalsotoallthecontributorsto
algorithmically.Whenyoutrytomakesuchrulesprecise,you theBugfinderHallofFame.

quicklygetlostinamorassofexceptionsandcaveatsandspecial
Resources
cases.Itseemshopeless.
MichaelNielsenonTwitter
BookFAQ
Neuralnetworksapproachtheprobleminadifferentway.Theidea
Coderepository
istotakealargenumberofhandwrittendigits,knownastraining MichaelNielsen'sproject
examples, announcementmailinglist
DeepLearning,bookbyIan
Goodfellow,YoshuaBengio,and
AaronCourville
cognitivemedium.com

http://neuralnetworksanddeeplearning.com/chap1.html 1/50
03/04/2017 Neuralnetworksanddeeplearning

ByMichaelNielsen/Jan2017

andthendevelopasystemwhichcanlearnfromthosetraining
examples.Inotherwords,theneuralnetworkusestheexamplesto
automaticallyinferrulesforrecognizinghandwrittendigits.
Furthermore,byincreasingthenumberoftrainingexamples,the
networkcanlearnmoreabouthandwriting,andsoimproveits
accuracy.SowhileI'veshownjust100trainingdigitsabove,
perhapswecouldbuildabetterhandwritingrecognizerbyusing
thousandsorevenmillionsorbillionsoftrainingexamples.

Inthischapterwe'llwriteacomputerprogramimplementinga
neuralnetworkthatlearnstorecognizehandwrittendigits.The
programisjust74lineslong,andusesnospecialneuralnetwork
libraries.Butthisshortprogramcanrecognizedigitswithan
accuracyover96percent,withouthumanintervention.
Furthermore,inlaterchapterswe'lldevelopideaswhichcan
improveaccuracytoover99percent.Infact,thebestcommercial
neuralnetworksarenowsogoodthattheyareusedbybanksto
processcheques,andbypostofficestorecognizeaddresses.

We'refocusingonhandwritingrecognitionbecauseit'sanexcellent
prototypeproblemforlearningaboutneuralnetworksingeneral.
Asaprototypeithitsasweetspot:it'schallengingit'snosmallfeat
torecognizehandwrittendigitsbutit'snotsodifficultastorequire
anextremelycomplicatedsolution,ortremendouscomputational
power.Furthermore,it'sagreatwaytodevelopmoreadvanced
techniques,suchasdeeplearning.Andsothroughoutthebookwe'll
returnrepeatedlytotheproblemofhandwritingrecognition.Later
inthebook,we'lldiscusshowtheseideasmaybeappliedtoother

http://neuralnetworksanddeeplearning.com/chap1.html 2/50
03/04/2017 Neuralnetworksanddeeplearning

problemsincomputervision,andalsoinspeech,naturallanguage
processing,andotherdomains.

Ofcourse,ifthepointofthechapterwasonlytowriteacomputer
programtorecognizehandwrittendigits,thenthechapterwouldbe
muchshorter!Butalongthewaywe'lldevelopmanykeyideas
aboutneuralnetworks,includingtwoimportanttypesofartificial
neuron(theperceptronandthesigmoidneuron),andthestandard
learningalgorithmforneuralnetworks,knownasstochastic
gradientdescent.Throughout,Ifocusonexplainingwhythingsare
donethewaytheyare,andonbuildingyourneuralnetworks
intuition.ThatrequiresalengthierdiscussionthanifIjust
presentedthebasicmechanicsofwhat'sgoingon,butit'sworthit
forthedeeperunderstandingyou'llattain.Amongstthepayoffs,by
theendofthechapterwe'llbeinpositiontounderstandwhatdeep
learningis,andwhyitmatters.

Perceptrons
Whatisaneuralnetwork?Togetstarted,I'llexplainatypeof
artificialneuroncalledaperceptron.Perceptronsweredevelopedin
the1950sand1960sbythescientistFrankRosenblatt,inspiredby
earlierworkbyWarrenMcCullochandWalterPitts.Today,it's
morecommontouseothermodelsofartificialneuronsinthis
book,andinmuchmodernworkonneuralnetworks,themain
neuronmodelusedisonecalledthesigmoidneuron.We'llgetto
sigmoidneuronsshortly.Buttounderstandwhysigmoidneurons
aredefinedthewaytheyare,it'sworthtakingthetimetofirst
understandperceptrons.

Sohowdoperceptronswork?Aperceptrontakesseveralbinary
inputs,x 1, x2 , ,andproducesasinglebinaryoutput:

Intheexampleshowntheperceptronhasthreeinputs,x 1, .In
x2 , x3

generalitcouldhavemoreorfewerinputs.Rosenblattproposeda
simpleruletocomputetheoutput.Heintroducedweights,
w1 , w2 , ,realnumbersexpressingtheimportanceofthe
http://neuralnetworksanddeeplearning.com/chap1.html 3/50
03/04/2017 Neuralnetworksanddeeplearning

respectiveinputstotheoutput.Theneuron'soutput,0or1,is
determinedbywhethertheweightedsum j
wj xj islessthanor
greaterthansomethresholdvalue.Justliketheweights,the
thresholdisarealnumberwhichisaparameteroftheneuron.To
putitinmoreprecisealgebraicterms:

0 if w j x j threshold
j
output = { (1)
1 if w j x j > threshold
j

That'sallthereistohowaperceptronworks!

That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe
perceptronisthatit'sadevicethatmakesdecisionsbyweighingup
evidence.Letmegiveanexample.It'snotaveryrealisticexample,
butit'seasytounderstand,andwe'llsoongettomorerealistic
examples.Supposetheweekendiscomingup,andyou'veheard
thatthere'sgoingtobeacheesefestivalinyourcity.Youlike
cheese,andaretryingtodecidewhetherornottogotothefestival.
Youmightmakeyourdecisionbyweighingupthreefactors:

1.Istheweathergood?
2.Doesyourboyfriendorgirlfriendwanttoaccompanyyou?
3.Isthefestivalnearpublictransit?(Youdon'townacar).

Wecanrepresentthesethreefactorsbycorrespondingbinary
variablesx 1, x2 ,andx .Forinstance,we'dhavex
3 1 = 1 ifthe
weatherisgood,andx 1 = 0 iftheweatherisbad.Similarly,x 2 = 1

ifyourboyfriendorgirlfriendwantstogo,andx 2 = 0 ifnot.And
similarlyagainforx andpublictransit.
3

Now,supposeyouabsolutelyadorecheese,somuchsothatyou're
happytogotothefestivalevenifyourboyfriendorgirlfriendis
uninterestedandthefestivalishardtogetto.Butperhapsyou
reallyloathebadweather,andthere'snowayyou'dgotothefestival
iftheweatherisbad.Youcanuseperceptronstomodelthiskindof
decisionmaking.Onewaytodothisistochooseaweightw 1 = 6

fortheweather,andw 2 andw
= 2 3 = 2fortheotherconditions.
Thelargervalueofw indicatesthattheweathermattersalotto
1

you,muchmorethanwhetheryourboyfriendorgirlfriendjoins
you,orthenearnessofpublictransit.Finally,supposeyouchoosea
thresholdof5fortheperceptron.Withthesechoices,the
perceptronimplementsthedesireddecisionmakingmodel,

http://neuralnetworksanddeeplearning.com/chap1.html 4/50
03/04/2017 Neuralnetworksanddeeplearning

outputting1whenevertheweatherisgood,and0wheneverthe
weatherisbad.Itmakesnodifferencetotheoutputwhetheryour
boyfriendorgirlfriendwantstogo,orwhetherpublictransitis
nearby.

Byvaryingtheweightsandthethreshold,wecangetdifferent
modelsofdecisionmaking.Forexample,supposeweinsteadchose
athresholdof3.Thentheperceptronwoulddecidethatyoushould
gotothefestivalwhenevertheweatherwasgoodorwhenboththe
festivalwasnearpublictransitandyourboyfriendorgirlfriendwas
willingtojoinyou.Inotherwords,it'dbeadifferentmodelof
decisionmaking.Droppingthethresholdmeansyou'remore
willingtogotothefestival.

Obviously,theperceptronisn'tacompletemodelofhuman
decisionmaking!Butwhattheexampleillustratesishowa
perceptroncanweighupdifferentkindsofevidenceinorderto
makedecisions.Anditshouldseemplausiblethatacomplex
networkofperceptronscouldmakequitesubtledecisions:

Inthisnetwork,thefirstcolumnofperceptronswhatwe'llcallthe
firstlayerofperceptronsismakingthreeverysimpledecisions,by
weighingtheinputevidence.Whatabouttheperceptronsinthe
secondlayer?Eachofthoseperceptronsismakingadecisionby
weighinguptheresultsfromthefirstlayerofdecisionmaking.In
thiswayaperceptroninthesecondlayercanmakeadecisionata
morecomplexandmoreabstractlevelthanperceptronsinthefirst
layer.Andevenmorecomplexdecisionscanbemadebythe
perceptroninthethirdlayer.Inthisway,amanylayernetworkof
perceptronscanengageinsophisticateddecisionmaking.

Incidentally,whenIdefinedperceptronsIsaidthataperceptron
hasjustasingleoutput.Inthenetworkabovetheperceptronslook
liketheyhavemultipleoutputs.Infact,they'restillsingleoutput.
Themultipleoutputarrowsaremerelyausefulwayofindicating

http://neuralnetworksanddeeplearning.com/chap1.html 5/50
03/04/2017 Neuralnetworksanddeeplearning

thattheoutputfromaperceptronisbeingusedastheinputto
severalotherperceptrons.It'slessunwieldythandrawingasingle
outputlinewhichthensplits.

Let'ssimplifythewaywedescribeperceptrons.Thecondition

j
w j x j > threshold iscumbersome,andwecanmaketwo
notationalchangestosimplifyit.Thefirstchangeistowrite

j
wj xj asadotproduct,w x j
wj xj,wherewandxare
vectorswhosecomponentsaretheweightsandinputs,respectively.
Thesecondchangeistomovethethresholdtotheothersideofthe
inequality,andtoreplaceitbywhat'sknownastheperceptron's
bias,b threshold.Usingthebiasinsteadofthethreshold,the
perceptronrulecanberewritten:

0 ifw x + b 0
output = { (2)
1 ifw x + b > 0

Youcanthinkofthebiasasameasureofhoweasyitistogetthe
perceptrontooutputa1.Ortoputitinmorebiologicalterms,the
biasisameasureofhoweasyitistogettheperceptrontofire.Fora
perceptronwithareallybigbias,it'sextremelyeasyforthe
perceptrontooutputa1.Butifthebiasisverynegative,thenit's
difficultfortheperceptrontooutputa1.Obviously,introducingthe
biasisonlyasmallchangeinhowwedescribeperceptrons,but
we'llseelaterthatitleadstofurthernotationalsimplifications.
Becauseofthis,intheremainderofthebookwewon'tusethe
threshold,we'llalwaysusethebias.

I'vedescribedperceptronsasamethodforweighingevidenceto
makedecisions.Anotherwayperceptronscanbeusedistocompute
theelementarylogicalfunctionsweusuallythinkofasunderlying
computation,functionssuchasAND,OR,andNAND.Forexample,
supposewehaveaperceptronwithtwoinputs,eachwithweight
2 ,andanoverallbiasof3.Here'sourperceptron:

Thenweseethatinput00 producesoutput1,since
(2) 0 + (2) 0 + 3 = 3 ispositive.Here,I'veintroducedthe
symboltomakethemultiplicationsexplicit.Similarcalculations
showthattheinputs01 and10 produceoutput1.Buttheinput11

http://neuralnetworksanddeeplearning.com/chap1.html 6/50
03/04/2017 Neuralnetworksanddeeplearning

producesoutput0,since(2) 1 + (2) 1 + 3 = 1isnegative.


AndsoourperceptronimplementsaNANDgate!

TheNANDexampleshowsthatwecanuseperceptronstocompute
simplelogicalfunctions.Infact,wecanusenetworksof
perceptronstocomputeanylogicalfunctionatall.Thereasonis
thattheNANDgateisuniversalforcomputation,thatis,wecanbuild
anycomputationupoutofNANDgates.Forexample,wecanuse
NANDgatestobuildacircuitwhichaddstwobits,x andx .This 1 2

requirescomputingthebitwisesum,x 1 x2 ,aswellasacarrybit
whichissetto1whenbothx andx are1,i.e.,thecarrybitisjust
1 2

thebitwiseproductx 1 x2 :

TogetanequivalentnetworkofperceptronswereplacealltheNAND
gatesbyperceptronswithtwoinputs,eachwithweight2,andan
overallbiasof3.Here'stheresultingnetwork.NotethatI'vemoved
theperceptroncorrespondingtothebottomrightNANDgatealittle,
justtomakeiteasiertodrawthearrowsonthediagram:

Onenotableaspectofthisnetworkofperceptronsisthattheoutput
fromtheleftmostperceptronisusedtwiceasinputtothe
bottommostperceptron.WhenIdefinedtheperceptronmodelI
didn'tsaywhetherthiskindofdoubleoutputtothesameplace
wasallowed.Actually,itdoesn'tmuchmatter.Ifwedon'twantto
allowthiskindofthing,thenit'spossibletosimplymergethetwo
lines,intoasingleconnectionwithaweightof4insteadoftwo
connectionswith2weights.(Ifyoudon'tfindthisobvious,you
shouldstopandprovetoyourselfthatthisisequivalent.)Withthat
change,thenetworklooksasfollows,withallunmarkedweights

http://neuralnetworksanddeeplearning.com/chap1.html 7/50
03/04/2017 Neuralnetworksanddeeplearning

equalto2,allbiasesequalto3,andasingleweightof4,as
marked:

UptonowI'vebeendrawinginputslikex andx asvariables


1 2

floatingtotheleftofthenetworkofperceptrons.Infact,it's
conventionaltodrawanextralayerofperceptronstheinputlayer
toencodetheinputs:

Thisnotationforinputperceptrons,inwhichwehaveanoutput,
butnoinputs,

isashorthand.Itdoesn'tactuallymeanaperceptronwithno
inputs.Toseethis,supposewedidhaveaperceptronwithno
inputs.Thentheweightedsum j
wj xj wouldalwaysbezero,and
sotheperceptronwouldoutput1ifb > 0,and0ifb 0.Thatis,the
perceptronwouldsimplyoutputafixedvalue,notthedesiredvalue
(x ,intheexampleabove).It'sbettertothinkoftheinput
1

perceptronsasnotreallybeingperceptronsatall,butratherspecial
unitswhicharesimplydefinedtooutputthedesiredvalues,
x1 , x2 , .

Theadderexampledemonstrateshowanetworkofperceptronscan
beusedtosimulateacircuitcontainingmanyNANDgates.And
becauseNANDgatesareuniversalforcomputation,itfollowsthat
perceptronsarealsouniversalforcomputation.

Thecomputationaluniversalityofperceptronsissimultaneously
reassuringanddisappointing.It'sreassuringbecauseittellsusthat
networksofperceptronscanbeaspowerfulasanyothercomputing

http://neuralnetworksanddeeplearning.com/chap1.html 8/50
03/04/2017 Neuralnetworksanddeeplearning

device.Butit'salsodisappointing,becauseitmakesitseemas
thoughperceptronsaremerelyanewtypeofNANDgate.That's
hardlybignews!

However,thesituationisbetterthanthisviewsuggests.Itturnsout
thatwecandeviselearningalgorithmswhichcanautomatically
tunetheweightsandbiasesofanetworkofartificialneurons.This
tuninghappensinresponsetoexternalstimuli,withoutdirect
interventionbyaprogrammer.Theselearningalgorithmsenableus
touseartificialneuronsinawaywhichisradicallydifferentto
conventionallogicgates.Insteadofexplicitlylayingoutacircuitof
NANDandothergates,ourneuralnetworkscansimplylearntosolve
problems,sometimesproblemswhereitwouldbeextremely
difficulttodirectlydesignaconventionalcircuit.

Sigmoidneurons
Learningalgorithmssoundterrific.Buthowcanwedevisesuch
algorithmsforaneuralnetwork?Supposewehaveanetworkof
perceptronsthatwe'dliketousetolearntosolvesomeproblem.
Forexample,theinputstothenetworkmightbetherawpixeldata
fromascanned,handwrittenimageofadigit.Andwe'dlikethe
networktolearnweightsandbiasessothattheoutputfromthe
networkcorrectlyclassifiesthedigit.Toseehowlearningmight
work,supposewemakeasmallchangeinsomeweight(orbias)in
thenetwork.Whatwe'dlikeisforthissmallchangeinweightto
causeonlyasmallcorrespondingchangeintheoutputfromthe
network.Aswe'llseeinamoment,thispropertywillmakelearning
possible.Schematically,here'swhatwewant(obviouslythis
networkistoosimpletodohandwritingrecognition!):

http://neuralnetworksanddeeplearning.com/chap1.html 9/50
03/04/2017 Neuralnetworksanddeeplearning

Ifitweretruethatasmallchangeinaweight(orbias)causesonlya
smallchangeinoutput,thenwecouldusethisfacttomodifythe
weightsandbiasestogetournetworktobehavemoreinthe
mannerwewant.Forexample,supposethenetworkwas
mistakenlyclassifyinganimageasan"8"whenitshouldbea"9".
Wecouldfigureouthowtomakeasmallchangeintheweightsand
biasessothenetworkgetsalittleclosertoclassifyingtheimageasa
"9".Andthenwe'drepeatthis,changingtheweightsandbiases
overandovertoproducebetterandbetteroutput.Thenetwork
wouldbelearning.

Theproblemisthatthisisn'twhathappenswhenournetwork
containsperceptrons.Infact,asmallchangeintheweightsorbias
ofanysingleperceptroninthenetworkcansometimescausethe
outputofthatperceptrontocompletelyflip,sayfrom0to1.That
flipmaythencausethebehaviouroftherestofthenetworkto
completelychangeinsomeverycomplicatedway.Sowhileyour"9"
mightnowbeclassifiedcorrectly,thebehaviourofthenetworkon
alltheotherimagesislikelytohavecompletelychangedinsome
hardtocontrolway.Thatmakesitdifficulttoseehowtogradually
modifytheweightsandbiasessothatthenetworkgetsclosertothe
desiredbehaviour.Perhapsthere'ssomecleverwayofgetting
aroundthisproblem.Butit'snotimmediatelyobvioushowwecan
getanetworkofperceptronstolearn.

Wecanovercomethisproblembyintroducinganewtypeof
artificialneuroncalledasigmoidneuron.Sigmoidneuronsare
similartoperceptrons,butmodifiedsothatsmallchangesintheir
weightsandbiascauseonlyasmallchangeintheiroutput.That's
thecrucialfactwhichwillallowanetworkofsigmoidneuronsto
learn.

Okay,letmedescribethesigmoidneuron.We'lldepictsigmoid
neuronsinthesamewaywedepictedperceptrons:

Justlikeaperceptron,thesigmoidneuronhasinputs,x 1, x2 , .
Butinsteadofbeingjust0or1,theseinputscanalsotakeonany

http://neuralnetworksanddeeplearning.com/chap1.html 10/50
03/04/2017 Neuralnetworksanddeeplearning

valuesbetween0and1.So,forinstance,0.638 isavalidinputfor
asigmoidneuron.Alsojustlikeaperceptron,thesigmoidneuron
hasweightsforeachinput,w 1, w2 , ,andanoverallbias,b.Butthe
outputisnot0or1.Instead,it's(w x + b),whereiscalledthe
*Incidentally,issometimescalledthelogistic
sigmoidfunction*,andisdefinedby: function,andthisnewclassofneuronscalled
logisticneurons.It'susefultorememberthis
1 terminology,sincethesetermsareusedbymany
(z) . (3)
1 + e
z peopleworkingwithneuralnets.However,we'll
stickwiththesigmoidterminology.

Toputitallalittlemoreexplicitly,theoutputofasigmoidneuron
withinputsx 1, x2 , ,weightsw 1, w2 , ,andbiasbis

1
. (4)
1 + exp( w j x j b)
j

Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons.
Thealgebraicformofthesigmoidfunctionmayseemopaqueand
forbiddingifyou'renotalreadyfamiliarwithit.Infact,thereare
manysimilaritiesbetweenperceptronsandsigmoidneurons,and
thealgebraicformofthesigmoidfunctionturnsouttobemoreofa
technicaldetailthanatruebarriertounderstanding.

Tounderstandthesimilaritytotheperceptronmodel,suppose
z w x + b isalargepositivenumber.Thene z
andso
0

(z) 1 .Inotherwords,whenz = w x + b islargeandpositive,


theoutputfromthesigmoidneuronisapproximately1,justasit
wouldhavebeenforaperceptron.Supposeontheotherhandthat
z = w x + b isverynegative.Thene z
,and(z) 0.So
whenz = w x + b isverynegative,thebehaviourofasigmoid
neuronalsocloselyapproximatesaperceptron.It'sonlywhen
w x + b isofmodestsizethatthere'smuchdeviationfromthe
perceptronmodel.

Whataboutthealgebraicformof?Howcanweunderstandthat?
Infact,theexactformofisn'tsoimportantwhatreallymatters
istheshapeofthefunctionwhenplotted.Here'stheshape:

http://neuralnetworksanddeeplearning.com/chap1.html 11/50
03/04/2017 Neuralnetworksanddeeplearning

sigmoidfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

Thisshapeisasmoothedoutversionofastepfunction:

stepfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

Ifhadinfactbeenastepfunction,thenthesigmoidneuronwould
beaperceptron,sincetheoutputwouldbe1or0dependingon
*Actually,whenw x + b = 0theperceptron
whetherw x + bwaspositiveornegative*.Byusingtheactual outputs0,whilethestepfunctionoutputs1.So,

functionweget,asalreadyimpliedabove,asmoothedout strictlyspeaking,we'dneedtomodifythestep
functionatthatonepoint.Butyougettheidea.
perceptron.Indeed,it'sthesmoothnessofthefunctionthatisthe
crucialfact,notitsdetailedform.Thesmoothnessofmeansthat
smallchangesw intheweightsandb inthebiaswillproducea
j

smallchangeoutputintheoutputfromtheneuron.Infact,
calculustellsusthatoutputiswellapproximatedby

output output
output w j + b, (5)
wj b
j

wherethesumisoveralltheweights,w ,and output/ w and


j j

output/ b denotepartialderivativesoftheoutputwithrespectto
wj andb,respectively.Don'tpanicifyou'renotcomfortablewith
partialderivatives!Whiletheexpressionabovelookscomplicated,
withallthepartialderivatives,it'sactuallysayingsomethingvery
simple(andwhichisverygoodnews):outputisalinearfunction
ofthechangesw andb intheweightsandbias.Thislinearity
j

makesiteasytochoosesmallchangesintheweightsandbiasesto
http://neuralnetworksanddeeplearning.com/chap1.html 12/50
03/04/2017 Neuralnetworksanddeeplearning

achieveanydesiredsmallchangeintheoutput.Sowhilesigmoid
neuronshavemuchofthesamequalitativebehaviouras
perceptrons,theymakeitmucheasiertofigureouthowchanging
theweightsandbiaseswillchangetheoutput.

Ifit'stheshapeofwhichreallymatters,andnotitsexactform,
thenwhyusetheparticularformusedforinEquation(3)?Infact,
laterinthebookwewilloccasionallyconsiderneuronswherethe
outputisf(w x + b)forsomeotheractivationfunctionf() .The
mainthingthatchangeswhenweuseadifferentactivationfunction
isthattheparticularvaluesforthepartialderivativesinEquation
(5)change.Itturnsoutthatwhenwecomputethosepartial
derivativeslater,usingwillsimplifythealgebra,simplybecause
exponentialshavelovelypropertieswhendifferentiated.Inany
case,iscommonlyusedinworkonneuralnets,andisthe
activationfunctionwe'llusemostofteninthisbook.

Howshouldweinterprettheoutputfromasigmoidneuron?
Obviously,onebigdifferencebetweenperceptronsandsigmoid
neuronsisthatsigmoidneuronsdon'tjustoutput0or1.Theycan
haveasoutputanyrealnumberbetween0and1,sovaluessuchas
0.173 and0.689 arelegitimateoutputs.Thiscanbeuseful,for
example,ifwewanttousetheoutputvaluetorepresenttheaverage
intensityofthepixelsinanimageinputtoaneuralnetwork.But
sometimesitcanbeanuisance.Supposewewanttheoutputfrom
thenetworktoindicateeither"theinputimageisa9"or"theinput
imageisnota9".Obviously,it'dbeeasiesttodothisiftheoutput
wasa0ora1,asinaperceptron.Butinpracticewecansetupa
conventiontodealwiththis,forexample,bydecidingtointerpret
anyoutputofatleast0.5asindicatinga"9",andanyoutputless
than0.5asindicating"nota9".I'llalwaysexplicitlystatewhen
we'reusingsuchaconvention,soitshouldn'tcauseanyconfusion.

Exercises
Sigmoidneuronssimulatingperceptrons,partI
Supposewetakealltheweightsandbiasesinanetworkof
perceptrons,andmultiplythembyapositiveconstant,c > 0.
Showthatthebehaviourofthenetworkdoesn'tchange.

Sigmoidneuronssimulatingperceptrons,partII
Supposewehavethesamesetupasthelastproblema
http://neuralnetworksanddeeplearning.com/chap1.html 13/50
03/04/2017 Neuralnetworksanddeeplearning

networkofperceptrons.Supposealsothattheoverallinputto
thenetworkofperceptronshasbeenchosen.Wewon'tneed
theactualinputvalue,wejustneedtheinputtohavebeen
fixed.Supposetheweightsandbiasesaresuchthat
w x + b 0 fortheinputxtoanyparticularperceptroninthe
network.Nowreplacealltheperceptronsinthenetworkby
sigmoidneurons,andmultiplytheweightsandbiasesbya
positiveconstantc > 0.Showthatinthelimitasc the
behaviourofthisnetworkofsigmoidneuronsisexactlythe
sameasthenetworkofperceptrons.Howcanthisfailwhen
w x + b = 0 foroneoftheperceptrons?

Thearchitectureofneuralnetworks
InthenextsectionI'llintroduceaneuralnetworkthatcandoa
prettygoodjobclassifyinghandwrittendigits.Inpreparationfor
that,ithelpstoexplainsometerminologythatletsusname
differentpartsofanetwork.Supposewehavethenetwork:

Asmentionedearlier,theleftmostlayerinthisnetworkiscalledthe
inputlayer,andtheneuronswithinthelayerarecalledinput
neurons.Therightmostoroutputlayercontainstheoutput
neurons,or,asinthiscase,asingleoutputneuron.Themiddle
layeriscalledahiddenlayer,sincetheneuronsinthislayerare
neitherinputsnoroutputs.Theterm"hidden"perhapssoundsa
littlemysteriousthefirsttimeIheardthetermIthoughtitmust
havesomedeepphilosophicalormathematicalsignificancebutit
reallymeansnothingmorethan"notaninputoranoutput".The
networkabovehasjustasinglehiddenlayer,butsomenetworks
havemultiplehiddenlayers.Forexample,thefollowingfourlayer
networkhastwohiddenlayers:

http://neuralnetworksanddeeplearning.com/chap1.html 14/50
03/04/2017 Neuralnetworksanddeeplearning

Somewhatconfusingly,andforhistoricalreasons,suchmultiple
layernetworksaresometimescalledmultilayerperceptronsor
MLPs,despitebeingmadeupofsigmoidneurons,notperceptrons.
I'mnotgoingtousetheMLPterminologyinthisbook,sinceIthink
it'sconfusing,butwantedtowarnyouofitsexistence.

Thedesignoftheinputandoutputlayersinanetworkisoften
straightforward.Forexample,supposewe'retryingtodetermine
whetherahandwrittenimagedepictsa"9"ornot.Anaturalwayto
designthenetworkistoencodetheintensitiesoftheimagepixels
intotheinputneurons.Iftheimageisa64 by64 greyscaleimage,
thenwe'dhave4, 096 = 64 64inputneurons,withtheintensities
scaledappropriatelybetween0and1.Theoutputlayerwillcontain
justasingleneuron,withoutputvaluesoflessthan0.5indicating
"inputimageisnota9",andvaluesgreaterthan0.5indicating
"inputimageisa9".

Whilethedesignoftheinputandoutputlayersofaneuralnetwork
isoftenstraightforward,therecanbequiteanarttothedesignof
thehiddenlayers.Inparticular,it'snotpossibletosumupthe
designprocessforthehiddenlayerswithafewsimplerulesof
thumb.Instead,neuralnetworksresearchershavedevelopedmany
designheuristicsforthehiddenlayers,whichhelppeoplegetthe
behaviourtheywantoutoftheirnets.Forexample,suchheuristics
canbeusedtohelpdeterminehowtotradeoffthenumberof
hiddenlayersagainstthetimerequiredtotrainthenetwork.We'll
meetseveralsuchdesignheuristicslaterinthisbook.

Uptonow,we'vebeendiscussingneuralnetworkswheretheoutput
fromonelayerisusedasinputtothenextlayer.Suchnetworksare
calledfeedforwardneuralnetworks.Thismeanstherearenoloops

http://neuralnetworksanddeeplearning.com/chap1.html 15/50
03/04/2017 Neuralnetworksanddeeplearning

inthenetworkinformationisalwaysfedforward,neverfedback.
Ifwedidhaveloops,we'dendupwithsituationswheretheinputto
thefunctiondependedontheoutput.That'dbehardtomake
senseof,andsowedon'tallowsuchloops.

However,thereareothermodelsofartificialneuralnetworksin
whichfeedbackloopsarepossible.Thesemodelsarecalled
recurrentneuralnetworks.Theideainthesemodelsistohave
neuronswhichfireforsomelimiteddurationoftime,before
becomingquiescent.Thatfiringcanstimulateotherneurons,which
mayfirealittlewhilelater,alsoforalimitedduration.Thatcauses
stillmoreneuronstofire,andsoovertimewegetacascadeof
neuronsfiring.Loopsdon'tcauseproblemsinsuchamodel,sincea
neuron'soutputonlyaffectsitsinputatsomelatertime,not
instantaneously.

Recurrentneuralnetshavebeenlessinfluentialthanfeedforward
networks,inpartbecausethelearningalgorithmsforrecurrentnets
are(atleasttodate)lesspowerful.Butrecurrentnetworksarestill
extremelyinteresting.They'remuchcloserinspirittohowour
brainsworkthanfeedforwardnetworks.Andit'spossiblethat
recurrentnetworkscansolveimportantproblemswhichcanonlybe
solvedwithgreatdifficultybyfeedforwardnetworks.However,to
limitourscope,inthisbookwe'regoingtoconcentrateonthemore
widelyusedfeedforwardnetworks.

Asimplenetworktoclassify
handwrittendigits
Havingdefinedneuralnetworks,let'sreturntohandwriting
recognition.Wecansplittheproblemofrecognizinghandwritten
digitsintotwosubproblems.First,we'dlikeawayofbreakingan
imagecontainingmanydigitsintoasequenceofseparateimages,
eachcontainingasingledigit.Forexample,we'dliketobreakthe
image

intosixseparateimages,

http://neuralnetworksanddeeplearning.com/chap1.html 16/50
03/04/2017 Neuralnetworksanddeeplearning

Wehumanssolvethissegmentationproblemwithease,butit's
challengingforacomputerprogramtocorrectlybreakupthe
image.Oncetheimagehasbeensegmented,theprogramthen
needstoclassifyeachindividualdigit.So,forinstance,we'dlikeour
programtorecognizethatthefirstdigitabove,

isa5.

We'llfocusonwritingaprogramtosolvethesecondproblem,that
is,classifyingindividualdigits.Wedothisbecauseitturnsoutthat
thesegmentationproblemisnotsodifficulttosolve,onceyouhave
agoodwayofclassifyingindividualdigits.Therearemany
approachestosolvingthesegmentationproblem.Oneapproachis
totrialmanydifferentwaysofsegmentingtheimage,usingthe
individualdigitclassifiertoscoreeachtrialsegmentation.Atrial
segmentationgetsahighscoreiftheindividualdigitclassifieris
confidentofitsclassificationinallsegments,andalowscoreifthe
classifierishavingalotoftroubleinoneormoresegments.The
ideaisthatiftheclassifierishavingtroublesomewhere,thenit's
probablyhavingtroublebecausethesegmentationhasbeenchosen
incorrectly.Thisideaandothervariationscanbeusedtosolvethe
segmentationproblemquitewell.Soinsteadofworryingabout
segmentationwe'llconcentrateondevelopinganeuralnetwork
whichcansolvethemoreinterestinganddifficultproblem,namely,
recognizingindividualhandwrittendigits.

Torecognizeindividualdigitswewilluseathreelayerneural
network:

http://neuralnetworksanddeeplearning.com/chap1.html 17/50
03/04/2017 Neuralnetworksanddeeplearning

Theinputlayerofthenetworkcontainsneuronsencodingthe
valuesoftheinputpixels.Asdiscussedinthenextsection,our
trainingdataforthenetworkwillconsistofmany28 by28 pixel
imagesofscannedhandwrittendigits,andsotheinputlayer
contains784 = 28 28neurons.ForsimplicityI'veomittedmostof
the784inputneuronsinthediagramabove.Theinputpixelsare
greyscale,withavalueof0.0representingwhite,avalueof1.0
representingblack,andinbetweenvaluesrepresentinggradually
darkeningshadesofgrey.

Thesecondlayerofthenetworkisahiddenlayer.Wedenotethe
numberofneuronsinthishiddenlayerbyn,andwe'llexperiment
withdifferentvaluesforn.Theexampleshownillustratesasmall
hiddenlayer,containingjustn = 15neurons.

Theoutputlayerofthenetworkcontains10neurons.Ifthefirst
neuronfires,i.e.,hasanoutput 1,thenthatwillindicatethatthe
networkthinksthedigitisa0.Ifthesecondneuronfiresthenthat
willindicatethatthenetworkthinksthedigitisa1.Andsoon.A
littlemoreprecisely,wenumbertheoutputneuronsfrom0through
9,andfigureoutwhichneuronhasthehighestactivationvalue.If
thatneuronis,say,neuronnumber6,thenournetworkwillguess
thattheinputdigitwasa6.Andsoonfortheotheroutputneurons.

Youmightwonderwhyweuse10 outputneurons.Afterall,thegoal
ofthenetworkistotelluswhichdigit(0, 1, 2, , 9)correspondsto
theinputimage.Aseeminglynaturalwayofdoingthatistousejust

http://neuralnetworksanddeeplearning.com/chap1.html 18/50
03/04/2017 Neuralnetworksanddeeplearning

4outputneurons,treatingeachneuronastakingonabinaryvalue,
dependingonwhethertheneuron'soutputiscloserto0orto1.
Fourneuronsareenoughtoencodetheanswer,since2 4
= 16 is
morethanthe10possiblevaluesfortheinputdigit.Whyshouldour
networkuse10 neuronsinstead?Isn'tthatinefficient?Theultimate
justificationisempirical:wecantryoutbothnetworkdesigns,and
itturnsoutthat,forthisparticularproblem,thenetworkwith10
outputneuronslearnstorecognizedigitsbetterthanthenetwork
with4outputneurons.Butthatleavesuswonderingwhyusing10
outputneuronsworksbetter.Istheresomeheuristicthatwouldtell
usinadvancethatweshouldusethe10 outputencodinginsteadof
the4outputencoding?

Tounderstandwhywedothis,ithelpstothinkaboutwhatthe
neuralnetworkisdoingfromfirstprinciples.Considerfirstthecase
whereweuse10 outputneurons.Let'sconcentrateonthefirst
outputneuron,theonethat'stryingtodecidewhetherornotthe
digitisa0.Itdoesthisbyweighingupevidencefromthehidden
layerofneurons.Whatarethosehiddenneuronsdoing?Well,just
supposeforthesakeofargumentthatthefirstneuroninthehidden
layerdetectswhetherornotanimagelikethefollowingispresent:

Itcandothisbyheavilyweightinginputpixelswhichoverlapwith
theimage,andonlylightlyweightingtheotherinputs.Inasimilar
way,let'ssupposeforthesakeofargumentthatthesecond,third,
andfourthneuronsinthehiddenlayerdetectwhetherornotthe
followingimagesarepresent:

Asyoumayhaveguessed,thesefourimagestogethermakeupthe0
imagethatwesawinthelineofdigitsshownearlier:

http://neuralnetworksanddeeplearning.com/chap1.html 19/50
03/04/2017 Neuralnetworksanddeeplearning

Soifallfourofthesehiddenneuronsarefiringthenwecan
concludethatthedigitisa0.Ofcourse,that'snottheonlysortof
evidencewecanusetoconcludethattheimagewasa0wecould
legitimatelygeta0inmanyotherways(say,throughtranslationsof
theaboveimages,orslightdistortions).Butitseemssafetosaythat
atleastinthiscasewe'dconcludethattheinputwasa0.

Supposingtheneuralnetworkfunctionsinthisway,wecangivea
plausibleexplanationforwhyit'sbettertohave10 outputsfromthe
network,ratherthan4.Ifwehad4outputs,thenthefirstoutput
neuronwouldbetryingtodecidewhatthemostsignificantbitof
thedigitwas.Andthere'snoeasywaytorelatethatmostsignificant
bittosimpleshapeslikethoseshownabove.It'shardtoimagine
thatthere'sanygoodhistoricalreasonthecomponentshapesofthe
digitwillbecloselyrelatedto(say)themostsignificantbitinthe
output.

Now,withallthatsaid,thisisalljustaheuristic.Nothingsaysthat
thethreelayerneuralnetworkhastooperateinthewayI
described,withthehiddenneuronsdetectingsimplecomponent
shapes.Maybeacleverlearningalgorithmwillfindsome
assignmentofweightsthatletsususeonly4outputneurons.Butas
aheuristicthewayofthinkingI'vedescribedworksprettywell,and
cansaveyoualotoftimeindesigninggoodneuralnetwork
architectures.

Exercise
Thereisawayofdeterminingthebitwiserepresentationofa
digitbyaddinganextralayertothethreelayernetworkabove.
Theextralayerconvertstheoutputfromthepreviouslayerinto
abinaryrepresentation,asillustratedinthefigurebelow.Find
asetofweightsandbiasesforthenewoutputlayer.Assume
thatthefirst3layersofneuronsaresuchthatthecorrect
outputinthethirdlayer(i.e.,theoldoutputlayer)has
activationatleast0.99,andincorrectoutputshaveactivation
lessthan0.01.
http://neuralnetworksanddeeplearning.com/chap1.html 20/50
03/04/2017 Neuralnetworksanddeeplearning

Learningwithgradientdescent
Nowthatwehaveadesignforourneuralnetwork,howcanitlearn
torecognizedigits?Thefirstthingwe'llneedisadatasettolearn
fromasocalledtrainingdataset.We'llusetheMNISTdataset,
whichcontainstensofthousandsofscannedimagesofhandwritten
digits,togetherwiththeircorrectclassifications.MNIST'sname
comesfromthefactthatitisamodifiedsubsetoftwodatasets
collectedbyNIST,theUnitedStates'NationalInstituteof
StandardsandTechnology.Here'safewimagesfromMNIST:

Asyoucansee,thesedigitsare,infact,thesameasthoseshownat
thebeginningofthischapterasachallengetorecognize.Ofcourse,
whentestingournetworkwe'llaskittorecognizeimageswhich
aren'tinthetrainingset!

TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000
imagestobeusedastrainingdata.Theseimagesarescanned
handwritingsamplesfrom250people,halfofwhomwereUS
CensusBureauemployees,andhalfofwhomwerehighschool
students.Theimagesaregreyscaleand28by28pixelsinsize.The
secondpartoftheMNISTdatasetis10,000imagestobeusedas
testdata.Again,theseare28by28greyscaleimages.We'llusethe
testdatatoevaluatehowwellourneuralnetworkhaslearnedto
recognizedigits.Tomakethisagoodtestofperformance,thetest
datawastakenfromadifferentsetof250peoplethantheoriginal
trainingdata(albeitstillagroupsplitbetweenCensusBureau
employeesandhighschoolstudents).Thishelpsgiveusconfidence
http://neuralnetworksanddeeplearning.com/chap1.html 21/50
03/04/2017 Neuralnetworksanddeeplearning

thatoursystemcanrecognizedigitsfrompeoplewhosewritingit
didn'tseeduringtraining.

We'llusethenotationxtodenoteatraininginput.It'llbe
convenienttoregardeachtraininginputxasa28 28 = 784
dimensionalvector.Eachentryinthevectorrepresentsthegrey
valueforasinglepixelintheimage.We'lldenotethecorresponding
desiredoutputbyy = y(x) ,whereyisa10 dimensionalvector.For
example,ifaparticulartrainingimage,x,depictsa6,then
y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
T
isthedesiredoutputfromthe
network.NotethatT hereisthetransposeoperation,turningarow
vectorintoanordinary(column)vector.

Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiases
sothattheoutputfromthenetworkapproximatesy(x) forall
traininginputsx.Toquantifyhowwellwe'reachievingthisgoalwe
*Sometimesreferredtoasalossorobjective
defineacostfunction*: function.Weusethetermcostfunction
throughoutthisbook,butyoushouldnotethe
1 otherterminology,sinceit'softenusedin
2
C (w, b) y(x) a . (6)
2n researchpapersandotherdiscussionsofneural
x
networks.

Here,wdenotesthecollectionofallweightsinthenetwork,ballthe
biases,nisthetotalnumberoftraininginputs,aisthevectorof
outputsfromthenetworkwhenxisinput,andthesumisoverall
traininginputs,x.Ofcourse,theoutputadependsonx,wandb,
buttokeepthenotationsimpleIhaven'texplicitlyindicatedthis
dependence.Thenotationvjustdenotestheusuallength
functionforavectorv.We'llcallC thequadraticcostfunctionit's
alsosometimesknownasthemeansquarederrororjustMSE.
Inspectingtheformofthequadraticcostfunction,weseethat
C (w, b) isnonnegative,sinceeveryterminthesumisnon
negative.Furthermore,thecostC (w, b)becomessmall,i.e.,
C (w, b) 0 ,preciselywheny(x) isapproximatelyequaltothe
output,a,foralltraininginputs,x.Soourtrainingalgorithmhas
doneagoodjobifitcanfindweightsandbiasessothatC (w, b) 0.
Bycontrast,it'snotdoingsowellwhenC (w, b)islargethatwould
meanthaty(x) isnotclosetotheoutputaforalargenumberof
inputs.Sotheaimofourtrainingalgorithmwillbetominimizethe
costC (w, b)asafunctionoftheweightsandbiases.Inotherwords,
wewanttofindasetofweightsandbiaseswhichmakethecostas
smallaspossible.We'lldothatusinganalgorithmknownas
gradientdescent.

http://neuralnetworksanddeeplearning.com/chap1.html 22/50
03/04/2017 Neuralnetworksanddeeplearning

Whyintroducethequadraticcost?Afterall,aren'tweprimarily
interestedinthenumberofimagescorrectlyclassifiedbythe
network?Whynottrytomaximizethatnumberdirectly,rather
thanminimizingaproxymeasurelikethequadraticcost?The
problemwiththatisthatthenumberofimagescorrectlyclassified
isnotasmoothfunctionoftheweightsandbiasesinthenetwork.
Forthemostpart,makingsmallchangestotheweightsandbiases
won'tcauseanychangeatallinthenumberoftrainingimages
classifiedcorrectly.Thatmakesitdifficulttofigureouthowto
changetheweightsandbiasestogetimprovedperformance.Ifwe
insteaduseasmoothcostfunctionlikethequadraticcostitturns
outtobeeasytofigureouthowtomakesmallchangesinthe
weightsandbiasessoastogetanimprovementinthecost.That's
whywefocusfirstonminimizingthequadraticcost,andonlyafter
thatwillweexaminetheclassificationaccuracy.

Evengiventhatwewanttouseasmoothcostfunction,youmaystill
wonderwhywechoosethequadraticfunctionusedinEquation(6).
Isn'tthisaratheradhocchoice?Perhapsifwechoseadifferent
costfunctionwe'dgetatotallydifferentsetofminimizingweights
andbiases?Thisisavalidconcern,andlaterwe'llrevisitthecost
function,andmakesomemodifications.However,thequadratic
costfunctionofEquation(6)worksperfectlywellforunderstanding
thebasicsoflearninginneuralnetworks,sowe'llstickwithitfor
now.

Recapping,ourgoalintraininganeuralnetworkistofindweights
andbiaseswhichminimizethequadraticcostfunctionC (w, b).This
isawellposedproblem,butit'sgotalotofdistractingstructureas
currentlyposedtheinterpretationofwandbasweightsand
biases,thefunctionlurkinginthebackground,thechoiceof
networkarchitecture,MNIST,andsoon.Itturnsoutthatwecan
understandatremendousamountbyignoringmostofthat
structure,andjustconcentratingontheminimizationaspect.Sofor
nowwe'regoingtoforgetallaboutthespecificformofthecost
function,theconnectiontoneuralnetworks,andsoon.Instead,
we'regoingtoimaginethatwe'vesimplybeengivenafunctionof
manyvariablesandwewanttominimizethatfunction.We'regoing
todevelopatechniquecalledgradientdescentwhichcanbeusedto
solvesuchminimizationproblems.Thenwe'llcomebacktothe
specificfunctionwewanttominimizeforneuralnetworks.
http://neuralnetworksanddeeplearning.com/chap1.html 23/50
03/04/2017 Neuralnetworksanddeeplearning

Okay,let'ssupposewe'retryingtominimizesomefunction,C (v) .
Thiscouldbeanyrealvaluedfunctionofmanyvariables,
v = v1 , v2 , .NotethatI'vereplacedthewandbnotationbyvto
emphasizethatthiscouldbeanyfunctionwe'renotspecifically
thinkingintheneuralnetworkscontextanymore.TominimizeC (v)
ithelpstoimagineC asafunctionofjusttwovariables,whichwe'll
callv andv :
1 2

Whatwe'dlikeistofindwhereC achievesitsglobalminimum.
Now,ofcourse,forthefunctionplottedabove,wecaneyeballthe
graphandfindtheminimum.Inthatsense,I'veperhapsshown
slightlytoosimpleafunction!Ageneralfunction,C ,maybea
complicatedfunctionofmanyvariables,anditwon'tusuallybe
possibletojusteyeballthegraphtofindtheminimum.

Onewayofattackingtheproblemistousecalculustotrytofindthe
minimumanalytically.Wecouldcomputederivativesandthentry
usingthemtofindplaceswhereC isanextremum.Withsomeluck
thatmightworkwhenC isafunctionofjustoneorafewvariables.
Butit'llturnintoanightmarewhenwehavemanymorevariables.
Andforneuralnetworkswe'lloftenwantfarmorevariablesthe
biggestneuralnetworkshavecostfunctionswhichdependon
billionsofweightsandbiasesinanextremelycomplicatedway.
Usingcalculustominimizethatjustwon'twork!

(Afterassertingthatwe'llgaininsightbyimaginingC asafunction
ofjusttwovariables,I'veturnedaroundtwiceintwoparagraphs
andsaid,"hey,butwhatifit'safunctionofmanymorethantwo

http://neuralnetworksanddeeplearning.com/chap1.html 24/50
03/04/2017 Neuralnetworksanddeeplearning

variables?"Sorryaboutthat.PleasebelievemewhenIsaythatit
reallydoeshelptoimagineC asafunctionoftwovariables.Itjust
happensthatsometimesthatpicturebreaksdown,andthelasttwo
paragraphsweredealingwithsuchbreakdowns.Goodthinking
aboutmathematicsofteninvolvesjugglingmultipleintuitive
pictures,learningwhenit'sappropriatetouseeachpicture,and
whenit'snot.)

Okay,socalculusdoesn'twork.Fortunately,thereisabeautiful
analogywhichsuggestsanalgorithmwhichworksprettywell.We
startbythinkingofourfunctionasakindofavalley.Ifyousquint
justalittleattheplotabove,thatshouldn'tbetoohard.Andwe
imagineaballrollingdowntheslopeofthevalley.Oureveryday
experiencetellsusthattheballwilleventuallyrolltothebottomof
thevalley.Perhapswecanusethisideaasawaytofindaminimum
forthefunction?We'drandomlychooseastartingpointforan
(imaginary)ball,andthensimulatethemotionoftheballasit
rolleddowntothebottomofthevalley.Wecoulddothissimulation
simplybycomputingderivatives(andperhapssomesecond
derivatives)ofC thosederivativeswouldtelluseverythingwe
needtoknowaboutthelocal"shape"ofthevalley,andtherefore
howourballshouldroll.

BasedonwhatI'vejustwritten,youmightsupposethatwe'llbe
tryingtowritedownNewton'sequationsofmotionfortheball,
consideringtheeffectsoffrictionandgravity,andsoon.Actually,
we'renotgoingtotaketheballrollinganalogyquitethatseriously
we'redevisinganalgorithmtominimizeC ,notdevelopingan
accuratesimulationofthelawsofphysics!Theball'seyeviewis
meanttostimulateourimagination,notconstrainourthinking.So
ratherthangetintoallthemessydetailsofphysics,let'ssimplyask
ourselves:ifweweredeclaredGodforaday,andcouldmakeupour
ownlawsofphysics,dictatingtotheballhowitshouldroll,what
laworlawsofmotioncouldwepickthatwouldmakeitsotheball
alwaysrolledtothebottomofthevalley?

Tomakethisquestionmoreprecise,let'sthinkaboutwhathappens
whenwemovetheballasmallamountv inthev direction,and
1 1

asmallamountv inthev direction.CalculustellsusthatC


2 2

changesasfollows:

C
http://neuralnetworksanddeeplearning.com/chap1.html C 25/50
03/04/2017 Neuralnetworksanddeeplearning
C C
C v1 + v2 . (7)
v1 v2

We'regoingtofindawayofchoosingv andv soastomake 1 2

C negativei.e.,we'llchoosethemsotheballisrollingdowninto
thevalley.Tofigureouthowtomakesuchachoiceithelpstodefine
v tobethevectorofchangesinv,v (v 1, v2 )
T
,whereT is
againthetransposeoperation,turningrowvectorsintocolumn
vectors.We'llalsodefinethegradientofC tobethevectorof
T

partialderivatives,( C

v1
,
C

v2
) .Wedenotethegradientvectorby
C ,i.e.:

T
C C
C ( , ) . (8)
v1 v2

Inamomentwe'llrewritethechangeC intermsofv andthe


gradient,C .Beforegettingtothat,though,Iwanttoclarify
somethingthatsometimesgetspeoplehunguponthegradient.
WhenmeetingtheC notationforthefirsttime,peoplesometimes
wonderhowtheyshouldthinkaboutthe symbol.What,exactly,
does mean?Infact,it'sperfectlyfinetothinkofC asasingle
mathematicalobjectthevectordefinedabovewhichhappensto
bewrittenusingtwosymbols.Inthispointofview, isjustapiece
ofnotationalflagwaving,tellingyou"hey,C isagradientvector".
Therearemoreadvancedpointsofviewwhere canbeviewedas
anindependentmathematicalentityinitsownright(forexample,
asadifferentialoperator),butwewon'tneedsuchpointsofview.

Withthesedefinitions,theexpression(7)forC canberewritten
as

C C v. (9)

ThisequationhelpsexplainwhyC iscalledthegradientvector:
C relateschangesinvtochangesinC ,justaswe'dexpect
somethingcalledagradienttodo.Butwhat'sreallyexcitingabout
theequationisthatitletsusseehowtochoosev soastomake
C negative.Inparticular,supposewechoose

v = C , (10)

whereisasmall,positiveparameter(knownasthelearningrate).
ThenEquation(9)tellsusthatC C C = C
2
.
BecauseC 2
0,thisguaranteesthatC 0 ,i.e.,C will

http://neuralnetworksanddeeplearning.com/chap1.html 26/50
03/04/2017 Neuralnetworksanddeeplearning

alwaysdecrease,neverincrease,ifwechangevaccordingtothe
prescriptionin(10).(Within,ofcourse,thelimitsofthe
approximationinEquation(9)).Thisisexactlythepropertywe
wanted!Andsowe'lltakeEquation(10)todefinethe"lawof
motion"fortheballinourgradientdescentalgorithm.Thatis,we'll
useEquation(10)tocomputeavalueforv ,thenmovetheball's
positionvbythatamount:


v v = v C . (11)

Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwe
keepdoingthis,overandover,we'llkeepdecreasingC untilwe
hopewereachaglobalminimum.

Summingup,thewaythegradientdescentalgorithmworksisto
repeatedlycomputethegradientC ,andthentomoveinthe
oppositedirection,"fallingdown"theslopeofthevalley.Wecan
visualizeitlikethis:

Noticethatwiththisrulegradientdescentdoesn'treproducereal
physicalmotion.Inreallifeaballhasmomentum,andthat
momentummayallowittorollacrosstheslope,oreven
(momentarily)rolluphill.It'sonlyaftertheeffectsoffrictionsetin
thattheballisguaranteedtorolldownintothevalley.Bycontrast,
ourruleforchoosingv justsays"godown,rightnow".That'sstill
aprettygoodruleforfindingtheminimum!

Tomakegradientdescentworkcorrectly,weneedtochoosethe
learningratetobesmallenoughthatEquation(9)isagood

http://neuralnetworksanddeeplearning.com/chap1.html 27/50
03/04/2017 Neuralnetworksanddeeplearning

approximation.Ifwedon't,wemightendupwithC ,which
> 0

obviouslywouldnotbegood!Atthesametime,wedon'twantto
betoosmall,sincethatwillmakethechangesv tiny,andthusthe
gradientdescentalgorithmwillworkveryslowly.Inpractical
implementations,isoftenvariedsothatEquation(9)remainsa
goodapproximation,butthealgorithmisn'ttooslow.We'llseelater
howthisworks.

I'veexplainedgradientdescentwhenC isafunctionofjusttwo
variables.But,infact,everythingworksjustaswellevenwhenC is
afunctionofmanymorevariables.SupposeinparticularthatC isa
functionofm variables,v 1, , vm .ThenthechangeC inC
producedbyasmallchangev = (v 1, , vm )
T
is

C C v, (12)

wherethegradientC isthevector

T
C C
C ( ,, ) . (13)
v1 vm

Justasforthetwovariablecase,wecanchoose

v = C , (14)

andwe'reguaranteedthatour(approximate)expression(12)forC
willbenegative.Thisgivesusawayoffollowingthegradienttoa
minimum,evenwhenC isafunctionofmanyvariables,by
repeatedlyapplyingtheupdaterule


v v = v C . (15)

Youcanthinkofthisupdateruleasdefiningthegradientdescent
algorithm.Itgivesusawayofrepeatedlychangingthepositionvin
ordertofindaminimumofthefunctionC .Theruledoesn'talways
workseveralthingscangowrongandpreventgradientdescent
fromfindingtheglobalminimumofC ,apointwe'llreturnto
exploreinlaterchapters.But,inpracticegradientdescentoften
worksextremelywell,andinneuralnetworkswe'llfindthatit'sa
powerfulwayofminimizingthecostfunction,andsohelpingthe
netlearn.

Indeed,there'sevenasenseinwhichgradientdescentisthe
optimalstrategyforsearchingforaminimum.Let'ssupposethat
we'retryingtomakeamovev inpositionsoastodecreaseC as

http://neuralnetworksanddeeplearning.com/chap1.html 28/50
03/04/2017 Neuralnetworksanddeeplearning

muchaspossible.ThisisequivalenttominimizingC C v .
We'llconstrainthesizeofthemovesothatv = forsomesmall
fixed > 0.Inotherwords,wewantamovethatisasmallstepofa
fixedsize,andwe'retryingtofindthemovementdirectionwhich
decreasesC asmuchaspossible.Itcanbeprovedthatthechoiceof
v whichminimizesC v isv = C ,where = /C is
determinedbythesizeconstraintv = .Sogradientdescent
canbeviewedasawayoftakingsmallstepsinthedirectionwhich
doesthemosttoimmediatelydecreaseC .

Exercises
Provetheassertionofthelastparagraph.Hint:Ifyou'renot
alreadyfamiliarwiththeCauchySchwarzinequality,youmay
findithelpfultofamiliarizeyourselfwithit.

IexplainedgradientdescentwhenC isafunctionoftwo
variables,andwhenit'safunctionofmorethantwovariables.
WhathappenswhenC isafunctionofjustonevariable?Can
youprovideageometricinterpretationofwhatgradient
descentisdoingintheonedimensionalcase?

Peoplehaveinvestigatedmanyvariationsofgradientdescent,
includingvariationsthatmorecloselymimicarealphysicalball.
Theseballmimickingvariationshavesomeadvantages,butalso
haveamajordisadvantage:itturnsouttobenecessarytocompute
secondpartialderivativesofC ,andthiscanbequitecostly.Tosee
whyit'scostly,supposewewanttocomputeallthesecondpartial
derivatives 2
C / vj vk .Ifthereareamillionsuchv variablesthen
j

we'dneedtocomputesomethinglikeatrillion(i.e.,amillion
*Actually,morelikehalfatrillion,since
squared)secondpartialderivatives*!That'sgoingtobe
2
C /vj vk =
2
.Still,yougetthe
C /vk vj

computationallycostly.Withthatsaid,therearetricksforavoiding point.

thiskindofproblem,andfindingalternativestogradientdescentis
anactiveareaofinvestigation.Butinthisbookwe'llusegradient
descent(andvariations)asourmainapproachtolearninginneural
networks.

Howcanweapplygradientdescenttolearninaneuralnetwork?
Theideaistousegradientdescenttofindtheweightsw andbiases k

bl whichminimizethecostinEquation(6).Toseehowthisworks,
let'srestatethegradientdescentupdaterule,withtheweightsand
biasesreplacingthevariablesv .Inotherwords,our"position"now
j

http://neuralnetworksanddeeplearning.com/chap1.html 29/50
03/04/2017 Neuralnetworksanddeeplearning

hascomponentsw andb ,andthegradientvectorC has


k l

correspondingcomponents C / w and C / b .Writingoutthe


k l

gradientdescentupdateruleintermsofcomponents,wehave

C

wk w = wk (16)
k
wk

C

bl b = bl . (17)
l
bl

Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill",
andhopefullyfindaminimumofthecostfunction.Inotherwords,
thisisarulewhichcanbeusedtolearninaneuralnetwork.

Thereareanumberofchallengesinapplyingthegradientdescent
rule.We'lllookintothoseindepthinlaterchapters.ButfornowI
justwanttomentiononeproblem.Tounderstandwhattheproblem
is,let'slookbackatthequadraticcostinEquation(6).Noticethat
thiscostfunctionhastheformC =
1

n

x
Cx ,thatis,it'sanaverage
2
y(x)a
overcostsC x
2
forindividualtrainingexamples.In
practice,tocomputethegradientC weneedtocomputethe
gradientsC separatelyforeachtraininginput,x,andthen
x

averagethem,C =
1

n

x
C x .Unfortunately,whenthenumber
oftraininginputsisverylargethiscantakealongtime,and
learningthusoccursslowly.

Anideacalledstochasticgradientdescentcanbeusedtospeedup
learning.TheideaistoestimatethegradientC bycomputingC x

forasmallsampleofrandomlychosentraininginputs.By
averagingoverthissmallsampleitturnsoutthatwecanquicklyget
agoodestimateofthetruegradientC ,andthishelpsspeedup
gradientdescent,andthuslearning.

Tomaketheseideasmoreprecise,stochasticgradientdescent
worksbyrandomlypickingoutasmallnumberm ofrandomly
chosentraininginputs.We'lllabelthoserandomtraininginputs
X1 , X2 , , Xm ,andrefertothemasaminibatch.Providedthe
samplesizem islargeenoughweexpectthattheaveragevalueof
theC willberoughlyequaltotheaverageoverallC ,thatis,
Xj x

m
C Xj C x
j=1 x
= C , (18)
m n

wherethesecondsumisovertheentiresetoftrainingdata.
Swappingsidesweget

m
http://neuralnetworksanddeeplearning.com/chap1.html 30/50
03/04/2017 Neuralnetworksanddeeplearning
m
1
C C Xj , (19)
m
j=1

confirmingthatwecanestimatetheoverallgradientbycomputing
gradientsjustfortherandomlychosenminibatch.

Toconnectthisexplicitlytolearninginneuralnetworks,supposew k

andb denotetheweightsandbiasesinourneuralnetwork.Then
l

stochasticgradientdescentworksbypickingoutarandomlychosen
minibatchoftraininginputs,andtrainingwiththose,

C Xj

wk w = wk (20)
k
m wk
j

C Xj

bl b = bl , (21)
l
m bl
j

wherethesumsareoverallthetrainingexamplesX inthecurrent j

minibatch.Thenwepickoutanotherrandomlychosenminibatch
andtrainwiththose.Andsoon,untilwe'veexhaustedthetraining
inputs,whichissaidtocompleteanepochoftraining.Atthatpoint
westartoverwithanewtrainingepoch.

Incidentally,it'sworthnotingthatconventionsvaryaboutscalingof
thecostfunctionandofminibatchupdatestotheweightsand
biases.InEquation(6)wescaledtheoverallcostfunctionbya
factor .Peoplesometimesomitthe ,summingoverthecostsof
1

n
1

individualtrainingexamplesinsteadofaveraging.Thisis
particularlyusefulwhenthetotalnumberoftrainingexamplesisn't
knowninadvance.Thiscanoccurifmoretrainingdataisbeing
generatedinrealtime,forinstance.And,inasimilarway,themini
batchupdaterules(20)and(21)sometimesomitthe termout 1

thefrontofthesums.Conceptuallythismakeslittledifference,
sinceit'sequivalenttorescalingthelearningrate.Butwhendoing
detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.

Wecanthinkofstochasticgradientdescentasbeinglikepolitical
polling:it'smucheasiertosampleasmallminibatchthanitisto
applygradientdescenttothefullbatch,justascarryingoutapollis
easierthanrunningafullelection.Forexample,ifwehavea
trainingsetofsizen = 60, 000,asinMNIST,andchooseamini
batchsizeof(say)m = 10,thismeanswe'llgetafactorof6, 000
speedupinestimatingthegradient!Ofcourse,theestimatewon'tbe
perfecttherewillbestatisticalfluctuationsbutitdoesn'tneedto

http://neuralnetworksanddeeplearning.com/chap1.html 31/50
03/04/2017 Neuralnetworksanddeeplearning

beperfect:allwereallycareaboutismovinginageneraldirection
thatwillhelpdecreaseC ,andthatmeanswedon'tneedanexact
computationofthegradient.Inpractice,stochasticgradient
descentisacommonlyusedandpowerfultechniqueforlearningin
neuralnetworks,andit'sthebasisformostofthelearning
techniqueswe'lldevelopinthisbook.

Exercise
Anextremeversionofgradientdescentistouseaminibatch
sizeofjust1.Thatis,givenatraininginput,x,weupdateour
weightsandbiasesaccordingtotherules
wk w

k
= w k C x / w k andb l b

l
= bl C x / bl .
Thenwechooseanothertraininginput,andupdatetheweights
andbiasesagain.Andsoon,repeatedly.Thisprocedureis
knownasonline,online,orincrementallearning.Inonline
learning,aneuralnetworklearnsfromjustonetraininginput
atatime(justashumanbeingsdo).Nameoneadvantageand
onedisadvantageofonlinelearning,comparedtostochastic
gradientdescentwithaminibatchsizeof,say,20 .

Letmeconcludethissectionbydiscussingapointthatsometimes
bugspeoplenewtogradientdescent.InneuralnetworksthecostC
is,ofcourse,afunctionofmanyvariablesalltheweightsand
biasesandsoinsomesensedefinesasurfaceinaveryhigh
dimensionalspace.Somepeoplegethungupthinking:"Hey,Ihave
tobeabletovisualizealltheseextradimensions".Andtheymay
starttoworry:"Ican'tthinkinfourdimensions,letalonefive(or
fivemillion)".Istheresomespecialabilitythey'remissing,some
abilitythat"real"supermathematicianshave?Ofcourse,theanswer
isno.Evenmostprofessionalmathematicianscan'tvisualizefour
dimensionsespeciallywell,ifatall.Thetricktheyuse,instead,isto
developotherwaysofrepresentingwhat'sgoingon.That'sexactly
whatwedidabove:weusedanalgebraic(ratherthanvisual)
representationofC tofigureouthowtomovesoastodecreaseC .
Peoplewhoaregoodatthinkinginhighdimensionshaveamental
librarycontainingmanydifferenttechniquesalongtheselinesour
algebraictrickisjustoneexample.Thosetechniquesmaynothave
thesimplicitywe'reaccustomedtowhenvisualizingthree
dimensions,butonceyoubuildupalibraryofsuchtechniques,you
cangetprettygoodatthinkinginhighdimensions.Iwon'tgointo

http://neuralnetworksanddeeplearning.com/chap1.html 32/50
03/04/2017 Neuralnetworksanddeeplearning

moredetailhere,butifyou'reinterestedthenyoumayenjoy
readingthisdiscussionofsomeofthetechniquesprofessional
mathematiciansusetothinkinhighdimensions.Whilesomeofthe
techniquesdiscussedarequitecomplex,muchofthebestcontentis
intuitiveandaccessible,andcouldbemasteredbyanyone.

Implementingournetworktoclassify
digits
Alright,let'swriteaprogramthatlearnshowtorecognize
handwrittendigits,usingstochasticgradientdescentandthe
MNISTtrainingdata.We'lldothiswithashortPython(2.7)
program,just74linesofcode!Thefirstthingweneedistogetthe
MNISTdata.Ifyou'reagituserthenyoucanobtainthedataby
cloningthecoderepositoryforthisbook,

gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git

Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.

Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwas
splitinto60,000trainingimages,and10,000testimages.That's
theofficialMNISTdescription.Actually,we'regoingtosplitthe
dataalittledifferently.We'llleavethetestimagesasis,butsplitthe
60,000imageMNISTtrainingsetintotwoparts:asetof50,000
images,whichwe'llusetotrainourneuralnetwork,andaseparate
10,000imagevalidationset.Wewon'tusethevalidationdatain
thischapter,butlaterinthebookwe'llfinditusefulinfiguringout
howtosetcertainhyperparametersoftheneuralnetworkthings
likethelearningrate,andsoon,whicharen'tdirectlyselectedby
ourlearningalgorithm.Althoughthevalidationdataisn'tpartof
theoriginalMNISTspecification,manypeopleuseMNISTinthis
fashion,andtheuseofvalidationdataiscommoninneural
networks.WhenIrefertothe"MNISTtrainingdata"fromnowon,
I'llbereferringtoour50,000imagedataset,nottheoriginal
*Asnotedearlier,theMNISTdatasetisbased
60,000imagedataset*. ontwodatasetscollectedbyNIST,theUnited
States'NationalInstituteofStandardsand

ApartfromtheMNISTdatawealsoneedaPythonlibrarycalled Technology.ToconstructMNISTtheNISTdata
setswerestrippeddownandputintoamore
Numpy,fordoingfastlinearalgebra.Ifyoudon'talreadyhave convenientformatbyYannLeCun,Corinna
Cortes,andChristopherJ.C.Burges.Seethis
Numpyinstalled,youcangetithere. linkformoredetails.Thedatasetinmy
repositoryisinaformthatmakesiteasytoload
andmanipulatetheMNISTdatainPython.I
obtainedthisparticularformofthedatafrom

http://neuralnetworksanddeeplearning.com/chap1.html 33/50
03/04/2017 Neuralnetworksanddeeplearning
theLISAmachinelearninglaboratoryatthe
Letmeexplainthecorefeaturesoftheneuralnetworkscode,before
UniversityofMontreal(link).

givingafulllisting,below.ThecenterpieceisaNetworkclass,which
weusetorepresentaneuralnetwork.Here'sthecodeweuseto
initializeaNetworkobject:

classNetwork(object):

def__init__(self,sizes):
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]

Inthiscode,thelistsizescontainsthenumberofneuronsinthe
respectivelayers.So,forexample,ifwewanttocreateaNetwork
objectwith2neuronsinthefirstlayer,3neuronsinthesecond
layer,and1neuroninthefinallayer,we'ddothiswiththecode:

net=Network([2,3,1])

ThebiasesandweightsintheNetworkobjectareallinitialized
randomly,usingtheNumpynp.random.randnfunctiontogenerate
Gaussiandistributionswithmean0andstandarddeviation1.This
randominitializationgivesourstochasticgradientdescent
algorithmaplacetostartfrom.Inlaterchapterswe'llfindbetter
waysofinitializingtheweightsandbiases,butthiswilldofornow.
NotethattheNetworkinitializationcodeassumesthatthefirstlayer
ofneuronsisaninputlayer,andomitstosetanybiasesforthose
neurons,sincebiasesareonlyeverusedincomputingtheoutputs
fromlaterlayers.

NotealsothatthebiasesandweightsarestoredaslistsofNumpy
matrices.So,forexamplenet.weights[1]isaNumpymatrixstoring
theweightsconnectingthesecondandthirdlayersofneurons.(It's
notthefirstandsecondlayers,sincePython'slistindexingstartsat
0.)Sincenet.weights[1]isratherverbose,let'sjustdenotethat

matrixw.It'samatrixsuchthatw istheweightfortheconnection
jk

betweenthek neuroninthesecondlayer,andthej neuronin


th th

thethirdlayer.Thisorderingofthejandkindicesmayseem
strangesurelyit'dmakemoresensetoswapthejandkindices
around?Thebigadvantageofusingthisorderingisthatitmeans
thatthevectorofactivationsofthethirdlayerofneuronsis:


a = (wa + b). (22)

http://neuralnetworksanddeeplearning.com/chap1.html 34/50
03/04/2017 Neuralnetworksanddeeplearning

There'squiteabitgoingoninthisequation,solet'sunpackitpiece
bypiece.aisthevectorofactivationsofthesecondlayerof
neurons.Toobtaina wemultiplyabytheweightmatrixw,andadd

thevectorbofbiases.Wethenapplythefunctionelementwiseto
everyentryinthevectorwa + b.(Thisiscalledvectorizingthe
function.)It'seasytoverifythatEquation(22)givesthesame
resultasourearlierrule,Equation(4),forcomputingtheoutputof
asigmoidneuron.

Exercise
WriteoutEquation(22)incomponentform,andverifythatit
givesthesameresultastherule(4)forcomputingtheoutput
ofasigmoidneuron.

Withallthisinmind,it'seasytowritecodecomputingtheoutput
fromaNetworkinstance.Webeginbydefiningthesigmoidfunction:

defsigmoid(z):
return1.0/(1.0+np.exp(z))

NotethatwhentheinputzisavectororNumpyarray,Numpy
automaticallyappliesthefunctionsigmoidelementwise,thatis,in
vectorizedform.

WethenaddafeedforwardmethodtotheNetworkclass,which,given
*Itisassumedthattheinputaisan(n,1)
aninputaforthenetwork,returnsthecorrespondingoutput*.All Numpyndarray,nota(n,)vector.Here,nisthe

themethoddoesisappliesEquation(22)foreachlayer: numberofinputstothenetwork.Ifyoutryto
usean(n,)vectorasinputyou'llgetstrange
results.Althoughusingan(n,)vectorappears
deffeedforward(self,a):
themorenaturalchoice,usingan(n,1)
"""Returntheoutputofthenetworkif"a"isinput."""
ndarraymakesitparticularlyeasytomodifythe
forb,winzip(self.biases,self.weights):
codetofeedforwardmultipleinputsatonce,and
a=sigmoid(np.dot(w,a)+b)
thatissometimesconvenient.
returna

Ofcourse,themainthingwewantourNetworkobjectstodoisto
learn.Tothatendwe'llgivethemanSGDmethodwhichimplements
stochasticgradientdescent.Here'sthecode.It'salittlemysterious
inafewplaces,butI'llbreakitdownbelow,afterthelisting.

defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The"training_data"isalistoftuples
"(x,y)"representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If"test_data"isprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)

http://neuralnetworksanddeeplearning.com/chap1.html 35/50
03/04/2017 Neuralnetworksanddeeplearning
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)

Thetraining_dataisalistoftuples(x,y)representingthetraining
inputsandcorrespondingdesiredoutputs.Thevariablesepochsand
mini_batch_sizearewhatyou'dexpectthenumberofepochsto

trainfor,andthesizeoftheminibatchestousewhensampling.eta
isthelearningrate,.Iftheoptionalargumenttest_datais
supplied,thentheprogramwillevaluatethenetworkaftereach
epochoftraining,andprintoutpartialprogress.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially.

Thecodeworksasfollows.Ineachepoch,itstartsbyrandomly
shufflingthetrainingdata,andthenpartitionsitintominibatches
oftheappropriatesize.Thisisaneasywayofsamplingrandomly
fromthetrainingdata.Thenforeachmini_batchweapplyasingle
stepofgradientdescent.Thisisdonebythecode
self.update_mini_batch(mini_batch,eta),whichupdatesthenetwork

weightsandbiasesaccordingtoasingleiterationofgradient
descent,usingjustthetrainingdatainmini_batch.Here'sthecode
fortheupdate_mini_batchmethod:

defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The"mini_batch"isalistoftuples"(x,y)",and"eta"
isthelearningrate."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]

Mostoftheworkisdonebytheline
delta_nabla_b,delta_nabla_w=self.backprop(x,y)

Thisinvokessomethingcalledthebackpropagationalgorithm,
whichisafastwayofcomputingthegradientofthecostfunction.

http://neuralnetworksanddeeplearning.com/chap1.html 36/50
03/04/2017 Neuralnetworksanddeeplearning

Soupdate_mini_batchworkssimplybycomputingthesegradientsfor
everytrainingexampleinthemini_batch,andthenupdating
self.weightsandself.biasesappropriately.

I'mnotgoingtoshowthecodeforself.backproprightnow.We'll
studyhowbackpropagationworksinthenextchapter,includingthe
codeforself.backprop.Fornow,justassumethatitbehavesas
claimed,returningtheappropriategradientforthecostassociated
tothetrainingexamplex.

Let'slookatthefullprogram,includingthedocumentationstrings,
whichIomittedabove.Apartfromself.backproptheprogramisself
explanatoryalltheheavyliftingisdoneinself.SGDand
self.update_mini_batch,whichwe'vealreadydiscussed.The

self.backpropmethodmakesuseofafewextrafunctionstohelpin

computingthegradient,namelysigmoid_prime,whichcomputesthe
derivativeofthefunction,andself.cost_derivative,whichIwon't
describehere.Youcangetthegistofthese(andperhapsthedetails)
justbylookingatthecodeanddocumentationstrings.We'lllookat
themindetailinthenextchapter.Notethatwhiletheprogram
appearslengthy,muchofthecodeisdocumentationstrings
intendedtomakethecodeeasytounderstand.Infact,theprogram
containsjust74linesofnonwhitespace,noncommentcode.All
thecodemaybefoundonGitHubhere.

"""
network.py
~~~~~~~~~~

Amoduletoimplementthestochasticgradientdescentlearning
algorithmforafeedforwardneuralnetwork.Gradientsarecalculated
usingbackpropagation.NotethatIhavefocusedonmakingthecode
simple,easilyreadable,andeasilymodifiable.Itisnotoptimized,
andomitsmanydesirablefeatures.
"""

####Libraries
#Standardlibrary
importrandom

#Thirdpartylibraries
importnumpyasnp

classNetwork(object):

def__init__(self,sizes):
"""Thelist``sizes``containsthenumberofneuronsinthe
respectivelayersofthenetwork.Forexample,ifthelist
was[2,3,1]thenitwouldbeathreelayernetwork,withthe
firstlayercontaining2neurons,thesecondlayer3neurons,
andthethirdlayer1neuron.Thebiasesandweightsforthe
networkareinitializedrandomly,usingaGaussian
distributionwithmean0,andvariance1.Notethatthefirst

http://neuralnetworksanddeeplearning.com/chap1.html 37/50
03/04/2017 Neuralnetworksanddeeplearning
layerisassumedtobeaninputlayer,andbyconventionwe
won'tsetanybiasesforthoseneurons,sincebiasesareonly
everusedincomputingtheoutputsfromlaterlayers."""
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]

deffeedforward(self,a):
"""Returntheoutputofthenetworkif``a``isinput."""
forb,winzip(self.biases,self.weights):
a=sigmoid(np.dot(w,a)+b)
returna

defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The``training_data``isalistoftuples
``(x,y)``representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If``test_data``isprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)

defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The``mini_batch``isalistoftuples``(x,y)``,and``eta``
isthelearningrate."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]

defbackprop(self,x,y):
"""Returnatuple``(nabla_b,nabla_w)``representingthe
gradientforthecostfunctionC_x.``nabla_b``and
``nabla_w``arelayerbylayerlistsofnumpyarrays,similar
to``self.biases``and``self.weights``."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
#feedforward
activation=x
activations=[x]#listtostorealltheactivations,layerbylayer
zs=[]#listtostoreallthezvectors,layerbylayer
forb,winzip(self.biases,self.weights):

http://neuralnetworksanddeeplearning.com/chap1.html 38/50
03/04/2017 Neuralnetworksanddeeplearning
z=np.dot(w,activation)+b
zs.append(z)
activation=sigmoid(z)
activations.append(activation)
#backwardpass
delta=self.cost_derivative(activations[1],y)*\
sigmoid_prime(zs[1])
nabla_b[1]=delta
nabla_w[1]=np.dot(delta,activations[2].transpose())
#Notethatthevariablelintheloopbelowisusedalittle
#differentlytothenotationinChapter2ofthebook.Here,
#l=1meansthelastlayerofneurons,l=2isthe
#secondlastlayer,andsoon.It'sarenumberingofthe
#schemeinthebook,usedheretotakeadvantageofthefact
#thatPythoncanusenegativeindicesinlists.
forlinxrange(2,self.num_layers):
z=zs[l]
sp=sigmoid_prime(z)
delta=np.dot(self.weights[l+1].transpose(),delta)*sp
nabla_b[l]=delta
nabla_w[l]=np.dot(delta,activations[l1].transpose())
return(nabla_b,nabla_w)

defevaluate(self,test_data):
"""Returnthenumberoftestinputsforwhichtheneural
networkoutputsthecorrectresult.Notethattheneural
network'soutputisassumedtobetheindexofwhichever
neuroninthefinallayerhasthehighestactivation."""
test_results=[(np.argmax(self.feedforward(x)),y)
for(x,y)intest_data]
returnsum(int(x==y)for(x,y)intest_results)

defcost_derivative(self,output_activations,y):
"""Returnthevectorofpartialderivatives\partialC_x/
\partialafortheoutputactivations."""
return(output_activationsy)

####Miscellaneousfunctions
defsigmoid(z):
"""Thesigmoidfunction."""
return1.0/(1.0+np.exp(z))

defsigmoid_prime(z):
"""Derivativeofthesigmoidfunction."""
returnsigmoid(z)*(1sigmoid(z))

Howwelldoestheprogramrecognizehandwrittendigits?Well,let's
startbyloadingintheMNISTdata.I'lldothisusingalittlehelper
program,mnist_loader.py,tobedescribedbelow.Weexecutethe
followingcommandsinaPythonshell,

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()

Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,
butifyou'refollowingalongit'sprobablyeasiesttodoinaPython
shell.

AfterloadingtheMNISTdata,we'llsetupaNetworkwith30 hidden
neurons.WedothisafterimportingthePythonprogramlisted

http://neuralnetworksanddeeplearning.com/chap1.html 39/50
03/04/2017 Neuralnetworksanddeeplearning

above,whichisnamednetwork,

>>>importnetwork
>>>net=network.Network([784,30,10])

Finally,we'llusestochasticgradientdescenttolearnfromthe
MNISTtraining_dataover30epochs,withaminibatchsizeof10,
andalearningrateof = 3.0 ,

>>>net.SGD(training_data,30,10,3.0,test_data=test_data)

Notethatifyou'rerunningthecodeasyoureadalong,itwilltake
sometimetoexecuteforatypicalmachine(asof2015)itwill
likelytakeafewminutestorun.Isuggestyousetthingsrunning,
continuetoread,andperiodicallychecktheoutputfromthecode.If
you'reinarushyoucanspeedthingsupbydecreasingthenumber
ofepochs,bydecreasingthenumberofhiddenneurons,orbyusing
onlypartofthetrainingdata.Notethatproductioncodewouldbe
much,muchfaster:thesePythonscriptsareintendedtohelpyou
understandhowneuralnetswork,nottobehighperformance
code!And,ofcourse,oncewe'vetrainedanetworkitcanberun
veryquicklyindeed,onalmostanycomputingplatform.For
example,oncewe'velearnedagoodsetofweightsandbiasesfora
network,itcaneasilybeportedtoruninJavascriptinaweb
browser,orasanativeapponamobiledevice.Inanycase,hereisa
partialtranscriptoftheoutputofonetrainingrunoftheneural
network.Thetranscriptshowsthenumberoftestimagescorrectly
recognizedbytheneuralnetworkaftereachepochoftraining.As
youcansee,afterjustasingleepochthishasreached9,129outof
10,000,andthenumbercontinuestogrow,

Epoch0:9129/10000
Epoch1:9295/10000
Epoch2:9348/10000
...
Epoch27:9528/10000
Epoch28:9542/10000
Epoch29:9534/10000

Thatis,thetrainednetworkgivesusaclassificationrateofabout95
percent95.42percentatitspeak("Epoch28")!That'squite
encouragingasafirstattempt.Ishouldwarnyou,however,thatif
yourunthecodethenyourresultsarenotnecessarilygoingtobe
quitethesameasmine,sincewe'llbeinitializingournetworkusing
(different)randomweightsandbiases.Togenerateresultsinthis
chapterI'vetakenbestofthreeruns.

http://neuralnetworksanddeeplearning.com/chap1.html 40/50
03/04/2017 Neuralnetworksanddeeplearning

Let'sreruntheaboveexperiment,changingthenumberofhidden
neuronsto100.Aswasthecaseearlier,ifyou'rerunningthecodeas
youreadalong,youshouldbewarnedthatittakesquiteawhileto
execute(onmymachinethisexperimenttakestensofsecondsfor
eachtrainingepoch),soit'swisetocontinuereadinginparallel
whilethecodeexecutes.

>>>net=network.Network([784,100,10])
>>>net.SGD(training_data,30,10,3.0,test_data=test_data)

Sureenough,thisimprovestheresultsto96.59percent.Atleastin
*Readerfeedbackindicatesquitesomevariation
thiscase,usingmorehiddenneuronshelpsusgetbetterresults*. inresultsforthisexperiment,andsometraining
runsgiveresultsquiteabitworse.Usingthe

Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoices techniquesintroducedinchapter3willgreatly
reducethevariationinperformanceacross
forthenumberofepochsoftraining,theminibatchsize,andthe differenttrainingrunsforournetworks.

learningrate,.AsImentionedabove,theseareknownashyper
parametersforourneuralnetwork,inordertodistinguishthem
fromtheparameters(weightsandbiases)learntbyourlearning
algorithm.Ifwechooseourhyperparameterspoorly,wecanget
badresults.Suppose,forexample,thatwe'dchosenthelearning
ratetobe = 0.001 ,

>>>net=network.Network([784,100,10])
>>>net.SGD(training_data,30,10,0.001,test_data=test_data)

Theresultsaremuchlessencouraging,

Epoch0:1139/10000
Epoch1:1136/10000
Epoch2:1135/10000
...
Epoch27:2101/10000
Epoch28:2123/10000
Epoch29:2142/10000

However,youcanseethattheperformanceofthenetworkisgetting
slowlybetterovertime.Thatsuggestsincreasingthelearningrate,
sayto = 0.01 .Ifwedothat,wegetbetterresults,whichsuggests
increasingthelearningrateagain.(Ifmakingachangeimproves
things,trydoingmore!)Ifwedothatseveraltimesover,we'llend
upwithalearningrateofsomethinglike = 1.0 (andperhapsfine
tuneto3.0),whichisclosetoourearlierexperiments.Soeven
thoughweinitiallymadeapoorchoiceofhyperparameters,weat
leastgotenoughinformationtohelpusimproveourchoiceof
hyperparameters.

Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis
especiallytruewhentheinitialchoiceofhyperparameters

http://neuralnetworksanddeeplearning.com/chap1.html 41/50
03/04/2017 Neuralnetworksanddeeplearning

producesresultsnobetterthanrandomnoise.Supposewetrythe
successful30hiddenneuronnetworkarchitecturefromearlier,but
withthelearningratechangedto = 100.0 :

>>>net=network.Network([784,30,10])
>>>net.SGD(training_data,30,10,100.0,test_data=test_data)

Atthispointwe'veactuallygonetoofar,andthelearningrateistoo
high:
Epoch0:1009/10000
Epoch1:1009/10000
Epoch2:1009/10000
Epoch3:1009/10000
...
Epoch27:982/10000
Epoch28:982/10000
Epoch29:982/10000

Nowimaginethatwewerecomingtothisproblemforthefirsttime.
Ofcourse,weknowfromourearlierexperimentsthattheright
thingtodoistodecreasethelearningrate.Butifwewerecomingto
thisproblemforthefirsttimethentherewouldn'tbemuchinthe
outputtoguideusonwhattodo.Wemightworrynotonlyabout
thelearningrate,butabouteveryotheraspectofourneural
network.Wemightwonderifwe'veinitializedtheweightsand
biasesinawaythatmakesithardforthenetworktolearn?Or
maybewedon'thaveenoughtrainingdatatogetmeaningful
learning?Perhapswehaven'trunforenoughepochs?Ormaybeit's
impossibleforaneuralnetworkwiththisarchitecturetolearnto
recognizehandwrittendigits?Maybethelearningrateistoolow?
Or,maybe,thelearningrateistoohigh?Whenyou'recomingtoa
problemforthefirsttime,you'renotalwayssure.

Thelessontotakeawayfromthisisthatdebugginganeural
networkisnottrivial,and,justasforordinaryprogramming,there
isanarttoit.Youneedtolearnthatartofdebugginginordertoget
goodresultsfromneuralnetworks.Moregenerally,weneedto
developheuristicsforchoosinggoodhyperparametersandagood
architecture.We'lldiscussalltheseatlengththroughthebook,
includinghowIchosethehyperparametersabove.

Exercise
Trycreatinganetworkwithjusttwolayersaninputandan
outputlayer,nohiddenlayerwith784and10neurons,

http://neuralnetworksanddeeplearning.com/chap1.html 42/50
03/04/2017 Neuralnetworksanddeeplearning

respectively.Trainthenetworkusingstochasticgradient
descent.Whatclassificationaccuracycanyouachieve?

Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded.
It'sprettystraightforward.Forcompleteness,here'sthecode.The
datastructuresusedtostoretheMNISTdataaredescribedinthe
documentationstringsit'sstraightforwardstuff,tuplesandlistsof
Numpyndarrayobjects(thinkofthemasvectorsifyou'renot
familiarwithndarrays):

"""
mnist_loader
~~~~~~~~~~~~

AlibrarytoloadtheMNISTimagedata.Fordetailsofthedata
structuresthatarereturned,seethedocstringsfor``load_data``
and``load_data_wrapper``.Inpractice,``load_data_wrapper``isthe
functionusuallycalledbyourneuralnetworkcode.
"""

####Libraries
#Standardlibrary
importcPickle
importgzip

#Thirdpartylibraries
importnumpyasnp

defload_data():
"""ReturntheMNISTdataasatuplecontainingthetrainingdata,
thevalidationdata,andthetestdata.

The``training_data``isreturnedasatuplewithtwoentries.
Thefirstentrycontainstheactualtrainingimages.Thisisa
numpyndarraywith50,000entries.Eachentryis,inturn,a
numpyndarraywith784values,representingthe28*28=784
pixelsinasingleMNISTimage.

Thesecondentryinthe``training_data``tupleisanumpyndarray
containing50,000entries.Thoseentriesarejustthedigit
values(0...9)forthecorrespondingimagescontainedinthefirst
entryofthetuple.

The``validation_data``and``test_data``aresimilar,except
eachcontainsonly10,000images.

Thisisanicedataformat,butforuseinneuralnetworksit's
helpfultomodifytheformatofthe``training_data``alittle.
That'sdoneinthewrapperfunction``load_data_wrapper()``,see
below.
"""
f=gzip.open('../data/mnist.pkl.gz','rb')
training_data,validation_data,test_data=cPickle.load(f)
f.close()
return(training_data,validation_data,test_data)

defload_data_wrapper():
"""Returnatuplecontaining``(training_data,validation_data,
test_data)``.Basedon``load_data``,buttheformatismore
convenientforuseinourimplementationofneuralnetworks.

Inparticular,``training_data``isalistcontaining50,000

http://neuralnetworksanddeeplearning.com/chap1.html 43/50
03/04/2017 Neuralnetworksanddeeplearning
2tuples``(x,y)``.``x``isa784dimensionalnumpy.ndarray
containingtheinputimage.``y``isa10dimensional
numpy.ndarrayrepresentingtheunitvectorcorrespondingtothe
correctdigitfor``x``.

``validation_data``and``test_data``arelistscontaining10,000
2tuples``(x,y)``.Ineachcase,``x``isa784dimensional
numpy.ndarrycontainingtheinputimage,and``y``isthe
correspondingclassification,i.e.,thedigitvalues(integers)
correspondingto``x``.

Obviously,thismeanswe'reusingslightlydifferentformatsfor
thetrainingdataandthevalidation/testdata.Theseformats
turnouttobethemostconvenientforuseinourneuralnetwork
code."""
tr_d,va_d,te_d=load_data()
training_inputs=[np.reshape(x,(784,1))forxintr_d[0]]
training_results=[vectorized_result(y)foryintr_d[1]]
training_data=zip(training_inputs,training_results)
validation_inputs=[np.reshape(x,(784,1))forxinva_d[0]]
validation_data=zip(validation_inputs,va_d[1])
test_inputs=[np.reshape(x,(784,1))forxinte_d[0]]
test_data=zip(test_inputs,te_d[1])
return(training_data,validation_data,test_data)

defvectorized_result(j):
"""Returna10dimensionalunitvectorwitha1.0inthejth
positionandzeroeselsewhere.Thisisusedtoconvertadigit
(0...9)intoacorrespondingdesiredoutputfromtheneural
network."""
e=np.zeros((10,1))
e[j]=1.0
returne

Isaidabovethatourprogramgetsprettygoodresults.Whatdoes
thatmean?Goodcomparedtowhat?It'sinformativetohavesome
simple(nonneuralnetwork)baselineteststocompareagainst,to
understandwhatitmeanstoperformwell.Thesimplestbaselineof
all,ofcourse,istorandomlyguessthedigit.That'llberightabout
tenpercentofthetime.We'redoingmuchbetterthanthat!

Whataboutalesstrivialbaseline?Let'stryanextremelysimple
idea:we'lllookathowdarkanimageis.Forinstance,animageofa
2willtypicallybequiteabitdarkerthananimageofa1,just
becausemorepixelsareblackenedout,asthefollowingexamples
illustrate:

Thissuggestsusingthetrainingdatatocomputeaverage
darknessesforeachdigit,0, 1, 2, , 9.Whenpresentedwithanew
image,wecomputehowdarktheimageis,andthenguessthatit's

http://neuralnetworksanddeeplearning.com/chap1.html 44/50
03/04/2017 Neuralnetworksanddeeplearning

whicheverdigithastheclosestaveragedarkness.Thisisasimple
procedure,andiseasytocodeup,soIwon'texplicitlywriteoutthe
codeifyou'reinterestedit'sintheGitHubrepository.Butit'sabig
improvementoverrandomguessing,getting2, 225ofthe10, 000
testimagescorrect,i.e.,22.25percentaccuracy.

It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe
20 to50 percentrange.Ifyouworkabitharderyoucangetupover
50 percent.Buttogetmuchhigheraccuraciesithelpstouse
establishedmachinelearningalgorithms.Let'stryusingoneofthe
bestknownalgorithms,thesupportvectormachineorSVM.If
you'renotfamiliarwithSVMs,nottoworry,we'renotgoingtoneed
tounderstandthedetailsofhowSVMswork.Instead,we'llusea
Pythonlibrarycalledscikitlearn,whichprovidesasimplePython
interfacetoafastCbasedlibraryforSVMsknownasLIBSVM.

Ifwerunscikitlearn'sSVMclassifierusingthedefaultsettings,
thenitgets9,435of10,000testimagescorrect.(Thecodeis
availablehere.)That'sabigimprovementoverournaiveapproach
ofclassifyinganimagebasedonhowdarkitis.Indeed,itmeans
thattheSVMisperformingroughlyaswellasourneuralnetworks,
justalittleworse.Inlaterchapterswe'llintroducenewtechniques
thatenableustoimproveourneuralnetworkssothattheyperform
muchbetterthantheSVM.

That'snottheendofthestory,however.The9,435of10,000result
isforscikitlearn'sdefaultsettingsforSVMs.SVMshaveanumber
oftunableparameters,andit'spossibletosearchforparameters
whichimprovethisoutoftheboxperformance.Iwon'texplicitly
dothissearch,butinsteadreferyoutothisblogpostbyAndreas
Muellerifyou'dliketoknowmore.Muellershowsthatwithsome
workoptimizingtheSVM'sparametersit'spossibletogetthe
performanceupabove98.5percentaccuracy.Inotherwords,a
welltunedSVMonlymakesanerroronaboutonedigitin70.
That'sprettygood!Canneuralnetworksdobetter?

Infact,theycan.Atpresent,welldesignedneuralnetworks
outperformeveryothertechniqueforsolvingMNIST,including
SVMs.Thecurrent(2013)recordisclassifying9,979of10,000
imagescorrectly.ThiswasdonebyLiWan,MatthewZeiler,Sixin
Zhang,YannLeCun,andRobFergus.We'llseemostofthe

http://neuralnetworksanddeeplearning.com/chap1.html 45/50
03/04/2017 Neuralnetworksanddeeplearning

techniquestheyusedlaterinthebook.Atthatlevelthe
performanceisclosetohumanequivalent,andisarguablybetter,
sincequiteafewoftheMNISTimagesaredifficultevenforhumans
torecognizewithconfidence,forexample:

Itrustyou'llagreethatthosearetoughtoclassify!Withimageslike
theseintheMNISTdatasetit'sremarkablethatneuralnetworks
canaccuratelyclassifyallbut21ofthe10,000testimages.Usually,
whenprogrammingwebelievethatsolvingacomplicatedproblem
likerecognizingtheMNISTdigitsrequiresasophisticated
algorithm.ButeventheneuralnetworksintheWanetalpaperjust
mentionedinvolvequitesimplealgorithms,variationsonthe
algorithmwe'veseeninthischapter.Allthecomplexityislearned,
automatically,fromthetrainingdata.Insomesense,themoralof
bothourresultsandthoseinmoresophisticatedpapers,isthatfor
someproblems:

sophisticatedalgorithm simplelearningalgorithm+good
trainingdata.

Towarddeeplearning
Whileourneuralnetworkgivesimpressiveperformance,that
performanceissomewhatmysterious.Theweightsandbiasesinthe
networkwerediscoveredautomatically.Andthatmeanswedon't
immediatelyhaveanexplanationofhowthenetworkdoeswhatit
does.Canwefindsomewaytounderstandtheprinciplesbywhich
ournetworkisclassifyinghandwrittendigits?And,givensuch
principles,canwedobetter?

Toputthesequestionsmorestarkly,supposethatafewdecades
henceneuralnetworksleadtoartificialintelligence(AI).Willwe
understandhowsuchintelligentnetworkswork?Perhapsthe
networkswillbeopaquetous,withweightsandbiaseswedon't
understand,becausethey'vebeenlearnedautomatically.Inthe
earlydaysofAIresearchpeoplehopedthattheefforttobuildanAI
wouldalsohelpusunderstandtheprinciplesbehindintelligence

http://neuralnetworksanddeeplearning.com/chap1.html 46/50
03/04/2017 Neuralnetworksanddeeplearning

and,maybe,thefunctioningofthehumanbrain.Butperhapsthe
outcomewillbethatweendupunderstandingneitherthebrainnor
howartificialintelligenceworks!

Toaddressthesequestions,let'sthinkbacktotheinterpretationof
artificialneuronsthatIgaveatthestartofthechapter,asameans
ofweighingevidence.Supposewewanttodeterminewhetheran
imageshowsahumanfaceornot:

Credits:1.EsterInbar.2.Unknown.3.NASA,
ESA,G.Illingworth,D.Magee,andP.Oesch
(UniversityofCalifornia,SantaCruz),R.
Bouwens(LeidenUniversity),andtheHUDF09
Team.Clickontheimagesformoredetails.

Wecouldattackthisproblemthesamewayweattacked
handwritingrecognitionbyusingthepixelsintheimageasinput
toaneuralnetwork,withtheoutputfromthenetworkasingle
neuronindicatingeither"Yes,it'saface"or"No,it'snotaface".

Let'ssupposewedothis,butthatwe'renotusingalearning
algorithm.Instead,we'regoingtotrytodesignanetworkbyhand,
choosingappropriateweightsandbiases.Howmightwegoabout
it?Forgettingneuralnetworksentirelyforthemoment,aheuristic
wecoulduseistodecomposetheproblemintosubproblems:does
theimagehaveaneyeinthetopleft?Doesithaveaneyeinthetop
right?Doesithaveanoseinthemiddle?Doesithaveamouthin
thebottommiddle?Istherehairontop?Andsoon.

Iftheanswerstoseveralofthesequestionsare"yes",orevenjust
"probablyyes",thenwe'dconcludethattheimageislikelytobea
face.Conversely,iftheanswerstomostofthequestionsare"no",
thentheimageprobablyisn'taface.

Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany
deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybe
wecanonlyseepartoftheface,orthefaceisatanangle,sosomeof
thefacialfeaturesareobscured.Still,theheuristicsuggeststhatif
wecansolvethesubproblemsusingneuralnetworks,thenperhaps
wecanbuildaneuralnetworkforfacedetection,bycombiningthe
networksforthesubproblems.Here'sapossiblearchitecture,with

http://neuralnetworksanddeeplearning.com/chap1.html 47/50
03/04/2017 Neuralnetworksanddeeplearning

rectanglesdenotingthesubnetworks.Notethatthisisn'tintended
asarealisticapproachtosolvingthefacedetectionproblemrather,
it'stohelpusbuildintuitionabouthownetworksfunction.Here's
thearchitecture:

It'salsoplausiblethatthesubnetworkscanbedecomposed.
Supposewe'reconsideringthequestion:"Isthereaneyeinthetop
left?"Thiscanbedecomposedintoquestionssuchas:"Istherean
eyebrow?""Arethereeyelashes?""Isthereaniris?"andsoon.Of
course,thesequestionsshouldreallyincludepositional
information,aswell"Istheeyebrowinthetopleft,andabovethe
iris?",thatkindofthingbutlet'skeepitsimple.Thenetworkto
answerthequestion"Isthereaneyeinthetopleft?"cannowbe
decomposed:

Thosequestionstoocanbebrokendown,furtherandfurther
throughmultiplelayers.Ultimately,we'llbeworkingwithsub
networksthatanswerquestionssosimpletheycaneasilybe
answeredatthelevelofsinglepixels.Thosequestionsmight,for
example,beaboutthepresenceorabsenceofverysimpleshapesat

http://neuralnetworksanddeeplearning.com/chap1.html 48/50
03/04/2017 Neuralnetworksanddeeplearning

particularpointsintheimage.Suchquestionscanbeansweredby
singleneuronsconnectedtotherawpixelsintheimage.

Theendresultisanetworkwhichbreaksdownaverycomplicated
questiondoesthisimageshowafaceornotintoverysimple
questionsanswerableatthelevelofsinglepixels.Itdoesthis
throughaseriesofmanylayers,withearlylayersansweringvery
simpleandspecificquestionsabouttheinputimage,andlater
layersbuildingupahierarchyofevermorecomplexandabstract
concepts.Networkswiththiskindofmanylayerstructuretwoor
morehiddenlayersarecalleddeepneuralnetworks.

Ofcourse,Ihaven'tsaidhowtodothisrecursivedecomposition
intosubnetworks.Itcertainlyisn'tpracticaltohanddesignthe
weightsandbiasesinthenetwork.Instead,we'dliketouselearning
algorithmssothatthenetworkcanautomaticallylearntheweights
andbiasesandthus,thehierarchyofconceptsfromtraining
data.Researchersinthe1980sand1990striedusingstochastic
gradientdescentandbackpropagationtotraindeepnetworks.
Unfortunately,exceptforafewspecialarchitectures,theydidn't
havemuchluck.Thenetworkswouldlearn,butveryslowly,andin
practiceoftentooslowlytobeuseful.

Since2006,asetoftechniqueshasbeendevelopedthatenable
learningindeepneuralnets.Thesedeeplearningtechniquesare
basedonstochasticgradientdescentandbackpropagation,butalso
introducenewideas.Thesetechniqueshaveenabledmuchdeeper
(andlarger)networkstobetrainedpeoplenowroutinelytrain
networkswith5to10hiddenlayers.And,itturnsoutthatthese
performfarbetteronmanyproblemsthanshallowneuralnetworks,
i.e.,networkswithjustasinglehiddenlayer.Thereason,ofcourse,
istheabilityofdeepnetstobuildupacomplexhierarchyof
concepts.It'sabitlikethewayconventionalprogramming
languagesusemodulardesignandideasaboutabstractiontoenable
thecreationofcomplexcomputerprograms.Comparingadeep
networktoashallownetworkisabitlikecomparinga
programminglanguagewiththeabilitytomakefunctioncallstoa
strippeddownlanguagewithnoabilitytomakesuchcalls.
Abstractiontakesadifferentforminneuralnetworksthanitdoesin
conventionalprogramming,butit'sjustasimportant.

Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning", Lastupdate:ThuJan1906:09:482017

http://neuralnetworksanddeeplearning.com/chap1.html 49/50
03/04/2017 Neuralnetworksanddeeplearning
DeterminationPress,2015

ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

http://neuralnetworksanddeeplearning.com/chap1.html 50/50

You might also like