Deep Nets
Deep Nets
Deep Nets
Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
Why Stop At One Hidden Layer?
E.g., vision hierarchy for recognizing handprinted text
Credit assignment problem
How is a neuron in layer 2 supposed to know
what it should output until all the neurons
above it do something sensible?
How is a neuron in layer 4 supposed to know
what it should output until all the neurons
below it do something sensible?
Deeper Vs. Shallower Nets
Deeper net can represent any mapping that shallower net can
Use identity mappings for the additional layers
Deeper net in principle is more likely to overfit
But in practice it often underfits on the training set
Degradation due to harder credit-assignment problem
CIFAR-10
Deeper isn’t always better!
ImageNet
thin=train,
thick=valid.
Vanishing gradient problem
With logistic or tanh units
1 𝜕𝑦𝑗
𝑦 𝑗= =𝑦 𝑗 ( 1 − 𝑦 𝑗 )
1+ exp ( − 𝑧 𝑗 ) 𝜕𝑧 𝑗
𝜕𝑦𝑗
𝑦 𝑗 =tanh ( 𝑧 𝑗 ) =( 1+ 𝑦 𝑗 ) ( 1− 𝑦 𝑗 )
𝜕𝑧 𝑗
gradient
Error gradients get squashed as they
are passed back through a deep
network
𝒏 …𝟑 𝟐 𝟏
layer
Why Deeply Layered Networks Fail
gradient
𝝏𝒚
𝝏𝒙
=
{ 𝟎
𝒘 𝒚𝒙
𝒚 =𝟎
𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
𝒚
can be a problem when
𝒙
Hack Solutions
gradient
Using ReLUs
can avoid squashing of gradient
𝒏 …𝟑 𝟐 𝟏
layer
𝝏𝑬
𝚫 𝒘 𝒙𝒚 𝐦𝐚𝐱 − 𝚫 𝟎 , 𝐦𝐢𝐧 𝜟 𝟎 ,
for exploding gradients 𝝏 𝒘 𝒙𝒚
𝚫 𝒘 𝒙𝒚 𝐬𝐢𝐠𝐧
𝝏 𝒘 𝒙𝒚
for exploding & vanishing gradients
Hack Solutions
Hard weight constraints
𝐦𝐢𝐧 (‖𝒘 ‖𝟐 , 𝒍 )
𝒘← 𝒘
‖𝒘‖𝟐
Batch normalization
As weights are learned, the distribution
of activations changes,
Affects ideal
Affects appropriate learning rate for 𝑾𝟐
𝒉𝟏
Solution
ensure that distribution of activations for 𝑾𝟏
each unit doesn’t change over the course of
learning
Hack Solutions
Batch normalization [LINK]
: activation of a unit in any layer for minibatch example
𝑾𝟐
𝒉𝟏
𝑾𝟏
Unsupervised layer-by-layer pretraining
Goal is to start weights off in a sensible configuration instead of using random
initial weights
Methods of pretraining
autoencoders
restricted Boltzmann machines (Hinton’s group)
The dominant paradigm from ~ 2000-2010
Still useful today if
not much labeled data
lots of unlabeled data
Autoencoders
Self-supervised training procedure
Given a set of input vectors (no target outputs)
Map input back to itself via a hidden layer bottleneck
How to achieve bottleneck?
Fewer neurons
Sparsity constraint
Information transmission constraint (e.g., add noise to unit, or shut off randomly,
a.k.a. dropout)
Autoencoder Combines
An Encoder And A Decoder
Decoder
Encoder
Stacked Autoencoders
...
copy
deep network
Note that decoders can be stacked to produce a generative domain model
Restricted Boltzmann Machines (RBMs)
For today, RBM is like an autoencoder with the output layer folded back onto the input.
probabilistic neurons
Depth of ImageNet Challenge Winner
source: http://chatbotslife.com
Trick From Long Ago To Avoid Local Optima
Add direct connections from input to output layer
Easy bits of the mapping are learned by the direct connections
Easy bit = linear or saturating-linear functions of the input
Often captures 80-90% of the variance in the outputs
Hidden units are reserved for learning the hard bits of the mapping
They don’t need to learn to copy activations forward for the linear portion of the
mapping
Problem
Adds a lot of free parameters
Latest Tricks
Novel architectures that
skip layers
have linear connectivity between layers
Advantage over direct-connection architectures
no/few additional free parameters
Deep Residual Networks (ResNet)
Add linear short-cut connections to architecture
Basic building block activation flow
Variations
allow different input and output dimensionalities:
kklk
Top-1 Error %
Do proper identity mappings
Suppose each layer of the network made a decision whether to
copy input forward:
perform a nonlinear transform of the input:
Weighting coefficient