0% found this document useful (0 votes)
12 views

Deep Nets

Uploaded by

Khang Thái Duy
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Deep Nets

Uploaded by

Khang Thái Duy
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

CSCI 5922

Neural Networks and Deep Learning

Deep Nets

Mike Mozer
Department of Computer Science and
Institute of Cognitive Science
University of Colorado at Boulder
Why Stop At One Hidden Layer?


E.g., vision hierarchy for recognizing handprinted text

Word output layer


Character hidden layer 3
Stroke hidden layer 2
Edge hidden layer 1
Pixel input layer
From Ng’s group
Why Deeply Layered Networks Fail


Credit assignment problem
 How is a neuron in layer 2 supposed to know
what it should output until all the neurons
above it do something sensible?
 How is a neuron in layer 4 supposed to know
what it should output until all the neurons
below it do something sensible?
Deeper Vs. Shallower Nets


Deeper net can represent any mapping that shallower net can
 Use identity mappings for the additional layers

Deeper net in principle is more likely to overfit

But in practice it often underfits on the training set
 Degradation due to harder credit-assignment problem
CIFAR-10
 Deeper isn’t always better!

ImageNet
thin=train,

thick=valid.

He, Zhang, Ren, and Sun (2015)


Why Deeply Layered Networks Fail


Vanishing gradient problem
 With logistic or tanh units
1 𝜕𝑦𝑗
𝑦 𝑗= =𝑦 𝑗 ( 1 − 𝑦 𝑗 )
1+ exp ⁡( − 𝑧 𝑗 ) 𝜕𝑧 𝑗
𝜕𝑦𝑗
𝑦 𝑗 =tanh ( 𝑧 𝑗 ) =( 1+ 𝑦 𝑗 ) ( 1− 𝑦 𝑗 )
𝜕𝑧 𝑗

gradient
 Error gradients get squashed as they
are passed back through a deep
network
𝒏 …𝟑 𝟐 𝟏
layer
Why Deeply Layered Networks Fail

Exploding gradient problem

gradient

 with linear or ReLU units


𝒚 =m ax (𝟎 , 𝒛 )
𝝏𝒚
𝝏𝒛
={𝟎 𝒛 ≤𝟎
𝟏 𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
𝒏 …𝟑 𝟐 𝟏 layer

𝝏𝒚
𝝏𝒙
=
{ 𝟎
𝒘 𝒚𝒙
𝒚 =𝟎
𝐨𝐭𝐡𝐞𝐫𝐰𝐢𝐬𝐞
𝒚
 can be a problem when
𝒙
Hack Solutions

gradient

Using ReLUs
 can avoid squashing of gradient
𝒏 …𝟑 𝟐 𝟏
layer

Use gradient clipping


( ( ))

𝝏𝑬
𝚫 𝒘 𝒙𝒚 𝐦𝐚𝐱 − 𝚫 𝟎 , 𝐦𝐢𝐧 ⁡ 𝜟 𝟎 ,
 for exploding gradients 𝝏 𝒘 𝒙𝒚

Use gradient sign 𝝏𝑬


( )

𝚫 𝒘 𝒙𝒚 𝐬𝐢𝐠𝐧
𝝏 𝒘 𝒙𝒚
 for exploding & vanishing gradients
Hack Solutions


Hard weight constraints
𝐦𝐢𝐧 (‖𝒘 ‖𝟐 , 𝒍 )
𝒘← 𝒘
‖𝒘‖𝟐

 avoids sigmoid saturation for vanishing gradients


 may prevent gradients from blowing up by limiting error gradient scaling
Hack Solutions


Batch normalization
 As weights are learned, the distribution
of activations changes,
 Affects ideal
 Affects appropriate learning rate for 𝑾𝟐
𝒉𝟏
 Solution
ensure that distribution of activations for 𝑾𝟏
each unit doesn’t change over the course of
learning
Hack Solutions

Batch normalization [LINK]
 : activation of a unit in any layer for minibatch example

𝑾𝟐
𝒉𝟏
𝑾𝟏

 parameters and trained via gradient descent


 separate parameters for each neuron
Hack Solutions


Unsupervised layer-by-layer pretraining
 Goal is to start weights off in a sensible configuration instead of using random
initial weights
 Methods of pretraining
autoencoders
restricted Boltzmann machines (Hinton’s group)
 The dominant paradigm from ~ 2000-2010
 Still useful today if
not much labeled data
lots of unlabeled data
Autoencoders

Self-supervised training procedure

Given a set of input vectors (no target outputs)

Map input back to itself via a hidden layer bottleneck


How to achieve bottleneck?
 Fewer neurons
 Sparsity constraint
 Information transmission constraint (e.g., add noise to unit, or shut off randomly,
a.k.a. dropout)
Autoencoder Combines
An Encoder And A Decoder

Decoder

Encoder
Stacked Autoencoders

...

copy
deep network


Note that decoders can be stacked to produce a generative domain model
Restricted Boltzmann Machines (RBMs)

For today, RBM is like an autoencoder with the output layer folded back onto the input.

 We’ll spend a whole class on probabilistic models like RBMs



Additional constraints

 probabilistic neurons
Depth of ImageNet Challenge Winner

source: http://chatbotslife.com
Trick From Long Ago To Avoid Local Optima

Add direct connections from input to output layer

Easy bits of the mapping are learned by the direct connections
 Easy bit = linear or saturating-linear functions of the input
 Often captures 80-90% of the variance in the outputs

Hidden units are reserved for learning the hard bits of the mapping
 They don’t need to learn to copy activations forward for the linear portion of the
mapping

Problem
 Adds a lot of free parameters
Latest Tricks


Novel architectures that
 skip layers
 have linear connectivity between layers

Advantage over direct-connection architectures
 no/few additional free parameters
Deep Residual Networks (ResNet)


Add linear short-cut connections to architecture


Basic building block activation flow


Variations
 allow different input and output dimensionalities:

 can have an arbitrary number of layers

He, Zhang, Ren, and Sun (2015)


ResNet


kklk

He, Zhang, Ren, and Sun (2015)


ResNet

Top-1 Error %

He, Zhang, Ren, and Sun (2015)


A Year Later…


Do proper identity mappings

He, Zhang, Ren, and Sun (2016)


Highway Networks


Suppose each layer of the network made a decision whether to
 copy input forward:
 perform a nonlinear transform of the input:

Weighting coefficient

 Coefficient is a function of the input:

Srivastava, Greff, & Schmidhuber (2015)


Highway Networks Facilitate Training Deep Networks

Srivastava, Greff, & Schmidhuber (2015)

You might also like