ArchitectureDesign For DeepLearning
ArchitectureDesign For DeepLearning
1
Deep Learning Srihari
Topics
• Overview
1.Example: Learning XOR
2.Gradient-Based Learning
3.Hidden Units
4.Architecture Design
5.Backpropagation and Other Differentiation
6.Historical Notes
2
Deep Learning Srihari
f (xi,W) yi 1 N
xi i=1,..N
L= ∑ L ( f (x ,W ),y )
i i i
N
i=1,..N
i=1
Loss +
Li
L+R(W)
R(W)
Regularizer
W={W(1), W(2),..} (Norm Penalty) 4
Deep Learning Srihari
5
Deep Learning Srihari
Architecture Terminology
• The word architecture refers to the overall
structure of the network:
– How many units should it have?
– How the units should be connected to each other?
• Most neural networks are organized into groups
of units called layers
– Most neural network architectures arrange these
layers in a chain structure
– With each layer being a function of the layer that
preceded it 6
Deep Learning Srihari
7
Deep Learning Srihari
8
Deep Learning Srihari
9
Deep Learning Srihari
10
Deep Learning Srihari
11
Deep Learning Srihari
CNN Architectures
More complex
features
captured
In deeper
layers
12
Deep Learning Srihari
13
Sensory processing Environmental Feedback
Deep Learning Srihari
Theoretical underpinnings
• Mathematical theory of Artificial Neural
Networks
– Linear versus Nonlinear Models
– Universal Approximation Theorem
• No Free Lunch Theorem
• Size of network
14
Deep Learning Srihari
16
Deep Learning Srihari
Implication of UA Theorem
• A feedforward network with a linear output layer
and at least one hidden layer with any
activation function can approximate:
– Any Borel measurable function from one finite-
dimensional space to another
– If f: X→Y is continuous mapping of X, where Y is any topological
space, (X,B) is a measurable space and f −1(V)∈B for every open
set V in Y, then f is a Borel measurable function
– Provided the network is given enough hidden units
• The derivatives of the network can also
approximate derivatives of function well
18
Deep Learning Srihari
Applicability of Theorem
• Any continuous function on a closed and
bounded subset of Rn is Borel measurable
– Therefore approximated by a neural network
• Discrete case:
– A neural network may also approximate any
function mapping from any finite dimensional
discrete space to another
• Original theorems stated for activations that
saturate for very negative/positive arguments
– Also proved for wider class including ReLU
19
Deep Learning Srihari
21
Deep Learning Srihari
On Size of Network
• Universal Approximation Theorem
– Says there is a network large enough to achieve
any degree of accuracy
– but does not say how large the network will be
• Bounds on size of the single-layer network exist
for a broad class of functions
– But worst case is exponential no. of hidden units
n
• No. of binary functions on vectors v ∈ {0,1}n is 22
– e.g. there are 16 functions of 2 variables
• Selecting one such function requires 2n bits which will
require O(2n) degrees of freedom
22
Deep Learning Srihari
Summary/Implications of Theorem
• A feedforward network with a single layer is
sufficient to represent any function
• But the layer may be infeasibly large and may
fail to generalize correctly
• Using deeper models can reduce no. of units
required and reduce generalization error
23
Deep Learning Srihari
Output of last FCN fed to a 1000-way softmax: to produce distribution over 1000 labels
Deep CNNs with ReLUs train several times faster than their equivalents with tanh units
tanh RELU reaches 25% error rate six times faster than with tanh neurons
Not possible to experiment with large neural networks using traditional saturating neuron models.
RELU
Deep Learning Srihari
Has same output for every Function can be obtained Another repeating
pair of mirror points in input. By folding the space Pattern can be folded on
Mirror axis of symmetry around axis of Top of the first
Is given by weights and bias symmetry (by another downstream
of unit. Function computed unit) to obtain another
on top of unit (green decision symmetry (which is now
surface) will be a mirror image repeated four times with
of simpler pattern across two hidden layers)
27
axis of symmetry
Deep Learning Srihari
Theorem on Depth
• The no. of linear regions carved out by a deep
rectifier network with d inputs, depth l and n
units per hidden layer is
⎛⎛ d(l−1) ⎞⎟
⎜⎜⎜ n ⎞⎟ d⎟
O ⎜⎜⎜ ⎟ n ⎟⎟
⎜⎜⎜⎝ d ⎟⎟⎠ ⎟⎟
⎝ ⎠
– i.e., exponential in the depth l
• In the case of maxout networks with k filters per
unit, the no. of linear regions is
O (k ) (l−1)+d
29
Deep Learning Srihari
Intuition on Depth
• A deep architecture expresses a belief that the
function we want to learn is a computer
program consisting of m steps
– where each step uses previous step’s output
• Intermediate outputs are not necessarily factors
of variation
– but can be analogous to counters or pointers used
for organizing processing
• Empirically greater depth results in better
generalization
30
Deep Learning Srihari
Empirical Results
• Deeper networks perform better
32
Deep Learning Srihari
Non-chain architecture
33
Deep Learning Srihari
34