0% found this document useful (0 votes)
57 views

ArchitectureDesign For DeepLearning

The document discusses the architecture design of deep learning networks. It covers topics such as basic neural network design with layers, architecture terminology, examples of generic neural network designs from 1 to 27 layers, theoretical underpinnings like the universal approximation theorem and no free lunch theorem, and advantages of deeper networks. It also discusses linear vs nonlinear models and the implications of the universal approximation theorem for training deep learning networks.

Uploaded by

Nguyễn Nam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

ArchitectureDesign For DeepLearning

The document discusses the architecture design of deep learning networks. It covers topics such as basic neural network design with layers, architecture terminology, examples of generic neural network designs from 1 to 27 layers, theoretical underpinnings like the universal approximation theorem and no free lunch theorem, and advantages of deeper networks. It also discusses linear vs nonlinear models and the implications of the universal approximation theorem for training deep learning networks.

Uploaded by

Nguyễn Nam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Deep Learning Srihari

Architecture Design for


Deep Learning
Sargur N. Srihari
[email protected]

1
Deep Learning Srihari

Topics
• Overview
1.Example: Learning XOR
2.Gradient-Based Learning
3.Hidden Units
4.Architecture Design
5.Backpropagation and Other Differentiation
6.Historical Notes

2
Deep Learning Srihari

Topics in Architecture Design


1. Basic design of a neural network
2. Architecture Terminology
3. Chart of 27 neural network designs (generic)
4. Specific deep learning architectures
5. Equations for Layers
6. Theoretical underpinnings
– Universal Approximation Theorem
– No Free Lunch Theorem
7. Advantages of deeper networks
8. Non-chain architecture 3
Deep Learning Srihari

Neural network for supervised learning


(with regularization)
Data Set:{xi,yi}, i=1,..N

f (xi,W) yi 1 N
xi i=1,..N
L= ∑ L ( f (x ,W ),y )
i i i
N
i=1,..N
i=1

Loss +
Li
L+R(W)

R(W)
Regularizer
W={W(1), W(2),..} (Norm Penalty) 4
Deep Learning Srihari

Design with two layers


• Most networks are organized into groups of
units are called layers
– Layers are arranged in a chain structure
• Each layer is a function of layer that preceded it
– First layer is given by h(1)=g(1)(W(1)T x + b(1))
– Second layer is h(2)=g(2)(W(2)T x + b(2)), etc.
• Example x=[x1,x2,x3]T
T T T
W1(1) = ⎡⎢W11W12W13 ⎤⎥ ,W2(1) = ⎡⎢W21W22W23 ⎤⎥ ,W3(1) = ⎡⎢W31W32W33 ⎤⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦

First Network layer Network layer output In matrix multiplication notation

5
Deep Learning Srihari

Architecture Terminology
• The word architecture refers to the overall
structure of the network:
– How many units should it have?
– How the units should be connected to each other?
• Most neural networks are organized into groups
of units called layers
– Most neural network architectures arrange these
layers in a chain structure
– With each layer being a function of the layer that
preceded it 6
Deep Learning Srihari

Advantage of Deeper Networks


• Deeper networks have
– Far fewer units in each layer
– Far fewer parameters
– Often generalize well to the test set
– But are often more difficult to optimize
• Ideal network architecture must be found via
experimentation guided by validation set error

7
Deep Learning Srihari

Main Architectural Considerations


1. Choice of depth of network
2. Choice of width of each layer

8
Deep Learning Srihari

Generic Neural Architectures (1-11)

9
Deep Learning Srihari

Generic Neural Architectures (12-19)

10
Deep Learning Srihari

Generic Neural Architectures (20-27)

11
Deep Learning Srihari

CNN Architectures

More complex
features
captured
In deeper
layers

12
Deep Learning Srihari

Architecture Blending Deep Learning and


Reinforcement Learning
• Human Level Control Through Deep
Reinforcement Learning

13
Sensory processing Environmental Feedback
Deep Learning Srihari

Theoretical underpinnings
• Mathematical theory of Artificial Neural
Networks
– Linear versus Nonlinear Models
– Universal Approximation Theorem
• No Free Lunch Theorem
• Size of network

14
Deep Learning Srihari

Linear vs Nonlinear Models


• A linear model with features-to-output via matrix
multiplication only represent linear functions
– They are easy to train
• Because loss functions result in convex optimization
• Unfortunately often we want to learn nonlinear
functions
– Not necessary to define a family of nonlinear
functions
– Feedforward networks with hidden layers provide a
universal approximation framework 15
Deep Learning Srihari

Universal Approximation Theorem


• A feed-forward network with a single hidden
layer containing a finite number of neurons can
approximate continuous functions on compact
subsets of Rn, under mild assumptions on the
activation function
– Simple neural networks can represent a wide
variety of interesting functions when given
appropriate parameters
– However, it does not touch upon the algorithmic
learnability of those parameters.

16
Deep Learning Srihari

Formal Universal Approx. Theorem


– Let φ(⋅) be continuous (activation function)
• Non-constant, bounded, monotonic increasing function
– Im is the unit hypercube [0,1]m (m inputs, values in [0,1] )
– Space of continuous functions on Im is C(Im)
• Then, given any function f ∈ C(Im) and ε>0, there exists an
integer N (no. of hidden units)
• real constants vi,bi ∈ R (output weights, input bias)
• real vectors wi∈Rm, i =1, ⋯ , N (input weights)
• such that we may define:
F(x) = ∑i=1,..N vi φ(wiTx+bi) as an approximation of f
where f is independent of φ; i.e.,
|F(x)−f (x)|<ε for all x ∈ Im
i.e., functions of the form F(x) are dense in C(Im)
Deep Learning Srihari

Implication of UA Theorem
• A feedforward network with a linear output layer
and at least one hidden layer with any
activation function can approximate:
– Any Borel measurable function from one finite-
dimensional space to another
– If f: X→Y is continuous mapping of X, where Y is any topological
space, (X,B) is a measurable space and f −1(V)∈B for every open
set V in Y, then f is a Borel measurable function
– Provided the network is given enough hidden units
• The derivatives of the network can also
approximate derivatives of function well
18
Deep Learning Srihari

Applicability of Theorem
• Any continuous function on a closed and
bounded subset of Rn is Borel measurable
– Therefore approximated by a neural network
• Discrete case:
– A neural network may also approximate any
function mapping from any finite dimensional
discrete space to another
• Original theorems stated for activations that
saturate for very negative/positive arguments
– Also proved for wider class including ReLU
19
Deep Learning Srihari

Theorem and Training


• Whatever function we are trying to learn, a
large MLP will be able to represent it
• However we are not guaranteed that the
training algorithm will learn this function
1. Optimizing algorithms may not find the parameters
2. May choose wrong function due to over-fitting
• No Free Lunch: There is no universal
procedure for examining a training set of
samples and choosing a function that will
generalize to points not in training set 20
Deep Learning Srihari

Feed-forward & No Free Lunch


• Feed-forward networks provide a universal
system for representing functions
– Given a function, there is a feed-forward network
that approximates the function
• There is no universal procedure for examining
a training set of specific examples and
choosing a function that will generalize to
points not in training set

21
Deep Learning Srihari

On Size of Network
• Universal Approximation Theorem
– Says there is a network large enough to achieve
any degree of accuracy
– but does not say how large the network will be
• Bounds on size of the single-layer network exist
for a broad class of functions
– But worst case is exponential no. of hidden units
n
• No. of binary functions on vectors v ∈ {0,1}n is 22
– e.g. there are 16 functions of 2 variables
• Selecting one such function requires 2n bits which will
require O(2n) degrees of freedom

22
Deep Learning Srihari

Summary/Implications of Theorem
• A feedforward network with a single layer is
sufficient to represent any function
• But the layer may be infeasibly large and may
fail to generalize correctly
• Using deeper models can reduce no. of units
required and reduce generalization error

23
Deep Learning Srihari

Example of difficult function


• Wolf (ochre border) versus Husky (yellow border)

– In this example, 90% accuracy obtained


• Unfortunately wolves had snow background and
huskies had lawns
– Which motivates need for explanation! 24
Alexnet: For ImageNet
Deep Learning Srihari

(1.2m high-res images) with 1000 classes


Alex Krizhevsky, Geoffrey E. Hinton, 2012

Input image CNN (5) FCN(3)

Output of last FCN fed to a 1000-way softmax: to produce distribution over 1000 labels

Deep CNNs with ReLUs train several times faster than their equivalents with tanh units
tanh RELU reaches 25% error rate six times faster than with tanh neurons
Not possible to experiment with large neural networks using traditional saturating neuron models.
RELU
Deep Learning Srihari

Function Families and Depth


• Some families of functions can be represented
efficiently if depth >d but require much larger
model if depth <d
• In some cases no. of hidden units required by
shallow model is exponential in n
– Functions representable with a deep rectifier net
can require an exponential no. of hidden units with
a shallow (one hidden layer) network
• Piecewise linear networks (which can be obtained from
rectifier nonlinearities or maxout units) can represent
functions with a no. of regions that is exponential in d
Deep Learning Srihari

Advantage of deeper networks


Absolute value rectification creates mirror images of function computed on top
of some hidden unit, wrt the input of that hidden unit.
Each hidden unit specifies where to fold the input space in order to create
mirror responses.
By composing these folding operations we obtain an exponentially large no. of
piecewise linear regions which can capture all kinds of repeating patterns

Has same output for every Function can be obtained Another repeating
pair of mirror points in input. By folding the space Pattern can be folded on
Mirror axis of symmetry around axis of Top of the first
Is given by weights and bias symmetry (by another downstream
of unit. Function computed unit) to obtain another
on top of unit (green decision symmetry (which is now
surface) will be a mirror image repeated four times with
of simpler pattern across two hidden layers)
27
axis of symmetry
Deep Learning Srihari

Theorem on Depth
• The no. of linear regions carved out by a deep
rectifier network with d inputs, depth l and n
units per hidden layer is
⎛⎛ d(l−1) ⎞⎟
⎜⎜⎜ n ⎞⎟ d⎟
O ⎜⎜⎜ ⎟ n ⎟⎟
⎜⎜⎜⎝ d ⎟⎟⎠ ⎟⎟
⎝ ⎠
– i.e., exponential in the depth l
• In the case of maxout networks with k filters per
unit, the no. of linear regions is
O (k ) (l−1)+d

• There is no guarantee that the kinds of


functions we want to learn in AI share such a
property 28
Deep Learning Srihari

Statistical Justification for Depth


• We may want to choose a deep model for
statistical reasons
• Any time we choose a ML algorithm we are
implicitly stating a set of beliefs about what kind
of functions that algorithm should learn
• Choosing a deep model encodes a belief that
the function should be a composition several
simpler functions

29
Deep Learning Srihari

Intuition on Depth
• A deep architecture expresses a belief that the
function we want to learn is a computer
program consisting of m steps
– where each step uses previous step’s output
• Intermediate outputs are not necessarily factors
of variation
– but can be analogous to counters or pointers used
for organizing processing
• Empirically greater depth results in better
generalization
30
Deep Learning Srihari

Empirical Results
• Deeper networks perform better

Test accuracy consistently increases


Increasing parameters without increasing
with depth
depth is not as effective

• Deep architectures indeed express a useful


prior over the space of functions the model
learns 31
Deep Learning Srihari

Other architectural considerations


• Specialized architectures are discussed later
• Convolutional Networks
– Used for computer vision
• Recurrent Neural Networks
– Used for sequence processing
– Have their own architectural considerations

32
Deep Learning Srihari

Non-chain architecture

33
Deep Learning Srihari

Connecting a pair of layers

34

You might also like