0% found this document useful (0 votes)

76 views

ArchitectureDesign For DeepLearning

The document discusses the architecture design of deep learning networks. It covers topics such as basic neural network design with layers, architecture terminology, examples of generic neural network designs from 1 to 27 layers, theoretical underpinnings like the universal approximation theorem and no free lunch theorem, and advantages of deeper networks. It also discusses linear vs nonlinear models and the implications of the universal approximation theorem for training deep learning networks.

Uploaded by

Nguyễn Nam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

ArchitectureDesign For DeepLearning

Uploaded by

Nguyễn Nam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Deep Learning Srihari

Architecture Design for

Deep Learning
Sargur N. Srihari
[email protected]

1
Deep Learning Srihari

Topics
• Overview
1.Example: Learning XOR
2.Gradient-Based Learning
3.Hidden Units
4.Architecture Design
5.Backpropagation and Other Differentiation
6.Historical Notes

2
Deep Learning Srihari

Topics in Architecture Design

1. Basic design of a neural network
2. Architecture Terminology
3. Chart of 27 neural network designs (generic)
4. Specific deep learning architectures
5. Equations for Layers
6. Theoretical underpinnings
– Universal Approximation Theorem
– No Free Lunch Theorem
7. Advantages of deeper networks
8. Non-chain architecture 3
Deep Learning Srihari

Neural network for supervised learning

(with regularization)
Data Set:{xi,yi}, i=1,..N

f (xi,W) yi 1 N
xi i=1,..N
L= ∑ L ( f (x ,W ),y )
i i i
N
i=1,..N
i=1

Loss +
Li
L+R(W)

R(W)
Regularizer
W={W(1), W(2),..} (Norm Penalty) 4
Deep Learning Srihari

Design with two layers

• Most networks are organized into groups of
units are called layers
– Layers are arranged in a chain structure
• Each layer is a function of layer that preceded it
– First layer is given by h(1)=g(1)(W(1)T x + b(1))
– Second layer is h(2)=g(2)(W(2)T x + b(2)), etc.
• Example x=[x1,x2,x3]T
T T T
W1(1) = ⎡⎢W11W12W13 ⎤⎥ ,W2(1) = ⎡⎢W21W22W23 ⎤⎥ ,W3(1) = ⎡⎢W31W32W33 ⎤⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦

First Network layer Network layer output In matrix multiplication notation

5
Deep Learning Srihari

Architecture Terminology
• The word architecture refers to the overall
structure of the network:
– How many units should it have?
– How the units should be connected to each other?
• Most neural networks are organized into groups
of units called layers
– Most neural network architectures arrange these
layers in a chain structure
– With each layer being a function of the layer that
preceded it 6
Deep Learning Srihari

Advantage of Deeper Networks

• Deeper networks have
– Far fewer units in each layer
– Far fewer parameters
– Often generalize well to the test set
– But are often more difficult to optimize
• Ideal network architecture must be found via
experimentation guided by validation set error

7
Deep Learning Srihari

Main Architectural Considerations

1. Choice of depth of network
2. Choice of width of each layer

8
Deep Learning Srihari

Generic Neural Architectures (1-11)

9
Deep Learning Srihari

Generic Neural Architectures (12-19)

10
Deep Learning Srihari

Generic Neural Architectures (20-27)

11
Deep Learning Srihari

CNN Architectures

More complex
features
captured
In deeper
layers

12
Deep Learning Srihari

Architecture Blending Deep Learning and

Reinforcement Learning
• Human Level Control Through Deep
Reinforcement Learning

13
Sensory processing Environmental Feedback
Deep Learning Srihari

Theoretical underpinnings
• Mathematical theory of Artificial Neural
Networks
– Linear versus Nonlinear Models
– Universal Approximation Theorem
• No Free Lunch Theorem
• Size of network

14
Deep Learning Srihari

Linear vs Nonlinear Models

• A linear model with features-to-output via matrix
multiplication only represent linear functions
– They are easy to train
• Because loss functions result in convex optimization
• Unfortunately often we want to learn nonlinear
functions
– Not necessary to define a family of nonlinear
functions
– Feedforward networks with hidden layers provide a
universal approximation framework 15
Deep Learning Srihari

Universal Approximation Theorem

• A feed-forward network with a single hidden
layer containing a finite number of neurons can
approximate continuous functions on compact
subsets of Rn, under mild assumptions on the
activation function
– Simple neural networks can represent a wide
variety of interesting functions when given
appropriate parameters
– However, it does not touch upon the algorithmic
learnability of those parameters.

16
Deep Learning Srihari

Formal Universal Approx. Theorem

– Let φ(⋅) be continuous (activation function)
• Non-constant, bounded, monotonic increasing function
– Im is the unit hypercube [0,1]m (m inputs, values in [0,1] )
– Space of continuous functions on Im is C(Im)
• Then, given any function f ∈ C(Im) and ε>0, there exists an
integer N (no. of hidden units)
• real constants vi,bi ∈ R (output weights, input bias)
• real vectors wi∈Rm, i =1, ⋯ , N (input weights)
• such that we may define:
F(x) = ∑i=1,..N vi φ(wiTx+bi) as an approximation of f
where f is independent of φ; i.e.,
|F(x)−f (x)|<ε for all x ∈ Im
i.e., functions of the form F(x) are dense in C(Im)
Deep Learning Srihari

Implication of UA Theorem
• A feedforward network with a linear output layer
and at least one hidden layer with any
activation function can approximate:
– Any Borel measurable function from one finite-
dimensional space to another
– If f: X→Y is continuous mapping of X, where Y is any topological
space, (X,B) is a measurable space and f −1(V)∈B for every open
set V in Y, then f is a Borel measurable function
– Provided the network is given enough hidden units
• The derivatives of the network can also
approximate derivatives of function well
18
Deep Learning Srihari

Applicability of Theorem
• Any continuous function on a closed and
bounded subset of Rn is Borel measurable
– Therefore approximated by a neural network
• Discrete case:
– A neural network may also approximate any
function mapping from any finite dimensional
discrete space to another
• Original theorems stated for activations that
saturate for very negative/positive arguments
– Also proved for wider class including ReLU
19
Deep Learning Srihari

Theorem and Training

• Whatever function we are trying to learn, a
large MLP will be able to represent it
• However we are not guaranteed that the
training algorithm will learn this function
1. Optimizing algorithms may not find the parameters
2. May choose wrong function due to over-fitting
• No Free Lunch: There is no universal
procedure for examining a training set of
samples and choosing a function that will
generalize to points not in training set 20
Deep Learning Srihari

Feed-forward & No Free Lunch

• Feed-forward networks provide a universal
system for representing functions
– Given a function, there is a feed-forward network
that approximates the function
• There is no universal procedure for examining
a training set of specific examples and
choosing a function that will generalize to
points not in training set

21
Deep Learning Srihari

On Size of Network
• Universal Approximation Theorem
– Says there is a network large enough to achieve
any degree of accuracy
– but does not say how large the network will be
• Bounds on size of the single-layer network exist
for a broad class of functions
– But worst case is exponential no. of hidden units
n
• No. of binary functions on vectors v ∈ {0,1}n is 22
– e.g. there are 16 functions of 2 variables
• Selecting one such function requires 2n bits which will
require O(2n) degrees of freedom

22
Deep Learning Srihari

Summary/Implications of Theorem
• A feedforward network with a single layer is
sufficient to represent any function
• But the layer may be infeasibly large and may
fail to generalize correctly
• Using deeper models can reduce no. of units
required and reduce generalization error

23
Deep Learning Srihari

Example of difficult function

• Wolf (ochre border) versus Husky (yellow border)

– In this example, 90% accuracy obtained

• Unfortunately wolves had snow background and
huskies had lawns
– Which motivates need for explanation! 24
Alexnet: For ImageNet
Deep Learning Srihari

(1.2m high-res images) with 1000 classes

Alex Krizhevsky, Geoffrey E. Hinton, 2012

Input image CNN (5) FCN(3)

Output of last FCN fed to a 1000-way softmax: to produce distribution over 1000 labels

Deep CNNs with ReLUs train several times faster than their equivalents with tanh units
tanh RELU reaches 25% error rate six times faster than with tanh neurons
Not possible to experiment with large neural networks using traditional saturating neuron models.
RELU
Deep Learning Srihari

Function Families and Depth

• Some families of functions can be represented
efficiently if depth >d but require much larger
model if depth <d
• In some cases no. of hidden units required by
shallow model is exponential in n
– Functions representable with a deep rectifier net
can require an exponential no. of hidden units with
a shallow (one hidden layer) network
• Piecewise linear networks (which can be obtained from
rectifier nonlinearities or maxout units) can represent
functions with a no. of regions that is exponential in d
Deep Learning Srihari

Advantage of deeper networks

Absolute value rectification creates mirror images of function computed on top
of some hidden unit, wrt the input of that hidden unit.
Each hidden unit specifies where to fold the input space in order to create
mirror responses.
By composing these folding operations we obtain an exponentially large no. of
piecewise linear regions which can capture all kinds of repeating patterns

Has same output for every Function can be obtained Another repeating
pair of mirror points in input. By folding the space Pattern can be folded on
Mirror axis of symmetry around axis of Top of the first
Is given by weights and bias symmetry (by another downstream
of unit. Function computed unit) to obtain another
on top of unit (green decision symmetry (which is now
surface) will be a mirror image repeated four times with
of simpler pattern across two hidden layers)
27
axis of symmetry
Deep Learning Srihari

Theorem on Depth
• The no. of linear regions carved out by a deep
rectifier network with d inputs, depth l and n
units per hidden layer is
⎛⎛ d(l−1) ⎞⎟
⎜⎜⎜ n ⎞⎟ d⎟
O ⎜⎜⎜ ⎟ n ⎟⎟
⎜⎜⎜⎝ d ⎟⎟⎠ ⎟⎟
⎝ ⎠
– i.e., exponential in the depth l
• In the case of maxout networks with k filters per
unit, the no. of linear regions is
O (k ) (l−1)+d

• There is no guarantee that the kinds of

functions we want to learn in AI share such a
property 28
Deep Learning Srihari

Statistical Justification for Depth

• We may want to choose a deep model for
statistical reasons
• Any time we choose a ML algorithm we are
implicitly stating a set of beliefs about what kind
of functions that algorithm should learn
• Choosing a deep model encodes a belief that
the function should be a composition several
simpler functions

29
Deep Learning Srihari

Intuition on Depth
• A deep architecture expresses a belief that the
function we want to learn is a computer
program consisting of m steps
– where each step uses previous step’s output
• Intermediate outputs are not necessarily factors
of variation
– but can be analogous to counters or pointers used
for organizing processing
• Empirically greater depth results in better
generalization
30
Deep Learning Srihari

Empirical Results
• Deeper networks perform better

Test accuracy consistently increases

Increasing parameters without increasing
with depth
depth is not as effective

• Deep architectures indeed express a useful

prior over the space of functions the model
learns 31
Deep Learning Srihari

Other architectural considerations

• Specialized architectures are discussed later
• Convolutional Networks
– Used for computer vision
• Recurrent Neural Networks
– Used for sequence processing
– Have their own architectural considerations

32
Deep Learning Srihari

Non-chain architecture

33
Deep Learning Srihari

Connecting a pair of layers

Complete Duolingo Spanish PDF
100% (3)
Complete Duolingo Spanish PDF
176 pages
OET Result
100% (3)
OET Result
2 pages
CELF 4 Spanish (Ages 5-8)
100% (4)
CELF 4 Spanish (Ages 5-8)
32 pages
B.ed Graduate Sample Resume
71% (7)
B.ed Graduate Sample Resume
2 pages
Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
No ratings yet
Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville
38 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Deep Feedforward Networks
No ratings yet
Deep Feedforward Networks
103 pages
6.1 DeepFFNets
No ratings yet
6.1 DeepFFNets
47 pages
5.2 MLBasics-Capacity
No ratings yet
5.2 MLBasics-Capacity
30 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
8.2 NNOptimization
No ratings yet
8.2 NNOptimization
17 pages
Two Applications of Deep Learning in The Physical Layer of Communication Systems
No ratings yet
Two Applications of Deep Learning in The Physical Layer of Communication Systems
10 pages
6COM1044 Deep Learning 1
No ratings yet
6COM1044 Deep Learning 1
49 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
9.2 CNN-Motivation
No ratings yet
9.2 CNN-Motivation
17 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
lbdl
No ratings yet
lbdl
143 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
5.11 MLBasics-Challenges
No ratings yet
5.11 MLBasics-Challenges
20 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
Neural Network As Universal Function Approximates
No ratings yet
Neural Network As Universal Function Approximates
13 pages
10.1 UnfoldingGraphs
No ratings yet
10.1 UnfoldingGraphs
16 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
Tutorial Math Deep Learning 2018 PDF
No ratings yet
Tutorial Math Deep Learning 2018 PDF
103 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
Lecture Slides
No ratings yet
Lecture Slides
30 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
2 CNN-Motivation
No ratings yet
2 CNN-Motivation
17 pages
1 NeuralNetworks
No ratings yet
1 NeuralNetworks
128 pages
cv3
No ratings yet
cv3
159 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
MODULE 2 DL SNOTES P1
No ratings yet
MODULE 2 DL SNOTES P1
16 pages
Unit 3
No ratings yet
Unit 3
12 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
ML UNIT-5
No ratings yet
ML UNIT-5
20 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
10.0 SequenceModeling
No ratings yet
10.0 SequenceModeling
27 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
Neural Network As Universal Approximates
No ratings yet
Neural Network As Universal Approximates
5 pages
1710 11573 PDF
No ratings yet
1710 11573 PDF
14 pages
21.3 VAE Apps
No ratings yet
21.3 VAE Apps
29 pages
Deep Feed-Forward Neural Network
No ratings yet
Deep Feed-Forward Neural Network
4 pages
UNIT-II DLL
No ratings yet
UNIT-II DLL
19 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Unit 1 Fundamentals of Deep Learning
No ratings yet
Unit 1 Fundamentals of Deep Learning
20 pages
transformer_turing
No ratings yet
transformer_turing
13 pages
9.5 CNN-Variants
No ratings yet
9.5 CNN-Variants
21 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
155 pages
LBDL
No ratings yet
LBDL
185 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
lbdl
No ratings yet
lbdl
156 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
Module 2
No ratings yet
Module 2
44 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
10.2.4 RNN-Context
No ratings yet
10.2.4 RNN-Context
10 pages
M3 L4 RNN Regularization
No ratings yet
M3 L4 RNN Regularization
24 pages
Resumen de Fonetica-Primer Cuatrimestre
No ratings yet
Resumen de Fonetica-Primer Cuatrimestre
18 pages
TruchotEN PDF
No ratings yet
TruchotEN PDF
24 pages
Purposive Communication - Theories
No ratings yet
Purposive Communication - Theories
2 pages
1 and 2.-Factual Information and Negative Factual Questions
No ratings yet
1 and 2.-Factual Information and Negative Factual Questions
18 pages
What Is Transaction Replication in SQL Server?: Advantages or Use Cases For Transactional Replication
No ratings yet
What Is Transaction Replication in SQL Server?: Advantages or Use Cases For Transactional Replication
15 pages
The Death of The Bird Poem - A D Hope
100% (3)
The Death of The Bird Poem - A D Hope
10 pages
Unit 1 - Computational Thinking
No ratings yet
Unit 1 - Computational Thinking
14 pages
دروس محوسبة في اللغة الانجليزية للصف الاول و الرابع 10
No ratings yet
دروس محوسبة في اللغة الانجليزية للصف الاول و الرابع 10
7 pages
JStrack-Enriching Malicious JavaScript Detection Based on AST Graph Analysis and Attention Mechanism
No ratings yet
JStrack-Enriching Malicious JavaScript Detection Based on AST Graph Analysis and Attention Mechanism
13 pages
Risc A Cisc P
No ratings yet
Risc A Cisc P
10 pages
Introduction and Need Results and Discussion: Harini Sowmya Narayanan, Priyanka S, Heramba Ganapathy
No ratings yet
Introduction and Need Results and Discussion: Harini Sowmya Narayanan, Priyanka S, Heramba Ganapathy
1 page
EcoSys Install Guide
No ratings yet
EcoSys Install Guide
19 pages
The Ip Address and Cloud Computing: Lesson
No ratings yet
The Ip Address and Cloud Computing: Lesson
3 pages
SCENE IV. Another Part of The Field.: Alarum. Excursions. Enter PRINCE HENRY, LORD JOHN OF LANCASTER, and King
No ratings yet
SCENE IV. Another Part of The Field.: Alarum. Excursions. Enter PRINCE HENRY, LORD JOHN OF LANCASTER, and King
4 pages
English - PC - Week 6 (11 - 15 May) Assignment Grammar
No ratings yet
English - PC - Week 6 (11 - 15 May) Assignment Grammar
2 pages
Unit-2
No ratings yet
Unit-2
12 pages
Cattle Breeds Classification
No ratings yet
Cattle Breeds Classification
8 pages
Black Book
No ratings yet
Black Book
93 pages
The Ministry of The Prophets
No ratings yet
The Ministry of The Prophets
810 pages
ASAS GANJIL 2024 BING KELAS XII FIX BANGET
No ratings yet
ASAS GANJIL 2024 BING KELAS XII FIX BANGET
6 pages
PTP 800 - Transmisor
No ratings yet
PTP 800 - Transmisor
5 pages
MS Listening specimen Papers 2020 - English B
No ratings yet
MS Listening specimen Papers 2020 - English B
4 pages
Emerson in Iran - Roger Sedarat
No ratings yet
Emerson in Iran - Roger Sedarat
254 pages
Business Studies
No ratings yet
Business Studies
4 pages
Dragging Down Heaven - Jesus As Magician and Manipulator of Spirits in The Gospels PDF
100% (1)
Dragging Down Heaven - Jesus As Magician and Manipulator of Spirits in The Gospels PDF
276 pages
Shubham Rawat Resume SDET
No ratings yet
Shubham Rawat Resume SDET
2 pages

ArchitectureDesign For DeepLearning

Uploaded by

ArchitectureDesign For DeepLearning

Uploaded by

Deep Learning Srihari

Architecture Design for

Topics in Architecture Design

Neural network for supervised learning

Design with two layers

First Network layer Network layer output In matrix multiplication notation

Advantage of Deeper Networks

Main Architectural Considerations

Generic Neural Architectures (1-11)

Generic Neural Architectures (12-19)

Generic Neural Architectures (20-27)

Architecture Blending Deep Learning and

Linear vs Nonlinear Models

Universal Approximation Theorem

Formal Universal Approx. Theorem

Theorem and Training

Feed-forward & No Free Lunch

Example of difficult function

– In this example, 90% accuracy obtained

(1.2m high-res images) with 1000 classes

Input image CNN (5) FCN(3)

Function Families and Depth

Advantage of deeper networks

• There is no guarantee that the kinds of

Statistical Justification for Depth

Test accuracy consistently increases

• Deep architectures indeed express a useful

Other architectural considerations

Connecting a pair of layers

You might also like