0% found this document useful (0 votes)
130 views205 pages

4 CNN PDF

This document discusses babysitting your machine learning jobs and debugging optimizers. It provides tips for debugging issues like learning rates that are too low or high, causing loss to not improve or explode. It recommends starting with small regularization to find an effective learning rate range, and using cross validation in stages to optimize hyperparameters, first with a coarse search over few epochs to identify promising parameter ranges, then finer searches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views205 pages

4 CNN PDF

This document discusses babysitting your machine learning jobs and debugging optimizers. It provides tips for debugging issues like learning rates that are too low or high, causing loss to not improve or explode. It recommends starting with small regularization to find an effective learning rate range, and using cross validation in stages to optimize hyperparameters, first with a coarse search over few epochs to identify promising parameter ranges, then finer searches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 205

Convolutional Neural Networks

Deep Learning Lecture 4

Samuel Cheng

School of ECE
University of Oklahoma

Spring, 2017

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 1 / 198


Table of Contents

1 Review

2 Babysitting your learning job

3 Overview and history of CNN

4 CNN basic

5 Case study

6 Some CNN tricks

7 Conclusions

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 2 / 198


Presentation order

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 3 / 198


Logistics

HW1 is due today


5% per day penalty (of HW1) starting tomorrow
Naim is the winner for the first HW with 3% overall bonus
As extra “bonus” to the winner, I would like him to present his
solution in class next Friday (10 ∼ 20 minutes). Emphasized on
surprises and lesson learned
No need to be comprehensive
HW1 won’t be accepted after his presentation

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 4 / 198


Review

Review

In the last class, we discussed


BP
Weight initialization
Batch normalization
Dropout
More optimization tricks
Nesterov accelerated gradient descent
RMSProp
Adam

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 5 / 198


Babysitting your learning job Debugging optimizer

Today

Left out from last lecture: some remarks on babysitting your


training process
Convolutional neural network (CNN)

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 6 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Double check that the loss is reasonable:

crank up regularization

loss went up, good. (sanity check)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 75 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 7 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
training data The above code:
- take the first 20 examples from
CIFAR-10
- turn off regularization (reg = 0.0)
- use simple vanilla ‘sgd’

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 76 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 8 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

Tip: Make sure that


you can overfit very
small portion of the
training data

Very small loss,


train accuracy 1.00,
nice!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 77 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 9 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 78 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 10 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
Loss barely changing

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 79 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 11 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
Loss barely changing: Learning rate is
loss not going down: probably too low
learning rate too low

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 80 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 12 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
Loss barely changing: Learning rate is
loss not going down: probably too low
learning rate too low
Notice train/val accuracy goes to 20%
though, what’s up with that? (remember
this is softmax)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 81 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 13 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that Okay now lets try learning rate 1e6. What could
makes the loss go possibly go wrong?
down.

loss not going down:


learning rate too low

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 82 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 14 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down.
cost: NaN almost
loss not going down: always means high
learning rate too low learning rate...
loss exploding:
learning rate too high
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 83 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 15 / 198


Babysitting your learning job Debugging optimizer

Debugging optimizer

Lets try to train now…

I like to start with small


regularization and find
learning rate that
makes the loss go
down. 3e-3 is still too high. Cost explodes….

loss not going down:


=> Rough range for learning rate we
learning rate too low should be cross-validating is
loss exploding: somewhere [1e-3 … 1e-5]
learning rate too high
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 84 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 16 / 198


Babysitting your learning job Hyperparameter optimization

Hyperparameter optimization

Cross-validation strategy
I like to do coarse -> fine cross-validation in stages
First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search
… (repeat as necessary)

Tip for detecting explosions in the solver:


If the cost is ever > 3 * original cost, break out early

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 86 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 17 / 198


Babysitting your learning job Hyperparameter optimization

Hyperparameter optimization

For example: run coarse search for 5 epochs


note it’s best to optimize
in log space!

nice

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 87 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 18 / 198


Babysitting your learning job Hyperparameter optimization

Hyperparameter optimization

Now run finer search...


adjust range

53% - relatively good


for a 2-layer neural net
with 50 hidden neurons.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 88 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 19 / 198


Babysitting your learning job Hyperparameter optimization

Hyperparameter optimization

Now run finer search...


adjust range

53% - relatively good


for a 2-layer neural net
with 50 hidden neurons.

But this best cross-


validation result is
worrying. Why?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 89 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 20 / 198


Babysitting your learning job Hyperparameter optimization

Hyperparameter optimization

Random Search vs. Grid Search

Random Search for Hyper-Parameter Optimization


Bergstra and Bengio, 2012

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5 - 90 20 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 21 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Babysitting your learning job Hyperparameter optimization

Conclusions of last lecture

BP is just chain rule in calculus


Use ReLU. Never use Sigmoid (use Tanh instead)
Input preprocessing is no longer very important
Do subtract mean
Whitening and normalizing are not much needed
Weight initialization on the other hand is extremely important for
deep networks
Use batch normalization if you can
Use dropout
Use Adam (or maybe RMSprop) for optimizer. If you don’t have
much data, can consider LBFGS
Need to babysit your learning for real-world problems
Never use grid search for tuning your hyperparameters

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 22 / 198


Overview and history of CNN

Convolutional Neural Networks

[LeNet-5, LeCun 1980]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 65 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 23 / 198


Overview and history of CNN

CNN history

A bit of history:

Hubel & Wiesel,


1959
RECEPTIVE FIELDS OF SINGLE
NEURONES IN
THE CAT'S STRIATE CORTEX

1962
RECEPTIVE FIELDS, BINOCULAR
INTERACTION
AND FUNCTIONAL ARCHITECTURE IN
THE CAT'S VISUAL CORTEX

1968...
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 66 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 24 / 198


Overview and history of CNN

CNN history

A bit of history

Topographical mapping in the cortex:


nearby cells in cortex represented
nearby regions in the visual field

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 68 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 25 / 198


Overview and history of CNN

CNN history

Hierarchical organization

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 69 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 26 / 198


Overview and history of CNN

CNN history

“sandwich” architecture (SCSCSC…)


A bit of history: simple cells: modifiable parameters
complex cells: perform pooling

Neurocognitron
[Fukushima 1980]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 70 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 27 / 198


Overview and history of CNN

CNN history

A bit of history:
Gradient-based learning
applied to document
recognition
[LeCun, Bottou, Bengio, Haffner
1998]

LeNet-5

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 71 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 28 / 198


Overview and history of CNN

CNN today

A bit of history:
ImageNet Classification with Deep
Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]

“AlexNet”

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 72 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 29 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere


Classification Retrieval

[Krizhevsky 2012]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 73 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 30 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere


Detection Segmentation

[Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 74 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 31 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere

NVIDIA Tegra X1

self-driving cars

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 75 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 32 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere


[Taigman et al. 2014]

[Goodfellow 2014]
[Simonyan et al. 2014]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 76 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 33 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere

[Toshev, Szegedy 2014]

[Mnih 2013]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 77 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 34 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere

[Ciresan et al. 2013] [Sermanet et al. 2011]


[Ciresan et al.]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 78 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 35 / 198


Overview and history of CNN

CNN today

Fast-forward to today: ConvNets are everywhere

[Denil et al. 2014]

[Turaga et al., 2010]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 79 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 36 / 198


Overview and history of CNN

CNN today

Whale recognition, Kaggle Challenge Mnih and Hinton, 2010

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 80 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 37 / 198


Overview and history of CNN

CNN today

Image
Captioning

[Vinyals et al., 2015]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 81 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 38 / 198


Overview and history of CNN

CNN today

reddit.com/r/deepdream

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 82 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 39 / 198


Overview and history of CNN

CNN today

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 83 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 40 / 198


Overview and history of CNN

CNN today

Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition
[Cadieu et al., 2014]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6 - 85 25 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 41 / 198


CNN basic Convolution layer

Motivation of CNN

A same object under different


viewpoints is very different in pixel
domain
A slightly horizontally shifted image
has change imperceivable to us but
can confuse naive recognition system
Ideally, we may want to have
shift-invariant features
In practice, if we have local feature
suitable for a particular region, the
same feature should work well with
other region
Weight sharing across space → CNN

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 42 / 198


CNN basic Convolution layer

Convolution Layer
32x32x3 image

32 height

32 width
3 depth

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 10 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 43 / 198


CNN basic Convolution layer

Convolution Layer
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 11 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 44 / 198


CNN basic Convolution layer

Convolution Layer Filters always extend the full


depth of the input volume
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 12 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 45 / 198


CNN basic Convolution layer

Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 13 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 46 / 198


CNN basic Convolution layer

Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 14 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 47 / 198


CNN basic Convolution layer

consider a second, green filter


Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 15 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 48 / 198


CNN basic Convolution layer

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 16 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 49 / 198


CNN basic Convolution layer

Preview: ConvNet is a sequence of Convolution Layers, interspersed with


activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 17 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 50 / 198


CNN basic Convolution layer

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with


activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 18 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 51 / 198


CNN basic Convolution layer

Preview [From recent Yann


LeCun slides]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 20 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 52 / 198


CNN basic Convolution layer

one filter =>


one activation map example 5x5 filters
(32 total)

We call the layer convolutional


because it is related to convolution
of two signals:

elementwise multiplication and sum of


a filter and the signal (image)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 21 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 53 / 198


CNN basic Convolution layer

preview:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 22 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 54 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:


activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 55 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 24 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 56 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 25 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 57 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 26 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 58 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 59 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 28 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 60 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 29 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 61 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 30 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 62 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 31 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 63 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 32 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 64 / 198


CNN basic Notes on dimensions

A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 33 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 65 / 198


CNN basic Notes on dimensions

N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 34 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 66 / 198


CNN basic Notes on dimensions

In practice: Common to zero pad the border


0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 35 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 67 / 198


CNN basic Notes on dimensions

In practice: Common to zero pad the border


0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 36 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 68 / 198


CNN basic Notes on dimensions

In practice: Common to zero pad the border


0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 37 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 69 / 198


CNN basic Notes on dimensions

Remember back to…


E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 38 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 70 / 198


CNN basic Notes on dimensions

Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 39 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 71 / 198


CNN basic Notes on dimensions

Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 40 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 72 / 198


CNN basic Notes on dimensions

Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 41 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 73 / 198


CNN basic Notes on dimensions

Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 42 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 74 / 198


CNN basic Notes on dimensions

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 43 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 75 / 198


CNN basic Notes on dimensions

Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)


- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 44 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 76 / 198


CNN basic Notes on dimensions

(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 45 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 77 / 198


CNN basic Pooling layers and fully connected layers

Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 54 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 78 / 198


CNN basic Pooling layers and fully connected layers

MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 55 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 79 / 198


CNN basic Pooling layers and fully connected layers

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 56 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 80 / 198


CNN basic Pooling layers and fully connected layers

Common settings:

F = 2, S = 2
F = 3, S = 2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 57 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 81 / 198


CNN basic Pooling layers and fully connected layers

Fully Connected Layer (FC layer)


- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 58 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 82 / 198


Case study LeNet

Demo

ConvNetJS cifar10 demo

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 83 / 198


Case study LeNet

Case Study: LeNet-5


[LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 60 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 84 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 61 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 85 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 62 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 86 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 63 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 87 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 64 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 88 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 65 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 89 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
Parameters: 0!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 66 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 90 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 67 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 91 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 68 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 92 / 198


Case study AlexNet

Case Study: AlexNet


[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used Norm layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 69 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 93 / 198


Case study ZFNet

Case Study: ZFNet [Zeiler and Fergus, 2013]

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 15.4% -> 14.8%

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 70 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 94 / 198


Case study VGGNet

Case Study: VGGNet


[Simonyan and Zisserman, 2014]

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

best model

11.2% top 5 error in ILSVRC 2013


->
7.3% top 5 error

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 71 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 95 / 198


Case study VGGNet

INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)


CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 72 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 96 / 198


Case study VGGNet

INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)


CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 73 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 97 / 198


Case study VGGNet

INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)


CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 74 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 98 / 198


Case study GoogLeNet

Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 75 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 99 / 198


Case study GoogLeNet

Slides from Fisher Yu

Schematic view (naive version)

number of
filters
1x1 Filter
concatenation

3x3 1x1 3x3 5x5


convolutions convolutions convolutions

5x5
Previous layer

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 100 / 198


Case study GoogLeNet

Slides from Fisher Yu

Naive idea

Filter
concatenation

1x1 3x3 5x5


convolutions convolutions convolutions

Previous
layer

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 101 / 198


Case study GoogLeNet

Slides from Fisher Yu

Naive idea (does not work!)

Filter
concatenation

1x1 3x3 5x5 3x3 max


convolutions convolutions convolutions pooling

Previous
layer

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 102 / 198


Case study GoogLeNet

Slides from Fisher Yu

Inception module

Filter
concatenation

3x3 5x5 1x1


convolutions convolutions convolutions
1x1
convolutions
1x1 1x1 3x3 max
convolutions convolutions pooling

Previous
layer

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 103 / 198


Case study GoogLeNet

Case Study: GoogLeNet


Fun features:

- Only 5 million params!


(Removes FC layers
completely)

Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 76 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 104 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]


ILSVRC 2015 winner (3.6% top 5 error)

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 77 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 105 / 198


Case study ResNet

(slide from Kaiming He’s recent presentation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 78 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 106 / 198


Case study ResNet

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 79 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 107 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]


ILSVRC 2015 winner (3.6% top 5 error)

2-3 weeks of training


on 8 GPU machine

at runtime: faster
than a VGGNet!
(even though it has
8x more layers)

(slide from Kaiming He’s recent presentation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 80 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 108 / 198


Case study ResNet

Case Study:
ResNet 224x224x3

[He et al., 2015]


spatial dimension
only 56x56!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 81 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 109 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 82 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 110 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]

- Batch Normalization after every CONV layer


- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 83 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 111 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 84 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 112 / 198


Case study ResNet

Case Study: ResNet [He et al., 2015]

(this trick is also used in GoogLeNet)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 85 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 113 / 198


Case study Policy net in AlphaGo

Case Study Bonus: DeepMind’s AlphaGo

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 87 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 114 / 198


Case study Policy net in AlphaGo

policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 88 27 Jan 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 115 / 198


Some CNN tricks

Some CNN tricks

Data augmentation
Transfer learning
Use of small filters
Implementing CNN efficiently
Use of GPUs
About floating point precision

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 116 / 198


Some CNN tricks Data augmentation

Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 12 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 117 / 198


Some CNN tricks Data augmentation

Data Augmentation
“cat”
Load image
and label
Compute
loss
CNN

Transform image

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 13 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 118 / 198


Some CNN tricks Data augmentation

Data Augmentation

- Change the pixels without


changing the label
What the computer sees

- Train on transformed data

- VERY widely used

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 14 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 119 / 198


Some CNN tricks Data augmentation

Data Augmentation
1. Horizontal flips

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 15 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 120 / 198


Some CNN tricks Data augmentation

Data Augmentation
2. Random crops/scales
Training: sample random crops / scales

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 16 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 121 / 198


Some CNN tricks Data augmentation

Data Augmentation
2. Random crops/scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 122 / 198


Some CNN tricks Data augmentation

Data Augmentation
2. Random crops/scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Testing: average a fixed set of crops

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 18 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 123 / 198


Some CNN tricks Data augmentation

Data Augmentation
2. Random crops/scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Testing: average a fixed set of crops


ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 19 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 124 / 198


Some CNN tricks Data augmentation

Data Augmentation
3. Color jitter
Simple:
Randomly jitter contrast

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 20 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 125 / 198


Some CNN tricks Data augmentation

Data Augmentation Complex:


3. Color jitter
1. Apply PCA to all [R, G, B]
Simple: pixels in training set
Randomly jitter contrast
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 21 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 126 / 198


Some CNN tricks Data augmentation

Data Augmentation
4. Get creative!

Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 22 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 127 / 198


Some CNN tricks Data augmentation

Data Augmentation: Takeaway

● Simple to implement, use it


● Especially useful for small datasets
● Fits into framework of noise / marginalization

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 24 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 128 / 198


Some CNN tricks Transfer learning

Don’t necesarily need lots of data for CNN

Transfer Learning with CNNs


1. Train on
Imagenet

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 27 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 129 / 198


Some CNN tricks Transfer learning

Don’t necesarily need lots of data for CNN

Transfer Learning with CNNs


2. Small dataset:
1. Train on feature extractor
Imagenet

Freeze these

Train this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 28 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 130 / 198


Some CNN tricks Transfer learning

Don’t necesarily need lots of data for CNN

Transfer Learning with CNNs


2. Small dataset: 3. Medium dataset:
1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)

Freeze these

Freeze these

Train this

Train this

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 29 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 131 / 198


Some CNN tricks Transfer learning

CNN Features off-the-shelf: an Astounding Baseline for Recognition


[Razavian et al, 2014]

DeCAF: A Deep
Convolutional Activation
Feature for Generic Visual
Recognition
[Donahue*, Jia*, et al.,
2013]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 31 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 132 / 198


Some CNN tricks Transfer learning

very similar very different


dataset dataset
more generic

very little data ? ?


more specific

quite a lot of ? ?
data

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 32 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 133 / 198


Some CNN tricks Transfer learning

very similar very different


dataset dataset
more generic

very little data Use Linear ?


Classifier on top
more specific
layer

quite a lot of Finetune a few ?


data layers

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 33 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 134 / 198


Some CNN tricks Transfer learning

very similar very different


dataset dataset
more generic

very little data Use Linear You’re in


Classifier on top trouble… Try
more specific
layer linear classifier
from different
stages

quite a lot of Finetune a few Finetune a


data layers larger number of
layers

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 34 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 135 / 198


Some CNN tricks Transfer learning

Takeaway for your projects/beyond:


Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has similar data, train a


big ConvNet there.
2. Transfer learn to your dataset

Caffe ConvNet library has a “Model Zoo” of pretrained models:


https://github.com/BVLC/caffe/wiki/Model-Zoo

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 38 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 136 / 198


Some CNN tricks Small filters

The power of small filters

Suppose we stack two 3x3 conv layers (stride 1)


Each neuron sees 3x3 region of previous activation map

Input First Conv Second Conv

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 41 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 137 / 198


Some CNN tricks Small filters

The power of small filters

Question: How big of a region in the input does a neuron on the


second conv layer see?

Input First Conv Second Conv

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 42 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 138 / 198


Some CNN tricks Small filters

The power of small filters

Question: How big of a region in the input does a neuron on the


second conv layer see?
Answer: 5 x 5

Input First Conv Second Conv

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 43 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 139 / 198


Some CNN tricks Small filters

The power of small filters


Question: If we stack three 3x3 conv layers, how big of an input
region does a neuron in the third layer see?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 44 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 140 / 198


Some CNN tricks Small filters

The power of small filters


Question: If we stack three 3x3 conv layers, how big of an input
region does a neuron in the third layer see?

X
Answer: 7 x 7

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 45 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 141 / 198


Some CNN tricks Small filters

The power of small filters


Question: If we stack three 3x3 conv layers, how big of an input
region does a neuron in the third layer see?

X
Three 3 x 3 conv
X gives similar
Answer: 7 x 7 representational
power as a single
7 x 7 convolution

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 46 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 142 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 47 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 143 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 48 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 144 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:
= C x (7 x 7 x C) = 49 C2 = 3 x C x (3 x 3 x C) = 27 C2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 49 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 145 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:
= C x (7 x 7 x C) = 49 C2 = 3 x C x (3 x 3 x C) = 27 C2

Fewer parameters, more nonlinearity = GOOD

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 50 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 146 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:
= C x (7 x 7 x C) = 49 C2 = 3 x C x (3 x 3 x C) = 27 C2
Number of multiply-adds: Number of multiply-adds:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 51 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 147 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:
= C x (7 x 7 x C) = 49 C2 = 3 x C x (3 x 3 x C) = 27 C2
Number of multiply-adds: Number of multiply-adds:
= (H x W x C) x (7 x 7 x C) = 3 x (H x W x C) x (3 x 3 x C)
= 49 HWC2 = 27 HWC2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 52 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 148 / 198


Some CNN tricks Small filters

The power of small filters


Suppose input is H x W x C and we use convolutions with C filters
to preserve depth (stride 1, padding to preserve H, W)
one CONV with 7 x 7 filters three CONV with 3 x 3 filters
Number of weights: Number of weights:
= C x (7 x 7 x C) = 49 C2 = 3 x C x (3 x 3 x C) = 27 C2
Number of multiply-adds: Number of multiply-adds:
= 49 HWC2 = 27 HWC2

Less compute, more nonlinearity = GOOD

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 53 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 149 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 54 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 150 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1?


1. “bottleneck” 1 x 1 conv
HxWxC
to reduce dimension
Conv 1x1, C/2 filters

H x W x (C / 2)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 55 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 151 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1?


1. “bottleneck” 1 x 1 conv
HxWxC
to reduce dimension
Conv 1x1, C/2 filters
2. 3 x 3 conv at reduced
H x W x (C / 2) dimension
Conv 3x3, C/2 filters
H x W x (C / 2)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 56 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 152 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1?


1. “bottleneck” 1 x 1 conv
HxWxC
to reduce dimension
Conv 1x1, C/2 filters
2. 3 x 3 conv at reduced
H x W x (C / 2) dimension
Conv 3x3, C/2 filters 3. Restore dimension
H x W x (C / 2) with another 1 x 1 conv
Conv 1x1, C filters [Seen in Lin et al, “Network in Network”,
GoogLeNet, ResNet]
HxWxC
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 57 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 153 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1?

HxWxC
Conv 1x1, C/2 filters Bottleneck
sandwich HxWxC
H x W x (C / 2)
Conv 3x3, C/2 filters Conv 3x3, C filters

H x W x (C / 2) Single
3 x 3 conv HxWxC
Conv 1x1, C filters
HxWxC
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 58 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 154 / 198


Some CNN tricks Small filters

The power of small filters

Why stop at 3 x 3 filters? Why not try 1 x 1? More nonlinearity,


fewer params,
less compute!
HxWxC
Conv 1x1, C/2 filters 3.25 C2
parameters HxWxC
H x W x (C / 2)
Conv 3x3, C/2 filters Conv 3x3, C filters
2
H x W x (C / 2) 9C
parameters HxWxC
Conv 1x1, C filters
HxWxC
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 59 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 155 / 198


Some CNN tricks Small filters

The power of small filters

Still using 3 x 3 filters … can we break it up?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 60 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 156 / 198


Some CNN tricks Small filters

The power of small filters

Still using 3 x 3 filters … can we break it up?

HxWxC
Conv 1x3, C filters

HxWxC
Conv 3x1, C filters
HxWxC

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 61 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 157 / 198


Some CNN tricks Small filters

The power of small filters

Still using 3 x 3 filters … can we break it up? More nonlinearity,


fewer params,
less compute!

HxWxC
Conv 1x3, C filters 6 C2 HxWxC
parameters
HxWxC Conv 3x3, C filters
Conv 3x1, C filters 9C 2

HxWxC parameters HxWxC

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 62 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 158 / 198


Some CNN tricks Small filters

The power of small filters

Latest version of GoogLeNet incorporates all these ideas

Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 63 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 159 / 198


Some CNN tricks Small filters

How to stack convolutions: Recap


● Replace large convolutions (5 x 5, 7 x 7) with stacks of
3 x 3 convolutions
● 1 x 1 “bottleneck” convolutions are very efficient
● Can factor N x N convolutions into 1 x N and N x 1
● All of the above give fewer parameters, less compute,
more nonlinearity

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 64 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 160 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col

There are highly optimized matrix multiplication routines


for just about every platform

Can we turn convolution into matrix multiplication?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 66 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 161 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 162 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Reshape K x K x C
receptive field to column
with K2C elements

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 163 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Repeat for all columns to get (K2C) x N matrix


(N receptive field locations)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 164 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Elements appearing in multiple


receptive fields are duplicated; this
uses a lot of memory

Repeat for all columns to get (K2C) x N matrix


(N receptive field locations)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 165 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Reshape each filter to K2C row,


(K2C) x N matrix making D x (K2C) matrix

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 166 / 198


Some CNN tricks Im2col

Implementing Convolutions: im2col


Feature map: H x W x C Conv weights: D filters, each K x K x C

Matrix multiply

D x N result;
(K2C) x N matrix D x (K2C) matrix reshape to output tensor

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 167 / 198


Some CNN tricks Im2col

Case study:
CONV forward in Caffe
library

im2col

matrix multiply: call to


cuBLAS

bias offset

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 73 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 168 / 198


Some CNN tricks FFT

Implementing convolutions: FFT


Convolution Theorem: The convolution of f and g is equal
to the elementwise product of their Fourier Transforms:

Using the Fast Fourier Transform, we can compute the


Discrete Fourier transform of an N-dimensional vector in O
(N log N) time (also extends to 2D images)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 75 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 169 / 198


Some CNN tricks FFT

Implementing convolutions: FFT


1. Compute FFT of weights: F(W)

2. Compute FFT of image: F(X)

3. Compute elementwise product: F(W) ○ F(X)

4. Compute inverse FFT: Y = F-1(F(W) ○ F(X))

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 76 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 170 / 198


Some CNN tricks FFT

Implementing convolutions: FFT

FFT convolutions get a big speedup for larger filters


Not much speedup for 3x3 filters =(

Vasilache et al, Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 77 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 171 / 198


Some CNN tricks “Strassen-like” algorithms

Implementing convolution: “Fast Algorithms”


Naive matrix multiplication: Computing product of two
N x N matrices takes O(N3) operations
Strassen’s Algorithm: Use clever arithmetic to reduce
complexity to O(Nlog2(7)) ~ O(N2.81)

From Wikipedia

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 78 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 172 / 198


Some CNN tricks “Strassen-like” algorithms

Implementing convolution: “Fast Algorithms”


Similar cleverness can be applied to convolutions

Lavin and Gray (2015) work out special cases for 3x3
convolutions:

Lavin and Gray, “Fast Algorithms for Convolutional Neural Networks”, 2015

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 79 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 173 / 198


Some CNN tricks “Strassen-like” algorithms

Implementing convolution: “Fast Algorithms”


Huge speedups on VGG for small batches:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 80 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 174 / 198


Some CNN tricks “Strassen-like” algorithms

Computing Convolutions: Recap

● im2col: Easy to implement, but big memory overhead

● FFT: Big speedups for small kernels

● “Fast Algorithms” seem promising, not widely used yet

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 81 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 175 / 198


Some CNN tricks Use of GPUs

CEO of NVIDIA:
Jen-Hsun Huang

(Stanford EE Masters
1992)

GTC 2015:
Introduced new Titan X
GPU by bragging about
AlexNet benchmarks

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 90 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 176 / 198


Some CNN tricks Use of GPUs

CPU
Few, fast cores (1 - 16)
Good at sequential processing

GPU
Many, slower cores (thousands)
Originally for graphics
Good at parallel computation

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 91 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 177 / 198


Some CNN tricks Use of GPUs

GPUs can be programmed


● CUDA (NVIDIA only)
○ Write C code that runs directly on the GPU
○ Higher-level APIs: cuBLAS, cuFFT, cuDNN, etc
● OpenCL
○ Similar to CUDA, but runs on anything
○ Usually slower :(
● Udacity: Intro to Parallel Programming https://www.udacity.
com/course/cs344
○ For deep learning just use existing libraries

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 92 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 178 / 198


Some CNN tricks Use of GPUs

GPUs are really good


at matrix multiplication:

GPU: NVIDA Tesla K40


with cuBLAS

CPU: Intel E5-2697 v2


12 core @ 2.7 Ghz
with MKL

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 93 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 179 / 198


Some CNN tricks Use of GPUs

GPUs are really good at convolution (cuDNN):

All comparisons are against a 12-core Intel E5-2679v2 CPU @ 2.4GHz running Caffe with Intel MKL 11.1.3.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 94 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 180 / 198


Some CNN tricks Use of GPUs

Even with GPUs, training can be slow


VGG: ~2-3 weeks training with 4 GPUs
ResNet 101: 2-3 weeks with 4 GPUs

NVIDIA Titan Blacks


~$1K each

ResNet reimplemented in Torch: http://torch.ch/blog/2016/02/04/resnets.html

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 95 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 181 / 198


Some CNN tricks Use of GPUs

Multi-GPU training: More complex

Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 96 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 182 / 198


Some CNN tricks Use of GPUs

Google: Distributed CPU training

Data parallelism

[Large Scale Distributed Deep Networks, Jeff Dean et al., 2013]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 97 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 183 / 198


Some CNN tricks Use of GPUs

Google: Distributed CPU training

Data parallelism
Model parallelism

[Large Scale Distributed Deep Networks, Jeff Dean et al., 2013]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 98 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 184 / 198


Some CNN tricks Use of GPUs

Google: Synchronous vs Async

Abadi et al, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 99 17 Feb 2016

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 185 / 198


Some CNN tricks Use of GPUs

Bottlenecks
to be aware of

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
0

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 186 / 198


Some CNN tricks Use of GPUs

GPU - CPU communication is a bottleneck.


=>

CPU data prefetch+augment thread running

while

GPU performs forward/backward pass

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
1

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 187 / 198


Some CNN tricks Use of GPUs

Moving parts lol

CPU - disk bottleneck


Hard disk is slow to read from

=> Pre-processed images


stored contiguously in files, read as
raw byte stream from SSD disk

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
2

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 188 / 198


Some CNN tricks Use of GPUs

GPU memory bottleneck


Titan X: 12 GB <- currently the max
GTX 980 Ti: 6 GB

e.g.
AlexNet: ~3GB needed with batch size 256

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
3

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 189 / 198


Some CNN tricks About floating point precision

Floating point precision


● 64 bit “double” precision is default
in a lot of programming

● 32 bit “single” precision is typically


used for CNNs for performance

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
5

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 190 / 198


Some CNN tricks About floating point precision

Floating point precision


● 64 bit “double” precision is default
in a lot of programming

● 32 bit “single” precision is typically


used for CNNs for performance
○ Including cs231n homework!

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
6

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 191 / 198


Some CNN tricks About floating point precision

Benchmarks on Titan X, from https://github.

Floating point precision com/soumith/convnet-benchmarks

Prediction: 16 bit “half” precision


will be the new standard
● Already supported in cuDNN
● Nervana fp16 kernels are the
fastest right now
● Hardware support in next-gen
NVIDIA cards (Pascal)
● Not yet supported in torch =(

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
7

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 192 / 198


Some CNN tricks About floating point precision

Floating point precision


How low can we go?

Gupta et al, 2015:


Train with 16-bit fixed point with stochastic rounding

CNNs on MNIST
Gupta et al, “Deep Learning with Limited Numerical Precision”, ICML 2015
10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
8

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 193 / 198


Some CNN tricks About floating point precision

Floating point precision


How low can we go?

Courbariaux et al, 2015:


Train with 10-bit activations, 12-bit parameter updates

Courbariaux et al, “Training Deep Neural Networks with Low Precision Multiplications”, ICLR 2015

10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
9

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 194 / 198


Some CNN tricks About floating point precision

Floating point precision


How low can we go?

Courbariaux and Bengio, February 9 2016:


● Train with 1-bit activations and weights!
● All activations and weights are +1 or -1
● Fast multiplication with bitwise XNOR
● (Gradients use higher precision)

Courbariaux et al, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1”, arXiv 2016

11
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
0

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 195 / 198


Some CNN tricks About floating point precision

Implementation details: Recap


● GPUs much faster than CPUs
● Distributed training is sometimes used
○ Not needed for small problems
● Be aware of bottlenecks: CPU / GPU, CPU / disk
● Low precison makes things faster and still works
○ 32 bit is standard now, 16 bit soon
○ In the future: binary nets?

11
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 11 - 17 Feb 2016
1

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 196 / 198


Conclusions

Conclusions

“Classic” CNN composed of conv layers, pooling layers, and


fully connected layers
Date back to LeNet-5 by Yann Lecun in 90’s
But gaining lots of attention since AlexNet 2012
Widely used tricks
Data augmentation
Transfer learning
Use of GPUs
Some recent trends
Small filter decomposition
Filter output cascading (GoogLeNet)
Fast conv layer with “Strassen-like” algorithms
Use of lower and lower floating point precision formats

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 197 / 198


Conclusions

Conclusions

“Classic” CNN composed of conv layers, pooling layers, and


fully connected layers
Date back to LeNet-5 by Yann Lecun in 90’s
But gaining lots of attention since AlexNet 2012
Widely used tricks
Data augmentation
Transfer learning
Use of GPUs
Some recent trends
Small filter decomposition
Filter output cascading (GoogLeNet)
Fast conv layer with “Strassen-like” algorithms
Use of lower and lower floating point precision formats

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 197 / 198


Conclusions

Conclusions

“Classic” CNN composed of conv layers, pooling layers, and


fully connected layers
Date back to LeNet-5 by Yann Lecun in 90’s
But gaining lots of attention since AlexNet 2012
Widely used tricks
Data augmentation
Transfer learning
Use of GPUs
Some recent trends
Small filter decomposition
Filter output cascading (GoogLeNet)
Fast conv layer with “Strassen-like” algorithms
Use of lower and lower floating point precision formats

S. Cheng (OU-Tulsa) Convolutional Neural Networks Jan 2017 197 / 198

You might also like