ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
with
Deep Convolutional Neural
Networks
Alex Krizhevsky
Krizhevsky,, Ilya Sutskever,
Sutskever, Geoffrey E. Hinton
Motivation
Classification goals:
•Make 1 guess about the label (Top-1 error)
•Make 5 guesses about the label (Top-5 error)
No
Bounding
Box
Database
ImageNet
15M images
22K categories
Images collected from Web
RGB Images
Variable-resolution
Human labelers (Amazon’s Mechanical Turk crowd-sourcing)
Input
(raw pixel)
x1 Weights
w1
x2
w2
Output: f(w*x+b)
x3 w3 f
… wd
xd
reference : http://en.wikipedia.org/wiki/Sigmoid_function#mediaviewer/File:Gjl-t(x).svg
Multi--Layer Neural Networks
Multi
Input Hidden Output
Layer Layer Layer
Nonlinear classifier
Learning can be done
by gradient descent
Back-Propagation
algorithm
Feed Forward Operation
input layer: hidden layer: output layer:
d features m outputs, one for
each class
wji vkj
x(1)
z1
x(2)
x(d)
zm
bias unit
Notation for Weights
Use wji to denote the weight between input
unit i and hidden unit j
input unit i hidden unit j
wji
x(i) wjix(i) yj
NH output unit k
net k* = ∑ y j v kj + v k 0
j =1
zj
Network Training
1. Initialize weights wji and vkj randomly but not to 0
2. Iterate until a stopping criterion is reached
z1
input sample xp
MNN with weights
output z= M
choose p zm
wji and vkj
2 c =1
1 n m (i )
Training error: J (w , v ) = ∑∑ (t c − zc )
(i ) 2
2 i =1 c =1
v (0 ) ,w (0 ) = random
Use gradient descent: repeat until convergence:
w (t +1) = w (t ) − η ∇ w J (w (t ) )
v (t +1) = v (t ) − η ∇ v J (v (t ) )
BackPropagation:: Layered Model
BackPropagation
d
activation at
hidden unit j net j = ∑ x (i )w ji + w j 0
i =1
output at
hidden unit j (
y j = f net j )
NH
activation at
output unit k net k* = ∑ y j v kj + v k 0
j =1
chain rule
chain rule
z k = f (net k* )
activation at
output unit k
1 m ∂J ∂J
J (w , v ) = ∑ (t c − zc )
2
objective function
2 c =1 ∂v kj ∂w ji
BackPropagation of Errors
∂J
m
∂J = − (t k − z k )f ' (net k* ) y j
∂w ji
( )
= −f ′ net j x ∑ (t k − zk ) f ′(net k* )v kj
( i)
∂v kj
k =1
error
unit i
unit j
z1
zm
classification error
training time
this is a good time to stop training, since after this time we start to overfit
Stopping criterion is part of training phase, thus validation data is part of the training data
To assess how the network will work on the unseen examples, we still need test data
Momentum
Gradient descent finds only a local minima
not a problem if J(w) is small at a local minima. Indeed, we do not wish to find w
s.t. J(w) = 0 due to overfitting
J(w)
reasonable local
minimum
global minimum
J(w)
problem if J(w) is large at a local
minimum w
bad local
minimum
global minimum
Momentum
Momentum: popular method to avoid local minima and
also speeds up descent in plateau regions
weight update at time t is
∆w (t ) = w (t ) − w (t −1)
add temporal average direction in which weights have been moving recently
(t +1) (t ) ∂J (t −1)
w =w + (1 − α ) η + α ∆ w
∂w
previous
steepest descent direction
direction
at α = 0, equivalent to gradient descent
at α = 1, gradient descent is ignored, weight update continues in the direction in
which it was moving previously (momentum)
usually, α is around 0.9
1D Convolution
Neural 1D Convolution
Implementation
2D Convolution Matrix
reference : http://en.wikipedia.org/wiki/Kernel_(image_processing)
Convolutional Filter
.
.
.
Input
Feature Map
reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Architecture
•Trained with stochastic gradient descent on two NVIDIA GPUs for about a
week (5~6 days)
•650,000 neurons, 60 million parameters, 630 million connections
•The last layer contains 1,000 neurons which produces a distribution over
the 1,000 class labels.
Architecture
Architecture
Architecture
Response--Normalization Layer
Response
: the activity of a neuron computed by applying kernel i at
position (x, y)
The response-normalized activity is given by
Max
Sum
reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Architecture
First Layer Visualization
ReLU
Learning rule
Use stochastic gradient descent with a batch size of 128 examples,
momentum of 0.9, and weigh decay of 0.0005
The update rule for weight w was
reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
Results - Classification
ILSVRC-2010 test set