0% found this document useful (0 votes)
273 views

ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF

The document summarizes a 2012 paper that achieved breakthrough results in image classification using deep convolutional neural networks. The paper introduced a network architecture with multiple convolutional and pooling layers, dropout for regularization, and ReLU activations. This network achieved top-5 test error rates of 15.3% on ImageNet classification, significantly outperforming prior methods. The network was trained on over 1 million images with 1,000 categories using stochastic gradient descent on GPUs for around a week.

Uploaded by

Dominic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
273 views

ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF

The document summarizes a 2012 paper that achieved breakthrough results in image classification using deep convolutional neural networks. The paper introduced a network architecture with multiple convolutional and pooling layers, dropout for regularization, and ReLU activations. This network achieved top-5 test error rates of 15.3% on ImageNet classification, significantly outperforming prior methods. The network was trained on over 1 million images with 1,000 categories using stochastic gradient descent on GPUs for around a week.

Uploaded by

Dominic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

ImageNet Classification

with
Deep Convolutional Neural
Networks
Alex Krizhevsky
Krizhevsky,, Ilya Sutskever,
Sutskever, Geoffrey E. Hinton
Motivation
Classification goals:
•Make 1 guess about the label (Top-1 error)
•Make 5 guesses about the label (Top-5 error)

No
Bounding
Box
Database
ImageNet
15M images
22K categories
Images collected from Web
RGB Images
Variable-resolution
Human labelers (Amazon’s Mechanical Turk crowd-sourcing)

ImageNet Large Scale Visual Recognition Challenge


(ILSVRC-2010)
1K categories
1.2M training images (~1000 per category)
50,000 validation images
150,000 testing images
Strategy – Deep Learning
“Shallow” vs. “deep” architectures

Learn a feature hierarchy all the way from pixels to classifier


reference : http://web.engr.illinois.edu/~slazebni/spring14/lec24_cnn.pdf
Neuron - Perceptron

Input
(raw pixel)
x1 Weights
w1
x2
w2
Output: f(w*x+b)
x3 w3 f

… wd

xd

reference : http://en.wikipedia.org/wiki/Sigmoid_function#mediaviewer/File:Gjl-t(x).svg
Multi--Layer Neural Networks
Multi
Input Hidden Output
Layer Layer Layer

 Nonlinear classifier
 Learning can be done
by gradient descent
 Back-Propagation
algorithm
Feed Forward Operation
input layer: hidden layer: output layer:
d features m outputs, one for
each class

wji vkj
x(1)
z1

x(2)

x(d)
zm

bias unit
Notation for Weights
 Use wji to denote the weight between input
unit i and hidden unit j
input unit i hidden unit j
wji
x(i) wjix(i) yj

 Use vkj to denote the weight between hidden


unit j and output unit k
hidden unit j output unit k
vkj
vkjyj zk
yj
Notation for Activation
 Use neti to denote the activation and hidden unit j
hidden unit j
d
net j = ∑ x (i )w ji + w j 0
i =1
yj

 Use net*k to denote the activation at output unit k

NH output unit k
net k* = ∑ y j v kj + v k 0
j =1
zj
Network Training
1. Initialize weights wji and vkj randomly but not to 0
2. Iterate until a stopping criterion is reached

 z1 
input sample xp
MNN with weights
output z= M 
choose p zm 
wji and vkj  

Compare output z with the desired target t;


adjust wji and vkj to move closer to the goal
t (by backpropagation)
BackPropagation
 Learn wji and vkj by minimizing the training error
 What is the training error?
 Suppose the output of MNN for sample x is z and the target
(desired output for x ) is t

 Error on one sample: 1 m


J (w , v ) = ∑ (t c − zc )
2

2 c =1

1 n m (i )
 Training error: J (w , v ) = ∑∑ (t c − zc )
(i ) 2
2 i =1 c =1

v (0 ) ,w (0 ) = random
 Use gradient descent: repeat until convergence:

w (t +1) = w (t ) − η ∇ w J (w (t ) )
v (t +1) = v (t ) − η ∇ v J (v (t ) )
BackPropagation:: Layered Model
BackPropagation
d
activation at
hidden unit j net j = ∑ x (i )w ji + w j 0
i =1

output at
hidden unit j (
y j = f net j )
NH
activation at
output unit k net k* = ∑ y j v kj + v k 0
j =1

chain rule
chain rule
z k = f (net k* )
activation at
output unit k

1 m ∂J ∂J
J (w , v ) = ∑ (t c − zc )
2
objective function
2 c =1 ∂v kj ∂w ji
BackPropagation of Errors
∂J
m
∂J = − (t k − z k )f ' (net k* ) y j
∂w ji
( )
= −f ′ net j x ∑ (t k − zk ) f ′(net k* )v kj
( i)
∂v kj
k =1
error
unit i
unit j
z1

zm

 Name “backpropagation” because during training, errors


propagated back from output to hidden layer
Learning Curves

classification error

training time

 this is a good time to stop training, since after this time we start to overfit
 Stopping criterion is part of training phase, thus validation data is part of the training data
 To assess how the network will work on the unseen examples, we still need test data
Momentum
 Gradient descent finds only a local minima
 not a problem if J(w) is small at a local minima. Indeed, we do not wish to find w
s.t. J(w) = 0 due to overfitting

J(w)

reasonable local
minimum

global minimum
J(w)
 problem if J(w) is large at a local
minimum w

bad local
minimum

global minimum
Momentum
 Momentum: popular method to avoid local minima and
also speeds up descent in plateau regions
 weight update at time t is
∆w (t ) = w (t ) − w (t −1)
 add temporal average direction in which weights have been moving recently

(t +1) (t )  ∂J  (t −1)
w =w + (1 − α ) η  + α ∆ w
 ∂w 
previous
steepest descent direction
direction
 at α = 0, equivalent to gradient descent
 at α = 1, gradient descent is ignored, weight update continues in the direction in
which it was moving previously (momentum)
 usually, α is around 0.9
1D Convolution
Neural 1D Convolution
Implementation
2D Convolution Matrix

reference : http://en.wikipedia.org/wiki/Kernel_(image_processing)
Convolutional Filter

.
.
.

Input
Feature Map
reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Architecture

•Trained with stochastic gradient descent on two NVIDIA GPUs for about a
week (5~6 days)
•650,000 neurons, 60 million parameters, 630 million connections
•The last layer contains 1,000 neurons which produces a distribution over
the 1,000 class labels.
Architecture
Architecture
Architecture
Response--Normalization Layer
Response
 : the activity of a neuron computed by applying kernel i at
position (x, y)
 The response-normalized activity is given by

 N : the total # of kernels in the layer


 n : hyper-parameter, n=5
 k : hyper-parameter, k=2
 α : hyper-parameter, α=10^(-4)
 β : hyper-parameter, β =0.75
 This aids generalization even though ReLU don’t require it.
 This reduces top-1 error by 1.4 , top-5 error rate by 1.2%
Pooling Layer

◦ Non-overlapping / overlapping regions


◦ Sum or max

Max

Sum

Reduces the error rate of top-1 by 0.4% and top-5 by 0.3%

reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Architecture
First Layer Visualization
ReLU
Learning rule
 Use stochastic gradient descent with a batch size of 128 examples,
momentum of 0.9, and weigh decay of 0.0005
 The update rule for weight w was

 i : the iteration index


 : the learning rate, initialized at 0.01 and reduced three times prior to
termination
 : the average over the i-th batch Di of the derivative of the
objective with respect to w
 Train for 90 cycles through the training set of 1.2 million images
Fighting overfitting - input
 This neural net has 60M real-valued
parameters and 650,000 neurons

 It overfils a lot therefore train on five


224x224 patches extracted randomly from
256x256 images, and also their horizontal
reflections
Fighting overfitting - Dropout
 Independently set each hidden unit activity to zero with 0.5
probability
 Used in the two globally-connected hidden layers at the net's
output
 Doubles the number of iterations required to converge

reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
Results - Classification
 ILSVRC-2010 test set

 ILSVRC-2012 test set


Results Classification
Results Retrival
The End
Thank you for your attention
Refernces
 www.cs.toronto.edu/~fritz/absps/imagenet.pd
 https://prezi.com/jiilm_br8uef/imagenet-
classification-with-deep-convolutional-
neural-networks/
 sglab.kaist.ac.kr/~sungeui/IR/.../second/201454
81오은수.pptx
 http://alex.smola.org/teaching/cmu2013-10-
701/slides/14_PrincipalComp.pdf
 Hagit Hel-or (Convolution Slide)
 http://www.cs.haifa.ac.il/~rita/ml_course/lectures
/NN.pdf

You might also like