0% found this document useful (0 votes)
10 views

12. Object Detection-compressed

Uploaded by

240415
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

12. Object Detection-compressed

Uploaded by

240415
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Segmentation, Localization, Detection

So far: Image Classification

Class Scores
Cat: 0.9
Dog: 0.05
Fully-Connected: Car: 0.01
4096 to 1000 ...
This image is CC0 public domain
Vector:
4096
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Semantic Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Semantic Segmentation This image is CC0 public domain

Label each pixel in the


image with a category
label Sky Sky

Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass Grass
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN

Full image
Cow

Cow

Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at
DxHxW
original image resolution will
be very expensive ...
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!

Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Semantic Segmentation Idea: Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
In-Network upsampling: “Unpooling”

Nearest Neighbor “Bed of Nails”


1 1 2 2 1 0 2 0

1 2 1 1 2 2 1 2 0 0 0 0

3 4 3 3 4 4 3 4 3 0 4 0

3 3 4 4 0 0 0 0

Input: 2 x 2 Output: 4 x 4 Input: 2 x 2 Output: 4 x 4


In-Network upsampling: “Max Unpooling”
Max Pooling Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0


3 5 2 1 5 6 1 2 0 1 0 0

1 2 2 1 7 8 3 4 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4

Input: 4 x 4 Output: 2 x 2 Input: 2 x 2 Output: 4 x 4

Corresponding pairs of
downsampling and
upsampling layers
Learnable Upsampling: Transpose Convolution

Recall:Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 1 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Dot product
between filter
and input

Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose Convolution
Recall: Normal 3 x 3 convolution, stride 2 pad 1

Filter moves 2 pixels in


Dot product the input for every one
between filter pixel in the output
and input
Stride gives ratio between
movement in input and
output
Input: 4 x 4 Output: 2 x 2
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
3 x 3 transpose convolution, stride 2 pad 1

Input gives
weight for
filter

Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Sum where
3 x 3 transpose convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Sum where
3 x 3 transpose convolution, stride 2 pad 1 output overlaps

Filter moves 2 pixels in


Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
Sum where
Other names: 3 x 3 transpose convolution, stride 2 pad 1 output overlaps
-Deconvolution (bad)
-Upconvolution
-Fractionally strided
convolution
-Backward strided Filter moves 2 pixels in
convolution Input gives the output for every one
weight for pixel in the input
filter
Stride gives ratio between
movement in output and
input
Input: 2 x 2 Output: 4 x 4
Transpose Convolution: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
where at overlaps in
a the output
y az + bx
b Need to crop one
pixel from output to
z by make output exactly
2x input
bz
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication

Example: 1D conv, kernel


size=3, stride=1, padding=1
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Convolution transpose multiplies by the
terms of a matrix multiplication transpose of the same matrix:

Example: 1D conv, kernel When stride=1, convolution transpose is


size=3, stride=1, padding=1 just a regular convolution (with different
padding rules)
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication

Example: 1D conv, kernel


size=3, stride=2, padding=1
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Convolution transpose multiplies by the
terms of a matrix multiplication transpose of the same matrix:

Example: 1D conv, kernel


When stride>1, convolution transpose is
size=3, stride=2, padding=1
no longer a normal convolution!
Semantic Segmentation Idea: Fully Convolutional
Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transpose convolution
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4

Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Classification + Localization

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Classification + Localization
Class Scores
Fully Cat: 0.9
Connected: Dog: 0.05
4096 to 1000 Car: 0.01
...

This image is CC0 public domain


Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates
(x, y, w, h)
Treat localization as a
regression problem!
Classification + Localization Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

This image is CC0 public domain


Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Classification + Localization Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

Multitask Loss + Loss

This image is CC0 public domain


Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Classification + Localization Correct label:
Cat

Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
...

+ Loss

This image is CC0 public domain


Vector: Fully
Often pretrained on ImageNet Connected:
4096 4096 to 4 Box
(Transfer learning)
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Aside: Human Pose Estimation
Represent pose as a
set of 14 joint
positions:

Left / right foot


Left / right knee
Left / right hip
Left / right shoulder
Left / right elbow
Left / right hand
Neck
Head top
This image is licensed under CC-BY 2.0.

Johnson and Everingham, "Clustered Pose and Nonlinear Appearance Models


for Human Pose Estimation", BMVC 2010
Aside: Human Pose Estimation

Left foot: (x, y)

Right foot: (x, y)


Vector:
4096 Head top: (x, y)

Toshev and Szegedy, “DeepPose: Human Pose


Estimation via Deep Neural Networks”, CVPR 2014
Aside: Human Pose Estimation

Correct left
foot: (x’, y’)

Left foot: (x, y) L2 loss

Right foot: (x, y) L2 loss

… ...
+ Loss
Vector:
4096 Head top: (x, y) L2 loss

Correct head
Toshev and Szegedy, “DeepPose: Human Pose
top: (x’, y’)
Estimation via Deep Neural Networks”, CVPR 2014
Object Detection

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Object Detection: Impact of Deep Learning

Figure copyright Ross Girshick, 2015.


Reproduced with permission.
Object Detection as Regression?
CAT: (x, y, w, h)

DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)

DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Each image needs a
Object Detection as Regression? different number of outputs!

CAT: (x, y, w, h) 4 numbers

DOG: (x, y, w, h)
DOG: (x, y, w, h) 16 numbers
CAT: (x, y, w, h)

DUCK: (x, y, w, h) Many


DUCK: (x, y, w, h) numbers!
….
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? NO
Background? YES
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? YES
Cat? NO
Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background

Dog? NO
Cat? YES
Background? NO

Problem: Need to apply CNN to huge


number of locations and scales, very
computationally expensive!
Region Proposals
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
R-CNN: Problems

• Ad hoc training objectives


• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training is slow (84h), takes a lot of disk space
• Inference (detection) is slow
• 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]
• Fixed by SPP-net [He et al. ECCV14]

Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Slide copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN
(Training)

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fast R-CNN
(Training)

Girshick, “Fast R-CNN”, ICCV 2015.


Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Faster R-CNN: RoI Pooling
Divide projected
Project proposal
proposal into 7x7
onto features grid, max-pool Fully-connected
within each cell layers

CNN

Hi-res input image: Hi-res conv features: RoI conv features: Fully-connected layers expect
3 x 640 x 480 512 x 20 x 15; 512 x 7 x 7 low-res conv features:
with region for region proposal 512 x 7 x 7
proposal Projected region
proposal is e.g.
512 x 18 x 8
Girshick, “Fast R-CNN”, ICCV 2015.
R-CNN vs SPP vs Fast R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
R-CNN vs SPP vs Fast R-CNN

Problem:
Runtime dominated
by region proposals!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Faster R-CNN:
Make CNN do proposals!
Insert Region Proposal
Network (RPN) to predict
proposals from features

Jointly train with 4 losses:


1. RPN classify object / not object
2. RPN regress box coordinates
3. Final classification score (object
classes)
4. Final box coordinates

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Faster R-CNN:
Make CNN do proposals!
Detection without Proposals: YOLO / SSD

Within each grid cell:


- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)

Input image Divide image into grid Output:


3xHxW 7x7 7 x 7 x (5 * B + C)

Image a set of base boxes


Redmon et al, “You Only Look Once: centered at each grid cell
Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016 Here B = 3
Detection without Proposals: YOLO / SSD
Go from input image to tensor of scores with one big convolutional network!

Within each grid cell:


- Regress from each of the B
base boxes to a final box with
5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)

Input image Divide image into grid Output:


3xHxW 7x7 7 x 7 x (5 * B + C)

Image a set of base boxes


Redmon et al, “You Only Look Once: centered at each grid cell
Unified, Real-Time Object Detection”, CVPR 2016
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016 Here B = 3
Object Detection: Lots of variables ...
Base Network Object Detection Takeaways
VGG16 architecture Faster R-CNN is
ResNet-101 Faster R-CNN slower but more
Inception V2 R-FCN accurate
Inception V3 SSD
Inception SSD is much
ResNet Image Size faster but not as
MobileNet # Region Proposals accurate

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Aside: Object Detection + Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Aside: Object Detection + Captioning
= Dense Captioning

Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Instance Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C

CNN Conv Conv


RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes

C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017
Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.
Mask R-CNN
Also does pose
Classification Scores: C
Box coordinates (per class): 4 * C
Joint coordinates

CNN Conv Conv


RoI Align

256 x 14 x 14 256 x 14 x 14 Predict a mask for


each of C classes

C x 14 x 14
He et al, “Mask R-CNN”, arXiv 2017
Mask R-CNN
Also does pose

He et al, “Mask R-CNN”, arXiv 2017


Figures copyright Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, 2017.
Reproduced with permission.
Recap:
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

You might also like