Segmentation Detection
Segmentation Detection
No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain
Full image
Full image
Full image
Full image
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Full image
An intuitive idea: encode the entire image with conv net, and do semantic segmentation
on top.
Problem: classification architectures often reduce feature spatial sizes to go deeper, but
semantic segmentation requires the output size to be the same as input size.
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
3 5 2 1 5 6
… 1 2 0 1 0 0
1 2 2 1 7 8 3 4 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
s
Sky Sky
ee
Tr
Tr
ee
s
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass Grass
w
Vector: Fully
This image is CC0 public domain
Connected:
4096 4096 to 4 Box
Coordinates
(x, y, w, h)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...
w
Vector: Fully
This image is CC0 public domain
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...
w
Vector: Fully
This image is CC0 public domain
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
DOG: (x, y, w, h)
DOG: (x, y, w, h) 12 numbers
CAT: (x, y, w, h)
Dog? NO
Cat? NO
Background? YES
Dog? YES
Cat? NO
Background? NO
Dog? YES
Cat? NO
Background? NO
Dog? NO
Cat? YES
Background? NO
Dog? NO
Cat? YES
Background? NO
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
SVMs
ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
Input image
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
“conv5” features
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
CNN
CNN
CNN
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Problem: Region features slightly misaligned
CNN
CNN
CNN
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Problem:
Runtime dominated
by region proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
CNN
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
CNN
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Anchor has an
object?
CNN Conv 1 x 20 x 15
Anchor has an
object?
CNN Conv 1 x 20 x 15
Box corrections
4 x 20 x 15
Anchor has an
object?
CNN Conv K x 20 x 15
Box transforms
4K x 20 x 15
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Anchor has an
object?
CNN Conv K x 20 x 15
Box transforms
4K x 20 x 15
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Mask R-CNN
C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017
Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen et al. "Visual genome: Connecting language and vision using
crowdsourced dense image annotations." International Journal of Computer Vision 123,
no. 1 (2017): 32-73.
Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017
Figure copyright IEEE, 2018. Reproduced for educational purposes.
3D Object Detection:
3D oriented bounding box
(x, y, z, w, h, l, r, p, y)
Faster R-CNN