2018_Frustum_PointNets
2018_Frustum_PointNets
Charles R. Qi1∗ Wei Liu2 Chenxia Wu2 Hao Su3 Leonidas J. Guibas1
1 2 3
Stanford University Nuro, Inc. UC San Diego
1918
3D bounding box regression using two variants of Point- and use advanced 3D deep networks (PointNets) that can
Net. The segmentation network predicts the 3D mask of exploit 3D geometry more effectively.
the object of interest (i.e. instance segmentation); and the Bird’s eye view based methods: MV3D [5] projects Li-
regression network estimates the amodal 3D bounding box DAR point cloud to bird’s eye view and trains a region pro-
(covering the entire object even if only part of it is visible). posal network (RPN [23]) for 3D bounding box proposal.
In contrast to previous work that treats RGB-D data as However, the method lags behind in detecting small objects,
2D maps for CNNs, our method is more 3D-centric as we such as pedestrians and cyclists and cannot easily adapt to
lift depth maps to 3D point clouds and process them us- scenes with multiple objects in vertical direction.
ing 3D tools. This 3D-centric view enables new capabilities 3D based methods: [31, 28] train 3D object classifiers
for exploring 3D data in a more effective manner. First, by SVMs on hand-designed geometry features extracted
in our pipeline, a few transformations are applied succes- from point cloud and then localize objects using sliding-
sively on 3D coordinates, which align point clouds into a window search. [7] extends [31] by replacing SVM with
sequence of more constrained and canonical frames. These 3D CNN on voxelized 3D grids. [24] designs new geomet-
alignments factor out pose variations in data, and thus make ric features for 3D object detection in a point cloud. [29, 14]
3D geometry pattern more evident, leading to an easier job convert a point cloud of the entire scene into a volumetric
of 3D learners. Second, learning in 3D space can better ex- grid and use 3D volumetric CNN for object proposal and
ploits the geometric and topological structure of 3D space. classification. Computation cost for those method is usu-
In principle, all objects live in 3D space; therefore, we be- ally quite high due to the expensive cost of 3D convolutions
lieve that many geometric structures, such as repetition, pla- and large 3D search space. Recently, [13] proposes a 2D-
narity, and symmetry, are more naturally parameterized and driven 3D object detection method that is similar to ours
captured by learners that directly operate in 3D space. The in spirit. However, they use hand-crafted features (based
usefulness of this 3D-centric network design philosophy has on histogram of point coordinates) with simple fully con-
been supported by much recent experimental evidence. nected networks to regress 3D box location and pose, which
Our method achieve leading positions on KITTI 3D ob- is sub-optimal in both speed and performance. In contrast,
ject detection [1] and bird’s eye view detection [2] bench- we propose a more flexible and effective solution with deep
marks. Compared with the previous state of the art [5], our 3D feature learning (PointNets).
method is 8.04% better on 3D car AP with high efficiency
(running at 5 fps). Our method also fits well to indoor RGB-
D data where we have achieved 8.9% and 6.4% better 3D Deep Learning on Point Clouds Most existing works
mAP than [13] and [24] on SUN-RGBD while running one convert point clouds to images or volumetric forms before
to three orders of magnitude faster. feature learning. [33, 18, 21] voxelize point clouds into
The key contributions of our work are as follows: volumetric grids and generalize image CNNs to 3D CNNs.
[16, 25, 32, 7] design more efficient 3D CNN or neural net-
• We propose a novel framework for RGB-D data based
work architectures that exploit sparsity in point cloud. How-
3D object detection called Frustum PointNets.
ever, these CNN based methods still require quantitization
• We show how we can train 3D object detectors un- of point clouds with certain voxel resolution. Recently, a
der our framework and achieve state-of-the-art perfor- few works [20, 22] propose a novel type of network archi-
mance on standard 3D object detection benchmarks. tectures (PointNets) that directly consumes raw point clouds
• We provide extensive quantitative evaluations to vali- without converting them to other formats. While PointNets
date our design choices as well as rich qualitative re- have been applied to single object classification and seman-
sults for understanding the strengths and limitations of tic segmentation, our work explores how to extend the ar-
our method. chitecture for the purpose of 3D object detection.
3. Problem Definition
2. Related Work
Given RGB-D data as input, our goal is to classify and
3D Object Detection from RGB-D Data Researchers localize objects in 3D space. The depth data, obtained from
have approached the 3D detection problem by taking var- LiDAR or indoor depth sensors, is represented as a point
ious ways to represent RGB-D data. cloud in RGB camera coordinates. The projection matrix
Front view image based methods: [3, 19, 34] take is also known so that we can get a 3D frustum from a 2D
monocular RGB images and shape priors or occlusion pat- image region. Each object is represented by a class (one
terns to infer 3D bounding boxes. [15, 6] represent depth among k predefined classes) and an amodal 3D bounding
data as 2D maps and apply CNNs to localize objects in 2D box. The amodal box bounds the complete object even if
image. In comparison we represent depth as a point cloud part of the object is occluded or truncated. The 3D box is
919
Depth 2d region point cloud segmented
proposal in frustum object points T-Net
Box Parameters
(n points) (m points)
region2frustum
RGB image
masking
3D Box
mxc
nxc
Segmentation translation
CNN Estimation
PointNet
PointNet
k
one-hot class vector
Figure 2. Frustum PointNets for 3D object detection. We first leverage a 2D CNN object detector to propose 2D regions and classify
their content. 2D regions are then lifted to 3D and thus become frustum proposals. Given a point cloud in a frustum (n × c with n points
and c channels of XYZ, intensity etc. for each point), the object instance is segmented by binary classification of each point. Based on the
segmented object point cloud (m × c), a light-weight regression PointNet (T-Net) tries to align points by translation such that their centroid
is close to amodal box center. At last the box estimation net estimates the amodal 3D bounding box for the object. More illustrations on
coordinate systems involved and network input, output are in Fig. 4 and Fig. 5.
parameterized by its size h, w, l, center cx , cy , cz , and ori- based model. We pre-train the model weights on ImageNet
entation θ, φ, ψ relative to a predefined canonical pose for classification and COCO object detection datasets and fur-
each category. In our implementation, we only consider the ther fine-tune it on a KITTI 2D object detection dataset to
heading angle θ around the up-axis for orientation. classify and predict amodal 2D boxes. More details of the
2D detector training are provided in the supplementary.
4. 3D Detection with Frustum PointNets
4.2. 3D Instance Segmentation
As shown in Fig. 2, our system for 3D object detection
consists of three modules: frustum proposal, 3D instance Given a 2D image region (and its corresponding 3D frus-
segmentation, and 3D amodal bounding box estimation. We tum), several methods might be used to obtain 3D loca-
will introduce each module in the following subsections. tion of the object: One straightforward solution is to di-
We will focus on the pipeline and functionality of each mod- rectly regress 3D object locations (e.g., by 3D bounding
ule, and refer readers to supplementary for specific architec- box) from a depth map using 2D CNNs. However, this
tures of the deep networks involved. problem is not easy as occluding objects and background
clutter is common in natural scenes (as in Fig. 3), which
4.1. Frustum Proposal may severely distract the 3D localization task. Because ob-
The resolution of data produced by most 3D sensors, es- jects are naturally separated in physical space, segmentation
pecially real-time depth sensors, is still lower than RGB in 3D point cloud is much more natural and easier than that
images from commodity cameras. Therefore, we leverage in images where pixels from distant objects can be near-by
mature 2D object detector to propose 2D object regions in to each other. Having observed this fact, we propose to seg-
RGB images as well as to classify objects.
With a known camera projection matrix, a 2D bounding
box can be lifted to a frustum (with near and far planes spec-
ified by depth sensor range) that defines a 3D search space
Background
for the object. We then collect all points within the frustum Clutter
to form a frustum point cloud. As shown in Fig 4 (a), frus- Object of Interest
tums may orient towards many different directions, which
result in large variation in the placement of point clouds. Foreground
Occluder
We therefore normalize the frustums by rotating them to-
ward a center view such that the center axis of the frustum is
orthogonal to the image plane. This normalization helps im- camera
prove the rotation-invariance of the algorithm. We call this Figure 3. Challenges for 3D detection in frustum point cloud.
entire procedure for extracting frustum point clouds from Left: RGB image with an image region proposal for a person.
RGB-D data frustum proposal generation. Right: bird’s eye view of the LiDAR points in the extruded frus-
While our 3D detection framework is agnostic to the ex- tum from 2D box, where we see a wide spread of points with both
act method for 2D region proposal, we adopt a FPN [17] foreground occluder (bikes) and background clutter (building).
920
.. .. .. .. Set
T-Net
mxc
3D Instance Segmentation PointNet Abstraction FCs
3
..... ..... ..... ..... Layers
nx1
nxc
rotation centroid Feature
. . . .
Abstraction
Layers
Propagation
Layers
Amodal 3D Box Estimation PointNet
3+2NH+4NS
Set
mxc
y x Abstraction FCs
Layers
frustum point cloud object of interest
(a) camera (b) frustum (c) 3D mask (d) 3D object (frustum coordinate) probability
object point cloud box parameters
coordinate coordinate coordinate coordinate (object coordinate) (object coordinate)
Figure 4. Coordinate systems for point cloud. Artificial points Figure 5. Basic architectures and IO for PointNets. Architecture
(black dots) are shown to illustrate (a) default camera coordi- is illustrated for PointNet++ [22] (v2) models with set abstraction
nate; (b) frustum coordinate after rotating frustums to center view layers and feature propagation layers (for segmentation). Coordi-
(Sec. 4.1); (c) mask coordinate with object points’ centroid at ori- nate systems involved are visualized in Fig. 4.
gin (Sec. 4.2); (d) object coordinate predicted by T-Net (Sec. 4.3).
3D Instance Segmentation PointNet. The network takes 4.3. Amodal 3D Box Estimation
a point cloud in frustum and predicts a probability score for Given the segmented object points (in 3D mask coordi-
each point that indicates how likely the point belongs to the nate), this module estimates the object’s amodal oriented
object of interest. Note that each frustum contains exactly 3D bounding box by using a box regression PointNet to-
one object of interest. Here those “other” points could be gether with a preprocessing transformer network.
points of non-relevant areas (such as ground, vegetation) or
other instances that occlude or are behind the object of in-
Learning-based 3D Alignment by T-Net Even though
terest. Similar to the case in 2D instance segmentation, de-
we have aligned segmented object points according to their
pending on the position of the frustum, object points in one
centroid position, we find that the origin of the mask coordi-
frustum may become cluttered or occlude points in another.
nate frame (Fig. 4 (c)) may still be quite far from the amodal
Therefore, our segmentation PointNet is learning the occlu-
box center. We therefore propose to use a light-weight re-
sion and clutter patterns as well as recognizing the geometry
gression PointNet (T-Net) to estimate the true center of the
for the object of a certain category.
complete object and then transform the coordinate such that
In a multi-class detection case, we also leverage the se-
the predicted center becomes the origin (Fig. 4 (d)).
mantics from a 2D detector for better instance segmenta-
tion. For example, if we know the object of interest is The architecture and training of our T-Net is similar to
a pedestrian, then the segmentation network can use this the T-Net in [20], which can be thought of as a special type
prior to find geometries that look like a person. Specifi- of spatial transformer network (STN) [12]. However, differ-
cally, in our architecture we encode the semantic category ent from the original STN that has no direct supervision on
as a one-hot class vector (k dimensional for the pre-defined transformation, we explicitly supervise our translation net-
k categories) and concatenate the one-hot vector to the in- work to predict center residuals from the mask coordinate
termediate point cloud features. More details of the specific origin to real object center.
architectures are described in the supplementary.
After 3D instance segmentation, points that are classified Amodal 3D Box Estimation PointNet The box estima-
as the object of interest are extracted (“masking” in Fig. 2). tion network predicts amodal bounding boxes (for entire
921
object even if part of it is unseen) for objects given an ob- In essence, the corner loss is the sum of the distances
ject point cloud in 3D object coordinate (Fig. 4 (d)). The between the eight corners of a predicted box and a ground
network architecture is similar to that for object classifica- truth box. Since corner positions are jointly determined by
tion [20, 22], however the output is no longer object class center, size and heading, the corner loss is able to regularize
scores but parameters for a 3D bounding box. the multi-task training for those parameters.
As stated in Sec. 3, we parameterize a 3D bounding box To compute the corner loss, we firstly construct N S ×
by its center (cx , cy , cz ), size (h, w, l) and heading angle N H “anchor” boxes from all size templates and heading
θ (along up-axis). We take a “residual” approach for box angle bins. The anchor boxes are then translated to the es-
center estimation. The center residual predicted by the box timated box center. We denote the anchor box corners as
estimation network is combined with the previous center Pkij , where i, j, k are indices for the size class, heading
residual from the T-Net and the masked points’ centroid to class, and (predefined) corner order, respectively. To avoid
recover an absolute center (Eq. 1). For box size and heading large penalty from flipped heading estimation, we further
angle, we follow previous works [23, 19] and use a hybrid compute distances to corners (Pk∗∗ ) from the flipped ground
of classification and regression formulations. Specifically truth box and use the minimum of the original and flipped
we pre-define N S size templates and N H equally split an- cases. δij , which is one for the ground truth size/heading
gle bins. Our model will both classify size/heading (N S class and zero else wise, is a two-dimensional mask used to
scores for size, N H scores for heading) to those pre-defined select the distance term we care about.
categories as well as predict residual numbers for each cate-
gory (3 × N S residual dimensions for height, width, length, 5. Experiments
N H residual angles for heading). In the end the net outputs
3 + 4 × N S + 2 × N H numbers in total. Experiments are divided into three parts1 . First we com-
pare with state-of-the-art methods for 3D object detection
Cpred = Cmask + ∆Ct−net + ∆Cbox−net (1) on KITTI [8] and SUN-RGBD [27] (Sec 5.1). Second,
we provide in-depth analysis to validate our design choices
4.4. Training with Multi-task Losses (Sec 5.2). Last, we show qualitative results and discuss the
strengths and limitations of our methods (Sec 5.3).
We simultaneously optimize the three nets involved (3D
instance segmentation PointNet, T-Net and amodal box es- 5.1. Comparing with state-of-the-art Methods
timation PointNet) with multi-task losses (as in Eq. 2).
Lc1−reg is for T-Net and Lc2−reg is for center regression We evaluate our 3D object detector on KITTI [9] and
of box estimation net. Lh−cls and Lh−reg are losses for SUN-RGBD [27] benchmarks for 3D object detection. On
heading angle prediction while Ls−cls and Ls−reg are for both tasks we have achieved significantly better results
box size. Softmax is used for all classification tasks and compared with state-of-the-art methods.
smooth-l1 (huber) loss is used for all regression cases.
KITTI Tab. 1 shows the performance of our 3D detector
Lmulti−task =Lseg + λ(Lc1−reg + Lc2−reg + Lh−cls + on the KITTI test set. We outperform previous state-of-the-
(2)
Lh−reg + Ls−cls + Ls−reg + γLcorner ) art methods by a large margin. While MV3D [5] uses multi-
view feature aggregation and sophisticated multi-sensor fu-
Corner Loss for Joint Optimization of Box Parameters sion strategy, our method based on the PointNet [20] (v1)
While our 3D bounding box parameterization is compact and PointNet++ [22] (v2) backbone is much cleaner in de-
and complete, learning is not optimized for final 3D box ac- sign. While out of the scope for this work, we expect that
curacy – center, size and heading have separate loss terms. sensor fusion (esp. aggregation of image feature for 3D de-
Imagine cases where center and size are accurately pre- tection) could further improve our results.
dicted but heading angle is off – the 3D IoU with ground We also show our method’s performance on 3D object
truth box will then be dominated by the angle error. Ide- localization (bird’s eye view) in Tab. 2. In the 3D localiza-
ally all three terms (center,size,heading) should be jointly tion task bounding boxes are projected to bird’s eye view
optimized for best 3D box estimation (under IoU metric). plane and IoU is evaluated on oriented 2D boxes. Again,
To resolve this problem we propose a novel regularization our method significantly outperforms previous works which
loss, the corner loss: include DoBEM [35] and MV3D [5] that use CNNs on pro-
jected LiDAR images, as well as 3D FCN [14] that uses 3D
X
NS X
NH X
8 X
8 CNNs on voxelized point cloud.
Lcorner = δij min{ kPkij − Pk∗ k, kPkij − Pk∗∗ k}
1 Details on network architectures, training parameters as well as more
i=1 j=1 k=1 i=1
(3) experiments are included in the supplementary material.
922
Cars Pedestrians Cyclists
Method
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
DoBEM [35] 7.42 6.95 13.45 - - - - - -
MV3D [5] 71.09 62.35 55.12 - - - - - -
Ours (v1) 80.62 64.70 56.07 50.88 41.55 38.04 69.36 53.50 52.88
Ours (v2) 81.20 70.39 62.19 51.21 44.89 40.23 71.96 56.77 50.39
Table 1. 3D object detection 3D AP on KITTI test set. DoBEM [35] and MV3D [5] (previous state of the art) are based on 2D CNNs with
bird’s eye view LiDAR image. Our method, without sensor fusion or multi-view aggregation, outperforms those methods by large margins
on all categories and data subsets. 3D bounding box IoU threshold is 70% for cars and 50% for pedestrians and cyclists.
923
Figure 6. Visualizations of Frustum PointNet results on KITTI val set (best viewed in color with zoom in). These results are based
on PointNet++ models [22], running at 5 fps and achieving test set 3D AP of 70.39, 44.89 and 56.77 for car, pedestrian and cyclist,
respectively. 3D instance masks on point cloud are shown in color. True positive detection boxes are in green, while false positive boxes
are in red and groundtruth boxes in blue are shown for false positive and false negative cases. Digit and letter beside each box denote
instance id and semantic class, with “v” for cars, “p” for pedestrian and “c” for cyclist. See Sec. 5.3 for more discussion on the results.
bathtub bed bookshelf chair desk dresser nightstand sofa table toilet Runtime mAP
DSS [29] 44.2 78.8 11.9 61.2 20.5 6.4 15.4 53.5 50.3 78.9 19.55s 42.1
COG [24] 58.3 63.7 31.8 62.2 45.2 15.5 27.4 51.0 51.3 70.1 10-30min 47.6
2D-driven [13] 43.5 64.5 31.4 48.3 27.9 25.9 41.9 50.4 37.0 80.4 4.15s 45.1
Ours (v1) 43.3 81.1 33.3 64.2 24.7 32.0 58.1 61.1 51.1 90.9 0.12s 54.0
Table 6. 3D object detection AP on SUN-RGBD val set. Evaluation metric is average precision with 3D IoU threshold 0.25 as proposed
by [27]. Note that both COG [24] and 2D-driven [13] use room layout context to boost performance while ours and DSS [29] not.
Compared with previous state-of-the-arts our method is 6.4% to 11.9% better in mAP as well as one to three orders of magnitude faster.
Comparing with alternative approaches for 3D detec- sualize a typical 2D mask prediction in Fig. 7. While the
tion. In this part we evaluate a few CNN-based baseline estimated 2D mask appears in high quality on an RGB im-
approaches as well as ablated versions and variants of our age, there are still lots of clutter and foreground points in
pipelines using 2D masks. In the first row of Tab. 7, we the 2D mask. In comparison, our 3D instance segmenta-
show 3D box estimation results from two CNN-based net- tion gets much cleaner result, which greatly eases the next
works. The baseline methods trained VGG [26] models module in finer localization and bounding box regression.
on ground truth boxes of RGB-D images and adopt the
same box parameter and loss functions as our main method. In the third row of Tab. 7, we experiment with an ablated
While the model in the first row directly estimates box lo- version of frustum PointNet that has no 3D instance seg-
cation and parameters from vanilla RGB-D image patch, mentation module. Not surprisingly, the model gets much
the other one (second row) uses a FCN trained from the worse results than our main method, which indicates the
COCO dataset for 2D mask estimation (as that in Mask- critical effect of our 3D instance segmentation module. In
RCNN [11]) and only uses features from the masked region the fourth row, instead of 3D segmentation we use point
for prediction. The depth values are also translated by sub- clouds from 2D masked depth maps (Fig. 7) for 3D box es-
tracting the median depth within the 2D mask. However, timation. However, since a 2D mask is not able to cleanly
both CNN baselines get far worse results compared to our segment the 3D object, the performance is more than 12%
main method. worse than that with the 3D segmentation (our main method
in the fifth row). On the other hand, a combined usage of 2D
To understand why CNN baselines underperform, we vi- and 3D masks – applying 3D segmentation on point cloud
924
network arch. mask depth representation accuracy RGB 2d mask by CNN points from masked points from our 3d
2d depth map instance segmentation
ConvNet - image 18.3 (baseline)
ConvNet 2D image 27.4
PointNet - point cloud 33.5
depth
PointNet 2D point cloud 61.6
PointNet 3D point cloud 74.3
PointNet 2D+3D point cloud 70.0 range: 8m ~ 55m range: 9m ~ 55m range: 12m ~ 16m
Table 7. Comparing 2D and 3D approaches. 2D mask is from
FCN on RGB image patch. 3D mask is from PointNet on frustum Figure 7. Comparisons between 2D and 3D masks. We show a
point cloud. 2D+3D mask is 3D mask generated by PointNet on typical 2D region proposal from KITTI val set with both 2D (on
point cloud poped up from 2D masked depth map. RGB image) and 3D (on frustum point cloud) instance segmenta-
tion results. The red numbers denote depth ranges of points.
frustum rot. mask centralize t-net accuracy
- - - 12.5
√ 5.3. Qualitative Results and Discussion
- - 48.1
√
- - 64.6 In Fig. 6 we visualize representative outputs of our frus-
√ √
- 71.5 tum PointNet model. We see that for simple cases of non-
√ √ √
74.3
occluded objects in reasonable distance (so we get enough
Table 8. Effects of point cloud normalization. Metric is 3D box
estimation accuracy with IoU=0.7. number of points), our model outputs remarkably accurate
3D instance segmentation mask and 3D bounding boxes.
loss type regularization accuracy Second, we are surprised to find that our model can even
regression only - 62.9 predict correctly posed amodal 3D box from partial data
cls-reg - 71.8 (e.g. parallel parked cars) with few points. Even humans
cls-reg (normalized) - 72.2 find it very difficult to annotate such results with point cloud
cls-reg (normalized) corner loss 74.3 data only. Third, in some cases that seem very challenging
Table 9. Effects of 3D box loss formulations. Metric is 3D box in images with lots of nearby or even overlapping 2D boxes,
estimation accuracy with IoU=0.7. when converted to 3D space, the localization becomes much
easier (e.g. P11 in second row third column).
On the other hand, we do observe several failure pat-
from 2D masked depth map – also shows slightly worse re- terns, which indicate possible directions for future efforts.
sults than our main method probably due to the accumulated The first common mistake is due to inaccurate pose and
error from inaccurate 2D mask predictions. size estimation in a sparse point cloud (sometimes less than
5 points). We think image features could greatly help esp.
Effects of point cloud normalization. As shown in since we have access to high resolution image patch even
Fig. 4, our frustum PointNet takes a few key coordinate for far-away objects. The second type of challenge is when
transformations to canonicalize the point cloud for more ef- there are multiple instances from the same category in a
fective learning. Tab. 8 shows how each normalization step frustum (like two persons standing by). Since our current
helps for 3D detection. We see that both frustum rotation pipeline assumes a single object of interest in each frus-
(such that frustum points have more similar XYZ distribu- tum, it may get confused when multiple instances appear
tions) and mask centroid subtraction (such that object points and thus outputs mixed segmentation results. This prob-
have smaller and more canonical XYZ) are critical. In addi- lem could potentially be mitigated if we are able to propose
tion, extra alignment of object point cloud to object center multiple 3D bounding boxes within each frustum. Thirdly,
by T-Net also contributes significantly to the performance. sometimes our 2D detector misses objects due to dark light-
Effects of regression loss formulation and corner loss. ing or strong occlusion. Since our frustum proposals are
In Tab. 9 we compare different loss options and show that a based on region proposal, no 3D object will be detected
combination of “cls-reg” loss (the classification and residual given no 2D detection. However, our 3D instance segmen-
regression approach for heading and size regression) and a tation and amodal 3D box estimation PointNets are not re-
regularizing corner loss achieves the best result. stricted to RGB view proposals. As shown in the supple-
The naive baseline using regression loss only (first row) mentary, the same framework can also be extended to 3D
achieves unsatisfactory result because the regression target regions proposed in bird’s eye view.
is large in range (object size from 0.2m to 5m). In com- Acknowledgement The authors wish to thank the support
parison, the cls-reg loss and a normalized version (residual of Nuro Inc., ONR MURI grant N00014-13-1-0341, NSF
normalized by heading bin size or template shape size) of it grants DMS-1546206 and IIS-1528025, a Samsung GRO
achieve much better performance. At last row we show that award, and gifts from Adobe, Amazon, and Apple.
a regularizing corner loss further helps optimization.
925
References [16] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas. Fpnn:
Field probing neural networks for 3d data. arXiv preprint
[1] Kitti 3d object detection benchmark leader board. arXiv:1605.06240, 2016. 2
http://www.cvlibs.net/datasets/kitti/ [17] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
eval_object.php?obj_benchmark=3d. Accessed: S. Belongie. Feature pyramid networks for object detection.
2017-11-14 12PM. 2 arXiv preprint arXiv:1612.03144, 2016. 3
[2] Kitti bird’s eye view object detection benchmark leader [18] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
board. http://www.cvlibs.net/datasets/ neural network for real-time object recognition. In IEEE/RSJ
kitti/eval_object.php?obj_benchmark=bev. International Conference on Intelligent Robots and Systems,
Accessed: 2017-11-14 12PM. 2 September 2015. 1, 2
[3] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urta- [19] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka. 3d
sun. Monocular 3d object detection for autonomous driving. bounding box estimation using deep learning and geometry.
In Proceedings of the IEEE Conference on Computer Vision arXiv preprint arXiv:1612.00496, 2016. 2, 5
and Pattern Recognition, pages 2147–2156, 2016. 2, 6 [20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep
[4] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fi- learning on point sets for 3d classification and segmentation.
dler, and R. Urtasun. 3d object proposals for accurate object Proc. Computer Vision and Pattern Recognition (CVPR),
class detection. In Advances in Neural Information Process- IEEE, 2017. 1, 2, 4, 5
ing Systems, pages 424–432, 2015. 6 [21] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas.
[5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d Volumetric and multi-view cnns for object classification on
object detection network for autonomous driving. In IEEE 3d data. In Proc. Computer Vision and Pattern Recognition
CVPR, 2017. 2, 5, 6 (CVPR), IEEE, 2016. 1, 2
[6] Z. Deng and L. J. Latecki. Amodal detection of 3d objects: [22] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep
Inferring 3d bounding boxes from 2d ones in rgb-depth im- hierarchical feature learning on point sets in a metric space.
ages. In Conference on Computer Vision and Pattern Recog- arXiv preprint arXiv:1706.02413, 2017. 1, 2, 4, 5, 7
nition (CVPR), volume 2, 2017. 2 [23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
[7] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. real-time object detection with region proposal networks. In
Vote3deep: Fast object detection in 3d point clouds using Advances in neural information processing systems, pages
efficient convolutional neural networks. In Robotics and Au- 91–99, 2015. 2, 5
tomation (ICRA), 2017 IEEE International Conference on, [24] Z. Ren and E. B. Sudderth. Three-dimensional object detec-
pages 1355–1361. IEEE, 2017. 1, 2 tion and layout prediction using clouds of oriented gradients.
[8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets In Proceedings of the IEEE Conference on Computer Vision
robotics: The kitti dataset. The International Journal of and Pattern Recognition, pages 1525–1533, 2016. 2, 7
Robotics Research, 32(11):1231–1237, 2013. 5 [25] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning
deep 3d representations at high resolutions. arXiv preprint
[9] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
arXiv:1611.05009, 2016. 2
tonomous driving? the kitti vision benchmark suite. In
Conference on Computer Vision and Pattern Recognition [26] K. Simonyan and A. Zisserman. Very deep convolutional
(CVPR), 2012. 5 networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 7
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
[27] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
ture hierarchies for accurate object detection and semantic
scene understanding benchmark suite. In Proceedings of the
segmentation. In Computer Vision and Pattern Recognition
IEEE Conference on Computer Vision and Pattern Recogni-
(CVPR), 2014 IEEE Conference on, pages 580–587. IEEE,
tion, pages 567–576, 2015. 1, 5, 7
2014. 1
[28] S. Song and J. Xiao. Sliding shapes for 3d object detection
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
in depth images. In Computer Vision–ECCV 2014, pages
arXiv preprint arXiv:1703.06870, 2017. 1, 4, 7
634–651. Springer, 2014. 2
[12] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial [29] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-
transformer networks. In NIPS 2015. 4 ject detection in rgb-d images. In Proceedings of the IEEE
[13] J. Lahoud and B. Ghanem. 2d-driven 3d object detection Conference on Computer Vision and Pattern Recognition,
in rgb-d images. In Proceedings of the IEEE Conference pages 808–816, 2016. 2, 7
on Computer Vision and Pattern Recognition, pages 4622– [30] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller.
4630, 2017. 2, 7 Multi-view convolutional neural networks for 3d shape
[14] B. Li. 3d fully convolutional network for vehicle detection recognition. In Proc. ICCV, 2015. 1
in point cloud. arXiv preprint arXiv:1611.08069, 2016. 2, 5, [31] D. Z. Wang and I. Posner. Voting for voting in online point
6 cloud object detection. Proceedings of the Robotics: Science
[15] B. Li, T. Zhang, and T. Xia. Vehicle detection from 3d and Systems, Rome, Italy, 1317, 2015. 2
lidar using fully convolutional network. arXiv preprint [32] P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong.
arXiv:1608.07916, 2016. 2 O-cnn: Octree-based convolutional neural networks for 3d
926
shape analysis. ACM Transactions on Graphics (TOG),
36(4):72, 2017. 2
[33] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
J. Xiao. 3d shapenets: A deep representation for volumetric
shapes. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 1912–1920, 2015. 1,
2
[34] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d
voxel patterns for object category recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1903–1911, 2015. 2
[35] S.-L. Yu, T. Westfechtel, R. Hamada, K. Ohno, and S. Ta-
dokoro. Vehicle detection and localization on birds eye view
elevation images using convolutional neural network. 2017
IEEE International Symposium on Safety, Security and Res-
cue Robotics (SSRR), 2017. 5, 6
927