0% found this document useful (0 votes)
106 views11 pages

3D Bounding Box Estimation For Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results

This document proposes a two-stage 3D object detection method for autonomous vehicles using monocular images. In the first stage, a deep convolutional neural network regresses additional 3D properties like viewpoint and the center projection of the 3D bounding box. In the second stage, initial 3D bounding boxes are estimated using geometric constraints between the 2D and 3D boxes. The predicted 3D boxes are then used to filter out false detections in the original 2D detections. The method outperforms other complex 3D detection methods on the KITTI dataset while being conceptually simple.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views11 pages

3D Bounding Box Estimation For Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results

This document proposes a two-stage 3D object detection method for autonomous vehicles using monocular images. In the first stage, a deep convolutional neural network regresses additional 3D properties like viewpoint and the center projection of the 3D bounding box. In the second stage, initial 3D bounding boxes are estimated using geometric constraints between the 2D and 3D boxes. The predicted 3D boxes are then used to filter out false detections in the original 2D detections. The method outperforms other complex 3D detection methods on the KITTI dataset while being conceptually simple.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

3D Bounding Box Estimation for Autonomous Vehicles by

Cascaded Geometric Constraints and Depurated 2D


Detections Using 3D Results
Jiaojiao Fang, Lingtao Zhou, Guizhong Liu
School of Electronic and Information Engineering Xi’an Jiaotong University, China
[email protected], [email protected], [email protected]

Abstract—3D object detection is one of the most important


tasks in 3D vision perceptual system of autonomous
vehicles. In this paper, we propose a novel two stage 3D
object detection method aimed at get the optimal solution of
object location in 3D space based on regressing two
additional 3D object properties by a deep convolutional
neural network and combined with cascaded geometric
constraints between the 2D and 3D boxes. First, we modify
the existing 3D properties regressing network by adding
two additional components, viewpoints classification and
the center projection of the 3D bounding box’s bottom face.
Second, we use the predicted center projection combined Fig. 1. Examples of some fallacious detection results for car category in the
KITTI dataset by the 2D detector used in our paper. From these results we can
with similar triangle constraint to acquire an initial 3D see that there are many false alarm the 2D detections especially for the upper
bounding box by a closed-form solution. Then, the location left image. These wrong detections are harmful for the two stage monocular
predicted by previous step is used as the initial value of the based 3D object detection methods as their performance are strongly depend
over-determined equations constructed by 2D and 3D on the 2D image understanding.
boxes fitting constraint with the configuration determined to perceive and interact with the real world. The existing
with the classified viewpoint. Finally, we use the recovered
methods are mainly classified into three categories according to
physical world information by the 3D detections to filter
the data has been used which falls into LiDAR, stereo image,
out the false detection and false alarm in 2D detections. We
and monocular image. Although monocular image based
compare our method with the state-of-the-arts on the
KITTI dataset show that although conceptually simple, our methods of most disadvantages as the lack of depth information
method outperforms more complex and computational when projected a 3D scene on image plane, they still have
expensive methods not only by improving the overall received extensive attention due to its low cost device
precision of 3D detections, but also increasing the requirements and wide range of application scenarios.
orientation estimation precision. Furthermore our method While deep learning based 2D object detection algorithms [2,
can deal with the truncated objects to some extent and 3, 4, 16] have achieved great advance and gain better
remove the false alarm and false detections in both 2D and robustness for challenges such as occlusion, viewpoint variance,
3D detections. illumination variance et.al, it remains an under-constrained
problem for 3D object detection based on purely monocular
Index Terms—3D Object Detection, Autonomous Driving, Deep images by back-projecting the object on image plane to the 3D
Learning, Viewpoints Classification, Geometry Constraints physical world. But this is eventually enabled via deep learning
[9] and some geometrical relation between 2D and 3D space [5,
8]. By the data-driven CNN models, we can learn the empirical
relationship between objects’ appearance and their 3D
I. INTRODUCTION properties with some specific priors in the scenario. In deep
learning based 3D object detection problem, the dimensions

3 D object detection is the task that aims at recovering the 6


Degree of Freedom (DoF) pose and dimensions of interest
objects in physical world which are defined by 3D bounding
and orientation estimation are relatively simple as they are
strongly related with object’s appearance and can be easily
estimated. But it is impractical to directly regress location only
boxes. It is one of the most important task in 3D scene by a single image patch, since the depth information is hard to
understanding and provides indispensable information for recover only depended on appearance information. The existing
intelligent agents perceptual like autonomous driving vehicles location estimation methods try to predict location of the object
by the geometric constraint between the 2D and 3D bounding

1
2) filter out wrong
detections
1) cascaded geometric
2D Detector constraints
Projected Similar triangle
2D BBox
cbf constraint

Dimensions
3D Regress NN init [X,Y,Z]
Orientation
224*224
2D-3D Boxes fit Location
Viewpoint
constraint [X,Y,Z]

Fig. 2. The overall pipeline of our proposed monocular 3D object detection method, which only requires a single RGB image as input, and can achieve 3D
perception of the objects in the scene. Our method use the deep CNN to regress another two appearance-related 3D properties for 3D location. The red fonts
represent the most important two key modules, cascaded geometric constraints and 2D detections filtration based on the 3D results.

boxes and convert the location estimation to an appearance of the 2D and 3D detections at the same time. Fig. 1 gives an
unrelated problem. However, this conversion has two illustration of the false alarm and false detection in our 2D
significant drawbacks: 1) it cannot benefit from a large number detections.
of labeled data in the training set; 2) the geometric constraint is The main contributions of this paper are three aspects as
hard to select the best fitting configuration and get the optimal follows:
solution by solving an over determined equation. Our previous 1) We propose a novel purely monocular image based 3D
work converts the 2D-3D boxes fitting configuration selecting object detection method by adding two appearance-related 3D
problem to an appearance related viewpoint classification properties to the existing multi-branch CNN for more stable 3D
problem for getting more accuracy configuration selection. All properties and building cascaded geometry constraints. Thus
the existing methods are ignored to further utilize the recovered we can simultaneously estimate dimensions, orientation, CBF
physical world information by 3D detections to further and viewpoints classification. By jointly training several most
understand the 2D image. related tasks, we can improve their performance at the same
In this paper, we introduce a novel deep learning and time.
monocular based two stage 3D object detection method by 2) We use the regressed CBF with the similar triangle
cascaded geometric constraints for location estimation. Firstly, constraint to solve the initial location of the 3D objects which is
in contrast to most current networks that only regressing the used as the initial values to solve the over-determined equations
dimensions and orientation of 3D objects, we modify the obtained by the 2D-3D bounding box fitting constraints with
existing multi-branch deep CNN by adding two branches of the estimated viewpoint to get more accuracy 3D bounding
viewpoints classification and the center location projection boxes. And our method can deal well with some truncated
regression of the 3D bounding box’s bottom face (CBF) to objects by only use the initial value as the 3D location.
obtain another two stable appearance related object properties 3) We use the prior information of the physical world which
for 3D location estimation. Secondly, we use the estimated is recovered by the 3D detections combined with the image
CBF combined with the similar triangle constraint between the acquisition device, the acquisition environment to distinguish
heights of 2D and 3D boxes to solve an initial 3D location by a whether the detection is false or not. Thus we can get more
closed-form solution. If the objects are truncated by the image reliable 2D and 3D detections.
plane, this solution is used as the final 3D location. Otherwise Experiments on KITTI dataset demonstrate the improvement
we will solve the final 3D location by cascaded another of our work on orientation estimation and overall detection
constraint. Thirdly, if the object is not truncated, the solved precision compared with current state-of-the-art methods only
location will be used as the initial value of an over-determined using a single RGB image. The qualitative results show that our
equation constructed by the 2D-3D boxes fitting constraint with method can deal well with some truncated objects and filter out
the configuration obtained by the viewpoint of the network the wrong detections in 2D and 3D results.
output.
Most of the prior knowledge about the physical world is
missing when projected a scene on image plane, so it is hard to II. RELATED WORK
identify the false alarm and false detection in the 2D detectors Recently, deep convolutional neural networks (CNN) based
without 3D spatial information. However, with the help of the 2D object detections have achieved remarkable performance
3D detection results, we can recovery the 3D spatial prior improvement. Thus the more meaningful task 3D object
information to a certain extent which can help us to distinguish detection which is estimating the 3D bounding boxes of the
the obviously unreasonable 2D detection results both for false objects has drawn more and more attention. And plentiful of
detection and false alarm. Thus we can improve the reliability
2
methods have been proposed to address the problem of 3D
object detection by the data collected from autonomous driving
scenarios, such as KITTI dataset. The existing 3D object
detection methods are mainly classified into three categories as
follow.
A. LiDAR-based 3D Object Detection
Another kind of 3D detection method uses LiDAR data as
the additional information to get more accurate 3D detection
result. [11] proposes a method that uses LiDAR data to create
multi front view and bird eye view (BEV) of the scene and feed
them separately into a two-stage detection network along with
the RGB image frame. After proposals are obtained from each
view, a multi-stage feature fusion network is deployed to get
the 3D properties and classes of objects. In [12], 3D anchor grid Fig. 3. In the left is an illustration of the global orientation 𝜃, angle of ray 𝜃𝑟𝑎𝑦
is used to get object candidates. After the extracted CNN that from camera to a car’s center and local orientation 𝜃𝑙 which are predict by
features of each candidate from both images and LiDAR-based the network. The upper right shows an example of the cars with the same
BEV map are fused and scored, top scored proposals are then global angle, but different local angle and the lower right shows the cars with
same local angle, viewpoint but different global angle.
classified and dimension-refined by the second stage of the
network. This kind of method produces much better 3D AP the orientations and 3D positions of objects can be inferred. [8]
score due to the known depth information, but the cost of proposes a 2D-3D object detection framework that regresses
money and computing resources as well as the power objects’ orientation and dimensions from image patches
consumption of devices capturing depth information is not containing objects utilizing 2D bounding box produced by
bearable for some circumstances. And these methods are like efficient 2D detector. 2D-3D box geometric constraints are then
the way human perceiving the 3D physical world, in which no be found to calculate the 3D positions of objects. Although the
particular depth information is acquired, either. procedure of finding eligible constraints can be done in parallel,
B. Stereo Images based 3D Object Detection it still leads to unnecessary computational and time
Stereo vision is also used in some 3D detection algorithms consumption which limits the application of this method. [18]
for its simulation to human binocular vision. Based on Mono3D and [19] use geometric constraint to get an initial 3D bounding
proposed by the same group, in [13], stereo images are used to box and refine them by a complex post-processing step.
get better 3D proposal in physical world. Stereo-based HHA
feature [14] which encoding the depth information of the scene
is also used as a stream of input to get better 3D bounding box III. 3D BOUNDING BOX ESTIMATION USING DEEP LEARNING
AND GEOMETRY CONSTRAINT
regression. [15] proposes a stereo-extended faster R-CNN
detection method in which region proposal is done on both left In this paper, we propose a novel two-stage 3D object
and right image from a stereo pair through RPN and their detection method based on the advanced 2D detector, reliable
results are associated. After the keypoints, viewpoints and 3D object properties and geometric constraints between 2D and
object dimensions are estimated from stereo proposals, a 3D boxes to recover a complete 3D bounding boxes. In contrast
refinement procedure is deployed via region-based photometric to the existing methods that only regressing orientation and
alignment to get better detection. dimensions of 3D object, our method uses a multi-branches
deep CNN to regress two additional components which are
C. Monocular Image based 3D Object Detection
used to solve the location of the 3D objects by cascaded
Taking advantage of the prior information from the dataset, in geometric constraints. At the 2D detection stage, an advanced
[5], 3D sliding window and a sophisticated proposal method 2D detector is applied to determine the sizes and positions of
considering context on image, shape, location, and objects on image plane. And then the cropped image patches
segmentation feature, is employed to generate 3D proposals according to 2D detections are fed into a multi-branches CNN
from scenes. A Fast R-CNN-based detector is fed with image to respectively infer: 1) dimensions 2) orientation, i.e. the local
patches correspond to each 3D candidates to classify them and angle is regressed to calculate the global angle that only the yaw
regress the bounding boxes and orientations of objects. [6, 7] angle be specified in autonomous driving scenarios in KITTI
separately detect objects that fall in different sub-categories in dataset 3) viewpoint 4) the center projection of the bottom face.
physical world. The sub-categories are defined by objects’ We use the estimated dimensions and orientation as well as
shape, viewpoint and occlusion patterns and divided by other two properties combined with the constraints between the
clustering using 3D CAD object models. [10] proposes a 2D-3D boxes to build two sets of linear equations which are
method called deep MANTA which deploys a Faster-RCNN used to recover location of the object. Thus we can solve a more
style model to detect objects and their parts on images from precision 3D location by cascaded geometric constraints. Once
coarse-to-fine. 3D CAD models are also used to establish a the 3D bounding boxes are obtained, we use some physical
library of template to be matched with model’s output so that properties to distinguish whether they’re reasonable result or
3
There are two orientations in the 3D object detection
problem, global orientation and local orientation. The global
orientation of an object is defined in the world coordinates and
will not change with the camera pose, while the local
orientation is defined under the camera coordinates and hinges
on how the camera shots the object.
Due to the fact that object with same global orientation can
look differently after projected on image plane if their spatial
position varies, we also regress the local orientation 𝜃𝑙 instead
of directly regressing global orientation 𝜃 as in [8]. As shown
in Fig. 3, the regressed local orientation 𝜃𝑙 is more relevant to
the appearance of cropped image patch. And the global
orientation can be calculated as:
𝜃 = −(𝜃𝑟𝑎𝑦 + 𝜃𝑙 − 2𝜋) = 2𝜋 − 𝜃𝑟𝑎𝑦 − 𝜃𝑙 (1)
Where 𝜃𝑟𝑎𝑦 = arctan(𝑧/𝑥) = arctan(𝑓/(𝑐𝑣 − 𝑣𝑐)) denotes
Fig. 4. Architecture of our multi-task network which consists of five branches
to compute dimension residual, angle residual, confidence of each bin,
the rotations between the camera principle axis [0, 0, 1] T and
viewpoint classification and the center projection of bottom face respectively. the ray passing through the center of the 3D object which can be
computed easily using the camera intrinsic parameters and
not. The overall pipeline of our methods is shown in Fig. 2. object’s position on the image. 𝑐𝑣 is the vertical direction
coordinate of the camera principle point.
A. Regressing Two Additional Appearance-related 3D A MultiBin method [8] that decomposing the continuous
Properties for 3D Location Estimation orientation angle into discrete bins and regressing residual of
In most of the existing two stage 3D object detection each bin is used in our paper. Dimensions’ offsets, (∆dx, ∆dy,
methods, only the dimensions and orientation of the 3D objects ∆dz), which are residual of object’s dimensions to their
are regressed, but the dimensions and orientation are not category’s average dimensions are predicted for the diversity of
enough for determining the location of a 3D box. There are dimensions distribution of different class objects. Using
several different 3D properties that tied strongly to the visual discrete bins and predicting offsets from mean sizes facilitates
appearance can be estimated to further constrain the final 3D orientation and dimension learning by restricting values to be
box estimation. within a smaller range. To avoid regressing a periodic value of
Inspired by the position estimation in the CNN based 2D angle 𝜃𝑙 , a L2-norm layer is added at the end of its branch to
detections supervised by the labelled ground truth, we add one generate the sine and cosine prediction of residual angle and its
task called the center projection position regression to the loss is defined by the cosine similarity between the prediction
current dimensions and orientation regression network. The and the real residual angle:
ground truth position of the projection is computed by 𝐿𝑎𝑛𝑔 = 1 − 1/𝑛𝜃 Σ cos(𝜃 ∗ − 𝑐𝑖 − Δ𝜃𝑖 ) (2)
projecting all the 3D center location of the objects to the image ∗
where θ denotes the ground-truth local orientation, ci denotes
plane with camera intrinsic matrices which will be used for the i-th bin region that ground-truth falls in, Δθi is the predicted
supervision. This prediction can be used in the classical pinhole residual angle and nθ is the number of overlapped bins
camera model which provides similar triangle constraint covering the ground-truth and the loss function uses the cosine
between the sizes of 2D-3D boxes. Given the 2D projection function to ensure that the offset Δθi can be well regressed.
[𝑢𝑐, 𝑣𝑐] of a 3D box’s location [X, Y, Z] , and the 2D box Residual regression can significantly reduce the range of a
[𝑢𝑚𝑖𝑛 , 𝑢𝑚𝑎𝑥 , 𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 ], we use the network to regress the continuous variable and improve the estimation precision.
offset between [𝑢𝑐, 𝑣𝑐] and [(𝑢𝑚𝑖𝑛 + 𝑢𝑚𝑎𝑥 )/2, 𝑣𝑚𝑎𝑥 ] for more Thus the overall loss function 𝐿 can be denoted as:
reliable results. 𝐿 = 𝑤1 𝐿𝑑𝑖𝑚𝑠 + 𝑤2 𝐿𝑎𝑛𝑔 + 𝑤3 𝐿𝑐𝑜𝑛𝑓 + 𝑤4 𝐿𝑣𝑖𝑒𝑤
Another task called viewpoints classification is also added (3)
+ 𝑤5 𝐿𝑐𝑏𝑓
into this deep CNN. The 3D location estimation method in [8]
is not only concept simple but also effective by using the where 𝐿𝑑𝑖𝑚𝑠 , 𝐿𝑎𝑛𝑔 , 𝐿𝑐𝑜𝑛𝑓 and 𝐿𝑣𝑖𝑒𝑤 denote the losses for
geometric constraint between the 2D-3D bounding boxes dimension regression, angle bias regression, confidence of bins,
fitting. Our previous work using the viewpoint classification viewpoints classification and the center projection of the 3D
task to further improving its performance by roughly dividing box’s bottom face respectively. w1:5 denote the combination
64 kinds of correspondence configurations into 16 categories weighting factors of each loss.
according to which surfaces of the 3D objects are projected to B. 2D-3D Boxes Fitting Constraints with Viewpoints
image plane. We also adopt this method to determine the Classification
configuration by classifying viewpoints from where the object
The fundamental idea of computing 3D location of object by
is observed. By adding these two tasks, we can transfer the
the 2D-3D boxes fitting constraint comes from the consistency
location estimation to an appearance related problem. Thus the
of 2D and 3D bounding boxes. Specifically, the projected
deep CNN we used has five sub-branches in total and is trained
vertexes of an object’s 3D box should fit tightly into each side
jointly among all these tasks. The complete architecture of our
of its 2D detection box. In other words, four of eight vertexes of
3D properties estimation network is shown in Fig. 4.
the 3D box should be projected right on the four sides of 2D
box respectively. In the left of Fig. 5 shows an example of the
4
cars’ 2D-3D boxes fitting from KITTI dataset, which shows
one kind of correspondence configuration between 2D and 3D
box. The vertex numbers 6, 1, 5, 3 of 3D box are projected on
the upper, lower, left, right sides of the 2D box respectively,
and the viewpoint are from front-right. In total, the sides of the
2D bounding box provide four constraint equations on the 3D
bounding box. Given the camera intrinsic matrix 𝐾 , the
dimensions of object 𝑑 = [𝑙, ℎ, 𝑤, 1]𝑇 , the 2D box
[𝑢𝑚𝑖𝑛 , 𝑢𝑚𝑎𝑥 , 𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 ] and the global orientation 𝜃, these
corresponding boxes fitting configuration constraints can be
formulated as:
Fig. 5. Illustration of the second geometric constraint we have used. Our
𝑢𝑚𝑖𝑛 = 𝜋𝑢 (𝐾[𝑅𝜃 𝑇]𝑆1 𝑑) previous work convert the left corners selecting problem to the right viewpoint
𝑢𝑚𝑎𝑥 = 𝜋𝑢 (𝐾[𝑅𝜃 𝑇]𝑆2𝑑) distinguish problem according to the consistency between them. The left part
(4) is a sample from KITTI dataset illustrating the relation between 2D and 3D
𝑣𝑚𝑖𝑛 = 𝜋𝑣 (𝐾[𝑅𝜃 𝑇]𝑆3 𝑑)
bounding box fitting, where the Arabic numbers indicate the index of the 3D
{ 𝑣𝑚𝑎𝑥 = 𝜋𝑣 (𝐾[𝑅𝜃 𝑇]𝑆4 𝑑) box’s vertex. The right part show examples of 4 kinds of different
where 𝑅𝜃 is the rotation matrix parameterized by orientation 𝜃. observational viewpoint. From top to bottom are examples of lateral view
𝑇 observation and front view observation. From left to right are examples of
And 𝑇 = [𝑡𝑥 , 𝑡𝑦 , 𝑡𝑧 ] = [X, Y, Z]𝑇 denotes the transition from look down view and the horizontally view observation.
the origin of camera coordinates to the center of the 3D
bounding box’s bottom face which needs to be solved by these of an object but just an roughly approximation to the actual
equations. 𝜋𝑢 and 𝜋𝑣 denote the image coordinates extracting location due to perspective projection and viewpoint, while the
functions getting homogeneous coordinates of the object on the 2D-3D boxes fitting constraint is a much more reliable
image plane as: constraint than the prior but it needs to solve an over
𝜋𝑢 (𝑃) = 𝑝1 /𝑝3 determined linear equations which are hard to get an optimal
solution.
𝜋𝑣 (𝑃) = 𝑝2 /𝑝3 (5)
According to the predicted bottom face’s center projection
𝑃 = [𝑝1 , 𝑝2 , 𝑝3 ]𝑇 . [𝑢𝑐, 𝑣𝑐] and the sizes of the 2D box, we can get a linear
And from 𝑆1 to 𝑆4 are the vertexes selecting matrixes equations system through equation (7) and (8), of the
describing the coordinate positions of four selected vertexes in relationship between 3D object’s center points of the bottom
the original coordinate with respect to the object object’s face and top face and their projections on image plane.
dimensions. These matrixes varied with different constraint 𝑋
configurations are used to define the relationship between 2D 𝑢𝑐
and 3D bounding boxes. For the car presented in Fig. 5, its 𝑍 [𝑣𝑐 ] = 𝐾[𝑅 𝑇] [𝑌 ] (7)
𝑍
corners selecting matrixes of the 2D box’s left, right, top and 1
1
down edges are as follows: Where [X, Y, Z] represents the 3D location of the object and
−0.5 0 0 0 −0.5 0 0 0 [𝑢𝑐, 𝑣𝑐] is its corresponding projection on image plane. We
[
0 −1 0 0] , [ 0 0 0 0] , approximate the center projection of the 3D box’s top face
0 0 −0.5 0 0 0 0.5 0 as [𝑢𝑐, 𝑣𝑚𝑖𝑛 ], then
𝑆1,2 = 0 0 0 1 0 0 0 1 (6) 𝑢𝑐 𝑋
0.5 0 0 0 0.5 0 0 0
3,4
0 −1 0 0 0 0 0 0] . 𝑍 [𝑣𝑚𝑖𝑛 ] = 𝐾[𝑅 𝑇] [𝑌 − h]. (8)
[ ],[ 𝑍
0 0 −0.5 0 0 0 −0.5 0 1
1
0 0 0 1 0 0 0 1 Through these, we can get a relation among depth Z from the
image plane, focal length f, object height h, and its projection
We setup a classification task by CNN to determine which size 𝑣𝑐 − 𝑣𝑚𝑖𝑛 on image plane through similar triangles
configuration we should use to calculate objects’ location constraint (𝑣𝑐 − 𝑣𝑚𝑖𝑛 )/ℎ = 𝑓/𝑍, as shown in Fig. 6. Thus we
according to the appearance information shown on image. As can get an initial location of a 3D object, but these are rough
shown in the right of Fig. 5, there are totally 16 kinds of solution and rarely equivalent to the real value due to
viewpoints and each viewpoint are corresponded with a set of perspective projection, the approximation and camera
vertexes selecting matrixes. viewpoint. And this constraint is more sensitive to regression
The viewpoints classification branch shares the backbone errors, but can be used as reliable initial values to an
network with other tasks can increase the model’s sensitivity of over-determined equations system for solving a more precision
orientation estimation. We believe that by adding relevant task location in the 3D space.
can facilitate the training process. Nevertheless, this constraint Hence we propose a cascaded geometric constraints method
will be invalid once the object is truncated by the image plane. to solve the location of the 3D bounding box. If the objects are
C. 3D Box Location Estimation by Cascaded Constraints truncated by the image plane, the initial location is used as the
final solved location. Otherwise we use the Gauss-Newton
We use the estimated dimensions, orientation, viewpoint and method to solve these over-determined multivariate equations
the center projection to predict location of the 3D object. Our constructed by equation (4), (5) and (6) with an initial value
3D location estimation approach is based on the discovery that solved by the previous step.
using similar triangle constraint can easily obtain a 3D location
5
[X,Y h, Z]
their projected sizes will decrease with the increase of the
distance which is also called depth, and can be denoted by the
[umin ,vmin]
inequality as follows:
𝑖𝑓 𝑍1 > 𝑍2 𝑎𝑛𝑑 ℎ1 = ℎ2 + 𝜀,
(8)
𝑡ℎ𝑒𝑛 𝑣𝑚𝑎𝑥1 − 𝑣𝑚𝑖𝑛1 < 𝑣𝑚𝑎𝑥2 − 𝑣𝑚𝑖𝑛2
where 𝑍1 and 𝑍2 represent the depths of two arbitrary objects,
[uc , vc] 𝜀 denotes a small value within certain range. Specifically, the
[X,Y, Z] [umax ,vmax] projected height of 3D boxes should change monotonously
with the distance of the object from the image plane.
Fig. 6. Illustration of the relationship between the 3D bounding box’s height
and its projection on the image plane which are used to calculate the initial
Finally, the adjacent objects may have similar depths. We
location of the 3D box. use the IoU between two 2D detection boxes to determine
whether they are adjacent or not. We consider the overlapped
region in the left side and the right side of an object on the
D. 2D detection correction based on 3D Estimation image plane as the two collected adjacent objects. If the depths
In the autonomous driving scenarios, all the concerned of both the left adjacent 2D box and right adjacent 2D box of an
objects, cars, cyclists and pedestrian are on the ground plane. object are far away from itself, we will treat this 2D box as a
And the RGB images in the KITTI dataset used in this paper are false detection.
collected by the VW Passat station wagon which is equipped If the detected 3D bounding box obviously violated one of
with four video cameras. All the cameras’ heights above these properties, especially the second property, we will
ground are 1.65 meters and the cameras’ locations are used to directly remove the 2D detection and its corresponding 3D
establish the world coordinate system. In consideration of the detection. For these are not easy to distinguished wrong
detections, we use multi-properties to co-determination which
characters about the autonomous driving scenarios, the image
one need to be removed from the detection results.
acquisition device, the acquisition environment and the 3D
information of the objects obtained by the estimated 3D
bounding box, we propose a 2D detections depurated method
IV. EXPERIMENTS
based on these recovered 3D physical information of the
scenarios to filter out the false detection and false alarm in the Experiments evaluated our framework are conducted on the
2D detections. We will describe in detail that the wrong real-world dataset, KITTI object detection benchmark [7] from
detection in the 2D image may lead a 3D detection result having driving street scenarios which including both 2D Object
four aspects of unreasonable properties. The first two are about Detection Evaluation and 3D Object Detection Evaluation. It
the 3D bounding box’s attributes, and the latter two are about consists of 7481 training images and 7518 testing images in the
the relationship among the objects in view of the image. We use dataset, and in each image, the object is annotated with
these characters to filter out the false alarm and false detection observation angle (local orientation), 2D location, dimensions,
in the 2D detections. 3D location, and global orientation. However, the annotated
First, the dimensions for the car category instances are labels are only available in the training set, so our experiments
low-variance to the average dimensions [1.52, 1.64, 3.86]. So are mainly conducted on the training set.
the length, width and height of the estimated 3D bounding A. Implementation Details
boxes should meet a certain range. If the 2D detection is wrong,
Our 3D properties estimation network was trained and tested
this may lead to getting an abnormal car dimensions.
Second, for the objects in view of an image, their vertical on KITTI object detection dataset by the split used in [6]. We
locations are within certain range due to the image acquisition used the ImageNet pre-trained VGG-19 network [20] without
device is on the road surface of ground plane and with a fixed its FC layers as backbone and added our 3D box properties
height of 1.65 meters. So the location of the car in the vertical branches behind it to complete each task. For the pre-training
direction is within a range. In [5], they set three kinds of height process, the cross-entropy loss was used for the classification
[1.3888, 1.7231, 2.0574] for object proposals. We use these and the smooth L1 loss is used for regression task, since it’s less
value as the initial centers of k-means cluster to the training sensitive to outliers compared to the L2 loss. During training,
dataset and get three cluster centers as [1.28842871, 1.72911, each ground truth object was cropped and resized to 224x224
2.2184]. So if the object’s vertical location is far from this range, and fed into the 3D properties network. We filtered out those
we will treat them as fallacious results. And the vertical samples which are heavily truncated from the training set in
locations of the adjacent objects change monotonously due to case of the potential harm to the model and randomly apply
the light straight line dissemination characteristic and local mirroring and color distortions to the training images for data
plane and gradually changing characteristics of the roads. augmentation in order to make the network more robust to
Third, the characteristic of optical imaging system satisfy the viewpoint varies and occlusions. Then the network was trained
characteristic that the projection size of the 3D bounding box is with SGD at learning rate of 0.0001 for 30 epochs with a batch
small in the long distance or depth and big on the contrary. size of 8 to get the final network parameters which were used
Generally speaking, the projected heights of the 3D bounding for validation. We set the weighting factors of loss w1:4 =
boxes change monotonously with the distance based on the
[1, 4, 8, 4, 4]. Fig. 5 showed some qualitative visualization of
pinhole camera model. The same dimensions of the objects,
our result on KITTI validation set. For fair comparison, all our

6
TABLE I
COMPARISON OF THE AVERAGE ORIENTATION SCORE (AOS, %), AVERAGE PRECISION (AP, %) AND ORIENTATION SCORE (OS) ON OFFICIAL KITTI DATASET FOR
CARS.

Easy Moderate Hard


Method
AOS AP OS AOS AP OS AOS AP OS

Mono3D[5] 91.01 92.33 0.9857 86.62 88.66 0.9769 76.84 78.96 0.9731
3DOP[13] 91.44 93.04 0.9828 86.10 88.64 0.9713 76.52 79.10 0.9673
SubCNN[7] 90.67 90.81 0.9984 88.62 89.04 0.9952 78.68 79.27 0.9925
Deep3DBox [8] 97.50 97.75 0.9974 96.30 96.80 0.9948 80.40 81.06 0.9919
Ours 97.57 97.75 0.9981 96.50 96.80 0.9970 80.45 81.06 0.9925

TABLE Ⅱ
COMPARISON WITH THE STATE-OF-THE-ART METHODS USING THE METRIC OF 3D AVERAGE PRECISION(%) ON KITTI DATASET TO EVALUATE THE 3D
DETECTION ACCURACY FOR CAR CATEGORY. THE BEST RESULT OF EACH COLUMN IS HIGHLIGHTED WITH BOLD FONT.

AP3D - IoU=0.5 AP3D - IoU=0.7


Method Type Time
Easy Moderate Hard Easy Moderate Hard

3DOP[13] stereo 3s 46.04 34.63 30.09 6.55 5.07 4.10


Mono3D[5] mono 4.2s 25.19 18.20 15.52 2.53 2.31 2.31
Deep3DBox [8] mono - 27.04 20.55 15.88 5.85 4.10 3.84
FQNet [18] mono 3.3s 28.98 20.71 18.59 5.45 5.11 4.45
GS3D[19] mono 2.3s 30.60 26.40 22.89 11.63 10.51 10.51
Ours mono 0.258s 31.45 24.52 20.48 13.79 11.38 10.95

TABLE Ⅲ
AP3D FOR THE PEDESTRIAN AND CYCLIST CATEGORY ON KITTI VALIDATION DATASET

Easy Moderate Hard


Method
AOS AP OS AOS AP OS AOS AP OS
3DOP[5] 70.13 78.39 0.8946 58.68 68.94 0.8511 52.32 61.37 0.8523
Mono3D[4] 65.56 76.04 0.8621 54.97 66.36 0.8283 48.77 58.87 0.8284
SubCNN[7] 72.00 79.48 0.9058 63.65 71.06 0.8957 56.32 62.68 0.8985
Deep3DBox [8] 69.16 83.94 0.8239 59.87 74.16 0.8037 52.50 64.84 0.8096
Ours 75.33 83.94 0.8974 66.08 74.16 0.8910 57.42 64.84 0.8856

experiments were based on the MS-CNN [17] 2D detection


B. 3D Bounding Box Evaluation
method to produce 2D boxes and then estimate 3D boxes from
2D detection boxes whose scores exceed a threshold. For fair We compared our proposed method with 6 recently proposed
comparisons, we used the detection results reported by the state-of-the-art monocular based 3D object detection methods
authors. Since most works only released their result on cars, on the KITTI benchmark, including 3DOP [13], Mono3D [5],
thus we made evaluation of our model on KITTI dataset SubCNN [7], Deep3DBox [8], FQNet [18] and GS3D [19] for
focused on for car category like most previous works did. Our KITTI cars. These results are evaluated based on three levels of
experiments were conducted on the setting of i7-6700 CPU, difficulty: Easy, Moderate, and Hard, which is defined
16GB RAM, and NVIDIA GTX1080Ti GPU using Python and according to the minimum bounding box height, occlusion
PyTorch [21] toolbox. degree and truncation grade. We evaluated the orientation,
TABLE Ⅴ dimensions and the overall performance of the 3D bounding
ABLATION STUDY OF 3D DETECTION RESULTS FOR CAR CATEGORY ON KITTI boxes.
VALIDATION DATASET. Training data requirements. As all 3D properties we
learned were appearance related, thus we could overcome the
AP3D Iou =0.5 AP3D Iou =0.7
method downside of Deep3DBox[8] that needed to learn the parameters
easy moderate hard easy moderate hard
cbf 31.47 23.11 19.17 7.95 5.66 5.31 for the fully connected layers and required more training data
than methods that using additional information, such as
vp 30.24 19.85 17.11 8.19 5.50 5.10
segmentation, depth. By adding two appearance-related tasks,
cbf /vp 33.03 25.68 21.48 9.69 7.98 6.56
we got competitive performance with [8] by less training data.
focal loss
31.45 24.52 20.48 13.79 11.38 10.95 KITTI orientation accuracy.
cbf /vp

7
Fig. 7. Result visualization of the 2D detection boxes (left, ported from [8]) and estimated 3D box projections (right) for cars on KITTI validation dataset by our
cascaded geometric constraints without using the 2D detections filtration stage. The black lines attached to each 3D box represent the orientation of objects (start
from the center of bottom face to the front of objects).

TABLE Ⅳ method even outperformed Deep3DBox [8] for cyclist


COMPARISONS OF THE AVERAGE ERROR OF DIMENSION ESTIMATION WITH categories despite having similar 2D AP as shown in Table 3.
STATE-OF-THE-ART METHODS ON THE KITTI VALIDATION DATASET. THE On the KITTI detection dataset, 2 bins was achieved better
NUMBER IS THE SMALLER THE BETTER.
performance than 8 bins in our work as it decreased the training
Method Dims errors data amount for each bin. We also conducted experiments with
Mono3D[4] 0.4251 different numbers of neurons in the fully connection layers (see
3DOP[13] 0.3527 Table 6) and found that increasing the number of neurons in the
Deep3DBox[8] 0.1934 FC layers further yielded some limited gains even beyond 256.
FQNet[18] 0.1698 KITTI 3D bounding box metric.
Our Method 0.1663
The orientation estimation precision evaluated only part of
3D bounding box’s parameters. To evaluate the accuracy of the
rest, we introduced 3 metrics, on which we compare our
Average Orientation Similarity (AOS) is the official 3D method against FQNet [18] for KITTI cars. The first metric was
orientation metric of the KITTI dataset which is described in [1] the average error in estimating the dimensions of the 3D objects
and multiplies the average precision (AP) of the 2D detector which was defined as: E𝑑 = 1/𝑁 ∑𝑁 2 2 2
𝑖=1 √∆𝑤𝑖 + ∆ℎ𝑖 + ∆𝑙𝑖 . We
with the average cosine distance similarity for azimuth
found the corresponding object in the ground truth which were
orientation is calculated to evaluate the performance of
closest to the detection result for computing E𝑑 . We only
orientation estimation. The ratio of AOS over Average
compared our method with Mono3D [4], 3DOP [13],
Precision (AP) called OS is defined in [8] which is
Deep3DBox [8] and FQNet [18] which have provided their
representative of how well each method performs only on
experimental results. Our results were summarized in Table 4.
orientation estimation, while factoring out the 2D localization
We could see that our methods had the lowest estimation error
performance. The AOS is first published in [8] as assessment
with an average dimension estimation error of about 0.1663
criteria and our method is first among all non-anonymous
meters, which demonstrated the effectiveness of our
methods for car examples on the KITTI dataset. As shown in
collaborated appearance-related properties regression modules.
Table 1, our method using exactly the same 2D detector
We further evaluated the 3D Object Detection performance
outperformed the baseline and other monocular images based
with the 3D AP metric. 3D AP is the KITTI official metric
3D detection methods on orientation estimation for cars. Our
which is used to evaluate the overall advantages of 3D
8
Fig. 8. Examples of these obviously abnormal detections which can be easily removed by one of the physical world prior information.

TABLE Ⅵ
example, if the car was 50m away from the camera, a
THE EFFECT OF NUMBER OF NEURONS IN FC LAYERS FOR ORIENTATION translation error of 2m corresponds to about half the car length
ESTIMATION would hard to identify on the image plane. Our method handles
increasing distance well, as shown in Fig. 7. The evaluation
FC 64 128 256 512 1024
OS 0.98472 0.98553 0.99946 0.9996 0.9995 shown that regressing the CBF and viewpoints made a
difference in all the 3D metrics. All the quantitative results
given in our paper were only obtained by the cascaded
bounding box estimation. The threshold of 3D Intersection over geometric constraints without filtering procedure as the
Union (IoU) with 0.7 and 0.5 are both used to determine a filtering procedure could be clearly observed in the image.
detection result is successful or not. As indicated in Table 2, our Qualitative Results: We had drawn the projected 3D
method performed relatively well compared to these most detection boxes on 2D image plane for better visualization. Fig.
related monocular image based methods[8, 18] and even ranked 7 was the examples of qualitative detection results by cascaded
first among purely monocular based methods with the IoU geometric constraints method without filtering wrong 2D
threshold of 0.7. From Table 2, we can see that our method detections on the scenes of KITTI dataset. The results in Fig. 7
outperformed Mono3D [8] and Deep3DBox [32] by a showed that our approach could deal with the truncated object
significant margin of about 7% improvement and even detect the 3D object well and achieve high-precision 3D
outperformed stereo-based 3DOP when 3D IoU threshold is set perception in autonomous driving scenarios with only one
to 0.7. Since 3DOP [9] is a stereo-based method that can obtain monocular image as input. If the 2D bounding box was closed
depth information directly, so its performance is much better to the image boundary and less than 10 pixels, we treated is as
than pure monocular based methods with the IoU threshold of truncated object.
0.5. The cost time of inference was also shown in this table,
C. Filtering 2D Detection based on 3D Box Estimation
which demonstrated the efficiency of our method. Our method
did not rely on computing additional features such as stereo, By the high precision 3D detector of our geometric cascade
semantic or instance segmentation, depth estimation as well constraints, we can filter out the wrong 2D detections with a
and did not need complex postprocessing as in [24] and [23]. high reliable. There are 1% of 2D detections are obviously
The ablation study of the contribution of vp, cbf and cascaded wrong detections and easily identified by the location of the
constraint vp/cbf were shown in Table.5. When we used the vertical direction whether normal or not for cars category in the
focal loss for category classification, we would get even better autonomous driving scenarios. In Fig. 8 was shown some
performance. examples of easily identified false alarms.
As image-based 3D detection methods had many drawbacks As the projected size decreases approximately linearly with
in spatial localization, 3D AP score was relatively lower than distance and the nearby objects may have similar depth, these
that 2D detectors obtain on the corresponding 2D metric. This could be used to filtering out almost 5% of more complicated
was due to the fact that 3D estimation was a more challenging wrong detections. From Fig.9 we could see that by using the 3D
task, especially as the distance to the object increases. For result combined with the physical information, more complex

9
[5] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler and R. Urtasun, "Monocular
3D Object Detection for Autonomous Driving," 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp.
2147-2156.

[6] Y. Xiang, Wongun Choi, Y. Lin and S. Savarese, "Data-driven 3D Voxel


Patterns for object category recognition," 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1903-1911.

[7] Y. Xiang, W. Choi, Y. Lin and S. Savarese, "Subcategory-Aware


Convolutional Neural Networks for Object Proposals and Detection," 2017
Fig. 9. Examples of these more complex abnormal detections which can be IEEE Winter Conference on Applications of Computer Vision (WACV), Santa
removed by several physical prior information. The left two images contain a Rosa, CA, 2017, pp. 924-933.
false detection in the lower left which are hard to distinguish. We jointly use
the depth, location and dimensions to find the false detection. [8] A. Mousavian, D. Anguelov, J. Flynn and J. Košecká, "3D Bounding Box
Estimation Using Deep Learning and Geometry," 2017 IEEE Conference on
false alarm and false detection also could been removed. Once
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp.
the wrong 2D detections are eliminated, its correspondence 3D 5632-5640.
detections will also be abandoned.
[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no.
V. CONCLUSION AND FUTURE WORK 7553, p. 436, 2015.

In this paper, we have proposed a novel method using deep [10] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulière and T. Chateau, "Deep
CNN to regress another two appearance-related 3D properties, MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle
viewpoints classification and the center projection of the 3D Analysis from Monocular Image,"2017 IEEE Conference on Computer Vision
box’s bottom face, and using these two properties to construct a and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1827-1836.
cascaded geometric constraints model which is used to solve a
more precision 3D location. Then we use the recovered 3D [11] X. Chen, H. Ma, J. Wan, B. Li and T. Xia, "Multi-view 3D Object
Detection Network for Autonomous Driving,"2017 IEEE Conference on
physical world information to further depurate the 2D
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp.
detections. 6526-6534.
Experiments demonstrated that our cascaded geometric
constraints method is not only less time and computational [12] J. Ku, M. Mozifian, J. Lee, A. Harakeh and S. L. Waslander, "Joint 3D
resources consuming than the baseline algorithm which makes Proposal Generation and Object Detection from View Aggregation," 2018
this method with higher applicability, but also can deal well IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
with the truncated objects been by the image plane. By the Madrid, 2018, pp. 1-8.
post-processing steps, we can wipe out most false alarm and
[13] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler and R. Urtasun, "3D Object
false detection in 2D and 3D detections. Although our method
Proposals Using Stereo Imagery for Accurate Object Class Detection," in IEEE
have achieved better performance, it remains a problem that Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp.
heavily depend on 2D detection performance which we hope to 1259-1272, 1 May 2018.
solve it and make our method less sensitive to the 2D detection.
And we also expect to extend our monocular 3D object [14] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik.
detection method for monocular 3D object tracking in the Learning rich features from RGB-D images for object detection and
future. segmentation. In European Conference on Computer Vision, pages 345–360,
2014.

REFERENCES [15] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo R-CNN based 3D
[1] Geiger, P. Lenz and R. Urtasun, "Are we ready for autonomous driving? Object Detection for Autonomous Driving. arXiv preprint arXiv:1902.09738,
The KITTI vision benchmark suite," 2012 IEEE Conference on Computer 2019.
Vision and Pattern Recognition, Providence, RI, pp. 3354-3361, 2012.
[16] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via
[2] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time region-based fully convolutional networks. In Advances in neural information
Object Detection with Region Proposal Networks," in IEEE Transactions on processing systems, pages 379– 387, 2016.
Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1
June 2017. [17] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep
convolutional neural network for fast object detection. In ECCV, 2016.
[3] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once:
Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer [18] Liu L, Lu J, Xu C, et al. Deep Fitting Degree Scoring Network for
Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 779-788. Monocular 3D Object Detection[C]//Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019: 1057-1066.
[4] T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, "Focal Loss for Dense
Object Detection," 2017 IEEE International Conference on Computer Vision [19] Li B, Ouyang W, Sheng L, et al. GS3D: An Efficient 3D Object Detection
(ICCV), Venice, 2017, pp. 2999-3007. Framework for Autonomous Driving[C] //Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2019: 1019-1028.

10
[20] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified multi-scale deep
convolutional neural network for fast object detection. In ECCV, 2016.

[21]Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward


Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and
Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS-W, 2017.

11

You might also like