0% found this document useful (0 votes)
44 views20 pages

Glenet: Boosting 3D Object Detectors With Generative Label Uncertainty Estimation

This document summarizes a research paper that proposes a new method called GLENet to address label uncertainty in 3D object detection. GLENet models the relationship between objects and their potential bounding box labels using conditional variational autoencoders to generate a distribution of plausible bounding boxes. It can be integrated into existing 3D detectors to build probabilistic detectors supervised by the predicted label uncertainty. The experiments show GLENet significantly improves performance on KITTI and Waymo datasets compared to previous deterministic and probabilistic detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views20 pages

Glenet: Boosting 3D Object Detectors With Generative Label Uncertainty Estimation

This document summarizes a research paper that proposes a new method called GLENet to address label uncertainty in 3D object detection. GLENet models the relationship between objects and their potential bounding box labels using conditional variational autoencoders to generate a distribution of plausible bounding boxes. It can be integrated into existing 3D detectors to build probabilistic detectors supervised by the predicted label uncertainty. The experiments show GLENet significantly improves performance on KITTI and Waymo datasets compared to previous deterministic and probabilistic detection methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

manuscript No.

GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty


Estimation
Yifan Zhang · Qijian Zhang · Zhiyu Zhu · Junhui Hou · Yixuan Yuan
arXiv:2207.02466v4 [cs.CV] 3 Jun 2023

Received: date / Accepted: date

Abstract The inherent ambiguity in ground-truth annota- test set. The source code and pre-trained models are publicly
tions of 3D bounding boxes, caused by occlusions, signal available at https://github.com/Eaphan/GLENet.
missing, or manual annotation errors, can confuse deep 3D
object detectors during training, thus deteriorating detection Keywords 3D object detection · label uncertainty ·
accuracy. However, existing methods overlook such issues to conditional variational autoencoders · probabilistic object
some extent and treat the labels as deterministic. In this paper, detection · 3D point cloud
we formulate the label uncertainty problem as the diversity
of potentially plausible bounding boxes of objects. Then, we
propose GLENet, a generative framework adapted from con- 1 Introduction
ditional variational autoencoders, to model the one-to-many
relationship between a typical 3D object and its potential As one of the most practical application scenarios of computer
ground-truth bounding boxes with latent variables. The label vision, 3D object detection has been attracting much academic
uncertainty generated by GLENet is a plug-and-play module and industrial attention in the current deep learning era with
and can be conveniently integrated into existing deep 3D the rise of autonomous driving and the emergence of large-
detectors to build probabilistic detectors and supervise the scale annotated datasets (e.g., KITTI (Geiger et al., 2012),
learning of the localization uncertainty. Besides, we propose and Waymo (Sun et al., 2020)).
an uncertainty-aware quality estimator architecture in prob- In the current community, despite the proliferation of vari-
abilistic detectors to guide the training of the IoU-branch ous deep learning-based 3D detection pipelines, it is observed
with predicted localization uncertainty. We incorporate the that mainstream 3D object detectors are typically designed as
proposed methods into various popular base 3D detectors deterministic models, without considering the critical issue
and demonstrate significant and consistent performance gains of the ambiguity of annotated ground-truth labels. However,
on both KITTI and Waymo benchmark datasets. Especially, different aspects of ambiguity/inaccuracy inevitably exist in
the proposed GLENet-VR outperforms all published LiDAR- the ground-truth annotations of object-level bounding boxes,
based approaches by a large margin and achieves the top which may significantly influence the overall learning process
rank among single-modal methods on the challenging KITTI of such deterministic detectors. For example, in the data
collection phase, raw point clouds can be highly incomplete
Yifan Zhang, Qijian Zhang, Zhiyu Zhu, and Junhui Hou due to the intrinsic properties of LiDAR sensors as well as un-
Department of Computer Science, City University of Hong Kong. controllable environmental occlusion. Moreover, in the data
E-mail: {yzhang3362-c, qijizhang3-c, zhiyuzhu2-c}@my.cityu.edu.hk;
[email protected];
labeling phase, ambiguity naturally occurs when different
human annotators subjectively estimate object shapes and
Yixuan Yuan
Department of Electronic Engineering, The Chinese University of Hong
locations from 2D images and partial 3D points. To facili-
Kong. tate intuitive understandings, we provide typical examples in
E-mail: [email protected] Fig. 1, from which we can observe that an incomplete LiDAR
This project was supported by the Hong Kong Research Grants Council observation can correspond to multiple potentially plausible
under Grants 11202320 and 11218121. Corresponding author: Junhui labels, and objects with similar LiDAR observation can be
Hou annotated with significantly varying bounding boxes.
2 Yifan Zhang et al.

Fig. 1: (a) Given an object with an incomplete LiDAR observation, there may exist multiple potentially plausible ground-truth
bounding boxes with varying sizes and shapes. (b) Ambiguity and inaccuracy can be inevitable in the labeling process when
annotations are derived from 2D images and partial points. In the given cases, similar point clouds of the car category with
only the rear part can be annotated with different ground-truth boxes of varying lengths.

Fig. 2: Illustration of two different learning paradigms of probabilistic object detectors. (a) Methods that adopt probabilistic
modeling in the detection head but essentially still ignore the issue of ambiguity in ground-truth bounding boxes. (b) Methods
that explicitly estimate ground-truth bounding box distributions to be used as more reliable supervision signals.

Motivated by the aforementioned phenomena, there also function). To this end, the second paradigm of learning frame-
exists another family of probabilistic detectors that explicitly works attempts to quantify label uncertainty derived from
consider the potential influence of label ambiguity. Conclu- some simple heuristics (Meyer and Thakurdesai (2020)) or
sively, these methods can be categorized into two paradigms, Bayes (Wang et al. (2020)), such that the detectors can be
as illustrated in Fig. 2. The first paradigm of learning frame- supervised under a more reliable bounding box distribution.
works (He et al., 2019; Meyer et al., 2019; Feng et al., However, it is unsurprising that these approaches still cannot
2018, 2019) tends to output the probabilistic distribution of produce satisfactory label uncertainty estimation results due
bounding boxes instead of directly regressing definite box to insufficient modeling capacity. In general, this line of work
coordinates in a deterministic fashion. For example, under the is still in its initial stage with a very limited number of studies,
pre-assumption of a Gaussian distribution, the detection head despite its greater potential in generating higher-quality label
predicts the mean and variance of the distribution accord- uncertainty estimation in a data-driven manner.
ingly. To supervise such probabilistic models, these works
simply treat ground-truth bounding boxes as the Dirac delta Architecturally, this work follows the second type of de-
distribution, after which KL divergence is applied between sign philosophy, where we particularly customize a powerful
the estimated distributions and ground truths. Obviously, the deep learning-based label uncertainty quantification frame-
major limitation of these methods lies in that they fail to work to enhance the reliability of the estimated ground-truth
essentially address the problem of label ambiguity, since the bounding box distributions. Technically, we formulate the
ground-truth bounding boxes are still considered determin- label uncertainty problem as the diversity of potentially plau-
istic with zero uncertainty (i.e., modeled as a Dirac delta sible bounding boxes and explicitly model the one-to-many
relationship between a typical 3D object and its potentially
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 3

Fig. 3: Illustration of multiple potentially plausible bounding boxes from GLENet on the KITTI dataset by sampling latent
variables multiple times. The point cloud, annotated ground-truth boxes, and predictions of GLENet are colored in black, red,
and green, respectively. GLENet produces diverse predictions for objects represented with sparse point clouds and incomplete
outlines, and consistent bounding boxes for objects with high-quality point clouds. The variance of the multiple predictions by
GLENet is used to estimate the uncertainty of the annotated ground-truth bounding boxes.

plausible ground-truth boxes in a learning-based framework. – We are the first to formulate the 3D label uncertainty
We propose GLENet, a novel deep generative network adapted problem as the diversity of potentially plausible bounding
from conditional variational auto-encoders (CVAE), which boxes of objects. To capture the one-to-many relationship
introduces a latent variable to capture the distribution over between a typical 3D object and its potentially plausible
potentially plausible bounding boxes of point cloud objects. ground-truth bounding boxes, we present a deep genera-
During inference, we sample latent variables multiple times tive model named GLENet. Additionally, we introduce
to generate diverse bounding boxes (see Fig.3), the variance a general and unified deep learning-based paradigm, in-
of which is taken as label uncertainty to guide the learn- cluding the network structure, loss function, evaluation
ing of localization uncertainty estimation in the downstream metric, etc.
detection task. Besides, based on the observation that detec- – Inspired by the strong correlation between the localization
tion results with low localization uncertainty in probabilistic quality and the predicted uncertainty in probabilistic
detectors tend to have accurate actual localization quality detectors, we propose UAQE to facilitate the training of
(see Section4.2), we further propose the uncertainty-aware the IoU-branch.
quality estimator (UAQE), which facilitates the training of
the IoU-branch with the localization uncertainty estimation.
To demonstrate our effectiveness and universality, we The remainder of the paper is organized as follows. Sec-
integrate GLENet into several popular 3D object detection tion 2 reviews existing works on LiDAR-based detectors
frameworks to build powerful probabilistic detectors. Ex- and label uncertainty estimation methods. In Section 3, we
periments on KITTI (Geiger et al., 2012) and Waymo (Sun explicitly formulate the label uncertainty estimation problem
et al., 2020) datasets demonstrate that our method can bring from the probabilistic distribution perspective, followed by
consistent performance gains and achieve the current state-of- the technical implementation of GLENet. In Section 4, we
the-art. Particularly, the proposed GLENet-VR surpasses all introduce a unified way of integrating the label uncertainty
published single-modal detection methods by a large margin statistics predicted by GLENet into the existing 3D object
and ranks 1𝑠𝑡 among all published LiDAR-based approaches detection frameworks to build more powerful probabilistic
on the highly competitive KITTI 3D detection benchmark on detectors, as well as some theoretical analysis. In Section 5,
March 29𝑡 ℎ , 20221. we conduct experiments on the KITTI dataset and the Waymo
We summarize the main contributions of this paper as Open dataset to demonstrate the effectiveness of our method
follows: in enhancing existing 3D detectors and the ablation study to
analyze the effect of different components. Finally, Section 7
1 www.cvlibs.net/datasets/kitti/eval object.php?obj benchmark=3d concludes this paper.
4 Yifan Zhang et al.

2 Related Work 2020; Varamesh and Tuytelaars, 2020) estimate the proba-
bilistic distribution of predicted bounding boxes rather than
2.1 LiDAR-based 3D Object Detection take them as deterministic results. For example, He et al.
(2019) and Choi et al. (2019) modeled the predicted boxes as
Existing 3D object detectors can be classified into two cate- Gaussian distributions, the variance of which can indicate the
gories: single-stage and two-stage. For single-stage detectors, localization uncertainty and is predicted with additional lay-
Zhou and Tuzel (2018) proposed to convert raw point clouds ers in the detection head. It introduces the KL Loss between
to regular volumetric representations and adopted voxel-based the predicted Gaussian distribution and the ground-truth
feature encoding. Yan et al. (2018b) presented a more efficient bounding boxes modeled as a Dirac delta function, so the
sparse convolution. Lang et al. (2019) converted point clouds regression branch is expected to output a larger variance and
to sparse fake images using pillars. Shi and Rajkumar (2020a) get a smaller loss for inaccurate localization estimation for the
aggregated point information via a graph structure. He et al. cases with ambiguous boundaries. Li et al. (2021) facilitated
(2020) introduced point segmentation and center estimation the learning of localization quality with distribution statistics
as auxiliary tasks in the training phase to enhance model ca- of a bounding box, such as the mean value, which inspires us
pacity. Zheng et al. (2021a) constructed an SSFA module for to further utilize the estimated uncertainty in UAQE. Meyer
robust feature extraction and a multi-task head for confidence et al. (2019) proposed a probabilistic 3D object detector mod-
rectification, and proposed DI-NMS for post-processing. For eling the distribution of bounding box corners as a Laplacian
two-stage detectors, Shi et al. (2020b) exploited a voxel-based distribution.
network to learn the additional spatial relationship between However, most probabilistic detectors take the ground-
intra-object parts under the supervision of 3D box annotations. truth bounding box as a deterministic Dirac delta distribution
Shi et al. (2019) proposed to directly generate 3D proposals and ignore the ambiguity in the ground-truth. Therefore, the
from raw point clouds in a bottom-up manner, using semantic localization variance is actually learned in an unsupervised
segmentation to validate points to regress detection boxes. manner, which may result in sub-optimal localization pre-
The follow-up work (Yang et al., 2019) further proposed cision and erratic training (see our theoretical analysis in
PointsPool to convert sparse proposal features to compact Section 4.1).
representations and used spherical anchors to generate accu-
rate proposals. Shi et al. (2020a) utilized both point-based
and voxel-based methods to fuse multi-scale voxel and point 2.3 Label Uncertainty Estimation
features. Deng et al. (2021) proposed voxel RoI pooling to
extract RoI features from coarse voxels. Label noise (or uncertainty) is a common problem in real-
To address the boundary ambiguity problems in 3D object world datasets and could seriously affect the performance
detection caused by occlusion and signal miss, some studies, of supervised learning algorithms. As the neural network
such as SPG (Xu et al., 2021), have tried to use point cloud is prone to overfit to even complete random noise (Zhang
completion methods to restore the full shape of objects and et al. (2021)), it is important to prevent the network from
improve the detection performance (Yan et al., 2021; Najibi overfitting noisy labels. An obvious solution is to consider the
et al., 2020). However, generating complete and precise shapes label of a misclassified sample to be uncertain and remove
with incomplete point clouds remains a non-trivial task. the samples (Delany et al., 2012). Garcia et al. (2015) used
a soft voting approach to approximate a noise level for each
sample based on the aggregation of the noise degree predic-
2.2 Probabilistic 3D Object Detector tion calculated for a set of binary classifiers. Luengo et al.
(2018) extended this work by correcting the label when most
There are two types of uncertainty in deep learning predic- classifiers predict the same label for noisy samples. Confident
tions. A type of uncertainty called aleatoric uncertainty is Learning Northcutt et al. (2021) estimated uncertainty in
caused by the inherent noise in observational data, which dataset labels by estimating the joint distribution of noisy
cannot be eliminated. The other type is called epistemic labels and true labels. However, the above studies mainly
Uncertainty or model uncertainty, which is caused by incom- focus on the image classification task.
plete training and can be alleviated with more training data. There only exists a limited number of previous works
Most existing state-of-the-art 2D (Liu et al., 2016; Tan et al., focusing on quantifying uncertainty statistics of annotated
2020; Carion et al., 2020) and 3D (Shi et al., 2020b) object ground-truth bounding boxes. Meyer and Thakurdesai (2020)
detectors produce a deterministic box with a confidence score proposed to model label uncertainty by the IoU between the
for each detection. While the probability score represents label bounding box and the corresponding convex hull of the
the existence and semantic confidence, it cannot reflect the aggregated LiDAR observations. However, it is non-learning-
uncertainty about predicted localization well. By contrast, based and thus has limited modeling capacity. Besides, it
probabilistic object detectors (He et al., 2019; Harakeh et al., only produces uncertainty of the ground-truth box as a whole
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 5

instead of each dimension. Wang et al. (2020) proposed a plausible ground-truth bounding boxes. To the best of our
Bayes method to estimate label noises by quantifying the knowledge, we are the first to employ CVAE in 3D object
matching degree of point clouds for the given boundary box detection to model label uncertainty.
with the Gaussian Mixture Model. However, its assumption of
conditional probabilistic independence between point clouds
is often untenable in practice. Differently, we formulate label 3 Proposed Label Uncertainty Estimation
uncertainty as the diversity of potentially plausible bounding
boxes. There may be some objects with few points that exactly As aforementioned, the ambiguity of annotated ground-truth
match the learned surface points of the corresponding labeled labels widely exists in 3D object detection scenarios and has
bounding box, so the label is considered by (Wang et al. adverse effects on the deep model learning process, which is
(2020)) to be deterministic. But for an object with sparse not well addressed or even completely ignored by previous
point clouds, our GLENet will output different and plausible works. To this end, we propose GLENet, a generic and unified
bounding boxes and further estimate high label uncertainty deep learning framework that generates label uncertainty
based on them, regardless of whether points match the given by modeling the one-to-many relationship between point
label. In general, Wang et al. (2020) used the Bayesian cloud objects and potentially plausible bounding box labels.
paradigm to estimate the correctness of the annotated box as Then the variance of the multiple outputs of GLENet for a
the label uncertainty, while our method formulates it as the single object is computed as the label uncertainty, which is
diversity of potentially plausible bounding boxes and predicts extended as an auxiliary regression objective to enhance the
it by GLENet. performance of the downstream 3D object detection task.

2.4 Conditional Variational Auto-Encoder 3.1 Problem Formulation

The variational auto-encoder (VAE) (Kingma and Welling, Let 𝐶 = {𝑐 𝑖 }𝑖=1 𝑛 be a set of 𝑛 observed LiDAR points belong-
2014) has been widely used in image and shape genera- ing to an object, where 𝑐 𝑖 ∈ R3 is a 3D point represented
tion tasks (Yan et al., 2016; Nash and Williams, 2017). It with spatial coordinates. Let 𝑋 be the annotated ground-truth
transforms natural samples into a distribution where latent bounding box of 𝐶 parameterized by the center location
variables can be drawn and passed to a decoder network (𝑐 𝑥 , 𝑐 𝑦 , 𝑐 𝑧 ), the size ( length 𝑙 , width 𝑤, and height ℎ), and
to generate diverse samples. Sohn et al. (2015) introduced the orientation 𝑟, i.e., 𝑋 = [𝑐 𝑥 , 𝑐 𝑦 , 𝑐 𝑧 , 𝑤, 𝑙, ℎ, 𝑟] ∈ R7 .
the conditional variational autoencoder (CVAE), which ex- We formulate the uncertainty of the annotated ground-
tends the capabilities of the traditional VAE by incorporating truth label of an object as the diversity of potentially plausible
an additional condition during the generative process. The bounding boxes of the object, which could be quantitatively
CVAE model consists of an encoder, a decoder, and an extra measured with the variance of the distribution of the potential
input, which is usually a label or other structured information bounding boxes. First, we model the distribution of these
pertinent to the generation task. This auxiliary condition potential boxes conditioned on point cloud 𝐶, denoted as
enables the CVAE to generate more targeted and controlled 𝑝(𝑋 |𝐶). Specifically, based on the Bayes theorem, we in-
samples compared to its unsupervised counterpart, the VAE. troduce an intermediate variable 𝑧 to write the conditional
In the NLP field, VAE has been widely applied to many distribution as
text generation tasks, such as dialogue response (Zhao et al., ∫
2017), machine translation (Zhang et al., 2016), story genera- 𝑝(𝑋 |𝐶) = 𝑝(𝑋 |𝑧, 𝐶) 𝑝(𝑧|𝐶)𝑑𝑧. (1)
tion (Wang and Wan, 2019), and poem composing (Li et al., 𝑧

2018). VAE and CVAE have also been applied in computer Then, with 𝑝(𝑋 |𝑧, 𝐶) and 𝑝(𝑧|𝐶) known, we can adopt a
vision tasks, like image generation (Yan et al., 2016), human Monte Carlo method to get multiple bounding box predictions
pose estimation (Sharma et al., 2019), medical image segmen- by sampling 𝑧 multiple times and approximate the variance
tation (Painchaud et al., 2020), salient object detection (Li of 𝑝(𝑋 |𝐶) with that of the sampled predictions.
et al., 2019; Zhang et al., 2020), and modeling human motion In the following, we will introduce our learning-based
dynamics (Yan et al., 2018a). Recently, VAE and CVAE algo- framework named GLENet to realize the estimation process.
rithms have also been applied extensively to applications of
3D point clouds, such as generating grasp poses (Mousavian
et al., 2019) and instance segmentation (Yi et al., 2019). 3.2 Inference Process of GLENet
Inspired by CVAE for generating diverse reasonable re-
sponses in dialogue systems, we propose GLENet adapted Fig. 4 (a) shows the flowchart of GLENet parameterized
from CVAE to capture the one-to-many relationship between by neural parameters 𝜃, which aims to predict 𝑝(𝑧|𝐶) and
objects with incomplete point clouds and the potentially 𝑝(𝑋 |𝑧, 𝐶). Specifically, under the assumption that the prior
6 Yifan Zhang et al.

Fig. 4: The overall workflow of GLENet. In the training phase, we learn parameters 𝜇 and 𝜎 (resp. 𝜇′ and 𝜎 ′ ) of latent
variable 𝑧 (resp. 𝑧 ′ ) through the prior network (resp. recognition network), after which a sample of 𝑧 ′ and the corresponding
geometrical embedding produced by the context encoder are jointly exploited to estimate the bounding box distribution. In the
inference phase, we sample from the distribution of 𝑧 multiple times to generate different bounding boxes, whose variance we
use as label uncertainty. Note we denote multiple sampling with black, orange, and green lines in subgraph (a).

distribution 𝑝(𝑧|𝐶) subjects to a multivariate Gaussian distri- by network parameters 𝜙 (see Fig. 4 (b)) to learn an auxiliary
bution parameterized by (𝜇 𝑧 , 𝜎𝑧 ), denoted as N (𝜇 𝑧 , 𝜎𝑧2 ), we posterior distribution 𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶) subjecting to a Gaussian
design a prior network, which is composed of PointNet (Qi distribution, denoted as N (𝜇′𝑧 , 𝜎𝑧′2 ), to regularize 𝑝 𝜃 (𝑧|𝐶),
et al., 2017) and additional MLP layers, from the input point i.e., 𝑝 𝜃 (𝑧|𝐶) should be close to 𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶).
cloud 𝐶 to predict the values of (𝜇 𝑧 , 𝜎𝑧 ). Then, we employ a Specifically, for the recognition network, we adopt the
context encoder to embed the input point cloud 𝐶 into a high same learning architecture as the prior network to generate
dimensional feature space, leading to the geometric feature point cloud embeddings, which are concatenated with ground-
representation 𝑓𝐶 , which is concatenated with 𝑧 sampled from truth bounding box information and fed into the subsequent
N (𝜇 𝑧 , 𝜎𝑧2 ) and fed into a prediction network composed of MLP layers to learn 𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶). Moreover, to facilitate the
MLPs to regress the bounding box distribution 𝑝(𝑋 |𝑧, 𝐶), i.e., learning process, we encode the information 𝑋 into offsets
the localization, dimension, and orientation of the bounding relative to predefined anchors, and then perform normalization
box. as:
As empirically observed in various related domains 𝑔𝑡
𝑐𝑥 𝑐𝑦
𝑔𝑡 𝑔𝑡
𝑐𝑧
(Goyal et al., 2017), it could be difficult to make use of 𝑡 𝑐𝑥 = , 𝑡 𝑐𝑦 = , 𝑡 𝑐𝑧 = ,
𝑑𝑎 𝑑𝑎 ℎ𝑎
latent variables when the prediction network can generate a
𝑤 𝑔𝑡 𝑙 𝑔𝑡 ℎ𝑔𝑡 (2)
plausible output only using the sufficiently expressive features 𝑡 𝑤 = log 𝑎 , 𝑡𝑙 = log 𝑎 , 𝑡 ℎ = log 𝑎 ,
𝑤 𝑙 ℎ
of condition 𝐶. Therefore, we utilize a simplified PointNet
𝑡𝑟 = sin(𝑟 𝑔𝑡 ),
architecture as the backbone of the context encoder to avoid
posterior collapse. We refer the readers to Section 5.1.3 for where (𝑤 𝑎 , 𝑙 𝑎 , ℎ 𝑎 ) is the size of the predefined√︁anchor located
the implementation details of these modules. In the following in the center of the point cloud, and 𝑑 𝑎 = (𝑙 𝑎 ) 2 + (𝑤 𝑎 ) 2
sections, we also use 𝑝 𝜃 (𝑧|𝐶), 𝑝 𝜃 (𝑋 |𝑧, 𝐶), and 𝑝 𝜃 (𝑋 |𝐶) to is the diagonal of the anchor box. We also take cos(𝑟) as
denote the predictions of 𝑝(𝑧|𝐶), 𝑝(𝑋 |𝑧, 𝐶), and 𝑝(𝑋 |𝐶) by the additional input of the recognition network to handle the
GLENet, respectively. issue of angle periodicity.

3.3.2 Objective Function


3.3 Training Process of GLENet
Following CVAE Sohn et al. (2015), we optimize GLENet by
3.3.1 Recognition Network maximizing the variational lower bound of the conditional
log-likelihood 𝑝 𝜃 (𝑋 |𝐶):
Given 𝐶 and its annotated bounding box 𝑋, we assume
there is a true posterior distribution 𝑞(𝑧|𝑋, 𝐶). Thus, during log 𝑝 𝜃 (𝑋 |𝐶) ≥ 𝐸 𝑞 𝜙 (𝑧 ′ | 𝑋,𝐶 ) [log 𝑝 𝜃 (𝑋 |𝑧, 𝐶)]−
(3)
training, we construct a recognition network parameterized 𝐾 𝐿 (𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶)|| 𝑝 𝜃 (𝑧|𝐶)),
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 7

where 𝐸 𝑞 [ 𝑝] returns the expectation of 𝑝 on the distribution KL loss between the distribution of prediction and ground-
of 𝑞, and 𝐾 𝐿 (·) denotes KL-divergence. truth in the detection head:
Specifically, the first term 𝐸 𝑞 𝜙 (𝑧 ′ | 𝑋,𝐶 ) [log 𝑝 𝜃 (𝑋 |𝑧, 𝐶)]
𝐿 𝑟 𝑒𝑔 = 𝐷 𝐾 𝐿 (𝑃 𝐷 (𝑦)||𝑃Θ (𝑦))
enforces the prediction network to be able to restore ground-
ˆ
𝜎 𝜎2 (𝑦 𝑔 − 𝑦ˆ ) 2 (9)
truth bounding box from latent variables. Following (Yan = log + + .
et al. (2018b)) and (Deng et al. (2021)), we explicitly define 𝜎 2𝜎 ˆ2 2𝜎ˆ2
the bounding box reconstruction loss as
𝑟 𝑒𝑔 4.1 More Analysis of KL-Loss
𝐿 𝑟 𝑒𝑐 = 𝐿 𝑟 𝑒𝑐 + 𝜆𝐿 𝑟𝑑𝑖𝑟
𝑒𝑐 , (4)
𝑟 𝑒𝑔
where 𝐿 𝑟 𝑒𝑐 denotes the Huber loss imposed on the prediction When ignoring label ambiguity and formulating the ground-
and encoded regression targets as described in Eq. (2), and truth bounding box as a Dirac delta function, as done in (He
et al. (2019)), the loss in Eq. (9) degenerates into
𝑒𝑐 denotes the binary cross-entropy loss used for direction
𝐿 𝑟𝑑𝑖𝑟
classification. ˆ 2 ) (𝑦 𝑔 − 𝑦ˆ ) 2
log( 𝜎
The second term 𝐾 𝐿 (𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶) ∥ 𝑝 𝜃 (𝑧|𝐶)) is aimed 𝐿 𝑟𝑝𝑟𝑒𝑔𝑜𝑏 ∝ + , (10)
2 2𝜎ˆ2
at regularizing the distribution of 𝑧 by minimizing the KL-
divergence between 𝑝 𝜃 (𝑧|𝐶) and 𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶). Since 𝑝 𝜃 (𝑧|𝐶) and the partial derivative of Eq. (10) with respect to the
and 𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶) are re-parameterized as N (𝜇 𝑧 , 𝜎𝑧2 ) and predicted variance 𝜎
ˆ is:
N (𝜇′𝑧 , 𝜎𝑧′2 ) through the prior network and the recognition 𝜕𝐿 𝑟𝑝𝑟𝑒𝑔𝑜𝑏 1 (𝑦 𝑔 − 𝑦ˆ ) 2
network, respectively, we can explicitly define the regulariza- = − . (11)
tion loss as:
𝜕𝜎 ˆ ˆ
𝜎 ˆ3
𝜎
When minimizing Eq. (10), a potential issue is that as |𝑦 𝑔 −
𝜎𝑧′ 𝜎𝑧2 (𝜇 𝑧 − 𝜇′𝑧 ) 2 𝑦ˆ | → 0,
𝐿 𝐾 𝐿 (𝑞 𝜙 (𝑧 ′ |𝑋, 𝐶) ∥ 𝑝 𝜃 (𝑧|𝐶)) = log + + .
𝜎𝑧 2𝜎𝑧′2 2𝜎𝑧′2
𝜕𝐿 𝑟𝑝𝑟𝑒𝑔𝑜𝑏 1
(5) → , (12)
𝜕𝜎 ˆ ˆ
𝜎
Thus, the overall objective function is written as the derivative for 𝜎ˆ can explode when 𝜎 ˆ → 0. Based on
the property of KL-loss, the prediction is optimal only when
𝐿 = 𝐿 𝑟 𝑒𝑐 + 𝛾 𝐿 𝐾 𝐿 , (6)
the estimated 𝜎ˆ = 0 and the localization error |𝑦 𝑔 − 𝑦ˆ | = 0.
where we empirically set the hyperparameter 𝛾 to 1 in all Therefore, the gradient explosion may result in erratic training
experiments. and sub-optimal localization precision.
By contrast, after modeling the ground-truth bounding
box as a Gaussian distribution, the partial derivative of Eq. (9)
4 Probabilistic 3D Detectors with Label Uncertainty with respect to prediction is:
𝜕𝐿 𝑟 𝑒𝑔 1 𝜎 2 (𝑦 𝑔 − 𝑦ˆ ) 2
To reform a typical detector to be a probabilistic object = − 3− , (13)
detector, we can enforce the detection head to estimate a 𝜕𝜎ˆ ˆ
𝜎 ˆ
𝜎 ˆ3
𝜎
probability distribution over bounding boxes, denoted as and
𝑃Θ (𝑦), instead of a deterministic bounding box location: 𝜕𝐿 𝑟 𝑒𝑔 𝑦ˆ − 𝑦 𝑔
= . (14)
1 − ( 𝑦− 𝑦)
ˆ 2 𝜕 𝑦ˆ ˆ2
𝜎
𝑃Θ (𝑦) = √ 𝑒 2𝜎
ˆ2 , (7)
ˆ2
2𝜋 𝜎 As |𝑦 𝑔 − 𝑦ˆ | → 0 and 𝜎
ˆ > 0,
where Θ indicates learnable network weights of a typical 𝜕𝐿 𝑟 𝑒𝑔 1 𝜎2
detector, 𝑦ˆ is the predicted bounding box location, and 𝜎 ˆ is → (1 − 2 ), (15)
𝜕𝜎ˆ ˆ
𝜎 ˆ
𝜎
the predicted localization variance.
and
Accordingly, we also assume the ground-truth bounding
box as a Gaussian distribution 𝑃 𝐷 (𝑦) with variance 𝜎 2 , whose 𝜕𝐿 𝑟 𝑒𝑔
→ 0. (16)
value is estimated by GLENet: 𝜕 𝑦ˆ

1 −
( 𝑦−𝑦𝑔 ) 2 Thus, when the predicted distribution reaches the optimal
𝑃 𝐷 (𝑦) = √ 𝑒 2𝜎 2 , (8) solution that is the distribution of ground-truth, i.e., |𝑦 𝑔 − 𝑦ˆ | →
2𝜋𝜎 2
0 and 𝜎ˆ → 𝜎, the derivatives for both 𝑦ˆ and 𝜎 ˆ become zero,
where 𝑦 𝑔 represents the ground-truth bounding box. There- which is an ideal property for the loss function and avoids the
fore, we can incorporate the generated label uncertainty in the aforementioned gradient explosion issue.
8 Yifan Zhang et al.

𝑝𝑟𝑜𝑏
(a) 𝐿𝑟𝑒𝑔 ( 𝜎 = 0) (b) 𝐿𝑟𝑒𝑔 ( 𝜎 = 0.2) (c) 𝐿𝑟𝑒𝑔 ( 𝜎 = 0.5)

Fig. 5: Illustration of the KL-divergence between distributions as a function of localization error |𝑦 𝑔 − 𝑦ˆ | and estimated
localization variance 𝜎ˆ given different label uncertainty 𝜎. With label uncertainty 𝜎 estimated by GLENet instead of zero, the
gradient is smoother when the loss converges to the minimum. Besides, the 𝐿 𝑟 𝑒𝑔 is smaller when 𝜎 is larger, which prevents
the model from overfitting to uncertain annotations.

Fig. 6: (a) Illustration of the relationship between the actual localization precision (i.e., IoU between predicted and ground-truth
bounding box) and the variance predicted by a probabilistic detector. Here, we reduce the dimension of the variance with PCA
to facilitate visualization. (b) Two examples: for the sparse sample, the prediction has high uncertainty and low localization
quality, while for the dense sample, the prediction has high localization quality and low uncertainty estimation.

Fig. 5 shows the landscape of the KL-divergence loss func-


tion under different label uncertainty 𝜎, which are markedly
different in shape and property. The 𝐿 𝑟𝑝𝑟𝑒𝑔𝑜𝑏 approaches in-
finitesimal and the gradient explodes as |𝑦 𝑔 − 𝑦ˆ | → 0 and
ˆ → 0. However, when we introduce the estimated label
𝜎
uncertainty and the predicted distribution is equal to the
ground-truth distribution, the KL Loss has a determined
minimum value of 0.5 and the gradient is smoother.
Fig. 7: Illustration of the proposed UAQE module in the
detection head using the learned localization variance to
4.2 Uncertainty-aware Quality Estimator assist the training of localization quality (IoU) estimation
branch.
Most state-of-the-art two-stage 3D object detectors use an IoU-
related confidence score as the sorting criterion in NMS (non-
maximum suppression), indicating the localization quality motivates us to use uncertainty as a criterion for judging the
rather than the classification score. As shown in Fig. 6, there quality of the boxes. However, the estimated uncertainty is
is a strong correlation between the uncertainty and actual 7-dimensional, making it infeasible to directly replace the
localization quality for each bounding box. This observation IoU confidence score with the uncertainty. To overcome this
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 9

Algorithm 1: 3D Variance Voting 5 Experiments


Data: 𝐵 is an 𝑁 × 7 matrix of predicted bounding boxes with
parameters ( 𝑥, 𝑦, 𝑧, 𝑤, 𝑙, ℎ, 𝜃 ). 𝐶 is the To reveal the effectiveness and universality of our method, we
corresponding variance. 𝑆 is a set of N corresponding integrated GLENet into several popular types of 3D object
confidence values. 𝜎𝑡 is a tunable hyperparameter. detection frameworks to form probabilistic detectors, which
Result: The final voting results 𝐷 of selected candidate boxes.
1 𝐵 = {𝑏1 , 𝑏2 , ..., 𝑏 𝑁 }; and 𝐶 = {𝑐1 , 𝑐2 , ..., 𝑐 𝑁 };
were evaluated on two commonly used benchmark datasets,
2 𝑆 = {𝑠1 , 𝑠2 , ..., 𝑠 𝑁 }; and 𝐿 = {1, 2, ..., 𝑁 }; i.e., the Waymo Open dataset (WOD) (Sun et al., 2020) and the
3 𝐷 ← {}; KITTI dataset (Geiger et al., 2012). Specifically, we start by
4 𝑖𝑜𝑢𝑡 ℎ𝑟𝑒𝑠ℎ ← 𝜇; introducing specific experiment settings and implementation
5 while 𝐿 ≠ ∅ do
6 idx =argmax 𝑆, 𝑏′ = 𝑏𝑖𝑑𝑥 ;
details in Section 5.1. After that, we report the detection
𝑖∈𝐿 performance of the resulting probabilistic detectors and make
7 𝐿 ′ = {𝑖 |𝑖 ∈ 𝐿, 𝐼𝑜𝑈 (𝑏𝑖 , 𝑏′ ) > 𝑖𝑜𝑢𝑡 ℎ𝑟𝑒𝑠ℎ };
comparisons with previous state-of-the-art approaches in
8 𝑃 ← {};
9 for 𝑖 ∈ 𝐿 ′ do Sections 5.2 and 5.3. Finally, we conduct a series of ablation
10 𝑝𝑖 = 𝑒 − (1−𝐼𝑜𝑈 (𝑏𝑖 ,𝑏) ) /𝜎𝑡 ;
2
studies to verify the necessity of different key components
′𝜃
11 if |𝑡 𝑎𝑛(𝑏𝑖 − 𝑏 ) | > 1 then
𝜃 and configurations in Section 5.4.
12 𝑝𝑖𝜃 = 0;
13 end Ð
14 𝑃 ← 𝑃 𝑝𝑖 ; 5.1 Experiment Settings
15 end Í
𝑖 ∈ 𝐿 ′ 𝑏𝑖 · 𝑝𝑖 /𝑐𝑖
16 𝑏𝑚 = Í , 𝑝𝑖 ∈ 𝑃, 𝑏𝑖 ∈ 𝐵, 𝑐𝑖 ∈ 𝐶; 5.1.1 Benchmark Datasets
Ð𝑖 ∈ 𝐿 ′ 𝑝𝑖 /𝑐𝑖
17 𝐷 ← 𝐷 𝑏𝑚 ;
18 𝐿 ← 𝐿 − 𝐿′;
The KITTI dataset contains 7481 training samples with anno-
19 end
tations in the camera field of vision and 7518 testing samples.
According to the occlusion level, visibility, and bounding
box size, the samples are further divided into three difficulty
issue, we propose an uncertainty-aware quality estimator levels: simple, moderate, and hard. Following common prac-
(UAQE) that introduces uncertainty information to facilitate tice, when performing experiments on the val set, we further
the training of the IoU-branch and improve the accuracy of split all training samples into a subset with 3712 samples for
IoU estimation. The UAQE is shown in Fig. 7. Given the training and the remaining 3769 samples for validation. We
predicted uncertainty as input, we construct a lightweight report the performance on both the val set and online test
sub-module consisting of two fully connected (FC) layers leaderboard for comparison. And we use all training data for
followed by the Sigmoid activation to generate a coefficient. the test server submission.
The original output of the IoU-branch is then multiplied with
this coefficient to obtain the final estimation. The UAQE aims The Waymo Open dataset is a large-scale autonomous driving
to capture the uncertainty in the estimation and adjust the final dataset with more diverse scenes and object annotations in full
output accordingly, resulting in a more accurate estimation 360◦ , which contains 798 sequences (158361 LiDAR frames)
of the IoU score. for training and 202 sequences (40077 LiDAR frames) for
validation. These frames are further divided into two difficulty
levels: LEVEL1 for boxes with more than five points and
4.3 3D Variance Voting LEVEL2 for boxes with at least one point. We report perfor-
mance on both LEVEL 1 and LEVEL 2 difficulty objects
Considering that in probabilistic object detectors, the learned using the recommended metrics, mean Average Precision
localization variance by the KL loss can reflect the uncertainty (mAP) and mean Average Precision weighted by heading
of the predicted bounding boxes, following (He et al., 2019), accuracy (mAPH). To conduct the experiments efficiently, we
we also propose 3D variance voting to combine neighboring created a representative training set by randomly selecting
bounding boxes to seek a more precise box representation. 20% of the frames from the original training set, which com-
Specifically, at a single iteration in the loop, box 𝑏 with the prises approximately 32,000 frames. All evaluations were
maximum score is selected and its new location is calculated performed on the complete validation set, consisting of around
according to itself and the neighboring boxes. During the 40,000 frames, using the official evaluation tool.
merging process, the neighboring boxes that are closer and
have a low variance are assigned higher weights. Note that 5.1.2 Evaluation Metric for GLENet
neighboring boxes with a large angle difference from 𝑏 do not
participate in the ensembling of angles. We refer the readers Due to the unavailability of the true distribution of a ground-
to Algorithm 1 for the details. truth bounding box, we propose to evaluate GLENet in a
10 Yifan Zhang et al.

Fig. 8: Illustration of the occlusion data augmentation. (a) The point cloud of the original object associated with the annotated
ground-truth bounding box. (b) A sampled dense object (red) is placed between the LiDAR sensor and the original object
(blue). (c) The projected range image from the point cloud in (b), where the convex hull (the red polygon) of the sampled
object is calculated and further jittered to increase the diversity of occluded samples. Based on the convex hull (the green
polygon) of the original point cloud, the occluded area can be obtained. The point cloud of the original object corresponding
to the occluded area is removed. (d) Final augmented object with the annotated ground-truth bounding boxes.

non-reference manner, in which the negative log-likelihood PointNet structure with channel dimensions (8, 8, 8) in the
between the estimated distribution of ground-truth 𝑝 𝐷 (𝑋 |𝐶) context encoder. The prediction network concatenates the gen-
subjecting to a Gaussian distribution N ( 𝑡ˆ, 𝜎 2 ) and 𝑝 𝜃 (𝑋 |𝐶) erated latent variable and context features and feeds them into
is computed: subsequent FC layers of channels (64, 64) before predicting
∫ offsets and directions.
𝐿 𝑁 𝐿𝐿 (𝜃) = − 𝑝 𝜃 (𝑋 |𝐶) log 𝑝 𝐷 (𝑋 |𝐶)d𝑋 (17)
𝑆
1 ∑︁
≈− log 𝑝 𝐷 (𝑋𝑖 |𝐶) 5.1.4 Training and Inference Strategies
𝑆 𝑖=1
𝑆
1 ∑︁ ∑︁ (𝑡 𝑖𝑘 − 𝑡ˆ𝑖𝑘 ) 2 log(𝜎𝑘2 ) log(2𝜋) To optimize GLENet, we adopted Adam (Kingma and Ba,
=− + + ,
𝑆 𝑖=1 𝑘 ∈{𝑐 𝑥 ,𝑐 𝑦 ,
2𝜎𝑘 2 2 2 2015) with a learning rate of 0.003, 𝛽1 of 0.9, and 𝛽2 of 0.99.
𝑐 𝑧 ,𝑤,𝑙,ℎ,𝑟 } The model was trained for a total of 400 epochs on KITTI
where 𝑆 denotes the number of inference times, 𝑋𝑖 is the result and 40 epochs on Waymo, with a batch size of 64 on 2 GPUs.
of the 𝑖-th inference, and 𝑡ˆ𝑖𝑘 and 𝑡 𝑖𝑘 represent the regression We used the one-cycle policy (Smith, 2017) to update the
targets and the predicted offsets, respectively. We estimate learning rate.
the integral by randomly sampling multiple prediction results In the training process, we applied common data aug-
via the Monte Carlo method. Generally, the value of 𝐿 𝑁 𝐿𝐿 mentation strategies, including random flipping, scaling, and
is small when GLENet outputs reasonable bounding boxes, rotation, in which the scaling factor and rotation angle were
i.e., predicting diverse plausible boxes with high variance for uniformly drawn from [0.95, 1.05] and [−𝜋/4, 𝜋/4], respec-
incomplete point cloud and consistent, precise boxes with tively. It is important to include multiple plausible ground-
low variance for high-quality point cloud, respectively. truth boxes in training, especially for incomplete point clouds,
so we further propose an occlusion-driven augmentation
5.1.3 Implementation Details approach, as illustrated in Fig. 8, after which a complete
point cloud may look similar to another incomplete point
To prevent data leakage, we kept the dataset division of cloud, while the ground-truth boxes of them are completely
GLENet consistent with that of the downstream detectors. As different. To overcome posterior collapse, we also adopted
the initial input of GLENet, the point cloud of each object KL annealing (Bowman et al., 2016) to gradually increase
was uniformly pre-processed into 512 points via random the weight of the KL loss from 0 to 1. We followed k-fold
subsampling/upsampling. Then we decentralized the point cross-sampling to divide all training objects into ten mutually
cloud by subtracting the coordinates of the center point to exclusive subsets. To overcome overfitting, each time we
eliminate the local impact of translation. trained GLENet on 9 subsets and then made predictions on
Architecturally, we realized the prior network and recogni- the remaining subset to generate label uncertainty estimations
tion network with an identical PointNet structure consisting of on the whole training set. During inference, we sampled the
three FC layers of output dimensions (64, 128, 512), followed latent variable 𝑧 from the predicted prior distribution 𝑝 𝜃 (𝑧|𝑐)
by another FC layer to generate an 8-dim latent variable. To 30 times to form multiple predictions, the variance of which
avoid posterior collapse, we particularly chose a lightweight was used as the label uncertainty.
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 11

Table 1: Quantitative comparison with state-of-the-art methods on the KITTI test set for vehicle detection, under the evaluation
metric of 3D Average Precision (AP) of 40 sampling recall points. The best and second-best results are highlighted in bold and
underlined, respectively.

3D AP𝑅40
Method Reference Modality
Easy Mod. Hard mAP
MV3D (Chen et al., 2017) CVPR’17 RGB+LiDAR 74.97 63.63 54.00 64.20
F-PointNet (Qi et al., 2018) CVPR’18 RGB+LiDAR 82.19 69.79 60.59 70.86
MMF (Liang et al., 2019) CVPR’19 RGB+LiDAR 88.40 77.43 70.22 78.68
PointPainting (Vora et al., 2020) CVPR’20 RGB+LiDAR 82.11 71.70 67.08 73.63
CLOCs (Pang et al., 2020) IROS’20 RGB+LiDAR 88.94 80.67 77.15 82.25
EPNet (Huang et al., 2020) ECCV’20 RGB+LiDAR 89.81 79.28 74.59 81.23
3D-CVF (Yoo et al., 2020) ECCV’20 RGB+LiDAR 89.20 80.05 73.11 80.79
STD (Yang et al., 2019) ICCV’19 LiDAR 87.95 79.71 75.09 80.92
Part-A2 (Shi et al., 2020b) TPAMI’20 LiDAR 87.81 78.49 73.51 79.94
3DSSD (Yang et al., 2020) CVPR’20 LiDAR 88.36 79.57 74.55 80.83
SA-SSD (He et al., 2020) CVPR’20 LiDAR 88.80 79.52 72.30 80.21
PV-RCNN (Shi et al., 2020a) CVPR’20 LiDAR 90.25 81.43 76.82 82.83
PointGNN (Shi and Rajkumar, 2020b) CVPR’ 20 LiDAR 88.33 79.47 72.29 80.03
Voxel-RCNN (Deng et al., 2021) AAAI’21 LiDAR 90.90 81.62 77.06 83.19
SE-SSD (Zheng et al., 2021b) CVPR’21 LiDAR 91.49 82.54 77.15 83.73
VoTR (Mao et al., 2021b) ICCV’21 LiDAR 89.90 82.09 79.14 83.71
Pyramid-PV (Mao et al., 2021a) ICCV’21 LiDAR 88.39 82.08 77.49 82.65
CT3D (Sheng et al., 2021) ICCV’21 LiDAR 87.83 81.77 77.16 82.25
GLENet-VR (Ours) - LiDAR 91.67 83.23 78.43 84.44

Table 2: Quantitative comparison of different methods on the KITTI validation set for vehicle detection, under the evaluation
metric of 3D Average Precision (AP) calculated with 11 sampling recall positions. The 3D APs under 40 recall sampling recall
points are also reported for the moderate car class. The best and second-best results are highlighted in bold and underlined,
respectively.

3D AP𝑅11 3D AP𝑅40
Methods Reference
Easy Moderate Hard Easy Moderate Hard
Part-𝐴2 (Shi et al., 2020b) TPAMI’20 89.47 79.47 78.54 - - -
3DSSD (Yang et al., 2020) CVPR’20 89.71 79.45 78.67 - - -
SA-SSD (He et al., 2020) CVPR’20 90.15 79.91 78.78 92.23 84.30 81.36
PV-RCNN (Shi et al., 2020a) CVPR’20 89.35 83.69 78.70 92.57 84.83 83.31
SE-SSD (Zheng et al., 2021b) CVPR’21 90.21 85.71 79.22 93.19 86.12 83.31
VoTR (Mao et al., 2021b) ICCV’21 89.04 84.04 78.68 - - -
Pyramid-PV (Mao et al., 2021a) ICCV’21 89.37 84.38 78.84 - - -
CT3D (Sheng et al., 2021) ICCV’21 89.54 86.06 78.99 92.85 85.82 83.46
SECOND (Yan et al., 2018b) Sensors’18 88.61 78.62 77.22 91.16 81.99 78.82
GLENet-S (Ours) - 88.68 82.95 78.19 91.73 84.11 81.35
CIA-SSD (Zheng et al., 2021a) AAAI’21 90.04 79.81 78.80 93.59 84.16 81.20
GLENet-C (Ours) - 89.82 84.59 78.78 93.20 85.16 81.94
Voxel R-CNN (Deng et al., 2021) AAAI’21 89.41 84.52 78.93 92.38 85.29 82.86
GLENet-VR (Ours) - 89.93 86.46 79.19 93.51 86.10 83.60

5.1.5 Base Detectors Generally, we set the value of 𝜎𝑡 to 0.05 and the value of 𝜇
to 0.01 in KITTI and 0.7 in Waymo dataset in 3D variance
We integrated GLENet into four popular deep 3D object voting. Note that for fair comparisons, we kept the network
detection frameworks, i.e., SECOND (Yan et al., 2018b), configurations of these base detectors unchanged except those
CIA-SSD (Zheng et al., 2021a), CenterPoint (two-stage) (Yin related to the new submodules.
et al., 2021), and Voxel R-CNN (Deng et al., 2021), to con-
struct probabilistic detectors, which are dubbed as GLENet-
S, GLENet-C, GLENet-CP, and GLENet-VR, respectively. 5.2 Evaluation on the KITTI Dataset
Specifically, we introduced an extra FC layer on the top of the
detection head to estimate standard deviations along with the We compared GLENet-VR with state-of-the-art detectors on
box locations. Meanwhile, we applied the proposed UAQE the KITTI test set, and Table 1 reports the AP and mAP that
to GLENet-VR to facilitate the training of the IoU-branch. averages over the APs of easy, moderate and hard objects. As
12 Yifan Zhang et al.

Table 3: Performance comparisons on the KITTI val set for


pedestrian and cyclist class using AP𝑅11 .

Pedestrian Cyclist
Method
Easy Moderate Hard Easy Moderate Hard
Second 56.55 52.97 47.73 80.59 67.14 63.11
GLENet-S 58.22 52.39 49.53 82.67 68.29 65.62
Voxel R-CNN 66.32 60.52 55.42 86.62 70.69 66.05
GLENet-VR 66.18 62.05 56.00 87.28 74.07 70.90

of March 29𝑡 ℎ , 2022, our GLENet-VR surpasses all published Fig. 9: PR curves of GLENet-VR on the car class of the
single-modal detection methods by a large margin and ranks KITTI test set.
1𝑠𝑡 among all published LiDAR-based approaches. Besides,
Fig. 9 also provides the detailed Precision-Recall (PR) curves
of GLENet-VR on KITTI test split. on pedestrian and cyclist classes respectively on the moderate
difficulty.
Table 2 lists the validation results of different detection
frameworks on the KITTI dataset, from which we can observe
that GLENet-S, GLENet-C, and GLENet-VR consistently 5.3 Evaluation on the Waymo Open Dataset
outperform their corresponding baseline methods, i.e., SEC-
OND, CIA-SSD, and Voxel R-CNN, by 4.79%, 4.78%, and In Table 4, we present a comprehensive comparison of various
1.84% in terms of 3D R11 AP on the category of moderate state-of-the-art methods for vehicle detection on the Waymo
car. Particularly, GLENet-VR achieves 86.36% AP on the Open Dataset, considering both LEVEL 1 and LEVEL 2 diffi-
moderate car class, which surpasses all other state-of-the- culty settings. The evaluation metrics used in this comparison
art methods. Besides, as a single-stage method, GLENet-C include the 3D mean Average Precision (mAP) for different
achieves 84.59% AP for the moderate vehicle class, which distance ranges (0-30m, 30-50m, and 50m-inf) and the overall
is comparable to the existing two-stage approaches while mAP for LEVEL 1 and LEVEL 2. Specifically, our method
achieving relatively lower inference costs. It is worth noting contributes 2.44%, 1.21% and 1.24% improvements in terms
that our method is compatible with mainstream detectors of LEVEL 1 mAP for SECOND, CenterPoint-TS and Voxel
and can be expected to achieve better performance when R-CNN, respectively. The improvements observed in the ta-
combined with stronger base detectors. Besides, our method ble demonstrate that our method is robust and consistently
also performs well on other classes. As shown in Table 3, enhances the performance of baseline models like SECOND
GLENet-S outperforms the Second by +1.8% and +2.51% and Voxel R-CNN. And GLENet-VR demonstrates the best
on pedestrian and cyclist classes respectively for 3D AP on performance with an mAP of 77.32% and 69.68% for LEVEL
the hard difficulty. And for the baseline Voxel R-CNN, our 1 and LEVEL 2, respectively. This superior performance can
method improves the performance by +1.47% and +3.38% be attributed to the effective handling of bounding box am-

Table 4: Quantitative comparison of different methods on the Waymo validation set for vehicle detection. ★: experiment results
re-produced with the code of OpenPCDeta . The best and second-best results are highlighted in bold and underlined, respectively.

LEVEL 1 3D mAP mAPH LEVEL 2 3D mAP mAPH


Methods
Overall 0-30m 30-50m 50m-inf Overall Overall 0-30m 30-50m 50m-inf Overall
PointPillar (Lang et al., 2019) 56.62 81.01 51.75 27.94 - - - - - -
MVF (Zhou et al., 2020) 62.93 86.30 60.02 36.02 - - - - - -
PV-RCNN (Shi et al., 2020a) 70.30 91.92 69.21 42.17 69.69 65.36 91.58 65.13 36.46 64.79
VoTr-TSD (Mao et al., 2021b) 74.95 92.28 73.36 51.09 74.25 65.91 - - - 65.29
Pyramid-PV (Mao et al., 2021a) 76.30 92.67 74.91 54.54 75.68 67.23 - - - 66.68
CT3D (Sheng et al., 2021) 76.30 92.51 75.07 55.36 - 69.04 91.76 68.93 42.60 -
SECOND★ (Yan et al., 2018b) 69.85 90.71 68.93 41.17 69.40 62.76 86.92 62.57 35.89 62.30
GLENet-S (Ours) 72.29 91.02 71.86 45.43 71.85 64.78 87.56 65.11 38.60 64.25
CenterPoint-TS★ (Yin et al., 2021) 75.52 92.09 74.35 54.27 75.07 67.37 90.89 68.11 42.46 66.94
GLENet-CP (Ours) 76.73 92.70 75.70 55.77 76.27 68.50 91.95 69.43 43.68 68.08
Voxel R-CNN★ (Deng et al., 2021) 76.08 92.44 74.67 54.69 75.67 68.06 91.56 69.62 42.80 67.64
GLENet-VR (Ours) 77.32 92.97 76.28 55.98 76.85 69.68 92.09 71.21 44.36 68.97
a Reference: https://github.com/open-mmlab/OpenPCDet.
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 13

biguity, especially for distant and sparse point cloud objects. Table 5: Comparison of different label uncertainty estimation
In addition to the overall performance, our methods also approaches. ”Convex hull” refers to the method in (Meyer
exhibit noteworthy improvements in the 30-50m and 50m-inf and Thakurdesai, 2020). The best results are highlighted in
distance ranges. These results indicate that our method is bold.
particularly effective in resolving ambiguity for objects that
are farther away from the sensor, which has traditionally 3D AP𝑅40
Methods
Easy Moderate Hard
posed challenges for point cloud-based detection algorithms.
Voxel R-CNN 92.38 85.29 82.86
In conclusion, Table 4 highlights the superior performance GLENet-VR w/ 𝐿𝐾 𝐿𝐷 (𝜎 2 =0) 92.48 85.37 83.05
of our methods in 3D detection tasks on the Waymo Open GLENet-VR w/ 𝐿𝐾 𝐿𝐷 (points num) 92.46 85.58 83.16
Dataset. By effectively addressing the challenges posed by dis- GLENet-VR w/ 𝐿𝐾 𝐿𝐷 (convex hull) 92.33 85.45 82.81
tant and sparse point cloud objects, our method demonstrates GLENet-VR w/ 𝐿𝐾 𝐿𝐷 (Ours) 93.49 86.10 83.56
significant improvements in both LEVEL 1 and LEVEL 2
difficulty settings across various distance ranges. Table 6: Comparison of our method with (Wang et al., 2020)
on the KITTI val set. The best results are highlighted in bold.

5.4 Ablation Study 𝐴𝑃𝐵𝐸𝑉 for [email protected]


Method
Easy Mod. Hard
We conducted ablative analyses to verify the effectiveness PIXOR (Yang et al., 2018) 86.79 80.75 76.60
ProbPIXOR + L 𝐾 𝐿𝐷 (𝜎 = 0) 88.60 80.44 78.74
and characteristics of our processing pipeline. In this section, ProbPIXOR + L 𝐾 𝐿𝐷 (Wang et al., 2020) 92.22 82.03 79.16
all the involved model variants are built upon the Voxel R- ProbPIXOR + L 𝐾 𝐿𝐷 (Ours) 91.50 84.23 81.85
CNN baseline and evaluated on the KITTI dataset, under
the evaluation metric of average precision calculated with 40
recall positions. into the KL Loss contributes 0.75%, 0.51%, and 0.3% im-
provements on the APs of easy, moderate, and hard classes,
5.4.1 Comparison with Other Label Uncertainty Estimation respectively, which demonstrates its regularization effect on
KL-loss (Eq. 9) and its ability to estimate more reliable
We compared GLENet with two other ways of label uncer- uncertainty statistics of bounding box labels. The proposed
tainty estimation: 1) treating the label distribution as the UAQE module in the probabilistic detection head boosts the
deterministic Dirac delta distribution with zero uncertainty; easy, moderate, and hard APs by 0.25%, 0.19% and 0.15%,
2) estimating the label uncertainty with simple heuristics, i.e., respectively, validating its effectiveness in estimating the
the number of points in the ground-truth bounding box or the localization quality.
IoU between the label bounding box and its convex hull of To gain a better understanding of how UAQE enhances
the aggregated LiDAR observations (Meyer and Thakurdesai, the estimation of IoU-related confidence scores (the location
2020). As shown in Table 5, our method consistently outper- quality), we analyze the error in IoU estimation for both
forms existing label uncertainty estimation paradigms. Com- GLENet-VR and the baseline model (w/o UAQE) over dif-
pared with heuristic strategies, our deep generative learning ferent actual IoU values between the proposals and their
paradigm can adaptively estimate label uncertainty statistics corresponding ground-truth boxes. Figure 10 illustrates the
in 7 dimensions, instead of the uncertainty of bounding boxes changes in the error distribution of IoU estimation. We can
as a whole, considering the variance in each dimension could observe that the UAQE module effectively reduces the IoU
be very different. estimation error across various intervals of actual IoU values,
Besides, to compare with (Wang et al., 2020), whose code such as [0.1, 0.6). These findings demonstrate that the UAQE
is not publicly available, we evaluated our method under its module not only improves the overall average precision (AP)
experiment settings and compared results with its reported metric but also enhances the accuracy of location quality
performance. As shown in Table 6, our method outperforms estimation.
(Wang et al., 2020) significantly in terms of AP 𝐵𝐸𝑉 on both
moderate and hard levels. 5.4.3 Ablation Study of GLENet

5.4.2 Key Components of Probabilistic Detectors Effectiveness of Preprocessing. As mentioned previously,


to eliminate the local impact of translation on the input of
We analyzed the contributions of different key components in GLENet, the point cloud of a single object is standardized
our constructed probabilistic detectors and reported results to zero mean value. However, this process might remove
in Table 7. According to the second row, we can conclude meaningful information contained in distances. For instance,
that only training with the KL loss brings little performance distant objects with fewer points typically have high label
gain. Introducing the label uncertainty generated by GLENet uncertainty, while closer objects usually have a high point
14 Yifan Zhang et al.

Table 7: Contribution of each component in our constructed Table 8: Effect of point cloud input with and without absolute
GLENet-VR pipeline. “LU” denotes the label uncertainty. coordinates in GLENet. “NC” denotes normalized coordi-
nates of the partial point cloud, and “AC” denotes absolute
KL loss LU var voting UAQE Easy Moderate Hard coordinates. We report the 𝐿 𝑁 𝐿𝐿 for evaluation of GLENet
92.38 85.29 82.86 and the 3D average precisions of 40 sampling recall points
✓ 92.45 85.25 82.99
✓ ✓ 92.48 85.37 83.05
for evaluation of downstream detectors.
✓ ✓ 93.20 85.76 83.29
✓ ✓ ✓ 93.24 85.91 83.41 NC AC 𝐿 𝑁 𝐿𝐿 ↓ Easy Mod. Hard Avg
✓ ✓ ✓ ✓ 93.49 86.10 83.56 ✓ 91.50 93.49 86.10 83.56 87.72
✓ ✓ 147.33 93.21 85.66 83.35 87.41

Table 9: Ablation study on occlusion augmentation techniques


and context encoder in GLENet, in which we report the 𝐿 𝑁 𝐿𝐿
for evaluation of GLENet and the 3D average precisions of 40
sampling recall points for evaluation of downstream detectors.

Setting 𝐿 𝑁 𝐿𝐿 ↓ Easy Mod. Hard Avg.


Baseline 91.50 93.49 86.10 83.56 87.72
w/o Occlusion Augmentation 230.10 92.96 85.52 83.07 87.18
w/o Context Encoder 434.93 92.65 85.31 82.59 86.85

Influence of Data Augmentation. To generate similar point


Fig. 10: Boxplots are used to display the estimated IoU error cloud shapes with diverse ground-truth bounding boxes dur-
across various intervals of true IoU values. The x-axis repre- ing training of GLENet, we proposed an occlusion data
sents the real IoU between proposals and their corresponding augmentation strategy and generated more incomplete point
GT boxes, while the y-axis represents the distribution of esti- clouds while keeping the bounding boxes unchanged (see
mation error, which is the difference between the estimated Fig. 8). As listed in Table 9, it can be seen that the occlusion
IoU score and the real IoU. The boxplot provides information data augmentation effectively enhances the performance of
about the distribution of error through five summary statistics: GLENet and the downstream detection task.
the minimum value, the maximum value, the median, the first
quartile (Q1), and the third quartile (Q3). Necessity of the Context Encoder. In addition to learning
the distribution of latent variables, the prior and recognition
networks are also capable of extracting features from point
count and low label uncertainty. For this reason, we performed clouds. To verify the necessity of the context encoder that
experiments by adding the absolute coordinates of the point is responsible for encoding contextual information from the
cloud as an extra feature in the input of GLENet. However, as input data in GLENet, we conducted an ablation experiment.
shown in Table 8, the inclusion of extra absolute coordinates As shown in Table 9, after removing the context encoder,
did not yield any significant improvement in the 𝐿 𝑁 𝐿𝐿 met- we observed a significant deterioration in both the 𝐿 𝑁 𝐿𝐿
ric or the performance of downstream detectors. We reason metric and the average precision (AP) of the downstream
these observations from two aspects. First, the additional detector. These results clearly demonstrate the necessity of
absolute coordinates may differentiate objects that are located the context encoder to extract geometric features from point
in different positions but have similar appearances. As a clouds and allow the recognition and prior networks to focus
result, there may be fewer samples with similar shapes but on capturing the underlying structure of the input data in
different bounding box labels, making it difficult for GLENet a low-dimensional space. Without the context encoder, the
to capture the one-to-many relationship between incomplete recognition and prior networks would need to learn both the
point cloud objects and potential plausible bounding boxes. geometric features and the contextual information from the
Second, the absolute distance and the point cloud density input data, which would lead to poorer performance.
are generally correlated, i.e., an object with a larger absolute
distance generally has a sparser point cloud representation, Dimension of the Latent Variable. Table. 10 shows the
and such correlation could be perceived by the network. In performance of adopting latent variables with various dimen-
other words, the absolute distance information is somewhat sions for GLENet. We can observe that the accuracy increase
redundant to the network. gradually, with the dimensions of latent variables from 2 to
8, and the setting of 32-dimensional latent variables achieve
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 15

Table 10: Ablation study of the dimensions of latent variables Table 12: Comparison on different occlusion levels and
in GLENet. distance rangesa , evaluated by the 3D Average Precision (AP)
calculated with 40 sampling recall positions on the KITTI
Dimensions 𝐿 𝑁 𝐿𝐿 ↓ Easy Mod. Hard Avg. val set.
2 856.48 92.05 84.69 82.22 86.32
4 605.11 92.25 85.11 82.24 86.53
8 91.50 93.49 86.15 83.56 87.73 Voxel R-CNN GLENet-VR
Methods Improvement
32 86.16 93.28 85.94 83.60 87.60 (Deng et al., 2021) (Ours)
64 110.49 93.11 85.51 83.27 87.30 0 92.35 93.51 +1.16
128 105.93 92.74 85.82 83.10 87.22 Occlusionb 1 76.91 78.64 +1.73
2 54.32 56.93 +2.61
0-20m 96.42 96.69 +0.27
Table 11: Ablation study of the sampling times to calculate
Distance 20-40m 83.82 86.87 +3.05
label uncertainty in GLENet. 40m-Inf 38.86 39.82 +0.96
a The results include separate APs for objects belonging to different
Times 𝐿 𝑁 𝐿𝐿 ↓ Easy Mod. Hard Avg. occlusion levels and APs for the moderate vehicle class in different
4 608.82 92.54 85.11 81.21 86.29 distance ranges.
8 240.08 92.96 85.52 82.80 87.09 b Definition of occlusion levels: levels 0, 1 and 2 correspond to fully
16 148.21 92.99 85.66 83.35 87.33 visible samples, partly occluded samples, and samples difficult to
30 91.5 93.49 86.10 83.56 87.72 see respectively.
64 86.76 93.37 86.16 83.42 87.65
128 77.06 93.53 85.92 83.47 87.64
Table 13: Inference time comparison for different baselines
on the KITTI dataset.
similar performance. The results demonstrate a too-small
Method FPS (Hz)
dimension of the latent variables makes the GLENet unable
SECOND Yan et al. (2018b) 23.36
to fully represent the underlying structure of the input data. GLENet-S (Ours) 22.80
And setting the dimension of latent variables to larger values CIA-SSD Zheng et al. (2021a) 27.18
like 64 or 128 can lead to over-fitting and slight decreases in GLENet-C (Ours) 28.76
performance. When the dimension of the latent variables is Voxel R-CNN Deng et al. (2021) 21.08
GLENet-VR (Ours) 20.82
too large, the model can easily memorize the noise and details
in the training data, which is not helpful for generating new
and useful samples. Besides, though the setting of 32-dim with the baseline, our method mainly improves on the heavily
latent variables leads to the lowest 𝐿 𝑁 𝐿𝐿 , the performance occluded and distant samples, which suffer from more serious
of downstream detectors is best using label uncertainty with boundary ambiguities of ground-truth bounding boxes.
8-dim latent variables. Therefore, though the 𝐿 𝑁 𝐿𝐿 metric
can reflect the quality of generating of GLENet to some
5.4.5 Inference Efficiency
extent, it is not guaranteed to be strongly correlated with the
performance of downstream detectors. We evaluated the inference speed of different baselines with a
batch size of 1 on a desktop with Intel CPU E5-2560 @ 2.10
Effects of the Sampling Times. In Table 11, we investigate GHz and NVIDIA GeForce RTX 2080Ti GPU. As shown
the effects of the sampling times to calculate label uncertainty. in Table 13, our approach does not significantly increase the
We can observe that larger sampling times generally achieve computational overhead. Particularly, GLENet-VR only takes
lower 𝐿 𝑁 𝐿𝐿 and better performance of downstream detectors, 0.6 more ms than the base Voxel R-CNN, since the number of
and similar performance is observed when using more than 30 candidates for the input of variance voting is relatively small
sampling times. Statistically speaking, the variance obtained in two-stage detectors.
after a certain number of sampling times will tend to stabilize.
Hence, to balance the computation cost and performance, we
empirically choose to calculate the label uncertainty with 5.5 Comparison of Visual Results
predicted multiple bounding boxes by sampling the latent
variables 30 times. Fig. 11 visualizes the detection results of our GLENet-VR and
the baseline Voxel R-CNN on the KITTI val set, where it can
5.4.4 Conditional Analysis be seen that our GLENet-VR obtains better detection results
with fewer false-positive bounding boxes and fewer missed
To figure out in what cases our method improves the base de- heavily occluded and distant objects than Voxel R-CNN. We
tector most, we evaluated GLENet-VR on different occlusion also compared the detection results of SECOND and GLENet-
levels and distance ranges. As shown in Table 12, compared S on the Waymo validation set in Fig. 12, where it can be
16 Yifan Zhang et al.

Fig. 11: Visual comparison of the results by GLENet-VR and Voxel R-CNN on the KITTI dataset. The ground-truth, true
positive and false positive bounding boxes are visualized in red, green and yellow, respectively, on both the point cloud and
image. Best viewed in color.

seen that compared with SECOND (Yan et al., 2018b), our GLENet lies in learning the latent distribution of bounding
GLENet-S has fewer false predictions and achieves more boxes from samples with similar point cloud shapes and
accurate localization. involving the whole point cloud in the scene distinguishes
those objects with similar shapes. Incorporating such
information without compromising the core benefits of
6 Discussion GLENet remains a challenge.
(3) Robustness to Annotation Errors. While GLENet aims
In this section, we further list a few potential technical lim- to address the inherent ambiguity in ground-truth anno-
itations of the current learning framework and promising tations, it may not be entirely immune to the effects of
directions for extensions. significant annotation errors. If the training data contains
(1) Complexity and Computational Cost. Despite GLENet substantial annotation errors, the model may inadvertently
providing reliable label uncertainty as supervision signals learn and propagate these errors, leading to an inaccurate
for downstream probabilistic detectors, estimating the estimation of label uncertainty. For example, if an object
label uncertainty itself brings additional computational with a high-quality point cloud is annotated with a wrong
costs and makes the overall training process more com- box and further leads to inconsistent predictions and larger
plex. Particularly, considering the risk of over-fitting, we label uncertainty, those objects with similar shapes will
followed k-fold cross-sampling to train GLENet on 9 sub- suffer from unreasonable label uncertainty supervision
sets and then made predictions on the remaining subset signals. The robustness and reliability of the proposed
at each time. method under such scenarios could be a limitation.
(2) Incomplete input information. In GLENet, we only take (4) Limited Evaluation Metrics and Scenarios. Evaluating the
the partial point cloud of individual objects as input, quality and diversity of generated data in generative tasks
so only the learned geometric information is used to like GLENet is challenging. Although the proposed 𝐿 𝑁 𝐿𝐿
estimate potential bounding boxes. However, the context assesses the closeness between the prediction of GLENet
cues like free space and location of surrounding objects are and ground-truth annotation bounding boxes, evaluating
neglected, which are also meaningful to determining the the quality and diversity of generated data remains an
bounding boxes. Therefore, the estimated label uncertainty ongoing research problem. On the other hand, while your
may deviate from the true distribution. But it is not feasible method demonstrates performance gains on benchmark
to take all points in the scene as input, as the key point of datasets such as KITTI and Waymo, it is important to
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 17

(a) SECOND (b) GLENet-S (Ours)

Fig. 12: Visual comparison of the results by SECOND and GLENet-S on the Waymo val set. The ground-truth, true positive
and false positive bounding boxes are visualized in red, green and yellow, respectively. Best viewed in color and zoom in for
more details. Additional NMS with a higher IoU threshold is conducted to eliminate overlapped bounding boxes for better
visualization.

consider the generalizability of your approach across or transmission. The quality of an image is subjective and
various environmental conditions, object classes, and can vary depending on the perception and expectations
sensor modalities. The ability of GLENet to generalize of the viewer.
to a broader range of datasets and scenarios could be a
limitation.
(5) Possible Extensions. The idea of estimating the label
uncertainty by capturing the one-to-many relationship 7 Conclusion
between observed input and multiple plausible labels with
latent variables could be extended to other subjective tasks We presented a general and unified deep learning-based
in computer vision where labels are not deterministic. paradigm for modeling 3D object-level label uncertainty.
One promising task is 3D object tracking, where different Technically, we proposed GLENet, adapted from the learning
opinions of annotators on the boundaries of objects lead framework of CVAE, to capture one-to-many relationships
to non-deterministic labels. Another example is image between incomplete point cloud objects and potentially plausi-
quality assessment, where the goal is to evaluate the ble bounding boxes. As a plug-and-play component, GLENet
quality of an image, often in the context of compression can generate reliable label uncertainty statistics that can be
conveniently integrated into various 3D detection pipelines
18 Yifan Zhang et al.

to build powerful probabilistic detectors. We verified the aggregation strategy in multi-class classification problems.
effectiveness and universality of our method by incorporating Knowledge-Based Systems 90:153–164
the proposed GLENet into several existing deep 3D object Geiger A, Lenz P, Urtasun R (2012) Are we ready for au-
detectors, which demonstrated consistent improvement and tonomous driving? the kitti vision benchmark suite. In:
produced state-of-the-art performance on both KITTI and Proceedings of the IEEE/CVF conference on computer
Waymo datasets. vision and pattern recognition, pp 3354–3361
Goyal A, Sordoni A, Côté MA, Ke N, Bengio Y (2017)
Z-forcing: Training stochastic recurrent networks. In: Ad-
Data Availability Statements vances in Neural Information Processing Systems, pp 6714–
6724
The Waymo Open Dataset (Sun et al., 2020) and KITTI (Geiger Harakeh A, Smart M, Waslander SL (2020) Bayesod: A
et al., 2012) used in this manuscript are deposited in publicly bayesian approach for uncertainty estimation in deep object
available repositories respectively: https://waymo.com/ detectors. In: IEEE International Conference on Robotics
open/data/perception and http://www.cvlibs.net/ and Automation, IEEE, pp 87–93
datasets/kitti. He C, Zeng H, Huang J, Hua XS, Zhang L (2020) Structure
aware single-stage 3d object detection from point cloud.
In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pp 11870–11879
References
He Y, Zhu C, Wang J, Savvides M, Zhang X (2019) Bound-
ing box regression with uncertainty for accurate object
Bowman S, Vilnis L, Vinyals O, Dai A, Jozefowicz R, Ben-
detection. In: Proceedings of the IEEE/CVF conference on
gio S (2016) Generating sentences from a continuous
computer vision and pattern recognition, pp 2888–2897
space. In: Proceedings of The 20th SIGNLL Conference
Huang T, Liu Z, Chen X, Bai X (2020) Epnet: Enhancing
on Computational Natural Language Learning, pp 10–21
point features with image semantics for 3d object detection.
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A,
In: European Conference on Computer Vision, Springer,
Zagoruyko S (2020) End-to-end object detection with
pp 35–52
transformers. In: European conference on computer vision,
Kingma D, Ba J (2015) Adam: A method for stochastic
pp 213–229
optimization. In: International Conference on Learning
Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view
Representations, pp 1–15
3d object detection network for autonomous driving. In:
Kingma DP, Welling M (2014) Auto-encoding variational
Proceedings of the IEEE/CVF conference on computer
bayes. In: International Conference on Learning Represen-
vision and pattern recognition, pp 1907–1915
tations, pp 1–14
Choi J, Chun D, Kim H, Lee HJ (2019) Gaussian yolov3:
Lang A, Vora S, Caesar H, Zhou L, Yang J, Beijbom O (2019)
An accurate and fast object detector using localization
Pointpillars: Fast encoders for object detection from point
uncertainty for autonomous driving. In: Proceedings of the
clouds. In: Proceedings of the IEEE/CVF conference on
IEEE International Conference on Computer Vision, vol
computer vision and pattern recognition, pp 12689–12697
2019-October, pp 502–511
Li B, Sun Z, Guo Y (2019) Supervae: Superpixelwise varia-
Delany SJ, Segata N, Mac Namee B (2012) Profiling instances
tional autoencoder for salient object detection. In: Proceed-
in noise reduction. Knowledge-Based Systems 31:28–40
ings of the AAAI Conference on Artificial Intelligence,
Deng J, Shi S, Li P, Zhou W, Zhang Y, Li H (2021) Voxel
vol 33, pp 8569–8576
r-cnn: Towards high performance voxel-based 3d object
Li J, Song Y, Zhang H, Chen D, Shi S, Zhao D, Yan R
detection. In: Proceedings of the AAAI Conference on
(2018) Generating classical chinese poems via conditional
Artificial Intelligence, pp 1201–1209
variational autoencoder and adversarial training. In: Pro-
Feng D, Rosenbaum L, Dietmayer K (2018) Towards safe
ceedings of the 2018 conference on empirical methods in
autonomous driving: Capture uncertainty in the deep neural
natural language processing, pp 3890–3900
network for lidar 3d vehicle detection. In: IEEE Conference
Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021) General-
on Intelligent Transportation Systems, pp 3266–3273
ized focal loss v2: Learning reliable localization quality
Feng D, Rosenbaum L, Timm F, Dietmayer K (2019) Lever-
estimation for dense object detection. In: Proceedings of
aging heteroscedastic aleatoric uncertainties for robust
the IEEE/CVF conference on computer vision and pattern
real-time lidar 3d object detection. In: IEEE Intelligent
recognition, pp 11632–11641
Vehicles Symposium, pp 1280–1287
Liang M, Yang B, Chen Y, Hu R, Urtasun R (2019) Multi-task
Garcia LP, Sáez JA, Luengo J, Lorena AC, de Carvalho AC,
multi-sensor fusion for 3d object detection. In: Proceedings
Herrera F (2015) Using the one-vs-one decomposition
of the IEEE/CVF conference on computer vision and
to improve the performance of class noise filters via an
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation 19

pattern recognition, pp 7345–7353 Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum


Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, pointnets for 3d object detection from rgb-d data. In: Pro-
Berg AC (2016) Ssd: Single shot multibox detector. In: ceedings of the IEEE/CVF conference on computer vision
European conference on computer vision, Springer, pp and pattern recognition, pp 918–927
21–37 Sharma S, Varigonda PT, Bindal P, Sharma A, Jain A (2019)
Luengo J, Shim SO, Alshomrani S, Altalhi A, Herrera F Monocular 3d human pose estimation by generation and
(2018) Cnc-nos: Class noise cleaning by ensemble filtering ordinal ranking. In: Proceedings of the IEEE/CVF Interna-
and noise scoring. Knowledge-Based Systems 140:27–49 tional Conference on Computer Vision, pp 2325–2334
Mao J, Niu M, Bai H, Liang X, Xu H, Xu C (2021a) Pyramid Sheng H, Cai S, Liu Y, Deng B, Huang J, Hua XS, Zhao MJ
r-cnn: Towards better performance and adaptability for (2021) Improving 3d object detection with channel-wise
3d object detection. In: Proceedings of the IEEE/CVF transformer. In: Proceedings of the IEEE/CVF International
International Conference on Computer Vision, pp 2723– Conference on Computer Vision, pp 2743–2752
2732 Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal
Mao J, Xue Y, Niu M, Bai H, Feng J, Liang X, Xu H, Xu generation and detection from point cloud. In: Proceedings
C (2021b) Voxel transformer for 3d object detection. In: of the IEEE/CVF conference on computer vision and
Proceedings of the IEEE/CVF International Conference pattern recognition, pp 770–779
on Computer Vision, pp 3164–3173 Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X, Li H (2020a)
Meyer G, Thakurdesai N (2020) Learning an uncertainty- Pv-rcnn: Point-voxel feature set abstraction for 3d object
aware object detector for autonomous driving. In: IEEE detection. In: Proceedings of the IEEE/CVF conference on
International Conference on Intelligent Robots and Sys- computer vision and pattern recognition, pp 10529–10538
tems, pp 10521–10527 Shi S, Wang Z, Shi J, Wang X, Li H (2020b) From points
Meyer GP, Laddha A, Kee E, Vallespi-Gonzalez C, Wellington to parts: 3d object detection from point cloud with part-
CK (2019) Lasernet: An efficient probabilistic 3d object aware and part-aggregation network. IEEE Transactions
detector for autonomous driving. In: Proceedings of the on Pattern Analysis and Machine Intelligence 43(8):2647–
IEEE/CVF conference on computer vision and pattern 2664
recognition, pp 12677–12686 Shi W, Rajkumar R (2020a) Point-gnn: Graph neural network
Mousavian A, Eppner C, Fox D (2019) 6-dof graspnet: Vari- for 3d object detection in a point cloud. In: Proceedings of
ational grasp generation for object manipulation. In: Pro- the IEEE/CVF conference on computer vision and pattern
ceedings of the IEEE/CVF International Conference on recognition, pp 1708–1716
Computer Vision, pp 2901–2910 Shi W, Rajkumar R (2020b) Point-gnn: Graph neural network
Najibi M, Lai G, Kundu A, Lu Z, Rathod V, Funkhouser for 3d object detection in a point cloud. In: Proceedings of
T, Pantofaru C, Ross D, Davis L, Fathi A (2020) Dops: the IEEE/CVF conference on computer vision and pattern
Learning to detect 3d objects and predict their 3d shapes. recognition, pp 1711–1719
In: Proceedings of the IEEE/CVF conference on computer Smith LN (2017) Cyclical learning rates for training neural
vision and pattern recognition, pp 11910–11919 networks. In: Proceedings of the IEEE/CVF Winter Con-
Nash C, Williams C (2017) The shape variational autoencoder: ference on Applications of Computer Vision, IEEE, pp
A deep generative model of part-segmented 3d objects. 464–472
Computer Graphics Forum 36(5):1–12 Sohn K, Lee H, Yan X (2015) Learning structured output
Northcutt C, Jiang L, Chuang I (2021) Confident learning: Es- representation using deep conditional generative models.
timating uncertainty in dataset labels. Journal of Artificial In: Advances in Neural Information Processing Systems,
Intelligence Research 70:1373–1411 pp 3483–3491
Painchaud N, Skandarani Y, Judge T, Bernard O, Lalande Sun P, Kretzschmar H, Dotiwalla X, Chouard A, Patnaik V,
A, Jodoin PM (2020) Cardiac segmentation with strong Tsui P, Guo J, Zhou Y, Chai Y, Caine B, Vasudevan V, Han
anatomical guarantees. IEEE transactions on medical imag- W, Ngiam J, Zhao H, Timofeev A, Ettinger S, Krivokon M,
ing 39(11):3703–3713 Gao A, Joshi A, Zhang Y, Shlens J, Chen Z, Anguelov D
Pang S, Morris D, Radha H (2020) Clocs: Camera-lidar object (2020) Scalability in perception for autonomous driving:
candidates fusion for 3d object detection. In: IEEE Inter- Waymo open dataset. In: Proceedings of the IEEE/CVF
national Conference on Intelligent Robots and Systems, conference on computer vision and pattern recognition, pp
IEEE, pp 10386–10393 2443–2451
Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: Deep learning Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and
on point sets for 3d classification and segmentation. In: efficient object detection. In: Proceedings of the IEEE/CVF
Proceedings of the IEEE/CVF conference on computer conference on computer vision and pattern recognition, pp
vision and pattern recognition, pp 652–660 10781–10790
20 Yifan Zhang et al.

Varamesh A, Tuytelaars T (2020) Mixture dense regression conference on computer vision and pattern recognition, pp
for object detection and human pose estimation. In: Pro- 3942–3951
ceedings of the IEEE/CVF conference on computer vision Yin T, Zhou X, Krahenbuhl P (2021) Center-based 3d object
and pattern recognition, pp 13086–13095 detection and tracking. In: Proceedings of the IEEE/CVF
Vora S, Lang AH, Helou B, Beijbom O (2020) Pointpainting: conference on computer vision and pattern recognition, pp
Sequential fusion for 3d object detection. In: Proceedings of 11784–11793
the IEEE/CVF conference on computer vision and pattern Yoo JH, Kim Y, Kim J, Choi JW (2020) 3d-cvf: Generating
recognition, pp 4604–4612 joint camera and lidar features using cross-view spatial
Wang T, Wan X (2019) T-cvae: Transformer-based condi- feature fusion for 3d object detection. In: European Con-
tioned variational autoencoder for story completion. In: ference on Computer Vision, Springer, pp 720–736
International Joint Conference on Artificial Intelligence, Zhang B, Xiong D, Su J, Duan H, Zhang M (2016) Variational
pp 5233–5239 neural machine translation. In: Proceedings of the 2016
Wang Z, Feng D, Zhou Y, Rosenbaum L, Timm F, Diet- conference on empirical methods in natural language pro-
mayer K, Tomizuka M, Zhan W (2020) Inferring spatial cessing, Association for Computational Linguistics, Austin,
uncertainty in object detection. In: IEEE International Texas, pp 521–530
Conference on Intelligent Robots and Systems, IEEE, pp Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021)
5792–5799 Understanding deep learning (still) requires rethinking
Xu Q, Zhou Y, Wang W, Qi CR, Anguelov D (2021) Spg: Un- generalization. Communications of the ACM 64(3):107–
supervised domain adaptation for 3d object detection via se- 115
mantic point generation. In: Proceedings of the IEEE/CVF Zhang J, Fan DP, Dai Y, Anwar S, Saleh FS, Zhang T, Barnes
International Conference on Computer Vision, pp 15446– N (2020) Uc-net: Uncertainty inspired rgb-d saliency de-
15456 tection via conditional variational autoencoders. In: Pro-
Yan X, Yang J, Sohn K, Lee H (2016) Attribute2image: ceedings of the IEEE/CVF conference on computer vision
Conditional image generation from visual attributes. In: and pattern recognition, pp 8582–8591
European conference on computer vision, Springer, pp Zhao T, Zhao R, Eskenazi M (2017) Learning discourse-
776–791 level diversity for neural dialog models using conditional
Yan X, Rastogi A, Villegas R, Sunkavalli K, Shechtman variational autoencoders. In: ACL 2017 - 55th Annual
E, Hadap S, Yumer E, Lee H (2018a) Mt-vae: Learning Meeting of the Association for Computational Linguistics,
motion transformations to generate multimodal human Proceedings of the Conference, pp 654–664
dynamics. In: European conference on computer vision, Zheng W, Tang W, Chen S, Jiang L, Fu CW (2021a) Cia-
pp 265–281 ssd: Confident iou-aware single-stage object detector from
Yan X, Gao J, Li J, Zhang R, Li Z, Huang R, Cui S (2021) point cloud. In: Proceedings of the AAAI Conference on
Sparse single sweep lidar point cloud segmentation via Artificial Intelligence, pp 3555–3562
learning contextual shape priors from scene completion. Zheng W, Tang W, Jiang L, Fu CW (2021b) Se-ssd: Self-
In: Proceedings of the AAAI Conference on Artificial ensembling single-stage object detector from point cloud.
Intelligence, vol 35, pp 3101–3109 In: Proceedings of the IEEE/CVF conference on computer
Yan Y, Mao Y, Li B (2018b) Second: Sparsely embedded vision and pattern recognition, pp 14494–14503
convolutional detection. Sensors 18(10):3337 Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for
Yang B, Luo W, Urtasun R (2018) Pixor: Real-time 3d point cloud based 3d object detection. In: Proceedings of
object detection from point clouds. In: Proceedings of the the IEEE/CVF conference on computer vision and pattern
IEEE/CVF conference on computer vision and pattern recognition, pp 4490–4499
recognition, pp 7652–7660 Zhou Y, Sun P, Zhang Y, Anguelov D, Gao J, Ouyang T, Guo
Yang Z, Sun Y, Liu S, Shen X, Jia J (2019) Std: Sparse-to- J, Ngiam J, Vasudevan V (2020) End-to-end multi-view
dense 3d object detector for point cloud. In: Proceedings fusion for 3d object detection in lidar point clouds. In:
of the IEEE/CVF International Conference on Computer Conference on Robot Learning, PMLR, pp 923–932
Vision, pp 1951–1960
Yang Z, Sun Y, Liu S, Jia J (2020) 3dssd: Point-based 3d single
stage object detector. In: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pp
11040–11048
Yi L, Zhao W, Wang H, Sung M, Guibas L (2019) Gspn:
Generative shape proposal network for 3d instance segmen-
tation in point cloud. In: Proceedings of the IEEE/CVF

You might also like