0% found this document useful (0 votes)
18 views

Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper

Uploaded by

brrutuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper

Uploaded by

brrutuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3D Object Detection with Pointformer

Xuran Pan1 * Zhuofan Xia1 * Shiji Song1 Li Erran Li2† Gao Huang1‡
1
Department of Automation, Tsinghua University, Beijing, China
Beijing National Research Center for Information Science and Technology (BNRist),
2
Alexa AI, Amazon / Columbia University
{pxr18, xzf20}@mails.tsinghua.edu.cn, [email protected],
{shijis, gaohuang}@tsinghua.edu.cn

Abstract Image of the scene Ground truth

Feature learning for 3D object detection from point


clouds is very challenging due to the irregularity of 3D point
cloud data. In this paper, we propose Pointformer, a Trans-
former backbone designed for 3D point clouds to learn fea- Top-50 attention Top-100 attention Top-200 attention
tures effectively. Specifically, a Local Transformer module
is employed to model interactions among points in a local
region, which learns context-dependent region features at
an object level. A Global Transformer is designed to learn
context-aware representations at the scene level. To fur- Figure 1. Attention maps directly from Pointformer block,
ther capture the dependencies among multi-scale represen- darker blue indicates stronger attention. For the key point
tations, we propose Local-Global Transformer to integrate (star), Pointformer first focuses on the local region of the same
local features with global features from higher resolution. object (the back of the chair), then spreads the attention to other
regions (the legs), finally attends to points from other objects glob-
In addition, we introduce an efficient coordinate refinement
ally (other chairs), leveraging both local and global dependencies.
module to shift down-sampled points closer to object cen-
troids, which improves object proposal generation. We use proaches [28, 12, 42] gridify the irregular point clouds into
Pointformer as the backbone for state-of-the-art object de- regular voxels and are followed by sparse 3D convolutions
tection models and demonstrate significant improvements to learn high dimensional features. Though effective, voxel-
over original models on both indoor and outdoor datasets. based approaches face the dilemma between efficiency and
accuracy. Specifically, using smaller voxels gains more pre-
cision, but suffers from higher computational cost. Con-
1. Introduction versely, using larger voxels misses potential local details in
the crowded voxels.
3D object detection in point clouds is essential for many Alternatively, point-based approaches [25], inspired by
real-world applications such as autonomous driving [10] the success of PointNet [21] and its variants, consume raw
and augmented reality [18]. Compared to images, 3D point points directly to learn 3D representations, which mitigates
clouds can provide detailed geometry and capture 3D struc- the drawback of converting point clouds to some regular
ture of the scene. On the other hand, point clouds are irreg- structures. Leveraging learning techniques for point sets,
ular, which can not be processed by powerful deep learn- point-based approaches avoid voxelization-induced infor-
ing models, such as convolutional neural networks directly. mation loss and take advantage of the sparsity in point
This poses a big challenge for effective feature learning. clouds by only computing on valid data points. Neverthe-
The common feature processing methods in 3D detec- less, due to the irregularity of point cloud data, point-based
tion can be roughly categorized into three types, based on learning operations have to be permutation-invariant and
the form of point cloud representations. Voxel-based ap- adaptive to the input size. To achieve this, it learns sim-
* Equal contribution.
ple symmetric functions (e.g. using point-wise feedforward
† Work done prior to Amazon. networks with pooling functions) which highly restricts its
‡ Corresponding author. representation power.

7463
Feature Learning Module

Detection
Heads
Pointformers Upsampling
Scene Point Cloud 3D Boxes

Pointformer Block

Keypoints Abstraction Context Aggregation Upsample Block

Output from Previous Block Learned Representations

Local Local-Global Global


Transformer Transformer Transformer

Multiscale Cross-Attention
A Pointformer Block

Figure 2. The Pointformer backbone for 3D object detection in point clouds. A basic feature learning block consists of three parts: a Local
Transformer to model interactions in the local region; a Local-Global Transformer to integrate local features with global information; a
Global Transformer to capture context-aware representations at the scene level.

Hybrid approaches [41, 15, 39, 24] attempt to combine just centroids sampled from Furthest Point Sampling (FPS)
both voxel-based and point-based representations. [41, 15] which improves the quality of generated object proposals.
leverages PointNet features at the voxel level and a column Third, we propose Local-Global Transformer (LGT) to inte-
of voxels (pillar) level respectively. [39, 24] deeply inte- grate local features with global features from higher resolu-
grate voxel features and PointNet features at the scene level. tion. Finally, Global Transformer (GT) module is designed
However, the fundamental difference between the two rep- to learn context-aware representations at the scene level. As
resentations could pose a limit on the effectiveness of these illustrated in Figure 1, Pointformer can capture both local
approaches for 3D point-cloud feature learning. and global dependencies, thus boosting the performance of
To address the above limitations, we resort to the Trans- feature learning for scenes with multiple cluttered objects.
former [30] models, which have achieved great success Extensive experiments have been conducted on several
in the field of natural language processing. Transformer detection benchmarks to verify the effectiveness of our ap-
models [8] are very effective at learning context-dependent proach. We use the proposed Pointformer as the backbone
representations and capturing long range dependencies in for three object detection models, CBGS [42], VoteNet [19],
the input sequence. Transformer and the associate self- and PointRCNN [25], and conduct experiments on three in-
attention mechanism not only meet the demand of permu- door and outdoor datasets, SUN-RGBD [27], KITTI [10],
tation invariance, but also are proved to be highly expres- and nuScenes [1] respectively. We observe significant im-
sive. Specifically, [6] proves that self-attention is at least provements over the original models on all experiment set-
as expressive as convolution. Currently, self-attention has tings, which demonstrates the effectiveness of our method.
been successfully applied to classification [23] and 2D ob- In summary, we make the following contributions:
ject detection [2] in computer vision. However, the straight- • We propose a pure transformer model, Pointformer,
forward application of Transformer to 3D point clouds is which serves as a highly effective feature learning
prohibitively expensive because computation cost grows backbone for 3D point clouds. Pointformer is permu-
quadratically with the input size. tation invariant, local and global context-aware.
To this end, we propose Pointformer, a backbone for 3D • We show that Pointformer can be easily applied as the
point clouds to learn features more effectively by leveraging drop-in replacement backbone for state-of-the-art 3D
the superiority of the Transformer models on set-structured object detectors for the point cloud.
data. As shown in Figure 2, Pointformer is a U-Net structure
with multi-scale Pointformer blocks. A Pointformer block • We perform extensive experiments using Pointformer
consists of Transformer-based modules that are both expres- as the backbone for three state-of-the-art 3D object
sive and friendly to the 3D object detection task. First, a detectors, and show significant performance gains on
Local Transformer (LT) module is employed to model in- several benchmarks including both indoor and out-
teractions among points in the local region, which learns door datasets. This demonstrates that the versatility
context-dependent region features at an object level. Sec- of Pointformer as 3D object detectors are typically de-
ond, a coordinate refinement module is proposed to ad- signed and optimized for either indoor or outdoor only.

7464
Positional Encoding Coordinate Refinement

FFN
Attention maps

Max-pooling
FFN
Grouping Self-attention
Furthest Point Features
Transformer Block x2
Sampling

Figure 3. Illustration of the Local Transformer. Input points are first down-sampled by FPS and generate local regions by ball query.
Transformer block takes point features and coordinates as input and generate aggregated features for the local region. To further adjust the
centroid points, attention maps from the last Transformer layer are adopted for coordinate refinement. As a result, points are pushed closer
to the object centers instead of surfaces.

2. Related Work MLCVNet [35] focuses more on contextual information ag-


Feature learning for 3D point clouds. Prior work includes gregation based on VoteNet, and PointGNN [26] exploits
feature learning on voxelized grids, direct feature learning graph learning methods in point cloud detection. We show
on point clouds and the hybrid of the two. 3D sparse con- that our novel Transformer based model, Pointformer, can
volution [11] is very effective on voxel grids. For direct be used as a drop-in replacement for voxel-based detec-
feature learning, PointNet [21] and PointNet++ [22] learn tor, CBGS [42] and point-based detectors, VoteNet [19] and
point-wise features and region features using feed-forward PointRCNN [25].
networks and simple symmetric functions (e.g. max) re-
spectively. PCCN [33] generalizes convolution to non-grid 3. Pointformer
structured data by exploiting parameterized kernel func- Feature learning for 3D point clouds needs to confront
tions that span the full continuous vector space. EdgeConv its irregular and unordered nature as well as its varying size.
[34] exchanges local neighborhood information and acts on Prior work utilizes simple symmetric functions, e.g., point-
graphs dynamically computed in each layer of the network. wise feedforward networks with pooling functions [21, 22],
Hybrid methods combine both types of features at the local or resorts to the techniques in graph neural networks by
level [41, 15] or at the network level [39, 24]. aggregating information from the local neighborhood [34].
Transformers in computer vision. Image GPT [3] is the However, the former is not effective in incorporating lo-
first to adopt the Transformers in 2D image classification cal context-dependent features beyond the capability of the
task for unsupervised pretraining. Further, ViT [9] ex- simple symmetric functions; the latter focuses on the mes-
tends this scheme to large scale supervised learning on im- sage passing between the center point and its neighbors
ages. For high level vision tasks, DETR [2] and Deformable while neglecting the feature correlations among the neigh-
DETR [43] leverage the advantages of Transformers in 2D bor points. Additionally, global representations are also in-
object detection. Set Transformer [16] uses attention mech- formative but rarely used in 3D object detection tasks.
anisms to model interactions among elements in the input In this paper, we design Transformer-based modules for
set. In the field of 3D vision, PAT [38] designs novel group point set operations which not only increase the expres-
shuffle attentions to capture long range dependencies in siveness of extracting local features, but incorporate global
point clouds. To the best of our knowledge, we are the first information into point representations as well. As shown
to propose a pure Transformer model for 3D points clouds in Figure 2, a Pointformer block mainly consists of three
feature learning with carefully designed Transformer blocks parts: Local Transformer (LT), Local-Global Transformer
and a positional encoding module to capture geometric and (LGT) and Global Transformer (GT). For each block, LT
rich context information. first receives the output from its previous block (high res-
3D object detection in point clouds. Detectors are de- olution) and extracts features for a new set with fewer el-
signed either with point clouds as the only input [42, 41, 15, ements (low resolution). Then, LGT uses the multi-scale
39, 24, 25, 19, 26, 35, 40] or fusing multiple sensor modal- cross-attention mechanism to integrate features from both
ities such as LiDAR and camera [20, 17, 31]. Their back- resolutions. Lastly, GT is adopted to capture context-aware
bones are designed with the aforementioned feature learn- representations. As for the up-sampling block, we follow
ing approaches. We focus on point cloud only object de- PointNet++ and adopt the feature propagation module for
tection. In this category, VoxelNet [41] divides the point its simplicity.
cloud into voxels, followed by 3D convolutions to extract
features. VoteNet [19] devises a novel 3D proposal mecha- 3.1. Background
nism using deep Hough voting, before H3DNet [40] makes We first revisit the general formulation of the Trans-
further investigations on geometric primitives. In addition, former model. Let F = {fi } and X = {xi } denote a set

7465
of input features and their positions, where fi and xi repre- parameter space carefully designed. For instance, a gener-
sent the feature and position of token i, respectively. Then, a alized graph feature learning function can be formulated as:
Transformer block comprises of a multi-head self-attention
module and feedforward network: eij = FFN(FFN(xi ⊕ xj ) + FFN(fi ⊕ fj )), (8)
fi′ = A(σ(eij ) × FFN(fj ), ∀j ∈ N (xi )), (9)
(m) (m) (m) (m)
qi = fi Wq(m) , ki = f i Wk , vi = fi Wv(m) , (1)
where most of the models utilize summation as the ag-
(m)
X (m) (m)
√ (m) gregation function A and the operation ⊕ is chosen
yi = σ(qi kj / d+ PE(xi , xj ))vj , (2)
j
from {Concatenation, Plus, Inner-product}. Therefore,
(0) (1) (M −1) the edge function eij is at most a quadratic function of
yi = fi + Concat(yi , yi , . . . , yi ), (3) {xi , xj , fi , fj }. For a one-layer Transformer block, the
oi = yi + FFN(yi ), (4) learning module can be formulated with the inner-product
self-attention mechanism as follows:
where Wq , Wk , Wv are projections for query, key and value.
fi Wq WkT fjT
m is the index of M attention heads and d is the feature di- eij = √ + PE(xi , xj ), (10)
mension. PE(·) is the positional encoding function for in- d
put positions, and FFN(·) represents a position-wise feed- fi′ = A(σ(eij × FFN(fj ), ∀j ∈ N (xi )), (11)
forward network. σ(·) is a normalization function and Soft-
Max is mostly adopted. where d is the feature dimension of fi and fj . We can ob-
In the following sections, for simplicity, we use serve that the edge function is also a quadratic function of
{xi , xj , fi , fj }. With sufficient number of layers in FFNs,
O = Transblock(F, PE(X)), (5) the graph-based feature learning module has the same ex-
to represent the basic Transformer block (Eq.(1)∼ pressive power as a one-layer Transformer encoder. When
Eq.(4)). Readers can refer to [30] for further details. it comes to Pointformer, as we stack more Transformer lay-
ers in the block, the expressiveness of our module is further
3.2. Local Transformer increased and can extract better representations.
In order to build a hierarchical representation for a point Moreover, feature correlations among the neighbor
cloud scene, we follow the high level methodology to build points are also considered, which are commonly omitted
feature learning blocks on different resolutions [22]. Given in other models. Under some circumstances, neighbor
an input point cloud P = {x1 , x2 , . . . , xN }, we first use points can be even more informative than the centroid point.
furthest point sampling (FPS) to choose a subset of points Therefore, by leveraging message passing among all points,
{xc1 , xc2 , . . . , xcN ′ } as a set of centroids. For each cen- features in the local region are equally considered, which
troid, ball query is applied to generate K points in the local makes the local feature extraction module more effective.
region within a given radius. Then we group these features
around the centroids, and feed them as a point sequence to 3.3. Coordinate Refinement
a Transformer layer, as shown in Figure 3. Let {xi , fi }t de- Furthest point sampling (FPS) is widely used in many
note the local region for tth centroid, where xi ∈ R3 and point cloud frameworks, as it can generate a relatively
fi ∈ RC represent the coordinates and features of the i-th uniform sampled points while keeping the original shape,
points in the group, respectively. Subsequently, a shared L- which ensures that a large fraction of the points can be cov-
layer Transformer block is applied to all local regions which ered with limited centroids. However, there are two main
receives the input of {xi , fi }t as follows: issues in FPS: (1) It is notoriously sensitive to the outlier
(0) points, leading to highly instability especially when dealing
fi = FFN(fi ), ∀i ∈ N (xct ), (6) with real-world point cloud data. (2) Sampled points from
F (l+1) =Transblock(F (l) , PE(X)), l = 0, .., L − 1, (7) FPS must be a subset of original point clouds, which makes
it challenging to infer the original geometric information in
where F = {fi |i ∈ N (xct )} and X = {xi |i ∈ N (xct )} the cases that objects are partially occluded or not enough
denote the set of features and coordinates in the local region points of an object are captured. Considering that points are
with centroid xct . mostly captured on the surface of objects, the second issue
Compared to the existing local feature extraction mod- may become more critical as the proposals are generated
ules in [36, 37, 29], the proposed Local Transformer has from sampled points, resulting in a natural gap between the
several advantages. First, the dense self-attention operation proposal and ground truth.
in the Transformer block greatly enhances its expressive- To overcome the aforementioned drawbacks, we propose
ness. Several graph learning based approaches can be ap- a point coordinate refinement module with the help of the
proximated as special cases of the LT module with learned self-attention maps. As shown in Figure 3, we first take out

7466
the self-attention map of the last layer of the Transformer serves as query and the output of GT from the higher resolu-
block for each attention head. Then, we compute the aver- tion is used as key and value. With the L-layer Transformer
age of the attention maps and utilize the particular row for block, the module is formulated as:
the centroid point as a weight vector: f (0) = FFN(fi ), ∀i ∈ P l , (16)
M
1 X (m) fj′ = FFN(fj ), ∀j ∈ P ,h
(17)
W = A , (12)
M m=1 0,:
(l+1) (l)
F = Transblock(F , Fj′ , PE(X)),l = 0,..,L−1, (18)
where M represents the number of attention heads and
where P l (keypoints, the output of LT in Figure 2) and P h
A(m) is the attention map for the mth head. Lastly, the
(the input of a Pointformer block in Figure 2) represent sub-
refined centroid coordinates are computed as weighted av-
samples of point cloud P from low and high resolution re-
erage of all points in the local region:
spectively. Through the Local-Global Transformer module,
K
X we utilize whole centroid points to integrate global informa-
x′ct = w k xk , (13)
tion via an attention mechanism, which makes the feature
k=1
learning of both more effective.
where wk is the kth entry of W . With the proposed co-
3.6. Positional Encoding
ordinate refinement module, centroid points are adaptively
moving closer to object centers. Moreover, by utilizing the Positional encoding is an integral part of Transformer
self-attention map, our module introduces little computa- models as it is the only mechanism that encodes position
tional cost and no additional learning parameters, making information for each token in the input sequence. When
the refinement process more efficient. adapting Transformers for 3D point cloud data, positional
encoding plays a more critical role as the coordinates of
3.4. Global Transformer point clouds are valuable features indicating the local struc-
Global information representing scene contexts and fea- tures. Compared to the techniques used in natural language
ture correlations between different objects is also valuable processing, we propose a simple and yet efficient approach.
in the detection tasks. Prior work using PointNet++ [22] For all Transformer modules, coordinates of each input
or sparse 3D convolution to extract high level features for point are firstly mapped to the feature dimension. Then,
3D point clouds enlarges the receptive field as the depth of we subtract the coordinates of the query and key points and
their networks increases. However, this has limitations on use relative positions for encoding. The encoding function
modeling long-range interactions. is formalized as:
As a remedy, we leverage the power of Transformer PE(xi , xj ) = FFN(xi − xj ). (19)
modules on modeling non-local relations and propose a
Global Transformer to achieve message passing through the
3.7. Computational Cost Reduction
whole point cloud. Specifically, all points are gathered to a Since Pointformer is a pure attention model based on
single group P and serves as input to a Transformer module. Transformer blocks, it suffers from extremely heavy com-
The formulation for GT is summarized as follows: putational overhead. Applying a conventional Transformer
(0) to a point cloud with n points consumes O(n2 ) time and
fi = FFN(fi ), ∀i ∈ P, (14) memory, leading to much more training cost.
F (l+1) = Transblock(F (l) ,PE(X)), l = 0, .., L − 1. (15) Some recent advances in efficient Transformers have
mitigated this issue [14, 13, 32, 4, 35], among which Lin-
By leveraging the Transformer on the scene level, we former [32] reduces the complexity to O(n) by low-rank
can capture the context-aware representations and promote factorization of the original attention. Under the hypothesis
message passing among different objects. Moreover, global that the self attention mechanism is low rank, i.e. the rank
representations can be particularly helpful for detecting ob- of the n × n attention matrix
jects with very few points.
!
QK ⊤
A = softmax √ , (20)
3.5. Local-Global Transformer dk
Local-Global Transformer is also a key module to com-
is much smaller than n, Linformer projects the n-dimension
bine the local and global features extracted by the LT and
keys and values to the ones with lower dimension k ≪ n,
GT modules. As shown in Figure 2, the LGT adopts a multi-
and k is closer to the rank of A. Therefore, the i-th head in
scale cross-attention module and generates relations be-
the projected multi-head self-attention is
tween low resolution centroids and high resolution points. !
Formally, we apply cross attention similar to the encoder- Q(Ei K)⊤
headi = softmax √ Fi V, (21)
decoder attention used in Transformer. The output of LT dk

7467
Car(IoU=0.7) Pedestrian (IoU=0.5) Cyclist (IOU=0.5)
Method Modality Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
PointRCNN [25] LiDAR 85.94 75.76 68.32 49.43 41.78 38.63 73.93 59.60 53.59
+ Pointformer LiDAR 87.13 77.06 69.25 50.67 42.43 39.60 75.01 59.80 53.99
Table 1. Performance comparison of PointRCNN with and without Pointformer on KITTI test split by submitting to official test server.
The evaluation metric is Average Precision(AP) with IoU threshold 0.7 for car and 0.5 for pedestrian/cyclist.
Method Modality Car Ped Bus Barrier TC Truck Trailer Moto Cons. Veh. Bicycle mAP
CBGS [42] LiDAR 81.1 80.1 54.9 65.7 70.9 48.5 42.9 51.5 10.5 22.3 52.8
+ Pointformer LiDAR 82.3 81.8 55.6 66.0 72.2 48.1 43.4 55.0 8.6 22.7 53.6
Table 2. Performance comparison of PointRCNN with and without Pointformer on the nuScenes benchmark.

Car(IoU=0.7) provided in the appendix.


Method Easy Moderate Hard
4.1. Experimental Setup
PointRCNN 88.88 78.63 77.38
Datasets. We adopt SUN RGB-D [27] and ScanNet V2 [7]
+ Pointformer 90.05 79.65 78.89 for indoor 3D detection benchmark. SUN RGB-D has 5K
Table 3. Performance comparison of PointRCNN with and without training images annotated with oriented 3D bounding boxes
Pointformer on the car class of KITTI val split set. for 37 object categories and ScanNet V2 has 1513 labeled
scenes with 40 semantic classes. We follow the same setting
Recall(IoU=0.5) Recall(IoU=0.7) in VoteNet [19] and report performance on the 10 classes on
RoIs PointRCNN +Pointformer PointRCNN +Pointformer
SUN RGB-D and 18 classes on ScanNet V2. For outdoor
10 86.66 87.51 29.87 35.46 datasets, we choose KITTI [10] and nuScenes [1] for eval-
50 96.01 96.52 40.28 42.45 uation. KITTI contains 7,481 training samples and 7,518
100 96.79 96.91 74.81 75.82
200 98.03 97.99 76.29 76.51 test samples for autonomous driving. NuScenes contains
1k different scenes with 40K key frames, which has 23 cat-
Table 4. Recall of proposal generation network with different num-
egories and 8 attributes. We follow the evaluation protocol
ber of RoIs and 3D IoU thresholds for the car class on the val split
proposed along with the datasets.
at moderate difficulty.
Experimental setups. We use the Pointformer as the back-
where Ei , Fi ∈ Rk×n are the projection matrices, which bone for three 3D detection models, including VoteNet [19],
reduces the complexity from O(n2 ) to O(kn). PointRCNN [25] and CBGS [42]. VoteNet is a point-based
Compared with the Taylor expansion approximation approach for indoor datasets, while PointRCNN and CBGS
technique used in MLCVNet [35], Linformer is easier to are adopted for outdoor datasets. PointRCNN is a classic
implement in out method. We thus adopt it to replace the approach for autonomous driving detection and CBGS is
Transformer layers in the vanilla Pointformer. Practically, the champion of nuScenes 3D detection Challenge held in
we map the number of points n to k = nr , where r is a fac- CVPR 2019. For a fair comparison, we adopt the same de-
tor controlling the number of projected dimensions. We ap- tection head, number of points for each resolution, hyper-
ply this mapping in Local Transformer, Global Transformer parameters and training configurations as baseline models.
and Local-Global Transformer blocks. By setting an appro- 4.2. Outdoor Datasets
priate factor r for each block, there would be a significant KITTI. We first evaluate our method comparing with
boost in both time and space consumption with little perfor- PointRCNN on KITTI’s 3D detection benchmark. PointR-
mance decay. CNN uses PointNet++ as its backbone with four set abstrac-
tion layers. Similarly, we adopt the same architecture, while
4. Experimental Results switching the set abstraction layer in PointNet++ with the
In this section, we use Pointformer as the backbone for proposed Transformer block. The comparison results on the
state-of-the-art object detection models and conduct ex- KITTI test server are shown in Table 1.
periments on several indoor and outdoor benchmarks. In For the car category, we also report the performance of
Sec. 4.1, we introduce the implementation details of the ex- 3D detection results on the val split as shown in Table 3.
periments. In Sec. 4.2 and Sec. 4.3, we show the compari- As we can observe, by adopting Pointformer, our model
son results on indoor and outdoor datasets respectively. In achieves consistent improvements comparing to the original
Sec. 4.4, we conduct extensive ablation studies to analyze PointRCNN. Especially in the hard difficulty, our method
our proposed Pointformer model. Finally, we show qualita- shows the most promising result with 1.5% AP improve-
tive results in Sec. 4.5. More analysis and visualizations are ment. We believe the better performance on hard objects is

7468
Method bathtub bed bookshelf chair desk dresser nightstand sofa table toilet mAP
VoteNet [19] 74.4 83.0 28.8 75.3 22.0 29.8 62.2 64.0 47.3 90.1 57.7
VoteNet* 75.5 85.6 32.0 77.4 24.8 27.9 58.6 67.4 51.1 90.5 59.1
+ Pointformer 80.1 84.3 32.0 76.2 27.0 37.4 64.0 64.9 51.5 92.2 61.1
Table 5. Perfomance comparison of VoteNet with and without Pointformer on SUN RGB-D validation dataset. The evaluation metric is
Average Precision with 0.25 IoU threshold.* denotes the model implemented in MMDetection3D [5].
Method cab bed chair sofa table door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurn mAP
VoteNet [19] 36.3 87.9 88.7 89.6 58.8 47.3 38.1 44.6 7.8 56.1 71.7 47.2 45.4 57.1 94.9 54.7 92.1 37.2 58.6
VoteNet* 47.7 88.7 89.5 89.3 62.1 54.1 40.8 54.3 12.0 63.9 69.4 52.0 52.5 73.3 95.9 52.0 95.1 42.4 62.9
+ Pointformer 46.7 88.4 90.5 88.7 65.7 55.0 47.7 55.8 18.0 63.8 69.1 55.4 48.5 66.2 98.9 61.5 86.7 47.4 64.1
Table 6. Performance comparison of VoteNet with and without Pointformer on ScanNetV2 validation dataset. The evaluation metric is
Average Precision with 0.25 IoU threshold.* denotes the model implemented in MMDetection3D [5].

attributed to the higher expressiveness of local Transformer Car (IoU=0.7)


LT GT LGT CoRe
module. For hard objects which are often small or occluded, Easy Moderate Hard
GT captures context-dependent region features, which con- 1 - - - - 88.88 78.63 77.38
tributes to the bounding box regression and classification. 2 X - - - 89.46 78.91 77.65
Additionally, we evaluate the performance of proposal 3 X - - X 89.76 79.24 78.43
generation network by calculating the recall of 3D bounding 4 X X - - 89.68 79.22 78.52
box with various number of proposals and 3D IoU thresh- 5 X X X - 89.82 79.34 78.62
old. As shown in Table 4, our backbone module signif- 6 X X X X 90.05 79.65 78.89
icantly enhances the performance of proposal generation Table 7. Effects of each component on the val split of KITTI. CoRe
network under almost all the settings. Analyzing the fig- represents the coordinates refinement module.
ures vertically, we observe that our backbone shows bet-
ter performance when the number of RoIs are relatively Car (IoU=0.7)
Positional Encoding
Easy Moderate Hard
small. As stated in Sec.3, the GT and LGT help to cap-
ture context-aware representations and models the relations 1 - 85.42 75.67 72.34
among different objects (proposals). This provides addi- 2 X 90.05 79.65 78.89
tional references for locating and reasoning the bounding Table 8. Effects of positional encoding on the val split of KITTI.
boxes. Therefore, despite the lack of RoIs, we can still im-
prove the performance of the proposal generation module lowed by the Pointformer blocks, two feature propaga-
and achieve higher recall. tion(FP) modules proposed in PointNet++ [22] serve as up-
NuScenes. We also validate the effectiveness of Point- samplers to increase the resolution for the subsequent de-
former on the nuScenes dataset, which greatly extends tection heads.
KITTI in dataset size, number of object categories and SUN RGB-D. We report the average precision(AP) over
number of annotated objects. Furthermore, nuScenes suf- 10 common classes in SUN RGB-D, as shown in Table 5.
fers from severe class imbalance issues, making the detec- Compared with the PointNet++ [22] in VoteNet [19], our
tion task more difficult and challenging. In this part, we Pointformer provides a significant boost with 2% mAP over
adopt CBGS, the champion of nuScenes 3D detection Chal- the implementation in MMDetection3D [5]. On some cate-
lenge held in CVPR 2019, as the baseline model and show gories with large and complex objects like dresser or bath-
the comparison results when replacing the backbone with tub, Pointformer shows its splendid capability on extract-
Pointformer. We summarize the results in Table 2. As ing non-local information by a sharp increase over 5% AP,
we can observe, by utilizing Pointformer as the backbone, which we attribute to the GT module in Pointformer.
our model achieves 0.8 higher mAP than baseline. For 8 ScanNet V2. We report the average precision(AP) over 18
of 10 classes, our model shows better performance, which classes in ScanNet V2, as shown in Table 6. Compared
demonstrates the effectiveness of Pointformer on larger and with VoteNet, Pointformer outperforms its original version
more challenging datasets. by 1.2% mAP with MMDetection3D.
4.3. Indoor Datasets 4.4. Ablation Study
We evaluate our Pointformer accompanied by VoteNet [19] In this section, we conduct extensive ablation experi-
on SUN RGB-D and ScanNet V2. We follow the same hy- ments to analyze the effectiveness of different components
perparameters on the backbone structure as VoteNet. Fol- of Pointformer. All experiments are trained on the train split

7469
Image of the scene Predictions Ground truth Image of the scene Predictions Ground truth

Figure 4. Qualitative results of 3D object detection on SUN RGB-D. From left to right: Originial scene image, our model’s prediction,
and annotated ground truth boxes.
Image of the scene Ground truth Overall attention the dresser in the left scene is only partially observed by the
sensor. However, our model can still generate precise pro-
posals for the object with proper bounding box sizes. Sim-
ilar results are shown in the right scene, where the table in
the front suffers from clutter because of the books on it.
Inspecting Pointformer with attention maps. To validate
Top-50 attention Top-100 attention Top-200 attention how modules in Pointformer affect learned point features,
we visualize the attention maps from the GT module of the
second last Pointformer block. We show the attention of
the particular points in Figure 5. The second row shows the
Figure 5. Visualization results of the attention maps. In top-k 50, 100, 200 points with highest attention values towards
attention, darker color indicates larger attention weight, in overall the points marked with star. We can observe that Point-
attention red indicates large value. former first focuses on the local region of the same object,
then spread the attention to other regions, and finally attends
with PointRCNN detection head and evaluated on the val
points from other objects globally. The overall attention
split with the car class.
map shows the average attention weights of all the points
Effects of each component. We validate the effectiveness in the scene, indicating that our model mostly focuses on
of each Transformer component and the coordinate refine- points on the objects. These visualization results show that
ment module, and summarized the results in Table 7. The Pointformer can capture local and global dependencies, and
first row corresponds to the PointRCNN baseline and the enhance message passing on both object and scene levels.
last row is the full Pointformer model. By comparing the
first row and second row, we can observe that easy objects 5. Conclusion
benefit more from the local Transformer with 0.6 AP im- This paper introduces Pointformer, a highly effective
provement. By comparing the second row and fourth row, feature learning backbone for 3D point clouds that is per-
we can see that global Transformer is more suitable for hard mutation invariant to points in the input and learns local
objects with 0.9 AP improvement. This observation is con- and global context-aware representations. We apply Point-
sistent with our analysis in Sec. 4.2. As for Local-Global former as the drop-in replacement backbone for state-of-
Transformer and coordinate refinement, the improvement is the-art 3D object detectors and show significant perfor-
similar under three difficulty settings. mance improvements on several benchmarks including both
Positional Encoding. Playing a critical role in Transformer, indoor and outdoor datasets.
position encoding can have huge impact on the learned rep- Comparing to classification and segmentation tasks in-
resentation. As we have shown in Table 8, we compare the cluding part-segmentation and semantic segmentation in
performance of Pointformer without positional encoding prior work, 3D object detection typically involves more
and with two approaches to position encoding (adding or points (4× - 16×) in a scene, which makes it harder for
concatenating positional encoding with the attention map). Transformer-based models. For future work, we would like
We can observe that Pointformer without positional encod- to explore extensions to these two tasks and other 3D tasks
ing suffers from a huge performance drop, as the coordi- such as shape completion, normal estimation, etc.
nates of points can capture the local geometric information.
Acknowledgments
4.5. Qualitative Results and Discussion This work is supported in part by the National Science
Qualitative results on SUN RGB-D. Figure 4 shows rep- and Technology Major Project of the Ministry of Science
resentative examples of detection results on SUN RGB-D and Technology of China under Grants 2018AAA0100701,
with VoteNet + Pointformer. As we can observe, our model the National Natural Science Foundation of China un-
achieves robust results despite the challenges of clutter and der Grants 61906106 and 62022048, the Institute for Guo
scanning artifacts. Additionally, our model can even recog- Qiang of Tsinghua University and Beijing Academy of Ar-
nize the missing objects in the ground truth. For instance, tificial Intelligence.

7470
References [16] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Se-
ungjin Choi, and Yee Whye Teh. Set transformer: A frame-
[1] H. Caesar, Varun Bankiti, A. Lang, Sourabh Vora, work for attention-based permutation-invariant neural net-
Venice Erin Liong, Q. Xu, A. Krishnan, Yu Pan, Giancarlo works. In Kamalika Chaudhuri and Ruslan Salakhutdinov,
Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset editors, ICML, volume 97 of Proceedings of Machine Learn-
for autonomous driving. CVPR, pages 11618–11628, 2020. ing Research, pages 3744–3753, Long Beach, California,
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas USA, 09–15 Jun 2019. PMLR.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [17] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urta-
to-End object detection with transformers. In ECCV, May sun. Multi-task multi-sensor fusion for 3d object detection.
2020. In CVPR, June 2019.
[3] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- [18] Youngmin Park, Vincent Lepetit, and W. Woo. Multiple 3d
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. object tracking for augmented reality. 2008 7th IEEE/ACM
Generative pretraining from pixels. ICML, 2020. International Symposium on Mixed and Augmented Reality,
[4] Krzysztof Choromanski, Valerii Likhosherstov, David Do- pages 117–120, 2008.
han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter [19] Charles R Qi, Or Litany, Kaiming He, and Leonidas J
Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, Guibas. Deep hough voting for 3d object detection in point
David Belanger, Lucy Colwell, and Adrian Weller. Rethink- clouds. In CVPR, pages 9277–9286, 2019.
ing attention with performers, 2020. [20] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J
[5] MMDetection3D Contributors. MMDetection3D: Open- Guibas. Frustum pointnets for 3d object detection from rgb-d
MMLab next-generation platform for general 3d object data. In CVPR, 2018.
detection. https://github.com/open- mmlab/ [21] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
mmdetection3d, 2020. Pointnet: Deep learning on point sets for 3d classification
[6] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin and segmentation. arXiv preprint arXiv:1612.00593, 2016.
Jaggi. On the relationship between self-attention and con- [22] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
volutional layers. arXiv preprint arXiv:1911.03584, 2019. Guibas. Pointnet++: Deep hierarchical feature learning on
[7] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- point sets in a metric space. In NeurIPS, pages 5099–5108,
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: 2017.
Richly-annotated 3d reconstructions of indoor scenes. In [23] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, I. Bello,
CVPR, pages 5828–5839, 2017. Anselm Levskaya, and Jonathon Shlens. Stand-alone self-
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina attention in vision models. In NeurIPS, 2019.
Toutanova. BERT: Pre-training of deep bidirectional trans- [24] Shaoshuai Shi, Chaoxu Guo, L. Jiang, Zhe Wang, Jianping
formers for language understanding. In NAACL-HLT, pages Shi, X. Wang, and Hongsheng Li. Pv-rcnn: Point-voxel fea-
4171–4186, Minneapolis, Minnesota, June 2019. Associa- ture set abstraction for 3d object detection. CVPR, pages
tion for Computational Linguistics. 10526–10535, 2020.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [25] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, cnn: 3d object proposal generation and detection from point
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- cloud. In CVPR, June 2019.
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [26] Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural net-
worth 16x16 words: Transformers for image recognition at work for 3d object detection in a point cloud. In Proceedings
scale, 2020. of the IEEE/CVF conference on computer vision and pattern
[10] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we recognition, pages 1711–1719, 2020.
ready for autonomous driving? the kitti vision benchmark [27] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
suite. In CVPR, 2012. Sun rgb-d: A rgb-d scene understanding benchmark suite. In
[11] Benjamin Graham, Martin Engelcke, and Laurens van der CVPR, pages 567–576, 2015.
Maaten. 3d semantic segmentation with submanifold sparse [28] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
convolutional networks. In CVPR, June 2018. amodal 3d object detection in rgb-d images. In CVPR, June
[12] Ji Hou, Angela Dai, and Matthias Niessner. 3d-sis: 3d se- 2016.
mantic instance segmentation of rgb-d scans. In CVPR, June [29] H. Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, B.
2019. Marcotegui, F. Goulette, and L. Guibas. Kpconv: Flexible
[13] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and deformable convolution for point clouds. ICCV, pages
and Franccois Fleuret. Transformers are rnns: Fast au- 6410–6419, 2019.
toregressive transformers with linear attention. ArXiv, [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
abs/2006.16236, 2020. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[14] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Re- Polosukhin. Attention is all you need. In NeurIPS, pages
former: The efficient transformer. ICLR, 2020. 5998–6008, 2017.
[15] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [31] Sourabh Vora, Alex H. Lang, Bassam Helou, and Oscar Bei-
Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders jbom. Pointpainting: Sequential fusion for 3d object detec-
for object detection from point clouds. In CVPR, 2019. tion. In CVPR, June 2020.

7471
[32] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
Hao Ma. Linformer: Self-attention with linear complexity.
arXiv preprint arXiv:2006.04768, 2020.
[33] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
Pokrovsky, and Raquel Urtasun. Deep parametric continu-
ous convolutional neural networks. In CVPR, June 2018.
[34] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Michael M Bronstein, and Justin M Solomon. Dynamic
graph cnn for learning on point clouds. Acm Transactions
On Graphics (tog), 38(5):1–12, 2019.
[35] Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming
Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level con-
text votenet for 3d object detection. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10447–10456, 2020.
[36] Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang,
and U. Neumann. Grid-gcn for fast and scalable point cloud
learning. CVPR, pages 5660–5669, 2020.
[37] Xu Yan, C. Zheng, Zhuguo Li, S. Wang, and Shuguang Cui.
Pointasnl: Robust point clouds processing using nonlocal
neural networks with adaptive sampling. CVPR, pages 5588–
5597, 2020.
[38] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li,
Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point
clouds with self-attention and gumbel subset sampling. In
CVPR, June 2019.
[39] M. Ye, Shuangjie Xu, and Tongyi Cao. Hvnet: Hybrid voxel
network for lidar based 3d object detection. CVPR, pages
1628–1637, 2020.
[40] Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang.
H3dnet: 3d object detection using hybrid geometric primi-
tives. In European Conference on Computer Vision, pages
311–329. Springer, 2020.
[41] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
for point cloud based 3d object detection. In CVPR, June
2018.
[42] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li,
and Gang Yu. Class-balanced Grouping and Sampling for
Point Cloud 3D Object Detection. arXiv e-prints, page
arXiv:1908.09492, Aug 2019.
[43] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable detr: Deformable transformers
for end-to-end object detection, 2020.

7472

You might also like