Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper
Pan_3D_Object_Detection_With_Pointformer_CVPR_2021_paper
Xuran Pan1 * Zhuofan Xia1 * Shiji Song1 Li Erran Li2† Gao Huang1‡
1
Department of Automation, Tsinghua University, Beijing, China
Beijing National Research Center for Information Science and Technology (BNRist),
2
Alexa AI, Amazon / Columbia University
{pxr18, xzf20}@mails.tsinghua.edu.cn, [email protected],
{shijis, gaohuang}@tsinghua.edu.cn
7463
Feature Learning Module
Detection
Heads
Pointformers Upsampling
Scene Point Cloud 3D Boxes
Pointformer Block
Multiscale Cross-Attention
A Pointformer Block
Figure 2. The Pointformer backbone for 3D object detection in point clouds. A basic feature learning block consists of three parts: a Local
Transformer to model interactions in the local region; a Local-Global Transformer to integrate local features with global information; a
Global Transformer to capture context-aware representations at the scene level.
Hybrid approaches [41, 15, 39, 24] attempt to combine just centroids sampled from Furthest Point Sampling (FPS)
both voxel-based and point-based representations. [41, 15] which improves the quality of generated object proposals.
leverages PointNet features at the voxel level and a column Third, we propose Local-Global Transformer (LGT) to inte-
of voxels (pillar) level respectively. [39, 24] deeply inte- grate local features with global features from higher resolu-
grate voxel features and PointNet features at the scene level. tion. Finally, Global Transformer (GT) module is designed
However, the fundamental difference between the two rep- to learn context-aware representations at the scene level. As
resentations could pose a limit on the effectiveness of these illustrated in Figure 1, Pointformer can capture both local
approaches for 3D point-cloud feature learning. and global dependencies, thus boosting the performance of
To address the above limitations, we resort to the Trans- feature learning for scenes with multiple cluttered objects.
former [30] models, which have achieved great success Extensive experiments have been conducted on several
in the field of natural language processing. Transformer detection benchmarks to verify the effectiveness of our ap-
models [8] are very effective at learning context-dependent proach. We use the proposed Pointformer as the backbone
representations and capturing long range dependencies in for three object detection models, CBGS [42], VoteNet [19],
the input sequence. Transformer and the associate self- and PointRCNN [25], and conduct experiments on three in-
attention mechanism not only meet the demand of permu- door and outdoor datasets, SUN-RGBD [27], KITTI [10],
tation invariance, but also are proved to be highly expres- and nuScenes [1] respectively. We observe significant im-
sive. Specifically, [6] proves that self-attention is at least provements over the original models on all experiment set-
as expressive as convolution. Currently, self-attention has tings, which demonstrates the effectiveness of our method.
been successfully applied to classification [23] and 2D ob- In summary, we make the following contributions:
ject detection [2] in computer vision. However, the straight- • We propose a pure transformer model, Pointformer,
forward application of Transformer to 3D point clouds is which serves as a highly effective feature learning
prohibitively expensive because computation cost grows backbone for 3D point clouds. Pointformer is permu-
quadratically with the input size. tation invariant, local and global context-aware.
To this end, we propose Pointformer, a backbone for 3D • We show that Pointformer can be easily applied as the
point clouds to learn features more effectively by leveraging drop-in replacement backbone for state-of-the-art 3D
the superiority of the Transformer models on set-structured object detectors for the point cloud.
data. As shown in Figure 2, Pointformer is a U-Net structure
with multi-scale Pointformer blocks. A Pointformer block • We perform extensive experiments using Pointformer
consists of Transformer-based modules that are both expres- as the backbone for three state-of-the-art 3D object
sive and friendly to the 3D object detection task. First, a detectors, and show significant performance gains on
Local Transformer (LT) module is employed to model in- several benchmarks including both indoor and out-
teractions among points in the local region, which learns door datasets. This demonstrates that the versatility
context-dependent region features at an object level. Sec- of Pointformer as 3D object detectors are typically de-
ond, a coordinate refinement module is proposed to ad- signed and optimized for either indoor or outdoor only.
7464
Positional Encoding Coordinate Refinement
FFN
Attention maps
Max-pooling
FFN
Grouping Self-attention
Furthest Point Features
Transformer Block x2
Sampling
Figure 3. Illustration of the Local Transformer. Input points are first down-sampled by FPS and generate local regions by ball query.
Transformer block takes point features and coordinates as input and generate aggregated features for the local region. To further adjust the
centroid points, attention maps from the last Transformer layer are adopted for coordinate refinement. As a result, points are pushed closer
to the object centers instead of surfaces.
7465
of input features and their positions, where fi and xi repre- parameter space carefully designed. For instance, a gener-
sent the feature and position of token i, respectively. Then, a alized graph feature learning function can be formulated as:
Transformer block comprises of a multi-head self-attention
module and feedforward network: eij = FFN(FFN(xi ⊕ xj ) + FFN(fi ⊕ fj )), (8)
fi′ = A(σ(eij ) × FFN(fj ), ∀j ∈ N (xi )), (9)
(m) (m) (m) (m)
qi = fi Wq(m) , ki = f i Wk , vi = fi Wv(m) , (1)
where most of the models utilize summation as the ag-
(m)
X (m) (m)
√ (m) gregation function A and the operation ⊕ is chosen
yi = σ(qi kj / d+ PE(xi , xj ))vj , (2)
j
from {Concatenation, Plus, Inner-product}. Therefore,
(0) (1) (M −1) the edge function eij is at most a quadratic function of
yi = fi + Concat(yi , yi , . . . , yi ), (3) {xi , xj , fi , fj }. For a one-layer Transformer block, the
oi = yi + FFN(yi ), (4) learning module can be formulated with the inner-product
self-attention mechanism as follows:
where Wq , Wk , Wv are projections for query, key and value.
fi Wq WkT fjT
m is the index of M attention heads and d is the feature di- eij = √ + PE(xi , xj ), (10)
mension. PE(·) is the positional encoding function for in- d
put positions, and FFN(·) represents a position-wise feed- fi′ = A(σ(eij × FFN(fj ), ∀j ∈ N (xi )), (11)
forward network. σ(·) is a normalization function and Soft-
Max is mostly adopted. where d is the feature dimension of fi and fj . We can ob-
In the following sections, for simplicity, we use serve that the edge function is also a quadratic function of
{xi , xj , fi , fj }. With sufficient number of layers in FFNs,
O = Transblock(F, PE(X)), (5) the graph-based feature learning module has the same ex-
to represent the basic Transformer block (Eq.(1)∼ pressive power as a one-layer Transformer encoder. When
Eq.(4)). Readers can refer to [30] for further details. it comes to Pointformer, as we stack more Transformer lay-
ers in the block, the expressiveness of our module is further
3.2. Local Transformer increased and can extract better representations.
In order to build a hierarchical representation for a point Moreover, feature correlations among the neighbor
cloud scene, we follow the high level methodology to build points are also considered, which are commonly omitted
feature learning blocks on different resolutions [22]. Given in other models. Under some circumstances, neighbor
an input point cloud P = {x1 , x2 , . . . , xN }, we first use points can be even more informative than the centroid point.
furthest point sampling (FPS) to choose a subset of points Therefore, by leveraging message passing among all points,
{xc1 , xc2 , . . . , xcN ′ } as a set of centroids. For each cen- features in the local region are equally considered, which
troid, ball query is applied to generate K points in the local makes the local feature extraction module more effective.
region within a given radius. Then we group these features
around the centroids, and feed them as a point sequence to 3.3. Coordinate Refinement
a Transformer layer, as shown in Figure 3. Let {xi , fi }t de- Furthest point sampling (FPS) is widely used in many
note the local region for tth centroid, where xi ∈ R3 and point cloud frameworks, as it can generate a relatively
fi ∈ RC represent the coordinates and features of the i-th uniform sampled points while keeping the original shape,
points in the group, respectively. Subsequently, a shared L- which ensures that a large fraction of the points can be cov-
layer Transformer block is applied to all local regions which ered with limited centroids. However, there are two main
receives the input of {xi , fi }t as follows: issues in FPS: (1) It is notoriously sensitive to the outlier
(0) points, leading to highly instability especially when dealing
fi = FFN(fi ), ∀i ∈ N (xct ), (6) with real-world point cloud data. (2) Sampled points from
F (l+1) =Transblock(F (l) , PE(X)), l = 0, .., L − 1, (7) FPS must be a subset of original point clouds, which makes
it challenging to infer the original geometric information in
where F = {fi |i ∈ N (xct )} and X = {xi |i ∈ N (xct )} the cases that objects are partially occluded or not enough
denote the set of features and coordinates in the local region points of an object are captured. Considering that points are
with centroid xct . mostly captured on the surface of objects, the second issue
Compared to the existing local feature extraction mod- may become more critical as the proposals are generated
ules in [36, 37, 29], the proposed Local Transformer has from sampled points, resulting in a natural gap between the
several advantages. First, the dense self-attention operation proposal and ground truth.
in the Transformer block greatly enhances its expressive- To overcome the aforementioned drawbacks, we propose
ness. Several graph learning based approaches can be ap- a point coordinate refinement module with the help of the
proximated as special cases of the LT module with learned self-attention maps. As shown in Figure 3, we first take out
7466
the self-attention map of the last layer of the Transformer serves as query and the output of GT from the higher resolu-
block for each attention head. Then, we compute the aver- tion is used as key and value. With the L-layer Transformer
age of the attention maps and utilize the particular row for block, the module is formulated as:
the centroid point as a weight vector: f (0) = FFN(fi ), ∀i ∈ P l , (16)
M
1 X (m) fj′ = FFN(fj ), ∀j ∈ P ,h
(17)
W = A , (12)
M m=1 0,:
(l+1) (l)
F = Transblock(F , Fj′ , PE(X)),l = 0,..,L−1, (18)
where M represents the number of attention heads and
where P l (keypoints, the output of LT in Figure 2) and P h
A(m) is the attention map for the mth head. Lastly, the
(the input of a Pointformer block in Figure 2) represent sub-
refined centroid coordinates are computed as weighted av-
samples of point cloud P from low and high resolution re-
erage of all points in the local region:
spectively. Through the Local-Global Transformer module,
K
X we utilize whole centroid points to integrate global informa-
x′ct = w k xk , (13)
tion via an attention mechanism, which makes the feature
k=1
learning of both more effective.
where wk is the kth entry of W . With the proposed co-
3.6. Positional Encoding
ordinate refinement module, centroid points are adaptively
moving closer to object centers. Moreover, by utilizing the Positional encoding is an integral part of Transformer
self-attention map, our module introduces little computa- models as it is the only mechanism that encodes position
tional cost and no additional learning parameters, making information for each token in the input sequence. When
the refinement process more efficient. adapting Transformers for 3D point cloud data, positional
encoding plays a more critical role as the coordinates of
3.4. Global Transformer point clouds are valuable features indicating the local struc-
Global information representing scene contexts and fea- tures. Compared to the techniques used in natural language
ture correlations between different objects is also valuable processing, we propose a simple and yet efficient approach.
in the detection tasks. Prior work using PointNet++ [22] For all Transformer modules, coordinates of each input
or sparse 3D convolution to extract high level features for point are firstly mapped to the feature dimension. Then,
3D point clouds enlarges the receptive field as the depth of we subtract the coordinates of the query and key points and
their networks increases. However, this has limitations on use relative positions for encoding. The encoding function
modeling long-range interactions. is formalized as:
As a remedy, we leverage the power of Transformer PE(xi , xj ) = FFN(xi − xj ). (19)
modules on modeling non-local relations and propose a
Global Transformer to achieve message passing through the
3.7. Computational Cost Reduction
whole point cloud. Specifically, all points are gathered to a Since Pointformer is a pure attention model based on
single group P and serves as input to a Transformer module. Transformer blocks, it suffers from extremely heavy com-
The formulation for GT is summarized as follows: putational overhead. Applying a conventional Transformer
(0) to a point cloud with n points consumes O(n2 ) time and
fi = FFN(fi ), ∀i ∈ P, (14) memory, leading to much more training cost.
F (l+1) = Transblock(F (l) ,PE(X)), l = 0, .., L − 1. (15) Some recent advances in efficient Transformers have
mitigated this issue [14, 13, 32, 4, 35], among which Lin-
By leveraging the Transformer on the scene level, we former [32] reduces the complexity to O(n) by low-rank
can capture the context-aware representations and promote factorization of the original attention. Under the hypothesis
message passing among different objects. Moreover, global that the self attention mechanism is low rank, i.e. the rank
representations can be particularly helpful for detecting ob- of the n × n attention matrix
jects with very few points.
!
QK ⊤
A = softmax √ , (20)
3.5. Local-Global Transformer dk
Local-Global Transformer is also a key module to com-
is much smaller than n, Linformer projects the n-dimension
bine the local and global features extracted by the LT and
keys and values to the ones with lower dimension k ≪ n,
GT modules. As shown in Figure 2, the LGT adopts a multi-
and k is closer to the rank of A. Therefore, the i-th head in
scale cross-attention module and generates relations be-
the projected multi-head self-attention is
tween low resolution centroids and high resolution points. !
Formally, we apply cross attention similar to the encoder- Q(Ei K)⊤
headi = softmax √ Fi V, (21)
decoder attention used in Transformer. The output of LT dk
7467
Car(IoU=0.7) Pedestrian (IoU=0.5) Cyclist (IOU=0.5)
Method Modality Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
PointRCNN [25] LiDAR 85.94 75.76 68.32 49.43 41.78 38.63 73.93 59.60 53.59
+ Pointformer LiDAR 87.13 77.06 69.25 50.67 42.43 39.60 75.01 59.80 53.99
Table 1. Performance comparison of PointRCNN with and without Pointformer on KITTI test split by submitting to official test server.
The evaluation metric is Average Precision(AP) with IoU threshold 0.7 for car and 0.5 for pedestrian/cyclist.
Method Modality Car Ped Bus Barrier TC Truck Trailer Moto Cons. Veh. Bicycle mAP
CBGS [42] LiDAR 81.1 80.1 54.9 65.7 70.9 48.5 42.9 51.5 10.5 22.3 52.8
+ Pointformer LiDAR 82.3 81.8 55.6 66.0 72.2 48.1 43.4 55.0 8.6 22.7 53.6
Table 2. Performance comparison of PointRCNN with and without Pointformer on the nuScenes benchmark.
7468
Method bathtub bed bookshelf chair desk dresser nightstand sofa table toilet mAP
VoteNet [19] 74.4 83.0 28.8 75.3 22.0 29.8 62.2 64.0 47.3 90.1 57.7
VoteNet* 75.5 85.6 32.0 77.4 24.8 27.9 58.6 67.4 51.1 90.5 59.1
+ Pointformer 80.1 84.3 32.0 76.2 27.0 37.4 64.0 64.9 51.5 92.2 61.1
Table 5. Perfomance comparison of VoteNet with and without Pointformer on SUN RGB-D validation dataset. The evaluation metric is
Average Precision with 0.25 IoU threshold.* denotes the model implemented in MMDetection3D [5].
Method cab bed chair sofa table door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurn mAP
VoteNet [19] 36.3 87.9 88.7 89.6 58.8 47.3 38.1 44.6 7.8 56.1 71.7 47.2 45.4 57.1 94.9 54.7 92.1 37.2 58.6
VoteNet* 47.7 88.7 89.5 89.3 62.1 54.1 40.8 54.3 12.0 63.9 69.4 52.0 52.5 73.3 95.9 52.0 95.1 42.4 62.9
+ Pointformer 46.7 88.4 90.5 88.7 65.7 55.0 47.7 55.8 18.0 63.8 69.1 55.4 48.5 66.2 98.9 61.5 86.7 47.4 64.1
Table 6. Performance comparison of VoteNet with and without Pointformer on ScanNetV2 validation dataset. The evaluation metric is
Average Precision with 0.25 IoU threshold.* denotes the model implemented in MMDetection3D [5].
7469
Image of the scene Predictions Ground truth Image of the scene Predictions Ground truth
Figure 4. Qualitative results of 3D object detection on SUN RGB-D. From left to right: Originial scene image, our model’s prediction,
and annotated ground truth boxes.
Image of the scene Ground truth Overall attention the dresser in the left scene is only partially observed by the
sensor. However, our model can still generate precise pro-
posals for the object with proper bounding box sizes. Sim-
ilar results are shown in the right scene, where the table in
the front suffers from clutter because of the books on it.
Inspecting Pointformer with attention maps. To validate
Top-50 attention Top-100 attention Top-200 attention how modules in Pointformer affect learned point features,
we visualize the attention maps from the GT module of the
second last Pointformer block. We show the attention of
the particular points in Figure 5. The second row shows the
Figure 5. Visualization results of the attention maps. In top-k 50, 100, 200 points with highest attention values towards
attention, darker color indicates larger attention weight, in overall the points marked with star. We can observe that Point-
attention red indicates large value. former first focuses on the local region of the same object,
then spread the attention to other regions, and finally attends
with PointRCNN detection head and evaluated on the val
points from other objects globally. The overall attention
split with the car class.
map shows the average attention weights of all the points
Effects of each component. We validate the effectiveness in the scene, indicating that our model mostly focuses on
of each Transformer component and the coordinate refine- points on the objects. These visualization results show that
ment module, and summarized the results in Table 7. The Pointformer can capture local and global dependencies, and
first row corresponds to the PointRCNN baseline and the enhance message passing on both object and scene levels.
last row is the full Pointformer model. By comparing the
first row and second row, we can observe that easy objects 5. Conclusion
benefit more from the local Transformer with 0.6 AP im- This paper introduces Pointformer, a highly effective
provement. By comparing the second row and fourth row, feature learning backbone for 3D point clouds that is per-
we can see that global Transformer is more suitable for hard mutation invariant to points in the input and learns local
objects with 0.9 AP improvement. This observation is con- and global context-aware representations. We apply Point-
sistent with our analysis in Sec. 4.2. As for Local-Global former as the drop-in replacement backbone for state-of-
Transformer and coordinate refinement, the improvement is the-art 3D object detectors and show significant perfor-
similar under three difficulty settings. mance improvements on several benchmarks including both
Positional Encoding. Playing a critical role in Transformer, indoor and outdoor datasets.
position encoding can have huge impact on the learned rep- Comparing to classification and segmentation tasks in-
resentation. As we have shown in Table 8, we compare the cluding part-segmentation and semantic segmentation in
performance of Pointformer without positional encoding prior work, 3D object detection typically involves more
and with two approaches to position encoding (adding or points (4× - 16×) in a scene, which makes it harder for
concatenating positional encoding with the attention map). Transformer-based models. For future work, we would like
We can observe that Pointformer without positional encod- to explore extensions to these two tasks and other 3D tasks
ing suffers from a huge performance drop, as the coordi- such as shape completion, normal estimation, etc.
nates of points can capture the local geometric information.
Acknowledgments
4.5. Qualitative Results and Discussion This work is supported in part by the National Science
Qualitative results on SUN RGB-D. Figure 4 shows rep- and Technology Major Project of the Ministry of Science
resentative examples of detection results on SUN RGB-D and Technology of China under Grants 2018AAA0100701,
with VoteNet + Pointformer. As we can observe, our model the National Natural Science Foundation of China un-
achieves robust results despite the challenges of clutter and der Grants 61906106 and 62022048, the Institute for Guo
scanning artifacts. Additionally, our model can even recog- Qiang of Tsinghua University and Beijing Academy of Ar-
nize the missing objects in the ground truth. For instance, tificial Intelligence.
7470
References [16] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Se-
ungjin Choi, and Yee Whye Teh. Set transformer: A frame-
[1] H. Caesar, Varun Bankiti, A. Lang, Sourabh Vora, work for attention-based permutation-invariant neural net-
Venice Erin Liong, Q. Xu, A. Krishnan, Yu Pan, Giancarlo works. In Kamalika Chaudhuri and Ruslan Salakhutdinov,
Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset editors, ICML, volume 97 of Proceedings of Machine Learn-
for autonomous driving. CVPR, pages 11618–11628, 2020. ing Research, pages 3744–3753, Long Beach, California,
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas USA, 09–15 Jun 2019. PMLR.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [17] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urta-
to-End object detection with transformers. In ECCV, May sun. Multi-task multi-sensor fusion for 3d object detection.
2020. In CVPR, June 2019.
[3] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- [18] Youngmin Park, Vincent Lepetit, and W. Woo. Multiple 3d
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. object tracking for augmented reality. 2008 7th IEEE/ACM
Generative pretraining from pixels. ICML, 2020. International Symposium on Mixed and Augmented Reality,
[4] Krzysztof Choromanski, Valerii Likhosherstov, David Do- pages 117–120, 2008.
han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter [19] Charles R Qi, Or Litany, Kaiming He, and Leonidas J
Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, Guibas. Deep hough voting for 3d object detection in point
David Belanger, Lucy Colwell, and Adrian Weller. Rethink- clouds. In CVPR, pages 9277–9286, 2019.
ing attention with performers, 2020. [20] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J
[5] MMDetection3D Contributors. MMDetection3D: Open- Guibas. Frustum pointnets for 3d object detection from rgb-d
MMLab next-generation platform for general 3d object data. In CVPR, 2018.
detection. https://github.com/open- mmlab/ [21] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
mmdetection3d, 2020. Pointnet: Deep learning on point sets for 3d classification
[6] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin and segmentation. arXiv preprint arXiv:1612.00593, 2016.
Jaggi. On the relationship between self-attention and con- [22] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
volutional layers. arXiv preprint arXiv:1911.03584, 2019. Guibas. Pointnet++: Deep hierarchical feature learning on
[7] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- point sets in a metric space. In NeurIPS, pages 5099–5108,
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: 2017.
Richly-annotated 3d reconstructions of indoor scenes. In [23] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, I. Bello,
CVPR, pages 5828–5839, 2017. Anselm Levskaya, and Jonathon Shlens. Stand-alone self-
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina attention in vision models. In NeurIPS, 2019.
Toutanova. BERT: Pre-training of deep bidirectional trans- [24] Shaoshuai Shi, Chaoxu Guo, L. Jiang, Zhe Wang, Jianping
formers for language understanding. In NAACL-HLT, pages Shi, X. Wang, and Hongsheng Li. Pv-rcnn: Point-voxel fea-
4171–4186, Minneapolis, Minnesota, June 2019. Associa- ture set abstraction for 3d object detection. CVPR, pages
tion for Computational Linguistics. 10526–10535, 2020.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [25] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, cnn: 3d object proposal generation and detection from point
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- cloud. In CVPR, June 2019.
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [26] Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural net-
worth 16x16 words: Transformers for image recognition at work for 3d object detection in a point cloud. In Proceedings
scale, 2020. of the IEEE/CVF conference on computer vision and pattern
[10] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we recognition, pages 1711–1719, 2020.
ready for autonomous driving? the kitti vision benchmark [27] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
suite. In CVPR, 2012. Sun rgb-d: A rgb-d scene understanding benchmark suite. In
[11] Benjamin Graham, Martin Engelcke, and Laurens van der CVPR, pages 567–576, 2015.
Maaten. 3d semantic segmentation with submanifold sparse [28] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
convolutional networks. In CVPR, June 2018. amodal 3d object detection in rgb-d images. In CVPR, June
[12] Ji Hou, Angela Dai, and Matthias Niessner. 3d-sis: 3d se- 2016.
mantic instance segmentation of rgb-d scans. In CVPR, June [29] H. Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, B.
2019. Marcotegui, F. Goulette, and L. Guibas. Kpconv: Flexible
[13] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and deformable convolution for point clouds. ICCV, pages
and Franccois Fleuret. Transformers are rnns: Fast au- 6410–6419, 2019.
toregressive transformers with linear attention. ArXiv, [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
abs/2006.16236, 2020. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
[14] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Re- Polosukhin. Attention is all you need. In NeurIPS, pages
former: The efficient transformer. ICLR, 2020. 5998–6008, 2017.
[15] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, [31] Sourabh Vora, Alex H. Lang, Bassam Helou, and Oscar Bei-
Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders jbom. Pointpainting: Sequential fusion for 3d object detec-
for object detection from point clouds. In CVPR, 2019. tion. In CVPR, June 2020.
7471
[32] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
Hao Ma. Linformer: Self-attention with linear complexity.
arXiv preprint arXiv:2006.04768, 2020.
[33] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
Pokrovsky, and Raquel Urtasun. Deep parametric continu-
ous convolutional neural networks. In CVPR, June 2018.
[34] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,
Michael M Bronstein, and Justin M Solomon. Dynamic
graph cnn for learning on point clouds. Acm Transactions
On Graphics (tog), 38(5):1–12, 2019.
[35] Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming
Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level con-
text votenet for 3d object detection. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10447–10456, 2020.
[36] Qiangeng Xu, Xudong Sun, Cho-Ying Wu, Panqu Wang,
and U. Neumann. Grid-gcn for fast and scalable point cloud
learning. CVPR, pages 5660–5669, 2020.
[37] Xu Yan, C. Zheng, Zhuguo Li, S. Wang, and Shuguang Cui.
Pointasnl: Robust point clouds processing using nonlocal
neural networks with adaptive sampling. CVPR, pages 5588–
5597, 2020.
[38] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li,
Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point
clouds with self-attention and gumbel subset sampling. In
CVPR, June 2019.
[39] M. Ye, Shuangjie Xu, and Tongyi Cao. Hvnet: Hybrid voxel
network for lidar based 3d object detection. CVPR, pages
1628–1637, 2020.
[40] Zaiwei Zhang, Bo Sun, Haitao Yang, and Qixing Huang.
H3dnet: 3d object detection using hybrid geometric primi-
tives. In European Conference on Computer Vision, pages
311–329. Springer, 2020.
[41] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
for point cloud based 3d object detection. In CVPR, June
2018.
[42] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li,
and Gang Yu. Class-balanced Grouping and Sampling for
Point Cloud 3D Object Detection. arXiv e-prints, page
arXiv:1908.09492, Aug 2019.
[43] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
and Jifeng Dai. Deformable detr: Deformable transformers
for end-to-end object detection, 2020.
7472