Point Transformer
Point Transformer
Abstract
n
ti o
Self-attention networks have revolutionized natural lan- airplane lamp
ca
bed
fi
ssi
guage processing and are making impressive strides in im-
cla
age analysis tasks such as image classification and object part
detection. Inspired by this success, we investigate the ap- segmentation
semgme
se
an ntat
cessing. We design self-attention layers for point clouds and
tic ion
use these to construct self-attention networks for tasks such
as semantic scene segmentation, object part segmentation, Point Transformer
and object classification. Our Point Transformer design im-
proves upon prior work across domains and tasks. For ex-
ample, on the challenging S3DIS dataset for large-scale se-
Figure 1. The Point Transformer can serve as the backbone for var-
mantic scene segmentation, the Point Transformer attains
ious 3D point cloud understanding tasks such as object classifica-
an mIoU of 70.4% on Area 5, outperforming the strongest tion, object part segmentation, and semantic scene segmentation.
prior model by 3.3 absolute percentage points and crossing
the 70% mIoU threshold for the first time.
in natural language processing [39, 45, 5, 4, 51] and image
analysis [10, 28, 54]. The transformer family of models is
1. Introduction particularly appropriate for point cloud processing because
3D data arises in many application areas such as au- the self-attention operator, which is at the core of trans-
tonomous driving, augmented reality, and robotics. Unlike former networks, is in essence a set operator: it is invariant
images, which are arranged on regular pixel grids, 3D point to permutation and cardinality of the input elements. The
clouds are sets embedded in continuous space. This makes application of self-attention to 3D point clouds is therefore
3D point clouds structurally different from images and pre- quite natural, since point clouds are essentially sets embed-
cludes immediate application of deep network designs that ded in 3D space.
have become standard in computer vision, such as networks We flesh out this intuition and develop a self-attention
based on the discrete convolution operator. layer for 3D point cloud processing. Based on this layer,
A variety of approaches to deep learning on 3D point we construct Point Transformer networks for a variety of
clouds have arisen in response to this challenge. Some vox- 3D understanding tasks. We investigate the form of the self-
elize the 3D space to enable the application of 3D discrete attention operator, the application of self-attention to local
convolutions [23, 32]. This induces massive computational neighborhoods around each point, and the encoding of po-
and memory costs and underutilizes the sparsity of point sitional information in the network. The resulting networks
sets in 3D. Sparse convolutional networks relieve these limi- are based purely on self-attention and pointwise operations.
tations by operating only on voxels that are not empty [9, 3]. We show that Point Transformers are remarkably effec-
Other designs operate directly on points and propagate in- tive in 3D deep learning tasks, both at the level of detailed
formation via pooling operators [25, 27] or continuous con- object analysis and large-scale parsing of massive scenes.
volutions [42, 37]. Another family of approaches connect In particular, Point Transformers set the new state of the art
the point set into a graph for message passing [44, 19]. on large-scale semantic segmentation on the S3DIS dataset
In this work, we develop an approach to deep learning on (70.4% mIoU on Area 5), shape classification on Model-
point clouds that is inspired by the success of transformers Net40 (93.7% overall accuracy), and object part segmenta-
tion on ShapeNetPart (86.6% instance mIoU). Our full im- putation and memory costs due to the cubic growth in the
plementation and trained models will be released upon ac- number of voxels as a function of resolution. The solution
ceptance. In summary, our main contributions include the is to take advantage of sparsity, as most voxels are usually
following. unoccupied. For example, OctNet [29] uses unbalanced
octrees with hierarchical partitions. Approaches based on
• We design a highly expressive Point Transformer layer
sparse convolutions, where the convolution kernel is only
for point cloud processing. The layer is invariant
evaluated at occupied voxels, can further reduce computa-
to permutation and cardinality and is thus inherently
tion and memory requirements [9, 3]. These methods have
suited to point cloud processing.
demonstrated good accuracy but may still lose geometric
• Based on the Point Transformer layer, we construct detail due to quantization onto the voxel grid.
high-performing Point Transformer networks for clas- Point-based networks. Rather than projecting or quantiz-
sification and dense prediction on point clouds. These ing irregular point clouds onto regular grids in 2D or 3D, re-
networks can serve as general backbones for 3D scene searchers have designed deep network structures that ingest
understanding. point clouds directly, as sets embedded in continuous space.
• We report extensive experiments over multiple do- PointNet [25] utilizes permutation-invariant operators such
mains and datasets. We conduct controlled studies to as pointwise MLPs and pooling layers to aggregate features
examine specific choices in the Point Transformer de- across a set. PointNet++ [27] applies these ideas within a
sign and set the new state of the art on multiple highly hierarchical spatial structure to increase sensitivity to local
competitive benchmarks, outperforming long lines of geometric layout. Such models can benefit from efficient
prior work. sampling of the point set, and a variety of sampling strate-
gies have been developed [27, 7, 46, 50, 11].
2. Related Work A number of approaches connect the point set into
a graph and conduct message passing on this graph.
For 2D image understanding, pixels are placed in regu- DGCNN [44] performs graph convolutions on kNN graphs.
lar grids and can be processed with classical convolution. PointWeb [55] densely connects local neightborhoods.
In contrast, 3D point clouds are unordered and scattered ECC [31] uses dynamic edge-conditioned filters where con-
in 3D space: they are essentially sets. Learning-based ap- volution kernels are generated based on edges inside point
proaches to processing 3D point clouds can be classified clouds. SPG [15] operates on a superpoint graph that rep-
into the following types: projection-based, voxel-based, and resents contextual relationships. KCNet [30] utilizes kernel
point-based networks. correlation and graph pooling. Wang et al. [40] investigate
Projection-based networks. For processing irregular in- the local spectral graph convolution. GACNet [41] employs
puts like point clouds, an intuitive way is to transform ir- graph attention convolution and HPEIN [13] builds a hierar-
regular representations to regular ones. Considering the chical point-edge interaction architecture. DeepGCNs [19]
success of 2D CNNs, some approaches [34, 18, 2, 14, 16] explore the advantages of depth in graph convolutional net-
adopt multi-view projection, where 3D point clouds are pro- works for 3D scene understanding.
jected into various image planes. Then 2D CNNs are used A number of methods are based on continuous convolu-
to extract feature representations in these image planes, fol- tions that apply directly to the 3D point set, with no quan-
lowed by multi-view feature fusion to form the final output tization. PCCN [42] represents convolutional kernels as
representations. In a related approach, TangentConv [35] MLPs. SpiderCNN [49] defines kernel weights as a fam-
projects local surface geometry onto a tangent plane at ev- ily of polynomial functions. Spherical CNN [8] designs
ery point, forming tangent images that can be processed by spherical convolution to address the problem of 3D rota-
2D convolution. However, this approach heavily relies on tion equivariance. PointConv [46] and KPConv [37] con-
tangent estimation. In projection-based frameworks, the ge- struct convolution weights based on the input coordinates.
ometric information inside point clouds is collapsed during InterpCNN [22] utilizes coordinates to interpolate point-
the projection stage. These approaches may also underuti- wise kernel weights. PointCNN [20] proposes to reorder the
lize the sparsity of point clouds when forming dense pixel input unordered point clouds with special operators. Um-
grids on projection planes. The choice of projection planes menhofer et al. [38] apply continuous convolutions to learn
may heavily influence recognition performance and occlu- particle-based fluid dynamics.
sion in 3D may impede accuracy. Transformer and self-attention. Transformer and self-
Voxel-based networks. An alternative approach to trans- attention models have revolutionized machine translation
forming irregular point clouds to regular representations is and natural language processing [39, 45, 5, 4, 51]. This
3D voxelization [23, 32], followed by convolutions in 3D. has inspired the development of self-attention networks for
When applied naively, this strategy can incur massive com- 2D image recognition [10, 28, 54, 6]. Hu et al. [10] and
Ramachandran et al. [28] apply scalar dot-product self- where yi is the output feature. φ, ψ, and α are pointwise
attention within local image patches. Zhao et al. [54] de- feature transformations, such as linear projections or MLPs.
velop a family of vector self-attention operators. Dosovit- δ is a position encoding function and ρ is a normalization
skiy et al. [6] treat images as sequences of patches. function such as softmax. The scalar attention layer com-
Our work is inspired by the findings that transform- putes the scalar product between features transformed by φ
ers and self-attention networks can match or even outper- and ψ and uses the output as an attention weight for aggre-
form convolutional networks on sequences and 2D images. gating features transformed by α.
Self-attention is of particular interest in our setting because In vector attention, the computation of attention weights
it is intrinsically a set operator: positional information is is different. In particular, attention weights are vectors that
provided as attributes of elements that are processed as a can modulate individual feature channels:
set [39, 54]. Since 3D point clouds are essentially sets of
points with positional attributes, the self-attention mecha- \label {eq:vector} \ve {y}_{i} = \sum _{\xx _j \in \xX } \rho \big (\gamma (\beta (\varphi (\ve {x}_i), \psi (\ve {x}_j))+\delta )\big ) \odot \alpha (\ve {x}_{j}), (2)
nism seems particularly suitable to this type of data. We
thus develop a Point Transformer layer that applies self- where β is a relation function (e.g., subtraction) and γ is
attention to 3D point clouds. a mapping function (e.g., an MLP) that produces attention
There are a number of previous works [48, 21, 50, 17] vectors for feature aggregation.
that utilize attention for point cloud analysis. They apply Both scalar and vector self-attention are set operators.
global attention on the whole point cloud, which introduces The set can be a collection of feature vectors that represent
heavy computation and renders these approaches inapplica- the entire signal (e.g., sentence or image) [39, 6] or a collec-
ble to large-scale 3D scene understanding. They also utilize tion of feature vectors from a local patch within the signal
scalar dot-product attention, where different channels share (e.g., an image patch) [10, 28, 54].
the same aggregation weights. In contrast, we apply self-
attention locally, which enables scalability to large scenes 3.2. Point Transformer Layer
with millions of points, and we utilize vector attention, Self-attention is a natural fit for point clouds because
which we show to be important for achieving high accuracy. point clouds are essentially sets embedded irregularly in a
We also demonstrate the importance of appropriate position metric space. Our point transformer layer is based on vec-
encoding in large-scale point cloud understanding, in con- tor self-attention. We use the subtraction relation and add
trast to prior approaches that omitted position information. a position encoding δ to both the attention vector γ and the
Overall, we show that appropriately designed self-attention transformed features α:
networks can scale to large and complex 3D scenes, and
substantially advance the state of the art in large-scale point \label {eq:pointtransformer} \ve {y}_{i} = \sum _{\xx _j \in \xX (i)} \rho \big (\gamma (\varphi (\ve {x}_i) - \psi (\ve {x}_j) + \delta )\big ) \odot \big (\alpha (\ve {x}_{j}) + \delta \big ) (3)
cloud understanding.
3. Point Transformer Here the subset X (i) ⊆ X is a set of points in a local neigh-
borhood (specifically, k nearest neighbors) of xi . Thus we
We begin by briefly revisiting the general formulation of adopt the practice of recent self-attention networks for im-
transformers and self-attention operators. Then we present age analysis in applying self-attention locally, within a lo-
the point transformer layer for 3D point cloud processing. cal neighborhood around each datapoint [10, 28, 54]. The
Lastly, we present our network architecture for 3D scene mapping function γ is an MLP with two linear layers and
understanding. one ReLU nonlinearity. The point transformer layer is il-
lustrated in Figure 2.
3.1. Background
Transformers and self-attention networks have revolu- input: (x, p)
tionized natural language processing [39, 45, 5, 4, 51] and
have demonstrated impressive results in 2D image analy- 𝜑, 𝜓: linear 𝛿: mlp 𝛼: linear
sis [10, 28, 54, 6]. Self-attention operators can be classi-
fied into two types: scalar attention [39] and vector atten-
𝛾: m
tion [54]. lp
Let X = {xi }i be a set of feature vectors. The stan-
aggregation
dard scalar dot-product attention layer can be represented
as follows: output: (y, p)
\label {eq:scalar} \ve {y}_{i} = \sum _{\xx _j \in \xX } \rho \big (\varphi (\ve {x}_i)^\top \psi (\ve {x}_j) + \delta \big ) \alpha (\ve {x}_{j}), (1) Figure 2. Point transformer layer.
(N, 32) (N/4, 64) (N/16, 128) (N/64, 256) (N/256, 512) (N/256, 512) (N/64, 256) (N/16, 128) (N/4, 64) (N, 32) (N, Dout)
Point Transformer
Label: chair
MLP
Transition Down
Global AvgPooling
Transition Up
(N, 32) (N/4, 64) (N/16, 128) (N/64, 256) (N/256, 512) (1, 512) (1, Dout)
Figure 3. Point transformer networks for semantic segmentation (top) and classification (bottom).
input: (x, p) input: (x, p1) input1: (x1, p1) input2: (x2, p2)
ceiling floor wall beam column window door table chair sofa bookcase board clutter
Figure 6. Visualization of shape retrieval results on the ModelNet40 dataset. The leftmost column shows the input query and the other
columns show the retrieved models.
Figure 7. Visualization of object part segmentation results on the ShapeNetPart dataset. The ground truth is in the top row, Point Trans-
former predictions on the bottom.
#pts k=8 k=16 k=32 k=64 k=128 k=256 [33] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,
10k 2 2 5 10 17 21 Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.
Splatnet: Sparse lattice networks for point cloud processing.
20k 3 5 8 23 43 49
In CVPR, 2018. 7
40k 8 12 26 71 127 144
[34] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and
80k 23 37 82 198 356 399
Erik G. Learned-Miller. Multi-view convolutional neural
100k 32 46 99 248 445 494 networks for 3d shape recognition. In ICCV, 2015. 2, 6
200k 104 125 225 545 992 1091 [35] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-
500k 639 695 867 1589 2865 3143 Yi Zhou. Tangent convolutions for dense prediction in 3d. In
1m 2496 2648 2949 4087 6362 6878 CVPR, 2018. 2, 6
Table A.2. High-efficiency kNN implementation with heap sort [36] Lyne P. Tchapmi, Christopher B. Choy, Iro Armeni, JunY-
algorithm. The leftmost column stands for the number of points oung Gwak, and Silvio Savarese. Segcloud: Semantic seg-
and the topmost row specifies the number of nearest neighbors. mentation of 3d point clouds. In 3DV, 2017. 5, 6
The reported running time is in milliseconds. [37] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud,
Beatriz Marcotegui, François Goulette, and Leonidas J
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Guibas. Kpconv: Flexible and deformable convolution for
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming point clouds. In ICCV, 2019. 1, 2, 5, 6, 7, 9, 10
Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An [38] Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, and
imperative style, high-performance deep learning library. In Vladlen Koltun. Lagrangian fluid simulation with continu-
NIPS, 2019. 5 ous convolutions. In ICLR, 2020. 2
[25] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Leonidas J. Guibas. Pointnet: Deep learning on point sets reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
for 3d classification and segmentation. In CVPR, 2017. 1, 2, Polosukhin. Attention is all you need. In NIPS, 2017. 1, 2,
5, 6, 7, 10 3, 4, 7
[26] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela [40] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spec-
Dai, Mengyuan Yan, and Leonidas Guibas. Volumetric and tral graph convolution for point set feature learning. In
multi-view cnns for object classification on 3d data. In ECCV, 2018. 2, 6
CVPR, 2016. 6 [41] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, and
[27] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Jie Shan. Graph attention convolution for point cloud seman-
Guibas. Pointnet++: Deep hierarchical feature learning on tic segmentation. In CVPR, 2019. 2
point sets in a metric space. In NIPS, 2017. 1, 2, 5, 6, 7 [42] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei
[28] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Pokrovsky, and Raquel Urtasun. Deep parametric continu-
Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone ous convolutional neural networks. In CVPR, 2018. 1, 2, 6,
self-attention in vision models. In NeurIPS, 2019. 1, 2, 3 7
[29] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. [43] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neu-
Octnet: Learning deep 3d representations at high resolutions. mann. Sgpn: Similarity group proposal network for 3d point
In CVPR, 2017. 2 cloud instance segmentation. In CVPR, 2018. 7
[30] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Min- [44] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma,
ing point cloud local structures by kernel correlation and Michael M. Bronstein, and Justin M. Solomon. Dynamic
graph pooling. In CVPR, 2018. 2 graph cnn for learning on point clouds. TOG, 2019. 1, 2, 6,
[31] Martin Simonovsky and Nikos Komodakis. Dynamic edge- 7
conditioned filters in convolutional neural networks on [45] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin,
graphs. In CVPR, 2017. 2 and Michael Auli. Pay less attention with lightweight and
[32] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano-
dynamic convolutions. In ICLR, 2019. 1, 2, 3
lis Savva, and Thomas Funkhouser. Semantic scene comple-
tion from a single depth image. In CVPR, 2017. 1, 2
[46] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deep [51] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
convolutional networks on 3d point clouds. In CVPR, 2019. Ruslan Salakhutdinov, and Quoc V. Le. XLNet: General-
2, 6, 7 ized autoregressive pretraining for language understanding.
[47] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin- In NeurIPS, 2019. 1, 2, 3
guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d [52] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen,
shapenets: A deep representation for volumetric shapes. In Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Shef-
CVPR, 2015. 5, 6 fer, and Leonidas Guibas. A scalable active framework for
[48] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. At- region annotation in 3d shape collections. TOG, 2016. 5, 6
tentional shapecontextnet for point cloud recognition. In [53] Zhiyuan Zhang, Binh-Son Hua, and Sai-Kit Yeung. Shell-
CVPR, 2018. 3, 6, 7 net: Efficient point cloud convolutional neural networks us-
[49] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. ing concentric shells statistics. In ICCV, 2019. 6, 10
Spidercnn: Deep learning on point sets with parameterized [54] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring
convolutional filters. In ECCV, 2018. 2, 6, 7 self-attention for image recognition. In CVPR, 2020. 1, 2, 3,
[50] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li, 4
Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point [55] Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia.
clouds with self-attention and gumbel subset sampling. In PointWeb: Enhancing local neighborhood features for point
CVPR, 2019. 2, 3, 5, 6, 10 cloud processing. In CVPR, 2019. 2, 6, 10