0% found this document useful (0 votes)
48 views

MPCT_Multiscale_Point_Cloud_Transformer_With_a_Residual_Network

The document presents the Multiscale Point Cloud Transformer (MPCT), a novel framework designed to enhance the processing of 3D point clouds using a modified self-attention mechanism that incorporates position encoding to reduce computational complexity. MPCT utilizes a residual network to effectively fuse multiscale features, enabling improved understanding of point cloud representations, and demonstrates superior performance in classification tasks compared to existing methods. Experimental results indicate MPCT achieves high accuracy on benchmark datasets, showcasing its potential for applications in various domains such as autonomous driving and robotics.

Uploaded by

GoKwo Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

MPCT_Multiscale_Point_Cloud_Transformer_With_a_Residual_Network

The document presents the Multiscale Point Cloud Transformer (MPCT), a novel framework designed to enhance the processing of 3D point clouds using a modified self-attention mechanism that incorporates position encoding to reduce computational complexity. MPCT utilizes a residual network to effectively fuse multiscale features, enabling improved understanding of point cloud representations, and demonstrates superior performance in classification tasks compared to existing methods. Experimental results indicate MPCT achieves high accuracy on benchmark datasets, showcasing its potential for applications in various domains such as autonomous driving and robotics.

Uploaded by

GoKwo Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

26, 2024 3505

MPCT: Multiscale Point Cloud Transformer


With a Residual Network
Yue Wu , Member, IEEE, Jiaming Liu , Maoguo Gong , Senior Member, IEEE, Zhixiao Liu ,
Qiguang Miao , Senior Member, IEEE, and Wenping Ma , Senior Member, IEEE

Abstract—The self-attention (SA) network revisits the essence areas such as autonomous driving, virtual and augmented re-
of data and has achieved remarkable results in text processing ality, and robotics. Driven by deep neural networks, recent 3D
and image analysis. SA is conceptualized as a set operator that works [1], [2], [3], [4], [5], [6], [7], [8] have focused on process-
is insensitive to the order and number of data, making it suitable
for point sets embedded in 3D space. However, working with point ing point clouds with learning-based methods. However, unlike
clouds still poses challenges. To tackle the issue of exponential images arranged on a regular pixel or grid, point clouds are sets
growth in complexity and singularity induced by the original of points embedded in three-dimensional space. This makes 3D
SA network without position encoding, we modify the attention point clouds structurally different from images and representa-
mechanism by incorporating position encoding to make it linear, tionally different from complex 3D data (e.g., grid and voxel
thus reducing its computational cost and memory usage and
making it more feasible for point clouds. This article presents a data), which have the simplest format but cannot be directly ap-
new framework called multiscale point cloud transformer (MPCT), plied to design deep networks for standard tasks in computer
which improves upon prior methods in cross-domain applications. vision.
The utilization of multiple embeddings enables the complete To address this challenge, many deep learning-based methods
capture of the remote and local contextual connections within have become powerful tools for representing 3D point clouds.
point clouds, as determined by our proposed attention mechanism.
Additionally, we use a residual network to facilitate the fusion Guo et al. [9] classified learning-based point cloud methods into
of multiscale features, allowing MPCT to better comprehend multiview methods [10], [11], voxel methods [12], [13], and
the representations of point clouds at each stage of attention. point methods [14], [15]. Since Qi et al. introduced the mul-
Experiments conducted on several datasets demonstrate that tilayer perceptron (MLP) operation in PointNet, point-based
MPCT outperforms the existing methods, such as achieving methods have become popular, and subsequent methods [16],
accuracies of 94.2% and 84.9% in classification tasks implemented
on ModelNet40 and ScanObjectNN, respectively. [17] have used graph context and kernel points to conduct fur-
ther learning on the basis of MLPs. Although these methods
Index Terms—Geometric and semantic features, multiscale aim to improve point cloud understanding, some issues remain
generation, point cloud transformer, residual network.
to be considered. 1) In addition to 3D coordinates, can we pro-
vide more geometric information for feature learning? 2) How
I. INTRODUCTION can the network learn better representations in the abstract
ITH the rapid development of 3D sensing technology, high-level space and while keeping its costs low and efficiency
W 3D point cloud data are appearing in many application high? 3) How can the network focus on features and connect
different features to optimize the output?
Compared to point- and convolution-based learning meth-
Manuscript received 25 October 2022; revised 17 February 2023 and 3 Au-
gust 2023; accepted 1 September 2023. Date of publication 12 September 2023; ods [17], [18], [19], [20], [21], [22], [23], [24], transformers [25]
date of current version 14 February 2024. This work was supported in part have recently been applied to text language and image vision
by the National Natural Science Foundation of China under Grants 62276200 tasks, and their performance is impressive in terms of numerous
and 62036006, in part by the Natural Science Basic Research Plan in Shaanxi
Province of China under Grant 2022JM-327, and in part by the CAAI-Huawei aspects [26], [27]. Prior to this, more research [28], [29], [30],
MINDSPORE Academic Open Fund. The associate editor coordinating the re- [31], [32], [33] began to study the use of transformers in point
view of this manuscript and approving it for publication was Dr. De-Nian Yang. clouds. Transformer is a classic decoder-encoder structure that
(Corresponding author: Maoguo Gong.)
Yue Wu, Jiaming Liu, and Qiguang Miao are with the School of Computer contains three main modules: input (word) embedding, position
Science and Technology, Key Laboratory of Collaborative Intelligence Sys- (sequential) encoding, and self-attention (SA) modules. Interest-
tems, Ministry of Education, Xidian University, Xi’an 710071, China (e-mail: ingly, the series of transformer operations are especially suitable
[email protected]; [email protected]; [email protected]).
Maoguo Gong is with the School of Electronic Engineering, Key Laboratory for point cloud processing, as demonstrated by the fact that they
of Collaborative Intelligence Systems, Ministry of Education, Xidian University, can uniquely establish remote dependencies between points. In
Xi’an 710071, China (e-mail: [email protected]). addition, the core SA module of the network is essentially a set
Zhixiao Liu is with the Yantai Research Institute, Harbin Engineering Uni-
versity, Yantai 264006, China (e-mail: [email protected]). operator, which is independent of the order and distribution of
Wenping Ma is with the School of Artificial Intelligence, Key Laboratory the given data.
of Intelligent Perception and Image Understanding of Ministry of Education, However, when applying a transformer to point clouds, the
Xidian University, Xi’an 710071, China (e-mail: [email protected]).
https://github.com/ywu0912/TeamCode.git. input sequence generated from points of the same size only re-
Digital Object Identifier 10.1109/TMM.2023.3312855 tains a simple scale feature in the same layer with an MLP.

1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3506 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

r We design an expressive point cloud transformer layer that


effectively integrates geometric and semantic features and
has a low linear complexity of O(N D).
r We verify that even with a simple residual network con-
nection strategy, effective memory-efficient networks can
be built and aid in feature propagation.
r We experiment on multiple datasets in multiple domains,
and MPCT achieves results comparable to state-of-the-art
methods on multiple competitive benchmarks.
The rest of the article is organized as follows. In Section II, we
briefly review the work related to transformers. In Section III,
we introduce the preliminary techniques and proposed MPCT
architecture. Section IV evaluates our method on five datasets
in comparison with several other classic methods and presents
the ablation studies. Section V concludes.
Fig. 1. Multiscale feature fusion module. We present a comparison between
PCT with a single-scale and our proposed MPCT with multiscales at the bottom.
The top-down transformation has a sampling rate of 1/2. II. RELATED WORK
A. Geometric Information and Local Features
Although point cloud data collection is becoming increas-
Point Transformer (PT) [28] was proposed to build local atten- ingly convenient, compared with regular grids or voxels, these
tion for neighborhood and Point Cloud Transformer (PCT) [30] data mainly lack high-level geometric information. In addition
was proposed to use a neighborhood embedding strategy to im- to 3D point coordinates, conventional methods [35], [36] esti-
prove point embeddings. Despite the progress made with PT and mate more geometric information about a point cloud, such as
PCT, their efficiency is still limited by the original SA module, plane, normal, curvature, etc., but this is limited to the use of
which requires generating a large attention map and calculating CNN-based methods. When [37], [38] designed their the net-
all points, leading to an O(N 2 D) complexity. work structures, they tried to combine the relative positions and
On the basis of the previous works, we propose a novel archi- distances between the points in the encoder. We hope that the
tecture called Multi-scale Point Cloud Transformer (MPCT), learning geometric point features can play a role in the shift from
whose key idea is to use the permutation invariance of self- the low-level space to the high-level space.
attention to avoid the problems caused by the irregularity of PointNet++ [15] aggregates local information on the basis
point clouds. As shown in Fig. 1, we are the first to use the of PointNet [14] with farthest point sampling and ball query
proposed novel self-attention mechanism to analyze multistage grouping. It has been shown that local features are effective.
local neighborhoods and generate multiscale features. In particu- MS-GraphSIM [39] uses a multiscale local feature fusion net-
lar, following the k-nearest neighbors (KNN) algorithm, we find work to assess the quality of point clouds, and PM-BVQA [40]
the neighbors of a point pi , which are denoted as ∀pik ∈ N (pi ). evaluates colored point clouds using a joint color-geometry mod-
The equivariant feature fi corresponding to point pi has a local ule. Merigot et al. [36] used the KNN algorithm to be more effi-
feature map in that D-dimensional space, which is defined as cient, but such methods are strongly affected by local neighbors,
f˜i = θ(fi , ∀fik ) ∈ R2D , where θ denotes a point feature and its especially when artificial noise is added to the training data. To
local feature. The fused features are obtained by a local aggre- overcome these problems, we apply point enhancement and mul-
gation operation. Then, by feeding the transformed features to tiscale local neighbors to reduce the possible deviations during
the self-attention module, the potential connections between the feature learning.
current features can be established, and new invariant features
can then be extracted (Q2). During each transformation, the use
of residual networks enables the creation of memory-efficient B. Transformer in 2D Vision
deep networks. Note that we further enhance the processing of The earliest transformer [25] was proposed for dealing with
position information (Q1) for each point in the point cloud be- machine translation tasks. Later, many frameworks introduced
fore each data input step and use the residual structure [34] to the attention mechanism for processing visual tasks. Wu et
maintain network memory (Q3) after each feature generation al. [27] proposed a visual transformer, which is based on the
step. role of marked image analysis in feature mapping. Further-
In summary, the proposed MPCT is straightforward and effi- more, ViT [26] also made considerable achievements, which
cient, relying solely on self-attention and pointwise operations. were completely based on self-attention as it did not have any
Reviewing the above three questions, the corresponding contri- convolution operators.
butions of this article are summarized as follows. However, point clouds contain disordered and irregular data.
r We propose a point enhancement method to upgrade A single point has no semantics, and it is difficult to mark a point
points to high-level geometric information to facilitate as a patch sequence as can be done with an image. Therefore,
self-attention and distinguished-attention. we propose a neighbor embedding module to aggregate from

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3507

the local neighborhood of a specific point to generate relevant product. Scalar and vector self-attention are essentially set op-
semantic and local information. erators, and the complexity of self-attention is O(N 2 D).

C. Transformers in 3D Point Clouds


Our work is inspired by a previous idea: a point cloud, as a B. Proposed MPCT Architecture
set with complete location attributes, fits with the set operator
We aim to leverage the self-attention network for point cloud
attributes of self-attention.
analysis purposes. Note that the point transformer is an im-
Recently, Zhao et al. [28] proposed the point transformer layer
portant feature aggregation operator for the entire network. We
by applying vector self-attention. PCT [30] introduced point
adopt point-enhanced geometric descriptor preprocessing with-
cloud transformation and achieved competitive results in point
out auxiliary branches. Based on this, we propose a powerful
cloud classification and segmentation tasks. DPCT [41] aggre-
network architecture to analyze the potential information and
gated both pointwise and channelwise self-attention models to
capture the latent features in point clouds: the network is com-
semantically capture richer contextual dependencies from both
pletely based on the point transformer layer, multiscale embed-
the position and channel. Moreover, PTT [29] and PTTR [31]
ding, and pooling transformation. The proposed MPCT archi-
also applied transformers to cope with object tracking. Misra
tecture is illustrated in Fig. 2.
et al. introduced 3DETR [42], an end-to-end transformer-based
Backbone: There are four stages in the feature transforma-
object detection model. These methods show that transformers
tion in MPCT. After the point information is enhanced, these
have broad applications in point clouds.
stages operate on the point set that is gradually downsampled.
Unlike these methods, MPCT is implemented based on mul-
The downsampling rates of the stage are [1/2, 1/2, 1/2, 1/2],
tiscale local neighborhoods with a residual network, which co-
so the bases of the point sets generated in different stages are
incides with the design idea of [43]. Our work focuses on using
[N/2, N/4, N/8, N/16], and the bases of the aggregation fea-
a transformer and incorporates other strategies as supporting
tures are [2D, 4D, 8D, 16D], where N is the number of input
modules.
points, and D is the feature dimension. The adopted lightweight
backbone structure, the number of stages, the downsampling
III. METHOD rates, and the aggregate feature dimensions expressed in the fig-
We first introduce the general form of transformers and then ure can vary according to the given application. Overall, the
propose our MPCT framework for point clouds. Given an input feature embedding module maps the point-enhanced input fea-
consisting of N points with xyz coordinates, a backbone net- tures to the embedding space. Multiscale local embedding per-
work is designed to extract and learn the features of multiscale forms feature transformation operations on the point cloud with
point clouds, and downstream tasks are designed to validate the different degrees of sampling. The self-attention module learns
output. Furthermore, we design some adaptive novel detail con- refined attention features for the input features based on the lo-
figurations for the proposed network. cal context. The output features are the residual sum of the input
features and their corresponding features produced through the
A. Preliminaries For Transformers self-attention layer.
In contrast with PCT [30], which uses stacked attention mod-
Generally, self-attention operators can be divided into two ules, we map the given point cloud to different feature spaces
types: scalar attention [25] and vector attention [44] operators. after performing different sampling operations using a single
Let Fin = {fi }N i=1 be a set of feature vectors. block. The advantage of this massive simplification is that it
In scalar attention, the standard dot product attention layer avoids distractions across multiple channels. Therefore, com-
can be expressed as follows: munication between individual preprocessing blocks occurs in a
Q, K, V = Fin · (Wq , Wk , Wv ) (1) single softmax layer, directly reflecting how the attention weight
  operator weights each block.
Q, K, V ∈ RN ×D , A = ρ QK  V (2) As a result, we attempt to illustrate the representation learning
ability of transformers on two related tasks: point cloud classi-
where Wq , Wk and Wv are shared learnable transformations. Q, fication and segmentation.
K, and V are the query, key and value matrices generated by Classification: The details of using the classification network
the linear transformation of the input feature Fin , respectively. are shown in the upper part of Fig. 2 “Output”. To recognize the
ρ is a normalization function (e.g., as softmax), and A is the input point cloud P as an object category (such as an airplane,
attention feature generated by the self-attention layer. a chair, etc.) for classification, we feed the multiscale fusion
In vector attention, the attention calculation is different. In feature Fout (composed of attentional features generated dur-
particular, attention can adjust certain feature channel: ing four stages connected after performing max pooling) to the
A = ρ(γ(σ(Q, K))) ⊗ V (3) classification decoder, which includes two cascaded LBRs. Each
layer’s dropout probability is 0.5, and then the point cloud is
where σ is the relationship function between Q and K (e.g., finally determined by the linear layer to predict the final classifi-
subtraction or multiplication), γ is a mapping function (e.g., an cation score C ∈ RNc : the class with the largest score represents
MLP) that generates the attention vector, and ⊗ is the Hadamard the class label of the point cloud.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3508 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Fig. 2. Proposed MPCT architecture. MPCT aims to exploit rich point cloud features to build a powerful perception network, including four parts: position
embedding, semantic embedding, multiscale embedding and self-attention modules. The input features (blue) are generated from the enhanced point information,
and semantic features (green) are generated from the input features, and geometric features (cyan) are generated from the high-level geometric information of the
initial points. The semantic features are decomposed into attention features via three linear transformations and fed forward along with the geometric features. ⊕,
, ⊗ and C represent pairwise summation, subtraction, multiplication and connection operations performed via the feature channel, respectively.

Segmentation: The details of using the segmentation network


are shown in the lower part of Fig. 2 “Output”. The input point
cloud is divided into N parts, which are represented by different
points (e.g., a fuselage or a wing), and we need to predict a label
for each point to represent a certain part of the object. First, we
divide the single-scale feature Fouti (i = 1, 2, 3, 4) of each stage
with the pointwise feature FP of the input stage for feature prop-
agation [15]. In addition, we encode the object labels, repeating
this step N times after the MLP to correspond to the point labels.
Fig. 3. Point enhancement module uses the triangle relationship to explicitly
The segmentation network decoder is almost the same as that enhance the low-level geometric relationships between points.
of the classification network, which is replaced with two MLPs,
and dropout is only executed on the first MLP. Then, we predict
the final point segmentation score S ∈ RN ×Ns : the part label of
a point is determined as the part with the largest score. Specifically, we search for the two nearest neighborhoods of
each point pi ∈ R3 in the given point cloud, and denote them as
pj1 , pj2 ∈ R3 . By using them to form a triangle, we can estimate
a geometric descriptor p˜i , which corresponds to pi , as illustrated
C. Information Enhancement Encoding in Fig. 3.
Positional encoding is crucial in attention, as it establishes Then we surpass the original point pi by introducing a train-
local and long-range dependencies among heterogeneous input able and parameterized point pi . We argue that point encoding
features. In point cloud networks, the geometric information of is more conducive to attention weighting and feature transfor-
points is a natural candidate for position encoding. mation. Therefore, the input features Fin are obtained from the
Point Enhancement: Inspired by [45], we argue that the posi- trainable point encoding feature pi .
tion information is very susceptible to the disturbance of local Neighborhood Enhancement: Usually, to encode the features
points. Hence, the smaller the value of K and the closer the of each point pi , the geometry relationship is enhanced via the
relationship with the central point is better. Therefore, we pro- encoding of its K nearest points:
pose a point enhancement method, which only needs to lift the
two closest points to enhance the information of a certain point θ (pi , ∀pik ) = [RP (pi , K) − ∀pik , RP (pi , K), ∀pik ,
in the initial phase and convert the low-level descriptor to the
high-level geometric descriptor. RP (pi , K) − ∀pik ] , (4)

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3509

where θ is used to relate a point to its local neighborhoods, pi


is the i-th point in the point cloud, pik is the k-th point in the
local neighborhoods of pi according to KNN, and RP (f, K) is
the operator for repeating a vector f K times to form a matrix.
The position encoding module extracts higher-level informa-
tion from the input coordinates to the local context pi ∈ R3 →
∀lik ∈ RK×10 .
Then, li is locally aggregated, and the operation can be ex-
pressed as:

∀lik = M1 (∀lik | k ∈ N (pi )) , ∀lik ∈ RK×D , (5)


 
li = M AX M2 (lik ) + lik , li ∈ RD , (6)

where M1 and M2 are MLP operators with a linear layer, a


normalization layer and an activation layer by the provided local
context, and M AX is the max pooling operator. D is the feature
dimension, and its value may be different for different equations.
When aggregating the local features, we also use the residual
connectivity strategy and then connect the position encoding
feature L = {li }N
i=1 , i.e., the geometric feature.

D. Point Cloud Transformer


As a recent study, PCT [30] uses query and value matrices
to perform the dot product operation to measure the attention
similarity, which requires squared complexity. Nonetheless, we
believe that in point clouds, the dot product attention normalized
by the softmax function [46] is not necessarily appropriate. Point
Transformer [28] indicates that using elementwise subtraction
to distinguish the correlations between points is more effective Fig. 4. Overview of attention for scalar attention [25], vector attention [44],
PCT [30], and our MPCT. The number of points is N , and the feature dimension
in terms of computational effectiveness. is D, D N.
In contrast, our point cloud transformer layer is based on pair-
wise attention with position encoding, and we adapt it more
suitable for point clouds with the complexity of self-attention, Since the query, key and value matrices are determined by the
which is O(N D), as shown in Fig. 4. shared linear transformation of the input feature Fin , they are
Specifically, we first use the query and key matrices to sum all permutation-invariant. Moreover, the softmax and weighted
the location encoding matrix L, (the geometric feature is intro- sum operators are independent of the order. Therefore, the entire
duced in the next section) separately, and then all of the attention self-attention process is arranged invariantly, which is suitable
weights are calculated by pairwise subtraction: for the irregular point cloud domain.
Ā = ∀ᾱi,j = γ1 (Q + L) − γ2 (K + L), (7) Finally, the self-attention feature Fsa and the input feature
Fin are further used for residual connection purposes, and the
where the subscripts i and j are the corresponding number and output feature Fout is provided for the entire self-attention layer
dimension, respectively. γ1 and γ2 are mapping functions com- after the Linear, Batchnorm, and ReLU (LBR) network:
posed of fully connected layers without bias. Then the attention
Fout = Fin + LBR(Fin − Fsa ). (10)
weights are normalized to obtain A = ∀αi,j :
exp (ᾱi,j ) E. Multiscale Feature Fusion
∀αi,j = Sof tmax (∀ᾱi,j ) = ∀  , (8)
k exp (ᾱi,k ) Most point-based learning methods are designed to effi-
where Sof tmax serves to obtain a row-level normalization of ciently extract global features. Nonetheless, recent work has
the attention matrix A ∈ RN ×D . In other words, we normalize fully demonstrated that local neighborhood information is es-
each point in the input point cloud w.r.t all other points to obtain sential for point cloud feature analysis. We refer to the ideas of
a weighted aggregation of the contextual information. PointNet++ [15] and DGCNN [16] and further design a mul-
The output self-attention feature Fsa is the weighted sum of tiscale aggregation strategy that integrates different neighbor-
the value matrix obtained with the position encoding information hoods to enhance the network’s local feature extraction capabil-
using the corresponding attention weights: ity.
The local embedding module includes four sampling and
Fsa = A ⊗ (V + L). (9) grouping (SG) layers and four LBR layers, which are used to

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3510 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

extract feature information at four scales. We hope to use four self-attention network, which can be expressed as follows:
cascaded SG layers to gradually expand the receptive field during
feature aggregation so that the neural network can see more re- As1 = A(Fs1 , Ls1 ), As1 ∈ RN ×D , (15)
fined features. The SG layers use the Euclidean distance measure where As1 and Ls1 are the attentional feature via the self-
to search for each point grouped by K-NN during point cloud attention function and the geometric feature mentioned before,
sampling to aggregate the features derived from local neigh- respectively. Similarly, we can obtain As2 , As3 and As4 .
borhoods. To facilitate the distinction, we directly refer to the Eventually, we use the generated attentional features as con-
features obtained via the aggregation of the point features as se- ditions for point cloud understanding to evaluate the point cloud
mantic features, whose search and connection strategies are the classification and segmentation tasks.
same as those of the position encoding features.
Corresponding to the geometric features above, θ(fi , ∀fik ) =
IV. EXPERIMENTS
[RP (fi , K), RP (fi , K) − ∀fik ] is typically used to denote
the local semantic context in the feature space. Nevertheless, In this section, we evaluate the validity of the proposed
θ(fi , ∀fik ) may not be sufficient for representing the neighbor- MPCT for point cloud downstream tasks on five datasets,
hood because of two reasons. 1) Due to the sparse and irregular i.e., ModelNet40 [47], ScanObjectNN [48], ShapeNetPart [49],
geometric structures of point clouds, their generalization capa- S3DIS [50], and ScanNet-V2 [51], and we conduct multiple
bilities in high-dimensional feature spaces may be weakened, comprehensive comparisons with other methods.
and 2) since point cloud neighbors are not unique to each other
in the representation of closed regions, information redundancy A. Implementation Details
may occur. To address these issues and enhance the generaliza-
In general, we train MPCT using the cross-entropy loss with
tion ability of the features, we transform the local points into a
label smoothing [52] and evaluate MPCT using an entire scene
normal distribution while maintaining their original semantics:
as its input.

∀fik − fi 1 K For shape classification, we use SGD [53] optimization for
∀fik = α × + β, σ = (fik − fi )2 , 300 epochs with a momentum of 0.9 and a weight decay of
σ+ K k=1
(11) 0.0001, and cosine annealing reduces the learning rate from
where α ∈ RD and β ∈ RD are learnable parameters and  = 0.1 to 0.001. For part and semantic segmentation, we use
1e − 8 is a small number for numerical stability. liujiamingshi- AdamW [54] optimization for 200 epochs, an initial learning
daSB Then, we use farthest point sampling (FPS) [15] to sample rate lr = 0.002, and a weight decay of 10–4, with cosine decay.
P to Ps1 for each sampling point p ∈ Ps1 , letting NP (p) be p-th The training batch size is 32, and the test batch size is 16.
nearest neighborhoods in P. Through the same processing ap- In addition, the training data are enhanced with a random
proach conducted on the basis of Ps1 , we suggest that the four translation range of [−0.2,0.2] and a random scale ratio of
sampling rates and neighborhood numbers are equal, with values [2/3,3/2], and strategies such as point resampling, additional
of 0.5 and 16, respectively. Next, we calculate the corresponding height, random sampling, etc., are also included [55]. We use
local features: PyTorch to implement the project, and all experiments are per-
formed on two GeForce RTX 3090 GPUs.
Fs1 (p) = [RP (fs1 (p), K), ∀fik ], Fs1 (p) ∈ RK×2D , (12)
Fs1 (p) = M AX(M(Fs1 (p)) + Fs1 (p)), Fs1 (p) ∈ R2D , B. Classification Results
(13) The synthetic ModelNet40 dataset has 40 object categories
and contains 12,311 CAD models. They are divided into 9,843
where fs1 (p) is the feature of point p in sampled point cloud Ps1 , models for training and 2,468 models for testing.
and Fs1 (p) is the feature of sampling point p after aggregating Table I shows the overall and average classification accuracies
the neighbor points, i.e., the input feature of the self-attention achieved by MPCT on ModelNet40. Specifically, we achieve an
network module. overall accuracy of 94.2% and an average classification accu-
Next, we connect the point features of the sampled Ps1 and racy of 92.0%, which exceeds the results obtained with other
regard them as the research objects for generating Ps2 , Ps3 , and similar inputs overall. Notably, it can be seen that our method
Ps4 . After executing a series of transformations, we obtain the performs better than the other methods that use more input points
feature representing the multiscale local features: or features.
 The ScanObjectNN real-world dataset has 15 categories with
Fsi = Fsi (pi) s.t. i = 1, 2, 3, 4. (14)
2902 unique object instances. It contains more than 15,000 ob-
pi∈Psi
jects, and it is actually more challenging due to the complexity of
The numbers of points and feature dimensions at each sam- its backgrounds and the locality of the objects. We use its most
pling scale are different and inversely proportional, and the gen- challenging version, PBT50RS, with noisy background points
erated multiscale features can separately construct local depen- for experiments. We use the mean class accuracy (mAcc) for
dencies between the point clouds. Nonetheless, simply connect- each category and the overall accuracy (OA) calculated across
ing the point clouds at this time may ignore the remote de- all categories as evaluation metrics. To facilitate the comparison
pendencies between them, so we feed these features into the experiments, we do not use a voting strategy.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3511

TABLE I close to those of the latest PT presented by [28] and are multiple
SHAPE CLASSIFICATION RESULTS (%) OBTAINED ON MODELNET40
percentage points higher for each semantic label than those of
previous works. Fig. 6 shows the MPCT predictions, and we can
see that the predictions are close to the real situation.

D. Object Detection Results


To verify the generalization ability of MPCT, we extend
its application to 3D object detection. Specifically, we use
3DETR [42] as a baseline and port MPCT as a general com-
ponent to the transformer encoder in the backbone of 3DETR.
Note that we change the configuration of MPCT and embed it in
3DETR as normal, while leaving the other settings unchanged.
We report our results obtained on ScanNet-V2 [51], which
has 1021 training samples, 312 validation samples, 100 testing
samples, and axis-aligned bounding box labels for 18 object
categories. Follow the experimental protocol in [70]: we report
the detection performance achieved on the validation set using
mean average precision (mAP) attained at two IoU thresholds
(0.25 and 0.5, denoted as AP25 and AP50 , respectively).
As shown in Table V, the revised 3DETR with our MPCT
yields a significant improvement (+0.9AP25 and +2.8AP50 ) over
the vanilla 3DETR. This result suggests that the network’s en-
coder can be facilitated to learn richer representations for input
point clouds under comprehensive transform processing.

E. Computational Analysis
Table II shows the results yielded by MPCT on the ScanOb-
jectNN dataset, which includes actual scan data containing We consider the computational efficiency levels of MPCT
real-world objects. Our method’s overall accuracy rate of 84.9% and several other methods by comparing their numbers of re-
and average accuracy rate of 82.9% are significantly higher than quired parameters (Params), numbers of floating point opera-
all other results and are first in the official rankings, and the tions (FLOPs), and inference time in Table VI. Among them,
PBT50RS variant used in the experiment is still the most chal- the single-scale PointNet++ has the lowest memory require-
lenging version. Notably, we are in a leading position in many of ment of 1.48 M parameters, and PointNet requires the lowest
the 15 categories, and the accuracy of normal object prediction processor load of 0.45 G FLOPs while providing less accurate
is directly proportional to the number of models. results. Overall, MPCT has the best performance with moderate
computational and memory requirements.
C. Segmentation Results
F. Ablation Studies
The ShapeNetPart dataset is an object-oriented dataset with
16,880 object point clouds belonging to 16 categories, and each We perform several ablation experiments in this section,
point is labeled as one of 50 parts; its models are divided into choosing alternatives for the geometric point descriptor, the at-
14,006 models for training and 2,784 models for testing. tention mechanisms, the positional encoding, and the residual
Table III shows the results produced by MPCT on ShapeNet- structure to understand the value of our constructions. All ex-
Part, which achieves the best results, with improvements of 3.0% perimental steps and evaluation metrics are implemented in the
and 1.5% over PointNet and the DGCNN in terms of their over- same setup as that used for the classification experiments.
all IOU results, respectively. Fig. 5 shows further segmentation Number of Local Neighborhoods: First, we study the number
examples provided by PointNet, DGCNN, and MPCT. of neighbors k, which is used to determine the local neighbor-
The S3DIS dataset is a dataset of indoor scenes for semantic hoods around each point. The results are shown in Table VII.
segmentation of point clouds, which contains 6 areas and 271 The best performance is obtained when k is set to 16. When
rooms, and each point in the dataset is divided into 13 categories. the number of neighborhoods is small (k = 8 or k = 12), the
We use mAcc and intersection-over-union (IoU) as evaluation model may not have enough context to make predictions. When
metrics. The IoU of a shape is calculated from the average of the number of neighborhoods is large (k = 24 or k = 32), each
the IoUs of all parts of the shape, and mIoU is the mean of the self-attention layer provides a large number of data points, many
IoUs of all tested shapes. of which may be more distant and less relevant. In addition, the
Table IV shows the results obtained by MPCT on the S3DIS artificial weak noise added during training is intended to en-
dataset. Following PointNet++’s protocol, we evaluate Area 5. hance the network’s learning ability, but it instead reduces the
Our MPCT achieves 74.6%/68.6% mAcc/mIoU. The results are model’s accuracy when the number of neighbors is large.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3512 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

TABLE II
SHAPE CLASSIFICATION RESULTS (%) OBTAINED ON THE SCANOBJECTNN DATASET

TABLE III
PART SEGMENTATION RESULTS (%) OBTAINED ON THE SHAPENETPART DATASET

Fig. 5. Visualization of the part segmentation results obtained on the ShapeNetPart dataset. We highlight the different predictions in red boxes.

Geometric Point Descriptor: We next confirm that geo- the neighbors of the corresponding point. Furthermore, we use
metric point descriptors are explicitly formed in 3D space the cross product of the edge vectors to compute the point
and include four low-level geometric descriptors: coordinates, normals to enhance the robustness of the descriptor to pos-
edges, normals, and edge lengths. The results are shown in sible deformations such as scaling, translation, rotation, etc.
Table VIII. As an essential condition, the coordinates of a point However, combining all this information is not always opti-
can intuitively represent global information. In addition, the mal because it may contain redundant information. For ex-
edge of a point can imply more local information by refer- ample, redundant point information, i.e., pj1 and pj2 , which
ring to the relative positions of its neighbors, and the length does not help to describe the low-level geometric features of
of the edge can partially estimate the density distribution of point pi .

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3513

TABLE IV TABLE VII


SEMANTIC SEGMENTATION RESULTS (%) OBTAINED ON THE S3DIS DATASET, ABLATION STUDY: NUMBER OF LOCAL NEIGHBORHOODS k
EVALUATED ON AREA 5

TABLE VIII
ABLATION STUDY: GEOMETRIC POINT DESCRIPTOR

TABLE IX
ABLATION STUDY: THE FORMS OF THE SELF-ATTENTION OPERATORS

Fig. 6. Visualization of semantic segmentation results obtained on the S3DIS


dataset. Top: input, middle: GT, bottom: MPCT (ours).

Attention Operator: Next, we study the types of self-attention


used in the point transformation layer. The results are shown in
TABLE V Table IX. We examine five cases: MLP is an inattentive base-
OBJECT DETECTION RESULTS (%) ON THE SCANNET-V2 DATASET
line that replaces the point transformation layer with a pointwise
MLP. Scalar attention replaces the vector attention with scalar
attention, as in (2). Vector attention is the pre-improved version,
as shown in (3). We use (7) as our self-attention operator. Scalar
attention is more expressive than the no-attention baseline but
not as good as vector attention. The performance gap between
TABLE VI our attention mechanism and PCT attention is also pronounced:
COMPUTATIONAL RESOURCE REQUIREMENTS AND INFERENCE TIMES OF the OA/mAcc values are 94.2%/92.0% vs. 93.5%/91.2%. Our
DIFFERENT METHODS, TESTED ON A GEFORCE RTX 3090 GPU attention operator is more expressive on the basis of vector at-
tention and is very beneficial for 3D data processing.
Position Encoding: We then study the choice of the position
encoding L, as shown in Table X. Our relative position encoding
implemented according to (4) significantly improves the perfor-
mance. We argue that the coordinates of a point can intuitively
represent global information, while its edges can imply more
local information by referring to its neighbors, and the length of
an edge can estimate the density distribution of the neighbors of
the corresponding point.
Residual Structure: Finally, we perform an ablation study
on the residual connection used in Equations (10) and (13).

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3514 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

TABLE X attention mechanism from the global position of the point cloud.
ABLATION STUDY: THE POSITION ENCODING AND RESIDUAL STRUCTURE
Furthermore, the performance of switching this attention to com-
plex point cloud scenes is unknown. This is a new challenge that
we need to face in future work.

V. CONCLUSION
In this article, we propose a transformer-based architecture
that is applied to point clouds. As point clouds are essentially
geometric positions in metric space, the core self-attention op-
erator of the transformer network fits well with position oper-
ations. We further improve the self-attention network and gen-
erative features and introduce the residual network to maintain
The results are shown in Table X. We can see that the perfor- continuity when processing the various stages of the given point
mance achieved without the residual structure on ModelNet40 is cloud. Extensive experiments show that MPCT has a good fea-
93.3%/91.3% in terms of the OA/mAcc metrics, which is much ture learning capability and achieves advanced performance on
lower than the performance attained using the residual structure several benchmarks, especially shape classification and seman-
(0.9%/0.7%). This indicates that even a simple residual structure tic segmentation. Note that our research is more comprehensive
is essential in this case. and in-depth than previous pioneering works. In the future, we
hope that our work will inspire further studies on the character-
G. Limitations istics of point cloud transformers.
Our MPCT and the previous approaches, such as PT [28],
PCT [30], and PointMLP [43], contain similar parts, including a ACKNOWLEDGMENT
transformer, a position encoding module, and a residual network. We gratefully acknowledge the support of MindSpore, CANN
However, we make integrative improvements. (Compute Architecture for Neural Networks) and Ascend AI
For instance, we adopt a pairwise mechanism to calculate Processor used for this research.
attention similarity with a lower linear O(N D) complexity. Our
work is novel in that we are the first to systematically apply a REFERENCES
residual network to high-level geometric and semantic features
[1] T. Weng, J. Xiao, F. Yan, and H. Jiang, “Context-aware 3D point cloud
extracted from point clouds and demonstrate its effectiveness in semantic segmentation with plane guidance,” IEEE Trans. Multimedia,
a simple and efficient manner. Additionally, our results validate early access, Oct. 10, 2022, doi: 10.1109/TMM.2022.3212914.
the proposed method through the use of multiscale point clouds. [2] Y. Wu et al., “Multi-view point cloud registration based on evolutionary
multitasking with bi-channel knowledge sharing mechanism,” IEEE Trans.
The differences between our approach and the main reference Emerg. Topics Comput. Intell., vol. 7, no. 2, pp. 357–374, Apr. 2023.
works are as follows. [3] C. Sun, Z. Zheng, X. Wang, M. Xu, and Y. Yang, “Self-
r In PT [28], only single-scale features are used. Even though supervised point cloud representation learning via separating mixed
shapes,” IEEE Trans. Multimedia, early access, Sep. 14, 2022,
PT obtains point clouds with different resolutions through doi: 10.1109/TMM.2022.3206664.
a series of downsampling operations and then aggregates [4] Y. Wu et al., “Self-supervised intra-modal and cross-modal contrastive
them through corresponding upsampling operations, infor- learning for point cloud understanding,” IEEE Trans. Multimedia, early
access, Jun. 09, 2023, doi: 10.1109/TMM.2023.3284591.
mation is still inevitably lost at different stages of network. [5] Y. Wu et al., “RORNet: Partial-to-partial registration network with reliable
In contrast, our MPCT with a residual network can ana- overlapping representations,” IEEE Trans. Neural Netw. Learn. Syst., early
lyze and generalize point cloud features at different scales access, Jun. 30, 2023, doi: 10.1109/TNNLS.2023.3286943.
[6] Z. Zhang, J. Chen, X. Xu, C. Liu, and Y. Han, “Hawk-eye-inspired per-
to effectively solve this problem.
r PCT [30] (see Fig. 1) only uses a single-scale embedding ception algorithm of stereo vision for obtaining orchard 3D point cloud
navigation map,” CAAI Trans. Intell. Technol., to be published.
and does not consider position encoding, using a more [7] H. Wang, D. Huang, and Y. Wang, “GridNet: efficiently learning deep hier-
archical representation for 3D point cloud understanding,” Front. Comput.
complex and computationally intensive attention mecha- Sci., vol. 16, no. 1, 2022, Art. no. 161301.
nism (refer to Fig. 4(c)). We achieve advantages in terms [8] J. Liu et al., “Instance-guided point cloud single object tracking with in-
of both efficiency and performance by applying a pairwise ception transformer,” IEEE Trans. Instrum. Meas., to be published.
[9] Y. Guo et al., “Deep learning for 3D point clouds: A survey,” IEEE Trans.
subtractive attention mechanism (see Fig. 4(d)).
r PointMLP [43] does not use a transformer and provides no Pattern Anal. Mach. Intell., vol. 43, no. 12, pp. 4338–4364, Dec. 2021.
[10] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view con-
obvious judgments concerning the geometric and semantic volutional neural networks for 3D shape recognition,” in Proc. IEEE Int.
Conf. Comput. Vis., 2015, pp. 945–953.
features of the given point cloud. Inspired by this, we em- [11] A. Kanezaki, Y. Matsushita, and Y. Nishida, “RotationNet: Joint object
bed an improved residual network in our novel point cloud categorization and pose estimation using multiviews from unsupervised
transformer, and achieve an effective improvement. viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
pp. 5010–5019.
Looking back at the proposed MPCT itself, even if the cost [12] D. Maturana and S. Scherer, “VoxNet: A 3D convolutional neural net-
of self-attention is reduced by utilizing subtractive pairwise at- work for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf. Intell.
tention, it has been proven to be effective for point cloud un- Robots Syst., 2015, pp. 922–928.
[13] C. Lv, W. Lin, and B. Zhao, “Voxel structure-based mesh reconstruction
derstanding. However, MPCT only performs the corresponding from a 3D point cloud,” IEEE Trans. Multimedia, vol. 24, pp. 1815–1829,
operation in the local domain and does not consider setting the 2022.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3515

[14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on [40] W.-X. Tao, G.-Y. Jiang, Z.-D. Jiang, and M. Yu, “Point cloud projection
point sets for 3D classification and segmentation,” in Proc. IEEE Conf. and multi-scale feature fusion network based blind quality assessment for
Comput. Vis. Pattern Recognit., 2017, pp. 77–85. colored point clouds,” in Proc. 29th ACM Int. Conf. Multimedia, 2021,
[15] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical pp. 5266–5272.
feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. [41] X.-F. Han, Y.-F. Jin, H.-X. Cheng, and G.-Q. Xiao, “Dual transformer
Process. Syst., 2017, pp. 5105–5114. for point cloud analysis,” IEEE Trans. Multimedia, early access, Aug.
[16] Y. Wang et al., “Dynamic graph CNN for learning on point clouds,” ACM 11, 2022, doi: 10.1109/TMM.2022.3198318.
Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019. [42] I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for
[17] H. Thomas et al., “KPConv: Flexible and deformable convolution 3D object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,
for point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2886–2897.
pp. 6410–6419. [43] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design
[18] Y. Wu et al., “Evolutionary multiform optimization with two-stage and local geometry in point cloud: A simple residual MLP framework,” in
bidirectional knowledge transfer strategy for point cloud registra- Proc. Int. Conf. Learn. Representations, 2022.
tion,” IEEE Trans. Evol. Comput., early access, Oct. 19, 2022, [44] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recog-
doi: 10.1109/TEVC.2022.3215743. nition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
[19] Y. You et al., “PRIN/SPRIN: On extracting point-wise rotation invari- pp. 10073–10082.
ant features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, [45] S. Qiu, S. Anwar, and N. Barnes, “Geometric back-projection net-
pp. 9489–9502, Dec. 2022. work for point cloud classification,” IEEE Trans. Multimedia, vol. 24,
[20] Y. Wu et al., “Evolutionary multitasking with solution space cutting for pp. 1943–1955, 2022.
point cloud registration,” IEEE Trans. Emerg. Topics Comput. Intell., early [46] Q. Zhen et al., “CosFormer: Rethinking softmax in attention,” in Proc. Int.
access, Jul. 12, 2023, doi: 10.1109/TETCI.2023.3290009. Conf. Learn. Representations, 2022.
[21] W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep convolutional networks on [47] Z. Wu et al., “3D ShapeNets: A deep representation for volumetric shapes,”
3D point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1912–1920.
2019, pp. 9613–9622. [48] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting
[22] Y. Wu et al., “INENet: Inliers estimation network with similarity learn- point cloud classification: A new benchmark dataset and classification
ing for partial overlapping registration,” IEEE Trans. Circuits Syst. Video model on real-world data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
Technol., vol. 33, no. 3, pp. 1413–1426, Mar. 2023. 2019, pp. 1588–1597.
[23] X. Lin, K. Chen, and K. Jia, “Object point cloud classification via poly- [49] L. Yi et al., “A scalable active framework for region annotation in
convolutional architecture search,” in Proc. 29th ACM Int. Conf. Multime- 3D shape collections,” ACM Trans. Graph., vol. 35, no. 6, pp. 1–12,
dia, 2021, pp. 807–815. 2016.
[24] Y. Wu et al., “Commonality autoencoder: Learning common features [50] I. Armeni et al., “3D semantic parsing of large-scale indoor spaces,” in
for change detection from heterogeneous images,” IEEE Trans. Neu- Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1534–1543.
ral Netw. Learn. Syst., vol. 33, no. 9, pp. 4257–4270, Sep. 2022, [51] A. Dai et al., “ScanNet: Richly-annotated 3D reconstructions of in-
doi: 10.1109/TNNLS.2021.3056238. door scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
[25] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. pp. 2432–2443.
Process. Syst., 2017, pp. 5998–6008. [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
[26] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for ing the inception architecture for computer vision,” in Proc. IEEE Conf.
image recognition at scale,” 2020, arXiv:2010.11929. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[27] B. Wu et al., “Visual transformers: Token-based image representation and [53] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with
processing for computer vision,” 2020, arXiv:2006.03677. warm restarts,” in Proc. Int. Conf. Learn. Representations, 2016,
[28] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in pp. 2818–2826.
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 16239–16248. [54] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[29] J. Shan, S. Zhou, Z. Fang, and Y. Cui, “PTT: Point-track-transformer mod- 2017, arXiv:1711.05101.
ule for 3D single object tracking in point clouds,” in Proc. IEEE/RSJ Int. [55] G. Qian et al., “Pointnext: Revisiting pointnet++ with improved training
Conf. Intell. Robots Syst., 2021, pp. 1310–1316. and scaling strategies,” in Proc. Adv. Neural Inf. Process. Syst., 2022,
[30] M.-H. Guo et al., “PCT: Point cloud transformer,” Comput. Vis. Media, pp. 23192–23204.
vol. 7, no. 2, pp. 187–199, 2021. [56] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia, “PointWeb: Enhancing local neigh-
[31] C. Zhou et al., “PTTR: Relational 3D point cloud object tracking with borhood features for point cloud processing,” in Proc. IEEE/CVF Conf.
transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Comput. Vis. Pattern Recognit., 2019, pp. 5560–5568.
2022, pp. 8521–8530. [57] M. Xu, R. Ding, H. Zhao, and X. Qi, “PAConv: Position adaptive convolu-
[32] Y. Wu et al., “SACF-Net: Skip-attention based correspondence filtering tion with dynamic kernel assembling on point clouds,” in Proc. IEEE/CVF
network for point cloud registration,” IEEE Trans. Circuits Syst. Video Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3172–3181.
Technol., vol. 33, no. 8, pp. 3585–3595, Aug. 2023. [58] C. Wang et al., “Learning discriminative features by covering local geo-
[33] Y. Wu et al., “PANet: A point-attention based multi-scale feature fusion metric space for point cloud analysis,” IEEE Trans. Geosci. Remote Sens.,
network for point cloud registration,” IEEE Trans. Instrum. Meas., vol. 72, vol. 60, 2022, Art. no. 5703215.
2023, Art. no. 2512913. [59] G. Te, W. Hu, A. Zheng, and Z. Guo, “RGCNN: Regularized graph CNN
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image for point cloud segmentation,” in Proc. 26th ACM Int. Conf. Multimedia,
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, 2018, pp. 746–754.
pp. 770–778. [60] H. Zhou et al., “Adaptive graph convolution for point cloud analysis,” in
[35] N. J. Mitra and A. Nguyen, “Estimating surface normals in noisy Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 4945–4954.
point cloud data,” in Proc. 19th Annu. Symp. Comput. Geometry, 2003, [61] W. Jing et al., “AGNet: An attention-based graph network for point cloud
pp. 322–328. classification and segmentation,” Remote Sens., vol. 14, no. 4, 2022,
[36] Q. Mérigot, M. Ovsjanikov, and L. J. Guibas, “Voronoi-based curvature Art. no. 1036.
and feature estimation from point clouds,” IEEE Trans. Vis. Comput. [62] T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai, “Walk in the cloud:
Graph., vol. 17, no. 6, pp. 743–756, Jun. 2011. Learning curves for point clouds shape analysis,” in Proc. IEEE/CVF Int.
[37] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural Conf. Comput. Vis., 2021, pp. 895–904.
network for point cloud analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. [63] Y. Lyu, X. Huang, and Z. Zhang, “EllipsoidNet: Ellipsoid representation
Pattern Recognit., 2019, pp. 8887–8896. for point cloud classification and segmentation,” in Proc. IEEE/CVF Win-
[38] Q. Hu et al., “RandLA-Net: Efficient semantic segmentation of large-scale ter Conf. Appl. Comput. Vis., 2022, pp. 256–266.
point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [64] G. Wang, Q. Zhai, and H. Liu, “Cross self-attention network for 3D point
2020, pp. 11105–11114. cloud,” Knowl.-Based Syst., vol. 247, 2022, Art. no. 108769.
[39] Y. Zhang, Q. Yang, and Y. Xu, “MS-GraphSim: Inferring point cloud [65] C. Zhang, H. Wan, X. Shen, and Z. Wu, “PatchFormer: An efficient point
quality via multiscale graph similarity,” in Proc. 29th ACM Int. Conf. transformer with patch attention,” in Proc. IEEE/CVF Conf. Comput. Vis.
Multimedia, 2021, pp. 1230–1238. Pattern Recognit., 2022, pp. 11789–11798.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3516 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

[66] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer, “3DmFV: Three- Zhixiao Liu received the B.S. degree in communi-
dimensional point cloud classification in real-time using convolutional cation engineering from Nanchang University, Nan-
neural networks,” IEEE Robot. Automat. Lett., vol. 3, no. 4, pp. 3145–3152, chang, China, in 2021. He is currently working to-
Oct. 2018. ward the master’s degree with the School of Elec-
[67] Y. Li et al., “PointCNN: Convolution on X-transformed points,” in Proc. tronic Information, Harbin Engineering University,
Adv. Neural Inf. Process. Syst., 2018, pp. 820–830. Harbin, China. His research interests include artifi-
[68] S. Qiu, S. Anwar, and N. Barnes, “Dense-resolution network for point cial intelligence, deep learning and computer vision.
cloud classification and segmentation,” in Proc. IEEE/CVF Winter Conf.
Appl. Comput. Vis., 2021, pp. 3812–3821.
[69] J. Yang et al., “Modeling point clouds with self-attention and Gumbel sub-
set sampling,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 3318–3327.
[70] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3D
object detection in point clouds,” in Proc. IEEE/CVF Int. Conf. Comput.
Vis., 2019, pp. 9276–9285. Qiguang Miao (Senior Member, IEEE) received the
M.Eng. and Doctor degrees in computer science from
Xidian University, Xi’an, China. He is currently a
Yue Wu (Member, IEEE) received the B.Eng. and Professor with the School of Computer Science and
Ph.D. degrees from Xidian University, Xi’an, China, Technology, Xidian University. His research interests
in 2011 and 2016, respectively. Since 2016, he has include intelligent image processing and multiscale
been a Teacher with Xidian University. He is cur- geometric representations for images.
rently an Associate Professor with Xidian University.
He has authored or coauthored more than 100 papers
in refereed journals and proceedings. His research in-
terests include computational intelligence and its ap-
plications. He is the Secretary General of Chinese
Association for Artificial Intelligence-Youth Branch,
Secretary General of CCF Xi’an during 2020–2022,
Chair of CCF YOCSEF Xi’an during 2021–2022, CCF/CAAI Senior Member. Wenping Ma (Senior Member, IEEE) received the
He is Editorial Board Member for over six journals, including CAAI Transac- B.S. degree in computer science and technology and
tions on Intelligence Technology, Frontiers of Computer Science, Remote Sens- the Ph.D. degree in pattern recognition and intelli-
ing, Electronics. gent systems from Xidian University, Xi’an, China, in
2003 and 2008, respectively. Since 2006, she has been
with the Key Laboratory of Intelligent Perception and
Image Understanding of the Ministry of Education,
Jiaming Liu received the B.S. degree in software en- Xidian University, where she is currently an Asso-
gineering from the Jiangxi University of Finance and ciate Professor. She has authored or coauthored more
Economics, Nanchang, China, in 2021. He is cur- than 30 SCI papers in international academic journals,
rently working toward the master’s degree with the including IEEE TRANSACTIONS ON EVOLUTIONARY
School of Computer Science and Technology, Xid- COMPUTATION, IEEE TRANSACTIONS ON IMAGE PROCESSING, Information Sci-
ian University, Xi’an, China. His research interests ences, Pattern Recognition, Applied Soft Computing, Knowledge-Based Systems,
include artificial intelligence, machine learning and Physica A-Statistical Mechanics and its Applications, and IEEE GEOSCIENCE
computer vision. AND REMOTE SENSING LETTERS. Her research interests include natural comput-
ing and intelligent image processing. Dr. Ma is a member of the Chinese Institute
of Electronics and the China Computer Federation.

Maoguo Gong (Senior Member, IEEE) received the


B.Eng. and Ph.D. degrees from Xidian University,
Xi’an, China, in 2003 and 2009, respectively. Since
2006, he has been a Teacher with Xidian University.
He was promoted to an Associate Professor and a
Full Professor, in 2008 and 2010, respectively, with
exceptive admission. He has authored or coauthored
more than 100 articles in journals and conferences.
He holds more than 20 granted patents as the first
inventor. He is leading or has completed more than
twenty projects as the Principle Investigator, funded
by the National Natural Science Foundation of China, the National Key Re-
search and De-velopment Program of China, and others. His research interests
include computational intelligence, with applications to optimization, learning,
data mining, and image understanding. Prof. Gong is the Executive Committee
Member of Chinese Association for Artificial Intelligence and a Senior Member
of Chinese Computer Federation. He was the recipient of the prestigious Na-
tional Program for Support of the Leading Innovative Talents from the Central
Organization Department of China, the Leading Innovative Talent in the Sci-
ence and Technology from the Ministry of Science and Technology of China,
the Excellent Young Scientist Foundation from the National Natural Science
Foundation of China, the New Century Excellent Talent from the Ministry of
Education of China, and the National Natural Science Award of China. He is an
Associate Editor or an Editorial Board Member for over five journals including
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION and IEEE TRANSAC-
TIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS.

Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.

You might also like