MPCT_Multiscale_Point_Cloud_Transformer_With_a_Residual_Network
MPCT_Multiscale_Point_Cloud_Transformer_With_a_Residual_Network
Abstract—The self-attention (SA) network revisits the essence areas such as autonomous driving, virtual and augmented re-
of data and has achieved remarkable results in text processing ality, and robotics. Driven by deep neural networks, recent 3D
and image analysis. SA is conceptualized as a set operator that works [1], [2], [3], [4], [5], [6], [7], [8] have focused on process-
is insensitive to the order and number of data, making it suitable
for point sets embedded in 3D space. However, working with point ing point clouds with learning-based methods. However, unlike
clouds still poses challenges. To tackle the issue of exponential images arranged on a regular pixel or grid, point clouds are sets
growth in complexity and singularity induced by the original of points embedded in three-dimensional space. This makes 3D
SA network without position encoding, we modify the attention point clouds structurally different from images and representa-
mechanism by incorporating position encoding to make it linear, tionally different from complex 3D data (e.g., grid and voxel
thus reducing its computational cost and memory usage and
making it more feasible for point clouds. This article presents a data), which have the simplest format but cannot be directly ap-
new framework called multiscale point cloud transformer (MPCT), plied to design deep networks for standard tasks in computer
which improves upon prior methods in cross-domain applications. vision.
The utilization of multiple embeddings enables the complete To address this challenge, many deep learning-based methods
capture of the remote and local contextual connections within have become powerful tools for representing 3D point clouds.
point clouds, as determined by our proposed attention mechanism.
Additionally, we use a residual network to facilitate the fusion Guo et al. [9] classified learning-based point cloud methods into
of multiscale features, allowing MPCT to better comprehend multiview methods [10], [11], voxel methods [12], [13], and
the representations of point clouds at each stage of attention. point methods [14], [15]. Since Qi et al. introduced the mul-
Experiments conducted on several datasets demonstrate that tilayer perceptron (MLP) operation in PointNet, point-based
MPCT outperforms the existing methods, such as achieving methods have become popular, and subsequent methods [16],
accuracies of 94.2% and 84.9% in classification tasks implemented
on ModelNet40 and ScanObjectNN, respectively. [17] have used graph context and kernel points to conduct fur-
ther learning on the basis of MLPs. Although these methods
Index Terms—Geometric and semantic features, multiscale aim to improve point cloud understanding, some issues remain
generation, point cloud transformer, residual network.
to be considered. 1) In addition to 3D coordinates, can we pro-
vide more geometric information for feature learning? 2) How
I. INTRODUCTION can the network learn better representations in the abstract
ITH the rapid development of 3D sensing technology, high-level space and while keeping its costs low and efficiency
W 3D point cloud data are appearing in many application high? 3) How can the network focus on features and connect
different features to optimize the output?
Compared to point- and convolution-based learning meth-
Manuscript received 25 October 2022; revised 17 February 2023 and 3 Au-
gust 2023; accepted 1 September 2023. Date of publication 12 September 2023; ods [17], [18], [19], [20], [21], [22], [23], [24], transformers [25]
date of current version 14 February 2024. This work was supported in part have recently been applied to text language and image vision
by the National Natural Science Foundation of China under Grants 62276200 tasks, and their performance is impressive in terms of numerous
and 62036006, in part by the Natural Science Basic Research Plan in Shaanxi
Province of China under Grant 2022JM-327, and in part by the CAAI-Huawei aspects [26], [27]. Prior to this, more research [28], [29], [30],
MINDSPORE Academic Open Fund. The associate editor coordinating the re- [31], [32], [33] began to study the use of transformers in point
view of this manuscript and approving it for publication was Dr. De-Nian Yang. clouds. Transformer is a classic decoder-encoder structure that
(Corresponding author: Maoguo Gong.)
Yue Wu, Jiaming Liu, and Qiguang Miao are with the School of Computer contains three main modules: input (word) embedding, position
Science and Technology, Key Laboratory of Collaborative Intelligence Sys- (sequential) encoding, and self-attention (SA) modules. Interest-
tems, Ministry of Education, Xidian University, Xi’an 710071, China (e-mail: ingly, the series of transformer operations are especially suitable
[email protected]; [email protected]; [email protected]).
Maoguo Gong is with the School of Electronic Engineering, Key Laboratory for point cloud processing, as demonstrated by the fact that they
of Collaborative Intelligence Systems, Ministry of Education, Xidian University, can uniquely establish remote dependencies between points. In
Xi’an 710071, China (e-mail: [email protected]). addition, the core SA module of the network is essentially a set
Zhixiao Liu is with the Yantai Research Institute, Harbin Engineering Uni-
versity, Yantai 264006, China (e-mail: [email protected]). operator, which is independent of the order and distribution of
Wenping Ma is with the School of Artificial Intelligence, Key Laboratory the given data.
of Intelligent Perception and Image Understanding of Ministry of Education, However, when applying a transformer to point clouds, the
Xidian University, Xi’an 710071, China (e-mail: [email protected]).
https://github.com/ywu0912/TeamCode.git. input sequence generated from points of the same size only re-
Digital Object Identifier 10.1109/TMM.2023.3312855 tains a simple scale feature in the same layer with an MLP.
1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3506 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3507
the local neighborhood of a specific point to generate relevant product. Scalar and vector self-attention are essentially set op-
semantic and local information. erators, and the complexity of self-attention is O(N 2 D).
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3508 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
Fig. 2. Proposed MPCT architecture. MPCT aims to exploit rich point cloud features to build a powerful perception network, including four parts: position
embedding, semantic embedding, multiscale embedding and self-attention modules. The input features (blue) are generated from the enhanced point information,
and semantic features (green) are generated from the input features, and geometric features (cyan) are generated from the high-level geometric information of the
initial points. The semantic features are decomposed into attention features via three linear transformations and fed forward along with the geometric features. ⊕,
, ⊗ and C represent pairwise summation, subtraction, multiplication and connection operations performed via the feature channel, respectively.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3509
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3510 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
extract feature information at four scales. We hope to use four self-attention network, which can be expressed as follows:
cascaded SG layers to gradually expand the receptive field during
feature aggregation so that the neural network can see more re- As1 = A(Fs1 , Ls1 ), As1 ∈ RN ×D , (15)
fined features. The SG layers use the Euclidean distance measure where As1 and Ls1 are the attentional feature via the self-
to search for each point grouped by K-NN during point cloud attention function and the geometric feature mentioned before,
sampling to aggregate the features derived from local neigh- respectively. Similarly, we can obtain As2 , As3 and As4 .
borhoods. To facilitate the distinction, we directly refer to the Eventually, we use the generated attentional features as con-
features obtained via the aggregation of the point features as se- ditions for point cloud understanding to evaluate the point cloud
mantic features, whose search and connection strategies are the classification and segmentation tasks.
same as those of the position encoding features.
Corresponding to the geometric features above, θ(fi , ∀fik ) =
IV. EXPERIMENTS
[RP (fi , K), RP (fi , K) − ∀fik ] is typically used to denote
the local semantic context in the feature space. Nevertheless, In this section, we evaluate the validity of the proposed
θ(fi , ∀fik ) may not be sufficient for representing the neighbor- MPCT for point cloud downstream tasks on five datasets,
hood because of two reasons. 1) Due to the sparse and irregular i.e., ModelNet40 [47], ScanObjectNN [48], ShapeNetPart [49],
geometric structures of point clouds, their generalization capa- S3DIS [50], and ScanNet-V2 [51], and we conduct multiple
bilities in high-dimensional feature spaces may be weakened, comprehensive comparisons with other methods.
and 2) since point cloud neighbors are not unique to each other
in the representation of closed regions, information redundancy A. Implementation Details
may occur. To address these issues and enhance the generaliza-
In general, we train MPCT using the cross-entropy loss with
tion ability of the features, we transform the local points into a
label smoothing [52] and evaluate MPCT using an entire scene
normal distribution while maintaining their original semantics:
as its input.
∀fik − fi 1 K For shape classification, we use SGD [53] optimization for
∀fik = α × + β, σ = (fik − fi )2 , 300 epochs with a momentum of 0.9 and a weight decay of
σ+ K k=1
(11) 0.0001, and cosine annealing reduces the learning rate from
where α ∈ RD and β ∈ RD are learnable parameters and = 0.1 to 0.001. For part and semantic segmentation, we use
1e − 8 is a small number for numerical stability. liujiamingshi- AdamW [54] optimization for 200 epochs, an initial learning
daSB Then, we use farthest point sampling (FPS) [15] to sample rate lr = 0.002, and a weight decay of 10–4, with cosine decay.
P to Ps1 for each sampling point p ∈ Ps1 , letting NP (p) be p-th The training batch size is 32, and the test batch size is 16.
nearest neighborhoods in P. Through the same processing ap- In addition, the training data are enhanced with a random
proach conducted on the basis of Ps1 , we suggest that the four translation range of [−0.2,0.2] and a random scale ratio of
sampling rates and neighborhood numbers are equal, with values [2/3,3/2], and strategies such as point resampling, additional
of 0.5 and 16, respectively. Next, we calculate the corresponding height, random sampling, etc., are also included [55]. We use
local features: PyTorch to implement the project, and all experiments are per-
formed on two GeForce RTX 3090 GPUs.
Fs1 (p) = [RP (fs1 (p), K), ∀fik ], Fs1 (p) ∈ RK×2D , (12)
Fs1 (p) = M AX(M(Fs1 (p)) + Fs1 (p)), Fs1 (p) ∈ R2D , B. Classification Results
(13) The synthetic ModelNet40 dataset has 40 object categories
and contains 12,311 CAD models. They are divided into 9,843
where fs1 (p) is the feature of point p in sampled point cloud Ps1 , models for training and 2,468 models for testing.
and Fs1 (p) is the feature of sampling point p after aggregating Table I shows the overall and average classification accuracies
the neighbor points, i.e., the input feature of the self-attention achieved by MPCT on ModelNet40. Specifically, we achieve an
network module. overall accuracy of 94.2% and an average classification accu-
Next, we connect the point features of the sampled Ps1 and racy of 92.0%, which exceeds the results obtained with other
regard them as the research objects for generating Ps2 , Ps3 , and similar inputs overall. Notably, it can be seen that our method
Ps4 . After executing a series of transformations, we obtain the performs better than the other methods that use more input points
feature representing the multiscale local features: or features.
The ScanObjectNN real-world dataset has 15 categories with
Fsi = Fsi (pi) s.t. i = 1, 2, 3, 4. (14)
2902 unique object instances. It contains more than 15,000 ob-
pi∈Psi
jects, and it is actually more challenging due to the complexity of
The numbers of points and feature dimensions at each sam- its backgrounds and the locality of the objects. We use its most
pling scale are different and inversely proportional, and the gen- challenging version, PBT50RS, with noisy background points
erated multiscale features can separately construct local depen- for experiments. We use the mean class accuracy (mAcc) for
dencies between the point clouds. Nonetheless, simply connect- each category and the overall accuracy (OA) calculated across
ing the point clouds at this time may ignore the remote de- all categories as evaluation metrics. To facilitate the comparison
pendencies between them, so we feed these features into the experiments, we do not use a voting strategy.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3511
TABLE I close to those of the latest PT presented by [28] and are multiple
SHAPE CLASSIFICATION RESULTS (%) OBTAINED ON MODELNET40
percentage points higher for each semantic label than those of
previous works. Fig. 6 shows the MPCT predictions, and we can
see that the predictions are close to the real situation.
E. Computational Analysis
Table II shows the results yielded by MPCT on the ScanOb-
jectNN dataset, which includes actual scan data containing We consider the computational efficiency levels of MPCT
real-world objects. Our method’s overall accuracy rate of 84.9% and several other methods by comparing their numbers of re-
and average accuracy rate of 82.9% are significantly higher than quired parameters (Params), numbers of floating point opera-
all other results and are first in the official rankings, and the tions (FLOPs), and inference time in Table VI. Among them,
PBT50RS variant used in the experiment is still the most chal- the single-scale PointNet++ has the lowest memory require-
lenging version. Notably, we are in a leading position in many of ment of 1.48 M parameters, and PointNet requires the lowest
the 15 categories, and the accuracy of normal object prediction processor load of 0.45 G FLOPs while providing less accurate
is directly proportional to the number of models. results. Overall, MPCT has the best performance with moderate
computational and memory requirements.
C. Segmentation Results
F. Ablation Studies
The ShapeNetPart dataset is an object-oriented dataset with
16,880 object point clouds belonging to 16 categories, and each We perform several ablation experiments in this section,
point is labeled as one of 50 parts; its models are divided into choosing alternatives for the geometric point descriptor, the at-
14,006 models for training and 2,784 models for testing. tention mechanisms, the positional encoding, and the residual
Table III shows the results produced by MPCT on ShapeNet- structure to understand the value of our constructions. All ex-
Part, which achieves the best results, with improvements of 3.0% perimental steps and evaluation metrics are implemented in the
and 1.5% over PointNet and the DGCNN in terms of their over- same setup as that used for the classification experiments.
all IOU results, respectively. Fig. 5 shows further segmentation Number of Local Neighborhoods: First, we study the number
examples provided by PointNet, DGCNN, and MPCT. of neighbors k, which is used to determine the local neighbor-
The S3DIS dataset is a dataset of indoor scenes for semantic hoods around each point. The results are shown in Table VII.
segmentation of point clouds, which contains 6 areas and 271 The best performance is obtained when k is set to 16. When
rooms, and each point in the dataset is divided into 13 categories. the number of neighborhoods is small (k = 8 or k = 12), the
We use mAcc and intersection-over-union (IoU) as evaluation model may not have enough context to make predictions. When
metrics. The IoU of a shape is calculated from the average of the number of neighborhoods is large (k = 24 or k = 32), each
the IoUs of all parts of the shape, and mIoU is the mean of the self-attention layer provides a large number of data points, many
IoUs of all tested shapes. of which may be more distant and less relevant. In addition, the
Table IV shows the results obtained by MPCT on the S3DIS artificial weak noise added during training is intended to en-
dataset. Following PointNet++’s protocol, we evaluate Area 5. hance the network’s learning ability, but it instead reduces the
Our MPCT achieves 74.6%/68.6% mAcc/mIoU. The results are model’s accuracy when the number of neighbors is large.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3512 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
TABLE II
SHAPE CLASSIFICATION RESULTS (%) OBTAINED ON THE SCANOBJECTNN DATASET
TABLE III
PART SEGMENTATION RESULTS (%) OBTAINED ON THE SHAPENETPART DATASET
Fig. 5. Visualization of the part segmentation results obtained on the ShapeNetPart dataset. We highlight the different predictions in red boxes.
Geometric Point Descriptor: We next confirm that geo- the neighbors of the corresponding point. Furthermore, we use
metric point descriptors are explicitly formed in 3D space the cross product of the edge vectors to compute the point
and include four low-level geometric descriptors: coordinates, normals to enhance the robustness of the descriptor to pos-
edges, normals, and edge lengths. The results are shown in sible deformations such as scaling, translation, rotation, etc.
Table VIII. As an essential condition, the coordinates of a point However, combining all this information is not always opti-
can intuitively represent global information. In addition, the mal because it may contain redundant information. For ex-
edge of a point can imply more local information by refer- ample, redundant point information, i.e., pj1 and pj2 , which
ring to the relative positions of its neighbors, and the length does not help to describe the low-level geometric features of
of the edge can partially estimate the density distribution of point pi .
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3513
TABLE VIII
ABLATION STUDY: GEOMETRIC POINT DESCRIPTOR
TABLE IX
ABLATION STUDY: THE FORMS OF THE SELF-ATTENTION OPERATORS
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3514 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
TABLE X attention mechanism from the global position of the point cloud.
ABLATION STUDY: THE POSITION ENCODING AND RESIDUAL STRUCTURE
Furthermore, the performance of switching this attention to com-
plex point cloud scenes is unknown. This is a new challenge that
we need to face in future work.
V. CONCLUSION
In this article, we propose a transformer-based architecture
that is applied to point clouds. As point clouds are essentially
geometric positions in metric space, the core self-attention op-
erator of the transformer network fits well with position oper-
ations. We further improve the self-attention network and gen-
erative features and introduce the residual network to maintain
The results are shown in Table X. We can see that the perfor- continuity when processing the various stages of the given point
mance achieved without the residual structure on ModelNet40 is cloud. Extensive experiments show that MPCT has a good fea-
93.3%/91.3% in terms of the OA/mAcc metrics, which is much ture learning capability and achieves advanced performance on
lower than the performance attained using the residual structure several benchmarks, especially shape classification and seman-
(0.9%/0.7%). This indicates that even a simple residual structure tic segmentation. Note that our research is more comprehensive
is essential in this case. and in-depth than previous pioneering works. In the future, we
hope that our work will inspire further studies on the character-
G. Limitations istics of point cloud transformers.
Our MPCT and the previous approaches, such as PT [28],
PCT [30], and PointMLP [43], contain similar parts, including a ACKNOWLEDGMENT
transformer, a position encoding module, and a residual network. We gratefully acknowledge the support of MindSpore, CANN
However, we make integrative improvements. (Compute Architecture for Neural Networks) and Ascend AI
For instance, we adopt a pairwise mechanism to calculate Processor used for this research.
attention similarity with a lower linear O(N D) complexity. Our
work is novel in that we are the first to systematically apply a REFERENCES
residual network to high-level geometric and semantic features
[1] T. Weng, J. Xiao, F. Yan, and H. Jiang, “Context-aware 3D point cloud
extracted from point clouds and demonstrate its effectiveness in semantic segmentation with plane guidance,” IEEE Trans. Multimedia,
a simple and efficient manner. Additionally, our results validate early access, Oct. 10, 2022, doi: 10.1109/TMM.2022.3212914.
the proposed method through the use of multiscale point clouds. [2] Y. Wu et al., “Multi-view point cloud registration based on evolutionary
multitasking with bi-channel knowledge sharing mechanism,” IEEE Trans.
The differences between our approach and the main reference Emerg. Topics Comput. Intell., vol. 7, no. 2, pp. 357–374, Apr. 2023.
works are as follows. [3] C. Sun, Z. Zheng, X. Wang, M. Xu, and Y. Yang, “Self-
r In PT [28], only single-scale features are used. Even though supervised point cloud representation learning via separating mixed
shapes,” IEEE Trans. Multimedia, early access, Sep. 14, 2022,
PT obtains point clouds with different resolutions through doi: 10.1109/TMM.2022.3206664.
a series of downsampling operations and then aggregates [4] Y. Wu et al., “Self-supervised intra-modal and cross-modal contrastive
them through corresponding upsampling operations, infor- learning for point cloud understanding,” IEEE Trans. Multimedia, early
access, Jun. 09, 2023, doi: 10.1109/TMM.2023.3284591.
mation is still inevitably lost at different stages of network. [5] Y. Wu et al., “RORNet: Partial-to-partial registration network with reliable
In contrast, our MPCT with a residual network can ana- overlapping representations,” IEEE Trans. Neural Netw. Learn. Syst., early
lyze and generalize point cloud features at different scales access, Jun. 30, 2023, doi: 10.1109/TNNLS.2023.3286943.
[6] Z. Zhang, J. Chen, X. Xu, C. Liu, and Y. Han, “Hawk-eye-inspired per-
to effectively solve this problem.
r PCT [30] (see Fig. 1) only uses a single-scale embedding ception algorithm of stereo vision for obtaining orchard 3D point cloud
navigation map,” CAAI Trans. Intell. Technol., to be published.
and does not consider position encoding, using a more [7] H. Wang, D. Huang, and Y. Wang, “GridNet: efficiently learning deep hier-
archical representation for 3D point cloud understanding,” Front. Comput.
complex and computationally intensive attention mecha- Sci., vol. 16, no. 1, 2022, Art. no. 161301.
nism (refer to Fig. 4(c)). We achieve advantages in terms [8] J. Liu et al., “Instance-guided point cloud single object tracking with in-
of both efficiency and performance by applying a pairwise ception transformer,” IEEE Trans. Instrum. Meas., to be published.
[9] Y. Guo et al., “Deep learning for 3D point clouds: A survey,” IEEE Trans.
subtractive attention mechanism (see Fig. 4(d)).
r PointMLP [43] does not use a transformer and provides no Pattern Anal. Mach. Intell., vol. 43, no. 12, pp. 4338–4364, Dec. 2021.
[10] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view con-
obvious judgments concerning the geometric and semantic volutional neural networks for 3D shape recognition,” in Proc. IEEE Int.
Conf. Comput. Vis., 2015, pp. 945–953.
features of the given point cloud. Inspired by this, we em- [11] A. Kanezaki, Y. Matsushita, and Y. Nishida, “RotationNet: Joint object
bed an improved residual network in our novel point cloud categorization and pose estimation using multiviews from unsupervised
transformer, and achieve an effective improvement. viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
pp. 5010–5019.
Looking back at the proposed MPCT itself, even if the cost [12] D. Maturana and S. Scherer, “VoxNet: A 3D convolutional neural net-
of self-attention is reduced by utilizing subtractive pairwise at- work for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf. Intell.
tention, it has been proven to be effective for point cloud un- Robots Syst., 2015, pp. 922–928.
[13] C. Lv, W. Lin, and B. Zhao, “Voxel structure-based mesh reconstruction
derstanding. However, MPCT only performs the corresponding from a 3D point cloud,” IEEE Trans. Multimedia, vol. 24, pp. 1815–1829,
operation in the local domain and does not consider setting the 2022.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
WU et al.: MPCT: MULTISCALE POINT CLOUD TRANSFORMER WITH A RESIDUAL NETWORK 3515
[14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on [40] W.-X. Tao, G.-Y. Jiang, Z.-D. Jiang, and M. Yu, “Point cloud projection
point sets for 3D classification and segmentation,” in Proc. IEEE Conf. and multi-scale feature fusion network based blind quality assessment for
Comput. Vis. Pattern Recognit., 2017, pp. 77–85. colored point clouds,” in Proc. 29th ACM Int. Conf. Multimedia, 2021,
[15] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical pp. 5266–5272.
feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. [41] X.-F. Han, Y.-F. Jin, H.-X. Cheng, and G.-Q. Xiao, “Dual transformer
Process. Syst., 2017, pp. 5105–5114. for point cloud analysis,” IEEE Trans. Multimedia, early access, Aug.
[16] Y. Wang et al., “Dynamic graph CNN for learning on point clouds,” ACM 11, 2022, doi: 10.1109/TMM.2022.3198318.
Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019. [42] I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer model for
[17] H. Thomas et al., “KPConv: Flexible and deformable convolution 3D object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021,
for point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2886–2897.
pp. 6410–6419. [43] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design
[18] Y. Wu et al., “Evolutionary multiform optimization with two-stage and local geometry in point cloud: A simple residual MLP framework,” in
bidirectional knowledge transfer strategy for point cloud registra- Proc. Int. Conf. Learn. Representations, 2022.
tion,” IEEE Trans. Evol. Comput., early access, Oct. 19, 2022, [44] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recog-
doi: 10.1109/TEVC.2022.3215743. nition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
[19] Y. You et al., “PRIN/SPRIN: On extracting point-wise rotation invari- pp. 10073–10082.
ant features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, [45] S. Qiu, S. Anwar, and N. Barnes, “Geometric back-projection net-
pp. 9489–9502, Dec. 2022. work for point cloud classification,” IEEE Trans. Multimedia, vol. 24,
[20] Y. Wu et al., “Evolutionary multitasking with solution space cutting for pp. 1943–1955, 2022.
point cloud registration,” IEEE Trans. Emerg. Topics Comput. Intell., early [46] Q. Zhen et al., “CosFormer: Rethinking softmax in attention,” in Proc. Int.
access, Jul. 12, 2023, doi: 10.1109/TETCI.2023.3290009. Conf. Learn. Representations, 2022.
[21] W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep convolutional networks on [47] Z. Wu et al., “3D ShapeNets: A deep representation for volumetric shapes,”
3D point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1912–1920.
2019, pp. 9613–9622. [48] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting
[22] Y. Wu et al., “INENet: Inliers estimation network with similarity learn- point cloud classification: A new benchmark dataset and classification
ing for partial overlapping registration,” IEEE Trans. Circuits Syst. Video model on real-world data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
Technol., vol. 33, no. 3, pp. 1413–1426, Mar. 2023. 2019, pp. 1588–1597.
[23] X. Lin, K. Chen, and K. Jia, “Object point cloud classification via poly- [49] L. Yi et al., “A scalable active framework for region annotation in
convolutional architecture search,” in Proc. 29th ACM Int. Conf. Multime- 3D shape collections,” ACM Trans. Graph., vol. 35, no. 6, pp. 1–12,
dia, 2021, pp. 807–815. 2016.
[24] Y. Wu et al., “Commonality autoencoder: Learning common features [50] I. Armeni et al., “3D semantic parsing of large-scale indoor spaces,” in
for change detection from heterogeneous images,” IEEE Trans. Neu- Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1534–1543.
ral Netw. Learn. Syst., vol. 33, no. 9, pp. 4257–4270, Sep. 2022, [51] A. Dai et al., “ScanNet: Richly-annotated 3D reconstructions of in-
doi: 10.1109/TNNLS.2021.3056238. door scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
[25] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. pp. 2432–2443.
Process. Syst., 2017, pp. 5998–6008. [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
[26] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for ing the inception architecture for computer vision,” in Proc. IEEE Conf.
image recognition at scale,” 2020, arXiv:2010.11929. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[27] B. Wu et al., “Visual transformers: Token-based image representation and [53] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with
processing for computer vision,” 2020, arXiv:2006.03677. warm restarts,” in Proc. Int. Conf. Learn. Representations, 2016,
[28] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in pp. 2818–2826.
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 16239–16248. [54] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[29] J. Shan, S. Zhou, Z. Fang, and Y. Cui, “PTT: Point-track-transformer mod- 2017, arXiv:1711.05101.
ule for 3D single object tracking in point clouds,” in Proc. IEEE/RSJ Int. [55] G. Qian et al., “Pointnext: Revisiting pointnet++ with improved training
Conf. Intell. Robots Syst., 2021, pp. 1310–1316. and scaling strategies,” in Proc. Adv. Neural Inf. Process. Syst., 2022,
[30] M.-H. Guo et al., “PCT: Point cloud transformer,” Comput. Vis. Media, pp. 23192–23204.
vol. 7, no. 2, pp. 187–199, 2021. [56] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia, “PointWeb: Enhancing local neigh-
[31] C. Zhou et al., “PTTR: Relational 3D point cloud object tracking with borhood features for point cloud processing,” in Proc. IEEE/CVF Conf.
transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Comput. Vis. Pattern Recognit., 2019, pp. 5560–5568.
2022, pp. 8521–8530. [57] M. Xu, R. Ding, H. Zhao, and X. Qi, “PAConv: Position adaptive convolu-
[32] Y. Wu et al., “SACF-Net: Skip-attention based correspondence filtering tion with dynamic kernel assembling on point clouds,” in Proc. IEEE/CVF
network for point cloud registration,” IEEE Trans. Circuits Syst. Video Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3172–3181.
Technol., vol. 33, no. 8, pp. 3585–3595, Aug. 2023. [58] C. Wang et al., “Learning discriminative features by covering local geo-
[33] Y. Wu et al., “PANet: A point-attention based multi-scale feature fusion metric space for point cloud analysis,” IEEE Trans. Geosci. Remote Sens.,
network for point cloud registration,” IEEE Trans. Instrum. Meas., vol. 72, vol. 60, 2022, Art. no. 5703215.
2023, Art. no. 2512913. [59] G. Te, W. Hu, A. Zheng, and Z. Guo, “RGCNN: Regularized graph CNN
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image for point cloud segmentation,” in Proc. 26th ACM Int. Conf. Multimedia,
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, 2018, pp. 746–754.
pp. 770–778. [60] H. Zhou et al., “Adaptive graph convolution for point cloud analysis,” in
[35] N. J. Mitra and A. Nguyen, “Estimating surface normals in noisy Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 4945–4954.
point cloud data,” in Proc. 19th Annu. Symp. Comput. Geometry, 2003, [61] W. Jing et al., “AGNet: An attention-based graph network for point cloud
pp. 322–328. classification and segmentation,” Remote Sens., vol. 14, no. 4, 2022,
[36] Q. Mérigot, M. Ovsjanikov, and L. J. Guibas, “Voronoi-based curvature Art. no. 1036.
and feature estimation from point clouds,” IEEE Trans. Vis. Comput. [62] T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai, “Walk in the cloud:
Graph., vol. 17, no. 6, pp. 743–756, Jun. 2011. Learning curves for point clouds shape analysis,” in Proc. IEEE/CVF Int.
[37] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural Conf. Comput. Vis., 2021, pp. 895–904.
network for point cloud analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. [63] Y. Lyu, X. Huang, and Z. Zhang, “EllipsoidNet: Ellipsoid representation
Pattern Recognit., 2019, pp. 8887–8896. for point cloud classification and segmentation,” in Proc. IEEE/CVF Win-
[38] Q. Hu et al., “RandLA-Net: Efficient semantic segmentation of large-scale ter Conf. Appl. Comput. Vis., 2022, pp. 256–266.
point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [64] G. Wang, Q. Zhai, and H. Liu, “Cross self-attention network for 3D point
2020, pp. 11105–11114. cloud,” Knowl.-Based Syst., vol. 247, 2022, Art. no. 108769.
[39] Y. Zhang, Q. Yang, and Y. Xu, “MS-GraphSim: Inferring point cloud [65] C. Zhang, H. Wan, X. Shen, and Z. Wu, “PatchFormer: An efficient point
quality via multiscale graph similarity,” in Proc. 29th ACM Int. Conf. transformer with patch attention,” in Proc. IEEE/CVF Conf. Comput. Vis.
Multimedia, 2021, pp. 1230–1238. Pattern Recognit., 2022, pp. 11789–11798.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.
3516 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
[66] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer, “3DmFV: Three- Zhixiao Liu received the B.S. degree in communi-
dimensional point cloud classification in real-time using convolutional cation engineering from Nanchang University, Nan-
neural networks,” IEEE Robot. Automat. Lett., vol. 3, no. 4, pp. 3145–3152, chang, China, in 2021. He is currently working to-
Oct. 2018. ward the master’s degree with the School of Elec-
[67] Y. Li et al., “PointCNN: Convolution on X-transformed points,” in Proc. tronic Information, Harbin Engineering University,
Adv. Neural Inf. Process. Syst., 2018, pp. 820–830. Harbin, China. His research interests include artifi-
[68] S. Qiu, S. Anwar, and N. Barnes, “Dense-resolution network for point cial intelligence, deep learning and computer vision.
cloud classification and segmentation,” in Proc. IEEE/CVF Winter Conf.
Appl. Comput. Vis., 2021, pp. 3812–3821.
[69] J. Yang et al., “Modeling point clouds with self-attention and Gumbel sub-
set sampling,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 3318–3327.
[70] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3D
object detection in point clouds,” in Proc. IEEE/CVF Int. Conf. Comput.
Vis., 2019, pp. 9276–9285. Qiguang Miao (Senior Member, IEEE) received the
M.Eng. and Doctor degrees in computer science from
Xidian University, Xi’an, China. He is currently a
Yue Wu (Member, IEEE) received the B.Eng. and Professor with the School of Computer Science and
Ph.D. degrees from Xidian University, Xi’an, China, Technology, Xidian University. His research interests
in 2011 and 2016, respectively. Since 2016, he has include intelligent image processing and multiscale
been a Teacher with Xidian University. He is cur- geometric representations for images.
rently an Associate Professor with Xidian University.
He has authored or coauthored more than 100 papers
in refereed journals and proceedings. His research in-
terests include computational intelligence and its ap-
plications. He is the Secretary General of Chinese
Association for Artificial Intelligence-Youth Branch,
Secretary General of CCF Xi’an during 2020–2022,
Chair of CCF YOCSEF Xi’an during 2021–2022, CCF/CAAI Senior Member. Wenping Ma (Senior Member, IEEE) received the
He is Editorial Board Member for over six journals, including CAAI Transac- B.S. degree in computer science and technology and
tions on Intelligence Technology, Frontiers of Computer Science, Remote Sens- the Ph.D. degree in pattern recognition and intelli-
ing, Electronics. gent systems from Xidian University, Xi’an, China, in
2003 and 2008, respectively. Since 2006, she has been
with the Key Laboratory of Intelligent Perception and
Image Understanding of the Ministry of Education,
Jiaming Liu received the B.S. degree in software en- Xidian University, where she is currently an Asso-
gineering from the Jiangxi University of Finance and ciate Professor. She has authored or coauthored more
Economics, Nanchang, China, in 2021. He is cur- than 30 SCI papers in international academic journals,
rently working toward the master’s degree with the including IEEE TRANSACTIONS ON EVOLUTIONARY
School of Computer Science and Technology, Xid- COMPUTATION, IEEE TRANSACTIONS ON IMAGE PROCESSING, Information Sci-
ian University, Xi’an, China. His research interests ences, Pattern Recognition, Applied Soft Computing, Knowledge-Based Systems,
include artificial intelligence, machine learning and Physica A-Statistical Mechanics and its Applications, and IEEE GEOSCIENCE
computer vision. AND REMOTE SENSING LETTERS. Her research interests include natural comput-
ing and intelligent image processing. Dr. Ma is a member of the Chinese Institute
of Electronics and the China Computer Federation.
Authorized licensed use limited to: Zhejiang University of Technology. Downloaded on May 19,2025 at 05:22:56 UTC from IEEE Xplore. Restrictions apply.