0% found this document useful (0 votes)
42 views22 pages

SMA Net: Deep Learning Based Identification and Fitting of CAD Models From Point Clouds

Uploaded by

nykael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views22 pages

SMA Net: Deep Learning Based Identification and Fitting of CAD Models From Point Clouds

Uploaded by

nykael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Engineering with Computers

https://doi.org/10.1007/s00366-022-01648-z

ORIGINAL ARTICLE

SMA‑Net: Deep learning‑based identification and fitting of CAD


models from point clouds
Sijie Hu1 · Arnaud Polette1 · Jean‑Philippe Pernot1 

Received: 30 November 2021 / Accepted: 18 March 2022


© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022

Abstract
Identification and fitting is an important task in reverse engineering and virtual/augmented reality. Compared to the traditional
approaches, carrying out such tasks with a deep learning-based method have much room to exploit. This paper presents
SMA-Net (Spatial Merge Attention Network), a novel deep learning-based end-to-end bottom-up architecture, specifically
focused on fast identification and fitting of CAD models from point clouds. The network is composed of three parts whose
strengths are clearly highlighted: voxel-based multi-resolution feature extractor, spatial merge attention mechanism and
multi-task head. It is trained with both virtually-generated point clouds and as-scanned ones created from multiple instances
of CAD models, themselves obtained with randomly generated parameter values. Using this data generation pipeline, the
proposed approach is validated on two different data sets that have been made publicly available: robot data set for Industry
4.0 applications, and furniture data set for virtual/augmented reality. Experiments show that this reconstruction strategy
achieves compelling and accurate results in a very high speed, and that it is very robust on real data obtained for instance
by laser scanner and Kinect.

Keywords  Identification · Fitting · Deep learning · Transformer · Data generation · Virtual reality · Reverse engineering

1 Introduction support various applications, for instance in engineering


design, robotics, metrology or cultural heritage [6]. For most
At a time when companies and individuals are increasingly engineering applications, the main objective is to extract
immersed in the digital world, even the full 3D world, solv- information from the acquired raw data to reconstruct para-
ing the question of moving from the real world to the vir- metric CAD models that best fit to the current state of the
tual one is becoming central. Thus, there is a clear need to object. Today, most existing RE techniques follow a time-
develop fast and accurate reconstruction techniques also able consuming patch-by-patch reconstruction strategy, wherein
to keep track of the possible evolutions and changes. This is users have to face cumbersome trimming and stitching
particularly true in the context of the Industry 4.0 for which issues. At the end, dead CAD models are obtained and can-
there is a large demand to model digital twins of products not be modified later on. Moreover, in the absence of suit-
and systems, and to maintain the coherence between the real able approaches, very few RE techniques are able to deal
and virtual worlds in time [39]. With the development and directly and efficiently with the case of assemblies. Thus,
spreading of 3D acquisition technologies, reverse engineer- the method generally consists in disassembling the various
ing (RE) has been extensively exploited in recent years to parts that will then serve as inputs of the RE process, and
then most probably reassembled in the virtual world. This
* Jean‑Philippe Pernot does not allow to study assemblies which cannot be disas-
[email protected] sembled easily. As a consequence, these approaches do not
Sijie Hu fully address the requirements of the Industry 4.0 in terms
[email protected] of fast reconstruction or update of virtual environments and
Arnaud Polette complete systems, which is the focus of this paper.
[email protected] With the rapid development of computer technologies and
hardware, deep learning has received increasing attention
1
LISPEN, Arts et Métiers Institute of Technology, due to its numerous applications in different areas, such as
13617 Aix‑en‑Provence, France

13
Vol.:(0123456789)
Engineering with Computers

computer vision, simulation, natural language processing, both virtually-generated point clouds and as-scanned ones
and robotics. As a dominating branch of artificial intelli- created from multiple instances of several CAD templates,
gence, deep learning has been successfully used to solve themselves obtained with randomly generated parameter
complex problems that are sometimes difficult to solve with values. Using this data generation pipeline, the proposed
more traditional approaches. Specifically, being able to iden- approach has been validated on two different datasets: robot
tify parts and assemblies, segment point clouds and retrieve dataset for Industry 4.0 applications, and furniture dataset
CAD model parameter values from a point cloud are essen- for virtual/augmented reality. Once trained, SMA-Net can
tial steps, which could benefit a lot from a deep learning- be used on a new point cloud, to identify the correspond-
based approach. However, the use of deep learning to reverse ing CAD template  (Fig.  1c) and estimate its parameter
engineer products or systems is a relatively poorly explored values (Fig. 1d). The CAD templates gather together prior
strategy, and notably when it comes to the reconstruction of knowledge on the parts to be reconstructed, including all the
parametric CAD assemblies [6]. Although some 3D recon- constraints to define the related B-Rep models. The geomet-
struction methods based on deep learning have emerged ric reconstructions are performed within a CAD software
in recent years [2, 3, 26, 27], these methods are all based in charge of updating the CAD models using the outputs
on existing open datasets and are not specifically designed of SMA-Net. The adopted modeler also keeps track of the
for reverse engineering, so their accuracy is far from meet- consistency of the generated models, including assembly
ing the requirements of industrial applications. The lack of constraints between parts if needed.
proper datasets is certainly partially responsible for such a Experiments show that this reconstruction strategy
low attention, as it is a bottleneck that makes very challeng- achieves compelling and accurate results in a very high
ing this type of research. Indeed, most existing datasets are speed, and that it is very robust on real datasets obtained
only designed for tackling part-level or instance-level 3D for instance by laser scanner and Kinect. Thus, these results
scene understanding, like 3D semantic segmentation and are of particular interest in the context of the Industry 4.0,
3D object detection or parameter-independent 3D recon- to maintain the coherence between real products/systems/
struction. Moreover, these datasets usually originate from environments and their digital twins whose geometric mod-
the Internet and require a huge amount of manual annota- els are a prior known, but whose configurations are evolving.
tions. The work presented in this paper not only aims to The contribution is threefold: (i) a novel reverse engi-
identify parts and assemblies from point clouds, but also neering framework able to fit CAD models on point clouds
to retrieve parameter values from the corresponding CAD segmented using a deep segmentation network; (ii) a virtual
models. data generation pipeline able to generate datasets of CAD
To solve those issues and allow a more global approach to model instances together with their as-scanned point clouds
reverse engineering, this paper introduces a deep learning- and related parameter values; (iii) a light and fast bottom-up
based reconstruction technique. Figure 1 shows a reconstruc- 3D classification and fitting framework, named SMA-Net,
tion example following the proposed approach. It is based able to identify objects from point clouds and estimate the
on a novel end-to-end bottom-up architecture, specially parameter values of the corresponding CAD templates in
focused on fast identification and fitting of CAD models an end-to-end manner. For the first time, such datasets used
from 3D point clouds. If the point cloud embeds several to validate the proposed identification and fitting technique
semantic parts (e.g., chairs or tables), they are treated sepa- have been made publicly available.
rately after a segmentation step (Fig. 1b). The core of the The paper is organized as follows. Section 2 reviews
approach relies on the so-called SMA-Net (Spatial Merge the works related to reverse engineering and deep learn-
Attention Network) composed of three main parts: voxel- ing. The overall framework is presented in Sect. 3 with the
based multi-resolution feature extractor, spatial merge details of the three parts of SMA-Net. The experimental set
attention mechanism and multi-task head. It is trained with up is detailed in Sect. 4, and the results are presented and

Fig. 1  Fast reconstruction of a
chair CAD model: a acquired
chair point cloud and its envi-
ronment, b segmented point
cloud, c CAD template and its
parameters, d fitted CAD model
generated by a CAD modeler
with parameter values resulting
from SMA-Net

(b) (b) (c) (d)

13
Engineering with Computers

discussed in Sect. 5. Section 6 ends this paper with conclu- industrial requirements in terms of parameter-driven CAD
sion and discussion. template reconstruction.
Many different ways have been proposed to process point
cloud data using deep learning, such as point-wise CNNs [8,
2 Prior work 29, 44, 64, 67], volume-wise CNNs [10–12, 24, 25, 53],
multi-view CNNs [9, 35, 43, 52] and special CNNs, namely
This section reviews previous reverse engineering and deep kernel-based parametric CNN [54, 59, 62] and graph reason-
learning approaches for point cloud analysis. In reverse engi- ing [36, 51, 57, 58]. Considering point-wise CNNs, Point-
neering, a particular attention is paid on the template-based Net [8] is a pioneer in this direction. However, PointNet does
methods. Fayolle and Pasko proposed a system to recon- not capture local structures limiting both its ability to rec-
struct solid models from digital point clouds by means of ognize fine-grained patterns and its generalizability to com-
a genetic evolutionary algorithm [18]. Robertson et al. [45] plex scenes. There have been some attempts to strengthen
have exploited the way to extract parametric models of fea- the ability to explain local features to learn deep point set
tures from poor quality 3D data using RANSAC [19], but it features more efficiently and robustly [44, 64, 67]. Instead
can only estimate parameters of simple primitives (e.g. cyl- of directly processing the irregular points, voxel-based
inder, plan). As an extension of RANSAC [31, 37, 48], try methods project point cloud on regular 3D grids in Euclid-
to recover multiple primitives from a point cloud. However, ean space. Graham et al. [24, 25] try to handle voxel-based
the performance of these RANSAC-based methods is highly data by sparse tensor and sparse convolution and achieved
dependent on the tuning of parameters based on prior knowl- good results for the segmentation of indoor scenes. Chris-
edge of different primitive shapes and it is not able to reverse topher et al. [11] introduced 4D Spatio-Temporal ConvNets
a full CAD model. These algorithms focus on the resolution that further improve the efficiency of sparse convolution
of general reverse engineering problems. Therefore, it is dif- through a more complex network structure.  Multi-view
ficult to achieve accurate reconstructions of 3D targets like CNNs represent the 3D point cloud with images rendered
template-based methods do. Bey et al. [5] have introduced from different views and analyze the information based on
a template-based method to reconstruct CAD models from a 2D CNN. Recently, Chen et al. [9] proposed to encode
3D point clouds, which is applied for civil engineering pur- the sparse 3D point cloud with a compact multi-view repre-
poses. Erdös et al. have presented a CAD model matching sentation for the task of object detection. Abhijit et al. [35]
method and they have extended the type of features to torus chooses multiple different virtual views from a 3D input to
and cuboids [16]. In these works, the authors also focus on generate representative 2D channels for training a 2D model
basic shapes (e.g. cylinder and cuboids) and do not work on for the task of indoor 3D semantic segmentation. Moreover,
complete CAD models. Inspired by template-based reverse some work designed convolutional kernels in special ways to
engineering approaches, Buonamici et al. [7] proposed a improve the parsing ability of convolutional operations. Xu
technique to fit a single CAD template on a point cloud, et al. [65] introduced SpiderCNN whose kernel is defined
while using particle swarm optimization. Another fitting by a Taylor polynomial function with neighbors weighted so
approach is proposed by Shah et al. [50], with a template- as to extend convolutional operations from regular grids to
based CAD reconstruction process optimized part-by-part irregular point sets. Thomas et al. [54] introduced KPConv
using simulated annealing. These template-based methods a new point convolution kernel which optimize the position
show good results, but suffer from several limitations. For of each weights dynamically. Besides, Li et al. [36] explored
instance, manual intervention is required when pre-aligning the improvement of 3D scene understanding performance in
CAD templates onto the point clouds, and a lot of time is a non-Euclidean space.
spent in the iterative resolution process. In recent years, Deep learning approaches often rely on huge data-
some template-based point cloud fitting methods based on sets.  ScanNet [14] and S3DIS  [1] are richly annotated
deep learning have emerged. Armen et al. [2] perform scan- datasets of 3D reconstructed meshes of indoor scenes spe-
to-CAD alignment by predicting 9-DoF (degree of freedom) cifically designed for scene understanding. For the task of
of the CAD model.  [3, 26, 27] use similar way to align part-level understanding, ShapeNetPart [66] and PartNet
the CAD template with the 3D input. Then, to improve sur- [40] introduce a large-scale dataset with instance-level 3D
face reconstruction accuracy, Vladislav et al. [30] proposed objects annotated. As an outdoor scene dataset, KITTI [20]
to deform retrieved CAD models after their alignment to is widely used for optical flow estimation, 3D object detec-
scans.  However, this mesh deformation based method tion and 3D tracking. However, such datasets are not yet
can only achieve rough approximations, and it results in fully available when considering point clouds for parameter
changes in the geometric properties of the CAD model that values estimation. Moreover, aforementioned deep learning
is transformed in a mesh. Thus, it is still difficult to meet the methods as a backbone are often used in 3D classification,
3D object detection and 3D scene segmentation tasks. There

13
Engineering with Computers

exist few attempts to study their application in reverse engi- 3.2 Overview of the identification and fitting
neering, especially to reconstruct parametric CAD models framework
[34, 60]. To circumvent those limitations, this paper intro-
duces SMA-Net, a multi-task convolutional neural network To obtain the right class and precise parameter values, three
designed to perform both the classification and parameter problems have been tackled. The first is to extract features
estimation in one forward propagation during fitting pro- from the 3D object space in an efficient way, the second is
cess. Furthermore, a virtual data generation pipeline is pro- to efficiently integrate information from a large amount of
posed to allow the creation of large datasets ready for train- extracted features, and the third is to build a robust loss func-
ing, and to validate the approach. tion to guide the learning process. SMA-Net has an elegant
network architecture for solving these three problems. It con-
sists of three main components depicted on Fig. 3:
3 RE framework and SMA‑Net architecture
• Voxel-based multi-resolution feature extractor to effi-
This section introduces the novel RE framework together ciently extract features from different spatial resolu-
with the architecture and technicalities of SMA-Net, the tions (Fig. 3a). The input is a set of Np points of 3D coor-
Spatial Merge Attention Network at the core of the proposed dinates pi = (xi , yi , zi ) ∈ ℝ3 , with i ∈ [1 … Np ]. Firstly,
approach. the complete point cloud is voxelized at a voxel size of
V s (s refers to the size of voxel) and then fed into a 3D
3.1 Reverse engineering framework CNN for feature extraction. The output is a set of feature
maps generated at different resolutions and denoted as Fj ,
The RE framework is presented on Fig. 2. It starts with the with j ∈ [1 … Nl ] and j the index of the extracted feature
acquisition of a raw point cloud, which then undergoes a maps with different spatial resolutions.
segmentation and clustering step, which can be done by • Spatial merge attention mechanism for robustly inte-
the existing well-designed semantic segmentation net- grating features of different resolutions (Fig. 3b). Each
works [63] and clustering algorithms [47]. In our imple- extracted feature map Fj is forwarded to the spatial atten-
mentation, SSCNs [23] pretrained on ScanNet dataset and tion (SA) mechanism, which emphasizes spatial mean-
DBSCAN [17] are used for segmentation and clustering ingful features of Fj along spatial axis. It generates a SA
respectively. The output point clouds are then analyzed by score Mjs for each spatial resolution layer. Then, spatial
SMA-Net, which has been pre-trained upstream. Thanks to recalibrated features Fjs are computed by voxel-wise mul-
an elegant deep neural network and an end-to-end learning, tiplications between Mjs and Fj. A merge function is fol-
SMA-Net allows object identification and parameter estima- lowed to squeeze and extract the important features from
tion into one module. It outputs the classes and estimated Fjs , which are then fused to form the global feature vector
parameter values of the parts and assemblies identified in the V . During this step, the information related to the size of
point cloud. This information is then used as inputs of the the raw point cloud are also incorporated. Specifically,
reconstruction module in charge of generating all the CAD the merge function is composed of a set of transformer
model instances using pre-defined templates. CAD templates blocks and a channel attention block combined to pro-
include built-in constraints (e.g. symmetries, joints between duce the final feature vector.
parts) to be satisfied by the modeler during the reconstruc- • Multi-task head for multi-parameters learning
tion step. The whole framework is an end-to-end architec- (Fig.  3c).  After obtaining the final feature vector,
ture without any loop or iteration, which makes it very fast highly characterized information is input to multi-heads
and easy to control. Here, CAD models do not need to be part. The multi-heads is composed of a classifier and
updated at multiple iteration steps, and they are generated several parameter regressors. The output of the classifier
only once after the fitting. is used to retrieve the CAD template model from data-

Fig. 2  Proposed RE frame-
work exploiting the pre-trained
SMA-Net for fast identification
and fitting of CAD models from
point clouds

13
Engineering with Computers

Fig. 3  Spatial Merge Attention Network (SMA-Net) highlighting nism aggregating multi-resolution features to a vector,  c  multi-task
three main components: a feature extractor extracting per-volume fea- head predicting the class and its parameters
tures Fj from different resolutions,  b  spatial merge attention mecha-

base, and the parameter regressors are used for estimating words, the shallow layer contains rich local structure size
parameter values precisely. information, while the deep layer contains more overall
structure size information. For parameter estimation task,
The outputs of SMA-Net then serve as inputs of the CAD extracted features can represent the detail information of the
reconstruction step, which extracts the identified CAD tem- target, but they are also able to express more abstract overall
plate and instantiates it with the estimated parameter values structural information.
to get the final CAD model fitting the point cloud (Fig. 2). To this end, the input point set are voxelized into volumes,
then a series of sparse convolution operations are performed
3.3 Voxel‑based multi‑resolution feature extractor with skip connections to extracts features from different spa-
tial resolutions (Fig. 3a). Thus, the geometric and contextual
Generally, any point feature extraction network can serve features of different spatial resolutions are extracted. The
as the feature extractor. On the other hand, the grid-based output is a set of Nl feature maps generated at different
methods have proven to be computationally efficient and spatial resolutions and denoted as  Fj ∈ ℝCj ×Lj ×Wj ×Hj , with
are particularly suitable for image feature extraction due to j ∈ [1 … Nl ] , j the index of features at different spatial reso-
their controllable receptive fields. Thanks to the emergence lutions, and Cj the feature channels with spatial resolution
of sparse representation and sparse convolution technology at Lj × Wj × Hj.
[11, 25], these attributes have been well extended to the 3D
field. Hence, the first component of SMA-Net exploits a 3.4 Spatial merge attention mechanism
voxel-based CNN for efficient feature encoding (Fig. 3a). In
particular, the U-Net [46] architecture has been wildly used The attention mechanism has made great achievements in
in 2D and 3D domains especially for the semantic segmenta- many visual perception tasks such as image caption gen-
tion because of its efficient encoder-decoder structure and erators, multi-object detection and semantic segmenta-
hierarchical expression. Thus, SMA-Net follows the U-Net tion. More recently, the attention mechanism represented
like architecture with sparse representation and sparse con- by transformer has been widely used in the image field
volution technology.  Unlike previous methods [11, 23] with very impressive results [15, 33, 38]. Thus, this part
which only use the information of the last layer of U-Net to of SMA-Net consists of two submodules detailed in the
derive the final output, SMA-Net exploits different feature next subsections (Fig. 3b): (1) voxel-based multi-resolution
extraction layers containing different structure size informa- spatial attention (SA), (2) data merge (DM) consisting of
tion. Indeed, even if their final output benefits from previ- two transformer blocks and a channel attention block. They
ous layers by skip-connections, these methods unavoidably have been specifically designed for enhancing the repre-
miss useful information present at lower resolutions, which sentational power of the network. The output of the Spatial
undoubtedly reduces the quality of the results. Actually, in Merge Attention mechanism is the feature vector V used as
a deep neural network, the receptive field increases as the input of the multi-heads part (Fig. 3c).
depth of the network increases, and the deeper of the net-
work, the more abstract of the extracted features. In other

13
Engineering with Computers

3.4.1 Multi‑resolution spatial attention 3.4.2 Data merge

To further recalibrate spatial information, the SA submod- The data merge consists of two transformer blocks and a
ule learns a set of spatial probability maps to weight the channel attention block. Figure 5.a illustrates the structure
conv-layer feature maps of the CNN. Specifically, given a of a transformer block which consist of layer normaliza-
feature map Fj encoded by the hierarchical U-Net struc- tion ( 𝐋𝐍) [4], multi-head self-attention ( 𝐌𝐒𝐀) [15, 56,
ture, SA sequentially infers a 3D spatial attention map 61] and a feed forward network ( 𝐅𝐅𝐍 ) that can be replaced
Mjs ∈ ℝ1×Lj ×Wj ×Hj (Fig. 4a). Next, the recalibrated feature by a MLP (Sect. 4.3.1). Each transformer block inputs a
map Fjs ∈ ℝCj ×Lj ×Wj ×Hj can be computed as follows: sequence, and outputs another sequence after a forward
propagation [56] in which each sequence is composed of
Mjs = 𝚽(Fj ), with j ∈ [1 … Nl ], several tokens, where each token is represented by a vector
(1)
Fjs = Mjs ⊗ Fj , of dimension ℝCt. Formally, the calculation of the output Zl
of a transformer block can be formulated as follows:
where 𝚽(⋅) is the spatial attention function and ⊗ the ele- ( ( l )) l
ment-wise multiplication. In SMA-Net, this function first Zl = 𝐅𝐅𝐍 𝐋𝐍 Ẑ + Ẑ ,
aggregates spatial information of different feature maps by l ( ( )) (3)
means of max-pooling and average-pooling operations. The Ẑ = 𝐌𝐒𝐀 𝐋𝐍 Zl−1 + Zl−1 ,
outputs of those operations are then forwarded to a light
where Zl−1 is the output of the last block.  Furthermore,
sparse convolutional neural network. The process can be
𝐌𝐒𝐀 stands for multi-head self-attention, which is the
summarized as:
core of transformer block and which can be formulated as
[ ( )]
𝚽(Fj ) = 𝝈 𝐒𝐂𝐨𝐧𝐯𝐬 Aj , follows (Fig. 5b):
( ( ( ( )))) (2) � �
Aj = 𝐑𝐞𝐋𝐔 𝐈𝐍 𝐒𝐂𝐨𝐧𝐯𝐬 𝐂𝐚𝐭 Fjmax , Fjavg , 𝐌𝐒𝐀(Z) = 𝐂𝐚𝐭 A0 , A1 , … , Ah−1 Mc ,
⎛ ⎞
where 𝝈 denotes the sigmoid function, 𝐒𝐂𝐨𝐧𝐯𝐬 denotes ⎜ Qi Ki T ⎟
sparse convolution, 𝐑𝐞𝐋𝐔 and 𝐈𝐍 refer to Rectified Linear Ai = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱⎜ � ⎟Vi , (4)
⎜ dk ⎟
Unit activation function [22] and instance normalization [55] ⎝ i ⎠
respectively, and 𝐂𝐚𝐭 is the concatenate operation. q
Qi = ZMi , Ki = ZMki and Vi = ZMvi ,

where i ∈ [0, 1, … , h − 1] is the head index and h is the num-


Ct Ct
ber of heads, dik is the dimension of Ki , and Mi ∈ ℝ h × h  ,
q

Ct Ct Ct Ct
Mki ∈ ℝ h × h   , Mvi ∈ ℝ h × h and Mc ∈ ℝCt ×Ct are weight

Fig. 4  Architecture of the
attention mechanisms: a spatial
attention submodule taking
advantage of max-pooling
and average-pooling opera-
tions forwarded to a 3D sparse
CNN, b channel attention
submodule combining outputs
of max-pooling and average-
pooling using a shared MLP
network

13
Engineering with Computers

Fig. 5  Overview of trans-
former block. a architecture of
transformer block, b multi-head
attention in transformer block

matrices of linear projection functions. The self-attention minimum (xmin , ymin , zmin ) of raw coordinate values P of
operation in transformer block makes all tokens mutual inter- point cloud provide a vital cues for 3D scene perception and
act, which in theory can yield global receptive field. Thus, understanding. Thus, the size information Gr of the original
in the field of 2D images, the input image is usually divided point cloud is also provided (Fig. 3):
into a number of fixed-size patches, and each patch is con-
sidered as a token, thereby converting the 2D image into a Gr = 𝐂𝐚𝐭(Xmax , Xmin , d) ∈ ℝ7 ,
sequence [13, 15, 61]. ⎧X � �
= x ,y ,z = 𝐌𝐚𝐱(P, axis = 1)
Considering the identification and fitting tasks, a simple ⎪ max � max max max� (6)
with ⎨ Xmin = xmin , ymin , zmin = 𝐌𝐢𝐧(P, axis = 1)
way to generate a token is transforming a feature map  Fj ⎪d = � �
into a token T j. However, the number of tokens depends on ⎩ �Xmax − Xmin �.
the depth of the U-Net pyramid structure, which is very lim-
Then, V s and Gr are further projected into a high-dimen-
ited. This inevitably causes a design contradiction: obtaining
sional space and then combined through a concatenate
more tokens by increasing the depth of U-Net. Moreover,
operation to obtain V c:
although this way can fuse features from different resolu-
tions, considering the whole feature map in a resolution as a V c = 𝐂𝐚𝐭(V �s , G�r ),
token will lose valuable details. To overcome this problem, (7)
V �s = V s Ms and G�r = Gr Mr ,
a simple channel reshape strategy is used. Unlike [15, 61],
tokens are not generated in spatial domain. Instead, tokens where Ms ∈ ℝlen(Vt )×(L−1)⋅Ct and Mr ∈ ℝ7×Ct are weight
are produced along channel axis. L tokens are generated, and matrices. After, V c is divided into L groups, resulting in L
each token is represented by a Ct-dimensional vector. tokens.
In practice, data merge module takes the output of multi- ( )
resolution spatial attention Fjs as input. It is first squeezed Tokens = 𝐑𝐞𝐬𝐡𝐚𝐩𝐞 V c ∈ ℝL⋅Ct , (8)
into a vector along spatial axis, so the length of the vector
is the number of feature channels V js ∈ ℝCj. Then, all V js , where L ⋅ Ct is equal to the length of V c. In this way, tokens
with j ∈ [1 … Nl ] are concatenated into one vector V s. This are defined more efficiently. More explanations are unfolded
process can be summarized as follows (Fig. 3): in the ablation study. Next, transformer blocks take the gen-
( ) erated tokens as input to further aggregate the information
V s = 𝐂𝐚𝐭 V 1s , V 2s , … , V Ns l , and learn a global context. Then, all tokens are recompressed
( ) (5) into a vector V t. Finally, given the merged feature vector V t ,
V js = 𝐌𝐚𝐱𝐏𝐨𝐨𝐥 Fjs , with j ∈ [1 … Nl ].
a channel attention network is applied to refine the features
Moreover, the original size information, such as size d of and produce the final feature vector to be used for the multi-
the bounding box diagonal, maximum (xmax , ymax , zmax ) and task head (Fig. 4b):

13
Engineering with Computers

V = Ac ⊗ V t , The deviations are collected within the parameter resid-


( ) (9) ual loss function of the regressor related to the g-th group
Ac = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱 V t M1 M2
for enabling SMA-Net reducing relative errors:
where M1 ∈ ℝ(L⋅Ct )×Ch , M2 ∈ ℝCh ×(L⋅Ct ) are weight matrices.
Nc ⎛ ⎞
Np c
� �
Lgreg
= ⎜𝛿 . 𝐒(Δp c ⎟
) ,
3.5 Multi‑task head ⎜ c,g i=1
c=1 ⎝
i ⎟

� (11)
� c�
Multi-group joint parameter regression can be considered � � c 2
0.5(Δp ) , if �Δpi � < 1
with 𝐒 Δpci = � c � i � �
as a multi-task learning problem, which is intrinsically a �Δpi � − 0.5, otherwise
multi-objective optimization problem because different tasks � �
may conflict each other [49]. It is very intuitive that differ- where 𝛿c,g is equal to 1 if c is a class of group g and
ent classes which have similar shapes or sizes can contrib- 0 otherwise, and 𝐒(.) denotes the standard SmoothL1
ute to each other when trained jointly, because they often function. It makes the gradient of the model smoother,
share common features. During the training phase, these thereby, making the model more stable and easier to con-
features can boost each other to achieve higher prediction verge during the training process.
results [68]. To improve the overall performance of classification and
Thus, the last part of SMA-Net is a multi-group head parameter estimation, SMA-Net is trained to minimize a
network designed so that groups of similar shapes or sizes joint loss function:
benefit from each other, and groups of different shapes or
sizes stop interfering with each other. Let’s consider Nc Ng

classes distributed in Ng groups. For instance, the furniture L = Lcls + 𝜆g .Lgreg (12)
dataset, as introduced in the results section, is composed of g=1

10 classes (4 chairs and 6 tables) distributed in 2 groups (the with 𝜆g the hyper-parameters controlling the balance
group of the chairs and the one of the tables). Each class between the different loss functions. Finally, even though
c ∈ [1 … Nc ] belongs to a group g ∈ [1 … Ng ] and is charac- SMA-Net is a unique identification and fitting framework,
terized by Npc parameters pci , with i ∈ [1 … Npc ]. To be able to Euclidian distances between the point cloud and the CAD
identify the class of a given object in a point cloud, and also model are not directly minimized during the learning, and
find out its parameter values, the multi-task head module of the fitting is carried out at the parameter level. Those dis-
SMA-Net is composed of (Fig. 3c): tances are however used at the end to check the quality of
the fitting.
• One classification head in which a multilayer percep-
tron (MLP) is applied to produce Nc classification scores
{sc1 , sc2 , … , scNc } ∈ ℝNc that are further regularized by 4 Experimental setup
a cross entropy loss function Lcls. This part of SMA-Net
allows identifying the class of an object in a point cloud, This section introduces the virtual data generation pipeline
i.e. the class with the maximum score. designed to set up the databases and validate SMA-Net on
• Several parameter regressors, one for each of the Ng
both identification and estimation tasks. It also details the
groups of classes. Each regressor is able to estimate all technicalities and settings of the different components of
the parameter values of all the classes belonging to the SMA-Net, including the implementation details on how two
corresponding group. Thus, depending on the class of known network have been modified to allow a comparison
the object identified by the classification head, only the with SMA-Net.
subset of parameter values corresponding to that class
is to be considered, and all the other parameter values 4.1 Virtual data generation pipeline
are disregarded. The regressions are also realized by a
MLP, followed by a parameter residual loss for guid- To evaluate and test SMA-Net on both identification and
ing the parameter estimation learning. For a parameter parameter estimation tasks, annotated 3D datasets are to be
pci , the absolute deviation 𝛿pci and the relative deviation defined. Montlahuc et al. [41] have introduced an as-scanned
Δpci between the estimated value p̃ ci and the ground truth point clouds generation strategy able to produce point
value p̄ ci can be computed as follows: clouds of CAD assembly models, similar to the ones that
𝛿pci p̃ ci − p̄ ci [ ] would have been obtained with a real scanner, i.e. includ-
Δpci = = , ∀i ∈ 1 … Npc . (10) ing artifacts resulting from real acquisitions (e.g. noise,
p̄ ci p̄ ci
heterogeneous density, incompleteness).  However, their

13
Engineering with Computers

method produces point clouds without associating both the CAD modeler is input to the point cloud virtual generation
classes of the parts (e.g., tables and chairs) and the numeri- algorithm. If virtual scan flag is set, HPR-based algorithm
cal parameter values of the CAD models, which are to be is used for generating as-scanned point clouds [32]. At the
used for training a network. Here, the idea is to propose a beginning, the origin and view direction of each virtual cam-
virtual data generation pipeline, able to generate as-scanned era are randomly defined. After that, a post-processing step
point clouds, labeled for learning on both identification is performed to insert noise and to sub-sample the point
and parameter estimation tasks. The complete pipeline is clouds. At the end of each loop, the information is collected
presented in Fig. 6. For a given CAD template, it allows and stored: generated CAD model instance and mesh, as-
the creation of multiple CAD model instances as well as scanned point cloud, label and parameter values.
multiple as-scanned point clouds. The process starts by
randomly generating parameter values used to generate the 4.2 Robot and furniture datasets
instances with a CAD modeler, i.e. the open-source para-
metric modeler FreeCAD in this work. The output of the Two datasets have been designed to validate SMA-Net: robot
dataset and furniture dataset. Generally, from the perspec-
tive of machine learning, angle-type parameters are more
difficult to understand than general distance-type param-
eters. Thus, the considered robot template only includes
angle-type parameters. This is used to verify the angle per-
ception ability of SMA-Net, and to study its application for
assembly fitting. The furniture dataset contains some com-
mon furnitures, i.e. chairs and tables, and the CAD templates
include both angle-type and distance-type parameters. This
second database is used to test the general performance
of the proposed approach. Moreover, due to the presence
of occlusion and noise in the real scene, as-scanned point
clouds are generated for both datasets. Thus, it is possible
to evaluate SMA-Net on point clouds obtained in real-
world conditions, and not only using virtually-generated
data. Importantly, SMA-Net has been trained and tested on
the two datasets separately, and the datasets (labeled model
instances, point clouds and parameter values) are made pub-
licly available at the following URL: https://d​ oi.o​ rg/1​ 0.5​ 281/​
zenodo.​50316​17.
Robot dataset This dataset is composed of 5000 point
clouds generated from a robot template (Fig. 7a). The pose
of the robot is controlled by 4 angles A 1–A4 , whose values
have been randomly changed in the range of [0, 90]. For each
pose, an instance of the CAD template is generated, together
with its mesh and as-scanned point cloud. Figures 7b1 and
Fig. 6  Virtual data generation pipeline that uses labelled CAD tem-
plate to generate many point cloud instances with their parameter val- b 2 shows meshes corresponding to two different poses, 7c1
ues and c 2 are the virtually-generated point clouds, 7d1 and d 2

Fig. 7  Generation of as-scanned point clouds for the robot dataset: a robot template and its 4 control angles, b i  updated meshes of two different
poses, c i  virtually-generated point clouds for the two poses, d i  as-scanned counterparts for the two poses

13
Engineering with Computers

the as-scanned point clouds of those poses. The data is then


split to training sub-dataset including 4000 data and test-
ing sub-dataset including 1000 data. In addition, to evaluate
the performance of the model during training, the training
data is further divided into 3200 for training and 800 for
validating. In this database, there is a single class ( Nc = 1 )
belonging to a unique group ( Ng = 1 ), and therefore a single
regressor is used in the multi-task head. Here, the validation
consists in testing the performance of SMA-Net to estimate
the values of the 4 angles for an unknown pose represented
by its possibly incomplete point cloud.
Furniture dataset This dataset contains two groups
( Ng = 2 ) for a total of 10 classes ( Nc = 10). Thus, SMA-
Net is here composed of two regressors in its multi-task
head, one for each group. Each regressor is in charge of
estimating the parameter values of all the classes of the cor-
responding group. The first group gathers together 4 types
of chairs, and the second 6 types of tables. Figure 8 shows
the elements of the database, i.e. all the CAD templates of
the considered classes, and examples of point clouds pro-
duced with different parameters. The 10 CAD templates are
parameterized by more than 50 control parameters. Figure 9
shows the parameterization of a chair and a table for a total
of 13 parameters. The final dataset consists of 20,000 point
clouds, 2000 for each of the 10 classes. The point clouds are
divided into 17,000 training and 3000 testing samples. To
ensure the fairness of the test, 300 samples are uniformly
randomly selected from each class for testing, and the rest
for training. For experimental studies, the original training
set is split into 15,000 training samples and 2000 validation
samples. In the same way, 200 samples are uniformly and
randomly selected from each class for validating and the
rest for training. Here, the validation consists in testing the
performance of SMA-Net to identify the class of an object
and estimate its parameter values from a raw point cloud. Fig. 8  Generation of as-scanned point clouds for the furniture data-
set.  First four rows are the chairs (Chair_1 to 4), and next six rows
are the tables (Table_1 to 6). First column are CAD templates, second
4.3 Technicalities and settings and third are updated meshes, fourth and fifth are virtually-generated
point clouds, and last are as-scanned ones
This section details the way SMA-Net has been implemented
and parameterized. It also explains how two classic networks
have been tuned for comparison with SMA-Net. • For the multi-resolution feature extractor, the depth of the
U-Net hierarchy is set to 6 ( Nl = 6). Deeper layers help
4.3.1 SMA‑Net extracting higher level features, and consequently more
information after combination. To balance performance
The implementation of SMA-Net is based on PyTorch [42] and computational complexity, a slim layer design strat-
and MinkowskiEngine [11] for sparse convolution and other egy has been adopted. Thus, the output channels under
essential layers. The voxel size is set to 10 to balance the each level in U-Net are set to m, 2m, 3m, 4m, 5m, 6m
computational cost and accuracy. Higher performance could respectively, where m is empirically set to 24 to have a
be achieved through setting a small voxel size, but it would good balance between the performance of the model and
bring more footprint than expected, especially in low-latency the memory consumption.
systems. Then, the three components of SMA-Net are cus- • For each spatial attention module, 2 sparse convolutional
tomized as follows: layers are used with kernel size equal to 3. Normally,

13
Engineering with Computers

Fig. 9  Examples of CAD
templates and related param-
eters: a Chair_3 template and its
parameters, b1 Table_1 template
and its parameters (b 2 , b3)

convolution with kernel size equal to 3 is more efficient 4.3.2 Comparison setting


on GPU than other kernel size.
• For data merge, L and Ct are set to 32 and 64, respec- Two known models, namely Pointnet2† and Resnet†  [28,
tively. The number of head in 𝐌𝐒𝐀 is set to 16 ( h = 16 ), 44], have been tuned for comparison with SMA-Net on our
while Ch is set to 16 t in the current implementa- dataset. In particular, for Pointnet2† , the implementation
L⋅C

tion. Moreover, 𝐅𝐅𝐍 has been implemented by a two- of [44] has been used as a starting point. In order to make
layer MLP, where the dimension of the features in the it suitable for both classification and estimation tasks, the
hidden layer is equal to 4 times the dimension of the classification head of the original implementation has been
input features. replaced by the multi-heads part proposed in this paper. For
• For multi-task head, one head is used for classifica- training, the original input have been uniformly sampled
tion, the others regress the parameters of different with 4096 points. For Resnet† , the 3D-Resnet [11] based
groups. MLP is used in both cases. Specifically, MLP on MinkowskiEngine sparse convolutional library has been
in classifier has 2 hidden layers. Regressors have more re-implemented. More concretely, sparse tensor and sparse
parameters to predict and more abstract prediction operation replace the dense tensor and related operations
goals, which is more difficult. Thus, 4 hidden layers are in original Resnet†. Similarly to Pointnet2† , the multi-heads
deployed in regressors. Empirically, one can observe that part of SMA-Net replaces the original classification head of
batch normalization largely hinders the regressor’s learn- Resnet†. In addition, the structure of Res-34 has been used in
ing ability, so the batch normalization of the last two our experiment. Finally, the same training strategy of SMA-
layers has been removed. Net has been adopted to train both Pointnet2† and Resnet†. In
this paper, † thus indicates that multi-head is used instead of
During the training, data augmentation is applied on-the- the original classification head to make the model compat-
fly by flipping with 50% probability, randomly rotating ible with the fitting task.
the objects along up-axis, and jittering data with Gauss
noise. This is needed to be able to identify objects in any
configuration when considering real-world examples. To 5 Results and discussion
strengthen the identification and estimation capacities of
SMA-Net in various conditions, virtually-generated point The proposed approach has been validated on two up-to-
clouds are also merged with as-scanned ones. Specifically, date challenging scenarios. The first scenario illustrates how
during the training, 40% of virtually-generated data are ran- SMA-Net can be used to follow the evolutions of a real robot
domly selected and replaced by the as-scanned data. For and update its digital twin for Industry 4.0 applications. The
testing, only as-scanned point clouds are used to validate second scenario shows how SMA-Net can identify objects
the approach on real-life data. in a real environment, and estimate their parameter values
The whole SMA-Net is trained in an end-to-end man- to allow fast reconstruction of 3D environments for BIM
ner while minimizing the total loss function of Eq. (12) applications.
with hyper-parameters 𝜆g = 1.0. For the minimization, the
Adam optimizer is adopted with an initial learning rate of 5.1 Evaluation criteria
1 × 10−3 and decay the learning rate by a factor of 0.7 every
20 epochs and train for 200 epochs. All experiments have Several criteria are here adopted to evaluate the perfor-
been performed on a laptop with NVIDIA RTX2080 GPU mances of SMA-Net on both identification and parameter
and Intel(R) Core(R) i7-9750 CPU. estimation tasks. For the identification task, the adopted

13
Engineering with Computers

criterion is the one of [8]. For parameter estimation task, the of the test dataset, average distance, max distance, and
accuracy of the estimations can be carried out in an absolute standard deviation are provided. Finally, the efficiency of
but also relative manner. Therefore, for a given parameter pci SMA-Net is evaluated in terms of time required to perform
related to a class c, the average of the absolute and relative both the identification and estimation tasks. Classically, the
errors on a test dataset can be computed as follows: efficiency is expressed in Hertz (Hz) to show how many
c samples the network treats every second.
1 ∑‖ c ‖
Ntest
Eravg (c, i) = c Δp (k)
Ntest k=1 ‖ i ‖ 5.2 Quantitative analysis
c (13)
1 ∑‖ c ‖
Ntest
Eaavg (c, i) = c ‖𝛿pi (k)‖, Figure 10 shows the evolution of the relative and abso-
Ntest k=1 lute errors during the training on the robot dataset, and
where Ntest
c
refers to the number of samples of class c in the Table 1 gathers the results of the evaluation of Pointnet2† ,
test dataset, Δpci (k) and 𝛿pci (k) represent respectively the Resnet† and SMA-Net on the test dataset composed of
relative and absolute errors of parameter pci on the k-th ele- as-scanned point clouds. Here, there is no identification
ment of class c in the test dataset. Errors are computed using task as the dataset is composed of a single class, i.e. the
Eq. (10). If angle-type and distance-type parameters appear at four-axis robot.  Compared with Pointnet2† and Resnet† ,
the same time when computing the average value, by default SMA-Net achieves the best results for the estimation of the
all parameter values are directly added to calculate the average four angles. From the results, axis-1 ( A1 ) and axis-2 ( A2 )
value. Further, the mean average values ( Erm and Eam ) are used have smaller absolute error around 0.7◦ and 0.6◦ , respec-
to describe the overall errors, which are defined as follows: tively. The absolute error of axis-4 ( A4 ) is larger which
around 3.1. The results are very intuitive, the spatial position
Nc of axis-3 and axis-4 strongly dependent on the position of
1 ∑ 1 ∑ avg
Nc p

Erm = E (c, i) axis-1 and axis-2. SMA-Net thus gets more information from
Nc c=1 Npc i=1 r
Nc (14) the changes of axis-1 and axis-2 whose behavior is therefore
1 ∑ 1 ∑ avg
Nc
better approximated. In terms of efficiency, the training takes
p

Eam = E (c, i),


Nc c=1 Npc i=1 a around 5 h, and SMA-Net is able to estimate the angles at a
frequency of 30.83 Hz, which is far enough for Industry 4.0
where Nc and Npc refer respectively to the number of classes applications, including updating the digital twin of such a
and the number of parameters in class c. system evolving in time. Moreover, the size of the model is
In addition, to assess the quality of a fitting result, Euclid- 12M parameters, which makes it possible to consider appli-
ian distances are computed between the point cloud and the cations on low performance devices.
CAD model updated with parameters estimated by SMA- Table 2 show the results of Pointnet2† , Resnet† and SMA-
Net. This is performed in CloudCompare, and for a sample Net for both the identification and parameter estimation

Fig. 10  Training curves with relative errors Er (dotted line) and absolute error Ea (solid line) for the robot dataset and its four angle values to
avg avg

estimate: result curve on training set (top), result curve on validation set (down)

13
Engineering with Computers

Table 1  Evaluation result on the robot test dataset ( Ntest


Robot
= 1000)
Parameter A i and its Er (c, i) and Ea (c, i) values
avg avg

Model A1 A2 A3 A4

Pointnet2† 0.007 0.92◦ 0.007 0.84◦ 0.010 1.27◦ 0.028 3.62◦


ResNet † 0.006 0.82 ◦
0.006 0.69 ◦
0.013 1.62 ◦
0.027 3.46◦
SMA-Net 0.005 0.68 ◦
0.004 0.57 ◦
0.007 1.05 ◦
0.025 3.09◦
Mean
Model Erm Eam Params. (M)

Pointnet2† 0.013 1.66◦ 4.6


ResNet † 0.013 1.65◦ 64.1
SMA-Net 0.011 1.35◦ 12.1

Table 2  Results on the furniture test dataset ( Ntest


Furniture
= 3000) make them more difficult to fit. Indeed, the fitting of small-
Mean
size features usually consumes more iteration cycles than
large-size ones. In terms of efficiency, the training takes
Model Erm Eam Acc. Params. (M) around 14 h, and SMA-Net is able to identify objects and
Pointnet2† 0.097 20.15 1.0 6.2 estimate parameter values in one forward propagation, and
ResNet† 0.033 8.26 1.0 64.7 at a frequency of 30.03 Hz, which is very promising for the
SMA-Net 0.027 6.48 1.0 13.9 development of AR/VR applications. Here, the model only
takes 13.9 M parameters.
Furthermore, the sensitivity of SMA-Net to noise has
tasks on the furniture dataset. It proves that the three mod- been evaluated while generating a series of virtual scan data
els can all accurately identify CAD models from the data- incorporating different levels of noise. Specifically, noise
base (Acc.= 1.0 in Table 2), but that SMA-Net far outper- is added using a Gaussian noise distribution controlled by
forms Pointnet2† and Resnet† on the estimation task. Indeed, an amplitude factor varying from 0 to 30 mm as shown in
Pointnet2† fails to estimate accurately the parameter values, Table 3. For all evaluations in Table 3, the same pre-trained
and Resnet† also generates large estimation errors com- model is used. One can see that when the noise increases
pared with SMA-Net.  In the appendix, Tables  7, 8, 9, from 0 to 5 mm, the absolute error of the parameter changes
10, 11, 12, 13, 14, 15 and 16 illustrate the result of each less than 0.6. Then, the absolute error grows slightly with
parameter. From the results, SMA-Net is able to manage the noise of 10 mm, however, when the noise increases to
both angle-type and distance-type parameters, to iden- 30 mm, the absolute error of the model increases signifi-
tify object and to perform estimations in a highly efficient cantly. It can be concluded that SMA-Net is not sensitive to
way. For example, considering Chair_3 (Fig. 9a, Table 9), the presence of a reasonable noise, i.e. a level of noise that
’back_leg_angle’ ( p16 ) relative error is 0.013 which is on the can be encountered with current acquisition devices.
same order of magnitude as the relative error of the other
distance-type parameters. Table_1 (Fig. 9b i  ) has 8 param- 5.3 Qualitative analysis
eters, and compared to parameters of ‘length’ ( p23 ), ‘width’
( p24 ) and ’leg_high’ ( p30 ), parameters of ‘back_railgap’ Figures 11 and 12 illustrate how the proposed RE frame-
( p25 ), ‘side_railgap’ ( p26 ), ‘rail_high’ ( p27 ), ‘leg_radius_1’ work can be used to fit efficiently CAD parts and assemblies
( p28 ) and ‘leg_radius_2’ ( p29 ) have a very small size, which on point clouds. Here, the point clouds have been obtained

Table 3  Evaluation result of Class c Parameter pci Absolute errors Ea (c, i)


avg
the absolute errors Ea (c, i)
avg

with respect to the amplitude 0 mm 1 mm 5 mm 10 mm 30 mm


of the inserted noise (0 mm,
1 mm, 5 mm, 10 mm, 30 mm Robot A0 0.78◦ 0.68◦ 1.11◦ 2.39◦ 8.68◦
) on the robot test dataset A1 0.67◦ 0.57◦ 0.84◦ 1.62◦ 4.94◦
( Ntest
Robot
= 1000)
A2 1.17 ◦
1.05 ◦
1.26 ◦
1.64 ◦
6.98◦
A3 3.39 ◦
3.09 ◦
3.78 ◦
5.72 ◦
18.95◦

13
Engineering with Computers

following the novel virtual data generation pipeline. Given SMA-Net. Results are good, even though a bit below the
an input point cloud, SMA-Net identifies the label of the ones obtained on as-scanned point clouds. This is because
object  (e.g.  Robot and Chair_3 in those examples) and the adopted CAD template does not fully match the scanned
estimates the parameter values in one forward propaga- chair.  Moreover, chairs are slightly deformable objects,
tion. Those values are then forwarded to the CAD modeler which may also explain possible deviations between a real-
that updates the corresponding CAD model and creates the world point cloud and a perfect CAD template.
mesh. Then, it is possible to compute distances between Figure 14a, b illustrate how the proposed RE framework
the point cloud and the mesh to display a color map. With can be used for real-time reconstruction of CAD models
quite low mean distances when compared to the size of the within a virtual scene. From left to right, the raw point cloud
objects, those results clearly show the capacity of SMA-Net is first acquired using Kinect V1, and then transferred to a
to identify and fit an object in an incomplete point cloud. pre-trained semantic segmentation model [23] to label points
The proposed identification and fitting technique has and extract objects of interest by clustering [17]. Then, for
also been validated on real scanned data. Figure 13 shows each segmented point cloud, SMA-Net is used to identify the
the result of a fitting on a point cloud acquired with a laser underlying object and estimate its parameters in one forward
scanner ROMER Absolute Arm 7520 SI and sub-sampled propagation. Finally, using the outputs of SMA-Net, CAD
to 10,000 points. Here, an ICP [21] algorithm aligns the models can be reconstructed and located in the virtual scene
input point cloud and the mesh resulting from the update of using ICP. Here, measured deviations can be ascribed to the
the CAD template with the parameter values obtained from fact that the underlying CAD templates are simplified ones,

Fig. 11  Robot fitting results on five different poses of the test dataset: as-scanned point clouds (colored points where red indicates larger devia-
tions) used as inputs of SMA-Net, and meshes (gray) generated from the updated CAD models using outputs of SMA-Net

Fig. 12  Chair_3 fitting results


on samples from the test
dataset: as-scanned point clouds
used as inputs of SMA-Net,
and meshes generated from the
updated CAD models using
estimated outputs of SMA-Net

13
Engineering with Computers

Fig. 13  Multiple views (b1–b 3 )


of the fitting result of Chair_3
on the point cloud acquired with
a laser scanner (a)

Fig. 14  Complete reconstruction of virtual environments from point label of the underlying object and estimate its parameters in one for-
clouds captured with Kinect V1. The whole scene is first segmented ward propagation.  Finally, CAD models are reconstructed using the
by a deep segmentation network, and a clustering algorithm is used output values of SMA-Net, and they are located in the 3D scene using
to group points.  Then, for each point cloud, SMA-Net identifies the ICP

which cannot capture the full complexity of the acquired 5.4 Ablation study
objects, and more parameters would be necessary. This need
is also true when considering possible object deformations The architecture of SMA-Net has been designed around
due to aging, which would require additional modeling three components, namely multi-resolution feature extrac-
parameters. Finally, outliers and artifacts due to the acqui- tor, spatial merge attention module and multi-head module,
sition with Kinect also have an impact on the results and all of which have a vital impact on the performances. This
create a larger variance in some places. ablation study evaluates how each component affects the

13
Engineering with Computers

average of the relative errors. For fast evaluation, a sub- collecting features from only one layer. Moreover, intro-
dataset of furniture dataset is exploited, which consists of ducing spatial merge attention module can help SMA-Net
six classes namely Chair_1, Chair_2, Chair_3, Table_4, integrating more meaningful information and have a larger
Table_5, and Table_6. The configuration of the hyperparam- receptive field during the feature merge stage. In addition,
eters is the same as for the above experiments. Table 4 shows multi-head makes it possible to deal with different target
the experimental results. The five leftmost columns are used groups. Moreover, it can also reduce the coupling between
to configure the experiments. Indeed, the second column different groups, and at the same time it helps promoting
represents the possibility to combine or not the information mutual boost between the classes which belong to the same
from multiple resolutions (MR) during the feature extraction group.
stage. The third one indicates the adoption or not of the spa-
tial attention (SA) mechanism. The fourth column tells the
data merge module (DM) is applied or not. The fifth column 6 Conclusion and future work
characterizes whether multiple-heads (MH) are used or not
to regress the  Ng groups. To highlight the contribution of This paper introduces a novel reverse engineering frame-
each component, seven experiments were conducted. work able to identify and fit CAD models from incomplete
The first experiment can be considered as a baseline. Spe- point clouds. The core of the approach is SMA-Net, a fast
cifically, as each component is set up to ’No’, only the output bottom-up 3D classification and fitting framework able to
of the last layer of the feature extractor is used as input of identify objects and estimate the parameter values in an end-
the next part, the spatial attention and merge modules are to-end manner. The network is trained using point clouds
removed, and for the final parameter prediction, only one generated following a virtual data generation pipeline able
regressor is integrated. The second experiment clearly shows to create many as-scanned point clouds.
that multi-heads can help to decouple features between differ- Differently, from the traditional reverse engineering tech-
ent groups to a certain extent. Then, for the third experiment, niques working at the level of the patches and facing numer-
the MR feature extractor is activated to integrate features ous trimming and stitching issues, the proposed approach
from different scales. The result shows that rich multi-scale carries out an end-to-end identification and parameter esti-
features help improve the performance of SMA-Net. Instead mation in one forward propagation, and it achieves very
of getting better results, deploying the SA sub-module good results on both virtually-generated datasets and real-
slightly hurts the performance of the model. This is because life acquisitions. The geometric modeling operations are
the features after calibration of SA are too abstract to parse left to the CAD modeler in charge of generating CAD mod-
it through a simple network. This reflects the effectiveness els from CAD templates using the output values of SMA-
and importance of the transformer-based DM sub-module Net. Considering possibly incomplete point clouds, resulting
from the other side. Finally, in the sixth experiment, all four from the occlusion phenomenon, the results have shown that
modules are activated, and the best results are obtained. SMA-Net is able to rely on other features to infer reasonable
In addition, for the last experiment, tokens are gener- parameter values for the partially visible parts. In addition,
ated by transforming a feature map Fj into a token T j that future works should focus on integrating all modules shown
is detailed in Sect. 3.4. From the results, it is clear that the in Fig. 2 into a unified reverse engineering framework to
proposed token generation process is more efficient than just realize end-to-end reasoning from the entire scene to the
simply transforming a feature map Fj into a token T j. CAD model. Also, needs concern the possibility to auto-
As a conclusion, the extraction of features from multi- matically adjust CAD templates to take into account local
ple resolutions can get more useful information than just deformations and thus be able to model structural ageing.

Table 4  Ablation study Model MR SA DM MH Erm Eam (mm)


evaluating the interest of each
component of SMA-Net SMA-Net No No No No 0.035 13.19
SMA-Net No No No Yes 0.031 10.94
SMA-Net Yes No No Yes 0.024 8.06
SMA-Net Yes Yes No Yes 0.025 8.66
SMA-Net Yes No Yes Yes 0.023 7.23
SMA-Net Yes Yes Yes Yes 0.022 6.5
SMA-Net‡ Yes Yes Yes Yes 0.025 8.29

The values in bold indicate the optimal results


The ”‡ ” refers to tokens generated by transforming a feature map Fj into a token T j , following details in 3.4

13
Engineering with Computers

Moreover, two application scenarios have been presented Table 6  Parameter name and
index relationship of the tables Table_1 length p23
with promising results to serve in the context of the Industry
in the furniture dataset width p24
4.0. The one on the robot is particularly adapted to maintain
back_railgap p25
the coherence of the digital twin to track the evolution of
side_railgap p26
the physical twin in time. The second one involves the com-
rail_high p27
plete reconstruction of a virtual environment from a Kinect
leg_radius_1 p28
acquisition. In both cases, SMA-Net demonstrated good
leg_radius_2 p29
accuracy and very high speed in identification and estima-
leg_high p30
tion tasks, and its performances are compatible with Indus-
Table_2 upLength p31
try 4.0 real-time simulation requirements. Both datasets are
upWidth p32
made publicly available at the following URL: https://​doi.​
downLength p33
org/​10.​5281/​zenodo.​50316​17. Last but not least, the results
downWidth p34
obtained from our method can also be considered as a group
legHigh p35
of initial parameters for the next step of fine-tuning based
legDis p36
on more traditional methods. At the same time, a good ini-
legDia p37
tialization will allow the algorithm to converge quickly to a
Table_3 upLength p38
local minimum and thus gain higher accuracy, which is of
downLength p39
great research interest and practical value.
legHigh p40
legDia p41
Table_4 high p42
Appendix 1 dia p43
leg_dis p44
See Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16.
support_dis p45
u_len p46
u_dia p47
u_loc p48
Table 5  Parameter name and index relationship of the chairs in the Table_5 width p49
furniture dataset length p50
Class c Name Parameter pi high p51
radius p52
Chair_1 p_seat p1 Table_6 high p53
u_shape_dia p2 dia p54
u_loc p3 leg_angle p55
leg_loc p4 support_dis p56
leg_high p5
Chair_2 sideRails_length p6
seat_length p7
width p8
backlegs_high p9
frontLegs_high p10
Chair_3 length p11
width p12
leg_high p13
total_high p14
back_rails_high p15
back_leg_angle p16
Chair_4 width_seat p17
fillet_seat p18
rail_seat p19
high p20
u_len p21
u_dis p22

13
Engineering with Computers

Table 7  Result of Chair_1 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p1 p2 p3 p4

Pointnet2† 0.022 6.02 0.270 11.14 0.059 8.88 0.023 5.92


ResNet† 0.011 2.96 0.035 1.37 0.026 4.07 0.011 2.55
SMA-Net 0.008 1.89 0.028 1.12 0.014 2.16 0.007 1.56
Model p5

Pointnet2† 0.012 6.51


ResNet† 0.010 5.25
SMA-Net 0.004 2.01

Table 8  Result of Chair_2 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p6 p7 p8 p9

Pointnet2† 0.060 27.03 0.053 26.88 0.070 33.28 0.008 9.38


ResNet† 0.012 5.66 0.010 5.78 0.014 7.72 0.006 6.90
SMA-Net 0.010 5.21 0.009 5.26 0.011 5.88 0.003 3.51
Model p10

Pointnet2† 0.018 14.91


ResNet† 0.008 5.98
SMA-Net 0.006 4.07

Table 9  Result of Chair_3 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p11 p12 p13 p14

Pointnet2† 0.094 48.48 0.084 46.71 0.039 18.61 0.026 24.69


ResNet † 0.019 9.06 0.016 7.94 0.011 5.50 0.012 11.09
SMA-Net 0.016 7.83 0.012 6.25 0.014 6.72 0.005 4.67
Model p15 p16

Pointnet2† 0.571 101.89 0.029 4.37


ResNet † 0.031 4.38 0.014 2.21
SMA-Net 0.022 3.10 0.013 1.94

13
Engineering with Computers

Table 10  Result of Chair_4 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p17 p18 p19 p20

Pointnet2† 0.064 34.90 0.173 10.76 0.123 24.15 0.025 11.12


ResNet† 0.050 26.42 0.117 7.77 0.083 15.03 0.033 14.86
SMA-Net 0.043 23.29 0.090 5.97 0.078 13.48 0.026 11.56
Model p21 p22

Pointnet2† 0.394 21.20 0.663 45.81


ResNet† 0.099 5.01 0.250 10.99
SMA-Net 0.090 4.49 0.229 9.50

Table 11  Result of Table_1 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 6
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p23 p24 p25 p26

Pointnet2† 0.021 28.42 0.025 14.80 0.189 11.67 0.101 7.71


ResNet† 0.015 18.79 0.021 12.17 0.088 4.83 0.053 3.73
SMA-Net 0.010 12.92 0.015 8.16 0.063 3.39 0.042 3.09
Model p27 p28 p29 p30

Pointnet2† 0.332 16.57 0.176 4.49 0.123 3.19 0.020 13.36


ResNet† 0.059 2.62 0.030 0.72 0.027 0.72 0.014 8.83
SMA-Net 0.058 2.56 0.022 0.54 0.027 0.71 0.007 4.54

Table 12  Result of Table_2 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 6
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p31 p32 p33 p34

Pointnet2† 0.019 19.94 0.029 16.54 0.058 34.93 0.061 27.07


ResNet† 0.011 11.68 0.017 9.90 0.018 10.69 0.022 9.59
SMA-Net 0.009 9.71 0.011 6.54 0.013 7.72 0.013 5.86
Model p35 p36 p37

Pointnet2† 0.011 8.51 0.118 27.68 0.118 12.49


ResNet† 0.010 7.07 0.022 4.97 0.015 1.82
SMA-Net 0.003 2.39 0.019 4.34 0.014 1.61

13
Engineering with Computers

Table 13  Result of Table_3
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p38 p39 p40 p41
name of each parameter  pi
provided in Table 6 Pointnet2† 0.014 12.90 0.019 10.16 0.012 6.15 0.047 6.03
ResNet † 0.008 6.88 0.010 4.96 0.009 4.72 0.016 2.25
SMA-Net 0.005 4.56 0.009 4.57 0.004 1.79 0.017 2.34

Table 14  Result of Table_4 in the furniture test dataset ( Ntest


Furniture
= 3000 ), with the name of each parameter  pi provided in Table 6
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg

Model p42 p43 p44 p45

Pointnet2† 0.016 13.13 0.022 26.59 0.035 30.10 0.107 63.67


ResNet† 0.024 19.11 0.024 2.87 0.037 31.94 0.053 35.47
SMA-Net 0.014 11.01 0.020 23.37 0.033 29.37 0.043 29.65
Model p46 p47 p48

Pointnet2† 0.280 12.48 0.421 14.38 0.092 7.80


ResNet† 0.176 7.67 0.230 6.37 0.048 4.51
SMA-Net 0.129 5.91 0.202 5.89 0.038 3.73

Table 15  Result of Table_5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p49 p50 p51 p52
name of each parameter  pi
provided in Table 6 Pointnet2† 0.024 19.63 0.014 15.69 0.010 7.87 0.323 71.72
ResNet† 0.011 8.65 0.010 12.63 0.007 5.94 0.035 6.30
SMA-Net 0.009 7.14 0.008 9.28 0.003 2.12 0.029 5.29

Table 16  Result of Table_6
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p53 p54 p55 p56
name of each parameter  pi
provided in Table 6 Pointnet2† 0.011 8.50 0.017 14.50 0.028 2.59 0.083 18.06
ResNet† 0.010 8.19 0.011 9.69 0.016 1.50 0.026 5.83
SMA-Net 0.003 2.68 0.011 9.08 0.014 1.32 0.025 5.35

References IEEE/CVF International Conference on computer vision, pp


2551–2560
4. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv
1. Armeni I, Sener O, Zamir AR, Jiang H, Brilakis I, Fischer M,
preprint arXiv:​1607.​06450
Savarese S (2016) 3D semantic parsing of large-scale indoor
5. Bey A, Chaine R, Marc R, Thibault G, Akkouche S (2011) Recon-
spaces. In IEEE Conf. on comput. vision and pattern recognition,
struction of consistent 3d CAD models from point cloud data
pp 1534–1543
using a priori CAD models. ISPRS 3812:289–294
2. Avetisyan A, Dahnert M, Dai A, Savva M, Chang AX, Nießner M
6. Buonamici F, Carfagni M, Furferi R, Governi L, Lapini A, Volpe
(2019) Scan2cad: learning cad model alignment in rgb-d scans.
Y (2018) Reverse engineering modeling methods and tools: a sur-
In Proceedings of the IEEE/CVF Conference on computer vision
vey. Comput-Aided Des Appl 15(3):443–464
and pattern recognition, pp 2614–2623
3. Avetisyan A, Dai A, Nießner M (2019) End-to-end cad model
retrieval and 9dof alignment in 3d scans. In: Proceedings of the

13
Engineering with Computers

7. Buonamici F, Carfagni M, Furferi R, Governi L, Lapini A, Volpe 26. Guo R, Zou C, Hoiem D (2015) Predicting complete 3d models
Y (2018) Reverse engineering of mechanical parts: a template- of indoor scenes. arxiv:​1504.​02437
based approach. J Comput Des Eng 5(2):145–159 27. Gupta S, Arbeláez P, Girshick R, Malik J (2015) Aligning 3d
8. Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep models to rgb-d images of cluttered scenes. In: 2015 IEEE Con-
learning on point sets for 3D classification and segmentation. ference on computer vision and pattern recognition (CVPR), pp
In: 2017 IEEE Conf. on comput. vision and pattern recognition 4731–4740
(CVPR), pp 77–85 28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
9. Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view 3D object image recognition. In: Proceedings of the IEEE Conference on
detection network for autonomous driving. In: 2017 IEEE Con- computer vision and pattern recognition, pp 770–778
ference on computer vision and pattern recognition (CVPR), pp 29. Hu Q et al (2021) Learning semantic segmentation of large-scale
6526–6534 point clouds with random sampling. In: IEEE transactions on pat-
10. Choy C, Dong W, Koltun V (2020) Deep global registration. In: tern analysis and machine intelligence. https://​doi.​org/​10.​1109/​
Proceedings of the IEEE/CVF conference on computer vision and TPAMI.​2021.​30832​88
pattern recognition (CVPR), pp 2514–2523 30. Ishimtsev V, Bokhovkin A, Artemov A, Ignatyev S, Niessner M,
11. Choy C, Gwak J, Savarese S (2019) 4D spatio-temporal ConvNets: Zorin D, Burnaev E (2020) Cad-deform: deformable fitting of cad
Minkowski convolutional neural networks. In: 2019 IEEE/CVF models to 3d scans. arXiv preprint arXiv:​2007.​11965
Conference on computer vision and pattern recognition (CVPR), 31. Kang Z, Li Z (2015) Primitive fitting based on the efficient multi-
pp 3070–3079 baysac algorithm. PLoS One 10(3):e0117341
12. Choy C, Park J, Koltun V (2019) Fully convolutional geometric 32. Katz S, Tal A (2015) On the visibility of point clouds. In: 2015
features. In: Proceedings of the IEEE/CVF International Confer- IEEE International Conference on computer vision (ICCV), pp
ence on Computer Vision (ICCV), pp 8958–8966 1350–1358
13. Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C 33. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021)
(2021) Twins: revisiting the design of spatial attention in vision Transformers in vision: A survey. arXiv preprint arXiv:​2101.​
transformers. 1(2):3 arXiv preprint arXiv:​2104.​13840 01169
14. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nießner 34. Kim H, Yeo C, Lee ID, Mun D (2020) Deep-learning-based
M (2017) ScanNet: richly-annotated 3D reconstructions of indoor retrieval of piping component catalogs for plant 3d cad model
scenes. In: IEEE Conf. on comput. vision and pattern recognition, reconstruction. Comput Ind 123:103320
pp 2432–2443 35. Kundu A, Yin X, Fathi A, Ross D, Brewington B, Funkhouser T,
15. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Pantofaru C (2020) Virtual multi-view fusion for 3d semantic seg-
Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al mentation. In: European Conference on computer vision. Springer,
(2020) An image is worth 16x16 words: transformers for image pp 518–535
recognition at scale. arXiv preprint arXiv:​2010.​11929 36. Li D, Shen X, Yu Y, Guan H, Wang H, Li D (2020) Ggm-net:
16. Erdös G, Nakano T, Váncza J (2014) Adapting CAD models of graph geometric moments convolution neural network for point
complex engineering objects to measured point cloud data. CIRP cloud shape classification. IEEE Access 8:124989–124998
Ann 63(1):157–160 37. Li Y, Wu X, Chrysanthou Y, Sharf A, Cohen-Or D, Mitra NJ
17. Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based spatial (2011) Globfit: consistently fitting primitives by discovering
clustering of applications with noise. In: Int. Conf. knowledge global relations. ACM Trans Graph 30(4):52:1-52:12
discovery and data mining, vol 240, p 6 38. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021)
18. Fayolle P-A, Pasko A (2015) User-assisted reverse modeling with Swin transformer: hierarchical vision transformer using shifted
evolutionary algorithms. In: IEEE Congress on evolutionary com- windows. arXiv preprint arXiv:​2103.​14030
putation, pp 2176–2183. https://d​ oi.o​ rg/1​ 0.1​ 109/C
​ EC.2​ 015.7​ 2571​ 39. Lu Y (2017) Industry 4.0: a survey on technologies, applications
53 and open research issues. J Ind Inf Integr 6:1–10
19. Fischler MA, Bolles RC (1981) Random sample consensus: a 40. Mo K, Zhu S, Chang AX, Yi L, Tripathi S, Guibas LJ, Su H
paradigm for model fitting with applications to image analysis (2019) PartNet: a large-scale benchmark for fine-grained and hier-
and automated cartography. Commun ACM 24(6):381–395 archical part-level 3D object understanding. In: 2019 IEEE/CVF
20. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous Conference on computer vision and pattern recognition (CVPR),
driving? The kitti vision benchmark suite. In: 2012 IEEE Confer- pp 909–918
ence on computer vision and pattern recognition, pp 3354–3361 41. Montlahuc J, Shah GA, Polette A, Pernot J-P (2019) As-scanned
21. Gelfand N, Mitra NJ, Guibas LJ, Pottmann H (2005) Robust global point clouds generation for virtual reverse engineering of CAD
registration. In: Symposium on geometry processing, vol 2, no 3, assembly models. Comput-Aided Des Appl 16(6):1171–1182
p5 42. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G,
22. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch:
networks. In: Proceedings of the fourteenth international confer- an imperative style, high-performance deep learning library. Adv
ence on artificial intelligence and statistics. JMLR Workshop and Neural Inf Process Syst 32:8026–8037
Conference Proceedings, pp 315–323 43. Qi CR, Su H, Nießner M, Dai A, Yan M, Guibas LJ (2016) Volu-
23. Graham B, Engelcke M, Maaten LVD (2018) 3D semantic seg- metric and multi-view cnns for object classification on 3d data.
mentation with submanifold sparse convolutional networks. In: In: Proceedings of the IEEE Conference on computer vision and
2018 IEEE/CVF Conference on computer vision and pattern rec- pattern recognition, pp 5648–5656
ognition, pp 9224–9232 44. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierar-
24. Graham B, Engelcke M, Van Der Maaten L (2018) 3d semantic chical feature learning on point sets in a metric space. In: Proc.
segmentation with submanifold sparse convolutional networks. of the 31st Int. Conf. on neural information processing systems,
In: Proceedings of the IEEE conference on computer vision and NIPS’17, pp 5105–5114
pattern recognition (CVPR), pp 9224–9232 45. Robertson C, Fisher RB, Werghi N, Ashbrook AP (2000) Fitting
25. Graham B, van der Maaten L (2017) Submanifold sparse convo- of constrained feature models to poor 3D data. In: Parmee IC
lutional networks. arXiv preprint arxiv:​1706.​01307 (ed) Evolutionary design and manufacture. Springer, London, pp
149–160

13
Engineering with Computers

46. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional Proceedings of the IEEE/CVF Conference on computer vision
networks for biomedical image segmentation. In: Medical Image and pattern recognition, pp 10296–10305
Computing and Computer-Assisted Intervention—MICCAI 2015, 59. Wang S, Suo S, Ma W-C, Pokrovsky A, Urtasun R (2018) Deep
Cham, 2015. Springer International Publishing, pp 234–241 parametric continuous convolutional neural networks. In: Pro-
47. Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er ceedings of the IEEE Conference on computer vision and pattern
MJ, Ding W, Lin C-T (2017) A review of clustering techniques recognition, pp 2589–2597
and developments. Neurocomputing 267:664–681 60. Willis KD, Pu Y, Luo J, Chu H, Du T, Lambourne JG, Solar-
48. Schnabel R, Wahl R, Klein R (2007) Efficient ransac for point- Lezama A, Matusik W (2021) Fusion 360 gallery: a dataset and
cloud shape detection. Comput Graphics Forum 26(2):214–226 environment for programmatic cad construction from human
49. Sener O, Koltun V (2018) Multi-task learning as multi-objective design sequences. ACM Trans Graph (TOG) 40(4):1–24
optimization. Advances in neural information processing systems, 61. Wu S, Wu T, Lin F, Tian S, Guo G (2021) Fully transformer
vol 31, pp 525–536 networks for semantic image segmentation. arXiv preprint arXiv:​
50. Shah GA, Polette A, Pernot JP, Giannini F, Monti M (2021) Simu- 2106.​04108
lated annealing-based fitting of CAD models to point clouds of 62. Wu W, Qi Z, Fuxin L (2019) Pointconv: deep convolutional net-
mechanical parts’ assemblies. Eng Comput 37(4):2891–2909 works on 3d point clouds. In: Proceedings of the IEEE/CVF Con-
51. Shi W, Rajkumar R (2020) Point-gnn: graph neural network for ference on computer vision and pattern recognition, pp 9621–9630
3d object detection in a point cloud. In: Proceedings of the IEEE/ 63. Xie Y, Tian J, Zhu XX (2020) Linking points with labels in 3d:
CVF Conference on computer vision and pattern recognition, pp a review of point cloud semantic segmentation. IEEE Geosci
1711–1719 Remote Sens Mag 8(4):38–59
52. Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi- 64. Xu Y, Fan T, Xu M, Zeng L, Qiao Y (2018) Spidercnn: deep
view convolutional neural networks for 3d shape recognition. In: learning on point sets with parameterized convolutional filters.
Proceedings of the IEEE International Conference on computer In: Proceedings of the European conference on computer vision
vision, pp 945–953 (ECCV), pp 87–102
53. Tang H, Liu Z, Zhao S, Lin Y, Lin J, Wang H, Han S (2020) 65. Xu Y, Fan T, Xu M, Zeng L, Qiao Y (2018) Spidercnn: deep
Searching efficient 3D architectures with sparse point-voxel con- learning on point sets with parameterized convolutional filters.
volution. In: European Conference on computer vision (ECCV), In: Proceedings of the European conference on computer vision
pp 685–702 (ECCV), pp 87–102
54. Thomas H, Qi CR, Deschaud J, Marcotegui B, Goulette F, Guibas 66. Yi L, Kim VG, Ceylan D, Shen IC, Yan M, Su H et al (2016) A
L (2019) KPConv: flexible and deformable convolution for point scalable active framework for region annotation in 3d shape col-
clouds. In: IEEE Int. Conf. on computer vision (ICCV), pp lections. ACM Trans Graph (ToG) 35(6):1–12
6410–6419 67. Zhao H, Jiang L, Fu C-W, Jia J (2019) Pointweb: enhancing local
55. Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normaliza- neighborhood features for point cloud processing. In: Proceed-
tion: the missing ingredient for fast stylization. arXiv preprint ings of the IEEE/CVF Conference on computer vision and pattern
arXiv:​1607.​08022 recognition, pp 5565–5573
56. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez 68. Zhu B, Jiang Z, Zhou X, Li Z, Yu G (2019) Class-balanced group-
AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: ing and sampling for point cloud 3d object detection. arXiv pre-
Advances in neural information processing systems, vol 30, pp print arXiv:​1908.​09492
5998–6008
57. Wang C, Samari B, Siddiqi K (2018) Local spectral graph convo- Publisher's Note Springer Nature remains neutral with regard to
lution for point set feature learning. In: Proceedings of the Euro- jurisdictional claims in published maps and institutional affiliations.
pean Conference on computer vision (ECCV), pp 52–66
58. Wang L, Huang Y, Hou Y, Zhang S, Shan J (2019) Graph atten-
tion convolution for point cloud semantic segmentation. In:

13

You might also like