SMA Net: Deep Learning Based Identification and Fitting of CAD Models From Point Clouds
SMA Net: Deep Learning Based Identification and Fitting of CAD Models From Point Clouds
https://doi.org/10.1007/s00366-022-01648-z
ORIGINAL ARTICLE
Abstract
Identification and fitting is an important task in reverse engineering and virtual/augmented reality. Compared to the traditional
approaches, carrying out such tasks with a deep learning-based method have much room to exploit. This paper presents
SMA-Net (Spatial Merge Attention Network), a novel deep learning-based end-to-end bottom-up architecture, specifically
focused on fast identification and fitting of CAD models from point clouds. The network is composed of three parts whose
strengths are clearly highlighted: voxel-based multi-resolution feature extractor, spatial merge attention mechanism and
multi-task head. It is trained with both virtually-generated point clouds and as-scanned ones created from multiple instances
of CAD models, themselves obtained with randomly generated parameter values. Using this data generation pipeline, the
proposed approach is validated on two different data sets that have been made publicly available: robot data set for Industry
4.0 applications, and furniture data set for virtual/augmented reality. Experiments show that this reconstruction strategy
achieves compelling and accurate results in a very high speed, and that it is very robust on real data obtained for instance
by laser scanner and Kinect.
Keywords Identification · Fitting · Deep learning · Transformer · Data generation · Virtual reality · Reverse engineering
13
Vol.:(0123456789)
Engineering with Computers
computer vision, simulation, natural language processing, both virtually-generated point clouds and as-scanned ones
and robotics. As a dominating branch of artificial intelli- created from multiple instances of several CAD templates,
gence, deep learning has been successfully used to solve themselves obtained with randomly generated parameter
complex problems that are sometimes difficult to solve with values. Using this data generation pipeline, the proposed
more traditional approaches. Specifically, being able to iden- approach has been validated on two different datasets: robot
tify parts and assemblies, segment point clouds and retrieve dataset for Industry 4.0 applications, and furniture dataset
CAD model parameter values from a point cloud are essen- for virtual/augmented reality. Once trained, SMA-Net can
tial steps, which could benefit a lot from a deep learning- be used on a new point cloud, to identify the correspond-
based approach. However, the use of deep learning to reverse ing CAD template (Fig. 1c) and estimate its parameter
engineer products or systems is a relatively poorly explored values (Fig. 1d). The CAD templates gather together prior
strategy, and notably when it comes to the reconstruction of knowledge on the parts to be reconstructed, including all the
parametric CAD assemblies [6]. Although some 3D recon- constraints to define the related B-Rep models. The geomet-
struction methods based on deep learning have emerged ric reconstructions are performed within a CAD software
in recent years [2, 3, 26, 27], these methods are all based in charge of updating the CAD models using the outputs
on existing open datasets and are not specifically designed of SMA-Net. The adopted modeler also keeps track of the
for reverse engineering, so their accuracy is far from meet- consistency of the generated models, including assembly
ing the requirements of industrial applications. The lack of constraints between parts if needed.
proper datasets is certainly partially responsible for such a Experiments show that this reconstruction strategy
low attention, as it is a bottleneck that makes very challeng- achieves compelling and accurate results in a very high
ing this type of research. Indeed, most existing datasets are speed, and that it is very robust on real datasets obtained
only designed for tackling part-level or instance-level 3D for instance by laser scanner and Kinect. Thus, these results
scene understanding, like 3D semantic segmentation and are of particular interest in the context of the Industry 4.0,
3D object detection or parameter-independent 3D recon- to maintain the coherence between real products/systems/
struction. Moreover, these datasets usually originate from environments and their digital twins whose geometric mod-
the Internet and require a huge amount of manual annota- els are a prior known, but whose configurations are evolving.
tions. The work presented in this paper not only aims to The contribution is threefold: (i) a novel reverse engi-
identify parts and assemblies from point clouds, but also neering framework able to fit CAD models on point clouds
to retrieve parameter values from the corresponding CAD segmented using a deep segmentation network; (ii) a virtual
models. data generation pipeline able to generate datasets of CAD
To solve those issues and allow a more global approach to model instances together with their as-scanned point clouds
reverse engineering, this paper introduces a deep learning- and related parameter values; (iii) a light and fast bottom-up
based reconstruction technique. Figure 1 shows a reconstruc- 3D classification and fitting framework, named SMA-Net,
tion example following the proposed approach. It is based able to identify objects from point clouds and estimate the
on a novel end-to-end bottom-up architecture, specially parameter values of the corresponding CAD templates in
focused on fast identification and fitting of CAD models an end-to-end manner. For the first time, such datasets used
from 3D point clouds. If the point cloud embeds several to validate the proposed identification and fitting technique
semantic parts (e.g., chairs or tables), they are treated sepa- have been made publicly available.
rately after a segmentation step (Fig. 1b). The core of the The paper is organized as follows. Section 2 reviews
approach relies on the so-called SMA-Net (Spatial Merge the works related to reverse engineering and deep learn-
Attention Network) composed of three main parts: voxel- ing. The overall framework is presented in Sect. 3 with the
based multi-resolution feature extractor, spatial merge details of the three parts of SMA-Net. The experimental set
attention mechanism and multi-task head. It is trained with up is detailed in Sect. 4, and the results are presented and
Fig. 1 Fast reconstruction of a
chair CAD model: a acquired
chair point cloud and its envi-
ronment, b segmented point
cloud, c CAD template and its
parameters, d fitted CAD model
generated by a CAD modeler
with parameter values resulting
from SMA-Net
13
Engineering with Computers
discussed in Sect. 5. Section 6 ends this paper with conclu- industrial requirements in terms of parameter-driven CAD
sion and discussion. template reconstruction.
Many different ways have been proposed to process point
cloud data using deep learning, such as point-wise CNNs [8,
2 Prior work 29, 44, 64, 67], volume-wise CNNs [10–12, 24, 25, 53],
multi-view CNNs [9, 35, 43, 52] and special CNNs, namely
This section reviews previous reverse engineering and deep kernel-based parametric CNN [54, 59, 62] and graph reason-
learning approaches for point cloud analysis. In reverse engi- ing [36, 51, 57, 58]. Considering point-wise CNNs, Point-
neering, a particular attention is paid on the template-based Net [8] is a pioneer in this direction. However, PointNet does
methods. Fayolle and Pasko proposed a system to recon- not capture local structures limiting both its ability to rec-
struct solid models from digital point clouds by means of ognize fine-grained patterns and its generalizability to com-
a genetic evolutionary algorithm [18]. Robertson et al. [45] plex scenes. There have been some attempts to strengthen
have exploited the way to extract parametric models of fea- the ability to explain local features to learn deep point set
tures from poor quality 3D data using RANSAC [19], but it features more efficiently and robustly [44, 64, 67]. Instead
can only estimate parameters of simple primitives (e.g. cyl- of directly processing the irregular points, voxel-based
inder, plan). As an extension of RANSAC [31, 37, 48], try methods project point cloud on regular 3D grids in Euclid-
to recover multiple primitives from a point cloud. However, ean space. Graham et al. [24, 25] try to handle voxel-based
the performance of these RANSAC-based methods is highly data by sparse tensor and sparse convolution and achieved
dependent on the tuning of parameters based on prior knowl- good results for the segmentation of indoor scenes. Chris-
edge of different primitive shapes and it is not able to reverse topher et al. [11] introduced 4D Spatio-Temporal ConvNets
a full CAD model. These algorithms focus on the resolution that further improve the efficiency of sparse convolution
of general reverse engineering problems. Therefore, it is dif- through a more complex network structure. Multi-view
ficult to achieve accurate reconstructions of 3D targets like CNNs represent the 3D point cloud with images rendered
template-based methods do. Bey et al. [5] have introduced from different views and analyze the information based on
a template-based method to reconstruct CAD models from a 2D CNN. Recently, Chen et al. [9] proposed to encode
3D point clouds, which is applied for civil engineering pur- the sparse 3D point cloud with a compact multi-view repre-
poses. Erdös et al. have presented a CAD model matching sentation for the task of object detection. Abhijit et al. [35]
method and they have extended the type of features to torus chooses multiple different virtual views from a 3D input to
and cuboids [16]. In these works, the authors also focus on generate representative 2D channels for training a 2D model
basic shapes (e.g. cylinder and cuboids) and do not work on for the task of indoor 3D semantic segmentation. Moreover,
complete CAD models. Inspired by template-based reverse some work designed convolutional kernels in special ways to
engineering approaches, Buonamici et al. [7] proposed a improve the parsing ability of convolutional operations. Xu
technique to fit a single CAD template on a point cloud, et al. [65] introduced SpiderCNN whose kernel is defined
while using particle swarm optimization. Another fitting by a Taylor polynomial function with neighbors weighted so
approach is proposed by Shah et al. [50], with a template- as to extend convolutional operations from regular grids to
based CAD reconstruction process optimized part-by-part irregular point sets. Thomas et al. [54] introduced KPConv
using simulated annealing. These template-based methods a new point convolution kernel which optimize the position
show good results, but suffer from several limitations. For of each weights dynamically. Besides, Li et al. [36] explored
instance, manual intervention is required when pre-aligning the improvement of 3D scene understanding performance in
CAD templates onto the point clouds, and a lot of time is a non-Euclidean space.
spent in the iterative resolution process. In recent years, Deep learning approaches often rely on huge data-
some template-based point cloud fitting methods based on sets. ScanNet [14] and S3DIS [1] are richly annotated
deep learning have emerged. Armen et al. [2] perform scan- datasets of 3D reconstructed meshes of indoor scenes spe-
to-CAD alignment by predicting 9-DoF (degree of freedom) cifically designed for scene understanding. For the task of
of the CAD model. [3, 26, 27] use similar way to align part-level understanding, ShapeNetPart [66] and PartNet
the CAD template with the 3D input. Then, to improve sur- [40] introduce a large-scale dataset with instance-level 3D
face reconstruction accuracy, Vladislav et al. [30] proposed objects annotated. As an outdoor scene dataset, KITTI [20]
to deform retrieved CAD models after their alignment to is widely used for optical flow estimation, 3D object detec-
scans. However, this mesh deformation based method tion and 3D tracking. However, such datasets are not yet
can only achieve rough approximations, and it results in fully available when considering point clouds for parameter
changes in the geometric properties of the CAD model that values estimation. Moreover, aforementioned deep learning
is transformed in a mesh. Thus, it is still difficult to meet the methods as a backbone are often used in 3D classification,
3D object detection and 3D scene segmentation tasks. There
13
Engineering with Computers
exist few attempts to study their application in reverse engi- 3.2 Overview of the identification and fitting
neering, especially to reconstruct parametric CAD models framework
[34, 60]. To circumvent those limitations, this paper intro-
duces SMA-Net, a multi-task convolutional neural network To obtain the right class and precise parameter values, three
designed to perform both the classification and parameter problems have been tackled. The first is to extract features
estimation in one forward propagation during fitting pro- from the 3D object space in an efficient way, the second is
cess. Furthermore, a virtual data generation pipeline is pro- to efficiently integrate information from a large amount of
posed to allow the creation of large datasets ready for train- extracted features, and the third is to build a robust loss func-
ing, and to validate the approach. tion to guide the learning process. SMA-Net has an elegant
network architecture for solving these three problems. It con-
sists of three main components depicted on Fig. 3:
3 RE framework and SMA‑Net architecture
• Voxel-based multi-resolution feature extractor to effi-
This section introduces the novel RE framework together ciently extract features from different spatial resolu-
with the architecture and technicalities of SMA-Net, the tions (Fig. 3a). The input is a set of Np points of 3D coor-
Spatial Merge Attention Network at the core of the proposed dinates pi = (xi , yi , zi ) ∈ ℝ3 , with i ∈ [1 … Np ]. Firstly,
approach. the complete point cloud is voxelized at a voxel size of
V s (s refers to the size of voxel) and then fed into a 3D
3.1 Reverse engineering framework CNN for feature extraction. The output is a set of feature
maps generated at different resolutions and denoted as Fj ,
The RE framework is presented on Fig. 2. It starts with the with j ∈ [1 … Nl ] and j the index of the extracted feature
acquisition of a raw point cloud, which then undergoes a maps with different spatial resolutions.
segmentation and clustering step, which can be done by • Spatial merge attention mechanism for robustly inte-
the existing well-designed semantic segmentation net- grating features of different resolutions (Fig. 3b). Each
works [63] and clustering algorithms [47]. In our imple- extracted feature map Fj is forwarded to the spatial atten-
mentation, SSCNs [23] pretrained on ScanNet dataset and tion (SA) mechanism, which emphasizes spatial mean-
DBSCAN [17] are used for segmentation and clustering ingful features of Fj along spatial axis. It generates a SA
respectively. The output point clouds are then analyzed by score Mjs for each spatial resolution layer. Then, spatial
SMA-Net, which has been pre-trained upstream. Thanks to recalibrated features Fjs are computed by voxel-wise mul-
an elegant deep neural network and an end-to-end learning, tiplications between Mjs and Fj. A merge function is fol-
SMA-Net allows object identification and parameter estima- lowed to squeeze and extract the important features from
tion into one module. It outputs the classes and estimated Fjs , which are then fused to form the global feature vector
parameter values of the parts and assemblies identified in the V . During this step, the information related to the size of
point cloud. This information is then used as inputs of the the raw point cloud are also incorporated. Specifically,
reconstruction module in charge of generating all the CAD the merge function is composed of a set of transformer
model instances using pre-defined templates. CAD templates blocks and a channel attention block combined to pro-
include built-in constraints (e.g. symmetries, joints between duce the final feature vector.
parts) to be satisfied by the modeler during the reconstruc- • Multi-task head for multi-parameters learning
tion step. The whole framework is an end-to-end architec- (Fig. 3c). After obtaining the final feature vector,
ture without any loop or iteration, which makes it very fast highly characterized information is input to multi-heads
and easy to control. Here, CAD models do not need to be part. The multi-heads is composed of a classifier and
updated at multiple iteration steps, and they are generated several parameter regressors. The output of the classifier
only once after the fitting. is used to retrieve the CAD template model from data-
Fig. 2 Proposed RE frame-
work exploiting the pre-trained
SMA-Net for fast identification
and fitting of CAD models from
point clouds
13
Engineering with Computers
Fig. 3 Spatial Merge Attention Network (SMA-Net) highlighting nism aggregating multi-resolution features to a vector, c multi-task
three main components: a feature extractor extracting per-volume fea- head predicting the class and its parameters
tures Fj from different resolutions, b spatial merge attention mecha-
base, and the parameter regressors are used for estimating words, the shallow layer contains rich local structure size
parameter values precisely. information, while the deep layer contains more overall
structure size information. For parameter estimation task,
The outputs of SMA-Net then serve as inputs of the CAD extracted features can represent the detail information of the
reconstruction step, which extracts the identified CAD tem- target, but they are also able to express more abstract overall
plate and instantiates it with the estimated parameter values structural information.
to get the final CAD model fitting the point cloud (Fig. 2). To this end, the input point set are voxelized into volumes,
then a series of sparse convolution operations are performed
3.3 Voxel‑based multi‑resolution feature extractor with skip connections to extracts features from different spa-
tial resolutions (Fig. 3a). Thus, the geometric and contextual
Generally, any point feature extraction network can serve features of different spatial resolutions are extracted. The
as the feature extractor. On the other hand, the grid-based output is a set of Nl feature maps generated at different
methods have proven to be computationally efficient and spatial resolutions and denoted as Fj ∈ ℝCj ×Lj ×Wj ×Hj , with
are particularly suitable for image feature extraction due to j ∈ [1 … Nl ] , j the index of features at different spatial reso-
their controllable receptive fields. Thanks to the emergence lutions, and Cj the feature channels with spatial resolution
of sparse representation and sparse convolution technology at Lj × Wj × Hj.
[11, 25], these attributes have been well extended to the 3D
field. Hence, the first component of SMA-Net exploits a 3.4 Spatial merge attention mechanism
voxel-based CNN for efficient feature encoding (Fig. 3a). In
particular, the U-Net [46] architecture has been wildly used The attention mechanism has made great achievements in
in 2D and 3D domains especially for the semantic segmenta- many visual perception tasks such as image caption gen-
tion because of its efficient encoder-decoder structure and erators, multi-object detection and semantic segmenta-
hierarchical expression. Thus, SMA-Net follows the U-Net tion. More recently, the attention mechanism represented
like architecture with sparse representation and sparse con- by transformer has been widely used in the image field
volution technology. Unlike previous methods [11, 23] with very impressive results [15, 33, 38]. Thus, this part
which only use the information of the last layer of U-Net to of SMA-Net consists of two submodules detailed in the
derive the final output, SMA-Net exploits different feature next subsections (Fig. 3b): (1) voxel-based multi-resolution
extraction layers containing different structure size informa- spatial attention (SA), (2) data merge (DM) consisting of
tion. Indeed, even if their final output benefits from previ- two transformer blocks and a channel attention block. They
ous layers by skip-connections, these methods unavoidably have been specifically designed for enhancing the repre-
miss useful information present at lower resolutions, which sentational power of the network. The output of the Spatial
undoubtedly reduces the quality of the results. Actually, in Merge Attention mechanism is the feature vector V used as
a deep neural network, the receptive field increases as the input of the multi-heads part (Fig. 3c).
depth of the network increases, and the deeper of the net-
work, the more abstract of the extracted features. In other
13
Engineering with Computers
To further recalibrate spatial information, the SA submod- The data merge consists of two transformer blocks and a
ule learns a set of spatial probability maps to weight the channel attention block. Figure 5.a illustrates the structure
conv-layer feature maps of the CNN. Specifically, given a of a transformer block which consist of layer normaliza-
feature map Fj encoded by the hierarchical U-Net struc- tion ( 𝐋𝐍) [4], multi-head self-attention ( 𝐌𝐒𝐀) [15, 56,
ture, SA sequentially infers a 3D spatial attention map 61] and a feed forward network ( 𝐅𝐅𝐍 ) that can be replaced
Mjs ∈ ℝ1×Lj ×Wj ×Hj (Fig. 4a). Next, the recalibrated feature by a MLP (Sect. 4.3.1). Each transformer block inputs a
map Fjs ∈ ℝCj ×Lj ×Wj ×Hj can be computed as follows: sequence, and outputs another sequence after a forward
propagation [56] in which each sequence is composed of
Mjs = 𝚽(Fj ), with j ∈ [1 … Nl ], several tokens, where each token is represented by a vector
(1)
Fjs = Mjs ⊗ Fj , of dimension ℝCt. Formally, the calculation of the output Zl
of a transformer block can be formulated as follows:
where 𝚽(⋅) is the spatial attention function and ⊗ the ele- ( ( l )) l
ment-wise multiplication. In SMA-Net, this function first Zl = 𝐅𝐅𝐍 𝐋𝐍 Ẑ + Ẑ ,
aggregates spatial information of different feature maps by l ( ( )) (3)
means of max-pooling and average-pooling operations. The Ẑ = 𝐌𝐒𝐀 𝐋𝐍 Zl−1 + Zl−1 ,
outputs of those operations are then forwarded to a light
where Zl−1 is the output of the last block. Furthermore,
sparse convolutional neural network. The process can be
𝐌𝐒𝐀 stands for multi-head self-attention, which is the
summarized as:
core of transformer block and which can be formulated as
[ ( )]
𝚽(Fj ) = 𝝈 𝐒𝐂𝐨𝐧𝐯𝐬 Aj , follows (Fig. 5b):
( ( ( ( )))) (2) � �
Aj = 𝐑𝐞𝐋𝐔 𝐈𝐍 𝐒𝐂𝐨𝐧𝐯𝐬 𝐂𝐚𝐭 Fjmax , Fjavg , 𝐌𝐒𝐀(Z) = 𝐂𝐚𝐭 A0 , A1 , … , Ah−1 Mc ,
⎛ ⎞
where 𝝈 denotes the sigmoid function, 𝐒𝐂𝐨𝐧𝐯𝐬 denotes ⎜ Qi Ki T ⎟
sparse convolution, 𝐑𝐞𝐋𝐔 and 𝐈𝐍 refer to Rectified Linear Ai = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱⎜ � ⎟Vi , (4)
⎜ dk ⎟
Unit activation function [22] and instance normalization [55] ⎝ i ⎠
respectively, and 𝐂𝐚𝐭 is the concatenate operation. q
Qi = ZMi , Ki = ZMki and Vi = ZMvi ,
Ct Ct Ct Ct
Mki ∈ ℝ h × h , Mvi ∈ ℝ h × h and Mc ∈ ℝCt ×Ct are weight
Fig. 4 Architecture of the
attention mechanisms: a spatial
attention submodule taking
advantage of max-pooling
and average-pooling opera-
tions forwarded to a 3D sparse
CNN, b channel attention
submodule combining outputs
of max-pooling and average-
pooling using a shared MLP
network
13
Engineering with Computers
Fig. 5 Overview of trans-
former block. a architecture of
transformer block, b multi-head
attention in transformer block
matrices of linear projection functions. The self-attention minimum (xmin , ymin , zmin ) of raw coordinate values P of
operation in transformer block makes all tokens mutual inter- point cloud provide a vital cues for 3D scene perception and
act, which in theory can yield global receptive field. Thus, understanding. Thus, the size information Gr of the original
in the field of 2D images, the input image is usually divided point cloud is also provided (Fig. 3):
into a number of fixed-size patches, and each patch is con-
sidered as a token, thereby converting the 2D image into a Gr = 𝐂𝐚𝐭(Xmax , Xmin , d) ∈ ℝ7 ,
sequence [13, 15, 61]. ⎧X � �
= x ,y ,z = 𝐌𝐚𝐱(P, axis = 1)
Considering the identification and fitting tasks, a simple ⎪ max � max max max� (6)
with ⎨ Xmin = xmin , ymin , zmin = 𝐌𝐢𝐧(P, axis = 1)
way to generate a token is transforming a feature map Fj ⎪d = � �
into a token T j. However, the number of tokens depends on ⎩ �Xmax − Xmin �.
the depth of the U-Net pyramid structure, which is very lim-
Then, V s and Gr are further projected into a high-dimen-
ited. This inevitably causes a design contradiction: obtaining
sional space and then combined through a concatenate
more tokens by increasing the depth of U-Net. Moreover,
operation to obtain V c:
although this way can fuse features from different resolu-
tions, considering the whole feature map in a resolution as a V c = 𝐂𝐚𝐭(V �s , G�r ),
token will lose valuable details. To overcome this problem, (7)
V �s = V s Ms and G�r = Gr Mr ,
a simple channel reshape strategy is used. Unlike [15, 61],
tokens are not generated in spatial domain. Instead, tokens where Ms ∈ ℝlen(Vt )×(L−1)⋅Ct and Mr ∈ ℝ7×Ct are weight
are produced along channel axis. L tokens are generated, and matrices. After, V c is divided into L groups, resulting in L
each token is represented by a Ct-dimensional vector. tokens.
In practice, data merge module takes the output of multi- ( )
resolution spatial attention Fjs as input. It is first squeezed Tokens = 𝐑𝐞𝐬𝐡𝐚𝐩𝐞 V c ∈ ℝL⋅Ct , (8)
into a vector along spatial axis, so the length of the vector
is the number of feature channels V js ∈ ℝCj. Then, all V js , where L ⋅ Ct is equal to the length of V c. In this way, tokens
with j ∈ [1 … Nl ] are concatenated into one vector V s. This are defined more efficiently. More explanations are unfolded
process can be summarized as follows (Fig. 3): in the ablation study. Next, transformer blocks take the gen-
( ) erated tokens as input to further aggregate the information
V s = 𝐂𝐚𝐭 V 1s , V 2s , … , V Ns l , and learn a global context. Then, all tokens are recompressed
( ) (5) into a vector V t. Finally, given the merged feature vector V t ,
V js = 𝐌𝐚𝐱𝐏𝐨𝐨𝐥 Fjs , with j ∈ [1 … Nl ].
a channel attention network is applied to refine the features
Moreover, the original size information, such as size d of and produce the final feature vector to be used for the multi-
the bounding box diagonal, maximum (xmax , ymax , zmax ) and task head (Fig. 4b):
13
Engineering with Computers
10 classes (4 chairs and 6 tables) distributed in 2 groups (the with 𝜆g the hyper-parameters controlling the balance
group of the chairs and the one of the tables). Each class between the different loss functions. Finally, even though
c ∈ [1 … Nc ] belongs to a group g ∈ [1 … Ng ] and is charac- SMA-Net is a unique identification and fitting framework,
terized by Npc parameters pci , with i ∈ [1 … Npc ]. To be able to Euclidian distances between the point cloud and the CAD
identify the class of a given object in a point cloud, and also model are not directly minimized during the learning, and
find out its parameter values, the multi-task head module of the fitting is carried out at the parameter level. Those dis-
SMA-Net is composed of (Fig. 3c): tances are however used at the end to check the quality of
the fitting.
• One classification head in which a multilayer percep-
tron (MLP) is applied to produce Nc classification scores
{sc1 , sc2 , … , scNc } ∈ ℝNc that are further regularized by 4 Experimental setup
a cross entropy loss function Lcls. This part of SMA-Net
allows identifying the class of an object in a point cloud, This section introduces the virtual data generation pipeline
i.e. the class with the maximum score. designed to set up the databases and validate SMA-Net on
• Several parameter regressors, one for each of the Ng
both identification and estimation tasks. It also details the
groups of classes. Each regressor is able to estimate all technicalities and settings of the different components of
the parameter values of all the classes belonging to the SMA-Net, including the implementation details on how two
corresponding group. Thus, depending on the class of known network have been modified to allow a comparison
the object identified by the classification head, only the with SMA-Net.
subset of parameter values corresponding to that class
is to be considered, and all the other parameter values 4.1 Virtual data generation pipeline
are disregarded. The regressions are also realized by a
MLP, followed by a parameter residual loss for guid- To evaluate and test SMA-Net on both identification and
ing the parameter estimation learning. For a parameter parameter estimation tasks, annotated 3D datasets are to be
pci , the absolute deviation 𝛿pci and the relative deviation defined. Montlahuc et al. [41] have introduced an as-scanned
Δpci between the estimated value p̃ ci and the ground truth point clouds generation strategy able to produce point
value p̄ ci can be computed as follows: clouds of CAD assembly models, similar to the ones that
𝛿pci p̃ ci − p̄ ci [ ] would have been obtained with a real scanner, i.e. includ-
Δpci = = , ∀i ∈ 1 … Npc . (10) ing artifacts resulting from real acquisitions (e.g. noise,
p̄ ci p̄ ci
heterogeneous density, incompleteness). However, their
13
Engineering with Computers
method produces point clouds without associating both the CAD modeler is input to the point cloud virtual generation
classes of the parts (e.g., tables and chairs) and the numeri- algorithm. If virtual scan flag is set, HPR-based algorithm
cal parameter values of the CAD models, which are to be is used for generating as-scanned point clouds [32]. At the
used for training a network. Here, the idea is to propose a beginning, the origin and view direction of each virtual cam-
virtual data generation pipeline, able to generate as-scanned era are randomly defined. After that, a post-processing step
point clouds, labeled for learning on both identification is performed to insert noise and to sub-sample the point
and parameter estimation tasks. The complete pipeline is clouds. At the end of each loop, the information is collected
presented in Fig. 6. For a given CAD template, it allows and stored: generated CAD model instance and mesh, as-
the creation of multiple CAD model instances as well as scanned point cloud, label and parameter values.
multiple as-scanned point clouds. The process starts by
randomly generating parameter values used to generate the 4.2 Robot and furniture datasets
instances with a CAD modeler, i.e. the open-source para-
metric modeler FreeCAD in this work. The output of the Two datasets have been designed to validate SMA-Net: robot
dataset and furniture dataset. Generally, from the perspec-
tive of machine learning, angle-type parameters are more
difficult to understand than general distance-type param-
eters. Thus, the considered robot template only includes
angle-type parameters. This is used to verify the angle per-
ception ability of SMA-Net, and to study its application for
assembly fitting. The furniture dataset contains some com-
mon furnitures, i.e. chairs and tables, and the CAD templates
include both angle-type and distance-type parameters. This
second database is used to test the general performance
of the proposed approach. Moreover, due to the presence
of occlusion and noise in the real scene, as-scanned point
clouds are generated for both datasets. Thus, it is possible
to evaluate SMA-Net on point clouds obtained in real-
world conditions, and not only using virtually-generated
data. Importantly, SMA-Net has been trained and tested on
the two datasets separately, and the datasets (labeled model
instances, point clouds and parameter values) are made pub-
licly available at the following URL: https://d oi.o rg/1 0.5 281/
zenodo.5031617.
Robot dataset This dataset is composed of 5000 point
clouds generated from a robot template (Fig. 7a). The pose
of the robot is controlled by 4 angles A 1–A4 , whose values
have been randomly changed in the range of [0, 90]. For each
pose, an instance of the CAD template is generated, together
with its mesh and as-scanned point cloud. Figures 7b1 and
Fig. 6 Virtual data generation pipeline that uses labelled CAD tem-
plate to generate many point cloud instances with their parameter val- b 2 shows meshes corresponding to two different poses, 7c1
ues and c 2 are the virtually-generated point clouds, 7d1 and d 2
Fig. 7 Generation of as-scanned point clouds for the robot dataset: a robot template and its 4 control angles, b i updated meshes of two different
poses, c i virtually-generated point clouds for the two poses, d i as-scanned counterparts for the two poses
13
Engineering with Computers
13
Engineering with Computers
Fig. 9 Examples of CAD
templates and related param-
eters: a Chair_3 template and its
parameters, b1 Table_1 template
and its parameters (b 2 , b3)
tion. Moreover, 𝐅𝐅𝐍 has been implemented by a two- of [44] has been used as a starting point. In order to make
layer MLP, where the dimension of the features in the it suitable for both classification and estimation tasks, the
hidden layer is equal to 4 times the dimension of the classification head of the original implementation has been
input features. replaced by the multi-heads part proposed in this paper. For
• For multi-task head, one head is used for classifica- training, the original input have been uniformly sampled
tion, the others regress the parameters of different with 4096 points. For Resnet† , the 3D-Resnet [11] based
groups. MLP is used in both cases. Specifically, MLP on MinkowskiEngine sparse convolutional library has been
in classifier has 2 hidden layers. Regressors have more re-implemented. More concretely, sparse tensor and sparse
parameters to predict and more abstract prediction operation replace the dense tensor and related operations
goals, which is more difficult. Thus, 4 hidden layers are in original Resnet†. Similarly to Pointnet2† , the multi-heads
deployed in regressors. Empirically, one can observe that part of SMA-Net replaces the original classification head of
batch normalization largely hinders the regressor’s learn- Resnet†. In addition, the structure of Res-34 has been used in
ing ability, so the batch normalization of the last two our experiment. Finally, the same training strategy of SMA-
layers has been removed. Net has been adopted to train both Pointnet2† and Resnet†. In
this paper, † thus indicates that multi-head is used instead of
During the training, data augmentation is applied on-the- the original classification head to make the model compat-
fly by flipping with 50% probability, randomly rotating ible with the fitting task.
the objects along up-axis, and jittering data with Gauss
noise. This is needed to be able to identify objects in any
configuration when considering real-world examples. To 5 Results and discussion
strengthen the identification and estimation capacities of
SMA-Net in various conditions, virtually-generated point The proposed approach has been validated on two up-to-
clouds are also merged with as-scanned ones. Specifically, date challenging scenarios. The first scenario illustrates how
during the training, 40% of virtually-generated data are ran- SMA-Net can be used to follow the evolutions of a real robot
domly selected and replaced by the as-scanned data. For and update its digital twin for Industry 4.0 applications. The
testing, only as-scanned point clouds are used to validate second scenario shows how SMA-Net can identify objects
the approach on real-life data. in a real environment, and estimate their parameter values
The whole SMA-Net is trained in an end-to-end man- to allow fast reconstruction of 3D environments for BIM
ner while minimizing the total loss function of Eq. (12) applications.
with hyper-parameters 𝜆g = 1.0. For the minimization, the
Adam optimizer is adopted with an initial learning rate of 5.1 Evaluation criteria
1 × 10−3 and decay the learning rate by a factor of 0.7 every
20 epochs and train for 200 epochs. All experiments have Several criteria are here adopted to evaluate the perfor-
been performed on a laptop with NVIDIA RTX2080 GPU mances of SMA-Net on both identification and parameter
and Intel(R) Core(R) i7-9750 CPU. estimation tasks. For the identification task, the adopted
13
Engineering with Computers
criterion is the one of [8]. For parameter estimation task, the of the test dataset, average distance, max distance, and
accuracy of the estimations can be carried out in an absolute standard deviation are provided. Finally, the efficiency of
but also relative manner. Therefore, for a given parameter pci SMA-Net is evaluated in terms of time required to perform
related to a class c, the average of the absolute and relative both the identification and estimation tasks. Classically, the
errors on a test dataset can be computed as follows: efficiency is expressed in Hertz (Hz) to show how many
c samples the network treats every second.
1 ∑‖ c ‖
Ntest
Eravg (c, i) = c Δp (k)
Ntest k=1 ‖ i ‖ 5.2 Quantitative analysis
c (13)
1 ∑‖ c ‖
Ntest
Eaavg (c, i) = c ‖𝛿pi (k)‖, Figure 10 shows the evolution of the relative and abso-
Ntest k=1 lute errors during the training on the robot dataset, and
where Ntest
c
refers to the number of samples of class c in the Table 1 gathers the results of the evaluation of Pointnet2† ,
test dataset, Δpci (k) and 𝛿pci (k) represent respectively the Resnet† and SMA-Net on the test dataset composed of
relative and absolute errors of parameter pci on the k-th ele- as-scanned point clouds. Here, there is no identification
ment of class c in the test dataset. Errors are computed using task as the dataset is composed of a single class, i.e. the
Eq. (10). If angle-type and distance-type parameters appear at four-axis robot. Compared with Pointnet2† and Resnet† ,
the same time when computing the average value, by default SMA-Net achieves the best results for the estimation of the
all parameter values are directly added to calculate the average four angles. From the results, axis-1 ( A1 ) and axis-2 ( A2 )
value. Further, the mean average values ( Erm and Eam ) are used have smaller absolute error around 0.7◦ and 0.6◦ , respec-
to describe the overall errors, which are defined as follows: tively. The absolute error of axis-4 ( A4 ) is larger which
around 3.1. The results are very intuitive, the spatial position
Nc of axis-3 and axis-4 strongly dependent on the position of
1 ∑ 1 ∑ avg
Nc p
Erm = E (c, i) axis-1 and axis-2. SMA-Net thus gets more information from
Nc c=1 Npc i=1 r
Nc (14) the changes of axis-1 and axis-2 whose behavior is therefore
1 ∑ 1 ∑ avg
Nc
better approximated. In terms of efficiency, the training takes
p
Fig. 10 Training curves with relative errors Er (dotted line) and absolute error Ea (solid line) for the robot dataset and its four angle values to
avg avg
estimate: result curve on training set (top), result curve on validation set (down)
13
Engineering with Computers
Model A1 A2 A3 A4
13
Engineering with Computers
following the novel virtual data generation pipeline. Given SMA-Net. Results are good, even though a bit below the
an input point cloud, SMA-Net identifies the label of the ones obtained on as-scanned point clouds. This is because
object (e.g. Robot and Chair_3 in those examples) and the adopted CAD template does not fully match the scanned
estimates the parameter values in one forward propaga- chair. Moreover, chairs are slightly deformable objects,
tion. Those values are then forwarded to the CAD modeler which may also explain possible deviations between a real-
that updates the corresponding CAD model and creates the world point cloud and a perfect CAD template.
mesh. Then, it is possible to compute distances between Figure 14a, b illustrate how the proposed RE framework
the point cloud and the mesh to display a color map. With can be used for real-time reconstruction of CAD models
quite low mean distances when compared to the size of the within a virtual scene. From left to right, the raw point cloud
objects, those results clearly show the capacity of SMA-Net is first acquired using Kinect V1, and then transferred to a
to identify and fit an object in an incomplete point cloud. pre-trained semantic segmentation model [23] to label points
The proposed identification and fitting technique has and extract objects of interest by clustering [17]. Then, for
also been validated on real scanned data. Figure 13 shows each segmented point cloud, SMA-Net is used to identify the
the result of a fitting on a point cloud acquired with a laser underlying object and estimate its parameters in one forward
scanner ROMER Absolute Arm 7520 SI and sub-sampled propagation. Finally, using the outputs of SMA-Net, CAD
to 10,000 points. Here, an ICP [21] algorithm aligns the models can be reconstructed and located in the virtual scene
input point cloud and the mesh resulting from the update of using ICP. Here, measured deviations can be ascribed to the
the CAD template with the parameter values obtained from fact that the underlying CAD templates are simplified ones,
Fig. 11 Robot fitting results on five different poses of the test dataset: as-scanned point clouds (colored points where red indicates larger devia-
tions) used as inputs of SMA-Net, and meshes (gray) generated from the updated CAD models using outputs of SMA-Net
13
Engineering with Computers
Fig. 14 Complete reconstruction of virtual environments from point label of the underlying object and estimate its parameters in one for-
clouds captured with Kinect V1. The whole scene is first segmented ward propagation. Finally, CAD models are reconstructed using the
by a deep segmentation network, and a clustering algorithm is used output values of SMA-Net, and they are located in the 3D scene using
to group points. Then, for each point cloud, SMA-Net identifies the ICP
which cannot capture the full complexity of the acquired 5.4 Ablation study
objects, and more parameters would be necessary. This need
is also true when considering possible object deformations The architecture of SMA-Net has been designed around
due to aging, which would require additional modeling three components, namely multi-resolution feature extrac-
parameters. Finally, outliers and artifacts due to the acqui- tor, spatial merge attention module and multi-head module,
sition with Kinect also have an impact on the results and all of which have a vital impact on the performances. This
create a larger variance in some places. ablation study evaluates how each component affects the
13
Engineering with Computers
average of the relative errors. For fast evaluation, a sub- collecting features from only one layer. Moreover, intro-
dataset of furniture dataset is exploited, which consists of ducing spatial merge attention module can help SMA-Net
six classes namely Chair_1, Chair_2, Chair_3, Table_4, integrating more meaningful information and have a larger
Table_5, and Table_6. The configuration of the hyperparam- receptive field during the feature merge stage. In addition,
eters is the same as for the above experiments. Table 4 shows multi-head makes it possible to deal with different target
the experimental results. The five leftmost columns are used groups. Moreover, it can also reduce the coupling between
to configure the experiments. Indeed, the second column different groups, and at the same time it helps promoting
represents the possibility to combine or not the information mutual boost between the classes which belong to the same
from multiple resolutions (MR) during the feature extraction group.
stage. The third one indicates the adoption or not of the spa-
tial attention (SA) mechanism. The fourth column tells the
data merge module (DM) is applied or not. The fifth column 6 Conclusion and future work
characterizes whether multiple-heads (MH) are used or not
to regress the Ng groups. To highlight the contribution of This paper introduces a novel reverse engineering frame-
each component, seven experiments were conducted. work able to identify and fit CAD models from incomplete
The first experiment can be considered as a baseline. Spe- point clouds. The core of the approach is SMA-Net, a fast
cifically, as each component is set up to ’No’, only the output bottom-up 3D classification and fitting framework able to
of the last layer of the feature extractor is used as input of identify objects and estimate the parameter values in an end-
the next part, the spatial attention and merge modules are to-end manner. The network is trained using point clouds
removed, and for the final parameter prediction, only one generated following a virtual data generation pipeline able
regressor is integrated. The second experiment clearly shows to create many as-scanned point clouds.
that multi-heads can help to decouple features between differ- Differently, from the traditional reverse engineering tech-
ent groups to a certain extent. Then, for the third experiment, niques working at the level of the patches and facing numer-
the MR feature extractor is activated to integrate features ous trimming and stitching issues, the proposed approach
from different scales. The result shows that rich multi-scale carries out an end-to-end identification and parameter esti-
features help improve the performance of SMA-Net. Instead mation in one forward propagation, and it achieves very
of getting better results, deploying the SA sub-module good results on both virtually-generated datasets and real-
slightly hurts the performance of the model. This is because life acquisitions. The geometric modeling operations are
the features after calibration of SA are too abstract to parse left to the CAD modeler in charge of generating CAD mod-
it through a simple network. This reflects the effectiveness els from CAD templates using the output values of SMA-
and importance of the transformer-based DM sub-module Net. Considering possibly incomplete point clouds, resulting
from the other side. Finally, in the sixth experiment, all four from the occlusion phenomenon, the results have shown that
modules are activated, and the best results are obtained. SMA-Net is able to rely on other features to infer reasonable
In addition, for the last experiment, tokens are gener- parameter values for the partially visible parts. In addition,
ated by transforming a feature map Fj into a token T j that future works should focus on integrating all modules shown
is detailed in Sect. 3.4. From the results, it is clear that the in Fig. 2 into a unified reverse engineering framework to
proposed token generation process is more efficient than just realize end-to-end reasoning from the entire scene to the
simply transforming a feature map Fj into a token T j. CAD model. Also, needs concern the possibility to auto-
As a conclusion, the extraction of features from multi- matically adjust CAD templates to take into account local
ple resolutions can get more useful information than just deformations and thus be able to model structural ageing.
13
Engineering with Computers
Moreover, two application scenarios have been presented Table 6 Parameter name and
index relationship of the tables Table_1 length p23
with promising results to serve in the context of the Industry
in the furniture dataset width p24
4.0. The one on the robot is particularly adapted to maintain
back_railgap p25
the coherence of the digital twin to track the evolution of
side_railgap p26
the physical twin in time. The second one involves the com-
rail_high p27
plete reconstruction of a virtual environment from a Kinect
leg_radius_1 p28
acquisition. In both cases, SMA-Net demonstrated good
leg_radius_2 p29
accuracy and very high speed in identification and estima-
leg_high p30
tion tasks, and its performances are compatible with Indus-
Table_2 upLength p31
try 4.0 real-time simulation requirements. Both datasets are
upWidth p32
made publicly available at the following URL: https://doi.
downLength p33
org/10.5281/zenodo.5031617. Last but not least, the results
downWidth p34
obtained from our method can also be considered as a group
legHigh p35
of initial parameters for the next step of fine-tuning based
legDis p36
on more traditional methods. At the same time, a good ini-
legDia p37
tialization will allow the algorithm to converge quickly to a
Table_3 upLength p38
local minimum and thus gain higher accuracy, which is of
downLength p39
great research interest and practical value.
legHigh p40
legDia p41
Table_4 high p42
Appendix 1 dia p43
leg_dis p44
See Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16.
support_dis p45
u_len p46
u_dia p47
u_loc p48
Table 5 Parameter name and index relationship of the chairs in the Table_5 width p49
furniture dataset length p50
Class c Name Parameter pi high p51
radius p52
Chair_1 p_seat p1 Table_6 high p53
u_shape_dia p2 dia p54
u_loc p3 leg_angle p55
leg_loc p4 support_dis p56
leg_high p5
Chair_2 sideRails_length p6
seat_length p7
width p8
backlegs_high p9
frontLegs_high p10
Chair_3 length p11
width p12
leg_high p13
total_high p14
back_rails_high p15
back_leg_angle p16
Chair_4 width_seat p17
fillet_seat p18
rail_seat p19
high p20
u_len p21
u_dis p22
13
Engineering with Computers
Model p1 p2 p3 p4
Model p6 p7 p8 p9
13
Engineering with Computers
13
Engineering with Computers
Table 13 Result of Table_3
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p38 p39 p40 p41
name of each parameter pi
provided in Table 6 Pointnet2† 0.014 12.90 0.019 10.16 0.012 6.15 0.047 6.03
ResNet † 0.008 6.88 0.010 4.96 0.009 4.72 0.016 2.25
SMA-Net 0.005 4.56 0.009 4.57 0.004 1.79 0.017 2.34
Table 15 Result of Table_5
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p49 p50 p51 p52
name of each parameter pi
provided in Table 6 Pointnet2† 0.024 19.63 0.014 15.69 0.010 7.87 0.323 71.72
ResNet† 0.011 8.65 0.010 12.63 0.007 5.94 0.035 6.30
SMA-Net 0.009 7.14 0.008 9.28 0.003 2.12 0.029 5.29
Table 16 Result of Table_6
Parameter pi and its Er (c, i) and Ea (c, i) values
avg avg
in the furniture test dataset
( Ntest
Furniture
= 3000 ), with the Model p53 p54 p55 p56
name of each parameter pi
provided in Table 6 Pointnet2† 0.011 8.50 0.017 14.50 0.028 2.59 0.083 18.06
ResNet† 0.010 8.19 0.011 9.69 0.016 1.50 0.026 5.83
SMA-Net 0.003 2.68 0.011 9.08 0.014 1.32 0.025 5.35
13
Engineering with Computers
7. Buonamici F, Carfagni M, Furferi R, Governi L, Lapini A, Volpe 26. Guo R, Zou C, Hoiem D (2015) Predicting complete 3d models
Y (2018) Reverse engineering of mechanical parts: a template- of indoor scenes. arxiv:1504.02437
based approach. J Comput Des Eng 5(2):145–159 27. Gupta S, Arbeláez P, Girshick R, Malik J (2015) Aligning 3d
8. Charles RQ, Su H, Kaichun M, Guibas LJ (2017) PointNet: deep models to rgb-d images of cluttered scenes. In: 2015 IEEE Con-
learning on point sets for 3D classification and segmentation. ference on computer vision and pattern recognition (CVPR), pp
In: 2017 IEEE Conf. on comput. vision and pattern recognition 4731–4740
(CVPR), pp 77–85 28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for
9. Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view 3D object image recognition. In: Proceedings of the IEEE Conference on
detection network for autonomous driving. In: 2017 IEEE Con- computer vision and pattern recognition, pp 770–778
ference on computer vision and pattern recognition (CVPR), pp 29. Hu Q et al (2021) Learning semantic segmentation of large-scale
6526–6534 point clouds with random sampling. In: IEEE transactions on pat-
10. Choy C, Dong W, Koltun V (2020) Deep global registration. In: tern analysis and machine intelligence. https://doi.org/10.1109/
Proceedings of the IEEE/CVF conference on computer vision and TPAMI.2021.3083288
pattern recognition (CVPR), pp 2514–2523 30. Ishimtsev V, Bokhovkin A, Artemov A, Ignatyev S, Niessner M,
11. Choy C, Gwak J, Savarese S (2019) 4D spatio-temporal ConvNets: Zorin D, Burnaev E (2020) Cad-deform: deformable fitting of cad
Minkowski convolutional neural networks. In: 2019 IEEE/CVF models to 3d scans. arXiv preprint arXiv:2007.11965
Conference on computer vision and pattern recognition (CVPR), 31. Kang Z, Li Z (2015) Primitive fitting based on the efficient multi-
pp 3070–3079 baysac algorithm. PLoS One 10(3):e0117341
12. Choy C, Park J, Koltun V (2019) Fully convolutional geometric 32. Katz S, Tal A (2015) On the visibility of point clouds. In: 2015
features. In: Proceedings of the IEEE/CVF International Confer- IEEE International Conference on computer vision (ICCV), pp
ence on Computer Vision (ICCV), pp 8958–8966 1350–1358
13. Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C 33. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021)
(2021) Twins: revisiting the design of spatial attention in vision Transformers in vision: A survey. arXiv preprint arXiv:2101.
transformers. 1(2):3 arXiv preprint arXiv:2104.13840 01169
14. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nießner 34. Kim H, Yeo C, Lee ID, Mun D (2020) Deep-learning-based
M (2017) ScanNet: richly-annotated 3D reconstructions of indoor retrieval of piping component catalogs for plant 3d cad model
scenes. In: IEEE Conf. on comput. vision and pattern recognition, reconstruction. Comput Ind 123:103320
pp 2432–2443 35. Kundu A, Yin X, Fathi A, Ross D, Brewington B, Funkhouser T,
15. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Pantofaru C (2020) Virtual multi-view fusion for 3d semantic seg-
Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al mentation. In: European Conference on computer vision. Springer,
(2020) An image is worth 16x16 words: transformers for image pp 518–535
recognition at scale. arXiv preprint arXiv:2010.11929 36. Li D, Shen X, Yu Y, Guan H, Wang H, Li D (2020) Ggm-net:
16. Erdös G, Nakano T, Váncza J (2014) Adapting CAD models of graph geometric moments convolution neural network for point
complex engineering objects to measured point cloud data. CIRP cloud shape classification. IEEE Access 8:124989–124998
Ann 63(1):157–160 37. Li Y, Wu X, Chrysanthou Y, Sharf A, Cohen-Or D, Mitra NJ
17. Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based spatial (2011) Globfit: consistently fitting primitives by discovering
clustering of applications with noise. In: Int. Conf. knowledge global relations. ACM Trans Graph 30(4):52:1-52:12
discovery and data mining, vol 240, p 6 38. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021)
18. Fayolle P-A, Pasko A (2015) User-assisted reverse modeling with Swin transformer: hierarchical vision transformer using shifted
evolutionary algorithms. In: IEEE Congress on evolutionary com- windows. arXiv preprint arXiv:2103.14030
putation, pp 2176–2183. https://d oi.o rg/1 0.1 109/C
EC.2 015.7 2571 39. Lu Y (2017) Industry 4.0: a survey on technologies, applications
53 and open research issues. J Ind Inf Integr 6:1–10
19. Fischler MA, Bolles RC (1981) Random sample consensus: a 40. Mo K, Zhu S, Chang AX, Yi L, Tripathi S, Guibas LJ, Su H
paradigm for model fitting with applications to image analysis (2019) PartNet: a large-scale benchmark for fine-grained and hier-
and automated cartography. Commun ACM 24(6):381–395 archical part-level 3D object understanding. In: 2019 IEEE/CVF
20. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous Conference on computer vision and pattern recognition (CVPR),
driving? The kitti vision benchmark suite. In: 2012 IEEE Confer- pp 909–918
ence on computer vision and pattern recognition, pp 3354–3361 41. Montlahuc J, Shah GA, Polette A, Pernot J-P (2019) As-scanned
21. Gelfand N, Mitra NJ, Guibas LJ, Pottmann H (2005) Robust global point clouds generation for virtual reverse engineering of CAD
registration. In: Symposium on geometry processing, vol 2, no 3, assembly models. Comput-Aided Des Appl 16(6):1171–1182
p5 42. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G,
22. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch:
networks. In: Proceedings of the fourteenth international confer- an imperative style, high-performance deep learning library. Adv
ence on artificial intelligence and statistics. JMLR Workshop and Neural Inf Process Syst 32:8026–8037
Conference Proceedings, pp 315–323 43. Qi CR, Su H, Nießner M, Dai A, Yan M, Guibas LJ (2016) Volu-
23. Graham B, Engelcke M, Maaten LVD (2018) 3D semantic seg- metric and multi-view cnns for object classification on 3d data.
mentation with submanifold sparse convolutional networks. In: In: Proceedings of the IEEE Conference on computer vision and
2018 IEEE/CVF Conference on computer vision and pattern rec- pattern recognition, pp 5648–5656
ognition, pp 9224–9232 44. Qi CR, Yi L, Su H, Guibas LJ (2017) PointNet++: deep hierar-
24. Graham B, Engelcke M, Van Der Maaten L (2018) 3d semantic chical feature learning on point sets in a metric space. In: Proc.
segmentation with submanifold sparse convolutional networks. of the 31st Int. Conf. on neural information processing systems,
In: Proceedings of the IEEE conference on computer vision and NIPS’17, pp 5105–5114
pattern recognition (CVPR), pp 9224–9232 45. Robertson C, Fisher RB, Werghi N, Ashbrook AP (2000) Fitting
25. Graham B, van der Maaten L (2017) Submanifold sparse convo- of constrained feature models to poor 3D data. In: Parmee IC
lutional networks. arXiv preprint arxiv:1706.01307 (ed) Evolutionary design and manufacture. Springer, London, pp
149–160
13
Engineering with Computers
46. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional Proceedings of the IEEE/CVF Conference on computer vision
networks for biomedical image segmentation. In: Medical Image and pattern recognition, pp 10296–10305
Computing and Computer-Assisted Intervention—MICCAI 2015, 59. Wang S, Suo S, Ma W-C, Pokrovsky A, Urtasun R (2018) Deep
Cham, 2015. Springer International Publishing, pp 234–241 parametric continuous convolutional neural networks. In: Pro-
47. Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er ceedings of the IEEE Conference on computer vision and pattern
MJ, Ding W, Lin C-T (2017) A review of clustering techniques recognition, pp 2589–2597
and developments. Neurocomputing 267:664–681 60. Willis KD, Pu Y, Luo J, Chu H, Du T, Lambourne JG, Solar-
48. Schnabel R, Wahl R, Klein R (2007) Efficient ransac for point- Lezama A, Matusik W (2021) Fusion 360 gallery: a dataset and
cloud shape detection. Comput Graphics Forum 26(2):214–226 environment for programmatic cad construction from human
49. Sener O, Koltun V (2018) Multi-task learning as multi-objective design sequences. ACM Trans Graph (TOG) 40(4):1–24
optimization. Advances in neural information processing systems, 61. Wu S, Wu T, Lin F, Tian S, Guo G (2021) Fully transformer
vol 31, pp 525–536 networks for semantic image segmentation. arXiv preprint arXiv:
50. Shah GA, Polette A, Pernot JP, Giannini F, Monti M (2021) Simu- 2106.04108
lated annealing-based fitting of CAD models to point clouds of 62. Wu W, Qi Z, Fuxin L (2019) Pointconv: deep convolutional net-
mechanical parts’ assemblies. Eng Comput 37(4):2891–2909 works on 3d point clouds. In: Proceedings of the IEEE/CVF Con-
51. Shi W, Rajkumar R (2020) Point-gnn: graph neural network for ference on computer vision and pattern recognition, pp 9621–9630
3d object detection in a point cloud. In: Proceedings of the IEEE/ 63. Xie Y, Tian J, Zhu XX (2020) Linking points with labels in 3d:
CVF Conference on computer vision and pattern recognition, pp a review of point cloud semantic segmentation. IEEE Geosci
1711–1719 Remote Sens Mag 8(4):38–59
52. Su H, Maji S, Kalogerakis E, Learned-Miller E (2015) Multi- 64. Xu Y, Fan T, Xu M, Zeng L, Qiao Y (2018) Spidercnn: deep
view convolutional neural networks for 3d shape recognition. In: learning on point sets with parameterized convolutional filters.
Proceedings of the IEEE International Conference on computer In: Proceedings of the European conference on computer vision
vision, pp 945–953 (ECCV), pp 87–102
53. Tang H, Liu Z, Zhao S, Lin Y, Lin J, Wang H, Han S (2020) 65. Xu Y, Fan T, Xu M, Zeng L, Qiao Y (2018) Spidercnn: deep
Searching efficient 3D architectures with sparse point-voxel con- learning on point sets with parameterized convolutional filters.
volution. In: European Conference on computer vision (ECCV), In: Proceedings of the European conference on computer vision
pp 685–702 (ECCV), pp 87–102
54. Thomas H, Qi CR, Deschaud J, Marcotegui B, Goulette F, Guibas 66. Yi L, Kim VG, Ceylan D, Shen IC, Yan M, Su H et al (2016) A
L (2019) KPConv: flexible and deformable convolution for point scalable active framework for region annotation in 3d shape col-
clouds. In: IEEE Int. Conf. on computer vision (ICCV), pp lections. ACM Trans Graph (ToG) 35(6):1–12
6410–6419 67. Zhao H, Jiang L, Fu C-W, Jia J (2019) Pointweb: enhancing local
55. Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normaliza- neighborhood features for point cloud processing. In: Proceed-
tion: the missing ingredient for fast stylization. arXiv preprint ings of the IEEE/CVF Conference on computer vision and pattern
arXiv:1607.08022 recognition, pp 5565–5573
56. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez 68. Zhu B, Jiang Z, Zhou X, Li Z, Yu G (2019) Class-balanced group-
AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: ing and sampling for point cloud 3d object detection. arXiv pre-
Advances in neural information processing systems, vol 30, pp print arXiv:1908.09492
5998–6008
57. Wang C, Samari B, Siddiqi K (2018) Local spectral graph convo- Publisher's Note Springer Nature remains neutral with regard to
lution for point set feature learning. In: Proceedings of the Euro- jurisdictional claims in published maps and institutional affiliations.
pean Conference on computer vision (ECCV), pp 52–66
58. Wang L, Huang Y, Hou Y, Zhang S, Shan J (2019) Graph atten-
tion convolution for point cloud semantic segmentation. In:
13