THISNet Tooth Instance Segmentation On 3D Dental Models Via Highlighting Tooth Regions
THISNet Tooth Instance Segmentation On 3D Dental Models Via Highlighting Tooth Regions
Abstract— Automatic tooth instance segmentation on 3D dental applicable in medical image analysis [4], [5]. In clinical treat-
models is crucial for digitizing dental treatments and enabling ment, dentists and oral healthcare professionals routinely use
computer-assisted treatment planning. However, It is challeng- various imaging modalities include panoramic X-ray radiogra-
ing since the tight arrangement of dental structures and the
consequential impact of dental ailments on their morphological phy, cone-beam computed tomography and 3D dental model.
characteristics. To address these challenges, we propose a novel Automatic tooth instance segmentation on 3D dental mod-
method called THISNet. Unlike existing methods, THISNet els is essential for digitizing dental treatments and enabling
focuses on highlighting tooth regions rather than relying on computer-assisted treatment planning. Accurately identifying
bounding box detection, leading to improved accuracy in tooth and segmenting individual teeth allow for precise manipulation
segmentation and labeling. By incorporating the highlighted tooth
regions with a tooth object affinity module, our method effec- of teeth in the dental model, resulting in effective treat-
tively integrates global contextual information, considering the ment strategies. In orthodontic treatments, such as braces and
relationships between neighboring teeth and their surrounding invisible aligners, require detailed tooth information in dental
structures. THISNet adopts an end-to-end learning approach, models to develop treatment plans and monitor progress, and
reducing complexity and enhancing segmentation efficiency com- in dental restoration work, such as crowns, bridges and dental
pared to multi-stage training methods. Experimental results
demonstrate the superiority of THISNet over existing approaches, implants requires accurate tooth segmentation results. How-
highlighting its potential in various dental clinical applications. ever, compared with general 3D instance segmentation [6],
[7], [8], tooth instance segmentation encounters kinds of
Index Terms— Tooth segmentation, highlighting, object affinity,
3D dental models. challenges. Specifically, the teeth are tightly arranged, making
it difficult to determine the boundaries of each tooth. Addi-
I. I NTRODUCTION tionally, tooth wear, dental caries, and other oral diseases can
significantly alter the shape of teeth, further increasing the
I NSTANCE segmentation [1], [2], [3] is one of the most
important tasks in computer vision. It can detect and
delineate instances appearing in an image, which is widely
complexity of classification and segmentation tasks. These
challenges emphasize the intricate nature of tooth instance
segmentation, thereby highlighting the necessity for the design
Manuscript received 19 July 2023; revised 17 October 2023; accepted of powerful algorithms.
4 December 2023. Date of publication 12 December 2023; date of current A variety of methods have been proposed for tooth instance
version 3 July 2024. This work was supported in part by the National Key
Research and Development Program of China under Grant 2022YFA1004100;
segmentation [9], [10], [11], [12], [13]. These methods can be
in part by the National Natural Science Foundation of China under Grant divided into two broad categories: grouping-based methods
62176035, Grant 62201111, Grant 61721002, and Grant 12226004; in part by and detection-based methods. Fig. 1 shows the differences
the Science and Technology Research Program of the Chongqing Municipal
Education Commission under Grant KJZD-K202100606; in part by the
between these two kinds of methods. Grouping-based meth-
Chongqing University of Posts and Telecommunications Ph.D. Innovative ods [12], [14] involve grouping points in 3D dental models
Talents Project under Grant BYJS202105; and in part by the Chongqing using semantic segmentation and clustering techniques. Each
Graduate Research Innovation Project under Grant CYB22249. This article
was recommended by Associate Editor J. Chen. (Corresponding author:
group is assigned a unique identifier to enable the segmenta-
Chenqiang Gao.) tion and labeling of different tooth instances. These methods
Pengcheng Li, Chenqiang Gao, and Fangcen Liu are with the School typically group points in 3D dental models based on their
of Communications and Information Engineering, Chongqing University of
Posts and Telecommunications, Chongqing 400065, China, and also with the
positions and semantic similarities. In the case of densely
Chongqing Key Laboratory of Signal and Information Processing, Chongqing arranged or tightly contacted teeth, grouping-based methods
400065, China (e-mail: [email protected]; [email protected]; may struggle to accurately separate them due to the presence
[email protected]).
Deyu Meng is with the School of Mathematics and Statistics, Xi’an Jiaotong
of similar semantics, resulting in inaccurate segmentation
University, Xi’an, Shaanxi 710049, China, and also with the Henan Engineer- outcomes.
ing Research Center for Artificial Intelligence Theory and Algorithms, School In contrast, detection-based methods [9], [10], [11] offer
of Mathematics and Statistics, Henan University, Kaifeng, Henan 475004,
China (e-mail: [email protected]).
greater intuitiveness. They first utilize an object detection
Yan Yan is with the Department of Computer Science, Illinois Institute of network to identify distinct tooth proposals in the 3D dental
Technology, Chicago, IL 60616 USA (e-mail: [email protected]). model. Subsequently, the points within the detected bounding
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2023.3341805.
boxes are directly segmented to achieve instance-level seg-
Digital Object Identifier 10.1109/TCSVT.2023.3341805 mentation and labeling for each tooth instance. Nevertheless,
1051-8215 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5230 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5231
with abnormal tooth arrangements. HDWE [29] developed C. Instance Segmentation on Point Cloud Data
a novel hierarchical deep word embedding for fine-grained 3D instance segmentation methods on point cloud data
recognition. MeshSegNet [27] presented a tooth segmentation can be roughly categorized into top-down and bottom-up
method that leveraged both point and triangle mesh cells methods. Top-down methods first identified potential object
to learn multi-scale contextual and global features from raw candidates using object detection and then performed instance
dental models. TSGCNet [28] introduced a two-steam graph segmentation on local regions of the 3D point cloud data.
convolutional network that utilized coordinate and normal GSPN [37] proposed to employ the analysis-by-synthesis
vector properties to learn multi-view geometric information method to generate object proposals from noisy scenes. 3D-
and extract more effective representations for tooth segmenta- BoNet [38] used a top-down strategy to predict a fixed set of
tion. STSNet [30] presented the first attempt to unsupervised 3D proposals based on bounding box detection. 3D-SIS [39]
pre-training for 3D tooth segmentation. However, semantic- generated boxes and segmentation masks by employing both
based methods struggle to accurately delineate individual tooth geometric and color features. Bottom-up methods employed a
boundaries when teeth are in close proximity, which leads grouping strategy after a discriminative feature embedding to
to neighboring teeth being incorrectly classified. In contrast, perform segmentation. Point-to-surface representation [40] is
the proposed tooth object affinity-based method effectively employed to assemble local and global geometric information
highlights areas corresponding to individual teeth, enabling for 3D point cloud data. SGPN [41] employed a similarity
accurate tooth segmentation and identification. matrix to cluster similar points into instances. VLAAD [42]
3) Instance-Based Methods: Mask-MCNet [9] achieved enhanced the discriminative power of feature representations
localization and segmentation of each tooth instance by first by adaptively assigning weights to each residue vector, which
predicting 3D bounding boxes and subsequently segmenting lay between the descriptors and the centroid of cluster to which
the points belonging to each tooth instance. A method based on they belong. PointGroup [43] predicted the offset of each point
proposal generation and cluster grouping is proposed for point relative to the center point of the instance, and fused the
cloud-based tooth instance segmentation [12], which employs semantic segmentation results and offsets to generate cluster
an attention mechanism that combines objectness and point- proposals. SoftGroup [44] allows each point to be associated
wise knowledge. It is effective in scenarios where teeth are with multiple classes to mitigate the problems stemming
undamaged or minimally worn, and where tooth boundaries from semantic prediction errors. Totally, those segmentation
are well-defined and without blurring. TSegNet [10] proposed methods designed for the large-scale scene may be difficult to
a robust tooth centroid prediction and accurate single tooth adapt to the particularity of the tight arrangement of teeth.
segmentation on point cloud data. MLMSM [31] proposed a
novel semi-supervised framework to train two self-supervised
models and then employed a model ensemble approach to III. M ETHOD
address the problem of limited data learning in downstream A. Overview
tasks. DArch [11] then introduced a semi-supervised frame- Fig. 2 shows the proposed network. Specifically, we first
work that included a two-stage process for tooth centroid utilize a two-stream multi-scale feature encoder to extract
detection and instance segmentation. In this paper, we utilize discriminative coordinates and normal features from the
tooth object affinity maps to directly highlight the regions 3D dental models. Then, a tooth object affinity module is
corresponding to each tooth and segment them in an end- employed to highlight the tooth regions of interest. Finally,
to-end fashion. By consolidating contextual information from in the identification head, a segmentation module leverages
these highlighted regions corresponding to each tooth, we can the highlighted regions to perform accurate tooth mask seg-
obtain accurate tooth instance segmentation results. mentation, and a labeling module performs classification and
objectness identification to assign labels to each segmented
B. Correspondence Matching tooth mask. By combining these components, our method
achieves precise tooth instance segmentation and labeling in
Click prediction [32] was proposed to address the 3D dental models.
text-based image search problem through multimodal hyper-
graph learning-based sparse coding. Bipartite matching [33]
was employed to find correspondences between objects in an B. Two-Stream Multi-Scale Feature Embedding
image and assign them to the same class or category. It is Since tooth objects are tightly arranged and vary in scale
widely used for object detection and segmentation to detect and orientation, inspired by TSGCNet [28], we employ a two-
and segment objects of interest in an image or a video [34], stream multi-scale feature encoder to extract tooth features
[35]. SparseInst [36] achieved real-time image segmentation from a given 3D dental model. It is noted that the feature
by using sparse representations to highlight important regions encoder is flexible and can also be replaced by existing point
in 2D images, and by utilizing bipartite matching to assign the cloud feature encoders (e.g., PointNet [45], PointNet++ [46]
correct class labels to target objects. In this paper, we explore and DGCNN [47]).
an object-aware optimal transport assignment (OOTA) strategy The input of the proposed network is a matrix of size
using bipartite matching [33], [36] in three-dimensional (i.e., M ×24, where M indicates the number of mesh cells, and each
3D mesh) space to pair the predicted tooth regions and the mesh cell contains four 3D coordinates (i.e., three vertices and
ground truth masks in the training stage. the centroid point) and four corresponding normal vectors in
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5232 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
Fig. 2. The proposed network architecture contains three key parts: (1) a two-stream multi-scale feature encoder that generates basic tooth feature
representations to handle the tightly arranged and vary shape teeth, (2) an object affinity-based decoder that contains a mask branch to generate tooth
mask features, and an instance branch using tooth object affinity module to highlight potential tooth regions, respectively, (3) an identification head that is
designed to segment and classify each tooth instance using the object-aware optimal transport assignment (OOTA) strategy.
3D dental models. To handle variations in orientation across where ci and n i indicate the coordinate and normal vectors
different 3D dental models, we use a feature transform module of i-th node, ci j and n i j represent one of K node coordinates
(FTM) to obtain a transformation matrix that normalizes the and normal vectors adjacent to the i-th node, and ⊕ denotes
input dental model in both the coordinate and normal vector by concatenation.
streams. FTM comprises three consecutive 1D convolutional To capture the basic topology structure of the tooth, we uti-
layers, with each layer followed by a batch normalization and lize a graph attention mechanism and max-pooling to focus
a ReLU activation. The dimension of these layers are 64, on the coordinate and normal vector attributes based on the
128, and 1024, respectively. Then, max pooling is applied updated cell node features. Specifically, in the coordinate
for feature aggregation, and three linear layers are used for stream, the learnable connection attention weight ail j for each
dimensional reduction. The output dimensions of the first and coordinate can be calculated using the neighbor features f ci j
second linear layers are 512 and 256, respectively. The output by:
of the last linear layer takes the form of a transformation
matrix with a size of C × C, where C matches the dimension ail j = Conv(△ f cli j ⊕ f cli j ), ∀ci j ∈ K, (2)
of the input coordinates and normal vectors, set at 12. This
learned transformation matrix is applied to the input coor- where △ f cli j = f cli − f cli j measures the difference between the
dinates and normal vectors for normalization through inner coordinate i and its one of K nearest neighbors j, then the
product operations. final output coordinate node features can be calculated by:
The multi-scale feature embedding module is designed X
to obtain different scales of tooth feature representations. Fcl+1 = ail j ⊙ fˆcli , (3)
To learn local feature representations, we build K-nearest n i j ∈Ni
neighbor (KNN) graphs at multiple scales for each mesh
cell. Since nodes on the edges of the mesh can be biased where ⊙ denotes element-wise production.
towards neighboring meshes, which results in discontinuities In the normal vector stream, the normal node features are
and inhomogeneities, we take the mesh cell centroids as the aggregated using max-pooling:
KNN nodes to keep the evenly distribution. The KNN graphs
identify the neighboring nodes that are closest to each cell, Fnl+1 = maxpooling( f nl i j ), ∀n i j ∈ K. (4)
which is essential for capturing local structure information.
To construct the KNN graph for each mesh cell (i.e., vertex and The coordinate node features and normal node features of
centroid) in the 3D dental models, we compute the Euclidean different scales are fused by an MLP layer:
distance between each cell node and its K nearest neighbors,
resulting in a graph G(V, E) where V represents the cell node Fc = MLP(Fc1 ⊕ Fc2 ⊕ Fc3 ),
attribute sets (i.e., coordinates or normal vectors) denoted as Fn = MLP(Fn1 ⊕ Fn2 ⊕ Fn3 ). (5)
V = {n 1 , n 2 , . . . , n M }, and E ⊆ {(x, y) | (x, y) ∈ V 2 ∧x ̸ = y}
represents the edges of the graph. For each node n i ∈ V , the The final tooth feature embedding F in both coordinate and
KNN set is denoted as K. The updated coordinate and normal normal streams are concatenated, followed by another MLP
features at l-th scale fˆcl and fˆnl can be respectively computed layer, and passed through a two-branch decoder based on tooth
by combining the input feature vectors with the contextual object affinity:
neighbor features:
fˆcli = f cli ⊕ f cli j , ∀ci j ∈ K, F = MLP(Fc ⊕ Fn ), F ∈ R D×M , (6)
fˆnl i = f nl i ⊕ f nl i j , ∀n i j ∈ K, (1)
where D is the dimension of the tooth feature embedding.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5233
C. Tooth Object Affinity Module The mask features Fm are obtained by applying a 1×1 projec-
To help capture better tooth boundaries and contours, tion convolutional layer with D channels on the tooth feature
as well as leverage global contextual information, we design to embedding F in the mask branch of the decoder:
directly highlight potential tooth regions in the object affinity- Fm = Conv(F), Fm ∈ R D×M . (11)
based decoder, as shown in Fig. 2. It consists of two branches:
a mask branch and an instance branch. The mask branch is Finally, two additional 1×1 convolutional layers are employed
responsible for generating tooth instance mask features, which on the grouped tooth instance features Ẑ for tooth classifica-
distinguish teeth from the surrounding structures. To collect tion and objectness confidence prediction, respectively.
tooth mask features, we utilize a vanilla 1 × 1 convolutional By multiplying the updated tooth instance features with
layer in the mask branch. the mask features, our identification head generates the final
In the instance branch, the tooth object affinity module is tooth instance masks. The mask features obtained through
designed to generate N tooth instance activation maps for the projection convolutional layer can refine and align the
highlighting individual tooth areas. Thees tooth object affinity tooth instance masks with the underlying tooth structures.
maps are instance-level weighted maps that highlight each Furthermore, the tooth instance features are further processed
tooth region, automatically capturing the semantic and instance for tooth classification and objectness confidence prediction,
information from the final tooth feature embedding, thereby enabling comprehensive tooth labeling and enhancing the
facilitating the identification and segmentation of each tooth. overall segmentation performance.
Specifically, the tooth object affinity module consists of
a vanilla 1 × 1 convolutional layer Conv(·), a non-linear E. Objectness-Based Loss Functions
Sigmoid activation function σ (·), and a layer normalization
As the network generates a fixed number of N tooth
L N [·]. The tooth object affinity maps Foa are generated using
instances, we utilize an end-to-end training approach with
the tooth feature embedding F ∈ R D×M obtained from the
bipartite matching to establish a one-to-one correspondence
two-stream multi-scale feature encoder:
between the predicted tooth objects and the ground truth
Foa = L N [σ (Conv(F))], Foa ∈ R N ×M , (7) teeth [35], [48]. Given the imbalance between teeth and the
background in the 3D dental model, it is prone to predict the
where N represents the number of the tooth object affinity
tooth as the background. To mitigate this issue, we propose
maps, and M denotes the input number of mesh cells. Foa
an object-aware optimal transport assignment (OOTA) strategy
represents a sparse set of N tooth instance activation maps,
based on the dice similarity coefficient (DSC) and tooth clas-
which are instance-aware weighted maps designed to empha-
sification confidence. Building on SparseInst [36], our OOTA
size areas of information specific to each tooth object. The
strategy aims to achieve precise one-to-one matching between
tooth instance features Z are obtained by aggregating rich
the predicted tooth instances and the corresponding ground
contextual information using the tooth object affinity maps
truth teeth. To achieve this matching, we introduce a pairwise
Foa and the tooth feature embedding F:
matching score S(i, j) to assess the similarity between the i-th
Z = Foa · F T , Z ∈ R N ×D , (8) tooth instance prediction and the j-th ground truth tooth. The
where z = (z i )N
is the instance feature representation of N matching score is calculated as:
potential tooth objects in the dental model. This aggregation S(i, j) = DSC(m i , g j )α · pi,c
1−α
j
, (12)
process combines the local tooth features with the highlighted
regions to capture both local and global information. where m i and g j represent the masks of the i-th predicted
To further improve generalization capability and mitigate tooth instance and the j-th ground truth tooth, respectively.
the impact of redundant information, similar to the method The probability pi,c j denotes the likelihood of the i-th predic-
in [36], we reshape the tooth instance features Z into multiple tion belonging to category c j . The hyper-parameter α is used
groups and introduce a group convolution operation: to control the balance between the impact of segmentation and
classification and is empirically set to 0.8 [36]. In addition, the
Ẑ = G({Z 1 , Z 2 , . . . , Z G }; 2), (9) Dice similarity coefficient score is calculated as:
where G(·) represents the group convolution operation with P
2 x,y,z m x yz · gx yz
shared parameters 2, 1 to G represent the indexes of DSC(m, g) = P 2
P 2
, (13)
subsets of channels divided among G groups. By reshaping x,y,z m x yz + x,y,z gx yz
the tooth instance features into groups and applying group where the mesh cells at position (x, y, z) in the prediction
convolution, we can effectively extract and preserve key fea- mask m and ground truth mask g are represented as m x yz and
tures related to tooth instances. gx yz , respectively.
We employ the object-aware optimal transport assign-
D. Tooth Identification Head ment (OOTA) strategy, which leverages the Hungarian
In the identification head, the final N tooth instance masks algorithm [33] to achieve optimal matching between the
M can be obtained by element-wise multiplication of the ground truth objects and N predictions. The complete OOTA
updated tooth instance features Ẑ with the mask features Fm : strategy is described in Algorithm 1.
Finally, the proposed training loss function is computed
M = Ẑ · Fm , M ∈ R N ×M . (10) between the tooth instance predictions that have been matched
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5234 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5235
TABLE I
C OMPARISON ON T HREE -F OLD C ROSS -VALIDATION W ITH S TATE - OF - THE -A RT M ETHODS
where k is the number of tooth object categories, and pi j its effectiveness in 3D tooth model segmentation tasks. This
denotes predicting class i as class j, and p ji represents the improvement can be attributed to the two-stream multi-scale
false positive cells that are predicted as class j while the true feature encoder, which captures normal vectors and contextual
category is i. information and effectively fuses multi-scale features. This
mitigates the semantic confusion caused by the similarity of
tooth location and morphology. Compared to DGCNN [47],
B. Comparison With State-of-the-Art Methods the proposed method achieves an improvement of 1.08% and
1) Competing Methods: The performance of the proposed 4.37% in OA and mIOU. The separate learning of spatial
method is compared with eight state-of-the-art methods on the coordinates and normal vectors contributes to extracting more
benchmark 3D oral scan datasets Teeth3DS, and the OA and discriminative features and enhancing the accuracy of tooth
mIOU metrics are reported. The compared methods include: segmentation in dental models. In comparison with Mesh-
• PointNet [45]: A pioneering network for classification and SegNet [27], [54], the proposed method outperforms it by
segmentation that directly uses unordered 3D point cloud 0.61% in OA and 2.84% in mIOU. By constructing dynamic
data. K-nearest neighbor (KNN) graphs, we flexibly learns global
• PointNet++ [46]: This method extends PointNet by tooth features while preserving local information. Compared
employing set abstraction and grouping on PointNet to to TSGCNet [25], [53], the proposed method achieves an
learn local contextual information for 3D point cloud data improvement of 1.59% in mIOU. This highlights the effec-
analysis. tiveness of our tooth object affinity module and identification
• DGCNN [47]: This method learns semantic information head in segmenting each complete tooth instance region.
from point cloud data by constructing local adjacency In comparison to the grouping-based methods [41], [43], our
matrices and updating the graph structure dynamically method outperforms PointGroup [43] by 1.44% in OA and
across different layers. 4.86% in mIOU. We attribute to the grouping-based methods
• MeshSegNet [27], [54]: This is a deep neural network are difficult to obtain accurately semantics of adjacent teeth
that operates at multi-scales to learn high-level geometric with similar shapes, which leads to inaccurate grouping results.
features for end-to-end tooth segmentation on 3D dental On the contrary, our method models the interaction between
models. adjacent teeth by using global context information in the tooth
• TSGCNet [28], [53]: This method segments 3D dental object affinity module, which is conducive to capturing the
models by adopting two graph-learning streams to extract discriminative features of tooth shapes, thereby improving the
more discriminative geometric representations from coor- segmentation accuracy. Compared to detection-based method
dinates and normal vectors. 3D-BoNet [38], the proposed method achieves an improve-
• SGPN [41]: This is a pioneering work in 3D instance ment of 1.7% and 6.99% in OA and mIOU. This may impute to
segmentation from point clouds, using a single network that the detection-based method segment objects directly in the
to generate proposals and assigning a corresponding detected bounding boxes, and these bounding boxes may cover
semantic class for each object. a portion of the adjacent teeth since teeth are tightly arranged,
• PointGroup [43]: This is a centroid-based 3D instance which leads to the segmentation of multiple teeth into a single
segmentation method that aims to identify and label entity. Instead, we use the tooth object affinity module to
individual objects in 3D point cloud data. locate each tooth region, and the highlighted irregular areas
• 3D-BoNet [38]: This is a bounding box detection-based are more in line with the nature of the teeth, hence improve
3D instance segmentation method that is widely used the segmentation accuracy.
in tooth instance segmentation and 3D point cloud To assess the quality (i.e., completeness) of tooth instance
segmentation. segmentation results produced by different methods, we set
2) Quantitative Results: The quantitative comparison various predefined thresholds for IOU (Intersection over
results on three-fold cross-validation with state-of-the-art Union) between the predicted tooth instance masks and the
methods are listed in Table I. The evaluation metrics include ground truth. We consider tooth instance segmentation results
OA, mIOU, and IOU for each tooth category and background. to have high quality when the IOU exceeds the thresh-
Overall, the proposed method demonstrates superior perfor- olds. Fig. 3 presents the comparison of different methods
mance compared to existing methods based on semantic and in terms of the completeness of tooth segmentation results.
instance segmentation. It outperforms PointNet++ [46] in We change the IOU threshold from 80% to 90% with a
terms of OA by 4.49% and mIOU by 11.41%, indicating step size of 2% and can observe that the completeness
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5236 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
TABLE II
C OMPUTATION OVERHEAD C OMPARISON W ITH
S TATE - OF - THE -A RT M ETHODS
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5237
Fig. 4. Qualitative results compared with state-of-the-art methods. The first row shows segmentation results on a complete number of teeth (excluding the
third molars due to the variability in tooth specificity between patients). The second row displays results on missing one tooth, while the third and fourth rows
show more challenging cases. Red dotted circles and arrows indicate segmentation distinctions of recent state-of-the-art methods, and the proposed method
maintains better segmentation performance and robustness.
TABLE IV
E FFECT OF THE M ASK B RANCH . ‘ W / O ’ AND ‘ W /’ I NDICATE
‘ WITHOUT ’ AND ‘ WITH ’, R ESPECTIVELY
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5238 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
Fig. 6. Visualization of the highlighted tooth regions (Mesh view) when N is set to 30. With the tooth object-aware optimal transport assignment (OOTA)
strategy in the training stage, we can match these regions and ground truth teeth in a one-to-one manner. The green boxes with tooth labels represent that the
tooth instance regions matched with the correspondent ground truth teeth, while the gray dashed boxes represent areas of low quality teeth that do not match
any ground truth teeth.
TABLE V TABLE VI
E FFECT OF THE I NSTANCE B RANCH E FFECT OF THE L OSS F UNCTIONS
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5239
TABLE VII with three encoders combined with the proposed decoder. All
E FFECT OF THE N UMBER OF K encoders achieve consistent improvement in OA and mIOU,
which indicates the good generalization ability.
D. Discussion
This paper presents a novel method for 3D dental model seg-
mentation based on a two-stream multi-scale feature encoder
TABLE VIII
and a tooth object affinity module. The proposed method
E FFECT OF THE F EATURE E MBEDDING S ETTINGS
demonstrates superior performance compared to state-of-
the-art methods in terms of tooth segmentation accuracy,
robustness, and completeness. By leveraging multi-scale fea-
ture extraction and incorporating normal vectors, the proposed
method mitigates semantic confusion caused by the similarity
of tooth location and morphology. This lead to improved
segmentation results, especially in cases where adjacent teeth
had similar characteristics.
Furthermore, the tooth object affinity module plays a crucial
role in accurately identifying tooth categories and delineating
tooth boundaries, even in challenging samples with missing
teeth or abnormal shapes and sizes. This module effectively
captures local and global tooth information and generates
reliable tooth instance activation maps, which enables precise
segmentation and avoids the inclusion of surrounding gum
areas as part of the tooth region.
Fig. 7. Comparison the segmentation performance of different feature
encoders under the proposed tooth object affinity module. Experiments are conducted on a benchmark dental dataset,
which demonstrates the effectiveness and superiority of the
proposed method. It outperforms existing methods based on
6) Effects of the Feature Embedding: Ablation experiments semantic and instance segmentation, showcasing its potential
are conducted to assess the impact of different feature embed- for various dental applications and clinical scenarios.
ding settings on tooth segmentation performance, with the Future work can focus on the enhancement of the proposed
results summarized in Table VIII. Notably, using either the method by exploring advanced feature encoding techniques,
coordinate stream or the normal stream in isolation yielded and the tooth object affinity module can be refined to achieve
less favorable segmentation performance compared to utilizing improved accuracy and robustness. Additionally, investigat-
coordinates and normal vectors concurrently. The combined ing the generalizability and transferability of the proposed
use of coordinates and normal vectors proved to be more approach to other medical imaging tasks would be valu-
effective in conveying position and structural information able for expanding its applicability in the broader field of
during tooth feature extraction. Furthermore, employing two computer-aided diagnosis and treatment planning.
separate streams for coordinate and normal vector information
extraction, as opposed to a single branch for both features, V. C ONCLUSION
lead to significant improvements in both overall accuracy
(OA) and mean intersection over union (mIOU) by 3.24% and This paper presents a novel method for accurate 3D dental
7.71%, respectively. This observation underscores the value of model segmentation in a highlighting-and-segmenting manner.
learning these features separately, facilitating the extraction The proposed method introduces a tooth object affinity module
of complementary information that better represents tooth to highlight the discriminative region of teeth. The final
positions and structures. tooth instance identification and segmentation results can be
7) Visualization of the Highlighted Tooth Regions: To fur- directly derived using a simple yet effective identification
ther explain how the instance activation maps distinguish head. Experimental results on a benchmark 3D tooth model
teeth, we provide a visualization of the highlighted tooth dataset demonstrate significant improvements in segmentation
regions from a 3D tooth model in a scenario where teeth are performance, which surpasses existing methods.
missing. Fig. 6 highlights the approximate regions of each
tooth. We can see that the tooth instance activation maps can R EFERENCES
effectively help to highlight teeth with different proportions [1] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
and positions, and they also perform well in challenging cases
[2] X. Zhang, H. Li, F. Meng, Z. Song, and L. Xu, “Segmenting beyond
such as missing teeth. the bounding box for instance segmentation,” IEEE Trans. Circuits Syst.
8) Generalization of the Decoder: To evaluate the effects of Video Technol., vol. 32, no. 2, pp. 704–714, Feb. 2022.
the generalization ability of the object affinity-based decoder, [3] Y. Sun, L. Su, S. Yuan, and H. Meng, “DANet: Dual-branch activation
network for small object instance segmentation of ship images,” IEEE
we conduct experiments using different point cloud feature Trans. Circuits Syst. Video Technol., vol. 33, no. 11, pp. 6708–6720,
encoders with it. As depicted in Fig. 7, we show the results Nov. 2023.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
5240 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 34, NO. 7, JULY 2024
[4] B. Hu, S. Zhou, Z. Xiong, and F. Wu, “Cross-resolution distillation [27] C. Lian et al., “Deep multi-scale mesh feature learning for automated
for efficient 3D medical image registration,” IEEE Trans. Circuits Syst. labeling of raw dental surfaces from 3D intraoral scanners,” IEEE Trans.
Video Technol., vol. 32, no. 10, pp. 7269–7283, Oct. 2022. Med. Imag., vol. 39, no. 7, pp. 2440–2450, Jul. 2020.
[5] R. Nie, J. Cao, D. Zhou, and W. Qian, “Multi-source information [28] L. Zhang et al., “TSGCNet: Discriminative geometric feature learning
exchange encoding with PCNN for medical image fusion,” IEEE Trans. with two-stream graph convolutional network for 3D dental model
Circuits Syst. Video Technol., vol. 31, no. 3, pp. 986–1000, Mar. 2021. segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[6] L. Zhao and W. Tao, “JSNet++: Dynamic filters and pointwise corre- (CVPR), Jun. 2021, pp. 6695–6704.
lation for 3D point cloud instance and semantic segmentation,” IEEE [29] J. Yu, M. Tan, H. Zhang, Y. Rui, and D. Tao, “Hierarchical deep
Trans. Circuits Syst. Video Technol., vol. 33, no. 4, pp. 1854–1867, click feature prediction for fine-grained image recognition,” IEEE Trans.
Apr. 2023. Pattern Anal. Mach. Intell., vol. 44, no. 2, pp. 563–578, Feb. 2022.
[7] T. D. Ngo, B.-S. Hua, and K. Nguyen, “ISBNet: A 3D point cloud [30] Z. Liu et al., “Hierarchical self-supervised learning for 3D tooth seg-
instance segmentation network with instance-aware sampling and box- mentation in intra-oral mesh scans,” IEEE Trans. Med. Imag., vol. 42,
aware dynamic convolution,” in Proc. IEEE/CVF Conf. Comput. Vis. no. 2, pp. 467–480, Feb. 2023.
Pattern Recognit. (CVPR), Jun. 2023, pp. 13550–13559. [31] J. Zhang, J. Yang, J. Yu, and J. Fan, “Semisupervised image classification
[8] J. Hou, X. Dai, Z. He, A. Dai, and M. Nießner, “Mask3D: Pretrain- by mutual learning of multiple self-supervised models,” Int. J. Intell.
ing 2D vision transformers by learning masked 3D priors,” in Proc. Syst., vol. 37, no. 5, pp. 3117–3141, May 2022.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, [32] J. Yu, Y. Rui, and D. Tao, “Click prediction for web image reranking
pp. 13510–13519. using multimodal sparse coding,” IEEE Trans. Image Process., vol. 23,
[9] F. G. Zanjani et al., “Mask-MCNet: Tooth instance segmentation no. 5, pp. 2019–2032, May 2014.
in 3D point clouds of intra-oral scans,” Neurocomputing, vol. 453, [33] H. W. Kuhn, “The Hungarian method for the assignment problem,” Nav.
pp. 286–298, Sep. 2021. Res. Logistics (NRL), vol. 52, no. 1, pp. 7–21, Feb. 2005.
[10] Z. Cui et al., “TSegNet: An efficient and accurate tooth segmentation [34] J. Wang, L. Song, Z. Li, H. Sun, J. Sun, and N. Zheng, “End-
network on 3D dental model,” Med. Image Anal., vol. 69, Apr. 2021, to-end object detection with fully convolutional network,” in Proc.
Art. no. 101949. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
[11] L. Qiu, C. Ye, P. Chen, Y. Liu, X. Han, and S. Cui, “DArch: Dental arch pp. 15844–15853.
prior-assisted 3D tooth instance segmentation with weak annotations,” in [35] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). London, DETR: Deformable transformers for end-to-end object detection,” 2020,
U.K.: Dent, Jun. 2022, pp. 20720–20729. arXiv:2010.04159.
[12] Y. Tian et al., “3D tooth instance segmentation learning objectness and [36] T. Cheng et al., “Sparse instance activation for real-time instance
affinity in point cloud,” ACM Trans. Multimedia Comput., Commun., segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Appl., vol. 18, no. 4, pp. 1–16, Nov. 2022. (CVPR), Jun. 2022, pp. 4423–4432.
[13] Y. Zhao et al., “TSASNet: Tooth segmentation on dental panoramic [37] L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas, “GSPN: Generative
X-ray images by two-stage attention segmentation network,” Knowledge- shape proposal network for 3D instance segmentation in point cloud,”
Based Syst., vol. 206, Oct. 2020, Art. no. 106338. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[14] Y. Zheng, B. Chen, Y. Shen, and K. Shen, “TeethGNN: Semantic Jun. 2019, pp. 3942–3951.
3D teeth segmentation with graph neural networks,” IEEE Trans. Vis. [38] B. Yang et al., “Learning object bounding boxes for 3D instance
Comput. Graphics, vol. 29, no. 7, pp. 3158–3168, Jul. 2023. segmentation on point clouds,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 32, 2019, pp. 6737–6746.
[15] K. Wu, L. Chen, J. Li, and Y. Zhou, “Tooth segmentation on den-
tal meshes using morphologic skeleton,” Comput. Graph., vol. 38, [39] J. Hou, A. Dai, and M. Nießner, “3D-SIS: 3D semantic instance
pp. 199–211, Feb. 2014. segmentation of RGB-D scans,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2019, pp. 4416–4425.
[16] B.-J. Zou, S.-J. Liu, S.-H. Liao, X. Ding, and Y. Liang, “Interactive tooth
partition of dental mesh base on tooth-target harmonic field,” Comput. [40] T. Sun, G. Liu, R. Li, S. Liu, S. Zhu, and B. Zeng, “Quadratic
Biol. Med., vol. 56, pp. 132–144, Jan. 2015. terms based point-to-surface 3D representation for deep learning of
point cloud,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5,
[17] Z. Li, X. Ning, and Z. Wang, “A fast segmentation method for STL teeth
pp. 2705–2718, May 2022.
model,” in Proc. IEEE/ICME Int. Conf. Complex Med. Eng., May 2007,
pp. 163–166. [41] W. Wang, R. Yu, Q. Huang, and U. Neumann, “SGPN: Similarity
group proposal network for 3D point cloud instance segmentation,”
[18] T. Kronfeld, D. Brunner, and G. Brunnett, “Snake-based segmentation in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
of teeth from virtual dental casts,” Comput.-Aided Design Appl., vol. 7, pp. 2569–2578.
no. 2, pp. 221–233, Jan. 2010.
[42] J. Zhang, Y. Cao, and Q. Wu, “Vector of locally and adaptively aggre-
[19] Y. Kumar, R. Janardan, B. Larson, and J. Moon, “Improved segmentation gated descriptors for image feature representation,” Pattern Recognit.,
of teeth in dental models,” Comput.-Aided Design Appl., vol. 8, no. 2, vol. 116, Aug. 2021, Art. no. 107952.
pp. 211–224, Jan. 2011.
[43] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Point-
[20] C. Sinthanayothin and W. Tharanont, “Orthodontics treatment sim- Group: Dual-set point grouping for 3D instance segmentation,” in Proc.
ulation by teeth segmentation and setup,” in Proc. 5th Int. Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
Electr. Eng./Electron., Comput., Telecommun. Inf. Technol., May 2008, pp. 4866–4875.
pp. 81–84. [44] T. Vu, K. Kim, T. M. Luu, T. Nguyen, and C. D. Yoo, “SoftGroup for
[21] M. Yaqi and L. Zhongke, “Computer aided orthodontics treatment by 3D instance segmentation on point clouds,” in Proc. IEEE/CVF Conf.
virtual segmentation and adjustment,” in Proc. Int. Conf. Image Anal. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 2698–2707.
Signal Process., Apr. 2010, pp. 336–339. [45] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet:
[22] D. Sun et al., “Automatic tooth segmentation and dense correspondence Deep learning on point sets for 3D classification and segmentation,”
of 3D dental model,” in Proc. Int. Conf. Med. Image Comput. Comput.- in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Assisted Intervent. Cham, Switzerland: Springer, 2020, pp. 703–712. pp. 77–85.
[23] T. Kondo, S. H. Ong, and K. W. C. Foong, “Tooth segmentation of dental [46] C. R. Qi et al., “PointNet++: Deep hierarchical feature learning on point
study models using range images,” IEEE Trans. Med. Imag., vol. 23, sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30,
no. 3, pp. 350–362, Mar. 2004. 2017.
[24] X. Xu, C. Liu, and Y. Zheng, “3D tooth segmentation and labeling [47] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and
using deep convolutional neural networks,” IEEE Trans. Vis. Comput. J. M. Solomon, “Dynamic graph CNN for learning on point clouds,”
Graphics, vol. 25, no. 7, pp. 2336–2348, Jul. 2019. ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, Oct. 2019.
[25] J. Zhang, C. Li, Q. Song, L. Gao, and Y.-K. Lai, “Automatic 3D [48] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-end people detection
tooth segmentation using convolutional neural networks in harmonic in crowded scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
parameter space,” Graph. Models, vol. 109, May 2020, Art. no. 101071. (CVPR), Jun. 2016, pp. 2325–2333.
[26] F. G. Zanjani et al., “Deep learning approach to semantic segmentation [49] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
in 3D point cloud intra-oral scans of teeth,” in Proc. Int. Conf. Med. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Imag. With Deep Learn., 2019, pp. 557–571. Oct. 2017, pp. 2999–3007.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.
LI et al.: THISNet: TOOTH INSTANCE SEGMENTATION ON 3D DENTAL MODELS 5241
[50] P. Li et al., “Semantic graph attention with explicit anatomical associa- Fangcen Liu received the B.S. and M.S. degrees
tion modeling for tooth segmentation from CBCT images,” IEEE Trans. from the School of Communications and Informa-
Med. Imag., vol. 41, no. 11, pp. 3116–3127, Nov. 2022. tion Engineering, Chongqing University of Posts
[51] A. Ben-Hamadou et al., “Teeth3DS: A benchmark for teeth segmentation and Telecommunications, China, in 2018 and 2021,
and labeling from intra-oral 3D scans,” 2022, arXiv:2210.06094. respectively, where she is currently pursuing the
[52] M. Corsini, P. Cignoni, and R. Scopigno, “Efficient and flexible sampling Ph.D. degree. Her research interests include image
with blue noise properties of triangular meshes,” IEEE Trans. Vis. processing, deep learning, cross-modal retrieval, and
Comput. Graphics, vol. 18, no. 6, pp. 914–924, Jun. 2012. infrared small target detection.
[53] Y. Zhao et al., “Two-stream graph convolutional network for intra-oral
scanner image segmentation,” IEEE Trans. Med. Imag., vol. 41, no. 4,
pp. 826–835, Apr. 2022.
[54] C. Lian et al., “MeshSNet: Deep multi-scale mesh feature learning for
end-to-end tooth labeling on 3D dental surfaces,” in Proc. Int. Conf.
Med. Image Comput. Comput.-Assist. Intervent. Conf., Shenzhen, China,
Oct. 2019, pp. 837–845.
Pengcheng Li received the B.S. and M.S. degrees Deyu Meng (Senior Member, IEEE) received the
from the School of Communications and Informa- B.Sc., M.Sc., and Ph.D. degrees from Xi’an Jiao-
tion Engineering, Chongqing University of Posts tong University, Xi’an, China, in 2001, 2004, and
and Telecommunications, China, in 2017 and 2020, 2008, respectively. He was a Visiting Scholar with
respectively, where he is currently pursuing the Carnegie Mellon University, Pittsburgh, PA, USA,
Ph.D. degree. His research interests include med- from 2012 to 2014. He is currently a Professor
ical image processing, computer vision, and deep with the School of Mathematics and Statistics, Xi’an
learning. Jiaotong University, and an Adjunct Professor with
the Faculty of Information Technology, Macau Uni-
versity of Science and Technology, Taipa, Macau,
China. His research interests include model-based
deep learning, variational networks, and meta learning.
Chenqiang Gao received the B.S. degree in
computer science from the China University of Geo-
sciences, Wuhan, China, in 2004, and the Ph.D.
degree in control science and engineering from
the Huazhong University of Science and Technol-
ogy, Wuhan, in 2009. In August 2009, he joined
the School of Communications and Information
Yan Yan (Senior Member, IEEE) received the Ph.D.
Engineering, Chongqing University of Posts and
degree in computer science from the University of
Telecommunications (CQUPT), Chongqing, China.
Trento. He was an Assistant Professor with Texas
In September 2012, he joined the Informedia Group,
State University and a Research Fellow with the
School of Computer Science, Carnegie Mellon Uni-
University of Michigan and the University of Trento.
versity, Pittsburgh, PA, USA, where he was a Visiting Scholar on multimedia
He is currently a Gladwin Development Chair
event detection (MED) and surveillance event detection (SED). In April 2013,
Assistant Professor with the Department of Com-
he became a Post-Doctoral Fellow and continued work on MED and SED
puter Science, Illinois Institute of Technology. His
until March 2014, when he returned to CQUPT. He is currently a Professor
research interests include computer vision, machine
with CQUPT. His research interests include image processing, infrared target
learning, and multimedia.
detection, action recognition, and event detection.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on October 22,2024 at 12:34:01 UTC from IEEE Xplore. Restrictions apply.