0% found this document useful (0 votes)
10 views

IET Image Processing - 2022 - Wang - Medical Image Segmentation Using Deep Learning A Survey

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

IET Image Processing - 2022 - Wang - Medical Image Segmentation Using Deep Learning A Survey

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Received: 21 February 2021 Revised: 1 September 2021 IET Image Processing

DOI: 10.1049/ipr2.12419

REVIEW

Medical image segmentation using deep learning: A survey

Risheng Wang1,2 Tao Lei1,2 Ruixia Cui3 Bingtao Zhang4 Hongying Meng5
Asoke K. Nandi5

1
Shaanxi Joint Laboratory of Artificial Intelligence, Abstract
Shaanxi University of Science and Technology,
Deep learning has been widely used for medical image segmentation and a large number
Xi’an, China
2
of papers has been presented recording the success of deep learning in the field. A com-
The School of Electronic Information and
Artificial Intelligence, Shaanxi University of Science
prehensive thematic survey on medical image segmentation using deep learning techniques
and Technology, Xi’an, China is presented. This paper makes two original contributions. Firstly, compared to traditional
3
The Laboratory of Hepatobiliary Surgery, First surveys that directly divide literatures of deep learning on medical image segmentation
Affiliated Hospital’ and ’National Engineering into many groups and introduce literatures in detail for each group, we classify currently
Laboratory of Big Data Algorithm and Analysis popular literatures according to a multi-level structure from coarse to fine. Secondly, this
Technology Research’(Xi’an Jiaotong University),
Xi’an, China paper focuses on supervised and weakly supervised learning approaches, without includ-
4
ing unsupervised approaches since they have been introduced in many old surveys and
The School of Electronic and Information
Engineering, Lanzhou Jiaotong University, Lanzhou, they are not popular currently. For supervised learning approaches, we analyse literatures
China in three aspects: the selection of backbone networks, the design of network blocks, and the
5
The Department of Electronic and Electrical improvement of loss functions. For weakly supervised learning approaches, we investigate
Engineering, Brunel University London, UK literature according to data augmentation, transfer learning, and interactive segmentation,
separately. Compared to existing surveys, this survey classifies the literatures very differ-
Correspondence
ently from before and is more convenient for readers to understand the relevant rationale
Tao Lei, Shaanxi Joint Laboratory of Artificial Intelli-
gence, Shaanxi University of Science and Technology, and will guide them to think of appropriate improvements in medical image segmentation
Xi’an 710021, China. based on deep learning approaches.
Email: [email protected]

Funding information
Natural Science Basic Research Program of Shaanxi,
Grant/Award Number: 2021J-47; National Nat-
ural Science Foundation of China, Grant/Award
Numbers: 61871259, 61861024; Key Research and
Development Program of Shaanxi, Grant/Award
Number: 2021ZDLGY08-07; Shaanxi Joint Lab-
oratory of Artificial Intelligence, Grant/Award
Number: 2020SS-03; National Natural Science
Foundation of China-Royal Socie, Grant/Award
Numbers: 61811530325, (IECnNSFCn170396
RoyalSociety)

1 INTRODUCTION brain-tumour segmentation [3] [4], optic disc segmentation


[5] [6], cell segmentation [7] [8], lung segmentation, pulmonary
Medical image segmentation aims to make anatomical or nodules [9] [10], and cardiac image segmentation [11] [12].
pathological structures changes in more clear in images; it With the development and popularisation of medical imaging
often plays a key role in computer-aided diagnosis and smart equipments, X-ray, Computed Tomography (CT), Magnetic
medicine due to the great improvement in diagnostic effi- Resonance Imaging (MRI) and ultrasound have become four
ciency and accuracy. Popular medical image segmentation tasks important image-assisted means to help clinicians diagnose dis-
include liver and liver-tumour segmentation [1] [2], brain and eases, to evaluate prognopsis, and to plan operations in medical

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2022 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology

IET Image Process. 2022;16:1243–1267. wileyonlinelibrary.com/iet-ipr 1243


17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1244 WANG ET AL.

institutions. In practical applications, although these ways of requires a small part of data labelled while most of data are
imaging have advantages as well as disadvantages, they are useful unlabelled.
for the medical examination of different parts of human body. Prior to the widespread application of deep learning,
To help clinicians make accurate diagnosis, it is necessary to researchers had presented many approaches based on model-
segment some crucial objects in medical images and extract driven on medical image segmentation. Masood et al. [21] made
features from segmented areas. Early approaches to medical a comprehensive summary of many model-driven techniques in
image segmentation often depend on edge detection, tem- medical image analysis, including image clustering, region grow-
plate matching techniques, statistical shape models, active con- ing, and random forest. In [21], authors summarised different
tours, machine learning, etc. Zhao et al. [13] proposed a new segmentation approaches on medical images according to dif-
mathematical morphology edge detection algorithm for lung ferent mathematical models. Recently, only a few studies based
CT images. Lalonde et al. [14] applied Hausdorff-based tem- on model-driven techniques were reported, but more and more
plate matching to disc inspection, and Chen et al. [15] also studies based on data-driven were reported for medical image
employed template matching to perform ventricular segmenta- segmentation. This paper mainly focuses on the evolution and
tion in brain CT images. Tsai et al. [16] proposed a shape-based development of deep learning models on medical image seg-
approach using horizontal sets for 2D segmentation of cardiac mentation.
MRI images and 3D segmentation of prostate MRI images. In [22], Shen et al. presented a special review of the appli-
Li et al. [17] used the activity profile model to segment liver- cation of deep learning in medical image analysis. This review
tumours from abdominal CT images, while Li et al. [18] pro- summarises the progress of machine learning and deep learning
posed a framework for medical body data segmentation by com- in medical image registration, anatomy and cell structure detec-
bining level sets and support vector machines (SVMs). Held tion, tissue segmentation, computer-aided disease diagnosis and
et al. [19] applied Markov random fields (MRF) to brain MRI prognopsis. Litjens et al. [23] reported a survey of deep learning
image segmentation. Although a large number of approaches methods, the survey covers the use of deep learning in image
have been reported and they are successful in certain circum- classification, object detection, segmentation, registration and
stances, image segmentation is still one of the most challeng- other tasks.
ing topics in the field of computer vision due to the difficulty More recently, Taghanaki et al. [24] discussed the develop-
of feature representation. In particular, it is more difficult to ment of semantic and medical image segmentation; they cate-
extract discriminating features from medical images than nor- gorised deep learning-based image segmentation solutions into
mal RGB images since the former often suffers from problems six groups, that is, deep architectural, data synthesis-based,
of blur, noise, low contrast, etc. Due to the rapid development loss function-based, sequenced models, weakly supervised, and
of deep learning techniques [20], medical image segmentation multi-task methods. To develop a more complete survey on
will no longer require hand-crafted feature and convolutional medical image segmentation, Seo et al. [25] reviewed classical
neural networks (CNN) successfully achieve hierarchical feature machine learning algorithms such as Markov random fields, k-
representation of images, and thus become the hottest research means clustering, random forest, and reviewed latest deep learn-
topic in image processing and computer vision. As CNNs used ing architectures such as the artificial neural networks (ANNs),
for feature learning are insensitive to image noise, blur, con- the convolutional neural networks (CNNs), and the recurrent
trast, etc., they provide excellent segmentation results for medi- neural networks (RNNs). Tajbakhsh et al. [26] reviewed solu-
cal images. tions of medical image segmentation with imperfect data sets,
It is worth mentioning that there are currently two cate- including two major data set limitations: scarce annotations and
gories of image segmentation tasks, semantic segmentation and weak annotations. All these surveys play an important role for
instance segmentation. Image semantic segmentation is a pixel- the development of medical image segmentation techniques.
level classification that assigns a corresponding category to each Hesamian et al. [27] reviewed on three aspects of approaches
pixel in an image. Compared to semantic segmentation, the (network structures), training techniques, and challenges. The
instance segmentation not only needs to achieve pixel-level clas- network structures section describes the main, popular network
sification, but also needs to distinguish instances on the basis of structures used for image segmentation. The training techniques
specific categories. In fact, there are few reports on instance seg- section discusses the J Digit imaging technique used to train
mentation in medical image segmentation since each organ or deep neural network models. The challenges section describes
tissue is quite different. We review the advances of deep learn- the various challenges associated with medical image segmenta-
ing techniques on medical image segmentation. tion using deep learning techniques. Meyer et al. [28] reviewed
According to the number of labelled data, machine learning the advances in the application or potential application of deep
is often categorised into supervised learning, weakly supervised learning to radiotherapy. Akkus et al. [29] provided an overview
learning, and unsupervised learning. The advantage of super- of current deep learning-based segmentation approaches for
vised learning is that we can train models based on carefully quantitative brain MRI images. Zhou et al. [30] focused on three
labelled data, but it is difficult to obtain a large number of typical types of weak supervision: incomplete supervision, inex-
labelled data for medical images. On the contrary, labelled act supervision and inaccurate supervision. Eelbode et al. [31]
data are not required for unsupervised learning, but the dif- focus on evaluating and summarising the optimisation methods
ficulty of learning is increased. Weakly supervised learning is used in medical image segmentation tasks based primarily on
between the supervised and unsupervised learning since it only Dice scores or Jaccard indices.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1245

3. In addition to reviewing comprehensively the development


and application of deep learning in medical image segmen-
tation, we also collect the currently common public medi-
cal image segmentation data sets. Finally, we discuss future
research trends and directions in this field.

Survey The rest of this paper is organised as follows. In Section II, we


review the development and evolution of supervised learning
applied to medical images, including the selection of backbone
network, the design of network blocks, and the improvement
of loss function. In Section III, we introduce the application
of unsupervised or weakly supervised methods in the field of
medical image segmentation and analyse the commonly unsu-
pervised or weakly supervised strategies for processing few-shot
data or class imbalanced data. In Section IV, we briefly intro-
FIGURE 1 An overview of deep learning methods on medical image duce some of the most advanced methods of medical image
segmentation segmentation, including NAS, application of GCN, and multi-
modality data fusion. In Section V, we collect the currently avail-
able public medical image segmentation data sets, and sum-
Through studying the aforementioned surveys, researchers marise limitations of current deep learning methods and future
can learn the latest techniques of medical image segmen- research directions.
tation, and then make more significant contributions for
computer-aided diagnoses and smart healthcare. However,
these surveys suffer from two problems. One is that most of 2 SUPERVISED LEARNING
them chronologically summarise the development of medical
image segmentation, and they thus ignore the technical branch For medical image segmentation tasks, supervised learning is
of deep learning for medical image segmentation. The other the most popular method since these tasks usually require
problem is that these surveys only introduce related techni- high accuracy. In this section, we focus on the review of
cal development but not focus on the task characteristics of improvements of neural network architectures. These improve-
medical image segmentation such as few-shot learning and ments mainly include network backbones, network blocks and
imbalance learning, which limits the improvement of medical the design of loss functions. Figure 2 shows an overview on
image segmentation based on task-driven. To address these two the improvement of network architectures based on super-
problems, we present a novel survey on medical image segmen- vised learning.
tation using deep learning. In this work, we make the following
contributions:
2.1 Backbone networks
1. We summarise the technical branch of deep learning for
medical image segmentation from coarse to fine as shown in Image semantic segmentation aims to achieve pixel classifi-
Figure 1. The summation includes two aspects of supervised cation of an image. For this goal, researchers proposed the
learning and weakly supervised learning. The latest applica- encoder–decoder structure that is one of the most popular
tions of neural architecture search (NAS), graph convolu- end-to-end architectures, such as fully convolution network
tional networks (GCN), multi-modality data fusion and med- (FCN) [32], U-Net [7], and Deeplab [33]. In these structures,
ical transformer in medical image analysis are also discussed. an encoder is often used to extract image features while a
Compared to the previous surveys, our survey follows con- decoder is often used to restore extracted features to the original
ceptual developments and is believed to be clearer. image size and output the final segmentation results. Although
2. On supervised learning approaches, we analyse literature the end-to-end structure is pragmatic for medical image seg-
from three aspects: the selection of backbone networks, mentation, it reduces the interpretability of models. The first
the design of network blocks, and the improvement of high-impact encoder–decoder structure, the U-Net proposed by
loss functions. This classification method can help subse- Ronneberger et al. [7], has been widely used for medical image
quent researchers to understand more deeply motivations segmentation. Figure 3 shows the U-Net architecture.
and improvement strategies of medical image segmentation U-Net: The U-Net solves problems of general CNN networks
networks. For weakly supervised learning, we also review lit- used for medical image segmentation, since it adopts a perfect
eratures from three aspects for processing few-shot data or symmetric structure and skip connection. Different from com-
class imbalanced data: data augmentation, transfer learning, mon image segmentation, medical images usually contain noise
and interactive segmentation. This organisation is expected and show blurred boundaries. Therefore, it is very difficult to
to be more conducive to researchers in finding innovations detect or recognise objects in medical images only depending
for improving the accuracy of medical image segmentation. on image low-level features. Meanwhile, it is also impossible to
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1246 WANG ET AL.

U-Net
3D Net
Backbone
RNN
Network
Skip Connection
Cascade 2D and 3D

Dense Connection
Inception
Supervised Network
Depth Separate
Learning Block
Attention
Multiscale

Cross Entropy Loss


FIGURE 4 The V-Net architecture [35]
WCE Loss
Dice Loss
Tversky Loss
Loss Feature Map
Function Boundary Loss
Generalized Dice
Loss +
Exponential
Conv + Relu
Logarithmic Loss

FIGURE 2 An overview of network architectures based on supervised Conv + Relu


learning
Feature Map

FIGURE 5 The recurrent residual convolutional unitalom2018recurrent


Input output
Image segmentation
tile map
Due to the limitation of computational resources, the 3D U-Net
only includes three down-sampling, which cannot effectively
extract deep-layer image features leading to limited segmenta-
tion accuracy for medical images. In addition, Milletari et al. [35]
proposed a similar architecture, V-Net, as shown in Figure 4.
It is well known that residual connections can avoid vanishing
Conv 3×3, ReLU gradient and accelerate network convergence, and it is thus easy
copy and crop
max pool 2×2 to design deeper network structures that can provide better fea-
up-conv 2×2
ture representation. Compared to 3D U-Net, V-Net employs
Conv 1×1
residual connections to design a deeper network (four down-
FIGURE 3 The U-Net architecture [7] samplings), and thus achieves higher performance. Similarly, by
applying residual connections to 3D networks, Yu et al. [36] pre-
sented Voxresnet, Lee et al. [37] proposed 3DRUNet, and Xiao
obtain accurate boundaries depending only on image semantic et al. [38] proposed Res-UNet. However, these 3D Networks
features due to the lack of image detail information, whereas the encounter same problems of high computational cost and GPU
U-Net effectively fuses low-level and high-level image features memory usage due to a very large number of parameters.
by combining low-resolution and high-resolution feature maps Recurrent Neural Network (RNN): RNN is initially designed
through skip connections, which is a perfect solution for medi- to deal with sequence problems. The long short-term mem-
cal image segmentation tasks. Currently, the U-Net has become ory (LSTM) network [39] is one of the most popular RNNs.
the benchmark for most medical image segmentation tasks and It can retain the gradient flow for a long time by introducing
has inspired a lot of meaningful improvements. a self-loop. For medical image segmentation, RNN has been
3D Net: In practice, as most of medical data such as CT and used to model the time dependence of image sequences. Alom
MRI images exist in the form of 3D volume data, the use of 3D et al. [40] proposed a medical image segmentation method that
convolution kernels can better mine the high-dimensional spa- combines ResUNet with RNN. The method achieves feature
tial correlation of data. Motivated by this idea, Çiçek et al. [34] accumulation of recursive residual convolutional layers, which
extended U-Net architecture to the application of 3D data, and improves feature representation for image segmentation tasks.
proposed 3D U-Net that deals with 3D medical data directly. Figure 5 shows the recurrent residual convolutional unit. Gao
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1247

et al. [41] joined LSTM and CNN to model the temporal rela- second step is to input the detected targets into a newly designed
tionship between different brain MRI slices to improve segmen- full-resolution convolutional network (FrCN) for segmentation,
tation accuracy. Bai et al. [42] combined FCN with RNN to and finally, a deep convolutional neural network is used to iden-
mine the spatiotemporal information for aortic sequence seg- tify the masses and classify them as benign or malignant. Simi-
mentation. Clearly, RNN can capture local and global spatial larly, Tang et al. [47] used faster R-CNN [54] and Deeplab [55]
features of images by considering the context information rela- cascades for localisation segmentation of the liver. In addition,
tionship. However, in medical image segmentation, the capture both Salehi et al. [56] and Yan et al. [57] proposed a kind of cas-
of complete and valid temporal information requires good med- cade networks for whole-brain MRI and high-resolution mam-
ical image quality (e.g. smaller slice thickness and pixel spacing). mogram segmentation. This kind of cascade network can effec-
Therefore, the design of RNN is uncommon for improving the tively extract richer multi-scale context information by using a
performance of medical image segmentation. posteriori probabilities generated by the first network than nor-
Skip Connection: Although the skip connection can fuse low- mal cascade networks.
resolution and high-resolution information and thus improve However, most of medical images are 3D volume data, but a
feature representation, it suffers from the problem of the large 2D convolutional neural network cannot learn temporal infor-
semantic gap between low- and high-resolution features, lead- mation in the third dimension, and a 3D convolutional neu-
ing to blurred feature maps. To improve skip connection, Ibte- ral network often requires high computation cost and serves
haz et al. [43] proposed MultiResUNet including the Resid- GPU memory consumption. Therefore some pseudo-3D seg-
ual Path (ResPath), which makes the encoder features perform mentation methods have been proposed. Oda et al. [58] pro-
some additional convolution operations before fusing with the posed a three-plane method of cascading three networks to seg-
corresponding features in the decoder. Seo et al. [44] proposed ment the abdominal artery region effectively from the medi-
mUNet and Chen et al. [45] proposed FED-Net. Both mU-Net cal CT volume. Vu et al. [59] applied the overlay of adjacent
and FED-Net add convolution operations to the skip connec- slices as input to the central slice prediction, and then fed the
tion to improve the performance of medical image segmenta- obtained 2D feature map into a standard 2D network for model
tion. training. Although these pseudo-3D approaches can segment
Cascade of 2D and 3D: For image segmentation tasks, the cas- object from 3D volume data, they only obtain limited accuracy
cade model often trains two or more models to improve seg- improvement due to the utilisation of local temporal informa-
mentation accuracy. This method is especially popular in med- tion. Compared to pseudo-3D networks, hybrid cascading 2D
ical image segmentation. The cascade model can be broadly and 3D networks are more popular. Li et al. [60] proposed a
divided into three types of frameworks: coarse-fine segmen- hybrid densely connected U-Net (H-DenseUNet) for liver and
tation, detection segmentation, and mixed segmentation. The liver-tumour segmentation. This method first employs a sim-
first class is a coarse-fine segmentation framework that uses a ple Resnet to obtain a rough liver segmentation result, utilis-
cascade of two 2D networks for segmentation, where the first ing the 2D DenseUNet to extract 2D image features effectively,
network performs coarse segmentation and then uses another then uses the 3D DenseUNet to extract 3D image features,
network model to achieve fine segmentation based on the pre- and finally designs a hybrid feature fusion layer to jointly opti-
vious coarse segmentation results. Christ et al. [46] proposed mise 2D and 3D features. Although the H-DenseUNet reduces
a cascaded network for liver and liver-tumour segmentation. the complexity of models compared to an entire 3D network,
This network firstly uses an FCN to segment livers, and then the model is complex and it still suffers from a large number
uses previous liver segmentation results as the input of the of parameters from 3D convolutions. For the problem, Zhang
second FCN for liver-tumour segmentation. Yuan et al. [47] et al. [61] proposed a lightweight hybrid convolutional network
first trained a simple convolutional–deconvolutional neural net- (LW-HCN) with a similar structure to the H-DenseUNet, but
works (CDNN) model (19-layer FCN) to provide rapid but the former requires fewer parameters and computational cost
coarse liver segmentation over the entire images of a CT vol- than the latter due to the design of the depthwise and spatiotem-
ume, and then applied another CDNN (29-layer FCN) to the poral separate (DSTS) block and the use of 3D depth separable
liver region for fine-grained liver segmentation. Finally, the liver convolution. Similarly, Dey et al. [62] also designed a cascade of
segmentation region enhanced by histogram equalisation is con- 2D and 3D network for liver and liver-tumour segmentation.
sidered as an additional input to the third CDNN (29-layer Obviously, among the three types of cascade networks men-
CNN) for liver-tumour segmentation. Besides, other networks tioned above, the hybrid 2D and 3D cascade network can effec-
using the coarse-fine segmentation framework can be found tively improve segmentation accuracy and reduce the learn-
in [48] [49] [50]. At the same time, the detection segmentation ing burdens.
framework is also popular. First, a network model such as R- In contrast to the above cascade networks, Valanarasu
CNN [51]or You-On-Look-Once (YOLO) [52] is used for tar- et al. [63] proposed a complete cascade network namely KiU-
get location identification, and then another network is used Net to perform brain dissection segmentation. The perfor-
for further detailed segmentation based on previously coarse mance of vanilla U-Net is greatly degraded when detecting
segmentation results. Al-Antari et al. [53] proposed a similar smaller anatomical structures with fuzzy noise boundaries. To
approach for breast mass detection, segmentation and classifica- overcome this problem, authors designed a novel over-complete
tion from mammograms. In this work, the first step is to use the architecture Ki-Net, in which the spatial size of the intermediate
regional deep learning method YOLO for target detection, the layer is larger than that of the input data, and this is achieved by
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1248 WANG ET AL.

using an up-sampling layer after each conversion layer in the 1×1 conv+BN+Relu
3×3 conv+BN+Relu
encoder. Thus the proposed Ki-Net possesses stronger edge Concatenation connection
capture capability compared to U-Net and finally it is cascaded
with the vanilla U-Net to improve the overall segmentation
accuracy. Since the KiU-Net can exploit both the low-level fine
edges feature maps using Ki-Net and the high-level shape fea-
ture maps using U-Net, it not only improves segmentation accu- FIGURE 6 Dense connection architecture [70]
racy but also achieves fast convergence for small anatomical
landmarks and blurred noisy boundaries.
Others: A generating adversarial networks (GAN) [64] has
been widely used in many areas of computer vision. In its
infancy, the GAN was often used for data augmentation by gen-
erating new samples, which would be reviewed in Section III,
but later researchers discovered that the idea of generative con-
frontation could be used in almost any field, and was therefore
also used for image segmentation. Since medical images usu-
ally show low contrast, blurred boundaries between different
tissues or between tissues and lesions, and sparse medical image
data with labels, U-Net-based segmentation methods using pixel
loss to learn local and global relationships between pixels are
not sufficient for medical image segmentation, and the use of
generative adversarial networks is becoming a popular idea for
improving image segmentation. Luc et al. [65] firstly applied the
generative adversarial network to image segmentation, where
the generative network is used for segmentation models and the FIGURE 7 The U-Net++ architecture [71]
adversarial network is trained as a classifier. Singh et al. [66] pro-
posed a conditional generation adversarial network (cGAN) to
segment breast tumours within the target area (ROI) in mam- tion information, and to make segmentation results as accurate
mograms. The generative network learns to identify tumour as possible. The above study shows that improved models can
regions and generates segmentation results, and the adversar- provide higher segmentation accuracy and they are more robust
ial network learns to distinguish between ground truth and seg- since prior knowledge constraints are employed in the training
mentation results from the generative network, thereby enforc- process of neural networks.
ing the generative network to obtain labels as realistic as possi- After proposing U-Net in [7], the encoder–decoder struc-
ble. The cGAN works fine when the number of training samples ture became the most popular structure in medical image seg-
is limited. Conze et al. [67] utilised cascaded pre-trained convo- mentation. The design of the network backbone focuses on
lutional encoder–decoders as generators of cGAN for abdom- more efficient feature extraction in the encoder and feature
inal multi-organ segmentation, and considered the adversarial recovery and fusion in the decoder to improve segmentation
network as a discriminator to enforces the model to create real- accuracy.
istic organ delineations.
In addition, the incorporation of the prior knowledge about
organ shape and position may be crucial for improving medi- 2.2 Network function block
cal image segmentation effect, where images are corrupted and
thus contain artefacts due to limitations of imaging techniques. 2.2.1 Dense connection
However, there are few works about how to incorporate prior
knowledge into CNN models. As one of the earliest studies in Dense connection is often used to construct a kind of special
this field, Oktay et al. [68] proposed a novel and general method convolution neural networks. For dense connection networks,
to combine a priori knowledge of shape and label structure into the input of each layer comes from the output of all previous
the anatomically constrained neural networks (ACNN) for med- layers in the process of forward transmission. Inspired by the
ical image analysis tasks. In this way, the neural network training dense connection, Guan et al. [70] proposed an improved U-Net
process can be constrained and guided to make more anatom- by replacing each sub-block of U-Net with a form of dense con-
ical and meaningful predictions, especially in cases where input nections as shown in Figure 6. Although the dense connection
image data is not sufficiently informative or consistent enough is helpful for obtaining richer image features, it often reduces
(e.g. missing object boundaries). Similarly, Boutillon et al. [69] the robustness of feature representation to a certain extent and
incorporated anatomical priors into a conditional adversarial increases the number of parameters.
framework for scapula bone segmentation, combining shape Zhou et al. [71] connected all U-Net layers (from one to
priors with conditional neural networks to encourage models to four) together as shown in Figure 7. The advantage of this
follow global anatomical properties in terms of shape and posi- structure is that it allows the network to learn automatically
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1249

Feature Map 2.2.3 Depth separability

To improve the generalisation capability of network models and


1×1 Conv
1×1 Conv Rate = 1
to reduce the requirement of memory usage, many researchers
Rate = 1 Channel = 512 focus on the study of lightweight networks for complex medi-
1×1 Conv Channel = 512
cal 3D volume data. Howard et al. [76] proposed MobileNet to
Rate = 1 3×3 Conv
Channel = 512 Rate = 5 decompose vanilla convolution into depthwise separable convo-
3×3 Conv 3×3 Conv Channel = 512 lution and pointwise convolution. The number of vanilla con-
Rate = 1
Channel = 512
Rate = 3 volution operation is usually DK × DK × M × N , where M is
Channel = 512 3×3 Conv
3×3 Conv
Rate = 3 the dimension of the input feature maps, N is the dimension
Rate = 3
Channel = 512 Channel = 512 of the output feature maps, and DK is the size of the convo-
3×3 Conv lution kernels. However, the number of the channel convolu-
Rate = 1
Channel = 512
3×3 Conv
Rate = 1
tion operation is DK × DK × 1 × M and the point convolution
Channel = 512 is 1 × 1 × M × N . Compared to vanilla convolution, the com-
putational cost of depthwise separable convolution is (1/N +
1/DK2 ) times than that of the vanilla convolution. Based on this,
Feature Map
Sandler et al. [77] proposed MobileNet-V2 that contains a novel
layer module, the inverted residual with linear bottleneck. In
FIGURE 8 The inception architecture [75]. It contains four cascade
branches with the gradual increment of the number of atrous convolution,
this module, the input is a low-dimensional compressed repre-
from 1 to 1, 3, and 5, then the receptive field of each branch will be 3, 7, 9, and sentation which is first expanded to high dimension and then
19. Therefore, the network can extract features from different scales filtered with a lightweight depthwise convolution. Features are
subsequently projected back to a low-dimensional representa-
tion with a linear convolution. It allows to significantly reduce
the memory footprint needed during inference. By extending
importance of features at different layers. Besides, the skip con- the depth separable convolution to the design of 3D networks,
nection is redesigned so that features with different semantic Lei et al. [78] proposed a lightweight V-Net (LV-Net) with fewer
scales can be aggregated in the decoder, resulting in a highly operations than V-Net for liver segmentation. Besides, Zhang
flexible feature fusion scheme. The disadvantage is that the et al. [61] and Huang et al. [79] also proposed the applica-
number of parameters is increased due to the employment of tion of depthwise separable convolutions to the segmentation
dense connection. Therefore, a pruning method is integrated of 3D medical volume data. Other related works for lightweight
into model optimisation to reduce the number of parameters. deep networks can be found in [80] [81]. Depthwise separa-
Meanwhile, the deep supervision [72] is also employed to bal- ble convolution is an effective way to reduce the number of
ance the decline of segmentation accuracy caused by the prun- model parameters, but it may result in loss of accuracy in med-
ing. Although the dense connection is helpful for obtaining ical image segmentation, and thus other approaches (e.g. deep
richer image features, it often reduces the robustness of feature supervision) [78] need to be employed to improve segmenta-
representation to a certain extent and increases the number of tion accuracy.
parameters.

2.2.4 Attention mechanism


2.2.2 Inception
For neural networks, an attention block can selectively change
For CNNs, deep networks often give better performances than input or assigns different weights to input variables accord-
shallow ones, but they encounter some new problems such as ing to different importance. In recent years, most of researches
vanishing gradient, the difficulty of network convergence, and combining deep learning and visual attention mechanism have
the requirement of large memory usage. The inception struc- focused on using masks to form attention mechanisms. The
ture overcomes these problems. It gives better performance by principle of masks is to design a new layer that can identify key
merging convolution kernels in parallel without increasing the features from an image, through training and learning, and then
depth of networks. This structure is able to extract richer image let networks only focus on interesting areas of images.
features using multi-scale convolution kernels, and to perform Local Spatial Attention: The spatial attention block aims to cal-
feature fusion to obtain better feature representation. Inspired culate the feature importance of each pixel in space-domain
by GoogleNet [73] [74], Gu et al. [75] proposed CE-Net by and extract the key information of an image. Jaderberg et al.
introducing the inception structure into medical image segmen- [82] early proposed a spatial transformer network (ST-Net) for
tation. The CE-Net adds atrous convolution to each parallel image classification by using spatial attention that transforms
structure to extract features on a wide reception field, and adds the spatial information of an original image into another space
1 × 1 convolution of feature maps, Figure 8 shows the architec- and retains the key information. Normal pooling is equivalent to
ture of the inception. However, the inception structure is com- the information merge that easily causes the loss of key infor-
plex leading to the difficulty of model modification. mation. For this problem, a block called spatial transformer is
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1250 WANG ET AL.

1×1 Conv attention is designed. Besides, other related works can be found
in [80] [81].
Relu 1×1 Conv Sigmoid Resampler
To improve the feature discriminant representation of net-
1×1 Conv works, Wang et al. [86] embedded an attention block inside the
central bottleneck between the contraction path and the expan-
sion path of the U-Net, and proposed the ScleraSegNet. Fur-
FIGURE 9 The attention block in the attention U-Net [83] thermore, they compared the performance of channel atten-
tion, spatial attention, and different combinations of two atten-
tions for medical image segmentations. They concluded that
the channel-centric attention was the most effective in improv-
ing image segmentation performance. Based on this conclusion,
they finally won the championship of the sclera segmentation
benchmarking competition (SSBC2019).
Although those attention mechanisms mentioned above
improve the final segmentation performance, they only per-
FIGURE 10 The channel attention in the SE-Net [84] form an operation of local convolution. The operation focuses
on the area of neighbouring convolution kernels but misses
the global information. In addition, the operation of down-
designed to extract key information of images by performing a sampling leads to the loss of spatial information, which is espe-
spatial transformation. Inspired by this, Oktay et al. [83] pro- cially unfavourable for biomedical image segmentation. A basic
posed attention U-Net. The improved U-Net uses an attention solution is to extract long-distance information by stacking mul-
block to change the output of the encoder before fusing features tiple layers, but this is low efficiency due to a large number of
from the encoder and the corresponding decoder. The attention parameters and high computational cost. In the decoder, the
block outputs a gating signal to control feature importance of up-sampling, the deconvolution, and the interpolation are also
pixels at different spatial positions. Figure 9 shows the architec- performed in the way of local convolution.
ture. This block combines the Relu and sigmoid functions via Non-local Attention: Recently, Wang et al. [87] proposed a non-
1 × 1 convolution to generate a weight map that is corrected by local U-Net to overcome the drawback of local convolution
multiplying features from the encoder. for medical image segmentation. The non-local U-Net employs
Channel Attention: The channel attention block can achieve the self-attention mechanism and the global aggregation block
feature recalibration, which utilises learned global information to extract full image information during the parts of both up-
to emphasise selectively useful features and suppress useless sampling and down-sampling, which can improve the final seg-
features. Hu et al. [84] proposed SE-Net that introduced the mentation accuracy. Figure 11 shows the global aggregation
channel attention to the field of image analysis and won the block. The non-local block is a general-purpose block that can
ImageNet Challenge in 2017. This method implements atten- be easily embedded in different convolutional neural networks
tion weighting on channels using three steps; Figure 10 shows to improve their performance.
this architecture. The first is the squeezing operation, the global It can be seen that the attention mechanism is effective
average pooling is performed on input features to obtain the for improving image segmentation accuracy. In fact, spatial
1 × 1 × Channel feature map. The second is the excitation oper- attention looks for interesting target regions while channel
ation, where channel features are interacted to reduce the num- attention looks for interesting features. The mixed attention
ber of channels, and then the reduced channel features are mechanism can take advantages of both spaces and channels.
reconstructed back to the number of channels. Finally, the sig- However, compared with the nonlocal attention, the con-
moid function is employed to generate a feature weight map of ventional attention mechanism lacks the ability of exploiting
[0, 1] that multiplies the scale back to the original input feature. the associations between different targets and features, so
Chen et al. [45] proposed FED-Net that uses the SE block to CNNs based on non-local attention usually exhibit better
achieve the feature channel attention. performance than normal CNNs for image segmentation
Mixture Attention: Spatial and channel attention mechanisms tasks.
are two popular strategies for improving feature representation.
However, the spatial attention ignores the difference of differ-
ent channel information and treats each channel equally. On the 2.2.5 Multi-scale information fusion
contrary, the channel attention pools global information directly
while ignoring local information in each channel, which is a rela- One of the challenges in medical image segmentation is a large
tively rough operation. Therefore, combining advantages of two range of scales among objects. For example, a tumour in the
attention mechanisms, researchers have designed many mod- middle or late stage could be much larger than that in the early
els based on a mixed domain attention block. Kaul et al. [85] stage. The size of perceptive field roughly determines how much
proposed the focusNet using a mixture of spatial attention and context information we can use. The general convolution or
channel attention for medical image segmentation, where the pooling only employs a single kernel, for instance, a 3 × 3 kernel
SE-Block is used for channel attention and a branch of spatial for convolution and a 2 × 2 kernel for pooling.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1251

FIGURE 11 The global aggregation block in the non-Local U-Net [87]

Pyramid Pooling: The parallel operation of multi-scale pooling


can effectively improve context information of networks, and
thus extract richer semantic information. He et al. [88] first pro-
posed spatial pyramid pooling (SPP) to achieve multi-scale fea-
ture extraction. The SPP divides an image from the fine space to
the coarse space, then gathers local features and extracts multi-
scale features. Inspired by the SPP, a multi-scale information
extraction block is designed and named residual multi-kernel
pooling (RMP) [75] that uses four pooling kernels with differ-
FIGURE 12 The gridding effect (the way of treating images as a
ent sizes to encode global context information. However, the chessboard causes the loss of information continuity)
up-sampling operation in RMP cannot restore the loss of detail
information due to pooling that usually enlarges the receptive
field but reduces the image resolution. uses a sawtooth wave-like heuristic to allocate the dilation rate,
Atrous Spatial Pyramid Pooling: In order to reduce the loss so that information from a wider pixel range can be accessed
of detail information caused by pooling operation, researchers and thus the gridding effect is suppressed. In [91], authors
proposed atrous convolution instead of the polling operation. gave several atrous convolution sequences using variable
Compared with the vanilla convolution, the atrous convolu- dilation rate, for example [1,2,3], [3,4,5], [1,2,5], [5,9,17], and
tion can effectively enlarge the receptive field without increas- [1,2,5,9].
ing the number of parameters. Combining advantages of the Non-local and ASPP: The atrous convolution can efficiently
atrous convolution and the SPP block, Chen et al. [55] proposed enlarge the receptive field to collect richer semantic informa-
the atrous spatial pyramid pooling module (ASPP) to improve tion, but it causes the loss of detail information due to the grid-
image segmentation results. The ASPP shows strong recogni- ding effect. Therefore, it is necessary to add constraints or estab-
tion capability on same objects with different scales. Similarly, lish pixel associations for improving the atrous convolution per-
Similarly, Lopez et al. [89] and Lei et al. [90] applied superpo- formance. Recently, Yang et al. [92] proposed a combination
sition of multi-scale atrous convolutions to brain tumour seg- block of ASPP and non-local for the segmentation of human
mentation and liver tumour segmentation, respectively, which body parts, as shown in Figure 13. ASPP uses multiple parallel
achieves a clear accuracy improvement. atrous convolutions with different scales to capture richer infor-
However, the ASPP suffers from two serious problems for mation, and the non-local operation captures a wide range of
image segmentation. The first problem is the loss of local dependencies. This combination possesses advantages of both
information as shown in Figure 12, where we assume that the ASPP and non-local, and it has a good application prospect for
convolutional kernel is 3 × 3 and the dilation rate is 2 for three medical image segmentation.
iterations. The second problem is that the information could The network function module is designed to perform more
be irrelevant across large distances. How to simultaneously efficient feature fusion. When feature is usually extracted by the
handle the relationship between objects with different scales is encoder, the feature is usually fused by the network function
important for designing a fine atrous convolutional network. In module to enhance the feature representation. Feature fusion
response to the above problems, Wang et al. [91] designed an is usually performed by fusing different scale information or
hybrid expansion convolution (HDC) networks. This structure performing a more efficient way of feature transfer. Then
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1252 WANG ET AL.

weighted cross-entropy loss is defined as

WCE (p, p̂ ) = −(𝛽 plog(̂p) + (1 − p)log(1 − p̂ )), (2)

where 𝛽 is used to tune the proportion of positive and negative


samples, and it is an empirical value. If 𝛽 > 1, the number of
false negatives will be decreased; on the contrary, the number of
false positives will be decreased when 𝛽 < 1. In fact, the cross-
entropy is a special case of the weighted cross-entropy when
𝛽 = 1. To adjust the weight of positive and negative samples
simultaneously, we can use the balanced cross-entropy (BCE)
loss function that is defined as

BCE (p, p̂ ) = −(𝛽 plog(̂p) + (1 − 𝛽)(1 − p)log(1 − p̂ )). (3)

FIGURE 13 The combination of ASPP and non-local architecture [92] In [7], Ronneberger et al. proposed U-Net in which the cross-
entropy loss function is improved by adding a distance function.
The improved loss function is able to improve the learning capa-
the feature is passed through the decoder to obtain a better bility of models for inter-class distance. The distance function is
segmentation result. defined as
−(d1 (x)+d2 (x)2

2.3 Loss function D(x) = 𝜔0 − e 2𝜎2 , (4)

In addition to improved segmentation speed and accuracy by where both d1 (x) and d2 (x) denote the distance between the
designing network backbone and the function block, designing pixel x and boundaries of the first two nearest cells. So the final
new loss functions also resulted in improvements in subsequent loss function is defined as
inference-time segmentation accuracy. Therefore, a great deal of ( )
work has been reported about the design of suitable loss func- L = BCE p, p̂ + D(x). (5)
tions for medical image segmentation tasks.

2.3.3 Dice loss


2.3.1 Cross entropy loss
The Dice is a popular performance metric for the evaluation of
For image segmentation tasks, the cross-entropy is one of the medical image segmentation. This metric is essentially a mea-
most popular loss functions. The function compares pixel- sure of overlap between a segmentation result and correspond-
wisely the predicted category vector with the real segmentation ing ground truth. The value of Dice ranges from 0 to 1. ‘1’
result vector. For the case of binary segmentation, let P (Y = means the segmentation result completely overlaps with the real
1) = p and P (Y = 0) = 1 − p, then the prediction is given by segmentation result. The calculation formula is defined as
the sigmoid function, where P (Ŷ = 1) = 1∕(1 + e−x ) = p̂ and
P (Ŷ = 0) = 1 − 1∕(1 + e−x ) = 1 − p̂ , x is the output of neural 2 × |A ∩ B|
Dice(A, B ) = × 100%, (6)
networks. The cross entropy loss is defined as A+B

where A is a predicted segmentation result and B is a real seg-


CE (p, p̂ ) = −(plog(̂p) + (1 − p)log(1 − p̂ )). (1)
mentation result.
For 3D medical volume data segmentation, Milletari et al. [35]
proposed V-Net that employs the Dice loss
2.3.2 Weighted cross entropy loss
2⟨p, p̂ ⟩
The cross-entropy loss deals with each pixel of images equally, DL(p, p̂ ) = 1 − , (7)
‖ p‖ + ‖ p‖
and thus outputs an average value, which ignores the class ‖ ‖1 ‖ ‖2
imbalance and leads to a problem that the loss function depends
on the class including the maximal number of pixels. Therefore, where ⟨p, p̂ ⟩ represents the dot product of the ground truth of
the cross-entropy loss often shows low performance for small each channel and the prediction result matrix.
target segmentation. It is worth noting that the Dice loss is suitable for uneven
To address the problem of class imbalance, Long et al. [32] samples. However, the use of the Dice loss easily influences
proposed weighted cross-entropy loss (WCE) to counteract the back propagation and leads to a training difficulty. Besides,
the class imbalance. For the case of binary segmentation, the the Dice loss has a low robustness for different models such
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1253

as mean surface distance or Hausdorff surface distance due where the first part is a regularised Dice Loss that is defined
to unbelievable gradient values. For example, the gradient of as
softmax function can be simplified to (p − t ), where t is the ∑
target value, and p is the predicted value, but the value of dice LGD (𝜃 ) = 1 − 2(𝜔G g(p)s𝜃 (p)
loss is 2t 2 /(p + t )2 . If values of p and t are too small, then p𝜖Ω
the gradient value will change drastically leading to training ∑
difficulty. + 𝜔B (1 − g(p))(1 − s𝜃 (p)))∕
p𝜖Ω
∑[ ]
2.3.4 Tversky loss ((𝜔G g(p)+s 𝜃 (p)
p𝜖Ω

Salehi et al. [93] proposed the Tversky Loss (TL) that is a regu- ∑
+ 𝜔B (2 − g(p) − s𝜃 (p)))), (11)
larised version of Dice loss to control the contribution of both p𝜖Ω
false positive and false negative to the loss function. The TL is
defined as and the second part is the boundary loss that is defined as
p, p̂
TL(p, p̂ ) = , (8) LB (𝜃 ) = ∅G (p)s𝜃 (p), (12)
p, p̂ + 𝛽(1 − p, p̂ ) + (1 − 𝛽)(p, 1 − p̂ )
where if p𝜖G , then ∅G (p) = ∑ −||p − z𝜗G (p)||, otherwise
where p ∈ 0, 1 and 0 ≤ p̂ ≤ 1. p and p̂ are the ground truth and
∅G (p) = ||p − z𝜗G (p)||.
∑ Besides, Ω g(p) f (s𝜃 (p)) is used for
predicted segmentation, respectively. TL is equivalent to (7) if
the foreground and Ω (1 − g(p))(1 − f (s𝜃 (p))) is used for the
𝛽 = 0.5. ∑ 2
background. The LGD (𝜃) weight is 𝜔G = 1∕( p𝜖Ω g(p)) and
∑ 2
the 𝜔B = 1∕( p𝜖Ω (1 − g(p))) . The Ω represents the pixel set
2.3.5 Generalised dice loss in the entire spatial domain.

Although the Dice loss can solve the problem of class imbalance
to a certain extent, it does not work for serious class imbalance. 2.3.7 Exponential logarithmic loss
For instance, small targets suffer from prediction errors of some
pixels, which easily causes a large change for Dice values. Sudre In (9), the weighted dice loss is actually that the obtained dice
et al. [94] proposed an Generalised Dice Loss (GDL), the GDL value divides the sum of each label, which achieves a balance
is defined as for objects with different scales. Therefore, by combining
∑m ∑n focal loss [96] and dice loss, Wong et al. [97] proposed the
( ) 1 2 j =1 𝜔 j i=1 pi j p̂ i j exponential logarithmic loss (EXP loss) used for brain seg-
GDL p, p̂ = 1 − ∑m ∑n , (9)
m 𝜔j (pi j +̂p ) mentation to solve problem of serious class imbalance. With
j =1 i=1 ij
the introduction of the exponential form, the non-linearity
where the weight 𝜔 = [𝜔1 , 𝜔2 , … , 𝜔m ] is assigned to each class, of the loss functions can be further controlled to improve
∑n 2 the segmentation accuracy. The EXP loss function is defined
and 𝜔 j = 1∕( i=1 pi j ) . The GDL is superior to the Dice loss as
since different areas have the similar contributions to the loss,
and the GDL is more stable and robust during the training pro- LEXP = 𝜔dice × Ldice + 𝜔cross × Lcross , (13)
cess.
where two new parameter weights are denoted by 𝜔dice and
𝜔cross , respectively. The Ldice is an exponential log Dice loss,
2.3.6 Boundary loss and the Lcross is a cross-entropy loss
To solve the problem of class imbalance, Kervadec et al. [95]
Ldice = E[(−ln(Dicei ))𝛾Dice ], (14)
proposed a new boundary loss used for brain lesion segmenta-
tion. This loss function aims to minimise the distance between
segmented boundaries and labelled boundaries. Authors con-
Lcross = E[𝜔l (−ln(pl (x)))𝛾cross ], (15)
ducted experiments on two imbalanced data sets with labels.
The results show that the combination of the Dice loss and the
and,
boundary loss is superior to the single one. The composite loss
is defined as ∑
2( x 𝜎il (x)pi (x)) + 𝜀
Dicei = ∑ , (16)
L = 𝛼LGD (𝜃) + (1 − 𝛼)LB (𝜃), (10) x (𝜎il (x) + pi (x)) + 𝜀
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1254 WANG ET AL.

(∑ )0.5 Traditional
k fk Augmentation
𝜔l = , (17) Data
fl
Augmentation Synthetic
Augmentation
where x is pixel position, i is the label and l is the ground-truth
Pre-trained
value at the position x. The pi (x) is the probability value
Weakly Supervised Transfer Model
outputted from the softmax. Learning Learning Domain
In (17), fk is the frequency of occurrence of the label k,
this parameter can reduce the influence of more frequently Adaptation
seen labels. Both 𝛾Dice and 𝛾cross are used to enhance the non- DeepIGeoS
linearity of the loss function. Interactive
BIFSeg
Segmentation
GM Interacting
2.3.8 Loss improvements
FIGURE 14 The weakly supervised learning methods for medical image
For medical image segmentation, the improvement of loss segmentation
mainly focuses on the problem of segmentation of small objects
in a large background (the problem of class imbalance). Chen learned features at all stages, but also improves network train-
et al. [98] proposed a new loss function by applying tradi- ing efficiency.
tional active contour energy minimisation to convolutional neu-
ral networks, Li et al. [99] proposed a new regularisation term
to improve the cross-entropy loss function, and Karimi et al. 3 WEAKLY SUPERVISED LEARNING
[100] proposed a loss function based on Hausdorff distance
(HD). Besides, there are still a lot of works [101] [102] trying to Although convolutional neural networks show strong adaptabil-
deal with this problem by adding penalties to loss functions or ity for medical image segmentation, segmentation results seri-
changing the optimisation strategy according to specific tasks. ously depend on high-quality labels. In fact, it is rare to build
In many medical image segmentation tasks, there are often many data sets with high-quality labels, especially in the field of
only one or two targets in an image, and the pixel ratio of tar- medical image analysis, since data acquisition and labeling often
gets is sometimes small, which makes network training difficult. incur high costs. Therefore, a lot of studies on incomplete or
Therefore, to improve network training and segmentation accu- imperfect data sets are reported. We summarise these studies as
racy, it is easier to focus on smaller targets by changing loss weakly supervised learning as shown in Figure 14.
functions than to change the network structure. However, the
design of loss functions is highly task-specific, so we need to
analyse carefully task requirement, and then design reasonable 3.1 Data augmentation
and available loss functions.
In the absence of largely labelled data sets, data augmentation
is an effective solution to this problem. However, general data
2.3.9 Deep supervision expansion methods produce images that are highly correlated
with original images. Compared to common data augmentation
In general, the increase of network depth can improve the fea- approaches, GAN proposed by Goodfellow [64] is currently a
ture representation of networks to some extent, but it simultane- popular strategy for data augmentation since GAN overcomes
ously causes new problems such as vanishing gradient and gra- the problem of reliance on original data.
dient explosion. In order to train deep networks effectively, Lee Traditional Methods: General data augmentation methods
et al. [72] proposed Deeply supervised nets (DSNs) by adding include the improvement of image quality such as noise sup-
some auxiliary branching classifiers to some layers of the neural pression, the change of image intensity such as brightness,
network. Dou et al. [103] proposed a 3D DSN for heart and saturation, and contrast, and the change of image layout such
liver segmentation, which incorporates a 3D deep monitoring as rotation, distortion, and scaling. Sirinukunwattana et al. [105]
mechanism into a 3D full convolutional network for volume-to- utilised the Gaussian blur to achieve data enhancement, which
volume learning and inference, eliminating redundant computa- is helpful for performing gland segmentation tasks in the colon
tion and reducing the risk of over-fitting in the case of limited tissue images. Dong et al. [106] randomly used the brightness
training data. Similarly, Dou et al. [104] presented a method for enhancement function in 3D MR images to enrich training
fetal brain MRI cortical plate segmentation using a fully convo- data for brain tumour segmentation. Contrast enhancement is
lutional neural network architecture with deep supervision and usually helpful when an image shows uneven intensity. Further-
residual connection, and obtained high segmentation accuracy more, Ronneberger et al. [7] used random elastic deformation
for brain MRI cortical plate segmentation. In fact, deep super- to perform data expansion on the original data set. In fact, the
vision not only constrains the discrimination and robustness of most commonly method used for traditional data augmentation
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1255

Prior Distribution Z Scale Discriminator F Image: X Generator G Image: Y Discriminator G Scale


Generator Image Discriminator Scale
Condition C

Generator F

FIGURE 15 The cGAN architecture [108]


FIGURE 16 The Cycle GAN architecture [112]
is parametric transformation (rotation, translation, shear,
shift, flip, …). Since this kind of transformation is virtual
without computational cost and the annotation on medical ResNet-34 networks pre-trained on ImageNet as encoders of
images is difficult, it is always performed before each training the U-shaped network to perform semantic segmentation of
session. robotic instruments from wireless capsule endoscopic videos
Conditional Generative Adversarial Nets (cGAN): In contrast of vascular proliferative lesions and surgical procedures. Simi-
to the use of cGAN for supervised learning introduced in larly, Conze et al. [115] used VGG-11 pre-trained on ImageNet
Section II, this section focuses on the use of cGAN for data as the encoder of a segmentation network to perform shoulder
augmentation. An original GAN generator denoted by G can muscle MRI segmentation. Experiments demonstrate that the
learn data distribution, but generated pictures are random, pre-trained network is useful for improving segmentation accu-
which means that the generation process of the G is an racy. It can be concluded that a pre-trained model on ImageNet
unguided state. In contrast, cGAN adds a condition to the can learn some common underlying features that are required
original GAN in order to guide the generation process of the for both medical and natural images, thus retraining process is
G . Figure 15 shows the architecture of cGAN. Guibas et al. unnecessary while performing fine-tuning is useful for training
[107] proposed a network architecture composed of a GAN models. However, the domain adaptive may be a problem when
[64] and a cGAN [108]. The random variables are input into applying pre-trained models of natural scene images to medical
the GAN leading to the generation of a synthetic image of image analysis tasks. Besides, popular transfer learning methods
fundus blood vessel label, then the generated label map is input are hardly applicable to 3D medical image analysis because pre-
into the conditional GAN to generate a real retinal fundi image. trained models often rely on 2D image data sets. If the number
Finally, authors verified the authenticity of synthesised images of medical data sets with annotations is large enough, it is pos-
by checking whether the classifier can distinguish a synthesised sible that the effect of pre-training is weak for improving model
image from a real image. Mahapatra et al. [109] used a cGAN to performance. In fact, the effect of a pre-trained model is unsta-
synthesise X-ray images with required abnormalities, this model ble and it depends on segmentation data sets and tasks. Empir-
considers abnormal X-ray images and lung segmentation labels ically, we can try to use the pre-trained model if it can improve
as inputs, and then generates synthetic X-ray images with same segmentation accuracy, otherwise we need to consider designing
diseases as input X-ray images. At the same time, the segmented new models.
label is obtained. In addition, there are also some other works Domain Adaptation: If the labels from the training target
[110] [111] using GAN or cGAN to generate images to achieve domain are not available, and we can only access the labels
data enhancement. Although the image generated by cGAN in other domains, then popular methods are to transfer the
has many defects, such as blurred boundary and low resolution, trained classifier on the source domain to the target domain
the cGAN provides a basic ideas for the later CycleGAN without labelled data. CycleGAN is a cycle structure, and
[112] and StarGAN [113] used for the conversion of image mainly composed of two generators and two discriminators.
styles. Figure 16 shows the architecture of CycleGAN. First, an image
in the X domain is transferred to the Y domain by a generator
G, and then the output from the G is reconstructed back to
3.2 Transfer learning the original image in the X domain by the generator F. On the
contrary, the image in the Y domain is transferred to the X
By utilising trained parameters of a model to initialise a new domain by the generator F, and then the output from the F is
model, transfer learning can achieve fast model training for reconstructed back to the original image in the Y domain by the
data with limited labels. One approach is to fine-tune the pre- generator G. Both discriminator G and F play discriminating
trained model on ImageNet for the target medical image analy- roles ensuring the style transfer of images. Huo et al. [116]
sis task, while the other is to migrate the training for data from proposed a jointly optimised image synthesis and segmentation
across domains. framework for the task of spleen segmentation in CT images
Pre-trained Model: Transfer learning is often used to solve the using CycleGAN [112]. The framework achieves an image
problem of limited data labelled in medical image analysis, and conversion from the marked source domain to the synthesised
some researchers found that using pre-trained networks on nat- target domain. During training, synthesised target images are
ural images such as ImageNet as an encoder within a U-Net- used to train the segmentation network. During the test process,
like network and then performing fine-tuning on medical data a real image from the target domain is directly input into the
can further improve the segmentation effect of medical images. trained segmentation network to obtain desired segmentation
Kalinin [114] et al. considered the VGG-11, VGG-16, and results. Chen et al. [117] also adopted a similar method using
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1256 WANG ET AL.

segmentation labels of MR images to achieve the task of cardiac Rupprecht et al. [125] proposed a new interactive segmen-
CT segmentation. tation method named GM interacting that updates image seg-
Chartsias et al. [118] used the CycleGAN to generate cor- mentation results according to the input text from users. This
responding MR images and labels from CT slices and myocar- method changes the output of the network by modifying the
dial segmentation labels, and then used synthetic MR and real feature maps between an encoder and a decoder interactively.
MR images to train the myocardial segmentation model. This The category of areas is first set according to the response of
model obtains 15% improvement over the myocardial segmen- users, then some guiding parameters including multiplication
tation model trained on real MR images. Similarly, there are and offset coefficients are updated through back propagation,
some other works that realise the image conversion between the feature map is finally changed resulting in updated segmen-
different domains through the CycleGAN and improve the per- tation results.
formance of medical image segmentation [119] [120]. The interactive image segmentation based on deep learning
can reduce the number of user interactions and the user time,
which shows broader application prospects.
3.3 Interactive segmentation

Manually drawing medical image segmentation labels is usually 3.4 Others works
tedious and time-consuming, especially for the drawing of
3D volume data. Interactive segmentation allows clinicians Semi-supervised learning can use a small part of labelled data
to correct interactively the initially segmented image gener- and any number of unlabelled data to train a model, and its
ated by a model to obtain more accurate segmentation. The loss function often consists of the sum of two loss functions.
key to effective interactive segmentations is that clinicians The first is a supervised loss function that is only related with
can use interactive methods such as mouse clicks and out- labelled data. The second is an unsupervised loss function or
line boxes to improve an initial segmentation result from a regularisation term that is related to both labelled and unla-
model. Then the model can update parameters and generate belled data.
new segmentation images to obtain new feedback from the Based on the idea of GAN, Zhang et al. [126] proposed a
clinicians. semi-supervised learning framework based on the adversarial
Wang et al. [121] proposed the DeepIGeoS using the way between segmentation network and evaluation network.
cascade of two CNNs for interactive segmentation of 2D An image is fed into U-Net to generate a segmentation map,
and 3D medical images. The first CNN called P-Net out- which is then stacked with the original image and presented to
puts a coarse segmentation result. Based on this, users the evaluation network to obtain a segmentation score. During
provide interactive points or short lines to mark wrong the training process, the segmentation network is optimised
segmentation areas, and then use them as the input of in two aspects, one is to minimise the segmentation loss of
the second CNN called R-Net to obtain corrected results. labelled images and the other is to make the evaluation network
Experiments were conducted on two-dimensional foetal obtain high scores for unlabelled images. Besides, the evaluation
MRI images and three-dimensional brain tumour images, network is updated to assign low scores to unmarked images
and experimental results showed that compared with tra- but high scores to marked images. Due to this adversarial learn-
ditional interactive segmentation methods such as Graph- ing, the segmentation network obtains supervised signals from
Cuts, RandomWalks and ITK-Snap, the DeepIGeoS greatly both labelled and unlabelled images. Thus, the semi-supervised
reduces the requirement for user interaction and reduces user learning framework achieves better segmentation effect in the
time. gland segmentation task for histopathology images. Similarly,
Wang et al. [122] proposed the BIFSeg that is similar to the some other semi-supervised frameworks [127] [128] [99]
principle of GrabCut [123] [124]. Users first draw a bounding [129] are also proposed to optimise medical image segmenta-
box, and the area inside the bounding box is considered as tion.
the input of CNN, then an initial result is obtained. After that, Accurate and robust segmentation of organs or lesions from
users perform an image-specific fine-tuning to make CNN medical images plays a vital role in many clinical applications,
provide better segmentation results. The GrabCut achieves such as diagnosis and treatment planning. However, it is difficult
image segmentation by learning a Gaussian mixture model for medical images to acquire the annotated data, as generat-
(GMM) from images, while the BIFSeg learns a CNN from ing accurate annotations requires expertise and time. Weakly
images. Usually CNN-based segmentation methods can only supervised segmentation methods learn image segmentation
deal with objects that have appeared in the training set, which from border or image-level labels or from a small amount of
limits the flexibility of these methods, but the BIFSeg attempts annotated image data, rather than using a large number of pixel-
to use a CNN to segment objects that have not been seen level annotations, to obtain high-quality segmentation results.
during training process. The process is equivalent to making In fact, a small amount of annotated data and a large amount
the BIFSeg learn to extract the foreground part of the object of unannotated data are more compatible with the real clinical
from a bounding box. During the test, the CNN can better situation. However, in practice, the performance of weakly
use the information in the specific image through an adaptive supervised learning only provides rarely acceptable results for
fine-tuning. medical image segmentation tasks, especially for 3D medical
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1257

images. Therefore, this is a direction worth exploring in the tion, model integration), and post-processing (e.g. enhanced
future. single pass domain). In practical applications, the improvements
of network structure design usually depend on experiences
without adequate interpretability theory support. More-
4 CURRENTLY POPULAR DIRECTION over, more complex network models indicate higher risk of
over-fitting.
4.1 Network architecture search Weng et al. [137] first proposed an NAS-UNet for medical
image segmentation. The NAS-UNet contains the same two cell
Recently, the performance of convolutional neural network architectures DownSC and UpSC. The difference between them
models has been continuously improved. Researchers have is that the former performs a search on the U-shaped back-
designed a large number of popular network architectures bone to obtain DownSC and UpSC blocks. The NAS-UNet
for specific tasks such as image classification, segmentation, outperforms the U-Net and its variants, and its training time
and reconstruction. These architectures are often designed is close to that of U-Net, but with only 6% of the number of
by industry experts or academics for months or even years, parameters.
since the design of network architectures with excellent per- To perform image segmentation in real time for high-
formance usually requires a great deal of domain knowledge. resolution 2D images (e.g. CT, MRI and histopathology images),
Therefore, the design process is time-consuming and laborious the study of compressed neural network models has become a
for researchers without domain knowledge. So far, NAS [130] popular direction in medical image segmentation. The applica-
has made significant progress in improving the accuracy of tion of NAS can effectively reduce the number of model param-
image classification. The NAS can be deemed to a subdomain eters and achieves high segmentation performance. Although
of automatic machine learning [131](AutoML) and has a strong the performance of NAS is stunning, the fact of why particu-
overlap with hyperparametric optimisation [132] and meta lar architectures perform well cannot be explained. Therefore,
learning [133]. Current research on NAS focuses on three it is also important for future research to better understand the
aspects: search space, search strategy and performance esti- mechanisms which have a significant impact on performance
mation. The search space is a candidate collection of network and to explore whether these properties can be generalised to
structures to be searched. The search space is divided into a different tasks.
global search space that represents the search for the entire net-
work structure, and a cell-based search space that searches only
a few small structures that are assembled into a complete large 4.2 Graph convolutional neural network
network by the ways of stacking and stitching. The search strat-
egy aims to find the optimal network structure as fast as possible The GCN [138] is one of the powerful tools for the study of
in search spaces. Popular search strategies are often grouped non-Euclidean domains. A graph is a data structure consisting
into three categories: reinforcement-based learning, evolution- of nodes and edges. The early graph neural networks (GNNs)
ary algorithms, and gradients. Performance estimation strategy [139] mainly address strictly graphical problems such as the
is the process of assessing how well the network structure classification of molecular structures. In practice, the Euclidean
performs on target data sets. For NAS techniques, researcher spaces (e.g. images) or sequences (e.g. text), and many common
pay more attention to the improvement of search strategies scenes can be converted into graphs that can be modelled by
since search space and performance estimation methods are using GCN techniques.
usually rarely changed. Some improved CNN model based on Gao et al. [140] designed a new graph pooling (gPool) and
NAS [134] [135] have been proposed and applied to image graph unpooling (gUnpool) operation based on GCN and pro-
segmentation. posed an encoder–decoder model namely graph U-Net. The
Most current studies on deep learning in medical image graph U-Net achieves better performance than popular U-Nets
segmentation depend on U-Net networks and makes some by adding a small number of parameters. In contrast to tra-
changes to the network structure according to different tasks, ditional convolutional neural networks where deeper is bet-
but in reality the non-network structure factors may be also ter, the performance of the graph U-Net cannot be improved
important for improving segmentation effect. Isensee et al. [136] by increasing the depth of networks when the value of depth
argued that too much manual adjustment on network structure exceeds 4. However, the graph U-Net show stronger capabil-
could lead to over-fitting for a given data set, and therefore pro- ity of feature encoding than popular U-Nets when the value of
posed a medical image segmentation framework no-new-UNet depth is smaller or equivalent to 4. Yang et al. [141] proposed the
(nnU-Net) that adapts itself to any new data set. The nnUnet end-to-end conditional partial residual plot convolutional net-
automatically adjusts all hyperparameters according to the work CPR-GCN for automatic anatomical marking of coronary
properties of the given data set without manual intervention. arteries. Authors showed that the GCN-based approach pro-
Therefore, the nnU-Net only relies on vanilla 2D UNet, 3D vided better performance and stronger robustness than tradi-
UNet, UNet cascade and a robust training scheme. It focuses tional and recent depth learning based approaches. Results from
on the stage of pre-processing (resampling and normalisation), these GCNs in medical image segmentations are promising, as
training (loss, optimiser settings, data augmentation), inference the graph structure has high data representation efficiency and
(e.g. patch-based strategies, test-time-augmentations integra- strong capability of feature encoding.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1258 WANG ET AL.

4.3 Interpretable shape attentive neural 4.4 Multi-modality data fusion


network
Multi-modality data fusion has been widely used in medical
Currently, many deep learning algorithms tend to make image analysis because it can provide richer object features that
judgments by using ‘memorised’ models that approxi- are helpful for improving object detection and segmentation
mately fit to input data. As a result, these algorithms can- results. Dou et al. [149] proposed a novel multi-modal learn-
not be explained sufficiently and give convincing evidences ing scheme for accurate segmentation of anatomical structures
for each specific prediction. Therefore, the study of the from unpaired CT and MRI images, and designed a new loss
interpretability of deep neural networks is a hot topic at function using knowledge distillation to improve model train-
present. ing efficiency [150]. More specifically, the normalisation layer
Sun et al. [142] proposed the SAU-Net that focuses on used for different modalities (i.e. CT and MRI) is implemented
the interpretability and the robustness of models. The pro- within separate variables, whereas the convolutional layer is con-
posed architecture attempts to address the problem of poor structed within shared variables. In each training iteration, sam-
edge segmentation accuracy in medical images by using a ples for each modality are loaded separately and then forwarded
secondary shape stream. Specially, the shape stream and the to the shared convolutional and independent normalisation lay-
regular texture stream can capture rich shape-dependent infor- ers, and finally the logarithms that can be used to calculate
mation in parallel. Furthermore, both spatial and channel knowledge distillation losses will be obtained. Moeskops et al.
attention mechanism are used for the decoder to explain the [151] investigated a question whether it is possible to train a
learning capability of models at each resolution of U-Net. single CNN to perform same segmentation tasks on different-
Finally, by extracting the learned shape and spatial atten- modality data. It is well known that CNNs show excellent per-
tion maps, we can interpret the highly activated regions of formance for image feature encoding and based on this, the
each decoder block. The learned shape maps can be used to experiments in [151] furthermore demonstrate that CNNs are
infer correct shapes of interesting categories learned by the also excellent for feature encoding of multi-modality data when
model. The SAU-Net is able to learn robust shape features they are used for the same tasks. Therefore, a single system
of objects via the gated shape stream, and is also more inter- can be used in clinical practice to automatically execute seg-
pretable than previous works via built-in saliency maps using mentation tasks on various modality data without extra task-
attention. specific training.
Wickstr⊘m et al. [143] explored the uncertainty and inter- More relevant literatures can be found in the review on multi-
pretability of semantic segmentation of colorectal polyps in con- modal fusion for medical image segmentation using deep learn-
volutional neural networks, and the authors developed the cen- ing [152]. In this review, authors classified fusion strategies
tral idea of guided back propagation [144] for the interpreta- into three categories: input-level fusion, layer-level fusion, and
tion of network gradients. By using back propagation, the gra- decision-level fusion. Although it is known that multi-modal
dient corresponding to each pixel in the input is obtained so fusion networks usually show better performance for segmen-
that the features considered by the network can be visualised. tation tasks than unimodal networks, multi-model fusion causes
In the process of back propagation, pixels with large and posi- some new problems such as how to design multi-modal net-
tive gradient values in an image should be paid more attention works to efficiently combine different modalities, how to exploit
due to high importance while pixels with large and negative gra- potential relationships between different modalities, and how to
dient values should be suppressed. If these negative gradients integrate multiple information into segmentation networks to
are included in the visualisation of important pixels, they may improve segmentation performance. In addition, the integration
result in noisy visualisations of descriptive features. To avoid of multi-modal data fusion into an effective single-parameter
creating noisy visualisations, the guide back propagation process network can help simplify deployment and improve the usability
changes the back propagation of the neural network so that the of models in clinical practice.
negative gradients are set to zero at each layer, thereby allowing
only positive gradients to flow backwards through the network
and highlight these pixels. 5 DISCUSSION AND OUTLOOK
Medical image analysis is an aid to the clinical diagnosis,
the clinicians wonder not only where the lesion is located at, 5.1 Medical Image Segmentation Data sets
but also the interpretability of results given by networks. Cur-
rently, the interpretation of medical image analysis is dom- In order to help clinicians make accurate diagnoses, it is neces-
inated by visualisation methods such as attention and the sary to segment important organs, tissues or lesions from medi-
class-activation-map (CAM). Therefore, the research on the cal images with the aid of a computer and extract features from
interpretability of deep learning for medical image segmen- segmented objects. As a result, various medical image data sets
tation [145] [146] [147] [148] will be a popular direction in and corresponding competitions have been launched to pro-
future. mote the development of computer-aided diagnosis techniques.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1259

FIGURE 17 Some images of benchmark data sets

In recent years, there has been a growing interest in devel- images of benchmark data sets. In fact, there are more public
oping more comprehensive computational anatomical models data sets than in the list of Table 1 used for medical image seg-
with the development of deep learning techniques, which has mentation.
facilitated the development of multi-organ analysis models. The
multi-organ segmentation approaches are different from tradi-
tional organ-specific strategies in that they incorporate relation- 5.2 Popular evaluation metrics
ships between different organs into models to represent more
accurately the complex human anatomy. In the context of multi- In order to measure effectively the performance of medical
organ analysis, brain and abdomen are the most popular in med- image segmentation model, a large number of metrics have
ical image analysis. Thus there are many data sets on the brain been proposed for evaluating the segmentation effectiveness.
and abdomen such as BRATS [3] [153] [154], ISLES [155], The evaluation of image segmentation performance relies on
KITS [156], LITS [157], and CHAOS [158]. There are two rea- pixel quality, region quality and surface distance quality. In this
sons for the emergence of large data sets: on the one hand, the section, we give some popular metrics for evaluating the per-
rapid development of imaging techniques, increasingly higher formance of medical image segmentation. Pixel quality met-
resolution shows more detailed anatomical tissue, which pro- rics include pixel accuracy (PA). Region quality metrics include
vides a better reference for clinicians; on the other hand, with Dice score, volume overlap error (VOE) and relative volume
the development of deep learning techniques, a large number difference (RVD). Surface distance quality metrics include aver-
of training samples are necessary, so many research teams have age symmetric surface distance (ASD) and maximum symmetric
collected many samples and annotated data to form data sets in surface distance (MSD).
order to train network models easily. In addition, stable organ PA: Pixel accuracy simply finds the ratio of pixels properly
structures in the abdomen (e.g. the liver, spleen, and kidneys) classified, divided by the total number of pixels. For K + 1
can provide constraints and contextual information for creat- classes (K foreground classes and the background), pixel accu-
ing computational anatomical models of the abdomen. There racy is defined as
are also a small number of public data sets on hippocampus
and pelvic organs (e.g. Colon [159] and prostate [160]). Indeed, ∑K
pii
the construction of more holistic and global anatomical mod- PA = ∑K i=0
∑K , (18)
els remains one of the greatest challenges and opportunities in i=0 j =0 pi j
future due to the lack of large data sets to characterise the com-
plexity of the human anatomy. More discussions on multi-organ where pi j is the number of pixels of class i predicted as belong-
analysis and computational anatomical methods can be found ing to class j .
in [161]. The review proposed by Cerrolaza et al. [161] fol- Dice score: it is a popular metric for image segmentation (and
lows a methodology-based classification of different techniques is more commonly used in medical image analysis), which can be
that are available for the analysis of multi-organs and multi- defined as twice the overlap area of predicted and ground-truth
anatomical structures, from techniques using point distribution maps, divided by the total number of pixels in both images. The
models to the latest deep learning-based approaches. Dice score is defined as
There are many publicly available data sets for medical image
segmentation, Table 1 provides a brief description and list of 2|A ∩ B|
Dice = , (19)
each data set. As shown in Figure 17, we also provide some |A| + |B|
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1260 WANG ET AL.

TABLE 1 Public data sets for medical segmentation

Objects Data set URL

Liver LiTS [157] https://competitions.codalab.org/competitions/17094


Sliver07 [162] http://www.sliver07.org/
3Dircadb [163] https://www.ircad.fr/research/3dircadb/
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
CHAOS [165] https://chaos.grand-challenge.org
Pancreas Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
NIH Pancreas [166] http://academictorrents.com/details/80ecfefcabede760cdbdf63e38986501f7becd49
Colon COLONOGRAPHY [159] https://wiki.cancerimagingarchive.net/display/Public/CT+COLO
NOGRAPHY#dc149b9170f54aa29e88f1119e25ba3e
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Heart AMRG Cardiac Atlas [167] http://www.cardiacatlas.org/studies/amrg-cardiac-atlas/
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Lung LIDC-IDRI [168] https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI♯
VESSEL12 [169] https://vessel12.grand-challenge.org/
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Prostate PROMISE12 [160] https://promise12.grand-challenge.org/
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Brain OASIS [170] http://www.oasis-brains.org/
BRATS [3] [153] [154] https://www.med.upenn.edu/sbia/brats2018/registration.html
ISLES [155] http://www.isles-challenge.org/
mTOP [171] https://www.smir.ch/MTOP/Start2016
Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Kidney KITS [156] https://kits19.grand-challenge.org
CHAOS [165] https://chaos.grand-challenge.org
Spleen Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
CHAOS [165] https://chaos.grand-challenge.org
Hippocampus Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Hepatic Vessel Medical Segmentation Decathlon (MSD) [164] http://medicaldecathlon.com/index.html
Skin lesion ISIC [172] https://challenge.isic-archive.com/data
STARE STARE [173] https://cecas.clemson.edu/ ahoover/stare/
Thyroid TNSCUI [174] https://tn-scui2020.grand-challenge.org/

where A and B denote the ground truth and the predicted seg- Let S (A) denote the set of surface voxels of A. The shortest
mentation maps, respectively. distance of an arbitrary voxel v to S (A) is defined as
VOE: It is the complement of the Jaccard index, it is
defined as d (v, S (A)) = min (‖v − sA ‖), (22)
sA ∈S (A
|A ∩ B|
VOE (A, B ) = 1 − . (20) where ‖ ∙ ‖ denotes the Euclidean distance.
|A ∪ B|
ASD: It is defined as
RVD: It is an asymmetric measure defined as (
1 ∑
|B| − |A| ASD(A, B) = d (sA , S (B))
RVD (A, B ) = . (21) |S (A)| + |S (B)|
|A| | | | | sA ∈S (A)

)
Surface distance metrics are a set of correlated measures ∑
of the distance between the surfaces of a reference and pre- + d (sB , S (A)) . (23)
dicted lesion. sB ∈S (B)
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1261

MSD: It is also known as the Symmetric Hausdorff Distance, manual design. However, it is difficult to search directly a
and is similar to ASD except that the maximum distance that is large network due to memory and GPU limitations. Therefore,
taken instead of the average: the future trend should be the combination of manual design
and the use of NAS technology. First, a backbone network
{ } is designed manually, and then small network modules are
MSD(A, B) = max max d (sA , S (B)), max d (sB , S (A)) . searched by NAS before training.
sA ∈S (A) sB ∈S (B)
The design of different convolution operations is also
(24) a meaningful research direction, such as atrous convolu-
tion, deformable convolution, and deep separable convolution.
Although these convolutions are all excellent for improving per-
5.3 Challenges and future scope formance of models, they still belong to traditional convolu-
tional categories. As a convolutional method of processing non-
It has been proved that fully automated segmentation of Euclidean data, the graph convolution goes beyond the tradi-
medical images based on deep neural networks is very valu- tional convolution and is valuable for medical data because the
able. By reviewing the progress of deep learning in medical graph structure is more efficient and has a strong semantic fea-
image segmentation, we have identified potential difficulties. ture encoding capability.
Researchers successfully employed a variety of means to
improve the accuracy of medical image segmentation. Although
only the improvement of accuracy cannot account for the 5.3.2 Design of loss function
performance of algorithms, especially in the field of medical
image analysis, where problems of class imbalance, noise In many medical image segmentation tasks, there are often only
interference problems and serious consequences of missed one or two targets in an image, and the pixel ratio of targets
tests must be considered. In the following subsections, we will is sometimes small, which makes network training difficult. For
analyse potential future research directions for medical image this problem, it is easier to focus on smaller targets by changing
segmentation. loss functions than to change the network structure. However,
the design of loss functions is highly task-specific, so we need to
analyse carefully task requirement, and then design reasonable
5.3.1 Design of network architecture and available loss functions.
In specific tasks of medical image segmentation, the use of
In studies of medical image segmentation, the innovation of classical cross-entropy loss functions combined with a specific
network structure design is most popular, as the improve- regularisation term or a specific loss function has become a
ment of network structure design shows clear effect and it is popular trend. In addition, the use of domain knowledge or
easily transferred to other tasks. Through reviewing classical a priori knowledge as regular terms or the design of specific
models in recent years, we find that the basic framework of loss functions can yield better task-specific segmentation
encoder–decoder U-shaped networks with long and short results for medical images. Another avenue is an automatic
skipped connections has been widely used for medical image loss function (or regularisation term) search based on NAS
segmentation. The residual network (ResNet) and the densely techniques.
connected network (DenseNet) have demonstrated the effect
of deepening network depth and the effectiveness of residual
structure on gradient propagation, respectively. Skip connec- 5.3.3 Transfer learning
tions in deep networks can facilitate gradient propagation and
thus reduce the risk of gradient dispersion leading to improved Medical imaging is usually accompanied by severe noise interfer-
segmentation performance. Furthermore, the optimisation of ence. Moreover, the data annotation of medical images is often
skipped connections will allow the model to extract richer more expensive than natural images. Therefore, medical image
features. segmentation based on pre-trained deep learning models on nat-
In addition, the design of the network module is worth ural images is a worthy direction for future research.
exploring. Recently, spatial pyramid modules have been widely In addition, transfer learning is an important way to achieve
used in the field of semantic segmentation. The atrous convo- weakly supervised medical image segmentation. In fact, trans-
lution with fewer parameters allows for wider receptive fields, fer learning is the use of existing knowledge to learn new
and the feature pyramid allows for features with different scales knowledge, and it focuses on finding similarities between exist-
to be acquired. The development of spatial channel attention ing knowledge and new knowledge. Since most data or tasks
modules makes the process of neural network feature extraction are correlated, transfer learning allows us to share the model
more targeted, so the design of task-specific feature extraction parameters (or knowledge learned by the model) with the new
network modules is also well worth investigating. model in a way that speeds up the efficiency of model learn-
The manual design of model structures requires rich expe- ing. Thus, transfer learning can solve the problem of insufficient
riences, so it is inevitable that NAS will gradually replace the labelling data.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1262 WANG ET AL.

5.3.4 Interactive segmentation and some basic image structures at the patch level. However,
when we detect these basic visual elements, the higher level
Although deep learning has achieved good results in many visual semantic information is often more concerned with how
image segmentation tasks, the vast majority of related works these elements relate to each other to form an object, and how
have been with automatic segmentation methods. Many cases the spatial location of objects relates to each other to form the
still require interactive segmentation methods, such as the scene. At present, the transformer is more natural and effec-
annotation of radiotherapy targets, or when user correction is tive in dealing with the relationships between these elements.
required because the automatic segmentation results are not However, if all the convolutional operators in CV tasks are
good enough. In addition, training deep learning models often replaced by Transformer, it may suffer from many problems,
requires a large number of labelled images as the training data such as high computational cost and memory usage. From exist-
sets that can be done more efficiently with an interactive seg- ing researches, the combination of Transformer and CNNs may
mentation tool. lead to better results.
Due to the superior performance of deep learning, the inter-
active image segmentation [126] based on deep learning can ACKNOWLEDGEMENTS
reduce the number of user interactions and the user time that This work was supported in part by Natural Science Basic
shows broader application prospect. Research Program of Shaanxi (Program No. 2021JC-47), in part
by the National Natural Science Foundation of China under
Grant 61871259, and Grant 61861024, National Natural Sci-
5.3.5 Graph convolutional neural network ence Foundation of China-Royal Society: Grant 61811530325
(IECnNSFCn170396, Royal Society, UK), in part by Key
In general, convolution-based deep neural networks with trans- Research and Development Program of Shaanxi (Program No.
lation invariance, rotation invariance, scale invariance, shared 2021ZDLGY08-07), and in part by Shaanxi Joint Laboratory of
convolution kernels and fast automatic feature extraction have Artificial Intelligence (Program No. 2020SS-03).
yielded remarkable results in the field of medical images. How-
ever, convolutional neural networks also have many limitations: CONFLICT OF INTEREST
they rely heavily on geometric priors and it is difficult to cap- The authors declare no conflict of interest.
ture the intrinsic relationships between different objects using
extracted local features, etc. GNN provides a powerful and intu- DATA AVAILABILITY STATEMENT
itive modelling approach [175] to the problem of modelling The data that support the findings of this study are available
non-Euclidean spaces. Taking the studied objects as nodes and from the corresponding author upon reasonable request.
the correlation or similarity between objects as edges, GNN is
able to integrate non-Euclidean data and extract invisible rela- ORCID
tionships between objects by exploiting their intrinsic relation- Tao Lei https://orcid.org/0000-0002-2104-9298
ships, and it has been widely used in brain segmentation [176], Hongying Meng https://orcid.org/0000-0002-8836-1382
vessel segmentation [177], prostate segmentation [178], coro- Asoke K. Nandi https://orcid.org/0000-0001-6248-2875
nary artery segmentation [141], etc.
REFERENCES
1. Li, W.: Automatic segmentation of liver tumor in ct images with deep con-
5.3.6 Medical transformer volutional neural networks. J. Comput. Commun. 3(11), 146–151 (2015)
2. Vivanti, R., Ephrat, A., Joskowicz, L., Karaaslan, O., Lev Cohain,
In recent years, deep neural networks based on U-shaped struc- N., Sosna, J.: Automatic liver tumor segmentation in follow-
tures and skip connection have been widely used in various up CT studies using convolutional neural networks. Sci. Rep. 2,
medical imaging tasks. However, despite of the fact of achieving 15497 (2015)
3. Menze, B.H., Jakab, A., Bauer, S., Kalpathy Cramer, J., Farahani, K.,
excellent performance by CNNs, it is unable to learn global and
Kirby, J., et al.: The multimodal brain tumor image segmentation bench-
long-range semantic information interactions well due to the mark (brats). IEEE Trans. Med. Image. 34(10), 1993–2024 (2014)
limitations of convolutional operations. Recently, transformer- 4. Cherukuri, V., Ssenyonga, P., Warf, B.C., Kulkarni, A.V., Monga, V., Schiff,
based architectures have become very popular that replaces the S.J.: Learning based segmentation of ct brain images: application to post-
convolutional operator and use self-attention modules to com- operative hydrocephalic scans. IEEE Trans. Bio-Med. Eng. 65(8), 1871–
1884 (2017)
pose entire encoder–decoder structures that can encode long-
5. Cheng, J., Liu, J., Xu, Y., Yin, F., Wong, D.W.K., Tan, N.M., et al.: Super-
range dependencies. It has been a great success in the field of pixel classification based optic disc and optic cup segmentation for glau-
natural language processing. coma screening. IEEE Trans. Med. Image 32(6), 1019–1032 (2013)
Dosovitskiy et al. [179] proposed Vision Transformer (ViT) 6. Fu, H., Cheng, J., Xu, Y., Wong, D.W.K., Liu, J., Cao, X.: Joint optic disc
that is able to classify images directly using the Transformer. and cup segmentation based on multi-label deep network and polar trans-
formation. IEEE Trans. Med. Image 37(7), 1597–1605 (2018)
Recently, a large number of researches [180] [181] [182] [183]
7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for
have applied the transformer to medical image segmentation. biomedical image segmentation. Proceedings of the International Confer-
CNNs have a comparative advantage in extracting the underly- ence on Medical Image Computing and Computer-Assisted Intervention
ing features. These low-level features form the key points, lines, (MICCAI), pp. 234–241. (2015)
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1263

8. Song, T.H., Sanchez, V., EIDaly, H., Rajpoot, N.M.: Dual-channel active 30. Zhou, Z.H.: A brief introduction to weakly supervised learning. Natl. Sci.
contour model for megakaryocytic cell segmentation in bone marrow Rev. 5(1), 44–53 (2018)
trephine histology images. IEEE Trans. Bio-Med. Eng. 64(12), 2913– 31. Eelbode, T., Bertels, J., Berman, M., Vandermeulen, D., Maes, F., Biss-
2923 (2017) chops, R., et al.: Optimization for medical image segmentation: theory
9. Wang, S., Zhou, M., Liu, Z., Liu, Z., Gu, D., Zang, Y., et al.: Cen- and practice when evaluating with dice score or jaccard index. IEEE
tral focused convolutional neural networks: developing a data-driven Trans. Med. Imaging 39(11), 3679–3690 (2020)
model for lung nodule segmentation. Med. Image Anal. 40, 172–183 32. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for
(2017) semantic segmentation. Proceedings of the IEEE Conference on Com-
10. Onishi, Y., Teramoto, A., Tsujimoto, M., Tsukamoto, T., Saito, K., puter Vision and Pattern Recognition (CVPR), pp. 3431–3440. IEEE,
Toyama, H., et al.: Multiplanar analysis for pulmonary nodule classifi- Piscataway, NJ (2015)
cation in CT images using deep convolutional neural network and gen- 33. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous
erative adversarial networks. Int. J. Comput. Assist. Radiol. Surg. 15(1), convolution for semantic image segmentation. Preprint, arXiv:170605587
173–178 (2020) (2017)
11. Wu, F., Zhuang, X.: CF distance: a new domain discrepancy metric 34. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.:
and application to explicit domain adaptation for cross-modality cardiac 3d u-net: learning dense volumetric segmentation from sparse annota-
image segmentation. IEEE Trans. Med. Imag. 39(12), 4274–4285 (2020) tion. In: Proceedings of the International Conference on Medical Image
12. Chen, C., Qin, C., Qiu, H., Tarroni, G., Duan, J., Bai, W., et al.: Deep Computing and Computer-Assisted Intervention (MICCAI), pp. 424–
learning for cardiac image segmentation: a review. Front. Cardiovas. Med. 432 (2016)
7, 25 (2020) 35. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural
13. Yu Qian, Z., Wei Hua, G., Zhen Cheng, C., Jing Tian, T., Ling Yun, L.: networks for volumetric medical image segmentation. In: Conference on
Medical images edge detection based on mathematical morphology. Proc. 3D Vision (3DV), pp. 565–571 (2016)
IEEE Eng. Med. Biol. Soc. 2005, 6492–6495 (2006) 36. Chen, H., Dou, Q., Yu, L., Heng, P.A.: Voxresnet: deep voxelwise residual
14. Lalonde, M., Beaulieu, M., Gagnon, L.: Fast and robust optic disc networks for volumetric brain segmentation. Preprint, arXiv:160805895
detection using pyramidal decomposition and Hausdorff-based template (2016)
matching. IEEE Trans. Med. Image. 20(11), 1193–1200 (2001) 37. Lee, K., Zung, J., Li, P., Jain, V., Seung, H.S.: Superhuman accuracy on the
15. Chen, W., Smith, R., Ji, S.Y., Ward, K.R., Najarian, K.: Automated ven- snemi3d connectomics challenge. Preprint, arXiv:170600120, (2017)
tricular systems segmentation in brain CT images by combining low- 38. Xiao, X., Lian, S., Luo, Z., Li, S.: Weighted res-unet for high-quality
level segmentation and high-level template matching. BMC Med. Inform. retina vessel segmentation. In: Conference on Information Technology
Decis. Mak. 9(S1), S4 (2009) in Medicine and Education (ITME), pp. 327–331 (2018)
16. Tsai, A., Yezzi, A., Wells, W., Tempany, C., Tucker, D., Fan, A., et al.: A 39. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
shape-based approach to the segmentation of medical imagery using level put. 9(8), 1735–1780 (1997)
sets. IEEE Trans. Med. Imag. 22(2), 137–154 (2003) 40. Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent
17. Li, C., Wang, X., Eberl, S., Fulham, M., Yin, Y., Chen, J., et al.: A likelihood residual convolutional neural network based on u-net (r2u-net) for medi-
and local constraint level set model for liver tumor segmentation from CT cal image segmentation. Preprint, arXiv:180206955 (2018)
volumes. IEEE Trans. Biomed. Eng. 60(10), 2967–2977 (2013) 41. Gao, Y., Phillips, J.M., Zheng, Y., Min, R., Fletcher, P.T., Gerig, G.: Fully
18. Li, S., Fevens, T., Krzyżak, A.: A SVM-based framework for autonomous convolutional structured LSTM networks for joint 4d medical image
volumetric medical image segmentation using hierarchical and coupled segmentation. In: Proceedings of the IEEE International Symposium
level sets. Int. Congr. Series. 1268, 207–212 (2004) on Biomedical Imaging (ISBI), pp. 1104–1108. IEEE, Piscataway, NJ
19. Held, K., Kops, E.R., Krause, B.J., Wells, W.M., Kikinis, R., Muller Gart- (2018)
ner, H.W.: Markov random field segmentation of brain MR images. IEEE 42. Bai, W., Suzuki, H., Qin, C., Tarroni, G., Oktay, O., Matthews, P.M., et al.:
Trans. Med. Imaging 16(6), 878–886 (1997) Recurrent neural networks for aortic image sequence segmentation with
20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with sparse annotations. In: Proceedings of the International Conference on
deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) Medical Image Computing and Computer-Assisted Intervention (MIC-
21. Masood, S., Sharif, M., Masood, A., Yasmin, M., Raza, M.: A survey on CAI), pp. 586–594 (2018)
medical image segmentation. Curr. Med. Imaging Rev. 11(1), 3–14 (2015) 43. Ibtehaz, N., Rahman, M.S.: Multiresunet: rethinking the u-net architecture
22. Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Ann. for multimodal biomedical image segmentation. Neural Netw. 121, 74–87
Rev. Biomed. Eng. 19, 221–248 (2017) (2020)
23. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoo- 44. Seo, H., Huang, C., Bassenne, M., Xiao, R., Xing, L.: Modified u-net
rian, M., et al.: A survey on deep learning in medical image analysis. Med. (mu-net) with incorporation of object-dependent high level features for
Image Anal. 42, 60–88 (2017) improved liver and liver-tumor segmentation in ct images. IEEE Trans.
24. Taghanaki, S.A., Abhishek, K., Cohen, J.P., Cohen.Adad, J., Hamarneh, Med. Imag. 39(5), 1316–1325 (2019)
G.: Deep semantic segmentation of natural and medical images: a review. 45. Chen, X., Zhang, R., Yan, P.: Feature fusion encoder decoder network for
Artif. Intell. Rev. 54, 137–178 (2021) automatic liver lesion segmentation. In: Proceedings of the IEEE 16th
25. Seo, H., Badiei Khuzani, M., Vasudevan, V., Huang, C., Ren, H., Xiao, R., International Symposium on Biomedical Imaging (ISBI), pp. 430–433.
et al.: Machine learning techniques for biomedical image segmentation: IEEE, Piscataway, NJ (2019)
an overview of technical aspects and introduction to state-of-art applica- 46. Christ, P.F., Elshaer, M.E.A., Ettlinger, F., Tatavarty, S., Bickel, M.,
tions. Med. Phys. 47(5), e148–e167 (2020) Bilic, P., et al.: Automatic liver and lesion segmentation in CT using
26. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J.N., Wu, Z., Ding, X.: cascaded fully convolutional neural networks and 3D conditional ran-
‘Embracing imperfect datasets: a review of deep learning solutions for dom fields. In: Proceedings of the International Conference on Medi-
medical image segmentation. Med. Image Anal. 63, 101693 (2020) cal Image Computing and Computer-Assisted Intervention, pp. 415–423
27. Hesamian, M.H., Jia, W., He, X., Kennedy, P.: Deep learning techniques (2016)
for medical image segmentation: achievements and challenges. J. Digit 47. Tang, W., Zou, D., Yang, S., Shi, J.: DSL: Automatic liver segmen-
Imaging 32(4), 582–596 (2019) tation with faster R-CNN and DeepLab. In: Proceedings of the
28. Meyer, P., Noblet, V., Mazzara, C., Lallement, A.: Survey on deep learning International Conference on Artificial Neural Network, pp. 137–147
for radiotherapy. Comput. Biol. Med. 98, 126–146 (2018) (2018)
29. Akkus, Z., Galimzianova, A., Hoogi, A., Rubin, D.L., Erickson, B.J.: Deep 48. Kaluva, K.C., Khened, M., Kori, A., Krishnamurthi, G.: 2D-densely con-
learning for brain MRI segmentation: state of the art and future direc- nected convolution neural networks for automatic liver and tumor seg-
tions. J. Digit Imaging 30(4), 449–459 (2017) mentation’, Preprint, arXiv:180202182 (2018)
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1264 WANG ET AL.

49. Feng, X., Wang, C., Cheng, S., Guo, L.: Automatic liver and tumor seg- 68. Oktay, O., Ferrante, E., Kamnitsas, K., Heinrich, M., Bai, W., Caballero, J.,
mentation of CT based on cascaded U-net. In: Proceedings of the Chi- et al.: Anatomically constrained neural networks (ACNNS): application to
nese Intelligent Systems Conference, pp. 155–164 (2019) cardiac image enhancement and segmentation. IEEE Trans. Med. Image
50. Albishri, A.A., Shah, S.J.H., Lee, Y.: Cu-net: Cascaded U-net model for 37(2), 384–395 (2017)
automated liver and lesion segmentation and summarization. In: Pro- 69. Boutillon, A., Borotikar, B., Burdin, V., Conze, P.H.: Combining shape
ceedings of the IEEE International Conference on Bioinformatics and priors with conditional adversarial networks for improved scapula seg-
Biomedicine (BIBM), pp. 1416–1423. IEEE, Piscataway, NJ (2019) mentation in mr images. In: Proceedings of the IEEE International Sym-
51. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. Proceedings posium on Biomedical Imaging (ISBI), 1164–1167 (2020)
of the IEEE Conference on Computer Vision (ICCV), pp. 2961–2969. 70. Guan, S., Khan, A.A., Sikdar, S., Chitnis, P.V.: Fully dense unet for 2-
IEEE, Piscataway, NJ (2017) d sparse photoacoustic tomography artifact removal. IEEE J. Biomed.
52. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and Health Informat. 24(2), 568–576 (2019)
accuracy of object detection. Preprint, arXiv:200410934 (2020) 71. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: redesign-
53. Al Antari, M.A., Al Masni, M.A., Choi, M.T., Han, S.M., Kim, T.S.: A ing skip connections to exploit multiscale features in image segmentation,
fully integrated computer-aided diagnosis system for digital x-ray mam- IEEE Trans. Med. Imag. 39(6), 1856–1867 (2019)
mograms via deep learning detection, segmentation, and classification. 72. Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets.
Int. J. Med. Inform. 117, 44–54 (2018) Artif. Intell. Statist. 38, 562–570 (2015)
54. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time 73. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al.:
object detection with region proposal networks. IEEE Trans. Pattern Going deeper with convolutions. In: Proceedings of the IEEE Confer-
Anal. Mach. Intell. 39(6), 1137–1149 (2016) ence on Computer Vision and Pattern Recognition (CVPR), pp. 1–9.
55. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: IEEE, Piscataway, NJ (2015)
Deeplab: semantic image segmentation with deep convolutional nets, 74. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the
atrous convolution, and fully connected CRFS. IEEE Trans. Pattern inception architecture for computer vision. In: Proceedings of the IEEE
Anal. Mach. Intell. 40(4), 834–848 (2017) Conference on Computer Vision and Pattern Recognition (CVPR), pp.
56. Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Auto-context convo- 2818–2826. IEEE, Piscataway, NJ (2016)
lutional neural network (auto-net) for brain extraction in magnetic 75. Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., et al.: Ce-net: con-
resonance imaging. IEEE Trans. Med. Imaging 36(11), 2319–2330 text encoder network for 2D medical image segmentation. IEEE Trans.
(2017) Med. Imag. 38(10), 2281–2292 (2019)
57. Yan, Y., Conze, P.H., Decencière, E., Lamard, M., Quellec, G., Cochener, 76. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand,
B., et al.: Cascaded multi-scale convolutional encoder-decoders for breast T., et al.: Mobilenets: efficient convolutional neural networks for mobile
mass segmentation in high-resolution mammograms. Annual Interna- vision applications. Preprint, arXiv:170404861 (2017)
tional Conference of the IEEE Engineering in Medicine and Biology 77. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.:
Society (EMBC). pp. 6738–6741 (2019) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings
58. Oda, M., Roth, H.R., Kitasaka, T., Misawa, K., Fujiwara, M., Mori, K.: of the IEEE Conference on Computer Vision and Pattern Recognition,
Abdominal artery segmentation method from CT volumes using fully pp. 4510–4520. IEEE, Piscataway, NJ (2018)
convolutional neural network. Int. J. Comput. Assist. Radiol. Surg. 14(12), 78. Lei, T., Zhou, W., Zhang, Y., Wang, R., Meng, H., Nandi, A.K.:
2069–2081 (2019) Lightweight V-net for liver segmentation. In: IEEE International Con-
59. Vu, M.H., Grimbergen, G., Nyholm, T., Löfstedt, T.: Evaluation of multi- ference on Acoustic Speech and Signal Processing (ICASSP), pp. 1379–
slice inputs to convolutional neural networks for medical image segmen- 1383. IEEE, Piscataway, NJ (2020)
tation. Preprint, arXiv:191209287 (2019) 79. Huang, C., Han, H., Yao, Q., Zhu, S., Zhou, S.K.: 3D U net: A
60. Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: 3D universal U-net for multi-domain medical image segmentation. In:
hybrid densely connected UNet for liver and tumor segmentation from Proceedings of the International Conference on Medical Image Com-
ct volumes. IEEE Trans. Med. Imag. 37(12), 2663–2674 (2018) puting and Computer-Assisted Intervention (MICCAI), pp. 291–299
61. Zhang, J., Xie, Y., Zhang, P., Chen, H., Xia, Y., Shen, C.: Light-weight (2019)
hybrid convolutional network for liver tumor segmentation. Int. Joint 80. Paschali, M., Gasperini, S., Roy, A.G., Fang, M.Y.S., Navab, N.: 3DQ:
Conf. Artif. Intell. (IJCAI), pp. 4271–4277 (2019) compact quantized neural networks for volumetric whole brain segmen-
62. Dey, R., Hong, Y.: Hybrid cascaded neural network for liver lesion seg- tation. In: Proceedings of the International Conference on Medical Image
mentation. In: Proceedings of the IEEE International Symposium on Computing and Computer-Assisted Intervention (MICCAI), pp. 438-446
Biomedical Imaging (ISBI), pp. 1173–1177. IEEE, Piscataway, NJ (2020) (2019)
63. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: 81. Xu, X., Lu, Q., Yang, L., Hu, S., Chen, D., Hu, Y., et al.: Quantization
towards accurate segmentation of biomedical images using over-complete of fully convolutional networks for accurate biomedical image segmen-
representations. In: Proceedings of the International Conference on Med- tation. In: Proceedings of the IEEE Conference on Computer Vision
ical Image Computing and Computer-Assisted Intervention (MICCAI), and Pattern Recognition (CVPR), pp. 8300–8308. IEEE, Piscataway, NJ
pp. 363–373 (2020) (2018)
64. Goodfellow, I., Pouget Abadie, J., Mirza, M., Xu, B., Warde Farley, D., 82. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer
Ozair, S., et al.: Generative adversarial nets. Adv. Neural. Inform. Process networks. In: Proceedings of the Advances in Neural Information Pro-
Syst. 27, 2672–2680 (2014) cess Systems, pp. 2017–2025 (2015)
65. Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using 83. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K.
adversarial networks. Preprint, arXiv:161108408 (2016) et al.: Attention u-net: learning where to look for the pancreas. Preprint,
66. Singh, V.K., Rashwan, H.A., Romani, S., Akram, F., Pandey, N., Sarker, arXiv:180403999 (2018)
M.M.K., et al.: Breast tumor segmentation and shape classification in 84. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceed-
mammograms using generative adversarial and convolutional neural net- ings of the IEEE Conference on Computer Vision and Pattern Recogni-
work. Expert Syst. Appl. 139, 112855 (2020) tion, pp. 7132–7141. IEEE, Piscataway, NJ (2018)
67. Conze, P.H., Kavur, A.E., Gall, E.C.L., Gezer, N.S., Meur, Y.L., Selver, 85. Kaul, C., Manandhar, S., Pears, N.: Focusnet: an attention-based fully con-
M.A., Conze Pierre-Henri, Kavur Ali Emre, Cornec-Le Gall Emilie, volutional network for medical image segmentation. In: Proceedings of
Gezer Naciye Sinem, Le Meur Yannick, Selver M. Alper, Rousseau the IEEE 16th International Symposium on Biomedical Imaging (ISBI),
François: Abdominal multi-organ segmentation with cascaded convo- pp. 455–458. IEEE, Piscataway, NJ (2019)
lutional and adversarial deep networks. Artif. Intell. Med. 117, 102109 86. Wang, C., He, Y., Liu, Y., He, Z., He, R., Sun, Z.: Sclerasegnet: an
(2021) improved u-net model with attention for accurate sclera segmentation.
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1265

Proceedings of the IAPR International Conference on Biometrics, pp. 1– thia M., Vasung Lana, Velasco-Annis Clemente, Ouaalam Abdelhakim,
8 (2019) Yang Xin, Ni Dong, Gholipour Ali: A Deep Attentive Convolutional
87. Wang, Z., Zou, N., Shen, D., Ji, S.: Non-local u-nets for biomedical image Neural Network for Automatic Cortical Plate Segmentation in Fetal MRI.
segmentation. In: Proceedings of the AAAI Conference on Artificial IEEE Trans. Med. Imag. 40(4), 1123–1133 (2021)
Intelligence, pp. 6315–6322 (2020) 105. Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.A., Guo, Y.B.,
88. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convo- et al.: Gland segmentation in colon histology images: The glas challenge
lutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. contest. Med. Image Anal. 35, 489–502 (2017)
Intell. 37(9), 1904–1916 (2015) 106. Dong, H., Yang, G., Liu, F., Mo, Y., Guo, Y.: Automatic brain tumor detec-
89. Lopez, M.M., Ventura, J.: Dilated convolutions for brain tumor segmenta- tion and segmentation using U-Net based fully convolutional networks.
tion in MRI scans. In: International Conference on Medical Image Com- In: Proceedings of the Annual Conference on Medical Image Under-
puting and Computer-Assisted Intervention (MICCAI) Workshop, pp. standing and Analysis (MIUA), pp. 506–517 (2017)
253–262 (2017) 107. Guibas, J.T., Virdi, T.S., Li, P.S.: Synthetic medical images from dual gen-
90. Lei, T., Wang, R., Zhang, Y., Wan, Y., Liu, C., Nandi, A.K.: Defed-net: erative adversarial networks. Preprint, arXiv:170901872 (2017)
Deformable encoder-decoder network for liver and liver tumor segmen- 108. Mirza, M., Osindero, S.: Conditional generative adversarial nets. Preprint,
tation. IEEE Trans. Radiat. Plasma Med. Sci. (2021) arXiv:14111784 (2014)
91. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., et al.: Under- 109. Mahapatra, D., Bozorgtabar, B., Thiran, J.P., Reyes, M.: Efficient active
standing convolution for semantic segmentation. Proc. IEEE Winter learning for image classification and segmentation using a sample selec-
Conf. Appl. Comput. Vis (WACV), pp. 1451–1460. IEEE, Piscataway, tion and conditional generative adversarial network. In: Proceedings
NJ (2018) of the International Conference on Medical Image Computing and
92. Yang, L., Song, Q., Wang, Z., Jiang, M.: Parsing r-cnn for instance- Computer-Assisted Intervention (MICCAI), pp. 580–588 (2018)
level human analysis. Proc. IEEE Conf. Comput. Vis Pattern Recognit 110. Shin, H.C., Tenenholtz, N.A., Rogers, J.K., Schwarz, C.G., Senjem, M.L.,
(CVPR), pp. 364–373 (2019) Gunter, J.L., et al.: Medical image synthesis for data augmentation and
93. Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Tversky loss function for anonymization using generative adversarial networks. In: Proceedings
image segmentation using 3d fully convolutional deep networks. Int. of the International Conference on Medical Image Computing and
Workshop Mach. Learn Med. Imag. 379–387 (2017) Computer-Assisted Intervention (MICCAI), pp. 1–11 (2018)
94. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Cardoso, M.J.: Gen- 111. Jin, D., Xu, Z., Tang, Y., Harrison, A.P., Mollura, D.J.: Ct-realistic lung
eralised dice overlap as a deep learning loss function for highly unbal- nodule simulation from 3d conditional generative adversarial networks
anced segmentations. In: Cardoso, M.J., Arbel, T., Carneiro, G., Syeda- for robust lung segmentation. In: Proceedings of the International Con-
Mahmood, T., Tavares, J.M., Moradi, M., Bradley, A., Greenspan, H., ference on Medical Image Computing and Computer-Assisted Interven-
Papa, J.P., Madabhushi, A., Nascimento, J.C., Cardoso, J.S., Belagiannis, tion (MICCAI), pp. 732–740 (2018)
V., Lu, Z. (eds.) Deep Learning in Medical Image Analysis and Multi- 112. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image trans-
modal Learning for Clinical Decision Support. Springer, Berlin, pp. 240– lation using cycle-consistent adversarial networks. In: Proceedings of the
248 (2017) IEEE Conference on Computer Vision (ICCV), pp. 2223–2232. IEEE,
95. Kervadec, H., Bouchtiba, J., Desrosiers, C., Granger, E., Dolz, J., Ayed, Piscataway, NJ (2017)
I.B.: Boundary loss for highly unbalanced segmentation. In: Proceedings 113. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: unified
of The 2nd International Conference on Medical Imaging with Deep generative adversarial networks for multi-domain image-to-image trans-
Learning (PMLR), pp. 285-296 (2019) lation. In: Proceedings of the IEEE Conference on Computer Vision
96. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense and Pattern Recognition (CVPR), pp. 8789–8797. IEEE, Piscataway, NJ
object detection. In: Proceedings of the IEEE International Conference (2018)
on Computer Vision, pp. 2980–2988 (2017) 114. Kalinin, A.A., Iglovikov, V.I., Rakhlin, A., Shvets, A.A.: Medical image
97. Wong, K.C., Moradi, M., Tang, H., Syeda Mahmood, T.: 3D segmentation segmentation using deep neural networks with pre-trained encoders. In:
with exponential logarithmic loss for highly unbalanced object sizes. In: Wani, M., Kantardzic, M., Sayed-Mouchaweh, M. (eds.) Deep Learning
Proceedings of the International Conference on Medical Image Comput- Applications, pp. 39–52. Springer, Berlin (2020)
ing and Computer-Assisted Intervention, pp. 612–619 (2018) 115. Conze, P.H., Brochard, S., Burdin, V., Sheehan, F.T., Pons, C.: Healthy ver-
98. Chen, X., Williams, B.M., Vallabhaneni, S.R., Czanner, G., Williams, R., sus pathological learning transferability in shoulder muscle MRI segmen-
Zheng, Y.: Learning active contour models for medical image segmenta- tation using deep convolutional encoder-decoders. Comput. Med. Imag-
tion. In: Proceedings of the IEEE Conference on Computer Vision and ing Graph. 83, 101733 (2020)
Pattern Recognition, pp. 11632–11640. IEEE, Piscataway, NJ (2019) 116. Huo, Y., Xu, Z., Bao, S., Assad, A., Abramson, R.G., Landman,
99. Li, X., Yu, L., Chen, H., Fu, C.W., Xing, L., Heng, P.A., Li Xiaomeng, B.A.: Adversarial synthesis learning enables segmentation without target
Yu Lequan, Chen Hao, Fu Chi-Wing, Xing Lei, Heng Pheng-Ann: modality ground truth. In: Proceedings of the IEEE 15th International
Transformation-Consistent Self-Ensembling Model for Semisupervised Symposium on Biomedical Imaging (ISBI), pp. 1217–1220. IEEE, Pis-
Medical Image Segmentation. IEEE Trans. Neural. Netw. Learn. Syst. cataway, NJ (2018)
32(2), 523–534 (2021) 117. Chen, C., Dou, Q., Chen, H., Qin, J., Heng, P.A.: Synergistic image and
100. Karimi, D., Salcudean, S.E.: Reducing the Hausdorff distance in medi- feature adaptation: towards cross-modality domain adaptation for med-
cal image segmentation with convolutional neural networks. IEEE Trans. ical image segmentation. In: Proceedings of the AAAI Conference on
Med. Imag. 39(2), 499–513 (2019) Artificial Intelligence, vol. 33, pp. 865–872 (2019)
101. Taghanaki, S.A., Zheng, Y., Zhou, S.K., Georgescu, B., Sharma, P., Xu, D., 118. Chartsias, A., Joyce, T., Dharmakumar, R., Tsaftaris, S.A.: Adversarial
et al.: Combo loss: handling input and output imbalance in multi-organ image synthesis for unpaired multi-modal cardiac data. In: International
segmentation. Comput. Med. Imag. Graph 75, 24–33 (2019) Workshop on Simulation and Synthesis in Medical Imaging, pp. 3–13
102. Caliva, F., Iriondo, C., Martinez, A.M., Majumdar, S., Pedoia, V.: Distance (2017)
map loss penalty term for semantic segmentation. In: Proceedings of 119. Zhao, C., Carass, A., Lee, J., He, Y., Prince, J.L.: Whole brain segmenta-
the International Conference on Medical Imaging with Deep Learning– tion and labeling from CT using synthetic MR images. In: International
Extended Abstract Track (2019) Workshop on Machine Learning in Medical Imaging, pp. 291–298 (2017)
103. Dou, Q., Yu, L., Chen, H., Jin, Y., Yang, X., Qin, J., et al.: 3D deeply 120. Valindria, V.V., Pawlowski, N., Rajchl, M., Lavdas, I., Aboagye, E.O.,
supervised network for automated segmentation of volumetric medical Rockall, A.G., et al.: Multi-modal learning from unpaired images: Applica-
images. Med. Image Anal. 41, 40–54 (2017) tion to multi-organ segmentation in CT and MRI. In: Proceedings of the
104. Dou, H., Karimi, D., Rollins, C.K., Ortinau, C.M., Vasung, L., Velasco- IEEE Winter Conference on Applications of Computer Vision (WACV),
Annis, C., Dou Haoran, Karimi Davood, Rollins Caitlin K., Ortinau Cyn- pp. 547–556. IEEE, Piscataway, NJ (2018)
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1266 WANG ET AL.

121. Wang, G., Zuluaga, M.A., Li, W., Pratt, R., Patel, P.A., Aertsen, M., et al.: 141. Yang, H., Zhen, X., Chi, Y., Zhang, L., Hua, X.S.: Cpr-gcn: Conditional
Deepigeos: a deep interactive geodesic framework for medical image seg- partial-residual graph convolutional network in automated anatomical
mentation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1559–1572 labeling of coronary arteries. In: Proceedings of the IEEE/CVF Con-
(2018) ference on Computer Vision and Pattern Recognition, pp. 3803–3811.
122. Wang, G., Li, W., Zuluaga, M.A., Pratt, R., Patel, P.A., Aertsen, M., IEEE, Piscataway, NJ (2020)
et al.: Interactive medical image segmentation using deep learning with 142. Sun, J., Darbeha, F., Zaidi, M., Wang, B.: Saunet: Shape attentive u-net
image-specific fine tuning. IEEE Trans. Med. Imag. 37(7), 1562–1573 for interpretable medical image segmentation. Preprint, arXiv:200107645
(2018) (2020)
123. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & 143. Wickstrøm, K., Kampffmeyer, M., Jenssen, R.: Uncertainty and inter-
region segmentation of objects in ND images. Proc. IEEE Conf. Com- pretability in convolutional neural networks for semantic segmentation
put. Vis. 1, 105–112 (2001) of colorectal polyps. Med. Image Anal. 60, 101619 (2020)
124. Rother, C., Kolmogorov, V., Blake, A.: “Grabcut” interactive foreground 144. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for
extraction using iterated graph cuts. ACM Trans. Graph 23(3), 309–314 simplicity: the all convolutional net. Preprint, arXiv:14126806 (2014)
(2004) 145. Guan, Q., Huang, Y., Zhong, Z., Zheng, Z., Zheng, L., Yang, Y.: Diag-
125. Rupprecht, C., Laina, I., Navab, N., Hager, G.D., Tombari, F.: Guide me: nose like a radiologist: attention guided convolutional neural network for
Interacting with deep networks. In: Proceedings of the International Con- thorax disease classification. Preprint, arXiv:180109927 (2018)
ference on Medical Image Computing and Computer-Assisted Interven- 146. Tang, Z., Chuang, K.V., DeCarli, C., Jin, L.W., Beckett, L., Keiser, M.J.,
tion, pp. 8551–8561 (2018) et al.: Interpretable classification of alzheimer’s disease pathologies with a
126. Zhang, Z., Yang, L., Zheng, Y.: Translating and segmenting multimodal convolutional neural network pipeline. Nat. Commun. 10(1), 1–14 (2019)
medical volumes with cycle-and shape-consistency generative adversarial 147. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra,
network. In: Proceedings of the IEEE Conference on Computer Vision D.: Grad-cam: visual explanations from deep networks via gradient-based
and Pattern Recognition (CVPR), pp. 9242–9251. IEEE, Piscataway, NJ localization. In: Proceedings of the IEEE Conference on Computer
(2018) Vision (ICCV), pp. 618–626. IEEE, Piscataway, NJ (2017)
127. Baur, C., Albarqouni, S., Navab, N.: Semi-supervised deep learning for 148. Zhang, Z., Chen, P., McGough, M., Xing, F., Wang, C., Bui, M., et al.:
fully convolutional networks. In: Proceedings of the International Con- Pathologist-level interpretable whole-slide cancer diagnosis with deep
ference on Medical Image Computing and Comput.-Assist. Intervent., learning. Nat. Mach. Intell. 1(5), 236–245 (2019)
pp. 311–319 (2017) 149. Dou, Q., Liu, Q., Heng, P.A., Glocker, B.: Unpaired multi-modal segmen-
128. Chartsias, A., Joyce, T., Papanastasiou, G., Semple, S., Williams, M., tation via knowledge distillation. IEEE Trans. Med. Imaging (2020)
Newby, D., et al.: Factorised spatial representation learning: Application 150. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural net-
in semi-supervised myocardial segmentation. Proc. Int. Conf. Med. Image work. Preprint, arXiv:150302531 (2015)
Computing and Computer-Assisted Intervention (MICCAI), pp. 490– 151. Moeskops, P., Wolterink, J.M., van der Velden, B.H., Gilhuijs, K.G.,
498 (2018) Leiner, T., Viergever, M.A., et al.: Deep learning for multi-task medi-
129. Zhao, A., Balakrishnan, G., Durand, F., Guttag, J.V., Dalca, A.V.: Data cal image segmentation in multiple modalities. In: Proceedings of the
augmentation using learned transformations for one-shot medical image International Conference on Medical Image Computing and Computer-
segmentation. In: Proceedings of the IEEE Conference on Computer Assisted Intervention (MICCAI), pp. 478–486 (2016)
Vision and Pattern Recognition, pp. 8543–8553. IEEE, Piscataway, NJ 152. Zhou, T., Ruan, S., Canu, S.: A review: deep learning for medical image
(2019) segmentation using multi-modality fusion. Array 3, 100004 (2019)
130. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. 153. Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., et al.:
Preprint, arXiv:180805377 (2018) Advancing the cancer genome atlas glioma MRI collections with expert
131. He, X., Zhao, K., Chu, X.: AutoML: a survey of the state-of-the-art. segmentation labels and radiomic features. Nat. Scient. Data 4, 170117
Knowl.-Based Syst. 212, 106622 (2020) (2017)
132. Ha, H., Rana, S., Gupta, S., Nguyen, T., Venkatesh, S., et al.: Bayesian opti- 154. Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., et al.:
mization with unknown search space. In: NIPS’19: Proceedings of the Identifying the best machine learning algorithms for brain tumor segmen-
33rd International Conference on Neural Information Processing Sys- tation, progression assessment, and overall survival prediction in the brats
tems, pp. 11795–11804 (2019) challenge. Preprint, arXiv:181102629 (2018)
133. Vanschoren, J.: Meta-learning: a survey. Preprint, arXiv:181003548 (2018) 155. Maier, O., Menze, B.H., von der Gablentz, J., Häni, L., Heinrich, M.P.,
134. Chen, L.C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, Liebrand, M., et al.: Isles 2015-a public evaluation benchmark for
F., et al.: Searching for efficient multi-scale architectures for dense ischemic stroke lesion segmentation from multispectral MRI. Med. Image
image prediction. Neural Inform. Process Syst. 31, 8699–8710 Anal. 35, 250–269 (2017)
(2018) 156. Heller, N., Sathianathen, N., Kalapara, A., Walczak, E., Moore, K., Kaluz-
135. Liu, C., Chen, L.C., Schroff, F., Adam, H., Hua, W., Yuille, A.L., et al.: niak, H., et al.: The kits19 challenge data: 300 kidney tumor cases with
Auto-deeplab: hierarchical neural architecture search for semantic image clinical context, CT semantic segmentations, and surgical outcomes.
segmentation. In: Proceedings of the IEEE Conference on Computer Preprint, arXiv:190400445 (2019)
Vision and Pattern Recognition (CVPR), pp. 82–92. IEEE, Piscataway, 157. Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou,
NJ (2019) Q., et al.: The liver tumor segmentation benchmark (LITS). Preprint,
136. Isensee, F., Kohl Simon A. A., Petersen Jens, Maier-Hein Klaus H.: nnU- arXiv:190104056 (2019)
Net: a self-configuring method for deep learning-based biomedical image 158. Kavur, A.E., Selver, M.A., Dicle, O., Barıs, M., Gezer, N.S.: Chaos-
segmentation. Nature Methods. 18(2), 203–211 (2021) combined (CT-MR) healthy abdominal organ segmentation challenge
137. Weng, Y., Zhou, T., Li, Y., Qiu, X.: Nas-unet: neural architecture search data. Med. Image Anal. 69, 101950 (2019)
for medical image segmentation. IEEE Access 7, 44247–44257 (2019) 159. Smith, K., Clark, W.B.T.N.J.K.M.W.K.: Data from ct_colonography.
138. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehen- https://doiorg/107937/K9/TCIA2015NWTESAY1 (2015)
sive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. 160. Litjens, G., Toth, R., van de Ven, W., Hoeks, C., Kerkstra, S., van Gin-
Syst. (2020) neken, B., et al.: Evaluation of prostate segmentation algorithms for MRI:
139. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The the promise12 challenge. Med. Image Anal. 18(2), 359–373 (2014)
graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 161. Cerrolaza, J.J., Picazo, M.L., Humbert, L., Sato, Y., Rueckert, D., Ballester,
(2008) M. Á. G., et al.: Computational anatomy for multi-organ analysis in med-
140. Gao, H., Ji, S.: Graph u-nets. Nature (2019) ical imaging: a review. Med. Image Anal. 56, 44–67 (2019)
17519667, 2022, 5, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ipr2.12419, Wiley Online Library on [03/03/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WANG ET AL. 1267

162. Tan Man, Wu Fa, Kong Dexing, Mao Xiongwei: Automatic liver segmen- tional Conference on Medical Image Computing and Computer-Assisted
tation using 3D convolutional neural networks with a hybrid loss func- Intervention, pp. 129–134. Springer, Berlin (2020)
tion. Medical Physics. 48(4), 1707–1719 (2021) 175. Zhang, B., Xiao, J., Jiao, J., Wei, Y., Zhao, Y.: Affinity attention graph neu-
163. France, I.: 3D-IRCADb, 3D image reconstruction for comparison of ral network for weakly supervised semantic segmentation. IEEE Trans.
algorithm database. (2016) Pattern Anal. Mach. Intell. (2021)
164. Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K. et al.: A 176. Yang, B., Pan, H., Yu, J., Han, K., Wang, Y.: Classification of medical
large annotated medical image dataset for the development and evalua- images with synergic graph convolutional networks. In: 2019 IEEE 35th
tion of segmentation algorithms. Preprint, arXiv:190209063 (2019) International Conference on Data Engineering Workshops (ICDEW),
165. Kavur, A.E., Gezer, N.S., Barış, M., Conze, P.H., Groza, V., Pham, D.D., pp. 253–258. IEEE, Piscataway, NJ (2019)
et al.: Chaos challenge–combined (CT-MR) healthy abdominal organ seg- 177. Shin, S.Y., Lee, S., Yun, I.D., Lee, K.M.: Deep vessel segmenta-
mentation. Preprint, arXiv:200106535 (2020) tion by learning graphical connectivity. Med. Image Anal. 58, 101556
166. Roth, H.R., Farag, E.B.T.L.L.J.L.R.M.S.A.: Nih pancreas-ct dataset. http: (2019)
//doiorg/107937/K9/TCIA2016tNB1kqBU (2015) 178. Tian, Z., Li, X., Zheng, Y., Chen, Z., Shi, Z., Liu, L., et al.: Graph-
167. Suinesiaputra, A., Medrano Gracia, P., Cowan, B.R., Young, A.A.: Big convolutional-network-based interactive prostate segmentation in MR
heart data: advancing health informatics through data sharing in car- images. Med. Phys. 47(9), 4164–4176 (2020)
diovascular imaging. IEEE J. Biomed. Health Inform. 19(4), 1283–1290 179. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
(2014) Unterthiner, T. et al.: An image is worth 16x16 words: transformers for
168. Armato, S.G., McLennan, G., Bidaut, L., McNitt-Gray, M.F., Meyer, C.R., image recognition at scale. In: Proceedings of the International Confer-
Reeves, A.P., et al.: The lung image database consortium (LIDC) and ence on Learning Representations (2019)
image database resource initiative (IDRI): a completed reference database 180. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y. et al.: Transunet: trans-
of lung nodules on CT scans. Med. Phys. 38(2), 915–931 (2011) formers make strong encoders for medical image segmentation. Preprint,
169. Topkaya, I.S., Erdogan, H., Porikli, F.: Counting people by clustering arXiv:210204306 (2021)
person detector outputs. In: 2014 11th IEEE International Conference 181. Gao, Y., Zhou, M., Metaxas, D.: Utnet: A hybrid transformer architec-
on Advanced Video and Signal Based Surveillance (AVSS), pp. 313–318. ture for medical image segmentation. In: Proceedings of the International
IEEE, Piscataway, NJ (2014) Conference on Medical Image Computing and Computer-Assisted Inter-
170. LaMontagne, P.J., Benzinger, T.L., Morris, J.C., Keefe, S., Hornbeck, R., vention (MICCAI), pp. 61-71 (2021)
Xiong, C., et al.: Oasis-3: longitudinal neuroimaging, clinical, and cogni- 182. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical trans-
tive dataset for normal aging and alzheimer disease. medRxiv (2019) former: gated axial-attention for medical image segmentation. Preprint,
171. Kistler, M., Bonaretti, S., Pfahrer, M., Niklaus, R., Büchler, P.: The virtual arXiv:210210662 (2021)
skeleton database: an open access repository for biomedical research and 183. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q. et al.: Swin-unet:
collaboration. J. Med. Internet. Res. 15(11), e245 (2013) Unet-like pure transformer for medical image segmentation. Preprint,
172. Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., arXiv:210505537 (2021)
Dusza, S.W., et al.: Skin lesion analysis toward melanoma detection: a
challenge at the 2017 International Symposium on Biomedical Imaging
(ISBI), hosted by the International Skin Imaging Collaboration (ISIC). In:
2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI How to cite this article: Wang, R., Lei, T., Cui, R.,
2018), pp. 168–172 (2018) Zhang, B., Meng, H., Nandi, A.K.: Medical image
173. Hoover, A., Kouznetsova, V., Goldbaum, M.: Locating blood vessels
segmentation using deep learning: A survey. IET Image
in retinal images by piecewise threshold probing of a matched filter
response. IEEE Trans. Med. Imag. 19(3), 203–210 (2000) Process. 16, 1243–1267 (2022).
174. Zhang, Y., Lai, H., Yang, W.: Cascade UNet and CH-UNet for thyroid https://doi.org/10.1049/ipr2.12419
nodule segmentation and benign and malignant classification. In: Interna-

You might also like