Neurocomputing: Guotai Wang, Wenqi Li, Michael Aertsen, Jan Deprest, Sébastien Ourselin, Tom Vercauteren
Neurocomputing: Guotai Wang, Wenqi Li, Michael Aertsen, Jan Deprest, Sébastien Ourselin, Tom Vercauteren
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Despite the state-of-the-art performance for medical image segmentation, deep convolutional neural net-
Received 2 August 2018 works (CNNs) have rarely provided uncertainty estimations regarding their segmentation outputs, e.g.,
Revised 26 January 2019
model (epistemic) and image-based (aleatoric) uncertainties. In this work, we analyze these different types
Accepted 27 January 2019
of uncertainties for CNN-based 2D and 3D medical image segmentation tasks at both pixel level and
Available online 7 February 2019
structure level. We additionally propose a test-time augmentation-based aleatoric uncertainty to analyze
Communicated by Pingkun Yan the effect of different transformations of the input image on the segmentation output. Test-time aug-
mentation has been previously used to improve segmentation accuracy, yet not been formulated in a
Keywords:
Uncertainty estimation consistent mathematical framework. Hence, we also propose a theoretical formulation of test-time aug-
Convolutional neural networks mentation, where a distribution of the prediction is estimated by Monte Carlo simulation with prior dis-
Medical image segmentation tributions of parameters in an image acquisition model that involves image transformations and noise.
Data augmentation We compare and combine our proposed aleatoric uncertainty with model uncertainty. Experiments with
segmentation of fetal brains and brain tumors from 2D and 3D Magnetic Resonance Images (MRI) showed
that 1) the test-time augmentation-based aleatoric uncertainty provides a better uncertainty estimation
than calculating the test-time dropout-based model uncertainty alone and helps to reduce overconfident
incorrect predictions, and 2) our test-time augmentation outperforms a single-prediction baseline and
dropout-based multiple predictions.
© 2019 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)
1. Introduction For example, for many images, the segmentation results of pixels
near the boundary are likely to be uncertain because of the
Segmentation of medical images is an essential task for many low contrast between the segmentation target and surrounding
applications such as anatomical structure modeling, tumor growth tissues, where uncertainty information of the segmentation can
measurement, surgical planing and treatment assessment [1]. be used to indicate potential mis-segmented regions or guide user
Despite the breadth and depth of current research, it is very interactions for refinement [3,4].
challenging to achieve accurate and reliable segmentation results In recent years, deep learning with convolutional neural net-
for many targets [2]. This is often due to poor image quality, in- works (CNN) has achieved the state-of-the-art performance for
homogeneous appearances brought by pathology, various imaging many medical image segmentation tasks [5–7]. Despite their im-
protocols and large variations of the segmentation target among pressive performance and the ability of automatic feature learning,
patients. Therefore, uncertainty estimation of segmentation results these approaches do not by default provide uncertainty estima-
is critical for understanding how reliable the segmentations are. tion for their segmentation results. In addition, having access to a
large training set plays an important role for deep CNNs to achieve
∗
human-level performance [8,9]. However, for medical image seg-
Corresponding author at: Wellcome / EPSRC Centre for Interventional and Sur-
gical Sciences, University College London, London, UK. mentation tasks, collecting a very large dataset with pixel-wise
E-mail addresses: [email protected], [email protected] (G. Wang). annotations for training is usually difficult and time-consuming.
https://doi.org/10.1016/j.neucom.2019.01.103
0925-2312/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license. (http://creativecommons.org/licenses/by/4.0/)
G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45 35
As a result, current medical image segmentation methods based different types of uncertainty estimation for the deep CNN-based
on deep CNNs use relatively small datasets compared with those segmentation, and validate the superiority of the proposed general
for natural image recognition [10]. This is likely to introduce aleatoric uncertainty with both 2D and 3D segmentation tasks.
more uncertain predictions for the segmentation results, and also
leads to uncertainty of downstream analysis, such as volumetric 2. Related works
measurement of the target. Therefore, uncertainty estimation is
highly desired for deep CNN-based medical image segmentation 2.1. Segmentation uncertainty
methods.
Several works have investigated uncertainty estimation for Uncertainty estimation has been widely investigated for many
deep neural networks [11–14]. They focused mainly on image existing medical image segmentation tasks. As way of examples,
classification or regression tasks, where the prediction outputs Saad et al. [19] used shape and appearance prior information to
are high-level image labels or bounding box parameters, therefore estimate the uncertainty for probabilistic segmentation of medical
uncertainty estimation is usually only given for the high-level imaging. Shi et al. [20] estimated the uncertainty of graph cut-
predictions. In contrast, pixel-wise predictions are involved in based cardiac image segmentation, which was used to improve
segmentation tasks, therefore pixel-wise uncertainty estimation the robustness of the system. Sankaran et al. [21] estimated lu-
is highly desirable. In addition, in most interactive segmentation men segmentation uncertainty for realistic patient-specific blood
cases, pixel-wise uncertainty information is more helpful for in- flow modeling. Parisot et al. [22] used segmentation uncertainty to
telligently guiding the user to give interactions. However, previous guide content-driven adaptive sampling for concurrent brain tumor
works have rarely demonstrated uncertainty estimation for deep segmentation and registration. Prassni et al. [3] visualized the un-
CNN-based medical image segmentation. As suggested by Kendall certainty of a random walker-based segmentation to guide volume
and Gal [11], there are two major types of predictive uncertainties segmentation of brain Magnetic Resonance Images (MRI) and ab-
for deep CNNs: epistemic uncertainty and aleatoric uncertainty. dominal Computed Tomography (CT) images. Top et al. [23] com-
Epistemic uncertainty is also known as model uncertainty that bined uncertainty estimation with active learning to reduce user
can be explained away given enough training data, while aleatoric time for interactive 3D image segmentation.
uncertainty depends on noise or randomness in the input testing
image. 2.2. Uncertainty estimation for deep CNNs
In contrast to previous works focusing mainly on classification
or regression-related uncertainty estimation, and recent works For deep CNNs, both epistemic and aleatoric uncertainties have
of Nair et al. [15] and Roy et al. [16] investigating only test-time been investigated in recent years. For model (epistemic) uncer-
dropout-based (epistemic) uncertainty for segmentation, we exten- tainty, exact Bayesian networks offer a mathematically grounded
sively investigate different kinds of uncertainties for CNN-based method, but they are hard to implement and computationally ex-
medical image segmentation, including not only epistemic but also pensive. Alternatively, it has been shown that dropout at test time
aleatoric uncertainties for this task. We also propose a more gen- can be cast as a Bayesian approximation to represent model un-
eral estimation of aleatoric uncertainty that is related to not only certainty [24,25]. Zhu and Zabaras [13] used Stochastic Variational
image noise but also spatial transformations of the input, consider- Gradient Descent (SVGD) to perform approximate Bayesian infer-
ing different possible poses of the object during image acquisition. ence on uncertain CNN parameters. A variety of other approxi-
To obtain the transformation-related uncertainty, we augment the mation methods such as Markov chain Monte Carlo (MCMC) [26],
input image at test time, and obtain an estimation of the distribu- Monte Carlo Batch Normalization (MCBN) [27] and variational
tion of the prediction based on test-time augmentation. Test-time Bayesian methods [28,29] have also been developed. Lakshmi-
augmentation (e.g., rotation, scaling, flipping) has been recently narayanan et al. [12] proposed ensembles of multiple models for
used to improve performance of image classification [17] and nod- uncertainty estimation, which was simple and scalable to imple-
ule detection [18]. Ayhan and Berens [14] also showed its utility ment. For test image-based (aleatoric) uncertainty, Kendall and Gal
for uncertainty estimation in a fundus image classification task. [11] proposed a unified Bayesian deep learning framework to learn
However, the previous works have not provided a mathematical or mappings from input data to aleatoric uncertainty and composed
theoretical formulation for this. Motivated by these observations, them with epistemic uncertainty, where the aleatoric uncertainty
we propose a mathematical formulation for test-time augmen- was modeled as learned loss attenuation and further categorized
tation, and analyze its performance for the general aleatoric into homoscedastic uncertainty and heteroscedastic uncertainty. Ay-
uncertainty estimation in medical image segmentation tasks. In han and Berens [14] used test-time augmentation for aleatoric un-
the proposed formulation, we represent an image as a result of an certainty estimation, which was an efficient and effective way to
acquisition process which involves geometric transformations and explore the locality of a testing sample. However, its utility for
image noise. We model the hidden parameters of the image acqui- medical image segmentation has not been demonstrated.
sition process with prior distributions, and predict the distribution
of the output segmentation for a test image with a Monte Carlo 2.3. Test-time augmentation
sampling process. With the samples from the distribution of the
predictive output based on the same pre-trained CNN, the vari- Data augmentation was originally proposed for the training
ance/entropy can be calculated for these samples, which provides of deep neural networks. It was employed to enlarge a relatively
an estimation of the aleatoric uncertainty for the segmentation. small dataset by applying transformations to its samples to create
The contribution of this work is three-fold. First, we propose new ones for training [30]. The transformations for augmentation
a theoretical formulation of test-time augmentation for deep typically include flipping, cropping, rotating, and scaling training
learning. Test-time augmentation has not been mathematically images. Abdulkadir et al. [6] and Ronneberger et al. [31] also used
formulated by existing works, and our proposed mathemati- elastic deformations for biomedical image segmentation. Several
cal formulation is general for image recognition tasks. Second, studies have empirically found that combining predictions of
with the proposed formulation of test-time augmentation, we multiple transformed versions of a test image helps to improve
propose a general aleatoric uncertainty estimation for medical the performance. For example, Matsunaga et al. [17] geometrically
image segmentation, where the uncertainty comes from not only transformed test images for skin lesion classification. [32] used a
image noise but also spatial transformations. Third, we analyze single model to predict multiple transformed copies of unlabeled
36 G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45
images for data distillation. Jin et al. [18] tested on samples Let Y and Y0 be the labels related to X and X0 respectively.
extended by rotation and translation for pulmonary nodule de- For image classification, Y and Y0 are categorical variables, and
tection. However, all these works used test-time augmentation they should be invariant with regard to transformations and noise,
as an ad hoc method, without detailed formulation or theoretical therefore Y = Y0 . For image segmentation, Y and Y0 are discretized
explanation, and did not apply it to uncertainty estimation for label maps, and they are equivariant with the spatial transforma-
segmentation tasks. tion, i.e., Y = Tβ (Y0 ).
The proposed general aleatoric uncertainty estimation is for- In the context of deep learning, let f( · ) be the function rep-
mulated in a consistent mathematical framework including two resented by a neural network, and θ represent the parameters
parts. The first part is a mathematical representation of ensem- learned from a set of training images with their corresponding an-
bles of predictions of multiple transformed versions of the input. notations. In a standard formulation, the prediction Y of a test im-
We represent an image as a result of an image acquisition model age X is inferred by:
with hidden parameters in Section 3.1. Then we formulate test- Y = f (θ , X ) (3)
time augmentation as inference with hidden parameters following
For regression problems, Y refers to continuous values. For seg-
given prior distributions in Section 3.2. The second part calculates
mentation or classification problems, Y refers to discretized labels
the diversity of the prediction results of an augmented test image,
obtained by argmax operation in the last layer of the network.
and it is used to estimate the aleatoric uncertainty related to image
Since X is only one of many possible observations of the under-
transformations and noise. This is detailed in Section 3.3. Our pro-
lying image X0 , direct inference with X may lead to a biased result
posed aleatoric uncertainty is compared and combined with epis-
affected by the specific transformation and noise associated with
temic uncertainty, which is described in Section 3.4. Finally, we ap-
X. To address this problem, we aim at inferring it with the help of
ply our proposed method to structure-wise uncertainty estimation
the latent X0 instead:
in Section 3.5.
Y = Tβ (Y0 ) = Tβ f θ , X0 ) = Tβ f θ , Tβ−1 (X − e ) (4)
3.1. Image acquisition model
where the exact values of β and e for X are unknown. Instead of
The image acquisition model describes the process by which finding a deterministic prediction of X, we alternatively consider
the observed images have been obtained. This process is con- the distribution of Y for a robust inference given the distributions
fronted with a lot of factors that can be related or unrelated to of β and e.
the imaged object, such as blurring, down-sampling, spatial trans-
formation, and system noise. While blurring and down-sampling p(Y |X ) = p Tβ f θ , Tβ−1 (X − e ) , where β ∼ p( β ) , e ∼ p( e )
are commonly considered for image super-resolution [33], in the
(5)
context of image recognition they have a relatively lower impact.
Therefore, we focus on the spatial transformation and noise, and For regression problems, we obtain the final prediction for X by
highlight that adding more complex intensity changes or other calculating the expectation of Y using the distribution p(Y|X).
forms of data augmentation such as elastic deformations is a
straightforward extension. The image acquisition model is: E (Y |X ) = yp(y|X )dy
X = Tβ (X0 ) + e (1) = Tβ f θ , Tβ−1 (X − e ) p(β ) p(e )dβde (6)
β∼p(β ),e∼p(e )
where X0 is an underlying image in a certain position and orien-
tation, i.e., a hidden variable. T is a transformation operator that Calculating E(Y|X) with Eq. (6) is computationally expensive, as β
is applied to X0 . β is the set of parameters of the transformation, and e may take continuous values and p(β) is a complex joint dis-
and e represents the noise that is added to the transformed im- tribution of different types of transformations. Alternatively, we es-
age. X denotes the observed image that is used for inference at timate E(Y|X) by using Monte Carlo simulation. Let N represent the
test time. Though the transformations can be in spatial, intensity total number of simulation runs. In the nth simulation run, the
or feature space, in this work we only study the impact of re- prediction is:
versible spatial transformations (e.g., flipping, scaling, rotation and
yn = Tβ n f θ , Tβ−1 (X − en ) (7)
translation), which are the most common types of transformations n
occurring during image acquisition and used for data augmenta- where βn ∼ p(β), en ∼ p(e). To obtain yn , we first randomly sample
tion purposes. Let Tβ−1 denote the inverse transformation of Tβ , βn and en from the prior distributions p(β) and p(e), respectively.
then we have: Then we obtain one possible hidden image with βn and en based
on Eq. (2), and feed it into the trained network to get its predic-
X0 = Tβ−1 (X − e ) (2)
tion, which is transformed with βn to obtain yn according to Eq.
Similarly to data augmentation at training time, we assume that (4). With the set Y = {y1 , y2 , ..., yN } sampled from p(Y|X), E(Y|X) is
the distribution of X covers the distribution of X0 . In a given ap- estimated as the average of Y and we use it as the final prediction
plication, this assumption leads to some prior distributions of the Yˆ for X:
transformation parameters and noise. For example, in a 2D slice N
1
of fetal brain MRI, the orientation of the fetal brain can span all Yˆ = E (Y |X ) ≈ yn (8)
N
possible directions in a 2D plane, therefore the rotation angle r n=1
can be modeled with a uniform prior distribution r ∼ U(0, 2π ). The For classification or segmentation problems, p(Y|X) is a discretized
image noise is commonly modeled as a Gaussian distribution, i.e., distribution. We obtain the final prediction for X by maximum like-
e ∼ N (μ, σ ), where μ and σ are the mean and standard deviation lihood estimation:
respectively. Let p(β) and p(e) represent the prior distribution of β
Yˆ = arg max p(y|X ) ≈ Mode Y (9)
and e respectively, therefore we have β ∼ p(β) and e ∼ p(e). y
G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45 37
where Mode(Y) is the most frequent element in Y. This corre- generated by not only test-time dropout, but also test-time aug-
sponds to majority voting of multiple predictions. mentation described in Section 3.2. For N samples from the Monte
Carlo simulation, let V = {v1 , v2 , ..., vN } denote the set of volumes
3.3. Aleatoric uncertainty estimation with test-time augmentation of the segmented structure, where vi is the volume of the seg-
mented structure in the ith simulation. Let μV and σV denote the
The uncertainty is estimated by measuring how diverse the pre- mean value and standard deviation of V respectively. We use the
dictions for a given image are. Both the variance and entropy of the volume variation coefficient (VVC) to estimate the structure-wise
distribution p(Y|X) can be used to estimate uncertainty. However, uncertainty:
variance is not sufficiently representative in the context of multi-
σV
modal distributions. In this paper we use entropy for uncertainty V VC = (14)
estimation:
μV
where VVC is agnostic to the size of the segmented structure.
H (Y |X ) = − p(y|X )ln p(y|X ) dy (10)
With the Monte Carlo simulation in Section 3.2, we can approx- 4. Experiments and results
imate H(Y|X) from the simulation results Y = {y1 , y2 , ..., yN }. Sup-
pose there are M unique values in Y. For classification tasks, this We validated our proposed testing and uncertainty estimation
typically refers to M labels. Assume the frequency of the mth method with two segmentation tasks: 2D fetal brain segmenta-
unique value is pˆ m , then H(Y|X) is approximated as: tion from MRI slices and 3D brain tumor segmentation from multi-
modal MRI volumes. The implementation details for 2D and 3D
M
segmentation are described in Sections 4.1 and 4.2 respectively.
H (Y |X ) ≈ − pˆ m ln( pˆ m ) (11)
In both tasks, we compared different types of uncertainties for
m=1
the segmentation results: 1) the proposed aleatoric uncertainty
For segmentation tasks, pixel-wise uncertainty estimation is de- based on our formulated test-time augmentation (TTA), 2) the epis-
sirable. Let Yi denote the predicted label for the ith pixel. With temic uncertainty based on test-time dropout (TTD) described in
the Monte Carlo simulation, a set of values for Yi are obtained Section 3.4, and 3) hybrid uncertainty that combines the aleatoric
Y i = {yi1 , yi2 , ..., yiN }. The entropy of the distribution of Yi is there- and epistemic uncertainties based on TTA + TTD. For each of these
fore approximated as: three methods, the uncertainty was obtained by Eq. (12) with N
M predictions. For TTD and TTA + TTD, the dropout probability was
H (Y i |X ) ≈ − pˆim ln( pˆim ) (12) set as a typical value of 0.5 [24].
m=1 We also evaluated the segmentation accuracy of these different
where pˆim is the frequency of the mth unique value in Y i . prediction methods: TTA, TTD, TTA + TTD and the baseline that
uses a single prediction without TTA and TTD. For a given training
3.4. Epistemic uncertainty estimation set, all these methods used the same model that was trained with
data augmentation and dropout at training time. The augmenta-
To obtain model (epistemic) uncertainty estimation, we follow tion during training followed the same formulation in Section 3.1.
the test-time dropout method proposed by [24]. In this method, We investigated the relationship between each type of uncertainty
let q(θ ) be an approximating distribution over the set of network and segmentation error in order to know which uncertainty has a
parameters θ with its elements randomly set to zero according better ability to indicate potential mis-segmentations. Quantitative
to Bernoulli random variables. q(θ ) can be achieved by minimiz- evaluations of segmentation accuracy are based on Dice score and
ing the Kullback–Leibler divergence between q(θ ) and the posterior Average Symmetric Surface Distance (ASSD).
distribution of θ given a training set. After training, the predictive
2 × TP
distribution of a test image X can be expressed as: Dice = (15)
2 × TP + FN + FP
p(Y |X ) = p(Y |X, ω )q(ω )dω (13) where TP, FP and FN are true positive, false positive and false neg-
ative respectively. The definition of ASSD is:
The distribution of the prediction can be sampled based on Monte
1
Carlo samples of the trained network (i.e, MC dropout): yn = ASSD = d (s, G ) + d (g, S ) (16)
f (θ n , X ) where θ n is a Monte Carlo sample from q(θ ). Assume the |S | + |G| s∈S g∈G
number of samples is N, and the sampled set of the distribution
of Y is Y = {y1 , y2 , ..., yN }. The final prediction for X can be esti- where S and G denote the set of surface points of a segmentation
mated by Eq. (8) for regression problems or Eq. (9) for classifica- result and the ground truth respectively. d(s, G) is the shortest Eu-
tion/segmentation problems. The epistemic uncertainty estimation clidean distance between a point s ∈ S and all the points in G.
can therefore be calculated based on variance or entropy of the
sampled N predictions. To keep consistent with our aleatoric un- 4.1. 2D fetal brain segmentation from MRI
certainty, we use entropy for this purpose, which is similar to Eq.
(12). Test-time dropout may be interpreted as a way of ensembles Fetal MRI has been increasingly used for study of the develop-
of networks for testing. In the work of Lakshminarayanan et al. ing fetus as it provides a better soft tissue contrast than the widely
[12], ensembles of neural networks was explicitly proposed as an used prenatal sonography. The most commonly used imaging pro-
alternative solution of test-time dropout for estimating epistemic tocol for fetal MRI is Single-Shot Fast Spin Echo (SSFSE) that ac-
uncertainty. quires images in a fast speed and mitigates the effect of fetal mo-
tion, leading to stacks of thick 2D slices. Segmentation is a funda-
3.5. Structure-wise uncertainty estimation mental step for fetal brain study, e.g., it plays an important role
in inter-slice motion correction and high-resolution volume recon-
Nair et al. [15] and Roy et al. [16] used Monte Carlo samples struction [34,35]. Recently, CNNs have achieved the state-of-the-art
generated by test-time dropout for structure/lesion-wise uncer- performance for 2D fetal brain segmentation [36–38]. In this ex-
tainty estimation. Following these works, we extend the structure- periment, we segment the 2D fetal brain using deep CNNs with
wise uncertainty estimation method by using Monte Carlo samples uncertainty estimation.
38 G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45
Fig. 1. Visual comparison of different types of uncertainties and their corresponding segmentations for fetal brain. The uncertainty maps in odd columns are based on Monte
Carlo simulation with N = 20 and encoded by the color bar in the left up corner (low uncertainty shown in purple and high uncertainty shown in yellow). The white arrows
in (a) show the aleatoric and hybrid uncertainties in a challenging area, and the white arrows in (b) and (c) show mis-segmented regions with very low epistemic uncertainty.
TTD: test-time dropout, TTA: test-time augmentation. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
4.1.1. Data and implementation in TensorFlow1 [40] using NiftyNet2 [25,41]. During training, we
We collected clinical T2-weighted MRI scans of 60 fetuses in used Adaptive Moment Estimation (Adam) to adjust the learning
the second trimester with SSFSE on a 1.5 Tesla MR system (Aera, rate that was initialized as 10−3 , with batch size 5, weight decay
Siemens, Erlangen, Germany). The data for each fetus contained 10−7 and iteration number 10k. We represented the transformation
three stacks of 2D slices acquired in axial, sagittal and coronal parameter β in the proposed augmentation framework as a com-
views respectively, with pixel size 0.63–1.58 mm and slice thick- bination of fl , r and s, where fl is a random variable for flipping
ness 3–6 mm. The gestational age ranged from 19 weeks to 33 along each 2D axis, r is the rotation angle in 2D, and s is a scaling
weeks. We used 2640 slices from 120 stacks of 40 patients for factor. The prior distributions of these transformation parameters
training, 278 slices from 12 stacks of 4 patients for validation and and random intensity noise were modeled as fl ∼ Bern(μf ), r ∼ U(r0 ,
1180 slices from 48 stacks of 16 patients for testing. Two radi- r1 ), s ∼ U(s0 , s1 ) and e ∼ N(μe , σ e ). The hyper-parameters for our fe-
ologists manually segmented the brain region for all the stacks tal brain segmentation task were set as μ f = 0.5, r0 = 0, r1 = 2π ,
slice-by-slice, where one radiologist gave a segmentation first, and s0 = 0.8 and s1 = 1.2. For the random noise, we set μe = 0 and
then the second senior radiologist refined the segmentation if dis- σe = 0.05, as a median-filter smoothed version of a normalized im-
agreement existed, the output of which were used as the ground age in our dataset has a standard deviation around 0.95. We aug-
truth. We used this dataset for two reasons. First, our dataset fits mented the training data with this formulation, and during test
with a typical medical image segmentation application where the time, TTA used the same prior distributions of augmentation pa-
number of annotated images is limited. This leads the uncertainty rameters as used for training.
information to be of high interest for robust prediction and our
downstream tasks such as fetal brain reconstruction and volume 4.1.2. Segmentation results with uncertainty
measurement. Second, the position and orientation of fetal brain Fig. 1 shows a visual comparison of different types of uncer-
have large variations, which is suitable for investigating the effect tainties for segmentation of three fetal brain images in coronal,
of data augmentation. For preprocessing, we normalized each stack sagittal and axial view respectively. The results were based on the
by its intensity mean and standard deviation, and resampled each same trained model of U-Net with train-time augmentation, and
slice with pixel size 1.0 mm. the Monte Carlo simulation number N was 20 for TTD, TTA, and
We used 2D networks of Fully Convolutional Network (FCN) TTA + TTD to obtain epistemic, aleatoric and hybrid uncertainties
[39], U-Net [31] and P-Net [4]. The networks were implemented
1
https://www.tensorflow.org.
2
http://www.niftynet.io.
G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45 39
Fig. 2. Dice of 2D fetal brain segmentation with different N that is the number of Monte Carlo simulation runs.
Table 1
Dice (%) and ASSD (mm) evaluation of 2D fetal brain segmentation with different training and testing methods. Tr − Aug: training without data augmentation. Tr + Aug:
training with data augmentation. ∗ denotes significant improvement from the baseline of single prediction in Tr − Aug and Tr + Aug respectively (p-value < 0.05). † denotes
significant improvement from Tr − Aug with TTA + TTD (p-value < 0.05).
Tr − Aug Baseline 91.05 ± 3.82 90.26 ± 4.77 90.65 ± 4.29 2.68 ± 2.93 3.11 ± 3.34 2.83 ± 3.07
TTD 91.13 ± 3.60 90.38 ± 4.30 90.93 ± 4.04 2.61 ± 2.85 3.04 ± 2.29 2.69 ± 2.90
TTA 91.99 ± 3.48∗ 91.64 ± 4.11∗ 92.02 ± 3.85∗ 2.26 ± 2.56∗ 2.51 ± 3.23∗ 2.28 ± 2.61∗
TTA + TTD 92.05 ± 3.58∗ 91.88 ± 3.61∗ 92.17 ± 3.68∗ 2.19 ± 2.67∗ 2.40 ± 2.71∗ 2.13 ± 2.42∗
Tr + Aug Baseline 92.03 ± 3.44 91.93 ± 3.21 91.98 ± 3.92 2.21 ± 2.52 2.12 ± 2.23 2.32 ± 2.71
TTD 92.08 ± 3.41 92.00 ± 3.22 92.01 ± 3.89 2.17 ± 2.52 2.03 ± 2.13 2.15 ± 2.58
TTA 92.79 ± 3.34∗ 92.88 ± 3.15∗ 93.05 ± 2.96∗ 1.88 ± 2.08 1.70 ± 1.75 1.62 ± 1.77∗
TTA + TTD 92.85 ± 3.15∗ † 92.90 ± 3.16∗ † 93.14 ± 2.93∗ † 1.84 ± 1.92 1.67 ± 1.76∗ † 1.48 ± 1.63∗ †
respectively. In each subfigure, the first row presents the input and mations of the input (aleatoric) rather than variations of model
the segmentation obtained by the single-prediction baseline. The parameters (epistemic).
other rows show these three types of uncertainties and their corre- Fig. 1(b) and (c) also show that TTD using different model
sponding segmentation results respectively. The uncertainty maps parameters seemed to obtain very little improvement from the
in odd columns are represented by pixel-wise entropy of N predic- baseline. In comparison, TTA using different input transformations
tions and encoded by the color bar in the left top corner. In the corrected the large mis-segmentations and achieved a more no-
uncertainty maps, purple pixels have low uncertainty values and ticeable improvement from the baseline. It can also be observed
yellow pixels have high uncertainty values. Fig. 1(a) shows a fetal that the results obtained by TTA + TTD are very similar to those
brain in coronal view. In this case, the baseline prediction method obtained by TTA, which shows TTA is more suitable to improving
achieved a good segmentation result. It can be observed that for the segmentation than TTD.
epistemic uncertainty calculated by TTD, most of the uncertain seg-
mentations are located near the border of the segmented fore- 4.1.3. Quantitative evaluation
ground, while the pixels with a larger distance to the border have To quantitatively evaluate the segmentation results, we mea-
a very high confidence (i.e., low uncertainty). In addition, the epis- sured Dice score and ASSD of predictions by different testing
temic uncertainty map contains some random noise in the brain methods with three network structures: FCN [39], U-Net [31] and
region. In contrast, the aleatoric uncertainty obtained by TTA con- P-Net [4]. For all of these CNNs, we used data augmentation at
tains less random noise and it shows uncertain segmentations not training time to enlarge the training set. At inference time, we
only on the border but also in some challenging areas in the lower compared the baseline testing method (without Monte Carlo sim-
right corner, as highlighted by the white arrows. In that region, the ulation) with TTD, TTA and TTA + TTD. We first investigated how
result obtained by TTA has an over-segmentation, and this is cor- the segmentation accuracy changes with the increase of the num-
responding to the high values in the same region of the aleatoric ber of Monte Carlo simulation runs N. The results measured with
uncertainty map. The hybrid uncertainty calculated by TTA + TTD all the testing images are shown in Fig. 2. We found that for all of
is a mixture of epistemic and aleatoric uncertainty. As shown in the these three networks, the segmentation accuracy of TTD remains
last row of Fig. 1(a), it looks similar to the aleatoric uncertainty close to that of the single-prediction baseline. For TTA and TTA +
map except for some random noise. TTD, an improvement of segmentation accuracy can be observed
Fig. 1(b) and (c) show two other cases where the single- when N increases from 1 to 10. When N is larger than 20, the
prediction baseline obtained an over-segmentation and an under- segmentation accuracy for these two methods reaches a plateau.
segmentation respectively. It can be observed that the epistemic In addition to the previous scenario using augmentation at
uncertainty map shows a high confidence (low uncertainty) in both training and test time, we also evaluated the performance
these mis-segmented regions. This leads to a lot of overconfident of TTD and TTA when data augmentation was not used for train-
incorrect segmentations, as highlighted by the white arrows in ing. The quantitative evaluations of combinations of different train-
Fig. 1(b) and (c). In comparison, the aleatoric uncertainty map ing methods and testing methods (N =20) are shown in Table 1. It
obtained by TTA shows a larger uncertain area that is mainly cor- can be observed that for both training with and without data aug-
responding to mis-segmented regions of the baseline. In these two mentation, TTA has a better ability to improve the segmentation
cases, The hybrid uncertainty also looks similar to the aleatoric accuracy than TTD. Combining TTA and TTD can further improve
uncertainty map. The comparison indicates that the aleatoric the segmentation accuracy, but it does not significantly outperform
uncertainty has a better ability than the epistemic uncertainty to TTA (p-value > 0.05).
indicate mis-segmentations of non-border pixels. For these pixels, Fig. 3 shows Dice distributions of five example stacks of fetal
the segmentation output is more affected by different transfor- brain MRI. The results were based on the same trained model of
40 G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45
Fig. 3. Dice distributions of segmentation results with different testing methods for five example stacks of 2D slices of fetal brain MRI. Note TTA’s higher mean value and
variance compared with TTD.
Fig. 4. Normalized joint histogram of prediction uncertainty and error rate for 2D fetal brain segmentation. The average error rates at different uncertainty levels are depicted
by the red curves. The dashed ellipses show that TTA leads to a lower occurrence of overconfident incorrect predictions than TTD. (For interpretation of the references to
color in this figure legend, the reader is referred to the web version of this article.)
U-Net with train-time augmentation. Note that the baseline had dictions than TTD. The dashed ellipses in Fig. 4 also show the
only one prediction for each image, and the Monte Carlo simu- different levels of overconfident incorrect predictions for different
lation number N was 20 for TTD, TTA and TTA + TTD. It can be testing methods.
observed that for each case, the Dice of TTD is distributed closely For structure-wise evaluation, we used VVC to represent
around that of the baseline. In comparison, the Dice distribution of structure-wise uncertainty and 1−Dice to represent structure-wise
TTA has a higher average than that of TTD, indicating TTA’s bet- segmentation error. Fig. 5 shows the joint distribution of VVC and
ter ability of improving segmentation accuracy. The results of TTA 1−Dice for different testing methods using U-Net trained with data
also have a larger variance than that of TTD, which shows TTA can augmentation and N = 20 for inference. The results of TTD, TTA,
provide more structure-wise uncertainty information. Fig. 3 also and TTA + TTD are shown in Fig. 5(a)–(c) respectively. It can be
shows that the performance of TTA + TTD is close to that of TTA. observed that for all the three testing methods, the VVC value
tends to become larger when 1−Dice grows. However, the slope
4.1.4. Correlation between uncertainty and segmentation error in Fig. 5(a) is smaller than those in Fig. 5(b) and (c). The compar-
To investigate how our uncertainty estimation methods can in- ison shows that TTA-based structure-wise uncertainty estimation
dicate incorrect segmentation, we measured the uncertainty and is highly related to segmentation error, and TTA leads to a larger
segmentation error at both pixel-level and structure-level. For scale of VVC than TTD. Combining TTA and TTD leads to similar
pixel-level evaluation, we measured the joint histogram of pixel- results to that of TTA.
wise uncertainty and error rate for TTD, TTA, and TTA + TTD re-
spectively. The histogram was obtained by statistically calculating
the error rate of pixels at different pixel-wise uncertainty levels
in each slice. The results based on U-Net with N = 20 are shown 4.2. 3D brain tumor segmentation from multi-modal MRI
in Fig. 4, where the joint histograms have been normalized by the
number of total pixels in the testing images for visualization. For MRI has become the most commonly used imaging methods for
each type of pixel-wise uncertainty, we calculated the average er- brain tumors. Different MR sequences such as T1-weighted (T1w),
ror rate at each pixel-wise uncertainty level, leading to a curve contrast enhanced T1-weighted (T1wce), T2-weighted (T2w) and
of error rate as a function of pixel-wise uncertainty, i.e., the red Fluid Attenuation Inversion Recovery (FLAIR) images can provide
curves in Fig. 4. This figure shows that the majority of pixels have complementary information for analyzing multiple subregions of
a low uncertainty with a small error rate. When the uncertainty brain tumors. Automatic brain tumor segmentation from multi-
increases, the error rate also becomes higher gradually. Fig. 4(a) modal MRI has a potential for better diagnosis, surgical plan-
shows the TTD-based uncertainty (epistemic). It can be observed ning and treatment assessment [42]. Deep neural networks have
that when the prediction uncertainty is low, the result has a steep achieved the state-of-the-art performance on this task [7,43]. In
increase of error rate. In contrast, for the TTA-based uncertainty this experiment, we analyze the uncertainty of deep CNN-based
(aleatoric), the increase of error rate is slower, shown in Fig. 4(b). brain tumor segmentation and show the effect of our proposed
This demonstrates that TTA has fewer overconfident incorrect pre- test-time augmentation.
G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45 41
Fig. 5. Structure-wise uncertainty in terms of volume variation coefficient (VVC) vs 1−Dice for different testing methods in 2D fetal brain segmentation.
Fig. 6. Visual comparison of different testing methods for 3D brain tumor segmentation. The uncertainty maps in odd columns are based on Monte Carlo simulation with
N = 40 and encoded by the color bar in the left up corner (low uncertainty shown in purple and high uncertainty shown in yellow). TTD: test-time dropout, TTA: test-time
augmentation. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4.2.1. Data and implementation with NiftyNet [41], and employed Adam during training with
We used the BraTS 20173 [44] training dataset that consisted of initial learning rate 10−3 , batch size 2, weight decay 10−7 and
volumetric images from 285 studies, with ground truth provided iteration number 20k. W-Net is a 2.5D network, and we compared
by the organizers. We randomly selected 20 studies for validation using W-Net only in axial view and a fusion of axial, sagittal and
and 50 studies for testing, and used the remaining for training. coronal views. These two implementations are referred to as W-
For each study, there were four scans of T1w, T1wce, T2w and Net(A) and W-Net(ASC) respectively. The transformation parameter
FLAIR images, and they had been co-registered. All the images β in the proposed augmentation framework consisted of fl , r, s and
were skull-stripped and re-sampled to an isotropic 1 mm3 reso- e, where fl is a random variable for flipping along each 3D axis, r is
lution. As a first demonstration of uncertainty estimation for deep the rotation angle along each 3D axis, s is a scaling factor and e is
learning-based brain tumor segmentation, we investigate segmen- intensity noise. The prior distributions were: fl ∼ Bern(0.5), r ∼ U(0,
tation of the whole tumor from these multi-modal images (Fig. 6). 2π ), s ∼ U(0.8, 1.2) and e ∼ N(0, 0.05) according to the reduced
We used 3D U-Net [6], V-Net [5] and W-Net [43] implemented standard deviation of a median-filtered version of a normalized
image. We used this formulated augmentation during training, and
also employed it to obtain TTA-based results at test time.
3
http://www.med.upenn.edu/sbia/brats2017.html.
42 G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45
Table 2
Dice (%) and ASSD (mm) evaluation of 3D brain tumor segmentation with different training and testing methods. Tr − Aug: Training without data augmentation. Tr + Aug:
Training with data augmentation. W-Net is a 2.5D network and W-Net (ASC) denotes the fusion of axial, sagittal and coronal views according to [43]. ∗ denotes significant
improvement from the baseline of single prediction in Tr − Aug and Tr + Aug respectively (p-value < 0.05). † denotes significant improvement from Tr − Aug with TTA + TTD
(p-value < 0.05).
Tr − Aug Baseline 87.81 ± 7.27 87.26 ± 7.73 86.84 ± 8.38 2.04 ± 1.27 2.62 ± 1.48 2.86 ± 1.79
TTD 88.14 ± 7.02 87.55 ± 7.33 87.13 ± 8.14 1.95 ± 1.20 2.55 ± 1.41 2.82 ± 1.75
TTA 89.16 ± 6.48∗ 88.58 ± 6.50∗ 87.86 ± 6.97∗ 1.42 ± 0.93∗ 1.79 ± 1.16∗ 1.97 ± 1.40∗
TTA + TTD 89.43 ± 6.14∗ 88.75 ± 6.34∗ 88.03 ± 6.56∗ 1.37 ± 0.89∗ 1.72 ± 1.23∗ 1.95 ± 1.31∗
Tr + Aug Baseline 88.76 ± 5.76 88.43 ± 6.67 87.44 ± 7.84 1.61 ± 1.12 1.82 ± 1.17 2.07 ± 1.46
TTD 88.92 ± 5.73 88.52 ± 6.66 87.56 ± 7.78 1.57 ± 1.06 1.76 ± 1.14 1.99 ± 1.33
TTA 90.07 ± 5.69∗ 89.41 ± 6.05∗ 88.38 ± 6.74∗ 1.13 ± 0.54∗ 1.45 ± 0.81 1.67 ± 0.98∗
TTA + TTD 90.35 ± 5.64∗ † 89.60 ± 5.95∗ † 88.57 ± 6.32∗ † 1.10 ± 0.49∗ 1.39 ± 0.76∗ † 1.62 ± 0.95∗ †
Fig. 7. Normalized joint histogram of prediction uncertainty and error rate for 3D brain tumor segmentation. The average error rates at different uncertainty levels are
depicted by the red curves. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 8. Structure-wise uncertainty in terms of volume variation coefficient (VVC) vs 1−Dice for different testing methods in 3D brain tumor segmentation.
TTA-based aleatoric uncertainty leads to fewer over-confident mis- performance of TTA + TTD is close to that of TTA, which is also
segmentations than the TTD-based epistemic uncertainty. demonstrated in Figs. 1 and 6.
For structure-level evaluation, we also studied the relation- We have demonstrated TTA based on the image acquisition
ship between structure-level uncertainty represented by VVC and model for image segmentation tasks, but it is general for different
structure-level error represented by 1−Dice. Fig. 8 shows their image recognition tasks, such as image classification, object detec-
joint distributions with three different testing methods using 3D tion, and regression. For regression tasks where the outputs are not
U-Net. The network was trained with data augmentation, and N discretized category labels, the variation of the output distribution
was set as 40 for inference. Fig. 8 shows that TTA-based VVC in- might be more suitable than entropy for uncertainty estimation.
creases when 1− Dice grows, and the slope is larger than that of Table 2 shows the superiority of test-time augmentation for bet-
TTD-based VVC. The results of TTA and TTA + TTD are similar, as ter segmentation accuracy, and it also demonstrates the combina-
shown in Fig. 8(b) and (c). The comparison shows that TTA-based tion of W-Net in different views helps to improve the performance.
structure-wise uncertainty can better indicate segmentation error This is an ensemble of three networks, and such an ensemble may
than TTD-based structure-wise uncertainty. be used as an alternative for epistemic uncertainty estimation, as
demonstrated by [12].
5. Discussion and conclusion We found that for our tested CNNs and applications, the proper
value of Monte Carlo sample N that leads the segmentation ac-
In our experiments, the number of training images was rela- curacy to a plateau was around 20–40. Using an empirical value
tively small compared with many datasets of natural images such N = 40 is large enough for our datasets. However, the optimal set-
as PASCAL VOC, COCO and ImageNet. For medical images, it is typ- ting of hyper-parameter N may change for different datasets. Fix-
ically very difficult to collect a very large dataset for segmentation, ing N = 40 for new applications where the optimal value of N is
as pixel-wise annotations are not only time-consuming to collect smaller would lead to unnecessary computation and reduce effi-
but also require expertise of radiologists. Therefore, for most exist- ciency. In some applications where the object has more spatial
ing medical image segmentation datasets, such as those in Grand variations, the optimal N value may be larger than 40. Therefore,
challenge4 , the image numbers are also quite small. Therefore, in- in a new application, we suggest that the optimal N should be de-
vestigating the segmentation performance of CNNs with limited termined by the performance plateau on the validation set.
training data is of high interest for medical image computing com- In conclusion, we analyzed different types of uncertainties for
munity. In addition, our dataset is not very large so that it is suit- CNN-based medical image segmentation by comparing and com-
able for data augmentation, which fits well with our motivation of bining model (epistemic) and input-based (aleatoric) uncertainties.
using data augmentation at training and test time. The need for We formulated a test-time augmentation-based aleatoric uncer-
uncertainty estimation is also stronger in cases where datasets are tainty estimation for medical images that considers the effect of
smaller. both image noise and spatial transformations. We also proposed
In our mathematical formulation of test-time augmentation a theoretical and mathematical formulation of test-time augmen-
based on an image acquisition model, we explicitly modeled tation, where we obtain a distribution of the prediction by using
spatial transformations and image noise. However, it can be Monte Carlo simulation and modeling prior distributions of pa-
easily extended to include more general transformations such as rameters in an image acquisition model. Experiments with 2D and
elastic deformations [6] or add a simulated bias field for MRI. In 3D medical image segmentation tasks showed that uncertainty es-
addition to the variation of possible values of model parameters, timation with our formulated TTA helps to reduce overconfident
the prediction result is also dependent on the input data, e.g., incorrect predictions encountered by model-based uncertainty es-
image noise and transformations related to the object. Therefore, timation and TTA leads to higher segmentation accuracy than a
a good uncertainty estimation should take these factors into single-prediction baseline and multiple predictions using test-time
consideration. Figs. 1 and 6 show that model uncertainty alone is dropout.
likely to obtain overconfident incorrect predictions, and TTA plays
an important role in reducing such predictions. In Fig. 3 we show
five example cases, where each subfigure shows the results for
Acknowledgments
one patient. Table 1 shows the statistical results based on all the
testing images. We found that for few testing images TTA + TTD
This work was supported by the Wellcome/EPSRC Centre for
failed to obtain higher Dice scores than TTA, but for the overall
Medical Engineering [WT 203148/Z/16/Z], an Innovative Engi-
testing images, the average Dice of TTA + TTD is slightly larger
neering for Health award by the Wellcome Trust (WT101957);
than that of TTA. Therefore, this leads to the conclusion that TTA
Engineering and Physical Sciences Research Council (EPSRC)
+ TTD does not always perform better than TTA, and the average
(NS/A0 0 0 027/1, EP/H046410/1, EP/J020990/1, EP/K005278),
Wellcome/EPSRC [203145Z/16/Z], the National Institute for Health
4
https://grand-challenge.org/challenges. Research University College London Hospitals Biomedical Research
44 G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45
Centre (NIHR BRC UCLH/UCL), the Royal Society [RG160569], and [23] A. Top, G. Hamarneh, R. Abugharbieh, Active learning for interactive 3D image
hardware donated by NVIDIA. segmentation, in: International Conference on Medical Image Computing
and Computer-Assisted Intervention, 2011, pp. 603–610, doi:10.1300/
J172v06n01_05.
Supplementary material [24] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: representing
model uncertainty in deep learning, in: International Conference on Machine
Learning, 2016, pp. 1050–1059.
Supplementary material associated with this article can be [25] W. Li, G. Wang, L. Fidon, S. Ourselin, M.J. Cardoso, T. Vercauteren, On the com-
found, in the online version, at 10.1016/j.neucom.2019.01.103. pactness, efficiency, and representation of 3D convolutional networks: brain
parcellation as a pretext task, in: International Conference on Information Pro-
cessing in Medical Imaging, 2017, pp. 348–360.
References [26] R.M. Neal, Bayesian Learning for Neural Networks, Springer Science & Business
Media, 2012.
[1] N. Sharma, L.M. Aggarwal, Automated medical image segmentation techniques, [27] M. Teye, H. Azizpour, K. Smith, Bayesian uncertainty estimation for batch nor-
J. Med. Phys. 35 (1) (2010) 3–14, doi:10.4103/0971-6203.58777. malized deep detworks, arXiv:1802.06455 (2018).
[2] D. Withey, Z. Koles, Medical image segmentation: methods and software, in: [28] A. Graves, Practical variational inference for neural networks, in: Advances in
Noninvasive Functional Source Imaging of the Brain and Heart and the Inter- Neural Information Processing Systems, 2011, pp. 1–9.
national Conference on Functional Biomedical Imaging, 2007, pp. 140–143. [29] C. Louizos, M. Welling, Structured and efficient variational deep learning with
[3] J.S. Prassni, T. Ropinski, K. Hinrichs, Uncertainty-aware guided volume segmen- matrix Gaussian posteriors, in: International Conference on Machine Learning,
tation, IEEE Trans. Vis. Comput.Grap. 16 (6) (2010) 1358–1365, doi:10.1109/ 2016, pp. 1708–1716.
TVCG.2010.208. [30] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep con-
[4] G. Wang, W. Li, M.A. Zuluaga, R. Pratt, P.A. Patel, M. Aertsen, T. Doel, A.L. David, volutional neural networks, in: Advances in Neural Information Processing Sys-
J. Deprest, S. Ourselin, T. Vercauteren, Interactive medical image segmentation tems, 2012, pp. 1097–1105.
using deep learning with image-specific fine-tuning, IEEE Trans. Med. Imaging [31] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomed-
37 (7) (2018) 1562–1573. ical image segmentation, in: International Conference on Medical Image Com-
[5] F. Milletari, N. Navab, S.-A. Ahmadi, V-Net: fully convolutional neural networks puting and Computer-Assisted Intervention, 2015, pp. 234–241.
for volumetric medical image segmentation, in: International Conference on [32] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, K. He, Data distillation: to-
3D Vision, 2016, pp. 565–571. wards omni-supervised learning, arXiv:1712.04440 (2017).
[6] A. Abdulkadir, S.S. Lienkamp, T. Brox, O. Ronneberger, 3D U-Net: learning [33] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, L. Zhang, Image super-resolution:
dense volumetric segmentation from sparse annotation, in: International Con- the techniques, applications, and future, Signal Process. 128 (2016) 389–408,
ference on Medical Image Computing and Computer-Assisted Intervention, doi:10.1016/j.sigpro.2016.05.002.
2016, pp. 424–432. [34] S. Tourbier, C. Velasco-Annis, V. Taimouri, P. Hagmann, R. Meuli, S.K. Warfield,
[7] K. Kamnitsas, C. Ledig, V.F.J. Newcombe, J.P. Simpson, A.D. Kane, D.K. Menon, M. Bach Cuadra, A. Gholipour, Automated template-based brain localization
D. Rueckert, B. Glocker, Efficient multi-scale 3D CNN with fully connected CRF and extraction for fetal brain MRI reconstruction, NeuroImage 155 (April)
for accurate brain lesion segmentation, Med. Image Anal. 36 (2017) 61–78. (2017) 460–472.
[8] A. Esteva, B. Kuprel, R.A. Novoa, J. Ko, S.M. Swetter, H.M. Blau, S. Thrun, [35] M. Ebner, G. Wang, W. Li, M. Aertsen, P.A. Patel, R. Aughwane, A. Melbourne,
Dermatologist-level classification of skin cancer with deep neural networks, T. Doel, A.L. David, J. Deprest, S. Ourselin, T. Vercauteren, An automated lo-
Nature 542 (7639) (2017) 115–118, doi:10.1038/nature21056. calization, segmentation and reconstruction framework for fetal brain MRI, in:
[9] P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, International Conference on Medical Image Computing and Computer-Assisted
D. Laird, R.L. Ball, C. Langlotz, K. Shpanskaya, M.P. Lungren, A. Ng, MURA Intervention, Vol. 10433, 2018, pp. 313–320.
dataset: towards radiologist-level abnormality eetection in musculoskeletal ra- [36] M. Rajchl, M.C.H. Lee, F. Schrans, A. Davidson, J. Passerat-Palmbach, G. Tarroni,
diographs, arXiv:1712.06957 (2017). A. Alansary, O. Oktay, B. Kainz, D. Rueckert, Learning under distributed weak
[10] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- supervision, arXiv:1606.01100 (2016).
thy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large scale visual [37] S.S.M. Salehi, D. Erdogmus, A. Gholipour, Auto-context convolutional neural
recognition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252, doi:10.1007/ network (auto-net) for brain extraction in magnetic resonance imaging, IEEE
s11263- 015- 0816- y. Trans. Med. Imaging 36 (11) (2017) 2319–2330.
[11] A. Kendall, Y. Gal, What uncertainties do we need in Bayesian deep learning [38] S.S.M. Salehi, S.R. Hashemi, C. Velasco-Annis, A. Ouaalam, J.A. Estroff, D. Erdog-
for computer vision? in: Advances in Neural Information Processing Systems, mus, S.K. Warfield, A. Gholipour, Real-time automatic fetal brain extraction in
2017, pp. 5580–5590, doi:10.1109/TDEI.2009.5211872. fetal MRI by deep learning, in: IEEE International Symposium on Biomedical
[12] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive un- Imaging, 2018, pp. 720–724.
certainty estimation using deep ensembles, in: Advances in Neural Information [39] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic seg-
Processing Systems, 2017, pp. 6405–6416. mentation, in: IEEE Conference on Computer Vision and Pattern Recognition,
[13] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder-decoder networks for 2015, pp. 3431–3440, doi:10.1109/CVPR.2015.7298965.
surrogate modeling and uncertainty quantification, arXiv:1801.06879 (2018). [40] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghe-
[14] M.S. Ayhan, P. Berens, Test-time data augmentation for estimation of het- mawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
eroscedastic aleatoric uncertainty in deep neural networks, in: Medical Imag- D.G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu,
ing with Deep Learning, 2018, pp. 1–9. X. Zheng, G. Brain, TensorFlow: a system for large-scale machine learning, in:
[15] T. Nair, D. Precup, D.L. Arnold, T. Arbel, Exploring uncertainty measures in deep USENIX Symposium on Operating Systems Design and Implementation, 2016,
networks for multiple sclerosis lesion detection and segmentation, in: Interna- pp. 265–284.
tional Conference on Medical Image Computing and Computer-Assisted Inter- [41] E. Gibson, W. Li, C. Sudre, L. Fidon, D.I. Shakir, G. Wang, Z. Eaton-Rosen,
vention, 2018, pp. 655–663. R. Gray, T. Doel, Y. Hu, T. Whyntie, P. Nachev, M. Modat, D.C. Barratt,
[16] A.G. Roy, S. Conjeti, N. Navab, C. Wachinger, Inherent brain segmentation qual- S. Ourselin, M.J. Cardoso, T. Vercauteren, NiftyNet: a deep-learning platform
ity control from fully convnet Monte Carlo sampling, in: International Confer- for medical imaging, Comput. Methods Programs Biomed. 158 (2018) 113–122.
ence on Medical Image Computing and Computer-Assisted Intervention, 2018, [42] B.H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Bur-
pp. 664–672. ren, N. Porz, J. Slotboom, R. Wiest, L. Lanczi, E. Gerstner, M.A. Weber, T. Ar-
[17] K. Matsunaga, A. Hamada, A. Minagawa, H. Koga, Image classification of bel, B.B. Avants, N. Ayache, P. Buendia, D.L. Collins, N. Cordier, J.J. Corso,
melanoma, nevus and seborrheic keratosis by deep neural network ensemble, A. Criminisi, T. Das, H. Delingette, Ç. Demiralp, C.R. Durst, M. Dojat, S. Doyle,
arXiv:1703.03108 (2017). J. Festa, F. Forbes, E. Geremia, B. Glocker, P. Golland, X. Guo, A. Hamamci,
[18] H. Jin, Z. Li, R. Tong, L. Lin, A deep 3D residual CNN for false positive reduction K.M. Iftekharuddin, R. Jena, N.M. John, E. Konukoglu, D. Lashkari, J.A. Ma-
in pulmonary nodule detection, Med. Phys. 45 (5) (2018) 2097–2107, doi:10. riz, R. Meier, S. Pereira, D. Precup, S.J. Price, T.R. Raviv, S.M. Reza, M. Ryan,
1002/mp.128. D. Sarikaya, L. Schwartz, H.C. Shin, J. Shotton, C.A. Silva, N. Sousa, N.K. Sub-
[19] A. Saad, G. Hamarneh, T. Möller, Exploration and visualization of segmenta- banna, G. Szekely, T.J. Taylor, O.M. Thomas, N.J. Tustison, G. Unal, F. Vasseur,
tion uncertainty using shape and appearance prior information, IEEE Trans. M. Wintermark, D.H. Ye, L. Zhao, B. Zhao, D. Zikic, M. Prastawa, M. Reyes,
Vis. Comput. Graph. 16 (6) (2010) 1366–1375, doi:10.1109/TVCG.2010.152. K. Van Leemput, The multimodal brain tumor image segmentation benchmark
[20] W. Shi, X. Zhuang, R. Wolz, D. Simon, K. Tung, H. Wang, S. Ourselin, P. Edwards, (BRATS), IEEE Trans. Med. Imaging 34 (10) (2015) 1993–2024, doi:10.1109/TMI.
R. Razavi, D. Rueckert, A multi-image graph cut approach for cardiac image 2014.2377694.
segmentation and uncertainty estimation, in: International Workshop on Sta- [43] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmenta-
tistical Atlases and Computational Models of the Heart, Vol. 7085, Springer, tion using cascaded anisotropic convolutional neural networks, in: Brainlesion:
Berlin Heidelberg, 2011, pp. 178–187, doi:10.1007/978- 3- 642- 28326- 0_18. Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Springer Inter-
[21] S. Sankaran, L. Grady, C.A. Taylor, Fast computation of hemodynamic sensitivity national Publishing, 2018, pp. 178–190.
to lumen segmentation uncertainty, IEEE Trans. Med. Imaging 34 (12) (2015) [44] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J.S. Kirby, J.B. Freymann,
2562–2571, doi:10.1109/TMI.2015.2445777. K. Farahani, C. Davatzikos, Advancing the cancer genome atlas glioma MRI col-
[22] S. Parisot, W. Wells, S. Chemouny, H. Duffau, N. Paragios, Concurrent tu- lections with expert segmentation labels and radiomic features, Nat. Sci. Data
mor segmentation and registration with uncertainty-based sparse non-uniform (2017) 170117.
graphs, Med. Image Anal. 18 (4) (2014) 647–659, doi:10.1016/j.media.2014.02.
006.
G. Wang, W. Li and M. Aertsen et al. / Neurocomputing 338 (2019) 34–45 45
Guotai Wang obtained his Bachelor and Master degree of Jan Deprest is a Professor of Obstetrics and Gynaecology
Biomedical Engineering in Shanghai Jiao Tong University at the Katholieke Universiteit Leuven and Consultant Ob-
in 2011 and 2014 respectively. He then obtained his PhD stetrician Gynaecologist at the University Hospitals Leu-
degree of Medical and Biomedical Imaging in University ven (Belgium). He is currently the academic chair of his
College London in 2018. His research interests include im- department and the director of the Centre for Surgical
age segmentation, computer vision and deep learning. Technologies at the Faculty of Medicine. He established
the Eurofoetus consortium, which is dedicated to the de-
velopment of instruments and techniques for minimally
invasive fetal and placental surgery.
Wenqi Li is a Research Associate in the Guided In- Sébastien Ourselin is Head of the School of Biomedical
strumentation for Fetal Therapy and Surgery (GIFT-Surg) Engineering & Imaging Sciences and Professor of Health-
project. His main research interests are in anatomy de- care Engineering at Kings College London. His core skills
tection and segmentation for presurgical evaluation and are in medical image analysis, software engineering, and
surgical planning. He obtained a BSc degree in Computer translational medicine. He is best known for his work
Science from the University of Science and Technology on image registration and segmentation, its exploitation
Beijing in 2010, and then an MSc degree in Computing for robust image-based biomarkers in neurological con-
with Vision and Imaging from the University of Dundee ditions, as well as for his development of image-guided
in 2011. In 2015, he completed his PhD in the Computer surgery systems.
Vision and Image Processing group at the University of
Dundee.