W-Net A Deep Model For Fully Unsupervised Image Segmentation
W-Net A Deep Model For Fully Unsupervised Image Segmentation
Abstract
4321
malized cut loss function on the encoding layer. In order Fully convolutional networks (FCNs) [21] have emerged
to achieve state-of-the-art results, we further appropriately as one of the most effective models for the semantic seg-
postprocess this initial segmentation in two steps: we first mentation problem. In a FCN, fully connected layers of
apply a fully conneceted conditional random field (CRF) standard convolutional neural networks (CNNs) are trans-
[20, 6] smoothing on the outputted segments, and we sec- formed as convolution layers with kernels that cover the
ond apply the hierarchical merging method of [2] to obtain entire input region. By utilizing fully connected layers,
a final segmentation, showed as Figure 1. the network can take an input of arbitrary size and pro-
We test our method on the Berkeley Segmentation Data duce a correspondingly-sized output map; for example, one
set benchmark. We follow standard benchmarking practices can produce a pixelwise prediction for images of arbitrary
and compute the segmentation covering (SC), probabilistic size. Recently a number of variants of FCN have been
Rand index (PRI), and variation of information (VI) met- proposed and studied and that perform semantic segmen-
rics of our segmentations as well as segmentation of several tation [24, 3, 5, 25, 27, 20, 32]. In [20], a conditional ran-
existing algorithms. We compare favorably with a number dom field (CRF) is applied to the output map to fine-tune the
of classical and recent segmentation approaches, and even segmentation. [32] formulates a mean-field approximate in-
approach human-level performance in some cases—for ex- ference for the CRF as a Recurrent Neural Network (CRF-
ample, our algorithm achieves 0.86 PRI versus 0.87 by hu- RNN), and then jointly optimize both the CRF energy as
mans, in the optimal image scale setting. We further show well as the supervised loss. [27] presents a U-shaped ar-
several examples of segments produced by our algorithm as chitecture consisting of a contracting path to capture con-
well as some competing methods. text and a symmetric expanding path that enables precise
The rest of this paper is organized as follows. We first localization. In this paper, we modify and extend the archi-
review related works in Section 2. The architecture of our tecture described in [27] to a W-shaped network such that
network is described in Section 3. Section 4 presents the it reconstructs the original input images and also predicts a
detailed procedure of the post-processing method, and ex- segmentation map without any labeling information.
perimental results are demonstrated in Section 5. Section 4
2.3. Encoder-decoders
discusses the conclusions that have been drawn.
Encoder-decoders are one of the most widely known and
2. Related Work used methods in unsupervised feature learning [17, 18].
The encoder Enc maps the input (e.g. an image patch) to a
We briefly discuss related work on segmentation, convo- compact feature representation, and then the decoder Dec
lutional networks, and autoencoders. reproduces the input from its lower-dimensional represen-
tation. In this paper, we design an encoder such that the in-
2.1. Unsupervised Segmentation put is mapped to a dense pixelwise segmentation layer with
Most approaches to unsupervised image segmentation same spatial size rather than a low-dimensional space. The
involve utilizing features such as color, brightness, or tex- decoder then performs a reconstruction from the dense pre-
ture over local patches, and then make pixel-level cluster- diction layer.
ing based on these features. Among these schemes, the
three most widely-used methods include Felzenszwalb and 3. Network Architecture
Huttenlocher’s graph-based method [14], Shi and Malik’s The network architecture is illustrated in Figure 2. It is
Normalized Cuts [9, 29], and Comaniciu and Meer’s Mean divided into an UEnc (left side) and a corresponding UDec
Shift [8]. Arbelaez et al. [1, 2] proposed a method based (right side); in particular, we modify and extend the typ-
on edge detection that has been shown to outperform the ical U-shaped architecture of a U-Net network described
classical methods. More recently, [26] proposed a unified in [27] to a W-shaped architecture such that it reconstructs
approach for bottom-up multi-scale hierarchical image seg- original input images as well as predicts the segmentation
mentation. In this paper, we adopt the hierarchical grouping maps without any labeling information. The W-Net archi-
algorithm described in [2] for postprocessing after we get tecture has 46 convolutional layers which are structured into
an initial segmentation prediction from W-Net encoder. 18 modules marked with the red rectangles. Each module
consists of two 3 × 3 convolutional layers, each followed
2.2. Deep CNNs in Semantic Segmentation
by a ReLU [23] non-linearity and batch normalization [19].
Deep neural networks have emerged as a key component The first nine modules form the dense prediction base of the
in many visual recognition problems, including supervised network and the second 9 correspond to the reconstruction
learning for semantic image segmentation. [22, 13, 10, 15, decoder.
16] all make pixel-wise annotations for segmentation based The UEnc consists of a contracting path (the first half)
on supervised classification using deep networks. to capture context and a corresponding expansive path (the
4322
Figure 2. W-Net architecture. The W-Net architecture is consist of an UEnc (left side) and a corresponding UDec (right side). It has 46
convolutional layers which are structured into 18 modules marked with the red rectangles. Each module consists of two 3 × 3 convolutional
layers. The first nine modules form the dense prediction base of the network and the second 9 correspond to the reconstruction decoder.
second half) that enables precise localization, as in the orig- One important modification in our architecture is that
inal U-Net architecture. The contracting path starts with an all of the modules use the depthwise separable convolution
initial module which performs convolution on input images. layers introduced in [7] except modules 1, 9, 10, and 18.
In the figure, the output sizes are reported for an example in- A depthwise separable convolution operation consists of a
put image resolution of 224 × 224. Modules are connected depthwise convolution and a pointwise convolution. The
via 2 × 2 max-pooling layers, and we double the number of idea behind such an operation is to examine spatial cor-
feature channels at each downsampling step. In the expan- relations and cross-channel correlations independently—a
sive path, modules are connected via transposed 2D convo- depthwise convolution performs spatial convolutions inde-
lution layers. We halve the number of feature channels at pendently over each channel and then a pointwise convo-
each upsampling step. As in the U-Net model, the input of lution projects the feature channels by the depthwise con-
each module in the contracting path is also bypassed to the volution onto a new channel space. As a consequence, the
output of its corresponding module in the expansive path to network gains performance more efficiently with the same
recover lost spatial information due to downsampling. The number of parameters. In Figure 2, blue arrows represent
final convolutional layer of the UEnc is a 1 × 1 convolution convolution layers and red arrows indicate depth-wise sep-
followed by a softmax layer. The 1x1 convolution maps arable convolutions. The network does not include any fully
each 64-component feature vector to the desired number of connected layers which allow it to learn arbitrarily large im-
classes K, and then the softmax layer rescales them so that ages and make a segmentation prediction of the correspond-
the elements of the K-dimensional output lie in the range ing size.
(0,1) and sum to 1. The architecture of the UDec is similar
to the UEnc except it reads the output of the UEnc which has 3.1. Soft Normalized Cut Loss
the size of 224 × 224 × K. The final convolutional layer
The output of the UEnc is a normalized 224 × 224 ×
of the UDec is a 1 × 1 convolution to map 64-component
K dense prediction. By taking the argmax, we can obtain a
feature vector back to a reconstruction of original input.
K-class prediction for each pixel. In this paper, we compute
4323
the normalized cut (N cut) [29] as a global criterion for the maximize the association within segments and minimize the
segmentation: disassociation between the segments in the encoding layer.
The procedure is formally presented in Algorithm 1. By
K
X cut(Ak , V − Ak ) iteratively applying Jreconstr and Jsof t−N cut , the network
N cutK (V ) = balances the trade-off between the accuracy of reconstruc-
assoc(Ak , V )
k=1
(1) tion and the consistency in the encoded representation layer.
K P
X u∈Ak ,v∈V −Ak w(u, v)
= P , Algorithm 1 Minibatch stochastic gradient descent training
u∈Ak ,t∈V w(u, t)
k=1
of W-Net.
where Ak is set of pixels in segment k, V is the set of all 1: procedure W-N ET(X;UEnc , UDec )
pixels, and w measures the weight between two pixels. 2: for number of training iterations do
However, since the argmax function is non- 3: Sample a minibatch of new input images x
differentiable, it is impossible to calculate the corre- 4: Update UEnc by minimizing Jsof t−N cut
sponding gradient during backpropagation. Instead, we 5: . Only update UEnc
define a soft version of the N cut loss which is differentiable 6: Update whole W-Net by minimizing Jreconstr
so that we can update gradients during backpropagation: 7: . Update both UEnc and UDec
8: return UEnc
K
X cut(Ak , V − Ak )
Jsof t−N cut (V, K) =
assoc(Ak , V )
k=1
K
4. Postprocessing
X assoc(Ak , Ak )
=K− After obtaining an initial segmentation from the encoder,
assoc(Ak , V )
k=1 we perform two postprocessing steps in order to obtain our
K P
final result. Below we describe these steps, namely CRF
X u∈V,v∈V w(u, v)p(u = Ak )p(v = Ak )
=K− P smoothing and hierarchical merging.
u∈Ak ,t∈Vw(u, t)p(u = Ak )
k=1
K P P 4.1. Fully-Connected Conditional Random Fields
u∈V p(u = Ak ) u∈V w(u, v)p(v = Ak )
X
=K− P P , for Accurate Edge Recovery
k=1 u∈V p(u = Ak ) t∈V w(u, t)
While deep CNNs with max-pooling layers have proven
(2)
their success in capturing high-level feature information of
where p(u = Ak ) measures the probability of node u be- inputs, the increased invariance and large receptive fields
longing to class Ak , and which is directly computed by the can cause reduction of localization accuracy. A lack of
encoder. By training UEnc to minimize the Jsof t−N cut loss smoothness constraints can result in the problem of poor
we can simultaneously minimize the total normalized dis- object delineation, especially in pixel-level labeling tasks.
association between the groups and maximize the total nor- To address this issue, while the soft normalized cut loss
malized association within the groups. and skip layers in the W-Net can help to improve the lo-
calization of object boundaries, we find that it improves
3.2. Reconstruction Loss segmentations with fine-grained boundaries by combining
the responses at the final UEnc layer with a fully connected
As in the classical encoder-decoder architecture, we also Conditional Random Field (CRF) model [6]. The fully con-
train the W-Net to minimize the reconstruction loss to en- nected CRF model employs the energy function
force that the encoded representations contain as much in-
formation of the original inputs as possible. In this paper,
X X
E(X) = Φ(u) + Ψ(u, v) (4)
by minimizing the reconstruction loss, we can make the seg- u u,v
mentation prediction align better with the input images. The
reconstruction loss is given by where u, v are pixels on input data X. The unary poten-
tial Φ(u) = − log p(u), where p(u) is the label annotation
2
Jreconstr = kX − UDec (UEnc (X; WEnc ); WDec )k2 , probability computed by the softmax layer in UEnc . The
(3) pairwise potential Ψ(u, v) measures the weighted penalties
where WEnc denotes the parameters of the encoder, WDec when two pixels are assigned different labels by using two
denotes the parameters of the decoder, and X is the input Gaussian kernels in different feature spaces.
image. We train W-Net to minimize the Jreconstr between Figure 3 presents an example of the prediction before
the reconstructed images and original inputs. We simul- and after the fully connected CRF model. The output of
taneously train UEnc to minimize Jsof t−N cut in order to the softmax layer in the fully convolutional encoder UEnc
4324
Figure 3. Belief map (output of softmax function) before and
after a fully connected CRF model. (a) Original image, (b) the
responses at the final UEnc layer, (c) the output of the fully con-
nected CRF.
XX
gP b(x, y, θ) = βi,s Gi,σ(s) (x, y, θ)+γsP b(x, y, θ),
s i
(5)
where s indexes scales, i indexes feature channels includ-
ing brightness, color, and texture, and Gi,σ(s) (x, y, θ) mea- Figure 4. The initial partitions for the hierarchical merging
sures the dissimilarity between two halves of a disc of radius and the corresponding weighted boundary maps. (a) Original
σ(s) center at (x, y) in channel i at angle θ. The mP b sig- inputs. (b) The output (argmax) of the fully connected CRF with
nal measures all the edges in the image and the sP b signal boundaries showed in red lines. (c) The corresponding weighted
captures the most salient curves in the image. Figure 4 (c) boundary maps of the red lines in (b).
shows the corresponding weighted boundary maps of the
initial boundaries produced by WNet-CRF in Figure 4 (b).
5. Experiments
We then build hierarchical segmentation from this
weighted boundaries by using the countour2ucm stage de- We train our proposed W-Net on the PASCAL VOC2012
scribed in [1, 2]. This algorithm constructs a hierarchy of dataset [12] and then evaluate the trained network us-
segments from contour detections. It has two steps: an Ori- ing the Berkeley Segmentation Database (BSDS300 and
ented Watershed Transform (OWT) to build an initial over- BSDS500). The PASCAL VOC2012 dataset is a large vi-
segmented region and an Ultrametric Contour Map (UCM), sual object classes challenge which contains 11,530 im-
which is a greedy graph-based region merging algorithm. ages and 6,929 segmentations. BSDS300 and BSDS500
4325
Algorithm 2 Post processing
1: procedure P OSTPROCESSING (x;UEnc , CRF, P b)
2: x0 = UEnc (x)
3: . Get the hidden representation of x
00
4: x = CRF(x0 )
5: . fine-grained boundaries with a fully CRF
000 00
6: x = P b(x )
7: . compute the probability of boundary only on the
00
edge detected in x
000
8: S = countour2ucm(x )
Figure 5. Jreconstr and Jsof t−N cut losses in the training
9: . hierarchical segmentation
phase. Left: Reconstruction losses during training (red: training
10: return S without Jsof t−N cut , blue: training with Jsof t−N cut ). Right:
Soft-Ncuts loss during training.
where X(i) and F (i) are the spatial location and pixel
value of node i, respectively. σI = 10, σX = 4, and r = 5.
The plots of Jreconstr and Jsof t−N cut losses during
training are shown in Figure 5. We examine the Jreconstr
Figure 6. A comparison with and without considering
loss with and without considering Jsof t−N cut during train-
Jsof t−N cut loss during back-propagation. (a) Original image.
ing. From Figure 5, we can see that Jreconstr converges (b) Visualization of the output of the final softmax layer in UEnc
faster when the Jsof t−N cut is not considered. When we without adding Jsof t−N cut loss during back-propagation. (c) Vi-
add the Jsof t−N cut during training, the Jreconstr decreases sualization of the output in the final softmax layer in UEnc when
slowly and less stably. At convergence of Jreconstr , the adding Jsof t−N cut loss during back-propagation. (d) The corre-
blue line (training with Jsof t−N cut ) is still higher than the sponding reconstructed image of (b). (e) The corresponding re-
red one (training without); this is because the hidden rep- constructed image of (c).
resentation space is forced to be more consistent with a
good segmentation of the image when the Jsof t−N cut loss
is introduced, so its ability to reconstruct the original im- layer in UEnc , we take the argmax function on the predic-
ages is weakened. Finally, both Jreconstr and Jsof t−N cut tion and use different colors to visualize different pixel-
converge which means our approach balances trade-offs be- wise labels. We can see that the pixel-wise prediction is
tween minimizing the reconstruction loss in the last layer smoothed when we consider the Jsof t−N cut during back-
and maximizing the total association within the groups in propagation. When we remove the Jsof t−N cut loss from
the hidden layer. the W-Net, the model become a regular fully convolutional
Figure 6 illustrates the comparison with and without con- encoder-decoder which makes a high-quality reconstruc-
sidering Jsof t−N cut loss during back-propagation. In order tion; however, the output of softmax layer is more noisy
to make a better visualization of the output on the softmax and discrete. On the other hand, by adding the Jsof t−N cut
4326
Figure 7. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the BSDS500.
From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected CRF,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
Figure 8. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
4327
SC PRI VI Table 1 and Table 2 summarize the performance of
Method
ODS OIS ODS OIS ODS OIS the proposed method noted as W-Net on BSDS300 and
Quad Tree 0.33 0.39 0.71 0.75 2.34 2.22
BSDS500 respectively. Since the UEnc encoder followed
Chan Vese [4] 0.49 - 0.75 - 2.54 -
NCuts [9] 0.44 0.53 0.75 0.79 2.18 1.84
by a fully connected CRF provides an initial boundaries de-
SWA [28] 0.47 0.55 0.75 0.80 2.06 1.75 tection, we compute the multi-scale local cues only on the
Canny-owt-ucm [2] 0.48 0.56 0.77 0.82 2.11 1.81 detected edges instead of the whole input image. As can be
Felz-Hutt [14] 0.51 0.58 0.77 0.82 2.15 1.79 seen, our proposed approach has competitive performance
Mean Shift [8] 0.54 0.58 0.78 0.80 1.83 1.63 compared to the high computation demanding gPb-owt-
Taylor [30] 0.56 0.62 0.79 0.84 1.74 1.63 ucm method. We further consider combining the bound-
W-Net (ours) 0.58 0.62 0.81 0.84 1.71 1.53
aries produced by our W-Net model after CRF smoothing
gPb-owt-ucm [2] 0.59 0.65 0.81 0.85 1.65 1.47
W-Net+ucm (ours) 0.60 0.65 0.82 0.86 1.63 1.45
with the ultrametric contour map produced by the gPb-owt-
Human 0.73 0.73 0.87 0.87 1.16 1.16 ucm method before applying the final postprocessing step;
Table 1. Results on BSDS300. The values are reproduced from we denote this variant as W-Net+ucm. We can see that
the tables in [30]. with this variant, our results outperform the other methods.
Figure 7 illustrates results of running the proposed
SC PRI VI
Method
ODS OIS ODS OIS ODS OIS method W-Net on images from the BSDS500. The first
NCuts [9] 0.45 0.53 0.78 0.80 2.23 1.89 row shows the original inputs; the second row shows that
Canny-owt-ucm [2] 0.49 0.55 0.79 0.83 2.19 1.89 results of initial boundaries detection produced by the UEnc
Felz-Hutt [14] 0.52 0.57 0.80 0.82 2.21 1.87 encoder followed by a fully connected CRF. The third and
Mean Shift [8] 0.54 0.58 0.79 0.81 1.85 1.64 the fourth rows show the ultrametric contour maps produced
Taylor [30] 0.56 0.62 0.81 0.85 1.78 1.56 by the contours2ucm stage at the ODS and OIS respectively.
W-Net (ours) 0.57 0.62 0.81 0.84 1.76 1.60 Figure 8 illustrates results of running the W-Net+ucm on
gPb-owt-ucm [2] 0.59 0.65 0.83 0.86 1.69 1.48
DC-Seg-full [11] 0.59 0.64 0.82 0.85 1.68 1.54
images from the BSDS500.
W-Net+ucm (ours) 0.59 0.64 0.82 0.85 1.67 1.47
Human 0.72 0.72 0.88 0.88 1.17 1.17
Table 2. Results on BSDS500. The values are reproduced from
the tables in [30] and [11].
6. Conclusion
4328
7. Appendix [14] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-
based image segmentation. International journal of com-
We show additional results of running the proposed puter vision, 59(2):167–181, 2004.
method W-Net on images from the BSDS500 in Figure [9] [15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul-
and Figure [10]. Further, Figure [11] and Figure [12] illus- taneous detection and segmentation. In European Confer-
trates more results of running the W-Net+ucm on images ence on Computer Vision, pages 297–312. Springer, 2014.
from the BSDS500. [16] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
columns for object segmentation and fine-grained localiza-
References tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 447–456, 2015.
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con-
tours to regions: An empirical evaluation. In Computer Vi- [17] G. E. Hinton and R. R. Salakhutdinov. Reducing the
sion and Pattern Recognition, 2009. CVPR 2009. IEEE Con- dimensionality of data with neural networks. science,
ference on, pages 2294–2301. IEEE, 2009. 313(5786):504–507, 2006.
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [18] F. J. Huang, Y.-L. Boureau, Y. LeCun, et al. Unsupervised
tour detection and hierarchical image segmentation. IEEE learning of invariant feature hierarchies with applications to
transactions on pattern analysis and machine intelligence, object recognition. pages 1–8, 2007.
33(5):898–916, 2011. [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep network training by reducing internal covariate shift. In
deep convolutional encoder-decoder architecture for image International Conference on Machine Learning, pages 448–
segmentation. arXiv preprint arXiv:1511.00561, 2015. 456, 2015.
[4] L. Bertelli, B. Sumengen, B. Manjunath, and F. Gibou. A [20] P. Krähenbühl and V. Koltun. Efficient inference in fully
variational framework for multiregion pairwise-similarity- connected crfs with gaussian edge potentials. In Advances
based image segmentation. IEEE Transactions on Pattern in neural information processing systems, pages 109–117,
Analysis and Machine Intelligence, 30(8):1400–1414, 2008. 2011.
[5] A. Chaurasia and E. Culurciello. Linknet: Exploiting en- [21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
coder representations for efficient semantic segmentation. networks for semantic segmentation. In Proceedings of the
arXiv preprint arXiv:1707.03718, 2017. IEEE Conference on Computer Vision and Pattern Recogni-
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and tion, pages 3431–3440, 2015.
A. L. Yuille. Deeplab: Semantic image segmentation with [22] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
deep convolutional nets, atrous convolution, and fully con- forward semantic segmentation with zoom-out features. In
nected crfs. arXiv preprint arXiv:1606.00915, 2016. Proceedings of the IEEE Conference on Computer Vision
[7] F. Chollet. Xception: Deep learning with depthwise separa- and Pattern Recognition, pages 3376–3385, 2015.
ble convolutions. arXiv preprint arXiv:1610.02357, 2016.
[23] V. Nair and G. E. Hinton. Rectified linear units improve
[8] D. Comaneci and P. M. M. Shift. A robust approach toward restricted boltzmann machines. In Proceedings of the 27th
feature space analysis. IEEE Transactions on Pattern Analy- international conference on machine learning (ICML-10),
sis and Machine Intelligence, 24(5), 2002. pages 807–814, 2010.
[9] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with
[24] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
multiscale graph decomposition. In Computer Vision and
work for semantic segmentation. In Proceedings of the IEEE
Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-
International Conference on Computer Vision, pages 1520–
ciety Conference on, volume 2, pages 1124–1131. IEEE,
1528, 2015.
2005.
[10] J. Dai, K. He, and J. Sun. Convolutional feature masking for [25] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet:
joint object and stuff segmentation. In Proceedings of the A deep neural network architecture for real-time semantic
IEEE Conference on Computer Vision and Pattern Recogni- segmentation. arXiv preprint arXiv:1606.02147, 2016.
tion, pages 3992–4000, 2015. [26] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and
[11] M. Donoser and D. Schmalstieg. Discrete-continuous gra- J. Malik. Multiscale combinatorial grouping for image seg-
dient orientation estimation for faster image segmentation. mentation and object proposal generation. IEEE transactions
In Proceedings of the IEEE Conference on Computer Vision on pattern analysis and machine intelligence, 39(1):128–
and Pattern Recognition, pages 3158–3165, 2014. 140, 2017.
[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, [27] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
and A. Zisserman. The PASCAL Visual Object Classes tional networks for biomedical image segmentation. In In-
Challenge 2012 (VOC2012) Results. http://www.pascal- ternational Conference on Medical Image Computing and
network.org/challenges/VOC/voc2012/workshop/index.html. Computer-Assisted Intervention, pages 234–241. Springer,
[13] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning 2015.
hierarchical features for scene labeling. IEEE transactions [28] E. Sharon, M. Galun, D. Sharon, R. Basri, and A. Brandt. Hi-
on pattern analysis and machine intelligence, 35(8):1915– erarchy and adaptivity in segmenting visual scenes. Nature,
1929, 2013. 442(7104):810–813, 2006.
4329
Figure 9. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the BSDS500.
From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected CRF,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
[29] J. Shi and J. Malik. Normalized cuts and image segmenta- intelligence, 22(8):888–905, 2000.
tion. IEEE Transactions on pattern analysis and machine [30] C. J. Taylor. Towards fast and accurate segmentation. In Pro-
4330
Figure 10. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the
BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected
CRF, segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
ceedings of the IEEE Conference on Computer Vision and [32] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Pattern Recognition, pages 1916–1922, 2013. Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random
fields as recurrent neural networks. In Proceedings of the
[31] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr IEEE International Conference on Computer Vision, pages
images through a hidden markov random field model and the 1529–1537, 2015.
expectation-maximization algorithm. IEEE transactions on
medical imaging, 20(1):45–57, 2001.
4331
Figure 11. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
4332
Figure 12. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).
4333