0% found this document useful (0 votes)
85 views

W-Net A Deep Model For Fully Unsupervised Image Segmentation

This paper proposes a novel deep learning architecture called W-Net for unsupervised image segmentation. W-Net concatenates two fully convolutional networks into an autoencoder - one for encoding and one for decoding. It jointly minimizes the reconstruction error and normalized cut of the encoded segmentation to produce segments without supervision. Experimental results on the Berkeley Segmentation Dataset show it outperforms other unsupervised methods and approaches human-level performance with suitable post-processing.

Uploaded by

xiao bh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

W-Net A Deep Model For Fully Unsupervised Image Segmentation

This paper proposes a novel deep learning architecture called W-Net for unsupervised image segmentation. W-Net concatenates two fully convolutional networks into an autoencoder - one for encoding and one for decoding. It jointly minimizes the reconstruction error and normalized cut of the encoded segmentation to produce segments without supervision. Experimental results on the Berkeley Segmentation Dataset show it outperforms other unsupervised methods and approaches human-level performance with suitable post-processing.

Uploaded by

xiao bh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

W-Net: A Deep Model for Fully Unsupervised Image Segmentation

Xide Xia Brian Kulis


Boston University Boston University
[email protected] [email protected]
arXiv:1711.08506v1 [cs.CV] 22 Nov 2017

Abstract

While significant attention has been recently focused


on designing supervised deep semantic segmentation algo-
rithms for vision tasks, there are many domains in which
sufficient supervised pixel-level labels are difficult to obtain.
In this paper, we revisit the problem of purely unsupervised
image segmentation and propose a novel deep architecture
for this problem. We borrow recent ideas from supervised
semantic segmentation methods, in particular by concate-
nating two fully convolutional networks together into an
autoencoder—one for encoding and one for decoding. The Figure 1. Overview of our approach. A fully convolutional net-
encoding layer produces a k-way pixelwise prediction, and work encoder produces a segmentation. This segmentation is fed
both the reconstruction error of the autoencoder as well into a fully convolutional network decoder to produce a recon-
as the normalized cut produced by the encoder are jointly struction, and training jointly minimizes the normalized cut of the
minimized during training. When combined with suitable encoded segmentation and the reconstruction of the image. The
postprocessing involving conditional random field smooth- encoded image is then post-processed to produce the final seg-
ing and hierarchical segmentation, our resulting algorithm mentation.
achieves impressive results on the benchmark Berkeley Seg-
mentation Data Set, outperforming a number of competing
methods. volutional networks to produce a pixelwise prediction, and
supervised training methods can then be employed to learn
filters to produce segments on novel images. One such pop-
ular recent approach is the U-Net architecture [27], a fully
1. Introduction
convolutional network that has been used to achieve impres-
The image segmentation problem is a core vision prob- sive results in the biomedical image domain. Unfortunately,
lem with a longstanding history of research. Historically, existing semantic segmentation methods require a signifi-
this problem has been studied in the unsupervised setting as cant amount of pixelwise labeled training data, which can
a clustering problem: given an image, produce a pixelwise be difficult to collect on novel domains.
prediction that segments the image into coherent clusters Given the importance of the segmentation problem in
corresponding to objects in the image. In classical com- many domains, and due to the lack of supervised data for
puter vision, there are a number of well-known techniques many problems, we revisit the problem of unsupervised
for this problem, including normalized cuts [9, 29], Markov image segmentation, utilizing recent ideas from semantic
random field-based methods [31] , mean shift [8], hierar- segmentation. In particular, we design a new architecture
chical methods [2], and many others. which we call W-Net, and which ties two fully convolu-
Given the recent success of deep learning within the tional network (FCN) architectures (each similar to the U-
computer vision field, there has been a resurgence in inter- Net architecture) together into a single autoencoder. The
est in the image segmentation problem. The vast majority first FCN encodes an input image, using fully convolutional
of recent work in this area has been focused on the problem layers, into a k-way soft segmentation. The second FCN re-
of semantic segmentation [3, 5, 25, 27, 20, 32], a super- verses this process, going from the segmentation layer back
vised variant of the image segmentation problem. Typically, to a reconstructed image. We jointly minimize both the re-
these methods are trained using models such as fully con- construction error of the autoencoder as well as a “soft” nor-

4321
malized cut loss function on the encoding layer. In order Fully convolutional networks (FCNs) [21] have emerged
to achieve state-of-the-art results, we further appropriately as one of the most effective models for the semantic seg-
postprocess this initial segmentation in two steps: we first mentation problem. In a FCN, fully connected layers of
apply a fully conneceted conditional random field (CRF) standard convolutional neural networks (CNNs) are trans-
[20, 6] smoothing on the outputted segments, and we sec- formed as convolution layers with kernels that cover the
ond apply the hierarchical merging method of [2] to obtain entire input region. By utilizing fully connected layers,
a final segmentation, showed as Figure 1. the network can take an input of arbitrary size and pro-
We test our method on the Berkeley Segmentation Data duce a correspondingly-sized output map; for example, one
set benchmark. We follow standard benchmarking practices can produce a pixelwise prediction for images of arbitrary
and compute the segmentation covering (SC), probabilistic size. Recently a number of variants of FCN have been
Rand index (PRI), and variation of information (VI) met- proposed and studied and that perform semantic segmen-
rics of our segmentations as well as segmentation of several tation [24, 3, 5, 25, 27, 20, 32]. In [20], a conditional ran-
existing algorithms. We compare favorably with a number dom field (CRF) is applied to the output map to fine-tune the
of classical and recent segmentation approaches, and even segmentation. [32] formulates a mean-field approximate in-
approach human-level performance in some cases—for ex- ference for the CRF as a Recurrent Neural Network (CRF-
ample, our algorithm achieves 0.86 PRI versus 0.87 by hu- RNN), and then jointly optimize both the CRF energy as
mans, in the optimal image scale setting. We further show well as the supervised loss. [27] presents a U-shaped ar-
several examples of segments produced by our algorithm as chitecture consisting of a contracting path to capture con-
well as some competing methods. text and a symmetric expanding path that enables precise
The rest of this paper is organized as follows. We first localization. In this paper, we modify and extend the archi-
review related works in Section 2. The architecture of our tecture described in [27] to a W-shaped network such that
network is described in Section 3. Section 4 presents the it reconstructs the original input images and also predicts a
detailed procedure of the post-processing method, and ex- segmentation map without any labeling information.
perimental results are demonstrated in Section 5. Section 4
2.3. Encoder-decoders
discusses the conclusions that have been drawn.
Encoder-decoders are one of the most widely known and
2. Related Work used methods in unsupervised feature learning [17, 18].
The encoder Enc maps the input (e.g. an image patch) to a
We briefly discuss related work on segmentation, convo- compact feature representation, and then the decoder Dec
lutional networks, and autoencoders. reproduces the input from its lower-dimensional represen-
tation. In this paper, we design an encoder such that the in-
2.1. Unsupervised Segmentation put is mapped to a dense pixelwise segmentation layer with
Most approaches to unsupervised image segmentation same spatial size rather than a low-dimensional space. The
involve utilizing features such as color, brightness, or tex- decoder then performs a reconstruction from the dense pre-
ture over local patches, and then make pixel-level cluster- diction layer.
ing based on these features. Among these schemes, the
three most widely-used methods include Felzenszwalb and 3. Network Architecture
Huttenlocher’s graph-based method [14], Shi and Malik’s The network architecture is illustrated in Figure 2. It is
Normalized Cuts [9, 29], and Comaniciu and Meer’s Mean divided into an UEnc (left side) and a corresponding UDec
Shift [8]. Arbelaez et al. [1, 2] proposed a method based (right side); in particular, we modify and extend the typ-
on edge detection that has been shown to outperform the ical U-shaped architecture of a U-Net network described
classical methods. More recently, [26] proposed a unified in [27] to a W-shaped architecture such that it reconstructs
approach for bottom-up multi-scale hierarchical image seg- original input images as well as predicts the segmentation
mentation. In this paper, we adopt the hierarchical grouping maps without any labeling information. The W-Net archi-
algorithm described in [2] for postprocessing after we get tecture has 46 convolutional layers which are structured into
an initial segmentation prediction from W-Net encoder. 18 modules marked with the red rectangles. Each module
consists of two 3 × 3 convolutional layers, each followed
2.2. Deep CNNs in Semantic Segmentation
by a ReLU [23] non-linearity and batch normalization [19].
Deep neural networks have emerged as a key component The first nine modules form the dense prediction base of the
in many visual recognition problems, including supervised network and the second 9 correspond to the reconstruction
learning for semantic image segmentation. [22, 13, 10, 15, decoder.
16] all make pixel-wise annotations for segmentation based The UEnc consists of a contracting path (the first half)
on supervised classification using deep networks. to capture context and a corresponding expansive path (the

4322
Figure 2. W-Net architecture. The W-Net architecture is consist of an UEnc (left side) and a corresponding UDec (right side). It has 46
convolutional layers which are structured into 18 modules marked with the red rectangles. Each module consists of two 3 × 3 convolutional
layers. The first nine modules form the dense prediction base of the network and the second 9 correspond to the reconstruction decoder.

second half) that enables precise localization, as in the orig- One important modification in our architecture is that
inal U-Net architecture. The contracting path starts with an all of the modules use the depthwise separable convolution
initial module which performs convolution on input images. layers introduced in [7] except modules 1, 9, 10, and 18.
In the figure, the output sizes are reported for an example in- A depthwise separable convolution operation consists of a
put image resolution of 224 × 224. Modules are connected depthwise convolution and a pointwise convolution. The
via 2 × 2 max-pooling layers, and we double the number of idea behind such an operation is to examine spatial cor-
feature channels at each downsampling step. In the expan- relations and cross-channel correlations independently—a
sive path, modules are connected via transposed 2D convo- depthwise convolution performs spatial convolutions inde-
lution layers. We halve the number of feature channels at pendently over each channel and then a pointwise convo-
each upsampling step. As in the U-Net model, the input of lution projects the feature channels by the depthwise con-
each module in the contracting path is also bypassed to the volution onto a new channel space. As a consequence, the
output of its corresponding module in the expansive path to network gains performance more efficiently with the same
recover lost spatial information due to downsampling. The number of parameters. In Figure 2, blue arrows represent
final convolutional layer of the UEnc is a 1 × 1 convolution convolution layers and red arrows indicate depth-wise sep-
followed by a softmax layer. The 1x1 convolution maps arable convolutions. The network does not include any fully
each 64-component feature vector to the desired number of connected layers which allow it to learn arbitrarily large im-
classes K, and then the softmax layer rescales them so that ages and make a segmentation prediction of the correspond-
the elements of the K-dimensional output lie in the range ing size.
(0,1) and sum to 1. The architecture of the UDec is similar
to the UEnc except it reads the output of the UEnc which has 3.1. Soft Normalized Cut Loss
the size of 224 × 224 × K. The final convolutional layer
The output of the UEnc is a normalized 224 × 224 ×
of the UDec is a 1 × 1 convolution to map 64-component
K dense prediction. By taking the argmax, we can obtain a
feature vector back to a reconstruction of original input.
K-class prediction for each pixel. In this paper, we compute

4323
the normalized cut (N cut) [29] as a global criterion for the maximize the association within segments and minimize the
segmentation: disassociation between the segments in the encoding layer.
The procedure is formally presented in Algorithm 1. By
K
X cut(Ak , V − Ak ) iteratively applying Jreconstr and Jsof t−N cut , the network
N cutK (V ) = balances the trade-off between the accuracy of reconstruc-
assoc(Ak , V )
k=1
(1) tion and the consistency in the encoded representation layer.
K P
X u∈Ak ,v∈V −Ak w(u, v)
= P , Algorithm 1 Minibatch stochastic gradient descent training
u∈Ak ,t∈V w(u, t)
k=1
of W-Net.
where Ak is set of pixels in segment k, V is the set of all 1: procedure W-N ET(X;UEnc , UDec )
pixels, and w measures the weight between two pixels. 2: for number of training iterations do
However, since the argmax function is non- 3: Sample a minibatch of new input images x
differentiable, it is impossible to calculate the corre- 4: Update UEnc by minimizing Jsof t−N cut
sponding gradient during backpropagation. Instead, we 5: . Only update UEnc
define a soft version of the N cut loss which is differentiable 6: Update whole W-Net by minimizing Jreconstr
so that we can update gradients during backpropagation: 7: . Update both UEnc and UDec
8: return UEnc
K
X cut(Ak , V − Ak )
Jsof t−N cut (V, K) =
assoc(Ak , V )
k=1
K
4. Postprocessing
X assoc(Ak , Ak )
=K− After obtaining an initial segmentation from the encoder,
assoc(Ak , V )
k=1 we perform two postprocessing steps in order to obtain our
K P
final result. Below we describe these steps, namely CRF
X u∈V,v∈V w(u, v)p(u = Ak )p(v = Ak )
=K− P smoothing and hierarchical merging.
u∈Ak ,t∈Vw(u, t)p(u = Ak )
k=1
K P P 4.1. Fully-Connected Conditional Random Fields
u∈V p(u = Ak ) u∈V w(u, v)p(v = Ak )
X
=K− P P , for Accurate Edge Recovery
k=1 u∈V p(u = Ak ) t∈V w(u, t)
While deep CNNs with max-pooling layers have proven
(2)
their success in capturing high-level feature information of
where p(u = Ak ) measures the probability of node u be- inputs, the increased invariance and large receptive fields
longing to class Ak , and which is directly computed by the can cause reduction of localization accuracy. A lack of
encoder. By training UEnc to minimize the Jsof t−N cut loss smoothness constraints can result in the problem of poor
we can simultaneously minimize the total normalized dis- object delineation, especially in pixel-level labeling tasks.
association between the groups and maximize the total nor- To address this issue, while the soft normalized cut loss
malized association within the groups. and skip layers in the W-Net can help to improve the lo-
calization of object boundaries, we find that it improves
3.2. Reconstruction Loss segmentations with fine-grained boundaries by combining
the responses at the final UEnc layer with a fully connected
As in the classical encoder-decoder architecture, we also Conditional Random Field (CRF) model [6]. The fully con-
train the W-Net to minimize the reconstruction loss to en- nected CRF model employs the energy function
force that the encoded representations contain as much in-
formation of the original inputs as possible. In this paper,
X X
E(X) = Φ(u) + Ψ(u, v) (4)
by minimizing the reconstruction loss, we can make the seg- u u,v
mentation prediction align better with the input images. The
reconstruction loss is given by where u, v are pixels on input data X. The unary poten-
tial Φ(u) = − log p(u), where p(u) is the label annotation
2
Jreconstr = kX − UDec (UEnc (X; WEnc ); WDec )k2 , probability computed by the softmax layer in UEnc . The
(3) pairwise potential Ψ(u, v) measures the weighted penalties
where WEnc denotes the parameters of the encoder, WDec when two pixels are assigned different labels by using two
denotes the parameters of the decoder, and X is the input Gaussian kernels in different feature spaces.
image. We train W-Net to minimize the Jreconstr between Figure 3 presents an example of the prediction before
the reconstructed images and original inputs. We simul- and after the fully connected CRF model. The output of
taneously train UEnc to minimize Jsof t−N cut in order to the softmax layer in the fully convolutional encoder UEnc

4324
Figure 3. Belief map (output of softmax function) before and
after a fully connected CRF model. (a) Original image, (b) the
responses at the final UEnc layer, (c) the output of the fully con-
nected CRF.

predicts the rough position of objects in inputs with coarse


boundaries. After the fully connected CRF model, the
boundaries are sharper and small spurious regions have
been smoothed out or removed.

4.2. Hierarchical Segmentation


After taking the argmax on the output of the fully con-
nected CRF, we still typically obtain an over-segmented par-
tition of the input image. Our final step is to merge segments
appropriately to form the final image segments. Figure 4
shows examples of such initial regions with boundaries in
red lines for original input images in (a). In this section, we
discuss an efficient hierarchical segmentation that first con-
verts the over-segmented partitions into weighted boundary
maps and then merges the most similar regions iteratively.
We measure the “importance” of each pixel on the ini-
tial over-segmented partition boundaries by computing a
weighted combination of multi-scale local cues and global
boundary measurements based on spectral clustering [1, 2]:

XX
gP b(x, y, θ) = βi,s Gi,σ(s) (x, y, θ)+γsP b(x, y, θ),
s i
(5)
where s indexes scales, i indexes feature channels includ-
ing brightness, color, and texture, and Gi,σ(s) (x, y, θ) mea- Figure 4. The initial partitions for the hierarchical merging
sures the dissimilarity between two halves of a disc of radius and the corresponding weighted boundary maps. (a) Original
σ(s) center at (x, y) in channel i at angle θ. The mP b sig- inputs. (b) The output (argmax) of the fully connected CRF with
nal measures all the edges in the image and the sP b signal boundaries showed in red lines. (c) The corresponding weighted
captures the most salient curves in the image. Figure 4 (c) boundary maps of the red lines in (b).
shows the corresponding weighted boundary maps of the
initial boundaries produced by WNet-CRF in Figure 4 (b).
5. Experiments
We then build hierarchical segmentation from this
weighted boundaries by using the countour2ucm stage de- We train our proposed W-Net on the PASCAL VOC2012
scribed in [1, 2]. This algorithm constructs a hierarchy of dataset [12] and then evaluate the trained network us-
segments from contour detections. It has two steps: an Ori- ing the Berkeley Segmentation Database (BSDS300 and
ented Watershed Transform (OWT) to build an initial over- BSDS500). The PASCAL VOC2012 dataset is a large vi-
segmented region and an Ultrametric Contour Map (UCM), sual object classes challenge which contains 11,530 im-
which is a greedy graph-based region merging algorithm. ages and 6,929 segmentations. BSDS300 and BSDS500

4325
Algorithm 2 Post processing
1: procedure P OSTPROCESSING (x;UEnc , CRF, P b)
2: x0 = UEnc (x)
3: . Get the hidden representation of x
00
4: x = CRF(x0 )
5: . fine-grained boundaries with a fully CRF
000 00
6: x = P b(x )
7: . compute the probability of boundary only on the
00
edge detected in x
000
8: S = countour2ucm(x )
Figure 5. Jreconstr and Jsof t−N cut losses in the training
9: . hierarchical segmentation
phase. Left: Reconstruction losses during training (red: training
10: return S without Jsof t−N cut , blue: training with Jsof t−N cut ). Right:
Soft-Ncuts loss during training.

have 300 and 500 images, respectively. For each image,


the BSDS dataset provides human-annotated segmentation
as ground truth. Since our proposed method is designed
for unsupervised image segmentation, we do not use any
ground truth labels in the training phase; we use the ground
truth only to evaluate quality of our segmentations.
We resize the input images to 224 × 224 during train-
ing, and the architecture of the trained network is shown in
Figure 2. We train the networks from scratch using mini-
batches of 10 images, with an initial learning rate of 0.003.
The learning rate is divided by ten after every 1,000 it-
erations. The training is stopped after 50,000 iterations.
Dropout of 0.65 was added to prevent overfitting during
training. We construct the weight matrix W for Jsof t−N cut
as:
−kF (i)−F (j)k2 −kX(i)−X(j)k2

2 2
σ2

wij = e I ∗ σ2
e
X if kX(i) − X(j)k2 < r
0 otherwise,
(6)

where X(i) and F (i) are the spatial location and pixel
value of node i, respectively. σI = 10, σX = 4, and r = 5.
The plots of Jreconstr and Jsof t−N cut losses during
training are shown in Figure 5. We examine the Jreconstr
Figure 6. A comparison with and without considering
loss with and without considering Jsof t−N cut during train-
Jsof t−N cut loss during back-propagation. (a) Original image.
ing. From Figure 5, we can see that Jreconstr converges (b) Visualization of the output of the final softmax layer in UEnc
faster when the Jsof t−N cut is not considered. When we without adding Jsof t−N cut loss during back-propagation. (c) Vi-
add the Jsof t−N cut during training, the Jreconstr decreases sualization of the output in the final softmax layer in UEnc when
slowly and less stably. At convergence of Jreconstr , the adding Jsof t−N cut loss during back-propagation. (d) The corre-
blue line (training with Jsof t−N cut ) is still higher than the sponding reconstructed image of (b). (e) The corresponding re-
red one (training without); this is because the hidden rep- constructed image of (c).
resentation space is forced to be more consistent with a
good segmentation of the image when the Jsof t−N cut loss
is introduced, so its ability to reconstruct the original im- layer in UEnc , we take the argmax function on the predic-
ages is weakened. Finally, both Jreconstr and Jsof t−N cut tion and use different colors to visualize different pixel-
converge which means our approach balances trade-offs be- wise labels. We can see that the pixel-wise prediction is
tween minimizing the reconstruction loss in the last layer smoothed when we consider the Jsof t−N cut during back-
and maximizing the total association within the groups in propagation. When we remove the Jsof t−N cut loss from
the hidden layer. the W-Net, the model become a regular fully convolutional
Figure 6 illustrates the comparison with and without con- encoder-decoder which makes a high-quality reconstruc-
sidering Jsof t−N cut loss during back-propagation. In order tion; however, the output of softmax layer is more noisy
to make a better visualization of the output on the softmax and discrete. On the other hand, by adding the Jsof t−N cut

4326
Figure 7. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the BSDS500.
From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected CRF,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

Figure 8. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

4327
SC PRI VI Table 1 and Table 2 summarize the performance of
Method
ODS OIS ODS OIS ODS OIS the proposed method noted as W-Net on BSDS300 and
Quad Tree 0.33 0.39 0.71 0.75 2.34 2.22
BSDS500 respectively. Since the UEnc encoder followed
Chan Vese [4] 0.49 - 0.75 - 2.54 -
NCuts [9] 0.44 0.53 0.75 0.79 2.18 1.84
by a fully connected CRF provides an initial boundaries de-
SWA [28] 0.47 0.55 0.75 0.80 2.06 1.75 tection, we compute the multi-scale local cues only on the
Canny-owt-ucm [2] 0.48 0.56 0.77 0.82 2.11 1.81 detected edges instead of the whole input image. As can be
Felz-Hutt [14] 0.51 0.58 0.77 0.82 2.15 1.79 seen, our proposed approach has competitive performance
Mean Shift [8] 0.54 0.58 0.78 0.80 1.83 1.63 compared to the high computation demanding gPb-owt-
Taylor [30] 0.56 0.62 0.79 0.84 1.74 1.63 ucm method. We further consider combining the bound-
W-Net (ours) 0.58 0.62 0.81 0.84 1.71 1.53
aries produced by our W-Net model after CRF smoothing
gPb-owt-ucm [2] 0.59 0.65 0.81 0.85 1.65 1.47
W-Net+ucm (ours) 0.60 0.65 0.82 0.86 1.63 1.45
with the ultrametric contour map produced by the gPb-owt-
Human 0.73 0.73 0.87 0.87 1.16 1.16 ucm method before applying the final postprocessing step;
Table 1. Results on BSDS300. The values are reproduced from we denote this variant as W-Net+ucm. We can see that
the tables in [30]. with this variant, our results outperform the other methods.
Figure 7 illustrates results of running the proposed
SC PRI VI
Method
ODS OIS ODS OIS ODS OIS method W-Net on images from the BSDS500. The first
NCuts [9] 0.45 0.53 0.78 0.80 2.23 1.89 row shows the original inputs; the second row shows that
Canny-owt-ucm [2] 0.49 0.55 0.79 0.83 2.19 1.89 results of initial boundaries detection produced by the UEnc
Felz-Hutt [14] 0.52 0.57 0.80 0.82 2.21 1.87 encoder followed by a fully connected CRF. The third and
Mean Shift [8] 0.54 0.58 0.79 0.81 1.85 1.64 the fourth rows show the ultrametric contour maps produced
Taylor [30] 0.56 0.62 0.81 0.85 1.78 1.56 by the contours2ucm stage at the ODS and OIS respectively.
W-Net (ours) 0.57 0.62 0.81 0.84 1.76 1.60 Figure 8 illustrates results of running the W-Net+ucm on
gPb-owt-ucm [2] 0.59 0.65 0.83 0.86 1.69 1.48
DC-Seg-full [11] 0.59 0.64 0.82 0.85 1.68 1.54
images from the BSDS500.
W-Net+ucm (ours) 0.59 0.64 0.82 0.85 1.67 1.47
Human 0.72 0.72 0.88 0.88 1.17 1.17
Table 2. Results on BSDS500. The values are reproduced from
the tables in [30] and [11].
6. Conclusion

In this paper we introduced a deep learning-based ap-


loss, we get a more consistent hidden representation shown proach for fully unsupervised image segmentation. Our
in (d), although the reconstruction is not as good as the one proposed algorithm is based on concatenating together
in a classical encoder-decoder architecture. From this com- two fully convolutional networks into an encoder-decoder
parison, we can see the trade-off between the consistency in framework, where each of the FCNs are variants of the U-
the hidden representation and the quality of reconstruction, Net architecture. Training is performed by iteratively min-
and it justifies our use of a soft normalized cut loss during imizing the reconstruction error of the decoder along with
training. a soft normalized cut of the encoder layer. As the resulting
segmentations are typically coarse and over-segmented, we
5.1. Segmentation Benchmarks apply CRF smoothing and hierarchical merging to produce
the final outputted segments. On the Berkeley Segmenta-
To compare the performance of W-Net with existing un-
tion Data Set, we outperform a number of existing classical
supervised image segmentation methods, we compare with
and recent techniques, achieving performance near human
the following: DC-Seg-full [11], gPb-owt-ucm [2], Tay-
level by some metrics.
lor [30], Felzenszwalb and Huttenlocher (Felz-Hutt) [14],
Mean Shift [8], Canny-owt-ucm [2], SWA [28], Chan We believe our method will be useful in cases where it
Vese [4], Multiscale Normalized Cuts (NCuts) [9], and is difficult to obtain labeled pixelwise supervision, for in-
Quad-Tree. As has become standard, we evaluate the per- stance in domains such as biomedical image analysis where
formance on three different metrics: Variation of Informa- new data sets may require significant re-labeling for se-
tion (VI), Probabilistic Rand Index (PRI), and Segmentation mantic segmentation methods to work well. Further, our
Covering (SC). For SC and PRI, higher scores are better; approach may be further refined in the future by utilizing
for VI, a lower score is better. We also report human per- different loss functions or postprocessing steps. Ideally,
formance on this data set. For a set of hierarchical segmen- we would like to design an architecture where additional
tations Si corresponding to different scales, we report the postprocessing is not necessary. Finally, designing variants
result at Optimal Dataset Scale (ODS) and Optimal Image of our architecture when a small amount of supervision is
Scale (OIS). available would also be useful in many domains.

4328
7. Appendix [14] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-
based image segmentation. International journal of com-
We show additional results of running the proposed puter vision, 59(2):167–181, 2004.
method W-Net on images from the BSDS500 in Figure [9] [15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul-
and Figure [10]. Further, Figure [11] and Figure [12] illus- taneous detection and segmentation. In European Confer-
trates more results of running the W-Net+ucm on images ence on Computer Vision, pages 297–312. Springer, 2014.
from the BSDS500. [16] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
columns for object segmentation and fine-grained localiza-
References tion. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 447–456, 2015.
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con-
tours to regions: An empirical evaluation. In Computer Vi- [17] G. E. Hinton and R. R. Salakhutdinov. Reducing the
sion and Pattern Recognition, 2009. CVPR 2009. IEEE Con- dimensionality of data with neural networks. science,
ference on, pages 2294–2301. IEEE, 2009. 313(5786):504–507, 2006.
[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [18] F. J. Huang, Y.-L. Boureau, Y. LeCun, et al. Unsupervised
tour detection and hierarchical image segmentation. IEEE learning of invariant feature hierarchies with applications to
transactions on pattern analysis and machine intelligence, object recognition. pages 1–8, 2007.
33(5):898–916, 2011. [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep network training by reducing internal covariate shift. In
deep convolutional encoder-decoder architecture for image International Conference on Machine Learning, pages 448–
segmentation. arXiv preprint arXiv:1511.00561, 2015. 456, 2015.
[4] L. Bertelli, B. Sumengen, B. Manjunath, and F. Gibou. A [20] P. Krähenbühl and V. Koltun. Efficient inference in fully
variational framework for multiregion pairwise-similarity- connected crfs with gaussian edge potentials. In Advances
based image segmentation. IEEE Transactions on Pattern in neural information processing systems, pages 109–117,
Analysis and Machine Intelligence, 30(8):1400–1414, 2008. 2011.
[5] A. Chaurasia and E. Culurciello. Linknet: Exploiting en- [21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
coder representations for efficient semantic segmentation. networks for semantic segmentation. In Proceedings of the
arXiv preprint arXiv:1707.03718, 2017. IEEE Conference on Computer Vision and Pattern Recogni-
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and tion, pages 3431–3440, 2015.
A. L. Yuille. Deeplab: Semantic image segmentation with [22] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed-
deep convolutional nets, atrous convolution, and fully con- forward semantic segmentation with zoom-out features. In
nected crfs. arXiv preprint arXiv:1606.00915, 2016. Proceedings of the IEEE Conference on Computer Vision
[7] F. Chollet. Xception: Deep learning with depthwise separa- and Pattern Recognition, pages 3376–3385, 2015.
ble convolutions. arXiv preprint arXiv:1610.02357, 2016.
[23] V. Nair and G. E. Hinton. Rectified linear units improve
[8] D. Comaneci and P. M. M. Shift. A robust approach toward restricted boltzmann machines. In Proceedings of the 27th
feature space analysis. IEEE Transactions on Pattern Analy- international conference on machine learning (ICML-10),
sis and Machine Intelligence, 24(5), 2002. pages 807–814, 2010.
[9] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with
[24] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
multiscale graph decomposition. In Computer Vision and
work for semantic segmentation. In Proceedings of the IEEE
Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-
International Conference on Computer Vision, pages 1520–
ciety Conference on, volume 2, pages 1124–1131. IEEE,
1528, 2015.
2005.
[10] J. Dai, K. He, and J. Sun. Convolutional feature masking for [25] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet:
joint object and stuff segmentation. In Proceedings of the A deep neural network architecture for real-time semantic
IEEE Conference on Computer Vision and Pattern Recogni- segmentation. arXiv preprint arXiv:1606.02147, 2016.
tion, pages 3992–4000, 2015. [26] J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and
[11] M. Donoser and D. Schmalstieg. Discrete-continuous gra- J. Malik. Multiscale combinatorial grouping for image seg-
dient orientation estimation for faster image segmentation. mentation and object proposal generation. IEEE transactions
In Proceedings of the IEEE Conference on Computer Vision on pattern analysis and machine intelligence, 39(1):128–
and Pattern Recognition, pages 3158–3165, 2014. 140, 2017.
[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, [27] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
and A. Zisserman. The PASCAL Visual Object Classes tional networks for biomedical image segmentation. In In-
Challenge 2012 (VOC2012) Results. http://www.pascal- ternational Conference on Medical Image Computing and
network.org/challenges/VOC/voc2012/workshop/index.html. Computer-Assisted Intervention, pages 234–241. Springer,
[13] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning 2015.
hierarchical features for scene labeling. IEEE transactions [28] E. Sharon, M. Galun, D. Sharon, R. Basri, and A. Brandt. Hi-
on pattern analysis and machine intelligence, 35(8):1915– erarchy and adaptivity in segmenting visual scenes. Nature,
1929, 2013. 442(7104):810–813, 2006.

4329
Figure 9. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the BSDS500.
From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected CRF,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

[29] J. Shi and J. Malik. Normalized cuts and image segmenta- intelligence, 22(8):888–905, 2000.
tion. IEEE Transactions on pattern analysis and machine [30] C. J. Taylor. Towards fast and accurate segmentation. In Pro-

4330
Figure 10. Results of hierarchical segmentation using the output of WNet with CRF smoothing as initial boundaries, on the
BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines obtained by the fully connected
CRF, segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

ceedings of the IEEE Conference on Computer Vision and [32] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Pattern Recognition, pages 1916–1922, 2013. Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random
fields as recurrent neural networks. In Proceedings of the
[31] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr IEEE International Conference on Computer Vision, pages
images through a hidden markov random field model and the 1529–1537, 2015.
expectation-maximization algorithm. IEEE transactions on
medical imaging, 20(1):45–57, 2001.

4331
Figure 11. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

4332
Figure 12. Results of hierarchical segmentation using the combination of the output of WNet with CRF smoothing and UCM as
the initial boundaries, on the BSDS500. From top to bottom: Original image, the initial over-segmented partitions showed in red lines,
segmentations obtained by thresholding at the optimal dataset scale (ODS) and optimal image scale (OIS).

4333

You might also like