DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
Figure 1: DeepCut: We use graph neural networks with unsupervised losses from classical graph theory to solve various
image segmentation tasks. Specifically, we employ deep features obtained from a pre-trained vision transformer to construct
a graph representation for each image, and subsequently partition this graph to generate a segmentation.
1
adding priors such as boundaries or scribbles [21], semi- unsupervised segmentation tasks (speed and accuracy).
supervised learning[3, 20], weakly-supervised learning[24],
and more. However, these approaches have limitations • Optimizing the correlation clustering objective for deep
since they still rely on some form of annotations or prior features clustering, which is not feasible with classical
knowledge about the image structure. methods, thus achieving k-less clustering.
An alternative approach to tackle this problem is to ex-
plore unsupervised methods. Recent research on unsuper- • Performing semantic part segmentation on multiple im-
vised Deep Neural Networks (DNNs) has yielded promis- ages using test-time optimization. By applying the
ing outcomes. For instance, self-DIstillation with No la- method to each image separately, we eliminate the need
bels (DINO [9]) was employed to train Vision Transform- for any post-processing steps required by previous meth-
ers (ViTs), and the generated attention maps corresponded ods.
to semantic segments in the input image. The deep features
extracted from these trained transformers demonstrated sig- 2. Background
nificant semantic meaning, facilitating their utilization in
various visual tasks, such as object localization, segmen- In the era prior to the deep-learning surge, numerous
tation, and semantic segmentation.[2, 23, 34]. classical approaches adopted quantitative criteria for seg-
Some recent work in this area has demonstrated promis- mentation based on principles from graph theory. These
ing results, especially using unsupervised techniques that methods involved representing affinities between image re-
combine deep features with classical graph theory for ob- gions as a graph and associating various image partitions
ject localization and segmentation tasks [23, 34]. These with cuts in that graph (e.g., [26, 5, 4]). The quality of im-
methods are lightweight and rely on pre-trained unsuper- age segments was determined by the ”optimality” of these
vised networks, unlike end-to-end approaches that require cuts. Remarkably, these techniques operated without any
significant time and resources for training from scratch. supervision, relying solely on the provided affinities be-
Our proposed method called DeepCut, introduces an in- tween image regions.
novative approach using Graph Neural Networks (GNNs) Different methods have defined optimal cuts in vari-
combined with classical graph clustering objectives as a loss ous ways, each presenting advantages. We provide a brief
function. GNNs are specialized neural networks designed overview of the relevant approaches.
to process graph-structured data, and they have achieved
remarkable results in various domains, including protein 2.1. Graph Clustering
folding with AlphaFold [19], drug discovery [37] and traf- We leverage two graph clustering functionals from clas-
fic prediction[18]. This work employs GNNs for computer sical graph theory in our work, normalized cut[26] and cor-
vision tasks such as object localization, segmentation, and relation clustering[5].
semantic part segmentation.
A key advantage of our method is the direct utilization
of deep features within the clustering process, in contrast Notations Let G = (V, E) be an undirected graph in-
to previous approaches [23, 34] that discarded this valuable duced by an image. Each node represents an image region,
information and relied solely on correlations between fea- and the weights wij represent the affinity between image
tures. This approach leads to improved object segmenta- regions i and j, i, j = 1 . . . n. Let W be an n × n matrix
tion performance and enables semantic segmentation across whose entries are wij .
multiple graphs, each corresponding to a separate image. Our goal is to partition this graph into k disjoint sets
To further enhance segmentation, we adopt a two-stage ap- A1 , A2 ...Ak such that ∪i Ai = V and ∀j̸=i Ai ∩Aj = ∅. This
n×k
proach that overcomes shortcomings of the objective func- partition can be expressed as a binary matrix S ∈ {0, 1}
tions overlooked by previous methods. We first separate where Sic = 1 iff i ∈ Ac .
foreground and background and then perform segmentation
individually for each part, leading to more accurate results.
Normalized Cut (N-cut) [26] A good partition is defined
Moreover, we introduce a method to perform ”k-less clus-
as one that maximizes the number of within-group connec-
tering,” allowing data clustering without the need to pre-
tions, and minimizes the number of between-group con-
define the number of clusters k, enhancing flexibility and
nections. The number of between-group connections can
adaptability.
be computed as the total weight of edges removed and de-
Our contributions are:
scribed in graph theory as a cut:
• Utilizing a lightweight GNN with classical clustering ob-
X
jectives as unsupervised loss functions for image segmen- cut(A, B) = w(u, v). (1)
tation, surpassing state-of-the-art performance at various u∈A,v∈B
Where A, B are parts of two-way partition of G. This where Θ is a matrix of trainable parameters. N (v) denotes
objective is formulated by the normalized cut (N-cut) func- the neighbours of a node hv , and |N (v)| is the number of
tional: neighbours. The GCN aggregation is performed through
a summation operator, with the term inside the summation
cut(A, B) cut(A, B) representing the message-passing operation. The objective
N cut(A, B) = + , (2)
assoc(A, V) assoc(B, V) during training is to optimize the network parameters Θ
P in a way that generates meaningful messages to be passed
where assoc(A, V) = i∈A,j∈V wij is the total affinities among nodes, optimizing the loss function.
connecting nodes of A to all nodes in the graph. The N-
cut formulation can be easily extended to K > 2 segments.
3. Method
Shi and Malik [26] also suggested an approximated solution
to Eq. (2) using spectral methods, known as spectral clus- In our approach, we process each image through the pre-
tering, where the spectrum (eigenvalues) of a graph Lapla- trained network to extract deep features, which are then
cian matrix is used to get approximated solution to the N-cut used for clustering. Subsequently, we construct a weighted
problem. graph based on these features. Employing a lightweight
Graph Neural Network (GNN), we optimize our unsuper-
Correlation Clustering (CC) [5] When the graph, de- vised graph partitioning loss functions (either LN Cut or
rived from an image, contains both negative and positive LCC) separately for each image graph.
affinities, N-cut is no longer applicable. In this case, a good We adopt this methodology to achieve unsupervised ob-
image segment may be one that maximizes the positive ject localization, segmentation, and semantic part segmen-
affinities inside the segment and the negative ones across tation.
segments [5]. This objective can be formulated by the cor- 3.1. From Deep Features to Graphs
relation clustering (CC) functional. Given a matrix W , an
optimal partition U minimizes: As shown in Figure 2, given an image M of size m × n
and d channels, we pass it through a transformer T. The
transformer divides the image into mn
X X
CC(S) = − Wij Sic Sjc . (3) p2 patches, where p is
ij c the patch size of transformer T. To extract the transformer’s
internal representation for each patch, we utilize the key to-
Correlation clustering utilizes intra-cluster disagreement ken from the last layer, as it has demonstrated superior per-
(repulsion) to automatically deduce the number of clusters formance across various tasks [2, 23]. The output is a fea-
k [5] so that it does not need to know k in advance. Further ture vector f( mn ×c) , where c represents the token embed-
p2
theoretical analysis on this property can be found in [4], ding dimension, containing all the extracted features from
along with classical optimization algorithms applying CC the different patches.
to image segmentation. Consider a weighted graph G = (V, E) with W as the
weight matrix. We construct a patch-wise correlation matrix
2.2. Graph Neural Networks
from the features obtained by the Vision Transformer:
GNN (Graph Neural Network) is a category of neural mn
× mn
networks designed to process graph-structured data directly. W = f f T ∈ R p2 p2 . (5)
A GNN layer comprises two fundamental operations: mes-
sage passing and aggregation. Message passing involves In correlation clustering, the negative weights (represent-
gathering information from the neighbors of each node and ing reputations) carry valuable information utilized by the
is applied to all nodes in the graph. The exchanged infor- clustering objective. However, the normalized cut objective
mation can include a combination of node features, edge only accepts positive weights. As a result, we threshold at
features, or any other data embedded in the graph. Aggre- zero:
gation refers to fusing all the messages obtained during the mn
× mn
message passing phase into one message that updates the W = f f T · (f f T > 0) ∈ R p2 p2 . (6)
current node’s state.
We introduce a hyper-parameter called k sensitivity de-
In our approach, we utilize a specific type of GNN called noted by α to adapt the cluster choosing process in corre-
the Graph Convolutional Network (GCN) [35]. In a GCN lation clustering. Since the number of clusters cannot be
layer ℓ, each node hv is processed as follows: directly chosen in correlation clustering, this parameter al-
lows us to control the sensitivity of the process, where a
X hu (ℓ)
hv(ℓ+1) = Θ , (4) higher value of α corresponds to more clusters. We enforce
|N (v)| this adjustment in the following manner:
u∈N (v)
Figure 2: Method overview: After extracting deep features from a pretrained ViT model, we construct a similarity matrix
based on the patch-wise feature similarities, which becomes our adjacency matrix. We build a graph using this adjacency
matrix and the deep features as node features. Next, we train a lightweight GNN using unsupervised graph partitioning loss
functions (Sec. 2.1) to partition the graph into k distinct clusters, which can be used for various downstream tasks.
Figure 3: Proposed two-stage clustering: First, we cluster Table 1: Object localization results. CorLoc metric (per-
the image into two disjoined sets, then apply clustering to centage of images with IOU > 0.5). CC and N-cut denotes
the foreground and background separately. correlation clustering and Normalized Cut respectively.
Our approach is versatile and applicable to various image Our segmentation process involves two steps, as shown
partitioning-related tasks, including: in Figure 3: foreground-background segmentation (k=2)
followed by semantic part segmentation on the foreground
object (k=4). This two-stage process addresses the bias
Object Localization Object localization involves identi- of the clustering functions towards larger clusters (e.g.,
fying the primary object in an image and enclosing it with a background-foreground), which limits the level of detail
bounding box. To perform localization, we follow the steps in foreground object segmentation. The exact process can
below: (1) Use our GCN clustering method with k = 2. also be applied to improve background segmentation, as de-
(2) Examine the edges of the clustered image and identify picted in Figure 1 and Figure 5.
the cluster that appears on more than two edges as the back-
ground, while the other cluster becomes our main object.
4. Training and performance
(3) Apply a bounding box around the identified main ob-
ject. For all experiments, We use DINO [9] trained ViT-S/8
transformer for feature extraction, with pre-trained weights
Object Segmentation Object segmentation involves the from the DINO paper authors (trained on ImageNet[25]).
separation of the foreground object in an image, com- No training is conducted on the tested datasets. We
monly known as foreground-background segmentation. employ a test-time training paradigm for all experiments,
The method employed for object segmentation is identical where each image is trained using a proposed graph neu-
to object localization, with the inclusion of stage (3), where ral network (GNN) segmentation approach for ten epochs.
the bounding box is applied. Since the proposed losses are unsupervised and involve
solving an optimization problem, generalizing the model to
the entire dataset does not improve accuracy. This training
Semantic Part Segmentation DeepCut achieves seman- method achieves a performance (speed) that is more than
tic part segmentation through a test-time optimization two times that of the current state-of-the-art TokenCut[34]
paradigm, where the model is sequentially exposed to each on the same hardware. DeepCut efficiency stems from its
image, optimizing the model weights based on the previ- lightweight architecture, consisting of only 30k trainable
ous image. This means the model does not require train- parameters. Implementation details and performance anal-
ing on all images beforehand, eliminating the need for co- yses are provided in the supplementary material.
segmentation or post-processing steps.
This advantage is derived from our approach to learn- 5. Results
ing deep features with GNN, which differs from previous
methods [8, 34] that solely rely on correlations and discard Our method is evaluated on three unsupervised tasks:
the high-dimensional data of deep features. By leveraging single object localization, single object segmentation, and
the intricate semantic information embedded within deep semantic part segmentation. We compare our approach with
features, our model implicitly performs the classification of other unsupervised methods published on these tasks using
the clusters between different image graphs, enabling se- widely used benchmarks. The results are presented for both
mantic part segmentation without the need for explicit post- the normalized-cut and correlation clustering GNN objec-
processing or additional training steps. tives.
Method CUB DUTS ECSSD Method NMI ARI
OneGAN[7] 55.5 - - SCOPS[17] (model) 24.4 7.1
Voynov et al.[31] 68.3 49.8 - Huang and Li[16] 26.1 13.2
Spectral Methods[23] 76.9 51.4 73.3 Choudhury et al.[11] 43.5 19.6
TokenCut[34] - 57.6 71.2
DFF[12] 25.9 12.4
DeepCut: CC loss 77.7 56.0 73.4 Amir et al.[12] 38.9 16.1
DeepCut: N-cut loss 78.2 59.5 74.6
DeepCut: N-cut loss 43.9 20.2
Table 2: Single object segmentation results. mIOU (mean
Table 3: Semantic part segmentation results. ARI and
intersection-over-union) CC denotes correlation clustering
NMI over the entire CUB-200 dataset are used to evaluate
and N-cut Normalized Cut.
cluster quality. Predictions were performed with k = 4 with
our method using the N-cut objective. First three methods
5.1. Object Localization use ground truth foreground masks as supervision.
Table 5: k-less clustering. We apply DeepCut with correlation clustering loss on the Fashion Product Images Dataset[1],
using a subset of images from 5 classes: Top-wear, Shoes, Bags, Eye-wear, and Belts. We employ our k-less method and
report the results as classification purity. The experiments are conducted with k sensitivity = 3 as defined in Eq. (7). The
presented results are the mean and standard deviation (std) obtained from multiple experiments.
Algorithm 1 DeepCut
1: x ← Input image
2: x ← V iT (x) ▷ Deep features from ViT
3: G ← Build graph(x)
4: for Each training epoch do
Figure 8: Segmentation refinement. Single object seg-
5: s ← GCN (G) ▷ Single layer of GCN
mentation with our method and bilateral solver[6] as a com-
6: s ← ELU (s)
plementary step to refine the obtained segmentation.
7: s ← M LP (s) ▷ Two layer MLP
8: s ← sof tmax(s)
9: Loss ← LN Cut or LCC
10: end for
11: Output segmentation ← argmax(s)
Figure 9: Segmentation refinement. Resolution manipula- ViT Evaluation mode, frozen weights.
tion using smaller size stride. ViT image input resolution is
280 × 280 for both images.
Build graph Create a graph from deep features as de-
picted at Sec. 3.1.
1. Resolution Manipulation
In this work, we utilize deep features extracted from GCN Graph Convolutional Network[35]. Input size =
ViTs. Those features represent the image patches corre- Deep features size, hidden size = 64.
sponding to the ViT patch size p × p. We perform our
segmentation patch-wise thus, for image size (m × n) our
segmentation resolution is ( m n
p × p ). For higher resolution ELU The Exponential Linear Unit activation function.
segmentation, we change the ViT stride value to p2 instead
of p as it doubles the number of patches the ViT uses, dou-
bling the resolution of our segmentation to ( 2m 2n
p × p ). We
MLP Consists from 2 linear layers, layer 1: from GCN
found this method to yield better than changing input image hidden size to hidden 2
size
. layer 2: from hidden
2
size
to k
resolutions as the transformer in use in this paper wasn’t the number of desired clusters. For k-less usage with cor-
trained on high resolution images. Example at Fig. 9 relation clustering, the output will be set to a maximum of
In order to improve the segmentation resolution further, desired clusters. Between the layers, there is an elu activa-
a bilateral solver[6] can be added as a complementary step tion function and 0.25 dropout.
to refine the boundaries of the obtained segmentation: see
Figure 8. All reported results int the paper is without any
post-processing methods. Loss We suggest two loss function derived from classical
graph theory; NCut and CC.
2. Implementation details
For all experiments, we use DINO [9] trained ViT-S/8 Output segmentation At step 8, in order to obtain the fi-
transformer for feature extraction; specifically, we use the nal segmentation, the vector s is extracted. Each entry in
keys features from the last layer of the DINO trained ”stu- this vector corresponds to a patch of the image and con-
dent” transformer. We use pre-trained weights provided by tains a probability vector that describes the likelihood of
the DINO paper authors (trained on ImageNet[25]). We do the patch belonging to a specific cluster. We chose the most
not conduct any training of the transformer on any of likely cluster assignment for each patch and than unflatten
the tested datasets. Input images resized to a resolution of the result to get an segmentation map.
2.1. Two-stage segmentation
The clustering functionals in this paper exhibit are bi-
ased towards larger clusters (e.g background-foreground),
resulting in a tendency to underperform on finer details by
merging them together, or in some cases, failing and intro-
ducing a significant amount of noise to the segmentation
process. To address this issue, a solution is proposed by uti-
lizing a two-stage segmentation approach, where the back-
ground and foreground segments are separately applied to
avoid the aforementioned biases and limitations. Example
can be seen at Fig. 12.
2.2. Training
To optimize object localization and object segmentation
task, we perform individual optimization for each image for
a duration of 10 epochs, with model weights being reset
between images. For part semantic segmentation, we carry
out separate optimization for each image over a span of 100
epochs, without resetting the model weights between them.
2.3. Performance
All of the results presented in the paper demonstrate
DeepCut without the utilization of any post-processing (e.g.
bilateral solver). All experiments were conducted using the
same hardware: Tesla V100 GPU and an Intel Xeon 32 core
CPU.
DUTS ECSSD Throughput
Model [mIoU] [mIoU] [img/sec]
TokenCut + Bilateral Solver 62.4 77.2 0.5
TokenCut w/o Bilateral Sol. 57.6 71.2 1
Ours 59.5 74.6 5
Figure 10: Method example: Object localization using DeepCut(NCut).
Figure 11: Method example: Random foreground-background segmentation samples using DeepCut(NCut) on VOC07.
Figure 12: Method example: Two-stage segmentation using DeepCut(NCut).
Figure 13: Method example: Random foreground-background segmentation samples using DeepCut(NCut/CC) on CUB-
200. DeepCut segments the birds accurately without including other objects such as branches and leaves (which is a common
failure point of previous methods).