0% found this document useful (0 votes)
17 views

DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering

Uploaded by

Saulo Rodrigues
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering

Uploaded by

Saulo Rodrigues
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DeepCut: Unsupervised Segmentation using Graph Neural Networks Clustering

Amit Aflalo , Shai Bagon , Tamar Kashti , and Yonina Eldar


Faculty of Mathematics and Computer Science, Weizmann Institute of Science
arXiv:2212.05853v3 [cs.CV] 21 Aug 2023

Figure 1: DeepCut: We use graph neural networks with unsupervised losses from classical graph theory to solve various
image segmentation tasks. Specifically, we employ deep features obtained from a pre-trained vision transformer to construct
a graph representation for each image, and subsequently partition this graph to generate a segmentation.

Abstract Correlation-Clustering (CC) objective to perform cluster-


ing without defining the number of clusters, allowing for
k-less clustering. We apply the proposed method for object
Image segmentation is a fundamental task in computer
localization, segmentation, and semantic part segmentation
vision. Data annotation for training supervised methods
tasks, surpassing state-of-the-art performance on multiple
can be labor-intensive, motivating unsupervised methods.
benchmarks1 .
Current approaches often rely on extracting deep features
from pre-trained networks to construct a graph, and clas-
sical clustering methods like k-means and normalized-cuts
are then applied as a post-processing step. However, this 1. Introduction
approach reduces the high-dimensional information en- Object localization and segmentation play crucial roles
coded in the features to pair-wise scalar affinities. To ad- in various real-world applications, such as autonomous cars,
dress this limitation, this study introduces a lightweight robotics, and medical diagnosis. These tasks have been
Graph Neural Network (GNN) to replace classical cluster- longstanding challenges in computer vision, and significant
ing methods while optimizing for the same clustering ob- efforts have been invested in improving their accuracy.
jective function. Unlike existing methods, our GNN takes Presently, state-of-the-art performance in these tasks is
both the pair-wise affinities between local image features achieved using supervised Deep Neural Networks (DNNs).
and the raw features as input. This direct connection be- However, the limited availability of annotated data restricts
tween the raw features and the clustering objective enables the applicability of these methods. Data annotation is labor-
us to implicitly perform classification of the clusters be- intensive and costly, particularly in specialized fields like
tween different graphs, resulting in part semantic segmen- medical imaging, where domain experts are required for ac-
tation without the need for additional post-processing steps. curate annotations. Several solutions have been proposed
We demonstrate how classical clustering objectives can be to address this challenge, including leveraging color data,
formulated as self-supervised loss functions for training
an image segmentation GNN. Furthermore, we employ the 1 Project page: https://sampl-weizmann.github.io/DeepCut/

1
adding priors such as boundaries or scribbles [21], semi- unsupervised segmentation tasks (speed and accuracy).
supervised learning[3, 20], weakly-supervised learning[24],
and more. However, these approaches have limitations • Optimizing the correlation clustering objective for deep
since they still rely on some form of annotations or prior features clustering, which is not feasible with classical
knowledge about the image structure. methods, thus achieving k-less clustering.
An alternative approach to tackle this problem is to ex-
plore unsupervised methods. Recent research on unsuper- • Performing semantic part segmentation on multiple im-
vised Deep Neural Networks (DNNs) has yielded promis- ages using test-time optimization. By applying the
ing outcomes. For instance, self-DIstillation with No la- method to each image separately, we eliminate the need
bels (DINO [9]) was employed to train Vision Transform- for any post-processing steps required by previous meth-
ers (ViTs), and the generated attention maps corresponded ods.
to semantic segments in the input image. The deep features
extracted from these trained transformers demonstrated sig- 2. Background
nificant semantic meaning, facilitating their utilization in
various visual tasks, such as object localization, segmen- In the era prior to the deep-learning surge, numerous
tation, and semantic segmentation.[2, 23, 34]. classical approaches adopted quantitative criteria for seg-
Some recent work in this area has demonstrated promis- mentation based on principles from graph theory. These
ing results, especially using unsupervised techniques that methods involved representing affinities between image re-
combine deep features with classical graph theory for ob- gions as a graph and associating various image partitions
ject localization and segmentation tasks [23, 34]. These with cuts in that graph (e.g., [26, 5, 4]). The quality of im-
methods are lightweight and rely on pre-trained unsuper- age segments was determined by the ”optimality” of these
vised networks, unlike end-to-end approaches that require cuts. Remarkably, these techniques operated without any
significant time and resources for training from scratch. supervision, relying solely on the provided affinities be-
Our proposed method called DeepCut, introduces an in- tween image regions.
novative approach using Graph Neural Networks (GNNs) Different methods have defined optimal cuts in vari-
combined with classical graph clustering objectives as a loss ous ways, each presenting advantages. We provide a brief
function. GNNs are specialized neural networks designed overview of the relevant approaches.
to process graph-structured data, and they have achieved
remarkable results in various domains, including protein 2.1. Graph Clustering
folding with AlphaFold [19], drug discovery [37] and traf- We leverage two graph clustering functionals from clas-
fic prediction[18]. This work employs GNNs for computer sical graph theory in our work, normalized cut[26] and cor-
vision tasks such as object localization, segmentation, and relation clustering[5].
semantic part segmentation.
A key advantage of our method is the direct utilization
of deep features within the clustering process, in contrast Notations Let G = (V, E) be an undirected graph in-
to previous approaches [23, 34] that discarded this valuable duced by an image. Each node represents an image region,
information and relied solely on correlations between fea- and the weights wij represent the affinity between image
tures. This approach leads to improved object segmenta- regions i and j, i, j = 1 . . . n. Let W be an n × n matrix
tion performance and enables semantic segmentation across whose entries are wij .
multiple graphs, each corresponding to a separate image. Our goal is to partition this graph into k disjoint sets
To further enhance segmentation, we adopt a two-stage ap- A1 , A2 ...Ak such that ∪i Ai = V and ∀j̸=i Ai ∩Aj = ∅. This
n×k
proach that overcomes shortcomings of the objective func- partition can be expressed as a binary matrix S ∈ {0, 1}
tions overlooked by previous methods. We first separate where Sic = 1 iff i ∈ Ac .
foreground and background and then perform segmentation
individually for each part, leading to more accurate results.
Normalized Cut (N-cut) [26] A good partition is defined
Moreover, we introduce a method to perform ”k-less clus-
as one that maximizes the number of within-group connec-
tering,” allowing data clustering without the need to pre-
tions, and minimizes the number of between-group con-
define the number of clusters k, enhancing flexibility and
nections. The number of between-group connections can
adaptability.
be computed as the total weight of edges removed and de-
Our contributions are:
scribed in graph theory as a cut:
• Utilizing a lightweight GNN with classical clustering ob-
X
jectives as unsupervised loss functions for image segmen- cut(A, B) = w(u, v). (1)
tation, surpassing state-of-the-art performance at various u∈A,v∈B
Where A, B are parts of two-way partition of G. This where Θ is a matrix of trainable parameters. N (v) denotes
objective is formulated by the normalized cut (N-cut) func- the neighbours of a node hv , and |N (v)| is the number of
tional: neighbours. The GCN aggregation is performed through
a summation operator, with the term inside the summation
cut(A, B) cut(A, B) representing the message-passing operation. The objective
N cut(A, B) = + , (2)
assoc(A, V) assoc(B, V) during training is to optimize the network parameters Θ
P in a way that generates meaningful messages to be passed
where assoc(A, V) = i∈A,j∈V wij is the total affinities among nodes, optimizing the loss function.
connecting nodes of A to all nodes in the graph. The N-
cut formulation can be easily extended to K > 2 segments.
3. Method
Shi and Malik [26] also suggested an approximated solution
to Eq. (2) using spectral methods, known as spectral clus- In our approach, we process each image through the pre-
tering, where the spectrum (eigenvalues) of a graph Lapla- trained network to extract deep features, which are then
cian matrix is used to get approximated solution to the N-cut used for clustering. Subsequently, we construct a weighted
problem. graph based on these features. Employing a lightweight
Graph Neural Network (GNN), we optimize our unsuper-
Correlation Clustering (CC) [5] When the graph, de- vised graph partitioning loss functions (either LN Cut or
rived from an image, contains both negative and positive LCC) separately for each image graph.
affinities, N-cut is no longer applicable. In this case, a good We adopt this methodology to achieve unsupervised ob-
image segment may be one that maximizes the positive ject localization, segmentation, and semantic part segmen-
affinities inside the segment and the negative ones across tation.
segments [5]. This objective can be formulated by the cor- 3.1. From Deep Features to Graphs
relation clustering (CC) functional. Given a matrix W , an
optimal partition U minimizes: As shown in Figure 2, given an image M of size m × n
and d channels, we pass it through a transformer T. The
transformer divides the image into mn
X X
CC(S) = − Wij Sic Sjc . (3) p2 patches, where p is
ij c the patch size of transformer T. To extract the transformer’s
internal representation for each patch, we utilize the key to-
Correlation clustering utilizes intra-cluster disagreement ken from the last layer, as it has demonstrated superior per-
(repulsion) to automatically deduce the number of clusters formance across various tasks [2, 23]. The output is a fea-
k [5] so that it does not need to know k in advance. Further ture vector f( mn ×c) , where c represents the token embed-
p2
theoretical analysis on this property can be found in [4], ding dimension, containing all the extracted features from
along with classical optimization algorithms applying CC the different patches.
to image segmentation. Consider a weighted graph G = (V, E) with W as the
weight matrix. We construct a patch-wise correlation matrix
2.2. Graph Neural Networks
from the features obtained by the Vision Transformer:
GNN (Graph Neural Network) is a category of neural mn
× mn
networks designed to process graph-structured data directly. W = f f T ∈ R p2 p2 . (5)
A GNN layer comprises two fundamental operations: mes-
sage passing and aggregation. Message passing involves In correlation clustering, the negative weights (represent-
gathering information from the neighbors of each node and ing reputations) carry valuable information utilized by the
is applied to all nodes in the graph. The exchanged infor- clustering objective. However, the normalized cut objective
mation can include a combination of node features, edge only accepts positive weights. As a result, we threshold at
features, or any other data embedded in the graph. Aggre- zero:
gation refers to fusing all the messages obtained during the mn
× mn
message passing phase into one message that updates the W = f f T · (f f T > 0) ∈ R p2 p2 . (6)
current node’s state.
We introduce a hyper-parameter called k sensitivity de-
In our approach, we utilize a specific type of GNN called noted by α to adapt the cluster choosing process in corre-
the Graph Convolutional Network (GCN) [35]. In a GCN lation clustering. Since the number of clusters cannot be
layer ℓ, each node hv is processed as follows: directly chosen in correlation clustering, this parameter al-
lows us to control the sensitivity of the process, where a
X hu (ℓ)
hv(ℓ+1) = Θ , (4) higher value of α corresponds to more clusters. We enforce
|N (v)| this adjustment in the following manner:
u∈N (v)
Figure 2: Method overview: After extracting deep features from a pretrained ViT model, we construct a similarity matrix
based on the patch-wise feature similarities, which becomes our adjacency matrix. We build a graph using this adjacency
matrix and the deep features as node features. Next, we train a lightweight GNN using unsupervised graph partitioning loss
functions (Sec. 2.1) to partition the graph into k distinct clusters, which can be used for various downstream tasks.

first term of the objective function promotes the clustering


T max(f f T ) of strongly connected components together, while the sec-
W = ff − . (7)
α ond term encourages the cluster assignments to be orthogo-
where α ∈ [1, inf). Lower α value corresponds to higher nal and have similar sizes.
repulsion forces between nodes (negative weights in W ) The loss function for correlation clustering [5] is:
and thus higher cluster count, as correlation clustering max- LCC = −T r(W SS T ). (10)
imizes negative affinities between segments in the graph.
This term promotes intra-cluster agreement while en-
3.2. Graph Neural Network Clustering couraging repulsion (negative affinities) between clusters.
Let N̂ be a node feature matrix obtained by applying W is defined as Eq. (6) and Eq. (7) for the N-cut and CC
one or more layers of GNN convolution on a graph G with loss, respectively. To control the sensitivity of the clustering
an adjacency matrix W . In our case, we use a one-layer process for CC, we utilize a k-sensitivity value, as explained
GCN and construct the graph using the patch-wise corre- in Sec. 7.
lation matrix from ViT obtained features (Eq. (5), Eq. (6), 3.3. Graph Neural Network Segmentation
Eq. (7)). Let S be the output of a Multi-Layer Perception
(MLP) with a softmax function applied on N̂ : In Sec. 3.1, we propose constructing a graph from deep
features extracted from unsupervised trained ViTs, where
N̂ = GN N (N, W ; ΘGN N ), each node represents a patch of the original image. The
(8) nodes in the graph are then clustered into disjoint sets, rep-
S = M LP (N̂ ; ΘM LP ),
resenting different image segments. For this clustering pro-
where ΘM LP and ΘGN N are trainable parameters. The cess, we employ graph neural network clustering as de-
GNN output S, is the cluster assignment matrix contain- scribed in Sec. 3.2, utilizing either the CC or N-cut losses.
ing vectors representing the node’s probability of belonging However, unlike previous approaches that used the N-cut
to a particular cluster. loss defined strictly for positive weights [8, 23], we advo-
The GNN is optimized using either the normalized-cut cate using the correlation clustering functional as a loss.
relaxation proposed in [8] or our newly proposed method Correlation clustering enables us to utilize negative weights
with the correlation clustering objective as the loss function. for graph building (see Eq. (5) and Eq. (7)), facilitating clus-
The loss function in the case of normalized cut is: tering without predefining the number of clusters.
Additionally, we incorporate deep features as node fea-
T r(S T W S) ST S IK tures for graph building, which is absent in previous meth-
LN Cuts = T
+ T
−√ , (9) ods that solely used the correlation matrix of the deep
T r(S DS) ∥S S∥F K F
P features [8, 23, 34]. As a result, our method allows for
where D = diag( j Wi,j ) is the row-wise sum diagonal feature classification while clustering and implicitly facili-
matrix of W . K denotes the number of disjoint sets we aim tates semantic part segmentation without necessitating post-
to partition the graph into, and IK is the identity matrix. The processing steps, as seen in previous methods [23].
Method VOC-07 VOC-12 COCO-20k
Selective Search[28] 18.8 20.9 6.0
EdgeBoxes[38] 31.1 31.6 28.8
DINO-[CLS][9] 45.8 46.2 42.1
LOST[27] 61.9 64.0 50.7
Spectral Methods[23] 62.7 66.4 52.2
TokenCut[34] 68.8 72.1 58.8
DeepCut: CC loss 68.8 67.9 57.6
DeepCut: N-cut loss 69.8 72.2 61.6

Figure 3: Proposed two-stage clustering: First, we cluster Table 1: Object localization results. CorLoc metric (per-
the image into two disjoined sets, then apply clustering to centage of images with IOU > 0.5). CC and N-cut denotes
the foreground and background separately. correlation clustering and Normalized Cut respectively.

Our approach is versatile and applicable to various image Our segmentation process involves two steps, as shown
partitioning-related tasks, including: in Figure 3: foreground-background segmentation (k=2)
followed by semantic part segmentation on the foreground
object (k=4). This two-stage process addresses the bias
Object Localization Object localization involves identi- of the clustering functions towards larger clusters (e.g.,
fying the primary object in an image and enclosing it with a background-foreground), which limits the level of detail
bounding box. To perform localization, we follow the steps in foreground object segmentation. The exact process can
below: (1) Use our GCN clustering method with k = 2. also be applied to improve background segmentation, as de-
(2) Examine the edges of the clustered image and identify picted in Figure 1 and Figure 5.
the cluster that appears on more than two edges as the back-
ground, while the other cluster becomes our main object.
4. Training and performance
(3) Apply a bounding box around the identified main ob-
ject. For all experiments, We use DINO [9] trained ViT-S/8
transformer for feature extraction, with pre-trained weights
Object Segmentation Object segmentation involves the from the DINO paper authors (trained on ImageNet[25]).
separation of the foreground object in an image, com- No training is conducted on the tested datasets. We
monly known as foreground-background segmentation. employ a test-time training paradigm for all experiments,
The method employed for object segmentation is identical where each image is trained using a proposed graph neu-
to object localization, with the inclusion of stage (3), where ral network (GNN) segmentation approach for ten epochs.
the bounding box is applied. Since the proposed losses are unsupervised and involve
solving an optimization problem, generalizing the model to
the entire dataset does not improve accuracy. This training
Semantic Part Segmentation DeepCut achieves seman- method achieves a performance (speed) that is more than
tic part segmentation through a test-time optimization two times that of the current state-of-the-art TokenCut[34]
paradigm, where the model is sequentially exposed to each on the same hardware. DeepCut efficiency stems from its
image, optimizing the model weights based on the previ- lightweight architecture, consisting of only 30k trainable
ous image. This means the model does not require train- parameters. Implementation details and performance anal-
ing on all images beforehand, eliminating the need for co- yses are provided in the supplementary material.
segmentation or post-processing steps.
This advantage is derived from our approach to learn- 5. Results
ing deep features with GNN, which differs from previous
methods [8, 34] that solely rely on correlations and discard Our method is evaluated on three unsupervised tasks:
the high-dimensional data of deep features. By leveraging single object localization, single object segmentation, and
the intricate semantic information embedded within deep semantic part segmentation. We compare our approach with
features, our model implicitly performs the classification of other unsupervised methods published on these tasks using
the clusters between different image graphs, enabling se- widely used benchmarks. The results are presented for both
mantic part segmentation without the need for explicit post- the normalized-cut and correlation clustering GNN objec-
processing or additional training steps. tives.
Method CUB DUTS ECSSD Method NMI ARI
OneGAN[7] 55.5 - - SCOPS[17] (model) 24.4 7.1
Voynov et al.[31] 68.3 49.8 - Huang and Li[16] 26.1 13.2
Spectral Methods[23] 76.9 51.4 73.3 Choudhury et al.[11] 43.5 19.6
TokenCut[34] - 57.6 71.2
DFF[12] 25.9 12.4
DeepCut: CC loss 77.7 56.0 73.4 Amir et al.[12] 38.9 16.1
DeepCut: N-cut loss 78.2 59.5 74.6
DeepCut: N-cut loss 43.9 20.2
Table 2: Single object segmentation results. mIOU (mean
Table 3: Semantic part segmentation results. ARI and
intersection-over-union) CC denotes correlation clustering
NMI over the entire CUB-200 dataset are used to evaluate
and N-cut Normalized Cut.
cluster quality. Predictions were performed with k = 4 with
our method using the N-cut objective. First three methods
5.1. Object Localization use ground truth foreground masks as supervision.

In Table 1, We evaluate our unsupervised object local-


ization performance on three datasets: PASCAL VOC 2007
[13], PASCAL VOC 2012 [14], and COCO20K (20k im-
ages chosen from MS-COCO dataset[22] introduced in pre-
vious work[29]). We compare our unsupervised approach
to state-of-the-art unsupervised methods. We report our re-
sults in the Correct Localization metric (CorLoc), defined
as the percentage of images whose intersection-over-union
with the ground truth label is greater than 50%. DeepCut
with N-cut losses gives the best results on all data sets, and
DeepCut with correlation clustering loss surpasses all pre-
vious methods except TokenCut[34].
Figure 4: DeepCut segmentation: N-cut vs. CC losses.
5.2. Single Object Segmentation Top: original image, middle: DeepCut with N-cut loss, bot-
tom: DeepCut with correlation clustering loss. Note that for
In Table 2, We evaluate our unsupervised single ob-
images with complex background DeepCut with CC loss
ject segmentation performance on three datasets: CUB
outperforms DeepCut with N-cut loss.
(widely-used dataset of birds for fine-grained visual cate-
gorization task)[32], DUTS (the largest saliency detection
benchmark)[33] and ECSSD (Extended Complex Scene
Saliency Dataset)[36]. We report our results in mean
intersection-over-union (mIoU). DeepCut with N-cut losses
archives the best results on all data sets. Visual comparison
of segmentation with our two losses is presented in Figure 4;
note how the CC loss segments the foreground better than
the Ncut loss when background for specific images that in-
clude different objects and shadows. As this is not the case
in most images, the N-cut usually outperforms the CC loss.
5.3. Semantic Part Segmentation
In Table 3, We evaluate our approach on the CUB dataset
[32] and report results using the Normalized Mutual Info
score (NMI) and Adjusted Rand Index score (ARI) on the Figure 5: Semantic part segmentation: N-cut vs. CC
entire test set. A comparison between our technique and losses. Top: original image, middle: normalized-cut loss
deep spectral method that uses classical graph theory for with our proposed two-step clustering; bottom: correlation
deep features-based segmentation[8] is presented in Fig- clustering with one-step k-less clustering.
ure 6. As seen at Table 3, DeepCut with normalized-
cut loss surpasses all other methods, including the top
three methods[17, 16, 11] that uses ground-truth foreground
Pretraining DeepCut DeepCut Negative
N-cut CC percentage
DINO[9] 59.4 59.8 33.6
MoCo-v3[10] 62.3 43.1 20
MAE [15] 47.2 31.2 5.4

Table 4: Pretraining. Object-localization performance on


PASCAL VOC07[13] with different unsupervised training
methods for acquiring deep features. All methods use ViT-
B-16 architecture, and were trained on ImageNet[25]. We
provide mIOU (mean Intersection-Over-Union) for corre-
lation clustering (CC) and normalized-cut (N-cut) objec-
tive. We also provide the mean percentage (across all the
dataset) of negative weights in the corresponding affinity
matrix. Note the correlation: the higher the percentage of
Figure 6: Semantic part segmentation. deep spectral negative weights of the transformer, the better the mIOU of
method vs. DeepCut with N-cut loss. Top: original im- the DeepCut with CC loss; for DINO DeepCut CC loss out-
age; middle: deep spectral method[23]; bottom: our seg- performs N-cut loss for this dataset.
mentation with N-cut loss. The deep spectral method failed
to preform semantic segmentation across all three images.
and segmentation techniques. This insight enables us to im-
prove unsupervised training methods, allowing us to extract
masks as supervision. more valuable information for segmentation purposes.

6. Deep Features Selection 7. k-less Clustering


Our work and previous unsupervised segmentation In this section, we explore clustering, using image clas-
methods heavily rely on the correlation between deep fea- sification as an example. Traditional clustering methods
tures from different patches. However, in this study, we usually require a predefined number of clusters (or classes)
proposed a novel approach by utilizing the negative corre- k for classifying data, which can be a disadvantage as it
lations (correlation clustering loss) that were discarded in demands prior knowledge about the data. Our DeepCut
previous works. By doing so, we leverage all available in- method introduces a GNN model with correlation clustering
formation to enhance performance in unsupervised segmen- loss, enabling k-less clustering, where the number of clus-
tation. It is crucial to recognize that the performance of un- ters is derived from the data. To demonstrate this valuable
supervised segmentation methods that rely on deep features property, we use the Fashion Product Images Dataset[1],
depends on the quality of these features. sampling a subset of 500 images from 5 classes: Top-wear,
Furthermore, no universal solution fits all scenarios, as Shoes, Bags, Eye-wear, and Belts.
different methods require specific types of information to We conducted three classification experiments, each re-
achieve optimal results. For instance, our paper presents peated ten times: one with three classes, one with four, and
two techniques—one based on correlation clustering and one with all classes. We employ our GNN approach with
the other on normalized cut. Correlation clustering depends correlation clustering loss in each experiment to cluster the
on both positive and negative correlations between features, images into different groups. We set the parameter α (k
whereas normalized cut only accepts positive correlations. sensitivity defined in Eq. (7)) to 3, using class tokens ex-
In Table 4, we compare three unsupervised ViT train- tracted from the unsupervised trained DINO[9] ViT base
ing approaches: DINO[9], MoCo[10], and MAE[15]. We transformer with a patch size of 16 as our features. We re-
observe that correlation clustering performs best with a port the clustering result as classification purity in Tab. 5.
method that supplies deep features with the largest amount To validate the algorithm’s robustness, we apply class per-
of negative correlation between features (DINO 33.6%). mutation and random subset sampling of each class.
We also observe that even a small amount of negative corre- The versatility of correlation clustering’s k-less nature
lation can be counterproductive, even with normalized cut extends to segmentation tasks, as illustrated in Figure 5.
(MAE 5.4%) that relies solely on positive correlation. We compare our proposed DeepCut method with two base-
Considering these factors, it becomes essential to iden- lines: connected component and spectral clustering. For
tify the best combinations of unsupervised training methods spectral clustering, we use eigen-gap[30] to select the num-
Number of 3 classes 4 classes 5 classes
classes
Connected Components 33.3 ±0 25.0 ±0 20.0 ±0
Spectral Clustering 85.5 ±14.1 77.9 ±6.6 82.8 ±6.9
DeepCut: cc loss 98.3 ±0.9 97.1 ±1.7 99.27 ±0.6

Table 5: k-less clustering. We apply DeepCut with correlation clustering loss on the Fashion Product Images Dataset[1],
using a subset of images from 5 classes: Top-wear, Shoes, Bags, Eye-wear, and Belts. We employ our k-less method and
report the results as classification purity. The experiments are conducted with k sensitivity = 3 as defined in Eq. (7). The
presented results are the mean and standard deviation (std) obtained from multiple experiments.

ory, Graph Neural Networks (GNNs), and self-supervised


pre-trained networks. DeepCut’s effectiveness is demon-
strated by achieving superior performance in object local-
ization, object segmentation, and semantic part segmenta-
tion tasks, surpassing existing state-of-the-art methods in
accuracy and speed.
Figure 7: k-sensitivity. The CC segmentation results The proposed graph structure incorporates node fea-
demonstrate that similar objects, such as people and bikes, tures, allowing more information to be utilized during the
cluster together consistently across different α values. clustering process. Consequently, implicit semantic part
Smaller α values lead to a finer partition of objects, but even segmentation is achievable, eliminating the need for post-
then, CC still assigns similar objects to the same clusters. processing methods. The versatility of this methodology ex-
tends to various downstream tasks, such as video segmenta-
tion and image matting. Additionally, the study emphasizes
ber of clusters. The connected components method proves
the importance of selecting an appropriate combination of
ineffective, clustering all images into the same group for
unsupervised training method and clustering method, as dis-
all experiments, resulting in zero standard deviation (STD).
cussed in Section Sec. 6.
On the other hand, spectral clustering performs better but
We demonstrate that our method is capable of optimizing
exhibits high variance. In contrast, our DeepCut with cor-
various loss functions derived from classical graph theory,
relation clustering loss achieves the highest accuracy and
including those that are challenging to optimize using con-
lowest variance. This powerful k-less property of correla-
ventional tools (e.g. correlation clustering).
tion clustering is demonstrated through an image clustering
example in terms of quantitative results. Furthermore, it can
be effectively applied to image segmentation, as depicted in Acknowledgements: The authors would like to thank
Fig. 7. Shir Amir for her useful comments. This research was sup-
ported by the ISRAEL SCIENCE FOUNDATION (grant
No. 3805/21), within the Israel Precision Medicine Part-
Choosing α: The CC functional determines the number
nership program, and by the European Research Coun-
of clusters by considering repulsion forces between clusters,
cil (ERC) under the European Union’s Horizon 2020 re-
as explained in Sec. 3. These forces are calculated based
search and innovation programme (grant agreement No.
on the amount of negative affinities in the adjacency matrix
101000967). This project also received funding from the
defined at Eq. (7). The affinities can vary when using differ-
Carolito Stiftung. Amit Aflalo was supported by the Young
ently trained ViT models (as observed in Tab. 4) or different
Weizmann Scholars Diversity and Excellence program. Dr.
datasets. To address this issue, we propose introducing a k-
Bagon is a Robin Chemers Neustein AI Fellow.
sensitivity variable α (as defined in Eq. (7)) to artificially
control the amount of repulsion forces and thereby regulate
References
the number of clusters. The choice of α should be tailored
to the specific task at hand. As shown in Fig. 7, different α [1] Param aggarwal. Fashion Product image dataset. https:
values yield different solutions for various tasks. //www.kaggle.com/datasets/paramaggarwal/
fashion-product-images-dataset. Accessed:
8. Conclusion 2020-27-10. 7, 8
[2] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel.
The study presents DeepCut, an innovative unsupervised Deep vit features as dense visual descriptors. arXiv preprint
segmentation technique that combines classical graph the- arXiv:2112.05814, 2021. 2, 3
[3] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- the IEEE/CVF Conference on Computer Vision and Pattern
janowski, Armand Joulin, Nicolas Ballas, and Michael Rab- Recognition, pages 8662–8672, 2020. 6
bat. Semi-supervised learning of visual features by non- [17] Wei-Chih Hung, Varun Jampani, Sifei Liu, Pavlo
parametrically predicting view assignments with support Molchanov, Ming-Hsuan Yang, and Jan Kautz. Scops:
samples. In Proceedings of the IEEE/CVF International Self-supervised co-part segmentation. In Proceedings of
Conference on Computer Vision, pages 8443–8452, 2021. 2 the IEEE/CVF Conference on Computer Vision and Pattern
[4] Shai Bagon and Meirav Galun. Large scale correlation clus- Recognition, pages 869–878, 2019. 6
tering optimization. arXiv preprint arXiv:1112.2903, 2011. [18] Weiwei Jiang and Jiayun Luo. Graph neural network for traf-
2, 3 fic forecasting: A survey. Expert Systems with Applications,
[5] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation page 117921, 2022. 2
clustering. Machine learning, 56(1):89–113, 2004. 2, 3, 4 [19] John Jumper, Richard Evans, Alexander Pritzel, Tim Green,
[6] Jonathan T Barron and Ben Poole. The fast bilateral solver. Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvu-
In European conference on computer vision, pages 617–632. nakool, Russ Bates, Augustin Žı́dek, Anna Potapenko, et al.
Springer, 2016. 11 Highly accurate protein structure prediction with alphafold.
[7] Yaniv Benny and Lior Wolf. Onegan: Simultaneous unsuper- Nature, 596(7873):583–589, 2021. 2
vised learning of conditional image generation, foreground [20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
segmentation, and fine-grained clustering. In European Con- Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
ference on Computer Vision, pages 514–530. Springer, 2020. head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and
6 Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
[8] Filippo Maria Bianchi, Daniele Grattarola, and Cesare 2
Alippi. Spectral clustering with graph neural networks for [21] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun.
graph pooling. In International Conference on Machine Scribblesup: Scribble-supervised convolutional networks for
Learning, pages 874–883. PMLR, 2020. 4, 5, 6 semantic segmentation. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
3159–3167, 2016. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
ing properties in self-supervised vision transformers. In
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Proceedings of the IEEE/CVF International Conference on
Zitnick. Microsoft coco: Common objects in context. In
Computer Vision, pages 9650–9660, 2021. 2, 5, 7, 11
European conference on computer vision, pages 740–755.
[10] Xinlei Chen, Saining Xie, and Kaiming He. An empiri-
Springer, 2014. 6
cal study of training self-supervised vision transformers. In
[23] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and
Proceedings of the IEEE/CVF International Conference on
Andrea Vedaldi. Deep spectral methods: A surprisingly
Computer Vision, pages 9640–9649, 2021. 7
strong baseline for unsupervised semantic segmentation and
[11] Subhabrata Choudhury, Iro Laina, Christian Rupprecht, and localization. In Proceedings of the IEEE/CVF Conference
Andrea Vedaldi. Unsupervised part discovery from con- on Computer Vision and Pattern Recognition, pages 8364–
trastive reconstruction. Advances in Neural Information Pro- 8375, 2022. 2, 3, 4, 5, 6, 7
cessing Systems, 34:28104–28118, 2021. 6 [24] Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-
[12] Edo Collins, Radhakrishna Achanta, and Sabine Susstrunk. Yu Liu, Yong Jae Lee, Alexander G Schwing, and Jan
Deep feature factorization for concept discovery. In Pro- Kautz. Instance-aware, context-focused, and memory-
ceedings of the European Conference on Computer Vision efficient weakly supervised object detection. In Proceedings
(ECCV), pages 336–352, 2018. 6 of the IEEE/CVF conference on computer vision and pattern
[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, recognition, pages 10598–10607, 2020. 2
and A. Zisserman. The PASCAL Visual Object Classes [25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
Challenge 2007 (VOC2007) Results. http://www.pascal- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
network.org/challenges/VOC/voc2007/workshop/index.html. Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
6, 7 Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, lenge. International Journal of Computer Vision (IJCV),
and A. Zisserman. The PASCAL Visual Object Classes 115(3):211–252, 2015. 5, 7, 11
Challenge 2012 (VOC2012) Results. http://www.pascal- [26] Jianbo Shi and Jitendra Malik. Normalized cuts and image
network.org/challenges/VOC/voc2012/workshop/index.html. segmentation. IEEE Transactions on pattern analysis and
6 machine intelligence, 22(8):888–905, 2000. 2, 3
[15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [27] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spy-
Dollár, and Ross Girshick. Masked autoencoders are scalable ros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet,
vision learners. In Proceedings of the IEEE/CVF Conference and Jean Ponce. Localizing objects with self-supervised
on Computer Vision and Pattern Recognition, pages 16000– transformers and no labels. In BMVC-British Machine Vi-
16009, 2022. 7 sion Conference, 2021. 5
[16] Zixuan Huang and Yin Li. Interpretable and accurate fine- [28] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
grained recognition via region grouping. In Proceedings of ers, and Arnold WM Smeulders. Selective search for ob-
ject recognition. International journal of computer vision,
104(2):154–171, 2013. 5
[29] Huy V Vo, Patrick Pérez, and Jean Ponce. Toward unsu-
pervised, multi-object discovery in large-scale image collec-
tions. In European Conference on Computer Vision, pages
779–795. Springer, 2020. 6
[30] Ulrike Von Luxburg. A tutorial on spectral clustering. Statis-
tics and computing, 17(4):395–416, 2007. 7
[31] Andrey Voynov, Stanislav Morozov, and Artem Babenko.
Object segmentation without labels with large-scale genera-
tive models. In International Conference on Machine Learn-
ing, pages 10596–10606. PMLR, 2021. 6
[32] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
Cub-dataset. Technical Report CNS-TR-2011-001, Califor-
nia Institute of Technology, 2011. 6
[33] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng,
Dong Wang, Baocai Yin, and Xiang Ruan. Learning to de-
tect salient objects with image-level supervision. In CVPR,
2017. 6
[34] Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L
Crowley, and Dominique Vaufreydaz. Self-supervised trans-
formers for unsupervised object discovery using normalized
cut. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 14543–14553,
2022. 2, 4, 5, 6
[35] Max Welling and Thomas N Kipf. Semi-supervised clas-
sification with graph convolutional networks. In J. In-
ternational Conference on Learning Representations (ICLR
2017), 2016. 3, 11
[36] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. Hierarchical
saliency detection. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1155–1162,
2013. 6
[37] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and
Jure Leskovec. Graph convolutional policy network for goal-
directed molecular graph generation. Advances in neural in-
formation processing systems, 31, 2018. 2
[38] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locat-
ing object proposals from edges. In European conference on
computer vision, pages 391–405. Springer, 2014. 5
Supplementary Material
280 × 280, images resized using Lanczos interpolation. All
expirements where conducted using Tesla V100 GPU.

Algorithm 1 DeepCut
1: x ← Input image
2: x ← V iT (x) ▷ Deep features from ViT
3: G ← Build graph(x)
4: for Each training epoch do
Figure 8: Segmentation refinement. Single object seg-
5: s ← GCN (G) ▷ Single layer of GCN
mentation with our method and bilateral solver[6] as a com-
6: s ← ELU (s)
plementary step to refine the obtained segmentation.
7: s ← M LP (s) ▷ Two layer MLP
8: s ← sof tmax(s)
9: Loss ← LN Cut or LCC
10: end for
11: Output segmentation ← argmax(s)

Figure 9: Segmentation refinement. Resolution manipula- ViT Evaluation mode, frozen weights.
tion using smaller size stride. ViT image input resolution is
280 × 280 for both images.
Build graph Create a graph from deep features as de-
picted at Sec. 3.1.
1. Resolution Manipulation
In this work, we utilize deep features extracted from GCN Graph Convolutional Network[35]. Input size =
ViTs. Those features represent the image patches corre- Deep features size, hidden size = 64.
sponding to the ViT patch size p × p. We perform our
segmentation patch-wise thus, for image size (m × n) our
segmentation resolution is ( m n
p × p ). For higher resolution ELU The Exponential Linear Unit activation function.
segmentation, we change the ViT stride value to p2 instead
of p as it doubles the number of patches the ViT uses, dou-
bling the resolution of our segmentation to ( 2m 2n
p × p ). We
MLP Consists from 2 linear layers, layer 1: from GCN
found this method to yield better than changing input image hidden size to hidden 2
size
. layer 2: from hidden
2
size
to k
resolutions as the transformer in use in this paper wasn’t the number of desired clusters. For k-less usage with cor-
trained on high resolution images. Example at Fig. 9 relation clustering, the output will be set to a maximum of
In order to improve the segmentation resolution further, desired clusters. Between the layers, there is an elu activa-
a bilateral solver[6] can be added as a complementary step tion function and 0.25 dropout.
to refine the boundaries of the obtained segmentation: see
Figure 8. All reported results int the paper is without any
post-processing methods. Loss We suggest two loss function derived from classical
graph theory; NCut and CC.
2. Implementation details
For all experiments, we use DINO [9] trained ViT-S/8 Output segmentation At step 8, in order to obtain the fi-
transformer for feature extraction; specifically, we use the nal segmentation, the vector s is extracted. Each entry in
keys features from the last layer of the DINO trained ”stu- this vector corresponds to a patch of the image and con-
dent” transformer. We use pre-trained weights provided by tains a probability vector that describes the likelihood of
the DINO paper authors (trained on ImageNet[25]). We do the patch belonging to a specific cluster. We chose the most
not conduct any training of the transformer on any of likely cluster assignment for each patch and than unflatten
the tested datasets. Input images resized to a resolution of the result to get an segmentation map.
2.1. Two-stage segmentation
The clustering functionals in this paper exhibit are bi-
ased towards larger clusters (e.g background-foreground),
resulting in a tendency to underperform on finer details by
merging them together, or in some cases, failing and intro-
ducing a significant amount of noise to the segmentation
process. To address this issue, a solution is proposed by uti-
lizing a two-stage segmentation approach, where the back-
ground and foreground segments are separately applied to
avoid the aforementioned biases and limitations. Example
can be seen at Fig. 12.
2.2. Training
To optimize object localization and object segmentation
task, we perform individual optimization for each image for
a duration of 10 epochs, with model weights being reset
between images. For part semantic segmentation, we carry
out separate optimization for each image over a span of 100
epochs, without resetting the model weights between them.
2.3. Performance
All of the results presented in the paper demonstrate
DeepCut without the utilization of any post-processing (e.g.
bilateral solver). All experiments were conducted using the
same hardware: Tesla V100 GPU and an Intel Xeon 32 core
CPU.
DUTS ECSSD Throughput
Model [mIoU] [mIoU] [img/sec]
TokenCut + Bilateral Solver 62.4 77.2 0.5
TokenCut w/o Bilateral Sol. 57.6 71.2 1
Ours 59.5 74.6 5
Figure 10: Method example: Object localization using DeepCut(NCut).

Figure 11: Method example: Random foreground-background segmentation samples using DeepCut(NCut) on VOC07.
Figure 12: Method example: Two-stage segmentation using DeepCut(NCut).

Figure 13: Method example: Random foreground-background segmentation samples using DeepCut(NCut/CC) on CUB-
200. DeepCut segments the birds accurately without including other objects such as branches and leaves (which is a common
failure point of previous methods).

You might also like