0% found this document useful (0 votes)

23 views14 pages

1ds20ai009 Research Paper

This document proposes a system called CLIPSeg that can perform image segmentation based on arbitrary text or image prompts, without retraining. CLIPSeg builds on CLIP to generate binary segmentation masks from prompts. It is trained once and can then perform three segmentation tasks: referring expression segmentation, zero-shot segmentation, and one-shot segmentation. The system extends CLIP with a transformer decoder to relate CLIP activations to segmentation outputs. It achieves competitive performance across low-shot segmentation tasks and can generalize to new classes and expressions without additional training.

Uploaded by

Anukalp Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views14 pages

1ds20ai009 Research Paper

Uploaded by

Anukalp Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Image Segmentation Using Text and Image Prompts

Timo Lüddecke1,§ and Alexander Ecker1,2

1 2
Institute of Computer Science and CIDAS, University of Göttingen MPI for Dynamics and Self-Organization, Göttingen

Abstract
arXiv:2112.10003v2 [cs.CV] 30 Mar 2022

Image segmentation is usually addressed by training a

model for a fixed set of object classes. Incorporating ad-
ditional classes or more complex queries later is expensive
as it requires re-training the model on a dataset that en-
compasses these expressions. Here we propose a system
that can generate image segmentations based on arbitrary
prompts at test time. A prompt can be either a text or an Figure 1: Our key idea is to use CLIP to build a flexible zero/one-
image. This approach enables us to create a unified model shot segmentation system that addresses multiple tasks at once.
(trained once) for three common segmentation tasks, which
come with distinct challenges: referring expression segmen- have been trained on. Different approaches have emerged
tation, zero-shot segmentation and one-shot segmentation. that extend this fairly constrained setting (see Tab. 1):
We build upon the CLIP model as a backbone which we ex-
tend with a transformer-based decoder that enables dense • In generalized zero-shot segmentation, seen as well as
prediction. After training on an extended version of the unseen categories needs to be segmented by putting un-
PhraseCut dataset, our system generates a binary segmen- seen categories in relation to seen ones, e.g. through
tation map for an image based on a free-text prompt or on word embeddings [1] or WordNet [2].
an additional image expressing the query. We analyze dif- • In one-shot segmentation, the desired class is provided
ferent variants of the latter image-based prompts in detail. in form of an image (and often an associated mask) in
This novel hybrid input allows for dynamic adaptation not addition to the query image to be segmented.
only to the three segmentation tasks mentioned above, but
• In referring expression segmentation, a model is
to any binary segmentation task where a text or image query
trained on complex text queries but sees all classes dur-
can be formulated. Finally, we find our system to adapt well
ing training (i.e. no generalization to unseen classes).
to generalized queries involving affordances or properties.
Code is available at https://eckerlab.org/code/ To this work, we introduce the CLIPSeg model (Fig. 1),
clipseg. which is capable of segmenting based on an arbitrary text
query or an example image. CLIPSeg can address all three
tasks named above. This multi-modal input format goes
1 Introduction beyond existing multi-task benchmarks such as Visual De-
cathlon [3] where input is always provided in form of im-
The ability to generalize to unseen data is a fundamental ages. To realize this system, we employ the pre-trained
problem relevant for a broad range of applications in artifi- CLIP model as a backbone and train a thin conditional seg-
cial intelligence. For instance, it is crucial that a household mentation layer (decoder) on top. We use the joint text-
robot understands the prompt of its user, which might in- visual embedding space of CLIP for conditioning our model,
volve an unseen object type or an uncommon expression for which enables us to process prompts in text form as well as
an object. While humans excel at this task, this form of in- images. Our idea is to teach the decoder to relate activations
ference is challenging for computer vision systems. inside CLIP with an output segmentation, while permitting
Image segmentation requires a model to output a predic- as little dataset bias as possible and maintaining the excel-
tion for each pixel. Compared to whole-image classifica- lent and broad predictive capabilities of CLIP.
tion, segmentation requires not only predicting what can be We employ a generic binary prediction setting, where a
seen but also where it can be found. Classical semantic seg- foreground that matches the prompt has to be differentiated
mentation models are limited to segment the categories they from background. This binary setting can be adapted to
multi-label predictions which is needed by Pascal zero-shot
§
[email protected] segmentation. Although the focus of our work is on building

1
unseen free form no fixed negative either a projection of the patch embeddings or mask trans-
classes prompt targets samples former are proposed. Our CLIPSeg model extends CLIP
Our setting ✓ ✓ ✓ ✓ with a transformer-based decoder, i.e. we do not rely on
Classic - - - ✓ convolutional layers.
Referring Expression - ✓ ✓ -
Zero-shot ✓ - ✓ ✓
Referring Expression Segmentation In referring expres-
One-shot ✓ - ✓ -
sion segmentation a target is specified in a natural language
Table 1: Comparison of different segmentation tasks. Negative phrase. The goal is to segment all pixels that match this
means samples that do not contain the target (or one of the targets phrase. Early approaches used recurrent networks in combi-
in multi-label segmentation). All approaches except classic seg- nation with CNNs to address this problem [15, 16, 17, 18].
mentation adapt to new targets dynamically at inference time. The CMSA module, which is central to the approach of Ye
et al. [19], models long-term dependencies between text and
image using attention. The more recent HULANet method
a versatile model, we find that CLIPSeg achieves competi- [20] consists of Mask-RCNN backbone and specific mod-
tive performance across three low-shot segmentation tasks. ules processing categories, attributes and relations, which
Moreover, it is able to generalize to classes and expressions are merged to generate a segmentation mask. MDETR [21]
for which it has never seen a segmentation. is an adaptation of the detection method DETR [22] to nat-
ural language phrase input. It consists of a CNN which
Contributions Our main technical contribution is the extracts features and a transformer which predicts bound-
CLIPSeg model, which extends the well-known CLIP trans- ing boxes for a set of query prompts. Note that referring
former for zero-shot and one-shot segmentation tasks by a expression segmentation does not require generalization to
proposing a lightweight transformer-based decoder. A key unseen object categories or understanding of visual support
novelty of this model is that the segmentation target can be images. Several benchmarks [23, 24, 20] were proposed to
specified by different modalities: through text or an image. track progress in referring expression segmentation. We opt
This allows us to train a unified model for several bench- for the PhraseCut dataset [20] which is substantially larger
marks. For text-based queries, unlike networks trained on in terms of images and classes than other datasets. It con-
PhraseCut, our model is able to generalize to new queries in- tains structured text queries involving objects, attributes and
volving unseen words. For image-based queries, we explore relationships. A query can match multiple object instances.
various forms of visual prompt engineering – analogously
to text prompt engineering in language modeling. Further- Zero-Shot Segmentation In zero-shot segmentation the
more, we evaluate how our model generalizes to novel forms goal is to segment objects of categories that have not been
of prompts involving affordances. seen during training. Normally, multiple classes need to be
segmented in an image at the same time. In the general-
ized setting, both seen and unseen categories may occur. A
2 Related Work key problem in zero-shot segmentation addressed by sev-
eral methods is the bias which favors seen classes. Bucher
Foundation Models and Segmentation Instead of learn- et al. [25] train a DeepLabV3-based network to synthesize
ing from scratch, modern vision systems are commonly pre- artificial, pixel-wise features for unseen classes based on
trained on a large-scale dataset (either supervised [4] or self- word2vec label embeddings. These features are used to
supervised [5, 6]) and use weight transfer. The term founda- learn a classifier. Follow-up work explicitly models the re-
tion model has been coined for very large pre-training mod- lation between seen and unseen classes [26]. Others add se-
els that are applicable to multiple downstream tasks [7]. One mantic class information into dense prediction models [27].
of these models is CLIP [8], which has demonstrated ex- More recent approaches use a joint space for image features
cellent performance on several image classification tasks. and class prototypes [28], employ a probabilistic formula-
In contrast to previous models which rely on ResNet [9] tion to account for uncertainty [29] or model the detection
backbones, the best-performing CLIP model uses a novel of unseen objects explicitly [30].
visual transformer [10] architecture. Analogously to im-
age classification, there have been efforts to make use of One-Shot Semantic Segmentation In one-shot seman-
transformers for segmentation: TransUNet [11] and SETR tic segmentation, the model is provided at test time with
[12] employ a hybrid architecture which combine a visual a single example of a certain class, usually as an image
transformer for encoding with a CNN-based decoder. Seg- with a corresponding mask. One-shot semantic segmenta-
former [13] combines a transformer encoder with an MLP- tion is a comparably new task, with the pioneering work
based decoder. The Segmentor model [14] pursues a purely being published in 2017 by Shaban et al. [31], which in-
transformer-based approach. To generate a segmentation, troduced the Pascal-5i dataset based on Pascal images and

2
labels. Their simple model extracts VGG16-features [32] layers S are read out and projected to the token embedding
from a masked support image to generate regression pa- size D of our decoder. Then, these extracted activations (in-
rameters that are applied per-location on the output of a cluding CLS token) are added to the internal activations of
FCN [33] to yield a segmentation. Later works introduce our decoder before each transformer block. The decoder has
more complex mechanisms to handle one-shot segmenta- as many transformer blocks as extracted CLIP activations (in
tion: The pyramid graph network (PGNet) [34] generates our case 3). The decoder generates the binary segmentation
a set of differently-shaped feature maps obtained through by applying a linear projection on the tokens of its trans-
W H
adaptive pooling and processes them by individual graph at- former (last layer) R(1+ P × P )×D 7→ RW ×H , where P is
tention units and passed through an atrous spatial pyramid the token patch size of CLIP. In order to inform the decoder
pooling (ASPP) block [35]. The CANet network [36] first about the segmentation target, we modulate the decoder’s
extracts images using a shared encoder. Then predictions are input activation by a conditional vector using FiLM [48].
iteratively refined through a sequence of convolutions and This conditional vector can be obtained in two ways: (1)
ASPP blocks. Several approaches focus on the modeling of Using the CLIP text-transformer embedding of a text query
prototypes [37, 38, 39]. PFENet [40] uses a prior computed and (2) using the CLIP visual transformer on a feature en-
on high-level CNN-features to provide an auxiliary segmen- gineered prompt image. CLIP itself is not trained, but only
tation that helps further processing. A weakly-supervised used as a frozen feature extractor. Due to the compact de-
variant as introduced by Rakelly et al. [41] requires only coder, CLIPSeg has only 1,122,305 trainable parameters for
sparse annotations in form of a set of points. In one-shot D = 64.
instance segmentation [42], instead of a binary match/non- The original CLIP is constrained to a fixed image size
match prediction, individual object instances are segmented. due to the learned positional embedding. We enable dif-
ferent image sizes (including larger ones) by interpolating
CLIP Extensions Despite CLIP [8] being fairly new, the positional embeddings. To validate the viability of this
multiple derivative works across different sub-fields have approach, we compare prediction quality for different im-
emerged. CLIP was combined with a GAN to modify image sizes and find that for ViT-B/16 performance only de-
ages based on a text prompt [43] and in robotics to gener- creases for images larger than 350 pixels (see supplemen-
alize to unseen objects in manipulations tasks [44]. Other tary for details). In our experiments we use CLIP ViT-B/16
work focused on understanding CLIP in more detail. In with a patch size P of 16 and use a projection dimension of
the original CLIP paper [8], it was found that the design of D = 64 if not indicated otherwise. We extract CLIP activa-
prompts matters for downstream tasks, i.e. instead of using tions at layers S = [3, 7, 9], consequently our decoder has
an object name alone as a prompt, adding the prefix “a photo only three layers.
of" increases performance. Zhou et al. [45] propose context
optimization (CoOp) which automatically learns tokens that Image-Text Interpolation Our model receives informa-
perform well for given downstream tasks. Other approaches tion about the segmentation target (“what to segment?”)
rely on CLIP for open-set object detection [46, 47]. through a conditional vector. This can be provided either by
text or an image (through visual prompt engineering). Since
CLIP uses a shared embedding space for images and text
3 CLIPSeg Method captions, we can interpolate between both in the embedding
space and condition on the interpolated vector. Formally, let
We use the visual transformer-based (ViT-B/16) CLIP [8] si be the embedding of the support image and ti the text em-
model as a backbone (Fig. 2) and extend it with a small, bedding of a sample i, we obtain a conditional vector xi by a
parameter-efficient transformer decoder. The decoder is linear interpolation xi = asi +(1 − a)xi , where a is sampled
trained on custom datasets to carry out segmentation, while uniformly from [0, 1]. We use this randomized interpolation
the CLIP encoder remains frozen. A key challenge is to as a data augmentation strategy during training.
avoid imposing strong biases on predictions during segmen-
tation training and maintaining the versatility of CLIP. We
do not use the larger ViT-L/14@336px CLIP variant as its 3.1 PhraseCut + Visual prompts (PC+)
weights were not publicly released as of writing this work. We use the PhraseCut dataset [20], which encompasses over
340,000 phrases with corresponding image segmentations.
Decoder Architecture Considering these demands, we Originally, this dataset does not contain visual support but
propose CLIPSeg: A simple, purely-transformer based de- only phrases and for every phrase a corresponding object
coder, which has U-Net-inspired skip connections to the exists. We extend this dataset in two ways: visual support
CLIP encoder that allow the decoder to be compact (in terms samples and negative samples. To add visual support im-
of parameters). While the query image (RW ×H ×3 ) is passed ages for a prompt p, we randomly draw from the set of all
through the CLIP visual transformer, activations at certain samples Sp , which share the prompt p. In case the prompt

3
Figure 2: Architecture of CLIPSeg: We extend a frozen CLIP model (red and blue) with a transformer that segments the query image based
on either a support image or a support prompt. N CLIP activations are extracted after blocks defined by S . The segmentation transformer
and the projections (both green) are trained on PhraseCut or PhraseCut+.

nique to compute a prototype vector for conditioning. The

provided support mask is downsampled and multiplied with
a late feature map from the CNN along the spatial dimen-
sions and then pooled along the spatial dimensions. This
nothing

nothing

nothing
animal

animal

way, only features that pertain to the support object are con-
sieve

sieve

sieve
knife

knife

knife
car

car

car
jug

jug

sidered in the prototype vector. This method cannot be

applied directly to transformer-based architectures, as se-
mantic information is also accumulated in the CLS token
throughout the hierarchy and not only in the feature maps.
Circumventing the CLS token and deriving the conditional
trash bin

trash bin

trash bin
nothing

nothing

nothing
window

window

window
house

house

house
bike

bike

bike
car

car

vector directly from masked pooling of the feature maps is

Figure 3: Different forms of combining an image with the associ- not possible either, since it would break the compatibility
ated object mask to build a visual prompt have a strong effect on between text embeddings and visual embeddings of CLIP.
CLIP predictions (bar charts). We use the difference in the prob- To learn more about how target information can be incor-
ability of the target object (orange) in the original image (left col- porated into CLIP, we compare several variants in a simple
umn) and the masking methods for our systematic analysis. experiment without segmentation and its confounding ef-
fects. We consider the cosine distance (alignment) between
visual and text-based embedding and use the original CLIP
is unique (|Sp | = 1), we rely only on the text prompt. Ad-
weights without any additional training.
ditionally, we introduce negative samples to the dataset, i.e.
samples in which no object matches the prompt. To this end, Specifically, we use CLIP to compute the text embeddings
the sample’s phrase is replaced by a different phrase with a ti which correspond to object names in the image. We then
probability qneg . Phrases are augmented randomly using a compare those to (1) the visual embedding of the original
set of fixed prefixes (as suggested by the CLIP authors). On image without modifications, so and (2) the visual embed-
the images we apply random cropping under consideration ding sh highlighting the target object using a modified RGB
of object locations, making sure the object remains at least image or attention mask (both techniques are described in
partially visible. In the remainder of this paper, we call this detail below). By softmax-normalizing the vector of align-
extended dataset PhraseCut+ (abbreviated by PC+). In con- ments [sh t0 , sh t1 , . . . ] for different highlighting techniques
trast to the original PhraseCut dataset, which uses only text and images, we obtain the distributions shown in Fig. 3. For
to specify the target, PC+ supports training using image-text quantitative scores, we consider only the target object name
interpolation. This way, we can train a joint model that op- embedding t0 , which we expect to have a stronger align-
erates on text and visual input. ment with the highlighted image embedding sh than with
the original image embedding s0 (Fig. 3). This means, if a
highlighting technique improves the alignment, the increase
4 Visual Prompt Engineering in object probability ∆P(object) = sh t0 − so t0 should be
large. We base this analysis on the LVIS dataset [49] since
In conventional, CNN-based one-shot semantic segmenta- its images contain multiple objects and a rich set of cate-
tion, masked pooling [31] has emerged as a standard tech- gories. We sample 1,600 images and mask one target object

4
CLIP modification & extras ∆P(object) background modific. ∆P(object) cropping & combinations ∆P(object)
CLIP masking CLS in layer 11 1.34 BG intensity 50% 3.08 crop large context 6.27
CLIP masking CLS in all layers 1.71 BG intensity 10% 13.85 crop 13.60
CLIP masking all in all layers -14.44 BG intensity 0% 23.40 crop & BG blur 15.34
dye object red in grays. image 1.21 BG blur 13.15 crop & BG intensity 10% 21.73
add red object outline 2.29 + intensity 10% 21.73 + BG blur 23.50

Table 2: Visual prompt engineering: Average improvement of object probability for different forms of combining image and mask over
1,600 samples. Cropping means cutting the image according to the regions specified by the mask, “BG” means background.

out of all objects present in this image. Metrics Compared to approaches in zero-shot and one-
shot segmentation (e.g. [25, 26]), the vocabulary we use
CLIP-Based Masking The straightforward equivalent to is open, i.e. the set of classes or expressions is not fixed.
masked pooling in a visual transformer is to apply the mask Therefore, throughout the experiments, our models are
on the tokens. Normally, a visual transformer consists of a trained to generate binary predictions that indicate where
fixed set of tokens which can interact at every layer through objects matching the query are located. If necessary, this
multi-head attention: A CLS token used for read-out and binary setting can be transformed into a multi-label setting
image-region-related tokens which were originally obtained (as we do in Section 5.2).
from image patches. Now, the mask can be incorporated by In segmentation, intersection over union (IoU, also Jac-
constraining the interaction at one (e.g. the last layer 11) or card score) is a common metric to compare predictions with
more transformer layers to within-mask patch tokens as well ground truth. Due to the diversity of the tasks, we employ
as the CLS token only. Our evaluation (Tab. 2, left) suggests different forms of IoU: Foreground IoU (IoUFG ) which com-
that this form of introducing the mask does not work well. putes IoU on foreground pixels only, mean IoU, which com-
By constraining the interactions with the CLS token (Tab. 2, putes the average over foreground IoUs of different classes
left, top two rows) only a small improvement is achieved (in and binary IoU (IoUBIN ) which averages over foreground
last layer or in all layers) while constraining all interactions IoU and background IoU. In binary segmentation, IoU re-
decreases performance dramatically. From this we conclude quires a threshold t to be specified. While most of the time
that more complex strategies are necessary to combine im- the natural choice of 0.5 is used, the optimal values can
age and mask internally. strongly deviate from 0.5 if the probability that an object
matching the query differs between training and inference
(the a-priori probability of a query matching one or more
Visual Prompt Engineering Instead of applying the mask objects in the scene depends highly on context and dataset).
inside the model, we can also combine mask and image to Therefore, we report performance of one-shot segmentation
a new image, which can then processed by the visual trans- using thresholds t optimized per task and model. Addition-
former. Analogous to prompt engineering in NLP (e.g. in ally, we adopt the average precision metric (AP) in all our
GPT-3 [50]), we call this procedure visual prompt engineer- experiments. Average precision measures the area under the
ing. Since this form of prompt design is novel and strategies recall-precision curve. It measures how well the system can
which perform best in this context are unknown, we conduct discriminate matches from non-matches, independent of the
an extensive evaluation of different variants of designing vi- choice of threshold.
sual prompts (Tab. 2). We find that the exact form of how the
mask and image are combined matters a lot. Generally, we
identify three image operations that improve the alignment Models and Baselines In our experiments we differenti-
between the object text prompts and the images: decreasing ate two variants of CLIPSeg: One trained on the original
the background brightness, blurring the background (using PhraseCut dataset (PC) and one trained on the extended ver-
a Gaussian filter) and cropping to the object. The combina- sion of PhraseCut which uses 20% negative samples, con-
tion of all three performs best (Tab. 2, last row). We will use tains visual samples (PC+) and uses image-text interpola-
this variant in the remainder. tion (Sec. 3). The robust latter version we call the universal
model. To put the performance of our models into perspec-
tive, we provide two baselines:
5 Experiments
• CLIP-Deconv encompasses CLIP but uses a very basic
We first evaluate our model on three established segmen- decoder, consisting only of the basic parts: FiLM con-
tation benchmarks before demonstrating the main contribu- ditioning [48], a linear projection and a deconvolution.
tion of our work: flexible few-shot segmentation that can be This helps us to estimate to which degree CLIP-alone
based on either text or image prompts. is responsible for the results.

5
t mIoU IoUFG AP unseen-10 unseen-4
pre-train. mIoUS mIoUU mIoUS mIoUU
CLIPSeg (PC+) 0.3 43.4 54.7 76.7
CLIPSeg (PC, D = 128) 0.3 48.2 56.5 78.2 CLIPSeg (PC+) CLIP 35.7 43.1 20.8 47.3
CLIPSeg (PC) 0.3 46.1 56.2 78.2 CLIP-Deconv (PC+) CLIP 25.1 36.7 25.9 41.9
CLIP-Deconv 0.3 37.7 49.5 71.2 ViTSeg (PC+) IN 4.2 19.0 6.0 24.8
ViTSeg (PC+) 0.1 28.4 35.4 58.3
ViTSeg (PC) 0.3 38.9 51.2 74.4 SPNet [27] IN 59.0 18.1 67.3 21.8
ZS3Net [25] IN-seen 33.9 18.1 66.4 23.2
MDETR [21] 53.7 - - CSRL [53] IN-seen 59.2 21.0 69.8 31.7
HulaNet [20] 41.3 50.8 - CaGNet [54] IN - - 69.5 40.2
Mask-RCNN top [20] 39.4 47.4 - OSR [30] IN-seen 72.1 33.9 75.0 44.1
RMI [20] 21.1 42.5 - JoEm [28] IN-seen 63.4 22.5 67.0 33.4

Table 3: Referring Expression Segmentation performance on Table 4: Zero-shot segmentation performance on Pascal-VOC with
PhraseCut (t refers to the binary threshold). 10 unseen classes. mIoUS and mIoUU indicate performance on
seen and unseen classes, respectively. Our model is trained on
PhraseCut with the Pascal classes being removed but uses a pre-
• ViTSeg shares the architecture of CLIPSeg, but uses trained CLIP backbone. IN-seen indicates ImageNet pre-training
an ImageNet-trained visual transformer as a backbone with unseen classes being removed.
[51]. For encoding text, we use the same text trans-
former of CLIP. This way we learn to which degree
the specific CLIP weights are crucial for good perfor- in a multi-label setting. Therefore, we employ a simple
mance. adaptation: Our model predicts a binary map independently
We rely on PyTorch [52] for training and use an image size for each of the 20 Pascal classes. Across all 20 predictions
of 352 × 352 pixels throughout our experiments (for details we determine the class with the highest probability for each
see appendix). pixel.
We train on PhraseCut+ but remove the unseen Pascal
classes from the dataset. This is carried out by assigning
5.1 Referring Expression Segmentation the Pascal classes to WordNet synsets [2] and generating a
We evaluate referring expression segmentation performance set of invalid words by traversing hyponyms (e.g. different
(Tab. 3) on the original PhraseCut dataset and compare to dog breeds for dog). Prompts that contain such a word are
scores reported by Wu et al. [20] as well as the concurrently removed from the dataset.
developed transformer-based MDETR method [21]. For this The idea of conducting this experiment is to provide a ref-
experiment we trained a version of CLIPSeg on the original erence for the zero-shot performance of our universal model.
PhraseCut dataset (CLIPSeg [PC]) using only text labels in It should not considered as competing in this benchmark as
addition to the universal variant which also includes visual we use a different training (CLIP pre-training, binary seg-
samples (CLIPSeg [PC+]). mentation on PhraseCut). The results (Tab. 4) indicate a ma-
Our approaches outperform the two-stage HULANet ap- jor gap between seen and unseen classes in models trained
proach by Wu et al. [20]. Especially, a high capacity de- on Pascal-VOC, while our models tend to be more balanced.
coder (D = 128) seems to be beneficial for PhraseCut. This is due to other models being trained exclusively on the
However, the performance worse than MDETR [21], which 10 or 16 seen Pascal classes in contrast to CLIPSeg which
operates at full image resolution and received two rounds can differentiate many more classes (or phrases). In fact, our
of fine-tuning on PhraseCut. Notably, the ViTSeg baseline model performs better on unseen classes than on seen ones.
performs generally worse than CLIPSeg, which shows that This difference is likely because the seen classes are gener-
CLIP pre-training is helpful. ally harder to segment: For the unseen-4 setting, the unseen
classes are “airplane”, “cow”, “motorbike” and “sofa”. All
of them are large and comparatively distinct objects.
5.2 Generalized Zero-Shot Segmentation
In generalized zero-shot segmentation, test images contain 5.3 One-Shot Semantic Segmentation
categories that have never been seen before in addition to
known categories. We evaluate the model’s zero-shot seg- In one-shot semantic segmentation, a single example image
mentation performance using the established Pascal-VOC along with a mask is presented to the network. Regions that
benchmark (Tab. 4). It contains five splits involving 2 to pertain to the class highlighted in the example image must
10 unseen classes (we report only 4 and 10 unseen classes). be found in a query image. Compared to previous tasks,
The latter is the most challenging setting as the set of un- we cannot rely on a text label but must understand the pro-
seen classes is large. Since our model was trained on fore- vided support image. Above (Sec. 4) we identified the best
ground/background segmentation we cannot directly use it method for visual prompt design, which we use here: crop-

6
t vis. backb. mIoU IoUBIN AP Pascal-5i t vis. backb. mIoU IoUBIN AP
CLIPSeg (PC+) 0.3 ViT (CLIP) 59.5 75.0 82.3 CLIPSeg (PC+) 0.3 ViT (CLIP) 72.4 83.1 93.5
CLIPSeg (PC) 0.3 ViT (CLIP) 52.3 69.5 72.4 CLIPSeg (PC) 0.3 ViT (CLIP) 70.3 81.6 84.8
CLIP-Deconv (PC+) 0.2 ViT (CLIP) 48.0 65.8 68.0 CLIP-Deconv (PC+) 0.3 ViT (CLIP) 63.2 77.3 85.3
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4 ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4
PPNet [39] RN50 52.8 69.2 - LSeg [58] ViT (CLIP) 52.3 67.0 -
RePRI [57] RN50 59.7 - - PFENet [40] VGG16 54.2 - -
PFENet [40] RN50 60.2 73.3 -
HSNet [55] RN50 64.0 76.7 -
Table 7: Zero-shot performance on Pascal-5i. The scores were
PPNet [39] RN101 55.2 70.9 - obtained by following the evaluation protocol of one-shot segmen-
RePRI [57] RN101 59.4 - - tation but using text input.
PFENet [40] RN101 59.6 72.9 -
HSNet [55] RN101 66.2 77.6 -

Table 5: One-shot performance on Pascal-5i (CLIPSeg and ViTSeg 5.4 One Model For All: Generalized Prompts
trained on PhraseCut+).
We have shown that CLIPSeg performs well on a variety of
academic segmentation benchmarks. Next, we evaluate its
t vis. backb. mIoU IoUBIN AP performance “in the wild" in unseen situations.
CLIPSeg (COCO) 0.1 ViT (CLIP) 33.2 58.4 40.5
CLIPSeg (COCO+N) 0.1 ViT (CLIP) 33.3 59.1 41.7
CLIP-Deconv (COCO+N) 0.1 ViT (CLIP) 29.8 56.8 40.8 Qualitative Results In Fig. 4 we show qualitative results
ViTSeg (COCO) 0.1 ViT (IN) 14.4 46.1 15.7 divided into two groups: (1, left) Affordance-like [59, 60]
PPNet [39] RN50 29.0 - - (“generalized”) prompts that are different from the descrip-
RePRI [57] RN50 34.0 - - tive prompts of PhraseCut and (2, right) prompts that were
PFENet [40] RN50 35.8 - - taken from the PhraseCut test set. For the latter we add
HSNet [55] RN50 39.2 68.2 -
challenging extra prompts involving an existing object but
HSNet [55] RN101 41.2 69.1 - the wrong color (indicated in orange). Generalized prompts,
which deviate from the PhraseCut training set by referring
Table 6: One-shot performance on COCO-20i (CLIPSeg trained to actions (“‘something to ...”) or rare object classes (“‘cut-
on PhraseCut), +N indicates 10% negative samples.
lery”) work surprisingly well given that the model was not
trained on such cases. It has learned an intuition of stuff that
can be stored away in cupboards, where sitting is possible
ping out the target object while blurring and darkening the and what “living creature” means. Rarely, false positives
background. To remove classes that overlap with the re- are generated (the bug in the salad is not a cow). Details in
spective subset of Pascal during training, we use the same the prompt are reflected by the segmentation (blue boxes)
method as in the previous section (Sec. 5.2). Other than and information about the color influences predicted object
in zero-shot segmentation, in one-shot segmentation, Ima- probabilities strongly (orange box).
geNet pre-trained backbones are common [40, 37]. PFENet
particularly leverages pre-training by using high-level fea-
ture similarity as a prior. Similarly, HSNet [55] processes Systematic Analysis To quantitatively assess the perfor-
correlated activations of query and support image using 4D- mance for generalized queries, we construct subsets of the
convolutions at multiple levels. LVIS test datasets containing only images of classes that
On Pascal-5i we find our universal model CLIPSeg (PC+) correspond to affordances or attributes. Then we ask our
to achieve competitive performance (Tab. 5) among state-of- model to segment with these affordances or attributes as
the-art methods, with only the very recent HSNet perform- prompts. For instance, we compute the foreground inter-
ing better. The results on COCO-20i (Tab. 6) show that section of union between armchair, sofa and loveseat ob-
CLIPSeg also works well when trained on other datasets jects when “sit on” is used as prompt. A complete list
than PhraseCut(+). Again HSNet performs better. To put of which affordances or attributes are mapped onto which
this in perspective, it should be considered that HSNet (and objects can be found in the appendix. We find (Tab. 8)
PFENet) are explicitly designed for one-shot segmentation, that the CLIPSeg version trained on PC+ performs better
rely on pre-trained CNN activations and cannot handle text than the CLIP-Deconv baseline and the version trained on
by default: Tian et al. [40] extended PFENet to zero-shot LVIS, which contains only object labels instead of complex
segmentation (but used the one-shot protocol) by replacing phrases. This result suggests that both dataset variability and
the visual sample with word vectors [1, 56] of text labels. model complexity are necessary for generalization. ViTSeg
In that case, CLIPSeg outperforms their scores by a large performs worse, which is expected as it misses the strong
margin (Tab. 7). CLIP backbone, known for its generalization capabilities.

7
Figure 4: Qualitative predictions of CLIPSeg (PC+) for various prompts, darkness indicates prediction strength. The generalized prompts
(left) deviate from the PhraseCut prompts as they involve action-related properties or new object names.

Affordances Attributes Meronymy Text-based Visual-based

mIoU AP mIoU AP mIoU AP mIoU AP mIoU AP
CLIPSeg (PC+) 36.9 50.5 26.6 43.0 25.7 29.0 CLIPSeg (PC+) 43.6 76.7 25.4 55.6
CLIPSeg (LVIS) 37.7 44.6 18.4 16.6 18.9 13.8 no CLIP pre-training 13.1 12.6 12.7 -
CLIP-Deconv 32.2 43.7 23.1 35.6 21.1 27.1 no visual 46.4 77.8 14.4 31.0
VITSeg (PC+) 19.2 23.5 26.8 28.0 18.4 15.9 D = 16 37.4 71.5 24.7 51.2
only layer 3 31.9 64.9 21.5 48.6
highlight mask 43.4 75.4 23.3 43.8
Table 8: Performance for generalized prompts. While the PC+-
model has seen prompts during training (colliding prompts with
test set were removed), the LVIS version was trained on object Table 9: Ablation study conducted on PhraseCut, involving text
classes only and is able to generalize due to the CLIP backbone. (left) and visual prompts (right) at test time. We use the best thresh-
We use the best threshold t for each model. old t for each model.

perfectly. The gap in text-based performance to the hybrid

5.5 Ablation Study version (PC+) is negligible.
In order to identify crucial factors for the performance
of CLIPSeg, we conduct an ablation study on PhraseCut 6 Conclusion
(Tab. 9). We evaluate text-based and visual prompt-based
performance (obtained using our modifications on Phrase- We presented the CLIPSeg image segmentation approach
Cut) separately for a complete picture. Both text-based and that can be adapted to new tasks by text or image prompts
visual performance drops when random weights instead of at inference time instead of expensive training on new data.
CLIP weights are used (“no CLIP pre-training”). When the Specifically, we investigated the novel visual prompt engi-
number of parameters is reduced to 16 (“D = 16”) per- neering in detail and demonstrated competitive performance
formance decreases substantially, which indicates the im- on referring expression, zero-shot and one-shot image seg-
portance of the information processing in the decoder. Us- mentation tasks. Beyond that, we showed – both quali-
ing an unfavourable visual prompting technique (“highlight tatively and quantitatively – that our model generalizes to
mask”) degrades performance on visual input, which sup- novel prompts involving affordances and properties. We
ports our findings from Sec. 4. Using only early activations expect our method to be useful, especially for inexperi-
from layer 3 decreases performance (“only layer 3”), from enced users for building a segmentation model by specify-
which we conclude that higher level features of CLIP are ing prompts and in robotic setups when interaction with hu-
useful for segmentation. Training without visual samples mans is desired. We believe that tackling multiple tasks is a
(“no visual”) decreases the performance on visual samples, promising direction for future research toward more generic
which is expected as visual and text vectors do not align and real-world compatible vision systems. In a wider con-

8
text, our experiments, in particular the comparison to the [8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ImageNet-based ViTSeg baseline, highlight the power of Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
foundation models like CLIP for solving several tasks at Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
once. ing transferable visual models from natural language super-
vision. arXiv preprint arXiv:2103.00020, 2021.

Limitations Our experiments are limited to only a small [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
number of benchmarks, in future work more modalities such Deep residual learning for image recognition. In Conference
as sound and touch could be incorporated. We depend on a on Computer Vision and Pattern Recognition (CVPR), 2016.
large-scale dataset (CLIP) for pre-training. Note, we do not
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
use the best-performing CLIP model ViT-L/14@336px due
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
to weight availability. Furthermore, our model focuses on Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
images, an application to video might suffer from missing vain Gelly, et al. An image is worth 16x16 words: Transform-
temporal consistency. Image size may vary but only within ers for image recognition at scale. International Conference
certain limits (for details see supplementary). on Learning Representations (ICLR), 2021.

[11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan
Broader Impact There is a chance that the model repli-
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou.
cates dataset biases from PhraseCut but especially from the Transunet: Transformers make strong encoders for medical
unpublished CLIP training dataset. Provided models should image segmentation. arXiv preprint arXiv:2102.04306, 2021.
be used carefully and not in tasks depicting humans. Our
approach enables adaptation to new tasks without energy- [12] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
intensive training. Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xi-
ang, Philip HS Torr, et al. Rethinking semantic segmentation
from a sequence-to-sequence perspective with transformers.
References In Conference on Computer Vision and Pattern Recognition,
2021.
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado,
and Jeff Dean. Distributed representations of words and [13] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
phrases and their compositionality. In Advances in Neural Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
Information Processing Systems (NIPS), 2013. ficient design for semantic segmentation with transformers.
arXiv preprint arXiv:2105.15203, 2021.
[2] George A Miller. Wordnet: a lexical database for english.
Communications of the ACM, 38(11):39–41, 1995. [14] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
Schmid. Segmenter: Transformer for semantic segmentation.
[3] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. arXiv preprint arXiv:2105.05633, 2021.
Learning multiple visual domains with residual adapters. In
Advances in Neural Information Processing Systems (NIPS), [15] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
2017. mentation from natural language expressions. In European
Conference on Computer Vision (ECCV), 2016.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
Fei-Fei. Imagenet: A large-scale hierarchical image database. [16] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and
In Conference on Computer Vision and Pattern Recognition Alan Yuille. Recurrent multimodal interaction for referring
(CVPR). IEEE, 2009. image segmentation. In International Conference on Com-
puter Vision (ICCV), 2017.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
offrey Hinton. A simple framework for contrastive learning [17] Hengcan Shi, Hongliang Li, Fanman Meng, and Q. Wu. Key-
of visual representations. In International Conference on Ma- word-aware network for referring expression image segmen-
chine Learning (ICML), Proceedings of Machine Learning tation. In European Conference on Computer Vision (ECCV),
Research, 13–18 Jul 2020. 2018.

[6] Xinlei Chen and Kaiming He. Exploring simple siamese rep- [18] Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan
resentation learning. In Conference on Computer Vision and Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmen-
Pattern Recognition (CVPR), 2021. tation via recurrent refinement networks. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[7] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-
man, Simran Arora, Sydney von Arx, Michael S Bernstein, [19] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang.
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. Cross-modal self-attention network for referring image seg-
On the opportunities and risks of foundation models. arXiv mentation. Conference on Computer Vision and Pattern
preprint arXiv:2108.07258, 2021. Recognition (CVPR), 2019.

9
[20] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
Subhransu Maji. Phrasecut: Language-based image segmen- convolutional networks for semantic segmentation. In
tation in the wild. In Conference on Computer Vision and Conference on Computer Vision and Pattern Recognition
Pattern Recognition (CVPR), 2020. (CVPR), 2015.

[21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan [34] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - mod- Qingyao Wu, and Rui Yao. Pyramid graph networks with
ulated detection for end-to-end multi-modal understanding. connection attentions for region-based one-shot semantic
ArXiv, 2021. segmentation. In International Conference on Computer Vi-
sion (ICCV), 2019.
[22] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [35] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
end object detection with transformers. In European Con- Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image
ference on Computer Vision (ECCV), Cham, 2020. Springer segmentation with deep convolutional nets, atrous convolu-
International Publishing. ISBN 978-3-030-58452-8. tion, and fully connected crfs. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 2018.
[23] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg,
and Tamara L Berg. Modeling context in referring expres- [36] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua
sions. In European Conference on Computer Vision (ECCV), Shen. Canet: Class-agnostic segmentation networks with it-
2016. erative refinement and attentive few-shot learning. In Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
[24] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam-
June 2019.
buru, Alan L Yuille, and Kevin Murphy. Generation and com-
prehension of unambiguous object descriptions. In Confer- [37] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
ence on Computer Vision and Pattern Recognition (CVPR), and Jiashi Feng. Panet: Few-shot image semantic segmenta-
2016. tion with prototype alignment. In International Conference
on Computer Vision (ICCV), 2019.
[25] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
Pérez. Zero-shot semantic segmentation. Advances in Neural [38] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixiang
Information Processing Systems (NeurIPS), 2019. Ye. Prototype mixture models for few-shot semantic segmen-
tation. In European Conference on Computer Vision (ECCV),
[26] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural
2020.
relation learning for zero-shot segmentation. In Advances in
Neural Information Processing Systems (NeurIPS), 2020.
[39] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xum-
[27] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt ing He. Part-aware prototype network for few-shot semantic
Schiele, and Zeynep Akata. Semantic projection network for segmentation. In European Conference on Computer Vision
zero- and few-label semantic segmentation. In Conference (ECCV), 2020.
on Computer Vision and Pattern Recognition (CVPR), June
[40] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng
2019.
Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich-
[28] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Exploit- ment network for few-shot segmentation. IEEE Transactions
ing a joint embedding space for generalized zero-shot seman- on Pattern Analysis and Machine Intelligence (TPAMI), Au-
tic segmentation. In International Conference on Computer gust 2020. ISSN 0162-8828.
Vision (ICCV), 2021.
[41] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A.
[29] Ping Hu, Stan Sclaroff, and Kate Saenko. Uncertainty-aware Efros, and Sergey Levine. Few-shot segmentation propaga-
learning for zero-shot semantic segmentation. In Advances in tion with guided networks. arXiv preprint arXiv:1806.07373,
Neural Information Processing Systems (NeurIPS), 2020. 2018.

[30] Hui Zhang and Henghui Ding. Prototypical matching and [42] Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge,
open set rejection for zero-shot semantic segmentation. In and Alexander S. Ecker. One-shot instance segmentation.
International Conference on Computer Vision (ICCV), Octo- arXiv, 2018.
ber 2021.
[43] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
[31] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and and Dani Lischinski. Styleclip: Text-driven manipulation of
Byron Boots. One-shot learning for semantic segmentation. stylegan imagery. In International Conference on Computer
British Machine Vision Conference (BMVC), 2017. Vision (ICCV), 2021.

[32] Karen Simonyan and Andrew Zisserman. Very deep convo- [44] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
lutional networks for large-scale image recognition. arXiv What and where pathways for robotic manipulation. arXiv
preprint arXiv:1409.1556, 2014. preprint arXiv:2109.12098, 2021.

10
[45] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei [58] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
Liu. Learning to prompt for vision-language models. arXiv Koltun, and Rene Ranftl. Language-driven semantic seg-
preprint arXiv:2109.01134, 2021. mentation. In International Conference on Learning Rep-
resentations, 2022. URL https://openreview.net/
[46] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero- forum?id=RriDjddCLN.
shot detection via vision and language knowledge distillation.
arXiv preprint arXiv:2104.13921, 2021. [59] James Jerome Gibson. The Senses Considered as Perceptual
Systems. Houghton Mifflin, 1966.
[47] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei Shu.
Zero-shot open set detection by extending clip. arXiv preprint [60] James J. Gibson. The Ecological Approach to Visual Percep-
arXiv:2109.02748, 2021. tion. Houghton Mifflin, 1979.

[48] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian

Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio.
Feature-wise transformations. Distill, 2018.

[49] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A

dataset for large vocabulary instance segmentation. In
Conference on Computer Vision and Pattern Recognition
(CVPR), 2019.

[50] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-

biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand-
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J.
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad-
ford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. ArXiv, 2020.

[51] Ross Wightman. Pytorch image mod-

els. https://github.com/rwightman/
pytorch-image-models, 2019.

[52] Adam Paszke, Sam Gross, Soumith Chintala, Gregory

Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
ferentiation in pytorch. In Advances in Neural Information
Processing Systems Workshops, 2017.

[53] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural

relation learning for zero-shot segmentation. Advances in
Neural Information Processing Systems (NeurIPS), 2020.

[54] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing
Zhang. Context-aware feature generation for zero-shot se-
mantic segmentation. In Proceedings of the 28th ACM Inter-
national Conference on Multimedia, pages 1921–1929, 2020.

[55] Juhong Min, Dahyun Kang, and Minsu Cho. Hypercorre-

lation squeeze for few-shot segmentation. In International
Conference on Computer Vision (ICCV), 2021.

[56] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian

Puhrsch, and Armand Joulin. Advances in pre-training dis-
tributed word representations. ArXiv, 2018.

[57] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Pi-
antanida, Ismail Ben Ayed, and Jose Dolz. Few-shot segmen-
tation without meta-learning: A good transductive inference
is all you need? arXiv preprint arXiv:2012.06166, 2020.

11
Appendix chair, highchair, deck chair, folding chair, chair, recliner,
wheelchair
Experimental Setup drink from: bottle, beer bottle, water bottle, wine bottle,
thermos bottle
Throughout our experiments we use PyTorch [52] with ride on: horse, pony, motorcycle
CLIP ViT-B/16 [8]. We train on PhraseCut [20] for 20,000
iterations on batches of size 64 with an initial learning rate
Attributes:
of 0.001 (for VitSeg 0.0001) which decays following a co-
can fly: eagle, jet plane, airplane, fighter jet, bird, duck,
sine learning rate schedule to 0.0001 (without warmup). We
gull, owl, seabird, pigeon, goose, parakeet
use automatic mixed precision and binary cross entropy as
can be driven: minivan, bus (vehicle), cab (taxi), jeep,
the only loss function.
ambulance, car (automobile)
can swim: duck, duckling, water scooter, penguin, boat,
Image-size Dependency of CLIP kayak, canoe
Since multi-head attention does not require a fixed number
of tokens, the visual transformer of CLIP can handle in- Meronymy (part-of relations):
puts of arbitrary size. However, the publicly available CLIP has wheels: dirt bike, car (automobile), wheelchair, motor-
models (ViT-B/16 and ViT-B/32) were trained on 224 × 224 cycle, bicycle, cab (taxi), minivan, bus (vehicle), cab (taxi),
pixel images. In this experiment we investigate how CLIP jeep, ambulance
performance relates to the input image size – measured in has legs: armchair, sofa, loveseat, deck chair, rocking
a classification task. To this end, we extract the CLS token chair, highchair, deck chair, folding chair, chair, recliner,
vector in the last layer from both CLIP models. Using this wheelchair, horse, pony, eagle, bird, duck, gull, owl,
feature vector as an input, we train a logistic regression clas- seabird, pigeon, goose, parakeet, dog, cat, flamingo, pen-
sifier on a subset of ImageNet [4] classes differentiating 67 guin, cow, puppy, sheep, black sheep, ostrich, ram (animal),
classes of vehicles (Fig. 5). Our results indicate that CLIP chicken (animal), person
generally handles large image sizes well, with the 16-px-
patch version (ViT-B/16) showing a slightly better perfor-
mance at an optimal image size of around 350 × 350 pixels.
Average Precision Computation
The average precision metric has the advantage of not de-
0.7 pending on a fixed threshold. This is particularly useful
when new classes occur which lead to uncalibrated predic-
0.6 tions. Instead of operating on bounding boxes as in detec-
accuracy

0.5 tion, we compute average precision at the pixel-level. This

makes the computation challenging, since AP is normally
0.4 computed by sorting all predictions (hence all pixels) ac-
0.3 ViT/16 cording their likelihood, which requires keeping them in the
0.2 ViT/32 working memory. For pixels, this is not possible. To cir-
cumvent this, we define a fixed set of thresholds and aggre-
100 200 300 400 500 gate statistics (true-positives, etc.) in each image. Finally,
image size
we sum up the statistics per threshold level and compute the
precision-recall curve. Average precision, which is the area
Figure 5: Image classification performance of CLIP over different under the precision-recall curve is computed using Simpson
image sizes.
integration.

Object-mapping for Affordances and At- Qualitative Predictions

tributes
In Fig. 6 we show predictions of ViTSeg (PC), analogous
For our systematic analysis on generalization (Section 5.5 to Fig. 4 of the main paper. In fact, ViTSeg trained with
in the main paper), we generate samples by replacing the visual samples (PC+) shows worse performance. The pre-
following object categories by affordances (bold). dictions clearly indicate the deficits of an ImageNet-trained
ViT backbone compared to CLIP: Details in the prompt are
Affordances: not reflected by the segmentation and a large number of false
sit on: armchair, sofa, loveseat, deck chair, rocking positives occur.

12
a surface somewhere something something
to place to store to sit something something that has a living yellow
a surface. items. away things on. to drink. to eat. legs. cutlery. creature. chair. table. drawers. grass. cow. cow. green drawers.

Figure 6: Qualitative predictions of CLIPSeg (PC+) (top, same as Fig. 4 of main paper for reference) and ViTSeg (PC) (bottom).

13
0.25

a bright photo of a <label>

performance [AP]
0.20

a photo of a <label>

an image of <label>
a photo of <label>
0.15
0.10

<label>
0.05
0.00
text prompts

Figure 7: Effect of different text prompts on performance.

0.250 0.25
0.225
performance [AP]

0.20
performance [AP]

0.200
0.175 0.15
0.150
0.125 0.10
vehicles
animals
person
0.100
0.05
stuff
]
.1]
.2]

.3]

.5]
.05

,0
1,0

2,0

3,0

0.00
0
[0,

05
[0.

[0.

object categories
[0.

object fraction interval

Figure 8: Effect of object size and class on performance.

Text prompts, object sizes and classes

To develop a better understanding of when our model per-
forms well, we compare different text prompts (Fig. 7), ob-
ject sizes (Fig. 8, left) and object classes (Fig. 8, right). This
evaluation is conducted on a pre-trained CLIPSeg (PC+). In
all cases we randomly sample different prompt forms during
training. Here we assess the performance on 5,000 samples
of the PhraseCut test set.
We see a small effect on performance for alternative
prompt forms. In terms of object size there is a clear trend
towards better performance on larger objects. Performance
over different classes is fairly balanced.