1ds20ai009 Research Paper
1ds20ai009 Research Paper
Abstract
arXiv:2112.10003v2 [cs.CV] 30 Mar 2022
1
unseen free form no fixed negative either a projection of the patch embeddings or mask trans-
classes prompt targets samples former are proposed. Our CLIPSeg model extends CLIP
Our setting ✓ ✓ ✓ ✓ with a transformer-based decoder, i.e. we do not rely on
Classic - - - ✓ convolutional layers.
Referring Expression - ✓ ✓ -
Zero-shot ✓ - ✓ ✓
Referring Expression Segmentation In referring expres-
One-shot ✓ - ✓ -
sion segmentation a target is specified in a natural language
Table 1: Comparison of different segmentation tasks. Negative phrase. The goal is to segment all pixels that match this
means samples that do not contain the target (or one of the targets phrase. Early approaches used recurrent networks in combi-
in multi-label segmentation). All approaches except classic seg- nation with CNNs to address this problem [15, 16, 17, 18].
mentation adapt to new targets dynamically at inference time. The CMSA module, which is central to the approach of Ye
et al. [19], models long-term dependencies between text and
image using attention. The more recent HULANet method
a versatile model, we find that CLIPSeg achieves competi- [20] consists of Mask-RCNN backbone and specific mod-
tive performance across three low-shot segmentation tasks. ules processing categories, attributes and relations, which
Moreover, it is able to generalize to classes and expressions are merged to generate a segmentation mask. MDETR [21]
for which it has never seen a segmentation. is an adaptation of the detection method DETR [22] to nat-
ural language phrase input. It consists of a CNN which
Contributions Our main technical contribution is the extracts features and a transformer which predicts bound-
CLIPSeg model, which extends the well-known CLIP trans- ing boxes for a set of query prompts. Note that referring
former for zero-shot and one-shot segmentation tasks by a expression segmentation does not require generalization to
proposing a lightweight transformer-based decoder. A key unseen object categories or understanding of visual support
novelty of this model is that the segmentation target can be images. Several benchmarks [23, 24, 20] were proposed to
specified by different modalities: through text or an image. track progress in referring expression segmentation. We opt
This allows us to train a unified model for several bench- for the PhraseCut dataset [20] which is substantially larger
marks. For text-based queries, unlike networks trained on in terms of images and classes than other datasets. It con-
PhraseCut, our model is able to generalize to new queries in- tains structured text queries involving objects, attributes and
volving unseen words. For image-based queries, we explore relationships. A query can match multiple object instances.
various forms of visual prompt engineering – analogously
to text prompt engineering in language modeling. Further- Zero-Shot Segmentation In zero-shot segmentation the
more, we evaluate how our model generalizes to novel forms goal is to segment objects of categories that have not been
of prompts involving affordances. seen during training. Normally, multiple classes need to be
segmented in an image at the same time. In the general-
ized setting, both seen and unseen categories may occur. A
2 Related Work key problem in zero-shot segmentation addressed by sev-
eral methods is the bias which favors seen classes. Bucher
Foundation Models and Segmentation Instead of learn- et al. [25] train a DeepLabV3-based network to synthesize
ing from scratch, modern vision systems are commonly pre- artificial, pixel-wise features for unseen classes based on
trained on a large-scale dataset (either supervised [4] or self- word2vec label embeddings. These features are used to
supervised [5, 6]) and use weight transfer. The term founda- learn a classifier. Follow-up work explicitly models the re-
tion model has been coined for very large pre-training mod- lation between seen and unseen classes [26]. Others add se-
els that are applicable to multiple downstream tasks [7]. One mantic class information into dense prediction models [27].
of these models is CLIP [8], which has demonstrated ex- More recent approaches use a joint space for image features
cellent performance on several image classification tasks. and class prototypes [28], employ a probabilistic formula-
In contrast to previous models which rely on ResNet [9] tion to account for uncertainty [29] or model the detection
backbones, the best-performing CLIP model uses a novel of unseen objects explicitly [30].
visual transformer [10] architecture. Analogously to im-
age classification, there have been efforts to make use of One-Shot Semantic Segmentation In one-shot seman-
transformers for segmentation: TransUNet [11] and SETR tic segmentation, the model is provided at test time with
[12] employ a hybrid architecture which combine a visual a single example of a certain class, usually as an image
transformer for encoding with a CNN-based decoder. Seg- with a corresponding mask. One-shot semantic segmenta-
former [13] combines a transformer encoder with an MLP- tion is a comparably new task, with the pioneering work
based decoder. The Segmentor model [14] pursues a purely being published in 2017 by Shaban et al. [31], which in-
transformer-based approach. To generate a segmentation, troduced the Pascal-5i dataset based on Pascal images and
2
labels. Their simple model extracts VGG16-features [32] layers S are read out and projected to the token embedding
from a masked support image to generate regression pa- size D of our decoder. Then, these extracted activations (in-
rameters that are applied per-location on the output of a cluding CLS token) are added to the internal activations of
FCN [33] to yield a segmentation. Later works introduce our decoder before each transformer block. The decoder has
more complex mechanisms to handle one-shot segmenta- as many transformer blocks as extracted CLIP activations (in
tion: The pyramid graph network (PGNet) [34] generates our case 3). The decoder generates the binary segmentation
a set of differently-shaped feature maps obtained through by applying a linear projection on the tokens of its trans-
W H
adaptive pooling and processes them by individual graph at- former (last layer) R(1+ P × P )×D 7→ RW ×H , where P is
tention units and passed through an atrous spatial pyramid the token patch size of CLIP. In order to inform the decoder
pooling (ASPP) block [35]. The CANet network [36] first about the segmentation target, we modulate the decoder’s
extracts images using a shared encoder. Then predictions are input activation by a conditional vector using FiLM [48].
iteratively refined through a sequence of convolutions and This conditional vector can be obtained in two ways: (1)
ASPP blocks. Several approaches focus on the modeling of Using the CLIP text-transformer embedding of a text query
prototypes [37, 38, 39]. PFENet [40] uses a prior computed and (2) using the CLIP visual transformer on a feature en-
on high-level CNN-features to provide an auxiliary segmen- gineered prompt image. CLIP itself is not trained, but only
tation that helps further processing. A weakly-supervised used as a frozen feature extractor. Due to the compact de-
variant as introduced by Rakelly et al. [41] requires only coder, CLIPSeg has only 1,122,305 trainable parameters for
sparse annotations in form of a set of points. In one-shot D = 64.
instance segmentation [42], instead of a binary match/non- The original CLIP is constrained to a fixed image size
match prediction, individual object instances are segmented. due to the learned positional embedding. We enable dif-
ferent image sizes (including larger ones) by interpolating
CLIP Extensions Despite CLIP [8] being fairly new, the positional embeddings. To validate the viability of this
multiple derivative works across different sub-fields have approach, we compare prediction quality for different im-
emerged. CLIP was combined with a GAN to modify im- age sizes and find that for ViT-B/16 performance only de-
ages based on a text prompt [43] and in robotics to gener- creases for images larger than 350 pixels (see supplemen-
alize to unseen objects in manipulations tasks [44]. Other tary for details). In our experiments we use CLIP ViT-B/16
work focused on understanding CLIP in more detail. In with a patch size P of 16 and use a projection dimension of
the original CLIP paper [8], it was found that the design of D = 64 if not indicated otherwise. We extract CLIP activa-
prompts matters for downstream tasks, i.e. instead of using tions at layers S = [3, 7, 9], consequently our decoder has
an object name alone as a prompt, adding the prefix “a photo only three layers.
of" increases performance. Zhou et al. [45] propose context
optimization (CoOp) which automatically learns tokens that Image-Text Interpolation Our model receives informa-
perform well for given downstream tasks. Other approaches tion about the segmentation target (“what to segment?”)
rely on CLIP for open-set object detection [46, 47]. through a conditional vector. This can be provided either by
text or an image (through visual prompt engineering). Since
CLIP uses a shared embedding space for images and text
3 CLIPSeg Method captions, we can interpolate between both in the embedding
space and condition on the interpolated vector. Formally, let
We use the visual transformer-based (ViT-B/16) CLIP [8] si be the embedding of the support image and ti the text em-
model as a backbone (Fig. 2) and extend it with a small, bedding of a sample i, we obtain a conditional vector xi by a
parameter-efficient transformer decoder. The decoder is linear interpolation xi = asi +(1 − a)xi , where a is sampled
trained on custom datasets to carry out segmentation, while uniformly from [0, 1]. We use this randomized interpolation
the CLIP encoder remains frozen. A key challenge is to as a data augmentation strategy during training.
avoid imposing strong biases on predictions during segmen-
tation training and maintaining the versatility of CLIP. We
do not use the larger ViT-L/14@336px CLIP variant as its 3.1 PhraseCut + Visual prompts (PC+)
weights were not publicly released as of writing this work. We use the PhraseCut dataset [20], which encompasses over
340,000 phrases with corresponding image segmentations.
Decoder Architecture Considering these demands, we Originally, this dataset does not contain visual support but
propose CLIPSeg: A simple, purely-transformer based de- only phrases and for every phrase a corresponding object
coder, which has U-Net-inspired skip connections to the exists. We extend this dataset in two ways: visual support
CLIP encoder that allow the decoder to be compact (in terms samples and negative samples. To add visual support im-
of parameters). While the query image (RW ×H ×3 ) is passed ages for a prompt p, we randomly draw from the set of all
through the CLIP visual transformer, activations at certain samples Sp , which share the prompt p. In case the prompt
3
Figure 2: Architecture of CLIPSeg: We extend a frozen CLIP model (red and blue) with a transformer that segments the query image based
on either a support image or a support prompt. N CLIP activations are extracted after blocks defined by S . The segmentation transformer
and the projections (both green) are trained on PhraseCut or PhraseCut+.
nothing
nothing
nothing
nothing
nothing
animal
animal
animal
animal
animal
animal
way, only features that pertain to the support object are con-
sieve
sieve
sieve
sieve
sieve
sieve
knife
knife
knife
knife
knife
knife
car
car
car
car
car
car
jug
jug
jug
jug
jug
jug
trash bin
trash bin
trash bin
trash bin
trash bin
nothing
nothing
nothing
nothing
nothing
nothing
window
window
window
window
window
window
house
house
house
house
house
house
bike
bike
bike
bike
bike
bike
car
car
car
car
car
car
4
CLIP modification & extras ∆P(object) background modific. ∆P(object) cropping & combinations ∆P(object)
CLIP masking CLS in layer 11 1.34 BG intensity 50% 3.08 crop large context 6.27
CLIP masking CLS in all layers 1.71 BG intensity 10% 13.85 crop 13.60
CLIP masking all in all layers -14.44 BG intensity 0% 23.40 crop & BG blur 15.34
dye object red in grays. image 1.21 BG blur 13.15 crop & BG intensity 10% 21.73
add red object outline 2.29 + intensity 10% 21.73 + BG blur 23.50
Table 2: Visual prompt engineering: Average improvement of object probability for different forms of combining image and mask over
1,600 samples. Cropping means cutting the image according to the regions specified by the mask, “BG” means background.
out of all objects present in this image. Metrics Compared to approaches in zero-shot and one-
shot segmentation (e.g. [25, 26]), the vocabulary we use
CLIP-Based Masking The straightforward equivalent to is open, i.e. the set of classes or expressions is not fixed.
masked pooling in a visual transformer is to apply the mask Therefore, throughout the experiments, our models are
on the tokens. Normally, a visual transformer consists of a trained to generate binary predictions that indicate where
fixed set of tokens which can interact at every layer through objects matching the query are located. If necessary, this
multi-head attention: A CLS token used for read-out and binary setting can be transformed into a multi-label setting
image-region-related tokens which were originally obtained (as we do in Section 5.2).
from image patches. Now, the mask can be incorporated by In segmentation, intersection over union (IoU, also Jac-
constraining the interaction at one (e.g. the last layer 11) or card score) is a common metric to compare predictions with
more transformer layers to within-mask patch tokens as well ground truth. Due to the diversity of the tasks, we employ
as the CLS token only. Our evaluation (Tab. 2, left) suggests different forms of IoU: Foreground IoU (IoUFG ) which com-
that this form of introducing the mask does not work well. putes IoU on foreground pixels only, mean IoU, which com-
By constraining the interactions with the CLS token (Tab. 2, putes the average over foreground IoUs of different classes
left, top two rows) only a small improvement is achieved (in and binary IoU (IoUBIN ) which averages over foreground
last layer or in all layers) while constraining all interactions IoU and background IoU. In binary segmentation, IoU re-
decreases performance dramatically. From this we conclude quires a threshold t to be specified. While most of the time
that more complex strategies are necessary to combine im- the natural choice of 0.5 is used, the optimal values can
age and mask internally. strongly deviate from 0.5 if the probability that an object
matching the query differs between training and inference
(the a-priori probability of a query matching one or more
Visual Prompt Engineering Instead of applying the mask objects in the scene depends highly on context and dataset).
inside the model, we can also combine mask and image to Therefore, we report performance of one-shot segmentation
a new image, which can then processed by the visual trans- using thresholds t optimized per task and model. Addition-
former. Analogous to prompt engineering in NLP (e.g. in ally, we adopt the average precision metric (AP) in all our
GPT-3 [50]), we call this procedure visual prompt engineer- experiments. Average precision measures the area under the
ing. Since this form of prompt design is novel and strategies recall-precision curve. It measures how well the system can
which perform best in this context are unknown, we conduct discriminate matches from non-matches, independent of the
an extensive evaluation of different variants of designing vi- choice of threshold.
sual prompts (Tab. 2). We find that the exact form of how the
mask and image are combined matters a lot. Generally, we
identify three image operations that improve the alignment Models and Baselines In our experiments we differenti-
between the object text prompts and the images: decreasing ate two variants of CLIPSeg: One trained on the original
the background brightness, blurring the background (using PhraseCut dataset (PC) and one trained on the extended ver-
a Gaussian filter) and cropping to the object. The combina- sion of PhraseCut which uses 20% negative samples, con-
tion of all three performs best (Tab. 2, last row). We will use tains visual samples (PC+) and uses image-text interpola-
this variant in the remainder. tion (Sec. 3). The robust latter version we call the universal
model. To put the performance of our models into perspec-
tive, we provide two baselines:
5 Experiments
• CLIP-Deconv encompasses CLIP but uses a very basic
We first evaluate our model on three established segmen- decoder, consisting only of the basic parts: FiLM con-
tation benchmarks before demonstrating the main contribu- ditioning [48], a linear projection and a deconvolution.
tion of our work: flexible few-shot segmentation that can be This helps us to estimate to which degree CLIP-alone
based on either text or image prompts. is responsible for the results.
5
t mIoU IoUFG AP unseen-10 unseen-4
pre-train. mIoUS mIoUU mIoUS mIoUU
CLIPSeg (PC+) 0.3 43.4 54.7 76.7
CLIPSeg (PC, D = 128) 0.3 48.2 56.5 78.2 CLIPSeg (PC+) CLIP 35.7 43.1 20.8 47.3
CLIPSeg (PC) 0.3 46.1 56.2 78.2 CLIP-Deconv (PC+) CLIP 25.1 36.7 25.9 41.9
CLIP-Deconv 0.3 37.7 49.5 71.2 ViTSeg (PC+) IN 4.2 19.0 6.0 24.8
ViTSeg (PC+) 0.1 28.4 35.4 58.3
ViTSeg (PC) 0.3 38.9 51.2 74.4 SPNet [27] IN 59.0 18.1 67.3 21.8
ZS3Net [25] IN-seen 33.9 18.1 66.4 23.2
MDETR [21] 53.7 - - CSRL [53] IN-seen 59.2 21.0 69.8 31.7
HulaNet [20] 41.3 50.8 - CaGNet [54] IN - - 69.5 40.2
Mask-RCNN top [20] 39.4 47.4 - OSR [30] IN-seen 72.1 33.9 75.0 44.1
RMI [20] 21.1 42.5 - JoEm [28] IN-seen 63.4 22.5 67.0 33.4
Table 3: Referring Expression Segmentation performance on Table 4: Zero-shot segmentation performance on Pascal-VOC with
PhraseCut (t refers to the binary threshold). 10 unseen classes. mIoUS and mIoUU indicate performance on
seen and unseen classes, respectively. Our model is trained on
PhraseCut with the Pascal classes being removed but uses a pre-
• ViTSeg shares the architecture of CLIPSeg, but uses trained CLIP backbone. IN-seen indicates ImageNet pre-training
an ImageNet-trained visual transformer as a backbone with unseen classes being removed.
[51]. For encoding text, we use the same text trans-
former of CLIP. This way we learn to which degree
the specific CLIP weights are crucial for good perfor- in a multi-label setting. Therefore, we employ a simple
mance. adaptation: Our model predicts a binary map independently
We rely on PyTorch [52] for training and use an image size for each of the 20 Pascal classes. Across all 20 predictions
of 352 × 352 pixels throughout our experiments (for details we determine the class with the highest probability for each
see appendix). pixel.
We train on PhraseCut+ but remove the unseen Pascal
classes from the dataset. This is carried out by assigning
5.1 Referring Expression Segmentation the Pascal classes to WordNet synsets [2] and generating a
We evaluate referring expression segmentation performance set of invalid words by traversing hyponyms (e.g. different
(Tab. 3) on the original PhraseCut dataset and compare to dog breeds for dog). Prompts that contain such a word are
scores reported by Wu et al. [20] as well as the concurrently removed from the dataset.
developed transformer-based MDETR method [21]. For this The idea of conducting this experiment is to provide a ref-
experiment we trained a version of CLIPSeg on the original erence for the zero-shot performance of our universal model.
PhraseCut dataset (CLIPSeg [PC]) using only text labels in It should not considered as competing in this benchmark as
addition to the universal variant which also includes visual we use a different training (CLIP pre-training, binary seg-
samples (CLIPSeg [PC+]). mentation on PhraseCut). The results (Tab. 4) indicate a ma-
Our approaches outperform the two-stage HULANet ap- jor gap between seen and unseen classes in models trained
proach by Wu et al. [20]. Especially, a high capacity de- on Pascal-VOC, while our models tend to be more balanced.
coder (D = 128) seems to be beneficial for PhraseCut. This is due to other models being trained exclusively on the
However, the performance worse than MDETR [21], which 10 or 16 seen Pascal classes in contrast to CLIPSeg which
operates at full image resolution and received two rounds can differentiate many more classes (or phrases). In fact, our
of fine-tuning on PhraseCut. Notably, the ViTSeg baseline model performs better on unseen classes than on seen ones.
performs generally worse than CLIPSeg, which shows that This difference is likely because the seen classes are gener-
CLIP pre-training is helpful. ally harder to segment: For the unseen-4 setting, the unseen
classes are “airplane”, “cow”, “motorbike” and “sofa”. All
of them are large and comparatively distinct objects.
5.2 Generalized Zero-Shot Segmentation
In generalized zero-shot segmentation, test images contain 5.3 One-Shot Semantic Segmentation
categories that have never been seen before in addition to
known categories. We evaluate the model’s zero-shot seg- In one-shot semantic segmentation, a single example image
mentation performance using the established Pascal-VOC along with a mask is presented to the network. Regions that
benchmark (Tab. 4). It contains five splits involving 2 to pertain to the class highlighted in the example image must
10 unseen classes (we report only 4 and 10 unseen classes). be found in a query image. Compared to previous tasks,
The latter is the most challenging setting as the set of un- we cannot rely on a text label but must understand the pro-
seen classes is large. Since our model was trained on fore- vided support image. Above (Sec. 4) we identified the best
ground/background segmentation we cannot directly use it method for visual prompt design, which we use here: crop-
6
t vis. backb. mIoU IoUBIN AP Pascal-5i t vis. backb. mIoU IoUBIN AP
CLIPSeg (PC+) 0.3 ViT (CLIP) 59.5 75.0 82.3 CLIPSeg (PC+) 0.3 ViT (CLIP) 72.4 83.1 93.5
CLIPSeg (PC) 0.3 ViT (CLIP) 52.3 69.5 72.4 CLIPSeg (PC) 0.3 ViT (CLIP) 70.3 81.6 84.8
CLIP-Deconv (PC+) 0.2 ViT (CLIP) 48.0 65.8 68.0 CLIP-Deconv (PC+) 0.3 ViT (CLIP) 63.2 77.3 85.3
ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4 ViTSeg (PC+) 0.2 ViT (IN) 39.0 59.0 62.4
PPNet [39] RN50 52.8 69.2 - LSeg [58] ViT (CLIP) 52.3 67.0 -
RePRI [57] RN50 59.7 - - PFENet [40] VGG16 54.2 - -
PFENet [40] RN50 60.2 73.3 -
HSNet [55] RN50 64.0 76.7 -
Table 7: Zero-shot performance on Pascal-5i. The scores were
PPNet [39] RN101 55.2 70.9 - obtained by following the evaluation protocol of one-shot segmen-
RePRI [57] RN101 59.4 - - tation but using text input.
PFENet [40] RN101 59.6 72.9 -
HSNet [55] RN101 66.2 77.6 -
Table 5: One-shot performance on Pascal-5i (CLIPSeg and ViTSeg 5.4 One Model For All: Generalized Prompts
trained on PhraseCut+).
We have shown that CLIPSeg performs well on a variety of
academic segmentation benchmarks. Next, we evaluate its
t vis. backb. mIoU IoUBIN AP performance “in the wild" in unseen situations.
CLIPSeg (COCO) 0.1 ViT (CLIP) 33.2 58.4 40.5
CLIPSeg (COCO+N) 0.1 ViT (CLIP) 33.3 59.1 41.7
CLIP-Deconv (COCO+N) 0.1 ViT (CLIP) 29.8 56.8 40.8 Qualitative Results In Fig. 4 we show qualitative results
ViTSeg (COCO) 0.1 ViT (IN) 14.4 46.1 15.7 divided into two groups: (1, left) Affordance-like [59, 60]
PPNet [39] RN50 29.0 - - (“generalized”) prompts that are different from the descrip-
RePRI [57] RN50 34.0 - - tive prompts of PhraseCut and (2, right) prompts that were
PFENet [40] RN50 35.8 - - taken from the PhraseCut test set. For the latter we add
HSNet [55] RN50 39.2 68.2 -
challenging extra prompts involving an existing object but
HSNet [55] RN101 41.2 69.1 - the wrong color (indicated in orange). Generalized prompts,
which deviate from the PhraseCut training set by referring
Table 6: One-shot performance on COCO-20i (CLIPSeg trained to actions (“‘something to ...”) or rare object classes (“‘cut-
on PhraseCut), +N indicates 10% negative samples.
lery”) work surprisingly well given that the model was not
trained on such cases. It has learned an intuition of stuff that
can be stored away in cupboards, where sitting is possible
ping out the target object while blurring and darkening the and what “living creature” means. Rarely, false positives
background. To remove classes that overlap with the re- are generated (the bug in the salad is not a cow). Details in
spective subset of Pascal during training, we use the same the prompt are reflected by the segmentation (blue boxes)
method as in the previous section (Sec. 5.2). Other than and information about the color influences predicted object
in zero-shot segmentation, in one-shot segmentation, Ima- probabilities strongly (orange box).
geNet pre-trained backbones are common [40, 37]. PFENet
particularly leverages pre-training by using high-level fea-
ture similarity as a prior. Similarly, HSNet [55] processes Systematic Analysis To quantitatively assess the perfor-
correlated activations of query and support image using 4D- mance for generalized queries, we construct subsets of the
convolutions at multiple levels. LVIS test datasets containing only images of classes that
On Pascal-5i we find our universal model CLIPSeg (PC+) correspond to affordances or attributes. Then we ask our
to achieve competitive performance (Tab. 5) among state-of- model to segment with these affordances or attributes as
the-art methods, with only the very recent HSNet perform- prompts. For instance, we compute the foreground inter-
ing better. The results on COCO-20i (Tab. 6) show that section of union between armchair, sofa and loveseat ob-
CLIPSeg also works well when trained on other datasets jects when “sit on” is used as prompt. A complete list
than PhraseCut(+). Again HSNet performs better. To put of which affordances or attributes are mapped onto which
this in perspective, it should be considered that HSNet (and objects can be found in the appendix. We find (Tab. 8)
PFENet) are explicitly designed for one-shot segmentation, that the CLIPSeg version trained on PC+ performs better
rely on pre-trained CNN activations and cannot handle text than the CLIP-Deconv baseline and the version trained on
by default: Tian et al. [40] extended PFENet to zero-shot LVIS, which contains only object labels instead of complex
segmentation (but used the one-shot protocol) by replacing phrases. This result suggests that both dataset variability and
the visual sample with word vectors [1, 56] of text labels. model complexity are necessary for generalization. ViTSeg
In that case, CLIPSeg outperforms their scores by a large performs worse, which is expected as it misses the strong
margin (Tab. 7). CLIP backbone, known for its generalization capabilities.
7
Figure 4: Qualitative predictions of CLIPSeg (PC+) for various prompts, darkness indicates prediction strength. The generalized prompts
(left) deviate from the PhraseCut prompts as they involve action-related properties or new object names.
8
text, our experiments, in particular the comparison to the [8] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ImageNet-based ViTSeg baseline, highlight the power of Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
foundation models like CLIP for solving several tasks at Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
once. ing transferable visual models from natural language super-
vision. arXiv preprint arXiv:2103.00020, 2021.
Limitations Our experiments are limited to only a small [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
number of benchmarks, in future work more modalities such Deep residual learning for image recognition. In Conference
as sound and touch could be incorporated. We depend on a on Computer Vision and Pattern Recognition (CVPR), 2016.
large-scale dataset (CLIP) for pre-training. Note, we do not
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
use the best-performing CLIP model ViT-L/14@336px due
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
to weight availability. Furthermore, our model focuses on Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
images, an application to video might suffer from missing vain Gelly, et al. An image is worth 16x16 words: Transform-
temporal consistency. Image size may vary but only within ers for image recognition at scale. International Conference
certain limits (for details see supplementary). on Learning Representations (ICLR), 2021.
[11] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan
Broader Impact There is a chance that the model repli-
Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou.
cates dataset biases from PhraseCut but especially from the Transunet: Transformers make strong encoders for medical
unpublished CLIP training dataset. Provided models should image segmentation. arXiv preprint arXiv:2102.04306, 2021.
be used carefully and not in tasks depicting humans. Our
approach enables adaptation to new tasks without energy- [12] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
intensive training. Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xi-
ang, Philip HS Torr, et al. Rethinking semantic segmentation
from a sequence-to-sequence perspective with transformers.
References In Conference on Computer Vision and Pattern Recognition,
2021.
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado,
and Jeff Dean. Distributed representations of words and [13] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
phrases and their compositionality. In Advances in Neural Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
Information Processing Systems (NIPS), 2013. ficient design for semantic segmentation with transformers.
arXiv preprint arXiv:2105.15203, 2021.
[2] George A Miller. Wordnet: a lexical database for english.
Communications of the ACM, 38(11):39–41, 1995. [14] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
Schmid. Segmenter: Transformer for semantic segmentation.
[3] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. arXiv preprint arXiv:2105.05633, 2021.
Learning multiple visual domains with residual adapters. In
Advances in Neural Information Processing Systems (NIPS), [15] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
2017. mentation from natural language expressions. In European
Conference on Computer Vision (ECCV), 2016.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
Fei-Fei. Imagenet: A large-scale hierarchical image database. [16] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and
In Conference on Computer Vision and Pattern Recognition Alan Yuille. Recurrent multimodal interaction for referring
(CVPR). IEEE, 2009. image segmentation. In International Conference on Com-
puter Vision (ICCV), 2017.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
offrey Hinton. A simple framework for contrastive learning [17] Hengcan Shi, Hongliang Li, Fanman Meng, and Q. Wu. Key-
of visual representations. In International Conference on Ma- word-aware network for referring expression image segmen-
chine Learning (ICML), Proceedings of Machine Learning tation. In European Conference on Computer Vision (ECCV),
Research, 13–18 Jul 2020. 2018.
[6] Xinlei Chen and Kaiming He. Exploring simple siamese rep- [18] Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan
resentation learning. In Conference on Computer Vision and Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmen-
Pattern Recognition (CVPR), 2021. tation via recurrent refinement networks. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[7] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-
man, Simran Arora, Sydney von Arx, Michael S Bernstein, [19] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang.
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. Cross-modal self-attention network for referring image seg-
On the opportunities and risks of foundation models. arXiv mentation. Conference on Computer Vision and Pattern
preprint arXiv:2108.07258, 2021. Recognition (CVPR), 2019.
9
[20] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
Subhransu Maji. Phrasecut: Language-based image segmen- convolutional networks for semantic segmentation. In
tation in the wild. In Conference on Computer Vision and Conference on Computer Vision and Pattern Recognition
Pattern Recognition (CVPR), 2020. (CVPR), 2015.
[21] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan [34] Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo,
Misra, Gabriel Synnaeve, and Nicolas Carion. Mdetr - mod- Qingyao Wu, and Rui Yao. Pyramid graph networks with
ulated detection for end-to-end multi-modal understanding. connection attentions for region-based one-shot semantic
ArXiv, 2021. segmentation. In International Conference on Computer Vi-
sion (ICCV), 2019.
[22] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [35] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
end object detection with transformers. In European Con- Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image
ference on Computer Vision (ECCV), Cham, 2020. Springer segmentation with deep convolutional nets, atrous convolu-
International Publishing. ISBN 978-3-030-58452-8. tion, and fully connected crfs. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 2018.
[23] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg,
and Tamara L Berg. Modeling context in referring expres- [36] Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua
sions. In European Conference on Computer Vision (ECCV), Shen. Canet: Class-agnostic segmentation networks with it-
2016. erative refinement and attentive few-shot learning. In Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
[24] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Cam-
June 2019.
buru, Alan L Yuille, and Kevin Murphy. Generation and com-
prehension of unambiguous object descriptions. In Confer- [37] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou,
ence on Computer Vision and Pattern Recognition (CVPR), and Jiashi Feng. Panet: Few-shot image semantic segmenta-
2016. tion with prototype alignment. In International Conference
on Computer Vision (ICCV), 2019.
[25] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
Pérez. Zero-shot semantic segmentation. Advances in Neural [38] Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixiang
Information Processing Systems (NeurIPS), 2019. Ye. Prototype mixture models for few-shot semantic segmen-
tation. In European Conference on Computer Vision (ECCV),
[26] Peike Li, Yunchao Wei, and Yi Yang. Consistent structural
2020.
relation learning for zero-shot segmentation. In Advances in
Neural Information Processing Systems (NeurIPS), 2020.
[39] Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xum-
[27] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt ing He. Part-aware prototype network for few-shot semantic
Schiele, and Zeynep Akata. Semantic projection network for segmentation. In European Conference on Computer Vision
zero- and few-label semantic segmentation. In Conference (ECCV), 2020.
on Computer Vision and Pattern Recognition (CVPR), June
[40] Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng
2019.
Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich-
[28] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Exploit- ment network for few-shot segmentation. IEEE Transactions
ing a joint embedding space for generalized zero-shot seman- on Pattern Analysis and Machine Intelligence (TPAMI), Au-
tic segmentation. In International Conference on Computer gust 2020. ISSN 0162-8828.
Vision (ICCV), 2021.
[41] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A.
[29] Ping Hu, Stan Sclaroff, and Kate Saenko. Uncertainty-aware Efros, and Sergey Levine. Few-shot segmentation propaga-
learning for zero-shot semantic segmentation. In Advances in tion with guided networks. arXiv preprint arXiv:1806.07373,
Neural Information Processing Systems (NeurIPS), 2020. 2018.
[30] Hui Zhang and Henghui Ding. Prototypical matching and [42] Claudio Michaelis, Ivan Ustyuzhaninov, Matthias Bethge,
open set rejection for zero-shot semantic segmentation. In and Alexander S. Ecker. One-shot instance segmentation.
International Conference on Computer Vision (ICCV), Octo- arXiv, 2018.
ber 2021.
[43] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
[31] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and and Dani Lischinski. Styleclip: Text-driven manipulation of
Byron Boots. One-shot learning for semantic segmentation. stylegan imagery. In International Conference on Computer
British Machine Vision Conference (BMVC), 2017. Vision (ICCV), 2021.
[32] Karen Simonyan and Andrew Zisserman. Very deep convo- [44] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport:
lutional networks for large-scale image recognition. arXiv What and where pathways for robotic manipulation. arXiv
preprint arXiv:1409.1556, 2014. preprint arXiv:2109.12098, 2021.
10
[45] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei [58] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen
Liu. Learning to prompt for vision-language models. arXiv Koltun, and Rene Ranftl. Language-driven semantic seg-
preprint arXiv:2109.01134, 2021. mentation. In International Conference on Learning Rep-
resentations, 2022. URL https://openreview.net/
[46] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero- forum?id=RriDjddCLN.
shot detection via vision and language knowledge distillation.
arXiv preprint arXiv:2104.13921, 2021. [59] James Jerome Gibson. The Senses Considered as Perceptual
Systems. Houghton Mifflin, 1966.
[47] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei Shu.
Zero-shot open set detection by extending clip. arXiv preprint [60] James J. Gibson. The Ecological Approach to Visual Percep-
arXiv:2109.02748, 2021. tion. Houghton Mifflin, 1979.
[54] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing
Zhang. Context-aware feature generation for zero-shot se-
mantic segmentation. In Proceedings of the 28th ACM Inter-
national Conference on Multimedia, pages 1921–1929, 2020.
[57] Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Pi-
antanida, Ismail Ben Ayed, and Jose Dolz. Few-shot segmen-
tation without meta-learning: A good transductive inference
is all you need? arXiv preprint arXiv:2012.06166, 2020.
11
Appendix chair, highchair, deck chair, folding chair, chair, recliner,
wheelchair
Experimental Setup drink from: bottle, beer bottle, water bottle, wine bottle,
thermos bottle
Throughout our experiments we use PyTorch [52] with ride on: horse, pony, motorcycle
CLIP ViT-B/16 [8]. We train on PhraseCut [20] for 20,000
iterations on batches of size 64 with an initial learning rate
Attributes:
of 0.001 (for VitSeg 0.0001) which decays following a co-
can fly: eagle, jet plane, airplane, fighter jet, bird, duck,
sine learning rate schedule to 0.0001 (without warmup). We
gull, owl, seabird, pigeon, goose, parakeet
use automatic mixed precision and binary cross entropy as
can be driven: minivan, bus (vehicle), cab (taxi), jeep,
the only loss function.
ambulance, car (automobile)
can swim: duck, duckling, water scooter, penguin, boat,
Image-size Dependency of CLIP kayak, canoe
Since multi-head attention does not require a fixed number
of tokens, the visual transformer of CLIP can handle in- Meronymy (part-of relations):
puts of arbitrary size. However, the publicly available CLIP has wheels: dirt bike, car (automobile), wheelchair, motor-
models (ViT-B/16 and ViT-B/32) were trained on 224 × 224 cycle, bicycle, cab (taxi), minivan, bus (vehicle), cab (taxi),
pixel images. In this experiment we investigate how CLIP jeep, ambulance
performance relates to the input image size – measured in has legs: armchair, sofa, loveseat, deck chair, rocking
a classification task. To this end, we extract the CLS token chair, highchair, deck chair, folding chair, chair, recliner,
vector in the last layer from both CLIP models. Using this wheelchair, horse, pony, eagle, bird, duck, gull, owl,
feature vector as an input, we train a logistic regression clas- seabird, pigeon, goose, parakeet, dog, cat, flamingo, pen-
sifier on a subset of ImageNet [4] classes differentiating 67 guin, cow, puppy, sheep, black sheep, ostrich, ram (animal),
classes of vehicles (Fig. 5). Our results indicate that CLIP chicken (animal), person
generally handles large image sizes well, with the 16-px-
patch version (ViT-B/16) showing a slightly better perfor-
mance at an optimal image size of around 350 × 350 pixels.
Average Precision Computation
The average precision metric has the advantage of not de-
0.7 pending on a fixed threshold. This is particularly useful
when new classes occur which lead to uncalibrated predic-
0.6 tions. Instead of operating on bounding boxes as in detec-
accuracy
12
a surface somewhere something something
to place to store to sit something something that has a living yellow
a surface. items. away things on. to drink. to eat. legs. cutlery. creature. chair. table. drawers. grass. cow. cow. green drawers.
Figure 6: Qualitative predictions of CLIPSeg (PC+) (top, same as Fig. 4 of main paper for reference) and ViTSeg (PC) (bottom).
13
0.25
a photo of a <label>
an image of <label>
a photo of <label>
0.15
0.10
<label>
0.05
0.00
text prompts
0.250 0.25
0.225
performance [AP]
0.20
performance [AP]
0.200
0.175 0.15
0.150
0.125 0.10
vehicles
animals
person
0.100
0.05
stuff
]
.1]
.2]
.3]
.5]
.05
,0
1,0
2,0
3,0
0.00
0
[0,
05
[0.
[0.
[0.
object categories
[0.
14