Zhang Exploring Regional Clues in CLIP For Zero-Shot Semantic Segmentation CVPR 2024 Paper
Zhang Exploring Regional Clues in CLIP For Zero-Shot Semantic Segmentation CVPR 2024 Paper
Abstract
dog
CLIP has demonstrated marked progress in visual recog- cat
nition due to its powerful pre-training on large-scale image- cow
Region-Level
text pairs. However, it still remains a critical challenge: Bridge Pixel-Level
Image-Level Classification
how to transfer image-level knowledge into pixel-level un- Classification
derstanding tasks such as semantic segmentation. In this Figure 1. By employing a region-level bridge, the CLIP-RC ex-
paper, to solve the mentioned challenge, we analyze the tends zero-shot capabilities from the image level to the pixel level,
gap between the capability of the CLIP model and the thus bridging the gap between image-level recognition and pixel-
requirement of the zero-shot semantic segmentation task. level semantic segmentation.
Based on our analysis and observations, we propose a novel
progress of the foundation models, especially the CLIP [34]
method for zero-shot semantic segmentation, dubbed CLIP-
model. The CLIP model is trained on large-scale image-
RC (CLIP with Regional Clues), bringing two main in-
text pairs and shows powerful zero-shot image recognition
sights. On the one hand, a region-level bridge is necessary
capability, which is the foundation of ZS3 tasks. In addi-
to provide fine-grained semantics. On the other hand, over-
tion to the capability for zero-shot recognition, the ability
fitting should be mitigated during the training stage. Bene-
to localize at the pixel level is also crucial for accomplish-
fiting from the above discoveries, CLIP-RC achieves state-
ing the ZS3 tasks which is an attribute that CLIP lacks.
of-the-art performance on various zero-shot semantic seg-
In order to make up for the shortcomings of CLIP in lo-
mentation benchmarks, including PASCAL VOC, PASCAL
calization, researchers have proposed two streams of ap-
Context, and COCO-Stuff 164K. Code will be available at
proaches: one-stage methods [47, 55] and two-stage meth-
https://github.com/Jittor/JSeg.
ods [7, 14, 33, 46, 52].
Two-stage methods, first, generate initial mask propos-
als using class-agnostic mask generators. Then, it provides
1. Introduction them with semantic information by using the CLIP model.
As a foundation task in computer vision, semantic segmen- Due to the introduction of an additional class-agnostic seg-
tation [4, 8, 11–13, 26, 36, 40, 44, 50, 51] aims to assign mentation model, these methods have a heavy computation
each pixel with a semantic class. Limited by technical meth- cost. In this paper, we focus on the one-stage methods.
ods and labeling costs, traditional segmenters can only pro- One-stage approaches avoid extra computing overhead and
cess scenarios with a limited number of classes. In other finetune the CLIP model for ZS3 directly. There are two
words, it can only handle the seen classes in the training set. critical factors during the fine-tuning process. Firstly, trans-
When it comes to unseen classes, traditional methods seem ferring image-level understanding features to pixel-level un-
powerless. However, in practical situations, encountering derstanding features are the key to finishing the segmenta-
classes that are not previously seen is unavoidable, which tion task. Secondly, during the fine-tuning process, models
brings challenges to the segmenter. To solve this problem, will tend to only recognize the classes they see in the tun-
researchers propose a new research paradigm, called zero- ing process (a.k.a., catastrophic forgetting), which will un-
shot semantic segmentation (ZS3) [1, 2, 43], which requires dermine the model’s zero-shot recognition capability. The
the model trained on seen classes to generalize well to un- differences highlighted above are the gaps between the ca-
seen classes. In this paper, we focus on the ZS3 settings. pabilities of the CLIP model and the requirements of ZS3
The rapid development of ZS3 tasks benefits from the tasks. Aiming to bridge the above gaps, we present a new
method CLIP-RC.
* Work conducted during an internship at Tsinghua University. CLIP-RC is motivated by an important observation. As
3270
Class of the entire image: 'kite' Class of the entire image: 'sand'
demonstrated in Fig. 2, we simply test the CLIP’s capacity
sky-other ceiling-tile clouds sky-other clouds clouds clouds clouds
on region-level classification. We find CLIP has a strong
performance in regional recognition. Region-level recogni-
kite clouds sky-other ground-other sky-other sky-other sky-other sky-other
tion capability is a finer-grained recognition capability than
the image-level, which is closer to the pixel-level segmen- sea playingfield person ground-other sand sand person ground-other
tation task. Thus, we believe it can be a suitable bridge to
connect CLIP and ZS3 tasks. Building on this, we introduce sand sand floor-other sand sand sand person sand
a new approach for ZS3 named CLIP-RC (CLIP with Re- Class of the entire image: 'bus' Class of the entire image: 'bus'
gional Clues). This method leverages regional clues, bridg-
ing the gap between image-level and pixel-level understand- banner bus bus sky-other building-other bus sky-other sky-other
3271
❄ Class embedding 𝐓 ∈ ℝ$×#
A photo of a Text
{CLASS}. Encoder
𝐆 ∈ ℝ# $ ∈ ℝ!×$×%×#
𝐑
❄ 🔥
Region
Image
𝐑 ∈ ℝ!×# Alignment Decoder
Encoder Module
v
Upsample 𝐈% ∈ ℝ&×'×# Multi-Head 𝐈% 𝐝 ∈ ℝ&×#
Concat
Linear q Cross Attn
$ ∈ ℝ&×'×)
𝑷 ×3 𝐎𝐮𝐭𝐩𝐮𝐭 ∈ ℝ$×&
(b) (c)
Figure 3. The framework of our proposed CLIP-RC. (a) The image is input into the image encoder, which then yields the image feature I,
the region category information feature R extracted via region-level bridging, and the global feature G. Subsequently, these three feature
sets are aligned and combined with the text embedding T to generate the regional relationship descriptors. Finally, a decoder for semantic
segmentation then utilizes these features to infer and generate a segmentation map. To prevent overfitting during training, the recovery
decoder and recovery loss are employed. (b) The region alignment module. (c) Detail of the decoder architecture.
posals with a class-agnostic mask generator, and the sec- ment Module (RAM), and Recovery Decoder with Recov-
ond stage employs the CLIP model for zero-shot classi- ery Loss (RDL). First, as detailed in Sec. 3.2, the image is
fication of the masked image sections. MaskCLIP+ [52] fed into the encoder to obtain global features, image fea-
improves upon previous ZS3 methods significantly by in- tures, and region category features extracted by RLB. In
corporating pseudo-labeling and self-training. However, Sec. 3.3, we use RAM to align these feature sets and create
these two-stage methods can be computationally expensive. image features with finer-grained categorical features. The
To avoid this, ZegCLIP [55] innovates further by combin- features are also merged with the text embedding to obtain
ing visual prompt tuning with relationship descriptors for regional relationship descriptors. Following this, a decoder
a one-stage ZS3 inference process. In addition, the task for semantic segmentation uses these features to predict and
of open vocabulary semantic segmentation, similar to the create a segmentation map. To minimize overfitting during
ZS3 task, also has been explored through various meth- training, the RDL, discussed in Sec. 3.4, is used to ensure a
ods [3, 21, 24, 25, 41, 42, 45, 47]. It is worth mentioning balance between learning task-specific features and general
that CLIPSelf [42] has a similar motivation as ours. The knowledge.
difference is that it transfers regional features to the student
model through knowledge distillation. In this paper, we ex- 3.2. Region-Level Bridge
plore the ZS3 approach from two distinct perspectives, aim- In our approach, the Region-Level Bridge (RLB) is a key
ing to identify and bridge the existing gap between current element to connect image-level and pixel-level representa-
single-stage methods and ZS3. tions. Specifically, as illustrated in Fig. 4. RLB captures
the regional category features of the image and facilitates
3. Method classification at a regional granularity.
Formally, we construct the input for the first ViT layer of
3.1. Method Overview CLIP’s visual encoder as shown in Eq. (1). Each element
represents distinct aspects of the input data,
As illustrated in Fig. 3(a), the CLIP-RC has three key com-
ponents: the Region-Level Bridge (RLB), Region Align- \mathbf {X}^0 = [\mathbf {G}^0, \mathbf {P}^0, \mathbf {I}^0, \mathbf {R}^{0}], \label {eq:ve_input} (1)
3272
𝐆 𝐏 𝐈 𝐑
0
̇̇̇ 𝐆
𝐏
𝐆 ∈ ℝ#
0
0
̇̇̇ 𝐈 ∈ ℝ!×# 𝐑
0 (b) (c)
Figure 4. This illustrates the RLB within the image encoder lay- Figure 5. Visualization of attention mask in CLIP. (a) Each token
ers. The input consists of a [CLS] token, deep prompt tokens, in the RLB is responsible for a 2 × 2 region within the 4 × 4 im-
image features, and RLB (from top to bottom). Each token in age features. (b) For [CLS] token and deep prompt tuning tokens,
RLB is responsible for a distinct image region, thereby yielding each token interacts with all image features, including interactions
finer-grained categorical features about image regions. Upon com- among themselves. (c) Visualization of the attention mask dur-
pletion of the inference process, the image features are obtained, ing self-attention, where the white blocks indicate masking and no
along with the [CLS] token that contains global image features and interaction.
regional category features.
l+1
0 1×D
former l + 1 layer VMHSA respectively. It is important to
where G ∈ R represents the [CLS] token for the vi- note that to ensure the generalizability of the RLB for un-
sual encoder, which is designed to capture the global image seen classes, its gradient is not updated.
feature, where D represents the dimension of the features. The final outputs of the visual encoders, denoted as XL ,
The P0 ∈ RK×D represents the K deep prompt tuning to- are given by:
kens [20] for the first layer, selected from the deep prompt \mathbf {X}^L = [\mathbf {G}, \_\, , \mathbf {I}, \mathbf {R}], \label {eq:ve_output} (3)
tuning token set P ∈ RL×K×D that are task-specific learn-
able parameters into the input space. I0 ∈ RN ×D denotes where, G signifies the global category feature of an image
the image features with added positional encoding. Lastly, by the [CLS] token, I denotes the features of the image ex-
R0 ∈ RM ×D symbolizes the initial RLB with M tokens. tracted by the visual encoder and R contains the region cat-
Each token in R0 adopts the weights of G0 as its initial egory features by the RLB.
weights, serving as a bridge between image-level and pixel-
level features. 3.3. Region Alignment Module
Following the construction of the input, we proceed Building upon the outputs of the CLIP visual encoder, we
with the extraction of image features. For a sequence of introduce the Region Alignment Module (RAM), as illus-
image features
√ √ with a length of N , the original size is trated in Fig. 3(b). This module is specifically designed
N × N × D. We utilize mask attention to accomplish to further align different feature sets. By aligning the G
the extraction of regional category features of image regions that gathers the global image context with the R that are
through the RLB. As illustrated in Fig. 5, the attention mask specific to the region and image features I, the multi-scale
Mask ∈ RE×E defines the computational direction of the spatial features and fine-grained category features present
RLB, where E = 1 + K + N + M . Each token √ in √the in the input data can be fully utilized.
N N
RLB is responsible for extracting features from √M × √M To align feature sets, we reshape R back to a dimen-
√ the√
patches. In this way, each token in RLB is responsible for sion of M × M × D, representing the category features
one region in the original whole feature, which is an in- of regions in the image. We then upsample both G, con-
termediate granularity between image-level and pixel-level. taining global image category features, and R to the size of
This feature extraction method is almost the same as the fea- image features, and concatenate them with I:
ture extraction when CLIP does classification on images.
After obtaining the attention mask, the masked attention \label {eq:fusion} \mathbf {\hat {I}} = \operatorname {Concat}(\operatorname {Upsample}(\mathbf {G}), \operatorname {Upsample}(\mathbf {R}), \mathbf {I}). (4)
operations within the visual encoder can be formulated as:
Further, to generate text embeddings with generalized
\begin {aligned} \mathbf {X}^{l+1} & = \mathcal {V}^{l+1}_{\mathrm {MHSA}} (\mathbf {X}^l) \\ & =\operatorname {softmax}(\frac {Q K^T}{\sqrt {d_v}}+\mathbf {Mask}) V, \end {aligned} capabilities for unseen classes, we employ the Relationship
(2) Descriptor [55]. This involves integrating the priors of the
RLB and the global [CLS] token into the text embeddings,
resulting in multiple, robust text embeddings with different
where, Xl and Xl+1 are the input and output of the trans- regional priors, i.e. regional relationship descriptors. We
3273
first fuse the RLB with the global [CLS] token: 3.5. Loss Function
\mathbf {R_a} = (\operatorname {Upsample}(\mathbf {G})+\mathbf {R}), (5) To further reduce overfitting during our training, we use a
method called Non-mutually Exclusive Loss (NEL) [55].
Ra is shaped as M × D, and then:
This method combines Sigmoid activation with Binary
\mathbf {\hat {R}}=\operatorname {Concat}[\mathbf {\mathbf {R}_t}, \mathbf {T}]=\operatorname {Concat}[\mathbf {T} \odot \mathbf {R_a}, \mathbf {T}], (6) Cross Entropy (BCE) loss, allowing for the independent
prediction of probabilities for different classes.
T ∈ RC×D represents the original text embeddings ex-
Additionally, we incorporate a recovery loss as discussed
tracted via CLIP’s text encoder, and C is the number of
in Section Sec. 3.4. The total loss our model aims to min-
classes. The R̂ ∈ RM ×C×2×D are the regional relation-
imize is a combination of these two types of losses, repre-
ship descriptors.
sented by the equation:
3.4. Recovery Decoder With Recovery Loss
\mathcal {L}=\alpha \cdot \mathcal {L}_{\text {NEL}} + \beta \cdot \mathcal {L}_{\text {recovery}}, (11)
To mitigate overfitting in zero-shot learning with a CLIP-
based model, we introduce the training-only Recovery De- where, α and β are weights that balance the contributions
coder with Recovery Loss (RDL). This RDL focuses on bal- of NEL and recovery loss, respectively.
ancing task-specific adaptation with the retention of general
knowledge by adding extra constraints during the tuning 4. Experiments
process
Initially, we deploy a decoder for semantic segmentation, 4.1. Dataset
as shown in Fig. 3(c). The aligned image features Î and PASCAL VOC 2012 Dataset provides an augmented train-
region-specific text queries R̂, derived from Sec. 3.3, pass ing set of 10,582 images, alongside a validation set consist-
through a linear layer, aligning them in dimension D. They ing of 1,449 images. In our work, we exclude the back-
are then processed as follows: ground class and categorize the 20 available classes into 15
\mathbf {\hat {I}_d}, \mathbf {\hat {R}_d}=\mathcal {D}_{\mathrm {MHCA}}(\mathbf {\hat {I}}, \mathbf {\hat {R}}), \mathcal {D}^\prime _{\mathrm {MHCA}}(\mathbf {\hat {R}}, \mathbf {\hat {I}}), (7) seen classes and 5 unseen classes.
′
COCO-Stuff164K Dataset covers 80 thing classes, 91
where, DMHCA and DMHCA denotes the decoder for se- stuff classes, and a single class designated for unlabeled el-
mantic segmentation with multi-head cross attention, and ements. It comprises a training subset featuring 118,287
Îd ∈ RN ×D and R̂d ∈ RM ×C×D are the image features images, alongside a validation subset consisting of 5,000
and region-specific text queries respectively, used for seg- images. The entire dataset is further divided into 156 seen
mentation. The segmentation map Output ∈ RC×N is classes and 15 unseen classes.
obtained by averaging the outputs: PASCAL Context Dataset contains 59 foreground classes
\mathbf {Output_{i}} = \mathbf {\hat {R}_d} \mathbf {\hat {I}_d}^T \quad \text {for all } i \in \{1, \ldots , M\}. (8) and a ”background” class. The training set and valida-
tion set contain 4,996 and 5,104 images, respectively. The
Then, during training, a recovery decoder recovers the dataset is divided into 50 seen classes (including ”back-
features extracted by the decoder into features with strong ground”) and 10 unseen classes.
generalization. The network architecture of the recovery
decoder is completely identical to that of the semantic seg- 4.2. Implementation Details
mentation decoder. They are recovered as follows:
Our experiments were conducted using the Jittor [18] and
\mathbf {\hat {I}_r}, \mathbf {\hat {R}_r}=\mathcal {RD}_{\mathrm {MHCA}}(\mathbf {\hat {I}_d}, \mathbf {\hat {R}_d}), \mathcal {RD}^\prime _{\mathrm {MHCA}}(\mathbf {\hat {R}_d}, \mathbf {\hat {I}_d}), (9) PyTorch [32] frameworks, with code based on the MMSeg-
Îr ∈ RN ×D and R̂r ∈ RM ×C×D represent features af- mentation library [6]. We used the ViT-B/16 model from
ter recovery of I and R̂ respectively. To ensure that the CLIP, training on 8 NVIDIA RTX 3090 GPUs. The batch
outputs of the recovery decoder composed of RDMHCA size was set at 16 for all datasets, using an input image size
and RD′MHCA align well with the original features from of 512 × 512. In the inductive setting, we trained on the
the backbone network, we propose a recovery loss for use Pascal VOC 2012, Pascal Context, and COCO-Stuff 164K
with the recovery decoder. This recovery loss is designed datasets for 40K, 40K, and 80K iterations, respectively. For
to help the decoder for semantic segmentation strike a bal- the transductive setting, we loaded the weight of the check-
ance between learning the specifics of the task at hand and point from the middle of the inductive training for each
maintaining a broad base of general knowledge, thereby al- dataset, then trained each from scratch for 20K, 20K, and
leviating the problem of overfitting to unseen classes. The 40K iterations, respectively.
equation for this loss is: 4.3. Evaluation Protocol
\mathcal {L}_{\text {recovery}}= \sum _{i=1}^n |\mathbf {\hat {I}}_{\mathbf {r}i} - \mathbf {I}_i | + \sum _{i=1}^n |\mathbf {\hat {R}}_{\mathbf {r}i} - \mathbf {\hat {R}}_i|. (10) Continuing from the established methodology for ZS3 [43,
52, 55], all classes C of a dataset are divided into a group
3274
PASCAL VOC 2012 PASCAL Context COCO-Stuff 164K
Methods
pAcc hIoU mIoU(S) mIoU(U) pAcc hIoU mIoU(S) mIoU(U) pAcc hIoU mIoU(S) mIoU(U)
SPNet [43] - 26.1 78.0 15.6 - - - - - 14.0 35.2 8.7
ZS3 [2] - 28.7 77.3 17.7 52.8 15.8 20.8 12.7 - 15.0 34.7 9.5
CaGNet [10] 80.7 39.7 78.4 26.6 - 21.2 24.1 18.5 65.6 18.2 33.5 12.2
SIGN [5] - 41.7 75.4 28.9 - - - - - 20.9 32.3 15.5
Joint [1] - 45.9 77.7 32.5 - 20.5 33.0 14.9 - - - -
ZegFormer [7] - 73.3 86.4 63.6 - - - - - 34.8 36.6 33.2
zsseg [46] 90.0 77.5 83.5 72.5 - - - - 60.3 37.8 39.3 36.3
DeOP [14] - 80.8 88.2 74.6 - - - - - 38.2 38.0 38.4
ZegCLIP [55] 94.6 84.3 91.9 77.8 76.2 49.9 46.0 54.6 62.0 40.8 40.2 41.4
CLIP-RC (Ours) 95.8 88.4 92.8 84.4 76.2 51.9 47.5 57.3 63.1 41.2 40.9 41.6
Table 1. Comparison with the SOTA methods in the inductive setting.
of seen classes CS and a group of unseen classes CU , with ability to effectively transfer segmentation capabilities to
CS ∩ CU = ∅. During training, only the seen classes (CS ) unseen classes proves its effectiveness for ZS3 tasks.
have labels. Furthermore, in the inductive setting of ZS3, In the inductive setting, detailed in Tab. 1, CLIP-RC out-
the model is trained without any knowledge of the unseen performs the current SOTA model, ZegCLIP [55], by a sig-
classes CU , including their labels, and names. This set- nificant margin. This is particularly true in handling unseen
ting closely mirrors practical inference scenarios, where the classes in the PASCAL VOC 2012 dataset, where our hIoU
model may be tested on classes not seen during the training. reaches 88.4%, a notable 4.1% improvement. This enhance-
Contrastingly, in the transductive setting of ZS3, the names ment is most apparent in the recognition of unseen classes,
of unseen classes are known before testing. This setting can underlining the improved segmentation capability of our
enhance the performance of these unseen classes and reduce method. Similar superiority is observed in the PASCAL
the dependence on data annotation for practical scenarios. Context and COCO-Stuff 164K datasets. We also show-
For the evaluation metric, we evaluate our model’s per- case visual results in Fig. 6 from the COCO-Stuff 164K
formance using standard metrics in segmentation: Mean dataset, where CLIP-RC accurately distinguishes between
Intersection-over-Union (mIoU) and pixel-wise classifica- various unseen classes, such as ‘playing field’ and ‘cloud’,
tion accuracy (pAcc). mIoU is separately reported for seen and ‘cardboard’, among others.
(mIoU(S)) and unseen (mIoU(U)) classes. Additionally, In the transductive setting, as detailed in Tab. 2, CLIP-
the Harmonic Mean IoU (hIoU) ensures a balanced eval- RC uses self-training to achieve groundbreaking results. It
uation of the model’s performance on both seen and unseen attained an hIoU of 93.0% on the PASCAL VOC 2012
classes, computed using the formula: dataset, surpassing the previous best model by 1.9%. This
strong performance is also evident in the PASCAL Con-
\mathrm {hIoU}=\frac {2 \times \mathrm {mIoU}(\mathrm {S}) \times \mathrm {mIoU}(\mathrm {U})}{\mathrm {mIoU}(\mathrm {S})+\mathrm {mIoU}(\mathrm {U})} (12) text and COCO-Stuff 164K datasets. Notably, in the trans-
ductive setting, CLIP-RC outperforms the SOTA trained in
fully supervised environments through self-training. This
4.4. Comparison with the State-of-the-art highlights the effectiveness of our method in accurately
Our proposed method, CLIP-RC, has been extensively eval- identifying seen classes and successfully generalizing them
uated on various benchmarks, displaying outstanding per- to unseen classes.
formance in both inductive and transductive settings. Its To demonstrate the upper limit of our model’s capabil-
3275
clothes
floor-other
textile-other
tennis racket
playingfield
person
cloud
tree clock
building
-other
boat
river
metal sky-other
tree
pizza
cardboard
Figure 6. Visualization results on the COCO-Stuff 164K dataset. Columns from left to right: (1) the test images, (2) results using the
current SOTA method [55], (3) results using CLIP-RC (Ours), and (4) ground truth. The red tags indicate unseen classes.
3276
PASCAL VOC 2012 PASCAL VOC 2012
RLB RAM RDL [CLS] token RLB
pAcc hIoU mIoU(S) mIoU(U) pAcc hIoU mIoU(S) mIoU(U)
✗ ✗ ✗ 94.6 84.3 91.9 77.8 ✗ ✗ 94.9 85.6 91.5 80.5
✓ ✗ ✗ 95.2 85.4 91.5 80.0 ✓ ✗ 95.4 86.6 92.1 82.0
✓ ✓ ✗ 95.4 87.4 92.5 82.9 ✗ ✓ 95.7 87.0 92.9 81.8
✓ ✓ ✓ 95.8 88.4 92.8 84.4 ✓ ✓ 95.8 88.4 92.8 84.4
Table 5. Ablation Studies of Key Components: RLB signifies Table 7. Ablation studies of region alignment module
Region-Level Bridge, RAM denotes Region Alignment Module,
and RDL represents the Recovery Decoder with Recovery Loss.
Number PASCAL VOC 2012
Loss Type
of Layer pAcc hIoU mIoU(S) mIoU(U)
Tokens PASCAL VOC 2012
Update GFLOPs L1 Loss 1 95.8 88.4 92.8 84.4
Number pAcc hIoU mIoU(S) mIoU(U)
16 ✓ 126.7 95.8 87.9 93.3 83.1 L2 Loss 1 95.2 86.7 91.8 81.9
1 ✗ 119.8 95.3 86.9 92.5 82.0 KD Loss 1 95.4 86.4 92.2 81.2
4 ✗ 121.2 95.6 87.3 92.6 82.6 L1 Loss 0 95.0 86.6 91.3 82.1
16 ✗ 126.7 95.8 88.4 92.8 84.4 L1 Loss 2 95.4 87.1 92.4 82.4
64 ✗ 148.9 96.0 89.1 93.0 85.6 L1 Loss 3 95.3 87.0 91.6 82.8
256 ✗ 238.2 96.0 89.2 92.7 86.0
1024 ✗ 609.2 OOM Table 8. Ablation studies of the recovery decoder and recovery
Table 6. Ablation studies of the region-level bridge loss
In Tab. 5, we benchmarked our model against the previ- fitting ability and led to decreased performance.
ous SOTA [55], to measure the impact of each module on
our model’s overall effectiveness. The results demonstrate
that each component significantly enhances our model’s 5. Conclusion
performance in ZS3 tasks.
Region-Level Bridge (RLB). Our ablation study in In this work, we presented CLIP-RC, a novel one-stage
Tab. 6 examined whether updating the RLB improves per- method for ZS3. Our approach successfully bridges the gap
formance and what the ideal number of these tokens is. between image-level classification and pixel-level segmen-
We found that updating the RLB can lead to overfitting on tation by introducing regional clues. This method demon-
seen classes, negatively affecting the performance of un- strates the potential of leveraging finer-grained recognition
seen classes. The study also investigated how changing the capabilities by CLIP for ZS3 tasks. By integrating a recov-
number of tokens affects the RLB. Generally, increasing the ery decoder and recovery loss, we addressed the issue of
number of tokens enhances the model’s performance up to overfitting, striking a balance between maintaining the in-
a certain point. Beyond this computational limit, issues like herent generalization abilities of the CLIP model and adapt-
out-of-memory (OOM) errors can occur, as we noted with ing it for the ZS3 task. Our experimental results have shown
1024 tokens. To balance effectiveness and computational that CLIP-RC not only performs robustly on seen classes
efficiency, we decided to use 16 tokens. but also exhibits remarkable performance on unseen classes.
Region Alignment Module (RAM). Our investigation This dual capability is crucial for real-world applications
into the RAM scrutinizes which features should be pre- where encountering novel classes is common. We hope this
served during fusion as specified in Eq. (4). Results in work paves the way for more efficient and effective solu-
Tab. 7 indicate that aligning and fusing both the [CLS] to- tions in the rapidly growing fields of ZS3.
ken and the RLB with the image’s original features is most
effective. The combined use of these tokens yields the best Limitations. CLIP-RC can refine the granularity of re-
results, optimizing the performance of our model. gions, as shown in Tab. 6. Finer granularity can improve
Recovery Decoder with Recovery Loss (RDL). In our IoU to some extent, but it also increases computational load.
exploration of the RD, we evaluated the impact of differ- Furthermore, since this work is based on CLIP, the perfor-
ent recovery loss types and the number of decoder lay- mance of the method also depends on the effectiveness of
ers on model performance in Tab. 8. Among various loss the pre-training of CLIP in visual-language alignment.
functions, L1 loss emerged as the most effective. As for
the number of layers in the RD, we found that increasing Acknowledgement. This work was supported by the
National Key Research and Development Program of
layers does not necessarily improve results; a balance be-
China (project No. 2021ZD0112902), the National Natural
tween task-specific knowledge and general knowledge must Science Foundation of China (project Nos. 62220106003,
be reached. Additionally, we considered removing the Re- 62372025), the Research Grant of Tsinghua-Tencent Joint
covery decoder entirely, computing the L1 loss directly be- Laboratory for Internet Innovation Technology and the
tween features used for segmentation and the original fea- Fundamental Research Funds for the Central Universities.
tures. However, this direct application impacted the model’s
3277
References Advances in Neural Information Processing Systems 35: An-
nual Conference on Neural Information Processing Systems
[1] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex- 2022, NeurIPS 2022, New Orleans, LA, USA, November 28
ploiting a joint embedding space for generalized zero-shot - December 9, 2022, 2022. 1
semantic segmentation. In 2021 IEEE/CVF International
[12] Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, and Shi-
Conference on Computer Vision, ICCV 2021, Montreal, QC,
Min Hu. Beyond self-attention: External attention using
Canada, October 10-17, 2021, pages 9516–9525. IEEE,
two linear layers for visual tasks. IEEE Trans. Pattern Anal.
2021. 1, 2, 6
Mach. Intell., 45(5):5436–5447, 2023.
[2] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
[13] Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming
Pérez. Zero-shot semantic segmentation. In Advances in
Cheng, and Shi-Min Hu. Visual attention network. Compu-
Neural Information Processing Systems 32: Annual Con-
tational Visual Media, 9(4):733–752, 2023. 1
ference on Neural Information Processing Systems 2019,
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, [14] Cong Han, Yujie Zhong, Dengjie Li, Kai Han, and Lin
Canada, pages 466–477, 2019. 1, 2, 6 Ma. Open-vocabulary semantic segmentation with decou-
[3] Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learn- pled one-pass network. In IEEE/CVF International Confer-
ing to generate text-grounded mask for open-world semantic ence on Computer Vision, ICCV 2023, Paris, France, Octo-
segmentation from only image-text pairs. In Proceedings of ber 1-6, 2023, pages 1086–1096. IEEE, 2023. 1, 6
the IEEE/CVF Conference on Computer Vision and Pattern [15] Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang
Recognition, pages 11165–11174, 2023. 3 Cai, Lei Yang, and Ziwei Liu. Avatarclip: zero-shot text-
[4] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian driven generation and animation of 3d avatars. ACM Trans.
Schroff, and Hartwig Adam. Encoder-decoder with atrous Graph., 41(4):161:1–161:19, 2022. 2
separable convolution for semantic image segmentation. In [16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna
Computer Vision - ECCV 2018 - 15th European Conference, Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona
Munich, Germany, September 8-14, 2018, Proceedings, Part Attariyan, and Sylvain Gelly. Parameter-efficient transfer
VII, pages 833–851. Springer, 2018. 1 learning for NLP. In Proceedings of the 36th Interna-
[5] Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and tional Conference on Machine Learning, ICML 2019, 9-15
Wael Abd-Almageed. SIGN: spatial-information incorpo- June 2019, Long Beach, California, USA, pages 2790–2799.
rated generative network for generalized zero-shot semantic PMLR, 2019. 2
segmentation. In 2021 IEEE/CVF International Conference [17] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
on Computer Vision, ICCV 2021, Montreal, QC, Canada, Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
October 10-17, 2021, pages 9536–9546. IEEE, 2021. 6 Lora: Low-rank adaptation of large language models. In
[6] MMSegmentation Contributors. MMSegmentation: The Tenth International Conference on Learning Represen-
Openmmlab semantic segmentation toolbox and tations, ICLR 2022, Virtual Event, April 25-29, 2022. Open-
benchmark. https : / / github . com / open - Review.net, 2022. 2
mmlab/mmsegmentation, 2020. 5 [18] Shi-Min Hu, Dun Liang, Guo-Ye Yang, Guo-Wei Yang, and
[7] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- Wen-Yang Zhou. Jittor: a novel deep learning framework
coupling zero-shot semantic segmentation. In IEEE/CVF with meta-operators and unified graph execution. Sci. China
Conference on Computer Vision and Pattern Recognition, Inf. Sci., 63(12), 2020. 5
CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, [19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
pages 11573–11582. IEEE, 2022. 1, 2, 6, 7 Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom
[8] Lixue Gong, Yiqun Zhang, Yunke Zhang, Yin Yang, and Duerig. Scaling up visual and vision-language representa-
Weiwei Xu. Erroneous pixel prediction for semantic image tion learning with noisy text supervision. In Proceedings
segmentation. Computational Visual Media, 8(1):165–175, of the 38th International Conference on Machine Learning,
2022. 1 ICML 2021, 18-24 July 2021, Virtual Event, pages 4904–
[9] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. 4916. PMLR, 2021. 2
Open-vocabulary object detection via vision and language [20] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
knowledge distillation. In The Tenth International Con- Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
ference on Learning Representations, ICLR 2022, Virtual sual prompt tuning. In Computer Vision - ECCV 2022 - 17th
Event, April 25-29, 2022. OpenReview.net, 2022. 2 European Conference, Tel Aviv, Israel, October 23-27, 2022,
[10] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
Liqing Zhang. Context-aware feature generation for zero- 2, 4
shot semantic segmentation. In MM ’20: The 28th ACM [21] Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian
International Conference on Multimedia, Virtual Event / Rupprecht. Diffusion models for zero-shot open-vocabulary
Seattle, WA, USA, October 12-16, 2020, pages 1921–1929. segmentation. CoRR, abs/2306.09316, 2023. 3
ACM, 2020. 6 [22] Naoki Kato, Toshihiko Yamasaki, and Kiyoharu Aizawa.
[11] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Zero-shot semantic segmentation via variational mapping.
Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking In 2019 IEEE/CVF International Conference on Computer
convolutional attention design for semantic segmentation. In Vision Workshops, ICCV Workshops 2019, Seoul, Korea
3278
(South), October 27-28, 2019, pages 1363–1370. IEEE, Systems 32: Annual Conference on Neural Information Pro-
2019. 2 cessing Systems 2019, NeurIPS 2019, December 8-14, 2019,
[23] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Vancouver, BC, Canada, pages 8024–8035, 2019. 5
Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shah- [33] Jie Qin, Jie Wu, Pengxiang Yan, Ming Li, Yuxi Ren, Xue-
baz Khan. Self-regulating prompts: Foundational model feng Xiao, Yitong Wang, Rui Wang, Shilei Wen, Xin Pan,
adaptation without forgetting. In IEEE/CVF International and Xingang Wang. Freeseg: Unified, universal and open-
Conference on Computer Vision, ICCV 2023, Paris, France, vocabulary image segmentation. In IEEE/CVF Conference
October 1-6, 2023, pages 15144–15154. IEEE, 2023. 2 on Computer Vision and Pattern Recognition, CVPR 2023,
[24] Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Vancouver, BC, Canada, June 17-24, 2023, pages 19446–
Wang, and Weidi Xie. Open-vocabulary object segmentation 19455. IEEE, 2023. 1, 6
with diffusion models. In IEEE/CVF International Confer- [34] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ence on Computer Vision, ICCV 2023, Paris, France, Octo- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ber 1-6, 2023, pages 7633–7642. IEEE, 2023. 3 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[25] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Krueger, and Ilya Sutskever. Learning transferable visual
Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana models from natural language supervision. In Proceedings
Marculescu. Open-vocabulary semantic segmentation with of the 38th International Conference on Machine Learning,
mask-adapted CLIP. In IEEE/CVF Conference on Computer ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–
Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, 8763. PMLR, 2021. 1, 2
Canada, June 17-24, 2023, pages 7061–7070. IEEE, 2023. [35] Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang,
3 Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malek-
[26] Zheng Lin, Zhao Zhang, Ziyue Zhu, Deng-Ping Fan, and Xi- shan. Clip-forge: Towards zero-shot text-to-shape genera-
alei Liu. Sequential interactive image segmentation. Com- tion. In IEEE/CVF Conference on Computer Vision and Pat-
putational Visual Media, 9(4):753–765, 2023. 1 tern Recognition, CVPR 2022, New Orleans, LA, USA, June
[27] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- 18-24, 2022, pages 18582–18592. IEEE, 2022. 2
roaki Hayashi, and Graham Neubig. Pre-train, prompt, and [36] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
predict: A systematic survey of prompting methods in natu- convolutional networks for semantic segmentation. IEEE
ral language processing. ACM Comput. Surv., 55(9):195:1– Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017. 1
195:35, 2023. 2 [37] Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-
[28] Xianglong Liu, Shihao Bai, Shan An, Shuo Wang, Wei Liu, Hua Guan. Language embedded 3d gaussians for
Xiaowei Zhao, and Yuqing Ma. A meaningful learning open-vocabulary scene understanding. arXiv preprint
method for zero-shot semantic segmentation. Sci. China Inf. arXiv:2311.18482, 2023. 2
Sci., 66(11), 2023. 2 [38] Sanjay Subramanian, William Merrill, Trevor Darrell, Matt
[29] Timo Lüddecke and Alexander S. Ecker. Image segmenta- Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A
tion using text and image prompts. In IEEE/CVF Conference strong zero-shot baseline for referring expression compre-
on Computer Vision and Pattern Recognition, CVPR 2022, hension. In Proceedings of the 60th Annual Meeting of
New Orleans, LA, USA, June 18-24, 2022, pages 7076–7086. the Association for Computational Linguistics (Volume 1:
IEEE, 2022. 2 Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
[30] Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and pages 5198–5215. Association for Computational Linguis-
Rana Hanocka. Text2mesh: Text-driven neural stylization tics, 2022. 2
for meshes. In IEEE/CVF Conference on Computer Vision [39] Can Wang, Menglei Chai, Mingming He, Dongdong Chen,
and Pattern Recognition, CVPR 2022, New Orleans, LA, and Jing Liao. Clip-nerf: Text-and-image driven manipu-
USA, June 18-24, 2022, pages 13482–13492. IEEE, 2022. lation of neural radiance fields. In IEEE/CVF Conference
2 on Computer Vision and Pattern Recognition, CVPR 2022,
[31] Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimil- New Orleans, LA, USA, June 18-24, 2022, pages 3825–3834.
iano Mancini, Zeynep Akata, and Barbara Caputo. A closer IEEE, 2022. 2
look at self-training for zero-label semantic segmentation. In [40] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
IEEE Conference on Computer Vision and Pattern Recogni- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT
tion Workshops, CVPR Workshops 2021, virtual, June 19- v2: Improved baselines with pyramid vision transformer.
25, 2021, pages 2693–2702. Computer Vision Foundation / Computational Visual Media, 8(3):415–424, 2022. 1
IEEE, 2021. 6 [41] Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui
[32] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong,
James Bradbury, Gregory Chanan, Trevor Killeen, Zem- Xudong Jiang, et al. Towards open vocabulary learning: A
ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai- survey. IEEE Transactions on Pattern Analysis and Machine
son, Andreas K"opf, Edward Z. Yang, Zachary De- Intelligence, 2024. 3
Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, [42] Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Wentao Liu, and Chen Change Loy. Clipself: Vision trans-
Pytorch: An imperative style, high-performance deep learn- former distills itself for open-vocabulary dense prediction.
ing library. In Advances in Neural Information Processing CoRR, abs/2310.01403, 2023. 3
3279
[43] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt [53] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
Schiele, and Zeynep Akata. Semantic projection network for wei Liu. Conditional prompt learning for vision-language
zero- and few-label semantic segmentation. In IEEE Con- models. In IEEE/CVF Conference on Computer Vision and
ference on Computer Vision and Pattern Recognition, CVPR Pattern Recognition, CVPR 2022, New Orleans, LA, USA,
2019, Long Beach, CA, USA, June 16-20, 2019, pages 8256– June 18-24, 2022, pages 16795–16804. IEEE, 2022. 2
8265. Computer Vision Foundation / IEEE, 2019. 1, 2, 5, [54] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
6 Liu. Learning to prompt for vision-language models. Int. J.
[44] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Comput. Vis., 130(9):2337–2348, 2022. 2
Jos'e M. 'Alvarez, and Ping Luo. Segformer: [55] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and
Simple and efficient design for semantic segmentation with Yifan Liu. Zegclip: Towards adapting CLIP for zero-shot
transformers. In Advances in Neural Information Processing semantic segmentation. In IEEE/CVF Conference on Com-
Systems 34: Annual Conference on Neural Information Pro- puter Vision and Pattern Recognition, CVPR 2023, Vancou-
cessing Systems 2021, NeurIPS 2021, December 6-14, 2021, ver, BC, Canada, June 17-24, 2023, pages 11175–11185.
virtual, pages 12077–12090, 2021. 1 IEEE, 2023. 1, 3, 4, 5, 6, 7, 8
[45] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon,
Thomas M. Breuel, Jan Kautz, and Xiaolong Wang.
Groupvit: Semantic segmentation emerges from text super-
vision. In IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2022, New Orleans, LA, USA,
June 18-24, 2022, pages 18113–18123. IEEE, 2022. 3
[46] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue
Cao, Han Hu, and Xiang Bai. A simple baseline for open-
vocabulary semantic segmentation with pre-trained vision-
language model. In Computer Vision - ECCV 2022 - 17th
European Conference, Tel Aviv, Israel, October 23-27, 2022,
Proceedings, Part XXIX, pages 736–753. Springer, 2022. 1,
2, 6
[47] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi-
ang Bai. SAN: side adapter network for open-vocabulary
semantic segmentation. IEEE Trans. Pattern Anal. Mach.
Intell., 45(12):15546–15561, 2023. 1, 3
[48] Hui Zhang and Henghui Ding. Prototypical matching and
open set rejection for zero-shot semantic segmentation. In
2021 IEEE/CVF International Conference on Computer Vi-
sion, ICCV 2021, Montreal, QC, Canada, October 10-17,
2021, pages 6954–6963. IEEE, 2021. 2
[49] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.
Vision-language models for vision tasks: A survey. CoRR,
abs/2304.00685, 2023. 2
[50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In 2017
IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017,
pages 6230–6239. IEEE Computer Society, 2017. 1
[51] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
Xiang, Philip H. S. Torr, and Li Zhang. Rethinking seman-
tic segmentation from a sequence-to-sequence perspective
with transformers. In IEEE Conference on Computer Vi-
sion and Pattern Recognition, CVPR 2021, virtual, June 19-
25, 2021, pages 6881–6890. Computer Vision Foundation /
IEEE, 2021. 1
[52] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from CLIP. In Computer Vision - ECCV 2022 -
17th European Conference, Tel Aviv, Israel, October 23-27,
2022, Proceedings, Part XXVIII, pages 696–712. Springer,
2022. 1, 2, 3, 5, 6
3280