0% found this document useful (0 votes)
13 views

MAPLE

Uploaded by

Rissal Hedna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MAPLE

Uploaded by

Rissal Hedna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Evaluating MaPLe for image recognition

Mohamed Rissal Hedna


Intelligent Adaptive Systems, University of Hamburg

Abstract
Vision-language models have outperformed traditional language models on a multitude of metrics.
This is due to their success in coupling both text and image modalities in order to give exceptional
performances of a variety of datasets, therefore mimicking the way humans learn from their environ-
ment. These models have also been successful in generalizing over different tasks, but they still lack
the freedom of choice when it comes to the prompts they take, or the quality of responses they give.
Some methods were used to improve these shortcomings, but none of them targeted both modalities
at the same time. The Multi-modal Prompt Learning approach (MaPLe) [1] tackles this specific
issue. Their approach proposes a tight coupling between language and vision with learnable function
parameters that can be trained end-to-end by training a CLIP model.

Introduction techniques to eliminate the need for man-


ual prompt writing, and offer ways to mod-
ify V-L models while keeping its weights
Vision-language models have been able to frozen to guard previously learned knowl-
outperform regular language models mainly edge. Nevertheless this method is incom-
through incorporation additional informa- plete since it only targets the language en-
tion gained by coupling vision and language coder and ignores the visual one. Due to
modalities together, and projecting them the nature of CLIP a multimodal approach
both into a space where the text information had to be taken, which is at the core of
would closely supervise the image embed- MaPLe’s (Multi-modal Prompt Learning) ar-
ding which enables the model to learn more chitecture which fine-tunes the CLIP model
advanced concepts due to the alignment to create the maximal alignment between
between two modalities. Models such as text and image embedding. This is achieved
CLIP [3](Contrastive Language-Image Pre- by prompting both the text and image en-
training) were able to show a great ability coders while creating a coupling function
to generalize over large datasets. The lan- that allows the interaction between the two
guage inputs in the form of "a photo of <cat- while the rest of the weights in the model
egory>" are passed through a language en- are frozen. The approach was clearly shown
coder to get the language embedding, which to be more successful than uni-modal ones
is matched to the visual embedding from an outperforming them in a number of metrics.
image encoder then used to predict the tar- The approach was able to outperform the
get class. However, these models require the previous state-of-the-art Co-CoOp [2] in a
tedious task of hand-crafting prompts for variety of generalization tasks, and demon-
each image, and suffer from the inability to strated better robustness when evaluated on
fine-tune them for downstream tasks with- different datasets.
out forgetting the information gained in the
pre training phases. We evaluated the model using the general-
Current research proposes prompt-learning ization from base-to-novel benchmark set-
Figure 1: Comparison of MaPLe with standard prompt learning methods. a Existing methods adopt uni-modal
prompting techniques to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP
(language or vision). b MaPLe introduces branch-aware hierarchical prompts that adapt both language and
vision branches simultaneously for improved generalization. c MaPLe surpasses state-of-the-art methods on 11
diverse image recognition datasets for novel class generalization task.

ting with a zero-shot setting by dividing Alternative solutions prompted the text en-
the Imagenet dataset into base and novel. coders to provide it with additional hand-
The MaPLe model was trained on the base crafted information about the images. This
classes and evaluated on both the base and being a tedious task, the NLP approach of
novel classes. prompt-learning was adopted by later re-
search to automate the process during the
fine-tuning period, by allowing the model
Related work to learn those prompts through additional
training. But to this date there were no
Visual-language models are revolutionary works that combined both modalities in the
over traditional classifiers because of their process. For instance, CoOp [8] optimizes a
ability to combine both image and text set of prompts only on the language branch,
modalities and get the information con- CoCoOp [9] improved on this idea by condi-
tained in both to perform well on few-shot tioning the prompts on image content and
and zero-shot classification. Successful ex- allowing it to generalize better. These ap-
amples of such V-L models are CLIP [3], proaches however were implemented either
ALIGN [4], LiT [5], FILIP [6] and Florence in the language or the visual branch inde-
[7], but they all suffer from the difficulty pendently, while MaPLe takes both into con-
of fine-tuning them to downstream tasks. sideration and couples the branches to cre-
These models are trained on large amounts ate a shared prompt learning process by
of available web-based data, ranging be- which a tighter representation of the down-
tween 400 million samples to 1 Billion. How- stream tasks would be shared between the
ever, fully retraining these models on spe- text and image encoders, while keeping the
cialized data would make them forget pre- rest of the model frozen to preserve previous
viously acquired information, leading to a knowledge.
decreased performance and worsened gen-
eralization abilities.

2
Method Zero shot classification is then used by man-
ually adding class labels to text prompts
The main question that was addressed in (e.g., ’a photo of a <category>’). The pre-
this paper is "Given the multimodal nature diction ŷ is related to the image I with the
of CLIP, is complete prompting better suited highest cosine similarity score with a tem-
to adapt CLIP?". The way they attempt to an- perature τ according to the following equa-
swer the question is by proposing an archi- tion:
tecture that would allow the fine-tuning of  sim( x,z ) 

clip to downstream tasks without retraining exp τ
the whole model. This is done through con- p(ŷ| x ) =  
sim( x,zi )
text optimization with prompting. Learn- ∑iC=1 exp τ
able prompts are appended to the text en-
coders then conditioned on the image en- The MaPLe approach combined the pre-
coders through a coupling function which is vious information to leverage the power
also learnable. This scheme is repeated inde- of multimodal prompt learning, which as
pendently for multiple transformer blocks shown in Fig?? allows the classes to be
and the prompts and functions are the only more separable than other approaches. The
learnable components of this architecture method also adds prompt-learning blocks in
(Fig2). the deeper layers of the CLIP model for both
The approach for building MaPLe was vision and language branches, while keep-
mainly based on CLIP as a V-L encoder. ing each prompt layer independent from the
CLIP encodes an image of dimensions other, and allowing the overall structure to
(HxWxC) with C being the number of chan- gain from the information contained in the
nels and in their case being 3 fed to an image model.
encoder, accompanied by a text description
in its turn fed to a text encoder.
Deep prompting
For the text encoder, MaPLe introduces addi-
Traditional CLIP components
tional prompt tokens prepended to the text
The image encoder divides the image into M- embeddings to result in the vector [P1,P2,···
sized patches, which are projected to patch ,Pb,W0] where Pi are learnable added tokens
embeddings Ei and then passed as inputs to and W0 is the original text input with W0 =
the transformer blocks Vi+1 with a learnable [w1, w2, ..., wn]. This structure is added at
class token Ci, then sequentially processed each language encoder layer up to a depth J,
by K transformer blocks. The final image with J<K the number of transformer blocks.
representation is reached by projecting the It is worth noting that when J = 1 MaPLe is
output of the kth transformer Vk into the equivalent to CoOp.
latent V-L space using ImageProj. For the vision branch, the same deep
The language encoder divides text into to- prompting technique is used by introduc-
kens then projects them into text embed- ing a number of learnable tokens alongside
dings Wi = [w0, w1, ..., wn]i. At each step the input embeddings. One crucial finding
each embedding is passed into the (i+1)th is that the introduction of shared prompts
transformer block. The final representation to deeper layers is better than independent
is build by projecting the output of the last prompts across transformer blocks, this is
encoder block into the V-L latent space using because it allows the model to learn shared
TextProj. features.

3
Figure 2: Overview of the proposed MaPLe (Multi-modal Prompt Learning) framework for prompt learning
in V-L models. MaPLe tunes both vision and language branches where only the context prompts are learned,
while the rest of the model is frozen. MaPLe conditions the vision prompts on language prompts via a V-L
coupling function F to induce mutual synergy between the two modalities. Our framework uses deep contextual
prompting where separate context prompts are learned across multiple transformer blocks.

The main goal of using prompt tuning in dv in the vision branch. This is the main
V-L models is to achieve completeness by component that helps share the gradient
fine tuning both the vision and the lan- learning process across both modalities, and
guage branches simultaneously with their improves on the naive "Independent V-L
respective prompts. One simple way to Prompting" way.
do this would train each prompts indepen-
dently which would be technically complete
but would not allow the branches to inter- Benchmarking
act with each other leading to sub-optimal
performance because the prompts learned We evaluated the MaPLe method to test ints
in deeper layers of the transformer blocks generalization ability on unseen data. We di-
would be less correlated. This paper calls vide a dataset into base and novel categories
this method "Independent V-L Prompting". and train the model on the base while eval-
The MaPLe approach introduces better syn- uating on base and novel using a few-shot
ergy between the vision and language setting.
branches by adding prompt learning blocks
to the language transformer blocks up to Dataset
depth J, then projects the prompts to the vi-
sion branch through a coupling function Fk We used the Imagenet [] dataset to evalu-
such that ate the model. Specifically, we used mini-
Imagenet with 1000 classes and divided it
P̃k = Fk ( Pk ) into train-test with equal splits. Imagenet is
a large scale image database which is orga-
The coupling function is a linear layer nized according to the WordNet hierarchy
which maps dl in the language branch to [], which groups words into synsets that

4
are groups of words with similar meanings,
then classifies the images according to those
synsets. The original database contains 1.2
million samples, but our subsample contains
a significantly smaller number of 10.000 im-
ages in total.

Results

You might also like