MAPLE

Uploaded by

Rissal Hedna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

MAPLE

Uploaded by

Rissal Hedna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Evaluating MaPLe for image recognition

Mohamed Rissal Hedna

Intelligent Adaptive Systems, University of Hamburg

Abstract
Vision-language models have outperformed traditional language models on a multitude of metrics.
This is due to their success in coupling both text and image modalities in order to give exceptional
performances of a variety of datasets, therefore mimicking the way humans learn from their environ-
ment. These models have also been successful in generalizing over different tasks, but they still lack
the freedom of choice when it comes to the prompts they take, or the quality of responses they give.
Some methods were used to improve these shortcomings, but none of them targeted both modalities
at the same time. The Multi-modal Prompt Learning approach (MaPLe) [1] tackles this specific
issue. Their approach proposes a tight coupling between language and vision with learnable function
parameters that can be trained end-to-end by training a CLIP model.

Introduction techniques to eliminate the need for man-

ual prompt writing, and offer ways to mod-
ify V-L models while keeping its weights
Vision-language models have been able to frozen to guard previously learned knowl-
outperform regular language models mainly edge. Nevertheless this method is incom-
through incorporation additional informa- plete since it only targets the language en-
tion gained by coupling vision and language coder and ignores the visual one. Due to
modalities together, and projecting them the nature of CLIP a multimodal approach
both into a space where the text information had to be taken, which is at the core of
would closely supervise the image embed- MaPLe’s (Multi-modal Prompt Learning) ar-
ding which enables the model to learn more chitecture which fine-tunes the CLIP model
advanced concepts due to the alignment to create the maximal alignment between
between two modalities. Models such as text and image embedding. This is achieved
CLIP [3](Contrastive Language-Image Pre- by prompting both the text and image en-
training) were able to show a great ability coders while creating a coupling function
to generalize over large datasets. The lan- that allows the interaction between the two
guage inputs in the form of "a photo of <cat- while the rest of the weights in the model
egory>" are passed through a language en- are frozen. The approach was clearly shown
coder to get the language embedding, which to be more successful than uni-modal ones
is matched to the visual embedding from an outperforming them in a number of metrics.
image encoder then used to predict the tar- The approach was able to outperform the
get class. However, these models require the previous state-of-the-art Co-CoOp [2] in a
tedious task of hand-crafting prompts for variety of generalization tasks, and demon-
each image, and suffer from the inability to strated better robustness when evaluated on
fine-tune them for downstream tasks with- different datasets.
out forgetting the information gained in the
pre training phases. We evaluated the model using the general-
Current research proposes prompt-learning ization from base-to-novel benchmark set-
Figure 1: Comparison of MaPLe with standard prompt learning methods. a Existing methods adopt uni-modal
prompting techniques to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP
(language or vision). b MaPLe introduces branch-aware hierarchical prompts that adapt both language and
vision branches simultaneously for improved generalization. c MaPLe surpasses state-of-the-art methods on 11
diverse image recognition datasets for novel class generalization task.

ting with a zero-shot setting by dividing Alternative solutions prompted the text en-
the Imagenet dataset into base and novel. coders to provide it with additional hand-
The MaPLe model was trained on the base crafted information about the images. This
classes and evaluated on both the base and being a tedious task, the NLP approach of
novel classes. prompt-learning was adopted by later re-
search to automate the process during the
fine-tuning period, by allowing the model
Related work to learn those prompts through additional
training. But to this date there were no
Visual-language models are revolutionary works that combined both modalities in the
over traditional classifiers because of their process. For instance, CoOp [8] optimizes a
ability to combine both image and text set of prompts only on the language branch,
modalities and get the information con- CoCoOp [9] improved on this idea by condi-
tained in both to perform well on few-shot tioning the prompts on image content and
and zero-shot classification. Successful ex- allowing it to generalize better. These ap-
amples of such V-L models are CLIP [3], proaches however were implemented either
ALIGN [4], LiT [5], FILIP [6] and Florence in the language or the visual branch inde-
[7], but they all suffer from the difficulty pendently, while MaPLe takes both into con-
of fine-tuning them to downstream tasks. sideration and couples the branches to cre-
These models are trained on large amounts ate a shared prompt learning process by
of available web-based data, ranging be- which a tighter representation of the down-
tween 400 million samples to 1 Billion. How- stream tasks would be shared between the
ever, fully retraining these models on spe- text and image encoders, while keeping the
cialized data would make them forget pre- rest of the model frozen to preserve previous
viously acquired information, leading to a knowledge.
decreased performance and worsened gen-
eralization abilities.

2
Method Zero shot classification is then used by man-
ually adding class labels to text prompts
The main question that was addressed in (e.g., ’a photo of a <category>’). The pre-
this paper is "Given the multimodal nature diction ŷ is related to the image I with the
of CLIP, is complete prompting better suited highest cosine similarity score with a tem-
to adapt CLIP?". The way they attempt to an- perature τ according to the following equa-
swer the question is by proposing an archi- tion:
tecture that would allow the fine-tuning of sim( x,z )
ŷ
clip to downstream tasks without retraining exp τ
the whole model. This is done through con- p(ŷ| x ) =
sim( x,zi )
text optimization with prompting. Learn- ∑iC=1 exp τ
able prompts are appended to the text en-
coders then conditioned on the image en- The MaPLe approach combined the pre-
coders through a coupling function which is vious information to leverage the power
also learnable. This scheme is repeated inde- of multimodal prompt learning, which as
pendently for multiple transformer blocks shown in Fig?? allows the classes to be
and the prompts and functions are the only more separable than other approaches. The
learnable components of this architecture method also adds prompt-learning blocks in
(Fig2). the deeper layers of the CLIP model for both
The approach for building MaPLe was vision and language branches, while keep-
mainly based on CLIP as a V-L encoder. ing each prompt layer independent from the
CLIP encodes an image of dimensions other, and allowing the overall structure to
(HxWxC) with C being the number of chan- gain from the information contained in the
nels and in their case being 3 fed to an image model.
encoder, accompanied by a text description
in its turn fed to a text encoder.
Deep prompting
For the text encoder, MaPLe introduces addi-
Traditional CLIP components
tional prompt tokens prepended to the text
The image encoder divides the image into M- embeddings to result in the vector [P1,P2,···
sized patches, which are projected to patch ,Pb,W0] where Pi are learnable added tokens
embeddings Ei and then passed as inputs to and W0 is the original text input with W0 =
the transformer blocks Vi+1 with a learnable [w1, w2, ..., wn]. This structure is added at
class token Ci, then sequentially processed each language encoder layer up to a depth J,
by K transformer blocks. The final image with J<K the number of transformer blocks.
representation is reached by projecting the It is worth noting that when J = 1 MaPLe is
output of the kth transformer Vk into the equivalent to CoOp.
latent V-L space using ImageProj. For the vision branch, the same deep
The language encoder divides text into to- prompting technique is used by introduc-
kens then projects them into text embed- ing a number of learnable tokens alongside
dings Wi = [w0, w1, ..., wn]i. At each step the input embeddings. One crucial finding
each embedding is passed into the (i+1)th is that the introduction of shared prompts
transformer block. The final representation to deeper layers is better than independent
is build by projecting the output of the last prompts across transformer blocks, this is
encoder block into the V-L latent space using because it allows the model to learn shared
TextProj. features.

3
Figure 2: Overview of the proposed MaPLe (Multi-modal Prompt Learning) framework for prompt learning
in V-L models. MaPLe tunes both vision and language branches where only the context prompts are learned,
while the rest of the model is frozen. MaPLe conditions the vision prompts on language prompts via a V-L
coupling function F to induce mutual synergy between the two modalities. Our framework uses deep contextual
prompting where separate context prompts are learned across multiple transformer blocks.

The main goal of using prompt tuning in dv in the vision branch. This is the main
V-L models is to achieve completeness by component that helps share the gradient
fine tuning both the vision and the lan- learning process across both modalities, and
guage branches simultaneously with their improves on the naive "Independent V-L
respective prompts. One simple way to Prompting" way.
do this would train each prompts indepen-
dently which would be technically complete
but would not allow the branches to inter- Benchmarking
act with each other leading to sub-optimal
performance because the prompts learned We evaluated the MaPLe method to test ints
in deeper layers of the transformer blocks generalization ability on unseen data. We di-
would be less correlated. This paper calls vide a dataset into base and novel categories
this method "Independent V-L Prompting". and train the model on the base while eval-
The MaPLe approach introduces better syn- uating on base and novel using a few-shot
ergy between the vision and language setting.
branches by adding prompt learning blocks
to the language transformer blocks up to Dataset
depth J, then projects the prompts to the vi-
sion branch through a coupling function Fk We used the Imagenet [] dataset to evalu-
such that ate the model. Specifically, we used mini-
Imagenet with 1000 classes and divided it
P̃k = Fk ( Pk ) into train-test with equal splits. Imagenet is
a large scale image database which is orga-
The coupling function is a linear layer nized according to the WordNet hierarchy
which maps dl in the language branch to [], which groups words into synsets that

4
are groups of words with similar meanings,
then classifies the images according to those
synsets. The original database contains 1.2
million samples, but our subsample contains
a significantly smaller number of 10.000 im-
ages in total.

Results

CLAVES Avg Antivirus
75% (4)
CLAVES Avg Antivirus
6 pages
MaPLe
No ratings yet
MaPLe
13 pages
Learning To Prompt With Text Only Supervision For Vision-Language Models
No ratings yet
Learning To Prompt With Text Only Supervision For Vision-Language Models
15 pages
CoCoOp
No ratings yet
CoCoOp
11 pages
Large Language Models Are Good Prompt Learners
No ratings yet
Large Language Models Are Good Prompt Learners
12 pages
CoOp
No ratings yet
CoOp
13 pages
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
No ratings yet
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
10 pages
Fewvlm
No ratings yet
Fewvlm
13 pages
Laclip
No ratings yet
Laclip
29 pages
Artificial Intelligence Neural Contradictory
No ratings yet
Artificial Intelligence Neural Contradictory
9 pages
2104.08860v2
No ratings yet
2104.08860v2
14 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
09508
No ratings yet
09508
16 pages
Mplug-Docowl: Modularized Multimodal Large Language Model For Document Understanding
No ratings yet
Mplug-Docowl: Modularized Multimodal Large Language Model For Document Understanding
13 pages
Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions
No ratings yet
Transferring Knowledge From Text To Video: Zero-Shot Anticipation For Procedural Actions
17 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
Scaling Language-Image Pre-Training Via Masking: Yanghao Li Haoqi Fan Ronghang Hu Christoph Feichtenhofer Kaiming He
No ratings yet
Scaling Language-Image Pre-Training Via Masking: Yanghao Li Haoqi Fan Ronghang Hu Christoph Feichtenhofer Kaiming He
12 pages
2411.04997v2
No ratings yet
2411.04997v2
13 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
Prompt-Learning For Short Text Classification
No ratings yet
Prompt-Learning For Short Text Classification
8 pages
Scaling Language-Image Pre-Training Via Masking
No ratings yet
Scaling Language-Image Pre-Training Via Masking
11 pages
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
No ratings yet
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
18 pages
Negative Yields Positive
No ratings yet
Negative Yields Positive
25 pages
ELIP
No ratings yet
ELIP
31 pages
UReader - Universal OCR-free Visually-Situated Language Understanding With Multimodal Large Language Model
No ratings yet
UReader - Universal OCR-free Visually-Situated Language Understanding With Multimodal Large Language Model
18 pages
Structured Prompting: Scaling In-Context Learning To 1,000 Examples
No ratings yet
Structured Prompting: Scaling In-Context Learning To 1,000 Examples
14 pages
Chinese Clip
No ratings yet
Chinese Clip
19 pages
Grounding Language Models To Images For Multimodal Inputs and Outputs
No ratings yet
Grounding Language Models To Images For Multimodal Inputs and Outputs
18 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
371-1-2284-5-10-20240222
No ratings yet
371-1-2284-5-10-20240222
9 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
452_Learning_to_Adapt_Frozen_C
No ratings yet
452_Learning_to_Adapt_Frozen_C
22 pages
Mobile Clip
No ratings yet
Mobile Clip
18 pages
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
No ratings yet
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
10 pages
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
58 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
No ratings yet
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
12 pages
Cvpr2022 Glip Grounded Language Image Pre Training
No ratings yet
Cvpr2022 Glip Grounded Language Image Pre Training
20 pages
Seminar CLIP
No ratings yet
Seminar CLIP
71 pages
Prompt Distribution Learning
No ratings yet
Prompt Distribution Learning
13 pages
Cutting Down On Prompts and Parameters: Simple Few-Shot Learning With Language Models
No ratings yet
Cutting Down On Prompts and Parameters: Simple Few-Shot Learning With Language Models
12 pages
CLIPBERT For Video-And-Language Learning Via Sparse Sampling
100% (1)
CLIPBERT For Video-And-Language Learning Via Sparse Sampling
12 pages
SigLIP 2- Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
No ratings yet
SigLIP 2- Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
20 pages
2502.04098v1
No ratings yet
2502.04098v1
27 pages
2311.04257v2
No ratings yet
2311.04257v2
18 pages
mml_language
No ratings yet
mml_language
11 pages
Enhancing Natural Language Processing (NLP) Models With Multimodal Learning Enhanced
No ratings yet
Enhancing Natural Language Processing (NLP) Models With Multimodal Learning Enhanced
2 pages
8
No ratings yet
8
27 pages
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
No ratings yet
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
16 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
LLAMA AI Paper
No ratings yet
LLAMA AI Paper
18 pages
interactive
No ratings yet
interactive
13 pages
Sasha Rush - Interactive and Visual Prompt Engineering
No ratings yet
Sasha Rush - Interactive and Visual Prompt Engineering
11 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
No ratings yet
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
12 pages
Xian Latent Embeddings For CVPR 2016 Paper
No ratings yet
Xian Latent Embeddings For CVPR 2016 Paper
9 pages
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
No ratings yet
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
13 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
ISO-FDIS-11898-1
No ratings yet
ISO-FDIS-11898-1
14 pages
Online Form One 2024 Joining Instructions (With A Live Link)
No ratings yet
Online Form One 2024 Joining Instructions (With A Live Link)
1 page
Elective - II - Business Intellegience (Course Plan)
No ratings yet
Elective - II - Business Intellegience (Course Plan)
3 pages
JUNOS Cookbook 1st Edition Aviva Garrett - Get the ebook instantly with just one click
100% (1)
JUNOS Cookbook 1st Edition Aviva Garrett - Get the ebook instantly with just one click
55 pages
Robo S6
100% (2)
Robo S6
46 pages
Folleto Ampliadora Digital
No ratings yet
Folleto Ampliadora Digital
4 pages
P3APS17007EN DeviceNet configuration instructions for Easergy P3 devices
No ratings yet
P3APS17007EN DeviceNet configuration instructions for Easergy P3 devices
20 pages
Ask Analytics - Text Mining in R - Part 3
No ratings yet
Ask Analytics - Text Mining in R - Part 3
5 pages
HS-D095T2 en
No ratings yet
HS-D095T2 en
2 pages
Kelompok Teknik Audio Video
No ratings yet
Kelompok Teknik Audio Video
1 page
Create A New Pipe Support Assembly: Set Specification & Size Dialog Box
No ratings yet
Create A New Pipe Support Assembly: Set Specification & Size Dialog Box
4 pages
Lecture-51 INTEL 8259A Programmable Interrupt Controller
No ratings yet
Lecture-51 INTEL 8259A Programmable Interrupt Controller
7 pages
Queues Notes
100% (1)
Queues Notes
8 pages
Frontier-Paradiso J750 TP Docu v03 Revb
No ratings yet
Frontier-Paradiso J750 TP Docu v03 Revb
56 pages
How I Cracked The AWS Solution Architect Cloud Quest. - DEV Community
No ratings yet
How I Cracked The AWS Solution Architect Cloud Quest. - DEV Community
8 pages
OS9.14.1.14 S3048 ON Release Notes
No ratings yet
OS9.14.1.14 S3048 ON Release Notes
18 pages
Manual Craig CMP621F
No ratings yet
Manual Craig CMP621F
35 pages
File Organization in DBMS - Set 1 - GeeksforGeeks
No ratings yet
File Organization in DBMS - Set 1 - GeeksforGeeks
6 pages
Free Iptv Ex - Yu Streams 11.07.14
100% (1)
Free Iptv Ex - Yu Streams 11.07.14
51 pages
Features: Classic
No ratings yet
Features: Classic
50 pages
Jobs Test Preparation Site: Computer MCQS
No ratings yet
Jobs Test Preparation Site: Computer MCQS
8 pages
Internet Service Provider
No ratings yet
Internet Service Provider
25 pages
Toolbox For Interval Type-2 Fuzzy Logic Systems
No ratings yet
Toolbox For Interval Type-2 Fuzzy Logic Systems
6 pages
MD-46 en Brochure - FOCUS-ON
No ratings yet
MD-46 en Brochure - FOCUS-ON
8 pages
Unit 4 QB
No ratings yet
Unit 4 QB
2 pages
ISO-IEC 27001 - FAQs
No ratings yet
ISO-IEC 27001 - FAQs
4 pages
Usb Linestate Termination
No ratings yet
Usb Linestate Termination
1 page
LWM 2.0 Brochure Rev03
No ratings yet
LWM 2.0 Brochure Rev03
2 pages
Reg 316 PDF
No ratings yet
Reg 316 PDF
40 pages

MAPLE

Uploaded by

MAPLE

Uploaded by

Evaluating MaPLe for image recognition

Mohamed Rissal Hedna

Introduction techniques to eliminate the need for man-

You might also like