0% found this document useful (0 votes)
8 views

interactive

The document presents TIPCap, a novel image captioning approach that leverages interactive prompts and a text data-centric methodology to improve performance in generating captions from images. It addresses challenges in existing methods by utilizing a mapping module driven by multivariate Gaussian distribution to reduce modality gaps and incorporating prompt information for enhanced prediction accuracy. Extensive experiments demonstrate that TIPCap achieves state-of-the-art results on popular datasets like MS-COCO and Flickr30K, outperforming other weakly or unsupervised methods.

Uploaded by

Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

interactive

The document presents TIPCap, a novel image captioning approach that leverages interactive prompts and a text data-centric methodology to improve performance in generating captions from images. It addresses challenges in existing methods by utilizing a mapping module driven by multivariate Gaussian distribution to reduce modality gaps and incorporating prompt information for enhanced prediction accuracy. Extensive experiments demonstrate that TIPCap achieves state-of-the-art results on popular datasets like MS-COCO and Flickr30K, outperforming other weakly or unsupervised methods.

Uploaded by

Alexis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Text Data-Centric Image Captioning with


Interactive Prompts
Yiyu Wang, Hao Luo, Jungang Xu∗ , Yingfei Sun, Fan Wang

Abstract—Supervised image captioning approaches have made


great progress, but it is challenging to collect high-quality
human-annotated image-text data. Recently, large-scale vision
and language models (e.g., CLIP) and large-scale generative
arXiv:2403.19193v1 [cs.CV] 28 Mar 2024

language models (e.g., GPT-2) have shown strong performances


in various tasks, which also provide some new solutions for image
captioning with web paired data, unpaired data or even text-only
data. Among them, the mainstream solution is to project image
embeddings into the text embedding space with the assistance
of consistent representations between image-text pairs from the
CLIP model. However, the current methods still face several
challenges in adapting to the diversity of data configurations in Fig. 1. Comparison of different methods, EI , ET and DT indicate image
a unified solution, accurately estimating image-text embedding encoder, text encoder and text decoder respectively. (a) supervised method. (b)
bias, and correcting unsatisfactory prediction results in the ZeroCap and MAGIC. (c) DeCap. (d) CapDec and CLOSE. (e) our approach.
inference stage. This paper proposes a new Text data-centric
approach with Interactive Prompts for image Captioning, named
TIPCap. 1) We consider four different settings which gradually
reduce the dependence on paired data. 2) We construct a mapping SEEM [30], etc.) trained on large-scale web data successfully
module driven by multivariate Gaussian distribution to mitigate unify multiple vision-language understanding and generation
the modality gap, which is applicable to the above four different tasks, allowing them to directly predict captions for images.
settings. 3) We propose a prompt interaction module that can However, due to the low-quality labels and significant noise in
incorporate optional prompt information before generating cap- web data, these models still need adding high-quality image-
tions. Extensive experiments show that our TIPCap outperforms
other weakly or unsupervised image captioning methods and text datasets such as MS-COCO in the training/fine-tuning
achieves a new state-of-the-art performance on two widely used data to achieve comparable performance. Second, another
datasets, i.e., MS-COCO and Flickr30K. line is to leverage the image-text representation alignment
Index Terms—image captioning, weakly or unsupervised ap- capability of the CLIP model to reduce the cost of acquiring
proaches, modality gap, interactive prompt. training data. It is generally easier to obtain a textual corpus
than to obtain high-quality image-text data. Since CLIP can
I. I NTRODUCTION provide consistent feature representations for image-text pairs,
Image captioning is a typical vision-language task, which some works train caption models with text-only data or little
aims to automatically generate textual descriptions for given additional paired data.
images. In the past few years, although image captioning This paper focuses on the second research line, because
has made great progress, the models [1]–[11] are generally it is a more caption-specific and resource-friendly solution.
trained on human-annotated image-text data like MS-COCO Additionally, aligning image-text representations rather than
[12] and Flickr30K [13], which are challenging to collect. directly predict captions can reduce the requirements of the
Some unsupervised works [14], [15] try to mitigate this foundation model on the label quality of image-text data.
issue using unpaired image-text data, but still need complex Among the related works shown in the Fig. 1, ZeroCap [23]
pseudo-training or adversarial training to ensure the semantic and MAGIC [24] generate multiple text and then determin-
alignment between decoder and image. Recently, the large ing the predicted caption according to the feature similarity
foundation models such as BERT [16], GPT-2 [17], T5 [18], between image and text calculated by the CLIP model. How-
CLIP [19], ALIGN [20] and BLIP [21], etc, have provide ever, the untrained generation process and the frequent CLIP
some new solutions for the vision-language tasks, including text encoder forward limit both the model performance and
image captioning. Based on their strong generalization ability efficiency. DeCap [26] proposes another efficient and text-only
and image-text representation alignment, some methods [22]– required method, which builds a support memory using all text
[27] use low-cost paired image-text data collected from web features. In the inference stage, the image embedding can be
(abbreviated as web data in this paper) or even text-only projected into the text embedding space. However, the support
data to train models which can even exhibit zero-shot caption memory restricts its ability to scale up to extensive data. Latest
capability. CapDec [25] and CLOSE [27], the closest paradigm to ours,
Specifically, there are two main research lines showing make a assumption that the feature bias between a image-text
promising potential in image captioning. First, some foun- pair in the CLIP embedding space can be estimated with a
dation models (e.g. BLIP [21], BLIP2 [28], OFA [29], and Gaussian distribution of N (0, σ 2 ). The value of σ is estimated
2

from few paired MS-COCO data in CapDec, but it is set as a able to handle prompt information, which further enhances
hyper-parameter in CLOSE. In this way, they needs text-only the flexibility;
training data because the text embedding can be projected into (3) Extensive experiments demonstrate the effectiveness of
the image embedding space. TIPCap and achieve a new state-of-the-art performance.
Although the above methods propose some ingenious solu-
tions, we still point out some crucial issues here. 1) Above II. R ELATED W ORK
methods are usually only compatible with one or two specific A. Image Captioning
data configurations. However, in real-world applications, users 1) Supervised Approaches: Inspired by the development of
have very different data configurations. For example, apart deep learning methods in machine translation [31], [32], most
from the text corpus, some web data, few high-quality image- of existing models utilize encoder-decoder framework. Earlier
text data like MS-COCO or some web image data can also be works [1], [2], [33], [34] adopt CNN to extract image features
provided. How to propose a unified solution to deal with differ- and decode them into sentence by LSTM [35]. Xu et al. [2]
ent data configurations? 2) The popular assumption that image- introduce an attention mechanism which can dynamically fo-
text feature bias is an independent Gaussian distribution may cus on salient regions of the given image. After that, Anderson
be sub-optimal because correlations exist between different et al. [3] propose to use Faster R-CNN [36] as encoder and
feature dimensions. How to propose a better approximation? achieve significant improvement. Some subsequent works [4],
3) These methods inevitably output unsatisfactory results. Can [5], [8], [37], [38] follow this paradigm. Recently, transformer-
we allow the model to handle user-provided prompts (such as based models have demonstrated excellent performance in
the object in the image) to improve predictions? image captioning task [5], [7], [39]–[41]. Although super-
Based on the above motivations, we proposes a new vised methods have achieved impressive results, high-quality
approach TIPCap that is text data-centric with interactive human-annotated paired image-text data is essential.
prompts for image captioning, as shown in Fig. 1 (e). Specifi- 2) Zero-shot Approaches: Zero-shot image captioning aims
cally, our TIPCap combines CLIP and GPT-2 to fully leverage to generate description without human-annotated data. While
the advantages of pre-trained models and contains three extra ZeroCap [23] and MAGIC [24] realize zero-shot image cap-
key modules: a mapping module, a reverse mapping module, tioning by combining CLIP and GPT-2, and both of them
and a prompt interaction module. Firstly, we have taken four introduce weak visual control cues through the cosine sim-
different data settings into account, which almost cover the ilarity between generated text and given image. Specifically,
vast majority of data configuration scenarios. A unified solu- ZeroCap relies on gradient descent to update the context cache
tion is proposed to estimate text-to-image embedding maps. of GPT-2 to make the output match given images; and MAGIC
Secondly, taking into account the correlation between feature proposes a new decoding strategy to regularize the generated
dimensions, the mapping module is driven by multivariate word to be close to given image and previously generated
Gaussian distribution instead of independent Gaussian distri- context. However, frequent CLIP text encoder forward slow
bution, which aims to mitigate the modality gap by performing the inference speed significantly. Wang et al. [42] argue that
a simple projection from CLIP text embedding space to the above methods are prone to fall into harmful contextual
CLIP image embedding space. The reverse mapping module language prior and ignore the visual information of given
performs a weak projection from CLIP image embedding image. DeCap [26] contains a frozen CLIP and a lightweight
space back to CLIP text embedding space for stronger robust- text decoder. During training, DeCap takes the CLIP text
ness. During inference, our TIPCap no longer needs mapping embedding as input to reconstruct its textual sequence, and
module but directly inputs CLIP image embedding into reverse stores all training CLIP text embeddings as a support memory.
mapping module and follow-up modules to generate captions. During inference, they project the image embedding into CLIP
Thirdly, the prompt interaction module endows TIPCap with text embedding space by calculating a weighted sum of all
the ability to fuse additional prompt information to generate embeddings in support memory based on the cosine similarity.
higher-quality descriptions. With these modules, TIPCap can CapDec [25] and CLOSE [27] also perform zero-shot image
be trained on text-centric data, and predict captions which can captioning using text-only data and have a similar paradigm,
be further improved with manual prompts for a given image. estimating the modality gap between image and text by an
To evaluate our approach, we conduct extensive experi- independent Gaussian distribution N (0, σ 2 ).
ments on two commonly used datasets: MS-COCO [12] and
Flickr30K [13]. The results demonstrate that our approach B. Vision-Language Models
significantly outperforms existing weakly or unsupervised ap-
proaches, and achieves a new state-of-the-art performance. Inspired by BERT [16] and its task-agnostic pre-training
paradigm, a line of works [43]–[46] have extended it to Vision-
Our major contributions can be summarized as follows:
Language (VL) for learning joint representations of image
(1) We propose a new approach TIPCap for image captioning, content and natural language, these models directly rely on
which provides a unified solution for four settings with pre-trained object detector to extract image region features
different data configurations; and employ a multimodal encoder to fuse multi-modal features
(2) The mapping module utilizes multivariate Gaussian dis- by solving tasks such as masked language modeling (MLM)
tribution to mitigate the modality gap effectively and out- and image-text matching (ITM). Another line of works (e.g.
performs independent Gaussian distribution; our model is CLIP [19] and ALIGN [20]) construct unimodal encoders
3

Image
Round t-1
Encoder
“A man is walking
(CLIP)

Prefix Projector
Add & L2 Norm
along a road”

FeedForward
Text
Inference
Decoder
Training (GPT-2) Round t
“A man riding on the

Add & L2 Norm


Noise Sampler
“A man riding back of a motorcycle
Text
on the back of Rev. Mapping down a road”
Encoder

(  , )
Prompt Constructor
a motorcycle
(CLIP)
down a road”
Rough caption: A man is walking along a road
Mapping User-specified prompt: An image contains motorcycle, …

Fig. 2. The overall framework of our approach. Our approach TIPCap is based on a pre-trained CLIP model and a pre-trained GPT-2 model. During training,
we first exploit CLIP to extract CLIP text embedding and project it into CLIP image embedding space by a mapping module; then we reconstruct text
embedding by a reverse mapping module and inject optional prompt information; finally, GPT-2 generates description. In the inference stage, we no longer
need the mapping module but directly feed CLIP image embedding into reverse mapping module and follow-up modules to generate captions.

for image and text separately without densely cross-domain TABLE I


connection, and perform pre-training from scratch on web- TAKING THE EXPERIMENTS ON MS-COCO AS EXAMPLE , A COMPARISON
OF DIFFERENT DATA SETTINGS .
scale noisy dataset using only a contrastive loss. ALBEF
[47] believes that the learning of image-text interactions and base data external data
alignment are both important, so introduces a contrastive loss Setting 1 COCO text 1% COCO paired data.
to align image and text before multimodal encoder. The above Setting 2 COCO text 100K YFCC15M paired data.
Setting 3 COCO text 5K YFCC15M image data.
models are all encoder-based and are not easy to directly Setting 4 COCO text None
transfer to text generation tasks. Considering this issue, BLIP
[21] proposes a unified model for VL understanding and
generation. Specifically, BLIP constructs unimodal encoders The prefix projector module projects CLIP embedding from
and text decoder for different VL tasks and applies parameter CLIP dimension to GPT-2 dimension. Finally, Pe with optional
sharing between text encoder and decoder except for the self- prompt embedding P te are incorporated as input of GPT-2 to
attention layers. generate captions:

III. M ETHOD Pe = Prefix-Projector(TeT ) (4)


The overall framework of our approach called TIPCap is T ′ = GPT-2(Pe , P te ) (5)
shown in Fig. 2. We first introduce model architecture, and
four settings with different data configurations, then the details The inference forward is slightly different from training
of interactive prompts, and our training objectives. forward described above: We no longer need the mapping
module but directly extract CLIP image embedding and feed
it into reverse mapping module and follow-up modules.
A. Model Architecture
As shown in Fig. 2, we utilize the frozen CLIP [19] and
GPT-2 [17] following existing methods. In addition, our TIP- B. Estimation of µ
⃗ and Σ
Cap contains a mapping module, a reverse mapping module, The mapping module is driven by a multivariate Gaussian
and a prefix projector. distribution N (⃗µ, Σ) to mitigate the modality gap, which
Given a text sequence T , we first extract the CLIP text brings up a crucial issue: how to obtain the mean µ ⃗ and
embedding Te and project it into CLIP image embedding space covariance Σ that can effectively characterize the true modality
by a mapping module: bias δ between CLIP image embedding and text embedding?
In this paper, we consider four settings with different data
Te = CLIPtext (T ) (1)
configurations, which can cover the majority of real-world
TeI = Mapping(Te , N (⃗
µ, Σ)) (2) scenarios (text corpus Tcorpus is available in each setting):
where N (⃗µ, Σ) indicates a multivariate Gaussian distribution (1) Setting 1: Few human-annotated high-quality paired data
with mean µ⃗ and covariance Σ. Briefly, the mapping module ⟨Ihuman , Thuman ⟩ is available, where Thuman is homol-
performs a simple projection by adding a noise ϵ obeying ogous to Tcorpus ;
N (⃗
µ, Σ) into Te . (2) Setting 2: Low-quality paired web data ⟨Iweb , Tweb ⟩ is
Then, a reverse mapping module re-projects TeI back into available, where Tweb is heterologous to Tcorpus ;
CLIP text embedding space: (3) Setting 3: No paired data, but a few source-agnostic image
data Iany is available;
TeT = Reverse-Mapping(TeI ) (3) (4) Setting 4: No any extra data, only Tcorpus is available.
4

Setting 1. We first calculate the embedding difference of similar to setting 3 and apply LMap to optimize trainable
human annotated paired data: parameters µ ⃗ and L. The calculation of LMap relies on the
modality bias between CLIP image embedding and CLIP text
δ ≃ Ie,human − Te,human (6)
embedding. However, if we use the bias between TeI and Te
where Ie,human and Te,human indicate CLIP image embed- to optimize the loss LMap , the model is prone to falling into
ding and CLIP text embedding respectively. trivial solutions (collapse), because the modality bias between
As mentioned before, N (⃗ µ, Σ) aims to characterize the the output and input of mapping module, i.e. TeI and Te , is
modality bias. Here Thuman is paired with Ihuman , and also directly sampled from N (⃗ µ, Σ). To address such issue, we
homologous to Tcorpus , so we can directly adopt the mean µ ⃗δ propose several specific designs.
and covariance Σδ of δ as a tight estimation of µ ⃗ and Σ. Firstly, we follow other works to apply several asymmetric
Setting 2. Similar to setting 1, we can calculate the embedding designs to enhance the robustness and avoid mode collapse.
difference of web data, and use the mean and covariance of Different from the mapping module consisting of an addition
embedding difference as an estimation. But it performs worse, operation between the input and a sampled multivariate Gaus-
due to Tweb and Tcorpus are heterologous. We propose to apply sian distribution noise, the reverse mapping module is designed
a simple correction to alleviate this issue: as a FeedForward layer, and aimed at re-project the output
of mapping module back into CLIP text embedding space.
δ ≃ Ie,web − Correct(Te,web ) Then, we do not use the original text embedding Te but the
= Ie,web − (Te,web + δw→c ) (7) reconstructed one TeT to optimize LMap . Then, LMap can be
= δweb − δw→c calculated as follows:
where δweb ∼N (⃗ µweb , Σweb ) and δw→c ∼N (⃗ µw→c , Σw→c ), LMap = KL(L−1 (δpseudo − µ
⃗ )∥N (⃗0, I)) (13)
subscript w→c indicates “web data to corpus data”, which aims δpseudo = TeI − TeT (14)
to achieve a domain alignment from Tweb to Tcorpus . Since
there is no pairwise relationship between Tcorpus and Tweb , where TeI and TeT indicate the output of mapping module
Correct(·) is just a global and rough estimation of the domain and reverse mapping module respectively. Since the reverse
alignment. mapping module does not perform strict reconstruction of
Setting 3. In this setting, we extend the mapping module to CLIP text embedding, which also introduces asymmetry to
be trainable instead of pre-defined parameters. Specifically, obtain a more robust performance. In order to ensure the unity
the covariance can be denoted as Σ = LL⊤ by cholesky of our TIPCap, we also apply the reverse mapping module in
decomposition, where L is a lower triangular matrix with the other settings except Setting 4.
same size as Σ. Through reparameterization trick [48], the Secondly, since the lack of real image data to introduce
noise of ϵ ∼ N (⃗µ, Σ) can be re-formulated as follows: corresponding latent prior information, we employ a relational
knowledge distillation loss:
ϵ = Lϵ′ + µ
⃗, ϵ′ ∼ N (⃗0, I) (8)
LDisti = KL(STeI ∥STe ) (15)
Therefore, we can carry out the mean µ ⃗ and matrix L as
trainable parameters. where STeI and STe indicate internal cosine similarity matrix
The difficulty lies in how to drive the training of µ
⃗ and L of TeI and Te respectively. Employing LDisti is aiming to
toward the correct direction. For this purpose, we introduce encourage TeI to have a similar internal cosine similarity to
a few source-agnostic image data Iany , and calculate the Te , which provides a constraint to ensure that TeI and Te are
embedding difference: semantically related and avoids mode collapse.

δ ≃ δany = Ie,any − Te,corpus (9)


C. Interactive Prompts
where δany ∼ N (⃗ µ, Σ) if Iany is paired with Tcorpus . Here Inevitably, image captioning models output unsatisfactory
we relax this requirement and assume that δany ∼ N (⃗ µ, Σ) is sentences, sometimes with factual errors or missing objects.
also roughly satisfied even though image and text are unpaired, Based on this issue, we hope to endow our model with the
to derive our training objective. ability to deal with additional prompt information to generate
Again we apply reparameterization trick, and have: information-enhanced captions.
δany ∼ N (⃗
µ, Σ) ≃ Lϵ + µ
⃗, ϵ ∼ N (⃗0, I) (10) Inspired by the supervised fine-tuning of InstructGPT [49],
we construct prompts as textual sentences and serve as a part
then we apply a simple transformation to δany , and result in: of input of GPT-2 model, as shown in Fig. 2. One full prompt
ϵ = L−1 (δany − µ
⃗ ) ∼ N (⃗0, I) (11) contains two parts: a rough caption predicted by the model and
a user-specified prompt sentence correcting the caption. The
From the above, our training goal is to make ϵ more close main difficulty focuses on collecting user-specified prompts
to a standard Gaussian distribution of N (⃗0, I): sentences during training, which should contain information
ignored in rough captions but existed in ground truth captions,
LMap = KL(L−1 (δany − µ
⃗ )∥N (⃗0, I)) (12)
to introduce positive information for training guidance.
Setting 4. Setting 4 explores a more extreme data configura- We devide the training into two stages: 1) We perform the
tion, where only text data is available. We follow the paradigm first stage of training without introducing prompts and get a
5

base captioning model modelbase , which aims to endow our Generated: A table topped with different types of food.
model with the ability to generate rough captions for the GT: A food tray filled with different types of food and
second stage training. 2) Then, we initialize modelprompt with drinks.
parameters of modelbase to perform the second stage of train- 𝑝 = 0.9, w/ prompt information
ing with introducing prompts. To avoid complex and expensive Reference: A table topped with different types of food.

Example 1
manual annotation, we explore a simple strategy to generate Prompt: An image contains tray.
user-specified prompts. At each training step, modelbase is Prediction: A food tray filled with different types of
food and drinks.
frozen to generate rough captions ⟨cr ⟩. For user-specified
𝑝 = 0.1, w/o prompt information
prompt, we do extract nouns or noun phrases set {pi }N i=1
by part-of-speech tagging from ground truth captions ⟨cgt ⟩. Reference: <-pad->
Prompt: <-pad->
Furthermore, aiming to preserve positive prompt information,
Prediction: A food tray filled with different types of
we remove nouns or noun phrases that appear in ⟨cr ⟩ from food and drinks.
′ N′
{pi }N
i=1 to get filtered set {pi }i=1 .
Generated: A cat laying on top of a couch on a shoe.
Taking Fig. 2 as example, for the round t generation, we GT: A cat laying on top of a couch on a shoe.

have ⟨cr ⟩ = “A man is walking along a road.”, {p′i }N

Example 2
i=1 = w/o prompt information
{“motorcycle”}, and ⟨cgt ⟩ = “A man riding on the back of Reference: <-pad->
a motorcycle down a road.” The corresponding full prompt Prompt: <-pad->
sentence P t is constructed by prompt constructor as follows: Prediction: A food tray filled with different types of
food and drinks.
Reference: A man is walking along a road.
Prompt: An image contains motorcycle. Fig. 3. Examples of constructed full prompt sentences during stage 2 training.
Prediction: A man riding on the back of a motorcycle
down a road.
where LCosine and LCL denote cosine embedding loss and
contrastive loss [19] respectively, and are defined as follows:
Note that we hope our TIPCap keeps the ability to perform
n
general captioning task (i.e. generate captions without “refer- 1X 1 ⊤
LCosine = (1 − Si,i ), LCL = − (LSCL + LSCL ) (18)
ence” and “prompt”), thus prompt information is NOT always n i=1 2
essential. Furthermore, when the generated caption is good n
and descriptive enough, it is also not neccessary to introduce 1X exp(Si,j /τ )
LSCL = log Pn (19)
prompts and perform the second inference. n i=1 j=1 exp(Si,j /τ )
Based on the above considerations, we replace the prompt where τ = 0.1, S ∈ Rn×n indicates the cosine similarity
sentences with padding tokens with a probability of p = 0.1 matrix between TeT and Te . Intuitively, LCosine is introduced for
when performing the stage 2 training (refer to “Implementation hard reconstruction, and LCL aims to relax the reconstruction
details” in Section IV-A); in addition, we also replace the constraint. Because we hope that TeT has similar semantics to
prompts with padding when generated caption is the same as Te without perfect reconstruction.
the ground truth caption. Examples of constructed full prompt
sentences are shown in Fig. 3. IV. E XPERIMENTS
A. Experimental Setting
1) Datasets: For fair performance comparison, we con-
duct experiments on both MS-COCO [12] and Flickr30K
D. Objectives
[13] datasets, and follow “Karpathy” split [50]. Furthermore,
YFCC15M [51] which has been used to train CLIP is adopt
Language Modeling Loss. For a mini-batch of n texts
as low-quality paired web data for setting 2 and 3.
{T i }ni=1 , where T i = {T1i , ..., TLi }, we optimize our model
MS-COCO is a widely used dataset in image captioning
by applying Maximum Likelihood Estimation (MLE):
task, which consists of 123,287 images, and each is paired with
5 human-annotated captions. Flickr30K is another human-
n L
1 XX annotated dataset similar to MS-COCO, where each image
log pθ Tji |T1:j−1
i

LMLE = − (16) is also paired with 5 reference captions, but only contains
n × L i=1 j=1
31,000 images. YFCC15M is a subset of the large-scale web-
collected dataset YFCC100M, where each image is annotated
where θ denotes the parameters of our model. with a weakly paired alt-text.
Reverse Mapping Reconstruction Loss. For our reverse Taking experiments on MS-COCO as examples, TABLE I
mapping module, we define LRecons to constrain the recon- shows a comparison of the available data under four different
struction relationship of TeT to Te : settings, which cover the majority of real-world scenarios.
Specifically, we sample 1% paired MS-COCO data for setting
1, about 100K paired YFCC data for setting 2, and 5K YFCC
LRecons = LCosine + LCL (17) image data for setting 3.
6

TABLE II
I N - DOMAIN CAPTIONING RESULTS ON MS-COCO AND F LICKR 30K. T HE SUPERSCRIPT ∗ INDICATES THAT RESULTS ARE FROM MAGIC [24]. F OR
WEAKLY OR UNSUPERVISED APPROACHES , THEY USE CLIP WITH DIFFERENT BACKBONE AS E NCODER . D E C AP [26] CONSTRUCTS A LIGHTWEIGHT
T RANSFORMER D ECODER (T.D.) INSTEAD OF PRE - TRAINED GPT-2. CLOSE [27] USES T5 MODEL [18]. S ETTING 1-4 ARE ABBREVIATED AS S1-4.

MS-COCO Flickr30k
Method Encoder Decoder
B-1 B-4 M R C S B-1 B-4 M R C S
Fully Supervised:
BUTD 77.2 36.2 27.0 56.4 113.5 20.3 - 27.3 21.7 - 56.6 16.0
UniVLP - 36.5 28.4 - 116.9 21.2 - 30.1 23.0 - 67.4 17.0
ClipCap - 33.5 27.5 - 113.1 21.1 - - - - - -
Oscar - 36.5 30.3 - 123.7 23.1 - - - - - -
Weakly or Unsupervised:
ZeroCap∗ ViT-B/32 GPT-2 49.8 7.0 15.4 31.8 34.5 9.2 44.7 5.4 11.8 27.3 16.8 6.2
MAGIC ViT-B/32 GPT-2 56.8 12.9 17.4 39.9 49.3 11.3 44.5 6.4 13.1 31.6 20.4 7.1
DeCap ViT-B/32 T.D. - 24.7 25.0 - 91.2 18.7 - 21.2 21.8 - 56.7 15.2
CapDec RN50x4 GPT-2 69.2 26.4 25.1 51.8 91.8 - 55.5 17.7 20.0 43.9 39.1 -
CLOSE ViT-L/14 T5 - 22.1 23.7 - 81.2 17.7 - - - - - -
CLOSE w/Tuned Noise ViT-L/14 T5 - 28.6 25.2 - 95.4 18.1 - - - - - -
TIPCap (S1) RN50x4 GPT-2 74.6 30.7 26.7 54.2 106.7 20.3 71.1 25.6 22.5 49.1 63.7 16.2
TIPCap (S2) RN50x4 GPT-2 72.7 28.6 25.6 52.5 100.6 19.6 68.0 23.7 21.3 47.3 57.8 15.2
TIPCap (S3) RN50x4 GPT-2 73.0 30.4 26.5 53.8 104.5 20.0 68.4 24.2 21.9 48.1 61.2 16.1
TIPCap (S4) RN50x4 GPT-2 71.3 29.8 26.2 53.4 102.1 19.4 67.5 24.0 21.7 47.7 59.4 15.9
TIPCap (S4) ViT-L/14 GPT-2 73.3 31.4 26.9 54.2 106.6 20.2 69.6 26.1 23.0 49.3 65.7 17.0

TABLE III
C ROSS -D OMAIN CAPTIONING RESULTS . X =⇒ Y MEANS THAT THE MODEL IS TRAINED ON DATASET X BUT EVALUATED ON DATASET Y.

Flickr30k =⇒ MS-COCO MS-COCO =⇒ Flickr30k


Method
B-1 B-4 M R C S B-1 B-4 M R C S
MAGIC 41.4 5.2 12.5 30.7 18.3 5.7 46.4 6.2 12.2 31.3 17.5 5.9
CapDec 43.3 9.2 16.3 36.7 27.3 - 60.2 17.3 18.6 42.7 35.7 -
DeCap - 12.1 18.0 - 44.4 10.9 - 16.3 17.9 - 35.7 11.1
TIPCap (S1) 59.8 16.7 19.4 42.3 56.0 12.5 66.9 19.8 19.8 45.3 48.2 13.7
TIPCap (S2) 55.9 14.5 17.8 40.4 47.8 11.4 63.5 17.3 18.4 43.2 41.6 12.3
TIPCap (S3) 55.9 14.2 18.4 40.6 48.7 11.9 63.7 18.6 19.2 44.0 42.8 13.0
TIPCap (S4) 55.8 14.3 18.4 40.5 48.5 11.9 63.8 18.7 19.2 44.1 42.4 12.9

2) Evaluation metrics: Following the common paradigm, foundation models (e.g. CLIP [19], GPT-2 [17] and T5 [18])
we adopt five widely used metrics for evaluation, including to perform image captioning on unpaired or text-only data,
BLEU-N [31], METEOR [52], ROUGE-L [53], CIDEr [54], including ZeroCap [23], MAGIC [24], DeCap [26], CapDec
and SPICE [55], which are denoted as B-N , M, R, C, and S [25] and CLOSE [27]. Our proposed TIPCap belongs to the
for simplicity. second category, weakly or unsupervised approaches.
3) Implementation details: Our TIPCap model is imple-
mented in PyTorch [56] and trained on 4 Nvidia Tesla V100 As shown in TABLE II, we report the performance results of
(32GB) GPUs. We first train our model without introducing four different data settings. As expected, TIPCap (Setting 1)
interactive prompts for 10 epochs, where the learning rate is performs better than the other three settings, as it estimates
warmed-up to 5e−4 during 1250 steps and decayed linearly; the distribution parameter of mapping module from high-
then we froze parameters of the mapping module and reverse quality paired data, which introduces strong and credible prior
mapping module and train our model with interactive prompts information of modality bias and can be regarded as an upper
for another 5 epochs with learning rate of 1e−6 , in this training bound of our proposed approach. TIPCap (Setting 2) uses
stage. For optimization, we set the batch size on each GPU weakly paired web data but performs worse, because the alt-
to 32 and adopt AdamW optimizer [57] with a weight decay text is low-quality and has a large margin with training text
of 0.1 in both above stages and beam size is set to 5 during corpus. TIPCap (Setting 3) and TIPCap (Setting 4) have less
inference. Trained models and source code will be released. data available but also achieve superior performances that
significantly outperforms other state-of-the-art models, which
shows the effectiveness of our trainable mapping module.
B. Comparison with Existing Models
Especially, our TIPCap uses CLIP with RN50x4 backbone and
1) In-domain captioning: TABLE II shows the in-domain GPT-2 model but outperforms the strong competitor CLOSE
captioning results on MS-COCO and Flickr30K. We compare that adopts CLIP with ViT-L/14 backbone and T5 model.
our proposed TIPCap with several approaches with different
supervision levels. 1) Fully supervised approaches, which rely Overall, our proposed TIPCap achieves state-of-the-art
on human-annotated paired data for model training, including performance and shows substantial performance improve-
BUTD [3], UniVLP [58], ClipCap [59], Oscar [46]; and 2) ment compared with recent weakly or unsupervised methods,
weakly or unsupervised approaches, which adopt pre-trained demonstrating the effectiveness and advantages.
7

Reference: Reference: Reference: Reference: Reference:


A black bird eating from a bird A cow is lying on the floor in a A giraffe feeding from a tree in an A close up of a sandwich on a plate. A woman holding a cell phone up
feeder. store. enclosure. Prompt: to her ear.
Prompt: Prompt: Prompt: An image contains lettuce, person. Prompt:
An image contains flower pot, fruit. An image contains ceiling, person. An image contains person, hay. Prediction: An image contains girl, smile.
Prediction: Prediction: Prediction: A person holding a sandwich with Prediction:
A black bird eating fruit from a A person is standing in a store A person feeding hay to a giraffe in lettuce on it. A young girl talking on a cell
flower pot. with a cow in the ceiling. an enclosure. phone with a smile on her face

Reference: Reference: Reference: Reference: Reference:


A wooden table that has a lot of A glass of beer and a glass of A young man is looking at a bottle A person holding a cell phone in A group of young people playing a
food on it. beer on a table. of beer. their hand. video game.
Prompt: Prompt: Prompt: Prompt: Prompt:
An image contains tool. An image contains flower vase. An image contains refrigerator. An image contains laptop. An image contains cell phone.
Prediction: Prediction: Prediction: Prediction: Prediction:
A wooden table that has a lot of A glass of beer on a table next to A young man looking at a bottle A person holding a cell phone in A group of young people playing a
tools on it. a flower vase. of beer in a refrigerator. front of a laptop. game with cell phones.

Fig. 4. Examples of captions generated by TIPCap with simulated interactive prompts, images come from MS-COCO karpathy test split. “Reference”
indicates the generated caption withou prompt information; “Prompt” indicates the simulated user-specified prompt information; “Prediction” shows the new
generated caption with prompt information.

TABLE IV
P ERFORMANCE COMPARISON WITH SIMULATED INTERACTIVE PROMPT INFORMATION IN TIPC AP.

B-1 B-4 M R C S
S1 77.27±.15 32.73±.06 27.57±.06 55.53±.12 113.17±.35 21.97±.06
CLIP S2 75.93±.06 30.77±.12 26.70±.10 54.03±.06 107.93±.06 21.60±.00
RN50x4 S3 78.23±.06 33.90±.10 28.50±.00 56.33±.06 116.27±.06 24.27±.06
S4 77.80±.10 33.60±.17 28.30±.00 56.20±.00 115.33±.06 23.73±.06
S1 77.53±.06 33.03±.12 28.00±.10 55.80±.00 115.87±.23 22.40±.00
CLIP S2 76.30±.10 31.30±.20 27.03±.06 54.23±.06 110.23±.42 22.03±.06
ViT-L/14 S3 78.57±.15 34.40±.20 28.40±.00 56.27±.06 116.87±.21 24.10±.00
S4 78.27±.15 33.90±.17 28.37±.06 56.07±.06 116.57±.42 24.13±.06

2) Cross-domain captioning: As shown in TABLE III, we benchmarks, it is not easy to give accurate quantitative results
conduct cross-domain experiments to further explore the gen- when introducing interactive prompts.
eralization ability of our approach. Specifically, we train our To verify the effectiveness of interactive prompts, we
TIPCap on the source dataset (e.g. MS-COCO), but perform perform the evaluation on MS-COCO with simulated user-
inference on a different target dataset (e.g. Flickr30K). We specified prompt information. Specifically, we first apply
compare TIPCap with several text-only methods, including TIPCap to generate a caption without prompt information
MAGIC [24], CapDec [25], and DeCap [26]. Our TIPCap as the rough caption, then we do second-time inference to
still outperforms all compared methods, demonstrating the introduce the prompt information, in which the simulated user-
superiority of our approach in generalization ability. specified prompt information is sampled from the nouns or
noun key phrases (ignored by rough caption) extracted from
corresponding ground truth caption. We perform three times
C. Ablation Studies evaluations in each setting and report the average performance
1) Impact of Interactive Prompts: Due to the dynamic with the standard deviation, as shown in TABLE IV. The
nature of the prompt information and the lack of relevant results demonstrate that the introduction of prompt information
8

global dimension=1 dimension=27 dimension=338 dimension=500

Fig. 5. Histogram visualization of the CLIP image and text embedding difference of MS-COCO training set, where orange and green indicate the histogram
statistics on all dimensions (global) and specific dimensions (local) separately.

(a) (b) (c) (d) (e)

Fig. 6. t-SNE visualization of CLIP image and text embeddings from MS-COCO training set, where blue and yellow points indicate image and text embeddings,
respectively. (a) without mapping module; (b) mapping module driven by univariate Gaussian distribution N√ (µ, σ 2 ), where µ ≃ 0.0009 and σ ≃ 0.0440
are estimated from MS-COCO paired data; (c) mapping module driven by N (µ, σ 2 ), where µ = 0 and σ = 0.016, which is adopted in CapDec [25]; (d)
mapping module driven by N (µ, σ 2 ), where µ = 0 and σ = 0.08, which is adopted in CLOSE [27]; (e) mapping module driven by multi-variate Gaussian
distribution N (⃗µ, Σ), where µ
⃗ and Σ are estimated from coco paired data.

TABLE V 2) Impact of Mapping Module: This paper makes an as-


A BLATION STUDY OF MAPPING MODULE AND REVERSE MAPPING sumption that the feature bias between image-text paired
MODULE . (A) O UR FULL TIPC AP MODEL ; (B) TIPC AP WITH THE
MAPPING MODULE DRIVEN BY INDEPENDENT G AUSSIAN DISTRIBUTION ; CLIP embeddings can be estimated as a multivariate Gaussian
(C) TIPC AP MODEL WITH REVERSE MAPPING MODULE REMOVED . “N/A” distribution, which is a natural extension of the assumption
REFERS TO “N OT A PPLICABLE ”. adopted in CapDec [25] and CLOSE [27] that use independent
Gaussian distribution.
CLIP RN50x4 CLIP ViT-L/14
B-1 B-4 M R C S B-1 B-4 M R C S In this part, we conduct experiments to investigate the
(A) TIPCap, N (⃗
µ, Σ) effect of the above two strategies. The results are reported
in TABLE V (A) and (B), we can see that the performances
S1 74.6 30.7 26.7 54.2 106.7 20.3 75.4 31.2 27.2 54.5 109.7 20.9
S2 72.7 28.6 25.6 52.5 100.6 19.6 73.4 29.1 26.0 52.7 103.5 20.2 are better when using multivariate Gaussian distribution. The
S3 73.0 30.4 26.5 53.8 104.5 20.0 73.7 31.2 26.7 53.9 107.5 20.5 results of setting 1 when using N (µ, σ 2 ) perform worst and
S4 71.3 29.8 26.2 53.4 102.1 19.4 73.3 31.4 26.9 54.2 106.6 20.2 intuitively show that the independent Gaussian distribution is
(B) TIPCap, N (µ, σ 2 ) a suboptimal strategy, which can not effectively characterize
S1 68.5 24.8 24.1 49.7 89.5 18.6 66.7 22.5 23.2 46.8 87.3 18.1 the feature bias. Moreover, the results of setting 3 and setting
S2 69.4 25.5 24.4 50.2 91.6 18.7 68.0 23.5 23.5 47.7 90.1 18.3 4 in TABLE V (B) demonstrate that our TIPCap is also
S3 69.9 26.2 24.6 50.7 92.9 18.9 70.3 25.3 24.7 50.2 94.1 19.0
S4 70.0 25.9 24.1 50.5 91.5 18.5 69.9 25.3 24.4 49.9 93.8 18.8
effective when using independent Gaussian distribution and
can achieves competitive performances.
(C) TIPCap, N (⃗
µ, Σ), w/o reverse mapping module
Fig. 5 shows simple histogram visualization of the CLIP im-
S1 74.1 30.2 26.2 53.7 105.0 19.8 75.3 31.3 26.9 54.4 109.2 20.8
S2 72.4 28.3 25.1 52.0 99.2 19.1 73.7 29.2 25.9 52.8 103.1 19.8
age and text embedding difference of MS-COCO training set,
S3 71.4 30.0 26.3 53.4 102.4 19.4 73.2 30.8 26.6 53.9 105.5 20.1 it can be seen that the distribution of embedding differences
S4 N/A N/A in different dimensions are not consistent. And it is why we
apply multivariate Gaussian distribution instead of independent
Gaussian distribution.
brings positive influences on performance. As shown in Fig. 6, we randomly sample 500 paired
Furthermore, in order to make a more intuitive qualitative image-text data from MS-COCO training set and visualize
comparison, we give some examples of captions generated the CLIP image and text embeddings. Fig. 6 (a) shows the
by TIPCap when introducing simulated interactive prompts, clear modality gap between CLIP image embeddings and CLIP
as shown in Fig. 4. In which, “Reference” caption is gener- text embeddings. Fig. 6 (b), (c) and (d) show the influence of
ated by the same TIPCap model without prompt information mapping module driven by univariate Gaussian distribution.
(i.e. rough caption); “Prompt” indicates the simulated user- Fig. 6 (b) indicates that the modality gap still exists after
specified prompt information constructed by sampling from applying the mapping module driven by N (µ, σ 2 ), even if
the ignored ground truth nouns or phrases; “Prediction” gives mean µ and standard deviation σ are estimated from MS-
the new caption generated by our TIPCap. COCO paired data. CapDec [25] and CLOSE [27] also use
9

TABLE VI TABLE VIII


P ERFORMANCE COMPARISON WITH DIFFERENT PREFIX LENGTH L AND P ERFORMANCE COMPARISON WITH DIFFERENT NUMBER OF IMAGES USED
DIFFERENT LAYER NUMBER OF N IN THE PREFIX PROJECTOR MODULE . IN SETTING 3. A LL IMAGES ARE RANDOMLY SAMPLED FROM YFCC15M.
E XPERIMENTS ARE CONDUCTED UNDER SETTING 4.
CLIP RN50x4 CLIP ViT-L/14
CLIP RN50x4 CLIP ViT-L/14 #img B-1 B-4 M R C S B-1 B-4 M R C S
B-1 B-4 M R C S B-1 B-4 M R C S 500 73.0 30.3 26.3 53.8 104.0 19.9 73.8 31.1 26.7 54.0 107.3 20.4
2 71.1 29.3 26.0 53.2 101.1 19.2 74.0 31.6 26.5 54.2 106.7 20.0 1K 72.9 30.3 26.3 53.8 104.2 19.9 73.9 31.2 26.7 54.0 107.3 20.4
4 71.3 29.8 26.2 53.4 102.1 19.4 73.3 31.4 26.9 54.2 106.6 20.2 5K 73.0 30.4 26.5 53.8 104.5 20.0 73.7 31.2 26.7 53.9 107.5 20.5
5 71.1 29.5 26.3 53.3 101.7 19.5 73.3 31.3 26.8 54.2 106.7 20.3 10K 72.9 30.4 26.4 53.7 104.0 19.9 73.8 31.2 26.7 53.9 107.3 20.4
L
10 71.0 29.3 26.0 53.1 100.5 19.2 72.0 30.7 27.3 54.4 106.5 20.7
20 71.1 29.5 26.1 53.1 101.5 19.4 73.4 30.9 26.7 53.9 106.7 20.2 TABLE IX
40 70.1 29.2 26.2 53.0 99.9 19.1 72.3 30.4 27.1 54.3 106.3 20.5 P ERFORMANCE COMPARISON OF THREE DIFFERENT SAMPLED DATA
UNDER SETTING 1. E XPERIMENTS ARE CONDUCTED ON MS-COCO.
1 71.2 29.9 26.1 53.2 101.5 19.1 72.9 31.1 26.5 53.7 105.2 19.9
2 71.7 30.1 26.3 53.5 101.9 19.4 73.2 31.2 26.7 54.1 106.0 20.1
3 71.3 29.8 26.2 53.4 102.1 19.4 73.3 31.4 26.9 54.2 106.6 20.2 CLIP RN50x4 CLIP ViT-L/14
N B-1 B-4 M R C S B-1 B-4 M R C S
4 71.6 29.9 26.2 53.4 102.1 19.3 73.8 31.1 26.7 54.1 107.0 20.5
5 71.6 30.0 26.2 53.5 102.3 19.5 74.1 31.4 26.8 54.1 108.2 20.6 sample 1 74.6 30.7 26.7 54.2 106.7 20.3 75.4 31.2 27.2 54.5 109.7 20.9
6 71.2 29.8 26.3 53.5 102.4 19.6 73.7 31.5 27.0 54.1 108.5 20.8 sample 2 74.8 31.0 26.7 54.3 107.2 20.5 75.3 31.4 27.1 54.5 109.9 20.8
sample 3 74.7 30.9 26.7 54.3 107.2 20.3 75.4 31.6 27.3 54.7 110.7 21.0
TABLE VII
P ERFORMANCE COMPARISON WITH DIFFERENT RATIO OF PAIRED DATA TABLE X
(i.e. MS-COCO) USED IN SETTING 1. P ERFORMANCE COMPARISON OF DIFFERENT IMAGE SOURCE
(“YFCC15M” V. S . “MS-COCO”) UNDER SETTING 3. E XPERIMENTS ARE
CONDUCTED ON MS-COCO.
CLIP RN50x4 CLIP ViT-L/14
r B-1 B-4 M R C S B-1 B-4 M R C S
0.2% 73.3 29.4 26.1 53.4 103.0 19.7 74.7 30.3 26.7 53.9 107.5 20.6 CLIP RN50x4 CLIP ViT-L/14
0.5% 74.2 30.5 26.6 54.1 105.8 20.1 75.3 31.2 27.1 54.7 109.2 20.9 B-1 B-4 M R C S B-1 B-4 M R C S
1% 74.6 30.7 26.7 54.2 106.7 20.3 75.4 31.2 27.2 54.5 109.7 20.9 YFCC15M 73.0 30.4 26.5 53.8 104.5 20.0 73.7 31.2 26.7 53.9 107.5 20.5
5% 74.4 30.7 26.7 54.3 106.5 20.3 75.5 31.5 27.3 54.8 109.9 20.9 MS-COCO 72.9 30.6 26.5 53.8 104.9 20.1 74.1 31.5 26.7 54.1 107.9 20.4
20% 74.5 31.0 26.7 54.4 107.1 20.3 75.7 31.9 27.3 55.0 110.5 21.0
50% 74.6 30.9 26.8 54.4 107.3 20.3 75.6 31.8 27.5 55.0 111.0 21.1
100% 74.4 30.8 26.7 54.4 106.6 20.3 75.8 31.7 27.3 54.9 110.6 21.0
ratios of paired data to study its effect as shown in Table VII.
Theoretically, the estimated parameters can more accurately
N (µ, σ 2 ) but have a larger standard deviation as shown in characterize the modality bias when more paired data is avail-
Fig. 6 (c) and (d), which mitigate the modality gap but not able. From the results, we can see that the performance tends
significantly. Fig. 6 (e) shows that our mapping module driven to stabilize when more than 1% paired data is available, which
by multi-variate Gaussian distribution can effectively reduce indicates that we can obtain sufficiently accurate estimated
the modality gap. parameters without all data.
3) Impact of Reverse Mapping Module: To explore the In setting 3, we use few source-agnostic image data to in-
effectiveness of our proposed reverse mapping module, we troduce the latent prior information of CLIP image embedding
report the performances of our TIPCap with the reverse space. To explore its effect, we sample different numbers of
mapping module removed, as shown in TABLE V (C). Com- image data from YFCC15M dataset for training. The results
paring TABLE V (A) and (C), it can be see that the reverse are reported in Table VIII. When using only 500 images, the
mapping module brings slight performance improvement over performance is still advantageous enough, which indicates the
all metrics. More importantly, the reverse mapping module data efficiency of our TIPCap.
is essential for our approaches in setting 4, which makes 6) Impact of different sampled data under setting 1: Under
our proposed TIPCap still applicable when only text data is Setting 1, the parameters of our mapping module (i.e. mean
available. and covariance) are estimated from sampled paired data (e.g.
4) Impact of prefix projector: The prefix projector aims to MS-COCO or Flickr30K), which brings a question: whether
convert CLIP embedding to prefix embeddings as the input different sampled data will influence the performance. To
of GPT-2 model. Following ClipCap [59] and CapDec [25], verify the influence of different sampling data, we conduct
we use a transformer-based architecture as prefix projector. To experiments on MS-COCO dataset. We sample 1% paired MS-
explore the effect of prefix projector, we perform ablation stud- COCO data for the estimation of mean and covariance, and
ies on prefix length of L and layer number of N respectively, perform three times sampling, resulting in three different sets
as shown in Table VI. The bold results denote our default of parameters. Then we train our TIPCap with these three
implementation. different estimations of N (⃗µ, Σ).
From the results, TIPCap shows good robustness when The results are reported in Table IX, and show that the
increasing L or N . It brings an advantage that we do not performances are robust to different sampling data, which fur-
need a heavy module to perform the indispensable projection ther demonstrates the effectiveness and robustness of applying
from CLIP embedding space to GPT-2 embedding space. multivariate Gaussian distribution to mitigate the modality gap.
5) Impact of available external data: In setting 1, the 7) Impact of image source under setting 3: As described in
parameter µ ⃗ and Σ are estimated from human-annotated high- Section 4.1, we use 5K YFCC images for our TIPCap training
quality paired data. We perform estimation using different under setting 3, which aims to prove that heterologous images
10

GTs: GTs: GTs:


GTs: A young girl inhales with the intent of blowing out a An airport filled with planes sitting on tarmacs. A motorcycle sits parked across from a herd of
A man with a red helmet on a small moped on a dirt road.
candle. The view of runway from behind the windows of airport. livestock.
Man riding a motor bike on a dirt road on the countryside.
A young girl is preparing to blow out her candle. SS1M to MS-COCO: A motorbike, people and sheep in the background
SS1M to MS-COCO:
SS1M to MS-COCO: Passenger airplane at the airport with rain clouds in SS1M to MS-COCO:
Man riding a motorbike on a rural road in Vietnam.
MS-COCO: Little girl eating a birthday cake with a candle the background Motorbike in the desert of Oman
A man riding a motorcycle down a road next to a mountain. MS-COCO: MS-COCO: MS-COCO:
A little girl blowing out candles on a birthday cake. A plane sitting on the tarmac at an airport. A motorcycle parked on the side of a dirt road.

GTs: GTs: GTs:


GTs:
Two men sitting on the roof of a house while another There is a lady in a pink coat standing on street with a A person is standing on jagged rocks above the water below.
A gray-haired man singing in a crowd and playing an A man stands on a rocky cliff overlooking a body of water.
one stands on a ladder. comical surprised look on her face.
Three men, one on a ladder, work on a roof.
acoustic guitar. SS1M to Flickr30K:
Woman with a surprised face in front of a Vitamin Shoppe
A man in a black jacket plays the guitar out in public.
SS1M to Flickr30K: SS1M to Flickr30K: Hiker with backpack standing on the edge of a lake in
SS1M to Flickr30K:
Workers repairing the roof of a house with a vacuum Surprised woman with red lips on the street in New the mountains
Man with a guitar in front of the concert hall
cleaner. York Flickr30K:
Flickr30K:
Flickr30K: Flickr30K: A man is hiking up a mountain with a lake in the
A man is playing a guitar in front of a crowd.
Two men are working on the roof of a building. A woman in a brown coat is crossing the street. background.

Fig. 7. Examples of captions generated by TIPCap. The first two images comes from MS-COCO dataset and the second two images comes from Flickr30K
dataset. “SS1M to MS-COCO / Flickr30K” indicate that captions is generated by TIPCap trained on SS1M. “MS-COCO” and “Flickr30K” denote TIPCap
trained on MS-COCO and Flickr30K, respectively.

are also applicable and effective. We also conduct experiments validation and test sets, and is divided into three parts: in-
using homologous images on MS-COCO dataset, the results domain contains images portraying only MS-COCO classes;
are shown in Table X. Specifically, we adopt the 5K images near-domain contains both MS-COCO and novel classes;
of MS-COCO karpathy validation split, which are homologous out-of-domain contains only novel classes. Following DeCap
to the training corpus but are not paired and do not affect the [26], we evaluate our TIPCap using the validation set on
performance evaluation on karpathy test split. official evaluation server 1 . Table XII shows the performance
From the results, both YFCC images and MS-COCO images comparison on MS-COCO karpathy test split and NoCaps val-
achieve comparable performance under setting 3, which indi- idation split. Compared to DeCap, TIPCap achieves significant
cates that the requirement for image data is source-agnostic. improvement on all metrics.
Combining the results shown in Table 8, it indicates that 2) Performance using SS1M dataset: SS1M [14] is a web-
although available images are few and heterologous with collected text corpus, which uses the name of 80 MS-COCO
training text corpus, TIPCap can still achieve competitive classes as keywords to crawl image descriptions from Shut-
performance under setting 3. Compared under setting 4, TIP- terstock 2 , resulting in 2,322,628 distinct image descriptions
Cap under setting 3 can achieve better performance because in total that is larger than MS-COCO and Flickr30K. Fig. XI
we introduce prior information of real image for training. gives a comparison of some text examples between SS1M and
Although it is difficult to collect high-quality paired image MS-COCO.
data for text corpus in real-world scenarios, the image data- We explore using SSIM corpus to train our TIPCap under
efficiency under setting 3 makes the collection cost of image setting 3, where the image data is sampled from YFCC15M,
data affordable. identical to the experiments conducted on the MS-COCO. Af-
ter training, we evaluate TIPCap on MS-COCO and Flickr30K
test datasets as shown in Table XIII (middle 2 rows), the
D. More Generalization Analysis results indicate that TIPCap can be easily extended to larger
datasets and also performs good zero-shot performance. Fig. 7
To further quantitatively evaluate the generalization perfor-
shows some examples of captions generated by TIPCap
mance of our proposed TIPCap, we conduct experiments on
trained on SS1M under setting 3 (see “SS1M to MS-COCO /
two new datasets: NoCaps [60] and SS1M [14].
Flickr30K”), which can correctly describe images, even with
1) Performance on NoCaps dataset: MS-COCO is limited
more descriptive details.
to 80 classes, NoCaps [60] provides a benchmark to mea-
sure the open-set capability for novel objects (unseen classes 1 https://eval.ai/web/challenges/challenge-page/355/overview

in MS-COCO dataset). The NoCaps dataset contains only 2 https://www.shutterstock.com


11

TABLE XI
C OMPARISON OF TEXT BETWEEN SS1M AND MS-COCO.

SS1M text examples:


1. Bus Lane Sign in Urban Setting in Black and White Sepia Tone
2. France, Paris, 04/04/2015, parc de bercy, a pathway with luscious green vegetation and a park bench
3. Cat looking at sea in Santorini, Greece
4. Yellow bus handles inside the vehicle with blur background
5. Many people in the street on the bicycles, holiday
6. Concept fighting, friendship, promise, success, Two people put their hands together and raised
MS-COCO text examples:
1. Modern living room interior with many green plants
2. A couple of chairs that are at a table
3. A flat screen TV mounted to a wall.
4. A mid sized bathroom with toilette, shower and vanity mirror above a sink.
5. A dock area with various toilets and a television on it.
6. There is a room with distinctive things in the picture.

TABLE XII
R ESULTS ON MS-COCO KARPATHY TEST SPLIT AND N O C APS VALIDATION SPLIT, WHERE “ IN ”, “ NEAR ” AND “ OUT ” INDICATE “in-domain”,
“near-domain” AND “out-of-domain” RESPECTIVELY.A LL MODELS ARE TRAINED ON MS-COCO TEXT CORPUS .

MS-COCO NoCaps val (CIDEr)


B-1 B-4 M R C S in near out overall
DeCap - 24.7 25.0 - 91.2 18.7 65.2 47.8 25.8 45.9
CLIP RN50x4:
TIPCap (S1) 74.6 30.7 26.7 54.2 106.7 20.3 77.6 63.3 44.8 61.6
TIPCap (S2) 72.7 28.6 25.6 52.5 100.6 19.6 71.1 56.6 39.0 55.1
TIPCap (S3) 73.0 30.4 26.5 53.8 104.5 20.0 74.6 59.2 37.1 56.9
TIPCap (S4) 71.3 29.8 26.2 53.4 102.1 19.4 74.3 59.2 36.4 56.7
CLIP ViT-L/14:
TIPCap (S1) 75.4 31.2 27.2 54.5 109.7 20.9 81.7 67.5 49.8 65.9
TIPCap (S2) 73.4 29.1 26.0 52.7 103.5 20.2 76.1 61.3 44.5 60.0
TIPCap (S3) 73.7 31.2 26.7 53.9 107.5 20.5 77.3 62.7 40.6 60.9
TIPCap (S4) 73.3 31.4 26.9 54.2 106.6 20.2 80.2 62.3 39.6 60.3

TABLE XIII
C ROSS - DOMAIN IMAGE CAPTIONING PERFORMANCE . W E TRAIN OUR TIPC AP USING SS1M DATASET UNDER SETTING 3 AND SETTING 1, AND PERFORM
EVALUATION ON MS-COCO AND F LICKR 30K DATASETS .

SS1M =⇒ MS-COCO SS1M =⇒ Flickr30K


B-1 B-4 M R C S B-1 B-4 M R C S
DeCap - 8.9 17.5 - 50.6 13.1 - - - - - -
TIPCap (S3, CLIP RN50x4) 54.7 11.5 18.0 35.7 53.0 13.9 52.2 9.1 15.4 32.7 32.2 11.0
TIPCap (S3, CLIP ViT-L/14) 56.5 12.3 18.6 37.2 57.3 14.5 56.6 10.7 16.2 35.3 36.3 11.3
TIPCap (S1, CLIP RN50x4) 62.9 16.5 20.6 41.3 69.2 16.3 59.2 13.0 17.1 37.2 38.7 11.8
TIPCap (S1, CLIP ViT-L/14) 65.1 18.3 21.5 43.2 73.9 17.0 62.6 14.4 18.1 39.3 44.0 12.5

Another interesting results are also shown in Table XIII V. C ONCLUSION


(bottom 2 rows), we train our TIPCap on setting 1, where
the required parameters µ ⃗ and Σ are directly estimated from In this paper, we propose a unified solution TIPCap with
1% MS-COCO paired data. It can be seen that this paradigm text-centric training data for image captioning, which can
achieves better performance, even though the distribution almost cover the vast majority of data configurations in real-
parameters are estimated from another dataset. We attribute world scenarios. In TIPCap, the mapping module, which is
this phenomenon to two reasons: 1) As shown in Table XI, driven by a multivariate Gaussian distribution and aims to
the text corpus in SS1M have a good fluency and language mitigate the modality gap between image and text embed-
accuracy; 2) Comparing SS1M and MS-COCO, the styles ding. Additionally, TIPCap can incorporate optional prompt
of their text corpus are not exactly same but similar. Their information, which is proposed to improve generated captions.
text both consists of a main object and additional descriptive Extensive experiments also demonstrate the effectiveness of
attribution like state and position. our TIPCap. We believe this study can give a new paradigm
and benefit the community for image captioning.
12

R EFERENCES [22] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m:


Pushing web-scale image-text pre-training to recognize long-tail visual
concepts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
[1] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
pp. 3558–3568.
image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 3156–3164. [23] Y. Tewel, Y. Shalev, I. Schwartz, and L. Wolf, “Zerocap: Zero-shot
image-to-text generation for visual-semantic arithmetic,” in Proc. IEEE
[2] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S.
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17 897–17 907.
Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption
[24] Y. Su, T. Lan, Y. Liu, F. Liu, D. Yogatama, Y. Wang, L. Kong, and
generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015,
N. Collier, “Language models can see: Plugging visual controls in text
pp. 2048–2057.
generation,” ArXiv, vol. abs/2205.02655, 2022.
[3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and
[25] D. Nukrai, R. Mokady, and A. Globerson, “Text-only training for image
L. Zhang, “Bottom-up and top-down attention for image captioning and
captioning using noise-injected CLIP,” in Proc. Conf. Empirical Methods
visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern
Natural Lang. Process., 2022, pp. 4055–4063.
Recognit., 2018, pp. 6077–6086.
[26] W. Li, L. Zhu, L. Wen, and Y. Yang, “Decap: Decoding clip latents for
[4] L. Huang, W. Wang, J. Chen, and X. Wei, “Attention on attention for
zero-shot captioning via text-only training,” in Proc. Int. Conf. Learn.
image captioning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp.
Representations, 2023.
4633–4642.
[27] S. Gu, C. Clark, and A. Kembhavi, “I can’t believe there’s no im-
[5] Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image ages! learning visual tasks using only language data,” ArXiv, vol.
captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, abs/2211.09778, 2022.
pp. 10 968–10 977.
[28] J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping
[6] Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. Lin, and R. Ji, language-image pre-training with frozen image encoders and large
“Dual-level collaborative transformer for image captioning,” in Proc. language models,” ArXiv, vol. abs/2301.12597, 2023.
AAAI Conf. Artif. Intell., 2021, pp. 2286–2293. [29] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou,
[7] Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for and H. Yang, “OFA: unifying architectures, tasks, and modalities through
image captioning,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 2585– a simple sequence-to-sequence learning framework,” in Proc. Int. Conf.
2594. Mach. Learn., 2022, pp. 23 318–23 340.
[8] L. Wu, M. Xu, L. Sang, T. Yao, and T. Mei, “Noise augmented double- [30] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment
stream graph convolutional networks for image captioning,” IEEE Trans. everything everywhere all at once,” ArXiv, vol. abs/2304.06718, 2023.
Circuits Syst. Video Technol., vol. 31, no. 8, pp. 3118–3127, 2021. [31] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for
[Online]. Available: https://doi.org/10.1109/TCSVT.2020.3036860 automatic evaluation of machine translation,” in Proc. Annu. Meet.
[9] W. Jiang, W. Zhou, and H. Hu, “Double-stream position learning Assoc. Comput. Linguist., 2002, pp. 311–318.
transformer network for image captioning,” IEEE Trans. Circuits Syst. [32] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. [Online]. H. Schwenk, and Y. Bengio, “Learning phrase representations using
Available: https://doi.org/10.1109/TCSVT.2022.3181490 RNN encoder-decoder for statistical machine translation,” in Proc. Conf.
[10] S. Cao, G. An, Z. Zheng, and Z. Wang, “Vision-enhanced and Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
consensus-aware transformer for image captioning,” IEEE Trans. [33] H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, “Show, observe and
Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. tell: Attribute-driven attention model for image captioning,” in Proc.
[Online]. Available: https://doi.org/10.1109/TCSVT.2022.3178844 Int. Joint Conf. Artif. Intell., 2018, pp. 606–612.
[11] J. Zhang, Y. Xie, W. Ding, and Z. Wang, “Cross on cross attention: [34] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with
Deep fusion transformer for image captioning,” IEEE Trans. Circuits semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. [Online]. 2016, pp. 4651–4659.
Available: https://doi.org/10.1109/TCSVT.2023.3243725 [35] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[12] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in [36] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards
context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. real-time object detection with region proposal networks,” IEEE Trans.
[13] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
descriptions to visual denotations: New similarity metrics for semantic [37] X. Dong, C. Long, W. Xu, and C. Xiao, “Dual graph convolutional net-
inference over event descriptions,” Trans. Assoc. Comput. Linguistics, works with transformer and curriculum learning for image captioning,”
vol. 2, pp. 67–78, 2014. in Proc. ACM Int. Conf. Multimed., 2021, pp. 2615–2624.
[14] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in [38] W. Nie, J. Li, N. Xu, A. Liu, X. Li, and Y. Zhang, “Triangle-reward
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4125–4134. reinforcement learning: A visual-linguistic semantic alignment for image
[15] I. Laina, C. Rupprecht, and N. Navab, “Towards unsupervised image captioning,” in Proc. ACM Int. Conf. Multimed., 2021, pp. 4510–4518.
captioning with shared multimodal embeddings,” in Proc. IEEE Int. [39] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning:
Conf. Comput. Vis., 2019, pp. 7413–7423. Transforming objects into words,” in Proc. Adv. neural inf. proces. syst.,
[16] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training 2019, pp. 11 135–11 145.
of deep bidirectional transformers for language understanding,” in Proc. [40] J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with
Conf. N. Am. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., multi-view visual representation for image captioning,” IEEE Trans.
2019, pp. 4171–4186. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020.
[17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [Online]. Available: https://doi.org/10.1109/TCSVT.2019.2947482
“Language models are unsupervised multitask learners,” 2019. [41] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory
[18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, transformer for image captioning,” in Proc. IEEE Conf. Comput. Vis.
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Pattern Recognit., 2020, pp. 10 575–10 584.
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, [42] J. Wang, Y. Zhang, M. Yan, J. Zhang, and J. Sang, “Zero-shot image cap-
pp. 140:1–140:67, 2020. tioning by anchor-augmented vision-language space alignment,” ArXiv,
[19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, vol. abs/2211.07275, 2022.
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, [43] H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder
“Learning transferable visual models from natural language supervi- representations from transformers,” in Proc. Conf. Empirical Methods
sion,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763. Natural Lang. Process., 2019, pp. 5099–5110.
[20] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, [44] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic
Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language visiolinguistic representations for vision-and-language tasks,” in Proc.
representation learning with noisy text supervision,” in Proc. Int. Conf. Adv. neural inf. proces. syst., 2019, pp. 13–23.
Mach. Learn., 2021, pp. 4904–4916. [45] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and
[21] J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping J. Liu, “UNITER: universal image-text representation learning,” in Proc.
language-image pre-training for unified vision-language understanding Eur. Conf. Comput. Vis., 2020, pp. 104–120.
and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12 888– [46] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu,
12 900. L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned
13

pre-training for vision-language tasks,” in Proc. Eur. Conf. Comput. Vis.,


ser. Lecture Notes in Computer Science, vol. 12375, 2020, pp. 121–137.
[47] J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi,
“Align before fuse: Vision and language representation learning with
momentum distillation,” in Proc. Adv. neural inf. proces. syst., 2021,
pp. 9694–9705.
[48] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
Proc. Int. Conf. Learn. Representations, 2014.
[49] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin,
C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton,
F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano,
J. Leike, and R. Lowe, “Training language models to follow instructions
with human feedback,” in Proc. Adv. neural inf. proces. syst., 2022.
[50] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
generating image descriptions,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2015, pp. 3128–3137.
[51] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland,
D. Borth, and L. Li, “YFCC100M: the new data in multimedia research,”
Commun. ACM, vol. 59, no. 2, pp. 64–73, 2016.
[52] M. J. Denkowski and A. Lavie, “Meteor universal: Language specific
translation evaluation for any target language,” in Proc. ACL Workshop
Statistical Machine Translation, 2014, pp. 376–380.
[53] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,”
in Proc. ACL Workshop Text Summarization Branches Out, 2004, pp.
74–81.
[54] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based
image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2015, pp. 4566–4575.
[55] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic
propositional image caption evaluation,” in Proc. Eur. Conf. Comput.
Vis., vol. 9909, 2016, pp. 382–398.
[56] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf,
E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-
performance deep learning library,” in Proc. Adv. neural inf. proces.
syst., 2019, pp. 8024–8035.
[57] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
in Proc. Int. Conf. Learn. Representations. OpenReview.net, 2019.
[58] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao, “Unified
vision-language pre-training for image captioning and VQA,” in Proc.
AAAI Conf. Artif. Intell., 2020, pp. 13 041–13 049.
[59] R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: CLIP prefix for
image captioning,” ArXiv, vol. abs/2111.09734, 2021.
[60] H. Agrawal, P. Anderson, K. Desai, Y. Wang, X. Chen, R. Jain,
M. Johnson, D. Batra, D. Parikh, and S. Lee, “nocaps: novel object
captioning at scale,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp.
8947–8956.

You might also like