interactive
interactive
from few paired MS-COCO data in CapDec, but it is set as a able to handle prompt information, which further enhances
hyper-parameter in CLOSE. In this way, they needs text-only the flexibility;
training data because the text embedding can be projected into (3) Extensive experiments demonstrate the effectiveness of
the image embedding space. TIPCap and achieve a new state-of-the-art performance.
Although the above methods propose some ingenious solu-
tions, we still point out some crucial issues here. 1) Above II. R ELATED W ORK
methods are usually only compatible with one or two specific A. Image Captioning
data configurations. However, in real-world applications, users 1) Supervised Approaches: Inspired by the development of
have very different data configurations. For example, apart deep learning methods in machine translation [31], [32], most
from the text corpus, some web data, few high-quality image- of existing models utilize encoder-decoder framework. Earlier
text data like MS-COCO or some web image data can also be works [1], [2], [33], [34] adopt CNN to extract image features
provided. How to propose a unified solution to deal with differ- and decode them into sentence by LSTM [35]. Xu et al. [2]
ent data configurations? 2) The popular assumption that image- introduce an attention mechanism which can dynamically fo-
text feature bias is an independent Gaussian distribution may cus on salient regions of the given image. After that, Anderson
be sub-optimal because correlations exist between different et al. [3] propose to use Faster R-CNN [36] as encoder and
feature dimensions. How to propose a better approximation? achieve significant improvement. Some subsequent works [4],
3) These methods inevitably output unsatisfactory results. Can [5], [8], [37], [38] follow this paradigm. Recently, transformer-
we allow the model to handle user-provided prompts (such as based models have demonstrated excellent performance in
the object in the image) to improve predictions? image captioning task [5], [7], [39]–[41]. Although super-
Based on the above motivations, we proposes a new vised methods have achieved impressive results, high-quality
approach TIPCap that is text data-centric with interactive human-annotated paired image-text data is essential.
prompts for image captioning, as shown in Fig. 1 (e). Specifi- 2) Zero-shot Approaches: Zero-shot image captioning aims
cally, our TIPCap combines CLIP and GPT-2 to fully leverage to generate description without human-annotated data. While
the advantages of pre-trained models and contains three extra ZeroCap [23] and MAGIC [24] realize zero-shot image cap-
key modules: a mapping module, a reverse mapping module, tioning by combining CLIP and GPT-2, and both of them
and a prompt interaction module. Firstly, we have taken four introduce weak visual control cues through the cosine sim-
different data settings into account, which almost cover the ilarity between generated text and given image. Specifically,
vast majority of data configuration scenarios. A unified solu- ZeroCap relies on gradient descent to update the context cache
tion is proposed to estimate text-to-image embedding maps. of GPT-2 to make the output match given images; and MAGIC
Secondly, taking into account the correlation between feature proposes a new decoding strategy to regularize the generated
dimensions, the mapping module is driven by multivariate word to be close to given image and previously generated
Gaussian distribution instead of independent Gaussian distri- context. However, frequent CLIP text encoder forward slow
bution, which aims to mitigate the modality gap by performing the inference speed significantly. Wang et al. [42] argue that
a simple projection from CLIP text embedding space to the above methods are prone to fall into harmful contextual
CLIP image embedding space. The reverse mapping module language prior and ignore the visual information of given
performs a weak projection from CLIP image embedding image. DeCap [26] contains a frozen CLIP and a lightweight
space back to CLIP text embedding space for stronger robust- text decoder. During training, DeCap takes the CLIP text
ness. During inference, our TIPCap no longer needs mapping embedding as input to reconstruct its textual sequence, and
module but directly inputs CLIP image embedding into reverse stores all training CLIP text embeddings as a support memory.
mapping module and follow-up modules to generate captions. During inference, they project the image embedding into CLIP
Thirdly, the prompt interaction module endows TIPCap with text embedding space by calculating a weighted sum of all
the ability to fuse additional prompt information to generate embeddings in support memory based on the cosine similarity.
higher-quality descriptions. With these modules, TIPCap can CapDec [25] and CLOSE [27] also perform zero-shot image
be trained on text-centric data, and predict captions which can captioning using text-only data and have a similar paradigm,
be further improved with manual prompts for a given image. estimating the modality gap between image and text by an
To evaluate our approach, we conduct extensive experi- independent Gaussian distribution N (0, σ 2 ).
ments on two commonly used datasets: MS-COCO [12] and
Flickr30K [13]. The results demonstrate that our approach B. Vision-Language Models
significantly outperforms existing weakly or unsupervised ap-
proaches, and achieves a new state-of-the-art performance. Inspired by BERT [16] and its task-agnostic pre-training
paradigm, a line of works [43]–[46] have extended it to Vision-
Our major contributions can be summarized as follows:
Language (VL) for learning joint representations of image
(1) We propose a new approach TIPCap for image captioning, content and natural language, these models directly rely on
which provides a unified solution for four settings with pre-trained object detector to extract image region features
different data configurations; and employ a multimodal encoder to fuse multi-modal features
(2) The mapping module utilizes multivariate Gaussian dis- by solving tasks such as masked language modeling (MLM)
tribution to mitigate the modality gap effectively and out- and image-text matching (ITM). Another line of works (e.g.
performs independent Gaussian distribution; our model is CLIP [19] and ALIGN [20]) construct unimodal encoders
3
Image
Round t-1
Encoder
“A man is walking
(CLIP)
Prefix Projector
Add & L2 Norm
along a road”
FeedForward
Text
Inference
Decoder
Training (GPT-2) Round t
“A man riding on the
( , )
Prompt Constructor
a motorcycle
(CLIP)
down a road”
Rough caption: A man is walking along a road
Mapping User-specified prompt: An image contains motorcycle, …
Fig. 2. The overall framework of our approach. Our approach TIPCap is based on a pre-trained CLIP model and a pre-trained GPT-2 model. During training,
we first exploit CLIP to extract CLIP text embedding and project it into CLIP image embedding space by a mapping module; then we reconstruct text
embedding by a reverse mapping module and inject optional prompt information; finally, GPT-2 generates description. In the inference stage, we no longer
need the mapping module but directly feed CLIP image embedding into reverse mapping module and follow-up modules to generate captions.
Setting 1. We first calculate the embedding difference of similar to setting 3 and apply LMap to optimize trainable
human annotated paired data: parameters µ ⃗ and L. The calculation of LMap relies on the
modality bias between CLIP image embedding and CLIP text
δ ≃ Ie,human − Te,human (6)
embedding. However, if we use the bias between TeI and Te
where Ie,human and Te,human indicate CLIP image embed- to optimize the loss LMap , the model is prone to falling into
ding and CLIP text embedding respectively. trivial solutions (collapse), because the modality bias between
As mentioned before, N (⃗ µ, Σ) aims to characterize the the output and input of mapping module, i.e. TeI and Te , is
modality bias. Here Thuman is paired with Ihuman , and also directly sampled from N (⃗ µ, Σ). To address such issue, we
homologous to Tcorpus , so we can directly adopt the mean µ ⃗δ propose several specific designs.
and covariance Σδ of δ as a tight estimation of µ ⃗ and Σ. Firstly, we follow other works to apply several asymmetric
Setting 2. Similar to setting 1, we can calculate the embedding designs to enhance the robustness and avoid mode collapse.
difference of web data, and use the mean and covariance of Different from the mapping module consisting of an addition
embedding difference as an estimation. But it performs worse, operation between the input and a sampled multivariate Gaus-
due to Tweb and Tcorpus are heterologous. We propose to apply sian distribution noise, the reverse mapping module is designed
a simple correction to alleviate this issue: as a FeedForward layer, and aimed at re-project the output
of mapping module back into CLIP text embedding space.
δ ≃ Ie,web − Correct(Te,web ) Then, we do not use the original text embedding Te but the
= Ie,web − (Te,web + δw→c ) (7) reconstructed one TeT to optimize LMap . Then, LMap can be
= δweb − δw→c calculated as follows:
where δweb ∼N (⃗ µweb , Σweb ) and δw→c ∼N (⃗ µw→c , Σw→c ), LMap = KL(L−1 (δpseudo − µ
⃗ )∥N (⃗0, I)) (13)
subscript w→c indicates “web data to corpus data”, which aims δpseudo = TeI − TeT (14)
to achieve a domain alignment from Tweb to Tcorpus . Since
there is no pairwise relationship between Tcorpus and Tweb , where TeI and TeT indicate the output of mapping module
Correct(·) is just a global and rough estimation of the domain and reverse mapping module respectively. Since the reverse
alignment. mapping module does not perform strict reconstruction of
Setting 3. In this setting, we extend the mapping module to CLIP text embedding, which also introduces asymmetry to
be trainable instead of pre-defined parameters. Specifically, obtain a more robust performance. In order to ensure the unity
the covariance can be denoted as Σ = LL⊤ by cholesky of our TIPCap, we also apply the reverse mapping module in
decomposition, where L is a lower triangular matrix with the other settings except Setting 4.
same size as Σ. Through reparameterization trick [48], the Secondly, since the lack of real image data to introduce
noise of ϵ ∼ N (⃗µ, Σ) can be re-formulated as follows: corresponding latent prior information, we employ a relational
knowledge distillation loss:
ϵ = Lϵ′ + µ
⃗, ϵ′ ∼ N (⃗0, I) (8)
LDisti = KL(STeI ∥STe ) (15)
Therefore, we can carry out the mean µ ⃗ and matrix L as
trainable parameters. where STeI and STe indicate internal cosine similarity matrix
The difficulty lies in how to drive the training of µ
⃗ and L of TeI and Te respectively. Employing LDisti is aiming to
toward the correct direction. For this purpose, we introduce encourage TeI to have a similar internal cosine similarity to
a few source-agnostic image data Iany , and calculate the Te , which provides a constraint to ensure that TeI and Te are
embedding difference: semantically related and avoids mode collapse.
base captioning model modelbase , which aims to endow our Generated: A table topped with different types of food.
model with the ability to generate rough captions for the GT: A food tray filled with different types of food and
second stage training. 2) Then, we initialize modelprompt with drinks.
parameters of modelbase to perform the second stage of train- 𝑝 = 0.9, w/ prompt information
ing with introducing prompts. To avoid complex and expensive Reference: A table topped with different types of food.
Example 1
manual annotation, we explore a simple strategy to generate Prompt: An image contains tray.
user-specified prompts. At each training step, modelbase is Prediction: A food tray filled with different types of
food and drinks.
frozen to generate rough captions ⟨cr ⟩. For user-specified
𝑝 = 0.1, w/o prompt information
prompt, we do extract nouns or noun phrases set {pi }N i=1
by part-of-speech tagging from ground truth captions ⟨cgt ⟩. Reference: <-pad->
Prompt: <-pad->
Furthermore, aiming to preserve positive prompt information,
Prediction: A food tray filled with different types of
we remove nouns or noun phrases that appear in ⟨cr ⟩ from food and drinks.
′ N′
{pi }N
i=1 to get filtered set {pi }i=1 .
Generated: A cat laying on top of a couch on a shoe.
Taking Fig. 2 as example, for the round t generation, we GT: A cat laying on top of a couch on a shoe.
′
have ⟨cr ⟩ = “A man is walking along a road.”, {p′i }N
Example 2
i=1 = w/o prompt information
{“motorcycle”}, and ⟨cgt ⟩ = “A man riding on the back of Reference: <-pad->
a motorcycle down a road.” The corresponding full prompt Prompt: <-pad->
sentence P t is constructed by prompt constructor as follows: Prediction: A food tray filled with different types of
food and drinks.
Reference: A man is walking along a road.
Prompt: An image contains motorcycle. Fig. 3. Examples of constructed full prompt sentences during stage 2 training.
Prediction: A man riding on the back of a motorcycle
down a road.
where LCosine and LCL denote cosine embedding loss and
contrastive loss [19] respectively, and are defined as follows:
Note that we hope our TIPCap keeps the ability to perform
n
general captioning task (i.e. generate captions without “refer- 1X 1 ⊤
LCosine = (1 − Si,i ), LCL = − (LSCL + LSCL ) (18)
ence” and “prompt”), thus prompt information is NOT always n i=1 2
essential. Furthermore, when the generated caption is good n
and descriptive enough, it is also not neccessary to introduce 1X exp(Si,j /τ )
LSCL = log Pn (19)
prompts and perform the second inference. n i=1 j=1 exp(Si,j /τ )
Based on the above considerations, we replace the prompt where τ = 0.1, S ∈ Rn×n indicates the cosine similarity
sentences with padding tokens with a probability of p = 0.1 matrix between TeT and Te . Intuitively, LCosine is introduced for
when performing the stage 2 training (refer to “Implementation hard reconstruction, and LCL aims to relax the reconstruction
details” in Section IV-A); in addition, we also replace the constraint. Because we hope that TeT has similar semantics to
prompts with padding when generated caption is the same as Te without perfect reconstruction.
the ground truth caption. Examples of constructed full prompt
sentences are shown in Fig. 3. IV. E XPERIMENTS
A. Experimental Setting
1) Datasets: For fair performance comparison, we con-
duct experiments on both MS-COCO [12] and Flickr30K
D. Objectives
[13] datasets, and follow “Karpathy” split [50]. Furthermore,
YFCC15M [51] which has been used to train CLIP is adopt
Language Modeling Loss. For a mini-batch of n texts
as low-quality paired web data for setting 2 and 3.
{T i }ni=1 , where T i = {T1i , ..., TLi }, we optimize our model
MS-COCO is a widely used dataset in image captioning
by applying Maximum Likelihood Estimation (MLE):
task, which consists of 123,287 images, and each is paired with
5 human-annotated captions. Flickr30K is another human-
n L
1 XX annotated dataset similar to MS-COCO, where each image
log pθ Tji |T1:j−1
i
LMLE = − (16) is also paired with 5 reference captions, but only contains
n × L i=1 j=1
31,000 images. YFCC15M is a subset of the large-scale web-
collected dataset YFCC100M, where each image is annotated
where θ denotes the parameters of our model. with a weakly paired alt-text.
Reverse Mapping Reconstruction Loss. For our reverse Taking experiments on MS-COCO as examples, TABLE I
mapping module, we define LRecons to constrain the recon- shows a comparison of the available data under four different
struction relationship of TeT to Te : settings, which cover the majority of real-world scenarios.
Specifically, we sample 1% paired MS-COCO data for setting
1, about 100K paired YFCC data for setting 2, and 5K YFCC
LRecons = LCosine + LCL (17) image data for setting 3.
6
TABLE II
I N - DOMAIN CAPTIONING RESULTS ON MS-COCO AND F LICKR 30K. T HE SUPERSCRIPT ∗ INDICATES THAT RESULTS ARE FROM MAGIC [24]. F OR
WEAKLY OR UNSUPERVISED APPROACHES , THEY USE CLIP WITH DIFFERENT BACKBONE AS E NCODER . D E C AP [26] CONSTRUCTS A LIGHTWEIGHT
T RANSFORMER D ECODER (T.D.) INSTEAD OF PRE - TRAINED GPT-2. CLOSE [27] USES T5 MODEL [18]. S ETTING 1-4 ARE ABBREVIATED AS S1-4.
MS-COCO Flickr30k
Method Encoder Decoder
B-1 B-4 M R C S B-1 B-4 M R C S
Fully Supervised:
BUTD 77.2 36.2 27.0 56.4 113.5 20.3 - 27.3 21.7 - 56.6 16.0
UniVLP - 36.5 28.4 - 116.9 21.2 - 30.1 23.0 - 67.4 17.0
ClipCap - 33.5 27.5 - 113.1 21.1 - - - - - -
Oscar - 36.5 30.3 - 123.7 23.1 - - - - - -
Weakly or Unsupervised:
ZeroCap∗ ViT-B/32 GPT-2 49.8 7.0 15.4 31.8 34.5 9.2 44.7 5.4 11.8 27.3 16.8 6.2
MAGIC ViT-B/32 GPT-2 56.8 12.9 17.4 39.9 49.3 11.3 44.5 6.4 13.1 31.6 20.4 7.1
DeCap ViT-B/32 T.D. - 24.7 25.0 - 91.2 18.7 - 21.2 21.8 - 56.7 15.2
CapDec RN50x4 GPT-2 69.2 26.4 25.1 51.8 91.8 - 55.5 17.7 20.0 43.9 39.1 -
CLOSE ViT-L/14 T5 - 22.1 23.7 - 81.2 17.7 - - - - - -
CLOSE w/Tuned Noise ViT-L/14 T5 - 28.6 25.2 - 95.4 18.1 - - - - - -
TIPCap (S1) RN50x4 GPT-2 74.6 30.7 26.7 54.2 106.7 20.3 71.1 25.6 22.5 49.1 63.7 16.2
TIPCap (S2) RN50x4 GPT-2 72.7 28.6 25.6 52.5 100.6 19.6 68.0 23.7 21.3 47.3 57.8 15.2
TIPCap (S3) RN50x4 GPT-2 73.0 30.4 26.5 53.8 104.5 20.0 68.4 24.2 21.9 48.1 61.2 16.1
TIPCap (S4) RN50x4 GPT-2 71.3 29.8 26.2 53.4 102.1 19.4 67.5 24.0 21.7 47.7 59.4 15.9
TIPCap (S4) ViT-L/14 GPT-2 73.3 31.4 26.9 54.2 106.6 20.2 69.6 26.1 23.0 49.3 65.7 17.0
TABLE III
C ROSS -D OMAIN CAPTIONING RESULTS . X =⇒ Y MEANS THAT THE MODEL IS TRAINED ON DATASET X BUT EVALUATED ON DATASET Y.
2) Evaluation metrics: Following the common paradigm, foundation models (e.g. CLIP [19], GPT-2 [17] and T5 [18])
we adopt five widely used metrics for evaluation, including to perform image captioning on unpaired or text-only data,
BLEU-N [31], METEOR [52], ROUGE-L [53], CIDEr [54], including ZeroCap [23], MAGIC [24], DeCap [26], CapDec
and SPICE [55], which are denoted as B-N , M, R, C, and S [25] and CLOSE [27]. Our proposed TIPCap belongs to the
for simplicity. second category, weakly or unsupervised approaches.
3) Implementation details: Our TIPCap model is imple-
mented in PyTorch [56] and trained on 4 Nvidia Tesla V100 As shown in TABLE II, we report the performance results of
(32GB) GPUs. We first train our model without introducing four different data settings. As expected, TIPCap (Setting 1)
interactive prompts for 10 epochs, where the learning rate is performs better than the other three settings, as it estimates
warmed-up to 5e−4 during 1250 steps and decayed linearly; the distribution parameter of mapping module from high-
then we froze parameters of the mapping module and reverse quality paired data, which introduces strong and credible prior
mapping module and train our model with interactive prompts information of modality bias and can be regarded as an upper
for another 5 epochs with learning rate of 1e−6 , in this training bound of our proposed approach. TIPCap (Setting 2) uses
stage. For optimization, we set the batch size on each GPU weakly paired web data but performs worse, because the alt-
to 32 and adopt AdamW optimizer [57] with a weight decay text is low-quality and has a large margin with training text
of 0.1 in both above stages and beam size is set to 5 during corpus. TIPCap (Setting 3) and TIPCap (Setting 4) have less
inference. Trained models and source code will be released. data available but also achieve superior performances that
significantly outperforms other state-of-the-art models, which
shows the effectiveness of our trainable mapping module.
B. Comparison with Existing Models
Especially, our TIPCap uses CLIP with RN50x4 backbone and
1) In-domain captioning: TABLE II shows the in-domain GPT-2 model but outperforms the strong competitor CLOSE
captioning results on MS-COCO and Flickr30K. We compare that adopts CLIP with ViT-L/14 backbone and T5 model.
our proposed TIPCap with several approaches with different
supervision levels. 1) Fully supervised approaches, which rely Overall, our proposed TIPCap achieves state-of-the-art
on human-annotated paired data for model training, including performance and shows substantial performance improve-
BUTD [3], UniVLP [58], ClipCap [59], Oscar [46]; and 2) ment compared with recent weakly or unsupervised methods,
weakly or unsupervised approaches, which adopt pre-trained demonstrating the effectiveness and advantages.
7
Fig. 4. Examples of captions generated by TIPCap with simulated interactive prompts, images come from MS-COCO karpathy test split. “Reference”
indicates the generated caption withou prompt information; “Prompt” indicates the simulated user-specified prompt information; “Prediction” shows the new
generated caption with prompt information.
TABLE IV
P ERFORMANCE COMPARISON WITH SIMULATED INTERACTIVE PROMPT INFORMATION IN TIPC AP.
B-1 B-4 M R C S
S1 77.27±.15 32.73±.06 27.57±.06 55.53±.12 113.17±.35 21.97±.06
CLIP S2 75.93±.06 30.77±.12 26.70±.10 54.03±.06 107.93±.06 21.60±.00
RN50x4 S3 78.23±.06 33.90±.10 28.50±.00 56.33±.06 116.27±.06 24.27±.06
S4 77.80±.10 33.60±.17 28.30±.00 56.20±.00 115.33±.06 23.73±.06
S1 77.53±.06 33.03±.12 28.00±.10 55.80±.00 115.87±.23 22.40±.00
CLIP S2 76.30±.10 31.30±.20 27.03±.06 54.23±.06 110.23±.42 22.03±.06
ViT-L/14 S3 78.57±.15 34.40±.20 28.40±.00 56.27±.06 116.87±.21 24.10±.00
S4 78.27±.15 33.90±.17 28.37±.06 56.07±.06 116.57±.42 24.13±.06
2) Cross-domain captioning: As shown in TABLE III, we benchmarks, it is not easy to give accurate quantitative results
conduct cross-domain experiments to further explore the gen- when introducing interactive prompts.
eralization ability of our approach. Specifically, we train our To verify the effectiveness of interactive prompts, we
TIPCap on the source dataset (e.g. MS-COCO), but perform perform the evaluation on MS-COCO with simulated user-
inference on a different target dataset (e.g. Flickr30K). We specified prompt information. Specifically, we first apply
compare TIPCap with several text-only methods, including TIPCap to generate a caption without prompt information
MAGIC [24], CapDec [25], and DeCap [26]. Our TIPCap as the rough caption, then we do second-time inference to
still outperforms all compared methods, demonstrating the introduce the prompt information, in which the simulated user-
superiority of our approach in generalization ability. specified prompt information is sampled from the nouns or
noun key phrases (ignored by rough caption) extracted from
corresponding ground truth caption. We perform three times
C. Ablation Studies evaluations in each setting and report the average performance
1) Impact of Interactive Prompts: Due to the dynamic with the standard deviation, as shown in TABLE IV. The
nature of the prompt information and the lack of relevant results demonstrate that the introduction of prompt information
8
Fig. 5. Histogram visualization of the CLIP image and text embedding difference of MS-COCO training set, where orange and green indicate the histogram
statistics on all dimensions (global) and specific dimensions (local) separately.
Fig. 6. t-SNE visualization of CLIP image and text embeddings from MS-COCO training set, where blue and yellow points indicate image and text embeddings,
respectively. (a) without mapping module; (b) mapping module driven by univariate Gaussian distribution N√ (µ, σ 2 ), where µ ≃ 0.0009 and σ ≃ 0.0440
are estimated from MS-COCO paired data; (c) mapping module driven by N (µ, σ 2 ), where µ = 0 and σ = 0.016, which is adopted in CapDec [25]; (d)
mapping module driven by N (µ, σ 2 ), where µ = 0 and σ = 0.08, which is adopted in CLOSE [27]; (e) mapping module driven by multi-variate Gaussian
distribution N (⃗µ, Σ), where µ
⃗ and Σ are estimated from coco paired data.
Fig. 7. Examples of captions generated by TIPCap. The first two images comes from MS-COCO dataset and the second two images comes from Flickr30K
dataset. “SS1M to MS-COCO / Flickr30K” indicate that captions is generated by TIPCap trained on SS1M. “MS-COCO” and “Flickr30K” denote TIPCap
trained on MS-COCO and Flickr30K, respectively.
are also applicable and effective. We also conduct experiments validation and test sets, and is divided into three parts: in-
using homologous images on MS-COCO dataset, the results domain contains images portraying only MS-COCO classes;
are shown in Table X. Specifically, we adopt the 5K images near-domain contains both MS-COCO and novel classes;
of MS-COCO karpathy validation split, which are homologous out-of-domain contains only novel classes. Following DeCap
to the training corpus but are not paired and do not affect the [26], we evaluate our TIPCap using the validation set on
performance evaluation on karpathy test split. official evaluation server 1 . Table XII shows the performance
From the results, both YFCC images and MS-COCO images comparison on MS-COCO karpathy test split and NoCaps val-
achieve comparable performance under setting 3, which indi- idation split. Compared to DeCap, TIPCap achieves significant
cates that the requirement for image data is source-agnostic. improvement on all metrics.
Combining the results shown in Table 8, it indicates that 2) Performance using SS1M dataset: SS1M [14] is a web-
although available images are few and heterologous with collected text corpus, which uses the name of 80 MS-COCO
training text corpus, TIPCap can still achieve competitive classes as keywords to crawl image descriptions from Shut-
performance under setting 3. Compared under setting 4, TIP- terstock 2 , resulting in 2,322,628 distinct image descriptions
Cap under setting 3 can achieve better performance because in total that is larger than MS-COCO and Flickr30K. Fig. XI
we introduce prior information of real image for training. gives a comparison of some text examples between SS1M and
Although it is difficult to collect high-quality paired image MS-COCO.
data for text corpus in real-world scenarios, the image data- We explore using SSIM corpus to train our TIPCap under
efficiency under setting 3 makes the collection cost of image setting 3, where the image data is sampled from YFCC15M,
data affordable. identical to the experiments conducted on the MS-COCO. Af-
ter training, we evaluate TIPCap on MS-COCO and Flickr30K
test datasets as shown in Table XIII (middle 2 rows), the
D. More Generalization Analysis results indicate that TIPCap can be easily extended to larger
datasets and also performs good zero-shot performance. Fig. 7
To further quantitatively evaluate the generalization perfor-
shows some examples of captions generated by TIPCap
mance of our proposed TIPCap, we conduct experiments on
trained on SS1M under setting 3 (see “SS1M to MS-COCO /
two new datasets: NoCaps [60] and SS1M [14].
Flickr30K”), which can correctly describe images, even with
1) Performance on NoCaps dataset: MS-COCO is limited
more descriptive details.
to 80 classes, NoCaps [60] provides a benchmark to mea-
sure the open-set capability for novel objects (unseen classes 1 https://eval.ai/web/challenges/challenge-page/355/overview
TABLE XI
C OMPARISON OF TEXT BETWEEN SS1M AND MS-COCO.
TABLE XII
R ESULTS ON MS-COCO KARPATHY TEST SPLIT AND N O C APS VALIDATION SPLIT, WHERE “ IN ”, “ NEAR ” AND “ OUT ” INDICATE “in-domain”,
“near-domain” AND “out-of-domain” RESPECTIVELY.A LL MODELS ARE TRAINED ON MS-COCO TEXT CORPUS .
TABLE XIII
C ROSS - DOMAIN IMAGE CAPTIONING PERFORMANCE . W E TRAIN OUR TIPC AP USING SS1M DATASET UNDER SETTING 3 AND SETTING 1, AND PERFORM
EVALUATION ON MS-COCO AND F LICKR 30K DATASETS .