Cross-Modal Retrieval For Knowledge-Based Visual Question Answering
Cross-Modal Retrieval For Knowledge-Based Visual Question Answering
[email protected]
3
Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
[email protected]
1 Introduction
The work we present in this article takes place in the context of Multimodal
Information Retrieval, a field at the intersection between Information Retrieval
(IR), Computer Vision, and Machine Learning. More precisely, we focus on
Knowledge-based Visual Question Answering about named Entities (KVQAE),
which has two specificities in regards to multimodal modeling [6,57,4,20]: (i) im-
ages represent named entities; (ii) multimodal interactions are complex and may
be combined as both questions and retrieved passages are (text, image) pairs.
Indeed, KVQAE consists in answering questions about named entities grounded
⋆
We thank the anonymous reviewers for their helpful comments, as well as Antoine
Chaffin for fruitful discussions about CLIP and cross-modal retrieval. Paul Lerner
did this work during his PhD at LISN. This work was supported by the ANR-19-
CE23-0028 MEERQAT project. This work was granted access to the HPC resources
of IDRIS under the allocation 2021-AD011012846 made by GENCI.
2 P. Lerner et al.
IQ IP TQ TP
The Arc de Triomphe is located on the right
“How many avenues
bank of the Seine at the centre of a
radiate from this
dodecagonal configuration of twelve
building?”
radiating avenues.
TQ IP
TQ IQ TP IP
IQ TP
Fig. 1. Two visual questions from the ViQuAE dataset along with relevant visual
passages from its Knowledge Base. The different types of mono- and cross-modal
interactions studied are also shown for the second question. The acronyms of the
interactions are composed of the letters T (Text), I (Image), q (question) and p (passage).
2 Related Work
In this section, we present a review of datasets and methods for KVQAE.
Datasets KVQA was the first KVQAE dataset proposed in [43]. Despite its
apparent large size, it has several limitations as pointed out by [31]: (i) only
one entity type is considered, namely person; (ii) it is generated automatically,
and thus, has a limited diversity of topics, lexicon, and syntax. Another key
difference with the other datasets is that KVQA was designed for structured KBs,
in particular Wikidata, from which it was generated, and not an unstructured
KB like the following works. To address the limitations of KVQA, ViQuAE was
introduced in [31]. It has fewer visual questions but they are manually annotated
and it covers a broad range of topics, lexicon, and syntax, as showed in Table 1.
Above all, ViQuAE comprises a large number of different entity types, including
for example landmarks and organizations in addition to persons. Recently, two
other datasets were proposed, aiming at larger size than ViQuAE and with fewer
textual bias: InfoSeek [8] and Encyclopedic-VQA (EVQA [38]). InfoSeek is split
into two subsets according to the annotation method: manual (ISM) or automatic
(ISA). Unfortunately, since neither ISM nor the test set of ISA is available at the
time of writing, we can evaluate our model only on the validation set of ISA. As
its annotation is automatic, it shares part of the caveats of KVQA but covers
more diverse entity types. EVQA alleviates these by using more sophisticated
question generation techniques than templates. However, it is sometimes biased
towards text, with questions such as “Which republic celebrated the vendémiaire
4 P. Lerner et al.
Table 1. Key features of different KVQAE datasets: ViQuAE [31], InfoSeek [8],
Encyclopedic-VQA (EVQA [38]), and KVQA [43]. InfoSeek is split into two subsets
according to the annotation method: manual (ISM) or automatic (ISA). *Computed on
a subset of 500 questions by [8].
in the month that the growing season ends for this tree? ”, a type of overspecified
questions that were typically filtered by the manual annotation in ViQuAE [31].
Some key features of these datasets are summarized in Table 1. Question length
is expressed in number of words provided by spaCy’s English tokenizer. Answer
prior is computed as the most likely answer in the training set, independently of
the question. All datasets are limited to the English language.
3.1 Method
Before being able to extract the answer to the question from a visual passage,
or even retrieve such a passage, we focus here on Entity Retrieval, given the
image of the question iq and a collection of entities (tp , ip ), where tp denotes the
name of the entity and ip its reference image. To do so, we define the following
similarity function, which combines mono- and cross-modal similarities:
where the parameters α{I,C} weigh each similarity. We focus on CLIP, a multi-
modal dual encoder, to implement sI (iq , ip ) and sC (iq , tp ), which models the
IqIp and IqTp interactions, respectively (see Figure 1). The objective is thus
to bring the image of the question closer to the image of this entity in the KB
(mono-modal training), or to its name (cross-modal training), or both jointly.
More formally, the objective underlying our IR model is to maximize s(iq , tp , ip )
if the two images iq and ip (+) depict the same entity, named with the textual
form tp (+) , and to minimize it otherwise. In such a contrastive approach, the
other entities of a batch, for which the textual and visual representations are
respectively noted tp (j) and ip (j) , are used as negatives. To implement this
approach, we jointly train sI (iq , ip ) and sC (iq , tp ) for each iq image of the batch
6 P. Lerner et al.
3.2 Data
As mentioned in the introduction, our evaluations are performed on the ViQuAE,
ISA, and EVQA datasets. For ViQuAE and ISA, we use the KB proposed in
[31], which consists of 1.5 million Wikipedia articles and images of corresponding
Wikidata entities. Unfortunately, the KB proposed by [8] has yet to be made
available; so our results on ISA will not be directly comparable to theirs. Indeed,
11.5% of ISA entities are missing from our KB, which filters down the training set
by 28%. On the contrary, only a few entities from ViQuAE are missing from the
KB. For EVQA, we use the corresponding KB of [38], which consists of 2 million
Wikipedia articles and corresponding images in WIT [45].
ViQuAE contains 3,700 visual questions about 2,400 different entities, ran-
domly divided into equal-sized sets for training, validation, and testing, with no
overlap between images. As a result, the overlap between entities in the training
and test sets is quite small, only 18%. Likewise, the entity overlap in ISA is of
20%. Our models must therefore learn to generalize not only to new images but
also to new entities. On the contrary, the entity overlap of EVQA is of 82%.
3.3 Hyperparameters
We use the ViT-B/32 version of CLIP unless otherwise mentioned. To take full
advantage of the entities associated with the other images in the batch tp (j)
and ip (j) , we use a batch of the largest possible size, here 3,072 (iq , tp (+) , ip (+) )
triples, i.e., more than the whole training set of ViQuAE. We use a single NVIDIA
V100 GPU with 32 GB of RAM. The large batch size is partly enabled by gradient
checkpointing.
Because the training set of ViQuAE is so small, training is very cheap: our
best model converges, i.e., starts to overfit, after 11 steps/epochs, in less than
Cross-modal Retrieval for Knowledge-based Visual Question Answering 7
3.4 Results
For hybrid retrieval, the weights α{I,C} are set through a grid search over the
validation set to maximize the mean reciprocal rank while constraining their
4
We kept the formulation of [41] but the temperature is usually expressed as τ1′ and
not eτ , which would be equivalent to τ ′ = 100
1
here.
5
https://www.pytorchlightning.ai/
6
The results are consistent with precision and recall at higher cutoffs, which we omit
for the sake of space.
8 P. Lerner et al.
“This mountain is part of Cairn Gorm is a mountain in the Pilot Rock (Oregon)
which European mountain Scottish Highlands. It is part of the
range?” Cairngorms range and wider
Grampian Mountains.
Jeddah Tower is a
“In what skyscraper
country is Nakheel construction project
this Tower which is currently on
hold. Located on the
skyscraper?” north side of Jeddah,
Saudi Arabia [...]
Fig. 2. Strengths and weaknesses of mono- and cross-modal retrieval exemplified through
CLIP results (not fine-tuned) on ViQuAE’s validation set.
Table 2. Entity Retrieval with a multimodal dual encoder, CLIP, on the validation
subsets of ViQuAE, InfoSeek-Automatic (ISA), and EVQA (single-hop). Mono- and
cross-modal retrieval model the IqIp and IqTp interactions, respectively. The best
results are marked in bold for each type of retrieval. Hybrid retrieval of disjoint training
combines mono-modal trained mono-modal retrieval and cross-modal trained cross-modal
retrieval.
white photography, subject pose. . . ). Here, the two photographs at the top of
two mountains, showing the horizon, are judged to be similar even though they
are different mountains. In contrast, the mono-modal retrieval is more effective in
the second example, where the two photographs of the Jeddah Tower are taken
from similar vantage points. These qualitative results support our hypothesis
that cross-modal retrieval might help addressing the heterogeneity of visual
representations of named entities.
Why choose? We show that mono- and cross-modal retrievals are complemen-
tary: their results can be simply combined at the score level (as in Equation 1).
Thus, without fine-tuning (first lines of each block in Table 2), fusing the two
retrievals brings a relative improvement of 32% in P@1 for ViQuAE (and 15%
for ISA, 23% for EVQA) compared to the best single retrieval (significant with
Fisher’s p ≤ 0.01). It would be interesting to study whether these results gen-
eralize to other tasks. For example, this method could benefit Content-based
Image Retrieval in a Web browsing context. Overall, hybrid retrieval gives the
best performance, on all three datasets.
4.1 Methods
where sT (tq , tp ) models the TqTp interaction between the text of the question
and of the passage and is implemented with DPR. We note this model DPRV +T
as it combines DPR, CLIPV , and CLIPT , or DPRV+T (in bold font) when
CLIP is fine-tuned.11 The weights α{T,I,C} are set through a grid search on the
validation set like in the previous section (see Figure 3 for an illustration of the
impact of these hyperparameters on MRR). DPR is a dual encoder model that
combines two BERT encoders, one for the question and one for the passage [28].
Answers are then extracted from these passages using Multi-passage BERT
[49], which also models the TqTp interaction.
The 1.5 (resp. 2) million articles of the KB of ViQuAE [31] (resp. EVQA [38]) are
divided into 12 (resp. 27) million 100-word passages, while preserving sentence
boundaries, as in [31].
Both DPR and Multi-passage BERT are pre-trained on TriviaQA, filtered
out of all questions used in [31] to generate ViQuAE,12 before being fine-tuned
on the downstream KVQAE dataset, following [31]. Both models are built upon
the uncased version of BERT-base [13]. We refer the reader to [31] for further
implementation details.
4.3 Baselines
We compare our approach to the DPRV +R+A model of [31], which combines
DPR, CLIPV , ArcFace, and an ImageNet-trained ResNet model. The results of
the four models are combined in the same way as in Equation 4, where DPR
11
DPR is always fine-tuned as described in the next section.
12
https://huggingface.co/datasets/PaulLerner/triviaqa_for_viquae
Cross-modal Retrieval for Knowledge-based Visual Question Answering 11
0.35
1.0
0.8 0.30
0.6
C
0.4 0.25
MRR
0.2
0.0
0.20
1.0
0.8
0.0 0.6
0.2 0.4 0.15
0.4
I
0.6 0.2
T 0.8 0.0
1.0
Fig. 3. Passage-level MRR on the validation set of ViQuAE depending on the α{T,I,C}
hyperparameters.
Note that [31,30] use CLIPV with the ResNet architecture while we use ViT [14]
in most of our experiments (but compare the two in the next section and find no
significant difference).
Moving away from the Retrieval+Extraction framework of [31], we compare
our results to [8,38], who both use the PaLM LLM [9], either as is or augmented
with the image caption and in-context learning examples (denoted PromptCap
[23]). [8] also experiment with FiD [25], augmented with CLIP retrieval results.
4.4 Results
Metrics Extracted answers are evaluated using Exact Match (EM) and token-
level F1 score on ViQuAE following [31], using the soft matching score defined
by [8] on ISA (see Section 3.4), and using both F1 and BEM [7] on EVQA. The
results for these three benchmarks are reported in Table 3.
Table 3. Reading Comprehension results on the test set of ViQuAE, the validation
set of ISA, and the test single-hop questions of EVQA. As in [28], the reader takes
as input the top-24 of different IR systems listed in the “Method” column (except for
the methods of [8,38]). The results of [8], in gray, are provided as reference but use a
different, yet unavailable, smaller KB, which perfectly covers ISA. *CLIP is based on
ViT’s architecture instead of ResNet. †Our re-implementation of the reader, which fixes
the loss function.
5 Conclusion
This paper studies cross-modal retrieval and its combination with mono-modal
retrieval for Knowledge-based Visual Question Answering about named Entities
(KVQAE). Retrieval is carried out with a multimodal dual encoder, namely CLIP.
Our results demonstrate the superiority of cross-modal retrieval over mono-modal
retrieval, but also the complementarity of the two, which can be easily combined.
We argue that cross-modal retrieval may help addressing the heterogeneity of
visual representations of named entities, consistently with prior work. It would be
interesting to study whether these results generalize to other tasks. For example,
this method could benefit Content-based Image Retrieval, in a Web browsing
context.
Although it was the abundance of cross-modal data that enabled CLIP’s
training in the first place, which would have been difficult with a mono-modal
annotation, this limits our results because it is difficult to control such a large
amount of data and thus to estimate CLIP’s generalization capabilities. We
hypothesize that mono-modal retrieval is better suited to generalize to new
entities.
We show that the effectiveness of cross-modal retrieval leads to more accurate
answers, on all three studied datasets. Therefore, our method outperforms our
baseline (mono-modal retrieval) but also the methods of [31,30], while being
conceptually simpler and computationally cheaper. Furthermore, it is competitive
with billion-scale parameters models on ISA and EVQA. As such, this is the
first comparative study of the recently introduced ViQuAE, ISA, and EVQA
datasets. We find that ISA is more challenging as it is less biased towards text,
but advocate for further studies on all three datasets — which all have their pros
and cons — with diverse methods.
Consistently with [31,8], we find a large gap between our best retrieval model
and oracle retrieval, showing that entity retrieval is the main bottleneck of
KVQAE. For future work, we plan to combine our unstructured KB with a
structured one, such as Wikidata, to enable the modeling of links between the
entities [54,40,52,2], which would further address the heterogeneity of their visual
representations. A more IR perspective on the matter could cast KVQAE as a
query expansion problem, with an initial ambiguous textual query which would
benefit from pseudo-relevant feedback [55].
14 P. Lerner et al.
References
1. Adjali, O., Grimal, P., Ferret, O., Ghannay, S., Le Borgne, H.: Explicit knowledge
integration for knowledge-aware visual question answering about named entities. In:
Proceedings of the 2023 ACM International Conference on Multimedia Retrieval.
p. 29–38. ICMR ’23, Association for Computing Machinery, New York, NY, USA
(2023). https://doi.org/10.1145/3591106.3592227, https://doi.org/10.1145/
3591106.3592227
2. Alberts, H., Huang, N., Deshpande, Y., Liu, Y., Cho, K., Vania, C., Cal-
ixto, I.: VisualSem: a high-quality knowledge graph for vision and language.
In: Proceedings of the 1st Workshop on Multilingual Representation Learn-
ing. pp. 138–152. Association for Computational Linguistics, Punta Cana, Do-
minican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.mrl-1.13,
https://aclanthology.org/2021.mrl-1.13
3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L.,
Parikh, D.: VQA: Visual Question Answering. In: 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV). pp. 2425–2433. IEEE, San-
tiago, Chile (Dec 2015). https://doi.org/10.1109/ICCV.2015.279, http://
ieeexplore.ieee.org/document/7410636/
4. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal Machine Learning: A Survey
and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence
41(2), 423–443 (Feb 2019). https://doi.org/10.1109/TPAMI.2018.2798607, con-
ference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
5. Bassani, E.: ranx: A Blazing-Fast Python Library for Ranking Evaluation and Com-
parison. In: Hagen, M., Verberne, S., Macdonald, C., Seifert, C., Balog, K., Nørvåg,
K., Setty, V. (eds.) Advances in Information Retrieval. pp. 259–264. Lecture
Notes in Computer Science, Springer International Publishing, Cham (2022).
https://doi.org/10.1007/978-3-030-99739-7_30
6. Bokhari, M.U., Hasan, F.: Multimodal information retrieval: Challenges and future
trends. International Journal of Computer Applications 74(14) (2013), publisher:
Foundation of Computer Science
7. Bulian, J., Buck, C., Gajewski, W., Börschinger, B., Schuster, T.: Tomayto, tom-
ahto. beyond token-level answer equivalence for question answering evaluation. In:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Pro-
cessing. pp. 291–305. Association for Computational Linguistics, Abu Dhabi, United
Arab Emirates (Dec 2022). https://doi.org/10.18653/v1/2022.emnlp-main.20,
https://aclanthology.org/2022.emnlp-main.20
8. Chen, Y., Hu, H., Luan, Y., Sun, H., Changpinyo, S., Ritter, A., Chang, M.W.:
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking
Questions? (Feb 2023). https://doi.org/10.48550/arXiv.2302.11713, http://
arxiv.org/abs/2302.11713, arXiv:2302.11713 [cs]
9. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham,
P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling
with pathways. Journal of Machine Learning Research 24(240), 1–113 (2023)
10. Couairon, G., Douze, M., Cord, M., Schwenk, H.: Embedding arithmetic of multi-
modal queries for image retrieval. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4950–4958
(June 2022)
11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and
Cross-modal Retrieval for Knowledge-based Visual Question Answering 15
In: Proceedings of the 60th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers). pp. 373–390. Association for Compu-
tational Linguistics, Dublin, Ireland (May 2022). https://doi.org/10.18653/v1/
2022.acl-long.29, https://aclanthology.org/2022.acl-long.29
23. Hu, Y., Hua, H., Yang, Z., Shi, W., Smith, N.A., Luo, J.: Promptcap: Prompt-guided
task-aware image captioning (2023)
24. Hu, Z., Iscen, A., Sun, C., Chang, K.W., Sun, Y., Ross, D.A., Schmid, C., Fathi,
A.: AVIS: Autonomous Visual Information Seeking with Large Language Models
(Jun 2023), http://arxiv.org/abs/2306.08129, arXiv:2306.08129 [cs]
25. Izacard, G., Grave, E.: Leveraging Passage Retrieval with Generative Mod-
els for Open Domain Question Answering. In: Proceedings of the 16th Con-
ference of the European Chapter of the Association for Computational Lin-
guistics: Main Volume. pp. 874–880. Association for Computational Linguistics,
Online (Apr 2021). https://doi.org/10.18653/v1/2021.eacl-main.74, https:
//aclanthology.org/2021.eacl-main.74
26. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A.,
Fung, P.: Survey of Hallucination in Natural Language Generation. ACM Computing
Surveys 55(12), 248:1–248:38 (Mar 2023). https://doi.org/10.1145/3571730,
https://dl.acm.org/doi/10.1145/3571730
27. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE
Transactions on Big Data 7(3), 535–547 (2019). https://doi.org/10.1109/TBDATA.
2019.2921572
28. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih,
W.t.: Dense passage retrieval for open-domain question answering. In: Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP). pp. 6769–6781. Association for Computational Linguistics, Online (Nov
2020), https://www.aclweb.org/anthology/2020.emnlp-main.550
29. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers
in vision: A survey. ACM Comput. Surv. 54(10s) (sep 2022). https://doi.org/10.
1145/3505244, https://doi.org/10.1145/3505244
30. Lerner, P., Ferret, O., Guinaudeau, C.: Multimodal inverse cloze task for knowledge-
based visual question answering. In: Advances in Information Retrieval (ECIR
2023). pp. 569–587. Springer Nature Switzerland, Cham (2023). https://doi.org/
10.1007/978-3-031-28244-7_36
31. Lerner, P., Ferret, O., Guinaudeau, C., Le Borgne, H., Besançon, R., Moreno, J.G.,
Lovón Melgarejo, J.: ViQuAE, a dataset for knowledge-based visual question answer-
ing about named entities. In: Proceedings of The 45th International ACM SIGIR
Conference on Research and Development in Information Retrieval. SIGIR ’22, Asso-
ciation for Computing Machinery, New York, NY, USA (2022). https://doi.org/
10.1145/3477495.3531753, https://hal.archives-ouvertes.fr/hal-03650618
32. Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur, A., von Platen, P.,
Patil, S., Chaumond, J., Drame, M., Plu, J., Tunstall, L., Davison, J., Šaško, M.,
Chhablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh, V., Xu, C., Patry, N.,
McMillan-Major, A., Schmid, P., Gugger, S., Delangue, C., Matussière, T., Debut,
L., Bekman, S., Cistac, P., Goehringer, T., Mustar, V., Lagunas, F., Rush, A.,
Wolf, T.: Datasets: A Community Library for Natural Language Processing. In:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations. pp. 175–184. Association for Computational
Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021), https:
//aclanthology.org/2021.emnlp-demo.21
Cross-modal Retrieval for Knowledge-based Visual Question Answering 17
33. Li, L., Yin, Y., Li, S., Chen, L., Wang, P., Ren, S., Li, M., Yang, Y., Xu, J., Sun, X.,
Kong, L., Liu, Q.: M3 IT: A Large-Scale Dataset towards Multi-Modal Multilingual
Instruction Tuning (Jun 2023). https://doi.org/10.48550/arXiv.2306.04387,
http://arxiv.org/abs/2306.04387, arXiv:2306.04387 [cs]
34. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text
summarization branches out. pp. 74–81 (2004)
35. Liu, Z., Xiong, C., Lv, Y., Liu, Z., Yu, G.: Universal vision-language dense retrieval:
Learning a unified representation space for multi-modal retrieval. In: The Eleventh
International Conference on Learning Representations (2023), https://openreview.
net/forum?id=PQOlkgsBsik
36. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International
Conference on Learning Representations (2019), https://openreview.net/forum?
id=Bkg6RiCqY7
37. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: A visual question
answering benchmark requiring external knowledge. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 3195–3204 (2019),
https://ieeexplore.ieee.org/document/8953725/
38. Mensink, T., Uijlings, J., Castrejon, L., Goel, A., Cadar, F., Zhou, H., Sha, F.,
Araujo, A., Ferrari, V.: Encyclopedic vqa: Visual questions about detailed prop-
erties of fine-grained categories. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV). pp. 3113–3124 (October 2023)
39. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,
Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:
PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances
in Neural Information Processing Systems 32 (2019), https://papers.nips.cc/
paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
40. Pezeshkpour, P., Chen, L., Singh, S.: Embedding Multimodal Relational Data for
Knowledge Base Completion. In: Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing. pp. 3208–3218 (2018)
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International Conference on Machine Learning.
pp. 8748–8763. PMLR (2021)
42. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-okvqa: A
benchmark for visual question answering using world knowledge. In: European
Conference on Computer Vision. pp. 146–162. Springer (2022)
43. Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: Knowledge-Aware Visual
Question Answering. In: Proceedings of the AAAI Conference on Artificial Intel-
ligence. vol. 33, pp. 8876–8884 (2019), https://144.208.67.177/ojs/index.php/
AAAI/article/view/4915
44. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance
tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM
conference on Conference on information and knowledge management. pp. 623–
632. CIKM ’07, Association for Computing Machinery, New York, NY, USA (Nov
2007). https://doi.org/10.1145/1321440.1321528, https://doi.org/10.1145/
1321440.1321528
45. Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-
based image text dataset for multimodal multilingual machine learning. In: Proceed-
ings of the 44th International ACM SIGIR Conference on Research and Development
18 P. Lerner et al.