0% found this document useful (0 votes)
2 views

Enhancing Multimodal Understanding With CLIP-Based

This paper presents an innovative ensemble approach for image-to-text transformation using Contrastive Language-Image Pretraining (CLIP) models. It introduces two variations of the CLIP model: one with a multi-layered architecture for capturing complex image-text relationships and another utilizing zero-shot learning combined with a K-Nearest Neighbors (KNN) model for embedding generation. The ensemble method demonstrates superior performance compared to standalone CLIP models, advancing the field of multimodal understanding.

Uploaded by

levi123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Enhancing Multimodal Understanding With CLIP-Based

This paper presents an innovative ensemble approach for image-to-text transformation using Contrastive Language-Image Pretraining (CLIP) models. It introduces two variations of the CLIP model: one with a multi-layered architecture for capturing complex image-text relationships and another utilizing zero-shot learning combined with a K-Nearest Neighbors (KNN) model for embedding generation. The ensemble method demonstrates superior performance compared to standalone CLIP models, advancing the field of multimodal understanding.

Uploaded by

levi123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Enhancing Multimodal Understanding with CLIP-Based

Image-to-Text Transformation

Chang Che, Qunwei Lin, Xinyu Zhao, Jiaxin Huang, Liqiang Yu

arXiv (arXiv: 2401.06167v1)

Generated on April 13, 2025


Enhancing Multimodal Understanding with CLIP-Based
Image-to-Text Transformation

Abstract
The process of transforming input images into corresponding textual explanations stands as a crucial
and complex endeavor within the domains of computer vision and natural language processing. In this
paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive
Language-Image Pretraining models.

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation Chang Che


[email protected] The George Washington University Atlanta, Georgia, USAQunwei Lin
[email protected] Trine University Phoenix, Arizona, USAXinyu Zhao [email protected]
Trine University Phoenix, Arizona, USA Jiaxin Huang [email protected] Trine University
Phoenix, Arizona, USALiqiang Yu [email protected] The University of Chicago Irvine, California,
USA ABSTRACT The process of transforming input images into corresponding tex- tual explanations
stands as a crucial and complex endeavor within the domains of computer vision and natural language
processing. In this paper, we propose an innovative ensemble approach that har- nesses the
capabilities of Contrastive Language-Image Pretraining (CLIP) models. Our ensemble framework
encompasses two signifi- cant variations of the CLIP model, each meticulously designed to cater to
specific nuances within the image-to-text transformation landscape. The first model introduces an
elaborated architecture, featuring multiple layers with distinct learning rates, thereby am- plifying its
adeptness in capturing intricate relationships between images and text. The second model strategically
exploits CLIP’s inherent zero-shot learning potential to generate image-text em- beddings,
subsequently harnessed by a K-Nearest Neighbors (KNN) model. Through this KNN-based paradigm,
the model facilitates image-to-text transformation by identifying closely related embed- dings within the
embedding space. Notably, our ensemble approach is rigorously evaluated, employing the cosine
similarity metric to gauge the alignment between model-generated embeddings and ground truth
representations. Comparative experiments vividly highlight the superiority of our ensemble strategy
over standalone CLIP models. This study not only advances the state-of-the-art in image-to-text
transformation but also accentuates the promising trajectory of ensemble learning in effectively
addressing intricate multimodal tasks. CCS CONCEPTS •Computing methodologies →Information
extraction ;En- semble methods ;Computer vision representations . Permission to make digital or hard
copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post
on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from [email protected]. ICBDT 2023, September 22–24, 2023, Qingdao, China
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN
978-1-4503-XXXX-X/18/06. . . $15.00 https://doi.org/10.1145/3627377.3627442KEYWORDS
Image-to-Text Transformation, CLIP, Ensemble Learning, Elabo- rated Architecture, Zero-Shot
Learning, K-Nearest Neighbors, Mul- timodal Alignment ACM Reference Format: Chang Che, Qunwei
Lin, Xinyu Zhao, Jiaxin Huang, and Liqiang Yu. 2023. En- hancing Multimodal Understanding with
CLIP-Based Image-to-Text Trans- formation. In 2023 6th International Conference on Big Data
Technologies (ICBDT) (ICBDT 2023), September 22–24, 2023, Qingdao, China. ACM, New York, NY,
USA, 5 pages. https://doi.org/10.1145/3627377.3627442 1 INTRODUCTION In the dynamic landscape
of CV(computer vision) and NLP(natural language processing), the conversion of visual input into
coherent textual descriptions has emerged as a fundamental yet intricate challenge. Bridging the gap
between these two modalities holds profound potential across diverse domains, from enabling visually
impaired individuals to enhancing the autonomy of machines. This article addresses the complex task
of image-to-text transformation by introducing a novel ensemble approach that harnesses the capa-
bilities of Contrastive Language-Image Pretraining (CLIP) models. While CLIP models have
demonstrated exceptional prowess in aligning text and images, this work propels the field forward by
presenting a sophisticated ensemble strategy that leverages their collective strengths. We introduce
two distinct adaptations of the CLIP model, meticulously tailored to different facets of the image- to-text
transformation endeavor. The first adaptation introduces a multi-layered architecture with the
integration of differential learning rates, amplifying the model’s discerning power to capture intricate and
nuanced image-text relationships. Complementing this, the second adaptation ingeniously exploits
CLIP’s inherent zero-shot learning capacity, resulting in the generation of con- textual embeddings for
both images and text. These embeddings seamlessly integrate into a K-Nearest Neighbors (KNN)[ 9]
model, a strategic choice that reverberates throughout the image-to-text transformation process. The
significance of this study reverberates across numerous real-world scenarios, ranging from generating
image captions to facilitating content-based image retrieval. Our ensemble approach, by bridging the
semantic gap between images and text, contributes to the interpretability and accessibility of visual
data.arXiv:2401.06167v1 [cs.CV] 2 Jan 2024
ICBDT 2023, September 22–24, 2023, Qingdao, China Chang and Qunwei, et al. In subsequent
sections, we embark on an exploratory journey into the intricate mechanics of our ensemble approach.
We delve into architectural intricacies, elucidate data preprocessing method- ologies, unveil the metrics
employed for evaluation, and present compelling experimental results. Through meticulous evaluation
and comprehensive comparative analysis, we substantiate the dis- tinct advantages of our approach
over both standalone CLIP models and traditional methods. As we navigate the complex landscape of
image-to-text transformation, our study not only advances the boundaries of current capabilities but
also underscores the bur- geoning potential of ensemble learning to unravel the complexities inherent
in multimodal tasks. 2 RELATED WORK Within the domain of converting images to text, extensive
research has been undertaken to improve the caliber and pertinence of the generated captions. This
section provides an overview of key ad- vancements and approaches that have contributed to the
develop- ment of our proposed ensemble model for image captioning. J Devlin et al.[ 3] introduce
BERT, a transformer-based language model pre-trained on extensive text data. BERT’s bidirectional
con- text comprehension and contextual embeddings have propelled it to achieve state-of-the-art
performance across diverse NLP tasks, in- cluding image captioning, significantly enhancing language
under- standing for such tasks. A Karpathy et al.[ 7] propose an attention- based model that aligns
visual and semantic spaces, producing de- tailed image descriptions by focusing on relevant regions of
the im- age. By allowing the model to attend to specific image regions while generating each word, the
approach improves the relevance and contextual understanding of generated captions. Q. Wu et al.
(2016) explore high-level concepts’ role in bridging vision and language, specifically their impact on
image captioning. By incorporating high-level semantic concepts, the model gains a better understand-
ing of the image content, leading to improved caption quality that is more aligned with human
perception. J Gu et al.[ 5] propose a stack-captioning model that employs a coarse-to-fine approach to
generate captions, demonstrating improved performance by iter- atively refining the captioning process.
Utilizing a hierarchy of captioning modules, this approach enables the model to encompass global and
local details, yielding captions that are more contextually enriched and coherent. SJ Rennie et al.[ 12]
introduces self-critical sequence training, a reinforcement learning approach for image captioning,
which improves caption quality by directly optimizing caption-level evaluation metrics. By using a
reinforcement learn- ing framework to fine-tune the captioning model, this approach encourages the
generation of captions that receive higher scores ac- cording to the chosen evaluation metric. P
Sharma et al.[ 13] present the Conceptual Captions dataset, enhancing image caption quality and
model training through improved annotations and hypernymic captions. J Johnson et al.[ 6] introduce
DenseCap, a model merging convolutional and recurrent networks for dense image caption- ing. It
generates captions for multiple image regions, enhancing description detail by covering various objects
and regions within images. D Elliott et al.[ 4] introduce the Multi30K dataset, a multi- lingual extension
of the English-German image description dataset, facilitating cross-lingual research in image
captioning. This datasetenables the evaluation and development of captioning models for multiple
languages, contributing to the advancement of multilin- gual captioning. R Vedantam et al.[
14]proposes the CIDEr metric, a consensus-based evaluation measure for image captioning that
considers consensus among multiple reference captions, addressing limitations of previous metrics.
CIDEr takes into account multiple valid caption variations and provides a more robust and compre-
hensive evaluation of caption quality. Utilizing the Twins-PCPVT model, Weinan Dai et al.[ 1] converts
fundus images into embed- dings, enhancing the efficiency and accuracy of diabetic retinopathy
detection. A Radford et al.[ 11] Generative Adversarial Networks (GANs) showcased potential in
generating images from text descrip- tions. However, the unidirectional nature of the approach
restricted its applicability to image-to-text tasks. Saad M.Darwith et al.[ 2] investigate the use of Type-2
Fuzzy Logic as a means to bridge the semantic gap in Content-Based Image Retrieval (CBIR) systems.
Their findings highlight the potential of this approach in enhancing image retrieval accuracy by
addressing inherent uncertainties in semantic interpretations. J Lu et al.[ 8]showcased advancements
in vision-and-language tasks by jointly pretraining on multiple vision and language tasks. However, the
focus was broader, and the fine-grained image-to-text transformation remained a challenge. In the
subsequent sections, we delve into our ensemble approach, unveiling its architectural intricacies, data
preprocessing strategies, evaluation metrics, and empirical findings. Through comprehensive analysis,
we highlight the distinct advantages of our approach, thereby contributing to the broader landscape of
image-to-text transformation. 3 ALGORITHM AND MODEL Our ensemble learning approach
harnesses the power of two intri- cately designed models, both of which are built upon the foundation of
the CLIP framework. Collectively, they collaborate to amplify the image-to-text transformation process,
harnessing the exceptional abilities of CLIP. 3.1 CLIP Model The CLIP model, which incorporates
elements of the Vision Trans- former (VIT), forms the foundation of our approach, providing a powerful
framework for processing both images and text and gen- erating embeddings in a shared semantic
space. The architecture of the CLIP model, which includes VIT, is illustrated in Figure 1[10]. The CLIP
model consists of a shared vision and language encoder. Leveraging the capabilities of VIT, the model
employs self-attention mechanisms to capture intricate details within images. This inte- gration allows
CLIP to effectively process and represent visual information. Given an input image and text, the model
projects them into a common space where the similarity between their embeddings reflects their
semantic correspondence. This inherent ability of CLIP, enhanced by VIT, forms the basis for our
subsequent enhancements. The Vision Transformer (VIT)[ 16] is a crucial component in- tegrated into
the CLIP model. VIT is an image processing model that utilizes self-attention mechanisms to analyze
and capture the relationships between different parts of an image. By incorporating VIT, the CLIP
model gains the capability to understand images
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation ICBDT 2023,
September 22–24, 2023, Qingdao, China Figure 1: Architecture of the CLIP Model at a more granular
level, enabling it to extract meaningful visual features and representations. This integration not only
enhances the model’s ability to process images but also contributes to the generation of semantically
meaningful embeddings. 3.2 Model A - Enhanced Image-to-Text Transformation To optimize the
process of image-to-text transformation, we in- troduce Model A, which enhances the capabilities of the
CLIP model through the incorporation of additional neural network lay- ers. Model A is designed to
refine the visual embedding of an input image and generate an enriched text embedding. The
computation within Model A begins with passing the input image through the pre-trained CLIP vision
model, resulting in a high-dimensional visual embedding denoted as Image_Embedding . This visual
representation captures the salient features of the image in the context of semantic understanding.
Figure 2: Architecture of Model A - Enhanced Image-to-Text Transformation To enhance the visual
embedding, Model A introduces two addi- tional fully connected layers, FC1andFC2, followed by layer
nor- malization operations Norm1 andNorm2 . These layers iteratively refine the feature representation,
extracting intricate relationships inherent in the image data. The out of FC1 is computed as follows:
FC1_Out =FC1(Image_Emb) (1) Subsequently, the output is normalized using layer normalization to
yield Norm1_Out:Norm1_Out =Norm1(FC1_Out) (2) Similarly, FC2andNorm2 operations further
enhance the fea- tures, resulting in Norm2_Out. To produce the enriched text emb, Model A employs a
weighted combination of Norm1_Out andNorm2_Out using the following equation: Final_Text_Emb
=■·Norm1_Out+(1−■)·Norm2_Out (3) This fusion mechanism empowers Model A to adaptively in-
tegrate the refined visual features, resulting in an enriched text embedding that encapsulates both the
original visual content and its semantic nuances. To facilitate effective training, we employ a differential
learn- ing rate strategy. Specifically, the pre-trained CLIP vision model is fine-tuned using a relatively
small learning rate ■■vision , while the newly introduced fully connected layers FC1andFC2, along with
layer normalization operations Norm1 and Norm2 , use a larger learning rate ■■fc. This method
guarantees a harmonious parameter adjustment between the existing pre-trained model and the freshly
incorporated layers. The pre-trained CLIP model, having been op- timized on an extensive dataset,
demands nuanced tweaks, while the newly added layers necessitate more substantial updates owing to
their random initialization. 3.3 Model B - Zero-Shot Learning and KNN-based Fusion Model B
capitalizes on the inherent zero-shot learning [ 15] capabil- ities of the CLIP model, extending its
applicability to image-to-text transformation tasks. The architecture of Model B integrates image
embeddings with their corresponding text embeddings through a K-nearest neighbors (KNN) based
fusion approach. This enables a seamless cross-modal interaction, leveraging the rich semantic
understanding of CLIP for enhanced image-to-text transformation. The first step of Model B[ 10]
involves utilizing the CLIP model to generate both image embeddings and text embeddings for a
diverse set of images and their corresponding textual descriptions. This forms a basis for zero-shot
learning, enabling Model B to generalize to unseen image-text pairs during testing. Figure 3: Zero-Shot
Learning Architecture in Model B
ICBDT 2023, September 22–24, 2023, Qingdao, China Chang and Qunwei, et al. For each image
embedding, Model B employs a K-nearest neigh- bors (KNN) model to retrieve a set of nearest
neighbor text em- beddings. The KNN-based approach introduces a distance-based weighting
mechanism, which captures the relevance and contex- tual significance of each neighbor. This
distance-based weighting is calculated using the following formula: Weight(■)=1
Distance(■)Distance_Dim+■×Coef (4) Where Distance(■)is the Euclidean distance between the query
image embedding and the ■th neighbor text embedding, Distance_Dim controls the influence of
distance, and ■is a small positive constant to ensure numerical stability, and Coef is a coefficient to
prevent overflow. The weighted contributions from all neighbors are then aggre- gated to form the final
text embedding. Mathematically, given an image embedding Image_Emb , the KNN-based fusion
mechanism generates the text embedding Text_Emb as: Text_Emb =1 ■■∑■
■=1Weight(■)×KNN_Text_Emb ■ (5) Here,■represents the number of nearest neighbors,
KNN_Text_Emb ■ refers to the ■th nearest neighbor text embedding. In summary, Model B extends
the zero-shot learning capabilities of the CLIP model to image-to-text transformation. The KNN-based
fusion approach intelligently combines image and text emb, with distance-based weighting to capture
contextual relevance, result- ing in an enriched text representation that reflects the underlying
semantics. Figure 4: K-Nearest Neighbors Fusion in Model B 3.4 Model Ensemble This ensemble
leverages the strengths of each constituent, yielding a powerful image-to-text transformation
framework. The integration process merges the semantically rich embed- dings from the trained CLIP
model with the contextual embeddings produced by the CLIP kNNRegression model. This fusion aims
to leverage CLIP’s interpretive capabilities alongside the nuanced contextual understanding of
kNNRegression, creating a unified ensemble embedding. The ensemble embedding is calculated as
follows:Ens_Emb =■×A_Emb+(1−■)×B_Emb (6) Where■represents the adjusted weighting coefficient.
The model ensemble embodies the symbiotic relationship be- tween the model A and model B,
resulting in a dynamic and versatile image-to-text transformation framework. The subsequent sections
delve into empirical evaluation and findings, substantiating the effectiveness and potential of our
ensemble approach. 3.5 Data Preprocessing Effective data preprocessing is pivotal to the success of
our ensemble learning approach for image-to-text transformation. We employ a series of meticulous
steps to enhance the quality and suitability of our input data. 3.5.1 Data Augmentation. Data
augmentation is employed to en- hance the diversity and robustness of our training dataset. For im-
ages, we apply random transformations such as rotations, flips, and crops to generate augmented
versions of the original images. For text, we perform synonym replacement and random word shuffling
to create variations of the input text descriptions. 3.5.2 Cosine Similarity Filtering. To streamline the
dataset and enhance model performance, we transform textual prompts into embedding. These
embedding encapsulate intricate semantic infor- mation within the embedding space. To quantify the
resemblance between two embedding vectors, we utilize the cosine similarity formula: Cosine
Similarity(■■,■■)=■■·■■ ■■■■·■■■■(7) Here,■■and■■denote the embedding vectors
associated with two distinct textual prompts. For each textual prompt within the training dataset, we
compute the cosine value with the embedding vectors of other prompts. By applying a designated
similarity threshold (e.g., 0.85), in cases where the similarity between a prompt and any other prompt
surpasses this threshold, we opt to remove one of the samples to ensure a diverse dataset
composition. This filtering step not only enhances model performance but also leads to improved
training efficiency. Specifically, the reduction in dataset size from 90K to 60K instances has resulted in
an enhance- ment of the model’s metric, along with a reduction in training time per epoch from 6 hours
to 4 hours. These improvements collec- tively contribute to the effectiveness and efficiency of our
ensemble learning approach for image-to-text transformation tasks. 3.6 Performance Measurement To
gauge model performance, we utilize the following metric: Avg-Cos =1 ■■∑■ ■=1CosSim(GT-Embed
■,Pred-Embed ■) (8) Where Avg-Cos denotes the average cosine similarity, ■signifies the total
image-text pairs, and CosSim(GT-Embed ■,Pred-Embed ■) represents the cosine value between the
groundtruth text embed- ding and the predicted embedding for the ■-th pair. This metric
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation ICBDT 2023,
September 22–24, 2023, Qingdao, China Table 1: Average Cosine Similarity Results Model
Avg-CosSim CLIP 0.5642 CLIP with data filter1 0.5684 CLIP with data filter2 0.5721 Model A 0.5753
Model B 0.5531 Ensemble Model 0.5961 quantifies the alignment and semantic coherence between
images and their textual descriptions. Higher Avg-CosSim values indicate more robust semantic
alignment and enhanced image-to-text trans- formation, showcasing the effectiveness of our ensemble
learning approach. 3.7 Experiment Results We conducted comprehensive experiments to evaluate the
per- formance of our proposed ensemble approach for image-to-text transformation. In our
experiments, we employ the Average Cosine Similarity (Avg-CosSim) as the evaluation metric,
assessing the correspondence between the embeddings generated by the model and the actual ground
truth representations. The outcomes presented in Table 1 unmistakably showcase the efficacy of our
ensemble approach. The Ensemble Model achieves the highest Avg-CosSim, indicating a superior
alignment between the generated embeddings and ground truth representations com- pared to other
individual models. Model A, with its elaborated architecture, also outperforms the standalone CLIP
models and Model B. These findings underscore the value of our ensemble strategy in enhancing
image-to-text transformation. The comparative experiments validate the promising trajectory of
ensemble learning in addressing intricate multimodal tasks and further contribute to advancing the
state-of-the-art in image-to-text transformation. REFERENCES [1]Weinan Dai, Chengjie Mou, Jun Wu,
and Xuesong Ye. 2023. Diabetic Retinopathy Detection with Enhanced Vision Transformers: The
Twins-PCPVT Solution. In 2023 IEEE 3rd International Conference on Electronic Technology,
Communication and Information (ICETCI) . IEEE, 403–407. [2]Saad M Darwish and Raad A Ali. 2015.
Observations on using type-2 fuzzy logic for reducing semantic gap in content–based image retrieval
system. International Journal of Computer Theory and Engineering 7, 1 (2015), 1. [3]Jacob Devlin,
Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [4]Desmond Elliott,
Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30k: Multilingual english-german image
descriptions. arXiv preprint arXiv:1605.00459 (2016). [5]Jiuxiang Gu, Jianfei Cai, Gang Wang, and
Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of
the AAAI confer- ence on artificial intelligence , Vol. 32. [6]Justin Johnson, Andrej Karpathy, and Li
Fei-Fei. 2016. Densecap: Fully convo- lutional localization networks for dense captioning. In
Proceedings of the IEEE conference on computer vision and pattern recognition . 4565–4574. [7]Andrej
Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In
Proceedings of the IEEE conference on computer vision and pattern recognition . 3128–3137. [8]Jiasen
Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretrain- ing task-agnostic visiolinguistic
representations for vision-and-language tasks. Advances in neural information processing systems 32
(2019).[9] Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883. [10] Alec
Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al.2021. Learning transferable visual models from
natural language supervision. InInternational conference on machine learning . PMLR, 8748–8763. [11]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representa- tion learning with
deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015). [12]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017.
Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on
computer vision and pattern recognition . 7008–7024. [13] Piyush Sharma, Nan Ding, Sebastian
Goodman, and Radu Soricut. 2018. Con- ceptual captions: A cleaned, hypernymed, image alt-text
dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers) . 2556–2565. [14] Ramakrishna Vedantam, C
Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In
Proceedings of the IEEE confer- ence on computer vision and pattern recognition . 4566–4575. [15]
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-the good, the bad and the
ugly. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4582–4591.
[16] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Fran- cis EH Tay,
Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch
on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision . 558–567.

You might also like