Transformers For Vision
Transformers For Vision
Abstract—Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their
application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input
sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory
(LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited
as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos,
text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge
datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to
arXiv:2101.01169v2 [cs.CV] 22 Feb 2021
provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to
fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature
encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification,
object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual
reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image
super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We
compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental
value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further
interest in the community to solve current challenges towards the application of transformer models in computer vision.
Index Terms—Self-attention, transformers, bidirectional encoders, deep neural networks, convolutional networks, self-supervision.
1 I NTRODUCTION
Fig. 1: Statistics on the number of times keywords such as BERT, Self-Attention, and Transformers appear in the titles of Peer-
reviewed and arXiv papers over the past few years (in Computer Vision and Machine Learning). The plots show consistent growth
in recent literature. This survey covers recent progress on Transformers in the computer vision domain.
Fig. 3: Architecture of the Transformer Model [1]. The model was first developed for the language translation task where an input
sequence in one language is required to be converted to the output sequence in another language. The Transformer encoder
(shown in the middle row) operates on the input language sequence and converts it to an embedding before passing it on to the
encoder blocks. The Transformer decoder (shown in the bottom row) operates on the previously generated outputs in the translated
language and the encoded input sequence from the middle branch to output the next word in the output sequence. The sequence
of previous outputs (used as input to the decoder) is obtained by shifting the output sentence to the right by one position and
appending start-of-sentence token at the beginning. This shifting avoids the model to learn to simply copy the decoder input to
the output. The ground-truth used to train the model is simply the output language sequence (without any right shift) appended
with an end-of-sentence token. The blocks consisting of multi-head attention (shown in the top-most row) and feed-forward layers
are repeated N × in both the encoder and decoder as indicated on the top right corner.
the self-attention blocks used in the decoder are masked to 2.2 (Un)Supervised Pre-training
prevent attending to the subsequent future entities. This is
simply done by an element-wise multiplication operation Self-attention based Transformer models generally operate
with a mask M ∈ Rn×n , where M is an upper-triangular in a two-stage training mechanism. First, pre-training is
matrix. The masked self-attention is defined by, performed on a large-scale dataset (and sometimes a com-
bination of several available datasets [22], [35]) in either a
supervised [11] or an unsupervised manner [3], [36], [37].
Later, the pre-trained weights are adapted to the down-
!
QKT
softmax p ◦M , stream tasks using small-mid scale datasets. Examples of
dq
downstream tasks include image classification [38], object
detection [13], zero-shot learning [20], question-answering
[10] and action recognition [18]. The effectiveness of pre-
where ◦ denotes Hadamard product. Basically, while pre-
training for large-scale Transformers has been advocated in
dicting an entity in the sequence, the attention scores of the
both the language and vision domains. For example, Vision
future entities are set to zero in masked self-attention.
Transformer model (ViT-L) [11] experiences an absolute 13%
Multi-Head Attention: In order to encapsulate multiple drop in accuracy on ImageNet test set when trained only on
complex relationships amongst different elements in the ImageNet train set as compared to the case when pretrained
sequence, the multi-head attention comprises multiple self- on JFT dataset [39] with 300 million images.
attention blocks (h = 8 in the original Transformer model Since acquiring manual labels at a massive scale is cum-
[1]). Each block has its own set of learnable weight ma- bersome, self-supervised learning has been very effectively
trices {WQi , WKi , WVi }, where i = 0 · · · (h−1). For an used in the pre-training stage. The self-supervision based
input X, the output of the h self-attention blocks in multi- pre-training stage training has played a crucial role in un-
head attention is then concatenated into a single matrix leashing the scalability and generalization of Transformer
[Z0 , Z1 , · · · Zh−1 ] ∈ Rn×h·dv and projected onto a weight networks, enabling training even above a trillion parame-
matrix W ∈ Rh·dv ×d (Fig. 3, top row). ter networks (e.g., the latest Switch Transformer [10] from
The main difference of self-attention with convolution Google). An extensive survey on SSL can be found in [40],
operation is that the weights are dynamically calculated [41]. As nicely summarized by Y. LeCun [42], the basic
instead of static weights (that stay the same for any input) as idea of SSL is to fill in the blanks, i.e., try to predict the
in the case of convolution. Further, self-attention is invariant occluded data in images, future or past frames in temporal
to permutations and changes in the number of input points. video sequences or predict a pretext task e.g., the amount
As a result, it can easily operate on irregular inputs as op- of rotation applied to inputs, the permutation applied to
posed to standard convolution that requires grid structure. image patches or the color of a gray-scale image. Another
4
effective way to impose self-supervised constraints is via a sequence of words (sentence) in one language. Positional
contrastive learning. In this case, nuisance transformations encodings are added to the input sequence to capture the
are used to create two types of modified versions of the same relative position of each word in the sequence. Positional
image i.e., without changing the underlying class semantics encodings have the same dimensions as the input d = 512,
(e.g., image stylizing, cropping) and with semantic changes and can be learned or pre-defined e.g., by sine or cosine
(e.g., replacing an object with another in the same scene, or functions. Being an auto-regressive model, the decoder of
changing the class with minor adversarial changes to the the Transformer [1] uses previous predictions to output the
image). Subsequently, the model is trained to be invariant to next word in the sequence. The decoder, therefore, takes
the nuisance transformations and emphasize on modeling inputs from the encoder as well as the previous outputs
minor changes that can alter semantic labels. to predict the next word of the sentence in the translated
Self-supervised learning provides a promising learning language. To facilitate residual connections the output di-
paradigm since it enables learning from a vast amount of mensions of all layers are kept the same i.e., d = 512.
readily available non-annotated data. In the SSL based pre- The dimensions of query, key and value weight matrices
training stage, a model is trained to learn a meaningful in multi-head attention are set to dq = 64, dk = 64, dv = 64.
representation of the underlying data by solving a pretext
task. The pseudo-labels for the pretext task are automati- 2.4 Bidirectional Representations
cally generated (without requiring any expensive manual The training strategy of the original Transformer model [1]
annotations) based on data attributes and task definition. could only attend to the context on the left of a given word
Therefore, the pretext task definition is a critical choice in in the sentence. This is limiting, since for most language
SSL. We can broadly categorize existing SSL methods based tasks, contextual information from both left and right sides
upon their pretext tasks into (a) generative approaches which is important. Bidirectional Encoder Representations from
synthesize images or videos (given conditional inputs), (b) Transformers (BERT) [3] proposed to jointly encode the right
context-based methods which exploit the relationships be- and left context of a word in a sentence, thus improving
tween image patches or video frames, and (c) cross-modal the learned feature representations for textual data in an
methods which leverage from multiple data modalities. unsupervised manner. To enable bidirectional training, [3]
Examples of generative approaches include conditional gen- basically introduced two pretext tasks: Masked Language
eration tasks such as masked image modeling [35] and Model and Next Sentence Prediction. The model pre-trained
image colorization [43], image super-resolution [44], image on these pretext tasks in an unsupervised manner was then
in-painting [45], and GANs based methods [46], [47]. The fine-tuned for the downstream task. For this purpose, task-
context-based pretext methods solve problems such as a specific additional output module is appended to the pre-
jigsaw puzzle on image patches [48]–[50], masked object trained model, and the full model is fine-tuned end-to-end.
classification [22], predict geometric transformation such as The network architecture of the base BERT [3] model
rotation [38], [51], or verify temporal sequence of video is based upon the original Transformer model proposed in
frames [52]–[54]. Cross-modal pretext methods verify the [1] and is similar to the GPT [4]. The main architectural
correspondence of two input modalities e.g., text & image difference compared to [1] is that the BERT model only
[55], audio & video [56], [57] or RGB & flow [58]. uses the Transformer encoder (similar to the middle row,
Fig. 3) while the GPT [4] only uses the Transformer decoder
2.3 Transformer Model (similar to the bottom row, Fig. 3). The core contribution
of BERT [3] is the pretext task definition, which enables
The architecture of the Transformer model proposed in [1]
bidirectional feature encoding in an unsupervised manner.
is shown in Fig. 3. It has an encoder-decoder structure. The
To this end, BERT [3] proposed two strategies: (1) Masked
encoder (middle row) consists of six identical blocks (i.e.,
Language Model (MLM) - A fixed percentage (15%) of
N =6 in Fig. 3), with each block having two sub-layers: a
words in a sentence are randomly masked and the model is
multi-head self-attention network, and a simple position-
trained to predict these masked words using cross-entropy
wise fully connected feed-forward network. Residual con-
loss. In predicting the masked words, the model learns
nections [59] alongside layer normalization [60] are em-
to incorporate the bidirectional context. (2) Next Sentence
ployed after each block as in Fig. 3. Note that, different from
Prediction (NSP) - Given a pair of sentences, the model
regular convolutional networks where feature aggregation
predicts a binary label i.e., whether the pair is valid from
and feature transformation are simultaneously performed
the original document or not. The training data for this can
(e.g., with a convolution layer followed by a non-linearity),
easily be generated from any monolingual text corpus. A
these two steps are decoupled in the Transformer model
pair of sentences A and B is formed, such that B is the actual
i.e., self-attention layer only performs aggregation while the
sentence (next to A) 50% of the time, and B is a random
feed-forward layer performs transformation. Similar to the
sentence for other 50% of the time. NSP enables the model to
encoder, the decoder (bottom row) in the Transformer model
capture sentence-to-sentence relationships which are crucial
comprises six identical blocks. Each decoder block has three
in many language modeling tasks such as Question Answer-
sub-layers, first two (multi-head self-attention, and feed-
ing and Natural Language Inference.
forward) are similar to the encoder, while the third sub-
layer performs multi-head attention on the outputs of the
corresponding encoder block, as shown in Fig. 3. 3 T RANSFORMERS & S ELF -ATTENTION IN V ISION
The original Transformer model in [1] was trained for We provide an overview of main themes followed in Trans-
the Machine Translation task. The input to the encoder is formers designed for vision applications in Fig. 4. Exist-
5
tokens similar to [82]. Authors introduce different training latent codes using a pre-trained discrete variational autoen-
techniques including data augmentation, training with an coder [117], [118]. DALL·E takes as input a single stream of
auxiliary task and injecting locality to self-attention to scale- 1280 tokens (256 for the text and 1024 for the image), and
up their model for high quality image synthesis [99]. The trained to generate all other tokens autoregressively (one
TransGAN model achieves state-of-the-art results in terms after another). It provides flexibility to generate images ei-
of Inception Score and Fréchet Inception Distance (FID) on ther from scratch (Fig. 14a) or by extending existing images
STL-10 and performs favorably compared with their CNN- (Fig. 14b), while staying faithful to the text caption.
based GAN counterparts on other datasets. The authors demonstrate the effectiveness of DALL·E by
creating images from text describing a wide variety of real
3.4.5 SceneFormer and fictional things. While generating images purely from
In the previous works on image generation [97]–[99], image textural captions, DALL·E shows impressive performance
outputs are generally predicted directly by the model. In at controlling multiple objects and their attributes (Fig. 14c),
contrast, [23] learns to generate parameters of 3D objects rendering certain viewpoint (Fig. 14d), capturing object’s
to be placed in a given scene. Specifically, SceneFormer internal structure (Fig. 14e), and combining unrelated ob-
[23] studies the 3D room layout conditioned scene gen- jects (Fig. 14f). Furthermore, DALL·E can perform image-to-
eration task. Given the empty room shape, this approach image translation (Fig. 14g) guided by the input text.
can propose new object configurations in the room while
maintaining realism. Remarkably, the model does not use
any appearance information and only learns to generate 3.6 Transformers for Low-level Vision
new scenes by modeling the inter-object relationships using Transformer models have also been proposed for low-level
self-attention in Transformers. Similar to how a Transformer vision tasks including image super-resolution, denoising,
operates on a sentence, it is applied to a sequence of objects deraining, and colorization. Specifically, Transformer net-
to predict the next suitable object in a scene. Specifically, work for super-resolution [16] uses attention mechanisms to
the size, pose, location, and category of the next object is search relevant textures from reference images and transfer
predicted by the Transformer model. A start token indicates them to low-resolution images to generate super-resolved
the initiation of inference and the number of output token outputs. Similarly, the work of [19] shows how to exploit
indicate the objects generated by the model in a sequence. the potential of pre-training and transfer learning with
The authors also explore generating new scenes given a a shared Transformer based backbone to address multi-
textual description of the room layout. The independence ple image restoration tasks (e.g., denoising, deraining, and
from the appearance makes the approach efficient, enabling super-resolution) with dedicated task-heads. The coloriza-
interactive scene generation. tion transformer [24] proposes a progressive design for
image colorization to achieve high-resolution outputs. Next,
3.5 Transformers for Text-to-Image Synthesis we provide details of these aforementioned image restora-
The task of generating realistic images from text is in- tion Transformer models.
teresting and practically valuable (e.g., for artistic content
creation), but at the same time highly challenging. Prior text- 3.6.1 Transformers for Super-Resolution
to-image synthesis approaches [113]–[116] are mostly based Image super-resolution (SR) aims to generate a high-
on GANs [46]. Although these methods produce moderate resolution (HR) image from its low-resolution (LR) version.
results, they are far from being photo-realistic. To this end, Recent years have seen major performance breakthroughs
Ramesh et al. [20] propose DALL·E which is a Transformer for SR due to convolutional neural networks (CNNs). Prin-
model capable of generating high-fidelity images from a cipally, the quality of super-resolved images generated by
given text description. The model is named DALL·E using CNNs is dependent on the choice of optimization objec-
a portmanteau of the Spanish artist Salvador Dalı́ and the tive. On one hand, SR methods [119]–[123] that are based
Pixar’s blockbuster movie WALL·E. on pixel-wise loss functions (e.g., L1, MSE, etc.) yield im-
DALL·E model has 12 billion parameters and it is trained pressive results in terms of image fidelity metrics such as
on a large set of text-image pairs taken from the internet. PSNR and SSIM. However, they struggle to recover fine
Before training, images are first resized to 256×256 reso- texture details and often produce images that are overly-
lution, and subsequently compressed to a 32×32 grid of smooth and perceptually less pleasant. On the other hand,
12
IPT model is optimized with L1 loss. Experimental results 3.7.1 ViLBERT: Vision and Language BERT
show that the pre-trained IPT model, when fine-tuned for a Vision and Language BERT was the first extension of the
specific low-level vision task, can provide significant perfor- BERT model to the multi-modal domain. The goal was to
mance gains over the state-of-the-art methods [122], [129], learn representations that can jointly model images and
[130]. natural language. For this purpose, ViLBERT developed
a two-stream architecture where each stream is dedicated
3.6.3 Colorization Transformer to model the vision or language inputs (Fig. 18-h). The
Given a grayscale image, colorization seeks to produce the architecture of both parallel streams is a series of Trans-
corresponding colorized sample. It is a one-to-many task as former blocks similar to the BERT model. Subsequently, co-
for a given grayscale input, there exist many possibilities attentional Transformer layers are applied to learn cross-
in the colorized output space. The challenging nature of modal relationships. The co-attentional framework is very
this task requires probabilistic models capable of produc- simple. Query, key, and value matrices are computed for
ing multiple colorized output samples. Colorization Trans- each modality in the standard way [1] and then key-value
former [24] is a probabilistic model based on conditional pairs for one modality are passed on to the other modality’s
attention mechanism [131]. It divides the image colorization attention head.
task into three sub-problems (Fig. 17) and proposes to solve ViLBERT applies VLP on a set of proxy tasks defined on
each task sequentially by a different Transformer network. the Conceptual Concepts dataset (with 3.3M images with
The authors first train a Transformer network to map a weak captions) and later fine-tune the model on down-
low-resolution grey-scale image to a 3-bit low-resolution stream tasks such as VQA. The pre-training phase oper-
colored image. Low-resolution images in turn allow training ates in a self-supervised manner, i.e., pretext tasks are cre-
of larger models. The 3-bit low-resolution colored image ated without manual labeling on the large-scale unlabelled
is then upsampled to an 8-bit RGB sample by another dataset. These pretext tasks include predicting whether the
Transformer network in the second stage of training. Finally, text and image inputs are related and predicting the seman-
a third stage Transformer is trained to increase the spatial tics of masked image regions and textual inputs (e.g., similar
resolution of the 8-bit RGB sample produced by the second- to reconstructing masked words in text in the BERT model
stage Transformer. Self-attention used in the colorization [3]). This way, the model learns the inherent structure in
Transformer is based on row/column attention layers intro- the data during pre-training and also models cross-domain
duced in [131]. These layers capture the interaction between associations. With evaluations on several tasks, [17] demon-
each pixel of an input image while being computation- strated that a two-stream model can perform better than a
ally less costly. The row-wise attention layer applies self- single-stream model that uses shared parameters to model
attention to all pixels in a given row, while the column-wise both language and vision domains [17].
attention layer considers pixels only in a given column of
an image. This work [24] is the first successful application 3.7.2 LXMERT
of Transformers trained to colorize grey-scale images at high Similar to ViLBERT [133], Learning Cross-Modality Encoder
(256×256) resolution. Representations from Transformers (LXMERT) [21] also uses
a two-stream architecture based on BERT framework. The
main difference lies in the object-relationship encoder that is
3.7 Transformers for Multi-Modal Tasks used to model the visual features instead of simple image-
Transformer models have also been extensively used for level features used in ViLBERT. The information in two
vision-language tasks such as visual question answering streams is then fused across modalities using cross-attention
(VQA) [135], visual commonsense reasoning (VSR) [136], blocks similar to [133].
cross-modal retrieval [137] and image captioning [138]. Sev- Compared to two pre-texts tasks used for VLP in [133],
eral works in this direction target effective vision-language LXMERT uses five pre-training tasks including masked ob-
14
Fig. 18: An overview of Transformer models used for multi-modal tasks in computer vision. The Transformer designs in this
category can be grouped into single-stream (UNITER [35], OSCAR [36], VideoBERT [17], Unicoder-VL [132], VisualBERT [55] and
VL-BERT [22]) and dual-stream architectures (LXMERT [21], ViLBERT [133] and PEMT [134]). A key distinction between models
is the choice of loss functions. While most of the multi-modal methods are focused on images as visual data, VideoBERT [17] and
PEMT [134] are designed to work on video streams and leverage unique modalities e.g., audio signals in videos [134].
ject and language prediction, cross-modality matching, and simply attempts to predict missing text tokens using the
visual question answering (Fig. 18-g). The pre-trained model image features and remaining textual tokens. The second
is fine-tuned on the VQA task, however, a high similarity objective attempts to differentiate between the true and false
between pre-training and fine-tuned tasks raises questions caption of a given image. After task-agnostic pre-training,
on the generalizability of the learned representations to new the authors propose to perform task-specific pre-training to
tasks. To this end, the authors conducted generalization bridge the domain gap before the final fine-tuning to the
experiments on Visual Reasoning for Real (NLVR) task [139] downstream task.
demonstrating impressive improvements on novel tasks.
3.7.4 VL-BERT
3.7.3 VisualBERT Su et al. [22] propose a multi-modal pre-training approach to
Different from two-stream networks like ViLBERT [133] learn features that are generalizable to multi-modal down-
and LXMERT [21], VisualBERT [55] uses a single stack stream tasks such as Visual Commonsense Reasoning and
of Transformers to model both the domains (images and Visual Question Answering. This endeavor requires ade-
text). The input sequence of text (e.g., caption) and the quately aligning the visual and linguistic cues so that an
visual features corresponding to the object proposals are effective composite representation is learned. To the end,
fed to the Transformer that automatically discovers relations [22] builds on the BERT model and inputs both the visual
between the two domains. Notably, VisualBERT architec- and language features. The language features correspond
ture is somewhat similar to VideoBERT [17] (explained in to the token in the input sentence and the visual features
Sec. 3.8), but instead of only focusing on cooking videos, correspond to the region of interest (RoI) from the input
VisualBERT evaluates on various visual-linguistic tasks (e.g., image (obtained via a standard Faster R-CNN). Specifically,
VCR, NLVR, VQA, and visual grounding). the model is pre-trained on both the visual-lingual dataset
The VisualBERT model first applies task-agnostic pre- (Conceptual Captions [140]) as well as the language-only
training using two objectives (Fig. 18-e). The first objective datasets (e.g., Wikipedia). The loss function is identical to
15
BERT, where the model is trained to predict the masked design pre-training tasks to predict masked visual or text
out words or visual ROIs (Fig. 18-f). In contrary to other region conditioned on the other domain input, and align
works such as UNITER [35], VL-BERT claims that the visual- language and visual inputs on both the global (image-text)
linguistic matching tasks are not useful during pre-training, and local (word-region) levels (Fig. 18-a). These tasks are
which is in contrast to evidence from later efforts [132]. Their beside the conventional masked language modeling task
results on several multi-modal tasks show their benefit over used in BERT and explicitly include fine-grained word-
the language-only pre-training (e.g., in BERT). region alignment alongside conditional masking of inputs
that were not considered in the earlier works such as VL-
3.7.5 Unicoder-VL BERT [22], Visual-BERT [55], Vilbert [133] and Unicoder-
Universal Encoder for Vision and Language (Unicoder-VL) VL [132]. Common to the other approaches, they adopt the
[132] learns multi-modal representations using large-scale Transformer architecture proposed in BERT that operates
image-caption datasets. The language and image inputs on both the visual and language embeddings. In contrast
are fed to a single Transformer model (with multiple suc- to applying independent Transformers to the language and
cessive encoders) to learn joint embeddings. To this end, visual inputs (as in ViLBERT [133] and LXMERT [21]),
it uses masked word prediction, masked object classifi- UNITER adopts a single Transformer applied to the textual
cation, and visual-linguistic matching as self-supervision and image inputs like [22], [55], [132].
tasks during pre-training (Fig. 18-d). Notably, the visual-
linguistic matching is carried out only at the global level 3.7.8 Oscar: Object-Semantics Aligned Pre-Training
(i.e., image-sentence alignment). The model is evaluated on VisualBert [55], Uniter [35], VL-BERT [22], VilBERT [133],
downstream tasks of image-text retrieval, zero-shot learn- Unicoder-VL [132] models for VLP concatenate image and
ing, and visual commonsense reasoning where it performs text features and leave it on to the self-attention to automat-
better than the previous models such as ViLBERT [133] and ically discover cross-modal relationships. This can compli-
VisualBERT [55]. This shows the significance of rich self- cate the visual grounding of semantic concepts in an image.
supervised tasks and advocates for a unified Transformer To address this problem, Oscar [36] first uses an object de-
architecture to learn multi-modal feature representations in tector to obtain object tags (labels), subsequently using these
a common framework. tags as a mechanism to align relevant visual features with
the semantic domain information (Fig. 18-b). The motivation
3.7.6 Unified VLP is that the textual content generally pertains to major objects
The Unified Vision-Language Pre-training (VLP) [141] in the image, therefore by explicitly adding those image
model uses a single Transformer network for both encod- labels in the input, visual features can be better attended.
ing and decoding stages. This stands in contrast to BERT Similar to BERT [3], Oscar uses a Masked Token Loss for
inspired VLP models [17], [22], [55], [142] which use in- VLP. Specifically, different tokens in the textual input and
dependent encoder and decoder networks. Joint modeling image tags are randomly masked and the model’s job is to
of encoding and decoding stages allows the Unified VLP predict the missing token. This forces it to learn the relation-
model to perform well for both image captioning and visual- ship of the missing token with the contextual information
question answering tasks, when fine-tuned on these individ- given as visual and semantic features. Further, it also uses
ual tasks. The intuition for shared modeling of encoding and a contrastive loss that discriminates between the original
decoding stage stems from the need to better share cross- and noisy/fake image-tag pairs. The representations thus
task information during pre-training. The unified model learned are fine-tuned on VQA, cross-modality retrieval,
consists of a stack of 12 Transformer blocks, each with a self- natural language reasoning, and image captioning tasks to
attention layer followed by a feed-forward module. The self- obtain better performances compared to VLP methods that
supervised objectives used for pre-training include masked do not use object tags.
vision-language predictions. Here, the authors explore two
variants i.e., bidirectional and sequence-to-sequence predic- 3.7.9 Vokenization
tion of masked works where different context encodings are Tan and Bansal [146] introduce the concept of ‘vokens’ (im-
used for both types of objectives. The proposed approach is ages related to language tokens extracted from sentences).
evaluated on COCO Captions, Flick 30K Captions and VQA The vokens (visualized tokens) provide visual supervision
2.0 and obtains encouraging results compared to previous to the language model to learn better features. The motiva-
methods on image captioning and VQA [143]. tion is that humans learn languages by correlating visual in-
formation with semantic concepts. In a similar spirit to other
3.7.7 UNITER self-supervised language representation learning methods
Universal image-text representation (UNITER) [35] is also [3], [133], they learn representations by defining an auxiliary
a multi-modal feature learning approach via pre-training task of voken-prediction task.
on four large-scale visual-linguistic datasets (MS-COCO Since the existing datasets encode limited visually
[70], Visual Genome [144], Conceptual Captions [140] and grounded tokens, they propose a vokenization method to
SBU Captions [145]). The learned representations have been map language tokens to visual vokens, as illustrated in
shown to transfer well on downstream tasks such as VQA, Fig. 19. The approach uses language-based retrieval for
Multi-modal retrieval, Visual Commonsense reasoning, and such a mapping and transfers a model trained on a small
NLVR. In order to emphasize on learning the relationships labeled dataset (MS-COCO) to a large dataset (Wikipedia).
between visual and language domains, they specifically Furthermore, it was ensured that the sentence-wide context
16
3.9.1 Cross-Transformer
Doersch et al. [25] explore the utility of self-supervision and
Transformer architectures for cases where distribution mis-
match exists between training and evaluation phases. They
specifically consider the few-shot fine-grained classification
problem, where a model is first trained on a set of base
classes and later during the evaluation, it must adapt to
(a) Spatial Self-Attention (b) Temporal Self-Attention novel classes using their few labeled examples (support set).
Cross-Transformer is evaluated on Meta-dataset [167],
Fig. 21: Spatial/Temporal Attention for Skeleton Data Repre- which is a huge dataset comprising of 10 distinct datasets
sentations. Relationships between body-joints and inter-frame
dependencies are modeled using two dedicated self-attention (including ImageNet, MS-COCO, etc.). The dataset encap-
modules. Figure is from [164]. sulates the challenging scenario where a learner must adapt
to new classes and novel domains during evaluation. The
Transformer architecture in this case is used to relate a
are matched with the ground-truth using bipartitie match- given query image with the few-examples available in the
ing. Similar to Mask R-CNN [85], a separate head is used support set. To this end, the Transformer finds spatially
to predict the instance mask based on self-attention and similar regions in the query and support set images, and
3D convolutions. The overall results are competitive among the corresponding features are then used to obtain class
the single model approaches on YouTube VIS dataset [162], decisions for the query. The queries in the Transformer
but performs somewhat lower compared to more complex architecture are derived from the grid features obtained
CNN-based models such as MaskProp [163]. using the query image. Similarly, grid features from the
support images are used to construct keys and values which
3.8.7 Skeleton-Based Action Recognition are in turn used to derive attended outputs. This approach,
besides a contrastive self-supervision based training mech-
Human action recognition based on skeleton representation anism, leads to the best performance on the challenging
requires models that can understand relationships between Meta-dataset.
different joints of a body in a given frame as well as between
different frames of a video. Plizzari et al. [164] proposed 3.9.2 FEAT: Few-Shot Embedding Adaptation
a two-stream Transformer network to model such rela-
Ye et al. [26] propose to adapt the few-shot embeddings
tionships. They introduced spatial self-attention (SSA) for
learned on the base classes to the few-shot target classes
relation modeling between different body-joints (Fig. 21a),
during inference using a Transformer module. This leads to
while temporal self-attention (TSA) to capture long-range
task-specific embeddings that perform better on the discrim-
inter-frame dependencies (Fig. 21b). They first used a small
inative tasks such as few-shot classification. While many
residual network to extract features from skeleton data and
other set-to-set functions are also evaluated, such as Graph
then used SSA and TSA modules to process those feature
convolutional networks [168], Bidirectional LSTMs [29] and
maps. SSA models the relations between different body
DeepSets [169], the best performance is achieved with the
parts by finding the correlation between each pair of joints
Transformer-based mapping. This is attributed to the better
independently, while TSA focuses on how features of a
contextualization, task interpolation and extrapolation ca-
certain joint change between frames along the temporal
pability of Transformers and their permutation invariance
dimension. Joints can be thought of as bag-of-words and
while maintaining a relatively lower parameter complexity.
the purpose of SSA is to discover relationships among the
The Transformer architecture used in this work follows the
surrounding joints in the same way as the Transformer
standard approach [1]. The embeddings are adapted using
relates different words in a phrase. On the other hand, TSA
a contrastive loss function for preserving discriminative
finds long-range relations between frames, similarly to how
properties (Fig. 22). The resulting model achieves strong
relations among phrases are built in NLP. The two streamed
performance on inductive, transductive, and generalized
spatial-temporal Transformer network achieve state-of-the-
FSL tasks.
art results on NTU-RGB+D 60 [165] and NTU-RGB+D 120
[166] datasets.
3.10 Transformers for Clustering
Clustering is a fundamental operation in unsupervised
3.9 Transformers in Low-shot Learning
learning that aims to discover structure in the data by
In the few-shot learning settings, a support set is provided grouping similar data points together. It has numerous
at the inference to adapt to a novel set of categories. Trans- applications such as data visualization and interpretation,
former models have been used to learn set-to-set mappings anomaly detection, and open-set categorization. Neural net-
on this support set [26] or learn the spatial relationships works have been developed for set prediction problems
between a given input query and support set images [25]. [169], [170], however, the setpoints are processed individ-
In terms of absolute performance, the patch-wise spatial ually which can lose information about inter-point relation-
self-attention between query and support set images excels ships. Recent works employ Transformers that operate on
compared to an image level association learned in [26]. set inputs called the Set Transformers (ST) [171] for amortized
However, the patch-wise attention computation is computa- clustering. Amortized clustering is a challenging problem
tionally expensive. We elaborate on these approaches below. that seeks to learn a parametric function that can map an
19
Fig. 24: Mesh Transformer architecture. The joint and vertex queries are appended with positional embeddings and passed
through multiple self-attention layers to jointly regress 3D coordinates of joints and mesh vertices. Figure is from [37].
is used to jointly model vertex to vertex relationships in a [3] basic model (with 109 million parameters) took around
mesh as well as the vertex to body-joint relationships. The 1.89 peta-flop days1 for training, while the latest GPT3 [6]
self-attention mechanism can attend to any combination of model (175 billion parameters) took around 3640 peta-flop
vertices in the mesh, thereby encoding non-local relation- days for training (a staggering ∼1925x increase). This comes
ships. with a huge price tag, e.g., according to one estimate [180],
The multi-layer Transformer architecture sequentially GPT3 training might have cost OpenAI around 4.6 million
performs dimensionality reduction to map the 2D image to USD. Additionally, these large-scale models require aggres-
3D mesh. Position encoding is performed using the 3D coor- sive compression techniques (e.g., distillation) to make their
dinates (x,y ,z ) of each vertex and each body-joint. Similar to inference feasible for real-world settings.
masked language modeling in NLP, METRO uses masked In the language domain, recent works focus on reducing
vertex modeling (MVM) which randomly masks some per- the high complexity of Transformer models (basically aris-
centage of input queries (see Fig. 24). The Transformer is ing from the self-attention mechanism [1] where a token’s
tasked with regressing all the joints and vertices which representation is updated by considering all tokens from the
helps encode inter-dependencies between them. METRO previous layer). For example, [161], [185] explore selective
obtains state-of-the-art results on human mesh reconstruc- or sparse attention to previous layer tokens which updating
tion on two publicly available datasets (Human3.6M [177] each next layer token. Linformer [33] reduces complexity of
and 3DPW [178]). Since the approach does not depends standard self-attention operation from O(n2 ) to O(n) (both
on a parametric mesh model, it generalizes well to other in time and memory requirements). The main idea is to
reconstruction tasks such as 3D hand reconstruction [179]. show that a low-rank matrix is sufficient to model the self-
Overall, this is the first effort to employ Transformers for 3D attention mechanism. The Reformer model [186] employed
human reconstruction tasks and leads to fairly good results. locally-sensitive hashing (LSH) to minimize the complexity
of self-attention from O(n2 ) to O(nlog(n)). In similar pur-
4 O PEN P ROBLEMS & F UTURE D IRECTIONS suit, the recent Lambda Networks propose to model context
as a linear function which helps reduce complexity of self-
Despite excellent performance from Transformer models
attention [187].
and their interesting salient features (Table 1), there ex-
Vyas et al. [188] developed an efficient cluster attention
ist several challenges associated with their applicability to
approach to deal with large input sequences that approx-
practical settings (Table 2). The most important bottlenecks
imates the original self-attention. They propose a cluster
include requirement for large-amounts of training data and
attention approach that groups queries into clusters and
associated high computational costs. There have also been
then computes attention between cluster centers (instead
some challenges to visualize and interpret Transformer
of attention between all the queries that leads to quadratic
models. In this section, we provide an overview of these
complexity). The main idea is that the queries close in the
challenges, mention some of the recent efforts to address
Euclidean space should have similar attention distributions.
those limitations and highlight the open research questions.
With a fixed number of clusters, this intuition helps reduce
the quadratic complexity to linear complexity of O(nc)
4.1 High Computational Cost with respect to the input sequence length n (where c is
As discussed in Sec. 1, Transformer models have high the number of clusters). We refer readers to [31] for a nice
parametric complexity. This results in high training and
inference cost, both in terms of computational time and 1. A peta-flop day is measure of computation and equals to perform-
resources required for processing. As an example, the BERT ing 1015 neural net operations per second for one complete day.
21
Task Method Design Highlights (focus on differences Input Data Type Label Type Loss
with the standard form)
Image ViT [11] Directly adopted NLP Transformer En- 2D Image Class labels Cross-entropy
Classification coder for images, Mechanism to linearly
embed image patches with positional
embedding suitable for the Encoder.
DeiT [12] Transformer as s student while CNN as 2D Image Class labels Cross-entropy,
a teacher, Distillation tokens to produce Distillation loss
estimated labels from teacher, Attention based on
between class and distillation tokens. KL-divergence
CLIP [81] Jointly train image and text encoders on 2D Images & texts Image-text Symmetric
image-text pairs, to maximize similarity pairs cross-entropy
of valid pairs and minimize otherwise
Object Detection DETR [13] Linear projection layer to reduce CNN 2D Image Class labels Hungarian loss
feature dimension, Spatial positional based on
embedding added to each multi-head bipartite
self-attention layer of both encoder and matching
decoder. Object queries (output posi- between
tional encoding) added to each multi- predicted and
head self-attention layer of decoder. ground truths
D-DETR [14] Deformable Transformer consists of de- 2D Image Class labels Hungarian loss
formable attention layers to introduce
sparse priors in Transformers, Multi-
scale attention module.
Low Shot CT [25] Self-supervised pretraining, Query- 2D Image Pretraining Normalized
Learning aligned class prototypes that provide without labels Cross-entropy
spatial correspondence between the and few-shot
support-set images and query image. learning with
Class labels
Image ColTran [24] Conditional Row/column multi-head 2D Image 2D Image Negative
Colorization attention layers, Progressive multi-scale log-likelihood
colorization scheme. of the images
Action ST-TR [164] Spatial and Temporal self-attention to Skeleton Action Classes Cross-entropy
Recognition operates on graph data such as joints in
skeletons.
Super-resolution TTSR [16] Texture enhancing Transformer module, 2D Image 2D Image Reconstruction
Relevance embeddings to compute the loss, Perceptual
relevance between the low-resolution loss defined on
and reference image. pretrained
VGG19
features.
Multi-Model Oscar [36] Transformer layer to jointly process 2D Image Captions, Class Negative
Learning triplet representation of image-text labels, Object log-likelihood
[words, tags, features], Masked tokens tags of masked
to represent text data. tokens,
Contrastive
binary
cross-entropy
3D Classifica- PT [173] Point Transformer block, Transition CAD models, 3D object Object and Cross-entropy
tion/Segmentation down block to reduce cardinality of the part segmentation shape
point set, Transition up for dense pre- categories
diction tasks.
3D Mesh METRO [37] Progressive dimensionality reduction 2D Image 3D Mesh + L1 loss on
Reconstruction across Transformer layers, Positional Human Pose mesh vertices
Encoding with 3D joint and 3D vertex and joints in 3D
coordinates, Masked vertex/joint mod- and 2D
eling. projection.
Vision and Chen et al. [149] Uni-modal encoders on language and Instruction text + Navigation Cross-entropy
Language map inputs followed by a cross-modal RGBD panorama + Plan over nodes and
Navigation transformer, Trajectory position encod- Topological [stop] action
ings in the map encoder. Environment Map
Referring Image CMSA [15] Multimodal feature, Cross-modal self- 2D Image + Language Segmentation Binary
Segmentation attention on multiple levels and their expression mask cross-entropy
fusion using learned gates. loss
Video Lee et al. [134] Operates on real-valued audio-visual Audio-Visual Activity labels Contrastive
Classification signals instead of tokens, Contrastive InfoNCE loss
learning for pre-training, End-to-end and Binary
multimodal transformer learning. cross-entropy
TABLE 1: A summary of key design choices adopted in different variants of transformers for a representative set of
computer vision applications. The main changes relate to specific loss function choices, architectural modifications, different
position embeddings and variations in input data modalities.
22
Image ViT [11] Top-1 Acc. ImageNet 88.55 a) First application of Transformer a) Requires training on large-scale
Classifica- ICLR’21 (global self-attention) directly on data e.g., 300-Million images, b)
tion image patches, b) Convolution-free Requires careful transfer learning
network architecture, c) Outper- to the new task, c) Requires large
forms CNN models such as ResNet. model with 632-Million parameters
to achieve SOTA results.
DeiT [12] Top-1 Acc. ImageNet 83.10 a) Successfully trains Transformer a) Requires access to pretrained
arXiv’20 on ImageNet only, b) Introduces CNN based teacher model thus per-
attention-based distillation method. formance depends on the quality of
c) Produces competitive perfor- the teacher model.
mance with small (86-Million pa-
rameters) Transformers.
Low-Shot CT [25] Top-1 Acc. ImageNet 62.25 a) Self-supervised pre-training Proposed algorithm is limited in its
Learning NeurIPS’20 COCO 60.35 mechanism that does not need capacity to perform on datasets that
manual labels, b) Dynamic lack spatial details such as texture.
inference using Transformer
achieving stat-of-the-art results.
Object DETR [13] AP COCO 44.9 a) Use of Transformer allows end- a) Performs poorly on small objects,
Detection ECCV’20 to-end training pipeline for object b) Requires long training time to
detection, b) Removes the need for converge.
hand-crafted post-processing steps.
D-DETR [14] AP COCO 43.8 a) Achieves better performance on Obtain SOTA results with 52.3 AP
ICLR’21 small objects than DETR [13], b) but with two stage detector design
Faster convergence than DETR [13] and test time augmentations.
Image ColTran [24] FID ImageNet 19.71 a) First successful application of a) Lacks end-to-end training, b)
Coloriza- ICLR’21 Transformer to image colorization, limited to images of size 256×256.
tion b) Achieves SOTA FID score.
Action ST-TR [164] Top-1 Acc. NTU 94.0/84.7 a) Successfully applies Transformer Proposed Transformers do not pro-
Recogni- arXiv’20 60/120 to model relations between body cess joints directly rather operate on
tion joints both in spatial and temporal features extracted by a CNN, thus
domain, b) Achieves SOTA results. the overall model is based on hand-
crafted design.
CUFED5 27.1 / 0.8 a) Achieves state-of-the-art super- a) Proposed Transformer does not
TTSR [16] PSNR/ Sun80 30.0 / 0.81 resolution by using attention, b) process images directly but features
Super-
CVPR’20 SSIM Urban100 25.9 / 0.78 Novel Transformer inspired archi- extracted by a convolution based
Resolution
Manga109 30.1 / 0.91 tectures that can process multi-scale network, b) Model with large num-
features. ber of trainable parameters, and c)
Compute intensive.
ViLBERT VQA [135]/ a) Proposed Transformer architec- a) Requires large amount of data
Acc./
Multi- [133] Retrieval 70.6/ 58.2 ture can combine text and visual for pre-training, b) Requires fine
mAP (R@1)
Model NeurIPS’19 [181] information to understand inter- tuning to the new task.
Learning task dependencies, b) Achieves pre-
training on unlabelled dataset.
Oscar [36] Acc./ VQA [182]/
80.37/57.5 a) Exploit novel supervisory signal Requires extra supervision through
ECCV’20 mAP (R@1) COCO
via object tags to achieve text and pre-trained object detectors thus
image alignment, b) Achieves state- performance is dependent on the
of-the-art results. quality of object detectors.
Acc./ VQA [135]/ Learns fine-grained relation align- Requires large multi-task datasets
UNITER [35] Avg. Flickr30K 72.47/83.72 ment between text and images
ECCV’20 for Transformer training which lead
(R@1/5/10) [183] to high computational cost.
Point Trans- ModelNet40 92.8 a) Transformer based attention ca- a) Only moderate improvements
3D Top-1 Acc.
former [173] [175] pable to process unordered and un- over previous SOTA, b) Large num-
Analysis IoU 85.9
arXiv’20 structured point sets, b) Permuta- ber of trainable parameters around
tion invariant architecture. 6× higher than PointNet++ [184].
MPJPE 77.1 a) Does not depend on parametric Dependent on hand-crafted net-
METRO [37] 3DPW
PA-MPJPE 47.9 mesh models so easily extendable work design.
arXiv’20 [178]
MPVE 88.2 to different objects, b) Achieves
SOTA results using Transformers.
TABLE 2: A summary of advantages and limitations of different Transformers based methods in different Tasks. (CT: Cross
Transformers, AP: Average Precision, mAP: mean AP, IoU: Intersection over Union, FID: Fréchet inception distance, MPJPE:
Mean Per Joint Position Error, MPVE: Mean Per Vertex Error).
23
literature survey on efficient Transformers in NLP. a parallel branch in the Transformers to improve person
Similar to the NLP domain, computer vision models re-identification. A recent work [189] rearranges the spa-
also suffer from the high computational cost of Trans- tially close tokens to better model relationships in spatially
former models. For example, image generators that are proximal locations. One may argue that the architectures
based on sequence-based Transformers (e.g., iGPT) have like Transformer models should remain generic to be di-
a high compute cost limiting their applicability to high- rectly applicable across domains, we notice that the high
resolution inputs. In future, it is interesting to explore how computational and time cost for pre-training such models
such models can be extended to high-dimensional cases demands novel design strategies to make their training
e.g., using a multi-scale transformer design with a somewhat more affordable on vision problems.
local context modeling. By inducing inductive biases based
on our understanding of the visual learning tasks (e.g.,
spatial relationships in the local neighbourhood), the high 4.4 Interpretability of Transformers
computational cost can be reduced. Similarly, using sparse Given the strong performance of Transformer architectures,
attention maps modeled with low-rank factorization in the it is interesting and critical to interpret their decisions, e.g.,
matrices can also help towards reducing the computational by visualizing relevant regions in an image for a given
cost [160]. classification decision. The main challenge is that the at-
tention originating in each layer, gets inter-mixed in the
4.2 Large Data Requirements subsequent layers in a complex manner, making it difficult
to visualize the relative contribution of input tokens towards
Since Transformer architectures do not inherently encode
final predictions. This is an open problem, however, some
inductive biases (prior knowledge) to deal with visual
recent works [191]–[193] target enhanced interpretability of
data, they typically require large amounts of training data
Transformers and report encouraging results. Attention roll-
during pre-training to figure out the underlying modality-
out and attention flow methods were proposed in [192]
specific rules. For example, a CNN has inbuilt translation
to estimate the accurate attentions. However, this method
invariance, weight sharing, and partial scale invariance
functions in an ad-hoc manner and makes simplistic as-
due to pooling operations or multi-scale processing blocks.
sumptions e.g., input tokens are linearly combined using
However, a Transformer network needs to figure out these
attention weights across the layers. Chefer et al. [193] note
image-specific properties on its own by looking at a large
that the attention scores obtained directly via the self-
number of examples. Similarly, relationships between video
attention process (encoding relationships between tokens)
frames need to be discovered automatically by the self-
or reassignments in [192] do not provide an optimal solu-
attention mechanism by looking at a large database of video
tion. As an alternative, they propose to assign and propagate
sequences. This results in longer training times, a significant
relevancy scores in the Transformer network such that the
increase in computational requirements, and large datasets
sum of relevancy is constant throughout the network. Their
for processing. For example, the ViT [11] model requires
design can handle both the positive and negative attribu-
hundreds of millions of image examples to obtain a decent
tions experienced in the self-attention layer. The proposed
performance on the ImageNet benchmark dataset. The ques-
framework has an added advantage of being able to provide
tion of learning a Transformer in a data-efficient manner is
class-specific visualizations. Despite these seminal works,
an open research problem and recent works report encour-
visualizing and interpreting Transformers is an unsolved
aging steps towards its resolution. For example, DeiT [12]
problem and methods are needed to obtain spatially precise
uses a distillation approach to achieve data efficiency while
activation-specific visualizations. Further progress in this
T2T (Tokens-to-Token) ViT [189] models local structure by
direction can help in better understanding the Transformer
combining spatially close tokens together, thus leading to
models, diagnosing any erroneous behaviors and biases
competitive performance when trained only on ImageNet
in the decision process. It can also help us design novel
from scratch (without pre-training).
architectures that can help us avoid any biases.
to design Hardware-Aware Transformers (HAT) using neu- and limitations of the existing methods and particularly
ral architecture search strategies [196]–[198]. Specifically, a elaborate on the important future research directions. With
SuperTransformer model is first trained for performance its specific focus on computer vision tasks, this survey pro-
approximation which can estimate a model’s performance vides a unique view of the recent progress in self-attention
without fully training it. This model comprises the largest and Transformer-based methods. We hope this effort will
possible model in the search space while sharing weights drive further interest in the vision community to leverage
between common parts. Eventually, an evolutionary search the potential of Transformer models and improve on their
is performed considering the hardware latency constraints current limitations e.g., reducing their carbon footprint.
to find a suitable SubTransformer model for a target hard-
ware platform (e.g., IoT device, GPU, CPU). However, such
hardware efficient designs are currently lacking for the ACKNOWLEDGMENTS
vision Transformers to enable their seamless deployment
The authors would like to thank Tim Prangemeier (TU Darmstadt), Lu-
in resource-constrained devices. Further, the search cost of
owei Zhou (Microsoft Research), Jason Corso (University of Michigan),
the evolutionary algorithms remains significant with the
Pichao Wang (Alibaba Group), Yuqing Wang (Meituan) and Manoj
associated impact of CO2 emissions on the environment.
Kumar (Google Brain) for their helpful feedback on the survey. We
would also like to thank Mohamed Afham for his help with a figure.
4.6 Leveraging Rich Multi-modal Annotations
In cases, where training data is available with dense labels in
multiple domains (e.g., language and vision [17], [199]), an
R EFERENCES
interesting question to consider is whether the pre-training [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
process leveraging rich labels on a small dataset speedup Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NeurIPS, 2017. 1, 2, 3, 4, 6, 8, 10, 12, 13, 16, 17, 18, 19, 20
its learning. This question has been explored in Virtex [200],
[2] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural
a model that seeks to learn strong visual representations machine translation,” in WMT, 2018. 1
using dense textual annotations (e.g., image captions). Since, [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
the captions encode information about objects present in training of deep bidirectional transformers for language under-
standing,” arXiv preprint arXiv:1810.04805, 2018. 1, 2, 3, 4, 13, 15,
an image, their relationships, actions and attributes, they 17, 20, 24
can provide better supervision to learn more generalizable [4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im-
and transferable representations. Particularly, they showed proving language understanding by generative pre-training,”
that a model trained with a visual backbone followed by a tech. rep., OpenAI, 2018. 1, 4
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
bidirectional language model (forward and backward Trans- “Language models are unsupervised multitask learners,” tech.
formers) [3] to predict captions, can learn strong features rep., OpenAI, 2019. 1, 8, 10
on MS-COCO dataset in an unsupervised manner. When [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
these features are transferred to the ImageNet model, they
et al., “Language models are few-shot learners,” arXiv preprint
perform better or equally-well compared to the unsuper- arXiv:2005.14165, 2020. 1, 20
vised/supervised features learned directly on the ImageNet [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
dataset. Since, Transformer models can process multiple M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A ro-
bustly optimized bert pretraining approach,” arXiv preprint
modalities in a unified architecture, it will be interesting arXiv:1907.11692, 2019. 1, 2, 23
to explore how densely annotated datasets can reduce the [8] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
data requirement of Transformers and if dense-annotations Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer
allow transferring well to novel unseen conditions in one learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019. 1
particular modality at inference. [9] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang,
M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant
models with conditional computation and automatic sharding,”
5 C ONCLUSION arXiv preprint arXiv:2006.16668, 2020. 1
[10] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
Attention has played a key role in delivering efficient to trillion parameter models with simple and efficient sparsity,”
and accurate computer vision systems, while simultane- arXiv preprint arXiv:2101.03961. 1, 3
ously providing insights into the function of deep neu- [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
ral networks. This survey reviews the self-attention ap- et al., “An image is worth 16x16 words: Transformers for image
proaches and specifically focuses on the Transformer and bi- recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. 1, 3,
directional encoding architectures that are built on the prin- 5, 6, 7, 17, 21, 22, 23
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
ciple of self-attention. We first cover fundamental concepts H. Jégou, “Training data-efficient image transformers & distilla-
pertaining to self-attention architectures and later provide tion through attention,” arXiv preprint arXiv:2012.12877, 2020. 1,
an in-depth analysis of competing approaches for a broad 5, 7, 17, 21, 22, 23
range of computer vision applications. Specifically, we in- [13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,”
clude state of the art self-attention models for image recog- arXiv preprint arXiv:2005.12872, 2020. 1, 3, 8, 17, 21, 22
nition, object detection, semantic and instance segmentation, [14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
video analysis and classification, visual question answering, DETR: Deformable transformers for end-to-end object detection,”
arXiv preprint arXiv:2010.04159, 2020. 1, 8, 21, 22
visual commonsense reasoning, image captioning, vision-
[15] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-
language navigation, clustering, few-shot learning, and 3D attention network for referring image segmentation,” in CVPR,
data analysis. We systematically highlight the key strengths 2019. 1, 9, 21
25
[16] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture [45] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
transformer network for image super-resolution,” in CVPR, 2020. “Context encoders: Feature learning by inpainting,” in CVPR,
1, 11, 12, 21, 22 2016. 4
[17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, [46] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
“VideoBERT: A joint model for video and language represen- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
tation learning,” in ICCV, 2019. 1, 13, 14, 15, 16, 17, 23, 24 sarial nets,” in NeurIPS, 2014. 4, 10, 11, 12
[18] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video [47] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun, “MARTA GANs:
action transformer network,” in CVPR, 2019. 1, 3, 17 Unsupervised representation learning for remote sensing image
[19] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, classification,” GRSL, 2017. 4
C. Xu, and W. Gao, “Pre-trained image processing transformer,” [48] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised
arXiv preprint arXiv:2012.00364, 2020. 1, 7, 11, 12 learning of spatiotemporal context for video action recognition,”
[20] A. Ramesh, M. Pavlov, G. Goh, and S. Gray, “DALL·E: Creating in WACV, 2019. 4
images from text,” tech. rep., OpenAI, 2021. 1, 3, 11 [49] M. Noroozi and P. Favaro, “Unsupervised learning of visual
[21] H. Tan and M. Bansal, “LXMERT: Learning cross-modality en- representations by solving jigsaw puzzles,” in ECCV, 2016. 4
coder representations from transformers,” in EMNLP-IJCNLP, [50] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image
2019. 1, 13, 14, 15, 16, 17 representations by completing damaged jigsaw puzzles,” WACV,
[22] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: 2018. 4
Pre-training of generic visual-linguistic representations,” arXiv [51] L. Jing, X. Yang, J. Liu, and Y. Tian, “Self-supervised spatiotempo-
preprint arXiv:1908.08530, 2019. 1, 3, 4, 14, 15, 17 ral feature learning via video rotation prediction,” arXiv preprint
[23] X. Wang, C. Yeshwanth, and M. Nießner, “SceneFormer: arXiv:1811.11387, 2018. 4
Indoor scene generation with transformers,” arXiv preprint [52] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised
arXiv:2012.09793, 2020. 1, 9, 11 representation learning by sorting sequences,” in ICCV, 2017. 4
[24] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization [53] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsu-
transformer,” in ICLR, 2021. 1, 11, 13, 21, 22 pervised learning using temporal order verification,” in ECCV,
[25] C. Doersch, A. Gupta, and A. Zisserman, “CrossTransformers: 2016. 4
spatially-aware few-shot transfer,” NeurIPS, 2020. 1, 18, 21, 22 [54] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning
[26] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via and using the arrow of time,” in CVPR, 2018. 4
embedding adaptation with set-to-set functions,” in CVPR, 2020. [55] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang,
1, 18, 19 “VisualBERT: A simple and performant baseline for vision and
[27] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT language,” in Arxiv preprint arXiv:1908.03557, 2019. 4, 14, 15
press, 2017. 1 [56] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of
[28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, audio and video models from self-supervised synchronization,”
2015. 1 in NeurIPS, 2018. 4
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [57] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
Neural computation, 1997. 1, 16, 18 ICCV, 2017. 4, 17
[30] D. Hu, “An introductory survey on attention mechanisms in nlp [58] N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Cross-
problems,” in IntelliSys, 2019. 2 modal self-supervision,” in GCPR, 2018. 4
[31] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans- [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
formers: A survey,” arXiv preprint arXiv:2009.06732, 2020. 2, 20 image recognition,” in CVPR, 2016. 4, 8, 17
[32] S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal, [60] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
“An attentive survey of attention models,” arXiv preprint arXiv preprint arXiv:1607.06450, 2016. 4
arXiv:1904.02874, 2019. 2 [61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[33] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self- “ImageNet: A large-scale hierarchical image database,” in CVPR,
attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2009. 5, 9
2020. 2, 20 [62] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for
[34] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self- image recognition,” in ICCV, 2019. 5, 6
attention generative adversarial networks,” in International con- [63] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu,
ference on machine learning, pp. 7354–7363, PMLR, 2019. 2 “CCNet: Criss-cross attention for semantic segmentation,” in
[35] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, ICCV, 2019. 5, 6
and J. Liu, “UNITER: Universal image-text representation learn- [64] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya,
ing,” in ECCV, 2020. 3, 4, 14, 15, 22 and J. Shlens, “Stand-alone self-attention in vision models,” in
[36] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, NeurIPS, 2019. 5, 6, 9
L. Dong, F. Wei, et al., “Oscar: Object-semantics aligned pre- [65] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for
training for vision-language tasks,” in ECCV, 2020. 3, 14, 15, image denoising,” in CVPR, 2005. 5, 6
21, 22 [66] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural
[37] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose networks,” in CVPR, 2018. 5, 6
and mesh reconstruction with transformers,” arXiv preprint [67] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-
arXiv:2012.09760, 2020. 3, 19, 20, 21, 22 jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.,
[38] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre- “The kinetics human action video dataset,” arXiv preprint
sentation learning by predicting image rotations,” arXiv preprint arXiv:1705.06950, 2017. 5, 16, 17
arXiv:1803.07728, 2018. 3, 4 [68] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[39] “Revisiting the unreasonable effectiveness of data.” https://ai. R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
googleblog.com/2017/07/revisiting-unreasonable-effectiveness. dataset for semantic urban scene understanding,” in CVPR, 2016.
html. Accessed: 2020-12-31. 3, 7 5, 9
[40] L. Jing and Y. Tian, “Self-supervised visual feature learning with [69] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
deep neural networks: A survey,” TPAMI, 2020. 3 “Scene parsing through ade20k dataset,” in CVPR, 2017. 5
[41] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and [70] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
J. Tang, “Self-supervised learning: Generative or contrastive,” P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects
arXiv preprint arXiv:2006.08218, 2020. 3 in context,” in ECCV, 2014. 5, 9, 15
[42] “Aaai 2020 keynotes turing award winners event.” https://www. [71] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint
youtube.com/watch?v=UX8OubxsY8w. Accessed: 2020-12-31. 3 body parsing & pose estimation network and a new benchmark,”
[43] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” TPAMI, 2018. 5
in ECCV, 2016. 4 [72] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object
[44] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, classes in video: A high-definition ground truth database,” Pat-
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo- tern Recognition Letters, 2009. 5
realistic single image super-resolution using a generative adver- [73] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention
sarial network,” in CVPR, 2017. 4, 12 augmented convolutional networks,” in ICCV, 2019. 6
26
[74] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rel- [103] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer
ative position representations,” arXiv preprint arXiv:1803.02155, networks in unsupervised feature learning,” in AISTATS, 2011.
2018. 6 10
[75] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image [104] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
recognition,” in CVPR, 2020. 6, 7, 19, 23 framework for contrastive learning of visual representations,”
[76] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- arXiv preprint arXiv:2002.05709, 2020. 10
fellow, and R. Fergus, “Intriguing properties of neural networks,” [105] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning repre-
arXiv preprint arXiv:1312.6199, 2013. 7 sentations by maximizing mutual information across views,” in
[77] M. M. Naseer, S. H. Khan, M. H. Khan, F. S. Khan, and F. Porikli, NeurIPS, 2019. 10
“Cross-domain transferability of adversarial perturbations,” in [106] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Es-
NeurIPS, 2019. 7 lami, and A. v. d. Oord, “Data-efficient image recognition with
[78] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in contrastive predictive coding,” arXiv preprint arXiv:1905.09272,
a neural network,” arXiv preprint arXiv:1503.02531, 2015. 7 2019. 10
[79] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, [107] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con-
“Designing network design spaces,” in CVPR, 2020. 7 trast for unsupervised visual representation learning,” in CVPR,
[80] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for 2020. 10
convolutional neural networks,” arXiv preprint arXiv:1905.11946, [108] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
2019. 7 ing,” arXiv preprint arXiv:1906.05849, 2019. 10
[81] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- [109] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning guide to convolutional neural networks for computer vision,”
transferable visual models from natural language supervision,” Synthesis Lectures on Computer Vision, 2018. 10
Image, vol. 2, p. T2, 2021. 8, 21 [110] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
[82] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, tation learning with deep convolutional generative adversarial
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, networks,” arXiv preprint arXiv:1511.06434, 2015. 10
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: [111] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan, “Adversarialnas: Ad-
Transformers for image recognition at scale,” 2020. 8, 11 versarial neural architecture search for gans,” in Proceedings of the
[83] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
wards real-time object detection with region proposal networks,” pp. 5680–5689, 2020. 10
TPAMI, 2016. 8, 17 [112] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
[84] R. Girshick, “Fast R-CNN,” in ICCV, 2015. 8 “Analyzing and improving the image quality of stylegan,” in
[85] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
ICCV, 2017. 8, 18 Pattern Recognition, pp. 8110–8119, 2020.
[86] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only [113] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
look once: Unified, real-time object detection,” in CVPR, 2016. 8 “Generative adversarial text to image synthesis,” in ICML, 2016.
[87] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and 11
A. C. Berg, “SSD: Single shot multibox detector,” in ECCV, 2016. [114] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
8 Metaxas, “StackGAN: Text to photo-realistic image synthesis
[88] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based trans- with stacked generative adversarial networks,” in ICCV, 2017.
formers for instance segmentation of cells in microstructures,” in 11
2020 IEEE International Conference on Bioinformatics and Biomedicine [115] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
(BIBM), pp. 700–707, IEEE, 2020. 8 Metaxas, “StackGAN++: Realistic image synthesis with stacked
[89] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and generative adversarial networks,” TPAMI, 2018. 11
S. Belongie, “Feature pyramid networks for object detection,” in [116] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
CVPR, 2017. 8 X. He, “AttnGAN: Fine-grained text to image generation with
[90] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, attentional generative adversarial networks,” in CVPR, 2018. 11
“Deformable convolutional networks,” in ICCV, 2017. 8 [117] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[91] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. arXiv preprint arXiv:1312.6114, 2013. 11
Chen, “Axial-DeepLab: Stand-alone axial-attention for panoptic [118] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse
segmentation,” arXiv preprint arXiv:2003.07853, 2020. 9 high-fidelity images with vq-vae-2,” in NeurISP, 2019. 11
[92] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic [119] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
segmentation,” in CVPR, 2019. 9 residual networks for single image super-resolution,” in CVPRW,
[93] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The 2017. 11
mapillary vistas dataset for semantic understanding of street [120] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep
scenes,” in ICCV, 2017. 9 recursive residual network,” in CVPR, 2017. 11
[94] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling [121] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S. Huang,
context in referring expressions,” in ECCV, 2016. 9 “Image super-resolution via dual-state recurrent networks,” in
[95] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and CVPR, 2018. 11
K. Murphy, “Generation and comprehension of unambiguous [122] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image
object descriptions,” in CVPR, 2016. 9 super-resolution using very deep residual channel attention net-
[96] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Refer- works,” in ECCV, 2018. 11, 13
itgame: Referring to objects in photographs of natural scenes,” [123] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
in EMNLP, 2014. 9 network for image restoration,” TPAMI, 2020. 11
[97] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, [124] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
and D. Tran, “Image transformer,” in ICML, 2018. 9, 10, 11 C. Change Loy, “ESRGAN: enhanced super-resolution generative
[98] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and adversarial networks,” in ECCVW, 2018. 12
I. Sutskever, “Generative pretraining from pixels,” in ICML, 2020. [125] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “SRFEAT: Single
9, 10, 11 image super-resolution with feature discrimination,” in ECCV,
[99] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for 2018. 12
high-resolution image synthesis,” arXiv preprint arXiv:2012.09841, [126] M. S. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single
2020. 9, 10, 11 image super-resolution through automated texture synthesis,” in
[100] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers ICCV, 2017. 12
can make one strong gan,” 2021. 9, 10 [127] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
[101] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-
A. Graves, et al., “Conditional image generation with pixelcnn realistic single image super-resolution using a generative adver-
decoders,” in NeurIPS, 2016. 9 sarial network,” in CVPR, 2017. 12
[102] A. Krizhevsky, “Learning multiple layers of features from tiny [128] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-
images,” tech. rep., 2009. 10 time style transfer and super-resolution,” in ECCV, 2016. 12
27
[129] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order [154] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end
attention network for single image super-resolution,” in CVPR, dense video captioning with masked transformer,” in Proceedings
2019. 13 of the IEEE Conference on Computer Vision and Pattern Recognition,
[130] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang, pp. 8739–8748, 2018. 16
X. Cao, and H. Shen, “Single image super-resolution via a holistic [155] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles,
attention network,” in ECCV, 2020. 13 “Dense-captioning events in videos,” in Proceedings of the IEEE
[131] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, “Ax- international conference on computer vision, pp. 706–715, 2017. 16
ial attention in multidimensional transformers,” arXiv preprint [156] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of
arXiv:1912.12180, 2019. 13 procedures from web instructional videos,” in Proceedings of the
[132] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, AAAI Conference on Artificial Intelligence, vol. 32, 2018. 16
“Unicoder-VL: A universal encoder for vision and language by [157] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
cross-modal pre-training.,” in AAAI, 2020. 14, 15 human actions classes from videos in the wild,” arXiv preprint
[133] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task- arXiv:1212.0402, 2012. 17
agnostic visiolinguistic representations for vision-and-language [158] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
tasks,” in NeurIPS, 2019. 13, 14, 15, 16, 17, 22, 23 R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology
[134] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song, “Param- and human-labeled dataset for audio events,” in ICASSP, 2017.
eter efficient multimodal transformers for video representation 17
learning,” arXiv preprint arXiv:2012.04124, 2020. 14, 16, 21 [159] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and
[135] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, A. Gupta, “Hollywood in homes: Crowdsourcing data collection
C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question for activity understanding,” in ECCV, 2016. 17
answering,” in ICCV, 2015. 13, 22 [160] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans-
[136] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to former network,” arXiv preprint arXiv:2102.00719, 2021. 17, 23
cognition: Visual commonsense reasoning,” in CVPR, 2019. 13 [161] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
[137] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross document transformer,” arXiv preprint arXiv:2004.05150, 2020. 17,
attention for image-text matching,” in ECCV, 2018. 13 20
[138] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A [162] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in
neural image caption generator,” in CVPR, 2015. 13 Proceedings of the IEEE/CVF International Conference on Computer
[139] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi, Vision, pp. 5188–5197, 2019. 18
“A corpus for reasoning about natural language grounded in [163] G. Bertasius and L. Torresani, “Classifying, segmenting, and
photographs,” arXiv preprint arXiv:1811.00491, 2018. 14 tracking object instances in video with mask propagation,” in
[140] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual Proceedings of the IEEE/CVF Conference on Computer Vision and
captions: A cleaned, hypernymed, image alt-text dataset for Pattern Recognition, pp. 9739–9748, 2020. 18
automatic image captioning,” in ACL, 2018. 14, 15 [164] C. Plizzari, M. Cannici, and M. Matteucci, “Spatial tempo-
[141] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, ral transformer network for skeleton-based action recognition,”
“Unified vision-language pre-training for image captioning and arXiv preprint arXiv:2008.07404, 2020. 18, 21, 22
vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, [165] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A
vol. 34, pp. 13041–13049, 2020. 15 large scale dataset for 3d human activity analysis,” in CVPR,
[142] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video 2016. 18
representations using contrastive bidirectional transformer,” [166] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C.
arXiv preprint arXiv:1906.05743, 2019. 15 Kot, “NTU RGB+D 120: A large-scale benchmark for 3d human
[143] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected activity understanding,” TPAMI, 2019. 18
objects in text for visual question answering,” arXiv preprint
[167] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu,
arXiv:1908.05054, 2019. 15
R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., “Meta-
[144] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, dataset: A dataset of datasets for learning to learn from few
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual examples,” arXiv preprint arXiv:1903.03096, 2019. 18
genome: Connecting language and vision using crowdsourced
[168] T. N. Kipf and M. Welling, “Semi-supervised classification with
dense image annotations,” IJCV, 2017. 15
graph convolutional networks,” arXiv preprint arXiv:1609.02907,
[145] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing 2016. 18
images using 1 million captioned photographs,” in NeurIPS, 2011.
[169] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut-
15
dinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017. 18
[146] H. Tan and M. Bansal, “Vokenization: Improving language un-
derstanding with contextualized, visual-grounded supervision,” [170] H. Edwards and A. Storkey, “Towards a neural statistician,” arXiv
arXiv preprint arXiv:2010.06775, 2020. 15, 16 preprint arXiv:1606.02185, 2016. 18
[147] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning [171] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh,
a generic agent for vision-and-language navigation via pre- “Set transformer: A framework for attention-based permutation-
training,” in CVPR, 2020. 16 invariant neural networks,” in ICML, 2019. 18, 19
[148] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, [172] J. Lee, Y. Lee, and Y. W. Teh, “Deep amortized clustering,” arXiv
and D. Batra, “Improving vision-and-language navigation with preprint arXiv:1909.13433, 2019. 19
image-text pairs from the web,” arXiv preprint arXiv:2004.14973, [173] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point trans-
2020. 16 former,” arXiv preprint arXiv:2012.09164, 2020. 19, 21, 22
[149] K. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, [174] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin,
“Topological planning with transformers for vision-and- and S.-M. Hu, “PCT: Point cloud transformer,” arXiv preprint
language navigation,” arXiv preprint arXiv:2012.05292, 2020. 16, arXiv:2012.09688, 2020. 19
21 [175] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,
[150] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short “3D ShapeNets: A deep representation for volumetric shapes,” in
note on the kinetics-700 human action dataset,” arXiv preprint CVPR, 2015. 19, 22
arXiv:1907.06987, 2019. 16, 17 [176] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,
[151] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “COOT: Co- Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and
operative hierarchical transformer for video-text representation F. Yu, “ShapeNet: An information-rich 3d model repository,”
learning,” arXiv preprint arXiv:2011.00597, 2020. 16 arXiv preprint arXiv:1512.03012, 2015. 19
[152] H. Seong, J. Hyun, and E. Kim, “Video multitask transformer [177] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Hu-
network,” in Proceedings of the IEEE/CVF International Conference man3.6M: Large scale datasets and predictive methods for 3D
on Computer Vision Workshops, pp. 0–0, 2019. 16 human sensing in natural environments,” TPAMI, 2013. 20
[153] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, [178] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and
“End-to-end video instance segmentation with transformers,” G. Pons-Moll, “Recovering accurate 3d human pose in the wild
arXiv preprint arXiv:2011.14503, 2020. 16, 17 using imus and a moving camera,” in ECCV, 2018. 20, 22
28