0% found this document useful (0 votes)
98 views28 pages

Transformers For Vision

This document provides an overview of recent applications of transformer models in computer vision. It discusses how transformer models can process visual data without strong assumptions about spatial or temporal structure, enabling their use in tasks like image recognition, object detection, segmentation, video understanding and more. The survey covers popular transformer architectures and their advantages for vision like self-attention, pre-training on large datasets, and scalability. It aims to serve as a comprehensive reference on the growing field of transformers in computer vision.

Uploaded by

Ali Haider
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views28 pages

Transformers For Vision

This document provides an overview of recent applications of transformer models in computer vision. It discusses how transformer models can process visual data without strong assumptions about spatial or temporal structure, enabling their use in tasks like image recognition, object detection, segmentation, video understanding and more. The survey covers popular transformer architectures and their advantages for vision like self-attention, pre-training on large datasets, and scalability. It aims to serve as a comprehensive reference on the growing field of transformers in computer vision.

Uploaded by

Ali Haider
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

1

Transformers in Vision: A Survey


Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir,
Fahad Shahbaz Khan, and Mubarak Shah

Abstract—Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their
application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input
sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory
(LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited
as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos,
text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge
datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to
arXiv:2101.01169v2 [cs.CV] 22 Feb 2021

provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to
fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature
encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification,
object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual
reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image
super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We
compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental
value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further
interest in the community to solve current challenges towards the application of transformer models in computer vision.

Index Terms—Self-attention, transformers, bidirectional encoders, deep neural networks, convolutional networks, self-supervision.

1 I NTRODUCTION

T RANSFORMER models [1] have recently demonstrated


exemplary performance on a broad range of language
tasks e.g., text classification, machine translation [2] and
However, visual data follows a typical structure (e.g., spatial
and temporal coherence), thus demanding novel network
designs and training schemes. As a result, Transformer mod-
question answering. Among these models, the most popular els and their variants have been successfully used for image
ones include BERT (Bidirectional Encoder Representations recognition [11], [12], object detection [13], [14], segmenta-
from Transformers) [3], GPT (Generative Pre-trained Trans- tion [15], image super-resolution [16], video understanding
former) v1-3 [4]–[6], RoBERTa (Robustly Optimized BERT [17], [18], image generation [19], text-image synthesis [20]
Pre-training) [7] and T5 (Text-to-Text Transfer Transformer) and visual question answering [21], [22], among several
[8]. The profound impact of Transformer models has become other use cases [23]–[26]. This survey aims to cover such
more clear with their scalability to very large capacity mod- recent and exciting efforts in the computer vision domain,
els [9], [10]. For example, the BERT-large [3] model with providing a comprehensive reference to interested readers.
340 million parameters was significantly outperformed by Transformer architectures are based on a self-attention
the GPT-3 [6] model with 175 billion parameters while the mechanism that learns the relationships between elements
latest mixture-of-experts Switch transformer [10] scales up of a sequence. As opposed to recurrent networks that pro-
to a whopping 1.6 trillion parameters! cess sequence elements recursively and can only attend to
The breakthroughs from Transformer networks in Nat- short-term context, transformer architectures can attend to
ural Language Processing (NLP) domain has sparked great complete sequences thereby learning long-range relation-
interest in the computer vision community to adapt these ships and can be easily parallelized. An important feature
models for vision and multi-modal learning tasks (Fig. 1). of these models is their scalability to very high-complexity
models and large-scale datasets. Since Transformers as-
sume minimal prior knowledge about the structure of the
• S. Khan, M. Naseer and F. S. Khan are with the MBZ University of
Artificial Intelligence, Abu Dhabi, UAE. problem as compared to their convolutional and recurrent
E-mail: [email protected] counterparts in deep learning [27]–[29], they are typically
• M. Hayat is with the Faculty of IT, Monash University, Clayton VIC pre-trained using pretext tasks on large-scale (unlabelled)
3800, Australia.
• S. W. Zamir is with the Inception Institute of Artificial Intelligence, Abu
datasets [1], [3]. Such a pre-training avoids costly man-
Dhabi, UAE. ual annotations, thereby encoding highly expressive and
• S. Khan and M. Naseer are also with the CECS, Australian National generalizable representations that model rich relationships
University, Canberra ACT 0200, Australia. between the entities present in a given dataset. The learned
• F. S. Khan is also with the Computer Vision Laboratory, Linköping
University, Sweden. representations are then fine-tuned on the downstream tasks
• M. Shah is with the Center for Research in Computer Vision, University in a supervised manner to obtain favorable results.
of Central Florida, Orlando, FL 32816, United States. This paper provides a holistic overview of the trans-
Manuscript received March, 2021. former models developed for computer vision applications.
2

Fig. 1: Statistics on the number of times keywords such as BERT, Self-Attention, and Transformers appear in the titles of Peer-
reviewed and arXiv papers over the past few years (in Computer Vision and Machine Learning). The plots show consistent growth
in recent literature. This survey covers recent progress on Transformers in the computer vision domain.

We develop a taxonomy of the network design space and


highlight the major strengths and shortcomings of the ex-
isting methods. Other literature reviews mainly focus on
the NLP domain [30], [31] or cover generic attention-based
approaches [30], [32]. By focusing on the newly emerging
area of visual transformers, we comprehensively organize
the recent approaches according to the intrinsic features of
self-attention and the investigated task. We first provide an
introduction to the salient concepts underlying Transformer
Fig. 2: An example self-attention block used in the vision do-
networks and then elaborate on the specifics of recent vision
main [34]. Given the input sequence of convolutional features
transformers. Where ever possible, we draw parallels be- from an image, the triplet of (key, query, value) is calculated
tween the Transformers used in the NLP domain [1] and the followed by attention calculation and applying it to reweight
ones developed for vision problems to flash major novelties the values. Note that a single head is shown here and an output
and interesting domain-specific insights. Recent approaches projection (W) is finally applied to obtain output features with
show that convolution operations can be fully replaced the same dimension as the input. Figure adapted from [34].
with attention-based transformer modules and have also
been used jointly in a single design to encourage symbiosis
between the two complementary set of operations. This sur- all entities of a sequence for structured prediction tasks.
vey finally details open research questions with an outlook Basically, a self-attention layer updates each component
towards the possible future work. of a sequence by aggregating global information from the
complete input sequence.
2 F OUNDATIONS Lets denote a sequence of n entities (x1 , x2 , · · · xn ) by
X ∈ Rn×d , where d is the embedding dimension to repre-
There exist two key ideas that have contributed towards the sent each entity. The goal of self-attention is to capture the
development of transformer models. (a) The first one is self- interaction amongst all n entities by encoding each entity
attention, which allows capturing ‘long-term’ information in terms of the global contextual information. This is done
and dependencies between sequence elements as compared by defining three learnable weight matrices to transform
to conventional recurrent models that find it challenging Queries (WQ ∈ Rn×dq ), Keys (WK ∈ Rn×dk ) and Values
to encode such relationships. (b) The second key idea is (WV ∈ Rn×dv ). The input sequence X is first projected onto
that of pre-training on a large (un)labelled corpus in a these weight matrices to get Q = XWQ , K = XWK and
(un)supervised manner, and subsequently fine-tuning to the V = XWV . The output Z ∈ Rn×dv of the self attention
target task with a small labeled dataset [3], [7], [33]. Below, layer is then given by,
we provide a brief tutorial on these two ideas (Sec. 2.2 !
and 2.1), along with a summary of seminal Transformer QKT
networks (Sec. 2.3 and 2.4) where these ideas have been Z = softmax p V.
dq
applied. This background will help us better understand
the forthcoming Transformer based models used in the For a given entity in the sequence, the self-attention basi-
computer vision domain (Sec. 3). cally computes the dot-product of the query with all keys,
which is then normalized using softmax operator to get the
2.1 Self-Attention attention scores. Each entity then becomes the weighted sum
Given a sequence of items, self-attention mechanism esti- of all entities in the sequence, where weights are given by
mates the relevance of one item to other items (e.g., which the attention scores (Fig. 2 and Fig. 3, top row-left block).
words are likely to come together in a sentence). The self- Masked Self-Attention: The standard self-attention
attention mechanism is an integral component of Trans- layer attends to all entities. For the Transformer model [1]
formers, which explicitly models the interactions between which is trained to predict the next entity of the sequence,
3

Fig. 3: Architecture of the Transformer Model [1]. The model was first developed for the language translation task where an input
sequence in one language is required to be converted to the output sequence in another language. The Transformer encoder
(shown in the middle row) operates on the input language sequence and converts it to an embedding before passing it on to the
encoder blocks. The Transformer decoder (shown in the bottom row) operates on the previously generated outputs in the translated
language and the encoded input sequence from the middle branch to output the next word in the output sequence. The sequence
of previous outputs (used as input to the decoder) is obtained by shifting the output sentence to the right by one position and
appending start-of-sentence token at the beginning. This shifting avoids the model to learn to simply copy the decoder input to
the output. The ground-truth used to train the model is simply the output language sequence (without any right shift) appended
with an end-of-sentence token. The blocks consisting of multi-head attention (shown in the top-most row) and feed-forward layers
are repeated N × in both the encoder and decoder as indicated on the top right corner.

the self-attention blocks used in the decoder are masked to 2.2 (Un)Supervised Pre-training
prevent attending to the subsequent future entities. This is
simply done by an element-wise multiplication operation Self-attention based Transformer models generally operate
with a mask M ∈ Rn×n , where M is an upper-triangular in a two-stage training mechanism. First, pre-training is
matrix. The masked self-attention is defined by, performed on a large-scale dataset (and sometimes a com-
bination of several available datasets [22], [35]) in either a
supervised [11] or an unsupervised manner [3], [36], [37].
Later, the pre-trained weights are adapted to the down-
!
QKT
softmax p ◦M , stream tasks using small-mid scale datasets. Examples of
dq
downstream tasks include image classification [38], object
detection [13], zero-shot learning [20], question-answering
[10] and action recognition [18]. The effectiveness of pre-
where ◦ denotes Hadamard product. Basically, while pre-
training for large-scale Transformers has been advocated in
dicting an entity in the sequence, the attention scores of the
both the language and vision domains. For example, Vision
future entities are set to zero in masked self-attention.
Transformer model (ViT-L) [11] experiences an absolute 13%
Multi-Head Attention: In order to encapsulate multiple drop in accuracy on ImageNet test set when trained only on
complex relationships amongst different elements in the ImageNet train set as compared to the case when pretrained
sequence, the multi-head attention comprises multiple self- on JFT dataset [39] with 300 million images.
attention blocks (h = 8 in the original Transformer model Since acquiring manual labels at a massive scale is cum-
[1]). Each block has its own set of learnable weight ma- bersome, self-supervised learning has been very effectively
trices {WQi , WKi , WVi }, where i = 0 · · · (h−1). For an used in the pre-training stage. The self-supervision based
input X, the output of the h self-attention blocks in multi- pre-training stage training has played a crucial role in un-
head attention is then concatenated into a single matrix leashing the scalability and generalization of Transformer
[Z0 , Z1 , · · · Zh−1 ] ∈ Rn×h·dv and projected onto a weight networks, enabling training even above a trillion parame-
matrix W ∈ Rh·dv ×d (Fig. 3, top row). ter networks (e.g., the latest Switch Transformer [10] from
The main difference of self-attention with convolution Google). An extensive survey on SSL can be found in [40],
operation is that the weights are dynamically calculated [41]. As nicely summarized by Y. LeCun [42], the basic
instead of static weights (that stay the same for any input) as idea of SSL is to fill in the blanks, i.e., try to predict the
in the case of convolution. Further, self-attention is invariant occluded data in images, future or past frames in temporal
to permutations and changes in the number of input points. video sequences or predict a pretext task e.g., the amount
As a result, it can easily operate on irregular inputs as op- of rotation applied to inputs, the permutation applied to
posed to standard convolution that requires grid structure. image patches or the color of a gray-scale image. Another
4

effective way to impose self-supervised constraints is via a sequence of words (sentence) in one language. Positional
contrastive learning. In this case, nuisance transformations encodings are added to the input sequence to capture the
are used to create two types of modified versions of the same relative position of each word in the sequence. Positional
image i.e., without changing the underlying class semantics encodings have the same dimensions as the input d = 512,
(e.g., image stylizing, cropping) and with semantic changes and can be learned or pre-defined e.g., by sine or cosine
(e.g., replacing an object with another in the same scene, or functions. Being an auto-regressive model, the decoder of
changing the class with minor adversarial changes to the the Transformer [1] uses previous predictions to output the
image). Subsequently, the model is trained to be invariant to next word in the sequence. The decoder, therefore, takes
the nuisance transformations and emphasize on modeling inputs from the encoder as well as the previous outputs
minor changes that can alter semantic labels. to predict the next word of the sentence in the translated
Self-supervised learning provides a promising learning language. To facilitate residual connections the output di-
paradigm since it enables learning from a vast amount of mensions of all layers are kept the same i.e., d = 512.
readily available non-annotated data. In the SSL based pre- The dimensions of query, key and value weight matrices
training stage, a model is trained to learn a meaningful in multi-head attention are set to dq = 64, dk = 64, dv = 64.
representation of the underlying data by solving a pretext
task. The pseudo-labels for the pretext task are automati- 2.4 Bidirectional Representations
cally generated (without requiring any expensive manual The training strategy of the original Transformer model [1]
annotations) based on data attributes and task definition. could only attend to the context on the left of a given word
Therefore, the pretext task definition is a critical choice in in the sentence. This is limiting, since for most language
SSL. We can broadly categorize existing SSL methods based tasks, contextual information from both left and right sides
upon their pretext tasks into (a) generative approaches which is important. Bidirectional Encoder Representations from
synthesize images or videos (given conditional inputs), (b) Transformers (BERT) [3] proposed to jointly encode the right
context-based methods which exploit the relationships be- and left context of a word in a sentence, thus improving
tween image patches or video frames, and (c) cross-modal the learned feature representations for textual data in an
methods which leverage from multiple data modalities. unsupervised manner. To enable bidirectional training, [3]
Examples of generative approaches include conditional gen- basically introduced two pretext tasks: Masked Language
eration tasks such as masked image modeling [35] and Model and Next Sentence Prediction. The model pre-trained
image colorization [43], image super-resolution [44], image on these pretext tasks in an unsupervised manner was then
in-painting [45], and GANs based methods [46], [47]. The fine-tuned for the downstream task. For this purpose, task-
context-based pretext methods solve problems such as a specific additional output module is appended to the pre-
jigsaw puzzle on image patches [48]–[50], masked object trained model, and the full model is fine-tuned end-to-end.
classification [22], predict geometric transformation such as The network architecture of the base BERT [3] model
rotation [38], [51], or verify temporal sequence of video is based upon the original Transformer model proposed in
frames [52]–[54]. Cross-modal pretext methods verify the [1] and is similar to the GPT [4]. The main architectural
correspondence of two input modalities e.g., text & image difference compared to [1] is that the BERT model only
[55], audio & video [56], [57] or RGB & flow [58]. uses the Transformer encoder (similar to the middle row,
Fig. 3) while the GPT [4] only uses the Transformer decoder
2.3 Transformer Model (similar to the bottom row, Fig. 3). The core contribution
of BERT [3] is the pretext task definition, which enables
The architecture of the Transformer model proposed in [1]
bidirectional feature encoding in an unsupervised manner.
is shown in Fig. 3. It has an encoder-decoder structure. The
To this end, BERT [3] proposed two strategies: (1) Masked
encoder (middle row) consists of six identical blocks (i.e.,
Language Model (MLM) - A fixed percentage (15%) of
N =6 in Fig. 3), with each block having two sub-layers: a
words in a sentence are randomly masked and the model is
multi-head self-attention network, and a simple position-
trained to predict these masked words using cross-entropy
wise fully connected feed-forward network. Residual con-
loss. In predicting the masked words, the model learns
nections [59] alongside layer normalization [60] are em-
to incorporate the bidirectional context. (2) Next Sentence
ployed after each block as in Fig. 3. Note that, different from
Prediction (NSP) - Given a pair of sentences, the model
regular convolutional networks where feature aggregation
predicts a binary label i.e., whether the pair is valid from
and feature transformation are simultaneously performed
the original document or not. The training data for this can
(e.g., with a convolution layer followed by a non-linearity),
easily be generated from any monolingual text corpus. A
these two steps are decoupled in the Transformer model
pair of sentences A and B is formed, such that B is the actual
i.e., self-attention layer only performs aggregation while the
sentence (next to A) 50% of the time, and B is a random
feed-forward layer performs transformation. Similar to the
sentence for other 50% of the time. NSP enables the model to
encoder, the decoder (bottom row) in the Transformer model
capture sentence-to-sentence relationships which are crucial
comprises six identical blocks. Each decoder block has three
in many language modeling tasks such as Question Answer-
sub-layers, first two (multi-head self-attention, and feed-
ing and Natural Language Inference.
forward) are similar to the encoder, while the third sub-
layer performs multi-head attention on the outputs of the
corresponding encoder block, as shown in Fig. 3. 3 T RANSFORMERS & S ELF -ATTENTION IN V ISION
The original Transformer model in [1] was trained for We provide an overview of main themes followed in Trans-
the Machine Translation task. The input to the encoder is formers designed for vision applications in Fig. 4. Exist-
5

Recently, global self-attention has been successfully ap-


Local Relation
Efficient Networks
Stand-Alone
plied by using NLP Transformer encoder directly on image
Transformer Hu et al. 2019
Attention
Attention Ramachandran et al. 2019
patches [11], removing the need for handcrafted network
Color
design. Transformer is data-hungry in nature e.g., a large-
Criss-Cross Attention-Augmented
Attention
Transformer
2020 ViT Convolution scale dataset like ImageNet is not enough to train Vision
Bello et al. 2019
Haung et al. 2019 Dosovitskiv et al.
2020 Cross Transformer from scratch so [12] proposes to distill knowl-
Global-Contex D-DETR Transformer
Networks Zhu et al. Doersch et al.
2020
Local edge from a CNN teacher to a student Vision Transformer
2020
Coe et al. 2020 Point
DETR
Attention which allowed Transformer training on only ImageNet
Tranformer
Non-Local
Zho et al. DeiT Carion et al.
SA-TA
Networks
2020 Tourvon et al.
2020
2020
Tourvon et al.
without any additional data. Here, we describe key insights
2020
Buades et al. 2018
Global from different methods based on local/global self-attention
Attenion including Transformers specifically designed to solve the
Pair/Patch image recognition task.
Attention
Zhao et al. 2020
CNN
Vectorized 3.1.1 Non-Local Neural Networks
Attention This approach is inspired by non-local means operation
[65] which was mainly designed for image denoising. This
Fig. 4: A taxonomy of self-attention design space. Existing works operation modifies a given pixel in a patch with a weighted
explore local vs. global self-attention both with and without sum of other pixel values in an image. However, instead
transformer-based encoding/decoding stages. Sparse attention, of considering a fixed-sized window around a pixel, it
vector/scalar attention and low-rank matrix factorizations have
selects distant pixels to contribute to the filter response
been explored to study the trade-off between computational
efficiency and expressivity of the developed models. Interesting based on the similarity between the patches. By design, the
efforts have also been made to utilize knowledge from convo- non-local operation models long-range dependencies in the
lution based architectures that already encode priors suitable image space. Motivated by this, Wang et al. [66] proposed a
for visual data to improve data efficiency. differentiable non-local operation for deep neural networks
to capture long-range dependencies both in both space
and time in a feed-forward fashion. Given a feature map,
ing frameworks generally apply global or local attention, their proposed operator [66] computes the response at a
leverage CNN representations or utilize matrix factorization position as a weighted sum of the features at all positions
to enhance design efficiency and use vectorized attention in the feature map. This way, the non-local operation is
models. We explain these research directions below in the able to capture interactions between any two positions in
form of task-specific groups of approaches. the feature map regardless of the distance between them.
Videos classification is an example of a task where long-
range interactions between pixels exist both in space and
3.1 Transformers for Image Recognition
time. Equipped with the capability to model long-range
Convolution operation is the work-horse of the conven- interactions, [66] demonstrated the superiority of non-local
tional deep neural networks used in computer vision and deep neural networks for more accurate video classification
it brought breakthroughs such as solving complex image on Kinetics dataset [67].
recognition tasks on high dimensional datasets like Im-
ageNet [61]. However, convolution also comes with its 3.1.2 Criss-Cross Attention
shortcomings e.g., it operates on a fixed-sized window thus Although the self-attention mechanism allows us to model
unable to capture long-range dependencies such as arbitrary full-image contextual information, it is both memory and
relations between pixels in both spatial and time domains compute intensive procedure. As shown in Fig. 5(a), in
in a given video. Furthermore, convolution filter weights order to encode global context for a given pixel location,
remain fixed after training so the operation cannot adapt non-local block [66] computes a dense attention map (in
dynamically to any variation to the input. In this section, we green). The non-local block [66] has a high complexity
review methods that alleviate the above-mentioned issues in of O(N 2 ), where N denotes the number of input feature
conventional deep neural networks by using Self-attention maps. To reduce this computational burden, Huang et al.
operations and Transformer networks (a specific form of [63] propose the criss-cross attention module that for each
self-attention). There are two main design approaches to pixel position generates a sparse attention map only on the
self-attention. (a) Global self-attention which is not restricted criss-cross path, as illustrated in Fig. 5(b). Further, by ap-
by the size of input features e.g., [62] introduces a layer plying criss-cross attention recurrently, each pixel position
inspired from non-local means that applies attention to the can capture context from all other pixels. Compared to non-
whole feature map while [63] reduces the computational √uses 11× lesser GPU memory, and
local block, the criss-cross
complexity of non-local operation [62] by designing sparse has a complexity of O(2 N ). State-of-the-art results are re-
attention maps. (b) Local self-attention tries to model re- ported [63] for the semantic and instance segmentation tasks
lation within a given neighborhood e.g., [64] proposed to on several benchmark datasets including Cityscapes [68],
restrict the attention within a specific window around a ADE20K [69], COCO [70], LIP [71] and CamVid [72].
given pixel position to reduce the computational overhead.
Similarly, [62] further improved local self-attention such 3.1.3 Stand-Alone Self-Attention
that it can dynamically adapt its weight aggregation to As discussed above, convolutional layers possess transla-
variations in the input data/features. tion equivariance but can not scale with a large receptive
6

Fig. 6: Local Relation Layer. It adapts weights based on the


relationships between features in a local window. Given an
input feature tensor and a predefined geometric prior (which
define composability of pixel pairs based on their relative
position), self attention is computed to adapt weights according
to incoming features. The function Φ simply defines the squared
Fig. 5: The figure compares two different self-attention ap- difference between key and query values. Figure is from [62].
proaches. (a) Non-local block [66], and (b) Criss-cross attention
module [63]. Figure is from [63].
3.1.5 Attention Augmented Convolutional Networks
Bello et al. [73] explore the possibility of employing self-
field, therefore can not capture long-range interactions [64].
attention as an alternative to convolutional operators. They
On the other hand, global attention [1] which attend to
propose to use the relative position encoding [74] in two
all spatial locations of the input can be computationally
dimensions to develop a new self-attention mechanism that
intensive and is preferred on down-sampled small images,
maintains translation equivariance, which is a desirable
image patches [11] or augmenting the convolutional features
property for handling images. Although this self-attention
space [73]. Ramachandran et al. [64] proposed to replace
provides competitive results as a stand-alone computational
convolutional layers in deep neural networks with a local
primitive, the best performance is obtained when used in
self-attention layer which can be applied to small or large
combination with the convolutional operations. Extensive
inputs without increasing the computational cost. At a basic
experiments show that attention augmentation leads to sys-
level, the proposed self-attention layer [64] considers all
tematic performance improvements in image classification
pixel positions in a specific window size around a given
and object detection for a variety of existing architectures.
pixel, compute queries, keys and value vectors for these
pixels, and then aggregates the spatial information within
this window. The value vectors are aggregated after pro- 3.1.6 Vectorized Self-Attention
jecting the softmax score of queries and keys. This process Zhao et al. [75] note that a traditional convolution operator
is repeated for all given pixels and the response is concate- performs feature aggregation and transformation jointly
nated to produce the output pixel. ResNet models with local (by applying a filter and then passing it through a non-
self-attention layer can solve ImageNet and COCO object linearity). In contrast, they propose to perform feature ag-
detection with fewer parameters as compared to ResNet gregation separately with self-attention followed by trans-
models based on convolutional layers [64]. formation using an element-wise perceptron layer (Fig. 7).
To this end, they propose two alternate strategies for feature
aggregation: (a) pairwise self-attention and (b) patch-wise
3.1.4 Local Relation Networks
self-attention. The pairwise self-attention is permutation
Another shortcoming of the convolutional operator comes and cardinality invariant operation, while the patch-wise
from the fact that after training, it applies fixed weights self-attention does not have such invariance properties (sim-
regardless of any changes to the visual input. Hu et al. [62] ilar to the convolution operator).
proposed to adaptively compose pixels in a local window. Both pairwise and patch-wise self-attentions are imple-
They introduced a new differentiable layer (Fig. 6) that mented as a vector attention [75] that learns weights for both
adapts its weight aggregation based on the compositional the spatial and channel dimensions. This provides an alter-
relations (similarity) between pixels/features within a lo- nate approach for attention that is conventionally performed
cal window. Such adaptive weight aggregation introduces using scalar weights (by taking a dot-product). The pair-
geometric priors into the network which are important for wise self-attention is a set operator that computes a vector
the recognition tasks [62]. Convolution is considered to be attention keeping in view the relationships of a particular
a top-down operator as it remains fixed across positions feature with its neighbors in a given local neighborhood. In
while a non-local operation such as introduced in [65] is contrast, patch-wise self-attention is a generalization of the
a bottom-up method as it aggregates input features over the convolution operator (not a set operator) and looks at all the
full image. The local relation layer belongs to the category feature vectors in the local neighbourhood when deriving
of bottom-up methods but it is restricted to a fixed window the attention vectors. Authors show that with considerably
size e.g., 7x7 neighborhood. fewer parameters, self-attention networks (SAN) can beat
7

Fig. 8: An overview of Vision Transformer (on the left) and the


details of Transformer encoder (on the right). The architecture
Fig. 7: Vectorized self-attention block in SAN. The vector-based resembles Transformers used in the NLP domain and the image
self-attention can be implemented as a pairwise or a patch- patches are simply fed to the model after flattening. After
wise operation. The input and output of the block is a feature training, the feature obtained from the first token position is
tensor with C channels. The left branch calculates the attention used for classification. Image obtained from [11].
weights α = γ(δ(x)) while the right branch transforms features
using a linear mapping β . r1 and r2 denote the factors by
which both branches reduce channel dimension for efficient
processing. Figure is from [75]. of the input to the Transformer. The output representation
corresponding to the first position is then used as the global
image representation for the image classification task.
comparable baselines from ResNet family on the ImageNet
dataset. One key property of their approach is its robustness
against adversarial perturbations [76], [77] and generaliza-
tion to unseen transformations in the data. This behaviour is 3.1.8 Data-Efficient Image Transformers
due to the dynamic nature of attention that makes it difficult
for the adversary to calculate useful fooling directions. The Data-efficient image Transformer (DeiT) [12] is the first
result in large-scale image classification, without utilizing
3.1.7 Vision Transformer any external large-scale dataset (e.g., JFT in [11]), which
Vision Transformer (ViT) [11] is the first work to show- demonstrates the potential of Transformers compared to
case how Transformers can ‘altogether’ replace standard carefully tuned CNN designs. Since the Transformer archi-
convolutions in deep neural networks on large-scale com- tecture does not assume prior knowledge about the image
puter vision datasets. They applied the original Transformer structure as opposed to CNN design, it typically leads
model (with minimal changes compared to the version to longer training times, and larger datasets are required
used for NLP tasks) on a sequence of image ’patches’. The to train Transformer models. However, DeiT demonstrates
Transformer model was pre-trained on a large propriety how Transformers can be learned on mid-sized datasets
dataset of images collected by Google and later fine-tuned (e.g., 1.2 million examples compared to hundreds of millions
to downstream recognition benchmarks e.g., ImageNet. This used in ViT [11]) in relatively shorter training episodes.
is an important step since pre-training on a medium-range Besides using augmentation and regularization procedures
dataset would not give state-of-the-art results with a ViT. common in CNNs, the main contribution is a novel native
This is because the CNNs encode prior knowledge about distillation approach for Transformers.
the image domain (inductive biases e.g., translation equiv- The distillation process [78] uses a CNN as a teacher
ariance) that reduces the need of data as compared to model (RegNetY-16GF [79]) whose outputs are used to train
Transformers which must discover such knowledge rules the Transformer model. The outputs from the CNN aids
from very large-scale datasets. To this end, a 300 million the Transformer in efficiently figuring out useful representa-
image JFT dataset [39] was used for pre-training that helped tions for input images. A distillation token is appended with
boost the performance to the level of state of the art CNN the input patch embeddings and the class token. The self-
models. Notably, compared to the iGPT [19] model that attention layers operate on these tokens to learn their inter-
also applied Transformers to full-sized images but performs dependencies and output the learned class, patch, and dis-
training as a generative task, ViT pre-trains the model with tillation tokens. The network is trained with a cross-entropy
a supervised classification task (although a self-supervision loss defined on the output class token and a distillation loss
variant is also explored which results in a less performance). to match the distillation token with the teacher output. Both
The main architecture of the model (Fig. 8) is very similar soft and hard label choices were explored for distillation,
to language Transformers. Instead of a 1D sequence of where the hard distillation was found to perform better.
language embeddings, 2D images patches are flattened in a Interestingly, the learned class and distillation tokens do
vector form and fed to the Transformer as a sequence. These not exhibit a high correlation indicating their complemen-
vectorized patches are then projected to a patch embedding tary nature. The learned representations compare favorably
using a linear layer and position embedding is attached well against top-performing CNN architectures such as
with it to encode location information. Importantly, a [cls] EfficientNet [80] and also generalize well for a number of
token (stands for classification) is appended at the beginning downstream recognition tasks.
8

3.1.9 CLIP: Contrastive Language–Image Pre-training


CLIP [81] is a contrastive approach to learn image represen-
tations from text, with a learning objective which maximizes
similarity of correct text-image pairs embeddings in a large
batch size. Specifically, given a batch of N image-text pairs,
Fig. 9: An overview of Detection Transformer (DETR) [13].
CLIP learns a multi-modal embedding space, by jointly
DETR treats the object detection task as a set prediction problem
training an image-encoder and a text-encoder, such that and uses the Transformer network to encode relationships be-
the cosine similarity of the valid N image-text pairs is tween set elements. Then a bipartite set loss is used to uniquely
maximized, while the remaining N 2 −N pairs is minimized. match the box predictions with the ground-truth boxes (shown
The authors consider ResNet-50 [59] and Vision Transformer on the right two columns). In case of no match, a ’no object’ class
(ViT) [82] for encoding images. The modified Transformer prediction is selected. Its simple design with minimal problem-
model [1] as in [5] is employed for encoding text. CLIP is specific modifications can beat a carefully built and popular
Faster R-CNN model. Figure from [13].
trained on a large corpus of 400 million image-text pairs and
demonstrates excellent zero-shot transfer capabilities. At
inference, the names of classes are used as input to the text- boxes are predicted in parallel while [1] uses an RNN to
encoder, and similarity of the encoded image is computed predict sequence elements one by one. Since the encoder
with all encoded texts (classes) to find the image-text pair and decoder are permutation invariant, learned positional
with highest match. The CLIP achieves an astounding zero- encodings are used as the object queries by the decoder to
shot classification accuracy of 75% on ImageNet, without us- generate different boxes. Note that the spatial structure in
ing an supervision from ImageNet training set. The authors a CNN detector (e.g., Faster R-CNN) automatically encodes
further demonstrate zero-shot transfer capabilities of the the positional information. DETR obtains performance com-
CLIP model on 30 different computer vision benchmarks. parable to the popular Faster R-CNN model [83] which is an
Note that CLIP with ResNet took 18 days to train on 592 impressive feat given its simple design. The DETR has also
V100 GPUs while CLIP with ViT took 12 days on 256 V100 been extended to interesting applications in other domains,
GPUs. This highlights the computational cost of CLIP. e.g., Cell-DETR [88] extends it for instance segmentation of
biological cells. A dedicated attention branch is added to
3.2 Transformers for Object Detection obtain instance-wise segmentations in addition box predic-
Similar to image classification, Transformer models are ap- tions that are enhanced with a CNN decoder to generate
plied to a set of image features obtained from a backbone accurate instance masks.
CNN model to predict precise object bounding boxes and
their corresponding class labels. Below, the first approach 3.2.2 Deformable - DETR
[13] tackles the detection problem, for the first time, using The above-mentioned DETR [13] successfully combines con-
Transformer networks and the second approach [14] mainly volutional networks with Transformers [1] to remove hand-
extends [13] to a multi-scale architecture and focuses on crafted design requirements and achieves an end-to-end
improving computational efficiency. trainable object detection pipeline. However, it struggles
to detect small objects and suffers from slow convergence
3.2.1 Detection Transformer - DETR and a relatively high computational cost [14]. DETR maps
In order to apply Transformer model, DETR [13] treats images to features space before using the Transformer for
object detection as a set prediction problem and proposes the relation modeling. Thus, the computational cost of self-
a set loss function. This means that given a set of image attention grows quadratically with the spatial size of the
features, predict the set of object bounding boxes. The first feature map i.e., O(H 2 W 2 C), where H and W represent the
contribution (the Transformer model) enables the prediction height and width of the feature map. This inherently puts
of a set of objects (in a single shot) and allows modeling a limitation on the use of multi-scale hierarchical features
their relationships. The second contribution (the set loss) [89] in DETR training framework which is ultimately impor-
allows bipartite matching between predictions and ground- tant to detect small objects. Furthermore, at the beginning
truth boxes. The main advantage of DETR is that it removes of training, the attention module simply projects uniform
the dependence on hand-crafted modules and operations, attention to all the locations of the feature map and requires
such as the RPN (region proposal network) and NMS (non- a large number of training epochs to tune attention weights
maximal suppression) commonly used in object detection to converge to meaningfully sparse locations. This approach
[83]–[87]. In this manner, the dependence on prior knowl- contributes to a slow convergence rate of DETR. To mitigate
edge and careful engineering design is relaxed for complex the above-mentioned issues, [14] proposed a deformable
structured tasks like object detection. attention module to process the feature maps. Inspired
Given spatial feature maps from the CNN backbone, the from deformable convolutions [90], deformable attention
encoder first flattens the spatial dimensions into a single module [14] only attends to sparse set of elements from
dimension, as illustrated in Fig. 9. This gives a sequence the whole feature map regardless of its spatial size. This
of features d × n, where d is the feature dimension and further allows cross-scale aggregation of feature maps with
n = h×w with h, w being the height and width of the spatial the help of multi-scale attention modules without increasing
feature maps. These features are then encoded and decoded the computational cost significantly. Deformable DETR not
using multi-head self-attention modules as proposed in only performs better but its training time also remains 10×
[1]. The main difference in the decoding stage is that all lower than the original DETR model [14].
9

Fig. 10: Axial attention module [91] that sequentially applies


multi-head axial attention operations along height and width
axes. Image from [91].

3.3 Transformers for Segmentation


A dense prediction task like image segmentation into se- Fig. 11: Cross-Modal Self-Attention: Red arrows show the self-
mantic labels and object instances requires modeling rich attention over words, the green arrows show self-attention over
interactions between pixels. Here, we explain an axial self- image regions (a H×W grid) and the cross-modal attention is
attention operation [91] that seeks to reduce the complexity shown in blue. As shown in the right box, the overall frame-
work can find segmentation masks corresponding to words
of self-attention and a cross-modal approach [15] that can referred in the given text description. Figure based on [15].
segment regions corresponding to a given language expres-
sion.
via information exchange across multi-resolution features.
3.3.1 Axial-Attention for Panoptic Segmentation A binary CE loss is used to train the overall model that
Panoptic segmentation [92] aims at jointly solving the oth- achieves good improvements on UNC [94], G-Ref [95] and
erwise distinct tasks of semantic segmentation and instance ReferIt [96] datasets.
segmentation by assigning each pixel of the image a se-
mantic label and an instance id. Global context can provide
3.4 Transformers for Image Generation
useful cues to deal with such a complex visual understand-
ing task. Self-attention is effective at modeling long-range Image generation tasks are interesting from the perspec-
contextual information, albeit applying it to large inputs tive of generative modeling and because the representa-
for a dense prediction task like panoptic segmentation is tions learned in an unsupervised manner can later be
prohibitively expensive. A naive solution is to apply self- used for down-stream tasks. Here, we summarize different
attention either to downsampled inputs or to limited re- Transformer-based architectures for image generation tasks
gions around each pixel [64]. Even after introducing these [97]–[100]. We also cover a structured generation task where
constraints, the self-attention still has quadratic complexity scene objects are populated given a room layout [23].
and sacrifices the global context, respectively.
To mitigate the aforementioned issues, Wang et al. [91] 3.4.1 Image Transformer
propose the position-sensitive axial-attention where the 2D Parmar et al. [97] develop an image generation model that
self-attention mechanism is reformulated as two 1D axial- can sequentially predict each pixel of an output image given
attention layers that are applied to height-axis and width- its previously generated pixels. Their approach models the
axis sequentially (see Fig. 10). The axial-attention is highly joint distribution of the image pixels by factorizing it as a
compute efficient and enables models to capture the full- product of pixel-wise conditional distributions. Previously
image context. The effectiveness of axial-attention is demon- developed auto-regressive models for this task, such as the
strated by achieving state-of-the-art performance for the PixelCNN [101], suffer from a limited receptive field which
panoptic segmentation task on COCO [70], Mapillary Vis- hinders in modeling long term relationships in an image e.g.,
tas [93], and Cityscapes [68] benchmarks and for the image part relationships or occlusions. Using self-attention phe-
classification problem on ImageNet dataset [61]. nomenon, Image Transformers [97] enhance the receptive
field of neural networks without incurring a high compu-
3.3.2 CMSA: Cross-Modal Self-Attention tational cost (e.g., effective receptive field up to 256 pixels
Cross-modal Self-attention (CMSA) [15] encodes long-range was demonstrated to be achieved as compared to 25 pixels
multi-modal dependencies between linguistic and visual of PixelCNN [101]). The generative pipeline was also tested
domain features for referring image segmentation task. The on conditional generation tasks e.g., image super-resolution,
referring image segmentation problem aims to segment image completion, and denoising.
entities in an image that are referred to by a language The core methodology has two main highlights (see
expression, as shown in Fig. 11. To this end, a set of cross- Fig. 12), (a) the way key, query, and value triplets are used
modal features is obtained by concatenating image features in images, and (b) the use of self-attention with a rela-
with each word embedding and the spatial coordinate fea- tively high number of positions as compared to sentences
tures. The self-attention operates on this rich feature and in the language (where self-attention has been previously
generates attention over the image corresponding to each demonstrated to work successfully). For the first part, the
word in the sentence. The segmentation network performs feature representations of previously generated pixels were
self-attention at multiple spatial levels and uses a gated used to generate ‘value’ and ‘key’ embeddings, while the
multi-level fusion module to refine segmentation masks current pixel’s feature embedding was used as a ‘query’.
10

Fig. 13: The architecture of generator and discriminator of


TransGAN model, built using the encoder of the transformer
model [1]. Figure is from [100].

astounding result, since the iGPT architecture is exactly the


same as used for language modeling tasks, and therefore it
does not incorporate any prior domain-specific knowledge.
Notably, the competing unsupervised CNN based solutions
widely adopt such priors in the form of architectural design,
Fig. 12: (a) Self-attention block in Image Transformer [97]. attention mechanisms, loss functions, and regularization
Given one channel for a pixel q , the block attends to the mem- [105]–[109]. However, on the downside, iGPT has a high
ory of previous synthesized pixels (mi ), followed by a feed- compute cost e.g., iGPT-L version has roughly 36× high
forward sub-network. Positional encodings pi are added in the training cost compared to MoCo [107] which is a state of
first layer. (b) The operation performed in Local Self-Attention the art self-supervised feature learning approach. For this
(example of a 2D case is shown). The image is partitioned into
reason, the training was generally limited to low-resolution
a grid of spatial blocks known as query blocks. In the self-
attention operation, each pixel in a query block attends to all of ≤ 64 × 64, while convolutional architectures can effec-
pixels in the memory block (shown in cyan rectangle). White tively learn from high-resolution inputs.
grid locations show masked inputs that have zero-contribution
towards the self-attention. 3.4.3 High-Resolution Image Synthesis
Transformers typically incur a high computational cost
when applied on high-dimensional sequences. To overcome
Positional embeddings were used in the first layer to encode this limitation, Esser et al. [99] proposed to include in-
location information. To solve the second problem, local ductive biases (commonly used in the CNNs) alongside
attention (1D and 2D variants) was used only in the local Transformers to improve their efficiency. Specifically, local
neighborhood around the query position. For practical rea- connectivity and spatial invariance biases inbuilt in the
sons, a fixed memory block was defined for each respective CNN structure are leveraged by learning a rich dictionary of
query, instead of dynamically calculating a different mem- visual patterns. The dictionary is learned using a Generative
ory neighborhood for each pixel. A maximum likelihood Adversarial approach [46] that seeks to encode perceptually
loss was applied for training the generative model. sound image patches. A Transformer is then used to learn
the long-range interactions between the dictionary items to
3.4.2 Image GPT generate the outputs. In turn, they develop a conditional
Motivated by the success of Transformer models in the image generation model capable of producing very high-
language domain, image GPT (iGPT) [98] demonstrated resolution images (up to megapixel range) using Transform-
that such models can be directly used for image generation ers. This is the first work that demonstrates the application
tasks, and to learn strong features for downstream vision of Transformers to generate such high-resolution images.
tasks (e.g., image classification). Specifically, iGPT trains
GPT v2 model [5] on flattened image sequences (1D pixel 3.4.4 TransGAN: Transformers based GAN model
arrays) and shows that it can generate plausible image Generative Adverserial Networks (GANs) [46] with CNNs
outputs without any external supervision. The generated as default backbone have been very successful for visu-
samples depict the model’s ability to understand spatial ally appealing image synthesis [110]–[112]. TransGAN [100]
relationships between pixels and high-level attributes such builds a strong GAN model, free of any convolution oper-
as object classes, texture, and scale. Notably, the design does ation, with both generator and discriminator based upon
not use any image-specific knowledge in the design (e.g., the the Transformer model [1]. As shown in Fig. 13, the the
2D position embeddings used in Image Transformer [97]). architecture of both generator and discriminator is based
The features learned with iGPT’s unsupervised training upon the encoder in original Transformer model [1]. For
mechanism compete impressively against other unsuper- memory efficiency, the generator contains multiple stages,
vised approaches, achieving state-of-the-art performance with up-sampling modules in-between, which gradually
on CIFAR-10/100 [102] and STL [103] datasets while per- increase the resolution of feature maps (input sequence
forming close to the best results of SimCLR (a contrastive length) while reducing the embedding dimension. The dis-
learning approach) [104] on ImageNet dataset. This is an criminator of TransGAN takes flattened image-patches as
11

(a) (b) (c) (d) (e) (f) (g)


Fig. 14: Images generated by DALL·E [20] from the following text prompts. (a) An armchair in the shape of an avocado. (b) A photo
of San Francisco’s golden gate bridge. Given a part of the image (in green box), DALL·E performs the image completion. (c) An emoji
of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants. (d) An extreme close-up view of a capybara sitting in a field.
(e) A cross-section view of a pomegranate. (f) A penguin made of watermelon. (g) The exact same cat on the top as a sketch on the bottom.

tokens similar to [82]. Authors introduce different training latent codes using a pre-trained discrete variational autoen-
techniques including data augmentation, training with an coder [117], [118]. DALL·E takes as input a single stream of
auxiliary task and injecting locality to self-attention to scale- 1280 tokens (256 for the text and 1024 for the image), and
up their model for high quality image synthesis [99]. The trained to generate all other tokens autoregressively (one
TransGAN model achieves state-of-the-art results in terms after another). It provides flexibility to generate images ei-
of Inception Score and Fréchet Inception Distance (FID) on ther from scratch (Fig. 14a) or by extending existing images
STL-10 and performs favorably compared with their CNN- (Fig. 14b), while staying faithful to the text caption.
based GAN counterparts on other datasets. The authors demonstrate the effectiveness of DALL·E by
creating images from text describing a wide variety of real
3.4.5 SceneFormer and fictional things. While generating images purely from
In the previous works on image generation [97]–[99], image textural captions, DALL·E shows impressive performance
outputs are generally predicted directly by the model. In at controlling multiple objects and their attributes (Fig. 14c),
contrast, [23] learns to generate parameters of 3D objects rendering certain viewpoint (Fig. 14d), capturing object’s
to be placed in a given scene. Specifically, SceneFormer internal structure (Fig. 14e), and combining unrelated ob-
[23] studies the 3D room layout conditioned scene gen- jects (Fig. 14f). Furthermore, DALL·E can perform image-to-
eration task. Given the empty room shape, this approach image translation (Fig. 14g) guided by the input text.
can propose new object configurations in the room while
maintaining realism. Remarkably, the model does not use
any appearance information and only learns to generate 3.6 Transformers for Low-level Vision
new scenes by modeling the inter-object relationships using Transformer models have also been proposed for low-level
self-attention in Transformers. Similar to how a Transformer vision tasks including image super-resolution, denoising,
operates on a sentence, it is applied to a sequence of objects deraining, and colorization. Specifically, Transformer net-
to predict the next suitable object in a scene. Specifically, work for super-resolution [16] uses attention mechanisms to
the size, pose, location, and category of the next object is search relevant textures from reference images and transfer
predicted by the Transformer model. A start token indicates them to low-resolution images to generate super-resolved
the initiation of inference and the number of output token outputs. Similarly, the work of [19] shows how to exploit
indicate the objects generated by the model in a sequence. the potential of pre-training and transfer learning with
The authors also explore generating new scenes given a a shared Transformer based backbone to address multi-
textual description of the room layout. The independence ple image restoration tasks (e.g., denoising, deraining, and
from the appearance makes the approach efficient, enabling super-resolution) with dedicated task-heads. The coloriza-
interactive scene generation. tion transformer [24] proposes a progressive design for
image colorization to achieve high-resolution outputs. Next,
3.5 Transformers for Text-to-Image Synthesis we provide details of these aforementioned image restora-
The task of generating realistic images from text is in- tion Transformer models.
teresting and practically valuable (e.g., for artistic content
creation), but at the same time highly challenging. Prior text- 3.6.1 Transformers for Super-Resolution
to-image synthesis approaches [113]–[116] are mostly based Image super-resolution (SR) aims to generate a high-
on GANs [46]. Although these methods produce moderate resolution (HR) image from its low-resolution (LR) version.
results, they are far from being photo-realistic. To this end, Recent years have seen major performance breakthroughs
Ramesh et al. [20] propose DALL·E which is a Transformer for SR due to convolutional neural networks (CNNs). Prin-
model capable of generating high-fidelity images from a cipally, the quality of super-resolved images generated by
given text description. The model is named DALL·E using CNNs is dependent on the choice of optimization objec-
a portmanteau of the Spanish artist Salvador Dalı́ and the tive. On one hand, SR methods [119]–[123] that are based
Pixar’s blockbuster movie WALL·E. on pixel-wise loss functions (e.g., L1, MSE, etc.) yield im-
DALL·E model has 12 billion parameters and it is trained pressive results in terms of image fidelity metrics such as
on a large set of text-image pairs taken from the internet. PSNR and SSIM. However, they struggle to recover fine
Before training, images are first resized to 256×256 reso- texture details and often produce images that are overly-
lution, and subsequently compressed to a 32×32 grid of smooth and perceptually less pleasant. On the other hand,
12

Fig. 16: The overall architecture of Image Processing Trans-


former (IPT) used for denoising, deraining and super-resolution
tasks. IPT employs multiple heads and tails to address each
image restoration task separately, and a shared Transformer
encoder-decoder. The multi-head extract visual features from
the corrupted input images, that are then divided into patches
and linearly flatten before being passed to the Transformer
encoder. Next, the encoder outputs along with the task-specific
Fig. 15: Diagram of the texture Transformer module. Q (query), embeddings (indicating the desired task to be learned) are
K (key) and V (value) represent texture features extracted from processed by the Transformer decoder followed by the multi-
a (bicubic upsampled) low-resolution image, a sequentially tails that yield the restored images. Figure is from [19].
down/upsampled reference image, and an original reference
image, respectively. The relevance embedding aims to estimate
similarity between low-resolution and reference images. H 3.6.2 Transformers for Image Processing Tasks
and S respectively denote hard and soft attentions computed
from relevance embedding. T indicates high-resolution texture State-of-the-art algorithms developed for high-level com-
features that are then transferred to the features F of low- puter vision tasks such as object detection and semantic
resolution image. Figure is from [16]. segmentation often employ backbone networks that are pre-
trained on large-scale datasets e.g., ImageNet. In contrast,
perceptual SR approaches [44], [124]–[127], in addition to algorithms for low-level vision tasks such as image denois-
per-pixel loss, employ adversarial loss [46] and perceptual ing, super-resolution, and deraining are directly trained on
loss [128] based on deep features extracted from pre-trained task-specific data, thereby suffer from the following limita-
CNNs. While these methods generate images that are sharp, tions. First, the number of images available in task-specific
visually pleasant, and perceptually plausible, they show a datasets is small (e.g., the commonly used DIV2K dataset for
substantial decrease in reconstruction accuracy measured in image super-resolution contains only 2000 images). Second,
PSNR/SSIM. Moreover, the perceptual SR algorithms have a the model trained for one image processing task does not
tendency to hallucinate fake textures and cause artifacts. The adapt well to other related tasks.
above mentioned SR approaches follow two distinct (but Chen et al. [19] proposed a pre-trained model based
conflicting) research directions: one maximizing the recon- on Transformer architecture, named as Image Processing
struction accuracy and the other maximizing the perceptual Transformer (IPT). It is capable of performing various image
quality, but never both. restoration tasks such as super-resolution, denoising, and
In order to alleviate the trade-off between perceptual deraining. As shown in Fig. 16, the overall architecture of
reproduction and accurate reproduction, Yang et al. [16] IPT consists of multi-heads and multi-tails to deal with
propose a Transformer network (TTSR) for super-resolution. different tasks separately, and a shared encoder-decoder
During training, TTSR uses paired LR-HR images, as well Transformer body. Since exploiting Transformers at full po-
as reference (Ref) images with similar content as of LR tential requires training on large-scale data, Chen et al. [19]
images. TTSR learns to search relevant regions in the Ref take the clean (ground-truth) images from the ImageNet
image and transfers rich textures to help super-resolving benchmark and synthesize their degraded versions for dif-
the input LR image. The texture Transformer module of ferent tasks. For example, bicubic interpolation is used for
TTSR method, shown in Fig. 15, consists of four core com- generating low-resolution images, additive white Gaussian
ponents: (1) Learnable texture extractor takes as input LR↑, noise is added to prepare noisy data, and hand-crafted
Ref↓↑, and Ref images, and generates texture features query rain streaks are applied to obtain rainy images. In total, 10
(Q), key (K), and value (V), respectively. Here, ↑ denotes million images are used to pre-train the IPT model. During
bicubic upsampling operation, and ↓↑ represents bicubic training, each task-specific head takes as input a degraded
down-sampling followed by an upsampling operation. (2) image and generates visual features. These feature maps are
Relevance embedding first unfolds Q and K into patches and divided into small crops and subsequently flattened before
then computes the similarity of each patch in Q with each feeding them to the Transformer encoder. The architecture
patch in K in order to generate hard and soft attention maps. of the encoder is the same as that of the original Transformer
(3) Hard-attention transfers HR texture features from V to (LR model [1]. The outputs of the encoder along with the task-
features) Q using the hard attention map. (4) Soft-attention specific embeddings are given as input to the Transformer
further enhances relevant features while suppressing less decoder. The features from the decoder output are reshaped
relevant ones by using the soft-attention map. and passed to the multi-tail that yields restored images. The
13

pre-training (VLP) on large-scale multi-modal datasets to


learn generic representations that effectively encode cross-
modality relationships (e.g., grounding semantic attributes
of a person in a given image). These representations can then
be transferred to downstream tasks, often obtaining state
of the art results. Such models generally apply the vanilla
multi-layer Transformer [1] with multi-modal inputs and
don’t introduce fundamental changes to the core attention
block. However, their main distinction is in the configura-
tion of Transformers and the loss functions, based on which
we categorize them into two categories: single-stream and
multi-stream Transformers. The single-stream designs feed
Fig. 17: Colorization Transformer is a probabilistic model that the multi-modal inputs to a single Transformer while the
breaks down the image colorization problem into three sub- multi-stream designs first use independent Transformers for
tasks and trains separate Transformer models for each. Figure each modality and later learn cross-modal representations
is from [24]. using another Transformer (see Fig. 18). We explain seminal
multi-modal Transformers below.

IPT model is optimized with L1 loss. Experimental results 3.7.1 ViLBERT: Vision and Language BERT
show that the pre-trained IPT model, when fine-tuned for a Vision and Language BERT was the first extension of the
specific low-level vision task, can provide significant perfor- BERT model to the multi-modal domain. The goal was to
mance gains over the state-of-the-art methods [122], [129], learn representations that can jointly model images and
[130]. natural language. For this purpose, ViLBERT developed
a two-stream architecture where each stream is dedicated
3.6.3 Colorization Transformer to model the vision or language inputs (Fig. 18-h). The
Given a grayscale image, colorization seeks to produce the architecture of both parallel streams is a series of Trans-
corresponding colorized sample. It is a one-to-many task as former blocks similar to the BERT model. Subsequently, co-
for a given grayscale input, there exist many possibilities attentional Transformer layers are applied to learn cross-
in the colorized output space. The challenging nature of modal relationships. The co-attentional framework is very
this task requires probabilistic models capable of produc- simple. Query, key, and value matrices are computed for
ing multiple colorized output samples. Colorization Trans- each modality in the standard way [1] and then key-value
former [24] is a probabilistic model based on conditional pairs for one modality are passed on to the other modality’s
attention mechanism [131]. It divides the image colorization attention head.
task into three sub-problems (Fig. 17) and proposes to solve ViLBERT applies VLP on a set of proxy tasks defined on
each task sequentially by a different Transformer network. the Conceptual Concepts dataset (with 3.3M images with
The authors first train a Transformer network to map a weak captions) and later fine-tune the model on down-
low-resolution grey-scale image to a 3-bit low-resolution stream tasks such as VQA. The pre-training phase oper-
colored image. Low-resolution images in turn allow training ates in a self-supervised manner, i.e., pretext tasks are cre-
of larger models. The 3-bit low-resolution colored image ated without manual labeling on the large-scale unlabelled
is then upsampled to an 8-bit RGB sample by another dataset. These pretext tasks include predicting whether the
Transformer network in the second stage of training. Finally, text and image inputs are related and predicting the seman-
a third stage Transformer is trained to increase the spatial tics of masked image regions and textual inputs (e.g., similar
resolution of the 8-bit RGB sample produced by the second- to reconstructing masked words in text in the BERT model
stage Transformer. Self-attention used in the colorization [3]). This way, the model learns the inherent structure in
Transformer is based on row/column attention layers intro- the data during pre-training and also models cross-domain
duced in [131]. These layers capture the interaction between associations. With evaluations on several tasks, [17] demon-
each pixel of an input image while being computation- strated that a two-stream model can perform better than a
ally less costly. The row-wise attention layer applies self- single-stream model that uses shared parameters to model
attention to all pixels in a given row, while the column-wise both language and vision domains [17].
attention layer considers pixels only in a given column of
an image. This work [24] is the first successful application 3.7.2 LXMERT
of Transformers trained to colorize grey-scale images at high Similar to ViLBERT [133], Learning Cross-Modality Encoder
(256×256) resolution. Representations from Transformers (LXMERT) [21] also uses
a two-stream architecture based on BERT framework. The
main difference lies in the object-relationship encoder that is
3.7 Transformers for Multi-Modal Tasks used to model the visual features instead of simple image-
Transformer models have also been extensively used for level features used in ViLBERT. The information in two
vision-language tasks such as visual question answering streams is then fused across modalities using cross-attention
(VQA) [135], visual commonsense reasoning (VSR) [136], blocks similar to [133].
cross-modal retrieval [137] and image captioning [138]. Sev- Compared to two pre-texts tasks used for VLP in [133],
eral works in this direction target effective vision-language LXMERT uses five pre-training tasks including masked ob-
14

Fig. 18: An overview of Transformer models used for multi-modal tasks in computer vision. The Transformer designs in this
category can be grouped into single-stream (UNITER [35], OSCAR [36], VideoBERT [17], Unicoder-VL [132], VisualBERT [55] and
VL-BERT [22]) and dual-stream architectures (LXMERT [21], ViLBERT [133] and PEMT [134]). A key distinction between models
is the choice of loss functions. While most of the multi-modal methods are focused on images as visual data, VideoBERT [17] and
PEMT [134] are designed to work on video streams and leverage unique modalities e.g., audio signals in videos [134].

ject and language prediction, cross-modality matching, and simply attempts to predict missing text tokens using the
visual question answering (Fig. 18-g). The pre-trained model image features and remaining textual tokens. The second
is fine-tuned on the VQA task, however, a high similarity objective attempts to differentiate between the true and false
between pre-training and fine-tuned tasks raises questions caption of a given image. After task-agnostic pre-training,
on the generalizability of the learned representations to new the authors propose to perform task-specific pre-training to
tasks. To this end, the authors conducted generalization bridge the domain gap before the final fine-tuning to the
experiments on Visual Reasoning for Real (NLVR) task [139] downstream task.
demonstrating impressive improvements on novel tasks.
3.7.4 VL-BERT
3.7.3 VisualBERT Su et al. [22] propose a multi-modal pre-training approach to
Different from two-stream networks like ViLBERT [133] learn features that are generalizable to multi-modal down-
and LXMERT [21], VisualBERT [55] uses a single stack stream tasks such as Visual Commonsense Reasoning and
of Transformers to model both the domains (images and Visual Question Answering. This endeavor requires ade-
text). The input sequence of text (e.g., caption) and the quately aligning the visual and linguistic cues so that an
visual features corresponding to the object proposals are effective composite representation is learned. To the end,
fed to the Transformer that automatically discovers relations [22] builds on the BERT model and inputs both the visual
between the two domains. Notably, VisualBERT architec- and language features. The language features correspond
ture is somewhat similar to VideoBERT [17] (explained in to the token in the input sentence and the visual features
Sec. 3.8), but instead of only focusing on cooking videos, correspond to the region of interest (RoI) from the input
VisualBERT evaluates on various visual-linguistic tasks (e.g., image (obtained via a standard Faster R-CNN). Specifically,
VCR, NLVR, VQA, and visual grounding). the model is pre-trained on both the visual-lingual dataset
The VisualBERT model first applies task-agnostic pre- (Conceptual Captions [140]) as well as the language-only
training using two objectives (Fig. 18-e). The first objective datasets (e.g., Wikipedia). The loss function is identical to
15

BERT, where the model is trained to predict the masked design pre-training tasks to predict masked visual or text
out words or visual ROIs (Fig. 18-f). In contrary to other region conditioned on the other domain input, and align
works such as UNITER [35], VL-BERT claims that the visual- language and visual inputs on both the global (image-text)
linguistic matching tasks are not useful during pre-training, and local (word-region) levels (Fig. 18-a). These tasks are
which is in contrast to evidence from later efforts [132]. Their beside the conventional masked language modeling task
results on several multi-modal tasks show their benefit over used in BERT and explicitly include fine-grained word-
the language-only pre-training (e.g., in BERT). region alignment alongside conditional masking of inputs
that were not considered in the earlier works such as VL-
3.7.5 Unicoder-VL BERT [22], Visual-BERT [55], Vilbert [133] and Unicoder-
Universal Encoder for Vision and Language (Unicoder-VL) VL [132]. Common to the other approaches, they adopt the
[132] learns multi-modal representations using large-scale Transformer architecture proposed in BERT that operates
image-caption datasets. The language and image inputs on both the visual and language embeddings. In contrast
are fed to a single Transformer model (with multiple suc- to applying independent Transformers to the language and
cessive encoders) to learn joint embeddings. To this end, visual inputs (as in ViLBERT [133] and LXMERT [21]),
it uses masked word prediction, masked object classifi- UNITER adopts a single Transformer applied to the textual
cation, and visual-linguistic matching as self-supervision and image inputs like [22], [55], [132].
tasks during pre-training (Fig. 18-d). Notably, the visual-
linguistic matching is carried out only at the global level 3.7.8 Oscar: Object-Semantics Aligned Pre-Training
(i.e., image-sentence alignment). The model is evaluated on VisualBert [55], Uniter [35], VL-BERT [22], VilBERT [133],
downstream tasks of image-text retrieval, zero-shot learn- Unicoder-VL [132] models for VLP concatenate image and
ing, and visual commonsense reasoning where it performs text features and leave it on to the self-attention to automat-
better than the previous models such as ViLBERT [133] and ically discover cross-modal relationships. This can compli-
VisualBERT [55]. This shows the significance of rich self- cate the visual grounding of semantic concepts in an image.
supervised tasks and advocates for a unified Transformer To address this problem, Oscar [36] first uses an object de-
architecture to learn multi-modal feature representations in tector to obtain object tags (labels), subsequently using these
a common framework. tags as a mechanism to align relevant visual features with
the semantic domain information (Fig. 18-b). The motivation
3.7.6 Unified VLP is that the textual content generally pertains to major objects
The Unified Vision-Language Pre-training (VLP) [141] in the image, therefore by explicitly adding those image
model uses a single Transformer network for both encod- labels in the input, visual features can be better attended.
ing and decoding stages. This stands in contrast to BERT Similar to BERT [3], Oscar uses a Masked Token Loss for
inspired VLP models [17], [22], [55], [142] which use in- VLP. Specifically, different tokens in the textual input and
dependent encoder and decoder networks. Joint modeling image tags are randomly masked and the model’s job is to
of encoding and decoding stages allows the Unified VLP predict the missing token. This forces it to learn the relation-
model to perform well for both image captioning and visual- ship of the missing token with the contextual information
question answering tasks, when fine-tuned on these individ- given as visual and semantic features. Further, it also uses
ual tasks. The intuition for shared modeling of encoding and a contrastive loss that discriminates between the original
decoding stage stems from the need to better share cross- and noisy/fake image-tag pairs. The representations thus
task information during pre-training. The unified model learned are fine-tuned on VQA, cross-modality retrieval,
consists of a stack of 12 Transformer blocks, each with a self- natural language reasoning, and image captioning tasks to
attention layer followed by a feed-forward module. The self- obtain better performances compared to VLP methods that
supervised objectives used for pre-training include masked do not use object tags.
vision-language predictions. Here, the authors explore two
variants i.e., bidirectional and sequence-to-sequence predic- 3.7.9 Vokenization
tion of masked works where different context encodings are Tan and Bansal [146] introduce the concept of ‘vokens’ (im-
used for both types of objectives. The proposed approach is ages related to language tokens extracted from sentences).
evaluated on COCO Captions, Flick 30K Captions and VQA The vokens (visualized tokens) provide visual supervision
2.0 and obtains encouraging results compared to previous to the language model to learn better features. The motiva-
methods on image captioning and VQA [143]. tion is that humans learn languages by correlating visual in-
formation with semantic concepts. In a similar spirit to other
3.7.7 UNITER self-supervised language representation learning methods
Universal image-text representation (UNITER) [35] is also [3], [133], they learn representations by defining an auxiliary
a multi-modal feature learning approach via pre-training task of voken-prediction task.
on four large-scale visual-linguistic datasets (MS-COCO Since the existing datasets encode limited visually
[70], Visual Genome [144], Conceptual Captions [140] and grounded tokens, they propose a vokenization method to
SBU Captions [145]). The learned representations have been map language tokens to visual vokens, as illustrated in
shown to transfer well on downstream tasks such as VQA, Fig. 19. The approach uses language-based retrieval for
Multi-modal retrieval, Visual Commonsense reasoning, and such a mapping and transfers a model trained on a small
NLVR. In order to emphasize on learning the relationships labeled dataset (MS-COCO) to a large dataset (Wikipedia).
between visual and language domains, they specifically Furthermore, it was ensured that the sentence-wide context
16

is desirable in various uni-modal and multi-modal learning


tasks such as activity recognition [67], [150]–[153]. Below, we
explain recent approaches that seek to resolve this challenge
using the expressivity of Transformer networks.

3.8.1 VideoBERT: Joint Video and Language Modeling


The VideoBERT [17] model leverages Transformer networks
and the strength of self-supervised learning to learn effec-
tive multi-modal representations. Specifically, VideoBERT
uses the prediction of masked visual and linguistic tokens
as a pretext task in self-supervised learning (Fig. 18-c).
This allows modeling high-level semantics and long-range
temporal dependencies, important for video understanding
Fig. 19: Visualized tokens (Vokens) [146]: A language model tasks. Given a video, they convert speech to text using
is visually supervised using closely related images that leads off-the-shelf speech recognition systems and apply vector
to better feature representations from the pretrained model.
Figure from [146]. quantization (clustering) to obtain visual features from pre-
trained video classification models. The BERT model is
then directly applied to these concatenated sequences of
is considered to obtain the token-voken mapping. The re- language and visual tokens to learn their joint distribution.
sulting model trained using generated tokens outperform The model can be trained with only-text, video-only, and
the state of the art BERT model on a diverse set of NLP tasks. video+text domains. The resulting model showcases inter-
In this sense, the proposed model does not evaluate vision esting capabilities for cross-modality predictions such as
tasks, however, uses vision as a useful grounding cue to video generation from a given textual input (e.g., captions or
train the language model, hence we include it in the multi- cooking recipe) and (video-based) future forecasting given
modal representation learning group. a video token. The video+text model uses a visual-linguistic
alignment task to learn cross-modality relationships. The
3.7.10 Vision-and-Language Navigation definition of this pre-text task is simple, given the latent
state of the [cls] token, the task is to predict whether the
This task aims to predict a navigation plan on a map based
sentence is temporally aligned with the sequence of visual
on the vision and language inputs. Self-attention based
tokens. Further, the learned representations are shown to be
Transformer networks were used earlier in [147], [148] for
very useful for downstream tasks such as action classifica-
the visual and language navigation (VLN). These works first
tion, zero-shot classification, and video captioning.
pre-trained a cross-modal Transformer network using self-
supervision on vision and language pairs and subsequently
fine-tune on the specific VLN tasks. While these works 3.8.2 Masked Transformer
learn attention between image region and language, Chen Zhou et al. [154] study the dense video captioning problem
et al. [149] propose to learn cross-modal attention between using Transformers. This problem setting requires gener-
language inputs and spatial topological maps. The topo- ating language descriptions for all events occurring in a
logical maps represent an agent’s environment as a graph video. The previous works on this problem generally op-
whose nodes denote places and the edges denote their con- erate sequentially i.e., first detect events and then generate
nectivity. Given the topological map and natural language captions in separate sub-blocks. The proposed unified Trans-
inputs, a VLN task using the Transformer model bears former network learns a single model to tackle both tasks
resemblance to sequence prediction in NLP. Specifically, at jointly, thereby seamlessly integrating the multi-modal tasks
each time instance, the cross-modal Transformer predicts of event detection and captioning. First, a video encoder
a single node of the topological map in the navigation is used to obtain frame-wise representations followed by
plan. The individual language and map encodings are first two decoder blocks focused on proposing the video events
processed using uni-modal encoders and later a cross-modal and the captions. Since untrimmed videos are considered,
encoder (similar to LXMERT [21]) is applied to aggregate a masking network is used in the captioning decoder to
information across modalities. To denote positions in the focus on describing a single event proposal. Remarkably,
map, a learned trajectory position encoding is appended [154] was the first approach to target dense video captioning
with the map features. Based on this Transformer setup, using non-recurrent models and used self-attention in the
[149] reports a full navigation system that can freely explore encoder(applied on CNN derived features) to model broad
the environment and intelligently plan its actions. range context between video frames. Experiments on Activ-
ityNet Captions [155] and YouCookII [156] datasets showed
3.8 Video Understanding good improvements over previous recurrent network and
two-stage based approaches.
Audio-visual data in the form of videos is abundantly avail-
able. However, the prevalent approaches generally learn
representations on short-length videos (up to a few sec- 3.8.3 Parameter Efficient Multi-Modal Transformers
onds long), that allow them to encode only short-range Lee et al. [134] note that the multi-modal representation
dependencies [1], [29]. Long-range dependency modeling learning approaches like VideoBERT [17] and ViLBERT [133]
17

generally keep the language processing part fixed to a pre-


trained model (e.g., BERT [3]) to reduce training complex-
ity. Alternatively, for the first time in the literature, they
propose to learn an end-to-end multi-modal bidirectional
Transformer model called PEMT on audio-visual data from
unlabeled videos. First, short-term (e.g., 1-3 seconds) video
dynamics are encoded using CNNs, followed by a modality-
specific Transformer (audio/visual) that can model long-
term dependencies (e.g., 30 seconds). A multi-modal Trans-
former is then applied to the modality-specific Transformer
outputs to exchange information across visual-linguistic
domains. However, learning such a model in a naive form
would incur huge memory requirements. To reduce para-
metric complexity, the parameters are shared across layers
within each Transformer based on a low-rank approxima-
tion that leads to as high as 80% parameter reduction.
The Transformer is trained using a contrastive learn-
ing approach based on a content-aware negative sampling Fig. 20: Video Transformer Network (VTN) [160]: Features are
(Fig. 18-i). Specifically, the model uses the features obtained extracted from the individual frames using a 2D CNN (f )
from CNNs learned during the training phase to select and fed to a Transformer encoder with positional embeddings
negative samples that are visually similar to the positive (PE). An MLP is used to classify videos using the output
instances. This work also compares various fusion strategies classification token embedding. Image courtesy [160].
adopted in earlier works such as early (VideoBERT [17]
and VL-BERT [22]), mid-level (ViL-BERT [133] and LXMERT
that demand precise delineation (e.g., action localization and
[21]) and late fusion mechanisms and shows that the mid-
segmentation).
level fusion is the optimal choice. The proposed model
is pre-trained on Kinetics-700 [150] dataset and later fine-
3.8.5 Video Transformer Network
tuned on downstream video classification tasks such as
short video classification on UCF101 [157], audio classifi- The traditional CNN based methods in video classification
cation on ESC50 [158] and long-term action recognition on generally perform 3D spatio-temporal processing over lim-
Charades [159] and Kinetics-Sounds [57] datasets. ited intervals to understand videos. Neimark et al. [160]
propose Video Transformer Network (VTN) that first obtain
frame-wise features using 2D CNN and apply a Transformer
3.8.4 Video Action Transformer encoder (Longformer [161]) on top to learn temporal re-
Girdhar et al. [18] use a variant of Transformer architec- lationships (Fig. 20). Longformer is an attractive choice to
ture to aggregate contextual cues in a video relevant to process long sequences (with an arbitrary length n) due
a particular person. They demonstrate the usefulness of to its O(n) complexity. The classification token [CLS] is
such contextual information for action classification and passed through a fully connected layer to recognize actions
localization. Initially, the model uses a Faster-RCNN [83] or events. The advantage of using Transformer encoder on
style processing where a backbone model generate features top of spatial features is two fold: (a) it allows processing
that are forwarded to the Region Proposal Network to a complete video in a single pass, and (b) considerably
obtain object proposals. Then RoI pooling is applied to improves training and inference efficiency by avoiding the
generate object-specific features. Multi-head self-attention expensive 3D convolutions. This makes VTN particularly
[1] is then applied on top of the object features as a cascade suitable for modeling long videos where interactions be-
of self-attention layers. In each Transformer unit, a partic- tween entities are spread throughout the video length.
ular person feature is treated as the ‘query’ (Q), while the Their experiments on Kinetics-400 dataset [67] with various
features from the neighboring video clip are used as ‘key’ backbones (ResNet [59], ViT [11] and DeiT [12]) shows
(K) and ‘value’ (V). The location information is explicitly competitive performance.
encoded in the input feature map from which K, V and Q
are derived, thus incorporating the positional information 3.8.6 Video Instance Segmentation Transformer
in the self-attention. For a given 400×400×64 video clip, The Video Instance Segmentation Transformer (VisTR) [153]
the key and value tensors are of size 16×25×25×128, while extends DETR [13] for video object instance segmentation
the query is 128 dimensional vector. Although this work (VIS) task. Local features are obtained using a backbone
uses only RGB stream, the use of additional modalities CNN on a collection of video frames. An encoder and a
like optical flow and audio signal (as in competing video decoder Transformer is used similar to DETR to frame the
analysis works) would further increase the computational instance segmentation problem as a sequence to sequence
complexity. Further, the Transformer model was found to prediction task. The input frame-level features are concate-
be sub-optimal for action localization, perhaps due to its nated to form clip representations and the Transformer out-
tendency to incorporate global information. Therefore, an puts instance predictions in a order that is consistent across
important research question is how to achieve the right frames. This integrates the object detection and tracking
trade-off between the global and local context for problems with in a single unified architecture. The predicted outputs
18

3.9.1 Cross-Transformer
Doersch et al. [25] explore the utility of self-supervision and
Transformer architectures for cases where distribution mis-
match exists between training and evaluation phases. They
specifically consider the few-shot fine-grained classification
problem, where a model is first trained on a set of base
classes and later during the evaluation, it must adapt to
(a) Spatial Self-Attention (b) Temporal Self-Attention novel classes using their few labeled examples (support set).
Cross-Transformer is evaluated on Meta-dataset [167],
Fig. 21: Spatial/Temporal Attention for Skeleton Data Repre- which is a huge dataset comprising of 10 distinct datasets
sentations. Relationships between body-joints and inter-frame
dependencies are modeled using two dedicated self-attention (including ImageNet, MS-COCO, etc.). The dataset encap-
modules. Figure is from [164]. sulates the challenging scenario where a learner must adapt
to new classes and novel domains during evaluation. The
Transformer architecture in this case is used to relate a
are matched with the ground-truth using bipartitie match- given query image with the few-examples available in the
ing. Similar to Mask R-CNN [85], a separate head is used support set. To this end, the Transformer finds spatially
to predict the instance mask based on self-attention and similar regions in the query and support set images, and
3D convolutions. The overall results are competitive among the corresponding features are then used to obtain class
the single model approaches on YouTube VIS dataset [162], decisions for the query. The queries in the Transformer
but performs somewhat lower compared to more complex architecture are derived from the grid features obtained
CNN-based models such as MaskProp [163]. using the query image. Similarly, grid features from the
support images are used to construct keys and values which
3.8.7 Skeleton-Based Action Recognition are in turn used to derive attended outputs. This approach,
besides a contrastive self-supervision based training mech-
Human action recognition based on skeleton representation anism, leads to the best performance on the challenging
requires models that can understand relationships between Meta-dataset.
different joints of a body in a given frame as well as between
different frames of a video. Plizzari et al. [164] proposed 3.9.2 FEAT: Few-Shot Embedding Adaptation
a two-stream Transformer network to model such rela-
Ye et al. [26] propose to adapt the few-shot embeddings
tionships. They introduced spatial self-attention (SSA) for
learned on the base classes to the few-shot target classes
relation modeling between different body-joints (Fig. 21a),
during inference using a Transformer module. This leads to
while temporal self-attention (TSA) to capture long-range
task-specific embeddings that perform better on the discrim-
inter-frame dependencies (Fig. 21b). They first used a small
inative tasks such as few-shot classification. While many
residual network to extract features from skeleton data and
other set-to-set functions are also evaluated, such as Graph
then used SSA and TSA modules to process those feature
convolutional networks [168], Bidirectional LSTMs [29] and
maps. SSA models the relations between different body
DeepSets [169], the best performance is achieved with the
parts by finding the correlation between each pair of joints
Transformer-based mapping. This is attributed to the better
independently, while TSA focuses on how features of a
contextualization, task interpolation and extrapolation ca-
certain joint change between frames along the temporal
pability of Transformers and their permutation invariance
dimension. Joints can be thought of as bag-of-words and
while maintaining a relatively lower parameter complexity.
the purpose of SSA is to discover relationships among the
The Transformer architecture used in this work follows the
surrounding joints in the same way as the Transformer
standard approach [1]. The embeddings are adapted using
relates different words in a phrase. On the other hand, TSA
a contrastive loss function for preserving discriminative
finds long-range relations between frames, similarly to how
properties (Fig. 22). The resulting model achieves strong
relations among phrases are built in NLP. The two streamed
performance on inductive, transductive, and generalized
spatial-temporal Transformer network achieve state-of-the-
FSL tasks.
art results on NTU-RGB+D 60 [165] and NTU-RGB+D 120
[166] datasets.
3.10 Transformers for Clustering
Clustering is a fundamental operation in unsupervised
3.9 Transformers in Low-shot Learning
learning that aims to discover structure in the data by
In the few-shot learning settings, a support set is provided grouping similar data points together. It has numerous
at the inference to adapt to a novel set of categories. Trans- applications such as data visualization and interpretation,
former models have been used to learn set-to-set mappings anomaly detection, and open-set categorization. Neural net-
on this support set [26] or learn the spatial relationships works have been developed for set prediction problems
between a given input query and support set images [25]. [169], [170], however, the setpoints are processed individ-
In terms of absolute performance, the patch-wise spatial ually which can lose information about inter-point relation-
self-attention between query and support set images excels ships. Recent works employ Transformers that operate on
compared to an image level association learned in [26]. set inputs called the Set Transformers (ST) [171] for amortized
However, the patch-wise attention computation is computa- clustering. Amortized clustering is a challenging problem
tionally expensive. We elaborate on these approaches below. that seeks to learn a parametric function that can map an
19

three problems in the 3D domain namely, object classifica-


tion, semantic segmentation, and object part segmentation.
The main contribution is a point Transformer layer that
applies self-attention in the local neighborhood of 3D points.
The proposed point Transformer layer builds on vec-
torized self-attention network (SAN) [75] where attention
weights are represented with vectors. Furthermore, a posi-
tional encoding δ is added both to the attention vector and
transformed features (value vectors) to represent location
information. The point Transformer layer is sandwiched be-
Fig. 22: An overview of Few-shot Embedding Adaptation tween two linear layers to create a point Transformer block
with Transformer (FEAT [26]). Compared to the conventional that is stacked multiple times in the developed network
instance embedding methods in FSL that keep the embedding architecture. Their design also included transition down/up
function same for all tasks (a), FEAT uses a set-to-set function to blocks to reduce/increase the number of points in the input
adapt the embedding function to each FSL task (b). It evaluates (in a typical encoding-decoding pipeline style). The result-
several set-to-set functions and found the Transformer module ing architecture delivers state-of-the-art performance on the
to be the most suitable choice for FSL. Figure from [26].
3D classification and segmentation tasks.

input set of points to their corresponding cluster centers.


Lee et al. [171] propose to learn such a mapping function
using a Transformer architecture comprising of multi-head
self-attention blocks [1].
The Transformer model is permutation invariant by de-
sign and allows encoding both pair-wise and higher-order
relationships between the input points. However, a full
Transformer would lead to a high computational cost of
O(n2 ) in each self-attention layer, where n is the number of
points in the set. ST reduces this cost to O(mn) by using an
Induced Self-Attention Block that uses a low-rank projection
Fig. 23: Point Transformer layer [173] based on vectorized self-
(H ∈ Rm ) to allow operating on large sets. The model attention [75]. δ denotes a position encoding, ψ, φ, α are point-
was trained to learn optimal parameters that maximize the wise transformations and γ is a mapping function. Figure is
likelihood of a mixture of Gaussians (MoGs). Thus MoG pa- from [173].
rameters are estimated by the ST given a set of data points.
Beyond amortized clustering, ST was also evaluated on
related set-transformation tasks including counting unique 3.11.2 Point-Cloud Transformer
elements in an input set, set anomaly detection, and point- The Point Cloud Transformer (PCT) [174] is a parallel work
cloud classification. More recently, [172] improves [171] by to [173] and motivated by the permutation invariance prop-
taking a sequential approach to cluster generation, thereby erty of Transformers. However, compared to [173], it is more
allowing assignment to a variable number of clusters. directly based on the conventional Transformer architecture
[1] and does not involve vector attention. The key modi-
3.11 Transformers for 3D Analysis fications include a 3D coordinate-based position encoding,
Given the irregular (variable number of points) and permu- an offset attention module, and a neighbor embedding that
tation invariant nature of 3D point cloud representations, encodes local 3D structure in point-clouds. Specifically, the
Transformers provide a nice mechanism to encode rich offset attention layer calculates the difference between the
relationships between the individual data points. To this end self-attended features and the input features using element-
[173], [174] are motivated by the capability of Transformers wise subtraction. The local neighbor embedding simply
to learn set-functions. Specifically, [173] introduced a Point finds self-attention relationships among a group of points
Transformer which uses vector attention that learns weights instead of individual 3D points. Explicitly incorporating
for each channel, while [174] suggest an alternate design local neighbourhood information makes this a more effi-
where local 3D structure is explicitly encoded. The non- cient architecture compared to [173]. The experiments are
local nature of Transformers is exploited in [37] towards an reported on 3D shape classification, normal estimation and
accurate human pose and mesh reconstruction algorithm. segmentation tasks on ModelNet40 [175] and ShapeNet
We discuss these approaches below. [176] datasets.

3.11.1 Point Transformer 3.11.3 Pose and Mesh Reconstruction


Zhao et al. [173] study the self-attention based Transformer The Mesh Transformer (METRO) [37] model targets 3D hu-
architecture for 3D point cloud processing. Self-attention man pose and mesh reconstruction from a single 2D image.
being a set-operator is ideally suited for processing point A key challenge here is to faithfully learn the non-local
clouds, a 3D data representation that demands invariance to interactions between body-joints and mesh vertices (e.g.,
number of points and their permutations. The authors study hand and foot). The expressivity of Transformer network
20

Fig. 24: Mesh Transformer architecture. The joint and vertex queries are appended with positional embeddings and passed
through multiple self-attention layers to jointly regress 3D coordinates of joints and mesh vertices. Figure is from [37].

is used to jointly model vertex to vertex relationships in a [3] basic model (with 109 million parameters) took around
mesh as well as the vertex to body-joint relationships. The 1.89 peta-flop days1 for training, while the latest GPT3 [6]
self-attention mechanism can attend to any combination of model (175 billion parameters) took around 3640 peta-flop
vertices in the mesh, thereby encoding non-local relation- days for training (a staggering ∼1925x increase). This comes
ships. with a huge price tag, e.g., according to one estimate [180],
The multi-layer Transformer architecture sequentially GPT3 training might have cost OpenAI around 4.6 million
performs dimensionality reduction to map the 2D image to USD. Additionally, these large-scale models require aggres-
3D mesh. Position encoding is performed using the 3D coor- sive compression techniques (e.g., distillation) to make their
dinates (x,y ,z ) of each vertex and each body-joint. Similar to inference feasible for real-world settings.
masked language modeling in NLP, METRO uses masked In the language domain, recent works focus on reducing
vertex modeling (MVM) which randomly masks some per- the high complexity of Transformer models (basically aris-
centage of input queries (see Fig. 24). The Transformer is ing from the self-attention mechanism [1] where a token’s
tasked with regressing all the joints and vertices which representation is updated by considering all tokens from the
helps encode inter-dependencies between them. METRO previous layer). For example, [161], [185] explore selective
obtains state-of-the-art results on human mesh reconstruc- or sparse attention to previous layer tokens which updating
tion on two publicly available datasets (Human3.6M [177] each next layer token. Linformer [33] reduces complexity of
and 3DPW [178]). Since the approach does not depends standard self-attention operation from O(n2 ) to O(n) (both
on a parametric mesh model, it generalizes well to other in time and memory requirements). The main idea is to
reconstruction tasks such as 3D hand reconstruction [179]. show that a low-rank matrix is sufficient to model the self-
Overall, this is the first effort to employ Transformers for 3D attention mechanism. The Reformer model [186] employed
human reconstruction tasks and leads to fairly good results. locally-sensitive hashing (LSH) to minimize the complexity
of self-attention from O(n2 ) to O(nlog(n)). In similar pur-
4 O PEN P ROBLEMS & F UTURE D IRECTIONS suit, the recent Lambda Networks propose to model context
as a linear function which helps reduce complexity of self-
Despite excellent performance from Transformer models
attention [187].
and their interesting salient features (Table 1), there ex-
Vyas et al. [188] developed an efficient cluster attention
ist several challenges associated with their applicability to
approach to deal with large input sequences that approx-
practical settings (Table 2). The most important bottlenecks
imates the original self-attention. They propose a cluster
include requirement for large-amounts of training data and
attention approach that groups queries into clusters and
associated high computational costs. There have also been
then computes attention between cluster centers (instead
some challenges to visualize and interpret Transformer
of attention between all the queries that leads to quadratic
models. In this section, we provide an overview of these
complexity). The main idea is that the queries close in the
challenges, mention some of the recent efforts to address
Euclidean space should have similar attention distributions.
those limitations and highlight the open research questions.
With a fixed number of clusters, this intuition helps reduce
the quadratic complexity to linear complexity of O(nc)
4.1 High Computational Cost with respect to the input sequence length n (where c is
As discussed in Sec. 1, Transformer models have high the number of clusters). We refer readers to [31] for a nice
parametric complexity. This results in high training and
inference cost, both in terms of computational time and 1. A peta-flop day is measure of computation and equals to perform-
resources required for processing. As an example, the BERT ing 1015 neural net operations per second for one complete day.
21

Task Method Design Highlights (focus on differences Input Data Type Label Type Loss
with the standard form)
Image ViT [11] Directly adopted NLP Transformer En- 2D Image Class labels Cross-entropy
Classification coder for images, Mechanism to linearly
embed image patches with positional
embedding suitable for the Encoder.
DeiT [12] Transformer as s student while CNN as 2D Image Class labels Cross-entropy,
a teacher, Distillation tokens to produce Distillation loss
estimated labels from teacher, Attention based on
between class and distillation tokens. KL-divergence
CLIP [81] Jointly train image and text encoders on 2D Images & texts Image-text Symmetric
image-text pairs, to maximize similarity pairs cross-entropy
of valid pairs and minimize otherwise
Object Detection DETR [13] Linear projection layer to reduce CNN 2D Image Class labels Hungarian loss
feature dimension, Spatial positional based on
embedding added to each multi-head bipartite
self-attention layer of both encoder and matching
decoder. Object queries (output posi- between
tional encoding) added to each multi- predicted and
head self-attention layer of decoder. ground truths
D-DETR [14] Deformable Transformer consists of de- 2D Image Class labels Hungarian loss
formable attention layers to introduce
sparse priors in Transformers, Multi-
scale attention module.
Low Shot CT [25] Self-supervised pretraining, Query- 2D Image Pretraining Normalized
Learning aligned class prototypes that provide without labels Cross-entropy
spatial correspondence between the and few-shot
support-set images and query image. learning with
Class labels
Image ColTran [24] Conditional Row/column multi-head 2D Image 2D Image Negative
Colorization attention layers, Progressive multi-scale log-likelihood
colorization scheme. of the images
Action ST-TR [164] Spatial and Temporal self-attention to Skeleton Action Classes Cross-entropy
Recognition operates on graph data such as joints in
skeletons.
Super-resolution TTSR [16] Texture enhancing Transformer module, 2D Image 2D Image Reconstruction
Relevance embeddings to compute the loss, Perceptual
relevance between the low-resolution loss defined on
and reference image. pretrained
VGG19
features.
Multi-Model Oscar [36] Transformer layer to jointly process 2D Image Captions, Class Negative
Learning triplet representation of image-text labels, Object log-likelihood
[words, tags, features], Masked tokens tags of masked
to represent text data. tokens,
Contrastive
binary
cross-entropy
3D Classifica- PT [173] Point Transformer block, Transition CAD models, 3D object Object and Cross-entropy
tion/Segmentation down block to reduce cardinality of the part segmentation shape
point set, Transition up for dense pre- categories
diction tasks.
3D Mesh METRO [37] Progressive dimensionality reduction 2D Image 3D Mesh + L1 loss on
Reconstruction across Transformer layers, Positional Human Pose mesh vertices
Encoding with 3D joint and 3D vertex and joints in 3D
coordinates, Masked vertex/joint mod- and 2D
eling. projection.
Vision and Chen et al. [149] Uni-modal encoders on language and Instruction text + Navigation Cross-entropy
Language map inputs followed by a cross-modal RGBD panorama + Plan over nodes and
Navigation transformer, Trajectory position encod- Topological [stop] action
ings in the map encoder. Environment Map
Referring Image CMSA [15] Multimodal feature, Cross-modal self- 2D Image + Language Segmentation Binary
Segmentation attention on multiple levels and their expression mask cross-entropy
fusion using learned gates. loss
Video Lee et al. [134] Operates on real-valued audio-visual Audio-Visual Activity labels Contrastive
Classification signals instead of tokens, Contrastive InfoNCE loss
learning for pre-training, End-to-end and Binary
multimodal transformer learning. cross-entropy

TABLE 1: A summary of key design choices adopted in different variants of transformers for a representative set of
computer vision applications. The main changes relate to specific loss function choices, architectural modifications, different
position embeddings and variations in input data modalities.
22

Task Method Metric Dataset Performance Highlights Limitations

Image ViT [11] Top-1 Acc. ImageNet 88.55 a) First application of Transformer a) Requires training on large-scale
Classifica- ICLR’21 (global self-attention) directly on data e.g., 300-Million images, b)
tion image patches, b) Convolution-free Requires careful transfer learning
network architecture, c) Outper- to the new task, c) Requires large
forms CNN models such as ResNet. model with 632-Million parameters
to achieve SOTA results.
DeiT [12] Top-1 Acc. ImageNet 83.10 a) Successfully trains Transformer a) Requires access to pretrained
arXiv’20 on ImageNet only, b) Introduces CNN based teacher model thus per-
attention-based distillation method. formance depends on the quality of
c) Produces competitive perfor- the teacher model.
mance with small (86-Million pa-
rameters) Transformers.

Low-Shot CT [25] Top-1 Acc. ImageNet 62.25 a) Self-supervised pre-training Proposed algorithm is limited in its
Learning NeurIPS’20 COCO 60.35 mechanism that does not need capacity to perform on datasets that
manual labels, b) Dynamic lack spatial details such as texture.
inference using Transformer
achieving stat-of-the-art results.

Object DETR [13] AP COCO 44.9 a) Use of Transformer allows end- a) Performs poorly on small objects,
Detection ECCV’20 to-end training pipeline for object b) Requires long training time to
detection, b) Removes the need for converge.
hand-crafted post-processing steps.
D-DETR [14] AP COCO 43.8 a) Achieves better performance on Obtain SOTA results with 52.3 AP
ICLR’21 small objects than DETR [13], b) but with two stage detector design
Faster convergence than DETR [13] and test time augmentations.

Image ColTran [24] FID ImageNet 19.71 a) First successful application of a) Lacks end-to-end training, b)
Coloriza- ICLR’21 Transformer to image colorization, limited to images of size 256×256.
tion b) Achieves SOTA FID score.

Action ST-TR [164] Top-1 Acc. NTU 94.0/84.7 a) Successfully applies Transformer Proposed Transformers do not pro-
Recogni- arXiv’20 60/120 to model relations between body cess joints directly rather operate on
tion joints both in spatial and temporal features extracted by a CNN, thus
domain, b) Achieves SOTA results. the overall model is based on hand-
crafted design.
CUFED5 27.1 / 0.8 a) Achieves state-of-the-art super- a) Proposed Transformer does not
TTSR [16] PSNR/ Sun80 30.0 / 0.81 resolution by using attention, b) process images directly but features
Super-
CVPR’20 SSIM Urban100 25.9 / 0.78 Novel Transformer inspired archi- extracted by a convolution based
Resolution
Manga109 30.1 / 0.91 tectures that can process multi-scale network, b) Model with large num-
features. ber of trainable parameters, and c)
Compute intensive.
ViLBERT VQA [135]/ a) Proposed Transformer architec- a) Requires large amount of data
Acc./
Multi- [133] Retrieval 70.6/ 58.2 ture can combine text and visual for pre-training, b) Requires fine
mAP (R@1)
Model NeurIPS’19 [181] information to understand inter- tuning to the new task.
Learning task dependencies, b) Achieves pre-
training on unlabelled dataset.
Oscar [36] Acc./ VQA [182]/
80.37/57.5 a) Exploit novel supervisory signal Requires extra supervision through
ECCV’20 mAP (R@1) COCO
via object tags to achieve text and pre-trained object detectors thus
image alignment, b) Achieves state- performance is dependent on the
of-the-art results. quality of object detectors.
Acc./ VQA [135]/ Learns fine-grained relation align- Requires large multi-task datasets
UNITER [35] Avg. Flickr30K 72.47/83.72 ment between text and images
ECCV’20 for Transformer training which lead
(R@1/5/10) [183] to high computational cost.
Point Trans- ModelNet40 92.8 a) Transformer based attention ca- a) Only moderate improvements
3D Top-1 Acc.
former [173] [175] pable to process unordered and un- over previous SOTA, b) Large num-
Analysis IoU 85.9
arXiv’20 structured point sets, b) Permuta- ber of trainable parameters around
tion invariant architecture. 6× higher than PointNet++ [184].
MPJPE 77.1 a) Does not depend on parametric Dependent on hand-crafted net-
METRO [37] 3DPW
PA-MPJPE 47.9 mesh models so easily extendable work design.
arXiv’20 [178]
MPVE 88.2 to different objects, b) Achieves
SOTA results using Transformers.

TABLE 2: A summary of advantages and limitations of different Transformers based methods in different Tasks. (CT: Cross
Transformers, AP: Average Precision, mAP: mean AP, IoU: Intersection over Union, FID: Fréchet inception distance, MPJPE:
Mean Per Joint Position Error, MPVE: Mean Per Vertex Error).
23

literature survey on efficient Transformers in NLP. a parallel branch in the Transformers to improve person
Similar to the NLP domain, computer vision models re-identification. A recent work [189] rearranges the spa-
also suffer from the high computational cost of Trans- tially close tokens to better model relationships in spatially
former models. For example, image generators that are proximal locations. One may argue that the architectures
based on sequence-based Transformers (e.g., iGPT) have like Transformer models should remain generic to be di-
a high compute cost limiting their applicability to high- rectly applicable across domains, we notice that the high
resolution inputs. In future, it is interesting to explore how computational and time cost for pre-training such models
such models can be extended to high-dimensional cases demands novel design strategies to make their training
e.g., using a multi-scale transformer design with a somewhat more affordable on vision problems.
local context modeling. By inducing inductive biases based
on our understanding of the visual learning tasks (e.g.,
spatial relationships in the local neighbourhood), the high 4.4 Interpretability of Transformers
computational cost can be reduced. Similarly, using sparse Given the strong performance of Transformer architectures,
attention maps modeled with low-rank factorization in the it is interesting and critical to interpret their decisions, e.g.,
matrices can also help towards reducing the computational by visualizing relevant regions in an image for a given
cost [160]. classification decision. The main challenge is that the at-
tention originating in each layer, gets inter-mixed in the
4.2 Large Data Requirements subsequent layers in a complex manner, making it difficult
to visualize the relative contribution of input tokens towards
Since Transformer architectures do not inherently encode
final predictions. This is an open problem, however, some
inductive biases (prior knowledge) to deal with visual
recent works [191]–[193] target enhanced interpretability of
data, they typically require large amounts of training data
Transformers and report encouraging results. Attention roll-
during pre-training to figure out the underlying modality-
out and attention flow methods were proposed in [192]
specific rules. For example, a CNN has inbuilt translation
to estimate the accurate attentions. However, this method
invariance, weight sharing, and partial scale invariance
functions in an ad-hoc manner and makes simplistic as-
due to pooling operations or multi-scale processing blocks.
sumptions e.g., input tokens are linearly combined using
However, a Transformer network needs to figure out these
attention weights across the layers. Chefer et al. [193] note
image-specific properties on its own by looking at a large
that the attention scores obtained directly via the self-
number of examples. Similarly, relationships between video
attention process (encoding relationships between tokens)
frames need to be discovered automatically by the self-
or reassignments in [192] do not provide an optimal solu-
attention mechanism by looking at a large database of video
tion. As an alternative, they propose to assign and propagate
sequences. This results in longer training times, a significant
relevancy scores in the Transformer network such that the
increase in computational requirements, and large datasets
sum of relevancy is constant throughout the network. Their
for processing. For example, the ViT [11] model requires
design can handle both the positive and negative attribu-
hundreds of millions of image examples to obtain a decent
tions experienced in the self-attention layer. The proposed
performance on the ImageNet benchmark dataset. The ques-
framework has an added advantage of being able to provide
tion of learning a Transformer in a data-efficient manner is
class-specific visualizations. Despite these seminal works,
an open research problem and recent works report encour-
visualizing and interpreting Transformers is an unsolved
aging steps towards its resolution. For example, DeiT [12]
problem and methods are needed to obtain spatially precise
uses a distillation approach to achieve data efficiency while
activation-specific visualizations. Further progress in this
T2T (Tokens-to-Token) ViT [189] models local structure by
direction can help in better understanding the Transformer
combining spatially close tokens together, thus leading to
models, diagnosing any erroneous behaviors and biases
competitive performance when trained only on ImageNet
in the decision process. It can also help us design novel
from scratch (without pre-training).
architectures that can help us avoid any biases.

4.3 Vision Tailored Transformer Designs


We note that most of the existing works focused on vision 4.5 Hardware Efficient Designs
tasks tend to directly apply Transformer models on com- Large-scale Transformer networks can have intensive power
puter vision problems. These include architectures designed and computation requirements, hindering their deployment
for image recognition [11], video understanding [17] and es- on edge devices and resource-constrained environments
pecially multi-modal processing [133]. Although the initial such as internet-of-things (IoT) platforms. Some recent ef-
results from these simple applications are quite encouraging forts have been reported to compress and accelerate NLP
and motivate us to look further into the strengths of self- models on embedded systems such as FPGAs [194]. Li et
attention and self-supervised learning, current architectures al. [194] used an enhanced block-circulant matrix-based rep-
may still remain better tailored for language problems (with resentation to compress NLP models and proposed a new
a sequence structure) and need further intuitions to make Field Programmable Gate Array (FPGA) architecture design
them more efficient for visual inputs. For example, vector to efficiently manage resources for high throughput and low
attention from [75] is a nice work in this direction which latency. They could achieve 27x, 3x and 81x improvements
attempts to specifically tailor self-attention operation for in performance (throughput measured in FPS), reduced
visual inputs via learning channel-wise attentions. Similarly, power consumption, and energy efficiency relative a CPU
[190] uses a Jigsaw puzzle based self-supervision loss as for RoBERTa model [7]. Towards this goal, [195] proposed
24

to design Hardware-Aware Transformers (HAT) using neu- and limitations of the existing methods and particularly
ral architecture search strategies [196]–[198]. Specifically, a elaborate on the important future research directions. With
SuperTransformer model is first trained for performance its specific focus on computer vision tasks, this survey pro-
approximation which can estimate a model’s performance vides a unique view of the recent progress in self-attention
without fully training it. This model comprises the largest and Transformer-based methods. We hope this effort will
possible model in the search space while sharing weights drive further interest in the vision community to leverage
between common parts. Eventually, an evolutionary search the potential of Transformer models and improve on their
is performed considering the hardware latency constraints current limitations e.g., reducing their carbon footprint.
to find a suitable SubTransformer model for a target hard-
ware platform (e.g., IoT device, GPU, CPU). However, such
hardware efficient designs are currently lacking for the ACKNOWLEDGMENTS
vision Transformers to enable their seamless deployment
The authors would like to thank Tim Prangemeier (TU Darmstadt), Lu-
in resource-constrained devices. Further, the search cost of
owei Zhou (Microsoft Research), Jason Corso (University of Michigan),
the evolutionary algorithms remains significant with the
Pichao Wang (Alibaba Group), Yuqing Wang (Meituan) and Manoj
associated impact of CO2 emissions on the environment.
Kumar (Google Brain) for their helpful feedback on the survey. We
would also like to thank Mohamed Afham for his help with a figure.
4.6 Leveraging Rich Multi-modal Annotations
In cases, where training data is available with dense labels in
multiple domains (e.g., language and vision [17], [199]), an
R EFERENCES
interesting question to consider is whether the pre-training [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
process leveraging rich labels on a small dataset speedup Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
in NeurIPS, 2017. 1, 2, 3, 4, 6, 8, 10, 12, 13, 16, 17, 18, 19, 20
its learning. This question has been explored in Virtex [200],
[2] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural
a model that seeks to learn strong visual representations machine translation,” in WMT, 2018. 1
using dense textual annotations (e.g., image captions). Since, [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
the captions encode information about objects present in training of deep bidirectional transformers for language under-
standing,” arXiv preprint arXiv:1810.04805, 2018. 1, 2, 3, 4, 13, 15,
an image, their relationships, actions and attributes, they 17, 20, 24
can provide better supervision to learn more generalizable [4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Im-
and transferable representations. Particularly, they showed proving language understanding by generative pre-training,”
that a model trained with a visual backbone followed by a tech. rep., OpenAI, 2018. 1, 4
[5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
bidirectional language model (forward and backward Trans- “Language models are unsupervised multitask learners,” tech.
formers) [3] to predict captions, can learn strong features rep., OpenAI, 2019. 1, 8, 10
on MS-COCO dataset in an unsupervised manner. When [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,
these features are transferred to the ImageNet model, they
et al., “Language models are few-shot learners,” arXiv preprint
perform better or equally-well compared to the unsuper- arXiv:2005.14165, 2020. 1, 20
vised/supervised features learned directly on the ImageNet [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
dataset. Since, Transformer models can process multiple M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A ro-
bustly optimized bert pretraining approach,” arXiv preprint
modalities in a unified architecture, it will be interesting arXiv:1907.11692, 2019. 1, 2, 23
to explore how densely annotated datasets can reduce the [8] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
data requirement of Transformers and if dense-annotations Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer
allow transferring well to novel unseen conditions in one learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019. 1
particular modality at inference. [9] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang,
M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant
models with conditional computation and automatic sharding,”
5 C ONCLUSION arXiv preprint arXiv:2006.16668, 2020. 1
[10] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
Attention has played a key role in delivering efficient to trillion parameter models with simple and efficient sparsity,”
and accurate computer vision systems, while simultane- arXiv preprint arXiv:2101.03961. 1, 3
ously providing insights into the function of deep neu- [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
ral networks. This survey reviews the self-attention ap- et al., “An image is worth 16x16 words: Transformers for image
proaches and specifically focuses on the Transformer and bi- recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. 1, 3,
directional encoding architectures that are built on the prin- 5, 6, 7, 17, 21, 22, 23
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
ciple of self-attention. We first cover fundamental concepts H. Jégou, “Training data-efficient image transformers & distilla-
pertaining to self-attention architectures and later provide tion through attention,” arXiv preprint arXiv:2012.12877, 2020. 1,
an in-depth analysis of competing approaches for a broad 5, 7, 17, 21, 22, 23
range of computer vision applications. Specifically, we in- [13] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,”
clude state of the art self-attention models for image recog- arXiv preprint arXiv:2005.12872, 2020. 1, 3, 8, 17, 21, 22
nition, object detection, semantic and instance segmentation, [14] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
video analysis and classification, visual question answering, DETR: Deformable transformers for end-to-end object detection,”
arXiv preprint arXiv:2010.04159, 2020. 1, 8, 21, 22
visual commonsense reasoning, image captioning, vision-
[15] L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-
language navigation, clustering, few-shot learning, and 3D attention network for referring image segmentation,” in CVPR,
data analysis. We systematically highlight the key strengths 2019. 1, 9, 21
25

[16] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture [45] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
transformer network for image super-resolution,” in CVPR, 2020. “Context encoders: Feature learning by inpainting,” in CVPR,
1, 11, 12, 21, 22 2016. 4
[17] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, [46] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
“VideoBERT: A joint model for video and language represen- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
tation learning,” in ICCV, 2019. 1, 13, 14, 15, 16, 17, 23, 24 sarial nets,” in NeurIPS, 2014. 4, 10, 11, 12
[18] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video [47] D. Lin, K. Fu, Y. Wang, G. Xu, and X. Sun, “MARTA GANs:
action transformer network,” in CVPR, 2019. 1, 3, 17 Unsupervised representation learning for remote sensing image
[19] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, classification,” GRSL, 2017. 4
C. Xu, and W. Gao, “Pre-trained image processing transformer,” [48] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervised
arXiv preprint arXiv:2012.00364, 2020. 1, 7, 11, 12 learning of spatiotemporal context for video action recognition,”
[20] A. Ramesh, M. Pavlov, G. Goh, and S. Gray, “DALL·E: Creating in WACV, 2019. 4
images from text,” tech. rep., OpenAI, 2021. 1, 3, 11 [49] M. Noroozi and P. Favaro, “Unsupervised learning of visual
[21] H. Tan and M. Bansal, “LXMERT: Learning cross-modality en- representations by solving jigsaw puzzles,” in ECCV, 2016. 4
coder representations from transformers,” in EMNLP-IJCNLP, [50] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning image
2019. 1, 13, 14, 15, 16, 17 representations by completing damaged jigsaw puzzles,” WACV,
[22] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-BERT: 2018. 4
Pre-training of generic visual-linguistic representations,” arXiv [51] L. Jing, X. Yang, J. Liu, and Y. Tian, “Self-supervised spatiotempo-
preprint arXiv:1908.08530, 2019. 1, 3, 4, 14, 15, 17 ral feature learning via video rotation prediction,” arXiv preprint
[23] X. Wang, C. Yeshwanth, and M. Nießner, “SceneFormer: arXiv:1811.11387, 2018. 4
Indoor scene generation with transformers,” arXiv preprint [52] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised
arXiv:2012.09793, 2020. 1, 9, 11 representation learning by sorting sequences,” in ICCV, 2017. 4
[24] M. Kumar, D. Weissenborn, and N. Kalchbrenner, “Colorization [53] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsu-
transformer,” in ICLR, 2021. 1, 11, 13, 21, 22 pervised learning using temporal order verification,” in ECCV,
[25] C. Doersch, A. Gupta, and A. Zisserman, “CrossTransformers: 2016. 4
spatially-aware few-shot transfer,” NeurIPS, 2020. 1, 18, 21, 22 [54] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning
[26] H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via and using the arrow of time,” in CVPR, 2018. 4
embedding adaptation with set-to-set functions,” in CVPR, 2020. [55] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang,
1, 18, 19 “VisualBERT: A simple and performant baseline for vision and
[27] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT language,” in Arxiv preprint arXiv:1908.03557, 2019. 4, 14, 15
press, 2017. 1 [56] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of
[28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, audio and video models from self-supervised synchronization,”
2015. 1 in NeurIPS, 2018. 4
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [57] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
Neural computation, 1997. 1, 16, 18 ICCV, 2017. 4, 17
[30] D. Hu, “An introductory survey on attention mechanisms in nlp [58] N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Cross-
problems,” in IntelliSys, 2019. 2 modal self-supervision,” in GCPR, 2018. 4
[31] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient trans- [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
formers: A survey,” arXiv preprint arXiv:2009.06732, 2020. 2, 20 image recognition,” in CVPR, 2016. 4, 8, 17
[32] S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal, [60] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”
“An attentive survey of attention models,” arXiv preprint arXiv preprint arXiv:1607.06450, 2016. 4
arXiv:1904.02874, 2019. 2 [61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[33] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self- “ImageNet: A large-scale hierarchical image database,” in CVPR,
attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2009. 5, 9
2020. 2, 20 [62] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for
[34] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self- image recognition,” in ICCV, 2019. 5, 6
attention generative adversarial networks,” in International con- [63] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu,
ference on machine learning, pp. 7354–7363, PMLR, 2019. 2 “CCNet: Criss-cross attention for semantic segmentation,” in
[35] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, ICCV, 2019. 5, 6
and J. Liu, “UNITER: Universal image-text representation learn- [64] N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya,
ing,” in ECCV, 2020. 3, 4, 14, 15, 22 and J. Shlens, “Stand-alone self-attention in vision models,” in
[36] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, NeurIPS, 2019. 5, 6, 9
L. Dong, F. Wei, et al., “Oscar: Object-semantics aligned pre- [65] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for
training for vision-language tasks,” in ECCV, 2020. 3, 14, 15, image denoising,” in CVPR, 2005. 5, 6
21, 22 [66] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural
[37] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose networks,” in CVPR, 2018. 5, 6
and mesh reconstruction with transformers,” arXiv preprint [67] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-
arXiv:2012.09760, 2020. 3, 19, 20, 21, 22 jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al.,
[38] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised repre- “The kinetics human action video dataset,” arXiv preprint
sentation learning by predicting image rotations,” arXiv preprint arXiv:1705.06950, 2017. 5, 16, 17
arXiv:1803.07728, 2018. 3, 4 [68] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
[39] “Revisiting the unreasonable effectiveness of data.” https://ai. R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
googleblog.com/2017/07/revisiting-unreasonable-effectiveness. dataset for semantic urban scene understanding,” in CVPR, 2016.
html. Accessed: 2020-12-31. 3, 7 5, 9
[40] L. Jing and Y. Tian, “Self-supervised visual feature learning with [69] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
deep neural networks: A survey,” TPAMI, 2020. 3 “Scene parsing through ade20k dataset,” in CVPR, 2017. 5
[41] X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and [70] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
J. Tang, “Self-supervised learning: Generative or contrastive,” P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects
arXiv preprint arXiv:2006.08218, 2020. 3 in context,” in ECCV, 2014. 5, 9, 15
[42] “Aaai 2020 keynotes turing award winners event.” https://www. [71] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint
youtube.com/watch?v=UX8OubxsY8w. Accessed: 2020-12-31. 3 body parsing & pose estimation network and a new benchmark,”
[43] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” TPAMI, 2018. 5
in ECCV, 2016. 4 [72] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object
[44] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, classes in video: A high-definition ground truth database,” Pat-
A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo- tern Recognition Letters, 2009. 5
realistic single image super-resolution using a generative adver- [73] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention
sarial network,” in CVPR, 2017. 4, 12 augmented convolutional networks,” in ICCV, 2019. 6
26

[74] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rel- [103] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer
ative position representations,” arXiv preprint arXiv:1803.02155, networks in unsupervised feature learning,” in AISTATS, 2011.
2018. 6 10
[75] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image [104] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple
recognition,” in CVPR, 2020. 6, 7, 19, 23 framework for contrastive learning of visual representations,”
[76] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- arXiv preprint arXiv:2002.05709, 2020. 10
fellow, and R. Fergus, “Intriguing properties of neural networks,” [105] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning repre-
arXiv preprint arXiv:1312.6199, 2013. 7 sentations by maximizing mutual information across views,” in
[77] M. M. Naseer, S. H. Khan, M. H. Khan, F. S. Khan, and F. Porikli, NeurIPS, 2019. 10
“Cross-domain transferability of adversarial perturbations,” in [106] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Es-
NeurIPS, 2019. 7 lami, and A. v. d. Oord, “Data-efficient image recognition with
[78] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in contrastive predictive coding,” arXiv preprint arXiv:1905.09272,
a neural network,” arXiv preprint arXiv:1503.02531, 2015. 7 2019. 10
[79] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, [107] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum con-
“Designing network design spaces,” in CVPR, 2020. 7 trast for unsupervised visual representation learning,” in CVPR,
[80] M. Tan and Q. V. Le, “EfficientNet: Rethinking model scaling for 2020. 10
convolutional neural networks,” arXiv preprint arXiv:1905.11946, [108] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview cod-
2019. 7 ing,” arXiv preprint arXiv:1906.05849, 2019. 10
[81] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- [109] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning guide to convolutional neural networks for computer vision,”
transferable visual models from natural language supervision,” Synthesis Lectures on Computer Vision, 2018. 10
Image, vol. 2, p. T2, 2021. 8, 21 [110] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
[82] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, tation learning with deep convolutional generative adversarial
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, networks,” arXiv preprint arXiv:1511.06434, 2015. 10
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: [111] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan, “Adversarialnas: Ad-
Transformers for image recognition at scale,” 2020. 8, 11 versarial neural architecture search for gans,” in Proceedings of the
[83] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: To- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
wards real-time object detection with region proposal networks,” pp. 5680–5689, 2020. 10
TPAMI, 2016. 8, 17 [112] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
[84] R. Girshick, “Fast R-CNN,” in ICCV, 2015. 8 “Analyzing and improving the image quality of stylegan,” in
[85] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
ICCV, 2017. 8, 18 Pattern Recognition, pp. 8110–8119, 2020.
[86] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only [113] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
look once: Unified, real-time object detection,” in CVPR, 2016. 8 “Generative adversarial text to image synthesis,” in ICML, 2016.
[87] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and 11
A. C. Berg, “SSD: Single shot multibox detector,” in ECCV, 2016. [114] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
8 Metaxas, “StackGAN: Text to photo-realistic image synthesis
[88] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based trans- with stacked generative adversarial networks,” in ICCV, 2017.
formers for instance segmentation of cells in microstructures,” in 11
2020 IEEE International Conference on Bioinformatics and Biomedicine [115] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
(BIBM), pp. 700–707, IEEE, 2020. 8 Metaxas, “StackGAN++: Realistic image synthesis with stacked
[89] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and generative adversarial networks,” TPAMI, 2018. 11
S. Belongie, “Feature pyramid networks for object detection,” in [116] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and
CVPR, 2017. 8 X. He, “AttnGAN: Fine-grained text to image generation with
[90] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, attentional generative adversarial networks,” in CVPR, 2018. 11
“Deformable convolutional networks,” in ICCV, 2017. 8 [117] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[91] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. arXiv preprint arXiv:1312.6114, 2013. 11
Chen, “Axial-DeepLab: Stand-alone axial-attention for panoptic [118] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse
segmentation,” arXiv preprint arXiv:2003.07853, 2020. 9 high-fidelity images with vq-vae-2,” in NeurISP, 2019. 11
[92] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic [119] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep
segmentation,” in CVPR, 2019. 9 residual networks for single image super-resolution,” in CVPRW,
[93] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The 2017. 11
mapillary vistas dataset for semantic understanding of street [120] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep
scenes,” in ICCV, 2017. 9 recursive residual network,” in CVPR, 2017. 11
[94] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling [121] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S. Huang,
context in referring expressions,” in ECCV, 2016. 9 “Image super-resolution via dual-state recurrent networks,” in
[95] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and CVPR, 2018. 11
K. Murphy, “Generation and comprehension of unambiguous [122] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image
object descriptions,” in CVPR, 2016. 9 super-resolution using very deep residual channel attention net-
[96] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Refer- works,” in ECCV, 2018. 11, 13
itgame: Referring to objects in photographs of natural scenes,” [123] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
in EMNLP, 2014. 9 network for image restoration,” TPAMI, 2020. 11
[97] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, [124] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and
and D. Tran, “Image transformer,” in ICML, 2018. 9, 10, 11 C. Change Loy, “ESRGAN: enhanced super-resolution generative
[98] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and adversarial networks,” in ECCVW, 2018. 12
I. Sutskever, “Generative pretraining from pixels,” in ICML, 2020. [125] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “SRFEAT: Single
9, 10, 11 image super-resolution with feature discrimination,” in ECCV,
[99] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for 2018. 12
high-resolution image synthesis,” arXiv preprint arXiv:2012.09841, [126] M. S. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single
2020. 9, 10, 11 image super-resolution through automated texture synthesis,” in
[100] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two transformers ICCV, 2017. 12
can make one strong gan,” 2021. 9, 10 [127] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
[101] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., “Photo-
A. Graves, et al., “Conditional image generation with pixelcnn realistic single image super-resolution using a generative adver-
decoders,” in NeurIPS, 2016. 9 sarial network,” in CVPR, 2017. 12
[102] A. Krizhevsky, “Learning multiple layers of features from tiny [128] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-
images,” tech. rep., 2009. 10 time style transfer and super-resolution,” in ECCV, 2016. 12
27

[129] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order [154] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end
attention network for single image super-resolution,” in CVPR, dense video captioning with masked transformer,” in Proceedings
2019. 13 of the IEEE Conference on Computer Vision and Pattern Recognition,
[130] B. Niu, W. Wen, W. Ren, X. Zhang, L. Yang, S. Wang, K. Zhang, pp. 8739–8748, 2018. 16
X. Cao, and H. Shen, “Single image super-resolution via a holistic [155] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles,
attention network,” in ECCV, 2020. 13 “Dense-captioning events in videos,” in Proceedings of the IEEE
[131] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, “Ax- international conference on computer vision, pp. 706–715, 2017. 16
ial attention in multidimensional transformers,” arXiv preprint [156] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of
arXiv:1912.12180, 2019. 13 procedures from web instructional videos,” in Proceedings of the
[132] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, AAAI Conference on Artificial Intelligence, vol. 32, 2018. 16
“Unicoder-VL: A universal encoder for vision and language by [157] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
cross-modal pre-training.,” in AAAI, 2020. 14, 15 human actions classes from videos in the wild,” arXiv preprint
[133] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task- arXiv:1212.0402, 2012. 17
agnostic visiolinguistic representations for vision-and-language [158] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
tasks,” in NeurIPS, 2019. 13, 14, 15, 16, 17, 22, 23 R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology
[134] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song, “Param- and human-labeled dataset for audio events,” in ICASSP, 2017.
eter efficient multimodal transformers for video representation 17
learning,” arXiv preprint arXiv:2012.04124, 2020. 14, 16, 21 [159] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and
[135] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, A. Gupta, “Hollywood in homes: Crowdsourcing data collection
C. Lawrence Zitnick, and D. Parikh, “VQA: Visual question for activity understanding,” in ECCV, 2016. 17
answering,” in ICCV, 2015. 13, 22 [160] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video trans-
[136] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to former network,” arXiv preprint arXiv:2102.00719, 2021. 17, 23
cognition: Visual commonsense reasoning,” in CVPR, 2019. 13 [161] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-
[137] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross document transformer,” arXiv preprint arXiv:2004.05150, 2020. 17,
attention for image-text matching,” in ECCV, 2018. 13 20
[138] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A [162] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in
neural image caption generator,” in CVPR, 2015. 13 Proceedings of the IEEE/CVF International Conference on Computer
[139] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi, Vision, pp. 5188–5197, 2019. 18
“A corpus for reasoning about natural language grounded in [163] G. Bertasius and L. Torresani, “Classifying, segmenting, and
photographs,” arXiv preprint arXiv:1811.00491, 2018. 14 tracking object instances in video with mask propagation,” in
[140] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual Proceedings of the IEEE/CVF Conference on Computer Vision and
captions: A cleaned, hypernymed, image alt-text dataset for Pattern Recognition, pp. 9739–9748, 2020. 18
automatic image captioning,” in ACL, 2018. 14, 15 [164] C. Plizzari, M. Cannici, and M. Matteucci, “Spatial tempo-
[141] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, ral transformer network for skeleton-based action recognition,”
“Unified vision-language pre-training for image captioning and arXiv preprint arXiv:2008.07404, 2020. 18, 21, 22
vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, [165] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A
vol. 34, pp. 13041–13049, 2020. 15 large scale dataset for 3d human activity analysis,” in CVPR,
[142] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video 2016. 18
representations using contrastive bidirectional transformer,” [166] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C.
arXiv preprint arXiv:1906.05743, 2019. 15 Kot, “NTU RGB+D 120: A large-scale benchmark for 3d human
[143] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected activity understanding,” TPAMI, 2019. 18
objects in text for visual question answering,” arXiv preprint
[167] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu,
arXiv:1908.05054, 2019. 15
R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., “Meta-
[144] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, dataset: A dataset of datasets for learning to learn from few
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual examples,” arXiv preprint arXiv:1903.03096, 2019. 18
genome: Connecting language and vision using crowdsourced
[168] T. N. Kipf and M. Welling, “Semi-supervised classification with
dense image annotations,” IJCV, 2017. 15
graph convolutional networks,” arXiv preprint arXiv:1609.02907,
[145] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing 2016. 18
images using 1 million captioned photographs,” in NeurIPS, 2011.
[169] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut-
15
dinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017. 18
[146] H. Tan and M. Bansal, “Vokenization: Improving language un-
derstanding with contextualized, visual-grounded supervision,” [170] H. Edwards and A. Storkey, “Towards a neural statistician,” arXiv
arXiv preprint arXiv:2010.06775, 2020. 15, 16 preprint arXiv:1606.02185, 2016. 18
[147] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning [171] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh,
a generic agent for vision-and-language navigation via pre- “Set transformer: A framework for attention-based permutation-
training,” in CVPR, 2020. 16 invariant neural networks,” in ICML, 2019. 18, 19
[148] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, [172] J. Lee, Y. Lee, and Y. W. Teh, “Deep amortized clustering,” arXiv
and D. Batra, “Improving vision-and-language navigation with preprint arXiv:1909.13433, 2019. 19
image-text pairs from the web,” arXiv preprint arXiv:2004.14973, [173] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point trans-
2020. 16 former,” arXiv preprint arXiv:2012.09164, 2020. 19, 21, 22
[149] K. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, [174] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin,
“Topological planning with transformers for vision-and- and S.-M. Hu, “PCT: Point cloud transformer,” arXiv preprint
language navigation,” arXiv preprint arXiv:2012.05292, 2020. 16, arXiv:2012.09688, 2020. 19
21 [175] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,
[150] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short “3D ShapeNets: A deep representation for volumetric shapes,” in
note on the kinetics-700 human action dataset,” arXiv preprint CVPR, 2015. 19, 22
arXiv:1907.06987, 2019. 16, 17 [176] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang,
[151] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “COOT: Co- Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and
operative hierarchical transformer for video-text representation F. Yu, “ShapeNet: An information-rich 3d model repository,”
learning,” arXiv preprint arXiv:2011.00597, 2020. 16 arXiv preprint arXiv:1512.03012, 2015. 19
[152] H. Seong, J. Hyun, and E. Kim, “Video multitask transformer [177] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Hu-
network,” in Proceedings of the IEEE/CVF International Conference man3.6M: Large scale datasets and predictive methods for 3D
on Computer Vision Workshops, pp. 0–0, 2019. 16 human sensing in natural environments,” TPAMI, 2013. 20
[153] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, [178] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and
“End-to-end video instance segmentation with transformers,” G. Pons-Moll, “Recovering accurate 3d human pose in the wild
arXiv preprint arXiv:2011.14503, 2020. 16, 17 using imus and a moving camera,” in ECCV, 2018. 20, 22
28

[179] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and


T. Brox, “FreiHAND: A dataset for markerless capture of hand
pose and shape from single rgb images,” in ICCV, 2019. 20
[180] “OpenAI’s GPT-3 language model: A technical overview.” https:
//lambdalabs.com/blog/demystifying-gpt-3/. Accessed: 2020-
12-31. 20
[181] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
descriptions to visual denotations: New similarity metrics for
semantic inference over event descriptions,” TACL, 2014. 22
[182] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh,
“Making the v in vqa matter: Elevating the role of image un-
derstanding in visual question answering,” in CVPR, 2017. 22
[183] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hock-
enmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-
to-phrase correspondences for richer image-to-sentence models,”
in ICCV, 2015. 22
[184] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep
hierarchical feature learning on point sets in a metric space,”
NeurIPS, 2017. 22
[185] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generat-
ing long sequences with sparse transformers,” arXiv preprint
arXiv:1904.10509, 2019. 20
[186] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient
transformer,” in ICLR, 2020. 20
[187] I. Bello, “Lambdanetworks: Modeling long-range interactions
without attention,” in International Conference on Learning Repre-
sentations, 2021. 20
[188] A. Vyas, A. Katharopoulos, and F. Fleuret, “Fast transformers
with clustered attention,” NeurIPS, 2020. 20
[189] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, F. E. Tay, J. Feng, and
S. Yan, “Tokens-to-token vit: Training vision transformers from
scratch on imagenet,” arXiv preprint arXiv:2101.11986, 2021. 23
[190] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “Tran-
sreid: Transformer-based object re-identification,” arXiv preprint
arXiv:2102.04378, 2021. 23
[191] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Ana-
lyzing multi-head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418,
2019. 23
[192] S. Abnar and W. Zuidema, “Quantifying attention flow in trans-
formers,” arXiv preprint arXiv:2005.00928, 2020. 23
[193] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability
beyond attention visualization,” arXiv preprint arXiv:2012.09838,
2020. 23
[194] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan,
H. Liu, and C. Ding, “FTRANS: energy-efficient acceleration of
transformers using fpga,” in ISLPED, 2020. 23
[195] H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han,
“HAT: Hardware-aware transformers for efficient natural lan-
guage processing,” arXiv preprint arXiv:2005.14187, 2020. 23
[196] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le,
“Understanding and simplifying one-shot architecture search,”
in ICML, 2018. 24
[197] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun,
“Single path one-shot neural architecture search with uniform
sampling,” arXiv preprint arXiv:1904.00420, 2019. 24
[198] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
neural architecture search via parameter sharing,” arXiv preprint
arXiv:1802.03268, 2018. 24
[199] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach,
“Grounded video description,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 6578–
6587, 2019. 24
[200] K. Desai and J. Johnson, “VirTex: Learning visual representations
from textual annotations,” arXiv preprint arXiv:2006.06666, 2020.
24

You might also like