0% found this document useful (0 votes)

22 views

Deep Learning Paper About Vit

Transformers, tokens, deep learning

Uploaded by

zioncalvin9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Deep Learning Paper About Vit

Transformers, tokens, deep learning

Uploaded by

zioncalvin9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Visual Transformers: Token-based Image Representation and Processing for

Computer Vision

Bichen Wu1 , Chenfeng Xu3 , Xiaoliang Dai1 , Alvin Wan3 , Peizhao Zhang1
Zhicheng Yan2 , Masayoshi, Tomizuka3 , Joseph Gonzalez3 , Kurt Keutzer3 , Peter Vajda1
1
Facebook Reality Labs, 2 Facebook AI, 3 UC Berkeley
arXiv:2006.03677v4 [cs.CV] 20 Nov 2020

{wbc, xiaoliangdai, stzpz, zyan3, vajdap}@fb.com

{xuchenfeng, alvinwan, tomizuka, jegonzal, keutzer}@berkeley.edu

Abstract 2) Not all images have all concepts: Low-level features

such as corners and edges exist in all natural images, so ap-
Computer vision has achieved remarkable success by plying low-level convolutional filters to all images is appro-
(a) representing images as uniformly-arranged pixel arrays priate. However, high-level features such as ear shape exist
and (b) convolving highly-localized features. However, con- in specific images, so applying high-level filters to all im-
volutions treat all image pixels equally regardless of impor- ages is computationally inefficient. For example, dog fea-
tance; explicitly model all concepts across all images, re- tures may not appear in images of flowers, vehicles, aquatic
gardless of content; and struggle to relate spatially-distant animals etc. This results in rarely-used, inapplicable filters
concepts. In this work, we challenge this paradigm by expending a significant amount of compute.
(a) representing images as semantic visual tokens and (b) 3) Convolutions struggle to relate spatially-distant
running transformers to densely model token relationships. concepts: Each convolutional filter is constrained to oper-
Critically, our Visual Transformer operates in a semantic ate on a small region, but long-range interactions between
token space, judiciously attending to different image parts semantic concepts is vital. To relate spatially-distant con-
based on context. This is in sharp contrast to pixel-space cepts, previous approaches increase kernel sizes, increase
transformers that require orders-of-magnitude more com- model depth, or adopt new operations like dilated convolu-
pute. Using an advanced training recipe, our VTs signifi- tions, global pooling, and non-local attention layers. How-
cantly outperform their convolutional counterparts, raising ever, by working within the pixel-convolution paradigm,
ResNet accuracy on ImageNet top-1 by 4.6 to 7 points while these approaches at best mitigate the problem, compensat-
using fewer FLOPs and parameters. For semantic segmen- ing for the convolution’s weaknesses by adding model and
tation on LIP and COCO-stuff, VT-based feature pyramid computational complexity.
networks (FPN) achieve 0.35 points higher mIoU while re-
ducing the FPN module’s FLOPs by 6.5x. To overcome the above challenges, we address the root
cause, the pixel-convolution paradigm, and introduce the Vi-
sual Transformer (VT) (Figure 1), a new paradigm to rep-
1. Introduction resent and process high-level concepts in images. Our intu-
ition is that a sentence with a few words (or visual tokens)
In computer vision, visual information is captured as ar- suffices to describe high-level concepts in an image. This
rays of pixels. These pixel arrays are then processed by motivates a departure from the fixed pixel-array representa-
convolutions, the de facto deep learning operator for com- tion later in the network; instead, we use spatial attention
puter vision. Although this convention has produced highly to convert the feature map into a compact set of semantic
successful vision models, there are critical challenges: tokens. We then feed these tokens to a transformer, a self-
1) Not all pixels are created equal: Image classification attention module widely used in natural language process-
models should prioritize foreground objects over the back- ing [37] to capture token interactions. The resulting visual
ground. Segmentation models should prioritize pedestrians tokens computed can be directly used for image-level pre-
over disproportionately large swaths of sky, road, vegeta- diction tasks (e.g., classification) or be spatially re-projected
tion etc. Nevertheless, convolutions uniformly process all to the feature map for pixel-level prediction tasks (e.g., seg-
image patches regardless of importance. This leads to spa- mentation). Unlike convolutions, our VT can better handle
tial inefficiency in both computation and representation. the three challenges above: 1) judiciously allocating com-

1
Semantic Grouping
Semantic Projection

Spatial Attention Xin

Tokenizer Transformer Projector

Conv Tokens Tin Tokens Tout

Visual Transformer

Input image Feature map Xin Feature map Xout

Figure 1: Diagram of a Visual Transformer (VT). For a given image, we first apply convolutional layers to extract low-level
features. The output feature map is then fed to VT: First, apply a tokenizer, grouping pixels into a small number of visual
tokens, each representing a semantic concept in the image. Second, apply transformers to model relationhips between tokens.
Third, visual tokens are directly used for image classification or projected back to the feature map for semantic segmentation.

putation by attending to important regions, instead of treat- relating high-level concepts. We further use spatial atten-
ing all pixels equally; 2) encoding semantic concepts in a tion to focus on important regions, instead of treating each
few visual tokens relevant to the image, instead of model- image patch equally. This yields strong performance de-
ing all concepts across all images; and 3) relating spatially- spite orders-of-magnitude less data and training time.
distant concepts through self-attention in token-space. Another relevant work, DETR[3], adopts transformers to
To validate the effectiveness of VT and understanding simplify the hand-crafted anchor matching procedure in ob-
its key components, we run controlled experiments by us- ject detection training. Although both adopt transformers,
ing VTs to replace convolutions in ResNet, a common test DETR is not directly comparable to our VT given their or-
bed for new building blocks for image classification. We thogonal use cases, i.e., insights from both works could be
also use VTs to re-design feature-pyramid networks (FPN), used together in one model for compounded benefit.
a strong baseline for semantic segmentation. Our expeir- Graph convolutions in vision models: Our work is
ments show that VTs achieve higher accuracy with lower also related to previous efforts such as GloRe [6], Latent-
computational cost in both tasks. For the ImageNet[11] GNN [47], and [26] that densely relate concepts in latent
benchmark, we replace the last stage of ResNet[14] with space using graph convolutions. To augment convolutions,
VTs, reducing FLOPs of the stage by 6.9x and improving [26, 6, 47] adopt a procedure similar to ours: (1) extract-
top-1 accuracy by 4.6 to 7 points. For semantic segmen- ing latent variables to represent in graph nodes (analogous
tation on COCO-Stuff [2] and Look-Into-Person [25], VT- to our visual tokens) (2) applying graph convolution to cap-
based FPN achieves 0.35 points higher mIOU while reduc- ture node interactions (analogous to our transformer), and
ing regular FPN module’s FLOPs by 6.4x. (3) projecting the nodes back to the feature map. Although
these approaches avoid spatial redundancy, they are suscep-
2. Relationship to previous work tible to concept redundancy: the second limitation listed in
the introduction. In particular, by using fixed weights that
Transformers in vision models: A notable recent and are not content-aware, the graph convolution expects a fixed
relevant trend is the adoption of transformers in vision mod- semantic concept in each node, regardless of whether the
els. Dosovitskiy et al. propose a Vision Transformer (ViT) concept exists in the image. By contrast, a transformer uses
[12], dividing an image into 16 × 16 patches and feeding content-aware weights, allowing visual tokens to represent
these patches (i.e., tokens) into a standard transformer. Al- varying concepts. As a result, while graph convolutions re-
though simple, this requires transformers to learn dense, quire hundreds of nodes (128 nodes in [4], 340 in [25], 150
repeatable patterns (e.g., textures), which convolutions are in [48]) to encode potential semantic concepts, our VT uses
drastically more efficient at learning. The simplicity incurs just 16 visual tokens and attains higher accuracy. Further-
an extremely high computational price: ViT requires up to more, while modules from [26, 6, 47] can only be added
7 GPU years and 300M JFT dataset images to outperform to a pretrained network to augment convolutions, VTs can
competing convolutional variants. By contrast, we leverage replace convolution layers to save FLOPs and parameters,
the respective strengths of each operation, using convolu- and support training from scratch.
tions for extracting low-level features and transformers for Attention in vision models: In addition to being used in

2
transformers, attention is also widely used in different forms “nodes”), whereas our VT uses as few as 16 visual tokens
in computer vision models [21, 20, 41, 44, 46, 40, 28, 18, to achieve superior performance.
19, 1, 49, 30]. Attention was first used to modulate the fea-
ture map: attention values are computed from the input and 3.1. Tokenizer
multiplied to the feature map as in [41, 21, 20, 44]. Later Our intuition is that an image can be summarized by a
work [46, 33, 39] interpret this “modulation” as a way to few handfuls of words, or visual tokens. This contrasts con-
make convolution spatially adaptive and content-aware. In volutions, which use hundreds of filters, and graph convo-
[40], Wang et al. introduced non-local operators, equiva- lutions, which use hundreds of “latent nodes” to detect all
lent to self-attention, to video understanding to capture the possible concepts regardless of image content. To leverage
long-range interactions. However, the computational com- this intuition, we introduce a tokenizer module to convert
plexity of self-attention grows quadratically with the num- feature maps into compact sets of visual tokens. Formally,
ber of pixels. [1] use self-attention to augment convolu- we denote the input feature map by X ∈ RHW ×C (height
tions and reduce the compute cost by using small channel H, width W , channels C) and visual tokens by T ∈ RL×C
sizes for attention. [30, 28, 7, 49, 19] on the other hand s.t. L HW (L represents the number of tokens).
restrict receptive field of self-attention and use it in a con-
volutional manner. Starting from [30], self-attentions are
3.1.1 Filter-based Tokenizer
used as a stand-alone building block for vision models. Our
work is different from all above since we propose a novel A filter-based tokenizer, also adopted by [47, 6, 26], utilizes
token-transformer paradigm to replace the inefficient pixel- convolutions to extract visual tokens. For feature map X,
convolution paradigm and achieve superior performance. we map each pixel Xp ∈ RC to one of L semantic groups
Efficient vision models: Many recent research efforts using point-wise convolutions. Then, within each group, we
have been focusing on building vision models to achieve spatially pool pixels to obtain tokens T. Formally,
better performance with lower computational cost. Early
T
work in this direction includes [23, 31, 13, 17, 32, 16, 48, T = SOFTMAXHW (XWA ) X
(1)
27, 43]. Recently, people use neural architecture search
| {z }
A∈RHW ×L
[42, 10, 38, 9, 36, 35] to optimize network’s performance
within a search space that consists of existing convolution Here, WA ∈ RC×L forms semantic groups from X, and
operators. The efforts above all seek to make the com- SOFTMAX HW (·) translates these activations into a spatial
mon convolutional-neural net more computationally effi- attention. Finally, A multiplies with X and computes
cient. In contrast, we propose a new building block that nat- weighted averages of pixels in X to make L visual tokens.
urally eliminates the redundant computations in the pixel- However, many high-level semantic concepts are sparse
convolution paradigm. and may each appear in only a few images. As a result, the
fixed set of learned weights WA potentially wastes com-
3. Visual Transformer putation by modeling all such high-level concepts at once.
We call this a “filter-based” tokenizer, since it uses convo-
We illustrate the overall diagram of a Visual Transformer
lutional filters WA to extract visual tokens.
(VT) based model in Figure 1. First, process the input im-
age with several convolution blocks, then feed the output
feature map to VTs. Our insight is to leverage the strengths
of both convolutions and VTs: (1) early in the network, use Conv2d
convolutions to learn densely-distributed, low-level patterns H
L
and (2) later in the network, use VTs to learn and relate
more sparsely-distributed, higher-order semantic concepts. W
WA
C C
At the end of the network, use visual tokens for image- Feature map X Visual tokens
L
level prediction tasks and use the augmented feature map Spatial attention A
for pixel-level prediction tasks.
A VT module involves three steps: First, group pixels Figure 2: Filter-based tokenizer that use convolution to
into semantic concepts, to produce a compact set of visual group pixels using a fixed convolution filter.
tokens. Second, to model relationships between semantic
concepts, apply a transformer [37] to these visual tokens.
Third, project these visual tokens back to pixel-space to ob-
3.1.2 Recurrent Tokenizer
tain an augmented feature map. Similar paradigms can be
found in [6, 47, 26] but with one critical difference: Pre- To remedy the limitation of filter-based tokenizers, we pro-
vious methods use hundreds of semantic concepts (termed, pose a recurrent tokenizer with weights that are dependent

3
on previous layer’s visual tokens. The intuition is to let 3.3. Projector
the previous layer’s tokens Tin guide the extraction of new
Many vision tasks require pixel-level details, but such
tokens for the current layer. The name of “recurrent tok-
details are not preserved in visual tokens. Therefore, we
enizer” comes from that current tokens are computed de-
fuse the transformer’s output with the feature map to refine
pendent on previous ones. Formally, we define
the feature map’s pixel-array representation as
WR = Tin WT→R ,
Xout = Xin + SOFTMAXL (Xin WQ )(TWK )T T,

T
(2)
T = SOFTMAXHW (XWR ) X, (5)
where Xin , Xout ∈ RHW ×C are the input and output fea-
where WT →R ∈ RC×C . In this way, the VT can in-
ture map. (Xin WQ ) ∈ RHW ×C is the query computed
crementally refine the set of visual tokens, conditioned on
from the input feature map Xin . (Xin WQ )p ∈ RC en-
previously-processed concepts. In practice, we apply recur-
codes the information pixel-p requires from the visual to-
rent tokenizers starting from the second VT, since it requires
kens. (TWK ) ∈ RL×C is the key computed from the to-
tokens from a previous VT.
ken T. (TWK )l ∈ RC represents the information the l-th
token encodes. The key-query product determines how to
project information encoded in visual tokens T to the orig-
Conv2d inal feature map. WQ ∈ RC×C , WK ∈ RC×C are learn-
H able weights used to compute queries and keys.
L

W
C
4. Using Visual Transformers in vision models
Feature map X L C
L
Visual tokens
C In this section, we discuss how to use VTs as building
Pre-visual tokens Spatial attention A
blocks in vision models. We define three hyper-parameters
Figure 3: Recurrent tokenizer that uses previous tokens to for each VT: channel size of the feature map; channel size
guide the token extraction in the current VT module. of the visual tokens; and the number of visual tokens.
Image classification model: For image classification,
following the convention of previous work, we build our
3.2. Transformer networks with backbones inherited from ResNet [14].
After tokenization, we then need to model interactions Based on ResNet-{18, 34, 50, 101}, we build correspond-
between these visual tokens. Previous works [6, 47, 26] use ing visual-transformer-ResNets (VT-ResNets) by replacing
graph convolutions to relate concepts. However, these op- the last stage of convolutions with VT modules. The last
erations use fixed weights during inference, meaning each stage of ResNet-{18, 34, 50, 101} contains 2 basic blocks,
token (or “node”) is bound to a specific concept, there- 3 basic blocks, 3 bottleneck blocks, and 3 bottleneck blocks,
fore graph convolutions waste computation by modeling all respectively. We replace them with the same number (2, 3,
high-level concepts, even those that only appear in few im- 3, 3) of VT modules. At the end of stage-4 (before stage-5
ages. To address this, we adopt transformers [37], which max pooling), ResNet-{18, 34} generate feature maps with
use input-dependent weights by design. Due to this, trans- the shape of 142 × 256, and ResNet-{50, 101} generate
formers support visual tokens with variable meaning, cov- feature maps with the shape of 142 × 1024. We set VT’s
ering more possible concepts with fewer tokens. feature map channel size to be 256, 256, 1024, 1024 for
We employ a standard transformer with minor changes: ResNet-{18, 34, 50, 101}. We adopt 16 visual tokens with
a channel size of 1024 for all the models. At the end of
T0out = Tin + SOFTMAXL (Tin K)(Tin Q)T Tin , (3)

the network, we output 16 visual tokens to the classification
head, which applies an average pooling over the tokens and
Tout = T0out + σ(T0out F1 )F2 , (4) use a fully-connected layer to predict the probabilities. A
where Tin , T0out , Tout∈R L×C
are the visual tokens. Dif- table summarizing the stage-wise description of the model
ferent from graph convolution, in a transformer, weights be- is provided in Appendix A. Since VTs only operate on 16
tween tokens are input-dependent and computed as a key- visual tokens, we can reduce the last stage’s FLOPs by up
query product: (Tin K)(Tin Q)T ∈ RL×L . This allows to 6.9x, as shown in Table 1.
us to use as few as 16 visual tokens, in contrast to hun- Semantic segmentation: We show that using VTs for
dreds of analogous nodes for graph-convolution approaches semantic segmentation can tackle several challenges with
[6, 47, 26]. After the self-attention, we use a non-linearity the pixel-convolution paradigm. First, the computational
and two pointwise convolutions in Equation (4), where complexity of convolution grows with the image resolu-
F1 , F2 ∈ RC×C are weights, σ(·) is the ReLU function. tion. Second, convolutions struggles to capture long-term

4
Identity

Conv2d Tokenizer

Transformer
Conv2d

Projector
Conv2d
interp2d Conv2d
Conv2d interp2d
interp2d Conv2d
interp2d

Figure 4: Feature Pyramid Networks (FPN) (left) vs visual-transformer-FPN (VT-FPN) (right) for semantic segmentation.
FPN uses convolution and interpolation to merge feature maps with different resolutions. VT-FPN extraxt visual tokens from
all feature maps, merge them with one transformer, and project back to the original feature maps.
R18 R34 R50 R101
Total 1.14x 1.16x 1.20x 1.09x 5.1. Visual Transformer for Classification
FLOPs
Stage-5 2.4x 5.0x 6.1x 6.9x
Total 0.91x 1.21x 1.19x 1.19x We conduct experiments on the ImageNet dataset [11]
Params
Stage-5 0.9x 1.5x 1.26x 1.26x with around 1.3 million images in the training set and 50
thousand images in the validation set. We implement VT
Table 1: FLOPs and parameter size reduction of VTs on
models in PyTorch [29]. We use stochastic gradient descent
ResNets by replacing the last stage of convolution modules
(SGD) optimizer with Nesterov momentum [34]. We use an
with VT modules.
initial learning rate of 0.1, a momentum of 0.9, and a weight
decay of 4e-5. We train the model for 90 epochs, and decay
the learning rate by 10x every 30 epochs. We use a batch
interactions between pixels. VTs, on the other hand, op-
size of 256 and 8 V100 GPUs for training.
erate on a small number of visual tokens regardless of the
image resolution, and since it models concept interactions VT vs. ResNet with default training recipe: In Table 2,
in the token-space, it bypasses the “long-range” challenge we first compare VT-ResNets and vanilla ResNets under the
with pixel-arrays. same training recipe. VT-ResNets in this experiment use a
To validate our hypothesis, we use panoptic feature pyra- filter-based tokenizer for the first VT module and recurrent
mid networks (FPN) [24] as a baseline and use VTs to im- tokenizers in later modules. We can see that after replacing
prove the network. Panoptic FPNs use ResNet as backbone the last stage of convolutions in ResNet18 and ResNet34,
to extract feature maps from different stages with various VT-based ResNets use many fewer FLOPs: 244M fewer
resolutions. These feature maps are then fused by a feature FLOPs for ResNet18 and 384M fewer for ResNet34. Mean-
pyramid network in a top-down manner to generate a multi- while, VT-ResNets achieve much higher top-1 validation
scale and detail preserving feature map with rich semantics accuracy than the corresponding ResNets by up to 2.2
for segmentation (Figure 4 left). FPN is computationally points. This confirms effectiveness of VTs. Also note that
expensive since it heavily relies on spatial convolutions op- the training accuracy achieved by VT-ResNets are much
erating on high resolution feature maps with large channel higher than that of baseline ResNets: VT-R18 is 7.9 points
sizes. We use VTs to replace convolutions in FPN. We name higher and VT-R34 is 6.9 points higher. This indicates
the new module as VT-FPN (Figure 4 right). From each res- that VT-ResNets are overfitting more heavily than regular
olution’s feature map, VT-FPN extract 8 visual tokens with ResNets. We hypothesize this is because VT-ResNets have
a channel size of 1024. The visual tokens are combined and much larger capacity and we need stronger regularization
fed into one transformer to compute interactions between (e.g., data augmentation) to fully utilize the model capacity.
visual tokens across resolutions. The output tokens are then We address this in Section 5.2 and Table 8.
projected back to the original feature maps, which are then Tokenizer ablation studies: In Table 3, we compare
used to perform pixel-level prediction. Compared with the different types of tokenizers used by VTs. We consider a
original FPN, the computational cost for VT-FPN is much pooling-based tokenizer, a clustering-based tokenizer, and
smaller since we only operate on a very small number of vi- a filter-based tokenizer (Section 3.1.1). We use the can-
sual tokens rather than all the pixels. Our experiment shows didate tokenizer in the first VT module and use recurrent
VT-FPN uses 6.4x fewer FLOPs than FPN while preserving tokenizers in later modules. As a baseline, we imple-
or surpassing its performance (Table 9 & 10). ment a pooling-based tokenizer, which spatially downsam-
ples a feature map X to reduce its spatial dimensions from
5. Experiments HW = 196 to L = 16, instead of grouping pixels by their
semantics. As a more advanced baseline, we consider a
We conduct experiments with VTs on image classifica- clustering-based tokenizer, which is described in Appendix
tion and semantic segmentation to (a) understand the key C. It applies K-Means clustering in the semantic space to
components of VTs and (b) validate their effectiveness. group pixels to visual tokens. As can be seen from Ta-

5
Top-1 FLOPs Params
Top-1 Top-1 Acc (%) (M) (M)
FLOPs Params
Acc (%) Acc (%) w/ RT 72.0 1569 11.7
(M) (M) R18
(Val) (Train) w/o RT 71.2 1586 11.1
R18 69.9 68.6 1814 11.7 w/ RT 74.9 3246 21.9
R34
VT-R18 72.1 76.5 1570 11.7 w/o RT 74.5 3335 20.9
R34 73.3 73.9 3664 21.8
VT-R34 75.0 80.8 3280 21.9 Table 4: VT-ResNets that use recurrent tokenizers achieve
better performance, since recurrent tokenizers are content-
Table 2: VT-ResNet vs. baseline ResNets on the ImageNet aware. RT denotes recurrent tokenizer.
dataset. By replacing the last stage of ResNets, VT-ResNet
uses 224M, 384M fewer FLOPs than the baseline ResNets Top-1 FLOPs Params
while achieving 1.7 points and 2.2 points higher validation Acc (%) (M) (M)
accuracy. Note the training accuracy of VT-ResNets are None 68.7 1528 8.5
much higher. This indicates VT-ResNets have higher model R18 GraphConv 69.3 1528 8.5
capacity and require stronger regularization (e.g., data aug- Transformer 71.5 1580 11.7
mentation) to fully utilize the model. See Table 8. None 73.3 3222 17.1
R34 GraphConv 73.7 3223 17.1
Transformer 75.2 3299 21.8
Top-1 FLOPs Params
Acc (%) (M) (M) Table 5: VT-ResNets using different modules to model to-
Pooling-based 70.5 1549 11.0 ken relationships. Models using transformers perform bet-
R18 Clustering-based 71.8 1579 11.6 ter than graph-convolution or no token-space operations.
Filter-based 72.1 1580 11.7
This validates that it is important to model relationships
Pooling-based 73.6 3246 20.6
R34 Clustering-based 75.2 3299 21.8 between visual token (semantic concepts) and transformer
Filter-based 74.9 3280 21.9 work better than graph convolution in relating tokens.

Table 3: VT-ResNets using with different types of tokeniz- Modeling token relationships: In Table 5, we compare
ers. Pooling-based tokenizers spatially downsample a fea- different methods of capturing token relationships. As a
ture map to obtain visual tokens. Clustering-based tokenizer baseline, we do not compute the interactions between to-
(Appendix C) groups pixels in the semantic space. Filter- kens. This leads to the worst performance among all vari-
based tokenizers (3.1.1) use convolution filters to group pix- ations. This validates the necessities of capturing the rela-
els. Both filter-based and cluster-based tokenizers work tionship between different semantic concepts. Another al-
much better than pooling-based tokenizers, validating the ternative is to use graph-convolutions similar to [6, 26, 47],
importance of grouping pixels by their semantics. but its performance is worse than that of VTs. This is likely
due to the fact that graph-convolutions bind each visual to-
ken to a fixed semantic concept. In comparison, using trans-
ble 3, filter-based and clustering-based tokenizers perform formers allows each visual token to encode any semantic
significantly better than the pooling-based baseline, validat- concepts as long as the concept appears in the image. This
ing our hypothesis that feature maps contain redundancies, allows the models to fully utilize its capacity.
and this can be addressed by grouping pixels in semantic Token efficiency ablation: In Table 6, we test vary-
space. The difference between filter-based and clustering- ing numbers of visual tokens, only to find negligible or
based tokenizers is small and vary between R18 and R34. no increase in accuracy. This agrees with our hypothesis
We hypothesize this is because both tokenizers have their that VTs are already capturing a wide variety of concepts
own drawbacks. The filter-based tokenizer relies on fixed with just a few handfuls of tokens–additional tokens are not
convolution filters to detect and assign pixels to semantic needed, as the space of possible, high-level concepts is al-
groups, and is limited by the capacity of the convolution ready covered.
filters to deal with diverse and sparse high-level semantic Pojection ablation: In Table 7, we study whether we
concepts. On the other hand, the clustering-based tokenizer need to project visual tokens back to feature maps. We hy-
extracts semantic concepts that exist in the image, but it is pothesize that projecting the visual tokens is an important
not designed to capture the essential semantic concepts. step since in vision understanding, pixel-level semantics are
In Table 4, we validate the recurrent tokenizer’s effec- very important, and visual tokens are representations in the
tiveness. We use a filter-based tokenizer in the first VT semantic space that do not encode any spatial information.
module and use recurrent tokenizers in subsequent modules. As validated by Table 7, projecting visual tokens back to
Experiments show that using recurrent tokenizers leads to the feature map leads to higher performance, even for im-
higher accuracy. age classification tasks.

6
No. Top-1 FLOPs Params Top-1 FLOPs Params
Models
Tokens Acc (%) (M) (M) Acc (%) (G) (M)
16 71.8 1579 11.6 R18[14] 69.8 1.814 11.7
R18 32 71.9 1711 11.6 R18+SE[21, 41] 70.6 1.814 11.8
64 72.1 1979 11.6 R18+CBAM[41] 70.7 1.815 11.8
16 75.1 3299 21.8 LR-R18[19] 74.6 2.5 14.4
R34 32 75.0 3514 21.8 R18[14](ours) 73.8 1.814 11.7
64 75.0 3952 21.8 VT-R18(ours) 76.8 1.569 11.7
R34[14] 73.3 3.664 21.8
Table 6: Using more visual tokens do not improve the ac- R34+SE[21, 41] 73.9 3.664 22.0
curacy of VT, which agrees with our hypothesis that images R34+CBAM[41] 74.0 3.664 22.9
can be described by a compact set of visual tokens. AA-R34[1] 74.7 3.55 20.7
R34[14](ours) 77.7 3.664 21.8
VT-R34(ours) 79.9 3.236 19.2
Top-1 FLOPs Params
R50[14] 76.0 4.089 25.5
Acc (%) (M) (M)
R50+SE[21, 41] 76.9 3.860* 28.1
w/ projector 72.0 1569 11.7 R50+CBAM[41] 77.3 3.864* 28.1
R18
w/o projector 71.0 1498 9.4 LR-R50[19] 77.3 4.3 23.3
w/ projector 74.8 3280 21.9 Stand-Alone[30] 77.6 3.6 18.0
R34
w/o projector 73.9 3159 17.4 AA-R50[1] 77.7 4.1 25.6
A2 -R50[5] 77.0 - -
Table 7: VTs that projects tokens back to feature maps per- SAN19[49] 78.2 3.3 20.5
form better. This may be because feature maps still encode GloRe-R50[6] 78.4 5.2 30.5
important spatial information. VT-R50(ours) 80.6 3.412 21.4
R101[14] 77.4 7.802 44.4
R101+SE [21, 41] 77.7 7.575* 49.3
5.2. Training with Advanced Recipe R101+CBAM[41] 78.5 7.581* 49.3
In Table 2, we show that under the regular training LR-R101[19] 78.5 7.79 42.0
recipe, the VT-ResNets experience serious overfitting. De- AA-R101[1] 78.7 8.05 45.4
GloRe-R200[6] 79.9 16.9 70.6
spite their accuracy improvement on the validation set, its
VT-R101(ours) 82.3 7.129 41.5
training accuracy improves by a significantly larger margin.
We hypothesize that this is because VT-based models have Table 8: Comparing VT-ResNets with other attention-
much higher model capacity. To fully unleash the potential augmented ResNets on ImageNet. *The baseline ResNet
of VT, we used a more advanced training recipe to train VT FLOPs reported in [41] is lower than our baseline.
models. To prevent overfitting, we train with more training
epochs, stronger data augmentation, stronger regularization,
and distillation. Specifically, we train VT-ResNet models infeasible for us to try these recipes one by one. So to un-
for 400 epochs with RMSProp. We use an initial learning derstand the source of the accuracy gain, we use the same
rate of 0.01 and increase to 0.16 in 5 warmup epochs, then training recipe to train baseline ResNet18 and ResNet34
reduce the learning rate by a factor of 0.9875 per epoch. We and also observe significant accuracy improvement on base-
use synchronized batch normalization and distributed train- line ResNets. But note that under the advanced training
ing with a batch size of 2048. We use label smoothing and recipe, the accuracy gap between VT-ResNet and baselines
AutoAugment [8] and we set the stochastic depth survival increases from 1.7 and 2.2 points to 2.2 and 3.0 points. This
probability [22] and dropout ratio to be 0.9 and 0.2, respec- further validates that a stronger training recipe can better
tively. We use exponential moving average (EMA) on the utilize the model capacity of VTs. While achieving higher
model weights with 0.99985 decay. We use knowledge dis- accuracy than previous work, VT-ResNets also use much
tillation [15] in the training recipe, where the teacher model fewer FLOPs and parameters, even we only replace the last
is FBNetV3-G [9]. The total loss is a weighted sum of dis- stage of the baseline model. If we consider the reduction
tillation loss (×0.8) and cross entropy loss (×0.2). over the original stage, we observe FLOP reductions of up
Our results are reported in Table 8. Compared with the to 6.9x, as shown in Table 1.
baseline ResNet models, VT-ResNet models achieve 4.6 to
7 points higher accuracy and surpass all other related work 5.3. Visual Transformer for Semantic Segmentation
that adopt attention of different forms based on ResNets
[21, 41, 1, 5, 19, 30, 49, 6]. This validates that our ad- We conduct experiments to test the effectiveness of VT
vanced training recipe better utilizes the model capacity of for semantic segmentation on the COCO-stuff dataset [2]
VT-ResNet models to outperform all previous baselines. and the LIP dataset [25]. The COCO-stuff dataset contains
Note that in addition to the architecture differences, pre- annotations for 91 stuff classes with 118K training images
vious works also used their own training recipes and it is and 5K validation images. LIP dataset is a collection of im-

7
Figure 5: Visualization of the spatial attention generated by a filter-based tokenizer on images from the LIP dataset. Red
denotes higher attention values and color blue denotes lower. Without any supervision, visual tokens automatically focus on
different areas of the image that correspond to different semantic concepts, such as sheep, ground, clothes, woods.
mIoU Total FPN
(%) FLOPs (G) FLOPs (G) A ∈ RHW ×L generated by filter-based tokenizers, where
FPN 40.78 159 55.1 each A:,l ∈ RHW is an attention map to show how each
R-50
VT-FPN 41.00 113 (1.41x) 8.5 (6.48x)
FPN 41.51 231 55.1
pixel of the image contributes to token-l. We plot the at-
R-101 tention map in Figure 5, and we can see that without any
VT-FPN 41.50 185 (1.25x) 8.5 (6.48x)
supervision, different visual tokens attend to different se-
Table 9: Semantic segmentation results on the COCO-stuff mantic concepts in the image, corresponding to parts of the
validation set. The FLOPs are calculated with a typical in- background or foreground objects. More visualization re-
put resolution of 800×1216. sults are provided in Appendix B.

mIoU Total FPN 6. Conclusion

(%) FLOPs (G) FLOPs (G)
FPN 47.04 37.1 12.8
The convention in computer vision is to represent im-
R50 ages as pixel arrays and to apply the de facto deep learn-
VT-FPN 47.39 26.4 (1.41x) 2.0 (6.40x)
FPN 47.35 54.4 12.8 ing operator – the convolution. In lieu of this, we pro-
R101
VT-FPN 47.58 43.6 (1.25x) 2.0 (6.40x) pose Visual Transformers (VTs), as hallmarks of a new
computer vision paradigm, learning and relating sparsely-
Table 10: Semantic segmentation results on the Look Into distributed, high-level concepts far more efficiently: Instead
Person validation set. The FLOPs are calculated with a typ- of pixel arrays, VTs represent just the high-level concepts
ical input resolution of 473×473. in an image using visual tokens. Instead of convolutions,
VTs apply transformers to directly relate semantic concepts
in token-space. To evaluate this idea, we replace convo-
ages containing humans with challenging poses and views. lutional modules with VTs, obtaining significant accuracy
For the COCO-stuff dataset, we train a VT-FPN model with improvements across tasks and datasets. Using an advanced
ResNet-{50, 101} backbones. Our implementation is based training recipe, our VT improves ResNet accuracy on Ima-
on Detectron2 [45]. Our training recipe is based on the se- geNet by 4.6 to 7 points. For semantic segmentation on
mantic segmentation FPN recipe with 1x training steps, ex- LIP and COCO-stuff, VT-based feature pyramid networks
cept that we use synchronized batch normalization in the (FPN) achieve 0.35 points higher mIoU despite 6.5x fewer
VT-FPN, change the batch size to 32, and use a base learn- FLOPs than convolutional FPN modules. This paradigm
ing rate of 0.04. For the LIP dataset, we also use synchro- can furthermore be compounded with other contempora-
nized batch normalization with a batch size of 96. We opti- neous tricks beyond the scope of this paper, including ex-
mize the model via SGD with weight decay of 0.0005 and tra training data and neural architecture search. However,
learning rate of 0.01. instead of presenting a mosh pit of deep learning tricks,
As we can see in Table 9 and 10, after replacing FPN our goal is to show that the pixel-convolution paradigm is
with VT-FPN, both ResNet-50 and ResNet-101 based mod- fraught with redundancies. To compensate, modern meth-
els achieve slightly higher mIoU, but VT-FPN requires 6.5x ods add exceptional amounts of computational complex-
fewer FLOPs than a FPN module. ity. However, as model designers and practitioners, we
can tackle the root cause instead of exacerbating compute
5.4. Visualizing Visual Tokens demands, addressing redundancy in the pixel-convolution
We hypothesize that visual tokens extracted in the VT convention by adopting the newfound token-transformer
correspond to different high-level semantics in the image. paradigm moving forward.
To better understand this, we visualize the spatial attention

8
References [13] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai,
Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer.
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Squeezenext: Hardware-aware neural network design. In
and Quoc V Le. Attention augmented convolutional net- Proceedings of the IEEE Conference on Computer Vi-
works. In Proceedings of the IEEE International Conference sion and Pattern Recognition Workshops, pages 1638–1647,
on Computer Vision, pages 3286–3295, 2019. 3, 7 2018. 3
[2] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
stuff: Thing and stuff classes in context. In Computer vision
Deep residual learning for image recognition. In Proceed-
and pattern recognition (CVPR), 2018 IEEE conference on.
ings of the IEEE conference on computer vision and pattern
IEEE, 2018. 2, 7
recognition, pages 770–778, 2016. 2, 4, 7, 12
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
to-end object detection with transformers. arXiv preprint ing the knowledge in a neural network. arXiv preprint
arXiv:2005.12872, 2020. 2 arXiv:1503.02531, 2015. 7
[4] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and [16] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh
Alan L Yuille. Attention to scale: Scale-aware semantic im- Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
age segmentation. In Proceedings of the IEEE conference on Ruoming Pang, Vijay Vasudevan, et al. Searching for mo-
computer vision and pattern recognition, pages 3640–3649, bilenetv3. In Proceedings of the IEEE International Confer-
2016. 2 ence on Computer Vision, pages 1314–1324, 2019. 3
[5] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Yan, and Jiashi Feng. Aˆ 2-nets: Double attention net- Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
works. In Advances in Neural Information Processing Sys- dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tems, pages 352–361, 2018. 7 tional neural networks for mobile vision applications. arXiv
[6] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan preprint arXiv:1704.04861, 2017. 3
Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based [18] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen
global reasoning networks. In Proceedings of the IEEE Con- Wei. Relation networks for object detection. In Proceed-
ference on Computer Vision and Pattern Recognition, pages ings of the IEEE Conference on Computer Vision and Pattern
433–442, 2019. 2, 3, 4, 6, 7 Recognition, pages 3588–3597, 2018. 3
[7] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin [19] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo-
Jaggi. On the relationship between self-attention and con- cal relation networks for image recognition. In Proceedings
volutional layers. arXiv preprint arXiv:1911.03584, 2019. of the IEEE International Conference on Computer Vision,
3 pages 3464–3473, 2019. 3, 7
[8] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- [20] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea
van, and Quoc V Le. Autoaugment: Learning augmentation Vedaldi. Gather-excite: Exploiting feature context in convo-
strategies from data. In Proceedings of the IEEE conference lutional neural networks. In Advances in Neural Information
on computer vision and pattern recognition, pages 113–123, Processing Systems, pages 9401–9411, 2018. 3
2019. 7
[21] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
[9] Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zi- works. In Proceedings of the IEEE conference on computer
jian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew vision and pattern recognition, pages 7132–7141, 2018. 3, 7
Yu, Peter Vajda, et al. Fbnetv3: Joint architecture-recipe
search using neural acquisition function. arXiv preprint [22] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil-
arXiv:2006.02049, 2020. 3, 7 ian Q Weinberger. Deep networks with stochastic depth. In
European conference on computer vision, pages 646–661.
[10] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei
Springer, 2016. 7
Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming
Wu, Yangqing Jia, et al. Chamnet: Towards efficient network [23] Forrest N Iandola, Song Han, Matthew W Moskewicz,
design through platform-aware model adaptation. In Pro- Khalid Ashraf, William J Dally, and Kurt Keutzer.
ceedings of the IEEE Conference on Computer Vision and Squeezenet: Alexnet-level accuracy with 50x fewer pa-
Pattern Recognition, pages 11398–11407, 2019. 3 rameters and¡ 0.5 mb model size. arXiv preprint
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, arXiv:1602.07360, 2016. 3
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [24] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
database. In 2009 IEEE conference on computer vision and Dollár. Panoptic feature pyramid networks. In Proceed-
pattern recognition, pages 248–255. Ieee, 2009. 2, 5 ings of the IEEE Conference on Computer Vision and Pattern
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Recognition, pages 6399–6408, 2019. 5
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [25] Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Look into person: Joint body parsing & pose estimation net-
vain Gelly, et al. An image is worth 16x16 words: Trans- work and a new benchmark. IEEE transactions on pattern
formers for image recognition at scale. arXiv preprint analysis and machine intelligence, 41(4):871–885, 2018. 2,
arXiv:2010.11929, 2020. 2 7

9
[26] Xiaodan Liang, Zhiting Hu, Hao Zhang, Liang Lin, and Kan Chen, et al. Fbnetv2: Differentiable neural architecture
Eric P Xing. Symbolic graph reasoning meets convolu- search for spatial and channel dimensions. arXiv preprint
tions. In Advances in Neural Information Processing Sys- arXiv:2004.05565, 2020. 3
tems, pages 1853–1863, 2018. 2, 3, 4, 6 [39] Weiyue Wang and Ulrich Neumann. Depth-aware cnn for
[27] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. rgb-d segmentation. In Proceedings of the European Con-
Shufflenet v2: Practical guidelines for efficient cnn architec- ference on Computer Vision (ECCV), pages 135–150, 2018.
ture design. In Proceedings of the European Conference on 3
Computer Vision (ECCV), pages 116–131, 2018. 3 [40] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[28] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz ing He. Non-local neural networks. In Proceedings of the
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- IEEE conference on computer vision and pattern recogni-
age transformer. arXiv preprint arXiv:1802.05751, 2018. 3 tion, pages 7794–7803, 2018. 3
[29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [41] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming So Kweon. Cbam: Convolutional block attention module.
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, In Proceedings of the European Conference on Computer Vi-
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- sion (ECCV), pages 3–19, 2018. 3, 7
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, [42] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im- Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing
perative style, high-performance deep learning library. In H. Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. vnet design via differentiable neural architecture search. In
Fox, and R. Garnett, editors, Advances in Neural Informa- Proceedings of the IEEE Conference on Computer Vision
tion Processing Systems 32, pages 8024–8035. Curran Asso- and Pattern Recognition, pages 10734–10742, 2019. 3
ciates, Inc., 2019. 5 [43] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng
[30] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonza-
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- lez, and Kurt Keutzer. Shift: A zero flop, zero parameter
alone self-attention in vision models. arXiv preprint alternative to spatial convolutions. In Proceedings of the
arXiv:1906.05909, 2019. 3, 7 IEEE Conference on Computer Vision and Pattern Recog-
[31] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, nition, pages 9127–9135, 2018. 3
and Ali Farhadi. Xnor-net: Imagenet classification using bi- [44] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and
nary convolutional neural networks. In European conference Kurt Keutzer. Squeezesegv2: Improved model structure and
on computer vision, pages 525–542. Springer, 2016. 3 unsupervised domain adaptation for road-object segmenta-
[32] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- tion from a lidar point cloud. In ICRA, 2019. 3
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
[45] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
residuals and linear bottlenecks. In Proceedings of the
Lo, and Ross Girshick. Detectron2. https://github.
IEEE conference on computer vision and pattern recogni-
com/facebookresearch/detectron2, 2019. 8
tion, pages 4510–4520, 2018. 3
[46] Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter
[33] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik
Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze-
Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional
segv3: Spatially-adaptive convolution for efficient point-
neural networks. In Proceedings of the IEEE Conference on
cloud segmentation. arXiv preprint arXiv:2004.01803, 2020.
Computer Vision and Pattern Recognition (CVPR), 2019. 3
3
[34] Ilya Sutskever, James Martens, George Dahl, and Geoffrey
[47] Songyang Zhang, Xuming He, and Shipeng Yan. Latent-
Hinton. On the importance of initialization and momentum
gnn: Learning efficient non-local relations for visual recog-
in deep learning. In International conference on machine
nition. In International Conference on Machine Learning,
learning, pages 1139–1147, 2013. 5
pages 7374–7383, 2019. 2, 3, 4, 6
[35] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas- [48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
net: Platform-aware neural architecture search for mobile. Shufflenet: An extremely efficient convolutional neural net-
In Proceedings of the IEEE Conference on Computer Vision work for mobile devices. In Proceedings of the IEEE con-
and Pattern Recognition, pages 2820–2828, 2019. 3 ference on computer vision and pattern recognition, pages
6848–6856, 2018. 2, 3
[36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking
model scaling for convolutional neural networks. arXiv [49] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
preprint arXiv:1905.11946, 2019. 3 ing self-attention for image recognition. arXiv preprint
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- arXiv:2004.13621, 2020. 3, 7
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 1,
3, 4
[38] Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuan-
dong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu,

10
9 X_ncp = X_nchw.view(N, C, H*W) # p = h*w
WK 10 # Normalize to unit vectors
DS Conv2d
11 U_ncl = U_ncl.normalize(dim=1)
H L L 12 X_ncp = X_ncp.normalize(dim=1)
C n iter
13 for _ in range(niter): # Lloyd’s algorithm
W Lloyd’s algorithm 14 dist_npl = (
C C
Visual tokens 15 X_ncp[..., None] - U_ncl[:, :, None, :]
Feature map X
16 ).norm(dim=1)
L
Spatial attention A 17 mask_npl = (dist_npl == dist_npl.min(dim=2)
18 U_ncl = X_ncp.MatMul(mask_npl)
s 19 U_ncl = U_ncl / mask_npl.sum(dim=1)
20 U_ncl = U_ncl.normalize(dim=1)
Figure 6: Cluster-based tokenizer that group pixels using 21 return U_ncl # centroids
the K-Means centroids of the pixels in the semantic space.
Listing 1: Pseudo-code of K-Means implemented in
PyTorch-like language
A. Stage-wise model description of VT-ResNet Although this tokenizer efficiently models only concepts
In this section, we provide a stage-wise description of the in the current image, the drawback is that it is not designed
model configurations for VT-based ResNet (VT-ResNet). to choose the most discriminative concepts.
We use three hyper-parameters to control a VT module:
channel size of the output feature map C, channel size of
visual token CT, and the number of visual tokens L. The
model configurations are described in Table 11.

B. More visualization results

We provide more visualization of the spatial attention on
images from the LIP dataset in Figure 7.

C. Clustering-based tokenizer
To address this limitation of filter-based tokenizers, we
devise a content-aware WK variant of WA to form se-
mantic groups from X. Our insight is to extract concepts
present in the current image by clustering pixels, instead of
applying the same filters regardless of the image content.
First, we treat each pixel as a sample {Xp }HW p=1 , and ap-
ply k-means to find L centroids, which are stacked to form
WK ∈ RC×L . Each centroid represents one semantic con-
cept in the image. Second, we replace WA in Equation (1)
with WK to form L semantic groups of channels:

WK = KMEANS(X),
T
(6)
T = SOFTMAXHW (XWK ) X.

Pseudo-code for our K-means implementation is provided

in Listing 1, and can be summarized as: Normalize all
pixels to unit vectors, initialize centroids with a spatially-
downsampled feature map, and run Lloyd’s algorithm to
produce centroids.
1 def kmeans(X_nchw, L, niter):
2 # Input:
3 # X_nchw - feature map
4 # L - num token
5 # niter - num iters of Lloyd’s
6 N, C, H, W = X_nchw.size()
7 # Initialization as down-sampled X
8 U_ncl = downsample(X).view(N, C, L)

11
Stage Resolution VT-R18 VT-R34 VT-R50 VT-R101
1 56×56 conv7×7-C64-S2 → maxpool3×3-S2
2 56×56 BB-C64 ×2 BB-C64 ×3 BN-C256 ×3 BN-C256 ×3
3 28×28 BB-C128 ×2 BB-C128 ×4 BN-C512 ×4 BN-C512 ×4
4 14×14 BB-C256 ×2 BB-C256 ×6 BN-C1024 ×6 BN-C1024 ×23
VT-C512-L16 VT-C512-L16 VT-C1024-L16 VT-C1024-L16
5 16
-CT1024 ×2 -CT1024 ×3 -CT1024 ×3 -CT1024 ×3
head 1 avgpool-fc1000

Table 11: Model descriptions for VT-ResNets. VT-R18 denotes visual-transformer-ResNet-18. “conv7×7-C64-S2” denotes
a 7-by-7 convolution with an output channel size of 64 and a stride of 2. “BB-C64×2” denotes a basic block [14] with an
output channel size of 64 and it is repeated twice. “BN-C256×3” denotes a bottleneck block [14] with an output channel size
of 256 and it is repeated by three times. “VT-C512-L16-CT1024 ×2” denotes a VT block with a channel size for the output
feature map as 512, channel size for visual tokens as 1024, and the number of tokens as 16.

Figure 7: Visualization of the spatial attention generated by a filter-based tokenizer. Red denotes higher attention values and
color blue denotes lower. Without any supervision, visual tokens automatically focus on different areas of the image that
correspond to different semantic concepts.

Lufkin Well Manager 2.0 Controller User Manual - Rev. 3
100% (1)
Lufkin Well Manager 2.0 Controller User Manual - Rev. 3
435 pages
BS ISO 22915-2 Industrial Trucks
No ratings yet
BS ISO 22915-2 Industrial Trucks
16 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Securing Windows Server 2016: Official Microsoft Learning Product
No ratings yet
Securing Windows Server 2016: Official Microsoft Learning Product
5 pages
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
No ratings yet
Efﬁcient Training of Visual Transformers with Small Datasets_Liu et al_
13 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Strudel Transformer Segmentation
No ratings yet
Strudel Transformer Segmentation
17 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
2103.10619v2
No ratings yet
2103.10619v2
11 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Transformer Segmentation
No ratings yet
Transformer Segmentation
35 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
2106.03348
No ratings yet
2106.03348
23 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
2022_ViTAEv2_Zhang et al_arXiv
No ratings yet
2022_ViTAEv2_Zhang et al_arXiv
22 pages
2024_GvT_Shan_chen_arXiv
No ratings yet
2024_GvT_Shan_chen_arXiv
9 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
Deep Learning Paper
No ratings yet
Deep Learning Paper
13 pages
2022_PVT v2
No ratings yet
2022_PVT v2
10 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
No ratings yet
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
10 pages
good note - ViT
No ratings yet
good note - ViT
13 pages
Understanding Robustness of Transformers For Image
No ratings yet
Understanding Robustness of Transformers For Image
23 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
Transformer-Based Visual Segmentation - A Survey
No ratings yet
Transformer-Based Visual Segmentation - A Survey
23 pages
Better Vision Transformer Via Token Pooling and Attention Sharing
No ratings yet
Better Vision Transformer Via Token Pooling and Attention Sharing
13 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
VGG (Simonyan and Zisserman)
No ratings yet
VGG (Simonyan and Zisserman)
14 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
No ratings yet
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
11 pages
2204.07118v1
No ratings yet
2204.07118v1
27 pages
Transformer-Based Visual Segmentation: A Survey
No ratings yet
Transformer-Based Visual Segmentation: A Survey
25 pages
Vision Transformer for Small-Size Datasets
No ratings yet
Vision Transformer for Small-Size Datasets
11 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
No ratings yet
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
14 pages
(NIPS23) Scattering Transformation For ViT
No ratings yet
(NIPS23) Scattering Transformation For ViT
21 pages
2012.12556
No ratings yet
2012.12556
23 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
Gaurav_Vision_Transformer
No ratings yet
Gaurav_Vision_Transformer
10 pages
ConvNeXt - A ConvNet For The 2020s
No ratings yet
ConvNeXt - A ConvNet For The 2020s
15 pages
Thesis AlexanderJaus BIBTEX
No ratings yet
Thesis AlexanderJaus BIBTEX
9 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
Vision GNN An Image is Worth Graph of Nodes
No ratings yet
Vision GNN An Image is Worth Graph of Nodes
15 pages
A_review_of_advances_in_image_recognition_models_F
No ratings yet
A_review_of_advances_in_image_recognition_models_F
5 pages
Bvit
No ratings yet
Bvit
12 pages
Rethinking Local Perception in Lightweight Vision Transformer
No ratings yet
Rethinking Local Perception in Lightweight Vision Transformer
14 pages
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
No ratings yet
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
14 pages
2103 - ICML - Perceiver General Perception With Iterative Attention
No ratings yet
2103 - ICML - Perceiver General Perception With Iterative Attention
16 pages
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
No ratings yet
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
15 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
NeurIPS-2021-not-all-images-are-worth-16x16-words-dynamic-transformers-for-efficient-image-recognition-Paper
No ratings yet
NeurIPS-2021-not-all-images-are-worth-16x16-words-dynamic-transformers-for-efficient-image-recognition-Paper
14 pages
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
No ratings yet
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
22 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
SAP FICO Dunning Procedure 1701431010
No ratings yet
SAP FICO Dunning Procedure 1701431010
14 pages
02 Stress Testing in Credit Risk Framework
No ratings yet
02 Stress Testing in Credit Risk Framework
11 pages
M3 - Big Data Analytics With Cloud AI Platform Notebooks Slides
No ratings yet
M3 - Big Data Analytics With Cloud AI Platform Notebooks Slides
17 pages
Beginning C: From Beginner to Pro, 7th Edition German Gonzalez-Morris pdf download
100% (4)
Beginning C: From Beginner to Pro, 7th Edition German Gonzalez-Morris pdf download
68 pages
QNX 6.5.0 SP1 Setup Manual For Raspberry PI Board: User's Manual: Software BCM 2835
No ratings yet
QNX 6.5.0 SP1 Setup Manual For Raspberry PI Board: User's Manual: Software BCM 2835
11 pages
Oundle School 13 Plus Maths Entrance Exam 2020
No ratings yet
Oundle School 13 Plus Maths Entrance Exam 2020
9 pages
ACTIVITY TABLE
No ratings yet
ACTIVITY TABLE
2 pages
Circular - Student Bootcamp for PM Shri Kendriya Vidyalaya Schools
No ratings yet
Circular - Student Bootcamp for PM Shri Kendriya Vidyalaya Schools
4 pages
LESSON 5 RIVETED PRESSURE VESSEL (1)
No ratings yet
LESSON 5 RIVETED PRESSURE VESSEL (1)
53 pages
Rice Leaf Diseases Detection Using Machine Learnin
No ratings yet
Rice Leaf Diseases Detection Using Machine Learnin
6 pages
Statement of Account: Date of Opening Account Status Account Type Currency Mr. R Nishanth
No ratings yet
Statement of Account: Date of Opening Account Status Account Type Currency Mr. R Nishanth
20 pages
Man Sci-Minimization Model Example
No ratings yet
Man Sci-Minimization Model Example
7 pages
Commercial Intertech-2
No ratings yet
Commercial Intertech-2
77 pages
Power System _ Short Notes
No ratings yet
Power System _ Short Notes
19 pages
Unit-1,2 Fill Ups and MCQ
No ratings yet
Unit-1,2 Fill Ups and MCQ
3 pages
NLS Diacom Training
100% (2)
NLS Diacom Training
43 pages
Unit 6 Geometry: Lesson Outline
No ratings yet
Unit 6 Geometry: Lesson Outline
35 pages
Success Story Of-Billgates
67% (3)
Success Story Of-Billgates
11 pages
Instituto Politécnico Nacional: Direct Marketing
No ratings yet
Instituto Politécnico Nacional: Direct Marketing
19 pages
WOIN & d20 Conversion Guide
No ratings yet
WOIN & d20 Conversion Guide
11 pages
KE Ech Energ Es L Mited: LT I I
No ratings yet
KE Ech Energ Es L Mited: LT I I
135 pages
IPC HDBW7442H Z4 S2 - Datasheet - 20220718
No ratings yet
IPC HDBW7442H Z4 S2 - Datasheet - 20220718
4 pages
2-PageResume Template - A4
No ratings yet
2-PageResume Template - A4
2 pages
Spectrum Protect Data Sheet
No ratings yet
Spectrum Protect Data Sheet
8 pages
MARKETING RESEARCH Siladitya Mitra
No ratings yet
MARKETING RESEARCH Siladitya Mitra
18 pages
TCS CodeVita 9
No ratings yet
TCS CodeVita 9
19 pages
Ir Brics
No ratings yet
Ir Brics
3 pages