Deep Learning Paper About Vit
Deep Learning Paper About Vit
Computer Vision
Bichen Wu1 , Chenfeng Xu3 , Xiaoliang Dai1 , Alvin Wan3 , Peizhao Zhang1
Zhicheng Yan2 , Masayoshi, Tomizuka3 , Joseph Gonzalez3 , Kurt Keutzer3 , Peter Vajda1
1
Facebook Reality Labs, 2 Facebook AI, 3 UC Berkeley
arXiv:2006.03677v4 [cs.CV] 20 Nov 2020
1
Semantic Grouping
Semantic Projection
Visual Transformer
Figure 1: Diagram of a Visual Transformer (VT). For a given image, we first apply convolutional layers to extract low-level
features. The output feature map is then fed to VT: First, apply a tokenizer, grouping pixels into a small number of visual
tokens, each representing a semantic concept in the image. Second, apply transformers to model relationhips between tokens.
Third, visual tokens are directly used for image classification or projected back to the feature map for semantic segmentation.
putation by attending to important regions, instead of treat- relating high-level concepts. We further use spatial atten-
ing all pixels equally; 2) encoding semantic concepts in a tion to focus on important regions, instead of treating each
few visual tokens relevant to the image, instead of model- image patch equally. This yields strong performance de-
ing all concepts across all images; and 3) relating spatially- spite orders-of-magnitude less data and training time.
distant concepts through self-attention in token-space. Another relevant work, DETR[3], adopts transformers to
To validate the effectiveness of VT and understanding simplify the hand-crafted anchor matching procedure in ob-
its key components, we run controlled experiments by us- ject detection training. Although both adopt transformers,
ing VTs to replace convolutions in ResNet, a common test DETR is not directly comparable to our VT given their or-
bed for new building blocks for image classification. We thogonal use cases, i.e., insights from both works could be
also use VTs to re-design feature-pyramid networks (FPN), used together in one model for compounded benefit.
a strong baseline for semantic segmentation. Our expeir- Graph convolutions in vision models: Our work is
ments show that VTs achieve higher accuracy with lower also related to previous efforts such as GloRe [6], Latent-
computational cost in both tasks. For the ImageNet[11] GNN [47], and [26] that densely relate concepts in latent
benchmark, we replace the last stage of ResNet[14] with space using graph convolutions. To augment convolutions,
VTs, reducing FLOPs of the stage by 6.9x and improving [26, 6, 47] adopt a procedure similar to ours: (1) extract-
top-1 accuracy by 4.6 to 7 points. For semantic segmen- ing latent variables to represent in graph nodes (analogous
tation on COCO-Stuff [2] and Look-Into-Person [25], VT- to our visual tokens) (2) applying graph convolution to cap-
based FPN achieves 0.35 points higher mIOU while reduc- ture node interactions (analogous to our transformer), and
ing regular FPN module’s FLOPs by 6.4x. (3) projecting the nodes back to the feature map. Although
these approaches avoid spatial redundancy, they are suscep-
2. Relationship to previous work tible to concept redundancy: the second limitation listed in
the introduction. In particular, by using fixed weights that
Transformers in vision models: A notable recent and are not content-aware, the graph convolution expects a fixed
relevant trend is the adoption of transformers in vision mod- semantic concept in each node, regardless of whether the
els. Dosovitskiy et al. propose a Vision Transformer (ViT) concept exists in the image. By contrast, a transformer uses
[12], dividing an image into 16 × 16 patches and feeding content-aware weights, allowing visual tokens to represent
these patches (i.e., tokens) into a standard transformer. Al- varying concepts. As a result, while graph convolutions re-
though simple, this requires transformers to learn dense, quire hundreds of nodes (128 nodes in [4], 340 in [25], 150
repeatable patterns (e.g., textures), which convolutions are in [48]) to encode potential semantic concepts, our VT uses
drastically more efficient at learning. The simplicity incurs just 16 visual tokens and attains higher accuracy. Further-
an extremely high computational price: ViT requires up to more, while modules from [26, 6, 47] can only be added
7 GPU years and 300M JFT dataset images to outperform to a pretrained network to augment convolutions, VTs can
competing convolutional variants. By contrast, we leverage replace convolution layers to save FLOPs and parameters,
the respective strengths of each operation, using convolu- and support training from scratch.
tions for extracting low-level features and transformers for Attention in vision models: In addition to being used in
2
transformers, attention is also widely used in different forms “nodes”), whereas our VT uses as few as 16 visual tokens
in computer vision models [21, 20, 41, 44, 46, 40, 28, 18, to achieve superior performance.
19, 1, 49, 30]. Attention was first used to modulate the fea-
ture map: attention values are computed from the input and 3.1. Tokenizer
multiplied to the feature map as in [41, 21, 20, 44]. Later Our intuition is that an image can be summarized by a
work [46, 33, 39] interpret this “modulation” as a way to few handfuls of words, or visual tokens. This contrasts con-
make convolution spatially adaptive and content-aware. In volutions, which use hundreds of filters, and graph convo-
[40], Wang et al. introduced non-local operators, equiva- lutions, which use hundreds of “latent nodes” to detect all
lent to self-attention, to video understanding to capture the possible concepts regardless of image content. To leverage
long-range interactions. However, the computational com- this intuition, we introduce a tokenizer module to convert
plexity of self-attention grows quadratically with the num- feature maps into compact sets of visual tokens. Formally,
ber of pixels. [1] use self-attention to augment convolu- we denote the input feature map by X ∈ RHW ×C (height
tions and reduce the compute cost by using small channel H, width W , channels C) and visual tokens by T ∈ RL×C
sizes for attention. [30, 28, 7, 49, 19] on the other hand s.t. L HW (L represents the number of tokens).
restrict receptive field of self-attention and use it in a con-
volutional manner. Starting from [30], self-attentions are
3.1.1 Filter-based Tokenizer
used as a stand-alone building block for vision models. Our
work is different from all above since we propose a novel A filter-based tokenizer, also adopted by [47, 6, 26], utilizes
token-transformer paradigm to replace the inefficient pixel- convolutions to extract visual tokens. For feature map X,
convolution paradigm and achieve superior performance. we map each pixel Xp ∈ RC to one of L semantic groups
Efficient vision models: Many recent research efforts using point-wise convolutions. Then, within each group, we
have been focusing on building vision models to achieve spatially pool pixels to obtain tokens T. Formally,
better performance with lower computational cost. Early
T
work in this direction includes [23, 31, 13, 17, 32, 16, 48, T = SOFTMAXHW (XWA ) X
(1)
27, 43]. Recently, people use neural architecture search
| {z }
A∈RHW ×L
[42, 10, 38, 9, 36, 35] to optimize network’s performance
within a search space that consists of existing convolution Here, WA ∈ RC×L forms semantic groups from X, and
operators. The efforts above all seek to make the com- SOFTMAX HW (·) translates these activations into a spatial
mon convolutional-neural net more computationally effi- attention. Finally, A multiplies with X and computes
cient. In contrast, we propose a new building block that nat- weighted averages of pixels in X to make L visual tokens.
urally eliminates the redundant computations in the pixel- However, many high-level semantic concepts are sparse
convolution paradigm. and may each appear in only a few images. As a result, the
fixed set of learned weights WA potentially wastes com-
3. Visual Transformer putation by modeling all such high-level concepts at once.
We call this a “filter-based” tokenizer, since it uses convo-
We illustrate the overall diagram of a Visual Transformer
lutional filters WA to extract visual tokens.
(VT) based model in Figure 1. First, process the input im-
age with several convolution blocks, then feed the output
feature map to VTs. Our insight is to leverage the strengths
of both convolutions and VTs: (1) early in the network, use Conv2d
convolutions to learn densely-distributed, low-level patterns H
L
and (2) later in the network, use VTs to learn and relate
more sparsely-distributed, higher-order semantic concepts. W
WA
C C
At the end of the network, use visual tokens for image- Feature map X Visual tokens
L
level prediction tasks and use the augmented feature map Spatial attention A
for pixel-level prediction tasks.
A VT module involves three steps: First, group pixels Figure 2: Filter-based tokenizer that use convolution to
into semantic concepts, to produce a compact set of visual group pixels using a fixed convolution filter.
tokens. Second, to model relationships between semantic
concepts, apply a transformer [37] to these visual tokens.
Third, project these visual tokens back to pixel-space to ob-
3.1.2 Recurrent Tokenizer
tain an augmented feature map. Similar paradigms can be
found in [6, 47, 26] but with one critical difference: Pre- To remedy the limitation of filter-based tokenizers, we pro-
vious methods use hundreds of semantic concepts (termed, pose a recurrent tokenizer with weights that are dependent
3
on previous layer’s visual tokens. The intuition is to let 3.3. Projector
the previous layer’s tokens Tin guide the extraction of new
Many vision tasks require pixel-level details, but such
tokens for the current layer. The name of “recurrent tok-
details are not preserved in visual tokens. Therefore, we
enizer” comes from that current tokens are computed de-
fuse the transformer’s output with the feature map to refine
pendent on previous ones. Formally, we define
the feature map’s pixel-array representation as
WR = Tin WT→R ,
Xout = Xin + SOFTMAXL (Xin WQ )(TWK )T T,
T
(2)
T = SOFTMAXHW (XWR ) X, (5)
where Xin , Xout ∈ RHW ×C are the input and output fea-
where WT →R ∈ RC×C . In this way, the VT can in-
ture map. (Xin WQ ) ∈ RHW ×C is the query computed
crementally refine the set of visual tokens, conditioned on
from the input feature map Xin . (Xin WQ )p ∈ RC en-
previously-processed concepts. In practice, we apply recur-
codes the information pixel-p requires from the visual to-
rent tokenizers starting from the second VT, since it requires
kens. (TWK ) ∈ RL×C is the key computed from the to-
tokens from a previous VT.
ken T. (TWK )l ∈ RC represents the information the l-th
token encodes. The key-query product determines how to
project information encoded in visual tokens T to the orig-
Conv2d inal feature map. WQ ∈ RC×C , WK ∈ RC×C are learn-
H able weights used to compute queries and keys.
L
W
C
4. Using Visual Transformers in vision models
Feature map X L C
L
Visual tokens
C In this section, we discuss how to use VTs as building
Pre-visual tokens Spatial attention A
blocks in vision models. We define three hyper-parameters
Figure 3: Recurrent tokenizer that uses previous tokens to for each VT: channel size of the feature map; channel size
guide the token extraction in the current VT module. of the visual tokens; and the number of visual tokens.
Image classification model: For image classification,
following the convention of previous work, we build our
3.2. Transformer networks with backbones inherited from ResNet [14].
After tokenization, we then need to model interactions Based on ResNet-{18, 34, 50, 101}, we build correspond-
between these visual tokens. Previous works [6, 47, 26] use ing visual-transformer-ResNets (VT-ResNets) by replacing
graph convolutions to relate concepts. However, these op- the last stage of convolutions with VT modules. The last
erations use fixed weights during inference, meaning each stage of ResNet-{18, 34, 50, 101} contains 2 basic blocks,
token (or “node”) is bound to a specific concept, there- 3 basic blocks, 3 bottleneck blocks, and 3 bottleneck blocks,
fore graph convolutions waste computation by modeling all respectively. We replace them with the same number (2, 3,
high-level concepts, even those that only appear in few im- 3, 3) of VT modules. At the end of stage-4 (before stage-5
ages. To address this, we adopt transformers [37], which max pooling), ResNet-{18, 34} generate feature maps with
use input-dependent weights by design. Due to this, trans- the shape of 142 × 256, and ResNet-{50, 101} generate
formers support visual tokens with variable meaning, cov- feature maps with the shape of 142 × 1024. We set VT’s
ering more possible concepts with fewer tokens. feature map channel size to be 256, 256, 1024, 1024 for
We employ a standard transformer with minor changes: ResNet-{18, 34, 50, 101}. We adopt 16 visual tokens with
a channel size of 1024 for all the models. At the end of
T0out = Tin + SOFTMAXL (Tin K)(Tin Q)T Tin , (3)
the network, we output 16 visual tokens to the classification
head, which applies an average pooling over the tokens and
Tout = T0out + σ(T0out F1 )F2 , (4) use a fully-connected layer to predict the probabilities. A
where Tin , T0out , Tout∈R L×C
are the visual tokens. Dif- table summarizing the stage-wise description of the model
ferent from graph convolution, in a transformer, weights be- is provided in Appendix A. Since VTs only operate on 16
tween tokens are input-dependent and computed as a key- visual tokens, we can reduce the last stage’s FLOPs by up
query product: (Tin K)(Tin Q)T ∈ RL×L . This allows to 6.9x, as shown in Table 1.
us to use as few as 16 visual tokens, in contrast to hun- Semantic segmentation: We show that using VTs for
dreds of analogous nodes for graph-convolution approaches semantic segmentation can tackle several challenges with
[6, 47, 26]. After the self-attention, we use a non-linearity the pixel-convolution paradigm. First, the computational
and two pointwise convolutions in Equation (4), where complexity of convolution grows with the image resolu-
F1 , F2 ∈ RC×C are weights, σ(·) is the ReLU function. tion. Second, convolutions struggles to capture long-term
4
Identity
Conv2d Tokenizer
Transformer
Conv2d
Projector
Conv2d
interp2d Conv2d
Conv2d interp2d
interp2d Conv2d
interp2d
Figure 4: Feature Pyramid Networks (FPN) (left) vs visual-transformer-FPN (VT-FPN) (right) for semantic segmentation.
FPN uses convolution and interpolation to merge feature maps with different resolutions. VT-FPN extraxt visual tokens from
all feature maps, merge them with one transformer, and project back to the original feature maps.
R18 R34 R50 R101
Total 1.14x 1.16x 1.20x 1.09x 5.1. Visual Transformer for Classification
FLOPs
Stage-5 2.4x 5.0x 6.1x 6.9x
Total 0.91x 1.21x 1.19x 1.19x We conduct experiments on the ImageNet dataset [11]
Params
Stage-5 0.9x 1.5x 1.26x 1.26x with around 1.3 million images in the training set and 50
thousand images in the validation set. We implement VT
Table 1: FLOPs and parameter size reduction of VTs on
models in PyTorch [29]. We use stochastic gradient descent
ResNets by replacing the last stage of convolution modules
(SGD) optimizer with Nesterov momentum [34]. We use an
with VT modules.
initial learning rate of 0.1, a momentum of 0.9, and a weight
decay of 4e-5. We train the model for 90 epochs, and decay
the learning rate by 10x every 30 epochs. We use a batch
interactions between pixels. VTs, on the other hand, op-
size of 256 and 8 V100 GPUs for training.
erate on a small number of visual tokens regardless of the
image resolution, and since it models concept interactions VT vs. ResNet with default training recipe: In Table 2,
in the token-space, it bypasses the “long-range” challenge we first compare VT-ResNets and vanilla ResNets under the
with pixel-arrays. same training recipe. VT-ResNets in this experiment use a
To validate our hypothesis, we use panoptic feature pyra- filter-based tokenizer for the first VT module and recurrent
mid networks (FPN) [24] as a baseline and use VTs to im- tokenizers in later modules. We can see that after replacing
prove the network. Panoptic FPNs use ResNet as backbone the last stage of convolutions in ResNet18 and ResNet34,
to extract feature maps from different stages with various VT-based ResNets use many fewer FLOPs: 244M fewer
resolutions. These feature maps are then fused by a feature FLOPs for ResNet18 and 384M fewer for ResNet34. Mean-
pyramid network in a top-down manner to generate a multi- while, VT-ResNets achieve much higher top-1 validation
scale and detail preserving feature map with rich semantics accuracy than the corresponding ResNets by up to 2.2
for segmentation (Figure 4 left). FPN is computationally points. This confirms effectiveness of VTs. Also note that
expensive since it heavily relies on spatial convolutions op- the training accuracy achieved by VT-ResNets are much
erating on high resolution feature maps with large channel higher than that of baseline ResNets: VT-R18 is 7.9 points
sizes. We use VTs to replace convolutions in FPN. We name higher and VT-R34 is 6.9 points higher. This indicates
the new module as VT-FPN (Figure 4 right). From each res- that VT-ResNets are overfitting more heavily than regular
olution’s feature map, VT-FPN extract 8 visual tokens with ResNets. We hypothesize this is because VT-ResNets have
a channel size of 1024. The visual tokens are combined and much larger capacity and we need stronger regularization
fed into one transformer to compute interactions between (e.g., data augmentation) to fully utilize the model capacity.
visual tokens across resolutions. The output tokens are then We address this in Section 5.2 and Table 8.
projected back to the original feature maps, which are then Tokenizer ablation studies: In Table 3, we compare
used to perform pixel-level prediction. Compared with the different types of tokenizers used by VTs. We consider a
original FPN, the computational cost for VT-FPN is much pooling-based tokenizer, a clustering-based tokenizer, and
smaller since we only operate on a very small number of vi- a filter-based tokenizer (Section 3.1.1). We use the can-
sual tokens rather than all the pixels. Our experiment shows didate tokenizer in the first VT module and use recurrent
VT-FPN uses 6.4x fewer FLOPs than FPN while preserving tokenizers in later modules. As a baseline, we imple-
or surpassing its performance (Table 9 & 10). ment a pooling-based tokenizer, which spatially downsam-
ples a feature map X to reduce its spatial dimensions from
5. Experiments HW = 196 to L = 16, instead of grouping pixels by their
semantics. As a more advanced baseline, we consider a
We conduct experiments with VTs on image classifica- clustering-based tokenizer, which is described in Appendix
tion and semantic segmentation to (a) understand the key C. It applies K-Means clustering in the semantic space to
components of VTs and (b) validate their effectiveness. group pixels to visual tokens. As can be seen from Ta-
5
Top-1 FLOPs Params
Top-1 Top-1 Acc (%) (M) (M)
FLOPs Params
Acc (%) Acc (%) w/ RT 72.0 1569 11.7
(M) (M) R18
(Val) (Train) w/o RT 71.2 1586 11.1
R18 69.9 68.6 1814 11.7 w/ RT 74.9 3246 21.9
R34
VT-R18 72.1 76.5 1570 11.7 w/o RT 74.5 3335 20.9
R34 73.3 73.9 3664 21.8
VT-R34 75.0 80.8 3280 21.9 Table 4: VT-ResNets that use recurrent tokenizers achieve
better performance, since recurrent tokenizers are content-
Table 2: VT-ResNet vs. baseline ResNets on the ImageNet aware. RT denotes recurrent tokenizer.
dataset. By replacing the last stage of ResNets, VT-ResNet
uses 224M, 384M fewer FLOPs than the baseline ResNets Top-1 FLOPs Params
while achieving 1.7 points and 2.2 points higher validation Acc (%) (M) (M)
accuracy. Note the training accuracy of VT-ResNets are None 68.7 1528 8.5
much higher. This indicates VT-ResNets have higher model R18 GraphConv 69.3 1528 8.5
capacity and require stronger regularization (e.g., data aug- Transformer 71.5 1580 11.7
mentation) to fully utilize the model. See Table 8. None 73.3 3222 17.1
R34 GraphConv 73.7 3223 17.1
Transformer 75.2 3299 21.8
Top-1 FLOPs Params
Acc (%) (M) (M) Table 5: VT-ResNets using different modules to model to-
Pooling-based 70.5 1549 11.0 ken relationships. Models using transformers perform bet-
R18 Clustering-based 71.8 1579 11.6 ter than graph-convolution or no token-space operations.
Filter-based 72.1 1580 11.7
This validates that it is important to model relationships
Pooling-based 73.6 3246 20.6
R34 Clustering-based 75.2 3299 21.8 between visual token (semantic concepts) and transformer
Filter-based 74.9 3280 21.9 work better than graph convolution in relating tokens.
Table 3: VT-ResNets using with different types of tokeniz- Modeling token relationships: In Table 5, we compare
ers. Pooling-based tokenizers spatially downsample a fea- different methods of capturing token relationships. As a
ture map to obtain visual tokens. Clustering-based tokenizer baseline, we do not compute the interactions between to-
(Appendix C) groups pixels in the semantic space. Filter- kens. This leads to the worst performance among all vari-
based tokenizers (3.1.1) use convolution filters to group pix- ations. This validates the necessities of capturing the rela-
els. Both filter-based and cluster-based tokenizers work tionship between different semantic concepts. Another al-
much better than pooling-based tokenizers, validating the ternative is to use graph-convolutions similar to [6, 26, 47],
importance of grouping pixels by their semantics. but its performance is worse than that of VTs. This is likely
due to the fact that graph-convolutions bind each visual to-
ken to a fixed semantic concept. In comparison, using trans-
ble 3, filter-based and clustering-based tokenizers perform formers allows each visual token to encode any semantic
significantly better than the pooling-based baseline, validat- concepts as long as the concept appears in the image. This
ing our hypothesis that feature maps contain redundancies, allows the models to fully utilize its capacity.
and this can be addressed by grouping pixels in semantic Token efficiency ablation: In Table 6, we test vary-
space. The difference between filter-based and clustering- ing numbers of visual tokens, only to find negligible or
based tokenizers is small and vary between R18 and R34. no increase in accuracy. This agrees with our hypothesis
We hypothesize this is because both tokenizers have their that VTs are already capturing a wide variety of concepts
own drawbacks. The filter-based tokenizer relies on fixed with just a few handfuls of tokens–additional tokens are not
convolution filters to detect and assign pixels to semantic needed, as the space of possible, high-level concepts is al-
groups, and is limited by the capacity of the convolution ready covered.
filters to deal with diverse and sparse high-level semantic Pojection ablation: In Table 7, we study whether we
concepts. On the other hand, the clustering-based tokenizer need to project visual tokens back to feature maps. We hy-
extracts semantic concepts that exist in the image, but it is pothesize that projecting the visual tokens is an important
not designed to capture the essential semantic concepts. step since in vision understanding, pixel-level semantics are
In Table 4, we validate the recurrent tokenizer’s effec- very important, and visual tokens are representations in the
tiveness. We use a filter-based tokenizer in the first VT semantic space that do not encode any spatial information.
module and use recurrent tokenizers in subsequent modules. As validated by Table 7, projecting visual tokens back to
Experiments show that using recurrent tokenizers leads to the feature map leads to higher performance, even for im-
higher accuracy. age classification tasks.
6
No. Top-1 FLOPs Params Top-1 FLOPs Params
Models
Tokens Acc (%) (M) (M) Acc (%) (G) (M)
16 71.8 1579 11.6 R18[14] 69.8 1.814 11.7
R18 32 71.9 1711 11.6 R18+SE[21, 41] 70.6 1.814 11.8
64 72.1 1979 11.6 R18+CBAM[41] 70.7 1.815 11.8
16 75.1 3299 21.8 LR-R18[19] 74.6 2.5 14.4
R34 32 75.0 3514 21.8 R18[14](ours) 73.8 1.814 11.7
64 75.0 3952 21.8 VT-R18(ours) 76.8 1.569 11.7
R34[14] 73.3 3.664 21.8
Table 6: Using more visual tokens do not improve the ac- R34+SE[21, 41] 73.9 3.664 22.0
curacy of VT, which agrees with our hypothesis that images R34+CBAM[41] 74.0 3.664 22.9
can be described by a compact set of visual tokens. AA-R34[1] 74.7 3.55 20.7
R34[14](ours) 77.7 3.664 21.8
VT-R34(ours) 79.9 3.236 19.2
Top-1 FLOPs Params
R50[14] 76.0 4.089 25.5
Acc (%) (M) (M)
R50+SE[21, 41] 76.9 3.860* 28.1
w/ projector 72.0 1569 11.7 R50+CBAM[41] 77.3 3.864* 28.1
R18
w/o projector 71.0 1498 9.4 LR-R50[19] 77.3 4.3 23.3
w/ projector 74.8 3280 21.9 Stand-Alone[30] 77.6 3.6 18.0
R34
w/o projector 73.9 3159 17.4 AA-R50[1] 77.7 4.1 25.6
A2 -R50[5] 77.0 - -
Table 7: VTs that projects tokens back to feature maps per- SAN19[49] 78.2 3.3 20.5
form better. This may be because feature maps still encode GloRe-R50[6] 78.4 5.2 30.5
important spatial information. VT-R50(ours) 80.6 3.412 21.4
R101[14] 77.4 7.802 44.4
R101+SE [21, 41] 77.7 7.575* 49.3
5.2. Training with Advanced Recipe R101+CBAM[41] 78.5 7.581* 49.3
In Table 2, we show that under the regular training LR-R101[19] 78.5 7.79 42.0
recipe, the VT-ResNets experience serious overfitting. De- AA-R101[1] 78.7 8.05 45.4
GloRe-R200[6] 79.9 16.9 70.6
spite their accuracy improvement on the validation set, its
VT-R101(ours) 82.3 7.129 41.5
training accuracy improves by a significantly larger margin.
We hypothesize that this is because VT-based models have Table 8: Comparing VT-ResNets with other attention-
much higher model capacity. To fully unleash the potential augmented ResNets on ImageNet. *The baseline ResNet
of VT, we used a more advanced training recipe to train VT FLOPs reported in [41] is lower than our baseline.
models. To prevent overfitting, we train with more training
epochs, stronger data augmentation, stronger regularization,
and distillation. Specifically, we train VT-ResNet models infeasible for us to try these recipes one by one. So to un-
for 400 epochs with RMSProp. We use an initial learning derstand the source of the accuracy gain, we use the same
rate of 0.01 and increase to 0.16 in 5 warmup epochs, then training recipe to train baseline ResNet18 and ResNet34
reduce the learning rate by a factor of 0.9875 per epoch. We and also observe significant accuracy improvement on base-
use synchronized batch normalization and distributed train- line ResNets. But note that under the advanced training
ing with a batch size of 2048. We use label smoothing and recipe, the accuracy gap between VT-ResNet and baselines
AutoAugment [8] and we set the stochastic depth survival increases from 1.7 and 2.2 points to 2.2 and 3.0 points. This
probability [22] and dropout ratio to be 0.9 and 0.2, respec- further validates that a stronger training recipe can better
tively. We use exponential moving average (EMA) on the utilize the model capacity of VTs. While achieving higher
model weights with 0.99985 decay. We use knowledge dis- accuracy than previous work, VT-ResNets also use much
tillation [15] in the training recipe, where the teacher model fewer FLOPs and parameters, even we only replace the last
is FBNetV3-G [9]. The total loss is a weighted sum of dis- stage of the baseline model. If we consider the reduction
tillation loss (×0.8) and cross entropy loss (×0.2). over the original stage, we observe FLOP reductions of up
Our results are reported in Table 8. Compared with the to 6.9x, as shown in Table 1.
baseline ResNet models, VT-ResNet models achieve 4.6 to
7 points higher accuracy and surpass all other related work 5.3. Visual Transformer for Semantic Segmentation
that adopt attention of different forms based on ResNets
[21, 41, 1, 5, 19, 30, 49, 6]. This validates that our ad- We conduct experiments to test the effectiveness of VT
vanced training recipe better utilizes the model capacity of for semantic segmentation on the COCO-stuff dataset [2]
VT-ResNet models to outperform all previous baselines. and the LIP dataset [25]. The COCO-stuff dataset contains
Note that in addition to the architecture differences, pre- annotations for 91 stuff classes with 118K training images
vious works also used their own training recipes and it is and 5K validation images. LIP dataset is a collection of im-
7
Figure 5: Visualization of the spatial attention generated by a filter-based tokenizer on images from the LIP dataset. Red
denotes higher attention values and color blue denotes lower. Without any supervision, visual tokens automatically focus on
different areas of the image that correspond to different semantic concepts, such as sheep, ground, clothes, woods.
mIoU Total FPN
(%) FLOPs (G) FLOPs (G) A ∈ RHW ×L generated by filter-based tokenizers, where
FPN 40.78 159 55.1 each A:,l ∈ RHW is an attention map to show how each
R-50
VT-FPN 41.00 113 (1.41x) 8.5 (6.48x)
FPN 41.51 231 55.1
pixel of the image contributes to token-l. We plot the at-
R-101 tention map in Figure 5, and we can see that without any
VT-FPN 41.50 185 (1.25x) 8.5 (6.48x)
supervision, different visual tokens attend to different se-
Table 9: Semantic segmentation results on the COCO-stuff mantic concepts in the image, corresponding to parts of the
validation set. The FLOPs are calculated with a typical in- background or foreground objects. More visualization re-
put resolution of 800×1216. sults are provided in Appendix B.
8
References [13] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai,
Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer.
[1] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Squeezenext: Hardware-aware neural network design. In
and Quoc V Le. Attention augmented convolutional net- Proceedings of the IEEE Conference on Computer Vi-
works. In Proceedings of the IEEE International Conference sion and Pattern Recognition Workshops, pages 1638–1647,
on Computer Vision, pages 3286–3295, 2019. 3, 7 2018. 3
[2] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
stuff: Thing and stuff classes in context. In Computer vision
Deep residual learning for image recognition. In Proceed-
and pattern recognition (CVPR), 2018 IEEE conference on.
ings of the IEEE conference on computer vision and pattern
IEEE, 2018. 2, 7
recognition, pages 770–778, 2016. 2, 4, 7, 12
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
to-end object detection with transformers. arXiv preprint ing the knowledge in a neural network. arXiv preprint
arXiv:2005.12872, 2020. 2 arXiv:1503.02531, 2015. 7
[4] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and [16] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh
Alan L Yuille. Attention to scale: Scale-aware semantic im- Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
age segmentation. In Proceedings of the IEEE conference on Ruoming Pang, Vijay Vasudevan, et al. Searching for mo-
computer vision and pattern recognition, pages 3640–3649, bilenetv3. In Proceedings of the IEEE International Confer-
2016. 2 ence on Computer Vision, pages 1314–1324, 2019. 3
[5] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng [17] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Yan, and Jiashi Feng. Aˆ 2-nets: Double attention net- Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
works. In Advances in Neural Information Processing Sys- dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tems, pages 352–361, 2018. 7 tional neural networks for mobile vision applications. arXiv
[6] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan preprint arXiv:1704.04861, 2017. 3
Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based [18] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen
global reasoning networks. In Proceedings of the IEEE Con- Wei. Relation networks for object detection. In Proceed-
ference on Computer Vision and Pattern Recognition, pages ings of the IEEE Conference on Computer Vision and Pattern
433–442, 2019. 2, 3, 4, 6, 7 Recognition, pages 3588–3597, 2018. 3
[7] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin [19] Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. Lo-
Jaggi. On the relationship between self-attention and con- cal relation networks for image recognition. In Proceedings
volutional layers. arXiv preprint arXiv:1911.03584, 2019. of the IEEE International Conference on Computer Vision,
3 pages 3464–3473, 2019. 3, 7
[8] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- [20] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea
van, and Quoc V Le. Autoaugment: Learning augmentation Vedaldi. Gather-excite: Exploiting feature context in convo-
strategies from data. In Proceedings of the IEEE conference lutional neural networks. In Advances in Neural Information
on computer vision and pattern recognition, pages 113–123, Processing Systems, pages 9401–9411, 2018. 3
2019. 7
[21] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
[9] Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zi- works. In Proceedings of the IEEE conference on computer
jian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew vision and pattern recognition, pages 7132–7141, 2018. 3, 7
Yu, Peter Vajda, et al. Fbnetv3: Joint architecture-recipe
search using neural acquisition function. arXiv preprint [22] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil-
arXiv:2006.02049, 2020. 3, 7 ian Q Weinberger. Deep networks with stochastic depth. In
European conference on computer vision, pages 646–661.
[10] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei
Springer, 2016. 7
Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming
Wu, Yangqing Jia, et al. Chamnet: Towards efficient network [23] Forrest N Iandola, Song Han, Matthew W Moskewicz,
design through platform-aware model adaptation. In Pro- Khalid Ashraf, William J Dally, and Kurt Keutzer.
ceedings of the IEEE Conference on Computer Vision and Squeezenet: Alexnet-level accuracy with 50x fewer pa-
Pattern Recognition, pages 11398–11407, 2019. 3 rameters and¡ 0.5 mb model size. arXiv preprint
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, arXiv:1602.07360, 2016. 3
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [24] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
database. In 2009 IEEE conference on computer vision and Dollár. Panoptic feature pyramid networks. In Proceed-
pattern recognition, pages 248–255. Ieee, 2009. 2, 5 ings of the IEEE Conference on Computer Vision and Pattern
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Recognition, pages 6399–6408, 2019. 5
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [25] Xiaodan Liang, Ke Gong, Xiaohui Shen, and Liang Lin.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Look into person: Joint body parsing & pose estimation net-
vain Gelly, et al. An image is worth 16x16 words: Trans- work and a new benchmark. IEEE transactions on pattern
formers for image recognition at scale. arXiv preprint analysis and machine intelligence, 41(4):871–885, 2018. 2,
arXiv:2010.11929, 2020. 2 7
9
[26] Xiaodan Liang, Zhiting Hu, Hao Zhang, Liang Lin, and Kan Chen, et al. Fbnetv2: Differentiable neural architecture
Eric P Xing. Symbolic graph reasoning meets convolu- search for spatial and channel dimensions. arXiv preprint
tions. In Advances in Neural Information Processing Sys- arXiv:2004.05565, 2020. 3
tems, pages 1853–1863, 2018. 2, 3, 4, 6 [39] Weiyue Wang and Ulrich Neumann. Depth-aware cnn for
[27] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. rgb-d segmentation. In Proceedings of the European Con-
Shufflenet v2: Practical guidelines for efficient cnn architec- ference on Computer Vision (ECCV), pages 135–150, 2018.
ture design. In Proceedings of the European Conference on 3
Computer Vision (ECCV), pages 116–131, 2018. 3 [40] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[28] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz ing He. Non-local neural networks. In Proceedings of the
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- IEEE conference on computer vision and pattern recogni-
age transformer. arXiv preprint arXiv:1802.05751, 2018. 3 tion, pages 7794–7803, 2018. 3
[29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [41] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming So Kweon. Cbam: Convolutional block attention module.
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, In Proceedings of the European Conference on Computer Vi-
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- sion (ECCV), pages 3–19, 2018. 3, 7
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, [42] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im- Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing
perative style, high-performance deep learning library. In H. Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. vnet design via differentiable neural architecture search. In
Fox, and R. Garnett, editors, Advances in Neural Informa- Proceedings of the IEEE Conference on Computer Vision
tion Processing Systems 32, pages 8024–8035. Curran Asso- and Pattern Recognition, pages 10734–10742, 2019. 3
ciates, Inc., 2019. 5 [43] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng
[30] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonza-
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- lez, and Kurt Keutzer. Shift: A zero flop, zero parameter
alone self-attention in vision models. arXiv preprint alternative to spatial convolutions. In Proceedings of the
arXiv:1906.05909, 2019. 3, 7 IEEE Conference on Computer Vision and Pattern Recog-
[31] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, nition, pages 9127–9135, 2018. 3
and Ali Farhadi. Xnor-net: Imagenet classification using bi- [44] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and
nary convolutional neural networks. In European conference Kurt Keutzer. Squeezesegv2: Improved model structure and
on computer vision, pages 525–542. Springer, 2016. 3 unsupervised domain adaptation for road-object segmenta-
[32] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- tion from a lidar point cloud. In ICRA, 2019. 3
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
[45] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
residuals and linear bottlenecks. In Proceedings of the
Lo, and Ross Girshick. Detectron2. https://github.
IEEE conference on computer vision and pattern recogni-
com/facebookresearch/detectron2, 2019. 8
tion, pages 4510–4520, 2018. 3
[46] Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter
[33] Hang Su, Varun Jampani, Deqing Sun, Orazio Gallo, Erik
Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze-
Learned-Miller, and Jan Kautz. Pixel-adaptive convolutional
segv3: Spatially-adaptive convolution for efficient point-
neural networks. In Proceedings of the IEEE Conference on
cloud segmentation. arXiv preprint arXiv:2004.01803, 2020.
Computer Vision and Pattern Recognition (CVPR), 2019. 3
3
[34] Ilya Sutskever, James Martens, George Dahl, and Geoffrey
[47] Songyang Zhang, Xuming He, and Shipeng Yan. Latent-
Hinton. On the importance of initialization and momentum
gnn: Learning efficient non-local relations for visual recog-
in deep learning. In International conference on machine
nition. In International Conference on Machine Learning,
learning, pages 1139–1147, 2013. 5
pages 7374–7383, 2019. 2, 3, 4, 6
[35] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V Le. Mnas- [48] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
net: Platform-aware neural architecture search for mobile. Shufflenet: An extremely efficient convolutional neural net-
In Proceedings of the IEEE Conference on Computer Vision work for mobile devices. In Proceedings of the IEEE con-
and Pattern Recognition, pages 2820–2828, 2019. 3 ference on computer vision and pattern recognition, pages
6848–6856, 2018. 2, 3
[36] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking
model scaling for convolutional neural networks. arXiv [49] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
preprint arXiv:1905.11946, 2019. 3 ing self-attention for image recognition. arXiv preprint
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- arXiv:2004.13621, 2020. 3, 7
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 1,
3, 4
[38] Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuan-
dong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu,
10
9 X_ncp = X_nchw.view(N, C, H*W) # p = h*w
WK 10 # Normalize to unit vectors
DS Conv2d
11 U_ncl = U_ncl.normalize(dim=1)
H L L 12 X_ncp = X_ncp.normalize(dim=1)
C n iter
13 for _ in range(niter): # Lloyd’s algorithm
W Lloyd’s algorithm 14 dist_npl = (
C C
Visual tokens 15 X_ncp[..., None] - U_ncl[:, :, None, :]
Feature map X
16 ).norm(dim=1)
L
Spatial attention A 17 mask_npl = (dist_npl == dist_npl.min(dim=2)
18 U_ncl = X_ncp.MatMul(mask_npl)
s 19 U_ncl = U_ncl / mask_npl.sum(dim=1)
20 U_ncl = U_ncl.normalize(dim=1)
Figure 6: Cluster-based tokenizer that group pixels using 21 return U_ncl # centroids
the K-Means centroids of the pixels in the semantic space.
Listing 1: Pseudo-code of K-Means implemented in
PyTorch-like language
A. Stage-wise model description of VT-ResNet Although this tokenizer efficiently models only concepts
In this section, we provide a stage-wise description of the in the current image, the drawback is that it is not designed
model configurations for VT-based ResNet (VT-ResNet). to choose the most discriminative concepts.
We use three hyper-parameters to control a VT module:
channel size of the output feature map C, channel size of
visual token CT, and the number of visual tokens L. The
model configurations are described in Table 11.
C. Clustering-based tokenizer
To address this limitation of filter-based tokenizers, we
devise a content-aware WK variant of WA to form se-
mantic groups from X. Our insight is to extract concepts
present in the current image by clustering pixels, instead of
applying the same filters regardless of the image content.
First, we treat each pixel as a sample {Xp }HW p=1 , and ap-
ply k-means to find L centroids, which are stacked to form
WK ∈ RC×L . Each centroid represents one semantic con-
cept in the image. Second, we replace WA in Equation (1)
with WK to form L semantic groups of channels:
WK = KMEANS(X),
T
(6)
T = SOFTMAXHW (XWK ) X.
11
Stage Resolution VT-R18 VT-R34 VT-R50 VT-R101
1 56×56 conv7×7-C64-S2 → maxpool3×3-S2
2 56×56 BB-C64 ×2 BB-C64 ×3 BN-C256 ×3 BN-C256 ×3
3 28×28 BB-C128 ×2 BB-C128 ×4 BN-C512 ×4 BN-C512 ×4
4 14×14 BB-C256 ×2 BB-C256 ×6 BN-C1024 ×6 BN-C1024 ×23
VT-C512-L16 VT-C512-L16 VT-C1024-L16 VT-C1024-L16
5 16
-CT1024 ×2 -CT1024 ×3 -CT1024 ×3 -CT1024 ×3
head 1 avgpool-fc1000
Table 11: Model descriptions for VT-ResNets. VT-R18 denotes visual-transformer-ResNet-18. “conv7×7-C64-S2” denotes
a 7-by-7 convolution with an output channel size of 64 and a stride of 2. “BB-C64×2” denotes a basic block [14] with an
output channel size of 64 and it is repeated twice. “BN-C256×3” denotes a bottleneck block [14] with an output channel size
of 256 and it is repeated by three times. “VT-C512-L16-CT1024 ×2” denotes a VT block with a channel size for the output
feature map as 512, channel size for visual tokens as 1024, and the number of tokens as 16.
Figure 7: Visualization of the spatial attention generated by a filter-based tokenizer. Red denotes higher attention values and
color blue denotes lower. Without any supervision, visual tokens automatically focus on different areas of the image that
correspond to different semantic concepts.
12