ViViT: A Video Vision Transformer
ViViT: A Video Vision Transformer
Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario LučiㆠCordelia Schmid†
Google Research
{aarnab, dehghani, heigold, chensun, lucic, cordelias}@google.com
Abstract only very recently with the Vision Transformer (ViT) [18],
that a pure-transformer based architecture has outperformed
arXiv:2103.15691v2 [cs.CV] 1 Nov 2021
We present pure-transformer based models for video its convolutional counterparts in image classification. Doso-
classification, drawing upon the recent success of such mod- vitskiy et al. [18] closely followed the original transformer
els in image classification. Our model extracts spatio- architecture of [68], and noticed that its main benefits
temporal tokens from the input video, which are then en- were observed at large scale – as transformers lack some
coded by a series of transformer layers. In order to han- of the inductive biases of convolutions (such as transla-
dle the long sequences of tokens encountered in video, we tional equivariance), they seem to require more data [18]
propose several, efficient variants of our model which fac- or stronger regularisation [64].
torise the spatial- and temporal-dimensions of the input. Al- Inspired by ViT, and the fact that attention-based ar-
though transformer-based models are known to only be ef- chitectures are an intuitive choice for modelling long-
fective when large training datasets are available, we show range contextual relationships in video, we develop sev-
how we can effectively regularise the model during training eral transformer-based models for video classification. Cur-
and leverage pretrained image models to be able to train on rently, the most performant models are based on deep 3D
comparatively small datasets. We conduct thorough abla- convolutional architectures [8, 20, 21] which were a natu-
tion studies, and achieve state-of-the-art results on multiple ral extension of image classification CNNs [27, 60]. Re-
video classification benchmarks including Kinetics 400 and cently, these models were augmented by incorporating self-
600, Epic Kitchens, Something-Something v2 and Moments attention into their later layers to better capture long-range
in Time, outperforming prior methods based on deep 3D dependencies [75, 23, 79, 1].
convolutional networks. To facilitate further research, we
release code at https://github.com/google-research/scenic. As shown in Fig. 1, we propose pure-transformer mod-
els for video classification. The main operation performed
in this architecture is self-attention, and it is computed on
1. Introduction a sequence of spatio-temporal tokens that we extract from
the input video. To effectively process the large number of
Approaches based on deep convolutional neural net- spatio-temporal tokens that may be encountered in video,
works have advanced the state-of-the-art across many stan- we present several methods of factorising our model along
dard datasets for vision problems since AlexNet [38]. At spatial and temporal dimensions to increase efficiency and
the same time, the most prominent architecture of choice in scalability. Furthermore, to train our model effectively on
sequence-to-sequence modelling (e.g. in natural language smaller datasets, we show how to reguliarise our model dur-
processing) is the transformer [68], which does not use con- ing training and leverage pretrained image models.
volutions, but is based on multi-headed self-attention. This
We also note that convolutional models have been de-
operation is particularly effective at modelling long-range
veloped by the community for several years, and there are
dependencies and allows the model to attend over all ele-
thus many “best practices” associated with such models.
ments in the input sequence. This is in stark contrast to
As pure-transformer models present different characteris-
convolutions where the corresponding “receptive field” is
tics, we need to determine the best design choices for such
limited, and grows linearly with the depth of the network.
architectures. We conduct a thorough ablation analysis of
The success of attention-based models in NLP has re-
tokenisation strategies, model architecture and regularisa-
cently inspired approaches in computer vision to integrate
tion methods. Informed by this analysis, we achieve state-
transformers into CNNs [75, 7], as well as some attempts to
of-the-art results on multiple standard video classification
replace convolutions completely [49, 3, 53]. However, it is
benchmarks, including Kinetics 400 and 600 [35], Epic
* Equal contribution Kitchens 100 [13], Something-Something v2 [26] and Mo-
† Equal advising ments in Time [45].
MLP
Head
Class
Factorised Factorised Factorised
Transformer Encoder Encoder Self-Attention Dot-Product
Position + Token
Embedding
MLP
0
CLS
Temporal Temporal
Fuse
1 Layer Norm
●●●
Spatial Temporal
2 Temporal Spatial
L× Self-Attention
●●● ●●●
Embed to 3
Multi-Head
tokens Dot-Product Spatial Temporal Fuse
Attention
Spatial Temporal
… K V Q
●●●
Spatial Spatial
N
Layer Norm
Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [18].
To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components
of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different
attention patterns over space and time.
Architectures for video understanding have mirrored ad- Although previous works attempted to replace convolu-
vances in image recognition. Early video research used tions in vision architectures [49, 53, 55], it is only very re-
hand-crafted features to encode appearance and motion cently that Dosovitisky et al. [18] showed with their ViT ar-
information [41, 69]. The success of AlexNet on Ima- chitecture that pure-transformer networks, similar to those
geNet [38, 16] initially led to the repurposing of 2D im- employed in NLP, can achieve state-of-the-art results for
age convolutional networks (CNNs) for video as “two- image classification too. The authors showed that such
stream” networks [34, 56, 47]. These models processed models are only effective at large scale, as transformers lack
RGB frames and optical flow images independently before some of inductive biases of convolutional networks (such
fusing them at the end. Availability of larger video classi- as translational equivariance), and thus require datasets
fication datasets such as Kinetics [35] subsequently facili- larger than the common ImageNet ILSRVC dataset [16] to
tated the training of spatio-temporal 3D CNNs [8, 22, 65] train. ViT has inspired a large amount of follow-up work
which have significantly more parameters and thus require in the community, and we note that there are a number
larger training datasets. As 3D convolutional networks re- of concurrent approaches on extending it to other tasks in
quire significantly more computation than their image coun- computer vision [71, 74, 84, 85] and improving its data-
terparts, many architectures factorise convolutions across efficiency [64, 48]. In particular, [4, 46] have also proposed
spatial and temporal dimensions and/or use grouped convo- transformer-based models for video.
lutions [59, 66, 67, 81, 20]. We also leverage factorisation In this paper, we develop pure-transformer architectures
of the spatial and temporal dimensions of videos to increase for video classification. We propose several variants of our
efficiency, but in the context of transformer-based models. model, including those that are more efficient by factoris-
Concurrently, in natural language processing (NLP), ing the spatial and temporal dimensions of the input video.
Vaswani et al. [68] achieved state-of-the-art results by re- We also show how additional regularisation and pretrained
placing convolutions and recurrent networks with the trans- models can be used to combat the fact that video datasets
former network that consisted only of self-attention, layer are not as large as their image counterparts that ViT was
normalisation and multilayer perceptron (MLP) operations. originally trained on. Furthermore, we outperform the state-
Current state-of-the-art architectures in NLP [17, 52] re- of-the-art across five popular datasets.
main transformer-based, and have been scaled to web-scale
datasets [5]. Many variants of the transformer have also
been proposed to reduce the computational cost of self- 3. Video Vision Transformers
attention when processing longer sequences [10, 11, 37,
62, 63, 73] and to improve parameter efficiency [40, 14]. We start by summarising the recently proposed Vision
Although self-attention has been employed extensively in Transformer [18] in Sec. 3.1, and then discuss two ap-
computer vision, it has, in contrast, been typically incor- proaches for extracting tokens from video in Sec. 3.2. Fi-
porated as a layer at the end or in the later stages of nally, we develop several transformer-based architectures
the network [75, 7, 32, 77, 83] or to augment residual for video classification in Sec. 3.3 and 3.4.
3.1. Overview of Vision Transformers (ViT) "
0
C
L 1 2
… T
S
Positional embedding
K
K
Multi-Head
Layer Norm
Multi-Head
Layer Norm
Layer Norm
Attention
Attention
Spatial Transformer Spatial Transformer Spatial Transformer
V
V
MLP
Token embedding
Encoder Encoder Encoder
Positional + Token
Q
Q
Embedding
0
C
L 1 … N 0
C
L 1 … N
… 0
C
L 1 … N
S S S
Spatial Self-Attention Block Temporal Self-Attention Block
Embed to tokens
K V Q K V Q
models have nt times more tokens than the pretrained im-
Self-Attention Block
Spatial Heads Temporal Heads age model. As a result, we initialise the positional embed-
dings by “repeating” them temporally from Rnw ·nh ×d to
Figure 6: Factorised dot-product attention (Model 4). For half of
Rnt ·nh ·nw ×d . Therefore, at initialisation, all tokens with
the heads, we compute dot-product attention over only the spatial
the same spatial index have the same embedding which is
axes, and for the other half, over only the temporal axis.
then fine-tuned.
to Model 3, but we factorise the multi-head dot-product at- Embedding weights, E When using the “tubelet embed-
tention operation instead (Fig. 6). Concretely, we compute ding” tokenisation method (Sec. 3.2), the embedding filter
attention weights for each token separately over the spatial- E is a 3D tensor, compared to the 2D tensor in the pre-
and temporal-dimensions using different heads. First, we trained model, Eimage . A common approach for initialising
note that the attention operation for each head is defined as 3D convolutional filters from 2D filters for video classifica-
tion is to “inflate” them by replicating the filters along the
QK>
Attention(Q, K, V) = Softmax √ V. (7) temporal dimension and averaging them [8, 22] as
dk
1
E= [Eimage , . . . , Eimage , . . . , Eimage ]. (8)
In self-attention, the queries Q = XWq , keys K = XWk , t
and values V = XWv are linear projections of the input X
We consider an additional strategy, which we denote as
with X, Q, K, V ∈ RN ×d . Note that in the unfactorised
“central frame initialisation”, where E is initialised with ze-
case (Model 1), the spatial and temporal dimensions are
roes along all temporal positions, except at the centre b 2t c,
merged as N = nt · nh · nw .
The main idea here is to modify the keys and values for E = [0, . . . , Eimage , . . . , 0]. (9)
each query to only attend over tokens from the same spatial-
and temporal index by constructing Ks , Vs ∈ Rnh ·nw ×d Therefore, the 3D convolutional filter effectively behaves
and Kt , Vt ∈ Rnt ×d , namely the keys and values corre- like “Uniform frame sampling” (Sec. 3.2) at initialisation,
sponding to these dimensions. Then, for half of the atten- while also enabling the model to learn to aggregate temporal
tion heads, we attend over tokens from the spatial dimen- information from multiple frames as training progresses.
sion by computing Ys = Attention(Q, Ks , Vs ), and for
the rest we attend over the temporal dimension by comput- Transformer weights for Model 3 The transformer
ing Yt = Attention(Q, Kt , Vt ). Given that we are only block in Model 3 (Fig. 5) differs from the pretrained ViT
changing the attention neighbourhood for each query, the model [18], in that it contains two multi-headed self atten-
attention operation has the same dimension as in the unfac- tion (MSA) modules. In this case, we initialise the spatial
torised case, namely Ys , Yt ∈ RN ×d . We then combine MSA module from the pretrained module, and initialise all
the outputs of multiple heads by concatenating them and weights of the temporal MSA with zeroes, such that Eq. 5
using a linear projection [68], Y = Concat(Ys , Yt )WO . behaves as a residual connection [27] at initialisation.
1.0 Each row of Tab. 4 includes all the methods from the
TFLOPs
79
78 0.5 rows above it, and we observe progressive improvements
from adding each regulariser. Overall, we obtain a substan-
16x8 16x4 16x2 16x8 16x4 16x2
Input tubelet size Input tubelet size tial overall improvement of 5.3% on Epic Kitchens. We
(a) Accuracy (b) Compute also achieve a similar improvement of 5% on SSv2 by us-
Figure 7: The effect of the backbone architecture on (a) accuracy
ing all the regularisation in Tab. 4. Note that the Kinetics-
and (b) computation on Kinetics 400, for the spatio-temporal at- pretrained models that we initialise from are from Tab. 2,
tention model (Model 1). and that all Epic Kitchens models in Tab. 2 were trained
with all the regularisers in Tab. 4. For larger datasets like
Spatio-temporal Factorised encoder Factorised self-attention Factorised dot-product
80.0
Kinetics and Moments in Time, we do not use these ad-
Top-1 Accuracy
0.4 ditional regularisers (we use only the first row of Tab. 4),
77.5
TFLOPs
blVNet [19] 73.5 91.2 – – AttentionNAS [76] 79.8 94.4 TSN [72] 33.2 60.2 46.0
STM [33] 73.7 91.6 – – LGD-3D R101 [51] 81.5 95.6 TRN [86] 35.3 65.9 45.4
TEA [42] 76.1 92.5 10 × 3 2.10 SlowFast R101-NL [21] 81.8 95.1 TBN [36] 36.7 66.0 47.2
X3D-XL [20] 81.9 95.5 TSM [43] 38.3 67.9 49.0
TSM-ResNeXt-101 [43] 76.3 – – –
TimeSformer-L [4] 82.2 95.6 SlowFast [21] 38.5 65.6 50.0
I3D NL [75] 77.7 93.3 10 × 3 10.77 ViViT-L/16x2 FE 82.9 94.6
CorrNet-101 [70] 79.2 – 10 × 3 6.72 ViViT-L/16x2 FE 44.0 66.4 56.8
ip-CSN-152 [66] 79.2 93.8 10 × 3 3.27 ViViT-L/16x2 FE (JFT) 84.3 94.9
LGD-3D R101 [51] 79.4 94.4 – – ViViT-H/14x2 (JFT) 85.8 96.5
(e) Something-Something v2
SlowFast R101-NL [21] 79.8 93.9 10 × 3 7.02
X3D-XXL [20] 80.4 94.6 10 × 3 5.82 (c) Moments in Time Method Top 1 Top 5
TimeSformer-L [4] 80.7 94.7 1×3 7.14 Top 1 Top 5 TRN [86] 48.8 77.6
ViViT-L/16x2 FE 80.6 92.7 1×1 3.98 SlowFast [20, 80] 61.7 –
ViViT-L/16x2 FE 81.7 93.8 1×3 11.94 TSN [72] 25.3 50.1 TimeSformer-HR [4] 62.5 –
TRN [86] 28.3 53.4 TSM [43] 63.4 88.5
Methods with large-scale pretraining I3D [8] 29.5 56.1 STM [33] 64.2 89.8
ip-CSN-152 [66] (IG [44]) 82.5 95.3 10 × 3 3.27 blVNet [19] 31.4 59.3 TEA [42] 65.1 –
ViViT-L/16x2 FE (JFT) 83.5 94.3 1×3 11.94 AssembleNet-101 [54] 34.3 62.7 blVNet [19] 65.2 90.3
ViViT-H/14x2 (JFT) 84.9 95.8 4×3 47.77 ViViT-L/16x2 FE 38.5 64.1 ViVIT-L/16x2 FE 65.9 89.9
32 stride 2 64 stride 2 128 stride 2 more frames (and thus more tokens) consistently achieve
higher single- and multi-view accuracy, in line with our ob-
80 servations in previous experiments (Tab. 5, Fig. 8). Mo-
Top-1 Accuracy
Figure 9 shows that as we increase the number of frames 4.3. Comparison to state-of-the-art
input to the network, the accuracy from processing a sin- Based on our ablation studies in the previous section,
gle view increases, since the network incorporates longer we compare to the current state-of-the-art using two of our
temporal context. However, common practice on datasets model variants. We primarily use our Factorised Encoder
such as Kinetics [21, 75, 42] is to average results over mul- model (Model 2), as it can process more tokens than Model
tiple, shorter “views” of the same video clip. Figure 9 also 1 to achieve higher accuracy.
shows that the accuracy saturates once the number of views
is sufficient to cover the whole video. As a Kinetics video Kinetics Tables 6a and 6b show that our spatio-temporal
consists of 250 frames, and we sample frames with a stride attention models outperform the state-of-the-art on Kinetics
of 2, our model which processes 128 frames requires just a 400 and 600 respectively. Following standard practice, we
single view to “see” the whole video and achieve its maxi- take 3 spatial crops (left, centre and right) [21, 20, 66, 75]
mum accuarcy. for each temporal view, and notably, we require signifi-
Note that we used ViViT-L/16x2 Factorised Encoder cantly fewer views than previous CNN-based methods.
(Model 2) here. As this model is more efficient it can pro- We surpass the previous CNN-based state-of-the-art us-
cess more tokens, compared to the unfactorised Model 1 ing ViViT-L/16x2 Factorised Encoder (FE) pretrained on
which runs out of memory after 48 frames using tubelet ImageNet, and also outperform [4] who concurrently pro-
length t = 2 and a “Large” backbone. Models processing posed a pure-transformer architecture. Moreover, by initial-
ising our backbones from models pretrained on the larger image-pretrained models. Finally, going beyond video clas-
JFT dataset [58], we obtain further improvements. Al- sification towards more complex tasks is a clear next step.
though these models are not directly comparable to previ-
ous work, we do also outperform [66] who pretrained on References
the large-scale, Instagram dataset [44]. Our best model uses [1] Anurag Arnab, Chen Sun, and Cordelia Schmid. Unified
a ViViT-H backbone pretrained on JFT and significantly ad- graph structured models for video understanding. In ICCV,
vances the best reported results on Kinetics 400 and 600 to 2021. 1
84.9% and 85.8%, respectively. [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.
Layer normalization. In arXiv preprint arXiv:1607.06450,
Moments in Time We surpass the state-of-the-art by a 2016. 3
significant margin as shown in Tab. 6c. We note that the [3] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
videos in this dataset are diverse and contain significant la- and Quoc V Le. Attention augmented convolutional net-
bel noise, making this task challenging and leading to lower works. In ICCV, 2019. 1
accuracies than on other datasets. [4] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
space-time attention all you need for video understanding?
Epic Kitchens 100 Table 6d shows that our Factorised In arXiv preprint arXiv:2102.05095, 2021. 2, 3, 4, 8, 9
Encoder model outperforms previous methods by a signifi- [5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
cant margin. In addition, our model obtains substantial im- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
provements for Top-1 accuracy of “noun” classes, and the tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
only method which achieves higher “verb” accuracy used Agarwal, et al. Language models are few-shot learners. In
optical flow as an additional input modality [43, 50]. Fur- NeurIPS, 2020. 2
thermore, all variants of our model presented in Tab. 2 out- [6] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han
performed the existing state-of-the-art on action accuracy. Hu. Gcnet: Non-local networks meet squeeze-excitation net-
We note that we use the same model to predict verbs and works and beyond. In CVPR Workshops, 2019. 2
nouns using two separate “heads”, and for simplicity, we do [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
not use separate loss weights for each head.
end object detection with transformers. In ECCV, 2020. 1,
Something-Something v2 (SSv2) Finally, Tab. 6e shows 2
that we achieve state-of-the-art Top-1 accuracy with our [8] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In CVPR,
Factorised encoder model (Model 2), albeit with a smaller
2017. 1, 2, 5, 6, 8
margin compared to previous methods. Notably, our Fac-
[9] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng
torised encoder model significantly outperforms the concur- Yan, and Jiashi Feng. A2-nets: Double attention networks.
rent TimeSformer [4] method by 2.9%, which also proposes In NeurIPS, 2018. 2
a pure-transformer model, but does not consider our Fac- [10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
torised encoder variant or our additional regularisation. Generating long sequences with sparse transformers. In
SSv2 differs from other datasets in that the backgrounds arXiv preprint arXiv:1904.10509, 2019. 2
and objects are quite similar across different classes, mean- [11] Krzysztof Choromanski, Valerii Likhosherstov, David Do-
ing that recognising fine-grained motion patterns is neces- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
sary to distinguish classes from each other. Our results sug- Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser,
gest that capturing these fine-grained motions is an area of et al. Rethinking attention with performers. In ICLR, 2021.
improvement and future work for our model. We also note 2
an inverse correlation between the relative performance of [12] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V.
Le. Randaugment: Practical automated data augmentation
previous methods on SSv2 (Tab. 6e) and Kinetics (Tab. 6a)
with a reduced search space. In NeurIPS, 2020. 7, 13, 14
suggesting that these two datasets evaluate complementary
[13] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
characteristics of a model. Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and
5. Conclusion and Future Work Michael Wray. Rescaling egocentric vision. In arXiv
preprint arXiv:2006.13256, 2020. 1, 6
We have presented four pure-transformer models for
[14] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob
video classification, with different accuracy and efficiency Uszkoreit, and Łukasz Kaiser. Universal transformers. In
profiles, achieving state-of-the-art results across five pop- ICLR, 2019. 2
ular datasets. Furthermore, we have shown how to ef- [15] Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab,
fectively regularise such high-capacity models for training Matthias Minderer, and Yi Tay. Scenic: A JAX library
on smaller datasets and thoroughly ablated our main de- for computer vision research and beyond. arXiv preprint
sign choices. Future work is to remove our dependence on arXiv:2110.11403, 2021. 6
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [34] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
database. In CVPR, 2009. 2, 5, 6 classification with convolutional neural networks. In CVPR,
[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina 2014. 2, 4
Toutanova. Bert: Pre-training of deep bidirectional trans- [35] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
formers for language understanding. In NAACL, 2019. 2, 3, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
5 Tim Green, Trevor Back, Paul Natsev, et al. The ki-
[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, netics human action video dataset. In arXiv preprint
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, arXiv:1705.06950, 2017. 1, 2, 5, 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [36] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Dima Damen. Epic-fusion: Audio-visual temporal binding
formers for image recognition at scale. In ICLR, 2021. 1, 2, for egocentric action recognition. In ICCV, 2019. 8
3, 5, 7 [37] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re-
[19] Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, former: The efficient transformer. In ICLR, 2020. 2
and David Cox. More is less: Learning efficient video repre- [38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
sentations by big-little network and depthwise temporal ag- Imagenet classification with deep convolutional neural net-
gregation. In NeurIPS, 2019. 8 works. In NeurIPS, volume 25, 2012. 1, 2
[20] Christoph Feichtenhofer. X3d: Expanding architectures for [39] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
efficient video recognition. In CVPR, 2020. 1, 2, 6, 8 jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
[21] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Popov, Matteo Malloci, Tom Duerig, et al. The open im-
Kaiming He. Slowfast networks for video recognition. In ages dataset v4: Unified image classification, object detec-
ICCV, 2019. 1, 6, 8 tion, and visual relationship detection at scale. IJCV, 2020.
[22] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 5
Spatiotemporal residual networks for video action recogni-
[40] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
tion. In NeurIPS, 2016. 2, 5, 6
Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
[23] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- bert for self-supervised learning of language representations.
serman. Video action transformer network. In CVPR, 2019. In ICLR, 2020. 2
1
[41] Ivan Laptev. On space-time interest points. IJCV, 64(2-3),
[24] Rohit Girdhar and Deva Ramanan. Attentional pooling for
2005. 2
action recognition. In NeurIPS, 2017. 4
[42] Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and
[25] Xavier Glorot and Yoshua Bengio. Understanding the diffi-
Limin Wang. Tea: Temporal excitation and aggregation for
culty of training deep feedforward neural networks. In AIS-
action recognition. In CVPR, 2020. 8
TATS, 2010. 6
[26] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- [43] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, module for efficient video understanding. In ICCV, 2019. 8,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz 9
Mueller-Freitag, et al. The” something something” video [44] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
database for learning and evaluating visual common sense. Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
In ICCV, 2017. 1, 6 and Laurens Van Der Maaten. Exploring the limits of weakly
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. supervised pretraining. In ECCV, 2018. 8, 9
Deep residual learning for image recognition. In CVPR, [45] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra-
2016. 1, 2, 5 makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown,
[28] Dan Hendrycks and Kevin Gimpel. Gaussian error linear Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments
units (gelus). In arXiv preprint arXiv:1606.08415, 2016. 3 in time dataset: one million videos for event understanding.
[29] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim PAMI, 42(2):502–508, 2019. 1, 6
Salimans. Axial attention in multidimensional transformers. [46] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As-
In arXiv preprint arXiv:1912.12180, 2019. 4 selmann. Video transformer network. In arXiv preprint
[30] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- arXiv:2102.00719, 2021. 2, 4
works. In CVPR, 2018. 2 [47] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
[31] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Weinberger. Deep networks with stochastic depth. In ECCV, Toderici. Beyond short snippets: Deep networks for video
2016. 7, 13, 14 classification. In CVPR, 2015. 2
[32] Zilong Huang, Xinggang Wang, Lichao Huang, Chang [48] Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, and Jian-
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross fei Cai. Scalable visual transformers with hierarchical pool-
attention for semantic segmentation. In ICCV, 2019. 2 ing. In arXiv preprint arXiv:2103.10619, 2021. 2
[33] Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and [49] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Junjie Yan. Stm: Spatiotemporal and motion encoding for Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
action recognition. In ICCV, 2019. 8 age transformer. In ICML, 2018. 1, 2
[50] Will Price and Dima Damen. An evaluation of action [67] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
recognition models on epic-kitchens. In arXiv preprint LeCun, and Manohar Paluri. A closer look at spatiotemporal
arXiv:1908.00867, 2019. 9 convolutions for action recognition. In CVPR, 2018. 2
[51] Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Tao Mei. Learning spatio-temporal representation with local reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
and global diffusion. In CVPR, 2019. 8 Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
[52] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, 2, 3, 4, 5, 7
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and [69] Heng Wang, Alexander Kläser, Cordelia Schmid, and
Peter J Liu. Exploring the limits of transfer learning with a Cheng-Lin Liu. Dense trajectories and motion boundary de-
unified text-to-text transformer. JMLR, 2020. 2 scriptors for action recognition. IJCV, 103(1), 2013. 2
[53] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [70] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli.
Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone Video modeling with correlation networks. In CVPR, 2020.
self-attention in vision models. In NeurIPS, 2019. 1, 2 8
[54] Michael S Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia [71] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Angelova. Assemblenet: Searching for multi-stream neural Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
connectivity in video architectures. In ICLR, 2020. 8 segmentation with mask transformers. In arXiv preprint
[55] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, arXiv:2012.00759, 2020. 2
and Ching-Hui Chen. Global self-attention networks for im- [72] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
age recognition. In arXiv preprint arXiv:2010.03019, 2021. Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
2 networks: Towards good practices for deep action recogni-
[56] Karen Simonyan and Andrew Zisserman. Two-stream con- tion. In ECCV, 2016. 4, 8
volutional networks for action recognition in videos. In [73] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
NeurIPS, 2014. 2, 4 Hao Ma. Linformer: Self-attention with linear complexity.
In arXiv preprint arXiv:2006.04768, 2020. 2
[57] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
[74] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
transformers for visual recognition. In CVPR, 2021. 2
Pyramid vision transformer: A versatile backbone for
[58] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
dense prediction without convolutions. In arXiv preprint
nav Gupta. Revisiting unreasonable effectiveness of data in
arXiv:2102.12122, 2021. 2
deep learning era. In ICCV, 2017. 5, 6, 9
[75] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[59] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human
ing He. Non-local neural networks. In CVPR, 2018. 1, 2,
action recognition using factorized spatio-temporal convolu-
8
tional networks. In ICCV, 2015. 2
[76] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier-
[60] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, giovanni, Michael S Ryoo, Anelia Angelova, Kris M Kitani,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent and Wei Hua. Attentionnas: Spatiotemporal attention cell
Vanhoucke, and Andrew Rabinovich. Going deeper with search for video classification. In ECCV, 2020. 8
convolutions. In CVPR, 2015. 1 [77] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
[61] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end
Shlens, and Zbigniew Wojna. Rethinking the inception ar- video instance segmentation with transformers. In arXiv
chitecture for computer vision. In CVPR, 2016. 7, 13, 14 preprint arXiv:2011.14503, 2020. 2
[62] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, [78] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebas- Scaling autoregressive video models. In ICLR, 2020. 4
tian Ruder, and Donald Metzler. Long range arena: A [79] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
benchmark for efficient transformers. In arXiv preprint ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
arXiv:2011.04006, 2020. 2 feature banks for detailed video understanding. In CVPR,
[63] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Met- 2019. 1
zler. Efficient transformers: A survey. In arXiv preprint [80] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Fe-
arXiv:2009.06732, 2020. 2 ichtenhofer, and Philipp Krahenbuhl. A multigrid method
[64] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco for efficiently training video models. In CVPR, 2020. 8
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training [81] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
data-efficient image transformers & distillation through at- Kevin Murphy. Rethinking spatiotemporal feature learning:
tention. In arXiv preprint arXiv:2012.12877, 2020. 1, 2, 7 Speed-accuracy trade-offs in video classification. In ECCV,
[65] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, 2018. 2
and Manohar Paluri. Learning spatiotemporal features with [82] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
3d convolutional networks. In ICCV, 2015. 2 David Lopez-Paz. Mixup: Beyond empirical risk minimiza-
[66] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis- tion. In ICLR, 2018. 7, 13, 14
zli. Video classification with channel-separated convolu- [83] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dy-
tional networks. In ICCV, 2019. 2, 8, 9 namic graph message passing networks. In CVPR, 2020. 2
[84] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and
Vladlen Koltun. Point transformer. In arXiv preprint
arXiv:2012.09164, 2020. 2
[85] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tation from a sequence-to-sequence perspective with trans-
formers. In arXiv preprint arXiv:2012.15840, 2020. 2
[86] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
ralba. Temporal relational reasoning in videos. In ECCV,
2018. 4, 8
Appendix Mixup Mixup [82] constructs virtual training examples
which are a convex combination of pairs of training exam-
A. Additional experimental details ples and their labels. Concretely, given (xi , yi ) and (xj , yj )
where xi denotes an input vector and yi a one-hot input la-
In this appendix, we provide additional experimental de-
bel, mixup constructs the virtual training example,
tails. Section A.1 provides additional details about the reg-
ularisers we used and Sec. A.2 details the training hyper- x̃ = λxi + (1 − λ)xj
paramters used for our experiments.
ỹ = λyi + (1 − λ)yj . (12)
A.1. Further details about regularisers
λ ∈ [0, 1], and is sampled from a Beta distribution,
In this section, we provide additional details and list the Beta(α, α). Our choice of the hyperparameter α is detailed
hyperparameters of the additional regularisers that we em- in Tab. 7.
ployed in Tab. 4. Hyperparameter values for all our experi-
ments are listed in Tab. 7. A.2. Training hyperparameters
Table 7 details the hyperparamters for all of our ex-
Stochastic depth Stochastic depth regularisation was periments. We use synchronous SGD with momentum, a
originally proposed for training very deep residual net- cosine learning rate schedule with linear warmup, and a
works [31]. Intuitively, the outputs of a layer, `, are batch size of 64 for all experiments. As aforementioned,
“dropped out” with probability, pdrop (`) during training, by we only employed additional regularisation when training
setting the output of the layer to be equal to its input. on the smaller Epic Kitchens and Something-Something v2
Following [31], we linearly increase the probability of datasets.
dropping a layer according to its depth within the network,
`
pdrop (`) = pdrop , (10)
L
where ` is the index of the layer in the network, and L is the
total number of layers.
master/official/vision/beta/ops/augment.py
Table 7: Training hyperparamters for experiments in the main paper. “–” indicates that the regularisation method was not used at all. Values
which are constant across all columns are listed once. Datasets are denoted as follows: K400: Kinetics 400. K600: Kinetics 600. MiT:
Moments in Time. EK: Epic Kitchens. SSv2: Something-Something v2.
K400 K600 MiT EK SSv2
Optimisation
Optimiser Synchronous SGD
Momentum 0.9
Batch size 64
Learning rate schedule cosine with linear warmup
Linear warmup epochs 2.5
Base learning rate 0.1 0.1 0.25 0.5 0.5
Epochs 30 30 10 50 35
Data augmentation
Random crop probability 1.0
Random flip probability 0.5
Scale jitter probability 1.0
Maximum scale 1.33
Minimum scale 0.9
Colour jitter probability 0.8 0.8 0.8 – –
Rand augment number of layers [12] – – – 2 2
Rand augment magnitude [12] – – – 15 20
Other regularisation
Stochastic droplayer rate, pdrop [31] – – – 0.2 0.3
Label smoothing λ [61] – – – 0.2 0.3
Mixup α [82] – – – 0.1 0.3