0% found this document useful (0 votes)
24 views

ViViT: A Video Vision Transformer

Uploaded by

Nahid Nazifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

ViViT: A Video Vision Transformer

Uploaded by

Nahid Nazifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ViViT: A Video Vision Transformer

Anurag Arnab* Mostafa Dehghani* Georg Heigold Chen Sun Mario LučiㆠCordelia Schmid†
Google Research
{aarnab, dehghani, heigold, chensun, lucic, cordelias}@google.com

Abstract only very recently with the Vision Transformer (ViT) [18],
that a pure-transformer based architecture has outperformed
arXiv:2103.15691v2 [cs.CV] 1 Nov 2021

We present pure-transformer based models for video its convolutional counterparts in image classification. Doso-
classification, drawing upon the recent success of such mod- vitskiy et al. [18] closely followed the original transformer
els in image classification. Our model extracts spatio- architecture of [68], and noticed that its main benefits
temporal tokens from the input video, which are then en- were observed at large scale – as transformers lack some
coded by a series of transformer layers. In order to han- of the inductive biases of convolutions (such as transla-
dle the long sequences of tokens encountered in video, we tional equivariance), they seem to require more data [18]
propose several, efficient variants of our model which fac- or stronger regularisation [64].
torise the spatial- and temporal-dimensions of the input. Al- Inspired by ViT, and the fact that attention-based ar-
though transformer-based models are known to only be ef- chitectures are an intuitive choice for modelling long-
fective when large training datasets are available, we show range contextual relationships in video, we develop sev-
how we can effectively regularise the model during training eral transformer-based models for video classification. Cur-
and leverage pretrained image models to be able to train on rently, the most performant models are based on deep 3D
comparatively small datasets. We conduct thorough abla- convolutional architectures [8, 20, 21] which were a natu-
tion studies, and achieve state-of-the-art results on multiple ral extension of image classification CNNs [27, 60]. Re-
video classification benchmarks including Kinetics 400 and cently, these models were augmented by incorporating self-
600, Epic Kitchens, Something-Something v2 and Moments attention into their later layers to better capture long-range
in Time, outperforming prior methods based on deep 3D dependencies [75, 23, 79, 1].
convolutional networks. To facilitate further research, we
release code at https://github.com/google-research/scenic. As shown in Fig. 1, we propose pure-transformer mod-
els for video classification. The main operation performed
in this architecture is self-attention, and it is computed on
1. Introduction a sequence of spatio-temporal tokens that we extract from
the input video. To effectively process the large number of
Approaches based on deep convolutional neural net- spatio-temporal tokens that may be encountered in video,
works have advanced the state-of-the-art across many stan- we present several methods of factorising our model along
dard datasets for vision problems since AlexNet [38]. At spatial and temporal dimensions to increase efficiency and
the same time, the most prominent architecture of choice in scalability. Furthermore, to train our model effectively on
sequence-to-sequence modelling (e.g. in natural language smaller datasets, we show how to reguliarise our model dur-
processing) is the transformer [68], which does not use con- ing training and leverage pretrained image models.
volutions, but is based on multi-headed self-attention. This
We also note that convolutional models have been de-
operation is particularly effective at modelling long-range
veloped by the community for several years, and there are
dependencies and allows the model to attend over all ele-
thus many “best practices” associated with such models.
ments in the input sequence. This is in stark contrast to
As pure-transformer models present different characteris-
convolutions where the corresponding “receptive field” is
tics, we need to determine the best design choices for such
limited, and grows linearly with the depth of the network.
architectures. We conduct a thorough ablation analysis of
The success of attention-based models in NLP has re-
tokenisation strategies, model architecture and regularisa-
cently inspired approaches in computer vision to integrate
tion methods. Informed by this analysis, we achieve state-
transformers into CNNs [75, 7], as well as some attempts to
of-the-art results on multiple standard video classification
replace convolutions completely [49, 3, 53]. However, it is
benchmarks, including Kinetics 400 and 600 [35], Epic
* Equal contribution Kitchens 100 [13], Something-Something v2 [26] and Mo-
† Equal advising ments in Time [45].
MLP
Head
Class
Factorised Factorised Factorised
Transformer Encoder Encoder Self-Attention Dot-Product
Position + Token
Embedding
MLP
0
CLS
Temporal Temporal
Fuse
1 Layer Norm
●●●
Spatial Temporal

2 Temporal Spatial
L× Self-Attention
●●● ●●●

Embed to 3
Multi-Head
tokens Dot-Product Spatial Temporal Fuse
Attention
Spatial Temporal
… K V Q
●●●

Spatial Spatial
N
Layer Norm

1 2 ●●● N 1 2 ●●● N 1 2 ●●● N

Figure 1: We propose a pure-transformer architecture for video classification, inspired by the recent success of such models for images [18].
To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components
of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different
attention patterns over space and time.

2. Related Work blocks [30, 6, 9, 57] within a ResNet architecture [27].

Architectures for video understanding have mirrored ad- Although previous works attempted to replace convolu-
vances in image recognition. Early video research used tions in vision architectures [49, 53, 55], it is only very re-
hand-crafted features to encode appearance and motion cently that Dosovitisky et al. [18] showed with their ViT ar-
information [41, 69]. The success of AlexNet on Ima- chitecture that pure-transformer networks, similar to those
geNet [38, 16] initially led to the repurposing of 2D im- employed in NLP, can achieve state-of-the-art results for
age convolutional networks (CNNs) for video as “two- image classification too. The authors showed that such
stream” networks [34, 56, 47]. These models processed models are only effective at large scale, as transformers lack
RGB frames and optical flow images independently before some of inductive biases of convolutional networks (such
fusing them at the end. Availability of larger video classi- as translational equivariance), and thus require datasets
fication datasets such as Kinetics [35] subsequently facili- larger than the common ImageNet ILSRVC dataset [16] to
tated the training of spatio-temporal 3D CNNs [8, 22, 65] train. ViT has inspired a large amount of follow-up work
which have significantly more parameters and thus require in the community, and we note that there are a number
larger training datasets. As 3D convolutional networks re- of concurrent approaches on extending it to other tasks in
quire significantly more computation than their image coun- computer vision [71, 74, 84, 85] and improving its data-
terparts, many architectures factorise convolutions across efficiency [64, 48]. In particular, [4, 46] have also proposed
spatial and temporal dimensions and/or use grouped convo- transformer-based models for video.
lutions [59, 66, 67, 81, 20]. We also leverage factorisation In this paper, we develop pure-transformer architectures
of the spatial and temporal dimensions of videos to increase for video classification. We propose several variants of our
efficiency, but in the context of transformer-based models. model, including those that are more efficient by factoris-
Concurrently, in natural language processing (NLP), ing the spatial and temporal dimensions of the input video.
Vaswani et al. [68] achieved state-of-the-art results by re- We also show how additional regularisation and pretrained
placing convolutions and recurrent networks with the trans- models can be used to combat the fact that video datasets
former network that consisted only of self-attention, layer are not as large as their image counterparts that ViT was
normalisation and multilayer perceptron (MLP) operations. originally trained on. Furthermore, we outperform the state-
Current state-of-the-art architectures in NLP [17, 52] re- of-the-art across five popular datasets.
main transformer-based, and have been scaled to web-scale
datasets [5]. Many variants of the transformer have also
been proposed to reduce the computational cost of self- 3. Video Vision Transformers
attention when processing longer sequences [10, 11, 37,
62, 63, 73] and to improve parameter efficiency [40, 14]. We start by summarising the recently proposed Vision
Although self-attention has been employed extensively in Transformer [18] in Sec. 3.1, and then discuss two ap-
computer vision, it has, in contrast, been typically incor- proaches for extracting tokens from video in Sec. 3.2. Fi-
porated as a layer at the end or in the later stages of nally, we develop several transformer-based architectures
the network [75, 7, 32, 77, 83] or to augment residual for video classification in Sec. 3.3 and 3.4.
3.1. Overview of Vision Transformers (ViT) "

Vision Transformer (ViT) [18] adapts the transformer


!
architecture of [68] to process 2D images with minimal
changes. In particular, ViT extracts N non-overlapping im-
age patches, xi ∈ Rh×w , performs a linear projection and #
then rasterises them into 1D tokens zi ∈ Rd . The sequence
of tokens input to the following transformer encoder is
Figure 2: Uniform frame sampling: We simply sample nt frames,
z = [zcls , Ex1 , Ex2 , . . . , ExN ] + p, (1) and embed each 2D frame independently following ViT [18].
where the projection by E is equivalent to a 2D convolution.
As shown in Fig. 1, an optional learned classification token
zcls is prepended to this sequence, and its representation at
the final layer of the encoder serves as the final represen-
tation used by the classification layer [17]. In addition, a
!
learned positional embedding, p ∈ RN ×d , is added to the
tokens to retain positional information, as the subsequent
self-attention operations in the transformer are permutation
invariant. The tokens are then passed through an encoder "
consisting of a sequence of L transformer layers. Each layer
#
` comprises of Multi-Headed Self-Attention [68], layer nor-
malisation (LN) [2], and MLP blocks as follows: Figure 3: Tubelet embedding. We extract and linearly embed non-
overlapping tubelets that span the spatio-temporal input volume.
y` = MSA(LN(z` )) + z` (2)
`+1 ` `
z = MLP(LN(y )) + y . (3) Tubelet embedding An alternate method, as shown in
Fig. 3, is to extract non-overlapping, spatio-temporal
The MLP consists of two linear projections separated by a
“tubes” from the input volume, and to linearly project this to
GELU non-linearity [28] and the token-dimensionality, d,
Rd . This method is an extension of ViT’s embedding to 3D,
remains fixed throughout all layers. Finally, a linear classi-
L and corresponds to a 3D convolution. For a tubelet of di-
fier is used to classify the encoded input based on zcls ∈ Rd ,
mension t × h × w, nt = b Tt c, nh = b H W
h c and nw = b w c,
if it was prepended to the input, or a global average pooling
tokens are extracted from the temporal, height, and width
of all the tokens, zL , otherwise.
dimensions respectively. Smaller tubelet dimensions thus
As the transformer [68], which forms the basis of
result in more tokens which increases the computation.
ViT [18], is a flexible architecture that can operate on any
Intuitively, this method fuses spatio-temporal information
sequence of input tokens z ∈ RN ×d , we describe strategies
during tokenisation, in contrast to “Uniform frame sam-
for tokenising videos next.
pling” where temporal information from different frames is
3.2. Embedding video clips fused by the transformer.
We consider two simple methods for mapping a video 3.3. Transformer Models for Video
V ∈ RT ×H×W ×C to a sequence of tokens z̃ ∈
Rnt ×nh ×nw ×d . We then add the positional embedding and As illustrated in Fig. 1, we propose multiple transformer-
reshape into RN ×d to obtain z, the input to the transformer. based architectures. We begin with a straightforward ex-
tension of ViT [18] that models pairwise interactions be-
Uniform frame sampling As illustrated in Fig. 2, a tween all spatio-temporal tokens, and then develop more
straightforward method of tokenising the input video is to efficient variants which factorise the spatial and temporal
uniformly sample nt frames from the input video clip, em- dimensions of the input video at various levels of the trans-
bed each 2D frame independently using the same method former architecture.
as ViT [18], and concatenate all these tokens together. Con-
cretely, if nh · nw non-overlapping image patches are ex- Model 1: Spatio-temporal attention This model sim-
tracted from each frame, as in [18], then a total of nt ·nh ·nw ply forwards all spatio-temporal tokens extracted from the
tokens will be forwarded through the transformer encoder. video, z0 , through the transformer encoder. We note that
Intuitively, this process may be seen as simply constructing this has also been explored concurrently by [4] in their
a large 2D image to be tokenised following ViT. We note “Joint Space-Time” model. In contrast to CNN architec-
that this is the input embedding method employed by the tures, where the receptive field grows linearly with the
concurrent work of [4]. number of layers, each transformer layer models all pair-
MLP
Temporal Transformer Encoder Class Transformer Block x L
Head
Temporal + Token
Embedding

0
C
L 1 2
… T
S

Positional embedding

K
K

Multi-Head

Layer Norm
Multi-Head

Layer Norm
Layer Norm

Attention
Attention
Spatial Transformer Spatial Transformer Spatial Transformer

V
V

MLP
Token embedding
Encoder Encoder Encoder
Positional + Token

Q
Q
Embedding

0
C
L 1 … N 0
C
L 1 … N
… 0
C
L 1 … N
S S S
Spatial Self-Attention Block Temporal Self-Attention Block

Embed to tokens

Figure 5: Factorised self-attention (Model 3). Within each trans-


former block, the multi-headed self-attention operation is fac-
torised into two operations (indicated by striped boxes) that first
only compute self-attention spatially, and then temporally.
Figure 4: Factorised encoder (Model 2). This model consists of Model 3: Factorised self-attention This model, in con-
two transformer encoders in series: the first models interactions trast, contains the same number of transformer layers as
between tokens extracted from the same temporal index to produce Model 1. However, instead of computing multi-headed
a latent representation per time-index. The second transformer
self-attention across all pairs of tokens, z` , at layer l, we
models interactions between time steps. It thus corresponds to a
“late fusion” of spatial- and temporal information.
factorise the operation to first only compute self-attention
spatially (among all tokens extracted from the same tem-
wise interactions between all spatio-temporal tokens, and it poral index), and then temporally (among all tokens ex-
thus models long-range interactions across the video from tracted from the same spatial index) as shown in Fig. 5.
the first layer. However, as it models all pairwise in- Each self-attention block in the transformer thus models
teractions, Multi-Headed Self Attention (MSA) [68] has spatio-temporal interactions, but does so more efficiently
quadratic complexity with respect to the number of tokens. than Model 1 by factorising the operation over two smaller
This complexity is pertinent for video, as the number of to- sets of elements, thus achieving the same computational
kens increases linearly with the number of input frames, and complexity as Model 2. We note that factorising attention
motivates the development of more efficient architectures over input dimensions has also been explored in [29, 78],
next. and concurrently in the context of video by [4] in their “Di-
vided Space-Time” model.
Model 2: Factorised encoder As shown in Fig. 4, this This operation can be performed efficiently by reshaping
model consists of two separate transformer encoders. The the tokens z from R1×nt ·nh ·nw ·d to Rnt ×nh ·nw ·d (denoted
first, spatial encoder, only models interactions between to- by zs ) to compute spatial self-attention. Similarly, the input
kens extracted from the same temporal index. A representa- to temporal self-attention, zt is reshaped to Rnh ·nw ×nt ·d .
tion for each temporal index, hi ∈ Rd , is obtained after Ls Here we assume the leading dimension is the “batch dimen-
Ls
layers: This is the encoded classification token, zcls if it was sion”. Our factorised self-attention is defined as
prepended to the input (Eq. 1), or a global average pooling ys` = MSA(LN(z`s )) + z`s (4)
from the tokens output by the spatial encoder, zLs , other-
yt` = MSA(LN(ys` )) + ys` (5)
wise. The frame-level representations, hi , are concatenated
`+1
into H ∈ Rnt ×d , and then forwarded through a temporal z = MLP(LN(yt` )) + yt` . (6)
encoder consisting of Lt transformer layers to model in-
We observed that the order of spatial-then-temporal self-
teractions between tokens from different temporal indices.
attention or temporal-then-spatial self-attention does not
The output token of this encoder is then finally classified.
make a difference, provided that the model parameters are
This architecture corresponds to a “late fusion” [34,
initialised as described in Sec. 3.4. Note that the number
56, 72, 46] of temporal information, and the initial spa-
of parameters, however, increases compared to Model 1, as
tial encoder is identical to the one used for image classi-
there is an additional self-attention layer (cf. Eq. 7). We do
fication. It is thus analogous to CNN architectures such
not use a classification token in this model, to avoid ambi-
as [24, 34, 72, 86] which first extract per-frame fea-
guities when reshaping the input tokens between spatial and
tures, and then aggregate them into a final representation
temporal dimensions.
before classifying them. Although this model has more
transformer layers than Model 1 (and thus more parame- Model 4: Factorised dot-product attention Finally, we
ters), it requires fewer floating point operations (FLOPs), develop a model which has the same computational com-
as the two separate transformer blocks have a complexity plexity as Models 2 and 3, while retaining the same number
of O((nh · nw )2 + n2t ) compared to O((nt · nh · nw )2 ) of of parameters as the unfactorised Model 1. The factorisa-
Model 1. tion of spatial- and temporal dimensions is similar in spirit
discuss several effective strategies to initialise these large-
Linear
scale video classification models.
Multi-Head Concatenate
Dot-product Attention
K V Q
Positional embeddings A positional embedding p is
Scaled Dot-Product Attention Scaled Dot-Product Attention

added to each input token (Eq. 1). However, our video


Layer Norm Linear Linear Linear Linear Linear Linear

K V Q K V Q
models have nt times more tokens than the pretrained im-
Self-Attention Block
Spatial Heads Temporal Heads age model. As a result, we initialise the positional embed-
dings by “repeating” them temporally from Rnw ·nh ×d to
Figure 6: Factorised dot-product attention (Model 4). For half of
Rnt ·nh ·nw ×d . Therefore, at initialisation, all tokens with
the heads, we compute dot-product attention over only the spatial
the same spatial index have the same embedding which is
axes, and for the other half, over only the temporal axis.
then fine-tuned.
to Model 3, but we factorise the multi-head dot-product at- Embedding weights, E When using the “tubelet embed-
tention operation instead (Fig. 6). Concretely, we compute ding” tokenisation method (Sec. 3.2), the embedding filter
attention weights for each token separately over the spatial- E is a 3D tensor, compared to the 2D tensor in the pre-
and temporal-dimensions using different heads. First, we trained model, Eimage . A common approach for initialising
note that the attention operation for each head is defined as 3D convolutional filters from 2D filters for video classifica-
tion is to “inflate” them by replicating the filters along the
QK>
 
Attention(Q, K, V) = Softmax √ V. (7) temporal dimension and averaging them [8, 22] as
dk
1
E= [Eimage , . . . , Eimage , . . . , Eimage ]. (8)
In self-attention, the queries Q = XWq , keys K = XWk , t
and values V = XWv are linear projections of the input X
We consider an additional strategy, which we denote as
with X, Q, K, V ∈ RN ×d . Note that in the unfactorised
“central frame initialisation”, where E is initialised with ze-
case (Model 1), the spatial and temporal dimensions are
roes along all temporal positions, except at the centre b 2t c,
merged as N = nt · nh · nw .
The main idea here is to modify the keys and values for E = [0, . . . , Eimage , . . . , 0]. (9)
each query to only attend over tokens from the same spatial-
and temporal index by constructing Ks , Vs ∈ Rnh ·nw ×d Therefore, the 3D convolutional filter effectively behaves
and Kt , Vt ∈ Rnt ×d , namely the keys and values corre- like “Uniform frame sampling” (Sec. 3.2) at initialisation,
sponding to these dimensions. Then, for half of the atten- while also enabling the model to learn to aggregate temporal
tion heads, we attend over tokens from the spatial dimen- information from multiple frames as training progresses.
sion by computing Ys = Attention(Q, Ks , Vs ), and for
the rest we attend over the temporal dimension by comput- Transformer weights for Model 3 The transformer
ing Yt = Attention(Q, Kt , Vt ). Given that we are only block in Model 3 (Fig. 5) differs from the pretrained ViT
changing the attention neighbourhood for each query, the model [18], in that it contains two multi-headed self atten-
attention operation has the same dimension as in the unfac- tion (MSA) modules. In this case, we initialise the spatial
torised case, namely Ys , Yt ∈ RN ×d . We then combine MSA module from the pretrained module, and initialise all
the outputs of multiple heads by concatenating them and weights of the temporal MSA with zeroes, such that Eq. 5
using a linear projection [68], Y = Concat(Ys , Yt )WO . behaves as a residual connection [27] at initialisation.

3.4. Initialisation by leveraging pretrained models 4. Empirical evaluation


ViT [18] has been shown to only be effective when We first present our experimental setup and implementa-
trained on large-scale datasets, as transformers lack some of tion details in Sec. 4.1, before ablating various components
the inductive biases of convolutional networks [18]. How- of our model in Sec. 4.2. We then present state-of-the-art
ever, even the largest video datasets such as Kinetics [35], results on five datasets in Sec. 4.3.
have several orders of magnitude less labelled examples
4.1. Experimental Setup
when compared to their image counterparts [16, 39, 58]. As
a result, training large models from scratch to high accuracy Network architecture and training Our backbone archi-
is extremely challenging. To sidestep this issue, and enable tecture follows that of ViT [18] and BERT [17]. We con-
more efficient training we initialise our video models from sider ViT-Base (ViT-B, L=12, NH =12, d=768), ViT-Large
pretrained image models. However, this raises several prac- (ViT-L, L=24, NH =16, d=1024), and ViT-Huge (ViT-H,
tical questions, specifically on how to initialise parameters L=32, NH =16, d=1280), where L is the number of trans-
not present or incompatible with image models. We now former layers, each with a self-attention block of NH heads
Table 1: Comparison of input encoding methods using ViViT-B Table 2: Comparison of model architectures using ViViT-B as the
and spatio-temporal attention on Kinetics. Further details in text. backbone, and tubelet size of 16 × 2. We report Top-1 accuracy on
Kinetics 400 (K400) and action accuracy on Epic Kitchens (EK).
Top-1 accuracy
Runtime is during inference on a TPU-v3.
Uniform frame sampling 78.5 K400 EK
FLOPs Params Runtime
(×109 ) (×106 ) (ms)
Tubelet embedding Model 1: Spatio-temporal 80.0 43.1 455.2 88.9 58.9
Random initialisation [25] 73.2 Model 2: Fact. encoder 78.8 43.7 284.4 115.1 17.4
Filter inflation [8] 77.6 Model 3: Fact. self-attention 77.4 39.1 372.3 117.3 31.7
Model 4: Fact. dot product 76.3 39.5 277.1 88.9 22.9
Central frame 79.2
Model 2: Ave. pool baseline 75.8 38.8 283.9 86.7 17.3

and hidden dimension d. We also apply the same naming


Table 3: The effect of varying the number of temporal transform-
scheme to our models (e.g., ViViT-B/16x2 denotes a ViT-
ers, Lt , in the Factorised encoder model (Model 2). We report the
Base backbone with a tubelet size of h×w×t = 16×16×2). Top-1 accuracy on Kinetics 400. Note that Lt = 0 corresponds to
In all experiments, the tubelet height and width are equal. the “average pooling baseline”.
Note that smaller tubelet sizes correspond to more tokens at
the input, and thus more computation. Lt 0 1 4 8 12
We train our models using synchronous SGD and mo- Top-1 75.8 78.6 78.8 78.8 78.9
mentum, a cosine learning rate schedule and TPU-v3 ac-
celerators. We initialise our models from a ViT image age per-view logits to obtain the final result. Unless other-
model trained either on ImageNet-21K [16] (unless other- wise specified, we use a total of 4 views per video (as this
wise specified) or the larger JFT [58] dataset. We imple- is sufficient to “see” the entire video clip across the various
ment our method using the Scenic library [15] and have re- datasets), and ablate these and other design choices next.
leased our code and models.
4.2. Ablation study
Datasets We evaluate the performance of our proposed
models on a diverse set of video classification datasets: Input encoding We first consider the effect of different
Kinetics [35] consists of 10-second videos sampled at input encoding methods (Sec. 3.2) using our unfactorised
25fps from YouTube. We evaluate on both Kinetics 400 model (Model 1) and ViViT-B on Kinetics 400. As we pass
and 600, containing 400 and 600 classes respectively. As 32-frame inputs to the network, sampling 8 frames and ex-
these are dynamic datasets (videos may be removed from tracting tubelets of length t = 4 correspond to the same
YouTube), we note our dataset sizes are approximately 267 number of tokens in both cases. Table 1 shows that tubelet
000 and 446 000 respectively. embedding initialised using the “central frame” method
Epic Kitchens-100 consists of egocentric videos captur- (Eq. 9) performs well, outperforming the commonly-used
ing daily kitchen activities spanning 100 hours and 90 000 “filter inflation” initialisation method [8, 22] by 1.6%, and
clips [13]. We report results following the standard “action “uniform frame sampling” by 0.7%. We therefore use this
recognition” protocol. Here, each video is labelled with a encoding method for all subsequent experiments.
“verb” and a “noun” and we therefore predict both cate-
gories using a single network with two “heads”. The top- Model variants We compare our proposed model vari-
scoring verb and action pair predicted by the network form ants (Sec. 3.3) across the Kinetics 400 and Epic Kitchens
an “action”, and action accuracy is the primary metric. datasets, both in terms of accuracy and efficiency, in Tab. 2.
Moments in Time [45] consists of 800 000, 3-second In all cases, we use the “Base” backbone and tubelet size of
YouTube clips that capture the gist of a dynamic scene in- 16 × 2. Model 2 (“Factorised Encoder”) has an additional
volving animals, objects, people, or natural phenomena. hyperparameter, the number of temporal transformers, Lt .
Something-Something v2 (SSv2) [26] contains 220 000 We set Lt = 4 for all experiments and show in Tab. 3 that
videos, with durations ranging from 2 to 6 seconds. In con- the model is not sensitive to this choice.
trast to the other datasets, the objects and backgrounds in The unfactorised model (Model 1) performs the best
the videos are consistent across different action classes, and on Kinetics 400. However, it can also overfit on smaller
this dataset thus places more emphasis on a model’s ability datasets such as Epic Kitchens, where we find our “Fac-
to recognise fine-grained motion cues. torised Encoder” (Model 2) to perform the best. We also
consider an additional baseline (last row), based on Model
Inference The input to our network is a video clip of 32 2, where we do not use any temporal transformer, and sim-
frames using a stride of 2, unless otherwise mentioned, sim- ply average pool the frame-level representations from the
ilar to [21, 20]. Following common practice, at inference spatial encoder before classifying. This average pooling
time, we process multiple views of a longer video and aver- baseline performs the worst, and has a larger accuracy drop
Table 4: The effect of progressively adding regularisation (each Table 5: The effect of spatial resolution on the performance of
row includes all methods above it) on Top-1 action accuracy on ViViT-L/16x2 and spatio-temporal attention on Kinetics 400.
Epic Kitchens. We use a Factorised encoder model with tubelet
Crop size 224 288 320
size 16 × 2.
Accuracy 80.3 80.7 81.0
Top-1 accuracy
GFLOPs 1446 2919 3992
Random crop, flip, colour jitter 38.4 Runtime 58.9 147.6 238.8
+ Kinetics 400 initialisation 39.6
+ Stochastic depth [31] 40.2
+ Random augment [12] 41.1 on such datasets, we employed several regularisation strate-
+ Label smoothing [61] 43.1 gies that we ablate using our “Factorised encoder” model
+ Mixup [82] 43.7 in Tab. 4. We note that these regularisers were originally
ViViT-B ViViT-L proposed for training CNNs, and that [64] have recently
1.5
80 explored them for training ViT for image classification.
Top-1 Accuracy

1.0 Each row of Tab. 4 includes all the methods from the
TFLOPs

79
78 0.5 rows above it, and we observe progressive improvements
from adding each regulariser. Overall, we obtain a substan-
16x8 16x4 16x2 16x8 16x4 16x2
Input tubelet size Input tubelet size tial overall improvement of 5.3% on Epic Kitchens. We
(a) Accuracy (b) Compute also achieve a similar improvement of 5% on SSv2 by us-
Figure 7: The effect of the backbone architecture on (a) accuracy
ing all the regularisation in Tab. 4. Note that the Kinetics-
and (b) computation on Kinetics 400, for the spatio-temporal at- pretrained models that we initialise from are from Tab. 2,
tention model (Model 1). and that all Epic Kitchens models in Tab. 2 were trained
with all the regularisers in Tab. 4. For larger datasets like
Spatio-temporal Factorised encoder Factorised self-attention Factorised dot-product
80.0
Kinetics and Moments in Time, we do not use these ad-
Top-1 Accuracy

0.4 ditional regularisers (we use only the first row of Tab. 4),
77.5
TFLOPs

75.0 0.2 as we obtain state-of-the-art results without them. The ap-


72.5 pendix contains hyperparameter values and additional de-
16x8 16x4 16x2 16x8 16x4 16x2 tails for all regularisers.
Input tubelet size Input tubelet size
(a) Accuracy (b) Compute Varying the backbone Figure 7 compares the ViViT-
Figure 8: The effect of varying the number of temporal tokens on B and ViViT-L backbones for the unfactorised spatio-
(a) accuracy and (b) computation on Kinetics 400, for different temporal model. We observe consistent improvements in
variants of our model with a ViViT-B backbone. accuracy as the backbone capacity increases. As expected,
the compute also grows as a function of the backbone size.
on Epic Kitchens, suggesting that this dataset requires more
detailed modelling of temporal relations. Varying the number of tokens We first analyse the per-
As described in Sec. 3.3, all factorised variants of our formance as a function of the number of tokens along the
model use significantly fewer FLOPs than the unfactorised temporal dimension in Fig. 8. We observe that using smaller
Model 1, as the attention is computed separately over input tubelet sizes (and therefore more tokens) leads to con-
spatial- and temporal-dimensions. Model 4 adds no addi- sistent accuracy improvements across all of our model ar-
tional parameters to the unfactorised Model 1, and uses the chitectures. At the same time, computation in terms of
least compute. The temporal transformer encoder in Model FLOPs increases accordingly, and the unfactorised model
2 operates on only nt tokens, which is why there is a barely (Model 1) is impacted the most.
a change in compute and runtime over the average pool- We then vary the number of tokens fed into the model
ing baseline, even though it improves the accuracy substan- by increasing the spatial crop-size from the default of 224
tially (3% on Kinetics and 4.9% on Epic Kitchens). Fi- to 320 in Tab. 5. As expected, there is a consistent increase
nally, Model 3 requires more compute and parameters than in both accuracy and computation. We note that when com-
the other factorised models, as its additional self-attention paring to prior work we consistently obtain state-of-the-art
block means that it performs another query-, key-, value- results (Sec. 4.3) using a spatial resolution of 224, but we
and output-projection in each transformer layer [68]. also highlight that further improvements can be obtained at
higher spatial resolutions.
Model regularisation Pure-transformer architectures
such as ViT [18] are known to require large training Varying the number of input frames In our experiments
datasets, and we observed overfitting on smaller datasets so far, we have kept the number of input frames fixed at 32.
like Epic Kitchens and SSv2, even when using an ImageNet We now increase the number of frames input to the model,
pretrained model. In order to effectively train our models thereby increasing the number of tokens proportionally.
Table 6: Comparisons to state-of-the-art across multiple datasets. For “views”, x × y denotes x temporal crops and y spatial crops. We
report the TFLOPs to process all spatio-temporal views. “FE” denotes our Factorised Encoder model.
(a) Kinetics 400 (b) Kinetics 600 (d) Epic Kitchens 100 Top 1 accuracy
Method Top 1 Top 5 Views TFLOPs Method Top 1 Top 5 Method Action Verb Noun

blVNet [19] 73.5 91.2 – – AttentionNAS [76] 79.8 94.4 TSN [72] 33.2 60.2 46.0
STM [33] 73.7 91.6 – – LGD-3D R101 [51] 81.5 95.6 TRN [86] 35.3 65.9 45.4
TEA [42] 76.1 92.5 10 × 3 2.10 SlowFast R101-NL [21] 81.8 95.1 TBN [36] 36.7 66.0 47.2
X3D-XL [20] 81.9 95.5 TSM [43] 38.3 67.9 49.0
TSM-ResNeXt-101 [43] 76.3 – – –
TimeSformer-L [4] 82.2 95.6 SlowFast [21] 38.5 65.6 50.0
I3D NL [75] 77.7 93.3 10 × 3 10.77 ViViT-L/16x2 FE 82.9 94.6
CorrNet-101 [70] 79.2 – 10 × 3 6.72 ViViT-L/16x2 FE 44.0 66.4 56.8
ip-CSN-152 [66] 79.2 93.8 10 × 3 3.27 ViViT-L/16x2 FE (JFT) 84.3 94.9
LGD-3D R101 [51] 79.4 94.4 – – ViViT-H/14x2 (JFT) 85.8 96.5
(e) Something-Something v2
SlowFast R101-NL [21] 79.8 93.9 10 × 3 7.02
X3D-XXL [20] 80.4 94.6 10 × 3 5.82 (c) Moments in Time Method Top 1 Top 5
TimeSformer-L [4] 80.7 94.7 1×3 7.14 Top 1 Top 5 TRN [86] 48.8 77.6
ViViT-L/16x2 FE 80.6 92.7 1×1 3.98 SlowFast [20, 80] 61.7 –
ViViT-L/16x2 FE 81.7 93.8 1×3 11.94 TSN [72] 25.3 50.1 TimeSformer-HR [4] 62.5 –
TRN [86] 28.3 53.4 TSM [43] 63.4 88.5
Methods with large-scale pretraining I3D [8] 29.5 56.1 STM [33] 64.2 89.8
ip-CSN-152 [66] (IG [44]) 82.5 95.3 10 × 3 3.27 blVNet [19] 31.4 59.3 TEA [42] 65.1 –
ViViT-L/16x2 FE (JFT) 83.5 94.3 1×3 11.94 AssembleNet-101 [54] 34.3 62.7 blVNet [19] 65.2 90.3
ViViT-H/14x2 (JFT) 84.9 95.8 4×3 47.77 ViViT-L/16x2 FE 38.5 64.1 ViVIT-L/16x2 FE 65.9 89.9

32 stride 2 64 stride 2 128 stride 2 more frames (and thus more tokens) consistently achieve
higher single- and multi-view accuracy, in line with our ob-
80 servations in previous experiments (Tab. 5, Fig. 8). Mo-
Top-1 Accuracy

roever, observe that by processing more frames (and thus


78 more tokens) with Model 2, we are able to achieve higher
accuracy than Model 1 (with fewer total FLOPs as well).
76 Finally, we observed that for Model 2, the number of
FLOPs effectively increases linearly with the number of in-
1 2 3 4 5 6 7 put frames as the overall computation is dominated by the
Number of views
initial Spatial Transformer. As a result, the total number
Figure 9: The effect of varying the number of frames input to the
network and increasing the number of tokens proportionally. We of FLOPs for the number of temporal views required to
use ViViT-L/16x2 Factorised Encoder on Kinetics 400. A Kinetics achieve maximum accuracy is constant across the models.
video contains 250 frames (10 seconds sampled at 25 fps) and the In other words, ViViT-L/16x2 FE with 32 frames requires
accuracy for each model saturates once the number of equidistant 995.3 GFLOPs per view, and 4 views to saturate multi-view
temporal views is sufficient to “see” the whole video clip. Ob- accuracy. The 128-frame model requires 3980.4 GFLOPs
serve how models processing more frames (and thus more tokens) but only a single view. As shown by Fig. 9, the latter model
achieve higher single- and multi-view accuracy. achieves the highest accuracy.

Figure 9 shows that as we increase the number of frames 4.3. Comparison to state-of-the-art
input to the network, the accuracy from processing a sin- Based on our ablation studies in the previous section,
gle view increases, since the network incorporates longer we compare to the current state-of-the-art using two of our
temporal context. However, common practice on datasets model variants. We primarily use our Factorised Encoder
such as Kinetics [21, 75, 42] is to average results over mul- model (Model 2), as it can process more tokens than Model
tiple, shorter “views” of the same video clip. Figure 9 also 1 to achieve higher accuracy.
shows that the accuracy saturates once the number of views
is sufficient to cover the whole video. As a Kinetics video Kinetics Tables 6a and 6b show that our spatio-temporal
consists of 250 frames, and we sample frames with a stride attention models outperform the state-of-the-art on Kinetics
of 2, our model which processes 128 frames requires just a 400 and 600 respectively. Following standard practice, we
single view to “see” the whole video and achieve its maxi- take 3 spatial crops (left, centre and right) [21, 20, 66, 75]
mum accuarcy. for each temporal view, and notably, we require signifi-
Note that we used ViViT-L/16x2 Factorised Encoder cantly fewer views than previous CNN-based methods.
(Model 2) here. As this model is more efficient it can pro- We surpass the previous CNN-based state-of-the-art us-
cess more tokens, compared to the unfactorised Model 1 ing ViViT-L/16x2 Factorised Encoder (FE) pretrained on
which runs out of memory after 48 frames using tubelet ImageNet, and also outperform [4] who concurrently pro-
length t = 2 and a “Large” backbone. Models processing posed a pure-transformer architecture. Moreover, by initial-
ising our backbones from models pretrained on the larger image-pretrained models. Finally, going beyond video clas-
JFT dataset [58], we obtain further improvements. Al- sification towards more complex tasks is a clear next step.
though these models are not directly comparable to previ-
ous work, we do also outperform [66] who pretrained on References
the large-scale, Instagram dataset [44]. Our best model uses [1] Anurag Arnab, Chen Sun, and Cordelia Schmid. Unified
a ViViT-H backbone pretrained on JFT and significantly ad- graph structured models for video understanding. In ICCV,
vances the best reported results on Kinetics 400 and 600 to 2021. 1
84.9% and 85.8%, respectively. [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.
Layer normalization. In arXiv preprint arXiv:1607.06450,
Moments in Time We surpass the state-of-the-art by a 2016. 3
significant margin as shown in Tab. 6c. We note that the [3] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens,
videos in this dataset are diverse and contain significant la- and Quoc V Le. Attention augmented convolutional net-
bel noise, making this task challenging and leading to lower works. In ICCV, 2019. 1
accuracies than on other datasets. [4] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
space-time attention all you need for video understanding?
Epic Kitchens 100 Table 6d shows that our Factorised In arXiv preprint arXiv:2102.05095, 2021. 2, 3, 4, 8, 9
Encoder model outperforms previous methods by a signifi- [5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
cant margin. In addition, our model obtains substantial im- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
provements for Top-1 accuracy of “noun” classes, and the tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
only method which achieves higher “verb” accuracy used Agarwal, et al. Language models are few-shot learners. In
optical flow as an additional input modality [43, 50]. Fur- NeurIPS, 2020. 2
thermore, all variants of our model presented in Tab. 2 out- [6] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han
performed the existing state-of-the-art on action accuracy. Hu. Gcnet: Non-local networks meet squeeze-excitation net-
We note that we use the same model to predict verbs and works and beyond. In CVPR Workshops, 2019. 2
nouns using two separate “heads”, and for simplicity, we do [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
not use separate loss weights for each head.
end object detection with transformers. In ECCV, 2020. 1,
Something-Something v2 (SSv2) Finally, Tab. 6e shows 2
that we achieve state-of-the-art Top-1 accuracy with our [8] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In CVPR,
Factorised encoder model (Model 2), albeit with a smaller
2017. 1, 2, 5, 6, 8
margin compared to previous methods. Notably, our Fac-
[9] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng
torised encoder model significantly outperforms the concur- Yan, and Jiashi Feng. A2-nets: Double attention networks.
rent TimeSformer [4] method by 2.9%, which also proposes In NeurIPS, 2018. 2
a pure-transformer model, but does not consider our Fac- [10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
torised encoder variant or our additional regularisation. Generating long sequences with sparse transformers. In
SSv2 differs from other datasets in that the backgrounds arXiv preprint arXiv:1904.10509, 2019. 2
and objects are quite similar across different classes, mean- [11] Krzysztof Choromanski, Valerii Likhosherstov, David Do-
ing that recognising fine-grained motion patterns is neces- han, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
sary to distinguish classes from each other. Our results sug- Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser,
gest that capturing these fine-grained motions is an area of et al. Rethinking attention with performers. In ICLR, 2021.
improvement and future work for our model. We also note 2
an inverse correlation between the relative performance of [12] Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V.
Le. Randaugment: Practical automated data augmentation
previous methods on SSv2 (Tab. 6e) and Kinetics (Tab. 6a)
with a reduced search space. In NeurIPS, 2020. 7, 13, 14
suggesting that these two datasets evaluate complementary
[13] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,
characteristics of a model. Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide
Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and
5. Conclusion and Future Work Michael Wray. Rescaling egocentric vision. In arXiv
preprint arXiv:2006.13256, 2020. 1, 6
We have presented four pure-transformer models for
[14] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob
video classification, with different accuracy and efficiency Uszkoreit, and Łukasz Kaiser. Universal transformers. In
profiles, achieving state-of-the-art results across five pop- ICLR, 2019. 2
ular datasets. Furthermore, we have shown how to ef- [15] Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab,
fectively regularise such high-capacity models for training Matthias Minderer, and Yi Tay. Scenic: A JAX library
on smaller datasets and thoroughly ablated our main de- for computer vision research and beyond. arXiv preprint
sign choices. Future work is to remove our dependence on arXiv:2110.11403, 2021. 6
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [34] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video
database. In CVPR, 2009. 2, 5, 6 classification with convolutional neural networks. In CVPR,
[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina 2014. 2, 4
Toutanova. Bert: Pre-training of deep bidirectional trans- [35] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,
formers for language understanding. In NAACL, 2019. 2, 3, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
5 Tim Green, Trevor Back, Paul Natsev, et al. The ki-
[18] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, netics human action video dataset. In arXiv preprint
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, arXiv:1705.06950, 2017. 1, 2, 5, 6
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [36] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and
vain Gelly, et al. An image is worth 16x16 words: Trans- Dima Damen. Epic-fusion: Audio-visual temporal binding
formers for image recognition at scale. In ICLR, 2021. 1, 2, for egocentric action recognition. In ICCV, 2019. 8
3, 5, 7 [37] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re-
[19] Quanfu Fan, Chun-Fu Chen, Hilde Kuehne, Marco Pistoia, former: The efficient transformer. In ICLR, 2020. 2
and David Cox. More is less: Learning efficient video repre- [38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
sentations by big-little network and depthwise temporal ag- Imagenet classification with deep convolutional neural net-
gregation. In NeurIPS, 2019. 8 works. In NeurIPS, volume 25, 2012. 1, 2
[20] Christoph Feichtenhofer. X3d: Expanding architectures for [39] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-
efficient video recognition. In CVPR, 2020. 1, 2, 6, 8 jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan
[21] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Popov, Matteo Malloci, Tom Duerig, et al. The open im-
Kaiming He. Slowfast networks for video recognition. In ages dataset v4: Unified image classification, object detec-
ICCV, 2019. 1, 6, 8 tion, and visual relationship detection at scale. IJCV, 2020.
[22] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 5
Spatiotemporal residual networks for video action recogni-
[40] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
tion. In NeurIPS, 2016. 2, 5, 6
Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
[23] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zis- bert for self-supervised learning of language representations.
serman. Video action transformer network. In CVPR, 2019. In ICLR, 2020. 2
1
[41] Ivan Laptev. On space-time interest points. IJCV, 64(2-3),
[24] Rohit Girdhar and Deva Ramanan. Attentional pooling for
2005. 2
action recognition. In NeurIPS, 2017. 4
[42] Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and
[25] Xavier Glorot and Yoshua Bengio. Understanding the diffi-
Limin Wang. Tea: Temporal excitation and aggregation for
culty of training deep feedforward neural networks. In AIS-
action recognition. In CVPR, 2020. 8
TATS, 2010. 6
[26] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- [43] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, module for efficient video understanding. In ICCV, 2019. 8,
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz 9
Mueller-Freitag, et al. The” something something” video [44] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
database for learning and evaluating visual common sense. Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
In ICCV, 2017. 1, 6 and Laurens Van Der Maaten. Exploring the limits of weakly
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. supervised pretraining. In ECCV, 2018. 8, 9
Deep residual learning for image recognition. In CVPR, [45] Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ra-
2016. 1, 2, 5 makrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown,
[28] Dan Hendrycks and Kevin Gimpel. Gaussian error linear Quanfu Fan, Dan Gutfreund, Carl Vondrick, et al. Moments
units (gelus). In arXiv preprint arXiv:1606.08415, 2016. 3 in time dataset: one million videos for event understanding.
[29] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim PAMI, 42(2):502–508, 2019. 1, 6
Salimans. Axial attention in multidimensional transformers. [46] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan As-
In arXiv preprint arXiv:1912.12180, 2019. 4 selmann. Video transformer network. In arXiv preprint
[30] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- arXiv:2102.00719, 2021. 2, 4
works. In CVPR, 2018. 2 [47] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-
[31] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian jayanarasimhan, Oriol Vinyals, Rajat Monga, and George
Weinberger. Deep networks with stochastic depth. In ECCV, Toderici. Beyond short snippets: Deep networks for video
2016. 7, 13, 14 classification. In CVPR, 2015. 2
[32] Zilong Huang, Xinggang Wang, Lichao Huang, Chang [48] Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, and Jian-
Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross fei Cai. Scalable visual transformers with hierarchical pool-
attention for semantic segmentation. In ICCV, 2019. 2 ing. In arXiv preprint arXiv:2103.10619, 2021. 2
[33] Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and [49] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Junjie Yan. Stm: Spatiotemporal and motion encoding for Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
action recognition. In ICCV, 2019. 8 age transformer. In ICML, 2018. 1, 2
[50] Will Price and Dima Damen. An evaluation of action [67] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
recognition models on epic-kitchens. In arXiv preprint LeCun, and Manohar Paluri. A closer look at spatiotemporal
arXiv:1908.00867, 2019. 9 convolutions for action recognition. In CVPR, 2018. 2
[51] Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Tao Mei. Learning spatio-temporal representation with local reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
and global diffusion. In CVPR, 2019. 8 Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
[52] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, 2, 3, 4, 5, 7
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and [69] Heng Wang, Alexander Kläser, Cordelia Schmid, and
Peter J Liu. Exploring the limits of transfer learning with a Cheng-Lin Liu. Dense trajectories and motion boundary de-
unified text-to-text transformer. JMLR, 2020. 2 scriptors for action recognition. IJCV, 103(1), 2013. 2
[53] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan [70] Heng Wang, Du Tran, Lorenzo Torresani, and Matt Feiszli.
Bello, Anselm Levskaya, and Jonathon Shlens. Stand-alone Video modeling with correlation networks. In CVPR, 2020.
self-attention in vision models. In NeurIPS, 2019. 1, 2 8
[54] Michael S Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia [71] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Angelova. Assemblenet: Searching for multi-stream neural Liang-Chieh Chen. Max-deeplab: End-to-end panoptic
connectivity in video architectures. In ICLR, 2020. 8 segmentation with mask transformers. In arXiv preprint
[55] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, arXiv:2012.00759, 2020. 2
and Ching-Hui Chen. Global self-attention networks for im- [72] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
age recognition. In arXiv preprint arXiv:2010.03019, 2021. Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
2 networks: Towards good practices for deep action recogni-
[56] Karen Simonyan and Andrew Zisserman. Two-stream con- tion. In ECCV, 2016. 4, 8
volutional networks for action recognition in videos. In [73] Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and
NeurIPS, 2014. 2, 4 Hao Ma. Linformer: Self-attention with linear complexity.
In arXiv preprint arXiv:2006.04768, 2020. 2
[57] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
[74] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
transformers for visual recognition. In CVPR, 2021. 2
Pyramid vision transformer: A versatile backbone for
[58] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi-
dense prediction without convolutions. In arXiv preprint
nav Gupta. Revisiting unreasonable effectiveness of data in
arXiv:2102.12122, 2021. 2
deep learning era. In ICCV, 2017. 5, 6, 9
[75] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
[59] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human
ing He. Non-local neural networks. In CVPR, 2018. 1, 2,
action recognition using factorized spatio-temporal convolu-
8
tional networks. In ICCV, 2015. 2
[76] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier-
[60] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, giovanni, Michael S Ryoo, Anelia Angelova, Kris M Kitani,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent and Wei Hua. Attentionnas: Spatiotemporal attention cell
Vanhoucke, and Andrew Rabinovich. Going deeper with search for video classification. In ECCV, 2020. 8
convolutions. In CVPR, 2015. 1 [77] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,
[61] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end
Shlens, and Zbigniew Wojna. Rethinking the inception ar- video instance segmentation with transformers. In arXiv
chitecture for computer vision. In CVPR, 2016. 7, 13, 14 preprint arXiv:2011.14503, 2020. 2
[62] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, [78] Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit.
Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebas- Scaling autoregressive video models. In ICLR, 2020. 4
tian Ruder, and Donald Metzler. Long range arena: A [79] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim-
benchmark for efficient transformers. In arXiv preprint ing He, Philipp Krahenbuhl, and Ross Girshick. Long-term
arXiv:2011.04006, 2020. 2 feature banks for detailed video understanding. In CVPR,
[63] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Met- 2019. 1
zler. Efficient transformers: A survey. In arXiv preprint [80] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Fe-
arXiv:2009.06732, 2020. 2 ichtenhofer, and Philipp Krahenbuhl. A multigrid method
[64] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco for efficiently training video models. In CVPR, 2020. 8
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training [81] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
data-efficient image transformers & distillation through at- Kevin Murphy. Rethinking spatiotemporal feature learning:
tention. In arXiv preprint arXiv:2012.12877, 2020. 1, 2, 7 Speed-accuracy trade-offs in video classification. In ECCV,
[65] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, 2018. 2
and Manohar Paluri. Learning spatiotemporal features with [82] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
3d convolutional networks. In ICCV, 2015. 2 David Lopez-Paz. Mixup: Beyond empirical risk minimiza-
[66] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis- tion. In ICLR, 2018. 7, 13, 14
zli. Video classification with channel-separated convolu- [83] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dy-
tional networks. In ICCV, 2019. 2, 8, 9 namic graph message passing networks. In CVPR, 2020. 2
[84] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and
Vladlen Koltun. Point transformer. In arXiv preprint
arXiv:2012.09164, 2020. 2
[85] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tation from a sequence-to-sequence perspective with trans-
formers. In arXiv preprint arXiv:2012.15840, 2020. 2
[86] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-
ralba. Temporal relational reasoning in videos. In ECCV,
2018. 4, 8
Appendix Mixup Mixup [82] constructs virtual training examples
which are a convex combination of pairs of training exam-
A. Additional experimental details ples and their labels. Concretely, given (xi , yi ) and (xj , yj )
where xi denotes an input vector and yi a one-hot input la-
In this appendix, we provide additional experimental de-
bel, mixup constructs the virtual training example,
tails. Section A.1 provides additional details about the reg-
ularisers we used and Sec. A.2 details the training hyper- x̃ = λxi + (1 − λ)xj
paramters used for our experiments.
ỹ = λyi + (1 − λ)yj . (12)
A.1. Further details about regularisers
λ ∈ [0, 1], and is sampled from a Beta distribution,
In this section, we provide additional details and list the Beta(α, α). Our choice of the hyperparameter α is detailed
hyperparameters of the additional regularisers that we em- in Tab. 7.
ployed in Tab. 4. Hyperparameter values for all our experi-
ments are listed in Tab. 7. A.2. Training hyperparameters
Table 7 details the hyperparamters for all of our ex-
Stochastic depth Stochastic depth regularisation was periments. We use synchronous SGD with momentum, a
originally proposed for training very deep residual net- cosine learning rate schedule with linear warmup, and a
works [31]. Intuitively, the outputs of a layer, `, are batch size of 64 for all experiments. As aforementioned,
“dropped out” with probability, pdrop (`) during training, by we only employed additional regularisation when training
setting the output of the layer to be equal to its input. on the smaller Epic Kitchens and Something-Something v2
Following [31], we linearly increase the probability of datasets.
dropping a layer according to its depth within the network,

`
pdrop (`) = pdrop , (10)
L
where ` is the index of the layer in the network, and L is the
total number of layers.

Random augment Random augment [12] randomly ap-


plies data augmentation transformations sequentially to an
input example. We follow the public implementation1 , but
modify the data augmentation operations to be temporally
consistent throughout the video (in other words, the same
transformation is applied on each frame of the video).
The authors define two hyperparameters for Random
augment, “number of layers” , the number of augmentation
transformations to apply sequentially to a video and “mag-
nitude”, the strength of the transformation that is shared
across all augmentation operations. Our values for these
parameters are shown in Tab. 7.

Label smoothing Label smoothing was proposed by [61]


originally to regularise training Inception-v3. Concretely,
the label distribution used during training, ỹ, is a mixture
of the one-hot ground-truth label, y, and a uniform distribu-
tion, u, to encourage the network to produce less confident
predictions during training:

ỹ = (1 − λ)y + λu. (11)

There is therefore one scalar hyperparamter, λ ∈ [0, 1].


1 https://github.com/tensorflow/models/blob/

master/official/vision/beta/ops/augment.py
Table 7: Training hyperparamters for experiments in the main paper. “–” indicates that the regularisation method was not used at all. Values
which are constant across all columns are listed once. Datasets are denoted as follows: K400: Kinetics 400. K600: Kinetics 600. MiT:
Moments in Time. EK: Epic Kitchens. SSv2: Something-Something v2.
K400 K600 MiT EK SSv2
Optimisation
Optimiser Synchronous SGD
Momentum 0.9
Batch size 64
Learning rate schedule cosine with linear warmup
Linear warmup epochs 2.5
Base learning rate 0.1 0.1 0.25 0.5 0.5
Epochs 30 30 10 50 35
Data augmentation
Random crop probability 1.0
Random flip probability 0.5
Scale jitter probability 1.0
Maximum scale 1.33
Minimum scale 0.9
Colour jitter probability 0.8 0.8 0.8 – –
Rand augment number of layers [12] – – – 2 2
Rand augment magnitude [12] – – – 15 20
Other regularisation
Stochastic droplayer rate, pdrop [31] – – – 0.2 0.3
Label smoothing λ [61] – – – 0.2 0.3
Mixup α [82] – – – 0.1 0.3

You might also like