0% found this document useful (0 votes)
22 views

Zhao Streaming Video Model CVPR 2023 Paper

Uploaded by

huazhouqi.tongji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Zhao Streaming Video Model CVPR 2023 Paper

Uploaded by

huazhouqi.tongji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Streaming Video Model

Yucheng Zhao1 *, Chong Luo2 , Chuanxin Tang2 , Dongdong Chen3 , Noel Codella3 , Zheng-Jun Zha1†
1
University of Science and Technology of China 2 Microsoft Research Asia 3 Microsoft Cloud + AI
{lnc@mail., zhazj}@ustc.edu.cn {cluo,chutan,dochen,ncodella}@microsoft.com

Figure 1. Illustration of the proposed streaming video model with a comparison to conventional frame-based architecture and clip-based
architecture. (a) The two-stage streaming video model gracefully serves different types of video tasks through a unified architecture. The
output of the temporal-aware (T-aware) spatial encoder serves the frame-based tasks, such as MOT, while the output of the temporal decoder
serves the sequence-based tasks, such as action recognition. (b) Frame-based architecture, which uses single image model to independently
extract spatial features for each frame, is widely used in the frame-based video tasks. (c) Clip-based architecture, which uses video model
to produce the spatiotemporal features for an entire clip, is widely used in the sequence-based video tasks.

Abstract coder to obtain spatiotemporal features for sequence-based


tasks. The efficiency and efficacy of S-ViT is demonstrated
Video understanding tasks have traditionally been mod- by the state-of-the-art accuracy in the sequence-based ac-
eled by two separate architectures, specially tailored for tion recognition task and the competitive advantage over
two distinct tasks. Sequence-based video tasks, such as ac- conventional architecture in the frame-based MOT task. We
tion recognition, use a video backbone to directly extract believe that the concept of streaming video model and the
spatiotemporal features, while frame-based video tasks, implementation of S-ViT are solid steps towards a unified
such as multiple object tracking (MOT), rely on single fixed- deep learning architecture for video understanding. Code
image backbone to extract spatial features. In contrast, we will be available at https://github.com/yuzhms/
propose to unify video understanding tasks into one novel Streaming-Video-Model.
streaming video architecture, referred to as Streaming Vi-
sion Transformer (S-ViT). S-ViT first produces frame-level
features with a memory-enabled temporally-aware spatial 1. Introduction
encoder to serve the frame-based video tasks. Then the
frame features are input into a task-related temporal de- As a fundamental research topic in computer vision,
video understanding mainly deals with two types of tasks.
* This work was done during the internship of Yucheng at MSRA. The sequence-based [9, 56] tasks aim to understand what
† Corresponding author. is happening in a period of time. For example, the action

14602
to gather additional temporal information. We believe that a
video frame should be treated differently from a single im-
age and that temporal-aware spatial features are more pow-
erful for solving frame-based video understanding tasks.
In this paper, we propose a unified architecture to han-
dle both types of video tasks. The proposed streaming
video model, as shown in Fig.1, circumvents the draw-
backs of the conventional treatment by a two-stage de-
sign. Specifically, it is composed of a temporal-aware
spatial encoder, which extracts temporal-aware spatial fea-
ture for each video frame, and a task-related temporal de-
coder, which transfers frame-level features to task-specific
outputs for sequence-based tasks. When compared with
Figure 2. Comparison on video modeling paradigm on both the frame-based architecture, the temporal-aware spatial en-
sequence-based action recognition task and frame-based multi- coder in streaming video model leverages additional infor-
ple object tracking task. The proposed streaming model achieves mation from past frames, so that it has potential to obtain
higher performance than the frame-based model on both tasks more powerful and robust features. When compared with
while has no loss compared to clip-based model on the sequence-
clip-based architecture, our model disentangles the frame-
based task. The clip-based model can not be directly used in
level feature extraction and clip-level feature fusion, so as
frame-based tasks.
to alleviate the computation pressure while enabling more
flexible use scenarios, such as long-term video inference or
recognition task classifies the object action in a video se- online video inference.
quence into a set of predefined categories. The frame-based We instantiate such a streaming video model by building
tasks [11, 31, 72], on the other hand, aim to look for key the streaming video Transformer (S-ViT) based on the vi-
information in a certain point of time in a video. For ex- sion Transformer [14]. S-ViT is featured by self-attention
ample, the multiple object tracking (MOT) task predicts the within a frame to extract spatial information and cross-
bounding boxes of objects in each video frame. Although attention across frames to make the fused feature temporal-
both types of tasks take a video as input, they are handled aware. Specifically, for the first frame of a video, S-ViT
very differently in computer vision research. extracts exactly the same spatial feature as a standard im-
The different treatment of these two types of tasks is age ViT, but it stores keys and values of every Trans-
mainly reflected in the type of backbone network used. former layer in a memory. For subsequent frames in a
The action recognition task is usually handled by a clip- video, both intra-frame self-attention and inter-frame cross-
based architecture, where a video model [1], which takes a attention [54] with the stored memory is calculated. S-ViT
video clip as input and outputs spatiotemporal features, is borrows ideas from triple 2D (T2D) decomposition [74]
used. In the video object segmentation (VOS), video ob- and limits the cross-attention region within patches with the
ject detection (VOD), and multiple object tracking (MOT) same horizontal or vertical positions. This decomposition
tasks, however, a frame-based architecture [14, 21] is of- reduces the computational cost and allows S-ViT to handle
ten adopted. The frame-based architecture employs image long histories. The output of this stage can directly be used
backbone to generate independent spatial features for each by the frame-based video tasks. For sequence-based tasks,
frame. In most tracking-by-detection MOT solutions, these an additional temporal decoder, implemented by a temporal
features are directly used as the input to the object detector. Transformer, is used to gather information from multiple
Both types of treatment have their respective drawbacks. frames.
On the one hand, the clip-based architecture processes a We evaluate out S-ViT model on two downstream tasks.
group of video frames at one time, which puts great pressure The first task is the sequence-based action recognition. We
on the processor’s memory space and processing power. As get 84.7% top-1 accuracy on Kinetics-400 [23] dataset and
a result, it is difficult to handle long videos or long actions 69.3% top-1 accuracy on Something-Something v2 [20]
effectively. In addition, the summarized spatiotemporal fea- dataset, which is on par with the state-of-the-art, but at
tures extracted by a video backbone usually lack sufficient a reduced computation expenditure. The second task is
spatial resolution to be used for dense prediction tasks. On MOT, which operates on video frames in a widely adopted
the other hand, the frame-based architecture does not con- tracking-by-detection framework. We show that introduc-
sider surrounding frames in the process of spatial feature ing temporal-aware spatial encoder creates comparative ad-
extraction. As a result, the features do not contain any tem- vantage over a frame-based architecture under a fair setting
poral information or an out-of-band mechanism is in need on MOT17 [40] benchmark.

14603
We summarize the contributions as follows. First, we [24] leverages the neural architecture search (NAS) tech-
propose a unified architecture, named streaming video nique, the causal convolution [53], to build an efficient and
model, for both frame-based and sequence-based video un- causal video model for mobile devices.
derstanding tasks. Second, we implement a T2D-based Our streaming video model does not fall into these
streaming video Transformer and demonstrate how it can two model families as we target unifying frame-based and
be used to serve different types of video tasks. Third, ex- sequence-based tasks, so it is versatile for any kinds of
periments on action recognition and MOT tasks show that video inputs. On the contrary, long-term video models and
our unified model could achieve state-of-the-art results on online video models are still clip-based models, where the
both types of tasks. We believe that the work presented in former aims at extending the temporal context and the latter
this paper is a solid step towards a universal video process- aims at efficient and causal video model inference.
ing architecture. Vision Transformer. Motivated by the success in NLP
[54], Vision Transformers (ViTs) [14] have made great
2. Related Works progress in computer vision. Different from previous dom-
inant CNN architectures, ViTs treat an image as a set of
Video models and video tasks. Video understanding is a
visual words and model their correlation with the attention
fundamental research topic in computer vision. There are
operation. ViTs have already led a paradigm shift in var-
mainly two kinds of tasks, one of which, named sequence-
ious vision tasks, including image recognition [35], object
based tasks [9, 56], aims to understand what is happening
detection [8], semantic segmentation [10], action recogni-
over a period of time, and the other, named frame-based
tion [3], etc. In this work, we build our streaming video
tasks [11, 31, 72], aims at capture the detail information at a
Transformer based on the vanilla vision Transformer [14]
certain point of time. Due to the fact that the inputs are quite
and a corresponding video adaptation mechanism triple 2D
different for these two types of tasks, different families of
decomposition [74].
models are developed independently.
For sequence-based tasks, clip-based models with 3D Multiple object tracking. Tracking by detection [4, 73]
(width, height, and time) video input are used. 3D convo- is one of the dominant paradigms in multiple object track-
lutional neural networks (CNNs) [9, 17, 45, 50–52, 64] once ing (MOT). These method first utilize powerful detectors to
were popular in the past decade, and video vision Trans- obtain detection results in each single frame and then asso-
former [1, 3, 6, 16, 36, 68, 74] are emerging models in re- ciate defections over time to construct tracking trajectories.
cent years. Thanks to the attention mechanism in Trans- The association can be done by using location, motion, and
former [54], video vision Transformers have better capa- appearance clues or directly solved using transformer archi-
bility to model long-range spatiotemporal correlations and tecture as set-prediction [70]. We follow a simple yet effec-
thus achieve higher performance than CNN-based methods. tive association method called ByteTrack [72] in this paper
For frame-based tasks, frame-based models with 2D (width and use a ViT-based detector to produce detection results.
and height) image input are used. The common models in- The key feature of the proposed method is the incorpora-
cludes ResNet [21], CSPNet [55], and Swin Transformer tion of a temporal-aware mechanism during the detection
[35]. Such models are adopted in the same way they are for feature extraction stage. While some prior works investigate
images. And they do not encode any temporal-related in- the utilization of temporal information in MOT [12, 65, 77],
formation. In this paper, we propose a unified architecture infusing it at the early feature extraction stage is infrequent.
to handle both types of tasks.
Long-term video models and online video models. As 3. Method
the clip-based models require all frames as input at once,
We build a streaming video model, named S-ViT, based
they have difficulty with long videos. A series of works
on vision Transformer (ViT) [14]. In this section, we will
termed long-term video models [60, 61] are proposed to
first introduce the background of each S-ViT component.
handle long videos. Building on top of clip-based mod-
Then, we will describe our architecture and model in de-
els, some memory designs are used to extend the tempo-
tails. Finally, we will provide the implementation details.
ral coverage. Long-term feature banks [60] augment 3D
CNNs with auxiliary supporting information extracted over 3.1. Background
the entire video. MeMViT [61] augmented multi-scale vi-
sion Transformer with cached memories using attention- Let us first review the vision transformer and its exten-
based designs. There is also a series of methods termed sion to frame-based video tasks and sequence-based video
online video models [24,30,78]. The temporal shifted mod- tasks.
ule (TSM) [30] proposes to shift part of the channels along Vision Transformer. Vision Transformer (ViT) is first pro-
the temporal dimension to exchange temporal information, posed to process image inputs X ∈ RH×W ×3 , where H
resulting in an efficient and online video model. MoViNets and W denote the height and width, and 3 is the number

14604
Figure 3. Illustration of streaming video Transformer. (a) The architecture of temporal-aware spatial encoder. (b) The scheme of a
Transformer layer. (c) Detailed structure of streaming T2D attention.

of RGB channels. ViT first embeds an image into N non- (de-)convolutions. We found such direct adaptation works
overlapping patches Xp ∈ RN ×C , where C is the number well in our video dense prediction tasks. Our implementa-
of channels. Then, a positional embedding is added to ob- tion is similar to the prior work ViTDet [28] that built a sim-
tain the input Z 0 to the first Transformer layer: ple feature pyramid network (FPN) for image object detec-
tion. The difference is that our resolution adaptor does not
Z 0 = Xp + e, (1) replace the original sophisticated feature pyramid network
(e.g. the PAN [34] in YOLOX [19]) but serves as a plugged-
where e ∈ RN ×C is the learnable positional embedding.
in module on top of the backbone to bridge the resolution
The key components in ViT are L Transformer layers
mismatch. Besides the multi-scale architecture, plain ViT
which are composed of a self-attention (SA) block, layer
also has a high computation cost due to the quadratic com-
normalization (LN) layers, and a multi-layer perception
plexity in self-attention [54]. We solve this issue by us-
(MLP) block, as shown in Fig.3-(b). Denote Z l−1 and Z l as
ing windowed self-attention [35] and convolutional cross-
the input and output of the lth Transformer layer, the com-
window propagation blocks [21], which are the same as
putation implemented by this layer can be written as:
ViTDet.
Y l = MSA(LN(Z l−1 )) + Z l−1 , (2) ViT for sequence-based video tasks. Classical clip-based
l l
Z = MLP(LN(Y )) + Y . l
(3) video models need to model the spatiotemporal feature
jointly. Although our model does not follow the clip-based
ViT for frame-based video tasks. Most frame-based video paradigm, the spatiotemporal feature learning mechanics
tasks, such as VOS, VOD, and MOT, need multi-scale fea- in existing works are profitable in the streaming model’s
ture maps. ViT is a non-hierarchical architecture that only design. From this perspective, we build our streaming
maintains a single-scale feature map, which makes it diffi- video model from a SOTA clip-based video model named
cult to be plugged into existing frameworks. For example, T2D-ViT [74]. The T2D-ViT extends ViT from an image
most detection frameworks utilize the ResNet-style multi- model to a clip-based video model by introducing tempo-
stage architecture that has feature maps of stride 4, 8, 16, ral attention. Concretely, given an input video tensor Z ∈
and 32, but the plain ViT only has a feature map of stride 16. RNh ×Nw ×Nt ×C , besides calculating the XY attention in-
To solve this resolution misalignment problem, we develop side each frame, T2D-ViT also calculates the XT temporal
a simple resolution adaptor (RA) to transfer the single-scale attention within the same y ∈ {1, 2, ..., Nh } index and the
feature to multi-scale features, as shown in Fig.3-(a). The TY temporal attention within the same x ∈ {1, 2, ..., Nw }
RA is implemented by a set of up-sample and down-sample index.

14605
The central idea of T2D-ViT is the decomposition of Here the sg stands for stop gradient. We generate
static appearance and dynamic motion. Therefore, it shares another temporal query q̃ t from the output of spatial self-
the same spirit with our streaming video model in that the attention ot by a separate transformation matrix W̃q and
spatial modeling in the current frame and the temporal mod- then compute the cross attention on q̃ t , k̃ t , and ṽ t :
eling among nearby frames are disentangled. Due to the ef-
ficiency and effectiveness of T2D-ViT, we adopt a similar õt = Attention(q̃t , k̃t , ṽt ). (8)
XT and TY temporal attention in our S-ViT model, which
Notice that the cross-attention here is calculated within the
will be introduced in the next section.
XT and TY data planes to improve the efficiency and ef-
3.2. Streaming Video Model fectiveness. Using the TY attention as an example. Given
inputs q̃t ∈ RNw ×Nt ×C and k̃t , ṽt ∈ RT ×Nw ×Nt ×C , we
Fig.1-(a) gives an overview of the proposed streaming split them along the horizontal axes to get {q̃t1 , q̃t2 , ..., q̃tNw },
video Transformer. Given an input sequence, at each times- {k̃t1 , k̃t2 , ..., k̃tNw }, and {ṽt1 , ṽt2 , ..., ṽtNw }. The attention is
tamp, the temporal-aware spatial encoder module first en- calculated among queries, keys, and values with the same
codes the spatial information within the current frame; then horizontal axis. Similarly, XT attention is calculated among
it fuses information from previous timestamps. The output queries, keys, and values with the same vertical axis. The
of this module is frame-level features, which can be utilized outputs of XT and TY attention are fused into ot with
for frame-based tasks like multiple object tracking. On top learnable per-channel weights initialized to 1e − 4. The
of the temporal-aware spatial encoder, an optional temporal introduction of T2D attention decrease the computational
decoder is appended to generate video-level features. Such complexity of cross attention part from O(Nw2 Nh2 T ) to
video-level features are used for sequence-based tasks like O(Nw2 Nh T + Nw Nh2 T ), which makes our temporal atten-
action recognition. tion module light-weight and therefore applicable for long
The core design in our streaming video Transformer is histories.
the temporal-aware spatial encoder with streaming T2D at-
tentions. The architecture of temporal-aware spatial en- 3.3. Implementation Details
coder is shown in Fig.3-(a), which is composed of multi- We implement our S-ViT based on the ViT-B [14] model,
ple Transformer layers and optional ResNet Blocks and the which has 12 layers of Transformer. To support multi-scale
resolution adaptor. The ResNet Block and the resolution features, we manually split the network into 4 stages, with
adaptor are used for frame-based video tasks which needs 3 layers for each stage. In the frame-based video tasks, we
multi-scale feature maps. The Transformer layer is com- use windowed attention with the window size of 14 × 14
posed of Attention layer and MLP block with skip connec- to reduce the heavy computation cost from the global self-
tion and layer normalization, as shown in Tab.3-(b). We use attentions. Four ResNet blocks are appended at the end of
the streaming T2D attention, which introducing temporal- each stage, respectively, for cross-window feature propaga-
aware spatial features by leveraging memorized histories. tion. The CLIP [46] pre-trained weights are used as the
Fig.3-(c) illustrates the implementation of streaming initialization. For parameters that did not exist in the ViT-B
T2D attention. First, we compute the spatial self-attention model, we randomly initialized them.
from the input xt : For the action recognition task, we use four tempo-
ral transformer layers as the temporal decoder. A text-
q t = xt Wq ; k t = xt Wk ; v t = xt Wv , (4)
generated classifier [63] is applied for the Kinetics-400 [23]
ot = Attention(qt , kt , vt ), (5) dataset and a learnable linear classifier for the something-
something v2 [20] dataset following T2D-ViT. On the mul-
where Wq , Wk , and Wv are projection matrices for queries,
tiple object tracking task, we use the YOLOX-style [19] de-
keys and values, respectively. Then, we maintain a mem-
tection head and the ByteTrack [72] tracker.
ory pool to store the historical information. During each
frame’s forward process, we put the keys and values in the 4. Experiments
self-attention into the memory pool. Concretely, in the for-
ward pass of the first frame, the memory pool only contains 4.1. Experimental Setup
the keys and values of the first frame itself. And in the for-
We evaluate our method on two video tasks, namely the
ward pass of the t-th frame, the memory pool contains all
video action recognition and the multiple object tracking.
keys and values from the past timestamps. Formally speak-
For video action recognition, we conduct experiments on
ing, the memory used for frame t is
two widely used benchmark, i,e., Kinetics-400 [23] and
k̃ t = [sg(k 1 ), sg(k 2 ), ..., sg(k t−1 ), sg(k t )], (6) Something-Something v2 [20]. For multiple object track-
t 1 2 t−1 t
ing, we use MOT17 [40] dataset for evaluation with addi-
ṽ = [sg(v ), sg(v ), ..., sg(v ), sg(v )]. (7) tional data sources MOTSynth [15] and CrowdHuman [48].

14606
Mosaic [5] and Mixup [71], which follow the practice in
ByteTrack. The input image size is 1440 × 800 with the
shortest side ranging from 576 to 1024 during multi-scale
training. We use the CLEAR [2] metrics for evaluation, in-
cluding multiple object tracking accuracy (MOTA), high or-
der tracking accuracy (HOTA) [38], and IDF1, to evaluate
different aspects of tracking and detection performance. We
also report the raw statics such as FP, FN, and IDs. As there
are no labels for the testing set of MOT17, we split the train-
ing set by using the first half of each video for training and
the last half for validation in our ablation studies, follow-
ing [76]. We report test results when compared with other
methods.
Training configurations. We train our S-ViT model
using the AdamW [37] optimizer. The training epoch for
Figure 4. Comparison of the performance of S-ViT with different action recognition on K400 and SSv2 is set to 30 with 5
test-time memory length.
epochs of warmup. A cosine learning rate schedule with
the maximum learning rates of 1e-5 and 5e-5 are used for
K400 and SSv2 respectively. The training epoch for mul-
Kinetics-400 (K400) [23] is a large-scale video action
tiple object tracking is set to 10 with 1 epoch of warmup.
recognition dataset collected from YouTube. It contains
The learning rate is set to 2.5e-4 with a cosine annealing
234584 training videos and 19760 validation videos. The
schedule. More details can be found in the supplementary.
video in K400 is trimmed to around 10 seconds. We use
the sparse sampling [41] and randomly resized cropping to 4.2. Results on Multiple Object Tracking
sample 16 frames with 224 × 224 resolution to form a video
clip. We use the same data augmentation and regularization The most important advantage of our S-ViT model for
as in X-CLIP [41], including random horizontal flip, color frame-based video tasks is its ability to extract temporal-
jitter, random grayscale, label smoothing, Mixup [71], and aware spatial features. We design controlled experiments
CutMix [69]. In the inference phase, we adopt the multi- on the MOT17 dataset to demonstrate the effectiveness of
view testing with four temporal clips and three spatial crops. the streaming video model and also ablate the influence of
The top-1 and top-5 classification accuracy on the validation some newly introduced factors.
set are reported as evaluation metrics. Effectiveness of streaming video model. Tab.1 shows the
Something-Something V2 (SSv2) [20] is another large- comparison between our streaming video model and the
scale action recognition dataset which focus more on tem- frame-based video model. Our streaming video model out-
poral modeling. The labels are like ”Pulling something performs the frame-based video model by 0.6 MOTA, 2.5
from left to right”, so it is crucial to learn motion infor- IDF1, and 1.3 HOTA, which clearly demonstrates the effec-
mation. The training set contains 168.9K training videos tiveness of temporal-aware spatial features.
and the validation set contains 24.7K validation videos. We Influence of test-time memory length. One flexibility of
use segment-based sampling from [30] to sample 32 frames our streaming video Transformer is that we can use arbi-
with 224 × 224 resolution. The augmentation and regular- trary memory length in the test phase without model re-
ization in SSv2 include random augmentation [13], repeated training. Intuitively, using longer history helps our model
augmentation [22], random erasing [75], Mixup [71], and to extract robuster features. As shown in Fig.4, longer test-
CutMix [69], which follow the practice in MViT [16]. time memory length indeed improves the tracking perfor-
MOT17 [40] is a multiple object tracking dataset that mance. Specifically, the 32-frame model gets 1.3 higher
contains 7 training sequences and 7 test sequences. The IDF1 and 0.7 higher HOTA than the 2-frame model.
total frame number is only 11k, so it is not enough to Ablation study on training datasets. The paradigm switch
train our S-ViT model. We use the CrowdHuman [48] from a frame-based video model to a streaming video model
dataset and the MOTSynth [15] dataset to expand the train- also involves a transition of training datasets. In multi-
ing data. CrowdHuman contains 19.4k images in crowd ple object tracking, it is common practice to involve addi-
human scenarios, and MOTSynth contains 764 synthetic tional data sources, as the MOT17 only has seven training
video sequences with 1.3m frames generated from Grand sequences. An image pedestrian detection dataset called
Theft Auto V. We conduct our experiments with combina- CrowdHuman is used in many prior works. However, as
tions of different data sources and discuss the influence in the CrowdHuman dataset only contains still images, it can-
Sec.4.2. The data augmentation and regularization include not provide useful temporal information, which is needed

14607
Table 1. Comparison of the frame-based video model and streaming video model on MOT17 half-validation set.

Method MOTA ↑ IDF1 ↑ HOTA ↑ FP ↓ FN ↓ IDs ↓


frame-based 79.0 78.4 67.0 10248 23058 564
streaming 79.6 80.9 68.3 9507 22956 453

Table 2. Comparison of training datasets for streaming video both the history and the future. The same performance of
model training on MOT17 half-validation set. MOT17 is a video these two models indicates that future information may not
dataset. MOTS is short for MOTSynth, which is a synthetic video be necessary for sequence-based video task, and we have
dataset. CH is short for CrowdHuman, which is an image dataset.
the opportunity to build a causal video model without sac-
rificing the performance on some kind of dataset. On SSv2,
Dataset MOTA ↑ IDF1 ↑ HOTA ↑ FP ↓ FN ↓ IDs ↓
MOT17 69.9 73.6 61.6 18837 29016 750 we observe a notable performance loss that may be related
+CH 78.0 78.0 65.7 9966 25065 549 to the fine-grained category definition in SSv2. For exam-
+MOTS 77.4 78.1 66.2 11742 24279 555 ple, knowing future information may help to distinguish the
+MOTS +CH 79.6 80.9 68.3 9507 22956 453
class ”opening something” from the class ”pretending to
open something without actually opening it.”
by our streaming video model. We thus introduce another
synthetic video dataset, MOTSynth, to train our streaming Table 3. Comparison of the frame-based video model, clip-based
video model. The drawback of using MOTSynth is that video model, and streaming video model on K400 and SSv2.
it has a domain gap with real images because it is gener-
ated from a video game. We evaluate the different com- Method GFLOPs K400 SSv2
binations of these data sources and present the results on Top-1 Top-5 Top-1 Top-5
Tab.2. The first row shows the results of using MOT17 frame-based 282 84.2 96.7 68.3 91.6
alone. The HOTA of this model is only 61.6% and we ob- clip-based 397 84.7 96.7 70.5 92.6
serve severe over-fitting during training. The second row streaming 340 84.7 96.8 69.3 92.1
and the third row show the results of adding CrowdHuman
and MOTSynth, respectively. In order to use CrowHuman Table 4. Comparison to the state-of-the-art on Kinetics-400.
in our streaming video model, we duplicate frames to form #Frames denotes the total number of frames used during inference
a video. It is clear that both additional data sources help our which is #frames per clip × # spatial crop × # temporal clip.
model achieving higher performance on MOT17. Finally,
in the last row, we use all three data sources and achieve Method #Frames GFLOPs Top-1 Top-5
the highest performance of 68.3 HOTA, showing the im- Methods with CNN
portance of using both the video data sources and the real- R(2+1)D [52] 16x1x10 75 72.0 90.0
world data sources. SlowFast + NL [18] 16x3x10 234 79.8 93.9
X3D-XXL [17] 16x3x10 144 80.4 94.6
4.3. Results on Action Recognition Methods with Transformer
TokenLearner [47] 64x3x4 4076 85.4 96.3
We present the action recognition results of S-ViT on ViViT-L FE [1] 32x3x1 3980 83.5 94.3
Tab.3 with a comparison of the frame-based model and the MViTv2-L (312 ↑) [29] 40x3x5 2828 86.1 97.0
clip-based model. The frame-based model here is imple- TimeSformer-L [3] 96x3x1 2380 80.7 94.7
Video-Swin-L (384 ↑) [36] 32x5x10 2107 84.9 96.7
mented with a spatial encoder and a temporal decoder, and
MTV-L [68] 32x3x4 1504 84.3 96.3
the streaming model further upgrades the spatial-encoder’s MTV-B [68] 32x3x4 399 81.8 95.0
temporal awareness. The only difference between them Uniformer-B [27] 32x3x4 259 83.0 95.4
is whether to use temporal awareness in spatial encoder. MViTv2-B [29] 32x1x5 225 82.9 95.7
Our streaming video model achieves a 0.5% top-1 accu- Video-Swin-S [36] 32x3x4 166 80.6 94.5
racy gain and a 1.0% top-1 accuracy gain over the frame- Methods with CLIP-B pre-trained ViT
ActionCLIP-B/16 [58] 32x3x10 563 83.8 96.2
based model on K400 and SSv2, respectively, thanks to the
EVL ViT-B/16 [33] 32x3x1 592 84.2 -
temporal-aware spatial encoder. It is also surprising to see X-CLIP-B/16 [41] 16x3x4 287 84.7 96.8
that our streaming video model achieves similar top-1 and ViT-B w/ ST-Adapter [42] 32x3x1 607 82.7 96.2
top-5 accuracy on K400 when compared with the clip-based Text4Vis-B/16 [63] 16x3x4 - 83.6 96.4
model but reduces the GFLOPs by 14%. The streaming T2D-B [74] 16x3x4 395 84.7 96.7
video model only uses the history information to compute Streaming Video Model
S-ViT (Ours) 16x3x4 340 84.7 96.8
the cross-attention, while the clip-based video model uses

14608
4.4. Benchmark Evaluation Table 5. Comparison to the state-of-the-art on SSv2.

In this section, we compare the performance of S-ViT Method #Frames GFLOPs Top-1 Top-5
with state-of-the-art methods on both the action recognition Methods with CNN
task and the multiple object tracking task. Results of ac- TSM [30] 16x1x1 66 63.3 88.5
tion recognition on K400 and SSv2 are shown in Tab.4 and MSNet [25] 16x1x1 67 64.7 89.4
Tab.5, respectively. Results of multiple object tracking on SELFYNet [26] 16x1x1 67 65.7 89.8
TDN [57] 16x1x1 132 66.9 90.9
the MOT17 test set are shown in Tab.6.
Methods with hierarchical Transformer
Kinetics-400. In Tab.4, we report the comparison of Video-Swin-B [36] 32x3x1 321 69.6 92.7
our streaming video model and previous clip-based mod- UniFormer-B [27] 32x3x1 259 71.2 92.8
els on K400. Among all the compared models, our S- MViT-B-24 [16] 32x3x1 236 68.7 91.5
ViT achieves competitive performance with relatively low MViTv2-S [29] 32x3x1 65 68.2 91.4
GFLOPs. Specifically, we get a 2.9% top-1 accuracy gain MViTv2-B [29] 32x3x1 225 72.1 93.4
over MTV-B [68] and a 0.8% top-1 accuracy gain over EVL Methods with cylindrical Transformer
TimeSformer-HR [3] 16x3x1 1703 62.5 -
ViT-B/16 [33] with lower GFLOPs. Even compared with ViViT-L [1] 16x3x4 903 65.4 89.8
the state-of-the-art models X-CLIP-B/16 [41] and T2D-B MTV-B (320p) [68] 16x3x4 930 68.5 90.4
[74], our S-ViT still gets competitive performance. It is Mformer-L [44] 32x3x1 1185 68.1 91.2
worth noting that our model is a streaming video model that EVL ViT-B/16 [33] 32x3x1 682 62.4 -
extracts features frame by frame and does not use future in- ViT-B w/ ST-Adapter [42] 32x3x1 652 69.5 92.6
formation in the temporal-aware spatial encoder. So it is T2D-B [74] 32x3x2 397 70.5 92.6
quite a success that we do not lag behind clip-based video Streaming Video Model
S-ViT (Ours) 32x3x2 340 69.3 92.1
models, showing the opportunity of using streaming video
models on sequence-based tasks.
Table 6. Comparison to the state-of-the-art on the MOT17 test set.
Something-Something V2. Tab.5 presents results of
S-ViT compared to SOTA methods on SSv2. Consistent
Method MOTA ↑ IDF1 ↑ HOTA ↑ FP ↓ FN ↓ IDs ↓
with our findings on K400, our streaming video model Methods with CNN
showcases considerable proficiency on this motion-focused CenterTrack [76] 67.8 64.7 52.2 18,498 160,332 3,039
QDTrack [43] 68.7 66.3 53.9 26,589 146,643 3,378
dataset, demonstrating its potential to operate as a general TraDeS [62] 69.1 63.9 52.7 20,892 150,060 3,555
video action recognition model for diverse datasets. FairMOT [73] 73.7 72.3 59.3 27,507 117,477 3,303
MOT17. We report multiple object tracking results on CorrTracker [59] 76.5 73.6 60.7 29,808 99,510 3,369
Unicorn [67] 77.2 75.5 61.7 50,087 73,349 5,379
MOT17, as shown in Tab.6. Among all the compared meth- ByteTrack [72] 80.3 77.3 63.1 25,491 83,721 2,196
ods, our S-ViT attains top performance and only underper- Methods with Transformer
forms ByteTrack, which utilizes the strong YOLOX detec- MeMOT [7] 72.5 69.0 56.9 37,221 115,248 2,724
TransCenter [66] 73.2 62.2 54.5 23,112 123,738 4,614
tor with COCO [32] pre-training. S-ViT uses a pure ViT MOTR [70] 73.4 68.6 57.8 - - 2,439
backbone and does not use any detection pre-training. Fur- Trackformer [39] 74.1 68.0 - 34,602 108,777 2,829
TransTrack [49] 75.2 63.5 54.1 50,157 86,442 3,603
ther tuning of the ViT-based detection architecture may im- GTR [77] 75.3 71.5 59.1 26,793 109,854 2,859
prove the performance of our method, but it is beyond the TransMOT [12] 76.7 75.1 61.7 36,231 93,150 2,346
scope of streaming video model in this paper. Neverthe- S-ViT (Ours) 78.1 75.9 62.0 39,063 82,704 1,983
less, our S-ViT achieves the highest performance among
all Transformer-based methods, outperforming TransMOT
[12] by 1.4 MOTA, 0.8 IDF1, and 0.3 HOTA. task compared to the previous practice of frame-based mod-
els. To the best of our knowledge, this is the first deep learn-
5. Conclusion ing architecture that unifies video understanding tasks.
In the future, we will apply S-ViT to more video tasks
In this work, we propose the idea of streaming video including single object tracking, video object detection, and
models that aim to unify the treatment of both frame-based long-term video localization. Besides, we will continue to
and sequence-based video understanding tasks, which in the improve S-ViT by upgrading its components, such as the
past were handled by separate models. We present an imple- detection head.
mentation named streaming video Transformer and conduct Acknowledgement This work was partially supported
comprehensive experiments on multiple benchmarks. Our by National Key R&D Program of China under Grant
model achieves competitive performance on the sequence- 2020AAA0105702, National Natural Science Foundation
based action recognition datasets compared to existing clip- of China (NSFC) under Grants 62225207 and U19B2038,
based methods. Our model also achieves a significant per- and the University Synergy Innovation Program of Anhui
formance gain on the frame-based multiple object tracking Province under Grants GXXT-2019-025.

14609
References [15] Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun
Cetintas, Riccardo Gasparini, Aljosa Osep, Simone Calder-
[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen ara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vi- can synthetic data help pedestrian detection and tracking? In
sion transformer. In Proceedings of the IEEE/CVF Interna- ICCV, pages 10829–10839. IEEE, 2021. 5, 6
tional Conference on Computer Vision (ICCV), pages 6836–
[16] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li,
6846, October 2021. 2, 3, 7, 8
Zhicheng Yan, Jitendra Malik, and Christoph Feichten-
[2] Keni Bernardin and Rainer Stiefelhagen. Evaluating multi- hofer. Multiscale vision transformers. In Proceedings of
ple object tracking performance: The CLEAR MOT metrics. the IEEE/CVF International Conference on Computer Vision
EURASIP J. Image Video Process., 2008, 2008. 6 (ICCV), pages 6824–6835, October 2021. 3, 6, 8
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
[17] Christoph Feichtenhofer. X3D: expanding architectures for
space-time attention all you need for video understanding?
efficient video recognition. In CVPR, pages 200–210. Com-
In ICML, volume 139 of Proceedings of Machine Learning
puter Vision Foundation / IEEE, 2020. 3, 7
Research, pages 813–824. PMLR, 2021. 3, 7, 8
[18] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and
[4] Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto
Kaiming He. Slowfast networks for video recognition. In
Ramos, and Ben Upcroft. Simple online and realtime track-
ICCV, pages 6201–6210. IEEE, 2019. 7
ing. In ICIP, pages 3464–3468. IEEE, 2016. 3
[5] Alexey Bochkovskiy, Chien-Yao Wang, and Hong- [19] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of Sun. YOLOX: exceeding YOLO series in 2021. CoRR,
object detection. CoRR, abs/2004.10934, 2020. 6 abs/2107.08430, 2021. 4, 5
[6] Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sud- [20] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
hakaran, Brais Martı́nez, and Georgios Tzimiropoulos. ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
Space-time mixing attention for video transformer. CoRR, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-
abs/2106.05968, 2021. 3 Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and
Roland Memisevic. The ”something something” video
[7] Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia,
database for learning and evaluating visual common sense.
Zhuowen Tu, and Stefano Soatto. Memot: Multi-object
In ICCV, pages 5843–5851. IEEE Computer Society, 2017.
tracking with memory. In CVPR, pages 8080–8090. IEEE,
2, 5, 6
2022. 8
[8] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Deep residual learning for image recognition. In CVPR,
end object detection with transformers. In ECCV (1), volume pages 770–778. IEEE Computer Society, 2016. 2, 3, 4
12346 of Lecture Notes in Computer Science, pages 213– [22] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten
229. Springer, 2020. 3 Hoefler, and Daniel Soudry. Augment your batch: Improving
[9] João Carreira and Andrew Zisserman. Quo vadis, action generalization through instance repetition. In CVPR, pages
recognition? A new model and the kinetics dataset. In CVPR, 8126–8135. Computer Vision Foundation / IEEE, 2020. 6
pages 4724–4733. IEEE Computer Society, 2017. 1, 3 [23] Will Kay, João Carreira, Karen Simonyan, Brian Zhang,
[10] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,
illov. Per-pixel classification is not all you need for semantic Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman,
segmentation. In NeurIPS, pages 17864–17875, 2021. 3 and Andrew Zisserman. The kinetics human action video
[11] Ho Kei Cheng and Alexander G. Schwing. Xmem: Long- dataset. CoRR, abs/1705.06950, 2017. 2, 5, 6
term video object segmentation with an atkinson-shiffrin [24] Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang,
memory model. In ECCV (28), volume 13688 of Lecture Mingxing Tan, Matthew Brown, and Boqing Gong.
Notes in Computer Science, pages 640–658. Springer, 2022. Movinets: Mobile video networks for efficient video recog-
2, 3 nition. In CVPR, pages 16020–16030. Computer Vision
[12] Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, and Foundation / IEEE, 2021. 3
Zicheng Liu. Transmot: Spatial-temporal graph transformer [25] Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho.
for multiple object tracking. CoRR, abs/2104.00194, 2021. Motionsqueeze: Neural motion feature learning for video
3, 8 understanding. In ECCV (16), volume 12361 of Lecture
[13] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. Notes in Computer Science, pages 345–362. Springer, 2020.
Randaugment: Practical automated data augmentation with 8
a reduced search space. In NeurIPS, 2020. 6 [26] Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Learning self-similarity in space and time as generalized mo-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tion for video action recognition. In ICCV, 2021. 8
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [27] Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is Hongsheng Li, and Yu Qiao. Uniformer: Unified trans-
worth 16x16 words: Transformers for image recognition at former for efficient spatial-temporal representation learning.
scale. In ICLR. OpenReview.net, 2021. 2, 3, 5 In ICLR. OpenReview.net, 2022. 7, 8

14610
[28] Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming He. [42] Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and
Exploring plain vision transformer backbones for object de- Hongsheng Li. St-adapter: Parameter-efficient image-
tection. In ECCV (9), volume 13669 of Lecture Notes in to-video transfer learning for action recognition. CoRR,
Computer Science, pages 280–296. Springer, 2022. 4 abs/2206.13559, 2022. 7, 8
[29] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man- [43] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li,
galam, Bo Xiong, Jitendra Malik, and Christoph Feichten- Trevor Darrell, and Fisher Yu. Quasi-dense similarity learn-
hofer. Improved multiscale vision transformers for classifi- ing for multiple object tracking. In CVPR, pages 164–173.
cation and detection. CoRR, abs/2112.01526, 2021. 7, 8 Computer Vision Foundation / IEEE, 2021. 8
[30] Ji Lin, Chuang Gan, and Song Han. TSM: temporal shift [44] Mandela Patrick, Dylan Campbell, Yuki M. Asano, Is-
module for efficient video understanding. In ICCV, pages han Misra, Florian Metze, Christoph Feichtenhofer, Andrea
7082–7092. IEEE, 2019. 3, 6, 8 Vedaldi, and João F. Henriques. Keeping your eye on the
ball: Trajectory attention in video transformers. In NeurIPS,
[31] Liting Lin, Heng Fan, Yong Xu, and Haibin Ling. Swin-
pages 12493–12506, 2021. 8
track: A simple and strong baseline for transformer tracking.
[45] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-
CoRR, abs/2112.00995, 2021. 2, 3
temporal representation with pseudo-3d residual networks.
[32] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James In ICCV, pages 5534–5542. IEEE Computer Society, 2017.
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and 3
C. Lawrence Zitnick. Microsoft COCO: common objects [46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
in context. In ECCV (5), volume 8693 of Lecture Notes in Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Computer Science, pages 740–755. Springer, 2014. 8 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[33] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Krueger, and Ilya Sutskever. Learning transferable visual
Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng models from natural language supervision. In ICML, volume
Li. Frozen CLIP models are efficient video learners. CoRR, 139 of Proceedings of Machine Learning Research, pages
abs/2208.03550, 2022. 7, 8 8748–8763. PMLR, 2021. 5
[34] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. [47] Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa
Path aggregation network for instance segmentation. In Dehghani, and Anelia Angelova. Tokenlearner: Adap-
CVPR, pages 8759–8768. Computer Vision Foundation / tive space-time tokenization for videos. In NeurIPS, pages
IEEE Computer Society, 2018. 4 12786–12797, 2021. 7
[35] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [48] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xi-
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- angyu Zhang, and Jian Sun. Crowdhuman: A benchmark for
former: Hierarchical vision transformer using shifted win- detecting human in a crowd. CoRR, abs/1805.00123, 2018.
dows. In Proceedings of the IEEE/CVF International Con- 5, 6
ference on Computer Vision (ICCV), pages 10012–10022, [49] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao,
October 2021. 3, 4 Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and
[36] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Ping Luo. Transtrack: Multiple-object tracking with trans-
Stephen Lin, and Han Hu. Video swin transformer. CoRR, former. CoRR, abs/2012.15460, 2020. 8
abs/2106.13230, 2021. 3, 7, 8 [50] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torre-
[37] Ilya Loshchilov and Frank Hutter. Decoupled weight decay sani, and Manohar Paluri. Learning spatiotemporal features
regularization. In ICLR (Poster). OpenReview.net, 2019. 6 with 3d convolutional networks. In ICCV, pages 4489–4497.
IEEE Computer Society, 2015. 3
[38] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip
[51] Du Tran, Heng Wang, Matt Feiszli, and Lorenzo Torre-
H. S. Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian
sani. Video classification with channel-separated convolu-
Leibe. HOTA: A higher order metric for evaluating multi-
tional networks. In ICCV, pages 5551–5560. IEEE, 2019.
object tracking. Int. J. Comput. Vis., 129(2):548–578, 2021.
3
6
[52] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann
[39] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixé, and LeCun, and Manohar Paluri. A closer look at spatiotem-
Christoph Feichtenhofer. Trackformer: Multi-object track- poral convolutions for action recognition. In CVPR, pages
ing with transformers. In CVPR, pages 8834–8844. IEEE, 6450–6459. Computer Vision Foundation / IEEE Computer
2022. 8 Society, 2018. 3, 7
[40] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. [53] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen
Schindler. MOT16: A benchmark for multi-object tracking. Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
arXiv:1603.00831 [cs], Mar. 2016. arXiv: 1603.00831. 2, 5, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A
6 generative model for raw audio. In SSW, page 125. ISCA,
[41] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, 2016. 3
Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin [54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Ling. Expanding language-image pretrained models for gen- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
eral video recognition. CoRR, abs/2208.02816, 2022. 6, 7, Polosukhin. Attention is all you need. In NIPS, pages 5998–
8 6008, 2017. 2, 3, 4

14611
[55] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, [69] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon
Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A Oh, Youngjoon Yoo, and Junsuk Choe. Cutmix: Regulariza-
new backbone that can enhance learning capability of CNN. tion strategy to train strong classifiers with localizable fea-
In CVPR Workshops, pages 1571–1580. Computer Vision tures. In ICCV, pages 6022–6031. IEEE, 2019. 6
Foundation / IEEE, 2020. 3 [70] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xi-
[56] Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, angyu Zhang, and Yichen Wei. MOTR: end-to-end multiple-
Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu- object tracking with transformer. In ECCV (27), volume
Gang Jiang, and Lu Yuan. Omnivl: One foundation 13687 of Lecture Notes in Computer Science, pages 659–
model for image-language and video-language tasks. CoRR, 675. Springer, 2022. 3, 8
abs/2209.07526, 2022. 1, 3 [71] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and
[57] Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. TDN: David Lopez-Paz. mixup: Beyond empirical risk minimiza-
temporal difference networks for efficient action recognition. tion. In ICLR (Poster). OpenReview.net, 2018. 6
In CVPR, pages 1895–1904. Computer Vision Foundation / [72] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng
IEEE, 2021. 8 Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang
[58] Mengmeng Wang, Jiazheng Xing, and Yong Liu. Action- Wang. Bytetrack: Multi-object tracking by associating every
clip: A new paradigm for video action recognition. CoRR, detection box. In ECCV (22), volume 13682 of Lecture Notes
abs/2109.08472, 2021. 7 in Computer Science, pages 1–21. Springer, 2022. 2, 3, 5, 8
[59] Qiang Wang, Yun Zheng, Pan Pan, and Yinghui Xu. Multiple [73] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng,
object tracking with correlation learning. In CVPR, pages and Wenyu Liu. Fairmot: On the fairness of detection and
3876–3886. Computer Vision Foundation / IEEE, 2021. 8 re-identification in multiple object tracking. Int. J. Comput.
[60] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaim- Vis., 129(11):3069–3087, 2021. 3, 8
ing He, Philipp Krähenbühl, and Ross B. Girshick. Long- [74] Yucheng Zhao, Chong Luo, Chuanxin Tang, Dongdong
term feature banks for detailed video understanding. In Chen, Noel C Codella, Lu Yuan, and Zheng-Jun Zha. T2d:
CVPR, pages 284–293. Computer Vision Foundation / IEEE, Spatiotemporal feature learning based on triple 2d decompo-
2019. 3 sition, 2023. 2, 3, 4, 7, 8
[61] Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi [75] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and
Fan, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Yi Yang. Random erasing data augmentation. In AAAI, pages
Memvit: Memory-augmented multiscale vision transformer 13001–13008. AAAI Press, 2020. 6
for efficient long-term video recognition. In CVPR, pages
[76] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.
13577–13587. IEEE, 2022. 3
Tracking objects as points. In ECCV (4), volume 12349
[62] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming
of Lecture Notes in Computer Science, pages 474–490.
Yang, and Junsong Yuan. Track to detect and segment: An
Springer, 2020. 6, 8
online multi-object tracker. In CVPR, pages 12352–12361.
[77] Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp
Computer Vision Foundation / IEEE, 2021. 8
Krähenbühl. Global tracking transformers. In CVPR, pages
[63] Wenhao Wu, Zhun Sun, and Wanli Ouyang. Transfer-
8761–8770. IEEE, 2022. 3, 8
ring textual knowledge for visual recognition. CoRR,
[78] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas
abs/2207.01297, 2022. 5, 7
Brox. ECO: efficient convolutional network for online video
[64] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and
understanding. In ECCV (2), volume 11206 of Lecture Notes
Kevin Murphy. Rethinking spatiotemporal feature learning:
in Computer Science, pages 713–730. Springer, 2018. 3
Speed-accuracy trade-offs in video classification. In ECCV
(15), volume 11219 of Lecture Notes in Computer Science,
pages 318–335. Springer, 2018. 3
[65] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatial-
temporal relation networks for multi-object tracking. In
ICCV, pages 3987–3997. IEEE, 2019. 3
[66] Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan,
Daniela Rus, and Xavier Alameda-Pineda. Transcenter:
Transformers with dense queries for multiple-object track-
ing. CoRR, abs/2103.15145, 2021. 8
[67] Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan,
Ping Luo, and Huchuan Lu. Towards grand unification of
object tracking. In ECCV (21), volume 13681 of Lecture
Notes in Computer Science, pages 733–751. Springer, 2022.
8
[68] Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi
Zhang, Chen Sun, and Cordelia Schmid. Multiview trans-
formers for video recognition. CoRR, abs/2201.04288, 2022.
3, 7, 8

14612

You might also like