Prompt Learning For Action Recognition
Prompt Learning For Action Recognition
Abstract
We present a new general learning approach for action recognition, Prompt Learn-
ing for Action Recognition (PLAR), which leverages the strengths of prompt
learning to guide the learning process. Our approach is designed to predict the
action label by helping the models focus on the descriptions or instructions as-
sociated with actions in the input videos. Our formulation uses various prompts,
including optical flow, large vision models, and learnable prompts to improve
the recognition performance. Moreover, we propose a learnable prompt method
that learns to dynamically generate prompts from a pool of prompt experts under
different inputs. By sharing the same objective, our proposed PLAR can optimize
prompts that guide the model’s predictions while explicitly learning input-invariant
(prompt experts pool) and input-specific (data-dependent) prompt knowledge. We
evaluate our approach on datasets consisting of both ground camera videos and
aerial videos, and scenes with single-agent and multi-agent actions. In practice, we
observe a 3.17 − 10.2% accuracy improvement on the aerial multi-agent dataset,
Okutamam and 0.8−2.6% improvement on the ground camera single-agent dataset,
Something Something V2. We plan to release our code on the WWW.
1 Introduction
Action recognition, the task of understanding human activities from video sequences, is a fundamental
problem in computer vision. This problem arises in many applications: video surveillance, human-
computer interaction, sports analysis, human-robot interaction, etc. There has been considerable
work on these problems in recent years, driven by the availability of large-scale video datasets and
advancements in deep learning techniques including two-stream Convolutional Neural Network
(CNN) (Simonyan and Zisserman, 2014), Recurrent Neural Network (RNN) (Sun et al., 2017), and
Transformer-based methods (Vaswani et al., 2017; Li et al., 2022b). These methods have achieved
considerable success in extracting discriminative features from video sequences, leading to significant
improvements in action recognition accuracy for ground videos and aerial videos.
Despite the recent progress in action recognition algorithms, the success of most existing approaches
relies on extensive labeled training data followed by a purely supervised learning paradigm that
mainly focuses on a backbone architecture design. In this paper, our goal is to design new methods for
video action recognition using prompt learning. Prompt-based techniques (Liu et al., 2023) have been
proposed for natural language processing tasks to circumvent the issue of lack of labeled data. These
learning methods use language models that estimate the probability of the text and use this probability
to predict the label thereby reducing or obviating the need for large labeled datasets. In the context of
action recognition, prompt learning offers the potential to design better optimization strategies by
providing high-level texture descriptions or instructions associated with actions. These prompts can
guide the learning process, and enable the model to capture discriminative spatio-temporal patterns
effectively, resulting in better performance.
*
These authors contributed equally to this work
Videos Prompts
Figure 1: Task: We use prompt learning for action recognition. Our method leverages the strengths
of prompt learning to guide the learning process by helping models better focus on the descriptions or
instructions associated with actions in the input videos. We explore various prompts, including optical
flow, large vision models, and proposed learnable prompts to improve recognition performance. The
recognition models can be CNNs or Transformers.
Many prompt learning-based techniques have been proposed for few-shot action recognition (Shi
et al., 2022), zero-shot action recognition (Sato et al., 2023; Wang et al., 2021a), and ordinal action
understanding. (Shi et al., 2022) proposes knowledge prompting, which leverages commonsense
knowledge of actions from external resources to prompt a powerful pre-trained vision-language
model for few-shot classification. (Sato et al., 2023) presents a unified, user prompt-guided zero-shot
learning framework using a target domain-independent skeleton feature extractor, which is pretrained
on a large-scale action recognition dataset. Bridge-Prompt (Li et al., 2022a) proposes a prompt-
based framework to model the semantics across adjacent actions from a series of ordinal actions in
instructional videos. Our goal is to apply these techniques for video action recognition.
Main Contributions: We present a general prompt learning approach that alleviates the burden of
objective optimization by integrating prompt-based learning into the action recognition pipeline. Our
approach is designed to enhance the model’s ability to process customized inputs by utilizing prompt
tokens. These prompt tokens can be either predefined templates or learnable tokens that include
information specific to video action recognition. Our formulation leverages prompts, which are easier
to focus on the interest targets, and enables the learning of complex visual concepts.
In our prompt learning paradigm, we explore and discuss various types of prompt, including optical
flow and large vision models. In addition, we present a learnable prompt, which dynamically generate
prompts from a pool of prompt experts under different inputs. Our goal is to optimize prompts
that guide the model’s predictions while explicitly learning input-invariant (prompt experts) and
input-specific (data-dependent) prompt knowledge. We validate this generalization by performing
extensive evaluations on datasets comprising of ground camera videos and aerial videos, on scenarios
involving single-agent and multi-agent actions. We demonstrate that our technique can improve
the performance and enhance the generalization capabilities of video action recognition models in
different scenarios. The novel components of our work include:
1. Present a general learning approach to use prompt learning and auto-regressive techniques
for action recognition.
2. Propose a new learnable prompt method, which can guide the model’s predictions while
explicitly learning input-invariant (prompt experts) and input-specific (data-dependent)
prompt knowledge.
3. To the best of our knowledge, ours is the first approach to explore the possibility of using
large vision models as the prompt to instruct the action recognition task.
4. Through empirical evaluations, we demonstrate the potential and effectiveness of prompt
learning techniques for action recognition tasks. Specifically, we observed 3.17-10.2%
accuracy improvement on the aerial multi-agent dataset, Okutama. Moreover, we observe a
0.8-2.6% accuracy improvement on the ground camera single-agent video dataset, Something
Something V2 (Goyal et al., 2017).
2
2 Related Works
Human action recognition, i.e., recognizing and understanding human actions, is crucial for a number
of real-world applications. Recently, many deep learning architectures have been proposed to improve
the performance. At a broad level, they can be classified into three categories:
Two-stream 2D Convolutional Neural Network: The use of two-stream 2D Convolutional Neural
Networks (CNNs) has been widely explored in the field of action recognition (Simonyan and
Zisserman, 2014; Karpathy et al., 2014; Wang et al., 2015; Sánchez et al., 2013; Chéron et al., 2015;
Girdhar et al., 2017). They capture spatial and temporal information separately to process video data.
(Zong et al., 2021) extended the two-stream CNN to a three-stream CNN by incorporating a motion
saliency stream to enhance the representation of salient motion information. (Piergiovanni and Ryoo,
2019) proposed a trainable flow layer that eliminates the need for optical flow computation while
capturing motion information.
3D CNN-based methods: Several methods in the literature have utilized 3D Convolutional Neural
Networks (CNNs) for action recognition (Tran et al., 2015; Ji et al., 2012; Zhang et al., 2020; Li et al.,
2020a; Carreira and Zisserman, 2017). These approaches focus on leveraging the spatio-temporal
characteristics of video data at the same time through 3D convolutions. (Zhou et al., 2020) analyzed
the spatio-temporal fusion in 3D CNN from a probabilistic view. (Yang et al., 2020) introduced a
generic Temporal Pyramid Network (TPN) to effectively model speed variations in actions at the
feature level. (Piergiovanni and Ryoo, 2021) investigated the impact of viewpoint variations on
recognition performance. (Hussein et al., 2019) proposed multi-scale temporal-only convolutions to
handle large variations in temporal extents within complex actions. (Fayyaz et al., 2021) addressed
the computational cost by dynamically adapting the temporal feature resolution within 3D CNNs.
Transformer-based approaches: Many techniques have been proposed or transformer-based action
recognition (Arnab et al., 2021; Bertasius et al., 2021; Wang et al., 2021b).(Tong et al., 2022)
introduced a Video Masked Autoencoder (Video MAE) with an attention mechanism to address
the challenges of information redundancy and temporal correlation in video data, improving the
efficiency of action recognition. (Yang et al., 2022) proposed an attention gate that facilitates
interactions between frame inputs and hidden states, enabling the aggregation of global inter-frame
features across the temporal domain through recurrent execution. (Mazzia et al., 2022) leveraged 2D
pose representations within a fully self-attentional architecture over short temporal periods, enabling
low latency and high throughput in recognition tasks. (Li et al., 2022b) implemented the seminal idea
of multiscale feature hierarchies with transformer models for video and image recognition.
Although these methods have had good success on the ground data and YouTube videos, they cannot
achieve a similar level of accuracy on videos captured using Unmanned Aerial Vehicles (UAVs) Wang
et al. (2023); Xian et al. (2023a). Compared to ground or YouTube videos, UAV videos have
their unique characteristics like small resolution, scale and size variations, and moving cameras.
(Xian et al., 2023b) proposed a mutual information-based feature alignment and sampling method
to extract spatial-temporal features corresponding to human actors for better recognition accuracy.
(Kothandaraman et al., 2022) introduced Fourier transformation into attention modules to aggregate
the motion salience.
The concept of prompt learning was introduced by (Petroni et al., 2019) and has since gained
significant attention in the field of Natural Language Processing (NLP) (Brown et al., 2020; Jiang
et al., 2020; Li and Liang, 2021; Liu et al., 2023; Tian et al., 2020). The fundamental idea behind
prompt learning is to treat pre-trained language models like BERT or GPT as knowledge bases,
enabling their application in downstream tasks. Early studies, such as those by (Petroni et al., 2019;
Poerner et al., 2019), focused on designing manually crafted prompts to enhance the performance
of language models. Subsequently, researchers like (Shin et al., 2020; Jiang et al., 2020) aimed to
automate this process using cost-effective, data-driven approaches. More recently, some works (Han
et al., 2022; Lester et al., 2021; Zhong et al., 2021) have attempted to learn continuous prompts
instead of searching for discrete prompts.
3
While prompt learning has garnered significant attention in the field of Natural Language Processing
(NLP), its application in computer vision is still a relatively new and emerging research direction.
Only recently have researchers started exploring prompt learning techniques in computer vision
tasks (Rao et al., 2022; Ju et al., 2022; Zhou et al., 2021). Pretrained VLMs (Jia et al., 2021; Radford
et al., 2021) utilize manual prompts for zero-shot inference on the downstream tasks. (Zhou et al.,
2021) proposed Context Optimization (CoOp) to extend continuous prompt representations to the
vision domain, enabling the automatic learning of task-relevant prompts. (Wang et al., 2022) focused
on prompt optimization for continual learning, explicitly managing task-specific knowledge while
maintaining plasticity. (Zhou et al., 2022) introduced conditional CoOp, which involves learning a
lightweight neural network to generate input-conditional tokens for each image. (Lu et al., 2022)
explored the generalization capabilities of prompt learning by learning the distribution of diverse
prompts. (Ge et al., 2022) addressed the challenge of distribution shifts in unsupervised domain
adaptation by learning both domain-agnostic and domain-specific prompts. However, the majority of
these works have primarily focused on image-level tasks, leaving the exploration of prompt learning
for video tasks relatively unexplored. In this paper, we propose a general learning paradigm to
investigate the effectiveness of prompt learning specifically for video understanding, with a specific
focus on action recognition in both ground/YouTube and aerial videos. Our goal is to bridge the gap
and extend the benefits of prompt learning to the domain of video understanding tasks.
3 Our Approach
The problem of video action recognition can be broadly classified into single-agent and multi-agent
action recognition. And according to different data types, there are aerial video action recognition and
ground camera action recognition. Typically, they all involve several steps. Taking the transformer-
based methods for example, first, the input video or image sequence is processed to extract relevant
features, such as movement patterns, appearance, or spatial-temporal information. These features are
then fed into a reasoning model inference action label. Prompt learning can help the first step better
handle feature extraction.
We denote the input as Xi = {x1 , x2 , ..., xm }, i ∈ [1, N ], where xj is the jth frame in the ith video,
m is the total frame number, and N is the total number of videos. The overall approach predicts
the action categories by using model f (Xi ), which can be CNNs or Transformers. As shown in
Figure 2, taking transformer-based methods as an example, we follow the same scheme to extract
the features, followed by using the reasoning process to predict the action labels. We also present a
prompt-learning-based encoder to help better extract the feature and then propose an auto-regressive
temporal reasoning algorithm for recognition models for enhanced inference ability.
Specifically, given an action model:
where fe is the prompt-learning-based input encoder, P is the prompt, and fa is the auto-regressive-
based temporal reasoning model, which is used for the temporal dimension.
For the first part of the input encoder, inspired by these prompt-based techniques in NLP, we
present a new general prompt-learning-based input encoder for action recognition. Our formulation
leverages the strengths of prompt learning to guide the learning process by providing high-level
texture descriptions or instructions associated with actions in the inputs. We use this to alleviate the
burden of models’ optimization by helping models better focus on the active region.
Prompts can enhance the model’s ability to process customized inputs by utilizing prompt tokens. By
leveraging prompts, models are easier to focus on the interest targets, and prompt learning enables
the model to learn complex visual concepts and capture discriminative spatio-temporal patterns
effectively. Specifically, our prompts can be either predefined templates (non-learnable prompt:
optical flow, large vision models) or learnable tokens (learnable prompt) that include task-specific
information. They can be used either alone or in combination.
4
Features Masks Auto-regressive
Predictor frame1
M1
Transformer
Predictor frame2
…
Prompts:
o Learnable prompt
o Optical flow M2
o Large model Predictor frame3
…
…
Queries
Optical Flow Prompt Optical flow is a fundamental concept in computer vision that involves
estimating the motion of objects within a video sequence. It represents the apparent motion of pixels
between consecutive frames, providing valuable information about the movement of objects and their
relative velocities.
For frame xi and frame xj , the optical flow is:
oij = O(xi , xj ), xi ∈ clipi , xj ∈ clipj , (2)
clipi and clipj are two adjacent video clips, and each clip contains several frames. This formulation
is more efficient because it avoids many calculations for every frame. Therefore, the input with
optical flow prompt becomes:
[X, P ] = {xi ∗ oij | i, j ∈ [1, m]} (3)
We use [X, P ] to replace the original X in video action recognition.
Large Vision Model Prompt Recently, large models have been attracting more attention for NLP
and other applications. These large models are considered powerful since they are trained on huge
amounts of data and don’t need to be finetuned on new tasks as an auxiliary input (i.e. prompt). Our
goal is to use these large models to generate prompts (e.g. mask, bbox) for video action recognition.
One popular work is the Segment Anything Model (SAM (Kirillov et al., 2023)), which can segment
any object in an image given only some prompts like a single click or box. SAM is trained on a
dataset of 11 million images and 1.1 billion masks. SAM can segment objects with high accuracy,
even when they are new or have been modified from the training data. SAM generalizes to new
objects and images without the need for additional training, so we don’t need to finetune the model
on our dataset. For some frames in a video clip, we generate a segmentation mask using a large
vision model, SAMKirillov et al. (2023). Next, these masks are used as prompts and fused with input
frames to optimize the recognition model. Specifically, for frame xi , the output from SAM is:
pi = SAM (xi , box/point), xi ∈ clipi (4)
clipi is a video clip containing a few frames,
[X, P ] = {xi ∗ pi | i ∈ [1, m]} (5)
We use [X, P ] to replace the original X in video action recognition.
5
3.1.2 Learnable Prompt
To better adapt to the input data, we also propose a learnable prompt, which learns to dynamically
generate prompts from a pool of prompt experts under different inputs. Prompt experts are learnable
parameters that can be updated from the training process. As shown in Figure 3, in our design, we
use input-invariant (prompt experts) and input-specific (data dependent) prompts. The input-invariant
prompts contain task information, and we use a dynamic mechanism to generate input-specific prompt
for different inputs.
×
∽
+
Input-specific prompt
B𝑥𝑆𝑥𝐶
Action Recognition Model Add/Mul/Concat Matrix Multiplication Sigmoid
+
∽
Figure 3: Learnable prompt: learning input-invariant (prompt experts) and input-specific (data
dependent) prompt knowledge. The input-invariant prompts will be updated from all the inputs,
which contain task information, and we use a dynamic mechanism to generate input-specific prompts
for different inputs. Add/Mul means element-wise operations.
There are different actions and domains (different video sources) for different videos, so it’s challeng-
ing to learn a single general prompt for all videos. Therefore, we design an input-invariant prompt
experts pool, which contains l learnable prompt.
P = {P1 , ..., Pl }, (6)
∗
which is learnable and will be updated from all the inputs. For a specific input X ,
P ∗ = M atmul(σ(F C(X ∗ )), P ), (7)
We first use an FC layer and sigmoid function to get dynamic weights. Then we apply these dynamic
weights to the input-invariant prompt pool to get a customized prompt P ∗ for X ∗ .
xpi = fe ([xi , pi ]), xi ∈ X ∗ , pi ∈ P ∗ , (8)
xpi is the prompt-based feature.
6
Method Frame size Accuracy
AARN (Yang et al., 2019; Algamdi et al., 2020) crops 33.75%
Lite ECO (Zolfaghari et al., 2018; Algamdi et al., 2020) crops 36.25%
I3D(RGB)(Carreira and Zisserman, 2017; Algamdi et al., 2020) crops 38.12%
3DCapsNet-DR(ZHang et al., 2020; Algamdi et al., 2020) crops 39.37%
3DCapsNet-EM(ZHang et al., 2020; Algamdi et al., 2020) crops 41.87%
DroneCaps(Algamdi et al., 2020) crops 47.50%
DroneAttention without bbox(Yadav et al., 2023) 720x420 61.34%
PLAR without bbox (Ours) 224x224 71.54%
DroneAttention with bbox (Yadav et al., 2023) 720x420 72.76%
PLAR with bbox (Ours) 224x224 75.93%
Table 1: Comparison with the state-of-the-art results on the Okutama dataset. With bbox informa-
tion, we achieved 10.20% improvement over the SOTA method. Without bbox information, we
outperformed the SOTA by 3.17%. crops: from detection.
The supervision formats used for single-agent and multi-agent action recognitions are different. As a
result, choose different loss functions. In particular, we choose the classical cross-entropy loss for
single-agent action recognition,
C
X exp x̂pn,c
Ln = − log PC y ,
p n,c
(10)
c=1 i=1 exp x̂n,i
where C is the class number, n is the video number, and x̂pn,c is the PLAR’s output feature. y is the
label. For multi-agent action recognition on Okutama, we use the BCEWithLogitsLoss,
Ln,c = − yn,c · log σ x̂pn,c + (1 − yn,c ) · log 1 − σ x̂pn,c
(11)
where x̂pn,c is the PLAR’s output feature. σ is a sigmoid function. This loss combines a sigmoid
function and the BCELoss, which is more numerically stable than using a plain sigmoid followed by
a BCELoss because by combining the operations into one layer, it takes advantage of the log-sum-exp
for numerical stability. For both single-agent and multi-agent videos, by sharing the same objective,
our learning approach can optimize prompts that guide the model’s predictions while explicitly
learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge.
4.1 Datasets
Okutama (Barekatain et al., 2017) The dataset contains diverse lighting, weather, and background
conditions, resembling real-world situations. It consists of 43 minute-long sequences with 12 action
classes, providing a challenge with dynamic action transitions, changing scales and aspect ratios,
camera movement, and multi-labeled actors. It features human-to-human interactions (hugging,
handshaking), human-to-object interactions (reading, drinking, pushing, pulling, carrying, calling),
and none-interactions (running, walking, lying, sitting, standing). The dataset showcases up to 9
actors sequentially performing a wide range of actions, making it highly challenging and engaging.
Something-something v2 (SSV2 (Goyal et al., 2017)) The SSV2 dataset is regarded as a sub-
stantial and comprehensive benchmark for action recognition, encompassing a vast collection of
220k action clips. This dataset places particular emphasis on action classes that revolve around the
dynamics of motion, showcasing various scenarios like "putting something into something, covering
something with something." In this context, the ability to capture and comprehend fine-grained
motion details becomes important, as it plays a vital role in attaining superior performance results.
7
Method pretrain Top-1 Acc. Top-5 Acc.
TEA (Li et al., 2020b) ImageNet 1k 65.1% 89.9%
MoViNet-A3 (Kondratyuk et al., 2021) N/A 64.1% 88.8%
ViT-B-TimeSformer (Bertasius et al., 2021) ImageNet 21k 62.5% /
SlowFast R101, 8×8 (Feichtenhofer et al., 2019) Kinetics400 63.1% 87.6%
MViTv1-B, 16×4 (Fan et al., 2021) Kinetics400 64.7% 89.2%
MViTv2-S, 16×4 (Li et al., 2022b) (train by us) Kinetics400 66.5% 90.6%
PLAR (Ours) Kinetics400 67.3% 91.0%
Table 2: Comparison with the state-of-the-art results on the Something Something V2. Our PLAR
improves 2.6% over MViTv1 and 0.8% over strong SOTA MViTv2.
4.2 Settings
All experiments are conducted on a desktop equipped with 8 Nvidia A5000 GPUs.
Okutama: For the multi-agent experiments, all the frames extracted from the video datasets were
scaled to 224 × 224. The backbone is Swin-T (Liu et al., 2021). Following (Yadav et al., 2023), the
feature maps obtained were processed in the ROIAlign function (crop size of 5 × 5) to get the desired
ROIs. Then we used the fully connected layer to encode those features to classifier features. The
one-hot encoded targets and classifier features were fed into Binary Cross Entropy Logits Loss. Other
training settings follow (Liu et al., 2021).
Something-something v2: Following (Li et al., 2022b), we fine-tune the pre-trained Kinetics models.
Specifically, we train for 100 epochs using 8 GPUs with a batch size of 64 and a base learning rate
of 5e-5 with a cosine learning rate schedule. We use Adamw and use a weight decay of 1e-4 and a
drop path rate of 0.4. For other training and testing settings, we follow (Li et al., 2022b). And the
backbone is MViTv2-S (Li et al., 2022b).
Okutama is an aerial multi-agent action recognition dataset in which multiple actors sequentially
perform a diverse set of actions, which makes it very challenging. In the real world, it’s difficult
to ensure that only a single agent is in the scene for action recognition. Therefore, multi-agent
action recognition is a very practical and important research direction. We compare our PLAR with
state-of-the-art (SOTA) works.
As shown in Table 1, if there is no bbox information, we achieved 10.20% improvement over the
SOTA method. If there is bbox information, we outperform the SOTA by 3.17%. This demonstrates
the effectiveness of our method.
Something-something V2 is a challenging ground camera dataset for visual common sense because it
requires models to understand the relationships between objects and actions. For example, to predict
the category of a video, a model must understand that "something bounces a ball" is different from
"something rolls a ball". In addition, the model must simultaneously pay attention to temporal model-
ing. We evaluate our PLAR’s reasoning and temporal modeling ability on Something-somethingV2.
As shown in Table 2, our PLAR improves 2.6% over MViTv1 and 0.8% over MViTv2, which
illustrates the effectiveness of our proposed prompt learning and Auto-regressive temporal modeling.
8
Frames Box prompt 1 point 2 points Line prompt
Figure 4: Large Vision Model: Prompts from the large vision model, no supervision needed. We
visualize the outputs in terms of different prompts, including bbox, line, and different points. bbox
and line have more stable outputs, which means better prompts result in better outputs.
First, we conducted ablation studies on various prompts, including optical flow, large vision models,
and learnable prompts to verify their effectiveness. And then we further visualize the output of a
large vision model in terms of different inputs to verify if they could focus on the right target.
Different Prompts To evaluate the effectiveness of different prompts, various prompts, including
optical flow, large vision model(SAM (Kirillov et al., 2023)), and learnable prompts, are examined in
this work. As shown in Table 3, the large vision model and learnable prompt achieved better accuracy.
Visualization For the large vision model, we visualize the outputs in terms of different prompts,
including bbox, line, and points. As shown in Figure 4, bbox and line have more stable outputs,
which means better prompts result in better outputs.
We present a general prompt learning approach that uses various prompts, including optical flow,
a large vision model (SAM), and learnable prompts. Prompt learning alleviates the optimization
burden by providing high-level texture descriptions or instructions associated with actions. Our
proposed learnable prompt learns to dynamically generate prompts from a pool of prompt experts
under different inputs. Our objective is to optimize prompts that guide the model’s predictions
while explicitly learning input-invariant (prompt experts) and input-specific (data-dependent) prompt
knowledge. We observe good accuracy improvements on some challenging datasets. Overall, ours
is among the first method that has explored the use of large vision model as the prompt to instruct
the action recognition task. Our results are very promising and there is room to further improve the
performance. Our approach has some limitations. The formulation does not use a unified prompt
form for both CNN and Transformer. We have only considered a few prompt formats, and the overall
performance depends on this choice. In the future, We would like to design a unified prompt form for
different architectures and we will integrate text prompts with our approach. We need to evaluate the
performance on more datasets.
9
References
Algamdi, A. M., Sanchez, V., and Li, C.-T. (2020). Dronecaps: Recognition of human actions in
drone videos using capsule networks with binary volume comparisons. In 2020 IEEE International
Conference on Image Processing (ICIP), pages 3174–3178. IEEE.
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. (2021). Vivit: A video
vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision,
pages 6836–6846.
Barekatain, M., Martí, M., Shih, H.-F., Murray, S., Nakayama, K., Matsuo, Y., and Prendinger, H.
(2017). Okutama-action: An aerial view video dataset for concurrent human action detection. In
Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages
28–35.
Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-time attention all you need for video
understanding? In ICML, volume 2, page 4.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural
information processing systems, 33:1877–1901.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics
dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 6299–6308.
Chéron, G., Laptev, I., and Schmid, C. (2015). P-cnn: Pose-based cnn features for action recognition.
In Proceedings of the IEEE international conference on computer vision, pages 3218–3226.
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., and Feichtenhofer, C. (2021). Multiscale
vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 6824–6835.
Fayyaz, M., Bahrami, E., Diba, A., Noroozi, M., Adeli, E., Van Gool, L., and Gall, J. (2021). 3d
cnns with adaptive temporal feature resolutions. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4731–4740.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019). Slowfast networks for video recognition. In
Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211.
Ge, C., Huang, R., Xie, M., Lai, Z., Song, S., Li, S., and Huang, G. (2022). Domain adaptation via
prompt learning. arXiv preprint arXiv:2202.06687.
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., and Russell, B. (2017). Actionvlad: Learning spatio-
temporal aggregation for action classification. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 971–980.
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel,
V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017). The" something something"
video database for learning and evaluating visual common sense. In Proceedings of the IEEE
international conference on computer vision, pages 5842–5850.
Han, X., Zhao, W., Ding, N., Liu, Z., and Sun, M. (2022). Ptr: Prompt tuning with rules for text
classification. AI Open, 3:182–192.
Hussein, N., Gavves, E., and Smeulders, A. W. (2019). Timeception for complex action recognition.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
254–263.
Ji, S., Xu, W., Yang, M., and Yu, K. (2012). 3d convolutional neural networks for human action
recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T.
(2021). Scaling up visual and vision-language representation learning with noisy text supervision.
In International Conference on Machine Learning, pages 4904–4916. PMLR.
10
Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. (2020). How can we know what language models
know? Transactions of the Association for Computational Linguistics, 8:423–438.
Ju, C., Han, T., Zheng, K., Zhang, Y., and Xie, W. (2022). Prompting visual-language models for
efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 105–124. Springer.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale
video classification with convolutional neural networks. In Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, pages 1725–1732.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg,
A. C., Lo, W.-Y., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., and Gong, B. (2021). Movinets:
Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 16020–16030.
Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M.-S., and Manocha, D. (2022). Far: Fourier
aerial video recognition. In European Conference on Computer Vision.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691.
Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., and Lu, J. (2022a). Bridge-prompt: Towards
ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 19880–19889.
Li, X., Shuai, B., and Tighe, J. (2020a). Directional temporal modeling for action recognition. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part VI 16, pages 275–291. Springer.
Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
abs/2101.00190.
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. (2020b). Tea: Temporal excitation and
aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 909–918.
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., and Feichtenhofer, C. (2022b).
Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict:
A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9):1–35.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin trans-
former: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 10012–10022.
Lu, Y., Liu, J., Zhang, Y., Liu, Y., and Tian, X. (2022). Prompt distribution learning. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215.
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., and Chiaberge, M. (2022). Action transformer:
A self-attention model for short-time pose-based human action recognition. Pattern Recognition,
124:108487.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. (2019).
Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
Piergiovanni, A. and Ryoo, M. S. (2019). Representation flow for action recognition. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, pages 9945–9953.
11
Piergiovanni, A. and Ryoo, M. S. (2021). Recognizing actions in videos from unseen viewpoints. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
4124–4132.
Poerner, N., Waltinger, U., and Schütze, H. (2019). E-bert: Efficient-yet-effective entity embeddings
for bert. In Findings.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models
from natural language supervision. In International Conference on Machine Learning.
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu, J. (2022). Denseclip:
Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 18082–18091.
Sánchez, J., Perronnin, F., Mensink, T., and Verbeek, J. (2013). Image classification with the fisher
vector: Theory and practice. International journal of computer vision, 105:222–245.
Sato, F., Hachiuma, R., and Sekii, T. (2023). Prompt-guided zero-shot anomaly action recognition
using pretrained deep skeleton features. arXiv preprint arXiv:2303.15167.
Shi, Y., Wu, X., and Lin, H. (2022). Knowledge prompting for few-shot action recognition. arXiv
preprint arXiv:2211.12030.
Shin, T., Razeghi, Y., IV, R. L. L., Wallace, E., and Singh, S. (2020). Eliciting knowledge from
language models using automatically generated prompts. ArXiv, abs/2010.15980.
Simonyan, K. and Zisserman, A. (2014). Two-stream convolutional networks for action recognition
in videos. Advances in neural information processing systems, 27.
Sun, L., Jia, K., Chen, K., Yeung, D.-Y., Shi, B. E., and Savarese, S. (2017). Lattice long short-term
memory for human action recognition. In Proceedings of the IEEE international conference on
computer vision, pages 2147–2156.
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and Isola, P. (2020). Rethinking few-shot
image classification: a good embedding is all you need? In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages
266–282. Springer.
Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). Videomae: Masked autoencoders are data-efficient
learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal
features with 3d convolutional networks. In Proceedings of the IEEE international conference on
computer vision, pages 4489–4497.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wang, L., Qiao, Y., and Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional
descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4305–4314.
Wang, M., Xing, J., and Liu, Y. (2021a). Actionclip: A new paradigm for video action recognition.
arXiv preprint arXiv:2109.08472.
Wang, X., Xian, R., Guan, T., de Melo, C. M., Nogar, S. M., Bera, A., and Manocha, D. (2023).
Aztr: Aerial video action recognition with auto zoom and temporal reasoning. arXiv preprint
arXiv:2303.01589.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., and Sang, N. (2021b). Oadtr: Online
action detection with transformers. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7565–7575.
12
Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T.
(2022). Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 139–149.
Xian, R., Wang, X., Kothandaraman, D., and Manocha, D. (2023a). Pmi sampler: Patch similarity
guided frame selection for aerial action recognition. arXiv preprint arXiv:2304.06866.
Xian, R., Wang, X., and Manocha, D. (2023b). Mitfas: Mutual information based temporal feature
alignment and sampling for aerial video action recognition. arXiv preprint arXiv:2303.02575.
Yadav, S. K., Luthra, A., Pahwa, E., Tiwari, K., Rathore, H., Pandey, H. M., and Corcoran, P. (2023).
Droneattention: Sparse weighted temporal attention for drone-camera based activity recognition.
Neural Networks, 159:57–69.
Yang, C., Xu, Y., Shi, J., Dai, B., and Zhou, B. (2020). Temporal pyramid network for action recog-
nition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 591–600.
Yang, F., Sakti, S., Wu, Y., and Nakamura, S. (2019). A framework for knowing who is doing what
in aerial surveillance videos. IEEE Access, 7:93315–93325.
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., and Yu, D. (2022). Recurring the transformer for
video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 14063–14073.
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P. H., and Koniusz, P. (2020). Few-shot action recognition
with permutation-invariant attention. In Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 525–542. Springer.
ZHang, P., Wei, P., and Han, S. (2020). Capsnets algorithm. In Journal of Physics: Conference
Series, volume 1544, page 012030. IOP Publishing.
Zhong, Z., Friedman, D., and Chen, D. (2021). Factual probing is [mask]: Learning vs. learning to
recall. arXiv preprint arXiv:2104.05240.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2021). Learning to prompt for vision-language models.
International Journal of Computer Vision, 130:2337 – 2348.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022). Conditional prompt learning for vision-language
models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 16816–16825.
Zhou, Y., Sun, X., Luo, C., Zha, Z.-J., and Zeng, W. (2020). Spatiotemporal fusion in 3d cnns: A
probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9829–9838.
Zolfaghari, M., Singh, K., and Brox, T. (2018). Eco: Efficient convolutional network for online video
understanding. In Proceedings of the European conference on computer vision (ECCV), pages
695–712.
Zong, M., Wang, R., Chen, X., Chen, Z., and Gong, Y. (2021). Motion saliency based multi-stream
multiplier resnets for action recognition. Image and Vision Computing, 107:104108.
13