0% found this document useful (0 votes)

46 views

Prompt Learning For Action Recognition

Uploaded by

reghecampfluca

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Prompt Learning For Action Recognition

Uploaded by

reghecampfluca

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Prompt Learning for Action Recognition

Xijun Wang, Ruiqi Xian, Tianrui Guan, Dinesh Manocha

University of Maryland, College Park, MD, USA
{xijun, rxian, rayguan, dmanocha}@umd.edu
arXiv:2305.12437v1 [cs.CV] 21 May 2023

Abstract

We present a new general learning approach for action recognition, Prompt Learn-
ing for Action Recognition (PLAR), which leverages the strengths of prompt
learning to guide the learning process. Our approach is designed to predict the
action label by helping the models focus on the descriptions or instructions as-
sociated with actions in the input videos. Our formulation uses various prompts,
including optical flow, large vision models, and learnable prompts to improve
the recognition performance. Moreover, we propose a learnable prompt method
that learns to dynamically generate prompts from a pool of prompt experts under
different inputs. By sharing the same objective, our proposed PLAR can optimize
prompts that guide the model’s predictions while explicitly learning input-invariant
(prompt experts pool) and input-specific (data-dependent) prompt knowledge. We
evaluate our approach on datasets consisting of both ground camera videos and
aerial videos, and scenes with single-agent and multi-agent actions. In practice, we
observe a 3.17 − 10.2% accuracy improvement on the aerial multi-agent dataset,
Okutamam and 0.8−2.6% improvement on the ground camera single-agent dataset,
Something Something V2. We plan to release our code on the WWW.

1 Introduction
Action recognition, the task of understanding human activities from video sequences, is a fundamental
problem in computer vision. This problem arises in many applications: video surveillance, human-
computer interaction, sports analysis, human-robot interaction, etc. There has been considerable
work on these problems in recent years, driven by the availability of large-scale video datasets and
advancements in deep learning techniques including two-stream Convolutional Neural Network
(CNN) (Simonyan and Zisserman, 2014), Recurrent Neural Network (RNN) (Sun et al., 2017), and
Transformer-based methods (Vaswani et al., 2017; Li et al., 2022b). These methods have achieved
considerable success in extracting discriminative features from video sequences, leading to significant
improvements in action recognition accuracy for ground videos and aerial videos.
Despite the recent progress in action recognition algorithms, the success of most existing approaches
relies on extensive labeled training data followed by a purely supervised learning paradigm that
mainly focuses on a backbone architecture design. In this paper, our goal is to design new methods for
video action recognition using prompt learning. Prompt-based techniques (Liu et al., 2023) have been
proposed for natural language processing tasks to circumvent the issue of lack of labeled data. These
learning methods use language models that estimate the probability of the text and use this probability
to predict the label thereby reducing or obviating the need for large labeled datasets. In the context of
action recognition, prompt learning offers the potential to design better optimization strategies by
providing high-level texture descriptions or instructions associated with actions. These prompts can
guide the learning process, and enable the model to capture discriminative spatio-temporal patterns
effectively, resulting in better performance.
*
These authors contributed equally to this work

Preprint. Under review.

Actions
Optical flow
Large Vision
Model
Learnable
Prompt
…
Recognition Models

Videos Prompts

Figure 1: Task: We use prompt learning for action recognition. Our method leverages the strengths
of prompt learning to guide the learning process by helping models better focus on the descriptions or
instructions associated with actions in the input videos. We explore various prompts, including optical
flow, large vision models, and proposed learnable prompts to improve recognition performance. The
recognition models can be CNNs or Transformers.

Many prompt learning-based techniques have been proposed for few-shot action recognition (Shi
et al., 2022), zero-shot action recognition (Sato et al., 2023; Wang et al., 2021a), and ordinal action
understanding. (Shi et al., 2022) proposes knowledge prompting, which leverages commonsense
knowledge of actions from external resources to prompt a powerful pre-trained vision-language
model for few-shot classification. (Sato et al., 2023) presents a unified, user prompt-guided zero-shot
learning framework using a target domain-independent skeleton feature extractor, which is pretrained
on a large-scale action recognition dataset. Bridge-Prompt (Li et al., 2022a) proposes a prompt-
based framework to model the semantics across adjacent actions from a series of ordinal actions in
instructional videos. Our goal is to apply these techniques for video action recognition.

Main Contributions: We present a general prompt learning approach that alleviates the burden of
objective optimization by integrating prompt-based learning into the action recognition pipeline. Our
approach is designed to enhance the model’s ability to process customized inputs by utilizing prompt
tokens. These prompt tokens can be either predefined templates or learnable tokens that include
information specific to video action recognition. Our formulation leverages prompts, which are easier
to focus on the interest targets, and enables the learning of complex visual concepts.
In our prompt learning paradigm, we explore and discuss various types of prompt, including optical
flow and large vision models. In addition, we present a learnable prompt, which dynamically generate
prompts from a pool of prompt experts under different inputs. Our goal is to optimize prompts
that guide the model’s predictions while explicitly learning input-invariant (prompt experts) and
input-specific (data-dependent) prompt knowledge. We validate this generalization by performing
extensive evaluations on datasets comprising of ground camera videos and aerial videos, on scenarios
involving single-agent and multi-agent actions. We demonstrate that our technique can improve
the performance and enhance the generalization capabilities of video action recognition models in
different scenarios. The novel components of our work include:

1. Present a general learning approach to use prompt learning and auto-regressive techniques
for action recognition.
2. Propose a new learnable prompt method, which can guide the model’s predictions while
explicitly learning input-invariant (prompt experts) and input-specific (data-dependent)
prompt knowledge.
3. To the best of our knowledge, ours is the first approach to explore the possibility of using
large vision models as the prompt to instruct the action recognition task.
4. Through empirical evaluations, we demonstrate the potential and effectiveness of prompt
learning techniques for action recognition tasks. Specifically, we observed 3.17-10.2%
accuracy improvement on the aerial multi-agent dataset, Okutama. Moreover, we observe a
0.8-2.6% accuracy improvement on the ground camera single-agent video dataset, Something
Something V2 (Goyal et al., 2017).

2
2 Related Works

2.1 Action Recognition

Human action recognition, i.e., recognizing and understanding human actions, is crucial for a number
of real-world applications. Recently, many deep learning architectures have been proposed to improve
the performance. At a broad level, they can be classified into three categories:
Two-stream 2D Convolutional Neural Network: The use of two-stream 2D Convolutional Neural
Networks (CNNs) has been widely explored in the field of action recognition (Simonyan and
Zisserman, 2014; Karpathy et al., 2014; Wang et al., 2015; Sánchez et al., 2013; Chéron et al., 2015;
Girdhar et al., 2017). They capture spatial and temporal information separately to process video data.
(Zong et al., 2021) extended the two-stream CNN to a three-stream CNN by incorporating a motion
saliency stream to enhance the representation of salient motion information. (Piergiovanni and Ryoo,
2019) proposed a trainable flow layer that eliminates the need for optical flow computation while
capturing motion information.
3D CNN-based methods: Several methods in the literature have utilized 3D Convolutional Neural
Networks (CNNs) for action recognition (Tran et al., 2015; Ji et al., 2012; Zhang et al., 2020; Li et al.,
2020a; Carreira and Zisserman, 2017). These approaches focus on leveraging the spatio-temporal
characteristics of video data at the same time through 3D convolutions. (Zhou et al., 2020) analyzed
the spatio-temporal fusion in 3D CNN from a probabilistic view. (Yang et al., 2020) introduced a
generic Temporal Pyramid Network (TPN) to effectively model speed variations in actions at the
feature level. (Piergiovanni and Ryoo, 2021) investigated the impact of viewpoint variations on
recognition performance. (Hussein et al., 2019) proposed multi-scale temporal-only convolutions to
handle large variations in temporal extents within complex actions. (Fayyaz et al., 2021) addressed
the computational cost by dynamically adapting the temporal feature resolution within 3D CNNs.
Transformer-based approaches: Many techniques have been proposed or transformer-based action
recognition (Arnab et al., 2021; Bertasius et al., 2021; Wang et al., 2021b).(Tong et al., 2022)
introduced a Video Masked Autoencoder (Video MAE) with an attention mechanism to address
the challenges of information redundancy and temporal correlation in video data, improving the
efficiency of action recognition. (Yang et al., 2022) proposed an attention gate that facilitates
interactions between frame inputs and hidden states, enabling the aggregation of global inter-frame
features across the temporal domain through recurrent execution. (Mazzia et al., 2022) leveraged 2D
pose representations within a fully self-attentional architecture over short temporal periods, enabling
low latency and high throughput in recognition tasks. (Li et al., 2022b) implemented the seminal idea
of multiscale feature hierarchies with transformer models for video and image recognition.
Although these methods have had good success on the ground data and YouTube videos, they cannot
achieve a similar level of accuracy on videos captured using Unmanned Aerial Vehicles (UAVs) Wang
et al. (2023); Xian et al. (2023a). Compared to ground or YouTube videos, UAV videos have
their unique characteristics like small resolution, scale and size variations, and moving cameras.
(Xian et al., 2023b) proposed a mutual information-based feature alignment and sampling method
to extract spatial-temporal features corresponding to human actors for better recognition accuracy.
(Kothandaraman et al., 2022) introduced Fourier transformation into attention modules to aggregate
the motion salience.

2.2 Prompt Learning

The concept of prompt learning was introduced by (Petroni et al., 2019) and has since gained
significant attention in the field of Natural Language Processing (NLP) (Brown et al., 2020; Jiang
et al., 2020; Li and Liang, 2021; Liu et al., 2023; Tian et al., 2020). The fundamental idea behind
prompt learning is to treat pre-trained language models like BERT or GPT as knowledge bases,
enabling their application in downstream tasks. Early studies, such as those by (Petroni et al., 2019;
Poerner et al., 2019), focused on designing manually crafted prompts to enhance the performance
of language models. Subsequently, researchers like (Shin et al., 2020; Jiang et al., 2020) aimed to
automate this process using cost-effective, data-driven approaches. More recently, some works (Han
et al., 2022; Lester et al., 2021; Zhong et al., 2021) have attempted to learn continuous prompts
instead of searching for discrete prompts.

3
While prompt learning has garnered significant attention in the field of Natural Language Processing
(NLP), its application in computer vision is still a relatively new and emerging research direction.
Only recently have researchers started exploring prompt learning techniques in computer vision
tasks (Rao et al., 2022; Ju et al., 2022; Zhou et al., 2021). Pretrained VLMs (Jia et al., 2021; Radford
et al., 2021) utilize manual prompts for zero-shot inference on the downstream tasks. (Zhou et al.,
2021) proposed Context Optimization (CoOp) to extend continuous prompt representations to the
vision domain, enabling the automatic learning of task-relevant prompts. (Wang et al., 2022) focused
on prompt optimization for continual learning, explicitly managing task-specific knowledge while
maintaining plasticity. (Zhou et al., 2022) introduced conditional CoOp, which involves learning a
lightweight neural network to generate input-conditional tokens for each image. (Lu et al., 2022)
explored the generalization capabilities of prompt learning by learning the distribution of diverse
prompts. (Ge et al., 2022) addressed the challenge of distribution shifts in unsupervised domain
adaptation by learning both domain-agnostic and domain-specific prompts. However, the majority of
these works have primarily focused on image-level tasks, leaving the exploration of prompt learning
for video tasks relatively unexplored. In this paper, we propose a general learning paradigm to
investigate the effectiveness of prompt learning specifically for video understanding, with a specific
focus on action recognition in both ground/YouTube and aerial videos. Our goal is to bridge the gap
and extend the benefits of prompt learning to the domain of video understanding tasks.

3 Our Approach

The problem of video action recognition can be broadly classified into single-agent and multi-agent
action recognition. And according to different data types, there are aerial video action recognition and
ground camera action recognition. Typically, they all involve several steps. Taking the transformer-
based methods for example, first, the input video or image sequence is processed to extract relevant
features, such as movement patterns, appearance, or spatial-temporal information. These features are
then fed into a reasoning model inference action label. Prompt learning can help the first step better
handle feature extraction.
We denote the input as Xi = {x1 , x2 , ..., xm }, i ∈ [1, N ], where xj is the jth frame in the ith video,
m is the total frame number, and N is the total number of videos. The overall approach predicts
the action categories by using model f (Xi ), which can be CNNs or Transformers. As shown in
Figure 2, taking transformer-based methods as an example, we follow the same scheme to extract
the features, followed by using the reasoning process to predict the action labels. We also present a
prompt-learning-based encoder to help better extract the feature and then propose an auto-regressive
temporal reasoning algorithm for recognition models for enhanced inference ability.
Specifically, given an action model:

f = fa ◦ fe ([X, P ]), (1)

where fe is the prompt-learning-based input encoder, P is the prompt, and fa is the auto-regressive-
based temporal reasoning model, which is used for the temporal dimension.

3.1 Prompt Learning based Input Encoder

For the first part of the input encoder, inspired by these prompt-based techniques in NLP, we
present a new general prompt-learning-based input encoder for action recognition. Our formulation
leverages the strengths of prompt learning to guide the learning process by providing high-level
texture descriptions or instructions associated with actions in the inputs. We use this to alleviate the
burden of models’ optimization by helping models better focus on the active region.
Prompts can enhance the model’s ability to process customized inputs by utilizing prompt tokens. By
leveraging prompts, models are easier to focus on the interest targets, and prompt learning enables
the model to learn complex visual concepts and capture discriminative spatio-temporal patterns
effectively. Specifically, our prompts can be either predefined templates (non-learnable prompt:
optical flow, large vision models) or learnable tokens (learnable prompt) that include task-specific
information. They can be used either alone or in combination.

4
Features Masks Auto-regressive
Predictor frame1

Transformer
Predictor frame2

…
Prompts:
o Learnable prompt
o Optical flow M2
o Large model Predictor frame3
…

…
Queries

Prompt-learning-based Encoder Auto-regressive Temporal Reasoning

Figure 2: Overview action recognition framework: We use transformer-based action recognition

methods as an example. We designed a prompt-learning-based encoder to help better extract the
feature and use our auto-regressive temporal reasoning algorithm for recognition models for enhanced
inference ability.

3.1.1 Non-Learnable Prompt

Non-Learnable prompts make use of statistical methods (e.g. optical flow) or existing powerful large
vision models, which can offer reliable prompts without training.

Optical Flow Prompt Optical flow is a fundamental concept in computer vision that involves
estimating the motion of objects within a video sequence. It represents the apparent motion of pixels
between consecutive frames, providing valuable information about the movement of objects and their
relative velocities.
For frame xi and frame xj , the optical flow is:
oij = O(xi , xj ), xi ∈ clipi , xj ∈ clipj , (2)
clipi and clipj are two adjacent video clips, and each clip contains several frames. This formulation
is more efficient because it avoids many calculations for every frame. Therefore, the input with
optical flow prompt becomes:
[X, P ] = {xi ∗ oij | i, j ∈ [1, m]} (3)
We use [X, P ] to replace the original X in video action recognition.

Large Vision Model Prompt Recently, large models have been attracting more attention for NLP
and other applications. These large models are considered powerful since they are trained on huge
amounts of data and don’t need to be finetuned on new tasks as an auxiliary input (i.e. prompt). Our
goal is to use these large models to generate prompts (e.g. mask, bbox) for video action recognition.
One popular work is the Segment Anything Model (SAM (Kirillov et al., 2023)), which can segment
any object in an image given only some prompts like a single click or box. SAM is trained on a
dataset of 11 million images and 1.1 billion masks. SAM can segment objects with high accuracy,
even when they are new or have been modified from the training data. SAM generalizes to new
objects and images without the need for additional training, so we don’t need to finetune the model
on our dataset. For some frames in a video clip, we generate a segmentation mask using a large
vision model, SAMKirillov et al. (2023). Next, these masks are used as prompts and fused with input
frames to optimize the recognition model. Specifically, for frame xi , the output from SAM is:
pi = SAM (xi , box/point), xi ∈ clipi (4)
clipi is a video clip containing a few frames,
[X, P ] = {xi ∗ pi | i ∈ [1, m]} (5)
We use [X, P ] to replace the original X in video action recognition.

5
3.1.2 Learnable Prompt
To better adapt to the input data, we also propose a learnable prompt, which learns to dynamically
generate prompts from a pool of prompt experts under different inputs. Prompt experts are learnable
parameters that can be updated from the training process. As shown in Figure 3, in our design, we
use input-invariant (prompt experts) and input-specific (data dependent) prompts. The input-invariant
prompts contain task information, and we use a dynamic mechanism to generate input-specific prompt
for different inputs.

Input Input-invariant (prompt experts)

B𝑥𝑆𝑥𝐶
Weight generator 𝑆𝑥𝐶 𝑆𝑥𝐶 𝑆𝑥𝐶
Norm …
B𝑥𝑆𝑥𝐶
Linear
𝑙𝑥𝑆𝑥𝐶
𝐵𝑥𝑙

×
∽
+

Input-specific prompt
B𝑥𝑆𝑥𝐶
Action Recognition Model Add/Mul/Concat Matrix Multiplication Sigmoid
+

∽
Figure 3: Learnable prompt: learning input-invariant (prompt experts) and input-specific (data
dependent) prompt knowledge. The input-invariant prompts will be updated from all the inputs,
which contain task information, and we use a dynamic mechanism to generate input-specific prompts
for different inputs. Add/Mul means element-wise operations.

There are different actions and domains (different video sources) for different videos, so it’s challeng-
ing to learn a single general prompt for all videos. Therefore, we design an input-invariant prompt
experts pool, which contains l learnable prompt.
P = {P1 , ..., Pl }, (6)
∗
which is learnable and will be updated from all the inputs. For a specific input X ,
P ∗ = M atmul(σ(F C(X ∗ )), P ), (7)
We first use an FC layer and sigmoid function to get dynamic weights. Then we apply these dynamic
weights to the input-invariant prompt pool to get a customized prompt P ∗ for X ∗ .
xpi = fe ([xi , pi ]), xi ∈ X ∗ , pi ∈ P ∗ , (8)
xpi is the prompt-based feature.

3.2 Auto-regressive Temporal Reasoning

Temporal reasoning is important for sequence data. Therefore, we propose an Auto-regressive

Temporal Reasoning algorithm to better model the time-varying data. Auto-regressive models are
statistical models that make predictions based on previous observations. They assume that the future
values of a variable can be estimated by considering its past values. For temporal reasoning, this
concept is extended to capture dependencies between different frames in a video.
After getting the prompt-based feature X p = {xp1 , xp2 , ..., xpm }, where xpi represents the observation
at time step i, the goal is to predict the future values,
j<(i+1)
Y
x̂pi+1 = fa ( fa (xpj ) + xpi+1 ) (9)
j

where fa denotes the auto-regressive

Q model that maintains an internal state and updates according to
the sequential input. means a series of functions here. The auto-regressive temporal reasoning
model considers the past observations of the sequence and the corresponding future observations to
learn the underlying temporal dependencies.

6
Method Frame size Accuracy
AARN (Yang et al., 2019; Algamdi et al., 2020) crops 33.75%
Lite ECO (Zolfaghari et al., 2018; Algamdi et al., 2020) crops 36.25%
I3D(RGB)(Carreira and Zisserman, 2017; Algamdi et al., 2020) crops 38.12%
3DCapsNet-DR(ZHang et al., 2020; Algamdi et al., 2020) crops 39.37%
3DCapsNet-EM(ZHang et al., 2020; Algamdi et al., 2020) crops 41.87%
DroneCaps(Algamdi et al., 2020) crops 47.50%
DroneAttention without bbox(Yadav et al., 2023) 720x420 61.34%
PLAR without bbox (Ours) 224x224 71.54%
DroneAttention with bbox (Yadav et al., 2023) 720x420 72.76%
PLAR with bbox (Ours) 224x224 75.93%
Table 1: Comparison with the state-of-the-art results on the Okutama dataset. With bbox informa-
tion, we achieved 10.20% improvement over the SOTA method. Without bbox information, we
outperformed the SOTA by 3.17%. crops: from detection.

3.3 Single-agent and Multi-agent Objective

The supervision formats used for single-agent and multi-agent action recognitions are different. As a
result, choose different loss functions. In particular, we choose the classical cross-entropy loss for
single-agent action recognition,
C
X exp x̂pn,c
Ln = − log PC y ,
p n,c
(10)
c=1 i=1 exp x̂n,i

where C is the class number, n is the video number, and x̂pn,c is the PLAR’s output feature. y is the
label. For multi-agent action recognition on Okutama, we use the BCEWithLogitsLoss,
Ln,c = − yn,c · log σ x̂pn,c + (1 − yn,c ) · log 1 − σ x̂pn,c

(11)
where x̂pn,c is the PLAR’s output feature. σ is a sigmoid function. This loss combines a sigmoid
function and the BCELoss, which is more numerically stable than using a plain sigmoid followed by
a BCELoss because by combining the operations into one layer, it takes advantage of the log-sum-exp
for numerical stability. For both single-agent and multi-agent videos, by sharing the same objective,
our learning approach can optimize prompts that guide the model’s predictions while explicitly
learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge.

4 Datasets and Results

To verify the effectiveness of PLAR, empirical evaluations were conducted on Okutama (Barekatain
et al., 2017) and Something-something V2(Goyal et al., 2017) datasets comprising both aerial videos
and ground camera videos, and scenarios involving single-agent and multi-agent actions.

4.1 Datasets

Okutama (Barekatain et al., 2017) The dataset contains diverse lighting, weather, and background
conditions, resembling real-world situations. It consists of 43 minute-long sequences with 12 action
classes, providing a challenge with dynamic action transitions, changing scales and aspect ratios,
camera movement, and multi-labeled actors. It features human-to-human interactions (hugging,
handshaking), human-to-object interactions (reading, drinking, pushing, pulling, carrying, calling),
and none-interactions (running, walking, lying, sitting, standing). The dataset showcases up to 9
actors sequentially performing a wide range of actions, making it highly challenging and engaging.

Something-something v2 (SSV2 (Goyal et al., 2017)) The SSV2 dataset is regarded as a sub-
stantial and comprehensive benchmark for action recognition, encompassing a vast collection of
220k action clips. This dataset places particular emphasis on action classes that revolve around the
dynamics of motion, showcasing various scenarios like "putting something into something, covering
something with something." In this context, the ability to capture and comprehend fine-grained
motion details becomes important, as it plays a vital role in attaining superior performance results.

7
Method pretrain Top-1 Acc. Top-5 Acc.
TEA (Li et al., 2020b) ImageNet 1k 65.1% 89.9%
MoViNet-A3 (Kondratyuk et al., 2021) N/A 64.1% 88.8%
ViT-B-TimeSformer (Bertasius et al., 2021) ImageNet 21k 62.5% /
SlowFast R101, 8×8 (Feichtenhofer et al., 2019) Kinetics400 63.1% 87.6%
MViTv1-B, 16×4 (Fan et al., 2021) Kinetics400 64.7% 89.2%
MViTv2-S, 16×4 (Li et al., 2022b) (train by us) Kinetics400 66.5% 90.6%
PLAR (Ours) Kinetics400 67.3% 91.0%
Table 2: Comparison with the state-of-the-art results on the Something Something V2. Our PLAR
improves 2.6% over MViTv1 and 0.8% over strong SOTA MViTv2.

Method Frame size Accuracy

Baseline 224x224 71.54%
Baseline + Optical Flow 224x224 72.13%
Baseline + Large Vision Model (SAM) 224x224 74.68%
Baseline + Learnable Prompt 224x224 75.93%
Table 3: Ablation study in terms of different prompts on the Okutama dataset. We evaluated various
prompts including optical flow, a large vision model(SAM (Kirillov et al., 2023)), and learnable
prompts. From our experiment, the large vision model and learnable prompt achieved better accuracy.

4.2 Settings

All experiments are conducted on a desktop equipped with 8 Nvidia A5000 GPUs.
Okutama: For the multi-agent experiments, all the frames extracted from the video datasets were
scaled to 224 × 224. The backbone is Swin-T (Liu et al., 2021). Following (Yadav et al., 2023), the
feature maps obtained were processed in the ROIAlign function (crop size of 5 × 5) to get the desired
ROIs. Then we used the fully connected layer to encode those features to classifier features. The
one-hot encoded targets and classifier features were fed into Binary Cross Entropy Logits Loss. Other
training settings follow (Liu et al., 2021).
Something-something v2: Following (Li et al., 2022b), we fine-tune the pre-trained Kinetics models.
Specifically, we train for 100 epochs using 8 GPUs with a batch size of 64 and a base learning rate
of 5e-5 with a cosine learning rate schedule. We use Adamw and use a weight decay of 1e-4 and a
drop path rate of 0.4. For other training and testing settings, we follow (Li et al., 2022b). And the
backbone is MViTv2-S (Li et al., 2022b).

4.3 Results on Okutama

Okutama is an aerial multi-agent action recognition dataset in which multiple actors sequentially
perform a diverse set of actions, which makes it very challenging. In the real world, it’s difficult
to ensure that only a single agent is in the scene for action recognition. Therefore, multi-agent
action recognition is a very practical and important research direction. We compare our PLAR with
state-of-the-art (SOTA) works.
As shown in Table 1, if there is no bbox information, we achieved 10.20% improvement over the
SOTA method. If there is bbox information, we outperform the SOTA by 3.17%. This demonstrates
the effectiveness of our method.

4.4 Results on Something-something V2

Something-something V2 is a challenging ground camera dataset for visual common sense because it
requires models to understand the relationships between objects and actions. For example, to predict
the category of a video, a model must understand that "something bounces a ball" is different from
"something rolls a ball". In addition, the model must simultaneously pay attention to temporal model-
ing. We evaluate our PLAR’s reasoning and temporal modeling ability on Something-somethingV2.
As shown in Table 2, our PLAR improves 2.6% over MViTv1 and 0.8% over MViTv2, which
illustrates the effectiveness of our proposed prompt learning and Auto-regressive temporal modeling.

8
Frames Box prompt 1 point 2 points Line prompt

Figure 4: Large Vision Model: Prompts from the large vision model, no supervision needed. We
visualize the outputs in terms of different prompts, including bbox, line, and different points. bbox
and line have more stable outputs, which means better prompts result in better outputs.

4.5 Ablation Study

First, we conducted ablation studies on various prompts, including optical flow, large vision models,
and learnable prompts to verify their effectiveness. And then we further visualize the output of a
large vision model in terms of different inputs to verify if they could focus on the right target.
Different Prompts To evaluate the effectiveness of different prompts, various prompts, including
optical flow, large vision model(SAM (Kirillov et al., 2023)), and learnable prompts, are examined in
this work. As shown in Table 3, the large vision model and learnable prompt achieved better accuracy.
Visualization For the large vision model, we visualize the outputs in terms of different prompts,
including bbox, line, and points. As shown in Figure 4, bbox and line have more stable outputs,
which means better prompts result in better outputs.

5 Conclusion, Limitations and Future work

We present a general prompt learning approach that uses various prompts, including optical flow,
a large vision model (SAM), and learnable prompts. Prompt learning alleviates the optimization
burden by providing high-level texture descriptions or instructions associated with actions. Our
proposed learnable prompt learns to dynamically generate prompts from a pool of prompt experts
under different inputs. Our objective is to optimize prompts that guide the model’s predictions
while explicitly learning input-invariant (prompt experts) and input-specific (data-dependent) prompt
knowledge. We observe good accuracy improvements on some challenging datasets. Overall, ours
is among the first method that has explored the use of large vision model as the prompt to instruct
the action recognition task. Our results are very promising and there is room to further improve the
performance. Our approach has some limitations. The formulation does not use a unified prompt
form for both CNN and Transformer. We have only considered a few prompt formats, and the overall
performance depends on this choice. In the future, We would like to design a unified prompt form for
different architectures and we will integrate text prompts with our approach. We need to evaluate the
performance on more datasets.

9
References
Algamdi, A. M., Sanchez, V., and Li, C.-T. (2020). Dronecaps: Recognition of human actions in
drone videos using capsule networks with binary volume comparisons. In 2020 IEEE International
Conference on Image Processing (ICIP), pages 3174–3178. IEEE.
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., and Schmid, C. (2021). Vivit: A video
vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision,
pages 6836–6846.
Barekatain, M., Martí, M., Shih, H.-F., Murray, S., Nakayama, K., Matsuo, Y., and Prendinger, H.
(2017). Okutama-action: An aerial view video dataset for concurrent human action detection. In
Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages
28–35.
Bertasius, G., Wang, H., and Torresani, L. (2021). Is space-time attention all you need for video
understanding? In ICML, volume 2, page 4.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,
P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural
information processing systems, 33:1877–1901.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics
dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 6299–6308.
Chéron, G., Laptev, I., and Schmid, C. (2015). P-cnn: Pose-based cnn features for action recognition.
In Proceedings of the IEEE international conference on computer vision, pages 3218–3226.
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., and Feichtenhofer, C. (2021). Multiscale
vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 6824–6835.
Fayyaz, M., Bahrami, E., Diba, A., Noroozi, M., Adeli, E., Van Gool, L., and Gall, J. (2021). 3d
cnns with adaptive temporal feature resolutions. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 4731–4740.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019). Slowfast networks for video recognition. In
Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211.
Ge, C., Huang, R., Xie, M., Lai, Z., Song, S., Li, S., and Huang, G. (2022). Domain adaptation via
prompt learning. arXiv preprint arXiv:2202.06687.
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., and Russell, B. (2017). Actionvlad: Learning spatio-
temporal aggregation for action classification. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 971–980.
Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel,
V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017). The" something something"
video database for learning and evaluating visual common sense. In Proceedings of the IEEE
international conference on computer vision, pages 5842–5850.
Han, X., Zhao, W., Ding, N., Liu, Z., and Sun, M. (2022). Ptr: Prompt tuning with rules for text
classification. AI Open, 3:182–192.
Hussein, N., Gavves, E., and Smeulders, A. W. (2019). Timeception for complex action recognition.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
254–263.
Ji, S., Xu, W., Yang, M., and Yu, K. (2012). 3d convolutional neural networks for human action
recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T.
(2021). Scaling up visual and vision-language representation learning with noisy text supervision.
In International Conference on Machine Learning, pages 4904–4916. PMLR.

10
Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. (2020). How can we know what language models
know? Transactions of the Association for Computational Linguistics, 8:423–438.
Ju, C., Han, T., Zheng, K., Zhang, Y., and Xie, W. (2022). Prompting visual-language models for
efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 105–124. Springer.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale
video classification with convolutional neural networks. In Proceedings of the IEEE conference on
Computer Vision and Pattern Recognition, pages 1725–1732.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg,
A. C., Lo, W.-Y., et al. (2023). Segment anything. arXiv preprint arXiv:2304.02643.
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., and Gong, B. (2021). Movinets:
Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 16020–16030.
Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M.-S., and Manocha, D. (2022). Far: Fourier
aerial video recognition. In European Conference on Computer Vision.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691.
Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., and Lu, J. (2022a). Bridge-prompt: Towards
ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 19880–19889.
Li, X., Shuai, B., and Tighe, J. (2020a). Directional temporal modeling for action recognition. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part VI 16, pages 275–291. Springer.
Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
abs/2101.00190.
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. (2020b). Tea: Temporal excitation and
aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 909–918.
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., and Feichtenhofer, C. (2022b).
Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict:
A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9):1–35.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin trans-
former: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 10012–10022.
Lu, Y., Liu, J., Zhang, Y., Liu, Y., and Tian, X. (2022). Prompt distribution learning. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215.
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., and Chiaberge, M. (2022). Action transformer:
A self-attention model for short-time pose-based human action recognition. Pattern Recognition,
124:108487.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. (2019).
Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
Piergiovanni, A. and Ryoo, M. S. (2019). Representation flow for action recognition. In Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, pages 9945–9953.

11
Piergiovanni, A. and Ryoo, M. S. (2021). Recognizing actions in videos from unseen viewpoints. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
4124–4132.
Poerner, N., Waltinger, U., and Schütze, H. (2019). E-bert: Efficient-yet-effective entity embeddings
for bert. In Findings.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models
from natural language supervision. In International Conference on Machine Learning.
Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., and Lu, J. (2022). Denseclip:
Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 18082–18091.
Sánchez, J., Perronnin, F., Mensink, T., and Verbeek, J. (2013). Image classification with the fisher
vector: Theory and practice. International journal of computer vision, 105:222–245.
Sato, F., Hachiuma, R., and Sekii, T. (2023). Prompt-guided zero-shot anomaly action recognition
using pretrained deep skeleton features. arXiv preprint arXiv:2303.15167.
Shi, Y., Wu, X., and Lin, H. (2022). Knowledge prompting for few-shot action recognition. arXiv
preprint arXiv:2211.12030.
Shin, T., Razeghi, Y., IV, R. L. L., Wallace, E., and Singh, S. (2020). Eliciting knowledge from
language models using automatically generated prompts. ArXiv, abs/2010.15980.
Simonyan, K. and Zisserman, A. (2014). Two-stream convolutional networks for action recognition
in videos. Advances in neural information processing systems, 27.
Sun, L., Jia, K., Chen, K., Yeung, D.-Y., Shi, B. E., and Savarese, S. (2017). Lattice long short-term
memory for human action recognition. In Proceedings of the IEEE international conference on
computer vision, pages 2147–2156.
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and Isola, P. (2020). Rethinking few-shot
image classification: a good embedding is all you need? In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages
266–282. Springer.
Tong, Z., Song, Y., Wang, J., and Wang, L. (2022). Videomae: Masked autoencoders are data-efficient
learners for self-supervised video pre-training. arXiv preprint arXiv:2203.12602.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015). Learning spatiotemporal
features with 3d convolutional networks. In Proceedings of the IEEE international conference on
computer vision, pages 4489–4497.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wang, L., Qiao, Y., and Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional
descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4305–4314.
Wang, M., Xing, J., and Liu, Y. (2021a). Actionclip: A new paradigm for video action recognition.
arXiv preprint arXiv:2109.08472.
Wang, X., Xian, R., Guan, T., de Melo, C. M., Nogar, S. M., Bera, A., and Manocha, D. (2023).
Aztr: Aerial video action recognition with auto zoom and temporal reasoning. arXiv preprint
arXiv:2303.01589.
Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., and Sang, N. (2021b). Oadtr: Online
action detection with transformers. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 7565–7575.

12
Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., and Pfister, T.
(2022). Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 139–149.
Xian, R., Wang, X., Kothandaraman, D., and Manocha, D. (2023a). Pmi sampler: Patch similarity
guided frame selection for aerial action recognition. arXiv preprint arXiv:2304.06866.
Xian, R., Wang, X., and Manocha, D. (2023b). Mitfas: Mutual information based temporal feature
alignment and sampling for aerial video action recognition. arXiv preprint arXiv:2303.02575.
Yadav, S. K., Luthra, A., Pahwa, E., Tiwari, K., Rathore, H., Pandey, H. M., and Corcoran, P. (2023).
Droneattention: Sparse weighted temporal attention for drone-camera based activity recognition.
Neural Networks, 159:57–69.
Yang, C., Xu, Y., Shi, J., Dai, B., and Zhou, B. (2020). Temporal pyramid network for action recog-
nition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 591–600.
Yang, F., Sakti, S., Wu, Y., and Nakamura, S. (2019). A framework for knowing who is doing what
in aerial surveillance videos. IEEE Access, 7:93315–93325.
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., and Yu, D. (2022). Recurring the transformer for
video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 14063–14073.
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P. H., and Koniusz, P. (2020). Few-shot action recognition
with permutation-invariant attention. In Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 525–542. Springer.
ZHang, P., Wei, P., and Han, S. (2020). Capsnets algorithm. In Journal of Physics: Conference
Series, volume 1544, page 012030. IOP Publishing.
Zhong, Z., Friedman, D., and Chen, D. (2021). Factual probing is [mask]: Learning vs. learning to
recall. arXiv preprint arXiv:2104.05240.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2021). Learning to prompt for vision-language models.
International Journal of Computer Vision, 130:2337 – 2348.
Zhou, K., Yang, J., Loy, C. C., and Liu, Z. (2022). Conditional prompt learning for vision-language
models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 16816–16825.
Zhou, Y., Sun, X., Luo, C., Zha, Z.-J., and Zeng, W. (2020). Spatiotemporal fusion in 3d cnns: A
probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 9829–9838.
Zolfaghari, M., Singh, K., and Brox, T. (2018). Eco: Efficient convolutional network for online video
understanding. In Proceedings of the European conference on computer vision (ECCV), pages
695–712.
Zong, M., Wang, R., Chen, X., Chen, Z., and Gong, Y. (2021). Motion saliency based multi-stream
multiplier resnets for action recognition. Image and Vision Computing, 107:104108.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
61% (72)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (70)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
No ratings yet
Actnetformer: Transformer-Resnet Hybrid Method For Semi-Supervised Action Recognition in Videos
22 pages
Bag of Freebies For Training Object Detection Neural Networks
No ratings yet
Bag of Freebies For Training Object Detection Neural Networks
9 pages
End-To-End Learning of Driving Models From Large-Scale Video Datasets
No ratings yet
End-To-End Learning of Driving Models From Large-Scale Video Datasets
9 pages
COLING 2024
No ratings yet
COLING 2024
10 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
PA-GAN: A Patch-Attention Based Aggregation Network For Face Recognition in Surveillance
No ratings yet
PA-GAN: A Patch-Attention Based Aggregation Network For Face Recognition in Surveillance
10 pages
Technologies 09 00002 v2
No ratings yet
Technologies 09 00002 v2
22 pages
Detect-And-Track Efficient Pose Estimation in Videos
No ratings yet
Detect-And-Track Efficient Pose Estimation in Videos
10 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
1 s2.0 S0925231220308110 Main
No ratings yet
1 s2.0 S0925231220308110 Main
12 pages
Applsci 13 06711 v2
No ratings yet
Applsci 13 06711 v2
17 pages
Face Recognition in Real-World Surveillance Videos With Deep Learning Method
No ratings yet
Face Recognition in Real-World Surveillance Videos With Deep Learning Method
5 pages
Text To Image Synthesis Using Self
No ratings yet
Text To Image Synthesis Using Self
20 pages
A Hybrid Deep Learning Approach for Video Object Detection
No ratings yet
A Hybrid Deep Learning Approach for Video Object Detection
9 pages
Technical Paper 2
No ratings yet
Technical Paper 2
10 pages
1802.09564v2
No ratings yet
1802.09564v2
12 pages
Skip-Pose Vectors: Pose-Based Motion Embedding Using Encoder-Decoder Models
No ratings yet
Skip-Pose Vectors: Pose-Based Motion Embedding Using Encoder-Decoder Models
6 pages
SRS Sample For Students (2) FINAL (1) - Pages-Deleted (1) - Converted 1
No ratings yet
SRS Sample For Students (2) FINAL (1) - Pages-Deleted (1) - Converted 1
21 pages
SuperTrack
No ratings yet
SuperTrack
13 pages
SLFLSDFKSFLDKJ
No ratings yet
SLFLSDFKSFLDKJ
3 pages
Pose-Based Human Image Generation Using PATN
No ratings yet
Pose-Based Human Image Generation Using PATN
9 pages
Learn AI Before AI Learns You
No ratings yet
Learn AI Before AI Learns You
5 pages
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
No ratings yet
Attention-Guided Multi-Granularity Fusion Model For Video Summarization
11 pages
Capstone PPT 4
No ratings yet
Capstone PPT 4
17 pages
CCPe
No ratings yet
CCPe
17 pages
Feature Extraction Methods for Real-Time Face Detection and Classification
No ratings yet
Feature Extraction Methods for Real-Time Face Detection and Classification
13 pages
Video Flow A Conditional Flow Based On Nanoparticles
No ratings yet
Video Flow A Conditional Flow Based On Nanoparticles
18 pages
Learning To Purification For Unsupervised Person Re-Identification
No ratings yet
Learning To Purification For Unsupervised Person Re-Identification
16 pages
JARVIS Joining Adversarial Training With Vision Tr-Sıkıştırıldı
No ratings yet
JARVIS Joining Adversarial Training With Vision Tr-Sıkıştırıldı
14 pages
Lin A Prior-Less Method CVPR 2018 Paper
No ratings yet
Lin A Prior-Less Method CVPR 2018 Paper
10 pages
An Empirical Study On Application of Word Embedding Techniques For Prediction of Software Defect Severity Level
No ratings yet
An Empirical Study On Application of Word Embedding Techniques For Prediction of Software Defect Severity Level
8 pages
Identification of Human Activities Using Sensors
No ratings yet
Identification of Human Activities Using Sensors
4 pages
sihf - Copy
No ratings yet
sihf - Copy
6 pages
Tracing Eye Movement Protocols With Cognitive Process Models
No ratings yet
Tracing Eye Movement Protocols With Cognitive Process Models
6 pages
Yadav 2020
No ratings yet
Yadav 2020
6 pages
Guest Editorial Introduction To The Special Issue On Large-Scale Video Analytics For Enhanced Security: Algorithms and Systems
No ratings yet
Guest Editorial Introduction To The Special Issue On Large-Scale Video Analytics For Enhanced Security: Algorithms and Systems
4 pages
22 04 CPE Presentation
No ratings yet
22 04 CPE Presentation
18 pages
SVM Active Learning Approach For Image Classification Using Spatial Information
No ratings yet
SVM Active Learning Approach For Image Classification Using Spatial Information
19 pages
Deep Convolutional Neural Networks for Unconstrained Ear Recognition(2020)
No ratings yet
Deep Convolutional Neural Networks for Unconstrained Ear Recognition(2020)
16 pages
1-s2.0-S240584402413993X-main
No ratings yet
1-s2.0-S240584402413993X-main
12 pages
ppt(ankitveer)
No ratings yet
ppt(ankitveer)
18 pages
Implementation of A Novel Algorithm For Moving Object Tracking in Video Surveillance
No ratings yet
Implementation of A Novel Algorithm For Moving Object Tracking in Video Surveillance
9 pages
Face Recognition Using Machine Learning
No ratings yet
Face Recognition Using Machine Learning
3 pages
Face Recognition For Attendance Using Transfer Learning
No ratings yet
Face Recognition For Attendance Using Transfer Learning
7 pages
Your Instructions Are Not Always Helpfu
No ratings yet
Your Instructions Are Not Always Helpfu
10 pages
2020 Deep CNN TR Le
No ratings yet
2020 Deep CNN TR Le
6 pages
[email protected]
No ratings yet
[email protected]
6 pages
Metric Learining
No ratings yet
Metric Learining
5 pages
Overview On Movie Recommender System
100% (1)
Overview On Movie Recommender System
4 pages
Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective
No ratings yet
Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective
8 pages
Glasses Detection From Human Face Images
No ratings yet
Glasses Detection From Human Face Images
7 pages
1. Internship Identification Exercise[1]
No ratings yet
1. Internship Identification Exercise[1]
2 pages
Unsupervised Learning of Video Representations Using Lstms
No ratings yet
Unsupervised Learning of Video Representations Using Lstms
12 pages
DOC-20241116-WA0011.
No ratings yet
DOC-20241116-WA0011.
8 pages
paper5
No ratings yet
paper5
12 pages
Deep Learning For Logo Recognition
No ratings yet
Deep Learning For Logo Recognition
10 pages
SE3M
No ratings yet
SE3M
17 pages
Self-Critical Sequence Training For Image Captioning
No ratings yet
Self-Critical Sequence Training For Image Captioning
16 pages
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Khan Introducing Language Guidance in Prompt-Based Continual Learning ICCV 2023 Paper
No ratings yet
Khan Introducing Language Guidance in Prompt-Based Continual Learning ICCV 2023 Paper
11 pages
Ai 1 PDF
No ratings yet
Ai 1 PDF
11 pages
Ai 3
No ratings yet
Ai 3
12 pages
2022 Acl-Demo 10
No ratings yet
2022 Acl-Demo 10
9 pages
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
No ratings yet
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
10 pages
Ai 2
No ratings yet
Ai 2
10 pages
898 Plot Prompt Learning With Opti
No ratings yet
898 Plot Prompt Learning With Opti
21 pages
TheFutureofE Commerce
No ratings yet
TheFutureofE Commerce
12 pages
Digital Marketing and E-Commerce Customer Utility Towards Online Payments
No ratings yet
Digital Marketing and E-Commerce Customer Utility Towards Online Payments
9 pages
TrendsandAnalysisofE CommerceMarket AGlobalPerspective
No ratings yet
TrendsandAnalysisofE CommerceMarket AGlobalPerspective
14 pages
Prompt Learning For Action Recognition
No ratings yet
Prompt Learning For Action Recognition
13 pages
Sensor-Based Datasets For Human Activity Recognition A Systematic Review of Literature
No ratings yet
Sensor-Based Datasets For Human Activity Recognition A Systematic Review of Literature
19 pages
Deep Learning-Based Human Pose Estimation: A Survey
No ratings yet
Deep Learning-Based Human Pose Estimation: A Survey
25 pages
Socratic Models ML AI LM 2204.00598
No ratings yet
Socratic Models ML AI LM 2204.00598
20 pages
Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
No ratings yet
Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
7 pages
Gesture Recognition: Sergio Escalera Isabelle Guyon Vassilis Athitsos Editors
No ratings yet
Gesture Recognition: Sergio Escalera Isabelle Guyon Vassilis Athitsos Editors
583 pages
A Machine Learning Approach For Predicting Physical Activity Intensity From Wearable Sensor Data
No ratings yet
A Machine Learning Approach For Predicting Physical Activity Intensity From Wearable Sensor Data
6 pages
Uav Thesis PDF
100% (2)
Uav Thesis PDF
5 pages
CICCARELLI-SPECTRE A Deep Learning Network For Posture Recognition in Manufacturing
No ratings yet
CICCARELLI-SPECTRE A Deep Learning Network For Posture Recognition in Manufacturing
13 pages
Human Activity Recognition Using Smart Phone
100% (1)
Human Activity Recognition Using Smart Phone
8 pages
Deep Learning For Sensor-Based Activity Recognition: A Survey
No ratings yet
Deep Learning For Sensor-Based Activity Recognition: A Survey
10 pages
3D Affordancenet: A Benchmark For Visual Object Affordance Understanding
No ratings yet
3D Affordancenet: A Benchmark For Visual Object Affordance Understanding
10 pages
Computer Vision PHD Thesis PDF
100% (3)
Computer Vision PHD Thesis PDF
7 pages
19N01F0003-Leveraging CNN and Transfer Learning For Vision-Based Human Activity Recognition
No ratings yet
19N01F0003-Leveraging CNN and Transfer Learning For Vision-Based Human Activity Recognition
57 pages
A Dynamic Sliding Window Approach For Activity Recognition: (Jortiz, Agolaya) @inf - Uc3m.es, Dborrajo@ia - Uc3m.es
No ratings yet
A Dynamic Sliding Window Approach For Activity Recognition: (Jortiz, Agolaya) @inf - Uc3m.es, Dborrajo@ia - Uc3m.es
2 pages
Thesis Dissertation International
100% (2)
Thesis Dissertation International
8 pages
Pytorchvideo: A Deep Learning Library For Video Understanding
No ratings yet
Pytorchvideo: A Deep Learning Library For Video Understanding
4 pages
Human Activity Recogniton Using Machine Learning IJERTV10IS040236
No ratings yet
Human Activity Recogniton Using Machine Learning IJERTV10IS040236
5 pages
2017 Allah Bux PHD
No ratings yet
2017 Allah Bux PHD
173 pages
Human Action Recognition With Raw Millimeter Wave Radar Data
No ratings yet
Human Action Recognition With Raw Millimeter Wave Radar Data
5 pages
Automatic Detection and Analysis of Player Action in Moving Background Sports Video Sequences-Mwz
No ratings yet
Automatic Detection and Analysis of Player Action in Moving Background Sports Video Sequences-Mwz
14 pages