License: CC BY-NC-ND 4.0
arXiv:2604.21011v1 [cs.CV] 22 Apr 2026

Micro-DualNet: Dual-Path Spatio–Temporal Network for
Micro-Action Recognition

Naga VS Raviteja Chappa1, Evangelos Sariyanidi1, Lisa Yankowitz1, Gokul Nair1, Casey J. Zampella1,2, Robert T. Schultz1,2 and Birkan Tunç1,2 {1The Children’s Hospital of Philadelphia, USA and 2 University of Pennsylvania, USA}
https://compsygroup.github.io/micro-dual-net/
This work is partially supported by the Office of the Director (OD), National Institute of Child Health and Human Development (NICHD), and National Institute of Mental Health (NIMH) of US, under grants R01MH122599, R01MH118327, P50HD105354 and R21HD102078; and the IDDRC at CHOP/Penn.
Abstract

Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one’s head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations (e.g., “covering face”) while others manifest through temporal dynamics (e.g., “leg shaking”). Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose Micro-DualNet, a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset (65.10% Top-1, 68.72% F1mean{}_{\text{mean}}) and state-of-the-art results on iMiGUE (76.88% Top-1) dataset. Ablations confirm that position-based actions benefit from ST processing while motion-based actions favor TS processing, validating that micro-actions require flexible complementary decomposition. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding. Clinical validation on an in-house dataset of 290 individuals demonstrates that Micro-DualNet-detected micro-actions reveal statistically significant behavioral differences between kids with autism spectrum disorder, other psychiatric conditions, or typical development, suggesting potential for automated behavioral assessment.

I INTRODUCTION

Despite their subtlety, the brief movements we unconsciously perform like a head scratch, finger tap, or adjusting glasses, encode substantial behavioral and psychological information. These micro-actions, subtle localized movements lasting 1-3 seconds, are commonly used in natural and spontaneous social interactions, hence critical for social communication. Unlike gross motor actions [36, 18, 17] that involve easily discernible full-body motions, micro-actions manifest as brief, small-scale movements of specific body parts. These movements carry significant behavioral and psychological cues critical for applications in behavioral assessment, human-computer interaction, and healthcare monitoring. For instance, subtle differences in motor patterns such as stereotypies may help distinguish autism spectrum disorder (ASD) from other conditions [9], yet manual behavioral coding is prohibitively time-intensive for clinical workflows [11]. Yet despite recent advances in action recognition [36, 17, 41, 29, 27], current methods [14] achieve only 61% accuracy on micro-action benchmarks, revealing a big challenge for fine-grained video understanding and any application that rely on capturing human behavior from videos.

The core challenge lies in the heterogeneous spatio-temporal characteristics of micro-actions. Consider “covering face” versus “stretching arms”: the former is characterized by its final spatial configuration while the latter manifests through repetitive temporal patterns where rhythm, not pose or location, carries discriminative information. As illustrated in Fig. 1, this heterogeneity means no single spatio-temporal decomposition captures all micro-actions optimally. Spatial-Temporal (ST) processing, which prioritizes spatial configuration before temporal dynamics, may excel for position-defined actions. Conversely, Temporal-Spatial (TS) processing, which models temporal dynamics before spatial relationships, may better captures motion-defined actions. Current architectures that commit to a single processing order cannot reconcile these opposing requirements.

Refer to caption
Figure 1: Micro-action recognition requires flexible spatio-temporal processing. (Top) Representative frame sequences showing the distinct characteristics of motion-based vs. position-based micro-actions.“Stretching arms” is defined by repetitive temporal patterns, while “Covering face” is characterized by its final spatial configuration. (Bottom) Empirical validation on MA-52 dataset: motion-based actions are better modelled by TS path (left), while position-based actions achieve higher accuracy with ST path (right). The dual-path architecture (green) consistently outperforms single paths, demonstrating the necessity of bidirectional processing for comprehensive micro-action understanding. Best viewed in color.

Micro-action recognition presents unique challenges beyond traditional action recognition. While traditionally studied actions involve coordinated full-body movements with clearly discernible actions, micro-actions operate within constrained kinematic spaces as they are subtle movements that concentrate discriminative signals in small spatio-temporal regions. These regions shift dynamically with body pose and viewpoint, creating a challenge: the spatial constraint increases recognition difficulty rather than simplifying it. This challenge is manifested in the performance of the current approaches [14, 26] that achieve accuracies of 61% on MA-52 and 71% on iMiGUE.

The challenge of micro-action recognition is reflected not just in the performance gap, but in the failure modes of existing methods as well i.e., fixed spatial regions misalign under viewpoint changes while single processing orders cannot accommodate both spatially-defined and temporally-defined micro-actions. Analysis of existing methods reveals complementary failure modes that inform our approach. Convolutional Neural Network (CNN)-based methods [24, 40, 33] learn appearance features but lack structural priors—when “touching face” occurs at varying scales or angles, learned spatial filters fail to generalize. Skeleton-based approaches [6, 42, 5, 35] encode anatomical structure but discard appearance cues, losing critical information like hand configurations and surface contacts that distinguish “rubbing eyes” from “touching nose.” Recent hybrid methods [7, 12] attempt multi-modal fusion but remain committed to a single processing order, missing a key insight: optimal spatio-temporal decomposition varies by action type.

We propose Micro-DualNet, a keypoint-guided dual-path network that adapts to micro-action heterogeneity through complementary spatio-temporal processing. Our approach leverages anatomical keypoints to define six adaptive spatial entities, namely head, face, left hand, right hand, torso, and lower body, via our spatial entity module (SEM) [section III-B], then processes these through parallel spatial-temporal (ST) and temporal-spatial (TS) pathways [section III-C]. The ST path captures spatial entity configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. This dual decomposition, combined with a gating/routing mechanism, enables automatic selection of optimal processing strategies per action type.

Rather than fixed fusion, we introduce entity-level adaptive routing [section III-C4] that allows each body part to learn its optimal blend of ST and TS processing based on its spatio-temporal characteristics. To ensure complementary learning without redundancy, we further propose Mutual Action Consistency (MAC) loss [section III-D] that enforces cross-path coherence while preserving specialized representations.

We validate our design through extensive experiments on standard benchmarks [section IV]. Experiments on MA-52 [14] and iMiGUE [26] datasets demonstrate the effectiveness of our approach, with ablations confirming that different body parts benefit from different processing orders. Empirical analysis confirms our hypothesis: position-based actions consistently achieve higher accuracy through ST processing, while motion-defined actions benefit from TS processing. Systematic ablations [section IV-D] demonstrate that each component contributes meaningfully: Keypoint-guided entities provide robust spatial grounding (+3.8% over fixed regions), dual paths capture complementary patterns (+9.99% over single path), and MAC loss ensures effective cooperation (+2.96%). These results validate that micro-actions indeed require flexible spatio-temporal decomposition, confirming our architectural principles and paving the road for future studies that can achieve higher performance and have more real-life impact. Finally, we provide initial clinical validation demonstrating that Micro-DualNet detected micro-actions reflect meaningful behavioral differences across diagnostic groups, bridging the gap between benchmark performance and real-world utility.

II RELATED WORKS

Micro-action Recognition. Micro-actions are subtle, short-duration movements concentrated on specific body parts [26, 4]. Early methods directly applied standard action recognizers [25, 21] but struggled with the spatially-localized and temporally-brief nature of discriminative signals. Guo et al. [14] introduced the MA-52 benchmark and MANet, which combines Temporal Shift Module (TSM) [24] with spatial entity aggregation using predefined body regions. While this establishes useful spatial priors, fixed regions misalign under viewpoint changes. Recent work addresses these limitations through diverse strategies: Motion-Modulated Network (MMN) [12] introduces motion-aware channel modulation for skeleton-based recognition, achieving strong results on MA-52 but lacking appearance information. MM-Gesture [13] explores multimodal fusion for micro-gestures, while Online Micro-Gesture [25] addresses streaming scenarios. However, these methods commit to a single processing order (spatial-then-temporal), limiting their ability to capture both configuration-centric and rhythm-centric patterns inherent in micro-actions.

Video CNNs and Transformers. Temporal modeling has evolved from sparse sampling (TSN [40], TSM [24], TS-LSTM [30]) to 3D convolutions (I3D [3], SlowFast [8]) and Video Transformers [1, 28]. These architectures use global pooling that dilutes localized signals critical for micro-actions. We build on TSM but replace global pooling with keypoint-guided entity extraction and constrain transformer attention to anatomically-grounded regions.

Skeleton-based Methods. Graph Convolutional Networks (GCNs) [42, 34, 6] and skeleton-specific architectures [7] excel when pose estimation is reliable but cannot leverage appearance information. This limitation proves critical for micro-actions where hand configuration and contact surfaces carry semantic meaning. While MMN [12] advances skeleton-based micro-action recognition through motion-guided modulation, it cannot distinguish visually-similar actions with different surface interactions. CTR-GCN [5] improves topology modeling but remains appearance-agnostic.

Part-based Representations. Keypoint-guided pooling localizes discriminative regions, with OpenPose [2] providing reliable body part detection. MANet [14] employs fixed spatial entities that struggle with viewpoint variations—when a person rotates, predefined regions no longer align with semantic body parts. Pure pose methods [12, 26] sacrifice appearance information entirely. Our approach addresses both limitations: adaptive keypoint-guided entities maintain semantic alignment across viewpoints while preserving appearance features, and dual-path processing with MAC regularization captures both spatial configurations and temporal dynamics. Unlike complex ensemble solutions [23, 10], we provide architectural innovations that improve single-model understanding of micro-action structure.

Challenge Solutions and Complex Architectures. Recent ACM Multimedia Grand Challenge 2024 submissions [23, 10] achieve high accuracy through sophisticated ensembles. These solutions combine multiple backbones (Swin-L [28], VideoMAE-v2 [39]), extensive data augmentation, and model ensembling. While demonstrating performance upper bounds, their computational requirements (10-100× our approach) and architectural complexity limit practical deployment. The winning solution [23] employs five models with test-time augmentation, requiring over 500 GFLOPs per prediction. In contrast, we focus on architectural insights that improve single-model performance while maintaining efficiency comparable to MANet [14].

Refer to caption

𝐟CNN\mathbf{f}_{\text{CNN}}

(𝐗\mathbf{X})𝐟CNN\mathbf{f}_{\text{CNN}}CE\mathcal{L}_{\text{CE}}MAC\mathcal{L}_{\text{MAC}}

Figure 2: Overview of the proposed Micro-DualNet framework. Given input video frames with corresponding body joints, our framework extracts CNN features (𝐟CNN\mathbf{f}_{\text{CNN}}) and decomposes them into anatomically-grounded entity features (𝐗\mathbf{X}) via the Spatial Entity Module (SEM) [section III-B]. These entity features are processed through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways [section III-C]: the ST path applies spatial modeling before temporal modeling, while the TS path inverts this order. Entity-level adaptive routing [section III-C4] then learns to blend path outputs per body part based on their spatio-temporal characteristics. The routed features are combined with global CNN features and optimized via Cross-Entropy (CE) loss and Mutual Action Consistency (MAC) loss [section III-D], which enforces cross-path coherence on raw path outputs before routing. Best viewed in color.

III METHODOLOGY

III-A Overview

As shown in Fig. 2, given an input video 𝐕T×H×W×3\mathbf{V}\in\mathbb{R}^{T\times H\times W\times 3} with TT frames, height HH, width WW, and 3 color channels, along with corresponding body joints, our framework processes micro-actions through four components: (1) a Spatial Entity Module (SEM) that extracts anatomically-grounded entity representations from Convolutional Neural Network (CNN) features, (2) dual Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways that capture complementary spatio-temporal patterns, (3) entity-level adaptive routing that learns per-entity processing preferences, and (4) Mutual Action Consistency (MAC) loss that enforces cross-path coherence while preserving specialized processing.

III-B Spatial Entity Module (SEM)

For each frame t{1,,T}t\in\{1,...,T\}, we extract features 𝐅tC×H×W\mathbf{F}_{t}\in\mathbb{R}^{C\times H^{\prime}\times W^{\prime}} from the penultimate layer of ResNet-101 with Temporal Shift Module (TSM) [24], where C=2048C=2048, H=H/32H^{\prime}=H/32, W=W/32W^{\prime}=W/32. We also use 25 keypoints (joint coordinates) represented as {(𝐤j,cj)}j=125\{(\mathbf{k}_{j},c_{j})\}_{j=1}^{25} where 𝐤j2\mathbf{k}_{j}\in\mathbb{R}^{2} denotes pixel coordinates and cj[0,1]c_{j}\in[0,1] indicates detection confidence. Key points can be generated by any human pose detection architecture. In the current study, we used OpenPose [2] for this purpose.

We define KK anatomical groupings using key points. For MA-52, we use K=6K=6 entities (head, face, left_hand, right_hand, torso, lower_body) to capture whole-body micro-actions. For iMiGUE focusing on upper-body micro-gestures, we use K=5K=5 entities, excluding lower_body as the dataset contains seated subjects. Please see Sec. A in Supplementary Material (Supp.) for detailed body joint-to-entity mappings.

Entity bounding boxes are computed dynamically as:

𝐁i,t={BBox({𝐤j|j𝒥i,cj>θ})if |𝒱i|2𝐁i,t1otherwise\mathbf{B}_{i,t}=\begin{cases}\text{BBox}(\{\mathbf{k}_{j}|j\in\mathcal{J}_{i},c_{j}>\theta\})&\text{if }|\mathcal{V}_{i}|\geq 2\\ \mathbf{B}_{i,t-1}&\text{otherwise}\end{cases} (1)

where 𝒥i\mathcal{J}_{i} denotes keypoint indices for entity ii, 𝒱i={j𝒥i|cj>θ}\mathcal{V}_{i}=\{j\in\mathcal{J}_{i}|c_{j}>\theta\} are visible keypoints with confidence threshold θ=0.3\theta=0.3, and BBox()\text{BBox}(\cdot) computes the minimum enclosing rectangle with 10% padding.

Entity features are extracted via ROIAlign [15] followed by entity-specific refinement:

𝐱i,t=i(ROIAlign(𝐅t,𝐁i,t))+𝐩i\mathbf{x}_{i,t}=\mathcal{E}_{i}(\text{ROIAlign}(\mathbf{F}_{t},\mathbf{B}_{i,t}))+\mathbf{p}_{i} (2)

where ROIAlign extracts fixed-size features from arbitrary bounding box regions 𝐁i,t\mathbf{B}_{i,t} using bilinear interpolation, avoiding the quantization artifacts of ROI pooling; i\mathcal{E}_{i} consists of depthwise separable convolutions projecting to D=256D=256 dimensions, and 𝐩iD\mathbf{p}_{i}\in\mathbb{R}^{D} is a learnable position embedding encoding entity identity.

III-C Dual-Path Spatio-Temporal Modeling

Given entity features 𝐗B×T×K×D\mathbf{X}\in\mathbb{R}^{B\times T\times K\times D} where BB is batch size, TT is number of frames, KK is number of spatial entities, and D=256D=256 is feature dimension, we construct dual paths to capture complementary spatio-temporal patterns in micro-actions.

III-C1 Spatial Entity Transformer

To model spatial relations among body joints within each frame, we design a spatial entity transformer (Spatial-T). For frame tt, we denote 𝐗tB×K×D\mathbf{X}_{t}\in\mathbb{R}^{B\times K\times D} as features of KK entities. These features are processed by Spatial-T as:

𝐗t\displaystyle\mathbf{X}^{\prime}_{t} =SPE(𝐗t)+𝐗t\displaystyle=\text{SPE}(\mathbf{X}_{t})+\mathbf{X}_{t} (3)
𝐗t′′\displaystyle\mathbf{X}^{\prime\prime}_{t} =LN(𝐗t+MHSAs(𝐗t))\displaystyle=\text{LN}(\mathbf{X}^{\prime}_{t}+\text{MHSA}_{s}(\mathbf{X}^{\prime}_{t})) (4)
𝐗^t\displaystyle\hat{\mathbf{X}}_{t} =LN(𝐗t′′+FFN(𝐗t′′))\displaystyle=\text{LN}(\mathbf{X}^{\prime\prime}_{t}+\text{FFN}(\mathbf{X}^{\prime\prime}_{t})) (5)

where SPE (Spatial Position Encoding) encodes relative spatial positions of entities based on their anatomical hierarchy (head→torso→limbs), MHSAs (Multi-Head Self-Attention) performs attention across entities to capture inter-entity dependencies crucial for micro-actions, LN denotes Layer Normalization, and FFN (Feed-Forward Network) is a two-layer Multi-Layer Perceptron (MLP) with GELU [16] activation. Please see Sec. D in Supp. for more details on the design choices.

III-C2 Temporal Transformer

To capture temporal dynamics of each entity across frames, we employ a temporal transformer (Temporal-T). For entity ii, we denote 𝐗iB×T×D\mathbf{X}_{i}\in\mathbb{R}^{B\times T\times D} as its features across TT frames. These features are operated by Temporal-T as:

𝐗i\displaystyle\mathbf{X}^{\prime}_{i} =TPE(𝐗i)+𝐗i\displaystyle=\text{TPE}(\mathbf{X}_{i})+\mathbf{X}_{i} (6)
𝐗i′′\displaystyle\mathbf{X}^{\prime\prime}_{i} =LN(𝐗i+MHSAt(𝐗i))\displaystyle=\text{LN}(\mathbf{X}^{\prime}_{i}+\text{MHSA}_{t}(\mathbf{X}^{\prime}_{i})) (7)
𝐗^i\displaystyle\hat{\mathbf{X}}_{i} =LN(𝐗i′′+FFN(𝐗i′′))\displaystyle=\text{LN}(\mathbf{X}^{\prime\prime}_{i}+\text{FFN}(\mathbf{X}^{\prime\prime}_{i})) (8)

where TPE (Temporal Position Encoding) encodes temporal positions using sinusoidal embeddings, and MHSAt captures motion patterns essential for distinguishing subtle micro-actions.

III-C3 Bidirectional Processing Paths

Micro-actions exhibit diverse spatio-temporal characteristics—some defined by spatial configurations (e.g., “touch face”), others by temporal patterns (e.g., “leg shaking”). We arrange transformers in two complementary orders:

ST Path: First captures spatial entity arrangements, then models their temporal evolution:

𝐗spatialST\displaystyle\mathbf{X}_{\text{spatial}}^{ST} =Spatial-T(𝐗)+MLPs(𝐗)\displaystyle=\text{Spatial-T}(\mathbf{X})+\text{MLP}_{s}(\mathbf{X}) (9)
𝐗ST\displaystyle\mathbf{X}^{ST} =Temporal-T(𝐗spatialST)\displaystyle=\text{Temporal-T}(\mathbf{X}_{\text{spatial}}^{ST}) (10)

TS Path: First extracts temporal patterns per entity, then models their spatial relationships:

𝐗temporalTS\displaystyle\mathbf{X}_{\text{temporal}}^{TS} =Temporal-T(𝐗)+MLPt(𝐗)\displaystyle=\text{Temporal-T}(\mathbf{X})+\text{MLP}_{t}(\mathbf{X}) (11)
𝐗TS\displaystyle\mathbf{X}^{TS} =Spatial-T(𝐗temporalTS)\displaystyle=\text{Spatial-T}(\mathbf{X}_{\text{temporal}}^{TS}) (12)

The MLPs preserve original entity information through residual connections, enabling adaptive feature combination. Spatial-T and Temporal-T process their respective dimensions efficiently through batch-wise operations: Spatial-T operates on each frame independently (processing KK entities per frame), while Temporal-T operates on each entity independently (processing TT frames per entity).

III-C4 Entity-Level Adaptive Routing

While the dual-path architecture captures complementary spatio-temporal patterns, a key question remains: how should ST and TS representations be combined? Simple concatenation or addition treats all entities uniformly, ignoring that different anatomical parts exhibit fundamentally distinct characteristics. Hands performing gestures are motion-dominant and benefit from temporal-first processing, while torso postures are configuration-dominant and favor spatial-first processing.

We introduce entity-level adaptive routing that allows each body part to learn its optimal blend of ST and TS representations. For each entity i{1,,K}i\in\{1,...,K\} at each temporal position tt, given path outputs 𝐗i,tST,𝐗i,tTSD\mathbf{X}_{i,t}^{ST},\mathbf{X}_{i,t}^{TS}\in\mathbb{R}^{D}, we concatenate them and compute routing scores through a lightweight entity-specific network:

𝐫i,t=i([𝐗i,tST;𝐗i,tTS])+𝐛i\mathbf{r}_{i,t}=\mathcal{R}_{i}([\mathbf{X}_{i,t}^{ST};\mathbf{X}_{i,t}^{TS}])+\mathbf{b}_{i} (13)

where [;][\cdot;\cdot] denotes concatenation, i:2D2\mathcal{R}_{i}:\mathbb{R}^{2D}\rightarrow\mathbb{R}^{2} is a two-layer network (Linear-LayerNorm-ReLU-Dropout-Linear), and 𝐛i2\mathbf{b}_{i}\in\mathbb{R}^{2} is a learnable entity-type prior encoding anatomical biases. The routing scores are converted to normalized weights via temperature-scaled softmax:

[αi,tST,αi,tTS]=softmax(𝐫i,t/τr)[\alpha_{i,t}^{ST},\alpha_{i,t}^{TS}]=\text{softmax}(\mathbf{r}_{i,t}/\tau_{r}) (14)

where τr=0.7\tau_{r}=0.7 controls routing sharpness. The fused entity representation combines both paths according to learned preferences:

𝐗i,tfused=αi,tST𝐗i,tST+αi,tTS𝐗i,tTS\mathbf{X}_{i,t}^{\text{fused}}=\alpha_{i,t}^{ST}\cdot\mathbf{X}_{i,t}^{ST}+\alpha_{i,t}^{TS}\cdot\mathbf{X}_{i,t}^{TS} (15)

III-D Mutual Action Consistency Learning

To ensure the dual paths learn consistent representations while maintaining their complementary strengths, we employ entity-aware contrastive learning between the ST and TS pathways.

For each entity ii and temporal position tt, we enforce that representations from both paths align for the same spatio-temporal location while contrasting with different temporal positions:

MACi,t=logexp(sim(𝐳i,tST,𝐳i,tTS)/τ)j=1Texp(sim(𝐳i,tST,𝐳i,jTS)/τ)\mathcal{L}_{\text{MAC}}^{i,t}=-\log\frac{\exp(\text{sim}(\mathbf{z}_{i,t}^{ST},\mathbf{z}_{i,t}^{TS})/\tau)}{\sum_{j=1}^{T}\exp(\text{sim}(\mathbf{z}_{i,t}^{ST},\mathbf{z}_{i,j}^{TS})/\tau)} (16)

where 𝐳i,tST,𝐳i,tTS\mathbf{z}_{i,t}^{ST},\mathbf{z}_{i,t}^{TS} are 2\ell_{2}-normalized features for entity ii at time tt from respective paths, and τ=0.07\tau=0.07 is the temperature parameter. This formulation ensures that while both paths process the same entity differently (spatial-first vs temporal-first), they maintain agreement about which temporal segments are most relevant for each entity. The total MAC loss aggregates across all visible entities and frames:

MAC=i=1Kt=1TcitMACi,ti,tcit\mathcal{L}_{\text{MAC}}=\frac{\sum_{i=1}^{K}\sum_{t=1}^{T}c_{i}^{t}\cdot\mathcal{L}_{\text{MAC}}^{i,t}}{\sum_{i,t}c_{i}^{t}} (17)

where citc_{i}^{t} represents the keypoint confidence for entity ii at time tt, naturally down-weighting occluded or uncertain entities. Note that MAC loss operates on the raw path outputs 𝐗ST\mathbf{X}^{ST} and 𝐗TS\mathbf{X}^{TS} before adaptive routing (section III-C4), providing a training signal that encourages temporal coherence between paths while allowing the routing module to independently learn entity-specific fusion strategies.

III-E Training Objectives

Given the dual-path outputs 𝐗ST,𝐗TSB×T×K×D\mathbf{X}^{ST},\mathbf{X}^{TS}\in\mathbb{R}^{B\times T\times K\times D}, we apply entity-level adaptive routing to obtain fused representations where each entity ii is combined according to its learned preferences:

𝐗fused={𝐗ifused}i=1K\mathbf{X}^{\text{fused}}=\{\mathbf{X}_{i}^{\text{fused}}\}_{i=1}^{K} (18)

where 𝐗ifused\mathbf{X}_{i}^{\text{fused}} is computed via Eq. (13)-(15). The video-level entity representation is obtained by averaging across temporal and entity dimensions:

𝐟entity=1TKt=1Ti=1K𝐗i,tfused\mathbf{f}_{\text{entity}}=\frac{1}{T\cdot K}\sum_{t=1}^{T}\sum_{i=1}^{K}\mathbf{X}_{i,t}^{\text{fused}} (19)

This is concatenated with global appearance features for final classification:

𝐟final=[𝐟CNN;𝐟entity]\mathbf{f}_{\text{final}}=[\mathbf{f}_{\text{CNN}};\mathbf{f}_{\text{entity}}] (20)

where 𝐟CNNC\mathbf{f}_{\text{CNN}}\in\mathbb{R}^{C} represents global context obtained via Global Average Pooling (GAP) over the CNN feature maps, capturing scene-level information that complements localized entity features. The model is trained with classification and consistency objectives:

=CE(𝒞(𝐟final),y)+λMAC\mathcal{L}=\mathcal{L}_{\text{CE}}(\mathcal{C}(\mathbf{f}_{\text{final}}),y)+\lambda\mathcal{L}_{\text{MAC}} (21)

where 𝒞\mathcal{C} is a two-layer MLP classifier, CE\mathcal{L}_{\text{CE}} is cross-entropy loss, yy denotes micro-action labels, and λ=0.1\lambda=0.1 balances the objectives. Importantly, MAC\mathcal{L}_{\text{MAC}} is computed on raw path outputs before routing, ensuring both paths receive gradient signals regardless of learned routing preferences. This design separates representation learning (via MAC) from adaptive fusion (via routing), allowing each component to fulfill its distinct role.

IV EXPERIMENTAL RESULTS

TABLE I: Performance comparison on MA-52 [14] and iMiGUE [26] datasets.
Method MA-52 iMiGUE
Accuracy (%) F1 Score (%) F1mean{}_{\text{mean}} Accuracy (%)
Body Action Action Body Action Overall Top-1 Top-5
Top-1 Top-1 Top-5 Macro Micro Macro Micro
TSN [40] 59.22 34.46 73.34 52.50 59.22 28.52 34.46 43.67 51.54 85.42
TIN [33] 73.26 52.81 85.37 66.99 73.26 39.82 52.81 58.22 52.38 86.15
TSM [24] 77.64 56.75 87.47 70.98 77.64 40.19 56.75 61.39 61.10 91.24
MANet [14] 78.95 61.33 88.83 72.87 78.95 49.22 61.33 65.59 62.54 92.18
C3D [37] 74.04 52.22 86.97 66.60 74.04 40.86 52.22 58.43 20.32 55.31
I3D [3] 78.16 57.07 88.67 71.56 78.16 39.84 57.07 61.66 34.96 63.69
SlowFast [8] 77.18 59.60 88.54 70.61 77.18 44.96 59.60 63.09 58.73 89.41
VideoSwin-T [28] 77.95 57.23 87.99 71.25 77.95 38.53 57.23 61.24 55.82 88.67
TimesFormer [1] 69.17 40.67 82.67 61.90 69.17 34.38 40.67 51.53 48.15 82.34
UniFormer [22] 79.03 58.89 87.29 71.80 79.03 48.01 58.89 64.43 57.29 89.95
ST-GCN [42] 69.87 49.61 79.54 61.53 69.87 34.64 49.61 53.91 46.97 84.09
2s-AGCN [34] 70.07 49.48 78.27 61.30 70.07 34.64 49.48 53.87 47.78 88.43
Shift-GCN [6] 71.23 51.85 80.16 62.48 71.23 36.92 51.85 55.62 51.51 88.18
CTR-GCN [5] 72.06 52.61 81.22 63.46 72.06 37.79 52.61 56.48 52.94 89.76
PoseConv3D [7] 80.95 63.52 90.23 74.96 80.95 47.20 63.52 66.66 64.38 93.52
PCAN [20] 82.30 66.74 91.75 77.02 82.30 53.83 66.74 69.97
Ours (Pose Only) 79.64 61.25 89.42 73.18 79.64 46.73 61.25 65.20 68.92 94.35
Ours (RGB Only) 81.18 62.87 90.68 75.42 81.18 48.95 62.87 67.11 71.54 95.18
Ours (Pose + RGB) 83.50 65.10 92.27 78.31 83.50 54.18 65.10 68.72 76.88 96.72

IV-A Datasets and Evaluation Metrics

MA-52 Dataset [14] is a large-scale micro-action dataset collected through psychological interviews capturing unconscious human micro-behaviors. The dataset contains 22,422 samples annotated hierarchically at two levels: 7 body-level and 52 action-level categories. Following standard splits defined in [14], we use 11,250, 5,586, and 5,586 samples for training, validation, and testing respectively. The dataset provides both RGB frames and OpenPose body joints, enabling multi-modal analysis. Actions span 1-3 seconds and include subtle movements like “touching face,” “leg shaking,” and “arms crossing”.

iMiGUE Dataset [26] focuses on upper-body micro-gestures collected from sports interviews. The dataset contains 32 micro-gesture categories with 12,899, 777, and 4,562 samples for training, validation, and testing respectively. While conceptually similar to micro-actions, these micro-gestures are restricted to upper limbs, making skeleton data particularly relevant. We evaluate on iMiGUE to demonstrate our method’s generalization beyond full-body movements.

Evaluation Metrics. Following standard practice in micro-action recognition [14, 26], we adopt Top-1/Top-5 accuracy, micro and macro F1 scores as evaluation metrics. While accuracy provides direct classification performance, F1 score better handles class imbalance inherent in micro-action datasets. We compute F1mean{}_{\text{mean}} by averaging macro and micro F1 scores across both hierarchical levels (body-level and action-level), providing a balanced assessment across different granularities and class frequencies.

IV-B Implementation Details

Architecture Configuration. We employ ResNet-101 with TSM [24] as our backbone, pretrained on Kinetics-400 [17]. SEM extracts K=6K=6 entities for MA-52 and K=5K=5 entities for iMiGUE datasets with dimension D=256D=256 each. Both ST and TS paths use 3-layer transformers with 8 attention heads, hidden dimension 1024, and dropout 0.1. For entity-level adaptive routing, each of the KK entities has a dedicated routing network i\mathcal{R}_{i} consisting of two linear layers (2DD/222D\rightarrow D/2\rightarrow 2, i.e., 5121282512\rightarrow 128\rightarrow 2) with LayerNorm, ReLU activation, and dropout 0.1. The learnable entity-type bias 𝐛i\mathbf{b}_{i} is initialized to zero. The routing temperature τ=1.0\tau=1.0 provides soft routing that allows continuous blending between paths. This module adds approximately 0.5M parameters (\sim3% overhead). The final classifier uses a 2-layer MLP (512→256→CC) with GeLU [16] activation and dropout 0.5, where CC is the number of action classes (52 for MA-52, 32 for iMiGUE as shown in Fig. 4). For MAC loss computation, we use temperature τ=0.07\tau=0.07 and λMAC=0.1\lambda_{MAC}=0.1.

Training Details. We sample 8 frames with temporal stride 8 for MA-52 dataset and 16 frames with stride 4 for iMiGUE dataset. Data augmentation includes random cropping to 224×224, horizontal flipping (p=0.5), and temporal jittering. We train with SGD [32] optimizer (momentum 0.9), initial learning rate 0.01 with cosine annealing schedule, batch size of 12, and weight decay of 5×1045\times 10^{-4} for 120 epochs. The first 10 epochs use linear warmup from 0.001 to 0.01. Implementation was done on a single GPU (Nvidia RTX 3090).

IV-C Comparison with State-of-the-Art Methods

Table I compares our method with state-of-the-art approaches. On MA-52, we achieve 68.72% F1mean{}_{\text{mean}}, competitive with PCAN [20] (69.97%) while using simple end-to-end training instead of PCAN’s complex 3-stage pipeline. On iMiGUE, we achieve state-of-the-art 76.88% Top-1 accuracy, surpassing PoseConv3D [7] by 12.50%.

Our adaptive keypoint-guided entities outperform MANet’s fixed regions [14] by 3.13% F1mean{}_{\text{mean}}. While 3D CNNs (SlowFast [8]: 63.09%) and Transformers (UniFormer [22]: 64.43%) achieve reasonable performance, they require substantially more computation. Pure skeleton methods struggle—CTR-GCN [5] achieves only 56.48%, confirming that appearance cues are indispensable for micro-actions.

The bottom rows of table I show modality ablations: pose-only (65.20%) surpasses all skeleton baselines, RGB-only (67.11%) demonstrates entity-aware processing benefits, and their fusion (68.72%) confirms complementarity. Even single-modality variants outperform heavier baselines, validating our dual-path design and MAC regularization. The 12.50% improvement on iMiGUE versus 3.13% on MA-52 reveals that adaptive entity extraction particularly excels for concentrated upper-body micro-gestures, where our keypoint guidance maintains semantic alignment across viewpoints while fixed regions fail.

IV-D Ablation Studies

Contribution of each component. Table II presents systematic analysis of each component’s contribution. Starting from the TSM baseline (52.15% on MA-52), adding a single TS path improves to 55.96% (+3.81%) by capturing temporal dynamics. The Spatial Entity Module provides substantial gains (+3.25% MA-52, +5.39% iMiGUE), confirming that keypoint-guided entity extraction outperforms global features. Adding the ST path for dual-path processing further improves to 62.14% (+2.93%), validating that ST and TS paths capture complementary patterns. Analyzing the fusion components separately: MAC loss alone provides +2.26% on MA-52 and +2.87% on iMiGUE by enforcing cross-path temporal coherence, while entity routing alone contributes +0.95% and +0.83% respectively. Crucially, combining both achieves +2.96% and +4.23%, exceeding MAC alone on iMiGUE, demonstrating synergistic effects where MAC ensures well-formed path representations while routing learns optimal entity-specific combinations. Overall, our full model improves 12.95% on MA-52 and 18.15% on iMiGUE over the baseline.

TABLE II: Ablation study of Micro-DualNet components on MA-52 and iMiGUE datasets.
Configuration Components Top-1 (%)
SEM Dual-path MAC Routing MA-52 iMiGUE
Baseline (TSM) 52.15 58.73
+ TS Only 55.96 63.48
+ SEM 59.21 68.87
+ Dual Path 62.14 72.65
+ MAC Loss 64.40 75.52
+ Routing - MAC 63.09 73.48
Full Model 65.10 76.88

Impact of Spatial Entity Extraction Methods. Table III compares different spatial entity extraction strategies. Center crop (58.3% Acc.) extracts only the central region of each frame, losing peripheral information crucial for micro-actions involving limbs or off-center movements. Fixed body regions, similar to MANet [14], achieve 61.3% accuracy but fail under pose variations. Our keypoint-guided approach achieves 65.1% accuracy by dynamically adapting entity boundaries based on detected joints. Removing confidence weighting drops performance to 63.1%, as unreliable keypoints corrupt entity features. Using only 4 entities (excluding lower body) reduces accuracy to 61.4%, confirming the importance of full-body modeling for MA-52. The 3.8% improvement over fixed regions validates adaptive entity extraction’s importance for handling pose variations and subtle movements.

TABLE III: Ablation study of spatial entity extraction methods on MA-52 [14] dataset.
Method Top-1 (%) F1-mean
Center Crop 58.3 0.594
Fixed Body Regions 61.3 0.656
Keypoint-Guided (Ours) 65.1 0.687
   w/o confidence weighting 63.1 0.657
   w/ 4 entities (no lower body) 61.4 0.641

Impact of temporal modeling. Table IV investigates temporal design choices. Frame sampling shows gains from 4 frames (59.8% Top-1) to 16 frames (65.1%), while 32 frames (64.7%) slightly degrades, suggesting temporal redundancy beyond 16 frames for 1-3 second micro-actions. For temporal aggregation, simple pooling (avg: 65.7%, max: 61.4%) discards temporal ordering. LSTM (62.8%) and Temporal Transformer (63.5%) preserve structure but process entities independently, missing critical inter-entity coordination. Our dual-path design (65.1%) captures synchronized movements essential for actions like “rubbing hands” where hand coordination defines the action. MAC loss granularity affects consistency: frame-level (62.7%) over-constrains at every timestamp, while video-level (65.1%) optimally balances by enforcing entity-wise consistency with temporal flexibility. Please see Supp. for additional ablation experiments.

TABLE IV: Impact of temporal modeling and frame sampling on MA-52 [14] dataset.
Method Top-1 (%) F1-mean
Frame Sampling
4 frames 59.8 0.638
8 frames 63.2 0.663
16 frames 65.1 0.687
32 frames 64.7 0.672
Temporal Aggregation
Average Pooling 61.4 0.657
Max Pooling 60.2 0.641
LSTM 62.8 0.659
Temporal Transformer 63.5 0.668
(Ours) 65.1 0.687
Entity Temporal Granularity
Frame-level MAC 62.7 0.657
Video-level MAC (Ours) 65.1 0.687

IV-E Clinical Validation Study

To evaluate the potential clinical utility, we applied Micro-DualNet to an in-house dataset of 290 individuals (ages 5–52) recorded during 2–3 minute conversations with a research staff member [The study which includes this dataset was reviewed and approved by the Institutional Review Board at CHOP.]. Participants received licensed psychologist-supervised diagnostic evaluations and were classified into three groups: ASD (autism spectrum disorder, nn=120), PSY (non-autistic psychiatric conditions, nn=46), and TDC (typically developing, nn=124). For the ten most frequent micro-actions, we conducted pairwise group comparisons using two-part (hurdle) analysis: (a) probability of engagement via logistic GLM (Prob.), and (b) intensity among engagers via fractional logit (Int.). Table V summarizes significant group differences. Notably, the PSY group showed elevated “retracting feet” intensity compared to both ASD (p<0.001p<0.001) and TDC (p<0.01p<0.01), while “turning head” intensity was lower in PSY than both ASD and TDC (p<0.05p<0.05). Fig. 3 illustrates intensity distributions for these two micro-actions; see Supp. for all comparisons. Although analyses controlling for demographics are needed for definitive interpretation, these results provide initial evidence that micro-action detection can be used to identify behavioral differences in psychiatric conditions as part of a larger computational behavior analysis research.

Refer to caption
Figure 3: Violin plots of percent time engaged in “retracting feet” and “turning head” by diagnostic group. Each dot represents one participant’s mean percent time engaged. Dashed lines indicate group medians; solid lines indicate group means. Brackets with asterisks denote statistically significant between-group differences in intensity among participants with nonzero engagement: p<0.05{}^{*}p<0.05, p<0.01{}^{**}p<0.01, p<0.001{}^{***}p<0.001. PSY shows significantly elevated “retracting feet” intensity compared to both ASD and TDC.
TABLE V: Clinical validation: pairwise group differences (p<0.05p<0.05) in micro-action engagement. Prob.: engagement probability; %: Percentage of time engaged among participants with nonzero engagement. padjp_{\text{adj}}: adjusted pp-value correcting for multiple comparisons.
Action Contrast Type Effect pp padjp_{\text{adj}}
retracting feet ASD << PSY % 0.37 <<.001 .004
shaking legs ASD >> PSY Prob. 3.00 .002 .054
shaking head ASD << PSY Prob. 0.27 .004 .054
retracting feet PSY >> TDC % 1.96 .007 .106
head up ASD >> TDC Prob. 2.37 .007 .058
stretching feet ASD << TDC Prob. 0.50 .008 .058
stretching feet ASD << PSY % 0.48 .017 .172
tilting head ASD >> PSY Prob. 2.31 .019 .102
nodding ASD << PSY Prob. 0.23 .020 .102
turning head ASD >> PSY % 2.71 .036 .251
Refer to caption
Figure 4: t-SNE [38] visualizations of learned representations on MA-52 [14] (top) and iMiGUE [26] (bottom) datasets. Baseline (left) shows heavily overlapping classes, single ST/TS paths (middle) yield partial clustering improvements, while Micro-DualNet (right) shows improved, though not complete, class grouping; with several action categories forming more coherent clusters. The remaining overlap reflects the inherent difficulty of fine-grained micro-action discrimination. Best viewed in zoom and color.
Refer to caption
Figure 5: (Left) Class-wise accuracy heatmap for Dual, ST, and TS paths across MA-52 [14] dataset. (Right) Model performance by class difficulty: dual-path achieves largest gains on hard categories (+31%), validating its effectiveness for ambiguous actions. Best viewed in zoom and color.

IV-F Qualitative Results

t-SNE visualizations (Fig. 4) show that single-path models yield complementary patterns—ST groups position-based actions while TS separates motion-based ones and Micro-DualNet combines these strengths with improved overall clustering. This aligns with Fig. 5: Micro-DualNet shows modest 3% gains on easy actions but 31% improvement on hard actions, suggesting that complementary entity-centric processing most benefits challenging micro-actions where single paths struggle.

V DISCUSSION

Our results reveal key insights. First, keypoint-guided entities show improved performance over fixed regions in our experiments, though further validation is needed. Second, contrasting ST/TS performance patterns—ST excelling on position-defined actions, TS on motion-based ones—validate that micro-actions require flexible processing. Third, larger gains on iMiGUE (12.5%) versus MA-52 (3.1%) suggest our approach particularly benefits concentrated micro-gestures. Clinical Implications. Automatically-detected micro-actions differ significantly across diagnostic groups. Elevated “retracting feet” in PSY and increased “shaking legs” in ASD align with established phenotypes [31, 19]. These findings suggest Micro-DualNet could support scalable behavioral assessment.

VI CONCLUSIONS

We presented Micro-DualNet, a keypoint-guided dual-path framework for micro-action recognition. By processing anatomically-grounded entities through parallel ST and TS pathways with entity-level adaptive routing and MAC regularization, we achieve competitive performance on commonly used datasets. Beyond benchmarks, clinical validation demonstrates that detected micro-actions may reveal significant behavioral differences across ASD, psychiatric, and typically developing groups, providing initial evidence for real-world clinical utility. Our key contribution is demonstrating that micro-actions require flexible entity-level spatio-temporal processing, combined with interpretable routing that could inform automated behavioral assessment in healthcare settings. Limitations. Our method depends on an external keypoint detector, making it vulnerable to pose estimation failures under severe occlusions. The dual-path architecture increases cost (\sim1.9×\times single path), and learned routing patterns may not transfer across datasets without fine-tuning. Fixed entity definitions may not optimally capture all micro-action types. Our clinical validation requires confirmation with demographic controls. Future work should explore learnable entity discovery, cross-dataset transfer, and expanded clinical evaluation.

References

  • [1] G. Bertasius, H. Wang, and L. Torresani (2021) Is space-time attention all you need for video understanding?. In Icml, Vol. 2, pp. 4. Cited by: §II, TABLE I.
  • [2] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh (2019) Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (1), pp. 172–186. Cited by: §II, §III-B.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §II, TABLE I.
  • [4] H. Chen, H. Shi, X. Liu, X. Li, and G. Zhao (2023) SMG: a micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision 131 (6), pp. 1346–1366. Cited by: §II.
  • [5] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368. Cited by: §I, §II, §IV-C, TABLE I.
  • [6] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu (2020) Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192. Cited by: §I, §II, TABLE I.
  • [7] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai (2022) Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978. Cited by: §I, §II, §IV-C, TABLE I.
  • [8] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211. Cited by: §II, §IV-C, TABLE I.
  • [9] S. Goldman and P. E. Greene (2013) Stereotypies in autism: a video demonstration of their clinical variability. Frontiers in integrative neuroscience 6, pp. 121. Cited by: §I.
  • [10] F. Gong, J. Chen, J. Zhu, Q. Bao, F. Gao, R. Gu, and G. Xu (2024) Micro-action recognition via hierarchical fusion and inference. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, New York, NY, USA, pp. 11327–11332. External Links: ISBN 9798400706868, Link, Document Cited by: §II, §II.
  • [11] R. Grzadzinski, T. Carr, C. Colombi, K. McGuire, S. Dufek, A. Pickles, and C. Lord (2016) Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (boscc). Journal of autism and developmental disorders 46 (7), pp. 2464–2479. Cited by: §I.
  • [12] J. Gu, K. Li, F. Wang, Y. Wei, Z. Wu, H. Fan, and M. Wang (2025) Motion matters: motion-guided modulation network for skeleton-based micro-action recognition. In Proceedings of the 33rd ACM International Conference on Multimedia, Cited by: §I, §II, §II, §II.
  • [13] J. Gu, F. Wang, K. Li, Y. Wei, Z. Wu, and D. Guo (2025) MM-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344. Cited by: §II.
  • [14] D. Guo, K. Li, B. Hu, Y. Zhang, and M. Wang (2024) Benchmarking micro-action recognition: dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology 34 (7), pp. 6238–6252. Cited by: §I, §I, §I, §II, §II, §II, Figure 4, Figure 5, §IV-A, §IV-A, §IV-C, §IV-D, TABLE I, TABLE I, TABLE III, TABLE IV.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-B.
  • [16] D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §III-C1, §IV-B.
  • [17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §I, §IV-B.
  • [18] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp. 2556–2563. Cited by: §I.
  • [19] S. R. Leekam, M. R. Prior, and M. Uljarevic (2011) Restricted and repetitive behaviors in autism spectrum disorders: a review of research in the last decade.. Psychological bulletin 137 (4), pp. 562. Cited by: §V.
  • [20] K. Li, D. Guo, G. Chen, C. Fan, J. Xu, Z. Wu, H. Fan, and M. Wang (2025) Prototypical calibrating ambiguous samples for micro-action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 4815–4823. Cited by: §IV-C, TABLE I.
  • [21] K. Li, D. Guo, G. Chen, X. Peng, and M. Wang (2023) Joint skeletal and semantic embedding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624. Cited by: §II.
  • [22] K. Li, Y. Wang, G. Peng, G. Song, Y. Liu, H. Li, and Y. Qiao (2022) UniFormer: unified transformer for efficient spatial-temporal representation learning. In International Conference on Learning Representations, External Links: Link Cited by: §IV-C, TABLE I.
  • [23] Q. Li, X. Huang, H. Chen, F. He, Q. Chen, and Z. Wang (2024) Advancing micro-action recognition with multi-auxiliary heads and hybrid loss optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, New York, NY, USA, pp. 11313–11319. External Links: ISBN 9798400706868, Link, Document Cited by: §II, §II.
  • [24] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7083–7093. Cited by: §I, §II, §II, §III-B, §IV-B, TABLE I.
  • [25] P. Liu, F. Wang, K. Li, G. Chen, Y. Wei, S. Tang, Z. Wu, and D. Guo (2024) Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490. Cited by: §II.
  • [26] X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, and G. Zhao (2021) IMiGUE: an identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10631–10642. Cited by: §I, §I, §II, §II, Figure 4, §IV-A, §IV-A, TABLE I.
  • [27] Y. Liu, L. Wang, Y. Wang, X. Ma, and Y. Qiao (2022) Fineaction: a fine-grained video dataset for temporal action localization. IEEE transactions on image processing 31, pp. 6937–6950. Cited by: §I.
  • [28] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu (2022) Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211. Cited by: §II, §II, TABLE I.
  • [29] H. Luo, G. Lin, Y. Yao, Z. Tang, Q. Wu, and X. Hua (2021) Dense semantics-assisted networks for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (5), pp. 3073–3084. Cited by: §I.
  • [30] C. Ma, M. Chen, Z. Kira, and G. AlRegib (2019) TS-lstm and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Processing: Image Communication 71, pp. 76–87. Cited by: §II.
  • [31] C. C. Nuckols and C. C. Nuckols (2013) The diagnostic and statistical manual of mental disorders,(dsm-5). Philadelphia: American Psychiatric Association. Cited by: §V.
  • [32] S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §IV-B.
  • [33] H. Shao, S. Qian, and Y. Liu (2020) Temporal interlacing network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11966–11973. Cited by: §I, TABLE I.
  • [34] L. Shi, Y. Zhang, J. Cheng, and H. Lu (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12026–12035. Cited by: §II, TABLE I.
  • [35] L. Shi, Y. Zhang, J. Cheng, and H. Lu (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing 29, pp. 9532–9545. Cited by: §I.
  • [36] K. Soomro, A. R. Zamir, and M. Shah (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §I.
  • [37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: TABLE I.
  • [38] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: Figure 4.
  • [39] L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao (2023) Videomae v2: scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560. Cited by: §II.
  • [40] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2018) Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence 41 (11), pp. 2740–2755. Cited by: §I, §II, TABLE I.
  • [41] H. Wu, X. Ma, and Y. Li (2021) Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (3), pp. 1250–1261. Cited by: §I.
  • [42] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §I, §II, TABLE I.