Micro-DualNet: Dual-Path Spatio–Temporal Network for
Micro-Action Recognition
Abstract
Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one’s head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations (e.g., “covering face”) while others manifest through temporal dynamics (e.g., “leg shaking”). Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose Micro-DualNet, a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset (65.10% Top-1, 68.72% F1) and state-of-the-art results on iMiGUE (76.88% Top-1) dataset. Ablations confirm that position-based actions benefit from ST processing while motion-based actions favor TS processing, validating that micro-actions require flexible complementary decomposition. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding. Clinical validation on an in-house dataset of 290 individuals demonstrates that Micro-DualNet-detected micro-actions reveal statistically significant behavioral differences between kids with autism spectrum disorder, other psychiatric conditions, or typical development, suggesting potential for automated behavioral assessment.
I INTRODUCTION
Despite their subtlety, the brief movements we unconsciously perform like a head scratch, finger tap, or adjusting glasses, encode substantial behavioral and psychological information. These micro-actions, subtle localized movements lasting 1-3 seconds, are commonly used in natural and spontaneous social interactions, hence critical for social communication. Unlike gross motor actions [36, 18, 17] that involve easily discernible full-body motions, micro-actions manifest as brief, small-scale movements of specific body parts. These movements carry significant behavioral and psychological cues critical for applications in behavioral assessment, human-computer interaction, and healthcare monitoring. For instance, subtle differences in motor patterns such as stereotypies may help distinguish autism spectrum disorder (ASD) from other conditions [9], yet manual behavioral coding is prohibitively time-intensive for clinical workflows [11]. Yet despite recent advances in action recognition [36, 17, 41, 29, 27], current methods [14] achieve only 61% accuracy on micro-action benchmarks, revealing a big challenge for fine-grained video understanding and any application that rely on capturing human behavior from videos.
The core challenge lies in the heterogeneous spatio-temporal characteristics of micro-actions. Consider “covering face” versus “stretching arms”: the former is characterized by its final spatial configuration while the latter manifests through repetitive temporal patterns where rhythm, not pose or location, carries discriminative information. As illustrated in Fig. 1, this heterogeneity means no single spatio-temporal decomposition captures all micro-actions optimally. Spatial-Temporal (ST) processing, which prioritizes spatial configuration before temporal dynamics, may excel for position-defined actions. Conversely, Temporal-Spatial (TS) processing, which models temporal dynamics before spatial relationships, may better captures motion-defined actions. Current architectures that commit to a single processing order cannot reconcile these opposing requirements.
Micro-action recognition presents unique challenges beyond traditional action recognition. While traditionally studied actions involve coordinated full-body movements with clearly discernible actions, micro-actions operate within constrained kinematic spaces as they are subtle movements that concentrate discriminative signals in small spatio-temporal regions. These regions shift dynamically with body pose and viewpoint, creating a challenge: the spatial constraint increases recognition difficulty rather than simplifying it. This challenge is manifested in the performance of the current approaches [14, 26] that achieve accuracies of 61% on MA-52 and 71% on iMiGUE.
The challenge of micro-action recognition is reflected not just in the performance gap, but in the failure modes of existing methods as well i.e., fixed spatial regions misalign under viewpoint changes while single processing orders cannot accommodate both spatially-defined and temporally-defined micro-actions. Analysis of existing methods reveals complementary failure modes that inform our approach. Convolutional Neural Network (CNN)-based methods [24, 40, 33] learn appearance features but lack structural priors—when “touching face” occurs at varying scales or angles, learned spatial filters fail to generalize. Skeleton-based approaches [6, 42, 5, 35] encode anatomical structure but discard appearance cues, losing critical information like hand configurations and surface contacts that distinguish “rubbing eyes” from “touching nose.” Recent hybrid methods [7, 12] attempt multi-modal fusion but remain committed to a single processing order, missing a key insight: optimal spatio-temporal decomposition varies by action type.
We propose Micro-DualNet, a keypoint-guided dual-path network that adapts to micro-action heterogeneity through complementary spatio-temporal processing. Our approach leverages anatomical keypoints to define six adaptive spatial entities, namely head, face, left hand, right hand, torso, and lower body, via our spatial entity module (SEM) [section III-B], then processes these through parallel spatial-temporal (ST) and temporal-spatial (TS) pathways [section III-C]. The ST path captures spatial entity configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. This dual decomposition, combined with a gating/routing mechanism, enables automatic selection of optimal processing strategies per action type.
Rather than fixed fusion, we introduce entity-level adaptive routing [section III-C4] that allows each body part to learn its optimal blend of ST and TS processing based on its spatio-temporal characteristics. To ensure complementary learning without redundancy, we further propose Mutual Action Consistency (MAC) loss [section III-D] that enforces cross-path coherence while preserving specialized representations.
We validate our design through extensive experiments on standard benchmarks [section IV]. Experiments on MA-52 [14] and iMiGUE [26] datasets demonstrate the effectiveness of our approach, with ablations confirming that different body parts benefit from different processing orders. Empirical analysis confirms our hypothesis: position-based actions consistently achieve higher accuracy through ST processing, while motion-defined actions benefit from TS processing. Systematic ablations [section IV-D] demonstrate that each component contributes meaningfully: Keypoint-guided entities provide robust spatial grounding (+3.8% over fixed regions), dual paths capture complementary patterns (+9.99% over single path), and MAC loss ensures effective cooperation (+2.96%). These results validate that micro-actions indeed require flexible spatio-temporal decomposition, confirming our architectural principles and paving the road for future studies that can achieve higher performance and have more real-life impact. Finally, we provide initial clinical validation demonstrating that Micro-DualNet detected micro-actions reflect meaningful behavioral differences across diagnostic groups, bridging the gap between benchmark performance and real-world utility.
II RELATED WORKS
Micro-action Recognition. Micro-actions are subtle, short-duration movements concentrated on specific body parts [26, 4]. Early methods directly applied standard action recognizers [25, 21] but struggled with the spatially-localized and temporally-brief nature of discriminative signals. Guo et al. [14] introduced the MA-52 benchmark and MANet, which combines Temporal Shift Module (TSM) [24] with spatial entity aggregation using predefined body regions. While this establishes useful spatial priors, fixed regions misalign under viewpoint changes. Recent work addresses these limitations through diverse strategies: Motion-Modulated Network (MMN) [12] introduces motion-aware channel modulation for skeleton-based recognition, achieving strong results on MA-52 but lacking appearance information. MM-Gesture [13] explores multimodal fusion for micro-gestures, while Online Micro-Gesture [25] addresses streaming scenarios. However, these methods commit to a single processing order (spatial-then-temporal), limiting their ability to capture both configuration-centric and rhythm-centric patterns inherent in micro-actions.
Video CNNs and Transformers. Temporal modeling has evolved from sparse sampling (TSN [40], TSM [24], TS-LSTM [30]) to 3D convolutions (I3D [3], SlowFast [8]) and Video Transformers [1, 28]. These architectures use global pooling that dilutes localized signals critical for micro-actions. We build on TSM but replace global pooling with keypoint-guided entity extraction and constrain transformer attention to anatomically-grounded regions.
Skeleton-based Methods. Graph Convolutional Networks (GCNs) [42, 34, 6] and skeleton-specific architectures [7] excel when pose estimation is reliable but cannot leverage appearance information. This limitation proves critical for micro-actions where hand configuration and contact surfaces carry semantic meaning. While MMN [12] advances skeleton-based micro-action recognition through motion-guided modulation, it cannot distinguish visually-similar actions with different surface interactions. CTR-GCN [5] improves topology modeling but remains appearance-agnostic.
Part-based Representations. Keypoint-guided pooling localizes discriminative regions, with OpenPose [2] providing reliable body part detection. MANet [14] employs fixed spatial entities that struggle with viewpoint variations—when a person rotates, predefined regions no longer align with semantic body parts. Pure pose methods [12, 26] sacrifice appearance information entirely. Our approach addresses both limitations: adaptive keypoint-guided entities maintain semantic alignment across viewpoints while preserving appearance features, and dual-path processing with MAC regularization captures both spatial configurations and temporal dynamics. Unlike complex ensemble solutions [23, 10], we provide architectural innovations that improve single-model understanding of micro-action structure.
Challenge Solutions and Complex Architectures. Recent ACM Multimedia Grand Challenge 2024 submissions [23, 10] achieve high accuracy through sophisticated ensembles. These solutions combine multiple backbones (Swin-L [28], VideoMAE-v2 [39]), extensive data augmentation, and model ensembling. While demonstrating performance upper bounds, their computational requirements (10-100× our approach) and architectural complexity limit practical deployment. The winning solution [23] employs five models with test-time augmentation, requiring over 500 GFLOPs per prediction. In contrast, we focus on architectural insights that improve single-model performance while maintaining efficiency comparable to MANet [14].

III METHODOLOGY
III-A Overview
As shown in Fig. 2, given an input video with frames, height , width , and 3 color channels, along with corresponding body joints, our framework processes micro-actions through four components: (1) a Spatial Entity Module (SEM) that extracts anatomically-grounded entity representations from Convolutional Neural Network (CNN) features, (2) dual Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways that capture complementary spatio-temporal patterns, (3) entity-level adaptive routing that learns per-entity processing preferences, and (4) Mutual Action Consistency (MAC) loss that enforces cross-path coherence while preserving specialized processing.
III-B Spatial Entity Module (SEM)
For each frame , we extract features from the penultimate layer of ResNet-101 with Temporal Shift Module (TSM) [24], where , , . We also use 25 keypoints (joint coordinates) represented as where denotes pixel coordinates and indicates detection confidence. Key points can be generated by any human pose detection architecture. In the current study, we used OpenPose [2] for this purpose.
We define anatomical groupings using key points. For MA-52, we use entities (head, face, left_hand, right_hand, torso, lower_body) to capture whole-body micro-actions. For iMiGUE focusing on upper-body micro-gestures, we use entities, excluding lower_body as the dataset contains seated subjects. Please see Sec. A in Supplementary Material (Supp.) for detailed body joint-to-entity mappings.
Entity bounding boxes are computed dynamically as:
| (1) |
where denotes keypoint indices for entity , are visible keypoints with confidence threshold , and computes the minimum enclosing rectangle with 10% padding.
Entity features are extracted via ROIAlign [15] followed by entity-specific refinement:
| (2) |
where ROIAlign extracts fixed-size features from arbitrary bounding box regions using bilinear interpolation, avoiding the quantization artifacts of ROI pooling; consists of depthwise separable convolutions projecting to dimensions, and is a learnable position embedding encoding entity identity.
III-C Dual-Path Spatio-Temporal Modeling
Given entity features where is batch size, is number of frames, is number of spatial entities, and is feature dimension, we construct dual paths to capture complementary spatio-temporal patterns in micro-actions.
III-C1 Spatial Entity Transformer
To model spatial relations among body joints within each frame, we design a spatial entity transformer (Spatial-T). For frame , we denote as features of entities. These features are processed by Spatial-T as:
| (3) | ||||
| (4) | ||||
| (5) |
where SPE (Spatial Position Encoding) encodes relative spatial positions of entities based on their anatomical hierarchy (head→torso→limbs), MHSAs (Multi-Head Self-Attention) performs attention across entities to capture inter-entity dependencies crucial for micro-actions, LN denotes Layer Normalization, and FFN (Feed-Forward Network) is a two-layer Multi-Layer Perceptron (MLP) with GELU [16] activation. Please see Sec. D in Supp. for more details on the design choices.
III-C2 Temporal Transformer
To capture temporal dynamics of each entity across frames, we employ a temporal transformer (Temporal-T). For entity , we denote as its features across frames. These features are operated by Temporal-T as:
| (6) | ||||
| (7) | ||||
| (8) |
where TPE (Temporal Position Encoding) encodes temporal positions using sinusoidal embeddings, and MHSAt captures motion patterns essential for distinguishing subtle micro-actions.
III-C3 Bidirectional Processing Paths
Micro-actions exhibit diverse spatio-temporal characteristics—some defined by spatial configurations (e.g., “touch face”), others by temporal patterns (e.g., “leg shaking”). We arrange transformers in two complementary orders:
ST Path: First captures spatial entity arrangements, then models their temporal evolution:
| (9) | ||||
| (10) |
TS Path: First extracts temporal patterns per entity, then models their spatial relationships:
| (11) | ||||
| (12) |
The MLPs preserve original entity information through residual connections, enabling adaptive feature combination. Spatial-T and Temporal-T process their respective dimensions efficiently through batch-wise operations: Spatial-T operates on each frame independently (processing entities per frame), while Temporal-T operates on each entity independently (processing frames per entity).
III-C4 Entity-Level Adaptive Routing
While the dual-path architecture captures complementary spatio-temporal patterns, a key question remains: how should ST and TS representations be combined? Simple concatenation or addition treats all entities uniformly, ignoring that different anatomical parts exhibit fundamentally distinct characteristics. Hands performing gestures are motion-dominant and benefit from temporal-first processing, while torso postures are configuration-dominant and favor spatial-first processing.
We introduce entity-level adaptive routing that allows each body part to learn its optimal blend of ST and TS representations. For each entity at each temporal position , given path outputs , we concatenate them and compute routing scores through a lightweight entity-specific network:
| (13) |
where denotes concatenation, is a two-layer network (Linear-LayerNorm-ReLU-Dropout-Linear), and is a learnable entity-type prior encoding anatomical biases. The routing scores are converted to normalized weights via temperature-scaled softmax:
| (14) |
where controls routing sharpness. The fused entity representation combines both paths according to learned preferences:
| (15) |
III-D Mutual Action Consistency Learning
To ensure the dual paths learn consistent representations while maintaining their complementary strengths, we employ entity-aware contrastive learning between the ST and TS pathways.
For each entity and temporal position , we enforce that representations from both paths align for the same spatio-temporal location while contrasting with different temporal positions:
| (16) |
where are -normalized features for entity at time from respective paths, and is the temperature parameter. This formulation ensures that while both paths process the same entity differently (spatial-first vs temporal-first), they maintain agreement about which temporal segments are most relevant for each entity. The total MAC loss aggregates across all visible entities and frames:
| (17) |
where represents the keypoint confidence for entity at time , naturally down-weighting occluded or uncertain entities. Note that MAC loss operates on the raw path outputs and before adaptive routing (section III-C4), providing a training signal that encourages temporal coherence between paths while allowing the routing module to independently learn entity-specific fusion strategies.
III-E Training Objectives
Given the dual-path outputs , we apply entity-level adaptive routing to obtain fused representations where each entity is combined according to its learned preferences:
| (18) |
where is computed via Eq. (13)-(15). The video-level entity representation is obtained by averaging across temporal and entity dimensions:
| (19) |
This is concatenated with global appearance features for final classification:
| (20) |
where represents global context obtained via Global Average Pooling (GAP) over the CNN feature maps, capturing scene-level information that complements localized entity features. The model is trained with classification and consistency objectives:
| (21) |
where is a two-layer MLP classifier, is cross-entropy loss, denotes micro-action labels, and balances the objectives. Importantly, is computed on raw path outputs before routing, ensuring both paths receive gradient signals regardless of learned routing preferences. This design separates representation learning (via MAC) from adaptive fusion (via routing), allowing each component to fulfill its distinct role.
IV EXPERIMENTAL RESULTS
| Method | MA-52 | iMiGUE | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy (%) | F1 Score (%) | F1 | Accuracy (%) | |||||||
| Body | Action | Action | Body | Action | Overall | Top-1 | Top-5 | |||
| Top-1 | Top-1 | Top-5 | Macro | Micro | Macro | Micro | ||||
| TSN [40] | 59.22 | 34.46 | 73.34 | 52.50 | 59.22 | 28.52 | 34.46 | 43.67 | 51.54 | 85.42 |
| TIN [33] | 73.26 | 52.81 | 85.37 | 66.99 | 73.26 | 39.82 | 52.81 | 58.22 | 52.38 | 86.15 |
| TSM [24] | 77.64 | 56.75 | 87.47 | 70.98 | 77.64 | 40.19 | 56.75 | 61.39 | 61.10 | 91.24 |
| MANet [14] | 78.95 | 61.33 | 88.83 | 72.87 | 78.95 | 49.22 | 61.33 | 65.59 | 62.54 | 92.18 |
| C3D [37] | 74.04 | 52.22 | 86.97 | 66.60 | 74.04 | 40.86 | 52.22 | 58.43 | 20.32 | 55.31 |
| I3D [3] | 78.16 | 57.07 | 88.67 | 71.56 | 78.16 | 39.84 | 57.07 | 61.66 | 34.96 | 63.69 |
| SlowFast [8] | 77.18 | 59.60 | 88.54 | 70.61 | 77.18 | 44.96 | 59.60 | 63.09 | 58.73 | 89.41 |
| VideoSwin-T [28] | 77.95 | 57.23 | 87.99 | 71.25 | 77.95 | 38.53 | 57.23 | 61.24 | 55.82 | 88.67 |
| TimesFormer [1] | 69.17 | 40.67 | 82.67 | 61.90 | 69.17 | 34.38 | 40.67 | 51.53 | 48.15 | 82.34 |
| UniFormer [22] | 79.03 | 58.89 | 87.29 | 71.80 | 79.03 | 48.01 | 58.89 | 64.43 | 57.29 | 89.95 |
| ST-GCN [42] | 69.87 | 49.61 | 79.54 | 61.53 | 69.87 | 34.64 | 49.61 | 53.91 | 46.97 | 84.09 |
| 2s-AGCN [34] | 70.07 | 49.48 | 78.27 | 61.30 | 70.07 | 34.64 | 49.48 | 53.87 | 47.78 | 88.43 |
| Shift-GCN [6] | 71.23 | 51.85 | 80.16 | 62.48 | 71.23 | 36.92 | 51.85 | 55.62 | 51.51 | 88.18 |
| CTR-GCN [5] | 72.06 | 52.61 | 81.22 | 63.46 | 72.06 | 37.79 | 52.61 | 56.48 | 52.94 | 89.76 |
| PoseConv3D [7] | 80.95 | 63.52 | 90.23 | 74.96 | 80.95 | 47.20 | 63.52 | 66.66 | 64.38 | 93.52 |
| PCAN [20] | 82.30 | 66.74 | 91.75 | 77.02 | 82.30 | 53.83 | 66.74 | 69.97 | – | – |
| Ours (Pose Only) | 79.64 | 61.25 | 89.42 | 73.18 | 79.64 | 46.73 | 61.25 | 65.20 | 68.92 | 94.35 |
| Ours (RGB Only) | 81.18 | 62.87 | 90.68 | 75.42 | 81.18 | 48.95 | 62.87 | 67.11 | 71.54 | 95.18 |
| Ours (Pose + RGB) | 83.50 | 65.10 | 92.27 | 78.31 | 83.50 | 54.18 | 65.10 | 68.72 | 76.88 | 96.72 |
IV-A Datasets and Evaluation Metrics
MA-52 Dataset [14] is a large-scale micro-action dataset collected through psychological interviews capturing unconscious human micro-behaviors. The dataset contains 22,422 samples annotated hierarchically at two levels: 7 body-level and 52 action-level categories. Following standard splits defined in [14], we use 11,250, 5,586, and 5,586 samples for training, validation, and testing respectively. The dataset provides both RGB frames and OpenPose body joints, enabling multi-modal analysis. Actions span 1-3 seconds and include subtle movements like “touching face,” “leg shaking,” and “arms crossing”.
iMiGUE Dataset [26] focuses on upper-body micro-gestures collected from sports interviews. The dataset contains 32 micro-gesture categories with 12,899, 777, and 4,562 samples for training, validation, and testing respectively. While conceptually similar to micro-actions, these micro-gestures are restricted to upper limbs, making skeleton data particularly relevant. We evaluate on iMiGUE to demonstrate our method’s generalization beyond full-body movements.
Evaluation Metrics. Following standard practice in micro-action recognition [14, 26], we adopt Top-1/Top-5 accuracy, micro and macro F1 scores as evaluation metrics. While accuracy provides direct classification performance, F1 score better handles class imbalance inherent in micro-action datasets. We compute F1 by averaging macro and micro F1 scores across both hierarchical levels (body-level and action-level), providing a balanced assessment across different granularities and class frequencies.
IV-B Implementation Details
Architecture Configuration. We employ ResNet-101 with TSM [24] as our backbone, pretrained on Kinetics-400 [17]. SEM extracts entities for MA-52 and entities for iMiGUE datasets with dimension each. Both ST and TS paths use 3-layer transformers with 8 attention heads, hidden dimension 1024, and dropout 0.1. For entity-level adaptive routing, each of the entities has a dedicated routing network consisting of two linear layers (, i.e., ) with LayerNorm, ReLU activation, and dropout 0.1. The learnable entity-type bias is initialized to zero. The routing temperature provides soft routing that allows continuous blending between paths. This module adds approximately 0.5M parameters (3% overhead). The final classifier uses a 2-layer MLP (512→256→) with GeLU [16] activation and dropout 0.5, where is the number of action classes (52 for MA-52, 32 for iMiGUE as shown in Fig. 4). For MAC loss computation, we use temperature and .
Training Details. We sample 8 frames with temporal stride 8 for MA-52 dataset and 16 frames with stride 4 for iMiGUE dataset. Data augmentation includes random cropping to 224×224, horizontal flipping (p=0.5), and temporal jittering. We train with SGD [32] optimizer (momentum 0.9), initial learning rate 0.01 with cosine annealing schedule, batch size of 12, and weight decay of for 120 epochs. The first 10 epochs use linear warmup from 0.001 to 0.01. Implementation was done on a single GPU (Nvidia RTX 3090).
IV-C Comparison with State-of-the-Art Methods
Table I compares our method with state-of-the-art approaches. On MA-52, we achieve 68.72% F1, competitive with PCAN [20] (69.97%) while using simple end-to-end training instead of PCAN’s complex 3-stage pipeline. On iMiGUE, we achieve state-of-the-art 76.88% Top-1 accuracy, surpassing PoseConv3D [7] by 12.50%.
Our adaptive keypoint-guided entities outperform MANet’s fixed regions [14] by 3.13% F1. While 3D CNNs (SlowFast [8]: 63.09%) and Transformers (UniFormer [22]: 64.43%) achieve reasonable performance, they require substantially more computation. Pure skeleton methods struggle—CTR-GCN [5] achieves only 56.48%, confirming that appearance cues are indispensable for micro-actions.
The bottom rows of table I show modality ablations: pose-only (65.20%) surpasses all skeleton baselines, RGB-only (67.11%) demonstrates entity-aware processing benefits, and their fusion (68.72%) confirms complementarity. Even single-modality variants outperform heavier baselines, validating our dual-path design and MAC regularization. The 12.50% improvement on iMiGUE versus 3.13% on MA-52 reveals that adaptive entity extraction particularly excels for concentrated upper-body micro-gestures, where our keypoint guidance maintains semantic alignment across viewpoints while fixed regions fail.
IV-D Ablation Studies
Contribution of each component. Table II presents systematic analysis of each component’s contribution. Starting from the TSM baseline (52.15% on MA-52), adding a single TS path improves to 55.96% (+3.81%) by capturing temporal dynamics. The Spatial Entity Module provides substantial gains (+3.25% MA-52, +5.39% iMiGUE), confirming that keypoint-guided entity extraction outperforms global features. Adding the ST path for dual-path processing further improves to 62.14% (+2.93%), validating that ST and TS paths capture complementary patterns. Analyzing the fusion components separately: MAC loss alone provides +2.26% on MA-52 and +2.87% on iMiGUE by enforcing cross-path temporal coherence, while entity routing alone contributes +0.95% and +0.83% respectively. Crucially, combining both achieves +2.96% and +4.23%, exceeding MAC alone on iMiGUE, demonstrating synergistic effects where MAC ensures well-formed path representations while routing learns optimal entity-specific combinations. Overall, our full model improves 12.95% on MA-52 and 18.15% on iMiGUE over the baseline.
| Configuration | Components | Top-1 (%) | ||||
|---|---|---|---|---|---|---|
| SEM | Dual-path | MAC | Routing | MA-52 | iMiGUE | |
| Baseline (TSM) | 52.15 | 58.73 | ||||
| + TS Only | 55.96 | 63.48 | ||||
| + SEM | ✓ | 59.21 | 68.87 | |||
| + Dual Path | ✓ | ✓ | 62.14 | 72.65 | ||
| + MAC Loss | ✓ | ✓ | ✓ | 64.40 | 75.52 | |
| + Routing - MAC | ✓ | ✓ | ✓ | 63.09 | 73.48 | |
| Full Model | ✓ | ✓ | ✓ | ✓ | 65.10 | 76.88 |
Impact of Spatial Entity Extraction Methods. Table III compares different spatial entity extraction strategies. Center crop (58.3% Acc.) extracts only the central region of each frame, losing peripheral information crucial for micro-actions involving limbs or off-center movements. Fixed body regions, similar to MANet [14], achieve 61.3% accuracy but fail under pose variations. Our keypoint-guided approach achieves 65.1% accuracy by dynamically adapting entity boundaries based on detected joints. Removing confidence weighting drops performance to 63.1%, as unreliable keypoints corrupt entity features. Using only 4 entities (excluding lower body) reduces accuracy to 61.4%, confirming the importance of full-body modeling for MA-52. The 3.8% improvement over fixed regions validates adaptive entity extraction’s importance for handling pose variations and subtle movements.
| Method | Top-1 (%) | F1-mean |
|---|---|---|
| Center Crop | 58.3 | 0.594 |
| Fixed Body Regions | 61.3 | 0.656 |
| Keypoint-Guided (Ours) | 65.1 | 0.687 |
| w/o confidence weighting | 63.1 | 0.657 |
| w/ 4 entities (no lower body) | 61.4 | 0.641 |
Impact of temporal modeling. Table IV investigates temporal design choices. Frame sampling shows gains from 4 frames (59.8% Top-1) to 16 frames (65.1%), while 32 frames (64.7%) slightly degrades, suggesting temporal redundancy beyond 16 frames for 1-3 second micro-actions. For temporal aggregation, simple pooling (avg: 65.7%, max: 61.4%) discards temporal ordering. LSTM (62.8%) and Temporal Transformer (63.5%) preserve structure but process entities independently, missing critical inter-entity coordination. Our dual-path design (65.1%) captures synchronized movements essential for actions like “rubbing hands” where hand coordination defines the action. MAC loss granularity affects consistency: frame-level (62.7%) over-constrains at every timestamp, while video-level (65.1%) optimally balances by enforcing entity-wise consistency with temporal flexibility. Please see Supp. for additional ablation experiments.
| Method | Top-1 (%) | F1-mean |
|---|---|---|
| Frame Sampling | ||
| 4 frames | 59.8 | 0.638 |
| 8 frames | 63.2 | 0.663 |
| 16 frames | 65.1 | 0.687 |
| 32 frames | 64.7 | 0.672 |
| Temporal Aggregation | ||
| Average Pooling | 61.4 | 0.657 |
| Max Pooling | 60.2 | 0.641 |
| LSTM | 62.8 | 0.659 |
| Temporal Transformer | 63.5 | 0.668 |
| (Ours) | 65.1 | 0.687 |
| Entity Temporal Granularity | ||
| Frame-level MAC | 62.7 | 0.657 |
| Video-level MAC (Ours) | 65.1 | 0.687 |
IV-E Clinical Validation Study
To evaluate the potential clinical utility, we applied Micro-DualNet to an in-house dataset of 290 individuals (ages 5–52) recorded during 2–3 minute conversations with a research staff member [The study which includes this dataset was reviewed and approved by the Institutional Review Board at CHOP.]. Participants received licensed psychologist-supervised diagnostic evaluations and were classified into three groups: ASD (autism spectrum disorder, =120), PSY (non-autistic psychiatric conditions, =46), and TDC (typically developing, =124). For the ten most frequent micro-actions, we conducted pairwise group comparisons using two-part (hurdle) analysis: (a) probability of engagement via logistic GLM (Prob.), and (b) intensity among engagers via fractional logit (Int.). Table V summarizes significant group differences. Notably, the PSY group showed elevated “retracting feet” intensity compared to both ASD () and TDC (), while “turning head” intensity was lower in PSY than both ASD and TDC (). Fig. 3 illustrates intensity distributions for these two micro-actions; see Supp. for all comparisons. Although analyses controlling for demographics are needed for definitive interpretation, these results provide initial evidence that micro-action detection can be used to identify behavioral differences in psychiatric conditions as part of a larger computational behavior analysis research.
| Action | Contrast | Type | Effect | ||
|---|---|---|---|---|---|
| retracting feet | ASD PSY | % | 0.37 | .001 | .004 |
| shaking legs | ASD PSY | Prob. | 3.00 | .002 | .054 |
| shaking head | ASD PSY | Prob. | 0.27 | .004 | .054 |
| retracting feet | PSY TDC | % | 1.96 | .007 | .106 |
| head up | ASD TDC | Prob. | 2.37 | .007 | .058 |
| stretching feet | ASD TDC | Prob. | 0.50 | .008 | .058 |
| stretching feet | ASD PSY | % | 0.48 | .017 | .172 |
| tilting head | ASD PSY | Prob. | 2.31 | .019 | .102 |
| nodding | ASD PSY | Prob. | 0.23 | .020 | .102 |
| turning head | ASD PSY | % | 2.71 | .036 | .251 |
IV-F Qualitative Results
t-SNE visualizations (Fig. 4) show that single-path models yield complementary patterns—ST groups position-based actions while TS separates motion-based ones and Micro-DualNet combines these strengths with improved overall clustering. This aligns with Fig. 5: Micro-DualNet shows modest 3% gains on easy actions but 31% improvement on hard actions, suggesting that complementary entity-centric processing most benefits challenging micro-actions where single paths struggle.
V DISCUSSION
Our results reveal key insights. First, keypoint-guided entities show improved performance over fixed regions in our experiments, though further validation is needed. Second, contrasting ST/TS performance patterns—ST excelling on position-defined actions, TS on motion-based ones—validate that micro-actions require flexible processing. Third, larger gains on iMiGUE (12.5%) versus MA-52 (3.1%) suggest our approach particularly benefits concentrated micro-gestures. Clinical Implications. Automatically-detected micro-actions differ significantly across diagnostic groups. Elevated “retracting feet” in PSY and increased “shaking legs” in ASD align with established phenotypes [31, 19]. These findings suggest Micro-DualNet could support scalable behavioral assessment.
VI CONCLUSIONS
We presented Micro-DualNet, a keypoint-guided dual-path framework for micro-action recognition. By processing anatomically-grounded entities through parallel ST and TS pathways with entity-level adaptive routing and MAC regularization, we achieve competitive performance on commonly used datasets. Beyond benchmarks, clinical validation demonstrates that detected micro-actions may reveal significant behavioral differences across ASD, psychiatric, and typically developing groups, providing initial evidence for real-world clinical utility. Our key contribution is demonstrating that micro-actions require flexible entity-level spatio-temporal processing, combined with interpretable routing that could inform automated behavioral assessment in healthcare settings. Limitations. Our method depends on an external keypoint detector, making it vulnerable to pose estimation failures under severe occlusions. The dual-path architecture increases cost (1.9 single path), and learned routing patterns may not transfer across datasets without fine-tuning. Fixed entity definitions may not optimally capture all micro-action types. Our clinical validation requires confirmation with demographic controls. Future work should explore learnable entity discovery, cross-dataset transfer, and expanded clinical evaluation.
References
- [1] (2021) Is space-time attention all you need for video understanding?. In Icml, Vol. 2, pp. 4. Cited by: §II, TABLE I.
- [2] (2019) Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (1), pp. 172–186. Cited by: §II, §III-B.
- [3] (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §II, TABLE I.
- [4] (2023) SMG: a micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis. International Journal of Computer Vision 131 (6), pp. 1346–1366. Cited by: §II.
- [5] (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368. Cited by: §I, §II, §IV-C, TABLE I.
- [6] (2020) Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192. Cited by: §I, §II, TABLE I.
- [7] (2022) Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978. Cited by: §I, §II, §IV-C, TABLE I.
- [8] (2019) Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211. Cited by: §II, §IV-C, TABLE I.
- [9] (2013) Stereotypies in autism: a video demonstration of their clinical variability. Frontiers in integrative neuroscience 6, pp. 121. Cited by: §I.
- [10] (2024) Micro-action recognition via hierarchical fusion and inference. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, New York, NY, USA, pp. 11327–11332. External Links: ISBN 9798400706868, Link, Document Cited by: §II, §II.
- [11] (2016) Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (boscc). Journal of autism and developmental disorders 46 (7), pp. 2464–2479. Cited by: §I.
- [12] (2025) Motion matters: motion-guided modulation network for skeleton-based micro-action recognition. In Proceedings of the 33rd ACM International Conference on Multimedia, Cited by: §I, §II, §II, §II.
- [13] (2025) MM-gesture: towards precise micro-gesture recognition through multimodal fusion. arXiv preprint arXiv:2507.08344. Cited by: §II.
- [14] (2024) Benchmarking micro-action recognition: dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology 34 (7), pp. 6238–6252. Cited by: §I, §I, §I, §II, §II, §II, Figure 4, Figure 5, §IV-A, §IV-A, §IV-C, §IV-D, TABLE I, TABLE I, TABLE III, TABLE IV.
- [15] (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §III-B.
- [16] (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §III-C1, §IV-B.
- [17] (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §I, §IV-B.
- [18] (2011) HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision, pp. 2556–2563. Cited by: §I.
- [19] (2011) Restricted and repetitive behaviors in autism spectrum disorders: a review of research in the last decade.. Psychological bulletin 137 (4), pp. 562. Cited by: §V.
- [20] (2025) Prototypical calibrating ambiguous samples for micro-action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, pp. 4815–4823. Cited by: §IV-C, TABLE I.
- [21] (2023) Joint skeletal and semantic embedding loss for micro-gesture classification. arXiv preprint arXiv:2307.10624. Cited by: §II.
- [22] (2022) UniFormer: unified transformer for efficient spatial-temporal representation learning. In International Conference on Learning Representations, External Links: Link Cited by: §IV-C, TABLE I.
- [23] (2024) Advancing micro-action recognition with multi-auxiliary heads and hybrid loss optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, MM ’24, New York, NY, USA, pp. 11313–11319. External Links: ISBN 9798400706868, Link, Document Cited by: §II, §II.
- [24] (2019) Tsm: temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7083–7093. Cited by: §I, §II, §II, §III-B, §IV-B, TABLE I.
- [25] (2024) Micro-gesture online recognition using learnable query points. arXiv preprint arXiv:2407.04490. Cited by: §II.
- [26] (2021) IMiGUE: an identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10631–10642. Cited by: §I, §I, §II, §II, Figure 4, §IV-A, §IV-A, TABLE I.
- [27] (2022) Fineaction: a fine-grained video dataset for temporal action localization. IEEE transactions on image processing 31, pp. 6937–6950. Cited by: §I.
- [28] (2022) Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211. Cited by: §II, §II, TABLE I.
- [29] (2021) Dense semantics-assisted networks for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (5), pp. 3073–3084. Cited by: §I.
- [30] (2019) TS-lstm and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Signal Processing: Image Communication 71, pp. 76–87. Cited by: §II.
- [31] (2013) The diagnostic and statistical manual of mental disorders,(dsm-5). Philadelphia: American Psychiatric Association. Cited by: §V.
- [32] (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §IV-B.
- [33] (2020) Temporal interlacing network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 11966–11973. Cited by: §I, TABLE I.
- [34] (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12026–12035. Cited by: §II, TABLE I.
- [35] (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing 29, pp. 9532–9545. Cited by: §I.
- [36] (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §I.
- [37] (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: TABLE I.
- [38] (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: Figure 4.
- [39] (2023) Videomae v2: scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560. Cited by: §II.
- [40] (2018) Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence 41 (11), pp. 2740–2755. Cited by: §I, §II, TABLE I.
- [41] (2021) Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 32 (3), pp. 1250–1261. Cited by: §I.
- [42] (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §I, §II, TABLE I.