0% found this document useful (0 votes)
31 views47 pages

Deep Learning for Human Pose Analysis

This survey paper reviews the advancements in human pose estimation, tracking, and action recognition through deep learning techniques. It highlights the interconnected nature of these tasks and discusses various methodologies, strengths, and limitations, while also providing insights into benchmark datasets and future research directions. The survey aims to unify these tasks within a deep learning framework to enhance practical applications across multiple domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views47 pages

Deep Learning for Human Pose Analysis

This survey paper reviews the advancements in human pose estimation, tracking, and action recognition through deep learning techniques. It highlights the interconnected nature of these tasks and discusses various methodologies, strengths, and limitations, while also providing insights into benchmark datasets and future research directions. The survey aims to unify these tasks within a deep learning framework to enhance practical applications across multiple domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Human Pose-based Estimation, Tracking and Action Recognition

with Deep Learning: A Survey

Lijuan Zhou1 , Xiang Meng1† , Zhihuan Liu1† , Mengqi Wu1† , Zhimin Gao1*, Pichao
Wang2
1
School of Computer and Artificial Intelligence, Zhengzhou University, China.
2
Amazon Prime Video, USA.
arXiv:2310.13039v1 [[Link]] 19 Oct 2023

*Corresponding author(s). E-mail(s): iegaozhimin@[Link];


Contributing authors: ieljzhou@[Link]; mengxiangzzu@[Link]; liuzhihuanzzu@[Link];
mengqiwuzzu@[Link]; pichaowang@[Link];

These authors contributed equally to this work.

Abstract
Human pose analysis has garnered significant attention within both the research community and practical
applications, owing to its expanding array of uses, including gaming, video surveillance, sports performance
analysis, and human-computer interactions, among others. The advent of deep learning has significantly
improved the accuracy of pose capture, making pose-based applications increasingly practical. This paper
presents a comprehensive survey of pose-based applications utilizing deep learning, encompassing pose esti-
mation, pose tracking, and action [Link] estimation involves the determination of human joint
positions from images or image sequences. Pose tracking is an emerging research direction aimed at generating
consistent human pose trajectories over time. Action recognition, on the other hand, targets the identifica-
tion of action types using pose estimation or tracking data. These three tasks are intricately interconnected,
with the latter often reliant on the former. In this survey, we comprehensively review related works, spanning
from single-person pose estimation to multi-person pose estimation, from 2D pose estimation to 3D pose
estimation, from single image to video, from mining temporal context gradually to pose tracking, and lastly
from tracking to pose-based action recognition. As a survey centered on the application of deep learning to
pose analysis, we explicitly discuss both the strengths and limitations of existing techniques. Notably, we
emphasize methodologies for integrating these three tasks into a unified framework within video sequences.
Additionally, we explore the challenges involved and outline potential directions for future research.

Keywords: Pose Estimation, Pose Tracking, Action Recognition, Deep Learning, Survey

1 Introduction or tracking data. Although these three tasks fall


within the domain of human motion analysis, they
Human pose estimation, tracking, and pose-based are typically treated as distinct entities in the existing
action recognition represent three fundamental literature.
research directions within the field of computer Human motion analysis is a long-standing
vision. These areas have a broad spectrum of appli- research topic, and there are a vast of works and
cations, spanning from video surveillance, human- several surveys on this task (Gavrila, 1999; Aggar-
computer interactions, gaming, sports analysis, intel- wal and Cai, 1999; Moeslund and Granum, 2001;
ligent driving, and the emerging landscape of new Wang et al., 2003; Moeslund et al., 2006; Poppe,
retail stores. Articulated human pose estimation 2007; Sminchisescu, 2008; Ji and Liu, 2009; Moes-
involves the task of estimating the configuration of lund et al., 2011). In these surveys, human detection,
the human body in a given image or video. Human tracking, pose estimation and motion recognition
pose tracking targets to generate consistent pose tra- are usually reviewed together. Several survey papers
jectories over time, which is usually used to analyze have summarized the research on human pose esti-
the motion proprieties of human. Human pose-based mation (Liu et al., 2015; Sarafianos et al., 2016),
or skeleton-based action recognition is to recognize tracking (Yilmaz et al., 2006; Watada et al., 2010;
the types of actions based on the pose estimation Salti et al., 2012; Smeulders et al., 2013; Wu et al.,

1
2015), and action recognition (Cedras and Shah, Graph Neural Networks (GCN) and Transformer-
1995; Turaga et al., 2008; Poppe, 2010; Guo and based approaches. Fig. 1 illustrates the taxonomy of
Lai, 2014). With the development of deep learning, this survey.
the three tasks have achieved significant improve- The key novelty of this survey is the focus on
ments compared to hand-crafted feature era (Zhu three closely related tasks that use deep learning
et al., 2016; Wang et al., 2018). The previous sur- approach, which has never been done in previous
veys either reviewed the whole vision-based human surveys. In reviewing the various methods, consid-
motion domain (Gavrila, 1999; Aggarwal and Cai, eration has been given to the connections between
1999; Moeslund and Granum, 2001; Wang et al., the three tasks, hence, this survey tends to discuss
2003; Moeslund et al., 2006; Poppe, 2007; Sminchis- the advantages and limitations of the reviewed meth-
escu, 2008; Ji and Liu, 2009), or have focused on ods from the viewpoint of assembling them to get
specific tasks (Liu et al., 2015; Sarafianos et al., 2016; more practical applications. This is the first survey
Wang et al., 2018; Chen et al., 2020; Liu et al., 2022; to put them together to analysis their inner con-
Sun et al., 2022; Zheng et al., 2023; Xin et al., 2023). nections in deep learning era. Besides, this survey
However, there is no such survey paper which simul- distinguishes itself from other surveys through the
taneously reviews pose estimation, pose tracking, and following contributions:
pose recognition. Inspired by Lagrangian viewpoint • A thorough and all-encompassing coverage of the
of motion analysis (Rajasegaran et al., 2023), pose
most advanced deep learning-based methodolo-
information and tracking are beneficial for action
gies developed since 2014. This extensive coverage
recognition. Therefore, these three tasks are closely
affords readers a comprehensive overview of the
related each other. It is significantly useful for review-
latest research methodologies and their outcomes.
ing the methods linking the three tasks together, and • An insightful categorization and analysis of meth-
providing a deep understanding for the separate solu-
ods on the three tasks, and highlights of the pros
tion of each task and more exploration for a unified
and cons, promoting potential exploration of better
solution of joint tasks.
solutions.
In this paper, we will conduct a comprehen- • An extensive review of the most commonly used
sive review of previous works using deep learning
benchmark datasets for these three tasks, and the
approach on these three tasks individually, and
state-of-the-art results on the benchmark datasets.
discuss the strengths and weaknesses of previous • An earnest discussion of the challenges of three
research paper. Furthermore, we elucidate the inher-
tasks and potential research directions through
ent connections that bind these three tasks together,
limitation analysis of available methods.
while championing the adoption of a deep learning-
based framework that seamlessly integrates them. Subsequent sections of this survey are organized
Specifically, we will review previous works with deep as follows. Sections 2 through 4 delve into the meth-
learning from 2D pose estimation to 3D pose esti- ods of pose estimation, pose tracking, and action
mation from single images to videos, from mining recognition, respectively. Commonly used benchmark
temporal contexts gradually to pose tracking, and datasets and the performance comparison for three
lastly from tracking to pose-based action recogni- tasks are described in Section 5. Challenges of these
tion. According to the number of persons for pose three tasks and pointers to future directions are pre-
estimation, 2D/3D pose estimation can be divided sented in Section 6. The survey provides concluding
into single-person and multi-person pose estima- remarks in Section 7.
tion. Depending on the input to the networks, each
category can be further divided into image and video- 2 Pose estimation
based single-person/multi-person pose estimation. To
link the poses across the frames, pose tracking can Human representation can be approached through
be divided into post-processing and integrated meth- three distinct models: the kinematic model, the pla-
ods for single-person pose tracking, top-down and nar model, and the volumetric model. The kinematic
bottom-up approaches for multi-person pose track- model employs a combination of joint positions and
ing. After getting the trajectory of poses in the limb orientations to faithfully depict the human
videos, pose-based action recognition could be natu- body’s structure. In contrast, the planar model uti-
rally conducted which can be divided into estimated lizes rectangles to represent both body shape and
pose and skeleton-based action recognition. The for- appearance, while the volumetric model leverages
mer takes RGB videos as the input and jointly con- mesh data to capture the intricacies of the human
ducts pose estimation, tracking, and action recogni- body’s shape. It’s essential to underscore that this
tion. The latter extracts skeleton sequences captured paper exclusively focuses on the kinematic model-
by sensors such as motion capture, time-of-flight, based human representation.
and structured light cameras for action recognition. Pose estimation, pose tracking and action recog-
For skeleton-based action recognition, four categories nition are three intimately interrelated tasks. Fig. 2
are identified including Convolutional Neural Net- shows the relationship among the three tasks. Pose
works (CNN), Recurrent Neural Networks (RNN), estimation aims to estimate joint coordinates from

2
Fig. 1 The taxonomy of this survey.

an image or a video. Pose tracking is an extension 2.1 2D pose estimation


of pose estimation in the context of videos, which
For 2D pose estimation, two sub-divisions are identi-
associates each estimated pose with its correspond-
fied, single-person pose estimation and multi-person
ing identity over time. It is interesting noting that a
pose estimation. Depending on the input to the net-
recent work (Choudhury et al., 2023) tends to esti-
works, single (multi) person pose estimation could be
mate poses after tracking volumes of persons, which
further divided into image-based single (multi) per-
implies that the two-way relationship of pose esti-
son pose estimation and video-based single (multi)
mation and tracking. Pose-based action recognition
person pose estimation.
aims to give the tracked pose with an identity the
corresponding action label. 2.1.1 Image-based single-person pose
For pose estimation, we generally classify the estimation
reviewed methods into two categories, 2D pose esti-
For image-based Single-Person Pose Estimation
mation and 3D pose estimation. The 2D pose esti-
(SPPE), the task involves providing the position
mation is to estimate a 2D pose (x, y) coordinates
and a rough scale of a person or their bound-
for each joint from a RGB image or video while 3D
ing box as a precursor to the estimation process.
pose estimation is to estimate a 3D pose (x, y, z)
Early works adopt the pictorial structures framework
coordinates.
that represents an object by a collection of parts
arranged in a deformable configuration, and a part

3
Fig. 2 The relationship among the three tasks.

in the collection is an appearance template matched based on transformer architectures. The attention
in an image. Different from early works, the deep modules in transformers offered the ability to cap-
learning-based methods target to locate keypoints ture long-range dependencies and global evidence
of human parts. Two typical frameworks, namely, crucial for accurate pose estimation. For exam-
direct regression and heatmap-based approaches, are ple, TFPose (Mao et al., 2021) first introduced
available for image-based single-person pose estima- Transformer to the pose estimation framework in
tion. In the direct regression-based approach, key- a regression-based manner. PRTR (Li et al., 2021)
points are directly predicted from the image features, introduced a two-stage, end-to-end regression-based
whereas the heatmap-based approach initially gen- framework that employed cascading Transform-
erates heatmaps and subsequently infers keypoint ers, achieving state-of-the-art performance among
locations based on these heatmaps. Fig. 3 provides regression-based methods. Mao et al. (Mao et al.,
an illustrative overview of the general framework for 2022) framed pose estimation as a sequence pre-
image-based 2D SPPE, showcasing the two predom- diction task, which they addressed with the Poseur
inant approaches. model.
(1) Regression-based approach However, it’s worth noting that these direct
The pioneer work (Toshev and Szegedy, 2014), regression methods sometimes struggle in high-
DeepPose, formulates pose estimation as a convo- precision scenarios. This limitation may stem from
lutional neural network(CNN)-based regression task the intricate mapping of RGB images to (x, y) loca-
towards body joints. A cascade of regressors are tions, adding unnecessary complexity to the learning
adopted to refine the pose estimates, as shown in process and hampering generalization. For instance,
Fig. 4. This work could reason about pose in a holistic direct regression may encounter challenges when
fashion in occlusion situations. Carreira et al. (Car- handling multi-modal outputs, where a valid joint
reira et al., 2016) introduced the Iterative Error appears in two distinct spatial locations. The con-
Feedback approach, wherein prediction errors were straint of producing a single output for a given regres-
recursively fed back into the input space, resulting in sion input can limit the network’s ability to represent
progressively improved estimations. Sun et al. (Sun small errors, potentially leading to over-training.
et al., 2017) presented a reparameterized pose repre- (2) Heatmap-based approach
sentation using bones instead of joints. This method Heatmaps have gained substantial attention due
defines a compositional loss function that captures to its ability to provide comprehensive spatial infor-
the long range interactions within the pose by exploit- mation, making itself invaluable for training Convo-
ing the joint connection structure. In more recent lutional Neural Networks (CNNs). This has spurred
developments, (Luvizon et al., 2019) introduced a a surge of interest in the development of CNN archi-
novel approach that employed softmax functions to tectures for pose estimation. Jain et al. (Jain et al.,
convert heatmaps into coordinates in a fully differen- 2014) pioneered an approach where multiple CNNs
tiable manner. This innovative technique was coupled were trained for independent binary body-part clas-
with a keypoint error distance-based loss function sification, with each network dedicated to a specific
and context-based structures. feature. This strategy effectively constrained the net-
Subsequently, researchers (Mao et al., 2021; Li work’s outputs to a much smaller class of valid config-
et al., 2021; Mao et al., 2022; Panteleris and Argy- urations, enhancing overall performance. Recognizing
ros, 2022) began exploring pose estimation methods the importance of structural domain constraints, such

4
Fig. 3 The framework of two approaches for image-based 2D SPPE.

Fig. 4 The DeepPose architecture (Toshev and Szegedy, 2014).

as the geometric relationships between body joint Tzimiropoulos (Bulat and Tzimiropoulos, 2016) pre-
locations, Tompson et al. (Tompson et al., 2014) pur- sented a detection-followed-by-regression CNN cas-
sued a joint training approach, simultaneously train- cade. This robust approach adeptly infers poses, even
ing CNNs and graphical models for human pose esti- in the presence of significant occlusions. Lifshitz et
mation. Similarly, Chen and Yuille (Chen and Yuille, al. (Lifshitz et al., 2016) introduced a novel voting
2014) adopt Convnets to learn conditional proba- scheme that harnesses information from the entire
bilities for the presence of parts and their spatial image, allowing for the aggregation of numerous votes
relationships within image patches. To address the to yield highly accurate keypoint detections. Chu et
limitations of pooling techniques in (Tompson et al., al. (Chu et al., 2017) incorporated CNNs into their
2014) for improving spatial locality precision, Tomp- approach, enhancing it with a multi-context attention
son et al. (Tompson et al., 2015) proposed a position mechanism for pose estimation. This dynamic mech-
refinement model (namely, a multi-resolution Con- anism autonomously learns and infers contextual
vents) that is trained to predict the joint offset representations, directing the model’s focus toward
location within a localized region of the image. The regions of interest. Furthermore, Yang et al. (Yang
works of (Tompson et al., 2014), (Chen and Yuille, et al., 2017) devised a Pyramid Residual Module
2014) and (Tompson et al., 2015) sought to merge the (PRMs) to bolster the scale invariance of CNNs.
representational flexibility inherent in graphical mod- PRMs effectively learn feature pyramids, which prove
els with the efficiency and statistical power offered instrumental in precise pose estimation.
by CNNs. To avoid using graphical models, Wei et With the development of Generative Adversarial
al. (Wei et al., 2016) introduced the Convolutional Networks (GAN) (Goodfellow et al., 2014), Chen et
Pose Machines to learn long-range spatial relation- al. (Chen et al., 2017) designed discriminators to dis-
ships without explicitly adopting graphical models. tinguish the real poses from the fake ones to incorpo-
Hu and Ramanan (Hu and Ramanan, 2016) pro- rate priors about the structure of human bodies. Ning
posed an architecture that could be used for multiple et al. (Ning et al., 2017) proposed to explore exter-
stages of predictions, and ties weights in the bottom- nal knowledge to guide the network training process
up and top-down portions of computation as well using learned projections that impose proper prior.
as across iteration. Similarly, Newell et al. (Newell Sun et al. (Sun et al., 2017) presented a two-stage nor-
et al., 2016) proposed the Stacked Hourglass Network malization scheme, human body normalization and
(SHN) for single-person pose estimation. The SHN limb normalization, to make the distribution of the
leverages a series of successive pooling and upsam- relative joint locations compact, resulting in easier
pling steps to generate a final set of predictions, show- learning of convolutional spatial models and more
casing its efficacy. In addressing challenging scenarios accurate pose estimation. Marras et al. (Marras et al.,
characterized by severe part occlusions, Bulat and 2017) introduced a Markov Random Field (MRF)-
based spatial model network between the coarse and

5
the refinement model that introduces geometric con- However, due to the different output spaces of regres-
straints on the relative locations of the body joints. sion models and heatmap models, directly transfer-
To deal with annotating pose problem, Liu and Fer- ring knowledge between heatmaps and vectors may
rari (Liu and Ferrari, 2017) presented an active result in information loss. To the end, DistilPose (Ye
learning framework for pose estimation. Ke et al. (Ke et al., 2023) (as shown in Fig. 5) is proposed to
et al., 2018) proposed a multi-scale structure-aware transfer heatmap-based knowledge from a teacher
network for human pose estimation. Peng et al. (Peng model to a regression-based student model through
et al., 2018) proposed adversarial data augmentation token-distilling encoder and simulated heatmaps.
for jointly optimize data augmentation and network 2.1.2 Image-based multi-person pose
training. The main idea is to design an augmentation
estimation
network (generator) that competes against a target
network (discriminator) by generating ”hard” aug- Compared with single-person pose estimation
mentation operations online. Tang et al. (Tang et al., (SPPE), multi-person pose estimation (MPPE) is
2018) introduced a Deeply Learned Compositional more difficult. First, the number or the position of
Model for pose estimation by exploiting deep neu- the person is not given, and the pose can occur at
ral networks to learn compositions of human body. any position or scale; second, interactions between
Nie et al. (Nie et al., 2018a) proposed the pars- people induce complex spatial interference, due to
ing induced learner including a parsing encoder and contact, occlusion, and limb articulations, making
a pose model parameter adapter, which estimates association of parts difficult; third, runtime com-
dynamic parameters in the pose model through joint plexity tends to grow with the number of people in
learning to extract complementary useful features for the image, making realtime performance a challenge.
more accurate pose estimation. Nie et al. (Nie et al., MPPE must address both global (human-level) and
2018b) proposed to jointly conduct human parsing local (keypoint-level) dependencies (as depicted in
and pose estimation in one framework by incorporat- Fig. 6), which involve different levels of semantic
ing information from their counterparts, giving more granularity. Mainstream solutions are normally two-
robust and accurate results. Tang and Wu (Tang stage approaches, which divide the problem into two
and Wu, 2019) proposed a data-driven approach to separate subproblems including global human detec-
group-related parts based on how much information tion and local keypoint regression. Typically, two
they share, and then a part-based branching net- primary frameworks have been proposed to tackle
work (PBN) is introduced to learn representations these subproblems, known as the top-down and
specific to each part group. To speed up the pose esti- bottom-up approaches. Inspired by the success of
mation, Zhang et al. (Zhang et al., 2019) presented end-to-end object detection, another viable solution
a Fast Pose Distillation (FPD) model that trains a is the one-stage approach. This approach aims to
lightweight pose neural network architecture capable develop a fully end-to-end trainable method capable
of executing rapidly with low computational cost, by of unifying the two disassembled subproblems.
effectively transferring pose structure knowledge of a (1) Top-down approach
robust teacher network. Top-down approaches in multi-person pose esti-
In summary, regression-based methods have mation begin by detecting all individuals within
advantages in speed but disadvantages in accuracy a given image, as shown in Fig. 7, and subse-
on pose estimation task. Heatmap-based methods quently employ single-person pose estimation tech-
can explicitly learn spatial information by estimat- niques within each detected bounding box.
ing heatmap likelihood, resulting in high accuracy. A group of methods (Papandreou et al., 2017; He
However, heatmap-based methods suffer seriously a et al., 2017; Xiao et al., 2018; Moon et al., 2019;
long-standing challenge known as the quantization Sun et al., 2019; Cai et al., 2020; Huang et al.,
error problem, which is caused by mapping the con- 2020; Zhang et al., 2020; Wang et al., 2020; Xu
tinuous coordinate values into discretized downscaled et al., 2022; Jiang et al., 2023; Gu et al., 2023)
heatmaps. To address this problem, Li et al (Li et al., aim to designing and improving modules within
2022) proposed a Simple Coordinate Classification pose estimation networks. Papandreou et al. (Papan-
(SimCC) method which formulates pose estimation dreou et al., 2017) adopt Faster RCNN (Ren et al.,
as two classification tasks for horizontal and verti- 2015) for person detection and keypoints estima-
cal coordinates. Despite the improvement in quan- tion within the bounding box. They introduce an
tization error, the estimation of heatmaps requires aggregation procedure to obtain highly localized key-
exceptionally high computational cost, resulting in point predictions, along with a keypoint-based Non-
slow preprocessing operations. Therefore, how to take Maximum-Suppression (NMS) to prevent duplicate
advantage of both heatmap-based and regression- pose detection. Sun et al. (Sun et al., 2019) pro-
based methods remains a challenging problem. Some posed a novel High-Resolution net(HRNet) to learn
works (Li et al., 2021; Ye et al., 2023) tend to such representation. To address systematic errors in
solve the above problem by transferring the knowl- standard data transformation and encoding-decoding
edge from heatmap-based to regression-based models. structures that degrade top-down pipeline perfor-
mance, Huang et al. (Huang et al., 2020) proposed

6
Fig. 5 The DistilPose framework (Ye et al., 2023).

Fig. 6 Perception of multi-person pose estimation task (Yang et al., 2023).

solutions to correct common biased data processing to treat transformer as a better decoder. Trans-
in human pose estimation. Pose (Yang et al., 2021) processes the features
Human detectors may fail in the first step of extracted by CNNs to model the global relationship.
top-down pipeline due to occlusion affected by the Zhou et al. (Zhou et al., 2023) proposed a Bottom-Up
overlapping of limbs. Another group of works (Iqbal Conditioned Top-Down pose estimation (BUCTD)
and Gall, 2016; Fang et al., 2017; Chen et al., method which modifies TransPose to accept condi-
2018; Su et al., 2019; Qiu et al., 2020) aim to tions as side-information generated by CTD. Differ-
address this issue. Fang et al. (Fang et al., 2017) ent from other top-down methods, BUCTD applies a
proposed a novel Regional Multi-person Pose Esti- bottom-up model as a person detector. TokenPose (Li
mation (RMPE) to facilitate pose estimation even et al., 2021) proposes a token-based representation
when inaccurate human bounding boxes exist. Chen to estimate the locations of occluded keypoints and
et al. (Chen et al., 2018) designed a Cascaded Pyra- model the relationship among different keypoints.
mid Network (CPN) that contains GlobalNet and HRFormer (Yuan et al., 2021) proposes to fuse multi-
RefineNet for localizing simple and hard keypoints resolution features by a transformer module. The
with occlusion respectively. Su et al. (Su et al., 2019) above works either require CNNs for feature extrac-
proposed two novel modules to perform the enhance- tion or careful designs of transformer structures. In
ment of the information for the multi-person pose contrast, a simple yet effective baseline model, ViT-
estimation under occluded scenes, namely, Chan- Pose (Xu et al., 2022), is proposed based on the plain
nel Shuffle Module (CSM) and Spatial, Channel- vision transformers.
wise Attention Residual Bottleneck (SCARB), where (2) Bottom-up approach
CSM promoting cross-channel information communi- In contrast to the top-down approach, the
cation among the pyramid feature maps and SCARB bottom-up approach initially detects all individual
highlighting the information of feature maps both in body parts or keypoints and subsequently associates
the spatial and channel-wise context. An occluded them with the corresponding subjects using part asso-
pose estimation and correction module (Qiu et al., ciation strategies. The seminal work of Pishchulin et
2020) is proposed to solve the occlusion problem in al. (Pishchulin et al., 2016) proposed a bottom-up
crowd pose estimation. approach that jointly labels part detection candidates
Much like single-person pose estimation, multi- and associates them to individual people. However,
person pose estimation has also undergone rapid solving the integer linear programming problem over
advancements, transitioning from CNNs to vision a fully connected graph is an NP-hard problem
transformer networks. Some recent works tend and the average processing time is on the order of

7
Fig. 7 The framework of two approaches for image-based 2D MPPE. Part of the figure is from (Zheng et al., 2020).

hours. In the work by Insafutdinov et al. (Insafut- approach. Newell et al. (Newell et al., 2017) simulta-
dinov et al., 2016), a more robust part detector and neously produced score maps and pixel-wise embed-
innovative image-conditioned pairwise terms were ding to group the candidate keypoints among differ-
proposed to enhance runtime efficiency. Neverthe- ent people to get final multi-person pose estimation.
less, this work encountered challenges in precisely Kocabas et al. (Kocabas et al., 2018) designed a Mul-
regressing the pairwise representations and a separate tiPoseNet that jointly handle person detection, per-
logistic regression is required. Iqbal and Gall (Iqbal son segmentation and pose estimation problems, by
and Gall, 2016) considered multi-person pose estima- the implementation of Pose Residual Network (PRN)
tion as a joint-to-person association problem. They which receives keypoint and person detections, and
construct a fully connected graph from a set of produces accurate poses by assigning keypoints to
detected joint candidates in an image and resolve person instances. To deal with the crowded scene, Li
the joint-to-person association and outlier detection et al. (Li et al., 2019) built a new benchmark called
using integer linear programming. OpenPose (Cao CrowdPose and proposed two components, namely,
et al., 2017a,b) proposes the first bottom-up repre- joint-candidate single-person pose estimation and
sentation of association scores via Part Affinity Fields global maximum joints association, for crowded pose
(PAFs) which are a set of 2D vector fields that encode estimation. Jin et al. (Jin et al., 2020) proposed a new
the location and orientation of limbs over the image differentiable hierarchical graph grouping method to
domain. Kreiss et al. (Kreiss et al., 2019) proposed to learn human part grouping. Cheng et al. (Cheng
use a Part Intensity Field (PIF) for body parts local- et al., 2020) extended the HRNet and proposed a
ization and a PAF for body part association with each higher resolution network (HigherHRNet) by decon-
other to form full human poses. To handle missed volving the high-resolution hetamaps generated by
small-scale persons, Cheng et al. (Cheng et al., 2023) HRNet to solve the variation challenge. Besides the
proposed multi-scale training and dual anatomical above bottom-up methods, some methods directly
canters to enhance the network. The above methods regress a set of pose candidates from image pixels
mainly apply heatmap prediction based on overall and the keypoints in each candidate might be from
L2 loss to locate keypoints. However, minimizing L2 the same person. A post-processing step is required
loss cannot always locate all keypoints since each to generate the final poses which are more spatially
heatmap often includes multiple body joints. To solve accurate. For instance, single-stage multi-person Pose
this problem, Qu et al. (Qu et al., 2023) proposed Machine (SPM) method (Nie et al., 2019) applies
to optimize heatmap prediction based on minimizing a hierarchical structured 2D/3D pose representation
the distance between the characteristic functions of to assist the long-range regression. The keypoints
the predicted and ground-truth heatmaps. are predicted based on person-agnostic heatmaps so
Different from the above two-stage bottom-up that grouping post-processing is required to assem-
approach, some works focus on joint detection and ble keypoints to the full-body pose. Disentangled
grouping, which belong to single-stage bottom-up Keypoint Regression (DEKR) (Geng et al., 2021)
regresses pose candidates by learning representations

8
that focus on keypoint regions. The pose candidates detection, a single-person pose estimator is run, thus,
were scored and ranked to generate the final poses the more people there are, the greater the compu-
based on keypoints and center heatmap estimation tational cost. In contrast, bottom-up approaches are
loss. PolarPose (Li et al., 2023) aims to simplify 2D attractive due to their robustness to early commit-
regression to a classification task by performing it in ment and the potential to decouple runtime complex-
polar coordinate. ity from the number of people in the image. Yet,
(3) One-stage approach bottom-up approaches do not directly leverage global
The one-stage approach aims to learn an end-to- contextual cues from other body parts and individ-
end network for MPPE without person detection and uals. One-stage methods eliminate the intermediate
grouping post-processing. Tian et al. (Tian et al., operations like grouping, ROI, bounding-box detec-
2019) first proposed a one-stage method based on tion, NMS and bypass the major shortcomings of
DirectPose to directly predict instance aware key- both top-down and bottom-up methods.
points for all persons from an image. To boost both 2.1.3 Video-based single-person pose
accuracy and speed, Mao et al. (Mao et al., 2021)
estimation
later presented a Fully Convolutional Pose (FCPose)
estimation framework to build dynamic filters in com- Video-based pose estimation aims to estimate sin-
pact keypoint heads. Meanwhile, Shi et al. (Shi et al., gle or multiple poses in each video frame. Compared
2021) designed InsPose, which adaptively adjusts the with image-based pose estimation, it is more chal-
network parameters for each instance. To reduce the lenging due to high variation in human pose and
effect of false positive poses in regression loss, the foreground appearance such as clothing and self-
Single-stage Multi-person Pose Regression (SMPR) occlusion. For video-based pose estimation, human
network (Miao et al., 2023) was presented by adapt- tracking is not considered in the video. Similar to
ing three positive pose identification strategies for ini- image-based SPPE, direct regression and heatmap-
tial and final pose regression, and the Non-Maximum based approaches are also available for video-based
Suppression (NMS) step. These methods could avoid SPPE. However, differently, video-based pose esti-
the need for heuristic grouping in bottom-up meth- mation has the advantage of temporal information,
ods or bounding-box detection and region of interest which can enhance the accuracy of pose estimation
(RoI) cropping in top-down ones. However, they still but can also introduce additional computational over-
require hand-crafted operations, like NMS, to remove head due to temporal redundancy. Therefore, achiev-
duplicates in the postprocessing stage. To further ing a balance between accuracy and efficiency is
remove NMS, a multi-person Pose Estimation frame- paramount for video-based pose estimation. Based on
work with TRansformers (PETR) (Shi et al., 2022) handling the efficiency, video-based SPPE approaches
regards pose estimation as a set prediction, which are categorized into the frame-by-frame approach and
is the first fully end-to-end framework without any sample frames-based ones. Fig. 8 illustrates the gen-
postprocessing. The above one-stage methods adopts eral framework of two approaches for video-based
a pose decoder with randomly initialized pose queries, SPPE.
making keypoint matching across persons ambigu- (1) Frame-by-frame approach
ous and training convergence slow. To this end, Yang The frame-by-frame approach, illustrated in
et al. (Yang et al., 2023) proposed an Explicit box Fig. 8, focuses on estimating poses individually for
Detection process for pose estimation (ED-pose) by each frame in the video sequence. With the suc-
realizing each box detection using a decoder and cess of image-based pose estimation, this category
cascading them to form an end-to-end framework, of methods mainly apply image-based pose estima-
making the model fast in convergence, precise and tion methods on each video frame by incorporating
scalable. temporal information to keep geometric consistency
Although the above end-to-end methods have across frames. The temporal information is normally
achieved promising performance, they rely on com- captured by fusion from concatenated consecutive
plex decoders. For instance, ED-pose includes a frames, applying 3D temporal convolution, using
human detection decoder and a human-to-keypoint dense optical flow and pose propagation.
detection decoder to detect human and keypoint In the early stages of this approach, Pfister et
boxes [Link] includes a pose decoder and al. (Pfister et al., 2014) proposed to use deep Con-
a joint decoder. In contrast, Group Pose (Liu et al., vNets for estimating human pose in videos. They
2023) only uses a simple transformer decoder for designed a regression layer to predict the location of
pursing efficiency. upper-body joints while considering temporal infor-
In summary, top-down approaches directly lever- mation through the direct processing of concatenated
age existing techniques for single-person pose estima- consecutive frames along the channel axis. Grinciu-
tion, but suffer from early commitment: if the person naite et al. (Grinciunaite et al., 2016) extended 2D
detector fails as it is prone to do when people are in convolution into 3D convolution and temporal infor-
close proximity, there is no recourse to recovery. Fur- mation can be efficiently represented in the third
thermore, the runtime of these top-down approaches dimension of 3D convolutional for video-based human
is proportional to the number of people. For each pose estimation.

9
Fig. 8 The framework of two approaches for video-based 2D SPPE.

Some works tend to use optical flow to produce capture the geometric relationships of joints in space
smooth movement. Pfister et al. (Pfister et al., 2015) and time. Nie et al. (Nie et al., 2019) designed
used dense optical flow to predict joint positions for a Dynamic Kernel Distillation (DKD) model. The
all neighboring frames and design spatial fusion lay- DKD model introduces a pose kernel distillator and
ers to learn dependencies between the human parts transmits pose knowledge in time. Xu et al. (Xu et al.,
locations. Song et al. (Song et al., 2017) also utilized 2021) proposed a novel neural architecture search
optical flow warping to capture high temporal consis- to select the most effective temporal feature fusion
tency and propose spatio-temporal message passing for optimizing the accuracy and speed across video
layer to incorporate domain-specific knowledge into frames. Dang et al. (Dang et al., 2022) proposed
deep networks. Jain et al. (Jain et al., 2014) use Local a Relation-based Pose Semantics Transfer Network
Contrast Normalization and Local Motion Normal- (RPSTN) by designing a joint relation-guided pose
ization to process the RGB image and optical-flow semantic propagator to learn the temporal seman-
features respectively and then combine them to feed tic continuity of poses. Despite various strategies
into Part-Detector network. These methods have are applied to reduce computation cost, this cate-
high complexity due to dense flowing computation, gory of methods still leads to sub-optimal efficiency
making them not applicable in real-time applications. improvement due to the estimation frame by frame.
Subsequently, some works (Gkioxari et al., 2016; (2) Sample frames-based approach
Charles et al., 2016; Luo et al., 2018; Nie et al., This category of approach aims to recover all
2019; Li et al., 2019a,b; Xu et al., 2021; Dang et al., poses based on the estimated poses from selected
2022; Jin et al., 2023) apply pose propagation which frames. As shown in Fig. 8, the general workflow
transfer features from previous frames to the cur- includes sample pose estimation and all poses recov-
rent frame in an online fashion. For example, Charles ering. One line of works generates sample poses
et al. (Charles et al., 2016) proposed a personal- by selecting keyframes and estimating the poses of
ized ConvNet to estimate human pose including four keyframes. For example, Zhang et al (Zhang et al.,
stages: initial annotation, spatial matching, tempo- 2020) introduced a Key-Frame Proposal Network (K-
ral propagation, and self evaluation. In the initial FPN) to select informative frames and a human pose
annotation stage, high-precision pose estimation is interpolation module to generate all poses from the
obtained by using flowing Convnets. Then Image poses in keyframes based on human pose dynam-
patches from the new frames without annotations are ics. Pose dynamic-based dictionary formulation may
matched to image patches of body joints in frames become challenging when the pose sequence to be
with annotations by spatial matching process. Dense interpolated becomes complex. Therefore, to effec-
optical flow is used for temporal propagation. Finally, tively exploit the dynamic information, REinforced
the quality of the spatial-temporal propagated anno- MOtion Transformation nEtwork (REMOTE) (Ma
tations is automatically evaluated to optimize the et al., 2022) includes a motion transformer to conduct
model. Luo et al. (Luo et al., 2018) proposed Long cross frame reconstruction. Although the compu-
Short-Term Memory (LSTM) pose machines by com- tational efficiency of the above works is improved
bining Convolutional Pose Machine (CPM) (Wei due to keyframes, they still require to take cost
et al., 2016) and LSTM network learning the tem- on keyframe selection, making it hard to further
poral dependency among video frames to effectively reduce the complexity. To solve this problem, Zeng

10
et al. (Zeng et al., 2022) proposed a novel Sample- generated keypoint predictions linked over the video
Denoise-Recover pipeline (namely DeciWatch) to uni- by lightweight tracking. Wang et al. (Wang et al.,
formly sample less than 10% of video frames for 2020) proposed a clip tracking network to perform
estimation. The estimatied poses based on sample pose estimation and tracking simultaneously. To con-
frames are denoised with a Transformer architecture struct the clip tracking network, the 3D HRNet is
and the rest poses are also recovered by another proposed for estimating poses which incorporating
Transformer network. DeciWatch can be used in both temporal dimension into the original HRNet. Alpha-
2D/3D pose estimation from videos and it can main- Pose (Fang et al., 2022) is also proposed for joint pose
tain or even improve the pose estimation accuracy estimation and tracking. In particular, all persons
as the previous methods with small cost on com- for each frame are firstly detected using off-the-shelf
putation. Although uniform sampling reduces the object detectors like YoloV3 or EfficientDet. To solve
cost of selecting keyframes, a refinement module is the quantization error, the symmetric integral key-
added to clean noisy poses. In contrast, MixSyn- points regression method is then proposed to localize
thFormer (Sun et al., 2023) deletes the refinement keypoints in different scales accurately. Pose-guided
module by combining a transformer encoder with an alignment module is applied on the predicted human
MLP-based mixed synthetic attention, thus pursing re-id feature to obtain pose-aligned human re-id fea-
highly efficient 2D/3D video-based pose estimation. tures after removing redundant poses based on NMS.
Overall, frame-by-frame approaches could bene- At last, a pose-aware identity embedding is presented
fit from image-based pose estimation but suffer from to produce tracking identity. Estimating poses frame
the computation complexity. Sample frame-based by frame ignores motion dynamics which is funda-
approaches offer a solution to improve efficiency but mentally important for accurate pose estimation from
raise questions about how to obtain sample frames videos. A recent method (Feng et al., 2023) presents
and recover poses. The paper employs uniform sam- Temporal Difference Learning based on Mutual Infor-
pling; however, considering the significant variations mation (TDMI) for pose estimation. A multi-stage
in joint movements under different actions, an adap- temporal difference encoder was designed for learning
tive sampling strategy might be more suitable for informative motion representations and a represen-
further enhancing efficiency. Additionally, the design tation disentanglement module was introduced to
of dynamic recovery methods should be explored to distill task-relevant motion features to enhance frame
handle non-uniform sampling effectively. representation for pose estimation. The temporal
2.1.4 Video-based multi-person pose difference features can be applied in pose track-
ing by measuring the similarity of motions for data
estimation
association. Gai et al. (Gai et al., 2023) proposed
Given the video-based SPPE just introduced, it is a Sptiotemporal Learning Transformer for video-
natural to extend them to handle multiple individ- based Pose estimation (SLT-Pose) to capture the
uals. Following the taxonomy of video-based SPPE, shallow feature information. With the introduction
most video-based MPPE approaches fall into frame- of diffusion models in computer vision tasks (eg.
by-frame category. They can be achieved by employ- image segmentation (Amit et al., 2021), object detec-
ing image-based MPPE frame by frame. Therefore, tion (Chen et al., 2023)), DiffPose (Feng et al., 2023)
the approaches of video-based MPPE can be catego- is the first diffusion model and formulates video-based
rized into Top-down and Bottom-up approaches. pose estimation as a conditional heatmap generation
(1) Top-down approach problem.
Top-down approaches mainly estimate poses by (2) Bottom-up approach
first detecting all persons for all frames and then Bottom-up approaches estimate poses by apply-
conducting image-based single-person pose estima- ing body part detection and grouping frame by frame.
tion frame by frame. Xiao et al. (Xiao et al., 2018) For example, one of the commonly used image-based
proposed a simple baseline based on ResNet to esti- MPPE methods, OpenPose(Cao et al., 2017b), can
mate poses in each frame and the estimated poses be also applied for MPPE from video by directly esti-
were then tracked based on optical flow. Xiu et mating poses frame by frame. Jin et al. (Jin et al.,
al. (Xiu et al., 2018) estimated multiple poses for each 2019) proposed a Pose-Guided Grouping (PGG) net-
frame based on RMPE method which can be replaced work for joint pose estimation and tracking. PGG
by other top-down methods for image-based MPPE. consists of two components including SpatialNet and
With the estimated poses in each frame, a Pose Flow TemporalNet. SpatialNet tackles multi-person pose
Builder (PF-Builder) is proposed for building the estimation by body part detection and part-level spa-
association of cross-frame poses by maximizing over- tial grouping for each frame. TemporalNet extends
all confidence along the temporal sequence (as shown SpatialNet to deal with online human-level temporal
in Fig. 9), and a Pose Flow Non-Maximum Sup- grouping.
pression (PF-NMS) is designed to robustly reduce Overall, 2D HPE has been significantly improved
redundant pose flows and re-link temporal disjoint with the development of deep learning tech-
ones. Girdhar et al. (Girdhar et al., 2018) estimated niques. For the image-based SPPE, heatmap-based
poses for each frame based on Mask R-CNN and then approaches generally outperform regression-based

11
Fig. 9 The Pose Flow framework (Xiu et al., 2018).

ones in accuracy but may be of challenge in the 2.2.1 Image-based single-person pose
quantization error problem. When extending SPPE estimation
to MPPE, both top-down and bottom-up approaches
Imaged-based single-person 3D human pose estima-
have their advantages and disadvantages. Moreover,
tion (HPE) can be classified into skeleton-based and
both approaches have a challenge of reliable detection
mesh-based approaches. The former one estimates 3D
of individual persons under significant occlusion. Per-
human joints as the final output and the latter one is
son detector in top-down approaches may fail in iden-
required to reconstruct 3D human mesh representa-
tifying the boundaries of overlapped human bodies.
tion. Since this paper focuses only on the kinematic
Body part association for occluded scenes may fail in
model-based human representation, we only review
bottom-up approaches. One-stage approaches bypass
skeleton-based approaches which can be further cat-
both the shortcomings of top-down and bottom-up
egorized into one-step pose estimation and two-steps
ones, yet they are still less frequently used. With the
pose estimation (recover 3D pose from 2D pose).
advancement of image-based pose estimation, it is
Fig. 10 shows the general framework of the two
natural to extend it to videos by directly applying
approaches for image-based 3D SPPE.
off-the-shelf image-based pose estimation methods
(1) One-stage approach
frame by frame or incorporating a temporal network.
This category of approaches directly infer 3D pose
Sample frames-based methods are preferred for the
from images without estimating 2D pose representa-
pose estimation from videos since they can largely
tion. Li and Chan (Li and Chan, 2014) first proposed
improve efficiency without looking at all frames, while
to estimate 3D poses from monocular images using
they have been used less in the video-based MPPE.
ConvNets. The framework consists of two types of
Considering the benefits of one-stage approaches for
tasks: joint point regression and joint point detec-
image-based MPPE, more effort is required to explore
tion. Both tasks take bounding box images containing
one-stage approaches for video-based ones.
human subjects as input. The regression task aims to
2.2 3D pose estimation estimate the positions of joint points relative to the
Generally speaking, recovering 3D pose is consid- root joint position, while each detection task classi-
ered more difficult than 2D pose estimation, due to fies whether one specific joint is present in the local
the larger 3D pose space and more ambiguities. An window or not.
algorithm has to be invariant to some factors, includ- The multi-task learning framework is the first to
ing background scenes, lighting, clothing shape and show that deep neural networks can be applied to 3D
texture, skin color, and image imperfections, among human pose estimation from single images. However,
others. one drawback of these regression-based methods is
their limitation in predicting only one pose for a given

12
Fig. 10 The framework of two approaches for image-based 3D SPPE.

image This may cause difficulties in images where the Luvizon et al., 2023), selecting best 3D poses from 3D
pose is ambiguous due to partial self-occlusion, and pose hypotheses using ranking networks (Jahangiri
hence several poses might be valid. In contrast, Li and Yuille, 2017; Sharma et al., 2019; Li and Lee,
et al. (Li et al., 2015) proposed a unified framework 2019).
for maximum-margin structured learning with a deep With the introduction of Graph convolutional
neural network for 3D human pose estimation, where networks(GCN)-based representation for human
the unified framework can jointly learn the image and joints, some methods (Ci et al., 2019; Zhao et al.,
pose feature representations and the score function. 2019; Choi et al., 2020; Zeng et al., 2020; Liu et al.,
Tekin et al. (Tekin et al., 2016) introduced an archi- 2020; Zou and Tang, 2021; Xu and Takano, 2021;
tecture relying on an overcomplete auto-encoder to Shengping et al., 2023; Hassan and Ben Hamza, 2023)
learn a high-dimensional latent pose representation apply GCN for lifting 2D to 3D poses. To over-
for joint dependencies. Zhou et al. (Zhou et al., 2016) come the limitations of shared weights in GCN, a
proposed a novel method which directly embeds a locally connected network (LCN) (Ci et al., 2019) was
kinematic object model into the deep neutral network proposed which leverages a fully connected network
learning, where the kinematic function is defined on and GCN to encode the relationship among joints.
the appropriately parameterized object motion vari- Similarly, Zhao et al. (Zhao et al., 2019) proposed
ables. Mehta et al. (Mehta et al., 2017) explored a semantic-GCN to learn channel-wise weights for
transfer learning to leverage the highly relevant mid- edges. A Pose2Mesh Choi et al. (2020) based on GCN
dle and high-level features from 2D pose datasets was proposed to refine the intermediate 3D pose from
in conjunction with the existing annotated 3D pose its PoseNet. Xu and Takano (Xu and Takano, 2021)
datasets. Similarly, Zhou et al. (Zhou et al., 2017) proposed a Graph Stacked Hourglass (GraphSH) net-
introduced a Weakly-supervised Transfer Learning works which consists of repeated encoder-decoder for
(WTL) method that employs mixed 2D and 3D labels representing three different scales of human skeletons.
in a unified deep neural network, which is end-to- To overcome the loss of joint interactions in current
end and fully exploits the correlation between the 2D GCN methods, Zhai et al. (Zhai et al., 2023) pro-
pose and depth estimation sub-tasks. Since regressing posed Hop-wise GraphFormer with Intragroup Joint
directly from image space, one-step-based methods Refinement (HopFIR) for lifting 3D poses.
often require a high computation cost. Inspired by the recent success in the nature lan-
(2) Two-stage approach guage field, there is a growing interest in exploring
This category of approaches infer 3D pose from the use of Transformer architecture for vision tasks.
the intermediately estimated 2D pose. They are often Lin et al. (Lin et al., 2021) first applied Trans-
conducted in two steps: 1) estimating 2D pose based former for 3D pose estimation. A multi-layer Trans-
on image-based single-person 2D pose estimation former with progressive dimensionality reduction was
methods. 2) Lifting the 2D pose to 3D pose through a proposed to regress the 3D coordinates of joints.
simple regressor. For instance, Martinez et al. (Mar- Here, the standard transformer ignores the interac-
tinez et al., 2017) proposed a simple baseline based tion of adjacency nodes. To overcome this problem,
on a fully connected residual network to regress 3D Zhao et al. (Zhao et al., 2022) proposed a graph-
poses from 2D poses. This baseline method achieves oriented Transformer which enlarges the receptive
good results at that time, however, it could fail due to field through self-attention and models graph struc-
reconstruction ambiguity of over-reliance on 2D pose ture by GCN to improve the performance on 3D pose
detector. To overcome this problem, several tech- estimation.
niques are applied such as replacing 2D poses with For in-the-wild data, it is difficult to obtain
heatmaps for estimating 3D poses (Tekin et al., 2017; accurate 3D pose annotations. To deal with the
Zhou et al., 2019), regressing 3D poses from 2D poses lack of 3D pose annotation problem, some weakly
and depth information (Wang et al., 2018; Carbonera

13
supervised, self-supervised, or unsupervised meth- high performance based on the prior knowledge of
ods (Zhou et al., 2017; Yang et al., 2018; Habibie the typical size of the human pose and body joints.
et al., 2019; Chen et al., 2019; Wandt and Rosen- The top-down methods mostly estimate poses based
hahn, 2019; Iqbal et al., 2020; Kundu et al., 2020; on each bounding box, which results in the doubt
Schmidtke et al., 2021; Yu et al., 2021; Gong et al., that the top-down models are not able to under-
2022; Chai et al., 2023) were proposed for estimating stand multi-person relationships and handle complex
3D poses from in-the-wild images without 3D pose scenes. To address this limitation, Wang et al. (Wang
annotations. A weakly supervised transfer learning et al., 2020) proposed a hierarchical multi-person
method (Zhou et al., 2017) was proposed to transfer ordinal relations (HMOR) to leverage the relation-
the knowledge from 3D annotations of indoor images ship among multiple persons for pose estimation.
to in-the-wild images. 3D bone length constraint- HMOR could encode the interaction information as
induced loss was applied in the weakly supervised ordinal relations, supervising the networks to output
learning. Habibie et al. (Habibie et al., 2019) applied 3D poses in the correct order. Cha et al. (Cha et al.,
a projection loss to refine 3D pose without anno- 2022) designed a transformer-based relation-aware
tation. A lifting network (Chen et al., 2019) was refinement to capture the intra- and inter-person rela-
proposed to recover 3D poses in a self-supervised tionships. Although the top-down approaches achieve
mode by introducing a geometrical consistency loss high accuracy, they suffer high computation costs as
based on the closure and invariance lifting property. person number increases. Meanwhile, these methods
The previous self-supervised methods have largely may neglect global information (inter-person rela-
relied on weak supervisions like consistency loss to tionship) in the scene since poses are individually
guide the learning, which inevitably leads to infe- estimated.
rior results in real-world scenarios with unseen poses. (2) Bottom-up approach
Comparatively, Gong et al. (Gong et al., 2022) pro- Bottom-up approaches first produce all body joint
pose a PoseTriplet method that allows explicit gener- locations and then associate joints to each person
ating 2D-3D pose pairs for augmenting supervision, according to root depth and part relative depth. Zan-
through a self-enhancing dual-loop learning frame- fir et al. (Zanfir et al., 2018) proposed MubyNet to
work. Benefiting from the reliable 2D pose detection, group human joints according to body part scores
two-step-based approaches generally outperform one- based on integrated 2D and 3D information. One
step-based ones. group of bottom-up approaches aim to group body
2.2.2 Image-based multi-person pose joints belonging to each person. Learning on Com-
pressed Output (LoCO) method (Fabbri et al., 2020)
estimation
first applied volumetric heatmaps to produce joint
Similar to 2D multi-person pose estimation, 3D locations with an encoder-decoder network for fea-
multi-person pose estimation for images can be ture compression, and a distance-based heuristic was
also divided into: top-down approaches, bottom-up then applied to retrieve 3D pose for each person.
approaches and one-stage approaches. Top-down and A distance-based heuristic was applied for linking
bottom-up approaches involve two stages for pose joints. The previous methods are trained in a fully-
estimation. Fig. 11 illustrates the general framework supervised fashion which requires 3D pose annota-
of the two approaches for image-based 3D MPPE. tions, while Kundu et al. (Kundu et al., 2020) pro-
(1) Top-down approach posed a unsupervised method for 3D pose estimation.
Top-down approaches first detect each person Without paired 2D images and 3D pose annotations,
based on human detection networks and then gen- a frozen network was applied to exploit the shared
erate 3D poses based on single-person estimation latent space between two different modalities based
approaches. Localization Classification-Regression on cross-modal alignment.
Network (LCR-Net) (Rogez et al., 2017, 2019) pro- Another group of bottom-up approaches focus on
poses a pose proposal network to generate human occlusion. Mehta et al.(Mehta et al., 2018) combined
bounding boxes and a series of human pose hypothe- the joint location maps and the occlusion-robust
ses. The pose hypotheses were refined based on pose-maps to infer the 3D poses. The joint loca-
the cropped ROI features for generating 3D poses. tion redundancy is applied to infer occluded joints.
Moon et al. (Moon et al., 2019) proposed a camera XNect (Mehta et al., 2020) encodes the immediate
distance-aware method for estimating the camera- local context of joints in the kinematic tree to address
centric human poses which consists of human detec- occlusion. Zhen et al. (Zhen et al., 2020) developed
tion, absolute 3D human root localization, and root- 3D part affinity field for depth-aware part associa-
relative 3D single-person pose estimation modules. tion by reasoning about inter-person occlusion, and
Here, the root-relative poses ignore the absolute loca- utilized a refined network to refine the 3D pose given
tions of each pose. Comparatively, Lin and Lee (Lin predicted 2D and 3D joint coordinates. All of these
and Lee, 2020) proposed the Human Depth Esti- methods handle occlusion from the perspective of
mation Network (HDNet) for absolute root joint single-person and require initial grouping joints into
localization in the camera coordinate space. HDNet individuals, which results in error-prone estimates in
could estimate the human depth with considerably multi-person scenarios. Liu et al. (Liu et al., 2022)

14
Fig. 11 The framework of two approaches for image-based 3D MPPE. Part of the figure is from (Wang et al., 2022).

proposed an occluded keypoints reasoning module et al., 2022) reconstructed joints from 2.5D human
based on a deeply supervised encoder distillation net- centers and 3D center-relative joint offsets. Jin et
work to reason about the invisible information from al. (Jin et al., 2022) proposed a Decoupled Regres-
the visible ones. Chen et al. (Chen et al., 2023) sion Model (DRM) by solving 2D pose regression and
presented Articulation-aware Knowledge Exploration depth regression. Recently, Qiu et al. (Qiu et al.,
(AKE) for keypoints associated with a progressive 2023) estimated 3D poses directly by fine-tuning a
scheme in the occlusion situation. In comparison to Weakly-Supervised Pre-training (WSP) network on
top-down approaches, bottom-up approaches offer 3D pose datasets.
the advantage of not requiring repeated single-person 2.2.3 Video-based single-person pose
pose estimation and they enjoy linear computation.
estimation
However, the bottom-up approaches require a second
association stage for joint grouping. Furthermore, Instead of estimating 3D poses from images, videos
since all persons are processed at the same scale, can provide temporal information to improve the
these methods are inevitably sensitive to human scale accuracy and robustness of pose estimation. Simi-
variations, which limits their applicability in wild lar to image-based 3D HPE, video-based 3D HPE
videos. can also be categorized into one-stage and two-stage
(3) One-stage approach approaches.
One-stage approaches treat pose estimation as (1) One-stage approach
parallel human center localizing and center-to-joint There are few research belong to this category
regression problem. Instead of separating joints local- of approaches. Tekin et al. (Tekin et al., 2016) pro-
izing and grouping in the two-stage approaches, posed a regression function to directly predict the 3D
these approaches predict each of the joint offsets pose in a given frame of a sequence from a spatio-
from the detected center points, which is usually set temporal volume centered around it. This volume
as the root joint of human. Since the joint offsets comprises bounding boxes surrounding the person in
are directly correlated to estimated center points, consecutive frames coming before and after the cen-
this strategy avoids the manually designed grouping tral one. Mehta et al. (Mehta et al., 2017) proposed
post-processing and is end-to-end trainable. Zhou et the VNect, which is capable of obtaining a temporally
al.(Zhou et al., 2019) modeled an object as a single consistent, full 3D skeletal pose of a human from a
point and regressed joints from image features at the monocular RGB camera by Convents regression and
human center. Wei et al. (Wei et al., 2020) proposed kinematic skeleton fitting. The VNect could regress
to regress joints from point-set anchors which serve 2D and 3D joint locations simultaneously. Dabral et
as prior of basic human poses. Wang et al. (Wang al. (Dabral et al., 2018) proposed two structure-aware

15
loss functions: illegal angle loss and left-right sym- MHFormer, MHFormer++ (Li et al., 2023) is pro-
metry loss to directly predict 3D body pose from the posed to further model local information of joints by
video sequence. The illegal angle loss is to distinguish incorporating graph Transformer encoder and effec-
the internal and external angle of a 3D joint and the tively aggregate multi-hypothesis features by adding
symmetry loss is defined as the difference in lengths of a fusion block. With the similar idea of pose hypoth-
left/right bone pairs. Qiu (Qiu et al., 2022) proposed esis (Li et al., 2022b, 2023), DiffPose (Holmquist
an end-to-end framework based on Instance-guided and Wandt, 2023) and Diffusion-based 3D Pose
Video Transformer (IVT) to predict 3D single and (D3DP) (Shan et al., 2023) aim to apply a diffusion
multiple poses directly from videos. An unsupervised model to predict multiple adjustable hypotheses for
feature extraction method (Honari et al., 2023) based a given 2D pose due to its ability of high-field sam-
on Constrastive Self-Supervised (CSS) learning was ples. The aforementioned Transformer-based meth-
presented to capture rich temporal features for pose ods (Zheng et al., 2021; Zhao et al., 2023; Li et al.,
estimation. Time-variant and time-invariant latent 2022a, 2023) mainly model spatial and temporal
features are learned using CSS by reconstructing the information sequentially by different stages of net-
input video frames and time-variant features are then works, thus resulting in insufficient learning of motion
applied to predicting 3D poses. patterns. Therefore, Tang et al. (Tang et al., 2023)
(2) Two-stage approach proposed Spatio-Temporal Criss-cross Transformer
Similar to two-step 3D poses estimated from (STCFormer) by stacking multiple STC attention
images, two-step 3D HPE involves two stages: esti- blocks to model spatial and temporal information in
mating 2D poses and lifting 3D poses from 2D poses. parallel with a two-pathway network.
However, the difference is that a sequence of 2D poses Seq2seq-based methods reconstruct all frames of
is applied for lifting a sequence of 3D poses in video- input sequence at once for improving coherence and
based 3D HPE. Based on different lifting methods, efficiency of 3D pose estimation. The earlier meth-
this category of approaches can be summarized into ods apply recurrent neural network (RNN) or long
Seq2frame and Seq2seq-based methods. short-term memory (LSTM) as the Seq2Seq network.
Seq2frame-based methods pay attention to pre- Lin et al. (Lin et al., 2017) designed a Recurrent 3D
dicting the central frame of the input video to Pose Sequence Machine(RPSM) for estimating 3D
produce a robust prediction and less sensitivity to human poses from a sequence of images. The RPSM
noise. Pavllo et al. (Pavllo et al., 2019) presented consists of three modules: a 2D pose module; a 3D
a Temporal Convolutional Network (TCN) on 2D pose recurrent module and a feature adaption mod-
keypoint trajectories with semi-supervised training ule for transforming the pose representations from 2D
method. In the network, 1D convolutions are used to to 3D domain. Hossain et al. (Rayat Imtiaz Hossain
capture temporal information with fewer parameters. and Little, 2018) presented a sequence-to-sequence
In semi-supervised training, the 3D pose estimator is network by using LSTM units and residual connec-
used as the encoder and the decoder maps the pre- tions on the decoder side. The sequence of 2D joint
dicted pose back to the 2D space. Some following locations is as input to the sequence-to-sequence net-
works improved the performance of TCN by solving work to predict a temporally coherent sequence of 3D
the occlusion problem (Cheng et al., 2019), utilizing poses. Lee et al. (Lee et al., 2018) proposed propa-
the attention (Liu et al., 2020), or decomposing the gating long short-term memory networks (p-LSTMs)
pose estimation task into bone length and bone direc- to estimates depth information from 2D joint loca-
tion prediction (Chen et al., 2021). Except TCN, Cai tion through learning the intrinsic joint interdepen-
et al. (Cai et al., 2019) employs GCN for modeling dency. Katircioglu et al. (Katircioglu et al., 2018)
temporal information in which learning multi-scale proposed a deep learning regression architecture to
features for 3D human body estimation from a short learn a high-dimensional latent pose representation
sequence of 2D joint detection. Without convolu- by using an autoencoder and a Long Short-Term
tion architecture involved, Zheng et al. (Zheng et al., Memory network is proposed to enforce temporal
2021) proposed a PoseFormer based on a spatial- consistency on 3D pose predictions. Raymond et
temporal transformer for estimating the 3D pose of al. (Yeh et al., 2019) proposed Chirality Nets. In
the center frame. To overcome the huge computa- Chirality Nets, fully connected layers, convolutional
tional cost of PoseFormer when increasing the frame layers, batch-normalization, and LSTM/GRU cells
number for better performance, PoseFormerV2 (Zhao can be chiral. According to this kind of symmetry,
et al., 2023) applies a frequency-domain represen- it naturally estimates 3D pose by exploiting the left-
tation of 2D pose sequences for lifting 3D poses. /right mirroring of the human body. Later, there
Similarly, Li et al. (Li et al., 2022a) proposed a strid- are some methods (Wang et al., 2020; Yu et al.,
den transformer encoder to reconstruct 3D pose of 2023; Zhang et al., 2022; Chen et al., 2023; Shuai
the center frame by reducing the sequence redun- et al., 2023; Zhu et al., 2022) apply GCN or trans-
dancy and computation cost. Li et al. (Li et al., former for Seq2seq learning. Wang et al. (Wang
2022b) further designed a Multi-Hypothesis trans- et al., 2020) exploited a GCN-based method com-
Former (MHFormer) to exploit spatial-temporal rep- bining a corresponding loss to model motion in both
resentations of multiple pose hypotheses. Based on short temporal intervals and long temporal ranges.

16
Zhang et al. (Zhang et al., 2022) proposed a mixed data with devise views. Capturing long-range tempo-
spatio-temporal encoder(MixSTE) which includes a ral information normally requires computing on more
temporal transformer to model the temporal motion frames, which results in high computational cost.
of each joint and a spatial transformer to learn To cope with this problem, a recent work, TEMpo-
inter-joint spatial correlations. The MixSTE directly ral POse estimation method (TEMPO) (Choudhury
reconstructs the entire frames to improve the coher- et al., 2023), learns a spatio-temporal representation
ence between input and output sequences. Chen et by a recurrent architecture to speed up the infer-
al. (Chen et al., 2023) proposed High-order Directed ence time while preserving estimation accuracy. To
Transformer (HDFormer) to reconstruct 3D pose be specific, persons are firstly detected and repre-
sequences from 2D pose sequences by incorporat- sented by feature volumes. A spatio-temporal pose
ing self-attention and high-order attention to model representation is then learned by recurrently com-
joint-joint, bone-joint, and hyperbone-joint interac- bining features from current and previous timesteps.
tions. It is finally decoded into an estimation of the cur-
2.2.4 Video-based multi-person pose rent pose and poses at future timestaps. Note that
the poses are estimated based on the tracking results
estimation
of feature volumes, which hints that pose estima-
Different from the image-based multi-person pose tion performance can be improved by pose tracking.
estimation, video-based multi-person pose estimation Moreover, TEMPO also provides a solution for action
often suffers from fast motion, large variability in prediction.
appearance and clothing, and person-to-person occlu- In the above two-step-based methods, the result
sion. A successful approach in this context must be of the latter step depends on the ones of the former
capable of accurately identifying the number of indi- step. Therefore, one-step pose estimation is proposed
viduals present in each video frame, as well as deter- recently based on end-to-end network. IVT (Qiu
mining the precise joint locations for each person and et al., 2022) can be also used to predict multi-
effectively associating these joints over time. ple poses directly from videos. The instance-guided
With the improvement of video-based single- tokens include deep features and instance 2D offsets
person 3D HPE, one method of video-based multi- (from body center to keypoints) which are sent into
based 3D HPE is two-step-based method that first a video transformer to capture the contextual depth
detects each person based on human detection net- information between multi-person joints in spatial
works and then generates 3D poses based on video- and temporal dimensions. A cross-scale instance-
based single-person 3D HPE methods. Cheng et guided attention mechanism is introduced to handle
al. (Cheng et al., 2021a) proposed a novel framework the variational scales among multiple persons.
for integrating graph convolutional network (GCN) In summary, 3D HPE has made significant
and time convolutional network (TCN) to estimate advancements recent years. Due to the progress in 2D
multi-person 3D pose. In particular, bounding boxes HPE, a large number of 3D image/video-based single-
are firstly detected for representing humans and 2D person HPE methods apply 2D to 3D lifting strategy.
poses are then estimated based on the bounding When extending single-person to multi-person in 3D
box. The 3D poses for each frame are estimated image/video-based HPE, two step (top-down and
by feeding 2D poses into joint- and bone-GCNs. bottom-up) and one-step methods are always applied.
The 3D pose sequence is finally fed into temporal Although top-down methods could achieve promising
TCN to enforce the temporal and human-dynamic results by the state-of-the-art person detection and
constraints. This category of methods applies top- single-person methods, they suffer from high compu-
down technique to estimate 3D poses, which rely on tation cost as person number increases and the miss-
detecting each person independently. Therefore, it is ing of inter-person relationship measurement. The
likely to suffer from inter-person occlusion and close bottom-up methods could enjoy linear computation,
interactions. To overcome this problem, the same however, they are sensitive to human scale variations.
author(Cheng et al., 2021b) later proposed an Multi- Therefore, one-step based methods are preferable
person Pose Estimation Integration (MPEI) network for 3D image/video-based multi-person HPE. When
by adding a bottom-up branch for capturing global- extending image-based 3D single/multi-person HPE
awareness poses on the same top-down branch as to video-based ones, temporal information is mea-
the paper (Cheng et al., 2021a). The final 3D poses sured for learning joint association across frames.
are estimated based on matching the estimated 3D Similar to images-methods, two-step-based methods
poses from both bottom-up and top-down branches. are commonly used due to the success of 2D to 3D
An interaction-aware discriminator was applied to lifting strategy. Among them, Seq2seq-based methods
enforce the natural interaction of two persons. To are preferable, as they contribute to enhancing the
overcome the occlusion problem, Park et al. (Park coherence and efficiency of 3D pose estimation. To
et al., 2023) presented POTR-3D to lift 3D pose capture the temporal information, TCN (Temporal
sequences by directly processing 2D pose sequences Convolutional Networks), RNN (Recurrent Neural
rather than a single frame at a time, and devise a data Network)-related architectures, and Transformers are
augmentation strategy to generate occlusion-aware commonly used networks.

17
3 Pose tracking Samanta et al. (Samanta and Chanda, 2016) pro-
posed a data-driven method for human body pose
Pose tracking aims to estimate human poses from tracking in video data. They initially estimated the
videos and link the poses across frames to obtain pose in the first frame of the video, and employed
a number of trackers. It is related to video-based local object tracking to maintain spatial relationships
pose estimation, but it requires to capturing the between body parts across different frames.
association of estimated poses across frames which (2) Integrated approach
is different from video-based pose estimation. With Zhao et al. (Zhao et al., 2015) proposed a two-step
the pose estimation methods reviewed in Section 2, iterative method that combines pose estimation and
the main task of pose tracking becomes pose link- visual tracking into a unified framework to compen-
ing. The fundamental problem of pose linking is to sate for each other, the pose estimation improves the
measure the similarity between pairs of poses in adja- accuracy of visual tracking, and the result of visual
cent frames. The pose similarity is normally measured tracking facilitates the pose estimation. The two steps
based on temporal information (eg. optical flow, are performed iteratively to get the final pose. In
temporal smoothness priors), and appearance infor- addition, they designed a reinitialization mechanism
mation from images. Following the taxonomy of two to prevent pose tracking failures. Previous methods
kinds of estimated poses, we divide the pose tracking required future frames or entire sequences to refine
methods into two categories: 2D pose tracking and the current pose and were difficult to track online. Ma
3D pose tracking. et al. (Ma et al., 2016) solved the problem of online
3.1 2D pose tracking tracking human pose of joint motion in dynamic envi-
ronments. They proposed a coupled-layer framework
According to the number of persons for tracking, composed of a global layer for pose tracking and a
2D pose tracking can be divided into single-person local layer for pose estimation. The core idea is to
and multi-person pose tracking. Fewer methods solve decompose the global pose candidate in any particu-
the problem of single-person pose tracking since lar frame into several local part candidates and then
they actually aim to update the estimated poses for recombine selected local parts to obtain an accurate
obtaining more accurate poses with temporal con- pose for the frame.
sistency. Therefore, pose tracking mainly solves the Post-processing approaches first obtain a set of
tracking problem of multiple persons. Nevertheless, plausible pose assumptions from the video and then
we will give a review of two categories of meth- stitch together compatible detections over time to
ods including single-person and multi-person pose form pose tracking. However, due to the multiplica-
tracking. tive cost of using global information, models in
3.1.1 Single-person pose tracking this category can usually only include local spatio-
temporal trajectories (evidence). These local spatio-
Based on the core idea of updating the estimated
temporal trajectories may be ambiguous, thus leading
poses by tracking, this category of approaches can
to the disadvantage of objective models. Further-
be usually divided into two types, post-processing
more, post-processing methods are difficult to track
and integration approaches. The post-processing
online, but integrated approaches allow for a more
approaches estimate the pose of each frame indi-
robust and accurate representation of the poses over
vidually, and then correlation analysis is conducted
time, ensuring that the tracked body retrains its
on the estimated poses across different frames to
appropriate configuration throughout the tracking
reduce inconsistencies and generate a smooth result.
process.
The integrated approaches unite pose estimation and
visual tracking within a single framework. Visual 3.1.2 Multi-person pose tracking
tracking ensures the temporal consistency of the Unlike single-person pose tracking, multi-person pose
poses, while pose estimation enhances the accuracy of tracking involves measuring human interactions,
the tracked body parts. By combining the strengths which can introduce challenges to the tracking pro-
of both visual tracking and pose estimation, the inte- cess. The number of the tracking people is unknown,
grated approaches achieve improved results in pose and the human interaction may cause the occlusion
tracking. Fig. 12 illustrates the general framework of and overlap. Similar to multi-person pose estimation,
the two approaches for single person pose tracking. existing methods can be divided into two categories,
(1) Post-processing approach top-down and bottom-up approaches.
Zhao et al. (Zhao et al., 2015) proposed to (1) Top-down approach
track human body pose by adopting the max-margin Top-down approaches (Wang et al., 2020; Fang
Markov model. They proposed a spatio-temporal et al., 2022) start by detecting the overall location
model composed of two sub-models for spatial parsing and bounding box of the human body in frames
and temporal parsing respectively. Spatial parsing is and then estimates the keypoints of each person.
used to estimate candidate human poses in a frame, Finally, the estimated human poses are associated
while temporal parsing determines the most probable according to similarity between poses in different
pose part locations over time. An inference iteration
of sub-models is conducted to obtain the final result.

18
Fig. 12 The framework of two approaches for 2D Single person pose tracking.

frames. Girdhar et al. (Girdhar et al., 2018) pro- on the correspondence relationship of keypoints to
posed a two-stage method for estimating and tracking associate the figures in the video. It is trained on
human keypoints in complex multi-person videos. large image data sets to use self-monitoring for body
The method utilizes Mask R-CNN to perform frame- pose estimation. In combination with the top-down
level pose estimation which detects person tubes and human pose estimation framework, keypoint corre-
estimates keypoints in predicted tubes, then performs spondence is used to recover lost pose detection based
a person-level tracking module by using lightweight on the temporal context and associate detected and
optimization to connect estimated keypoints over recovered poses for pose tracking.
time. However, this method does not consider motion The methods discussed in this section typically
and pose information, which causes difficulty in track- begin by detecting the human body boundary, which
ing the occasional truncated human. To address the can make them susceptible to challenges like occlu-
issue, Xiu et al. (Xiu et al., 2018) employed pose flow sion and truncation. Moreover, most methods first
as a unit and proposed a new pose flow generator estimate poses in each frame and then implement
which consists of Pose Flow Builder and Pose Flow data association and refinement. This strategy essen-
NMS. They initially estimated multi-person poses by tially relies heavily on non-existent visual evidence
employing an improved RMPE, and then maximiz- in the case of occlusion, so detection is inevitably
ing overall confidence to construct pose flows. Finally, easy to miss. To this end, Yang et al. (Yang et al.,
pose flows were purified by applying Plow Flow NMS 2021) derived dynamic predictions through GNN
to obtain reasonable multi-pose trajectories. To ease that explicitly takes into account spatio-temporal and
the complexity of method, Xiao et al. (Xiao et al., visual information. It leverages historical pose track-
2018) proposed a simple but effective method for lets as input and predicts corresponding poses in
pose estimation and tracking. They adopted the pose the following frames for each tracklet. The predicted
propagation and similarity measurement based on poses will then be aggregated with the detected poses,
optical flow to improve the greedy matching method so as to recover occluded joints that may have been
for pose tracking. Zhang et al. (Zhang et al., 2019) missed by the estimator, significantly improving the
solved the articulated multi-person pose estimation robustness of the method.
and real-time velocity tracking. An end-to-end multi- The methods mentioned above primarily empha-
task network (MTN) was designed for simultaneously size pose-based similarities for matching, which usu-
performing human detection, pose estimation, and ally struggle to re-identify tracks that have been
person re-identification (Re-ID) tasks. Given the occluded for extended periods or significant pose
detection box, keypoints and Re-ID feature pro- deformations. In light of this, Doering et al. (Doer-
vided by MTN, an occlusion-aware strategy is applied ing and Gall, 2023) proposed a novel gated attention
for pose tracking. Ning et al. (Ning et al., 2020) approach which utilizes a duplicate-aware associa-
proposed a top-down approach that combines single- tion, and automatically adapts the impact of pose-
person pose tracking (SPT) and visual object track- based similarities and appearance-based similarities
ing (VOT) into a unified online functional entity that according to the attention probabilities associated
can be easily implemented with a replaceable single with each similarity metric.
person pose estimator. They processed each human (2) Bottom-up approach
candidate separately and associated the lost tracked In contrast, bottom-up approaches first detect
candidate to the targets from the previous frames keypoints of the human body and then group the key-
through pose matching. The human pose matching points into individuals. The grouped keypoints are
can be achieved by applying the Siamese Graph Con- then connected and associated across frames to gen-
volution Network as the Re-ID module. Umer et erate the complete pose. Iqbal et al. (Iqbal et al.,
al. (Rafi et al., 2020) proposed a method that relies 2017) proposed a novel method which jointly models

19
multi-person pose estimation and tracking in a single to compute 2D and 3D pose, and then performed
formula. They represented the detected body joints joint multiple person optimization under constraints
in the video by a spatio-temporal graph which can be to reconstruct and track multiple person 3D pose.
divided into sub-graphs corresponding to the possible Metha et al. (Mehta et al., 2020) estimated 2D and
trajectories of each human body pose by solving an 3D pose features and employed a fully-connected
integer linear program. Raaj et al. (Raaj et al., 2019) neural network to decode features into complete 3D
proposed Spatio-Temporal Affinity Fields (STAF) poses, followed by a space-time skeletal model fitting.
across a video sequence for online pose tracking. The above works firstly estimate poses and then
The connections across keypoints in each frame are link poses across frames in which the concept of track-
represented by Part Affinity Fields (PAFs) and con- ing is to associate joints of the same person together
nections between keypoints across frames are repre- over time, using joints localized independently in each
sented by Temporal Affinity Fields. Jin et al. (Jin frame. By contrast, Sun et al. (Sun et al., 2019)
et al., 2019) viewed pose tracking as a hierarchical improved joint localization based on the information
detection and grouping problem. They proposed a from other frames. They proposed to first learn the
unified framework consisting of SpatialNet and Tem- spatio-temporal joint relationships and then formu-
poralNet. SpatialNet implements single-frame body lated pose tracking as a simple linear optimization
part detection and part-level data association, and problem.
TemporalNet groups human instances in continu- (2) One-stage approach
ous frames into trajectories. The grouping process is One-stage approach (Reddy et al., 2021; Zhang
modeled by a differentiable Pose-Guided Grouping et al., 2022; Choudhury et al., 2023; Zou et al.,
(PGG) module to make the entire part detection and 2023) aims to train a single end-to-end framework for
grouping pipeline fully end-to-end trainable. jointly estimating and linking 3D poses, which can
The bottom-up approach relates joints spatially propagate the errors of the sub-tasks in the multi-
and temporally without detecting bounding boxes. stage approaches back to the input image pixels of
Therefore, the computational cost of the methods videos. For instance, Reddy et al. (Reddy et al.,
is almost unaffected by the change in the number 2021) introduced Tessetrack to jointly infer about
of human candidates. However, they require signifi- 3D pose reconstructions and associations in space
cant computational resources and often suffers from and time in a single end-to-end learnable frame-
the ambiguous keypoints assignment without the work. Tessetrack consists of three key components:
global pose view. The top-down approach enhances person detection, pose tracking and pose estima-
single-frame pose estimation by incorporating tempo- tion. With the detected persons, a spatial-temporal
ral context information to correlate estimated poses person-specific representation is learned for measur-
across different frames. It simplifies the complex ing similarity to link poses by solving an assignment
task and improves the keypoints assignment accu- problem based on bipartite graph matching. All
racy, although it may increase calculation cost in matched representations are then merged into a sin-
case of a large number of human candidates. In gle representation which is deconvolved into a 3D
summary, the top-down approach outperforms the pose and taken as the estimated pose. To handle
bottom-up approach both in accuracy and tracking the occlusions, VoxelTrack (Zhang et al., 2022) intro-
speed, so most of the state-of-the-art methods follow duces an occlusion-aware multi-view feature fusion
the top-down approach. strategy for linking poses. Specifically, it jointly esti-
mates and tracks 3D poses from a 3D voxel-based
3.2 3D pose tracking
representation constructed from multi-view images.
With the advancement of 3D pose estimation, pose Poses are linked over time by bipartite graph match-
tracking can be naturally extended into 3D space. ing based on fused representation from different
Given that current methods primarily focus on multi- views without occlusion. PHALP (Rajasegaran et al.,
person scenarios, we categorize them into two groups 2022) accumulates 3D representations over time for
without specifying single or multi-person tracking: better tracking. It relies on a backbone for estimat-
multi-stage and one-stage approaches. ing 3D representations for each human detection,
(1) Multi-stage approach aggregating representations over time and forecast-
The multi-stage approaches generally track poses ing future states, and eventually associating tracklets
involving several steps such as 2D/3D pose estima- with detections using predicted representations in a
tion, lifting 2D to 3D poses and 3D pose linking. probabilistic framework. Snipper (Zou et al., 2023)
These tasks are served as independent sub-tasks. For conducts a deformable attention mechanism to aggre-
example, Bridgeman et al. (Bridgeman et al., 2019) gate spatiotemporal information for multi-person 3D
performed independent 2D pose detection per frame pose estimation, tracking, and motion forecasting
and associated 2D pose detection between different simultaneously in a single shot. Similar to Snip-
camera views through a fast greedy algorithm. Then per, TEMPO (Choudhury et al., 2023) performs a
the associated poses are used to generate and track recurrent architecture to fuse both spatial and tem-
3D pose. Zanfir et al. (Zanfir et al., 2018) first con- poral information into a single representation, which
ducted a single person feedforward-feedback model

20
methods follow a two-stage strategy which first
applies existing pose estimation methods to generate
poses from videos and then conduct action recog-
nition using pose features. Cheron et al. (Chéron
et al., 2015) proposed P-CNN to extract appear-
ance and flow features conditioned on estimated
human poses for action recognition. Mohammadreza
et al. (Zolfaghari et al., 2017) designed a body part
segmentation network to generate poses and then
applied it to a multi-stream 3D-CNN to integrate
poses, optical flow and RGB visual information for
action recognition. After generating joint heatmaps
by pose estimator, Choutas et al. (Choutas et al.,
2018) proposed a Pose moTion (PoTion) represen-
tation by temporally aggregating the heatmaps for
action recognition. To avoid relying on the inaccu-
rate poses from pose estimation maps, Liu et al. (Liu
Fig. 13 Two categories of approaches for action recognition. and Yuan, 2018) aggregated pose estimation maps to
form poses and heatmaps, and then evolved them for
enabling pose estimation, tracking, and forecast- action recognition. Moon et al. (Moon et al., 2021)
ing from multi-view information without sacrificing proposed an algorithm for a pose-driven approach to
efficiency. integrate appearance and pre-estimated pose infor-
Although both approaches have achieved good mation for action recognition. Shah et al. (Shah et al.,
performance on 3D multi-person pose tracking, for 2022) designed a Joint-Motion Reasoning Network
the first approach, solving each sub-problem inde- (JMRN) for better capturing inter-joint dependen-
pendently leads to performance degradation. 1) 2D cies of poses generated followed by running a pose
pose estimation easily suffers from noise, especially detector on each video frame. This line of methods
in the presence of occlusion. 2) The accuracy of 3D considers pose estimation and action recognition as
estimation depends on the 2D estimates and associa- two separate tasks so that action recognition perfor-
tions across all views. 3) Occlusion-induced unreliable mance may be affected by inaccurate pose estimation.
appearance features impact the accuracy of 3D pose Duan et al. (Duan et al., 2022) proposed PoseConv3D
tracking. As a result, the second approach has gained to form 3D heatmap volume by estimating 2D poses
prominence in recent years in 3D multi-person pose by existing pose estimator and stacking 2D heatmaps
tracking. along the temporal dimension, and to classify actions
by 3D CNN on top of the volume. Sato et al. (Sato
4 Action Recognition et al., 2023) presented a user prompt-guided zero-shot
learning method based on target domain-independent
Action recognition aims to identify the class labels joint features and the joints are pre-extracted by
of human actions in the input images or videos. For the existing multi-person pose estimation technique.
the connection with pose estimation and tracking, Rajasegaran et al. (Rajasegaran et al., 2023) pro-
this paper only reviews the action recognition meth- posed a Lagrangian Action Recognition with Track-
ods based on poses. Pose-based action recognition ing (LART) method to apply the tracking results
can be categorized into two approaches: estimated for predicting actions. Pose and appearance features
pose-based and skeleton-based. Estimated pose-based are firstly obtained by the PHALP tracking algo-
action recognition approaches apply RGB videos as rithm (Rajasegaran et al., 2022), and then fused as
the input and classify actions using poses estimated the input of a transformer network to predict actions.
from RGB videos. On the other hand, skeleton-based Hachiuma et al. (Hachiuma et al., 2023) introduced a
action recognition methods utilize skeletons as their unified framework based on structured keypoint pool-
input which can be obtained through various sensors, ing for enhancing the adaptability and scalability of
including motion capture devices, time-of-flight cam- skeleton-based action recognition. Human keypoints
eras, and structured light cameras. Fig. 13 illustrate and object contour points are initially obtained
the prevailing frameworks of these two categories through multi-person pose estimation and object
approaches of pose-based action recognition. detection. A structured keypoint pooling is then
applied to aggregate keypoint features to overcome
4.1 Estimated pose-based action
skeleton detection and tracking errors. Addtionally,
recognition non-human object keypoints are severed as addi-
Pose features have been shown in performing much tional input for eliminating the variety restrictions
better than low/mid features and acting as discrim- of targeted actions. Finally, A pooling-switch trick
inative cues for action recognition (Jhuang et al., is proposed for weakly supervised spatio-temporal
2013). With the success of pose estimation, some

21
Fig. 14 Four approaches for skeleton-based action recognition. (1) RNN example (Wang and Wang, 2017). (2) CNN example (Cae-
tano et al., 2019). (3) GCN example (Yan et al., 2018). (4) Transformer example (Plizzari et al., 2021).

action localization to achieve action recognition for different deep learning networks which can be catego-
each person in each frame. rized into CNN-based, RNN-based, GCN-based, and
Another line of methods jointly solves pose esti- Transformer-based methods, as shown in Fig. 14.
mation and action recognition tasks. Luvizon et
al. (Luvizon et al., 2018) proposed a multi-task 4.2.1 CNN-based approach
CNN for joint pose estimation from still images and Convolutional Neural Networks (CNN), widely
action recognition from video sequences based on employed in the realm of computer vision, possess a
appearance and pose features. Due to the differ- natural advantage in image feature extraction due to
ent output formats of the pose estimation and the their exceptional local perception and weight-sharing
action recognition tasks, Foo et al. (Foo et al., 2023) capabilities. Due to the success of CNN in image
designed a Unified Pose Sequence (UPS) multi-task processing, CNN can better capture spatial informa-
model, which constructs text-based action labels and tion in skeleton sequences. CNN-based methods for
coordinate-based poses into a heterogeneous output skeleton-based action recognition can be categorized
format, for simultaneously processing the two tasks. into 2D and 3D CNN-based approaches, depending
on the type of neural network utilized.
4.2 Skeleton-based Action Recognition Most of the 2D CNN-based methods (Du et al.,
2015; Wang et al., 2016; Hou et al., 2016; Li et al.,
Skeleton data is one form of 3D data commonly
2017; Liu et al., 2017; Ke et al., 2017; Caetano
used for action recognition. It consists of a sequence
et al., 2019; Li et al., 2019) first convert the skeleton
of skeletons, representing a schematic model of the
sequence into a pseudo-image, in which the spatial-
locations of trunk, head, and limbs of the human
temporal information of the skeleton sequence is
body. Compared with another two commonly used
embedded in the colors and textures. Du et al. (Du
data including RGB and depth, skeleton data is
et al., 2015) mapped the Cartesian coordinates of
robust to illumination change and invariant to camera
the joints to RGB coordinates and then quantized
location and subject appearance. With the devel-
the skeleton sequences into an image for feature
opment of deep learning techniques, skeleton-based
extraction and action recognition. To reduce the
action recognition has transitioned from hand-crafted
inter-articular occlusion due to perspective transfor-
features to deep learning-based features. This sur-
mations, some works (Wang et al., 2016; Hou et al.,
vey mainly reviews the recent methods based on
2016) proposed to encode the spatial-temporal infor-
mation of skeleton sequences into three orthogonal

22
color texture images. The pair-wise distances between the attentional ability of RNN neurons. Si et al. (Si
joints on single or multiple skeleton sequences are et al., 2019) proposed an Attention enhanced Graph
represented by Joint Distance Map (JDM) (Li et al., Convolutional LSTM (AGC-LSTM) to enhance the
2017) which is encoded as a color change in the feature representations of key nodes.
texture image. To explore better spatial feature rep- To simultaneously exploit the temporal and spa-
resentations, Ding et al. (Ding et al., 2017) encoded tial features of skeleton sequences, some methods aim
the distance, direction and angle of the joints as spa- to design spatial and/or temporal networks. Wang et
tial features into the texture color images. Ke et al. (Wang and Wang, 2017) proposed a two-stream
al. (Ke et al., 2017) proposed to represent segments RNN for simultaneously learning spatial and tempo-
of skeleton sequences by images and classified actions ral relationships of skeleton sequences and enhancing
using a multi-task learning network based on CNN. the generalization ability of the model through a
Similarly, Liang et al. (Liang et al., 2019) applied a skeleton data enhancement technique with 3D trans-
multi-tasking learning based on three-stream CNN formations. Liu et al. (Liu et al., 2016) proposed a
to encode skeletal fragment features, position and spatial-temporal LSTM network, extending the tra-
motion information. ditional LSTM-based learning into the temporal and
When compressing skeleton sequences into images spatial domains. Considering the importance of the
by 2D CNN, it is unavoidable to lose some tempo- relationships between non-neighboring joints in the
ral information. By contrast, 3D CNN-based meth- skeleton data, Zhang et al. (Zhang et al., 2017)
ods (Liu et al., 2017; Hernandez Ruiz et al., 2017) designed eight geometric relational features to model
are more excellent at learning spatio-temporal fea- the spatial information and evaluated them in a
tures. Hernandez et al. (Hernandez Ruiz et al., 2017) three-layer LSTM network. Si et al. (Si et al., 2018)
encoded skeleton sequences as stacked Euclidean Dis- proposed a spatial-based Reasoning and Temporal
tance Matrices (EDM) computed over joints and Stack Learning (SR-TSL) novel model to capture
then performed convolution along time dimension for high-level spatial structural information within each
learning spatiao-temporal dynamics of the data. frame, and model the detailed dynamic information
by combining multiple jump-segment LSTMs.
4.2.2 RNN-based approach
RNN-related networks are often used for processing 4.2.3 GCN-based approach
time-series data to effectively capture the tempo- GCN is a recent popular network for skeleton-based
ral information within skeleton sequences. Except for action recognition due to the human skeleton is a
temporal information, spatial information is another natural graph structure. Compared with CNN and
important cue for action recognition which may be RNN-based methods, GCN-based methods could bet-
ignored by RNN-related networks. Some methods ter capture the relationship between joints in the
focus on solving this problem by spatial division of skeleton sequence. According to whether the topology
the human body. For exmaple, Du et al. (Du et al., (namely vertex connection relationship) is dynami-
2015, 2016) proposed a hierarchical RNN for process- cally adjusted during inference, GCN-based methods
ing skeleton sequences of five body parts for action can be classified into static methods (Yan et al., 2018;
recognition. Shahroudy et al. (Shahroudy et al., 2016) Huang et al., 2020; Liu et al., 2020; Zhang et al., 2020)
proposed a Partially-aware LSTM (P-LSTM) for sep- and dynamic methods (Li et al., 2019; Shi et al., 2019;
arately modeling skeleton sequences of body parts Cheng et al., 2020; Korban and Li, 2020; Chen et al.,
and classified actions based on the concatenation of 2021; Chi et al., 2022; Duan et al., 2022; Wang et al.,
memory cells. 2022; Wen et al., 2023; Lin et al., 2023; Li et al., 2022;
To better focus on the key spatial information in Dai et al., 2023; Zhu et al., 2023; Shu et al., 2023; Wu
the skeleton data, some methods tend to incorpo- et al., 2023).
rate attention mechanism. Song et al. (Song et al., For static methods, the topologies of GCNs
2017) proposed a spatiotemporal attention model remian fixed during inference. For instance, an early
using LSTM which includes a spatial attention mod- application of graph convolutions, spatial-temporal
ule to adaptively select key joints in each frame, and GCN (ST-GCN) (Yan et al., 2018), is proposed which
a temporal attention module to select keyframes in applies a predefined and fixed topology based on the
skeleton sequences. Similarly, Liu et al. (Liu et al., human body structure. Liu et al. (Liu et al., 2020)
2017) proposed a cyclic attention mechanism to iter- proposed a multi-scale graph topology to GCNs for
atively enhance the performance of attention for modeling multi-range joint relationships.
focusing on key joints. The subsequent improvement For dynamic methods, the topologies of GCNs
work by Song et al. (Song et al., 2018) used spatio- are dynamically inferred during inference. Action
temporal regularization to encourage the exploration structure graph convolution network (AS-GCN) (Li
of relationships among all nodes rather than overem- et al., 2019) applies an A-link inference module
phasizing certain nodes and avoided an unbounded to capture action-specific correlations. Two-stream
increase in temporal attention. Zhang et al. (Zhang adaptive GCN (2s-AGCN) (Shi et al., 2019) and
et al., 2019) proposed a simple, effective, and gener- semantics-guided network (SGN) (Zhang et al., 2020)
alized Element Attention Gate (EleAttG) to enhance

23
enhanced topology learning with self-attention mech- 4.2.4 Transformer-based approach
anism for modeling correlations between two joints.
Transformer was originally designed for machine
Although topology dynamic modeling is beneficial
translation tasks in natural language processing.
for inferring intrinsic relations of joints, it may be
Vision Transformer (ViT) (Dosovitskiy et al., 2020) is
difficult to encode the context of an action since
the first work to use a Transformer encoder to extract
the captured topologies are independent of a pose.
image features in computer vision. When introduc-
Therefore, some methods focus on context-dependent
ing Transformer to skeleton-based action recogni-
intrinsic topology modeling. In Dynamic GCN (Ye
tion, the core is how to design a better encoder
et al., 2020), contextual features of all joints are
for modeling spatial and temporal information of
incorporated to learn the relations of joints. Channel
skeleton sequences. Compared with GCN-methods,
topology refinement GCN (CTR-GCN) (Chen et al.,
Transformer-based methods can quickly obtain global
2021) focuses on embedding joint topology in dif-
topology information and enhance the correlation of
ferent channels, while InfoGCN (Chi et al., 2022)
non-physical joints. There are mainly three categories
introduces attention-based graph convolution to cap-
of methods: pure Transformer, hybid Transformer
ture the context-dependent topology based on the
and unsupervised Transformer.
latent representation learned by information bottle-
The first category of methods applies the stan-
neck. Multi-Level Spatial-Temporal excited Graph
dard Transformer for learning spatial and tempo-
Network (ML-STGNet) (Zhu et al., 2023) intro-
ral features. A spatial Transformer and a temporal
duces a spatial data-driven excitation module based
Transformer are often applied alternately or together
on Transformer to learn joint relations of differ-
based on one stream (Shi et al., 2020; Wang et al.,
ent samples in a data-dependent way. Multi-View
2021; Ijaz et al., 2022) or two-stream (Zhang et al.,
Interactional Graph Network (MV-IGNet) (Wang
2021; Shi et al., 2021; Gedamu et al., 2023) network.
et al., 2023) designs a global context adaptation mod-
Shi et al. (Shi et al., 2020) proposed to decou-
ule for adaptive learning of topology structures on
ple the data into spatial and temporal dimensions,
multi-level spatial skeleton contexts. Spatial Graph
where the spatial and temporal streams respectively
Diffusion Convolutional (S-GDC) network (Li et al.,
include motion-irrelevant and motion-relevant fea-
2023) aims to learn new graphs by graph diffu-
tures. A Decoupled Spatial-Temporal Attention Net-
sion for capturing the connections of distant joints
work (DSTA-Net) was proposed to encode the two
on the same body and two interacting bodies. In
streams sequentially based on the attention mod-
the above dynamic methods, the topology model-
ule. It allows modeling spatial-temporal dependencies
ing is based only on joint information. By contrast,
between joints without the information about their
a language model knowledge-assisted GCN (LA-
positions or mutual connections. Ijaz et al. (Ijaz et al.,
GCN) (Xu et al., 2023) applies large-scale language
2022) proposed a multi-modal Transformer-based
model to incorporate action-related prior information
network for nursing activity recognition which fuses
to learn topology for action recognition.
the encoding results of the spatial-temporal skeleton
No matter the static or dynamic methods, they
model and acceleration model. The spatial-temporal
aim to construct different GCNs for modeling spa-
skeleton model comprises of spatial and tempo-
tial and temporal features of actions. In contrast,
ral Transformer encoder in a sequential processing,
some papers work on strategies to assist the ability
which computes spatial and temporal features from
of different GCNs. For instance, Wang et al. (Wang
joints. The acceleration model has one Transformer
et al., 2023) proposed neural Koopman pooling
block, which computes correlation across accelera-
to replace the temporal average/max pooling for
tion data points for a given action sample. Zhang et
aggregating spatial-temporal features. The Koopman
al. (Zhang et al., 2021) proposed a Spatial-Temporal
pooling learns class-wise dynamics for better classi-
Special Transformer (STST) to capture skeleton
fication. Zhou et al. (Zhou et al., 2023) presented a
sequences in the temporal and spatial dimensions sep-
Feature Refinement head (FR Head) based on con-
arately. STST is a two-stream structure including a
trastive learning to improve the discriminative power
spatial transformer block and a directional tempo-
of ambiguous actions. With the FR Head, the perfor-
ral transformer block. Relation-mining Self-Attention
mance of some existing methods (eg. 2s-AGCN (Shi
Network (RSA-Net) (Gedamu et al., 2023) applies
et al., 2019), CTR-GCN (Chen et al., 2021)) can be
seven RSA bolcks in spatial and temporal domains for
improved by about 1%.
learning intra-frame and inter-frame action features.
In summary, GCN-based methods can effectively
Such a two-stream structure leads to the extension
utilize and handle the joint relations by topological
of the feature dimension and makes the network
networks but are generally limited to local spatial-
capture richer information, but at the same time
temporal neighborhoods. Compared with static
increases the computational cost. To reduce the com-
methods, dynamic methods have stronger generaliza-
putational cost, Shi et al. (Shi et al., 2021) proposed
tion capabilities due to the dynamic topologies.
a Sparse Transformer-based Action Recognition (ST-
AR) model. ST-AR consists of a sparse self-attention
module performed on sparse matrix multiplications

24
for capturing spatial correlations, and a segmented To improve the generalization ability of features,
linear self-attention module processed on variable the third category of methods (Kim et al., 2022; Dong
lengths of sequences for capturing temporal correla- et al., 2023; Shah et al., 2023; Cheng et al., 2021;
tions to further reduce the computation and memory Wu et al., 2023; Hua et al., 2023) focus on unsu-
cost. pervised or self-supervised action recognition based
Since Transformer is weak in extracting discrimi- on Transformer which has demonstrated excellent
native information from local features and short-term performance in capturing global context and local
temporal information, the second category of meth- joint dynamics. These methods normally apply con-
ods (Plizzari et al., 2021; Zhou et al., 2022; Qiu et al., trastive learning or Encoder-Decoder architecture for
2022; Kong et al., 2022; Zhang et al., 2022; Gao et al., learning a better representation of actions. Kim et
2022; Liu et al., 2022; Pang et al., 2022; Wang et al., al. (Kim et al., 2022) proposed GL-Transformer,
2023; Duan et al., 2023) integrate Transformer with which designs a global and local attention mechanism
GCN and CNN for better feature extraction, which to learn the local joint motion changes and global
is beneficial to utilize the advantages of different contextual information of skeleton sequences. With
networks. Plizzari et al. (Plizzari et al., 2021) pro- the motion sequence representation, actions are clas-
posed a two-stream Spatial-Temporal TRansformer sified based on their average pooling on the temporal
network (ST-TR) by integrating spatial and temporal axis. Anshul et al. (Shah et al., 2023) designed the
Transformers with Temporal Convolution Network HaLP module by generating hallucinating latent pos-
and GCN. Qiu et al. (Qiu et al., 2022) proposed a itive samples for self-supervised learning based on
Spatio-Temporal Tuples Transformer (STTFormer) contrastive learning. This module can explore the
which includes a spatio-temporal tuples self-attention potential space of human postures in the appropriate
module for capturing joint relationship in consec- directions to generate new positive samples, and opti-
utive frames, and an Inter-Frame Feature Aggre- mize the solution efficiency by a new approximation
gation (IFFA) module for enhancing the ability to function.
distinguish similar actions. Similar to ST-TR, the In summary, the research on skeleton-based
IFFA module applies TCN to aggregate features action recognition has made great progress in recent
of sub-actions. Yang et al. (Zhang et al., 2022) years. CNN-based methods mainly convert skeleton
presented Zoom-Former for extending single-person sequences into images, excelling at capturing spa-
action recognition to multi-person group activities. tial information of actions but potentially losing
The Zoom-Former improves the traditional GCN temporal information. With the help of RNN for rep-
by designing a Relation-aware Attention mechanism, resenting temporal information, RNN-based methods
which comprehensively leverages the prior knowledge focus on representing spatial information based on
of body structure and the global characteristic of the spatial division of the human body combin-
human motion to exploit the multi-level features. ing attention mechanism. Compared with CNN and
With this improvement, Zoom-Former could hierar- RNN-based methods, GCN and Transformer-based
chically extract the low-level motion information of methods have greater advantages and become the
a single person and the high-level interaction infor- mainstream methods. GCN-based methods are ben-
mation of multiple people. To effectively capture eficial for representing joint relations by topological
the relationship between key local joints and global networks in which dynamic topology-based meth-
contextual information in the spatial and temporal ods have stronger generalization ability than static
dimension, Gao et al. (Gao et al., 2022) proposed ones. However, they are mostly confined to local
an end-to-end Focal and Global Spatial-Temporal spatial-temporal neighborhoods. Transformer-based
transFormer (FG-STForm) by integrating temporal methods can quickly obtain global topology infor-
convolutions into a global self-attention mechanism. mation and enhance the correlation of non-physical
Liu et al. (Liu et al., 2022) proposed a Kernel Atten- joints. Combining Transformers with CNN and GCN
tion Adaptive Graph Transformer Network to use represents a promising approach for extracting both
a graph transformer operator for modeling higher- local and global features, enhancing action recogni-
order spatial dependencies between joints. Wang tion performance.
et al. (Wang et al., 2023) proposed a Multi-order
Multi-mode Transformer (3Mformer) by applying 5 Benchmark datasets
a higher-order Transformer to process hypergraphs
of skeleton data for better capturing higher-order This section reviews the commonly used datasets for
motion patterns between body joints. SkeleTR (Duan the three tasks and also compares the performance of
et al., 2023) initially employs a GCN to capture different methods on some popular datasets.
intra-person dynamic information and then applies
a stacked Transformer encoder to model the person
5.1 Pose estimation
interaction. It can handle different tasks including The datasets are reviewed based on 2D and 3D pose
video-level action recognition, instance-level action estimation tasks and the details are summarized in
detection and group activity recognition. Table 1 and 2. Due to the page limit, we mainly

25
Table 1 Datasets for 2D HPE. PCP: Percentage of Correct Localized Parts, PCPm: Mean Percentage of Correctly Localized
Parts, PCK: Percentage of Correct Keypoints, PCKh: Percentage of Correct Keypoints with a specified head size, AP: Average
Precision, mAP: mean Average Precision. IB: Image-based, VB: Video-based. SP: single person, MP: multi-person. Train, Val and
Test represent frame numbers except for Penn Action and PoseTrack, and they represent video numbers.
Dataset Year Citation #Poses #Joints Train Val Test SP/MP Actions Metrics
LSP (Johnson and Everingham, 2010) 2010 971 2,000 14 1k - 1k SP × PCP/PCK
LSPET (Johnson and Everingham, 2011) 2011 509 10,000 14 10k - - SP × PCP
FLIC (Sapp and Taskar, 2013) 2013 537 5,003 10 4k - 1k SP × PCK/PCP
MPII (Andriluka et al., 2014) 2014 2583 26,429 16 29k - 12k SP ✓ PCPm/PCKh
IB MPII multi-person (Andriluka et al., 2014) 2014 2583 14,993 16 3.8k - 1.7k MP ✓ mAP
MSCOCO16 (Lin et al., 2014) 2014 37862 105,698 17 45k 22k 80k MP × AP
MSCOCO17 (Lin et al., 2014) 2014 37862 - 17 64k 2.7k 40k MP × AP
LIP (Gong et al., 2017) 2017 482 50462 16 30k 10k 10k SP × PCK
CrowdPose (Li et al., 2019) 2019 423 80000 14 10k 2k 8k MP × mAP
J-HMDB (Jhuang et al., 2013) 2013 849 31,838 15 2.4k - 0.8k SP ✓ PCK
VB Penn Action (Zhang et al., 2013) 2013 367 159,633 13 1k - 1k SP ✓ PCK
PoseTrack17 (Andriluka et al., 2018) 2017 420 153,615 15 292 50 208 MP ✓ mAP
PoseTrack18 (Andriluka et al., 2018) 2018 420 - 15 593 170 375 MP ✓ mAP
PoseTrack21 (Doering et al., 2022) 2022 15 - 15 593 170 - MP ✓ mAP

Table 2 Datasets for 3D HPE. MPJPE:Mean Per Joint Position Error, PA-MPJPE: Procrustes Analysis Mean Per Joint
Position Error, MPJAE: Mean Per Joint Angular Erro, 3DPCK: 3D Percentage of Correct Keypoints, MPJAE: Mean Per Joint
Angular Error, AP: Average Precision.
Dataset Year Citation #Joints #Frames SP/MP Actions Metrics
HumanEva-I (Sigal et al., 2010) 2010 1678 15 37.6k SP ✓ MPJPE/PA-MPJPE
Human3.6M (Ionescu et al., 2013) 2014 2677 17 3.6M SP ✓ MPJPE
MPI-INF-3DHP (Mehta et al., 2017) 2017 851 15 1.3M SP ✓ 3DPCK
VB
CMU Panoptic (Joo et al., 2017) 2017 680 15 1.5M MP ✓ 3DPCK/MPJPE
3DPW (von Marcard et al., 2018) 2018 674 18 51k MP × MPJPE/MPJAE/PA-MPJPE
MuPoTs-3D (Mehta et al., 2018) 2018 346 15 8k MP × 3DPCK
MuCo-3DHP (Mehta et al., 2018) 2018 346 - - MP × 3DPCK

review some popular and large-scale pose datasets in activity samples in 21 classes and all the images are
the following sections. labeled. Except for joints, rich annotations includ-
5.1.1 Datasets for 2D pose estimation ing body occlusion, 3D torso and head orientations
are also labeled on Amazon Mechanical Turk. The
For the image-based 2D pose estimation, Microsoft MPII dataset serves as a valuable resource for both
Common Objects in Context (COCO) (Lin et al., 2D single-person and multi-person pose estimation.
2014) and Max Planck Institute for Informatics The J-HMDB dataset (Jhuang et al., 2013)
(MPII) (Andriluka et al., 2014) are popular datasets. was created by annotating human joints of the
Joint-annotated HMDB (J-HMDB) dataset (Jhuang HMDB51 action dataset. From HMDB51, 928 videos
et al., 2013) and Penn Action (Zhang et al., 2013) including 21 actions of a single person were extracted
datasets are often used for the 2D video-based and the human joints of each were annotated using
single-person pose estimation (SPPE), while Pose- a 2D articulated human puppet model. Each video
Track (Andriluka et al., 2018) is often used for video- consists of 15-40 frames. In total, there are 31,838
based multiple-person pose estimation (MPPE). annotated frames. This dataset can serve as a bench-
The COCO dataset (Lin et al., 2014) is the mark for human detection, pose estimation, pose
most widely used large-scale dataset for pose esti- tracking and action recognition. It also presents a new
mation. It was created by extracting everyday scene challenge for video-based pose estimation or tracking
images with common objects and labeling the objects since it includes more variations in camera motions,
using per-instance segmentation. This dataset con- motion blur and partial or full-body visibility. Sub-
sists of more than 330,000 images and 200,000 labeled J-HMDB dataset (Jhuang et al., 2013) is a subset
persons, and each person is labeled with 17 keypoints. of the J-HMDB dataset and contains 316 videos with
It has two versions for pose estimation including a total of 11,200 frames.
COCO2016 and COCO2017. The two versions are The Penn Action dataset (Zhang et al., 2013)
different with the number of images for training, test- is also an annotated sports action dataset collected
ing and validation as shown in Table 1. Except of by the University of Pennsylvania. It consists of 2,326
pose estimation, this dataset can be also suitable for videos with 15 actions and each frame was annotated
object detection, image segmentation and captioning. with 13 keypoints for each person. The dataset can be
The MPII dataset (Andriluka et al., 2014) used for the tasks of pose estimation, action detection
was collected from 3,913 YouTube videos by the and recognition.
Max Planck Institute for Informatics. It consists of The PoseTrack Dataset (Andriluka et al.,
24,920 images including over 40,000 individuals with 2018) was collected from raw videos of the MPII Pose
16 annotated body joints. These images were col- Dataset. For each frame in MPII, 41-298 neighboring
lected by a two-level hierarchical method to capture frames with crowded scenes and multiple individu-
everyday human activities. This dataset involves 491 als were selected for PoseTrack dataset. The selected

26
videos were annotated with person locations, identi- 5.1.3 Performance comparison
ties, body pose and ignore regions. According to dif-
In Table 3, we present a comparison of different meth-
ferent number of videos, this dataset currently exists
ods for 2D image-based SPPE and MPPE on the
in three versions: PoseTrack2017, PoseTrack2018,
COCO dataset. For the SPPE task, the performance
and PoseTrack2021. In total, PoseTrack2017 contains
of heatmap-based methods generally outperforms the
292 videos for training, and 50 videos for valida-
regression-based methods. This superiority can be
tion and 208 videos for testing. Among them, 23,000
attributed to the richer spatial information provided
frames are labeled with a very lager number (i.e.
by heatmaps, where the probabilistic prediction of
153,615) of annotated poses. PoseTrack2018 increases
each pixel enhances the accuracy of keypoint local-
the number of the video and contains 593 videos for
ization. However, heatmap-based methods (Ye et al.,
training, 170 videos for validation, and 315 videos
2023) suffer seriously from the quantization error
for testing, and consists of 46,933 labeled frames.
problem and high-computational cost using high res-
PoseTrack2021 is an extension of PoseTrack2018 with
olution heatmaps. For the MPPE task, the top-down
more annotations (eg. bounding box of small per-
methods overall outperform the bottom-up methods
sons, joint occlusions). With the person identities,
by the success of existing SPPE techniques after
this dataset has been widely used as a benchmark to
detecting individuals. However, they suffer from early
evaluate multi-person pose estimation and tracking
commitment and have greater computational costs
algorithms.
than bottom-up methods. One-stage methods speed
5.1.2 Datasets for 3D pose estimation up the process by eliminating the intermediate oper-
Compared with the 2D datasets, acquiring high- ations (eg., grouping, ROI, NMS) introduced by
quality annotation for 3D poses is more challenging top-down and bottom-up methods, while their per-
and requires motion caption systems (eg., Mocap, formance (Liu et al., 2023) is still lower (about 9%
wearable IMUs). Therefore, 3D pose datasets are nor- of AP score in the best case) than top-down meth-
mally built in constrained environments. Currently, ods (Xu et al., 2022). Moreover, It is also observed
Human3.6M and MPI-INF-3DHP are widely used for that the backbone and input image size are two fac-
the task of SPPE, and MuPoTs-3D is often used for tors for the results. The commonly used backbone
MPPE task. includes ResNet, HRNet and Hourglass. The recent
The Human3.6M dataset (Ionescu et al., 2013) Transformer-based network (eg., ViTAE-G, Swin-L)
is the largest and most representation indoor dataset can be also used as the backbone and the method (Xu
for 3D single-person pose estimation. It was collected et al., 2022) based on ViTAE-G network achieves
by recording videos of 11 human subjects perform- the best performance. When using the same back-
ing 17 activities from 4 camera views, and capturing bone (Zhang et al., 2020; Yang et al., 2021) for the
poses by marker-based Mocap systems. In total, this same category of methods, the larger the image size,
dataset consists of 3.6 million poses with one pose the better the performance.
in one frame. This dataset is suitable for the HPE Table 4 and Table 5 compare the different meth-
task from images or videos. With video-based HPE, a ods for 2D video-based SPPE and MPPE. Overall,
sequence of frames in a suitable receptive field is con- two categories of methods for video-based SPPE
sidered as the input. Protocol 1 is the most common achieve comparable results on two datasets. Yet sam-
protocol which applies frames of 5 subjects (S1, S5, ple frames-based methods(Zeng et al., 2022) are
S6, S7, S8) for training and the frames of 2 subjects generally faster than frame-by-frame ones by ignoring
(S9, S11) for test. looking at all frames. Similar to image-based MPPE,
The MPI-INF-3DHP dataset (Mehta et al., the top-down methods achieve better performance
2017) is a large 3D single-person pose dataset in both than the bottom-up methods for video-based MPPE.
indoor and outdoor environments. It was captured by For 3D pose estimation, taken Human3.6M, MPI-
a maker-less MoCap system in a multi-camera stu- INF-3DHP and MuPoTS-3D datasets as examples,
dio. There are 8 subjects performing 8 activities from Table 6 and Table 7 respectively shows the compar-
14 camera views. This dataset provides 1.3 million isons for SPPE and MPPE from images or videos.
frames, but more diverse motions than Human3.6M. The comparison for video-based MPPE was not con-
Same as Human3.6M, this dataset is also suitable for ducted due to only fewer existing methods. For the
the HPE task from images or videos. The test set SPPE task, two-stage methods normally lift 3D poses
includes the frames of 6 subjects with different scenes. from the estimated 2D poses, they generally outper-
The MuPoTs-3D dataset (Mehta et al., 2018) form one-stage methods due to the success of the
is a multi-person 3D pose dataset in both indoor and 2D pose estimation technique. It is also noted that
outdoor environments. Same as MPI-INF-3DHP, it the recent one-stage method based on Transformer
was also captured by a multi-view marker-less MoCap network (Qiu et al., 2022) also achieves pretty good
system. Over 8,000 frames were collected in 20 videos results. Compared to the same category of methods
by 8 subjects. There are some challenging frames between images and videos, the performance based
with occlusions, drastic illumination changes and lens on videos is better than the ones based on images. It
flares in some outdoor scenes. demonstrates that the temporal information of videos

27
Table 3 Performance comparison for 2D image-based pose estimation on COCO dataset.
COCO
Category Year Method
Backbone Inputsize AP AP.5 AP.75 APM APL
2021 TFPose (Mao et al., 2021) ResNet-50 384×288 72.2 90.9 80.1 69.1 78.8
Regression-based 2021 PRTR (Li et al., 2021) HRNet-W32 512×384 72.1 90.4 79.6 68.1 79.4
2022 Panteleris et al. (Panteleris and Argyros, 2022) - 384×288 72.6 - - - -
SP 2021 Li et al. (Li et al., 2021) HRNet-W48 - 75.7 92.3 82.9 72.3 81.3
Heatmap-based 2022 Li et al. (Li et al., 2022) HRNet-W48 384×288 76.0 92.4 83.5 72.5 81.9
2023 DistilPose (Ye et al., 2023) HRNet-W48-stage3 256×192 73.7 91.6 81.1 70.2 79.6
2017 Papandreou et al. (Papandreou et al., 2017) ResNet-101 353×257 68.5 87.1 75.5 65.8 73.3
2017 RMPE (Fang et al., 2017) Hourglass - 61.8 83.7 69.8 58.6 67.6
2018 Xiao et al. (Xiao et al., 2018) ResNet-152 384×288 73.7 91.9 81.1 70.3 80.0
2018 CPN (Chen et al., 2018) ResNet 384×288 73.0 91.7 80.9 69.5 78.1
2019 Posefix (Moon et al., 2019) ResNet-152 384×288 73.6 90.8 81.0 70.3 79.8
2019 Sun et al. (Sun et al., 2019) HRNet-W48 384×288 77 92.7 84.5 73.4 83.1
2019 Su et al. (Su et al., 2019) ResNet-152 384×288 74.6 91.8 82.1 70.9 80.6
2020 Cai et al. (Cai et al., 2020) 4×RSN-50 384×288 78.6 94.3 86.6 75.5 83.3
2020 Huang et al. (Huang et al., 2020) HRNet 384×288 77.5 92.7 84.0 73.0 82.4
Top-down 2020 Zhang et al. (Zhang et al., 2020) HRNet-W48 384×288 77.4 92.6 84.6 73.6 83.7
2020 Graphpcnn (Wang et al., 2020) HR48 384×288 76.8 92.6 84.3 73.3 82.7
2020 Qiu et al. (Qiu et al., 2020) - 384×288 74.1 91.9 82.2 - -
2021 TransPose (Yang et al., 2021) HRNet-W48 256×192 75.0 92.2 82.3 71.3 81.1
2021 TokenPose (Li et al., 2021) - 384×288 75.9 92.3 83.4 72.2 82.1
2021 HRFormer (Yuan et al., 2021) - 384×288 76.2 92.7 83.8 72.5 82.3
2022 ViTPose (Xu et al., 2022) ViTAE-G 576×432 81.1 95.0 88.2 77.8 86.0
MP 2022 Xu et al. (Xu et al., 2022) HR48 384×288 76.6 92.4 84.3 73.2 82.5
2023 PGA-Net (Jiang et al., 2023) HRNet-W48 384x288 76.0 92.5 83.5 72.4 82.1
2023 BCIR (Gu et al., 2023) HRNet-W48 384x288 76.1 - - - -
2017 Associative embedding (Newell et al., 2017) Hourglass 512×512 65.5 86.8 72.3 60.6 72.6
2018 Multiposenet (Kocabas et al., 2018) ResNet50 480×480 69.6 86.3 76.6 65.0 76.3
2018 OpenPose (Cao et al., 2017b) - - 61.8 84.9 67.5 57.1 68.2
2019 Pifpaf (Kreiss et al., 2019) ResNet50 - 55.0 76.0 57.9 39.4 76.4
2020 Jin et al. (Jin et al., 2020) Hourglass 512×512 67.6 85.1 73.7 62.7 74.6
Bottom-up
2020 Higherhrnet (Cheng et al., 2020) HrHRNet-W48 640×640 72.3 91.5 79.8 67.9 78.2
2021 DEKR (Geng et al., 2021) HRNet-W48 640x640 71.0 89.2 78.0 67.1 76.9
2023 HOP (Qu et al., 2023) HRNet-W48 640×640 70.5 89.3 77.2 66.6 75.8
2023 Cheng et al. (Cheng et al., 2023) HRNet-W48 640×640 71.5 89.1 78.5 67.2 78.1
2023 PolarPose (Li et al., 2023) HRNet-W48 640x640 70.2 89.5 77.5 66.1 76.4
2019 Directpose (Tian et al., 2019) ResNet-101 800×800 64.8 87.8 71.1 60.4 71.5
2021 FCPose (Mao et al., 2021) DLA-60 736 × 512 65.9 89.1 72.6 60.9 74.1
2021 InsPose (Shi et al., 2021) HRNet-w32 - 71.0 91.3 78.0 67.5 76.5
One-stage 2022 PETR (Shi et al., 2022) Swin-L - 71.2 91.4 79.6 66.9 78.0
2023 ED-pose (Yang et al., 2023) Swin-L - 72.7 92.3 80.9 67.6 80.0
2023 GroupPose (Liu et al., 2023) Swin-L - 72.8 92.5 81.0 67.7 80.3
2023 SMPR (Miao et al., 2023) HRNet-w32 800x800 70.2 89.7 77.5 65.9 77.2

Table 4 Performance comparison for 2D video-based SPPE for the MPPE task. Specifically, one-stage methods
on Penn Action dataset and JHMDB dataset. FF:
frame-by-frame; SF: sample frame-based. generally perform better than most top-down and
bottom-up methods, which further implies that the
Category Year Method
Penn JHMDB end-to-end training could reduce intermediate errors
PCK PCK such as human detection and joint grouping.
2016 Gkioxari et al. (Gkioxari et al., 2016) 91.8 -
2017
2018
Song et al. (Song et al., 2017)
LSTM (Luo et al., 2018)
96.4
97.7
92.1
93.6
5.2 Pose tracking
2019 DKD (Nie et al., 2019) 97.8 94
FF
2019 Li et al. (Li et al., 2019a) - 94.8 This section reviews the datasets for pose track-
2022 RPSTN (Dang et al., 2022) 98.7 97.7 ing and also compares different methods on some
2023 HANet (Jin et al., 2023) - 99.6
datasets.
2020 K-FPN (Zhang et al., 2020) 98 94.7
SF
2022 REMOTE (Ma et al., 2022) 98.6 95.9 5.2.1 Datasets
2022 DeciWatch (Zeng et al., 2022) - 98.9
2023 MixSynthFormer (Sun et al., 2023) - 99.3 Table 8 summarizes the datasets, with a focus on
Table 5 Performance comparison for 2D video-based MPPE the Campus, CMP Panoptic, and PoseTrack datasets,
on PoseTrack2017 dataset. which are highly cited and frequently used for evalu-
ating multi-person pose tracking. These datasets are
Val Test
Category Year Method preferred because multi-person poses are more repre-
mAP mAP
2018 Xiao et al. (Xiao et al., 2018) 76.7 73.9
sentative of real-world scenarios. In the earlier stage,
2018 Pose Flow (Xiu et al., 2018) 66.5 63.0 VideoPose2.0 was often applied for single-person pose
2018 Detect-Track (Girdhar et al., 2018)
Top-down 2020 Wang et al. (Wang et al., 2020)
-
81.5
64.1
73.5
tracking. The PoseTrack dataset has been discussed
2022 AlphaPose (Fang et al., 2022) 74.7 - in Section 5.1.1. In the following, we only review other
2023
2023
SLT-Pose (Gai et al., 2023)
DiffPose (Feng et al., 2023)
81.5
83.0
-
-
three datasets.
2023 TDMI (Feng et al., 2023) 83.6 - The VideoPose2.0 dataset (Sapp et al., 2011)
Bottom-up 2019 PGG (Jin et al., 2019) 77.0 - is a video dataset for tracking the poses of upper
and lower arms. The videos were collected from TV
is beneficial for estimating more accurate poses. From shows ”Friends” and ”Lost” and are normally with
Table 7, good progress has been made in recent years a single actor and a variety of movements. This

28
Table 6 Performance comparison for 3D SPPE on Human3.6M and MPI-INF-3DHP dataset. IB: Image-based, VB: Video-based.
Human3.6M MPI-INF-3DHP
Category Year Method
MPJPE↓ PMPJPE↓ PCK AUC
2015 Li et al. (Li et al., 2015) 122.0 - - -
2016 Zhou et al. (Zhou et al., 2016) 107.3 - - -
One-stage
2017 Mehta et al. (Mehta et al., 2017) 74.1 - 57.3 28.0
2017 WTL (Zhou et al., 2017) 64.9 - 69.2 32.5
2017 Martinez et al. (Martinez et al., 2017) 62.9 47.7 - -
2017 Tekin et al. (Tekin et al., 2017) 69.7 - - -
2017 Jahangiri et al. (Jahangiri and Yuille, 2017) - 68.0 - -
2018 Drpose3d (Wang et al., 2018) 57.8 42.9 - -
2018 Yang et al. (Yang et al., 2018) 58.6 37.7 80.1 45.8
2019 Habibie et al. (Habibie et al., 2019) 49.2 - 82.9 45.4
2019 Chen et al. (Chen et al., 2019) - 68.0 71.1 36.3
2019 RepNet (Wandt and Rosenhahn, 2019) 80.9 65.1 82.5 58.5
2019 Hemlets pose (Zhou et al., 2019) - - 75.3 38.0
2019 Sharma et al. (Sharma et al., 2019) 58.0 40.9 - -
IB 2019 Li and Lee (Li and Lee, 2019) 52.7 42.6 67.9 -
2019 LCN (Ci et al., 2019) 52.7 42.2 74.0 36.7
Two-stage 2019 semantic-GCN (Zhao et al., 2019) - 57.6 - -
2020 Iqbal et al. (Iqbal et al., 2020) 67.4 54.5 79.5 -
2020 Pose2mesh (Choi et al., 2020) 64.9 48.0 - -
2020 Srnet (Zeng et al., 2020) 44.8 - 77.6 43.8
2020 Liu et al. (Liu et al., 2020) 52.4 41.2 - -
2021 Zou et al. (Zou and Tang, 2021) 49.4 39.1 86.1 53.7
2021 GraphSH (Xu and Takano, 2021) 51.9 - 80.1 45.8
2021 Lin et al. (Lin et al., 2021) 54.0 36.7 - -
2021 Yu et al. (Yu et al., 2021) 92.4 52.3 86.2 51.7
2022 Graformer (Zhao et al., 2022) 51.8 - - -
2022 PoseTriplet (Gong et al., 2022) 78 51.8 89.1 53.1
2023 HopFIR (Zhai et al., 2023) 48.5 - 87.2 57.0
2023 SSP-Net (Carbonera Luvizon et al., 2023) 51.6 - 83.2 44.3
2023 PHGANet (Shengping et al., 2023) 49.1 - 86.9 55.0
2023 RS-Net (Hassan and Ben Hamza, 2023) 47.0 38.6 85.6 53.2
2016 Tekin et al. (Tekin et al., 2016) 125.0 - - -
2017 Vnect (Mehta et al., 2017) 80.5 - 79.4 41.6
One-stage 2018 Dabral et al. (Dabral et al., 2018) 52.1 36.3 76.7 39.1
2022 IVT (Qiu et al., 2022) 40.2 28.5 - -
2023 CSS (Honari et al., 2023) 60.1 46.0 - -
2017 RPSM (Lin et al., 2017) 73.1 - - -
2018 Rayat et al. (Rayat Imtiaz Hossain and Little, 2018) 51.9 42.0 - -
2018 p-LSTMs (Lee et al., 2018) 55.8 46.2 - -
2018 Katircioglu et al. (Katircioglu et al., 2018) 67.3 - - -
2019 Cheng et al. (Cheng et al., 2019) 42.9 32.8 - -
2019 Cai et al. (Cai et al., 2019) 48.8 39.0 - -
2019 TCN (Pavllo et al., 2019) 46.8 36.5 - -
2019 Chirality Nets (Yeh et al., 2019) 46.7 - - -
2020 UGCN (Wang et al., 2020) 42.6 32.7 86.9 62.1
VB 2020 GAST-Net (Liu et al., 2020) 44.9 35.2 - -
2021 Chen et al. (Chen et al., 2021) 44.1 35.0 87.9 54.0
Two-stage
2021 PoseFormer (Zheng et al., 2021) 44.3 34.6 88.6 56.4
2022 Strided (Li et al., 2022a) 43.7 35.2 - -
2022 Mhformer (Li et al., 2022b) 43.0 - 93.8 63.3
2022 MixSTE (Zhang et al., 2022) 39.8 30.6 94.4 66.5
2022 UPS (Foo et al., 2023) 40.8 32.5 - -
2023 DSTFormer (Zhu et al., 2022) 37.5 - - -
2023 GLA-GCN (Yu et al., 2023) 44.4 34.8 98.5 79.1
2023 D3DP (Shan et al., 2023) 35.4 - 98.0 79.1
2023 DiffPose (Holmquist and Wandt, 2023) 43.3 32.0 84.9 -
2023 STCFormer (Tang et al., 2023) 40.5 31.8 98.7 83.9
2023 PoseFormerV2 (Zhao et al., 2023) 45.2 35.6 97.9 78.8
2023 MTF-Transformer (Shuai et al., 2023) 26.2 - - -

dataset includes 44 videos, each lasting 2-3 seconds, interactions using the camera system with 480 views.
totaling 1,286 frames. Each frame is hand-annotated Subjects were engaged in different games: Ultima-
with joint locations. This dataset is an extension tum (with 3 subjects), Prisoner’s dilemma (with 8
of the VideoPose dataset (Weiss et al., 2010), but subjects), Mafia (with 8 subjects), Haggling (with 3
more challenging since about 30% of lower arms are subjects), and 007-bang game (with 5 subjects). The
significantly foreshortened. number of subjects in each game varies from three
The CMU Panoptic Dataset (Joo et al., 2017) to eight. In total, this dataset consists of 65 videos
was created by capturing subjects engaged in social and 1.5 million 3D poses estimated using Kinects.

29
Table 7 Performance comparison for 3D Image-based MPPE on MuPoTS-3D dataset.
MuPoTS-3D
Category Year Method All people Matched people
PCKrel PCKabs PCKrel PCKabs PCKroot AUCrel
2019 LCR-Net (Rogez et al., 2019) 70.6 - 74.0 - - -
2019 Moon et al. (Moon et al., 2019) 81.8 31.5 82.5 31.8 31.0 40.9
Top-down 2020 HDNet (Lin and Lee, 2020) - - 83.7 35.2 - -
2020 HMOR (Wang et al., 2020) - - 82.0 43.8 - -
2022 Cha et al. (Cha et al., 2022) 89.9 - 91.7 - - -
2018 Mehta et al. (Mehta et al., 2018) 65.0 - 69.8 - - -
2020 Kundu et al. (Kundu et al., 2020) 74.0 28.1 75.8 - - -
2020 XNect (Mehta et al., 2020) 70.4 - 75.8 - - -
Bottom-up
2020 Smap (Zhen et al., 2020) 73.5 35.4 80.5 38.7 45.5 42.7
2022 Liu et al. (Liu et al., 2022) 79.4 36.5 86.5 39.3 - -
2023 AKE (Chen et al., 2023) 74.7 37.2 81.1 40.1 - -
2022 Wang et al. (Wang et al., 2022) 82.7 39.2 - - - -
One-stage 2022 DRM (Jin et al., 2022) 80.9 39.3 85.1 41.0 45.6 45.4
2023 WSP (Qiu et al., 2023) 82.4 - 83.2 - - -

Table 8 Datasets for Pose tracking. MOTA: Multiple Object Tracking Accuracy, PCP: Percentage of Correct Parts, KLE:
Keypoint Localization Error.
Dataset Year Citation #Joints Size 2D/3D Metrics
VideoPose2.0 (Sapp et al., 2011) 2011 198 - 44 videos 2D AP
Multi-Person PoseTrack (Iqbal et al., 2017) 2017 238 14 16 subjects, 60 videos 2D MOTA
PoseTrack17 (Andriluka et al., 2018) 2018 420 15 40 subjects, 550 videos 2D MOTA
PoseTrack18 (Andriluka et al., 2018) 2018 420 15 1138 videos 2D MOTA
ICDPose (Girdhar et al., 2018) 2018 250 14 60 videos 2D MOTA
Campus dataset (Berclaz et al., 2011) 2011 1253 - 3 subjects, 3 views, 6k frames 3D PCP
Outdoor Pose (Ramakrishna et al., 2013) 2013 61 14 4 subjects, 828 frames 3D PCP/KLE
CMU Panoptic (Joo et al., 2017) 2017 680 15 8 subjects, 480 views, 65 videos 3D MOTA

Table 9 Performance comparison for 2D single person pose For 2D multi-person pose tracking, most methods fol-
tracking on Videopose2.0.
low the top-down strategy by well-estimated poses
Method Category Year AP of single-person estimation technique. Undoubtedly,
Zhao et al. these methods outperform bottom-up ones about
(Zhao et al., 2015) Post-processing 2015 85.0 2-15% of MOTA scores on the Posetrack2017 and
Samanta et al.
(Samanta and Chanda, 2016) Post-processing 2016 89.9 2018 datasets. Regarding 3D multi-person pose track-
Zhao et al. ing, there are currently fewer existing works. Among
(Zhao et al., 2015) Integrated 2015 80.0 them, one-stage methods perform better than multi-
Ma et al.
stage methods shown in Table 11, and Voxeltrack
(Ma et al., 2016) Integrated 2016 95.0
(Zhang et al., 2022) achieves the best results. This is
because one-stage methods jointly estimate and link
It is often used for evaluating multi-person 3D pose 3D poses, which can propagate the errors of sub-tasks
estimation and pose tracking methods. in the multi-stage methods back to the input image
The Campus Dataset (Belagiannis et al., 2014) pixels of videos.
was collected by capturing interactions among three
5.3 Action recognition
individuals in an outdoor environment using 3 cam-
eras. It contains 6,000 frames including 3 views, and This section reviews the datasets that are more com-
each view provides 2,000 frames. It is widely used for monly used for pose-based action recognition and also
3D multi-person pose estimation and tracking. Due to compares different categories of the methods.
a small number of cameras and wide baseline views, 5.3.1 Datasets
it is challenging for pose tracking.
In Section 4, we have reviewed the pose-based action
5.2.2 Performance comparison recognition methods which can be divided into esti-
Table 9 and Table 10 respectively show the compar- mated pose-based and skeleton-based action recogni-
ison of 2D pose tracking methods. For 2D single- tion. The former one applies RGB data and the latter
person pose tracking, integrated methods jointly one directly uses skeleton data as the input. Table 12
optimize pose estimation and pose tracking within summaries the large-scale datasets that are prevalent
a unified framework, leveraging the benefits of each in deep learning-based action recognition.
to achieve better results. From Table 9, it can be NTU RGB+D dataset (Shahroudy et al.,
observed that one of the integrated methods (Ma 2016) was constructed by Nanyang Technological
et al., 2016) exhibits state-of-the-art performance. University, Singapore. Four modalities were collected
using Mincrosoft Kinect v2 sensor including RGB,

30
Table 10 Performance comparison for 2D multi-person pose tracking on PoseTrack2017 and PoseTrack2018.
2017 Testing 2017 Validation 2018 Testing 2018 Validation
Method Category Year
MOTA MOTA MOTA MOTA
Detect-and-Track
(Girdhar et al., 2018) Top-down 2018 51.8 55.2 - -
Pose Flow
(Xiu et al., 2018) Top-down 2018 51.0 58.3 - -
Flow Track
(Xiao et al., 2018) Top-down 2018 57.8 65.4 - -
Fastpose
(Zhang et al., 2019) Top-down 2019 57.4 63.2 - -
LightTrack
(Ning et al., 2020) Top-down 2020 58.0 - - 64.6
Umer et al.
(Rafi et al., 2020) Top-down 2020 60.0 68.3 60.7 69.1
Clip Tracking
(Wang et al., 2020) Top-down 2020 64.1 71.6 64.3 68.7
Yang et al.
(Yang et al., 2021) Top-down 2021 - 73.4 - 69.2
AlphaPose
(Fang et al., 2022) Top-down 2022 - 65.7 - 64.7
GatedTrack
(Doering and Gall, 2023) Top-down 2023 - - - 64.5
Posetrack
(Iqbal et al., 2017) Bottom-up 2017 48.4 - - -
Raaj et al.
(Raaj et al., 2019) Bottom-up 2019 53.8 62.7 - 60.9
Jin et al.
(Jin et al., 2019) Bottom-up 2019 - 71.8 - -

and samples enable it more challenging than NTU


Table 11 Performance comparison for 3D multi-person pose RGB+D dataset in action recognition.
tracking on CMU Panoptic and Campus dataset.
PKU-MMD dataset (Chunhui et al., 2017) is
CMU Campus a large-scale multi-modality dataset for action detec-
Method Category Year
MOTA PCP tion and recognition tasks. Four modalities including
Bridgeman et al. RGB, depth maps, skeletons and infrared frames
(Bridgeman et al., 2019) Multi-stage 2019 - 92.6
Tessetrack were captured by Mincrosoft Kinect v2 sensor. This
(Reddy et al., 2021) One-stage 2021 94.1 97.4 dataset consists of 1,076 videos composed of 51
Voxeltrack actions which are performed by 66 subjects in 3
(Zhang et al., 2022) One-stage 2022 98.5 96.7 views. The action classes cover 41 daily actions and
Snipper
(Zou et al., 2023) One-stage 2023 93.4 - 10 person-person interaction actions. Each video con-
TEMPO tains more than twenty action samples. In total, this
(Choudhury et al., 2023) One-stage 2023 98.4 - dataset includes 3,000 minutes and 5,400,000 frames.
The large amount of actions in one untrimmed video
depth maps, skeletons and infrared frames. The makes the robustness of action detection methods.
dataset consists of 60 actions performed by 40 sub- Kinetics-Skeleton dataset (Kay et al., 2017)
jects. The actions can be divided into three groups is an extra large-scale action dataset captured by
including: 40 daily actions, 9 health-related actions searching RGB videos from YouTube and generat-
and 11 person-person interaction actions. The age ing skeletons by OpenPose. It has 400 actions, with
range of the subjects is from 10 to 35 years and each 400-1150 clips for each action, each from a unique
subject performs an action for several times. In total, YouTube video. Each clip lasts around 10 seconds.
there are 56880 samples which are captured in 80 dis- The total number of video samples is 306,245. The
tinct camera views. The large amount of variation in action classes include: person actions, person-person
subjects and views makes it possible to have more actions and person-object actions. Due to the source
cross-subject and cross-view evaluations for action of YouTube, the videos are not as professional as the
recognition methods. ones recorded in experimental background. There-
NTU RGB+D 120 dataset (Liu et al., fore, the dataset has considerable camera motion,
2019) is an extension of the NTU RGB+D illumination variations, shadows, background clutter
dataset (Shahroudy et al., 2016). An additional 60 and a large variety of subjects.
action categories performed by another 66 subjects 5.3.2 Performance comparison
including 57,600 samples were added to the NTU
RGB+D dataset. This dataset also provides four In Table 14, we compare the results of different action
modalities including RGB, depth maps, skeletons and recognition methods on two prominent datasets. Esti-
infrared frames. More number of actions, subjects mated poses-based methods apply RGB data as the

31
Table 12 A review of human action recognition datasets. C: Colour, D: Depth, S: Skeleton, I: Infrared frame; LOSubO: Leave
One Subject Out, CS: Cross Subject, CV: Cross Validation; tr: training, va: validation, te: test
Dataset Year Citation Modality Sensors #Actions #Subjects #Samples Protocol
HDM05
2007 503 C,D,S RRM 130 5 2317 10-fold CV
(Müller et al., 2007)
MSR-Action3D
2010 1736 D,S Kinect 20 10 557 CS(1/3 tr; 2/3 tr; half tr, half te)
(Li et al., 2010)
MSRC-12
2012 494 S Kinect 12 30 6244 LOSubO
(Fothergill et al., 2012)
G3D
2012 262 C,D,S Kinect 20 10 659 CS(4 tr, 1 va, 5 te)
(Bloom et al., 2012)
SBU Kinect
2012 575 C,D,S Kinect 8 7 300 5-fold CV
(Yun et al., 2012)
UTKinect-Action3D
2012 1716 C,D,S Kinect 10 10 200 LOSubO
(Xia et al., 2012)
Northwestern-UCLA
2014 497 C,D,S Kinect 10 10 1494 LOSubO; cross view(2 tr, 1 te)
(Wang et al., 2014)
UTD-MHAD
2015 706 C,D,S,I Kinect 27 8 861 CS(odd tr, even te)
(Chen et al., 2015)
SYSU
2015 594 C,D,S Kinect 12 40 480 CS(half tr, half te)
(Hu et al., 2015)
NTU-RGB+D
2016 2452 C,D,S,I Kinect 60 40 56880 CS(half tr, half te); cross view(half tr, half te)
(Shahroudy et al., 2016)
PKU-MMD
2017 195 C,D,S,I Kinect 51 66 1076 CS(57 tr, 9 te); cross view(2 tr, 1 te)
(Chunhui et al., 2017)
Kinetics
2017 3402 C,S YouTube 400 - 306245 CV(250-1000 tr, 50 va, 100 te per action)
(Kay et al., 2017)
NTU RGB+D 120
2019 907 C,D,S,I Kinect 120 106 114480 CS(half tr, half te); cross view(half tr, half te)
(Liu et al., 2019)

Table 13 Performance of estimated pose-based action or CNN can better learn both local and global
recognition methods on three datasets for showing the
benefits of pose estimation or tracking for recognition. GT: features. Specifically, the method (Wang et al.,
ground-truth. 2023) of applying transformer encoder on hypergraph
achieved the best performance on two datasets, which
Dataset Method Highlights Accuracy
estimated poses 58.5±1.5 provides a hint of representing actions using hyper-
PoTion
JHMDB GT poses 62.1±1.1 graph for classification. It is also worth noting that
(Choutas et al., 2018) GT poses + crop 67.9±2.4
the method (Xu et al., 2023) based on the guid-
AVA
LART
-poses-tracking
-poses
40.2
41.4
ance of natural language respectively achieves pretty
(Rajasegaran et al., 2023) full model 42.3 good performance on two datasets, which implies
NTU60
UPS separate training 89.6 the advantage of incorporating linguistic context for
(Foo et al., 2023) joint training 92.6
action recognition.

input, and the best performance (Duan et al., 2022; 6 Challenges and Future
Foo et al., 2023) is lower than the ones (Wang et al.,
2023) used skeletons as the input on two datasets
Directions
(especially the larger one). This is reasonable because This paper has reviewed recent deep learning-based
some facts (eg. illumination, background) could affect approaches for pose estimation, tracking and action
the performance when using RGB. In particular, recognition. It also includes a discussion of commonly
methods based on one-stage strategy jointly address used datasets and a comparative analysis of vari-
pose estimation and action recognition, thus reduc- ous methods. Despite the the remarkable successes in
ing the errors of intermediate steps and generally these domains, there are still some challenges and cor-
achieving better results than the methods based on responding research directions to promote advances
a two-stage strategy. Moreover, Table 13 illustrates for the three tasks.
the effects of pose estimation (PE) and tracking on
action recognition (AR). It can be easily seen that 6.1 Pose estimation
pose estimation and tracking results can improve
There are five main challenges for the pose estimation
the performance of action recognition, which further
task as follows.
emphasizes the relationship of these three tasks.
(1) Occlusion
For the skeleton-based methods, the recent meth-
Although the current methods have achieved out-
ods mainly apply GCN and Transformer, consistently
standing performance on public datasets, they still
outperforming CNN and RNN-based methods. This
suffer from the occlusion problem. Occlusion results
improvement demonstrate the benefit of local and
in unreliable human detection and declined perfor-
global feature learning based on GCN and Trans-
mance for pose estimation. Person detectors in top-
former for action recognition. Specifically, dynamic
down approaches may fail in identifying the bound-
GCN-based methods generally perform better than
aries of overlapped human bodies and body part
static GCN-based ones due to stronger generaliza-
association for occluded scenes may fail in bottom-
tion capabilities. Hybrid Transformer-based methods
up approaches. Mutual occlusion in crowd scenarios
outperform pure Transformer-based ones on large
caused largely declined performance for current 3D
datasets since integrating the Transformer with GCN
HPE methods.

32
Table 14 Performance comparison of action recognition methods on NTU RGB+D and NTU RGB+D 120 datasets.
NTU RGB + D 60 NTU RGB + D 120
Method Category Sub-category Year
C-Sub C-Set C-Sub C-Set
Zolfaghari et al. (Zolfaghari et al., 2017) Estimated Pose-based two-stage strategy 2017 80.8 - - -
Liu et al. (Liu and Yuan, 2018) Estimated Pose-based two-stage strategy 2018 91.7 95.3 - -
IntegralAction (Moon et al., 2021) Estimated Pose-based two-stage strategy 2021 91.7 - - -
PoseConv3D (Duan et al., 2022) Estimated Pose-based two-stage strategy 2021 94.1 97.1 86.9 90.3
Luvizonet al. (Luvizon et al., 2018) Estimated Pose-based one-stage strategy 2018 85.5 - - -
UPS (Foo et al., 2023) Estimated Pose-based one-stage strategy 2023 92.6 97.0 89.3 91.1
2 Layere P-LSTM (Shahroudy et al., 2016) RNN-based spatial division of human body 2016 62.9 70.3 - -
Trust Gate ST-LSTM (Liu et al., 2016) RNN-based spatial and/or temporal networks 2016 69.2 77.7 - -
Two-stream RNN (Wang and Wang, 2017) RNN-based spatial and/or temporal networks 2017 71.3 79.5 - -
Zhang et al. (Zhang et al., 2017) RNN-based spatial and/or temporal networks 2017 70.3 82.4 - -
SR-TSL (Si et al., 2018) RNN-based spatial and/or temporal networks 2018 84.8 92.4 - -
GCA-LSTM (Liu et al., 2017) RNN-based attention mechanism 2017 74.4 82.8 58.3 59.2
STA-LSTM (Song et al., 2018) RNN-based attention mechanism 2018 73.4 81.2 - -
EleAtt-GRU (Zhang et al., 2019) RNN-based attention mechanism 2019 80.7 88.4 - -
2s AGC-LSTM (Si et al., 2019) RNN-based attention mechanism 2019 89.2 95.0 - -
JTM (Wang et al., 2016) CNN-based 2D CNN 2017 73.4 75.2 - -
JDM (Li et al., 2017) CNN-based 2D CNN 2017 76.2 82.3 - -
Liu et al. (Liu et al., 2017) CNN-based 2D CNN 2017 80.0 87.2 60.3 63.2
SkeletonNet (Ke et al., 2017) CNN-based 2D CNN 2017 75.9 81.2 - -
Ke et al. (Ke et al., 2017) CNN-based 2D CNN 2017 79.6 86.8 - -
Li et al. (Li et al., 2017) CNN-based 2D CNN 2017 85.0 92.3 - -
Ding et al. (Ding et al., 2017) CNN-based 2D CNN 2017 - 82.3 - -
Li et al. (Li et al., 2019) CNN-based 2D CNN 2017 82.8 90.1 - -
TSRJI (Caetano et al., 2019) CNN-based 2D CNN 2019 73.3 80.3 65.5 59.7
SkeletonMotion (Caetano et al., 2019) CNN-based 2D CNN 2019 76.5 84.7 67.7 66.9
3SCNN (Liang et al., 2019) CNN-based 2D CNN 2019 88.6 93.7 - -
DM-3DCNN (Hernandez Ruiz et al., 2017) CNN-based 3D CNN 2017 82.0 89.5 - -
ST-GCN (Yan et al., 2018) GCN-based static method 2018 81.5 88.3 - -
STIGCN (Huang et al., 2020) GCN-based static method 2020 90.1 96.1 - -
MS-G3D (Liu et al., 2020) GCN-based static method 2020 91.5 96.2 86.9 88.4
CA-GCN (Zhang et al., 2020) GCN-based static method 2020 83.5 91.4 - -
AS-GCN (Li et al., 2019) GCN-based dynamic method 2018 86.8 94.2 - -
2s-AGCN (Shi et al., 2019) GCN-based dynamic method 2020 88.5 95.1 - -
SGN (Zhang et al., 2020) GCN-based dynamic method 2020 89.0 94.5 79.2 81.5
4s Shift-GCN (Cheng et al., 2020) GCN-based dynamic method 2020 90.7 96.5 85.9 87.6
DC-GCN+ADC (Cheng et al., 2020) GCN-based dynamic method 2020 90.8 96.6 86.5 88.1
DDGCN (Korban and Li, 2020) GCN-based dynamic method 2020 91.1 97.1 - -
Dynamic GCN (Ye et al., 2020) GCN-based dynamic method 2020 91.5 96.0 87.3 88.6
CTR-GCN (Chen et al., 2021) GCN-based dynamic method 2021 92.4 96.8 88.9 90.6
InfoGCN (Chi et al., 2022) GCN-based dynamic method 2021 93.0 97.1 89.8 91.2
DG-STGCN (Duan et al., 2022) GCN-based dynamic method 2022 93.2 97.5 89.6 91.3
TCA-GCN (Wang et al., 2022) GCN-based dynamic method 2022 92.8 97.0 89.4 90.8
ML-STGNet (Zhu et al., 2023) GCN-based dynamic method 2023 91.9 96.2 88.6 90.0
MV-IGNet (Wang et al., 2023) GCN-based dynamic method 2023 89.2 96.3 83.9 85.6
S-GDC (Li et al., 2023) GCN-based dynamic method 2023 88.6 94.9 85.2 86.1
Motif-GCN+TBs (Wen et al., 2023) GCN-based dynamic method 2023 90.5 96.1 87.1 87.7
3s-ActCLR (Lin et al., 2023) GCN-based dynamic method 2023 84.3 88.8 74.3 75.7
GSTLN (Dai et al., 2023) GCN-based dynamic method 2023 91.9 96.6 88.1 89.3
4s STF-Net (Wu et al., 2023) GCN-based dynamic method 2023 91.1 96.5 86.5 88.2
LA-GCN (Xu et al., 2023) GCN-based dynamic method 2023 93.5 97.2 90.7 91.8
DSTA-Net (Shi et al., 2020) Transformer-based pure Transformer 2020 91.5 96.4 86.6 89.0
STAR (Shi et al., 2021) Transformer-based pure Transformer 2021 83.4 89.0 78.3 80.2
STST (Zhang et al., 2021) Transformer-based pure Transformer 2021 91.9 96.8 - -
IIP-Former (Wang et al., 2021) Transformer-based pure Transformer 2022 92.3 96.4 88.4 89.7
RSA-Net (Gedamu et al., 2023) Transformer-based pure Transformer 2023 91.8 96.8 88.4 89.7
ST-TR (Plizzari et al., 2021) Transformer-based hybrid Transformer 2021 89.9 96.1 81.9 84.1
Zoom Transformer (Zhang et al., 2022) Transformer-based hybrid Transformer 2022 90.1 95.3 84.8 86.5
KA-AGTN (Liu et al., 2022) Transformer-based hybrid Transformer 2022 90.4 96.1 86.1 88.0
STTFormer (Qiu et al., 2022) Transformer-based hybrid Transformer 2022 92.3 96.5 88.3 89.2
FG-STFormer (Gao et al., 2022) Transformer-based hybrid Transformer 2022 92.6 96.7 89.0 90.6
GSTN (Jiang et al., 2022) Transformer-based hybrid Transformer 2022 91.3 96.6 86.4 88.7
IGFormer (Pang et al., 2022) Transformer-based hybrid Transformer 2022 93.6 96.5 85.4 86.5
3Mformer (Wang et al., 2023) Transformer-based hybrid Transformer 2023 94.8 98.7 92.0 93.8
SkeleTR (Duan et al., 2023) Transformer-based hybrid Transformer 2023 94.8 97.7 87.8 88.3
GL-Transformer (Kim et al., 2022) Transformer-based unsupervised Transformer 2022 76.3 83.8 66.0 68.7
HiCo-LSTM (Dong et al., 2023) Transformer-based unsupervised Transformer 2023 81.4 88.8 73.7 74.5
HaLP+CMD (Shah et al., 2023) Transformer-based self-supervised Transformer 2023 82.1 88.6 72.6 73.1
SkeAttnCLR (Hua et al., 2023) Transformer-based self-supervised Transformer 2023 82.0 86.5 77.1 80.0
SkeletonMAE (Wu et al., 2023) Transformer-based self-supervised Transformer 2023 86.6 92.9 76.8 79.1

To overcome this problem, some methods (Dong and wearable inertial measurement units (Zhang
et al., 2019; Tu et al., 2020; Zhang et al., 2021) have et al., 2020). When applying pose estimation from
been proposed based on multi-view learning. This is different modalities, it may face another problem of
because the occluded part in one view may become few available datasets with different modalities. With
visible in other views. However, these methods often the development of vision-language models, texts
need large memory and expensive computation costs, could provide semantics for pose estimation and also
especially for 3D MPPE under multi-view. Moreover, be easily generated by GPT, thus a better direc-
some methods based on multi-modal learning have tion for another modality. Based on pose semantics,
also been demonstrated for robustness to occlusion, the occluded part can be inferred. With regard the
which could extract enrich features from different semantics, human-scene relationships can also pro-
sensing modalities such as depth (Shah et al., 2019) vide some semantic cues such as a person cannot

33
be simultaneously present in the locations of other (4) Limited data for uncommon poses
objects in the scene. The current public datasets have limited training
(2) Low resolution data for uncommon poses (eg. falling), which results
In the real-word application, low-resolution in model bias and further low accuracy on such poses.
images or videos are often captured due to wide-view Data augmentation (Jiang et al., 2022; Zhang et al.,
cameras, long-distance shooting capturing devices 2023) for uncommon poses is a common method
and so on. Obscured persons also exist due to envi- for generating new samples with more diversity.
ronmental shadows. The current methods are usu- Optimization-based methods (Jiang et al., 2023) can
ally trained on high-resolution input, which may mitigate the impact of domain gaps, by estimating
cause low accuracy when applying them to low- poses case-by-case rather than learning. Therefore,
resolution input. One solution for estimating poses deep-learning-based method combining optimization
from low-resolution input is to recover image res- techniques might be helpful for uncommon pose esti-
olution by applying super-resolution methods as mation. Moreover, open vocabulary learning can be
image pre-processing. However, the optimization of also applied to estimating uncommon poses by the
super-resolution does not contribution to high-level semantic relationship between these poses with other
human pose analysis. Wang et al. (Wang et al., common poses.
2022a) observed that low-resolution would exagger- (5) High uncertainty of 3D poses
ate the degree of quantization error, thus offset Predicting 3D poses from 2D poses is required to
modeling may be helpful for pose estimation with handle uncertainty and indeterminacy due to depth
low-resolution input. ambiguity and potential occlusion. However, most of
(3) Computation complexity the existing methods (Shan et al., 2023) belong to
As reviewed in Section 2, many methods have deterministic methods which aim to construct single
been proposed for solving computation complexity. and definite 3D poses from images. Therefore, how
For example, one-stage methods for image-based to handle uncertainty and indeterminacy of poses
MPPE are proposed to save the increased time remains an open question. Inspired by the strong
consumption caused by intermediate steps. Sample capability of diffusion models to generate samples
frames-based methods for video-based pose estima- with high uncertainty, applying diffusion models is a
tion are proposed to reduce the complexity of pro- promising direction for pose estimation. Few meth-
cessing each frame. However, such one-stage methods ods (Gong et al., 2023; Holmquist and Wandt, 2023;
may sacrifice accuracy when improving efficiency (eg. Feng et al., 2023) have been recently proposed by for-
the recent ED-pose network (Yang et al., 2023) takes mulating 3D pose estimation as a reverse diffusion
the shortest time and would sacrifice about %4 AP on process.
CoCO val2017 dataset). Therefore, more effort into
one-stage methods for MPPE is required to achieve 6.2 Pose tracking
computationally efficient pose estimation while main- Most pose tracking methods follow pose estima-
taining high accuracy. Sample frames-based meth- tion and linking strategy, pose tracking performance
ods (Zeng et al., 2022) estimate poses based on highly depends on the results of pose estimation.
three steps, which still results in more time consump- Therefore, some challenges of pose estimation also
tion. Hence, an end-to-end network is preferred to exist in pose tracking, such as occlusion. Multi-view
incorporate with sample frames-based methods for features fusion (Zhang et al., 2022) is one method of
video-based pose estimation. eliminating unreliable appearances by occlusion for
Transformer-based architectures for video-based improving the results of pose linking. Linking every
3D pose estimation inevitably incur high compu- detection box rather than only high score detection
tational costs. This is because that they typically boxes (Zhang et al., 2022) is another method to make
regard each video frame as a pose token and apply up non-negligible true poses by occlusion. In the fol-
extremely long video frames to achieve advanced per- lowing, we will present some more challenges for pose
formance. For instance, Strided (Li et al., 2022a) and tracking.
Mhformer (Li et al., 2022b) require 351 frames, and (1) Multi-person pose tracking under multiple
MixSTE (Li et al., 2022b) and DSTformer (Zhu et al., cameras
2022) require 243 frames. Self-attention complexity The main challenge is how to fuse the scenes of
increases quadratically with the number of tokens. different views. Although Voxteltrack (Zhang et al.,
Although directly reducing the frame number can 2022) tends to fuse multi-view features fusion, it
reduce the cost, it may result in lower performance would be researched more. If scenes from non-
due to a small temporal receptive field. Therefore, it overlapping cameras are fused and projected in a
is preferable to design an efficient architecture while virtual world, poses can be tracked in a long area
maintaining a large temporal receptive field for accu- continuously.
rate estimation. Considering that similar tokens may (2) Similar appearance and diverse motion
exist in deep transformer blocks (Wang et al., 2022b), To link poses across frames, the general solution
one potential solution is to prune pose tokens to is to measure the similarity between every pair of
improve the efficiency. poses in neighboring frames based on appearance and

34
motion. Persons sometimes have uniform appearance were proposed to learn a mapping between skele-
and diverse motions at the same time, such as group tons and word embedding of class labels. Class labels
dancers, and sports players. They are highly similar may possess less semantics than textual descriptions
and almost undistinguished in appearance by uniform which are natural languages for describing how an
clothes, and in complicated motion and interaction action is performed. In the future, new methods can
patterns. In this case, measuring the similarity is be pursued based on textual descriptions for zero-shot
challenging. However, such poses with similar appear- skeleton-based action recognition.
ance can be easily distinguished by textual semantics. (3) Multi-modality fusion
One possible solution is to incorporate some multi- Estimated pose-based methods take RGB data as
modality pre-training models, such as Contrastive the input and recognize actions based on RGB and
Language-Image Pre-training (CLIP) (Radford et al., estimated skeletons. Moreover, text data can guide
2021), for measuring similarity based on their seman- improving the performance of visually similar actions
tic representation. and zero-shot learning, which is another modality for
(3) Fast camera motion action recognition. Due to the heterogeneity of dif-
Existing methods mainly address pose tracking ferent modalities, how to fully utilize them deserves
by assuming slow camera motion. However, fast cam- to be further explored by researchers. Although some
era motion with ego-camera capturing is very often methods (Duan et al., 2022) tend to propose a par-
in real-world application. How to address egocentric ticular model for fusing different modalities, such
pose tracking with fast camera motion is a challeng- model lacks of generalization. In the future, a uni-
ing problem. Khirodkar et al. (Khirodkar et al., 2023) versal fusing method regardless of models is a better
proposed a new benchmark (EgoHumans) for ego- option.
centric pose estimation and tracking, and designed
6.4 Unified models
a multi-stream transformer to track multiple per-
sons. Experiments have shown that there is still a As reviewed in Section 4.1, some methods tend to
gap between the performance of static and dynamic conduct action recognition based on results of pose
capture systems due to camera synchronization and estimation or tracking. Table 13 further demonstrates
calibration. More effort can be made to bridge the pose estimation and tracking can improve action
gap. recognition performance. These observations empha-
size these three tasks are closely related together,
6.3 Action recognition which provides a direction for designing unified mod-
With the rapid advancement of deep learning tech- els for solving three tasks. Recently, a unified model
niques, promise results have been achieved on large- (UPS (Foo et al., 2023)) has been proposed for 3D
scale action datasets. There are still some open video-based pose estimation and estimated poses-
questions as follows. based action recognition, however, their performance
(1) Computation complexity is largely lower than the ones of separate models.
According to the performance comparison Hence, more unified models are preferable for jointly
(Table 14) of different methods, the method of inte- solving these three tasks.
grating transformer with GCNs achieves the best
accuracy. However, as mentioned before the compu- 7 Conclusion
tation required for a transformer and the amount of
memory required increases on a quadratic scale with This survey has presented a systematic overview of
the number of tokens (Ulhaq et al., 2022). There- recent works about human pose-based estimation,
fore, how to select significant tokens from video tracking and action recognition with deep learn-
frames or skeletons is an open question for efficient ing. We have reviewed pose estimation approaches
transformer-based action recognition. Similar to from 2D to 3D, from single-person to multi-person,
transformer-based pose estimation, pruning tokens and from images to videos. After estimating poses,
or discarding input matches (Qing et al., 2023) tend we summarized the methods of linking poses across
to reduce the cost. Moreover, integrating lightweight frames for tracking poses. Pose-based action recogni-
GCNs (Kang et al., 2023) can be further beneficial tion approaches have been also reviewed which are
for efficiency. taken as the application of pose estimation and track-
(2) Zero-shot learning on skeletons ing. For each task, we have reviewed different cate-
Annotating and labeling large-amount data is gories of methods and discussed their advantages and
expensive, and zero-shot learning is desirable in real- disadvantages. Meanwhile, end-to-end methods were
world applications. Existing zero-shot action recogni- highlighted for jointly conducting pose estimation,
tion methods mainly apply RGB data as the input. tracking and action recognition in the category of
However, skeleton data has become a promising alter- estimated pose-based action recognition. Commonly
native to RGB data due to its robustness to variations used datasets have been reviewed and performance
in appearance and background. Therefore, zero-shot comparisons of different methods have been covered
skeleton-based action recognition is more desirable. to further demonstrate the benefits of some methods.
Few methods (Gupta et al., 2021; Zhou et al., 2023)

35
Based on the strengths and weaknesses of the Smeulders, A.W., Chu, D.M., Cucchiara, R., Calder-
existing works, we point out a few promising future ara, S., Dehghan, A., Shah, M.: Visual tracking:
directions. For pose estimation, more effort can be An experimental survey. TPAMI 36(7), 1442–1468
made on pose estimation with occlusion, low res- (2013)
olution, limited data with uncommon poses and Wu, Y., Lim, J., Yang, M.-H.: Object tracking bench-
balancing the performance with computation com- mark. TPAMI 37(9), 1834–1848 (2015)
plexity. Multi-person pose tracking can be further Cedras, C., Shah, M.: Motion-based recognition a
resolved under multiple cameras, similar appearance, survey. IVT 13(2), 129–155 (1995)
diverse motions and fast camera motion. Zero-shot Turaga, P., Chellappa, R., Subrahmanian, V.S.,
learning on skeletons and multi-modality fusion can Udrea, O.: Machine recognition of human activi-
be also further explored for action recognition. ties: A survey. TCSVT 18(11), 1473 (2008)
Poppe, R.: A survey on vision-based human action
Acknowledgements. This work is supported by
recognition. IVT 28(6), 976–990 (2010)
the National Natural Science Foundation of China
Guo, G., Lai, A.: A survey on still image based human
(Grant No. 62006211, 61502491) and China Postdoc-
action recognition. PR 47(10), 3343–3361 (2014)
toral Science Foundation (Grant No. 2019TQ0286,
Zhu, F., Shao, L., Xie, J., Fang, Y.: From hand-
2020M682349).
crafted to learned representations for human action
recognition: a survey. IVT 55, 42–52 (2016)
References Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera,
S.: Rgb-d-based human motion recognition with
Gavrila, D.M.: The visual analysis of human move-
deep learning: A survey. CVIU 171, 118–139
ment: A survey. CVIU 73(1), 82–98 (1999)
(2018)
Aggarwal, J.K., Cai, Q.: Human motion analysis: A
Chen, Y., Tian, Y., He, M.: Monocular human pose
review. CVIU 73(3), 428–440 (1999)
estimation: A survey of deep learning-based meth-
Moeslund, T.B., Granum, E.: A survey of computer
ods. Computer vision and image understanding
vision-based human motion capture. CVIU 81(3),
192, 102897 (2020)
231–268 (2001)
Liu, W., Bao, Q., Sun, Y., Mei, T.: Recent advances
Wang, L., Hu, W., Tan, T.: Recent developments in
of monocular 2d and 3d human pose estimation:
human motion analysis. PR 36(3), 585–601 (2003)
A deep learning perspective. ACM Computing
Moeslund, T.B., Hilton, A., Krüger, V.: A survey
Surveys 55(4), 1–41 (2022)
of advances in vision-based human motion capture
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M.,
and analysis. CVIU 104(2-3), 90–126 (2006)
Wang, G., Liu, J.: Human action recognition from
Poppe, R.: Vision-based human motion analysis: An
various data modalities: A review. TPAMI (2022)
overview. CVIU 108(1-2), 4–18 (2007)
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen,
Sminchisescu, C.: 3d human motion analysis in
J., Kehtarnavaz, N., Shah, M.: Deep learning-
monocular video: techniques and challenges.
based human pose estimation: A survey. ACM
Human Motion: Understanding, Modelling, Cap-
Computing Surveys 56(1), 1–37 (2023)
ture, and Animation, 185–211 (2008)
Xin, W., Liu, R., Liu, Y., Chen, Y., Yu, W., Miao,
Ji, X., Liu, H.: Advances in view-invariant human
Q.: Transformer for skeleton-based action recogni-
motion analysis: a review. IEEE Transactions
tion: A review of recent advances. Neurocomputing
on Systems, Man, and Cybernetics 40(1), 13–24
(2023)
(2009)
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feicht-
Moeslund, T.B., Hilton, A., Krüger, V., Sigal, L.:
enhofer, C., Malik, J.: On the benefits of 3d pose
Visual Analysis of Humans. Springer, ??? (2011)
and tracking for human action recognition. In:
Liu, Z., Zhu, J., Bu, J., Chen, C.: A survey of human
CVPR, pp. 640–649 (2023)
pose estimation: The body parts parsing based
Choudhury, R., Kitani, K., Jeni, L.A.: TEMPO: Effi-
methods. JVCIR 32, 10–19 (2015)
cient multi-view pose estimation, tracking, and
Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris,
forecasting. In: ICCV, pp. 14750–14760 (2023)
I.A.: 3d human pose estimation: A review of the
Toshev, A., Szegedy, C.: Deeppose: Human pose esti-
literature and analysis of covariates. CVIU 152,
mation via deep neural networks. In: CVPR, pp.
1–20 (2016)
1653–1660 (2014)
Yilmaz, A., Javed, O., Shah, M.: Object tracking: A
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik,
survey. ACM CSUR 38(4), 13 (2006)
J.: Human pose estimation with iterative error
Watada, J., Musa, Z., Jain, L.C., Fulcher, J.: Human
feedback. In: CVPR, pp. 4733–4742 (2016)
tracking: A state-of-art survey. In: International
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional
Conference on Knowledge-Based and Intelligent
human pose regression. In: ICCV, pp. 2602–2611
Information and Engineering Systems, pp. 454–463
(2017)
(2010)
Luvizon, D.C., Tabia, H., Picard, D.: Human pose
Salti, S., Cavallaro, A., Di Stefano, L.: Adaptive
regression by combining indirect part detection and
appearance modeling for video tracking: Survey
contextual information. Computers & Graphics 85,
and evaluation. TIP 21(10), 4334–4348 (2012)

36
15–22 (2019) 1246–1259 (2017)
Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Sun, K., Lan, C., Xing, J., Zeng, W., Liu, D., Wang,
Z.: Tfpose: Direct human pose estimation with J.: Human pose estimation using global and local
transformers. arXiv preprint arXiv:2103.15320 normalization. In: ICCV, pp. 5599–5607 (2017)
(2021) Marras, I., Palasek, P., Patras, I.: Deep globally con-
Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., Tu, strained mrfs for human pose estimation. In: ICCV,
Z.: Pose recognition with cascade transformers. In: pp. 3466–3475 (2017)
CVPR, pp. 1944–1953 (2021) Liu, B., Ferrari, V.: Active learning for human pose
Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., estimation. In: ICCV, pp. 4363–4372 (2017)
Wang, Z., Hengel, A.v.: Poseur: Direct human pose Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale
regression with transformers. In: ECCV, pp. 72–88 structure-aware network for human pose estima-
(2022) tion. In: ECCV, pp. 713–728 (2018)
Panteleris, P., Argyros, A.: Pe-former: Pose estima- Peng, X., Tang, Z., Yang, F., Feris, R.S., Metaxas, D.:
tion transformer. In: ICPRAI, pp. 3–14 (2022) Jointly optimize data augmentation and network
Jain, A., Tompson, J., Andriluka, M., Taylor, G.W., training: Adversarial data augmentation in human
Bregler, C.: Learning human pose estimation pose estimation. In: CVPR, pp. 2226–2234 (2018)
features with convolutional networks. In: ICLR Tang, W., Yu, P., Wu, Y.: Deeply learned compo-
(2014) sitional models for human pose estimation. In:
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint ECCV, pp. 190–206 (2018)
training of a convolutional network and a graphical Nie, X., Feng, J., Zuo, Y., Yan, S.: Human pose esti-
model for human pose estimation. In: NIPS, pp. mation with parsing induced learner. In: CVPR,
1799–1807 (2014) pp. 2100–2108 (2018)
Chen, X., Yuille, A.L.: Articulated pose estimation by Nie, X., Feng, J., Yan, S.: Mutual learning to adapt
a graphical model with image dependent pairwise for joint human parsing and pose estimation. In:
relations. In: NIPS, pp. 1736–1744 (2014) ECCV, pp. 502–517 (2018)
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bre- Tang, W., Wu, Y.: Does learning specific features
gler, C.: Efficient object localization using convo- for related parts help human pose estimation? In:
lutional networks. In: CVPR, pp. 648–656 (2015) CVPR, pp. 1107–1116 (2019)
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Zhang, F., Zhu, X., Ye, M.: Fast human pose estima-
Convolutional pose machines. In: CVPR, pp. 4724– tion. In: CVPR, pp. 3517–3526 (2019)
4732 (2016) Li, Y., Yang, S., Liu, P., Zhang, S., Wang, Y., Wang,
Hu, P., Ramanan, D.: Bottom-up and top-down rea- Z., Yang, W., Xia, S.-T.: SimCC: A simple coor-
soning with hierarchical rectified gaussians. In: dinate classification perspective for human pose
CVPR, pp. 5600–5609 (2016) estimation. In: ECCV, pp. 89–106 (2022)
Newell, A., Yang, K., Deng, J.: Stacked hourglass net- Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W.,
works for human pose estimation. In: ECCV, pp. Lu, C.: Human pose regression with residual log-
483–499 (2016) likelihood estimation. In: ICCV, pp. 11025–11034
Bulat, A., Tzimiropoulos, G.: Human pose estima- (2021)
tion via convolutional part heatmap regression. In: Ye, S., Zhang, Y., Hu, J., Cao, L., Zhang, S., Shen,
ECCV, pp. 717–732 (2016) L., Wang, J., Ding, S., Ji, R.: Distilpose: Tok-
Lifshitz, I., Fetaya, E., Ullman, S.: Human pose esti- enized pose regression with heatmap distillation.
mation using deep consensus voting. In: ECCV, pp. In: CVPR, pp. 2163–2172 (2023)
246–260 (2016) Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang,
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., L.: Explicit box detection unifies end-to-end multi-
Wang, X.: Multi-context attention for human pose person pose estimation. In: ICLR (2023)
estimation. In: CVPR, pp. 1831–1840 (2017) Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen,
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: J., Kehtarnavaz, N., Shah, M.: Deep learning-
Learning feature pyramids for human pose estima- based human pose estimation: A survey. ACM
tion. In: ICCV, pp. 1281–1290 (2017) Computing Surveys (2020)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A.,
Warde-Farley, D., Ozair, S., Courville, A., Ben- Tompson, J., Bregler, C., Murphy, K.: Towards
gio, Y.: Generative adversarial nets. In: NIPS, pp. accurate multi-person pose estimation in the wild.
2672–2680 (2014) In: CVPR, pp. 4903–4911 (2017)
Chen, Y., Shen, C., Wei, X.-S., Liu, L., Yang, J.: He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask
Adversarial posenet: A structure-aware convolu- r-cnn. In: ICCV, pp. 2961–2969 (2017)
tional network for human pose estimation. In: Xiao, B., Wu, H., Wei, Y.: Simple baselines for human
ICCV, pp. 1212–1221 (2017) pose estimation and tracking. In: ECCV, pp. 466–
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep 481 (2018)
fractal neural networks for human pose estima- Moon, G., Chang, J.Y., Lee, K.M.: Posefix: Model-
tion. IEEE Transactions on Multimedia 20(5), agnostic general human pose refinement network.

37
In: CVPR, pp. 7773–7781 (2019) 11313–11322 (2021)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high- Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C.,
resolution representation learning for human pose Chen, X., Wang, J.: Hrformer: High-resolution
estimation. In: CVPR, pp. 5693–5703 (2019) transformer for dense prediction. In: NIPS (2021)
Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Sim-
Zhang, X., Zhou, X., Zhou, E., Sun, J.: Learning ple vision transformer baselines for human pose
delicate local representations for multi-person pose estimation. In: NIPS, vol. 35, pp. 38571–38584
estimation. In: ECCV, pp. 455–472 (2020) (2022)
Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B.,
the details: Delving into unbiased data processing Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut:
for human pose estimation. In: CVPR, pp. 5700– Joint subset partition and labeling for multi person
5709 (2020) pose estimation. In: CVPR, pp. 4929–4937 (2016)
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Insafutdinov, E., Pishchulin, L., Andres, B.,
Distribution-aware coordinate representation for Andriluka, M., Schiele, B.: Deepercut: A deeper,
human pose estimation. In: CVPR, pp. 7093–7102 stronger, and faster multi-person pose estimation
(2020) model. In: ECCV, pp. 34–50 (2016)
Wang, J., Long, X., Gao, Y., Ding, E., Wen, S.: Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime
Graph-pcnn: Two stage human pose estimation multi-person 2d pose estimation using part affinity
with graph pose refinement. In: ECCV, pp. 492– fields. In: CVPR, pp. 7291–7299 (2017)
508 (2020) Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Open-
Xu, X., Zou, Q., Lin, X.: Adaptive hypergraph neu- Pose:realtime multi-person 2D pose estimation
ral network for multi-person pose estimation. In: using part affinity fields. In: CVPR, pp. 7291–7299
AAAI, pp. 2955–2963 (2022) (2017)
Jiang, C., Huang, K., Zhang, S., Wang, X., Xiao, Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite
J., Goulermas, Y.: Aggregated pyramid gating fields for human pose estimation. In: CVPR, pp.
network for human pose estimation without pre- 11977–11986 (2019)
training. PR 138, 109429 (2023) Cheng, Y., Ai, Y., Wang, B., Wang, X., Tan, R.T.:
Gu, K., Yang, L., Mi, M.B., Yao, A.: Bias- Bottom-up 2d pose estimation via dual anatomi-
compensated integral regression for human pose cal centers for small-scale persons. PR 139, 109403
estimation. TPAMI 45(9), 10687–10702 (2023) (2023)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Qu, H., Cai, Y., Foo, L.G., Kumar, A., Liu, J.: A
Towards real-time object detection with region characteristic function-based method for bottom-
proposal networks. In: NIPS, pp. 91–99 (2015) up human pose estimation. In: CVPR, pp. 13009–
Iqbal, U., Gall, J.: Multi-person pose estimation with 13018 (2023)
local joint-to-person associations. In: ECCV, pp. Newell, A., Huang, Z., Deng, J.: Associative embed-
627–642 (2016) ding: End-to-end learning for joint detection and
Fang, H.-S., Xie, S., Tai, Y.-W., Lu, C.: Rmpe: grouping. In: NIPS, pp. 2277–2287 (2017)
Regional multi-person pose estimation. In: ICCV, Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet:
pp. 2334–2343 (2017) Fast multi-person pose estimation using pose resid-
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, ual network. In: ECCV, pp. 417–433 (2018)
J.: Cascaded pyramid network for multi-person Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S.,
pose estimation. In: CVPR, pp. 7103–7112 (2018) Lu, C.: Crowdpose: Efficient crowded scenes pose
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi- estimation and a new benchmark. In: CVPR, pp.
person pose estimation with enhanced channel-wise 10863–10872 (2019)
and spatial information. In: CVPR, pp. 5674–5682 Jin, S., Liu, W., Xie, E., Wang, W., Qian, C.,
(2019) Ouyang, W., Luo, P.: Differentiable hierarchical
Qiu, L., Zhang, X., Li, Y., Li, G., Wu, X., Xiong, graph grouping for multi-person pose estimation.
Z., Han, X., Cui, S.: Peeking into occluded joints: In: ECCV, pp. 718–734 (2020)
A novel framework for crowd pose estimation. In: Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S.,
ECCV, pp. 488–504 (2020) Zhang, L.: Higherhrnet: Scale-aware representation
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: learning for bottom-up human pose estimation. In:
Keypoint localization via transformer. In: ICCV, CVPR, pp. 5386–5395 (2020)
pp. 11802–11812 (2021) Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage
Zhou, M., Stoffl, L., Mathis, M., Mathis, A.: Rethink- multi-person pose machines. In: ICCV, pp. 6951–
ing pose estimation in crowds: overcoming the 6960 (2019)
detection information-bottleneck and ambiguity. Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.:
In: ICCV (2023) Bottom-up human pose estimation via disentan-
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., gled keypoint regression. In: CVPR, pp. 14676–
Xia, S.-T., Zhou, E.: Tokenpose: Learning keypoint 14686 (2021)
tokens for human pose estimation. In: ICCV, pp. Li, J., Wang, Y., Zhang, S.: PolarPose: Single-stage

38
multi-person pose estimation in polar coordinates. ICIP, pp. 599–603 (2019)
IEEE Transactions on Image Processing 32, 1108– Xu, L., Guan, Y., Jin, S., Liu, W., Qian, C., Luo,
1119 (2023) P., Ouyang, W., Wang, X.: Vipnas: Efficient video
Tian, Z., Chen, H., Shen, C.: Directpose: Direct pose estimation via neural architecture search. In:
end-to-end multi-person pose estimation. arXiv CVPR, pp. 16072–16081 (2021)
preprint arXiv:1911.07451 (2019) Dang, Y., Yin, J., Zhang, S.: Relation-based asso-
Mao, W., Tian, Z., Wang, X., Shen, C.: Fcpose: Fully ciative joint location for human pose estimation
convolutional multi-person pose estimation with in videos. IEEE Transactions on Image Processing
dynamic instance-aware convolutions. In: CVPR, 31, 3973–3986 (2022)
pp. 9034–9043 (2021) Jin, K.-M., Lim, B.-S., Lee, G.-H., Kang, T.-K., Lee,
Shi, D., Wei, X., Yu, X., Tan, W., Ren, Y., Pu, S.: S.-W.: Kinematic-aware hierarchical attention net-
Inspose: instance-aware networks for single-stage work for human pose estimation in videos. In:
multi-person pose estimation. In: ACMMM, pp. WACV, pp. 5725–5734 (2023)
3079–3087 (2021) Zhang, Y., Wang, Y., Camps, O., Sznaier, M.: Key
Miao, H., Lin, J., Cao, J., He, X., Su, Z., Liu, R.: frame proposal network for efficient pose estima-
Smpr: Single-stage multi-person pose regression. tion in videos. In: ECCV, pp. 609–625 (2020)
PR 143, 109743 (2023) Ma, X., Rahmani, H., Fan, Z., Yang, B., Chen, J.,
Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end Liu, J.: Remote: Reinforced motion transformation
multi-person pose estimation with transformers. network for semi-supervised 2d pose estimation in
In: CVPR, pp. 11069–11078 (2022) videos. In: AAAI, pp. 1944–1952 (2022)
Liu, H., Chen, Q., Tan, Z., Liu, J.-J., Wang, J., Su, Zeng, A., Ju, X., Yang, L., Gao, R., Zhu, X., Dai,
X., Li, X., Yao, K., Han, J., Ding, E., Zhao, Y., B., Xu, Q.: Deciwatch: A simple baseline for 10×
Wang, J.: Group pose: A simple baseline for end- efficient 2d and 3d pose estimation. In: ECCV, pp.
to-end multi-person pose estimation. In: ICCV, pp. 607–624 (2022)
15029–15038 (2023) Sun, Y., Dougherty, A.W., Zhang, Z., Choi, Y.K.,
Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Wu, C.: Mixsynthformer: A transformer encoder-
Deep convolutional neural networks for efficient like structure with mixed synthetic self-attention
pose estimation in gesture videos. In: ACCV, pp. for efficient human pose estimation. In: ICCV, pp.
538–552 (2014) 14884–14893 (2023)
Grinciunaite, A., Gudi, A., Tasli, E., Den Uyl, M.: Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose flow:
Human pose estimation in space and time using 3d Efficient online pose tracking. In: ECCV (2018)
cnn. In: ECCV, pp. 32–39 (2016) Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M.,
Pfister, T., Charles, J., Zisserman, A.: Flowing con- Tran, D.: Detect-and-track: Efficient pose estima-
vnets for human pose estimation in videos. In: tion in videos. In: CVPR, pp. 350–359 (2018)
ICCV, pp. 1913–1921 (2015) Wang, M., Tighe, J., Modolo, D.: Combining detec-
Song, J., Wang, L., Van Gool, L., Hilliges, O.: Thin- tion and tracking for human pose estimation in
slicing network: A deep structured model for pose videos. In: CVPR, pp. 11088–11096 (2020)
estimation in videos. In: CVPR, pp. 4220–4229 Fang, H.-S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y.,
(2017) Li, Y.-L., Lu, C.: Alphapose: Whole-body regional
Jain, A., Tompson, J., LeCun, Y., Bregler, C.: Mod- multi-person pose estimation and tracking in real-
eep: A deep learning framework using motion time. TPAMI (2022)
features for human pose estimation. In: ACCV, pp. Feng, R., Gao, Y., Ma, X., Tse, T.H.E., Chang,
302–315 (2014) H.J.: Mutual information-based temporal differ-
Gkioxari, G., Toshev, A., Jaitly, N.: Chained pre- ence learning for human pose estimation in video.
dictions using convolutional neural networks. In: In: CVPR, pp. 17131–17141 (2023)
ECCV, pp. 728–743 (2016) Gai, D., Feng, R., Min, W., Yang, X., Su, P., Wang,
Charles, J., Pfister, T., Magee, D., Hogg, D., Zisser- Q., Han, Q.: Spatiotemporal learning transformer
man, A.: Personalizing human video pose estima- for video-based human pose estimation. TCSVT
tion. In: CVPR, pp. 3063–3072 (2016) (2023)
Luo, Y., Ren, J., Wang, Z., Sun, W., Pan, J., Liu, J., Amit, T., Shaharbany, T., Nachmani, E., Wolf, L.:
Pang, J., Lin, L.: Lstm pose machines. In: CVPR, Segdiff: Image segmentation with diffusion prob-
pp. 5207–5215 (2018) abilistic models. arXiv preprint arXiv:2112.00390
Nie, X., Li, Y., Luo, L., Zhang, N., Feng, J.: Dynamic (2021)
kernel distillation for efficient pose estimation in Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet:
videos. In: ICCV, pp. 6942–6950 (2019) Diffusion model for object detection. ICCV, 19830–
Li, H., Yang, W., Liao, Q.: Temporal feature enhanc- 19843 (2023)
ing network for human pose estimation in videos. Feng, R., Gao, Y., Tse, T.H.E., Ma, X., Chang,
In: ICIP, pp. 579–583 (2019) H.J.: Diffpose: Spatiotemporal diffusion model for
Li, W., Xu, X., Zhang, Y.-J.: Temporal feature cor- video-based human pose estimation. In: ICCV, pp.
relation for human pose estimation in videos. In: 14861–14872 (2023)

39
Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person 3d human pose regression. In: CVPR, pp. 3425–
articulated tracking with spatial and temporal 3435 (2019)
embeddings. In: CVPR, pp. 5664–5673 (2019) Choi, H., Moon, G., Lee, K.M.: Pose2mesh: Graph
Li, S., Chan, A.B.: 3d human pose estimation from convolutional network for 3d human pose and mesh
monocular images with deep convolutional neural recovery from a 2d human pose. In: ECCV, pp.
network. In: ACCV, pp. 332–347 (2014) 769–787 (2020)
Li, S., Zhang, W., Chan, A.B.: Maximum-margin Zeng, A., Sun, X., Huang, F., Liu, M., Xu, Q., Lin, S.:
structured learning with deep networks for 3d Srnet: Improving generalization in 3d human pose
human pose estimation. In: ICCV, pp. 2848–2856 estimation with a split-and-recombine approach.
(2015) In: ECCV, pp. 507–523 (2020)
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Liu, K., Ding, R., Zou, Z., Wang, L., Tang, W.: A
Fua, P.: Structured prediction of 3d human pose comprehensive study of weight sharing in graph
with deep neural networks. In: BMVC, pp. 1–11 networks for 3d human pose estimation. In: Com-
(2016) puter Vision–ECCV 2020: 16th European Confer-
Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: ence, Glasgow, UK, August 23–28, 2020, Proceed-
Deep kinematic pose regression. In: ECCV, pp. ings, Part X 16, pp. 318–334 (2020)
186–201 (2016) Zou, Z., Tang, W.: Modulated graph convolutional
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotny- network for 3d human pose estimation. In: ICCV,
chenko, O., Xu, W., Theobalt, C.: Monocular 3d pp. 11477–11487 (2021)
human pose estimation in the wild using improved Xu, T., Takano, W.: Graph stacked hourglass net-
cnn supervision. In: 3DV, pp. 506–516 (2017) works for 3d human pose estimation. In: CVPR,
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: pp. 16105–16114 (2021)
Towards 3d human pose estimation in the wild: a Shengping, Z., Chenyang, W., Liqiang, N., Hongxun,
weakly-supervised approach. In: ICCV, pp. 398– Y., Qingming, H., Qi, T.: Learning enriched hop-
407 (2017) aware correlation for robust 3d human pose esti-
Martinez, J., Hossain, R., Romero, J., Little, J.J.: mation. IJCV (6), 1566–1583 (2023)
A simple yet effective baseline for 3d human pose Hassan, M.T., Ben Hamza, A.: Regular splitting
estimation. In: ICCV, pp. 2640–2649 (2017) graph network for 3d human pose estimation. IEEE
Tekin, B., Márquez-Neila, P., Salzmann, M., Fua, P.: Transactions on Image Processing 32, 4212–4222
Learning to fuse 2d and 3d image cues for monocu- (2023)
lar body pose estimation. In: ICCV, pp. 3941–3950 Zhai, K., Nie, Q., Ouyang, B., Li, X., Yang, S.: Hop-
(2017) fir: Hop-wise graphformer with intragroup joint
Zhou, K., Han, X., Jiang, N., Jia, K., Lu, J.: Hem- refinement for 3d human pose estimation. In: ICCV
lets pose: Learning part-centric heatmap triplets (2023)
for accurate 3d human pose estimation. In: ICCV, Lin, K., Wang, L., Liu, Z.: End-to-end human pose
pp. 2344–2353 (2019) and mesh reconstruction with transformers. In:
Wang, M., Chen, X., Liu, W., Qian, C., Lin, L., CVPR, pp. 1954–1963 (2021)
Ma, L.: Drpose3d: Depth ranking in 3d human Zhao, W., Wang, W., Tian, Y.: Graformer: Graph-
pose estimation. arXiv preprint arXiv:1805.08973 oriented transformer for 3d pose estimation. In:
(2018) CVPR, pp. 20438–20447 (2022)
Carbonera Luvizon, D., Tabia, H., Picard, D.: SSP- Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H.,
Net: Scalable sequential pyramid networks for real- Wang, X.: 3d human pose estimation in the wild
time 3d human pose regression. PR 142, 109714 by adversarial learning. In: CVPR, pp. 5255–5264
(2023) (2018)
Jahangiri, E., Yuille, A.L.: Generating multiple Habibie, I., Xu, W., Mehta, D., Pons-Moll, G.,
diverse hypotheses for human 3d pose consistent Theobalt, C.: In the wild human pose estima-
with 2d joint detections. In: ICCV, pp. 805–814 tion using explicit 2d features and intermediate
(2017) 3d representations. In: CVPR, pp. 10905–10914
Sharma, S., Varigonda, P.T., Bindal, P., Sharma, A., (2019)
Jain, A.: Monocular 3d human pose estimation Chen, C.-H., Tyagi, A., Agrawal, A., Drover, D.,
by generation and ordinal ranking. In: ICCV, pp. Mv, R., Stojanov, S., Rehg, J.M.: Unsupervised 3d
2325–2334 (2019) pose estimation with geometric self-supervision. In:
Li, C., Lee, G.H.: Generating multiple hypotheses for CVPR, pp. 5714–5724 (2019)
3d human pose estimation with mixture density Wandt, B., Rosenhahn, B.: Repnet: Weakly super-
network. In: CVPR, pp. 9887–9895 (2019) vised training of an adversarial reprojection net-
Ci, H., Wang, C., Ma, X., Wang, Y.: Optimizing net- work for 3d human pose estimation. In: CVPR, pp.
work structure for 3d human pose estimation. In: 7782–7791 (2019)
ICCV, pp. 2262–2271 (2019) Iqbal, U., Molchanov, P., Kautz, J.: Weakly-
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, supervised 3d human pose learning via multi-view
D.N.: Semantic graph convolutional networks for images in the wild. In: CVPR, pp. 5243–5252

40
(2020) Sridhar, S., Pons-Moll, G., Theobalt, C.: Single-
Kundu, J.N., Seth, S., Jampani, V., Rakesh, M., shot multi-person 3d pose estimation from monoc-
Babu, R.V., Chakraborty, A.: Self-supervised 3d ular rgb. In: 3DV, pp. 120–130 (2018)
human pose estimation via part guided novel image Mehta, D., Sotnychenko, O., Mueller, F., Xu, W.,
synthesis. In: CVPR, pp. 6152–6162 (2020) Elgharib, M., Fua, P., Seidel, H.-P., Rhodin, H.,
Schmidtke, L., Vlontzos, A., Ellershaw, S., Lukens, Pons-Moll, G., Theobalt, C.: Xnect: Real-time
A., Arichi, T., Kainz, B.: Unsupervised human pose multi-person 3d motion capture with a single rgb
estimation through transforming shape templates. camera. TOG 39(4), 82–1 (2020)
In: CVPR, pp. 2484–2494 (2021) Zhen, J., Fang, Q., Sun, J., Liu, W., Jiang, W., Bao,
Yu, Z., Ni, B., Xu, J., Wang, J., Zhao, C., Zhang, H., Zhou, X.: Smap: Single-shot multi-person abso-
W.: Towards alleviating the modeling ambiguity lute 3d pose estimation. In: ECCV, pp. 550–566
of unsupervised monocular 3d human pose estima- (2020)
tion. In: ICCV, pp. 8651–8660 (2021) Liu, Q., Zhang, Y., Bai, S., Yuille, A.: Explicit occlu-
Gong, K., Li, B., Zhang, J., Wang, T., Huang, J., Mi, sion reasoning for multi-person 3d human pose
M.B., Feng, J., Wang, X.: Posetriplet: co-evolving estimation. In: ECCV, pp. 497–517 (2022)
3d human pose estimation, imitation, and hal- Chen, X., Zhang, J., Wang, K., Wei, P., Lin, L.:
lucination under self-supervision. In: CVPR, pp. Multi-person 3d pose esitmation with occlusion
11017–11027 (2022) reasoning. TMM, 1–13 (2023)
Chai, W., Jiang, Z., Hwang, J.-N., Wang, G.: Global Zhou, X., Wang, D., Krähenbühl, P.: Objects as
adaptation meets local generalization: Unsuper- points. arXiv preprint arXiv:1904.07850 (2019)
vised domain adaptation for 3d human pose esti- Wei, F., Sun, X., Li, H., Wang, J., Lin, S.: Point-set
mation. In: ICCV (2023) anchors for object detection, instance segmenta-
Wang, Z., Nie, X., Qu, X., Chen, Y., Liu, S.: tion and pose estimation. In: ECCV, pp. 527–544
Distribution-aware single-stage models for multi- (2020)
person 3d pose estimation. In: CVPR, pp. 13096– Jin, L., Xu, C., Wang, X., Xiao, Y., Guo, Y., Nie,
13105 (2022) X., Zhao, J.: Single-stage is enough: Multi-person
Rogez, G., Weinzaepfel, P., Schmid, C.: Lcr-net: absolute 3d pose estimation. In: CVPR, pp. 13086–
Localization-classification-regression for human 13095 (2022)
pose. In: CVPR, pp. 3433–3441 (2017) Qiu, Z., Qiu, K., Fu, J., Fu, D.: Weakly-supervised
Rogez, G., Weinzaepfel, P., Schmid, C.: Lcr-net++: pre-training for 3d human pose estimation via
Multi-person 2d and 3d pose detection in natural perspective knowledge. PR 139, 109497 (2023)
images. TPAMI 42(5), 1146–1161 (2019) Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance- prediction of 3d body poses from motion compen-
aware top-down approach for 3d multi-person pose sated sequences. In: CVPR, pp. 991–1000 (2016)
estimation from a single rgb image. In: ICCV, pp. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin,
10133–10142 (2019) H., Shafiei, M., Seidel, H.-P., Xu, W., Casas, D.,
Lin, J., Lee, G.H.: Hdnet: Human depth estima- Theobalt, C.: Vnect: Real-time 3d human pose
tion for multi-person camera-space localization. In: estimation with a single rgb camera. ACM Trans-
ECCV, pp. 633–648 (2020) actions on Graphics (TOG) 36(4), 44 (2017)
Wang, C., Li, J., Liu, W., Qian, C., Lu, C.: Dabral, R., Mundhada, A., Kusupati, U., Afaque,
Hmor: Hierarchical multi-person ordinal relations S., Sharma, A., Jain, A.: Learning 3d human pose
for monocular multi-person 3d pose estimation. In: from structure and motion. In: ECCV, pp. 668–683
ECCV, pp. 242–259 (2020) (2018)
Cha, J., Saqlain, M., Kim, G., Shin, M., Baek, S.: Qiu, Z., Yang, Q., Wang, J., Fu, D.: Ivt: An end-to-
Multi-person 3d pose and shape estimation via end instance-guided video transformer for 3d pose
inverse kinematics and refinement. In: ECCV, pp. estimation. In: ACM MM, pp. 6174–6182 (2022)
660–677 (2022) Honari, S., Constantin, V., Rhodin, H., Salzmann,
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A.-I., M., Fua, P.: Temporal representation learning on
Sminchisescu, C.: Deep network for the integrated monocular videos for 3d human pose estimation.
3d sensing of multiple people in natural images. TPAMI 45(5), 6415–6427 (2023)
NIPS 31 (2018) Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.:
Fabbri, M., Lanzi, F., Calderara, S., Alletto, S., Cuc- 3d human pose estimation in video with tempo-
chiara, R.: Compressed volumetric heatmaps for ral convolutions and semi-supervised training. In:
multi-person 3d pose estimation. In: CVPR, pp. CVPR, pp. 7753–7762 (2019)
7204–7213 (2020) Cheng, Y., Yang, B., Wang, B., Yan, W., Tan,
Kundu, J.N., Revanur, A., Waghmare, G.V., R.T.: Occlusion-aware networks for 3d human pose
Venkatesh, R.M., Babu, R.V.: Unsupervised cross- estimation in video. In: CVPR, pp. 723–732 (2019)
modal alignment for multi-person 3d pose estima- Liu, J., Guang, Y., Rojas, J.: Gast-net: Graph atten-
tion. In: ECCV, pp. 35–52 (2020) tion spatio-temporal convolutional networks for 3d
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., human pose estimation in video. arXiv preprint

41
arXiv:2003.14179 (2020) graph convolutional network for 3d human pose
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., estimation from monocular video. In: ICCV, pp.
Luo, J.: Anatomy-aware 3d human pose estima- 8818–8829 (2023)
tion with bone-based pose decomposition. TCSVT Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.:
32(1), 198–209 (2021) Mixste: Seq2seq mixed spatio-temporal encoder for
Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, 3d human pose estimation in video. In: CVPR, pp.
J., Thalmann, N.M.: Exploiting spatial-temporal 13232–13242 (2022)
relationships for 3d pose estimation via graph Chen, H., He, J.-Y., Xiang, W., Cheng, Z.-Q., Liu,
convolutional networks. In: ICCV, pp. 2272–2281 W., Liu, H., Luo, B., Geng, Y., Xie, X.: Hdformer:
(2019) High-order directed transformer for 3d human pose
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., estimation. In: IJCAI, pp. 581–589 (2023)
Ding, Z.: 3d human pose estimation with spatial Shuai, H., Wu, L., Liu, Q.: Adaptive multi-view and
and temporal transformers. In: ICCV, pp. 11656– temporal fusing transformer for 3d human pose
11665 (2021) estimation. TPAMI 45(4), 4122–4135 (2023)
Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.:
Poseformerv2: Exploring frequency domain for effi- Motionbert: Unified pretraining for human motion
cient and robust 3d human pose estimation. In: analysis. arXiv preprint arXiv:2210.06551 (2022)
CVPR, pp. 8877–8886 (2023) Cheng, Y., Wang, B., Yang, B., Tan, R.T.: Graph
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, and temporal convolutional networks for 3d multi-
W.: Exploiting temporal contexts with strided person pose estimation in monocular videos. In:
transformer for 3d human pose estimation. IEEE AAAI, pp. 1157–1165 (2021)
Transactions on Multimedia (2022) Cheng, Y., Wang, B., Yang, B., Tan, R.T.: Monocu-
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, lar 3d multi-person pose estimation by integrating
L.: Mhformer: Multi-hypothesis transformer for 3d top-down and bottom-up networks. In: CVPR, pp.
human pose estimation. In: CVPR, pp. 13147– 7649–7659 (2021)
13156 (2022) Park, S., You, E., Lee, I., Lee, J.: Towards robust
Li, W., Liu, H., Tang, H., Wang, P.: Multi-hypothesis and smooth 3d multi-person pose estimation from
representation learning for transformer-based 3d monocular videos in the wild. In: ICCV, pp. 14772–
human pose estimation. PR 141, 109631 (2023) 14782 (2023)
Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis Zhao, L., Gao, X., Tao, D., Li, X.: Tracking human
human pose estimation using diffusion models. In: pose using max-margin markov models. IEEE
ICCV, pp. 15977–15987 (2023) Transactions on Image Processing 24(12), 5274–
Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, 5287 (2015)
K., Wang, S., Ma, S., Gao, W.: Diffusion-based Samanta, S., Chanda, B.: A data-driven approach for
3d human pose estimation with multi-hypothesis human pose tracking based on spatio-temporal pic-
aggregation. In: ICCV (2023) torial structure. arXiv preprint arXiv:1608.00199
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3d (2016)
human pose estimation with spatio-temporal criss- Zhao, L., Gao, X., Tao, D., Li, X.: Learning a track-
cross attention. In: CVPR, pp. 4790–4799 (2023) ing and estimation integrated graphical model for
Lin, M., Lin, L., Liang, X., Wang, K., Cheng, H.: human pose tracking. IEEE transactions on neural
Recurrent 3d pose sequence machines. In: CVPR, networks and learning systems 26(12), 3176–3186
pp. 810–819 (2017) (2015)
Rayat Imtiaz Hossain, M., Little, J.J.: Exploiting Ma, M., Marturi, N., Li, Y., Stolkin, R., Leonardis,
temporal information for 3d human pose estima- A.: A local-global coupled-layer puppet model for
tion. In: ECCV, pp. 68–84 (2018) robust online human pose tracking. Computer
Lee, K., Lee, I., Lee, S.: Propagating lstm: 3d pose Vision and Image Understanding 153, 163–178
estimation based on joint interdependency. In: (2016)
ECCV, pp. 119–135 (2018) Zhang, J., Zhu, Z., Zou, W., Li, P., Li, Y., Su,
Katircioglu, I., Tekin, B., Salzmann, M., Lepetit, H., Huang, G.: Fastpose: Towards real-time pose
V., Fua, P.: Learning latent representations of estimation and tracking via scale-normalized multi-
3d human pose with deep neural networks. IJCV task networks. arXiv preprint arXiv:1908.05593
126(12), 1326–1341 (2018) (2019)
Yeh, R., Hu, Y.-T., Schwing, A.: Chirality nets for Ning, G., Pei, J., Huang, H.: Lighttrack: A generic
human pose regression. In: NIPS, pp. 8161–8171 framework for online top-down human pose track-
(2019) ing. In: CVPRW, pp. 1034–1035 (2020)
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided Rafi, U., Doering, A., Leibe, B., Gall, J.: Self-
3d pose estimation from videos. In: ECCV, pp. supervised keypoint correspondences for multi-
764–780 (2020) person pose estimation and tracking in videos. In:
Yu, B.X., Zhang, Z., Liu, Y., Zhong, S.-h., Liu, ECCV, pp. 36–52 (2020)
Y., Chen, C.W.: Gla-gcn: Global-local adaptive Yang, Y., Ren, Z., Li, H., Zhou, C., Wang, X.,

42
Hua, G.: Learning dynamics via graph neural net- Chained multi-stream networks exploiting pose,
works for human pose estimation and tracking. In: motion, and appearance for action classification
CVPR, pp. 8074–8084 (2021) and detection. In: ICCV, pp. 2904–2913 (2017)
Doering, A., Gall, J.: A gated attention transformer Choutas, V., Weinzaepfel, P., Revaud, J., Schmid,
for multi-person pose tracking. In: ICCV, pp. 3189– C.: Potion: Pose motion representation for action
3198 (2023) recognition. In: CVPR, pp. 7024–7033 (2018)
Iqbal, U., Milan, A., Gall, J.: Posetrack: Joint multi- Liu, M., Yuan, J.: Recognizing human actions as the
person pose estimation and tracking. In: CVPR, evolution of pose estimation maps. In: CVPR, pp.
pp. 2011–2020 (2017) 1159–1168 (2018)
Raaj, Y., Idrees, H., Hidalgo, G., Sheikh, Y.: Efficient Moon, G., Kwon, H., Lee, K.M., Cho, M.: Integralac-
online multi-person 2d pose tracking with recur- tion: Pose-driven feature integration for robust
rent spatio-temporal affinity fields. In: CVPR, pp. human action recognition in videos. In: CVPR, pp.
4620–4628 (2019) 3339–3348 (2021)
Bridgeman, L., Volino, M., Guillemaut, J.-Y., Hilton, Shah, A., Mishra, S., Bansal, A., Chen, J.-C., Chel-
A.: Multi-person 3d pose estimation and tracking lappa, R., Shrivastava, A.: Pose and joint-aware
in sports. In: CVPRW, pp. 0–0 (2019) action recognition. In: WACV, pp. 3850–3860
Zanfir, A., Marinoiu, E., Sminchisescu, C.: Monocular (2022)
3d pose and shape estimation of multiple people Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revis-
in natural scenes-the importance of multiple scene iting skeleton-based action recognition. In: CVPR,
constraints. In: CVPR, pp. 2148–2157 (2018) pp. 2969–2978 (2022)
Sun, X., Li, C., Lin, S.: Explicit spatiotemporal Sato, F., Hachiuma, R., Sekii, T.: Prompt-guided
joint relation learning for tracking human pose. In: zero-shot anomaly action recognition using pre-
ICCV (2019) trained deep skeleton features. In: CVPR, pp.
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., 6471–6480 (2023)
Narasimhan, S.G.: Tessetrack: End-to-end learn- Hachiuma, R., Sato, F., Sekii, T.: Unified keypoint-
able multi-person articulated 3d pose tracking. In: based action recognition framework via structured
CVPR, pp. 15190–15200 (2021) keypoint pooling. In: CVPR, pp. 22962–22971
Zhang, Y., Wang, C., Wang, X., Liu, W., Zeng, (2023)
W.: Voxeltrack: Multi-person 3d human pose esti- Luvizon, D.C., Picard, D., Tabia, H.: 2d/3d pose esti-
mation and tracking in the wild. TPAMI 45(2), mation and action recognition using multitask deep
2613–2626 (2022) learning. In: CVPR, pp. 5137–5146 (2018)
Zou, S., Xu, Y., Li, C., Ma, L., Cheng, L., Vo, M.: Foo, L.G., Li, T., Rahmani, H., Ke, Q., Liu, J.:
Snipper: A spatiotemporal transformer for simul- Unified pose sequence modeling. In: CVPR, pp.
taneous multi-person 3d pose estimation tracking 13019–13030 (2023)
and forecasting on a video snippet. TCSVT (2023) Du, Y., Fu, Y., Wang, L.: Skeleton based action
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Malik, recognition with convolutional neural network. In:
J.: Tracking people by predicting 3d appearance, ACPR, pp. 579–583 (2015)
location and pose. In: CVPR, pp. 2740–2749 Wang, P., Li, Z., Hou, Y., Li, W.: Action recognition
(2022) based on joint trajectory maps using convolutional
Wang, H., Wang, L.: Modeling temporal dynamics neural networks. In: ACMMM, pp. 102–106 (2016)
and spatial configurations of actions using two- Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical
stream recurrent neural networks. In: CVPR, pp. spectra-based action recognition using convolu-
499–508 (2017) tional neural networks. TCSVT 28(3), 807–811
Caetano, C., Brémond, F., Schwartz, W.R.: Skele- (2016)
ton image representation for 3d action recognition Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps
based on tree structure and reference joints. In: based action recognition with convolutional neural
SIBGRAPI, pp. 16–23 (2019) networks. IEEE Signal Processing Letters 24(5),
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph 624–628 (2017)
convolutional networks for skeleton-based action Liu, M., Liu, H., Chen, C.: Enhanced skeleton visu-
recognition. In: AAAI, pp. 7444–7452 (2018) alization for view invariant human action recogni-
Plizzari, C., Cannici, M., Matteucci, M.: Spatial tion. PR 68, 346–362 (2017)
temporal transformer network for skeleton-based Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid,
action recognition. In: ICPRW, pp. 694–701 (2021) F.: Skeletonnet: Mining deep part features for 3d
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, action recognition. IEEE Signal Processing Letters
M.J.: Towards understanding action recognition. (2017)
In: ICCV, pp. 3192–3199 (2013) Li, Y., Xia, R., Liu, X., Huang, Q.: Learning shape-
Chéron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based motion representations from geometric algebra
cnn features for action recognition. In: ICCV, pp. spatio-temporal model for skeleton-based action
3218–3226 (2015) recognition. In: ICME, pp. 1066–1071 (2019)
Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.:

43
Ding, Z., Wang, P., Ogunbona, P.O., Li, W.: Investi- convolutional networks for skeleton-based action
gation of different skeleton features for cnn-based recognition. In: ACM MM, pp. 2122–2130 (2020)
3d action recognition. In: ICMEW, pp. 617–622 Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.:
(2017) Disentangling and unifying graph convolutions for
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, skeleton-based action recognition. In: CVPR, pp.
F.: A new representation of skeleton sequences for 143–152 (2020)
3d action recognition. In: CVPR (2017) Zhang, X., Xu, C., Tao, D.: Context aware graph con-
Liang, D., Fan, G., Lin, G., Chen, W., Pan, X., volution for skeleton-based action recognition. In:
Zhu, H.: Three-stream convolutional neural net- CVPR, pp. 14333–14342 (2020)
work with multi-task and ensemble learning for 3d Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y.,
action recognition. In: CVPRW, pp. 0–0 (2019) Tian, Q.: Actional-structural graph convolutional
Liu, H., Tu, J., Liu, M.: Two-stream 3d convolutional networks for skeleton-based action recognition. In:
neural network for skeleton-based action recogni- CVPR, pp. 3595–3603 (2019)
tion. arXiv preprint arXiv:1705.08106 (2017) Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-
Hernandez Ruiz, A., Porzi, L., Rota Bulò, S., Moreno- stream adaptive graph convolutional networks for
Noguer, F.: 3d cnns on distance matrices for human skeleton-based action recognition. In: CVPR, pp.
action recognition. In: ACM MM, pp. 1087–1095 12026–12035 (2019)
(2017) Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J.,
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent Lu, H.: Skeleton-based action recognition with shift
neural network for skeleton based action recogni- graph convolutional network. In: CVPR, pp. 183–
tion. In: CVPR, pp. 1110–1118 (2015) 192 (2020)
Du, Y., Fu, Y., Wang, L.: Representation learning of Korban, M., Li, X.: Ddgcn: A dynamic directed graph
temporal dynamics for skeleton-based action recog- convolutional network for action recognition. In:
nition. IEEE Transactions on Image Processing ECCV, pp. 761–776 (2020)
25(7), 3010–3022 (2016) Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu,
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU W.: Channel-wise topology refinement graph con-
RGB+ D: A large scale dataset for 3D human volution for skeleton-based action recognition. In:
activity analysis. In: CVPR, pp. 1010–1019 (2016) ICCV, pp. 13359–13368 (2021)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end- Chi, H.-g., Ha, M.H., Chi, S., Lee, S.W., Huang,
to-end spatio-temporal attention model for human Q., Ramani, K.: Infogcn: Representation learning
action recognition from skeleton data. In: AAAI, for human skeleton-based action recognition. In:
pp. 4263–4270 (2017) CVPR, pp. 20186–20196 (2022)
Liu, J., Wang, G., Hu, P., Duan, L.-Y., Kot, A.C.: Duan, H., Wang, J., Chen, K., Lin, D.: Dg-
Global context-aware attention lstm networks for stgcn: dynamic spatial-temporal modeling for
3d action recognition. In: CVPR, pp. 1647–1656 skeleton-based action recognition. arXiv preprint
(2017) arXiv:2210.05895 (2022)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Wang, S., Zhang, Y., Wei, F., Wang, K., Zhao,
Spatio-temporal attention-based lstm networks for M., Jiang, Y.: Skeleton-based action recognition
3d action recognition and detection. TIP 27(7), via temporal-channel aggregation. arXiv preprint
3459–3471 (2018) arXiv:2205.15936 (2022)
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., Zheng, Wen, Y.-H., Gao, L., Fu, H., Zhang, F.-L., Xia, S.,
N.: Eleatt-rnn: Adding attentiveness to neurons in Liu, Y.-J.: Motif-gcns with local and non-local tem-
recurrent neural networks. IEEE Transactions on poral blocks for skeleton-based action recognition.
Image Processing 29, 1061–1073 (2019) TPAMI 45(2), 2009–2023 (2023)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: Lin, L., Zhang, J., Liu, J.: Actionlet-dependent con-
An attention enhanced graph convolutional lstm trastive learning for unsupervised skeleton-based
network for skeleton-based action recognition. In: action recognition. In: CVPR, pp. 2363–2372
CVPR, pp. 1227–1236 (2019) (2023)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio- Li, Z., Gong, X., Song, R., Duan, P., Liu, J., Zhang,
temporal LSTM with trust gates for 3D human W.: SMAM: Self and mutual adaptive matching
action recognition. In: ECCV, pp. 816–833 (2016) for skeleton-based few-shot action recognition. TIP
Zhang, S., Liu, X., Xiao, J.: On geometric features for 32, 392–402 (2022)
skeleton-based action recognition using multilayer Dai, M., Sun, Z., Wang, T., Feng, J., Jia, K.: Global
lstm networks. In: WACV, pp. 148–157 (2017) spatio-temporal synergistic topology learning for
Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: skeleton-based action recognition. PR 140, 109540
Skeleton-based action recognition with spatial rea- (2023)
soning and temporal stack learning. In: ECCV, pp. Zhu, Y., Shuai, H., Liu, G., Liu, Q.: Multilevel spa-
103–118 (2018) tial–temporal excited graph network for skeleton-
Huang, Z., Shen, X., Tian, X., Li, H., Huang, based action recognition. TIP 32, 496–508 (2023)
J., Hua, X.-S.: Spatio-temporal inception graph

44
Shu, X., Xu, B., Zhang, L., Tang, J.: Multi- 109455 (2023)
granularity anchor-contrastive representation Zhou, Y., Li, C., Cheng, Z.-Q., Geng, Y., Xie,
learning for semi-supervised skeleton-based action X., Keuper, M.: Hypergraph transformer for
recognition. TPAMI 45(6), 7559–7576 (2023) skeleton-based action recognition. arXiv preprint
Wu, L., Zhang, C., Zou, Y.: Spatiotemporal focus for arXiv:2211.09590 (2022)
skeleton-based action recognition. PR 136, 109231 Qiu, H., Hou, B., Ren, B., Zhang, X.: Spatio-temporal
(2023) tuples transformer for skeleton-based action recog-
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., nition. arXiv preprint arXiv:2201.02849 (2022)
Zheng, N.: Semantics-guided neural networks for Kong, J., Bian, Y., Jiang, M.: Mtt: Multi-scale tem-
efficient skeleton-based human action recognition. poral transformer for skeleton-based action recog-
In: CVPR, pp. 1112–1121 (2020) nition. IEEE Signal Processing Letters 29, 528–532
Ye, F., Pu, S., Zhong, Q., Li, C., Xie, D., Tang, (2022)
H.: Dynamic gcn: Context-enriched topology learn- Zhang, J., Jia, Y., Xie, W., Tu, Z.: Zoom trans-
ing for skeleton-based action recognition. In: ACM former for skeleton-based group activity recogni-
MM, pp. 55–63 (2020) tion. TCSVT 32(12), 8646–8659 (2022)
Wang, M., Ni, B., Yang, X.: Learning multi-view Gao, Z., Wang, P., Lv, P., Jiang, X., Liu, Q., Wang,
interactional skeleton graph for action recognition. P., Xu, M., Li, W.: Focal and global spatial-
TPAMI 45(6), 6940–6954 (2023) temporal transformer for skeleton-based action
Li, S., He, X., Song, W., Hao, A., Qin, H.: recognition. In: ACCV, pp. 382–398 (2022)
Graph diffusion convolutional network for skeleton Liu, Y., Zhang, H., Xu, D., He, K.: Graph trans-
based semantic recognition of two-person actions. former network with temporal kernel attention
TPAMI 45(7), 8477–8493 (2023) for skeleton-based action recognition. Knowledge-
Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Lan- Based Systems 240, 108146 (2022)
guage knowledge-assisted representation learn- Pang, Y., Ke, Q., Rahmani, H., Bailey, J., Liu,
ing for skeleton-based action recognition. arXiv J.: Igformer: Interaction graph transformer for
preprint arXiv:2305.12398 (2023) skeleton-based human interaction recognition. In:
Wang, X., Xu, X., Mu, Y.: Neural koopman pooling: ECCV, pp. 605–622 (2022)
Control-inspired temporal dynamics encoding for Duan, H., Xu, M., Shuai, B., Modolo, D., Tu, Z.,
skeleton-based action recognition. In: CVPR, pp. Tighe, J., Bergamo, A.: Skeletr: Towards skeleton-
10597–10607 (2023) based action recognition in the wild. In: ICCV, pp.
Zhou, H., Liu, Q., Wang, Y.: Learning discrimi- 13634–13644 (2023)
native representations for skeleton based action Kim, B., Chang, H.J., Kim, J., Choi, J.Y.:
recognition. In: CVPR, pp. 10608–10617 (2023) Global-local motion transformer for unsupervised
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis- skeleton-based action learning. In: ECCV, pp. 209–
senborn, D., Zhai, X., Unterthiner, T., Dehghani, 225 (2022)
M., Minderer, M., Heigold, G., Gelly, S., et al.: Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang,
An image is worth 16x16 words: Transformers X.: Hierarchical contrast for unsupervised skeleton-
for image recognition at scale. arXiv preprint based action representation learning. In: AAAI, pp.
arXiv:2010.11929 (2020) 525–533 (2023)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Decoupled Shah, A., Roy, A., Shah, K., Mishra, S., Jacobs, D.,
spatial-temporal attention network for skeleton- Cherian, A., Chellappa, R.: Halp: Hallucinating
based action-gesture recognition. In: ACCV (2020) latent positives for skeleton-based self-supervised
Wang, Q., Peng, J., Shi, S., Liu, T., He, J., Weng, R.: learning of actions. In: CVPR, pp. 18846–18856
Iip-transformer: Intra-inter-part transformer for (2023)
skeleton-based action recognition. arXiv preprint Cheng, Y.-B., Chen, X., Zhang, D., Lin, L.:
arXiv:2110.13385 (2021) Motion-transformer: Self-supervised pre-training
Ijaz, M., Diaz, R., Chen, C.: Multimodal transformer for skeleton-based action recognition. In: ACM
for nursing activity recognition. In: CVPR, pp. MM, pp. 1–6 (2021)
2065–2074 (2022) Wu, W., Hua, Y., Zheng, C., Wu, S., Chen, C., Lu,
Zhang, Y., Wu, B., Li, W., Duan, L., Gan, C.: A.: Skeletonmae: Spatial-temporal masked autoen-
Stst: Spatial-temporal specialized transformer for coders for self-supervised skeleton action recogni-
skeleton-based action recognition. In: ACMMM, tion. In: ICMEW, pp. 224–229 (2023)
pp. 3229–3237 (2021) Hua, Y., Wu, W., Zheng, C., Lu, A., Liu, M., Chen,
Shi, F., Lee, C., Qiu, L., Zhao, Y., Shen, T., Muralid- C., Wu, S.: Part aware contrastive learning for self-
har, S., Han, T., Zhu, S.-C., Narayanan, V.: Star: supervised action recognition. In: IJCAI, pp. 855–
Sparse transformer-based action recognition. arXiv 863 (2023)
preprint arXiv:2107.07089 (2021) Johnson, S., Everingham, M.: Clustered pose and
Gedamu, K., Ji, Y., Gao, L., Yang, Y., Shen, nonlinear appearance models for human pose esti-
H.T.: Relation-mining self-attention network for mation. In: BMVC, p. 5 (2010)
skeleton-based human action recognition. PR 139, Johnson, S., Everingham, M.: Learning effective

45
human pose estimation from inaccurate annota- Weiss, D., Sapp, B., Taskar, B.: Sidestepping
tion. In: CVPR, pp. 1465–1472 (2011) intractable inference with structured ensemble cas-
Sapp, B., Taskar, B.: Modec: Multimodal decompos- cades. NIPS 23 (2010)
able models for human pose estimation. In: CVPR, Belagiannis, V., Amin, S., Andriluka, M., Schiele,
pp. 3674–3681 (2013) B., Navab, N., Ilic, S.: 3d pictorial structures for
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: multiple human pose estimation. In: CVPR, pp.
2d human pose estimation: New benchmark and 1669–1676 (2014)
state of the art analysis. In: CVPR, pp. 3686–3693 Müller, M., Röder, T., Clausen, M., Eberhardt, B.,
(2014) Krüger, B., Weber, A.: Mocap database hdm05.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Per- Institut für Informatik II, Universität Bonn 2(7)
ona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: (2007)
Microsoft coco: Common objects in context. In: Li, W., Zhang, Z., Liu, Z.: Action recognition based
ECCV, pp. 740–755 (2014) on a bag of 3D points. In: CVPRW, pp. 9–14
Gong, K., Liang, X., Zhang, D., Shen, X., Lin, (2010)
L.: Look into person: Self-supervised structure- Fothergill, S., Mentis, H., Kohli, P., Nowozin, S.:
sensitive learning and a new benchmark for human Instructing people for training gestural interactive
parsing. In: CVPR, pp. 932–940 (2017) systems. In: Proceedings of the SIGCHI Confer-
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S., ence on Human Factors in Computing Systems, pp.
Lu, C.: Crowdpose: Efficient crowded scenes pose 1737–1746 (2012)
estimation and a new benchmark. In: CVPR, pp. Bloom, V., Makris, D., Argyriou, V.: G3d: A gam-
10863–10872 (2019) ing action dataset and real time action recognition
Zhang, W., Zhu, M., Derpanis, K.G.: From actemes evaluation framework. In: CVPR, pp. 7–12 (2012)
to action: A strongly-supervised representation for Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L.,
detailed action understanding. In: ICCV, pp. 2248– Samaras, D.: Two-person interaction detection
2255 (2013) using body-pose features and multiple instance
Andriluka, M., Iqbal, U., Insafutdinov, E., learning. In: CVPRW, pp. 28–35 (2012)
Pishchulin, L., Milan, A., Gall, J., Schiele, B.: Xia, L., Chen, C.-C., Aggarwal, J.: View invariant
Posetrack: A benchmark for human pose esti- human action recognition using histograms of 3D
mation and tracking. In: CVPR, pp. 5167–5176 joints. In: CVPRW, pp. 20–27 (2012)
(2018) Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.-C.: Cross-
Doering, A., Chen, D., Zhang, S., Schiele, B., Gall, view action modeling, learning and recognition. In:
J.: Posetrack21: A dataset for person search, multi- CVPR, pp. 2649–2656 (2014)
object tracking and multi-person pose tracking. In: Chen, C., Jafari, R., Kehtarnavaz, N.: Utd-mhad: A
CVPR, pp. 20963–20972 (2022) multimodal dataset for human action recognition
Sigal, L., Balan, A.O., Black, M.J.: Humaneva: Syn- utilizing a depth camera and a wearable inertial
chronized video and motion capture dataset and sensor. In: ICIP, pp. 168–172 (2015)
baseline algorithm for evaluation of articulated Hu, J.-F., Zheng, W.-S., Lai, J., Zhang, J.: Jointly
human motion. IJCV 87(1-2), 4–27 (2010) learning heterogeneous features for RGB-D activity
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, recognition. In: CVPR (2015)
C.: Human3. 6m: Large scale datasets and pre- Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., Jiay-
dictive methods for 3d human sensing in natural ing, L.: Pku-mmd: A large scale benchmark for con-
environments. TPAMI 36(7), 1325–1339 (2013) tinuous multi-modal human action understanding.
Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L., arXiv preprint arXiv:1703.07475 (2017)
Banerjee, S., Godisart, T., Nabbe, B., Matthews, Kay, W., Carreira, J., Simonyan, K., Zhang, B.,
I., et al.: Panoptic studio: A massively multi- Hillier, C., Vijayanarasimhan, S., Viola, F., Green,
view system for social interaction capture. TPAMI T., Back, T., Natsev, P., et al.: The kinet-
41(1), 190–204 (2017) ics human action video dataset. arXiv preprint
Marcard, T., Henschel, R., Black, M.J., Rosenhahn, arXiv:1705.06950 (2017)
B., Pons-Moll, G.: Recovering accurate 3d human Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan,
pose in the wild using imus and a moving camera. L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale
In: ECCV, pp. 601–617 (2018) benchmark for 3d human activity understanding.
Sapp, B., Weiss, D., Taskar, B.: Parsing human TPAMI 42(10), 2684–2701 (2019)
motion with stretchable models. In: CVPR 2011, Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y.,
pp. 1281–1288 (2011) He, M.: Skeleton based action recognition using
Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Mul- translation-scale invariant image mapping and
tiple object tracking using k-shortest paths opti- multi-scale deep cnn. In: ICMEW, pp. 601–604
mization. TPAMI 33(9), 1806–1819 (2011) (2017)
Ramakrishna, V., Kanade, T., Sheikh, Y.: Track- Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J.,
ing human pose by tracking symmetric parts. In: Lu, H.: Decoupling gcn with dropgraph module for
CVPR, pp. 3728–3735 (2013) skeleton-based action recognition. In: ECCV, pp.

46
1–18 (2020) Qing, Z., Zhang, S., Huang, Z., Wang, X., Wang,
Jiang, Y., Sun, Z., Yu, S., Wang, S., Song, Y.: Y., Lv, Y., Gao, C., Sang, N.: Mar: Masked
A graph skeleton transformer network for action autoencoders for efficient action recognition. IEEE
recognition. Symmetry 14(8), 1547 (2022) Transactions on Multimedia (2023)
Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X.: Kang, M.-S., Kang, D., Kim, H.: Efficient skeleton-
Fast and robust multi-person 3d pose estimation based action recognition via joint-mapping strate-
from multiple views. In: CVPR, pp. 7792–7801 gies. In: WACV, pp. 3403–3412 (2023)
(2019) Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syn-
Tu, H., Wang, C., Zeng, W.: Voxelpose: Towards tactically guided generative embeddings for zero-
multi-camera 3d human pose estimation in wild shot skeleton action recognition. In: ICIP, pp.
environment. In: ECCV, pp. 197–212 (2020) 439–443 (2021)
Zhang, J., Cai, Y., Yan, S., Feng, J., et al.: Direct Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang,
multi-view multi-person 3d pose estimation. NIPS J.: Zero-shot skeleton-based action recognition via
34, 13153–13164 (2021) mutual information estimation and maximization.
Shah, S., Jain, N., Sharma, A., Jain, A.: On the arXiv preprint arXiv:2308.03950 (2023)
robustness of human pose estimation. In: CVPRW
(2019)
Zhang, Z., Wang, C., Qin, W., Zeng, W.: Fusing wear-
able imus with multi-view images for human pose
estimation: A geometric approach. In: CVPR, pp.
2200–2209 (2020)
Wang, C., Zhang, F., Zhu, X., Ge, S.S.: Low-
resolution human pose estimation. PR 126(10857),
108579 (2022)
Wang, Z., Luo, H., Wang, P., Ding, F., Wang, F., Li,
H.: Vtc-lfc: Vision transformer compression with
low-frequency components. NIPS 35, 13974–13988
(2022)
Jiang, W., Jin, S., Liu, W., Qian, C., Luo, P., Liu, S.:
Posetrans: A simple yet effective pose transforma-
tion augmentation for human pose estimation. In:
ECCV, pp. 643–659 (2022)
Zhang, J., Gong, K., Wang, X., Feng, J.: Learning
to augment poses for 3d human pose estimation
in images and videos. TPAMI 45(8), 10012–10026
(2023)
Jiang, Z., Zhou, Z., Li, L., Chai, W., Yang, C.-Y.,
Hwang, J.-N.: Back to optimization: Diffusion-
based zero-shot 3d human pose estimation. arXiv
preprint arXiv:2307.03833 (2023)
Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H.,
Liu, J.: Diffpose: Toward more reliable 3d pose
estimation. CVPR (2023)
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan,
Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-
object tracking by associating every detection box.
In: ECCV, pp. 1–21 (2022)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A.,
Goh, G., Agarwal, S., Sastry, G., Askell, A.,
Mishkin, P., Clark, J., et al.: Learning transferable
visual models from natural language supervision.
In: International Conference on Machine Learning,
pp. 8748–8763 (2021)
Khirodkar, R., Bansal, A., Ma, L., Newcombe,
R., Vo, M., Kitani, K.: Egohumans: An egocen-
tric 3d multi-human benchmark. arXiv preprint
arXiv:2305.16487 (2023)
Ulhaq, A., Akhtar, N., Pogrebna, G., Mian, A.:
Vision transformers for action recognition: A sur-
vey. arXiv preprint arXiv:2209.05700 (2022)

47

You might also like