0% found this document useful (0 votes)
58 views11 pages

Deep Learning for Human Pose Estimation

The document discusses deep learning-based human pose estimation, focusing on estimating human articulation joint positions from images and videos. It highlights the challenges faced in this field, such as occlusion and depth ambiguity, and presents a novel approach using a multi-scale TCE module within an encoder-decoder network to improve 2D pose estimation in videos. The research emphasizes the importance of deep learning advancements and the potential applications of human pose estimation in areas like virtual reality, healthcare, and sports analysis.

Uploaded by

dija am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views11 pages

Deep Learning for Human Pose Estimation

The document discusses deep learning-based human pose estimation, focusing on estimating human articulation joint positions from images and videos. It highlights the challenges faced in this field, such as occlusion and depth ambiguity, and presents a novel approach using a multi-scale TCE module within an encoder-decoder network to improve 2D pose estimation in videos. The research emphasizes the importance of deep learning advancements and the potential applications of human pose estimation in areas like virtual reality, healthcare, and sports analysis.

Uploaded by

dija am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link].

org (E-ISSN 2348-1269, P- ISSN 2349-5138)

Deep Learning based Human Pose Estimation


using OpenCV
Goli shiva chaithanya, computer science and engineering, koneru laxmaiah University
Pulipati krishna chaithanya, Mechanical engineering, Sardar Vallabhbhai National institute of technology
Jupalle Hruthika, Electronics engineering, Sardar Vallabhbhai National institute of technology

Abstract:
In vision-based human activity analysis, human pose estimation is an important study area. The goal of human
pose estimation is to estimate the positions of the human articulation joints in 2D/3D space from photographs
or movies. Because of the complication ofreal-world settings and a wide range of human stances, vision-
based human poses. Estimation is a difficult task. Deep learning's rapid advancement has recently attracted a
lot of attention. The simulation of the processing and reasoning capacities of the human brain has received a
lot of attention. Thevisual system of humans. As a result, it is critical to continue to [Link] learning
techniques are used to estimate human pose based on imagery. a video-based 2D pose estimation approach
that incorporates a multi-scale TCE module into the encoder-decoder network design to explore temporal
consistency in videos explicitly. At the feature level, the TCE module uses the learnable offset field to
capture the geometric transition betweenneighbouring frames. We further investigate multi-scale geometric
changes at the feature level by incorporating the spatial pyramid into the TCE module, which results in even
more performance gains.

Keywords: Human, Pose, Deep, Learning, Vision

I. INTRODUCTION

H UMAN posture estimation (HPE), which has received a lot of attention in the computer vision field, entails
predicting the configuration of human body parts using sensor input data, such as photographs and videos.
HPE gives geometry and motion information on the human body, which has been used in a variety of
applications (for example, human-computer interaction, motion analysis, augmented reality (AR), virtual
reality (VR), healthcare, and so on). Deep learning solutions have been demonstrated to outperform traditional
computer vision approaches in a variety of tasks, including picture classification, thanks to the rapid growth
of deep learning solutions in recent years. However, obstacles such as occlusion, insufficient training data,
and depth ambiguity must to be solved. 2D HPE from photos and movies with 2D pose annotations is simple
to achieve, and good performance has been achieved for a single person's human pose estimation using deep
learning algorithms. In controlled lab contexts, motion capture devices can acquire 3D pose annotation;
nevertheless, they have limitations in real-world situations. The fundamental issue in 3D HPE from
monocular RGB photos and videos is depth ambiguity. The key issue that needs to be addressed in multiview
setups is viewpoint affiliation. Some studies have used sensors such as depth sensors, inertial measurement
units (IMUs), and radio frequency devices, although these methods are usually expensive and necessitate
IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 757
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)
specialised gear. Human posture estimation is also utilised in video surveillance, human-computer interface,
sports analysis, virtual reality, animation development, and other domains. Human pose estimation, for
example, can be used to track human subjects' mobility in interactive gaming. Microsoft's Kinect, for
example, popularised the use of 3D pose estimation to track the motion of the human player and render the
activity of the virtual character. Human posture estimate can rebuild the athlete's motion from daily training
recordings in sports analysis. CGI applications can also benefit from human pose estimation. If their human
position can be calculated, graphics, styles, fancy improvements, equipment, and artwork can be
superimposed. The produced images can naturally match the person as he or she walks by tracking the
fluctuations of this human stance.

2D human posture estimation and 3D human pose estimation are the two types of human pose
estimation. Both 2D and 3D human posture estimation are difficult jobs due to the complexity of the real
world and the diversity of human stances. Deep learning has been extensively used on the job of human pose
estimate in recent years, thanks to the rapid growth of Convolutional Neural Networks (CNNs). Despite the
fact that deep learning-based algorithms have made great progress, they nevertheless face several obstacles.
The majority of available approaches for 2D human posture estimation focus on building novel network
topologies for image-based 2D pose estimation. movements. Although these methods can be applied directly
to video data, they frequently produce poor results because image-based methods cannot take use of the rich
temporal information contained in video data. A typical neural network model for 3D human pose estimation
requires a substantial amount of training data. Annotating 3D human joint locations, on the other hand, is a
time-consuming operation. Furthermore, there are solid geometrical theories for projecting 2D images onto
3D skeletons. Using a neural network to approximate this projection alone could result in the network being
overfitted with training data.

II. BACKGROUND INFORMATION

Following are some key work details which were discussed by keeping the related thing in the investigation.
The goal of human pose estimate, sometimes called human keypoint estimation,

is to find anatomical keypoints in the human body. It's a crucial task in the subject of computer vision, with
substantial theoretical implications and widespread applications in disciplines likehuman activity recognition,
sports analysis, and human-computer interaction. Many world- class research teams and institutes have
recently committed significant resources to studying this issue.

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 758
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)

Fig. 1 A hierarchy of literature survey


The Robotics Institute at Carnegie Mellon University, for example, has developed Panoptic Studio, a large-
scale multi-view human motion capture system. Common Objects in Setting (COCO) is a large-scale 2D
human pose dataset created by Microsoft Research that collects photos of complicated everyday scenarios
incorporating common objects in their natural context. Simultaneously, they held competitions and
workshops that aided in the development of 2D pose estimation technology. The MPII and MPI-INF-3DHP
datasets, which are extensively used 2D and 3D human pose datasets, were also proposed by the Max Planck
Institute for Computer Science. In top computer vision conferences and journals, such as CVPR (IEEE
Conference on Computer Vision and Pattern Recognition), ICCV (International Conference on Computer
Vision), ECCV (European Conference on Computer Vision), PAMI (IEEE Transaction on Pattern Analysis
and Machine Intelligence), TIP (IEEE Transaction on Image Analysis), and IJCV (International Conference
on Computer Vision), human pose estimation has already become one of the hottest topics. Human pose
IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 759
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)
estimation is separated into two types: 2D pose estimation and 3D pose estimation, which estimate the
positions of human joints in two-dimensional and three-dimensional space, respectivelyThe majority of
traditional image-based 2D posture estimate algorithms are part-based from the bottom up. These methods
treat the human posture as a collection of human body parts and use the deformable model to characterise
body component spatial connections. The pictorial structure model for visual object representations was
proposed by Fishler et al. [25] in 1973. Followingthat, Felzenszwalb et al.

The poselet prior was utilised by Pishchulin et al. to improve the visual structure model. These methods rely
on handcrafted characteristics to detect human body components, such as the Histogram of Oriented Gradient
(HoG) and Scale-Invariant Feature Transform (SIFT), and then utilise a dynamic programming algorithm to
get the best human pose configuration. However, for photos of complicated everyday scenarios with
truncated or badly occluded human joints, these techniques lack generalisation ability. Researchers attempted
to apply Convolutional Neural Networks (CNNs) to human posture estimation after seeing the success of
deep learning in object classification and detection. Meanwhile, large-scale human pose datasets such as
FLIC [73], MPII [1], and Microsoft COCO [53] are available, allowing deep networks to be trained. Because
stacked convolution and pooling layers allow CNNs to learn high-level visual characteristics, these
approaches can directly predict human joint locations from input photos. Recent research aims to improve
video-based posture estimation performance by incorporating temporal information into sophisticated deep
models. The most frequent methods [67, 11, 78, 97] use optical flow to explore temporal context.

Because optical flow describes the distribution of apparent movement velocities, it can help refine the
projected heatmaps by capturing geometric transitions between frames. Song et al. [78], for example,
employed optical flow to take use of visual evidence from neighbouring frames. To improve the performance
of video pose estimation, Pfister et al. [67] used optical flow to align output heatmaps from surrounding
frames. Other approaches [28, 54] use Recurrent Neural Networks (RNNs), such as Long Short-Term
Memory (LSTM) and Gated Recurrent Unit (GRU), to capture temporal dependency. RNNs have become
dominant tools for sequence tasks due to their power in long-range temporal representation. Luo et al., for
example, suggested a recurrent model with LSTM to consider temporal information for video posture
estimation. Gkioxari et al. proposed a CNN-based chained model in which the pose prediction is based on
both the input and the output of the preceding frame. There are also approaches for learning representations
of video clips using 3D convolution. Girdhar et al. [27] increased the Mask R-2D CNN's convolution to 3D,
allowing it to use temporal information from video clips to generate more accurate position predictions in
videos.

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 760
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)

Fig. 2 Type of human body

III. MONOCULAR 3D POSE ESTIMATION

The huge potential of 3D human posture estimation in diverse applications such as human- computer
interaction, virtual reality, and action detection has piqued curiosity. Many researchers [55, 65] used a neural
network to predict 3D human poses from monocular photos as a result of deep learning's success. There are
two key issues in employing neural networks to estimate 3D postures. A conventional neural network model,
for starters, necessitates a substantial amount of training data. 3D Marker-based Motion Capture is used to
collect pose annotations (MoCap) system, which is a time-consuming procedure. In this research, we offer a
unique self-supervised strategy for training a 3D posture estimation model that takes advantage of the
geometric prior. We define 3D pose estimation as a combination of 2D keypoint estimate and 2D-to-3D pose
lifting. Our work focuses on training the 2D-to-3D lifting network without utilising any additional 3D ground-
truth data, and the first stage is compatible with any state-of-the-art 2D keypoint detector. We construct the
transform re-projection loss in particular to tackle the depth ambiguity problem.

IV. RESULT AND ANALYSIS

The training approach has two stages in order for the proposed two-branch network to convergewithout explicit
3D pose monitoring. First, we use the Lpre-train loss to pre-train the network. We train the network for 20
epoches with a learning rate of 0.001 using Adam as the optimizer. The network is then trained for 300
epoches using the LT loss. The rate of learning begins at
0.01 and decreases by 0.1 per 100 epoches. During the assessment, we solely use the 2D-to- 3D lifting
branch to predict the relative 3D poses in the camera space, rather than the root position branch, to maintain
compatibility with other works. Pytorch, a deep learning toolbox, is used to develop our technique. We
compare the suggested transform re-projection loss to an existing popular technique, adversarial loss, in order

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 761
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)
to assess its effectiveness. On the H36M dataset, we create many versions and compare the outcomes under
Protocol #1 (MPJPE) and Protocol #2 (P-MPJPE). As inputs, all variations use 2D postures retrieved by the
CPN network. The quantitative results are presented in Table, and Figure depicts the outcomes of several
versions on multiple hard samples, such as with severe self-occlusion or far from the camera.

 Analysis of Network pertaining:

This point details about the pre training of the data set which is very relevant for our analysis.

Fig. 3 Training dataset of 3DHP

 Analysis of consistent frame work


In this section, we look at how successful the suggested consistent factorization loss is. All variations are
trained on the Human3.6M train set, and Table 4.1 shows their per- action P-MPJPE on the Human3.6M test
set. Clearly, our strategy performs the best out of all the alternatives. The pre-trained hierarchical lexicon aids
in obtaining superior outcomes as compared to the baseline. However, if only the hierarchical dictionary is
used, the benefit is minimal.

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 762
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)

Fig. 4 Comparison with baseline

 Analysis of Hierarchical Dictionary


The efficacy of the hierarchical dictionary is examined in this paper. On the Human3.6M, it shows
comparisons with modern dictionary-based approaches. AIGN
[89] uses PCA to learn a 3D pose vocabulary and adds adversarial loss as a restriction. C3DPO [60] employs
a single-level vocabulary that is learned in tandem with a 3D pose estimate network. Distill [91] is a weakly-
supervised technique for learning a 3D pose estimation network from a lexicon obtained via NRSfM. Our
strategy, as indicated in Table 4.2, produces the best results of all. We build a variation (Ours-SD) that
replaces the hierarchical dictionary with a single-level dictionary, comparable to C3DPO, for additional
comparisons. Ours-SD outperforms C3DPO with the consistent factorization restriction, achieving 85.8 vs.
95.6 (mm) MPJPE. Furthermore, the hierarchical dictionary aids in achieving better outcomes than the
single-level dictionary, with MPJPE and P-MPJPE decreasing by 3.9 and 5.2 (mm) respectively.

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 763
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)

Fig. 5 Visualization analysis of result


 Comparison of results with other researchers
Following table describing the comparative description of this vestigation with other methods.

Table: 1 Comparison of results

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 764
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)

 Analysis of Generalization Ability


To test the proposed model's generalisation ability, we trained it on the Human3.6M datasetand tested it on the
MPI-INF-3DHP dataset, which contains complex outdoor scenes. Figure illustrates some visualisation results
demonstrating that our technology can successfully recover 3D poses on datasets without having been trained
on them. Furthermore, in this setup table, our approach can get 70.6 percent PCK3D and 36.6 percentAUC.

CONCLUSION AND FUTURE SCOPE

Conclusion

In the field of computer vision, human pose estimation is a hot issue of research. This thesis investigated deep

learning-based 2D and 3D human position estimation and suggested a number of models, ranging from

video-based 2D pose estimation to self-supervised 3D pose estimation. The following is a summary of the

important innovative contributions:

 For specifically exploring temporal consistency in films, we offer the multi-scale TCE module and embed it

into the encoderdecoder network architecture. At the feature level, the TCE module uses the learnable offset

field to capture the geometric transition between neighbouring frames. It can explicitly represent the temporal

consistency information in an end-to-end network, unlike existing model-based techniques.

 It is more computationally efficient than existing post-enhancement approaches since it does not involve

additional optical flow computations. We further investigate multi- scale geometric changes at the feature

level by incorporating the spatial pyramid into the TCE module, which results in even more performance

gains.

 A root position regression branch is also introduced to restore the global 3D poses during training. The

network can save the scale information of re-projected 2D poses in this fashion, which improves the accuracy of

predicted 3D poses. Furthermore, during training, this method just uses geometry information, resulting in improved

generalisation ability.

 To solve the projection ambiguity problem, we offer the consistent factorization network, which entirely

disentangles the 3D human shape and camera viewpoint. To this purpose, we create a simple and effective

loss function that constrains the canonical 3D human position using multi-view information. Furthermore, we

characterise a 3D human pose as a combination of a dictionary of 3D pose base and use geometric

information from 3D human poses to learn a hierarchical dictionary from 2D human poses by solving the
IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 765
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)
NRSfM issue.

 When compared to the single-level dictionary, the hierarchical dictionary can be learned without the use of

3D human posture annotations and has a greater expressive ability.

Future Scope of Work

 To improve the performance of 2D pose estimation in multi-person videos, we will aim to create a unified

framework integrating the multi-scale TCE module with the multi- person tracking technique.

 In terms of 3D posture estimation, we'll look into depth maps and point cloud data in the future. The cost of

obtaining depth map and point cloud data will decrease as depth cameras and radar sensors become more

widely available on mobile devices. Absolute depth information may be obtained from the depth map and

point cloud, successfully resolving the projection ambiguity problem.

References

 Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” in NeurIPS, 2012.
 J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR,
2015.
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time objectdetection with region proposal
networks,” IEEE TPAMI, 2016.
 T. B. Moeslund and E. Granum, “A survey of computer visionbased human motion capture,” CVIU, 2001.
 T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances ¨ in vision-basedhuman motion capture
and analysis,” CVIU, 2006.
 R. Poppe, “Vision-based human motion analysis: An overview,” CVIU, 2007.

 X. Ji and H. Liu, “Advances in view-invariant human motion analysis: a review,” IEEE TSMC, 2009.
 M. B. Holte, C. Tran, M. M. Trivedi, and T. B. Moeslund, “Human pose estimation and activity recognition
from multi-view videos: Comparative explorations of recent developments,” IEEE Journal of Selected Topics
in Signal Processing, 2012.
 Z. Liu, J. Zhu, J. Bu, and C. Chen, “A survey of human pose estimation: the bodyparts parsing based
methods,” JVCIR, 2015.
 W. Gong, X. Zhang, J. Gonzalez, A. Sobral, T. Bouwmans, C. Tu, ` and E.-h. Zahzah, “Human pose
estimation from monocular images: A comprehensive survey,” Sensors,2016.

 N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kakadiaris, “3d human pose estimation: A review of the
literature and analysis of covariates,” CVIU, 2016.
 Y. Chen, Y. Tian, and M. He, “Monocular human pose estimation: A survey of deep learning-based
methods,” CVIU, 2020.
 T. L. Munea, Y. Z. Jembre, H. T. Weldegebriel, L. Chen, C. Huang, and C. Yang, “The progress of human
pose estimation: A survey and taxonomy of models applied in 2d human pose estimation,” IEEE Access,
2020.

 E. Marinoiu, D. Papava, and C. Sminchisescu, “Pictorial human spaces: How well do humans perceive a 3d
articulated pose?” in ICCV, 2013.
 S. Zuffi, O. Freifeld, and M. J. Black, “From pictorial structures to deformable structures,” in CVPR, 2012.
 S. Johnson and M. Everingham, “Clustered pose and nonlinear appearance models for human pose
IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 766
© 2021 IJRAR October 2021, Volume 8, Issue 4 [Link] (E-ISSN 2348-1269, P- ISSN 2349-5138)
estimation.” in BMVC, 2010.
 Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity
fields,” in CVPR, 2017.
 X. Chen and A. L. Yuille, “Parsing occluded people by flexible compositions,” in CVPR, 2015.
 D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C.
Theobalt, “Vnect: Real-time 3d human pose estimation with a single rgbcamera,” ACM TOG, 2017.
 S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: A parameterized model of articulated image
motion,” in FG, 199
 M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and
state of the art analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 3686–3693.
 M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose
estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp.
1014–1021.
 A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal context for 3d human pose estimation in the
wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3395–
3404.
 S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation
methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, 2011.
 Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic, “3d pictorial structures for multiple
human pose estimation,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition,
2014, pp. 1669–1676.
 V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” in Proceedings of the IEEE
International Conference on Automatic Face & Gesture Recognition, 2017,pp. 468–475.
 E. Brau and H. Jiang, “3d human pose estimation via deep learning from 2d annotations,” in Proceedings
of the International Conference on 3D Vision, 2016, pp. 582–591.
 T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion
estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500–513,
2010.
 Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity
fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
7291–7299.
 J. Charles, T. Pfister, D. Magee, D. Hogg, and A. Zisserman, “Personalizing human video pose estimation,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3063–3072.
 C.-H. Chen and D. Ramanan, “3d human pose estimation = 2d pose estimation + matching,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7035–7043.
 L. Chen, H. Ai, R. Chen, Z. Zhuang, and S. Liu, “Cross-view tracking for multi-human 3d pose estimation at
over 100 fps,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
 X. Chen, K.-Y. Lin, W. Liu, C. Qian, and L. Lin, “Weakly-supervised discovery of geometry-aware
representation for 3d human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 10 895–10 904.
 Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose
estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.
7103–7112.
 A. Cherian, J. Mairal, K. Alahari, and C. Schmid, “Mixing body-part sequences for human pose estimation,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2361–2368

IJRAR21D1205 International Journal of Research and Analytical Reviews (IJRAR) [Link] 767

Common questions

Powered by AI

The research contributes to solving the depth ambiguity problem by introducing a transform re-projection loss that integrates geometric priors into the training of 3D pose estimation models. This method enables 2D-to-3D pose lifting without relying on additional 3D ground-truth data, allowing the model to handle depth ambiguities more effectively. This self-supervised approach maintains compatibility with state-of-the-art 2D keypoint detectors, facilitating robust 3D pose predictions from monocular images .

Experiments using the Human3.6M dataset show that the consistent factorization network effectively reduces projection ambiguity by disentangling the 3D human shape and camera viewpoint. The network's performance, measured by metrics like MPJPE and P-MPJPE, surpasses other methods by achieving lower error rates, demonstrating its ability to predict accurate 3D poses using a hierarchical dictionary approach that integrates geometric information from 3D human poses .

Hierarchical dictionaries in 3D pose estimation offer a structured approach to learning and utilizing geometric information from human poses, resulting in better performance than single-level dictionaries. The hierarchical approach aids in reducing MPJPE and P-MPJPE scores significantly, indicating more precise pose predictions. Compared to single-level dictionaries like those used in C3DPO, hierarchical dictionaries integrate consistent factorization constraints, which contribute to improved pose estimation results .

The analysis emphasizes the model's ability to adapt to new environments by demonstrating successful 3D pose recovery on datasets it was not trained on, such as transitioning from the Human3.6M dataset to the outdoor MPI-INF-3DHP dataset. This adaptability is evidenced by robust performance metrics like 70.6% PCK3D and 36.6% AUC, underscoring the model's capacity to generalize well in real-world scenarios with diverse scene complexities .

The primary challenges in estimating 3D human poses using neural networks include the need for substantial training data and the difficulty of obtaining accurate 3D pose annotations, which typically require time-consuming 3D marker-based motion capture systems . The innovative self-supervised strategy addresses these challenges by utilizing geometric priors and training a 3D pose estimation model without relying on additional 3D ground-truth data. This approach combines 2D keypoint estimates with 2D-to-3D pose lifting, implementing a transform re-projection loss to tackle depth ambiguity problems .

In video-based approaches, incorporating optical flow improves the prediction of human joint locations by capturing the apparent movement velocities between frames, which helps refine predicted heatmaps. These geometric transitions allow for better alignment and integration of temporal context, leading to more accurate pose predictions as demonstrated by the improved performance in methods using optical flow techniques .

Part-based models, such as the pictorial structure model, approach 2D pose estimation by treating the human body as a collection of parts, using handcrafted features like Histogram of Oriented Gradient (HoG) and Scale-Invariant Feature Transform (SIFT) to characterize spatial connections between parts. These methods rely on dynamic programming for optimal pose configuration. In contrast, convolutional neural networks (CNNs) directly predict joint locations by learning high-level visual characteristics from input images. The traditional methods struggle with generalization in complex scenarios with truncated or occluded joints, a limitation less prevalent in CNN approaches due to their ability to learn from large datasets .

The multi-scale TCE module enhances performance by capturing geometric transitions at the feature level between neighboring frames, providing explicit temporal consistency information in an end-to-end network. Unlike existing model-based techniques, the TCE module does not require additional optical flow computations, making it computationally efficient. By incorporating spatial pyramids, it further explores multi-scale geometric changes, improving the network's ability to leverage temporal context and increasing pose estimation accuracy .

Advancements in video-based 2D pose estimation have included the integration of temporal information by using optical flow techniques to capture geometric transitions between frames. Methods such as those proposed by Pfister et al. align output heatmaps from surrounding frames to enhance pose estimation performance. Additionally, Recurrent Neural Networks like LSTM and GRU are used to capture temporal dependencies, while CNN-based approaches utilize previous frame outputs to inform current predictions, thereby improving accuracy .

To improve generalization, researchers train models on datasets like Human3.6M and test them on diverse datasets such as MPI-INF-3DHP, which includes complex outdoor scenes. The ability to recover accurate 3D poses on datasets not used during training is facilitated by approaches that utilize multi-view information and hierarchical dictionary-based learning, resulting in improved PCK3D and AUC scores .

You might also like