0% found this document useful (0 votes)
56 views7 pages

Action Classification via 2D Pose Estimation

Uploaded by

eeshnaugraiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views7 pages

Action Classification via 2D Pose Estimation

Uploaded by

eeshnaugraiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/332780847

Action Classification Based on 2D Coordinates Obtained by Real-time Pose


Estimation

Conference Paper · February 2019

CITATIONS READS

3 3,834

4 authors, including:

Muthu Subash Kavitha Takio Kurita


Nagasaki University Hiroshima University
78 PUBLICATIONS 793 CITATIONS 336 PUBLICATIONS 5,042 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Muthu Subash Kavitha on 01 May 2019.

The user has requested enhancement of the downloaded file.


Action Classification Based on 2D Coordinates
Obtained by Real-time Pose Estimation
Siyi Shuai1 , Muthusubash Kavitha2 , Junichi Miyao3 , and Takio Kurita3
1
Department of Information Engineering
Hiroshima University
1-4-1 Kagamiyama, Higashi-Hiroshima, 739-8527, Japan
{m174409, kavitha, miyao, tkurita}@[Link]

Abstract—Human action classification is a significant issue in


computer vision field. To retrieve essential information from a
large number of videos, understanding the content of the videos
is very important. In this study, we propose an approach that
classify the human actions based on the coordinate information
of the body parts. The extracted key coordinate points from
each frame based on real-time pose estimation algorithm are
accumulated as the matrix. Then these accumulated coordinates
are used to feed the convolutional neural network (CNN) to
classify human actions, that is the main contribution of this
study. This study is designed to ignore the background, and
just consider the movement information of the joints of the
extracted poses. CNN is designed to consist of three convolutional
layers, pooling layer and linear layer to extract the most relevant
features for classifying the human actions. We use two benchmark (a)
dataset to validate the performance of our proposed approach.
The human action classification performance of our proposed
approach using six different types of actions achieves very high
accuracy (100%), which is higher than the other competitive
approaches using KTH dataset.
Index Terms—action classification, pose estimation, 2D coor-
dinates, CNN

I. I NTRODUCTION
Human action recognition, classification and understanding
in videos have been a significant research domain in com-
puter vision. Recently large-scale online video resources (e.g.,
YouTube) and datasets (e.g., ActivityNet) are easily available.
(b)
In order to control the information explosion, it is necessary to
recognize and analyze these video contents for various reasons Fig. 1: Results of human action obtained using real-time pose
such as search, recommendation, etc. There are several appli- estimation algorithm. Different frames show different posture
cations that include videos for understanding, intelligent search and joints position (a) boxing action and joints movement, (b)
and retrieval, surveillance and human-computer interactions. pull up action and joints movement
However, human activity recognition and classification using
videos has been considered as one of the most challenging
visual tasks. figure 1. The person size appeared in different video frames is
Human activity, no matter how common, is done for some not always same. If the person size appeared as smaller, then
purpose. For example, in order to accomplish a physical the range of joints movement is also smaller. Thus, it is hardly
action, the person should interact and give feedback to the to learn with other person size. However, the human joints
environment using his/her head, hands, arms, legs, bodies, etc position information in each video frame is significant for
[1]. Different joints position express different actions. Hence, action classification. While there has been important progress
several joints movement of human body can be considered as in solving these tasks such as real-time human pose estimation
human activity. In the image-based human action, it is im- and CNN-based human activity classification model.
mobile human posture. However, video-based human actions For real-time human pose estimation, the progress of human
consists of a sequence of motionless postures, as shown in pose estimation have been achieved by using deep learning
neural network. There are two different approaches to analyze [9] proposed a method that can able to apply on various
the human pose such as, top-down approaches and bottom-up video understanding tasks which contains objects, actions and
approaches. Each method has its own characteristics. In our scenes.
research, we prefer to utilize real-time human pose estimation Recently, there were some research methods have also
method [2]. In this study, we use the 2D coordinates obtained conduct action recognition task starting from the perspective
by using real-time human pose estimation approach [2] as the of human skeletons. V. Raviteja et al. [11] proposed a new
cues to action classification. In the proposed approach, the 2D skeletal description by using rotations and translations in 3D
coordinates of a person in each frame in the video clips are space, and thus 3D geometric relationships between various
extracted and accumulated as matrix. Then these accumulated body parts can be modeled. D. Yong et al. [12] divided human
coordinates are used to feed the neural networks to classify the skeleton into five parts on the basis of human physical struc-
human activities. The performance of the proposed approach ture and proposed an end-to-end hierarchical RNN structure
is evaluated on two different bench mark datasets. constructing with five sub-nets in the model.
This paper is organized as follows. The related works are Additionally, the Long Short Term Memory (LSTM) model
briefly reviewed in section II. Section III mainly explains the were used to store, modify, and access the internal state of
proposed approach. Experimental results and conclusion are memory cells. Thus it allowes to discover long-range temporal
explained in section IV. section V, respectively. relationships. Hence LSTMs achieve state-of-the-art results in
various applications, such as handwriting recognition, segmen-
II. R ELATED W ORKS tation of events, emotion detection and speech recognition
It is worth to study the problem of recognition and un- [10]. The completed overview of LSTMs is beyond the scope
derstanding of human activities in computer vision. We in- of this study and hence we do not introduce in much detail.
vestigate and present the summary of the historical evolution
approaches related to action recognition [3], [1] and human B. Human pose estimation methods
pose estimation [4]. The review of this approach mainly focus Human pose estimation mainly focus on finding and localiz-
on recent and most relevant methods. ing the key points of individual and describe human skeleton
information by using ”parts ” of the body [13]. Traditional
A. Action recognition methods human pose estimation methods generally are based on the
The traditional action recognition methods largely pay at- idea of template matching using geometrical prior model. The
tention to global video representations and achieved good key point describes how to use the template [14]. It express
results on datasets, such as KTH [5], HMDB51, UCF101. the whole human body structure including key points of body,
These approaches focused on taking the advantage of local limb structures and the relationship between the different limb
appearance and motion information such as histogram of structures. P. F. Felzenszwalb et al. [15] proposed a classic
oriented gradients (HOG), histogram of optical flow (HOF), approach based on pictorial structure model of the relationship
motion boundary histogram (MBH) or dense trajectories. Fur- of spatial correlations between the parts of the body. However
thermore, in order to aggregate and encode these information these methods [16] can achieve good results only the limbs
and produce a global video-level representation, they used of the person are visible on the images. Hence it can easily
spatio-temporal pyramids with bag-of-Words (BoW) or fisher influence some error such as double-counting.
vector based encoding. Finally, human action classification Recently, there has been huge interest on models that
was achieved via traditional SVMs. employ deep learning neural network for the task of articulated
About neural networks, especially CNNs have been shown pose estimation. Deep learning neural network can be divided
to reach a great performance in action recognition [1]. A. into two research directions. One is the top-down and another
Karpathy et al. [6] explored multiple approaches for various one is bottom-up methods. The top-down approaches firstly
frame-level fusion methods and utilized local spatio-temporal perform person detection and then employ pose estimation.
information through the connectivity of the CNN in time W. Shih-En et al. [17] proposed a multi-staged CNN archi-
domain. K. Simonyan et al. [7] proposed two-stream CNN tectures to provide an additional information regarding the
approach which comprises spatial and temporal networks co-occurrence, interdependence, and context of body parts in
for action recognition. CNN trained on multi-frame dense each stage of the network. Thus the network can able to learn
optical flow that can achieve a great results. Through using image dependent spatial relationships between the body parts
a separate CNN stream to learn spatial information of frame- implicitly. Y. Chen et al. [18] considered the difficulty in
level, the combination of two-stream CNN model shown better detecting different key points that handle to localize simple
performance than the traditional methods. However aforemen- and difficult key points. The common top-down approaches
tioned methods required computing optical flow separately for are prone to occur problems such as wrongly detected person
optimizing the parameters. In order to solve this issue, 3D position and repeatedly detected same person. These issues
CNN was proposed by S. Ji et al. [8] for action recognition. cause key points detection errors or generate different key
The 3D CNN model extracted features from spatio-temporal points for the same person. F. Hao-shu et al. [19] proposed a
dimensions, and thus the motion information can be captured method to solve these problems by using bottom-up methods.
in the multiple adjacent frames. Recently, D. Tran et al. It mainly focused on key points detection and clustering. It
We consider 2D coordinate of body key points as an
important features that can help to classify actions. Therefore,
we would like to take advantage of the results of real-time
pose estimation approach [2] to get 2D coordinate information.
Though it is the same action, the joints position is different
based on the size of a person. Hence it is hard to use these
raw position information directly. In order to overcome the
(a) standing (b) riding different size of the person in each frame, we normalize all
the coordinates to 1 × 1 grid and classify the human actions
based on the convolution neural network model. The outline
of the proposed approach is shown in the figure 3.
B. Real-time pose estimation
The real-time pose estimation [2] is used to obtain the 2D
coordinate information of body key points. We use color video
(c) walking (d) running clips as inputs and produce joints position information of a
person as the outputs. The joints position information of a
Fig. 2: Coordinate points of human joints
person can be represented by (xi , yi ). The extracted results of
14 body key points that includes nose, neck, shoulders, elbows,
was assembled for full poses of the persons after all key points wrists, hip, knees and ankles. Hence we consider P is equal to
were detected. X. Fangting et al. [20] proposed a method that 14. The position of a person is expressed using total number
divide the human body into different parts. The key points of joints position information and thus the feature vectors for
located at specific positions of the segmentation area were used one frame are defined as
T
to model the relationship between the key points of the divided

x = x1 , y1 , x2 , y2 , . . . , xi , yi (i = 0, . . . , P ) (1)
body parts. Different with segmentation area to perform pose
estimation, Zhe. Cao et al. [2] mapped the relationship of Consider video clips with F number of frames that is ex-
key points into body part affinity fields (PAFs). It learned to pressed as F × 2P matrix, as shown in figure 4. Hence repre-
associate different body parts with individuals on the image. senting the action of all frames in the video are accumulated
The architecture jointly learned the part of locations and their in matrix, which is expressed as
 T
association through the two branches of the same sequential x1
prediction process. Furthermore it can maintain high accuracy xT2 
X =  ..  (f = 1, . . . , F ) (2)
 
while achieving real-time performance, with large number of
 . 
people on the image. It achieved 8.8 fps of speed for a video
with 19 people. On the COCO 2016 key points challenge xTf
dataset, this architecture set the state-of-the-art and the results The accumulated matrix X generated from all frames is
significantly exceeds the previous state-of-the-art methods on used as input into the convolution neural network architecture,
the MPII multi-person dataset. We take the advantage of this which is the main contribution of our study. Different with
method to get the coordinate information of the body parts to using images as input to classify actions, we ignore the
classify human actions. background, and just consider the movement information of
the joints of the extracted poses. The length of the feature
III. ACTION CLASSIFICATION FRAMEWORK vectors in each frame can be defined by the number of joints
A. Outline of the proposed Method information that needs to be extracted. The matrix we designed
is perfect and easy to find the movement information of all
The human pose estimation approach provide the 2D infor- extracted coordinates in each frame. Furthermore, it is suitable
mation of body key points, i.e., 2D joints position information for describing the changes in each coordinate values of several
of frames. The 2D information could express the status of consecutive frames.
each person in each frame, thereby a sequence of frames
will feedback the action information of each person. Different C. Action classification using neural networks
position of human joints lead to different actions. In different We design five layers of neural networks to classify the
actions, the position of human joints are very different, in actions. The proposed architecture includes three convolution
contrast, these joints information are really similar in same layers, max pooling and fully connected layer as shown in
type actions. For example, in actions such as “standing” and figure 5. The convolution layers are applied to extract the
“riding”, the position of arms and legs are really different. features. The information of all joints position is important
Whereas the actions such as “walking” and “running”, in the for action recognition and hence the size of one side of the
real scenes express similar actions in some camera angles, as convolution filter is set to the length of the feature matrix that
shown in figure 2. was accumulated as shown in figure 4. The convolution layer
Fig. 3: Outline of the proposed human action classification approach.

Fig. 4: Matrix obtained from all extracted coordinates of all Fig. 5: Proposed convolution neural network model.
frames

parameters of the network, we use the standard softmax cross


we designed can extract information of the shape of the pose entropy loss, which is defined as
as well as the short-time movement of the [Link] final fully N X
K
connected layer is used to classify the actions. The convolution
X
L=− tij log yij (5)
layers with Relu activation function are defined as i=1 j=1
 
X where yij is the output of the network for j-th class to the
xlj = f  xl−1
i
l
∗ wij + blj  (3) i-th sample.
i∈Xj
IV. E XPERIMENTS
where Xj is a set of indexes for the input features, l denote A. Dataset
the current layer and xlj is the feature map output in the l’s
layer. Also, x0i represents the input matrix that accumulating In the field of action recognition and video understanding,
all joints position of all frames. Each output features of each KTH, UCF101, Hollywood, Sports-1M, HMDB51 are com-
layer is given an additive bias b. The activation function f is monly used dataset to confirm the performance of the devel-
defined as oped approach. Among those we use KTH and UCF50 dataset
to test our proposed model. The KTH video dataset contains
f (x) = max (0, x) (4) six types of human actions; running, walking, boxing, jogging,
hand waving and hand clapping. The spatial resolution of each
In the last layer, softmax function is used to classify the video is 160x120 pixels with average length of four seconds.
actions. It totally includes 600 video clips with 25 fps frame rate of
Let {(Xi , ti )|i = 1, . . . , N } be the set of training samples, each video clip.
where Xi is a matrix that is obtained from the extracted The UCF50 dataset contain 50 types of human actions.
joints positions for i-th video in the training samples and ti is The proposed approach mainly focus on classifying single
the teacher signal represented as one-hot vector. To train the person action in the video clips. Hence we use some videos
from UCF50 dataset that come under our criteria. We used action classification accuracy and outperforms other previous
11 human actions: Golf swing, playing violin, pommel horse, approaches.
pull up, push ups, lunges, nun chucks, rope climbing, rock
climbing in door, taichi and jumping jack. It totally consists
of 913 video clips and each actions include 83 video clips.
B. Feature extraction
As explained in the section III, the coordinates are extracted
by applying real-time pose estimation algorithm. This algo-
rithm detect 18 key points totally from body parts. Whereas
eyes and ears are not necessary, hence those are not included
in the feature vectors. Finally, single person could represent
by feature vectors of 14 coordinates. The ksize and stride have
(a)
been set according how many coordinates do you want to
extract, for example, in the first CNN layer, the ksize have
been set [14, 2] and [2, 1] respectively, as the figure 6 shown
in the below. The p and f are the number of coordinates and
frame that you want to extract.

(b)

Fig. 7: Human action classification of KTH dataset for a total


of 5000 iteration. (a) accuracy of both train and test dataset
and (b) loss values of both train and test dataset

Fig. 6: The ksize and stride setting of first CNN layer as an


example.

C. Classification results
We select one frame from every five frames and used as
input into the model. The input is the accumulated matrix that
consist of 60 frames. Each frame represent one vector that
comprise 14 coordinates. Therefore, the input size of one video (a)
clip is 60×28. We randomly divide the total number of dataset
into 0.7 training and 0.3 testing. The mini-batch size is set to
100. We select SGD as a loss function and set learning rate
as 0.001. In order to prevent over fitting, dropout and weight
decay is added before the linear layer. The dropout is 0.5 and
weight decay is 0.01. In table I, we show the performance of
our proposed method on different datasets based on accuracy
and loss values. The human action classification results of our
proposed approach based on KTH datasets produce very high
performance as shown in figure 7. Figure 8 shows the action (b)
classification results of eleven types of human actions based on Fig. 8: Human action classification of eleven types of actions
UCF50 dataset. These eleven types of human actions in terms based on UCF50 dataset for a total of 5000 iterations. (a)
of precision, recall and F-measure by our proposed method accuracy of both train and test dataset, and (b) loss values of
is presents in table II. Table III shows the results of action both train and test dataset
classification accuracy of KTH dataset with other competitive
approaches. It shows that our proposed approach achieves best
Measures KTH UCF50 In the present study we use single person action classifica-
Train tion. In future we plan to extend the proposed classification
Accuracy 1.0 0.964 approach with multi-person actions. Furthermore, we consider
Loss 0.065 0.118 extending our approach by constructing the relationship of
Test more joints point via other neural network architecture such
Accuracy 1.0 0.803 as graph convolution neural network for generalizing the
Loss 0.085 0.667 performance of the results.
TABLE I: Performance of the proposed approach based on ACKNOWLEDGEMENT
accuracy and loss values
This work was partly supported by JSPS KAKENHI Grant
Number 16K00239.
Types of actions Precision Recall F-measure
Pommel horse 0.91 0.91 0.91 R EFERENCES
Pull up 0.67 0.77 0.71 [1] Yu Kong, Yun Fu, Human Action Recognition and Prediction: A Survey.
Push ups 0.83 1 0.91 2018
Golf swing 0.78 0.78 0.78 [2] Zhe. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d
pose estimation using part afnity elds. In CVPR, 2017.
Playing Violin 0.89 0.89 0.89 [3] F. Negin and F. Bremond. Human action recognition in videos: A survey.
Nun chucks 0.85 0.85 0.85 2016.
Lunges 0.86 0.86 0.86 [4] T. Moeslund, A. Hilton, and V. Kruger. A survey of advances in vision-
based human motion capture and analysis. In CVIU, 2006.
Rope climbing 0.70 0.54 0.61 [5] C. Schldt, I. Laptev, and B. Caputo. Recognizing human actions: A local
Rope climbing in door 0.71 0.63 0.67 svm approach. In ICPR, 2004.
Tai chi 0.86 0.86 0.86 [6] [Link], [Link], [Link], [Link], [Link] and L. Fei-Fei.
Large-scale video classication with convolutional neural networks. In
Jumping jack 0.77 0.76 0.79 CVPR, pages 1725-1732, 2014.
[7] K. Simonyan and A. Zisserman. Two-stream convolutional networks
TABLE II: Performance of the proposed approach based on for action recognition in videos. In Advances in Neural Information
different types of human actions of UCF50 dataset iterms of Processing Systems, pages 568-576, 2014.
precision, recall and F-measure [8] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks
for human action recognition. IEEE transactions on pattern analysis and
machine intelligence, 35(1):221-231, 2013.
[9] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning
V. C ONCLUSION spatiotemporal features with 3d convolutional networks. In ICCV, pages
4489-4497. IEEE, 2015.
[10] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R.
In this study, we proposed human action classification ap- Monga, and G. Toderici. Beyond short snippets: Deep networks for video
proach based on fourteen 2D coordinate points using real-time classication. In CVPR, pages 4694-4702, 2015.
pose estimation algorithm. The designed matrix containing [11] V. Raviteja, Felipe Arrate and Rama Chellappa. Human Action Recog-
nition by Representing 3D Skeletons as Points in a Lie Group. In CVPR,
multi-frames that compose of extracted 2D coordinates of 2014
human body, which is used as inputs to train the convolution [12] D. Yong, Yun Fu, Liang Wang. Hierarchical Recurrent Neural Network
neural network model, that is the main contribution of this for Skeleton Based Action Recognition. In CVPR, 2015
[13] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolu-
study. The matrix we designed is perfect and easy to find tional part heatmap regression. In ECCV, 2016.
the movement information of all coordinates in each frame. [14] Y. Yang and D. Ramanan. Articulated human detection with exible
Furthermore, it is suitable for describing the changes in each mixtures of parts. In TPAMI, 2013.
[15] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object
coordinate values of several consecutive frames. The human recognition. In IJCV, 2005.
action classification results of our proposed approach using [16] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:
six different types of human actions based on KTH dataset People detection and articulated pose estimation. In CVPR, 2009.
[17] W. Shih-En, Varun Ramakrishna, Takeo Kanade and Yaser Sheikh.
produce very good performance, which is higher than the Convolutional Pose Machines. In CVPR 2016
other competitive approaches. The human action classification [18] C. Yilun, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu and
performance using eleven different types of human actions Jian Sun. Cascaded Pyramid Network for Multi-Person Pose Estimation.
2017
based on UCF50 dataset yield very high to moderate results [19] F. Hao-Shu, Shuqin Xie, Yu-Wing Tai and Cewu Lu1. RMPE: Regional
in terms of precision, recall and F-measure values. Multi-Person Pose Estimation. In ICCV 2017.
[20] X. Fangting , Peng Wang, Xianjie Chen and Alan Yuille. Joint Multi-
Person Pose Estimation and Semantic Part Segmentation. In ICCV 2017.
Approach KTH [21] R. Mahdyar, Ravanbakhsh, Hossein Mousavi, Mohammad Rastegari,
Vittorio Murino, Larry S. Davis. Action Recognition with Image Based
R. Mahdyar et al. [21] 95.6% CNN Features. 2015
A. Fadwa et al. [22] 98.90% [22] A. Fadwa, Chunbo Bao, Arwa Mohammed Taqi, Mariofanna Milanova,
Our proposed method 100% Nabeel Ghassan. Human Actions Recognition Based on 3D Deep Neural
Network. In NTICT, 2017
TABLE III: Comparison of human action classification ac-
curacy of our proposed method with the state-of-the-art ap-
proaches using KTH dataset.

View publication stats

You might also like