Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Skeleton-based action recognition has advanced significantly in the past decade. Among
deep learning-based action recognition methods, one of the most commonly used structures is a two-
stream network. This type of network extracts high-level spatial and temporal features from skeleton
coordinates and optical flows, respectively. However, other features, such as the structure of the skeleton or
the relations of specific joint pairs, are sometimes ignored, even though using these features can also improve
action recognition performance. To robustly learn more low-level skeleton features, this paper introduces
an efficient fully convolutional network to process multiple input features. The network has multiple
streams, each of which has the same encoder-decoder structure. A temporal convolutional network and
a co-occurrence convolutional network encode the local and global features, and a convolutional classifier
decodes high-level features to classify the action. Moreover, a novel fusion strategy is proposed to combine
independent feature learning and dependent feature relating. Detailed ablation studies are performed to
confirm the network’s robustness to all feature inputs. If more features are combined and the number
of streams increases, performance can be further improved. The proposed network is evaluated on three
skeleton datasets: NTU-RGB+D, Kinetics, and UTKinect. The experimental results show its effectiveness
and performance superiority over state-of-the-art methods.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
stream network is a structure commonly used to extract spa- A. MULTI-STREAM NEURAL NETWORK
tial and temporal features [24]–[28]. This method originates The multi-stream neural network is a form of multi-task
from video-based action recognition [24] and significantly learning. Simonyan et al. [24] were the first to study the
improves action recognition performance. However, using application of the two-stream network to video-based action
only two low-level features does not suffice in some cases. recognition, and optical flow was confirmed to be an effective
Fig.1 presents an example of action selfie in the NTU- representation of the temporal feature [17]. Among CNN-
RGB+D [11] dataset. The same individual performs the same based models, skeleton coordinates and optical flows are the
action, but the patterns of spatial and temporal feature are two common feature vectors for the input of spatial and
significantly different. Hence, it is difficult to recognize this temporal streams [24]–[26]. As to RNN or LSTM-based
action only through these two features. However, if we con- models, both geometric descriptors and time durations are
sider the relation between the head and a hand, recognition commonly used network inputs [12], [13], [27], [31], [32].
becomes an easy task. This is why we need a new framework Moreover, fusion strategies of multi-stream networks also
that combines more low-level skeleton features. exhibit some differences. A controversy between early and
To move beyond the limitation of input features and further late fusion emerged in the early stage of research in this
improve the performance of skeleton-based action recogni- domain [33], [34]. Average and maximum are two common
tion, we propose a robust multi-feature network (MF-Net). score fusion methods, and multiply fusion is considered to
The main network includes multiple streams with the same have better accuracy [16], [17], [28], [35]. Concatenation
fully convolutional structure. The combination of the tem- of high-level feature maps has recently been often used to
poral convolutional network and the co-occurrence convolu- improve features’ correlation. The widely applied average
tional network encodes low-level features, and the classifier fusion strategy presented in [24] is considered effective in
decodes high-level feature maps into the recognition result. alleviating overfitting.
Both local features across the skeleton sequences and global Compared to video data, analyzing a skeleton sequence
features across channels are extracted. Moreover, strided is a more natural approach to extracting internal structural
convolution is used instead of the traditional pooling layer features through skeleton joints. However, such features are
for the discontinuous dimension. A new fusion strategy is ignored by the networks only consider spatial and temporal
proposed to balance independent feature learning and depen- domains. For the ideal fusion strategy of a multi-stream
dent feature relating. Considering the influence of motion network, both independent learning and dependent relating
speed, we propose a simple data augmentation method that are necessary to learn high-level features robustly.
is effective in improving efficiency and reduce overfitting.
The main contributions of our work can be summarized as B. CNN FOR SKELETON-BASED ACTION RECOGNITION
follows: Unlike RGB and depth images, skeleton data contains the po-
• We design the multi-feature network (MF-Net), a robust sitions of human joints, considered to be relatively high-level
2 VOLUME XX, 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
Temporal Co-occurrence
Classification
Feature1 Convolutional Convolutional Classifier
Network Network Scores
Temporal Co-occurrence
Classification
Feature2 Convolutional Convolutional Classifier
Network Network Scores
Encoder Decoder
Temporal Co-occurrence
Skeleton Sequence Classification
FeatureN Convolutional Convolutional Classifier
Network Network Scores
FIGURE 2. Architecture of our MF-Net: the network input, consisting of several low-level features, is generated from the skeleton sequences. The network includes
several streams with the same structure. Each stream consists of a temporal convolutional network, a co-occurrence convolutional network, and a classifier. Feature
maps before classifiers are concatenated to form fusion stream. For each stream, an independent classification score is obtained, and the result is the weighted
average of such scores.
features in action recognition [29]. Due to the ability to ex- level geometric features are generated from the skeleton se-
tract high-level features, CNN-based models have performed quences and input into the network. Additionally, we perform
remarkably well on benchmarks of skeleton-based action novel data augmentation to increase the randomness of mo-
recognition [14], [15], [25], [26]. In such models, the tempo- tion speed. Each stream has the same independent structure
ral convolutional network becomes a universal structure due that includes an encoder and a decoder. The encoder of each
to its effectiveness in extracting temporal information; it is stream consists of a temporal convolutional network and a
also a foundational component in GCN-based model [15], co-occurrence convolutional network, and each subnetwork
[22]. To solve the problem that a CNN is not robust to action consists of several convolutional modules and pooling layers.
duration, Ke et al. [14] transformed skeleton sequences into The classifier consists of two convolutional modules and a
three clips to model multi-task learning. Li et al. used co- global average pooling layer. To further correlate the high-
occurrence feature learning [36] in two-stream convolutional level feature maps of each stream, we propose a novel fusion
network [25] to achieve state-of-the-art performance [26]. strategy to form a fusion stream. All the streams are trained
To further take advantage of CNN, we design a fully together using an end-to-end backpropagation method. The
convolutional network consisting of a temporal convolutional network components will be introduced in detail in the fol-
network and a co-occurrence convolutional network. The lowing subsections.
former is used to extract local temporal information among
neighboring skeleton frames, and the latter is used to learn B. INPUT FEATURES
global information across the channel dimension. To deal
To test the performance of our network, several low-level
with the discontinuous dimension, we innovatively use the
skeleton features are used as the network input. Considering
strided convolution instead of the traditional pooling layer.
the effect of feature fusion, input features should be different
in some aspects. We design four features for different dimen-
III. MULTI-FEATURE NETWORK
sions. Fig.3 shows an example of four input features based
Existing methods are mature enough to process two low-level
on the Kinect V2 skeleton.
features, but using so few features does not suffice in cases
of complex actions. Generally, more input features requires
higher network performance, which motivates us to propose 1) Spatial Feature
MF-Net. In this section, the proposed framework and its Considering the original 3D skeleton joints is the simplest
components are introduced in detail. approach to representing the spatial information. Hence, we
extract joint coordinates from sequences to form the feature
A. PIPELINE OVERVIEW vector. The spatial feature is calculated by
MF-Net is a highly modular network with fully convolutional
operations. Its structure is presented in Fig.2. Several low- ftspa (i) = (xit , yti , zti ), ∀i ∈ V, t ∈ T, (1)
VOLUME XX, 2018 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
4) Actional Feature
Most actions have a very significant impact on specific
joint relations. For example, when a person is drinking or
answering the phone, the distance between the head and a
hand becomes smaller; walking or running entails the two
feet moving periodically farther away. Hence, it is reasonable
to select such pairs of joints to represent actional features. To
a) Spatial Feature b) Temporal Feature
avoid overlap with structural features, we subjectively select
some long bones that are highly correlated with common
actions as the actional feature. For example, the red lines
in Fig.3(d) is actually used by us in the KinectV2 skeleton.
The difference between joint pairs is calculated to form
the actional feature vector according to the calculation in
equation 3.
C. MODEL ARCHITECTURE
To robustly learn features higher than low-level ones, we
propose a novel MF-Net to comprehensively extract high-
level features. The relevant pipeline is presented in Fig.2, and
c) Structural Feature d) Actional Feature the detailed network architecture is shown in Fig.4. Our MF-
Net is a fully convolutional network with multiple streams;
FIGURE 3. Example of four input features based on the Kinect V2 skeleton. each stream includes a temporal convolutional network, a
The concentration of each feature is shown in red.
co-occurrence convolutional network, and a convolutional
where (xit , yti , zti ) ∈ R3 is the coordinate of the ith joint of classifier. The input for each stream is a four-dimensional
the tth skeleton frame. tensor [N × M, C, T, V ], where N is the batch size, M is
the number of people, C is the number of channels, T is the
2) Temporal Feature sequence length, and V is the feature vector calculated in
section III-B. The network and its components are described
Optical flow has been proven to be effective in extracting
in detail below.
temporal information for action recognition [37]. For skele-
The basic convolution module integrates the standard
ton sequences, optical flows are a set of displacement vectors
ReLU activation function and 2D batch normalization. Rel-
between consecutive frames. To represent the full motion,
ative to the typical ResNet structure [38], a shortcut con-
we calculate the optical flow for all the skeleton joints and
nection is used to increase the feature representation ability
construct the temporal feature as follows:
in deeper layers. This module is used for all convolution
spa
fttem (i) = ft+1 (i) − ftspa (i), ∀i ∈ V, t ∈ T, (2) operations except the last convolution layer in the classifier.
A temporal convolutional network is the first half of the
where V is the joint set and feature vector fttem (i) encodes encoder. As the skeleton sequence is interpretable along
the motion at joint i between a pair of consecutive frames t dimension T , it is reasonable to use a convolution opera-
and t + 1. tion to extract high-level temporal information. Appropriate
downsampling helps reduce the number of calculations and
3) Structural Feature alleviate overfitting. Compared to the average pooling layer,
The relationship among skeleton joints, often ignored in the max-pooling layer can retain more texture information.
some two-stream networks, is nonetheless significant. To The co-occurrence convolution network is used to aggre-
extract high-level skeleton structural features, the adjacency gate global features across channels C. First, the transposi-
matrix of the skeleton is used as input feature. Specifically, tion operation transforms the matrix from [N × M, C, T, V ]
each skeleton frame can be converted to a graph G = (V, E) to [N × M, V, T, C]. Next, strided convolution is performed
to represent intra-body connections. V is the joint set to on dimensions C and T with 3 × 3 convolution kernel, and
represent spatial features, and E is the set of edges between the convolution stride applies only to dimension C. The same
joints, used to represent structural features. Hence, the struc- max-pooling layer as in the temporal convolutional network
tural feature can be written as is placed after each strided convolution module.
The pooling layer is commonly used to achieve down-
ftstru (i, j) = ftspa (i) − ftspa (j), ∀t ∈ T, (i, j) ∈ E, (3) sampling. However, it can be regarded as a strict equivalent
4 VOLUME XX, 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
Shortcut Connection
a) ConvModule
Concat
BN Transpose C/V
Fusion Classifier Classifier
Classifier Classifier Classifier
Conv2d 3×1 StridedConv2d 3×3
Scores Scores Scores Scores Scores
Conv2d 3×1 MaxPool2d 2×1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
8 11 5 7 6
Method X-Sub X-View
17 1 13
4 7
Spa 84.2% 91.3%
18 14 14 15 Tem 83.3% 89.0%
9 12
Stru 85.7% 91.2%
19 15 16 17 Act 84.3% 91.6%
20 16 10 13 18 19
Spa+Tem 87.9% 94.3%
a) NTU RGB+D b) Kinetics c) UTKinect
Spa+Stru 87.7% 93.4%
Spa+Act 87.5% 93.8%
FIGURE 6. Joint labels of three experimental datasets.
Tem+Stru 88.8% 94.2%
extended. In contrast, learning-based approaches are easier to Tem+Act 88.4% 94.4%
implement. For example, Kendall et al. [40] use uncertainty Stru+Act 87.3% 93.3%
to learn the weight of each loss function. However, such Spa+Tem+Stru 89.6% 94.9%
methods are prone to overfitting. Therefore, we choose a Spa+Tem+Act 89.5% 94.9%
direct summation of cross-entropy losses as the target loss Spa+Stru+Act 88.4% 94.0%
function of the global optimizer. Tem+Stru+Act 89.6% 94.9%
Spa+Tem+Stru+Act 90.0% 95.4%
IV. EXPERIMENTAL RESULTS
In this section, we evaluate the performance of our MF-
Net. The experiments are performed on two impressive OpenPose toolbox [3] to obtain skeleton sequences. Each
large-scale action recognition datasets, NTU-RGB+D and skeleton frame has 18 joints, as shown in Fig.6(b). The
Kinetics, and a small dataset called UTKinect. To verify available dataset is divided into a training set with 240000
network robustness for various features, all low-level fea- sequences and a test set with 20000 sequences. Five is the
tures in sectionIII-B and their combinations are tested as maximum number of people in the raw data, and two of the
the network input. Elaborate ablation studies on the NTU- five are selected for multi-person action recognition.
RGB+D dataset are performed to evaluate the contributions
of our proposed framework. The comparisons on these three 3) UTKinect
datasets between MF-Net and other state-of-the-art methods Xia et al. [44] provide another widely used small-size
show the effectiveness of our method. skeleton-based action recognition dataset. The videos in it
were captured using a single stationary Kinect device. There
A. DATASETS are 10 action types in the dataset: walk, sit down, stand up,
1) NTU-RGB+D pick up, carry, throw, push, pull, wave hands, and clap hands.
NTU-RGB+D is the most commonly used large-scale The dataset contains 10 subjects, and each subject performs
skeleton-based action recognition dataset. It contains 56880 each action twice. The total number of sequences is 199.
skeleton sequences with 60 action categories; the data were Each skeleton has 20 joints, and the joint labels are shown in
collected by 40 volunteers performing observations from var- Fig.6(c). Both evaluation protocols–leave one sequence and
ious horizontal angles: 45◦ ; 0◦ ; −45◦ . The longest duration cross-subject–provided in [44] are used in our analysis.
of a sequence is 10 seconds, and the frame rate is 30 fps.
Twelve joint attributes including 3D coordinates are collected B. ABLATION STUDY
by Microsoft KinectV2 depth sensor. Skeleton joint labels are 1) MULTI-TREAM NETWORK
shown in Fig.6(a). The dataset includes two recommended Network robustness to various low-level features is the focus
benchmarks: cross-subject and cross-view settings. We use of our work. To verify the independent learning ability of
the partition settings suggested by Yan et al. [22]: 40091 each feature and the improvement resulting from feature
sequences are used for training and 16487 sequences are combination, the four low-level features mentioned in section
reserved for evaluation in the cross-subject setting, whereas III-B are combined to evaluate the network performance.
37646 sequences are used for training and 18932 sequences All 15 combinations are tested as network input on the two
are reserved for evaluation in the cross-view setting. benchmarks of the NTU-RGB+D dataset. The results are
summarized in Table 1.
2) Kinetics Considering Table 1, we observe that all combinations
Kinetics is one of the largest human action recognition attain considerable performance improvements, which veri-
datasets that contains 300000 videos in 400 classes [43]. fies the effectiveness of our model. If an independent low-
The original dataset contains raw video data captured from level feature is used as the input of the network, we ob-
YouTube. Yan et al. [22] have processed this dataset by the serve that the lowest precision values reach 83.3%/89.0%
6 VOLUME XX, 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
FIGURE 7. Improvement of the four stream network over the traditional spatial-temporal network. The horizontal axis represents various actions from the NTU-RGB+D
cross-subject protocol. Warmer colors represent better performance.
in cross-subject/cross-view evaluations. These results ensure actions exhibit better performance in the case of the two-
that using a single low-level feature can result in effective stream network. These actions easily reach close to 100%
action recognition by the proposed network. If one addi- accuracy in the case of the two-stream network but exhibit
tional feature is combined with the original single feature, overfitting as the number of network parameters increases.
the minimum precision improvement is 1.6%/1.7%, and the Such results demonstrate that a structure with more features
maximum reaches 5.5%/5.4%. Moreover, three-stream and has a better overall performance and is superior in processing
four-stream networks can further improve recognition perfor- more complex actions.
mance compared to the above. Therefore, we can conclude In conclusion, the MF-Net structure is robust enough to
that any additional low-level features have the potential to process various low-level features and their combinations.
improve network performance. Using our proposed network, essential single-feature learning
For a specific single feature, each feature makes a different can attain significant accuracy, and increasing the number of
contribution to the final result. The spatial feature is the streams further improves performance. Note that improve-
basis of other low-level features. Thus, any additional feature ments will slow down as more input streams are used, and
combined with the spatial feature results in a steady improve- both the number of parameters and training time will increase
ment. The temporal feature’s result is the worst. However, significantly.
combinations of temporal feature and other features result in
the best accuracy and significantly improve performance. The 2) STRIDED CONVOLUTION
structural and actional features are similar in some aspects, The co-occurrence convolutional network is the foundation
so the improvement resulting from them being combined of our proposed network designed to extract global features
is relatively poor. Even so, that little improvement can also across the channel dimension. Its commonly used downsam-
reveal the robustness of our network. pling function is a max-pooling layer [26]. Considering the
Considering various action types, we compare the four- discontinuity of the channel dimension, we propose using
stream network with the traditional spatial-temporal network; strided convolution as a substitute for the combination of a
the result is visualized in Fig.7. Improvements are observed convolution layer and a pooling layer. To assess the necessity
for most of action types, while the others exhibit slightly of downsampling, we also evaluate the network without
lower performance. Furthermore, we observe significant im- downsampling and the convolution layers with various stride
provements for harder actions, e.g., reading and writing. sizes. Table 2 presents a comparison of various downsam-
These two actions are the most difficult to distinguish in this pling methods.
dataset; they exhibit improvements of 6.8%/8.0% in the case Using Table 2, the effectiveness of the strided convolu-
of the four-stream network. On the other hand, some easier tion layer can be verified. The co-occurrence convolutional
VOLUME XX, 2018 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
TABLE 4. Action recognition performance on the NTU-RGB+D dataset. TABLE 5. Action recognition performance on the Kinetics dataset. The
methods listed at the top are video-based.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
FIGURE 9. Confusion matrix on the NTU-RGB+D dataset under the cross-subject protocol. As the overall confusion matrix is oversized, we divide the action types
into four categories based on accuracy; each category has 15 actions. The four confusion matrices for the respective categories are presented in this figure. The
upper left matrix has the worst performance, and the bottom right matrix is the best. All positive conditions are shown in color, and conditions with values greater
than 0.01 are annotated with specific probability figures. Darker colors represent more significant misclassification.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
[32] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream TABLE 7. Comparison of accuracy of K-partition data augmentation and other
bi-directional recurrent neural network for fine-grained action detection,” schemes.
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016. Method X-Sub X-View
[33] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
Without augmentation 89.0% 95.0%
fusion in semantic video analysis,” in ACM International Conference on
Multimedia, 2005. Crop-resize 89.4% 94.9%
[34] H. Gunes and M. Piccardi, “Affect recognition from face and body: early Random sampling 89.6% 95.1%
fusion vs. late fusion,” in IEEE International Conference on Systems, Man, K-partition (k=2) 89.6% 95.2%
and Cybernetics (SMC), 2005.
[35] C. Li, Y. Hou, P. Wang, and W. Li, “Multiview-based 3-d action recog- K-partition (k=4) 90.0% 95.4%
nition using deep networks,” IEEE Transactions on Human-Machine Sys- K-partition (k=8) 89.7% 95.3%
tems, vol. 49, no. 1, pp. 95–104, 2018.
[36] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
occurrence feature learning for skeleton based action recognition using
regularized deep lstm networks,” in AAAI Conference on Artificial Intel- the randomness of motion speed; however, it loses action
ligence, 2016. continuity. To effectively balance randomness and continuity,
[37] H. Wang and C. Schmid, “Action recognition with improved trajectories,” we propose a simple K-partition random sampling method, as
in IEEE International Conference on Computer Vision (ICCV), 2013.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image shown in Fig.11.
recognition,” in IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2016.
[39] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving
for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806,
2014.
[40] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty
to weigh losses for scene geometry and semantics,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2018.
[41] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Gradnorm:
Gradient normalization for adaptive loss balancing in deep multitask
networks,” in International Conference on Machine Learning (ICML),
2018.
[42] S. Liu, Y. Liang, and A. Gitter, “Loss-balanced task weighting to reduce
negative transfer in multi-task learning,” in AAAI Conference on Artificial
Intelligence (Student Abstract), 2019.
[43] K. W. et al, “The kinetics human action video dataset,” arXiv preprint
arXiv:1705.06950, 2017.
[44] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-
narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “View invariant
human action recognition using histograms of 3d joints,” in IEEE Confer- FIGURE 11. Visualization of our K-partition random sampling.
ence on Computer Vision and Pattern Recognition Workshops (CVPRW), C×T ×V
A skeleton sequence A ∈ R can be represented by
2012.
[45] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive C-dimensional locations of V feature vectors in a video with
recurrent neural networks for high performance human action recognition T frames. We divide the sequence into K parts [A1 , A2 , · ·
from skeleton data,” in IEEE International Conference on Computer Vision ·, Ak ] of the same length, Ai ∈ RC×(T /K)×V . Afterwards,
(ICCV), 2017.
[46] Y. Tang, Y. Tian, J. Lu, P. Li, and J. Zhou, “Deep progressive reinforcement each part is randomly0
sampled to obtain a new sequence
0 0 0 0
learning for skeleton-based action recognition,” in IEEE Conference on part Ai ∈ RC×(T /K)×V . Finally, [A1 , A2 , · · ·, AK ] are
Computer Vision and Pattern Recognition (CVPR), 2018. 0 0
[47] R. Cui, G. Hua, A. Zhu, J. Wu, and H. Liu, “Hard sample mining and concatenated to form a new sequence A ∈ RC×T ×V using
learning for skeleton-based human action recognition and identification,” sorting. In the case of T < T 0 , repeated sampling is allowed.
IEEE Access, vol. 7, pp. 8245–8257, 2018. To verify the effectiveness of our data augmentation
[48] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition
by representing 3d skeletons as points in a lie group,” in IEEE Conference method, we compare the proposed K-partition method with
on Computer Vision and Pattern Recognition (CVPR), 2014. other schemes on the NTU-RGB+D dataset. Crop-resize is an
[49] J. Weng, C. Weng, and J. Yuan, “Spatio-temporal naive-bayes nearest- operation that randomly cuts out the boundary of a sequence
neighbor (st-nbnn) for skeleton-based action recognition,” in IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2017. and performs downsampling to a fixed length. Straightfor-
[50] M. Liu, Q. He, and H. Liu, “Fusing shape and motion matrices for view ward random sampling and no augmentation are considered
invariant action recognition using 3d skeletons,” in IEEE International as two alternative schemes. The influence of parameter K is
Conference on Image Processing (ICIP), 2017.
[51] M. Jain, H. Jegou, and P. Bouthemy, “Better exploiting motion for better also studied. The experimental results are presented in Table
action recognition,” in IEEE Conference on Computer Vision and Pattern 7.
Recognition (CVPR), 2013. Table 7 shows that crop-resize and random sampling
. data augmentation cannot improve action recognition per-
formance according to cross-view evaluation. Among these
APPENDIX A DATA AUGMENTATION methods, our K-partition random sampling achieves the best
The duration of an action and the speed of motion largely performance, improving by 1.0%/0.4% in the two protocols.
depend on the performer [51]. Thus, the network is prone to As to parameter K, if K=4 the model achieves the best
overfitting individual subjects. Random sampling is the most performance.
straightforward data augmentation method used to increase
12 VOLUME XX, 2018
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access
TABLE 8. Specific convolution layers’ parameters. The layers are listed in TABLE 9. Evaluation time on the NTU-RGB+D dataset for various models.
order, and their positions in the entire network are shown in Fig.2 and Fig.4.
Parameter njoint represents the number of joints of the skeleton, nstream
denotes the number of streams in the network, and nclass represents the Model Time Model Time
number of action types. ST-GCN [22] 74s MF-Net 31s
Nstride=1 52s Nstride=2 41s
Layer Channels Kernel Stride No Sampling 39s Random Sampling 40s
Conv1 3 × 128 3×1 1×1 Nstream=2 20s Nstream=3 27s
Conv2 128 × 128 3×1 1×1
Conv3 128 × 256 3×1 1×1
Conv4 256 × 256 3×1 1×1 are the other reasons a large stride is effective in reducing
StridedConv1 njoint × 128 3×3 1×4 the necessary time. Random sampling entails almost no addi-
StridedConv2 128 × 256 3×3 1×4 tional time consumption; however, the interpolation in our K-
StridedConv3 256 × 256 3×3 1×2 partition random sampling increases the reading time. Thus,
(nstream × 256) × the time reduction resulting from augmentation is limited.
Conv5 3×3 1×1
(nstream × 256) Otherwise, considering more feature streams will increase
(nstream × 256) × the required time, which will grow at a rate between o(n)
Conv6 1×1 1×1
nclass and o(n2 ), where n is the number of streams.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.