0% found this document useful (0 votes)
13 views

Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition

This article proposes a robust multi-feature learning network for skeleton-based action recognition. The network contains multiple streams that each encode different skeleton features using an encoder-decoder structure. A novel fusion strategy is used to combine the independent feature learning and dependent feature relating. Experimental results on three datasets demonstrate that the approach outperforms other state-of-the-art methods.

Uploaded by

masoud kia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Wang - Robust Multi-Feature Learning For Skeleton-Based Action Recognition

This article proposes a robust multi-feature learning network for skeleton-based action recognition. The network contains multiple streams that each encode different skeleton features using an encoder-decoder structure. A novel fusion strategy is used to combine the independent feature learning and dependent feature relating. Experimental results on three datasets demonstrate that the approach outperforms other state-of-the-art methods.

Uploaded by

masoud kia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Robust Multi-Feature Learning for


Skeleton-Based Action Recognition
YINGFU WANG1 , ZHEYUAN XU1 , LI LI1 , AND YAO JIAN1 (Member, IEEE)
1
School of Remote Sensing and Information Engineering, Wuhan University 430079, China
Corresponding author: Jian Yao (E-mail: [email protected] url: http://cvrs.whu.edu.cn/)
This work was supported in part by the National Key R&D Program of China under grant No. 2017YFB1302400, the National Natural
Science Foundation of China under grant No. 41571436, and the Hubei Province Science and Technology Support Program of China under
grant No. 2015BAA027.

ABSTRACT Skeleton-based action recognition has advanced significantly in the past decade. Among
deep learning-based action recognition methods, one of the most commonly used structures is a two-
stream network. This type of network extracts high-level spatial and temporal features from skeleton
coordinates and optical flows, respectively. However, other features, such as the structure of the skeleton or
the relations of specific joint pairs, are sometimes ignored, even though using these features can also improve
action recognition performance. To robustly learn more low-level skeleton features, this paper introduces
an efficient fully convolutional network to process multiple input features. The network has multiple
streams, each of which has the same encoder-decoder structure. A temporal convolutional network and
a co-occurrence convolutional network encode the local and global features, and a convolutional classifier
decodes high-level features to classify the action. Moreover, a novel fusion strategy is proposed to combine
independent feature learning and dependent feature relating. Detailed ablation studies are performed to
confirm the network’s robustness to all feature inputs. If more features are combined and the number
of streams increases, performance can be further improved. The proposed network is evaluated on three
skeleton datasets: NTU-RGB+D, Kinetics, and UTKinect. The experimental results show its effectiveness
and performance superiority over state-of-the-art methods.

INDEX TERMS Action recognition, skeleton, multi-feature learning, CNN, robustness

I. INTRODUCTION nition can be categorized into handcrafted feature-based


Human action recognition is one of the most challenging approaches and deep learning-based approaches [7]. The
tasks in the field of computer vision and video understand- former encode all the body joints’ coordinates into feature
ing. In the past decade, action recognition has undergone a vectors for pattern learning. Such methods often focus on
rapid development and has been widely applied to human- either position relations or dynamic trajectories and, as a
computer interaction, visual surveillance, video indexing, result, miss information of other dimensions [8], [9]. The
virtual reality, etc. [1], [2]. The previous studies focused on information contained in low-level feature vectors is limited;
RGB videos because of convenience of capturing data. Re- however, deep learning can solve this problem. Skeleton data
cently, the appearance of large-scale 3D skeleton datasets has naturally have the form of a time series. Thus, it is reasonable
drawn increasing attention to skeleton-based action recogni- to apply recurrent neural network (RNN) or long short-term
tion. Besides the perception of depth, video data provides memory (LSTM) approaches to model temporal information
another approach to obtaining skeleton data using pose esti- [10]–[13]. In the preceding two years, methods based on
mation algorithms [3], [4]. Compared to video-based models, the convolutional neural network (CNN) have demonstrated
skeleton-based models have several merits. First, they are ro- impressive performance, benefiting from the ability to extract
bust to body scale, motion speed and variations of viewpoints high-level features [14]–[18]. The graph convolutional net-
[5], [6]. Second, skeleton data can ignore the surrounding work (GCN) extends CNN to a non-Euclidean domain [19]–
distractions. Third, lower hardware requirements facilitate [21], and has been proved to be effective in learning structural
practical use of skeleton-based models. features for modeling skeleton sequences [22], [23].
Generally, the approaches to skeleton-based action recog- Among these deep learning-based methods, the two-

VOLUME XX, 2018 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and efficient network for learning various skeleton fea-


tures, which is the first network that combines more than
three geometric descriptors.
• We propose a novel fusion strategy to correlate high-
level feature maps from multiple streams. Independent
learning of feature streams and adequate correlation
of fusion stream are balanced to ensure fusion perfor-
mance.
• We confirm that using more input features is effective
(a) (b) in improving action recognition performance. Whenever
any additional feature is combined, the result improves
considerably.
Extensive ablation studies are performed to evaluate the
robustness of our network. All network proposals are demon-
strated to be effective in improving recognition performance.
The experimental results on three skeleton datasets show that
our approach outperforms other state-of-the-art methods.
(c) (d)
II. RELATED WORK
FIGURE 1. Example selected from the NTU-RGB+D dataset. Images (a) and In this section, we briefly review the related studies, Includ-
(b) are two RGB images of the action of taking a selfie. Images (c) and (d)
show the corresponding spatial and temporal features of these two skeleton ing those of using multi-stream neural networks and CNN for
sequences. skeleton-based action recognition. More detailed analyses of
related studies are available in surveys [29], [30].

stream network is a structure commonly used to extract spa- A. MULTI-STREAM NEURAL NETWORK
tial and temporal features [24]–[28]. This method originates The multi-stream neural network is a form of multi-task
from video-based action recognition [24] and significantly learning. Simonyan et al. [24] were the first to study the
improves action recognition performance. However, using application of the two-stream network to video-based action
only two low-level features does not suffice in some cases. recognition, and optical flow was confirmed to be an effective
Fig.1 presents an example of action selfie in the NTU- representation of the temporal feature [17]. Among CNN-
RGB+D [11] dataset. The same individual performs the same based models, skeleton coordinates and optical flows are the
action, but the patterns of spatial and temporal feature are two common feature vectors for the input of spatial and
significantly different. Hence, it is difficult to recognize this temporal streams [24]–[26]. As to RNN or LSTM-based
action only through these two features. However, if we con- models, both geometric descriptors and time durations are
sider the relation between the head and a hand, recognition commonly used network inputs [12], [13], [27], [31], [32].
becomes an easy task. This is why we need a new framework Moreover, fusion strategies of multi-stream networks also
that combines more low-level skeleton features. exhibit some differences. A controversy between early and
To move beyond the limitation of input features and further late fusion emerged in the early stage of research in this
improve the performance of skeleton-based action recogni- domain [33], [34]. Average and maximum are two common
tion, we propose a robust multi-feature network (MF-Net). score fusion methods, and multiply fusion is considered to
The main network includes multiple streams with the same have better accuracy [16], [17], [28], [35]. Concatenation
fully convolutional structure. The combination of the tem- of high-level feature maps has recently been often used to
poral convolutional network and the co-occurrence convolu- improve features’ correlation. The widely applied average
tional network encodes low-level features, and the classifier fusion strategy presented in [24] is considered effective in
decodes high-level feature maps into the recognition result. alleviating overfitting.
Both local features across the skeleton sequences and global Compared to video data, analyzing a skeleton sequence
features across channels are extracted. Moreover, strided is a more natural approach to extracting internal structural
convolution is used instead of the traditional pooling layer features through skeleton joints. However, such features are
for the discontinuous dimension. A new fusion strategy is ignored by the networks only consider spatial and temporal
proposed to balance independent feature learning and depen- domains. For the ideal fusion strategy of a multi-stream
dent feature relating. Considering the influence of motion network, both independent learning and dependent relating
speed, we propose a simple data augmentation method that are necessary to learn high-level features robustly.
is effective in improving efficiency and reduce overfitting.
The main contributions of our work can be summarized as B. CNN FOR SKELETON-BASED ACTION RECOGNITION
follows: Unlike RGB and depth images, skeleton data contains the po-
• We design the multi-feature network (MF-Net), a robust sitions of human joints, considered to be relatively high-level
2 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Temporal Co-occurrence
Classification
Feature1 Convolutional Convolutional Classifier
Network Network Scores

Temporal Co-occurrence
Classification
Feature2 Convolutional Convolutional Classifier
Network Network Scores

Encoder Decoder

Temporal Co-occurrence
Skeleton Sequence Classification
FeatureN Convolutional Convolutional Classifier
Network Network Scores

Concatenation Classifier Result

FIGURE 2. Architecture of our MF-Net: the network input, consisting of several low-level features, is generated from the skeleton sequences. The network includes
several streams with the same structure. Each stream consists of a temporal convolutional network, a co-occurrence convolutional network, and a classifier. Feature
maps before classifiers are concatenated to form fusion stream. For each stream, an independent classification score is obtained, and the result is the weighted
average of such scores.

features in action recognition [29]. Due to the ability to ex- level geometric features are generated from the skeleton se-
tract high-level features, CNN-based models have performed quences and input into the network. Additionally, we perform
remarkably well on benchmarks of skeleton-based action novel data augmentation to increase the randomness of mo-
recognition [14], [15], [25], [26]. In such models, the tempo- tion speed. Each stream has the same independent structure
ral convolutional network becomes a universal structure due that includes an encoder and a decoder. The encoder of each
to its effectiveness in extracting temporal information; it is stream consists of a temporal convolutional network and a
also a foundational component in GCN-based model [15], co-occurrence convolutional network, and each subnetwork
[22]. To solve the problem that a CNN is not robust to action consists of several convolutional modules and pooling layers.
duration, Ke et al. [14] transformed skeleton sequences into The classifier consists of two convolutional modules and a
three clips to model multi-task learning. Li et al. used co- global average pooling layer. To further correlate the high-
occurrence feature learning [36] in two-stream convolutional level feature maps of each stream, we propose a novel fusion
network [25] to achieve state-of-the-art performance [26]. strategy to form a fusion stream. All the streams are trained
To further take advantage of CNN, we design a fully together using an end-to-end backpropagation method. The
convolutional network consisting of a temporal convolutional network components will be introduced in detail in the fol-
network and a co-occurrence convolutional network. The lowing subsections.
former is used to extract local temporal information among
neighboring skeleton frames, and the latter is used to learn B. INPUT FEATURES
global information across the channel dimension. To deal
To test the performance of our network, several low-level
with the discontinuous dimension, we innovatively use the
skeleton features are used as the network input. Considering
strided convolution instead of the traditional pooling layer.
the effect of feature fusion, input features should be different
in some aspects. We design four features for different dimen-
III. MULTI-FEATURE NETWORK
sions. Fig.3 shows an example of four input features based
Existing methods are mature enough to process two low-level
on the Kinect V2 skeleton.
features, but using so few features does not suffice in cases
of complex actions. Generally, more input features requires
higher network performance, which motivates us to propose 1) Spatial Feature
MF-Net. In this section, the proposed framework and its Considering the original 3D skeleton joints is the simplest
components are introduced in detail. approach to representing the spatial information. Hence, we
extract joint coordinates from sequences to form the feature
A. PIPELINE OVERVIEW vector. The spatial feature is calculated by
MF-Net is a highly modular network with fully convolutional
operations. Its structure is presented in Fig.2. Several low- ftspa (i) = (xit , yti , zti ), ∀i ∈ V, t ∈ T, (1)
VOLUME XX, 2018 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

where feature vector ftstru (i, j) is used to represent the


connection between joints i and j in frame t if edge (i, j)
is in set E.

4) Actional Feature
Most actions have a very significant impact on specific
joint relations. For example, when a person is drinking or
answering the phone, the distance between the head and a
hand becomes smaller; walking or running entails the two
feet moving periodically farther away. Hence, it is reasonable
to select such pairs of joints to represent actional features. To
a) Spatial Feature b) Temporal Feature
avoid overlap with structural features, we subjectively select
some long bones that are highly correlated with common
actions as the actional feature. For example, the red lines
in Fig.3(d) is actually used by us in the KinectV2 skeleton.
The difference between joint pairs is calculated to form
the actional feature vector according to the calculation in
equation 3.

C. MODEL ARCHITECTURE
To robustly learn features higher than low-level ones, we
propose a novel MF-Net to comprehensively extract high-
level features. The relevant pipeline is presented in Fig.2, and
c) Structural Feature d) Actional Feature the detailed network architecture is shown in Fig.4. Our MF-
Net is a fully convolutional network with multiple streams;
FIGURE 3. Example of four input features based on the Kinect V2 skeleton. each stream includes a temporal convolutional network, a
The concentration of each feature is shown in red.
co-occurrence convolutional network, and a convolutional
where (xit , yti , zti ) ∈ R3 is the coordinate of the ith joint of classifier. The input for each stream is a four-dimensional
the tth skeleton frame. tensor [N × M, C, T, V ], where N is the batch size, M is
the number of people, C is the number of channels, T is the
2) Temporal Feature sequence length, and V is the feature vector calculated in
section III-B. The network and its components are described
Optical flow has been proven to be effective in extracting
in detail below.
temporal information for action recognition [37]. For skele-
The basic convolution module integrates the standard
ton sequences, optical flows are a set of displacement vectors
ReLU activation function and 2D batch normalization. Rel-
between consecutive frames. To represent the full motion,
ative to the typical ResNet structure [38], a shortcut con-
we calculate the optical flow for all the skeleton joints and
nection is used to increase the feature representation ability
construct the temporal feature as follows:
in deeper layers. This module is used for all convolution
spa
fttem (i) = ft+1 (i) − ftspa (i), ∀i ∈ V, t ∈ T, (2) operations except the last convolution layer in the classifier.
A temporal convolutional network is the first half of the
where V is the joint set and feature vector fttem (i) encodes encoder. As the skeleton sequence is interpretable along
the motion at joint i between a pair of consecutive frames t dimension T , it is reasonable to use a convolution opera-
and t + 1. tion to extract high-level temporal information. Appropriate
downsampling helps reduce the number of calculations and
3) Structural Feature alleviate overfitting. Compared to the average pooling layer,
The relationship among skeleton joints, often ignored in the max-pooling layer can retain more texture information.
some two-stream networks, is nonetheless significant. To The co-occurrence convolution network is used to aggre-
extract high-level skeleton structural features, the adjacency gate global features across channels C. First, the transposi-
matrix of the skeleton is used as input feature. Specifically, tion operation transforms the matrix from [N × M, C, T, V ]
each skeleton frame can be converted to a graph G = (V, E) to [N × M, V, T, C]. Next, strided convolution is performed
to represent intra-body connections. V is the joint set to on dimensions C and T with 3 × 3 convolution kernel, and
represent spatial features, and E is the set of edges between the convolution stride applies only to dimension C. The same
joints, used to represent structural features. Hence, the struc- max-pooling layer as in the temporal convolutional network
tural feature can be written as is placed after each strided convolution module.
The pooling layer is commonly used to achieve down-
ftstru (i, j) = ftspa (i) − ftspa (j), ∀t ∈ T, (i, j) ∈ E, (3) sampling. However, it can be regarded as a strict equivalent
4 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Shortcut Connection

Conv ReLU BN Stream1 Stream2 Stream2 Stream2

a) ConvModule
Concat
BN Transpose C/V
Fusion Classifier Classifier
Classifier Classifier Classifier
Conv2d 3×1 StridedConv2d 3×3
Scores Scores Scores Scores Scores
Conv2d 3×1 MaxPool2d 2×1

MaxPool2d 2×1 StridedConv2d 3×3 Weighted


AvgScores
Conv2d 3×1 MaxPool2d 2×1 Loss Loss Loss Loss Loss
Conv2d 3×1 StridedConv2d 3×3

MaxPool2d 2×1 MaxPool2d 2×1 AvgLoss

b) Temporal Conv- c) Co-occurrence Conv-


FIGURE 5. Example of our fusion strategy.
lutional Network lutional Network
Global Average D. FEATURE FUSION
Conv2d 3×3 Conv2d 1×1 Pooling
It has been proven that feature fusion strategies can sig-
d) Classifier
nificantly affect the result of action recognition [23], [25].
Simonyan et al. [24] adopt the strategy whereby each stream
FIGURE 4. Detailed structure of the network shown in network Fig.2. (a)
ConvModule is the basic module for all convolution operations except the last performs learning independently and the final result is the
convolution layer. (b) The temporal convolutional network extracts temporal average of the classification scores. Recently, late fusion has
information between consecutive frames. (c) The co-occurrence convolutional
network generates global high-level features through co-occurrence feature often been applied in multi-stream networks. The encoded
learning. The transpose operation exchanges the dimension positions between feature maps are concatenated at a deeper position, and
channel and skeleton joints. (d) The classifier provides classification scores for
all streams.
high-level feature maps are subsequently decoded. To ensure
robust feature learning and effective feature fusion, a novel
feature fusion strategy is proposed. Fig.5 shows an example
convolution and has a stringent requirement of data conti- of applying our strategy to a four-stream network. The details
nuity, yet the skeleton sequence is discontinuous along the are described below.
channel dimension. Springenberg et al [39] were the first To implement independent learning for each stream, a
to use strided convolution, a more general downsampling classifier at the end of the stream is necessary to decode
operation, as a substitute for a max-pooling layer. For most feature maps. At the same position, high-level feature maps
classification tasks in the Euclidean domain, this structure are concatenated. The classifier for the fusion stream is sim-
is less accurate than the combination of a convolution layer ilar, except for larger channels, to feature stream classifiers.
and a max-pooling layer. However, for a skeleton sequence All classifiers output classification scores for the respective
in a non-Euclidean domain, strided convolution can alleviate streams. Considering the importance of feature correlation,
the influence of channel discontinuity. Furthermore, we use the fusion stream should have a larger weight than those of
a large stride size to rapidly reduce the channel size, which feature streams. Therefore, we use a weighted average to
also increases the network’s receptive field in deeper layers. calculate the scores,
Due to the smaller matrix size, the computational complexity
of operations performed by the classifiers is significantly y = y1 + y2 + · · · + yn + w · yf , (4)
reduced.
where n is the number of network streams, [y1 , y2 , · · ·, yn ]
The classifier is the last module of each stream and is represents the scores of feature streams, and yf is the score
used to generate classification scores. Two convolution layers of the fusion stream. Weight w is calculated by grid search,
decode the high-level feature maps, and the global aver- and our recommended initial value for this weight is n.
age pooling reduces the matrix dimension. Compared to a
The loss function of multi-task learning has been studied
multi-layer perceptron (MLP), a convolutional decoder has
increasingly often [40]–[42]. A straightforward summation
a stronger decoding ability while using fewer parameters.
of loss functions is a practical approach. Moreover, grid
Global average pooling is performed on dimensions M , T
search and learning-based approaches have been proposed
and C and aligns the classification matrix with the number of
to determine a better loss weight. Regarding grid search, it
action categories.
is difficult to determine an appropriate weight for multiple
streams, and this approach cannot be readily applied and
VOLUME XX, 2018 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

24 22 TABLE 1. Comparisons of various low-level input features. Both cross-subject


25 4 8 16 15 20
14 0 17 (X-Sub) and cross-view (X-View) benchmarks on the NTU-RGB+D dataset are
3 12 10 11 13
11
10 6 7 8 9 reported. Spa, Tem, Stru, and Act represent spatial, temporal, structural and
12 9 21 5 23 2 1 5 1 3 2
actional features, respectively.
2 3 6 4

8 11 5 7 6
Method X-Sub X-View
17 1 13
4 7
Spa 84.2% 91.3%
18 14 14 15 Tem 83.3% 89.0%
9 12
Stru 85.7% 91.2%
19 15 16 17 Act 84.3% 91.6%
20 16 10 13 18 19
Spa+Tem 87.9% 94.3%
a) NTU RGB+D b) Kinetics c) UTKinect
Spa+Stru 87.7% 93.4%
Spa+Act 87.5% 93.8%
FIGURE 6. Joint labels of three experimental datasets.
Tem+Stru 88.8% 94.2%
extended. In contrast, learning-based approaches are easier to Tem+Act 88.4% 94.4%
implement. For example, Kendall et al. [40] use uncertainty Stru+Act 87.3% 93.3%
to learn the weight of each loss function. However, such Spa+Tem+Stru 89.6% 94.9%
methods are prone to overfitting. Therefore, we choose a Spa+Tem+Act 89.5% 94.9%
direct summation of cross-entropy losses as the target loss Spa+Stru+Act 88.4% 94.0%
function of the global optimizer. Tem+Stru+Act 89.6% 94.9%
Spa+Tem+Stru+Act 90.0% 95.4%
IV. EXPERIMENTAL RESULTS
In this section, we evaluate the performance of our MF-
Net. The experiments are performed on two impressive OpenPose toolbox [3] to obtain skeleton sequences. Each
large-scale action recognition datasets, NTU-RGB+D and skeleton frame has 18 joints, as shown in Fig.6(b). The
Kinetics, and a small dataset called UTKinect. To verify available dataset is divided into a training set with 240000
network robustness for various features, all low-level fea- sequences and a test set with 20000 sequences. Five is the
tures in sectionIII-B and their combinations are tested as maximum number of people in the raw data, and two of the
the network input. Elaborate ablation studies on the NTU- five are selected for multi-person action recognition.
RGB+D dataset are performed to evaluate the contributions
of our proposed framework. The comparisons on these three 3) UTKinect
datasets between MF-Net and other state-of-the-art methods Xia et al. [44] provide another widely used small-size
show the effectiveness of our method. skeleton-based action recognition dataset. The videos in it
were captured using a single stationary Kinect device. There
A. DATASETS are 10 action types in the dataset: walk, sit down, stand up,
1) NTU-RGB+D pick up, carry, throw, push, pull, wave hands, and clap hands.
NTU-RGB+D is the most commonly used large-scale The dataset contains 10 subjects, and each subject performs
skeleton-based action recognition dataset. It contains 56880 each action twice. The total number of sequences is 199.
skeleton sequences with 60 action categories; the data were Each skeleton has 20 joints, and the joint labels are shown in
collected by 40 volunteers performing observations from var- Fig.6(c). Both evaluation protocols–leave one sequence and
ious horizontal angles: 45◦ ; 0◦ ; −45◦ . The longest duration cross-subject–provided in [44] are used in our analysis.
of a sequence is 10 seconds, and the frame rate is 30 fps.
Twelve joint attributes including 3D coordinates are collected B. ABLATION STUDY
by Microsoft KinectV2 depth sensor. Skeleton joint labels are 1) MULTI-TREAM NETWORK
shown in Fig.6(a). The dataset includes two recommended Network robustness to various low-level features is the focus
benchmarks: cross-subject and cross-view settings. We use of our work. To verify the independent learning ability of
the partition settings suggested by Yan et al. [22]: 40091 each feature and the improvement resulting from feature
sequences are used for training and 16487 sequences are combination, the four low-level features mentioned in section
reserved for evaluation in the cross-subject setting, whereas III-B are combined to evaluate the network performance.
37646 sequences are used for training and 18932 sequences All 15 combinations are tested as network input on the two
are reserved for evaluation in the cross-view setting. benchmarks of the NTU-RGB+D dataset. The results are
summarized in Table 1.
2) Kinetics Considering Table 1, we observe that all combinations
Kinetics is one of the largest human action recognition attain considerable performance improvements, which veri-
datasets that contains 300000 videos in 400 classes [43]. fies the effectiveness of our model. If an independent low-
The original dataset contains raw video data captured from level feature is used as the input of the network, we ob-
YouTube. Yan et al. [22] have processed this dataset by the serve that the lowest precision values reach 83.3%/89.0%
6 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 7. Improvement of the four stream network over the traditional spatial-temporal network. The horizontal axis represents various actions from the NTU-RGB+D
cross-subject protocol. Warmer colors represent better performance.

in cross-subject/cross-view evaluations. These results ensure actions exhibit better performance in the case of the two-
that using a single low-level feature can result in effective stream network. These actions easily reach close to 100%
action recognition by the proposed network. If one addi- accuracy in the case of the two-stream network but exhibit
tional feature is combined with the original single feature, overfitting as the number of network parameters increases.
the minimum precision improvement is 1.6%/1.7%, and the Such results demonstrate that a structure with more features
maximum reaches 5.5%/5.4%. Moreover, three-stream and has a better overall performance and is superior in processing
four-stream networks can further improve recognition perfor- more complex actions.
mance compared to the above. Therefore, we can conclude In conclusion, the MF-Net structure is robust enough to
that any additional low-level features have the potential to process various low-level features and their combinations.
improve network performance. Using our proposed network, essential single-feature learning
For a specific single feature, each feature makes a different can attain significant accuracy, and increasing the number of
contribution to the final result. The spatial feature is the streams further improves performance. Note that improve-
basis of other low-level features. Thus, any additional feature ments will slow down as more input streams are used, and
combined with the spatial feature results in a steady improve- both the number of parameters and training time will increase
ment. The temporal feature’s result is the worst. However, significantly.
combinations of temporal feature and other features result in
the best accuracy and significantly improve performance. The 2) STRIDED CONVOLUTION
structural and actional features are similar in some aspects, The co-occurrence convolutional network is the foundation
so the improvement resulting from them being combined of our proposed network designed to extract global features
is relatively poor. Even so, that little improvement can also across the channel dimension. Its commonly used downsam-
reveal the robustness of our network. pling function is a max-pooling layer [26]. Considering the
Considering various action types, we compare the four- discontinuity of the channel dimension, we propose using
stream network with the traditional spatial-temporal network; strided convolution as a substitute for the combination of a
the result is visualized in Fig.7. Improvements are observed convolution layer and a pooling layer. To assess the necessity
for most of action types, while the others exhibit slightly of downsampling, we also evaluate the network without
lower performance. Furthermore, we observe significant im- downsampling and the convolution layers with various stride
provements for harder actions, e.g., reading and writing. sizes. Table 2 presents a comparison of various downsam-
These two actions are the most difficult to distinguish in this pling methods.
dataset; they exhibit improvements of 6.8%/8.0% in the case Using Table 2, the effectiveness of the strided convolu-
of the four-stream network. On the other hand, some easier tion layer can be verified. The co-occurrence convolutional
VOLUME XX, 2018 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2. Comparison of accuracy of various downsampling methods used in


the co-occurrence convolutional network. Feature Feature Feature Feature Feature Feature

Method X-Sub X-View


Without downsampling 89.4% 94.7% Network Network Concat Network Network
Max-pooling 89.4% 95.3%
Strided Convolution (stride=2) 89.5% 94.7%
Strided Convolution (stride=4) 90.0% 95.4% Classifier Classifier Network Concat

TABLE 3. Comparisons of various fusion strategies. The classification result


of each stream in MF-Net is also presented.
Scores Scores Classifier Classifier
Method X-Sub X-View
Early fusion 86.5% 93.4%
AvgScores Scores Scores
Late fusion 71.3% 85.2%
Average fusion 89.6% 94.7%
Spatial stream 83.4% 90.8%
Temporal stream 82.2% 90.4% Loss Loss Loss Loss
Structural stream 85.0% 91.0% a) Average Fusion b) Early Fusion c) Late Fusion
Action stream 82.3% 90.2%
Fusion stream 89.8% 95.3% FIGURE 8. Structures of three common fusion strategies.
MF-Net 90.0% 95.4%
focuses on fusion performance but ignores the origi-
nal features. Hence, its overfitting is most pronounced.
network without downsampling has a lower precision and These results fully explain the significance of indepen-
requires a much longer training time. Compared to the max- dent feature learning.
pooling layer, strided convolution is more robust to channel • In comparison with the corresponding results presented
order and attains improvements of 0.6%/0.1% in the cross- in Table 1, the precision values of feature streams are
subject/cross-view protocols. Considering the stride size, we slightly lower if joint training is performed. As the loss
observe that a network with a larger stride clearly performs function is the sum of feature stream losses and the
better. This result reveals that a large-stride downsampling fusion stream loss, such a decline can be accepted. As
is effective in increasing the receptive field and alleviating a result, the precision of the fusion stream improves
overfitting of the skeleton sequence. considerably.
• Although the fusion stream significantly outperforms
3) FUSION STRATEGY feature streams, using a weighted average of the fu-
Feature fusion is a key problem in multi-feature learning, and sion stream and feature streams can further improve
fusion performance largely depends on the fusion strategy. To precision. These results demonstrate that overfitting of
analyze whether our fusion strategy is effective for multiple feature fusion still occurs.
features, common fusion strategies including early fusion,
late fusion, and average fusion are compared. The detailed C. COMPARISON TO OTHER STATE-OF-THE-ART
structures of these three fusion strategies are shown in Fig.8. METHODS
In Table 3, we show the individual accuracy values of feature To evaluate the performance of MF-Net under various con-
streams and fusion stream in our MF-Net. The following ditions, we perform experiments on NTU-RGB+D [11], Ki-
observations can be made based on these results: netics [43], and UTKinect [44] datasets. We compare our
• Our fusion strategy outperforms all of conventional fu- model with other state-of-the-art methods and present several
sion strategies. In comparison with the late fusion strat- analyses.
egy, we attain large improvements of 18.7%/10.2% in
the cross-subject/cross-view evaluations. In comparison 1) NTU-RGB+D
with the average fusion strategy, smaller improvements Following the evaluation protocols described by [11], [22],
of 0.4%/0.7% are observed. we compare our MF-Net model with other state-of-the-art
• Among common fusion strategies, average fusion out- methods on the NTU-RGB+D dataset. Both RNN- and CNN-
performs early fusion and late fusion. In comparison based methods are compared, including PA-LSTM [11],
with the maximum accuracy of single-feature learning, ST-LSTM+TG [13], Temporal Conv [15], C-CNN+MTLN
shown in Table 1, early fusion attains improvements of [14], VA-LSTM [45], ST-GCN [22], Multi-stream LSTM
0.8%/1.8%, and average fusion results in improvements [31], DPRL+GCNN [46], HCN [26], SR-TSL [23] and hard
of 3.9%/3.1%. In contrast, late fusion is particularly sample mining [47]. Lie groups [48] are considered a repre-
unsuitable to our MF-Net. The reason is that late fusion sentative traditional method using handcrafted features. The
8 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 4. Action recognition performance on the NTU-RGB+D dataset. TABLE 5. Action recognition performance on the Kinetics dataset. The
methods listed at the top are video-based.

Method X-Sub X-View


Lie Group [48] 50.1% 52.8% Method Top-1 Top-5
PA-LSTM [11] 60.7% 67.3% RGB [43] 57.0% 77.3%
ST-LSTM+TG [13] 69.2% 77.7% Optical flow [43] 49.5% 71.9%
Temporal Conv [15] 74.3% 83.1% Feature Enc [9] 14.9% 25.8%
C-CNN+MTLN [14] 79.6% 84.8% Deep LSTM [11] 16.4% 35.3%
VA-LSTM [45] 79.4% 87.6% Temporal Conv [15] 20.3% 40.0%
ST-GCN [22] 81.5% 88.3% ST-GCN [22] 30.7% 52.8%
Multi-stream LSTM [31] 80.9% 89.6% MF-Net 33.2% 55.5%
DPRL+GCNN [46] 83.5% 89.8%
HCN [26] 86.5% 91.1% TABLE 6. Action recognition performance on the UTKinect dataset.
SR-TSL [23] 84.8% 92.4%
Hard Sample Mining [47] 86.6% 92.9% Method X-Sub Leave-one
MF-Net 90.0% 95.4% Histogram [44] 90.9% N/A
Lie Group SE [48] 97.1% N/A
ST-LSTM+TG [13] 95.0% 97.0%
results are shown in Table 4. Our MF-Net benefits from net- TS-LSTM [8] 97.0% N/A
work’s robustness to multiple low-level features, significantly ST-NBNN [49] N/A 98.0%
outperforms other state-of-the-art methods and achieves the DPRL+GCNN [46] N/A 98.5%
SM+MM [50] 92.6% 98.5%
best performance. Compared to the best accuracy values of
the existing methods, we attain very large improvements of MF-Net 97.9% 98.5%
3.4%/2.5% in cross-subject and cross-view settings, respec-
tively.
flow [43] – are also presented in Table 5.
To analyze the classification result for each action type, Compared to the state-of-the-art skeleton-based method
four confusion matrices of NTU-RGB+D cross-subject eval- ST-GCN [22], our MF-Net improves accuracy by
uation are shown in Fig.9. Based on these confusion matrices, 2.5%/2.7% according to Top-1/Top-5 metrics, which repre-
several observations can be made as follows: sents the best performance among skeleton-based methods.
• The overall performance is quite good. The maximum However, the accuracy values are significantly lower than
true positive rate reaches 100%, and the minimum is those of video-based methods. Considering the histogram
58%. The figures for more than half of the actions presented in Fig.10, several problems can be observed. There
exceed 90%. are more than 50 action types in the range 0%-5%, which also
• Most cells of the grids have a white background that rep- is the range with the most action types. An action’s accuracy
resents no misclassification between the two respective being in this range means that the model cannot classify
action types. Few false positive and false negative rates such an action type. As the midpoint of an accuracy range
can exceed 1%. The upper left confusion matrix shows increases, the respective frequency declines. Precision above
that actions with worse accuracy values are more prone 70% can only be attained for few actions. These statistics
to be confused with each other. ultimately result in a poor overall accuracy for the Kinetics
• Although multi-feature learning improves performance dataset.
on more difficult actions, reading and writing still have In conclusion, although our MF-Net attains improvements
the largest chances of misclassification. Eating a meal, compared to other state-of-the-art skeleton-based methods,
playing with a phone and typing on a keyboard are the the model remains far from being applicable in practice and
other actions with poor performance. The commonality still has numerous problems to be solved.
among these actions is that their motions predominantly
involve a hand. 3) UTKinect
UTKinect is a smaller skeleton dataset containing 200 se-
2) Kinetics quences that requires strong feature extraction and overfitting
Kinetics is a more challenging dataset than NTU-RGB+D. resistance capabilities. Following the cross-subject and leave-
The joint coordinates are recovered using an estimation al- one protocols, we compare our model with Histogram [44],
gorithm, so its confidence is limited compared to that of a Lie Group SE [48], ST-LSTM+TG [13], TS-LSTM [8], ST-
skeleton determined by depth sensors. Another challenge is NBNN [49], DPRL+GCNN [46] and SM+MM [50] methods.
the number of action categories. Yan et al. [22] make this The experimental results presented in Table 6 show that our
dataset available and provide experimental results for Feature MF-Net also achieves the best performance on the UTKinect
Enc [9], Deep LSTM [11], Temporal Conv [15] and ST-GCN dataset. Considering the existence of data errors, the model
[22]. The two video frame-based methods – RGB and Optical is close to classifying all samples correctly.
VOLUME XX, 2018 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 9. Confusion matrix on the NTU-RGB+D dataset under the cross-subject protocol. As the overall confusion matrix is oversized, we divide the action types
into four categories based on accuracy; each category has 15 actions. The four confusion matrices for the respective categories are presented in this figure. The
upper left matrix has the worst performance, and the bottom right matrix is the best. All positive conditions are shown in color, and conditions with values greater
than 0.01 are annotated with specific probability figures. Darker colors represent more significant misclassification.

V. DISCUSSION Although our proposed network achieves state-of-the-art


performance, there are also several problems that merit dis-
The experiments described in section IV allow the effec- cussion.
tiveness and robustness of our proposed MF-Net to be fully
evaluated. On the three skeleton-based action recognition • While considering more low-level features increases the
datasets with different sizes, our model outperforms other network’s robustness, it also raises the parameter count.
state-of-the-art methods. The most significant advantage of A larger parameter count will reduce the network’s
our approach is that the network is robust to various low- efficiency and cause the model to overfit. Although
level features. Each low-level feature can be independently the overall accuracy clearly improves, results for some
learned, and feature fusion further extracts the correlation easier actions worsen.
among high-level features. Generally, the used features over- • Overfitting is the most severe problem in our work.
lap in some aspects, e.g., structural and actional features Despite performing several operations, including data
are similar, but their fusion can still improve the ultimate augmentation, rapid downsampling, and L2 normaliza-
performance. These results show that any feature combina- tion with a large weight, overfitting still occurs. For
tion has the potential to further improve performance. The example, the training accuracy on the NTU-RGB+D
computation time is another priority of our analysis. Powerful dataset reaches 100% in some cases.
data augmentation, use of more pooling layers, and a large- • If actions with finer movements, e.g., those mostly in-
stride convolution layer all lead to a smaller matrix size. As volving a hand, are analyzed, recognition performance
a result, our model is clearly more efficient than other open- clearly declines. Using too many pooling layers would
source methods. result in some action details being blurred. The ability
10 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[9] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars,


“Modeling video evolution for action recognition,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2015.
[10] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network
for skeleton based action recognition,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2015.
[11] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale
dataset for 3d human activity analysis,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[12] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with
trust gates for 3d human action recognition,” in European Conference on
Computer Vision (ECCV), 2016.
[13] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-
based action recognition using temporal sliding lstm networks,” in IEEE
International Conference on Computer Vision (ICCV), 2017.
[14] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new rep-
resentation of skeleton sequences for 3d action recognition,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[15] T. S. Kim and A. Reiter, “Interpretable 3d human action analysis with
temporal convolutional networks,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
FIGURE 10. Histogram for the Kinetics skeleton dataset. The horizontal axis [16] Y. Hou, S. Wang, P. Wang, Z. Gao, and W. Li, “Spatially and temporally
represents recognition accuracy, and the vertical axis represents the number of structured global to local aggregation of dynamic depth information for
action types with accuracy in the corresponding range.
action recognition,” IEEE Access, vol. 6, pp. 2206–2219, 2017.
[17] P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on joint
to process fine actions should be explored in our future trajectory maps using convolutional neural networks,” in Proceedings of
research. the 24th ACM international conference on Multimedia, pp. 102–106,
ACM, 2016.
[18] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. K. Chichung,
VI. CONCLUSION “Skeleton-based online action prediction using scale selection network,”
In this paper, we propose a fully convolutional network for IEEE transactions on pattern analysis and machine intelligence, 2019.
skeleton-based action recognition that can robustly learn and [19] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
fuse multiple low-level skeleton features. The combination [20] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural
of the temporal convolutional network and the co-occurrence networks on graphs with fast localized spectral filtering,” in Advances in
convolutional network ensures the effectiveness of local and Neural Information Processing Systems (NIPS), 2016.
[21] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst,
global learning, and the use of a convolutional classifier “Geometric deep learning: going beyond euclidean data,” IEEE Signal
improves feature fusion. Our fusion strategy balances inde- Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
pendent learning of feature streams and effective correlation [22] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional
networks for skeleton-based action recognition,” in AAAI Conference on
of fusion stream. Detailed ablation studies confirm that our Artificial Intelligence, 2018.
network is robust to input features and that multi-feature [23] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-based action
learning is effective in improving recognition accuracy. Al- recognition with spatial reasoning and temporal stack learning,” in Euro-
pean Conference on Computer Vision (ECCV), 2018.
though some problems, including overfitting, remain to be
[24] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
solved, this multi-feature structure provides a straightforward for action recognition in videos,” in Advances in Neural Information
yet reliable method for further improving the ultimate recog- Processing Systems (NIPS), 2014.
nition performance. [25] C. Li, Q. Zhong, D. Xie, and S. Pu, “Skeleton-based action recognition
with convolutional neural networks,” in IEEE International Conference on
Multimedia & Expo Workshops (ICMEW), 2017.
REFERENCES [26] C. Li, Q. Zhong, D. Xie, and S. Pu, “Co-occurrence feature learning
[1] R. Poppe, “A survey on vision-based human action recognition,” Image from skeleton data for action recognition and detection with hierarchical
and vision computing, vol. 28, no. 6, pp. 976–990, 2010. aggregation,” in Internation Joint Conference on Artificial Intelligence
[2] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods (IJCAI), 2018.
for action representation, segmentation and recognition,” Computer vision [27] S. Zhang, Y. Yang, J. Xiao, X. Liu, Y. Yang, D. Xie, and Y. Zhuang,
and image understanding, vol. 115, no. 2, pp. 224–241, 2011. “Fusing geometric features for skeleton-based action recognition using
[3] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique multilayer lstm networks,” IEEE Transactions on Multimedia, vol. 20,
with an application to stereo vision,” in IEEE Conference on Computer no. 9, pp. 2330–2343, 2018.
Vision and Pattern Recognition (CVPR), 2017. [28] Z. Ding, P. Wang, P. O. Ogunbona, and W. Li, “Investigation of different
[4] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded skeleton features for cnn-based 3d action recognition,” in 2017 IEEE
pyramid network for multi-person pose estimation,” in IEEE Conference International Conference on Multimedia & Expo Workshops (ICMEW),
on Computer Vision and Pattern Recognition (CVPR), 2018. pp. 617–622, IEEE, 2017.
[5] J. K. Aggarwal and L. Xia, “Human activity recognition from 3d data: A [29] P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “Rgb-d-based
review,” Pattern Recognition Letters, vol. 48, pp. 70–80, 2014. human motion recognition with deep learning: A survey,” Computer Vision
[6] L. L. Presti and M. La Cascia, “3d skeleton-based human action classifi- and Image Understanding, vol. 171, pp. 118–139, 2018.
cation: A survey,” Pattern Recognition, vol. 53, pp. 130–147, 2016. [30] M. Fu, N. Chen, Z. Huang, K. Ni, Y. Liu, S. Sun, and X. Ma, “Human
[7] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang, “Rgb-d-based ac- action recognition: A survey,” in International Conference On Signal And
tion recognition datasets: A survey,” Pattern Recognition, vol. 60, pp. 86– Information Processing, Networking And Computers, pp. 69–77, Springer,
105, 2016. 2018.
[8] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action [31] L. Wang, X. Zhao, and Y. Liu, “Skeleton feature fusion based on multi-
recognition with depth cameras,” in IEEE Conference on Computer Vision stream lstm for action recognition,” IEEE Access, vol. 6, pp. 50788–
and Pattern Recognition (CVPR), 2012. 50800, 2018.

VOLUME XX, 2018 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[32] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream TABLE 7. Comparison of accuracy of K-partition data augmentation and other
bi-directional recurrent neural network for fine-grained action detection,” schemes.
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016. Method X-Sub X-View
[33] C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late
Without augmentation 89.0% 95.0%
fusion in semantic video analysis,” in ACM International Conference on
Multimedia, 2005. Crop-resize 89.4% 94.9%
[34] H. Gunes and M. Piccardi, “Affect recognition from face and body: early Random sampling 89.6% 95.1%
fusion vs. late fusion,” in IEEE International Conference on Systems, Man, K-partition (k=2) 89.6% 95.2%
and Cybernetics (SMC), 2005.
[35] C. Li, Y. Hou, P. Wang, and W. Li, “Multiview-based 3-d action recog- K-partition (k=4) 90.0% 95.4%
nition using deep networks,” IEEE Transactions on Human-Machine Sys- K-partition (k=8) 89.7% 95.3%
tems, vol. 49, no. 1, pp. 95–104, 2018.
[36] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
occurrence feature learning for skeleton based action recognition using
regularized deep lstm networks,” in AAAI Conference on Artificial Intel- the randomness of motion speed; however, it loses action
ligence, 2016. continuity. To effectively balance randomness and continuity,
[37] H. Wang and C. Schmid, “Action recognition with improved trajectories,” we propose a simple K-partition random sampling method, as
in IEEE International Conference on Computer Vision (ICCV), 2013.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image shown in Fig.11.
recognition,” in IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2016.
[39] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving
for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806,
2014.
[40] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty
to weigh losses for scene geometry and semantics,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2018.
[41] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Gradnorm:
Gradient normalization for adaptive loss balancing in deep multitask
networks,” in International Conference on Machine Learning (ICML),
2018.
[42] S. Liu, Y. Liang, and A. Gitter, “Loss-balanced task weighting to reduce
negative transfer in multi-task learning,” in AAAI Conference on Artificial
Intelligence (Student Abstract), 2019.
[43] K. W. et al, “The kinetics human action video dataset,” arXiv preprint
arXiv:1705.06950, 2017.
[44] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-
narasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “View invariant
human action recognition using histograms of 3d joints,” in IEEE Confer- FIGURE 11. Visualization of our K-partition random sampling.
ence on Computer Vision and Pattern Recognition Workshops (CVPRW), C×T ×V
A skeleton sequence A ∈ R can be represented by
2012.
[45] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive C-dimensional locations of V feature vectors in a video with
recurrent neural networks for high performance human action recognition T frames. We divide the sequence into K parts [A1 , A2 , · ·
from skeleton data,” in IEEE International Conference on Computer Vision ·, Ak ] of the same length, Ai ∈ RC×(T /K)×V . Afterwards,
(ICCV), 2017.
[46] Y. Tang, Y. Tian, J. Lu, P. Li, and J. Zhou, “Deep progressive reinforcement each part is randomly0
sampled to obtain a new sequence
0 0 0 0
learning for skeleton-based action recognition,” in IEEE Conference on part Ai ∈ RC×(T /K)×V . Finally, [A1 , A2 , · · ·, AK ] are
Computer Vision and Pattern Recognition (CVPR), 2018. 0 0

[47] R. Cui, G. Hua, A. Zhu, J. Wu, and H. Liu, “Hard sample mining and concatenated to form a new sequence A ∈ RC×T ×V using
learning for skeleton-based human action recognition and identification,” sorting. In the case of T < T 0 , repeated sampling is allowed.
IEEE Access, vol. 7, pp. 8245–8257, 2018. To verify the effectiveness of our data augmentation
[48] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recognition
by representing 3d skeletons as points in a lie group,” in IEEE Conference method, we compare the proposed K-partition method with
on Computer Vision and Pattern Recognition (CVPR), 2014. other schemes on the NTU-RGB+D dataset. Crop-resize is an
[49] J. Weng, C. Weng, and J. Yuan, “Spatio-temporal naive-bayes nearest- operation that randomly cuts out the boundary of a sequence
neighbor (st-nbnn) for skeleton-based action recognition,” in IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2017. and performs downsampling to a fixed length. Straightfor-
[50] M. Liu, Q. He, and H. Liu, “Fusing shape and motion matrices for view ward random sampling and no augmentation are considered
invariant action recognition using 3d skeletons,” in IEEE International as two alternative schemes. The influence of parameter K is
Conference on Image Processing (ICIP), 2017.
[51] M. Jain, H. Jegou, and P. Bouthemy, “Better exploiting motion for better also studied. The experimental results are presented in Table
action recognition,” in IEEE Conference on Computer Vision and Pattern 7.
Recognition (CVPR), 2013. Table 7 shows that crop-resize and random sampling
. data augmentation cannot improve action recognition per-
formance according to cross-view evaluation. Among these
APPENDIX A DATA AUGMENTATION methods, our K-partition random sampling achieves the best
The duration of an action and the speed of motion largely performance, improving by 1.0%/0.4% in the two protocols.
depend on the performer [51]. Thus, the network is prone to As to parameter K, if K=4 the model achieves the best
overfitting individual subjects. Random sampling is the most performance.
straightforward data augmentation method used to increase
12 VOLUME XX, 2018

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2945632, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 8. Specific convolution layers’ parameters. The layers are listed in TABLE 9. Evaluation time on the NTU-RGB+D dataset for various models.
order, and their positions in the entire network are shown in Fig.2 and Fig.4.
Parameter njoint represents the number of joints of the skeleton, nstream
denotes the number of streams in the network, and nclass represents the Model Time Model Time
number of action types. ST-GCN [22] 74s MF-Net 31s
Nstride=1 52s Nstride=2 41s
Layer Channels Kernel Stride No Sampling 39s Random Sampling 40s
Conv1 3 × 128 3×1 1×1 Nstream=2 20s Nstream=3 27s
Conv2 128 × 128 3×1 1×1
Conv3 128 × 256 3×1 1×1
Conv4 256 × 256 3×1 1×1 are the other reasons a large stride is effective in reducing
StridedConv1 njoint × 128 3×3 1×4 the necessary time. Random sampling entails almost no addi-
StridedConv2 128 × 256 3×3 1×4 tional time consumption; however, the interpolation in our K-
StridedConv3 256 × 256 3×3 1×2 partition random sampling increases the reading time. Thus,
(nstream × 256) × the time reduction resulting from augmentation is limited.
Conv5 3×3 1×1
(nstream × 256) Otherwise, considering more feature streams will increase
(nstream × 256) × the required time, which will grow at a rate between o(n)
Conv6 1×1 1×1
nclass and o(n2 ), where n is the number of streams.

APPENDIX B EXPERIMENTAL SETTINGS AND MODEL


ARCHITECTURE
We use PyTorch 1.0 and train the model for 50 epochs
on two GTX-1080Ti GPUs. An SGD optimizer is used to
train the model with the L2 normalization. Due to different
convergence on these three datasets, weight decay values
range from 0.001 to 0.01. The batch size for training is 64. A
stepped learning rate schedule is used to accelerate network
convergence; in this schedule, the initial learning rate is 0.01,
and the rate decays by 0.1 every 10 epochs.
To ensure robustness of our approach, the same model
architecture is tested on all three experimental datasets. The
specific network parameters of convolution layers are pre-
sented in Table 8. The input sequence length is resampled
to 128 using our proposed K-partition data augmentation
method. All max-pooling layers have kernel (2, 1). Consid-
ering the existence of multi-person actions, we merge the
person dimension into the batch dimension in the beginning
of the network, and the scores of multi-person actions are
averaged at the global average pooling layer to calculate
classification scores. In addition, we perform some prepro-
cessing for NTU-RGB+D cross-view setting to normalize the
skeleton.

APPENDIX C NETWORK EFFICIENCY


Network efficiency is another focus of our work. Combining
more features to compose the multistream network usually
implies an increase in the time needed. However, several
operations in MF-Net can in fact offset this shortcoming.
Using the settings in appendix B, we evaluate the influence of
these operations on the NTU-RGB+D cross-subject protocol,
using our MF-Net and ST-GCN [22] as baselines. Table 9
shows the total time of analyzing these 16487 sequences by
various models.
Table 9 shows that our MF-Net is much faster than ST-
GCN in the same environment. Using a large-stride co-
occurrence convolutional network substantially reduces the
matrix size. The smaller model size and the larger batch size
VOLUME XX, 2018 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like