Dynamic gesture recognition based on deep learning in human-to-computer interfaces
Dynamic gesture recognition based on deep learning in human-to-computer interfaces
0004
Abstract
Currently, gesture recognition provides a faster, simpler, convenient, effective and more natural
way for human-computer interaction, which has been widely concerned. Gesture recognition plays an
important role in real life. The manual feature extraction in traditional gesture recognition methods is
time-consuming and strenuous. Moreover, in order to improve the accuracy of recognition, the quantity
and quality of features to be extracted are required to be very high, which is a bottleneck for traditional
gesture recognition methods. Therefore, we propose a deep learning method for dynamic gesture
recognition in Human-to-Computer interfaces. An improved inverted residual network architecture is
utilized as the basis of SSD (Single Shot MultiBox Detector) network for feature extraction. And the
convolution structure of the auxiliary layer is predicted by using the inverse residual structure combining
the cavity convolution. It uses multi-scale information, which can reduce the amount of calculation and
parameters number. Transfer learning is used to optimize the trained network model so as to reduce the
training time and make the model more convergent. Finally, experimental results show that the proposed
method can recognize different gestures quickly and effectively.
Key Words: Gesture Recognition, Deep Learning, Human-to-Computer Interfaces, Feature Extraction
to the templates using a Sequential Monte Carlo inference deep convolutional neural network are: RCNN, Fast
technique. And many other topics are proposed to detect RCNN, Faster RCNN and SSD, etc. [10-13]. When the
the gestures. However, there are still some problems such PASCAL VOC data set was tested, the object recogni-
as low efficiency, time-consuming etc. tion rate of Faster RCNN was 73.2%, and 7 frames of im-
Deep learning model is a complex, multi-layer artifi- age were recognized in each second. The recognition
cial neural network structure. Deep learning models have rate of SSD method was 72.1%, and 58 frames of image
strong nonlinear modeling ability and use general learn- were recognized per second. The recognition rate of
ing process to learn features from data. Compared with Faster R-CNN was faster than that of SSD. The recogni-
the features of traditional artificial design, the deep learn- tion rate of YOLO method was 63.4%, and it could rec-
ing model can express higher level and more abstract in- ognize 45 frames of image per second. The recognition
ternal features [7-9]. speed was similar to that of SSD method, and the recog-
Deep convolutional neural network (CNN) in deep nition rate was significantly lower than that of SSD. In
learning is an effective method for image feature extrac- this paper, modified SSD (MSSD) model is selected as
tion. Because of its invariance in translation and rotation the recognition model.
of image information, it has become a popular method in
the field of image processing and target recognition. At 2.1 SSD Network Structure
present, most of the researches on gesture recognition fo- SSD target detection model does not require time-
cus on the gesture recognition with a single hand. In the consuming region generation and feature re-sampling
process of gesture interaction, two-handed operation and steps. By directly convolving the whole image and pre-
other hands often occur. For gesture recognition of mul- dicting the category and corresponding coordinates of
tiple hands, this paper proposes a dynamic gesture recog- the object contained in the image, the detection speed is
nition method based on deep convolutional neural net- greatly improved. Meanwhile, the accuracy of target de-
work. Our contributions are as follows: tection is greatly improved by using small size convolu-
1. Feature is extracted by an improved inverted residual tion kernel and multi-scale prediction.
network architecture based on SSD. The SSD network structure is divided into Base net-
2. The convolution structure of the auxiliary layer is pre- work and Auxiliary network. The Base network is the
dicted by using the inverse residual structure combin- network that has high classification accuracy in the field
ing the cavity convolution with multi-scale informa- of image classification and removes its classification
tion, which can reduce the amount of calculation and layer. The auxiliary network is a convolutional network
parameters number. structure added on the basis of the basic network for tar-
3. Transfer learning is used to optimize the trained net- get detection. The size of these layers gradually de-
work model so as to reduce the training time and make creases so that multi-scale prediction can be made. Each
the model more convergent. added auxiliary network layer through a series of convo-
4. Experimental results show that the proposed method can lution kernels will produce a fixed predicted set. For a m
recognize different gestures quickly and effectively. ´ n ´ p (p is the channel number, m, n are the size) feature
The rest of this paper is organized as follows. In the layer, each auxiliary network will use 3 ´ 3 ´ p convolu-
next section, we detailed introduce the proposed SSD tion kernel to predict and produce score for one class. In
method for gesture recognition. Then, we give rich ex- the m ´ n positions, it predicts all the corresponding val-
periments and analysis in section 3. A conclusion is con- ues.
ducted in section 4. SSD model predicts k boundary boxes at each posi-
tion of feature graph. At the same time, the score of an
2. Gesture Recognition Model in Deep object appearing in this position and the offset of the ob-
Learning ject position relative to the boundary box are predicted.
Thus, c ´ k scores and 4k position offsets are predicted at
The main methods of object recognition based on the positions of each feature graph. For a feature graph
Dynamic Gesture Recognition Based on Deep Learning in Human-to-Computer Interfaces 33
with m ´ n size, it will predict (c + 4) × k × m × n outputs. Hierarchical feature fusion is the sum of the outputs
Finally, non-maximal suppression is applied to obtain of each convolution unit in the empty convolution layer.
the final predicted value of object category and position And the result of each sum is obtained by concatenate
information in the image. operation to get the final output result.
The reverse residual structure adopts ReLU6 as the
2.2 Modified SSD Network Structure activation function, and its output is,
SSD model uses VGG network as the basic network.
Y = min(max(X, 0), 6) (1)
But VGG network model has a large number of parame-
ters, occupies most of the running time in the process of where Y is the output of ReLU6 activation function. X is
feature extraction. And in the forward propagation pro- the input eigenvalue.
cess, information loss in the transformation process is al- Compared with ReLU, ReLU6 has better robustness
ways caused by nonlinear transformation. in low precision computing scenes. In addition, the con-
Shen [14] put forward the nonlinear activation func- volution kernel of 3´3 is used. Dropout and batch nor-
tion ReLU based on the manifold learning theory. Under malization are used in the training network process to re-
the high dimension, it would be better to retain informa- duce the overfitting in the training process. The impro-
tion. And in the low dimension, it would cause greater ved reverse residual structure is shown in Figure 1.
loss of information. Therefore, the input layer should in- Where Dilated denotes the empty convolution, Linear is
crease the feature dimension before the nonlinear trans- the Linear activation function, and HFF represents the
formation. In the output layer, the linear activation func- hierarchical feature fusion. Dwise represents a depth-se-
tion should be used to reduce the dimension of the fea- parable convolution structure.
ture to reduce the loss of information. So inverted resid- Combining with the improved reverse residual struc-
ual block was proposed. ture, we modify the base layer and auxiliary layer in SSD
The down-sampling operation in the reverse residual model: (1) original SSD uses VGG network as a base
structure will cause the loss of feature information while layer for feature extraction, but VGG network model is
increasing the perceptive field of the convolution kernel. not suitable for deployment to run on mobile devices, so
Therefore, it is considered to abandon the down-sam- reverse residual MobileNetV2 is proposed on the basis
pling operation in the convolution structure and introduce of network structure, which has less parameters, small
the empty convolution to solve this problem. Empty con- footprint, and running faster, which is as the SSD feature
volution adds an expansion parameter on the basis of the extraction network and to reduce the size of the model
original convolution operation. It expands the convolu- and calculation. The traditional convolutional network
tion kernel to the corresponding scale, and fills 0 in the structure is used in SSD auxiliary layer, which leads to
unused area of the original convolution kernel. The ap- large number of parameters and large amount of calcula-
plication of empty convolution can increase the sensing tion. As the basic structure of the auxiliary layer, the im-
field of the convolution kernel without the down-sam- proved auxiliary network layer can reduce the informa-
pling operation. However, the using of empty convolu- tion loss caused by the nonlinear transformation in the
tion will make the operation of convolution check data learning process and the convolution kernel has multi-
discontinuous, and small objects cannot be better identi- scale receptive field.
fied. This paper considers the hierarchical feature fusion
to solve the problems caused by the introduction of 2.3 Loss Function in MSSD Network Structure
empty convolution. Generating recognition box in MSSD model is a re-
gression process. Judging the category within the recog- tive. The data set contains 4800 images, and each image
nition box is a classification process. The total objective contains 4 categories: his own left hand (owlh), his own
loss function is the weighted sum of position loss (loc) right hand (owrh), opposite left hand (oplh) and opposite
and confidence loss (conf). right hand (oprh). Each image labels the gesture region
position of 4 categories, as shown in Figure 2.
L(c, l, g) = 1 / N(Lconf (c) + aLloc (l, g)) (2)
In training process of MSSD model, the training set,
where, N is the number of default boxes corresponding verification set and test set are shown in Table 1.
to the real boxes. a = 1 is the weight term according to
the real experiment situation. Lconf (c) is the cross en- 3.2 Evaluation Index
tropy classification loss function of Softmax, and c is In this paper, we adopt the following evaluation in-
the confidence of each category. In Lloc (l, g), l = (lx, ly, dexes to analyze the effectiveness of proposed model.
lw, lh), each item denotes the predicted center of the box 1. IoU (intersection over union) is defined as the ratio of
(x, y) and the width (w), high (h). g = (gx, gy, gw, gh) re- the intersection and union of the area occupied by two
presents the true central position (x, y), width (w) and boxes [16].
high (h).
(5)
(3)
where P is the predicted box. GT is the ground truth.
where 2. Precision and recall are two famous quantitative in-
dexes. The gesture recognition model will classify the
contents in the identified boxes, predict the possibility
(4)
of the four gesture categories, and set the most likely
as the classification result.
3. Experiments on Gesture Recognition
(6)
3.1 Data Set Analysis
In order to realize the training of MSSD model, the (7)
gesture image data set taken from the first perspective is
used. The experiment adopts the gesture data set Ego-
(8)
Hands created by Indiana university [15]. The EgoHands
use the wearable device Google glass to shoot images.
Two people interact with each other in the first perspec- where TP is the detected correct gesture number. FP is
the detected other posture number. FN is the leak de- 1060. The original image size of EgoHands dataset is
tected gesture number. F-score is used to adjust Preci- 1240´720 pixel, which is adjusted to 600´600 during
sion and Recall, which is more close to 1, the model is training process. The training strategy is shown in Table
better. 2. In this paper, the fine-tuning and transfer learning are
3. mAP (mean Average Precision) is to get an index that improved in MSSD model network.
can reflect the global performance. The size of input image and the size of feature graph
with true box would affect the recognition accuracy of
(9) MSSD model [17]. The added BN layer will also affect
the recognition accuracy of the deep learning model. This
3.3 Fine-tuning Network and Transfer Leaning experiment will fine-tune the MSSD model structure.
We firstly verify the effect of IoU on the recognition In the experiment, the size of the input image is ad-
accuracy with proposed method. The blue bar is accu- justed from 1240´720 to 600´600 and 300´300. Finally,
racy rate of recognition and the red bar is error rate of the trained models are denoted as MSSD6 and MSSD3,
recognition in Figure 3. When IoU = 0.3, though the rec- respectively. In the experiment, each pixel in the Conv3
ognition rate is high, the error rate is high too. When IoU ´3 layer extracted from the VGG-16 basic network is
= 0.6, the result is similar to IoU = 0.9. But IoU = 0.9, it added with box. The conv3´3 layer is also introduced
needs more time to process one image. Therefore, we into the calculation of loss function and the back propa-
choose IoU = 0.6 in this paper. gation process of box recognition, and the training result
For gesture recognition problem in gesture interac- is MSSD+Conv3 model. The results are shown in Table
tion process, the parameters are changed in MSSD mo- 3 and Figure 4.
del. The VGG-16 recognition model trained in PASCAL Transfer learning means that a learning algorithm
VOC dataset is used to initialize the parameters of the can use the commonalities among different learning ta-
basic network in MSSD model. It fixes the first two lay- sks to share statistical advantages and transfer knowl-
ers and does not participate in the back propagation. The edge between tasks. Transfer learning can shorten the
target to be identified is divided into four categories, and training time and improve the recognition rate of the
one background category. The total number of categories model.
is set as 5. The maximum recognition results of each Bambach [18] proposed a model for EgoHands ges-
frame are set as 4, and the maximum recognition result of ture recognition based on Caffenet network. In the ex-
each class is set as 1. This set only shows the most likely periment, the basic network in MSSD model was appro-
recognition result in each gesture class, which greatly re- priately changed, and then the parameters in Caffenet
duces the false recognition in each class. The training model and residual network model (Resnet) were trans-
and testing in MSSD model adopt Caffe deep learning ferred to MSSD model for training.
framework, and computer graphics card is NVIDIA GTX In the experiment, the MSSD model is adjusted by
changing the basic network in VGG as the top-5 layer
network in Caffenet model. Then, the parameters of the
Caffenet model in [18] are transferred to the basic net-
for interactive systems, Acm Transactions on Inter- [15] Bambach, S., S. Lee, D. J. Crandall, et al. (2015) Lend-
active Intelligent Systems 4(4), 1-34. doi: 10.1145/ ing A hand: detecting hands and recognizing activities
2643204 in complex egocentric interactions, 2015 IEEE Inter-
[7] Gao, J., P. Li, and Z. K. Chen (2019) A canonical national Conference on Computer Vision (ICCV). IEEE
polyadic deep convolutional computation model for Computer Society. doi: 10.1109/ICCV.2015.226
big data feature learning in Internet of Things, Future [16] Lepetit-Aimon, G., R. Duval, and F. Cheriet (2018)
Generation Computer Systems. doi: 10.1016/j.future. Large receptive field fully convolutional network for
2019.04.048 semantic segmentation of retinal vasculature in fundus
[8] Lin, T., H. Li, and S. L. Yin (2018) Modified pyramid images, International Workshop on Computational
dual tree direction filter-based image de-noising via Pathology 201-209. doi: 10.1007/978-3-030-00949-
curvature scale and non-local mean multi-grade rem- 6_24
nant multi-grade remnant filter, International Journal [17] Liu, W., D. Anguelov, D. Erhan, et al. (2016) SSD: sin-
of Communication Systems 31(16). doi: 10.1002/dac. gle shot MultiBox detector, European Conference on
3486 Computer Vision. ECCV, 21-37. doi: 10.1007/978-
[9] Yin, S. L., and J. Bi (2019) Medical image annotation 3-319-46448-0_2
based on deep transfer learning, Journal of Applied [18] Bambach, S., S. Lee, D. J. Crandall, et al. (2015) Lend-
Science and Engineering 22(2), 385-390. doi: 10. ing A hand: detecting hands and recognizing activities
6180/jase.201906_22(2).0020 in complex egocentric interactions, 2015 IEEE Inter-
[10] Yin, S. L., Y. Zhang, and S. Karim (2018) Large scale national Conference on Computer Vision (ICCV).
remote sensing image segmentation based on fuzzy re- IEEE Computer Society. doi: 10.1109/ICCV.2015.226
gion competition and Gaussian mixture model, IEEE [19] Zhou, Z., Z. Cao, and Y. Pi (2018) Dynamic gesture
Access 6, 26069-26080. doi: 10.1109/ACCESS.2018. recognition with a Terahertz Radar based on range pro-
2834960 file sequences and Doppler signatures, Sensors 18(1),
[11] Yin, S. L., Y. Zhang, and S. Karim (2019) Region 10. doi: 10.3390/s18010010
search based on hybrid CNN in optical remote sensing [20] Verma, B., and A. Choudhary (2018) Framework for
images under cloud computing environment, Interna- dynamic hand gesture recognition using Grassmann
tional Journal of Distributed Sensor Networks 15(5). manifold for intelligent vehicles, Iet Intelligent Trans-
doi: 10.1177/1550147719852036 port Systems 12(7), 721-729. doi: 10.1049/iet-its.2017.
[12] Ren, S., K. He, R. Girshick, et al. (2017) Faster R- 0331
CNN: towards real-time object detection with region [21] Zhang, Z., Z. Tian, and Z. Mu (2018) Latern: dynamic
proposal networks, IEEE Transactions on Pattern An- continuous hand gesture recognition using FMCW ra-
alysis & Machine Intelligence 39(6), 1137-1149. doi: dar sensor, IEEE Sensors Journal 18(8), 1-1. doi: 10.
10.1109/TPAMI.2016.2577031 1109/JSEN.2018.2808688
[13] Li, J., H. C. Wong, S. L. Lo, et al. (2018) Multiple ob- [22] Nguyen, X. S., L. Brun, O. Lezoray, et al. (2019) Skel-
ject detection by deformable part-based model and R- eton-based hand gesture recognition by learning SPD
CNN, IEEE Signal Processing Letters PP(99):1-1. matrices with neural networks, IEEE International
doi: 10.1109/LSP.2017.2789325 Conference on Automatic Face & Gesture Recogni-
[14] Shen, J., J. Bu, B. Ju, et al. (2012) Refining Gaussian tion (FG). IEEE. doi: 10.1109/FG.2019.8756512
mixture model based on enhanced manifold learning,
Neurocomputing 87(1), 19-25. doi: 10.1016/j.neucom. Manuscript Received: Jul. 22, 2019
2012.01.029 Accepted: Oct. 19, 2019