0% found this document useful (0 votes)
66 views

Development of A Hand Pose Recognition System On An Embedded Computer Using Artificial Intelligence

paper de iot

Uploaded by

GraceSevillano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Development of A Hand Pose Recognition System On An Embedded Computer Using Artificial Intelligence

paper de iot

Uploaded by

GraceSevillano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Development of a hand pose recognition system on

an embedded computer using Artificial Intelligence


Dennis Núñez Fernández
Universidad Nacional de Ingenierı́a
Lima, Peru
[email protected]

Abstract—The recognition of hand gestures is a very interesting In this work we developed a system for hand pose recogni-
research topic due to the growing demand in recent years in tion to work on embedded computers with limited computa-
robotics, virtual reality, autonomous driving systems, human- tional resources and making use of low power consumption. In
machine interfaces and in other new technologies. Despite several
approaches for a robust recognition system, gesture recognition order to accomplish these targets, we employ low-processing
based on visual perception has many advantages over devices algorithms and trained a light CNN, which was optimized
such as sensors, or electronic gloves. This paper describes the to balance high accuracy, fast time response, low power
implementation of a visual-based recognition system on a em- consumption and low computational costs.
bedded computer for 10 hand poses recognition. Hand detection
is achieved using a tracking algorithm and classification by a II. M ETHODOLOGY
light convolutional neural network. Results show an accuracy of A. Overview
94.50%, a low power consumption and a near real-time response.
Thereby, the proposed system could be applied in a large range The proposed system works with images captured from a
of applications, from robotics to entertainment. standard CMOS camera and executed on embedded comput-
Index Terms—Gesture Recognition, Human-Machine Interac- ers with low computational resources, without GPU support,
tion, Recognition System, Hand Poses, Embedded Computer. such as Raspberry Pi, BeagleBone Board, Banana Pi, Intel
Galileo Board, and others. Therefore, the main objectives of
I. I NTRODUCTION the proposed system are as follows: high accuracy rate, fast
Hand gesture recognition is one obvious strategy to build time response, low power consumption and low computational
user-friendly interfaces between machines and users. In the costs.
near future, hand posture recognition technology would allow The system is composed of three main steps: hand detection,
for the operation of complex machines and smart devices hand region tracking and hand gesture recognition. In the
through only series of hand postures, finger and hand move- first step the Haar cascades classifier detects a basic hand
ments, eliminating the need for physical contact between man shape in order to have a good hand detection. Then, this
and machine. Hand gesture recognition on images from com- hand region is tracked using the MIL (Multiple Instance
mon single camera is a difficult problem because occlusions, Learning) tracking algorithm. Finally, hand gesture recogni-
variations of posture appearance, differences in hand anatomy, tion is performed based on a trained Convolutional Neural
etc. Despite these difficulties, several approaches to gesture Network. Since the steps described before are designed to
recognition on color images has been proposed during the last consume few computational resources, the whole system will
decade [1]. be implemented on a personal computer and Raspberry Pi
In recent years, Convolutional Neural Networks (CNNs) board. Fig. 1 shows the steps mentioned above.
have become the state-of-the-art for object recognition in
computer vision [2]. In spite of high potential of CNNs in
object detection problems [3] [4] and image segmentation [2]
tasks, only few papers report successful results (a recent survey
on hand gesture recognition [1] reports only one important
work [5]). Some obstacles to wider use of CNNs are high
computational costs, lack of sufficiently large datasets, as
well as lack of hand detectors appropriate for CNN-based
classifiers. In [6], a CNN has been used for classification of
six hand poses to control robots using colored gloves. In more
recent work [7], a CNN has been implemented on the Nao
robot. In a recent work [8], a CNN has been trained on one
million of images. However, only a portion of the dataset with
3361 manually labeled frames in 45 classes of sign language Fig. 1: Diagram for the proposed system
is publicly available.

978-1-7281-3646-2/19/$31.00 ©2019 IEEE


B. Haar Cascades Classifier In this project, the MIL (Multiple Instance Learning) algo-
rithm will be used for hand tracking. The MIL algorithm trains
The Viola-Jones object detection algorithm is the first object
a classifier in an online manner to separate the object from
detection algorithm to provide competitive object detection
the background. Multiple Instance Learning avoids the drift
rates in real-time. Although it can be trained to detect a variety
problem for a robust tracking. So, MIL results in a more robust
of object classes, it was mainly intended by the problem of
and stable tracker. The implementation is based on [9]. In
hand detection.
addition to this, the proposed tracking algorithm consumes less
This approach to hand detection combines four key con-
memory and computational resources than the Haar cascade
cepts: Haar-like features: simple rectangular features, Inte-
classifier.
gral image: for rapid feature detection, AdaBoost: machine-
learning method, Cascade classifier: to combine many features D. Skin Detection
efficiently, see Fig. 2. Skin color is a powerful feature for fast hand detection.
Essentially, all skin color-based methodologies try to learn a
skin color distribution, and then use it to extract the hand
region. In this work the hand region has been obtained on
the basis of statistical color models [10]. A model in RGB-
H-CbCr color spaces has been constructed on the basis of a
Fig. 2: Haar cascade classifier training dataset. Later, the hand probability image has been
thresholded. Finally, after morphological closing, a connected
components labeling has been executed to extract the gravity
C. Hand Tracking center of the region, coordinates of the most top pixel as well
Haar cascade classifier allows better detection of objects as coordinates of the most left pixel of the hand region.
with static features such as balloons, boxes, faces, eyes, E. Hand Poses Dataset
mounts, noise, etc. But a hand in motion has few static
The dataset for hand gesture classification was provided
features because its shape and fingers can change as well as
by an open database from AGH University of Science and
its orientations. So, Haar cascade classifier allows detection of
Technology [11]. This is composed of 73,124 grayscale images
only basic hand poses, which are not suitable to recognize a
of size 48x48 pixels divided into ten different hand gestures,
hand in motion with a long mount of different poses.
captured from ten persons of different nationalities. However,
Since hand detection using Haar cascades in not a robust for purposes of this project, only binary images were selected.
method, this deficiency is compensated with a hand tracker From this dataset, the 80% (42,027 images) were used for
based on wrist region. Furthermore, wrist region is proposed training and 20% (14,667 images) for testing. Fig. 4 depicts
for tracking due this region keeps invariant and has static samples of each hand gesture, also called class. The principal
features when hand changes to different poses, shapes and advantage of this dataset is that the hands were approximately
orientations. aligned in such a way that characteristic hand features (e.g.
In addition to this, hand tracking allows the reduction of the wrist) are approximately located at pre-defined positions
the processing time since tracking requires less computational in the image. This means that in each class the wrists are ap-
resources than hand detection (whole image evaluation versus proximately located at the same position. Furthermore, thanks
local evaluation). Fig. 3 shows the different hand regions used to such an method the recognition of hand poses at acceptable
for detection and tracking, as image shows the hand region for frame rates can be succeeded with a simple convolutional
tracking (blue box) encloses the hand in different shapes and neural network and at a lower computational cost.
poses. Therefore, the hand region inside the blue box will be
used by the CNN to perform the hand gesture recognition.

Fig. 4: Sample images of hand gestures used in training and


Fig. 3: Wrist region for detection (red box) and hand region testing steps
for tracking (blue box)
F. Convolutional Neural Networks B. Experimental Results of Inference
Since each hand pose is composed of strongly different The implementation of the proposed recognition system on
strokes, recognition doesnt need large images and complex a personal computer has no issues due to its high compu-
CNNs to extract useful features. In this way, we only use tational resources. However, when a recognition system is
binary images of 48x48 pixels, and a small CNN with few implemented on embedded computers like the Raspberry Pi
layers and parameters. 3 we have two major obstacles working against us: limited
The proposed CNN is formed by two convolutional layers RAM memory (only 1 GB) and restricted processor speed
with kernels of 5x5 and 3x3 size each one, a non linearity (four ARM Cortex-A53 @1. 2 GHz). In order to obtain a
(ReLU) activation function and a max-pooling layer after every better computation performance, the system was implemented
convolutional layer, and two full-connected (FC) layers of 120 using C++ language.
neurons length followed by a final 10-way softmax, see Fig. 5. In spite of the processing and memory limitations mentioned
Furthermore, this CNN is composed by only 60K learnable above, our real-time recognition system shows promising re-
parameters. This number of parameters are significantly less sults during the evaluation step. Fig. 7 depicts its performance
than the AlexNet network (60M learnable parameters and in real environments on images of 640x480 pixels. As you can
650,000 neurons) [2] and the GoogleNet (6.8M learnable observe, the system correctly recognizes different hand poses,
parameters) [12]. although some shape distortions, low light conditions, and
different sizes. In addition to this, we obtain a fast response
time of about 351.2 ms (average of 100 iterations) to detect
and classify a single hand pose.

Fig. 5: Architecture of the CNN for hand pose recognition.

III. E XPERIMENTAL R ESULTS


A. Experimental Results of the Model
The performance of the CNN of hand poses classification (a) Hand detection
was evaluated using different metrics such as confusion matrix
and accuracy. The confusion matrix presents a visualization
of the misclassified classes and helps to add more training
images in order to improve the model. The confusion matrix
of our model is shown in Fig. 6 and discloses which letters
are misclassified. These errors happen because of similarities
between the classes. Furthermore, our architecture shows an
outstanding accuracy of 94.50%.

(b) Hand tracking and pose recognition

(c) Hand tracking and pose recognition


Fig. 6: Confusion matrix
Fig. 7: Results of hand pose recognition on a Raspberry Pi 3
The Table I shows some details of CNNs tested on the
Raspberry Pi 3 platform. As you can see, the proposed
CNN achieves the fastest time response, compared with other
architectures, by using the lowest power consumption because
its simple and efficient design.
TABLE I: Response time and power consumption for evalua-
tion of different CNNs on a Raspberry Pi 3 using Caffe

Proposed VGG F NiN AlexNet GoogLeNet


Model CNN [13] [14] [2] [12]
Layers 9 13 16 11 27
Power (W.) 0.690 0.760 0.840 0.750 0.790
Time (s.) 0.351 0.857 0.553 1.803 1.175

IV. C ONCLUSION
In this paper, we introduce the implementation of a hand
pose recognition system on a regular embedded computer. We
demonstrated that our system is capable to recognize 10 hand
gestures with an accuracy of 94.50% on images captured from
a single RGB camera, and using low power consumption,
about 0.690 W. In addition, the average time to process each
640x480 image on a Raspberry Pi 3 board is 351.2 ms. The
results demonstrate that our recognition system is suitable for
embedded applications in robotics, virtual reality, autonomous
driving systems, human-machine interfaces and others.
R EFERENCES
[1] Oyedotun, O., Khashman, A.: Deep learning in vision-based static hand
gesture recognition. Neural Computing and Applications (2016) 111
[2] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with
deep convolutional neural networks. In: NIPS. (2012) 10971105
[3] Kwolek, B.: Face detection using convolutional neural networks and
Gabor filters. In: Int. Conf. Artificial Neural Networks, LNCS, vol. 3696,
Springer (2005) 551556
[4] Arel, I., Rose, D., Karnowski, T.: Research frontier: Deep machine
learninga new frontier in artificial intelligence research. Comp. Intell.
Mag. 5(4) (2010) 1318
[5] Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose
recovery of human hands using convolutional networks. ACM Trans.
Graph. 33(5) (2014)
[6] Nagi, J., Ducatelle, F.: Max-pooling convolutional neural networks for
vision-based hand gesture recognition. In: IEEE ICSIP. (2011) 342347
[7] Barros, P., Magg, S., Weber, C., Wermter, S.: A multichannel convolu-
tional neural network for hand posture recognition. In: 24th Int. Conf.
on Artificial Neural Networks (ICANN), Cham, Springer (2014) 403410
[8] Koller, O., Ney, H., Bowden, R.: Deep hand: How to train a CNN on 1
million hand images when your data is continuous and weakly labelled.
In: IEEE Conf. on Comp. Vision and Pattern Rec. (2016) 37933802
[9] B Babenko, M-H Yang, S Belongie, Visual Tracking with Online
Multiple Instance Learning, IEEE CVPR09, June, 2009.
[10] Jones, M.J., Rehg, J.M.: Statistical color models with application to skin
detection. Int. J. Comput. Vision 46(1) (2002) 8196
[11] Núñez Fernández D., Kwolek B., Hand Posture Recognition Using
Convolutional Neural Network. In: Mendoza M., Velastn S. (eds)
Progress in Pattern Recognition, Image Analysis, Computer Vision, and
Applications. CIARP 2017. Lecture Notes in Computer Science, vol
10657. Springer, Cham
[12] C. Szegedy et al., Going deeper with convolutions, 2015 IEEE Conf. on
Computer Vision and Pattern Recogn. (CVPR), Boston, MA, 2015.
[13] Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
man. Return of the Devil in the Details: Delving Deep into Convolutional
Nets. In Proceedings of the British Machine Vision Conference 2014,
pages 6.16.12. British Machine Vision Association, 2014.
[14] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. In
International Conference on Learning Representations (ICLR) 2014.

You might also like