0% found this document useful (0 votes)
111 views

Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SPECIAL SECTION ON SMART HEALTH SENSING AND

COMPUTATIONAL INTELLIGENCE: FROM BIG DATA TO BIG IMPACTS

Received August 29, 2019, accepted September 10, 2019, date of publication September 16, 2019,
date of current version September 27, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2941836

Design and Implementation of a Convolutional


Neural Network on an Edge Computing
Smartphone for Human Activity Recognition
TAHMINA ZEBIN 1 , PATRICIA J. SCULLY2 , (Member, IEEE), NIELS PEEK3 ,
ALEXANDER J. CASSON 4 , (Senior Member, IEEE),
AND KRIKOR B. OZANYAN 4 , (Senior Member, IEEE)
1 School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, U.K.
2 School of Physics, NUI Galway, Galway H91 TK33, Ireland
3 Health eResearch Center, The University of Manchester, Manchester M13 9PL, U.K.
4 Department of Electrical and Electronic Engineering, The University of Manchester, Manchester M13 9PL, U.K.

Corresponding author: Tahmina Zebin (t.zebin@ uea.ac.uk)


This work was supported by the U.K. Engineering and Physical Sciences Research Council under Grant EP/P010148/1.

ABSTRACT Edge computing aims to integrate computing into everyday settings, enabling the system
to be context-aware and private to the user. With the increasing success and popularity of deep learning
methods, there is an increased demand to leverage these techniques in mobile and wearable computing
scenarios. In this paper, we present an assessment of a deep human activity recognition system’s memory
and execution time requirements, when implemented on a mid-range smartphone class hardware and the
memory implications for embedded hardware. This paper presents the design of a convolutional neural
network (CNN) in the context of human activity recognition scenario. Here, layers of CNN automate the
feature learning and the influence of various hyper-parameters such as the number of filters and filter size on
the performance of CNN. The proposed CNN showed increased robustness with better capability of detecting
activities with temporal dependence compared to models using statistical machine learning techniques. The
model obtained an accuracy of 96.4% in a five-class static and dynamic activity recognition scenario. We
calculated the proposed model memory consumption and execution time requirements needed for using it
on a mid-range smartphone. Per-channel quantization of weights and per-layer quantization of activation to
8-bits of precision post-training produces classification accuracy within 2% of floating-point networks for
dense, convolutional neural network architecture. Almost all the size and execution time reduction in the
optimized model was achieved due to weight quantization. We achieved more than four times reduction in
model size when optimized to 8-bit, which ensured a feasible model capable of fast on-device inference.

INDEX TERMS Convolutional neural networks, edge computing, tensorflow lite, activity recognition, deep
learning.

I. INTRODUCTION information in a hierarchical manner [1]. They are highly


Deep learning techniques have been applied to a variety suited for exploiting temporal correlations in data sets that
of fields and proved their usefulness in many applications makes them suitable for applications such as human activity
such as speech recognition, language modelling and video recognition (HAR) classification, where potentially a large
processing. Models such as Convolutional Neural Networks amount of data is available; human movements are encoded
(CNN), and Recurrent Neural Networks (RNN) employ a in a sequence of successive samples in time; and the current
data-driven approach to learning discriminating features from activity is not defined by one small window of data alone.
raw sensor data to infer complex, sequential, and contextual In recent years machine learning methods used in the lit-
erature for HAR have effectively analyzed human activities
The associate editor coordinating the review of this manuscript and
for domains such as ambient assisted living (AAL), elder
approving it for publication was Vincenzo Piuri. care support, smart rehabilitation in sports, and cognitive

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 133509
T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

disorder recognition systems in smart healthcare [1], [2]. II. BACKGROUND


Despite significant research efforts over the past few decades, In traditional machine learning models based on static and
activity recognition remains a challenging problem. Sensor shallow features, several authors [7], [8] provided a broad
embodiment and low accuracy of activity recognition is one summary of HAR, highlighting the capabilities and limi-
of the challenges that affect the adoption of these systems tations of a number of statistical machine learning models
in clinical settings [3]. Other than that, many of the state- such as Support Vector Machines (SVMs), Gaussian Mixture
of-the-art learning methods ( e.g. support vector machines Models (GMMs) or Hidden Markov Models (HMMs) [4], [9].
(SVM), decision trees, k-nearest neighbour algorithms, and One of the main shortcomings the developers faced was that
advanced ensemble classifiers) need extensive pre-processing they had to decide adequate features for the task by trial and
and domain knowledge to handcraft and calculate discrim- error. To overcome this handcrafting, researchers have moved
inating features to be used by a classifier [4]. However, towards deep learning models where the feature extraction
deep learning remains under-explored as a research field in process is included in its modelling.
terms of raw time-series processing of inertial sensor data for
activity recognition [5].
In this paper, we devise a feature-less activity recogni- A. RELATED DEEP LEARNING WORKS
tion system with a novel multi-channel 1-D convolutional Convolutional networks comprised of one or more con-
neural network architecture and substituted the manually volutional and pooling layers followed by one or more
designed feature extraction procedure in HAR by an auto- fully-connected layers, have gained popularity due to their
mated feature learning engine. Compared to deep architec- ability to learn unique representations from images or
tures such as recurrent neural networks and long short term speeches, capturing local dependency and distortion invari-
memory networks, CNNs have the most straight-forward ance [10]. CNN has recently been applied to the problem of
training process. In addition, some of the current edge activity recognition in a number of research papers. In order
development boards such as SparkFun Apollo3 [6] chip has to provide an automated activity recognition system with a
support for CNN acceleration for edge use. For our design, novel multi-channel 1-D convolutional neural network archi-
we exploit the fact that CNN can discover intricacies in the tecture with high accuracy; this section reviews a number of
data characteristics with its convolution (which computes a recent studies employing deep learning models such as con-
mixture of nearby sensor readings), and pooling operation volutional and recurrent neural networks to classify activities
(which makes the representation invariant to small trans- using data from one or more wearable inertial sensors.
lations of the input). The implemented architecture in this Ordonez and Roggen [9] proposed an activity recogni-
paper is novel in its use for time series data analysis and its tion classifier using Deep CNN and long short-term mem-
use of batch normalization for HAR with the CNN archi- ory (LSTM) and two open datasets collected by seven inertial
tecture when compared to ones reported in the literature. measurement units and 12 triaxial accelerometer wearable
We then extend the trained models implementation on a sensors. The authors classified 27 hand gestures (opening
smartphone to provide a proof of concept of transferring door, washing dishes, cleaning table etc.), and five move-
the capability of deep learning models on edge devices. ments (standing, walking, sitting, lying, and nulling) using a
We also consider the memory footprint optimization for combination of CNN and LSTM after converting the sensory
the network to run on a mobile and embedded wearable data into sensor signal graphs. Simulation results showed that
device. F1 score of 0.93 and 0.958 was achieved. F1 score is the har-
The remaining sections of this paper are organized as monic mean of precision and recall performance (formulation
follows. In Section II we review current sensor-based activity provided in equation (4)). Ronao and Cho [10] first attempted
recognition using deep learning methods. Design insights are the design of CNN with convolutional layers applied along
derived from the review of the related work and we provide a the time axis and overall the sensors simultaneously, called
description of the dataset for the implemented network in this temporal convolution layers. They used two or three of these
section. Section III gives details on the proposed CNN algo- layers followed by a pooling layer and a softmax classifier.
rithm, and discusses the necessary background concepts for We have previously developed a context-aware algorithm
understanding the design of the CNN model. The influence of using multi-layer LSTM [11] which achieved an overall
important hyper-parameters such as the number of convolu- accuracy of 92% for static and dynamic daily life activities,
tional layers, filter size and number of filters are explored in however, for edge implementation, the network has quite high
Section IV using a grid search method. Based on the perfor- computational cost and memory requirement. Though the
mance of the proposed CNN model, the optimal parameters use of batch normalization accelerated the computation to an
for the final design are selected. Section V presents the chal- extent, it would require further acceleration before it is used
lenges, model graph analysis for memory requirement, and for real-time predictions. Reuda et al. [12] compared several
quantization approach applied on the trained deep model for concurrent methods that implemented a CNN based archi-
its implementation on an edge device. Finally, the quantized tecture, showing better performance compared to shallower
model performance is evaluated as a smartphone app imple- or handcrafted methods using Linear Discriminant Analysis
mentation for activity recognition in Section VI. (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest

133510 VOLUME 7, 2019


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

TABLE 1. Division of the dataset. C. DATA PRE-PROCESSING STAGES


With wearable inertial sensor being our data source, the raw
dataset contains values with different units and ranges with
accelerometer recorded in m/s2 and gyroscope data recorded
in rad/s. The sensor signals (accelerometer and gyroscope)
were pre-processed by applying a Butterworth low-pass filter
with 0.3 Hz cutoff frequency to remove low frequency gravity
component from the acceleration. We performed the follow-
ing pre-processing on our training and test datasets:

1) SCALING AND NORMALIZATION


To avoid any kind of training bias due to the direct use of large
values from any of the six channels, we applied scaling across
Neighbor (KNN), SVMs. A detailed review of few other the channels. This converted all the channel values to a range
deep learning methods are available in [1], [13], [14]. It is between 0 and 1, by application of a min-max normalization
noted that evaluation of most of the architecture described function from the python sklearn library for this purpose. To
so far is performed on publicly available datasets such as the be noted, as we maintained a consistent location for sensor
Opportunity [15], Pamap2 [16], and UCI HAR dataset [17]. placement during our data collection, the model provides
One major issue is that some of these datasets contain too accurate prediction for that sensor location (e.g. wrist or
many sensors combined with overtly variant activities, that pelvis) only. However, the model is adaptable to reasonable
lacks a balanced number of support instances in each category amount of displacements since the scaling and normalization
to be efficiently detected by any deep learning algorithms. stage takes care of the amplitude variation. Additionally,
Hammarela et al. [18] used CNNs to classify activities using the algorithm can deal with any change in sensor orientation
data from multiple inertial sensors on the body. This per- due to raw time series processing from each channel.
formed well, and was optimized for low-power devices, but
reintroduced the extraction of handcrafted features by using a 2) SEGMENTATION
spectrogram of the input data, and it requires multiple sensor Once the scaling on the raw data is performed, the six-channel
devices (whereas in a real-life scenario, we base our work on input time series is segmented as 1 × 128 windows so that the
the assumption that people would wear just one unit at any convolutional filters could explore any temporal relationship
time). between the samples within an activity. Our choice on the
In this paper, we aim to derive a sensor independent deep optimum window size was made in an adaptive and empirical
learning method that is highly accurate and fast in its decision manner [20], [21] to produce good segmentation for all the
making. For that, we designed and discussed through the activities under consideration.
methodological steps for implementing a robust model on a
balanced dataset, and then moved towards the edge imple- 3) CLASS RELABELING AND ONE-HOT ENCODING
mentation.
We also converted the output activity labels to One-Hot
encoded labels. We encoded our activity windows to five
B. DATASET DESCRIPTION unique labels as discussed previously in the dataset descrip-
As the dataset, we processed the time series data from a waist tion section. These pre-processing stages is also summarized
mounted inertial sensor containing both accelerometer and in the preprocessing stage of Fig. 1.
gyroscope measurements. Data from 20 subjects were used,
further details on this can be found in ref [4]. The dataset is III. THE PROPOSED HAR CLASSIFICATION MODEL
grouped together to classify five everyday activities: 1: walk BASED ON CNN ARCHITECTURE
on a level surface; 2: walk upstairs; 3: walk downstairs; A schematic diagram of our four-layer stacked CNN architec-
4: sedentary (stand+ sit); 5: sleep (lying). We used a custom ture used for multi-class HAR classification is presented in
wearable setup with MPU-9150 sensor [19] and captured Fig. 1. We selected the CNN architecture due to its capability
3-axial linear acceleration and 3-axial angular velocity at a to process raw time-series data, error handling with backprop-
constant rate of 50Hz. We sampled the data from 14 volun- agation, and high model accuracy.
teers, with approximately 7500 labelled activities as training
data, and data from an additional 6 volunteers, 2500 labelled A. MODEL IMPLEMENTATION
activities, as a test dataset (summary presented in Table 1). We stacked a total of four convolution and pooling layers in
The test set was separated entirely from the training dataset the proposed model, to obtain more detailed features than
during our experiments. In addition, to avoid over-fitting obtained by a previous layer. In our design, we doubled the
the model, 20% of the training dataset was held back for number of filters after each convolution and pooling layer.
validation. The feature map extracted by the filters of the final CNN

VOLUME 7, 2019 133511


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

FIGURE 1. Layer-wise specification of the CNN architecture used for multi-class HAR classification.

layer wass then flattened out to be used as an automatically parameters to minimize the difference between input and
extracted feature set by any other classifier. The six-channel reconstructed output over the whole training set. To apply
input time-series (segmented as 128 samples/window for CNN to human activity recognition, there is a need for several
each channel) from 3-axis accelerometer and 3-axis gyro- design adjustments for 1-D adaptation for processing the
scope were processed by a number of convolution filters. sensor data such as input adaptation, pooling, and weight-
These filters create a non-linear, distributed representation sharing. The subsections below discuss several adaptations
of the input and are of variable filter size. These are then be and processing stages of the proposed convolutional neural
applied over the entire input time series with a specific stride networks for the activity classification task in hand.
length. Then a max-pooling layer is used to down-sample the
temporal features (such as slopes/changes in the time series B. CNN FEATURE EXTRACTION
signal) that the convolution layer has just extracted. For a In the CNN architecture, the internal representation of the
given training dataset, our objective is to find the optimal input is implicitly learned by the convolutional kernels.

133512 VOLUME 7, 2019


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

TABLE 2. Effect of increasing number of convolution and pooling layers.

conv1D, maxooling1D, dropout, and batch normalization lay-


ers for our implementation.
FIGURE 2. Elaboration of convolution and sub-sampling (max-pooling)
for time series data in a typical CNN layer. IV. ABLATION STUDY: HYPER-PARAMETER TUNING AND
CLASSIFICATION PERFORMANCE
We present the influence of important hyper-parameters such
The convolution and a sub-sampling or pooling operations as the number of convolutional layers, number of convo-
inherently enable CNN to perform as a good feature predic- lutional kernels (i.e. filters), and filter size in this section.
tor. As mentioned previously section III-A, we have used We evaluated the model performance by varying a number of
a four-layer stacked convolution and pooling for extract- model parameters and observed their effect on the detection
ing features from the raw wearable sensor data. In Fig. 2, speed and accuracy. Based on this ablation study on the
an elaboration of the convolution and subsampling (i.e. max- proposed architecture, the optimal parameters for the final
pooling) operation on the time series data is illustrated. design are selected.
In the convolution layer, multiple non-linear transformations
(n transformations for n filters) are applied on the input, each A. HYPER-PARAMETER TUNING
one creating a different output map. In a sub-sampling layer The hyper-parameters of a convolutional layer are its filter
with specific pool size (e.g. 1 × 2), each of the outputs of the size, depth, stride, and padding. These parameters must be
previous layer is down-sampled by a procedure of striding chosen carefully in order to generate a highly accurate and
over on non-overlapping regions of the input. This procedure fast output. The hyper-parameters of the pooling layer are
is usually an averaging or maxing of each region that creates its stride and pooling size. Since they have to be chosen in
a single output. Since each point of the output is created by a accordance with each other, we systematically presented the
region of the input and these regions don’t overlap, a down- effect of these hyper-parameters in the following sections.
sampling occurs.
Every layer gets an input (1-D array), and all operations 1) EFFECT OF NUMBER OF LAYERS
are performed in segments and windows of this input. This In a CNN, the successive layers of convolution and
procedure exploits the prior idea that data points that carry sub-sampling (e.g. max-pooling) functionality aims to learn
related or similar information will be grouped by a specific progressively more complex and specific features, with
filter or kernel. There are shared weights in each layer, that the last layer representing the output activity classes.
are identically applied to all parts of the input. This defines An increased number of layers contributes to the depth of
the functionality of CNN, since the 1-D input is convolved the network. We varied the number of layers in our proposed
with a small matrix of weights to produce the output of the model from 1 to 5. With an increasing number of layers,
layer, acting as a filter. In this context, convolution extracts the accuracy of the model increases beginning with 82%
features, while pooling recombines the extracted information accuracy with a single layered network and going up to 94.9%
in a more meaningful way. Together they extract specific with five layers (shown in Table 2). However, as can be seen
features over the whole input window. This operation exploits in Fig. 3 and Table 2, the complexity and execution time of
the assumption that a filter that is useful in one part of the the network increases because of the increasing number of
signal is probably useful in other parts of it. This design parameters as the number of layers go higher.
choice vastly decreases the number of trainable parameters
and contributes to the high accuracy during the training of the 2) EFFECT OF NUMBER OF FILTERS AND KERNEL
network. Because of the consecutive pooling procedures in SIZE OF THE FILTER
each layer, the output becomes increasingly sensitive to small The number of filters (i.e. convolutional kernels, n in Fig. 2)
variations of the input, which eventually help identify unique of each successive layer increases due to down-sampling.
time domain events for activity classification. A further level The deeper the layer, the bigger the segment or window of
of abstraction can be achieved by stacking up a couple of the original input that affects the value of a data points in
these layers. To be noted, the model was implemented using that layer. Table 3 summarizes the model constants with a
the Keras open source library in Python [22] with tensorflow varying number of filters and, lists the time needed to train
back-end. We utilized the sequential model and the dense, and the model’s performance accuracy. An increased number

VOLUME 7, 2019 133513


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

TABLE 4. Quantitative comparison of SVM, fully connected Dense Neural


Network and Versions of CNN for HAR classifications.

FIGURE 3. Change in the number of model parameters and training time


with increased number of layers in the CNN model.

FIGURE 4. Effect of convolutional kernel size in CNN activity FIGURE 5. Average training set accuracy over 150 epochs for the
classification. proposed model. The batch normalized CNN is more stable in terms of
accuracy than the generic CNN.

TABLE 3. Effect of increasing number of filters on convolution layer.

batch normalized CNN (CNN + Dropout + BN ) consistently


outperforms the generic CNN model an increase 4% from
the generic CNN without BN. In addition, for an increased
number of training epochs, the CNN +Dropout+BN achieves
95% training set accuracy four times faster (using 15 epochs)
than the generic CNN (50 epochs). This is also true for the
validation data set. The reduction in training epochs required
is substantial and will be of vital importance while dealing
of convolutional filters manifested higher accuracy due to the with bigger datasets such as [23]. Fig. 5 shows the training
added temporal scaling in features obtained by the additional set accuracy over 150 epochs for the proposed model and
filters. The size of the filter (expressed as 1 × m in Fig. 2) the Batch Normalized CNN (seen in purple) is more stable
captures the temporal correspondence between neighbouring in terms of accuracy than the generic CNN.
data points within the filter. The effects of filter size on
models accuracy and execution time has been summarized B. TEST SET CLASSIFICATION PERFORMANCE
in Fig. 4 with model constants provided in the first column. To evaluate the performance of the model, we inputted the
As can be seen from Fig. 4, our experiments indicate higher data from 6 volunteers as the test dataset. The test set was sep-
model accuracy with increasing filter size until the filter size arated entirely from the training dataset. In addition, to avoid
reached (1 × 12) in a (1 × 128) activity window. The model over-fitting the model with training data, 20% of the training
accuracy deteriorates with wider filter size such as (1 × 15) dataset was held back as a validation set in our experiments.
and (1 × 18), since it starts to lose the temporal context To put our model performance in context, we presented
of the activity. To be noted, all the time measurement were the confusion matrix plot in Fig. 6 for the test data set
performed on a conventional computer with an Intel core when predictions are done with the test dataset. The rows
i7 CPU (2.4 GHZ), and 3 GB memory. in the confusion matrix correspond to the predicted class
(Output Class) and the columns correspond to the true class
3) EFFECT OF DROPOUT AND BATCH NORMALIZATION (Target Class). The rightmost column of the confusion matrix
Table 4 shows the effect of Dropout and Batch Normal- presented in Fig. 6 corresponds to the class-wise precision
ization (BN) on models overall performance accuracy. The performance and the bottom row correspond to the recall

133514 VOLUME 7, 2019


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

and sit) and sleep activities the precision performance was


found to be 97.3% and 90.4% respectively.
While measuring the recall performance, ratio of correctly
predicted positive observations to all observations in the
actual/true class is calculated and the class-wise recall per-
formances for the five class scenario was 89.3%, 95.2% and
98.8%, 97.3% and 97.5% respectively. The highest precision
and recall score was obtained for the walk downstairs cate-
gory where higher data ratio provided higher confidence in
its decision making. In cases of uneven recall and precision
performance which is the case for some of our classes such
as walk level and sleep activity, an F1 score is usually more
useful indicator of performance than accuracy as this score
takes both false positives and false negatives into account and
is defined as follows:
2 × Precision × Recall
F1 Score = (4)
Precision+Recall
FIGURE 6. Confusion matrix for the proposed CNN on the test dataset.
Our reported F1 score for the walk level, walk upstairs, walk
downstairs, sedentary, and sleep activities with the proposed
performance. The diagonal cells in the confusion matrix cor- model are 92.47%, 96.04% and 98.64%, 97.3%, and 93.5%
respond to observations that are correctly classified (TP and respectively.
TN ’s). For our test dataset, there are 324 instances of correctly
classified as walk level activity. The off-diagonal cells corre- V. PRE-TRAINED MODEL TRANSFER ON EDGE DEVICE
spond to incorrectly classified observations (FP and FN ’s). The implementation of pre-trained deep learning models for
For instances where the model predicted incorrectly, in the mobile computing is of special interest and has enabled
first column of the confusion matrix, there were 9 instances numerous applications such as smart activity tracking, intel-
where the model predicted walk level activity to be walk ligent personal assistant, real-time language translation on
upstairs, for 9 instances the predicted class was sedentary smartphones and smartwatches. Recent implementations
and for 21 instances it was classified as sleep category which and approaches used point-wise group convolution and
contributed to a 10.7% false negative instances for this class. channel shuffle to effectively run on resource constrained
Similarly, if we go across the row, there are some instances devices [24], several authors have proposed and implemented
where the model flagged up the false positives (4.1%) for deep learning models on smartphones for facial expressions
this category. We have also shown in each cell the number detection using camera data [25] or human activity detec-
of observations for an activity category, and the percentage tion using accelerometers built-in to smartphones [10], [26].
in terms of total observations available in the full test set. Commercial human activity monitoring devices, relying on
From the confusion matrix, the overall accuracy, precision inertial sensors such as Fitbit have gained popularity for daily
and recall performance of the model can be calculated using activity tracking. However, the accuracy of these devices was
the following equations: not satisfactory for applications such as gait analysis or step
counting for slow walking [27]. The type of the activity might
TP + TN
Overall Accuracy = . (1) affect the accuracy as well [28], [29]. Along with the accuracy
TP + TN + FP + FN reduction issue, models size would be another factor to be
TP
Precision = . (2) considered seriously since very large neural networks can be
TP + FP hundreds of megabytes and would be difficult to store on
TP device. Thus we need to minimize the model before consid-
Recall or True positive rate = . (3)
TP + FN ering an edge implementation. In the following subsection,
The class-wise performance when the trained model was we discussed the challenges of implementing deep models
exposed to a test set containing approximately 2500 new on edge devices and presented our analysis of the neural
activities and it achieved 96.4% overall accuracy when tested network after applying quantization of the trained model to
with the proposed model. To show class-wise precision per- fit it efficiently on a smartphone.
formance, we presented the ratio of correctly predicted pos-
itive observations to the total predicted positive observations A. CHALLENGES ON EDGE DEVICES
as high precision relates to the low false positive rate. In our Most wearables have low computation power compared to
case, we achieved 95.9%, 96.9% and 98.5% precision perfor- a typical smartphone due to their small form factors and
mance for dynamic ambulatory activities as walk level, walk heat dissipation restrictions. Wearables also have very limited
upstairs and downstairs respectively. For sedentary (stand energy constraints due to their limited battery capacity. On the

VOLUME 7, 2019 133515


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

FIGURE 7. Steps for embedded adaptation of TensroFlow based deep learning models.

TABLE 5. Memory consumption and execution time summary by node are efficient for a cross-platform serialization library but not
type.
suitable for embedded implementations [33], [34].
We presented the required stages in the workflow in Fig. 7
where the first block corresponds to training stage of the
deep learning model on TensorFlow and the second block
encompasses the steps required to get the graph representa-
tion inference ready on an embedded platform. The rest of the
stages are required for optimization before the model can be
deployed on an edge device.

VI. MODEL ANALYSIS FOR OPTIMIZATION


A. MODEL SIZE AND SPEED OF OPERATIONS
other hand, deep learning models usually require heavy com- For edge computing, only the inference stage of a complex
putation, regardless of their types (Dense, convolutional, deep learning model needs to be performed on devices with
or recurrent neural networks). As a neural network model with limited on-board memory, computational power and
could consist of hundreds of connected layers, each of which battery run time. For this reason, we conducted an investi-
can have a collection of processing elements (i.e., neurons) gation on whether our models could be contained in an edge
executing a non-trivial function, it requires considerable com- computing device’s memory and the total number of floating
putation and energy resources in particular during streamed point operations that are required to execute the frozen graph
data processing (e.g., continuous speech translation or inertial of the model. Table 5 lists the memory allocation analysis
sensor data processing), the implementation steps need to be based on the number of node types available on the imple-
re-visited in order to load them on edge devices [30]. In this mented model. The CNN model consumed about 16 MB
section, we aim to find out the computational requirements of memory when transformed and saved in TensorFlow’s
for it to use resource-heavy modeling techniques on a smart- protocol buffer (.pb) format. The memory consumption anal-
phone or smartwatch level hardware. ysis presented in Table 5, indicates the number of operations
(ops), and the time consumed to complete these ops. These
B. TOOLBOX REQUIREMENT FOR PROOF OF CONCEPT values provide a decision point at which we can optimize the
In our work, we have successfully enabled the inference stage architecture or select for alternative models to reduce these
of a four layer CNN model using pre-trained and optimized numbers according to the resource availability on an edge
deep learning models on mobile devices using the Tensor- device. The trained models were then optimized and made
Flow lite (TFlite) [31] library. TFlite is an advanced and graph ready to convert to TFlite. We have used TensorFlow’s
lightweight version of the TensorFlow library including the graph_transforms and summarize_graph to freeze the model
basic operations and functionality necessary in layer-wise and conduct the analysis on the model’s profile and types
neural network based computation and is made use of here of operation necessary for inference. Additionally, we have
as the base deep learning library. Currently, TFlite supports utilized tensorboard’s visualization to evaluate the graph after
a set of core operations in both floating-point (float32) and each step and to identify any unsupported layer that needs
quantized (uint8) precisions, for which the latter has been custom implementation.
tuned for mobile platforms. Since the set of operations sup-
ported in TFlite is limited, not every model that is trained B. MODEL TESTING ON SMARTPHONE
using TensorFlow is convertible to TFlite [32]. TFlite works Fig. 8(a) shows ascreenshot of the implemented Android
on flat buffers, where TensorFlow uses protocol buffers which app driven by the optimized deep model on the back-end.

133516 VOLUME 7, 2019


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

FIGURE 8. (a) Screen-shot and (b) power profile of the HAR Android app running on a Samsung
A5 smartphone.

FIGURE 9. (a) Weight and activation quantization scheme, (b) Memory footprint of various deep learning models in terms of weight
and activation.

Fig. 8(b) shows the absolute power consumption of the with a floating-point representation, there are a number of
implementation. We used Trepn Profiler [35] for these mea- weight tensors that are constant and variable input tensors
surements. The average absolute power consumption when stored as floating point numbers. The forward pass function
the app is running in the inference mode is approximately which operates on the weights and inputs, using floating point
40 mW. In comparison, the same value measured for YouTube arithmetic, storing the output in output tensors as a float.
is around 116 mW, which suggests that our HAR android Post-training quantization techniques are simpler to use and
app is relatively low power and computationally effective. For allow for quantization with limited data [32], [36]. In this
implementing model inference stage on device, we further work, we explore the impact of quantizing the weights and
reduced the model size by compressing weights and/or quan- activation functions separately. The results for the weight and
tize both weights and activations for faster inference, without activation quantization experiments are shown in Fig. 9(b)
re-training the model. and Table 6. Moving from 32-bits float to fixed-point 8-bits
leads to 4× reduction in memory. Fig. 9(b) shows the actual
C. OPTIMIZATION THROUGH QUANTIZATION memory footprint of various learning models such as the
In this section, multiple approaches for model quantization dense neural network (DNN), CNN and an LSTM model built
are discussed to demonstrate the performance impact for each using the same dataset. As can be seen from the bar diagram
of these approaches. All these experimental scenarios were in Fig. 9(a) and from the memory footprint column in Table 6,
simulated on a conventional computer with a 2.4GHZ CPU the CNN model was 7.62 times smaller in size when post
and 32GB memory and the quantized model were tested fur- training quantization is applied on the weights and activation
ther on a Samsung A5 2017 smartphone (1.2 GHz quad-core of the model. This version would also be more suitable for
CPU with 3GB memory) for its functionality on an activity low-end microprocessors that do not support floating-point
recognition app using the TFlite library. In a conventional arithmetic. Our experiments suggested that we can variably
neural network layer implemented in TensorFlow or keras quantize (i.e. discretize) the range to only record some values

VOLUME 7, 2019 133517


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

TABLE 6. Change of model accuracy due to quantization, 32-bit floating point model accuracy and 8-bit quantized model accuracy.

with float32 that have more significant contribution on the of sensor displacement. In the future, we will drive this
model accuracy and round off the rest to unit8 values to still implementation on programmable devices such as an ARM
take advantage of the optimized model. In Table 6, we also Cortex M-series [38] systems and Sparkfun apollo3 edge
observed a slight increase in prediction accuracy for the test platforms [6], with further development of the C++ API in
dataset for DNN and LSTM models because of the salient TensorFlow Lite framework [31]. The smartphone implemen-
and less noisy weight values available in the saved model tation presented this study could be useful for making smart
for activity prediction. The inference time for the optimized wearables and devices that are stand alone from the cloud,
and fine-tuned CNN was 0.23 seconds on an average for the potentially improving the user privacy.
detection of a typical activity window. The table also suggests
that the minimization of model parameters may not necessar- ACKNOWLEDGMENT
ily lead to the optimal network in terms of performance. This This work was carried out at the University of Manchester.
means a selective quantization approach (e.g. using k-means Data and code supporting this publication can be obtained
clustering) can make the process more efficient than straight from T. Zebin’s Github [37]. The steps to deploy the trained
linear quantization. model within the Android Application are explained in the
video abstract accompanying this article.
VII. CONCLUSION
We have presented a deep convolutional neural network REFERENCES
model for the classification of five daily-life activities using [1] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, ‘‘Deep learning for sensor-
based activity recognition: A survey,’’ 2017, arXiv:1707.03502. [Online].
raw accelerometer and gyroscope data of a wearable sen- Available: https://arxiv.org/abs/1707.03502
sor as the input. Our experimental results demonstrate how [2] F. Gu, K. Khoshelham, S. Valaee, J. Shang, and R. Zhang, ‘‘Locomotion
these characteristics can be efficiently extracted by automated activity recognition using stacked denoising autoencoders,’’ IEEE Internet
Things J., vol. 5, no. 3, pp. 2085–2093, Jun. 2018.
feature engine in CNNs. The presented model obtained an [3] C. A. Ronao and S.-B. Cho, ‘‘Deep convolutional neural networks for
accuracy of 96.4% in a five-class static and dynamic activ- human activity recognition with smartphone sensors,’’ in Proc. Int. Conf.
ity recognition scenario with a 20 volunteer custom dataset Neural Inf. Process. Cham, Switzerland: Springer, 2015, pp. 46–53.
[Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-
available at the GitHub repository for this research [37]. The 26561-2_6
proposed model showed increased robustness and has a better [4] T. Zebin, P. J. Scully, and K. B. Ozanyan, ‘‘Evaluation of supervised clas-
capability of detecting activities with temporal dependence sification algorithms for human activity recognition with inertial sensors,’’
in Proc. IEEE SENSORS conf., Glasgow, U.K., Nov. 2017, pp. 1–3.
compared to models using statistical machine learning tech- [5] N. Twomey, T. Diethe, I. Craddock, and P. Flach, ‘‘Unsupervised learning
niques. Additionally, the batch normalized implementation of sensor topologies for improving activity recognition in smart environ-
made the network achieve stable training performance in ments,’’ Neurocomputing, vol. 234, pp. 93–106, Apr. 2017.
[6] (2019). SparkFun Edge Development Board-Apollo3 Blue. [Online].
almost four times fewer iterations. The proposed model has Available: https://learn.sparkfun.com/tutorials/using-sparkfun-edge-
further been empirically analyzed for the memory require- board-with-ambiq-apollo3-sdk/all
ment and execution time for its operations with the objective [7] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, and
J. Zhang, ‘‘Convolutional neural networks for human activity recognition
of deploying the model in edge devices such as smartphones
using mobile sensors,’’ in Proc. 6th Int. Conf. Mobile Comput., Appl.
and wearables. We observed that most of the size and execu- Services, Nov. 2014, pp. 197–205.
tion time reduction in the optimized model are due to weight [8] A. Bulling, U. Blanke, and B. Schiele, ‘‘A tutorial on human activity recog-
quantization, potentially allowing them to be quantized dif- nition using body-worn inertial sensors,’’ ACM Comput. Surv., vol. 46,
no. 3, pp. 1–33, 2014.
ferently to the activations and allowing further optimizations. [9] F. J. Ordóñez and D. Roggen, ‘‘Deep convolutional and LSTM recurrent
In future, we would like to amend and develop time-series neural networks for multimodal wearable activity recognition,’’ Sensors,
counterpart of models with new and efficient architecture vol. 16, no. 1, p. 115, 2016.
[10] C. A. Ronao and S.-B. Cho, ‘‘Human activity recognition with smartphone
similar to shufflenet [24] facilitating pointwise group con- sensors using deep learning neural networks,’’ Expert Syst. Appl., vol. 59,
volution and channel shuffle, to further reduce computation pp. 235–244, Oct. 2016.
cost while maintaining accuracy. The proposed model has [11] T. Zebin, M. Sperrin, N. Peek, and A. J. Casson, ‘‘Human activity recog-
nition from inertial sensor time-series using batch normalized deep LSTM
been validated and successfully implemented on a smart- recurrent networks,’’ in Proc. 40th Annu. Int. Conf. IEEE Eng. Med. Biol.
phone. The implementation on the smartphone is utilizing Soc., Jul. 2018, pp. 1–4.
the real-time sensor data to predict the activity, a current [12] F. M. Rueda, R. Grzeszick, G. A. Fink, S. Feldhorst, and
M. T. Hompel, ‘‘Convolutional neural networks for human activity
limitation of this pre-trained model is that the classification recognition using body-worn sensors,’’ Informatics, vol. 5, no. 2, p. 26,
accuracy decreases during activity transition and in case 2018.
133518 VOLUME 7, 2019
T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

[13] M. Gochoo, T.-H. Tan, S.-H. Liu, F.-R. Jean, F. Alnajjar, and S.-C. Huang, [36] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst, ‘‘Min-
‘‘Unobtrusive activity recognition of elderly people living alone using imum energy quantized neural networks,’’ in Proc. 51st Asilomar Conf.
anonymous binary sensors and DCNN,’’ IEEE J. Biomed. Health Inform., Signals, Syst., Comput., Oct./Nov. 2017, pp. 1921–1925.
vol. 23, no. 2, pp. 693–702, Mar. 2018. [37] T. Zebin. (2018). Deep Learning Demo. [Online]. Available:
[14] D. Ravi, C. Wong, B. Lo, and G. Z. Yang, ‘‘Deep learning for human https://github.com/TZebin/Thesis-Supporting-Files/tree/master/Deep
activity recognition: A resource efficient implementation on low-power Learning Demo
devices,’’ in Proc. IEEE 13th Int. Conf. Wearable Implant. Body Sensor [38] (2019). ARM Cortex-m Series Processors. [Online]. Available:
Netw., Jun. 2016, pp. 71–76. https://developer.arm.com/ip-products/processors/cortex-m
[15] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster,
J. del R. Millán, and D. Roggen, ‘‘The Opportunity challenge: A bench-
mark database for on-body sensor-based activity recognition,’’ Pattern
Recognit. Lett., vol. 34, no. 15, pp. 2033–2042, Jan. 2009.
[16] A. Reiss and D. Stricker, ‘‘Introducing a new benchmarked dataset
for activity monitoring,’’ in Proc. 16th Int. Symp. Wearable Comput.,
Jun. 2012, pp. 108–109.
TAHMINA ZEBIN received the undergraduate
[17] M. Lichman. (2013). Uci Har Machine Learning Repository. and M.S. degrees in applied physics, electronics
[Online]. Available: https://archive.ics.uci.edu/ml/machine-learning- and communication engineering from the Univer-
databases/00240/ sity of Dhaka, Bangladesh, and the M.Sc. degree
[18] N. Y. Hammerla, S. Halloran, and T. Plotz, ‘‘Deep, convolutional, and in digital image and signal processing from the
recurrent models for human activity recognition using wearables,’’ in University of Manchester, in 2012. Before joining
Proc. 25th Int. Joint Conf. Artif. Intell., New York, NY, USA, Jul. 2016, as a Lecturer with the University of East Anglia,
pp. 1533–1540. she was a Postdoctoral Research Associate in the
[19] MPU-9150 Product Specification, Invensence, San Jose, CA, USA, 2012. EPSRC funded project Wearable Clinic: Self, Help
[20] O. Banos, J.-M. Galvez, M. Damas, H. Pomares, and I. Rojas, ‘‘Win- and Care with the University of Manchester and
dow size impact in human activity recognition,’’ Sensors, vol. 14, no. 4, was a Research Fellow in health innovation ecosystem with the University
pp. 6474–6499, Apr. 2014. of Westminster. Her current research interests include advanced image and
[21] J. Liono, A. K. Qin, and F. D. Salim, ‘‘Optimal time window for temporal signal processing, human activity recognition, and risk prediction modeling
segmentation of sensor streams in multi-activity recognition,’’ in Proc. from electronic health records using various statistical and deep learning
13th Int. Conf. Mobile Ubiquitous Syst., Comput., Netw. Services, 2016,
techniques. She was a recipient of the President’s Doctoral Scholarship,
pp. 10–19.
from 2013 to 2016, for conducting the Ph.D. in electrical and electronic
[22] F. Chollet. (2013). Keras: The Python Deep Learning Library. [Online].
engineering.
Available: https://keras.io/
[23] A. Doherty, D. Jackson, N. Hammerla, T. Plötz, P. Olivier, M. H. Granat,
T. White, V. T. van Hees, M. I. Trenell, C. G. Owen, S. J. Preece,
R. Gillions, S. Sheard, and N. J. Wareham, ‘‘Large scale population
assessment of physical activity using wrist worn accelerometers: The UK
biobank study,’’ PLoS ONE, vol. 12, no. 2, 2017, Art. no. e0169649.
[24] X. Zhang, X. Zhou, M. Lin, and J. Sun, ‘‘Shufflenet: An extremely efficient PATRICIA J. SCULLY received the Ph.D. degree
convolutional neural network for mobile devices,’’ in Proc. IEEE Conf. in engineering from the University of Liverpool,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6848–6856. Liverpool, U.K., in 1992, and held Reader position
[25] I. Song, H.-J. Kim, and P. B. Jeon, ‘‘Deep learning for real-time robust with Liverpool John Moores University, in 2000.
facial expression recognition on a smartphone,’’ in Proc. IEEE Int. Conf. She was a Senior Lecturer/Associate Professor
Consum. Electron., Jan. 2014, pp. 564–567. in sensor instrumentation with the University of
[26] I. Andrey, ‘‘Real-time human activity recognition from accelerometer Manchester, in 2002, before moving to NUI Gal-
data using convolutional neural networks,’’ Appl. Soft Comput., vol. 62, way, Ireland, in 2018. She is experienced in lead-
pp. 915–922, Jan. 2017. ing industrial and research council/government
[27] C. K. Wong, H. M. Mentis, and R. Kuber, ‘‘The bit doesn’t fit: Evaluation
funded research projects at national and interna-
of a commercial activity-tracker at slower walking speeds,’’ Gait Posture,
tional levels and has research interests in sensors and monitoring for indus-
vol. 59, pp. 177–181, Jan. 2018.
[28] Y. Huang, J. Xu, B. Yu, and P. B. Shull, ‘‘Validity of FitBit, Jawbone trial processes, including optical fiber technology and photonic materials for
UP, Nike+ and other wearable devices for level and stair walking,’’ Gait sensors and devices, ranging from functional chemically sensitive optical
Posture, vol. 48, pp. 36–41, Jul. 2016. coatings to laser-inscribed photonic and conducting structures in transparent
[29] J. Huang, S. Lin, N. Wang, G. Dai, Y. Xie, and J. Zhou, ‘‘TSE-CNN: A two- materials that affect the properties of light.
stage end-To-end CNN for human activity recognition,’’ IEEE J. Biomed.
Health Informat., to be published.
[30] E. Grolman, A. Finkelshtein, R. Puzis, A. Shabtai, G. Celniker, Z. Katzir,
and L. Rosenfeld, ‘‘Transfer learning for user action identication in mobile
apps via encrypted trafc analysis,’’ IEEE Intell. Syst., vol. 33, no. 2,
pp. 40–53, Mar./Apr. 2018. NIELS PEEK received the M.Sc. degree in com-
[31] (2019). Tensorflow Lite for Mobile and Embedded Learning. [Online]. puter science and artificial intelligence, in 1994,
Available: https://www.tensorflow.org/lite/microcontrollers/overview and the Ph.D. degree in computer science, in 2000,
[32] R. Krishnamoorthi, ‘‘Quantizing deep convolutional networks for efficient from Utrecht University. From 2013 to 2017,
inference: A whitepaper,’’ Jun. 2018, arXiv:1806.08342. [Online]. Avail- he was the President of the Society for Arti-
able: https://arxiv.org/abs/1806.08342
ficial Intelligence in Medicine. He is currently
[33] J. Hanhirova, T. Kämäräinen, S. Seppälä, M. Siekkinen, V. Hirvisalo,
a Professor of health informatics with the Uni-
A. Ylä-Jääski, ‘‘Latency and throughput characterization of convo-
lutional neural networks for mobile computer vision,’’ Mar. 2018,
versity of Manchester. He has coauthored more
arXiv:1803.09492. [Online]. Available: https://arxiv.org/abs/1803.09492 than 200 peer-reviewed scientific publications. His
[34] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and research interests include data-driven methods for
L. Van Gool, ‘‘Ai benchmark: Running deep neural networks on health research, healthcare quality improvement, and computerized decision
Android smartphones,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, support. In 2018, he was an Elected Fellow of the American College of
pp. 288–314. Medical Informatics and memory consumption and execution time fellow
[35] (2019). Trepn Power Profiler–Qualcomm Developer Network. [Online]. of the Alan Turing Institute, the U.K.’s national institute for data science and
Available: https://developer.qualcomm.com/software/trepn-power-profiler artificial intelligence.

VOLUME 7, 2019 133519


T. Zebin et al.: Design and Implementation of a CNN on an Edge Computing Smartphone for HAR

ALEXANDER J. CASSON received the master’s KRIKOR B. OZANYAN received the M.Sc. degree
degree in engineering science from the University in engineering physics (semiconductors) and the
of Oxford, Oxford, U.K., in 2006, and the Ph.D. Ph.D. degree in solid-state physics, in 1980 and
degree in electronic engineering from Imperial 1989, respectively. He is currently the Director of
College London, London, U.K., in 2010. Since Research with the School of EEE, The University
2013, he has been a Faculty Member with The Uni- of Manchester, U.K. He has more than 300 pub-
versity of Manchester, Manchester, U.K., where lications in the areas of devices, materials, and
he leads a research team focusing on next gener- systems for sensing and imaging. He is a Fellow
ation wearable devices and their integration and of the Institute of Engineering and Technology,
use in the healthcare system. He has published over U.K., and the Institute of Physics, U.K. He was a
100 articles on these topics. He is the Vice-Chair of the IET Healthcare Tech- Distinguished Lecturer of the IEEE Sensors Council and was the Editor-in-
nologies Network and a Lead of the Manchester Bioelectronics Network. Chief of the IEEE SENSORS JOURNAL and the General Co-Chair of the IEEE
Sensors Conferences in the last few years.

133520 VOLUME 7, 2019

You might also like