Human Activity Classification
Human Activity Classification
Aristos Athens, Zachary Blum, Navjot Singh. output. However, the classes in the article are improperly
balanced (with one class accounting for between 50 and 60
I. I NTRODUCTION percent of all of the data), which could affect the true test
Activity recognition is an important task in several health- performance (assuming that the test data also suffers from
care applications. By continuously monitoring and analyzing the same problem). The paper by Youngwook Kim and Hao
user activity it is possible to provide automated recommen- Ling [7] discusses an interesting approach of human activity
dations to both patients and doctors [2, 6]. There are also recognition through data obtained by a Doppler radar. Similar
applications to consumer products such as data logging for to our approach, this paper incorporate SVM models that are
smart watches health apps [3]. Common consumer devices tuned through cross-validation over a range of hyperparameter
such as smart phones and smart watches generally ship with values to pick an optimal one. However, the features and
IMUs (Inertial Measurement Unit), which are packaged ac- characteristics of the data are intrinsically different than the
celerometer and gyroscope sensors. Through the information IMU and heart rate data that our paper deals with. For
provided by these IMUs, machine learning techniques can then example, we do not have to take into consideration the angle
be used to train activity classifiers, giving users, doctors, and of the subject with respect to the radar and apply a different
app developers access to an individual’s lifestyle and activity model based on these edge cases. These variations in the data
choices. In this paper we examine two questions in parallel: lead the authors to create a classification model consisting of
what is the ”best” classification technique and how well both decision trees and SVM’s. We believe our data is easier
can it perform with less features. To compare classification to classify and as such can achieve a higher classification rate
techniques we use a variety of metrics, including classification than the 90% achieved in this paper. We found that the state-
accuracy, required dataset size, and prediction speed. In partic- of-the-art approach to supervised human activity classification
ular, we examine the use of logistic regression, support vector generally involved neural networks, particularly convolution
machines (SVM), various decision tree techniques, and neural neural networks (CNN) and recurrent neural networks (RNN).
networks. We then examine if these techniques can perform A number of studies [8] [12] use multiple datasets to classify
just as well with reduced sensor channels, for example an IMU similar activities. Both [Hamerla, 2016] and [Xue] incorporate
from a single body location vs. multiple IMU’s placed across CNN’s and KNN’s with moving windows in order to look
the body. at the activities of a user of a period of time and achieved
accuracy results above 95%. Given the limited scope of the
II. R ELATED W ORK CS 229, we did not try to emulate these techniques but believe
Because of its many applications, supervised human activity they can produce more robust models, especially for similar
classification using sensor data is a relatively popular research activities like playing soccer or running, where it may be
area. Through our research, we have found that related articles important to look over a history rather than a particular time-
and their approaches can generally be divided into three stamp.
categories: Naive Bayes Classification, SVM/Decision Trees,
and Neural Networks. We consider the use of Naive Bayes as III. DATASET AND F EATURES
a classifier for human activity as clever and interesting, since it We used the PAMAP2 Dataset from the UCI repository of
is usually used for text classification. One such article that uses machine learning datasets. [16, 15] This dataset includes raw
the Naive Bayes Classifier is Long, Yin, and Aarts, 2009 [9]. 9-axis IMU data streams (from the hand, chest, and ankle) as
This article was similar to our paper in that it included sensor well as heart rate data from nine different subjects performing
data from multiple subjects for the purpose of multi-class various activities. Each IMU provides temperature, 3-axis
activity classification, and evaluated their models using cross- acceleration, 3-axis angular velocity, and 3-axis magnetometer
validation (and compared their results to that of other methods, data at a rate of 100 Hz. In total there are 1.9 million
including decision trees). One way in which this paper differs data points, each containing 52 features. In the dataset, each
from our work (and is a strength of the paper) is its use time-step is labeled with an activity ID, one of 12 different
of Principal Component Analysis before conducting Naive activities that the subjects were engaged in. The 12 activities
Bayes to reduce the feature space and correlation between are the following: ironing, walking, lying, standing, sitting,
the features. This classification model, however, only had Nordic walking, vacuum cleaning, cycling, ascending stairs,
an accuracy of around 80 percent. One study that examined descending stairs, running, rope jumping. We had to clean
decision trees was [Parkka, 2006] [5]. Similar to our paper, the the data for a few reasons. The frequency of the time-stamps
study used an ordinary decision tree grown via cross-validation matches the highest frequency sensors, the IMUs, which read
and using the Gini Loss for each split. One strength of the every 0.01 seconds (100 Hz). However, the heart-rate monitor
article is that, in addition to this ”automatically generated had a frequency of 9 Hz, meaning approximately 90% of
decision tree,” the researchers also created a ”custom decision the heart-rate column consisted of Nan values. We filled in
tree” using their expert knowledge and analysis of the sensor missing heart-rate values by linearly interpolating between
nearest valid readings. For the logistic regression, SVM, and linearly separated. Part of the advantage for SVM’s is that
neural network models, an additional preprocessing step of only a fraction (represented by n in eqn 3) of the original
removing the training mean and dividing by the training training set has to be retrained for creating the hyperplane
standard deviation was added. We combined each subjects during predictions. Another advantage is that various kernel
data into a single matrix and divided that into a training and functions can easily be tested and selected for the one that
test set. We used 85% of our data for training and 15% for best fits a particular application. Thus, to select the optimal
testing. To avoid over-fitting on only a certain subset of classes, kernel function along with the C parameter specifying the soft-
we randomly split the data between training and testing. To margin size, a grid search was performed over three standard
train our models, we performed 5-fold cross-validation using kernel functions (polynomial, rbf, and linear) with a range of
our training set, to tune the hyperparameters for the given C values for each.
model. We then used this optimized model and analyzed its C. Deep Learning
performance on the test set.
Based on our literature review we believed that Deep
IV. M ETHODS Learning techniques should work well for this classification
A. Regression problem. We therefore implemented a MultiLayer Perceptron
As a baseline measure, we incorporated a standard logistic architecture for multi-class classification. A MultiLayer Per-
regression model with a loss function of the form shown in ceptron architecture is a fully connected feedforward neural
1. To make the model more robust against overfitting, L2 network with one input layer, one or more hidden layers and
regularization was incorporated with the C parameter inversely one output layer. Formally, the MLP can be considered a
related to the strength of regularization. In order to pick the function f: Rn −→ Rk , where n is the number of input features
optimal value for C, the model was independently trained over and k is the number of classes. Each hidden layer can be
a range of C values. The C value that produced the highest formalized as f:Ra −→ Rb , where a is the input size and b is
accuracy on the validation set was selected and tested on the the output size. In matrix notation, it would be:
test set. f (x) = Ac (W c x + bc ) (4)
m
X 1 where x is the input vector, W c is the weight matrix associated
J(θ) = (hθ (x(i) ) − y (i) )2 + ||θ||22 (1)
C with layer c, and bc is the bias vector associated with layer c,
i=0
and Ac is the activation function associated with layer c.
For each training phase, 5-fold cross validation was incor- We use softmax as the final activation, so each prediction
porated in order to reduce the variance in a trained model. is size (kx1) where k is the number of classes. The classifier
Stochastic Average Gradient (SAG) descent was selected as predicts a score for each class, instead of simply producing
the solver to use for training as it generally provides fast a single class label. This can give a sense of how close we
convergence for large feature-sets such as ours [10]. The SAG are to the correct label. We converted the true labels to use
solver is only guaranteed to converge quickly if all features are one-hot encoding, that is we took an (mx1) array and made
of about the same scale. Thus, the normalization preprocessing it (mxk), where m is the number of datapoints. We initially
step described earlier is necessary. found that making the model deeper produced worse results,
B. Support Vector Machine but increasing the number of neurons per layer improved
the accuracy. Our final architecture is Layer1 Relu Layer2
We wanted to also create a non-linear classifier in the hopes
Relu Layer3 Softmax. The layers have weight sizes of
that it can outperform the linear logistic regression model, so
Layer1(n, 512), Layer2(512, k). In order to generalize better
a SVM was a natural choice, as it is fairly easy to implement
we apply dropout for each hidden layer; this helps combat
with few parameters to tune. We created our SVM’s by solving
overfitting by suppressing each node with 50% probability.
the following primal problem:
We tried different gradient descent rules, with the best result
m
1 X coming from the SGD optimizer. To evaluate loss we use
miny,w,b ||w||2 + C ζi
2 categorical cross entropy, which is as follows:
i=1
(2) k
s.t. y (i) (wT x(i) + b) ≥ 1 − ζi , i = 1, ..., m X
CE(y 0 ) = − y j log(y 0 j ) (5)
ζi ≥ 0, i = 1, ..., m j=i
and the decision function is defined as: We use this in conjunction with softmax activation for the final
Xn layer, which gives us a probability, or confidence, for each
sgn( yi αi K(xi , x) + ρ) (3) prediction. Softmax output for the ith element is as follows:
i=1 0
eyi
Through training, an SVM model can create multiple hyper- SM (y 0 , i) = − Pk 0 (6)
planes to split the training set into its labeled categories. This j=1 eyj
is done through the use of kernels that transform the input This is advantageous because we have multiple classes, instead
data into a higher dimension, so that the data can then be of simple binary classification. We want the loss to give a sense
of the degree of error for each category, instead of a simple fact that individual decision trees have high variance which
yes or no. For example, assume we have 3 classes, and the could lead to low prediction accuracy [11]
first class is the correct label for this time step. Consider two 2) Boosted Decision Trees
example outputs: (0.98, 0.01, 0.01) and (0.51, 0.48, 0.01). The One form of ensembling is using ”boosting,” which com-
first output is clearly superior to the second, because it has bines many ”weak learners” (simple decision trees) in order
a higher confidence for the correct class, however both will to reduce bias in the model (at the expense of increasing
”predict” the first class. If we use something like simple error variance). Boosting these weak decision trees is done for the
rate for our loss, we are not able to capture the confidence of purpose of ideally improving accuracy, though it may be prone
our predictions. Using softmax activation with cross entropy to overfitting [11]. One specific algorithm for boosting, is
loss helps us capture this difference. Our final architecture is AdaBoost (which was used in this paper), as described in [4].
shown in figure 1. This neural network was implemented using 3) Random Forest
TensorFlow’s Keras [1]. Another ensemble method for decision trees, for the purpose
of improving prediction accuracy, is random forest. Random
Forest is a form of bagging (bootstrap aggregation), which
involves sampling with replacement from the original popula-
tion for the purpose of reducing variance (at the expense of an
increase in bias, increased computational cost, and decreased
interpretability of the trees). For a random forest, a large
number of decision trees are generated, and the bias is further
reduced (by decorrelating the trees) by only considering a
subset of the total number of features at each split in the
decision tree [11].
Fig. 1. We achieved our best results with this architecture. Our architecture
had size Layer1(n, 512), Layer2(512, k), where n is the number of data V. R ESULTS AND D ISCUSSION
features ranging from 12 − 52 and k = 18 is the number of class labels.
We use ReLU activation, softmax final activation, cross entropy loss, and As discussed previously, we were interested in classification
applied dropout to avoid overfitting.
performance on both the full set of features and performance
when using a reduced set of features. We tried several fea-
D. Trees ture combinations and discovered that we could get decent
Decision trees are a useful method for multi-class classifica- performance when using just the hand IMU plus heart rate
tion for nonlinear feature sets. Decision trees perform greedy sensor. The hand IMU performed better than any individual
”splits” on the each feature of the data at a specific threshold. IMU. This is a positive result, as we are particularly interested
In order to choose a split, a decision tree seeks to maximize the in applications where a user is holding a phone or wearing a
difference between the loss of the parent node and the sum smart watch. We will refer to the hand IMU plus heart rate
of the losses of the child nodes. The specific loss function data as the ”reduced” or ”limited” feature-set. Our primary
we used was the Gini Loss shown below, where pmk is the evaluation metric for all models was classification accuracy.
proportion of examples in class k present in region Rm , and This is simply the count of correctly classified data points
qm is the proportion of examples in Rm from tree T with divided by the count of classifications attempted. It is as
| T | different Rm regions [13]: follows:
m
1 X
|T | K f (y 0 ) = 1{y 0 i == y i } (8)
X X m i=1
qm pmk (1 − pmk ) (7)
m=1 k=1
where y 0 is our set of predicted labels, and y is the set of
There are multiple methods of regularizing, or preventing true labels. We used various loss functions, as described in
overfitting, for decision trees including setting a minimum size the subsection for each technique. The full results are shown
of leaf (terminal) nodes, and setting a maximum tree depth, below in Figure 3, and explained in detail in the remainder of
setting a maximum number of nodes [11]. For this paper, we this section.
chose to regularize using the maximum tree depth. All of these
decision tree classifiers were implemented using the scikit-
learn library [14]
1) Ordinary Decision Trees
The ordinary decision tree classifier uses the above algo-
rithm to create one tree, and we test the resulting tree on a
test set. While this generally perform well (depending on the Fig. 2. Training and Test Results for all Methods, for both Full and Limited
dataset), there are a few reasons for the desire to ”ensemble,” Feature-sets
or combine, multiple decision trees, including primarily the
A. Logistic Regression C. Decision Trees
Through 5-fold cross-validation, in fig. 3 we see that the 1) Ordinary Decision Trees
optimal C value for both feature-sets occurs around 0.01.The In order to tune the maximum depth hyperparameter of the
SAG algorithm automatically calculates the learning rate decision tree, we used scikit-learn’s validation curve function
value, so one was not needed to be selected prior [10]. to perform 5-fold cross-validation. The training and cross-
With a chosen C value of 0.01, the limited feature-set model validation curves for the limited feature-set is shown in the
performed with an 63.9% accuracy on the test set and the full left side of 4 below as a function of maximum tree depth,
feature-set model performed with a 81.5% accuracy, both of and similar curves were produced for the full feature-set It is
which are very similar to the validation sets. From the con-
fusion matrix (which is not posted due to space constraints),
the main sources of error for the both feature-set models are
activities like vacuum cleaning and ironing, Nordic walking
and walking, or lying and sitting. As expected, these activities
have similar body movements, so from the point of view of a
sensor reading, these may be hard to distinguish for a linear
classifier.
Fig. 4. Left: Training and Validation Curves for Limited Feature-set using
Ordinary Decision Trees. Right: Training loss and accuracy for the Multilayer
Perceptron neural net when trained on the reduced feature-set.