0% found this document useful (0 votes)
36 views

Fall Detection2018

1) The document proposes a deep learning-based approach using LSTM neural networks for human fall detection in smart homes from video data. 2) Existing methods rely on hand-crafted features which constrain performance, while deep learning can automatically learn discriminative features. 3) Transfer learning is employed to overcome the need for huge training datasets, utilizing a large public action recognition dataset including falls.

Uploaded by

TuhinaRaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Fall Detection2018

1) The document proposes a deep learning-based approach using LSTM neural networks for human fall detection in smart homes from video data. 2) Existing methods rely on hand-crafted features which constrain performance, while deep learning can automatically learn discriminative features. 3) Transfer learning is employed to overcome the need for huge training datasets, utilizing a large public action recognition dataset including falls.

Uploaded by

TuhinaRaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Video-based Human Fall Detection in Smart Homes

Using Deep Learning

Anahita Shojaei-Hashemi1, Panos Nasiopoulos1, Mahsa T. Pourazad


James J. Little2 TELUS Communications Inc.
1
Department of Electrical & Computer Engineering Vancouver, Canada
2
Department of Computer Science
University of British Columbia
Vancouver, Canada

Abstract—Automatic human fall detection is a challenging task of the data collected by each sensor, so a plenty of sensors
of healthcare in smart homes, and video cameras have been proved should be installed, which is costly and difficult. Also, the data
to be efficient in addressing this problem. Although existing has high noise to signal ratio. Video cameras can be
methods perform relatively well, they are all built upon “hand- conveniently mounted in various places of the residence, and
crafted” features, thus constraining the performance of the model they collect data with rich information content that can be used
to some presumed conditions and scenarios, and making it for several tasks. The downside, however, is that cameras are
vulnerable to any deviation from the assumed settings. In this considered intrusive and introduce privacy issues if video
paper, we propose a deep-learning-based approach for human fall content is captured and stored. Nowadays, inexpensive depth
detection, using long short-term memory neural network. Our
cameras, such as the Microsoft Kinect, can address some of the
model is not restricted to any specific circumstances, and
privacy issues and under proper implementation conditions
performance evaluations show that it outperforms all the existing
methods. could be a promising and feasible option for human fall
detection in the context of smart home.
Keywords—human fall detection; deep learning; long short-term Limited research has been done so far on depth-camera-
memory (LSTM); smart home; depth camera based fall detection. In [3], Marzahl et al. apply a manual pre-
I. INTRODUCTION segmentation to exclude areas where falling is not possible to
happen, e.g., cupboards. Then, a decision tree detects any person
The concept of a "smart home" is a major step towards lying on the floor by exploiting spatial characteristics of
wellness and improved quality of life and a hot interdisciplinary segmented objects. Out of the 55 fall samples collected in a
research topic bringing together artificial intelligence, cloud laboratory setting, 93% of them have been classified correctly.
computing, communications and networks, psychology and In [4], Mastorakis and Markis use the 3D bounding box of the
healthcare [1]. Monitoring the well-being of the residents is an subject, extracted by OpenNI framework [5], and employ
expected service to be provided by a smart home. Although thresholding on the height, width, and depth of the bounding box
older adults, patients newly released from hospital and as well as their first derivatives to detect falling incidents. All
chronically ill people are more vulnerable to falling, everyone is falls have been detected with no false alarms for a dataset of 48
prone to it. It can be specifically dangerous for people living fall samples and 112 non-fall samples, collected in a laboratory
alone, as the incident might take hours or even days to be setting. Bian et al. in [6] extract the key joints of the human body
discovered with the person remaining injured or unconscious, using their own proposed algorithm, and apply support vector
while it is very important, if not vital, for the fallen person to be machine (SVM) on the 3D trajectory of the head joint to detect
immediately taken care of and possibly transported to the fall incidents. In evaluation on a dataset of 380 samples,
hospital. Hence, there is a need for a monitoring system, in a collected in a laboratory setting and containing equal number of
smart residence, which can automatically detect fall incidents fall and non-fall samples, all falls have been detected and 9 false
and send alarms to the emergency medical center and/or family alarms have been generated. In [7], Rougier et al. detect the
and friends. ground plane of the room, segment the person from the
Devices used for fall detection are divided into three background, localize and track the 3D centroid of the person,
categories: wearable devices, ambient sensors, and video and detect falls by thresholding the centroid height relative to
cameras, each having its own pros and cons, which makes them the ground and the centroid velocity. Only one fall has not been
complementary to each other. Wearable devices are relatively detected out of 25 fall samples, and no false alarms have been
inexpensive and can directly measure kinematic quantities. reported on 54 non-fall samples. All the samples have been
However, they cannot recognize complex physical motions and collected in a laboratory setting. In [8], Planinc and Kampel
can be considered intrusive [2]. Ambient sensors, on the other employ the skeletal data extracted by Microsoft Software
hand, are versatile in application and can be installed anywhere Development Kit (SDK) for Kinect to calculate the orientation
or embedded in any object without interfering with the resident’s of the person’s major axis and the height of the spine relative to
daily life. Nevertheless, limited information can be obtained out the ground. These features are then used for fall detection. The

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


method has been evaluated on a data set consisting of 40 fall well if it is deep in time, because of a phenomenon called
samples and 32 non-fall samples collected in a laboratory “vanishing gradient”. To overcome this issue, so that the neural
setting. 37 falls have been detected with five false alarms. network can also handle long sequences, the long short-term
memory (LSTM) neural network was introduced as a modified
Although the existing methods perform relatively well, in all version of RNN [13].
of them fall detection is based on video features which are
manually chosen by the user. Designing these “hand-crafted” B. Transfer Learning
features inevitably involves making several assumptions about Most of machine learning methods work well only when the
the problem. This not only requires background knowledge on test data is drawn from the same feature space and have the same
the application field, but also constrains the performance of the feature distribution as the training data. Thus, for a new
model to some presumed conditions and scenarios, and makes application where the data has a different feature space or
the trained model sensitive to any deviation from the training different feature distribution, the model should be rebuilt from
data. Furthermore, expanding the model or modifying it for scratch using a new set of training data. Nevertheless, this is not
different though similar tasks is difficult and usually requires always feasible in practice, since it is usually expensive, if
redesigning the features. possible, to collect the required training data, and it is time and
In this paper, we propose a deep-learning-based approach for resource consuming to retrain the model. Transfer learning, or
human fall detection, using long short-term memory (LSTM) knowledge transfer, tries to address this issue by permitting the
neural network. In contrast to classic methods whose feature distribution of the training and the test data to be
performance is limited by the presumed conditions, the deep different. Instead of learning the new task from scratch, transfer
neural network takes care of feature design itself and extracts the learning tries to transfer the knowledge learnt in the previous
most discriminative features based on the training data, and task to the new one. In other words, transfer learning aims to
hence, covers more real-life scenarios. To overcome the inherent extract knowledge from one or more source tasks and to apply it
requirement of deep learning approaches for huge training data to a target task [14].
sets, we employ transfer learning. We trained and tested our Transfer learning can be of great benefit to deep models, as
proposed approach on the “NTU RGB+D Action Recognition they need abundant data to get their plentiful parameters trained
Dataset” [9], which is a recently released public dataset on, while the amount of data is limited in many applications.
consisting of about 50000 samples on human actions including Transfer learning in deep neural networks working with visual
falling. The results show that our deep-learning-based model has data is based on the interesting fact that regardless of the dataset
great performance in detecting falls, and it outperforms all the and the objective function, such networks in their first layer tend
existing methods which are based on hand-crafted features. to identify basic features such as lines and corners [15].
The rest of this paper is organized as follows: section II Therefore, features generated by their first layer can be
provides a background overview on LSTM neural networks and considered as “general” features. Ascending through the layers,
transfer learning. Section III explains the proposed method in the learnt filters become more dependent on the dataset and the
detail. Section IV discusses the experimental results, and section objective function, so the extracted features get more “specific”.
V concludes the paper. Based on this fact, transfer learning for deep neural networks
takes the following path: First, the source network is trained on
II. BACKGROUND OVERVIEW the source dataset and task. Next, the weights of the first few
layers are copied to the target network to transfer the learnt
A. Long Short-Term Memory (LSTM) Neural Networks general features. The rest of the layers of the target network are
In machine learning, there are many scenarios where the then initialized and trained on the target data, like usual. Finally,
output at each point (of time, space, etc.) depends not only on the transferred layers are either fine-tuned or left unaltered. The
the input at the same point, but also on the inputs at other points. choice depends on the size of the target dataset and the number
Such data is referred to as sequential data, and classifying such of parameters in the transferred layers to avoid overfitting.
data is called sequence labeling. Data can be sequential along
any continuous or discrete dimension, such as time and space. III. PROPOSED METHOD
For the sake of simplicity and without loss of generality, in the As previously mentioned, our goal is to design a video-based
rest of this section we assume the data is sequential in time. In deep model for human fall detection in indoor environments to
the context of deep learning, recurrent neural network (RNN) is be used in a smart home setting. In order for the model to be
a prevalent option for sequence labeling [10], [11], [12]. The robust against changes in lighting conditions, e.g., to perform
difference between regular neural network, which is also called even in darkness, and for preserving the privacy of the residents,
feed-forward network, and RNN is the presence of a feedback we opt depth cameras, which produce depth maps instead of
loop in RNN. When unfolded, this loop produces a recurrent regular images. Considering the fact that the 3D locations of
connection in the network, which models the correlation existing major body joints carry most of the body kinematic information
among the elements of the sequential data. required for discriminating different actions, keeping track of
In order to have an RNN with more than one hidden layer, at the body joints, as shown by our evaluations, proves to be
each time step, the hidden state of each layer should be fed as sufficient for action recognition and fall detection, while it is
the input to the next hidden layer of the same time step, in computationally much cheaper. Since the existing algorithms to
addition to getting passed to the hidden layer of the next time extract skeletons from depth map, such as the one provided by
step at the same level. An RNN can be deep either layer-wise or the Microsoft Software Developer Kit (SDK), process the video
time-wise, or both. Nevertheless, a regular RNN is not trained

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


sequence frame by frame, the use of body skeleton information The samples encompass 60 different actions, including falling,
by our model can be implemented in real-time. and are performed by 40 subjects, aging from 10 to 35. 11 of the
actions involve two persons, and the rest are solo. The videos
There is correlation among the states of the body skeleton at have been captured through three synchronous Microsoft Kinect
different time steps. To exploit this sequential information v2 cameras, which were installed at the same height with three
intrinsically embedded in the input sequence, we choose RNN different horizontal angels: −45°, 0°, +45°. Each sample is
as our deep learning model. As actions usually take place within available in 4 modalities: RGB video, depth map sequence,
a long sequence of frames, vanilla RNN encounters the skeleton sequence, and infrared (IR) video. All the modalities
vanishing gradient issue, so we specifically take LSTM, which have the speed of 30 fps. The resolution of depth map frames,
can go deep in time. As a deep neural network, LSTM requires RGB frames, and IR frames are 512 × 424, 1920 × 1080, and
abundant training data. However, the number of fall incident 512 × 424, respectively. The skeletons are composed of the 3D
samples is limited, as it occasionally happens in real life and it coordinates of 25 major body joints, and each frame contains up
is difficult to simulate in laboratory settings. On the other hand, to two skeletons. The corresponding pixel to each joint is also
large amount of data is available on ordinary actions, such as provided for RGB frames and depth maps [9]. As explained
walking, drinking, picking up, etc. Thus, we propose a way to before, the input of our proposed model is skeleton sequence, so
make use of this data to compensate for the lack of fall samples,
we only used the skeleton modality of NTU dataset. Some of the
by employing transfer learning in training the LSTM. skeleton sequences in this dataset have missing frames,
As illustrated in Fig. 1, we first train a multi-class LSTM on skeletons, or joints. We removed those sequences as well as the
the abundant samples of human regular actions. Then, we actions involving more than one person. Consequently, the total
transfer all the learnt weights, except for those of the last layer, number of samples we used was 44372, out of which 890 were
to a two-class LSTM that is designed for fall detection. Finally, falling samples. We used 75% of the samples for training, and
we train the last layer of the two-class LSTM on scarce human the rest for testing. 20% of the training data was used for cross-
fall samples in combination with part of the regular action validation. The maximum number of training iterations is
samples. The structures of the two LSTMs are the same. They 100000, while the training is halted if the value of the loss
only differ in the last layer, where the number of units equals the function for the validation set does not decrease for five
number of classes. To prevent the LSTMs from getting biased consecutive epochs.
to training data, i.e. overfitting, we apply “dropout” To design the deep neural network, considering that the
regularization [16]. lengths of the sequences were not much different, we set the
There are several parameters of the designed model which depth of the LSTM in time as the average length of the training
need to be set. These include the depth of the LSTM in time, the samples, which was 90, and trimmed the longer sequences and
number of the layers, the number of the hidden units in each zero-padded the shorter ones. Rectified linear unit (ReLU) and
layer, and the dropout ratio. They can be selected either Softmax were chosen as the activation function and the loss
heuristically or through random search optimization. function, respectively. The other parameters of the model were
set so that the performance of the method is optimized. The
IV. EXPERIMENTAL RESULTS AND DISCUSSION performance was evaluated based on the area under the curve
We evaluated the performance of our proposed method on (AUC) metric, calculated for the receiver operating
the “NTU RGB+D Action Recognition Dataset”. This dataset characteristic (ROC) curve. To select the number of layers, we
contains 56,880 video samples, each consisting of one action. started from a single-layer LSTM and increased the number of

Which non-fall Human Actions Is the input a fall or


action is the input? a non-fall action?
Regular Fall
(Abundant) (Scarce)

Output Layer
(Classes)

Transfer Learning
General Features

(Keeping the Weights Fixed)

Input Layer
(Sequence of Body
Skeletons)
(1) Multiclass LSTM (2) Two-class LSTM

Fig. 1. The overall structure of our proposed method

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


Fig. 2. The ROC curves and their corresponding AUCs for different Fig. 3. The ROC curves and their corresponding AUCs for different dropout
numbers of the layers ratios

Fig. 4. The ROC curves and their corresponding AUCs for different Fig. 5. The predision-recall curve for the optimal parameters
numbers of the hidden units in both layers

layers up to three. As shown in Fig. 2, the performance of the optimized model, showing the precision and recall values at the
two-layer LSTM was good enough and adding the third layer threshold of 0.5. Table 2 compares the performance of our
did not improve it much, so we did not go above three layers, model with that of the best depth-map-based fall detection
and set the number of the layers at two. The dropout ratio and methods, which are all built using hand-crafted features. With
the number of the hidden units were selected through random 93% precision and 96% recall, our deep model outperforms
search. According to Fig. 3, the performance of the model is the Rougier’s [7] and Plannic’s [8] with considerable margins.
best at the dropout ratio of 0.5. Fig. 4 shows that the best
performance is obtained with 20 hidden units for both layers, V. CONCLUSION
though the combination of 70 units for the first layer and 30 units In this work, we proposed a deep learning model for human
for the second layer is also close to optimum. fall detection in videos captured by depth cameras, with
potential application in smart homes. Our approach takes depth
In order to compare our proposed model with the existing map sequences and determines if a falling incident has
methods, we had to use a metric other than AUC, since the other happened, so that an alarm can be generated and sent to
papers do not provide AUC for their methods. This metric, family/friends or medical service staff. This is the first deep-
which is not as comprehensive as AUC in evaluating the learning-based method in the field of human fall detection. It has
performance, consists of precision and recall values at the fixed area under the ROC curve of 0.99, and it outperforms all the
threshold of 0.5. Fig. 5 presents the precision-recall curve of our existing fall detection algorithms, which are based on hand-
crafted features.

TABLE I. THE PERFORMANCE OF OUR PROPOSED MODEL COMPARED REFERENCES


TO THE STATE-OF-THE-ART- METHODS
[1] R. Harper, Inside the smart home, Springer Science & Business Media,
Method Precision Recall 2006.
Rougier’s [7] 0.6861 0.9894 [2] L. a. K. I. Chen, "Activity recognition: Approaches, practices and
Plannic’s [8] 0.8177 0.9210 trends," Activity Recognition in Pervasive Intelligent Environments, pp.
Ours 0.9323 0.9612 1-31, 2011.

978-1-5386-4881-0/18/$31.00 ©2018 IEEE


[3] C. Marzahl, P. Penndorf, I. Bruder and M. Staemmler, "Unobtrusive fall [10] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning
detection using 3D images of a gaming console: Concept and first representations by back-propagating errors," nature, vol. 323, no. 6088,
results," Ambient Assisted Living, pp. 135-146, 2012. p. 533, 1986.
[4] G. Mastorakis and D. Makris, "Fall detection system using Kinect’s [11] J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no.
infrared sensor," Journal of Real-Time Image Processing, vol. 9, no. 4, 2, pp. 179-211, 1990.
pp. 635-646, 2014. [12] P. J. Werbos, "Generalization of backpropagation with application to a
[5] "OpenNI 2 Downloads and Documentation | The Structure Sensor," recurrent gas market model," Neural networks, vol. 1, no. 4, pp. 339-
2017. [Online]. Available: https://structure.io/openni. 356, 1988.
[6] Z.-P. Bian, J. Hou, L.-P. Chau and N. Magnenat-Thalmann, "Fall [13] S. Hochreiter and J. Schmidhuber, "Long short-term Memory," Neural
detection based on body part tracking using a depth camera," IEEE computation, vol. 9, no. 8, pp. 1735-1780, 1997.
journal of biomedical and health informatics, vol. 19, no. 2, pp. 430- [14] S. J. Pan and Q. Yang, "A survey on transfer learning," IEEE
439, 2015. Transactions on knowledge and data engineering, vol. 22, no. 10, pp.
[7] C. Rougier, E. Auvinet, J. Rousseau, M. Mignotte and J. Meunier, "Fall 1345-1359, 2010.
detection from depth map video sequences," Toward useful services for [15] J. Yosinski, J. Clune, Y. Bengio and H. Lipson, "How transferable are
elderly and people with disabilities, pp. 121-128, 2011. features in deep neural networks?," in Advances in neural information
[8] R. Planinc and M. Kampel, "Introducing the use of depth data for fall processing systems, 2014.
detection," Personal and ubiquitous computing, vol. 17, no. 6, pp. 1063- [16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R.
1072, 2013. Salakhutdinov, "Dropout: A simple way to prevent neural networks
[9] A. Shahroudy, J. Liu, T.-T. Ng and G. Wang, "NTU RGB+D: A Large from overfitting," The Journal of Machine Learning Research, vol. 15,
Scale Dataset for 3D Human Activity Analysis," in The IEEE no. 1, pp. 1929-1958, 2014.
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

978-1-5386-4881-0/18/$31.00 ©2018 IEEE

You might also like