0% found this document useful (0 votes)
52 views

Building Occupancy Detection and Localisation Using CCTV Camera and Deep Learning

This document discusses a deep learning approach for occupancy detection and localization using CCTV cameras. It begins by explaining the importance of occupancy information for building energy management. It then reviews limitations of existing sensor-based occupancy detection methods. The document proposes using CCTV cameras and deep learning as a more accurate solution. Specifically, it describes a model with two main modules: 1) a convolutional neural network that extracts features from videos and 2) a three-stage detector that identifies the number and locations of occupants with increasing accuracy.

Uploaded by

Esma Özel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Building Occupancy Detection and Localisation Using CCTV Camera and Deep Learning

This document discusses a deep learning approach for occupancy detection and localization using CCTV cameras. It begins by explaining the importance of occupancy information for building energy management. It then reviews limitations of existing sensor-based occupancy detection methods. The document proposes using CCTV cameras and deep learning as a more accurate solution. Specifically, it describes a model with two main modules: 1) a convolutional neural network that extracts features from videos and 2) a three-stage detector that identifies the number and locations of occupants with increasing accuracy.

Uploaded by

Esma Özel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in IEEE Internet of Things Journal.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 1

Building Occupancy Detection and Localisation


using CCTV Camera and Deep Learning
Shushan Hu, Peng Wang, Cathal Hoare, and James O’Donnell

Abstract—Occupancy information plays a key role in analysing peak demand periods of buildings and smoothing energy
and improving building energy performance. The advances of requirement peaks of buildings in smart grid [6].
Internet of Things (IoT) technologies have engendered a shift in A number of developed IoT sensors offer exciting opportu-
measuring building occupancy with IoT sensors, in which cam-
eras in Closed-Circuit Television (CCTV) systems can provide nities for measuring occupancy status in buildings, but some of
richer measurements. However, existing camera-based occupancy these sensors might be limited by inherent characteristics [7].
detection approaches cannot function well when scanning videos The approaches based on Passive Infrared (PIR) sensors can
with a number of occupants and determining occupants’ loca- detect infrared light radiating from occupants which leads to
tions. This paper aims to develop a novel deep learning based low-accuracy performance for multiple and static occupants.
approach for better building occupancy detection based on CCTV
cameras. To doing so, this research proposes a deep learning Environmental and electricity sensors are able to measure
model to detect the number of occupants and determine their indoor environmental variations related to the presence of
locations in videos. This model consists of two main modules occupants, resulting in time delay and poor detection perfor-
namely feature extraction and three-stage occupancy detection. mance in complex environments (e.g. opening rooms). Radio
The first module presents a deep convolutional neural network Frequency Identification (RFID) and WiFi sensors require
to perform residual and multi-branch convolutional calculation
to extract shallow and semantic features, and constructs feature additional devices to receive signal from users and can be
pyramids through a bi-directional feature network. The second significantly affected by noise from obstacles.
module performs a three-stage detection procedure with three An accurate and feasible solution for occupancy detection
sequential and homogeneous detectors which have increasing in buildings lies in leveraging CCTV cameras to capture
Intersection over Union (IoU) thresholds. Empirical experiments optical streams of the indoor environment. Device availability
evaluate the detection performance of the approach with CCTV
videos from a university building. Experimental results show can improve the applicability and cost-efficiency due to the
that the approach achieves superior detection performance when widespread deployment of CCTV cameras in many buildings
compared with baseline models. for safety and security issues. A key challenge in camera based
Index Terms—Building occupancy detection, deep learning, occupancy detection is eliminating interference caused by
IoT sensor. background and object obfuscation. Some powerful techniques
from the computer vision domain have opened up potential
of obtaining occupancy information from CCTV videos [8].
I. I NTRODUCTION Some early efforts applied pattern recognition technologies
(e.g., filtering algorithm, classification, and clustering meth-
I NTERNET of Things (IoT) technologies have advanced
the research of smart buildings in recent years [1]. As
the single largest portion of global energy consumption (i.e.,
ods) to subtract background information from videos, but these
background subtraction based approaches can fail if occupants
around 40%) and greenhouse gas emission (i.e., around 35%) remain static for extended periods.
[2], energy efficiency in smart buildings has attracted great Recent research has intended to apply machine learning to
attention [2]. Against this backdrop, occupancy detection plays approximate the complex relationships between sensor mea-
an essential role in achieving more efficient building energy surements and occupancy information [9]. Machine learning
management [3]. First of all, occupancy information can assist performs classification and regression operations to filter im-
building managers in comprehensively assessing building en- age patches with learnable weights and biases. Shallow learn-
ergy performance and identifying building operation faults [4]. ing algorithms (e.g., Support Vector Machine (SVM)) provide
In addition, this information can help to optimise operations limited capacity to finish occupant localisation. Deep learn-
of equipment (e.g., Heating, Ventilation, and Air Conditioning ing significantly improves the learning performance through
(HVAC) systems) while maintaining a comfortable indoor deeper and more complex neural network architectures and
environment [5]. Moreover, long-term occupancy information has been applied to a wild range of fields. For example, a
can accomplish better demand responses by identifying energy significant body of research aimed to develop deep-learning
based image segmentation solutions that play a key role in
This work is supported in part by Natural Science Foundation of China medical image analysis, robotic perception and augmented
under Grant 62106069. (Corresponding author: Shushan Hu)
S. Hu and P. Wang are with School of Computer Science and Information reality [10]. Deep learning has engendered a shift for the
Engineering, Hubei University, Wuhan, China ([email protected]; communication and networking field such as jointly optimising
[email protected]). transmitter and receiver components for the physical layer
C. Hoare and J. O’Donnell are with School of Mechanical and Materials
Engineering and UCD Energy Institute, University College Dublin, Belfield, [11], and accurately classifying mobile encrypted traffic to
Dublin 4, Ireland ([email protected]; [email protected]) obtain highly-valuable profiling information [12]. Advances

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 2

in deep learning also give rise to the proliferation of in- II. R ELATED W ORKS
dustrial IoT with potential for various applications such as A. Occupancy Measurement with Sensors
smart assembling and smart manufacturing [13]. However,
The advances of IoT technologies have promoted a wide
existing deep learning based occupancy detection approaches
range of sensors available to measure building occupancy
cannot function well for complex videos with overlaps among
status (Table I). To be specific, PIR sensors can detect the
occupants [14].
motion of occupants by measuring infrared light radiating from
This research intends to develop a novel deep learning objects. Yun et al. demonstrated a quantitative performance
based approach to obtain accurate occupancy information for analysis on human movement direction detection based on
building energy management. The approach needs to eliminate PIR sensors [15], but PIR sensors may fail to measure static
interference from overlaps among occupants and complex occupants. Environmental sensors aim to detect occupancy
background in CCTV videos. Several technical challenges through environmental variations caused by the presence of
inhibit the pathway to achieve this goal. Firstly, to inference occupants. Zimmermann et al. presented an approach to detect
multiple and varisized occupants leads to issues associated occupants using a fusion of environmental sensors from an
with relevant feature extraction from videos. Secondly, to indoor air quality measurement system [16]. Two limitations
accurately localise occupants in videos results in a complex of environmental sensors are a time delay and low detection
regression problem adjusting the position and size of predicted accuracy in open cases.
boundaries of occupants. To address the above challenges, this RFID sensors are capable of identifying and tracking tags
research proposes to design a novel deep learning model to attached to objects. Li et al. developed an RFID based system
scan videos for high-quality occupancy information including to track stationary and mobile occupants for demand-driven
the number of occupants and their locations. In doing so, this operations [17]. RFID sensors may fall short in requiring tag
work refines the detection problem into two main procedures: attachments on objects and reader deployments. WiFi becomes
1) extracting fine-grained features related to occupants from a primary signal for occupancy detection with channel state
videos, 2) detecting the number of occupants and calculating information [18]. Sheng et al. presented a deep learning frame-
their locations. work for action recognition by integrating spatial features from
CNN into a temporal bidirectional Long Short Term Memory
The main contributions are summarised as follows: (LSTM) model [19]. Nonetheless, WiFi approaches need addi-
tional signal receivers and have low performance in complex
• We propose a deep learning model for building occupancy indoor environment. Smartphones provide an alternative for
detection and localisation based on CCTV cameras. This occupancy detection with embedded sensors. Chen et al.
model is capable of delivering better detection perfor- leveraged smartphones to recognise human activities through
mance when facing videos with many occupants. extreme learning machine (ELM), coordinate transformation
• We design a deep neural architecture to construct feature and principal component analysis (CT-PCA), and online SVM
pyramids from CCTV videos. The architecture uses a [20]. The placing aside of smartphones significantly reduces
deep convolutional neural network (DCNN) with resid- the detection performance.
ual and multi-branch convolutional calculation to extract Cameras provide a straightforward optical instrument to
shallow and semantic features, and constructs feature capture visible images for occupancy detection. Zulcaffle et al.
pyramids through a top-down and a bottom-up pathway designed a four-part method for frontal view gait recognition
integration operations. based on Time-of-Flight cameras [21]. However, limitations
• We perform a three-stage detection procedure to calculate include computation complexity and probability with occlu-
the number of occupants and regress their locations. This sion. Electricity meters are used to detect interactions between
procedure employs three sequential and homogeneous de- occupants and appliances [22]. Feng et al. tried to detect
tectors to conduct iterative detection on feature pyramids. building occupancy from Advanced Metering Infrastructure
Increasing IoU thresholds are applied to these detectors (AMI) data with a CNN and a LSTM network [8]. Jin et al.
so as to improve detection performance. presented another smart meter based solution which applies a
• We conduct empirical experiments to evaluate the de- multi-view based iteration and surrogate loss to learn refined
tection performance. Experimental results show that the results[23]. The interactions between occupancy and power are
approach can achieve superior performance on detecting not straightforward, and sometimes difficult to model. Among
the number of occupants and determining the locations these sensors, cameras can provide richer measurements on
of these occupants. occupancy status than other sensors. Growing deployments of
CCTV cameras also promoted this relevant topic for building
occupancy detection. However, a major barrier obstructing
The remainder is organised as follows: Section II presents
camera based detection approaches is eliminating interference
the case of measuring occupancy status and filtering occupancy
caused by background and object obfuscation.
information. A detailed description of the approach is illus-
trated in Section III. Section IV demonstrates the effectiveness
and superior performance with empirical experiments. Section B. Occupancy Detection with Algorithms
V details the conclusions from the development and testing of The development of computer vision technologies has or-
this approach. chestrated a paradigm shift in the way that obtains fine-

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 3

TABLE I TABLE II
C OMPARATIVE ANALYSIS OF SENSORS FOR OCCUPANCY DETECTION . C OMPARISON OF CAMERA BASED OCCUPANCY DETECTION ALGORITHMS .

Sensors Strength Weakness Algorithm Characteristic Weakness


Low accuracy for multiple Background Using Gaussian model Significant affection from
Less privacy concerns
PIR[24] occupants and failures for subtraction[27] without training data. lumination variations
and localisation of occupants
static occupants. Regressing clear margin Low performance in
SVM [30]
Environmental Non-intrusive detection Time delay and low in high dimensional spaces. noisy data set.
sensors [25] at no additional cost. accuracy in opening rooms. Learning fine-grained features falling short in
CNN [31]
Hard to model interactions from image pixels. occupant localisation.
Electricity Non-intrusive detection and
between occupancy and Faster Enabling region proposals Unbalanced detection with
meters [8] no additional installation cost.
power. R-CNN[34] and sharing features. single IoU threshold
Occupancy localisation and Attachments of tags and Performing one stage Falling short in
RFID [17] YOLO [35]
multiple occupant detection. deployments of readers. regression for detection. complex scenarios.
Need for signal receivers
Passive activity recognition
WiFi [26] and low accuracy in
and wide availability of WiFi.
complex environment.
Smartphone Passive activity recognition Low performance when
[20] with no additional cost. placing smartphones aside.
High detection accuracy and Computation complexity
Camera [21]
multiple occupant detection. and occlusion problems.

grained occupancy information from videos. An early study is


background subtraction using a gaussian mixture model. Tam- Fig. 1. The system architecture of occupancy detection with CCTV video.
gade et al. developed an optical flow motion vector estima-
tion through iterative Lucas-Kanade pyramidal implementation
[27]. Histogram of Oriented Gradient (HOG) provides another of background subtraction. SVM may demonstrate a lower
solution by counting occurrences of gradient orientation in performance when more noise exists in data sets or the feature
localised portions of an image. Cao et al. leveraged an artificial number exceeds the sample number. CNN based approaches
neural network to classify occupancy information with HOG fall short in occupant localisation. Faster R-CNN and YOLO
features [28]. use a single IoU threshold which may result in an unbalanced
The technology boom of machine learning enriched occu- issue for the detection performance. Low thresholds usually
pant detection within videos by learning abundant hidden rules produce noisy results while higher thresholds tend to degrade
between inputs to outputs based on a set of training data. With the performance.
a clear margin in high dimensional spaces, Shih et al. proposed
an SVM-based system with an image-based depth sensor and III. M ETHODOLOGY
a pan-tilt-zoom camera [29]. Yang et al. used SVM to analyse
This research intends to develop a novel deep learning based
whether a person is entering or exiting rooms based on the
approach to obtain better building occupancy detection. In
difference between two pixel coordinates [30].
doing so, the approach applies CCTV cameras as sensors to
CNN emerged as a deep learning algorithm to analyse visual
measure occupants’ status and designs a deep learning model
imagery through assigning learnable weights and biases to
to calculate the number of occupants and their locations as
various objects in an image. Fine-grained features learned
output (Fig. 1). Measurements from cameras are stored as
from CNN provide confident and relevant context for occupant
video files and sequential frames are extracted as input images
detection. Thys et al. presented a CNN based approach to
for the deep learning model.
generate adversarial patches of occupants with lots of intra-
class variety [31]. Zou et al. developed a CNN approach to
detect people head windows with high recall and precision A. Problem Formulation and Model
[32]. The core problem of the detection approach is designing
Faster R-CNN introduces a novel detection idea by en- a deep learning model to eliminate abundant and complex
abling nearly cost-free region proposals and sharing convo- interference in images for enhanced detection performance. In
lutional features with the detection network [33]. Ahmed et accordance with the principle of deep learning, we formulate
al. demonstrated the applications and effectiveness of Faster the detection problem into two main procedures called feature
R-CNN through facilitating video analysis for overhead view extraction and occupancy detection. Firstly, a convolutional
detection and segmentation [34]. YOLO presents an alternative calculation is performed on all pixels of input images to obtain
solution that aims for high speed and adequate accuracy by shallow and semantic features:
performing one-stage regression analysis for object detection
m,n
with spatially separated bounding boxes and associated class X
probabilities. Kajabad et al. presented a YOLO based approach Ω = f( (wij ∗ xij ) + b) (1)
i=1,j=1
to detect people in a closed space and identify density areas
from CCTV cameras in a museum [35]. where Ω,xij , wij , b stand for the output features, input value at
Several weaknesses arise from these algorithms. To be spe- position (i, j) in images, weight value for position (i, j) , and
cific, lumination variations significantly affect the performance bias respectively. m and n indicate the size of input images.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 4

used to conduct backpropagation for optimisation of model


parameters.

Algorithm 1 The pseudo code of the occupancy detection


model
# construct feature pyramid from images
feature extraction function:
extract feature as dcnn(img)
obtain feature pyramids as
bifpn downchannel(feature), bifpn convup(up feature)
return feature pyramid
end function

# perform three-stage detection


three-stage detection function:
obtain region proposals as
rois = RPN(feature pyramid)
for each stage i do
extract ROI feature as roi pooling(rois)
map ROI feature to sample space as
fc forward(pool result)
perform classification as fc classifier(fc)
perform regression as fc bbox reg(fc)
obtain classification loss as
Fig. 2. The structure of the deep learning model for occupancy detection and loss cls = cross entropy(cls score, label)
localisation. obtain regression loss as
loss reg = L2 loss(bbox pred, gt bbox)
obtain final loss as
The size of features Ω ∈ RH×W ×C consists of a height value loss = loss cls + loss reg
H, a width value W , and a channel number C. optimise model by backprogration
Secondly, the occupancy detection procedure is split into end for
a classification sub-problem and a regression sub-problem. A return objects and bounding boxes
classification function O = fcl (ΩH×W ×C ) is used to identify end function
if the potential object is a occupant or not (O ∈ {1, 0}). A
regression function B = fre (ΩH×W ×C ) is used to determine
the closest bounding boxes for target occupants. B represents a
predicted bounding box with four parameters {bx , by , bw , bh }, B. Feature Extraction
where bx and bx stand for the position of the left corner of This approach designs a DCNN as the backbone network
the box, bw and bh indicate the width and height of the box. to learn multi-scale and high-dimensional features from input
In practice, there are multiple occupants contained in CCTV images. The architecture of a building block in the DCNN
videos. The occupancy detection is extended as a multi- follows a strategy of splitting, transforming and aggregating
dimensional solution. Let N be the number of occupants and [36] (Fig. 3 (a)). This strategy divides one single training path
the output Φ indicates the N-dimensional occupancy vector into a group of convolution paths. Features from these paths
Φ = (O1 , B1 ), (O2 , B2 ), . . . , (ON , BN ). are depth aggregated to final output.
Following the problem formulation, this research designs a The splitting phase firstly leverages a base layer to separate
novel deep learning model to solve the detection problem (Fig. the input into two parts (Part 1 and Part 2) with channel data
2). Algorithm 1 illustrates a pseudo code of the deep learning (xl = [x0l , x00l ]). x00l will go through a dense convolutional block
model. Based on input images, this model leverages two main and a transition layer. x0l is then combined with transmitted
modules, namely feature extraction and three-stage occupancy features to the next stage. The dense block is defined as a
detection, to finish the occupancy detection mission. The first homogeneous and multi-branch building block with a set of
module learns multi-scale features from input images with transformations with the same topology. A dimension called
a deep convolutional neural network and constructs feature “cardinality” is defined to represent the size of the set of
pyramids through a bi-directional feature pyramid network transformations. Cardinality is an essential factor in addition
consisting of a top-down and a bottom-up feature integration to the dimensions of depth and width. A more effective way to
pathway. The second module performs a three-stage detec- improve learning capacity lies in increasing cardinality, rather
tion with sequential and homogeneous detectors which have than constructing a deeper or wider architecture.
increasing IoU thresholds. Two loss functions, including a In the transforming phase, each convolution path consists
classification loss and a bounding box regression loss, are of three convolution layers: 1×1 4-d layer, 3×3 4-d layer and

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 5

1×1 256-d layer. A novel transformation function (2) replaces


the elementary transformation in a simple neuron.
C
X
F(x) = Ti (x) (2)
i=1

where Ti (x) and C stand for an arbitrary function and the size
of the set of transformations (i.e., cardinality) respectively.
Ti (x) projects the input vector x into an embedding code
and then transforms the code. C is a newly introduced hyper-
parameter for a new pathway of adjusting the model capacity.
The aggregating phase deeply integrates intermediate trans-
formed features into final features. An aggregated transforma-
tion function (y = Ti (x) + x) is used to construct the residual
function for the building block. y is the output of a block. This
structure reformulates the network layers with reference to
the layer inputs instead of learning unreferenced functions. A
residual learning reformulation solves the degradation problem Fig. 3. (a) a building block of DCNN with cardinality = 32; (b) a building
as the network depth increases. The reformulation approxi- block of DCNN with grouped convolutions.
mates the network as a residual function F(x) := H(x) − x
and defines the building block as residual learning in the
network layers:

xT = F(x00l , Wl ) + Wl0 x00l (3)


where xT , x00l , F, Wl and Wl0 stand for the output and input
vectors of the network layers, the residual mapping function,
and weight vectors respectively. F(x00l , Wl ) needs to be learned
during the training process. If the dimensions of F and x00l are
not equal, the model conducts a linear projection Wl0 on x00l
to match the dimensions.
In addition, a partial transition layer is designed to generate
the output feature xl+1 by maximising the difference of
gradient combination:
Fig. 4. The feature network of constructing feature pyramids with bidirec-
xl+1 = Wl00 ∗ [x0l , xT ] (4) tional cross-scale connections.
The partial transition layer conducts a hierarchical feature
fusion strategy by truncating the gradient flow to prevent
distinct layers from learning duplicate gradient information. of DCNN, the features from shallow layers (e.g., C2 , C3 ,) con-
This truncation combines the output from the dense block tain high-level resolutions but low-level semantic information.
with the features coming from part 1. The strategy is able The feature from deep layers (e.g., C4 , C5 ,) have high-level
to significantly reduce the computation complexity since the semantic information but low-level resolutions.
gradient flow is truncated and the gradient information will This feature network integrates both the bidirectional cross-
not be reused. scale connections and the fast normalized fusion. In detail, the
In order to simplify the structure of the dense block and network consists of a bottom-up pathway, a top-down pathway
improve training efficiency, an equivalent structure is reformu- and lateral connections to integrate features with high-level
lated (Fig. 3 (b)) for the dense block shown in Fig. 3 (a). This resolutions and semantic information without much calculation
reformulation replaces the low-dimensional embedding layers load [37].
(i.e., the first 1 × 1 4-d layers in Fig. 3 (a)) with a single and With hierarchical features (C2 , C3 , C4 , C5 ) from a typical
wider layer (i.e., the first 1 × 1 128-d layers in Fig. 3 (b)). feed-forward convolutional calculation, the top-down pathway
The grouped convolution layer divides its input channels into constructs upsampled features by generating higher resolution
32 groups of convolutions whose input and output channels features through an upsampling calculation. These upsampled
are 4-dimensional. The third layer leverages a 1 × 1 filter to features are enriched through lateral connections sourced from
match the input dimensions to output vectors. features at the same level. The fastPnormalised fusion is used to
wi
In order to enable multi-scale occupancy detection with integrate different features: O = i +P · Ii , where wi is
j wj
proportionally sized features, the approach applies a feature weight vector for input feature Ii ,  = 0.0001 is a small value
network to construct feature pyramids using intermediate fea- to avoid numerical instability. For example, the upsampled
tures from the DCNN (Fig. 4). With the pyramidal hierarchy feature at level 4 is calculated as:

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 6

w1 · C4 + w2 · Resise(C5 )
P4td = Conv( ) (5)
w1 + w2 + 
where the Resise function is an upsampling with the nearest
neighbour strategy on the spatial information by a factor of 2
×. A lateral connection is used to enhance the unsampled
results with previous features at the same level (C4 ). The
connection indicates an element wise addition operation and
a 1 × 1 conv for a reduction of channel dimensions.
To obtain fine-grained features, the bottom-up pathway
performs another integration upon upsampled features of the
top-down pathway and features of feed-forward computation.
This pathway applies subsampling functions to construct an
opposite integration direction compared to the top-down path-
way. For example, the final features at level 4 is calculated
as:

w10 · C4 + w20 · P4td + w30 · Resise(P3 )


P4 = Conv( ) (6)
w10 + w20 + w30 + 
where the Resise function is a subsampling with the nearest
neighbour strategy on the spatial information by a factor of 2×. Fig. 5. The procedure of three-stage detection with three sequential and
Beside subsampled features from higher resolution features homogeneous detectors.
(P3 ), two lateral connections integrate two previous features
(C4 and P td ) for the final integration.

1
Lcls (pi , Pi∗ )
P
C. Three-stage Occupancy Detection L({pi }, {ti }) = Ncls i
(7)
+λ Nreg i p∗i Lreg (ti , t∗i )
1
P
With feature pyramids, the approach performs a three-stage
detection to obtain high quality occupancy information (Fig.
where i, pi and p∗i stand for the index of an anchor, the prob-
5). Three homogeneous detectors (i.e., detector1, detector2,
ability of a predicted anchor being an object and the ground-
detector3) perform a sub-training operation in a sequential
truth label valuing 1 with a positive anchor respectively.
order and have the same topology, which can generate high
With initial region proposals, the three detectors perform an
quality hypotheses at inference time and achieve better match
iterative detection with increasing IoU thresholds. The detector
results with increasing quality of detectors. Increasing IoU
integrates feature pyramids and region proposals from RPN (or
thresholds are used for these detectors in order to smooth
detectors of previous stages) as input. An RoI pooling layer
the unbalanced detection arising from a single IoU threshold,
is used to resample the input features as fixed size. Following
as low thresholds usually produce noisy results while higher
a retraining process of two fully connected layers, a classifier
thresholds tend to degrade the performance due to overfitting
checks if the anchors contain occupants or not. A bounding
and mismatch problems [38]. The three-stage detection is
box regressor tries to precisely localise bounding box coordi-
represented as an iterative detection process to filter occupancy
nates by a mount of translating and scaling operations.
information from images. A Region Proposal Network (RPN)
The classifier is defined as a function h(x) aiming to cate-
generates initial region proposals for the detector of stage 1
gorise candidate objects into one of 2 classes. Class 0 indicates
(detector1), the output of which provides additional and better
image background and class 1 is the occupant category. The
region proposals to train the detector of stage 2 (detector2).
function h(x) applies the posterior distribution upon an patch
The subsequent detector at stage 3 (detector3) applies the out-
x of images and predicts a class label y for objects in this
put of detector2 as region proposals to improve the detection
patch (hk (x) = p(y = k|x)). A risk function assesses the
performance.
classification performance of the function and improves the
RPN conducts a resampling operation on feature pyramids
performance through a training process.
and predicts region proposals for objects at each pixel of the
input images. The resampling applies a convolution layer (3×3 X
conv) to transform the input features as 256-d. Two parallel Rcls [h] = Lcls (h(xi ), yi ) (8)
i=1
convolution layers (1 × 1 conv) are used to generate enhanced
features for subsequent classification and regression missions. where Lcls stands for a cross-entropy loss function.
A classifier and a bounding box regressor are responsible for The bounding box regressor represents another function
predicting k region proposals, which are parameterised relative ft (x, p) to precisely confirm bounding box coordinates for
to k reference boxes (i.e., anchors). Finally, a loss function (7) occupants. The function regresses a predicted box p in an
minimises the difference between the reference and ground image patch x to a ground-truth box g in the stage t.
truth boxes. The bounding box p is represented with four parameters

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 7

(px , py , pw , ph ), where px and py are coordinates of the top TABLE III


left corner, pw and ph stand for the width and height of T HE DETECTION PERFORMANCE UNDER DIFFERENT MODEL PARAMETERS .
the box. The ground-truth box g is also represented with Anchor area Anchor ratio Stage IoU AP50 MAE MSE
four parameters g = (gx , gy , gw , gh ). The regressor tries to 32,64,128,256,512 0.5,1.0,2.0 0.5,0.6,0.7 83.1% 6.13 7.71
minimise the bounding box risk by an indicator: 16,32,64,128,128 0.5,1.0,2.0 0.4,0.5,0.6 83.8% 3.45 3.95
16,32,64,128,128 0.75,1.0,1.5,2.0,3.0 0.5,0.6,0.7 84.4% 1.04 1.61
X 16,32,64,128,128 0.75,1.0,1.5,2.0,3.0 0.4,0.5,0.6 85.1% 0.94 1.28
Rloc [f ] = Lloc (f (xi , pi ), gi ) (9)
i

where Lloc stand for the loss function calculating a distance images of the dataset to train the occupancy detection algo-
vector ∆ with four variables (δx , δy , δw , δh ) from: rithms, while 2,000 images are used to test the performance
of these algorithms.
δx = (gx − px )/pw , δy = (gy − py )/ph This research implements the proposed deep learning model
(10)
δw = log(gw /pw ), δh = log(gh /ph ) as an executable program for the performance test upon the
This approach applies a cascade regression strategy by dataset. The implementation uses the Python programming
adopting a resampling procedure to adjust the distribution language for coding and applies Pytorch as the basic platform.
of hypotheses for different stages. The strategy defines an The program is run on a server with the following specifica-
overall regression function f (x, p) as a cascade of specialised tions: AMD 3960X CPU at 4.0GHz 24 cores, 128 GB DDR4
regressors ft (x, p) for the three detection stages: RAM, 2 × GeForce 2080Ti GPU, 512GB SSD, Ubuntu ×64
OS.
f (x, p) = f1 (x, p) ◦ f2 (x, p) ◦ f3 (x, p) (11)
B. Deep Learning Training
where f1 , f2 and f3 stand for the regressor for stage 1, 2, and
3 respectively. The operator ◦ is cascaded linkage meaning This study conducts a training process to improve the
each regressor ft is opitmised for the box pt generated by the detection performance of the model. This training process
previous regressor ft−1 . starts from a learning rate of 0.0025 × 0.001 and follows
A loss function is defined to optimise the detection perfor- the warmup strategy which keeps linear growth until the 10th
mance of each stage t with an IoU threshold ut . epoch. We try to optimise the model parameters using the
Stochastic Gradient Descent (SGD) optimiser, which aims to
L(xt , g) = Lcls (ht (xt ), y t ) minimise the cross entropy and smooth L1 objective functions
(12)
+λ[y t ≥ 1]Lloc (ft (xt , pt ), g) following the opposite direction of gradient descent. In order
to reach the convergence point, we train the developed model
where pt , λ and y t stand for the regressor of stage t − 1 (i.e.,
by increasing the epoch value from 1 to 30 (Fig. 6). Four
ft−1 (xt−1 , pt−1 )), the trade-off coefficient, and the label of
metrics called Accuracy, Mean Absolute Error (MAE), Mean
the patch xt under the threshold ut respectively.
Square Error (MSE) and mean Average Precision (mAP) are
used to evaluate the detection performance. It can be found
IV. P ERFORMANCE E VALUATION that the model tends to convergence with the epoch value of
In order to demonstrate the effectiveness and superior per- 12. In particular, the accuracy value reaches the peak point
formance of the approach, this work documents a case study at the 12th and the 21st epoch (Fig. 6 (a)). The MAE results
of occupancy detection when applying some other popular show three troughs when the epoch values are 12, 17, and 21
solutions as the baseline. (Fig. 6 (b)). The MSE results contain four minimal values at
the 10th, 12th, 14th, and 21st epoch. The mAP curve reaches
A. Experimental Setup the peak point with the epoch value of 12 (Fig. 6 (c)).
Regarding the three-stage detection architecture, we im-
The case study begins by preparing a dataset with labelled
prove the detection performance by analysing three main pa-
images to enable verification of the approach. With videos
rameters called anchor area, anchor ratio and stage IoU (Table
(resolution: 1920 × 1080) collected from CCTV cameras in
III) with MAE, MSE, and mAP metrics. The model obtains
a university building over a period of 5 weeks in 2019, this
relatively good performance with initial parameters (1st row).
study conducts a pre-processing procedure with several steps
We quickly confirm the anchor area to be [16,32,64,128,128]
to obtain labelled images. The procedure begins by trimming
by analysing the area of ground truth boxes. Another stage IoU
video segments recorded when the room is not occupied
([0.4,0.5,0.6]) is tested to reach the first milestone (2nd row).
according to the teaching schedule of the target classrooms.
We identify that the ratio of [0.75,1.0,1.5,2.0,3.0] can achieve
The second step takes images from the remaining videos
better detection performance (3rd row). The integration of the
with an interval of 8 seconds during lectures and the interval
anchor ratio and the stage IoU assists the model to achieve
becomes 5 seconds during breaks between lectures. The reason
the best performance (4th row).
is occupants normally have more movements during breaks
leading to frequent variations in images. Finally, the authors
manually draw bounding boxes for occupants to generate C. Baseline Models
labelled images. In summary, this study prepares nearly 12,000 For the purpose of evaluating the performance of the ap-
labelled images for the dataset. The demonstration uses 10,000 proach, the case study selects five state-of-the-art detection

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 8

Fig. 7. Visual results of occupancy detection and localisation in images.

Fig. 8. Visual results of failed occupancy detection.

as heuristic-guided feature selection and overlap based anchor


sampling.

D. Experimental Results
1) Detection Results Visualisation: This study represents
an example (Fig. 7) that visualises the occupancy detection
Fig. 6. Convergence curves for the deep learning model training (a) accuracy
results, (b) MAE and MSE results, and (c) mAP results. results of the approach. There are 32 occupants in this image
where nearly one quarter of occupants are standing, while
other occupants are sitting in fixed seats. Some overlaps exist
models to establish a baseline for comparison. Faster R-CNN among occupants due to the layout of seats and shielding
uses RPN to achieve good detection performance by predicting from standing occupants. The approach successfully detects
object bounding boxes and objective scores at each location. all 32 occupants contained in the image. Regarding occupancy
YOLOv4 presents a one-stage object detection solution with localisation, it is obvious that the predicted bounding boxes
high speed and adequate accuracy, and performs regression for occupants (i.e., red box) have high-degree overlaps with
analysis for object detection with spatially separated bounding bounding boxes of ground truth (i.e., green box).
boxes and associated class probabilities. EfficientDet follows This section shows some failure cases generated by the
the one-stage paradigm for object detection by integrating proposed model (Fig. 8). The first two cases are caused
EfficientNet with a set of fixed scaling coefficients and a by interference from irrelevant objects. Results in Fig 8 (a)
bi-directional feature pyramid network. RetinaNet is another present a failure when detecting the occupant behind the
popular one-stage detector, which applies a feature pyramid microphone and a schoolbag is recognised as an occupant in
network for detecting objects at multiple scales and defines Fig 8 (b). The second two cases are due to occlusions from
the focal loss function to facilitate the extreme foreground- other occupants in crowded scenarios. One occupant is missed
background class imbalance. An anchor-free extension of Reti- in Fig 8 (c) due to occlusion from another occupant in front.
naNet (AF RetinaNet) defines a simple and effective building Two occupants in Fig 8 (d) are detected as only one occupant
block for single-shot object detectors by addressing issues such due to a vertically overlapping layout.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 9

Fig. 10. Detection performance of all detection models in F1 score.

Fig. 9. The precision performance evaluating the number of occupants being


correctly detected.

2) Occupancy Detection Performance: This study evaluates


the performance when detecting the number of occupants. Two
metrics called Precision and Recall are used to assess the
detection accuracy (Fig. 9).

TP TP
P recision = T P +F P , Recall = T P +F N (13)

where T P , F P and F N stand for the true positive number,


the false negative number, and the false positive number Fig. 11. Detection performance in terms of MAE and MSE, evaluating
respectively. differences between predicted and observed values.
All detection models obtain relatively good performance
upon detecting the number of occupants. To be specific,
our model can detect most occupants with a Precision value due to the one-stage architecture that is a bit weak for detecting
of 89.7% and a Recall value of 89.2%, while RetinaNet multiple occupants from complex images.
has the worst performance with values (83.1%, 82.9%). The This assessment applies MAE and MSE metrics to evaluate
anchor free structure leads to a slight increase in the detection the detection performance (Fig. 11). Our model has the lowest
performance when compared with RetinaNet. Faster R-CNN value (MAE:0.94, MSE:1.35) while YOLOv4 has the highest
has the second best performance with values (87.3%, 87.2%). MAE value (2.89) and RetinaNet has the potential for the
YOLOv4 presents good accuracy results with the improvement greatest improvement in terms of MSE (4.26). The reason
based on bag of freebies and bag of specials. In general, two- is YOLOv4, RetinaNet and EfficientDet perform one-stage
stage detection models have better performance than one-stage detection which predicts a fixed number of possible objects
models since RPN generates initial and valuable proposals to on a grid. AF RetinaNet presents good performance on both
improve the detection performance. metrics (i.e., MAE:1.31, MSE:2.48) since it applies a feature
The case study uses another metric called F1 score to selective anchor-free module to enhance limitations brought
represent the harmonic mean of Precision and Recall values up by the conventional anchor-based detection.
when evaluating the prediction performance of models. 3) Occupancy Localisation Performance: This research
leverages two popular metrics called Average Precision (AP)
2P recision × Recall and mean Average Precision (mAP) to evaluate the localisation
F1 = (14)
P recision + Recall performance (Table IV). AP is a metric for calculating the
The F1 scores of all six models are shown in Fig. 10. overlap ratio between predicted and true boxes. mAP evaluates
These models have values between 83% to 90%, in which the overall performance by computing a mean value for all of
our model has the highest value (89.5%) and RetinaNet has the AP values at different IoUs.
the lowest value (83%). Faster R-CNN has the second best mAP results of all models are illustrated in Fig. 12. These
performance (87.2%). The difference between Faster R-CNN models obtain values of more than 40% in mAP indicat-
and our model is due to a single IoU threshold can not balance ing good performance when predicting bounding boxes for
noisy results and good performance. YOLOv4 and EfficientDet occupants. Our approach achieves a significant improvement
have relatively good results (86.0% and 85.2% respectively) in mAP (about 2%) when compared with other models.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 10

Fig. 13. Detection performance for images with different scenarios.

TABLE V
C OMPARISON ON THE COMPUTATION PERFORMANCE .
Fig. 12. Detection performance of all detection models in mAP.
Trainable Memory Inference
Detection model parameters requirement time
TABLE IV (Million) (MegaByte) (Second)
P ERFORMANCE COMPARISON OF DIFFERENT DETECTION MODELS . Faster R-CNN 4.1 1,540 0.058
YOLOv4 2.8 1,390 0.039
Detection model AP50 AP60 AP70 AP80 RetinaNet 3.6 1,450 0.051
Faster R-CNN 82.4% 75.1% 55.4% 18.7% AF RetinaNet 3.6 1,450 0.051
YOLOv4 82.9% 77.1% 55.3% 18.0% EfficientDet 2.3 1,330 0.045
RetinaNet 81.3% 73.5% 52.4% 16.2% Our model 4.8 1610 0.063
AF RetinaNet 81.4% 74.6% 54.2% 18.6%
EfficientDet 81.5% 73.4% 54.1% 20.5%
Our model 85.1% 77.8% 58.3% 20.8%
4) Computation Performance: This section documents the
computation performance evaluation for all models (Table V).
This performance evaluation selects three indicators, called
RetinaNet has the worst performance with a mAP value of trainable parameters, memory requirements and inference
40.9%. YOLOV4 achieves better performance than Faster R- time, to assess the computation and storage cost of these
CNN. This is because mAP only relates to bounding boxes models. The inference time is the most important indicator
of correctly predicted occupants and YOLOV4 has fewer cor- representing the time of scanning one image. It is obvious that
rectly predicted occupants than Faster R-CNN. The undetected all models have super performance with a time less than 0.1
occupants are often surrounded by complex background and second. Although our model needs a bit more inference time
it is difficult to calculate their bounding boxes. due to a complex architecture for better occupancy detection,
Regarding the AP metric, in accordance with the mAP it is efficient to meet time requirements for future applications
metric, the proposed approach has the best performance for all deployed in buildings. Regarding the other two metrics, though
IoU thresholds from AP50 to AP80 . RetinaNet has the lowest more trainable parameters in our model lead to more memory
values at different IoU thresholds. It is worth noting that these requirement, it is worth noting that memory requirements of
models exhibit decreasing performance when increasing IoU all models (from 1,390 to 1,610 MegaByte) only take up a
thresholds. The reason is overlaps among occupants in images small proportion of the total memory of popular GPUs, which
may cause detectors to treat crowded occupants as a single normally is larger than 10,000 MegaByte.
detection, or to mistakenly shift the target bounding box to
another person with higher IoU thresholds. E. Discussion
Furthermore, this experiment finds that considerable differ- This work enhances the performance evaluation by dis-
ences exist in occupancy detection for different scenarios in cussing the performance of our approach and some existing
images (Fig. 13). This research defines a complex scenario occupancy detection solutions. Due to the inaccessible source
where an image contains more than 30 occupants and more code or dataset, this discussion is finished with results from
than 10 occupants are standing, with more overlaps between references (Table VI). A head detection solution [32] applies
occupants. The regular scenario indicates images outside the HOG features and a 7 layer CNN to obtain good performance,
complex range. Although our approach achieves the best per- but a low occupancy number (0-3) significantly reduces the
formance on both scenarios, a significant difference of nearly detection difficulty when compared with our model. Two so-
5% shows in mAP when evaluating detection performance lutions [35], [31] applies the YOLO model to finish occupancy
for the two scenarios. The biggest difference (around 10% in detection missions. It can be concluded that our model is better
mAP) belongs to AF RetinaNet. This arises due to additional than them, since one baseline model (YOLOv4) has better
overlaps caused by more occupants and increased standing performance than YOLOv3 and YOLOv2 [39]. Although an
in the complex scenario, as such this poses a more difficult occupancy presence detection solution [28] achieves a high
scenario for occupancy localisation. accuracy value, our solution can have 100% accuracy when

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 11

TABLE VI energy performance by using deep reinforcement learning to


A NALYSIS WITH EXISTING OCCUPANCY DETECTION SOLUTIONS . enable better control of devices [40].
Occupant Performance
No Solution Core model
range result R EFERENCES
Head HOG & Precision:
1 0-3 [1] L. Yu, S. Qin, M. Zhang, C. Shen, T. Jiang, and X. Guan, “A review
detection [32] CNN 92.7%
Occupancy Visuable of deep reinforcement learning for smart building energy management,”
2 YOLOv3 0-10 IEEE Internet Things J., vol. 8, no. 15, pp. 12 046–12 063, 2021.
localisation [35] boxes
Occupancy [2] S. J. Smith, “Cleaning cars, grid and air,” Nature Energy, vol. 6, no. 1,
3 YOLOv2 0-2 PR curve pp. 19–20, 2021.
localisation [31]
Occupancy HOG Accuracy: [3] J. Jiang, C. Wang, T. Roth, C. Nguyen, P. Kamongi, H. Lee, and
4 0-1 Y. Liu, “Residential house occupancy detection: Trust-based scheme
presence [28] & ANN 96%
Overhead Faster using economic and privacy-aware sensors,” IEEE Internet Things J.,
5 0-10 TPR: 94% vol. 9, no. 3, pp. 1938–1950, 2021.
detection [34] R-CNN
DCNN & Precision: 89.7% [4] H. Saberi, C. Zhang, and Z. Y. Dong, “Data-driven distributionally robust
6 Our model 0-80 hierarchical coordination for home energy management,” IEEE Trans.
Detector TPR:89.2%
Smart Grid, vol. 12, no. 5, pp. 4090–4101, 2021.
[5] L. Yu, Y. Sun, Z. Xu, C. Shen, D. Yue, T. Jiang, and X. Guan, “Multi-
agent deep reinforcement learning for HVAC control in commercial
finishing presence detection. An overhead detection approach buildings,” IEEE Trans. Smart Grid, vol. 12, no. 1, pp. 407–419, 2020.
[6] K. Baek, E. Lee, and J. Kim, “Resident behavior detection model for
[34] leverages Faster R-CNN to detect occupants from an environment responsive demand response,” IEEE Trans. Smart Grid,
overhead angle of view and has a good TPR (True Positive vol. 12, no. 5, pp. 3980–3989, 2021.
Rate) value (94%). Though our model has a lower TPR value [7] Y. Zhu, H. Luo, F. Zhao, and R. Chen, “Indoor/outdoor switching
detection using multisensor densenet and lstm,” IEEE Internet Things
(89.2%), it can be inferred that our approach has superior J., vol. 8, no. 3, pp. 1544–1556, 2021.
performance than Faster R-CNN based analysis in Section [8] C. Feng, A. Mehmani, and J. Zhang, “Deep learning-based real-time
IV-D2. building occupancy detection using AMI data,” IEEE Trans. Smart Grid,
vol. 11, no. 5, pp. 4490–4501, 2020.
[9] K. Sun, Q. Zhao, and J. Zou, “A review of building occupancy
V. C ONCLUSIONS measurement systems,” Energy Build., vol. 216, p. 109965, 2020.
[10] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and
This research proposes a novel approach for building oc- D. Terzopoulos, “Image segmentation using deep learning: A survey,”
IEEE Trans. Pattern Anal. Mach. Intell. (Early Access), 2021.
cupancy detection and localisation using CCTV cameras and [11] T. O’shea and J. Hoydis, “An introduction to deep learning for the
deep learning. In doing so, the approach designs a deep learn- physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp.
ing model consisting of two main modules to achieve better 563–575, 2017.
[12] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Mobile en-
occupancy detection. The first module leverages a DCNN to crypted traffic classification using deep learning: Experimental evalua-
learn shallow and semantic features from input images and tion, lessons learned, and challenges,” IEEE Trans. Netw. Serv., vol. 16,
constructs feature pyramids through a two-pathway integration no. 2, pp. 445–458, 2019.
[13] R. A. Khalil, N. Saeed, M. Masood, Y. M. Fard, M.-S. Alouini, and
and lateral connections. The second module represents a T. Y. Al-Naffouri, “Deep learning in the industrial internet of things:
three-stage detection procedure that uses three sequential and Potentials, challenges, and emerging applications,” IEEE Internet Things
homogeneous detectors to conduct iterative detection with J., vol. 8, no. 14, pp. 11 016–11 040, 2021.
[14] L. Rueda, K. Agbossou, A. Cardenas, N. Henao, and S. Kelouwani, “A
increasing IoU thresholds. Experimental results show that comprehensive review of approaches to building occupancy detection,”
the proposed model can achieve superior performance on Build. Environ., vol. 180, p. 106966, 2020.
occupancy detection and localisation when compared with [15] J. Yun and J. Woo, “A comparative analysis of deep learning and
machine learning on detecting movement directions using pir sensors,”
baseline models, including 0.94 in MAE and 1.35 in MSE IEEE Internet Things J., vol. 7, no. 4, pp. 2855–2868, 2020.
for the detection performance, and 44.5% in mAP for the [16] L. Zimmermann, R. Weigel, and G. Fischer, “Fusion of nonintrusive
localisation performance. environmental sensors for occupancy detection in smart homes,” IEEE
Internet Things J., vol. 5, no. 4, pp. 2343–2352, 2018.
With this method, building managers and engineers can [17] N. Li, G. Calis, and B. Becerik-Gerber, “Measuring and monitoring
obtain the number of occupants and their locations in videos. occupancy with an RFID based system for demand-driven HVAC
Then these managers have reliable information when formu- operations,” Automat. Constr., vol. 24, pp. 89–99, 2012.
[18] X. Ma, Y. Zhao, L. Zhang, Q. Gao, M. Pan, and J. Wang, “Practical
lating energy efficient action plans with other stakeholders. device-free gesture recognition using WiFi signals based on metalearn-
Better energy-related strategies can be decided to improve ing,” IEEE Trans. Ind. Informat., vol. 16, no. 1, pp. 228–237, 2020.
building energy efficiency while maintaining environmental [19] B. Sheng, F. Xiao, L. Sha, and L. Sun, “Deep spatial–temporal model
based cross-scene action recognition using commodity wifi,” IEEE
comfort for occupants. In addition, engineers are able to Internet Things J., vol. 7, no. 4, pp. 3592–3601, 2020.
develop accurate building performance simulation and smooth [20] Z. Chen, C. Jiang, and L. Xie, “A novel ensemble ELM for human activ-
energy consumption peaks for buildings in smart grids. ity recognition using smartphone sensors,” IEEE Trans. Ind. Informat.,
vol. 15, no. 5, pp. 2691–2699, 2018.
Future research will further investigate the benefits of [21] T. M. A. Zulcaffle, F. Kurugollu, D. Crookes, A. Bouridane, and
occupancy detection and deep learning for building energy M. Farid, “Frontal view gait recognition with fusion of depth features
management. In particular, we plan to improve occupancy from a time of flight camera,” IEEE Trans. Inf. Forensics Security,
vol. 14, no. 4, pp. 1067–1082, 2018.
detection performance in the complex scenario by updating [22] D. H. Green, S. R. Shaw, P. Lindahl, T. J. Kane, J. S. Donnal, and S. B.
the deep learning model with possible improvements on the Leeb, “A multiscale framework for nonintrusive load identification,”
RPN architecture and the non-maximum suppression mech- IEEE Trans. Ind. Informat., vol. 16, no. 2, pp. 992–1002, 2020.
[23] M. Jin, R. Jia, and C. J. Spanos, “Virtual occupancy sensing: Using
anism. With precise and reliable occupancy information, a smart meters to indicate your presence,” IEEE Trans. Mobile Comput.,
potential research opportunity lies with improving building vol. 16, no. 11, pp. 3264–3277, 2017.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877

IEEE INTERNET OF THINGS 12

[24] C. Duarte, K. Van Den Wymelenberg, and C. Rieger, “Revealing


occupancy patterns in an office building through the use of occupancy
sensor data,” Energy Build., vol. 67, pp. 587–595, 2013.
[25] C. Jiang, Z. Chen, R. Su, M. K. Masood, and Y. C. Soh, “Bayesian
filtering for building occupancy estimation from carbon dioxide con-
centration,” Energy Build., vol. 206, p. 109566, 2020.
[26] J. Wang, N. C. F. Tse, and J. Y. C. Chan, “Wi-Fi based occupancy
detection in a complex indoor space under discontinuous wireless
communication: A robust filtering based on event-triggered updating,”
Build. Environ., vol. 151, pp. 228–239, 2019.
[27] S. N. Tamgade and V. R. Bora, “Motion vector estimation of video
image by pyramidal implementation of Lucas Kanade optical flow,” in
Conf. Emerg. Trend. Eng. Technol. IEEE, 2009, pp. 914–917.
[28] N. Cao, J. Ting, S. Sen, and A. Raychowdhury, “Smart sensing for
HVAC control: Collaborative intelligence in optical and IR cameras,”
IEEE Trans. Ind. Electron., vol. 65, no. 12, pp. 9785–9794, 2018.
[29] H.-C. Shih, “A robust occupancy detection and tracking algorithm for the
automatic monitoring and commissioning of a building,” Energy Build.,
vol. 77, pp. 270–280, 2015.
[30] J. Yang, A. Pantazaras, K. A. Chaturvedi, A. K. Chandran, M. Santa-
mouris, S. E. Lee, and K. W. Tham, “Comparison of different occupancy
counting methods for single system-single zone applications,” Energy
Build., vol. 172, pp. 221–234, 2018.
[31] S. Thys, W. Van Ranst, and T. Goedeme, “Fooling automated surveil-
lance cameras: Adversarial patches to attack person detection,” in
Proceedings of CVPR Workshops, June 2019.
[32] J. Zou, Q. Zhao, W. Yang, and F. Wang, “Occupancy detection in the
office by analyzing surveillance videos and its application to building
energy conservation,” Energy Build., vol. 152, pp. 385–398, 2017.
[33] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” in Advan. Neural
Inf. Process. Syst., 2015, pp. 91–99.
[34] I. Ahmed, S. Din, G. Jeon, and F. Piccialli, “Exploring deep learning
models for overhead view multiple object detection,” IEEE Internet
Things J., vol. 7, no. 7, pp. 5737–5744, 2020.
[35] E. N. Kajabad and S. V. Ivanov, “People detection and finding attractive
areas by the use of movement detection analysis and deep learning
approach,” Procedia Computer Science, vol. 156, pp. 327–337, 2019.
[36] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” in Proceedings of CVPR,
2017, pp. 1492–1500.
[37] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of
CVPR, 2017, pp. 2117–2125.
[38] Z. Cai and N. Vasconcelos, “Cascade R-CNN: high quality object
detection and instance segmentation,” IEEE Trans. Pattern Anal. Mach.
Intell., pp. 1–1, 2020.
[39] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
timal speed and accuracy of object detection,” arXiv preprint
arXiv:2004.10934, 2020.
[40] L. Yu, W. Xie, D. Xie, Y. Zou, D. Zhang, Z. Sun, L. Zhang, Y. Zhang,
and T. Jiang, “Deep reinforcement learning for smart home energy
management,” IEEE Internet Things J., vol. 7, no. 4, pp. 2751–2762,
2020.

© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.

You might also like