Building Occupancy Detection and Localisation Using CCTV Camera and Deep Learning
Building Occupancy Detection and Localisation Using CCTV Camera and Deep Learning
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
Abstract—Occupancy information plays a key role in analysing peak demand periods of buildings and smoothing energy
and improving building energy performance. The advances of requirement peaks of buildings in smart grid [6].
Internet of Things (IoT) technologies have engendered a shift in A number of developed IoT sensors offer exciting opportu-
measuring building occupancy with IoT sensors, in which cam-
eras in Closed-Circuit Television (CCTV) systems can provide nities for measuring occupancy status in buildings, but some of
richer measurements. However, existing camera-based occupancy these sensors might be limited by inherent characteristics [7].
detection approaches cannot function well when scanning videos The approaches based on Passive Infrared (PIR) sensors can
with a number of occupants and determining occupants’ loca- detect infrared light radiating from occupants which leads to
tions. This paper aims to develop a novel deep learning based low-accuracy performance for multiple and static occupants.
approach for better building occupancy detection based on CCTV
cameras. To doing so, this research proposes a deep learning Environmental and electricity sensors are able to measure
model to detect the number of occupants and determine their indoor environmental variations related to the presence of
locations in videos. This model consists of two main modules occupants, resulting in time delay and poor detection perfor-
namely feature extraction and three-stage occupancy detection. mance in complex environments (e.g. opening rooms). Radio
The first module presents a deep convolutional neural network Frequency Identification (RFID) and WiFi sensors require
to perform residual and multi-branch convolutional calculation
to extract shallow and semantic features, and constructs feature additional devices to receive signal from users and can be
pyramids through a bi-directional feature network. The second significantly affected by noise from obstacles.
module performs a three-stage detection procedure with three An accurate and feasible solution for occupancy detection
sequential and homogeneous detectors which have increasing in buildings lies in leveraging CCTV cameras to capture
Intersection over Union (IoU) thresholds. Empirical experiments optical streams of the indoor environment. Device availability
evaluate the detection performance of the approach with CCTV
videos from a university building. Experimental results show can improve the applicability and cost-efficiency due to the
that the approach achieves superior detection performance when widespread deployment of CCTV cameras in many buildings
compared with baseline models. for safety and security issues. A key challenge in camera based
Index Terms—Building occupancy detection, deep learning, occupancy detection is eliminating interference caused by
IoT sensor. background and object obfuscation. Some powerful techniques
from the computer vision domain have opened up potential
of obtaining occupancy information from CCTV videos [8].
I. I NTRODUCTION Some early efforts applied pattern recognition technologies
(e.g., filtering algorithm, classification, and clustering meth-
I NTERNET of Things (IoT) technologies have advanced
the research of smart buildings in recent years [1]. As
the single largest portion of global energy consumption (i.e.,
ods) to subtract background information from videos, but these
background subtraction based approaches can fail if occupants
around 40%) and greenhouse gas emission (i.e., around 35%) remain static for extended periods.
[2], energy efficiency in smart buildings has attracted great Recent research has intended to apply machine learning to
attention [2]. Against this backdrop, occupancy detection plays approximate the complex relationships between sensor mea-
an essential role in achieving more efficient building energy surements and occupancy information [9]. Machine learning
management [3]. First of all, occupancy information can assist performs classification and regression operations to filter im-
building managers in comprehensively assessing building en- age patches with learnable weights and biases. Shallow learn-
ergy performance and identifying building operation faults [4]. ing algorithms (e.g., Support Vector Machine (SVM)) provide
In addition, this information can help to optimise operations limited capacity to finish occupant localisation. Deep learn-
of equipment (e.g., Heating, Ventilation, and Air Conditioning ing significantly improves the learning performance through
(HVAC) systems) while maintaining a comfortable indoor deeper and more complex neural network architectures and
environment [5]. Moreover, long-term occupancy information has been applied to a wild range of fields. For example, a
can accomplish better demand responses by identifying energy significant body of research aimed to develop deep-learning
based image segmentation solutions that play a key role in
This work is supported in part by Natural Science Foundation of China medical image analysis, robotic perception and augmented
under Grant 62106069. (Corresponding author: Shushan Hu)
S. Hu and P. Wang are with School of Computer Science and Information reality [10]. Deep learning has engendered a shift for the
Engineering, Hubei University, Wuhan, China ([email protected]; communication and networking field such as jointly optimising
[email protected]). transmitter and receiver components for the physical layer
C. Hoare and J. O’Donnell are with School of Mechanical and Materials
Engineering and UCD Energy Institute, University College Dublin, Belfield, [11], and accurately classifying mobile encrypted traffic to
Dublin 4, Ireland ([email protected]; [email protected]) obtain highly-valuable profiling information [12]. Advances
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
in deep learning also give rise to the proliferation of in- II. R ELATED W ORKS
dustrial IoT with potential for various applications such as A. Occupancy Measurement with Sensors
smart assembling and smart manufacturing [13]. However,
The advances of IoT technologies have promoted a wide
existing deep learning based occupancy detection approaches
range of sensors available to measure building occupancy
cannot function well for complex videos with overlaps among
status (Table I). To be specific, PIR sensors can detect the
occupants [14].
motion of occupants by measuring infrared light radiating from
This research intends to develop a novel deep learning objects. Yun et al. demonstrated a quantitative performance
based approach to obtain accurate occupancy information for analysis on human movement direction detection based on
building energy management. The approach needs to eliminate PIR sensors [15], but PIR sensors may fail to measure static
interference from overlaps among occupants and complex occupants. Environmental sensors aim to detect occupancy
background in CCTV videos. Several technical challenges through environmental variations caused by the presence of
inhibit the pathway to achieve this goal. Firstly, to inference occupants. Zimmermann et al. presented an approach to detect
multiple and varisized occupants leads to issues associated occupants using a fusion of environmental sensors from an
with relevant feature extraction from videos. Secondly, to indoor air quality measurement system [16]. Two limitations
accurately localise occupants in videos results in a complex of environmental sensors are a time delay and low detection
regression problem adjusting the position and size of predicted accuracy in open cases.
boundaries of occupants. To address the above challenges, this RFID sensors are capable of identifying and tracking tags
research proposes to design a novel deep learning model to attached to objects. Li et al. developed an RFID based system
scan videos for high-quality occupancy information including to track stationary and mobile occupants for demand-driven
the number of occupants and their locations. In doing so, this operations [17]. RFID sensors may fall short in requiring tag
work refines the detection problem into two main procedures: attachments on objects and reader deployments. WiFi becomes
1) extracting fine-grained features related to occupants from a primary signal for occupancy detection with channel state
videos, 2) detecting the number of occupants and calculating information [18]. Sheng et al. presented a deep learning frame-
their locations. work for action recognition by integrating spatial features from
CNN into a temporal bidirectional Long Short Term Memory
The main contributions are summarised as follows: (LSTM) model [19]. Nonetheless, WiFi approaches need addi-
tional signal receivers and have low performance in complex
• We propose a deep learning model for building occupancy indoor environment. Smartphones provide an alternative for
detection and localisation based on CCTV cameras. This occupancy detection with embedded sensors. Chen et al.
model is capable of delivering better detection perfor- leveraged smartphones to recognise human activities through
mance when facing videos with many occupants. extreme learning machine (ELM), coordinate transformation
• We design a deep neural architecture to construct feature and principal component analysis (CT-PCA), and online SVM
pyramids from CCTV videos. The architecture uses a [20]. The placing aside of smartphones significantly reduces
deep convolutional neural network (DCNN) with resid- the detection performance.
ual and multi-branch convolutional calculation to extract Cameras provide a straightforward optical instrument to
shallow and semantic features, and constructs feature capture visible images for occupancy detection. Zulcaffle et al.
pyramids through a top-down and a bottom-up pathway designed a four-part method for frontal view gait recognition
integration operations. based on Time-of-Flight cameras [21]. However, limitations
• We perform a three-stage detection procedure to calculate include computation complexity and probability with occlu-
the number of occupants and regress their locations. This sion. Electricity meters are used to detect interactions between
procedure employs three sequential and homogeneous de- occupants and appliances [22]. Feng et al. tried to detect
tectors to conduct iterative detection on feature pyramids. building occupancy from Advanced Metering Infrastructure
Increasing IoU thresholds are applied to these detectors (AMI) data with a CNN and a LSTM network [8]. Jin et al.
so as to improve detection performance. presented another smart meter based solution which applies a
• We conduct empirical experiments to evaluate the de- multi-view based iteration and surrogate loss to learn refined
tection performance. Experimental results show that the results[23]. The interactions between occupancy and power are
approach can achieve superior performance on detecting not straightforward, and sometimes difficult to model. Among
the number of occupants and determining the locations these sensors, cameras can provide richer measurements on
of these occupants. occupancy status than other sensors. Growing deployments of
CCTV cameras also promoted this relevant topic for building
occupancy detection. However, a major barrier obstructing
The remainder is organised as follows: Section II presents
camera based detection approaches is eliminating interference
the case of measuring occupancy status and filtering occupancy
caused by background and object obfuscation.
information. A detailed description of the approach is illus-
trated in Section III. Section IV demonstrates the effectiveness
and superior performance with empirical experiments. Section B. Occupancy Detection with Algorithms
V details the conclusions from the development and testing of The development of computer vision technologies has or-
this approach. chestrated a paradigm shift in the way that obtains fine-
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
TABLE I TABLE II
C OMPARATIVE ANALYSIS OF SENSORS FOR OCCUPANCY DETECTION . C OMPARISON OF CAMERA BASED OCCUPANCY DETECTION ALGORITHMS .
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
where Ti (x) and C stand for an arbitrary function and the size
of the set of transformations (i.e., cardinality) respectively.
Ti (x) projects the input vector x into an embedding code
and then transforms the code. C is a newly introduced hyper-
parameter for a new pathway of adjusting the model capacity.
The aggregating phase deeply integrates intermediate trans-
formed features into final features. An aggregated transforma-
tion function (y = Ti (x) + x) is used to construct the residual
function for the building block. y is the output of a block. This
structure reformulates the network layers with reference to
the layer inputs instead of learning unreferenced functions. A
residual learning reformulation solves the degradation problem Fig. 3. (a) a building block of DCNN with cardinality = 32; (b) a building
as the network depth increases. The reformulation approxi- block of DCNN with grouped convolutions.
mates the network as a residual function F(x) := H(x) − x
and defines the building block as residual learning in the
network layers:
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
w1 · C4 + w2 · Resise(C5 )
P4td = Conv( ) (5)
w1 + w2 +
where the Resise function is an upsampling with the nearest
neighbour strategy on the spatial information by a factor of 2
×. A lateral connection is used to enhance the unsampled
results with previous features at the same level (C4 ). The
connection indicates an element wise addition operation and
a 1 × 1 conv for a reduction of channel dimensions.
To obtain fine-grained features, the bottom-up pathway
performs another integration upon upsampled features of the
top-down pathway and features of feed-forward computation.
This pathway applies subsampling functions to construct an
opposite integration direction compared to the top-down path-
way. For example, the final features at level 4 is calculated
as:
1
Lcls (pi , Pi∗ )
P
C. Three-stage Occupancy Detection L({pi }, {ti }) = Ncls i
(7)
+λ Nreg i p∗i Lreg (ti , t∗i )
1
P
With feature pyramids, the approach performs a three-stage
detection to obtain high quality occupancy information (Fig.
where i, pi and p∗i stand for the index of an anchor, the prob-
5). Three homogeneous detectors (i.e., detector1, detector2,
ability of a predicted anchor being an object and the ground-
detector3) perform a sub-training operation in a sequential
truth label valuing 1 with a positive anchor respectively.
order and have the same topology, which can generate high
With initial region proposals, the three detectors perform an
quality hypotheses at inference time and achieve better match
iterative detection with increasing IoU thresholds. The detector
results with increasing quality of detectors. Increasing IoU
integrates feature pyramids and region proposals from RPN (or
thresholds are used for these detectors in order to smooth
detectors of previous stages) as input. An RoI pooling layer
the unbalanced detection arising from a single IoU threshold,
is used to resample the input features as fixed size. Following
as low thresholds usually produce noisy results while higher
a retraining process of two fully connected layers, a classifier
thresholds tend to degrade the performance due to overfitting
checks if the anchors contain occupants or not. A bounding
and mismatch problems [38]. The three-stage detection is
box regressor tries to precisely localise bounding box coordi-
represented as an iterative detection process to filter occupancy
nates by a mount of translating and scaling operations.
information from images. A Region Proposal Network (RPN)
The classifier is defined as a function h(x) aiming to cate-
generates initial region proposals for the detector of stage 1
gorise candidate objects into one of 2 classes. Class 0 indicates
(detector1), the output of which provides additional and better
image background and class 1 is the occupant category. The
region proposals to train the detector of stage 2 (detector2).
function h(x) applies the posterior distribution upon an patch
The subsequent detector at stage 3 (detector3) applies the out-
x of images and predicts a class label y for objects in this
put of detector2 as region proposals to improve the detection
patch (hk (x) = p(y = k|x)). A risk function assesses the
performance.
classification performance of the function and improves the
RPN conducts a resampling operation on feature pyramids
performance through a training process.
and predicts region proposals for objects at each pixel of the
input images. The resampling applies a convolution layer (3×3 X
conv) to transform the input features as 256-d. Two parallel Rcls [h] = Lcls (h(xi ), yi ) (8)
i=1
convolution layers (1 × 1 conv) are used to generate enhanced
features for subsequent classification and regression missions. where Lcls stands for a cross-entropy loss function.
A classifier and a bounding box regressor are responsible for The bounding box regressor represents another function
predicting k region proposals, which are parameterised relative ft (x, p) to precisely confirm bounding box coordinates for
to k reference boxes (i.e., anchors). Finally, a loss function (7) occupants. The function regresses a predicted box p in an
minimises the difference between the reference and ground image patch x to a ground-truth box g in the stage t.
truth boxes. The bounding box p is represented with four parameters
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
where Lloc stand for the loss function calculating a distance images of the dataset to train the occupancy detection algo-
vector ∆ with four variables (δx , δy , δw , δh ) from: rithms, while 2,000 images are used to test the performance
of these algorithms.
δx = (gx − px )/pw , δy = (gy − py )/ph This research implements the proposed deep learning model
(10)
δw = log(gw /pw ), δh = log(gh /ph ) as an executable program for the performance test upon the
This approach applies a cascade regression strategy by dataset. The implementation uses the Python programming
adopting a resampling procedure to adjust the distribution language for coding and applies Pytorch as the basic platform.
of hypotheses for different stages. The strategy defines an The program is run on a server with the following specifica-
overall regression function f (x, p) as a cascade of specialised tions: AMD 3960X CPU at 4.0GHz 24 cores, 128 GB DDR4
regressors ft (x, p) for the three detection stages: RAM, 2 × GeForce 2080Ti GPU, 512GB SSD, Ubuntu ×64
OS.
f (x, p) = f1 (x, p) ◦ f2 (x, p) ◦ f3 (x, p) (11)
B. Deep Learning Training
where f1 , f2 and f3 stand for the regressor for stage 1, 2, and
3 respectively. The operator ◦ is cascaded linkage meaning This study conducts a training process to improve the
each regressor ft is opitmised for the box pt generated by the detection performance of the model. This training process
previous regressor ft−1 . starts from a learning rate of 0.0025 × 0.001 and follows
A loss function is defined to optimise the detection perfor- the warmup strategy which keeps linear growth until the 10th
mance of each stage t with an IoU threshold ut . epoch. We try to optimise the model parameters using the
Stochastic Gradient Descent (SGD) optimiser, which aims to
L(xt , g) = Lcls (ht (xt ), y t ) minimise the cross entropy and smooth L1 objective functions
(12)
+λ[y t ≥ 1]Lloc (ft (xt , pt ), g) following the opposite direction of gradient descent. In order
to reach the convergence point, we train the developed model
where pt , λ and y t stand for the regressor of stage t − 1 (i.e.,
by increasing the epoch value from 1 to 30 (Fig. 6). Four
ft−1 (xt−1 , pt−1 )), the trade-off coefficient, and the label of
metrics called Accuracy, Mean Absolute Error (MAE), Mean
the patch xt under the threshold ut respectively.
Square Error (MSE) and mean Average Precision (mAP) are
used to evaluate the detection performance. It can be found
IV. P ERFORMANCE E VALUATION that the model tends to convergence with the epoch value of
In order to demonstrate the effectiveness and superior per- 12. In particular, the accuracy value reaches the peak point
formance of the approach, this work documents a case study at the 12th and the 21st epoch (Fig. 6 (a)). The MAE results
of occupancy detection when applying some other popular show three troughs when the epoch values are 12, 17, and 21
solutions as the baseline. (Fig. 6 (b)). The MSE results contain four minimal values at
the 10th, 12th, 14th, and 21st epoch. The mAP curve reaches
A. Experimental Setup the peak point with the epoch value of 12 (Fig. 6 (c)).
Regarding the three-stage detection architecture, we im-
The case study begins by preparing a dataset with labelled
prove the detection performance by analysing three main pa-
images to enable verification of the approach. With videos
rameters called anchor area, anchor ratio and stage IoU (Table
(resolution: 1920 × 1080) collected from CCTV cameras in
III) with MAE, MSE, and mAP metrics. The model obtains
a university building over a period of 5 weeks in 2019, this
relatively good performance with initial parameters (1st row).
study conducts a pre-processing procedure with several steps
We quickly confirm the anchor area to be [16,32,64,128,128]
to obtain labelled images. The procedure begins by trimming
by analysing the area of ground truth boxes. Another stage IoU
video segments recorded when the room is not occupied
([0.4,0.5,0.6]) is tested to reach the first milestone (2nd row).
according to the teaching schedule of the target classrooms.
We identify that the ratio of [0.75,1.0,1.5,2.0,3.0] can achieve
The second step takes images from the remaining videos
better detection performance (3rd row). The integration of the
with an interval of 8 seconds during lectures and the interval
anchor ratio and the stage IoU assists the model to achieve
becomes 5 seconds during breaks between lectures. The reason
the best performance (4th row).
is occupants normally have more movements during breaks
leading to frequent variations in images. Finally, the authors
manually draw bounding boxes for occupants to generate C. Baseline Models
labelled images. In summary, this study prepares nearly 12,000 For the purpose of evaluating the performance of the ap-
labelled images for the dataset. The demonstration uses 10,000 proach, the case study selects five state-of-the-art detection
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
D. Experimental Results
1) Detection Results Visualisation: This study represents
an example (Fig. 7) that visualises the occupancy detection
Fig. 6. Convergence curves for the deep learning model training (a) accuracy
results, (b) MAE and MSE results, and (c) mAP results. results of the approach. There are 32 occupants in this image
where nearly one quarter of occupants are standing, while
other occupants are sitting in fixed seats. Some overlaps exist
models to establish a baseline for comparison. Faster R-CNN among occupants due to the layout of seats and shielding
uses RPN to achieve good detection performance by predicting from standing occupants. The approach successfully detects
object bounding boxes and objective scores at each location. all 32 occupants contained in the image. Regarding occupancy
YOLOv4 presents a one-stage object detection solution with localisation, it is obvious that the predicted bounding boxes
high speed and adequate accuracy, and performs regression for occupants (i.e., red box) have high-degree overlaps with
analysis for object detection with spatially separated bounding bounding boxes of ground truth (i.e., green box).
boxes and associated class probabilities. EfficientDet follows This section shows some failure cases generated by the
the one-stage paradigm for object detection by integrating proposed model (Fig. 8). The first two cases are caused
EfficientNet with a set of fixed scaling coefficients and a by interference from irrelevant objects. Results in Fig 8 (a)
bi-directional feature pyramid network. RetinaNet is another present a failure when detecting the occupant behind the
popular one-stage detector, which applies a feature pyramid microphone and a schoolbag is recognised as an occupant in
network for detecting objects at multiple scales and defines Fig 8 (b). The second two cases are due to occlusions from
the focal loss function to facilitate the extreme foreground- other occupants in crowded scenarios. One occupant is missed
background class imbalance. An anchor-free extension of Reti- in Fig 8 (c) due to occlusion from another occupant in front.
naNet (AF RetinaNet) defines a simple and effective building Two occupants in Fig 8 (d) are detected as only one occupant
block for single-shot object detectors by addressing issues such due to a vertically overlapping layout.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
TP TP
P recision = T P +F P , Recall = T P +F N (13)
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
TABLE V
C OMPARISON ON THE COMPUTATION PERFORMANCE .
Fig. 12. Detection performance of all detection models in mAP.
Trainable Memory Inference
Detection model parameters requirement time
TABLE IV (Million) (MegaByte) (Second)
P ERFORMANCE COMPARISON OF DIFFERENT DETECTION MODELS . Faster R-CNN 4.1 1,540 0.058
YOLOv4 2.8 1,390 0.039
Detection model AP50 AP60 AP70 AP80 RetinaNet 3.6 1,450 0.051
Faster R-CNN 82.4% 75.1% 55.4% 18.7% AF RetinaNet 3.6 1,450 0.051
YOLOv4 82.9% 77.1% 55.3% 18.0% EfficientDet 2.3 1,330 0.045
RetinaNet 81.3% 73.5% 52.4% 16.2% Our model 4.8 1610 0.063
AF RetinaNet 81.4% 74.6% 54.2% 18.6%
EfficientDet 81.5% 73.4% 54.1% 20.5%
Our model 85.1% 77.8% 58.3% 20.8%
4) Computation Performance: This section documents the
computation performance evaluation for all models (Table V).
This performance evaluation selects three indicators, called
RetinaNet has the worst performance with a mAP value of trainable parameters, memory requirements and inference
40.9%. YOLOV4 achieves better performance than Faster R- time, to assess the computation and storage cost of these
CNN. This is because mAP only relates to bounding boxes models. The inference time is the most important indicator
of correctly predicted occupants and YOLOV4 has fewer cor- representing the time of scanning one image. It is obvious that
rectly predicted occupants than Faster R-CNN. The undetected all models have super performance with a time less than 0.1
occupants are often surrounded by complex background and second. Although our model needs a bit more inference time
it is difficult to calculate their bounding boxes. due to a complex architecture for better occupancy detection,
Regarding the AP metric, in accordance with the mAP it is efficient to meet time requirements for future applications
metric, the proposed approach has the best performance for all deployed in buildings. Regarding the other two metrics, though
IoU thresholds from AP50 to AP80 . RetinaNet has the lowest more trainable parameters in our model lead to more memory
values at different IoU thresholds. It is worth noting that these requirement, it is worth noting that memory requirements of
models exhibit decreasing performance when increasing IoU all models (from 1,390 to 1,610 MegaByte) only take up a
thresholds. The reason is overlaps among occupants in images small proportion of the total memory of popular GPUs, which
may cause detectors to treat crowded occupants as a single normally is larger than 10,000 MegaByte.
detection, or to mistakenly shift the target bounding box to
another person with higher IoU thresholds. E. Discussion
Furthermore, this experiment finds that considerable differ- This work enhances the performance evaluation by dis-
ences exist in occupancy detection for different scenarios in cussing the performance of our approach and some existing
images (Fig. 13). This research defines a complex scenario occupancy detection solutions. Due to the inaccessible source
where an image contains more than 30 occupants and more code or dataset, this discussion is finished with results from
than 10 occupants are standing, with more overlaps between references (Table VI). A head detection solution [32] applies
occupants. The regular scenario indicates images outside the HOG features and a 7 layer CNN to obtain good performance,
complex range. Although our approach achieves the best per- but a low occupancy number (0-3) significantly reduces the
formance on both scenarios, a significant difference of nearly detection difficulty when compared with our model. Two so-
5% shows in mAP when evaluating detection performance lutions [35], [31] applies the YOLO model to finish occupancy
for the two scenarios. The biggest difference (around 10% in detection missions. It can be concluded that our model is better
mAP) belongs to AF RetinaNet. This arises due to additional than them, since one baseline model (YOLOv4) has better
overlaps caused by more occupants and increased standing performance than YOLOv3 and YOLOv2 [39]. Although an
in the complex scenario, as such this poses a more difficult occupancy presence detection solution [28] achieves a high
scenario for occupancy localisation. accuracy value, our solution can have 100% accuracy when
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3201877
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ULAKBIM UASL - SELCUK UNIVERSITESI. Downloaded on November 03,2022 at 06:40:04 UTC from IEEE Xplore. Restrictions apply.