0% found this document useful (0 votes)
42 views4 pages

Multi-Camera Person Detection & Heatmap

Uploaded by

erhan.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views4 pages

Multi-Camera Person Detection & Heatmap

Uploaded by

erhan.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

260

A Deep Learning Based Person Detection and


Heatmap Generation Technique with a
Multi-Camera System
2022 12th International Conference on Electrical and Computer Engineering (ICECE) | 979-8-3503-9879-3/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICECE57408.2022.10089044

Md Shakhrul Iman Siam1 , Subrata Biswas2


Department of Electrical and Electronic Engineering
Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh
Email: 1 [email protected], 2 [email protected]

Abstract—This paper outlines a technical method for video there aren’t many well-annotated picture datasets created
analysis that may be used to identify persons in footage specifically for the construction industry in widely used open
from several CCTV cameras and provide a heatmap of that databases.
information for a certain floor layout. The analysis of customer
and employee behavior in retail and office settings, as well as A Heatmap is a graphical representation of data where
motion tracking and advertising effectiveness research, can all values are depicted by color. The more congested data is
be aided by the automatic creation of people density maps. at a particular location, the hotter will be the color used to
With the use of video recordings made by common video represent this data. From the heatmap, we can easily find
surveillance cameras, density maps were created. We made
the areas that are more attractive to customers or visitors.
advantage of CCTV cameras, which are dispersed across a retail
establishment. Because the Yolov5 object detection algorithm we can get info about which places of a retail store are
may produce findings more quickly, we have chosen to employ crowded and which are less crowded. Also, we can analyze
it for human detection. Additionally, due to the short inference this over time. Like, At which time of the day or which
time, it is appropriate for real-time applications. day of the week do people come the most. This info will
Index Terms—Object Detection, YOLO, Heatmap, KDE,
help businesses with further analysis [4]. In this work, we
Homography Transform.
represent a scalable solution for real-time human detection
I. I NTRODUCTION and heatmap generation on a floor layout in order to generate
useful business insights such as customer behavior, shopping
Real-time human detection is one of the most fundamental pattern, etc.
tasks in computer vision and it has become one of the most
popular research topics in different fields over the last few II. R ELATED W ORKS
years since it has numerous commercial applications [1].
Over the past few decades, human identification, tracking, In the past, person detection in surveillance videos was
and segmentation have been the subject of substantial re- done manually. The task of identifying people in images has
search. Although several algorithms have been put out, there gained significant attention due to the increasing importance
are still issues in the discipline. There are additional obstacles of biometrics and surveillance. Deepak et al. [5] developed
in detecting and tracking for the object class of humans. an algorithm based on the background subtraction method
First, because the human body can move freely at numerous for real-time object detection using Faster-R-CNN. Kajabad
joints, the way people look can change depending on the et al. [6] describe a people detection approach using a
angle from which they are viewed as well as how their body deep learning method. They also proposed an algorithm to
parts are positioned. People also wear a range of clothing find the hot zones of people’s movement in the image.
and accessories, which when combined can create hundreds Parzych et al. [7] explained how to create a density map
of different combinations of hues, textures, materials, and of people’s movement from video footage analysis in a
fashions. salesroom. However, their detection method is based on
Detection of humans from surveillance cameras can be people’s movement activity which can be implemented only if
done by various techniques [2] including Motion Based there are continuous movements. Punn et al. [8] used Yolov3
detection and Deep Learning based detection. Deep learning with deepsort tracking technique to detect people in order to
person detection algorithms have advanced quickly in re- monitor social distancing. Khan et al. [9] used Yolo, Faster-R-
cent years, considerably enhancing both detection speed and CNN, and SSD for identifying hotspots of people to mitigate
accuracy. Deep learning-based computer vision technology the transmission of the coronavirus.
outperforms conventional image processing and recognition However, none of these works, address the question of
techniques in terms of detection speed, algorithm robustness, how to locate the congested region on a 2-Dimensional
and feature extraction without manual design [3]. Due to the floor layout. Existing literature only describes how to create
fact that deep learning techniques are data-hungry, a specific a heatmap of persons across single-camera photos, making it
image dataset for the construction industry is needed in order impossible to combine data obtained from multiple cameras.
to apply object detection on building sites. Because of the To the best of our knowledge, this is the first paper that
complexity and dynamic nature of construction activities, describes a full pipeline of detecting people from camera
it is difficult to gather and annotate images, which is why images and mapping the information on a floorplan taken

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:39:56 UTC from IEEE Xplore. Restrictions apply.
261

from multiple cameras to find the areas that are frequently


visited by people.
III. M ETHODOLOGY
A. Cameras and Floor Layout
The proposed system is based on a multi-camera system
that covers the entire floor region. A Floor layout and position
of various surveillance cameras of a Retail store are shown
in Fig.1. The position and coverage areas of each camera are
known.
Fig. 2: CrowdHuman Dataset
B. Person Detection
Images are captured from the video footage of the CCTV
cameras. For Detecting Humans in the camera images, we classification at the same time. The one-stage head [14]
use the Yolov5 object Detection algorithm [10]. for dense prediction provides information about where
1) Dataset: To get the best result from the object detec- the object is present.
tion model, we train it on Crowdhuman Dataset [11]. This First The input image is fed into CSP-Darknet53 for feature
dataset contains 15000, 4370, and 5000 images for training, extraction and then into PANet for feature fusion, and finally,
validation, and testing, respectively. There are a total of 340k the outputs (class, location, confidence score) are obtained
person instances in the training set, and each human instance from the YOLO layer. Fig.3 shows an overview of Yolov5
is annotated with a head bounding box and human full- architecture.
body bounding box. Some of the sample images from the 3) Training: We train the model with the SGD optimizer
CrowdHuman dataset are shown in Fig.2. for 200 epochs on CrowdHuman Dataset. The learning rate
2) Model Architecture: The proposed system uses Yolov5- and the momentum are 0.01 and 0.973 respectively with a
m architecture for object detection task. Yolov5-m is a weight decay of 0.0005. We use horizontal flipping and Mo-
medium-sized model with 21.2 million parameters which saic augmentation technique to make our model more robust.
consists of: Three types of losses- Bounding box regression loss (Mean
• Backbone (CSP-Darknet53): CSP-Darknet53 [12] with Squared Error), Objectness loss (Binary Cross Entropy), and
a Spatial Pyramid Pooling (SPP) layer is used as the Classification loss (Cross-Entropy) are calculated. All these
backbone for Yolov5 which acts as a feature extractor. loss curves vs epochs are shown in Fig.4.
It is a convolutional neural network (CNN) that uses 4) Performance evaluation: we use the crowdhuman test
DarkNet-53 as its backbone to detect objects. dataset which consists of 5000 images to evaluate our model,
• Neck (PANet): The neck is a feature aggregator which We used mean Average Precision (mAP) as an evaluation
collects feature maps from different stages of the back- metric that calculates a score by comparing the predicted
bone. It creates a connection between the backbone bounding box to the actual bounding box. We calculate mAP
and the head. We used the Path Aggregation Net- using the Intersection over Union(IOU) value for a given IOU
work(PANet) [13] to build the neck. threshold and obtained 0.85 mAP(0.5) on the test dataset.
• Head (YOLO Layer): As the acronym of YOLO stands
C. Coordinate Transformation
for ”You Only Look Once”, it is a one-stage detector
that makes the predictions for object localization and To visualize the heatmap on a 2D floorplan, Coordinate
transformation of each person’s position from the camera
image to the floorplan is required. It is much easier to
visualize movement patterns presented on a 2D floorplan
rather than when shown on CCTV footage. For this pur-
pose,we use Homography Transformation [15]. Homography
Transformation is a mapping between two planar projections

Fig. 1: FloorPlan and Camera positions of a Retail Outlet Fig. 3: YOLOv5 Model Architecture

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:39:56 UTC from IEEE Xplore. Restrictions apply.
262

0.11 train train


validation 0.20 validation
0.10

0.09 0.18
0.8
0.08 0.16
Loss

Loss
0.07
0.14
0.06
0.12
0.05 0.6
0.10
0.04

0.03 0.08
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Epochs Epochs
0.4
(a) Bounding box loss (b) Objectness loss

train 0.2
validation
0.025 0.8

0.020 0.6

0.0
mAP
Loss

0.015
0.4
0.010
2 0 2 4 6 8 10 12 14
0.2
0.005
(a) Kernel Density Estimator
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Epochs Epochs

(c) Classification loss (d) mAP(0.5)


Fig. 4: Training Losses 250
250

200 200
Image 1 Image 2
150
150
100

(x2, y2) 50 100


(x1, y1)
10.0 50
7.5
5.0
2.5
10.0 0.0
7.5
5.0 2.5
Floor Plan 2.5
0.0 5.0

H H 2.5
5.0
7.5
10.0
7.5
10.0
(x1´, y1´) (x2´, y2´)
(b) Colourmap
Fig. 6: Heatmap Generation
Fig. 5: Coordinate Transformation

points to measure kernel density. when Gaussian kernels over-


of an image. It is represented by a 3x3 transformation ma- lap, their values are accumulated together (solid blue curve
trix(H) in a homogenous coordinates space. Mathematically, in Fig. 6a). Heatmap is generated based on the normalized
the Homography matrix is represented as: gaussian kernels which are transformed into a continuous
range from [0, 255]. As the value goes from 0 to 255, the
      
x´ x h11 h12 h13 x heatmap color goes from blue to red (Fig. 6b).
 y´  = H  y  = h21 h22 h23   y  (1) A workflow of the proposed system is shown in Fig. 7.
1 1 h31 h32 h33 1
IV. E XPERIMENTAL R ESULTS
We implement the system in python 3.6 on an Intel(R)
where (x, y) is the coordinate of a person’s position in Core(TM) i7-1165G7 2.80GHz processor, 8GB RAM, and
camera image, and (x´, y´) is the transformed coordinate in 2GB NVIDIA GeForce MX330 GPU. The program is capable
Floorplan (Fig. 5). To calculate the matrix H, we need at of running on both CPU and GPU, but the processing speed
least 4 point pairs from the two images(camera images and on GPU is much faster than CPU. We process the videos at
floorplan). The more point pairs we provide, the better the 8 FPS on our system which is suitable for real-time analysis.
estimate of matrix H will be. Our proposed system consists of two stages. The first is
to capture images from the CCTV videos to detect persons.
D. Heatmap Generation
A heatmap is a visual representation of data where the Capture Image Person Coordinate Generate
Video Data
Frames Detection Transformation Heatmap
values of the data are shown by colors. We implement
Gaussian Kernel and Kernel Density Estimator (KDE) [16]
for Heatmap generation. Using a suitable kernel, Kernel
Show Output
density estimates can be equipped with properties such as
smoothness or continuity. Normal kernels with appropriate
variance (red dashed lines) are fitted on each of the data Fig. 7: Workflow of the proposed method

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:39:56 UTC from IEEE Xplore. Restrictions apply.
263

camera network. By spotting people in a retail shop area and


determining what kinds of shops, brands, and products are
more intriguing to customers, the suggested strategies can
also be useful for controlling customer behavior in shopping
centers. Thus, company management can quickly adjust the
way the sale area operates by analyzing the behavior of the
consumer. In the future, we hope to evaluate and separate
(a) camera 1 (b) camera 2
the number of visitors to various locations as well as identify
groups of people.
VI. ACKNOWLEDGEMENT
This research is a part of our work at Advanced Chemical
Industries (ACI) Limited. Video data, Floor-plan, and All the
Technical support were provided by ACI Logistics Limited.
We ensure that during the video processing No one’s privacy
(c) camera 3 (d) camera 4 was violated.
Fig. 8: Detected Persons in multiple cameras R EFERENCES
[1] M. Paul, S. M. E. Haque, and S. Chakraborty, “Human detection in
surveillance videos and its applications - a review,” EURASIP J. Adv.
Videos from CCTV cameras can be processed both online Signal Process., vol. 2013, pp. 1–16, Dec. 2013.
using RTSP streaming link, or offline using video files. After [2] M. A. Ansari and D. K. Singh, “Human detection techniques for real
time surveillance: a comprehensive survey,” Multimed. Tools Appl.,
capturing a frame from the camera, it is fed to the yolov5 vol. 80, pp. 8759–8808, Mar. 2021.
model for detecting all the persons in that image. We use [3] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With
a threshold of 0.3 for detecting object followed by a non- Deep Learning: A Review,” IEEE Trans. Neural Networks Learn. Syst.,
vol. 30, pp. 3212–3232, Jan. 2019.
max-suppression technique with a threshold of 0.4 to keep [4] A. J. Newman, D. K. C. Yu, and D. P. Oulton, “New insights into retail
the bounding boxes only that has higher class probabilities. space and format planning from customer-tracking data,” Journal of
Fig.8 shows the output of detecting persons associated with Retailing and Consumer Services, vol. 9, pp. 253–258, Sept. 2002.
[5] B. Deepak, “Real-time object detection and tracking using color feature
each camera. and motion,” 04 2015.
Our second stage involves creating a heatmap on the [6] E. N. Kajabad and S. V. Ivanov, “People detection and finding
floorplan to show how active people are in different areas. attractive areas by the use of movement detection analysis and deep
learning approach,” Procedia Computer Science, vol. 156, pp. 327–337,
This is achieved by implementing Homography Transform, 2019. 8th International Young Scientists Conference on Computational
which converts the coordinates of all detected persons in the Science, YSC2019, 24-28 June 2019, Heraklion, Greece.
images into a floorplan. Heatmap generation on the Floorplan [7] M. Parzych, A. Chmielewska, T. Marciniak, A. Dabrowski, A. Chros-
towska, and M. Klincewicz, “Automatic people density maps genera-
is the second part of our proposed algorithm, as depicted tion with use of movement detection analysis,” in 2013 6th Interna-
in Fig.9 where the black dots represent people’s position tional Conference on Human System Interactions (HSI), pp. 26–31,
over the entire floorplan and The red areas indicate people’s 2013.
[8] N. S. Punn, S. K. Sonbhadra, S. Agarwal, and G. Rai, “Monitoring
engagement in corresponding areas. COVID-19 social distancing with person detection and tracking via
fine-tuned YOLO v3 and Deepsort techniques,” arXiv, May 2020.
[9] M. Z. Khan, M. U. G. Khan, T. Saba, I. Razzak, A. Rehman, and S. A.
Bahaj, “Hot-Spot Zone Detection to Tackle Covid19 Spread by Fusing
the Traditional Machine Learning and Deep Learning Approaches of
Computer Vision,” IEEE Access, vol. 9, pp. 100040–100049, July 2021.
[10] G. Jocher, A. Stoken, J. Borovec, NanoCode012, Christopher-
STAN, L. Changyu, Laughing, tkianai, A. Hogan, lorenzomammana,
yxNONG, AlexWang1900, L. Diaconu, PetrDvoracek, P. Rai, et al.,
“ultralytics/yolov5: v3.1 - Bug Fixes and Performance Improvements,”
Oct. 2020.
[11] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun,
“Crowdhuman: A benchmark for detecting human in a crowd,” arXiv
preprint arXiv:1805.00123, 2018.
[12] C.-Y. Wang, H.-Y. M. Liao, I.-H. Yeh, Y.-H. Wu, P.-Y. Chen, and
J.-W. Hsieh, “CSPNet: A New Backbone that can Enhance Learning
Capability of CNN,” arXiv, Nov. 2019.
[13] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network
for Instance Segmentation,” arXiv, Mar. 2018.
[14] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”
arXiv, Apr. 2018.
[15] F. Rovira-Más, J. Reid, and Q. Zhang, “Stereovision data processing
with 3 d density maps for agricultural vehicles,” Transactions of the
ASABE, vol. 49, 07 2006.
Fig. 9: Generated Heatmap on Floormap [16] Y.-C. Chen, “A tutorial on kernel density estimation and recent ad-
vances,” 2017.

V. C ONCLUSIONS
In this paper, we present a method to detect people and
find the hot zones in the floorplan in real time from a CCTV

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:39:56 UTC from IEEE Xplore. Restrictions apply.

You might also like