Real-Time Deep Learning Approach For Pedestrian Detection and Suspicious Activity Recognition
Real-Time Deep Learning Approach For Pedestrian Detection and Suspicious Activity Recognition
Available online
online at
at www.sciencedirect.com
www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Available online at www.sciencedirect.com
Procedia
Procedia Computer
Computer Science
Science 00
00 (2022)
(2022) 000–000
000–000
Procedia Computer Science 00 (2022) 000–000 www.elsevier.com/locate/procedia
ScienceDirect www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
Abstract
Abstract
Abstract
Pedestrian
Pedestrian detection,
detection, tracking,
tracking, andand suspicious
suspicious activity
activity recognition
recognition have have grown
grown increasingly
increasingly significant
significant in in computer
computer vision
vision
Pedestrian
applications detection,
in recent tracking,
years as and
security suspicious
threats activity
have recognition
increased. Continuous have grown
monitoring increasingly
of private significant
and public
applications in recent years as security threats have increased. Continuous monitoring of private and public areas in high-density in
areascomputer
in vision
high-density
applications
areas
areas is very in
is very recent years
difficult,
difficult, so as security
so active
active video threats havethat
video surveillance
surveillance increased.
that can track
can Continuous
track pedestrianmonitoring
pedestrian behavior in
behavior inofreal
private
real timeand
time public areas
is required.
is required. Thisinpaper
This high-density
paper presents
presents
areas
an is very difficult,
an innovative
innovative and
and robust
robustso active
deep video surveillance
deep learning
learning system
system as that can
as well
well as aatrack
as unique
uniquepedestrian
pedestrian
pedestrian behavior
data setinthat
data set realincludes
that time is required.
includes student This paper
student behavior
behavior likepresents
like as
as test
test
an innovative
cheating,
cheating, and robust
laboratory
laboratory deep learning
equipment
equipment theft, systemdisputes,
theft, student
student as well as
disputes, anda unique
and danger pedestrian
danger situations
situations indata
in set that includes
institutions.
institutions. It
It is thestudent
is the first ofbehavior
first of its kind like
its kind to as test
to provide
provide
cheating,
pedestrians
pedestrians laboratory
with equipment
with aa unified
unified and theft,ID
and stable
stable IDstudent disputes,
annotation.
annotation. andpresented
Again,
Again, danger situations
presented in institutions.
aa comparative
comparative analysis
analysis of ofItresults
is theachieved
results first of its
achieved bykind
by the to provide
the recent
recent deep
deep
pedestrians
learning
learning with a unified
approach
approach to and stable
to pedestrian
pedestrian ID annotation.
detection,
detection, tracking,Again,
tracking, and presentedactivity
and suspicious
suspicious a comparative
activity analysis
recognition
recognition of results
methods
methods on aaachieved
on recent by the recent
recent benchmark
benchmark deep
dataset.
dataset.
learning
Finally, approach
Finally, paper
paper to pedestrian
concluded
concluded with detection, tracking,
with investigation
investigation new and suspicious
new research
research directionsactivity
directions in recognition
in vision-based
vision-based methods onfor
surveillance
surveillance fora recent benchmark
practitioners
practitioners and dataset.
and research
research
Finally,
scholars.paper concluded with investigation new research directions in vision-based surveillance for practitioners and research
scholars.
scholars.
© 2023
© 2023
2023The The Authors.
TheAuthors.
Authors. Published
Published by
by ELSEVIER
ELSEVIER B.V.
©
© 2023 The Authors. Published
Published by Elsevier B.V. B.V.
by ELSEVIER B.V.
This
This is
is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (https://creativecommons.org/licenses/by-nc-nd/4.0)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review
Peer-review under
under access article under
responsibility
responsibility of
of the
the the CC BY-NC-ND
scientific
scientific committee
committee license
of
of the
the (https://creativecommons.org/licenses/by-nc-nd/4.0)
International
International Conference
Conference on
on Machine
Machine Learning
Learning and Data
andEngineering
Data
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data
Engineering
Engineering
Engineering
Keywords: Pedestrian
Keywords: Pedestrian detection;
detection; Video
Video Surveillance;
Surveillance; Tracking;
Tracking; Suspicious
Suspicious activity.
activity.
Keywords: Pedestrian detection; Video Surveillance; Tracking; Suspicious activity.
1.
1. Introduction
Introduction
1. Introduction
Video surveillance
Video surveillance is is now
now installed
installed everywhere
everywhere to to track
track and
and monitor
monitor pedestrians
pedestrians or or criminals
criminals in in streets,
streets, airports,
airports, banks,
banks,
Video surveillance
prisms,
prisms, laboratories,isshopping
laboratories, now installed
shopping everywhere
centers,
centers, etc.
etc. [1]. to track
[1]. The
The and monitor
surveillance
surveillance pedestrians
system
system is based or
is based oncriminals
on in streets,
aa closed-circuit
closed-circuit airports,
television
television banks,
(CCTV)
(CCTV)
prisms,
system. laboratories,
system. Recently, shopping
Recently, Pan-Tilt-Zoom centers,
Pan-Tilt-Zoom (PTZ) etc. [1].
(PTZ) cameras The
cameras have surveillance
have many system
many advantages is
advantages overbased on a
over traditionalclosed-circuit
traditional CCTV television
CCTV cameras.
cameras. The (CCTV)
The main
main
system.
advantage
advantage Recently,
of PTZPan-Tilt-Zoom
of aa PTZ camera
camera is that(PTZ)
is that it cameras
it allows
allows users have
users to
to view
viewmany advantages
more
more content
content thanoveraa fixed
than traditional
fixed CCTV
camera.
camera. The cameras.
The features
featuresofofThe
themain
the PTZ
PTZ
advantage
camera of a PTZ
camera include:
include: 1)
1) Thecamera
The user is
user that
can
can pan
panit allows
left andusers
left and righttoand
right view
and tilt more
tilt up
up and
and content
down than
down to a fixed
to obtain
obtain camera. The
aa complete
complete 180oofeatures
180 o
view, of the PTZ
view,whether
whether it
it is
is
camera
left
left or include:
or right
right or up1)and
or up andThe user If
down.
down. can
If pan leftand
installed
installed andpositioned
and right and correctly,
positioned tilt up andadvanced
correctly, down to PTZ
advanced obtain
PTZ a complete
cameras
cameras can 180 view,
can provide
provide whether360
aa complete
complete it isoo
360
left orof
field
field ofright or up
view.
view. and down.
Therefore,
Therefore, If installed
aa single
single andcamera
pan/tilt
pan/tilt positioned
camera can correctly,
can replace
replace two
twoadvanced
or evenPTZ
or even three
threecameras can provide
fixed-view
fixed-view cameras,
cameras, a complete
which
which isis 360
very
very
o
field of and
suitable
suitable view.can
and canTherefore,
almost a singlemost
almost eliminate
eliminate pan/tilt
most of camera
of the
the blindcan
blind replace
spots
spots on two orwith
on cameras
cameras even
with three fixed-view
deviated
deviated fixed-viewcameras,
fixed-view angles.
angles.A Awhich
PTZ is very
PTZ camera
camera
suitable
is and
is programmed
programmed tocan almost
to rotate eliminate most
rotate automatically
automatically in of the blind
in multiple spots
multiple directions on cameras
directions at with
at aa different deviated
different view
view of of an fixed-view
an area. angles.
area.Researchers
Researchers mainA PTZ camera
main focus
focus isis
is programmed to rotate automatically in multiple directions at a different view of an area.Researchers main focus is
1877-0509
1877-0509 ©© 2023
2023 The
The Authors.
Authors. Published
Published byby ELSEVIER
ELSEVIER B.V.
B.V.
1877-0509
This
This is
is an © 2023
an open
open The article
access
access Authors.
article Published
under
under the CCby
the CC ELSEVIERlicense
BY-NC-ND
BY-NC-ND B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0)
license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under
This is an open
Peer-review responsibility
underaccess of
of the
the scientific
article under
responsibility committee
the CC BY-NC-ND
scientific of
of the
the International
committee license Conference
Conference on
on Machine
Machine Learning
Learning and
and Data
Data Engineering
(https://creativecommons.org/licenses/by-nc-nd/4.0)
International Engineering
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data Engineering
1877-0509 © 2023 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and
Data Engineering
10.1016/j.procs.2023.01.219
Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447 2439
2 U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000
to develop a video surveillance system that can assess pedestrian methods in real time [2]. The challenge of identifying
pedestrians in crowded environments becomes extremely challenging in real-time when low-resolution images,
motion blur, contrast illumination, scale or size of pedestrian changes, and entirely orpartially obscured outlines
are present. Fig.1 describes the proposed approach motivation. Pedestrian publicuniversity dataset such as INRIA
[2], Caltech [1], MS COCO [3], KITTI [5], and ETH [4] datasets, pedestrian casesare typically modest. Due to
restrictions such as 1) cloudy presentation, 2) confused and imprecise boundary, 3) duplicated pedestrian occurrences,
4) tiny and big dimension occurrences with distinctive properties, etc., localizing these small instances in the presence
of illumination change and occlusion is a vital operation. The advanced research onpedestrians” analysis conducted
on publicly available benchmark datasets.
Fig.1. (a) Issues and challenges of ETH [2] and Caltech [3] datasets. (a) pedestrian significant change in the visual as the
illumination changes. (b) pedestrian scale or size changes in images changed significantly. (c) pedestrian occlusion affects the
detection and tracking results. (d) pedestrian occlusion with other road objects effects detection accuracy. (e) pedestrian cloth
variation affects the detection algorithm accuracy. (f) Multi-camera captured direction represents a different visual appearance.
These datasets have several limitations, including: 1) a limited variety of pedestrian instances captured in a supervised pattern. 2)
The size of the dataset is small and contains the least scenarios. 3) limited environment such as urban road and city only. No student
behavior dataset is available. A robust and novel deep learning model and student academic environment dataset. During each
sequence of frames in the video, human experts annotated the behavior of student pedestrians. It provides data in three categories.
1) Using bounding boxes to locate pedestrians. 2) fully labelled; 3) unique IDs are used as a class category for annotated pedestrians.
The proposed contributions are as follows:
1. To solve existing state-of-the-art database concerns such as size and illumination variance in pedestrian images, It presents
the unique enhanced Mask R-CNN deep learning architecture.
2. Student’s normal and suspicious activities are recorded in the proposed dataset.
3. Within the framework of the proposed pedestrian dataset for academic settings along with comprehensive review of previous
work and compare existing techniques.
The remaining research paper is categorized as follows: the most significant contribution of the new pedestrian dataset, as well as
concerns and challenges in the academic context discuss in section 2. A deep learning architecture is described in section 3. The
outcomes of the empirical examination are discussed in section 4. Finally, in the Section 5 the research paper ended with conclusion
and research direction.
2440 Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447
U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000 3
2. Related Work
This section describes the most relevant and recent pedestrian datasets. In addition, it discusses the advanced deep
learning approaches to pedestrian detection, tracking, and suspicious activity recognition, along with its limitation.
2.1. State-of-the-art Pedestrian Dataset
Currently pedestrian datasets used by researchers for pedestrian detection, tracking, and suspicious activity
recognition. First, the Caltech dataset contains 2,300 unique pedestrians and 350,000 annotated bounding boxes to
represent these pedestrians. This dataset was created on the city road and using the camera mounted on the vehicle
[3]. Second, the MIT dataset is the well-known pedestrian dataset, which consists of high-quality pedestrian sample
images. It contains 709 unique pedestrians. Whether in front view or back view, the range of pose images taken on
city streets [4] is relatively limited. Third, Daimler, this dataset captures people walking on the street through cameras
installed on vehicles in an urban environment during the day. The data set includes pedestrian tracking attributes,
annotated labelled bounding boxes, ground truth images, and floating disparity map files.
The training set contains 15560 pedestrian images and 6744 annotated pedestrian images. The test set contains 21,790
pedestrian images and 56,492 annotated images [5]. The ATCI dataset is a pedestrian database acquired by a normal
car”s rear-view camera, and it”s used to test pedestrian recognition in parking lots, urban environments city streets,
and private as well as public lanes. The data set contains 250 video clips, each of 76 minutes, and 200,000 marked
pedestrian bounding boxes, captured in day-light scenes, with contrasting weather scenarios [6]. The ETH dataset is
used to observe the traffic scene from the inside of the vehicle. The behavior of pedestrians is recorded and placed
over vehicles. In an urban setting, the dataset can be used for pedestrian recognition and tracking via mobile platforms.
Road cars and pedestrians, are included in the dataset [7]. The TUD-Brussels dataset was created using a mobile
platform in an urban environment.
Crowded urban street behavior was recorded vehicle embedded cameras. It can be used in car safety scenarios in urban
environments [8]. One of the most abstract pedestrian detection data sets is the INRIA dataset. It incorporates human
behavior, as well as a mobile camera and complex background scenes, with various variations in posture, appearance,
dress, background, lighting, contrast, etc. [9]. The PASCAL Visual Object Classes (VOC) 2017 and 2007 collection
contains static objects in an urban setting with various viewpoints and positions. This dataset was created with the
goal of recognizing visual object classes in real-world scenarios. Animals, trees, road signs, vehicles, and people are
among the 20 different categories in this collection [10]. The Common Object in Context was constructed using the
MS COCO 2018 dataset [11]. (COCO). The 2018 dataset was recently utilized to recognize distinct things in the
context while focusing on stimulus object detection.
The annotations include different examples of things connected to 80 different object categories and 91 different
human segmentation categories. For pedestrian instances, there are key point annotations and five picture labels per
sample image. (1) real-scene object detection with segmentation mask, (2) panoptic semantic segmentation, (3)
pedestrian keypoint detection and evaluation, and (4) Dense Pose estimation in a congested scene is among the COCO
2018 dataset challenges [12].
For street picture segmentation, the Mapillary Vistas Research dataset is employed [13]. Pedestrians and other non-
living categories are solved using panoramic segmentation, which successfully merges the concepts of semantic and
instance segmentation. A comparison of pedestrian databases and their video surveillance purposes is shown inTable
1. In addition, we’ve included proposed dataset, which will be introduced in the next section. The connection is made
based on the dataset’s use, size, environment, label, and annotation.
2.1. Proposed Deep learning architecture and Academic Environment Pedestrian Dataset
In this section, the proposed framework from a different perspective, as captured by a high-qualityDSLR camera.
The proposed video acquisition framework records the video at 30f/s along with 384x2160 resolution. The size of the
dataset is 100GB. The student behavior frames shown in Fig.2. The orientation of the camera is in the range of 45o to
90o. Yeshwantrao Chavan College of Engineering (YCCE), Nagpur student academic activity behavior recorded in
the proposed dataset. The student age is between 22-27, including both male and female. Out of which, 65% are male
Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447 2441
4 U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000
and 35% are female. The academic environment dataset consists of different behaviors such as lab student activities,
exam hall, classroom, student cheating behavior, dispute, and stealing a mobile phone and lab electronic devices [34].
At the frame level, domain experts annotate the pedestrian video sequence. The labelling stage contains three phases:
1) human identification. 2) tracking, and 3) detection of suspicious activities. First, Mask R-CNN [12] method was
used to determine the location of the pedestrian in the frame, followed by manual validation and correction of the data.
Next, a deep sort [14] model was used for extracting tracking information. At last, with these two basic operations, get
a rectangle bounding box around pedestrians that defines the ROI foreach human. The last stage of the updating process
is performed manually, with human expert knowledge in the academic environment. Height, age, enclosing box,
unique Id, feet, frame, body size, hairstyle, hair color, head attachments, clothing, mustache stubble activities, and
accessories are all given for each human instance in the frame mostly on the label.
U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000 5
2442 Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447
Fig.2: An example of the designed database. Lab fight between two girls – 1st Row. The scenario of snatching the phone is depicted
in the 2nd row. A scenario of a student threatening is depicted in the 3rd row. The 4th row describes thesame critical situation. The
5th row depicts a situation in which students steal lab material. The sixth row depicts exam cheating scenario in examination hall.
The current deep learning-based pedestrian detection, tracking, and suspicious activity recognition systems are not as
accurate and fast as human vision [2]. Pedestrian detection, tracking, and activity recognition are now separated into
two categories: CNN and deep learning. V. Jones [3] approach used in pedestrian detection for face recognition. Again,
HOG [5] and DPM [4] conventional approaches are used for pedestrian detection. These procedures are
computationally intensive and time-consuming, and they necessitate the participation of humans. CNN-based deep
learning techniques have grown in prominence as a result of their accuracy in pedestrian identification [7,8]. R-CNN
[9] is the first deep learning model for object detection. Multiple stage convolutional network such Mask R-CNN
[9], R-CNN family other variant as (Fast and Faster R-CNN model) [9][10][11][12]. Other, CNN models having single
stages such as You Only Look Once (YOLO) [14] and SSD [15] are examples of deep learning approaches.As a
result, real-time pedestrian detection is now unsuitable. As a result, Redmon et al. [15] introduced the YOLO net, that
is an object regression architecture, to increase detection speed and accuracy. The proposed improved YOLOv5
method effectively detects small and constant pedestrians.
2.1. YOLOv5 Deep Learning Architecture
The YOLOv5 detector has only one stage. The YOLOv5 architecture contains three sections. 1) A solid foundation.
2) The output, and 3) the neck. The input picture features are extracted first by the backbone portion. For scale invariant
feature extraction, CNN and max-pooling backbone networks are used [29]. The feature map development process is
divided into four tiers in the backbone network. Each layer generates a feature map with the following dimensions:
152x152 pixels, 76x76 pixel resolution, 38x38 pixel resolution, and 19x19 pixel resolution. The neck network
integrates feature maps of several levels to capture additional contextual information and prevent information loss.
For multi-scale features, a recursive neural network is created, and pixel grouping backbone networks are employed
for feature engineering. In a top-down method, semantic features are provided via a feature pyramid network. The
bottom-up approach uses a pixel aggregation network for object localization. In the neck network, it can see three
feature fusion components of different scales with sizes of 76x76x255, 38x38x255, and 19x19x255, where 255 is the
network’s image intensity range. The CSP network aims to improve inference speed. In the neck, the CSP network
replaces the leftover units with CBL modules. The SPP module combines the benefits of the largest pooling with the
flexibility of varying kernel sizes. The feature map in the input is mostly compressed. It compresses the extracted
feature, resulting in a considerable reduction in feature extraction time. Again, it compresses features and removes the
most important ones. Following that, it went through the intricacies of the upgraded YOLOv5 architecture.
6 U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000
Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447 2443
To detect pedestrians on several scales, by leveraging scale-independent convolutional feature construction, the
suggested Improved Mask R-CNN addresses the existing approach difficulties. Fig.3 depicts the basic concept of
scale-independent feature map generation. A unique Improved R-CNN framework based on the Faster R-CNN
pipeline [12] has been proposed as a result of the aforesaid proposal. The proposed Improved R-CNN is a unified
architecture that combines a scale-independent feature map with a two-stage backbone network.
Fig.3. Depicts the basic concept of scale independent feature map generation.
The Improved R-CNN, as illustrated in Fig.4, takes input image and runs it through the common convolutional layers
to obtain its whole local features. The scale-independent feature map and multiple backbone structure, that is useful
for the present input about certain scales, may always help to increase the decisive outcomes. As a result, improved
R-CNN can outperform traditional R-CNN detection across a large number of input scales. Improved R-CNN is also
particularly efficient and effective of training and testing time because it combines convolutional characteristics for
the input image. The traditional Mask R-CNN uses a binary convolution mask for every candidate region.
general of object detection accuracy. The RoIs must be generated using the proposal region generation technique,
which takes time. Since the exponential function for the items is unavailable, the tiniest objects are not categorized
efficiently. Mask R-CNNs in a practical utility for pedestrian monitoring are limited due to these two flaws. In the
traditional Mask R-CNN, the human examples, which differ in scale, are not recognized adequately. As a result, this
problem must be addressed by creating scale-independent extracted features for the various scales of human instances.
To construct the probability value, the multi resolution images were mixed with various sizes of anchor box. This
score is then combined with the feature map created previously in the process. Because of the scale-independent
feature map, the region proposal accuracy improves. The steps in creating a scale-independent map for object detection
discuss in brief as follows:
Algorithm: Features extraction algorithm for generation Scale-independent Object detection
Input: MultiScaleFeature, scale invariant mask, MultiScaleImage, multi-scale image,
Output: scaleFeatureMatrix, the scale-independent matrix
Read all the image at multiple scale.
for scalefeatureMatrixfi ← 1 to K do
Multi-scale image convolution process
for MultiScalFj ← 1 to T do
Multi-scaled invariant mask generated and convolved
CScoreConv (MultiScaleImage, MultiScaleFeature)
if CScoreConv = null then
exit out of iteration
else
CuScore ← CuScore + CuScore
end
end
scaleFeatureMatrix← CuScore
end
To identify the varied scale pedestrian efficiently, scale-independent local features are combined with other
convolution layers. Following that, the RoI align procedure is used to align each region’s suggestions. Other Mask R-
CNN processes are used in subsequent processing. The Mask R-CNN architecture for pedestrian detection has been
discussed. Finally, the bounding box and segmented mask are used to represent all of the discovered and finally,
results are represented using class annotation for each object in the scene.
In this section, the results of three tasks performed using methods regarded to be cutting-edge technology as pedestrian
detection, tracking, and suspicious activity recognition. It also presents the results acquired using such strategies in
the academic environment database, it also presents baseline findings using the same technique in a well-known
dataset. Pedestrian Detection is the first step. Both the R-FCN [15] and RetinaNet [16], Mask R-CNN [17] deep
learning frameworks excelled in the PASCAL VOC [17] problems,particularly in pedestrian identification.
The R-FCN [15] and RetinaNet [16], Mask R-CNN [17] computational intelligence system provides the benchmark
performance for pedestrian detection given that both performed very well in the PASCAL VOC [17] challenges,
particularly in the pedestrian detection issues. It compared the predicted dataset performance of the two methods with
the results seen in the PASCAL VOC 2007 and 2012 datasets. On top of the ResNet, RetinaNet leverages the Feature
Pyramid Network (FPN) as its support system. Variations in position are encoded using a particular convolutional
layer by R-FCN [15]. Instead of a completely linked layer, it makes use of ROI max pooling. Again, try Mask R-CNN
on the suggested dataset. The dataset is divided into three categories: training (60%) and testing (20%) for real-time
queries. The experimental findings for the proposed and current methodologies are presented in Table 2. AP IoU=0:5
Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447 2445
8 U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000
represents the Average Precision (AP) at the Intersection of Union (IoU) values of the common evaluation measure
with the value set to 0.5.
Table 2. Comparative analysis of proposed method and existing method on Proposed dataset And PASCAL VOC 2007
Methodology Backbone Structure PASCAL VOC Dataset Proposed Dataset
R-FCN CNN Net [15] ResNet-101 Network 84.43 ± 1.85 59.29 ± 1.31
RetinaNet CNN Net [16] ResNet-50 Network 86.44 ± 1.03 63.10 ± 1.64
Proposed Mask R-CNN ResNet-101 Network 87.41 ± 1.02 65.10 ± 1.44
The Tracktor CV [2] and V-IOU [18] techniques give the state-of-the-art, for two reasons: 1) best performer in the
MOT challenge; and 2) open source pre-builded framework. The two phases of the TracktorCV approach are as
follows: 1) a regression component that uses the output of the detector stage to modify the bounding box's current
location; and 2) a detecting component that keeps the set of frames for the subsequent frames. For both methodologies,
it is seen that there is a positive association between failures that are connected to crowds and two worrying instances:
1) scenarios where trajectories intersect individuals at every second due to dense pedestrian congestion; and 2) when
crucial deformations of the person silhouettes occur. The proposed database comprises of more intricate pictures with
dense backdrops, including several situations with organization. It provides an overview of the findings in Table 3.
Table 3. Performance comparison of the two cutting-edge tracking methods using the suggested dataset and MOT datasets.
Methodology Backbone MOTA Measure MOTP Measure F1-Measure
TracktorCv [2] MOT-17 65.20 ± 9.60 62.30 ± 11.00 89.60 ± 2.80
Proposed dataset 56.00 ± 3.70 55.90 ± 2.60 87.40 ± 2.00
V-IOU [18] MOT-17 52.50 ± 8.80 57.50 ± 9.50 86.50 ± 1.90
Proposed dataset 47.90 ± 5.10 51.10 ± 5.80 83.30 ± 8.40
Proposed Mask R-CNN MOT-17 67.50 ± 9.80 59.50 ± 10.50 89.50 ± 2.87
Proposed dataset 70.90 ± 6.12 57.10 ± 15.80 83.10 ± 18.39
In surveillance videos, abnormal activities are detected by varying object behaviors in scenes with varying
appearances, scales, lighting conditions, and occluded trajectories. In a crowd area, detecting individual pedestrians
is an important process. The aforementioned techniques are not suitable in these circumstances. The use of motion
has been the focus of other recent investigation groups [20][21]. The extraction of local spatiotemporal cuboids from
optical flow or gradient patterns has also been attempted [23]. Challa S.K. et al. [24] presented a multi-branch CNN-
BiLSTM model for human activity recognition using wearable sensor data. An activity recognition algorithm based
on CNN filters is used in this approach. Next, Jain R. et al. [25] proposed a deep ensemble learning approach for lower
extremity activities recognition using wearable sensors. Semwal V. B. and colleagues [26] proposed an improved
method for the selection of features for the recognition of human walking activities using bio-geography optimization.
An ensemble learning approach was used by Semwal et al. [27] in order to create an optimized hybrid deep learning
model for recognizing human walking activities. Other handcrafted features were considered by another author. An
invariant gait recognition-based person identification method was presented by Semwal et al. [28]. According to
Bijalwan et al. [29], multi-sensor based biomechanical gait analysis can be carried out using wearable sensors
combined with vision. Semwal et al. [30] presented a pattern identification of different human joints for different
human walking styles using an inertial measurement unit (IMU) sensor. Dua N. et al. [31] proposed a multi-input
CNN-GRU based human activity recognition using wearable sensors. Bijalwan V. et al. [32] proposed a heterogeneous
Computing Model for Post-injury Walking Pattern restoration and Postural Stability Re-habilitation Exercise
Recognition". Again, although the above methods have proved their effectiveness in experiments, most of them only
cover the detection of abnormal activities in local or global areas. As shown in Table 4, joint deliberation of motion
2446 Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447
U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000 9
flows pattern, varying size of objects, and interactions between adjacent objects can be used to represent pedestrian
activities in a high-density scene and enhance the performance of unusual activity detection.
Table 4. Comparative analysis of frame level Suspicious activity recognition using Equal Error Rate (EER).
Methodology/Technique Pedestrian 1 (%) Pedestrian 2 (%) Average (%)
Social Force Map [22] 36.5% 35.0% 35.7%
MDT-spatial [23] 32% 38% 34%
Multibranch CNN-BiLSTM model [24] 44% 26% 32%
Lower extremity activities recognition [25] 45% 27% 35%
Optimized feature selection [26] 46% 29% 38%
Hybrid deep learning model [27] 48% 25% 34%
Pose Invariant Gait [28] 45.2% 29% 33%
Multi-sensor Gait [29] 45% 31.2% 39%
Human joints using IMU [30] 41% 27% 35%
Multi-input CNN-GRU [31] 38% 37% 37%
Heterogeneous Computing Model [32] 47% 31% 40%
5. Conclusion
In this paper, the academic environment database is proposed, which comprises video sequences of pedestrians in
indoor academic environments that are annotated at the frame level. The pedestrian database contains the behavior of
students in the institution. This is the first of its kind dataset that provides a unified and stable pedestrian ID annotation,
making it suitable for pedestrian detection, tracking, and behaviour detection. It also proposed a scale-invariant Mask
R-CNN model for robust and efficient pedestrian detection. Again, the proposed framework is also useful in suspicious
activity recognition on recent benchmark databases. This well-organized comparison helps to identify problems and
challenges in this domain. In the future, more experimentation is required for pose estimation and pedestrian trajectory
identification and detection.
References
[1] Ahmed M., Jahangir M., Afzal H. (2015) “Using Crowd-source based features from social media and Conventional features to predict the
movies popularity”, IEEE International Conference on Smart Cities, Social Community and Sustained Community, China, pp. 273–278.
[2] Bergmann P., Meinhardt T., Taixe L. (2019) “Tracking without bells and whistles”, IEEE International Conference ICCV, Seoul, Korea, pp.
1-16.
[3] Dollar P., Wojek C., Schiele B., and Perona P. (2012) “Pedestrian Detection: An Evaluation of the State of the Art”, IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI), 34 (4): 743-761.
[4] Samsi S., Weiss M., Bestor D., Li D., Jones M., Reuther A., Edelman D., Arcand W. and Byun C. (2021) “The MIT Supercloud Dataset”,
International Cornell Journal of Distributed, Parallel, and Cluster Computing, 2108 (02037): 1-10.
[5] Silberstein S., Levi D., Kogan V. and Gazit R. (2014) “Vision-based pedestrian detection for rear-view cameras”, IEEE Intelligent Vehicles
Symposium, pp. 853-860, Dearborn, MI, USA.
[6] Alom M. and Taha T. (2017) “Robust multi-view pedestrian tracking using neural networks”, IEEE National Conference on Aerospace and
Electronics, pp. 17-22, Dayton, OH, USA.
[7] Zhang X., Park S., Beeler T., Bradley D., Tang S., Hilliges O. (2020) “ETH Gaze: A Large-Scale Dataset for Gaze Estimation Under Extreme
Head Pose and Gaze Variation” European Conference on Computer Vision (ECCV), Springer. Lecture Notes in Computer Science, pp. 1-10.
[8] Wojek C., Walk S., Schiele B., (2009) “Multiresolution model for Object Detection”, IEEE Computer Vision and Pattern Recognition (CVPR),
June 20-25, 2009, Miami, Florida, USA.
[9] Nguyen T., Soo K., (2013). “Fast Pedestrian Detection Using Histogram of Oriented Gradients and Principal Components Analysis”.
International Journal of Contents, 2013 (1): 1-20.
[10] Everingham, M., Van L., Williams, C. (2010) “The Pascal Visual Object Classes (VOC) Challenge”, International Journal of Computer
Vision, Springer, 88 (1): 303–338.
[11] Lin T. (2014), “Microsoft COCO: Common Objects in Context. ECCV 2014”, Lecture Notes in Computer Science, Springer, 8693 (1): 740-
755.
Ujwalla Gawande et al. / Procedia Computer Science 218 (2023) 2438–2447 2447
10 U. H. Gawande, K. O. Hajari, Y. G. Golhar / Procedia Computer Science 00 (2022) 000–000
[12] Nicolai W., Bewley A., Dietrich P. (2017) “Simple online and real-time tracking with a deep association metric”, IEEE International
Conference on Image Processing (ICIP), pp. 3645–3649.
[13] Jifeng D., Yi L., Kaiming H., Sun J., (2016) “R-FCN: Object detection via region-based fully convolutional networks”, IEEE Computer
Vision and Pattern Recognition (CVPR), pp. 1-11.
[14] Everingham, M., Eslami, S., V. G., Williams C., Winn J. (2015) “The PASCAL VOC Challenge: A Retrospective”, International Journal of
Computer Vision (IJCV), Springer, 11 (1): 98-136.
[15] Kaiming H., Georgia G., Dollar P. and Girshick R. (2020) “Mask R-CNN, IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 42 (2): pp. 386-397.
[16] Viola P. and Jones M. (2001) “Rapid object detection using a boosted cascade of simple features,” IEEE Computer Vision and Pattern
Recognition (CVPR), USA, pp. I-I.
[17] Felzenszwalb P., Girshick R., McAllester D. and Ramanan D. (2010) “Object detection with discriminatively trained part- based models”,
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32 (9): 1627–1645.
[18] Dalal N. and Triggs B. (2005) “Histograms of oriented gradients for human detection”, IEEE Computer Vision and Pattern Recognition
(CVPR), San Diego, CA, USA, pp. 886–893.
[19] Muhammad N., Hussain M., Muhammad G., Bebis G., (2011), “Copy-move forgery detection using dyadic wavelet transform,” International
Conference on Computer Graphics, Singapore, pp. 103–108.
[20] Muhammad G., Hossain M., Kumar N., (2021) “EEG-based pathology detection for home health monitoring”, IEEE Journal on Selected
Areas in Community, 39 (2): 603–610.
[21] Muhammad G., Alhamid M., and Long X., (2019) “Computing and processing on the edge: Smart pathology detection for connected
healthcare”, IEEE Network, 33 (1): 44–49.
[22] Girshick R., Donahue J., Darrell T., and Malik J., (2014), “Rich feature hierarchies for accurate object detection and semantic segmentation”
IEEE Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, pp. 580–587.
[23] He K., Zhang X., Ren S., and Sun J., (2015) “Spatial pyramid pooling in deep convolutional networks for visual recognition”, IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37 (9): 1904–1916.
[24] Girshick R., (2015) “Fast R-CNN”, IEEE International Conference on Computer Vision, Santiago, Chile, pp. 1440–1448.
[25] Ren S., He K., Girshick R. and Sun J., (2016) “Faster R-CNN: Towards real-time object detection with region proposal networks”, IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39 (6): 1137–1149.
[26] He K., Gkioxari G., Dollar P., and Girshick R., (2017) “Mask R-CNN” IEEE Computer Vision and Pattern Recognition (CVPR), Venice,
Italy, pp. 2980–2988.
[27] Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., (2016) “SSD: Single shot multi-box detector” European Conference on Computer
Vision (ECCV), Cham, Springer, pp. 21–37.
[28] Redmon J., Girshick R. and Farhadi A., (2016) “You only look once: unified, real-time object detection”, IEEE Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV, USA, pp. 779–788.
[29] Senst T. and Sikora T, (2018) “Extending IOU based multi-object tracking by visual information”, IEEE International Conference on
Advanced Video and Signal Based Surveillance, Auckland, New Zealand, pp. 1-7.
[30] Barnardin K. and Stiefelhagen R. (2008) “Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics”, EURASIP Journal
on Image and Video Processing, Springer, 1 (1): 246-309.
[31] Kratz L. and Nishino K. (2012) “Tracking Pedestrians Using Local Spatio-Temporal Motion Patterns in Extremely Crowded Scenes” IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34 (5): 987-1002.
[32] Wang S. and Miao Z. (2010) “Anomaly Detection in Crowd Scene”, IEEE International Conference on Signal Processing, pp. 1220-1223,
Oct. 24-28, Beijing, China.
[33] Wang S., Miao Z. (2010) “Anomaly Detection in Crowd Scene Using Historical Information”, IEEE Intelligence Signal Processing and
Communication System, pp. 1-4.