Crowd Detection and Analysis For Surveillance Videos Using Deep Learning
Crowd Detection and Analysis For Surveillance Videos Using Deep Learning
Student, Department of Computer Science and Engineering, Student, Department of Electronics and Telecommunication
G H Raisoni College of Engineering, Engineering,
Nagpur, India G H Raisoni College of Engineering, Nagpur, India.
[email protected] [email protected]
Abstract—Crowd identification and analysis has drawn a lot of CDE is essential for disaster management and for
attention recently, owing to a wide variety of video surveillance maintaining the maximum people count during the COVID-
applications. We present a detailed review of crowd analysis and 19 pandemic.
management, focusing on state-of-the-art methods for both Moreover, in the recent past, analysis of facial attributes
controlled and unconstrained conditions. The paper illustrates has attained much credit in the field of computer vision.
both the advantages as well as disadvantages of state-of-the-art Various features of the human face, such, emotions, age,
methods. Mass or crowd gathering can be seen at a lot of gender and ethnicity, can be used for catego rization.
places like airports, sports stadiums, at various religious,
educational, and entertainment-related events, etc. When tens of Some of these features can be used for Security, and video
thousands of people gather in limited space, a tragedy is monitoring, electronic customer relationship management,
probably bound to happen. Automated video surveillance has biometrics, cosmetology, and forensic art are only a few of
become the need of the day and supports the analysis and the real-world applications where age and gender
management of data on a massive scale. It is very important to classification is essentially helpful. However, the results
identify the presence of a crowd and detect thenumber of people obtained are not up to the mark. Various age and gender
in the gathering. This can prove very useful for the detection of classifica tion problem s remain persistent complic ations.
sudden troupe build-up to avoid riots. Moreover, it can also be
Even with the advancement that has been made in the
very useful in the Covid-19 pandemic situation to avoid people
computer vision community with the continuous onward
gathering at a place. This paper presents a system to detect the
motion of modern techniques that improves state of the art,
presence of a crowd by counting unique people and then
performing crowd analysis. The crowd is analyzed by detecting
age, and gender predictions of the raw, real-life face are
the gender and age of people in the crowd. ineffective to meet the demands of commercial and real-world
applications. Over many years, a lot of effort has gone into
solving the classification problem. Many of these custom
Keywords—Deep Learning, Crow d Density Estimatio n, methods perform poorly when it comes to determining the age
CNN , Mobile N ets, Neural networks and gender of unconstrained in the wild pictures. These
traditional tailor-made methods rely on discrepancies in
I. INT RODUCT ION attributes of facial feature and face descriptors, which are
unable to cope with the various unpredictable variations
With the expanding population and several problems encountered in these difficult unconstrained imaging
arising due to crowded scenarios in the cities, there is a conditions. There is a variation of images in these
necessity for crowd detection. Automatic crowd detection categories, such as Noise, posture, and lighting can all affect
involves assessing the number of individuals in the video or the ability of those manually developed computer vision
an image. Further, the estimation of such crowd density can methods to accurately classify the age and gender of the
be done from the images of the crowded scene extracted image s.
from the surveillance video. Deep learning-based approaches have recently shown
Crowd Density Estimation (CDE) is a challenging promising results in the age and gender classification of
problem that can assist in solving various real-life problems. unfiltered face images. Further elaboration of existing works
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.
in age and gender prediction, as well as evidence of deep model is presented in [12] that uses a compact convolutional
learning and CNN advances, we propose a VGG16 neural network to save computational resources and at the
architecture model based on deep learning that predicts age same time achieve great real-time speed, which is superior to
group and gender of unfiltered in-the-wild facial images the existing lightweight models. [13] proposes a deep model
generated using a crowd detection model based on object focused on Convolutional Neural Networks (CNN) and
tracking algorithm. It works by calculating the Euclidean Spatio-Temporal context. The CNN model is used to detect
distance between existing object centroids and new object people, and STC is used to monitor moving people's heads.
centroids between subsequent frames in a video. We build an [14] describes another supervised approach that uses Spatio-
object tracker for each of our detected objects to track it as it tempo ral featur es and their fusion .
moves around the frame. We monitor until we hit the Nth
frame, then rerun our object detector. This complete process B. Gender and Age Prediction
then repeats.
Face images are often used in gender and age detection
The remaining part of the paper is organized as follows. We methods. After extracting facial features, the images are
present related work in crowd identification and age and classified into age and gender groups using classification and
gender classification in Section 2. The background of the regression methods. The classification method [15- 16] used
models used in the method is presented in Section 3. The the Support Vector Machine based method. Various
proposed method is described in Section 4. Section 5 presents classification methods like Support Vector Machines, Radial
the datasets used for training and experimentation. In section Basis Function Networks, and the classical Discriminant
6, the results are reported, accompanied by conclusions in methods were compared in [17], in which SVMs were able to
sectio n 7. succeed to give an acceptable error rate with storage of 20 per
cent the training set. In [18], two competing hyper bf
networks, one for male and one for female, were trained on
II. LITERATURE REVIEW geometrical shapes. Standard regression methods for age and
gender classification include linear regression [19], Reinforce
In this section, we present a review of various crowd Vector Regression (SVR) [20], and Partial Least Squares
detection approaches followed by a review of gender and age (PLS) [21].
prediction methods.
In [22], a gender and age prediction system is proposed.
A. Crowd Detection Face images are used for the task. First, the quality of the face
image is improved using the histogram equalization method
There are various approaches for person detection. Mostly called Brightness Preserving Dynamic Fuzzy Histogram
the crowd detection methods proceed by finding individual Equalization (BPDFHE). For detection of a face in the given
persons and counting them. People with a clear vision are image, Image segmentation and image filling are applied. For
involved in detecting, recognizing, and tracking items. age estimation, Eigen face is used. More recently, Weber law
Clustering-based methods,regression-based methods, and descriptor was used in [23] for gender recognition which
detection-based methods are the three types of methods. The demonstrated outstanding performance on the FERET
clustering method[1-2], is used to detect different objects, benchmark[24].The best results in [23] were obtained with the
and their trajectories are clustered to count the objects. block size of 12X12 and T, M, and S values of 8, 4, and 4
Regression-based methods [3-4] first find low-level respectively. In [25], features like intensity, shape, and texture
information such as foreground features, edge, and texture were used with mutual information which again resulted in
features. Scene level information is extracted from local and almost perfe ct results on the FERET benchm ar k.
global properties such as Histogram of Oriented Gradients
(HOG), Local Binary Pattern (LBP), etc. Finally, a regression In recent history, numerous methods have been introduced
function is exploited for counting. Detection-based methods to solve classification problems leveraging deep learning
involve person detection, target localization, tracking, and techniques. However, the early neural network methods
trajectory classification. A comprehensive study on person utilized small datasets. In [26], A neural network was trained
detection can be found in [5]. In this section we briefly review on a minimal collection of 90 images (45 male and 45
is the detection-based methods, as we utilize the detection- female), with an error rate of 8.1%. Ranjan et al. [27]
based metho d. presented a model utilizing CNN for gender recognition and
age estimation. It's an end-to-end network that shares CNN's
The early methods are based on low-level features. In lower layers' parameters. In [28], a robust estimations solution
[6],both appearance and motion features are utilized for the (CNN 2E L M ) is propose d. It uses CNN and ELM.
detection of a pedestrian. Haar filter, absolute difference Haar
filter, and shifted difference filter are used to detect the For age classification, authors in [29] brought the
objects with motion. Moreover, eight pedestrian detectors are importance of deep neural networks how adding or
trained using the Adaboost algorithm. Salim, et al. [7] subtracting a layer could change the output of the model. In
presented method that can detect all the people passing [30] the measurements of the face were utilized for age
through the field of view of the camera, with an average detection. It was shown that using readily available dense
efficiency of 83.14%. It uses the Kalman filter [8] for building blocks to approximate the expected optimal sparse
predicting the tracked person's location in each frame. structure can be a viable method for improving neural
PETS2009 dataset was utilized for experimentation. Frame networks [31].
differencing is used in [9] to segment the crowd for people
and then counting. Features are described for individual
patterns, and counting is performed. Various methods [10,11]
are also proposed that utilize Kinect camera to capture the
depth information along with the low-level features. The most
recent methods rely on deep learning. A crowd counting
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.
III. BACKGROUND
B. VGG-16 Model
VGG-16 [35] is a pre-trained CNN proposed by
Zisserman and Simonyan. VGG Net is trained on ImageNet
dataset and is capable of performing the classification task
with good accuracy. Images from 1000 classes are divided
into three sets: 1.3 million training images, 100,000 testing
images, and 50,000 validation images in the ImageNet
dataset. There are 16 filtering layers in VGG- 16.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.
A. CROWD DET ECT ION AND CROWD DENSITY detection of bounding boxes, the centroid is computed for
EST IMAT ION (CDE) each bounding box utilizing their respective (x,y)
coordin ates. Each bounding box is allotted a unique ID.
For detection of the crowd, Person detection is performed and However, if a new unique ID is allotted to objects in
the detected persons are tracked. The presence of humans is each new incoming frame, it may cause a problem in the
found using MobileNet SSDs Tracking is performed using motive of object tracking. To alleviate this, we relate the
Unique Person Detection Algorithm and it also displays the centroid of the new object to that of the already existing
count of detected people. human proposal and calculate the distance between
For efficient detection of humans, we combine MobileNets them.
and Single Shot Detectors. Specifically, MobileNets + SSD is
used along with Opencv 3.3's DNN module (Deep Neural A list of tracked humans (TH) is maintained to detect the
Network) to detect humans in images. The centroid tracking prese nc e of unique people . To detect if any new human s
algorithm works as follows.
are present in the new frame compared to the previous
Crowd Density Estimation (CDE) Algorithm used in this paper frame, the number of objects is counted. If the count of
is inspired by the centroid tracking algorithm[13]. Figure 4 human proposals in the new frame is more compared to
shows the diagrammatic explanation of the algorithm and the the previous frame then those newly detected objects are
algorithm is presented as algorithm1.First, the bounding boxes added to the list TH and a unique ID is allotted. Again the
having coordinates (x, y) for each detected human in each frame bounding box and centroid of the new proposal is
is found. The bounding boxes can be generated using some of computed. Following that, the path of the human proposal
the common object detectors such as Haar cascades, RNN, color is detected by calculating the minimum distance using the
thresholding, etc. After detection of bounding boxes, the Euclidean distance formula.
centroid is computed for each bounding box utilizing their Furthermore, for any given video it is important to
respective (x,y) coordinates. Each bounding box is allotted a consider the fact that a human will move out of the field of
unique ID. view after some time. To address this issue, we deregister
the object by removing the unique ID. When there is no
match of the human proposal with the other existing object
for a certain number of frames, say for N frames then we
consider that the human proposal is lost and has moved out
of the view, and hence it is deregistered. This assists in
counting unique people and to develop the crowd
counter.
Algorithm 1 Crowd Density Estimation
(CDE) Algorithm Input: Video Frames
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.
pre-trained VGG-16 [34], and then classification is performed. inTable-1.Thedataset was divided into three broad categories
We classify the humans into three age groups (1-30), (30-60), of age group 1-30,30 -60 and 60+ to obtain better accu ra cy.
and 60+. The age prediction model has 4 dense layers after The dataset for gender prediction as shown in table-2 was
feature extraction from the VGG-16. The last layer of Softmax taken from Kaggle, they were divided into 80% training,
has 3 nodes as we have three classification classes. The model 20% validation, and testing was done manually. The dataset
summary is shown in figure 5. Test accuracy of the age contained JPG images for men and women and was equally
prediction model is 69%. divided into directories and cleaning of data was done
manually. The total number of images in the dataset and
distribution of images into labels can be seen in Table-2.
V. DATASETS Figure 8 shows the result obtained for a sample video where
the people are clearly visible and the video quality is good.
Whereas figure 9 shows the results obtained on a night video.
Although human proposals were detected correctly, the
analysis was not completely correct. The age and gender were
not correctly classified. Based on the experimentations
performed, it was observed that the images obtained from
videos that are of high resolution gave accurate results for
both age and gender whereas the images obtained from videos
that had low resolution or in which a person's face was not
clearly visible did not give accurate results.
References
[1] G. Antonini and J. P. T hiran, "Counting pedestrians in video sequences
using trajectory clustering," IEEE Transactions on Circuits and Systems for
Video T echnology, vol. 16, no. 8, pp. 1008 –1020, 2006.
[2] I. S. T opkaya, H. Erdogan, and F. Porikli, "Counting people by clustering
person detector outputs," in11th IEEE International Conference on Advanced
Video and Signal-Based Surveillance, AVSS 2014, 2014, pp.313–318.
[3] Gould, Stephen, T ianshi Gao, and Daphne Koller. "Region -based
segmentation and object detection." Advances in neural information
processing systems 22 (2009): 655-663.
[4] Idrees, H., Saleemi, I., Seibert, C., Shah, M., 2013. "Multi-source
multiscale counting in extremely dense crowd images", in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547 -
2554.
[5] Raghavachari, Chakravartula, V. Aparna, S. Chithira, and Vidhya
Balasubramanian. "A comparative study of vision based human detection
techniques in people counting applications." Procedia Computer Science 58
(2015): 461-469.
[6] Jones, M.J., Snow, D., 2008. "Pedestrian detection using boosted features
over many frames". In: 19th International Conference on Pattern Recognition,
2008. ICPR 2008. IEEE, pp. 14, http://dx.doi.org/10.1109
/ICPR.2008.4761703.
[7] Salim, Sohail, et al. "Crowd Detection And T racking In Surveillance
Video Sequences." 2019 IEEE International Conference on Smart
Instrumentation, Measurement and Application (ICSIMA) . IEEE, 2019.
[8] Q. Wan and Y. Wang, "Multiple moving objects tracking under complex
scenes," in T he Sixth World Congress on Intelligent Control and Automation,
Figure 9. Results obtained on a sample input
Proc. IEEE 2, pp. 9871–9875, 2006.
video captured during nighttime.
[9] C. Chen, T . Chen, D. Wang, and T . Chen, "A cost-effectivepeople-counter
for a crowd of moving people based on two-stagesegmentation," J Inform
VII.CONCLUSION AND FUTURE WORK Hiding Multimedia . . ., vol. 3, no. 1, pp.12–23, 2012.
[10] L. Del Pizzo, P. Foggia, A. Greco, G. Percannella, and M. Vento,
The proposed system is capable of detecting the presence "Counting people by RGB or depth overhead cameras," PatternRecognition
of a huge number of people in surveillance videos. It Letters, vol. 81, pp. 41–50, 2016.
[11] D. Pizzo, P. Foggia, A. Greco, G. Percannella, and M. Vento, "Aversatile
successfully detects the gender and age of people and gives
and effective method for counting people on either RGB ordepth overhead
the summary of people belonging to different age groups
cameras," in2015 IEEE International Conference onMultimedia Expo
along with the gender for further analysis by the authorities.
Workshops (ICMEW), 2015, pp. 1–6.
The system accepts input in the form of a video which
[12] Nascimento, Jacinto C., Arnaldo J. Abrantes, and Jorge S. Marques. "An
further gets divided into frames using MobilNet SSD. The
algorithm for centroid-based tracking of moving objects." In 1999 IEEE
person detected in the frame is tracked, and the images of International Conference on Acoustics, Speech, and Signal Processing.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings. ICASSP99, vol. 6, pp. 3305-3308. IEEE Computer Society, [34] Fu, Cheng-Yang, Wei Liu, Ananth Ranga, Ambrish T yagi, and
1999. Alexander C. Berg. "Dssd: Deconvolutional single shot detector." arXiv
[13] G. Liu, Z. Yin, Y. Jia, and Y. Xie, "Passenger flow estimation basedon preprint arXiv:1701.06659 (2017).
convolutional neural network in public transportation system," Knowledge- [35] Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks
Based Systems, vol. 123, pp. 102–115, 2017. for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
[14] X. Wei, J. Du, M. Liang, and L. Ye, "Boosting Deep Attribute Learning [36] Shakya, Subarna. "Collaboration of Smart City Services with Appropriate
via Support Vector Regression for Fast Moving Crowd Counting" Pattern Resource Managementand Privacy Protection." Journal of Ubiquitous
Recognition Letters, vol. 47, pp. 178–193,2017. Computing and Communication T echnologies (UCCT ) 3, no. 01 (2021).
[15] E. Eidinger, R. Enbar, and T . Hassner, "Age and gender estimation of [37] Ranganathan, G. "Real Life Human Movement Realization in
unfiltered faces," IEEE Transactions on Information Forensics and Security, Multimodal Group Communication Using Depth Map Information and
vol. 9, no. 12, pp. 2170–2179, 2014. Machine Learning." Journal of Innovative Image
[16] M. A. Beheshti-nia and Z. Mousavi, "A new classification method based Processing (JIIP) 2, no. 02 (2020): 93-101.
on pairwise support vector machine (SVM) for facial age estimation," Journal
of Industrial and Systems Engineering, vol. 10, no. 1, pp. 91–107, 2017.
[17] Moghaddam, Baback, and Ming-Hsuan Yang. "Learning gender with
support faces." IEEE Transactions on Pattern Analysis and Machine
Intelligence 24, no. 5 (2002): 707-711.
[18] Poggio, Brunelli, R. Brunelli, and T . Poggio. "HyberBF networks for
gender classification." (1992).
[19] A. Demontis, B. Biggio, G. Fumera, and F. Roli, "Super-sparse
regression for fast age estimation from faces at test time," Image Analysis and
Processing—ICIAP, Springer, Berlin, Germany, 2015.
[20] G. Guo, Y. Fu, C. R. Dyer, and T . S. Huang, "Image-based human age
estimation by manifold learning and locally adjusted robust regression," IEEE
Transactions on Image Processing, vol. 17, no. 7, pp. 1178–1188, 2008.
[21] G. Guo and G. Mu, "Simultaneous dimensionality reduction and human
age estimation via kernel partial least squares regression," in Proceedings of
the 24th IEEE Conference on Computer Vision and Pattern Recognition , pp.
657–664, Colorado Springs, CO, USA, June 2011.
[22] Kumar, S., Singh, S. and Kumar, J., 2019, January. Gender classification
using machine learning with multi-feature method. In 2019 IEEE 9th Annual
Computing and Communication Workshop and Conference (CCWC) (pp.
0648-0653). IEEE.
[23] Ullah, Ihsan, Muhammad Hussain, Ghulam Muhammad, and Anwar
Mirza. "Gender Recognition From Face Images With Spatial WLD
Descriptor."
[24] Phillips, P. Jonathon, Harry Wechsler, Jeffery Huang, and Patrick J.
Rauss. "T he FERET database and evaluation procedure for face-recognition
algorithms." Image and vision computing 16, no. 5 (1998): 295-306.
[25] Perez, Claudio, Juan Tapia, Pablo Estévez, and Claudio Held. "Gender
classification from face images using mutual information and feature fusion."
International Journal of Optomechatronics 6, no. 1 (2012): 92-119
[26] Golomb, Beatrice A., David T . Lawrence, and T errence J.
Sejnowski. "Sexnet: A neural network identifies sex from human faces." In
NIPS, vol. 1, p. 2. 1990.
[27] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, "An
all-in-one convolutional neural network for face analysis," in Proceedings of
the 12th IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2017), pp. 17–24, Biometrics Wild, Bwild, Washington, DC,
USA, June 2017.
[28] M. Duan, K. Li, and K. Li, "An ensemble CNN2ELM for age
estimation," IEEE Transactions on Information Forensics and Security, vol.
13, no. 3, pp. 758–772, 2018.
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification
with deep convolutional neural networks," in Advances in neural information
processing systems, 2012, pp. 1097-1105.
[30] X. Geng, Z.-H. Zhou, and K. Smith-Miles, "Automatic age estimation
based on facial aging patterns," IEEE T ransactions on pattern analysis and
machine intelligence, vol. 29, pp. 2234-2240, 2007.
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al.,
"Going deeper with convolutions," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2015, pp. 1 -9.
[32] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. "Faster r-cnn:
T owards real-time object detection with region proposal networks." IEEE
transactions on pattern analysis and machine intelligence 39, no. 6 (2016):
1137-1149.
[33] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You
only look once: Unified, real-time object detection." In Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 779-788.
2016.
Authorized licensed use limited to: UNIV OF ENGINEERING AND TECHNOLOGY TAXILA. Downloaded on September 12,2024 at 08:53:53 UTC from IEEE Xplore. Restrictions apply.