0% found this document useful (0 votes)
61 views7 pages

Deep Learning for Crowd Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views7 pages

Deep Learning for Crowd Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

Crowd Detection Using Deep Learning


Sayali Bodake1, Dr. S.S. kadam2
1Student, Dept. of Technology, SPPU, Maharashtra, India
2ESEG Group Centre for Development Advanced Computing, Panchawati Rd., Panchawati, Pashan, Pune,
Maharashtra, India
------------------------------------------------------------------------***-------------------------------------------------------------------------
Abstract - This way of life makes life easier for people with, the crowd density in the small picture patches has a
and increases the use of public services in cities. We present fairly uniform distribution. Second, image segmentation
a CNN-MRF-based method for counting people in still improves the amount of training data available to the
images from various scenes. Crowd density is well regression model. Because of the benefits mentioned
represented by the features derived from the CNN model above, we can train a more robust regression model.
trained for other computer vision tasks. The neighboring
Signal processing, image processing, and computer vision
local counts are strongly correlated when using the
are only a few of the applications where CNNs come in
overlapping patches separated strategies. The MRF may use
handy. Several CNN-CC algorithms have been proposed to
this connection to smooth adjacent local counts for a more
deal with major issues such as occlusion, low visibility,
accurate overall count. We divide the dense crowd visible
inter- and intra-object variance, and scale variation due to
image into overlapping patches, and then extract features
different viewpoints in this regard. Figure 1 depicts a
from each patch image using a deep convolutional neural
typical CNN-CC flow diagram that depicts two approaches.
network, followed by a completely connected neural
Except for the last two blocks, which were used for
network to regress the local patch crowd count. Since the
comparison and error computation, the first, on the left,
local patches overlap, there is a strong connection between
found ground-truth density (GTD). On the right, the
the crowd counts of neighboring patches. We smooth the
second computed ED and crowd counting. The description
counting effects of the local patches using this connection
of each block is as follows Density estimation: It is a
and the Markov random field.
method for estimating the probability density function of a
Key Words: Convolutional Neural Network (CNN), Image random variable using observed (ground-truth) data. The
process, Feature extraction, Feature selection, detection, ED of a crowd can be obtained in a variety of ways,
classification. including clustering, identification, and regression. In
1. INTRODUCTION sparse crowds, detection-based techniques perform well,
while regression-based approaches perform well in dense
There are two major groups of existing models for crowds and overestimate crowds in sparse patches. In
estimating crowd density and counting the crowd: direct both sparse and dense cases, a combination of detection
and indirect approaches. The direct approach (also known and regression can be used to improve results.
as object detection based) is based on detecting and
segmenting each person in a crowd scene to get a total Counting: It’s a tool for counting the number of items
count, while the indirect approach (also known as feature (people, cells, vehicles, and so on) in an image or video
based) takes a picture as a whole and extracts some that’s used after a density map has been computed. Image
features before getting the final count. Due to variations in density affects how well various well-known handcrafted
perspective and scene, the distribution of crowd density in techniques work. Counting by detection, for example,
crowded crowd images is seldom consistent. Figure 3 works better in sparse-density images since there are less
shows several examples of photographs. As a result, overlapping artifacts, where as CNN-based methods work
counting the crowd by looking at the entire picture is well in images with a wide density spectrum. Complex
irrational. As a result, the divide-count-sum approach was network architecture, increased number of parameters,
adapted in our system. After dividing the images into high computational cost, and real-time deployment are
patches, a regression model is used to map the image some of the unique challenges faced by CNN-CC
patch to the local count. Finally, the cumulative number of algorithms. Traditional handcrafted crowd-counting
these patches is used to calculate the global image count. algorithms can be used for real- time monitoring, but they
There are two benefits of image segmentation: To begin have lower precision and produce a low-resolution

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1294
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

density map. In high occlusion, a wide density spectrum,


and scale-varying conditions, these techniques often ∑( )
struggle to produce the desired results. In terms of
prediction accuracy and resolution, CNN-CC algorithms, on Step 3: Read all features from Trainset using below
the other hand, outperform. The cost of computation is
lower in traditional handcrafted methods. The majority of Step 4: Generate the weight of both features set
applications strive for a high level of prediction precision.
Many researchers attempted to reduce uncertainty and
succeeded. As a result of the growing popularity of CNN- ( )
CC techniques, we decided to review and evaluate the
Step 5: Verify with Th
most recent and well-known research papers on the most
difficult datasets. Selected_Instance= result = Weight >Th ? 1 : 0;

2. THEORY AND LITERATURE SURVEY Add each selected_instance into cL, when n = null

2.1 Algorithm Design Step 6: Return cL


Input: Training dataset TrainData[], various activation
functions[], Threshold Th 2.2 Literature survey
Crowd safety in public places has always been a serious
Output: Extracted Features Feature_set[] for but difficult issue, especially in high-density gathering
completed trained module. areas. The higher the crowd level, the easier it is to lose
control [1], which can result in severe casualties. In order
Step 1: Set input block of data d[], activation function,
to aid in mitigation and decision-making, it is important to
epoch size,
search out an intelligent form of crowd analysis in public
Step 2 : [Link]  ExtractFeatures(d[]) areas. Crowd counting and density estimation are valuable
components of crowd analysis [2], since they can help
Step 3 : Feature_set[]  optimized([Link]) measure the importance of activities and provide
appropriate staff with information to aid decision-making.
Step 4 : Return Feature_set[] As a result, crowd counting and density estimation have
become hot topics in the security sector, with applications
Test Module
ranging from video surveillance to traffic control to public
Input: Train_Feature set [], // Set of training dataset safety and urban planning [3]. A crowd monitoring system
is in very high demand these days. However, current
Test_Feature set [] //Set of test dataset crowd monitoring system products have a number of
flaws, such as being constrained by application scenes or
Threshold denominator Th
having low precision. In particular, there is a lack of
research on tracking the number of pedestrians in a large-
Collection List cL
scale crowded area [4]. The detection-based methods and
Output: classified all instances with the desired the regression-based methods are the two types of crowd
weight. counting methods. Detection-based crowd counting
methods typically employ a sliding window to detect each
Step 1: Read all features from the Testing dataset using the pedestrian in the scene, calculate the pedestrian’s
below function approximate location, and then count the number of
pedestrians [5–7]. For low-density crowd scenes,
∑( ) detection-based methods may produce decent results, but
they are severely restricted for high-density crowd scenes.
The early regression- based methods [8–10] attempt to
Step 2: Read all features from the training dataset using learn a direct mapping between low-level features derived
the below function from local image blocks and head count. Direct regression-
based approaches like these only count the number of

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1295
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

pedestrians while missing essential spatial in- formation. Switch classifier, which is very costly and often unreliable.
Learning the linear or non-linear mapping between local Similar to Refs. [16,17], Kumaga et al. [19] suggested a
block features and their corresponding target density hybrid neural network Mixture of CNNs in 20207,
maps, as indicated by references [11,12], may integrate believing that a single predictor in various scene
spatial information into the learning process. Researchers environments is insufficient to accurately predict the
were inspired by the Convolutional Neural Network’s number of pedestrians (MoCNN). A combination of expert
(CNN) performance in many computer vision tasks to use CNNs and a gated CNN makes up the model framework. On
CNN to learn nonlinear functions from crowd images to the basis of the context of the input picture, the
density maps or counts. In 20205, Wang et al. [13] used appropriate expert CNN is adaptively selected. Expert
the Alexnet network structure [14] to apply CNN to the CNNs estimate the image’s head count in prediction, while
crowd counting mission. To count the number of gated CNN estimates each expert CNN’s acceptable
pedestrians in the crowd picture, the completely likelihood. These odds are then used as weighting factors
connected layer with 4096 neurons was replaced by a in calculating a weighted average of all expert CNNs’ head
layer with only one neuron. counts. Via gated CNN preparation, MoCNN not only trains
numerous expert CNNs, but also learns the likelihood of
In the same year, Zhang et al. [4] discovered that when each expert CNN’s approximate head count. However, it
existing approaches were applied to new scenes that can only be used for crowd counting estimation and does
varied from the training dataset, their output was not have information on crowd density distribution. Tang
significantly reduced. To address this problem, a data- et al. [20] proposed a low-rank and sparse-based deep-
driven approach was proposed for fine-tuning the pre- fusion convolutional neural network for crowd counting
trained CNN model with training samples that were close (LFCNN) that improved the accuracy of the projection
to the density level in the new scenario, allowing it to from the density map to global counting by using a
adjust to unknown application scenes. This approach regression approach based on low-rank and sparse
eliminates the need for retraining when the model is penalty.
transformed to a new scenario, but it still necessitates a
large amount of training data, and it is difficult to predict 3. METHODOLOGY
the density level of the new scene in practice. In 20206,
Zhang et al. [16] proposed a multi-column convolutional Due to variations in perspective and scene, the
neural network- based architecture (MCNN) based on the distribution of crowd density in crowded crowd images is
success of multi-column networks [15] in image seldom consistent. Figure 3 shows several examples of
recognition by constructing a network consisting of three photographs. As a result, counting the crowd by looking at
columns of filters corresponding to the receptive fields the entire picture is irrational. As a result, the divide-
with different sizes (large, medium, small) to adapt to count-sum approach was adapted in our system. After
changes in head size due to perspective effects or ima. Of dividing the images into patches, a regression model is
column of the MCNN pre-trains all image blocks during used to map the image patch to the local count. Finally, the
training, then the three networks are combined for fine- cumulative number of these patches is used to calculate
tuning training. The training process is complicated, the global image count. There are two benefits of image
because there is a lot of redundancy in the structure. segmentation: To begin with, the crowd density in the
small picture patches has a fairly uniform distribution.
Sam et al. [17] proposed in 20207 that the convolutional Second, image segmentation improves the amount of
neural network for crowd counting (Switching CNN) be training data available to the regression model. Because of
used to train regressions using a specific collection of the benefits mentioned above, we can train a more robust
training data patches based on different crowd densities in regression model. The total crowd density distribution is
the picture. The network is made up of multiple continuous, despite the fact that the distribution of crowd
independent CNN regressions, similar to a multi-column density is not uniform. This means that neighboring
net- work, with the addition of a Switch classifier based on picture patches should have identical densities. We often
the VGG-16 [18] architecture to pick the best regression use overlaps to separate the image, which improves the
for each input block. Alternately, the Switch classifier and relation between image patches. To compensate for
the independent regression are trained. Switching CNN, on potential image patch estimation errors and to get the
the other hand, switches between regressions using the overall result closer to the true density distribution, the

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1296
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

Markov random field is used to smooth the estimation


count between overlapping image patches.

We use a completely connected neural network to learn a


map from the above features to the local count, and a pre-
trained deep residual network to extract features from
image patches. Deep convolutional network features have
been used in a variety of computer vision tasks, including
image recognition, object detection, and image
segmentation. This suggests that the deep convolutional
network’s learned features are applicable to a wide range
of computer vision tasks. The representation ability of the
learned features improves as the number of network
layers increases. A deeper model, on the other hand, Fig -1: Proposed System View
necessitates more data for preparation. Current datasets
for crowd counting are insufficient to train a very deep Generate Train set and Test set: In this this phase we
convolutional neural network from scratch. To extract first create training and testing dataset for proposed
system. The basic objective of this module to generate the
features from an image patch, we use a pre-trained deep
ground truth values for both training and testing dataset.
residual network. Instead of learning unreferenced
functions, their approach resolved the degradation issue Three different features have been extracted from each
by reformulating the layers as learning residual functions image like height, width and channel. It extracts the actual
with reference to the layer inputs. To extract the deep pixel values of each image during data creation. The
features that reflect the density of the crowd, we use the outcome this process the .csv files both training and
residual network, which was trained on the ImageNet testing respectively.
dataset for image classification. For every three
convolution layers, this pre-trained CNN network Pre-processing and Normalization: Image acquisition
generated a residual item, bringing the total number of and image resizing has used for generate fixed size of each
layers in the network to 152. To get the 1000 dimensional object while Gaussian filter has used for remove the noise
features, we resize the image patches to 224 by 224 pixels from object.
as the model’s input and extract the fc1000 layer’s output.
CNN (Training and Testing): The architecture of CNN is
Following that, the features are used to train a five-layer quite different from a conventional neural network model.
fully linked neural network. The input to the network is In the conventional neural network, input values are
1000-dimensional, and the network’s number of neurons transformed by traversing through a series of hidden
is 100-100-50-50-1. The local crowd count is the layers. Every layer is made up of a set of neurons, where
network’s output. The completely linked neural network’s each layer is fully connected to all neurons in the layer
learning role is to minimize the mean squared error of the before. The reason behind the better performance of CNNs
training patches is that these networks capture the inherent properties of
images. This significant feature of CNN gave us the
confidence to use it in the analysis of our proposed dataset

TensorFlow Library Module: In the first module we


implements the access interfaces and should be
customized for every deep learning tools called
TensorFlow. With the help of this APIs often need to be
compatible with application’s source code.

Optimization: Adam is a method that computes adaptive


learning rates for each parameter. In addition to storing an
exponentially decaying average of past squared
gradients vt like Ad delta and RMS prop, Adam also keeps

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1297
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

an exponentially decaying average of past gradients mt, 3.2 Datasets Used


similar to momentum. Whereas momentum can be seen as
a ball running down a slope, Adam behaves like a heavy  Input dataset:
ball with friction, which thus prefers flat minima in the Images that contains large number of crowd.
error surface  Outcomes:
Number of objects is available in entire image
Mt =β1mt−1+ (1−β1)
using proposed CNN approach.
Vt = β2vt−1+ (1−β2)g2t  Dataset:
We use some real time crowd images dataset.
mt and vt are estimates of the first moment (the mean) We also use around 5 to 10MB video that
and the second moment (the uncentered variance) of the contents large number crowd. (like cricket
gradients respectively, hence the name of the method.
audience video).
As mt and vt are initialized as vectors of 0's, the authors of
Adam observe that they are biased towards zero, 4. RESULTS AND DISCUSSION
especially during the initial time steps, and especially
when the decay rates are small. This system's work can only be measured by contrasting it
to other systems that are attempting to solve a similar
Mapping with ground truth: This is the analysis module end-user problem. Figure.2 shows the efficiency of the
which validates the system efficiency between actual class proposed system with other recent approaches in
label and predicted class label. In first modules we literature.
generate the actual count and density map each image
while in testing CNN predicts the possible counts of
respective input image. The ground truth score and
predicted score will give accuracy of system using below
formula

The average accuracy of system is around 85% per cent


with various cross validation.

3.1 S/W and H/W Requirement


Fig -2: Experiment 1
1. H/W Requirement
Its Ground truth count is 651, Estimated count is 589,
• Processor: CPU
Accuracy is 90.47% and Error Rate is 9.53%
• RAM: 2GB

• Hard Disk Space: 20GB

• Core: TensorFlow 1.2

2. S/W Requirement

• Language: Python

• IDE: Python 3.6 and Matlab 16

• Database: MYSQL

• Platform: Microsoft Windows 10 Fig -3: Experiment 2

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1298
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

Its Ground truth count is 505, Estimated count is 475,  This approach is able to work on image as well as
Accuracy is 94.05% and Error Rate is 4.95% video dataset respectively.
 Various feature extraction selection techniques
provides god detection accuracy.
 System uses RESNET from deep convolutional
network that provides up to 152 hidden layers.

We can extend the system with multiple convolutional


layers with ensemble deep learning model for highest
accuracy.

REFERENCES

[1] Fruin, J.J. Pedestrian Planning and Design;


Metropolitan Association of Urban Designers
Environmental Planners: New York, NY, USA, 1971.
[2] Zhan, B.; Monekosso, D.; Remagnino, P.; Velastin, S.;
Fig -4: Experiment 3 Xu, L.-Q. Crowd analysis: A survey. Mach. Vis. Appl.
2008, 19, 345–357.
Its Ground truth count is 653, Estimated count is 585, [3] Zeng, L.; Xu, X.; Cai, B.; Qiu, S.; Zhang, T. Multi-scale
convolutional neural net- works for crowd counting.
Accuracy is 89.58% and Error Rate is 9.42% In Proceedings of the IEEE International Conference
on Image Processing, Beijing, China, 17–20 September
20207; pp. 465–469.
[4] Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd
counting via deep convolutional neural networks. In
Proceedings of the IEEE Conference on Com- puter
Vision and Pattern Recognition, Boston, MA, USA, 8–
10 June 20205; pp. 833–841.
[5] Leibe, B.; Seemann, E.; Schiele, B. Pedestrian detection
in crowded scenes. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recogni-
tion, San Diego, CA, USA, 20–26 June 2005; pp. 878–
885.
[6] Zhao, T.; Nevatia, R.; Wu, B. Segmentation and tracking
of multiple humans in crowded environments. IEEE
Trans. Pattern Anal. Mach. Intell. 2008, 30, 1198–
1211. [CrossRef] [PubMed]
[7] Ge, W.; Collins, R.T. Marked point processes for crowd
Fig -5: Experiment 4 counting. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Miami, FL,
USA, 20–25 June 2009; pp. 2913–2920.
Its Ground truth count is 670, Estimated count is 529, [8] Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy
Accuracy is 78.95% and Error Rate is 21.05% preserving crowd monitoring: Counting people
without people models or tracking. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
5. CONCLUSIONS Recognition, Anchorage, AK, USA, 24–26 June 2008;
pp. 1–7. Ryan, D
We present a CNN based method for counting people in [9] Denman, S.; Fookes, C.; Sridharan, S. Crowd counting
using multiple local fea- tures. In Proceedings of the
still images from various scenes. Crowd density is well Digital Image Computing: Techniques and Applica-
represented by the features derived from the CNN model tions, Melbourne, Australia, 1–3 December 2009; pp.
81–88.
trained for other computer vision tasks. The neighboring
[10] Chan, A.B.; Vasconcelos, N. Bayesian poisson
local counts are strongly correlated when using the regression for crowd counting. In Proceedings of the
overlapping patches separated strategies. The feature IEEE 12th International Conference on Computer
Vision, Kyoto, Japan, 27 September–4 October 2009;
extraction may use this connection to smooth adjacent pp. 545–551.
local counts for a more accurate overall count. [11] Lempitsky, V.; Zisserman, A. Learning to count objects
in images. In Proceed- ings of the Advances in Neural
Experimental findings show that the proposed method Information Processing Systems, Vancouver, BC,
outperforms other recent related methods. Canada, 6–9 December 20200; pp. 1324–1332.
[12] Pham, V.Q.; Kozakaya, T.; Yamaguchi, O.; Okada, R.
 The system will provide better accuracy for crowd Count forest: Co-voting uncertain number of targets
using random forest for crowd density estimation. In
detection from heterogeneous images. Proceedings of the IEEE International Conference on

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1299
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 08 Issue: 08 | Aug 2021 [Link] p-ISSN: 2395-0072

Computer Vision, Santiago, Chile, 7–13 December


20205; pp. 3253–3261.

[13] Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. Deep
people counting in ex- tremely dense crowds. In
Proceedings of the 23rd ACM International
Conference on Multimedia, Brisbane, Australia, 26–30
October 20205; pp. 1299–1302.
[14] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet
classification with deep convolutional neural
networks. In Proceedings of the Advances in Neural
Infor- mation Processing Systems, Lake Tahoe, NV,
USA, 3–6 December 20202; pp. 1097–1105.
[15] Schmidhuber, J.; Meier, U.; Ciresan, D. Multi-column
deep neural networks for image classification. In
Proceedings of the IEEE International Conference on
Computer Vision and Pattern Recognition, Providence,
RI, USA, 16–21 June 20202; pp. 3642–3649.
[16] Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-
image crowd counting via multi-column convolutional
neural network. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, Las
Vegas, NV, USA, 27–30 June 20206; pp. 589–597.
[17] Sam, D.B.; Surya, S.; Babu, R.V. Switching convolutional
neural network for crowd counting. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, HI, USA, 21–26 July 20207; pp.
5744–5752.
[18] Simonyan, K.; Zisserman, A. Very Deep Convolutional
Networks for Large-Scale Image Recognition. arXiv
20204, arXiv:1409.1556.
[19] Kumagai, S.; Hotta, K.; Kurita, T. Mixture of Counting
CNNs: Adaptive Integration of CNNs Specialized to
Specific Appearance for Crowd Counting. arXiv,
20207; arXiv:1703.09393.
[20] Zhang, L.; Shi, M.; Chen, Q. Crowd counting via scale-
adaptive convolutional neural network. In
Proceedings of the IEEE Winter Conference on
Applications of Computer Vision, Lake Tahoe, NV, USA,
12–14 March 20208; pp. 1113–1121.

© 2021, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 1300

You might also like