An Effective Approach For Violence Detection Using Deep Learning and Natural Language Processing4
An Effective Approach For Violence Detection Using Deep Learning and Natural Language Processing4
Abstract—An effective tool for violence detection is highly been modernized by using the algorithms of Artificial Intelli-
demanded to examine the rise in crime rate in today’s era. gence, Machine Learning, and Natural Language Processing.
Artificial Intelligence can play a significant role in violence These smart systems have replaced the continuous monitoring
detection and monitoring to tackle various problems of secu-
rity and safety concerns. This research proposes strategies to by humans and minimized the occurrence of errors in violence
incorporate Deep Learning and Natural Language Processing detection. Machine learning and NLP produce results with
(NLP) to simultaneously detect anomalous objects and scenarios greater accuracy by training models with adequate datasets.
from videos using TensorFlow and aggressive, offensive, and hate In this research, the primary focus is on violence detection
speech from an audio channel of surveillance cameras. This through Artificial Intelligence which is categorized into two
research aims to automatically detect violence in real-time from
surveillance footage by using TensorFlow custom object detec- portions. The first portion is object-based detection, which
tion upon identification of firearms, robbery, fistfights, sexual involves identifying acts of violence such as fire, snatching,
harassment, and fire in successive images from the video feed. In fist fights, and the use of weapons like pistols, as well as
addition, the audio channel of such surveillance cameras can also instances of sexual harassment; and speech-based detection,
be significantly fruitful in detecting hate speech, verbal sexual which focuses specifically on detecting instances of abuse
abuse, and profanity. The proposed system includes an alert
mechanism that detects any type of violence and automatically through analyzing spoken language. The developed models
notifies the security administrator, enabling timely intervention can be deployed on any existing surveillance system with
to prevent potential damage to society. The developed models can next to negligible additional hardware and software resource
be deployed on any existing surveillance system with next to neg- requirements, thereby making it an efficient, fast, accurate, and
ligible additional hardware and software resource requirements, economical solution.
thereby making it an efficient, fast, accurate, and economical
solution. To train the model, custom datasets were designed for The structure of the paper is such that the work related to
6 categories in images and 2 categories in speech. The accuracy violence detection either by object or speech using various
of the developed system was found to be 84%, with adequate algorithms is discussed in Section II. Section III provides
performance under various luminance conditions, including night information on the equipment, algorithm, experimental setup,
vision images. and implementation methodology of the developed violence
Index Terms—Violence Detection, Object Detection, Deep
Learning, Natural Language Processing, Artificial Intelligence, detector. Section IV evaluates the performance of the devel-
TensorFlow, Smart Surveillance. oped system and presents the results whereas the paper is
concluded in Section V.
I. I NTRODUCTION II. L ITERATURE R EVIEW
One of the important factors that come to mind for the Many researchers proposed several techniques to detect
development and prosperity of the country is security, women’s violence through computer vision or deep learning techniques.
safety, and law enforcement. It enlightens the significance Violence can occur at any time and in several ways either by
of surveillance systems and violence detection. Moreover, using a pistol in-crowd, fire, fist fight, handbag snatching, and
violence from the usage of pistols, fire, snatching, sexual sexual harassment. These particular violence categories mostly
harassment, and fist fight can occur at any time, and it requires occur in today’s world and it became the most significant topic
human monitoring which is an inefficient way to detect such for researchers to detect violence by replacing the continuous
unusual and unpredictable events. The technological world has monitoring of humans.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
For violence detection, there are certain studies utilizing
both object and text detection. [1] [2] provide a comprehensive
analysis of the current state and emerging trends in violence
detection research, including the categorization of methods,
addressing challenges, and presenting datasets for testing.
[3] introduces BrutNet, a hybrid model combining DCNN
and GRU architectures, for automatic violence detection and
classification in videos. This has used Convolutional Neural
Networks (CNN) to detect the weapon from video surveillance
[4]. [5]proposes using YOLO-V3 for real-time automated vi-
sual surveillance to detect handguns. In the proposed detection
system, [6] a pre-trained deep learning model Mobile Net
V3-SSD Lite is used. [7] proposes a deep learning-based
method for predicting abnormal events in daycare environ- Fig. 1: Development Process Block Diagram
ments using networked surveillance systems and IoT devices,
with superior performance compared to previous methods, by
utilizing multi-classifiers, deep neural networks, and kernel tool kit (NLTK) library for NLP. This section also provides
density functions for dynamic activity prediction. Different the details pertaining to the hardware and software used in
algorithms have been used to detect fire and smoke in different the development of the proposed model. Here a detailed
areas in order to improve the system’s efficiency and speed explanation of both approaches is provided separately. The
[8] [9] [10]. [11] proposes an intelligent surveillance system block diagram of the proposed system is given in Fig. 1.
that automatically detects multiple anomalous activities in Initially, datasets pertaining to the 5 image categories were
videos, utilizing moving object detection, object tracking, and acquired from existing datasets available online and custom
behavior understanding, with a detection accuracy of up to additions were made to them. This is further discussed in
90% based on experimental results, addressing the need for section A. Similarly, the text dataset was also acquired online
efficient monitoring of surveillance videos in public places. and merged with the created dataset (as discussed in section
Sexual, abusive, and hate speech detection is useful to B) for binary classification of speech from microphones after
prevent bullying and harassment as these crimes are rising converting it to text. After the dataset acquisition phase, two
rapidly in the world. Different approaches have been used to separate models were trained for images and text and deployed
build the model for violence detection. An approach [12] [13] using parallel processing in Python. Upon detection of any
used ML algorithms e.g., Random Forest, Multinomial Naı̈ve anomaly from either one or both models, a warning signal is
Bayes, SVM with linear, Radial Basis Function and compared generated, along with the location of violent activity, for the
with Count Vectorizer and Tfidf Vectorizer features while [14] security administration’s brisk response, to avoid any mishap.
evaluates the system performance by using the ML algorithms
with an accuracy of 0.97. The recommended hardware and software requirements
In the past, many researchers worked on individual detection utilized are given in TABLE I and TABLE II respectively.
of all unusual objects and events but none of them have However, it is feasible to reproduce the system without these
worked on the compiled model having the ability to detect requirements, although the accuracy, training time, and real-
the objects and texts. By looking at the previous work, text time performance may differ.
classification has been done but none of them have used audio
and video input from surveillance cameras to simultaneously A. Object Detection
detect objects and texts using two different approaches. The Object detection can serve to be an efficient tool for
parallel processing of custom object detection is trained on five detecting a set of objects in an image or a video feed along
categories using TensorFlow and text classification with two with the location of the given object(s) in the scene. The
categories using NLP. The proposed model can be deployed object detection model is deployed to locate the 5 categories of
in a real-time surveillance system with no additional tool and objects/ scenarios for violence detection. Training and testing
results in greater accuracy using the ML technique. were carried out with images of different resolutions, captured
at different distances and under various lighting conditions.
III. I MPLEMENTATION M ETHODOLOGY The details of the dataset and TensorFlow model used are
This section elaborates in detail on the approaches incor- given in this section.
porated to build the system. This proposed system has been
developed using two algorithms. TensorFlow custom object 1) Dataset Acquisition and Training: To build a dataset to
detection has been used to detect violence, in particular pistol, achieve the goals of this research, images were collected from
fire, fist fight, snatching, and sexual harassment within the various sources including Mehran University IICT building
scope of the surveillance camera and the other is sexual abuse security cameras, Android phone IP camera, online available
and hate speech text detection using the Natural language datasets, and downloads from Google images. This enabled
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Recommended hardware requirement the model to learn and generalize well, enabling accurate
Hardware performance in various real-world situations, and making it
Workstation more robust and adaptable. The resolution of the images in
Model Lenovo Legion Y545
CPU Intel i7-9750H 2.6G (9th Generation)
the custom dataset is limited, as indicated in TABLE III,
GPU Nvidia GeForce GTX 1660Ti 6 GB ranging from minimum resolution to maximum resolution.
RAM 16GB Although higher-resolution images in a real-time environment
Storage 1TB HHD + 512GB SSD could potentially yield better accuracy, the accuracy described
OS Windows 10 Home
Smart Phone (For IP Webcam) in this paper is achieved within the range from minimum to
Device name OPPO A53 maximum resolution.
Model CPH2127
CPU Qualcomm SM4250 Octa Core
RAM 4.00 GB
Storage 64.0 GB
OS Android 10
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
words with larger sizes. It provides a quick and easy way to
identify the most prominent words in a dataset.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
continuously monitors the environment and looks for objects
like fire, pistol, snatching, fist fight, and sexual harassment or
abusive speech. Python interface has the ability to do parallel
processing for performing two different tasks (Object and
Text detection). On detection of any abnormal scenario or
abusive speech, an alarm/notification system is to be activated
to inform the security administration. The flowchart of the
proposed system is given in Fig. 7. Fig. 8: Pistol Detection
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
TABLE VI: Performance of object detector with different light dataset shows a further decrease to 84%. The accuracy and
intensities loss metrics have been comprehensively evaluated in relation
S. No. Object Type LUX (Light False Posi- Number to the training, validation, and testing datasets, providing a
Intensity) tives of De- comprehensive overview of the model’s performance.
tections
1 Fire 117 No 1 TABLE VIII: Accuracy and Loss
2 Fire 31 No 1
3 Fighting 120 Yes 2 Dataset Accuracy Loss
4 Fighting 20 No 1 Training 0.91 0.021
5 Fighting 215 No 1 Validation 0.86 0.0103
6 Sexual Harassment 120 Yes 2 Testing 0.84 0.0320
7 Sexual Harassment 117 No 1
8 Sexual Harassment 136 Yes 2
9 Handbag Snatching 118 No 1 The Fig. 14 represents an illustration of the concept of In-
10 Handbag Snatching 23 No 1 tersection over Union (IoU) as applied to an violence detection
11 Pistol 117 No 1
model. IoU serves as a performance and efficiency metric by
12 Pistol 25 No 1
13 Pistol 20 No 1 quantifying the overlap or similarity between the predicted
bounding box (detection) and the ground truth bounding box
(ground truth) of an object. IoU has been calculated using Eq
(3) As shown in Fig. 14, the IoU is observed to be greater
than 0.5, indicating that the detections are classified as true
positives.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
a test sample with a score less than 0.5 is classified as normal
speech, while a score greater than or equal to 0.5 indicates
abusive speech. The table further details the occurrences of
true positives and false positives, providing insights into the
performance and efficiency of the model.
TABLE IX: Text Classification Results
S. No. Test Samples Abusive Normal True False
Speech Speech Posi- Posi-
(a) Image Acquisition Time tives tives
1 You bitch stay away Yes No Yes No
from me (0.97)
2 Good Morning No Yes Yes No
(0.02)
3 You bloody, Stay Yes No Yes No
away from me, I (0.94)
gonna kill you
4 I really curse those Yes No Yes No
people who speak (0.53)
(b) Model Inference Time bad and abusive lan-
guage in society
5 She was threatened Yes No No Yes
by her neighbor (0.99)
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.
training, validation, and testing datasets. However, TABLE XI authors are thankful to the DEAN FEECE Professor Dr.
elaborates the processing time taken by the system with and Mukhtiar Ali Unar for granting the permission to use IIT
without Google API. Fig. 18. illustrates a graph that depicts the building CCTV surveillance cameras to build the dataset for
system’s performance when using the Online Google API and violence detection. We are also thankful to Engr. Ghulam
the Offline Google API. The results indicate that the Online Mustafa Baloch, the officer incharge of Video Conferencing
Google API takes a longer time to process compared to the System IICT, Mehran UET Jamshoro for facilitating this
Offline Google API. However, utilizing the Offline Google research by providing the CCTV recordings for building up
API may result in higher memory consumption. the custom dataset.
R EFERENCES
[1] F. U. M. Ullah, M. S. Obaidat, A. Ullah, K. Muhammad, M. Hijji, and
S. W. Baik, “A comprehensive review on vision-based violence detection
in surveillance videos,” ACM Computing Surveys, vol. 55, no. 10, pp.
1–44, feb 2023.
[2] B. Omarov, S. Narynov, Z. Zhumanov, A. Gumar, and M. Khassanova,
“State-of-the-art violence detection techniques in video surveillance
security systems: a systematic review,” PeerJ Computer Science, vol. 8,
p. e920, apr 2022.
[3] M. Haque, S. Afsha, and H. Nyeem, “An efficient deep learning model
for violence detection,” 2023.
[4] H. Jain, A. Vikram, Mohana, A. Kashyap, and A. Jain, “Weapon
detection using artificial intelligence and deep learning for security
applications,” in 2020 International Conference on Electronics and
Sustainable Communication Systems (ICESC). IEEE, jul 2020.
[5] A. Warsi, M. Abdullah, M. N. Husen, and M. Yahya, “Automatic
handgun and knife detection algorithms: A review,” in 2020 14th
International Conference on Ubiquitous Information Management and
Communication (IMCOM). IEEE, jan 2020.
[6] M. Ghazal, N. Waisi, and N. Abdullah, “The detection of handguns
from live-video in real-time based on deep learning,” TELKOMNIKA
(Telecommunication Computing Electronics and Control), vol. 18, no. 6,
Fig. 18: Overall Processing with and without Google API p. 3026, dec 2020.
[7] G. Vallathan, A. John, C. Thirumalai, S. Mohan, G. Srivastava, and J. C.-
W. Lin, “Suspicious activity detection using deep learning in secure
assisted living IoT environments,” The Journal of Supercomputing,
V. C ONCLUSION vol. 77, no. 4, pp. 3242–3260, jul 2020.
[8] M. Grega, A. Matiolański, P. Guzik, and M. Leszczuk, “Automated
This model proved to be an efficient, fast, accurate, and detection of firearms and knives in a CCTV image,” Sensors, vol. 16,
economical solution for violence detection with no additional no. 1, p. 47, jan 2016.
[9] M. S. Allauddin, G. S. Kiran, G. R. Kiran, G. Srinivas, G. U. R. Mouli,
hardware and software requirements. This system has the and P. V. Prasad, “Development of a surveillance system for forest fire
ability to detect violence through surveillance cameras in low detection and monitoring using drones,” in IGARSS 2019 - 2019 IEEE
light intensity environments. During the model implementation International Geoscience and Remote Sensing Symposium. IEEE, jul
2019.
in the AV room of the IICT building, it was observed that the [10] A. NAMOZOV and Y. I. CHO, “An efficient deep learning algorithm
pistol and fire detection shows greater accuracy as compared for fire and smoke detection with limited data,” Advances in Electrical
to the other three classes. Sexual Harassment, Fist Fight, and and Computer Engineering, vol. 18, no. 4, pp. 121–128, 2018.
[11] S. Chaudhary, M. A. Khan, and C. Bhatnagar, “Multiple anomalous
Fighting have false positives, thus it shows less accuracy. activity detection in videos,” Procedia Computer Science, vol. 125, pp.
This system can be implemented in Universities, Hospitals, 336–345, 2018.
Banks, etc. for public safety to detect violence efficiently as [12] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate
speech detection and the problem of offensive language,” Proceedings
it can detect speech and objects simultaneously. The three of the International AAAI Conference on Web and Social Media, vol. 11,
categories of Sexual Harassment, Handbag Snatching, and Fist no. 1, pp. 512–515, may 2017.
fighting have higher false positives within them since they [13] G. M. Barrientos, R. Alaiz-Rodrı́guez, V. González-Castro, and A. C.
Parnell, “Machine learning techniques for the detection of inappropriate
look similar. In the future, this can be improved by either erotic content in text,” International Journal of Computational Intelli-
increasing the dataset or merging these three categories into gence Systems, vol. 13, no. 1, p. 591, 2020.
one (physical abnormal activity), in which case the accuracy [14] F. Husain, “Arabic offensive language detection using machine learning
and ensemble machine learning approaches,” 2020.
will drastically increase. This system can achieve much better
accuracy if the dataset is increased for both object detection
and text classification. This can be done in future to remove
false positives and increase accuracy level.
ACKNOWLEDGMENT
This research work has been carried out in Research Lab-
1 and Audio Video Conference Room IIT Building, Mehran
University of Engineering and Technology, Jamshoro. The
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 04:03:22 UTC from IEEE Xplore. Restrictions apply.