Violence_Detection_From_Industrial_Surveillance_Videos_Using_Deep_Learning
Violence_Detection_From_Industrial_Surveillance_Videos_Using_Deep_Learning
ABSTRACT The integration of Internet of Things (IoT) technology in industrial surveillance and the
proliferation of surveillance cameras in smart cities has empowered the development of real-time activity
recognition and violence detection systems, respectively. These systems are crucial in enhancing safety
measures, improving operational efficiency, reducing accident risks, and providing automatic monitoring in
dynamic environments. In this paper, we propose a three-stage deep learning-based end-to-end framework
for violence detection. The lightweight convolutional neural network (CNN) model initially identifies
individuals in the video stream to minimize the processing of irrelevant frames. Subsequently, a sequence
of 50 frames with identified persons is directed to a 3D-CNN model, where the spatiotemporal features
of these sequences are extracted and passed to the classifier. Unlike traditional methods that process all
frames indiscriminately, this targeted filtering mechanism allows computational resources to be allocated
more effectively. Next, SoftMax classifier processes the extracted features to categorize frame sequences
as violent or non-violent. The classifier’s predictions trigger real-time alerts, enabling rapid intervention.
The modularity of this stage supports adaptability to new datasets, as it can leverage transfer learning
to generalize across diverse surveillance contexts. Unlike traditional systems constrained by hand-crafted
features, this design dynamically learns from data, reducing reliance on prior domain knowledge and
improving generalizability. We conducted experiments on violence detection across four datasets, comparing
the performance of our model with convolutional CNN models. A computation time analysis revealed that
our lightweight model requires significantly less computation time, demonstrating its efficiency. We also
conducted cross-data experiments to assess the model’s capacity to perform consistently across various
datasets. Experiments show that our proposed model outperforms the methods mentioned in the existing
literature. These experiments demonstrate that the model’s adaptability and robustness need to be improved.
INDEX TERMS Activity detection, industrial surveillance, violence detection, computer vision, deep
learning.
violence detection, outperforming traditional methods. Deep sequential patterns associated with violent behaviors makes
learning models can automatically extract representative it very effective. A classifier receives the extracted features
features from raw data, offering a major advantage over in the third stage and classifies frame sequences as violent or
traditional machine learning approaches. For instance, Con- non-violent. If the classifier identifies violence, it generates
volutional Neural Networks (CNNs) and Recurrent Neural alerts. The technology has a real-time warning function
Networks (RNNs) have been widely used to identify spatial that alerts nearby security agencies or police stations of
and temporal features in video data, respectively. any detected violence. This allows for quick response and
In recent studies, models such as the 3D Convolutional intervention, bringing practicality to the system.
Neural Networks (3D-CNNs) and Long Short-Term Memory We used four different datasets to analyze the performance
(LSTM) networks have demonstrated promising results in of the proposed model on violence detection, i.e., RWF-2000,
violence detection. 3D-CNNs are a variation of CNNs the hockey fight dataset, the surveillance fight dataset, and
designed to handle three-dimensional data such as videos. the Industrial Surveillance dataset [22]. The proposed 3D-
Unlike traditional 2D-CNNs, which operate on static images CNN method demonstrates superior performance compared
and understand height and width, 3D-CNNs also consider the to other machine learning models, including ConvLSTM,
temporal dimension (time) in videos. This is accomplished by across all tested datasets. This highlights its effectiveness for
performing convolutions in three dimensions - height, width, real-world violence detection applications.
and time - enabling the 3D-CNNs to capture both spatial We also conducted cross-dataset testing to evaluate the
features (such as objects and their shapes) and temporal ability of the proposed model to generalize across various
features (such as the motion of these objects over time). This datasets. The primary objective is to assess whether the
makes 3D-CNNs especially useful for video analysis tasks model has a sufficiently comprehensive understanding of
like action recognition or violence detection, since they can violence to identify violent actions across diverse datasets,
comprehend the video as a continuous sequence of events which is crucial for practical applications in a range
rather than a collection of individual frames. For instance, of scenarios with varied data. These tests offer insights
a study by Muhammad et al. [18] employed a 3D-CNN model into the model’s adaptability and robustness, confirming
on a large video dataset and achieved an accuracy of 74.2%. its potential for widespread use in different monitoring
Existing models of neural networks are highly promising situations. Additionally, a computation time analysis was
in identifying human actions in a variety of applications, performed, highlighting the efficiency of our approach to
from healthcare to entertainment. However, when it comes to processing video data.
detecting violent activities in complex real-world situations, The rest of the paper is organized as follows: Section II
these models face significant challenges. The accuracy of covers the background of human action recognition and
these models is often compromised by factors such as violence detection. Section III discussed violence detection
background noise, visual distractions that include moving using deep learning Methods. The proposed methodology is
objects, and varying lighting conditions that result in false discussed in Section IV and Section V covers results and
positives and false negatives being reported. analysis. Section VI concludes the paper and discusses future
Another challenge arises from the processing of many work. Section VII includes acknowledgment.
irrelevant frames. Traditional models often analyze each
frame in the video sequence, which causes unnecessary II. BACKGROUND
computational overhead on a computer. This inefficiency is This section reviews previous work in human action recog-
particularly problematic in real-time surveillance applica- nition, the application of deep learning in human action
tions where rapid and accurate decision-making is crucial. recognition and violence detection in video data.
The need for a more focused and computationally efficient
approach is evident, especially in real-time applications like A. HUMAN ACTION RECOGNITION
violence detection in public spaces or industrial settings. Automated methods of video sequence analysis and
We propose a three-stage deep learning based end-to- decision-making regarding the behaviors shown in videos
end framework for accurate and efficient violence detection. are used in human activity detection for video surveillance
A lightweight CNN is used as an initial screening mechanism systems. In 1999, Gavrilla created the research field of 2D
in the first stage. The light-weighted CNN model is and 3D approaches [1] for the development of human action
optimized for computational efficiency, allowing for fast recognition (HAR) systems. While the early models were
scanning of incoming video streams. Its main purpose is to highly reliant on feature extraction from single image or
recognize and separate frames with human beings, reducing sets of images, they laid the groundwork for more complex
the requirement to analyze unnecessary frames. Selective systems that integrated spatial and temporal data over time.
frame processing considerably minimizes computing cost A different team of researchers, led by JK Aggarwal and
and accelerates detection. The second stage employs a more Q Cai, developed a new taxonomy centered on the study of
detailed 3D-CNN model. This model extracts spatiotemporal human motion, monitoring from different types of camera
information from 50 frames including identified people from views and human activity detection [2]. Two methods exist for
the first stage. Training the 3D-CNN to identify complicated HAR: using still images and using video data. Video-based
algorithms outperform those based on still images, as they by automating feature extraction and learning complex
convey both spatial and temporal information. Videos capture spatiotemporal patterns from raw data. Deep learning models,
continuous movements and interactions, which are essential such as CNNs and RNNs, have demonstrated superior per-
for distinguishing between different types of actions, such formance in recognizing human actions in videos, especially
as walking, running, or violent movements. This temporal when integrated with large-scale datasets. For instance,
data provides context that static images cannot, making it CNNs have been widely used to extract spatial features
indispensable for effective HAR systems. from video data [17]. In contrast, LSTMs and their variants,
However, relying solely on video-based approaches also such as deep bidirectional LSTMs (BiLSTMs), are useful for
presents challenges. These algorithms need to handle large capturing temporal dependencies in video sequences, making
amounts of data efficiently, as processing continuous streams them crucial for action recognition tasks [22]. Recent studies
of video can be computationally expensive. In addition, such as Zhang et al. [5] have shown how deep learning models
video data is often noisy, making it essential to clean with attention mechanisms can significantly improve action
the input before analysis. The process typically starts with recognition by focusing on the most relevant spatiotemporal
noise reduction to eliminate irrelevant information from the features in video data.
video frames. Techniques such as background subtraction
are commonly used to isolate the human figure from its
surroundings. B. STATE OF THE ART APPROACH TO HAR
Once noise is reduced, the form of a human is extracted Wang et al. [17] present a LSTM mechanism that simulates
from the backdrop images by analyzing sequences of video the cognitive memory processes of the human brain for
frames and observing changes in position over time. Human visual monitoring in IoT-assisted smart cities. The primary
shapes are usually identified through a combination of feature objective is to enable timely detection of violent actions and
extraction techniques and tracking algorithms. Classification prevent false tracking in clear environments. The proposed
of objects, including humans, is done by evaluating their model utilizes a unique props function within its storage
movement characteristics and shape, using methods such as mechanism to perform real-time processing when envi-
optical flow, which analyzes the pattern of motion between ronmental conditions change and conventional algorithms
frames, or shape-based descriptors that focus on the contour become ineffective. However, the props function may also
and skeletal structure of the human body. introduce limitations in scalability, as it may require frequent
Nonetheless, even with these advancements, traditional updates to account for environmental variations, which could
HAR methods remain limited by their inability to effectively increase computational overhead. Additionally, the model’s
handle complex scenes with multiple interacting people, poor performance in more complex real-world scenarios, such as
lighting conditions, and varying camera angles. Additionally, crowded industrial environments or highly dynamic urban
most of these methods rely on handcrafted features, which spaces, remains unexplored. This raises concerns about
require domain knowledge and limit their adaptability to new its robustness in diverse surveillance contexts, particularly
environments or unseen data. where lighting conditions, background clutter, or overlapping
Figure 1 showcases a range of different methods for objects may impede accurate detection.
Human Action Recognition, highlighting the diversity of Muhammad et al. [18] introduced a HAR method that
approaches used in this field [41]. employs an attention-based LSTM network combined with
dilated CNN features. The authors developed a BiLSTM
network with an attention mechanism to effectively learn
the spatiotemporal properties present in sequential data.
This network design enables the model to adjust attention
weights, thus allowing it to easily recognize and focus
on learned global features for detecting human actions in
video data.The use of attention-based mechanisms signif-
icantly improves the model’s ability to focus on critical
spatiotemporal features, making it more efficient in detecting
complex actions. The BiLSTM’s ability to process data
in both directions (forward and backward) provides a
more comprehensive understanding of the action sequences,
particularly for recognizing subtle behaviors. However,
this method also introduces considerable computational
complexity, which may make real-time deployment difficult,
FIGURE 1. Methods of HAR [41]. especially in resource-constrained environments like IoT-
based surveillance. Furthermore, the model’s reliance on
The field has now shifted towards deep learning-based global feature learning could result in missed fine-grained
approaches, which have significantly improved HAR systems details, particularly in dense environments where actions may
occur in close proximity or in highly occluded scenes. The Ullah et al. [22] proposed an AI-assisted edge vision
dilated CNN features, while useful for capturing contextual approach for violence detection in IoT-based industrial
information, may also cause the model to lose sensitivity surveillance networks. The framework comprises five main
to smaller, rapid movements that are crucial for violence steps: training a lightweight CNN for efficient edge process-
detection. ing; acquiring data using resource-constrained vision sensors;
Zhang et al. [5] delved into the challenge of few-shot detecting suspicious humans or objects and generating alerts;
activity recognition using a cross-modal memory network. sending relevant frames to a more powerful backend for
A cross-modal memory network stores information from deeper investigation; and finally, using the backend system
multiple modalities (e.g., video, audio) and enables the model for accurate violence detection.The edge-based architecture
to utilize these diverse inputs to improve learning in complex proposed by Ullah et al. [22] addresses one of the main
environments, particularly when training data is limited. challenges of real-time violence detection in industrial
The proposed model is designed to recognize new videos environments: limited computational resources. By utilizing
with a limited number of labeled samples by leveraging a lightweight CNN for initial processing, the framework
visual contextual embedding for few-shot classification. minimizes the amount of data that needs to be sent to
Few-shot learning models are crucial in situations where the backend, which reduces bandwidth usage and latency.
collecting large, annotated datasets is impractical or costly. This makes the approach particularly well-suited for IoT-
By leveraging visual contextual embedding, this approach based systems where energy and computational resources
allows the model to generalize well to unseen activities are constrained. However, the reliance on a more powerful
with minimal supervision. However, the success of few-shot backend for deeper investigation introduces latency that could
learning heavily depends on the quality and diversity impact real-time performance in time-critical situations.
of the training samples. In surveillance contexts, where Additionally, the effectiveness of the lightweight CNN in
lighting, camera angles, and occlusions can vary greatly, accurately detecting violence, especially in complex scenes
the limited training data used in few-shot learning may with occlusions or overlapping objects, remains a concern.
not sufficiently capture these complexities. Moreover, few- Another potential limitation is the framework’s adaptability
shot models tend to struggle with generalization when faced to different industrial contexts, where sensor configurations
with highly variable or noisy environments. The use of and environmental factors vary significantly.
cross-modal [5] memory networks helps in cross-referencing Chen et al. [8] introduced a spatiotemporal graph con-
data from multiple modalities, which improves recognition volutional network (ST-GCN) for skeleton-based HAR in
capabilities, but this also increases the demand for high- surveillance environments. The proposed model captures
quality, synchronized data from different modalities, which both spatial and temporal information about human actions
may not always be available in practical settings. by incorporating graph convolutional layers, which effec-
Haroon et al. [6] put forward a multi-stream framework tively model the relationships between different body joints
for human interaction recognition, which aims to capture and in the skeleton data.The use of ST-GCNs for skeleton-based
analyze complex human interactions in various scenarios. action recognition offers a highly structured approach to
The proposed approach combines a 1D-CNN with BiLSTM modeling human actions, as it focuses on the movement and
stream to learn human interactions based on key features relationships of body joints. This approach is particularly
extracted using a pose estimation algorithm, and a 3D-CNN useful in environments where clear skeletal data can be
model to learn temporal information from video sequences. extracted, such as sports events or controlled laboratory con-
The multi-stream framework proposed is advantageous for ditions. However, in real-world surveillance environments,
analyzing multi-person interactions or complex human activ- extracting high-quality skeleton data is challenging due to
ities. By combining 1D-CNNs with BiLSTM and 3D-CNN occlusions, varying camera angles, and environmental noise.
models, the framework can capture both spatial and temporal Furthermore, ST-GCNs rely heavily on accurate skeleton
information, making it more robust in understanding complex detection, which may not be feasible in industrial or crowded
human interactions in diverse environments. However, the public spaces where body movements are obscured or erratic.
reliance on pose estimation algorithms has limitations, Additionally, the performance of this method in recognizing
especially in environments where occlusions, low-resolution subtle or non-standard movements, such as those seen in
video, or poor lighting hinder accurate pose detection. violence detection scenarios, may be limited, as the skeletal
In such cases, misestimations in pose can lead to inaccurate structure alone may not capture the full context of the action.
action recognition. Additionally, the computational overhead Kim and Lee [9] proposed a multi-modal fusion approach
of running multiple streams (1D-CNN, BiLSTM, 3D- for anomaly detection in video streams, integrating both
CNN) could limit the framework’s scalability, particularly visual and auditory data to improve the overall performance
in real-time applications or on low-power edge devices. of the system. The authors developed a deep neural network
The fusion of pose estimation and deep learning models architecture that combines a 3D-CNN for visual feature
is promising but requires further optimization to handle extraction and a 1D-CNN for auditory feature extraction,
real-world surveillance challenges. followed by a fusion layer that effectively combines the
extracted features for better anomaly detection. The inte- the temporal progression of actions, which is critical for
gration of auditory data alongside visual data enhances the understanding violent behavior in video sequences.
detection of anomalies, as certain violent actions may be On the other hand, the 3D-CNN approach developed
accompanied by distinct sounds. However, the reliance on by Ding et al. [26] extends the capability of standard
auditory features introduces additional challenges, especially CNNs by incorporating temporal information through 3D
in noisy environments like factories or urban areas, where convolutions. This method achieves an accuracy of 91% and
background noise may interfere with the detection process. is particularly effective in crowded environments where both
The 1D-CNN for auditory feature extraction, while use- spatial and temporal patterns must be considered. However,
ful, may require substantial filtering and preprocessing to its reliance on the backpropagation method for training
handle the variability of audio data in these environments. introduces significant computational overhead, which may
Furthermore, the fusion of multiple modalities increases the limit its deployment in real-time scenarios.
computational complexity of the system, which could hinder The VGG vector of locally aggregated descriptors (VLAD)
real-time performance, particularly in resource-constrained model for image retrieval, presented by Zhou et al. [36],
settings like edge devices or low-power surveillance systems. also uses a backpropagation method and achieves an
Ensuring synchronization between the audio and video approximate accuracy of 90% in crowded environments.
streams is another challenge that must be addressed to avoid This method, while powerful for place recognition tasks,
inconsistencies in anomaly detection. is less specialized for violence detection as it does not
consider temporal dynamics essential for action recognition
III. VIOLENCE DETECTION USING DEEP in video sequences. Similarly, Karpathy et al. [11] employed
LEARNING METHODS a multi-modal approach, combining CNNs with Mel Filter
Deep learning forms the backbone of all the methods Bank (MFB) audio features, achieving an approximate
discussed in the papers discussed in Section II. In the accuracy of 90%. By integrating audio cues like shouting or
context of violence detection, CNN-based 3D algorithms alarms, this method enhances violence detection capabilities
have been widely used due to their ability to model both in crowded environments. However, the need to process
spatial and temporal features simultaneously. The 3D-CNN both visual and auditory streams introduces additional
algorithm is designed to capture the spatiotemporal patterns complexity, particularly when dealing with noisy or cluttered
present in video sequences, making it particularly well-suited environments.
for tasks like violence detection, where the timing and One of the most promising approaches in the table is
sequence of actions are critical. However, the development the use of ConvLSTM networks for violence detection.
of a 3D-CNN model often requires complex, hand-crafted Muhammad et al. [18] developed a model combining
algorithms to fine-tune the model for specific applications. CNNs with ConvLSTM, achieving approximately 97%
These models are computationally intensive, making them accuracy. ConvLSTM is designed to handle both spatial
less feasible for deployment in real-time or resource- and temporal dependencies, making it highly effective in
constrained environments. As a result, there is an ongoing recognizing violent actions in crowded settings. By lever-
effort to develop more efficient 3D-CNN architectures that aging the strengths of both CNNs and LSTMs, this model
can deliver high accuracy while maintaining computational excels at detecting violent behaviors that unfold over
efficiency. time, which are difficult to capture with purely spatial
By contrast, more powerful models based on newer archi- models.
tectures, such as Transformers or spatiotemporal attention The highest accuracy in the table is reported by
networks, have been developed to automatically extract Haroon et al. [6], whose deep CNN combined with optical
features and understand complex interactions without the flow analysis reached 98% accuracy. This method tracks
need for extensive manual tuning. These models are capable motion trajectories, making it particularly adept at identi-
of analyzing both short- and long-range dependencies in fying violent actions by analyzing the movement patterns
video sequences, making them highly effective for HAR and of individuals. While this approach provides exceptional
violence detection tasks. However, these advanced models are accuracy, the computational demands of combining deep
still in the experimental stages and face challenges related to CNNs with optical flow analysis may present challenges
computational cost and data requirements. for real-time deployment, especially in resource-constrained
Table 1 outlines some of the approaches used in violence environments.
detection, showcasing the diversity of deep learning models
in terms of both accuracy and computational efficiency. For
instance, the Visual Geometry Group (VGG-f) model [21], IV. THE PROPOSED METHODOLOGY
which utilizes the ImageNet method of object detection, In this section we discuss the proposed three-staged end-to-
is well-suited for real-time detection tasks. Achieving an end framework in detail. Different types of datasets such as
accuracy range of 91%-94%, this model is effective for the surveillance fight dataset [15], RWF-2000 Dataset [16],
detecting objects in crowded environments. However, as a hockey fight dataset [17], and New Industrial Surveillance
purely spatial method, it lacks the ability to capture Dataset [9] were used in experiments.
A. THE PROPOSED THREE STAGE VIOLENCE DETECTION limited to, surveillance cameras. This raw video data
FRAMEWORK is critical, as it serves as the foundational input for
Figure 2 shows the proposed three-stage violence detection the entire model. The quality, frame rate, audio, and
framework. In the first Stage, a lightweight CNN is used resolution of this video data are crucial factors that
to scan the video stream rapidly since it is computationally can significantly influence the model’s performance in
efficient. Its main purpose is to segregate frames with subsequent stages.
human beings, reducing the requirement to filter unnecessary • Person Detection using MobileNet CNN [9]: The first
frames. Selective frame processing considerably minimizes objective in the model’s workflow after capturing a
computational overhead and accelerates detection. A more video is to identify the person in the sequences.
complicated 3D-CNN model is used in the second stage. This This crucial action prepares the ground for the later
model extracts spatiotemporal information from 50 frames identification of violence. We use the MobileNet Single
including people identified in the first stage.. Training the 3D- Shot MultiBox Detector (MobileNet-SSD) architecture,
CNN to recognize complicated sequential patterns associated a CNN that is known for its fast-processing speed
with violent behaviors makes it effective. A Softmax and minimal computing requirements, to achieve this.
classifier receives the extracted features in the third stage The use of depth-wise separable convolutional layers
and classify frame sequences as violent or non-violent. rather than conventional convolutional layers distin-
If the classifier identifies violence, it generates alerts. The guishes this design from others. The network is divided
technology automatically alerts the nearest security agency or into 2819 layers, except for the last layer, which is
police station if violence is detected. This allows immediate completely linked, each layer being followed by a
response and intervention, making the system useful for the batch normalization procedure and a ReLU activation.
real-world situations. Figure 3 shows detection in hockey fight dataset [22].
Given the challenges associated with limited training The initial convolutional layer in the architecture
data, the model also employs transfer learning techniques. functions with a two-step stride employing a 3 × 3 ×
This allows the model to generalize better across different 3 × 32 filter. It takes an input with dimensions
scenarios and increases its overall performance. of 224 × 224 × 3. The next step is a depth-wise
By employing this multi-stage, multi-modal approach, the convolutional layer that works with a single-step stride,
proposed model aims to combine computational efficiency a 3 × 3 × 32 filter, and an input of 112 × 112 ×
with high accuracy, making it well-suited for practical, real- 32 dimensions. While MobileNet is generally used for
world applications. Edge computing is used to run the model classification tasks, in this context, its SSD extension
directly on IoT devices, enabling real-time processing and is vital for pinpointing the location of objects in the
immediate response to violent acts. video frames. This SSD extension is integrated at
The following provides a detailed explanation of the the end of the MobileNet architecture and executes
model’s workflow: feed-forward convolutions to yield a predetermined
• Video Capture: The initial stage involves capturing set of bounding boxes. These boxes are examined to
real-time video data, which is sourced from various confirm the presence or absence of human figures, based
types of videos capturing devices, including but not on the extracted feature maps and applied convolutional
filters. Each bounding box includes a set of class temporal information due to their 3D convolution and
predictions with corresponding probabilities, and the pooling operations.
class with the maximum probability is chosen. A zero For our model, we have fine-tuned a 3D-CNN
probability indicates a lack of any object of interest. architecture inspired by the C3D model [9], initially
• Learning with 3D-CNNs: The core of our violence developed using a version of Caffe. This architecture
detection model lies in its ability to extract spa- is particularly effective for video-based tasks and has
tiotemporal features, which is accomplished through been validated in multiple studies. The C3D model
a 3D-CNN. This network is specifically designed to is composed of eight convolutional layers, five max-
handle sequences of 50 frames containing the detected pooling layers, and two fully connected layers, culmi-
person from the previous MobileNet-SSD model. Unlike nating in a SoftMax output layer. Each convolutional
2D-CNNs, which only capture spatial information, layer uses 3 × 3 × 3 kernels with a stride of one.
3D-CNNs are adept at preserving both spatial and The max-pooling layers predominantly employ a 2 ×
2 × 2 kernel se ize, except for the first layer, which is gathered on the National Hockey League (NHL) hockey
uses a 1 × 2 × 2 kernel with a stride of two to grounds. This dataset contains 1000 NHL hockey game films
preserve temporal information. The convolutional with equal amounts of violent and nonviolent activities, with
layers are structured with a varying number of filters: two players often in close bodily contact.
64 in the first layer, 128 in the second, and 256 in the
third. These layers also feature kernels with a defined 2) THE INDUSTRIAL SURVEILLANCE DATASET
temporal depth, denoted by size D. The convolutional Ullah, F.U.M [9] collected the industrial surveillance dataset
operations are performed with a kernel size of 3 and from different sources and search engines such as YouTube
padding of 1. The fully connected layers, labeled as and Google by inserting diverse queries, such as violence
fc6 and fc7, contain 4096 neurons each. The SoftMax scenes in industrial surveillance, in factories, and in steel
layer’s output is tailored to the dataset’s classes, which, mills. The obtained videos from different sources have
in this case, are limited to two: violent and non-violent. distinct video resolution and frame rate. The length of
To address the issue of overfitting and to enhance the retrieved videos ranges from 7 to 12 min that are trimmed
model’s learning capabilities, we employ random crops to 5-seconds violent and nonviolent clips for each class.
of size 3 × 16 × 128 × 128 from the original 50- They arranged the dataset in the same standard format,
frame input sequence during training. This architectural such as surveillance fight dataset. The industrial surveillance
design allows the network to act as a hierarchical feature dataset consists of varied scenes such as industries, stores,
extractor. Lower layers focus on basic patterns like offices, and petrol pumps. Compared to existing datasets, the
corners and edges, while higher layers capture more industrial surveillance dataset is more challenging because
complex, global features. An illustrative representation most of the actions are aside of the center point from the
of the C3D architecture is provided in Figure 4 camera and the frame per seconds (fps) varies like other
below. surveillance datasets. Several samples of each dataset are
• Activity Classification using SoftMax Classifier: The given in Figure 5.
features extracted by the 3D-CNN serve as the input for
a SoftMax classifier. The SoftMax function is usually C. MODEL DEVELOPMENT AND TRAINING
used in the last layer of a neural network-based classifier The model is developed using TensorFlow as the primary
to make sure that the output probabilities are normalized framework for implementing the deep learning algorithms.
and add up to one. This makes it possible to effectively The model is trained on an NVIDIA GeForce RTX
label each frame as either violent or non-violent. The 3080 laptop GPU, providing the computational power needed
result classification directly influences the subsequent for efficient training and model optimization. The Adam
alert mechanism. optimizer is used to minimize the binary cross-entropy loss
• Alert Generation: If the model predicts violence, function. Early stopping is applied based on the validation
an alarm is triggered, notifying the nearest security loss to prevent over-fitting. For the hockey fight and RWF-
department. This immediate alert system allows for 2000 datasets, the learning rate and batch size are set at
prompt action to be taken in response to the detected 0.001 and 32, respectively. For the Surveillance Fight and
violent event, potentially averting dangerous situations the Industrial Surveillance datasets, the learning rate and
and ensuring safety. batch size are set at 0.0001 and 16, respectively. The model
is trained for 50 epochs. After each epoch, the model is
B. DATASETS evaluated on a testing set to monitor its performance. The
We have used the most widely used benchmark datasets: the evaluation metrics include binary accuracy and binary cross-
hockey fight dataset, survellience fight dataset, and RWF- entropy loss.
2000 dataset. We also used industrial surveillance dataset
collected by Ullah, F.U.M [9]. These datasets are well D. MODEL EVALUATION
balanced, labeled and they had 80%/20% split for training and The model’s performance is evaluated using a comprehensive
testing purposes. Besides, these datasets cover mainly indoor set of metrics that include accuracy, precision, recall,
scenes, outdoor scenes and a few weather conditions. F1-score, and area under the ROC curve (AUC-ROC). The
metrics chosen for this evaluation include True Positive (TP),
1) EXISTING BENCHMARK DATASETS True Negative (TN), False Positive (FP), False Negative (FN),
The surveillance fight dataset [15] includes indoor, outdoor, Accuracy, Area Under the Receiver Operating Characteristic
night, and daytime films from real-world surveillance and Curve (AUC ROC), Precision, Recall, and F1-score. These
YouTube. This 300-video dataset has equal aggressive and metrics provide a holistic view of the model’s effectiveness
nonviolent acts. RWF-2000 Dataset [16] includes factory, in classifying violent and non-violent activities. A simple
workplace, and other indoor, outdoor, day, and night videos. train-test split is used for validation. This approach allows
This dataset only contains surveillance videos without for a straightforward yet effective way to assess the model’s
multimedia editing. This 2000-video collection has equal performance on unseen data. The model’s performance is
violent and nonviolent acts. The hockey fight dataset [17] compared against existing models and techniques in the field
FIGURE 5. Samples from both classes of each violence detection dataset, with violent frames in first three columns and
non-violent frames in following three columns. (a) Sample frames from surveillance fight dataset (b) Hockey fight dataset:
violent and nonviolent frames. (c) Sample frames from RWF-2000 collection of real-world surveillance footage. (d) Sample
frames from both industrial surveillance dataset groups.
of violence detection. This benchmarking helps to position For the surveillance fight dataset, the proposed 3D-CNN
the proposed model within the broader landscape of violence model demonstrated a notable improvement in performance.
detection solutions. The model is tested across different It correctly produced 28 TP and 26 TN, with only 3 FP
datasets to assess its generalizability. This is crucial for and 3 FN, leading to an accuracy of 89.7%, compared to the
ensuring that the model performs well not just on the data 62% accuracy of the ConvLSTM method. The false positives
it was trained on but also on new, unseen data. The time are likely caused by the model misclassifying non-violent
taken for the model to make a prediction is measured. This is actions due to occlusions and congested camera views.
particularly important for real-time applications where quick In the Industrial Surveillance dataset, both models show
decision-making is essential. comparable performance, but the proposed 3D-CNN model
still outperforms the ConvLSTM. The proposed model
V. RESULTS AND ANALYSIS achieved an accuracy of 88.89%, compared to 73% for
A. MODEL PERFORMANCE COMPARISON ConvLSTM. The proposed 3D-CNN model produced 26 TP
Table 2 presents a comparative evaluation of two state-of-the- and 27 TN, with 4 FP and 3 FP. The FP in this dataset
art deep learning models employed for violence detection: likely stem from the model misinterpreting normal industrial
ConvLSTM and the Proposed 3D-CNN. Figure 6 shows the activities as violent due to the complexity of the environment,
confusion matrices of the two models for the four datasets. while the false negatives suggest some violent events may
The proposed 3D-CNN method consistently demonstrates have been obscured or subtle.
superior or equivalent performance compared to the Con- Overall, the proposed 3D-CNN method emerges as a
vLSTM method across all the datasets. For instance, in the highly promising technique for violence detection. Its
RWF-2000 dataset, the proposed 3D-CNN model achieved an robust performance, particularly on the RWF-2000, Industrial
accuracy of 92.5%, significantly improving upon the 85.3% Surveillance, and Hockey Fight datasets, further solidifies
accuracy recorded for the ConvLSTM method. In this dataset, its standing as a reliable and effective method for violence
the 3D-CNN model correctly classified 185 violent incidents detection across various settings. However, while the model
(TP) and 185 non-violent incidents (TN). However, it also outperforms ConvLSTM in most metrics, addressing the
produced 16 false positives (misclassifying non-violent acts remaining false positives and false negatives will be crucial
as violent) and 14 false negatives (failing to detect actual for future improvements, especially in complex or fast-paced
violent incidents), highlighting the challenges posed by the environments.
diverse and complex scenes in this dataset. As demonstrated in Table 3, our proposed 3D-CNN
In the hockey fight dataset, the proposed 3D-CNN model method exhibits superior performance on across the four
exhibited exceptional performance, achieving an accuracy datasets, achieving an accuracy of 97.2% with a standard
of 97.2%, surpassing the 94% accuracy of the ConvLSTM deviation of 1.55. This surpasses most state-of-the-art
method. With 98 TP and 96 TN, the model accurately methods on the hockey fight dataset. While methods like
identified most violent and non-violent events. The minor ViF [20], OViF [21], DiMOLIF [22], and HOMO [23]
discrepancies, seen in 3 FP and 3 FN, can be attributed to the consider both the orientation and magnitude changes of the
dynamic and aggressive nature of hockey, where fast-paced optical flow, they still fall short of the performance achieved
non-violent actions are often mistaken for violent ones. by the proposed 3D-CNN model. Despite the complexity
TABLE 2. The detailed evaluation results of the convlstm and proposed 3D-CNN based on accuracy, AUC, precision, recall, and F1-score.
FIGURE 6. Visual representation of the confusion matrix for the proposed 3D-CNN. (a) Surveillance fight dataset. (b) RWF-2000 dataset. (c) Hockey fight
dataset. (d) Industrial surveillance dataset.
the model achieved its lowest accuracy (59.85%) when the ConvLSTM model consists of 18,976,770 parameters,
trained on RWF-2000 and tested on Industrial surveillance, while the proposed model significantly reduces this to
highlighting the challenges of generalizing across datasets 4,470,298 parameters. This substantial reduction in param-
with different characteristics. These results, summarized in eters highlights the optimization and efficiency achieved by
Table 4, emphasize that while the model can generalize, its the proposed model, both in terms of processing speed and
performance depends heavily on the diversity and complexity model size.
of the datasets used during training. The cross-dataset The models were trained and tested on a high-performance
experimentation not only validates the model’s effectiveness system featuring a 64-bit operating system and an x64-based
but also underscores its potential for scalable deployment Intel(R) Core(TM) i7-10870H CPU, clocked at 2.20GHz
in diverse surveillance settings, where variations in scene with a turbo boost up to 2.21GHz. The system was also
types, camera angles, and environmental conditions pose equipped with 64.0 GB of RAM, which provided ample
challenges. memory to handle the intensive computational tasks involved
The overall reduction in accuracy across cross-dataset in processing video data for surveillance purposes.This
experiments underscores the inherent challenges in gen- hardware setup facilitated not only fast training and testing
eralizing between datasets with different characteristics. but also ensured that the models could efficiently handle
These challenges arise due to variations in factors such large batches of video frames, which is crucial for real-time
as video quality, scene diversity, camera angles, and the surveillance applications.The proposed model is designed
nature of violent actions across datasets. Video quality to be efficient enough to run on edge devices, such as
differences, including resolution and noise levels, affect the IoT-based surveillance systems, which often have limited
model’s ability to adapt to new environments. Scene diversity computational resources compared to high-end GPUs. This
and varying camera angles, such as those found in fixed is achieved by using a lightweight initial stage (CNN for
industrial surveillance versus dynamic sports footage, further human detection) to reduce the number of frames that need
complicate generalization. Additionally, the specific contexts to be processed by the more computationally intensive 3D-
in which violent actions occur can differ greatly across CNN. The ability to run models like the proposed 3D-CNN
datasets, making it difficult for models to identify consistent in such an environment is a strong indicator that these models
patterns. are scalable and could be deployed in real-world IIoT-aided
surveillance systems, where quick response times and the
C. TIME COMPLEXITY ANALYSIS ability to process high-resolution video data in real-time are
Processing time is a critical consideration for video data paramount. This combination of hardware capability and
in Industrial Internet of Things (IIoT)-aided surveillance optimized model design further emphasizes the practicality
systems. In this study, the ConvLSTM model and the and applicability of the proposed model in demanding IIoT
proposed 3D-CNN model were evaluated for their efficiency scenarios.
in handling video frames. The ConvLSTM model processed
28 frames per second, while the proposed model handled VI. CONCLUSION AND FUTURE WORK
72 frames per second. This means the proposed model In this paper, a three-staged end-to-end framework is pro-
processes one frame in approximately 0.01389 seconds, posed for violence detection in a surveillance video stream.
making it 2.57 times faster than the ConvLSTM model. In the first stage, human are detected using an efficient
This improved processing speed underscores the efficiency CNN model to remove unwanted frames, which results in
of the proposed approach. In terms of model complexity, reducing the overall processing time. Next, frame sequences
with persons are fed into a 3D-CNN model trained on three [12] M. Ding et al., ‘‘3D-CNN-based action recognition,’’ IEEE Trans. Image
benchmark datasets, where the spatiotemporal features are Process., vol. 26, no. 3, pp. 1252–1265, Mar. 2017.
[13] J. Y. Lee, K. D. Kim, and K. Kim, ‘‘A study on improving the location
extracted and forwarded to the SoftMax classifier for final of CCTV cameras for crime prevention through an analysis of population
predictions. Experimental results over various benchmark movement patterns using mobile big data,’’ KSCE J. Civil Eng., vol. 23,
datasets confirm that our method is a good fit for violence no. 1, pp. 376–387, Jan. 2019.
[14] S. Khan, K. Muhammad, S. Mumtaz, S. W. Baik, and
detection in surveillance and achieves better accuracy than V. H. C. de Albuquerque, ‘‘Energy-efficient deep CNN for smoke
several other existing techniques. We want to use our detection in foggy IoT environment,’’ IEEE Internet Things J., vol. 6,
methodology on devices with limited resources. Our paper no. 6, pp. 9237–9245, Dec. 2019.
[15] M. Sajjad, S. Khan, K. Muhammad, W. Wu, A. Ullah, and S. W. Baik,
involves implementing edge intelligence in order to identify ‘‘Multi-grade brain tumor classification using deep CNN with extensive
instances of violence in the IoT devices. Our future research data augmentation,’’ J. Comput. Sci., vol. 30, pp. 174–182, Jan. 2019.
aims to improve violence detection by leveraging multiview [16] M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K. Sangaiah,
data for thorough analysis.Additionally, We plan to enhance A. Castiglione, C. Esposito, and S. W. Baik, ‘‘CNN-based anti-spoofing
two-tier multi-factor authentication system,’’ Pattern Recognit. Lett.,
the model’s adaptability to varying environmental conditions vol. 126, pp. 123–131, Sep. 2019.
by incorporating sound sensor data. This approach is intended [17] S. Wang, X. Liu, S. Liu, K. Muhammad, A. A. Heidari, J. D. Ser,
to be particularly useful in challenging light conditions, and V. H. C. de Albuquerque, ‘‘Human short long-term cognitive memory
mechanism for visual monitoring in IoT-assisted smart cities,’’ IEEE
where visual data alone may be insufficient. By integrating Internet Things J., vol. 9, no. 10, pp. 7128–7139, May 2022.
auditory inputs, we aim to improve the robustness and overall [18] K. Muhammad, Mustaqeem, A. Ullah, A. S. Imran, M. Sajjad, M. S. Kiran,
performance of the model in diverse real-world scenarios. G. Sannino, and V. H. C. de Albuquerque, ‘‘Human action recognition
using attention based LSTM network with dilated CNN features,’’ Future
This is necessary since existing algorithms depend on single Gener. Comput. Syst., vol. 125, pp. 820–830, Dec. 2021.
cameras, which are unable to capture the complete picture. [19] L. Zhang, X. Chang, J. Liu, M. Luo, M. Prakash, and A. G. Hauptmann,
Hence, multiview enables thorough surveillance of activity ‘‘Few-shot activity recognition with cross-modal memory network,’’
Pattern Recognit., vol. 108, Dec. 2020, Art. no. 107348.
from any perspective. [20] U. Haroon, A. Ullah, T. Hussain, W. Ullah, M. Sajjad, K. Muhammad,
M. Y. Lee, and S. W. Baik, ‘‘A multi-stream sequence learning framework
VII. ACKNOWLEDGMENT for human interaction recognition,’’ IEEE Trans. Hum.-Mach. Syst.,
vol. 52, no. 3, pp. 435–444, Jun. 2022.
Any opinions, findings, and conclusions or recommendations [21] K. Lloyd, P. L. Rosin, D. Marshall, and S. C. Moore, ‘‘Detecting violent
expressed in this material are those of the author(s) and do not and abnormal crowd activity using temporal analysis of grey level co-
necessarily reflect the views of NSF. occurrence matrix (GLCM)-based texture measures,’’ Mach. Vis. Appl.,
vol. 28, nos. 3–4, pp. 361–371, May 2017.
[22] F. U. M. Ullah, K. Muhammad, I. U. Haq, N. Khan, A. A. Heidari,
REFERENCES S. W. Baik, and V. H. C. de Albuquerque, ‘‘AI-assisted edge vision for
violence detection in IoT-based industrial surveillance networks,’’ IEEE
[1] D. M. Gavrila, ‘‘The visual analysis of human movement: A survey,’’
Trans. Ind. Informat., vol. 18, no. 8, pp. 5359–5370, Aug. 2022.
Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, Jan. 1999.
[23] J. Chen, Y. Wang, J. Wang, X. Gao, and L. Nie, ‘‘Spatiotemporal
[2] J. K. Aggarwal and Q. Cai, ‘‘Human motion analysis: A review,’’ Comput. graph convolutional networks for skeleton-based human action recog-
Vis. Image Understand., vol. 73, no. 3, pp. 428–440, Mar. 1999. nition in surveillance environments,’’ IEEE Trans. Multimedia, vol. 24,
[3] Y. Wang et al., ‘‘Violence detection in surveillance environments using pp. 1235–1248, 2022.
LSTM,’’ IEEE Trans. Emerg. Topics Comput., vol. 9, no. 1, pp. 1–13, [24] S. Kim, H. Lee, J. Park, and J. Choi, ‘‘Anomaly detection in video streams
Jan. 2021. using multi-modal fusion and deep neural networks,’’ Pattern Recognit.
[4] G. Muhammad, M. A. Hossain, and G. Bebis, ‘‘Attention-based LSTM Lett., vol. 145, pp. 39–47, May 2023.
network for human action recognition,’’ Pattern Recognit. Lett., vol. 138, [25] G. Batchuluun, J. H. Kim, H. G. Hong, J. K. Kang, and K. R. Park,
pp. 120–127, May 2021. ‘‘Fuzzy system based human behavior recognition by combining behavior
[5] L. Zhang, X. Chang, J. Liu, M. Luo, M. Prakash, and A. G. Hauptmann, prediction and recognition,’’ Expert Syst. Appl., vol. 81, pp. 108–133,
‘‘Few-shot learning for human action recognition using cross-modal Sep. 2017.
memory networks,’’ IEEE Trans. Image Process., vol. 30, pp. 2301–2315, [26] C. Ding, S. Fan, M. Zhu, W. Feng, and B. Jia, ‘‘Violence detection in
2021. video by using 3D convolutional neural networks,’’ presented at the Int.
[6] U. Haroon, A. Ullah, T. Hussain, W. Ullah, M. Sajjad, and K. Muhammad, Symp. Vis. Comput., Jan. 2014.
‘‘Multi-stream deep learning model for human interaction recognition,’’ J. [27] S. Akti, G. A. Tataroglu, and H. K. Ekenel, ‘‘Vision-based fight detection
Vis. Commun. Image Represent., vol. 73, Jun. 2021, Art. no. 102981. from surveillance cameras,’’ in Proc. 9th Int. Conf. Image Process. Theory,
[7] F. U. M. Ullah, K. Muhammad, I. Haq, N. Khan, A. A. Heidari, and S. Tools Appl. (IPTA), Nov. 2019, pp. 1–6.
A. Baik, ‘‘Edge computing-based AI-assisted violence detection in IoT- [28] M. Cheng, K. Cai, and M. Li, ‘‘RWF-2000: An open large scale video
based surveillance networks,’’ IEEE Internet Things J., vol. 8, no. 9, database for violence detection,’’ in Proc. 25th Int. Conf. Pattern Recognit.
pp. 7601–7611, Sep. 2021. (ICPR), Jan. 2021, pp. 4183–4190.
[8] J. Chen, X. Wang, and Y. Zhang, ‘‘Skeleton-based human action [29] E. B. Nievas, O. D. Suarez, G. B. García, and R. Sukthankar, ‘‘Violence
recognition using graph convolutional networks,’’ IEEE Access, vol. 9, detection in video using computer vision techniques,’’ in Proc. Int.
pp. 69989–70001, 2021. Conf. Comput. Anal. Images Patterns. Berlin, Germany: Springer, 2011,
[9] S. Kim and J. Lee, ‘‘Multi-modal anomaly detection in video streams pp. 332–339.
using audio-visual fusion,’’ Pattern Recognit. Lett., vol. 145, pp. 12–21, [30] H. Ullah, K. Muhammad, A. Ullah, T. Saba, and A. Rehman, ‘‘Violence
Jun. 2021. detection in Hollywood movies by the fusion of visual and mid-level audio
[10] J. Chen, C. Ma, and J. Wang, ‘‘Graph convolutional networks for skeleton- cues,’’ IEEE Access, vol. 6, pp. 48250–48261, 2018.
based action recognition,’’ IEEE Trans. Image Process., vol. 28, no. 3, [31] T. Hassner, Y. Itcher, and O. Kliper-Gross, ‘‘Violent flows: Real-time
pp. 2739–2753, 2019. detection of violent crowd behavior,’’ in Proc. IEEE Comput. Soc. Conf.
[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and Comput. Vis. Pattern Recognit. Workshops, Jun. 2012, pp. 1–6.
L. Fei-Fei, ‘‘Large-scale video classification with convolutional neural [32] Y. Gao, H. Liu, X. Sun, C. Wang, and Y. Liu, ‘‘Violence detection using
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, oriented VIolent flows,’’ Image Vis. Comput., vols. 48–49, pp. 37–41,
pp. 1725–1732. Apr. 2016.
[33] A. Ben Mabrouk and E. Zagrouba, ‘‘Spatio-temporal feature using optical XIAOHONG YUAN is currently working as a
flow based distribution for violence detection,’’ Pattern Recognit. Lett., Professor with the Department of Computer Sci-
vol. 92, pp. 62–67, Jun. 2017. ence, North Carolina Agricultural and Technical
[34] J. Mahmoodi and A. Salajeghe, ‘‘A classification method based on optical State University. Her research has been funded
flow for violence detection,’’ Expert Syst. Appl., vol. 127, pp. 121–127, by the National Security Agency, the National
Aug. 2019. Centers of Academic Excellence in Cybersecurity
[35] S. Sudhakaran and O. Lanz, ‘‘Learning to detect violent videos using (NCAE-C), the National Science Foundation, the
convolutional long short-term memory,’’ in Proc. 14th IEEE Int. Conf. Adv. Department of Energy, and the Department of
Video Signal Based Surveill. (AVSS), Aug. 2017, pp. 1–6.
Education. Her research interests include AI and
[36] P. Zhou, Q. Ding, H. Luo, and X. Hou, ‘‘Violence detection in surveillance
machine learning, anomaly detection, software
video using low-level features,’’ PLoS ONE, vol. 13, no. 10, Oct. 2018,
Art. no. e0203668. security, cyber identity, and cyber security education. She has served on the
[37] E. Fenil, G. Manogaran, G. Vivekananda, T. Thanjaivadivel, S. Jeeva, and editorial board for several journals on cybersecurity.
A. Ahilan, ‘‘Real time violence detection framework for football stadium
comprising of big data analysis and deep learning through bidirectional
LSTM,’’ Comput. Netw., vol. 151, pp. 191–200, Mar. 2019.
[38] I. Serrano Gracia, O. Deniz Suarez, G. Bueno Garcia, and T.-K. Kim, ‘‘Fast
fight detection,’’ PLoS ONE, vol. 10, no. 4, Apr. 2015, Art. no. e0120448. LETU QINGGE received the Ph.D. degree in
[39] Y. Shi, Y. Tian, Y. Wang, and T. Huang, ‘‘Sequential deep trajectory computer science from Montana State University,
descriptor for action recognition with three-stream CNN,’’ IEEE Trans. Bozeman, MT, USA. He is currently working as
Multimedia, vol. 19, no. 7, pp. 1510–1520, Jul. 2017. an Assistant Professor with the Department of
[40] H. Khan, ‘‘Violence detection from industrial surveillance videos using Computer Science, North Carolina Agricultural
deep learning,’’ ProQuest One Academic, 2023. and Technical State University, USA. His research
[41] V. Sharma, M. Gupta, A. K. Pandey, D. Mishra, and A. Kumar, ‘‘A review has been funded by NSF and NIH. His research
of deep learning-based human activity recognition on benchmark video interests include algorithms, deep learning, bioin-
datasets,’’ Appl. Artif. Intell., vol. 36, no. 1, pp. 1–22, Dec. 2022. formatics, and computer vision.