Deep Learning-Based Real-Time Weapon Detection System: ISSN (2210-142X) Int. J. Com. Dig. Sys. 14, No.1 (Aug-2023)
Deep Learning-Based Real-Time Weapon Detection System: ISSN (2210-142X) Int. J. Com. Dig. Sys. 14, No.1 (Aug-2023)
ISSN (2210-142X)
Int. J. Com. Dig. Sys. 14, No.1 (Aug-2023)
http://dx.doi.org/10.12785/ijcds/140141
Received 4 Feb. 2023, Revised 18 Mar. 2023, Accepted 06 May. 2023, Published 01 Aug. 2023
Abstract: In recent years, the rate of gun violence has risen at a rapid pace. Most current security systems rely on human personnel
to monitor lobbies and halls constantly. With the advancement of machine learning and, specifically, deep learning techniques, future
closed-circuit TV (CCTV) and security systems should be able to detect threats and act upon this detection when needed.
This paper presents a security system architecture that uses deep learning and image-processing techniques for real-time weapon
detection. The system relies on processing a video feed to detect people carrying different types of weapons by periodically capturing
images from the video feed. These images are fed to a convolutional neural network (CNN). The CNN then decides if the image
contains a threat or not. If it is a threat, it would alert the security guards on a mobile application and send them an image of the
situation. The system was tested and achieved a testing accuracy of 92.5%. Also, it was able to complete the detection in as fast as 1.6
seconds.
monitor people entering a building lobby. The camera(s) The third approach by Verma and Dhillon uses Fast R-CNN,
should be installed facing the entrance area providing and deep learning [21]. A VGG-16-based classification
real-time images. Once the system detects the threat, it model (16 convolutional layers) is used, which focuses on
automatically notifies the security guards in the building on prediction loss minimization. Its methods are the same as
their mobile phones. The security guards receive an image the previous approaches, but with a minor difference of
of what is happening in that building area and respond having a fixed input size of 224p x 224p RGB images.
accordingly. It is worth noting that this work does not extend The fourth approach [22] uses the same basics mentioned
to concealed weapons [15], nor the detection of suspects above (sliding window, HOG feature extraction, SIFT,
based on worrying expressions or unnatural behavior [16]. and Harris key point detection) with unipolar sigmoid
It is also not concerned with predicting crimes or robberies and bipolar sigmoid as activation functions. The idea
before they happen [17]. is to detect the humans first and then check whether
there is a weapon in the picture. Notice that detecting
To summarize the objectives of the paper are as follows: humans using background reduction is faster than HOG,
which is essential in real-time applications. However, for
• Establish the ability of CNNs in detecting handheld background reduction to be efficient, the camera must
weapons using CCTV feeds. be installed indoors because the background subtraction
method is not flexible and can be affected by slight changes
• Identify the proper image pre-processing steps needed
in light intensity or object occlusions.
to facilitate the detection process.
The fifth approach by Lai and Maples tried various
• Achieve reasonable performance metrics using the pre-trained models such as VGG-16 mentioned earlier,
proposed architecture. Overfeat1, Overfeat2, and Overfeat3 [23]. The problem the
paper is trying to address is mainly detecting in real-time,
The rest of the paper is organized as follows: Section and despite the significant accuracy in GoogleNet Overfeat,
2 provides an overview of the literature and related work. it needed 14s per image, which is impractical. This paper’s
Section 3 details the method used to solve the threat detec- final results of reasonably high accuracy and a satisfactory
tion problem, including the details of the DNN model used. classification time of 1.3 seconds are achieved using
The results of the system testing are discussed in section Overfeat-3 with tuned hyper-parameters.
4. Section 5 presents the proposed system’s limitations The sixth approach proposes a detection model
and practical considerations. Finally, section 6 presents the (ResNet50V2) trained on the Open Images V6 dataset
work’s conclusion and future research directions. to detect the visual relationships of “holds” and “wears”
between people and objects [24].
2. Related Work and Literature Review The seventh approach utilizes the transfer learning concept
This section summarizes a few papers that used to train the VGGNet architecture based on VGGNet-16
different approaches in automatic weapon detection using weights [25]. While it achieved high accuracy, the dataset
deep learning and image processing. used is for stand-alone weapons. Another approach based
The first approach uses background reduction to remove on the VGGNet architecture classified seven classes of
static objects by using a reference image of the place, weapons and achieved an accuracy of 98.4 which was
then compares it to a present image of the same place to higher than other basic models like VGG16 and Resnet.
eliminate similarities [18]. By using Canny edge detection, The training was done on a small dataset from the internet
the obtained image will be a silhouette. After that, the containing just over five thousand images. However, the
model uses a sliding window, MPEG-7 feature extraction, images trained only contained the weapon with no other
and support vector machine (SVM) to decide whether to object. In addition, there was no mention of the time
send an alert. The sliding window size is determined by required to process each image [26].
trial and error after installing the system, which means the In[27], the main focus point was cold steel weapons.
system needs to be adjusted with different sliding window It tackled the challenge formed by the light reflection
sizes depending on installation. on this type of weapon by utilizing different regional
The second approach uses deep learning for automatic proposal search algorithms. And using Dacolt darkening
handgun detection [19]. It compares the sliding window and contrast to handle different brightness and lighting
results and Regional Convolutional Neural Networks (R- conditions.
CNN). After using The Histogram of Oriented Gradients
(HOG) descriptor for feature extraction and then using Based on the summary of the previous work, it is clear
a sliding window, the processing speed for detecting that there is a gap in finding a method that is:
pedestrians is 14 s/image which is not very practical
for real-time applications. This is in addition to the • Able to detect weapons in near real-time.
large computational power needed for the sizeable neural
• Computationally light.
network. The real challenge is to create a method that
dynamically optimizes all of the CNN’s parameters • Tested on more than one type of weapon.
simultaneously [20].
https:// journal.uob.edu.bh
Int. J. Com. Dig. Sys. 14, No.1 531-540 (Aug-2023) 533
https:// journal.uob.edu.bh
534 Al-Mousa, et al.: Deep Learning-Based Real-Time Weapon Detection System.
times removed most of the image’s noise. The kernel size mask to the original image. Figure 6 shows the final output
selected is (5, 5), and the number of iterations is two. after performing image processing steps. Figure 6a has too
Figure 5 shows the results of using too many iterations many features, which might confuse the model, increase
on a small image. Figure 5a shows the output after two the time needed for model training, and negatively affect
iterations, while Figure 5b shows the output after six the accuracy. In contrast, Figure 6b with the background
iterations. Figure 5b shows less noise; however, the critical eliminated and the human with or without the weapon is
information, which is the firearm held by the person, is visible, the desired features appear to be more evident than
cropped because of the excessive number of iterations. those in Figure 6a, and the system’s accuracy would be
Processing images in grayscale instead of colored images higher.
increases the speed, so the image is converted to grayscale,
and another binary threshold is applied. The most effective
way to apply the mask is using bitwise logical operations. C. The CNN Architecture
Bitwise XOR is used, which would result in inverting The input layer is the input interface of the neural net-
the foreground. Then a bitwise AND is used to apply the work where input images are loaded. Because of the GPU
memory limit and relatively small dataset size, a resolution
https:// journal.uob.edu.bh
Int. J. Com. Dig. Sys. 14, No.1 531-540 (Aug-2023) 535
Figure 4. Applying blur. (a) Original image. (b) Subtracted without blurring. (c) Gaussian blur. (d) Median blur
Figure 5. Applying opening. (a) using two iterations. (b) using six iterations
https:// journal.uob.edu.bh
536 Al-Mousa, et al.: Deep Learning-Based Real-Time Weapon Detection System.
B. Dropout Architecture
To reduce the gap between the training and validation
accuracy, ‘dropout’ is added after each of the hidden layers.
What dropout does is that it ‘drops out’ random neurons
at a specific rate as a regularization technique to reduce
overfitting. As a result, more training epochs are required
to reach the same training accuracy because some neurons
are dropped during training, but each epoch will take a
shorter time. The difference between the validation and
the training accuracy has decreased during the first few
epochs, as shown in Figure 9. However, results have not
improved, as the validation accuracy has not exceeded
91.9%. One might notice that stopping learning at the fifth
epoch might be better than continuing until the twelfth
epoch, as more overfitting occurred while not significantly
increasing the validation accuracy. This technique is one of
the regularization techniques used in machine learning, and
Figure 8. Training accuracy and validation accuracy for baseline it is called ‘early stopping’.
architecture
Metric Value
Validation Accuracy 92.4%
True Negative 187 out of 200
True Positive 183 out of 200
Accuracy 92.5%
Precision 93.4%
Recall 91.5%
F1-score 92.4%
Pre-Processing Time 0.3 s
Classification Time 2.0 s
T rue Positives
Precision = (2)
T rue Positives + False Positives Figure 9. Training accuracy and validation accuracy for dropout
architecture
T rue Positives
Recall = (3) C. Max-pooling Inputs Architecture
T rue Positives + False Negatives
A max-pooling function of 2x2 window size is applied
after the input layer and before the hidden layers to high-
2 ∗ Precision ∗ Recall light the desired features as a pre-processing step before
F1 − S core = (4)
Precision + Recall training. This enabled the increase of the batch size to 64
and has shown significantly faster learning and validation
Lower-resolution images (640p x 360p) drastically performance. The increase in the batch size can be justified
decreased the time required to pre-process and classify by the resolution reduction (the image entering the CNN
each image compared to the full-size images. The time now has a quarter of the number of pixels in the original
decreased from 9.5 seconds to 2.3 seconds, divided as image), which significantly reduced the trainable parameters
follows: the average pre-processing time is 0.3 seconds, from 139 million to 76 million. It has also given similar
and the average classification time is 2 seconds. This also validation and testing results: validation accuracy of 91.8%
decreased the time required to send the image, as the time and test accuracy of 92.5%, as shown in Table IV. Although
required to encode, transmit, and decode the image was this model has a slightly lower validation accuracy, it is
also drastically reduced. highly preferred over the last two architectures because of
its faster classification response. The average classification
time is now 1.3 seconds, down from 2.0 seconds in the base-
line architecture. These times were measured by embedding
https:// journal.uob.edu.bh
538 Al-Mousa, et al.: Deep Learning-Based Real-Time Weapon Detection System.
Future Transportation, vol. 3, no. 1, pp. 189–209, 2023. [Online]. vol. 11, no. 16, 2021. [Online]. Available: https://www.mdpi.com/
Available: https://www.mdpi.com/2673-7590/3/1/12 2076-3417/11/16/7535
[13] Z. Ullah, F. Al-Turjman, L. Mostarda, and R. Gagliardi, [27] A. Castillo, S. Tabik, F. Pérez, R. Olmos, and F. Herrera, “Brightness
“Applications of artificial intelligence and machine learning in guided preprocessing for automatic cold steel weapon detection in
smart cities,” Computer Communications, vol. 154, pp. 313–323, surveillance videos with deep learning,” Neurocomputing, vol. 330,
2020. [Online]. Available: https://www.sciencedirect.com/science/ pp. 151–161, 2019. [Online]. Available: https://www.sciencedirect.
article/pii/S0140366419320821 com/science/article/pii/S0925231218313365
[14] Z. Sabeur, C. M. Angelopoulos, L. Collick, N. Chechina, [28] S. Karsoliya, “Approximating number of hidden layer neurons in
D. Cetinkaya, and A. Bruno, “Advanced cyber and physical situation multiple hidden layer bpnn architecture,” International Journal of
awareness in urban smart spaces,” in Advances in Neuroergonomics Engineering Trends and Technology, vol. 3, 2012.
and Cognitive Engineering, H. Ayaz, U. Asgher, and L. Paletta, Eds.
Cham: Springer International Publishing, 2021, pp. 428–441. [29] M. Madhiarasan and S. N. Deepa, “Comparative analysis on hidden
neurons estimation in multi layer perceptron neural networks for
[15] M. Parande and S. Soma, “Concealed weapon detection in a human wind speed forecasting,” Artificial Intelligence Review, vol. 48,
body by infrared imaging,” International Journal of Science and no. 4, pp. 449–471, Dec 2017.
Research (IJSR), vol. 4, pp. 182–188, 2015.
[30] M. Basavarajaiah, “Maxpooling vs minpooling vs average
[16] H. Bouma, J. van Rest, K. van Buul-Besseling, J. de Jong, and pooling,” 2021. [Online]. Available: https://medium.com/@bdhuma/
A. Havekes, “Integrated roadmap for the rapid finding and track- 95fb03f45a9
ing of people at large airports,” International Journal of Critical
Infrastructure Protection, vol. 12, pp. 61–74, 2016. [31] S. Ruder, “An overview of gradient descent optimization algo-
rithms,” 2017.
[17] M. P. de la Cruz López, J. J. Cartelle Barros, A. del Caño Gochi,
M. C. Garaboa Fernández, and J. Blanco Leis, “Assessing the risk [32] Y. Bengio, Practical Recommendations for Gradient-Based Train-
of robbery in bank branches to reduce impact on personnel,” Risk ing of Deep Architectures. Berlin, Heidelberg: Springer Berlin
Analysis, vol. n/a, no. n/a, 2021. Heidelberg, 2012, pp. 437–478.
https:// journal.uob.edu.bh
540 Al-Mousa, et al.: Deep Learning-Based Real-Time Weapon Detection System.
Omar Z. Alzaibaq received his B.Sc. de- Yazan Abu Hashyeh received his B.Sc.
gree in Computer Engineering from Princess degree in electronics Engineering from
Sumaya University for Technology (PSUT), Princess Sumaya University for Technology
Jordan, in 2019. He is currently working as (PSUT), Jordan, in 2019. Since October
a systems and software engineer at Iotistic 2019, he has worked as an Artificial in-
Solutions. His main fields include develop- telligence engineer at IOTISTIC Solutions.
ment, Linux-based scripting, fleet manage- During his position as an Artificial Intelli-
ment systems, and artificial intelligence. gence engineer. He works part-time Lecturer
with Pioneers Academy teaching Python and
Artificial intelligence. His main field of work
is in image processing and software development.
https:// journal.uob.edu.bh