Chapter 4 CYTED Book
Chapter 4 CYTED Book
Scenes
Maira Moran
IC - UFF, 24210-310 Niterói (Rio de Janeiro), Brazil, e-mail: [email protected]
Aura Conci
IC - UFF, 24210-310 Niterói (Rio de Janeiro), Brazil, e-mail: [email protected]
Ángel Sánchez
ETSII - URJC, 28933 Móstoles (Madrid), Spain, e-mail: [email protected]
1
2 Maira Moran, Aura Conci and Ángel Sánchez
1 Introduction
New Smart City (SC) technologies are helping cities to maximize their resources
and increase efficiencies in all facets of urban life. A SC consists of an urban space
where Information and Communication Technologies (ICT) are used to improve the
quality and performance of urban services such as transportation, energy, water,
infrastructures and other services in order to reduce resource energy consumption,
wastage and overall costs [10].
One of the relevant areas in the SC is guarantying the security of their citizens.
Video surveillance (CCTV) cameras, which are commonly used by urban police
departments, can be part of these “smart” technologies in combination with video
analytics software. Video recordings contain a wealth of valuable information that
can be automatically analyzed to detect anomalous (and even dangerous) events from
multiple cameras. Commonly, in security centers work human operators that are in
charge of a large number of CCTV cameras capturing multiple city views operating in
real-time. Due to the difficulty of humans for being able to keep their attention during
several hours in front of many cameras (usually, more than 16), it is desirable that
the video surveillance system could be automatically able to recognize potentially
critical security events in specific video frames and cameras. In such cases, the system
can notify an alert to the human operators to focus his/her attention on a concrete
camera. Image- content analytics technology can help solving the event detection
problem, by processing video frames and identifying, classifying and indexing some
types of targets objects (e.g., cars, motorcycles, persons or animals) [18]. Driven by
Artificial Intelligence techniques, surveillance software can also make these images
(or frames) in videos as searchable, actionable and quantifiable.
In this context, this work presents an study of applying deep networks to the
problem of automatically detecting knives (and related objects) in images. This is a
challenging problem due to the multiple variabilities of these targets when appearing
in scenes. In particular, the changing shapes of knives, their relatively small sizes in
images, the possibility of being partially occluded, the possibility of being carried
by a person or being free in a location or changing illumination conditions in
scenes, among other difficulties. All these involved characteristics (which which can
also appear combined), can produce a negative impact over the performance of the
detection algorithms.
This paper describes a research on the application of combining super-resolution
techniques with deep neural networks to effectively handle the knife detection prob-
lem in complex images. Our results show that the proposed methodology produces
accurate results when detecting this special type of objects.
This paper is organized as follows. Section 2 summarizes the related work on the
considered knife detection problem. The aspects of small-object detection (and, in
particular, knives), as well as the description of the YOLOv4 model used in this work,
are described in Section 3. In Sections 4 and 5, we respectively describe the dataset
used in the experiment and some related pre-processing on it. The experiments
carried out and their analysis appear in Section 6. Finally, Section 7 concludes this
work.
Automatic Detection of Knives in Complex Scenes 3
2 Related work
The problem of small-sized object detection in labeled datasets is still not solved at all
[15]. In this problem, very few image pixels represent the whole objects of interest,
which make it difficult to detect and classify them. The use of super-resolution to
increase the object size in order to compensate for the loss of object information can
help to the detection task [16].
One specific use case of small-sized object detection consist in the detection of
knives. As for other types of weapons, carrying knives in public is either forbidden
or restricted in many countries. Since knives are both widely available and can be
used as weapons, their detection is of high importance for security personnel [7].
One of the first works on automatic detection of knives was presented by Kmiec
and Glowacz in 2011 [11]. These authors compute a set of image descriptors using
Histograms of Oriented Gradients (HOG). These descriptors, that are invariant to
geometric and photometric transformations, are used with a SVM for the detection
task.
Glowacz and collaborators [7] propose an Active Appearance Model (AAM) to
detect knives in images. As the knife-blade has usually an uniform texture, using an
AAM could contribute to improve detections, since the model would not converge
to other objects having a similar shape.
In 2016 Grega et al.[9] publish a highly-cited work on detection of firearms and
knives from CCTV images. Their goal is to reduce the number of false alarms in
detections. These authors use a modified sliding window technique to determine the
approximate position of the knife in an image. Then, they extract edge histograms
and texture descriptors to create feature vectors for training a SVM able to classify
the detected objects as knives.
Buckchash and Raman [2] have proposed in 2017 a method to detect visual
knives in images. Their approach has three stages: foreground segmentation, feature
extraction using the FAST (Feature Accelerated Segment Test) corner detector, and
Multi-Resolution Analysis (MRA) for classification and target confirmation.
More recent works make use of deep networks. Castillo et al. [3] presented a
system to locate cold steel weapons in images. (such as knives). These weapons have
a reflecting surface that under different light conditions can distorts and blurs their
shape in the frames. To solve the problem, the authors propose the combination of
a contrast-enhancement brightness-guided preprocessing procedure with the use of
different types of Convolutional Neural Networks (CNN).
Other authors have experimented with infrarred images (IR) to detect not visible
(i.e., hidden) knives [17]. A type of deep neural network (GoogleNet) that was
trained on natural images was fine-tuned to classify the IR images as people or as
people carrying a hidden knife.
A very comprehensive survey on the progress of Computer Vision-based concepts,
methodologies, analysis and applications for automatic knife detection has been
published recently showing the state-of-the-art of vision-based detection systems
[4]. The authors define a taxonomy based on the state-of-the-art methods for knife
detection. They analyzed several image features used in the considered works for
4 Maira Moran, Aura Conci and Ángel Sánchez
this task. The challenges regarding weapon detection and new-frontier in weapon
detection are included, as well. This survey references more then 80 works, and
concludes pointing out some possible research gaps in the problem and related ones.
Another brief review of the state-of-the-art approaches of knife identification and
classification is presented very recently [5]. Although, this article is not a review
paper, it presents a broadly revision of recent works using Convolutional Neural
Network (CNN), Recurrent Convolutional Neural Network (R-CNN), Faster R-CNN,
and Overfeat Network, that is most of deep learning methods used up now for the
considered problem.
This section describes the object detection problem (particularized for the case of
knives detection), and the features of YOLO architectures (particularized for the case
of YOLOv4 model, used in our experiments).
Object detection is a challenging task in Computer Vision that has received large
attention in last years, especially with the development of Deep Learning [18] [15]. It
presents many applications related with video surveillance, automated vehicle system
robot vision or machine inspection, among many others. The problem consists in
recognizing and localizing some classes of objects present in a static image or in a
video. Recognizing (or classifying) means determining the categories (from a given
set of classes) of all object instances present in the scene together with their respective
network confidence values on these detections. Localizing consists in returning the
coordinates of each bounding box containing any considered object instance in the
scene. The detection problem is different from (semantic) instance segmentation
where the goal is identifying for each pixel of the image the object instance (for
every considered type of object) to which the pixel belongs. Some difficulties in
the object detection problem include aspects such as geometrical variations like
scale changes (e.g., small size ratio between the object and the image containing
it) and rotations of the objects (e.g., due to scene perspective the objects may not
appear as frontal); partial occlusion of objects by other elements in the scene;
illumination conditions (i.e., changes due to weather conditions, natural or artificial
light); among others but not limited to these ones. Note that some images may contain
several combined variabilities (e.g., small, rotated and partially occluded objects). In
addition to detection accuracy, another important aspect to consider is how to speed
up the detection task.
Detecting knives in images (and also in videos) is a challenging problem. The
images where these objects can present several extrinsic and intrinsic variabilities
Automatic Detection of Knives in Complex Scenes 5
due to the size of the target object (in general, its size ratio is very small when
compared to the image size), the possibility of the weapon being carried by a person
or appearing freely placed in a location, the illumination conditions of the scene
(which could produce a very low contrast between the knife and the surrounding
background), among other real difficulties.
3.2 YOLOv4
Redmon and collaborators have proposed in 2016 the new object detector model
called YOLO (acronym of "You Only Look Once") [14], which handles the object
detection as a one stage regression problem by taking an input image and learning
simultaneously the class probabilities and the bounding box object coordinates.
This first version of YOLO was also called YOLOv1, and since them the successive
improved versions of this architecture (YOLOv2, YOLOv3, YOLOv4, and YOLOv5,
respectively) have gained much popularity within the Computer Vision community.
Different from previous two-stage detection networks, like R-CNN and faster
R-CNN, the YOLO model used only one-stage detection. That is, it can make predic-
tions with only one "pass" in the network. This feature made the YOLO architecture
extremely fast, at least 1000 times faster than R-CNN and 100 times faster than Fast
R-CNN.
The architecture of all YOLO models have some similar components which are
summarized next:
• Backbone: A convolutional neural network that accumulates and produces visual
features with different shapes and sizes. Classification models like ResNet, VGG,
and EfficientNet are used as feature extractors.
• Neck: This component consists in a set of layers that receive the output features
extracted by the Backbone (at different resolutions), and integrate and blend these
characteristics before passing them on to the prediction layer. For example, models
like Feature Pyramid Networks(FPN) or Path Aggregation networks(PAN) have
been used for such purpose.
• Head: This component takes in features from the Neck along with the bound-
ing box predictions. It performs the classification along with regression on the
features and produces the bounding box coordinates to complete the detection
process. Generally, it produces four output values per detection: the 𝑥 and 𝑦 center
coordinates, and width and height of detected object, respectively.
Next, we summarize the main specific features of YOLOv4 architecture that was
used in our experiments. YOLOV4 was released by Alexey Bochkovskiy et al. in
their 2020 paper “YOLOv4: Optimal Speed and Accuracy of Object Detection” [1].
This model is ahead in performance on other convolutional detection models like
EfficientNet and ResNext50. Like YOLOv3, it has the Darknet53 model as Backbone
component. It has a speed of 62 frames per second with an mAP of 43.5 percent on
the COCO dataset.
6 Maira Moran, Aura Conci and Ángel Sánchez
4 Datasets
The success of the proposed method is highly related to the quality of the data used
to train the supervised algorithm. Even considering that one of the main applications
for the proposed problem is its inclusion in surveillance system, to of our knowledge,
there are no current publicly-available CCTV datasets. The datasets used in similar
works consists of images captured by the authors, where many of them are captured
Automatic Detection of Knives in Complex Scenes 7
from the Internet. In this section, we present the two main dataset in this scope,
which are also used in our work for training as testing the algorithms.
The DaSCI knives dataset [13] is a subset of a more general weapon detection dataset.
It was created by people from University of Granada as an open data repository,
and designed for the object detection task. The annotation files describe the image
region where each knife is located, by defining a correspondent bounding box. It is
composed of 2,078 images, each one of them containing at least one knife, resulting
in 2,155 objects in total. The dataset was formed considering in a diverse way, i.e., the
images were selected in order to provide samples with very different visual features,
resulting in a robust challenging dataset. Some considered visual features of knives
are: types, shapes, colors, sizes, materials, locations, positions in relation to other
scene objects, indoor/outdoor scenarios, and so on. The images were extracted mostly
from the Internet, and the main sources were free image stocks and YouTube videos,
from which frames were extracted, considering the criteria previously mentioned.
The dataset is divided into 15 subsets (referred as DS1-DS15) according with their
image sources. Each one is composed by: 8, 130, 16, 12, 188, 242, 11, 36, 49, 130,
603, 29, 143, 108, and 83 images, respectively. Table 1 summarizes the information
about these subsets. Figure 2 shows some examples of images extracted from some
of these sources.
Video frames DS1, DS2, DS3, DS4, DS5, DS6, DS7, DS8, DS9, DS12,
Source type DS13, DS14, DS15
Internet images DS11
Captured by authors DS10
One DS1, DS2, DS3, DS4, DS5, DS6, DS7, DS8
Objects per image
Multiple DS9, DS10, DS11, DS12, DS13, DS14, DS15
Yes DS1, DS2, DS3, DS4, DS5, DS6, DS7, DS8, DS9
Multiple scenarios
No DS10, DS11, DS12, DS13, DS14, DS15
As previously mentioned, the size, position and location if the objects varies in
this dataset. In this way, the area that the each knife covers in the image also differs
(although it is often small). Figure 3 shows histograms of these proportions. Even
considered that the dataset were designed to present a high heterogeneity in this
aspect, it can be observed that the majority of the objects (i.e., around 50%) only
covers between 1% and 20% of the image area. The remaining objects are more
equally distributed, occupying different portions of their respective images.
The fact that the knives in this dataset tend to occupy a small area over the
images (and consequently, present a low spatial resolution) is a challenging issue for
8 Maira Moran, Aura Conci and Ángel Sánchez
the detection task, that can be assessed in the pipeline of possible solutions to be
developed.
(a) Relative object vs image size ratio). (b) Absolute object size (spatial resolution).
Fig. 3: Histogram of object sizes composing the knives’ samples in DaSCI dataset.
It is important to mention that the annotations are not completely uniform, in the
sense that for some cases the knife area described in the annotation file covers the
whole knife, both blade and handle, and for other cases the described knife are cover
only the knife blade.
The annotation formats describe each image and the positions of the associated
objects. Firstly, the image information is detailed, including its file name, path and
dimensions (width, height and depth, being this last one related to the number of
channels, mostly 3 since color images are defined in 3 RGB channels). Then, the
information of objects is listed (always "knife" in this work), and its respective
Automatic Detection of Knives in Complex Scenes 9
Fig. 4: Example of images that composed the DaSCI dataset and their respective
annotations.
region, which is described as a bounding box denoted by coordinates of its top left
(xmin,ymin) and bottom right (xmax, ymax) corners.
The MS COCO (Microsoft Common Objects in Context) dataset [6] is widely used
in Computer Vision literature for object detection and segmentation tasks. Since
the appearance of its first version, other upgraded versions from this datased have
been published. In this work, we consider the COCO 2017 dataset. It consists of
a very large and complete dataset, composed of 330,000 images with 1.5 million
objects. This dataset considered 80 different classes, and class ’knife’ is one of them
containing 7,770 labeled objects from 4,326 images. Since the COCO dataset was
initially designed to encompass objects of 80 different classes, the images selected
to compose this dataset mostly portrait scenes crowded with different objects, and
knives mainly are not the main object of interest in the scene. This can also be
considered as a challenging issue for the problem assessed in this study. Figure 5
shows some samples of the COCO dataset.
Fig. 5: Example of images that composed the COCO dataset and their respective
annotations.
Also, as similarly to DaSCI, in this dataset the knives mainly present a very
low spatial resolution, which is another aspect to be handled in this study. Figure
6 presents an histogram of the object area/total image area ratio for the samples in
COCO dataset.
The object bounding boxes in MS COCO annotations is described by the 𝑥 and 𝑦
coordinates of the top left corner, and the object width and height, respectively.
10 Maira Moran, Aura Conci and Ángel Sánchez
Fig. 6: Histogram of object sizes for the objects that compose the knives samples in
MS COCO dataset.
The knife detection task have been previously assessed in other works in the literature
(see survey work [4]). However, the number of public datasets available is still
very limited. Regarding datasets that include knives in images, there some available
options that were initially proposed for classification tasks. Although their annotation
should be expanded in order to be employed in a detection task, it is important to
consider that such datasets are also available.
There is another dataset provided by DaSCI that could be employed for the
knife classification task, composed of 10,039 images, which were extracted from the
Internet. The annotations cover 100 object classes, being knife the target one, with
635 images. Among the others classes are: ’car’, ’plant’, ’pen’, ’smartphone’, ’cigar’,
etc.
Grega et. al. [8] also proposed a method for knife classification. Their dataset
consists of 12,899 images at 100 × 100 pixels resolution, from which 9,340 are
negative samples, and 3,559 are positive samples. The positive samples consist of a
scene with a knife held in a hand, and the negative samples consists of scenes with
no knife. Concerning the environment, the scenes in the images can be indoor and
outdoor.
5 Pre-processings on dataset
The YOLOv4 algorithm expects that each annotation file to present the following
structure: object class, object coordinates (𝑥 and 𝑦), width and height, separated by
a simple space:
Automatic Detection of Knives in Complex Scenes 11
0 x y width height
In YOLOv4 annotation files, each line corresponds to an object. An example of
annotation in this format is shown next:
0 25 40 100 120
0 30 15 80 50
Note that each annotation file refers to an image, that contains one or more objects.
In the example above, the first line describes the first object, that is an object of class
"0" (in this work we only consider the class ’knife’ denoted by "0"). Also, the upper
left corner of this first object’s bounding box is in the position 𝑥 = 25 and 𝑦 = 40.
Finally, this first object has a width of 100 and a height of 120. Similarly, the second
object in the example annotation is an object of category "0" (knife), the bounding
box that defines its area has its superior left corner positioned in 𝑥 = 30 𝑦 = 15, and
the object’s width and height are 80 and 50, respectively.
As previously mentioned, the object regions in the DaSCI annotations are de-
scribed as bounding boxes defined by the coordinates of the top left (𝑥 𝑚𝑖𝑛 , 𝑦 𝑚𝑖𝑛 ) and
bottom right (𝑥 𝑚𝑎𝑥 , 𝑦 𝑚𝑎𝑥 ) corners. In this way, the values to compose the YOLOv4
annotations can be easily calculated from the DaSCI annotations:
x = xmin
y = ymin
width = xmax-xmin
height = ymax-ymin
This way, YOLOv4 annotation obtained from the DaSCI XML annotation is
composed of:
0 xmin ymin xmax-xmin ymax-ymin
As described in section 4, the object’s bounding box in the MS COCO annotation
is also defined by the 𝑥 and 𝑦 coordinates of the upper left corner, and the object’s
width and height, so as in the YOLOv4 annotation format. So, the information to
compose the annotation is directly transcribed from the MS COCO JSON annotation
file. Note that, in this structure, each object annotation refers to an object, not to an
image.
The images to be used as input of the YOLOv4 algorithm must present the a spatial
resolution of 416 × 416. In this sense, the images of both MS COCO and DaSCI
datasets must be resized to meet this condition. As previously mentioned, both
datasets are composed by images with different sizes (spatial resolutions), so for
some images the rescale would result in an decrease of the image size, and for
others this resizing would enlarge the original images. Increasing the image size,
can be specially critic, since the methods commonly used for this task consist of
12 Maira Moran, Aura Conci and Ángel Sánchez
interpolations that frequently lead to effects like blur, aliasing, etc, degrading the
quality of the resulting image.
In order to observe the impact of the resizing part of the preprocessing, two al-
ternative resizing operations were performed. The first one is bilinear interpolation,
commonly used as a "black box" operation in most machine learning libraries, in-
cluding the PyTorch Python library used in this work. The second one is SRGAN
(generative adversarial network for single image super-resolution) [12], which con-
sists of a machine learning supervised algorithm. The SRGAN, more specifically
one of its variations, is currently state of the art for some widely known challenges.
Considering that the SRGAN uses a generative network 𝐺 to create high-resolution
images which are so similar to the original ones, that can mislead the differentiable
discriminator 𝐷, which is trained to distinguish the generated and the real super-
resolution image. In this process, the 𝐷 network demands an evolution of 𝐺 during
the training process, leading to perceptually superior solutions [?]. To obtain the
model used in this work, the SRGAN training was performed using the ImageNet
dataset
On the other hand, the bilinear interpolation calculates the values of the new
interpolated points based on a pondered mean of their surrounding points (four
neighborhood) in the original image. The weight assigned to each neighbor point is
based on its distance to the new point. Consequently, the value of the new point is
mostly defined by the values of its closest neighbor, but it is also influenced by the
values of the other neighbors.
In this experiment, we analyze the impact of using super-resolution as a pre-
processing step of the object detection algorithm. For such purpose, we have adopted
a cross-dataset evaluation approach. Evaluations configured in an in-domain setting,
which is defined by using the samples from the same dataset for training and testing
the algorithms, tend to bias and affects negatively the generalization of machine
learning algorithms. Moreover, the transfer learning technique was also assessed, as
described in section 5.3.
As previously mentioned, several factors can affect the performance of the proposed
algorithms, as the lightning conditions, object size, perspective, visibility, etc. In this
sense, we created subsets of interest from the original test set. Each of these test sets
presents an special condition, so one can observe how a particular condition affects
to the results of the models. Next, the subsets are listed next:
1. Outdoor: it covers all the images that denote outdoors scenes, related mostly to a
higher luminosity;
2. Indoor: composed of images that denote indoor scenes, mostly presenting a lower
luminosity;
3. Occluded: composed of images in which the knives are being handled by a person,
remaining partially occluded;
Automatic Detection of Knives in Complex Scenes 13
4. Not occluded: the object is lying on a surface and it is not held by anyone.
These subsets are not exclusive, i.e., the same image can belong to more than one
subset, except when the conditions where defined subsets are excluding (e.g., subsets
1 and 2).
Also, the ratio between object size and image size is a factor that can affect
the models’ results, specially considering that the use of super-resolution as pre-
processing step may influence the performance for small objects. As presented in
the histogram of section 4 (Figure 2), most of the objects that compose the DaSCI
database, which is used as test set in our experiments, cover around 20% of the
corresponding images.
6 Experimental results
In this section, we present the results of the performed experiments, which analyze
the impact of using transfer learning and super-resolution techniques in the training
process of the object detection network (YOLOv4). In subsection 6.1, we summarize
14 Maira Moran, Aura Conci and Ángel Sánchez
the metrics used in this analysis. Then, subsection 6.2 presents the results of exper-
iments, comparing the results obtained by using each of the mentioned techniques,
in general and associated with different aspects of the test dataset as object size,
visibility and illumination.
The evaluation is based on true positives TP (i.e., regions correctly detected as re-
gions containing knives); False negatives FN (i.e., non detected regions containing
knives); and False positives FP (i.e., regions incorrectly detected as regions contain-
ing knives). From these results some metrics can calculated such as Precision (Prec),
Recall (Rec), and F1-score, using the following equations:
𝑇𝑃 𝑇𝑃 𝑃𝑟𝑒𝑐 ∗ 𝑅𝑒𝑐
𝑃𝑟𝑒𝑐 = 𝑅𝑒𝑐 = 𝐹1 = 2 (1)
𝑇 𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐 + 𝑅𝑒𝑐
The Jaccard index, or Intersection over Union (IoU), is also used in this analysis.
This metric computes the areas of the bounding boxes denoting the detected knives
and the corresponding ground truths. The optimal and maximal value for IoU is 1,
which denotes that the area of the intersection of the two bounding boxes (obtained
by the algorithms and ground truth) is identical to the area defined by their union
(i.e., the areas are equal).
Figure 8 exemplifies the mentioned IoU areas for several test images. The area
in blue represents the bounding box by obtained by one of the proposed algorithms,
and the area in violet shows the bounding box defined by the ground truth.
Automatic Detection of Knives in Complex Scenes 15
Fig. 8: Examples of bounding boxes: ground truth (violet) and algorithm results
(blue).
Training process
Model
Tranfer Learning Pre-processing
M1 No Bilinear interpolation
M2 Yes Bilinear interpolation
M3 No SRGAN
M4 Yes SRGAN
As described in section 5.1.1, we used the cross-dataset approach to train and test
all the models. The test dataset (DaSCI) is composed of 2,078 images, which cover
2,155 objects. The main hyper-parameters used in the training process are: confidence
prediction threshold = 0.25, IoU threshold = 0.5, and batch size = 1.
16 Maira Moran, Aura Conci and Ángel Sánchez
Table 3 shows the values obtained for the selected metrics considering the whole
dataset. It is possible to observe that not all models detected most of the objects.
The best overall performance was achieved by M3. It is possible that the use of
the proposed transfer learning technique promotes a worse overall performance for
models M2 and M4 compared with M1 and M3. Also, the results suggest that using
the super-resolution pre-processing affects the models performance in different ways
depending on whether it is combined with transfer learning or not.
For the models not trained with transfer learning (M1 and M3), the SRGAN
subtly improved the results, increasing the number of TP in 7 cases and reducing the
number of FN in 7 cases. The number of FP was substantially reduced (-111 cases).
On the other hand, for the models trained with transfer learning (M2 and M4), the
results using SRGAN were substantially worse. This difference is of -344 (-32.12%)
for TP, +437 (56.46%) for FP, and +344 (31.73%) for FN.
Concerning the other performance metrics, the M1 model presented the best
average IoU values, and the M3 model presented the best Precision, Recall and F1-
score values. In genera,l the use of the super-resolution pre-processing had a negative
impact in both metrics. On the other hand, the use of the transfer-learning technique
promoted worse average results for M2 and M4, compared with M1 and M3
The differences in the IoU values achieved by each model can also be observed in
the histograms presented in Figure 9, in which it is possible to observe that models
M1 and M3 achieved IoU values that lay in mostly in the 70% − 100% interval. On
the other hand, the IoU values that models M2 and M4 achieved lay in mostly in the
1% − 20% interval
The plots presented in Figure 10 shows the performance variations associated to the
relative object size in their respective images. As mentioned in section 4, most of
objects in test set are very small in relation of their respective images (around 50%
cover 1% − 10% of the areas of images). In this way, the performance of the models
for the relatively small objects represent a large part of the overall results. Also, it is
expected that in real-world detection applications, as surveillance videos, the objects
Automatic Detection of Knives in Complex Scenes 17
(a) M1 (b) M2
(c) M3 (d) M4
would cover a very small portion of the images. Therefore the results for these cases
are specially important in our assessment.
In Figure 10, it is possible to observe that the performance of models M1 and M3
remains similar for most relative object sizes. On the other hand, model M2 and M4
present a better performance for objects having a relative size around less than 30%
of the image.
Another factor that may affect the detection performance is occlusion, which in the
considered context is defined by the object being handled by a person, whose hand
consequently occludes the knife blade. Table 4 compares the results between the
portion of the dataset in which the objects are partially occluded as described, and
the case in which the objects are completely visible, i.e., placed in some flat surface.
Observe that the results significantly differ, especially for the M2 model, which
suggests that this aspect clearly affects the models performance. Overall, all models
presented better results in cases in which the object was occluded. Similar to the
overall trend pointed out for the general results, models trained without transfer
learning achieved better results, being M1 the best model for occluded objects and
M3 the best model for non occluded objects.
18 Maira Moran, Aura Conci and Ángel Sánchez
(a) M1 (b) M2
(c) M3 (d) M4
Fig. 10: IoU variations associated to relative object sizes for each model.
Finally, another factor considered in our evaluation is the natural illumination of the
scene for each image. More specifically, we compare the models results for indoor
and outdoor scenes, since this change of natural illumination may present some
impact in the detection performance. Table 5 summarize these results.
According to the results, the natural illumination seems not to be a particular
challenging factor for the detection models, since the results for all models tend to be
similar for both indoor and outdoors scenes. It is possible to observe that the models
achieved slightly better results with outdoor scenes. In contrast with the occlusion
factor, the natural illumination variations is more equally represented in the test
dataset (i.e., the number of images with indoor and outdoor scenes are relatively
close).
Automatic Detection of Knives in Complex Scenes 19
7 Conclusion
In this work, we evaluated the performance of the YOLOv4 algorithm for detecting
knives in natural images. In the performed experiments, two other conditions were
assessed: the use of a super-resolution algorithm as pre-processing step and the
transfer learning technique. The evaluation of results not only considers the whole
test dataset, but also specific subsets, in order to evaluated if there are specific
conditions that can affect the results, such as object sizes, natural illumination and
partial occlusions. The results demonstrated that the use of a super-resolution pre-
processing algorithm only promotes better results if not combined with the transfer
learning technique. Moreover, the use of the proposed transfer learning reduced the
overall performance of our YOLOv4 models.
In future works, we aim to evaluate other object detection algorithms and other
pre-processing techniques. Finally, we will also explore the classification aspect of
object detection algorithms, proposing different classes of knives, considering their
specific features.
References
1. Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed
and accuracy of object detection, 2020.
2. Himanshu Buckchash and Balasubramanian Raman. A robust object detector: Application
to detection of visual knives. In 2017 IEEE International Conference on Multimedia Expo
Workshops (ICMEW), pages 633–638, 2017.
3. Alberto Castillo, Siham Tabik, Francisco Pérez, Roberto Olmos, and Francisco Herrera. Bright-
ness guided preprocessing for automatic cold steel weapon detection in surveillance videos
with deep learning. Neurocomputing, 330:151–161, 2019.
20 Maira Moran, Aura Conci and Ángel Sánchez
4. Rajib Debnath and Mrinal Kanti Bhowmik. A comprehensive survey on computer vision based
concepts, methodologies, analysis and applications for automatic gun/knife detection. Journal
of Visual Communication and Image Representation, 79, 2021.
5. Neelam Dwivedi, Dushyant Kumar Singh, and Dharmender Singh Kushwaha. Employing data
generation for visual weapon identification using convolutional neural networks. Multimedia
Systems, 28(10):347–360, 2022.
6. Tsung-Yi Lin et al. Microsoft coco: Common objects in context. In Computer Vision – ECCV
2014, pages 740–755. Springer International Publishing, 2014.
7. Andrzej Glowacz, Marcin Kmieć, and Andrzej Dziech. Visual detection of knives in se-
curity applications using active appearance models. Multimedia Tools and Applications,
74(12):56416–56429, 2015.
8. Michał Grega, Andrzej Matiolański, Piotr Guzik, and Mikołaj Leszczuk. Automated detection
of firearms and knives in a cctv image. Sensors, 16(1):47, 2016.
9. Michał Grega, Andrzej Matiolański, Piotr Guzik, and Mikołaj Leszczuk. Automated detection
of firearms and knives in a cctv image. Sensors, 16(1), 2016.
10. Rida Khatoun and Sherali Zeadally. Smart cities: concepts, architectures, research opportuni-
ties. Communications of the ACM, 59(8):46–57, 2016.
11. Marcin Kmiec and Andrzej Glowacz. An approach to robust visual knife detection. Machine
Graphics Vision International Journal, 20(2):215–227, 2011.
12. Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro
Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-
realistic single image super-resolution using a generative adversarial network, 2017.
13. Roberto Olmos, Siham Tabik, and Francisco Herrera. Automatic handgun detection alarm in
videos using deep learning. Neurocomputing, 275:66–72, 2018.
14. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection, 2016.
15. Kang Tong, Yiquan Wu, and Fei Zhou. Recent advances in small object detection based on
deep learning: A review. Image and Vision Computing, 97:103910, 03 2020.
16. Zhuang-Zhuang Wang, Kai Xie, Xin-Yu Zhang, Hua-Quan Chen, Chang Wen, and Jian-Biao
He. Small-object detection based on yolo and dense block via image super-resolution. IEEE
Access, 9:56416–56429, 2021.
17. Sumeth Yuenyong, Narit Hnoohom, and Konlakorn Wongpatikaseree. Automatic detection of
knives in infrared images. pages 65–68, 02 2018.
18. Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A
survey. arXiv preprint arXiv:1905.05055, 2019.