Deep Learning Based Efficient Ship Detection From Drone-Captured Images For Maritime Surveillance
Deep Learning Based Efficient Ship Detection From Drone-Captured Images For Maritime Surveillance
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Deep learning based efficient ship detection from drone-captured images for
maritime surveillance
Shuxiao Cheng a , Yishuang Zhu a,b , Shaohua Wu a,b ,∗
a Guangdong Provincial Key Laboratory of Aerospace Communication and Networking Technology, Harbin Institute of Technology
(Shenzhen), Shenzhen 518055, China
b
Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen 518055, China
Keywords: The use of drones to observe ships is an effective means of maritime surveillance. However, the object
Ship detection scale from drone-captured images changes dramatically, presenting a significant challenge for ship detection.
Drone-captured images Additionally, the limited computing resources on drones make it difficult to achieve fast detection speed. To
Maritime surveillance
address these issues, we propose an efficient deep learning based network, namely the YOLOv5-ODConvNeXt,
Convolutional Neural Network (CNN)
for ship detection from drone-captured images. YOLOv5-ODConvNeXt is a more accurate and faster network
YOLOv5
designed to improve the efficiency of maritime surveillance. Based on YOLOv5, we implement Omni-
dimensional Convolution (ODConv) in the YOLOv5 backbone to boost the accuracy without increasing the
network width and depth. We also replace the original C3 block with a ConvNeXt block in YOLOv5 backbone
to accelerate detection speed with only a slight decline in accuracy. We test our model on a self-constructed ship
detection dataset containing 3200 images captured by drones or with a drone view. The experimental results
show that our model achieves 48.0% 𝐴𝑃 , exceeding the accuracy of YOLOv5s by 1.2% 𝐴𝑃 . The detection
speed of our network is 8.3 ms per image on an NVIDIA RTX3090 GPU, exceeding the detection speed of
YOLOv5s by 13.3%. Our code is available at https://github.com/chengshuxiao/YOLOv5-ODConvNeXt.
∗ Corresponding author at: Guangdong Provincial Key Laboratory of Aerospace Communication and Networking Technology, Harbin Institute of Technology
(Shenzhen), Shenzhen 518055, China.
E-mail addresses: [email protected] (S. Cheng), [email protected] (Y. Zhu), [email protected] (S. Wu).
https://doi.org/10.1016/j.oceaneng.2023.115440
Received 2 February 2023; Received in revised form 27 June 2023; Accepted 22 July 2023
Available online 2 August 2023
0029-8018/© 2023 Elsevier Ltd. All rights reserved.
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Ship detection aims to recognize ships and their course borders from • We studied the problem of fusing ConvNeXt block into different
an image. In recent years, deep neural networks (DNNs) have been layers of the YOLOv5s model. Based on the experimental results,
frequently used for this purpose. However, applying ship detection on we chose to replace the original C3 block with a ConvNeXt block
drones brings two main difficulties. First, the size of ships in drone- in layer 6 to improve the detection speed with a slight drop in
captured images varies with the height and attitude of the drone. In model accuracy.
addition, drone-captured images contain abundant and complex back- • Based on the above improvements, we proposed
ground information, making ship detection more challenging. Second, YOLOv5-ODConvNeXt, a more efficient ship detection model.
insufficient computing resources on drones can slow down detection Compared to the original YOLOv5s network, the accuracy has
speed and result in poor real-time performance. Therefore, there is an been improved by 1.2% 𝐴𝑃 on our ship dataset and the detection
urgent need for a fast yet accurate ship detector. speed has been improved by 13.3%.
YOLO series (Redmon et al., 2016; Redmon and Farhadi, 2017,
2. Related works
2018; Bochkovskiy et al., 2020; Jocher et al., 2022), which plays an
essential role in object detection tasks, is widely used in ship detection.
2.1. Deep learning based object detection
In this paper, we propose an improved version of the YOLOv5s (Jocher
et al., 2022) model called YOLOv5-ODConvNeXt that achieves higher
Deep learning based detector can be roughly divided into two types:
accuracy and faster detection speed than the original YOLOv5s model,
transformer (Vaswani et al., 2017) based detectors and Convolutional
making it ideal for ship detection from drone-captured images. The
Neural Networks (CNN) based detectors. Transformer models were
overview structure of this work is shown in Fig. 1. We use Omin-
first applied to a broader range of natural language processing (NLP)
dimensional Dynamic Convolution (ODConv) (Li et al., 2022b) to re- tasks. Their outstanding performance has generated great interest in
place the traditional downsampling convolutional layer, which en- applying Transformers to computer vision tasks. A Transformer has
hances the backbone’s capability to generate more features. Then we an encoder–decoder structure, using the self-attention mechanism to
use the ConvNeXt block (Liu et al., 2022) to replace the original C3 establish relationships among elements in a sequence. An image can
block, which increases the detection speed. Considering that there are also be regarded as a sequence by dividing it into 𝑁 × 𝑁 image
no publicly available ship datasets for drone-captured scenarios, we patches, which is a basic concept for Transformers in vision tasks.
constructed a single-class ship dataset with 3200 images captured by Detection Transformer (DETR) (Carion et al., 2020) is the first end-
drones or with a drone view. We conducted extensive experiments to-end transformer based object detector and treats object detection
on our dataset and achieved remarkable results that demonstrate our as a set prediction problem. DETR uses the Transformer to process
model’s ability to balance accuracy and detection speed. Our proposed the features generated by CNN, eliminating the requirement for series
YOLOv5-ODConvNeXt achieves 48.0% 𝐴𝑃 , with the detection speed of of hand-crafted components such as non-maximum suppression (NMS)
8.3 ms per image on an NVIDIA RTX3090 GPU. strategy and anchor boxes. Although DETR achieves high performance
To achieve an efficient detector for ship detection on drone-captured on the MS-COCO (Lin et al., 2014) dataset, it faces various obsta-
images for maritime surveillance, we made the following contributions: cles including missed detection on small objects and a long training
duration. Inspired by Dai et al. (2017), Zhu et al. (2021b) proposed
• We constructed a ship detection dataset containing 3200 anno- Deformable DETR, which aims to accelerate the convergence speed
tated images of ships, all of which are captured by drones or have and promote the accuracy on detecting small objects of DETR. Unlike
the perspective of drones. the multi-head attention mechanism in Transformer, the Deformable
• We studied the problem of fusing ODConv into different layers attention module focuses on partial important points around a reference
of the YOLOv5s model. Based on the experimental results, We point of feature maps. Liu et al. (2021a) proposed Swin Transformer,
found that deploying ODConv in shallower layers leads to greater a pure transformer based backbone for image classification, object
accuracy gains and less increment of parameters. Therefore, we detection, and segmentation. Swin Transformer is a hierarchical trans-
applied ODConv in layer 1 of YOLOv5s to boost model accuracy former using shift windows, which improves computing efficiency by
without increasing the network width and depth. using non-overlapping windows for local self-attention. Although these
2
S. Cheng et al. Ocean Engineering 285 (2023) 115440
transformer based detectors (Carion et al., 2020; Zhu et al., 2021b; et al., 2020) focuses on boosting the efficiency of CNNs. A BiFPN
Dai et al., 2021; Liu et al., 2021a; Dong et al., 2022) have shown structure is proposed to fuse multiscale features with learnable weights.
great potential in replacing traditional CNNs on object detection tasks, Furthermore, EfficientDet introduces a model scaling strategy which
they still suffer from high computational cost and a large number of jointly scales different part of the network and the input resolution.
parameters, making them unsuitable for real-time ship detection on YOLOv4 (Bochkovskiy et al., 2020), an upgraded version of YOLOv3,
drone-captured images in maritime surveillance systems. can be trained on a single GPU like 1080Ti. It uses a ‘‘bag of freebies’’
CNN-based detectors can be separated into two-stage detectors and that does not increase the inference cost, such as data augmentation,
one-stage detectors. Two-stage detectors follow a ‘‘coarse-to-fine’’ man- label smoothing, and CIoU loss (Zheng et al., 2020). There is also ‘‘bag
ner, which first generate various regions of interest (RoIs) from in- of specials’’ that obviously promote the accuracy with little increment
put images and perform classification and regression on a series of of inference cost, such as SPP (He et al., 2015), CSPNet (Wang et al.,
RoI. One-stage detectors directly obtain results from input images. 2020), PANet (Liu et al., 2018). After YOLOv4, there have been many
R-CNN (Girshick et al., 2014) is the earliest two-stage detector that sig- improved versions of the YOLO series like Scaled-YOLOv4 (Wang et al.,
nificantly accelerated the development of object detection technology 2021), YOLOv5 (Jocher et al., 2022), YOLOF (Chen et al., 2021) and
in the deep learning area and is the first paper in R-CNN series (Girshick YOLOX (Ge et al., 2021).
et al., 2014; Girshick, 2015; Ren et al., 2017). R-CNN uses selective
search (Uijlings et al., 2013) to extract nearly 2000 region proposals
2.2. Ship detection from visual images
from the original input image. These separate regions are resized to a
fixed scale and fed into a CNN trained on ImageNet (Deng et al., 2009)
to obtain the output features, and then predictions are obtained from Ship detection from visual images has received widespread attention
each region through support virtual machines (SVMs). R-CNN provides for its application in maritime surveillance. Drone-captured images
an excellent framework for object detection, but overlapping region are a type of visual images. Modern ship detection algorithms are
proposals lead to redundant computations that are inefficient. To deal mostly constructed based on deep neural networks that do not require
with this problem, Girshick (2015) proposed Faster R-CNN. It uses RoI hand-crafted features and have good robustness.
pooling, a variant of spatial pyramid pooling (SPP) (He et al., 2015), Shao et al. (2019) were the first to applied CNN for ship detection
to obtain a fix-size feature map from each RoI so that RoIs can share in surveillance video. They proposed a CNN framework for saliency
computations among overlapping areas and all layers can be updated prediction based on the YOLOv2 (Redmon and Farhadi, 2017) model.
during training. However, the high computational cost of selective CNN was utilized for rough prediction first, and subsequently saliency
search still slows down the detection speed of Fast R-CNN. Ren et al. detection was employed to refine it. They also presented a coastline
(2017) proposed Faster R-CNN, which replaced selective search with segmentation method that reduces detection range and increases detec-
a region proposal network (RPN). Compared to selective search, RPN tion efficiency. Chen et al. (2020a) used the combination of improved
greatly decreases the computational complexity, enabling Faster R-CNN YOLOv2 (Redmon and Farhadi, 2017) and modified WGAN (Arjovsky
to be the first object detector to approach real-time. Although Faster et al., 2017) to deal with small ship detection. Density-Based Spatial
R-CNN breaks through the speed bottleneck of Fast R-CNN, there is Clustering of Applications with Noise (DBSCAN) is used to generate
still computational redundancy in the following detection stage. Some anchor boxes instead of k-means clustering, and a Gaussian Mixture
scholars have proposed a variety of improvement schemes, including WGAN with Gradient Penalty is used for data augmentation. But the
R-FCN (Dai et al., 2016), Light Head R-CNN (Li et al., 2017), and Mask detection speed of these YOLOv2 based method (Shao et al., 2019; Chen
R-CNN (He et al., 2017). Considering the defects of two-stage detectors et al., 2020a) is slow with low accuracy. Liu et al. (2021b) proposed an
in detection speed and lack of global information, we chose to study improved version of YOLOv3 (Redmon and Farhadi, 2018) for ship de-
one-stage detectors for efficient ship detection from drone-captured tection under complex weather conditions. They use redesigned anchor
images. boxes, soft NMS, reconstructed loss function and data augmentation
You Only Look Once (YOLO) (Redmon et al., 2016) is the first one-
to realize a more reliable and robust detector. However, its detection
stage CNN-based detector. Two-stage detectors make predictions on
speed on NVIDIA 1080Ti GPU is 30 frames per second (FPS) for the
various of RoIs, which ignore the global information of the whole input
input resolution of 608*608, which is not fast enough. ShipYOLO (Han
image. To address this problem, YOLO redefined the object detection
et al., 2021) is an enhanced model based on YOLOv4 (Bochkovskiy
as a single regression problem. The neural network can directly con-
et al., 2020), which is also designed for ship detection in surveil-
vert image pixels to bounding boxes and probabilities for each area,
lance video. ShipYOLO has three main improvements which includes
resulting in faster detection speed compared to two-stage detectors.
structural re-parameterization in backbone, attention mechanism for
The network divides the input image into a series of grids, with each
multi-scale feature fusion and using dilated convolution in SPP (He
grid responsible for detecting objects in that region of the image. Each
et al., 2015). ShipYOLO achieves the detection speed of 47 frames per
grid can predict multiple categories for bounding boxes, with Non-
second (FPS) on NVIDIA 1080Ti GPU for the input size at 512*512,
Maximum Suppression (NMS) eliminating duplicate detection of the
same object. Liu et al. (2016) proposed single shot multibox detec- which is still not fast enough for devices with insufficient computing
tor (SSD), which improves detection speed and accuracy by utilizing resources like drones. Zhang et al. (2022) proposed YOLOv5-DN, an
predefined anchor boxes and multiscale detection technology. Red- improved version of YOLOv5 (Jocher et al., 2022) for maritime ship
mon and Farhadi (2017) proposed YOLOv2, an improved version of detection and classification. YOLOv5-DN is realized by introducing
YOLO (Redmon et al., 2016), which implements various techniques CSP-DenseNet structure (Wang et al., 2020) to YOLOv5 model, aiming
including batch normalization (Ioffe and Szegedy, 2015), high resolu- to optimize the detection accuracy. However, it does not take into
tion classifier, and anchor boxes generated by k-means clustering to account detection speed, which leads to inefficiency.
achieve state-of-the-art real-time detection. Lin et al. (2017) figured out In conclusion, current ship detection techniques still suffer from
the imbalance of foreground and background classes during training inefficiency. Therefore, achieving a good balance between detection
process for one-stage detectors and proposed RetinaNet. A novel focal accuracy and speed is of vital importance for maritime surveillance.
loss function is designed to address this problem so that the network In this work, we will extend the YOLOv5-based ship detection model
can focus on the difficult, misclassified samples. Redmon and Farhadi according to the following aspect, i.e., using dynamic convolution to
(2018) further promoted the performance of YOLOv2 and proposed replace the traditional downsampling convolutional layer and exploit-
YOLOv3 by combining techniques such as data augmentation, multi- ing ConvNeXt module in backbone, leading to faster and more accurate
scale training, and independent logistic classifiers. EfficientDet (Tan ship detection on drone-captured images.
3
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Fig. 2. The illustration of (a) two-dimensional convolution and (b) depthwise convolution.
4
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Fig. 4. The illustration of how four different sorts of weights work in ODConv.
nonlinear activation. Specifically, 𝛼𝑓 𝑖 , 𝛼𝑐𝑖 and 𝛼𝑠𝑖 are generated by a model without the increment of network width and depth. The position
Sigmoid function, i.e., of the ODConv used in YOLOv5-ODConvNeXt is shown in Fig. 1.
1
Sigmoid (𝑥) = , (4)
1 + 𝑒−𝑥 3.2. ConvNeXt block
which maps the input value to the interval of (0,1). 𝛼𝑠𝑖 is generated by
a Softmax function, i.e., For the past three years, we have witnessed the burst of applying
( ) Transformers in vision tasks. However, due to the global attention
( ) exp 𝑥𝑖
Sof tmax 𝑥𝑖 = ∑ ( ), (5) mechanism of Transformers, its computational complexity is quadratic
𝑗 exp 𝑥𝑗 of the input size. This burden is particularly felt in tasks such as ob-
where 𝑥𝑖 denotes the 𝑖th element of the input vector. Softmax produces ject detection with higher-resolution inputs. Therefore, ConvNeXt (Liu
∑
a set of probabilities which adds a constraint of 𝑖 𝛼𝜔𝜄 = 1, leading to et al., 2022) is born to bring the advantages of Transformers back to
the simplification of learning 𝛼𝜔𝑖 . The final convolutional filters which CNN. The architecture of ConvNeXt block is shown in Fig. 5(a).
are generated by the combination of the weighted sum of 𝑛 groups ConvNeXt is a pure CNN which refers to the macro/micro design
of filters are utilized to produce the output features. Therefore, the and the training process of a Swin Transformer (Liu et al., 2021a). A
convolution kernel changes with different input features. In this work, ConvNeXt block uses the depthwise convolution with a larger kernel
we utilize ODConv to pursue the goal of realizing a more accurate size of 7 × 7 followed by the simpler Layer Normalization (Ba et al.,
5
S. Cheng et al. Ocean Engineering 285 (2023) 115440
layer and a GeLU (Hendrycks and Gimpel, 2016) activation function are Operating System Ubuntu 20.04.4
CPU Intel Xeon Gold 6230
used to lift the number of channels from 𝐶 to 4 × 𝐶, where the GeLU
GPU NVIDIA RTX3090
activation function is CUDA Version 11.4
( (√ ( ))) Python Version 3.7.13
GELU(𝑥) = 0.5𝑥 1 + Tanh (2∕𝜋) 𝑥 + 0.044715𝑥3 . (6) Pytorch Version 1.11.0
6
S. Cheng et al. Ocean Engineering 285 (2023) 115440
7
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Fig. 8. The variation of (a) model accuracy, (b) inference speed and (c) the number of parameters after applying ODConv at different locations in YOLOv5s.
Table 5
Detection results of replacing C3 block with ConvNeXt block in different locations.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s(baseline) 640 46.8% 9.4 7.01
YOLOv5s-ConvNeXt(1) 640 46.3% 8.3 7.03
YOLOv5s-ConvNeXt(2) 640 46.6% 8.3 7.04
YOLOv5s-ConvNeXt(3) 640 46.7% 8.1 6.93
YOLOv5s-ConvNeXt(4) 640 46.8% 8.5 7.96
8
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Table 6
Detection results of using different kernel sizes of ConvNeXt block in YOLOv5s.
Model Kernel size Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s-ConvNeXt(3) 3 640 45.8% 7.9 6.92
YOLOv5s-ConvNeXt(3) 5 640 46.3% 8.0 6.92
YOLOv5s-ConvNeXt(3) 7 640 46.7% 8.1 6.93
YOLOv5s-ConvNeXt(3) 9 640 46.8% 8.3 6.94
YOLOv5s-ConvNeXt(3) 11 640 46.8% 8.4 6.95
we choose to implement a ConvNeXt block at position (3) in the accuracy on our ship dataset. Compared to Scaled-YOLOv4, our model
following experiments. is less accurate by 0.4% 𝐴𝑃 but 48.1% faster. Compared to YOLOv7,
One of the characteristics of the ConvNeXt block is that it uses our model is less accurate by 4.5% 𝐴𝑃 but 83.1% faster. Further-
larger kernel size instead of the most commonly used 3 × 3 convolution. more, the number of parameters of proposed model is only 19.1%
Therefore, we tested several kernel sizes in ConvNeXt block, including of that in YOLOv7. In terms of model size, our model is relatively
3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11, to determine the optimal kernel lightweight at 13.7 MB, which is smaller than most of the CNNs includ-
size for the ConvNeXt block in YOLOv5s. The results in Table 6 suggest ing TPH-YOLOv5, Scaled-YOLOv4, YOLOv6-tiny, YOLOv7, NanoDet-
that model accuracy reaches a saturation point at the kernel size of Plus-m, EfficientDet-d0, and Faster R-CNN. Although YOLO-Fastestv2
9 × 9. However, larger kernel size in ConvNeXt block yields slower has a smaller model size, its accuracy is much lower at only 20.0%
detection speed. In order to balance accuracy and detection speed, we 𝐴𝑃 , which does not meet the requirements for accurate ship detection.
choose to retain the original kernel size of 7 × 7 in the ConvNeXt block. Therefore, our model achieves better trade-off between accuracy and
detection speed and is more suitable for real-time ship detection on
4.5. Comparisons with the state-of-the-art drone-captured images in maritime surveillance system. Fig. 11 shows
the visualization of the detection results of our model.
In this experiment, we evaluate the performance of proposed model To better evaluate the performance of our proposed model on
on our ship dataset and compare it with nine different state-of-the-art different ship datasets, we conduct further experiments on the LEVIR-
detectors. Table 7 lists the scores of our model and other detectors, Ship dataset (Chen et al., 2022). The experimental results are shown in
including YOLOv5s (Jocher et al., 2022), TPH-YOLOv5 (Zhu et al., Table 8. Our proposed YOLOv5-ODConvNeXt still demonstrates great
2021a), Scaled-YOLOv4 (Wang et al., 2021), YOLOv6-tiny (Li et al., performance in terms of speed and accuracy, achieving 23.1% 𝐴𝑃
2022a), YOLOv7 (Wang et al., 2022), YOLO-Fastestv2 (dog-qiuqiu, with 8.3 ms batch 1 inference time without increasing the number of
2021), NanoDet-Plus-m (RangiLyu, 2021), EfficientDet-d0 (Tan et al., parameters on LEVIR-Ship dataset.
2020) and Faster R-CNN (Ren et al., 2017) with a ResNet18 (He et al.,
2016) backbone. Considering that TPH-YOLOv5 is designed for high 5. Conclusions
performance devices with huge computational cost, we scaled this
model to a smaller version to fit our task. Our model achieves 48.0% Fast and accurate ship detection is of vital importance for mar-
𝐴𝑃 with 8.3 ms inference time per image, which exceeds the baseline itime surveillance which can be widely applied in preventing ship
model YOLOv5s by 1.2% 𝐴𝑃 and 1.1 ms inference time per image accidents, illegal fishing, smuggling and so on. In this work, we focus
without increasing the number of parameters. Fig. 10 provides a more on addressing the challenges and requirements of ship detection from
intuitive comparison of our proposed YOLOv5-ODConvNeXt model drone-captured images, such as complex backgrounds and real-time
with other state-of-the-art detectors. It is clear that our proposed model performance. We proposed an enhanced deep learning model namely
outperforms YOLOv5s, YOLOv6-tiny, TPH-YOLOv5, YOLO-Fastestv2, YOLOv5-ODConvNeXt based on YOLOv5s, aiming to improve the ac-
NanoDet-Plus-m, EfficientDet-d0 and Faster R-CNN in both speed and curacy and detection speed of the original network for efficient ship
9
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Table 7
The detection results of the proposed model and other state-of-the-art detectors on self-constructed dataset.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M) Model size (MB)
YOLOv5s(baseline) 640 46.8% 9.4 7.01 13.7
YOLOv5-ODConvNeXt 640 48.0% 8.3 6.99 13.7
TPH-YOLOv5 640 46.0% 18.9 9.16 18.9
Scaled-YOLOv4 640 48.4% 12.3 9.11 17.8
YOLOv6-tiny 640 46.5% 9.0 14.94 31.4
YOLOv7 640 52.5% 15.2 36.48 71.3
YOLO-Fastestv2 352 20.0% 8.9 0.24 1.08
NanoDet-Plus-m 416 34.5% 12.6 2.44 29.9
EfficientDet-d0 512 34.1% 16.2 3.82 15.0
Faster-RCNN 640 47.2% 25.6 28.684 227.1
Table 8
The detection results of the proposed model and other state-of-the-art detectors on LEVIR-Ship dataset.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M) Model size (MB)
YOLOv5s(baseline) 640 23.1% 9.4 7.01 13.7
YOLOv5-ODConvNeXt 640 23.1% 8.3 6.99 13.7
TPH-YOLOv5 640 20.6% 18.9 9.16 18.9
Scaled-YOLOv4 640 25.2% 12.3 9.11 17.8
YOLOv6-tiny 640 19.0% 9.0 14.94 31.4
YOLOv7 640 21.3% 15.2 36.48 71.3
YOLO-Fastestv2 352 12.5% 8.9 0.24 1.08
NanoDet-Plus-m 416 23.7% 12.6 2.44 29.9
EfficientDet-d0 512 15.1% 16.2 3.82 15.0
Faster-RCNN 640 20.2% 25.6 28.684 227.1
detection. We constructed a ship dataset with 3200 images captured detection speed, exceeding YOLOv5s by 13.3%. As a result, the intro-
by drones or with a drone view. To improve the accuracy without duction of ODConv and ConvNeXt block enables YOLOv5s faster and
increasing the network width and width, we implemented ODConv in more accurate, significantly improving the performance of ship detec-
layer 1 to replace the traditional convolution layer. We found that tion on drone-captured images for maritime surveillance. Experimental
applying ODConv in shallow layers leads to higher model accuracy with results have demonstrated the superior performance of our proposed
model in terms of accuracy and detection speed compared to other
less increment of parameters due to the limited number of channels in
state-of-the-art models. However, there is still room for improvement
shallow layers. Additionally, we exploited ConvNeXt block in the layer
in our work due to some inherent limitations of object detectors. In
6 instead of the original C3 block with three bottlenecks. The simpler
future work, it is possible to develop weakly supervised ship detection
structure of ConvNeXt module leads to faster detection speed with techniques to reduce the model’s dependence on large amounts of
only a slight decline in accuracy. Our proposed YOLOv5-ODConvNeXt annotated training data. Additionally, we will consider improving the
achieves 48.0% 𝐴𝑃 with 8.3 ms inference time per image on an NVIDIA performance of small ship detection and evaluating the performance
RTX3090 GPU. Compared to the original YOLOv5s, our proposed model of our model on different ship datasets. We believe that our work can
improves the detection accuracy by 1.2% 𝐴𝑃 with a significantly faster benefit practical applications in maritime surveillance.
10
S. Cheng et al. Ocean Engineering 285 (2023) 115440
CRediT authorship contribution statement Girshick, R., 2015. Fast R-CNN. In: 2015 IEEE International Conference on Computer
Vision. ICCV, pp. 1440–1448. http://dx.doi.org/10.1109/ICCV.2015.169.
Shuxiao Cheng: Conceptualization, Methodology, Software, For- Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for
accurate object detection and semantic segmentation. In: Proceedings of the IEEE
mal analysis, Writing – original draft. Yishuang Zhu: Visualization,
Conference on Computer Vision and Pattern Recognition. pp. 580–587.
Data curation, Writing – review & editing. Shaohua Wu: Supervision, Han, X., Zhao, L., Ning, Y., Hu, J., 2021. ShipYolo: an enhanced model for ship
Funding acquisition, Validation, Writing – review & editing. detection. J. Adv. Transp. 2021.
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the
Declaration of competing interest IEEE International Conference on Computer Vision. pp. 2961–2969.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9),
The authors declare that they have no known competing finan-
1904–1916.
cial interests or personal relationships that could have appeared to
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
influence the work reported in this paper. nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 770–778.
Data availability Hendrycks, D., Gimpel, K., 2016. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415.
I have shared the link to my code in the abstract of the article. Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by
Acknowledgments reducing internal covariate shift. In: International Conference on Machine Learning.
PMLR, pp. 448–456.
This work has been supported in part by the National Key Research Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., NanoCode012, Kwon, Y., TaoXie,
and Development Program of China under Grant no. 2020YFB1806403, Fang, J., imyhxy, Michael, K., Lorna, V, A., Montes, D., Nadar, J., Laughing,
and in part by the GuangDong Basic and Applied Basic Research Foun- tkianai, yxNONG, Skalski, P., Wang, Z., Hogan, A., Fati, C., Mammana, L.,
AlexWang1900, Patel, D., Yiwei, D., You, F., Hajek, J., Diaconu, L., Minh, M.T.,
dation under Grant no. 2022B1515120002, and in part by the National
2022. ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO
Natural Science Foundation of China under Grant no. 62201307.
Export and Inference. Zenodo, http://dx.doi.org/10.5281/zenodo.6222936.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep
References convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Wein-
berger, K. (Eds.), Advances in Neural Information Processing Systems, Vol.
Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial 25. Curran Associates, Inc., URL: https://proceedings.neurips.cc/paper/2012/file/
networks. In: International Conference on Machine Learning. PMLR, pp. 214–223. c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv: Li, H., Deng, L., Yang, C., Liu, J., Gu, Z., 2021. Enhanced YOLO v3 tiny network for
1607.06450. real-time ship detection from visual image. IEEE Access 9, 16692–16706.
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. Yolov4: Optimal speed and accuracy Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W.,
of object detection. arXiv preprint arXiv:2004.10934. et al., 2022a. YOLOv6: A single-stage object detection framework for industrial
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End- applications. arXiv preprint arXiv:2209.02976.
to-end object detection with transformers. In: European Conference on Computer Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J., 2017. Light-head R-CNN: In defense
Vision. Springer, pp. 213–229. of two-stage object detector. arXiv preprint arXiv:1711.07264.
Chen, J., Chen, K., Chen, H., Zou, Z., Shi, Z., 2022. A degraded reconstruction Li, C., Zhou, A., Yao, A., 2022b. Omni-dimensional dynamic convolution. In: Inter-
enhancement-based method for tiny ship detection in remote sensing images with national Conference on Learning Representations. URL: https://openreview.net/
a new large-scale dataset. IEEE Trans. Geosci. Remote Sens. 60, 1–14. http://dx. forum?id=DmpCfq6Mg39.
doi.org/10.1109/TGRS.2022.3180894. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object
Chen, Z., Chen, D., Zhang, Y., Cheng, X., Zhang, M., Wu, C., 2020a. Deep learning for detection. In: Proceedings of the IEEE International Conference on Computer Vision.
autonomous ship-oriented small ship detection. Saf. Sci. 130, 104812. pp. 2980–2988.
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z., 2020b. Dynamic convolution: Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference Zitnick, C.L., 2014. Microsoft COCO: Common objects in context. In: European
on Computer Vision and Pattern Recognition. pp. 11030–11039. Conference on Computer Vision.
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J., 2021. You only look Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016.
one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision Ssd: Single shot multibox detector. In: European Conference on Computer Vision.
and Pattern Recognition. pp. 13039–13048. Springer, pp. 21–37.
Dai, Z., Cai, B., Lin, Y., Chen, J., 2021. Up-detr: Unsupervised pre-training for object Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021a. Swin
detection with transformers. In: Proceedings of the IEEE/CVF Conference on transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
Computer Vision and Pattern Recognition. pp. 1601–1610. of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Dai, J., Li, Y., He, K., Sun, J., 2016. R-fcn: Object detection via region-based fully Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022. A ConvNet for
convolutional networks. Adv. Neural Inf. Process. Syst. 29. the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y., 2017. Deformable Pattern Recognition. CVPR, pp. 11976–11986.
convolutional networks. In: 2017 IEEE International Conference on Computer Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation network for instance
Vision. ICCV, pp. 764–773. http://dx.doi.org/10.1109/ICCV.2017.89. segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A large- Pattern Recognition. pp. 8759–8768.
scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision Liu, R.W., Yuan, W., Chen, X., Lu, Y., 2021b. An enhanced CNN-enabled learning
and Pattern Recognition. IEEE, pp. 248–255. method for promoting ship detection in maritime surveillance system. Ocean Eng.
dog-qiuqiu, 2021. dog-qiuqiu/Yolo-FastestV2: V0.2. Zenodo, http://dx.doi.org/10.5281/ 235, 109435.
zenodo.5181503, URL: https://doi.org/10.5281/zenodo.5181503. RangiLyu, 2021. NanoDet-plus: Super fast and high accuracy lightweight anchor-free
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B., 2022. Cswin object detection model. https://github.com/RangiLyu/nanodet.
transformer: A general vision transformer backbone with cross-shaped windows. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: Unified,
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern real-time object detection. In: Proceedings of the IEEE Conference on Computer
Recognition. pp. 12124–12134. Vision and Pattern Recognition. pp. 779–788.
Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2015. Redmon, J., Farhadi, A., 2017. YOLO9000: better, faster, stronger. In: Proceedings of
The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111 the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7263–7271.
(1), 98–136. Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The Pascal arXiv:1804.02767.
visual object classes (voc) challenge. Int. J. Comput. Vis. 88 (2), 303–338. Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster R-CNN: Towards real-time object
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J., 2021. Yolox: Exceeding yolo series in 2021. detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.
arXiv preprint arXiv:2107.08430. 39 (6), 1137–1149. http://dx.doi.org/10.1109/TPAMI.2016.2577031.
11
S. Cheng et al. Ocean Engineering 285 (2023) 115440
Shao, Z., Wang, L., Wang, Z., Du, W., Wu, W., 2019. Saliency-aware convolution neural Yang, B., Bender, G., Le, Q.V., Ngiam, J., 2019. Condconv: Conditionally parameterized
network for ship detection in surveillance video. IEEE Trans. Circuits Syst. Video convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 32.
Technol. 30 (3), 781–794. Zhang, Y., Li, Q.Z., Zang, F.N., 2017. Ship detection for visual maritime surveillance
Tan, M., Pang, R., Le, Q.V., 2020. Efficientdet: Scalable and efficient object detection. from non-stationary platforms. Ocean Eng. 141, 53–63. http://dx.doi.org/10.1016/
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern j.oceaneng.2017.06.022, URL: https://www.sciencedirect.com/science/article/pii/
Recognition. pp. 10781–10790. S0029801817303190.
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W., 2013. Selective search Zhang, X., Yan, M., Zhu, D., Guan, Y., 2022. Marine ship detection and classification
for object recognition. Int. J. Comput. Vis. 104 (2), 154–171. based on YOLOv5 model. J. Phys. Conf. Ser. 2181 (1), 012025.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and
Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. better learning for bounding box regression. In: Proceedings of the AAAI Conference
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M., 2021. Scaled-yolov4: Scaling cross stage on Artificial Intelligence, Vol. 34, No. 07. pp. 12993–13000.
partial network. In: Proceedings of the IEEE/Cvf Conference on Computer Vision Zhu, X., Lyu, S., Wang, X., Zhao, Q., 2021a. TPH-YOLOv5: Improved YOLOv5 based
and Pattern Recognition. pp. 13029–13038. on transformer prediction head for object detection on drone-captured scenarios.
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M., 2022. YOLOv7: Trainable bag-of-freebies sets In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696. 2778–2788.
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.-H., 2020. CSPNet: Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021b. Deformable DETR: Deformable
A new backbone that can enhance learning capability of CNN. In: Proceedings of transformers for end-to-end object detection. In: International Conference on
the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Learning Representations. URL: https://openreview.net/forum?id=gZ9hCDWe6ke.
pp. 390–391.
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 1492–1500.
12