0% found this document useful (0 votes)
4 views12 pages

Deep Learning Based Efficient Ship Detection From Drone-Captured Images For Maritime Surveillance

This document presents a new deep learning model, YOLOv5-ODConvNeXt, designed for efficient ship detection from drone-captured images, addressing challenges such as varying object scale and limited computing resources. The model incorporates Omni-dimensional Convolution and ConvNeXt blocks to enhance accuracy and speed, achieving 48.0% average precision and a detection speed of 8.3 ms per image. The authors constructed a dataset of 3200 annotated images to validate their approach, demonstrating improved performance over existing models.

Uploaded by

fruiterer333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

Deep Learning Based Efficient Ship Detection From Drone-Captured Images For Maritime Surveillance

This document presents a new deep learning model, YOLOv5-ODConvNeXt, designed for efficient ship detection from drone-captured images, addressing challenges such as varying object scale and limited computing resources. The model incorporates Omni-dimensional Convolution and ConvNeXt blocks to enhance accuracy and speed, achieving 48.0% average precision and a detection speed of 8.3 ms per image. The authors constructed a dataset of 3200 annotated images to validate their approach, demonstrating improved performance over existing models.

Uploaded by

fruiterer333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Ocean Engineering 285 (2023) 115440

Contents lists available at ScienceDirect

Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng

Deep learning based efficient ship detection from drone-captured images for
maritime surveillance
Shuxiao Cheng a , Yishuang Zhu a,b , Shaohua Wu a,b ,∗
a Guangdong Provincial Key Laboratory of Aerospace Communication and Networking Technology, Harbin Institute of Technology
(Shenzhen), Shenzhen 518055, China
b
Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen 518055, China

ARTICLE INFO ABSTRACT

Keywords: The use of drones to observe ships is an effective means of maritime surveillance. However, the object
Ship detection scale from drone-captured images changes dramatically, presenting a significant challenge for ship detection.
Drone-captured images Additionally, the limited computing resources on drones make it difficult to achieve fast detection speed. To
Maritime surveillance
address these issues, we propose an efficient deep learning based network, namely the YOLOv5-ODConvNeXt,
Convolutional Neural Network (CNN)
for ship detection from drone-captured images. YOLOv5-ODConvNeXt is a more accurate and faster network
YOLOv5
designed to improve the efficiency of maritime surveillance. Based on YOLOv5, we implement Omni-
dimensional Convolution (ODConv) in the YOLOv5 backbone to boost the accuracy without increasing the
network width and depth. We also replace the original C3 block with a ConvNeXt block in YOLOv5 backbone
to accelerate detection speed with only a slight decline in accuracy. We test our model on a self-constructed ship
detection dataset containing 3200 images captured by drones or with a drone view. The experimental results
show that our model achieves 48.0% 𝐴𝑃 , exceeding the accuracy of YOLOv5s by 1.2% 𝐴𝑃 . The detection
speed of our network is 8.3 ms per image on an NVIDIA RTX3090 GPU, exceeding the detection speed of
YOLOv5s by 13.3%. Our code is available at https://github.com/chengshuxiao/YOLOv5-ODConvNeXt.

1. Introduction data preprocessing. In contrast to SAR or ORS imaging methods, visual


imaging methods captured by optical cameras contain abundant color
Maritime management and waterway management have long been and texture information that allows viewers to comprehend scenes
plagued by numerous issues, including ship accidents, illegal fish- more easily. Additionally, the visible-light cameras offer the benefits
ing and smuggling. These problems stem from inadequate real-time of inexpensive cost, simple installation and low power consumption.
monitoring of waterways. Thus, real-time ship detection is crucial for A more significant use for such a camera is military security since
improving maritime surveillance. the passive imaging modality conceals the location of the monitoring
Three primary categories of ship detection exist based on data device (Zhang et al., 2017). For many scenarios such as port monitoring
sources: Synthetic Aperture Radar (SAR) images, optical remote sens- and cross-border ship detection, ship detection from visual images can
ing (ORS) images and visual images. SAR images are widely used meet the need for accurate and real-time ship detection in maritime
in ship detection due to their wide field of view and all-day/all- surveillance systems.
season performance. However, they have limitations such as long data With the rapid development of computer vision technology and
update periods that limit real-time performance in ship detection. drone technology, the use of high-definition cameras on drones for
Additionally, most SAR images have low resolution which can lead ship detection has gradually become an effective means of maritime
to missed detection of small ships (Li et al., 2021). Furthermore, SAR surveillance. Compared to fixed coastal surveillance cameras, drones
images are grayscale and lack color information. ORS images, mostly offer higher flexibility and wider field-of-view capabilities, resulting
captured by satellite cameras, own the advantage of high resolution. in lower monitoring costs per unit water area. In this work, we focus
However, ORS images cannot be obtained at night and are greatly on improving the performance of ship detection from drone-captured
affected by meteorological conditions such as clouds, rain and fog. images for maritime surveillance.
Moreover, such high-resolution images also bring a certain burden to

∗ Corresponding author at: Guangdong Provincial Key Laboratory of Aerospace Communication and Networking Technology, Harbin Institute of Technology
(Shenzhen), Shenzhen 518055, China.
E-mail addresses: [email protected] (S. Cheng), [email protected] (Y. Zhu), [email protected] (S. Wu).

https://doi.org/10.1016/j.oceaneng.2023.115440
Received 2 February 2023; Received in revised form 27 June 2023; Accepted 22 July 2023
Available online 2 August 2023
0029-8018/© 2023 Elsevier Ltd. All rights reserved.
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 1. The overview of our ship detection framework.

Ship detection aims to recognize ships and their course borders from • We studied the problem of fusing ConvNeXt block into different
an image. In recent years, deep neural networks (DNNs) have been layers of the YOLOv5s model. Based on the experimental results,
frequently used for this purpose. However, applying ship detection on we chose to replace the original C3 block with a ConvNeXt block
drones brings two main difficulties. First, the size of ships in drone- in layer 6 to improve the detection speed with a slight drop in
captured images varies with the height and attitude of the drone. In model accuracy.
addition, drone-captured images contain abundant and complex back- • Based on the above improvements, we proposed
ground information, making ship detection more challenging. Second, YOLOv5-ODConvNeXt, a more efficient ship detection model.
insufficient computing resources on drones can slow down detection Compared to the original YOLOv5s network, the accuracy has
speed and result in poor real-time performance. Therefore, there is an been improved by 1.2% 𝐴𝑃 on our ship dataset and the detection
urgent need for a fast yet accurate ship detector. speed has been improved by 13.3%.
YOLO series (Redmon et al., 2016; Redmon and Farhadi, 2017,
2. Related works
2018; Bochkovskiy et al., 2020; Jocher et al., 2022), which plays an
essential role in object detection tasks, is widely used in ship detection.
2.1. Deep learning based object detection
In this paper, we propose an improved version of the YOLOv5s (Jocher
et al., 2022) model called YOLOv5-ODConvNeXt that achieves higher
Deep learning based detector can be roughly divided into two types:
accuracy and faster detection speed than the original YOLOv5s model,
transformer (Vaswani et al., 2017) based detectors and Convolutional
making it ideal for ship detection from drone-captured images. The
Neural Networks (CNN) based detectors. Transformer models were
overview structure of this work is shown in Fig. 1. We use Omin-
first applied to a broader range of natural language processing (NLP)
dimensional Dynamic Convolution (ODConv) (Li et al., 2022b) to re- tasks. Their outstanding performance has generated great interest in
place the traditional downsampling convolutional layer, which en- applying Transformers to computer vision tasks. A Transformer has
hances the backbone’s capability to generate more features. Then we an encoder–decoder structure, using the self-attention mechanism to
use the ConvNeXt block (Liu et al., 2022) to replace the original C3 establish relationships among elements in a sequence. An image can
block, which increases the detection speed. Considering that there are also be regarded as a sequence by dividing it into 𝑁 × 𝑁 image
no publicly available ship datasets for drone-captured scenarios, we patches, which is a basic concept for Transformers in vision tasks.
constructed a single-class ship dataset with 3200 images captured by Detection Transformer (DETR) (Carion et al., 2020) is the first end-
drones or with a drone view. We conducted extensive experiments to-end transformer based object detector and treats object detection
on our dataset and achieved remarkable results that demonstrate our as a set prediction problem. DETR uses the Transformer to process
model’s ability to balance accuracy and detection speed. Our proposed the features generated by CNN, eliminating the requirement for series
YOLOv5-ODConvNeXt achieves 48.0% 𝐴𝑃 , with the detection speed of of hand-crafted components such as non-maximum suppression (NMS)
8.3 ms per image on an NVIDIA RTX3090 GPU. strategy and anchor boxes. Although DETR achieves high performance
To achieve an efficient detector for ship detection on drone-captured on the MS-COCO (Lin et al., 2014) dataset, it faces various obsta-
images for maritime surveillance, we made the following contributions: cles including missed detection on small objects and a long training
duration. Inspired by Dai et al. (2017), Zhu et al. (2021b) proposed
• We constructed a ship detection dataset containing 3200 anno- Deformable DETR, which aims to accelerate the convergence speed
tated images of ships, all of which are captured by drones or have and promote the accuracy on detecting small objects of DETR. Unlike
the perspective of drones. the multi-head attention mechanism in Transformer, the Deformable
• We studied the problem of fusing ODConv into different layers attention module focuses on partial important points around a reference
of the YOLOv5s model. Based on the experimental results, We point of feature maps. Liu et al. (2021a) proposed Swin Transformer,
found that deploying ODConv in shallower layers leads to greater a pure transformer based backbone for image classification, object
accuracy gains and less increment of parameters. Therefore, we detection, and segmentation. Swin Transformer is a hierarchical trans-
applied ODConv in layer 1 of YOLOv5s to boost model accuracy former using shift windows, which improves computing efficiency by
without increasing the network width and depth. using non-overlapping windows for local self-attention. Although these

2
S. Cheng et al. Ocean Engineering 285 (2023) 115440

transformer based detectors (Carion et al., 2020; Zhu et al., 2021b; et al., 2020) focuses on boosting the efficiency of CNNs. A BiFPN
Dai et al., 2021; Liu et al., 2021a; Dong et al., 2022) have shown structure is proposed to fuse multiscale features with learnable weights.
great potential in replacing traditional CNNs on object detection tasks, Furthermore, EfficientDet introduces a model scaling strategy which
they still suffer from high computational cost and a large number of jointly scales different part of the network and the input resolution.
parameters, making them unsuitable for real-time ship detection on YOLOv4 (Bochkovskiy et al., 2020), an upgraded version of YOLOv3,
drone-captured images in maritime surveillance systems. can be trained on a single GPU like 1080Ti. It uses a ‘‘bag of freebies’’
CNN-based detectors can be separated into two-stage detectors and that does not increase the inference cost, such as data augmentation,
one-stage detectors. Two-stage detectors follow a ‘‘coarse-to-fine’’ man- label smoothing, and CIoU loss (Zheng et al., 2020). There is also ‘‘bag
ner, which first generate various regions of interest (RoIs) from in- of specials’’ that obviously promote the accuracy with little increment
put images and perform classification and regression on a series of of inference cost, such as SPP (He et al., 2015), CSPNet (Wang et al.,
RoI. One-stage detectors directly obtain results from input images. 2020), PANet (Liu et al., 2018). After YOLOv4, there have been many
R-CNN (Girshick et al., 2014) is the earliest two-stage detector that sig- improved versions of the YOLO series like Scaled-YOLOv4 (Wang et al.,
nificantly accelerated the development of object detection technology 2021), YOLOv5 (Jocher et al., 2022), YOLOF (Chen et al., 2021) and
in the deep learning area and is the first paper in R-CNN series (Girshick YOLOX (Ge et al., 2021).
et al., 2014; Girshick, 2015; Ren et al., 2017). R-CNN uses selective
search (Uijlings et al., 2013) to extract nearly 2000 region proposals
2.2. Ship detection from visual images
from the original input image. These separate regions are resized to a
fixed scale and fed into a CNN trained on ImageNet (Deng et al., 2009)
to obtain the output features, and then predictions are obtained from Ship detection from visual images has received widespread attention
each region through support virtual machines (SVMs). R-CNN provides for its application in maritime surveillance. Drone-captured images
an excellent framework for object detection, but overlapping region are a type of visual images. Modern ship detection algorithms are
proposals lead to redundant computations that are inefficient. To deal mostly constructed based on deep neural networks that do not require
with this problem, Girshick (2015) proposed Faster R-CNN. It uses RoI hand-crafted features and have good robustness.
pooling, a variant of spatial pyramid pooling (SPP) (He et al., 2015), Shao et al. (2019) were the first to applied CNN for ship detection
to obtain a fix-size feature map from each RoI so that RoIs can share in surveillance video. They proposed a CNN framework for saliency
computations among overlapping areas and all layers can be updated prediction based on the YOLOv2 (Redmon and Farhadi, 2017) model.
during training. However, the high computational cost of selective CNN was utilized for rough prediction first, and subsequently saliency
search still slows down the detection speed of Fast R-CNN. Ren et al. detection was employed to refine it. They also presented a coastline
(2017) proposed Faster R-CNN, which replaced selective search with segmentation method that reduces detection range and increases detec-
a region proposal network (RPN). Compared to selective search, RPN tion efficiency. Chen et al. (2020a) used the combination of improved
greatly decreases the computational complexity, enabling Faster R-CNN YOLOv2 (Redmon and Farhadi, 2017) and modified WGAN (Arjovsky
to be the first object detector to approach real-time. Although Faster et al., 2017) to deal with small ship detection. Density-Based Spatial
R-CNN breaks through the speed bottleneck of Fast R-CNN, there is Clustering of Applications with Noise (DBSCAN) is used to generate
still computational redundancy in the following detection stage. Some anchor boxes instead of k-means clustering, and a Gaussian Mixture
scholars have proposed a variety of improvement schemes, including WGAN with Gradient Penalty is used for data augmentation. But the
R-FCN (Dai et al., 2016), Light Head R-CNN (Li et al., 2017), and Mask detection speed of these YOLOv2 based method (Shao et al., 2019; Chen
R-CNN (He et al., 2017). Considering the defects of two-stage detectors et al., 2020a) is slow with low accuracy. Liu et al. (2021b) proposed an
in detection speed and lack of global information, we chose to study improved version of YOLOv3 (Redmon and Farhadi, 2018) for ship de-
one-stage detectors for efficient ship detection from drone-captured tection under complex weather conditions. They use redesigned anchor
images. boxes, soft NMS, reconstructed loss function and data augmentation
You Only Look Once (YOLO) (Redmon et al., 2016) is the first one-
to realize a more reliable and robust detector. However, its detection
stage CNN-based detector. Two-stage detectors make predictions on
speed on NVIDIA 1080Ti GPU is 30 frames per second (FPS) for the
various of RoIs, which ignore the global information of the whole input
input resolution of 608*608, which is not fast enough. ShipYOLO (Han
image. To address this problem, YOLO redefined the object detection
et al., 2021) is an enhanced model based on YOLOv4 (Bochkovskiy
as a single regression problem. The neural network can directly con-
et al., 2020), which is also designed for ship detection in surveil-
vert image pixels to bounding boxes and probabilities for each area,
lance video. ShipYOLO has three main improvements which includes
resulting in faster detection speed compared to two-stage detectors.
structural re-parameterization in backbone, attention mechanism for
The network divides the input image into a series of grids, with each
multi-scale feature fusion and using dilated convolution in SPP (He
grid responsible for detecting objects in that region of the image. Each
et al., 2015). ShipYOLO achieves the detection speed of 47 frames per
grid can predict multiple categories for bounding boxes, with Non-
second (FPS) on NVIDIA 1080Ti GPU for the input size at 512*512,
Maximum Suppression (NMS) eliminating duplicate detection of the
same object. Liu et al. (2016) proposed single shot multibox detec- which is still not fast enough for devices with insufficient computing
tor (SSD), which improves detection speed and accuracy by utilizing resources like drones. Zhang et al. (2022) proposed YOLOv5-DN, an
predefined anchor boxes and multiscale detection technology. Red- improved version of YOLOv5 (Jocher et al., 2022) for maritime ship
mon and Farhadi (2017) proposed YOLOv2, an improved version of detection and classification. YOLOv5-DN is realized by introducing
YOLO (Redmon et al., 2016), which implements various techniques CSP-DenseNet structure (Wang et al., 2020) to YOLOv5 model, aiming
including batch normalization (Ioffe and Szegedy, 2015), high resolu- to optimize the detection accuracy. However, it does not take into
tion classifier, and anchor boxes generated by k-means clustering to account detection speed, which leads to inefficiency.
achieve state-of-the-art real-time detection. Lin et al. (2017) figured out In conclusion, current ship detection techniques still suffer from
the imbalance of foreground and background classes during training inefficiency. Therefore, achieving a good balance between detection
process for one-stage detectors and proposed RetinaNet. A novel focal accuracy and speed is of vital importance for maritime surveillance.
loss function is designed to address this problem so that the network In this work, we will extend the YOLOv5-based ship detection model
can focus on the difficult, misclassified samples. Redmon and Farhadi according to the following aspect, i.e., using dynamic convolution to
(2018) further promoted the performance of YOLOv2 and proposed replace the traditional downsampling convolutional layer and exploit-
YOLOv3 by combining techniques such as data augmentation, multi- ing ConvNeXt module in backbone, leading to faster and more accurate
scale training, and independent logistic classifiers. EfficientDet (Tan ship detection on drone-captured images.

3
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 2. The illustration of (a) two-dimensional convolution and (b) depthwise convolution.

Table 1 3.1. Omni-dimensional dynamic convolution


Structural comparison of YOLOv5s and YOLOv5-
ODConvNeXt.
YOLOv5 uses traditional two-dimensional convolution to generate
#Layer YOLOv5s YOLOv5-ODConvNeXt
features. Fig. 2(a) shows how two-dimensional convolution works in a
0 Conv Conv
convolutional layer with four filters. The number of output channels of
1 Conv ODConv
2 C3(n=1) C3(n=1) a convolutional layer equals to the number of filters, and the dimension
3 Conv Conv of each filter depends on the dimension of the input features. The
4 C3(n=2) C3(n=2) traditional two-dimensional convolution can be described as:
5 Conv Conv
6 C3(n=3) ConvNext 𝑂𝑢𝑡𝑝𝑢𝑡 (𝑥) = 𝑊 ∗ 𝑥, (1)
7 Conv Conv
8 C3(n=1) C3(n=1) where 𝑥 denotes the input features and 𝑊 denotes the convolutional
9 SPPF SPPF layer; ∗ denotes the convolution operation. It is obvious that the convo-
... ... ...
lutional kernels of each filter do not change corresponding to different
inputs. Therefore, increasing the number of filters is often necessary
to acquire more features, which is less efficient. To address this issue,
2.3. Review of YOLOv5 dynamic convolution is exploited in our network, which can boost the
accuracy of lightweight CNN while maintaining efficient inference.
In this work, we focus on studying YOLOv5 (Jocher et al., 2022) Omni-dimensional Dynamic Convolution (ODConv) (Li et al., 2022b)
for fast and accurate ship detection on drone-captured images. YOLOv5 is an extended version of CondConv (Yang et al., 2019) and Dy-
is a fast and easy-to-use object detection model that achieves state- Conv (Chen et al., 2020b). To better understand ODConv, we first
of-the-art results in object detection tasks. It is built on the PyTorch introduce the basic concept of dynamic convolution. Dynamic con-
framework, making it easy to use and extend. YOLOv5 has five dif- volution can be regarded as the linear combination of 𝑛 groups of
ferent versions including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l two-dimensional convolution, i.e.,
and YOLOv5x according to the number of parameters. YOLOv5s has ( )
𝑂𝑢𝑡𝑝𝑢𝑡 (𝑥) = 𝛼1 𝑊1 + ⋯ + 𝛼n 𝑊𝑛 ∗ 𝑥, (2)
relatively fewer parameters while maintaining high detection accuracy,
which is absolutely suitable for real-time ship detection. Therefore, we where 𝑊𝑖 denotes the 𝑖th group of convolutional filters and 𝛼𝑖 is
choose YOLOv5s as the baseline for this work. the weighted scalar for 𝑊𝑖 . 𝛼𝑖 is calculated by an attention function
The architecture of YOLOv5 is composed of three parts, i.e., back- conditioned on the input features. Both CondConv and DyConv use
bone, neck and head. The backbone uses CSPNet (Wang et al., 2020) the modified Squeeze-and-Excitation (Hu et al., 2018) structure as
with an SPPF (a fast version of SPP) layer to extract features from their attention function to generate 𝛼𝑖 . However, the attention value
the input image. The neck uses PANet (Liu et al., 2018) for multi- 𝛼𝑖 is shared across various dimensions including the spatial dimension,
scale feature fusion. And YOLO detection head (Redmon et al., 2016) the input dimension and the output dimension. To this end, ODConv
is used for classification and bounding box regression. It is worth expands the structure of CondConv and DyConv by introducing three
noting that the head detects objects from three different dimensions complementary weights to realize a more general dynamic convolution.
of 80*80, 40*40 and 20*20, which corresponds to small, medium and Follow the notations in Eq. (2). ODConv can be defined as
large objects (see Fig. 3).
𝑂𝑢𝑡𝑝𝑢𝑡(𝑥) = (𝛼𝜔1 ⊙ 𝛼𝑓 1 ⊙ 𝛼𝑐1 ⊙ 𝛼𝑠1 ⊙ 𝑊1 +
(3)
3. Proposed network … + 𝛼𝜔n ⊙ 𝛼𝑓 𝑛 ⊙ 𝛼𝑐𝑛 ⊙ 𝛼𝑠𝑛 ⊙ 𝑊𝑛 ) ∗ 𝑥,
where 𝛼𝜔𝑖 is equivalent to 𝛼𝑖 in Eq. (2). The newly introduced 𝛼𝑓 𝑖 ,
In this study, we proposed YOLOv5-ODConvNeXt to enhance the 𝛼𝑐𝑖 and 𝛼𝑠𝑖 denote the learnable weights for the output dimension, the
efficiency of YOLOv5 (Jocher et al., 2022) for ship detection from input dimension and the spatial dimension respectively. The symbol
drone-captured images. The architecture of YOLOv5-ODConvNeXt is ⊙ denotes the weighted operation in different dimensions among the
shown in Fig. 1. According to extensive experimental results, we fi- convolutional filters. The illustration of the ⊙ operation is shown in
nally choose to replace the convolution in layer 1 of the network Fig. 4.
with Omni-dimensional Dynamic Convolution (Li et al., 2022b) and Fig. 3 shows the structure of ODConv. The input features are fed
C3 module in layer 6 of the network with ConvNeXt module. Ta- into a Fully Connected (FC) layer and a ReLU (Krizhevsky et al., 2012)
ble 1 shows the differences between the architecture of YOLOv5s and activation function after being squeezed by a Global Average Pooling
YOLOv5-ODConvNeXt. (GAP) layer. Then, four weights are generated by an FC layer and

4
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 3. An ODConv layer.

Fig. 4. The illustration of how four different sorts of weights work in ODConv.

nonlinear activation. Specifically, 𝛼𝑓 𝑖 , 𝛼𝑐𝑖 and 𝛼𝑠𝑖 are generated by a model without the increment of network width and depth. The position
Sigmoid function, i.e., of the ODConv used in YOLOv5-ODConvNeXt is shown in Fig. 1.
1
Sigmoid (𝑥) = , (4)
1 + 𝑒−𝑥 3.2. ConvNeXt block
which maps the input value to the interval of (0,1). 𝛼𝑠𝑖 is generated by
a Softmax function, i.e., For the past three years, we have witnessed the burst of applying
( ) Transformers in vision tasks. However, due to the global attention
( ) exp 𝑥𝑖
Sof tmax 𝑥𝑖 = ∑ ( ), (5) mechanism of Transformers, its computational complexity is quadratic
𝑗 exp 𝑥𝑗 of the input size. This burden is particularly felt in tasks such as ob-
where 𝑥𝑖 denotes the 𝑖th element of the input vector. Softmax produces ject detection with higher-resolution inputs. Therefore, ConvNeXt (Liu

a set of probabilities which adds a constraint of 𝑖 𝛼𝜔𝜄 = 1, leading to et al., 2022) is born to bring the advantages of Transformers back to
the simplification of learning 𝛼𝜔𝑖 . The final convolutional filters which CNN. The architecture of ConvNeXt block is shown in Fig. 5(a).
are generated by the combination of the weighted sum of 𝑛 groups ConvNeXt is a pure CNN which refers to the macro/micro design
of filters are utilized to produce the output features. Therefore, the and the training process of a Swin Transformer (Liu et al., 2021a). A
convolution kernel changes with different input features. In this work, ConvNeXt block uses the depthwise convolution with a larger kernel
we utilize ODConv to pursue the goal of realizing a more accurate size of 7 × 7 followed by the simpler Layer Normalization (Ba et al.,

5
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 5. A schematic comparison of (a) ConvNeXt block and (b) C3 block.

2016). The illustration of depthwise convolution is shown in Fig. 2(b). Table 2


The configuration of experimental environment.
In a depthwise convolution layer, each filter performs convolution
operation at only one input channel. Subsequently, a 1 × 1 convolution Parameter Configuration

layer and a GeLU (Hendrycks and Gimpel, 2016) activation function are Operating System Ubuntu 20.04.4
CPU Intel Xeon Gold 6230
used to lift the number of channels from 𝐶 to 4 × 𝐶, where the GeLU
GPU NVIDIA RTX3090
activation function is CUDA Version 11.4
( (√ ( ))) Python Version 3.7.13
GELU(𝑥) = 0.5𝑥 1 + Tanh (2∕𝜋) 𝑥 + 0.044715𝑥3 . (6) Pytorch Version 1.11.0

Finally, a 1 × 1 convolution layer is used to squeeze the channel


dimension from 4 × 𝐶 to 𝐶. Compared to the scheme of C3 block which
is shown in 5(b), ConvNeXt block is simpler with fewer activation and 4.1. Experimental environment and dataset
normalization layers. ConvNeXt block uses bigger kernel size which
has a broader range of reception field while C3 block uses the kernel
All the experiments are conducted on deep learning framework
size of 3 × 3 and 1 × 1. A larger kernel size can benefit the network
PyTorch in Ubuntu 20.04.4 System with an Intel Xeon Gold 6230 CPU
from capturing long range dependencies. The normalization layer of
and NVIDIA RTX3090 GPU. The Python version and Torch version are
ConvNeXt block is Layer Normalization while C3 blocks adopt Batch
3.7.13 and 1.11.0, respectively. The specific configuration is shown
Normalization. Additionally, ConvNeXt block employs the inverted
in Table 2. We only use one GPU for training and inference, and all
bottleneck design while C3 block contains the repetitions of bottle-
networks are trained with a batch size of 32. Other hyperparameters in-
neck. All these designs of ConvNeXt block come from the structure
cluding learning rate, weight decay and data augmentation parameters
of Transformers. Furthermore, ConvNeXt Block follows the concept
are set by default which are provided by YOLOv5.
of ResNeXt (Xie et al., 2017), adopting depthwise convolution and
pointwise convolution to realize a split-transform-merge strategy for Since it is hard to obtain publicly accessible datasets for ship detec-
efficient inference. tion on drone-captured images, we constructed a single-category ship
In this work, we replace the original C3 block in the backbone of dataset which contains 3200 ship images captured by drones or with
YOLOv5 with ConvNeXt block, which leads to faster inference speed a drone view. The images in our datasets have three sources including
with a slight decline of accuracy. MS-COCO dataset (Lin et al., 2014), Pascal VOC (Everingham et al.,
2010, 2015) dataset and images captured by our own drone. We used
4. Experiment a python script to extract the images that contain ship instances in
MS-COCO and Pascal VOC datasets. Then all the data is converted to
To demonstrate the effectiveness of the proposed network, exten- YOLO format. After that, we manually filtered the data and reserved
sive experiments are conducted on our ship dataset. In this section, the images captured by drones or with a drone view. Finally, several
we first introduce the experimental environment, dataset, and the labeled images captured by our own drone are added to the dataset.
evaluation criterion of this work. Three different experiments are con- We divided our dataset into a training set, a validation set and a test
ducted including fusing ODConv with YOLOv5, fusing ConvNeXt block set according to the ratio of 7:1:2.
with YOLOv5s, and comparisons with the state-of-the-art methods. We Fig. 6 shows the visualization of the statistical information of our
present both subjective and objective results. ship dataset. As Fig. 6(a) suggests, there are more than 8000 ship

6
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 7. Different positions we chose to implement ODConv in the backbone of


YOLOv5s.
Fig. 6. Statistical visualization of our ship datasets. (a) The number of ship instances,
(b) the distribution of ground truth, (c) the distribution of instance center points, (d)
the distribution of instance sizes. The darker the color of the point, the more the Table 3
number of instances in (c) and (d). Detection results of applying ODConv in different layers of the backbone of YOLOv5s.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s(baseline) 640 46.8% 9.4 7.01
YOLOv5s-ODConv[N=2](1) 640 47.6% 9.5 7.03
instances in our dataset. Fig. 6(b) shows the ground truth of all in-
YOLOv5s-ODConv[N=2](2) 640 47.1% 9.5 7.09
stances. Fig. 6(c) reveals the position distribution of instances, and we YOLOv5s-ODConv[N=2](3) 640 46.7% 9.6 7.32
can see most of the instances lie in the center of an image. Fig. 6(d) YOLOv5s-ODConv[N=2](4) 640 46.2% 10.2 8.21
evaluates the number of instances of different sizes, and it is obvious
that small objects occupy the vast majority of our dataset, which fits
the characteristics of drone-captured images. Table 4
Detection results of applying ODConv in layer 1 with different 𝑛 value.
4.2. Metrics Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s(baseline) 640 46.8% 9.4 7.01
YOLOv5s-ODConv[N=2](1) 640 47.6% 9.5 7.03
In order to validate the effectiveness of proposed network, we
YOLOv5s-ODConv[N=3](1) 640 47.1% 9.8 7.05
compare the performance of the new network and the baseline network YOLOv5s-ODConv[N=4](1) 640 47.1% 10.2 7.07
(YOLOv5s). Therefore, a complete evaluation system needs to be es-
tablished. There are two important metrics in our experiments, one for
detection speed and one for accuracy. We choose the batch 1 inference
time for the metric of detection speed and average precision (𝐴𝑃 ) for
of precision and recall. 𝐴𝑃 is defined as the area between the P-R curve
the metric of accuracy. The calculation of 𝐴𝑃 will be introduced as
and the coordinate axis, i.e.,
follows.
1
For the same category, if the intersection over union (IoU) of 𝐴𝑃 = 𝑃 (𝑅)𝑑𝑅, (9)
predicted bounding box and ground truth exceeds the given threshold, ∫0
it can be defined as correct detection. The number of correctly detected whose value ranges from 0 to 1. In this work, we choose 𝐴𝑃50∶95 , which
instances is marked as true positive (𝑇 𝑃 ); the number of wrongly is the average of a series of 𝐴𝑃 value with the IoU threshold ranging
detected instances is marked as false positive (𝐹 𝑃 ); and the number of from 0.5 to 0.95 in steps of 0.05.
missed detected instances is marked as false negative (𝐹 𝑁). Then we
can introduce two indicators named precision and recall. The precision 4.3. Fusing ODConv with YOLOv5s
is defined as
𝑇𝑃 In this experiment, we analyze the performance of applying ODConv
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = , (7) in different layers of YOLOv5s. There are five downsampling convolu-
𝑇𝑃 + 𝐹𝑃
tion layers in the backbone of YOLOv5. The original convolution layer
which measures the accuracy of all predictions. The recall is defined as
is replaced by an ODConv layer respectively except for the first layer,
which is shown in Fig. 7. We set 𝑛 = 2 for ODConv, which means
𝑇𝑃 two groups of convolution filters are linearly combined to generate the
𝑅𝑒𝑐𝑎𝑙𝑙 = , (8)
𝑇𝑃 + 𝐹𝑁 convolution layer. The input size is set to 640 × 640. All models are
which describes the percentage of all ground truths that are correctly trained for 500 epochs on the training set and the detection results are
predicted by the network. A P-R curve can be plotted based on the value received on the test set, which are shown in Table 3.

7
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 8. The variation of (a) model accuracy, (b) inference speed and (c) the number of parameters after applying ODConv at different locations in YOLOv5s.

Table 5
Detection results of replacing C3 block with ConvNeXt block in different locations.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s(baseline) 640 46.8% 9.4 7.01
YOLOv5s-ConvNeXt(1) 640 46.3% 8.3 7.03
YOLOv5s-ConvNeXt(2) 640 46.6% 8.3 7.04
YOLOv5s-ConvNeXt(3) 640 46.7% 8.1 6.93
YOLOv5s-ConvNeXt(4) 640 46.8% 8.5 7.96

We observed that the best results are obtained when ODConv is


used in position (1). Compared to the baseline model, applying ODConv
in position (1) increases the 𝐴𝑃 of YOLOv5s by 0.8% with a slight
increment of inference time. Fig. 8 shows the variation of detection
accuracy, inference speed and parameters after applying ODConv in
different layers of YOLOv5s. Results on Fig. 8(a) illustrate that apply-
ing ODConv to shallower layers yields greater accuracy gains, which
indicates that richer features are needed in the shallow layers of the
network. Fewer channels are contained in low-level features, while for
high-level features the number of channels has reached 512, which con-
tains an abundant representation of features. Therefore, implementing
ODConv in deep layers does not achieve higher accuracy. Fig. 8(b) and
8(c) illustrate that applying ODConv in the deeper layers of the network
slows down the inference speed as well as introducing more parame-
ters. There are fewer channels in the shallow layers of the network,
which leads to a slight increment of parameters and inference time
Fig. 9. Different positions we chose to replace C3 block with ConvNeXt block in the
when exploiting ODConv. However, in deeper layers, the number of
backbone of YOLOv5s.
channels increases exponentially. Applying ODConv in layers with more
channels will introduce more parameters, which results in burdening
the computational cost of the network and slowing down the inference
4.4. Fusing ConvNeXt block with YOLOv5s
speed.
It can be concluded that applying ODConv in the shallow layers of
In this experiment, we evaluated the performance of replacing
YOLOv5s brings improvement in accuracy with a little increment of different C3 blocks with ConvNeXt blocks in YOLOv5s backbone. We
parameters. As ODConv is applied to deeper layers of YOLOv5s, the tested different kernel sizes of ConvNeXt block to determine the op-
accuracy gains will become smaller and even reverse, but the increment timal performance of the introduced ConvNeXt block. The YOLOv5s
of parameters will become larger, which leads to a decrement of backbone consists of four C3 blocks which are connected by the down-
detection speed. Therefore, we finally choose position (1) to exploit sampling convolution layers and an SPPF layer. Each C3 block contains
ODConv in our proposed network. 𝑛 repetitions of bottleneck blocks with three 1 × 1 convolution, as
We also test the performance of setting different 𝑛 values of OD- shown in Fig. 5(b). The values of 𝑛 in C3 blocks are 1, 2, 3 and 1
Conv. The ODConv is applied in position (1) with different 𝑛 values from bottom to top. Fig. 9 shows the specific locations of the ConvNeXt
ranging from 2 to 4. Table 4 lists the results of different models. The blocks we exploited in YOLOv5s backbone.
best results are obtained with 𝑛 = 2, which indicates that the ODConv Table 5 lists the scores of the original YOLOv5s and four revised
consisting of two groups of filters enables the best accuracy of YOLOv5s models. The best results were obtained when the C3 block at posi-
and larger 𝑛 does not lead to higher accuracy. The performance of the tion (3) was replaced by a ConvNeXt block, which leads to 1.3 ms
network reaches a saturation point at 𝑛 = 2. Therefore, we finally set decrement of inference time with only 0.1% 𝐴𝑃 drop of accuracy.
𝑛 = 2 in the following experiments. Additionally, the number of parameters is slightly dropped. Therefore,

8
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Table 6
Detection results of using different kernel sizes of ConvNeXt block in YOLOv5s.
Model Kernel size Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M)
YOLOv5s-ConvNeXt(3) 3 640 45.8% 7.9 6.92
YOLOv5s-ConvNeXt(3) 5 640 46.3% 8.0 6.92
YOLOv5s-ConvNeXt(3) 7 640 46.7% 8.1 6.93
YOLOv5s-ConvNeXt(3) 9 640 46.8% 8.3 6.94
YOLOv5s-ConvNeXt(3) 11 640 46.8% 8.4 6.95

Fig. 10. Comparisons of the performance on our ship dataset.

we choose to implement a ConvNeXt block at position (3) in the accuracy on our ship dataset. Compared to Scaled-YOLOv4, our model
following experiments. is less accurate by 0.4% 𝐴𝑃 but 48.1% faster. Compared to YOLOv7,
One of the characteristics of the ConvNeXt block is that it uses our model is less accurate by 4.5% 𝐴𝑃 but 83.1% faster. Further-
larger kernel size instead of the most commonly used 3 × 3 convolution. more, the number of parameters of proposed model is only 19.1%
Therefore, we tested several kernel sizes in ConvNeXt block, including of that in YOLOv7. In terms of model size, our model is relatively
3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11, to determine the optimal kernel lightweight at 13.7 MB, which is smaller than most of the CNNs includ-
size for the ConvNeXt block in YOLOv5s. The results in Table 6 suggest ing TPH-YOLOv5, Scaled-YOLOv4, YOLOv6-tiny, YOLOv7, NanoDet-
that model accuracy reaches a saturation point at the kernel size of Plus-m, EfficientDet-d0, and Faster R-CNN. Although YOLO-Fastestv2
9 × 9. However, larger kernel size in ConvNeXt block yields slower has a smaller model size, its accuracy is much lower at only 20.0%
detection speed. In order to balance accuracy and detection speed, we 𝐴𝑃 , which does not meet the requirements for accurate ship detection.
choose to retain the original kernel size of 7 × 7 in the ConvNeXt block. Therefore, our model achieves better trade-off between accuracy and
detection speed and is more suitable for real-time ship detection on
4.5. Comparisons with the state-of-the-art drone-captured images in maritime surveillance system. Fig. 11 shows
the visualization of the detection results of our model.
In this experiment, we evaluate the performance of proposed model To better evaluate the performance of our proposed model on
on our ship dataset and compare it with nine different state-of-the-art different ship datasets, we conduct further experiments on the LEVIR-
detectors. Table 7 lists the scores of our model and other detectors, Ship dataset (Chen et al., 2022). The experimental results are shown in
including YOLOv5s (Jocher et al., 2022), TPH-YOLOv5 (Zhu et al., Table 8. Our proposed YOLOv5-ODConvNeXt still demonstrates great
2021a), Scaled-YOLOv4 (Wang et al., 2021), YOLOv6-tiny (Li et al., performance in terms of speed and accuracy, achieving 23.1% 𝐴𝑃
2022a), YOLOv7 (Wang et al., 2022), YOLO-Fastestv2 (dog-qiuqiu, with 8.3 ms batch 1 inference time without increasing the number of
2021), NanoDet-Plus-m (RangiLyu, 2021), EfficientDet-d0 (Tan et al., parameters on LEVIR-Ship dataset.
2020) and Faster R-CNN (Ren et al., 2017) with a ResNet18 (He et al.,
2016) backbone. Considering that TPH-YOLOv5 is designed for high 5. Conclusions
performance devices with huge computational cost, we scaled this
model to a smaller version to fit our task. Our model achieves 48.0% Fast and accurate ship detection is of vital importance for mar-
𝐴𝑃 with 8.3 ms inference time per image, which exceeds the baseline itime surveillance which can be widely applied in preventing ship
model YOLOv5s by 1.2% 𝐴𝑃 and 1.1 ms inference time per image accidents, illegal fishing, smuggling and so on. In this work, we focus
without increasing the number of parameters. Fig. 10 provides a more on addressing the challenges and requirements of ship detection from
intuitive comparison of our proposed YOLOv5-ODConvNeXt model drone-captured images, such as complex backgrounds and real-time
with other state-of-the-art detectors. It is clear that our proposed model performance. We proposed an enhanced deep learning model namely
outperforms YOLOv5s, YOLOv6-tiny, TPH-YOLOv5, YOLO-Fastestv2, YOLOv5-ODConvNeXt based on YOLOv5s, aiming to improve the ac-
NanoDet-Plus-m, EfficientDet-d0 and Faster R-CNN in both speed and curacy and detection speed of the original network for efficient ship

9
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Fig. 11. Visualization of the detection results from proposed model.

Table 7
The detection results of the proposed model and other state-of-the-art detectors on self-constructed dataset.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M) Model size (MB)
YOLOv5s(baseline) 640 46.8% 9.4 7.01 13.7
YOLOv5-ODConvNeXt 640 48.0% 8.3 6.99 13.7
TPH-YOLOv5 640 46.0% 18.9 9.16 18.9
Scaled-YOLOv4 640 48.4% 12.3 9.11 17.8
YOLOv6-tiny 640 46.5% 9.0 14.94 31.4
YOLOv7 640 52.5% 15.2 36.48 71.3
YOLO-Fastestv2 352 20.0% 8.9 0.24 1.08
NanoDet-Plus-m 416 34.5% 12.6 2.44 29.9
EfficientDet-d0 512 34.1% 16.2 3.82 15.0
Faster-RCNN 640 47.2% 25.6 28.684 227.1

Table 8
The detection results of the proposed model and other state-of-the-art detectors on LEVIR-Ship dataset.
Model Size (pixels) 𝐴𝑃50∶95 Speed (ms) Params (M) Model size (MB)
YOLOv5s(baseline) 640 23.1% 9.4 7.01 13.7
YOLOv5-ODConvNeXt 640 23.1% 8.3 6.99 13.7
TPH-YOLOv5 640 20.6% 18.9 9.16 18.9
Scaled-YOLOv4 640 25.2% 12.3 9.11 17.8
YOLOv6-tiny 640 19.0% 9.0 14.94 31.4
YOLOv7 640 21.3% 15.2 36.48 71.3
YOLO-Fastestv2 352 12.5% 8.9 0.24 1.08
NanoDet-Plus-m 416 23.7% 12.6 2.44 29.9
EfficientDet-d0 512 15.1% 16.2 3.82 15.0
Faster-RCNN 640 20.2% 25.6 28.684 227.1

detection. We constructed a ship dataset with 3200 images captured detection speed, exceeding YOLOv5s by 13.3%. As a result, the intro-
by drones or with a drone view. To improve the accuracy without duction of ODConv and ConvNeXt block enables YOLOv5s faster and
increasing the network width and width, we implemented ODConv in more accurate, significantly improving the performance of ship detec-
layer 1 to replace the traditional convolution layer. We found that tion on drone-captured images for maritime surveillance. Experimental
applying ODConv in shallow layers leads to higher model accuracy with results have demonstrated the superior performance of our proposed
model in terms of accuracy and detection speed compared to other
less increment of parameters due to the limited number of channels in
state-of-the-art models. However, there is still room for improvement
shallow layers. Additionally, we exploited ConvNeXt block in the layer
in our work due to some inherent limitations of object detectors. In
6 instead of the original C3 block with three bottlenecks. The simpler
future work, it is possible to develop weakly supervised ship detection
structure of ConvNeXt module leads to faster detection speed with techniques to reduce the model’s dependence on large amounts of
only a slight decline in accuracy. Our proposed YOLOv5-ODConvNeXt annotated training data. Additionally, we will consider improving the
achieves 48.0% 𝐴𝑃 with 8.3 ms inference time per image on an NVIDIA performance of small ship detection and evaluating the performance
RTX3090 GPU. Compared to the original YOLOv5s, our proposed model of our model on different ship datasets. We believe that our work can
improves the detection accuracy by 1.2% 𝐴𝑃 with a significantly faster benefit practical applications in maritime surveillance.

10
S. Cheng et al. Ocean Engineering 285 (2023) 115440

CRediT authorship contribution statement Girshick, R., 2015. Fast R-CNN. In: 2015 IEEE International Conference on Computer
Vision. ICCV, pp. 1440–1448. http://dx.doi.org/10.1109/ICCV.2015.169.
Shuxiao Cheng: Conceptualization, Methodology, Software, For- Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for
accurate object detection and semantic segmentation. In: Proceedings of the IEEE
mal analysis, Writing – original draft. Yishuang Zhu: Visualization,
Conference on Computer Vision and Pattern Recognition. pp. 580–587.
Data curation, Writing – review & editing. Shaohua Wu: Supervision, Han, X., Zhao, L., Ning, Y., Hu, J., 2021. ShipYolo: an enhanced model for ship
Funding acquisition, Validation, Writing – review & editing. detection. J. Adv. Transp. 2021.
He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn. In: Proceedings of the
Declaration of competing interest IEEE International Conference on Computer Vision. pp. 2961–2969.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9),
The authors declare that they have no known competing finan-
1904–1916.
cial interests or personal relationships that could have appeared to
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
influence the work reported in this paper. nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 770–778.
Data availability Hendrycks, D., Gimpel, K., 2016. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415.
I have shared the link to my code in the abstract of the article. Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. pp. 7132–7141.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by
Acknowledgments reducing internal covariate shift. In: International Conference on Machine Learning.
PMLR, pp. 448–456.
This work has been supported in part by the National Key Research Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., NanoCode012, Kwon, Y., TaoXie,
and Development Program of China under Grant no. 2020YFB1806403, Fang, J., imyhxy, Michael, K., Lorna, V, A., Montes, D., Nadar, J., Laughing,
and in part by the GuangDong Basic and Applied Basic Research Foun- tkianai, yxNONG, Skalski, P., Wang, Z., Hogan, A., Fati, C., Mammana, L.,
AlexWang1900, Patel, D., Yiwei, D., You, F., Hajek, J., Diaconu, L., Minh, M.T.,
dation under Grant no. 2022B1515120002, and in part by the National
2022. ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO
Natural Science Foundation of China under Grant no. 62201307.
Export and Inference. Zenodo, http://dx.doi.org/10.5281/zenodo.6222936.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep
References convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Wein-
berger, K. (Eds.), Advances in Neural Information Processing Systems, Vol.
Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial 25. Curran Associates, Inc., URL: https://proceedings.neurips.cc/paper/2012/file/
networks. In: International Conference on Machine Learning. PMLR, pp. 214–223. c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv: Li, H., Deng, L., Yang, C., Liu, J., Gu, Z., 2021. Enhanced YOLO v3 tiny network for
1607.06450. real-time ship detection from visual image. IEEE Access 9, 16692–16706.
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M., 2020. Yolov4: Optimal speed and accuracy Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W.,
of object detection. arXiv preprint arXiv:2004.10934. et al., 2022a. YOLOv6: A single-stage object detection framework for industrial
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End- applications. arXiv preprint arXiv:2209.02976.
to-end object detection with transformers. In: European Conference on Computer Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J., 2017. Light-head R-CNN: In defense
Vision. Springer, pp. 213–229. of two-stage object detector. arXiv preprint arXiv:1711.07264.
Chen, J., Chen, K., Chen, H., Zou, Z., Shi, Z., 2022. A degraded reconstruction Li, C., Zhou, A., Yao, A., 2022b. Omni-dimensional dynamic convolution. In: Inter-
enhancement-based method for tiny ship detection in remote sensing images with national Conference on Learning Representations. URL: https://openreview.net/
a new large-scale dataset. IEEE Trans. Geosci. Remote Sens. 60, 1–14. http://dx. forum?id=DmpCfq6Mg39.
doi.org/10.1109/TGRS.2022.3180894. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object
Chen, Z., Chen, D., Zhang, Y., Cheng, X., Zhang, M., Wu, C., 2020a. Deep learning for detection. In: Proceedings of the IEEE International Conference on Computer Vision.
autonomous ship-oriented small ship detection. Saf. Sci. 130, 104812. pp. 2980–2988.
Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., Liu, Z., 2020b. Dynamic convolution: Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference Zitnick, C.L., 2014. Microsoft COCO: Common objects in context. In: European
on Computer Vision and Pattern Recognition. pp. 11030–11039. Conference on Computer Vision.
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J., 2021. You only look Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C., 2016.
one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision Ssd: Single shot multibox detector. In: European Conference on Computer Vision.
and Pattern Recognition. pp. 13039–13048. Springer, pp. 21–37.
Dai, Z., Cai, B., Lin, Y., Chen, J., 2021. Up-detr: Unsupervised pre-training for object Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021a. Swin
detection with transformers. In: Proceedings of the IEEE/CVF Conference on transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
Computer Vision and Pattern Recognition. pp. 1601–1610. of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Dai, J., Li, Y., He, K., Sun, J., 2016. R-fcn: Object detection via region-based fully Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S., 2022. A ConvNet for
convolutional networks. Adv. Neural Inf. Process. Syst. 29. the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y., 2017. Deformable Pattern Recognition. CVPR, pp. 11976–11986.
convolutional networks. In: 2017 IEEE International Conference on Computer Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018. Path aggregation network for instance
Vision. ICCV, pp. 764–773. http://dx.doi.org/10.1109/ICCV.2017.89. segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A large- Pattern Recognition. pp. 8759–8768.
scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision Liu, R.W., Yuan, W., Chen, X., Lu, Y., 2021b. An enhanced CNN-enabled learning
and Pattern Recognition. IEEE, pp. 248–255. method for promoting ship detection in maritime surveillance system. Ocean Eng.
dog-qiuqiu, 2021. dog-qiuqiu/Yolo-FastestV2: V0.2. Zenodo, http://dx.doi.org/10.5281/ 235, 109435.
zenodo.5181503, URL: https://doi.org/10.5281/zenodo.5181503. RangiLyu, 2021. NanoDet-plus: Super fast and high accuracy lightweight anchor-free
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B., 2022. Cswin object detection model. https://github.com/RangiLyu/nanodet.
transformer: A general vision transformer backbone with cross-shaped windows. Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2016. You only look once: Unified,
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern real-time object detection. In: Proceedings of the IEEE Conference on Computer
Recognition. pp. 12124–12134. Vision and Pattern Recognition. pp. 779–788.
Everingham, M., Eslami, S., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2015. Redmon, J., Farhadi, A., 2017. YOLO9000: better, faster, stronger. In: Proceedings of
The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111 the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7263–7271.
(1), 98–136. Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement. arXiv preprint
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The Pascal arXiv:1804.02767.
visual object classes (voc) challenge. Int. J. Comput. Vis. 88 (2), 303–338. Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster R-CNN: Towards real-time object
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J., 2021. Yolox: Exceeding yolo series in 2021. detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.
arXiv preprint arXiv:2107.08430. 39 (6), 1137–1149. http://dx.doi.org/10.1109/TPAMI.2016.2577031.

11
S. Cheng et al. Ocean Engineering 285 (2023) 115440

Shao, Z., Wang, L., Wang, Z., Du, W., Wu, W., 2019. Saliency-aware convolution neural Yang, B., Bender, G., Le, Q.V., Ngiam, J., 2019. Condconv: Conditionally parameterized
network for ship detection in surveillance video. IEEE Trans. Circuits Syst. Video convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 32.
Technol. 30 (3), 781–794. Zhang, Y., Li, Q.Z., Zang, F.N., 2017. Ship detection for visual maritime surveillance
Tan, M., Pang, R., Le, Q.V., 2020. Efficientdet: Scalable and efficient object detection. from non-stationary platforms. Ocean Eng. 141, 53–63. http://dx.doi.org/10.1016/
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern j.oceaneng.2017.06.022, URL: https://www.sciencedirect.com/science/article/pii/
Recognition. pp. 10781–10790. S0029801817303190.
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W., 2013. Selective search Zhang, X., Yan, M., Zhu, D., Guan, Y., 2022. Marine ship detection and classification
for object recognition. Int. J. Comput. Vis. 104 (2), 154–171. based on YOLOv5 model. J. Phys. Conf. Ser. 2181 (1), 012025.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D., 2020. Distance-IoU loss: Faster and
Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. better learning for bounding box regression. In: Proceedings of the AAAI Conference
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M., 2021. Scaled-yolov4: Scaling cross stage on Artificial Intelligence, Vol. 34, No. 07. pp. 12993–13000.
partial network. In: Proceedings of the IEEE/Cvf Conference on Computer Vision Zhu, X., Lyu, S., Wang, X., Zhao, Q., 2021a. TPH-YOLOv5: Improved YOLOv5 based
and Pattern Recognition. pp. 13029–13038. on transformer prediction head for object detection on drone-captured scenarios.
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M., 2022. YOLOv7: Trainable bag-of-freebies sets In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696. 2778–2788.
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.-H., 2020. CSPNet: Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J., 2021b. Deformable DETR: Deformable
A new backbone that can enhance learning capability of CNN. In: Proceedings of transformers for end-to-end object detection. In: International Conference on
the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Learning Representations. URL: https://openreview.net/forum?id=gZ9hCDWe6ke.
pp. 390–391.
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 1492–1500.

12

You might also like