0% found this document useful (0 votes)
11 views

Apple Detection in Complex Scene Using The Improved Yolov4 Model

Uploaded by

Soon Chuen Eng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Apple Detection in Complex Scene Using The Improved Yolov4 Model

Uploaded by

Soon Chuen Eng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

agronomy

Article
Apple Detection in Complex Scene Using the Improved
YOLOv4 Model
Lin Wu, Jie Ma *, Yuehua Zhao and Hong Liu

School of Electronics and Information Engineering, Hebei University of Technology, Tianjin 300401, China;
[email protected] (L.W.); [email protected] (Y.Z.); [email protected] (H.L.)
* Correspondence: [email protected]; Tel.: +86-139-2038-1537

Abstract: To enable the apple picking robot to quickly and accurately detect apples under the complex
background in orchards, we propose an improved You Only Look Once version 4 (YOLOv4) model
and data augmentation methods. Firstly, the crawler technology is utilized to collect pertinent apple
images from the Internet for labeling. For the problem of insufficient image data caused by the
random occlusion between leaves, in addition to traditional data augmentation techniques, a leaf
illustration data augmentation method is proposed in this paper to accomplish data augmentation.
Secondly, due to the large size and calculation of the YOLOv4 model, the backbone network Cross
Stage Partial Darknet53 (CSPDarknet53) of the YOLOv4 model is replaced by EfficientNet, and
convolution layer (Conv2D) is added to the three outputs to further adjust and extract the features,
which make the model lighter and reduce the computational complexity. Finally, the apple detection
experiment is performed on 2670 expanded samples. The test results show that the EfficientNet-B0-
YOLOv4 model proposed in this paper has better detection performance than YOLOv3, YOLOv4,
and Faster R-CNN with ResNet, which are state-of-the-art apple detection model. The average values
 of Recall, Precision, and F1 reach 97.43%, 95.52%, and 96.54% respectively, the average detection time
 per frame of the model is 0.338 s, which proves that the proposed method can be well applied in the
Citation: Wu, L.; Ma, J.; Zhao, Y.; Liu, vision system of picking robots in the apple industry.
H. Apple Detection in Complex Scene
Using the Improved YOLOv4 Model. Keywords: apple detection; YOLOv4; EfficientNet; picking robot; data augmentation
Agronomy 2021, 11, 476. https://
doi.org/10.3390/agronomy11030476

Academic Editors: Karsten Schmidt 1. Introduction and Related Works


and Silvia Arazuri
Apple is one of the most popular fruits, and its output is also among the top three
in global fruit sales. According to incomplete statistics, there are more than 7500 types of
Received: 10 January 2021
Accepted: 2 March 2021
known apples [1] in the world. However, experienced farmers are still the main force of
Published: 4 March 2021
agricultural production. Manual work consumes time and increases production costs, and
workers who lack knowledge and experience will make unnecessary mistakes. With the
Publisher’s Note: MDPI stays neutral
continuous progress of precision agricultural technology, fruit picking robots have been
with regard to jurisdictional claims in
widely used in agriculture. In the picking systems, there are mainly two subsystems: the
published maps and institutional affil- vision system and the manipulator system [2]. The vision system detects and localizes
iations. fruits and guides the manipulator to detach fruits from trees. Therefore, a robust and
efficient vision system is the key to the success of the picking robot, but due to the complex
background in orchards, there are still many challenges in this research.
For the complex background in orchards, the dense occlusion between leaves is one of
Copyright: © 2021 by the authors.
the biggest interference factors in apple detection, which will cause false detection or missed
Licensee MDPI, Basel, Switzerland.
detection of apples. Therefore, to make the model learn features better, the training data
This article is an open access article
should contain more comprehensive scenes. However, due to the huge number of apples
distributed under the terms and and complex background, apple labeling is a very time-consuming and energy-consuming
conditions of the Creative Commons task, which leads to the number of most datasets ranges from dozens to thousands of
Attribution (CC BY) license (https:// images [3–7], and covers a single scene. The data for occlusion scenes is even scarcer,
creativecommons.org/licenses/by/ which is not conducive to enhancing the detection ability of the model. To overcome
4.0/). this deficiency, we propose a leaf illustration data augmentation method to expand the

Agronomy 2021, 11, 476. https://doi.org/10.3390/agronomy11030476 https://www.mdpi.com/journal/agronomy


Agronomy 2021, 11, 476 2 of 15

dataset. To further expand the number of the dataset and enrich the complexity of the
scene, common data augmentation methods such as mirror, crop, brightness, blur, dropout,
rotation, scale, and translation are also utilized in this paper. The experimental results
show that the model trained by traditional augmentation techniques and an illustration
augmentation technique proposed in this paper can well detect apples under complex
scenes in orchards.
In recent years, the research on apple detection under complex scenes in orchards
has also made some progress. Tian Y et al. [8] proposed an improved YOLOv3 model to
detect apples in different growth periods in orchards, with the F1 score of 0.817. Kang H
et al. [9] proposed a new LedNet model and an automatic labeling tool, with the Recall
and the accuracy at 0.821 and 0.853, respectively. Mazzia V et al. [10] used the YOLOv3-
tiny model to match the embedded device, which achieved the detection speed of 30 fps
without affecting the mean Average Precision (mAP) (83.64%). Kuznetsova A et al. [11]
proposed pre-processing and post-processing operations to adapt to the YOLOv3 model,
the detection result shown that the average detection time was 19 ms, 7.8% of the objects
were mistaken and 9.2% of apples were not recognized for apples. Gao F et al. [12] used
Faster Regions with Convolutional Neural Networks (Faster R-CNN) to detect apples in
dense-foliage fruiting-wall trees, the experimental result was that the mAP was 0.879 and
the average detection time was 0.241 s, which effectively detected apples under various
occlusion conditions; Liu X et al. [13] proposed an apple detection based on color and
shape features method, the detection results were that the value of Recall, Precision, and F1
score reached 89.80%, 95.12%, and 92.38%, respectively. Jia W et al. [14] combined ResNet
and DenseNet to improve Mask R-CNN, which reduced the input parameters, with the
Precision of 97.31% and the Recall of 95.70%.
For picking robots, the model should have fast and accurate detection performance.
The YOLO [15–18] models unify object classification and object detection into a regression
problem. The YOLO models do not use the area proposal process but directly use regression
to detect objects. Therefore, the detection process is effectively accelerated. Compared
with the YOLOv3 model, the latest version YOLOv4 model owns better accuracy under
maintaining the same speed. However, the YOLOv4 model has not been widely used
for fruit detection. Due to the large size and computational complexity of the YOLOv4
model, it is a huge burden for low-performance devices. EfficientNet [19] uses a compound
coefficient to balance the three dimensions (depth, width, and resolution) of the model on
limited resources, which can maximize the accuracy of the model. Therefore, we utilize
EfficientNet to replace the backbone network CSPDarrknet53 of the YOLOv4 model, and
Conv2D is added to the three outputs to further extract and adjust the features, which can
make the improved model lighter and better detection performance. The experimental
results show that the improved model can be well applied to the vision system of the
picking robot.
The rest of this paper is organized as follows. Section 2 introduces the dataset collec-
tion, common data augmentation methods, and the proposed illustration data augmenta-
tion method. Section 3 introduces YOLOv4, EfficientNet, and the improved EfficientNet-
B0-YOLOv4 model. Section 4 is experimental configuration, experimental results, and
discussion. Finally, the conclusions and prospects of this paper are described.

2. Dataset and Data Augmentation


2.1. Dataset
In this paper, we choose Red Fuji apple as the experimental object. Since there are
a large number of apple-related images on the Internet, we use the Python language to
develop an image crawler to download these images in batches, which reduces the cost of
data collection and improves the efficiency of data collection.
The main sources of images are Baidu and Google. The search keywords are Red
Fuji Apple, Apple Tree, and Apple, etc. Firstly, to ensure image quality, the width or
height of the crawled image is set to be at least greater than 500 pixels. Secondly, after
Agronomy 2021, 11, x FOR PEER REVIEW 3 of 16

Agronomy 2021, 11, 476 3 of 15


The main sources of images are Baidu and Google. The search keywords are Red Fuji
Apple, Apple Tree, and Apple, etc. Firstly, to ensure image quality, the width or height of
the crawled image is set to be at least greater than 500 pixels. Secondly, after the manual
screening,
the manual thescreening,
repetitive,the
fuzzy, and inconsistent
repetitive, fuzzy, and images are mainly
inconsistent images removed. Finally,
are mainly 267
removed.
Finally, 267 images
high-quality high-quality images are
are obtained, obtained,
of which of which
35 images 35 images
contain only acontain only a54
single apple, single
im-
apple,
ages 54 images
with multiple with multiple
apples apples
without without overlapping,
overlapping, and 178with
and 178 images images with multiple
multiple apples
apples overlapping.
overlapping.
Then,these
Then, these267267images
imagesareareexpanded
expandedtoto2670
2670images
imagesbybyusing
usingdata
data augmentation
augmentation
methods.Randomly
methods. Randomlydivide
divide1922
1922images
imagesasasthe
thetraining
trainingset
settototrain
trainthe
thedetection
detectionmodel,
model,
214images
214 imagesasasthethevalidation
validationset
settotoadjust
adjustthe
themodel
modelparameters,
parameters,andand534
534images
imagesas asthe
thetest
test
settotoverify
set verifythethedetection
detectionperformance.
performance.To Tobetter
bettercompare
comparethetheperformance
performanceof ofdifferent
different
models,images
models, imagesininthethetraining
trainingset
setareareconverted
convertedtotoPASCAL
PASCALVOC VOCformat.
format.The
Thecompleted
completed
dataset is shown in Table
dataset is shown in Table 1. 1.

Table 1. The number of apple images generated by data augmentation methods.


Table 1. The number of apple images generated by data augmentation methods.
Original Mirror Crop Brightness Blur Dropout Rotation Scale Translation Illustration Total
Original Mirror Crop Brightness Blur Dropout Rotation Scale Translation Illustration Total
267
267 267267 267 267 267
267 267
267 267
267 267
267 267
267 267
267 267
267 2670
2670

2.2.Common
2.2. CommonData DataAugmentation
Augmentation
InInthis
thispaper,
paper,we
weuse
use88common
commondata dataaugmentation
augmentationmethods
methodstotoexpand
expandthe
thedataset,
dataset,
respectively mirror, crop, brightness, blur, dropout, rotation, scale, and translation opera-
respectively mirror, crop, brightness, blur, dropout, rotation, scale, and translation oper-
tion. These
ation. These operations
operations are
are used
used to
to further
further simulate
simulate the
thecomplex
complexscenes
scenesofofapple
appledetection
detection
ininorchards.
orchards.Figure
Figure1b–i
1b–ishows
showsthe
theeffects
effectsofofvarious
variouscommon
commondatadataaugmentation
augmentationafter
after
these operations.
these operations.

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)


Figure
Figure1.1.Common
Commondatadataaugmentation
augmentationmethods:
methods: (a)(a)
original image;
original (b)(b)
image; horizontal mirror;
horizontal (c)(c)
mirror; crop
crop
processing; (d) brightness transformation; (e) gaussian blur processing; (f) dropout processing; (g)
processing; (d) brightness transformation; (e) gaussian blur processing; (f) dropout processing;
rotation processing; (h) scale processing; (i) translation processing.
(g) rotation processing; (h) scale processing; (i) translation processing.

2.2.1.
2.2.1.Image
ImageMirror
Mirror
InInorchards,
orchards,the
theposition
positionand
anddirection
directionofofthe
theapples
applesare
arevarious.
various.Therefore,
Therefore,we
weuse
use
50% probability horizontal mirroring and 50% probability vertical mirroring to process
50% probability horizontal mirroring and 50% probability vertical mirroring to process the
the original
original image.
image. Both
Both cancan
be be used
used alone
alone or combination.
or in in combination.

2.2.2.Image
2.2.2. ImageCrop
Crop
InInmany
manyapples
applesstacked
stackedtogether,
together,there
therewill
willbe
bevarious
variousocclusion
occlusionproblems,
problems,and
andsome
some
apples will be blocked a little or more. Therefore, we randomly cut off 20% of the original
apples will be blocked a little or more. Therefore, we randomly cut off 20% of the original
imageedges
image edgestotosimulate
simulatethis
thisscene.
scene.
Agronomy 2021, 11, 476 4 of 15

2.2.3. Image Brightness


When the illumination is strong or weak, it will lead to apple color changes, which
cause huge interference for the detection. Therefore, to enhance the robustness of the
model, we randomly multiply the image with a brightness factor between 0.5 and 1.5.

2.2.4. Image Blur


Sometimes the image captured by the picking robot may be unclear or blurred, which
can also cause interference with the apple detection. Therefore, we use the gaussian blur
with a mean value of 2.0 and a standard deviation of 8.0 to augment the dataset.

2.2.5. Image Dropout


Apples often encounter the problem of diseases and insect pests, typically leaving
traces of numerous spots. Therefore, we randomly dropout the grid points between 0.01
and 0.1 on the original image, and the grid points are filled with black.

2.2.6. Image Rotation


Similar to the mirror method, rotation is to further increase the image viewing angles.
Therefore, we use randomly rotating the original image at an angle between −30◦ and 30◦
to augment the dataset, and the space vacated by the rotation is filled with black.

2.2.7. Image Scale


Due to the different positions of the apples in orchards, there will be apples of different
sizes when capturing images. Therefore, to simulate this scene, we randomly multiply the
original image with a scaling factor between 0.5 and 1.5.

2.2.8. Image Translation


Similar to the crop method, translation is to further solve the occlusion problem of the
apple. Therefore, we randomly translate 20% of the edges of the original image, and the
space after translation is filled with black.

2.3. Illustration Data Augmentation


To enrich the background and texture of training images, in Mixup [20], CutMix [21],
Mosaic [18], several original images can be regularly mixed or superimposed to form a new
image. For example, four original images are evenly spread into four-square-grid images,
which will not increase the cost of training and inference but enhance the localization
ability of the model. For apple detection in orchards, the biggest interference to the apple
detection is the dense occlusion between leaves, and the complexity and randomness of
the background, which makes feature extraction difficult and leads to false detection or
missed detection of apples.
Therefore, to simulate the complexity of the scene and enrich the background of the
object, we propose a leaf illustration data augmentation method, which uses some leaf
illustrations to randomly insert on the original image. Firstly, collect 5 kinds of apple leaf
illustrations, as shown in Figure 2. The format of the illustration is PNG, only contains the
object itself, and the background is transparent, which helps protect the original image
after insertion and avoid adding the invalid background. Secondly, the illustration size
is 1/8 to 1/4 of the average value of all the ground-truth in the current image, and the
number of insertions is 5 to 15 times. Finally, the original dataset is expanded in batches by
using the illustration data augmentation method. Figure 3 shows the augmentation effects
of different leaf illustrations.
Agronomy 2021, 11, 476 5 of 15
Agronomy 2021, 11, x FOR PEER REVIEW 5 of 16
Figure
Agronomy 2021, 11, x FOR 2. Five
PEER REVIEW kinds of apple leaves. 5 of 16

Figure 2. Five kinds of apple leaves.


Figure 2. Five kinds of apple leaves.
Figure 2. Five kinds of apple leaves.
(a)

(a)
(a)
(b)
Figure 3. (a) Original image; (b) leaf data augmentation.

3. Methodologies (b)
(b)
3.1. YOLOv4
Figure 3. (a) Original image; (b) leaf data augmentation.
Figure
Figure 3.
3. (a)
(a)Original
Original image;
image; (b)
(b) leaf
leaf data
data augmentation.
augmentation.
YOLOv4 is the state-of-the-art, real-time detection model, which is further improved
3. Methodologies
3.
3.
based on3.1. Methodologies
Methodologies
theYOLOv4
YOLOv3 model. As a result, on the MS COCO dataset, without a drop in the
3.1. YOLOv4
3.1. YOLOv4
frames per secondYOLOv4 (FPS), the mean Average Precision (mAP) is increased to 44%, and the
YOLOv4 is
YOLOv4 is the
thestate-of-the-art,
the state-of-the-art, real-time
state-of-the-art, real-time detection
real-time detection model,
detection model, which
model, which is
which is further
is further improved
further improved
improved
overall performance
based on the is
YOLOv3significantly
model. AsAsimproved.
a result, onon There
thethe MSMS are
COCO three major
dataset, improvements
without a drop in the in the
based
based onon the
the YOLOv3
YOLOv3 model.
model. As a a result,
result, on the MS COCOCOCO dataset,
dataset, without
without a a drop
drop in in
the
network frames
structure:
the per second
frames per(1)second
Using
(FPS), themean
the
(FPS), CSPNet
the Average
mean [22]
Average toPrecision
Precision modify
(mAP) Darknet53
(mAP)is increased to
is increased to44%,CSPDarknet53,
to and and
44%, the
frames
overall per second
performance (FPS),
is the mean
significantly Average
improved. Precision
There are(mAP)
three ismajor
increased to 44%,
improvements andin the
the
which further promotes
the overall
overall performance
performancethe fusion of low-level
is significantly
is significantly
improved.
improved.
information
There
There are three
are three
and achieves
major stronger
improvements
major improvements in the
feature
in
network
the network structure:
structure: (1) Using
(1) Using thethe CSPNet
CSPNet [22][22] totomodify
modify Darknet53
Darknet53 to
to CSPDarknet53,
CSPDarknet53,
extractionnetwork
capabilities.
structure: As (1)shownUsing intheFigure
CSPNet 4b,[22]
theto original
modify residual
Darknet53 module
to stronger is divided into
CSPDarknet53,
which
which further
further promotes
promotes the
the fusion
fusion of
of low-level
low-level information
information and
and achieves
achieves stronger feature
feature
left and right
whichparts.
extractionfurther The
capabilities.right
promotes As parts
the fusion
shown maintain
inof low-level
Figure thethe
4b, original
information
original residual
and achieves
residual stack,
module and
stronger
is dividedthe left parts
feature
into
extraction capabilities. As shown in Figure 4b, the original residual module is divided
extraction
use a large residual capabilities. As shown in Figure 4b, the original residual module is divided into
left
intoand andedge
leftright right to
parts. The fuse
parts. right the
The low-level
parts
right maintain information
the original
parts maintain with the
the residual
original high-level
stack,
residualandstack,
the leftinformation
and parts
the
left
extracteduse and right parts. edgeThe right parts maintain the original residual stack, and (SPP)
the left[23]
parts
left parts use a large residual edge to fuse the low-level information with the high-levelto add
from
a the
large residual
residual block;
to fuse (2)theUsing
low-level the Spatial
information Pyramid
with the Pooling
high-level information
use a largefrom
extracted residual edge to fuse
the operations
residual the low-level information with the high-level information
4 different max-pooling
information extracted fromblock; (2)
theatresidual
the Using
last the Spatial
output
block; toPyramid
(2) Using further Pooling
extract
the Spatial (SPP)
and[23]
Pyramid fuseto add
features,
Pooling
extracted
4(SPP)
different from the
max-pooling residual block;
operations (2) Using
at the last the Spatial
output to Pyramid
further Pooling
extract (SPP)
and fuse[23] to add
features,
the convolution [23] to
kernel add 4 different
size is operations max-pooling
(1 × 1), (5at× the 5), (9operations
× 9), and at the last output to further extract
4 different
the max-pooling last
(9 output to (13 × 13),
further as shown
extract andinfuse infeatures,
Figure 5; (3)
andconvolution
fuse features, kernel size is (1 × 1),kernel
the convolution (5 × 5),size ×is9),
(1 and
× 1),(13 (5 ××13),
5), (9as ×shown
9), and Figure
(13 5; (3)
× 13), as
Modifying theFeature
convolution
Modifying Pyramid
Feature kernel
PyramidNetworks
size (1 × 1),(FPN)
isNetworks (5 (FPN) [24]
× 5), (9 × 9),
[24] structure
and (13 ×toto
structure 13), Path
Path Aggregation
asAggregation
shown in Figure 5;Network
Network (3)
shown in Figure 5; (3) Modifying Feature Pyramid Networks (FPN) [24] structure to Path
Modifying
(PANet) (PANet)
[25], that Feature
[25], addis,Pyramid
is,Network
that aaddtop-down Networks
a top-down (FPN) [24]
structure
structure structure
totothe to Path
bottom-up Aggregation
structure Network
of FPN
to fur-to fur-
Aggregation (PANet) [25], that is, add athe bottom-up
top-down structure
structure of FPN
to the bottom-up
(PANet)
ther extract
ther and
extract
structure [25],
merge
ofand that
FPNmerge is, add
feature
to further a top-down
information,
feature information,
extract structure
and merge asasshown to the bottom-up
inFigure
Figure
shown information,
feature in structure
6b.6b. of FPN
as shown in Figure to fur-
6b.
ther extract and merge feature information, as shown in Figure 6b.

(a) (b)
(a) (b)
Figure 4. Comparison between Darknet53 and CSPDarknet53. (a) Darknet53, (b) CSPDarknet53.
Figure (a)
4. Comparison between Darknet53 and CSPDarknet53.
(b) (a) Darknet53, (b) CSPDarknet53.

Figure 4. Comparison between


Figure 4. Comparison Darknet53
between andand
Darknet53 CSPDarknet53. (a)Darknet53,
CSPDarknet53. (a) Darknet53,(b) (b) CSPDarknet53.
CSPDarknet53.
Agronomy
Agronomy
Agronomy 2021,
2021,
2021, 11,xxFOR
11,
11, FORPEER
476 PEERREVIEW
REVIEW 66of
6ofof
1615
16

Figure
Figure
Figure 5.5.5. SPP.
SPP.
SPP.

(a)
(a) (b)
(b)
Figure
Figure 6.6.6.
Figure Comparison
Comparison
Comparisonbetween
between
betweenFPN
FPN and
and
FPN PANet.
PANet.
and (a)
(a)
PANet. FPN,
FPN,
(a) (b)
(b)
FPN, PANet.
PANet.
(b) PANet.

3.2.
3.2.
3.2. EfficientNet
EfficientNet
EfficientNet
InIn
In recentyears,
recent
recent years,the
years, therapid
the rapiddevelopment
rapid developmentof of deep
deeplearning
ofdeep learninghas
learning hasspawned
has spawnedvarious
spawned various
various excellent
excel-
excel-
convolutional
lent convolutional neural
neural networks.
networks. From
From the
the initial
initial simple
simple
lent convolutional neural networks. From the initial simple network [26–28] to the current network [26–28] to
to the
the current
current
complex
complex
complex network
network
network [29–32],
[29–32],
[29–32], the
the
the performance
performance
performance ofof
of the
the
the models
models
models isisis getting
getting
getting better
better
better and
and
and better
better
better inin
in all
all
all
aspects.
aspects.
aspects. EfficientNet
EfficientNet
EfficientNet combines
combines
combines thethe
the advantages
advantages
advantages ofofof previous
previous
previous networks,
networks,
networks, whichsummarizes
which
which summarizes
summarizes
the
the improvement
improvement of
of network
network performance
performance into
into three
three dimensions:
dimensions:
the improvement of network performance into three dimensions: (1) Deepen the network, (1)
(1) Deepen
Deepen thethe network,
network,
that
that is,
is, use
use the
the skip
skip connection
connection toto increase
increase the
the depth
depth of
of the
the
that is, use the skip connection to increase the depth of the neural network, and achieve neural
neural network,
network, and
and achieve
achieve
feature
feature extraction
extraction through
through deeper
deeper layers;
layers; (2) (2)
Widen Widenthe the
network,
feature extraction through deeper layers; (2) Widen the network, that is, increase the num- network,
that that
is, is,
increase increase
the num- the
number
ber of
ber of convolutional
of convolutional
convolutional layers layers
layers to to
to achieveachieve
achieve more more
more features features
features and and
and obtain obtain
obtain more more functions;
more functions;
functions; (3) (3)
(3) By
ByBy
increasing
increasing the
increasing the input
the input image
input image resolution,
image resolution,
resolution, the the network
the network
network can can
can learnlearn
learn and and express
and express
express more more things,
more things,
things,
which is beneficial to improve accuracy. Then, use a compound
which is beneficial to improve accuracy. Then, use a compound coefficient 𝜙 to uniformly
which is beneficial to improve accuracy. Then, use a compound coefficient
coefficient 𝜙 φtoto uniformly
uniformly
scaleand
scale andbalance
balance thethe depth,
depth, width,
width, andand resolution
resolution ofthe of network,
the the network, and maximize
andmaximize
maximize thenet- the
net-
scale and balance the depth, width, and resolution of network, and the
network
workaccuracy accuracy
accuracyon on
onlimited limited
limitedresources. resources.
resources.The The
Thecalculation calculation
calculationof ofthe of
thecompoundthe compound
compoundcoefficient coefficient
shownis
coefficientisisshown
work
shown
inEquation in
Equation(1): Equation
(1): (1):
in
depth :φ d = αφ
depth :
depth:width dd ==ααφ: w = βφ
:ww==ββφφ2: r =
widthresolution
width : 2
γφ (1)
s.t. α · β · γ ≈ 2
φ
resolution:
resolution : αrr≥==1,γγφβ ≥ 1, γ ≥ 1 , , (1)
(1)
ss.t.t.. αα⋅⋅ββused
where the d, w, and r are the coefficients 22 ⋅ γ22to≈scale
⋅ γ ≈ 22 the depth, width, and resolution
of the network. The α, β, and γ are resource allocation for network depth, width, and
resolution. According to the researchα
α ≥1,1,ββM≥≥[19]
of≥Tan 1,1,γγin
≥≥1his
1
paper, the network parameters
where
of the d, w, and
EfficientNet-B0 r are
are the
showncoefficients
in Table used
2. Theto scale
optimal
where the d, w, and r are the coefficients used to scale the depth, width,the depth, width,ofand
coefficients and resolution
theresolution
network of of
are:
the
α=
the network. The
1.2, β =The
network. α,β,
1.1, α,
γ =β,1.15.
andγγEfficientNet
and areresource
are resourceisallocation
allocation
mainly made fornetwork
for network depth,
up of Stem, width,
16width,
depth, Blocks, and
and reso-
Conv2D,
reso-
GlobalAveragePooling2D,
lution.
lution. Accordingto
According tothe and Dense
theresearch
research ofTan
of Tanlayers.
MM[19]
[19]in The
inhis design
his paper,ofthe
paper, Blocks
the is mainly
network
network basedof
parameters
parameters on
of
EfficientNet-B0 are shown in Table 2. The optimal coefficients of the network are: α = 1.2,to
the residual
EfficientNet-B0 structure
are shownand inattention
Table 2. mechanism,
The optimal and the other
coefficients of structures
the network are
are:similar
α = 1.2,
β = 1.1, γ = 1.15. EfficientNet is mainly made up of Stem, 16 Blocks, Conv2D, GlobalAver-
agePooling2D, and Dense layers. The design of Blocks is mainly based on the residual
Agronomy 2021, 11, 476 7 of 15
structure and attention mechanism, and the other structures are similar to conventional
convolutional neural networks. Figure 7 shows the EfficientNet-B0, which is the baseline
network of EfficientNet.
conventional convolutional neural networks. Figure 7 shows the EfficientNet-B0, which is
Table 2. EfficientNet-B0 network parameter.
the baseline network of EfficientNet.
Stage Operator Resolution #Channels #Layers
Table 2. EfficientNet-B0 ∧network parameter. ∧ ∧ ∧ ∧
i Fi H i ×W i Ci Li
1 Stage Conv3 ×Operator
3∧ Resolution
224 × 224 ∧ ∧ 32#Channels

#Layers
1∧
i
2 MBConv1, k3 F×i 3 112 × 112Hi ×Wi 16 Ci 1 Li
3 1 k3 × 3× 3
MBConv6,Conv3 224 × 224
112 × 112 24 32 21
2 MBConv1, k3 × 3 112 × 112 16 1
4 3 MBConv6, k5 × 5k3 × 3
MBConv6, 56 × 56112 × 112 40 24 22
5 4 MBConv6, k3 × 3k5 × 5
MBConv6, 28 × 28 56 × 56 80 40 32
5 MBConv6, k3 × 3 28 × 28 80 3
6 6 MBConv6, k5 × 5
MBConv6, k5 × 5
14 × 14 14 × 14 112 112 33
7 7 MBConv6, k5 × 5k5 × 5
MBConv6, 14 × 14 14 × 14 192 192 44
8 8 MBConv6, k3 × 3k3 × 3
MBConv6, 7×7 7×7 320 320 11
9 Conv1 × 1 & Pooling & FC 7×7 1280 1
9 Conv1 × 1 & Pooling & FC 7×7 1280 1

EfficientNet-B0.
Figure 7. EfficientNet-B0.

3.3. EfficientNet-B0-YOLOv4
3.3. EfficientNet-B0-YOLOv4
There are 8 versions of EfficientNet (B0–B7). With the increase of the version, the
There are 8 versions of EfficientNet (B0–B7). With the increase of the version, the
performance of the model gradually improves, but the corresponding model size and
performance of the model gradually improves, but the corresponding model size and cal-
calculation amount also gradually increases. Although the original YOLOv4 model has
culation amount also gradually increases. Although the original YOLOv4 model has ex-
excellent performance, its size and calculation amount are large, which is not suitable
cellent performance, its size and calculation amount are large, which is not suitable for the
for the application of some low-performance devices. To further improve the accuracy
application of some low-performance devices. To further improve the accuracy and effi-
and efficiency of the YOLOv4 model and consider the size of the model, we replace the
ciency of the YOLOv4 model and consider the size of the model, we replace the backbone
backbone network CSPDarknet53 of the YOLOv4 model with EfficientNet-B0, and choose
network CSPDarknet53 of the YOLOv4 model with EfficientNet-B0, and choose P3, P5,
P3, P5, and P7 as three different feature layers. Since the output sizes of the three feature
and P7 as three different feature layers. Since the output sizes of the three feature layers
layers of the CSPDarknet53 are (256 × 256), (512 × 512), and (1024 × 1024), respectively,
of the CSPDarknet53 are (256 × 256), (512 × 512), and (1024 × 1024), respectively, the cor-
the corresponding P3 is (40 × 40), P5 is (112 × 112), and P7 is (320 × 320). Therefore, to
responding P3 is (40 × 40), P5 is (112 × 112), and P7 is (320 × 320). Therefore, to match the
match the size and further extract the features, Conv2D is added to adjust the three output
features. Figure 8 shows the network structure of EfficientNet-YOLOv4.
Agronomy 2021, 11, x FOR PEER REVIEW 8 of 16

Agronomy 2021, 11, 476 8 of 15


size and further extract the features, Conv2D is added to adjust the three output features.
Figure 8 shows the network structure of EfficientNet-YOLOv4.

Figure
Figure 8.
8. EfficientNet-YOLOv4.
EfficientNet-YOLOv4.

Agronomy 2021, 11, x FOR PEER REVIEW The loss


The loss function
function remains
remains the the same as the YOLOv4 model, which consists of 9three of 16
parts: classification loss, regression loss, and confidence loss. Classification loss
loss, and confidence loss. Classification loss and con- and confi-
dence loss
fidence lossremain
remainthe thesame
sameas asthetheYOLOv3
YOLOv3model,model, but but Complete
Complete Intersection over Union
(CIoU) [33] is used to replace mean squared error
where S represents S × S grids, each grid generates B candidate
2 (MSE) to optimize
boxes,the
optimize andregression
the regression loss.
loss.
each candidate
The
The CIoU loss function is as follows:
box gets corresponding bounding boxes through the network, finally, S × S × B bounding
boxes are formed. If there is no object (noobj)ρin 2 the
(
ρb2, bb,box, )
gt gt only the confidence loss of the
b
+ αυ
CIoU = 1 − IoU +
LOSS
box is calculated. The confidence LOSSCIoU 1 − IoUuses
loss=function + cross 2
+ αυ,
entropy , error and is divided into (2)
(2)
two parts: there is the object (obj) and noobj. The loss c 2 cof noobj increases the weight coeffi-
λ,ρ𝜌 2 b, b gtis to
𝑏, 𝑏 represents

cient
where which reduce the
represents the contribution
theEuclidean
Euclidean weight
distance
distance ofbetween
the noobj
between calculation
thethe
center points
center part.
of of
points The
the clas-
predic-
the pre-
sification
tion boxbox
diction loss
andand function
the ground
the alsotruth,
ground uses
truth, ccross entropy
represents
c represents error.
the
the Whendistance
diagonal
diagonal the j-th anchor
distance of
ofthe box of the
the smallest
smallest i-th
closed
areaisthat
grid
area simultaneously
responsible containtruth,
for certain ground
can simultaneously contain box and the
then the bounding
the prediction boxground truth.
generated byFigure
this an-9
shows
chor
shows boxthewill calculate
structure of the
CIoU.classification loss function.
The formulas of α and υ are as follows:
υ
α= , (3)
1 − IoU + υ
2
4  w gt w
υ = 2  arctan gt − arctan  . (4)
π  h h
The total loss function of the YOLOv4 model is:

ρ 2 ( b, b gt )
LOSS = 1 − IoU + + αυ −
c2
S2 B
 ∧
 ∧
  
Figure 
CIoU. I ij Ci log ( Ci ) + 1 − Ci  log  1 − Ci   −
Figure9.9.CIoU.
i =0 j =0
obj

     ,
S2 B
(5)
 ∧ υ are as follows: 
λnoobj  I ijnoobj Ci log ( Ci ) + 1 − Ci  log (1 − Ci )  −

4. Experiments
The formulas andofDiscussion
α and
4.1. Experimental i = 0Details
j =0    υ 
2
α= , (3)
S
4.1.1. Simulation obj Setup  ∧ 1
 − IoU
∧ + υ 
 I ij   pi ( c ) log ( pi ( c ) ) +  1 − pi ( c )  log (1 − pi ( c ) ) 
The experimental
i =0 c∈classes  environmentof this  paper is underUbuntu
2  18.04 system, GPU is
4 w gt w
Tesla K80 (12 GB), CPU is Intel XEON, υ = arctan
and the − arctan .
models are all written with PyTorch. Gen- (4)
π2 h gt h
eral settings: the training epoch is 100, the learning rate for the first 50 epochs is 1 × 10−3
and the batch size is 16, the learning rate for the next 50 epochs is 1 × 10−3 and the batch
size is 8. Due to the relatively small RAM, the input image size of the YOLOv4 model is
changed from 608 × 608 to 416 × 416, which is the same as the original YOLOv3 model.
Table 3 shows the basic configuration of the local computer.
Agronomy 2021, 11, 476 9 of 15

The total loss function of the YOLOv4 model is:


ρ2 (b,b gt )
LOSS = 1 − IoU + 2 + αυ−
 c
S2 B obj ∧ ∧
   
∑ ∑ ij I C i log ( C i ) + 1 − C i log 1 − Ci −
i =0 j =0
S2 B noobj ∧
 

  (5)
λnoobj ∑ ∑ Iij Ci log(Ci ) + 1 − Ci log(1 − Ci ) −
i =0 j =0
S2 obj
h∧  ∧
 i
∑ Iij ∑ pi (c) log( pi (c)) + 1 − pi (c) log(1 − pi (c))
i =0 c∈classes

where S2 represents S × S grids, each grid generates B candidate boxes, and each candidate
box gets corresponding bounding boxes through the network, finally, S × S × B bounding
boxes are formed. If there is no object (noobj) in the box, only the confidence loss of the
box is calculated. The confidence loss function uses cross entropy error and is divided
into two parts: there is the object (obj) and noobj. The loss of noobj increases the weight
coefficient λ, which is to reduce the contribution weight of the noobj calculation part. The
classification loss function also uses cross entropy error. When the j-th anchor box of the
i-th grid is responsible for certain ground truth, then the bounding box generated by this
anchor box will calculate the classification loss function.

4. Experiments and Discussion


4.1. Experimental Details
4.1.1. Simulation Setup
The experimental environment of this paper is under Ubuntu 18.04 system, GPU is
Tesla K80 (12 GB), CPU is Intel XEON, and the models are all written with PyTorch. General
settings: the training epoch is 100, the learning rate for the first 50 epochs is 1 × 10−3 and
the batch size is 16, the learning rate for the next 50 epochs is 1 × 10−3 and the batch size is
8. Due to the relatively small RAM, the input image size of the YOLOv4 model is changed
from 608 × 608 to 416 × 416, which is the same as the original YOLOv3 model. Table 3
shows the basic configuration of the local computer.

Table 3. The basic configuration of the local computer.

Computer Configuration Specific Parameters


CPU Intel XEON
GPU Tesla K80
Operating system Ubuntu18.04
Random Access Memory 12 GB

4.1.2. Evaluation Index


In the binary classification problem, according to the combination of the sample’s true
class and the model’s prediction class, it can be divided into 4 types: TP, FP, TN, and FN.
TP means true positive, that is, the actual is positive and the prediction is also positive;
FP means false positive, that is, the actual is negative but the prediction is positive; TN is
true negative, that is, the actual is negative and the prediction is also negative; FN means
false negative, that is, the actual is positive but the prediction is negative. The Precision
represents the proportion of samples that are true positive among all samples predicted
to be positive. The Recall represents the proportion of samples predicted to be positive
among the samples that are true positive. The AP value of each class is the area under the
P-R curve formed by the Precision and the Recall. The mAP value is the average of the AP
values of all classes. The F1 score is based on the harmonic average of the Precision and the
Recall. The formula definition is as follows:
Agronomy 2021, 11, 476 10 of 15

Precision:
TP
P= . (6)
TP + FP
Recall:
TP
R= . (7)
TP + FN
Average Precision:
Z 1
AP = p(r )dr. (8)
0
Mean Average Precision:
1 n
n i∑
mAP = APi . (9)
=1
F1 score:
R
1,nomy
REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER 11, REVIEW 11 of 16 P ×11Rof 16
x FOR PEER REVIEW 11 of 16 11 of 16 11 of 16 11 of 1
F1 = 2 × . (10)
R
1,nomy
REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER 11, REVIEW
x FOR PEER REVIEW 11 of 16 P +11Rof 16 11 of 16 11 of 16 11 of 16 11 of 1
R
1,nomy
REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
11, REVIEW
x FOR PEER REVIEW 11 of 16 11 of 16 11 of 16 11 of 16 11 of 16 11 of 1
R
1,nomy
REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER
Agronomy
11, REVIEW
x FOR2021,
PEER x FOR4.2.
11, REVIEW PEER Experimental
REVIEW Results 11 of 16 11 of 16 11 of 16 11 of 16 11 of 16 11 of 1
d
R xdetections
1,nomy missed
REVIEW
FOR2021,
PEER xdetections
Agronomyaremissed
11, REVIEW
FOR consistent
2021,
PEER xdetections
aremissed
with
11, REVIEW
FOR consistent
PEERthese
detections
aremissed
with
REVIEWconsistent
methods,
these
detections
are missed
which
with
consistent
methods,
these
shows
detections
are missed
which
with
consistent
methods,
the these
shows
effectiveness
detections
are
4.2.1. Influence of Data Augmentation Methodsmissed
which
with
consistent
methods,
the these
shows
effectiveness
detections
are
11 and
missed
ofwhich
with
consistent
methods,
16 the these
shows
effectiveness
detections
are
11 and
of which
with
consistent
methods,
16 the these
shows
effectiveness
are
11 and
ofwhich
with
consistent
methods,
16 the these
shows
effectiveness
11 and
ofwhich
with
methods,
16 the these
shows
effectiveness
11 and
ofwhich
methods,
16 the show
effect
anw
d xdetections
1,lity
R nomy missed
of
feasibility
REVIEW
FOR our
2021,
PEER xdetections
aremissed
proposed
11, ofconsistent
feasibility
REVIEW
FOR our
PEER detections
areof
method.missed
proposedwith
consistent
these
feasibility
REVIEW detections
areof
ourmethod. missed
proposedwith
consistent
methods,
these
feasibility detections
areof
ourmethod. missed
which
with
proposedconsistent
methods,
these
feasibilityshows
detections
areof
ourmethod. missed
proposedwhich
with
consistent
methods,
themethod.
these
feasibility
our shows
effectiveness
detections
areof
missed
proposedwhich
with
consistent
methods,
themethod.
these
feasibility
our shows
effectiveness
detections
areand
which
proposed
11 ofwith
consistent
methods,
themethod.
feasibility
of 16 our these
shows
effectiveness
are
and
proposed
11 of
of which
with
consistent
methods,
16 themethod.
our these
shows
effectiveness
and
which
proposed
11 of with
methods,
16 themethod.
these
shows
effectiveness
and
11 ofwhich
methods,
16 the shows
effectiveness
and
whichthe show
effect
an
d xdetections
missed detections
aremissed
consistentdetections
aremissed
with
consistent
these
detections
are missed
with Todetections
consistent
methods,
theseobtain
are missed
which
with better
consistent
methods,
these
shows detection
detections
are missed
which
with
consistent
methods,
the these
shows results,
effectiveness
detections
are which
with
consistent
methods,
the traditional
these
shows
effectiveness
areand
which
with
consistent
methods,
the thesedata
shows
effectiveness
andwhich
with augmentation
methods,
the these
shows
effectiveness
and
which
methods,
the techniques
shows
effectiveness
and
which
the and
shows athe effect
effectiveness
and
Rolity
1, of
feasibility
REVIEW
FOR our
PEER
further proposed
To of
feasibility
REVIEW
verify our
further method.
proposed
theTo
verify of
feasibility
influenceour
further method.
proposed
theTo
verify
of of
feasibility
influence
further
the our
the method.
proposed
proposed
To verify
of of
feasibility
our
influence
further
the the method.
proposed
of
feasibility
illustration
proposed
To
verify
of
influenceour
further
the the method.
proposed
of
feasibility
illustration
proposed
data
To
verify
of
influence
further
the our method.
proposed
augmentation
the 11 of
of
illustration
proposed
data
To
verify
of 16
influence
further
the our method.
proposed
augmentation
the 11further
of
illustration
proposed
data
To
verify
of 16
influence
the method.
augmentation
the 11 of
illustration
proposed
data
verify
of 16
influence
the
augmentation
the
illustration
proposed
data
of
influence
the
augmentation
illustration
proposed
data
of the data an
augmentatio
illustratio
proposed au
d detections
lity missed
of
feasibility
our detections
aremissed
proposed
ofconsistent
feasibility
our detections
are
method.missed
proposedwith
ofconsistent
these
feasibility leaf
detections
are missed
with illustration
consistent
methods,
these
detections
are missed
which
with augmentation
consistent
methods,
these
shows
detections
are which
with
consistent
methods,
the these
showstechnique
effectiveness
are which
with
consistent
methods,
the these
showsare adopted
effectiveness
and
which
with
methods,
the these
shows to
effectiveness
andwhich expand
methods,
the shows the
effectiveness
and
whichthedataset.
shows To
effectiveness
and the evaluate
effectiveness
and an
RodREVIEW
further
on method
the To
verify
further
improved
on the
method
the To
verify
model,
improved
on asour
influence
further
the
method
the method.
proposed
To
verify
of
shown
model,
improved
on of
feasibility
asour
influence
further
the
in the
method
the method.
Figure
shownproposed
proposed
To verify
of
model,
improved
on of
feasibility
10,
in our
influence
further
the the
method
the
asthe
Figuremethod.
proposed
To
verify
shownof
model,
improved of
feasibility
illustration
proposed
influence
detection
on
10,
in our
further
the the
method
the
asthe method.
proposed
Figure
shown
model,
improved
results of
illustration
proposed
data
To
verify
of
influence
further
the
detection
on
10,
in our
method
the
as
ofthe
Figuremethod.
proposed
augmentation
the
shown
the 11
model, of
illustration
proposed
data
To
verify
of
improved
results 16
influence
further
the
detection
model
on10,
in
method
the
as
ofthe
Figuremethod.
augmentation
the
shown
the 11
model, of
illustration
proposed
data
verify
of
improved
results 16
influence
the
augmentation
detection
model
on10,
inthe
as the
illustration
proposed
ofthe data
Figure
shown
the of
influence
model, the
improved
results augmentation
detection
model
10,
inas illustration
proposed
ofthe data
Figure
shown
the of
model,
resultsthe
augmentation
detection
model
10,
inas illustration
proposed
ofthedata
Figure
shown
the
results augmentatio
detection
model
10, illustratio
inofthe data
Figure
the
resultsauo
detecti
mode
10,
dolity
detections
missed
of
feasibility
our detections
aremissed
proposed
ofconsistent
feasibility
our detections
are
method.missed
proposedwith
ofconsistent
these
feasibilitydetections
are missed
with
consistent
methods,
the these
detections
are
influencewhich
with
consistent
methods,
of these
shows
are
the which
with
consistent
methods,
the these
shows
effectiveness
augmentation which
with
methods,
the these
shows
effectiveness
and
techniques which
methods,
the
on shows
effectiveness
theandwhichthe shows
effectiveness
and
EfficientNet-B0-YOLOv4the effectiveness
and model, andthe
d onfurther
method
the
d detections
by missed To
trainedverify
further
improved
on
the illustration
by the
method
the To
verify
model,
improved
trained
the on asour
influence
further
the
method
the method.
proposed
To
verify
illustration
augmented
by of
shown
model,
improved
on of
feasibility
influence
further
the
in
trained
the
our
the
method
the
as method.
Figure
shownproposed
proposed
To verify
of
model,
improved
on
illustration
augmented
image
by
of
feasibility
10,
in our
influence
further
the the
method
the
asthe
Figure
trained
the
method.
proposed
shown
model,
improved of
illustration
proposed
To
verify
of
influence
detection
on
10,
in
areillustration
augmented
compared
image
by
our
further
the the
method
the
asthe method.
proposed
illustration
proposed
data
To
verify
Figure
shownof
influence
model,
trained
the further
improved
results the
detection
on
10,
in
method
the
as
ofthe
Figuremethod.
augmentation
the
shown
the
areillustration
augmented
with
11
model,
compared
image
of
illustration
proposed
data
verify
of
improved
results 16
influence
the
augmentation
detection
by the model
on10,
inthe
trained
the areas the
illustration
proposed
ofthe data
Figure
shown
the
model of
influence
model, the
improved
results augmentation
detection
model
illustration
augmented
with
compared
image 10,
inas illustration
proposed
ofthe
bytrained
the data
Figure
shown
the of
model,
trained
theare resultsthe
augmentation
detection
modelmodel
10,
inas illustration
proposed
ofthe
illustration
augmented
with
compared
image
by data
Figure
byandshown
the
results augmentation
detection
model
trained
the
trained
the
are 10, illustration
inofthe
model data
Figure
the
results augmentatio
detection
illustration
augmented
with
compared
image
by
byandmodel
10,ofthe
trained
the
the
are data
the
modelresultsauo
detecti
mode
illustration
augmented
with
compared
image
byand trained
theare model
augmented
with
compared
image
byandtrained
thearemodel
with
compared
image
by trained
theare
model
with
compar
by tra
th
d lity of
feasibility
our detections
are missed
proposed
ofconsistent
feasibility
our detections
are
method. missed
proposed
ofwith
consistent
feasibility
our these
detections
are
method.
proposedofwith
consistent
methods,
feasibility
our
control these
are
method.
proposed which
ofwith
consistent
methods,
variate our these
shows
method.
proposed
technique which
with
methods,
the these
shows
effectiveness
method.
is adoptedwhich
methods,
the toshows
effectiveness
get which
rid theof shows
effectiveness
one data the effectiveness
augmentation approach every
doddetections
mage.
d
further
on
by method
the
trained
the
real
From
missed
Toverify
improved
image.
the
further
on
illustration
by real
the
method
the
trained
the
detection
From
detections
are missed
To
model,verify
image.
consistent
influence
improvedfurther
on
illustration
augmented
by
the
the
method
the
as
trained
the
results,
real
detection
From
detections
are with
To
shown verify
of
model,influence
improved
image.
consistent the
it
these
further
on
illustration
augmented
image
by the
can
are
in the
method
the
as
results,
real
proposed
trained
theare
detection
From
be
with
To
Figure
shown verify
of
model,influence
improved
image.
seen
consistent
methods, the
it
these
further
on
illustration
augmented
compared
image
by the
10,
canin the
method
the
asthe
trained
theare
results,
that
real
illustration
proposed
Figure
detection
From
which
withbe
To
shown
the
verify
of
model,
image.
methods, seen
influence
improved
the
it
these
further
detection
on
illustration
augmented
with
compared
image
by the
10,
model
can
shows
in
method
thethe
as
results,
that
real
the
the
trained
the are illustration
proposed
Figure
detection
From
whichbe
data
shown
the
verify
model,of
image.
influence
improved
results
detection
model
trained
methods,
the seen on
illustration
augmented
with
compared
image
by
the
it
the
model
can
shows
augmentation
10,
inthe
as
ofthe
the
are
results,
that
realbe
effectiveness
which
illustration
proposed
data
Figure
trained
the shown
the
trained
the model,
model of
influence
improved
results
detection
model
illustration
detection
From
by augmented
thewith
compared
image
image.
trained
thethe
seenby
by
the
it
the
augmentation
10,
inas
ofthe
trained
the
modelthe
illus-
can
shows are
results,
that
real
illustration
proposed
Figuredata
shown
the
model of
model,
results
detection
model
illustration
augmented
detection
From
bybe
effectiveness
and with
compared
theimage
image.by
trained
the
the seen
the
it
the
augmentation
10,
inas
ofthe
trained
the
model
illus-
can are
results,
that
illustration
proposed
Figuredata
shown
the
results
detection
model model
augmented
detection
From
bybe
effectiveness
and with
compared
theimage
by
trained
the
seenthe
itmodel
augmentation
10,
inof illustration
the
trained
the
illus-
can are
results,
that
and
data
Figure
the
results
detection
model
detection
bybe withmodel
compared
theimage
by
trained
the
seen
it
augmentation
10, of
the
trained
modelthe
illus-
can are data
the
results
detection
model
results,
that
by
be withmodel
compared
the by
trained
the
seen
it
augmentatio
of the
trained
the
model
illus-
can
that
by results
model
be mode
with
the by
trained
the
seenmodtho
tra
illus
lity
od on of
feasibility
furtherour
method
the Toproposed
verify
improved
on of
feasibility
furtherour
the
method
the method.
To
model,proposed
verify
improvedon of
feasibility
influence
furtherour
the
method
the
as method.
To
shown proposed
verify
of
model,
improvedon of
influence
further
the
in our
the
time
method
the
as Figuremethod.
shown proposed
proposed
To verify
of
model,influence
and
improvedfurther
on the
10,get
in the
method
the
asthe
Figure method.
illustration
proposed
the
shown verify
of
model,influence
F1
improved
detection
on the
10,inthe
as the
the illustration
proposed
indicators
Figuredata
shown
model,of
influence
improved
results
detection the
augmentation
in
10,
in the
as
of
theillustration
proposed
data
Figure
shown
the
model,of
absence
results
detection
modelthe
augmentation
10,
inas
oftheillustration
proposed
of
Figuredata
this
shown
the
results
detection
model augmentation
method,
10,
inoftheillustration
Figure
thedata
resultsas10,
detection
model augmentation
shown
ofthe data
the in
results
detection
modelaugmentation
Tableof 4.
the Itmodel
can
results bemodel
of the mode
d
mage.by
ndolity trained
the
real
From
augmented
tration
detections
missed illustration
image.by
the trained
the
real
augmented
images
tration
detections
are illustration
detection
From augmented
image.
and
consistent by
the trained
the
results,
real
augmented
images
are the
tration
with illustration
augmented
detection
From
model image
image.
and
consistent by
the
it can trained
the
augmented
theseimages
the are
results,
real
trained
tration
with illustration
augmented
detection
From
be
model
methods,compared
image
image.
seen
and by
the
it
by can trained
the
augmented
theseimages
the are
results,
that
real
trained
the
tration
which illustration
detection
From
be augmented
with
compared
the
model
real
methods, image
image.
seen
and by
the
it
by the
model
cantrained
the
augmented
images
images
showsthe are
results,
that
real
trained
the
tration
which model
detection
From
be illustration
augmented
model
real
the with
compared
theimage
image.
trained
seen
both
and by
the
itmodel
by cantrained
the
the
augmented
images
images
shows can
the are
results,
that
real
trained
the model
tration
effectiveness illustration
detection
From
bybe augmented
the
model with
compared
image
image.
trained
the
seenby
the
it
accurately
real
theboth
and by trained
the
model
illus-
can
augmented
images
images
can
the are
results,
that
trained
the model
tration
effectiveness
and augmented
detection
From
bybe
modelwith
compared
theimage
by
trained
the
seen
the
it
accurately
real
both
andby trained
the
model
illus-
can are
results,
that model
detection
augmented
images
images
can
the bybe
trained
the
and with
compared
theimage
by
trained
the
modelseenit
accurately
real
both
and bymodeltrained
the
illus-
can are
results,
imagesthat
images
can
the bybe
trained
the model
with
compared
the by
trained
the
seen
model it trained
model
accurately
real
both
andby the
illus-
can
imagesthat
can
the by
be
trained
the model
with
the by
trained
the
seen
model trained
the
model
illus-
accurately
real
bothby that
images
can by
byaccuratel
trained
the the
trained
the
real
both
bymodtra
th
illus
imag
can
th
dbyon of
feasibility
furtherour
method
the Toproposed
verify
improved
on of
feasibility
furtherour
the
method
the method.
To
model,proposed
verify
improvedon of
influence
furtherour
the
method
the
as method.
To
shown proposed
verify
of
model,influence
improvedfurther
on the
in the
seen
method
the
as Figuremethod.
proposed
shown verify
of
model,influence
from
improvedon the
10,
in the
the
as the
the illustration
proposed
Figure
shown of
model,influence
improved
detectionthe
experimental
10,inasthe illustration
proposed
Figuredata
shown
model,of
results
detection the
augmentation
results
10,
inas
of
theillustration
proposed
data
that
Figure
shown
theresults the
detection
model augmentation
10,
inoftheillustration
removal
Figure
thedata
results
detection
model augmentation
of
10, ofthethedata
results
detection
model augmentation
illustrationof the data
results
model augmentation
of the model
d
mage.
ndolity trained
the
real
From
augmented
tration
detections illustration
image.by
the trained
the
real
augmented
images
are trationillustration
detection
From augmented
image.
and
consistent by
the trained
the
results,
real
augmented
images
the
tration
with illustration
augmented
detection
From
model image
image.
and by
the
it can trained
the
augmented
theseimages
the are
results,
real
trained
trationillustration
augmented
detection
From
be
model
methods,compared
image
image.
seen
and by
the
it
by can trained
the
augmented
images
the are
results,
that
real
trained
the
tration
which illustration
detection
From
be augmented
with
compared
the
model
real image
image.
seen
and by
the
it
by the
model
canthe
augmented
images
images
showsthe are
results,
that
real
trained
the
tration model
detection
From
be illustration
augmented
model
real
the with
compared
theimage
image.
trained
seen
both
and the
itmodel
by cantrained
the
augmented
images
images
can
the are
results,
that
trained
the model
detection
From
bybe
tration
effectiveness augmented
the
model with
compared
image
trained
the
seenby
the
it
accurately
real
both
and by trained
the
model
illus-
can
augmented
images
images
can
the are
results,
that
trained
the
and model
detection
bybe with
compared
the
the
model image
seenby
trained
it
accurately
real
both
andby trained
the
model
illus-
can are
results,
that
images
images
can
the bybe
trained
the model
with
compared
the by
trained
the
modelseenit
accurately
real
both
and bymodeltrained
the
illus-
can
imagesthat
can
the bybe
trained
the model
with
the by
trained
the
seen
model trained
model
accurately
real
bothby the
illus-
imagesthat
can by
trained
the model
the by
trained
the trained
the
model
illus-
accurately
real
bothby
images
can
theby by
trained
the th
illus
accuratel
real
bothimag
can
d dapples
mage. on
by
detect
of under
feasibility
furtherour
method
the
trained
the
real
From
To apples
illustration
image.by
the
no
detect
proposed
verify
improved
on of
further occlusion
under
our
the
method
the
trained
the
real
apples
method.
To
model,
illustration
detection
From augmented
image. by
the
no
detect
proposed
verify
influence
improvedfurther
on between
occlusion
under
the
method
the
as
trained
the
results,
real
apples
method.
shown verify
of
model,
illustration
augmented
detection
From image
image. by
the
it can
no
detect
influence
improvedon the
inleaves.
the
as between
occlusion
under
the
method
trained
theare
results,
real
apples
proposed
Figure
shown of
model, However,
illustration
augmented
detection
From
be compared
image
image.
seen by
the
it
no
detect
influence
improved the
10,
in
has
can
leaves.
as
the
between
the
are
results,
that
real
occlusion
under
the
detection
From
be the
apples
image.
seen
under
However,
illustration
proposed
Figure
shown of
model,
detection no
detect
the
10,in
greatest
illustration
augmented
with
compared
image
the
it the
model
can
leaves.
as between
the
are
results,
that
occlusion
under
detection
From
be
the apples
impact
model
under
dense
However,
illustration
proposed
Figuredata
shown
results
detection
augmented
with
compared
theimage
trained
seenthe
itmodel
can
no
detect
leaves.
between
on
occlusion
trained
the are
results,
that
under
occlusion
the
augmentation
10,
in of
themodel
detection
bybe thethe
apples
with
under
dense
However,
illustration
data
Figure
theresults
detection
model
compared
image
trained
the
seenby
it
no
detect
leaves.
between
occlusion
under
performance
trained
the
model
illus-
can are
results,
that
bybe
occlusion
the
augmentation
10, ofthethedataapples
results under
dense
However,
detection
model
with
the model
compared
by
trained
the
seen
it
no
model
illus-
can
leaves.
of
of
between
occlusion
trained
the
that
bybe
under
occlusion
the
augmentation
the
results
the
model
with
the
under
dense
However,
model
trained
the
seen model,
bymodel
no
illus-
leaves.
of between
occlusion
trained
the
that
by
occlusion
the
the
and
model
the
under
dense
However,
model
by
trained
the the
model
illus-
leaves.
between
byF1
trained
the
occlusion
the under
dense
However,
drops
by
trained
the
leaves.
the
illus- by
by
occlusio
the
the
und
dens
How
illus
nen augmented
lity
od tration
apples
detect
of
further under
our
leaves,
between To augmented
images
applestration
no
detect
proposed
the leaves,
verifymodel
between
furtherthe and
occlusion
under augmented
images
method.
trained
the the
applestration
no
leaves,
model
verify model
detect
between
between
influence by
the and
occlusion
under augmented
images
apples
trained
the
real
of the
trained
tration
no
leaves,
model
images model
detect
leaves.
between
influence
the by and
between
occlusion
under by
augmented
images
apples
trained
the
real
hardly
proposed
of the
trained
the
tration
However,
no
detect
leaves,
model
images
between
the bymodel
real
leaves.
betweenand
occlusion
under
detects by
augmented
images
trained
the
real images
applesthe
trained
hardly
illustration
proposed the
undertration
However,
no
detect
leaves,
model
images
betweenmodel
apples,
byreal
leaves. both
betweenand
occlusion
under
the
detects by
augmented
images
trained
the
real images
apples
hardlycan
the
trained
under
dense
while
illustration
data the
However,
no
leaves,
model
images model
detectaccurately
real
leaves.
between
apples,
bytheboth
betweenand
occlusion
under
detects
augmentation by
images
occlusion
the images
apples
trained
the
real
data hardly
model can
the
trained
whilethe
under
dense
However,
no
leaves,
model
images model
accurately
real
leaves.
between
apples,
bytheboth
and
between
detectsby
occlusion
under images
occlusion
the
trained
augmentationthe
real
hardly
model can
the
trained
whilethe
under
dense
However,
no
leaves,
model
images model
accurately
real
leaves. both
between
occlusion
apples,
by by
images
occlusion
the
the
detects can
trained
under
densethe
However,
trained
the
real
hardly
model
while accurately
real
leaves.
model
images both
between
apples,
by by
images
occlusion
the
the
detects can
the
under
dense
However,
trained
real
hardly
model
while accurately
real
leaves.
images
apples,
by both images
occlusion
the
the
detects
real can
under
dense
However,
hardly
model
while
imagesaccuratel
both
occlusio
the
apples,
the
detects can
und
dens
hardl
mode
whila
on method
the improved
on method
the model,
improvedon the
as shown
model,
improved inas Figure
shown
model, 10,
inasthe
Figure
shown
detection
10,in the
Figure
results
detection10, of
thetheresults
detection
model of the
results
model of the model
nd
mage.
en
o
by trained
the
real
apples From
augmented
tration
detect
leaves,
further
illustration
image.
under
between
by
the
apples
the
verify
trained
the
real
augmented
images
tration
no
detect
leaves,
model
illustration
detection
From augmented
image.
and
occlusion
under
between
the
by
the
apples
trained
the
trained
the
results,
real
augmented
images
the
tration
no
detect
model
illustration
augmented
detection
From
model
between
leaves,
image
image.
and
occlusion
under
between
influence by
by
the
it can
apples
trained
the
real
of
the
augmented
images
the
no
are
trained
tration
detect
illustration
2.05%,
results,
real
leaves.
leaves,
model
images
augmented
detection
From
be
modelcompared
between
image
image.
seen
andthe
it
occlusion
under
between
the by by
augmented
images
applesthe
However,
trained
the
real
hardly
proposed no
are
indicating
can
results,
that
detection
From
trained
the be
tration
detect
leaves.
leaves,
model
images
augmented
between
between
by
with
compared
the
model
real image
seen
and the
it
occlusion
under by
apples
detects
the
that
model
can
under
However,
trained
the
real
hardly
illustration no
are
results,
that
detect
model
the
detection
augmented
images
images
the
trained
the be
leaves.
leaves,
model
images between
between
with
compared
theimage
image
trained
model
realseen
both
and
occlusion
apples,
byunder
the
detects
itmodel
by can
apples
under
dense
trained
the
However,
trained
the
real
data hardly
whileno
are
results,
imagesthat
images
can
the bybe
trained
the model
data
leaves.
leaves,
model
images
the
between
with
compared
trained
the
model seen
occlusion
under
between
apples,
by
by
it
accurately
real
both
and by
occlusion
the
the
detects
augmentation
trained
the
augmented
model
illus-
canthat
images
can
the
under
dense by
However,
trained
the
real
hardly
model
whileno be
trained
the model
leaves.
leaves,
model
images
with
the
between
occlusion
apples,
by
by
trained
the
modelseen
accurately
real
bothby
occlusion
the
the
detects
trained
the
real
by
hardly
model
trained
the
model
illus- the
that
images
can by
trained
whilethe
under
dense
However,
model
images
model
the by
illustration
trained
the
accurately
real
leaves.
apples,
bytheboth
between
detects bymodel
images
occlusion
the
trained
real
hardly
model
trained
the
illus-
can
under
dense
whilethe
However,
images real
leaves.
apples,
by both
the
detects
by
data
byaccurately
trained
the images
occlusion
thereal
hardly
model
the
augmentation
illus-
can
under
dense
However,
while
byaccurately
images
apples,
the
both illus-
occlusion
the
the
detects can
under
dense
hardly
model
while accuratel
occlusio
the
apples,
the
detectsdens
mode
whilaa
d
d d
mage.by
on
by trained
illustration
method
the
trained
the
real
From by
improved
on
illustration
image.by
the trained
illustration
augmented
the
trained
the
real model, by
improved
illustration
detection
From augmented
image. by
the trained
illustration
augmented
as
the images
shown
results,
real model, by
illustration
augmented
detection
From image
image.
the
it cancan
trained
illustration
in augmented
as images
are
results,detect
Figure
shown by
augmented
detection
From
be
method compared
image
seen
the
it cancan
apples
10,trained
illustration
in augmented
images
the
are
results,
that
has detect
Figure
detection
be well
with by
detection
compared
theimage
seenit
aocclusion
greater can
apples
10,trained
illustration
the
model
can and
augmented
images
the
are
results,
that bedetect
improve
model well
results
with by
detection
compared
the
trained
seen
contribution itmodel
cancan
apples
trained
illustration
and
augmented
ofimages
trained
the
that
bybedetect
the
to improve
well
detec-
results
model
thewith
trained
the
seen by
model
by can
apples
modeltrained
illustration
illus-
enriching and
augmented
of images
trained
the
that
by detect
the improve
model
the well
detec-
by
model
by
trained
the
the can
apples
illustration
and
augmented
images
trained
the
model
illus- detect
theimprove
well
byaccurately
diversity detec-
by
trained
the of can
apples
and
the
illus-
the augmented
byimages
detect
theimprove
thewell
training detec-
can
apples
illus-and
images
set. detect
theimprove
well
detec-
Compared can
apples
and detect
the
improv
well
detec
ap
nen augmented
tration
apples
detect
leaves, under
between theaugmented
images
applestration
no
detect
leaves,
model
between and
occlusion
under augmented
images
trained
the the
applestration
no
leaves,
model model
detect
between
between
by and
occlusion
under augmented
images
apples
trained
the
real the
trained
tration
no
leaves,
model
images model
detect
leaves.
between
by and
between
occlusion
under by
augmented
images
apples
trained
the
real
hardlythe
trained
the
However,
no
detect
leaves,
model
images
between
bymodel
real
leaves.
betweenand
under
detects by
images
trained
the
real images
applesthe
trained
hardlythe
under
However,
no
leaves,
model
images
betweenmodel
apples,
byreal
leaves. both
betweenand
occlusion
under
the
detects by
images
trained
the
real
hardlycan
the
trained
under
dense
whilethe
However,
no
leaves,
model
images model
accurately
real
leaves.
apples,
bytheboth
between
occlusion
detects by
images
occlusion
the can
trained
the
under
dense
However,
trained
the
real
hardly
model
while accurately
real
leaves.
model
images both
between
apples,
by byimages
occlusion
the
the
detects can
the
under
dense
However,
trained
real
hardly
model
while real
leaves.
images
apples,
by both images
occlusion
the
the
detects
real can
under
dense
However,
hardly
model
while
images accurately
apples, both
occlusion
the
the
detects can
under
dense
hardly
model
while accurately
apples,occlusion
the
the
detects dense
model
while occlusio
apples,
the mode
whil
d d by
sults.trained
illustration
ontrained
the
tion
It can by
improved
results.
be trained
illustration
augmented
seen
tion model,
It can by
results.
that be trained
illustration
the augmented
as
seen
tionimages
shown by
Itillustration
can
results.
that be can
thetrained
illustration
in augmented
seen
tion
Itimages
detect
Figure by
illustration
augmentation
can
results.
that be can
apples
10,
thetrained
illustration
augmented
seen
tion
Itimages
the detectwell
illustration by
detection
augmentation
can
results.
that method
be can
apples
thetrained
illustration
and
augmented
seen
tion images
detect
improve
well
results
Itillustrationby
augmentation
can
enriches
results.
that method
be can
apples
thetrained
illustration
and
augmented
of
seen
tionimages
detect
the improve
well
detec-
Itillustration
the by
model
augmentation
can
enriches
results.
that
leaf
method
be can
apples
theillustration
and
oc-augmented
seen
tionimages
detect
theimprove
well
detec-
Itillustration
the
augmentation
can
enriches
results.
that
leaf
method
be can
apples
the and
oc-augmented
seen images
detect
theimprove
well
detec-
Itillustration
the
augmentation
can
enriches
that
leaf
method
be can
apples
the and
oc-
seen images
detect
theimprove
well
detec-
illustration
the
augmentation
enriches
that
leaf
methodcan
apples
the and
oc- detect
theimprove
well
detec-
illustration
the
augmentation
enriches
leaf
methodapples
and
oc-thethe
improv
well
detec
augmenta
enriches
leaf
methooca
nd
mage.
en
by the
real
apples From
augmented
tration
detect
leaves,
illustration
image.
under
between
by
the
apples
the
the
real
augmented
images
tration
no
detect
leaves,
model
illustration
detection
From augmented
image.
and
occlusion
under
between
the results,
augmented
images
the
apples
trained
the tration
no
detect
augmented
detection
From
model
between
leaves,
model
image
andthe
it
occlusion
byunder
between
can
augmented
images
apples
trained
the
real the
no
are
results,
detection
be
trainedmodel
with
detect
leaves.
leaves,
model
images
compared
between
image
seen
andit
occlusion
under
between
by by can
images
common
applesthe
However,
trained
the
real
hardly no
are
results,
that
trained
the be
leaves.
leaves,
model
imagesbetween
between
by
with
compared
the
seen
model
realand
datait
occlusion
under
detects by the
model
can
images
underthat
the
trained
the
However,
trained
the
real
hardlyno bemodel
model
real
augmentation
leaves.
leaves,
model
images between
with
the
trained
seen
both
occlusion
apples,
by the
detects
model
by
images
under
dense
trained
the
that
can
However,
trained
the
real
hardly
while
by
trained
the model
leaves.
model
images
the
trained
the
between
apples,
by
by
accurately
real
both
methods, by
occlusion
the
the
detects
trained
the
model
illus-
images
can
the
under
dense by
However,
trained
real
hardly
model
while leaves.
images
apples,
by
by
trained
the
accurately
real
both
occlusion
the
the
detects
real
the
illus-
images
illustration can
under
dense by
However,
hardly
model
while
images
apples,
the
accurately
both
data
occlusion
the
the
detects
illus-
can
under
dense
hardly
model
while accurately
augmentation
apples,occlusion
the
the
detects dense
model
while will
apples, generate
occlusion
the model
while the mode
d by
nsults. trained
illustration
tion
Itincan by
results.
be trained
illustration
augmented
seen
tion
Itincan by
results.
that be trained
illustration
the augmented
seen
tionimages by
Itillustration
can
results.
that be can
thetrained
illustration
augmented
seen
tion
Itimages
detect by
illustration
augmentation
can
results.
that be can
apples
thetrained
illustration
augmented
seen
tion images
detectwell
Itillustrationby
augmentation
can
results.
that method
be can
apples
thetrained
illustration
and
augmented
seen
tion images
detect
improve
well
Itillustrationby
augmentation
can
enriches
results.
that method
be can
apples
theillustration
and
augmented
seen
tionimages
detect
the improve
well
detec-
Itillustration
the
augmentation
can
enriches
results.
that
leaf
method
be can
apples
the and
oc-augmented
seen images
detect
theimprove
well
detec-
Itillustration
the
augmentation
can
enriches
that
leaf
method
be can
apples
the and
oc-
seen images
detect
theimprove
well
detec-
illustration
the
augmentation
enriches
that
leaf
method can
apples
the and
oc- detect
theimprove
well
detec-
ofillustration
the
augmentation
enriches
leaf
methodapples
and
oc- the
the improve
well
detec-
augmentation
enriches
leaf
method and
oc- the
the improv
detec
enriches
leaf
methooc
nd
mage.
en
by real
the
scene
clusion
apples From
augmented
tration
detect
leaves,
illustration
the
image.
under
between
scene
thetraining
apples
the
clusion
detection
no From
augmented
images
tration
detect
leaves,
model
between
augmented
the
images,
and
occlusion
under scene
thetraining
clusion
augmented
images
trained
the the
apples
no
leaves,
model
in
results,
detection
model
detect
between
between
by
image
provides
the
images,
anditscene
occlusion
under can
images
apples
trained
the
real the are
training
clusion
richer
trained
no
leaves,
model
images
in
results,
be
model
leaves.
between
by
compared
provides
the
images,
seen
and
between
new it
occlusion
under scene
byfeatures
training
can clusion
that
the
trained
the
However,
no
background
trained
the
real
hardly
leaves,
model
images
richer
in
be
by
with
provides
the
the
model
real
leaves.
between for
images,
seen
occlusion
detects
scene
by the
features
training
the
model
images
and
trained
the
real
clusion
that
trained
hardlythe
under
However,
model
model
richer
learning
inprovides
real
leaves.
texture
images
apples,
by
the
the
both
between
the
detects
for
images,
trainedscene
model
by
images
trained
real for
hardly
trained
features
training
the
clusion
can
under
dense
while
of
the
However,by
the
images
richer
learning
in
the
provides
the
trained
the by
for
images,
model,
scene
accurately
real
leaves.
apples,
bytheboth
image,
detects
real
the
features
training
the
images
occlusion
the hardly
model
clusion
illus-
can
under
dense of
by
However,
while
richer
learning
in
the
which
images
provides
the
the for
images,
model,
scene
accurately
apples,
theboth features
occlusion
the
detects is training
the
illus-
can
under
dense
of
hardly
model
while
of richer
learning
in
the
great provides
the
for
images,
model,
accurately
apples, occlusion
the
the help
detects
features
training
the
dense
model
while to richer
learning
the
provides
for
images,
model,
features
occlusion
enhance
apples,
the model
the
while
of richer
learning
the the
provides
for
model,
features
robustness
the model
the
ofricher
learning
the for
mode
feat
theo
d by
sults.trained
illustration
tion
Itincan by
results.
be trained
illustration
augmented
seen
tion
Itincan by
results.
that be trained
illustration
the augmented
seen
tionimages by
Itillustration
can
results.
that be can
thetrained
illustration
augmented
seen
tionimages
detect by
Itillustration
augmentation
can
results.
that be can
apples
thetrained
illustration
augmented
seen
tion images
detectwell
Itillustrationby
augmentation
can
results.
that method
be can
apples
the illustration
and
augmented
seen
tion images
detect
improve
well
Itillustration
augmentation
can
enriches
results.
that method
be can
apples
the and
augmented
seen images
detect
the improve
well
detec-
Itillustration
the
augmentation
can
enriches
that
leaf
method
be can
apples
the and
oc-
seen images
detect
theimprove
well
detec-
illustration
the
augmentation
enriches
that
leaf
method can
apples
the and
oc- detect
theimprove
well
detec-
illustration
the
augmentation
enriches
leaf
method apples
and
oc- the
the improve
well
detec-
augmentation
enriches
leaf
method and
oc- the
the improve
detec-
enriches
leaf
methodoc- the
the detec
enriches
leaf oc
nn
mage.
us
en
scene
clusion
helps
andFrom
augmented
tration
apples
detect
leaves,
the
thus
to
under
between the
scene
thetraining
clusion
detection
improve
helps
and
augmented
images
applesno
detect
leaves,
model
between
the
thus
to images,
and
occlusion
under scene
training
clusion
images
trained
the the
apples
no
leaves,
model
in
results,
improve
learning
helps
and provides
model
betweenthe
thus
to images,
andit
occlusion
under
between
by
scene
trained
the
real
training
can
improve
ability clusion
learning
helps
and
the richer
trained
no in
be provides
and
model
leaves.
of
leaves,
model
images
the
between images,
seen
thus
to
occlusion
by model
scene
byfeatures
training
improve
ability clusion
that
detection
learning
helps
and
trained
the
However,
trained
the
real
hardly
richer
inprovides
real
leaves.
model
imagesbetween
detection,
by
the
the
and
thus
to for
images,
detects
scene
byfeatures
training
the
model
results
improve
ability
imagesclusion
detection
learning
helps
and
the
under
However,
trained
real
hardly
richer
learning
inprovides
of
real
leaves.
especially
images
apples,
by
the
both
the
detects
for
images,
trained
and
thus
to the
real
scene
features
training
the
results
improve
ability
images clusion
can
under
dense of
detection
However,
under
hardly
while
by
learning
helps
model.
and
images
richer
learning
in
the
apples,
provides
of
and the
the
thus
to for
images,
model,
thescene
features
training
the
illus-
results
improve
ability of
detection
accurately
both
occlusion
the
the the
detects learning
helps
model.
under
denseand
can richer
learning
in
the
provides
of
and
interference
hardly
model
while apples,
the
thus
to for
images,
model,
the features
training
the
results
improve
ability of
detection
accurately
occlusion
the
the
detects learning
helps
model.
dense
model of
while
richer
learning
the
provides
of
and
dense
apples, to for
images,
themodel,
the features
results
improvethe
ability of
detection
occlusion
the learning
model.
leaves.
model
while
richer
learning
the
ofprovides
and for
model,
the
Due
the the features
the
results
ability of
detection richer
learning
learning
model.
tooc-
model
the
of
and
thethe
for
model,
the features
the
results
abilityof
detection
model.
complexity
learning
the
of
and for
mode
the the
result
detec
mo o
d by
sults.trained
illustration
tion
Itincan by
results.
be trained
illustration
augmented
seen
tion
Itincan by
results.
that be trained
illustration
the augmented
seen
tionimages by
Itillustration
can
results.
that be can
thetrained
illustration
augmented
seen
tionimages
detect by
Itillustration
augmentation
can
results.
that be can
apples
theillustration
augmented
seen
tion images
detectwell
Itillustration
augmentation
can
results.
that method
be can
apples
the and
augmented
seen images
detect
improve
well
Itillustration
augmentation
can
enriches
that method
be can
apples
the and
seen images
detect
the improve
well
detec-
illustration
the
augmentation
enriches
that
leaf
method can
apples
the and
oc- detect
theimprove
well
detec-
illustration
the
augmentation
enriches
leaf
method apples
and
oc- thetheimprove
well
detec-
augmentation
enriches
leaf
method and
oc- the
the improve
detec-
enriches
leaf
method the detec-
enriches
leaf oc- the leaf oc
nn us
en
scene
clusion
helps
and
augmented
apples
detect
leaves,
the
thus
to
under
between
scene
training
clusion
improve
helps
and
images
apples
the no
leaves,
model
the
thus
to images,
and
occlusion
under
between
scene
training
trained
thebe
clusion
improve
learning
helps
and
the
no inprovides
model
between
leaves,
model
the
thus
to images,
occlusion
by
scene
trained
the
real
training
improve
ability clusion
learning
helps
and richer
trainedinprovides
and
leaves.
model
images
the
thus
to
between
by
and
images,
scene
byfeatures
training
improve
ability clusion
detection
learning
helps
and
the
However,
trained
real
hardly
richer
inprovides
and
real
leaves.
images
diversity by
the
thus
to for
images,
detects
ofreal
scene
features
training
the
results
improve
ability
images
underclusion
detection
learning
helps
and
However,
hardly
images
leaves,
richer
learning
in
apples,
provides
of
and the
thus
to
both
the
detects
the
for
images,
the scene
features
training
the
results
improve
ability of
detection
learning
helps
model.
under
dense
hardly
while and
can richer
learning
in
the
provides
of
and
apples,
proposed
the
thus
to for
images,
model,
the features
training
the
results
improve
ability of
detection
accurately
occlusion
the
the
detects learning
helps
model.
dense
model
while
richer
learning
the
provides
of
and
apples,
illustration
to for
images,
model,
the
the features
the
results
improve
ability of
detection
occlusion
the learning
modelmodel.
while
data
richer
learning
the
provides
of
and
the
for
themodel,
the features
resultsthe
ability of
detection
learning
model
augmentation
model. richer
learning
the
of
and for
model,
the features
the
results
ability of
detection
model.
method
learning
the
of
and for
model,
the the
resultsof
detection
combined
model.learning
the
of mode
the
result
mo o
d
nus by
sults.trained
sceneillustration
tion
Itincan
clusion the by
results.
be
scenetrained
illustration
augmented
seen
tion
Itincan
training
clusion the by
results.
that
images,
scenetrained
illustration
the augmented
seen
tionimages
training
clusion
inprovides
the by
Itillustration
can
results.
that be
images,
scenecan
theillustration
augmented
seen
tion
trainingimages
detect
Itillustration
augmentation
can
clusion results.
that
richer
inprovides
the be
images,
scenecan
apples
the augmented
seen
features
training images
clusion detect
richer
in well
Itillustration
augmentation
can
that method
provides
the
forbe
images,
scenecan
apples
the and
seen
features
training
the images
detect
richerimprove
learning
in well
illustration
augmentation
enriches
that method
provides
thefor
images, can
apples
the and
features
training
theof detect
the
richer improve
learning
the well
providesdetec-
illustration
the
augmentation
enriches
leaf
method
for
images,
model, apples
and
oc-
features
theof thetheimprove
well
detec-
augmentation
enriches
richer
learning
the leaf
method
provides
for
model, and
oc-
features
theof thetheimprove
detec-
enriches
richer
learning
the leaf
method
for
model, oc-
features
the the
the detec-
enriches
oflearning
the leaf
for
model, oc-
the the
oflearning
the leaf
model, oc-
of the mode
F1
en helps
and
apples
comparison
Table
leaves,
between thus
to
under
4. F1improve
thebe helps
and
no
comparison
Table
between
leaves,
model thus
to
occlusion
4. F1 improve
learning
helps
and
different
comparison
Table
between
trained
thebe model thus
to
between
bydata
4. F1 improve
ability
learning
helps
different
trained
real and and
leaves.
augmentation
comparison
Table
between
imagesby thus
to
data
4. F1 improve
ability
realdetection
learning
helps
and
However,
different
augmentation
comparison
Table
between
hardly methods.
images and
thus
to
data
4. F1 results
detectsimprove
ability
detection
learning
helps
under
different and
augmentation
comparison
Table
between
hardlymethods.
apples,of
and
thus
to
the
data
4. the
F1
detects results
improve
ability
detection
learning
helps
model.
dense
different
augmentation
comparison
Table
between
while of
and
methods.
apples, tothe
the
results
improve
ability
detection
occlusion
data
4.
the F1 learning
model.
different
augmentation
comparison
Table
between
model
while of
and
methods.
data
4.
the the
the
F1 results
ability
detection
learning
model.
different
augmentation
comparison
between
model of
and
methods.
data the
results
ability
detection
model.
different of
augmentation
between and
methods.
datathe
results
detection
model.
different of
augmentation
methods.datathe
results
model. of
augmentation
methods. the mo
me
d by trained
illustration by trained
illustration
augmented by illustration
augmented
images canaugmented
images
withdetectcommon can
applesimages
detectwell
data can
apples
and detect
augmentation improve
well apples
and the improve
methods well
detec- and
to theimprove
detec-
generate the
ofapple detec-
images can make
F1
sults.
nus
en
tion
scene
helps
andItincan
clusion
comparison
Table
leaves,
results.
the
4.thus
toF1
the
sceneseen
tion
improve
helps
andItincan
training
clusion
comparison
Table
between
model
results.
that
the
4.thus
toF1
the
images,
sceneseen
tion
improve
different
Itillustration
training
learning
helps
and
comparison
Table
between
trained by
can
clusion
in results.
that
provides
data
4. the
thus
toF1
be the
images,
scene
abilityseen
Itillustration
training
improve
learning
helps
different
real and
augmentation
comparison
Table
between
images
augmentation
can
clusion
andthat
richer
inprovides
the
thus
to
data
4. F1
be the
images,
scene
improve
abilityseen
features
training
detection
learning
helps
different and illustration
augmentation
comparison
Table
between
hardly methods.
augmentation
andthat
richer
in thus
to
data
4. F1
method
provides
the
for the
images,
features
training
the
results
improve
ability
detection
learning
helps
different
detects
illustration
augmentation
comparison
Table
between
methods.
apples,
augmentation
ofenriches
richer
learning
and
tothe
data
4. the
F1
method
provides
for
images,
features
the
results
improve
ability of
detection
learning
model.
different
the
augmentation
comparison
Table
between
while
augmentation
of
and
methods.
enriches
richer
learning
the
data
4.
the
leaf
themethod
provides
for
model,
the
F1 results
abilityoc-
features
theof
detection
learning
model.
different
the
augmentation
comparison
between
model ofenriches
richer
learning
the
and
methods.
data
leaf
method
for
model,
the
results
abilityoc-
features
the
detection
model.
different
the
augmentation
between ofenriches
learning
the
and
methods.
data
leaf
for
model,
the
results oc-
the
detection
model.
different
the
oflearning
the
of
augmentation
methods.
data
leaf
model,
the
resultsoc-
model. ofup
of the
augmentation
methods. the for
model, the
model.
methods.
d
n by
sults.trained
sceneillustration
tion
Itincan
clusion the
Data by
results.
be
scene illustration
augmented
seen
tion
Itincan
training
clusion results.
Augmentationthat
Datathe be the
images,
scene augmented
seenimages
Itillustration
training can
clusion
in that
provides
Augmentationthe be
images,
Methods
Data scenecan
the
seen
trainingimages
detect
illustration
lackaugmentation
that
richer
in of
provides
Augmentation the
images,
Methods
Data can
apples
features
training detect
the illustration
training richer well
augmentation
method
images,
provides
Augmentation for
Methods
Data images,apples
and
features
the richerimprove
learning
Augmentation well
augmentation
enriches
greatly method
provides
Methods
Data for and
features
theof the
reduce the
richer improve
learning
the
Augmentation
Methods
Data detec-
enriches
leaf
method
the
for
model, oc-
features
theof thethe
workload
learning
the
Augmentation detec-
enriches
leaf
for
model,
Methods
Data oc-
of
theof the
labeling,
learning
the
Augmentation leaf
Methods
Data model, oc-
and
of theachieve
Augmentation model,
Methods better
Methods results
F1us helps
and
comparison
Table thus
to improve
helps
and
4. F1 comparison
Table
between thus
to
4. F1 improve
learning
helps
and
different
comparison
Table
between thus
to
data
4. F1 improve
ability
learning
helps
different and
augmentation
comparison
Table
between and
thus
to
data
4. F1 improve
ability
detection
learning
helps
different
augmentation
comparison
Table
between
methods.and
tothe
data
4. F1 results
improve
ability
detection
learning
different
augmentation
comparison
Table
between
methods. of
and
datathe
theresults
ability
detection
4. improve
F1 learning
model.
different
augmentation
comparison
between of
and
methods.
data the
results
ability
detection
model.
different
augmentation
between of
and
methods.
data the
results
detection
model.
different
augmentationof
methods. the
results
model.
data augmentation of the
methods. methods. model.
d
nus by
sults.
sceneillustration
tion
Itincan
clusion results.
the be
scene augmented
seen
Itincan
training
clusion that
the be the
images,
sceneseenimages
illustration
trainingin that
provides
the
images, can
the
training detect
illustration
augmentation apples well
augmentation
method and enriches
method thethe detec-
enriches
leaf oc- the leaf oc-
F1 helps

and

comparison
Table
ustration Data
thus
4.to Augmentation
F1improve
helps

and

comparison
Table
between
Illustration Data
thus
4.toF1  Augmentation
improve
learning
helps


and
different
comparison
Table
between
Illustration Methods
Data
thus
to
data
4. F1 improve
abilityinricher
learning
helps
augmentation
 

different
comparison
Table
between
Illustration provides
model
Augmentation
and
data
4. F1images,
Methods
Data
tothe  features
improve
ability
detection
learning


different richer
augmentation
comparison
Table
between
Illustrationmethods. provides
detection.
Augmentation
and
data
4. F1for
Methods
Data
the features
the
results


different richer
ability
detection
learning learning
Augmentation
augmentation
comparison
between
Illustrationmethods. Methods
Data
of
and
data for
the  features
 
the
results
ability

differentoflearning
detection
model. the
Augmentation
augmentation
between Methods
methods.Data
of
and
data for
model,
the 
the
results

different oflearning
detection
model. the
Augmentation
augmentationof
methods.
data model,
Methods
Datathe
results of the
Augmentation
model.
augmentation

 Methods
of
methods. model,
the model. Methods
methods.   
sults.
nus scene
helps
andItincan
clusion the
Data
thus
to be
sceneseen
training
improve
helps
and in
Augmentationthat
Datathe
thus
to the illustration
images,
training
improve
learning
helps provides
Augmentation
to images,
Methods
Data
theimprove
ability
learning augmentation
richer
provides
Augmentation
Methods
Data
and the features richer
Augmentation
ability
detection
learning Methods
Data
and method
for features
the
Augmentation
results
ability
detection enriches
learning
Methods
Data
of
and for
the theof the
learning
the
Augmentation
results
detection
model. Methods
Data
of leaf
model,
the oc-
of the
Augmentation
results
model. model,
Methods
of the model. Methods
F1 

comparison
Table
ustration


anslation 4. F1 

comparison
Table
between
Illustration
 Translation 
 4. F1  


different
comparison
Table
between
Illustration
 Translation  data
4. F1   

different
Illustration
 Translation  
augmentation
comparison
Table
between
 data
4. F1 
  

different
augmentation
comparison
between
Illustration
 

Translationmethods.
 data  

different
 
augmentation
between

Translation
methods.
 data   

different
augmentation
 
methods.
 data 
  

augmentation
 
methods. 
 
 

 methods. 
 
  
 
 

nus
F1 scene
helps

and
 in
comparison
Table
ustration 4. the
Data
thus
toF1 training
Augmentation
improve
helps


comparison
Table
between
Illustration Data
4.toF1images,
theimprove
learning



different
comparison
Table
between
Illustration provides
Augmentation
Methods
Data
data
4. the
F1   
Table
ability
learning

different richer
Augmentation
augmentation
comparison
between
Illustration and
data F1
 features
 comparison
4.different
Methods
Data Augmentation
ability
detection


augmentation
between
methods.and
data for
Methods
Data  the
between
results
detection


different learning
Augmentation
augmentation
methods. Methods
Data
of
data thedifferent
  
 of the
Augmentation
results
model.
augmentation
methods. data
Methods
of model,
the augmentation
model.
 Methods
methods.   methods.    


anslation


Scale 


Translation


Scale
 
 

Translation





Scale 
 
 

Translation


Scale

 
 
 

Translation


Scale

 

 




Scale
 
 
 



 
 
 


 
 
 

 
 
  
 
 

F1us helps


comparison
Table
ustration

 Data
4. Augmentation
toF1improve


comparison
Table
between
Illustration

 Data
4. the
F1  Augmentation
learning



different
comparison
between
Illustration
 

 Methods
Data
data   Augmentation
ability


different
 
augmentation
between
 
 Methods
Data
and
data   Augmentation
detection


different
augmentation
 
methods.
 Methods
Data
data 
 Augmentation
results

augmentation
 
methods. Methods
of the 
 
 model.


 Methods
methods. 
 
  
  
 
 

anslation




Rotation
Scale
 4. Translation



Scale


Rotation
 Data Translation

 




Scale

 Methods
Rotation Translation
  

Scale
augmentation
Augmentation


Rotation Methods   

Scale
augmentation
Augmentation


Rotation Methods 






Data
Rotation 
 Methods
Augmentation 




 Methods  
 

  
 
  
 
 

F1 

comparison
Table
ustration

 DataF1 Augmentation


comparison
between
Illustration

 
 Augmentation





different
between

 Data
data   

different
  
 Data
data 
 
 


methods. 
 methods. 
   
  
 
 

anslation



Rotation
Scale
 Translation



Scale


Rotation
 Translation

 




Scale


Rotation 
 
 

Scale


Rotation

 
 
 



Rotation

 

 



 
 
 


 
 
 

 
 
  
  

Dropout
F1 

comparison
ustration

 Data Augmentation

Dropout

Illustration
between

 Data 
Dropout
Augmentation


different
 

 Methods
Data
data 
 
Dropout

Augmentation
augmentation
 
 Methods 
 
Dropout
 

methods.Methods 
Dropout
   
   
   
    
anslation


Rotation
Dropout


Scale

 Data

Translation


Scale


Rotation

Dropout



 Data
Augmentation
Translation  


Scale
Augmentation



Rotation
 

Dropout








Rotation
Methods 
Dropout






 Methods 

 

Dropout




















 

















 









 







 






Blur
anslation


Scale

 

Blur


Scale



 
 

Blur
 






 

 
 

Blur
 



 

 

 
Blur



 


Blur

 
 

 
  

  


Rotation
Dropout
 DataRotation
Dropout
 Brightness 
Scale

Augmentation
 BrightnessRotation
 

Dropout



 Methods  
Dropout




  



 


  

  
 


Blur


rightness
Scale

 


Blur




 

 
Blur



 Brightness


 

Blur

 



 Brightness 

Blur

 


 Brightness 


 
 

  

  
Rotation
Dropout

 Rotation

Dropout

Rotation

  

Dropout



   




   



  


   




Blur


rightness

 


Blur


Brightness


 

 
Blur


Brightness


 
 
Blur
 

Brightness

 
  
Brightness

 
 


 
 
 
Crop
Rotation
Dropout



Blur


 Brightness
Crop

Dropout
Dropout




Blur



 


Crop




Blur







 
Crop









 
Crop






Crop






rightness


Crop



Mirror 

Crop

Mirror


 Brightness
 
 


Crop
 

Mirror


 
 
Brightness

Crop

Mirror


 
 
Crop

Mirror
 

Mirror
 
Dropout

Blur



Blur


Blur


 




 


 

 
rightness


Crop



Mirror 
Brightness

Crop

Mirror




Brightness 
Brightness
 
Crop
 

Mirror

 
 
Crop

Mirror
 
 
Mirror 
 


Blur


95.00%
95.62%
96.54%
F1
rightness 

94.87%
95.00%


95.62%
96.54%
F1  94.43%

94.87%
F1

95.00%
95.62%
96.54%

 94.28%
94.43%
94.87%

95.00%
95.62%
96.54%
F1
 94.01%
94.28%
94.43% 
94.87%
95.00%
95.62%
96.54%
F1
 94.01%
93.52%
94.28%
94.43%
94.87%
95.00%
95.62%
96.54%
F1 92.81%
94.01%
93.52%
94.28%
94.43%
94.87%
95.00%
95.62%
96.54% 92.81%
90.76%
94.01%
93.52%
94.28%
94.43%
94.87%
95.00%
95.62% 92.81% 90.76%
94.01%
93.52%
94.28%
94.43%
94.87%
95.00% 92.81% 90.76%
94.01%
93.52%
94.28%
94.43%
94.87% 92.81% 90.76%
94.01%
93.52%
94.28%
94.43% 92.81% 90.76%
94.01%
93.52%
94.28% 92.81%
90.76%
94.01%
93.52%


Crop



 Brightness

Mirror

95.00%
95.62%
96.54%
F1 
Crop

Mirror
Crop





94.87%
95.00%


95.62%
96.54%
F1 
Crop
 
Mirror
94.43%

94.87%


 94.28%
95.00%
95.62%
96.54%
F1  
Mirror
94.43%
94.87%
95.00%

95.62%

 94.01%
96.54%
F1 94.28%
94.43%
94.87%
95.00%

95.62% 94.01%
96.54%
F1 93.52%
94.28%
94.43%
94.87%
95.00%
95.62%
96.54% 92.81%
94.01%
93.52%
94.28%
94.43%
94.87%
95.00%
95.62% 92.81%
90.76%
94.01%
93.52%
94.28%
94.43%
94.87%
95.00% 92.81%
90.76%
94.01%
93.52%
94.28%
94.43%
94.87% 92.81%
90.76%
94.01%
93.52%
94.28%
94.43% 92.81%
90.76%
94.01%
93.52%
94.28% 92.81%
90.76%
94.01%
93.52% 92.81%
90.76%
93.52%


Crop
rightness

 Crop



 


 94.28% 

Mirror


95.00%
95.62%
96.54%
F1

 
Mirror
Mirror

94.87%
95.00%
95.62%
96.54%
F1
 
Mirror
94.43%
94.87%
95.00%
95.62%
96.54%
F1
 94.43%F1 94.01%
94.87%
95.00%
95.62%
96.54% 94.28%
94.43%
94.87%
95.00%
95.62%
96.54% 94.01% 93.52%
94.28%
94.43%
94.87%
95.00%
95.62% 92.81% 94.01%
93.52%
94.28%
94.43%
94.87%
95.00% 92.81% 90.76%
94.01%
93.52%
94.28%
94.43%
94.87% 92.81% 90.76%
94.01%
93.52%
94.28%
94.43% 92.81% 90.76%
94.01%
93.52%
94.28% 92.81%
90.76%
94.01%
93.52% 92.81%
90.76%
93.52% 92.81%
90.76%

Crop
 
 
Mirror

95.00%
95.62%
96.54%
F1


Mirror
F1
94.87%
95.00%
95.62%
96.54%
F1 F1 94.28%
94.43%
94.87%
95.00%
95.62%
96.54%
96.54% 94.43%
94.87%
95.00%
95.62%
96.54% 94.01%
95.62% 94.28%
94.43%
94.87%
95.00%
95.62% 94.01%
95.00% 93.52%
94.28%
94.43%
94.87%
95.00% 92.81%
94.87% 94.01%
93.52%
94.28%
94.43%
94.87% 92.81%
94.43% 90.76%
94.01%
93.52%
94.28%
94.43%
94.28% 92.81%
90.76%
94.01%
93.52%
94.28%
94.01% 92.81%
90.76%
94.01%
93.52%
93.52% 92.81%
90.76%
93.52%
92.81% 92.81%
90.76%
90.76% 90.76%


Mirror 
95.00%
95.62%
96.54%
F1
 94.87%
95.00%
95.62%
96.54%
F1 94.43%
94.87%
95.00%
95.62%
96.54% 94.28% 94.43%
94.87%
95.00%
95.62% 94.01% 94.28%
94.43%
94.87%
95.00% 94.01% 93.52%
94.28%
94.43%
94.87% 92.81% 94.01%
93.52%
94.28%
94.43% 92.81% 90.76%
94.01%
93.52%
94.28% 92.81%
90.76%
94.01%
93.52% 92.81%
90.76%
93.52% 92.81%
90.76% 90.76%

F1
95.00%
95.62%
96.54% 94.87%
95.00%
95.62%
96.54% 94.43% 94.87%
95.00%
95.62% 94.28% 94.43%
94.87%
95.00% 94.01% 94.28%
94.43%
94.87% 94.01% 93.52%
94.28%
94.43% 92.81% 94.01%
93.52%
94.28% 92.81%
90.76%
94.01%
93.52% 92.81%
90.76%
93.52% 92.81%
90.76% 90.76%
95.00%
95.62%
96.54% 94.87% 95.00%
95.62% 94.43% 94.87%
95.00% 94.28% 94.43%
94.87% 94.01% To better assess
94.28%
94.43% 94.28%the performance
94.01%
93.52% 92.81%
94.01%
93.52% of the improved
92.81%
90.76%
93.52% 90.76%model,
92.81% 90.76% we count the detection results
95.00%
95.62% 94.87% 95.00% 94.43% 94.87% 94.28% 94.43%of original 94.28% images
94.01% 94.01%
93.52% and augmented 93.52% images.
92.81% 90.76% The90.76%
92.81% test results are shown in Table 5. It can be
95.00% 94.87% 94.43% 94.28% seen94.01% 93.52% 92.81% 90.76%
that the EfficinetNet-B0-YOLOV4 model in this paper can achieve desirable detection
(a) (a) results (a)for the apple (a) images(a) augmented (a)by using (a) data augmentation (a) (a)
methods. Compared
(a) (a) with the (a) original(a) (a) (a) (a)
image, the methods of mirror, crop, rotation, scale, and translation are (a)
(a) (a) (a) (a) (a) (a) (a)
(a) (a) mainly (a)based on(a) the change (a)of image(a) position or angle, which hardly adds new texture
(a) (a) (a) (a) (a)
(a) (a) (a) (a)
(a) (a) (a)
(a) (a)
(b)
(a) (b) (b) (b) (b) (b) (b) (b) (b)
Agronomy 2021, 11, 476 11 of 15

Agronomy 2021, 11, x FOR PEER REVIEW 11 of 16


information, so the detection results are similar to those of the original images. The
methods of brightness, blur, dropout, and illustration bring new texture information to
the image. Although it will cause more false detections, keeps the number of missed
missed detections
detections similar are consistent
to the originalwith theseand
images, methods,
has morewhich shows
object the effectiveness
detections, which showsand
feasibility of our
that the rich proposedwill
background method.
enhance the learning ability of the model. Compared with
To further
the detection verifyof the
results influence
traditional of the proposed
augmentation illustration
techniques, data augmentation
the proposed illustration
method on the improved model, as shown in Figure 10, the detection results
augmentation technique will lead to more false detections, but the detection quantity of the model
and
trained
missed by the illustration
detections augmented
are consistent with image are compared
these methods, which with
showsthe the
model trained byand
effectiveness the
real image.ofFrom
feasibility the detection
our proposed results, it can be seen that the model trained by the illus-
method.
tration augmented images and the model trained by the real images both can accurately
detect apples
Table underresults
5. Detection no occlusion
of originalbetween leaves.images.
and generated However, under the dense occlusion
between leaves, the model trained by real images hardly detects apples, while the model
Apple Images Original Mirror Crop Brightness Blur Dropout Rotation Scale Translation Illustration
trained by illustration augmented images can detect apples well and improve the detec-
Number of
263 tion results.
270 266 It can281be seen that 276the illustration
274 augmentation
268 method enriches
266 265 the 283
leaf oc-
detected objects
clusion scene in the training images, provides richer features for the learning of the model,
Number of
3 8and thus8helps to improve
6 the8 learning ability
8 and detection
9 results
7 of the
4 model. 7
missed objects
Number of
32 44
Table 4. F1 40
comparison53between different
50 data48 43 methods.
augmentation 39 35 56
false objects

Data Augmentation Methods


Illustration   
To further verifythe influence
 of theproposedillustrationdata augmentation method
Translation  on the improved
 model,
 as shown
 in Figure
 10, thedetection results
 of the model trained
Scale  by the illustration
 augmented
  image are
 compared  with the model trained by the real
Rotation  image. From  the detection
 results,
 it canbe seen that the model trained by the illustration
Dropout  augmented  images and
 the model
 trained by the real images both can accurately detect
apples under no occlusion between leaves. However, under the dense occlusion between
Blur    
leaves, the model trained by real images hardly detects apples, while the model trained by
Brightness   
illustration augmented images can detect apples well and improve the detection results. It
Crop  
can be seen that the illustration augmentation method enriches the leaf occlusion scene in
Mirror  the training images, provides richer features for the learning of the model, and thus helps
F1 96.54% 95.62% 95.00%
to improve 94.87%ability
the learning 94.43% 94.28% results
and detection 94.01%
of the 93.52%
model. 92.81% 90.76%

(a)

(b)

(c)
Figure
Figure 10.
10. Detection
Detectionresults
resultsbetween
betweenillustration augmented
illustration image
augmented andand
image realreal
image: (a) real
image: image;
(a) real (b) no
image; (b)illustration aug-
no illustration
mented image; (c) illustration augmented image.
augmented image; (c) illustration augmented image.
Agronomy 2021, 11, 476 12 of 15

4.2.2. Comparison of Different Models


To verify the superiority of the proposed EfficientNet-B0-YOLOv4 model in this paper,
we compare it with YOLOv3, YOLOv4, and Faster R-CNN with ResNet, which are the
state-of-the-art apple detection models. Table 6 shows the comparison of F1, mAP, Precision
and Recall of different models. Table 7 shows the comparison of the average detection time
per frame, weight size, parameter amount and calculation amount (FLOPs) of different
models. Table 8 shows the detection results of different models in the test set. Figure 11
shows the comparison of the P-R curve of different models. Figure 12 shows the detection
Agronomy 2021, 11, x FOR PEER REVIEW 13 of 16
results of different models, where the green ring represents the missed detection and the
blue ring represents the false detection.

where6.the
Table F1 is 4.60%
Performance higher, mAP
comparison is 2.87%
between higher,
different Precision is 2.41% higher, and Recall
models.
is 6.54% higher especially. The EfficientNet-B0-YOLOv4 model proposed in this paper is
slightlyDifferent Models
better than the YOLOv4 model F1 in detection mAPperformance,
Precision
where the F1 Recall
is 0.18%
higher, mAP is 1.30% higher, Precision
YOLOv3 91.76% is 2.70%93.98%
lower, and Recall
95.81%is 2.86% higher
88.03%espe-
cially. But YOLOv4 96.36%
in terms of weight indicators, 96.85%
the improved model98.22%
is much better 94.57%
than the
Faster R-CNN
YOLOv4 model,with ResNet
where 66.96%
the average 82.69%
detection time per frame52.07%
is reduced by 93.76%
0.072 s, the
EfficientNet-B0-YOLOv4
weight size is reduced by 86 MB, 96.54%the parameter98.15% 95.52%
amount is reduced by 2.62 × 10 97.43%
7, and the

calculation amount is reduced by 1.72 × 10 . 10

TableIt7. can
Weightbe comparison
seen from between
Figure 11 that the
different P-R curve area under the EfficientNet-B0-
models.
YOLOv4 model proposed in this paper is larger, which shows that it has better perfor-
mance.Different
It can beModels
seen from TableTime/s
8 and FigureSize/MB Parameter
12 that in the FLOPs each
case of large objects,
model can YOLOv3
accurately detect apples, but in the case
0.405 of small objects,
235 6.15 × 10especially
7 3.28in×the1010case
of occlusion, the Faster R-CNN with
YOLOv4 ResNet model
0.410 will have 6.40
244 more 7
× missed
10 2.99 × 1010and
detections
Faster R-CNN with ResNet × 10 7 11
× 10mod-
false detections, which leads to low6.167 108 At the2.83
Precision (52.07%). same time, the1.79
YOLO
7 10
els EfficientNet-B0-YOLOv4
will have fewer missed detections 0.338 and false158 the×detection
detections,3.78 10 1.27 × 10
result of the
YOLOv3 model is relatively poor, the detection result of the YOLOv4 model and the Effi-
cientNet-B0-YOLOv4 model proposed in this paper are close to the same.
Table 8. Detection results of different models.
Based on the above analysis, the whole results of the EfficientNet-B0-YOLOv4 model
Different Models proposed in this
Ground-Truth paper
Faster R-CNNarewith
better than theYOLOv3
ResNet current popular apple detection
YOLOv4 models, which
EfficientNet-B0-YOLOv4
Number of detected objects can
2340achieve high-recall4214 and real-time detection3745 performance,
3004 and reduce the 2712weight size
Number of missed objects 0 147 112 45
and computational complexity. The experimental results show that the proposed 68 method
Number of wrong objects 0 2021 1517 709 440
in this paper is well applied to the vision system of the picking robot.

Figure 11. P-R curves of different detection models.


Agronomy 2021, 11, 476 13 of 15

Figure 11. P-R curves of different detection models.

Agronomy 2021, 11, x FOR PEER REVIEW 14 of 16

(a)

(b)

(c)

(d)
Figure 12.Detection
Figure12. Detectionresults of different
results models:
of different (a) YOLOv3;
models: (b) YOLOv4;
(a) YOLOv3; (c) Faster
(b) YOLOv4; R-CNN
(c) Faster with with
R-CNN
ResNet;
ResNet;(d)
(d)EfficientNet-B0-YOLOv4.
EfficientNet-B0-YOLOv4.

5. Conclusions
Generally, to make robots pick more real apples in orchards, more attention should be
paidTo to simulate
the improvement
the possibleof the Recall.scenes
complex It canof beapple
seen detection
from Tables 6 and 7 that
in orchards andthe Faster
improve
R-CNN with ResNet model has a better Recall (93.76%), but
the apple dataset, an illustration data augmentation method is proposed and 8 commonthe other performance and
detection results are the worst. Although the weight (108 MB)
data augmentation methods are utilized to expand the dataset. On the expanded 2670 and the parameter amount
(2.83 × 10the7 ) are lower than the YOLO models, the two-stage steps are complex and lead
samples, F1 of using the illustration data augmentation method has increased the
to theGiven
most. calculation amount
the large size and × 1011 ) and the
(1.79computational average detection
complexity of the YOLOv4time per frameEfficient-
model, (6.167 s)
greatly exceed the YOLO models. The YOLOv3 model and YOLOv4
Net is utilized to replace its backbone network CSPDarknet53. The improved EfficientNet- model still maintain
better real-time
B0-YOLOv4 detection
model has theresults, and other
F1 of 96.54%, the indicators
mAP of 98.15%, in the theweight
Recallareofclose to the
97.43%, and same,
the
but the detection performance of the YOLOv4 model is better than
average calculation time per frame of 0.338 s, which are better than the current popular the YOLOv3 model,
where themodel,
YOLOv3 F1 is 4.60%
YOLOv4 higher, mAP
model, andis Faster
2.87% higher,
R-CNNPrecision
with ResNet is 2.41%
model.higher, and Recall
Comparing the
proposed EfficientNet-B0-YOLOv4 model with the original YOLOv4 model, thepaper
is 6.54% higher especially. The EfficientNet-B0-YOLOv4 model proposed in this weight is
slightly
size better by
is reduced than the YOLOv4
35.25%, modelamount
the parameter in detection performance,
is reduced by 40.94%, where the calculation
and the F1 is 0.18%
higher, mAP is 1.30% higher, Precision is 2.70% lower, and
amount is reduced by 57.53%. In future work, we hope to add more apple classesRecall is 2.86% higher especially.
for de-
But in terms of weight indicators, the improved model is much
tection, and conduct level evaluation for each class after picking. For example, each better than the YOLOv4
class
ismodel,
dividedwhere
into the average
three levels:detection time perand
good, medium, framebad,isthus
reduced by 0.072
forming s, the weight
a7 complete set of size
the
is reduced by 86 MB, the parameter amount is reduced by 2.62 × 10 , and the calculation
apple detection system. Furthermore, we will continue to consolidate the illustration data
amount is reduced by 1.72 × 1010 .
augmentation method to improve the dataset.
It can be seen from Figure 11 that the P-R curve area under the EfficientNet-B0-
YOLOv4
Author model proposed
Contributions: in this paper L.W.
Conceptualization, is larger, which
and J.M.; showsacquisition,
Funding that it has better
J.M. andperformance.
Y.Z.; Inves-
It can beJ.M.
tigation, seenandfrom
H.L.;Table 8 and Figure
Supervision, 12 that
L.W., J.M., Y.Z.inandtheH.L.;
caseWriting—original
of large objects, draft,
each L.W.;
modelWrit-
can
accurately detect
ing—review apples,
& editing, L.W.,but inand
J.M. the Y.Z.
caseAll
of authors
small objects,
have readespecially
and agreedin the casepublished
to the of occlusion,
ver-
the Faster
sion R-CNN with ResNet model will have more missed detections and false detections,
of the manuscript.
which leads to low Precision (52.07%). At the same time, the YOLO models will have
Funding: This research was funded by the Hebei Natural Science Foundation (No. F2020202045),
the Hebei Postgraduate Innovation Funding Project (No. CXZZBS2020026).
Institutional Review Board Statement: The studies not involving humans or animals.
Informed Consent Statement: All authors have read and agreed to the published version of the
Agronomy 2021, 11, 476 14 of 15

fewer missed detections and false detections, the detection result of the YOLOv3 model is
relatively poor, the detection result of the YOLOv4 model and the EfficientNet-B0-YOLOv4
model proposed in this paper are close to the same.
Based on the above analysis, the whole results of the EfficientNet-B0-YOLOv4 model
proposed in this paper are better than the current popular apple detection models, which
can achieve high-recall and real-time detection performance, and reduce the weight size
and computational complexity. The experimental results show that the proposed method
in this paper is well applied to the vision system of the picking robot.

5. Conclusions
To simulate the possible complex scenes of apple detection in orchards and improve
the apple dataset, an illustration data augmentation method is proposed and 8 common
data augmentation methods are utilized to expand the dataset. On the expanded 2670 sam-
ples, the F1 of using the illustration data augmentation method has increased the most.
Given the large size and computational complexity of the YOLOv4 model, EfficientNet is
utilized to replace its backbone network CSPDarknet53. The improved EfficientNet-B0-
YOLOv4 model has the F1 of 96.54%, the mAP of 98.15%, the Recall of 97.43%, and the
average calculation time per frame of 0.338 s, which are better than the current popular
YOLOv3 model, YOLOv4 model, and Faster R-CNN with ResNet model. Comparing the
proposed EfficientNet-B0-YOLOv4 model with the original YOLOv4 model, the weight
size is reduced by 35.25%, the parameter amount is reduced by 40.94%, and the calculation
amount is reduced by 57.53%. In future work, we hope to add more apple classes for
detection, and conduct level evaluation for each class after picking. For example, each class
is divided into three levels: good, medium, and bad, thus forming a complete set of the
apple detection system. Furthermore, we will continue to consolidate the illustration data
augmentation method to improve the dataset.

Author Contributions: Conceptualization, L.W. and J.M.; Funding acquisition, J.M. and Y.Z.; In-
vestigation, J.M. and H.L.; Supervision, L.W., J.M., Y.Z. and H.L.; Writing—original draft, L.W.;
Writing—review & editing, L.W., J.M. and Y.Z. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the Hebei Natural Science Foundation (No. F2020202045), the
Hebei Postgraduate Innovation Funding Project (No. CXZZBS2020026).
Institutional Review Board Statement: The studies not involving humans or animals.
Informed Consent Statement: All authors have read and agreed to the published version of
the manuscript.
Data Availability Statement: The raw data required to reproduce these findings cannot be shared at
this time as the data also forms part of an ongoing study.
Acknowledgments: This work has been supported by Hebei Natural Science Foundation (Grant No.
F2020202045) and Hebei Postgraduate Innovation Funding Project (Grant No. CXZZBS2020026).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Zhu, G.; Tian, C. Determining sugar content and firmness of ‘Fuji’ apples by using portable near-infrared spectrometer and
diffuse transmittance spectroscopy. J. Food Process Eng. 2018, 41, e12810. [CrossRef]
2. Lehnert, C.; Sa, I.; McCool, C.; Upcroft, B.; Perez, T. Sweet pepper pose detection and grasping for automated crop harvesting. In
Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA’16), Stockholm, Sweden, 16–21 May
2016; pp. 2428–2434.
3. Valdez, P. Apple Defect Detection Using Deep Learning Based Object Detection for Better Post Harvest Handling. arXiv 2020,
arXiv:2005.06089.
4. Cao, Y.; Qi, W.; Li, X.; Li, Z. Research progress and prospect on non-destructive detection and quality grading technology of
apple. Smart Agric. 2019, 1, 29–45.
Agronomy 2021, 11, 476 15 of 15

5. Zhang, J.; Karkee, M.; Zhang, Q.; Zhang, X.; Majeed, Y.; Fu, L.; Wang, S. Multi-class object detection using faster R-CNN and
estimation of shaking locations for automated shake-and-catch apple harvesting. Comput. Electron. Agric. 2020, 173, 105384.
[CrossRef]
6. Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Detection of apple lesions in orchards based on deep learning methods of cyclegan
and yolov3-dense. J. Sens. 2019, 2019, 7630926. [CrossRef]
7. Mureşan, H.; Oltean, M. Fruit recognition from images using deep learning. Acta Univ. SapientiaeInform. 2018, 10, 26–42.
[CrossRef]
8. Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the
improved YOLO-V3 network. Comput. Electron. Agric. 2019, 157, 417–426. [CrossRef]
9. Kang, H.; Chen, C. Fast implementation of real-time fruit detection in apple orchards using deep learning. Comput. Electron.
Agric. 2020, 168, 105108. [CrossRef]
10. Mazzia, V.; Khaliq, A.; Salvetti, F.; Chiaberge, M. Real-Time Apple Detection System Using Embedded Systems with Hardware
Accelerators: An Edge AI Application. IEEE Access 2020, 8, 9102–9114. [CrossRef]
11. Kuznetsova, A.; Maleva, T.; Soloviev, V. Using YOLOv3 Algorithm with Pre- and Post-Processing for Apple Detection in
Fruit-Harvesting Robot. Agronomy 2020, 10, 1016. [CrossRef]
12. Gao, F.; Fu, L.; Zhang, X.; Majeed, Y.; Li, R.; Karkee, M.; Zhang, Q. Multi-class fruit-on-plant detection for apple in SNAP system
using Faster R-CNN. Comput. Electron. Agric. 2020, 176, 105634. [CrossRef]
13. Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Sun, Y. A detection method for apple fruits based on color and shape features. IEEE Access 2019, 7,
67923–67933. [CrossRef]
14. Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask
R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [CrossRef]
15. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27 June 2016; pp. 779–788.
16. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR’17), Hawaii Convention Center, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
17. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
18. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
19. Tan, M.; Le, Q.V. Efficientnet: Rethinking network scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946.
20. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412.
21. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable
features. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’19), Seoul, Korea, 27 October–3
November 2019; pp. 6023–6032.
22. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning
capability of cnn. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20),
Seattle, WA, USA, 14–19 June 2020; pp. 390–391.
23. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef]
24. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017;
pp. 2117–2125.
25. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768.
26. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
27. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
28. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27 June 2016; pp. 770–778.
30. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA,
USA, 7–22 June 2015; pp. 1–9.
31. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
32. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141.
33. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In
Proceedings of the 2020 AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000.

You might also like