目标检测经典论文——YOLOv4论文翻译：YOLOv4: Optimal Speed and Accuracy of Object Detection（YOLOv4: 目标检测最优速度和精度）

最新推荐文章于 2025-05-23 13:15:25 发布

原创

最新推荐文章于 2025-05-23 13:15:25 发布 · 7.5k 阅读

23 ·

CC 4.0 BY-SA版权

文章标签：

#YOLOv4 #目标检测 #YOLO系列 #MS COCO #IoU

YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4: 目标检测最优速度和精度

Alexey Bochkovskiy*

Chien-Yao Wang*

Institute of Information Science

Academia Sinica, Taiwan

Hong-Yuan Mark Liao

Institute of Information Science Academia Sinica, Taiwan

Abstract

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justiﬁcation of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50 ) for the MS COCO dataset at a real-time speed of ∼65 FPS on Tesla V100. Source code is at https://github.com/AlexeyAB/darknet.

摘要

有大量的技巧可以提高卷积神经网络（CNN）的精度。需要在大数据集下对这种技巧的组合进行实际测试，并需要对结果进行理论论证。某些技巧仅在某些模型上使用和专门针对某些问题，或只针对小规模的数据集；而一些技巧，如批处理归一化、残差连接等，适用于大多数的模型、任务和数据集。我们假设这种通用的技巧包括加权残差连接（Weighted-Residual-Connection，WRC）、跨小型批量连接(Cross-Stage-Partial-connection，CSP)、Cross mini-Batch Normalization（CmBN）、自对抗训练（Self-adversarial-training，SAT）和Mish激活函数。我们在本文中使用这些新的技巧：WRC、CSP、CmBN、SAT，Mish-activation，Mosaic data augmentation、CmBN、DropBlock正则化和CIoU损失，以及这些技巧的组合，在MS COCO数据集达到目前最好的结果：43.5%的AP（65.7% AP50），在Tesla V100上速度达到约65FPS。源码见：https://github.com/AlexeyAB/darknet.

1. Introduction

The majority of CNN-based object detectors are largely applicable only for recommendation systems. For example, searching for free parking spaces via urban video cameras is executed by slow accurate models, whereas car collision warning is related to fast inaccurate models. Improving the real-time object detector accuracy enables using them not only for hint generating recommendation systems, but also for stand-alone process management and human input reduction. Real-time object detector operation on conventional Graphics Processing Units (GPU) allows their mass usage at an affordable price. The most accurate modern neural networks do not operate in real time and require large number of GPUs for training with a large mini-batch-size. We address such problems through creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.

1. 引言

大多数基于CNN的目标检测器基本上都仅适用于推荐系统。例如：通过城市摄像头寻找免费停车位，它由精确的慢速模型完成，而汽车碰撞警报需要由快速、低精度模型完成。改善实时目标检测器的精度，使其能够不仅可以用于提示生成推荐系统，也可以用于独立的流程管理和减少人力投入。传统GPU使得目标检测可以以实惠的价格运行。最准确的现代神经网络不是实时运行的，需要大量的训练的GPU与大的mini bacth size。我们通过创建一个CNN来解决这样的问题，在传统的GPU上进行实时操作，而对于这些训练只需要一个传统的GPU。

The main goal of this work is designing a fast operating speed of an object detector in production systems and optimization for parallel computations, rather than the low computation volume theoretical indicator (BFLOP). We hope that the designed object can be easily trained and used. For example, anyone who uses a conventional GPU to train and test can achieve real-time, high quality, and convincing object detection results, as the YOLOv4 results shown in Figure 1. Our contributions are summarized as follows:

1. We develope an efﬁcient and powerful object detection model. It makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super fast and accurate object detector.

2. We verify the inﬂuence of state-of-the-art Bag-ofFreebies and Bag-of-Specials methods of object detection during the detector training.

3. We modify state-of-the-art methods and make them more effecient and suitable for single GPU training, including CBN [89], PAN [49], SAM [85], etc.

Figure 1: Comparison of the proposed YOLOv4 and other state-of-the-art object detectors. YOLOv4 runs twice faster than EfficientDet with comparable performance. Improves YOLOv3’s AP and FPS by 10% and 12%, respectively.

这研究的主要目的是设计一个可以在生产环境快速运行的目标检测器，并且进行并行计算优化，而不是较低的计算量理论指标（BFLOP）。我们希望所设计的目标易于训练和使用。例如，任何使用传统GPU进行训练和测试的人都可以实现实时、高质量、有说服力的目标检测结果，YOLOv4的结果如图1所示。现将我们的成果总结如下：

1. 我们构建了一个快速、强大的模型，这使得大家都可以使用1080 Ti或2080 Ti GPU来训练一个超快、准确的目标检测器。

2. 我们验证了最先进的Bag-of-Freebies和Bag-of-Specials方法在目标检测训练期间的影响。

3. 我们修改了最先进的方法，使其变得更高效并且适合单GPU训练，包括CBN[89]、PAN[49]、SAM[85]等。

图1：本文提出的YOLOv4和其他先进的目标检测器比较结果。YOLOv4与EfficientDet相比精度差不多相同，但速度比其快两倍。YOLOv3的AP值和FPS都分别提升了10%和12%。

2. Related work

2.1. Object detection models

A modern detector is usually composed of two parts, a backbone which is pre-trained on ImageNet and a head which is used to predict classes and bounding boxes of objects. For those detectors running on GPU platform, their backbone could be VGG [68], ResNet [26], ResNeXt [86], or DenseNet [30]. For those detectors running on CPU platform, their backbone could be SqueezeNet [31], MobileNet [28, 66, 27, 74], or ShufﬂeNet [97, 53]. As to the head part, it is usually categorized into two kinds, i.e., one-stage object detector and two-stage object detector. The most representative two-stage object detector is the R-CNN [19] series, including fast R-CNN [18], faster R-CNN [64], R-FCN [9], and Libra R-CNN [58]. It is also possible to make a two-stage object detector an anchor-free object detector, such as RepPoints [87]. As for one-stage object detector, the most representative models are YOLO [61, 62, 63], SSD [50], and RetinaNet [45]. In recent years, anchor-free one-stage object detectors are developed. The detectors of this sort are CenterNet [13], CornerNet [37, 38], FCOS [78], etc. Object detectors developed in recent years often insert some layers between backbone and head, and these layers are usually used to collect feature maps from different stages. We can call it the neck of an object detector. Usually, a neck is composed of several bottom-up paths and several top-down paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], and NAS-FPN [17]. In addition to the above models, some researchers put their emphasis on directly building a new backbone (DetNet [43], DetNAS [7]) or a new whole model (SpineNet [12], HitDetector [20]) for object detection.

2. 相关工作

2.1. 目标检测模型

现代目标检测器通常由两部分组成：ImageNet上预训练的backbone和用于预测类别和BBOX的检测器head。对于那些在GPU平台上运行的探测器，其backbone可以是VGG[68]，ResNet[26]、ResNeXt[86]、或DenseNet [30]。对于那些运行在CPU平台上的检测器形式，它们的backbone可以是SqueezeNet[31]、MobileNet[28，66，27，74]，或ShuffleNet[97，53]。至于head部分，它通常被分两类：即一阶段（one-stage）和两阶段（two-stage）的目标检测器。最有代表性的两阶段检测器是R-CNN[19]系列模型，包括Fast R-CNN[18]、Faster R-CNN[64]、R-FCN[9]和Libra R-CNN[58]。也可以在两阶段目标检测器中不用anchor的目标检测器，如RepPoints[87]。对于一阶段检测器来说，最代表性的有YOLO[61、62、63]、SSD[50]和RetinaNet[45]。近几年来，也开发了许多不使用anchor的一阶段目标检测器。这类检测器有CenterNet[13]、CornerNet[37，38]、FCOS[78]等。近年来开发检测器往往会在backbone和head之间插入一些层，这些层用于收集不同阶段的特征图。我们可以称它为检测器的neck。通常情况下neck是由几个自下而上或自上而下的通路（paths）组成。具有这种结构的网络包括Feature Pyramid Network (FPN)[44]、Path Aggregation（PAN）[49]、BiFPN[77]和NAS-FPN[17]。除上述模型外，有的研究者注重于直接重新构建backbone（DetNet[43]、DetNAS[7]）或重新构建整个模型（SpineNet[12]、HitDetector[20])，并用于目标检测任务。

To sum up, an ordinary object detector is composed of several parts:

Input: Image, Patches, Image Pyramid
Backbones: VGG16[68], ResNet-50[26], SpineNet[12], EfficientNet-B0/B7[75], CSPResNeXt50[81],CSPDarknet53[81]
Neck:
Additional blocks: SPP [25], ASPP [5], RFB [47], SAM [85]
Path-aggregation blocks: FPN [44], PAN [49], NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98]
Heads:
Dense Prediction (one-stage):
- - RPN [64], SSD [50], YOLO [61], RetinaNet[45] (anchor based)
  - CornerNet [37], CenterNet [13], MatrixNet[60], FCOS [78] (anchor free)
Sparse Prediction (two-stage):
- - Faster R-CNN [64], R-FCN [9], Mask RCNN [23] (anchor based)
  - RepPoints [87] (anchor free)

Figure 2: Object detector.

总结起来，通常目标检测模型由以下一些部分组成：

输入：图像、图像块、图像金字塔
Backbones：VGG16[68]、ResNet-50[26]、SpineNet[12]、EfficientNet-B0/B7[75]、CSPResNeXt50[81]、CSPDarknet53[81]
Neck:

• Additional blocks: SPP [25], ASPP [5], RFB[47], SAM [85]

• Path-aggregation blocks: FPN [44], PAN [49],NAS-FPN [17], Fully-connected FPN, BiFPN[77], ASFF [48], SFAM [98]

Heads:
Dense Prediction (one-stage):RPN [64], SSD [50], YOLO [61], RetinaNet[45] (anchor based) CornerNet [37], CenterNet [13], MatrixNet[60], FCOS [78] (anchor free)
Sparse Prediction (two-stage):Faster R-CNN [64], R-FCN [9], Mask R-CNN [23] (anchor based) RepPoints [87] (anchor free)

图2：目标检测器。

2.2. Bag of freebies

Usually, a conventional object detector is trained offline. Therefore, researchers always like to take this advantage and develop better training methods which can make the object detector receive better accuracy without increasing the inference cost. We call these methods that only change the training strategy or only increase the training cost as “bag of freebies.” What is often adopted by object detection methods and meets the deﬁnition of bag of freebies is data augmentation. The purpose of data augmentation is to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. For examples, photometric distortions and geometric distortions are two commonly used data augmentation method and they deﬁnitely beneﬁt the object detection task. In dealing with photometric distortion, we adjust the brightness, contrast, hue, saturation, and noise of an image. For geometric distortion, we add random scaling, cropping, ﬂipping, and rotating.

2.2. Bag of freebies

通常情况下，传统的目标检测器的训练都是在线下进行的。因此，研究者们总是喜欢利用纯下训练的好处而研究更好的训练方法，使得目标检测器在不增加测试成本的情况下达到更好的精度。我们将这些只需改变训练策略或只增加训练成本的方法称为bag of freebies。目标检测经常采用并符合这个定义的就是数据增强。数据增强的目的是增加输入图像的多样性，从而使设计的目标检测模型对来自不同环境的图片具有较高的鲁棒性。比如photometric distortions和geometric distortions是两种常用的数据增强方法，它们对检测任务肯定是有好处的。使用photometric distortions时，我们调整图像的亮度、对比度、色调、饱和度和噪声。使用geometric distortions时，我们对图像添加随机缩放、裁剪、翻转和旋转。

The data augmentation methods mentioned above are all pixel-wise adjustments, and all original pixel information in the adjusted area is retained. In addition, some researchers engaged in data augmentation put their emphasis on simulating object occlusion issues. They have achieved good results in image classiﬁcation and object detection. For example, random erase [100] and CutOut [11] can randomly select the rectangle region in an image and ﬁll in a random or complementary value of zero. As for hide-and-seek [69] and grid mask [6], they randomly or evenly select multiple rectangle regions in an image and replace them to all zeros. If similar concepts are applied to feature maps, there are DropOut [71], DropConnect [80], and DropBlock [16] methods. In addition, some researchers have proposed the methods of using multiple images together to perform data augmentation. For example, MixUp [92] uses two images to multiply and superimpose with different coefﬁcient ratios, and then adjusts the label with these superimposed ratios. As for CutMix [91], it is to cover the cropped image to rectangle region of other images, and adjusts the label according to the size of the mix area. In addition to the above mentioned methods, style transfer GAN [15] is also used for data augmentation, and such usage can effectively reduce the texture bias learned by CNN.

上面提到的数据增强方法都是逐像素的调整，以及调整区域的所有原始像素信息会被保留下来。此外，一些从事数据增强工作的研究者把重点放在了模拟目标遮挡问题上。他们在图像分类和目标检测取得了好的结果。例如，随机擦除[100]和CutOut[11]可以随机的选取图像中的矩形区域，并填充随机值或零的互补值。至于hide-and-seek [69]和grid mask [6]，他们随机或均匀地选择图像中的多个矩形区域，并将其全部像素值替换为零值。如果将类似的概念应用到特征图中，就是DropOut[71]、DropConnect[80]和DropBlock[16]方法。此外，有研究者提出了将多张图像放在一起从而实现数据增强的方法。例如，MixUp[92]将两张图像以不同系数的进行相乘和叠加，并根据叠加比例调整标签。对于CutMix[91]，它通过覆盖裁剪后的图像到其他图像的矩形区域，并根据混合区的大小调整标签。除了以上提到的方法，网络迁移GAN[15]也常常用于数据增强，这种方法可以有效地减少CNN学习到的纹理偏差。

Different from the various approaches proposed above, some other bag of freebies methods are dedicated to solving the problem that the semantic distribution in the dataset may have bias. In dealing with the problem of semantic distribution bias, a very important issue is that there is a problem of data imbalance between different classes, and this problem is often solved by hard negative example mining [72] or online hard example mining [67] in two-stage object detector. But the example mining method is not applicable to one-stage object detector, because this kind of detector belongs to the dense prediction architecture. Therefore Lin et al. [45] proposed focal loss to deal with the problem of data imbalance existing between various classes. Another very important issue is that it is difﬁcult to express the relationship of the degree of association between different categories with the one-hot hard representation. This representation scheme is often used when executing labeling. The label smoothing proposed in [73] is to convert hard label into soft label for training, which can make model more robust. In order to obtain a better soft label, Islam et al. [33] introduced the concept of knowledge distillation to design the label reﬁnement network.

与上面提出的各种方法不同，其他的一些Bag of freebies方法是专门解决可能有偏差的数据集中语义分布问题。在处理语义分布偏差的问题上，有一个很重要的问题是不同类别之间的数据不平衡，而两阶段检测器处理这个问题通常是通过hard negative example mining [72]或online hard example mining [67]。但example mining method不适用于一阶段的目标检测器，因为这种检测器属于密集预测架构。因此，Linet al.[45]提出了focal loss解决数据不平衡问题。另一个很重要的问题是，one-hot编码很难表达出类与类之间关联程度。这种表示方法（one-hot）通常在打标签的时候使用。在[73]中提出的label smoothing方案是将硬标签转化为软标签进行训练，可以使模型更具有鲁棒性。为了获得更好的软标签，Islam等[33]引入知识蒸馏的概念并用于设计标签细化网络。

The last bag of freebies is the objective function of Bounding Box (BBox) regression. The traditional object detector usually uses Mean Square Error (MSE) to directly perform regression on the center point coordinates and height and width of the BBox, i.e., {xcenter, ycenter, w, h} , or the upper left point and the lower right point, i.e., {xtop_left, ytop_left, xbottom_right, ybottom_