Res2Net A New Multi-Scale Backbone Architecture
Res2Net A New Multi-Scale Backbone Architecture
2, FEBRUARY 2021
Abstract—Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone
convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent
performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layer-wise
manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like
connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range
of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models,
e.g., ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains
over baseline models on widely-used datasets, e.g., CIFAR-100 and ImageNet. Further ablation studies and experimental results
on representative computer vision tasks, i.e., object detection, class activation mapping, and salient object detection, further verify
the superiority of the Res2Net over the state-of-the-art baseline methods. The source code and trained models are available on
https://mmcheng.net/res2net/.
1 INTRODUCTION
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
GAO ET AL.: RES2NET: A NEW MULTI-SCALE BACKBONE ARCHITECTURE 653
receptive fields at a more granular level. To achieve this goal, art performance in various vision tasks with stronger multi-
we replace the 3 3 filters1 of n channels, with a set of scale representations. As designed, CNNs are equipped with
smaller filter groups, each with w channels (without loss of basic multi-scale feature representation ability since the input
generality we use n ¼ s w). As shown in Fig. 2, these information follows a fine-to-coarse fashion. The AlexNet [28]
smaller filter groups are connected in a hierarchical residual- stacks filters sequentially and achieves significant perfor-
like style to increase the number of scales that the output fea- mance gain over traditional methods for visual recognition.
tures can represent. Specifically, we divide input feature However, due to the limited network depth and kernel size of
maps into several groups. A group of filters first extracts fea- filters, the AlexNet has only a relatively small receptive field.
tures from a group of input feature maps. Output features of The VGGNet [47] increases the network depth and uses filters
the previous group are then sent to the next group of filters with smaller kernel size. A deeper structure can expand the
along with another group of input feature maps. This process receptive fields, which is useful for extracting features from a
repeats several times until all input feature maps are proc- larger scale. It is more efficient to enlarge the receptive field
essed. Finally, feature maps from all groups are concatenated by stacking more layers than using large kernels. As such, the
and sent to another group of 1 1 filters to fuse information VGGNet provides a stronger multi-scale representation
altogether. Along with any possible path in which input fea- model than AlexNet, with fewer parameters. However, both
tures are transformed to output features, the equivalent AlexNet and VGGNet stack filters directly, which means
receptive field increases whenever it passes a 3 3 filter, each feature layer has a relatively fixed receptive field.
resulting in many equivalent feature scales due to combina- Network in Network (NIN) [31] inserts multi-layer per-
tion effects. ceptrons as micro-networks into the large network to
The Res2Net strategy exposes a new dimension, namely scale enhance model discriminability for local patches within the
(the number of feature groups in the Res2Net block), as an receptive field. The 1 1 convolution introduced in NIN has
essential factor in addition to existing dimensions of depth been a popular module to fuse features. The GoogLeNet [51]
[47], width,2 and cardinality [56]. We state in Section 4.4 utilizes parallel filters with different kernel sizes to enhance
that increasing scale is more effective than increasing other the multi-scale representation capability. However, such
dimensions. capability is often limited by the computational constraints
Note that the proposed approach exploits the multi-scale due to its limited parameter efficiency. The Inception
potential at a more granular level, which is orthogonal to Nets [50], [52] stack more filters in each path of the parallel
existing methods that utilize layer-wise operations. Thus, paths in the GoogLeNet to further expand the receptive field.
the proposed building block, namely Res2Net module, can be On the other hand, the ResNet [23] introduces short connec-
easily plugged into many existing CNN architectures. Exten- tions to neural networks, thereby alleviating the gradient
sive experimental results show that the Res2Net module can vanishing problem while obtaining much deeper network
further improve the performance of state-of-the-art CNNs, structures. During the feature extraction procedure, short
e.g., ResNet [23], ResNeXt [56], and DLA [60]. connections allow different combinations of convolutional
operators, resulting in a large number of equivalent feature
scales. Similarly, densely connected layers in the Dense-
2 RELATED WORK Net [26] enable the network to process objects in a very wide
2.1 Backbone Networks range of scales. DPN [10] combines the ResNet with Dense-
Recent years have witnessed numerous backbone networks Net to enable feature re-usage ability of ResNet and the fea-
[15], [23], [26], [28], [47], [51], [56], [60], achieving state-of-the- ture exploration ability of DenseNet. The recently proposed
DLA [60] method combines layers in a tree structure. The
1. Convolutional operators and filters are used interchangeably. hierarchical tree structure enables the network to obtain even
2. Width refers to the number of channels in a layer as in [61]. stronger layer-wise multi-scale representation capability.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
654 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 2, FEBRUARY 2021
2.2 Multi-Scale Representations for Vision Tasks low-resolution features learned by a low-resolution network.
Multi-scale feature representations of CNNs are of great Other than the low-resolution representations in current
importance to a number of vision tasks including object works, the HRNet [48], [49] introduces high-resolution rep-
detection [43], face analysis [4], [41], edge detection [37], resentations in the network and repeatedly performs multi-
semantic segmentation [6], salient object detection [34], [65], scale fusions to strengthen high-resolution representations.
and skeleton detection [67], boosting the model perfor- One common operation in [5], [9], [11], [48], [49] is that they
mance of those fields. all use pooling or up-sample to re-size the feature map to 2n
times of the original scale to save the computational budget
2.2.1 Object Detection while maintaining or even improving performance. While in
the Res2Net block, the hierarchical residual-like connections
Effective CNN models need to locate objects of different
within a single residual block module enable the variation of
scales in a scene. Earlier works such as the R-CNN [18]
receptive fields at a more granular level to capture details
mainly rely on the backbone network, i.e., VGGNet [47], to
and global features. Experimental results show that Res2Net
extract features of multiple scales. He et al. propose an SPP-
module can be integrated with those novel network designs
Net approach [22] that utilizes spatial pyramid pooling after
to further boost the performance.
the backbone network to enhance the multi-scale ability.
The Faster R-CNN method [43] further proposes the region
proposal networks to generate bounding boxes with various 3 RES2NET
scales. Based on the Faster R-CNN, the FPN [32] approach 3.1 Res2Net Module
introduces feature pyramid to extract features with different The bottleneck structure shown in Fig. 2a is a basic building
scales from a single image. The SSD method [36] utilizes fea- block in many modern backbone CNNs architectures, e.g.,
ture maps from different stages to process visual informa- ResNet [23], ResNeXt [56], and DLA [60]. Instead of extract-
tion at different scales. ing features using a group of 3 3 filters as in the bottleneck
block, we seek alternative architectures with stronger multi-
2.2.2 Semantic Segmentation scale feature extraction ability, while maintaining a similar
computational load. Specifically, we replace a group of 3 3
Extracting essential contextual information of objects requires
filters with smaller groups of filters, while connecting differ-
CNN models to process features at various scales for effective
ent filter groups in a hierarchical residual-like style. Since our
semantic segmentation. Long et al. [38] propose one of the ear-
proposed neural network module involves residual-like con-
liest methods that enables multi-scale representations of the
nections within a single residual block, we name it Res2Net.
fully convolutional network (FCN) for semantic segmentation
Fig. 2 shows the differences between the bottleneck block
task. In DeepLab, Chen et al. [6], [7] introduces cascaded
and the proposed Res2Net module. After the 1 1 convolu-
atrous convolutional module to expand the receptive field
tion, we evenly split the feature maps into s feature map
further while preserving spatial resolutions. More recently,
subsets, denoted by xi , where i 2 f1; 2; . . . ; sg. Each feature
global context information is aggregated from region-based
subset xi has the same spatial size but 1=s number of chan-
features via the pyramid pooling scheme in the PSPNet [64].
nels compared with the input feature map. Except for x1 ,
each xi has a corresponding 3 3 convolution, denoted by
2.2.3 Salient Object Detection Ki ðÞ. We denote by yi the output of Ki ðÞ. The feature subset
Precisely locating the salient object regions in an image xi is added with the output of Ki1 ðÞ, and then fed into Ki ðÞ.
requires an understanding of both large-scale context infor- To reduce parameters while increasing s, we omit the 3 3
mation for the determination of object saliency, and small- convolution for x1 . Thus, yi can be written as
scale features to localize object boundaries accurately [66].
8
Early approaches [3] utilize handcrafted representations of < xi i ¼ 1;
global contrast [13] or multi-scale region features [53]. Li yi ¼ Ki ðxi Þ i ¼ 2; (1)
:
et al. [29] propose one of the earliest methods that enables Ki ðxi þ yi1 Þ 2 < i 4 s:
multi-scale deep features for salient object detection. Later,
multi-context deep learning [68] and multi-level convolu- Notice that each 3 3 convolutional operator Ki ðÞ could
tional features [62] are proposed for improving salient object potentially receive feature information from all feature splits
detection. More recently, Hou et al. [24] introduce dense fxj ; j ig. Each time a feature split xj goes through a 3 3
short connections among stages to provide rich multi-scale convolutional operator, the output result can have a larger
feature maps at each layer for salient object detection. receptive field than xj . Due to the combinatorial explosion
effect, the output of the Res2Net module contains a different
2.3 Concurrent Works number and different combination of receptive field sizes/
Recently, there are some concurrent works aiming at scales.
improving the performance by utilizing the multi-scale fea- In the Res2Net module, splits are processed in a multi-
tures [5], [9], [11], [49]. Big-Little Net [5] is a multi-branch scale fashion, which is conducive to the extraction of both
network composed of branches with different computational global and local information. To better fuse information at
complexity. Octave Conv [9] decomposes the standard con- different scales, we concatenate all splits and pass them
volution into two resolutions to process features at different through a 1 1 convolution. The split and concatenation
frequencies. MSNet [11] utilizes a high-resolution network strategy can enforce convolutions to process features more
to learn high-frequency residuals by using the up-sampled effectively. To reduce the number of parameters, we omit the
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
GAO ET AL.: RES2NET: A NEW MULTI-SCALE BACKBONE ARCHITECTURE 655
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
656 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 2, FEBRUARY 2021
TABLE 1 TABLE 2
Top-1 and Top-5 Test Error on the ImageNet Dataset Top-1 and Top-5 Test Error (%) of Deeper Networks
on the ImageNet Dataset
top-1 err. (%) top-5 err. (%)
ResNet-50 [23] 23.85 7.13 top-1 err. top-5 err.
Res2Net-50 22.01 6.15 DenseNet-161 [26] 22.35 6.20
InceptionV3 [52] 22.55 6.44 ResNet-101 [23] 22.63 6.44
Res2Net-50-299 21.41 5.88 Res2Net-101 20.81 5.57
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
GAO ET AL.: RES2NET: A NEW MULTI-SCALE BACKBONE ARCHITECTURE 657
TABLE 4
Top-1 Test Error (%) and Model Size on the CIFAR-100 Dataset
Fig. 4. Visualization of class activation mapping [45], using ResNet-50 and Res2Net-50 as backbone networks.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
658 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 2, FEBRUARY 2021
TABLE 5 TABLE 7
Object Detection Results on the PASCAL VOC07 and COCO Performance of Semantic Segmentation on
Datasets, Measured Using AP (%) and AP@IoU = 0.5 (%) PASCAL VOC12 Val Set Using Res2Net-50
with Different Scales
Dataset Backbone AP AP@IoU = 0.5
Backbone Setting Mean IoU (%)
ResNet-50 72.1 -
VOC07 ResNet-50 64w 77.7
Res2Net-50 74.4 -
ResNet-50 31.1 51.4 48w2s 78.2
COCO
Res2Net-50 33.7 53.6 Res2Net-50 26w4s 79.2
18w6s 79.1
The Res2Net has similar complexity compared with its counterparts. 14w8s 79.0
ResNet-101 64w 79.0
Res2Net-101 26w4s 80.2
Both methods have similar activation maps on the middle
size objects, such as ‘ice cream’. Due to stronger multi-scale The Res2Net has similar complexity compared with its
counterparts.
ability, the Res2Net has activation maps that tend to cover
the whole object on big objects such as ‘bulbul’, ‘mountain
dog’, ‘ballpoint’, and ‘mosque’, while activation maps of 4.7 Semantic Segmentation
ResNet only cover parts of objects. Such ability of precisely
localizing CAM region makes the Res2Net potentially valu- Semantic segmentation requires a strong multi-scale ability
able for object region mining in weakly supervised semantic of CNNs to extract essential contextual information of
segmentation tasks [54]. objects. We thus evaluate the multi-scale ability of
Res2Net on the semantic segmentation task using PASCAL
VOC12 dataset [16]. We follow the previous work to use the
4.6 Object Detection augmented PASCAL VOC12 dataset [20] which contains
For object detection task, we validate the Res2Net on the 10,582 training images and 1,449 val images. We use the
PASCAL VOC07 [17] and MS COCO [33] datasets, using Deeplab v3+ [8] as our segmentation method. All implemen-
Faster R-CNN [43] as the baseline method. We use the back- tations remain the same with Deeplab v3+ [8] except that the
bone network of ResNet-50 versus Res2Net-50, and follow backbone network is replaced with ResNet and our pro-
all other implementation details of [43] for a fair comparison. posed Res2Net. The output strides used in training and eval-
Table 5 shows the object detection results. On the PASCAL uation are both 16. As shown in Table 7, Res2Net-50 based
VOC07 dataset, the Res2Net-50 based model outperforms its method outperforms its counterpart by 1.5 percent on mean
counterparts by 2.3 percent on average precision (AP). On IoU. And Res2Net-101 based method outperforms its coun-
the COCO dataset, the Res2Net-50 based model outperforms terpart by 1.2 percent on mean IoU. Visual comparisons of
its counterparts by 2.6 percent on AP, and 2.2 percent on semantic segmentation results on challenging examples are
AP@IoU = 0.5. illustrated in Fig. 6. The Res2Net based method tends to seg-
We further test the AP and average recall (AR) scores for ment all parts of objects regardless of object size.
objects of different sizes as shown in Table 6. Objects are
divided into three categories based on the size, according
4.8 Instance Segmentation
to [33]. The Res2Net based model has a large margin of
improvement over its counterparts by 0.5, 2.9, and 4.9 per- Instance segmentation is the combination of object detection
cent on AP for small, medium, and large objects, respec- and semantic segmentation. It requires not only the correct
tively. The improvement of AR for small, medium, and large detection of objects with various sizes in an image but also
objects are 1.4, 2.5, and 3.7 percent, respectively. Due to the the precise segmentation of each object. As mentioned in
strong multi-scale ability, the Res2Net based models can Sections 4.6 and 4.7, both object detection and semantic seg-
cover a large range of receptive fields, boosting the perfor- mentation require a strong multi-scale ability of CNNs. Thus,
mance on objects of different sizes. the multi-scale representation is quite beneficial to instance
segmentation. We use the Mask R-CNN [21] as the instance
segmentation method, and replace the backbone network of
TABLE 6 ResNet-50 with our proposed Res2Net-50. The performance of
Average Precision (AP) and Average Recall (AR) of Object instance segmentation on MS COCO [33] dataset is shown in
Detection with Different Sizes on the COCO Dataset Table 8. The Res2Net-26w4s based method outperforms its
counterparts by 1.7 percent on AP and 2.4 percent on AP50 .
Object size The performance gains on objects with different sizes are also
Small Medium Large All demonstrated. The improvement of AP for small, medium,
ResNet-50 13.5 35.4 46.2 31.1 and large objects are 0.9, 1.9, and 2.8 percent, respectively.
Res2Net-50 AP (%) 14.0 38.3 51.1 33.7 Table 8 also shows the performance comparisons of Res2Net
Improve. +0.5 +2.9 +4.9 +2.6 under the same complexity with different scales. The perfor-
mance shows an overall upward trend with the increase
ResNet-50 21.8 48.6 61.6 42.8
Res2Net-50 AR (%) 23.2 51.1 65.3 45.0 of scale. Note that compared with the Res2Net-50-48w2s,
Improve. +1.4 +2.5 +3.7 +2.2 the Res2Net-50-26w4s has an improvement of 2.8 percent
on APL , while the Res2Net-50-48w2s has the same APL
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
GAO ET AL.: RES2NET: A NEW MULTI-SCALE BACKBONE ARCHITECTURE 659
Fig. 6. Visualization of semantic segmentation results [8], using ResNet-101 and Res2Net-101 as backbone networks.
compared with ResNet-50. We assume that the performance gain on the DUT-OMRON dataset, since this dataset contains
gain on large objects is benefited from the extra scales. When the most significant object size variation compared with the
the scale is relatively larger, the performance gain is not obvi- other three datasets. Some visual comparisons of salient
ous. The Res2Net module is capable of learning a suitable object detection results on challenging examples are illus-
range of receptive fields. The performance gain is limited trated in Fig. 7.
when the scale of objects in the image is already covered by the
available receptive fields in the Res2Net module. With fixed 4.10 Key-Points Estimation
complexity, the increased scale results in fewer channels for Human parts are of different sizes, which requires the key-
each receptive field, which may reduce the ability to process points estimation method to locate human key-points with
features of a particular scale. different scales. To verify whether the multi-scale representa-
tion ability of Res2Net can benefit the task of key-points esti-
4.9 Salient Object Detection mation, we use the SimpleBaseline [55] as the key-points
estimation method and only replace the backbone with the
Pixel level tasks such as salient object detection also require
proposed Res2Net. All implementations including the train-
the strong multi-scale ability of CNNs to locate both the
ing and testing strategies remain the same with the SimpleBa-
holistic objects as well as their region details. Here we use
seline [55]. We train the model using the COCO key-point
the latest method DSS [24] as our baseline. For a fair compari-
detection dataset [33], and evaluate the model using the
son, we only replace the backbone with ResNet-50 and our
COCO validation set. Following common settings, we use the
proposed Res2Net-50, while keeping other configurations
same person detectors in SimpleBaseline [55] for evaluation.
unchanged. Following [24], we train those two models using
Table 10 shows the performance of key-points estimation on
the MSRA-B dataset [35], and evaluate results on ECSSD [58],
the COCO validation set using Res2Net. The Res2Net-50 and
PASCAL-S [30], HKU-IS [29], and DUT-OMRON [59] data-
Res2Net-101 based models outperform baselines on AP by
sets. The F-measure and Mean Absolute Error (MAE) are
3.3 and 3.0 percent, respectively. Also, Res2Net based models
used for evaluation. As shown in Table 9, the Res2Net based
have considerable performance gains on human with differ-
model has a consistent improvement compared with its
ent scales compared with baselines.
counterparts on all datasets. On the DUT-OMRON dataset
(containing 5,168 images), the Res2Net based model has a
5.2 percent improvement on F-measure and a 2.1 percent TABLE 9
improvement on MAE, compared with ResNet based model. Salient Object Detection Results on Different Datasets,
The Res2Net based approach achieves greatest performance Measured Using F-Measure and Mean Absolute Error (MAE)
Backbone Setting AP AP50 AP75 APS APM APL ResNet-50 0.823 0.105
PASCAL-S
Res2Net-50 0.841 0.099
ResNet-50 64w 33.9 55.2 36.0 14.8 36.0 50.9
ResNet-50 0.894 0.058
48w2s 34.2 55.6 36.3 14.9 36.8 50.9 HKU-IS
Res2Net-50 0.905 0.050
26w4s 35.6 57.6 37.6 15.7 37.9 53.7
Res2Net-50 18w6s 35.7 57.5 38.1 15.4 38.1 53.7 ResNet-50 0.748 0.092
DUT-OMRON
14w8s 35.3 57.0 37.5 15.6 37.5 53.4 Res2Net-50 0.800 0.071
The Res2Net has similar complexity compared with its counterparts. The Res2Net has similar complexity compared with its counterparts.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
660 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 2, FEBRUARY 2021
[2] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object
detection: A survey,” Comput. Visual Media, vol. 5, no. 2, pp. 117–150,
2019.
[3] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection:
A benchmark,” IEEE Trans. Image Process., vol. 24, no. 12,
pp. 5706–5722, Dec. 2015.
[4] A. Bulat and G. Tzimiropoulos, “How far are we from solving the
2D & 3D face alignment problem? (and a dataset of 230,000 3D
facial landmarks),” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
nit., 2017, pp. 1021–1030.
[5] C.-F. R. Chen, Q. Fan, N. Mallinar, T. Sercu, and R. Feris, “Big-
little net: An efficient multi-scale feature representation for visual
and speech recognition,” in Proc. Int. Conf. Learn. Representations,
2019.
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
Fig. 7. Examples of salient object detection [24] results, using ResNet-50 A. L. Yuille, “DeepLab: Semantic image segmentation with deep
and Res2Net-50 as backbone networks, respectively. convolutional nets, atrous convolution, and fully connected CRFs,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848,
TABLE 10 Apr. 2018.
Performance of Key-Points Estimation on [7] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
the COCO Validation Set atrous convolution for semantic image segmentation,” CoRR, abs/
1706.05587, 2017.
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Backbone AP AP50 AP75 APM APL
“Encoder-decoder with atrous separable convolution for semantic
ResNet-50 70.4 88.6 78.3 67.1 77.2 image segmentation,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018,
Res2Net-50 73.7 92.5 81.4 70.8 78.2 pp. 833–851.
[9] Y. Chen, H. Fang, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach,
ResNet-101 71.4 89.3 79.3 68.1 78.1 S. Yan, and J. Feng, “Drop an octave: Reducing spatial redun-
Res2Net-101 74.4 92.6 82.6 72.0 78.5 dancy in convolutional neural networks with octave convolution,”
in Proc. IEEE Int. Conf. Comput. Vis., 2019.
The Res2Net has similar complexity compared with its counterparts. [10] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path
networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017,
pp. 4467–4475.
[11] B. Cheng, R. Xiao, J. Wang, T. Huang, and L. Zhang, “High fre-
5 CONCLUSION AND FUTURE WORK quency residual learning for multi-scale image classification,” in
We present a simple yet efficient block, namely Res2Net, to Proc. Brit. Mach. Vis. Conf., 2019.
[12] M.-M. Cheng, Y. Liu, W.-Y. Lin, Z. Zhang, P. L. Rosin, and
further explore the multi-scale ability of CNNs at a more P. H. S. Torr, “BING: Binarized normed gradients for objectness esti-
granular level. The Res2Net exposes a new dimension, mation at 300fps,” Comput. Visual Media, vol. 5, no. 1, pp. 3–20,
namely “scale”, which is an essential and more effective fac- Mar. 2019.
tor in addition to existing dimensions of depth, width, and [13] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu,
“Global contrast based salient region detection,” IEEE Trans. Pat-
cardinality. Our Res2Net module can be integrated with tern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.
existing state-of-the-art methods with no effort. Image clas- [14] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model
sification results on CIFAR-100 and ImageNet benchmarks compression and acceleration for deep neural networks,” CoRR,
arXiv: 1710.09282, 2017.
suggested that our new backbone network consistently per- [15] F. Chollet, “Xception: Deep learning with depthwise separable
forms favourably against its state-of-the-art competitors, convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
including ResNet, ResNeXt, DLA, etc. Jul. 2017, pp. 1800–1807.
Although the superiority of the proposed backbone [16] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,
and A. Zisserman, “The pascal visual object classes challenge: A
model has been demonstrated in the context of several repre- retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, 2015.
sentative computer vision tasks, including class activation [17] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
mapping, object detection, and salient object detection, we A. Zisserman, “The pascal visual object classes (VOC) challenge,”
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
believe multi-scale representation is essential for a much [18] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hier-
wider range of application areas. To encourage future works archies for accurate object detection and semantic segmentation,”
to leverage the strong multi-scale ability of the Res2Net, the in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
source code is available on https://mmcheng.net/res2net/. [19] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
connections for efficient neural network,” in Proc. Int. Conf. Neural
Inf. Process. Syst., 2015, pp. 1135–1143.
ACKNOWLEDGMENTS [20] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik,
“Semantic contours from inverse detectors,” in Proc. IEEE Int.
This research was supported by NSFC (NO. 61620106008, Conf. Comput. Vis., 2011, pp. 991–998.
61572264), the national youth talent support program, and [21] K. He, G. Gkioxari, P. Doll ar, and R. Girshick, “Mask R-CNN,” in
Tianjin Natural Science Foundation (17JCJQJC43700, Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
18ZXZNGX00110). Shang-Hua Gao and Ming-Ming Cheng deep convolutional networks for visual recognition,” IEEE Trans.
contributed equally to this work. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
REFERENCES nit., 2016, pp. 770–778.
[24] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply
[1] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object supervised salient object detection with short connections,” IEEE
recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 815–828,
Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002. Apr. 2019.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
GAO ET AL.: RES2NET: A NEW MULTI-SCALE BACKBONE ARCHITECTURE 661
[25] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in [50] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141. v4, inception-resnet and the impact of residual connections on
[26] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, learning,” in Proc. Nat. Conf. Artif. Intell., 2017, vol. 4, Art. no. 12.
“Densely connected convolutional networks,” in Proc. IEEE Conf. [51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269. D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
[27] A. Krizhevsky and G. Hinton, “Learning multiple layers of fea- convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
tures from tiny images,” Citeseer, Tech. Rep. TR-2009, University 2015, pp. 1–9.
of Toronto, Toronto, 2009. [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi- “Rethinking the inception architecture for computer vision,” in
cation with deep convolutional neural networks,” in Proc. Int. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105. [53] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng,
[29] G. Li and Y. Yu, “Visual saliency based on multiscale deep “Salient object detection: A discriminative regional feature inte-
features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, gration approach,” Int. J. Comput. Vis., vol. 123, no. 2, pp. 251–268,
pp. 5455–5463. 2017.
[30] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets of [54] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan,
salient object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pat- “Object region mining with adversarial erasing: A simple classifi-
tern Recognit., 2014, pp. 280–287. cation to semantic segmentation approach,” in Proc. IEEE Conf.
[31] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proc. Int. Comput. Vis. Pattern Recognit., 2017, pp. 6488–6496.
Conf. Learn. Representations, 2013. [55] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose esti-
[32] T.-Y. Lin, P. Doll ar, R. B. Girshick, K. He, B. Hariharan, and mation and tracking,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018,
S. J. Belongie, “Feature pyramid networks for object detection,” in pp. 472–487.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 936–944. [56] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated resid-
[33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, ual transformations for deep neural networks,” in Proc. IEEE Conf.
P. Doll ar, and C. L. Zitnick, “Microsoft COCO: Common objects Comput. Vis. Pattern Recognit., 2017, pp. 5987–5995.
in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [57] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc.
[34] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1395–1403.
pooling-based design for real-time salient object detection,” in Proc. [58] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3917–3926. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1155–1162.
[35] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, [59] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency
“Learning to detect a salient object,” IEEE Trans. Pattern Anal. Mach. detection via graph-based manifold ranking,” in Proc. IEEE Conf.
Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011. Comput. Vis. Pattern Recognit., 2013, pp. 3166–3173.
[36] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and [60] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer
A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. aggregation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Conf. Comput. Vis., 2016, pp. 21–37. 2018, pp. 2403–2412.
[37] Y. Liu, M.-M. Cheng, X. Hu, J.-W. Bian, L. Zhang, X. Bai, and [61] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in
J. Tang, “Richer convolutional features for edge detection,” IEEE Proc. Brit. Mach. Vis. Conf., 2016, pp. 87.1–87.12.
Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1939–1946, [62] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet:
Aug. 2019. Aggregating multi-level convolutional features for salient object
[38] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
works for semantic segmentation,” in Proc. IEEE Conf. Comput. pp. 202–211.
Vis. Pattern Recognit., 2015, pp. 3431–3440. [63] T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle
[39] D. G. Lowe, “Distinctive image features from scale-invariant key- filter for robust object tracking,” in Proc. IEEE Conf. Comput. Vis.
points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. Pattern Recognit., 2017, pp. 4819–4827.
[40] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practi- [64] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
cal guidelines for efficient CNN architecture design,” in Proc. Eur. network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
Conf. Comput. Vis., Sep. 2018, pp. 122–138. pp. 6230–6239.
[41] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, “SSH: [65] J. Zhao, Y. Cao, D.-P. Fan, X.-Y. Li, L. Zhang, and M.-M. Cheng,
Single stage headless face detector,” in Proc. IEEE Int. Conf. Com- “Contrast prior and fluid pyramid integration for RGBD salient
put. Vis., 2017, pp. 4875–4884. object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[42] G.-Y. Nie, M.-M. Cheng, Y. Liu, Z. Liang, D.-P. Fan, Y. Liu, and 2019.
Y. Wang, “Multi-level context ultra-aggregation for stereo [66] K. Zhao, S. Gao, W. Wang, and M.-M. Cheng, “Optimizing the
matching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, F-measure for threshold-free salient object detection,” in Proc.
pp. 3283–3291. IEEE Int. Conf. Comput. Vis., 2019.
[43] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [67] K. Zhao, W. Shen, S. Gao, D. Li, and M.-M. Cheng, “Hi-Fi: Hierar-
real-time object detection with region proposal networks,” in chical feature integration for skeleton detection,” in Proc. Int. Joint
Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99. Conf. Artif. Intell., 2018, pp. 1191–1197.
[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [68] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “ImageNet multi-context deep learning,” in Proc. IEEE Conf. Comput. Vis. Pat-
large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, tern Recognit., 2015, pp. 1265–1274.
no. 3, pp. 211–252, 2015.
[45] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, Shang-Hua Gao is working toward the master’s
D. Batra, et al., “Grad-CAM: Visual explanations from deep net- degree in Media Computing Lab, Nankai Univer-
works via gradient-based localization,” in Proc. IEEE Int. Conf. sity. He is supervised via Prof. Ming-Ming Cheng.
Comput. Vis., 2017, pp. 618–626. His research interests include computer vision,
[46] K. Simonyan and A. Zisserman, “Two-stream convolutional net- machine learning, and radio vortex wireless
works for action recognition in videos,” in Proc. Int. Conf. Neural communications.
Inf. Process. Syst., 2014, pp. 568–576.
[47] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn.
Representations, 2014.
[48] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre-
sentation learning for human pose estimation,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 5693–5703.
[49] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu,
X. Wang, W. Liu, and J. Wang, “High-resolution representations
for labeling pixels and regions,” CoRR, abs/1904.04514, 2019.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.
662 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 43, NO. 2, FEBRUARY 2021
Ming-Ming Cheng received the PhD degree Ming-Hsuan Yang received the PhD degree in
from Tsinghua University, in 2012, and then computer science from the University of Illinois at
worked with Prof. Philip Torr in Oxford for 2 Urbana-Champaign, in 2000. He is a professor in
years. He is now a professor at Nankai Univer- electrical engineering and computer science at
sity, leading the Media Computing Lab. His University of California, Merced. He has served
research interests includes computer vision and as an associate editor of the IEEE Transactions
computer graphics. He received awards including on Pattern Analysis and Machine Intelligence,
ACM China Rising Star Award, IBM Global SUR the International Journal of Computer Vision, the
Award, etc. He is a senior member of the IEEE Computer Vision and Image Understanding, etc.
and on the editorial boards of IEEE TIP. He received the NSF CAREER award, in 2012
and the Google Faculty Award, in 2009.
Kai Zhao is currently working toward the PhD Philip Torr received the PhD degree from Oxford
degree with college of computer science, Nankai University. After working for another three years
University, under the supervision of Prof Ming- at Oxford, he worked for six years for Microsoft
Ming Cheng. His research interests mainly focus Research, first in Redmond, then in Cambridge,
on statistical learning and computer vision. founding the vision side of the Machine Learning
and Perception Group. He is now a professor at
Oxford University. He has won awards from top
vision conferences, including ICCV, CVPR,
ECCV, NIPS and BMVC. He is a senior member
of the IEEE and a Royal Society Wolfson
Research Merit Award holder.
Xin-Yu Zhang is working toward the graduate " For more information on this or any other computing topic,
degree in the School of Mathematical Sciences, please visit our Digital Library at www.computer.org/csdl.
Nankai University. His research interests include
computer vision and deep learning.
Authorized licensed use limited to: Ocean University of China. Downloaded on March 11,2022 at 07:17:44 UTC from IEEE Xplore. Restrictions apply.