Detect Faces Efficiently: A Survey and Evaluations: Yuantao Feng, Shiqi Yu, Hanyang Peng, Yan-Ran Li, Jianguo Zhang
Detect Faces Efficiently: A Survey and Evaluations: Yuantao Feng, Shiqi Yu, Hanyang Peng, Yan-Ran Li, Jianguo Zhang
Abstract—Face detection is to search all the possible regions for faces in images and locate the faces if there are any. Many
applications including face recognition, facial expression recognition, face tracking and head-pose estimation assume that both the
location and the size of faces are known in the image. In recent decades, researchers have created many typical and efficient face
detectors from the Viola-Jones face detector to current CNN-based ones. However, with the tremendous increase in images and
videos with variations in face scale, appearance, expression, occlusion and pose, traditional face detectors are challenged to detect
various ”in the wild” faces. The emergence of deep learning techniques brought remarkable breakthroughs to face detection along with
arXiv:2112.01787v1 [cs.CV] 3 Dec 2021
the price of a considerable increase in computation. This paper introduces representative deep learning-based methods and presents
a deep and thorough analysis in terms of accuracy and efficiency. We further compare and discuss the popular and challenging
datasets and their evaluation metrics. A comprehensive comparison of several successful deep learning-based face detectors is
conducted to uncover their efficiency using two metrics: FLOPs and latency. The paper can guide to choose appropriate face detectors
for different applications and also to develop more efficient and accurate detectors.
1 I NTRODUCTION
such as S3 FD [17], PyramidBox [18], SRN [19], DSFD [20], TABLE 1: Different models adopt different ranges and differ-
and RetinaFace [21]. ent presets of test scales. ’0.25x’ denotes shrinking the width
Face detection is sometimes considered as a solved and height by 0.25, and others follow. Specifically, ’Sx’ and
problem because the average precision (AP) on many face ’Ex’ are shrinking and enlarging images accordingly, while
detection datasets such as PASCAL Face [22], AFW [23] and ’Fx’ is enlarging the image into a fixed size. Test image sizes
FDDB [24], has reached or exceeded 0.990 since 20171 . On stand for re-scaling the smaller side of the image to the given
the most popular and challenging WIDER Face dataset [3], value, and the other side follows the same ratio.
the AP has reached 0.921 even on the hard test set. Model Test image scales
HR,2017 [13] 0.25x, 0.5x, 1x, 2x
S3FD,2017 [17] 0.5x, 1x, Sx, Ex
0.96
SRN,2019 [19] 0.5x, 1x, 1.5x, 2.25x, Fx
0.94 DSFD,2019 [20] 0.5x, 1x, 1.25x, 1.75x 2.25x, Sx, Ex
CSP,2019 [25] 0.25x, 0.5x, 0.75x, 1x, 1.25x, 1.5x, 1.75x, 2x
0.92
Model Test image sizes
0.90 SSH,2017 [14] 500, 800, 1200, 1600
AP
0.88 SFA,2019 [26] 500, 600, 700, 800, 900, 1000, 1100, 1200, 1600
SHF,2020 [27] 100, 300, 600, 1000, 1400
0.86 RetinaFace, 2020 [21] 500, 800, 1100, 1400, 1700
0.84 Easy
Medium
0.82 Hard based face detection methods and evaluate them in terms
2016 2017 2018 2019 2020
Year of accuracy and computational cost. The main contributions
are as follows.
Fig. 2: The best AP on the easy, medium and hard subsets of 1) Different from previous face detection surveys [28],
WIDER Face [3] test set in the recent years. [29], [30], [31], [32] in which the content is mainly
built on reviewing traditional methods, our survey
But face detection is not a solved problem. If we observe focuses on deep learning-based face detectors. We
the best results of each year in Fig. 2, we can find the AP have noted the existence of surveys [33], [34], [35] on
is still improving but slowly in recent 3 years. Therefore, deep learning; however, they focus on generic object
with such near-to-saturated performance improvement, one detection, not specifically for face detection. In this
question would be asked: If a tiny improvement is achieved paper, we provide a clear view of the path by which
by a much heavier deep model with great computational deep learning based face detection has evolved in
cost, will we consider the model is a good one? If we recently years.
look slightly deeper into the implementation of some recent 2) Accuracy and efficiency are both studied and ana-
models, we can find that multiple scaling is heavily used in lyzed in the paper. In addition to detailed introduc-
the evalutions on WIDERFACE benchmark. If we resize the tions to deep learning based face detectors, some
input image with many different scales, such as 1/4, 1/2, 1, experiments are carried out to analyze different
3/2, 2, 4 and more, and feed all those resized images into a deep face detectors using different metrics. Some
detector, the combined results will have a better AP, which tricks to improve accuracy are also introduced. So
in another word is achieved by the assembling and sup- the paper can help readers understand better how
pressing (NMS) the multi-scale outputs, and is independent good accuracy and efficiency can be achieved.
to the backbone of the underlying face detector. We listed the 3) With a focus on the efficiency of face detectors,
scales used by some models in Table 1. None of them tested comprehensive experiments are carried out to eval-
an image using only one scale. It is a trend that more scales uate the accuracy and particularly efficiency of
are used recently. There is a risk that multiple scales with different face detectors. In addition to latency, we
a heavy computational cost are employed, and outstanding also propose an accurate metric for the computa-
accuracy is claimed, which overshadows the performance tional cost of a CNN model. It is FLoating point
gain the by the detector itself and the computational cost by OPerations (FLOPs) under certain rules. FLOPs is
such a multi-scale operation is not known. It is also worth more neutral than latency which heavily depends
noting that most benchmarks do not evaluate the compu- on hardware and deep network structure. The code
tational cost. Most often, it is difficult for users to know to compute the FLOPs has been released in https:
by which the improvement is achieved, a better backbone //github.com/fengyuentau/PyTorch-FLOPs.git.
technology or the follow-up computational-intensive multi-
The rest of the paper is organized as follows. Some key
scale ensemble strategy?
challenges in face detection are summarized in Section 2.
We do expect a perfect face detector which is robust and
In Section 3, we provide a roadmap to describe the de-
accurate even for some faces in extremely difficult condi-
velopment of deep learning-based face detection with de-
tions, while being extremely fast with low computational
tailed reviews. In Section 4, we review several fundamental
cost. However, we all know the no free lunch theorem. There-
subproblems including backbones, context modeling, the
fore, in this survey, we investigate the recent deep learning
handling of face scale variations and proposal generation.
Popular datasets for face detection and state-of-the-art per-
1. State-of-the-art AP can be found in the official result pages of the
datasets, and https://paperswithcode.com/task/face-detection which formances are presented in Section 5. Section 6 reveals the
also collects results from published papers. relation between computational cost and AP by conduct-
3
ing extensive experiments on several open-source one-stage edge devices have limited computational capability, storage
face detectors. In addition, speed-focusing face detectors and battery life to run advanced deep learning-based algo-
collected from Github are reviewed in Section 7. Finally, we rithms. In this case, efficient face detection is essential for
conclude the paper with a discussion on future challenges face applications on edge devices.
in face detection in Section 8.
3 FACE D ETECTION F RAMEWORKS
2 M AIN C HALLENGES Before deep learning was used for face detection, cascaded
Most face-related applications need clear frontal faces. De- AdaBoost-based classifiers were the most popular classifiers
tecting a clear frontal face is a relatively easy task. Some may for face detection. The features used in AdaBoost were
argue that some faces are useless for the next step such as designed specifically for faces, not generic objects. For ex-
face recognition if the faces are tiny and with occlusion; but ample, the Haar-like [2] feature can describe facial patterns
it is not. Effectively detecting any faces in extremely difficult of eyes, mouth and others. In recent years, facial features
conditions can greatly improve the perception capability of a can be automatically learnt from data via deep learning
computer but is still a challenging task. If a face is detected techniques. Therefore, many deep learning-based face detec-
and evaluated as a bad quality sample, the subject can be tors are inspired by modern network architectures designed
suggested to be closer to the camera, or the camera can from object detection. Following the popular manner of
adjust automatically for a better image. Face detection is organizing object detection frameworks, we organize deep
still a problem far from to be well solved. Many challenges learning-based face detectors into three main categories:
do still exist. • Multi-stage face detection frameworks. It is inspired
Accuracy-related challenges are from face appearance by cascaded classifiers in face detection and is an
and imaging conditions. In real-world scenes, there are early exploration of applying deep learning tech-
many different kinds of face appearance, varying in dif- niques to face detection.
ferent skin color, makeup, expression, wearing glasses or • Two-stage face detection frameworks. The first stage
a mask and so on. In unconstrained environments, imaging generates some proposals, and the proposals are
a face can be impacted by various lighting, viewing angles confirmed in the second stage. The efficiency should
and distances, backgrounds, and weather conditions. The be better than multi-stage ones.
face images will vary in illumination, pose, scale, occlusion, • One-stage face detection frameworks. Feature ex-
blur and distortion. The face samples in difficult conditions traction and proposal generation are performed in
can be found in Fig. 1. There have been several datasets a single unified network. These frameworks can be
and competitions featuring face detection in unconstrained further categorized into anchor-based methods and
conditions, such as FDDB [24], WIDER Face [3] and WIDER anchor-free methods.
Face Challenge 2019 2 . More than 45% of faces are smaller
than 20 × 20 pixels in WIDER. In most face-related applica- To show how the deep learning-based face detection
tions, we seldom need small faces whose sizes are less than evolves, milestone face detectors and some important object
20. However, if we can detect small or even tiny faces, we detectors are plotted in Fig. 3. The two-stage and multi-
can resize the original large images to smaller ones and send stage face detectors are on the top branch, and the single-
them to a face detector. Then, the computational cost can be stage ones are on the bottom branch. The generic object
greatly reduced since we only need to detect faces in smaller detectors are in the middle branch and in blue. A More
images. Therefore, a better accuracy sometimes also means detailed introduction of those detectors is provided in the
a higher efficiency. following subsections.
Masked face detection is becoming more important
since people are wearing and will continuously wear masks 3.1 Multi-stage and Two-Stage Face Detectors
to prevent COVID-19 in the next few years. Face-related ap- In the early era when deep learning techniques entered face
plications did not consider this situation in the past. Wearing detection, face detectors were designed to have multiple
masks will reduce the detection accuracy obviously. Some stages, also known as the cascade structure which has
masks are even printed with some logos or cartoon figures. been widely used in most early face detectors. With the
All those can disrupt face detection. If a face has a mask remarkable breakthrough brought by Faster R-CNN [6],
and sunglasses at the same time, face detection will be even some researchers turned to improve Faster R-CNN based
more difficult. Therefore, in the next few years, masked face on face data.
detection should be explored and studied. In the cascade structure, features are usually extracted
Efficiency-related challenges are brought by the great and refined one or multiple times before being fed into
demands on edge devices. Since the increasing demands on classifiers and regressors, so as to reject most of the sliding
edge devices, such as smartphones and intelligent CCTV windows to improve efficiency. As shown on the result
cameras, massive amount of data is generated per day. page3 of FDDB [24], Li et al. made an early attempt and
We frequently take selfies, photos of others, long video proposed their CNN-based face detector, named Cascad-
meetings, etc. Modern CCTV cameras record 1080P videos edCNN [36]. CascadeCNN consists of 3 stages of CNNs,
constantly at 30 FPS. These result in a great demand for as shown in Fig. 4. Sliding windows are first resized to
facial data analysis, and the data is considerable. In contrast, 12 × 12 pixels and fed into the shallow 12-net to reduce
2. https://competitions.codalab.org/competitions/20146 3. http://vis-www.cs.umass.edu/fddb/results.html
4
MTCNN
(K. Zhang et al.)
Fig. 3: Timeline of milestone face detectors [10], [12], [13], [14], [17], [18], [19], [20], [21], [25], [36], [37], [38], [39], [40], [41],
[42], and remarkable works from object recognition [43], [44] and object detection [6], [8], [11], [15], [16], [45] (marked as
blue, attached to the middle branch). Since the proposal of AlexNet [46], various face detection works inspired by deep
learning techniques from object recognition and object detection were published in the 2012-post deep learning-based face
detection era. The top branch is two/multi-stage face detectors, while the bottom branch is one-stage detectors, which has
become the most popular network design adopted by researchers.
candidate windows by 90%. The remaining windows are sponding output features for tiny faces can be less than 1
then processed by the 12-calibration-net to refine the size pixel, making it insufficient to encode rich information. To
for face localization. Retained windows are then resized address this problem, Zhu et al. proposed CMS-RCNN [38],
to 24 × 24 as the input for the combination of 24-net and which is equiped with a contextual multi-scale design for
24-calibration-net, and so on for the next CNNs combina- both RPN and final detection. As shown in Fig. 4, multi-
tion. CascadeCNN achieved state-of-the-art performance on scale features from conv3, conv4 and conv5 are concatenated
AFW [23] and FDDB, while reaching a compelling speed of by shrinking them into the same shape with conv5 as the
14 FPS for for the typical 640×480 VGA images on a 2.0 GHz input for RPN, so as to collect more information for tiny
CPU. Another attempt at cascaded CNNs for face detection faces and also improve the localization capability from low-
is the well-known MTCNN [12] proposed by Zhang et al. level layers. CMS-RCNN achieved an AP of 0.899, 0.874,
MTCNN is composed of 3 subnetworks, which are P-Net 0.624 on the easy, medium and hard sets of the WIDER Face
for obtaining candidate facial windows, R-Net for rejecting dataset respectively, outperforming MTCNN by 0.051(Easy),
false candidates and refining remaining candidates, O-Net 0.049(Medium) and 0.016(Hard).
for producing the final output with both face bounding In addition to CMS-RCNN, there are others making im-
boxes and landmarks in the multi-task manner. P-Net is a provements based on Faster R-CNN. Bootstrapping Faster
shallow fully convolutional network with 6 CONV layers, R-CNN [47] builds a training dataset by iteratively adding
which can take images of any sizes as input. MTCNN was a false positives from a model’s output to optimize Faster
great success with large and state-of-the-art advantages on R-CNN. Face R-CNN [9] adopts the same architecture as
WIDER Face [3], FDDB and AFW, while reaching 16 fps on Faster R-CNN with center loss, online hard example mining
a 2.6 GHz CPU. and multi-scale training strategy. FDNet [48] exploits multi-
In the object-detection-fashion two-stage network archi- scale training and testing and a vote-based NMS strategy
tectures, a region proposal network (RPN) [6] is required on top of Faster R-CNN with a light-head design. Position-
to generate object proposals. RPN can be considered as a sensitive average pooling was proposed in Face R-FCN [10]
straightforward classification CNN, which generates pro- to assign different weights to different parts of the face
posals based on the preset anchors on CNN features, filters based on R-FCN [11]. With the improvements considering
out non objects and refines object proposals. However, as the special patterns of face data, these methods achieved
the CNNs shrink the image to extract features, the corre- better performance than their original version on the same
5
24-net 48-net
12-net
CONV
CONV
CONV
CONV
Face/Non-Face Face/Non-Face
Classification Classification Face/Non-Face
Feature Maps Feature Maps Feature Maps Classification
Feature Maps FC Layers FC Layers
12-net 24-net FC Layers
Filter By Score
Input Filter By Score Filter By Score
Resize to 24
Resize to 48
12-calibration-net 24-calibration-net 48-calibration-net
CONV
Bounding Box
CONV
CONV
CONV
Bounding Box Bounding Box
Regression
CONV
Layers
Facial Landmark Localization
Bounding Box
Conv Feature Regression
Layers maps pooling
Layers
CONV
Face Classification
R-net
CONV
MTCNN FC Layers
FC Layers
Facial Landmark Localization Input For
Each
ROI
Crop and Resize to 𝟒𝟖 ×48
Face Classification
O-net CMS-RCNN
CONV CONV
CONV
Fig. 4: Diagrams of milestone multi/two-stage face detectors [12], [36], [38], [38]. Others share similar architectures as the
three.
WIDER Face dataset. HR [13] proposed by Hu et al. is one of the first to per-
Whether it is the cascaded multi-stage or two-stage form anchor-based face detection in a unified convolutional
network design, its computation is heavily dependent on neural network. The backbone of HR is ResNet-101 [44]
the number of faces in the image, the increase in which with layers truncated after conv4_5. Early feature fusion
also increases proposals passed to the next stage in the on layers conv3_4 and conv4_5 is performed to encode
interior of the network. Notably, the multi-scale test metric, context since high-resolution features are beneficial for small
which usually enlarges the images multiple times to make face detection. Through experiments on faces clustered into
tiny faces detectable, can dramatically increase the compu- 25 scales, 25 anchors are defined for 2X, 1X and .5X inputs,
tational cost on this basis. Considering that the number of to achieve the best performance of three input scales. HR
faces in the image from the actual scene varies from one face outperformed CMS-RCNN [38] by 0.199 on the WIDER Face
in a selfie to many faces in a large group photo, we consider validation hard set, and more importantly, the run-time of
the robustness of cascade or two-stage networks in terms of HR is independent of the number of faces in the image,
runtime. while CMS-RCNN’s linearly scale up with the number of
faces.
Different from HR, SSH [14] attempts to detect faces at
3.2 One-Stage Face Detectors different scales on different levels of features, as shown in
In real-time face-related applications, face detection must Fig. 5. Taking VGG-16 [43] as the backbone, SSH detects
be performed in real time. If the system is deployed on faces on the enhanced features from conv4_3, conv5_3
edge devices, the computing power is low. In those kinds and pool5 for small, medium and large faces respectively.
of situations, one-stage face detectors are more suitable SSH introduces a module (SSH module) that greatly en-
since their process time is stable regardless of how many riches receptive fields to better model the context of faces.
faces there are in images. Different from the multi/two- The SSH module is widely adopted by later works [18], [20],
stage detectors, the one-stage face detectors perform feature [21], [40], which turns out to be efficient for performance
extraction, proposal generation and face detection in a single boosting.
and unified convolutional neural network, whose runtime Since S3 FD [17], many one-stage face detectors [18], [19],
efficiency is independent of the number of faces. Dense [20], [21], [25], [40], [41], [42] fully utilize multi-scale features
anchors are designed to replace proposals in two-stage attempting to achieve scale-invariant face detection. S3 FD
detectors [14]. Starting from CornerNet [45], an increasing extends the headless VGG-16 [43] with more convolutional
number of works use the anchor-free mechanism in their layers, whose stride gradually doubles from 4 to 128 pixels,
frameworks. so as to cover a larger range of face scales. PyramidBox [18]
6
conv7 pool5
Head SSH Block Head
conv6 Head
conv5 SSH Block Head
Head
conv_fc
res4 Fuse Head Head conv4 Fuse SSH Block Head
res3 conv5 L2 Norm Head
Head conv3
conv4
res2 res3 res4
conv3 conv2 (N, C, H, W)
res1
Conv DeConv conv2 conv1 1/4C-
CONV
Elt-Add
conv1 1/4C-
CONV
Concat
Conv
LFPN DSS SSH Head Center
LFPN Dilated SSH Head DeConv
conv5 Head conv5 Heatmap
LFPN DSS SSH Head LFPN Dilated SSH Head
C5
Concat
DeConv
Conv
conv4 LFPN DSS SSH Head
Head conv4 LFPN Dilated SSH Head C4 DeConv
Head
Conv
conv3 conv3 Scale
C3 DeConv
Heatmap
Current Current
conv2 Up feature map feature map conv2 Up feature map feature map C2
conv1 Conv Conv conv1 Conv Conv
Fig. 5: Diagrams of milestone one-stage face detectors [13], [14], [17], [18], [20], [25].
adopts the same backbone as S3 FD, integrates FPN [15] latter of which can be top-down, bottom-up or cross-level
to fuse adjacent-level features for semantic enhancement, fusion from NAS.
and improves the SSH module with wider and deeper Since the proposal of CornerNet [45] back in 2018, which
convolutional layers inspired by Inception-ResNet [49] and directly predicts the top-left and bottom-right points of
DSSD [50]. DSFD [20] also inherits the backbone from S3 FD, bounding boxes instead of relying on prior anchors, many
but enhances the multi-scale features by the Feature En- explorations [54], [55], [56], [57] have been made to remodel
hance Module (FEM), so that detection can be made on two object detection more semantically using the anchor-free
shots - one from non-enhanced multi-scale features, and the design. CSP models a face bounding box as a center point
other from the enhanced features. The same scale features and the scale of the box as shown in Fig. 5. CSP takes
from the second shot have larger RFs than those from the multi-scale features from the modified ResNet-50 [44], and
first shot, but also have smaller RFs than the next-level concatenates them to take the advantage of rich global
features from the first shot, indicating that the face scales and local information for detection heads using transpose
are split more refined across these multi-scale detection convolution layers. In particular, the anchor-free detection
layers. Similarly, SRN [19] has a dual-shot networks but is head can also be an enhancement module for anchor-based
trained differently on multi-scale features: low-level features heads. ProgressFace [42] appends an anchor-free module
need two-step classification to refine, since they have higher to provide more positive anchors for the highest resolution
resolution and contribute the vast majority of anchors and feature maps in FPN, so as to reduce the imbalance of
also negative samples; additionally, high-level features have positive and negative samples for small faces.
lower resolution which is worth two-step regression using One-stage frameworks are popular on face detection in
the Cascade R-CNN [51] to have more accurate bounding recent years for the following three reasons. (a) The runtime
boxes. of one-stage face detectors is independent of the number
of faces in an image by design. Therefore, it enhances the
There are also some significant anchor-based methods robustness of runtime efficiency. (b) It is computationally
using the FPN [15] as the backbone. RetinaFace adds efficient and straightforward for one-stage detectors to reach
one more pyramid layer on top of the FPN and replaces near scale invariance by contextual modeling and multi-
CONV layers with the deformable convolution network scale feature sampling. (c) Face detection is a relatively less
(DCN) [52], [53] within FPN’s lateral connections and con- complex task than general object detection. This means that
text module. RetinaFace models a face in three ways: a innovations and advanced network designs in object detec-
3D mesh (1k points), a 5-landmark mask (5 points), and tion can be quickly adjusted to face detection by considering
a bounding box (2 points). Cascade regression [51] is em- the special pattern of faces.
ployed with multi-task loss in RetinaFace to achieve better
localization. Instead of using the handcrafting structures,
Liu et al proposed BFBox, which explores face-appropriate 4 FACE R EPRESENTATION
FPN architectures using the successful Neural Architecture The key idea of face detection has never changed whether it
Search (NAS). Liu decouples FPN as the backbone and is in the traditional era or deep learning era. It finds the com-
FPN connections, the former of which can be replaced by mon patterns of all faces in the dataset. In the traditional era,
VGG [43], ResNet [44] or the backbone from NAS, and the many of handcrafted features, such as SIFT [58], Haar [2]
7
and HOG [59], are employed to extract local features from which is the first choice for the baseline backbones for many
the image, which are aggregated by approaches such as face detectors, such as SSH [14], S3 FD [17] and Pyramid-
AdaBoost for the higher-level representation of faces. Box [18]. Performance improvements can easily be obtained
Different from traditional methods, which require rich by simply swapping the backbone from VGG-16 to ResNet-
prior knowledge to design handcrafted features, deep con- 50/101/152 [44], as shown in [20]. Since state of the arts
volutional neural networks can directly learn even more have achieved AP >0.900 even on WIDER Face hard sets, it
powerful features from face images. A deep learning-based is common for recent face detectors [20], [42], [62] to equipe
face detection model can be considered as two parts: a with a deeper and wider backbone for higher AP, such as
CNN backbone and several detection branches. Starting the ResNet-152 and ResNets with FPN [15] connections. Liu
from some popular CNN backbones, the feature extraction et al. employs Neural Architecture Search (NAS) to search
methods that can handle face scale invariance are intro- face-appropriate backbones and FPN connections.
duced as well as several strategies to generate proposals for One of the most inexpensive choices is ResNet-50 which
face detection. is listed in Table 2, which has less parameters and less
FLOPs, while achieving very similar performance compared
4.1 Popular CNN Backbones to deeper nets. Another choice for state-of-the-art face de-
In most deep face detectors there is a CNN backbone for tectors to reach real-time speed is to change the backbone to
feature extraction. Some popular backbone networks are MobileNet [60], which has similar performance to VGG-16
listed in Table 2. They are VGG-16 from the VGGNet [43] but one order of magnitude less in ’#Params’ and FLOPs.
series, ResNet-50/101/152 from the ResNet [44] series, and
MobileNet [60]. The models are powerful and can achieve
good accuracy on face detection, but they are a little heavy.
30 train_bboxes One of the major challenges for face detection is the large
27.66
26.10 val_bboxes span at face scales. As statistics shown in Fig. 6, there are
25 157,025 and 39,123 face bounding boxes in the train and
Percentage (%)
match enough anchors. It will result in a low recall rate. SSD, [17], [18] detect on a wider range of layers, which have
A simple solution for a trained face detector is to perform strides gradually doubling from 4 to 128 pixels. SRN [19]
multi-scale test on an image pyramid, which is built by and DSFD [20] introduce the two-stream mechanism, which
progressively resizing the original image. It is equal to re- detects on both the detection layers from the backbone
scale faces and hopefully brings outer faces back into the and extra layers applied on the detection layers for feature
detectable range of scales. This solution does not require re- enhancement. Different from subsampling on more layers,
training the detector, but it may come with a sharp increase [14], [21], [26] detects only at the last three level feature
in redundant computation, since there is no certain answer maps, which are enhanced by their context modeling meth-
to how deep the pyramid we should build to match with ods. By detecting on a feature pyramid, detection layers are
the certain extent of scale invariance of a trained CNN. implicitly trained to be sensitive to different scales, while
Another better solution to face scale invariance is to it also leads to an increase in model size and redundant
make full use of the feature maps produced in CNNs. One computation, since the dense sampling may cause some
can easily observe that the layers of standard CNN back- duplicate results from adjacent-level layers.
bones gradually decrease in size. The subsampling of these Predicting face scales: To eliminate the redundancy from
layers naturally builds up a pyramid with different strides pyramids, several approaches [64], [65], [66] predict the face
and receptive fields (RFs). It produces multi-scale feature scales before making a detection. [64] first generates a global
maps. In general, high-level feature maps produced by later face scale histogram from the input image by the Scale
layers with large RFs are encoded with strong semantic Proposal Network (SPN), which is trained with image-level
information, and lead to its robustness to variations such ground truth histogram vectors and without face location
as illumination, rotation and occlusion. Low-level feature information. A sparse image pyramid is built according to
maps produced by early layers with small RFs are less the output histogram, so as to have faces rescaled to the
sensitive to semantics, but have high resolution and rich detectable range of the later single-scale RPN. Similarly, [65]
details, which are beneficial for localization. To take both detects on a feature pyramid without unnecessary scales,
the advantages, a number of methods are proposed, which which is built by using the scale histogram to a sequen-
can be categorized into modeling context, detecting on a tial ResNet [44] blocks that can downsample feature maps
feature pyramid, and predicting face scales. recursively. [66] predicts not only face scales but also face
Modeling context: Additional context is essential for locations by a shallow ResNet18 [44] with scale attention
detecting faces, especially for detecting small ones. HR and spatial attention attached, named S2 AP. S2 AP generates
[13] shows that context modeling by fusing feature maps a 60-channel feature map, meaning face scales are mapped
of different scales can dramatically improve the accuracy to 60 bins, each of which is a spatial heatmap that has high
of detecting small faces. Following a similar fusion strat- response to its responsible face scale. With the 60-channel
egy as HR, [27] detects on three different dilated CONV feature maps, it is possible to decrease the unnecessary
branches, aiming to enlarge RF without too much increase computation with the low-response channels and the low-
in computation. [38] downsamples feature maps of strides response spatial areas by a masked convolution.
4 and 8 to concatenate with those of stride 16, so as to
improve the capability of the RPN to produce proposals
for faces at different scales. SSH [14] exploits an approach 4.3 Proposal Generation
similar to Inception [63], which concatenates the output Faces in the wild can be of any possible locations and
from three CONV branches that have 3 × 3, 5 × 5 and 7 × 7 scales in the image. The general pipeline for most of the
filters respectively. PyramidBox [18] first adopts an FPN [15] early successful face detectors, is to first generate proposals
module to build up context and is further enhanced by in the sliding-window manner, extract features from the
deeper and wider SSH modules. [20] improves the SSH windows using handcrafted descriptors [2], [23], [67], [68] or
module by replacing CONV layers with dilated CONV layers. CNNs [12], [36], and finally apply face classifiers. However,
[25] upsamples feature maps of strides 8, 16 to concatenate inspired by RPN [6] and SSD [8], modern anchor-based face
with those of stride 4, which is fed to an FCN to produce detectors generate proposals by applying k anchor boxes
center, scale and offset heatmaps. The fusion of feature maps on each pixel of the extracted CNN features. Specifically,
encodes rich semantics from high-level feature maps with 3 scales and 3 aspect ratios are used in Faster R-CNN [6],
rich geometric information from low-level feature maps, yielding k = 9 anchors on each pixel of the feature maps.
based on which the detectors can improve their capability of Moreover, the detection layer takes the same feature maps
localization and classification towards face scale invariance. as input, yielding 4k outputs encoding the coordinates for
Meanwhile, the fusion of feature maps also introduces more k anchor boxes from the regressor and 2k outputs for face
layers, such as CONV and POOL to adjust scales and channels, scores from the classifier.
which creates additional computational overhead. Considering that most of the face boxes are near square,
Detecting on a feature pyramid: Inspired by SSD [8], a modern face detectors tend to set the aspect ratio of anchors
majority of recent approaches, such as [14], [17], [18], [19], to 1, while the scales depends. HR [13] defines 25 scales
[20], [21], detect at multiple feature maps of different scales so as to match the cluster results on the WIDER Face [3]
respectively, and combine detection results. It is considered training set. S3 FD assigns the anchor scale of 4 times the
to be an effective method for weighing between speed and stride of the current layer to keep anchor sizes smaller than
accuracy. SSD [8] puts default boxes on each pixel of the effective receptive fields [69] and ensure the same density
feature maps from 6 detection layers that have strides of 8, of different scale anchors on the image. PyramidBox [18]
16, 32, 64 and 128. Sharing a similar CNN backbone with introduces PyramidAnchors, which generates a group of
9
anchors with larger regions corresponding to a face, such large face from Flickr. 468 faces were annotated from 205
as head and body boxes, to have more context to help images, each of which is labeled with a bounding box and 6
detect faces. In [70], extra shifted anchors are added to landmarks. PASCAL Face6 [22] was contructed by selecting
increase the anchor sample density, and significantly in- 851 images from the PASCAL VOC [1] test set with 1,335
creased the average IoU between anchors and small faces. faces annotated. Since the two datasets were built to help
GroupSampling [71] assigns anchors of different scales only evaluate the face detectors proposed by [23] and [1], they
on the bottom pyramid layer of FPN [15], but it groups only contain a few hundred images, resulting in limited
all training samples according to the anchor scales, and variations in face appearance and background.
randomly samples from groups to ensure the positive and Yang et al. created the Multi-Attribute Labelled
negative sample ratios between groups are the same. Faces [74] (MALF7 ) dataset for fine-grained evaluation on
face detection in the wild. The MALF dataset contains 5,250
images from Flickr and Baidu Search with 11,931 faces
5 DATASETS AND E VALUATION
labeled, which is an evidently larger dataset than FDDB,
To evaluate different face detection algorithms, datasets are AFW and PASCAL Face. The faces in MALF were annotated
needed. There have been several public datasets, which by drawing axis-aligned square bounding boxes, attempting
are FDDB [24], AFW [23], PASCAL Face [22], MALF [74], to contain a complete face with the nose in the center of
WIDER Face [3], MAFA [75], 4K-Face [79], UFDD [80] and the bounding box. This may introduce noise for training
DARK Face [81]. These datasets all consist of colored images face detectors since a square bounding box containing a 90-
from real-life scenes. Different datasets may utilize different degree side faces can have over half of its content being
evaluation criterion. In Section 5.1, we present overviews cluttered background. In addition to labeling faces, some
of different datasets covering some statistics such as the attributes were also annotated, such as gender, pose and
number of images and faces, the source of images, the rules occlusion.
of labeling and challenges brought by the dataset. A detailed In 2016, WIDER Face8 [3] was released, which has been
analysis of the face detection evaluation criterion is also the most popular and widely used face detection bench-
included in Section 5.2. Detection results on the datasets are mark. The images in WIDER Face were collected from popu-
provided and analyzed in Section 5.3. lar search engines for predefined event categories following
LSCOM [82] and examined manually to filter out similar im-
5.1 Datasets ages and images without faces, resulting in 32,203 images in
Some essential statistics of currently accessible datasets are total for 61 event categories, which were split into 3 subsets
summarized in Table 3 including the total number of images for training, validation testing set. To keep large variations
and faces, faces per image, how the data was splitted differ- in scale, occlusion and pose, the annotation was performed
ent sets, etc. More details are introduced in the following following two main policies: (a) a bounding box should
part. tightly contain the forehead, chin and cheek and is drew for
FDDB4 [24] is short for Face Detection Dataset and each recognizable face and (b) an estimated bounding box
Benchmark, which has been one of the most popular should be drawn for an occluded face, producing 393,703
datasets for face detector evaluation since its publication annotated faces in total. The number of faces per image
in 2010. The images of FDDB were collected from Yahoo! reaches 12.2 and 50% of the faces are of height between 10-
News, 2,845 of which were selected after filtering out dupli- 50 pixels. WIDER Face outnumbers other datasets in Table 3
cate data. Faces were excluded with these factors, (a) height by a large margin. It means WIDER Face pays never-seen-
or width less than 20 pixels, (b) the two eyes being non- before attention to small faces detection by providing a
visible, (c) the angle between the nose and the ray from the large number of images with the densest small faces for
camera to the head being less than 90 degrees, (d) failure training, validation and testing. Furthermore, the authors
estimation on position, size or orientation of faces by a of WIDER Face defined ’easy’, ’medium’ and ’hard’ levels
human. This led to 5,171 faces left, which were annotated by for the validation and test sets based on the detection rate
drawing elliptical face regions covering from the forehead to of EdgeBox [83]. It offers a much more detailed and fine-
the chin vertically, and the left cheek to the right cheek hori- grained evaluation for face detectors. Hence, the WIDER
zontally. FDDB helped advance uncontrained face detection Face dataset greatly advances the researches of CNN based
in terms of the robustness of expression, pose, scale and face detectors, especially the multi-scale CNN designs and
occlusion. However, its images can be heavily biased toward utilization of context.
celebrity faces since they were collected from the news. It is The last four datasets listed in Table 3 are less generic
also worth noting that although the elliptical style of the than those reviewed above, and focus on face detection
face label adopted by FDDB is closer to human cognition, it in specified and different aspects. The MAFA9 [75] dataset
is not adopted by later datasets and deep learning-based focuses on masked face detection, containing 30,811 images
face detectors, which favor the bounding box style with with 39,485 masked faces labeled. In addition to the location
a relatively easier method for defining positive/negative of eyes and masks, the orientation of the face, the occlusion
samples by calculating the Intersection over Union (IoU). degree and the mask type were also annotated for each face.
Zhu et al. built an annotated faces in-the-wild (AFW5 )
dataset [23] by randomly sampling images with at least one 6. http://host.robots.ox.ac.uk/pascal/VOC/
7. http://www.cbsr.ia.ac.cn/faceevaluation/
4. http://vis-www.cs.umass.edu/fddb/ 8. http://shuoyang1213.me/WIDERFACE/
5. http://www.cs.cmu.edu/∼deva/papers/face/index.html 9. http://www.escience.cn/people/geshiming/mafa.html
10
TABLE 3: Comparison of currently accessible face detection datasets, listed in the order of publication or started year. Note
that UCCS [72] and WILDEST Face [73] are not included because their data is not currently available. ’Blur’, ’App.’, ’Ill.’,
’Occ.’, ’Pose’ in the ’Variations’ columns denote blur, appearance, illumination, occlusion and pose respectively.
#Faces AVG Resolution Split Variations
Dataset #Images #Faces
Per Image (W × H ) Train Val Test Blur App. Ill. Occ. Pose
FDDB [24] 2,845 5,171 1.8 377 × 399 - - 100% X X X
AFW [23] 205 468 2.3 1491 × 1235 - - 100% X X
PASCAL Face [22] 851 1,335 1.5 - - - 100%
MALF [74] 5,250 11,931 2.2 - - - 100% X X X
WIDER Face [3] 32,203 393,703 12.2 1024 × 888 40% 10% 50% X X X X X
MAFA [75] 30,811 39,485 1.2 516 × 512 85% - 15% X X
IJB-A [76] 48,378 497,819 10.2 1796 × 1474 50% - 50% X X X X
IJB-B [77] 76,824 135,518 1.7 894 × 599 - - 100% X X X X
IJB-C [78] 138,836 272,335 1.9 1010 × 671 - - 100% X X X X
4K-Face [79] 5,102 35,217 6.9 3840 × 2160 - - 100%
UFDD [80] 6,425 10,897 1.6 1024 × 774 - - 100% X X
DARK Face [81] 6,000 43,849 7.3 1080 × 720 100% - - X
The IJB series10 [76], [77], [78] were collected for multiple unmatched ground truths are defined as the false negatives.
tasks, including face detection, verification, identification, True negatives are not applied here since the background
and identity clustering. The IJB-C is the combination of IJB- can be a large part of the image. To define whether two
A and IJB-B with some new face data. 4K-Face11 [79] was regions are matched or not, the commonly used intersection
built for the evaluation of large face detection, and contains over union (IoU), also known as the Jaccard overlap, is
5,102 4K-resolution images with 35,217 large faces (>512 applied:
pixels). UFDD12 [80] provides a test set with 6,425 images area(P ) ∩ area(GT )
IoU = (1)
and 10,897 faces in the variation of different weather con- area(P ) ∪ area(GT )
ditions and degradtion such as lens impediments. DARK where P is the predicted region, and GT is the ground
Face13 [81] concentrates on face detection in low light condi- truth region. In a widely used setting, the IoU threshold
tions, and provides 6,000 low-light images for training dark is set to 0.5, meaning if the IoU of a predicted region and
face detector. Since the images are captured in real-world a ground truth region is greater than or equal to 0.5, the
nighttime scenes such as streets, each image in DARK Face predicted region is marked as matched and thus a true
contains 7.3 faces on average which is relatively dense. positive, otherwise it is a false positive.
After determining true or false positives for each detec-
5.2 Accuracy Evaluation Criterion tion, the next step is to calculate the precision and recall
from the detection result list sorted by score in descending
There are mainly two accuracy evaluation criteria adopted order to plot the precision-against-recall curve. A granular
by the datasets reviewed above, one of which is the receiver confidence gap can be defined to sample more precision and
operating characteristic (ROC) curve obtained by plotting recall, but for a simple explanation, we define the gap as a
the true positive rate (TPR) against false positives such as detection result. In nth sampling, we calculate the precision
those adopted by FDDB [24], MALF [74], UCCS [72] and and recall from the top-n detection results:
IJB [78], the other of which is the most popular evaluation
criterion from PASCAL VOC [1] by plotting the precision T Pn
P recisionn = (2)
against recall while calculating average precision (AP), such T Pn + F Pn
as those adopted by AFW [23], PASCAL Face [22], WIDER T Pn
Recalln = (3)
Face [3], MAFA [75], 4K-Face [79], UFDD [80], DARK T Pn + F Nn
Face [81] and Wildest Face [73]. Since these two kinds of where T Pn , F Pn and F Nn are true positives, false positives
evaluation criterion are two different methods for revealing and false negatives from the top-n results respectively. Let
the performance of detectors under the same calculation us say we have 1,000 detection results; then, we have 1,000
of the confusion matrix 14 , we choose the most popular pairs of (recalli , precisioni ) which are enough for plotting
evaluation criteria AP calculated from the precision-again- the curve.
recall curve in the paper. We can compute the area under the precision-against-
To get a precision-again-recall curve, the confusion ma- recall curve, which is AP, to represent the overall perfor-
trix, which is to define the true positives (TP), false pos- mance of a face detector. Under the single IoU threshold
itives (FP), false negatives (FN) and true negatives (TN) setting of 0.5 in WIDER Face evaluation, the top AP for the
from the detection and ground truths, should be firstly hard test subset of WIDER reached 0.924. In the WIDER Face
calculated. A true positive is a detection result matched Challenge 2019 which uses the same data as the WIDER Face
with a ground truth; otherwise, it is a false positive. The dataset but evaluates face detectors in 10 IoU thresholds of
0.50:0.05:0.95, the top average AP reaches 0.5756.
10. https://www.nist.gov/programs-projects/face-challenges
11. https://github.com/Megvii-BaseDetection/4K-Face 5.3 Results on Accuracy
12. https://ufdd.info
13. https://flyywh.github.io/CVPRW2019LowLight/ To understand the progress in recent years on face detection,
14. https://en.wikipedia.org/wiki/Confusion matrix the results of different datasets are collected from their
11
Fig. 8: The results on the FDDB dataset, which are from the result page of FDDB http://vis-www.cs.umass.edu/fddb/
results.html.
Fig. 9: The results on the WIDER Face validation and test sets. The figures are from WIDER face homepage http://
shuoyang1213.me/WIDERFACE/.
official homepages. Because of space limitations, only the than FDDB. Most recent face detectors have been tested with
results from the two most popular datasets are listed. They it. From Fig. 9, it can be found that the accuracy is also very
are Fig. 8 for FDDB [24] and Fig. 9 for WIDER Face [3]. The high even on the hard set. The improvement on mAP is
FDDB results since 2004 are listed. The current ROC curves not so obvious now. The mAP is almost saturated similar to
are much better than those in the past. This means that the FDDB.
detection accuracy is much higher than in the past. The true
We must note that the current benchmarks, regardless
positive rate is reaching 1.0. If you look into the samples in
of FDDB, WIDER or others, only evaluate the accuracy of
FDDB, you can find there are some tiny and blur faces in the
detection and do not evaluate efficiency. If two detectors
ground truth data. Sometimes it is hard to decide whether
achieve similar mAP, but the computational cost of one is
they should be faces, even by humans. Therefore, we can
just half of another, surely we will think the detector with
say that the current detectors achieve perfect accuracy on
half computational cost is better than another. Since the
FDDB, and almost all faces can be detected.
accuracy metric is almost saturated, it is time to include
The WIDER face is newer, larger and more challenging efficiency in the evaluation.
12
TABLE 5: Test scales used by open-source one-stage face detectors [13], [14], [17], [18], [19], [20], [25]. Note that the double
check marks denote image flipping vertically in addition to the image at the current scale. SSH shrinks and enlarges images
to several preset fixed sizes. Since S3 FD, two adaptive test scales are used to save GPU memory, one of which is ”S” for
adaptive shrinking, the other of which is ”E” for recursively adaptive enlarging. Scale ”F” denotes enlarging the image to
the preset largest size.
test Scales (ratio)
Model Publication
0.25 0.5 0.75 1 1.25 1.5 1.75 2.0 2.25 S E F
HR CVPR’17 X X X
S3 FD ICCV’17 X X X X
PyramidBox ECCV’18 X X XX X X X X X
SRN AAAI’19 X XX X X X
DSFD CVPR’19 X XX X X X X X
CSP CVPR’19 XX XX XX XX XX XX XX XX XX
test Scales (resize longer side)
100 300 500 600 700 800 900 1000 1100 1200 1400 1600
SSH ICCV’17 X X X X
SHF WACV’20 X X X X X
RetinaFace CVPR’20 X X X X X
TABLE 6: How different scales impact the AP of PyramidBox [18]. We use Scale = 1 as the baseline, and then try adding
different scales one by one to test how AP is impacted by different scales.
Test Scales
APeasy APmedium APhard TFLOPs
0.25 0.75 1 1.25 1.5 1.75
X 0.947 0.936 0.875 1.37
X X 0.954(+0.007) 0.939(+0.003) 0.872(-0.003) 1.45(+0.008)
X X 0.952(+0.005) 0.940(+0.004) 0.874(-0.001) 2.14(+0.77)
X X 0.948(+0.001) 0.938(+0.002) 0.884(+0.009) 2.72(+1.35)
X X 0.947(+0.000) 0.937(+0.001) 0.881(+0.006) 2.46(+1.09)
X X 0.946(-0.001) 0.936(+0.000) 0.874(-0.001) 1.63(+0.26)
TABLE 7: How much will AP and FLOPs decrease if a scale is removed? The detector PyramidBox is employed.
Test Scales
APeasy APmedium APhard TFLOPs
0.25 0.75 1 1.25 1.5 1.75
X X X X X X 0.957 0.945 0.886 4.94
X X X X X 0.949(-0.008) 0.940(-0.005) 0.884(-0.002) 4.85(-0.009)
X X X X X 0.954(-0.003) 0.942(-0.003) 0.885(-0.001) 4.16(-0.780)
X X X X X 0.955(-0.002) 0.940(-0.005) 0.850(-0.013) 3.58(-1.360)
X X X X X 0.957(+0.000) 0.944(-0.001) 0.880(-0.006) 3.58(-1.360)
X X X X X 0.958(+0.001) 0.945(+0.000) 0.884(-0.002) 3.84(-1.100)
X X X X X 0.957(+0.000) 0.945(+0.000) 0.886(+0.000) 4.67(-0.270)
14
0.969
0.964 0.966
0.961 0.961 0.961
0.96 0.957
0.952 0.952
0.950 HR
RetinaFace
0.94 0.937
0.931 SSH
0.925 0.925 S3FD
0.92
0.921
PyramidBox
0.910
0.907
SRN
0.904
0.901
0.904 CSP
0.90 DSFD
0.889
AP
0.88
0.859
0.86
APeasy
0.845 APmedium
0.84 APhard
0.82
0.806
0.80
0 5 10 15 20 25 30 35 40 45
TFLOPs per image
Fig. 10: The FLOPs vs. multi-scale AP of WIDER Face validation set. 7 models from the WIDER Face result page are listed,
which are HR [13], SSH [14], S3 FD [17], PyramidBox [18], SRN [19], DSFD [20], CSP [25]. (The TFLOPs for some speed-focusing
face detectors are listed in Table 10 because the TFLOPs are in a much smaller scale and cannot fit in this figure.)
0.963
0.959 0.960
0.96 0.956 0.956
0.953
0.948 0.949
0.946
0.944 HR
0.94 0.935
RetinaFace
SSH
0.923
0.927
0.921
S3FD
0.92 0.915
0.914
PyramidBox
0.910 SRN
CSP
0.90 0.896
0.899 0.900
DSFD
AP
0.887
0.88
0.858
APeasy
0.86
APmedium
0.844
APhard
0.84
0.819
0.82
0 5 10 15 20 25 30 35 40 45
TFLOPs per image
Fig. 11: The FLOPs vs. multi-scale test AP of WIDER Face test set. 7 models from the WIDER Face result page are listed,
which are HR [13], SSH [14], S3 FD [17], PyramidBox [18], SRN [19], DSFD [20], CSP [25].
15
on scale 1.75. This is because the PyramidBox pretrained TABLE 9: State-of-the-art open-source models tested with
model is mainly trained on scale 1. a 720P image containing several faces at scale=1.0 only.
The two tables 6 and 7 imply that APeasy is the most We average the FLOPs (AVG TFLOPs) and latency (AVG
sensitive to scales 0.25, APmedium is the most sensitive to Latency) by running the test for each model 100 times.
scale 0.25 and 1, and APhard is the most sensitive to scale Note that ’Post-Proc’ denotes post-processing stages, such
1. Note that this is highly related to the training scale. If as decoding from anchors, NMS and so on. For this stage,
the model is trained differently, the conclusion may change we adopt the original processing code of each model.
accordingly. AVG AVG Latency (ms)
Single-scale test on multiple models. Model
TFLOPs Forward Forward
Post-Proc
Table 8 shows the AP and FLOPs of different models (GPU) (CPU)
on scale 1. The large overall leap is brought by Pyramid- RetinaFace 0.201 131.60 809.24 8.74 (GPU)
CSP 0.579 154.55 1955.20 27.74 (CPU)
Box [18], which mainly introduces the FPN [15] module SRN 1.138 204.77 2933.16 8.71 (GPU)
to fuse features from two adjacent scales and the context DSFD 1.559 219.63 3671.46 76.32 (CPU)
enhancing module from SSH [14]. The computational cost
of PyramidBox is 2X compared with SSH but less than 1/2
of DSFD. However, the AP achieved by PyramidBox and system. There are some other open source face detectors
DSFD are comparable. whose target is to make face detection run in real time
for practical applications. Their computational costs are in
TABLE 8: AP and FLOPs of different models on scale 1. the magnitude of those of GFLOPs or 10 GFLOPs and are
Model APeasy APmedium APhard TFLOPs
much less than the previous costs. Here we group them as
RetinaFace 0.952 0.942 0.776 0.198 speed-focusing face detectors. We collect the most-popular
S3FD 0.924 0.906 0.816 0.571 ones from github.com, and review them in terms of network
CSP 0.948 0.942 0.774 0.571 architectures, AP, FLOPs and efficiency.
SSH 0.925 0.909 0.731 0.587
PyramidBox 0.947 0.936 0.875 1.387 FaceBoxes [87] is one of the first one-stage deep learning-
DSFD 0.949 0.936 0.845 1.532 based models to achieve real-time face detection. FaceBoxes
rapidly downsamples feature maps to a stride 32 with two
If some benchmarks can evaluate FLOPs or some other convolution layers with large kernels. Inception blocks [63]
similar efficiency measurements, different face detectors can are introduced to enhanced feature maps at stride of 32. Fol-
compare more fairly. It will also promote face detection lowing the multi-scale mechanism from SSD [8], FaceBoes
research to a better stage. detects on layers inception3, conv3_2 and conv4_2 for
faces at different scales, resulting in an AP of 0.960 on
6.4 FLOPs vs Latency FDDB [24] and 20 FPS on an INTEL E5-2660v3 CPU at 2.60
GHz.
To compare the two measurements, we convert existing
YuFaceDetectNet [89] adopts a light MobileNet [60] as
models to the Open Neural Network Exchange (ONNX)
the backbone. Compared to FaceBoxes, YuFaceDetectNet
format and run them using the ONNXRUNTIME17 in this
has more convolution layers on each stride to have fine-
comparison for fair comparison. Note that due to the dif-
grained features, and detects on the extra layer of stride
ferent supports to ONNX converting of different DL frame-
16, which improves the recall of small faces. The evaluation
works, we managed to convert RetinaFace [21], SRN [19],
results of the model on the WIDER Face [3] validation set
DSFD [20] and CSP [25] to ONNX format. The results are
are 0.856 (Easy), 0.842 (Medium) and 0.727 (Hard). The main
in Table 9. These models are evaluated using an NVIDIA
and well-known repository, libfacedetection [91], takes Yu-
QUADRO RTX 6000 with CUDA 10.2, and an INTEL Xeon
FaceDetectNet as the detection model and offers pure C++
Gold 6132 CPU @ 2.60 GHz. The powerful GPU contains
implementation without dependence on DL frameworks,
4,609 CUDA parallel-processing cores and 24GB memory.
resulting from 77.34 FPS for 640 × 480 images to 2,027.74
We can observe that both FLOPs and forward latency
FPS for 128 × 96 images on an INTEL i7-1065G7 CPU at 1.3
increase from RetinaFace [21] to DSFD [20]. Note that al-
GHz.
though the average FLOPs of RetinaFace are just one-fifth of
LFFD [90] introduces residual blocks for feature extrac-
SRN’s, the forward latency of RetinaFace is almost near half
tion, and proposes receptive fields as the natural anchors.
of SRN’s, implying that FLOPs are not linearly correlated to
Its faster version LFFD-v2 managed to achieve 0.875 (Easy),
latency due to the differences in implementation, hardware
0.863 (Medium) and 0.754 (Hard) on the WIDER Face vali-
settings, memory efficiency and so on. The reason why the
dation set, while running at 472 FPS using CUDA 10.0 and
post-processing latency of DSFD and CSP sharply increase
an NVIDIA RTX 2080Ti GPU. ULFG [88] adds even more
is that they do not use GPU-accelerated NMS as others do.
convolution layers on each stride, taking the advantage of
depth-wise convolution, which is friendly to edge devices in
7 S PEED -F OCUSING FACE D ETECTORS terms of FLOPs and forward latency. As reported, the slim
For the face detectors introduced in the previous sections, version of ULFG has an AP of 0.770 (Easy), 0.671 (Medium)
the main target is to reach a better AP. Their computational and 0.395 (Hard) on the WIDER Face validation set, and can
costs are heavy and normally in magnitude of TFLOPs. It is run at 105 FPS with an input resolution of 320 × 240 on an
unrealistic to deploy those heavy models to a face-related ARM A72 at 1.5 GHz.
These light-weight models are developed using vari-
17. https://github.com/microsoft/onnxruntime ous frameworks and tested on different hardware. For fair
16
TABLE 10: Popular and active open-source face detectors at Github. Note that ’AVG GFLOPs’ are computed on WIDER
Face validation set in single-scale test where only scale=1.0. Also note that latency is measured on CPU.
#CONV #Params AVG WIDER Face Val Set Latency (ms)
Model
Layers (×106 ) GFLOPs APeasy APmedium APhard Forward Post-Proc
FaceBoxes [87] 33 1.013 1.541 0.845 0.777 0.404 16.52 7.16
ULFG-slim-320 [88] 0.652 0.646 0.520
42 0.390 2.000 19.03 2.37
ULFG-slim-640 [88] 0.810 0.794 0.630
ULFG-RFB-320 [88] 0.683 0.678 0.571
52 0.401 2.426 21.27 1.90
ULFG-RFB-640 [88] 0.816 0.802 0.663
YuFaceDetectNet [89] 43 0.085 2.549 0.856 0.842 0.727 23.47 32.81
LFFD-v2 [90] 45 1.520 37.805 0.875 0.863 0.752 178.47 6.70
LFFD-v1 [90] 65 2.282 55.555 0.910 0.880 0.778 229.35 10.08
comparison, we export these models from their original suffer from illnesses or accidents may have damaged faces,
frameworks to ONNX and test using ONNXRUNTIME on such as burn scars on the faces. Face detection is not only a
a INTEL i7-5930K CPU at 3.50GHz. Results are shown in technical problem but also a humanitarian problem, mean-
Table 10. We can observe that more CONV layers do not lead ing that this technology should serve all the people, not only
to more parameters (FacesBoxes and ULFG series) and more the dominant part of the population. Ideally, face detectors
FLOPs (YuFaceDetectNet and ULFG series). This is mainly should be able to detect all kinds of faces. However, in most
because of the extensive usage of depth-wise convolution in face datasets and benchmarks, most faces are from young
ULFG. Additionally, note that more FLOPs do not lead to people.
more forward latency due to depth-wise convolution. The The final goal of face detection is to detect faces with
post-processing latency across different face detectors seems very high accuracy and high efficiency. Therefore, the al-
inconsistent with the forward latency, and we verified that gorithms can be deployed to many kinds of edge devices
this is caused by different numbers of bounding boxes sent and centralized servers to improve the perception capability
to NMS and the different implementations of NMS (Python- of computers; currently, there still is a considerable gap.
based or Cython-based). Face detectors can achieve good accuracy but still require
considerable computations. Improving the efficiency should
8 C ONCLUSIONS AND D ISCUSSIONS be the next step.
Face detection is one of the most important and popu-
lar topics yet still challenging in computer vision. Deep
learning has brought remarkable breakthroughs for face R EFERENCES
detectors. Face detection is more robust and accurate even [1] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
in unconstrained real-world environments. In this paper, A. Zisserman, “The pascal visual object classes (voc) challenge,”
recent deep learning-based face detectors and benchmarks International journal of computer vision, vol. 88, no. 2, pp. 303–338,
are introduced. From the evaluations of accuracy and effi- 2010.
[2] P. Viola and M. Jones, “Rapid object detection using a boosted
ciency on different deep face detectors, we can find that we cascade of simple features,” in Proceedings of the 2001 IEEE Com-
can reach a very high accuracy if we do not consider the puter Society Conference on Computer Vision and Pattern Recognition
computational cost. However, there should be a simple and (CVPR), 2001.
[3] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “WIDER Face: A face
beautiful solution for face detection since it is simpler than
detection benchmark,” in Proceedings of the IEEE Conference on
generic object detection. The research on face detection can Computer Vision and Pattern Recognition (CVPR), 2016.
focus on the topics introduced in the following topics in the [4] J. Li and Y. Zhang, “Learning surf cascade for fast and accurate
future. object detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2013.
Superfast Face Detection. There is no definition for su- [5] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, “Face detection
perfast face detection. Ideally, superfast face detector should based on multi-block lbp representation,” in International Confer-
be able to run in real time on low-cost edge devices even ence on Biometrics, 2007.
when the input image is 1080P. Empirically speaking, we [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” in Ad-
would like to expect it to be less than 100M FLOPs with a vances in Neural Information Processing Systems (NIPS). Montreal,
1080P image as input. For real-world applications, efficiency Canada: Curran Associates, Inc., 2015.
is one of the key issues. Efficient face detectors can help [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only
to save both energy, the cost of hardware and improve the Look Once: Unified, real-time object detection,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition
responsiveness for edge devices, such as CCTV cameras and (CVPR), 2016.
mobile phones. [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
Detecting Faces in the Long-tailed Distribution. Face A. C. Berg, “SSD: Single shot multibox detector,” in Proceedings of
the European Conference on Computer Vision (ECCV), 2016.
samples can be regarded as a long-tailed distribution. Most
[9] H. Wang, Z. Li, X. Ji, and Y. Wang, “Face R-CNN,” arXiv preprint
face detectors are trained for the dominant part of the arXiv:1706.01061, 2017.
distribution. We have already had enough samples for faces [10] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li, “Detecting faces
with variances in illumination, pose, scale, occlusion, blur, using region-based fully convolutional networks,” arXiv preprint
arXiv:1709.05256, 2017.
distortion in the WIDER Face dataset. But what about other
[11] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via
faces like the old and damaged ones? As people getting region-based fully convolutional networks,” in Advances in Neural
old, there are many wrinkles on their faces; and people who Information Processing Systems (NIPS), 2016.
17
[12] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection [37] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel features for
and alignment using multitask cascaded convolutional networks,” multi-view face detection,” in IEEE International Joint Conference on
IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016. Biometrics (IJCB), 2014.
[13] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the [38] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “CMS-RCNN: Con-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), textual multi-scale region-based cnn for unconstrained face detec-
2017. tion,” in Deep Learning for Biometrics, 2017, pp. 57–79.
[14] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, “SSH: [39] M. Najibi, B. Singh, and L. S. Davis, “FA-RPN: Floating region
Single stage headless face detector,” in Proceedings of the IEEE proposals for face detection,” in Proceedings of the IEEE Conference
International Conference on Computer Vision (ICCV), 2017. on Computer Vision and Pattern Recognition (CVPR), 2019.
[15] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Be- [40] Y. Liu, X. Tang, J. Han, J. Liu, D. Rui, and X. Wu, “HAMBox:
longie, “Feature pyramid networks for object detection,” in Pro- Delving into mining high-quality anchors on face detection,” in
ceedings of the IEEE Conference on Computer Vision and Pattern Proceedings of the IEEE/CVF Conference on Computer Vision and
Recognition (CVPR), 2017. Pattern Recognition (CVPR), June 2020.
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss [41] Y. Liu and X. Tang, “BFBox: Searching face-appropriate backbone
for dense object detection,” in Proceedings of the IEEE International and feature pyramid network for face detector,” in Proceedings
Conference on Computer Vision (ICCV), 2017. of the IEEE Conference on Computer Vision and Pattern Recognition
[17] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3FD: (CVPR), 2020.
Single shot scale-invariant face detector,” in Proceedings of the IEEE [42] J. Zhu, D. Li, T. Han, L. Tian, and Y. Shan, “ProgressFace: Scale-
International Conference on Computer Vision (ICCV), 2017. aware progressive learning for face detection,” in Proceedings of the
[18] X. Tang, D. K. Du, Z. He, and J. Liu, “Pyramidbox: A context- European Conference on Computer Vision (ECCV), 2020.
assisted single shot face detector,” in Proceedings of the European [43] K. Simonyan and A. Zisserman, “Very deep convolutional
Conference on Computer Vision (ECCV), 2018. networks for large-scale image recognition,” arXiv preprint
[19] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou, “Selec- arXiv:1409.1556, 2014.
tive refinement network for high performance face detection,” in [44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), for image recognition,” in Proceedings of the IEEE Conference on
2019. Computer Vision and Pattern Recognition (CVPR), 2016.
[20] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and [45] H. Law and J. Deng, “CornerNet: Detecting objects as paired
F. Huang, “DSFD: dual shot face detector,” in Proceedings of the keypoints,” in Proceedings of the European Conference on Computer
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vision (ECCV), 2018.
2019. [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
[21] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Reti- cation with deep convolutional neural networks,” in Advances in
naFace: Single-shot multi-level face localisation in the wild,” in Neural Information Processing Systems (NIPS), 2012.
Proceedings of the IEEE Conference on Computer Vision and Pattern [47] S. Wan, Z. Chen, T. Zhang, B. Zhang, and K.-k. Wong, “Bootstrap-
Recognition (CVPR), 2020. ping face detection with hard negative examples,” arXiv preprint
[22] J. Yan, X. Zhang, Z. Lei, and S. Z. Li, “Face detection by structural arXiv:1608.02236, 2016.
models,” Image and Vision Computing, vol. 32, no. 10, pp. 790–799, [48] C. Zhang, X. Xu, and D. Tu, “Face detection using improved faster
2014. rcnn,” arXiv preprint arXiv:1802.02142, 2018.
[23] X. Zhu and D. Ramanan, “Face detection, pose estimation, and [49] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,
landmark localization in the wild,” in Proceedings of the IEEE inception-resnet and the impact of residual connections on learn-
Conference on Computer Vision and Pattern Recognition (CVPR), 2012. ing,” in Proceedings of the AAAI Conference on Artificial Intelligence
[24] V. Jain and E. Learned-Miller, “FDDB: A benchmark for face (AAAI), 2017.
detection in unconstrained settings,” Technical Report UM-CS- [50] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg,
2010-009, University of Massachusetts, Amherst, Tech. Rep., 2010. “DSSD: Deconvolutional single shot detector,” arXiv preprint
[25] W. Liu, S. Liao, W. Ren, W. Hu, and Y. Yu, “High-level semantic arXiv:1701.06659, 2017.
feature detection: A new perspective for pedestrian detection,” in [51] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high
Proceedings of the IEEE Conference on Computer Vision and Pattern quality object detection,” in Proceedings of the IEEE Conference on
Recognition (CVPR), 2019. Computer Vision and Pattern Recognition (CVPR), 2018.
[26] S. Luo, X. Li, R. Zhu, and X. Zhang, “Sfa: Small faces attention face [52] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,
detector,” IEEE Access, vol. 7, pp. 171 609–171 620, 2019. “Deformable convolutional networks,” in Proceedings of the IEEE
[27] Z. Zhang, W. Shen, S. Qiao, Y. Wang, B. Wang, and A. Yuille, International Conference on Computer Vision (ICCV), 2017.
“Robust face detection via learning small faces on hard images,” [53] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
in Proceedings of IEEE Winter Conference on Applications of Computer deformable, better results,” in Proceedings of the IEEE Computer So-
Vision (WACV), 2020. ciety Conference on Computer Vision and Pattern Recognition (CVPR),
[28] A. Kumar, A. Kaur, and M. Kumar, “Face detection techniques: 2019.
a review,” Artificial Intelligence Review, vol. 52, no. 2, pp. 927–948, [54] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
2019. one-stage object detection,” in Proceedings of the IEEE International
[29] S. Zafeiriou, C. Zhang, and Z. Zhang, “A survey on face detection Conference on Computer Vision (CVPR), 2019.
in the wild: past, present and future,” Computer Vision and Image [55] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” in arXiv
Understanding, vol. 138, pp. 1–24, 2015. preprint arXiv:1904.07850, 2019.
[30] C. Zhang and Z. Zhang, “A survey of recent advances in face [56] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection
detection,” Microsoft Research, Tech. Rep., 2010. by grouping extreme and center points,” in Proceedings of the IEEE
[31] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
images: A survey,” IEEE Transactions on pattern analysis and machine [57] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point
intelligence, vol. 24, no. 1, pp. 34–58, 2002. set representation for object detection,” in Proceedings of the IEEE
[32] E. Hjelmås and B. K. Low, “Face detection: A survey,” Computer International Conference on Computer Vision (CVPR), 2019.
vision and image understanding, vol. 83, no. 3, pp. 236–274, 2001. [58] D. G. Lowe, “Object recognition from local scale-invariant fea-
[33] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with tures,” in Proceedings of the IEEE International Conference on Com-
deep learning: A review,” IEEE transactions on neural networks and puter Vision (ICCV), 1999.
learning systems, vol. 30, no. 11, pp. 3212–3232, 2019. [59] N. Dalal and B. Triggs, “Histograms of oriented gradients for
[34] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A human detection,” in Proceedings of the IEEE Computer Society
survey,” arXiv preprint arXiv:1905.05055, 2019. Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
[35] X. Wu, D. Sahoo, and S. C. Hoi, “Recent advances in deep learning [60] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
for object detection,” Neurocomputing, vol. 396, pp. 39–64, 2020. T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient
[36] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional convolutional neural networks for mobile vision applications,”
neural network cascade for face detection,” in Proceedings of the arXiv preprint arXiv:1704.04861, 2017.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-
2015. geNet: A large-scale hierarchical image database,” in Proceedings
18
of the IEEE Conference on Computer Vision and Pattern Recognition [84] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
(CVPR), 2009. convolutional neural networks for resource efficient inference,” in
[62] Y. Zhu, H. Cai, S. Zhang, C. Wang, and Y. Xiong, “Tinaface: International Conference on Learning Representations (ICLR), 2017.
Strong but simple baseline for face detection,” arXiv preprint [85] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
arXiv:2011.13183, 2020. network training by reducing internal covariate shift,” in Interna-
[63] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, tional Conference on Machine Learning (ICML), 2015.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with [86] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider
convolutions,” in Proceedings of the IEEE Computer Society Confer- to see better,” arXiv preprint arXiv:1506.04579, 2015.
ence on Computer Vision and Pattern Recognition (CVPR), 2015. [87] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “FaceBoxes:
[64] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu, “Scale-aware face A cpu real-time face detector with high accuracy,” in Proceedings
detection,” in Proceedings of the IEEE Conference on Computer Vision of IEEE International Joint Conference on Biometrics (IJCB), 2017.
and Pattern Recognition (CVPR), 2017. [88] Linzaer, “Ultra-light-fast-generic-face-
[65] Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, and X. Tang, “Recurrent Scale detector-1mb,” https://github.com/Linzaer/
Approximation for Object Detection in CNN,” in Proceedings of the Ultra-Light-Fast-Generic-Face-Detector-1MB, 2020.
IEEE International Conference on Computer Vision (ICCV), 2017. [89] S. Yu et al., “libfacedetection.train,” https://github.com/ShiqiYu/
[66] G. Song, Y. Liu, M. Jiang, Y. Wang, J. Yan, and B. Leng, “Beyond libfacedetection.train, 2021.
Trade-Off: Accelerate FCN-Based Face Detector with Higher Ac- [90] Y. He, D. Xu, L. Wu, M. Jian, S. Xiang, and C. Pan, “Lffd:
curacy,” in Proceedings of the IEEE Computer Society Conference on A light and fast face detector for edge devices,” arXiv preprint
Computer Vision and Pattern Recognition (CVPR), 2018. arXiv:1904.10633, 2019.
[67] J. Li and Y. Zhang, “Learning surf cascade for fast and accurate [91] S. Yu et al., “libfacedetection,” https://github.com/ShiqiYu/
object detection,” in Proceedings of the IEEE Conference on Computer libfacedetection, 2021.
Vision and Pattern Recognition (CVPR), 2013.
[68] Hongliang Jin, Qingshan Liu, Hanqing Lu, and Xiaofeng Tong,
“Face detection using improved lbp under bayesian framework,”
in International Conference on Image and Graphics (ICIG), 2004.
[69] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the
effective receptive field in deep convolutional neural networks,”
Yuantao Feng is currently a Research Assistant
in Advances in Neural Information Processing Systems (NIPS), 2016.
in the Department of Computer Science and
[70] C. Zhu, R. Tao, K. Luu, and M. Savvides, “Seeing small faces from
Engineering, Southern University of Science and
robust anchor’s perspective,” in Proceedings of the IEEE Conference
Technology, China. He received his B.E. degree
on Computer Vision and Pattern Recognition (CVPR), 2018.
and M.E. in computer science and technology
[71] X. Ming, F. Wei, T. Zhang, D. Chen, and F. Wen, “Group sampling from the College of Computer and Software
for scale invariant face detection,” in Proceedings of the IEEE Com- Engineering, Shenzhen University in 2018 and
puter Society Conference on Computer Vision and Pattern Recognition 2021 respectively. His research interests include
(CVPR), 2019. object detection and computer vision.
[72] E. T. Boult, M. Gunther, and R. A. Dhamija, “2nd unconstrained
face detection and open set recognition challenge,” https://vast.
uccs.edu/Opensetface/, access at 2019-10-31.
[73] M. K. Yucel, Y. C. Bilge, O. Oguz, N. Ikizler-Cinbis, P. Duygulu,
and R. G. Cinbis, “Wildest faces: Face detection and recognition in
violent settings,” arXiv preprint arXiv:1805.07566, 2018.
[74] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Fine-grained evaluation on
face detection in the wild,” in Proceedings of IEEE International Shiqi Yu is currently an Associate Professor in
Conference on Automatic Face and Gesture Recognition (FG), 2015. the Department of Computer Science and En-
[75] S. Ge, J. Li, Q. Ye, and Z. Luo, “Detecting masked faces in the wild gineering, Southern University of Science and
with lle-cnns,” in Proceedings of the IEEE Conference on Computer Technology, China. He received his B.E. degree
Vision and Pattern Recognition (CVPR), 2017. in computer science and engineering from the
[76] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, Chu Kochen Honors College, Zhejiang Univer-
P. Grother, A. Mah, M. Burge, and A. K. Jain, “Pushing the sity in 2002, and Ph.D. degree in pattern recog-
frontiers of unconstrained face detection and recognition: Iarpa nition and intelligent systems from the Institute
janus benchmark a,” in IEEE Conference on Computer Vision and of Automation, Chinese Academy of Sciences in
Pattern Recognition (CVPR), 2015. 2007. He worked as an Assistant Professor and
[77] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, an Associate Professor in Shenzhen Institutes
N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and of Advanced Technology, Chinese Academy of Sciences from 2007 to
P. Grother, “Iarpa janus benchmark-b face dataset,” in IEEE 2010, and as an associate professor in Shenzhen University from 2010
Conference on Computer Vision and Pattern Recognition Workshops to 2019. His research interests include gait recognition, face detection
(CVPRW), 2017. and computer vision.
[78] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K.
Jain, W. T. Niggel, J. Anderson, J. Cheney, and P. Grother, “Iarpa
janus benchmark - c: Face dataset and protocol,” in International
Conference on Biometrics (ICB), 2018.
[79] J. Wang, Y. Yuan, B. Li, G. Yu, and S. Jian, “Sface: An efficient
network for face detection in large scale variations,” arXiv preprint
Hanyang Peng received a B.S. degree in mea-
arXiv:1804.06559, 2018.
surement and control technology from the North-
[80] H. Nada, V. A. Sindagi, H. Zhang, and V. M. Patel, “Pushing
east University of China, Shenyang, China, in
the limits of unconstrained face detection: a challenge dataset
2008, an M.E. degree in detection technology
and baseline results,” in IEEE International Conference on Biometrics
and automatic equipment from the Tianjin Uni-
Theory, Applications and Systems (BTAS), 2018.
versity of China, Tianjin, China, in 2010, and a
[81] W. Y. Chen Wei, Wenjing Wang and J. Liu, “Deep retinex de-
Ph.D. degree in pattern recognition and intelli-
composition for low-light enhancement,” in British Machine Vision
gence systems from the Institute of Automation,
Conference (BMVC), 2018.
Chinese Academy of Sciences, Beijing, China, in
[82] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy, 2017. He is currently with Southern University of
A. Hauptmann, and J. Curtis, “Large-scale concept ontology for Science and Technology, Shenzhen, China. His
multimedia,” IEEE multimedia, vol. 13, no. 3, pp. 86–91, 2006. current research interests include computer vision, machine learning,
[83] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals deep learning and optimization.
from edges,” in Proceedings of the European Conference on Computer
Vision (ECCV), 2014.
19