9
to reduce the search space of classier, and the two-step regression is performed at high-level
layers to obtain accurate location. Later on, VIM-FD [
337
], ISRN [
326
], AInnoFace [
313
] and Re-
neFace [
323
] improved SRN with several eective techniques, such as training data augmentation,
improved feature extractor and training supervision, anchor assignment and matching strategy,
multi-scale test strategy etc.
Most aforementioned methods need to preset anchors for face detection, while some representa-
tive detectors of single-stage, such as DenseBox [
90
], UnitBox [
298
] and CenterFace [
280
], full the
detection without preset anchors. We will present them as anchor-free type in the next subsection.
3.1.3 Anchor-based and anchor-free methods. As shown in Table 2, most current face detectors are
anchor-based due to the long-time development and superior performance. Generally, we preset
the dense anchors on the feature maps, then full the classication and bounding box regression
on these anchors one or more times, and nally output the accepted ones as the detection results.
Therefore, the anchor allocation and matching strategy is crucial to the detection accuracy. For
example, the scale compensation for anchor matching, proposed by S
3
FD [
328
], can eectively
improve the recall of tiny and outer faces. Besides, S
3
FD utilized a max-out label mechanism to
reduce the large number of negatives which is a frequent issue in anchor-based mechanism as
well. Zhu et al. [
356
] introduced an expected max overlapping score (EMO) to evaluate the quality
of matched anchors, and proposed several techniques to encourage the true positives achieve
high EMO scores. Since the scale distribution of faces is imbalance in the training dataset, Group
Sampling [
164
] sorts the anchor boxes by their scales and maintains the same number of samples
for each group during the training. More recently, HAMBox [
151
] proposed an online anchor
compensation strategy to help the detection of outer faces, taking the advantage of unmatched
anchors that nonetheless provide favorable regression.
The anchor-based methods have dominated state of the art in face detection, but there are several
weaknesses of them. The hyperparameters ( e.g., scale, stride, ratio, number) of preset anchors need
to be carefully tuned for each particular dataset, which limits the generalization ability of detectors.
Besides, the dense anchors increase the computational cost and bring the imbalance problem of
positive and negative anchors.
Anchor-free methods [
120
,
226
,
355
] attract growing attention in general object detection. As
for face detection, certain pioneering works have emerged in recent years. DenseBox [
90
] and
UnitBox [
298
] attempt to predict the pixel-wise bounding box and the condence score. Besides,
CenterFace [
280
] regards face detection as a generalized task of keypoint estimation, which predicts
the facial center point and the size of bounding box in feature map. In brief, the anchor-free
detectors get rid of the preset anchors and achieve the better generalization capacity. Regarding
to the detection accuracy, it needs further exploration for better robustness to false positives and
stability in training process.
3.1.4 Multi-task learning methods. Multi-task learning has been widely studied in computer vision
community. Generally, the multi-task learning based approaches are designed for solving a problem
together with other related tasks by sharing the visual representation. Here, we introduce the
multi-task learning methods that trains the face detector with the associated facial tasks or auxiliary
supervision branches to enrich the feature representation and detection robustness.
Many multi-task learning methods [
26
,
90
,
128
,
280
,
310
,
319
,
367
] have explored the joint
learning of face detection and facial landmark localization. Among them, MTCNN [
319
] is the
most representative one, which exploits the inherent correlation between facial bounding boxes
and landmarks by a three-stage cascaded network. Subsequently, HyperFace [
180
] fused the low-
level features as well as the high-level features to simultaneously conduct four tasks, including
face detection, facial landmark localization, gender classication and pose estimation. Based on