![](https://csdnimg.cn/release/download_crawler_static/87754236/bg4.jpg)
module SEAM to enhance the learning of face features.
3. To address the problem of imbalance between hard and easy samples, we weight the easy
and hard samples according to the IoU. To reduce hyperparameter tuning, we set the mean
value of IoU of all candidate positive samples with ground-truth as the dividing line between
positive and negative samples. And we design a weighted function named Slide to give higher
weight to hard samples which is helpful for the model to learn more difficult features. The
details of this function will be presented in sections 3-5.
The rest of the paper is arranged as follows: in Section 2 we review the related literature
in this area; in Section 3 we describe the model structure in detail, and the main improvisions
including the receptive field enhancement module, the attention module, the adaptive sam-
ple weighting function, the anchor design, the Replusion Loss and the Normalized Gaussian
Wasserstein Distance (NWD) Loss, respectively; in Section 4 we describe the experiments and
the according analysis of the results, including ablation experiments and comparisons with
other models; and in Section 5 we summarize our work and give some advice about future
research.
2 Related Works
Face Detection. Face detection has been a hot research area in computer vision for decades.
In the early years of deep learning, face detection algorithms usually use neural networks to
automatically extract image features for classification. CascadeCNN [1] proposes a cascaded
structure with three stages of carefully designed deep convolutional networks that predicts
face and landmark location in a coarse-to-fine manner. MTCNN [2] develops a similar cascade
architecture to jointly align the face landmarks and detect the face locations. PCN [3] uses an
angle prediction network to correct faces and improve the face detection accuracy. But early
deep-learning-based face detection algorithms have some drawbacks such as tedious training,
local optimum, slow detection speed, and low detection accuracy, etc.
Current face detection algorithms are mainly improved by inheriting the advantages of generic
object detection algorithms, such as SSD [4], Faster R-CNN [5], RetinaNet [6], etc. CMS-
RCNN [34] uses Faster R-CNN as backbone and introduces contextual information and multi-
scale features to detect faces. Zhang et al. [25] designs a lightweight network based on SSD
structure, named FaceBoxes, to quickly shrink the feature size by 32x down-sampling, and
uses a multi-scale network module to enhance the features in both network width and depth
dimensions. SRN [35], which is improved on the generic object detection algorithm RefineDet
[36] and RetinaNet [6], achieves high performance by introducing two-stage classification and
regression, and designs a multi-branch module to enhance the effect of receptive fields.
Scale-invariance. As one of the most challenging problems in face detection, large face scale
variations in complex scenes has an important impact on the accuracy of the detector. The
multi-scale detection capability mainly depends on the scale-invariance features, and many
works address this problem to extract features more accurately and effectively [13, 24, 37, 38].
For small objects detection, using fewer down-sampling layers and dilated convolution can
significantly improve the detection performance [39, 40]. Another way to bridge this problem
is using more anchors. Anchor can provide good priori information, thus using denser anchors
and corresponding matching strategies can effectively improve the quality of object proposals
[24, 25, 37, 40]. Multi-scale training can be helpful to construct the image pyramids and in-
crease the sample diversity, which is a simple but effective method to improve the performance
of multi-scale object detection. On the other hand, the receptive fields will increase and the
semantic information get richer accordingly, however, the spatial information may be missing
correspondingly. A natural idea is to fuse deep semantic information with shallow features,
4