深度学习驱动的端到端人脸识别：2020年最新进展综述

需积分: 10 19 浏览量更新于2024-07-15 收藏 9.45MB PDF 举报

《端到端人脸识别：2020年综述论文》深入探讨了深度学习时代下人脸识别技术的最新进展。人脸识别作为计算机视觉的核心领域，其历史可以追溯至早期，但随着深度卷积神经网络（Deep Convolutional Neural Networks, DCNNs）的崛起和大规模人脸数据集的积累，该领域的技术性能得到了显著提升，逐渐成为现实世界广泛应用的基石。在端到端的深度人脸识别系统中，其工作流程可以概括为三个关键组件：面部检测、面部预处理和面部表示。首先，面部检测模块负责在输入的自然图像或视频帧中定位人脸区域，这是识别过程的基础，确保后续步骤能准确地聚焦于潜在的人脸特征。面部预处理环节则对检测出的人脸进行规范化处理。这包括对人脸进行姿态校正，使其朝向一个标准视图，以及对人脸区域进行裁剪，将其缩放到固定的像素大小。这一阶段旨在消除光照、角度和姿势变化等因素对识别结果的影响，提高模型的鲁棒性。最后，面部表示阶段是整个流程的核心，通过深度学习网络提取并生成高度抽象且区分度高的特征表示。这些特征不仅包含了个体的独特性，还能有效地抵抗噪声、遮挡和表情变化等挑战。通常，深度卷积神经网络被设计成多层结构，每一层负责捕捉不同层次的特征，从低级的边缘和纹理特征，到高级的面部结构和身份特征。近年来的研究重点在于优化这三个组件之间的协同作用，提升模型的精度和速度。例如，一些工作致力于改进面部检测算法，使其更精确、实时；另一些研究则探索更高效的预处理方法，如轻量级的特征提取网络，以减少计算资源消耗。此外，还有人在设计更深层次的网络架构，以及结合多模态信息（如面部表情、纹理和3D结构）来增强识别性能。这篇综述论文详细回顾了这些最新的研究成果，展示了端到端深度人脸识别在理论与实践中的最新突破，为研究人员提供了宝贵的参考，同时也为实际应用中的实时和高精度人脸识别技术提供了指导方向。未来，随着AI技术的不断进步，端到端人脸识别系统有望在安全验证、社交媒体分析、甚至虚拟现实等领域发挥更大的作用。

Fig. 5. The illustration of single-stage and multi-stage face detectors. The single-stage detector directly

accomplishes the face detection from the entire feature maps, whereas the multi-stage detector adopts a

proposal stage to generate candidates and one or more stages to refine these candidates.

Apart from the modeling, how the train the multi-stage detector is another interesting topic.

The multi-stage detectors are commonly trained stage by stage, since each stage is supervised

by its own objective. This may lead to inferior optimization. To handle this issue, a joint training

strategy [

178

] was designed for both Cascaded CNN [

123

] and Faster R-CNN to achieve end-to-end

optimization and better performance on face detection.

3.1.2 Single-stage methods. The single-stage methods accomplish the candidate classication and

bounding box regression from the entire feature maps directly, without involving the proposal

stage.

A classic structure of single stage comes from a general object detector named Single Shot

multibox Detector (SSD) [

142

]. Similar to RPN, SSD presets dense anchor boxes over dierent

ratios and scales on the feature maps. SSD is a prevailing framework in object detection because it

runs much faster than Faster R-CNN while maintaining comparable accuracy. So, many developers

employed SSD for face detection in applications. However, SSD is not robust enough to large scale

variation, especially to the small faces. Afterward, many methods [

224

327

–

329

] studied to

modify SSD for face detection. For example, Zhang et al. [

328

] designed a scale-equitable version

to obtain adequate features from the faces of dierent scales.

Many state-of-the-art face detectors resort to the feature pyramid network (FPN) [

132

] which

consists of a top-down architecture with skip connections and merge the high-level and low-level

features for detection. The high-level feature maps have more semantic information, while the

low-level layers have smaller receptive eld but more detailed local information. The feature fusion

preserves the advantages from both sides, and brings great progress in detecting objects with a

wide range of scales. Therefore, many single-stage face detectors [

124

130

168

224

225

244

318

326

] are developed with the advantage of FPN. Not only handling the scale issue in face

detection via FPN, but also these methods attempt to solve the inherent shortcomings of original

FPN such like the conict of receptive eld. The special feature fusion operation [

124

130

224

] is

also suitable for tackling the hard cases of face detection, such as blur and occluded faces.

Although the single-stage methods have the advantage of high eciency, their detection accuracy

is below that of the two-stage methods. It is partially because the imbalance problem of positives

and negatives brought by the dense anchors, whereas the proposal-to-rene scheme is able to

alleviate this issue. Accordingly, ReneDet [

325

] set up an anchor renement module in its network

to remove large number of negatives. Inspired by ReneDet, SRN [

] presented a selective two-step

classication and regression method; the two-step classication is performed at the low-level layers

to reduce the search space of classier, and the two-step regression is performed at high-level

layers to obtain accurate location. Later on, VIM-FD [

337

], ISRN [

326

], AInnoFace [

313

] and Re-

neFace [

323

] improved SRN with several eective techniques, such as training data augmentation,

improved feature extractor and training supervision, anchor assignment and matching strategy,

multi-scale test strategy etc.

Most aforementioned methods need to preset anchors for face detection, while some representa-

tive detectors of single-stage, such as DenseBox [

], UnitBox [

298

] and CenterFace [

280

], full the

detection without preset anchors. We will present them as anchor-free type in the next subsection.

3.1.3 Anchor-based and anchor-free methods. As shown in Table 2, most current face detectors are

anchor-based due to the long-time development and superior performance. Generally, we preset

the dense anchors on the feature maps, then full the classication and bounding box regression

on these anchors one or more times, and nally output the accepted ones as the detection results.

Therefore, the anchor allocation and matching strategy is crucial to the detection accuracy. For

example, the scale compensation for anchor matching, proposed by S

FD [

328

], can eectively

improve the recall of tiny and outer faces. Besides, S

FD utilized a max-out label mechanism to

reduce the large number of negatives which is a frequent issue in anchor-based mechanism as

well. Zhu et al. [

356

] introduced an expected max overlapping score (EMO) to evaluate the quality

of matched anchors, and proposed several techniques to encourage the true positives achieve

high EMO scores. Since the scale distribution of faces is imbalance in the training dataset, Group

Sampling [

164

] sorts the anchor boxes by their scales and maintains the same number of samples

for each group during the training. More recently, HAMBox [

151

] proposed an online anchor

compensation strategy to help the detection of outer faces, taking the advantage of unmatched

anchors that nonetheless provide favorable regression.

The anchor-based methods have dominated state of the art in face detection, but there are several

weaknesses of them. The hyperparameters ( e.g., scale, stride, ratio, number) of preset anchors need

to be carefully tuned for each particular dataset, which limits the generalization ability of detectors.

Besides, the dense anchors increase the computational cost and bring the imbalance problem of

positive and negative anchors.

Anchor-free methods [

120

226

355

] attract growing attention in general object detection. As

for face detection, certain pioneering works have emerged in recent years. DenseBox [

] and

UnitBox [

298

] attempt to predict the pixel-wise bounding box and the condence score. Besides,

CenterFace [

280

] regards face detection as a generalized task of keypoint estimation, which predicts

the facial center point and the size of bounding box in feature map. In brief, the anchor-free

detectors get rid of the preset anchors and achieve the better generalization capacity. Regarding

to the detection accuracy, it needs further exploration for better robustness to false positives and

stability in training process.

3.1.4 Multi-task learning methods. Multi-task learning has been widely studied in computer vision

community. Generally, the multi-task learning based approaches are designed for solving a problem

together with other related tasks by sharing the visual representation. Here, we introduce the

multi-task learning methods that trains the face detector with the associated facial tasks or auxiliary

supervision branches to enrich the feature representation and detection robustness.

Many multi-task learning methods [

128

280

310

319

367

] have explored the joint

learning of face detection and facial landmark localization. Among them, MTCNN [

319

] is the

most representative one, which exploits the inherent correlation between facial bounding boxes

and landmarks by a three-stage cascaded network. Subsequently, HyperFace [

180

] fused the low-

level features as well as the high-level features to simultaneously conduct four tasks, including

face detection, facial landmark localization, gender classication and pose estimation. Based on

剩余43页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习驱动的端到端人脸识别：2020年最新进展综述

深度学习驱动的端到端人脸识别系统：最新进展与关键要素

深度学习驱动的人脸识别技术综述

CNN技术在人脸识别中的应用研究

人脸识别综述论文（几篇在维普上下的论文）

25篇最新CV领域2020综述性论文传送！(涵盖15个方向).zip

人脸对齐Face Alignment In-the-Wild A Survey.zip

深度学习与传统方法：20年物体检测技术综述

R-CNN与人脸检测的相关性探讨

opencv实现的人脸识别技术详解

java+sql server项目之科帮网计算机配件报价系统源代码.zip

最新资源