野生环境下的人脸检测综述：过去、现在与未来

需积分: 10 111 浏览量更新于2024-07-19 收藏 3.37MB PDF 举报

这篇综述文章《在野外的人脸检测：过去、现在与未来》由Stefanos Zafeiriou、Cha Zhang和Zhengyou Zhang撰写，发表于计算机视觉与图像理解（Computer Vision and Image Understanding）杂志。它为初学者提供了一个全面而深入的视角，探讨了人脸检测技术在实际环境（wild）中的发展历史、当前状态以及未来可能的发展趋势。人脸检测是计算机视觉领域的一个核心任务，它的目标是在图像或视频中定位和识别人类面部。自20世纪90年代起，随着技术的进步，特别是深度学习的兴起，人脸检测算法经历了显著的改进。早期的方法主要依赖于特征工程，如Haar特征和级联分类器（如Viola-Jones算法），这些方法在速度和准确度上都有所限制。近年来，深度学习模型，特别是卷积神经网络（CNNs），如FaceNet、Faster R-CNN和MTCNN等，极大地提升了人脸检测的性能。这些模型通过大量标注数据进行训练，能够捕捉复杂的面部特征并实现更精确的定位。例如，基于深度学习的单阶段检测器如YOLO（You Only Look Once）和RetinaNet，在实时性和准确性上取得了平衡，成为业界标准。文章中涵盖了各种关键技术和挑战，如行人重叠问题、光照变化、表情和姿势变化、遮挡、以及在不同分辨率和复杂背景下的性能优化。此外，作者还讨论了人脸检测在实际应用中的重要性，如安防监控、社交媒体分析、人机交互以及生物认证等领域。在未来，随着人工智能和边缘计算的发展，对实时、低功耗和隐私保护的需求将进一步推动人脸检测技术的进步。研究可能会朝着更高效的模型、多模态融合、以及结合其他传感器信息的方向发展，以应对更为复杂的场景。总结来说，这篇综述是人脸检测领域的宝贵资源，对于想要深入了解这一主题的初学者和专业人士提供了深入的理论基础和实际案例分析，有助于他们跟上这个快速发展的领域。同时，它也预示着人脸检测技术将在未来继续引领计算机视觉的前沿。

Stefanos Zafeiriou, Cha Zhang and Zhengyou Zhang / CVIU 00 (2015) 1–33 7

mapping function φ() : R

→ R

, where d is the size of the test patch. For linear features, φ(x) = φ

x, φ ∈ R

. The

classiﬁcation function is in the following form:

(x) = sign[

(φ

x)], (11)

where λ

() are R → R discriminating functions, such as the conventional stump classiﬁers in AdaBoost. F

(x) shall be

1 for positive examples and −1 for negative examples. Note the Haar-like feature set is a subset of linear features. An-

other example is the anisotropic Gaussian ﬁlters in [75]. In [76], the linear features were constructed by pre-learning

them using local non-negative matrix factorization (LNMF), which is still sub-optimal. Instead, Liu and Shum [74]

proposed to search for the linear features by examining the Kullback-Leibler (KL) divergence of the positive and

negative histograms projected on the feature during boosting (hence the name Kullback-Leibler boosting). In [77],

the authors proposed to apply Fisher discriminant analysis and more generally recursive nonparametric discriminant

analysis (RNDA) to ﬁnd the linear projections φ

. Linear projection features are very powerful features. The selected

features shown in [74] and [77] were like face templates. They may signiﬁcantly improve the convergence speed of

the boosting classiﬁer at early stages. However, caution must be taken to avoid overﬁtting if these features are to be

used at the later stages of learning. In addition, the computational load of linear features is generally much higher

than the traditional Haar-like features. On the contrary, in [78] the use of simple pixel pairs as features, and in [79]

the use of the relative values of a set of control points as features, was proposed. Such pixel-based features can be

computed even faster than the Haar-like features, however, their discrimination power is generally insufﬁcient to build

high performance detectors.

Another popular complex feature for face/object detection is based on regional statistics such as histograms. In

[80] local edge orientation histograms was proposed, which compute the histogram of edge orientations in subregions

of the test windows. These features are then selected by an AdaBoost algorithm to build the detector. The orientation

histogram is largely invariant to global illumination changes, and it is capable of capturing geometric properties of

faces that are difﬁcult to capture with linear edge ﬁlters such as Haar-like features. However, similar to motion ﬁlters,

edge based histogram features are not scale invariant, hence one must ﬁrst scale the test images to form a pyramid

to make the local edge orientation histograms features reliable. Later, in [19] a similar scheme called histogram

of oriented gradients (HoG) was proposed, which became a very popular feature for human/pedestrian detection

[81, 82, 83, 84, 85] (we will discuss about the use of HoG features in face detection in the next subsection). In [86],

the authors proposed spectral histogram features, which adopts a broader set of ﬁlters before collecting the histogram

features, including gradient, Laplacian of Gaussian and Gabor ﬁlters. Compared with [80], the histogram features in

[86] were based on the whole testing window rather than local regions, and Support Vector Machines (SVMs) were

used for classiﬁcation. In [87] another histogram-based feature, called spatial histograms, was proposed. The spatial

histograms are based on local statistics of LBP. HoG and LBP were also combined in [88], which achieved excellent

performance in human detection with partial occlusion handling. Region covariance is another statistics based feature,

proposed in [89] for generic object detection and texture classiﬁcation tasks. To extract these features the covariance

matrices among the color channels and gradient images are computed instead of the histograms. Regional covariance

features can also be efﬁciently computed using integral images.

In [90] a sparse feature set was proposed in order to strengthen the features’ discrimination power without incurring

too much additional computational cost. Each sparse feature can be represented as:

f (x) =

(x; u, v, s), α

∈ {−1, +1} (12)

where x is an image patch, and p

is a granule of the sparse feature. A granule is speciﬁed by 3 parameters: horizontal

offset u, vertical offset v and scale s. For instance, as shown in Fig. 8, p

(x; 5, 3, 2) is a granule with top-left corner

(5,3), and scale 2

= 4, and p

(x; 9, 13, 3) is a granule with top-left corner (9,13), and scale 2

= 8. Granules can

be computed efﬁciently using pre-constructed image pyramids, or through the integer image. In [90], the maximum

number of granules in a single sparse feature is 8. Since the total number of granules is large, the search space is

very large and exhaustive search is infeasible. The method employed a heuristic search scheme, where granules are

added to a sparse feature one-by-one, with an expansion operator that removes, reﬁnes and adds granules to a partially

selected sparse feature. To reduce the computation, the authors further conducted multi-scaled search, which uses

Stefanos Zafeiriou, Cha Zhang and Zhengyou Zhang / CVIU 00 (2015) 1–33 8

a small set of training examples to ﬁrst evaluate all features and then reject those that are unlikely to be good. The

performance of the multi-view face detector trained in [90] using sparse features was very good.

As new features are composed in seeking the best discrimination power, the feature pool becomes larger and

larger, thus creating new challenges in the feature selection process. A number of recent works have attempted to

address this issue. For instance, [91] proposed to discover compositional features using the classic frequent item-

set mining scheme in data mining. Instead of using the raw feature values, they assumed a collection of induced

binary features (e.g., decision stumps with known thresholds) that are already available. By partitioning the feature

space into sub-regions through these binary features, the training examples can be indexed by the sub-regions they

are located. The algorithm then searches for a small subset of compositional features that are both frequent to have

statistical signiﬁcance and accurate to be useful for label prediction. The ﬁnal classiﬁer is then learned based on the

selected subset of compositional features through AdaBoost. In [92], the authors ﬁrst established an analogue between

compositional feature selection and generative image segmentation, and applied the Swendsen-Wang Cut algorithm to

generate n-partitions for the individual feature set, where each subset of the partition corresponds to a compositional

feature. This algorithm re-runs for every weak classiﬁer selected by the AdaBoost learning framework. On a person

detection task tested, the composite features showed signiﬁcant improvement, especially when the individual features

were very weak (e.g., Haar-like features).

In some applications such as object tracking, even if the number of possible features is not extensive, an exhaustive

feature selection is still impractical due to computational constraints. In [93], the authors proposed a gradient based

feature selection scheme for online boosting with primary applications in person detection and tracking. Their work

iteratively updates each feature using a gradient descent algorithm, by minimizing the weighted least square error

between the estimated feature response and the true label. This is particularly attractive for tracking and updating

schemes such as [82], where at any time instance, the object’s appearance is already represented by a boosted classiﬁer

learned from previous frames. Assuming there is no dramatic change in the appearance, the gradient descent based

algorithm can reﬁne the features in a very efﬁcient manner.

There have also been many features that attempted to model the shape of the objects. For instance, in [94] multi-

ple boundary fragments to weak classiﬁers were composed and formed a strong “boundary-fragment-model” detector

using boosting. They ensured the feasibility of the feature selection process by limiting the number of boundary

fragments to 2-3 for each weak classiﬁer. In [95] the object detectors were learned with a boosting algorithm and

the feature set consisted of a randomly chosen dictionary of contour fragments. A very similar edgelet feature was

proposed in [96], and was used to learn human body part detectors in order to handle multiple, partially occluded hu-

mans. In [97], shapelet features focusing on local regions of the image were built from low-level gradient information

using AdaBoost for pedestrian detection. An interesting side beneﬁt of having contour/edgelet features is that object

detection and object segmentation can be performed jointly, such as the work in [98] and [99].

2.2.1. Robust Descriptors meet Boosting

The ﬁeld of pedestrian detection has been dominated by the now classic HoG plus SVM approach proposed in

[19]. This pivotal paper started an era where robust descriptors such as HoGs, SIFT and their fast counterparts such as

SURF features, densely or sparsely measured all over the image have been concatenated and fed to a classiﬁer. These

very simple schemes achieve competitive pedestrian and face detection performance [100, 26, 101]. The application

of these robust features with cascades of weak classiﬁers and boosting methodologies has recently started to receive

attention.

One of the ﬁrst such approaches was recently introduced combining a cascade of weak-classiﬁers with SURF

features [102, 103]. In particular, in [103] the detection region was represented by patches and each patch was

described by a multi-dimensional SURF descriptor. The number of SURF patches contained only few hundreds of

features. Logistic regression was adopted as the weak classiﬁer on each local SURF patch and Area Under ROC curve

(AUC) was used as the criterion for convergence. In [103] it was shown that the SURF cascade is able to converge

very fast (within even an hour on a standard desktop).

Variations of LBP and HoG features were proposed in [22] and applied for face detection. In particular, the Local

Gradient Patterns (LGP) and Binary Histograms of Oriented Gradients (BHOG) were proposed. The LGP feature

construction methodology compares the gradient values, which are computed as the absolute value of the intensity

difference between the given pixel and its neighbouring pixels, in predeﬁned neighbourhood structures. The LGP

feature is then created by assigning 1 if the gradient value of a neighbouring pixel is greater than the threshold value,

剩余49页未读，继续阅读

共轭的麦克斯玮

粉丝: 0
资源: 2

野生环境下的人脸检测综述：过去、现在与未来

人脸检测方法综述

基于VC6.0的数字图像处理实例，基于肤色的人脸检测代码及相关文献

人脸表情识别综述

基于人脸检测的疲劳状态检测综述

为主流的人脸检测算法写一份综述进行概括

ResNet人脸检测模型

matlab人脸检测人脸数

基于django人脸检测

基于模板匹配的人脸检测人脸检测人脸检测人脸检测代码

基于深度学习的人脸检测人脸检测代码

最新资源