Stefanos Zafeiriou, Cha Zhang and Zhengyou Zhang / CVIU 00 (2015) 1–33 7
mapping function φ() : R
d
→ R
1
, where d is the size of the test patch. For linear features, φ(x) = φ
T
x, φ ∈ R
d
. The
classification function is in the following form:
F
T
(x) = sign[
T
X
t
λ
t
(φ
T
t
x)], (11)
where λ
t
() are R → R discriminating functions, such as the conventional stump classifiers in AdaBoost. F
T
(x) shall be
1 for positive examples and −1 for negative examples. Note the Haar-like feature set is a subset of linear features. An-
other example is the anisotropic Gaussian filters in [75]. In [76], the linear features were constructed by pre-learning
them using local non-negative matrix factorization (LNMF), which is still sub-optimal. Instead, Liu and Shum [74]
proposed to search for the linear features by examining the Kullback-Leibler (KL) divergence of the positive and
negative histograms projected on the feature during boosting (hence the name Kullback-Leibler boosting). In [77],
the authors proposed to apply Fisher discriminant analysis and more generally recursive nonparametric discriminant
analysis (RNDA) to find the linear projections φ
t
. Linear projection features are very powerful features. The selected
features shown in [74] and [77] were like face templates. They may significantly improve the convergence speed of
the boosting classifier at early stages. However, caution must be taken to avoid overfitting if these features are to be
used at the later stages of learning. In addition, the computational load of linear features is generally much higher
than the traditional Haar-like features. On the contrary, in [78] the use of simple pixel pairs as features, and in [79]
the use of the relative values of a set of control points as features, was proposed. Such pixel-based features can be
computed even faster than the Haar-like features, however, their discrimination power is generally insufficient to build
high performance detectors.
Another popular complex feature for face/object detection is based on regional statistics such as histograms. In
[80] local edge orientation histograms was proposed, which compute the histogram of edge orientations in subregions
of the test windows. These features are then selected by an AdaBoost algorithm to build the detector. The orientation
histogram is largely invariant to global illumination changes, and it is capable of capturing geometric properties of
faces that are difficult to capture with linear edge filters such as Haar-like features. However, similar to motion filters,
edge based histogram features are not scale invariant, hence one must first scale the test images to form a pyramid
to make the local edge orientation histograms features reliable. Later, in [19] a similar scheme called histogram
of oriented gradients (HoG) was proposed, which became a very popular feature for human/pedestrian detection
[81, 82, 83, 84, 85] (we will discuss about the use of HoG features in face detection in the next subsection). In [86],
the authors proposed spectral histogram features, which adopts a broader set of filters before collecting the histogram
features, including gradient, Laplacian of Gaussian and Gabor filters. Compared with [80], the histogram features in
[86] were based on the whole testing window rather than local regions, and Support Vector Machines (SVMs) were
used for classification. In [87] another histogram-based feature, called spatial histograms, was proposed. The spatial
histograms are based on local statistics of LBP. HoG and LBP were also combined in [88], which achieved excellent
performance in human detection with partial occlusion handling. Region covariance is another statistics based feature,
proposed in [89] for generic object detection and texture classification tasks. To extract these features the covariance
matrices among the color channels and gradient images are computed instead of the histograms. Regional covariance
features can also be efficiently computed using integral images.
In [90] a sparse feature set was proposed in order to strengthen the features’ discrimination power without incurring
too much additional computational cost. Each sparse feature can be represented as:
f (x) =
X
i
α
i
p
i
(x; u, v, s), α
i
∈ {−1, +1} (12)
where x is an image patch, and p
i
is a granule of the sparse feature. A granule is specified by 3 parameters: horizontal
offset u, vertical offset v and scale s. For instance, as shown in Fig. 8, p
i
(x; 5, 3, 2) is a granule with top-left corner
(5,3), and scale 2
2
= 4, and p
i
(x; 9, 13, 3) is a granule with top-left corner (9,13), and scale 2
3
= 8. Granules can
be computed efficiently using pre-constructed image pyramids, or through the integer image. In [90], the maximum
number of granules in a single sparse feature is 8. Since the total number of granules is large, the search space is
very large and exhaustive search is infeasible. The method employed a heuristic search scheme, where granules are
added to a sparse feature one-by-one, with an expansion operator that removes, refines and adds granules to a partially
selected sparse feature. To reduce the computation, the authors further conducted multi-scaled search, which uses
7