LIU et al.: PEDESTRIAN-DETECTION BASED ON HETEROGENEOUS FEATURES AND ENSEMBLE OF MVP PARTS 815
Fig. 1. General architecture of the proposed pedestrian detector.
A. Overview of the Proposed Pedestrian-Detection Method
The architecture of our proposed method is presented in
Fig. 1. Due to high variability in pedestrian appearance, the
pedestrian is divided into several body parts, and each body
part is treated from different viewing angles and poses, re-
spectively. The details of division of parts, poses, and views
are given in Section III-C and D. For each view or pose of
a certain body part, an expert classifier with heterogeneous
features (to be introduced in Section III-B) is trained. These
classifiers are assembled within a two-stage structure. The
first stage ensembles different views and poses of each body
part, with view–pose ensemble (VPE) functions, and forms
an MVP ensemble classifier for each body part. The second
stage combines all MVP body parts with a part ensemble
(PE) function. When an ROI is inputted to the detector, all
individual expert classifiers examine the ROI from their own
field of expertise, i.e., a certain viewing angle or pose of
a certain body part. After collecting the opinions from the
experts, the VPE functions combine the classification results
of body parts. Then, the PE function addresses the final
decision result.
B. Pedestrian Feature Description Based on Combination of
Heterogeneous Features
One important step in the process of pedestrian detection is
to perform a thorough and distinctive feature description of the
pedestrian. The commonly used features include HOG, color
feature, LBP, Haar wavelet, and motion feature. One single
feature could describe only a single aspect of the pedestrian,
such as contour, color, local region, or texture, and it only
has limited description power. To perform a better description
of the pedestrian, some literatures propose to use the combi-
nation of more than one feature to enhance the description
power, such as HOG–LBP [18] and HOG–CSS features [15].
HOG–LBP features extract contour and texture information,
simultaneously, and are among the best performing (and most
popular) feature sets available [35], [36]. Nevertheless, the
simple concatenation of the two feature vectors, as in [18],
does not take the contributions of both individual features into
account, and the description ability of the features is not fully
exploited. Inspired by [37], in this paper, a new linear kernel
function is proposed to combine the two heterogeneous features
with complementary information, as
K(x
i
, x
j
)=
m
k=0
(1 − β)x
H
ik
x
H
jk
+
n
k=0
βx
L
ik
x
L
jk
(1)
where K(x
i
, x
j
) represents the proposed kernel function; x
i
is the feature vector of sample i; x
i
=[x
H
i
,x
L
i
]; x
H
ik
,x
L
ik
rep-
resent the kth element of the feature vectors of HOG and
LBP, respectively; m and n are the dimensions of the feature
vectors of HOG and LBP. β is a combination coefficient, which
determines the contribution of each feature, and β ∈ [0, 1].
With (1), the contour feature and the local region feature
are combined organically, with consideration of their respective
contributions. One could notice that the simple concatenation
approach proposed in [18] is a special case of (1), where
the contributions of two features are considered to be equal,
and β = 0.5. Compared to the method in [18], our approach
significantly improves the description power of the feature
combination, without noticeable increase in computation cost.
In addition, compared to the RBF kernel function in [37],
our approach boosts less requirement of computation power.
Details are shown in Section IV-B.
In this paper, the extraction of HOG feature is the same
as [7]. LBP uses the same size of block (16 × 16) as HOG.
For each block, LBP generates a histogram with 58 uniform
patterns and 1 nonuniform pattern. The histograms of all blocks
are concatenated as the LBP feature vector of the input image.
As the size of the ROI is 48 × 96, the proposed heterogeneous
features describe the input image with a 5225-dimensional
feature vector.
C. Division of Body Parts
In order to handle possible partial occlusion, considering
both model complexity and detection accuracy, the pedestrian is
divided into three parts, i.e., UB, LB, and FB, which is the same
as the approach proposed in [9] and [25] (see Fig. 2). Every
part covers a fixed percentage of the pedestrian. UB and LB
take 50% of the body, whereas FB covers 100% of the body.
The horizontally occluded pedestrian could be detected with
such division of the body. For example, the pedestrian with
an umbrella on his shoulder [UB occluded; see Fig. 3(a)] can
be properly detected by examining the nonoccluded LB. As for
the pedestrian with vertical occlusions [see Fig. 3(c) and (d)], as
long as the occlusion is less than 30% of the body, such kind of
occlusion could be handled with adequate vertically occluded
samples in the training data set.