2840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 5, MAY 2012
• Promising experimental results comparable to the
state-of-the-art results have been obtained on Caltech101,
Pascal VOC2007, and Scene15 data sets, and significant
improvements have been achieved over several existing
MKL methods across the four data sets. A new bound
is established for the performance of the state-of-the art
MKL method on object recognition.
The remainder of this paper is organized as follows. Section II
briefs the related work. In Section III, the GS-MKL framework
is introduced for object recognition. The learning algorithm of
GS-MKL is presented in Section IV. Section V presents two
sample grouping strategies for GS-MKL. The experimental re-
sults are given in Section VI. Finally, Section VII concludes this
paper.
A preliminary version of this work has been published in
[53]. The main extensions include two grouping strategies,
where sample grouping interacts with GS-MKL training,
grouping strategy comparison, comparisons of GS-MKL and
other MKL methods, and more extensive experiments.
II. R
ELATED
WORK
In the past decade, research efforts have been devoted to
characterizing visual statistics for a number of object categories
[2], [7], [13], [14], [19], [27]. Among them, the kernel method
[3], [5], [15], [16], [18] is one of the attractive research areas.
Generally speaking, the kernel method offers two advantages
in learning object categories: (1) A kernel explicitly defines a
visual similarity measure between image pairs and implicitly
maps the input space to the feature space [13], thereby avoiding
the explicit feature representation and the curse of dimension;
(2) Combined with SVM, the kernel method can find out the
optimal separating hyper-plane between positive and negative
samples efficiently. Hence, the SVM-based kernel method
has been applied to many recognition problems (e.g., object
detection [40] and image and video annotation [41]–[43]), in
addition to object recognition. Generally, SVM-based kernel
methods used in object recognition can be categorized into
four types, i.e., individual kernel designing, canonical MKL,
SS-MKL, and SVM ensemble. We brief the related works as
follows.
A. Individual Kernel Designing
Recently, many efforts have been made to delicately design
individual kernels for the similarity of an image pair. A kernel
based on a multiresolution histogram is introduced in [15] to
measure the image similarity at different granularities. A spa-
tial pyramid matching kernel (PMK) is introduced in [3] to en-
force the loose spatial information, which matches images with
spatial coordinates. A kernel based on the local feature distri-
bution is presented in [16] to model the image local context. A
chi-squared kernel based on the pyramid histogram of orientated
gradients (PHOG) is presented in [33] to capture the shape sim-
ilarity with spatial layout.
All these methods rely on the features that represent particular
visual characteristics. However, not all kernels play the same
role in differentiating object categories. Hence, kernel selec-
tion/fusion over a set of available kernels is usually desired for
generic object recognition. It is worthy to note that individual
kernels can be incorporated into the proposed GS-MKL frame-
work to investigate the corresponding contributions in object
recognition.
B. Canonical MKL
Recently, instead of using a single kernel, a classifier based on
multikernel combination has been introduced into object recog-
nition, yielding promising results [5], [18], [38], [45]. In [5]
and [18], multiple features (e.g., appearance and shape) and
kernels [e.g., PMK and spatial pyramid kernels (SPKs) with
different hyper-parameters] are employed and combined in the
MKL framework. Bosch et al. [45] strengthens MKL with a
cross validation strategy. The initial weights of multiple kernels
are learnt by an extended MKL [5] and then refined by an ex-
haustive search to minimize the classification error over a vali-
dation set. In [44], kernel alignment is utilized to optimize mul-
tikernel combination over color, shape, and appearance features.
Basically, these methods adopt a uniform multikernel combi-
nation over the whole input space. Hence, when training data
exhibit high intraclass variation and interclass correlation on
local training samples, these methods may suffer a degraded
performance due to the choice of global uniform multikernel
combination.
C. SS-MKL
More recently, SS-MKL methods have been proposed in [23],
[27], and [29] by using sample-specific kernel weighting strate-
gies. The basic idea is that kernel weights depend not only on
the kernel functions but also on the samples themselves. Com-
pared with canonical MKL, SS-MKL tends to reflect the relative
importance of different kernels at the level of individual sample
rather than at the level of object category. Despite some perfor-
mance improvements, learning too many parameters may lead
to the expensive computation cost and the risk of overfitting.
It has to be noted that, although the proposed GS-MKL and
the methods [5], [18], [23], [27], [45] reviewed above are all ex-
tended from the MKL framework, GS-MKL provides a mech-
anism of evaluating multiple kernels over sample groups. From
this view, GS-MKL is a more flexible framework that can be
generalized to canonical MKL and SS-MKL by changing the
number of groups. GS-MKL provides a tractable solution to
adapt multikernel combination to the local data distributions for
sample groups.
D. Learning With Classifier Ensemble
Instead of a single classifier, classifier ensemble has been
proposed as an alternative technique to improve classification
accuracy. Classifier ensemble can take place at data, feature,
and classifier levels [46]. To cope with the diversity of data,
a straightforward classifier ensemble method employs a data
partitioning strategy where each base classifier is trained over
a distinct subset of the training data. Such divide and conquer
methods train multiple base classifiers that are experts in their
specific parts of the data space. However, base classifiers are in-
dependently trained, leaving out the other partitions of the data.
When such independence condition is not satisfied, it cannot be
assured that the decision of the base classifier will improve the
final classification performance.