
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2636827, IEEE
Transactions on Pattern Analysis and Machine Intelligence
3
• To remove noisy and redundant attributes, we propose
three attribute selection criteria and formulate the attribute
selection problem as a submodular optimization problem.
A greedy optimization algorithm is presented and its
solution is guaranteed to be at least (1-1/e)-approximation
the optimum.
• We conduct experiments on four public datasets for both
object and action recognition and demonstrate that the
proposed attribute-based representations yield comparable
performance to most existing visual recognition algo-
rithms.
This paper is organized as follows: Section 2 describes the ex-
traction method of both human-labeled and data-driven attributes.
Section 3 introduces the concept of submodularity and presents
the proposed submodular attribute selection approach. Section
4 shows some implementation details, and section 5 provides
experimental results and analysis on four public datasets. Section
6 concludes the paper.
2 HUMAN-LABELED ATTRIBUTE AND DATA-
DRIVEN ATTRIBUTE EXTRACTION
Visual classes can be characterized by a collection of human-
labeled attributes. For example, the action “long-jump” in Olympic
Sports Dataset [29] is associated with either motion attributes
(jump forward, motion in the air), or with scene attributes (e.g.,
outdoor, track). Given an instance x ∈ R
d
, an attribute classifier
f
a
: x → {0, 1} predicts the confidence score of the presence of
an attribute a in the image or video. This classifier f
a
is learned
using the training samples of all classes which have this attribute
as positive and the rest as negative. Given a set of P attribute
classifiers {f
a
p
(x)}
P
p=1
, an instance x is mapped to the semantic
space O:
h(x) : R
d
→ O = [0, 1]
P
(1)
where h(x) = [h
1
(x), ..., h
P
(x))]
T
is a P -dimensional attribute
score vector.
For the extraction of data-driven attributes, we propose to
discover a large set of data-driven attributes using a dictionary
learning method. The low-level features of N training samples
are denoted as X = [x
1
, x
2
, ..., x
N
] ∈ R
d×N
. Assume that we
have K classes, and the features of training samples can also
be expressed as X = [X
1
, X
2
, ..., X
K
], where X
k
∈ R
d×N
k
denotes N
k
samples form class k. For each class k, we first
learn a class-specific dictionary D
k
of size M
k
using the KSVD
algorithm [2]. We then initialize a total dictionary D using these
class-specific dictionaries as D = [D
1
, D
2
, ..., D
K
], and learn
it by minimizing the reconstruction error of all the training
samples. The class-specific dictionary D
k
is learned by solving
the following problem:
arg min
D
k
,Z
k
||X
k
− D
k
Z
k
||
2
F
s.t. ∀i, ||z
k
i
||
0
≤ T (2)
where D
k
= [d
k
1
, ..., d
k
M
k
] ∈ R
d×M
k
, Z
k
= [z
k
1
, ..., z
k
N
k
] ∈
R
M
k
×N
k
are the sparse codes of X
k
. The sparsity constraint
||z
k
i
||
0
≤ T specifies that the sample x
k
i
has fewer than T
dictionary atoms from D
k
in its decomposition. After we have
obtained the class-specific dictionaries, we concatenate them to
initialize a total dictionary D = [D
1
, D
2
, ..., D
K
] of size M,
where M =
P
K
k=1
M
k
. The dictionary D can also be denoted
as D = [d
1
, d
2
, ..., d
M
] ∈ R
d×N
and is learned by solving the
following problem:
arg min
D,Z
||X − DZ||
2
F
s.t. ∀i, ||z
i
||
0
≤ T (3)
where D is the learned attribute dictionary of size M, Z =
[z
1
, ..., z
N
] ∈ R
M×N
are the sparse codes of X. The sparsity
constraint ||z
i
||
0
≤ T specifies that the sample x
i
has fewer than
T dictionary atoms from D in its decomposition. It means that
each vector z
i
is sparse and has fewer than T non-zero entries.
The value of the j-th entry z
ij
in the coefficients z
i
indicates
whether the dictionary atom d
j
is used for the decomposition of
sample x
i
. Thus, each dictionary atom d
j
is treated as a data-
driven attribute, and z
i
is treated as an M -dimensional attribute
score vector.
3 SUBMODULAR ATTRIBUTE SELECTION
Since we formulate the attribute selection problem as a submod-
ular optimization problem, we first briefly review the definition
and optimization of submodular functions. Then we propose
three attribute selection criteria for selecting a discriminative and
compact subset of attributes. After that, we formulate the attribute
selection problem as an optimization problem of a submdular
function, which is a linear combination of the entropy rate of a
random walk and a weighted maximum coverage function.
3.1 Submodularity
Submodular functions are a class of set functions that have
the property of diminishing returns [28]. Given a set E, a set
function F : 2
E
→ R is submodular if F (A ∪ v) − F (A) ≥
f(B ∪ v) − F (B) holds for all A ⊆ B ⊆ E and v ∈ E \ B. The
diminishing return property means that the marginal gain of the
element v decreases if used in a later stage. Recently, submodular
functions have been widely exploited in various applications, such
as sensor placements [17], superpixel segmentation [26], docu-
ment summarization [24], object detection and recognition [15],
[51] and feature selection [8], [27]. [27] presented a submodular
feature selection method for acoustic score spaces based on ex-
isting facility location and saturated coverage functions. Different
from these applications, we define a novel submodular objective
function for attribute selection. Although we only evaluate our
approach for object and action recognition, it can be applied to
other recognition tasks that use attributes.
3.2 Attribute Selection criterion
Motivation: We first present an example to illustrate the motiva-
tion of the proposed attribute selection criterion. Assume that we
have four classes denoted as c
k
, k = 1, ..., 4 and three attributes
denoted as a
m
, m = 1, ..., 3. For each attribute, the class c
k
may have or not have this attribute. An example of attribute
assignment for each class is shown in Table 1. Here we consider
the discrimination capability of the attribute a
m
for distinguishing
class c
i
from class c
j
. It can be seen that if either one class has
attribute a
m
, then a
m
is discriminative for distinguishing class c
i
from c
j
. However, if both classes have or do not have this attribute,
then a
d
is not useful for distinguishing the two classes. In this
way, we evaluate the discrimination capability of each attribute
for distinguishing each pairwise classes in Table 2. Let us further
consider the task of selecting two attributes out of three attributes.