MANUSCRIPT ACCEPTED BY IEEE TRANS. PAMI, MARCH 2008. 5
C. Classification Based on Sparse Representation
Given a new test sample y from one of the classes in the
training set, we first compute its sparse representation
ˆ
x
1
via
(6) or (10). Ideally, the nonzero entries in the estimate
ˆ
x
1
will
all be associated with the columns of A from a single object
class i, and we can easily assign the test sample y to that class.
However, noise and modeling error may lead to small nonzero
entries associated with multiple object classes (see Figure 3).
Based on the global, sparse representation, one can design many
possible classifiers to resolve this. For instance, we can simply
assign y to the object class with the single largest entry in
ˆ
x
1
.
However, such heuristics do not harness the subspace structure
associated with images in face recognition. To better harness such
linear structure, we instead classify y based on how well the
coefficients associated with all training samples of each object
reproduce y.
For each class i, let δ
i
: R
n
→ R
n
be the characteristic function
which selects the coefficients associated with the i-th class. For
x ∈ R
n
, δ
i
(x) ∈ R
n
is a new vector whose only nonzero entries
are the entries in x that are associated with class i. Using only the
coefficients associated with the i-th class, one can approximate
the given test sample y as
ˆ
y
i
= Aδ
i
(
ˆ
x
1
). We then classify y
based on these approximations by assigning it to the object class
that minimizes the residual between y and
ˆ
y
i
:
min
i
r
i
(y)
.
= ky − A δ
i
(
ˆ
x
1
)k
2
. (12)
Algorithm 1 below summarizes the complete recognition proce-
dure. Our implementation minimizes the `
1
-norm via a primal-
dual algorithm for linear programming based on [39], [40].
Algorithm 1 : Sparse Representation-based Classification
(SRC)
1: Input: a matrix of training samples A = [A
1
, A
2
, . . . , A
k
] ∈
R
m×n
for k classes, a test sample y ∈ R
m
, (and an optional
error tolerance ε > 0.)
2: Normalize the columns of A to have unit `
2
-norm.
3: Solve the `
1
-minimization problem:
ˆ
x
1
= arg min
x
kxk
1
subject to Ax = y. (13)
(Or alternatively, solve
ˆ
x
1
= arg min
x
kxk
1
subject to kAx − yk
2
≤ ε.)
4: Compute the residuals r
i
(y) = ky − A δ
i
(
ˆ
x
1
)k
2
for i = 1, . . . , k.
5: Output: identity(y) = arg min
i
r
i
(y).
Example 1 (`
1
-Minimization versus `
2
-Minimization): To
illustrate how Algorithm 1 works, we randomly select half of the
2, 414 images in the Extended Yale B database as the training set,
and the rest for testing. In this example, we subsample the images
from the original 192 × 168 to size 12 × 10. The pixel values of
the downsampled image are used as 120-D features – stacked as
columns of the matrix A in the algorithm. Hence matrix A has
size 120 × 1207, and the system y = Ax is underdetermined.
Figure 3 left illustrates the sparse coefficients recovered by
Algorithm 1 for a test image from the first subject. The figure
also shows the features and the original images that correspond
to the two largest coefficients. The two largest coefficients are
both associated with training samples from subject 1. Figure
3 right shows the residuals w.r.t. the 38 projected coefficients
δ
i
(
ˆ
x
1
), i = 1, 2, . . . , 38. With 12 × 10 downsampled images
as features, Algorithm 1 achieves an overall recognition rate
of 92.1% across the Extended Yale B database. (See Section
IV for details and performance with other features such as
Eigenfaces and Fisherfaces, as well as comparison with other
methods.) Whereas the more conventional minimum `
2
-norm
solution to the underdetermined system y = Ax is typically
quite dense, minimizing the `
1
-norm favors sparse solutions,
and provably recovers the sparsest solution when this solution is
sufficiently sparse. To illustrate this contrast, Figure 4 left shows
the coefficients of the same test image given by the conventional
`
2
-minimization (4), and Figure 4 right shows the corresponding
residuals w.r.t. the 38 subjects. The coefficients are much less
sparse than those given by `
1
-minimization (in Figure 3), and
the dominant coefficients are not associated with subject 1. As a
result, the smallest residual in Figure 4 does not correspond to
the correct subject (subject 1).
D. Validation Based on Sparse Representation
Before classifying a given test sample, we must first decide if
it is a valid sample from one of the classes in the dataset. The
ability to detect and then reject invalid test samples, or “outliers,”
is crucial for recognition systems to work in real-world situations.
A face recognition system, for example, could be given a face
image of a subject that is not in the database, or an image that is
not a face at all.
Systems based on conventional classifiers such as nearest
neighbor (NN) or nearest subspace (NS), often use the residuals
r
i
(y) for validation, in addition to identification. That is, the
algorithm accepts or rejects a test sample based on how small the
smallest residual is. However, each residual r
i
(y) is computed
without any knowledge of images of other object classes in the
training dataset and only measures similarity between the test
sample and each individual class.
In the sparse representation paradigm, the coefficients
ˆ
x
1
are
computed globally, in terms of images of all classes. In a sense, it
can harness the joint distribution of all classes for validation. We
contend that the coefficients
ˆ
x are better statistics for validation
than the residuals. Let us first see this through an example.
Example 2 (Concentration of Sparse Coefficients): We
randomly select an irrelevant image from Google, and
downsample it to 12 × 10. We then compute the sparse
representation of the image against the same Extended Yale
B training data as in Example 1. Figure 5 left plots the
obtained coefficients, and right plots the corresponding residuals.
Compared to the coefficients of a valid test image in Figure 3,
notice that the coefficients
ˆ
x here are not concentrated on any
one subject and instead spread widely across the entire training
set. Thus, the distribution of the estimated sparse coefficients
ˆ
x contains important information about the validity of the test
image: A valid test image should have a sparse representation
whose nonzero entries concentrate mostly on one subject,
whereas an invalid image has sparse coefficients spread widely
among multiple subjects.
To quantify this observation, we define the following measure
of how concentrated the coefficients are on a single class in the
dataset:
Definition 1 (Sparsity Concentration Index): The sparsity
concentration index (SCI) of a coefficient vector x ∈ R
n
is