AHARON et al.: -SVD: ALGORITHM FOR DESIGNING OVERCOMPLETE DICTIONARIES 4313
quantization (VQ) coding method, called gain-shape VQ, where
this coefficient is allowed to vary [39]. In contrast, in sparse
representations as discussed in this paper, each example is rep-
resented as a linear combination of several vectors
.
Thus, sparse representations can be referred to as a generaliza-
tion of the clustering problem.
Since the
-means algorithm (also known as the generalized
Lloyd algorithm—GLA [39]) is the most commonly used pro-
cedure for training in the vector quantization setting, it is nat-
ural to consider generalizations of this algorithm when turning
to the problem of dictionary training. The clustering problem
and its
-means solution will be discussed in more detail in
Section IV-A, since our work approaches the dictionary training
problem by generalizing the
-means. Here we shall briefly
mention that the
-means process applies two steps per each it-
eration: i) given
, assign the training examples to their
nearest neighbor; and ii) given that assignment, update
to better fit the examples.
The approaches to dictionary design that have been tried so
far are very much in line with the two-step process described
above. The first step finds the coefficients given the dictio-
nary—a step we shall refer to as “sparse coding.” Then, the
dictionary is updated assuming known and fixed coefficients.
The differences between the various algorithms that have been
proposed are in the method used for the calculation of coeffi-
cients and in the procedure used for modifying the dictionary.
B. Maximum Likelihood Methods
The methods reported in [22]–[25] use probabilistic rea-
soning in the construction of
. The proposed model suggests
that for every example
the relation
(3)
holds true with a sparse representation
and Gaussian white
residual vector
with variance . Given the examples
, these works consider the likelihood function
and seek the dictionary that maximizes it. Two assumptions are
required in order to proceed: the first is that the measurements
are drawn independently, readily providing
(4)
The second assumption is critical and refers to the “hidden vari-
able”
. The ingredients of the likelihood function are computed
using the relation
(5)
Returning to the initial assumption in (3), we have
(6)
The prior distribution of the representation vector
is assumed
to be such that the entries of
are zero-mean i.i.d., with Cauchy
[24] or Laplace distributions [22], [23]. Assuming for example
a Laplace distribution, we get
(7)
This integration over
is difficult to evaluate, and indeed, Ol-
shausen and Field [23] handled this by replacing it with the ex-
tremal value of
. The overall problem turns into
(8)
This problem does not penalize the entries of
as it does for
those of
. Thus, the solution will tend to increase the dictio-
nary entries’ values, in order to allow the coefficients to become
closer to zero. This difficulty has been handled by constraining
the
-norm of each basis element, so that the output variance
of the coefficients is kept at an appropriate level [24].
An iterative method was suggested for solving (8). It includes
two main steps in each iteration: i) calculate the coefficients
using a simple gradient descent procedure and then ii) update
the dictionary using [24]
(9)
This idea of iterative refinement, mentioned before as a general-
ization of the
-means algorithm, was later used again by other
researchers, with some variations [36], [37], [40]–[42].
A different approach to handle the integration in (7) was sug-
gested by Lewicki and Sejnowski [25]. They approximated the
posterior as a Gaussian, enabling an analytic solution of the inte-
gration. This allows an objective comparison of different image
models (basis or priors). It also removes the need for the ad-
ditional rescaling that enforces the norm constraint. However,
this model may be too limited in describing the true behaviors
expected. This technique and closely related ones have been re-
ferred to as approximated ML techniques [37].
There is an interesting relation between the above method and
the independent component analysis (ICA) algorithm [43]. The
latter handles the case of a complete dictionary (the number of
elements equals the dimensionality) without assuming additive
noise. The above method is then similar to ICA in that the algo-
rithm can be interpreted as trying to maximize the mutual infor-
mation between the inputs (samples) and the outputs (the coef-
ficients) [24], [22], [25].
C. The MOD Method
An appealing dictionary training algorithm, named method
of optimal directions (MOD), is presented by Engan et al. [36],