Once the basis is determined, the speaker-dependent
coordinate vector can be estimated using a simple qua-
dratic optimization method (Hazen and Glass, 1997; Kuhn
et al., 2000). The main difference of various speaker-space-
based methods lies in the construction of the basis of the
speaker subspace. For example, in the well-known eigen-
voice (EV) method (Kuhn et al., 2000; Kenny et al.,
2004), the basis vectors which are called the eigenvoices,
are obtained by performing principal component analysis
(PCA) on the training speaker SD model parameters. Then
the K leading eigenvectors which represent the greatest
variabilities of different training speaker models are pre-
served as the K basis vectors. In reference speaker weight-
ing (RSW) (Hazen and Glass, 1997; Mak et al., 2006; Teng
et al., 2009), all training speaker SD models are reserved
for the candidate basis vectors. During speaker adaptation,
a subset of them are chosen according to some heuristic cri-
teria to linearly represents the unknown SD model. In
aspect model weighting (AMW) (Hahm et al., 2010), the
basis is constructed by a set of aspect models which are
the mixture model of the training speakers’ SD models
and are trained based on likelihood maximization with
respect to the training data. In all these methods, given
the speaker subspace, the corresponding coordinate of an
unknown speaker is always estimated using the maximum
likelihood criterion. However, there is a common difficulty
in determining the dimension of the speaker subspace.
When the adaptation data is limited, a small dimensional
speaker subspace is preferred. As the adaptation data
increases, larger speaker subspace yields better perfor-
mance. Unfortunately, none of these methods can provide
the best speaker subspace given varying amounts of adap-
tation data for a particular unknown speaker.
In this paper, we discuss the generalization of the
speaker-space-based speaker adaptation methods using
the compressive sensing theory (Donoho, 2006). In fact,
the parameters of the SD model lie in a very high dimen-
sional space. The core issue of the speaker adaptation
problem is to estimate the high dimensional vector of
model parameters from a few speech signal observations.
Breakthrough results in compressive sensing (C S) have
shown that high dimensional signals (vectors) can often
be accurately recovered from a relatively small number of
non-adaptive linear projection observations, provided that
they possess a compact representation in some basis. In
machine learning, sparse representation and compressive
sensing are widely employed to address the problem of data
sparsity and model complexity. Recently, it finds a lot of
applications in speech processing and recognition. For
instance, exemplar-based sparse representations were
proposed for noise robust automatic speech recognition
(Gemmeke et al., 2011). l
1
regularization is used to derive
sparse representations of GMM-supervectors for speaker
identification (Naseem et al., 2010) and verification (Kua
et al., 2011). In Boominathan and Murty (2012) , the
orthogonal matching pursuit algorithm is used to derive
sparse representation of each feature vector using a dictio-
nary of feature vectors belonging to many speakers for
speaker identification. Recently, l
1
and l
2
regularization
method is used to derive an i-vector based sparse represen-
tation classification method for speaker verification (Kua
et al., 2013). For speaker adaptation of a speech recogni-
tion system, elem ent-wise l
2
regularization is applied to
the maximum likelihood linear regression (MLLR) matrix,
resulting in a ridge MLLR method (Li et al., 2010), which
can make a shrinkage of the adaptation parameters and
give significant word error rate reduction from the errors
obtained by standard MLLR in an utterance-by-utterance
unsupervised adaptation scenario. By imposing sparseness
constraints, sparse maximum a posteriori (MAP) adapta-
tion is proposed in Olsen et al. (2011) and Olsen et al.
(2012), which can save significantly on storage and even
improve the quality of the resul ting speaker-dependent
model.
Actually, all the speaker-space-based methods implicitly
assume a low dimensional speaker subspace, which pro-
vides a sparse representation of the SD model parameters.
The main contribution of this paper lies in two aspects:
firstly, we use a redundant basis dictionary to construct
the speaker space. The basis vectors are a combination of
all the eigenvoices and training speakers’ SD models. As
a benefit of PCA, the subspace spanned by the leading
eigenvoices captures the most inter-speaker variability of
all training speaker models. However, the intra-speaker
variability is not modeled and each eigenvoice is no longer
a valid training speaker model. In reference speaker
weighting method, each basis vector (reference model) is
constructed directly from a training speaker model. All
basis vectors are equally important and the intra-speaker
information is well preserved. Experimental results of Teng
et al. (2007) show that the results of eigenvoices always fall
between the best and the worst results of random selections
of reference speaker models. The motivation of using a dic-
tionary combining all the eigenvoices and training speaker
models is that the advantage of both methods (i.e. RSW
and EV) can be utilized during speaker adaptation. Sec-
ondly, given the adaptation data, algorithms from com-
pressive sensing theory are introduced to automatically
select a varying subset of the dictionary entries which can
best represent the unknown SD model. Different from the
RSW method, the selection process is based on direct like-
lihood maximization with respect to the adaptation data.
Two optimization scheme, namely the matching pursuit
scheme (Mallat and Zhang, 1993; Tropp and Gilbert,
2007) and the l
1
regularized optimization scheme (Tibshira-
ni, 1996; Figueiredo et al., 2007), are derived for speaker
adaptation. Matching pursuit is a greedy algorithm, which
iteratively selects one basis vector for combination until
some stopping c ondition is reached, while the l
1
regularized
optimization algorithm uses an explicit l
1
norm regulariza-
tion term to force some components of the speaker coordi-
nate vector to be zero, thus selects the optimal basis vectors
through the nonzero components. Although the matching
pursuit algorithm is sub-optimal, but it is very fast and
W.-L. Zhang et al. / Speech Communication 55 (2013) 950–963 951