压缩感知驱动的快速说话人自适应：提升未知说话者性能

174 浏览量更新于2024-07-14 收藏 681KB PDF 举报

本文探讨了在有限的适配数据条件下，如何利用压缩感测技术改进说话人自适应方法。传统上，基于说话者空间的自适应方法，如本征语音（EV）和参考说话者加权（RSW），在确定最佳子空间尺寸和基向量方面存在挑战，尤其是在处理未知说话者时。文章提出了一种创新的框架，将基于高斯混合模型隐藏式马尔可夫模型（GMM-HMM）的语音识别系统中的混合成分均值向量组织成超向量，将说话人适应问题转化为从有限语音样本中恢复与说话人相关的超向量。在这个新框架中，冗余的说话者字典由所有训练说话者的超向量和EV方法得到的超向量构成。当提供适应数据时，目标是通过从这个字典中选择一组最合适的项目，遵循最大后验原则，来构建特定说话人的最优子空间。作者提出了两种算法：匹配追踪和l1正则化优化，来解决这个问题。匹配追踪算法，虽然简单且快速，但可能存在次优解决方案；而l1正则化优化则通过直接针对自适应数据的似然性进行优化，能更好地逼近未知说话人模型。匹配追踪算法利用了一个有效的冗余基矢量去除机制，通过迭代更新说话人坐标，确保效率。然而，它具有贪婪性质，可能导致局部最优。相比之下，l1正则化优化算法采用了更精确的梯度投影方法，通过多次迭代逐步优化，提供更准确的结果。这项研究不仅有助于提升语音识别系统的性能，还展示了压缩感测在快速适应新说话者方面的潜力，对于实际应用中的实时性和准确性具有重要意义。本文的研究成果对于提高语音识别系统的灵活性和适应性，特别是在资源有限的情况下，具有显著的价值。同时，它也促进了压缩感测技术在语音处理领域的进一步探索和应用。

Author's personal copy

Once the basis is determined, the speaker-dependent

coordinate vector can be estimated using a simple qua-

dratic optimization method (Hazen and Glass, 1997; Kuhn

et al., 2000). The main diﬀerence of various speaker-space-

based methods lies in the construction of the basis of the

speaker subspace. For example, in the well-known eigen-

voice (EV) method (Kuhn et al., 2000; Kenny et al.,

2004), the basis vectors which are called the eigenvoices,

are obtained by performing principal component analysis

(PCA) on the training speaker SD model parameters. Then

the K leading eigenvectors which represent the greatest

variabilities of diﬀerent training speaker models are pre-

served as the K basis vectors. In reference speaker weight-

ing (RSW) (Hazen and Glass, 1997; Mak et al., 2006; Teng

et al., 2009), all training speaker SD models are reserved

for the candidate basis vectors. During speaker adaptation,

a subset of them are chosen according to some heuristic cri-

teria to linearly represents the unknown SD model. In

aspect model weighting (AMW) (Hahm et al., 2010), the

basis is constructed by a set of aspect models which are

the mixture model of the training speakers’ SD models

and are trained based on likelihood maximization with

respect to the training data. In all these methods, given

the speaker subspace, the corresponding coordinate of an

unknown speaker is always estimated using the maximum

likelihood criterion. However, there is a common diﬃculty

in determining the dimension of the speaker subspace.

When the adaptation data is limited, a small dimensional

speaker subspace is preferred. As the adaptation data

increases, larger speaker subspace yields better perfor-

mance. Unfortunately, none of these methods can provide

the best speaker subspace given varying amounts of adap-

tation data for a particular unknown speaker.

In this paper, we discuss the generalization of the

speaker-space-based speaker adaptation methods using

the compressive sensing theory (Donoho, 2006). In fact,

the parameters of the SD model lie in a very high dimen-

sional space. The core issue of the speaker adaptation

problem is to estimate the high dimensional vector of

model parameters from a few speech signal observations.

Breakthrough results in compressive sensing (C S) have

shown that high dimensional signals (vectors) can often

be accurately recovered from a relatively small number of

non-adaptive linear projection observations, provided that

they possess a compact representation in some basis. In

machine learning, sparse representation and compressive

sensing are widely employed to address the problem of data

sparsity and model complexity. Recently, it ﬁnds a lot of

applications in speech processing and recognition. For

instance, exemplar-based sparse representations were

proposed for noise robust automatic speech recognition

(Gemmeke et al., 2011). l

regularization is used to derive

sparse representations of GMM-supervectors for speaker

identiﬁcation (Naseem et al., 2010) and veriﬁcation (Kua

et al., 2011). In Boominathan and Murty (2012) , the

orthogonal matching pursuit algorithm is used to derive

sparse representation of each feature vector using a dictio-

nary of feature vectors belonging to many speakers for

speaker identiﬁcation. Recently, l

and l

regularization

method is used to derive an i-vector based sparse represen-

tation classiﬁcation method for speaker veriﬁcation (Kua

et al., 2013). For speaker adaptation of a speech recogni-

tion system, elem ent-wise l

regularization is applied to

the maximum likelihood linear regression (MLLR) matrix,

resulting in a ridge MLLR method (Li et al., 2010), which

can make a shrinkage of the adaptation parameters and

give signiﬁcant word error rate reduction from the errors

obtained by standard MLLR in an utterance-by-utterance

unsupervised adaptation scenario. By imposing sparseness

constraints, sparse maximum a posteriori (MAP) adapta-

tion is proposed in Olsen et al. (2011) and Olsen et al.

(2012), which can save signiﬁcantly on storage and even

improve the quality of the resul ting speaker-dependent

model.

Actually, all the speaker-space-based methods implicitly

assume a low dimensional speaker subspace, which pro-

vides a sparse representation of the SD model parameters.

The main contribution of this paper lies in two aspects:

ﬁrstly, we use a redundant basis dictionary to construct

the speaker space. The basis vectors are a combination of

all the eigenvoices and training speakers’ SD models. As

a beneﬁt of PCA, the subspace spanned by the leading

eigenvoices captures the most inter-speaker variability of

all training speaker models. However, the intra-speaker

variability is not modeled and each eigenvoice is no longer

a valid training speaker model. In reference speaker

weighting method, each basis vector (reference model) is

constructed directly from a training speaker model. All

basis vectors are equally important and the intra-speaker

information is well preserved. Experimental results of Teng

et al. (2007) show that the results of eigenvoices always fall

between the best and the worst results of random selections

of reference speaker models. The motivation of using a dic-

tionary combining all the eigenvoices and training speaker

models is that the advantage of both methods (i.e. RSW

and EV) can be utilized during speaker adaptation. Sec-

ondly, given the adaptation data, algorithms from com-

pressive sensing theory are introduced to automatically

select a varying subset of the dictionary entries which can

best represent the unknown SD model. Diﬀerent from the

RSW method, the selection process is based on direct like-

lihood maximization with respect to the adaptation data.

Two optimization scheme, namely the matching pursuit

scheme (Mallat and Zhang, 1993; Tropp and Gilbert,

2007) and the l

regularized optimization scheme (Tibshira-

ni, 1996; Figueiredo et al., 2007), are derived for speaker

adaptation. Matching pursuit is a greedy algorithm, which

iteratively selects one basis vector for combination until

some stopping c ondition is reached, while the l

regularized

optimization algorithm uses an explicit l

norm regulariza-

tion term to force some components of the speaker coordi-

nate vector to be zero, thus selects the optimal basis vectors

through the nonzero components. Although the matching

pursuit algorithm is sub-optimal, but it is very fast and

W.-L. Zhang et al. / Speech Communication 55 (2013) 950–963 951

剩余14页未读，继续阅读

weixin_38536397

粉丝: 7
资源: 961

压缩感知驱动的快速说话人自适应：提升未知说话者性能

matlab开发-非澳大利亚非压缩感测

制冷压缩机性能测试实验报告.docx

c语言做的播放器源码.zip

机器学习 （清华大学出版社） 第2章线性模型 习题

基于MATLAB软件身份证号码识别源码系统【GUI界面版本】.zip

1730266576194.jpg

开题报告Python电影票房爬取与大数据可视化系统.docx

081iskbP09gull2p.jpeg

开题报告SpringBoot大学生二手电子产品交易平台.docx

基于PHP的产品报价系统的设计与开发(源代码+论文).zip

最新资源

机器学习（清华大学出版社）第2章线性模型习题