使用局部敏感判别分析改善说话人识别

16 浏览量更新于2024-08-31 收藏 301KB PDF 举报

"这篇研究论文探讨了说话人识别中局部敏感判别分析（Local Sensitive Discriminant Analysis, LSDA）的应用，旨在通过补偿会话间变异性来改进语音识别系统。与传统的线性判别分析（Linear Discriminant Analysis, LDA）相比，LSDA能更好地捕捉数据流形的局部几何结构，从而在每个局部区域最大化不同说话人i-向量之间的差距。论文还提出了使用自适应k最近邻方法改进LSDA，以适应类别内样本数量差异较大的情况，并调整相应的类内和类间权重矩阵，确保每个类别在LSDA的目标函数中具有相等的重要性。实验在NIST 2010说话人识别评估的扩展条件5女性任务上进行，结果显示了所提出方法的有效性。" 本文的研究重点是利用LSDA来提升说话人验证系统的性能，特别是应对会话间变异性这一挑战。通常，说话人识别系统需要处理不同时间、环境或设备下录制的语音，这些因素导致的变异性称为会话间变异性。传统的LDA方法在发现数据流形的局部结构时可能表现不足，而LSDA则通过寻找最大化类别间间隔的投影来改善这一问题。 LSDA的关键创新在于其能够在局部区域内区分不同说话人的i-向量。i-向量是一种表示说话人特征的方法，它能够捕获说话人的独特声学特性。通过最大化这些局部间隔，LSDA可以更准确地划分不同的说话人，提高识别准确性。然而，由于每个类别（即每个说话人）的样本数量可能差异较大，原始的LSDA可能会因样本不均衡而受到影响。为此，研究人员引入了自适应k最近邻（adaptive k-nearest neighbors, k-NN）策略，使得每个类别可以根据其样本数量动态调整，确保每个类别的权重在优化过程中得到均衡考虑。这样改进的LSDA在处理类别不平衡问题时更为稳健。实验部分，作者使用了NIST 2010说话人识别评估的扩展条件5女性任务数据集，这是说话人识别领域的一个标准基准。实验结果证明了改进的LSDA方法在说话人验证中的有效性，显示了该方法在实际应用中的潜力。这项研究为说话人识别技术提供了一种新的、有效的解决方案，通过局部敏感判别分析和自适应k-NN策略来增强系统对会话间变异性的影响的处理能力，有助于推动语音识别技术的进步。

Locality Sensitive Discriminant Analysis for

Speaker Veriﬁcation

Danwei Cai

†

Weicheng Cai

∗

Zhidong Ni

†

Ming Li

∗†

∗

SYSU-CMU Joint Institute of Engineering, School of Electronics and Information Technology,

Sun Yat-Sen University, Guangzhou, China

†

SYSU-CMU Shunde International Joint Research Institute, Foshan, China

E-mail: liming46@mail.sysu.edu.cn

Abstract—In this paper, we apply Locality Sensitive Dis-

criminant Analysis (LSDA) to speaker veriﬁcation system for

intersession variability compensation. As opposed to LDA which

fails to discover the local geometrical structure of the data

manifold, LSDA ﬁnds a projection which maximizes the margin

between i-vectors from different speakers at each local area. Since

the number of samples varies in a wide range in each class, we

improve LSDA by using adaptive k nearest neighbors in each

class and modifying the corresponding within- and between-class

weight matrix. In that way, each class has equal importance

in LSDA’s objective function. Experiments were carried out on

the NIST 2010 speaker recognition evaluation (SRE) extended

condition 5 female task, results show that our proposed adaptive

k nearest neighbors based LSDA method signiﬁcantly improves

the conventional i-vector/PLDA baseline by 18% relative cost

reduction and 28% relative equal error rate reduction.

I. INTRODUCTION

Current speaker recognition systems widely use i-vector

modeling due to its excellent performance as well as its small

model size [1] [2]. I-vector based speaker veriﬁcation systems

ﬁrst calculate zero-order and ﬁrst-order Baum-Welch statistics

by projecting the MFCC features onto Universal Background

Model (UBM). Then a single factor analysis is used as a front-

end to generate a low dimensional total variability space (i.e.

the i-vector space) which jointly models language, speaker

and channel variabilities [2]. After i-vectors are extracted,

Probabilistic Linear Discriminative Analysis (PLDA) is widely

adopted as a back-end modeling approach [3][4][5].

Conventionally, in the i-vector framework, the tokens for

calculating the zero- and ﬁrst-order statistics are the MFCC

features trained GMM components. Recently, tokens in the i-

vector framework for calculating the zero-order statistics have

been extended to tied triphone states, tandem or bottleneck

features trained GMM components [6][7][8][9]. The features

for calculating the ﬁrst-order statistics have also been extended

from MFCC to feature level acoustic and phonetic fused fea-

tures [8]. The phonetically-aware tokens trained by supervised

learning can provide better token alignment, which leads to a

signiﬁcant performance improvement on the text independent

speaker veriﬁcation tasks [6][7][8][9][10].

This research was funded in part by the National Natural Science Foun-

dation of China (61401524), Natural Science Foundation of Guangdong

Province (2014A030313123), the Fundamental Research Funds for the Central

Universities(15lgjc10) and National Key Research and Development Program

(2016YFC0103905)

Within the i-vector space, Linear Discriminative Analysis

(LDA) [11] can be performed before PLDA scoring to generate

dimensionality reduced and channel compensated features so

that we can reduce the dimensions and variabilities in i-

vectors. Intrinsically, LDA tries to estimate the global statistics

and only seek a linear manifold based on the Euclidean

structure. It may fail to discover the structure which lies on

linear submanifolds hidden in the total variability space.

Recently, nonparametric discriminant analysis (i.e. Nearest-

Neighbor Discriminant Analysis, NDA), has been success-

fully applied to speaker veriﬁcation systems for variabilities

compensation [12][13][14]. This motivates us to explore other

nonparametric discriminant analysis algorithms for i-vector di-

mension reduction, e.g. Locality Sensitive Discriminant Anal-

ysis [15]. LSDA ﬁnds k nearest neighbors globally for each

sample, constructs within- and between-class graph to model

the local geometrical structure. Then it ﬁnds a linear transform

matrix to map the i-vectors into a subspace in which the

margin between i-vectors from different speakers is maximized

at each local area. Compared to LSDA, NDA ﬁnds k nearest

neighbors in each different class so that its computational

complexity is much more higher. In order to gain good

performance, LSDA with k nearest neighbors requires the data

samples in each class to be larger than k or close to k, but

we can not guarantee that because the number of i-vectors in

each speaker is heterogeneously distributed. Considering this

inherent characteristic of the training set, we improve LSDA

by using adaptive k nearest neighbors for each speaker. Since

the number of i-vectors for each speaker is not the same and

sometimes even varies with a wide range, the speakers with

fewer i-vectors have little inﬂuence in the objective function of

LSDA. We further modify LSDA’s within-class and between-

class weight matrix to handle this issue of unbalanced data.

II. SYSTEM OVERVIEW

The overview of our speaker veriﬁcation system with LSDA

for variabilities compensation is shown in Fig.1.

A. DNN Tandem Feature Extraction

In the system, Deep Neural Network serves as an acoustic

modeling network used to extract phonetic level tandem fea-

ture. At ﬁrst, a DNN acoustic model is trained using acoustic

features and phonetic label data. Then MFCC feature is given

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38569569

粉丝: 7
资源: 931

使用局部敏感判别分析改善说话人识别

最新资源