时长依赖的PLDA模型：改进的说话人验证方法

118 浏览量更新于2024-08-28 收藏 103KB PDF 举报

在本研究论文中，我们探讨了针对文本无关的说话人验证任务的时长依赖性协方差正则化（Duration Dependent Covariance Regularization, DD-CR-PLDA）模型。传统的简化概率线性判别分析（Probabilistic Linear Discriminant Analysis, PLDA）模型中，通常假设所有i-vectors共享全局协方差矩阵来表示残余能量的变异。然而，我们认为较长语音片段的i-vectors经过更精确的估计，其对应的协方差应该更小，反映了其稳定性。与i-vector模型训练中的逆零阶统计加权协方差思想相呼应，我们提出了一个时长依赖的归一化指数项。这个指数项考虑了语音片段的长度信息，通过将长度纳入模型参数学习过程中，使得协方差矩阵能够适应不同长度的语音样本，从而提高说话人验证的准确性。具体来说，我们提出的方法是利用每个语音片段的持续时间对其进行标准化，然后将其作为权重应用于协方差矩阵的计算，这有助于减少因不同长度语音带来的噪声影响，提升模型的稳健性和区分能力。 DD-CR-PLDA模型的优势在于它能够动态地调整协方差结构，更好地反映了实际语音数据的分布特性。在实际应用中，这种时长依赖的正则化策略可以优化模型在长、短语音识别上的性能，并且可能有助于处理语速变化、口音差异等复杂情况。通过实验验证，我们展示了这种新型模型在说话人验证任务上取得了优于传统PLDA模型的结果，尤其是在那些语音样本长度差异较大的情况下。本文的主要贡献包括：(1)提出了一种基于时长的协方差正则化策略；(2)设计了一个能够自适应不同语音长度的PLDA模型——DD-CR-PLDA；(3)展示了这种模型在实际说话人验证任务中的优越性能。这项工作对于提高说话人验证系统的鲁棒性和准确性具有重要意义，也为未来研究提供了新的思路，特别是在处理多模态和多条件的说话人识别问题时。

Duration Dependent Covariance Regularization in PLDA Modeling for

Speaker Veriﬁcation

Weicheng Cai

2,3

, Ming Li

1,2

, Lin Li

, Qingyang Hong

SYSU-CMU Joint Institute of Engineering, Sun Yat-sen University, China

SYSU-CMU Shunde International Joint Research Institute, China

School of Information Science and Technology, Sun Yat-sen University, China

School of Information Science and Technology, Xiamen University, China

liming46@mail.sysu.edu.cn

Abstract

In this paper, we present a covariance regularized probabilis-

tic linear discriminant analysis (CR-PLDA) model for text in-

dependent speaker veriﬁcation. In the conventional simpliﬁed

PLDA modeling, the covariance matrix used to capture the

residual energies is globally shared for all i-vectors. However,

we believe that the point estimated i-vectors from longer speech

utterances may be more accurate and their corresponding co-

variances in the PLDA modeling should be smaller. Similar

to the inverse 0

order statistics weighted covariance in the

i-vector model training, we propose a duration dependent nor-

malized exponential term containing the duration normalizing

factor µ and duration extent factor ν to regularize the covariance

in the PLDA modeling. Experimental results are reported on the

NIST SRE 2010 common condition 5 female part task and the

NIST 2014 i-vector machine learning challenge, respectively.

For both tasks, the proposed covariance regularized PLDA sys-

tem outperforms the baseline PLDA system by more than 13%

relatively in terms of equal error rate (EER) and norm minDCF

values.

Index Terms: PLDA, covariance regularization, i-vector,

speaker veriﬁcation, duration

1. Introduction

Total variability i-vector modeling has gained signiﬁcant atten-

tion in both speaker veriﬁcation (SV) and language identiﬁca-

tion (LID) domains due to its excellent performance, compact

representation and small model size [1, 2, 3]. In this model-

ing, ﬁrst, zero-order and ﬁrst-order Baum-Welch statistics are

calculated by projecting the MFCC features on those Gaussian

Mixture Model (GMM) components using the occupancy pos-

terior probability. Second, in order to reduce the dimensionality

of the concatenated statistics vectors, a single factor analysis

is adopted to generate a low dimensional total variability space

which jointly models language, speaker and channel variabili-

ties all together [1]. Third, within this i-vector space, variability

compensation methods, such as Within-Class Covariance Nor-

malization (WCCN) [4], Linear Discriminative Analysis (LDA)

and Nuisance Attribute Projection (NAP) [5], are performed

to reduce the variability for the subsequent modeling methods

This research is supported in part by the National Natural Sci-

ence Foundation of China (61401524), Natural Science Foundation of

Guangdong Province (2014A030313123), SYSU-CMU Shunde Inter-

national Joint Research Institute and CMU-SYSU Collaborative Inno-

vation Research Center.

(e.g., Support Vector Machine [6], Sparse Representation [7],

Probabilistic Linear Discriminant Analysis (PLDA) [8, 9, 10],

etc.).

Conventionally, in the i-vector framework, the tokens for

calculating the zero-order and ﬁrst-order Baum-Welch statistics

are the MFCC features trained GMM components. Such choice

of token units may not be the optimal solution. Recently, the

generalized i-vector framework [11, 12, 13, 14, 15] has been

proposed. In this framework, the tokens for calculating the

zero-order statistics have been extended to tied triphone states,

monophone states, tandem features trained GMM components,

bottleneck features trained GMM components, etc. The features

for calculating the ﬁrst-order statistics have also been extended

from MFCC to feature level acoustic and phonetic fused fea-

tures [13]. The phonetically-aware tokens trained by supervised

learning can provide better token separation and discrimination.

This enables the system to compare different speakers’ voices

token by token with more accurate token alignment, which leads

to signiﬁcant performance improvement on the text independent

speaker veriﬁcation task [11, 12, 13, 14, 15].

After i-vectors are extracted, among the aforementioned

supervised learning techniques, PLDA is widely adopted and

considered as the state-of-the-art back-end modeling approach

[8, 9, 10, 16, 17, 18, 19, 20]. PLDA is a generative model that

incorporates both within-speaker and between-speaker varia-

tions. Generally, we model the i-vectors with a Gaussian dis-

tribution assumption(G-PLDA). After we learned the model pa-

rameters by expected maximization (EM) algorithm, the scor-

ing is based on a hypothesis testing framework.

Recently, It is shown in[21] that the performance of PLDA

on short utterance is degraded. Duration variability has also

been investigated in the i-vector space using PLDA model

[17][22][23][24]. This motivates us to incorporate the speech

duration information directly into the PLDA model training and

generate a more accurate model.

In the standard simpliﬁed PLDA modeling [10], the within-

speaker variations can be considered as the residual that can’t be

interpreted by the speaker space. The covariance matrix used to

model these residuals is globally shared by all i-vectors, no mat-

ter whether the corresponding utterances’ durations are long or

short. We believe that the point estimated i-vectors from longer

speech utterances may be more accurate and their correspond-

ing covariances in the PLDA modeling should be smaller. Mo-

tivated by the inverse 0

order statistics weighted covariance

in the i-vector model training[25][26], we propose a duration

dependent normalized exponential term containing the duration

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38722464

粉丝: 4
资源: 939

时长依赖的PLDA模型：改进的说话人验证方法

Location-dependent metamaterials in terahertz range for reconfiguration purposes

Gender-Dependent Feature Extraction for Speaker Recognition

数据融合matlab代码-Text-Dependent-Speaker-Verification-:嘈杂条件下基于文本的说话人验证的功能和模型

Improved Speaker-Dependent Separation for CHiME-5 Challenge

A new compensation method for temperature-dependent gain tilt in L-band EDFA by adjusting only inserted VOA

Time Delays and Stimulus. Dependent Pattern Formation in Periodic Environments in Isolated Neurons的注记* (2005年)

Design of State-Dependent Switching With Dwell Time Constraint for Interval Positive Switched Systems

Development and validation of a time-dependent large-signal simulation code for gyroklystron amplifier

A geometry-dependent model for void closure in.pdf

【Discussion on Normality Verification】: Verification Methods for Normality Assumption in Linear ...

最新资源