Duration Dependent Covariance Regularization in PLDA Modeling for
Speaker Verification
Weicheng Cai
2,3
, Ming Li
1,2
, Lin Li
4
, Qingyang Hong
4
1
SYSU-CMU Joint Institute of Engineering, Sun Yat-sen University, China
2
SYSU-CMU Shunde International Joint Research Institute, China
3
School of Information Science and Technology, Sun Yat-sen University, China
4
School of Information Science and Technology, Xiamen University, China
liming46@mail.sysu.edu.cn
Abstract
In this paper, we present a covariance regularized probabilis-
tic linear discriminant analysis (CR-PLDA) model for text in-
dependent speaker verification. In the conventional simplified
PLDA modeling, the covariance matrix used to capture the
residual energies is globally shared for all i-vectors. However,
we believe that the point estimated i-vectors from longer speech
utterances may be more accurate and their corresponding co-
variances in the PLDA modeling should be smaller. Similar
to the inverse 0
th
order statistics weighted covariance in the
i-vector model training, we propose a duration dependent nor-
malized exponential term containing the duration normalizing
factor µ and duration extent factor ν to regularize the covariance
in the PLDA modeling. Experimental results are reported on the
NIST SRE 2010 common condition 5 female part task and the
NIST 2014 i-vector machine learning challenge, respectively.
For both tasks, the proposed covariance regularized PLDA sys-
tem outperforms the baseline PLDA system by more than 13%
relatively in terms of equal error rate (EER) and norm minDCF
values.
Index Terms: PLDA, covariance regularization, i-vector,
speaker verification, duration
1. Introduction
Total variability i-vector modeling has gained significant atten-
tion in both speaker verification (SV) and language identifica-
tion (LID) domains due to its excellent performance, compact
representation and small model size [1, 2, 3]. In this model-
ing, first, zero-order and first-order Baum-Welch statistics are
calculated by projecting the MFCC features on those Gaussian
Mixture Model (GMM) components using the occupancy pos-
terior probability. Second, in order to reduce the dimensionality
of the concatenated statistics vectors, a single factor analysis
is adopted to generate a low dimensional total variability space
which jointly models language, speaker and channel variabili-
ties all together [1]. Third, within this i-vector space, variability
compensation methods, such as Within-Class Covariance Nor-
malization (WCCN) [4], Linear Discriminative Analysis (LDA)
and Nuisance Attribute Projection (NAP) [5], are performed
to reduce the variability for the subsequent modeling methods
This research is supported in part by the National Natural Sci-
ence Foundation of China (61401524), Natural Science Foundation of
Guangdong Province (2014A030313123), SYSU-CMU Shunde Inter-
national Joint Research Institute and CMU-SYSU Collaborative Inno-
vation Research Center.
(e.g., Support Vector Machine [6], Sparse Representation [7],
Probabilistic Linear Discriminant Analysis (PLDA) [8, 9, 10],
etc.).
Conventionally, in the i-vector framework, the tokens for
calculating the zero-order and first-order Baum-Welch statistics
are the MFCC features trained GMM components. Such choice
of token units may not be the optimal solution. Recently, the
generalized i-vector framework [11, 12, 13, 14, 15] has been
proposed. In this framework, the tokens for calculating the
zero-order statistics have been extended to tied triphone states,
monophone states, tandem features trained GMM components,
bottleneck features trained GMM components, etc. The features
for calculating the first-order statistics have also been extended
from MFCC to feature level acoustic and phonetic fused fea-
tures [13]. The phonetically-aware tokens trained by supervised
learning can provide better token separation and discrimination.
This enables the system to compare different speakers’ voices
token by token with more accurate token alignment, which leads
to significant performance improvement on the text independent
speaker verification task [11, 12, 13, 14, 15].
After i-vectors are extracted, among the aforementioned
supervised learning techniques, PLDA is widely adopted and
considered as the state-of-the-art back-end modeling approach
[8, 9, 10, 16, 17, 18, 19, 20]. PLDA is a generative model that
incorporates both within-speaker and between-speaker varia-
tions. Generally, we model the i-vectors with a Gaussian dis-
tribution assumption(G-PLDA). After we learned the model pa-
rameters by expected maximization (EM) algorithm, the scor-
ing is based on a hypothesis testing framework.
Recently, It is shown in[21] that the performance of PLDA
on short utterance is degraded. Duration variability has also
been investigated in the i-vector space using PLDA model
[17][22][23][24]. This motivates us to incorporate the speech
duration information directly into the PLDA model training and
generate a more accurate model.
In the standard simplified PLDA modeling [10], the within-
speaker variations can be considered as the residual that can’t be
interpreted by the speaker space. The covariance matrix used to
model these residuals is globally shared by all i-vectors, no mat-
ter whether the corresponding utterances’ durations are long or
short. We believe that the point estimated i-vectors from longer
speech utterances may be more accurate and their correspond-
ing covariances in the PLDA modeling should be smaller. Mo-
tivated by the inverse 0
th
order statistics weighted covariance
in the i-vector model training[25][26], we propose a duration
dependent normalized exponential term containing the duration