Locality Sensitive Discriminant Analysis for
Speaker Verification
Danwei Cai
†
Weicheng Cai
∗
Zhidong Ni
†
Ming Li
∗†
∗
SYSU-CMU Joint Institute of Engineering, School of Electronics and Information Technology,
Sun Yat-Sen University, Guangzhou, China
†
SYSU-CMU Shunde International Joint Research Institute, Foshan, China
E-mail: liming46@mail.sysu.edu.cn
Abstract—In this paper, we apply Locality Sensitive Dis-
criminant Analysis (LSDA) to speaker verification system for
intersession variability compensation. As opposed to LDA which
fails to discover the local geometrical structure of the data
manifold, LSDA finds a projection which maximizes the margin
between i-vectors from different speakers at each local area. Since
the number of samples varies in a wide range in each class, we
improve LSDA by using adaptive k nearest neighbors in each
class and modifying the corresponding within- and between-class
weight matrix. In that way, each class has equal importance
in LSDA’s objective function. Experiments were carried out on
the NIST 2010 speaker recognition evaluation (SRE) extended
condition 5 female task, results show that our proposed adaptive
k nearest neighbors based LSDA method significantly improves
the conventional i-vector/PLDA baseline by 18% relative cost
reduction and 28% relative equal error rate reduction.
I. INTRODUCTION
Current speaker recognition systems widely use i-vector
modeling due to its excellent performance as well as its small
model size [1] [2]. I-vector based speaker verification systems
first calculate zero-order and first-order Baum-Welch statistics
by projecting the MFCC features onto Universal Background
Model (UBM). Then a single factor analysis is used as a front-
end to generate a low dimensional total variability space (i.e.
the i-vector space) which jointly models language, speaker
and channel variabilities [2]. After i-vectors are extracted,
Probabilistic Linear Discriminative Analysis (PLDA) is widely
adopted as a back-end modeling approach [3][4][5].
Conventionally, in the i-vector framework, the tokens for
calculating the zero- and first-order statistics are the MFCC
features trained GMM components. Recently, tokens in the i-
vector framework for calculating the zero-order statistics have
been extended to tied triphone states, tandem or bottleneck
features trained GMM components [6][7][8][9]. The features
for calculating the first-order statistics have also been extended
from MFCC to feature level acoustic and phonetic fused fea-
tures [8]. The phonetically-aware tokens trained by supervised
learning can provide better token alignment, which leads to a
significant performance improvement on the text independent
speaker verification tasks [6][7][8][9][10].
This research was funded in part by the National Natural Science Foun-
dation of China (61401524), Natural Science Foundation of Guangdong
Province (2014A030313123), the Fundamental Research Funds for the Central
Universities(15lgjc10) and National Key Research and Development Program
(2016YFC0103905)
Within the i-vector space, Linear Discriminative Analysis
(LDA) [11] can be performed before PLDA scoring to generate
dimensionality reduced and channel compensated features so
that we can reduce the dimensions and variabilities in i-
vectors. Intrinsically, LDA tries to estimate the global statistics
and only seek a linear manifold based on the Euclidean
structure. It may fail to discover the structure which lies on
linear submanifolds hidden in the total variability space.
Recently, nonparametric discriminant analysis (i.e. Nearest-
Neighbor Discriminant Analysis, NDA), has been success-
fully applied to speaker verification systems for variabilities
compensation [12][13][14]. This motivates us to explore other
nonparametric discriminant analysis algorithms for i-vector di-
mension reduction, e.g. Locality Sensitive Discriminant Anal-
ysis [15]. LSDA finds k nearest neighbors globally for each
sample, constructs within- and between-class graph to model
the local geometrical structure. Then it finds a linear transform
matrix to map the i-vectors into a subspace in which the
margin between i-vectors from different speakers is maximized
at each local area. Compared to LSDA, NDA finds k nearest
neighbors in each different class so that its computational
complexity is much more higher. In order to gain good
performance, LSDA with k nearest neighbors requires the data
samples in each class to be larger than k or close to k, but
we can not guarantee that because the number of i-vectors in
each speaker is heterogeneously distributed. Considering this
inherent characteristic of the training set, we improve LSDA
by using adaptive k nearest neighbors for each speaker. Since
the number of i-vectors for each speaker is not the same and
sometimes even varies with a wide range, the speakers with
fewer i-vectors have little influence in the objective function of
LSDA. We further modify LSDA’s within-class and between-
class weight matrix to handle this issue of unbalanced data.
II. SYSTEM OVERVIEW
The overview of our speaker verification system with LSDA
for variabilities compensation is shown in Fig.1.
A. DNN Tandem Feature Extraction
In the system, Deep Neural Network serves as an acoustic
modeling network used to extract phonetic level tandem fea-
ture. At first, a DNN acoustic model is trained using acoustic
features and phonetic label data. Then MFCC feature is given