The SYSU System for the Interspeech 2015
Automatic Speaker Verification Spoofing and
Countermeasures Challenge
Shitao Weng
∗
, Shushan Chen
∗
, Lei Yu
∗
, Xuewei Wu
∗
, Weicheng Cai
†
, Zhi Liu
∗
, Ming Li
†
∗
SYSU-CMU Joint Institute of Engineering, Sun Yat-Sen University, Guangzhou, China
†
SYSU-CMU Shunde International Joint Research Institute, Guangdong, China
E-mail: liming46@mail.sysu.edu.cn
Abstract—Many existing speaker verification systems are re-
ported to be vulnerable against different spoofing attacks, for
example speaker-adapted speech synthesis, voice conversion, play
back, etc. In order to detect these spoofed speech signals as a
countermeasure, we propose a score level fusion approach with
several different i-vector subsystems. We show that the acoustic
level Mel-frequency cepstral coefficients (MFCC) features, the
phase level modified group delay cepstral coefficients (MGDCC)
and the phonetic level phoneme posterior probability (PPP) tan-
dem features are effective for the countermeasure. Furthermore,
feature level fusion of these features before i-vector modeling
also enhance the performance. A polynomial kernel support
vector machine is adopted as the supervised classifier. In order
to enhance the generalizability of the countermeasure, we also
adopted the cosine similarity and PLDA scoring as one-class
classifications methods. By combining the proposed i-vector sub-
systems with the OpenSMILE baseline which covers the acoustic
and prosodic information further improves the final performance.
The proposed fusion system achieves 0.29% and 3.26% EER on
the development and test set of the database provided by the
INTERSPEECH 2015 automatic speaker verification spoofing
and countermeasures challenge.
Index Terms: speaker verification, spoofing and countermea-
sures, i-vector, modified group delay cepstral coefficients,
phoneme posterior probability
I. INTRODUCTION
The goal of speaker verification is to automatically verify
the claimed speaker identity given a segment of speech. In the
past decade, speaker verification has attracted significant re-
search attention with promising results [1]. However, recently
it is reported that many existing speaker verification systems
are vulnerable against different spoofing attacks, e.g. speaker-
adapted speech synthesis, voice conversion, play back, etc.[2],
[3], [4], [5], [6]
Compared to text independent speaker verification, text
dependent speaker verification is more robust against the
play back spoofing since the speech content is constrained
or pre-defined. Speaker-adapted speech synthesis and voice
conversion are the most common spoofing methods that can
convert arbitrary text or speech inputs towards the target
speaker [2]. To enhance the robustness of speech verification
system against spoofing attacks, different countermeasures
have been proposed. In [7], higher-level dynamic features and
voice quality assessment are used to detect those artificial
signals. Furthermore, modified group delay cepstral coeffi-
cients (MGDCC) feature has been proposed to distinguish
between the original and the spoofed speech signals in the
phase domain [8]. This approach is based on the fact that the
phase information of synthetic spoofing speech is typically
different from the real human articulated speech while the
human auditory system is less sensitive to this difference. Long
term temporal modulation feature derived from magnitude or
phase spectrum has also been proposed to detect the synthetic
speech [9].
Total variability i-vector modeling has been widely used in
speaker verification due to its excellent performance, compact
representation and small model size [10], [11]. In this work,
we apply the recently proposed generalized i-vector framework
[12], [13], [14], [15] with both the acoustic and phonetic
features to the countermeasure task.
Figure 1 shows an overview of our anti-spoofing coun-
termeasure system. First, there are several i-vector subsys-
tems using different features, namely the acoustic level Mel-
frequency cepstral coefficients (MFCC) features, the phase
level MGDCC features, the phonetic level phoneme posterior
probability (PPP) tandem features [14], [16] and their fea-
ture level combinations. Second, we also applied the openS-
MILE toolkit [17] to perform the utterance level acoustic
and prosodic feature extraction. We believe that the spoofed
speech signal may have different prosodic patterns. Third, after
the feature normalization, multiple classification methods, e.g.
cosine scoring, K-nearest neighbor (KNN), simplified PLDA
[18] and Support Vector Machine (SVM), are employed as the
back end. Finally, score level fusion is performed to further
enhance the overall system performance.
The remainder of the paper is organized as follows. The
corpus and the proposed algorithms are explained in Sections
II and III, respectively. Experimental results and discussions
are presented in Section IV while conclusions are provided in
Section V.
II. CORPUS
The database used to evaluate the proposed methods is
based upon a standard dataset of both genuine and spoofed
speech. Genuine speech is without significant channel or
background noise effect and includes 106 speakers (45 male,