SYSU系统：2015年interspeech自动说话人验证反欺诈挑战

研究论文

134 浏览量更新于2024-08-27 收藏 312KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"SYSU团队在2015年Interspeech自动语音验证欺骗与对策挑战赛中的系统" 本文是关于2015年Interspeech自动语音验证挑战赛的研究论文，该挑战赛聚焦于检测和防止针对语音识别系统的欺诈攻击。SYSU（中山大学）团队提出了一种得分级融合方法，利用多种不同的i-向量子系统来应对包括适应性语音合成、语音转换和回放在内的各种欺骗攻击。关键词：研究论文，语音验证，欺骗攻击，对策，i-向量，梅尔频率 cepstral 系数 (MFCC)，修改的组延迟 cepstral 系数 (MGDCC)，音素后验概率 (PPP) 并联特征，特征级融合，多项式核支持向量机 (SVM) 论文首先指出了现有的语音验证系统对各种欺骗攻击的脆弱性，如适应性语音合成、语音转换和回放等。为了对抗这些欺诈手段，研究团队设计了一种创新的得分级融合策略。这一策略的核心是集成多个不同类型的特征，以提高检测的准确性。 1. 特征提取： - **声学级MFCC特征**：MFCC是语音处理中常用的特征，它捕捉了声音的频谱信息，对语音的声学特性进行量化。 - **相位级MGDCC特征**：MGDCC关注语音信号的相位信息，通过分析相位变化提供额外的声学信息，有助于区分真实语音和伪造信号。 - **音素级PPP并联特征**：PPP特征利用了音素的后验概率，通过分析语音的发音结构，可以增强系统对欺诈语音的识别能力。 2. 特征级融合：在i-向量建模之前，将这些特征进行融合，可以进一步提升系统性能。这种方法旨在结合各种特征的优势，以获得更全面的表示，从而增强模型的鲁棒性。 3. 分类器：采用多项式核支持向量机作为分类器，这种机器学习算法能够处理高维空间中的非线性关系，从而有效地将真实语音和欺骗信号区分开来。通过实验，研究团队证明了上述方法在应对语音验证欺骗攻击方面的有效性。综合使用这些特征和融合技术，可以显著提高系统对欺诈检测的能力，为语音验证安全提供了重要的技术参考。这篇论文展示了在应对自动语音验证挑战时，通过多特征融合和高级分类技术的重要性，为未来开发更强大的防欺诈系统提供了有价值的研究方向和实践基础。

资源详情

资源推荐

The SYSU System for the Interspeech 2015

Automatic Speaker Veriﬁcation Spooﬁng and

Countermeasures Challenge

Shitao Weng

∗

, Shushan Chen

∗

, Lei Yu

∗

, Xuewei Wu

∗

, Weicheng Cai

†

, Zhi Liu

∗

, Ming Li

†

∗

SYSU-CMU Joint Institute of Engineering, Sun Yat-Sen University, Guangzhou, China

†

SYSU-CMU Shunde International Joint Research Institute, Guangdong, China

E-mail: liming46@mail.sysu.edu.cn

Abstract—Many existing speaker veriﬁcation systems are re-

ported to be vulnerable against different spooﬁng attacks, for

example speaker-adapted speech synthesis, voice conversion, play

back, etc. In order to detect these spoofed speech signals as a

countermeasure, we propose a score level fusion approach with

several different i-vector subsystems. We show that the acoustic

level Mel-frequency cepstral coefﬁcients (MFCC) features, the

phase level modiﬁed group delay cepstral coefﬁcients (MGDCC)

and the phonetic level phoneme posterior probability (PPP) tan-

dem features are effective for the countermeasure. Furthermore,

feature level fusion of these features before i-vector modeling

also enhance the performance. A polynomial kernel support

vector machine is adopted as the supervised classiﬁer. In order

to enhance the generalizability of the countermeasure, we also

adopted the cosine similarity and PLDA scoring as one-class

classiﬁcations methods. By combining the proposed i-vector sub-

systems with the OpenSMILE baseline which covers the acoustic

and prosodic information further improves the ﬁnal performance.

The proposed fusion system achieves 0.29% and 3.26% EER on

the development and test set of the database provided by the

INTERSPEECH 2015 automatic speaker veriﬁcation spooﬁng

and countermeasures challenge.

Index Terms: speaker veriﬁcation, spooﬁng and countermea-

sures, i-vector, modiﬁed group delay cepstral coefﬁcients,

phoneme posterior probability

I. INTRODUCTION

The goal of speaker veriﬁcation is to automatically verify

the claimed speaker identity given a segment of speech. In the

past decade, speaker veriﬁcation has attracted signiﬁcant re-

search attention with promising results [1]. However, recently

it is reported that many existing speaker veriﬁcation systems

are vulnerable against different spooﬁng attacks, e.g. speaker-

adapted speech synthesis, voice conversion, play back, etc.[2],

[3], [4], [5], [6]

Compared to text independent speaker veriﬁcation, text

dependent speaker veriﬁcation is more robust against the

play back spooﬁng since the speech content is constrained

or pre-deﬁned. Speaker-adapted speech synthesis and voice

conversion are the most common spooﬁng methods that can

convert arbitrary text or speech inputs towards the target

speaker [2]. To enhance the robustness of speech veriﬁcation

system against spooﬁng attacks, different countermeasures

have been proposed. In [7], higher-level dynamic features and

voice quality assessment are used to detect those artiﬁcial

signals. Furthermore, modiﬁed group delay cepstral coefﬁ-

cients (MGDCC) feature has been proposed to distinguish

between the original and the spoofed speech signals in the

phase domain [8]. This approach is based on the fact that the

phase information of synthetic spooﬁng speech is typically

different from the real human articulated speech while the

human auditory system is less sensitive to this difference. Long

term temporal modulation feature derived from magnitude or

phase spectrum has also been proposed to detect the synthetic

speech [9].

Total variability i-vector modeling has been widely used in

speaker veriﬁcation due to its excellent performance, compact

representation and small model size [10], [11]. In this work,

we apply the recently proposed generalized i-vector framework

[12], [13], [14], [15] with both the acoustic and phonetic

features to the countermeasure task.

Figure 1 shows an overview of our anti-spooﬁng coun-

termeasure system. First, there are several i-vector subsys-

tems using different features, namely the acoustic level Mel-

frequency cepstral coefﬁcients (MFCC) features, the phase

level MGDCC features, the phonetic level phoneme posterior

probability (PPP) tandem features [14], [16] and their fea-

ture level combinations. Second, we also applied the openS-

MILE toolkit [17] to perform the utterance level acoustic

and prosodic feature extraction. We believe that the spoofed

speech signal may have different prosodic patterns. Third, after

the feature normalization, multiple classiﬁcation methods, e.g.

cosine scoring, K-nearest neighbor (KNN), simpliﬁed PLDA

[18] and Support Vector Machine (SVM), are employed as the

back end. Finally, score level fusion is performed to further

enhance the overall system performance.

The remainder of the paper is organized as follows. The

corpus and the proposed algorithms are explained in Sections

II and III, respectively. Experimental results and discussions

are presented in Section IV while conclusions are provided in

Section V.

II. CORPUS

The database used to evaluate the proposed methods is

based upon a standard dataset of both genuine and spoofed

speech. Genuine speech is without signiﬁcant channel or

background noise effect and includes 106 speakers (45 male,

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38575456

粉丝: 4
资源: 952

SYSU系统：2015年interspeech自动说话人验证反欺诈挑战

The SYSU System for CCPR 2016 Multimodal Emotion Recognition Challenge

SYSU-MM01.rar

sizeof ＂SYSU\t2022 ＂

strstr("Hello SYSU", "SYSU");？

sysu-mm01数据集下载

strstr("Hello SYSU", "SYSU");C语言中什么意思

sysu-mm01 数据集下载

SYSU-MM01数据集介绍

strncmp("Hello SYSU", "Hello C", 7) ​什么意思？

微信小程序 如何访问 http://netpay.sysu.edu.cn

https://github.com/HCPLab-SYSU/SR.

微信小程序 如何访问 http://netpay.sysu.edu.cn 写出代码

使用JavaScript为页面设计一个文本超链接“打开中大主页”，当用户单击这个超链接时，将弹出一个没 有菜单、工具栏、状态栏的窗口，其大小为600*400，以显示页面htp://www.sysu.edu.cn。

国内行人数据集下载链接

latex中institute使用

深度学习人像抠图数据集链接

跨模态行人重识别研究现状

android studio ui

频时转换 fortran

最新资源

strncmp("Hello SYSU", "Hello C", 7) 什么意思？

微信小程序如何访问 http://netpay.sysu.edu.cn

微信小程序如何访问 http://netpay.sysu.edu.cn 写出代码

使用JavaScript为页面设计一个文本超链接“打开中大主页”，当用户单击这个超链接时，将弹出一个没有菜单、工具栏、状态栏的窗口，其大小为600*400，以显示页面htp://www.sysu.edu.cn。