SI-NET：多尺度上下文感知卷积块用于说话人验证

需积分: 0 200 浏览量更新于2024-08-05 收藏 977KB PDF 举报

"这篇论文是颜永红老师团队在声纹识别领域的研究，提出了名为‘多尺度上下文感知卷积块’的新方法，即SI-NET，用于提高说话人验证系统的性能。" 在声纹识别领域，充分利用多尺度信息对于构建高性能的说话人验证（SV）系统至关重要。生物学研究表明，人类听觉系统采用多时间尺度处理模式来提取声音信息，并具有整合多尺度信息以编码声音的能力。受到这一启发，该论文提出了一种新颖的结构——Split-Integration (SI) 块，旨在微粒级别上探索多尺度上下文感知特征学习，以提升说话人验证的性能。 SI-NET模型由一对操作组成：(i) 多尺度分割，这个设计目的是模仿人类听觉系统，将输入信号分解为不同尺度的特征，这样可以捕获到不同频率和时间范围内的信息；(ii) 整合操作，将这些不同尺度的特征有效地融合在一起，以便更全面地理解和表示声纹特征。通过这种方式，SI-NET能够更好地捕捉到语音中的细节和全局模式，增强模型对说话人独特性的辨别力。在实现中，多尺度分割可能涉及不同大小的卷积核或使用金字塔结构，以获取不同范围的上下文信息。整合部分则可能采用注意力机制或其他形式的特征融合策略，确保关键信息在不同尺度间有效地传递和组合。通过这种模块化的设计，SI-NET不仅提高了声纹识别的准确性，还可能减少了模型的复杂性，使得训练更快且更易于优化。在实验部分，论文可能对比了SI-NET与其他现有的声纹识别技术，如传统的基于i-vector的方法、深度学习的卷积神经网络（CNN）或循环神经网络（RNN）等，展示了SI-NET在各种基准数据集上的优越性能。此外，可能还进行了敏感性分析，探讨了不同参数设置对系统性能的影响，以及对噪声和变体的鲁棒性测试。这篇论文提出的SI-NET为声纹识别提供了新的视角和方法，通过多尺度上下文感知，提高了系统在复杂环境下的识别能力，对于推动声纹识别技术的发展具有重要意义。

SI-NET: MULTI-SCALE CONTEXT-AWARE CONVOLUTIONAL BLOCK FOR SPEAKER

VERIFICATION

Zhuo Li

1,2

, Ce Fang

1,2

, Runqiu Xiao

1,2

, Wenchao Wang

1,2†

, Yonghong Yan

1,2,3

Key Laboratory of Speech Acoustics and Content Understanding,

Institute of Acoustics, Chinese Academy of Sciences, Beijing, China

University of Chinese Academy of Sciences, Beijing, China

Xinjiang Key Laboratory of Minority Speech and Language Information Processing,

Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China

ABSTRACT

Utilizing multi-scale information adequately is essential

for building a high-performance speaker veriﬁcation (SV)

system. Biological research shows that the human auditory

system employs a multi-timescale processing mode to extract

information and has a mechanism of integrating multi-scale

information to encode sound information. Inspired by this,

we propose a novel block, named Split-Integration (SI), to

explore multi-scale context-aware feature learning at a granu-

lar level for speaker veriﬁcation. Our model involves a pair of

operations, (i) multi-scale split, which is designed to imitate

the multi-timescale processing mode, extracting multi-scale

features by grouping and stacking different sizes of ﬁlters,

and (ii) dynamic integration, which aims at reﬂecting analogy

with the fusion mechanism, introducing KL divergence to

measure the complementarity between multi-scale features

such that the model fully integrates multi-scale features and

produces better speaker-discriminative representation. Ex-

periments are conducted on Voxceleb and Speakers in the

Wild(SITW) datasets. Results demonstrate that our approach

achieves a relative 10%-20% improvement on equal error rate

(EER) over a strong baseline in the SV task.

Index Terms— speaker veriﬁcation, Split-Integration,

multi-scale features, dynamic integration, at a granular level

1. INTRODUCTION

End-to-end speaker veriﬁcation systems have emerged in

recent years and achieved state-of-the-art performance. D-

vector[1], which trains the DNN-based model to distinguish

speakers by a classiﬁcation-based loss function at the frame

level, opens the door for end-to-end SV systems. Neverthe-

less, using the frame-level labels in the training stage and

the simple averaging operation on the frame-level outputs

in the test stage ignores the dependency of frames. Thus,

x-vector[2], one of the most popular systems, introduces the

†

Corresponding author.

Time Delay Neural Network (TDNN) and a statistic pooling

to consider the dependencies of contiguous frames. With the

rising popularity of the x-vector system, various efforts have

been devoted to the topology of the network to enhance the

deep representation ability for SV. The popular ResNet[3]

architecture is introduced into the SV tasks[4, 5, 6, 7] to

generate more abstractive and informative embeddings.

Most of the above systems have only a single timescale

in a frame-level layer and focus on learning local patterns.

Capturing multi-scale speaker information is beneﬁcial for

further improvement. Recently, one biological research [8]

shows that the human auditory system employs (at least) a

2-timescale processing mode to track acoustic dynamics and

has a mechanism of fusing multi-timescale information to

encode sound information. In paper [9], wang has proven

that a two-pathway neural network could encode comple-

mentary multi-timescales information into utterance-level

embeddings. But it neglects to explore the effect at the frame

level. Study [10] also proves that multi-scale ﬁlters with dif-

ferent receptive ﬁelds in each layer could extract multi-scale

information and gain better performance.

Recently, a novel network, Res2Net [11], is proposed.

Different from [10] [9], Res2Net strengthens its ability to

extract multi-scale features at a granular level by splitting ﬁl-

ters into several groups and adding the output of the previous

group of ﬁlters into the input of the current group of ﬁlters.

Our prior work in FFSVC2020 [12] [13] and paper [14] show

that Res2Net gains great improvement in the SV task.

Inspired by the biological research and the architecture

of Res2Net, we explore multi-scale context-aware feature

learning at a more granular level by introducing a novel

block, named ”split-integration”, as shown in ﬁgure 1. The

”split-integration” convolution block consists of two opera-

tions: multi-scale split and dynamic integration. Biological

research [15] suggests that the human auditory system de-

rives the appropriate perceptual representations by extracting

rapidly varying information on a scale of milliseconds and

analyzing more slowly varying signal attributes on a scale of

下载后可阅读完整内容，剩余7页未读，立即下载

夕夕如盼

粉丝: 27

SI-NET：多尺度上下文感知卷积块用于说话人验证

Multiscale Context-aware Ensemble Deep KELM for Efficient Hypers

multiscale-entropy.zip_Multiscale Entropy_multiscale_多尺度熵matlab

Wavelet-multiscale.rar_multiscale wavelet

Multiscale Fourier descriptor

multiscale combination grouping

matlab图像膨胀代码-multiscale_segmentation:multiscale_segmentation

ncut-multiscale.zip_matlab ncut_multiscale ncut_ncut

MULTISCALE WEDGELET IMAGE ANALYSIS

MI.rar_MI_medical image_multiscale denoising_multiscale product

Multiscale_Retinex-master.zip_Multiscale_Retinex_Retinex增强_multi

最新资源