全新 UIS-RNN 声纹识别：深度学习下Google的最新突破

1星需积分: 50 48 浏览量更新于2024-09-09 1 收藏 390KB PDF 举报

本文主要探讨了"声纹识别"领域的深度学习方法，特别是针对"fully supervised speaker diarization"这一问题的研究。作者Aonan Zhang、Quan Wang、Zhenyao Zhu和Chong Wang来自Google Inc.和Columbia University，他们提出了一种名为"unbounded interleaved-state recurrent neural networks (UIS-RNN)"的全新算法。在声纹识别任务中，UIS-RNN的核心思想是利用深度学习技术中的参数共享循环神经网络（RNN）来建模每个个体说话者。输入的语音片段被提取出具有区分性特征的d-vectors，这些向量用于表示说话者的身份。RNN的不同状态在时间维度上交错，这样可以自然地处理未知数量的说话者，从而实现动态的、自适应的识别能力。与传统的基于谱聚类的方法相比，该系统采用全监督学习方式，可以直接利用带有时间戳标注的样本进行训练，这使得模型能够在有明确说话者标签的情况下进行学习，提高了识别的准确性。特别值得注意的是，UIS-RNN能够在在线模式下实时解码，而大部分现有的最先进的声纹识别系统往往依赖于离线的聚类步骤，这就意味着UIS-RNN在实时应用中具有显著的优势。在NIST SRE2000 CALL HOME数据集上， UIS-RNN达到了7.6%的识别错误率，这明显优于使用谱聚类方法的现有技术。因此，本文提出的 UIS-RNN不仅提升了声纹识别的精度，还通过其在线解码特性，为实际应用场景提供了高效且准确的解决方案。总结来说，这篇论文深入研究了深度学习在声纹识别中的应用，特别是在解决说话者分离问题上，展示了完全监督学习方法的优越性能，并为未来的实时语音分析和识别技术开辟了新的可能性。

FULLY SUPERVISED SPEAKER DIARIZATION

Aonan Zhang

1,2

Quan Wang

Zhenyao Zhu

John Paisley

Chong Wang

Google Inc., USA

Columbia University, USA

{ aonan, quanw, zyzhu, chongw } @google.com

{ az2385, jpaisley } @columbia.edu

ABSTRACT

In this paper, we propose a fully supervised speaker diarization

approach, named unbounded interleaved-state recurrent neural

networks (UIS-RNN). Given extracted speaker-discriminative em-

beddings (a.k.a. d-vectors) from input utterances, each individual

speaker is modeled by a parameter-sharing RNN, while the RNN

states for different speakers interleave in the time domain. This RNN

is naturally integrated with a distance-dependent Chinese restaurant

process (ddCRP) to accommodate an unknown number of speakers.

Our system is fully supervised and is able to learn from examples

where time-stamped speaker labels are annotated. We achieved a

7.6% diarization error rate on NIST SRE 2000 CALLHOME, which

is better than the state-of-the-art method using spectral clustering.

Moreover, our method decodes in an online fashion while most

state-of-the-art systems rely on ofﬂine clustering.

Index Terms— Speaker diarization, d-vector, clustering, recur-

rent neural networks, Chinese restaurant process

1. INTRODUCTION

Aiming to solve the problem of “who spoke when”, most existing

speaker diarization systems consist of multiple relatively indepen-

dent components [1, 2, 3], including but not limited to: (1) A speech

segmentation module, which removes the non-speech parts, and di-

vides the input utterance into small segments; (2) An embedding ex-

traction module, where speaker-discriminative embeddings such as

speaker factors [4], i-vectors [5], or d-vectors [6] are extracted from

the small segments; (3) A clustering module, which determines the

number of speakers, and assigns speaker identities to each segment;

(4) A resegmentation module, which further reﬁnes the diarization

results by enforcing additional constraints [1].

For the embedding extraction module, recent work [2, 3, 7]

has shown that the diarization performance can be signiﬁcantly im-

proved by replacing i-vectors [5] with neural network embeddings,

a.k.a. d-vectors [6, 8]. This is largely due to the fact that neu-

ral networks can be trained with big datasets, such that the model

is sufﬁciently robust against varying speaker accents and acoustic

conditions in different use scenarios.

However, there is still one component that is unsupervised in

most modern speaker diarization systems — the clustering module.

Examples of clustering algorithms that have been used in diarization

systems include Gaussian mixture models [7, 9], mean shift [10],

agglomerative hierarchical clustering [2, 11], k-means [3, 12], Links

[3, 13], and spectral clustering [3, 14].

The ﬁrst author performed this work as an intern at Google.

The implementation of the algorithms in this paper is available at:

https://github.com/google/uis-rnn

Since both the number of speakers and the segment-wise speaker

labels are determined by the clustering module, the quality of the

clustering algorithm is critically important to the ﬁnal diarization

performance. However, the fact that most clustering algorithms are

unsupervised means that, we will not able to improve this module

by learning from examples when the time-stamped speaker labels

ground truth are available. In fact, in many domain-speciﬁc applica-

tions, it is relatively easy to obtain such high quality annotated data.

In this paper, we replace the unsupervised clustering module by

an online generative process that naturally incorporates labelled data

for training. We call this method unbounded interleaved-state re-

current neural network (UIS-RNN), based on these facts: (1) Each

speaker is modeled by an instance of RNN, and these instances share

the same parameters; (2) An unbounded number of RNN instances

can be generated; (3) The states of different RNN instances, cor-

responding to different speakers, are interleaved in the time domain.

Within a fully supervised framework, our method in addition handles

complexities in speaker diarization: it automatically learns the num-

ber of speakers within each utterance via a Bayesian non-parametric

process, and it carries information through time via the RNN.

The contributions of our work are summarized as follows:

1. Unbounded interleaved-state RNN, a trainable model for the

general problem of segmenting and clustering temporal data

by learning from examples.

2. Framework for a fully supervised speaker diarization system.

3. New state-of-the-art performance on NIST SRE 2000 CALL-

HOME benchmark.

4. Online diarization solution with ofﬂine quality.

2. BASELINE SYSTEM USING CLUSTERING

Our diarization system is built on top of the recent work by Wang et

al. [3]. Speciﬁcally, we use exactly the same segmentation module

and embedding extraction module as their system, while replacing

their clustering module by an unbounded interleaved-state RNN.

As a brief review, in the baseline system [3], a text-independent

speaker recognition network is used to extract embeddings from slid-

ing windows of size 240ms and 50% overlap. A simple voice activ-

ity detector (VAD) with only two full-covariance Gaussians is used

to remove non-speech parts, and partition the utterance into non-

overlapping segments with max length of 400ms. Then we average

window-level embeddings to segment-level d-vectors, and feed them

into the clustering algorithm to produce ﬁnal diarization results. The

workﬂow of this baseline system is shown in Fig. 1.

The text-independent speaker recognition network for comput-

ing embeddings has three LSTM layers and one linear layer. The

network is trained with the state-of-the-art generalized end-to-end

loss [6]. We have been retraining this model for better performance,

which will be later discussed in Section 4.1.

arXiv:1810.04719v3 [eess.AS] 27 Oct 2018

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_44276261

粉丝: 1

全新 UIS-RNN 声纹识别：深度学习下Google的最新突破

声纹识别系统C++实现

声纹识别方面的优秀论文

声音识别数据

java 离线声纹识别

python 说话人识别 声纹识别

nano声纹识别识别无人机信号

ai声纹识别清华大学

声纹识别系统 doc

声纹识别可能遇到的问题

声纹识别 说话人 python

最新资源

python 说话人识别声纹识别

声纹识别说话人 python