
FULLY SUPERVISED SPEAKER DIARIZATION
Aonan Zhang
1,2
Quan Wang
1
Zhenyao Zhu
1
John Paisley
2
Chong Wang
1
1
Google Inc., USA
2
Columbia University, USA
1
{ aonan, quanw, zyzhu, chongw } @google.com
2
{ az2385, jpaisley } @columbia.edu
ABSTRACT
In this paper, we propose a fully supervised speaker diarization
approach, named unbounded interleaved-state recurrent neural
networks (UIS-RNN). Given extracted speaker-discriminative em-
beddings (a.k.a. d-vectors) from input utterances, each individual
speaker is modeled by a parameter-sharing RNN, while the RNN
states for different speakers interleave in the time domain. This RNN
is naturally integrated with a distance-dependent Chinese restaurant
process (ddCRP) to accommodate an unknown number of speakers.
Our system is fully supervised and is able to learn from examples
where time-stamped speaker labels are annotated. We achieved a
7.6% diarization error rate on NIST SRE 2000 CALLHOME, which
is better than the state-of-the-art method using spectral clustering.
Moreover, our method decodes in an online fashion while most
state-of-the-art systems rely on offline clustering.
Index Terms— Speaker diarization, d-vector, clustering, recur-
rent neural networks, Chinese restaurant process
1. INTRODUCTION
Aiming to solve the problem of “who spoke when”, most existing
speaker diarization systems consist of multiple relatively indepen-
dent components [1, 2, 3], including but not limited to: (1) A speech
segmentation module, which removes the non-speech parts, and di-
vides the input utterance into small segments; (2) An embedding ex-
traction module, where speaker-discriminative embeddings such as
speaker factors [4], i-vectors [5], or d-vectors [6] are extracted from
the small segments; (3) A clustering module, which determines the
number of speakers, and assigns speaker identities to each segment;
(4) A resegmentation module, which further refines the diarization
results by enforcing additional constraints [1].
For the embedding extraction module, recent work [2, 3, 7]
has shown that the diarization performance can be significantly im-
proved by replacing i-vectors [5] with neural network embeddings,
a.k.a. d-vectors [6, 8]. This is largely due to the fact that neu-
ral networks can be trained with big datasets, such that the model
is sufficiently robust against varying speaker accents and acoustic
conditions in different use scenarios.
However, there is still one component that is unsupervised in
most modern speaker diarization systems — the clustering module.
Examples of clustering algorithms that have been used in diarization
systems include Gaussian mixture models [7, 9], mean shift [10],
agglomerative hierarchical clustering [2, 11], k-means [3, 12], Links
[3, 13], and spectral clustering [3, 14].
The first author performed this work as an intern at Google.
The implementation of the algorithms in this paper is available at:
https://github.com/google/uis-rnn
Since both the number of speakers and the segment-wise speaker
labels are determined by the clustering module, the quality of the
clustering algorithm is critically important to the final diarization
performance. However, the fact that most clustering algorithms are
unsupervised means that, we will not able to improve this module
by learning from examples when the time-stamped speaker labels
ground truth are available. In fact, in many domain-specific applica-
tions, it is relatively easy to obtain such high quality annotated data.
In this paper, we replace the unsupervised clustering module by
an online generative process that naturally incorporates labelled data
for training. We call this method unbounded interleaved-state re-
current neural network (UIS-RNN), based on these facts: (1) Each
speaker is modeled by an instance of RNN, and these instances share
the same parameters; (2) An unbounded number of RNN instances
can be generated; (3) The states of different RNN instances, cor-
responding to different speakers, are interleaved in the time domain.
Within a fully supervised framework, our method in addition handles
complexities in speaker diarization: it automatically learns the num-
ber of speakers within each utterance via a Bayesian non-parametric
process, and it carries information through time via the RNN.
The contributions of our work are summarized as follows:
1. Unbounded interleaved-state RNN, a trainable model for the
general problem of segmenting and clustering temporal data
by learning from examples.
2. Framework for a fully supervised speaker diarization system.
3. New state-of-the-art performance on NIST SRE 2000 CALL-
HOME benchmark.
4. Online diarization solution with offline quality.
2. BASELINE SYSTEM USING CLUSTERING
Our diarization system is built on top of the recent work by Wang et
al. [3]. Specifically, we use exactly the same segmentation module
and embedding extraction module as their system, while replacing
their clustering module by an unbounded interleaved-state RNN.
As a brief review, in the baseline system [3], a text-independent
speaker recognition network is used to extract embeddings from slid-
ing windows of size 240ms and 50% overlap. A simple voice activ-
ity detector (VAD) with only two full-covariance Gaussians is used
to remove non-speech parts, and partition the utterance into non-
overlapping segments with max length of 400ms. Then we average
window-level embeddings to segment-level d-vectors, and feed them
into the clustering algorithm to produce final diarization results. The
workflow of this baseline system is shown in Fig. 1.
The text-independent speaker recognition network for comput-
ing embeddings has three LSTM layers and one linear layer. The
network is trained with the state-of-the-art generalized end-to-end
loss [6]. We have been retraining this model for better performance,
which will be later discussed in Section 4.1.
arXiv:1810.04719v3 [eess.AS] 27 Oct 2018