IMPROVED SPEAKER SEGMENTATION AND SEGMENTS CLUSTERING USING THE
BAYESIAN INFORMATION CRITERION
Alain Tritschler and Ramesh Gopinath
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
email:alain@us.ibm.com
ABSTRACT
Detection of speaker, channel and environmentchanges in
a continuous audio stream is importantinvarious applica-
tions (e.g., broadcast news, meetings/teleconferences etc.).
Standard schemes for segmentation use a classier and hence
do not generalize to unseen sp eaker / channel / environ-
ments. Recently S.Chen introduced new segmentation and
clustering algorithms, using the so-called BIC. This paper
presents more accurate and more ecientvariants of the
BIC scheme for segmentation and clustering. Specically,
the new algorithms improve the speed and accuracy of seg-
mentation and clustering and allow for a real-time imple-
mentation of simultaneous transcription, segmentation and
speaker tracking.
1. INTRODUCTION
The segmentation of continuous audio is useful as a pre-
processor for further classicatio n of the segments for sp eaker
identication/verication , noise rejection, music removal etc.
In automatic transcription applications such a segmentation
scheme allows the creation and use of speaker / channel
/environment-speci c acoustic mo dels for improved tran-
scription accuracy. In several of these applicatio ns cluster-
ing of segments from the same speaker / channel / envi-
ronment is also useful. Segmentation and clustering can be
used in conjunction in sp eaker tracking applications. To-
gether they can be used to increase the amount of adapta-
tion data for unsup ervised adaptation of acoustic mo dels in
transcription applications. In general they allow sp ecialized
processing of the audio for specic speakers / channels / en-
vironments. This paper presents improvements (both speed
and accuracy) to algorithms for segmentation and clustering
based on the Bayesian Information Criterion (BIC) intro-
duced recently in [1]. These improvements have allowed us
to create an application that concurrently segments, tran-
scribes, identies and tracks speakers in broadcast news au-
dio in real-time.
The pap er is organized as follows: Section 2 briey re-
views the BIC, which is the key concept used in b oth the
segmentation and clustering algorithms. Section 3 describ es
the new version of the segmentation algorithm and Sec-
tion 4 describes impovements to the clustering algorithm.
Section 5 describ es how these new algorithms are incorp o-
rated in a real-time transcription, segmentation and sp eaker
identication and tracking system for broadcast news.
2. THE BAYESIAN INFORMATION CRITERION
BIC is an asymtotically optimal Bayesian mo del-selection
criterion used to decide whichof
p
parametric models best
represents
n
data samples
x
1
;:::;x
n
,
x
i
2
IR
d
. Each model
M
j
has a number of parameters, say
k
j
.We assume that
the samples
x
i
are independent.
According to the BIC theory [3], for suciently large
n
,
the best mo del of the data is the one which maximizes
BIC
j
=
log
L
j
(
x
1
;:::;x
n
)
,
1
2
k
j
logn
(1)
with
= 1, and where
L
j
is the maximum likelihoo d of the
data under mo del
M
j
(i.e., the likelihood of the data with
maximum likelihoo d values for the
k
j
parameters of
M
j
).
In the particular case where there are only two models
wehave a simple test for model selection : choose the model
M
1
over
M
2
if
BIC
=
BIC
1
,
BIC
2
, is positive.
Note that BIC can also be viewed as a penalized maxi-
mum likelihoo d technique [3, 1].
3. SEGMENTATION USING BIC
3.1. BIC for segmentation
In this pap er standard 24-dimensional mel-cepstral feature
vectors generated at 10ms intervals from the continuous au-
dio stream form the data samples (or frames). The audio
stream is from a Broadcast news source sampled at 16KHz
with 16-bit PCM. The basic problem is to identify all pos-
sible frames where there is a segment boundary. Without
loss of generality consider a window of consecutive data
samples
f
x
1
:::x
n
g
in which there is at most one segment
boundary. In this case the basic question of whether or not
there is a segment boundary at frame
i
can be cast as a
model selection problem b etween the following two models:
model
M
1
where
f
x
1
;:::;x
n
g
is drawn from a single full-
covariance Gaussian, and mo del
M
2
where
f
x
1
;:::;x
n
g
is
drawn from two full-covariance Gaussians, with
f
x
1
:::x
i
g
drawn from the rst Gaussian, and
f
x
i
+1
;:::;x
n
g
drawn
from the second Gaussian. Since
x
i
2
IR
d
, model
M
1
has
k
1
=
d
+
d
(
d
+1)
2
parameters, while model
M
2
has twice as
many parameters (
k
2
=2
k
1
).
It is straightforward to show [1] that the
i
th
frame is a
goo d candidate for a segment boundary if the expression :
BIC
i
=
,
n
2
log
j
w
j
+
i
2
log
j
f
j
+
n
,
i
2
log
j
s
j
+
1
2
(
d
+
d
(
d
+1)
2
)
logn