
SI-NET: MULTI-SCALE CONTEXT-AWARE CONVOLUTIONAL BLOCK FOR SPEAKER
VERIFICATION
Zhuo Li
1,2
, Ce Fang
1,2
, Runqiu Xiao
1,2
, Wenchao Wang
1,2†
, Yonghong Yan
1,2,3
1
Key Laboratory of Speech Acoustics and Content Understanding,
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
2
University of Chinese Academy of Sciences, Beijing, China
3
Xinjiang Key Laboratory of Minority Speech and Language Information Processing,
Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
ABSTRACT
Utilizing multi-scale information adequately is essential
for building a high-performance speaker verification (SV)
system. Biological research shows that the human auditory
system employs a multi-timescale processing mode to extract
information and has a mechanism of integrating multi-scale
information to encode sound information. Inspired by this,
we propose a novel block, named Split-Integration (SI), to
explore multi-scale context-aware feature learning at a granu-
lar level for speaker verification. Our model involves a pair of
operations, (i) multi-scale split, which is designed to imitate
the multi-timescale processing mode, extracting multi-scale
features by grouping and stacking different sizes of filters,
and (ii) dynamic integration, which aims at reflecting analogy
with the fusion mechanism, introducing KL divergence to
measure the complementarity between multi-scale features
such that the model fully integrates multi-scale features and
produces better speaker-discriminative representation. Ex-
periments are conducted on Voxceleb and Speakers in the
Wild(SITW) datasets. Results demonstrate that our approach
achieves a relative 10%-20% improvement on equal error rate
(EER) over a strong baseline in the SV task.
Index Terms— speaker verification, Split-Integration,
multi-scale features, dynamic integration, at a granular level
1. INTRODUCTION
End-to-end speaker verification systems have emerged in
recent years and achieved state-of-the-art performance. D-
vector[1], which trains the DNN-based model to distinguish
speakers by a classification-based loss function at the frame
level, opens the door for end-to-end SV systems. Neverthe-
less, using the frame-level labels in the training stage and
the simple averaging operation on the frame-level outputs
in the test stage ignores the dependency of frames. Thus,
x-vector[2], one of the most popular systems, introduces the
†
Corresponding author.
Time Delay Neural Network (TDNN) and a statistic pooling
to consider the dependencies of contiguous frames. With the
rising popularity of the x-vector system, various efforts have
been devoted to the topology of the network to enhance the
deep representation ability for SV. The popular ResNet[3]
architecture is introduced into the SV tasks[4, 5, 6, 7] to
generate more abstractive and informative embeddings.
Most of the above systems have only a single timescale
in a frame-level layer and focus on learning local patterns.
Capturing multi-scale speaker information is beneficial for
further improvement. Recently, one biological research [8]
shows that the human auditory system employs (at least) a
2-timescale processing mode to track acoustic dynamics and
has a mechanism of fusing multi-timescale information to
encode sound information. In paper [9], wang has proven
that a two-pathway neural network could encode comple-
mentary multi-timescales information into utterance-level
embeddings. But it neglects to explore the effect at the frame
level. Study [10] also proves that multi-scale filters with dif-
ferent receptive fields in each layer could extract multi-scale
information and gain better performance.
Recently, a novel network, Res2Net [11], is proposed.
Different from [10] [9], Res2Net strengthens its ability to
extract multi-scale features at a granular level by splitting fil-
ters into several groups and adding the output of the previous
group of filters into the input of the current group of filters.
Our prior work in FFSVC2020 [12] [13] and paper [14] show
that Res2Net gains great improvement in the SV task.
Inspired by the biological research and the architecture
of Res2Net, we explore multi-scale context-aware feature
learning at a more granular level by introducing a novel
block, named ”split-integration”, as shown in figure 1. The
”split-integration” convolution block consists of two opera-
tions: multi-scale split and dynamic integration. Biological
research [15] suggests that the human auditory system de-
rives the appropriate perceptual representations by extracting
rapidly varying information on a scale of milliseconds and
analyzing more slowly varying signal attributes on a scale of
220978-1-6654-3739-4/21/$31.00 ©2021 IEEE ASRU 2021