VIRTUAL ADVERSARIAL TRAINING FOR DS-CNN BASED SMALL-FOOTPRINT
KEYWORD SPOTTING
Xiong Wang
1∗
, Sining Sun
1,2∗
, Lei Xie
1†
Audio, Speech and Language Processing Group,
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
1
Tencent, China
2
ABSTRACT
Serving as the tigger of a voice-enabled user interface, on-
device keyword spotting model has to be extremely compact,
efficient and accurate. In this paper, we adopt a depth-wise
separable convolutional neural network (DS-CNN) as our
small-footprint KWS model, which is highly competitive to
these ends. However, recent study has shown that a compact
KWS system is very vulnerable to small adversarial pertur-
bations while augmenting the training data with specifically-
generated adversarial examples can improve performance. In
this paper, we further improve KWS performance through a
virtual adversarial training (VAT) solution. Instead of using
adversarial examples for data augmentation, we propose to
train a DS-CNN KWS model using adversarial regulariza-
tion, which aims to smooth model’s distribution and thus to
improve robustness, by explicitly introducing a distribution
smoothness measure into the loss function. Experiments on
a collected KWS corpus using a circular microphone array in
far-field scenario show that the VAT approach brings 31.9%
relative false rejection rate (FRR) reduction compared to the
normal training approach with cross entropy loss, and it also
surpasses the adversarial example based data augmentation
approach with 10.3% relative FRR reduction.
Index Terms: depthwise separable convolutional neural net-
work, DS-CNN, KWS, virtual adversarial training
1. INTRODUCTION
With the exponential growth of mobile and intelligent de-
vices, such as smart speakers, voice-enabled user interfaces
play an increasingly crucial role in achieving natural user ex-
periences. Such voice interfaces are usually triggered by an
on-device keywords spotting (KWS) module that always s-
tands by and listens for the wake/trigger word(s). With limit-
ed on-device memory and computational capabilities, the K-
WS module has to be deployed with a small-footprint algo-
rithm with real-time response. Meanwhile, as the first step
before speech interactions, accurate on-device detection, with
∗
The first two authors contributed equally to this work. This research
work is supported by the National Natural Science Foundation of China
(No.61571363).
†
Lei Xie is the corresponding author.
low false reject rate (FRR) and false alarm rate (FAR), is
crucially important for customer experiences, especially for
those always-on devices deployed in real-world complicated
acoustic environments with noises and reverberations.
There has been a rich literature on the topic of key-
word spotting from audio. Previous heavy KWS approaches,
which rely on a large vocabulary continuous speech recogniz-
er (LVCSR) [1, 2, 3] while latency and computation issues are
not their concerns, are apparently not suitable for on-device
deployment. In the past decade, hidden Markov model (HM-
M) based keyword/filler approaches [4, 5, 6, 7] have been very
popular for online low-latency and computation-constrained
KWS. Under such an HMM framework, with the recent
renaissance of neural networks, Gaussian mixture model (G-
MM) based acoustic model has been replaced by deep neural
network (DNNs). The HMM based approaches, though very
compact and competitive, still need Viterbi search on the H-
MM graph. Alternatively, some recent systems were solely
based on a single DNN without the use of HMM topology and
Viterbi decoding. In this small-footprint system, a compact
DNN is trained to predict the posteriors of (sub-)keyword and
filler units and a simple post-processing module produces
a confidence score for keyword/non-keyword decision [8].
Following this line, various types of neural networks, in-
cluding recurrent and convolutional structures with better
contextual modeling ability, have been intensively explored
recently [9, 10, 11, 12, 13]. Among them, a depth-wise sepa-
rable convolutional neural network (DS-CNN) [14] approach
has become highly competitive, outperforming other models
in all aspects of accuracy, model size and operation time.
With a small-footprint efficient solution, deploying a
highly accurate KWS system to real applications is still very
tricky. False alarms and false rejections are unavoidable, al-
though feeding more positive and negative examples in the
model training is quite useful to suppress these errors. More
frustratingly, in a real system, such as the voice trigger on
a home smart speaker, the false-alarmed and false-rejected
queries are extremely non-reproducible. This is mainly due
to 1) the complicated time-varying acoustic environments
and 2) the subtle change on speech timbre when the same
speaker uttering the same trigger word at different time. Re-
cent study [15] has treated these unpredictable false alarm