Trainable Frontend For Robust and Far-Field Keyword Spotting
Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous
Google, Mountain View, USA
{yxwang,getreuer,thadh,dicklyon,rif}@google.com
Abstract
Robust and far-field speech recognition is critical to enable
true hands-free communication. In far-field conditions, sig-
nals are attenuated due to distance. To improve robustness to
loudness variation, we introduce a novel frontend called per-
channel energy normalization (PCEN). The key ingredient of
PCEN is the use of an automatic gain control based dynamic
compression to replace the widely used static (such as log or
root) compression. We evaluate PCEN on the keyword spot-
ting task. On our large rerecorded noisy and far-field eval sets,
we show that PCEN significantly improves recognition perfor-
mance. Furthermore, we model PCEN as neural network layers
and optimize high-dimensional PCEN parameters jointly with
the keyword spotting acoustic model. The trained PCEN fron-
tend demonstrates significant further improvements without in-
creasing model complexity or inference-time cost.
Index Terms: Keyword spotting, robust and far-field speech
recognition, automatic gain control, deep neural networks
1. Introduction
Speech has become a prevailing interface to enable human-
computer interaction, especially on mobile devices. An im-
portant component of such an interface is the keyword spotting
(KWS) system [1]. For example, KWS is often used to wake up
mobile devices or to initiate conversational assistants. There-
fore, reliably recognizing keywords, regardless of the acous-
tic environment, is often a prerequisite for effective interaction
with speech-enabled products.
Thanks to the development of deep neural networks (DNN),
automatic speech recognition (ASR) has dramatically improved
in the past few years [2]. However, while current ASR sys-
tems perform well in relatively clean conditions, their robust-
ness remains a major challenge. This also applies to the KWS
system [3]. The system needs to be robust to not only various
kinds of noise interference but also varying loudness (or sound
level). The capability to handle loudness variation is important
because it allows users to talk to devices from a wide range of
distances, enabling true hands-free interaction.
Similar to modern ASR systems, our KWS system is also
neural network based [1, 4]. However, being a resource-limited
embedded system, the keyword recognizer has its own con-
straints. Most importantly, it is expected to run on devices and is
always listening, which demands small memory footprints and
low power consumption. Therefore, the size of the neural net-
work needs to be much smaller than those used in modern ASR
systems [4, 5], implying a network with a limited representa-
tion power. In addition, on-device keyword spotting typically
does not use sophisticated decoding schemes or language mod-
els [1]. These constraints motivate us to rethink the design of
the feature-extraction frontend.
In DNN-based acoustic modeling, perhaps the most widely
used frontend is the so-called log-mel frontend, consisting of
mel-filterbank energy extraction followed by log compression,
where the log compression is used to reduce the dynamic range
of filterbank energy. However, there are several issues with
the log function. First, a log has a singularity at 0. Com-
mon methods to deal with the singularity are to use either a
clipped log (i.e. log(max(offset, x))) or a stabilized log (i.e.
log(x+offset)). However, the choice of the offset in both meth-
ods is ad hoc and may have different performance impacts on
different signals. Second, the log function uses a lot of its dy-
namic range on low level, such as silence, which is likely the
least informative part of the signal. Third, the log function is
loudness dependent. With different loudness, the log function
can produce different feature values even when the underlying
signal content (e.g. keywords) is the same, which introduces an-
other factor of variation into training and inference. Although
techniques such as mean–variance normalization [6] and cep-
stral mean normalization [7] can be used to alleviate this issue
to some extent, it is nontrivial to deal with time-varying loud-
ness in an online fashion.
To remedy the above issues of log compression, we intro-
duce a new frontend called per-channel energy normalization
(PCEN). Essentially, PCEN implements a simple feed-forward
automatic gain control (AGC) [8, 9], which dynamically sta-
bilizes signal levels. Since all the PCEN operations are dif-
ferentiable, we further propose to implement PCEN as neural
network operations/layers and jointly optimize various PCEN
parameters with the KWS acoustic model. Equipped with this
trainable AGC-based frontend, the resulting KWS system is
found to be more robust to distant speech.
The rest of the paper is organized as follows. In Section 2,
we introduce and describe the PCEN frontend. In Section 3,
we formulate PCEN as neural network layers and discuss the
advantages of the new formulation. In Section 4, we present
experimental results. The last section discusses and concludes
this paper.
2. Per-Channel Energy Normalization
In this section, we introduce the PCEN frontend as an alter-
native to the log-mel frontend. The key component in PCEN is
that it replaces the static log (or root) compression by a dynamic
compression described below:
P CEN(t, f) =
E(t, f)
( + M (t, f))
α
+ δ
r
− δ
r
, (1)
where t and f denote time and frequency index and E(t, f )
denotes filterbank energy in each time-frequency (T-F) bin. Al-
though there is no restriction on what filterbank to use, in this
paper we use an FFT-based mel filterbank for fair comparison
with the log-mel frontend. M(t, f) is a smoothed version of the
arXiv:1607.05666v1 [cs.CL] 19 Jul 2016