远场关键词识别的可训练前端技术

需积分: 9 4 浏览量更新于2024-09-06 收藏 481KB PDF 举报

"Trainable Frontend For Robust and Far-Field Keyword Spotting" 本文主要探讨了在远场语音识别中的关键问题，特别是在实现真正的免提通信时，如何提高远场语音识别的鲁棒性。作者Yuxuan Wang等人提出了一种名为Per-Channel Energy Normalization（PCEN）的创新前端技术，旨在改善由于距离导致信号衰减和响度变化的敏感性。 1. 引言在远场条件下，由于距离的影响，语音信号会显著衰减，这给语音识别系统带来了挑战。为了增强系统的鲁棒性，文章引入了PCEN技术，它是一种动态压缩方法，用以替代传统的静态压缩（如对数或根压缩）。 2. Per-Channel Energy Normalization PCEN的核心是采用自动增益控制为基础的动态压缩。与静态压缩不同，PCEN能够实时调整，更好地适应环境的变化，从而在处理不同响度和距离的语音输入时提供更稳定的表现。 3. 可训练的PCEN前端论文介绍了将PCEN建模为神经网络层的方法，使得PCEN参数可以在关键词检测的声学模型训练过程中进行优化。这种可训练的PCEN前端使得高维度的PCEN参数得以微调，增强了整个系统的性能。 4. 实验：PCEN vs. log-mel 实验部分对比了PCEN与传统的log-mel特征提取方法在关键词识别任务上的表现。结果显示，在大型重录的嘈杂和远场评估集上，PCEN显著提升了识别性能。 5. 讨论与结论作者讨论了PCEN的优势，并得出结论，经过优化的PCEN不仅提高了识别准确率，而且没有增加模型复杂性或推理时间成本。这表明，PCEN是一个高效且实用的解决方案，可以用于改进远场关键词识别系统。关键词: PCEN, mel, log-mel, 关键词该研究提出了一个针对远场和噪声环境的鲁棒语音识别前端技术——PCEN，通过动态压缩和可训练的神经网络层优化，有效提升了关键词识别的性能，同时保持了计算效率。这一技术对于实现更智能、更可靠的免提语音交互系统具有重要意义。

Trainable Frontend For Robust and Far-Field Keyword Spotting

Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous

Google, Mountain View, USA

{yxwang,getreuer,thadh,dicklyon,rif}@google.com

Abstract

Robust and far-ﬁeld speech recognition is critical to enable

true hands-free communication. In far-ﬁeld conditions, sig-

nals are attenuated due to distance. To improve robustness to

loudness variation, we introduce a novel frontend called per-

channel energy normalization (PCEN). The key ingredient of

PCEN is the use of an automatic gain control based dynamic

compression to replace the widely used static (such as log or

root) compression. We evaluate PCEN on the keyword spot-

ting task. On our large rerecorded noisy and far-ﬁeld eval sets,

we show that PCEN signiﬁcantly improves recognition perfor-

mance. Furthermore, we model PCEN as neural network layers

and optimize high-dimensional PCEN parameters jointly with

the keyword spotting acoustic model. The trained PCEN fron-

tend demonstrates signiﬁcant further improvements without in-

creasing model complexity or inference-time cost.

Index Terms: Keyword spotting, robust and far-ﬁeld speech

recognition, automatic gain control, deep neural networks

1. Introduction

Speech has become a prevailing interface to enable human-

computer interaction, especially on mobile devices. An im-

portant component of such an interface is the keyword spotting

(KWS) system [1]. For example, KWS is often used to wake up

mobile devices or to initiate conversational assistants. There-

fore, reliably recognizing keywords, regardless of the acous-

tic environment, is often a prerequisite for effective interaction

with speech-enabled products.

Thanks to the development of deep neural networks (DNN),

automatic speech recognition (ASR) has dramatically improved

in the past few years [2]. However, while current ASR sys-

tems perform well in relatively clean conditions, their robust-

ness remains a major challenge. This also applies to the KWS

system [3]. The system needs to be robust to not only various

kinds of noise interference but also varying loudness (or sound

level). The capability to handle loudness variation is important

because it allows users to talk to devices from a wide range of

distances, enabling true hands-free interaction.

Similar to modern ASR systems, our KWS system is also

neural network based [1, 4]. However, being a resource-limited

embedded system, the keyword recognizer has its own con-

straints. Most importantly, it is expected to run on devices and is

always listening, which demands small memory footprints and

low power consumption. Therefore, the size of the neural net-

work needs to be much smaller than those used in modern ASR

systems [4, 5], implying a network with a limited representa-

tion power. In addition, on-device keyword spotting typically

does not use sophisticated decoding schemes or language mod-

els [1]. These constraints motivate us to rethink the design of

the feature-extraction frontend.

In DNN-based acoustic modeling, perhaps the most widely

used frontend is the so-called log-mel frontend, consisting of

mel-ﬁlterbank energy extraction followed by log compression,

where the log compression is used to reduce the dynamic range

of ﬁlterbank energy. However, there are several issues with

the log function. First, a log has a singularity at 0. Com-

mon methods to deal with the singularity are to use either a

clipped log (i.e. log(max(offset, x))) or a stabilized log (i.e.

log(x+offset)). However, the choice of the offset in both meth-

ods is ad hoc and may have different performance impacts on

different signals. Second, the log function uses a lot of its dy-

namic range on low level, such as silence, which is likely the

least informative part of the signal. Third, the log function is

loudness dependent. With different loudness, the log function

can produce different feature values even when the underlying

signal content (e.g. keywords) is the same, which introduces an-

other factor of variation into training and inference. Although

techniques such as mean–variance normalization [6] and cep-

stral mean normalization [7] can be used to alleviate this issue

to some extent, it is nontrivial to deal with time-varying loud-

ness in an online fashion.

To remedy the above issues of log compression, we intro-

duce a new frontend called per-channel energy normalization

(PCEN). Essentially, PCEN implements a simple feed-forward

automatic gain control (AGC) [8, 9], which dynamically sta-

bilizes signal levels. Since all the PCEN operations are dif-

ferentiable, we further propose to implement PCEN as neural

network operations/layers and jointly optimize various PCEN

parameters with the KWS acoustic model. Equipped with this

trainable AGC-based frontend, the resulting KWS system is

found to be more robust to distant speech.

The rest of the paper is organized as follows. In Section 2,

we introduce and describe the PCEN frontend. In Section 3,

we formulate PCEN as neural network layers and discuss the

advantages of the new formulation. In Section 4, we present

experimental results. The last section discusses and concludes

this paper.

2. Per-Channel Energy Normalization

In this section, we introduce the PCEN frontend as an alter-

native to the log-mel frontend. The key component in PCEN is

that it replaces the static log (or root) compression by a dynamic

compression described below:

P CEN(t, f) =



E(t, f)

( + M (t, f))

+ δ



− δ

, (1)

where t and f denote time and frequency index and E(t, f )

denotes ﬁlterbank energy in each time-frequency (T-F) bin. Al-

though there is no restriction on what ﬁlterbank to use, in this

paper we use an FFT-based mel ﬁlterbank for fair comparison

with the log-mel frontend. M(t, f) is a smoothed version of the

arXiv:1607.05666v1 [cs.CL] 19 Jul 2016

下载后可阅读完整内容，剩余4页未读，立即下载

Tosonw

粉丝: 92
资源: 95

远场关键词识别的可训练前端技术

Spotting Outliers in Large Distributed Datasets using

Trainable_Segmentation-master.zip

Trainable Relation Extraction framework-开源

Python库 | trainable-0.1.4.dev8-py3-none-any.whl

PyPI 官网下载 | trainable-0.1.4.dev8-py3-none-any.whl

ch07-反向传播算法.zip

彭敏龙__Trainable Undersampling for Class-Imbalance Learning1

Matlab-基于Matlab实现可训练时间规整TTW算法-Trainable-Time-Warping-附项目源码-优质项目

【ch08-Keras高层接口】 3.自定义层.pdf

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for

最新资源