小 footprint DS-CNN 基于虚拟对抗提升的关键词检测

需积分: 12 59 浏览量更新于2024-08-26 收藏 131KB PDF 举报

本文主要探讨了在语音驱动用户界面中的小足迹关键词发现（Small-Footprint Keyword Spotting, KWS）模型的改进策略，特别关注的是深度分离卷积神经网络（Depth-wise Separable Convolutional Neural Network, DS-CNN）的应用。小足迹KWS模型对于设备上运行的交互系统来说，需要具备极高的紧凑性、效率和准确性，而DS-CNN因其结构优势在满足这些需求方面表现出色。然而，近期的研究指出，小型的KWS系统容易受到微小的对抗性扰动的影响，这可能威胁到系统的稳定性。为了提升模型的鲁棒性，作者们借鉴了虚拟对抗训练（Virtual Adversarial Training, VAT）的概念。与传统的数据增强方法不同，即通过生成特定的对抗样本来扩充训练集，本文提出了一种新的训练策略。在虚拟对抗训练中，研究人员不直接依赖于对抗样本，而是引入了对抗性正则化。这种方法旨在使模型的决策边界更加平滑，从而提高模型在面对微小扰动时的识别能力。具体而言，他们训练DS-CNN KWS模型时，通过增加一个额外的损失项，这个损失项促使模型学习到一个稳定的决策区域，即使面对轻微的输入变化，也能保持正确的分类结果。该研究的重要贡献在于将深度学习的理论与实际应用结合，提出了一个有效的方法来增强小型KWS模型的抗扰动性能，这对于保障语音助手等设备在实际场景中的稳定性和安全性具有重要意义。通过这种方式，DS-CNN KWS模型不仅保持了高效和准确，还提升了对潜在攻击的抵御能力，为未来在资源受限的设备上的语音识别技术提供了有价值的新思路。

VIRTUAL ADVERSARIAL TRAINING FOR DS-CNN BASED SMALL-FOOTPRINT

KEYWORD SPOTTING

Xiong Wang

1∗

, Sining Sun

1,2∗

, Lei Xie

1†

Audio, Speech and Language Processing Group,

School of Computer Science, Northwestern Polytechnical University, Xi’an, China

Tencent, China

ABSTRACT

Serving as the tigger of a voice-enabled user interface, on-

device keyword spotting model has to be extremely compact,

efﬁcient and accurate. In this paper, we adopt a depth-wise

separable convolutional neural network (DS-CNN) as our

small-footprint KWS model, which is highly competitive to

these ends. However, recent study has shown that a compact

KWS system is very vulnerable to small adversarial pertur-

bations while augmenting the training data with speciﬁcally-

generated adversarial examples can improve performance. In

this paper, we further improve KWS performance through a

virtual adversarial training (VAT) solution. Instead of using

adversarial examples for data augmentation, we propose to

train a DS-CNN KWS model using adversarial regulariza-

tion, which aims to smooth model’s distribution and thus to

improve robustness, by explicitly introducing a distribution

smoothness measure into the loss function. Experiments on

a collected KWS corpus using a circular microphone array in

far-ﬁeld scenario show that the VAT approach brings 31.9%

relative false rejection rate (FRR) reduction compared to the

normal training approach with cross entropy loss, and it also

surpasses the adversarial example based data augmentation

approach with 10.3% relative FRR reduction.

Index Terms: depthwise separable convolutional neural net-

work, DS-CNN, KWS, virtual adversarial training

1. INTRODUCTION

With the exponential growth of mobile and intelligent de-

vices, such as smart speakers, voice-enabled user interfaces

play an increasingly crucial role in achieving natural user ex-

periences. Such voice interfaces are usually triggered by an

on-device keywords spotting (KWS) module that always s-

tands by and listens for the wake/trigger word(s). With limit-

ed on-device memory and computational capabilities, the K-

WS module has to be deployed with a small-footprint algo-

rithm with real-time response. Meanwhile, as the ﬁrst step

before speech interactions, accurate on-device detection, with

∗

The ﬁrst two authors contributed equally to this work. This research

work is supported by the National Natural Science Foundation of China

(No.61571363).

†

Lei Xie is the corresponding author.

low false reject rate (FRR) and false alarm rate (FAR), is

crucially important for customer experiences, especially for

those always-on devices deployed in real-world complicated

acoustic environments with noises and reverberations.

There has been a rich literature on the topic of key-

word spotting from audio. Previous heavy KWS approaches,

which rely on a large vocabulary continuous speech recogniz-

er (LVCSR) [1, 2, 3] while latency and computation issues are

not their concerns, are apparently not suitable for on-device

deployment. In the past decade, hidden Markov model (HM-

M) based keyword/ﬁller approaches [4, 5, 6, 7] have been very

popular for online low-latency and computation-constrained

KWS. Under such an HMM framework, with the recent

renaissance of neural networks, Gaussian mixture model (G-

MM) based acoustic model has been replaced by deep neural

network (DNNs). The HMM based approaches, though very

compact and competitive, still need Viterbi search on the H-

MM graph. Alternatively, some recent systems were solely

based on a single DNN without the use of HMM topology and

Viterbi decoding. In this small-footprint system, a compact

DNN is trained to predict the posteriors of (sub-)keyword and

ﬁller units and a simple post-processing module produces

a conﬁdence score for keyword/non-keyword decision [8].

Following this line, various types of neural networks, in-

cluding recurrent and convolutional structures with better

contextual modeling ability, have been intensively explored

recently [9, 10, 11, 12, 13]. Among them, a depth-wise sepa-

rable convolutional neural network (DS-CNN) [14] approach

has become highly competitive, outperforming other models

in all aspects of accuracy, model size and operation time.

With a small-footprint efﬁcient solution, deploying a

highly accurate KWS system to real applications is still very

tricky. False alarms and false rejections are unavoidable, al-

though feeding more positive and negative examples in the

model training is quite useful to suppress these errors. More

frustratingly, in a real system, such as the voice trigger on

a home smart speaker, the false-alarmed and false-rejected

queries are extremely non-reproducible. This is mainly due

to 1) the complicated time-varying acoustic environments

and 2) the subtle change on speech timbre when the same

speaker uttering the same trigger word at different time. Re-

cent study [15] has treated these unpredictable false alarm

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38538224

粉丝: 5
资源: 953

小 footprint DS-CNN 基于虚拟对抗提升的关键词检测

基于CHL-DS-01工作站的工业机器人码垛教学案例.pdf

海康DS-78 79N-EX系列支持萤石云程序包.zip

ds-sim-job-scheduler-dispatcher:基于ds-sim（麦格理大学的自定义分布式系统模拟器）的COMP3100作业调度程序-调度程序的项目存储库

E-CNN-classifier:这是论文“基于Dempster-Shafer理论和深度学习的证据分类器”的可用代码（arXiv预印本arXiv

基于IQRD-RLS的自适应均衡算法在 DS-SS系统的应用研究

海康威视 智慧交通摄像头 车牌识别，报警布防，手动抓图 ds-tcg225，ds-tcg227，ds-tcg205-b，sdk包版本是v6.1.4.42

DS-5教程-使用ARM DS-5模拟器进行开发调试

Torch-for-R-CNN-Example:R中CNN的Liquidbrain Video中使用的R脚本

基于插值的运动DS-OTHR阵列重构算法

基于混沌扩频DS- CDMA系统的建模与仿真 (2003年)

最新资源

海康威视智慧交通摄像头车牌识别，报警布防，手动抓图 ds-tcg225，ds-tcg227，ds-tcg205-b，sdk包版本是v6.1.4.42