深度学习在语音处理中的应用：从大数据到小样本

需积分: 13 182 浏览量更新于2024-09-04 收藏 201KB PDF 举报

"这篇文档是关于深度神经网络在语音自动处理中的应用的综合调查，重点关注在数据有限的情况下如何实现高效学习。文章涵盖了自动语音识别、说话人识别和情感识别等任务，并探讨了少量数据下的解决方案，如少数样本技术的应用。" 在语音处理领域，深度神经网络（DNNs）已经成为主流技术，尤其在自动语音识别（Automatic Speech Recognition, ASR）、说话人识别（Speaker Identification, SI）和情感识别（Emotion Recognition）等方面取得了显著的进步。然而，构建这些先进系统的前提是需要大量的训练数据，这对于资源有限的语言或特定问题来说是一个挑战。自动语音识别作为最具挑战性的任务，涉及到不同语言间的巨大差异。DNNs通过学习大量语音数据来捕获这些差异，从而实现高精度的转录。然而，对于资源不足的语言，收集足够的训练数据既耗时又昂贵。因此，论文首先分析了最先进的ASR系统，以理解在数据有限的环境下可能面临的困难。为了解决数据不足的问题，作者们探讨了需要较少数据的技术和任务。这些技术可能包括数据增强、迁移学习和元学习等，它们能够帮助模型从有限的数据中提取更丰富的信息。数据增强通过对原始数据进行变换（如速度改变、添加噪声等）来创造虚拟样本，增加模型的泛化能力。迁移学习则利用预训练模型的知识转移到新的任务，减少对新任务数据的需求。在最后一部分，论文聚焦于少数样本（Few-shot）学习技术，这是一种在少量示例下快速适应新任务的方法。在语音识别和识别任务中，少量样本技术可以有效地应对资源不足的问题，例如，通过构建原型网络或者利用元学习策略，模型可以在看到少量新类别样本后迅速调整。这篇综述提供了对深度神经网络在资源受限的语音处理任务中应用的深入洞察，旨在推动该领域的研究，开发出能有效利用有限数据的算法和模型，以促进更多语言和特殊场景的语音处理技术的发展。

arXiv:2003.04241v1 [eess.AS] 9 Mar 2020

Deep Neural Networks for Automatic Speech Proces s ing: A Survey

from Large Corpora to Limited Data

Vincent ROGER, J´erˆome FARINAS and Julien P INQUIER

IRIT, Universit´e de Toulouse, CNRS, Toulouse, France

Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to

be learned. Hence, learning state-of - the-art frameworks on under-resourced speech languages/problems is a difﬁcult task. Problems

could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming

and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker

identiﬁcation and emotion recognition. To assess the problem of limited data, we ﬁrstly investigate state-of-the-art Automatic

Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provid e

an overview of techniq ues and tasks requiring fewer data. In the last section we investigate few-shot techn iques as we interpret

under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniq ues and p erspectives of using

such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large

datasets. Nevertheless, some promising results f rom the literature encourage the usage of such techniques for speech processing.

Index Terms—Audio Processing, Deep Learning Techniques, Deep Neural Networks, Few-Shot Learning, Speech Analysis, Un der-

Resourced Languages.

I. INTRODUCTION

UTOMATIC speech processing systems drastically im-

proved the past few years, espe cially A utomatic Speech

Recognition (ASR) systems. It is also the case for other

speech processing tasks such as speaker identiﬁcation, emotion

classiﬁcation, etc. This success was made possible by the

large a mount of annotated data available combined with th e

extensive use of deep learning techniques and the capacity of

modern Graphics Pro cessing Units. Some models are alread y

deployed for everyday usage such as y our personal assistants

on your smartphones, your connected speakers and so on.

Nevertheless, challenge s remain for automatic speech pro-

cessing systems. They lack robustness against large vocabu-

lary in real-world environmen t: this includes noises, distance

from the speaker, rever berations and other alterations. Some

challenges, such as CHIME [ 1], provide data to let the

community try to handle some of these problems. It is being

investigated to improve the generalization of modern models

by avoiding the inclusion of other annotated data for every

possible environment.

State-Of-The-Art (SOTA) techniques for most speech tasks

require large datasets. Indeed, with modern DNN speech

processing systems, having more data usually imply better

performances. The TED-LIUM 3 f rom [2] (with 452 hours)

provide more than twice the data of the TED -LIUM 2 dataset.

Doing so, they obtain better results by training their model

on TED-LIUM 3 than training their model over TED -LIUM

2 data. This improvement in performance for ASR system s

is also obser ved with the LibriSpeech dataset (from [3]). V.

Panayotov et al. obtain better results on the Wall Street Journal

(WSJ) test set by training a model over LibriSpeech dataset

(1000 hou rs) than training a mod el over the WSJ training set

(82 hou rs) [3].

This pheno menon, of having more data imply better

performances, is also observable with the VoxCeleb 2 dataset

compare to the VoxCeleb dataset: [4] increase the number of

sentences from 1 00,000 utterances to one millio n utterances

and increase the number of identities from 1251 to 6112

compare d to the previous version of VoxCeleb. Doing so,

they obtain better performances compare to training their

model with the previous VoxCeleb data set.

With under-resourced languages (such as [5]) and/or tasks

(pathological detection with speech signals), we lack large

datasets. By under-resource d, we mean limited digital re-

sources (limited acoustic and text corpora) and/or a lack of

linguistic expertise. For a more precise deﬁnition and details of

the problem you may look [6]. Non-conventional speech tasks

such as disease detection (such as Parkinson, gravity of ENT

cancer and others) using audio are examples of tasks under

resourced. Tr a in Deep Neural Network models in such context

is a challenge for these under-resourced speech datasets. This

is especially the case for large vocabulary tasks. M . Moore

et al. showed that recent ASR systems are not well adapted

for impaired spee ch [7] and M. B. Mustafa et al. showed the

difﬁculties to ada pt such models with limited amount of data

[8]. Few-shot learning consists of training a model using k-

shot (where shot means an example per class), where k ≥ 1

and k is a low number. Training a n ASR system on a new

languag e, adapting an ASR system on pathological speech

or doing a speaker identiﬁcation with few examples are still

complicated tasks. We think that few-shot techniques may be

useful to tackle these problems.

This survey will be focused on how to learn Deep Neural

Network (DNN) models un der low resources for speech data

with non-overlapping mono signals. Therefore, we will ﬁrst

review SOTA ASR tech niques that u se a large amount of da ta

(section II). Then we will review techniques and speech tasks

(speaker identiﬁcation, emotion recognition) requiring fewer

data than SOTA techn iques (section III). We will also look

into pathological speech processing for ASR using adaptation

下载后可阅读完整内容，剩余7页未读，立即下载

syp_net

粉丝: 158
资源: 1187

深度学习在语音处理中的应用：从大数据到小样本

基于云平台的NAO机器人语音处理.pdf

卷积神经网络综述.pdf

图神经网络综述.pdf

BP神经网络及深度学习研究综述.pdf

深度神经网络的高效处理技术综述.pdf

深度学习研究综述.pdf

深度学习技术综述.pdf

深度学习发展综述.pdf

深度学习理论综述.pdf

基于区域信息的深度卷积神经网络研究综述.pdf

最新资源