arXiv:2003.04241v1 [eess.AS] 9 Mar 2020
1
Deep Neural Networks for Automatic Speech Proces s ing: A Survey
from Large Corpora to Limited Data
Vincent ROGER, J´erˆome FARINAS and Julien P INQUIER
IRIT, Universit´e de Toulouse, CNRS, Toulouse, France
Most state-of-the-art speech systems are using Deep Neural Networks (DNNs). Those systems require a large amount of data to
be learned. Hence, learning state-of - the-art frameworks on under-resourced speech languages/problems is a difficult task. Problems
could be the limited amount of data for impaired speech. Furthermore, acquiring more data and/or expertise is time-consuming
and expensive. In this paper we position ourselves for the following speech processing tasks: Automatic Speech Recognition, speaker
identification and emotion recognition. To assess the problem of limited data, we firstly investigate state-of-the-art Automatic
Speech Recognition systems as it represents the hardest tasks (due to the large variability in each language). Next, we provid e
an overview of techniq ues and tasks requiring fewer data. In the last section we investigate few-shot techn iques as we interpret
under-resourced speech as a few-shot problem. In that sense we propose an overview of few-shot techniq ues and p erspectives of using
such techniques for the focused speech problems in this survey. It occurs that the reviewed techniques are not well adapted for large
datasets. Nevertheless, some promising results f rom the literature encourage the usage of such techniques for speech processing.
Index Terms—Audio Processing, Deep Learning Techniques, Deep Neural Networks, Few-Shot Learning, Speech Analysis, Un der-
Resourced Languages.
I. INTRODUCTION
A
UTOMATIC speech processing systems drastically im-
proved the past few years, espe cially A utomatic Speech
Recognition (ASR) systems. It is also the case for other
speech processing tasks such as speaker identification, emotion
classification, etc. This success was made possible by the
large a mount of annotated data available combined with th e
extensive use of deep learning techniques and the capacity of
modern Graphics Pro cessing Units. Some models are alread y
deployed for everyday usage such as y our personal assistants
on your smartphones, your connected speakers and so on.
Nevertheless, challenge s remain for automatic speech pro-
cessing systems. They lack robustness against large vocabu-
lary in real-world environmen t: this includes noises, distance
from the speaker, rever berations and other alterations. Some
challenges, such as CHIME [ 1], provide data to let the
community try to handle some of these problems. It is being
investigated to improve the generalization of modern models
by avoiding the inclusion of other annotated data for every
possible environment.
State-Of-The-Art (SOTA) techniques for most speech tasks
require large datasets. Indeed, with modern DNN speech
processing systems, having more data usually imply better
performances. The TED-LIUM 3 f rom [2] (with 452 hours)
provide more than twice the data of the TED -LIUM 2 dataset.
Doing so, they obtain better results by training their model
on TED-LIUM 3 than training their model over TED -LIUM
2 data. This improvement in performance for ASR system s
is also obser ved with the LibriSpeech dataset (from [3]). V.
Panayotov et al. obtain better results on the Wall Street Journal
(WSJ) test set by training a model over LibriSpeech dataset
(1000 hou rs) than training a mod el over the WSJ training set
(82 hou rs) [3].
This pheno menon, of having more data imply better
performances, is also observable with the VoxCeleb 2 dataset
compare to the VoxCeleb dataset: [4] increase the number of
sentences from 1 00,000 utterances to one millio n utterances
and increase the number of identities from 1251 to 6112
compare d to the previous version of VoxCeleb. Doing so,
they obtain better performances compare to training their
model with the previous VoxCeleb data set.
With under-resourced languages (such as [5]) and/or tasks
(pathological detection with speech signals), we lack large
datasets. By under-resource d, we mean limited digital re-
sources (limited acoustic and text corpora) and/or a lack of
linguistic expertise. For a more precise definition and details of
the problem you may look [6]. Non-conventional speech tasks
such as disease detection (such as Parkinson, gravity of ENT
cancer and others) using audio are examples of tasks under
resourced. Tr a in Deep Neural Network models in such context
is a challenge for these under-resourced speech datasets. This
is especially the case for large vocabulary tasks. M . Moore
et al. showed that recent ASR systems are not well adapted
for impaired spee ch [7] and M. B. Mustafa et al. showed the
difficulties to ada pt such models with limited amount of data
[8]. Few-shot learning consists of training a model using k-
shot (where shot means an example per class), where k ≥ 1
and k is a low number. Training a n ASR system on a new
languag e, adapting an ASR system on pathological speech
or doing a speaker identification with few examples are still
complicated tasks. We think that few-shot techniques may be
useful to tackle these problems.
This survey will be focused on how to learn Deep Neural
Network (DNN) models un der low resources for speech data
with non-overlapping mono signals. Therefore, we will first
review SOTA ASR tech niques that u se a large amount of da ta
(section II). Then we will review techniques and speech tasks
(speaker identification, emotion recognition) requiring fewer
data than SOTA techn iques (section III). We will also look
into pathological speech processing for ASR using adaptation