A KALDI-DNN-based ASR system for Italian
Experiments on Children Speech
Piero Cosi
Istituto di Scienze e Tecnologie della Cognizione
Consiglio Nazionale delle Ricerche
Unità Organizzativa di Supporto di Padova - Italy
piero.cosi@pd.istc.cnr.it
Abstract—In this paper, the KALDI ASR engine adapted to
Italian is described and the results obtained so far on some children
speech ASR experiments are reported. We give a brief overview of
KALDI, we describe in detail its DNN implementation, we introduce
the acoustic model (AM) training procedure and we end describing
some experiments on Italian children speech together with the final
test procedures.
Keywords— DNN, Children Speech, ASR
I. INTRODUCTION
During the last few years, many different Automatic
Speech Recognition (ASR) frameworks have been developed
for research purposes and, nowadays, various open-source
ASR toolkits are available to research laboratories. Systems
such as HTK [1], SONIC [2], [3], SPHINX [4], [5], RWTH
[6], JULIUS [7], KALDI [8], the more recent ASR framework
SIMON [9], and the relatively new system called BAVIECA
[10] are a simple and probably not exhaustive list.
Deep Neural Networks (DNNs) are the latest hot topic in
speech recognition. Since around 2010 many papers have been
published in this area, and some of the largest companies (e.g.
Google, Microsoft) are starting to use DNNs in their
production systems.
Indeed new systems such as KALDI [8] demonstrated the
effectiveness of easily incorporate “Deep Neural Network”
(DNN) techniques [11] in order to improve the recognition
performance in almost all recognition tasks.
In this paper, the KALDI ASR engine adapted to Italian is
described and the results obtained so far on some children
speech ASR experiments are reported. We give a brief
overview of KALDI, and in particular of its DNN
implementation, we introduce the acoustic model (AM) training
procedure and we end describing some experiments on Italian
children speech together with the final test procedures.
II. KALDI
As written in his official web site
(http://KALDI.sourceforge.net), the KALDI ASR environment
should be mainly taken into consideration for the following
simple reasons:
it’s “easy to use” (once you learn the basics, and
assuming you understand the underlying science)
it’s “easy to extend and modify”
it’s “redistributable”: unrestrictive license, community
project
if your stuff works or is interesting, the KALDI team is
open to including it and your example scripts in our
central repository: more citation, as others build on it.
In particular, even if KALDI is similar in aims and scope to
HTK, and the goal is still to have modern and flexible code,
written in C++, that is easy to modify and extend, the important
features that represent the main reasons to use KALDI versus
other toolkits include:
code-level integration with Finite State Transducers
(FSTs)
o compiling against the OpenFst toolkit (using it as a
library);
extensive linear algebra support
o including a matrix library that wraps standard
o BLAS and LAPACK routines;
extensible design
o providing, as far as possible, algorithms in the most
generic form possible; for instance, decoders are
templated on an object that provides a score indexed
by a (frame, fst- input-symbol) tuple, this meaning
that the decoder could work from any suitable source
of scores, such as a neural net;
open license
o the code is licensed under Apache 2.0, which is one
of the least restrictive licenses available;
complete recipes
o making available complete recipes for building
speech recognition systems, that work from widely
available databases such as those provided by the
ELRA or Linguistic Data Consortium (LDC).
It should be noted that the goal of releasing complete recipes
is an important aspect of KALDI. Since the code is publicly
available under a license that permits modifications and re-
release, this encourages people to release their code, along with