THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT
Mirco Ravanelli
1
, Titouan Parcollet
2
, Yoshua Bengio
1∗
1
Mila, Universit
´
e de Montr
´
eal ,
∗
CIFAR Fellow
2
LIA, Universit
´
e d’Avignon
ABSTRACT
The availability of open-source software is playing a remarkable role
in the popularization of speech recognition and deep learning. Kaldi,
for instance, is nowadays an established framework used to develop
state-of-the-art speech recognizers. PyTorch is used to build neural
networks with the Python language and has recently spawn tremen-
dous interest within the machine learning community thanks to its
simplicity and flexibility.
The PyTorch-Kaldi project aims to bridge the gap between these
popular toolkits, trying to inherit the efficiency of Kaldi and the
flexibility of PyTorch. PyTorch-Kaldi is not only a simple inter-
face between these software, but it embeds several useful features
for developing modern speech recognizers. For instance, the code is
specifically designed to naturally plug-in user-defined acoustic mod-
els. As an alternative, users can exploit several pre-implemented
neural networks that can be customized using intuitive configuration
files. PyTorch-Kaldi supports multiple feature and label streams as
well as combinations of neural networks, enabling the use of com-
plex neural architectures. The toolkit is publicly-released along with
a rich documentation and is designed to properly work locally or on
HPC clusters.
Experiments, that are conducted on several datasets and tasks,
show that PyTorch-Kaldi can effectively be used to develop modern
state-of-the-art speech recognizers.
Index Terms: speech recognition, deep learning, Kaldi, PyTorch.
1. INTRODUCTION
Over the last years, we witnessed a progressive improvement and
maturation of Automatic Speech Recognition (ASR) technologies
[1, 2], that have reached unprecedented performance levels and are
nowadays used by millions of users worldwide.
A key role in this technological breakthrough is being played by
deep learning [3], that contributed to overcoming previous speech
recognizers based on Gaussian Mixture Models (GMMs). Beyond
deep learning, other factors have played a role in the progress of
the field. A number of speech-related projects such as AMI [4],
DICIT [5], DIRHA [6] and speech recognition challenges such
as CHiME [7], Babel, and Aspire, have remarkably fostered the
progress in ASR. The public distribution of large datasets such
as Librispeech [8] has also played an important role to establish
common evaluation frameworks and tasks.
Among the others factors, the development of open-source soft-
ware such as HTK [9], Julius [10], CMU-Sphinx, RWTH-ASR [11],
LIA-ASR [12] and, more recently, the Kaldi toolkit [13] have further
helped popularize ASR, making both research and development of
novel ASR applications significantly easier.
Kaldi currently represents the most popular ASR toolkit. It re-
lies on finite-state transducers (FSTs) [14] and provides a set of C++
libraries for efficiently implementing state-of-the-art speech recogni-
tion systems. Moreover, the toolkit includes a large set of recipes that
cover all the most popular speech corpora. In parallel to the devel-
opment of this ASR-specific software, several general-purpose deep
learning frameworks, such as Theano [15], TensorFlow [16], and
CNTK [17], have gained popularity in the machine learning com-
munity. These toolkits offer a huge flexibility in the neural network
design and can be used for a variety of deep learning applications.
PyTorch [18] is an emerging python package that implements ef-
ficient GPU-based tensor computations and facilitates the design of
neural architectures, thanks to proper routines for automatic gradi-
ent computation. An interesting feature of PyTorch lies in its modern
and flexible design, that naturally supports dynamic neural networks.
In fact, the computational graph is dynamically constructed on-the-
fly at running time rather than being statically compiled.
The PyTorch-Kaldi project aims to bridge the gap between Kaldi
and PyTorch
1
. Our toolkit implements acoustic models in PyTorch,
while feature extraction, label/alignment computation, and decod-
ing are performed with the Kaldi toolkit, making it a perfect fit to
develop state-of-the-art DNN-HMM speech recognizers. PyTorch-
Kaldi natively supports several DNNs, CNNs, and RNNs models.
Combinations between deep learning models, acoustic features, and
labels are also supported, enabling the use of complex neural archi-
tectures. For instance, users can employ a cascade between CNNs,
LSTMs, and DNNs, or run in parallel several models that share some
hidden layers. Users can also explore different acoustic features,
context duration, neuron activations (e.g., ReLU, leaky ReLU), nor-
malizations (e.g., batch [19] and layer normalization [20]), cost func-
tions, regularization strategies (e.g, L2, dropout [21]), optimization
algorithms (e.g., SGD, Adam [22], RMSPROP), and many other
hyper-parameters of an ASR system through simple edits of a con-
figuration file.
The toolkit is designed to make the integration of user-defined
acoustic models as simple as possible. In practice, users can em-
bed their deep learning model and conduct ASR experiments even
without being fully familiar with the complex speech recognition
pipeline. The toolkit can perform computations on both local ma-
chines and HPC cluster, and supports multi-gpu training, recovery
strategy, and automatic data chunking.
The experiments, conducted on several datasets and tasks, have
shown that PyTorch-Kaldi makes it possible to easily develop com-
petitive state-of-the-art speech recognition systems.
2. THE PYTORCH-KALDI PROJECT
An overview of the architecture adopted in PyTorch-Kaldi is re-
ported in Fig. 1. The main script run exp.py is written in python
1
The code is available on GitHub (github.com/mravanelli/
PyTorch-kaldi/).