Speech bandwidth expansion based on Deep Neural Networks
Yingxue Wang
1,2
, Shenghui Zhao
1
, Wenbo Liu
3,4
, Ming Li
3,5
,Jingming Kuang
1
1
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
2
School of Computer Science, Carnegie Mellon University
3
SYSU-CMU Joint Inst. of Eng., Sun Yat-Sen University
4
Department of ECE, Carnegie Mellon University
5
SYSU-CMU Shunde International Joint Research Institute
yxwang.bit@gmail.com, shzhao@bit.edu.cn
Abstract
This paper proposes a new speech bandwidth expansion
method, which uses Deep Neural Networks (DNNs) to build
high-order eigenspaces between the low frequency components
and the high frequency components of the speech signal. A
four-layer DNN is trained layer-by-layer from a cascade of
Neural Networks (NNs) and two Gaussian-Bernoulli Restrict-
ed Boltzmann Machines (GBRBMs). The GBRBMs are adopt-
ed to model the distribution of spectral envelopes of the low
frequency and the high frequency respectively. The NNs are
used to model the joint distribution of hidden variables extract-
ed from the two GBRBMs. The proposed method takes advan-
tage of the strong modeling ability of GBRBMs in modeling
the distribution of the spectral envelopes. And both the objec-
tive and subjective test results show that the proposed method
outperforms the conventional GMM based method.
Index Terms: bandwidth extension, deep neural networks, neu-
ral networks, Gaussian-Bernoulli Restricted Boltzmann Ma-
chine
1. Introduction
Speech bandwidth expansion (BWE) is a technique that at-
tempts to improve the speech quality by recovering the missing
high frequency components using the correlation that exists be-
tween the low and high frequency parts of the wide-band speech
signal. The BWE techniques have been applied to various tasks,
such as speech recognition [1], multicast conference [2], etc.
Many approaches have been proposed for speech bandwidth ex-
tension during the last decades. Generally, these methods can
be classified into two categories: rule-based methods and statis-
tical methods. The rule based methods directly regenerate the
high frequency spectral based on the acoustical knowledge of
the speech signal, e.g. simply copying a portion of the narrow-
band spectrum onto the desired extension frequency compo-
nents [3]. On the other hand, the statistical methods employ
statistical models to estimate the mapping function between the
low frequency and high frequency spectral features [4, 5, 6, 7].
By contrast to rule-based methods, statistical methods can con-
struct more precise mapping functions using statistical models.
Therefore, statistical methods, especially the GMM-based B-
WE methods are widely used [5].
Motivated by the success of Deep Neural Networks (DNN)
in speech recognition [8], we propose to utilize DNN to estimate
a robust mapping function for speech bandwidth extension. Dif-
ferent from the conventional non-linear or linear transformation
approaches, the DNN learns both a linear and a non-linear re-
lationship between the low frequency and high frequency spec-
tral envelopes. Thus, DNN can learn a more detailed and pre-
cise relationship between the low frequency and high frequen-
cy. In our approach, different from the conventional feedfor-
ward neural networks for regression tasks, which are usually
trained using the back-propagation algorithm under the mini-
mum mean square error criterion, a four-layer DNN is trained
layer-by-layer from a cascade of Neural Networks (NNs) and
two Gaussian-Bernoulli Restricted Boltzmann Machines (G-
BRBMs). In the training phase, we first train two exclusive
GBRBMs for low frequency and high frequency to obtain the
deep networks that capture abstractions for each speech. Then,
low frequency feature vectors and high frequency feature vec-
tors are fed into their corresponding GBRBM and high-order
features produced by GBRBMs are used to train a concatenat-
ing neural network between the two GBRBMs. In the recon-
struction phase, the low frequency signal is converted through
the trained NNs in the high-order space, and brought back to the
cepstrum space using the inverse process of the high frequency
GBRBM.
This paper is organized as follows. Section 2 gives an
overview of RBM and GBRBM while section 3 explains our
speech bandwidth extension method. We show our setup and
experimental results in section 4, and section 5 is our conclu-
sion.
2. Preliminaries
Our speech bandwidth extension method uses GBRBM to cap-
ture high-order features. We briefly review the GBRBM and its
fundamental model, Restricted Boltzmann machine (RBM), in
this section.
2.1. RBM
A RBM is a bipartite undirected graphical model. It has a two-
layer structure with one visible layer corresponding to a set of
visible stochastic variables v = [v
1
, ...v
V
]
T
and one hidden
layer corresponding to a set of hidden stochastic variables h =
[h
1
, ...h
H
]
T
, where V and H denote the number of units in the
visible and hidden layers [9]. The joint probability p(v, h) of
binary-valued visible units v and binary-valued hidden units h
is defined as follows:
p(v, h) =
1
Z
exp(−E(v, h)) (1)
E(v, h) = −av
−T
− bh
−T
− v
−T
W h (2)