Text-Independent Voice Conversion Using Deep
Neural Network Based Phonetic Level Features
Huadi Zheng
‡
, Weicheng Cai
∗
, Tianyan Zhou
∗
∗
SYSU-CMU Joint Institute of Eng.,Sun Yat-sen University
†
SYSU-CMU Shunde International Joint Research Institute
liming46@mail.sysu.edu.cn
Shilei Zhang
§
, Ming Li
∗†
‡
Dept. of EIE, Hong Kong Polytechnic University
§
Speech technology and Solution Group, IBM China Research
Abstract—This paper presents a phonetically-aware joint den-
sity Gaussian mixture model (JD-GMM) framework for voice
conversion that no longer requires parallel data from source
speaker at the training stage. Considering that the phonetic level
features contain text information which should be preserved in
the conversion task, we propose a method that only concatenates
phonetic discriminant features and spectral features extracted
from the same target speakers speech to train a JD-GMM. After
the mapping relationship of these two features is trained, we
can use phonetic discriminant features from source speaker to
estimate target speaker’s spectral features at conversion stage.
The phonetic discriminant features are extracted using PCA
from the output layer of a deep neural network (DNN) in an
automatic speaker recognition (ASR) system. It can be seen
as a low dimensional representation of the senone posteriors.
We compare the proposed phonetically-aware method with con-
ventional JD-GMM method on the Voice Conversion Challenge
2016 training database. The experimental results show that our
proposed phonetically-aware feature method can obtain similar
performance compared to the conventional JD-GMM in the case
of using only target speech as training data.
Index Terms—Gaussian mixture model; phoneme posterior
probability; voice conversion; deep neural network
I. INTRODUCTION
Speech signals usually contain not only linguistic content
but also some explicit personal identity information to help
associate the speech with a specific speaker. For human beings,
these non-linguistic cues can be easily caught by hearing
perception. Voice conversion (VC) is an effective approach
to capture this non-linguistic information and utilize it to syn-
thesize an intended voice. The speech signal produced by one
person (source speaker) can be modified by various transfor-
mation and mapping techniques to generate speech signals that
sounds like another person (target speaker) while the linguistic
message is preserved. VC system can be applied to different
areas like electronic larynx [1] and text-to-speech system [2].
It has been reported that spectral attributes are important to
characterize the speaker individuality [3].Therefore, most of
VC systems are based on spectral mapping technique. The
related mapping approach and model have been intensively
studied over the past several years.
To conduct a typical parallel or text-dependent VC process,
both paired data training and runtime conversion are usually
required. During the data preparation stage, the parallel data,
an utterance set containing speeches from both source speaker
and target speaker on the same content, has to be prepared and
aligned. Spectrum components separated from the paired data
are further passed to a feature extraction module to extract
spectral features such as Mel-cepstral coefficient (MCCs) [4],
line spectral frequency (LSF) [2], line spectrum pair (LSP)[5]
[6] and other types of acoustic feature. These features usually
have a good representation of spectrum on low-resolution
space, which provides convenience for computation. And
spectrum can be easily reconstructed from these features for
converted voice synthesis. Time alignment is employed on the
parallel features for modifying the speech duration between the
utterance pairs, such as using dynamic time warping (DTW)
technique.
At the offline training stage, the spectral features will be
used to estimate the parameters of the mapping function. A
great number of statistical parametric approaches for VC have
managed to transform these spectral features between speakers
by implementing a robust feature mapping function, such as
vector quantization (VQ) mapping codebooks [3], Gaussian
mixture model (GMM) [2][4][7], artificial neural networks
(ANN) [8], partial least squares regression (PLS) [9] and
non-negative matrix factorization (NMF) [10]. In the GMM
based approaches, joint density estimation technique has been
proved to be robust for even a small amount of training data
with a better perceptual test result [2]. The source features
and target features are concatenated to train a joint density
distribution Gaussian mixture model (JD-GMM) after time
alignment. When it comes to runtime conversion, the spectral
features can be estimated from the model and reversed back
to spectrum component.
However, the statistical property of GMM requires rel-
atively large amounts of parallel training data to increase
the mapping accuracy. The requirement for large amounts
of parallel spectral features is not always feasible in practi-
cal application and impossible in cross-linguistic conversion.
To utilize the non-parallel data set, text-independent method
has been proposed, such as vocal tract length normalization
(VTLN) [11], unit selection [12] and more. Though some
of the mapping techniques have been proved to be useful
in non-parallel training, these approaches still need to align
the source and target data on frame or phoneme level and
the model lacks of generalization with one-one mapping. To
reduce the dependence on source data in the training stage