Vol.26 No.3 JOURNAL OF ELECTRONICS (CHINA) May 2009
A CODEBOOK COMPENSATIVE VOICE MORPHING ALGORITHM
BASED ON MAXIMUM LIKELIHOOD ESTIMATION
1
Xu Ning
Yang Zhen
Zhang Linhua
*
(Institute of Signal Processing and Transmission, Nanjing University of Post
& Telecommunications, Nanjing 213002, China)
*
(College of Telecommunication & Information Engineering, Nanjing University of Post
& Telecommunications, Nanjing 210003, China)
Abstract This paper presents an improved voice morphing algorithm based on Gaussian Mixture
Model (GMM) which overcomes the traditional one in the terms of overly smoothed problems of the
converted spectral and discontinuities between frames. Firstly, a maximum likelihood estimation for the
model is introduced for the alleviation of the inversion of high dimension matrixes caused by traditional
conversion function. Then, in order to resolve the two problems associated with the baseline, a
codebook compensation technique and a time domain medial filter are applied. The results of listening
evaluations show that the quality of the speech converted by the proposed method is significantly better
than that by the traditional GMM method, and the Mean Opinion Score (MOS) of the converted
speech is improved from 2.5 to 3.1 and ABX score from 38% to 75%.
Key words Maximum-Likelihood (ML) estimation; Codebook compensation; Medial filter; Voice
morphing
CLC index TN925
DOI 10.1007/s11767-008-0016-9
I. Introduction
Voice morphing is a technique for modifying a
source speaker’s speech to sound as if it was spoken
by some designated target speaker. There are many
applications of voice morphing including custom-
izing voices for Text To Speech (TTS) systems,
transforming voice in adverts to sound like that of a
well-known celebrity, and improving the intelligi-
bility of abnormal speech uttered by a person with
a speech problem. In general, almost all the voice
morphing systems consist of two stages: training
and transforming of which the core process is the
transformation of the spectral envelope of the
source speaker to match that of the target speaker.
1
Manuscript received data: January 30, 2008; Revised date:
April 13, 2008.
Supported by a grant from the National High Technology
Research and Development Program of China (863 Pro-
gram, No.2006AA010102), and the National Natural Sci-
ence Foundation of China (No.60872105).
Communication author: Xu Ning, born in 1981, male,
Ph.D. candidate. New Mofan Road 66#, Nanjing Uni-
versity of Posts and Teleommunications, 97 Mailbox,
Nanjing 210003, China.
Email: D0705@njupt.edu.cn
Various approaches have been proposed for doing
this such as codebook mapping
[1]
, artificial neural
network
[2]
, and linear transformations
[3,4]
. Codebook
mapping, however, typically leads to discontinuities
in the transformed speech and the conversion ap-
proach suffers from a lack of robustness as well as
degraded quality. On the other hand, artificial
neural network is not prone to generalizing well
enough. Hence, linear transformation based ap-
proaches are now the most popular ones. However,
the traditional GMM conversion tends to generate
overly smoothed utterances
[5]
as well as a number of
artifacts in consecutive converted frames
[6,7]
caused
by direct transformation of the source vectors.
In order to achieve high quality of voice con-
version, the main problems issued above need to be
solved. This paper presents a novel method for
spectral conversion using codebook technique un-
der
Maximum-Likelihaod (ML) framework
[8]
. In ad-
dition, the converted feature vector is smoothed
along the time axis to maintain a continuous con-
version of correlations in consecutive frames.
The remainder of this paper is organized as
follows. First, an overview of our voice conversion
framework is given in Section II, followed by the