基于最大似然估计的GMM改进语音变形算法：提升转换质量

143 浏览量更新于2024-08-27 收藏 781KB PDF 举报

本文主要探讨了一种基于最大似然估计（Maximum-Likelihood, ML）的改进语音变形算法，该算法针对传统Gaussian Mixture Model (GMM) 方法在语音转换过程中存在的谱线过度平滑和帧间不连续性问题进行了优化。最大似然估计在算法设计中起到了关键作用，通过减少由传统转换函数导致的高维矩阵求逆问题，提高了计算效率和精度。算法的核心创新首先体现在对GMM模型的参数估计上。通过采用最大似然估计，算法能够更准确地拟合语音信号的分布特性，从而避免了由于矩阵求逆带来的不稳定性和计算复杂度。这种方法使得模型更加稳健，能够在处理复杂语音信号时提供更自然的转换效果。为了进一步提升语音转换的质量，论文提出了码本补偿技术（Codebook Compensation）。码本补偿技术有助于减少噪声和失真，特别是在处理细节丰富的语音片段时，能够更好地保留原始信号的特征，使得转换后的语音听起来更加真实，提高了听感体验。通过将编码后的语音特征与预定义的码本进行匹配，算法可以在保持语音流畅的同时，尽可能地减小失真。此外，时域中间滤波器的应用也是一项关键改进。这个滤波器被设计用于时域内平滑转换过程中的瞬态变化，从而降低帧间断点，使音频的连续性得到了显著提升。通过在时域内实现平滑过渡，避免了突发性的声音变化，使得听众更容易接受和理解转换后的语音。聆听评估结果有力地证明了这种改进算法的有效性。相较于传统的GMM方法，使用新算法处理的语音平均意见得分（Mean Opinion Score, MOS）显著提高，从2.5提升到了3.1，这表明语音质量有了显著提升。同时，ABX测试得分也从38%降低到了75%，意味着转换后的语音在人类听觉感知上的相似度有了显著改善。总结来说，这篇文章提出了一种结合最大似然估计、码本补偿技术和时域中间滤波器的语音变形算法，有效地解决了传统方法在语音转换中的问题，提高了语音质量和连续性，为语音合成和变换技术的发展提供了新的可能。

Vol.26 No.3 JOURNAL OF ELECTRONICS (CHINA) May 2009

A CODEBOOK COMPENSATIVE VOICE MORPHING ALGORITHM

BASED ON MAXIMUM LIKELIHOOD ESTIMATION

Xu Ning

Yang Zhen

Zhang Linhua

(Institute of Signal Processing and Transmission, Nanjing University of Post

& Telecommunications, Nanjing 213002, China)

(College of Telecommunication & Information Engineering, Nanjing University of Post

& Telecommunications, Nanjing 210003, China)

Abstract This paper presents an improved voice morphing algorithm based on Gaussian Mixture

Model (GMM) which overcomes the traditional one in the terms of overly smoothed problems of the

converted spectral and discontinuities between frames. Firstly, a maximum likelihood estimation for the

model is introduced for the alleviation of the inversion of high dimension matrixes caused by traditional

conversion function. Then, in order to resolve the two problems associated with the baseline, a

codebook compensation technique and a time domain medial filter are applied. The results of listening

evaluations show that the quality of the speech converted by the proposed method is significantly better

than that by the traditional GMM method, and the Mean Opinion Score (MOS) of the converted

speech is improved from 2.5 to 3.1 and ABX score from 38% to 75%.

Key words Maximum-Likelihood (ML) estimation; Codebook compensation; Medial filter; Voice

morphing

CLC index TN925

DOI 10.1007/s11767-008-0016-9

I. Introduction

Voice morphing is a technique for modifying a

source speaker’s speech to sound as if it was spoken

by some designated target speaker. There are many

applications of voice morphing including custom-

izing voices for Text To Speech (TTS) systems,

transforming voice in adverts to sound like that of a

well-known celebrity, and improving the intelligi-

bility of abnormal speech uttered by a person with

a speech problem. In general, almost all the voice

morphing systems consist of two stages: training

and transforming of which the core process is the

transformation of the spectral envelope of the

source speaker to match that of the target speaker.

Manuscript received data: January 30, 2008; Revised date:

April 13, 2008.

Supported by a grant from the National High Technology

Research and Development Program of China (863 Pro-

gram, No.2006AA010102), and the National Natural Sci-

ence Foundation of China (No.60872105).

Communication author: Xu Ning, born in 1981, male,

Ph.D. candidate. New Mofan Road 66#, Nanjing Uni-

versity of Posts and Teleommunications, 97 Mailbox,

Nanjing 210003, China.

Email: D0705@njupt.edu.cn

Various approaches have been proposed for doing

this such as codebook mapping

[1]

, artificial neural

network

[2]

, and linear transformations

[3,4]

. Codebook

mapping, however, typically leads to discontinuities

in the transformed speech and the conversion ap-

proach suffers from a lack of robustness as well as

degraded quality. On the other hand, artificial

neural network is not prone to generalizing well

enough. Hence, linear transformation based ap-

proaches are now the most popular ones. However,

the traditional GMM conversion tends to generate

overly smoothed utterances

[5]

as well as a number of

artifacts in consecutive converted frames

[6,7]

caused

by direct transformation of the source vectors.

In order to achieve high quality of voice con-

version, the main problems issued above need to be

solved. This paper presents a novel method for

spectral conversion using codebook technique un-

der

Maximum-Likelihaod (ML) framework

[8]

. In ad-

dition, the converted feature vector is smoothed

along the time axis to maintain a continuous con-

version of correlations in consecutive frames.

The remainder of this paper is organized as

follows. First, an overview of our voice conversion

framework is given in Section II, followed by the

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38678406

粉丝: 5

基于最大似然估计的GMM改进语音变形算法：提升转换质量

基于极大似然估计的三维定位算法

jidadingwei.zip_DXD_极大似然估计定位算法_极大似然定位_非测距定位

stata中极大似然估计方法

基于最大似然估计的一种新的信号处理方法 (2008年)

含仿真操作录像，基于期望最大化(EM)算法的GMM局部最大似然估计matlab仿真

EM算法(讲解+程序).zip_EM算法_HMM_comedgs_无监督聚类_极大似然估计

R语言-极大似然估计

极大似然估计法仿真程序

最大似然估计matlab

基于最大似然序列估计（MLSE）的电子色散均衡器

最新资源