深度神经网络驱动的无依赖文本语音转换

2 浏览量更新于2024-08-26 收藏 312KB PDF 举报

本文主要探讨了基于深度神经网络的文本无关性语音转换方法，特别关注于利用语音的音素级别特征进行转换。论文标题"Text-Independent Voice Conversion Using Deep Neural Network Based Phonetic Level Features"表明研究的核心在于如何实现一种无需源说话人在训练阶段提供并行数据的语音转换技术，以实现更广泛的语音合成应用。传统的语音转换通常依赖于大量的源说话人和目标说话人的配对数据，这对于实际场景中的应用可能造成局限。作者针对这一问题，提出了一种新的框架，即联合密度高斯混合模型（JD-GMM）的改进版本，该模型在设计上考虑到了音素级别的特征保留。音素是语音信号中最基本的单位，包含了丰富的文本信息，这对于保持转换后的语音的自然性和可理解性至关重要。论文的核心贡献在于，研究人员仅使用同一目标说话人的音素鉴别特征和谱特征进行联合训练。首先，他们从目标语音中提取这两种特征，然后通过建立这两类特征之间的映射关系，使得在转换阶段，即使源说话人的语音没有直接对应的数据，也能通过源说话人的音素鉴别特征推断出目标说话人的谱特征。这种方法显著降低了对并行数据的依赖，提高了语音转换的灵活性和实用性。音素鉴别特征在这里扮演了关键角色，它们是经过处理后具有高度区分性的特征，能够有效地反映出说话人的独特发音模式。与传统的只依赖于声学特征的转换方法不同，这种结合了语言学信息的方法有助于更好地保持转换后语音的自然度和一致性，从而提高转换的质量。此外，文中可能还讨论了训练过程、模型优化策略、评估指标以及实验结果，以证明新方法在各种语音样本上的有效性。整体而言，这项研究不仅创新了语音转换的技术路径，也为未来的语音合成和个性化语音生成领域开辟了新的可能性，为解决实际应用场景中的语音转换难题提供了有力的理论支持和技术手段。

展开

Text-Independent Voice Conversion Using Deep

Neural Network Based Phonetic Level Features

Huadi Zheng

‡

, Weicheng Cai

∗

, Tianyan Zhou

∗

SYSU-CMU Joint Institute of Eng.,Sun Yat-sen University

†

SYSU-CMU Shunde International Joint Research Institute

liming46@mail.sysu.edu.cn

Shilei Zhang

, Ming Li

∗†

‡

Dept. of EIE, Hong Kong Polytechnic University

Speech technology and Solution Group, IBM China Research

Abstract—This paper presents a phonetically-aware joint den-

sity Gaussian mixture model (JD-GMM) framework for voice

conversion that no longer requires parallel data from source

speaker at the training stage. Considering that the phonetic level

features contain text information which should be preserved in

the conversion task, we propose a method that only concatenates

phonetic discriminant features and spectral features extracted

from the same target speakers speech to train a JD-GMM. After

the mapping relationship of these two features is trained, we

can use phonetic discriminant features from source speaker to

estimate target speaker’s spectral features at conversion stage.

The phonetic discriminant features are extracted using PCA

from the output layer of a deep neural network (DNN) in an

automatic speaker recognition (ASR) system. It can be seen

as a low dimensional representation of the senone posteriors.

We compare the proposed phonetically-aware method with con-

ventional JD-GMM method on the Voice Conversion Challenge

2016 training database. The experimental results show that our

proposed phonetically-aware feature method can obtain similar

performance compared to the conventional JD-GMM in the case

of using only target speech as training data.

Index Terms—Gaussian mixture model; phoneme posterior

probability; voice conversion; deep neural network

I. INTRODUCTION

Speech signals usually contain not only linguistic content

but also some explicit personal identity information to help

associate the speech with a speciﬁc speaker. For human beings,

these non-linguistic cues can be easily caught by hearing

perception. Voice conversion (VC) is an effective approach

to capture this non-linguistic information and utilize it to syn-

thesize an intended voice. The speech signal produced by one

person (source speaker) can be modiﬁed by various transfor-

mation and mapping techniques to generate speech signals that

sounds like another person (target speaker) while the linguistic

message is preserved. VC system can be applied to different

areas like electronic larynx [1] and text-to-speech system [2].

It has been reported that spectral attributes are important to

characterize the speaker individuality [3].Therefore, most of

VC systems are based on spectral mapping technique. The

related mapping approach and model have been intensively

studied over the past several years.

To conduct a typical parallel or text-dependent VC process,

both paired data training and runtime conversion are usually

required. During the data preparation stage, the parallel data,

an utterance set containing speeches from both source speaker

and target speaker on the same content, has to be prepared and

aligned. Spectrum components separated from the paired data

are further passed to a feature extraction module to extract

spectral features such as Mel-cepstral coefﬁcient (MCCs) [4],

line spectral frequency (LSF) [2], line spectrum pair (LSP)[5]

[6] and other types of acoustic feature. These features usually

have a good representation of spectrum on low-resolution

space, which provides convenience for computation. And

spectrum can be easily reconstructed from these features for

converted voice synthesis. Time alignment is employed on the

parallel features for modifying the speech duration between the

utterance pairs, such as using dynamic time warping (DTW)

technique.

At the ofﬂine training stage, the spectral features will be

used to estimate the parameters of the mapping function. A

great number of statistical parametric approaches for VC have

managed to transform these spectral features between speakers

by implementing a robust feature mapping function, such as

vector quantization (VQ) mapping codebooks [3], Gaussian

mixture model (GMM) [2][4][7], artiﬁcial neural networks

(ANN) [8], partial least squares regression (PLS) [9] and

non-negative matrix factorization (NMF) [10]. In the GMM

based approaches, joint density estimation technique has been

proved to be robust for even a small amount of training data

with a better perceptual test result [2]. The source features

and target features are concatenated to train a joint density

distribution Gaussian mixture model (JD-GMM) after time

alignment. When it comes to runtime conversion, the spectral

features can be estimated from the model and reversed back

to spectrum component.

However, the statistical property of GMM requires rel-

atively large amounts of parallel training data to increase

the mapping accuracy. The requirement for large amounts

of parallel spectral features is not always feasible in practi-

cal application and impossible in cross-linguistic conversion.

To utilize the non-parallel data set, text-independent method

has been proposed, such as vocal tract length normalization

(VTLN) [11], unit selection [12] and more. Though some

of the mapping techniques have been proved to be useful

in non-parallel training, these approaches still need to align

the source and target data on frame or phoneme level and

the model lacks of generalization with one-one mapping. To

reduce the dependence on source data in the training stage

下载后可阅读完整内容，剩余5页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38706824

粉丝: 2

深度神经网络驱动的无依赖文本语音转换

Semi-Automatic 2D-to-3D Conversion Using disparity propagation

Polarization-independent wavelength conversion using a single semiconductor optical amplifier

nlp-text-subjectivity-conversion

Design of low-dispersion-discrepancy silicon waveguide for broadband polarization-independent wavelength conversion

One-Shot-Voice-Conversion-with-WadaIN.github.io

text-to-braille-conversion:这是一个python脚本，可将txt文件转换为盲文，反之亦然

chp-15-150411095737-conversion-gate01_Network_analysis_ac_conver

awsome-voice-conversion:精选的很棒的语音转换，项目和社区列表

All-optical NRZ-to-AMI conversion using linear filtering effect of silicon microring resonator

Ultrahigh speed OOK-to-PSK conversion using linear filtering in silicon ring resonators

最新资源