2019年多说话者语音克隆自动技术硕士论文

需积分: 10 67 浏览量更新于2024-07-09 收藏 2.54MB PDF 举报

标题："Automatic Multispeaker Voice Cloning-2019.pdf" 论文探讨了自动多说话人语音克隆技术，该领域的研究旨在创建能够模仿多个声音来源的先进语音合成系统。作者Corentin Jemine在2018-2019学年撰写此硕士论文，由Gilles Louppe作为指导者，该工作隶属于法国列日大学的应用科学学院。这篇硕士学位论文专注于数据科学的专门领域，特别是与声音处理和人工智能技术相关。论文的核心内容围绕如何设计并实现一个实时的多说话人语音克隆系统，这涉及到深度学习、声学建模、信号处理以及可能的神经网络架构，如循环神经网络（RNN）或变分自编码器（VAE），来捕捉和复制不同个体的声音特征。研究可能使用了TTS（Text-to-Speech）技术和声码器-解码器结构，通过分析和学习每个说话人的语音样本，包括其音调、语速和发音习惯，以生成高度逼真的语音合成。论文的成果可以应用于多种场景，如语音合成、音频转换、虚拟助理的个性化声音定制，或者用于增强沉浸式体验，如游戏中的角色配音。此外，论文还强调了版权和使用权的规定，指出用户可以在遵守BOAI原则（Budapest Open Access Initiative）的前提下，进行阅读、下载、复制、传播、打印和学术研究等行为，但商业用途是严格禁止的，同时尊重作者的道德权利。论文的代码和数据集可以通过论文作者在GitHub上的项目（<https://github.com/CorentinJ/Real-Time-Voice-Cloning>）获取，以及图书馆的数字资源（<http://lib.uliege.be> 和 <http://hdl.handle.net/2268.2/6801>）进行访问。这个研究不仅是对语音识别和合成技术的深入探索，也为后续的多模态交互和个性化语音技术的发展奠定了基础。

for rare or unknown words. Indeed, manually engineered heuristics do not quite fully

characterize all intricacies of spoken language. For this reason, feature extraction

can also be done with trained models. The line between the feature extractor and

the acoustic model can then become blurry, especially for deep models. In fact, a

tendency that is common across all areas where deep models have overtaken traditional

machine learning techniques is for feature extraction to consist of less heuristics, as

highly nonlinear models become able to operate at higher levels of abstraction.

A common feature extraction technique is to build frames that will integrate

surrounding context in a hierarchical fashion. For example, a frame at the syllable level

could include the word that comprises it, its position in the word, the neighbouring

syllables, the phonemes that make up the syllable, ... The lexical stress and accent

of individual syllables can be predicted by a statistical model such as a decision tree.

To encode prosody, a set of rules such as ToBI (Beckman and Elam, 1997) can be

used. Ultimately, there remains a work of feature engineering to present a frame as a

numerical object to the model, e.g. categorical features are typically encoded using a

one-hot representation.

The reason why the acoustic model does not directly predict an audio waveform

is that audio happens to be diﬃcult to model: it is a particularly dense domain

and audio signals are typically highly nonlinear. A representation that brings out

features in a more tractable manner is the time-frequency domain. Spectrograms

are smoother and much less dense than their waveform counterpart. They also have

the beneﬁt of being two-dimensional, thus allowing models to better leverage spatial

connectivity. Unfortunately, a spectrogram is a lossy representation of the waveform

that discards the phase. There is no unique inverse transformation function, and

deriving one that produces natural-sounding results is not trivial. When referring to

speech, this generative function is called a vocoder. The choice of the vocoder is an

important factor in determining the quality of the generated audio.

As is often the case with tasks that involve generating perceptual data such as

images or audio, a formal and objective evaluation of the performance of the model

is diﬃcult. In our case, we’re concerned with evaluating speech naturalness and voice

similarity of the generated audio. Older TTS methods often relied on statistics com-

puted on the waveform or on the acoustic features to compare diﬀerent models. Not

only are those metrics often only weakly correlated with the human perception of

sound, but they would also typically be what the model was aiming to minimize.

“When a measure becomes a target, it ceases to be a good measure” (Strathern, 1997).

A recent tendency adopted in TTS is to perform a subjective evaluation with human

subjects and to report their Mean Opinion Score (MOS). The subjects are presented

with a series of audio segments and are asked to rate their naturalness (or similarity

when comparing two segments) on a Likert scale from 1 to 5. Because subjects do not

necessarily rate actual human speech with a 5, it may be possible for TTS systems to

surpass humans on this metric in the future. (Shirali-Shahreza and Penn, 2018) argue

that MOS is not a metric adapted to evaluate TTS systems, and they advocate using

A/B testing instead, where subjects are asked to say which audio segment they prefer

among two (a neutral vote is usually also possible). To account for the variance of

opinion within subjects, one should aim to perform these studies at a large scale, with

hundreds of subjects. Crowdsourcing services such as Amazon Mechanical Turk

are

commonly involved.

2.2 Evolution of the state of the art in text-to-speech

The state of the art in SPSS has for long remained a hidden Markov model (HMM)

based framework (Tokuda, 2013). This approach, laid out in Figure 2, consists in

clustering the linguistic features extracted from the input text with a decision tree, and

to train a HMM per cluster (Yoshimura et al., 1999). The HMMs are tasked to produce

a distribution over spectrogram coeﬃcients, their derivative, second derivative and a

binary ﬂag that indicates which parts of the generated audio should contain voice.

With the maximum likelihood parameter generation algorithm (MLPG) (Tokuda et al.,

2000), spectrogram coeﬃcients are sampled from this distribution and eventually fed

to the MLSA vocoder (Imai, 1983). It is possible to modify the voice generated by

conditioning the HMMs on a speaker or tuning the generated speech parameters with

adaptation or interpolation techniques (Yoshimura et al., 1997). Note that, while this

framework used to be state of the art for SPSS, it was still inferior in terms of the

naturalness of the generated speech compared to the well-established concatenative

approaches.

Figure 2: The general HMM-based

TTS pipeline. Figure extracted from

(Hashimoto et al., 2015).

Method MOS

HMM+MLPG 3.08 (±0.12)

HMM+DNN 2.86 (±0.12)

DNN+MLPG 3.53 (±0.12)

DNN+DNN 3.17 (±0.12)

Table 1: MOS of the diﬀerent methods

explored in (Hashimoto et al., 2015).

The ﬁrst line is the HMM-based frame-

work. For the second and fourth line,

the MLPG algorithm is replaced by a

fully-connected neural network.

Improvements to this framework were later brought by feed-forward deep neural

networks (DNN), as a result of progress in both hardware and software. Zen et al.

(2013) proposes to replace entirely the decision tree-clustered HMMs in favor of a DNN.

They argue for better data eﬃciency as the training set is no longer fragmented in

diﬀerent clusters of contexts. They demonstrate improvements over the speech quality

https://www.mturk.com/

剩余37页未读，继续阅读

TracelessLe

粉丝: 5w+
资源: 466

2019年多说话者语音克隆自动技术硕士论文

python三阶深度学习框架-Real-Time-Voice-Cloning-master.zip

Adaptive-MultiSpeaker-Separation:自适应和聚焦神经层的多扬声器分离问题

deepvoice3_pytorch：基于卷积神经网络的文本到语音合成模型的PyTorch实现

cairo-devel-1.15.12-4.el7.x86_64.rpm.zip

abrt-devel-2.1.11-60.el7.centos.i686.rpm.zip

baobab-3.28.0-2.el7.x86_64.rpm.zip

anaconda-21.48.22.159-1.el7.centos.x86_64.rpm.zip

amanda-libs-3.3.3-22.el7.x86_64.rpm.zip

apache-rat-core-0.8-13.el7.noarch.rpm.zip

bpg-mrgvlovani-fonts-1.002-3.el7.noarch.rpm.zip

最新资源