深度学习驱动的语音合成就在眼前：微软亚洲研究院的全面综述

需积分: 11 126 浏览量更新于2024-07-09 收藏 1.36MB PDF 举报

"微软亚洲研究院发布了一篇名为‘A Survey on Neural Speech Synthesis’的综述论文，该论文全面概述了语音合成领域的最新进展和技术。研究人员分析了450多篇相关文献，涵盖了从文本到语音（TTS）的转换、深度学习在语音合成中的应用、关键组件（如文本分析、声学模型和 vocoders）、先进话题（如快速TTS、低资源TTS、鲁棒TTS、表现力TTS和自适应TTS）以及相关的数据集、开源实现和教程资源。此外，他们还对未来的研究方向进行了讨论和展望，旨在为语音合成领域的研究者和从业者提供有价值的参考信息。" 本文的核心知识点包括： 1. **文本到语音（TTS）技术**：TTS系统旨在将输入的文本转换为自然且可理解的语音输出。随着深度学习和人工智能的发展，基于神经网络的TTS显著提升了合成语音的质量。 2. **深度学习在语音合成中的应用**：神经网络技术的引入，特别是递归神经网络（RNNs）、卷积神经网络（CNNs）和Transformer架构，极大地推动了语音合成的性能提升。 3. **关键组件**： - **文本分析**：这是处理输入文本并提取发音规则和语调信息的过程，通常涉及语言模型和文本归一化。 - **声学模型**：预测语音信号的声学特征，如频谱、基频和能量，通常由RNNs或CNNs实现。 - **Vocoders**：负责将声学特征转换为实际的音频波形，包括传统的统计参数 vocoders 和基于神经网络的新型vocoders（如WaveNet和 Griffin-Lim算法）。 4. **先进话题**： - **快速TTS**：优化模型以实现更快的合成速度，适合实时应用。 - **低资源TTS**：在少量数据下训练高质量的TTS模型，适用于资源有限的环境。 - **鲁棒TTS**：增强模型对噪声和不同说话人风格的适应性。 - **表现力TTS**：使合成的语音具有情感和个性，增加自然度。 - **自适应TTS**：允许系统根据用户的特定需求或反馈进行个性化调整。 5. **相关资源**：论文中列出了大量的数据集、开源实现和教程，这些资源对于研究人员和开发者来说是非常宝贵的，可以帮助他们加速研究进程和应用开发。 6. **未来研究方向**：随着技术的不断发展，未来的重点可能包括提高合成语音的自然度和真实性、降低计算复杂性、探索跨语言和跨域的TTS以及结合强化学习的自适应学习方法。 7. **目标读者**：此综述不仅面向学术研究人员，也面向在工业界从事语音合成工作的实践者，为他们提供了全面的参考和指导。这篇论文为语音合成领域的研究提供了深入的洞察，并为未来的研究工作指明了道路，具有极高的参考价值。

phonemes. They argue that RNN-based encoder-attention-decoder models like Tacotron 2 suffer from

the following two issues: 1) Due to the recurrent nature, both the RNN-based encoder and decoder

cannot be trained in parallel, and the RNN-based encoder cannot be parallel in inference, which

affects the efﬁciency both in training and inference. 2) Since the text and speech sequences are usually

very long, RNN is not good at modeling the long dependency in these sequences. TransformerTTS

adopts the basic model structure of Transformer and absorbs some designs from Tacotron 2 such as

decoder pre-net/post-net and stop token prediction. It achieves similar voice quality with Tacotron

2 but enjoys faster training time. However, compared with RNN-based models such as Tacotron

that leverage stable attention mechanisms such as location-sensitive attention, the encoder-decoder

attentions in Transformer are not robust due to parallel computation. Thus, some works propose to

enhance the robustness of Transformer-based acoustic models. For example, MultiSpeech [

] im-

proves the robustness of the attention mechanism through encoder normalization, decoder bottleneck,

and diagonal attention constraint, and RobuTrans [

194

] leverages duration prediction to enhance the

robustness in autoregressive generation.

Previous neural-based acoustic models such as Tacotron 1/2 [

382

303

], DeepVoice 3 [

270

] and

TransformerTTS [

192

] all adopt autoregressive generation, which suffer from several issues: 1)

Slow inference speed. The autoregressive mel-spectrogram generation is slow especially for long

speech sequence (e.g., for 1 second speech, there are nearly 500 frames of mel-spectrogram if hop

size is 10ms, which is a long sequence). 2) Robust issues. The generated speech usually has a

lot of word skipping and repeating and issues, which is mainly caused by the inaccurate attention

alignments between text and mel-spectrograms in encoder-attention-decoder based autoregressive

generation. Thus, FastSpeech [

290

] is proposed to solve these issues: 1) It adopts a feed-forward

Transformer network to generate mel-spectrograms in parallel, which can greatly speed up inference.

2) It removes the attention mechanism between text and speech to avoid word skipping and repeating

issues and improve robustness. Instead, it uses a length regulator to bridge the length mismatch

between the phoneme and mel-spectrogram sequences. The length regulator leverages a duration

predictor to predict the duration of each phoneme and expands the phoneme hidden sequence

according to the phoneme duration, where the expanded phoneme hidden sequence can match the

length of mel-spectrogram sequence and facilitate the parallel generation. FastSpeech enjoys several

advantages [

290

]: 1) extremely fast inference speed (e.g., 270x inference speedup on mel-spectrogram

generation, 38x speedup on waveform generation); 2) robust speech synthesis without word skipping

and repeating issues; and 3) on par voice quality with previous autoregressive models. FastSpeech

has been deployed in Microsoft Azure Text to Speech Service

to support all the languages and

locales in Azure TTS

FastSpeech leverages an explicit duration predictor to expand the phoneme hidden sequence to match

to the length of mel-spectrograms. How to get the duration label to train the duration predictor is

critical for the prosody and quality of generated voice. We brieﬂy review the TTS models with

duration prediction in Section 3.4.2. In the next, we introduce some other improvements based on

FastSpeech. FastSpeech 2 [

292

] is proposed to further enhance FastSpeech, mainly from two aspects:

1) Using ground-truth mel-spectrograms as training targets, instead of distilled mel-spectrograms from

an autoregressive teacher model. This simpliﬁes the two-stage teacher-student distillation pipeline

in FastSpeech and also avoids the information loss in target mel-spectrograms after distillation. 2)

Providing more variance information such as pitch, duration, and energy as decoder input, which eases

the one-to-many mapping problem [

139

382

456

] in text to speech

. FastSpeech 2 achieves

better voice quality than FastSpeech and maintains the advantages of fast, robust, and controllable

speech synthesis in FastSpeech

. FastPitch [

181

] improves FastSpeech by using pitch information

as decoder input, which shares similar idea of variance predictor in FastSpeech 2.

Other Acoustic Models (e.g., Flow, GAN, VAE, Diffusion)

Besides the above acoustic models,

there are a lot of other acoustic models [

367

126

187

], as shown in Table 2. Flow-based

models have long been used in neural TTS. After the early successful applications on vocoders (e.g.,

https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/

https://techcommunity.microsoft.com/t5/azure-ai/neural-text-to-speech-extends-support-to-15-more-

languages-with/ba-p/1505911

One-to-many mapping in TTS refers to that there are multiple possible speech sequences corresponding to a

text sequence due to variations in speech, such as pitch, duration, sound volume, and prosody, etc.

FastSpeech 2s [

292

] is proposed together with FastSpeech 2. Since it is a fully end-to-end text-to-waveform

model, we introduce it in Section 2.5.

Parallel WaveNet [

255

], WaveGlow [

279

], FloWaveNet [

163

]), ﬂow-based models are also applied

in acoustic models, such as Flowtron [

366

] that is an autoregressive ﬂow-based mel-spectrogram

generation model, Flow-TTS [

234

] and Glow-TTS [

159

] that leverage generative ﬂow for non-

autoregressive mel-spectrogram generation. Besides ﬂow-based models, other generative models have

also been leveraged in acoustic models. For example, 1) GMVAE-Tacotron [

119

], VAE-TTS [

443

and BVAE-TTS [

187

] are based on VAE [

168

]; 2) GAN exposure [

], TTS-Stylization [

224

and Multi-SpectroGAN [

186

] are based on GAN [

]; 3) Diff-TTS [

141

], Grad-TTS [

276

], and

PriorGrad [185] are based on diffusion model [310, 113].

Table 3: A list of vocoders and their corresponding characteristics.

Vocoder Input AR/NAR Modeling Architecture

WaveNet [254] Linguistic Feature AR / CNN

SampleRNN [233] / AR / RNN

WaveRNN [150] Linguistic Feature AR / RNN

LPCNet [363] BFCC AR / RNN

Univ. WaveRNN [215] Mel-Spectrogram AR / RNN

SC-WaveRNN [265] Mel-Spectrogram AR / RNN

MB WaveRNN [418] Mel-Spectrogram AR / RNN

FFTNet [145] Cepstrum AR / CNN

Par. WaveNet [255] Linguistic Feature NAR Flow CNN

WaveGlow [279] Mel-Spectrogram NAR Flow Hybrid/CNN

FloWaveNet [163] Mel-Spectrogram NAR Flow Hybrid/CNN

WaveFlow [271] Mel-Spectrogram AR Flow Hybrid/CNN

SqueezeWave [433] Mel-Spectrogram NAR Flow CNN

WaveGAN [68] / NAR GAN CNN

GELP [149] Mel-Spectrogram NAR GAN CNN

GAN-TTS [23] Linguistic Feature NAR GAN CNN

MelGAN [178] Mel-Spectrogram NAR GAN CNN

Par. WaveGAN [402] Mel-Spectrogram NAR GAN CNN

HiFi-GAN [174] Mel-Spectrogram NAR GAN Hybrid/CNN

VocGAN [408] Mel-Spectrogram NAR GAN CNN

GED [96] Linguistic Feature NAR GAN CNN

Fre-GAN [161] Mel-Spectrogram NAR GAN CNN

Wave-VAE [268] Mel-Spectrogram NAR VAE CNN

WaveGrad [41] Mel-Spectrogram NAR Diffusion Hybrid/CNN

DiffWave [176] Mel-Spectrogram NAR Diffusion Hybrid/CNN

PriorGrad [185] Mel-Spectrogram NAR Diffusion Hybrid/CNN

2.4 Vocoders

Roughly speaking, the development of vocoders can be categorized into two stages: the vocoders

used in statistical parametric speech synthesis (SPSS) [

155

238

], and the neural network-based

vocoders [

254

315

150

279

163

]. Some popular vocoders in SPSS include STRAIGHT [

155

] and

WORLD [

238

]. We take the WORLD vocoder as an example, which consists of vocoder analysis

and vocoder synthesis steps. In vocoder analysis, it analyzes the speech and gets acoustic features

such as mel-cepstral coefﬁcients [

], band aperiodicity [

156

157

] and F0. In vocoder synthesis, it

generates speech waveform from these acoustic features. In this section, we mainly review the works

on neural-based vocoders due to their high voice quality.

Early neural vocoders such as WaveNet [

254

255

], Char2Wav [

315

], WaveRNN [

150

] directly take

linguistic features as input and generate waveform. Later, Prenger et al.

[279]

, Kim et al.

[163]

, Kumar

et al.

[178]

, Yamamoto et al.

[402]

take mel-spectrograms as input and generate waveform. Since

speech waveform is very long, autoregressive waveform generation takes much inference time. Thus,

generative models such as Flow [

169

167

], GAN [

], VAE [

168

], and DDPM (Denoising

Diffusion Probabilistic Model, Diffusion for short) [

310

113

] are used in waveform generation.

Accordingly, we divide the neural vocoders into different categories: 1) Autoregressive vocoders,

2) Flow-based vocoders, 3) GAN-based vocoders, 4) VAE-based vocoders, and 5) Diffusion-based

vocoders. We list some representative vocoders in Table 3 and describe them as follows.

Autoregressive Vocoders

WaveNet [

254

] is the ﬁrst neural-based vocoder, which leverages dilated

convolution to generate waveform points autoregressively. Unlike the vocoder analysis and synthesis

in SPSS [

355

156

135

155

238

], WaveNet incorporates almost no prior knowledge about

audio signals, and purely relies on end-to-end learning. The original WaveNet, as well as some

following works that leverage WaveNet as vocoder [

], generate speech waveform conditioned on

linguistic features, while WaveNet can be easily adapted to condition on linear-spectrograms [

]

and mel-spectrograms [

336

270

303

]. Although WaveNet achieves good voice quality, it suffers

from slow inference speed. Therefore, a lot of works [

256

117

233

] investigate lightweight and fast

vocoders. SampleRNN [

233

] leverages a hierarchical recurrent neural network for unconditional

waveform generation, and it is further integrated into Char2Wav [

315

] to generate waveform con-

ditioned on acoustic features. Further, WaveRNN [

448

] is developed for efﬁcient audio synthesis,

using a recurrent neural network and leveraging several designs including dual softmax layer, weight

pruning, and subscaling techniques to reduce the computation. Lorenzo-Trueba et al.

[215]

, Paul

et al.

[265]

, Jiao et al.

[144]

further improve the robustness and universality of the vocoders. LPC-

Net [

363

364

] introduces conventional digital signal processing into neural networks, and uses linear

prediction coefﬁcients to calculate the next waveform point while leveraging a lightweight RNN to

compute the residual. LPCNet generates speech waveform conditioned on BFCC (bark-frequency

cepstral coefﬁcients) features, and can be easily adapted to condition on mel-spectrograms. Some

following works further improve LPCNet from different perspectives, such as reducing complexity

for speedup [370, 275, 151], and improving stability for better quality [129].

Flow-based Vocoders

Normalizing ﬂow [

293

169

167

] is a kind of generative model. It

transforms a probability density with a sequence of invertible mappings [

293

]. Since we can get

a standard/normalized probability distribution (e.g., Gaussion) through the sequence of invertible

mappings based on the change-of-variables rules, this kind of ﬂow-based generative model is called

as a normalizing ﬂow. During sampling, it generates data from a standard probability distribution

through the inverse of these transforms. The ﬂow-based models used in neural TTS can be divided

into two categories according to the two different techniques [

262

]: 1) autoregressive transforms [

169

]

(e.g., inverse autoregressive ﬂow used in Parallel WaveNet [

255

]), and 2) bipartite transforms (e.g.,

Glow [

167

] used in WaveGlow [

279

], and RealNVP [

] used in FloWaveNet [

163

]), as shown in

Table 4.

Table 4: Several representative ﬂow-based models and their formulations [271].

Flow Evaluation z = f

−1

(x) Synthesis x = f (z)

AF [261] z

= x

· σ

; θ) + µ

; θ) x

−u

;θ)

IAF [169] z

−µ

;θ)

= z

· σ

; θ) + µ

; θ)

Bipartite

RealNVP [66] z

= x

, x

= z

Glow [167] z

= x

· σ

; θ) + µ

; θ) x

−µ

;θ)

•

Autoregressive transforms, e.g., inverse autoregressive ﬂow (IAF) [

169

]. IAF can be regarded as a

dual formulation of autoregressive ﬂow (AF) [

261

124

]. The training of AF is parallel while the

sampling is sequential. In contrast, the sampling in IAF is parallel while the inference for likelihood

estimation is sequential. Parallel WaveNet [

255

] leverages probability density distillation to marry

the efﬁcient sampling of IAF with the efﬁcient training of AR modeling. It uses an autoregressive

WaveNet as the teacher network to guide the training of the student network (Parallel WaveNet)

to approximate the data likelihood. Similarly, ClariNet [

269

] uses IAF and teacher distillation,

and leverages a closed-form KL divergence to simplify and stabilize the distillation process.

Although Parallel Wavenet and ClariNet can generate speech in parallel, it relies on sophisticated

teacher-student training and still requires large computation.

•

Bipartite transforms, e.g., Glow [

167

] or RealNVP [

]. To ensure the transforms to be invertible,

bipartite transforms leverage the afﬁne coupling layers that ensure the output can be computed from

剩余62页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习驱动的语音合成就在眼前：微软亚洲研究院的全面综述

语音合成方法和发展综述

语音合成技术综述

为什么不继续生成文献综述了

大模型综述 中文版 pdf

超图研究院bim+gis技术白皮书

一个NLP研发工程师需要会的所有技术,尽可能详尽,尽可能多的分类列举!

《数值分析》所有算法、例题、课后题的matlab程序实现 并且每个程序都附有详尽的

在这一部分应对所有的软件需求进行足够详细的描述。详尽程度应以足够软件设计人员进行概要设计和系统测试人员进行系统测试计划和编写测试用例为准。

介绍一下MODIS数据，要十分详尽

abaqus超级学习手册pdf

最新资源

大模型综述中文版 pdf

《数值分析》所有算法、例题、课后题的matlab程序实现并且每个程序都附有详尽的