phonemes. They argue that RNN-based encoder-attention-decoder models like Tacotron 2 suffer from
the following two issues: 1) Due to the recurrent nature, both the RNN-based encoder and decoder
cannot be trained in parallel, and the RNN-based encoder cannot be parallel in inference, which
affects the efficiency both in training and inference. 2) Since the text and speech sequences are usually
very long, RNN is not good at modeling the long dependency in these sequences. TransformerTTS
adopts the basic model structure of Transformer and absorbs some designs from Tacotron 2 such as
decoder pre-net/post-net and stop token prediction. It achieves similar voice quality with Tacotron
2 but enjoys faster training time. However, compared with RNN-based models such as Tacotron
that leverage stable attention mechanisms such as location-sensitive attention, the encoder-decoder
attentions in Transformer are not robust due to parallel computation. Thus, some works propose to
enhance the robustness of Transformer-based acoustic models. For example, MultiSpeech [
39
] im-
proves the robustness of the attention mechanism through encoder normalization, decoder bottleneck,
and diagonal attention constraint, and RobuTrans [
194
] leverages duration prediction to enhance the
robustness in autoregressive generation.
Previous neural-based acoustic models such as Tacotron 1/2 [
382
,
303
], DeepVoice 3 [
270
] and
TransformerTTS [
192
] all adopt autoregressive generation, which suffer from several issues: 1)
Slow inference speed. The autoregressive mel-spectrogram generation is slow especially for long
speech sequence (e.g., for 1 second speech, there are nearly 500 frames of mel-spectrogram if hop
size is 10ms, which is a long sequence). 2) Robust issues. The generated speech usually has a
lot of word skipping and repeating and issues, which is mainly caused by the inaccurate attention
alignments between text and mel-spectrograms in encoder-attention-decoder based autoregressive
generation. Thus, FastSpeech [
290
] is proposed to solve these issues: 1) It adopts a feed-forward
Transformer network to generate mel-spectrograms in parallel, which can greatly speed up inference.
2) It removes the attention mechanism between text and speech to avoid word skipping and repeating
issues and improve robustness. Instead, it uses a length regulator to bridge the length mismatch
between the phoneme and mel-spectrogram sequences. The length regulator leverages a duration
predictor to predict the duration of each phoneme and expands the phoneme hidden sequence
according to the phoneme duration, where the expanded phoneme hidden sequence can match the
length of mel-spectrogram sequence and facilitate the parallel generation. FastSpeech enjoys several
advantages [
290
]: 1) extremely fast inference speed (e.g., 270x inference speedup on mel-spectrogram
generation, 38x speedup on waveform generation); 2) robust speech synthesis without word skipping
and repeating issues; and 3) on par voice quality with previous autoregressive models. FastSpeech
has been deployed in Microsoft Azure Text to Speech Service
10
to support all the languages and
locales in Azure TTS
11
.
FastSpeech leverages an explicit duration predictor to expand the phoneme hidden sequence to match
to the length of mel-spectrograms. How to get the duration label to train the duration predictor is
critical for the prosody and quality of generated voice. We briefly review the TTS models with
duration prediction in Section 3.4.2. In the next, we introduce some other improvements based on
FastSpeech. FastSpeech 2 [
292
] is proposed to further enhance FastSpeech, mainly from two aspects:
1) Using ground-truth mel-spectrograms as training targets, instead of distilled mel-spectrograms from
an autoregressive teacher model. This simplifies the two-stage teacher-student distillation pipeline
in FastSpeech and also avoids the information loss in target mel-spectrograms after distillation. 2)
Providing more variance information such as pitch, duration, and energy as decoder input, which eases
the one-to-many mapping problem [
139
,
84
,
382
,
456
] in text to speech
12
. FastSpeech 2 achieves
better voice quality than FastSpeech and maintains the advantages of fast, robust, and controllable
speech synthesis in FastSpeech
13
. FastPitch [
181
] improves FastSpeech by using pitch information
as decoder input, which shares similar idea of variance predictor in FastSpeech 2.
Other Acoustic Models (e.g., Flow, GAN, VAE, Diffusion)
Besides the above acoustic models,
there are a lot of other acoustic models [
367
,
22
,
126
,
187
,
55
], as shown in Table 2. Flow-based
models have long been used in neural TTS. After the early successful applications on vocoders (e.g.,
10
https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/
11
https://techcommunity.microsoft.com/t5/azure-ai/neural-text-to-speech-extends-support-to-15-more-
languages-with/ba-p/1505911
12
One-to-many mapping in TTS refers to that there are multiple possible speech sequences corresponding to a
text sequence due to variations in speech, such as pitch, duration, sound volume, and prosody, etc.
13
FastSpeech 2s [
292
] is proposed together with FastSpeech 2. Since it is a fully end-to-end text-to-waveform
model, we introduce it in Section 2.5.
11