models_introduction.md 15.1 KB
Newer Older
小湉湉's avatar
小湉湉 已提交
1
# Models introduction
小湉湉's avatar
小湉湉 已提交
2
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule-based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable.
H
Hui Zhang 已提交
3 4

The main processes of TTS include:
小湉湉's avatar
小湉湉 已提交
5 6
1. Convert the original text into characters/phonemes, through the `text frontend` module.
2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
H
Hui Zhang 已提交
7 8
3. Convert acoustic features into waveforms through `Vocoders`.

小湉湉's avatar
小湉湉 已提交
9
A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.
H
Hui Zhang 已提交
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

## Acoustic Models
### Modeling Objectives of Acoustic Models
Modeling the mapping relationship between text sequences and speech features:
```text
text X = {x1,...,xM}
specch Y = {y1,...yN}
```
Modeling Objectives:
```text
Ω = argmax p(Y|X,Ω)
```
### Modeling process of Acoustic Models
At present, there are two mainstream acoustic model structures.

- Frame level acoustic model:
   - Duration model (M Tokens - > N Frames).
   - Acoustic decoder (N Frames - > N Frames).

<div align="left">
30
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/frame_level_am.png" width=500 /> <br>
H
Hui Zhang 已提交
31 32 33 34 35 36
</div>

- Sequence to sequence acoustic model:
    - M Tokens - > N Frames.

<div align="left">
37
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/seq2seq_am.png" width=500 /> <br>
H
Hui Zhang 已提交
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
</div>

### Tacotron2
 [Tacotron](https://arxiv.org/abs/1703.10135)  is the first end-to-end acoustic model based on deep learning, and it is also the most widely used acoustic model.

[Tacotron2](https://arxiv.org/abs/1712.05884) is the Improvement of Tacotron.
#### Tacotron
**Features of Tacotron:**
- Encoder.
   - CBHG.
   - Input: character sequence.
- Decoder.
    - Global soft attention.
    - unidirectional RNN.
    - Autoregressive teacher force training (input real speech feature).
    - Multi frame prediction.
    - CBHG postprocess.
    - Vocoder: Griffin-Lim.
<div align="left">
57
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/tacotron.png" width=700 /> <br>
H
Hui Zhang 已提交
58 59 60 61
</div>

**Advantage of Tacotron:**
- No need for complex text frontend analysis modules.
小湉湉's avatar
小湉湉 已提交
62
- No need for an additional duration model.
H
Hui Zhang 已提交
63 64 65 66 67 68 69
- Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.

**Disadvantages of Tacotron:**
- The CBHG  is complex and the amount of parameters is relatively large.
- Global soft attention.
- Poor stability for speech synthesis tasks.
- In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
小湉湉's avatar
小湉湉 已提交
70
-  Phase problem in Griffin-Lim causes speech distortion during wave reconstruction.
H
Hui Zhang 已提交
71 72 73 74 75 76 77 78 79 80 81 82 83
- The autoregressive decoder cannot be stopped during the generation process.

#### Tacotron2
**Features of Tacotron2:**
- Reduction of parameters.
   - CBHG -> PostNet (3 Conv layers + BLSTM or 5 Conv layers).
   - remove Attention RNN.
- Speech distortion caused by Griffin-Lim.
    - WaveNet.
- Improvements of PostNet.
   - CBHG -> 5 Conv layers.
   -  The input and output of the PostNet calculate `L2` loss with real Mel spectrogram.
   - Residual connection.
小湉湉's avatar
小湉湉 已提交
84
- Bad stop in an autoregressive decoder.
H
Hui Zhang 已提交
85 86 87 88
   - Predict whether it should stop at each moment of decoding (stop token).
   - Set a threshold to determine whether to stop generating when decoding.
- Stability of attention.
   - Location-aware attention.
小湉湉's avatar
小湉湉 已提交
89
   - The alignment matrix of the previous time is considered at step `t` of the decoder.
H
Hui Zhang 已提交
90 91

<div align="left">
92
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/tacotron2.png" width=500 /> <br>
H
Hui Zhang 已提交
93 94
</div>

95
You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [examples/ljspeech/tts0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0).
H
Hui Zhang 已提交
96 97 98

### TransformerTTS
**Disadvantages of the Tacotrons:**
小湉湉's avatar
小湉湉 已提交
99
- Encoder and decoder are relatively weak at global information modeling
H
Hui Zhang 已提交
100 101 102 103 104 105 106 107
   - Vanishing gradient of RNN.
   - Fixed-length context modeling problem in CNN kernel.
- Training is relatively inefficient.
- The attention is not robust enough and the stability is poor.

Transformer TTS is a combination of Tacotron2 and Transformer.

#### Transformer
小湉湉's avatar
小湉湉 已提交
108
 [Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on an attention mechanism.
H
Hui Zhang 已提交
109 110 111 112 113 114 115

**Features of Transformer:**
- Encoder.
    - `N` blocks based on self-attention mechanism.
    - Positional Encoding.
- Decoder.
    - `N` blocks based on self-attention mechanism.
小湉湉's avatar
小湉湉 已提交
116
    - Add Mask to the self-attention in blocks to cover up the information after the `t` step.
H
Hui Zhang 已提交
117 118 119 120
    - Attentions between encoder and decoder.
    - Positional Encoding.

<div align="left">
121
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/transformer.png" width=500 /> <br>
H
Hui Zhang 已提交
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
</div>

#### Transformer TTS
Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.

**Motivations:**
- RNNs in Tacotron2  make the inefficiency of training.
- Vanishing gradient of RNN makes the model's ability to model long-term contexts weak.
- Self-attention doesn't contain any recursive structure which can be trained in parallel.
- Self-attention can model global context information well.

**Features of Transformer TTS:**
- Add conv based PreNet in encoder and decoder.
- Stop Token in decoder controls when to stop autoregressive generation.
- Add PostNet after decoder to improve the quality of synthetic speech.
- Scaled position encoding.
    - Uniform scale position encoding may have a negative impact on input or output sequences.

<div align="left">
141
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/transformer_tts.png" width=500 /> <br>
H
Hui Zhang 已提交
142 143 144 145 146 147 148
</div>

**Disadvantages of Transformer TTS:**
- The ability of position encoding for timing information is still relatively weak.
- The ability to perceive local information is weak, and local information is more related to pronunciation.
- Stability is worse than Tacotron2.

149
You can find PaddleSpeech TTS's Transformer TTS with LJSpeech dataset example at [examples/ljspeech/tts1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1).
H
Hui Zhang 已提交
150 151 152 153 154 155


### FastSpeech2
**Disadvantage of seq2seq models:**
- In the seq2seq model based on attention, no matter how to improve the attention mechanism, it's difficult to avoid generation errors in the decoding stage.

小湉湉's avatar
小湉湉 已提交
156
Frame-level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame-level mapping does not have the uncertainty of sequence generation.
H
Hui Zhang 已提交
157 158 159 160

In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.

#### FastSpeech
小湉湉's avatar
小湉湉 已提交
161
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
H
Hui Zhang 已提交
162 163 164 165 166 167

**Features of FastSpeech:**
- Encoder: based on Transformer.
- Change `FFN` to `CNN` in self-attention.
    -  Model local dependency.
- Length regulator.
小湉湉's avatar
小湉湉 已提交
168 169
    - Use real phoneme durations to expand the output frame of the encoder during training.
- Non-autoregressive decode.
H
Hui Zhang 已提交
170 171 172 173 174
    -  Improve generation efficiency.

**Length predictor:**
- Pretrain a TransformerTTS model.
- Get alignment matrix of train data.
小湉湉's avatar
小湉湉 已提交
175 176 177
- Calculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of the encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand the output frame of the encoder during training.
H
Hui Zhang 已提交
178 179 180 181 182
- Use phoneme durations predicted by the duration model to expand the frame during prediction.
    - Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is `1` during training).

**Advantages of non-autoregressive decoder:**
- The built-in duration model of the seq2seq model has converted the input length `M` to the output length `N`.
小湉湉's avatar
小湉湉 已提交
183
- The length of the output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
H
Hui Zhang 已提交
184 185 186
• Can be generated in parallel (decoding time is less affected by sequence length)

<div align="left">
187
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/fastspeech.png" width=800 /> <br>
H
Hui Zhang 已提交
188 189 190 191 192 193
</div>

#### FastPitch
[FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.

<div align="left">
194
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/fastpitch.png" width=500 /> <br>
H
Hui Zhang 已提交
195 196 197 198 199 200
</div>

#### FastSpeech2
**Disadvantages of FastSpeech:**
- The teacher-student distillation pipeline is complicated and time-consuming.
- The duration extracted from the teacher model is not accurate enough.
小湉湉's avatar
小湉湉 已提交
201
- The target mel spectrograms distilled from the teacher model suffer from information loss due to data simplification.
H
Hui Zhang 已提交
202 203 204 205

[FastSpeech2](https://arxiv.org/abs/2006.04558)  addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.

**Features of FastSpeech2:**
小湉湉's avatar
小湉湉 已提交
206 207
- Directly train the model with the ground-truth target instead of the simplified output from the teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch`, and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
H
Hui Zhang 已提交
208

小湉湉's avatar
小湉湉 已提交
209
FastSpeech2 is similar to FastPitch but introduces more variation information of the speech.
H
Hui Zhang 已提交
210 211

<div align="left">
212
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/fastspeech2.png" width=800 /> <br>
H
Hui Zhang 已提交
213 214
</div>

小湉湉's avatar
小湉湉 已提交
215
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame-level ones in FastSpeech2.
H
Hui Zhang 已提交
216 217 218 219 220

### SpeedySpeech
[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.

**Features of SpeedySpeech:**
小湉湉's avatar
小湉湉 已提交
221
- Use a simpler, smaller, and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.  
H
Hui Zhang 已提交
222 223 224 225
- Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.

<div align="left">
226
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/speedyspeech.png" width=500 /> <br>
H
Hui Zhang 已提交
227 228
</div>

229
You can find PaddleSpeech TTS's SpeedySpeech with CSMSC dataset example at [examples/csmsc/tts2](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2).
H
Hui Zhang 已提交
230 231 232 233 234 235

## Vocoders
In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.

Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.

小湉湉's avatar
小湉湉 已提交
236
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimates the parameters, and then the decoder restores the speech.
H
Hui Zhang 已提交
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253

Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.

### Categories of neural vocodes
- Autoregression
    - WaveNet
    - WaveRNN
    - LPCNet

- Flow
    - **WaveFlow**
    - WaveGlow
    - FloWaveNet
    - Parallel WaveNet
- GAN
    - WaveGAN
    - **Parallel WaveGAN**
小湉湉's avatar
小湉湉 已提交
254 255 256 257
    - **MelGAN**
    - **Style MelGAN**
    - **Multi Band MelGAN**
    - **HiFi GAN**
H
Hui Zhang 已提交
258 259 260 261 262 263 264
- VAE
    - Wave-VAE
- Diffusion
    - WaveGrad
    - DiffWave

**Motivations of GAN-based vocoders:**
小湉湉's avatar
小湉湉 已提交
265
- Modeling speech signals by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
H
Hui Zhang 已提交
266
- Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a **slow generation speed**.
小湉湉's avatar
小湉湉 已提交
267
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long-term context information.
H
Hui Zhang 已提交
268
- Vocoders based on Bipartite Transformation converge slowly and are complex.
小湉湉's avatar
小湉湉 已提交
269
- GAN-based vocoders don't need to make assumptions about the speech distribution and train through adversarial learning.
H
Hui Zhang 已提交
270 271 272 273 274 275 276

Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.

### WaveFlow
 [WaveFlow](https://arxiv.org/abs/1912.01219) is proposed by Baidu Research.

**Features of WaveFlow:**
小湉湉's avatar
小湉湉 已提交
277 278
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on an Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and several orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smaller than WaveGlow (87.9M).
H
Hui Zhang 已提交
279 280
- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.

281
You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examples/ljspeech/voc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
H
Hui Zhang 已提交
282 283

### Parallel WaveGAN
小湉湉's avatar
小湉湉 已提交
284
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN-based training method.
H
Hui Zhang 已提交
285 286 287 288 289 290

**Features of Parallel WaveGAN:**

- Use non-causal convolution instead of causal convolution.
- The input is random Gaussian white noise.
- The model is non-autoregressive both in training and prediction, which is fast
小湉湉's avatar
小湉湉 已提交
291
- Multi-resolution STFT loss.
H
Hui Zhang 已提交
292 293

<div align="left">
294
  <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/pwg.png" width=600 /> <br>
H
Hui Zhang 已提交
295 296
</div>

297
You can find PaddleSpeech TTS's Parallel WaveGAN with CSMSC example at [examples/csmsc/voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1).