quick_start.md 6.6 KB
Newer Older
L
lizi 已提交
1
([简体中文](./quick_start_cn.md)|English)
小湉湉's avatar
小湉湉 已提交
2
# Quick Start of Text-to-Speech
小湉湉's avatar
小湉湉 已提交
3 4
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
小湉湉's avatar
小湉湉 已提交
5
* AISHELL3 (Mandarin multiple speakers)
小湉湉's avatar
小湉湉 已提交
6
* LJSpeech (English single speaker)
小湉湉's avatar
小湉湉 已提交
7
* VCTK (English multiple speakers)
小湉湉's avatar
小湉湉 已提交
8 9

The models in PaddleSpeech TTS have the following mapping relationship:
10
* tts0 - Tacotron2
小湉湉's avatar
小湉湉 已提交
11 12 13 14 15 16 17
* tts1 - TransformerTTS
* tts2 - SpeedySpeech
* tts3 - FastSpeech2
* voc0 - WaveFlow
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
小湉湉's avatar
小湉湉 已提交
18 19
* voc4 - Style MelGAN
* voc5 - HiFiGAN
20
* vc0 - Tacotron2 Voice Clone with GE2E
小湉湉's avatar
小湉湉 已提交
21
* vc1 - FastSpeech2 Voice Clone with GE2E
小湉湉's avatar
小湉湉 已提交
22 23 24

## Quick Start

小湉湉's avatar
小湉湉 已提交
25
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [examples/csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc)
小湉湉's avatar
小湉湉 已提交
26 27

### Train Parallel WaveGAN with CSMSC
小湉湉's avatar
小湉湉 已提交
28
- Go to the directory
小湉湉's avatar
小湉湉 已提交
29 30 31 32 33 34 35 36 37 38 39 40 41 42
    ```bash
    cd examples/csmsc/voc1
    ```
- Source env
    ```bash
    source path.sh
    ```
    **Must do this before you start to do anything.**
    Set `MAIN_ROOT` as project dir. Using `parallelwave_gan` model as `MODEL`.

- Main entrypoint
    ```bash
    bash run.sh
    ```
小湉湉's avatar
小湉湉 已提交
43
    This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
小湉湉's avatar
小湉湉 已提交
44
### Train FastSpeech2 with CSMSC
小湉湉's avatar
小湉湉 已提交
45
- Go to the directory
小湉湉's avatar
小湉湉 已提交
46 47 48 49 50 51 52 53 54
    ```bash
    cd examples/csmsc/tts3
    ```
- Source env
    ```bash
    source path.sh
    ```
    **Must do this before you start to do anything.**
    Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
小湉湉's avatar
小湉湉 已提交
55
- Main entry point
小湉湉's avatar
小湉湉 已提交
56 57 58
    ```bash
    bash run.sh
    ```
小湉湉's avatar
小湉湉 已提交
59
    This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
小湉湉's avatar
小湉湉 已提交
60 61 62 63 64 65

The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
小湉湉's avatar
小湉湉 已提交
66 67
- synthesize waveform from a text file. (in acoustic models)
- inference using a static model. (optional)
小湉湉's avatar
小湉湉 已提交
68

小湉湉's avatar
小湉湉 已提交
69
For more details, you can see `README.md` in examples.
小湉湉's avatar
小湉湉 已提交
70 71

## Pipeline of TTS
小湉湉's avatar
小湉湉 已提交
72
This section shows how to use pretrained models provided by TTS and make an inference with them.
小湉湉's avatar
小湉湉 已提交
73

小湉湉's avatar
小湉湉 已提交
74
Pretrained models in TTS are provided in an archive. Extract it to get a folder like this:
小湉湉's avatar
小湉湉 已提交
75 76 77 78 79 80 81
**Acoustic Models:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
P
PiaoYang 已提交
82 83
├── spk_id_map.txt (optional)
└── tone_id_map.txt (optional)
小湉湉's avatar
小湉湉 已提交
84 85 86 87 88 89 90 91 92
```
**Vocoders:**
```text
checkpoint_name
├── default.yaml  
├── snapshot_iter_*.pdz
└── stats.npy  
```
- `default.yaml` stores the config used to train the model.
小湉湉's avatar
小湉湉 已提交
93 94 95 96 97
- `snapshot_iter_*.pdz` is the checkpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of the feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of speakers to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
小湉湉's avatar
小湉湉 已提交
98 99 100

The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
小湉湉's avatar
小湉湉 已提交
101
The code below shows how to use a `FastSpeech2` model.  After loading the pretrained model, use it and the normalizer object to construct a prediction object,then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
小湉湉's avatar
小湉湉 已提交
102 103 104 105 106 107 108

```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
109 110 111
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.modules.normalizer import ZScore
小湉湉's avatar
小湉湉 已提交
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
# examples/fastspeech2/baker/frontend.py
from frontend import Frontend

# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
    phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
    fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
    idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
    paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()

# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)

# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)

# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")

# text to spectrogram
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
    with paddle.no_grad():
        temp_mel = fastspeech2_inference(part_phone_ids)
        if flags == 0:
            mel = temp_mel
            flags = 1
        else:
            mel = paddle.concat([mel, temp_mel])
```

### Vocoder (spectrogram to wave)
小湉湉's avatar
小湉湉 已提交
159
The code below shows how to use a  ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and the normalizer object to construct a prediction object,then use `pwg_inference(mel)` to generate raw audio (in wav format).
小湉湉's avatar
小湉湉 已提交
160 161 162 163 164 165 166 167

```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
168 169 170
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
小湉湉's avatar
小湉湉 已提交
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197

# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
    pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()

# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)

# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)

# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
        audio_path,
        wav.numpy(),
        samplerate=fastspeech2_config.fs)
```