README.md 14.8 KB
Newer Older
L
lym0302 已提交
1 2 3 4 5 6 7
([简体中文](./README_cn.md)|English)

# Streaming Speech Synthesis Service

## Introduction
This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.

小湉湉's avatar
小湉湉 已提交
8 9 10 11
For service interface definition, please check:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)

L
lym0302 已提交
12 13 14 15
## Usage
### 1. Installation
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

T
tianhao zhang 已提交
16
It is recommended to use **paddlepaddle 2.4rc** or above.
17

L
lym0302 已提交
18
You can choose one way from easy, meduim and hard to install paddlespeech.
L
lym0302 已提交
19

20
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
L
lym0302 已提交
21 22

### 2. Prepare config File
L
lym0302 已提交
23
The configuration file can be found in `conf/tts_online_application.yaml`.
L
lym0302 已提交
24
- `protocol` indicates the network protocol used by the streaming TTS service. Currently, both **http and websocket** are supported.
L
lym0302 已提交
25 26 27 28 29 30 31 32 33 34
- `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
    - This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
    - the engine type supports two forms: **online**  and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
    - fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
    - fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
    - Both hifigan and mb_melgan support streaming voc inference.
    - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
L
lym0302 已提交
35
    - When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
小湉湉's avatar
小湉湉 已提交
36
    - Pad calculation method of streaming vocoder in PaddleSpeech: [AIStudio tutorial](https://aistudio.baidu.com/aistudio/projectdetail/4151335)
L
lym0302 已提交
37
- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
L
liangym 已提交
38 39
- **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address.

L
lym0302 已提交
40 41
### 3. Streaming speech synthesis server and client using http protocol
#### 3.1 Server Usage
L
lym0302 已提交
42 43
- Command Line (Recommended)

L
lym0302 已提交
44
  Start the service (the configuration file uses http by default):
L
lym0302 已提交
45 46 47 48 49 50 51 52 53 54 55 56 57 58
  ```bash
  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
  ```

  Usage:
  
  ```bash
  paddlespeech_server start --help
  ```
  Arguments:
  - `config_file`: yaml file of the app, defalut: ./conf/tts_online_application.yaml
  - `log_file`: log file. Default: ./log/paddlespeech.log

  Output:
小湉湉's avatar
小湉湉 已提交
59
  ```text
L
lym0302 已提交
60 61 62 63 64 65 66 67 68 69
  [2022-04-24 20:05:27,887] [    INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
  [2022-04-24 20:05:28,038] [    INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
  [2022-04-24 20:05:28,191] [    INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
  [2022-04-24 20:05:28,192] [    INFO] - **********************************************************************
  INFO:     Started server process [14638]
  [2022-04-24 20:05:28] [INFO] [server.py:75] Started server process [14638]
  INFO:     Waiting for application startup.
  [2022-04-24 20:05:28] [INFO] [on.py:45] Waiting for application startup.
  INFO:     Application startup complete.
  [2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete.
L
lym0302 已提交
70 71
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
L
lym0302 已提交
72 73 74 75 76 77 78 79 80 81 82 83 84

  ```

- Python API
  ```python
  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor

  server_executor = ServerExecutor()
  server_executor(
      config_file="./conf/tts_online_application.yaml", 
      log_file="./log/paddlespeech.log")
  ```

小湉湉's avatar
小湉湉 已提交
85 86
  Output:
  ```text
L
lym0302 已提交
87 88 89 90 91 92 93 94 95 96
  [2022-04-24 21:00:16,934] [    INFO] - The first response time of the 0 warm up: 1.268730878829956 s
  [2022-04-24 21:00:17,046] [    INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
  [2022-04-24 21:00:17,151] [    INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
  [2022-04-24 21:00:17,151] [    INFO] - **********************************************************************
  INFO:     Started server process [320]
  [2022-04-24 21:00:17] [INFO] [server.py:75] Started server process [320]
  INFO:     Waiting for application startup.
  [2022-04-24 21:00:17] [INFO] [on.py:45] Waiting for application startup.
  INFO:     Application startup complete.
  [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
L
lym0302 已提交
97 98
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
L
lym0302 已提交
99 100
  ```

L
lym0302 已提交
101
#### 3.2 Streaming TTS client Usage
L
lym0302 已提交
102 103
- Command Line (Recommended)

L
lym0302 已提交
104
    Access http streaming TTS service:
L
lym0302 已提交
105

L
lym0302 已提交
106 107
    If `127.0.0.1` is not accessible, you need to use the actual service IP address.

L
lym0302 已提交
108 109
    ```bash
    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
L
lym0302 已提交
110
    ```
L
lym0302 已提交
111

L
lym0302 已提交
112 113 114 115 116 117 118 119 120 121 122 123
    Usage:
  
    ```bash
    paddlespeech_client tts_online --help
    ```

    Arguments:
    - `server_ip`: erver ip. Default: 127.0.0.1
    - `port`: server port. Default: 8092
    - `protocol`: Service protocol, choices: [http, websocket], default: http.
    - `input`: (required): Input text to generate.
    - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0
124
    - `output`: Client output wave filepath. Default: None, which means not to save the audio to the local.
L
lym0302 已提交
125
    - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
126
    - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
L
lym0302 已提交
127 128
    
    Output:
小湉湉's avatar
小湉湉 已提交
129
    ```text
L
lym0302 已提交
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
    [2022-04-24 21:08:18,559] [    INFO] - tts http client start
    [2022-04-24 21:08:21,702] [    INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
    [2022-04-24 21:08:21,703] [    INFO] - 首包响应:0.18863153457641602 s
    [2022-04-24 21:08:21,704] [    INFO] - 尾包响应:3.1427218914031982 s
    [2022-04-24 21:08:21,704] [    INFO] - 音频时长:3.825 s
    [2022-04-24 21:08:21,704] [    INFO] - RTF: 0.8216266382753459
    [2022-04-24 21:08:21,739] [    INFO] - 音频保存至:output.wav

    ```

- Python API
  ```python
  from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
  import json

  executor = TTSOnlineClientExecutor()
  executor(
      input="您好,欢迎使用百度飞桨语音合成服务。",
      server_ip="127.0.0.1",
      port=8092,
      protocol="http",
      spk_id=0,
      output="./output.wav",
      play=False)

  ```

  Output:
小湉湉's avatar
小湉湉 已提交
158
  ```text
L
lym0302 已提交
159 160 161 162 163 164 165
  [2022-04-24 21:11:13,798] [    INFO] - tts http client start
  [2022-04-24 21:11:16,800] [    INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
  [2022-04-24 21:11:16,801] [    INFO] - 首包响应:0.18234872817993164 s
  [2022-04-24 21:11:16,801] [    INFO] - 尾包响应:3.0013909339904785 s
  [2022-04-24 21:11:16,802] [    INFO] - 音频时长:3.825 s
  [2022-04-24 21:11:16,802] [    INFO] - RTF: 0.7846773683635238
  [2022-04-24 21:11:16,837] [    INFO] - 音频保存至:./output.wav
L
lym0302 已提交
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
  ```

### 4. Streaming speech synthesis server and client using websocket protocol
#### 4.1 Server Usage
- Command Line (Recommended)
  First modify the configuration file `conf/tts_online_application.yaml`, **set `protocol` to `websocket`**.
  Start the service:
  ```bash
  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
  ```

  Usage:
  
  ```bash
  paddlespeech_server start --help
  ```
  Arguments:
  - `config_file`: yaml file of the app, defalut: ./conf/tts_online_application.yaml
  - `log_file`: log file. Default: ./log/paddlespeech.log

  Output:
小湉湉's avatar
小湉湉 已提交
187 188 189 190 191 192 193 194 195 196 197 198 199
  ```text
  [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
  [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
  [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
  [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
  INFO:     Started server process [17600]
  [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
  INFO:     Waiting for application startup.
  [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
  INFO:     Application startup complete.
  [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
L
lym0302 已提交
200 201
  ```

L
lym0302 已提交
202 203 204 205 206 207 208 209 210 211 212
- Python API
  ```python
  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor

  server_executor = ServerExecutor()
  server_executor(
      config_file="./conf/tts_online_application.yaml", 
      log_file="./log/paddlespeech.log")
  ```

  Output:
小湉湉's avatar
小湉湉 已提交
213 214 215 216 217 218 219 220 221 222 223 224 225
  ```text
  [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
  [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
  [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
  [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
  INFO:     Started server process [23466]
  [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
  INFO:     Waiting for application startup.
  [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
  INFO:     Application startup complete.
  [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
  [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
L
lym0302 已提交
226 227 228 229 230 231 232
  ```

#### 4.2 Streaming TTS client Usage
- Command Line (Recommended)

    Access websocket streaming TTS service:

L
lym0302 已提交
233 234
    If `127.0.0.1` is not accessible, you need to use the actual service IP address.

L
lym0302 已提交
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
    ```bash
    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
    ```

    Usage:
  
    ```bash
    paddlespeech_client tts_online --help
    ```

    Arguments:
    - `server_ip`: erver ip. Default: 127.0.0.1
    - `port`: server port. Default: 8092
    - `protocol`: Service protocol, choices: [http, websocket], default: http.
    - `input`: (required): Input text to generate.
    - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0
251
    - `output`: Client output wave filepath. Default: None, which means not to save the audio to the local.
L
lym0302 已提交
252
    - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
253 254
    - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
    
L
lym0302 已提交
255 256 257

    
    Output:
小湉湉's avatar
小湉湉 已提交
258
    ```text
L
lym0302 已提交
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284
    [2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
    [2022-04-27 10:21:04,496] [    INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
    [2022-04-27 10:21:04,496] [    INFO] - 首包响应:0.2124948501586914 s
    [2022-04-27 10:21:07,483] [    INFO] - 尾包响应:3.199106454849243 s
    [2022-04-27 10:21:07,484] [    INFO] - 音频时长:3.825 s
    [2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
    [2022-04-27 10:21:07,516] [    INFO] - 音频保存至:output.wav
    ```

- Python API
  ```python
  from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
  import json

  executor = TTSOnlineClientExecutor()
  executor(
      input="您好,欢迎使用百度飞桨语音合成服务。",
      server_ip="127.0.0.1",
      port=8092,
      protocol="websocket",
      spk_id=0,
      output="./output.wav",
      play=False)
  ```

  Output:
小湉湉's avatar
小湉湉 已提交
285 286 287 288 289 290 291 292
  ```text
  [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
  [2022-04-27 10:22:49,080] [    INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
  [2022-04-27 10:22:49,080] [    INFO] - 首包响应:0.21017956733703613 s
  [2022-04-27 10:22:52,100] [    INFO] - 尾包响应:3.2304444313049316 s
  [2022-04-27 10:22:52,101] [    INFO] - 音频时长:3.825 s
  [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
  [2022-04-27 10:22:52,134] [    INFO] - 音频保存至:./output.wav
L
lym0302 已提交
293
  ```