README.md 3.7 KB
Newer Older
C
Corentin Jemine 已提交
1
# Real-Time Voice Cloning
C
Corentin Jemine 已提交
2
This repository is an implementation of [Transfer Learning from Speaker Verification to
C
Corentin Jemine 已提交
3
Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) (SV2TTS) with a vocoder that works in real-time. Feel free to check [my thesis](https://puu.sh/DHgBg.pdf) if you're curious or if you're looking for info I haven't documented yet (don't hesitate to make an issue for that too). Mostly I would recommend giving a quick look to the figures beyond the introduction.
C
Corentin Jemine 已提交
4

C
Corentin Jemine 已提交
5
SV2TTS is a three-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model trained to generalize to new voices.
C
Corentin Jemine 已提交
6

7
**Video demonstration:**
C
Corentin Jemine 已提交
8

C
Corentin Jemine 已提交
9
[![Toolbox demo](https://i.imgur.com/Ixy13b7.png)](https://www.youtube.com/watch?v=-O_hYhToKoA)
C
Corentin Jemine 已提交
10 11


C
Corentin Jemine 已提交
12 13 14 15 16

### Papers implemented  
| URL | Designation | Title | Implementation source |
| --- | ----------- | ----- | --------------------- |
|[**1806.04558**](https://arxiv.org/pdf/1806.04558.pdf) | **SV2TTS** | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo |
C
Corentin Jemine 已提交
17 18 19 20
|[1802.08435](https://arxiv.org/pdf/1802.08435.pdf) | WaveRNN (vocoder) | Efficient Neural Audio Synthesis | [fatchord/WaveRNN](https://github.com/fatchord/WaveRNN) |
|[1712.05884](https://arxiv.org/pdf/1712.05884.pdf) | Tacotron 2 (synthesizer) | Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions | [Rayhane-mamah/Tacotron-2](https://github.com/Rayhane-mamah/Tacotron-2)
|[1710.10467](https://arxiv.org/pdf/1710.10467.pdf) | GE2E (encoder)| Generalized End-To-End Loss for Speaker Verification | This repo |

C
Corentin Jemine 已提交
21 22 23


## Quick start
C
Corentin Jemine 已提交
24 25 26 27 28
### Requirements
You will need the following whether you plan to use the toolbox only or to retrain the models.

**Python 3.7**. Python 3.6 might work too, but I wouldn't go lower because I make extensive use of pathlib.

R
Rancoud 已提交
29
Run `pip install -r requirements.txt` to install the necessary packages. Additionally you will need [PyTorch](https://pytorch.org/get-started/locally/).
C
Corentin Jemine 已提交
30 31 32

A GPU is *highly* recommended (CPU-only is currently not implemented), but you don't necessarily need a high tier GPU if you only want to use the toolbox.

33 34
### Pretrained models
Download the latest [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models).
V
Valiox 已提交
35

36 37
### Preliminary
Before you download any dataset, you can begin by testing your configuration with:
C
Corentin Jemine 已提交
38

39
`python demo_cli.py`
C
Corentin Jemine 已提交
40

41
If all tests pass, you're good to go.
42

43 44
### Datasets
For playing with the toolbox alone, I only recommend downloading [`LibriSpeech/train-clean-100`](http://www.openslr.org/resources/12/train-clean-100.tar.gz). Extract the contents as `<datasets_root>/LibriSpeech/train-clean-100` where `<datasets_root>` is a directory of your choosing. Other datasets are supported in the toolbox, see [here](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training#datasets). You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.
C
Corentin Jemine 已提交
45

C
Corentin Jemine 已提交
46
### Toolbox
47
You can then try the toolbox:
C
Corentin Jemine 已提交
48

49 50 51
`python demo_toolbox.py -d <datasets_root>`  
or  
`python demo_toolbox.py`  
C
Corentin Jemine 已提交
52

53
depending on whether you downloaded any datasets.
C
Corentin Jemine 已提交
54

C
Corentin Jemine 已提交
55 56 57 58 59
## Wiki
- **How it all works** (coming soon!)
- [**Training models yourself**](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training)
- **Training with other data/languages** (coming soon!)
- [**TODO and planned features**](https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/TODO-&-planned-features) 
C
Corentin Jemine 已提交
60

61
## Contribution
C
Corentin Jemine 已提交
62
Feel free to open issues or PRs for any problem you may encounter, typos that you see or aspects that are confusing.