未验证 提交 8f71d678 编写于 作者: R Rami Mouro 提交者: GitHub

Added no_mp3_support argument and added a check for ffmpeg installation (#517)

上级 a32962bb
......@@ -11,7 +11,7 @@ import librosa
import argparse
import torch
import sys
from audioread.exceptions import NoBackendError
if __name__ == '__main__':
## Info & args
......@@ -34,12 +34,23 @@ if __name__ == '__main__':
"If True, audio won't be played.")
parser.add_argument("--seed", type=int, default=None, help=\
"Optional random number seed value to make toolbox deterministic.")
parser.add_argument("--no_mp3_support", action="store_true", help=\
"If True, disallows loading mp3 files to prevent audioread errors when ffmpeg is not installed.")
args = parser.parse_args()
print_args(args, parser)
if not args.no_sound:
import sounddevice as sd
if not args.no_mp3_support:
try:
librosa.load("samples/1320_00000.mp3")
except NoBackendError:
print("Librosa will be unable to open mp3 files if additional software is not installed.\n"
"Please install ffmpeg or add the '--no_mp3_support' option to proceed without support for mp3 files.")
exit(-1)
print("Running a test of your configuration...\n")
if torch.cuda.is_available():
device_id = torch.cuda.current_device()
gpu_properties = torch.cuda.get_device_properties(device_id)
......@@ -123,8 +134,10 @@ if __name__ == '__main__':
message = "Reference voice: enter an audio filepath of a voice to be cloned (mp3, " \
"wav, m4a, flac, ...):\n"
in_fpath = Path(input(message).replace("\"", "").replace("\'", ""))
if in_fpath.suffix.lower() == ".mp3" and args.no_mp3_support:
print("Can't Use mp3 files please try again:")
continue
## Computing the embedding
# First, we load the wav using the function that the speaker encoder provides. This is
# important: there is preprocessing that must be applied.
......
......@@ -28,6 +28,8 @@ if __name__ == '__main__':
"overhead but allows to save some GPU memory for lower-end GPUs.")
parser.add_argument("--seed", type=int, default=None, help=\
"Optional random number seed value to make toolbox deterministic.")
parser.add_argument("--no_mp3_support", action="store_true", help=\
"If True, no mp3 files are allowed.")
args = parser.parse_args()
print_args(args, parser)
......
The audio files in this folder are provided for toolbox testing and
benchmarking purposes. These are the same reference utterances
used by the SV2TTS authors to generate the audio samples located at:
https://google.github.io/tacotron/publications/speaker_adaptation/index.html
The `p240_00000.mp3` and `p260_00000.mp3` files are compressed
versions of audios from the VCTK corpus available at:
https://datashare.is.ed.ac.uk/handle/10283/3443
VCTK.txt contains the copyright notices and licensing information.
The `1320_00000.mp3`, `3575_00000.mp3`, `6829_00000.mp3`
and `8230_00000.mp3` files are compressed versions of audios
from the LibriSpeech dataset available at: https://openslr.org/12
For these files, the following notice applies:
```
LibriSpeech (c) 2014 by Vassil Panayotov
LibriSpeech ASR corpus is licensed under a
Creative Commons Attribution 4.0 International License.
See <http://creativecommons.org/licenses/by/4.0/>.
```
---------------------------------------------------------------------
CSTR VCTK Corpus
English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
(Version 0.92)
RELEASE September 2019
The Centre for Speech Technology Research
University of Edinburgh
Copyright (c) 2019
Junichi Yamagishi
jyamagis@inf.ed.ac.uk
---------------------------------------------------------------------
Overview
This CSTR VCTK Corpus includes speech data uttered by 110 English
speakers with various accents. Each speaker reads out about 400
sentences, which were selected from a newspaper, the rainbow passage
and an elicitation paragraph used for the speech accent archive.
The newspaper texts were taken from Herald Glasgow, with permission
from Herald & Times Group. Each speaker has a different set of the
newspaper texts selected based a greedy algorithm that increases the
contextual and phonetic coverage. The details of the text selection
algorithms are described in the following paper:
C. Veaux, J. Yamagishi and S. King,
"The voice bank corpus: Design, collection and data analysis of
a large regional accent speech database,"
https://doi.org/10.1109/ICSDA.2013.6709856
The rainbow passage and elicitation paragraph are the same for all
speakers. The rainbow passage can be found at International Dialects
of English Archive:
(http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation
paragraph is identical to the one used for the speech accent archive
(http://accent.gmu.edu). The details of the the speech accent archive
can be found at
http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf
All speech data was recorded using an identical recording setup: an
omni-directional microphone (DPA 4035) and a small diaphragm condenser
microphone with very wide bandwidth (Sennheiser MKH 800), 96kHz
sampling frequency at 24 bits and in a hemi-anechoic chamber of
the University of Edinburgh. (However, two speakers, p280 and p315
had technical issues of the audio recordings using MKH 800).
All recordings were converted into 16 bits, were downsampled to
48 kHz, and were manually end-pointed.
This corpus was originally aimed for HMM-based text-to-speech synthesis
systems, especially for speaker-adaptive HMM-based speech synthesis
that uses average voice models trained on multiple speakers and speaker
adaptation technologies. This corpus is also suitable for DNN-based
multi-speaker text-to-speech synthesis systems and waveform modeling.
COPYING
This corpus is licensed under the Creative Commons License: Attribution 4.0 International
http://creativecommons.org/licenses/by/4.0/legalcode
VCTK VARIANTS
There are several variants of the VCTK corpus:
Speech enhancement
- Noisy speech database for training speech enhancement algorithms and TTS models where we added various types of noises to VCTK artificially: http://dx.doi.org/10.7488/ds/2117
- Reverberant speech database for training speech dereverberation algorithms and TTS models where we added various types of reverberantion to VCTK artificially http://dx.doi.org/10.7488/ds/1425
- Noisy reverberant speech database for training speech enhancement algorithms and TTS models http://dx.doi.org/10.7488/ds/2139
- Device Recorded VCTK where speech signals of the VCTK corpus were played back and re-recorded in office environments using relatively inexpensive consumer devices http://dx.doi.org/10.7488/ds/2316
- The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) https://github.com/microsoft/MS-SNSD
ASV and anti-spoofing
- Spoofing and Anti-Spoofing (SAS) corpus, which is a collection of synthetic speech signals produced by nine techniques, two of which are speech synthesis, and seven are voice conversion. All of them were built using the VCTK corpus. http://dx.doi.org/10.7488/ds/252
- Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database. This database consists of synthetic speech signals produced by ten techniques and this has been used in the first Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) http://dx.doi.org/10.7488/ds/298
- ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database. This database has been used in the 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2019) https://doi.org/10.7488/ds/2555
ACKNOWLEDGEMENTS
The CSTR VCTK Corpus was constructed by:
Christophe Veaux (University of Edinburgh)
Junichi Yamagishi (University of Edinburgh)
Kirsten MacDonald
The research leading to these results was partly funded from EPSRC
grants EP/I031022/1 (NST) and EP/J002526/1 (CAF), from the RSE-NSFC
grant (61111130120), and from the JST CREST (uDialogue).
Please cite this corpus as follows:
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",
The Centre for Speech Technology Research (CSTR),
University of Edinburgh
......@@ -9,7 +9,8 @@ import numpy as np
import traceback
import sys
import torch
import librosa
from audioread.exceptions import NoBackendError
# Use this directory structure for your datasets, or modify it to fit your needs
recognized_datasets = [
......@@ -39,7 +40,15 @@ recognized_datasets = [
MAX_WAVES = 15
class Toolbox:
def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, low_mem, seed):
def __init__(self, datasets_root, enc_models_dir, syn_models_dir, voc_models_dir, low_mem, seed, no_mp3_support):
if not no_mp3_support:
try:
librosa.load("samples/6829_00000.mp3")
except NoBackendError:
print("Librosa will be unable to open mp3 files if additional software is not installed.\n"
"Please install ffmpeg or add the '--no_mp3_support' option to proceed without support for mp3 files.")
exit(-1)
self.no_mp3_support = no_mp3_support
sys.excepthook = self.excepthook
self.datasets_root = datasets_root
self.low_mem = low_mem
......@@ -64,7 +73,7 @@ class Toolbox:
self.reset_ui(enc_models_dir, syn_models_dir, voc_models_dir, seed)
self.setup_events()
self.ui.start()
def excepthook(self, exc_type, exc_value, exc_tb):
traceback.print_exception(exc_type, exc_value, exc_tb)
self.ui.log("Exception: %s" % exc_value)
......@@ -149,7 +158,11 @@ class Toolbox:
else:
name = fpath.name
speaker_name = fpath.parent.name
if fpath.suffix.lower() == ".mp3" and self.no_mp3_support:
self.ui.log("Error: No mp3 file argument was passed but an mp3 file was used")
return
# Get the wav from the disk. We take the wav with the vocoder/synthesizer format for
# playback, so as to have a fair comparison with the generated audio
wav = Synthesizer.load_preprocess_wav(fpath)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册