data_preparation.md 2.4 KB
Newer Older
H
Hui Zhang 已提交
1 2 3 4
# Data Preparation

## Generate Manifest

小湉湉's avatar
小湉湉 已提交
5
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. file path, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
H
Hui Zhang 已提交
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

```
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
```
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.

For how to generate such manifest files, please refer to `examples/librispeech/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.

## Compute Mean & Stddev for Normalizer

To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples:

```bash
python3 utils/compute_mean_std.py \
--num_samples 2000 \
H
Hui Zhang 已提交
22
--spectrum_type linear \
H
Hui Zhang 已提交
23 24 25 26
--manifest_path examples/librispeech/data/manifest.train \
--output_path examples/librispeech/data/mean_std.npz
```

小湉湉's avatar
小湉湉 已提交
27
It will compute the mean and standard deviations of the power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
H
Hui Zhang 已提交
28 29 30 31


## Build Vocabulary

小湉湉's avatar
小湉湉 已提交
32
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to the text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
H
Hui Zhang 已提交
33 34 35 36 37 38 39 40

```bash
python3 utils/build_vocab.py \
--count_threshold 0 \
--vocab_path examples/librispeech/data/eng_vocab.txt \
--manifest_paths examples/librispeech/data/manifest.train
```

小湉湉's avatar
小湉湉 已提交
41
It will write a vocabulary file `examples/librispeech/data/vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).