未验证 提交 e13e1c18 编写于 作者: L lijianshe02 提交者: GitHub

add wav2lip doc (#150)

* add wav2lip doc
上级 e46af1cf
......@@ -43,6 +43,7 @@ GAN-Generative Adversarial Network, was praised by "the Father of Convolutional
* [AnimeGANv2](./docs/en_US/tutorials/animegan.md)
* [U-GAT-IT](./docs/en_US/tutorials/ugatit.md)
* [Photo2Cartoon](./docs/en_US/tutorials/photo2cartoon.md)
* [Wav2Lip](./docs/en_US/tutorials/wav2lip.md)
## Composite Application
......@@ -101,6 +102,14 @@ GAN-Generative Adversarial Network, was praised by "the Father of Convolutional
<img src='./docs/imgs/animeganv2.png'width='700' height='250'/>
</div>
### Lip-syncing
<div align='center'>
<img src='./docs/imgs/mona.gif'width='700'>
</div>
## Changelog
- v0.1.0 (2020.11.02)
......
......@@ -44,6 +44,7 @@ GAN--生成对抗网络,被“卷积网络之父”**Yann LeCun(杨立昆)
* [AnimeGANv2](./docs/zh_CN/tutorials/animegan.md)
* [U-GAT-IT](./docs/zh_CN/tutorials/ugatit.md)
* [Photo2Cartoon](docs/zh_CN/tutorials/photo2cartoon.md)
* [Wav2Lip](docs/zh_CN/tutorials/wav2lip.md)
## 复合应用
......@@ -113,6 +114,14 @@ GAN--生成对抗网络,被“卷积网络之父”**Yann LeCun(杨立昆)
<img src='./docs/imgs/animeganv2.png'width='700' height='250'/>
</div>
### 唇形同步
<div align='center'>
<img src='./docs/imgs/mona.gif'width='700'>
</div>
## 版本更新
- v0.1.0 (2020.11.02)
......
# Lip-syncing
## 1. Lip-syncing introduction
This work address the problem of lip-syncing a talking face video of an arbitray identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or on videosof specific people seen during the training phase. Wav2lip tackle this problem by learning from a powerful lip-sync discriminator, and the result show that the lip-sync accuracy of the generated videos using Wav2Lip model is almost as good as real synced videos.
## 2. How to use
### 2.1 Test
The pretrained model can be downloaded from [here](https://paddlegan.bj.bcebos.com/models/wav2lip_hq.pdparams)
Runing the following command to complete the lip-syning task. The output is the synced videos.
```
cd applications
python tools/wav2lip.py --face ../../imgs/mona7s.mp4 --audio ../../imgs/guangquan.m4a --outfile pp_guangquan_mona7s.mp4
```
**params:**
- face: path of the input image or video file including faces.
- audio: path of the input audio file, format can be `.wav``.mp3`, `.m4a`. It can be any file supported by `FFMPEG` containing audio data.
### 2.2 Training
1. Our model are trained on LRS2. See [here](https://github.com/Rudrabha/Wav2Lip#training-on-datasets-other-than-lrs2) for a few suggestions regarding training on other datasets.
Preprocessed LRS2 dataset folder structure should be like:
```
preprocessed_root (lrs2_preprocessed)
├── list of folders
| ├── Folders with five-digit numbered video IDs
| │ ├── *.jpg
| │ ├── audio.wav
```
Place the LRS2 filelists(train, val, test) `.txt` files in the `filelists/` folder.
2. You can eigher train the model without the additional visual quality discriminator or use the discriminator. For the former, run:
- For single GPU:
```
export CUDA_VISIBLE_DEVICES=0
python tools/main.py --confit-file configs/wav2lip.yaml
```
- For multiple GPUs:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch \
--log_dir ./mylog_dd.log \
tools/main.py \
--config-file configs/wav2lip.yaml \
```
For the latter, run:
- For single GPU:
```
export CUDA_VISIBLE_DEVICES=0
python tools/main.py --confit-file configs/wav2lip_hq.yaml
```
- For multiple GPUs:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch \
--log_dir ./mylog_dd.log \
tools/main.py \
--config-file configs/wav2lip_hq.yaml \
```
### 2.3 Model
Model|Dataset|BatchSize|Inference speed|Download
---|:--:|:--:|:--:|:--:
wa2lip_hq|LRS2| 1 | 0.2853s/image (GPU:P40) | [model](https://paddlegan.bj.bcebos.com/models/wav2lip_hq.pdparams)
## Results
![](../../imgs/mona.gif)
### 4. Reference
```
@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484–492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}
```
文件已添加
......@@ -8,7 +8,7 @@ ppgan.apps包含超分、插针、上色、换妆、图像动画生成、人脸
* 上色:
* [DeOldify](#ppgan.apps.DeOldifyPredictor)
* [DeepRemaster](#ppgan.apps.DeepRemasterPredictor)
*:
*:
* [DAIN](#ppgan.apps.DAINPredictor)
* 图像工作驱动:
* [FirstOrder](#ppgan.apps.FirstOrderPredictor)
......@@ -16,6 +16,9 @@ ppgan.apps包含超分、插针、上色、换妆、图像动画生成、人脸
* [FaceFaceParse](#ppgan.apps.FaceParsePredictor)
* 动漫画:
* [AnimeGAN](#ppgan.apps.AnimeGANPredictor)
* 唇形合成:
* [Wav2Lip](#ppgan.apps.Wav2LipPredictor)
## 公共用法
......@@ -431,3 +434,34 @@ ppgan.apps.MiDaSPredictor(output=None, weight_path=None)
> > - prediction (numpy.ndarray): 返回预测结果。
> > - pfm_f (str): 如果设置output路径,返回pfm文件保存路径。
> > - png_f (str): 如果设置output路径,返回png文件保存路径。
## ppgan.apps.Wav2lipPredictor
```python
ppgan.apps.FirstOrderPredictor(args)
```
> 构建Wav2lip模型的实例,此模型用来做唇形合成,即给定一个人物视频和一个音频,实现人物口型与输入语音同步。论文是A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild,论文链接: http://arxiv.org/abs/2008.10010.
>
> **示例**
>
> ```
> from ppgan.apps import Wav2LipPredictor
> # The args parameter should be specified by argparse
> predictor = Wav2LipPredictor(args)
> predictor.run()
> ```
> **参数:**
> - args(ArgumentParser): 参数包含所有的输入参数,用户在运行程序时需要通过argparse指定,主要的参数主要包含以下几项:`
> > - checkpoint_path (str): 指定模型路径,默认是None,不指定则会自动下载内置的已经训练好的模型。
> > - face (str): 指定的包含人物的图片或者视频的文件路径。
> > - audio (str): 指定的输入音频的文件路径,它的格式可以是 `.wav`, `.mp3`, `.m4a`等,任何ffmpeg可以处理的文件格式都可以。
> > - outfile (str): 指定的输出视频文件路径。
>
> **返回值**
>
> > 无。
# Wav2Lip
## 1. Wav2Lip介绍
Wav2Lip实现的是视频人物根据输入音频生成与语音同步的人物唇形,使得生成的视频人物口型与输入语音同步。Wav2Lip不仅可以基于静态图像来输出与目标语音匹配的唇形同步视频,还可以直接将动态的视频进行唇形转换,输出与目标语音匹配的视频。Wav2Lip实现唇形与语音精准同步突破的关键在于,它采用了唇形同步判别器,以强制生成器持续产生准确而逼真的唇部运动。此外,它通过在鉴别器中使用多个连续帧而不是单个帧,并使用视觉质量损失(而不仅仅是对比损失)来考虑时间相关性,从而改善了视觉质量。Wav2Lip适用于任何人脸、任何语言,对任意视频都能达到很高都准确率,可以无缝地与原始视频融合,还可以用于转换动画人脸。
## 2. 使用方法
### 2.1 测试
预训练模型可以从如下地址下载: [wav2lip_weight](https://paddlegan.bj.bcebos.com/models/wav2lip_hq.pdparams)
运行如下命令,就可以完成唇形同步任务,程序运行成功后,会在当前文件夹生成唇形同步后的视频文件。本项目中提供了视频与音频文件供展示使用,具体命令如下所示:
```
cd applications
python tools/wav2lip.py --face ../../imgs/mona7s.mp4 --audio ../../imgs/guangquan.m4a --outfile pp_guangquan_mona7s.mp4
```
**参数说明:**
- face: 原始视频,视频中都人物都唇形将根据音频进行唇形合成,以和音频同步
- audio: 驱动唇形合成的音频,视频中的人物将根据此音频进行唇形合成
### 2.2 训练
1. 我们的模型是基于LRS2数据集训练的。可以参考[这里](https://github.com/Rudrabha/Wav2Lip#training-on-datasets-other-than-lrs2)获得在其它训练集上进行训练的一些建议。
输入到Wav2Lip模型的LRS2数据集的文件组织结构如下:
```
preprocessed_root (lrs2_preprocessed)
├── list of folders
| ├── Folders with five-digit numbered video IDs
| │ ├── *.jpg
| │ ├── audio.wav
```
将LRS2的(train, val, test) `.txt` 文件列表放入 `filelists/`文件夹。
2. 你可以选择训练不带视觉质量判别器的模型,也可以将视觉质量判别起加上进行训练。对于前者,运行如下命令:
- GPU单卡训练:
```
export CUDA_VISIBLE_DEVICES=0
python tools/main.py --confit-file configs/wav2lip.yaml
```
- GPU多卡训练:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch \
--log_dir ./mylog_dd.log \
tools/main.py \
--config-file configs/wav2lip.yaml \
```
对于后者,运行如下命令:
- GPU单卡训练:
```
export CUDA_VISIBLE_DEVICES=0
python tools/main.py --confit-file configs/wav2lip_hq.yaml
```
- GPU多卡训练:
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch \
--log_dir ./mylog_dd.log \
tools/main.py \
--config-file configs/wav2lip_hq.yaml \
```
### 2.3 模型
Model|Dataset|BatchSize|Inference speed|Download
---|:--:|:--:|:--:|:--:
wa2lip_hq|LRS2| 1 | 0.2853s/image (GPU:P40) | [model](https://paddlegan.bj.bcebos.com/models/psgan_weight.pdparam://paddlegan.bj.bcebos.com/models/wav2lip_hq.pdparams)
## 3. 结果展示
![](../../imgs/mona.gif)
### 4. 参考文献
```
@inproceedings{10.1145/3394171.3413532,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413532},
doi = {10.1145/3394171.3413532},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {484–492},
numpages = {9},
keywords = {lip sync, talking face generation, video generation},
location = {Seattle, WA, USA},
series = {MM '20}
}
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册