提交 07a1478c 编写于 作者: G gaotingquan 提交者: Tingquan Gao

Add instructions of training with DALI

上级 e8435f61
# Train with DALI
## Preface
[The NVIDIA Data Loading Library](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html) (DALI) is a library for data loading and pre-processing to accelerate deep learning applications. It can build Dataloader of Paddle.
Since the Deep learning relies on a large amount of data in the training stage, these data need to be loaded and preprocessed. These operations are usually executed on the CPU, which limits the further improvement of the training speed, especially when the batch_size is large, which become the bottleneck of speed. DALI can use GPU to accelerate these operations, thereby further improve the training speed.
## Installing DALI
DALI only support Linux x64 and version of CUDA is 10.0 or later.
* For CUDA 10:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
* For CUDA 11.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110
For more information about installing DALI, please refer to [DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html).
## Using DALI
Paddleclas supports training with DALI in static graph. Since DALI only supports GPU training, `CUDA_VISIBLE_DEVICES` needs to be set, and DALI needs to occupy GPU memory, so it needs to reserve GPU memory for Dali. To train with DALI, just set the fields in the training config `use_dali = True`, or start the training by the following command:
```shell
# set the GPUs that can be seen
export CUDA_VISIBLE_DEVICES="0"
# set the GPU memory used for neural network training, generally 0.8 or 0.7
export FLAGS_fraction_of_gpu_memory_to_use=0.80
python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True
```
And you can train with muti-GPUs:
```shell
# set the GPUs that can be seen
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
# set the GPU memory used for neural network training, generally 0.8 or 0.7
export FLAGS_fraction_of_gpu_memory_to_use=0.80
python -m paddle.distributed.launch \
--gpus="0,1,2,3,4,5,6,7" \
tools/static/train.py \
-c ./configs/ResNet/ResNet50.yaml \
-o use_dali=True
```
## Train with FP16
On the basis of the above, using FP16 half-precision can further improve the training speed, just add fields in the start training command `AMP.use_pure_fp16=True`:
```shell
python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True -o AMP.use_pure_fp16=True
```
Using fp16 half-precision will lead to the problem of training precision decline or convergence slow.
......@@ -38,7 +38,7 @@ Of course, you can also directly modify the configuration file to update the con
epoch:0 train step:13 loss:7.9561 top1:0.0156 top5:0.1094 lr:0.100000 elapse:0.193s
```
During training, you can view loss changes in real time through `VisualDL`, see [VisualDL](../extension/VisualDL.md) for details.
During training, you can view loss changes in real time through `VisualDL`, see [VisualDL](../extension/VisualDL_en.md) for details.
### 1.2 Model finetuning
......
# 使用DALI加速训练
## 前言
[NVIDIA数据加载库](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html)(The NVIDIA Data Loading Library,DALI)是用于数据加载和预处理的开源库,用于加速深度学习训练、推理过程,它可以直接构建飞桨Paddle的DataLoader数据读取器。
由于深度学习程序在训练阶段依赖大量数据,这些数据需要经过加载、预处理等操作后,才能送入训练程序,而这些操作通常在CPU完成,因此限制了训练速度进一步提高,特别是在batch_size较大时,数据读取可能成为训练速度的瓶颈。DALI可以基于GPU的高并行特性实现数据加载及预处理操作,可以进一步提高训练速度。
## 安装DALI
目前DALI仅支持Linux x64平台,且CUDA版本大于等于10.0。
* 对于CUDA 10:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda100
* 对于CUDA 11.0:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110
关于更多DALI安装的信息,可以参考[DALI官方](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html)
## 使用DALI
PaddleClas支持在静态图训练方式中使用DALI加速,由于DALI仅支持GPU训练,因此需要设置GPU,且DALI需要占用GPU显存,需要为DALI预留显存。使用DALI训练只需在训练配置文件中设置字段`use_dali=True`,或通过以下命令启动训练即可:
```shell
# 设置用于训练的GPU卡号
export CUDA_VISIBLE_DEVICES="0"
# 设置用于神经网络训练的显存大小,可根据具体情况设置,一般可设置为0.8或0.7
export FLAGS_fraction_of_gpu_memory_to_use=0.80
python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True
```
也可以使用多卡训练:
```shell
# 设置用于训练的GPU卡号
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
# 设置用于神经网络训练的显存,可根据具体情况设置,一般可设置为0.8或0.7
export FLAGS_fraction_of_gpu_memory_to_use=0.80
python -m paddle.distributed.launch \
--gpus="0,1,2,3,4,5,6,7" \
tools/static/train.py \
-c ./configs/ResNet/ResNet50.yaml \
-o use_dali=True
```
## 使用FP16训练
在上述基础上,使用FP16半精度训练,可以进一步提高速度,只需在启动训练命令中添加字段`AMP.use_pure_fp16=True`
```shell
python tools/static/train.py -c configs/ResNet/ResNet50.yaml -o use_dali=True -o AMP.use_pure_fp16=True
```
使用FP16半精度训练将导致训练精度下降或收敛变慢的问题。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册