DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs.
It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.
In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as `True` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.
In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as ``True`` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.
We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability.
Principle
~~~~~~~~~~~~~
We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability. The detailed principle is shonw as follows:
1. Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.
...
...
@@ -79,14 +82,14 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
for name, param in model.named_parameters():
setattr(param, 'grad', torch.zeros_like(param))
2. Gradients on different devices should be synchronized after the backward function.
2. Gradients on different devices should be synchronized after the backward procedure.
@@ -96,35 +99,48 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
if param.requires_grad:
allreduce(param.grad.data)
Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.
For more related functions, please refer to ``ding/utils/pytorch_ddp_dist_helper.py``
3. Training
When using it, firstly we set ``config.policy.learn.multi_gpu`` as `True` in the config file. Secondly, we need to Initialize the current experimental environment.
Please refer to ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``
Usage
~~~~~~~
To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:
1. Set ``config.policy.learn.multi_gpu`` as ``True``
2. Add DDP training context liks this:
.. code-block:: python
from ding.utils import DistContext
from ding.entry import serial_pipeline
# define main_config and create_config
main_config = (...)
create_config = (...)
with DistContext():
main(space_invaders_dqn_config,create_config)
if __name__ == "__main__":
# call serial_pipeline with DDP
with DistContext():
serial_pipeline(main_config, create_config)
.. tip::
The whole example is located in ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``
For DPP, the runnable script demo is demonstrated as follows.
3. Execute launch shell script
For DDP, the runnable script demo is demonstrated as follows.
<liclass="toctree-l2"><aclass="reference internal"href="random_collect_size.html">How to randomly collect some data sample at the beginning?</a></li>
...
...
@@ -245,8 +249,10 @@ and then the parameters are synchronized with other GPUs.</p>
<h2>DataDistributedParallel(DDP) Mode<aclass="headerlink"href="#datadistributedparallel-ddp-mode"title="Permalink to this headline">¶</a></h2>
<p>DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs.
It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.</p>
<p>In DataDistributedParallel(DDP) Mode, we should simply set <codeclass="docutils literal notranslate"><spanclass="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file under <codeclass="docutils literal notranslate"><spanclass="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
<p>We re-implement the data-parallel training module with APIs in <codeclass="docutils literal notranslate"><spanclass="pre">torch.distributed</span></code> for high scalability.</p>
<p>In DataDistributedParallel(DDP) Mode, we should simply set <codeclass="docutils literal notranslate"><spanclass="pre">config.policy.learn.multi_gpu</span></code> as <codeclass="docutils literal notranslate"><spanclass="pre">True</span></code> in the config file under <codeclass="docutils literal notranslate"><spanclass="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
<divclass="section"id="principle">
<h3>Principle<aclass="headerlink"href="#principle"title="Permalink to this headline">¶</a></h3>
<p>We re-implement the data-parallel training module with APIs in <codeclass="docutils literal notranslate"><spanclass="pre">torch.distributed</span></code> for high scalability. The detailed principle is shonw as follows:</p>
<olclass="arabic simple">
<li><p>Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.</p></li>
</ol>
...
...
@@ -259,12 +265,12 @@ It adopts multi-process to control multi-GPUs and adopts ring allreduce to synch
</pre></div>
</div>
<olclass="arabic simple"start="2">
<li><p>Gradients on different devices should be synchronized after the backward function.</p></li>
<li><p>Gradients on different devices should be synchronized after the backward procedure.</p></li>
<p>Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
For more related functions, please refer to <codeclass="docutils literal notranslate"><spanclass="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
<olclass="arabic simple"start="3">
<li><p>Training</p></li>
<p>3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.</p>
<p>For more related functions, please refer to <codeclass="docutils literal notranslate"><spanclass="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
</div>
<divclass="section"id="usage">
<h3>Usage<aclass="headerlink"href="#usage"title="Permalink to this headline">¶</a></h3>
<p>To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:</p>
<olclass="arabic simple">
<li><p>Set <codeclass="docutils literal notranslate"><spanclass="pre">config.policy.learn.multi_gpu</span></code> as <codeclass="docutils literal notranslate"><spanclass="pre">True</span></code></p></li>
<li><p>Add DDP training context liks this:</p></li>
</ol>
<p>When using it, firstly we set <codeclass="docutils literal notranslate"><spanclass="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file. Secondly, we need to Initialize the current experimental environment.
Please refer to <codeclass="docutils literal notranslate"><spanclass="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
<p>For DPP, the runnable script demo is demonstrated as follows.</p>
<divclass="admonition tip">
<pclass="admonition-title">Tip</p>
<p>The whole example is located in <codeclass="docutils literal notranslate"><spanclass="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
</div>
<olclass="arabic simple"start="3">
<li><p>Execute launch shell script</p></li>
</ol>
<p>For DDP, the runnable script demo is demonstrated as follows.</p>