提交 c08bffdf 编写于 作者: P PaParaZz1

Deploying to gh-pages from @ 147d56f3 🚀

上级 072980b8
......@@ -64,9 +64,12 @@ DataDistributedParallel(DDP) Mode
DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs.
It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.
In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as `True` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.
In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as ``True`` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.
We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability.
Principle
~~~~~~~~~~~~~
We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability. The detailed principle is shonw as follows:
1. Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.
......@@ -79,14 +82,14 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
for name, param in model.named_parameters():
setattr(param, 'grad', torch.zeros_like(param))
2. Gradients on different devices should be synchronized after the backward function.
2. Gradients on different devices should be synchronized after the backward procedure.
.. code-block:: python
self._optimizer.zero_grad()
loss.backward()
if self._cfg.learn.multi_gpu:
self.sync_gradients(self._learn_model)
self.sync_gradients(self._learn_model) # sync gradients
self._optimizer.step()
.. code-block:: python
......@@ -96,35 +99,48 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
if param.requires_grad:
allreduce(param.grad.data)
Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.
For more related functions, please refer to ``ding/utils/pytorch_ddp_dist_helper.py``
3. Training
When using it, firstly we set ``config.policy.learn.multi_gpu`` as `True` in the config file. Secondly, we need to Initialize the current experimental environment.
Please refer to ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``
Usage
~~~~~~~
To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:
1. Set ``config.policy.learn.multi_gpu`` as ``True``
2. Add DDP training context liks this:
.. code-block:: python
from ding.utils import DistContext
from ding.entry import serial_pipeline
# define main_config and create_config
main_config = (...)
create_config = (...)
with DistContext():
main(space_invaders_dqn_config,create_config)
if __name__ == "__main__":
# call serial_pipeline with DDP
with DistContext():
serial_pipeline(main_config, create_config)
.. tip::
The whole example is located in ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``
For DPP, the runnable script demo is demonstrated as follows.
3. Execute launch shell script
For DDP, the runnable script demo is demonstrated as follows.
.. code-block:: bash
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=2 spaceinvaders_dqn_main_multi_gpu_ddp.py
or (on cluster managed by Slurm)
Or on cluster managed by Slurm
.. code-block:: bash
srun -p PARTITION_NAME --mpi=pmi2 --gres=gpu:2 -n2 --ntasks-per-node=2 python -u spaceinvaders_dqn_main_multi_gpu_ddp.py
......@@ -103,7 +103,11 @@
<li class="toctree-l2"><a class="reference internal" href="multi_discrete.html">Multi-Discrete Example</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">How to Use Multi-GPUs to Train Your Model</a><ul>
<li class="toctree-l3"><a class="reference internal" href="#dataparallel-dp-mode">DataParallel(DP) Mode</a></li>
<li class="toctree-l3"><a class="reference internal" href="#datadistributedparallel-ddp-mode">DataDistributedParallel(DDP) Mode</a></li>
<li class="toctree-l3"><a class="reference internal" href="#datadistributedparallel-ddp-mode">DataDistributedParallel(DDP) Mode</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#principle">Principle</a></li>
<li class="toctree-l4"><a class="reference internal" href="#usage">Usage</a></li>
</ul>
</li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="random_collect_size.html">How to randomly collect some data sample at the beginning?</a></li>
......@@ -245,8 +249,10 @@ and then the parameters are synchronized with other GPUs.</p>
<h2>DataDistributedParallel(DDP) Mode<a class="headerlink" href="#datadistributedparallel-ddp-mode" title="Permalink to this headline"></a></h2>
<p>DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs.
It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.</p>
<p>In DataDistributedParallel(DDP) Mode, we should simply set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file under <code class="docutils literal notranslate"><span class="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
<p>We re-implement the data-parallel training module with APIs in <code class="docutils literal notranslate"><span class="pre">torch.distributed</span></code> for high scalability.</p>
<p>In DataDistributedParallel(DDP) Mode, we should simply set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <code class="docutils literal notranslate"><span class="pre">True</span></code> in the config file under <code class="docutils literal notranslate"><span class="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
<div class="section" id="principle">
<h3>Principle<a class="headerlink" href="#principle" title="Permalink to this headline"></a></h3>
<p>We re-implement the data-parallel training module with APIs in <code class="docutils literal notranslate"><span class="pre">torch.distributed</span></code> for high scalability. The detailed principle is shonw as follows:</p>
<ol class="arabic simple">
<li><p>Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.</p></li>
</ol>
......@@ -259,12 +265,12 @@ It adopts multi-process to control multi-GPUs and adopts ring allreduce to synch
</pre></div>
</div>
<ol class="arabic simple" start="2">
<li><p>Gradients on different devices should be synchronized after the backward function.</p></li>
<li><p>Gradients on different devices should be synchronized after the backward procedure.</p></li>
</ol>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="bp">self</span><span class="o">.</span><span class="n">_optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_cfg</span><span class="o">.</span><span class="n">learn</span><span class="o">.</span><span class="n">multi_gpu</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sync_gradients</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_learn_model</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sync_gradients</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_learn_model</span><span class="p">)</span> <span class="c1"># sync gradients</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
</pre></div>
</div>
......@@ -274,28 +280,47 @@ It adopts multi-process to control multi-GPUs and adopts ring allreduce to synch
<span class="n">allreduce</span><span class="p">(</span><span class="n">param</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</pre></div>
</div>
<p>Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
For more related functions, please refer to <code class="docutils literal notranslate"><span class="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
<ol class="arabic simple" start="3">
<li><p>Training</p></li>
<p>3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.</p>
<p>For more related functions, please refer to <code class="docutils literal notranslate"><span class="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
</div>
<div class="section" id="usage">
<h3>Usage<a class="headerlink" href="#usage" title="Permalink to this headline"></a></h3>
<p>To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:</p>
<ol class="arabic simple">
<li><p>Set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <code class="docutils literal notranslate"><span class="pre">True</span></code></p></li>
<li><p>Add DDP training context liks this:</p></li>
</ol>
<p>When using it, firstly we set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file. Secondly, we need to Initialize the current experimental environment.
Please refer to <code class="docutils literal notranslate"><span class="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">ding.utils</span> <span class="kn">import</span> <span class="n">DistContext</span>
<span class="kn">from</span> <span class="nn">ding.entry</span> <span class="kn">import</span> <span class="n">serial_pipeline</span>
<span class="k">with</span> <span class="n">DistContext</span><span class="p">():</span>
<span class="n">main</span><span class="p">(</span><span class="n">space_invaders_dqn_config</span><span class="p">,</span><span class="n">create_config</span><span class="p">)</span>
<span class="c1"># define main_config and create_config</span>
<span class="n">main_config</span> <span class="o">=</span> <span class="p">(</span><span class="o">...</span><span class="p">)</span>
<span class="n">create_config</span> <span class="o">=</span> <span class="p">(</span><span class="o">...</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&quot;__main__&quot;</span><span class="p">:</span>
<span class="c1"># call serial_pipeline with DDP</span>
<span class="k">with</span> <span class="n">DistContext</span><span class="p">():</span>
<span class="n">serial_pipeline</span><span class="p">(</span><span class="n">main_config</span><span class="p">,</span> <span class="n">create_config</span><span class="p">)</span>
</pre></div>
</div>
<p>For DPP, the runnable script demo is demonstrated as follows.</p>
<div class="admonition tip">
<p class="admonition-title">Tip</p>
<p>The whole example is located in <code class="docutils literal notranslate"><span class="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
</div>
<ol class="arabic simple" start="3">
<li><p>Execute launch shell script</p></li>
</ol>
<p>For DDP, the runnable script demo is demonstrated as follows.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="m">0</span>,1 python -m torch.distributed.launch --nnodes<span class="o">=</span><span class="m">1</span> --node_rank<span class="o">=</span><span class="m">0</span> --nproc_per_node<span class="o">=</span><span class="m">2</span> spaceinvaders_dqn_main_multi_gpu_ddp.py
</pre></div>
</div>
<p>or (on cluster managed by Slurm)</p>
<p>Or on cluster managed by Slurm</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>srun -p PARTITION_NAME --mpi<span class="o">=</span>pmi2 --gres<span class="o">=</span>gpu:2 -n2 --ntasks-per-node<span class="o">=</span><span class="m">2</span> python -u spaceinvaders_dqn_main_multi_gpu_ddp.py
</pre></div>
</div>
</div>
</div>
</div>
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册