Deploying to gh-pages from @ 147d56f3 🚀

c08bffdf · PaParaZz1 · 072980b8 · c08bffdf · c08bffdf · c08bffdf
3 changed file
--- a/_sources/best_practice/multi_gpu_example.rst.txt
+++ b/_sources/best_practice/multi_gpu_example.rst.txt
@@ -64,9 +64,12 @@ DataDistributedParallel(DDP) Mode
 DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs. 
 It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.

-In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as `True` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.
+In DataDistributedParallel(DDP) Mode, we should simply set ``config.policy.learn.multi_gpu`` as ``True`` in the config file under ``dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py``.

-We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability.
+Principle
+~~~~~~~~~~~~~
+
+We re-implement the data-parallel training module with APIs in ``torch.distributed`` for high scalability. The detailed principle is shonw as follows:

 1. Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.

@@ -79,14 +82,14 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
        for name, param in model.named_parameters():
            setattr(param, 'grad', torch.zeros_like(param))

-2. Gradients on different devices should be synchronized after the backward function.
+2. Gradients on different devices should be synchronized after the backward procedure.

 .. code-block:: python

        self._optimizer.zero_grad()
        loss.backward()
        if self._cfg.learn.multi_gpu:
-            self.sync_gradients(self._learn_model)
+            self.sync_gradients(self._learn_model)  # sync gradients
        self._optimizer.step()

 .. code-block:: python
@@ -96,35 +99,48 @@ We re-implement the data-parallel training module with APIs in ``torch.distribut
            if param.requires_grad:
                allreduce(param.grad.data)

-Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
+3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
+DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.
+
 For more related functions, please refer to ``ding/utils/pytorch_ddp_dist_helper.py``

-3. Training

-When using it, firstly we set ``config.policy.learn.multi_gpu`` as `True` in the config file. Secondly, we need to Initialize the current experimental environment.
-Please refer to ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``
+Usage
+~~~~~~~
+
+To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:
+
+1. Set ``config.policy.learn.multi_gpu`` as ``True``
+
+2. Add DDP training context liks this:

 .. code-block:: python

    from ding.utils import DistContext
+    from ding.entry import serial_pipeline
+
+    # define main_config and create_config
+    main_config = (...)
+    create_config = (...)

-    with DistContext():
-        main(space_invaders_dqn_config,create_config)
+    if __name__ == "__main__":
+        # call serial_pipeline with DDP
+        with DistContext():
+            serial_pipeline(main_config, create_config)

+.. tip::
+    The whole example is located in ``dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py``

-For DPP, the runnable script demo is demonstrated as follows.
+3. Execute launch shell script
+
+For DDP, the runnable script demo is demonstrated as follows.

 .. code-block:: bash

    CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=2 spaceinvaders_dqn_main_multi_gpu_ddp.py

-or (on cluster managed by Slurm)
+Or on cluster managed by Slurm

 .. code-block:: bash

    srun -p PARTITION_NAME --mpi=pmi2 --gres=gpu:2 -n2 --ntasks-per-node=2 python -u spaceinvaders_dqn_main_multi_gpu_ddp.py
-
-
-
-
-
--- a/best_practice/multi_gpu_example.html
+++ b/best_practice/multi_gpu_example.html
@@ -103,7 +103,11 @@
 <li class="toctree-l2"><a class="reference internal" href="multi_discrete.html">Multi-Discrete Example</a></li>
 <li class="toctree-l2 current"><a class="current reference internal" href="#">How to Use Multi-GPUs to Train Your Model</a><ul>
 <li class="toctree-l3"><a class="reference internal" href="#dataparallel-dp-mode">DataParallel(DP) Mode</a></li>
-<li class="toctree-l3"><a class="reference internal" href="#datadistributedparallel-ddp-mode">DataDistributedParallel(DDP) Mode</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#datadistributedparallel-ddp-mode">DataDistributedParallel(DDP) Mode</a><ul>
+<li class="toctree-l4"><a class="reference internal" href="#principle">Principle</a></li>
+<li class="toctree-l4"><a class="reference internal" href="#usage">Usage</a></li>
+</ul>
+</li>
 </ul>
 </li>
 <li class="toctree-l2"><a class="reference internal" href="random_collect_size.html">How to randomly collect some data sample at the beginning?</a></li>
@@ -245,8 +249,10 @@ and then the parameters are synchronized with other GPUs.</p>
 <h2>DataDistributedParallel(DDP) Mode<a class="headerlink" href="#datadistributedparallel-ddp-mode" title="Permalink to this headline">¶</a></h2>
 <p>DataDistributedParallel(DDP) is mainly used for single-machine multi-GPUs and multi-machine multi-GPUs.
 It adopts multi-process to control multi-GPUs and adopts ring allreduce to synchronize gradient.</p>
-<p>In DataDistributedParallel(DDP) Mode, we should simply set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file under <code class="docutils literal notranslate"><span class="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
-<p>We re-implement the data-parallel training module with APIs in <code class="docutils literal notranslate"><span class="pre">torch.distributed</span></code> for high scalability.</p>
+<p>In DataDistributedParallel(DDP) Mode, we should simply set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <code class="docutils literal notranslate"><span class="pre">True</span></code> in the config file under <code class="docutils literal notranslate"><span class="pre">dizoo/atari/config/serial/spaceinvaders/spaceinvaders_dqn_config_multi_gpu_ddp.py</span></code>.</p>
+<div class="section" id="principle">
+<h3>Principle<a class="headerlink" href="#principle" title="Permalink to this headline">¶</a></h3>
+<p>We re-implement the data-parallel training module with APIs in <code class="docutils literal notranslate"><span class="pre">torch.distributed</span></code> for high scalability. The detailed principle is shonw as follows:</p>
 <ol class="arabic simple">
 <li><p>Parameters on Rank-0 GPU are broadcasted to all devices, so that models on different devices share the same initialization.</p></li>
 </ol>
@@ -259,12 +265,12 @@ It adopts multi-process to control multi-GPUs and adopts ring allreduce to synch
 </pre></div>
 </div>
 <ol class="arabic simple" start="2">
-<li><p>Gradients on different devices should be synchronized after the backward function.</p></li>
+<li><p>Gradients on different devices should be synchronized after the backward procedure.</p></li>
 </ol>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="bp">self</span><span class="o">.</span><span class="n">_optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
 <span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
 <span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">_cfg</span><span class="o">.</span><span class="n">learn</span><span class="o">.</span><span class="n">multi_gpu</span><span class="p">:</span>
-    <span class="bp">self</span><span class="o">.</span><span class="n">sync_gradients</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_learn_model</span><span class="p">)</span>
+    <span class="bp">self</span><span class="o">.</span><span class="n">sync_gradients</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_learn_model</span><span class="p">)</span>  <span class="c1"># sync gradients</span>
 <span class="bp">self</span><span class="o">.</span><span class="n">_optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
 </pre></div>
 </div>
@@ -274,28 +280,47 @@ It adopts multi-process to control multi-GPUs and adopts ring allreduce to synch
            <span class="n">allreduce</span><span class="p">(</span><span class="n">param</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
 </pre></div>
 </div>
-<p>Information including loss and reward should be aggregated among devices when applying data-parallel training. DI-engine achieves this with AllReduce operator in a hook, and only saves log files on process with rank 0.
-For more related functions, please refer to <code class="docutils literal notranslate"><span class="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
-<ol class="arabic simple" start="3">
-<li><p>Training</p></li>
+<p>3. Information including loss and reward should be aggregated among devices when applying data-parallel training.
+DI-engine achieves this with allreduce operator in learner and evaluator, and only saves log files on process with rank 0.</p>
+<p>For more related functions, please refer to <code class="docutils literal notranslate"><span class="pre">ding/utils/pytorch_ddp_dist_helper.py</span></code></p>
+</div>
+<div class="section" id="usage">
+<h3>Usage<a class="headerlink" href="#usage" title="Permalink to this headline">¶</a></h3>
+<p>To enable DDP training in DI-engine existing codes, you just need to add modifications by following steps:</p>
+<ol class="arabic simple">
+<li><p>Set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <code class="docutils literal notranslate"><span class="pre">True</span></code></p></li>
+<li><p>Add DDP training context liks this:</p></li>
 </ol>
-<p>When using it, firstly we set <code class="docutils literal notranslate"><span class="pre">config.policy.learn.multi_gpu</span></code> as <cite>True</cite> in the config file. Secondly, we need to Initialize the current experimental environment.
-Please refer to <code class="docutils literal notranslate"><span class="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">ding.utils</span> <span class="kn">import</span> <span class="n">DistContext</span>
+<span class="kn">from</span> <span class="nn">ding.entry</span> <span class="kn">import</span> <span class="n">serial_pipeline</span>

-<span class="k">with</span> <span class="n">DistContext</span><span class="p">():</span>
-    <span class="n">main</span><span class="p">(</span><span class="n">space_invaders_dqn_config</span><span class="p">,</span><span class="n">create_config</span><span class="p">)</span>
+<span class="c1"># define main_config and create_config</span>
+<span class="n">main_config</span> <span class="o">=</span> <span class="p">(</span><span class="o">...</span><span class="p">)</span>
+<span class="n">create_config</span> <span class="o">=</span> <span class="p">(</span><span class="o">...</span><span class="p">)</span>
+
+<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&quot;__main__&quot;</span><span class="p">:</span>
+    <span class="c1"># call serial_pipeline with DDP</span>
+    <span class="k">with</span> <span class="n">DistContext</span><span class="p">():</span>
+        <span class="n">serial_pipeline</span><span class="p">(</span><span class="n">main_config</span><span class="p">,</span> <span class="n">create_config</span><span class="p">)</span>
 </pre></div>
 </div>
-<p>For DPP, the runnable script demo is demonstrated as follows.</p>
+<div class="admonition tip">
+<p class="admonition-title">Tip</p>
+<p>The whole example is located in <code class="docutils literal notranslate"><span class="pre">dizoo/atari/entry/spaceinvaders_dqn_main_multi_gpu_ddp.py</span></code></p>
+</div>
+<ol class="arabic simple" start="3">
+<li><p>Execute launch shell script</p></li>
+</ol>
+<p>For DDP, the runnable script demo is demonstrated as follows.</p>
 <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="m">0</span>,1 python -m torch.distributed.launch --nnodes<span class="o">=</span><span class="m">1</span> --node_rank<span class="o">=</span><span class="m">0</span> --nproc_per_node<span class="o">=</span><span class="m">2</span> spaceinvaders_dqn_main_multi_gpu_ddp.py
 </pre></div>
 </div>
-<p>or (on cluster managed by Slurm)</p>
+<p>Or on cluster managed by Slurm</p>
 <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>srun -p PARTITION_NAME --mpi<span class="o">=</span>pmi2 --gres<span class="o">=</span>gpu:2 -n2 --ntasks-per-node<span class="o">=</span><span class="m">2</span> python -u spaceinvaders_dqn_main_multi_gpu_ddp.py
 </pre></div>
 </div>
 </div>
+</div>
 </div>



--- a/searchindex.js
+++ b/searchindex.js