Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement distributed training using horovod #1865

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
horovod documentation
  • Loading branch information
NanoNabla committed May 5, 2021
commit 5f23b13121ea358fbc0795333d2fbb3b7c5d8fd0
2 changes: 2 additions & 0 deletions doc/TRAINING_ADVANCED.rst
Original file line number Diff line number Diff line change
@@ -17,3 +17,5 @@ This document contains more advanced topics with regard to training models with
9. :ref:`parallel-training-optimization`
10. :ref:`data-importers`
11. :ref:`byte-output-mode`
12. :ref:`horovod-parallel-training`

22 changes: 22 additions & 0 deletions doc/TRAINING_HOROVOD.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
.. _horovod-parallel-training:

Distributed training using Horovod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you have a capable compute architecture, it is possible to distribute the training using `Horovod <https://github.com/horovod/horovod>`_. A fast network is recommended.
Horovod is capable of using MPI and NVIDIA's NCCL for highly optimized inter-process communication.
It also offers `Gloo <https://github.com/facebookincubator/gloo>`_ as an easy-to-setup communication backend.

For more information about setup or tuning of Horovod please visit `Horovod's documentation <https://horovod.readthedocs.io/en/stable/summary_include.html>`_.

Horovod is expected to run on heterogeneous systems (e.g. different number and model type of GPUs per machine).
However, this can cause unpredictable problems and user interaction in training code is needed.
Therefore, we do only support homogenous systems, which means same hardware and also same software configuration (OS, drivers, MPI, NCCL, TensorFlow, ...) on each machine.
The only exception is different number of GPUs per machine, since this can be controlled by ``horovodrun -H``.

Detailed documentation how to run Horovod is provided `here <https://horovod.readthedocs.io/en/stable/running.html>`_.
The short command to train on 4 machines using 4 GPUs each:

.. code-block:: bash

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python3 DeepSpeech.py --train_files [...] --horovod