You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Trying to use mpirun less than version 3 with for example lagomorph lddmm atlas results in an error currently since we can't properly find the local rank.
Describe the solution you'd like
Local rank should be determined in a uniform way regardless of MPI version. We should try the method used now, which is what horovod uses, but fall back to a naive hostname-based method if the import fails.
Describe alternatives you've considered
Previously we did not need to compute local rank because we accepted it on the command line as an argument. This is a bit cumbersome however, and switching to computation means the calling convention for lagomorph which uses pytorch.distributed will match horovod.
Additional context
The following stackoverflow response outlines the basic method we need to fall back to: https://stackoverflow.com/a/31792540
The steps required are:
On each rank, compute processor name or hostname
Perform an allgather to grab all of the node names
sort the unique node names alphabetically
find integer index of this rank's hostname
use mpi_comm_split with the integer index found in the last step as the "color"
This can all be done inside lagomorph.utils.mpi_local_comm
The text was updated successfully, but these errors were encountered:
This unifies our approach to parallelism. Any command line tool will
parse the command line and MPI environment in a uniform way. On Summit,
this corresponds to calling jsrun with `jsrun -n<N> -a6 -g6` just as is
expected by horovod. We need MPI>=3 in order to find local rank using
the method implemented here. In the future, we need a fallback to remove
this requirement (see Issue #17).
Is your feature request related to a problem? Please describe.
Trying to use mpirun less than version 3 with for example
lagomorph lddmm atlas
results in an error currently since we can't properly find the local rank.Describe the solution you'd like
Local rank should be determined in a uniform way regardless of MPI version. We should try the method used now, which is what horovod uses, but fall back to a naive hostname-based method if the import fails.
Describe alternatives you've considered
Previously we did not need to compute local rank because we accepted it on the command line as an argument. This is a bit cumbersome however, and switching to computation means the calling convention for lagomorph which uses
pytorch.distributed
will match horovod.Additional context
The following stackoverflow response outlines the basic method we need to fall back to:
https://stackoverflow.com/a/31792540
The steps required are:
mpi_comm_split
with the integer index found in the last step as the "color"This can all be done inside
lagomorph.utils.mpi_local_comm
The text was updated successfully, but these errors were encountered: