You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observe a sudden abnormal increase ( 2-3x) in the collective communications with all reduce from 1MB and beyond with GPUs. You can reproduce this issue by measuring the time taken to complete 100 iterations of MPI_Allreduce() excluding the first and the second iterations in order to eliminate the initialization cost and include from 3-100 to get the measurement for only pure communications.
This issue is persistent irrespective of scale, different all reduce algorithms, CPU / NIC bindings , with and w/o MPI_Barriers, variability, C Vs SYCL and mpich versions.
I do see setting the --env MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=recursive_doubling change the MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=0 to MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=3 but do not see any difference in the performance. Is there a possibility if the CVARs are not getting passed down or if any CVAR is modified during runtime?
Could you confirm if this mpich upstream is following the codepath in which it is using the Xelinks?
The text was updated successfully, but these errors were encountered:
#7070 was to address this issue, but currently the CVAR default value does not change the existing behavior. The question remains whether any MPICH module on Aurora will set the reduction threshold to an appropriate value as to not cause the poor performance observed in your tests.
Seems like we still need to close on if this should be default on Aurora for new builds of MPICH. If there are no perceived downsides, we can make this recommendation for future builds, but I'd like to confirm that there are no concerns before recommending that.
This issue is persistent irrespective of scale, different all reduce algorithms, CPU / NIC bindings , with and w/o MPI_Barriers, variability, C Vs SYCL and mpich versions.
I do see setting the --env MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=recursive_doubling change the MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=0 to MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=3 but do not see any difference in the performance. Is there a possibility if the CVARs are not getting passed down or if any CVAR is modified during runtime?
Could you confirm if this mpich upstream is following the codepath in which it is using the Xelinks?
The text was updated successfully, but these errors were encountered: