Allreduce algorithm, performance and codepath issue on ZE gpus #7024

kaushikvelusamy · 2024-06-07T19:23:11Z

We observe a sudden abnormal increase ( 2-3x) in the collective communications with all reduce from 1MB and beyond with GPUs. You can reproduce this issue by measuring the time taken to complete 100 iterations of MPI_Allreduce() excluding the first and the second iterations in order to eliminate the initialization cost and include from 3-100 to get the measurement for only pure communications.

This issue is persistent irrespective of scale, different all reduce algorithms, CPU / NIC bindings , with and w/o MPI_Barriers, variability, C Vs SYCL and mpich versions.

I do see setting the --env MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=recursive_doubling change the MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=0 to MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=3 but do not see any difference in the performance. Is there a possibility if the CVARs are not getting passed down or if any CVAR is modified during runtime?
Could you confirm if this mpich upstream is following the codepath in which it is using the Xelinks?

raffenet · 2024-08-30T18:32:52Z

#7070 was to address this issue, but currently the CVAR default value does not change the existing behavior. The question remains whether any MPICH module on Aurora will set the reduction threshold to an appropriate value as to not cause the poor performance observed in your tests.

cbelusar · 2024-10-25T15:30:18Z

Seems like we still need to close on if this should be default on Aurora for new builds of MPICH. If there are no perceived downsides, we can make this recommendation for future builds, but I'd like to confirm that there are no concerns before recommending that.

raffenet added the aurora label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

kaushikvelusamy commented Jun 7, 2024

raffenet commented Aug 30, 2024

cbelusar commented Oct 25, 2024

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

Comments

kaushikvelusamy commented Jun 7, 2024

raffenet commented Aug 30, 2024

cbelusar commented Oct 25, 2024