Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allreduce algorithm, performance and codepath issue on ZE gpus #7024

Open
kaushikvelusamy opened this issue Jun 7, 2024 · 2 comments
Open
Labels

Comments

@kaushikvelusamy
Copy link

  1. We observe a sudden abnormal increase ( 2-3x) in the collective communications with all reduce from 1MB and beyond with GPUs. You can reproduce this issue by measuring the time taken to complete 100 iterations of MPI_Allreduce() excluding the first and the second iterations in order to eliminate the initialization cost and include from 3-100 to get the measurement for only pure communications.

This issue is persistent irrespective of scale, different all reduce algorithms, CPU / NIC bindings , with and w/o MPI_Barriers, variability, C Vs SYCL and mpich versions.

  1. I do see setting the --env MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=recursive_doubling change the MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=0 to MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=3 but do not see any difference in the performance. Is there a possibility if the CVARs are not getting passed down or if any CVAR is modified during runtime?

  2. Could you confirm if this mpich upstream is following the codepath in which it is using the Xelinks?

@raffenet
Copy link
Contributor

#7070 was to address this issue, but currently the CVAR default value does not change the existing behavior. The question remains whether any MPICH module on Aurora will set the reduction threshold to an appropriate value as to not cause the poor performance observed in your tests.

@cbelusar
Copy link

Seems like we still need to close on if this should be default on Aurora for new builds of MPICH. If there are no perceived downsides, we can make this recommendation for future builds, but I'd like to confirm that there are no concerns before recommending that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants