-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Assertion failed in file src/mpid/ch4/src/init_comm.c at line 72: in_use == 0" fails #7200
Labels
Comments
raffenet
added a commit
to raffenet/mpich
that referenced
this issue
Nov 7, 2024
If the special init comm used for roots only address exchange still has pending operations at destroy time, it could cause an assertion failure during MPI_INIT. Instead, we should wait for pending ops to complete to avoid a crash. Fixes pmodels#7200.
4 tasks
raffenet
added a commit
to raffenet/mpich
that referenced
this issue
Nov 7, 2024
If the special init comm used for roots only address exchange still has pending operations at destroy time, it could cause an assertion failure during MPI_INIT. Instead, we should wait for pending ops to complete to avoid a crash. Fixes pmodels#7200.
raffenet
added a commit
to raffenet/mpich
that referenced
this issue
Nov 7, 2024
If the special init comm used for roots only address exchange still has pending operations at destroy time, it could cause an assertion failure during MPI_INIT. Instead, we should wait for pending ops to complete to avoid a crash. Fixes pmodels#7200.
raffenet
added a commit
to raffenet/mpich
that referenced
this issue
Nov 7, 2024
If the special init comm used for roots only address exchange still has pending operations at destroy time, it could cause an assertion failure during MPI_INIT. Instead, we should wait for pending ops to complete to avoid a crash. Fixes pmodels#7200.
raffenet
added a commit
to raffenet/mpich
that referenced
this issue
Nov 7, 2024
If the special init comm used for roots only address exchange still has pending operations at destroy time, it could cause an assertion failure during MPI_INIT. Instead, we should wait for pending ops to complete to avoid a crash. Fixes pmodels#7200.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Multiple applications running on Aurora have seen the following fail:
The smallest reproducer is
It does not seem to be reproducible at 1 or 4 nodes but we've seen it at 4096 nodes however. (Adding @KennethEJansen since he reported it)
The text was updated successfully, but these errors were encountered: