prov/verbs: collective communication basic example #7913
Unanswered
theWayofthecode
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
we are experimenting with the collective communication API and we have hard time getting a basic example running. We are stuck at the setup phase. After calling
fi_join_collective()
for all endpoints, we never receive theFI_JOIN_COMPLETE
event in the event queue.The
fi_multinode_coll
fabtest runs fine, so any system related issues are ruled out.Our test follows the steps of this pseudocode:
We checked that the eq is set up correctly since if we write into the eq with
fi_eq_write()
theFI_JOIN_COMPLETE
ourselves, our polling receives the event just fine.The test is executed within a single OS process, meaning that all the endpoints of the collective are initialized in one process. We also continuously poll on the eqs of all endpoints.
We enabled
FI_LOG_LEVEL=debug
and there are no any suspicious error logs or warnings.The bottom line is that the fi_join_collective() never completes for any endpoint, and there are no clues to help us to fix the problem.
What do we miss? Any tips to debug this issue?
PS: We are using the latest version of libfabric (v1.15.1).
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions