-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong data received when sharing a team with multiple threads #979
Comments
I am seeing several Message truncated errors in the UCC log.
|
@nirandaperera Just going out on a limb and say that UCX was also built with thread-multiple, correct? |
@nirandaperera can you paste a reproducer? |
@janjust I was using conda ucx 1.15.0 |
I'm assuming it's already built with --thread-multiple, can you check?
Should say |
@janjust yes it does |
Do you have a simple reproducer handy and we'll take a look. |
Just to add onto this: from what i'm hearing we've never tried this model of multitple threads/team, and to me it seems to mimic multiple threads/mpi_comm which would be against the spec. What I suspect is happening the threads are progressing each other's callbacks because the context is shared thus each thread picks up the task in no particular order (this is me guessing over what we've observed in our case). |
@nsarka Can you please look at this thread - does what I'm saying make sense, is this what you observed with sliding-window? |
@janjust FWIW I am using tagged collectives |
@janjust There's a separate thread that progresses the ctx. From what I understood from the source code that all teams will be ultimately enqueing tasks to a single progress queue in the context, isn't it? |
yes, I believe that's correct |
@janjust I was using v1.2 I just tried my code with the ucc master branch build and I don't see this error anymore. I think previously allgather knomial algorithm was active. But FWIU in the master branch, there are two new algorithms,
I think one of these is now running the operation. |
Sorry, I misspoke. I put some logs in the code and found out that it's the knomial algorithm in both cases. |
there was a fix for multithreaded case, it was merged into master but we didn't include it into v1.3.x branch |
@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926 I think this change might be the fix. |
hmm, odd, that pr had nothing to do with multi-threading |
yes, could be |
@janjust I think the tags have not been propagated properly until this PR. 🙂 |
@Sergei-Lebedev is there a way to get a patch release for v1.3? |
@manjugv wdyt? |
@Sergei-Lebedev it might need a proper test case for this scenario anyways, I guess. |
Hi, I have the following setup.
UCC_THREAD_MULTIPLE
UCC_CONTEXT_SHARED
UCC_COLLECTIVE_INIT_AND_POST_UNORDERED
andUCC_NO_SYNC_COLLECTIVES
(IINM these configs are not used underneath)I have multiple threads issuing a set of tagged allgather operations (with callbacks) to the team. I am keeping the send and receive buffers alive in the
data
object of the callback.It seems the receive buffers are receiving wrong data from the parallel allgather operations.
It's working fine for a single thread.
Can someone please help me out here?
Is sharing a team with multiple threads, an anti-pattern? I checked the allgather test code, and it seemed to me that there was a separate team for every test proc.
The text was updated successfully, but these errors were encountered: