Wrong data received when sharing a team with multiple threads #979

nirandaperera · 2024-05-23T14:24:27Z

Hi, I have the following setup.

UCC lib initialized with UCC_THREAD_MULTIPLE
UCC ctx created with UCC_CONTEXT_SHARED
1 team created with UCC_COLLECTIVE_INIT_AND_POST_UNORDERED and UCC_NO_SYNC_COLLECTIVES (IINM these configs are not used underneath)
Separate thread for ucc ctx progress

I have multiple threads issuing a set of tagged allgather operations (with callbacks) to the team. I am keeping the send and receive buffers alive in the data object of the callback.
It seems the receive buffers are receiving wrong data from the parallel allgather operations.
It's working fine for a single thread.

Can someone please help me out here?

Is sharing a team with multiple threads, an anti-pattern? I checked the allgather test code, and it seemed to me that there was a separate team for every test proc.

The text was updated successfully, but these errors were encountered:

nirandaperera · 2024-05-23T16:01:37Z

I am seeing several Message truncated errors in the UCC log.

6479650.125378] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126129] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126132] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126893] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126895] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.127440] [XPS-15-9510:3236166:7]          ec_cpu.c:186  cpu ec DEBUG executor finalize, eee: 0x584f0ee37600
[1716479650.127449] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2

janjust · 2024-05-24T13:45:01Z

@nirandaperera Just going out on a limb and say that UCX was also built with thread-multiple, correct?

janjust · 2024-05-24T13:45:21Z

@nirandaperera can you paste a reproducer?

nirandaperera · 2024-05-24T13:46:56Z

@janjust I was using conda ucx 1.15.0

janjust · 2024-05-24T13:50:19Z

I'm assuming it's already built with --thread-multiple, can you check?

$ ucx_info -v
# Library version: 1.17.0
# Library path: /hpc/local/benchmarks/daily/next/2024-05-22/hpcx-gcc-redhat7/ucx/mt/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17.x', revision b67bc34
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.4.0/redhat7 --with-gdrcopy --prefix=/build-result/hpcx-gcc-redhat7/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

Should say
--enable-mt

nirandaperera · 2024-05-24T13:51:00Z

@janjust yes it does

janjust · 2024-05-24T13:54:48Z

Do you have a simple reproducer handy and we'll take a look.
We did something similar to what you're doing with AR but changed to 1 ucc_context/thread due to performance, not correctness

janjust · 2024-05-24T14:01:15Z

Just to add onto this: from what i'm hearing we've never tried this model of multitple threads/team, and to me it seems to mimic multiple threads/mpi_comm which would be against the spec.

What I suspect is happening the threads are progressing each other's callbacks because the context is shared thus each thread picks up the task in no particular order (this is me guessing over what we've observed in our case).

janjust · 2024-05-24T14:02:00Z

@nsarka Can you please look at this thread - does what I'm saying make sense, is this what you observed with sliding-window?

nirandaperera · 2024-05-24T14:09:36Z

@janjust FWIW I am using tagged collectives

nirandaperera · 2024-05-24T14:31:09Z

@janjust There's a separate thread that progresses the ctx. From what I understood from the source code that all teams will be ultimately enqueing tasks to a single progress queue in the context, isn't it?

janjust · 2024-05-24T14:32:13Z

yes, I believe that's correct

nirandaperera · 2024-05-24T18:15:06Z

@janjust I was using v1.2 I just tried my code with the ucc master branch build and I don't see this error anymore. I think previously allgather knomial algorithm was active. But FWIU in the master branch, there are two new algorithms,

    3 :            bruck : O(log(N)) Variation of Bruck algorithm
    4 :          sparbit : O(log(N)) SPARBIT algorithm

I think one of these is now running the operation.
(I tried v1.3 as well & it had the same issue)

nirandaperera · 2024-05-24T19:27:23Z

Sorry, I misspoke. I put some logs in the code and found out that it's the knomial algorithm in both cases.

Sergei-Lebedev · 2024-05-24T19:46:31Z

there was a fix for multithreaded case, it was merged into master but we didn't include it into v1.3.x branch
#932

nirandaperera · 2024-05-24T21:50:21Z

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix.
https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

janjust · 2024-05-24T21:57:30Z

hmm, odd, that pr had nothing to do with multi-threading

Sergei-Lebedev · 2024-05-25T07:23:32Z

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

yes, could be

nirandaperera · 2024-05-26T19:14:41Z

@janjust I think the tags have not been propagated properly until this PR. 🙂

nirandaperera · 2024-05-28T13:07:06Z

@Sergei-Lebedev is there a way to get a patch release for v1.3?

Sergei-Lebedev · 2024-05-28T13:40:21Z

@Sergei-Lebedev is there a way to get a patch release for v1.3?

@manjugv wdyt?

nirandaperera · 2024-05-28T16:27:10Z

@Sergei-Lebedev it might need a proper test case for this scenario anyways, I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong data received when sharing a team with multiple threads #979

Wrong data received when sharing a team with multiple threads #979

nirandaperera commented May 23, 2024 •

edited

Loading

nirandaperera commented May 23, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

nirandaperera commented May 24, 2024 •

edited

Loading

janjust commented May 24, 2024

nirandaperera commented May 24, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

nirandaperera commented May 24, 2024

nirandaperera commented May 24, 2024 •

edited

Loading

janjust commented May 24, 2024

nirandaperera commented May 24, 2024 •

edited

Loading

nirandaperera commented May 24, 2024

Sergei-Lebedev commented May 24, 2024

nirandaperera commented May 24, 2024 •

edited

Loading

janjust commented May 24, 2024

Sergei-Lebedev commented May 25, 2024

nirandaperera commented May 26, 2024

nirandaperera commented May 28, 2024 •

edited

Loading

Sergei-Lebedev commented May 28, 2024

nirandaperera commented May 28, 2024

Wrong data received when sharing a team with multiple threads #979

Wrong data received when sharing a team with multiple threads #979

Comments

nirandaperera commented May 23, 2024 • edited Loading

nirandaperera commented May 23, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

nirandaperera commented May 24, 2024 • edited Loading

janjust commented May 24, 2024

nirandaperera commented May 24, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

janjust commented May 24, 2024

nirandaperera commented May 24, 2024

nirandaperera commented May 24, 2024 • edited Loading

janjust commented May 24, 2024

nirandaperera commented May 24, 2024 • edited Loading

nirandaperera commented May 24, 2024

Sergei-Lebedev commented May 24, 2024

nirandaperera commented May 24, 2024 • edited Loading

janjust commented May 24, 2024

Sergei-Lebedev commented May 25, 2024

nirandaperera commented May 26, 2024

nirandaperera commented May 28, 2024 • edited Loading

Sergei-Lebedev commented May 28, 2024

nirandaperera commented May 28, 2024

nirandaperera commented May 23, 2024 •

edited

Loading

nirandaperera commented May 24, 2024 •

edited

Loading

nirandaperera commented May 24, 2024 •

edited

Loading

nirandaperera commented May 24, 2024 •

edited

Loading

nirandaperera commented May 24, 2024 •

edited

Loading

nirandaperera commented May 28, 2024 •

edited

Loading