Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/UCP: Allow self copy in allgather using network loopback #1021

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yaeliyac
Copy link

@yaeliyac yaeliyac commented Sep 19, 2024

What

Add option for self copy loopback in out of place start for allgather algorithms: knomial, ring, neighbor, sparbit

Why ?

When using UCC plugin for NCCL, using cuda_memcpy might cause deadlock

How ?

Add UCC_TL_UCP_ALLGATHER_USE_LOOPBACK flag to control this
Tested on ISRAEL-1 with multiple test cases - showed good results and same performance with and without loopback
image

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@yaeliyac yaeliyac closed this Nov 6, 2024
@yaeliyac yaeliyac reopened this Nov 6, 2024
@yaeliyac yaeliyac marked this pull request as ready for review November 18, 2024 15:39
@yaeliyac yaeliyac changed the title UCC plugin for NCCL - adjustments Allow self copy in allgather using network loopback Dec 11, 2024
@janjust janjust requested review from nsarka and janjust December 11, 2024 16:08
@Sergei-Lebedev
Copy link
Contributor

ok to test

src/components/cl/basic/cl_basic_context.c Outdated Show resolved Hide resolved
src/components/ec/cuda/ec_cuda_executor.c Outdated Show resolved Hide resolved
src/components/ec/cuda/ec_cuda_executor.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_neighbor.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp.c Outdated Show resolved Hide resolved
@samnordmann samnordmann self-requested a review December 13, 2024 12:35
@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 3 times, most recently from 482006f to eecdebf Compare December 13, 2024 15:28
@janjust
Copy link
Collaborator

janjust commented Dec 13, 2024

@Sergei-Lebedev why do we have to keep approving runs when force pushing?? Seems an outlier here

@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 3 times, most recently from 32b3ccd to b1c875b Compare December 15, 2024 11:49
@yaeliyac yaeliyac changed the title Allow self copy in allgather using network loopback TL/UCP: Allow self copy in allgather using network loopback Dec 15, 2024
@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 3 times, most recently from 2dc74f3 to a2c5fee Compare December 15, 2024 16:18
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It Looks good, I left a bunch of comments mainly on codestyle.

However, please add tests for that feature.

Also please note from the CI:

Commit title is too long: 59
Bad commit title: 'TL/UCP: Allow self copy in allgather using network loopback'

src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp.h Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_sparbit.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_sparbit.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_ring.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_ring.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_knomial.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather.c Outdated Show resolved Hide resolved
src/components/tl/ucp/allgather/allgather_sparbit.c Outdated Show resolved Hide resolved
@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 2 times, most recently from 5619791 to caa90f8 Compare December 22, 2024 16:08
@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 6 times, most recently from b6d4812 to 54e86f2 Compare January 9, 2025 17:50
@yaeliyac yaeliyac force-pushed the origin/for_plugin branch 7 times, most recently from 4969914 to 3f87beb Compare February 10, 2025 18:20
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -9,7 +9,8 @@

using Param_0 = std::tuple<int, ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t>;
using Param_1 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t>;
using Param_2 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t, std::string>;
using Param_2 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the new param use_loopback rather be a bool ?

if (ucc_unlikely(UCC_OK != status)) {
return status;
}
}

while (UCC_INPROGRESS == ucc_tl_ucp_test(task)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checking: is this call safe in the "inplace" case ?

Comment on lines +144 to +145
"If set to 1 performs network loopback for self copy, otherwise uses mc "
"cuda copy",
Copy link
Collaborator

@samnordmann samnordmann Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"If set to 1 performs network loopback for self copy, otherwise uses mc "
"cuda copy",
"If set to 1 performs network loopback for self copy, otherwise uses mc",

it is not necessarily cuda

ucc_tl_ucp_task_t *task)
{
ucc_status_t status;
status = ucc_tl_ucp_send_nb(sbuf, data_size, smem, rank, team, task);
Copy link
Collaborator

@nsarka nsarka Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of status = ... ; if (UCC_OK != status) ... you can use UCC_CHECK_GOTO(..., err)

task->super.status = status;
return task->super.status;
}
return UCC_OK;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err:
    return status;

@@ -54,8 +55,7 @@

void ucc_tl_ucp_allgather_knomial_progress(ucc_coll_task_t *coll_task)
{
ucc_tl_ucp_task_t *task = ucc_derived_of(coll_task,
ucc_tl_ucp_task_t);
ucc_tl_ucp_task_t * task = ucc_derived_of(coll_task, ucc_tl_ucp_task_t);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*task

args->root : 0;
ucc_rank_t rank = VRANK(task->subset.myrank, broot, size);
size_t local = GET_LOCAL_COUNT(args, size, rank);
ucc_rank_t broot = args->coll_type == UCC_COLL_TYPE_BCAST ? args->root : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align =

@@ -67,6 +69,8 @@ void ucc_tl_ucp_allgather_ring_progress(ucc_coll_task_t *coll_task)
if (UCC_INPROGRESS == ucc_tl_ucp_test(task)) {
return;
}
step = use_loopback ? task->tagged.send_posted - 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line check for inplace as well? If use_loopback is true and inplace is false, will step be incorrect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar for neighbor exchange

@janjust
Copy link
Collaborator

janjust commented Feb 12, 2025

@Sergei-Lebedev @samnordmann @yaeliyac @lappazos

The more I look at this implementation the less I think this is the right approach, and I'm surprised we would allow this in. We're introducing an if-else statement all throughout critical paths of the algorithms to avoid this issues specific to the nccl_plugin and only during inplace.
What if we need to add additional algorithms, then we need to be aware of this limitation and add the same if-else, or if we need to expand this for A2A, A2AV, or AGV? Why is AG the only collective that suffers this deadlock?
Thinking about this in more detail I don't think this is a good approach, we should really consider this carefully, regardless of the deadlock fix. As an immediate alternative, I think at the very least we should provide a loop-back copy or the mc_copy from a higher level.

But on a broader note, what exactly is the deadlock here? It seems to be an issue specifically with nccl_plugin, and why does loopback fix it?

Can we have you guys jump on another devcall and review this with us please.

if (ucc_unlikely(UCC_OK != status)) {
return status;
}
}

while ((!UCC_IS_INPLACE(TASK_ARGS(task))) && (UCC_INPROGRESS == ucc_tl_ucp_test(task))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting counter to a different value depending on inplace and use_loopback, should it just decrement send_posted/completed and recv_posted/completed in allgather_copy? That would remove a good bit of complexity

}
return UCC_OK;
}
ucc_status_t allgather_copy(void *rbuf, void *sbuf, size_t data_size,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allgather_copy should probably be renamed to copy_sbuf_inplace

counter = task->tagged.send_posted - 1;
} else {
counter = task->tagged.send_posted;
}
Copy link
Collaborator

@nsarka nsarka Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two if statements can be removed if send_posted is decremented right after the send is completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants