-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/UCP: Allow self copy in allgather using network loopback #1021
base: master
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
5e852a7
to
7e6c55d
Compare
d13b885
to
b3c2e80
Compare
ok to test |
482006f
to
eecdebf
Compare
1d88c9b
to
820794d
Compare
@Sergei-Lebedev why do we have to keep approving runs when force pushing?? Seems an outlier here |
32b3ccd
to
b1c875b
Compare
2dc74f3
to
a2c5fee
Compare
a2c5fee
to
5940100
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It Looks good, I left a bunch of comments mainly on codestyle.
However, please add tests for that feature.
Also please note from the CI:
Commit title is too long: 59
Bad commit title: 'TL/UCP: Allow self copy in allgather using network loopback'
5619791
to
caa90f8
Compare
caa90f8
to
ee808ea
Compare
b6d4812
to
54e86f2
Compare
4969914
to
3f87beb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -9,7 +9,8 @@ | |||
|
|||
using Param_0 = std::tuple<int, ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t>; | |||
using Param_1 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t>; | |||
using Param_2 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int, gtest_ucc_inplace_t, std::string>; | |||
using Param_2 = std::tuple<ucc_datatype_t, ucc_memory_type_t, int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the new param use_loopback
rather be a bool
?
if (ucc_unlikely(UCC_OK != status)) { | ||
return status; | ||
} | ||
} | ||
|
||
while (UCC_INPROGRESS == ucc_tl_ucp_test(task)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checking: is this call safe in the "inplace" case ?
"If set to 1 performs network loopback for self copy, otherwise uses mc " | ||
"cuda copy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If set to 1 performs network loopback for self copy, otherwise uses mc " | |
"cuda copy", | |
"If set to 1 performs network loopback for self copy, otherwise uses mc", |
it is not necessarily cuda
ucc_tl_ucp_task_t *task) | ||
{ | ||
ucc_status_t status; | ||
status = ucc_tl_ucp_send_nb(sbuf, data_size, smem, rank, team, task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of status = ... ; if (UCC_OK != status) ...
you can use UCC_CHECK_GOTO(..., err)
task->super.status = status; | ||
return task->super.status; | ||
} | ||
return UCC_OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err:
return status;
@@ -54,8 +55,7 @@ | |||
|
|||
void ucc_tl_ucp_allgather_knomial_progress(ucc_coll_task_t *coll_task) | |||
{ | |||
ucc_tl_ucp_task_t *task = ucc_derived_of(coll_task, | |||
ucc_tl_ucp_task_t); | |||
ucc_tl_ucp_task_t * task = ucc_derived_of(coll_task, ucc_tl_ucp_task_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*task
args->root : 0; | ||
ucc_rank_t rank = VRANK(task->subset.myrank, broot, size); | ||
size_t local = GET_LOCAL_COUNT(args, size, rank); | ||
ucc_rank_t broot = args->coll_type == UCC_COLL_TYPE_BCAST ? args->root : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align =
@@ -67,6 +69,8 @@ void ucc_tl_ucp_allgather_ring_progress(ucc_coll_task_t *coll_task) | |||
if (UCC_INPROGRESS == ucc_tl_ucp_test(task)) { | |||
return; | |||
} | |||
step = use_loopback ? task->tagged.send_posted - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line check for inplace as well? If use_loopback is true and inplace is false, will step be incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar for neighbor exchange
3f87beb
to
b9331ee
Compare
b9331ee
to
c4eb0df
Compare
@Sergei-Lebedev @samnordmann @yaeliyac @lappazos The more I look at this implementation the less I think this is the right approach, and I'm surprised we would allow this in. We're introducing an if-else statement all throughout critical paths of the algorithms to avoid this issues specific to the nccl_plugin and only during inplace. But on a broader note, what exactly is the deadlock here? It seems to be an issue specifically with nccl_plugin, and why does loopback fix it? Can we have you guys jump on another devcall and review this with us please. |
if (ucc_unlikely(UCC_OK != status)) { | ||
return status; | ||
} | ||
} | ||
|
||
while ((!UCC_IS_INPLACE(TASK_ARGS(task))) && (UCC_INPROGRESS == ucc_tl_ucp_test(task))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting counter to a different value depending on inplace and use_loopback, should it just decrement send_posted/completed and recv_posted/completed in allgather_copy? That would remove a good bit of complexity
} | ||
return UCC_OK; | ||
} | ||
ucc_status_t allgather_copy(void *rbuf, void *sbuf, size_t data_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allgather_copy
should probably be renamed to copy_sbuf_inplace
counter = task->tagged.send_posted - 1; | ||
} else { | ||
counter = task->tagged.send_posted; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two if statements can be removed if send_posted
is decremented right after the send is completed
What
Add option for self copy loopback in out of place start for allgather algorithms: knomial, ring, neighbor, sparbit
Why ?
When using UCC plugin for NCCL, using cuda_memcpy might cause deadlock
How ?
Add UCC_TL_UCP_ALLGATHER_USE_LOOPBACK flag to control this
![image](https://private-user-images.githubusercontent.com/80533283/395868247-f15a1994-c0be-4ac4-bfa4-35941d3b0589.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MjAwNzcsIm5iZiI6MTczOTYxOTc3NywicGF0aCI6Ii84MDUzMzI4My8zOTU4NjgyNDctZjE1YTE5OTQtYzBiZS00YWM0LWJmYTQtMzU5NDFkM2IwNTg5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDExNDI1N1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJjODFmY2RjM2U3NDA5NzA5Njg5Yzk0YjFkZjVkYzVjMzA0YjdlNDNmYTJlZjVmNmUxODY2NjhkODJmMTRlNTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.mqeXsUolWMVKxKVDrqOBL1BZBahpeETxJcZehstya20)
Tested on ISRAEL-1 with multiple test cases - showed good results and same performance with and without loopback