-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/MLX5: add device mem mcast bcast #989
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Can you please address the following comment?
- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.
- We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
- Please make sure building ucc is possible even without cuda support
- Please fix the CI issues
- Can you add test for this feature?
Thanks! Can you please address the following comment? 1- Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit. If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings. We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code Please make sure building ucc is possible even without cuda support Please fix the CI issues Can you add test for this feature? |
Hi @samnordmann Thanks for the constructive comments. I have added a new commit. Please let me know if you have more comments. |
So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here ucc/src/components/tl/mlx5/tl_mlx5_team.c Line 299 in b2d5a78
I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.
This comment is not addressed. Every cuda API call should be decorated with appropriate compilator guard. (another advantage of using mc). This is related to the error seen in the CI. The program should compile with a configure command of the type
there are still the same issues
Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Some comments still need to be addressed
4cb99e3
to
5fca2e8
Compare
@samnordmann thanks for the constructive comments. Please see the new commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Can you please address the following comment?
Can you add some context an explanation about "what" "why" and "how" the pr is achieving? Also IMHO, using the term "cuda memory" would be more explicit.If GPU direct RDMA is available, we can directly call ibv_post_send()/recv() using GPU buffers. Therefore, if this feature is enabled, we pre-post GPU buffers into our receive queue instead CPU buffers. Making it possible to receive MCAST packets into GPU directly without additional copies or stagings.
So this adds the support for cuda memory type for user's buffer? I don't understand since cuda memory type is supposed to be supported already, as indicated here
ucc/src/components/tl/mlx5/tl_mlx5_team.c
Line 299 in b2d5a78
ucc_status_t ucc_tl_mlx5_team_get_scores(ucc_base_team_t * tl_team,
Please can you help me understand this? I still don't understand what is the motivation for this PR and what would happen today if a user uses mcast with GPU buffer (which is already enabled as pointed out above). Please provide a description.
We have in ucc a component "mc" that is an interface with the different memory types, which provides alloc, free, memcpy, memset etc. We should use these instead, it provides better perf and cleaner code
Please refer to #989 (comment)
I am sorry but I don't understand why you think it is better to not use mc component. This component is here exactly for this purpose and is used everywhere in the codebase. Using the component has many benefits (that I can list if needed), while I fail to see the concrete benefit of not using it.
Thanks for replacing cuda memory calls with mc calls. However, the goal is also to use mc component for CPU mem calls. It will remove a lot of duplication. Can you please make this change?
Please fix the CI issues
Sure, will do
there are still the same issues
The CI is still red. Can you please rebase this PR and re-run the CI so we can check?
Can you add test for this feature?
Sure, I will open new PR for it
Ok, I think it is important to test this feature before merging it. This test should be triggered by the CI
Can you please add these tests?
@samnordmann can you please take a look at the new commit? With the new design changes, we have a performance issue: if I use ucc_mc_memcpy() I get way worse performance compared to directly using cudaMemcpy(). Please see below: cudaMemcpy:
ucc_mc_memcpy():
|
565c092
to
9d12794
Compare
So the performance gap comes from the cudaMemcpy being asynchronous in the ucc_mc_memcpy and such a copy involves overheads related to managing cuda streams. We do not require an asynchronous copy in our design, therefore, I added a new mc function that uses synchronous cudaMemcpy and used it in our design. |
@samnordmann Thanks for the constructive comments. I addressed all your comments and fixed the performance issues. Please take a look at the updated commit. |
@Sergei-Lebedev can you also please let me know if you have any comments on this PR? I addressed all Sam's comments already. |
9d12794
to
e336b2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@Sergei-Lebedev thank you for the comments. Please see my responses and the new commit I pushed |
@Sergei-Lebedev can you please let me know if you have further comments? |
5d9aee5
to
79246ce
Compare
Hi @Sergei-Lebedev I pushed the requested changes. Thanks |
7798792
to
d630d6b
Compare
@Sergei-Lebedev I removed the |
d630d6b
to
4c8a86a
Compare
@Sergei-Lebedev Thank you for the new comment. I have resolved all of them and updated the commit. |
4c8a86a
to
162def8
Compare
TL/MLX5: add device mem mcast bcast