UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

MamziB · 2024-08-21T20:47:43Z

Inside TL MLX5, we need to know if TL Service has cuda-support or not. Since we cannot call ucp_context_query() directly inside TL MLX5, we use a shared variable and pass it to TL MLX5. This can be extended to other capabilities in the future as well.

MamziB · 2024-09-06T22:30:54Z

@Sergei-Lebedev can you please take a look at this small patch?

janjust · 2024-09-10T13:52:16Z

@MamziB please fix commit title to pass code style check

janjust

just a general question regarding the backwards compatibility
Also, since we're introducing the capability for a specific TL - do we have to update other TL's as well? (including private ones)

src/components/tl/mlx5/mcast/tl_mlx5_mcast.h

Sergei-Lebedev

in my opinion it's more logical to return supported memory types as part of attributes. Since you need to query service team you can add additional field to attributes and check it in tl mlx5

MamziB · 2024-09-25T17:14:27Z

@Sergei-Lebedev @janjust

After conducting a thorough code review of UCC, I found that implementing the desired feature without modifying the existing API seems unfeasible. Specifically, we need to make changes in the attribute section.

Currently, the ucc_context_attr_field lacks an attribute related to supported memory types. To address this, I propose adding a new MASK flag to ucc_context_attr_field to represent memory type support. This will allow us to introduce a new valid attribute indicating the supported memory types.

When the function ucc_tl_ucp_get_context_attr is invoked, it can check this MASK flag. If the corresponding flag is set, the memory type information can be populated accordingly. During TL MLX5 team creation, we can then call ucc_tl_ucp_get_context_attr to determine if TL UCP supports GPU memory, which would allow us to enable MCAST for GPU memory when supported.

Can you please advise if you have a more concrete resolution? if not, let's merge the PR as is for now. Thank you

*  @ref ucc_context_attr_t defines the attributes of the context. The bits in
 *  "mask" bit array is defined by @ref ucc_context_attr_field, which correspond to
 *  fields in structure @ref ucc_context_attr_t. The valid fields of the structure
 *  is specified by the setting the bit to "1" in the bit-array "mask". When
 *  bits corresponding to the fields is not set, the fields are not defined.
 *
 *  @endparblock
 *
 */
typedef struct ucc_context_attr {
    uint64_t                mask;
    ucc_context_type_t      type;
    ucc_coll_sync_type_t    sync_type;
    ucc_context_addr_h      ctx_addr;
    ucc_context_addr_len_t  ctx_addr_len;
    uint64_t                global_work_buffer_size;
} ucc_context_attr_t;

MamziB · 2024-10-07T15:51:26Z

@manjugv hey manjo, did you have any thoughts on this PR? I think you mentioned you would take a look at it

janjust · 2024-11-14T16:14:39Z

@manjugv ping

manjugv · 2024-11-14T18:34:36Z

@MamziB @janjust It seems like there is no agreement between you guys on the solution. For me, this is very odd - TL/MLX5 asking about TL/UCP. Also, you are creating a structure with one field.

MamziB · 2024-11-14T18:44:49Z

@manjugv At this stage, I'm focused on exploring all potential solutions for this issue, which is why Tommy, Sergey, and I haven't reached a conclustion yet. We're aiming to be thorough and consider various angles before deciding on a path forward. If you have any additional insights or alternative approaches in mind for this PR, we’d greatly appreciate your guidance.

swx-jenkins3 · 2024-12-07T04:21:40Z

Can one of the admins verify this patch?

manjugv · 2025-01-22T16:47:12Z

@janjust is this ready?

MamziB requested review from Sergei-Lebedev and bureddy August 21, 2024 20:47

MamziB self-assigned this Aug 21, 2024

MamziB added the Ready-for-Review label Aug 21, 2024

UCC/CTX: passing cuda check from tl ucp to others

ab17b0e

MamziB force-pushed the mamzi/tl-caps branch from 47db5d1 to ab17b0e Compare August 21, 2024 20:50

janjust reviewed Sep 16, 2024

View reviewed changes

src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved

Sergei-Lebedev reviewed Sep 17, 2024

View reviewed changes

manjugv added the Arch-Review-Required label Nov 14, 2024

manjugv removed the Ready-for-Review label Dec 11, 2024

janjust added the WIP - Don't Merge label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

MamziB commented Aug 21, 2024 •

edited by janjust

Loading

MamziB commented Sep 6, 2024

janjust commented Sep 10, 2024

janjust left a comment

Sergei-Lebedev left a comment

MamziB commented Sep 25, 2024 •

edited

Loading

MamziB commented Oct 7, 2024

janjust commented Nov 14, 2024

manjugv commented Nov 14, 2024

MamziB commented Nov 14, 2024

swx-jenkins3 commented Dec 7, 2024

manjugv commented Jan 22, 2025

UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

Are you sure you want to change the base?

UCC/CTX: passing cuda-check from tl ucp to mlx5 #1013

Conversation

MamziB commented Aug 21, 2024 • edited by janjust Loading

MamziB commented Sep 6, 2024

janjust commented Sep 10, 2024

janjust left a comment

Choose a reason for hiding this comment

Sergei-Lebedev left a comment

Choose a reason for hiding this comment

MamziB commented Sep 25, 2024 • edited Loading

MamziB commented Oct 7, 2024

janjust commented Nov 14, 2024

manjugv commented Nov 14, 2024

MamziB commented Nov 14, 2024

swx-jenkins3 commented Dec 7, 2024

manjugv commented Jan 22, 2025

MamziB commented Aug 21, 2024 •

edited by janjust

Loading

MamziB commented Sep 25, 2024 •

edited

Loading