Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: various optimizations #1012

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Aug 21, 2024

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:

  1. support rectangular blocks
  2. other configurations in how we post the WQEs:
    • iterate across nodes before blocks when posting the WQEs
    • reuse dm chunks
    • send blocks by batch
  3. knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

TL/MLX5: add npolls cfg for FANIN

TL/MLX5: knomial fanin

TL/MLX5: add prints and profile events

TL/MLX5: remove debug prints
tiny bit more robust

print blocks dimensions

fully working configurable batch_size, serialization, and pollings
clean and working

TL/MLX5: add more config for block dimensions

force longer by default
ucc_tl_mlx5_alltoall_t *a2a = team->a2a;
int seq_index = task->alltoall.seq_index;
int npolls = UCC_TL_MLX5_TEAM_CTX(team)->cfg.npolls;
int radix = UCC_TL_MLX5_TEAM_LIB(team)->cfg.fanin_kn_radix;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

}
*dist *= radix;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

@@ -242,7 +282,13 @@ static ucc_status_t ucc_tl_mlx5_fanout_start(ucc_coll_task_t *coll_task)

tl_debug(UCC_TASK_LIB(task), "fanout start");
/* start task if completion event received */
UCC_TL_MLX5_PROFILE_REQUEST_EVENT(task, "mlx5_alltoall_fanout_start", 0);
if (team->a2a->node.sbgp->group_rank == team->a2a->node.asr_rank) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this if/else when PROFILING is not requested? this is to improve the performance when profiling is not needed

ucc_tl_mlx5_alltoall_t *a2a = team->a2a;
int node_size = a2a->node.sbgp->group_size;
int net_size = a2a->net.sbgp->group_size;
int op_msgsize = node_size * a2a->max_msg_size * UCC_TL_TEAM_SIZE(team) *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

int block_msgsize = block_h * block_w * task->alltoall.msg_size;
ucc_status_t status = UCC_OK;
int node_grid_w = node_size / block_w;
int node_nbr_blocks = (node_size * node_size) / (block_h * block_w);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

if (!send_to_self &&
task->alltoall.op->blocks_sent[cyc_rank] < node_nbr_blocks) {
dm = ucc_tl_mlx5_a2a_wait_for_dm_chunk(task);
if (status != UCC_OK) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which status you are checking?

@@ -28,6 +28,19 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = {
ucc_offsetof(ucc_tl_mlx5_lib_config_t, dm_buf_num),
UCC_CONFIG_TYPE_ULUNITS},

{"FORCE_REGULAR", "y",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you encapsulate all these three env vars into one with 3 options

c->offset = (ptrdiff_t)team->dm_offset;
team->dm_offset = PTR_OFFSET(team->dm_offset,
UCC_TL_MLX5_TEAM_LIB(team)->cfg.dm_buf_size);
c->addr = (uintptr_t)PTR_OFFSET(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants