-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/MLX5: various optimizations #1012
base: master
Are you sure you want to change the base?
TL/MLX5: various optimizations #1012
Conversation
TL/MLX5: add npolls cfg for FANIN TL/MLX5: knomial fanin TL/MLX5: add prints and profile events TL/MLX5: remove debug prints
tiny bit more robust print blocks dimensions fully working configurable batch_size, serialization, and pollings
clean and working TL/MLX5: add more config for block dimensions force longer by default
lintrunner cleaning
ucc_tl_mlx5_alltoall_t *a2a = team->a2a; | ||
int seq_index = task->alltoall.seq_index; | ||
int npolls = UCC_TL_MLX5_TEAM_CTX(team)->cfg.npolls; | ||
int radix = UCC_TL_MLX5_TEAM_LIB(team)->cfg.fanin_kn_radix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
} | ||
*dist *= radix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
@@ -242,7 +282,13 @@ static ucc_status_t ucc_tl_mlx5_fanout_start(ucc_coll_task_t *coll_task) | |||
|
|||
tl_debug(UCC_TASK_LIB(task), "fanout start"); | |||
/* start task if completion event received */ | |||
UCC_TL_MLX5_PROFILE_REQUEST_EVENT(task, "mlx5_alltoall_fanout_start", 0); | |||
if (team->a2a->node.sbgp->group_rank == team->a2a->node.asr_rank) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this if/else when PROFILING is not requested? this is to improve the performance when profiling is not needed
ucc_tl_mlx5_alltoall_t *a2a = team->a2a; | ||
int node_size = a2a->node.sbgp->group_size; | ||
int net_size = a2a->net.sbgp->group_size; | ||
int op_msgsize = node_size * a2a->max_msg_size * UCC_TL_TEAM_SIZE(team) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align
int block_msgsize = block_h * block_w * task->alltoall.msg_size; | ||
ucc_status_t status = UCC_OK; | ||
int node_grid_w = node_size / block_w; | ||
int node_nbr_blocks = (node_size * node_size) / (block_h * block_w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align
if (!send_to_self && | ||
task->alltoall.op->blocks_sent[cyc_rank] < node_nbr_blocks) { | ||
dm = ucc_tl_mlx5_a2a_wait_for_dm_chunk(task); | ||
if (status != UCC_OK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which status you are checking?
@@ -28,6 +28,19 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = { | |||
ucc_offsetof(ucc_tl_mlx5_lib_config_t, dm_buf_num), | |||
UCC_CONFIG_TYPE_ULUNITS}, | |||
|
|||
{"FORCE_REGULAR", "y", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you encapsulate all these three env vars into one with 3 options
c->offset = (ptrdiff_t)team->dm_offset; | ||
team->dm_offset = PTR_OFFSET(team->dm_offset, | ||
UCC_TL_MLX5_TEAM_LIB(team)->cfg.dm_buf_size); | ||
c->addr = (uintptr_t)PTR_OFFSET( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align
What
This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:
We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.
TODO:
One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.
cc @lappazos @x41lakazam