Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CL/HIER allgatherv #1016

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Conversation

x41lakazam
Copy link
Collaborator

@x41lakazam x41lakazam commented Sep 2, 2024

allgatherv algorithm split into three parts:

  • gatherv inside the node
  • allgatherv between the node leaders
  • broadcast inside the node

Still under development. Currently supports only int32, I will add support for int64 once I'm sure of the logic.

Situation where the ranks are not in a continuous order across nodes is not handled yet.

@x41lakazam
Copy link
Collaborator Author

Mpi test results:

$ mpirun         \
	-x UCC_CL_HIER_ALLGATHERV_TUNE=allgatherv:@node_split:inf \
	-x UCC_CLS=basic,hier \
	-x UCC_CL_HIER_NODE_LEADERS_SBGP_TLS=ucp \
	-x UCC_CL_HIER_NODE_SBGP_TLS=ucp \
	-x LD_PRELOAD=./lib/libucc.so \
	-x PATH         \
	-x LD_LIBRARY_PATH         \
	-np 6 --map-by ppr:3:node  \
	./bin/ucc_test_mpi -c allgatherv -t world -d uint32 -v -m 8 -v 

===== UCC MPI TEST INFO =======
seed:         6222
collectives:  Allgatherv
data types:   uint32
memory types: Host
teams:        world
tc=Allgatherv team=world mtype=host msgsize=8 persistent=0 inplace=0 dt=uint32

===== UCC MPI TEST REPORT =====
collective                 tests    passed    failed   skipped
Allgatherv                     1         1         0         0

===== UCC MPI TEST SUMMARY =====
total tests:  1
passed:       1
skipped:      0
failed:       0
elapsed:      9s

@@ -2,6 +2,10 @@
# Copyright (c) 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update copyright year

@@ -0,0 +1,381 @@
/**
* Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
* Copyright (c) Meta Platforms, Inc. and affiliates. 2022.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rem


UCC_CL_HIER_PROFILE_REQUEST_EVENT(task, "cl_hier_allgatherv_finalize", 0);

ucc_assert(cl_schedule->super.super.n_tasks <= 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<= MAX_ALLGATHERV_TASKS

}

// Question: Do i need this function ?
ucc_status_t ucc_cl_hier_allgatherv_triggered_post_setup(ucc_coll_task_t *task)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Comment on lines +79 to +80
ucc_coll_task_t *tasks[MAX_ALLGATHERV_TASKS] = {NULL};
int n_tasks = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialized vars go first

frag_p, n_frags);
}

static ucc_status_t ucc_cl_hier_allgatherv_node_split_frag_setup(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, no pipelining


if (c64 ^ d64) {
cl_debug(team->context->lib,
"mixed 64 bit count/displ mode is not supported\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rem \n

Comment on lines +112 to +116
if (n_frags > 1) {
args.max_frag_count =
ucc_buffer_block_count(args.args.src.info.count, n_frags, 0);
args.mask |= UCC_BASE_CARGS_MAX_FRAG_COUNT;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, no pipelining

args.mask |= UCC_BASE_CARGS_MAX_FRAG_COUNT;
}

full_size = cl_team->sbgps[UCC_HIER_SBGP_FULL].sbgp->group_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think UCC_CL_TEAM_SIZE is better, no need to create FULL sbgp


// Gatherv in the node
// src.info.buffer -> dst.info_v.buffer
if (node_size > 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space before {

((uint32_t *)gv_rc)[_i] = _scount;
((uint32_t *)gv_displ)[_i] = _displ;

_displ += _scount;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first gatherv call is problematic if the displacements are not continuous (i.e if displ[i] != sum(counts[:i])), for example say we have those three src buffers:

rank1: [1,2]
rank2: [3]
rank3: [4,5]

And displacements=[2,5,6]
Say that for every rank dst buffer points to this array: [10, 10, 0, 0, 10, 0, 0, 0]
Thus, in the end it should be: [10, 10, 1, 2, 10, 3, 4, 5]

But here because we are writing directly to the dst buffer, and because we are "stacking" each of the three buffers to the beginning of the buf, we are going to erase data of dst buf.

We can either decide not to support it or allocate a separate buffer specifically for this gatherv call.

@janjust do you know if/how this is handled in the hcoll alg ?

Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add gtest

}
} while (0);

args.args.coll_type = UCC_COLL_TYPE_GATHERV;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of inplace allgatherv args.src.info is ignored, but for gatherv args.src.info is ignored only at root. Need to set args.src.info for non root ranks

status);

// The result of this collective is in dst.info_v.buffer, this should be the source buffer of the next collective
args.args.src.info.buffer = args.args.dst.info_v.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to use inplace allgatherv on second step if node size > 1

}
} while (0);

args.args.coll_type = UCC_COLL_TYPE_GATHERV;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check node sbgp enabled

n_tasks++;
}

// Allgatherv in the full net
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check node leaders group is enabled


int _hid, _count, _displ;
/* For every rank in the communicator,
* - find his host id (which is also the index of the leader -- Sergey ?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general case host_id != node_leader_rank


// BCAST in the node
// src.info.buffer -> src.info.buffer
if (node_size > 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sbgp_enabled

status);

// The result of this collective is in dst.info_v.buffer, this should be the source buffer of the next collective
args.args.src.info.buffer = args.args.dst.info_v.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of allgatherv need to set src.info.datatype and src.info.mem_type

}

// This collective writes to src.info.buffer, this should be the output buffer (dst.info_v.buffer)
args.args.dst.info_v.buffer = args.args.src.info.buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, src.info.buffer == dst.info_v.buffer


// This collective writes to src.info.buffer, this should be the output buffer (dst.info_v.buffer)
args.args.dst.info_v.buffer = args.args.src.info.buffer;
n_tasks++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used

@janjust
Copy link
Collaborator

janjust commented Sep 19, 2024

@x41lakazam is this still draft?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants