Feature request: improved zero-copy non-contiguous send #7002

pascal-boeschoten-hapteon · 2024-05-09T11:03:26Z

Hello,

With XPMEM, MPICH supports zero-copy intra-node sends for non-contiguous datatypes.
But AFAICT there are some restrictions:

Can't use with hindexed or struct datatypes.
The size of the datatype is limited by a header field (defined by MPIDI_POSIX_MAX_AM_HDR_SIZE), limiting the complexity of the data that can be sent at once.
MPICH does not release XPMEM segments. If the user doesn't reuse buffers, MPICH will accumulate segments until it hits an XPMEM limit and crashes.

Could anything be done about these limitations? And could it be made to work inter-node with e.g. UCX active message?

Kind regards,
Pascal Boeschoten

The text was updated successfully, but these errors were encountered:

hzhou · 2024-08-21T16:48:27Z

I'm sorry for not getting back to you sooner. Yes, we can improve the non-contig IPC. To help prioritize, if you have any use cases, please add.

pascal-boeschoten-hapteon · 2024-09-12T13:57:02Z

Hello, thank you for getting back to me!

I've made a similar feature request for OpenMPI (open-mpi/ompi#12536), hope it's okay if I just add the same information I gave there:

Not having zero-copy for non-contiguous sends means certain data structures need to be split into many requests.
For example, if you have this:

struct S {
std::vector<float> vec_f;  // size 10k
std::vector<int> vec_i;    // size 10k
std::vector<double> vec_d; // size 10k
std::vector<char> vec_c;   // size 10k
};
std::vector<S> vec_s; // size 100

Being able to send it in 1 request instead of 400 (one for each contiguous buffer) seems like it could be quite advantageous, for performance and ease of use.
Even if there are restrictions, e.g. the buffer datatypes must be MPI_BYTE (i.e. assuming the sender and receiver have the same architecture / homogeneous cluster).

At the moment sending such a datatype in 1 request with a struct results in packing/unpacking, which is very slow for large buffers, so much so that it is significantly slower than sending 400 zero-copy requests as in the aforementioned example.

Other use cases could be sending many sub-views of a large 2D array, or sending map-like/tree-like types.

To give a bit more context, we've observed that when sending many large and complex data structures (similar to the one in the example above) to many other ranks, it's significantly slower to have many small zero-copy requests vs one big request with packing/unpacking. The sheer volume of requests seems to be the bottleneck and we've seen up to a factor of 5 difference in throughput. But when it's just 1 rank sending to 1 other rank, the many small zero-copy requests are faster, as the packing/unpacking becomes limited by memory bandwidth. It should mean that if we could have one big zero-copy request, the performance gain in the congested case would be very significant.

Let me know if you need any more information.

Kind regards,
Pascal Boeschoten

hzhou · 2024-09-12T15:15:14Z

We can just extend the features you asked for. We'll ping back here to solicit your assistance in testing when have the pull request.

hzhou self-assigned this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: improved zero-copy non-contiguous send #7002

Feature request: improved zero-copy non-contiguous send #7002

pascal-boeschoten-hapteon commented May 9, 2024

hzhou commented Aug 21, 2024

pascal-boeschoten-hapteon commented Sep 12, 2024

hzhou commented Sep 12, 2024

Feature request: improved zero-copy non-contiguous send #7002

Feature request: improved zero-copy non-contiguous send #7002

Comments

pascal-boeschoten-hapteon commented May 9, 2024

hzhou commented Aug 21, 2024

pascal-boeschoten-hapteon commented Sep 12, 2024

hzhou commented Sep 12, 2024