prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

soumagne · 2025-01-27T23:27:43Z

When canceling a multi-recv entry, FI_MULTI_RECV was never set as part of the error entry flags. Fix so that the user can determine if the buffer is still in use.

j-xiong · 2025-01-27T23:43:59Z

This is fixing the utility provider, please change the title to start with prov/util:.

soumagne · 2025-01-27T23:51:29Z

ah right thanks fixed.

prov/util/src/util_srx.c

soumagne · 2025-01-31T18:09:14Z

@aingerson I updated the fix yesterday, I've let cancel return -FI_ENOENT if the entry is not found (it used to return FI_SUCCESS before), let me know if you prefer to keep the previous behavior.

shefty · 2025-01-31T18:17:47Z

Cancel is an async operation. We intentionally did not want cancel to return ENOENT and for apps to rely on that return value. That error is only possible for SW based cancel.

I'm referring only to an error code being reported to the user.

aingerson · 2025-01-31T19:21:50Z

@soumagne Yeah, what Sean said. FI_SUCCESS means the operation was submitted and can either result in a cancel error entry if it was cancelled or it would basically be ignored because the operation can't be canceled. In your app, you'd have to read the cq to see if an entry is available and basically retry the cancel until either you get the cancel entry for that recv or get the rx completion for it.
Also, could you squash these two commits together? They are needed together since the first one asserts the ref cnt is 0 and then the second one actually handles the non-zero ref cnt.
Thanks!

soumagne · 2025-01-31T19:58:51Z

ok that sounds good, I fixed it to always return SUCCESS instead of ENOENT and I squashed the commits.

aingerson · 2025-01-31T20:09:33Z

prov/util/src/util_srx.c

@@ -1037,12 +1041,15 @@ static bool util_cancel_recv(struct util_srx_ctx *srx, struct slist *queue,
 	slist_foreach(queue, item, prev) {
 		rx_entry = container_of(item, struct util_rx_entry, peer_entry);
 		if (rx_entry->peer_entry.context == context) {
+			if (rx_entry->peer_entry.flags & FI_MULTI_RECV &&
+			    rx_entry->multi_recv_ref)
+				return -FI_EAGAIN;


I'm still confused about this EAGAIN return. If we want to define the EAGAIN return as valid for a cancel, we should add it to the man page to clarify what it means for the application. As it stands right now in the definition, it feels like this should be an FI_SUCCESS because it was found and we dealt with it accordingly indicating to the application that there should be an entry on the CQ eventually (either the rx that is pending or the cancel entry) and once that is processed, the application can try to cancel again if it didn't get the completion for the entry. I'm fine with either, I just think we should clarify the expectation here.

My intent was that EAGAIN would be returned immediately to the user. EAGAIN is a valid return code for any data transfer or async operation, so the app is already expected to deal with that. EAGAIN does not indicate anything about whether the entry was found or its state. From the app's viewpoint, it just says we can't deal with the cancel right now, and it needs to be retried.

EAGAIN should not be an error code returned through a completion.

I also agree, in that particular case, unless I'm missing something I don't think we've dealt with it, we just found it and returned EAGAIN back to the caller of fi_cancel(), so it would never get canceled until it's retried.

Let me know if you'd like me to add any clarification anywhere. I can also add a comment in the code.

Yeah, I understand the benefit of the EAGAIN and I'm fine with allowing it for this case. I just read FI_SUCCESS as essentially telling the app to read the CQ because something will be available. FI_SUCCESS just says I looked at your request and I did what I thought correct with it which for the mrecv case is not to cancel it and to let the pending receives complete. Ie the app would have to read the mrecv completion and decide to try to cancel it again because the buffer is not done. I think returning EAGAIN is simpler and that's totally fine. I would just like the man page (fi_endpoint) to reflect that return code and explain what that means for the application.
Right now it says

Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. For fi_cancel, a return value of 0 indicates that the cancel request was submitted for processing.

So just adding something there about EAGAIN is totally fine

ok I updated the man page and added a comment.

aingerson

Thank you for the updates!

j-xiong · 2025-02-03T19:27:58Z

bot:aws:retest

j-xiong · 2025-02-04T16:56:20Z

The Intel CI failure is caused by issue with HW used for dmabuf tests, which is unrelated to this PR.

@shijin-aws Is the AWS CI failure related?

shijin-aws · 2025-02-04T19:25:09Z

2025-02-03 19:42:49,475 - INFO - test_runner_windows - Efawin compilation finished successfully.

Usage:

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS]		start server

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS] <srv_addr>	connect to server


Ping pong client and server


Options:

 -B <src_port>        source control port number (server: 47592, client: auto)

 -P <dst_port>        destination control port number (client: 47592)

 -d <domain>          domain name

 -s <source address>  source address associated with domain name

 -p <provider>        specific provider name eg sockets, verbs

 -e <ep_type>         endpoint type: msg|rdm|dgram (dgram)

 -I <number>          number of iterations (10)

 -S <size>            specific transfer size or 'all' (all)

 -c                   enables data_integrity checks

 -m <transmit mode>   transmit mode type: msg|tagged (msg)

 -h                   display this help output

 -v                   enable debugging output

 -6                   use IPv6 address

Usage:

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS]		start server

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS] <srv_addr>	connect to server


Ping pong client and server


Options:

 -B <src_port>        source control port number (server: 47592, client: auto)

 -P <dst_port>        destination control port number (client: 47592)

 -d <domain>          domain name

 -s <source address>  source address associated with domain name

 -p <provider>        specific provider name eg sockets, verbs

 -e <ep_type>         endpoint type: msg|rdm|dgram (dgram)

 -I <number>          number of iterations (10)

 -S <size>            specific transfer size or 'all' (all)

 -c                   enables data_integrity checks

 -m <transmit mode>   transmit mode type: msg|tagged (msg)

 -h                   display this help output

 -v                   enable debugging output

 -6                   use IPv6 address

2025-02-03 19:42:59,612 - INFO - test_runner_windows - Server process failed with return code 1


...

@soumagne can you rebase your PR to the latest main branch that has 6793260

When canceling a multi-recv entry, FI_MULTI_RECV was never set as part of the error entry flags. Fix so that the user can determine if the buffer is still in use. Return -FI_EAGAIN on cancel if multi-recv buffer is in use. Update fi_endpoint man page to describe fi_cancel() return value. Signed-off-by: Jerome Soumagne <[email protected]>

j-xiong · 2025-02-05T16:48:58Z

Intel CI failure is caused by infrastructure issues. Ignoring for now.

soumagne force-pushed the rxm_multi_recv branch from cf0f475 to f10a534 Compare January 27, 2025 23:49

soumagne changed the title ~~prov/rxm: fix FI_MULTI_RECV not set on FI_ECANCELED~~ prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED Jan 27, 2025

aingerson reviewed Jan 28, 2025

View reviewed changes

prov/util/src/util_srx.c Show resolved Hide resolved

soumagne force-pushed the rxm_multi_recv branch from f10a534 to 5e4ccd7 Compare January 30, 2025 21:42

soumagne requested review from aingerson, j-xiong and shefty January 30, 2025 21:44

soumagne force-pushed the rxm_multi_recv branch from 5e4ccd7 to c6b0b8b Compare January 31, 2025 19:56

aingerson reviewed Jan 31, 2025

View reviewed changes

soumagne force-pushed the rxm_multi_recv branch from c6b0b8b to 4e00169 Compare January 31, 2025 23:57

aingerson approved these changes Feb 1, 2025

View reviewed changes

j-xiong approved these changes Feb 3, 2025

View reviewed changes

soumagne force-pushed the rxm_multi_recv branch from 4e00169 to acd18e9 Compare February 4, 2025 19:27

j-xiong merged commit 2b51e42 into ofiwg:main Feb 5, 2025
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

soumagne commented Jan 27, 2025

j-xiong commented Jan 27, 2025

soumagne commented Jan 27, 2025

soumagne commented Jan 31, 2025

shefty commented Jan 31, 2025 •

edited

Loading

aingerson commented Jan 31, 2025

soumagne commented Jan 31, 2025

aingerson Jan 31, 2025

shefty Jan 31, 2025

soumagne Jan 31, 2025

aingerson Jan 31, 2025

soumagne Jan 31, 2025

aingerson left a comment

j-xiong commented Feb 3, 2025

j-xiong commented Feb 4, 2025

shijin-aws commented Feb 4, 2025

j-xiong commented Feb 5, 2025

prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

Conversation

soumagne commented Jan 27, 2025

j-xiong commented Jan 27, 2025

soumagne commented Jan 27, 2025

soumagne commented Jan 31, 2025

shefty commented Jan 31, 2025 • edited Loading

aingerson commented Jan 31, 2025

soumagne commented Jan 31, 2025

aingerson Jan 31, 2025

Choose a reason for hiding this comment

shefty Jan 31, 2025

Choose a reason for hiding this comment

soumagne Jan 31, 2025

Choose a reason for hiding this comment

aingerson Jan 31, 2025

Choose a reason for hiding this comment

soumagne Jan 31, 2025

Choose a reason for hiding this comment

aingerson left a comment

Choose a reason for hiding this comment

j-xiong commented Feb 3, 2025

j-xiong commented Feb 4, 2025

shijin-aws commented Feb 4, 2025

j-xiong commented Feb 5, 2025

shefty commented Jan 31, 2025 •

edited

Loading