Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

soumagne
Copy link
Contributor

When canceling a multi-recv entry, FI_MULTI_RECV was never set as part of the error entry flags. Fix so that the user can determine if the buffer is still in use.

@j-xiong
Copy link
Contributor

j-xiong commented Jan 27, 2025

This is fixing the utility provider, please change the title to start with prov/util:.

@soumagne soumagne changed the title prov/rxm: fix FI_MULTI_RECV not set on FI_ECANCELED prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED Jan 27, 2025
@soumagne
Copy link
Contributor Author

ah right thanks fixed.

@soumagne
Copy link
Contributor Author

@aingerson I updated the fix yesterday, I've let cancel return -FI_ENOENT if the entry is not found (it used to return FI_SUCCESS before), let me know if you prefer to keep the previous behavior.

@shefty
Copy link
Member

shefty commented Jan 31, 2025

Cancel is an async operation. We intentionally did not want cancel to return ENOENT and for apps to rely on that return value. That error is only possible for SW based cancel.

I'm referring only to an error code being reported to the user.

@aingerson
Copy link
Contributor

@soumagne Yeah, what Sean said. FI_SUCCESS means the operation was submitted and can either result in a cancel error entry if it was cancelled or it would basically be ignored because the operation can't be canceled. In your app, you'd have to read the cq to see if an entry is available and basically retry the cancel until either you get the cancel entry for that recv or get the rx completion for it.
Also, could you squash these two commits together? They are needed together since the first one asserts the ref cnt is 0 and then the second one actually handles the non-zero ref cnt.
Thanks!

@soumagne
Copy link
Contributor Author

ok that sounds good, I fixed it to always return SUCCESS instead of ENOENT and I squashed the commits.

@@ -1037,12 +1041,15 @@ static bool util_cancel_recv(struct util_srx_ctx *srx, struct slist *queue,
slist_foreach(queue, item, prev) {
rx_entry = container_of(item, struct util_rx_entry, peer_entry);
if (rx_entry->peer_entry.context == context) {
if (rx_entry->peer_entry.flags & FI_MULTI_RECV &&
rx_entry->multi_recv_ref)
return -FI_EAGAIN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about this EAGAIN return. If we want to define the EAGAIN return as valid for a cancel, we should add it to the man page to clarify what it means for the application. As it stands right now in the definition, it feels like this should be an FI_SUCCESS because it was found and we dealt with it accordingly indicating to the application that there should be an entry on the CQ eventually (either the rx that is pending or the cancel entry) and once that is processed, the application can try to cancel again if it didn't get the completion for the entry. I'm fine with either, I just think we should clarify the expectation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was that EAGAIN would be returned immediately to the user. EAGAIN is a valid return code for any data transfer or async operation, so the app is already expected to deal with that. EAGAIN does not indicate anything about whether the entry was found or its state. From the app's viewpoint, it just says we can't deal with the cancel right now, and it needs to be retried.

EAGAIN should not be an error code returned through a completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree, in that particular case, unless I'm missing something I don't think we've dealt with it, we just found it and returned EAGAIN back to the caller of fi_cancel(), so it would never get canceled until it's retried.

Let me know if you'd like me to add any clarification anywhere. I can also add a comment in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand the benefit of the EAGAIN and I'm fine with allowing it for this case. I just read FI_SUCCESS as essentially telling the app to read the CQ because something will be available. FI_SUCCESS just says I looked at your request and I did what I thought correct with it which for the mrecv case is not to cancel it and to let the pending receives complete. Ie the app would have to read the mrecv completion and decide to try to cancel it again because the buffer is not done. I think returning EAGAIN is simpler and that's totally fine. I would just like the man page (fi_endpoint) to reflect that return code and explain what that means for the application.
Right now it says

Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. For fi_cancel, a return value of 0 indicates that the cancel request was submitted for processing.

So just adding something there about EAGAIN is totally fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I updated the man page and added a comment.

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

@j-xiong
Copy link
Contributor

j-xiong commented Feb 3, 2025

bot:aws:retest

@j-xiong
Copy link
Contributor

j-xiong commented Feb 4, 2025

The Intel CI failure is caused by issue with HW used for dmabuf tests, which is unrelated to this PR.

@shijin-aws Is the AWS CI failure related?

@shijin-aws
Copy link
Contributor

2025-02-03 19:42:49,475 - INFO - test_runner_windows - Efawin compilation finished successfully.

Usage:

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS]		start server

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS] <srv_addr>	connect to server


Ping pong client and server


Options:

 -B <src_port>        source control port number (server: 47592, client: auto)

 -P <dst_port>        destination control port number (client: 47592)

 -d <domain>          domain name

 -s <source address>  source address associated with domain name

 -p <provider>        specific provider name eg sockets, verbs

 -e <ep_type>         endpoint type: msg|rdm|dgram (dgram)

 -I <number>          number of iterations (10)

 -S <size>            specific transfer size or 'all' (all)

 -c                   enables data_integrity checks

 -m <transmit mode>   transmit mode type: msg|tagged (msg)

 -h                   display this help output

 -v                   enable debugging output

 -6                   use IPv6 address

Usage:

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS]		start server

  C:/10734_1738611233\libfabric\x64\Release-v142\fi_pingpong.exe [OPTIONS] <srv_addr>	connect to server


Ping pong client and server


Options:

 -B <src_port>        source control port number (server: 47592, client: auto)

 -P <dst_port>        destination control port number (client: 47592)

 -d <domain>          domain name

 -s <source address>  source address associated with domain name

 -p <provider>        specific provider name eg sockets, verbs

 -e <ep_type>         endpoint type: msg|rdm|dgram (dgram)

 -I <number>          number of iterations (10)

 -S <size>            specific transfer size or 'all' (all)

 -c                   enables data_integrity checks

 -m <transmit mode>   transmit mode type: msg|tagged (msg)

 -h                   display this help output

 -v                   enable debugging output

 -6                   use IPv6 address

2025-02-03 19:42:59,612 - INFO - test_runner_windows - Server process failed with return code 1


...

@soumagne can you rebase your PR to the latest main branch that has 6793260

When canceling a multi-recv entry, FI_MULTI_RECV was never
set as part of the error entry flags. Fix so that the user
can determine if the buffer is still in use.

Return -FI_EAGAIN on cancel if multi-recv buffer is in use.

Update fi_endpoint man page to describe fi_cancel() return value.

Signed-off-by: Jerome Soumagne <[email protected]>
@j-xiong
Copy link
Contributor

j-xiong commented Feb 5, 2025

Intel CI failure is caused by infrastructure issues. Ignoring for now.

@j-xiong j-xiong merged commit 2b51e42 into ofiwg:main Feb 5, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants