-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/util: fix FI_MULTI_RECV not set on FI_ECANCELED #10734
Conversation
This is fixing the utility provider, please change the title to start with |
cf0f475
to
f10a534
Compare
ah right thanks fixed. |
f10a534
to
5e4ccd7
Compare
@aingerson I updated the fix yesterday, I've let cancel return |
Cancel is an async operation. We intentionally did not want cancel to return ENOENT and for apps to rely on that return value. That error is only possible for SW based cancel. I'm referring only to an error code being reported to the user. |
@soumagne Yeah, what Sean said. FI_SUCCESS means the operation was submitted and can either result in a cancel error entry if it was cancelled or it would basically be ignored because the operation can't be canceled. In your app, you'd have to read the cq to see if an entry is available and basically retry the cancel until either you get the cancel entry for that recv or get the rx completion for it. |
5e4ccd7
to
c6b0b8b
Compare
ok that sounds good, I fixed it to always return SUCCESS instead of ENOENT and I squashed the commits. |
@@ -1037,12 +1041,15 @@ static bool util_cancel_recv(struct util_srx_ctx *srx, struct slist *queue, | |||
slist_foreach(queue, item, prev) { | |||
rx_entry = container_of(item, struct util_rx_entry, peer_entry); | |||
if (rx_entry->peer_entry.context == context) { | |||
if (rx_entry->peer_entry.flags & FI_MULTI_RECV && | |||
rx_entry->multi_recv_ref) | |||
return -FI_EAGAIN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about this EAGAIN return. If we want to define the EAGAIN return as valid for a cancel, we should add it to the man page to clarify what it means for the application. As it stands right now in the definition, it feels like this should be an FI_SUCCESS because it was found and we dealt with it accordingly indicating to the application that there should be an entry on the CQ eventually (either the rx that is pending or the cancel entry) and once that is processed, the application can try to cancel again if it didn't get the completion for the entry. I'm fine with either, I just think we should clarify the expectation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was that EAGAIN would be returned immediately to the user. EAGAIN is a valid return code for any data transfer or async operation, so the app is already expected to deal with that. EAGAIN does not indicate anything about whether the entry was found or its state. From the app's viewpoint, it just says we can't deal with the cancel right now, and it needs to be retried.
EAGAIN should not be an error code returned through a completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree, in that particular case, unless I'm missing something I don't think we've dealt with it, we just found it and returned EAGAIN back to the caller of fi_cancel(), so it would never get canceled until it's retried.
Let me know if you'd like me to add any clarification anywhere. I can also add a comment in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand the benefit of the EAGAIN and I'm fine with allowing it for this case. I just read FI_SUCCESS as essentially telling the app to read the CQ because something will be available. FI_SUCCESS just says I looked at your request and I did what I thought correct with it which for the mrecv case is not to cancel it and to let the pending receives complete. Ie the app would have to read the mrecv completion and decide to try to cancel it again because the buffer is not done. I think returning EAGAIN is simpler and that's totally fine. I would just like the man page (fi_endpoint) to reflect that return code and explain what that means for the application.
Right now it says
Returns 0 on success. On error, a negative value corresponding to fabric errno is returned. For fi_cancel, a return value of 0 indicates that the cancel request was submitted for processing.
So just adding something there about EAGAIN is totally fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I updated the man page and added a comment.
c6b0b8b
to
4e00169
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates!
bot:aws:retest |
The Intel CI failure is caused by issue with HW used for dmabuf tests, which is unrelated to this PR. @shijin-aws Is the AWS CI failure related? |
@soumagne can you rebase your PR to the latest main branch that has 6793260 |
When canceling a multi-recv entry, FI_MULTI_RECV was never set as part of the error entry flags. Fix so that the user can determine if the buffer is still in use. Return -FI_EAGAIN on cancel if multi-recv buffer is in use. Update fi_endpoint man page to describe fi_cancel() return value. Signed-off-by: Jerome Soumagne <[email protected]>
4e00169
to
acd18e9
Compare
Intel CI failure is caused by infrastructure issues. Ignoring for now. |
When canceling a multi-recv entry, FI_MULTI_RECV was never set as part of the error entry flags. Fix so that the user can determine if the buffer is still in use.