-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AI: Put sessions with retryable errors in temporary hold #3400
base: master
Are you sure you want to change the base?
AI: Put sessions with retryable errors in temporary hold #3400
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3400 +/- ##
===================================================
+ Coverage 32.11405% 32.46728% +0.35323%
===================================================
Files 147 147
Lines 40789 40875 +86
===================================================
+ Hits 13099 13271 +172
+ Misses 26916 26809 -107
- Partials 774 795 +21
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @ad-astra-video
I'm thinking about this "on-hold" mechanism. I'm thinking in the context of live video (not sure if it's similar for AI Jobs).
We have 2 types of errors:
- really retryable, meaning that if you retry it should just work (
invalid ticket sendernounce
,ticketparams expired
) => For those I think we may not need to put O on hold, but just retry sending the request, wdyt? - temporarily unavailable (
insufficient capacity
) => For this, we basically don't want to use this O until the next session refresh; so maybe instead of putting this "on hold" we would just remove it from the pool (but not add any penalty)?
Let me know your thoughts.
Ya I can see that the requirements can be much different.
My goal was to not exclude Orchestrators for 10 minutes on batch jobs when they are temporarily at capacity and there are enough Orchestrators to cover demand. When there is 1-2 Orchestrators in a selector the session pool will refresh much more so this is less of an issue. What do you think about adding another suspender to Note we don't check AI capacity on GetOrchestrator requests right now but can add that in this PR or a separate PR to keep Orchestrator suspended in the temporary suspender until has capacity. Lines 353 to 355 in 232df3a
Lines 442 to 448 in 232df3a
|
Hmm, I'd try to not add anything more than we need, because this selection logic is already complex 🙃 So, coming back to the problem we want to solve (for both AI Jobs and Live AI Video):
// Retryable errors: invalid ticket sendernounce, ticketparams expired
if isRetryableError(err) {
params.sessManager.Complete(ctx, sess)
continue
} So, basically the same what you had in your initial PR #3033
So, IMO something like this #3407 (taken mostly from your initial PR) should be good enough. Let me know your thoughts. |
What does this pull request do? Explain your changes. (required)
Add temporary hold for Orchestrator sessions that return retryable errors to keep the Orchestrator from being selected in the current request and other concurrently running requests of the same capability/model id. The Orchestrator is not suspended and is added back to
unknownSessions
after the timeout.Background
knownSessions
if available and, if not, selects and removes a session fromunknownSessions
.AISessionManager
re-uses selectors for each capability/model id pair between requests. The Suspender is also re-used between requests.Specific updates (required)
PutSessionOnHold
that does not suspend the Orchestrator but does keep the Orchestrator out of selection for theAIProcesssingRetryTimeout
while the request is running.ReleaseSessionsOnHold
runs after the timeout used for a request to try to complete to keep the session from being selected again in the very short term.requestID
toaiRequestParams
to use as the key to remove session from temporary holdHow did you test each of these updates (required)
Does this pull request close any open issues?
Checklist:
make
runs successfully./test.sh
pass