Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI: Put sessions with retryable errors in temporary hold #3400

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ad-astra-video
Copy link
Collaborator

What does this pull request do? Explain your changes. (required)

Add temporary hold for Orchestrator sessions that return retryable errors to keep the Orchestrator from being selected in the current request and other concurrently running requests of the same capability/model id. The Orchestrator is not suspended and is added back to unknownSessions after the timeout.

Background

  • When the selector chooses a session it removes the session from the knownSessions if available and, if not, selects and removes a session from unknownSessions.
  • AISessionManager re-uses selectors for each capability/model id pair between requests. The Suspender is also re-used between requests.
  • Orchestrators that return retryable errors (e.g. insufficient capacity, too many nonces, ticket params expired) could be suspended for up to 30 minutes if the Gateway is not saturating the Orchestrators serving the model.
  • The Orchestrator pools refresh every 10 minutes and it takes 3 refreshes to remove an Orchestrator from suspension.

Specific updates (required)

  • Add new function PutSessionOnHold that does not suspend the Orchestrator but does keep the Orchestrator out of selection for the AIProcesssingRetryTimeout while the request is running. ReleaseSessionsOnHold runs after the timeout used for a request to try to complete to keep the session from being selected again in the very short term.
  • Added requestID to aiRequestParams to use as the key to remove session from temporary hold

How did you test each of these updates (required)

  • Tested locally and wrote tests to cover the additional hold mechanism

Does this pull request close any open issues?

Checklist:

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 19, 2025
Copy link

codecov bot commented Feb 19, 2025

Codecov Report

Attention: Patch coverage is 59.57447% with 38 lines in your changes missing coverage. Please review.

Project coverage is 32.46728%. Comparing base (232df3a) to head (6d5f2ae).

Files with missing lines Patch % Lines
server/handlers.go 0.00000% 21 Missing ⚠️
server/ai_session.go 75.47170% 10 Missing and 3 partials ⚠️
server/ai_mediaserver.go 0.00000% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3400         +/-   ##
===================================================
+ Coverage   32.11405%   32.46728%   +0.35323%     
===================================================
  Files            147         147                 
  Lines          40789       40875         +86     
===================================================
+ Hits           13099       13271        +172     
+ Misses         26916       26809        -107     
- Partials         774         795         +21     
Files with missing lines Coverage Δ
server/ai_process.go 5.59265% <100.00000%> (+5.00043%) ⬆️
server/ai_mediaserver.go 7.23906% <0.00000%> (-0.04908%) ⬇️
server/ai_session.go 34.58213% <75.47170%> (+32.24880%) ⬆️
server/handlers.go 51.42637% <0.00000%> (-0.76455%) ⬇️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...6d5f2ae. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_process.go 5.59265% <100.00000%> (+5.00043%) ⬆️
server/ai_mediaserver.go 7.23906% <0.00000%> (-0.04908%) ⬇️
server/ai_session.go 34.58213% <75.47170%> (+32.24880%) ⬆️
server/handlers.go 51.42637% <0.00000%> (-0.76455%) ⬇️

... and 1 file with indirect coverage changes

Copy link
Contributor

@leszko leszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ad-astra-video

I'm thinking about this "on-hold" mechanism. I'm thinking in the context of live video (not sure if it's similar for AI Jobs).

We have 2 types of errors:

  • really retryable, meaning that if you retry it should just work (invalid ticket sendernounce, ticketparams expired) => For those I think we may not need to put O on hold, but just retry sending the request, wdyt?
  • temporarily unavailable (insufficient capacity) => For this, we basically don't want to use this O until the next session refresh; so maybe instead of putting this "on hold" we would just remove it from the pool (but not add any penalty)?

Let me know your thoughts.

@ad-astra-video
Copy link
Collaborator Author

Ya I can see that the requirements can be much different.

  • Live AI jobs would be expected to be long running (minutes to hours).
  • Batch AI jobs can be 200ms to minutes for the most part.
  • With transcoding, the selector is built for each request the Orchestrator would respond with insufficient capacity and not be in the pool. The Orchestrator could then be in the next pool when the next stream starts with no delay or penalty.
  • For AI, if an Orchestrator is at capacity, we will not know until we have selected the orchestrator, created the payment and sent the full job to the Orchestrator. The Orchestrator then checks capacity and sends back insufficient capacity before cashing the ticket.

My goal was to not exclude Orchestrators for 10 minutes on batch jobs when they are temporarily at capacity and there are enough Orchestrators to cover demand. When there is 1-2 Orchestrators in a selector the session pool will refresh much more so this is less of an issue.

What do you think about adding another suspender to AISessionSelector to use as a temporary suspender that calls signalRefresh() every 30 seconds in a go routine started when the selector is created and adds the sessions on hold back to the session pools after the penalty passes and the session is refreshed successfully. This would likely reduce go routines and allow to have more control over how long should suspend based on capability/model id if needed.

Note we don't check AI capacity on GetOrchestrator requests right now but can add that in this PR or a separate PR to keep Orchestrator suspended in the temporary suspender until has capacity.

go-livepeer/server/rpc.go

Lines 353 to 355 in 232df3a

func getOrchestrator(orch Orchestrator, req *net.OrchestratorRequest) (*net.OrchestratorInfo, error) {
addr := ethcommon.BytesToAddress(req.Address)
if err := verifyOrchestratorReq(orch, addr, req.Sig); err != nil {

go-livepeer/server/rpc.go

Lines 442 to 448 in 232df3a

func verifyOrchestratorReq(orch Orchestrator, addr ethcommon.Address, sig []byte) error {
if !orch.VerifySig(addr, addr.Hex(), sig) {
glog.Error("orchestrator req sig check failed")
return fmt.Errorf("orchestrator req sig check failed")
}
return orch.CheckCapacity("")
}

@leszko
Copy link
Contributor

leszko commented Feb 21, 2025

What do you think about adding another suspender to AISessionSelector to use as a temporary suspender that calls signalRefresh() every 30 seconds in a go routine started when the selector is created and adds the sessions on hold back to the session pools after the penalty passes and the session is refreshed successfully. This would likely reduce go routines and allow to have more control over how long should suspend based on capability/model id if needed.

Hmm, I'd try to not add anything more than we need, because this selection logic is already complex 🙃 So, coming back to the problem we want to solve (for both AI Jobs and Live AI Video):

  1. invalid ticket sendernounce, ticketparams expired => My understanding is that for both AI Jobs and Live AI Video, we can just retry the same O, right? So the code like this should work:
// Retryable errors: invalid ticket sendernounce, ticketparams expired
if isRetryableError(err) {
	params.sessManager.Complete(ctx, sess)
	continue
}

So, basically the same what you had in your initial PR #3033

  1. insufficient capacity => This is a little more problematic, because AI Jobs and Live AI Video have different requirements, but we can use some IF and do different things for AI Jobs and Live AI Video
    • Live AI Video => We want to remove it from the pool (but do not suspend it), because the O is busy (possibly for a long time), so just remove it from the pool, it will get selected during the next refresh
    • AI Jobs => I think it should be ok for you to put it back into the pool so do the same what you do for other isRetryableError(); This "on-hold" is also ok, the only thing that I'm worried about is that we make the selection even more complex.

So, IMO something like this #3407 (taken mostly from your initial PR) should be good enough.

Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch. go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants