AI: Put sessions with retryable errors in temporary hold #3400

ad-astra-video · 2025-02-19T01:24:33Z

What does this pull request do? Explain your changes. (required)

Add temporary hold for Orchestrator sessions that return retryable errors to keep the Orchestrator from being selected in the current request and other concurrently running requests of the same capability/model id. The Orchestrator is not suspended and is added back to unknownSessions after the timeout.

Background

When the selector chooses a session it removes the session from the knownSessions if available and, if not, selects and removes a session from unknownSessions.
AISessionManager re-uses selectors for each capability/model id pair between requests. The Suspender is also re-used between requests.
Orchestrators that return retryable errors (e.g. insufficient capacity, too many nonces, ticket params expired) could be suspended for up to 30 minutes if the Gateway is not saturating the Orchestrators serving the model.
The Orchestrator pools refresh every 10 minutes and it takes 3 refreshes to remove an Orchestrator from suspension.

Specific updates (required)

Add new function PutSessionOnHold that does not suspend the Orchestrator but does keep the Orchestrator out of selection for the AIProcesssingRetryTimeout while the request is running. ReleaseSessionsOnHold runs after the timeout used for a request to try to complete to keep the session from being selected again in the very short term.
Added requestID to aiRequestParams to use as the key to remove session from temporary hold

How did you test each of these updates (required)

Tested locally and wrote tests to cover the additional hold mechanism

Does this pull request close any open issues?

Checklist:

Read the contribution guide
make runs successfully
All tests in ./test.sh pass
README and other documentation updated
Pending changelog updated

codecov · 2025-02-19T01:37:17Z

Codecov Report

Attention: Patch coverage is 59.57447% with 38 lines in your changes missing coverage. Please review.

Project coverage is 32.46728%. Comparing base (232df3a) to head (6d5f2ae).

Files with missing lines	Patch %	Lines
server/handlers.go	0.00000%	21 Missing ⚠️
server/ai_session.go	75.47170%	10 Missing and 3 partials ⚠️
server/ai_mediaserver.go	0.00000%	4 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3400         +/-   ##
===================================================
+ Coverage   32.11405%   32.46728%   +0.35323%     
===================================================
  Files            147         147                 
  Lines          40789       40875         +86     
===================================================
+ Hits           13099       13271        +172     
+ Misses         26916       26809        -107     
- Partials         774         795         +21

Files with missing lines	Coverage Δ
server/ai_process.go	`5.59265% <100.00000%> (+5.00043%)`	⬆️
server/ai_mediaserver.go	`7.23906% <0.00000%> (-0.04908%)`	⬇️
server/ai_session.go	`34.58213% <75.47170%> (+32.24880%)`	⬆️
server/handlers.go	`51.42637% <0.00000%> (-0.76455%)`	⬇️

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 232df3a...6d5f2ae. Read the comment docs.

Files with missing lines	Coverage Δ
server/ai_process.go	`5.59265% <100.00000%> (+5.00043%)`	⬆️
server/ai_mediaserver.go	`7.23906% <0.00000%> (-0.04908%)`	⬇️
server/ai_session.go	`34.58213% <75.47170%> (+32.24880%)`	⬆️
server/handlers.go	`51.42637% <0.00000%> (-0.76455%)`	⬇️

... and 1 file with indirect coverage changes

leszko

Thanks for the PR @ad-astra-video

I'm thinking about this "on-hold" mechanism. I'm thinking in the context of live video (not sure if it's similar for AI Jobs).

We have 2 types of errors:

really retryable, meaning that if you retry it should just work (invalid ticket sendernounce, ticketparams expired) => For those I think we may not need to put O on hold, but just retry sending the request, wdyt?
temporarily unavailable (insufficient capacity) => For this, we basically don't want to use this O until the next session refresh; so maybe instead of putting this "on hold" we would just remove it from the pool (but not add any penalty)?

Let me know your thoughts.

ad-astra-video · 2025-02-20T20:45:52Z

Ya I can see that the requirements can be much different.

Live AI jobs would be expected to be long running (minutes to hours).
Batch AI jobs can be 200ms to minutes for the most part.
With transcoding, the selector is built for each request the Orchestrator would respond with insufficient capacity and not be in the pool. The Orchestrator could then be in the next pool when the next stream starts with no delay or penalty.
For AI, if an Orchestrator is at capacity, we will not know until we have selected the orchestrator, created the payment and sent the full job to the Orchestrator. The Orchestrator then checks capacity and sends back insufficient capacity before cashing the ticket.

My goal was to not exclude Orchestrators for 10 minutes on batch jobs when they are temporarily at capacity and there are enough Orchestrators to cover demand. When there is 1-2 Orchestrators in a selector the session pool will refresh much more so this is less of an issue.

What do you think about adding another suspender to AISessionSelector to use as a temporary suspender that calls signalRefresh() every 30 seconds in a go routine started when the selector is created and adds the sessions on hold back to the session pools after the penalty passes and the session is refreshed successfully. This would likely reduce go routines and allow to have more control over how long should suspend based on capability/model id if needed.

Note we don't check AI capacity on GetOrchestrator requests right now but can add that in this PR or a separate PR to keep Orchestrator suspended in the temporary suspender until has capacity.

go-livepeer/server/rpc.go

Lines 353 to 355 in 232df3a

    
           func getOrchestrator(orch Orchestrator, req *net.OrchestratorRequest) (*net.OrchestratorInfo, error) { 
        
           	addr := ethcommon.BytesToAddress(req.Address) 
        
           	if err := verifyOrchestratorReq(orch, addr, req.Sig); err != nil {

go-livepeer/server/rpc.go

Lines 442 to 448 in 232df3a

    
           func verifyOrchestratorReq(orch Orchestrator, addr ethcommon.Address, sig []byte) error { 
        
           	if !orch.VerifySig(addr, addr.Hex(), sig) { 
        
           		glog.Error("orchestrator req sig check failed") 
        
           		return fmt.Errorf("orchestrator req sig check failed") 
        
           	} 
        
           	return orch.CheckCapacity("") 
        
           }

leszko · 2025-02-21T09:08:30Z

What do you think about adding another suspender to AISessionSelector to use as a temporary suspender that calls signalRefresh() every 30 seconds in a go routine started when the selector is created and adds the sessions on hold back to the session pools after the penalty passes and the session is refreshed successfully. This would likely reduce go routines and allow to have more control over how long should suspend based on capability/model id if needed.

Hmm, I'd try to not add anything more than we need, because this selection logic is already complex 🙃 So, coming back to the problem we want to solve (for both AI Jobs and Live AI Video):

invalid ticket sendernounce, ticketparams expired => My understanding is that for both AI Jobs and Live AI Video, we can just retry the same O, right? So the code like this should work:

// Retryable errors: invalid ticket sendernounce, ticketparams expired
if isRetryableError(err) {
	params.sessManager.Complete(ctx, sess)
	continue
}

So, basically the same what you had in your initial PR #3033

insufficient capacity => This is a little more problematic, because AI Jobs and Live AI Video have different requirements, but we can use some IF and do different things for AI Jobs and Live AI Video
- Live AI Video => We want to remove it from the pool (but do not suspend it), because the O is busy (possibly for a long time), so just remove it from the pool, it will get selected during the next refresh
- AI Jobs => I think it should be ok for you to put it back into the pool so do the same what you do for other isRetryableError(); This "on-hold" is also ok, the only thing that I'm worried about is that we make the selection even more complex.

So, IMO something like this #3407 (taken mostly from your initial PR) should be good enough.

Let me know your thoughts.

ad-astra-video added 3 commits February 18, 2025 17:24

temp hold sessions with retryable errors

b5d1048

add tests for putting session on hold

73f69fb

add sessions on hold to /getAISessionPoolsInfo data

6d5f2ae

ad-astra-video requested a review from leszko February 19, 2025 01:24

github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 19, 2025

ad-astra-video requested a review from rickstaa February 19, 2025 01:24

leszko reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI: Put sessions with retryable errors in temporary hold #3400

AI: Put sessions with retryable errors in temporary hold #3400

ad-astra-video commented Feb 19, 2025

codecov bot commented Feb 19, 2025 •

edited

Loading

leszko left a comment

ad-astra-video commented Feb 20, 2025

leszko commented Feb 21, 2025

AI: Put sessions with retryable errors in temporary hold #3400

Are you sure you want to change the base?

AI: Put sessions with retryable errors in temporary hold #3400

Conversation

ad-astra-video commented Feb 19, 2025

codecov bot commented Feb 19, 2025 • edited Loading

Codecov Report

leszko left a comment

Choose a reason for hiding this comment

ad-astra-video commented Feb 20, 2025

leszko commented Feb 21, 2025

codecov bot commented Feb 19, 2025 •

edited

Loading