-
Notifications
You must be signed in to change notification settings - Fork 129
How should the Hyperbahn network recover if a single worker is Busy #1305
Comments
Idea: for a subchannel, if there are any other peers that are "not recently This involves tracking a time decaying busy score on each peer for both Busy is still a signal for exponential backoff. We should still pursue a On Fri, Sep 11, 2015 at 5:15 PM Jake Verbaten [email protected]
|
@kriskowal does not help if the ingress is rate limited. Clients still see busy and an ingress does not know what the health is. This needs to be applied to hyperbahn client itself. However transforming busy into declined is going to be confusing from a metrics point of view. Maybe the total rate rate limiter should return unhealthy |
https://github.com/uber/tchannel/blob/master/node/errors.js#L775 Busy is retriable, same as declined. Unhealthy is not. |
We should probably retry on Unhealthy |
cc @kriskowal ^ |
Brace yourself for weirdness. The circuit breaker sends Declined errors when it is Unhealthy. Declined can be retried. I added the Unhealthy error type to the protocol in anticipation of retry-on-unhealthy having bad consequences. The circuit breaker doesn’t use it right now. We could remove it and pretend it never happened. |
Currently if a single worker is TotalBusy due to co-tenacy issues it will return Busy frames back to the edge.
It's expected that the edge should retry somewhere else.
However by retrying on Busy we open ourselves up to cascading failures, we also have not implemented work shedding yet. We still want to retry as a single worker failing should be invisible to edge users.
One solution to this problem involves a few pieces:
The text was updated successfully, but these errors were encountered: