How should the Hyperbahn network recover if a single worker is Busy #1305

Raynos · 2015-09-12T00:15:46Z

Currently if a single worker is TotalBusy due to co-tenacy issues it will return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we also have not implemented work shedding yet. We still want to retry as a single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

Take into account the number of busy frames when doing peer selection. This will make any given node favor peers that are not busy
Continue retrying worker busy errors elsewhere.
On any given sub channel, if the majority of its peers are busy start work shedding the busy errors back to the client / edge. At this point we've run out of capacity and the edge will have to give for the network to recover.

kriskowal · 2015-09-12T15:31:19Z

Idea: for a subchannel, if there are any other peers that are "not recently
busy", transform busy frames into retriable error frames (declined) at both
ingress and egress. Only transport a busy frame if the entire downstream
cluster is busy.

This involves tracking a time decaying busy score on each peer for both
load balancing and saturation detection.

Busy is still a signal for exponential backoff. We should still pursue a
change in latency signal for flow control, which should help us avoid busy
frames in most cases.

On Fri, Sep 11, 2015 at 5:15 PM Jake Verbaten [email protected]
wrote:

Currently if a single worker is TotalBusy due to co-tenacy issues it will
return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we
also have not implemented work shedding yet. We still want to retry as a
single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

Take into account the number of busy frames when doing peer
selection. This will make any given node favor peers that are not busy

Continue retrying worker busy errors elsewhere.

On any given sub channel, if the majority of its peers are busy
start work shedding the busy errors back to the client / edge. At this
point we've run out of capacity and the edge will have to give for the
network to recover.

—
Reply to this email directly or view it on GitHub
#1305.

Raynos · 2015-09-12T21:36:42Z

@kriskowal does not help if the ingress is rate limited. Clients still see busy and an ingress does not know what the health is.

This needs to be applied to hyperbahn client itself. However transforming busy into declined is going to be confusing from a metrics point of view.

Maybe the total rate rate limiter should return unhealthy

anson627 · 2015-09-14T18:09:11Z

https://github.com/uber/tchannel/blob/master/node/errors.js#L775

Busy is retriable, same as declined. Unhealthy is not.

Raynos · 2015-09-14T19:44:38Z

We should probably retry on Unhealthy

Raynos · 2015-09-14T19:44:42Z

cc @kriskowal ^

kriskowal · 2015-09-15T19:10:09Z

Brace yourself for weirdness.

The circuit breaker sends Declined errors when it is Unhealthy. Declined can be retried.

I added the Unhealthy error type to the protocol in anticipation of retry-on-unhealthy having bad consequences. The circuit breaker doesn’t use it right now. We could remove it and pretend it never happened.

Raynos added the hyperbahn label Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should the Hyperbahn network recover if a single worker is Busy #1305

How should the Hyperbahn network recover if a single worker is Busy #1305

Raynos commented Sep 12, 2015

kriskowal commented Sep 12, 2015

Raynos commented Sep 12, 2015

anson627 commented Sep 14, 2015

Raynos commented Sep 14, 2015

Raynos commented Sep 14, 2015

kriskowal commented Sep 15, 2015

How should the Hyperbahn network recover if a single worker is Busy #1305

How should the Hyperbahn network recover if a single worker is Busy #1305

Comments

Raynos commented Sep 12, 2015

kriskowal commented Sep 12, 2015

Raynos commented Sep 12, 2015

anson627 commented Sep 14, 2015

Raynos commented Sep 14, 2015

Raynos commented Sep 14, 2015

kriskowal commented Sep 15, 2015