TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

blampe · 2015-08-13T14:46:42Z

If a process advertises and then immediately dies, Hyperbahn will attempt to send requests over the dead connection for the next ~5min until the peer is cleaned up.

We should clean up the peer as soon as we get ECONNREFUSED from it, or at least retry the request with a different peer (it's always safe to retry in this situation).

Raynos · 2015-08-13T16:36:25Z

Hyperbahn should favor peers with open connections as part of peer selection. This sounds like a peer selection bug.

I'm wary of aggressively removing peers as it can lead to zero availability instead of partial availability.

We can add retries back at the exit node as well.

anson627 · 2015-08-18T02:03:19Z

Is ECONNREFUSED equivalent to an advertised process is gone? If so, when the process comes back, it will re-advertise and get a new peer. It's better to remove the peer rather than delaying it, if removing peer is not a expensive call.

Raynos · 2015-08-18T02:06:03Z

Removing the peer immediately can lead up to 60 seconds of declined error frames. Especially for a service with 1 worker.

blampe · 2015-08-18T18:32:10Z

@anson627 it's equivalent to a socket error where we've failed to make a connection with the remote host.

anson627 · 2015-08-18T18:42:46Z

@blampe yes, I guess my question is - does this socket error mean process is completely dead, or it could be a sporadic thing?

anson627 · 2015-08-18T18:45:43Z

@Raynos I guess it would save us forwarding ECONNREFUSED errors for another 4 minutes then.

jcorbin · 2015-08-18T20:17:46Z

Yeah, we do not want to immediately remove peers on connection loss, in fact not doing so is a large point of the point of having Hyperbahn and a collection-of-peers abstraction:

Hyperbahn affinity node should attempt to reconnect to the peer immediately (it initiated the connection in the first place
However it should not prefer to send a request to a disconnected or still-connecting peer, that is a/the bug here
We have this clearly on our current plan for Hyperbahn, so I'm going to claim and retitle this issue.

jcorbin · 2015-08-19T23:47:36Z

#1175 got us at least part of the way there

jcorbin self-assigned this Aug 18, 2015

jcorbin changed the title ~~Hyperbahn: remove peers on connection loss~~ TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected Aug 18, 2015

Raynos added the node label Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

blampe commented Aug 13, 2015

Raynos commented Aug 13, 2015

anson627 commented Aug 18, 2015

Raynos commented Aug 18, 2015

blampe commented Aug 18, 2015

anson627 commented Aug 18, 2015

anson627 commented Aug 18, 2015

jcorbin commented Aug 18, 2015

jcorbin commented Aug 19, 2015

TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

Comments

blampe commented Aug 13, 2015

Raynos commented Aug 13, 2015

anson627 commented Aug 18, 2015

Raynos commented Aug 18, 2015

blampe commented Aug 18, 2015

anson627 commented Aug 18, 2015

anson627 commented Aug 18, 2015

jcorbin commented Aug 18, 2015

jcorbin commented Aug 19, 2015