Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154

Open
blampe opened this issue Aug 13, 2015 · 8 comments
Assignees
Labels

Comments

@blampe
Copy link
Contributor

blampe commented Aug 13, 2015

If a process advertises and then immediately dies, Hyperbahn will attempt to send requests over the dead connection for the next ~5min until the peer is cleaned up.

We should clean up the peer as soon as we get ECONNREFUSED from it, or at least retry the request with a different peer (it's always safe to retry in this situation).

@Raynos
Copy link
Contributor

Raynos commented Aug 13, 2015

Hyperbahn should favor peers with open connections as part of peer selection. This sounds like a peer selection bug.

I'm wary of aggressively removing peers as it can lead to zero availability instead of partial availability.

We can add retries back at the exit node as well.

@anson627
Copy link
Contributor

Is ECONNREFUSED equivalent to an advertised process is gone? If so, when the process comes back, it will re-advertise and get a new peer. It's better to remove the peer rather than delaying it, if removing peer is not a expensive call.

@Raynos
Copy link
Contributor

Raynos commented Aug 18, 2015

Removing the peer immediately can lead up to 60 seconds of declined error frames. Especially for a service with 1 worker.

@blampe
Copy link
Contributor Author

blampe commented Aug 18, 2015

@anson627 it's equivalent to a socket error where we've failed to make a connection with the remote host.

@anson627
Copy link
Contributor

@blampe yes, I guess my question is - does this socket error mean process is completely dead, or it could be a sporadic thing?

@anson627
Copy link
Contributor

@Raynos I guess it would save us forwarding ECONNREFUSED errors for another 4 minutes then.

@jcorbin
Copy link
Contributor

jcorbin commented Aug 18, 2015

Yeah, we do not want to immediately remove peers on connection loss, in fact not doing so is a large point of the point of having Hyperbahn and a collection-of-peers abstraction:

  • Hyperbahn affinity node should attempt to reconnect to the peer immediately (it initiated the connection in the first place
  • However it should not prefer to send a request to a disconnected or still-connecting peer, that is a/the bug here
  • We have this clearly on our current plan for Hyperbahn, so I'm going to claim and retitle this issue.

@jcorbin jcorbin self-assigned this Aug 18, 2015
@jcorbin jcorbin changed the title Hyperbahn: remove peers on connection loss TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected Aug 18, 2015
@jcorbin
Copy link
Contributor

jcorbin commented Aug 19, 2015

#1175 got us at least part of the way there

@Raynos Raynos added the node label Sep 22, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants