-
Notifications
You must be signed in to change notification settings - Fork 129
TChannel shouldn't prefer to send calls to a disconnected or still-connecting peer over ones that are connected #1154
Comments
Hyperbahn should favor peers with open connections as part of peer selection. This sounds like a peer selection bug. I'm wary of aggressively removing peers as it can lead to zero availability instead of partial availability. We can add retries back at the exit node as well. |
Is ECONNREFUSED equivalent to an advertised process is gone? If so, when the process comes back, it will re-advertise and get a new peer. It's better to remove the peer rather than delaying it, if removing peer is not a expensive call. |
Removing the peer immediately can lead up to 60 seconds of declined error frames. Especially for a service with 1 worker. |
@anson627 it's equivalent to a socket error where we've failed to make a connection with the remote host. |
@blampe yes, I guess my question is - does this socket error mean process is completely dead, or it could be a sporadic thing? |
@Raynos I guess it would save us forwarding ECONNREFUSED errors for another 4 minutes then. |
Yeah, we do not want to immediately remove peers on connection loss, in fact not doing so is a large point of the point of having Hyperbahn and a collection-of-peers abstraction:
|
#1175 got us at least part of the way there |
If a process advertises and then immediately dies, Hyperbahn will attempt to send requests over the dead connection for the next ~5min until the peer is cleaned up.
We should clean up the peer as soon as we get ECONNREFUSED from it, or at least retry the request with a different peer (it's always safe to retry in this situation).
The text was updated successfully, but these errors were encountered: