-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable quicker tcp error notification #670
enable quicker tcp error notification #670
Conversation
Fixes: eclipse-bluechi#652 Relates to: eclipse-bluechi#648 When a connection between controller and agent is esablished and is dropped later, e.g. removing the cable from the agent, the disconnect is properly detected by the keepalive mechanism. However, the keepalive takes a while to detect this (based on the KEEPCNT of the system). If any command is issued during that time frame, tcp will try to retransmit the data. Eventually, retransmitting will be stopped and ICMP packets for host not reachable emitted. The keepalive is not used then, which results in the connection between agent and controller to be broken but not closed - a disconnect is not detected. By setting the IP_RECVERR socket option, errors such as the ICMP host not reachable error will be delivered to the upper layer (systemd event loop in our case) so that it can handle it. Signed-off-by: Michael Engel <[email protected]>
6638c14
to
5502e9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive debugging work
I'm slightly worried about always enabling this. I can see bluechi being useful also on an slightly less robust network. Maybe we can make this an option (on by default)? |
Would worth probably do some testing and measure, if it's affecting network I would set it default to false instead. The code looks nice, kudos for @engelmi and reporter. |
Relates to: eclipse-bluechi#652 For systems with less robust network, always enabling IP_RECV_ERR might result in unnecessary disconnects and a less robust BlueChi setup. Therefore, a configuration setting was introduced to enable or disable this option depending on the current needs. Signed-off-by: Michael Engel <[email protected]>
@alexlarsson @mkemel @dougsland @dougsland Testing and/or measuring in an automated way will be quite hard - at least I don't know right now how to simulate "unplugging the cable". We can create an issue for that, of course :) Edit: |
I don't think you want to test it by actually yanking the cord. There are ways to use the linux traffic control to emulate latency and packet loss. See for example https://dzone.com/articles/simulate-network-latency-and-packet-drop-in-linux |
lgtm |
Fixes: #652
Relates to: #648
When a connection between controller and agent is esablished and is dropped later, e.g. removing the cable from the agent, the disconnect is properly detected by the keepalive mechanism. However, the keepalive takes a while to detect this (based on the KEEPCNT of the system). If any command is issued during that time frame, tcp will try to retransmit the data. Eventually, retransmitting will be stopped and ICMP packets for host not reachable emitted. The keepalive is not used then, which results in the connection between agent and controller to be broken but not closed - a disconnect is not detected.
By setting the IP_RECVERR socket option, errors such as the ICMP host not reachable error will be delivered to the upper layer (systemd event loop in our case) so that it can handle it.
A detailed explanation can be found in this comment: #652 (comment)