enable quicker tcp error notification #670

engelmi · 2023-12-18T08:29:44Z

Fixes: #652
Relates to: #648

When a connection between controller and agent is esablished and is dropped later, e.g. removing the cable from the agent, the disconnect is properly detected by the keepalive mechanism. However, the keepalive takes a while to detect this (based on the KEEPCNT of the system). If any command is issued during that time frame, tcp will try to retransmit the data. Eventually, retransmitting will be stopped and ICMP packets for host not reachable emitted. The keepalive is not used then, which results in the connection between agent and controller to be broken but not closed - a disconnect is not detected.
By setting the IP_RECVERR socket option, errors such as the ICMP host not reachable error will be delivered to the upper layer (systemd event loop in our case) so that it can handle it.

A detailed explanation can be found in this comment: #652 (comment)

Fixes: eclipse-bluechi#652 Relates to: eclipse-bluechi#648 When a connection between controller and agent is esablished and is dropped later, e.g. removing the cable from the agent, the disconnect is properly detected by the keepalive mechanism. However, the keepalive takes a while to detect this (based on the KEEPCNT of the system). If any command is issued during that time frame, tcp will try to retransmit the data. Eventually, retransmitting will be stopped and ICMP packets for host not reachable emitted. The keepalive is not used then, which results in the connection between agent and controller to be broken but not closed - a disconnect is not detected. By setting the IP_RECVERR socket option, errors such as the ICMP host not reachable error will be delivered to the upper layer (systemd event loop in our case) so that it can handle it. Signed-off-by: Michael Engel <[email protected]>

mkemel

Impressive debugging work

alexlarsson · 2023-12-18T13:43:25Z

I'm slightly worried about always enabling this. I can see bluechi being useful also on an slightly less robust network. Maybe we can make this an option (on by default)?

dougsland · 2023-12-18T18:20:57Z

I'm slightly worried about always enabling this. I can see bluechi being useful also on an slightly less robust
network. Maybe we can make this an option (on by default)?

Would worth probably do some testing and measure, if it's affecting network I would set it default to false instead. The code looks nice, kudos for @engelmi and reporter.

Relates to: eclipse-bluechi#652 For systems with less robust network, always enabling IP_RECV_ERR might result in unnecessary disconnects and a less robust BlueChi setup. Therefore, a configuration setting was introduced to enable or disable this option depending on the current needs. Signed-off-by: Michael Engel <[email protected]>

engelmi · 2023-12-19T16:56:12Z

@alexlarsson @mkemel @dougsland
Could you have another look? I introduced a new configuration option for enabling/disabling the IP_RECV_ERR option as requested. I set the default currently to true since our "main target system" would have rather robust networking.

@dougsland Testing and/or measuring in an automated way will be quite hard - at least I don't know right now how to simulate "unplugging the cable". We can create an issue for that, of course :)

Edit:
@dougsland Created the issue for testing this: #672

alexlarsson · 2023-12-20T09:15:16Z

I don't think you want to test it by actually yanking the cord. There are ways to use the linux traffic control to emulate latency and packet loss. See for example https://dzone.com/articles/simulate-network-latency-and-packet-drop-in-linux

src/agent/agent.c

alexlarsson · 2023-12-20T09:26:03Z

lgtm

engelmi requested review from dougsland and mwperina December 18, 2023 08:29

engelmi requested review from alexlarsson, mkemel and rhatdan as code owners December 18, 2023 08:29

engelmi mentioned this pull request Dec 18, 2023

Cannot detect disconnection of node during d-bus operation #652

Closed

engelmi force-pushed the fix-agent-disconnect-handling branch from 6638c14 to 5502e9e Compare December 18, 2023 08:38

mkemel approved these changes Dec 18, 2023

View reviewed changes

alexlarsson reviewed Dec 20, 2023

View reviewed changes

src/agent/agent.c Show resolved Hide resolved

alexlarsson approved these changes Dec 20, 2023

View reviewed changes

engelmi mentioned this pull request Dec 20, 2023

Introduce integration test(s) for network latency and packet loss #672

Open

engelmi merged commit 924e153 into eclipse-bluechi:main Dec 20, 2023
18 checks passed

engelmi deleted the fix-agent-disconnect-handling branch February 28, 2024 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable quicker tcp error notification #670

enable quicker tcp error notification #670

engelmi commented Dec 18, 2023

mkemel left a comment

alexlarsson commented Dec 18, 2023

dougsland commented Dec 18, 2023

engelmi commented Dec 19, 2023 •

edited

Loading

alexlarsson commented Dec 20, 2023

alexlarsson commented Dec 20, 2023

enable quicker tcp error notification #670

enable quicker tcp error notification #670

Conversation

engelmi commented Dec 18, 2023

mkemel left a comment

Choose a reason for hiding this comment

alexlarsson commented Dec 18, 2023

dougsland commented Dec 18, 2023

engelmi commented Dec 19, 2023 • edited Loading

alexlarsson commented Dec 20, 2023

alexlarsson commented Dec 20, 2023

engelmi commented Dec 19, 2023 •

edited

Loading