Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable quicker tcp error notification #670

Merged
merged 2 commits into from
Dec 20, 2023

Conversation

engelmi
Copy link
Member

@engelmi engelmi commented Dec 18, 2023

Fixes: #652
Relates to: #648

When a connection between controller and agent is esablished and is dropped later, e.g. removing the cable from the agent, the disconnect is properly detected by the keepalive mechanism. However, the keepalive takes a while to detect this (based on the KEEPCNT of the system). If any command is issued during that time frame, tcp will try to retransmit the data. Eventually, retransmitting will be stopped and ICMP packets for host not reachable emitted. The keepalive is not used then, which results in the connection between agent and controller to be broken but not closed - a disconnect is not detected.
By setting the IP_RECVERR socket option, errors such as the ICMP host not reachable error will be delivered to the upper layer (systemd event loop in our case) so that it can handle it.

A detailed explanation can be found in this comment: #652 (comment)

Fixes: eclipse-bluechi#652
Relates to: eclipse-bluechi#648

When a connection between controller and agent is esablished and
is dropped later, e.g. removing the cable from the agent, the
disconnect is properly detected by the keepalive mechanism.
However, the keepalive takes a while to detect this (based on the
KEEPCNT of the system). If any command is issued during that time
frame, tcp will try to retransmit the data. Eventually, retransmitting
will be stopped and ICMP packets for host not reachable emitted.
The keepalive is not used then, which results in the connection
between agent and controller to be broken but not closed - a
disconnect is not detected.
By setting the IP_RECVERR socket option, errors such as the ICMP
host not reachable error will be delivered to the upper layer
(systemd event loop in our case) so that it can handle it.

Signed-off-by: Michael Engel <[email protected]>
@engelmi engelmi force-pushed the fix-agent-disconnect-handling branch from 6638c14 to 5502e9e Compare December 18, 2023 08:38
Copy link
Member

@mkemel mkemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive debugging work

@alexlarsson
Copy link
Contributor

I'm slightly worried about always enabling this. I can see bluechi being useful also on an slightly less robust network. Maybe we can make this an option (on by default)?

@dougsland
Copy link
Contributor

I'm slightly worried about always enabling this. I can see bluechi being useful also on an slightly less robust
network. Maybe we can make this an option (on by default)?

Would worth probably do some testing and measure, if it's affecting network I would set it default to false instead. The code looks nice, kudos for @engelmi and reporter.

Relates to: eclipse-bluechi#652

For systems with less robust network, always enabling IP_RECV_ERR
might result in unnecessary disconnects and a less robust BlueChi
setup. Therefore, a configuration setting was introduced to enable
or disable this option depending on the current needs.

Signed-off-by: Michael Engel <[email protected]>
@engelmi
Copy link
Member Author

engelmi commented Dec 19, 2023

@alexlarsson @mkemel @dougsland
Could you have another look? I introduced a new configuration option for enabling/disabling the IP_RECV_ERR option as requested. I set the default currently to true since our "main target system" would have rather robust networking.

@dougsland Testing and/or measuring in an automated way will be quite hard - at least I don't know right now how to simulate "unplugging the cable". We can create an issue for that, of course :)

Edit:
@dougsland Created the issue for testing this: #672

@alexlarsson
Copy link
Contributor

I don't think you want to test it by actually yanking the cord. There are ways to use the linux traffic control to emulate latency and packet loss. See for example https://dzone.com/articles/simulate-network-latency-and-packet-drop-in-linux

@alexlarsson
Copy link
Contributor

lgtm

@engelmi engelmi merged commit 924e153 into eclipse-bluechi:main Dec 20, 2023
18 checks passed
@engelmi engelmi deleted the fix-agent-disconnect-handling branch February 28, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot detect disconnection of node during d-bus operation
4 participants