Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP check sometimes stays in "passing" status while endpoint is actually down #22068

Open
Chemrat opened this issue Jan 13, 2025 · 0 comments
Open

Comments

@Chemrat
Copy link

Chemrat commented Jan 13, 2025

Overview of the Issue

On some machines we've experienced an issue with UDP healtcheck passing once and then hanging in "passing" status despite service going down and no longer being available. Monitoring the UDP traffic, consul sends just one check and stops sending any more. We're using a simple server that responds to an empty UDP payload with empty UDP payload.

Adding extra logging steps to consul, seems like the check just hangs on

_, err = bufio.NewReader(conn).Read(make([]byte, 1))

Looking through the code further, it seems like there's no actual timeout set for UDP checks. Dialer timeout set in func (c *CheckUDP) Start(), as far as I understand, is related to establishing connection (which makes no sense for UDP? except for timeout for resolving endpoint address?) and is irrelevant to actual conn timeouts, which should be set via SetReadDeadline? In func (c *CheckUDP) check():

deadline := time.Now().Add(c.Timeout)
err := conn.SetReadDeadline(deadline)

However just adding SetReadDeadLine doesn't fix anything because of check in

if err != nil {
if strings.Contains(err.Error(), "i/o timeout") {
c.StatusHandler.updateCheck(c.CheckID, api.HealthPassing, fmt.Sprintf("UDP connect %s: Success", c.UDP))
return

which I also do not understand at all. Why on read timeout, the check actually passess? Doesn't UDP healthcheck expect a response?

Setting SetReadDeadLine and deleting the "i/o timeout" case error check (btw why does it check for string value? I'm not a go developer, but this seems odd to me) makes the check update accordingly when the UDP endpoint goes up or down.

Since I'm not a go dev, I might be grossly misinterpreting what aformentioned functions do, but after making the changes I've described, consul stopped misreporting UDP service status for me.


Reproduction Steps

  1. Start consul agent
  2. Register servce with UDP healtcheck with timeout of 1s
  3. Pass the check once, stop the service
  4. Service is still passing in consul (logs, UI, API)

Consul info for both Client and Server

Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = dev
        revision = c1a887e0
        version = 1.21.0
        version_metadata = 
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 1
        leader = true
        leader_addr = 10.253.253.100:8300
        server = true
raft:
        applied_index = 565
        commit_index = 565
        fsm_pending = 0
        last_contact = 0
        last_log_index = 565
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:4ee65dbe-b13c-418f-c3b3-5698213eaea4 Address:10.253.253.100:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 5
        goroutines = 190
        max_procs = 5
        os = linux
        version = go1.23.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 1
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 1
        left = 0
        member_time = 2
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
{
    "ui_config": {
        "enabled": true
    },
    "client_addr": "0.0.0.0",
    "bind_addr": "10.253.253.100",
    "retry_join": ["10.253.253.100"],
    "server": true,
    "limits": {
        "http_max_conns_per_client": 8192,
        "rpc_max_conns_per_client": 2048
    },
    "bootstrap_expect": 1
}

Operating system and Environment details

Linux 5.4.0-198-generic #218-Ubuntu SMP Fri Sep 27 20:18:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant