[bug] machine reconnect after omni downtime #638

githubcdr · 2024-09-17T18:33:42Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When upgrading omni I sometime notice that Siderolinks seem to go down for a long time. Nodes are not available during this downtime.

Expected Behavior

Node repair (restore) connections to omni in case of unexpected disconnects. I would expect at least a reconnect on failure from nodes.

Steps To Reproduce

Bring Omni 0.42.3 down for a few minutes and restart, some nodes are not available/connected.

{"level":"warn","ts":1726598097.5886366,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.365197,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.4972,"caller":"device/send.go:138","msg":"peer(OyBh…3QWc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.510401,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}

What browsers are you seeing the problem on?

No response

Anything else?

A reboot command to the node sometimes works.

The text was updated successfully, but these errors were encountered:

Unix4ever · 2024-09-19T15:31:41Z

I think nodes handle reconnect in that case. So if nodes goes down and gets rebooted it should check into Omni immediately.
While when Omni is down I guess it also tries to reconnect, but does retries with backoff.
Omni should keep the last known endpoints for the nodes, but maybe there's something wrong.

Worth checking if there is a bug.

Unix4ever · 2024-09-19T15:38:42Z

Do the machines reconnect back after some time?

githubcdr · 2024-09-19T15:41:08Z

I had to reboot them manually, but only some disappeared, I gave it 4 hours before doing so

githubcdr · 2024-09-19T20:35:39Z

Just happened again; reproducible in my case by stopping omni for 5 minutes and start it again, the wireguard connection error will pop up in the logs and some hosts become isolated.

smira · 2024-09-20T08:38:19Z

Please make sure you're running recent enough Talos, and attach here logs of the Talos node after it gets disconnected.

githubcdr · 2024-09-21T09:24:10Z

Im running latest version, maybe stale Siderolinks are causing this issue? I run omni in a testing env and create and destroy a lot of clusters. When all nodes are green I still see the logs below

Sometimes a node is greyed out, no logs available, but I can still reboot the node via omni.

{"level":"warn","ts":1726910417.2707446,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910418.1342225,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910419.3680203,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910420.5396538,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.4085839,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.5408869,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910422.4702663,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910422.5636215,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"k8s_proxy","request_url":"/apis/metrics.k8s.io/v1beta1/pods","method":"GET","remote_addr":"xxx","duration":0.030410022,"status":503,"response_length":20,"cluster":"silly-skynet","cluster_uuid":"","impersonate.user":"[email protected]","impersonate.groups":["system:masters"]}
{"level":"warn","ts":1726910423.1459103,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910424.6307366,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910425.7384636,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.5885136,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.6452115,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910427.6772408,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
...
{"level":"warn","ts":1726910660.4494154,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910660.7291093,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/admin/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.00004044,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.7074013,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910662.7517579,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/backup/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.0000428,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.9705136,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910663.171093,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910663.3321455,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/blog/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxxx","duration":0.00004516,"status":200,"response_length":17

githubcdr · 2024-09-25T13:30:19Z

Things seem to be a lot more stable for me after updating to 1.8.0 and removing disconnected old hosts, so far so good!

githubcdr · 2024-09-26T16:43:21Z

Maybe this node logs helps

01/01/1970 11:17:24
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&resourceVersion=7429241\": EOF", "error_count": 4}
01/01/1970 11:18:08
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&limit=500&resourceVersion=0\": EOF", "error_count": 0}
01/01/1970 17:05:27
[talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}

smira · 2024-09-26T18:06:17Z

This is totally unrelated. If Talos disconnects from Omni, the logs will say about SideroLink.

probably unrelated, but the time is off by a lot somewhere in your infra ;)

githubcdr · 2024-10-09T11:03:11Z

The date on the node is correct, it's an optical UI issue in omni, I can reproduce this issue;

stop omni and wait for a few minutes
restart omni, notice that some nodes are not reconnecting

The nodes are connected by cable on a stable network, it seems to be at random with this as last node message

[talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://[fdae:41e4303::1]:10000/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=645709\": dial tcp [fdae:41303::1]:10000: i/o timeout"}

I tried using internal discovery or public, results are the same. The k8s cluster itself is operational, the node is reachable and port 50000 is open indicating that the api-server is operational, sometimes a reboot works (which is strange since it depends on the wireguard connection)

I'll keep debugging, thanks for all your work.

githubcdr added the bug Something isn't working label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] machine reconnect after omni downtime #638

[bug] machine reconnect after omni downtime #638

githubcdr commented Sep 17, 2024 •

edited

Loading

Unix4ever commented Sep 19, 2024

Unix4ever commented Sep 19, 2024 •

edited

Loading

githubcdr commented Sep 19, 2024

githubcdr commented Sep 19, 2024

smira commented Sep 20, 2024

githubcdr commented Sep 21, 2024 •

edited

Loading

githubcdr commented Sep 25, 2024

githubcdr commented Sep 26, 2024

smira commented Sep 26, 2024

githubcdr commented Oct 9, 2024

[bug] machine reconnect after omni downtime #638

[bug] machine reconnect after omni downtime #638

Comments

githubcdr commented Sep 17, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

What browsers are you seeing the problem on?

Anything else?

Unix4ever commented Sep 19, 2024

Unix4ever commented Sep 19, 2024 • edited Loading

githubcdr commented Sep 19, 2024

githubcdr commented Sep 19, 2024

smira commented Sep 20, 2024

githubcdr commented Sep 21, 2024 • edited Loading

githubcdr commented Sep 25, 2024

githubcdr commented Sep 26, 2024

smira commented Sep 26, 2024

githubcdr commented Oct 9, 2024

githubcdr commented Sep 17, 2024 •

edited

Loading

Unix4ever commented Sep 19, 2024 •

edited

Loading

githubcdr commented Sep 21, 2024 •

edited

Loading