Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] machine reconnect after omni downtime #638

Open
1 task done
githubcdr opened this issue Sep 17, 2024 · 10 comments
Open
1 task done

[bug] machine reconnect after omni downtime #638

githubcdr opened this issue Sep 17, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@githubcdr
Copy link

githubcdr commented Sep 17, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When upgrading omni I sometime notice that Siderolinks seem to go down for a long time. Nodes are not available during this downtime.

Expected Behavior

Node repair (restore) connections to omni in case of unexpected disconnects. I would expect at least a reconnect on failure from nodes.

Steps To Reproduce

Bring Omni 0.42.3 down for a few minutes and restart, some nodes are not available/connected.

{"level":"warn","ts":1726598097.5886366,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.365197,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.4972,"caller":"device/send.go:138","msg":"peer(OyBh…3QWc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726598098.510401,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}

What browsers are you seeing the problem on?

No response

Anything else?

A reboot command to the node sometimes works.

@githubcdr githubcdr added the bug Something isn't working label Sep 17, 2024
@Unix4ever
Copy link
Member

I think nodes handle reconnect in that case. So if nodes goes down and gets rebooted it should check into Omni immediately.
While when Omni is down I guess it also tries to reconnect, but does retries with backoff.
Omni should keep the last known endpoints for the nodes, but maybe there's something wrong.

Worth checking if there is a bug.

@Unix4ever
Copy link
Member

Unix4ever commented Sep 19, 2024

Do the machines reconnect back after some time?

@githubcdr
Copy link
Author

I had to reboot them manually, but only some disappeared, I gave it 4 hours before doing so

@githubcdr
Copy link
Author

Just happened again; reproducible in my case by stopping omni for 5 minutes and start it again, the wireguard connection error will pop up in the logs and some hosts become isolated.

@smira
Copy link
Member

smira commented Sep 20, 2024

Please make sure you're running recent enough Talos, and attach here logs of the Talos node after it gets disconnected.

@githubcdr
Copy link
Author

githubcdr commented Sep 21, 2024

Im running latest version, maybe stale Siderolinks are causing this issue? I run omni in a testing env and create and destroy a lot of clusters. When all nodes are green I still see the logs below

Sometimes a node is greyed out, no logs available, but I can still reboot the node via omni.

{"level":"warn","ts":1726910417.2707446,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910418.1342225,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910419.3680203,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910420.5396538,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.4085839,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910421.5408869,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910422.4702663,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910422.5636215,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"k8s_proxy","request_url":"/apis/metrics.k8s.io/v1beta1/pods","method":"GET","remote_addr":"xxx","duration":0.030410022,"status":503,"response_length":20,"cluster":"silly-skynet","cluster_uuid":"","impersonate.user":"[email protected]","impersonate.groups":["system:masters"]}
{"level":"warn","ts":1726910423.1459103,"caller":"device/send.go:138","msg":"peer(RVqv…Z3Fc) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910424.6307366,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910425.7384636,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.5885136,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910426.6452115,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910427.6772408,"caller":"device/send.go:138","msg":"peer(gYOs…LsGU) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
...
{"level":"warn","ts":1726910660.4494154,"caller":"device/send.go:138","msg":"peer(nFZ4…frRM) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910660.7291093,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/admin/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.00004044,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.7074013,"caller":"device/send.go:138","msg":"peer(aXTs…pziA) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910662.7517579,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/backup/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxx","duration":0.0000428,"status":200,"response_length":1730}
{"level":"warn","ts":1726910662.9705136,"caller":"device/send.go:138","msg":"peer(jJtQ…3034) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"warn","ts":1726910663.171093,"caller":"device/send.go:138","msg":"peer(N9xn…SwxI) - Failed to send handshake initiation: no known endpoint for peer","component":"server","component":"siderolink"}
{"level":"info","ts":1726910663.3321455,"caller":"logging/handler.go:59","msg":"HTTP request done","component":"server","handler":"static","request_url":"/blog/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php","method":"GET","remote_addr":"xxxx","duration":0.00004516,"status":200,"response_length":17

@githubcdr
Copy link
Author

Things seem to be a lot more stable for me after updating to 1.8.0 and removing disconnected old hosts, so far so good!

@githubcdr
Copy link
Author

Maybe this node logs helps

01/01/1970 11:17:24
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&resourceVersion=7429241\": EOF", "error_count": 4}
01/01/1970 11:18:08
[talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dtalos-q7f-tv7&limit=500&resourceVersion=0\": EOF", "error_count": 0}
01/01/1970 17:05:27
[talos] error watching discovery service state {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "error": "rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"}

@smira
Copy link
Member

smira commented Sep 26, 2024

This is totally unrelated. If Talos disconnects from Omni, the logs will say about SideroLink.

probably unrelated, but the time is off by a lot somewhere in your infra ;)

@githubcdr
Copy link
Author

The date on the node is correct, it's an optical UI issue in omni, I can reproduce this issue;

  • stop omni and wait for a few minutes
  • restart omni, notice that some nodes are not reconnecting

The nodes are connected by cable on a stable network, it seems to be at random with this as last node message

[talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://[fdae:41e4303::1]:10000/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=645709\": dial tcp [fdae:41303::1]:10000: i/o timeout"}

I tried using internal discovery or public, results are the same. The k8s cluster itself is operational, the node is reachable and port 50000 is open indicating that the api-server is operational, sometimes a reboot works (which is strange since it depends on the wireguard connection)

I'll keep debugging, thanks for all your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants