-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] machine reconnect after omni downtime #638
Comments
I think nodes handle reconnect in that case. So if nodes goes down and gets rebooted it should check into Omni immediately. Worth checking if there is a bug. |
Do the machines reconnect back after some time? |
I had to reboot them manually, but only some disappeared, I gave it 4 hours before doing so |
Just happened again; reproducible in my case by stopping omni for 5 minutes and start it again, the wireguard connection error will pop up in the logs and some hosts become isolated. |
Please make sure you're running recent enough Talos, and attach here logs of the Talos node after it gets disconnected. |
Im running latest version, maybe stale Siderolinks are causing this issue? I run omni in a testing env and create and destroy a lot of clusters. When all nodes are green I still see the logs below Sometimes a node is greyed out, no logs available, but I can still reboot the node via omni.
|
Things seem to be a lot more stable for me after updating to 1.8.0 and removing disconnected old hosts, so far so good! |
Maybe this node logs helps
|
This is totally unrelated. If Talos disconnects from Omni, the logs will say about probably unrelated, but the time is off by a lot somewhere in your infra ;) |
The date on the node is correct, it's an optical UI issue in omni, I can reproduce this issue;
The nodes are connected by cable on a stable network, it seems to be at random with this as last node message
I tried using internal discovery or public, results are the same. The k8s cluster itself is operational, the node is reachable and port 50000 is open indicating that the api-server is operational, sometimes a reboot works (which is strange since it depends on the wireguard connection) I'll keep debugging, thanks for all your work. |
Is there an existing issue for this?
Current Behavior
When upgrading omni I sometime notice that Siderolinks seem to go down for a long time. Nodes are not available during this downtime.
Expected Behavior
Node repair (restore) connections to omni in case of unexpected disconnects. I would expect at least a reconnect on failure from nodes.
Steps To Reproduce
Bring Omni 0.42.3 down for a few minutes and restart, some nodes are not available/connected.
What browsers are you seeing the problem on?
No response
Anything else?
A reboot command to the node sometimes works.
The text was updated successfully, but these errors were encountered: