-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"could not get peer id" and timeouts since 0.0.29 #709
Comments
Are you still seeing these problems with v0.0.30? I fixed a lot of issues related to dial self which is why I ask. One thing I see is that you have mixed ip4 and ip6 addresses. Are you running a multi stack cluster? The 60 second delay when Spegel is down should be fixed once my PR in Containerd gets merged. |
I can confirm the error on a
|
Yes, both 0.0.29 and 0.0.30 are effectively non-functional in my cluster. The log above is from 0.0.30.
Yes. |
I think these are two different issues. As I have not had a multistack test Spegel has never really been verified with it. @betweenclouds how are you determining that things are not working? Is it just the logs or are you seeing that Spegel is not able to resolve peers? Spegel will report unhealthy when it does not have any peers in its routing table, so is Spegel crashing? @jfroy this issue will be solved by creating an e2e tests with multistack as it is pretty complex to deal with currently while using libp2p. |
Let me know if I can help (more data or run experiments). Since it's my home lab cluster, there's no disruption budget 😬 |
@phillebaba Yes the pods did crash with the higher versions. |
@betweenclouds RKE2 and K3S will never work if you install Spegel directly due to the way that Containerd is inegrated. Which is why Spegel has been embedded instead. |
Spegel version
v0.0.30
Kubernetes distribution
Talos 1.9.1
Kubernetes version
v1.31.4
CNI
Cilium
Describe the bug
Since v0.0.29 with the new peer discovery, I am seeing errors in spegel logs and 60s delays whenever a container needs to fetch an image (basically a timeout, after which the image is quickly fetched from upstream). I am assuming this means spegel is basically in a bad state.
I haven't changed my CNI, Kubernetes, or Talos versions in-between those spegel versions, and I haven't changed my spegel settings.
My cluster is dual-stack (v4 and v6).
I use the helm chart for installation using the following values:
Logs:
The text was updated successfully, but these errors were encountered: