Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"could not get peer id" and timeouts since 0.0.29 #709

Open
jfroy opened this issue Jan 22, 2025 · 7 comments
Open

"could not get peer id" and timeouts since 0.0.29 #709

jfroy opened this issue Jan 22, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@jfroy
Copy link
Contributor

jfroy commented Jan 22, 2025

Spegel version

v0.0.30

Kubernetes distribution

Talos 1.9.1

Kubernetes version

v1.31.4

CNI

Cilium

Describe the bug

Since v0.0.29 with the new peer discovery, I am seeing errors in spegel logs and 60s delays whenever a container needs to fetch an image (basically a timeout, after which the image is quickly fetched from upstream). I am assuming this means spegel is basically in a bad state.

I haven't changed my CNI, Kubernetes, or Talos versions in-between those spegel versions, and I haven't changed my spegel settings.

My cluster is dual-stack (v4 and v6).

I use the helm chart for installation using the following values:

serviceMonitor:
  enabled: true
grafanaDashboard:
  enabled: true
spegel:
  containerdRegistryConfigPath: /etc/cri/conf.d/hosts

Logs:

{"time":"2025-01-22T06:39:54.884189668Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/oci.(*Containerd).Verify","file":"/build/pkg/oci/containerd.go","line":118},"msg":"unable to verify status response","runtime_version":"2.0.1"}
{"time":"2025-01-22T06:39:54.901603359Z","level":"INFO","source":{"function":"main.registryCommand","file":"/build/main.go","line":212},"msg":"running Spegel","registry":":5000","router":":5001"}
{"time":"2025-01-22T06:39:54.90170104Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.(*P2PRouter).Run","file":"/build/pkg/routing/p2p.go","line":111},"msg":"starting p2p router","logger":"p2p","id":"/ip6/2001:5a8:42a6:e2fb::f95/tcp/5001/p2p/12D3KooWLPhmVFbJie7ua7juDgqrcH123CTRSeFV2iMNJhM27dYt"}
{"time":"2025-01-22T06:39:54.902058451Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T06:39:59.90690095Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.1.204/tcp/5001] dial tcp4 0.0.0.0:5001->10.11.1.204:5001: i/o timeout"}
{"time":"2025-01-22T06:39:59.912454683Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920541642Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920706382Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920869343Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920891363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}
{"time":"2025-01-22T06:48:54.903045168Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T06:57:54.902146115Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T07:06:54.902113162Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
@jfroy jfroy added the bug Something isn't working label Jan 22, 2025
@phillebaba
Copy link
Member

phillebaba commented Jan 22, 2025

Are you still seeing these problems with v0.0.30? I fixed a lot of issues related to dial self which is why I ask.

One thing I see is that you have mixed ip4 and ip6 addresses. Are you running a multi stack cluster?

The 60 second delay when Spegel is down should be fixed once my PR in Containerd gets merged.
containerd/containerd#11106

@betweenclouds
Copy link

I can confirm the error on a RKE2 v1.31.3+rke2r1 cluster with spegel installed manually, only ipv4, cni: calico: spegel v0.0.28 works, v0.0.29, and v0.0.30 not

Defaulted container "registry" out of: registry, configuration (init)
{"time":"2025-01-22T12:06:41.259669337Z","level":"INFO","source":{"function":"main.registryCommand","file":"/build/main.go","line":212},"msg":"running Spegel","registry":":5000","router":":5001"}
{"time":"2025-01-22T12:06:41.259949179Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.(*P2PRouter).Run","file":"/build/pkg/routing/p2p.go","line":111},"msg":"starting p2p router","logger":"p2p","id":"/ip4/10.42.44.78/tcp/5001/p2p/dsfgkdhsgfjsdhgfjhsdlfknsdlfnsdfsdf"}
{"time":"2025-01-22T12:06:41.260201211Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T12:06:41.262976258Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.42.44.203/tcp/5001] dial tcp4 10.42.44.203:5001: connect: connection refused"}
{"time":"2025-01-22T12:06:41.263394684Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":254},"msg":"skipping bootstrap peer that is same as host","logger":"p2p"}
{"time":"2025-01-22T12:06:41.26341765Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}
{"time":"2025-01-22T12:06:41.265681676Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.42.44.203/tcp/5001] dial backoff"}
{"time":"2025-01-22T12:06:41.265724462Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":254},"msg":"skipping bootstrap peer that is same as host","logger":"p2p"}
{"time":"2025-01-22T12:06:41.265732315Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}

@jfroy
Copy link
Contributor Author

jfroy commented Jan 22, 2025

Are you still seeing these problems with v0.0.30? I fixed a lot of issues related to dial self which is why I ask.

Yes, both 0.0.29 and 0.0.30 are effectively non-functional in my cluster. The log above is from 0.0.30.

One thing I see is that you have mixed ip4 and ip6 addresses. Are you running a multi stack cluster?

Yes.

@phillebaba
Copy link
Member

I think these are two different issues. As I have not had a multistack test Spegel has never really been verified with it.

@betweenclouds how are you determining that things are not working? Is it just the logs or are you seeing that Spegel is not able to resolve peers? Spegel will report unhealthy when it does not have any peers in its routing table, so is Spegel crashing?

@jfroy this issue will be solved by creating an e2e tests with multistack as it is pretty complex to deal with currently while using libp2p.

@jfroy
Copy link
Contributor Author

jfroy commented Jan 24, 2025

I think these are two different issues. As I have not had a multistack test Spegel has never really been verified with it.

@jfroy this issue will be solved by creating an e2e tests with multistack as it is pretty complex to deal with currently while using libp2p.

Let me know if I can help (more data or run experiments). Since it's my home lab cluster, there's no disruption budget 😬

@betweenclouds
Copy link

@phillebaba Yes the pods did crash with the higher versions.
But I was now able to install by the RKE2 way (not helmchart). I've to do some tests but for me it seems to work.

@phillebaba
Copy link
Member

@betweenclouds RKE2 and K3S will never work if you install Spegel directly due to the way that Containerd is inegrated. Which is why Spegel has been embedded instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants