"could not get peer id" and timeouts since 0.0.29 #709

jfroy · 2025-01-22T07:19:59Z

Spegel version

v0.0.30

Kubernetes distribution

Talos 1.9.1

Kubernetes version

v1.31.4

CNI

Cilium

Describe the bug

Since v0.0.29 with the new peer discovery, I am seeing errors in spegel logs and 60s delays whenever a container needs to fetch an image (basically a timeout, after which the image is quickly fetched from upstream). I am assuming this means spegel is basically in a bad state.

I haven't changed my CNI, Kubernetes, or Talos versions in-between those spegel versions, and I haven't changed my spegel settings.

My cluster is dual-stack (v4 and v6).

I use the helm chart for installation using the following values:

serviceMonitor:
  enabled: true
grafanaDashboard:
  enabled: true
spegel:
  containerdRegistryConfigPath: /etc/cri/conf.d/hosts

Logs:

{"time":"2025-01-22T06:39:54.884189668Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/oci.(*Containerd).Verify","file":"/build/pkg/oci/containerd.go","line":118},"msg":"unable to verify status response","runtime_version":"2.0.1"}
{"time":"2025-01-22T06:39:54.901603359Z","level":"INFO","source":{"function":"main.registryCommand","file":"/build/main.go","line":212},"msg":"running Spegel","registry":":5000","router":":5001"}
{"time":"2025-01-22T06:39:54.90170104Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.(*P2PRouter).Run","file":"/build/pkg/routing/p2p.go","line":111},"msg":"starting p2p router","logger":"p2p","id":"/ip6/2001:5a8:42a6:e2fb::f95/tcp/5001/p2p/12D3KooWLPhmVFbJie7ua7juDgqrcH123CTRSeFV2iMNJhM27dYt"}
{"time":"2025-01-22T06:39:54.902058451Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T06:39:59.90690095Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.1.204/tcp/5001] dial tcp4 0.0.0.0:5001->10.11.1.204:5001: i/o timeout"}
{"time":"2025-01-22T06:39:59.912454683Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920541642Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920706382Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920869343Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.11.3.186/tcp/5001] dial to self attempted\n  * [/ip4/10.11.1.204/tcp/5001] dial backoff\n  * [/ip4/10.11.2.154/tcp/5001] dial backoff"}
{"time":"2025-01-22T06:39:59.920891363Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}
{"time":"2025-01-22T06:48:54.903045168Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T06:57:54.902146115Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T07:06:54.902113162Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}

The text was updated successfully, but these errors were encountered:

phillebaba · 2025-01-22T08:57:39Z

Are you still seeing these problems with v0.0.30? I fixed a lot of issues related to dial self which is why I ask.

One thing I see is that you have mixed ip4 and ip6 addresses. Are you running a multi stack cluster?

The 60 second delay when Spegel is down should be fixed once my PR in Containerd gets merged.
containerd/containerd#11106

betweenclouds · 2025-01-22T12:14:08Z

I can confirm the error on a RKE2 v1.31.3+rke2r1 cluster with spegel installed manually, only ipv4, cni: calico: spegel v0.0.28 works, v0.0.29, and v0.0.30 not

Defaulted container "registry" out of: registry, configuration (init)
{"time":"2025-01-22T12:06:41.259669337Z","level":"INFO","source":{"function":"main.registryCommand","file":"/build/main.go","line":212},"msg":"running Spegel","registry":":5000","router":":5001"}
{"time":"2025-01-22T12:06:41.259949179Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.(*P2PRouter).Run","file":"/build/pkg/routing/p2p.go","line":111},"msg":"starting p2p router","logger":"p2p","id":"/ip4/10.42.44.78/tcp/5001/p2p/dsfgkdhsgfjsdhgfjhsdlfknsdlfnsdfsdf"}
{"time":"2025-01-22T12:06:41.260201211Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/state.Track","file":"/build/pkg/state/state.go","line":34},"msg":"running scheduled image state update"}
{"time":"2025-01-22T12:06:41.262976258Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.42.44.203/tcp/5001] dial tcp4 10.42.44.203:5001: connect: connection refused"}
{"time":"2025-01-22T12:06:41.263394684Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":254},"msg":"skipping bootstrap peer that is same as host","logger":"p2p"}
{"time":"2025-01-22T12:06:41.26341765Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}
{"time":"2025-01-22T12:06:41.265681676Z","level":"ERROR","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":286},"msg":"could not get peer id","logger":"p2p","err":"failed to dial: failed to dial 92B: all dials failed\n  * [/ip4/10.42.44.203/tcp/5001] dial backoff"}
{"time":"2025-01-22T12:06:41.265724462Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":254},"msg":"skipping bootstrap peer that is same as host","logger":"p2p"}
{"time":"2025-01-22T12:06:41.265732315Z","level":"INFO","source":{"function":"github.com/spegel-org/spegel/pkg/routing.bootstrapFunc.func1","file":"/build/pkg/routing/p2p.go","line":293},"msg":"no bootstrap nodes found","logger":"p2p"}

jfroy · 2025-01-22T16:11:38Z

Are you still seeing these problems with v0.0.30? I fixed a lot of issues related to dial self which is why I ask.

Yes, both 0.0.29 and 0.0.30 are effectively non-functional in my cluster. The log above is from 0.0.30.

One thing I see is that you have mixed ip4 and ip6 addresses. Are you running a multi stack cluster?

Yes.

phillebaba · 2025-01-24T13:36:07Z

I think these are two different issues. As I have not had a multistack test Spegel has never really been verified with it.

@betweenclouds how are you determining that things are not working? Is it just the logs or are you seeing that Spegel is not able to resolve peers? Spegel will report unhealthy when it does not have any peers in its routing table, so is Spegel crashing?

@jfroy this issue will be solved by creating an e2e tests with multistack as it is pretty complex to deal with currently while using libp2p.

jfroy · 2025-01-24T15:54:36Z

I think these are two different issues. As I have not had a multistack test Spegel has never really been verified with it.

@jfroy this issue will be solved by creating an e2e tests with multistack as it is pretty complex to deal with currently while using libp2p.

Let me know if I can help (more data or run experiments). Since it's my home lab cluster, there's no disruption budget 😬

betweenclouds · 2025-01-27T07:02:47Z

@phillebaba Yes the pods did crash with the higher versions.
But I was now able to install by the RKE2 way (not helmchart). I've to do some tests but for me it seems to work.

phillebaba · 2025-01-27T10:46:45Z

@betweenclouds RKE2 and K3S will never work if you install Spegel directly due to the way that Containerd is inegrated. Which is why Spegel has been embedded instead.

phillebaba · 2025-01-27T15:09:42Z

@jfroy I have confirmed that dual stack networking will not function properly with the current version of Spegel. It worked before mostly by mistake, rather than due to any effort by me. I do not have that much experience with dual stack clusters so I am trying to understand the purpose and best practices.

I think that in theory both IPv4 and IPv6 could be run at the same time with some modifications. The reality however is that the Containerd configuration will only support a either IPv4 or IPv6 addresses, so I see little value in supporting both at the same time. Same goes for Kubernetes services which need to be explicitly opted in to run dual stack, otherwise it defaults to IPv4. Either we default to one or the other, or we make it configurable. A similar problem is described in #619.

jfroy · 2025-01-27T15:34:31Z

@jfroy I have confirmed that dual stack networking will not function properly with the current version of Spegel. It worked before mostly by mistake, rather than due to any effort by me. I do not have that much experience with dual stack clusters so I am trying to understand the purpose and best practices.

I think that in theory both IPv4 and IPv6 could be run at the same time with some modifications. The reality however is that the Containerd configuration will only support a either IPv4 or IPv6 addresses, so I see little value in supporting both at the same time. Same goes for Kubernetes services which need to be explicitly opted in to run dual stack, otherwise it defaults to IPv4. Either we default to one or the other, or we make it configurable. A similar problem is described in #619.

Seems reasonable to me to default to V4 only and have a setting to prefer V6-only. Maybe also document that both can't be used at the same time.

I can imagine more complex clusters where some nodes are V4-only and some are V6-only, but that feels broken to me and I wouldn't spend time supporting that.

dimm0 · 2025-02-14T19:07:55Z

Seems reasonable to me to default to V4 only and have a setting to prefer V6-only. Maybe also document that both can't be used at the same time.

This seems like a good solution to me

jfroy added the bug Something isn't working label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"could not get peer id" and timeouts since 0.0.29 #709

"could not get peer id" and timeouts since 0.0.29 #709

jfroy commented Jan 22, 2025 •

edited

Loading

phillebaba commented Jan 22, 2025 •

edited

Loading

betweenclouds commented Jan 22, 2025

jfroy commented Jan 22, 2025 •

edited

Loading

phillebaba commented Jan 24, 2025

jfroy commented Jan 24, 2025

betweenclouds commented Jan 27, 2025

phillebaba commented Jan 27, 2025

phillebaba commented Jan 27, 2025

jfroy commented Jan 27, 2025

dimm0 commented Feb 14, 2025

"could not get peer id" and timeouts since 0.0.29 #709

"could not get peer id" and timeouts since 0.0.29 #709

Comments

jfroy commented Jan 22, 2025 • edited Loading

Spegel version

Kubernetes distribution

Kubernetes version

CNI

Describe the bug

phillebaba commented Jan 22, 2025 • edited Loading

betweenclouds commented Jan 22, 2025

jfroy commented Jan 22, 2025 • edited Loading

phillebaba commented Jan 24, 2025

jfroy commented Jan 24, 2025

betweenclouds commented Jan 27, 2025

phillebaba commented Jan 27, 2025

phillebaba commented Jan 27, 2025

jfroy commented Jan 27, 2025

dimm0 commented Feb 14, 2025

jfroy commented Jan 22, 2025 •

edited

Loading

phillebaba commented Jan 22, 2025 •

edited

Loading

jfroy commented Jan 22, 2025 •

edited

Loading