kube-proxy service start failure on boot #626

vmpjdc · 2024-08-26T23:47:11Z

Summary

I've observed a couple of times that when a k8s unit reboots, kube-proxy fails to start. This breaks many functions of the node with somewhat perplexing symptoms.

The core of the problem appears to be:

Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: + exec /snap/k8s/313/bin/kube-proxy --cluster-cidr=10.1.0.0/16 --healthz-bind-address=127.0.0.1 --hostname-override=juju-bd78f7-stg-netbox-30 --kubeconfig=/etc/kubernetes/proxy.conf --profiling=false
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.621814     611 server_linux.go:69] "Using iptables proxy"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.651253     611 server.go:1062] "Successfully retrieved node IP(s)" IPs=["10.142.102.91"]
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: I0824 00:33:53.652936     611 conntrack.go:119] "Set sysctl" entry="net/netfilter/nf_conntrack_max" value=131072
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: E0824 00:33:53.653060     611 server.go:558] "Error running ProxyServer" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 k8s.kube-proxy[611]: E0824 00:33:53.653169     611 run.go:74] "command failed" err="open /proc/sys/net/netfilter/nf_conntrack_max: no such file or directory"
Aug 24 00:33:53 juju-bd78f7-stg-netbox-30 systemd[1]: snap.k8s.kube-proxy.service: Main process exited, code=exited, status=1/FAILURE

That is, kube-proxy tries to configure conntrack before the kernel module has loaded.

Here's a gist with the full output from boot (line 319 is where I started it manually): https://gist.github.com/vmpjdc/06913c8125814eb98f8ebda3fd356ab2

What Should Happen Instead?

The kube-proxy service should start reliably on boot.

Reproduction Steps

Deploy k8s using Juju: juju deploy -n3 --channel 1.30/beta --constraints 'mem=8G root-disk=50G cores=2' k8s

Optionally, deploy some services into the cluster with Juju.

Reboot a k8s unit.

Observe that kube-proxy did not start (Current = inactive):

$ juju exec -u k8s/0 -- snap services k8s.kube-proxy
Service         Startup   Current   Notes
k8s.kube-proxy  enabled   inactive  -
$ _

(I'm not sure whether this happens every single time.)

Run kubectl get pods -A that some pods (probably a cilium pod, maybe others) are in non-Running states, e.g. Unknown, Terminating, etc.

Start kube-proxy: juju exec -u k8s/0 -- snap start k8s.kube-proxy

Observe that the cluster recovers. (If it does not, delete the affected pods and they should respawn quickly.)

System information

Script does not exist. Here's what the charm installed:

installed:                v1.30.0            (313) 109MB classic,held

Can you suggest a fix?

If it's possible to customize the sysemctl units that snapd creates, allowing more retries would probably work around the problem. In the meantime I can work around it locally by installing a suitable override myself; and, for that matter, so could the charm.

Are you interested in contributing with a fix?

No response

The text was updated successfully, but these errors were encountered:

bschimke95 · 2024-08-28T05:46:56Z

Hi @vmpjdc

Thanks for reporting this. We'll look into this.
In the meantime you might get around this issue by forcefully loading the module:

echo nf_conntrack | sudo tee /etc/modules-load.d/nf_conntrack.conf

See also: canonical/microk8s#4462 (comment)

vmpjdc · 2024-10-02T01:25:56Z

I've been testing the following systemd drop-in config, and it seems to work well so far:

[Unit]
StartLimitIntervalSec=0

[Service]
Restart=always
RestartSec=1s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy service start failure on boot #626

kube-proxy service start failure on boot #626

vmpjdc commented Aug 26, 2024 •

edited

Loading

bschimke95 commented Aug 28, 2024

vmpjdc commented Oct 2, 2024

kube-proxy service start failure on boot #626

kube-proxy service start failure on boot #626

Comments

vmpjdc commented Aug 26, 2024 • edited Loading

Summary

What Should Happen Instead?

Reproduction Steps

System information

Can you suggest a fix?

Are you interested in contributing with a fix?

bschimke95 commented Aug 28, 2024

vmpjdc commented Oct 2, 2024

vmpjdc commented Aug 26, 2024 •

edited

Loading