Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: add test for rootful docker #366

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Feb 21, 2025

I am finding with testing that the networking between hosts does not work when we are running in rootful. I was testing this because using nvidia devices does work with rootful, but once I got to the stop of needing pods to communicate, there was no communication.

I am not sure about the error, but this test should reproduce it in CI. Note that to enable this we use the docker-rootful template provided by lima (@AkihiroSuda you have thought of all things)! The main changes here are to add this test to the matrix, and ensure that in the different install scripts, we largely do nothing if the container runtime is docker-rootful.

Related to #365 but does not fix it, only demonstrates it.

@vsoch
Copy link
Contributor Author

vsoch commented Feb 21, 2025

Note that I've seen two variants of this error - either an operation timeout (the result here):

image

Or that the address is not reachable / bad (what I've seen in production and my researchapps testing CI):

image

@@ -20,6 +20,8 @@ jobs:
include:
- lima_template: template://ubuntu-24.04
container_engine: docker
- lima_template: template://docker-rootful
container_engine: docker-rootful
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this form

- lima_template: template://ubuntu-24.04
  container_engine: docker
  rootfull: 1

if [[ "$CONTAINER_ENGINE" == "docker-rootful" ]]
then
CONTAINER_ENGINE="docker"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the variables to be immutable through the entire lifecycle of the test.
So, another variable like ROOTFUL=1 should be introduced.

@AkihiroSuda
Copy link
Member

Thanks, I confirmed that this issue happens on my local machines too, but I haven't identified the cause.

Tested with Docker v28 and v27.5.1, on Ubuntu 24.04.1 (ARM64).

I think it was working in the past?

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Feb 24, 2025

ICMP and DNS still seems to work, but TCP across the nodes seems broken?

VXLAN packets are apparently sent and received on each of the VMs, though. (Run tcpdump udp).

Apparently, the receiver VM seems refusing to route the VXLAN packets to the usernetes-node-1 container where kubelet, flannel, etc. are running in.

@AkihiroSuda
Copy link
Member

Found a workaround: execute ethtool --offload eth0 tx-checksum-ip-generic off in usernetes-node-1 container

@thaJeztah
Copy link

Any eyes needed here from the Moby networking folks? (I know they're pretty busy currently, but if it's useful I can try ask them if they have time to spare to give it eyes)

@vsoch vsoch force-pushed the test-rootful branch 5 times, most recently from 0010ee9 to 0d56a3c Compare February 24, 2025 16:26
This is important to run on multi-node

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Contributor Author

vsoch commented Feb 24, 2025

@AkihiroSuda do you remember the last time you tested with it working? In recent memory we had updates to flannel, the underlying kind node (Kubernetes version), and (for me) at some point last year the additional make sync-external-ip was added. If we can reproduce a previously working version it could be a good strategy to debug (to compare to).

@vsoch
Copy link
Contributor Author

vsoch commented Feb 24, 2025

oh wow, this is really interesting!

Not sure if this is expected, but this looks to be a warning in the failed nerdctl setup:

Warning: 7m[WARNING] buildkitd has access to images in "buildkit" namespace by default. If you want to give buildkitd access to the images in "default" namespace, run this command with CONTAINERD_NAMESPACE=default

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Feb 25, 2025

The ethtool --offload eth0 tx-checksum-ip-generic off rule can be probably appended here:

# Correct UDP checksums for VXLAN behind NAT
# https://github.com/flannel-io/flannel/issues/1279
# https://github.com/kubernetes/kops/pull/9074
# https://github.com/karmab/kcli/commit/b1a8eff658d17cf4e28162f0fa2c8b2b10e5ad00
SUBSYSTEM=="net", ACTION=="add|change|move", ENV{INTERFACE}=="flannel.1", RUN+="/usr/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off"

It is still unclear why this is needed only for rootful, though.

Any eyes needed here from the Moby networking folks? (I know they're pretty busy currently, but if it's useful I can try ask them if they have time to spare to give it eyes)

Thanks, that would be appreciated.

@AkihiroSuda
Copy link
Member

Warning: 7m[WARNING] buildkitd has access to images in "buildkit" namespace by default. If you want to give buildkitd access to the images in "default" namespace, run this command with CONTAINERD_NAMESPACE=default

Irrelevant to the topic.
Should be fixed though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants