-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FATAL ERROR: error setting up chroot: error remounting chroot in read-only: device or resource busy #10965
Comments
This appears to be a very old version of gVisor (the logs say (Also, from context, it appears you are benchmarking gVisor. Please ensure to read the performance section of the Production Guide as you do this.) |
runsc.log.20240930-135720.868828.boot.txt Thank for looking Etienne, here are some logs with |
i don't understand the logs here. this chroot error happens when a containers starts at Line 122 in 3971ecb
from the boot.txt, the container starts with no issue and the application runs. |
Earlier runsc was using /tmp as the sandbox chroot. However, as indicated in #10965, some sandbox launches could fail with EBUSY while trying to remount the /tmp chroot as read-only. We suspect that this could be due to the Golang runtime or an imported library having an open FD within /tmp. So we first create a new tmpfs mount over /tmp (like we did before). Then we additionally create a subdirectory within /tmp and create a second tmpfs mount there. This subdirectory is used as the chroot. This should protect against the hypothesized failure scenarios. Updates #10965 PiperOrigin-RevId: 680652316
@nt The attached logs do not show the Anyways, @nixprime had a hypothesis of what could be going on. runsc creates a new tmpfs mount at /tmp and then creates the sandbox chroot there. This mount is re-mounted as read-only once the sandbox chroot is prepared. In between the time that we create the tmpfs mount at /tmp and it is remounted, we hypothesize that either the Golang runtime or some library opens a file descriptor within /tmp, which is not closed at the time of remount, causing it to fail with EBUSY. Could you try patching #10975 and giving that a try? |
Hi @ayushr2, thanks for looking. Unfortunately we can't upgrade gvisor as frequently as we'd like because we care about checkpoint stability. I will make sure to include that patch in our next update. |
What kernel do you use? |
@avagin If you mean for the Host, this is on 6.1.90.
|
Updates #10965 PiperOrigin-RevId: 682418510
Thanks @andrew-anthropic. Could you try patching #10994 and see if it helps? We may need to adjust the interval being used there. We suspect that there may be a race with programs like |
Hi Ayush -- I'm happy to confirm that #10994 does indeed fix the issue! I'm not sure offhand what might be iterating through LMK if it would be a help to do a dive (I'd probably do something like run |
I'd prefer to not merge #10994 because it is very hacky. The change uses an arbitrary sleep and number of retries. From what I can tell, this is not a gVisor bug, and is specific to the environment being run in.
OK closing! |
Just ran across this as well, we've also been seeing this here at Cloudflare, it's affecting only our first few stages in our release targets, so we definitely know there's a subtle configuration difference between these environment and the rest of our production with something else. My suspicion given that we applied basically the exact same patch for a month and saw the problem go away, removed it and it came back in a week is that something BPF/kernel related is scanning mounts as they are created and holding open filesystems or files in said filesystems. Given that the only processes in the system that know what's going on in those namespaces are gvisor and the kernel (if you want to call the kernel a process 😄), something has to be sticking it's nose where it doesn't belong has been my current theory. And with the explosion of BPF tracing, I've been thinking something has gotten a bit overzealous monitoring or scanning, especially endpoint security things. |
Given multiple users are hitting this, I am open to submitting something like #10994. |
Updates #10965 PiperOrigin-RevId: 703281270
So #10994 actually had a bug, where after 3 EBUSY failures, it would just skip re-mounting as read-only. So patching it would make the error go away, but it's not the desired state. @jseba @andrew-anthropic @nt Could you try patching #11259 instead and see if the error goes away? |
Description
Starting a sandbox can randomly fail with
FATAL ERROR: error setting up chroot: error remounting chroot in read-only: device or resource busy
Steps to reproduce
Happens ~0.8% of container start attempts
runsc version
docker version (if using docker)
No response
uname
No response
kubectl (if using Kubernetes)
No response
repo state (if built from source)
No response
runsc debug logs (if available)
-> logs in comments
The text was updated successfully, but these errors were encountered: