Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

Closed
samruddhikhandale opened this issue Aug 2, 2023 · 27 comments · Fixed by #637
Assignees

Comments

@samruddhikhandale
Copy link
Member

Sometimes docker fails to start within a container with the following error 👇

Failed to connect to Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

As manually running /usr/local/share/docker-init.sh fixes this issue, add some retry mechanism into the docker-init script .

@samruddhikhandale samruddhikhandale self-assigned this Aug 2, 2023
@AndriiTsok
Copy link

For the last two weeks we have been experiencing the same issue. Running the script recovers the docker instance.

@kklopfenstein
Copy link
Contributor

I wanted to add that I've also been experiencing this issue much more frequently that normal.

@samruddhikhandale
Copy link
Member Author

Thanks for reporting! Opened #637 which adds a retry logic to address the issue ^

@AndriiTsok
Copy link

@samruddhikhandale We have just tried 2.3.0 version of Docker in Docker. But we are still observing the issue with the Docker daemon.

Running the onCreateCommand from devcontainer.json...

zsh .devcontainer/on-create.sh
on-create start
Removing existing clusters...
FATA[0000] runtime failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
Removing existing registries...
FATA[0000] failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 

@samruddhikhandale
Copy link
Member Author

(*) Failed to start docker, retrying in 5s...

@AndriiTsok Hmm, in the creation logs do you see this log message ^? I would like to validate if the retry is actually happening. I wonder if it's conflicting (/running in parallel) with the onCreateCommand.

@AndriiTsok
Copy link

@samruddhikhandale Thank you for your fast reply!

We tried to create more than 30 codespaces during the last few hours to try to reproduce the issue. At the moment we are not able to produce (*) Failed to start docker, retrying in 5s... in creation logs. Earlier this morning the docker step was failing almost in every codespace.

Today, we made sure that not the prebuilt image was used and that the codespaces are created for new branches. Also, we explicitly set 2.3.0 for the D-in-D feature.

We will keep an eye on the logs and the stability and will create another issue in case it is reproducible.

@AndriiTsok
Copy link

@samruddhikhandale we have just managed to catch the error again. Here is the creation log: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Line with Docker daemon error: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L307

We simply trying to check the clusters and remove them for cases when we rebuilding the codespaces:

# Get the list of all existing clusters
clusters=$(k3d cluster list -o json | jq -r '.[].name')

# Iterate over each cluster and delete it
for cluster in $clusters; do
  echo "Deleting cluster $cluster"
  k3d cluster delete $cluster
done

# Create a new cluster with the given configuration
k3d cluster create --config .devcontainer/k3d/config.yaml --kubeconfig-update-default

Fails trying to get clusters.

P.S.
I have observed that the Docker daemon is now failing less frequently than it was just three days ago.

@AndriiTsok
Copy link

@samruddhikhandale I can confirm the issue is still randomly occurring. I would say it is around 50%/50% usually rebuild helps to restore the codespace.

One of the observed log values:

[161525 ms] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount type=bind,src=/var/lib/docker/codespacemount/workspace,dst=/workspaces --mount type=volume,src=dind-var-lib-docker-0hcbhh2c7vldoj773drm1bjldnb89u0rt96sl9nju22d9ou8d14n,dst=/var/lib/docker --mount type=volume,src=minikube-config,dst=/home/vscode/.minikube --mount source=/root/.codespaces/shared,target=/workspaces/.codespaces/shared,type=bind --mount source=/var/lib/docker/codespacemount/.persistedshare,target=/workspaces/.codespaces/.persistedshare,type=bind --mount source=/.codespaces/agent/mount,target=/.codespaces/bin,type=bind --mount source=/mnt/containerTmp,target=/tmp,type=bind --mount type=bind,src=/.codespaces/agent/mount/cache,dst=/vscode -l Type=codespaces -e CODESPACES=******** -e ContainerVersion=13 -e RepositoryName=Monorepo --label ContainerVersion=13 --hostname codespaces-86fa16 --add-host codespaces-86fa16:127.0.0.1 --cap-add sys_nice --network host --privileged --entrypoint /bin/sh vsc-monorepo-9081da56175f3b1ac597257c0566d7ce76b18fbc1a048e05bdbd04f7efb0dfca-features -c echo Container started
Container started
sed: couldn't flush stdout: Device or resource busy
Outcome: success User: node WorkspaceFolder: /workspaces/Monorepo
devcontainer process exited with exit code 0
Running blocking commands...

sed: couldn't flush stdout: Device or resource busy

@samruddhikhandale
Copy link
Member Author

@AndriiTsok Thanks for the update.

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Looks like the retry mechanism is triggered, we retry for 5 five times until docker daemon starts. From the logs, we can only see one such log statements, hence I can think of two things which might be happening:

  1. There's a race condition in between retry logic and the start of the onCreateCommand (hence, the retries are not triggered and docker is not guaranteed to start)
  2. The docker daemon started fine, hence no more retry logs but some other process killed the daemon

@AndriiTsok would it be possible to provide a sample repro (ie a sample dev container config), I'd like to experiment few things. If not, no worries, I could play around by adding docker commands within onCreateCommand.

@AndriiTsok In the meanwhile, would it be possible to add a similar retry logic (which starts docker daemon) to your onCreateCommand script? Let me know if this works!

@tom-growthbox
Copy link

We have experienced this issue consistently in the last 4 days. It happens during codespaces prebuild. I see the line (*) Failed to start docker, retrying in 5s... once in each of the failed jobs. Successful jobs do not have this line in the log. It didn't start with version 2.3.0, but somehow downgrading to 2.2.1 fixes it. I have not seen the error with 2.2.1.

@samruddhikhandale
Copy link
Member Author

I wonder if adding retry logic is somehow breaking the codespace prebuilds, thanks @tom-growthbox for reporting the issue and providing a temporary solution.

@tom-growthbox would it be possible to provide a sample repro? (ie dev container config) It would help me investigate and fix the underlying issue, thanks!

@tom-growthbox
Copy link

I created a sample repo with similar configuration as we use. However, the prebuild doesn't fail on this one. Would need to spend some time on this to reproduce the issue.

https://github.com/tom-growthbox/prebuild-error

@AndriiTsok
Copy link

AndriiTsok commented Aug 15, 2023

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue
It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@mandrasch
Copy link

mandrasch commented Aug 16, 2023

Same error here, needed 15 new codespaces instances to finally trigger and catch it. version used: 2.3.0

Published creation-log.txt and output of cat dockerd.log here:
https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727

time="2023-08-16T19:07:55.475998600Z" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
failed to start daemon: error while opening volume store metadata database: timeout
x\"..." type=io.containerd.runtime.v1

Discussion and context: https://github.com/orgs/community/discussions/63776
Thanks very much for the work on this!

@samruddhikhandale
Copy link
Member Author

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

@samruddhikhandale
Copy link
Member Author

samruddhikhandale commented Aug 16, 2023

Hi @mandrasch 👋

In your dev container, docker is added by the universal image. See https://github.com/devcontainers/images/blob/main/src/universal/.devcontainer/devcontainer.json#L64-L66.

Hence, adding the docker-in-docker Feature has no impact as the docker-init script is old. See https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727#file-creation-log-txt-L168. (I think this behavior should be updated.)

The prod image was built with Feature version 2.2.1 (released on Aug 3rd) which does not contain the retry logic. I'll work on releasing a new universal image which would be built with v2.3.1.

Let me know if that makes sense.

@samruddhikhandale
Copy link
Member Author

I'll work on releasing a new universal image which would be built with v2.3.1.

Opened devcontainers/images#705. In the meanwhile, @mandrasch can you use the dev image? (mcr.microsoft.com/devcontainers/universal:dev). Let me know if that helps!

Also, can you remove the docker-in-docker Feature from your dev container as that one is unnecessary (given that the universal image already installs it)?

@samruddhikhandale
Copy link
Member Author

I'll work on releasing a new universal image which would be built with v2.3.1.

The image is live now!

@mandrasch
Copy link

mandrasch commented Aug 17, 2023

Hi @samruddhikhandale! Thanks so much for explaining this! 🙏 🙏

I now removed the docker-in-docker feature from my devcontainer.json, works for me.

  "image": "mcr.microsoft.com/devcontainers/universal:2",
  "features": {
    "ghcr.io/ddev/ddev/install-ddev:latest": {}
  },

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

Question regarding this:

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked devcontainers/images#705 but could not find a commit related to a version number?) Thanks!

@samruddhikhandale
Copy link
Member Author

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

@mandrasch One more thing, the universal image is cached in a codespace, hence, even now you will get a pull of an older image (unless you pin it to 2.5.0). I am working on updating the cache for Codespaces, but that would take a day or two.

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked devcontainers/images#705 but could not find a commit related to a version number?) Thanks!

Unfortunately, I don't think there's a direct way to find out the Feature version.

@AndriiTsok
Copy link

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

Hu @samruddhikhandale I have just re-pushed the container files to https://github.com/TRYON-Technology/Docker-in-Docker-Issue

I also added an error.log showing the issue:
https://github.com/TRYON-Technology/Docker-in-Docker-Issue/blob/3e6849493c9cfd073c2e45563647264385d9a07a/error.log#L2216

@mandrasch
Copy link

@samruddhikhandale Thanks so much for the detailed technical background information, very helpful! 🙏

@samruddhikhandale
Copy link
Member Author

Re-opening as retry logic is reverted. See #659

Opened #660 for tracking docker failures due to "sed: couldn't flush stdout: Device or resource" errors

@bherbruck
Copy link

bherbruck commented Aug 19, 2023

If anyone needs a quick fix - I've gotten in the habit of doing this recently, it has been working well:

#!/bin/bash

echo "Waiting for Docker to start"

# Wait for Docker to start up
while [ ! -S /var/run/docker.sock ]; do
  echo -n "."
  /usr/local/share/docker-init.sh
  sleep 1
done

echo "Docker is running!"
{
  "postCreateCommand": "bash .devcontainer/start-docker.sh"
}
.devcontainer/
├── devcontainer.json
└── start-docker.sh

@darmalovan
Copy link

@bherbruck I ran the docker-init.sh script and the output I got was "Setting dockerd Azure DNS", which didn't resolve the docker issue... have you stumbled into something like that?

@bherbruck
Copy link

@darmalovan I get that same output if I run /usr/local/share/docker-init.sh too many times

@samruddhikhandale
Copy link
Member Author

docker-in-docker v2.4.0 includes the following changes, see #669

  • Adds retries for docker daemon startup
  • We have seen errors like sed: couldn't flush stdout: Device or resource errors which fails to start, this adds retries to fix such sed errors
  • Adds a workflow which runs 100 jobs for validating docker startup --> 50 jobs validates dockerd after the container is started, and 50 jobs validates within the onCreateCommand. The stress test is 🟢

Important Note: /usr/local/share/docker-init.sh which starts/retries dockerd is added to the entrypoint command. This command runs in the background and is not a blocking script for the container startup. Since it's in the background, onCreateCommand/postCreateCommand/postStartCommand could all start executing before docker is fully running. If it takes docker too long, that could introduce flakiness in those lifecycle scripts.

Opened devcontainers/spec#299 which requests a new semantics to have "blocking" entrypoints that the CLI waits for. This way we can ensure that docker is already up and running for the mentioned ^ lifecycle scripts and is available in the container.

Closing in the favor of #671. Feel free to reopen if needed, or comment on #671 if you still run into docker not running issues. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants