Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

samruddhikhandale · 2023-08-02T19:03:42Z

Sometimes docker fails to start within a container with the following error 👇

Failed to connect to Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

As manually running /usr/local/share/docker-init.sh fixes this issue, add some retry mechanism into the docker-init script .

The text was updated successfully, but these errors were encountered:

AndriiTsok · 2023-08-04T04:28:22Z

For the last two weeks we have been experiencing the same issue. Running the script recovers the docker instance.

kklopfenstein · 2023-08-04T14:21:40Z

I wanted to add that I've also been experiencing this issue much more frequently that normal.

samruddhikhandale · 2023-08-04T16:51:13Z

Thanks for reporting! Opened #637 which adds a retry logic to address the issue ^

AndriiTsok · 2023-08-08T12:45:07Z

@samruddhikhandale We have just tried 2.3.0 version of Docker in Docker. But we are still observing the issue with the Docker daemon.

Running the onCreateCommand from devcontainer.json...

zsh .devcontainer/on-create.sh
on-create start
Removing existing clusters...
FATA[0000] runtime failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
Removing existing registries...
FATA[0000] failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

samruddhikhandale · 2023-08-08T16:15:44Z

(*) Failed to start docker, retrying in 5s...

@AndriiTsok Hmm, in the creation logs do you see this log message ^? I would like to validate if the retry is actually happening. I wonder if it's conflicting (/running in parallel) with the onCreateCommand.

AndriiTsok · 2023-08-08T19:38:43Z

@samruddhikhandale Thank you for your fast reply!

We tried to create more than 30 codespaces during the last few hours to try to reproduce the issue. At the moment we are not able to produce (*) Failed to start docker, retrying in 5s... in creation logs. Earlier this morning the docker step was failing almost in every codespace.

Today, we made sure that not the prebuilt image was used and that the codespaces are created for new branches. Also, we explicitly set 2.3.0 for the D-in-D feature.

We will keep an eye on the logs and the stability and will create another issue in case it is reproducible.

AndriiTsok · 2023-08-09T19:09:39Z

@samruddhikhandale we have just managed to catch the error again. Here is the creation log: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Line with Docker daemon error: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L307

We simply trying to check the clusters and remove them for cases when we rebuilding the codespaces:

# Get the list of all existing clusters
clusters=$(k3d cluster list -o json | jq -r '.[].name')

# Iterate over each cluster and delete it
for cluster in $clusters; do
  echo "Deleting cluster $cluster"
  k3d cluster delete $cluster
done

# Create a new cluster with the given configuration
k3d cluster create --config .devcontainer/k3d/config.yaml --kubeconfig-update-default

Fails trying to get clusters.

P.S.
I have observed that the Docker daemon is now failing less frequently than it was just three days ago.

AndriiTsok · 2023-08-10T19:54:00Z

@samruddhikhandale I can confirm the issue is still randomly occurring. I would say it is around 50%/50% usually rebuild helps to restore the codespace.

One of the observed log values:

[161525 ms] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount type=bind,src=/var/lib/docker/codespacemount/workspace,dst=/workspaces --mount type=volume,src=dind-var-lib-docker-0hcbhh2c7vldoj773drm1bjldnb89u0rt96sl9nju22d9ou8d14n,dst=/var/lib/docker --mount type=volume,src=minikube-config,dst=/home/vscode/.minikube --mount source=/root/.codespaces/shared,target=/workspaces/.codespaces/shared,type=bind --mount source=/var/lib/docker/codespacemount/.persistedshare,target=/workspaces/.codespaces/.persistedshare,type=bind --mount source=/.codespaces/agent/mount,target=/.codespaces/bin,type=bind --mount source=/mnt/containerTmp,target=/tmp,type=bind --mount type=bind,src=/.codespaces/agent/mount/cache,dst=/vscode -l Type=codespaces -e CODESPACES=******** -e ContainerVersion=13 -e RepositoryName=Monorepo --label ContainerVersion=13 --hostname codespaces-86fa16 --add-host codespaces-86fa16:127.0.0.1 --cap-add sys_nice --network host --privileged --entrypoint /bin/sh vsc-monorepo-9081da56175f3b1ac597257c0566d7ce76b18fbc1a048e05bdbd04f7efb0dfca-features -c echo Container started
Container started
sed: couldn't flush stdout: Device or resource busy
Outcome: success User: node WorkspaceFolder: /workspaces/Monorepo
devcontainer process exited with exit code 0
Running blocking commands...

sed: couldn't flush stdout: Device or resource busy

samruddhikhandale · 2023-08-10T23:06:08Z

@AndriiTsok Thanks for the update.

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Looks like the retry mechanism is triggered, we retry for 5 five times until docker daemon starts. From the logs, we can only see one such log statements, hence I can think of two things which might be happening:

There's a race condition in between retry logic and the start of the onCreateCommand (hence, the retries are not triggered and docker is not guaranteed to start)
The docker daemon started fine, hence no more retry logs but some other process killed the daemon

@AndriiTsok would it be possible to provide a sample repro (ie a sample dev container config), I'd like to experiment few things. If not, no worries, I could play around by adding docker commands within onCreateCommand.

@AndriiTsok In the meanwhile, would it be possible to add a similar retry logic (which starts docker daemon) to your onCreateCommand script? Let me know if this works!

tom-growthbox · 2023-08-14T06:34:29Z

We have experienced this issue consistently in the last 4 days. It happens during codespaces prebuild. I see the line (*) Failed to start docker, retrying in 5s... once in each of the failed jobs. Successful jobs do not have this line in the log. It didn't start with version 2.3.0, but somehow downgrading to 2.2.1 fixes it. I have not seen the error with 2.2.1.

samruddhikhandale · 2023-08-14T20:59:23Z

I wonder if adding retry logic is somehow breaking the codespace prebuilds, thanks @tom-growthbox for reporting the issue and providing a temporary solution.

@tom-growthbox would it be possible to provide a sample repro? (ie dev container config) It would help me investigate and fix the underlying issue, thanks!

tom-growthbox · 2023-08-15T06:41:35Z

I created a sample repo with similar configuration as we use. However, the prebuild doesn't fail on this one. Would need to spend some time on this to reproduce the issue.

https://github.com/tom-growthbox/prebuild-error

AndriiTsok · 2023-08-15T23:20:39Z

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue
It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

mandrasch · 2023-08-16T20:45:28Z

Same error here, needed 15 new codespaces instances to finally trigger and catch it. version used: 2.3.0

Published creation-log.txt and output of cat dockerd.log here:
https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727

time="2023-08-16T19:07:55.475998600Z" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
failed to start daemon: error while opening volume store metadata database: timeout
x\"..." type=io.containerd.runtime.v1

Discussion and context: https://github.com/orgs/community/discussions/63776
Thanks very much for the work on this!

samruddhikhandale · 2023-08-16T21:14:29Z

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

samruddhikhandale · 2023-08-16T21:37:44Z

Hi @mandrasch 👋

In your dev container, docker is added by the universal image. See https://github.com/devcontainers/images/blob/main/src/universal/.devcontainer/devcontainer.json#L64-L66.

Hence, adding the docker-in-docker Feature has no impact as the docker-init script is old. See https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727#file-creation-log-txt-L168. (I think this behavior should be updated.)

The prod image was built with Feature version 2.2.1 (released on Aug 3rd) which does not contain the retry logic. I'll work on releasing a new universal image which would be built with v2.3.1.

Let me know if that makes sense.

samruddhikhandale · 2023-08-16T21:45:13Z

I'll work on releasing a new universal image which would be built with v2.3.1.

Opened devcontainers/images#705. In the meanwhile, @mandrasch can you use the dev image? (mcr.microsoft.com/devcontainers/universal:dev). Let me know if that helps!

Also, can you remove the docker-in-docker Feature from your dev container as that one is unnecessary (given that the universal image already installs it)?

samruddhikhandale · 2023-08-17T00:21:25Z

I'll work on releasing a new universal image which would be built with v2.3.1.

The image is live now!

mandrasch · 2023-08-17T04:50:11Z

Hi @samruddhikhandale! Thanks so much for explaining this! 🙏 🙏

I now removed the docker-in-docker feature from my devcontainer.json, works for me.

  "image": "mcr.microsoft.com/devcontainers/universal:2",
  "features": {
    "ghcr.io/ddev/ddev/install-ddev:latest": {}
  },

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

Question regarding this:

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked devcontainers/images#705 but could not find a commit related to a version number?) Thanks!

samruddhikhandale · 2023-08-17T16:44:32Z

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

@mandrasch One more thing, the universal image is cached in a codespace, hence, even now you will get a pull of an older image (unless you pin it to 2.5.0). I am working on updating the cache for Codespaces, but that would take a day or two.

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked devcontainers/images#705 but could not find a commit related to a version number?) Thanks!

Unfortunately, I don't think there's a direct way to find out the Feature version.

However, you can run devcontainer-info command in your codespace/dev container. Find the value for Source code release/branch. eg. Source code release/branch: v0.3.15
Go to https://github.com/devcontainers/images/actions/workflows/push.yml and find that workflow for v0.3.15. See https://github.com/devcontainers/images/actions/runs/5884996703 and find the job which builds the universal image (In this case, it's job 61)
You can search for docker-in-docker and find the Feature version. See https://github.com/devcontainers/images/actions/runs/5884996703/job/15985471064#step:6:14994

AndriiTsok · 2023-08-18T03:22:14Z

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

Hu @samruddhikhandale I have just re-pushed the container files to https://github.com/TRYON-Technology/Docker-in-Docker-Issue

I also added an error.log showing the issue:
https://github.com/TRYON-Technology/Docker-in-Docker-Issue/blob/3e6849493c9cfd073c2e45563647264385d9a07a/error.log#L2216

mandrasch · 2023-08-18T06:19:46Z

@samruddhikhandale Thanks so much for the detailed technical background information, very helpful! 🙏

samruddhikhandale · 2023-08-18T22:54:31Z

Re-opening as retry logic is reverted. See #659

Opened #660 for tracking docker failures due to "sed: couldn't flush stdout: Device or resource" errors

bherbruck · 2023-08-19T02:19:22Z

If anyone needs a quick fix - I've gotten in the habit of doing this recently, it has been working well:

#!/bin/bash

echo "Waiting for Docker to start"

# Wait for Docker to start up
while [ ! -S /var/run/docker.sock ]; do
  echo -n "."
  /usr/local/share/docker-init.sh
  sleep 1
done

echo "Docker is running!"

{
  "postCreateCommand": "bash .devcontainer/start-docker.sh"
}

.devcontainer/
├── devcontainer.json
└── start-docker.sh

darmalovan · 2023-08-20T16:23:00Z

@bherbruck I ran the docker-init.sh script and the output I got was "Setting dockerd Azure DNS", which didn't resolve the docker issue... have you stumbled into something like that?

bherbruck · 2023-08-20T17:34:19Z

@darmalovan I get that same output if I run /usr/local/share/docker-init.sh too many times

samruddhikhandale · 2023-08-24T19:43:09Z

docker-in-docker v2.4.0 includes the following changes, see #669

Adds retries for docker daemon startup
We have seen errors like sed: couldn't flush stdout: Device or resource errors which fails to start, this adds retries to fix such sed errors
Adds a workflow which runs 100 jobs for validating docker startup --> 50 jobs validates dockerd after the container is started, and 50 jobs validates within the onCreateCommand. The stress test is 🟢

Important Note: /usr/local/share/docker-init.sh which starts/retries dockerd is added to the entrypoint command. This command runs in the background and is not a blocking script for the container startup. Since it's in the background, onCreateCommand/postCreateCommand/postStartCommand could all start executing before docker is fully running. If it takes docker too long, that could introduce flakiness in those lifecycle scripts.

Opened devcontainers/spec#299 which requests a new semantics to have "blocking" entrypoints that the CLI waits for. This way we can ensure that docker is already up and running for the mentioned ^ lifecycle scripts and is available in the container.

Closing in the favor of #671. Feel free to reopen if needed, or comment on #671 if you still run into docker not running issues. Thank you!

samruddhikhandale self-assigned this Aug 2, 2023

samruddhikhandale mentioned this issue Aug 4, 2023

Docker-in-docker: Add retries to the docker-init script until docker daemon starts #637

Merged

samruddhikhandale closed this as completed in #637 Aug 7, 2023

mandrasch mentioned this issue Aug 17, 2023

codespaces: "docker not found" / "docker daemon not running" occurs randomly mandrasch/ddev-craftcms-vite#4

Closed

mandrasch mentioned this issue Aug 17, 2023

codespaces - remove docker-in-docker since it's installed via universal image ddev/ddev#5267

Merged

2 tasks

samruddhikhandale mentioned this issue Aug 18, 2023

[docker-in-docker] Reverts the retry logic #659

Merged

samruddhikhandale reopened this Aug 18, 2023

mandrasch mentioned this issue Aug 24, 2023

docs: improve and simplify codespaces guide, use separate script ddev/ddev#5290

Merged

2 tasks

This was referenced Aug 24, 2023

Add support for blocking 'entrypoints' devcontainers/spec#299

Open

[docker in docker] Adds retries for docker daemon startup and cgroup nesting #669

Merged

[docker-in-docker] - Use blocking entrypoints to ensure dockerd startup #671

Open

samruddhikhandale closed this as completed Aug 24, 2023

mandrasch mentioned this issue May 18, 2024

Codespaces: "Cannot connect to the Docker daemon" occurs randomly (postCreateCommand) #977

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

samruddhikhandale commented Aug 2, 2023

AndriiTsok commented Aug 4, 2023

kklopfenstein commented Aug 4, 2023

samruddhikhandale commented Aug 4, 2023

AndriiTsok commented Aug 8, 2023

samruddhikhandale commented Aug 8, 2023

AndriiTsok commented Aug 8, 2023

AndriiTsok commented Aug 9, 2023

AndriiTsok commented Aug 10, 2023

samruddhikhandale commented Aug 10, 2023

tom-growthbox commented Aug 14, 2023

samruddhikhandale commented Aug 14, 2023

tom-growthbox commented Aug 15, 2023

AndriiTsok commented Aug 15, 2023 •

edited

Loading

mandrasch commented Aug 16, 2023 •

edited

Loading

samruddhikhandale commented Aug 16, 2023

samruddhikhandale commented Aug 16, 2023 •

edited

Loading

samruddhikhandale commented Aug 16, 2023

samruddhikhandale commented Aug 17, 2023

mandrasch commented Aug 17, 2023 •

edited

Loading

samruddhikhandale commented Aug 17, 2023

AndriiTsok commented Aug 18, 2023

mandrasch commented Aug 18, 2023

samruddhikhandale commented Aug 18, 2023

bherbruck commented Aug 19, 2023 •

edited

Loading

darmalovan commented Aug 20, 2023

bherbruck commented Aug 20, 2023

samruddhikhandale commented Aug 24, 2023

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634

Comments

samruddhikhandale commented Aug 2, 2023

AndriiTsok commented Aug 4, 2023

kklopfenstein commented Aug 4, 2023

samruddhikhandale commented Aug 4, 2023

AndriiTsok commented Aug 8, 2023

samruddhikhandale commented Aug 8, 2023

AndriiTsok commented Aug 8, 2023

AndriiTsok commented Aug 9, 2023

AndriiTsok commented Aug 10, 2023

samruddhikhandale commented Aug 10, 2023

tom-growthbox commented Aug 14, 2023

samruddhikhandale commented Aug 14, 2023

tom-growthbox commented Aug 15, 2023

AndriiTsok commented Aug 15, 2023 • edited Loading

mandrasch commented Aug 16, 2023 • edited Loading

samruddhikhandale commented Aug 16, 2023

samruddhikhandale commented Aug 16, 2023 • edited Loading

samruddhikhandale commented Aug 16, 2023

samruddhikhandale commented Aug 17, 2023

mandrasch commented Aug 17, 2023 • edited Loading

samruddhikhandale commented Aug 17, 2023

AndriiTsok commented Aug 18, 2023

mandrasch commented Aug 18, 2023

samruddhikhandale commented Aug 18, 2023

bherbruck commented Aug 19, 2023 • edited Loading

darmalovan commented Aug 20, 2023

bherbruck commented Aug 20, 2023

samruddhikhandale commented Aug 24, 2023

AndriiTsok commented Aug 15, 2023 •

edited

Loading

mandrasch commented Aug 16, 2023 •

edited

Loading

samruddhikhandale commented Aug 16, 2023 •

edited

Loading

mandrasch commented Aug 17, 2023 •

edited

Loading

bherbruck commented Aug 19, 2023 •

edited

Loading