-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate limited number of layers on aarch64 docker #75
Comments
This is a sort of time bomb / hot potato for whichever buildfarmer's touch to the Dockerfile re-triggers this issue. moby/moby#1171 looks very promising. I might get a chance to circle back to this during my buildfarmercop reign. |
moby/moby#27384 is perhaps the root cause of this. Docker 1.12.x on ARM64 is compiled with Go 1.7, and Go has a bug that's tickled by Docker. The good news is that we're looking at Go 1.9 successfully being available as a binary release for ARM64 from the Go project, and in some reasonable timeframe after that there should be a fresh binary for ARM64 of Docker that can be easily consumed. In the meantime the best strategy is to reduce the number of layers you use. There are various Docker optimization strategies to pursue that would reduce the size of the Dockerfile and thus eliminate layers and build steps, with a performance improvement. Looking at https://github.com/ros2/ci/blob/master/linux_docker_resources/Dockerfile In particular successive RUN commands can usually be merged, e.g. as RUN foo is almost always equivalent to your purposes to RUN foo && bar and that wipes out a layer. You might look at this (vintage) writeup https://blog.tutum.co/2014/10/22/how-to-optimize-your-dockerfile/ that's pretty useful. Newest Docker has some more tricks up its sleeve, so don't go completely overboard on optimizing, but there are small changes that should help stability. |
My understanding is that this has been resolved by upgrading the version of Docker on the arm64 machines to one that has better file system stability. |
We haven't upgraded this Jenkins instance yet. We have what was previously referred to as the ROS 2 buildfarm which lives at http://ci.ros2.org and provides continuous integration and build archiving for ROS 2 omnibus builds, and in the past couple weeks have created a ROS 2 buildfarm at http://build.ros2.org based more closely on the ROS buildfarm (build.ros.org) to build the Xenial debs for the beta2 release of ROS 2. The Docker deb that I built last weekend has been deployed to the package buildfarm http://build.ros2.org Depending on the CI load today it might be a good day to upgrade docker on the CI farm and see if that does indeed resolve this. |
There's an updated version of Docker 1.12.x also available through a PPA that should fix this particular issue. @nuclearsandwich - if you have not yet resolved this issue, I'd like to help you work through testing it. |
@vielmetti it's been a while since I've looked at this since I've been pulled away to other matters. It looks like we're still running 1.12.6 on the aarch64 host. I've got the deb I built for the buildfarm of 17.6-ce which we could also test resolution with. I think @wjwwood has the buildfarm shift currently but I don't know if anyone has bandwidth to update docker and test without the workaround leading up to beta 3. |
I don't at the moment. But if someone can show it fixes the issue I can work on upgrading the machines in the background (less work than testing it out I think). |
FWIW, I pushed a branch to https://github.com/ros2/ci called |
I think that the problem is a matter of number of layers and not about a specific command. Given that we reduced the total number of layers in the Dockerfile, the referenced job will not prove if it's fixed or not (very close though given that it's now 49 layers deep and the job crashes as soon as we reach 50 layers). Edit: actually it failed at step 46 and not 50 o_O, but it does exhibit the error message described in the original issue |
@mikaelarguedas Ah, I didn't realize that we had reduced the layers. Thanks for the update. |
The instructions for installing this newly fixed version of Docker 1.12 are as follows:
The related bug report for reference is https://bugs.launchpad.net/bugs/1702979 |
@nuclearsandwich do you think that you will have spare cycles in the near future to try a more recent docker on the packet machines? (I don't remember if a newer kernel was needed as well or if newer docker could be enough) |
Two things of note - The newest Docker installs very easily with the instructions to use Also, it's much easier to get Arm base images these days that do what you want, because there's support for multi-arch images in the main docker library. (ie. you can say "FROM ubuntu" and it will do the right thing). |
To answer this question, yes. I can give this a shot Monday or this afternoon. Since we're not wrangling the CI hosts with puppet we can use the script from get.docker.com or the 17.06 deb I built for the buildfarm hosts. |
My notes on multiarch are here and the Works on Arm newsletter covered them today A good read is this from Phil Estes https://integratedcode.us/2017/09/13/dockerhub-official-images-go-multi-platform/ |
Updated the CI host to 17.07 with the get.docker.com script @vielmetti linked. Running 3 builds All three have made it past the Dockerfile building stage so this looks like it does the thing. I think maybe we let the nightlies run over the weekend and if nothing bad happens probably update the other linux hosts in order to keep the same version of docker everywhere. /cc @sloretz as build farmer. |
@nuclearsandwich Can you please retrigger the turtlebot job to use the |
Added http://ci.ros2.org/job/ci_turtlebot-demo_linux-aarch64/102/ to the list of builds above. Edit: er.. http://ci.ros2.org/job/ci_linux-aarch64/548/ |
and 💥
It seems like aufs is still part of the problem as it's still the default driver. Since we're on Ubuntu Xenial is there anything we need to do to try again with the overlay2 storage driver? |
We've been waiting to upgrade the OS before trying overlayfs2 it looks like it's not too hard to enable: https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/#configure-docker-with-the-overlay-or-overlay2-storage-driver maybe the puppet formula had an option too. |
I touch on this in my previous docker for arm announcement: Specifically, the related ticket for this new functionality is here: |
Oh nice. I must have missed it with my head down. Sorry for the spurious ping. |
If you're looking to test a Docker installation to see if it will crash when the file system gets too deep, I give you https://gist.github.com/anonymous/bdafb8e961f55b2533fee8fa5221d186 If you are running an unpatched |
This is still marked as "open", but we should be OK now with anything resembling a modern Docker version. |
Very good point @vielmetti. In fact we just updated to 18.09.5 last week. Thanks! |
See #73 for a description of the problem, and a link to problematic builds. We've currently worked around the problem with #74, but reverting that commit should show the problem again for debugging.
The text was updated successfully, but these errors were encountered: