-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrated MR - Fedora CoreOS Stable/Testing/Next Support #8
Comments
Summarizing the linked issue with what I believe to be the latest status. There is no official support for Fedora and Fedora CoreOS (FCOS) presently (ie. no docker images are published automatically to the NVIDIA container registry). A community contribution under My fork of this project on gitlab has been using this strategy successfully for a couple of years now to build/push images to dockerhub in this way. The resultant images are currently successfully used on typhoon k8s FCOS clusters. However, they are running ^1 Note that it should be possible to build functioning images without pre-compiled kernel headers without using FCOS gitlab-runners. |
I expect that if you were to examine the These errors occur with Fedora40 and won't be officially addressed by NVIDIA until they incorporate a patch they have advertised on their community forum. I have this patch applied on my fleet and it works fine but applying it is tricky. See: https://gitlab.com/container-toolkit-fcos/driver/-/issues/11 |
Hi, happy to join your talks here. I'm currently also struggling to get Nvidia drivers and CUDA working in an EOL FCOS release e.g. 37. I managed to get it working by using the rpm-ostree method. Even though redoing the steps yesterday didn't work. I figured out. that there is a new version of the Nvidia-container-toolkit 1.16.1.1 that seems to break things. Reverting back to 1.16.0.1 everything works fine. I also tried your containers fifonix but we ended up with the same issue zwshan. Checking the logs and your docker file you have hardcoded HTTP_PROXY set in your run commands. I guess this is why the container exits with -1. At least that is what we can see on our side. Can you please check / confirm? Furthermore I'm no expert on FCOS. |
You are correct that I have embedded unreachable proxies in the driver images and that will most certainly be a problem if you are running a driver image that does not have pre-compiled kernel headers. I will fix this in the next couple of weeks but for new images only. When running a kernel-specific driver image with pre-compiled kernel headers I do not even launch it with a network interface since the image does not need network access at all. I'm skeptical that the Also, interested in why you would be running such an old FCOS image (with security issues etc)? Finally, I just want to share that right now I'm running 550.90.07 on each of the current Fedora40 next/testing/stable streams on k8s nodes without issue using toolkit 1.16.1-1. However, I am running the driver as a podman systemd unit on each of the gpu worker nodes and deploying |
@fifofonix thanks for your reply. I have to run this old FCOS version since it is used in OpenStack Magnum. I have to run K8s 1.21.x and even if I would run a newer K8s release OpenStack Magnum still wouldn't work with an up to date version of FCOS. For newer K8s releases I use Vexxhosts CAPI driver in OpenStack Magnum. This can use either Ubuntu or Flatcar as base OS. On another post NVIDIA/gpu-operator#696 I also mentioned that I managed to get GPU Operator working without driver deployment. IMHO this is the best solution as it manages everything. Especially when using OpenStack Magnum. My idea is that the deployment should be simple and effortless. This is exactly what you can achieve with the NVIDIA GPU Operator. Next I will try to get the driver deployment through GPU Operator working too. The only reason it is not working yet, is that once the GPU Operator deploys the driver container, it tries to install kernel headers and they just don't exist. At least on the latest FCOS37 and some older FCOS35 release that I'm using. |
Thanks for the links @r0k5t4r, I had missed some of the updates on the gpu-operator thread. Of course this is the way to go and I look forward to experimenting with it soon! |
Hi, I made some progress. I spend plenty of time looking into the problem with the driver deployment through gpu operator and noticed / fixed a couple of things. Since I’m using a very old fcos35 release that is eol , the repo URL’s are different. I added some logic to check if the container builds on an eol release and if so it changes the repos to archive. furthermore it seems that the driver fails to build due to a gcc mismstch. The environment variable to disable this, seems to be ignored. So I also added some logic to download the correct version from Koji. Now I can not just successfully build a container but also run it without any problems directly in Podman on my fcos35 nodes. I can run NVIDIA-smi just fine. Also the NVIDIA modules are listed when running lsmod on the host os. But for some reason kubernetes is still not happy with it. It compiles the driver, installs the driver but then it unloads it, destroys the pod and starts all over again. Maybe an issue with the probe? I don’t know. I have not yet put my code online but I’m planning to do this within next week. I will try to precompile the driver on the container, commit and push it to my repo. And try using this. Cheers, |
@r0k5t4r thanks for your efforts. I would like to add some details here, which can be useful.
|
@dfateyev you're welcome. OpenSource relies heavily on contribution. :) And it is interesting to investigate on this issue and hopefully fix it one day. Thanks for your input. Very valuable. 👍 #Your POD is destroyed by K8s scheduler which consider it unhealthy after probing. To debug it in details, you can extract the driver-related Deployment from gpu-operator suite, and deploy your driver image with your Deployment. There you can customize or skip probing to see what failed there. Good idea to debug the driver deployment like this. I will try it. So but even with your precompiled container it is not working yet with GPU-Operator in K8s? I also tried to use a precompiled container, but I noticed that this was first supported in GPU-operator version 23.3.0 while I have to run a much older release since we need K8s 1.21.x. :) So no precompiled containers for me. |
I found this very good articles from CERN and they seem to have successfully used one of @fifofonix containers with magnum and the gpu operator. I tried using the same version but it did not work. I don’t know which k8s version they used. I’m running 1.21.11. https://kubernetes.web.cern.ch/blog/2023/03/17/efficient-access-to-shared-gpu-resources-part-2/ |
Migrating this as we would like to see native support for Fedora CoreOS in this project.
See - https://gitlab.com/nvidia/container-images/driver/-/issues/34
The text was updated successfully, but these errors were encountered: