Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to install NVIDIA driver if not present in the machine #328

Merged
merged 14 commits into from
Oct 9, 2024

Conversation

gabrielcocenza
Copy link
Member

@gabrielcocenza gabrielcocenza commented Sep 26, 2024

detect if NVIDIA driver is present by the /proc/driver/nvidia/version file and:

  • if present, doesn't try to install
  • if not present, try to install using the ubuntu-drivers pkg

See more info about the approach at this doc

detect if NVIDIA driver is present by the /proc/driver/nvidia/version
file and:

* if present, doesn't try to install
* if not present, try to install using the ubuntu-drivers pkg
* even if installed, the machine might require reboot. To detect
  this, we check if nvidia-smi command is working.
@gabrielcocenza gabrielcocenza marked this pull request as ready for review October 1, 2024 22:18
@gabrielcocenza gabrielcocenza requested a review from a team as a code owner October 1, 2024 22:18
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
@samuelallan72
Copy link
Contributor

if present, doesn't try to install

In what cases would the nvidia driver be present already?

even if installed, the machine might require reboot. To detect this, we check if nvidia-smi command is working.

If the machine requires a reboot, can the charm set a blocked status with a message asking the user to reboot the machine?

Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need refactor strategy to two.

src/service.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
@gabrielcocenza
Copy link
Member Author

gabrielcocenza commented Oct 2, 2024

In what cases would the nvidia driver be present already?

The driver is necessary to have NVIDIA GPUs being able to receive workloads, so it's very likely that once you receive the server you will install it manually even before installing juju or other necessary tools on it.

If the machine requires a reboot, can the charm set a blocked status with a message asking the user to reboot the machine?

I asked to the server team and if installed with ubuntu-drivers it's not necessary to reboot it.

@samuelallan72
Copy link
Contributor

The driver is necessary to have NVIDIA GPUs being able to receive workloads, so it's very likely that once you receive the server you will install it manually even before installing juju or other necessary tools on it.

Ah right of course, and it's a subordinate charm so it will definitely be sharing a server. 👍

src/hw_tools.py Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/service.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chanchiwai-ray chanchiwai-ray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking) overall lgtm, but if you follow the existing code base, the nvidia driver strategy should be in this list

Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall is good to me. Thanks for the changing of NVIDIADriverStrategy. I also notice that multiple strategies can't fit into SnapExporter class. So instead of overwrite the exporter functions, please consider refactor the SnapExporter class.

src/service.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
@gabrielcocenza
Copy link
Member Author

gabrielcocenza commented Oct 4, 2024

(non-blocking) overall lgtm, but if you follow the existing code base, the nvidia driver strategy should be in this list

@jneo8 wants to refactor the codebase to not use the hw_tool. That is why we are not registering it.

@chanchiwai-ray chanchiwai-ray self-requested a review October 7, 2024 07:40
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
src/hw_tools.py Outdated Show resolved Hide resolved
@jneo8
Copy link
Contributor

jneo8 commented Oct 8, 2024

(non-blocking) overall lgtm, but if you follow the existing code base, the nvidia driver strategy should be in this list

@jneo8 wants to refactor the codebase not use the hw_tool. That is why we are not registering it.

Yes, we don't have to. Register it on the list will duplicate the install step.

jneo8
jneo8 previously approved these changes Oct 8, 2024
@gabrielcocenza gabrielcocenza merged commit 98b502e into canonical:main Oct 9, 2024
10 checks passed
@gabrielcocenza gabrielcocenza deleted the install-driver branch October 9, 2024 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants