Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: NVIDIA GPU operator Zarf package #321

Closed
1 of 8 tasks
justinthelaw opened this issue Mar 26, 2024 · 4 comments
Closed
1 of 8 tasks

spike: NVIDIA GPU operator Zarf package #321

justinthelaw opened this issue Mar 26, 2024 · 4 comments
Assignees
Labels
tech-debt Not a feature, but still necessary

Comments

@justinthelaw
Copy link
Contributor

justinthelaw commented Mar 26, 2024

LFAI delivery requires a production-ready NVIDIA GPU operator Zarf package that will bootstrap a containerized version of the necessary NVIDIA CUDA drivers, container toolkit, feature discovery and device plugin components to enable generative AI and ML applications to use NVIDIA GPUs from a Kubernetes cluster.

  • How do I prepare an air-gappable Zarf package that contains the NVIDIA GPU operator?
  • How do I setup the NVIDIA GPU operator to be configurable at deploy time?
    • Multi-instance GPU (logical separation of GPU resources)?
    • Time slicing (shared GPU loading and usage)?
    • Distributed node resource load balancing configuration?
  • How and where do I consistently test this on K3D to make sure it works?
  • How and where do I consistently test this on RKE2 to make sure it works?
  • How do I integrate this back into the LFAI infrastructure UDS bundle in issue #317

See additional NVIDIA GPU operator context here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html

@justinthelaw justinthelaw self-assigned this Mar 26, 2024
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Mar 26, 2024

@justinthelaw justinthelaw added the tech-debt Not a feature, but still necessary label Mar 26, 2024
@YrrepNoj
Copy link
Member

YrrepNoj commented Apr 4, 2024

Commenting for personal tracking- Part of this spike should involve evaluating creating our own version of this repo/container that we publish from our org to use.

@justinthelaw
Copy link
Contributor Author

This will be tracked via the following PR: justinthelaw/uds-rke2#39

@justinthelaw
Copy link
Contributor Author

PR in previous comment is the tracking PR that is tied to a Delivery issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech-debt Not a feature, but still necessary
Projects
None yet
Development

No branches or pull requests

4 participants