Fake GPU Operator

The purpose of the fake GPU Operator or GPU Operator Simulator is to simulate the NVIDIA GPU Operator without a GPU. The software has been created by Run:ai in order to save money on actual machines in situations that do not require the GPU itself and allow for larger scale tests. This simulator:

Allows a CPU-only node to be represented as if it has one or more GPUs.
Simulates all features of the NVIDIA GPU Operator, including feature discovery and NVIDIA MIG.
Emits metrics to Prometheus, simulating actual GPU behavior.

You can configure the simulator to have any NVIDIA GPU topology, including the type and amount of GPU memory.

Prerequisites

Ensure that the real Nvidia GPU Operator is not present in the Kubernetes cluster.

Installation

Assign the nodes you want to simulate GPUs on to a node pool by labeling them with the run.ai/simulated-gpu-node-pool label. For example:

kubectl label node <node-name> run.ai/simulated-gpu-node-pool=default

NodePools are used to group nodes that should have the same GPU topology. These are defined in the topology.nodePools section of the Helm values.yaml file. By default, a node pool with 2 Tesla K80 GPUs will be created for all nodes labeled with run.ai/simulated-gpu-node-pool=default. To create a different GPU topology, refer to the customization section below.

To install the operator:

helm repo add fake-gpu-operator https://fake-gpu-operator.storage.googleapis.com
helm repo update
helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace

Usage

Submit any workload with a request for an NVIDIA GPU:

resources:
  limits:
    nvidia.com/gpu: 1

Verify that it has been scheduled on one of the CPU nodes.

You can also test by running the example deployment YAML under the example folder

Troubleshooting

Pod Security Admission should be disabled on the gpu-operator namespace

kubectl label ns gpu-operator pod-security.kubernetes.io/enforce=privileged

Customization

The GPU topology can be customized by editing the values.yaml file on the topology section before installing/upgrading the helm chart.

GPU metrics

By default, the DCGM exporter will report maximum GPU utilization for every pod requesting GPUs.

To customize GPU utilization, add a run.ai/simulated-gpu-utilization annotation to the pod with a value representing the desired range of GPU utilization. For example, add run.ai/simulated-gpu-utilization: 10-30 to simulate a pod that utilizes between 10% and 30% of the GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
.circleci		.circleci
.github		.github
cmd		cmd
deploy/fake-gpu-operator		deploy/fake-gpu-operator
design		design
example		example
internal		internal
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake GPU Operator

Prerequisites

Installation

Usage

Troubleshooting

Customization

GPU metrics

About

Releases 64

Packages

Contributors 17

Languages

License

run-ai/fake-gpu-operator

Folders and files

Latest commit

History

Repository files navigation

Fake GPU Operator

Prerequisites

Installation

Usage

Troubleshooting

Customization

GPU metrics

About

Resources

License

Stars

Watchers

Forks

Releases 64

Packages 0

Contributors 17

Languages

Packages