The purpose of the fake GPU Operator or GPU Operator Simulator is to simulate the NVIDIA GPU Operator without a GPU. The software has been created by Run:ai in order to save money on actual machines in situations that do not require the GPU itself. This simulator:
- Allows you to take a CPU-only node and externalize it as if it has 1 or more GPUs.
- Simulates all aspects of the NVIDIA GPU Operator including feature discovery, NVIDIA MIG and more.
- Emits metrics to Prometheus simulating actual GPUs
You can configure the simulator to have any NVIDIA GPU topology, including type and amount of GPU memory.
The real Nvidia GPU Operator should not exist in the Kubernetes cluster
Label the nodes you wish to have fake GPUs on, with the following labels:
kubectl label node <node-name> nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.dcgm-exporter=true --overwrite
By default, the operator creates a GPU topology of 2 Tesla K80 GPUs for each node in the cluster. To create a different GPU topology, see the customization section below.
Install the operator:
helm repo add fake-gpu-operator https://fake-gpu-operator.storage.googleapis.com
helm repo update
helm upgrade -i gpu-operator fake-gpu-operator/fake-gpu-operator --namespace gpu-operator --create-namespace
Submit any workload with a request for NVIDIA GPU:
resources:
limits:
nvidia.com/gpu: 1
Verify that it has been scheduled on one of the CPU nodes.
You can also test by running the example deployment YAML under the example
folder
Pod Security Admission should be disabled on the gpu-operator namespace
kubectl label ns gpu-operator pod-security.kubernetes.io/enforce=privileged
The base GPU topology is defined using a Kubernetes configmap named topology
.
To customize the GPU topology, edit the Kubernetes configmap by running:
kubectl edit cm topology -n gpu-operator
The configmap should look like this:
apiVersion: v1
data:
topology.yml: |
config:
node-autofill:
gpu-count: 16
gpu-memory: 11441
gpu-product: Tesla-K80
mig-strategy: mixed
The configmap defines the GPU topology for all nodes.
- gpu-count - number of GPUs per node.
- gpu-memory - amount of GPU memory per GPU.
- gpu-product - GPU type. For example:
Tesla-K80
,Tesla-V100
, etc. - mig-strategy - MIG strategy. Can be
none
,mixed
orsingle
.
Each node can have a different GPU topology. To customize a specific node, edit the configmap named <node-name>-topology
in the gpu-operator
namespace.
By default, dcgm exporter will export maximum GPU utilization for every pod that requests GPUs.
If you want to customize the GPU utilization, add a run.ai/simulated-gpu-utilization
annotation to the pod with a value that represents the range of the GPU utilization that should be simulated.
For example, add run.ai/simulated-gpu-utilization: 10-30
annotation to simulate a pod that utilizes the GPU between 10% to 30%.