This repository contains the supporting code for our paper Measuring GPU Utilization one level deeper. We present a comprehensive suite of CUDA benchmarks designed to identify and measure interference across various GPU resources.
The codebase is organized into the following primary directories:
gpu_util_bench_lib/
: A shared library containing CUDA kernels and helper functions for kernel launchinginter_sm/
: Benchmarks for measuring interference and utilization across Streaming Multiprocessors (SM) (Paper section 4.1)intra_sm/
: Benchmarks for measuring interference and utilization within SMs (Paper section 4.2)mm_pytroch/
: Example demonstrating interference patterns on production ML kernels (Paper section 4.3)pitfalls/
: Examples illustrating common limitations in current interference prediction approaches (Paper section 3)
To benchmarks require the follwoing dependencies:
- CMake (version >= 3.22)
- C++17 or later
- CUDA toolkit (validated with CUDA 12.5 and 12.6)
- NVIDIA GPU driver (can be installed alongside CUDA toolkit)
Note: Our benchmarks currently do not support AMD GPUs.
-
Determine your GPU's Compute Capability using nvidia-smi:
nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -n 1
-
Update the Compute Capability in
CMakeLists.txt
:set(CMAKE_CUDA_ARCHITECTURES 90) # Modify based on your GPU
-
Build the repository
mkdir build && cd build cmake .. cmake --build .
Each directory contains detailed instructions for executing the benchmarks and reproducing paper experiments. The provided scripts are optimized for the H100 GPU. Users with different GPU architectures may need to adjust script parameters accordingly.
Important: Before running experiments, set the BUILD_DIR
environment variable to match your build directory.
export BUILD_DIR=$HOME/gpu-util-interference/build # update based on location of your build directory
To gather detailed performance metrics for isolated kernel execution, use the Nsight Compute Profiler. When profiling with NCU, specify mode=0
in the scripts:
ncu -f -o ncu.ncu-rep --set full <executable>
For analyzing kernel co-location scenarios, we recommend collecting CUDA traces using the Nsight Systems Profiler to visualize kernel overlap patterns and verify concurrent execution.
nsys profile --force-overwrite true -o nsys.nsys-rep --trace cuda <executable>
Our paper's results were obtained using the following hardware configurations:
- H100 NVL:
- CUDA version 12.5
- GPU driver version 555.42.06
- Nsight Compute version 2024.2.1.0
- Nsight Systems version 2024.2.3.38
- GeForce RTX3090:
- CUDA version 12.6
- GPU driver version 560.35.03
- Nsight Compute version 2024.3.1.0
- Nsight Systems version 2024.4.2.133
If you use our benchmarks, please cite our paper:
@article{elvinger2025measuring,
title={Measuring GPU utilization one level deeper},
author={Elvinger, Paul and Strati, Foteini and Jerger, Natalie Enright and Klimovic, Ana},
journal={arXiv preprint arXiv:2501.16909},
year={2025}
}