Skip to content

CUDA benchmarks for measuring GPU utilization and interference

License

Notifications You must be signed in to change notification settings

eth-easl/gpu-util-interference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Measuring GPU utilization one level deeper

This repository contains the supporting code for our paper Measuring GPU Utilization one level deeper. We present a comprehensive suite of CUDA benchmarks designed to identify and measure interference across various GPU resources.

Repository Structure

The codebase is organized into the following primary directories:

  • gpu_util_bench_lib/: A shared library containing CUDA kernels and helper functions for kernel launching
  • inter_sm/: Benchmarks for measuring interference and utilization across Streaming Multiprocessors (SM) (Paper section 4.1)
  • intra_sm/: Benchmarks for measuring interference and utilization within SMs (Paper section 4.2)
  • mm_pytroch/: Example demonstrating interference patterns on production ML kernels (Paper section 4.3)
  • pitfalls/: Examples illustrating common limitations in current interference prediction approaches (Paper section 3)

Requirements and Installation

Prerequisites

To benchmarks require the follwoing dependencies:

  • CMake (version >= 3.22)
  • C++17 or later
  • CUDA toolkit (validated with CUDA 12.5 and 12.6)
  • NVIDIA GPU driver (can be installed alongside CUDA toolkit)

Note: Our benchmarks currently do not support AMD GPUs.

Compilation Instructions

  1. Determine your GPU's Compute Capability using nvidia-smi:

    nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -n 1
  2. Update the Compute Capability in CMakeLists.txt:

    set(CMAKE_CUDA_ARCHITECTURES 90)  # Modify based on your GPU
  3. Build the repository

    mkdir build && cd build
    cmake ..
    cmake --build .

Running experiments

Benchmark Execution

Each directory contains detailed instructions for executing the benchmarks and reproducing paper experiments. The provided scripts are optimized for the H100 GPU. Users with different GPU architectures may need to adjust script parameters accordingly.

Important: Before running experiments, set the BUILD_DIR environment variable to match your build directory.

export BUILD_DIR=$HOME/gpu-util-interference/build # update based on location of your build directory

Performance Analysis

NCU Metrics Collection

To gather detailed performance metrics for isolated kernel execution, use the Nsight Compute Profiler. When profiling with NCU, specify mode=0 in the scripts:

ncu -f -o ncu.ncu-rep --set full <executable>

CUDA Trace Collection

For analyzing kernel co-location scenarios, we recommend collecting CUDA traces using the Nsight Systems Profiler to visualize kernel overlap patterns and verify concurrent execution.

nsys profile --force-overwrite true -o nsys.nsys-rep --trace cuda <executable>

Experimental Setup

Our paper's results were obtained using the following hardware configurations:

  • H100 NVL:
    • CUDA version 12.5
    • GPU driver version 555.42.06
    • Nsight Compute version 2024.2.1.0
    • Nsight Systems version 2024.2.3.38
  • GeForce RTX3090:
    • CUDA version 12.6
    • GPU driver version 560.35.03
    • Nsight Compute version 2024.3.1.0
    • Nsight Systems version 2024.4.2.133

Paper

If you use our benchmarks, please cite our paper:

@article{elvinger2025measuring,
  title={Measuring GPU utilization one level deeper},
  author={Elvinger, Paul and Strati, Foteini and Jerger, Natalie Enright and Klimovic, Ana},
  journal={arXiv preprint arXiv:2501.16909},
  year={2025}
}

About

CUDA benchmarks for measuring GPU utilization and interference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published