feat(profiling): add support for pytorch profiling (#9154)

PR does - Patches `torch.profiler.profile` class by adding our own `on_trace_ready` handler - Adds GPU time/flops/memory samples via libdatadog interface in `on_trace_ready` event handler - Ensures that libdd exporter is enabled if pytorch is enabled - Hides functionality behind a FF set to False by default - changelog entry - Is there a minimum python version? - the biggest requirement is that the current pytorch profiler API which we instrument was introduced in torch version 1.8.1 (https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/), do we just want to document or we could disable the instrumentation if we detect an outdated version with `torch.__version__` - Some documentation on needed user configuration, conflicting features, gotchas ~~Probably should make experimental/beta collectors not part of the ALL template (Is this blocking since we haven't done in the past??)~~ ## Testing Done - Tested by running on ec2 GPU instance - Tested by running `prof-pytorch` service in staging - I'm not entirely sure if we need unit tests for this feature, or where they would live. Would we want the unit test suite to depend on torch? Maybe this is solved for tracing integrations, though ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: sanchda <[email protected]> Co-authored-by: Peter Griggs <[email protected]> Co-authored-by: Daniel Schwartz-Narbonne <[email protected]> Co-authored-by: Emmett Butler <[email protected]> Co-authored-by: Daniel Schwartz-Narbonne <[email protected]> Co-authored-by: Taegyun Kim <[email protected]> Co-authored-by: Daniel Schwartz-Narbonne <[email protected]>
DataDog · Dec 13, 2024 · 00ec1f7 · 00ec1f7
1 parent 1dd528c
commit 00ec1f7
Show file tree

Hide file tree

Showing 20 changed files with 770 additions and 68 deletions.
diff --git a/.github/workflows/pytorch_gpu_tests.yml b/.github/workflows/pytorch_gpu_tests.yml
@@ -0,0 +1,43 @@
+name: Pytorch Unit Tests (with GPU) 
+
+on:
+  push:
+    branches:
+      - 'main'
+      - 'mq-working-branch**'
+    paths:
+      - 'ddtrace/profiling/collector/pytorch.py'
+  pull_request:
+    paths:
+      - 'ddtrace/profiling/collector/pytorch.py'
+  workflow_dispatch: 
+
+jobs:
+   unit-tests:
+    runs-on: APM-4-CORE-GPU-LINUX
+    steps:
+      - uses: actions/checkout@v4
+        # Include all history and tags
+        with:
+          persist-credentials: false
+          fetch-depth: 0
+
+      - uses: actions/setup-python@v5
+        name: Install Python
+        with:
+          python-version: '3.12'
+
+      - uses: actions-rust-lang/setup-rust-toolchain@v1
+      - name: Install latest stable toolchain and rustfmt
+        run: rustup update stable && rustup default stable && rustup component add rustfmt clippy
+
+      - name: Install hatch
+        uses: pypa/hatch@install
+        with:
+          version: "1.12.0"
+
+      - name: Install PyTorch
+        run: pip install torch
+
+      - name: Run tests
+        run: hatch run profiling_pytorch:test 
diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/include/ddup_interface.hpp b/ddtrace/internal/datadog/profiling/dd_wrapper/include/ddup_interface.hpp
@@ -44,6 +44,9 @@ extern "C"
     void ddup_push_release(Datadog::Sample* sample, int64_t release_time, int64_t count);
     void ddup_push_alloc(Datadog::Sample* sample, int64_t size, int64_t count);
     void ddup_push_heap(Datadog::Sample* sample, int64_t size);
+    void ddup_push_gpu_gputime(Datadog::Sample* sample, int64_t time, int64_t count);
+    void ddup_push_gpu_memory(Datadog::Sample* sample, int64_t mem, int64_t count);
+    void ddup_push_gpu_flops(Datadog::Sample* sample, int64_t flops, int64_t count);
     void ddup_push_lock_name(Datadog::Sample* sample, std::string_view lock_name);
     void ddup_push_threadinfo(Datadog::Sample* sample,
                               int64_t thread_id,
@@ -56,11 +59,13 @@ extern "C"
     void ddup_push_trace_type(Datadog::Sample* sample, std::string_view trace_type);
     void ddup_push_exceptioninfo(Datadog::Sample* sample, std::string_view exception_type, int64_t count);
     void ddup_push_class_name(Datadog::Sample* sample, std::string_view class_name);
+    void ddup_push_gpu_device_name(Datadog::Sample*, std::string_view device_name);
     void ddup_push_frame(Datadog::Sample* sample,
                          std::string_view _name,
                          std::string_view _filename,
                          uint64_t address,
                          int64_t line);
+    void ddup_push_absolute_ns(Datadog::Sample* sample, int64_t timestamp_ns);
     void ddup_push_monotonic_ns(Datadog::Sample* sample, int64_t monotonic_ns);
     void ddup_flush_sample(Datadog::Sample* sample);
     // Stack v2 specific flush, which reverses the locations

diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/include/libdatadog_helpers.hpp b/ddtrace/internal/datadog/profiling/dd_wrapper/include/libdatadog_helpers.hpp
@@ -45,7 +45,8 @@ namespace Datadog {
     X(local_root_span_id, "local root span id")                                                                        \
     X(trace_type, "trace type")                                                                                        \
     X(class_name, "class name")                                                                                        \
-    X(lock_name, "lock name")
+    X(lock_name, "lock name")                                                                                          \
+    X(gpu_device_name, "gpu device name")
 
 #define X_ENUM(a, b) a,
 #define X_STR(a, b) b,

diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/include/sample.hpp b/ddtrace/internal/datadog/profiling/dd_wrapper/include/sample.hpp
@@ -100,6 +100,9 @@ class Sample
     bool push_release(int64_t lock_time, int64_t count);
     bool push_alloc(int64_t size, int64_t count);
     bool push_heap(int64_t size);
+    bool push_gpu_gputime(int64_t time, int64_t count);
+    bool push_gpu_memory(int64_t size, int64_t count);
+    bool push_gpu_flops(int64_t flops, int64_t count);
 
     // Adds metadata to sample
     bool push_lock_name(std::string_view lock_name);
@@ -112,11 +115,15 @@ class Sample
     bool push_exceptioninfo(std::string_view exception_type, int64_t count);
     bool push_class_name(std::string_view class_name);
     bool push_monotonic_ns(int64_t monotonic_ns);
+    bool push_absolute_ns(int64_t timestamp_ns);
 
     // Interacts with static Sample state
     bool is_timeline_enabled() const;
     static void set_timeline(bool enabled);
 
+    // Pytorch GPU metadata
+    bool push_gpu_device_name(std::string_view device_name);
+
     // Assumes frames are pushed in leaf-order
     void push_frame(std::string_view name,     // for ddog_prof_Function
                     std::string_view filename, // for ddog_prof_Function

diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/include/types.hpp b/ddtrace/internal/datadog/profiling/dd_wrapper/include/types.hpp
@@ -11,7 +11,10 @@ enum SampleType : unsigned int
     LockRelease = 1 << 4,
     Allocation = 1 << 5,
     Heap = 1 << 6,
-    All = CPU | Wall | Exception | LockAcquire | LockRelease | Allocation | Heap
+    GPUTime = 1 << 7,
+    GPUMemory = 1 << 8,
+    GPUFlops = 1 << 9,
+    All = CPU | Wall | Exception | LockAcquire | LockRelease | Allocation | Heap | GPUTime | GPUMemory | GPUFlops
 };
 
 // Every Sample object has a corresponding `values` vector, since libdatadog expects contiguous values per sample.
@@ -30,6 +33,12 @@ struct ValueIndex
     unsigned short alloc_space;
     unsigned short alloc_count;
     unsigned short heap_space;
+    unsigned short gpu_time;
+    unsigned short gpu_count;
+    unsigned short gpu_alloc_space;
+    unsigned short gpu_alloc_count;
+    unsigned short gpu_flops;
+    unsigned short gpu_flops_samples; // Should be "count," but flops is already a count
 };
 
 } // namespace Datadog
diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp b/ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp
@@ -193,6 +193,24 @@ ddup_push_heap(Datadog::Sample* sample, int64_t size) // cppcheck-suppress unuse
     sample->push_heap(size);
 }
 
+void
+ddup_push_gpu_gputime(Datadog::Sample* sample, int64_t time, int64_t count) // cppcheck-suppress unusedFunction
+{
+    sample->push_gpu_gputime(time, count);
+}
+
+void
+ddup_push_gpu_memory(Datadog::Sample* sample, int64_t size, int64_t count) // cppcheck-suppress unusedFunction
+{
+    sample->push_gpu_memory(size, count);
+}
+
+void
+ddup_push_gpu_flops(Datadog::Sample* sample, int64_t flops, int64_t count) // cppcheck-suppress unusedFunction
+{
+    sample->push_gpu_flops(flops, count);
+}
+
 void
 ddup_push_lock_name(Datadog::Sample* sample, std::string_view lock_name) // cppcheck-suppress unusedFunction
 {
@@ -252,6 +270,12 @@ ddup_push_class_name(Datadog::Sample* sample, std::string_view class_name) // cp
     sample->push_class_name(class_name);
 }
 
+void
+ddup_push_gpu_device_name(Datadog::Sample* sample, std::string_view gpu_device_name) // cppcheck-suppress unusedFunction
+{
+    sample->push_gpu_device_name(gpu_device_name);
+}
+
 void
 ddup_push_frame(Datadog::Sample* sample, // cppcheck-suppress unusedFunction
                 std::string_view _name,
@@ -262,6 +286,12 @@ ddup_push_frame(Datadog::Sample* sample, // cppcheck-suppress unusedFunction
     sample->push_frame(_name, _filename, address, line);
 }
 
+void
+ddup_push_absolute_ns(Datadog::Sample* sample, int64_t timestamp_ns) // cppcheck-suppress unusedFunction
+{
+    sample->push_absolute_ns(timestamp_ns);
+}
+
 void
 ddup_push_monotonic_ns(Datadog::Sample* sample, int64_t monotonic_ns) // cppcheck-suppress unusedFunction
 {

diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/src/profile.cpp b/ddtrace/internal/datadog/profiling/dd_wrapper/src/profile.cpp
@@ -89,6 +89,23 @@ Datadog::Profile::setup_samplers()
     if (0U != (type_mask & SampleType::Heap)) {
         val_idx.heap_space = get_value_idx("heap-space", "bytes");
     }
+    if (0U != (type_mask & SampleType::GPUTime)) {
+        val_idx.gpu_time = get_value_idx("gpu-time", "nanoseconds");
+        val_idx.gpu_count = get_value_idx("gpu-samples", "count");
+    }
+    if (0U != (type_mask & SampleType::GPUMemory)) {
+        // In the backend the unit is called 'gpu-space', but maybe for consistency
+        // it should be gpu-alloc-space
+        // gpu-alloc-samples may be unused, but it's passed along for scaling purposes
+        val_idx.gpu_alloc_space = get_value_idx("gpu-space", "bytes");
+        val_idx.gpu_alloc_count = get_value_idx("gpu-alloc-samples", "count");
+    }
+    if (0U != (type_mask & SampleType::GPUFlops)) {
+        // Technically "FLOPS" is a unit, but we call it a 'count' because no
+        // other profiler uses it as a unit.
+        val_idx.gpu_flops = get_value_idx("gpu-flops", "count");
+        val_idx.gpu_flops_samples = get_value_idx("gpu-flops-samples", "count");
+    }
 
     // Whatever the first sampler happens to be is the default "period" for the profile
     // The value of 1 is a pointless default.

diff --git a/ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp b/ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp
@@ -262,6 +262,42 @@ Datadog::Sample::push_heap(int64_t size)
     return false;
 }
 
+bool
+Datadog::Sample::push_gpu_gputime(int64_t time, int64_t count)
+{
+    if (0U != (type_mask & SampleType::GPUTime)) {
+        values[profile_state.val().gpu_time] += time * count;
+        values[profile_state.val().gpu_count] += count;
+        return true;
+    }
+    std::cout << "bad push gpu" << std::endl;
+    return false;
+}
+
+bool
+Datadog::Sample::push_gpu_memory(int64_t size, int64_t count)
+{
+    if (0U != (type_mask & SampleType::GPUMemory)) {
+        values[profile_state.val().gpu_alloc_space] += size * count;
+        values[profile_state.val().gpu_alloc_count] += count;
+        return true;
+    }
+    std::cout << "bad push gpu memory" << std::endl;
+    return false;
+}
+
+bool
+Datadog::Sample::push_gpu_flops(int64_t size, int64_t count)
+{
+    if (0U != (type_mask & SampleType::GPUFlops)) {
+        values[profile_state.val().gpu_flops] += size * count;
+        values[profile_state.val().gpu_flops_samples] += count;
+        return true;
+    }
+    std::cout << "bad push gpu flops" << std::endl;
+    return false;
+}
+
 bool
 Datadog::Sample::push_lock_name(std::string_view lock_name)
 {
@@ -351,6 +387,28 @@ Datadog::Sample::push_class_name(std::string_view class_name)
     return true;
 }
 
+bool
+Datadog::Sample::push_gpu_device_name(std::string_view device_name)
+{
+    if (!push_label(ExportLabelKey::gpu_device_name, device_name)) {
+        std::cout << "bad push" << std::endl;
+        return false;
+    }
+    return true;
+}
+
+bool
+Datadog::Sample::push_absolute_ns(int64_t _timestamp_ns)
+{
+    // If timeline is not enabled, then this is a no-op
+    if (is_timeline_enabled()) {
+        endtime_ns = _timestamp_ns;
+    }
+
+    return true;
+}
+
+
 bool
 Datadog::Sample::push_monotonic_ns(int64_t _monotonic_ns)
 {

diff --git a/ddtrace/internal/datadog/profiling/ddup/_ddup.pyi b/ddtrace/internal/datadog/profiling/ddup/_ddup.pyi
@@ -20,19 +20,24 @@ def start() -> None: ...
 def upload(tracer: Optional[Tracer]) -> None: ...
 
 class SampleHandle:
-    def push_cputime(self, value: int, count: int) -> None: ...
-    def push_walltime(self, value: int, count: int) -> None: ...
+    def flush_sample(self) -> None: ...
+    def push_absolute_ns(self, timestamp_ns: int) -> None: ...
     def push_acquire(self, value: int, count: int) -> None: ...
-    def push_release(self, value: int, count: int) -> None: ...
     def push_alloc(self, value: int, count: int) -> None: ...
+    def push_class_name(self, class_name: StringType) -> None: ...
+    def push_cputime(self, value: int, count: int) -> None: ...
+    def push_exceptioninfo(self, exc_type: Union[None, bytes, str, type], count: int) -> None: ...
+    def push_frame(self, name: StringType, filename: StringType, address: int, line: int) -> None: ...
+    def push_gpu_device_name(self, device_name: StringType) -> None: ...
+    def push_gpu_flops(self, value: int, count: int) -> None: ...
+    def push_gpu_gputime(self, value: int, count: int) -> None: ...
+    def push_gpu_memory(self, value: int, count: int) -> None: ...
     def push_heap(self, value: int) -> None: ...
     def push_lock_name(self, lock_name: StringType) -> None: ...
-    def push_frame(self, name: StringType, filename: StringType, address: int, line: int) -> None: ...
-    def push_threadinfo(self, thread_id: int, thread_native_id: int, thread_name: StringType) -> None: ...
+    def push_monotonic_ns(self, monotonic_ns: int) -> None: ...
+    def push_release(self, value: int, count: int) -> None: ...
+    def push_span(self, span: Optional[Span]) -> None: ...
     def push_task_id(self, task_id: Optional[int]) -> None: ...
     def push_task_name(self, task_name: StringType) -> None: ...
-    def push_exceptioninfo(self, exc_type: Union[None, bytes, str, type], count: int) -> None: ...
-    def push_class_name(self, class_name: StringType) -> None: ...
-    def push_span(self, span: Optional[Span]) -> None: ...
-    def push_monotonic_ns(self, monotonic_ns: int) -> None: ...
-    def flush_sample(self) -> None: ...
+    def push_threadinfo(self, thread_id: int, thread_native_id: int, thread_name: StringType) -> None: ...
+    def push_walltime(self, value: int, count: int) -> None: ...
diff --git a/ddtrace/internal/datadog/profiling/ddup/_ddup.pyx b/ddtrace/internal/datadog/profiling/ddup/_ddup.pyx
@@ -68,6 +68,9 @@ cdef extern from "ddup_interface.hpp":
     void ddup_push_release(Sample *sample, int64_t release_time, int64_t count)
     void ddup_push_alloc(Sample *sample, int64_t size, int64_t count)
     void ddup_push_heap(Sample *sample, int64_t size)
+    void ddup_push_gpu_gputime(Sample *sample, int64_t gputime, int64_t count)
+    void ddup_push_gpu_memory(Sample *sample, int64_t size, int64_t count)
+    void ddup_push_gpu_flops(Sample *sample, int64_t flops, int64_t count)
     void ddup_push_lock_name(Sample *sample, string_view lock_name)
     void ddup_push_threadinfo(Sample *sample, int64_t thread_id, int64_t thread_native_id, string_view thread_name)
     void ddup_push_task_id(Sample *sample, int64_t task_id)
@@ -77,8 +80,10 @@ cdef extern from "ddup_interface.hpp":
     void ddup_push_trace_type(Sample *sample, string_view trace_type)
     void ddup_push_exceptioninfo(Sample *sample, string_view exception_type, int64_t count)
     void ddup_push_class_name(Sample *sample, string_view class_name)
+    void ddup_push_gpu_device_name(Sample *sample, string_view device_name)
     void ddup_push_frame(Sample *sample, string_view _name, string_view _filename, uint64_t address, int64_t line)
     void ddup_push_monotonic_ns(Sample *sample, int64_t monotonic_ns)
+    void ddup_push_absolute_ns(Sample *sample, int64_t monotonic_ns)
     void ddup_flush_sample(Sample *sample)
     void ddup_drop_sample(Sample *sample)
 
@@ -302,6 +307,18 @@ cdef call_ddup_push_class_name(Sample* sample, class_name: StringType):
     if utf8_data != NULL:
         ddup_push_class_name(sample, string_view(utf8_data, utf8_size))
 
+cdef call_ddup_push_gpu_device_name(Sample* sample, device_name: StringType):
+    if not device_name:
+        return
+    if isinstance(device_name, bytes):
+        ddup_push_gpu_device_name(sample, string_view(<const char*>device_name, len(device_name)))
+        return
+    cdef const char* utf8_data
+    cdef Py_ssize_t utf8_size
+    utf8_data = PyUnicode_AsUTF8AndSize(device_name, &utf8_size)
+    if utf8_data != NULL:
+        ddup_push_gpu_device_name(sample, string_view(utf8_data, utf8_size))
+
 cdef call_ddup_push_trace_type(Sample* sample, trace_type: StringType):
     if not trace_type:
         return
@@ -448,6 +465,18 @@ cdef class SampleHandle:
         if self.ptr is not NULL:
             ddup_push_heap(self.ptr, clamp_to_int64_unsigned(value))
 
+    def push_gpu_gputime(self, value: int, count: int) -> None:
+        if self.ptr is not NULL:
+            ddup_push_gpu_gputime(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))
+
+    def push_gpu_memory(self, value: int, count: int) -> None:
+        if self.ptr is not NULL:
+            ddup_push_gpu_memory(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))
+
+    def push_gpu_flops(self, value: int, count: int) -> None:
+        if self.ptr is not NULL:
+            ddup_push_gpu_flops(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))
+
     def push_lock_name(self, lock_name: StringType) -> None:
         if self.ptr is not NULL:
             call_ddup_push_lock_name(self.ptr, lock_name)
@@ -494,6 +523,10 @@ cdef class SampleHandle:
         if self.ptr is not NULL:
             call_ddup_push_class_name(self.ptr, class_name)
 
+    def push_gpu_device_name(self, device_name: StringType) -> None:
+        if self.ptr is not NULL:
+            call_ddup_push_gpu_device_name(self.ptr, device_name)
+
     def push_span(self, span: Optional[Span]) -> None:
         if self.ptr is NULL:
             return
@@ -512,6 +545,10 @@ cdef class SampleHandle:
         if self.ptr is not NULL:
             ddup_push_monotonic_ns(self.ptr, <int64_t>monotonic_ns)
 
+    def push_absolute_ns(self, timestamp_ns: int) -> None:
+        if self.ptr is not NULL:
+            ddup_push_absolute_ns(self.ptr, <int64_t>timestamp_ns)
+
     def flush_sample(self) -> None:
         # Flushing the sample consumes it.  The user will no longer be able to use
         # this handle after flushing it.