Skip to content

Commit

Permalink
feat(profiling): add support for pytorch profiling (#9154)
Browse files Browse the repository at this point in the history
PR does
- Patches `torch.profiler.profile` class by adding our own
`on_trace_ready` handler
- Adds GPU time/flops/memory samples via libdatadog interface in
`on_trace_ready` event handler
- Ensures that libdd exporter is enabled if pytorch is enabled
- Hides functionality behind a FF set to False by default
- changelog entry
- Is there a minimum python version?
- the biggest requirement is that the current pytorch profiler API which
we instrument was introduced in torch version 1.8.1
(https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/),
do we just want to document or we could disable the instrumentation if
we detect an outdated version with `torch.__version__`
- Some documentation on needed user configuration, conflicting features,
gotchas

~~Probably should make experimental/beta collectors not part of the ALL
template (Is this blocking since we haven't done in the past??)~~

## Testing Done
- Tested by running on ec2 GPU instance
- Tested by running `prof-pytorch` service in staging
- I'm not entirely sure if we need unit tests for this feature, or where
they would live. Would we want the unit test suite to depend on torch?
Maybe this is solved for tracing integrations, though

## Checklist

- [x] Change(s) are motivated and described in the PR description
- [x] Testing strategy is described if automated tests are not included
in the PR
- [x] Risks are described (performance impact, potential for breakage,
maintainability)
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed or label `changelog/no-changelog` is set
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/))
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))
- [x] If this PR changes the public interface, I've notified
`@DataDog/apm-tees`.

## Reviewer Checklist

- [x] Title is accurate
- [x] All changes are related to the pull request's stated goal
- [x] Description motivates each change
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- [x] Testing strategy adequately addresses listed risks
- [x] Change is maintainable (easy to change, telemetry, documentation)
- [x] Release note makes sense to a user of the library
- [x] Author has acknowledged and discussed the performance implications
of this PR as reported in the benchmarks PR comment
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: sanchda <[email protected]>
Co-authored-by: Peter Griggs <[email protected]>
Co-authored-by: Daniel Schwartz-Narbonne <[email protected]>
Co-authored-by: Emmett Butler <[email protected]>
Co-authored-by: Daniel Schwartz-Narbonne <[email protected]>
Co-authored-by: Taegyun Kim <[email protected]>
Co-authored-by: Daniel Schwartz-Narbonne <[email protected]>
  • Loading branch information
8 people authored Dec 13, 2024
1 parent 1dd528c commit 00ec1f7
Show file tree
Hide file tree
Showing 20 changed files with 770 additions and 68 deletions.
43 changes: 43 additions & 0 deletions .github/workflows/pytorch_gpu_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Pytorch Unit Tests (with GPU)

on:
push:
branches:
- 'main'
- 'mq-working-branch**'
paths:
- 'ddtrace/profiling/collector/pytorch.py'
pull_request:
paths:
- 'ddtrace/profiling/collector/pytorch.py'
workflow_dispatch:

jobs:
unit-tests:
runs-on: APM-4-CORE-GPU-LINUX
steps:
- uses: actions/checkout@v4
# Include all history and tags
with:
persist-credentials: false
fetch-depth: 0

- uses: actions/setup-python@v5
name: Install Python
with:
python-version: '3.12'

- uses: actions-rust-lang/setup-rust-toolchain@v1
- name: Install latest stable toolchain and rustfmt
run: rustup update stable && rustup default stable && rustup component add rustfmt clippy

- name: Install hatch
uses: pypa/hatch@install
with:
version: "1.12.0"

- name: Install PyTorch
run: pip install torch

- name: Run tests
run: hatch run profiling_pytorch:test
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ extern "C"
void ddup_push_release(Datadog::Sample* sample, int64_t release_time, int64_t count);
void ddup_push_alloc(Datadog::Sample* sample, int64_t size, int64_t count);
void ddup_push_heap(Datadog::Sample* sample, int64_t size);
void ddup_push_gpu_gputime(Datadog::Sample* sample, int64_t time, int64_t count);
void ddup_push_gpu_memory(Datadog::Sample* sample, int64_t mem, int64_t count);
void ddup_push_gpu_flops(Datadog::Sample* sample, int64_t flops, int64_t count);
void ddup_push_lock_name(Datadog::Sample* sample, std::string_view lock_name);
void ddup_push_threadinfo(Datadog::Sample* sample,
int64_t thread_id,
Expand All @@ -56,11 +59,13 @@ extern "C"
void ddup_push_trace_type(Datadog::Sample* sample, std::string_view trace_type);
void ddup_push_exceptioninfo(Datadog::Sample* sample, std::string_view exception_type, int64_t count);
void ddup_push_class_name(Datadog::Sample* sample, std::string_view class_name);
void ddup_push_gpu_device_name(Datadog::Sample*, std::string_view device_name);
void ddup_push_frame(Datadog::Sample* sample,
std::string_view _name,
std::string_view _filename,
uint64_t address,
int64_t line);
void ddup_push_absolute_ns(Datadog::Sample* sample, int64_t timestamp_ns);
void ddup_push_monotonic_ns(Datadog::Sample* sample, int64_t monotonic_ns);
void ddup_flush_sample(Datadog::Sample* sample);
// Stack v2 specific flush, which reverses the locations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ namespace Datadog {
X(local_root_span_id, "local root span id") \
X(trace_type, "trace type") \
X(class_name, "class name") \
X(lock_name, "lock name")
X(lock_name, "lock name") \
X(gpu_device_name, "gpu device name")

#define X_ENUM(a, b) a,
#define X_STR(a, b) b,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,9 @@ class Sample
bool push_release(int64_t lock_time, int64_t count);
bool push_alloc(int64_t size, int64_t count);
bool push_heap(int64_t size);
bool push_gpu_gputime(int64_t time, int64_t count);
bool push_gpu_memory(int64_t size, int64_t count);
bool push_gpu_flops(int64_t flops, int64_t count);

// Adds metadata to sample
bool push_lock_name(std::string_view lock_name);
Expand All @@ -112,11 +115,15 @@ class Sample
bool push_exceptioninfo(std::string_view exception_type, int64_t count);
bool push_class_name(std::string_view class_name);
bool push_monotonic_ns(int64_t monotonic_ns);
bool push_absolute_ns(int64_t timestamp_ns);

// Interacts with static Sample state
bool is_timeline_enabled() const;
static void set_timeline(bool enabled);

// Pytorch GPU metadata
bool push_gpu_device_name(std::string_view device_name);

// Assumes frames are pushed in leaf-order
void push_frame(std::string_view name, // for ddog_prof_Function
std::string_view filename, // for ddog_prof_Function
Expand Down
11 changes: 10 additions & 1 deletion ddtrace/internal/datadog/profiling/dd_wrapper/include/types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ enum SampleType : unsigned int
LockRelease = 1 << 4,
Allocation = 1 << 5,
Heap = 1 << 6,
All = CPU | Wall | Exception | LockAcquire | LockRelease | Allocation | Heap
GPUTime = 1 << 7,
GPUMemory = 1 << 8,
GPUFlops = 1 << 9,
All = CPU | Wall | Exception | LockAcquire | LockRelease | Allocation | Heap | GPUTime | GPUMemory | GPUFlops
};

// Every Sample object has a corresponding `values` vector, since libdatadog expects contiguous values per sample.
Expand All @@ -30,6 +33,12 @@ struct ValueIndex
unsigned short alloc_space;
unsigned short alloc_count;
unsigned short heap_space;
unsigned short gpu_time;
unsigned short gpu_count;
unsigned short gpu_alloc_space;
unsigned short gpu_alloc_count;
unsigned short gpu_flops;
unsigned short gpu_flops_samples; // Should be "count," but flops is already a count
};

} // namespace Datadog
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,24 @@ ddup_push_heap(Datadog::Sample* sample, int64_t size) // cppcheck-suppress unuse
sample->push_heap(size);
}

void
ddup_push_gpu_gputime(Datadog::Sample* sample, int64_t time, int64_t count) // cppcheck-suppress unusedFunction
{
sample->push_gpu_gputime(time, count);
}

void
ddup_push_gpu_memory(Datadog::Sample* sample, int64_t size, int64_t count) // cppcheck-suppress unusedFunction
{
sample->push_gpu_memory(size, count);
}

void
ddup_push_gpu_flops(Datadog::Sample* sample, int64_t flops, int64_t count) // cppcheck-suppress unusedFunction
{
sample->push_gpu_flops(flops, count);
}

void
ddup_push_lock_name(Datadog::Sample* sample, std::string_view lock_name) // cppcheck-suppress unusedFunction
{
Expand Down Expand Up @@ -252,6 +270,12 @@ ddup_push_class_name(Datadog::Sample* sample, std::string_view class_name) // cp
sample->push_class_name(class_name);
}

void
ddup_push_gpu_device_name(Datadog::Sample* sample, std::string_view gpu_device_name) // cppcheck-suppress unusedFunction
{
sample->push_gpu_device_name(gpu_device_name);
}

void
ddup_push_frame(Datadog::Sample* sample, // cppcheck-suppress unusedFunction
std::string_view _name,
Expand All @@ -262,6 +286,12 @@ ddup_push_frame(Datadog::Sample* sample, // cppcheck-suppress unusedFunction
sample->push_frame(_name, _filename, address, line);
}

void
ddup_push_absolute_ns(Datadog::Sample* sample, int64_t timestamp_ns) // cppcheck-suppress unusedFunction
{
sample->push_absolute_ns(timestamp_ns);
}

void
ddup_push_monotonic_ns(Datadog::Sample* sample, int64_t monotonic_ns) // cppcheck-suppress unusedFunction
{
Expand Down
17 changes: 17 additions & 0 deletions ddtrace/internal/datadog/profiling/dd_wrapper/src/profile.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,23 @@ Datadog::Profile::setup_samplers()
if (0U != (type_mask & SampleType::Heap)) {
val_idx.heap_space = get_value_idx("heap-space", "bytes");
}
if (0U != (type_mask & SampleType::GPUTime)) {
val_idx.gpu_time = get_value_idx("gpu-time", "nanoseconds");
val_idx.gpu_count = get_value_idx("gpu-samples", "count");
}
if (0U != (type_mask & SampleType::GPUMemory)) {
// In the backend the unit is called 'gpu-space', but maybe for consistency
// it should be gpu-alloc-space
// gpu-alloc-samples may be unused, but it's passed along for scaling purposes
val_idx.gpu_alloc_space = get_value_idx("gpu-space", "bytes");
val_idx.gpu_alloc_count = get_value_idx("gpu-alloc-samples", "count");
}
if (0U != (type_mask & SampleType::GPUFlops)) {
// Technically "FLOPS" is a unit, but we call it a 'count' because no
// other profiler uses it as a unit.
val_idx.gpu_flops = get_value_idx("gpu-flops", "count");
val_idx.gpu_flops_samples = get_value_idx("gpu-flops-samples", "count");
}

// Whatever the first sampler happens to be is the default "period" for the profile
// The value of 1 is a pointless default.
Expand Down
58 changes: 58 additions & 0 deletions ddtrace/internal/datadog/profiling/dd_wrapper/src/sample.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,42 @@ Datadog::Sample::push_heap(int64_t size)
return false;
}

bool
Datadog::Sample::push_gpu_gputime(int64_t time, int64_t count)
{
if (0U != (type_mask & SampleType::GPUTime)) {
values[profile_state.val().gpu_time] += time * count;
values[profile_state.val().gpu_count] += count;
return true;
}
std::cout << "bad push gpu" << std::endl;
return false;
}

bool
Datadog::Sample::push_gpu_memory(int64_t size, int64_t count)
{
if (0U != (type_mask & SampleType::GPUMemory)) {
values[profile_state.val().gpu_alloc_space] += size * count;
values[profile_state.val().gpu_alloc_count] += count;
return true;
}
std::cout << "bad push gpu memory" << std::endl;
return false;
}

bool
Datadog::Sample::push_gpu_flops(int64_t size, int64_t count)
{
if (0U != (type_mask & SampleType::GPUFlops)) {
values[profile_state.val().gpu_flops] += size * count;
values[profile_state.val().gpu_flops_samples] += count;
return true;
}
std::cout << "bad push gpu flops" << std::endl;
return false;
}

bool
Datadog::Sample::push_lock_name(std::string_view lock_name)
{
Expand Down Expand Up @@ -351,6 +387,28 @@ Datadog::Sample::push_class_name(std::string_view class_name)
return true;
}

bool
Datadog::Sample::push_gpu_device_name(std::string_view device_name)
{
if (!push_label(ExportLabelKey::gpu_device_name, device_name)) {
std::cout << "bad push" << std::endl;
return false;
}
return true;
}

bool
Datadog::Sample::push_absolute_ns(int64_t _timestamp_ns)
{
// If timeline is not enabled, then this is a no-op
if (is_timeline_enabled()) {
endtime_ns = _timestamp_ns;
}

return true;
}


bool
Datadog::Sample::push_monotonic_ns(int64_t _monotonic_ns)
{
Expand Down
25 changes: 15 additions & 10 deletions ddtrace/internal/datadog/profiling/ddup/_ddup.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,24 @@ def start() -> None: ...
def upload(tracer: Optional[Tracer]) -> None: ...

class SampleHandle:
def push_cputime(self, value: int, count: int) -> None: ...
def push_walltime(self, value: int, count: int) -> None: ...
def flush_sample(self) -> None: ...
def push_absolute_ns(self, timestamp_ns: int) -> None: ...
def push_acquire(self, value: int, count: int) -> None: ...
def push_release(self, value: int, count: int) -> None: ...
def push_alloc(self, value: int, count: int) -> None: ...
def push_class_name(self, class_name: StringType) -> None: ...
def push_cputime(self, value: int, count: int) -> None: ...
def push_exceptioninfo(self, exc_type: Union[None, bytes, str, type], count: int) -> None: ...
def push_frame(self, name: StringType, filename: StringType, address: int, line: int) -> None: ...
def push_gpu_device_name(self, device_name: StringType) -> None: ...
def push_gpu_flops(self, value: int, count: int) -> None: ...
def push_gpu_gputime(self, value: int, count: int) -> None: ...
def push_gpu_memory(self, value: int, count: int) -> None: ...
def push_heap(self, value: int) -> None: ...
def push_lock_name(self, lock_name: StringType) -> None: ...
def push_frame(self, name: StringType, filename: StringType, address: int, line: int) -> None: ...
def push_threadinfo(self, thread_id: int, thread_native_id: int, thread_name: StringType) -> None: ...
def push_monotonic_ns(self, monotonic_ns: int) -> None: ...
def push_release(self, value: int, count: int) -> None: ...
def push_span(self, span: Optional[Span]) -> None: ...
def push_task_id(self, task_id: Optional[int]) -> None: ...
def push_task_name(self, task_name: StringType) -> None: ...
def push_exceptioninfo(self, exc_type: Union[None, bytes, str, type], count: int) -> None: ...
def push_class_name(self, class_name: StringType) -> None: ...
def push_span(self, span: Optional[Span]) -> None: ...
def push_monotonic_ns(self, monotonic_ns: int) -> None: ...
def flush_sample(self) -> None: ...
def push_threadinfo(self, thread_id: int, thread_native_id: int, thread_name: StringType) -> None: ...
def push_walltime(self, value: int, count: int) -> None: ...
37 changes: 37 additions & 0 deletions ddtrace/internal/datadog/profiling/ddup/_ddup.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ cdef extern from "ddup_interface.hpp":
void ddup_push_release(Sample *sample, int64_t release_time, int64_t count)
void ddup_push_alloc(Sample *sample, int64_t size, int64_t count)
void ddup_push_heap(Sample *sample, int64_t size)
void ddup_push_gpu_gputime(Sample *sample, int64_t gputime, int64_t count)
void ddup_push_gpu_memory(Sample *sample, int64_t size, int64_t count)
void ddup_push_gpu_flops(Sample *sample, int64_t flops, int64_t count)
void ddup_push_lock_name(Sample *sample, string_view lock_name)
void ddup_push_threadinfo(Sample *sample, int64_t thread_id, int64_t thread_native_id, string_view thread_name)
void ddup_push_task_id(Sample *sample, int64_t task_id)
Expand All @@ -77,8 +80,10 @@ cdef extern from "ddup_interface.hpp":
void ddup_push_trace_type(Sample *sample, string_view trace_type)
void ddup_push_exceptioninfo(Sample *sample, string_view exception_type, int64_t count)
void ddup_push_class_name(Sample *sample, string_view class_name)
void ddup_push_gpu_device_name(Sample *sample, string_view device_name)
void ddup_push_frame(Sample *sample, string_view _name, string_view _filename, uint64_t address, int64_t line)
void ddup_push_monotonic_ns(Sample *sample, int64_t monotonic_ns)
void ddup_push_absolute_ns(Sample *sample, int64_t monotonic_ns)
void ddup_flush_sample(Sample *sample)
void ddup_drop_sample(Sample *sample)

Expand Down Expand Up @@ -302,6 +307,18 @@ cdef call_ddup_push_class_name(Sample* sample, class_name: StringType):
if utf8_data != NULL:
ddup_push_class_name(sample, string_view(utf8_data, utf8_size))

cdef call_ddup_push_gpu_device_name(Sample* sample, device_name: StringType):
if not device_name:
return
if isinstance(device_name, bytes):
ddup_push_gpu_device_name(sample, string_view(<const char*>device_name, len(device_name)))
return
cdef const char* utf8_data
cdef Py_ssize_t utf8_size
utf8_data = PyUnicode_AsUTF8AndSize(device_name, &utf8_size)
if utf8_data != NULL:
ddup_push_gpu_device_name(sample, string_view(utf8_data, utf8_size))

cdef call_ddup_push_trace_type(Sample* sample, trace_type: StringType):
if not trace_type:
return
Expand Down Expand Up @@ -448,6 +465,18 @@ cdef class SampleHandle:
if self.ptr is not NULL:
ddup_push_heap(self.ptr, clamp_to_int64_unsigned(value))

def push_gpu_gputime(self, value: int, count: int) -> None:
if self.ptr is not NULL:
ddup_push_gpu_gputime(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))

def push_gpu_memory(self, value: int, count: int) -> None:
if self.ptr is not NULL:
ddup_push_gpu_memory(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))

def push_gpu_flops(self, value: int, count: int) -> None:
if self.ptr is not NULL:
ddup_push_gpu_flops(self.ptr, clamp_to_int64_unsigned(value), clamp_to_int64_unsigned(count))

def push_lock_name(self, lock_name: StringType) -> None:
if self.ptr is not NULL:
call_ddup_push_lock_name(self.ptr, lock_name)
Expand Down Expand Up @@ -494,6 +523,10 @@ cdef class SampleHandle:
if self.ptr is not NULL:
call_ddup_push_class_name(self.ptr, class_name)

def push_gpu_device_name(self, device_name: StringType) -> None:
if self.ptr is not NULL:
call_ddup_push_gpu_device_name(self.ptr, device_name)

def push_span(self, span: Optional[Span]) -> None:
if self.ptr is NULL:
return
Expand All @@ -512,6 +545,10 @@ cdef class SampleHandle:
if self.ptr is not NULL:
ddup_push_monotonic_ns(self.ptr, <int64_t>monotonic_ns)

def push_absolute_ns(self, timestamp_ns: int) -> None:
if self.ptr is not NULL:
ddup_push_absolute_ns(self.ptr, <int64_t>timestamp_ns)

def flush_sample(self) -> None:
# Flushing the sample consumes it. The user will no longer be able to use
# this handle after flushing it.
Expand Down
Loading

0 comments on commit 00ec1f7

Please sign in to comment.