[UR] Add initial spec for async alloc entry points #2180

hdelan · 2024-10-08T10:05:28Z

First basic work in progress spec.

pbalcer

I'm having a hard time grasping how these functions are supposed to be used w.r.t. events.

With in-order queues, it's simple enough:

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);
urEnqueueKernelLaunch(...);

urEnqueueUSMFreeExp(ptr, ...);

void *ptr2;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ...); // allocate 1kb
assert(ptr == ptr2); // since this is an in-order queue, and we previously freed ptr, we can reuse that same object

urKernelSetArgPointer(..., ptr2);
urEnqueueKernelLaunch(...);

But if the queue is out-of-order, we face a situation where urEnqueueUSMDeviceAllocExp needs to figure out the dependencies of previously enqueued frees:

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

ur_event_handle_t free_out_event;
urEnqueueUSMFreeExp(ptr, waitlist = &[kernel_one_out_event], free_out_event);

void *ptr2;
ur_event_handle_t allocate_out_event;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ..., waitlist = &[free_out_event], &allocate_out_event); // allocate 1kb. This takes the same events as enqueue kernel launch would.

assert(ptr == ptr2); // the implementation can figure out the ptr object will finish by the time ptr2 waitlist signals.

urKernelSetArgPointer(..., ptr2); // should this automatically add the ptr2 out event?
urEnqueueKernelLaunch(..., waitlist = &[allocate_out_event] , ...); // should this take allocate_out_event or kernel_one_out_event?

To me this looks like as if all the enqueue alloc/free functions have an implicit urEnqueueEventsWait built-in, and SYCL/users will need to account for that.

Theoretically, it should be possible to figure out all those dependencies/events by storing a list of active and pending uses for each pointer, eliminating the need for events.

hdelan · 2024-10-08T14:49:47Z

But if the queue is out-of-order, we face a situation where urEnqueueUSMDeviceAllocExp needs to figure out the dependencies of previously enqueued frees:
void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

Also bear in mind you will need to return an event from urEnqueueUSMDeviceAllocExp and add that as a dependency to the kernel launch.

To me this looks like as if all the enqueue alloc/free functions have an implicit urEnqueueEventsWait built-in, and SYCL/users will need to account for that.

That is correct. I think the goal of these funcs should be to have the same synchronization behaviour as other urEnqueue* entry points. I can't see why this might cause complications for users/SYCL.

Theoretically, it should be possible to figure out all those dependencies/events by storing a list of active and pending uses for each pointer, eliminating the need for events.

However one of the key things this approach is missing is the ability to pass dep events to urEnqueue*USMAllocExp. If I want an async allocation to happen only once say kernel A has completed, there is no way of expressing that without events. With an async free we can infer the dep events by seeing which events use the allocation, so maybe we don't need dep events for this, but I think it would be no harm.

For instance, we really need to be able to express this:

urEnqueueUSMFreeExp();
urEnqueueUSMDeviceAllocExp(); // I only want this allocation to be made once the previous free happens
                              // ie if I want to make sure I don't overflow the mem pool.

But if we don't return an event from Free and pass events to Alloc, then we can't express this just using ptrs for synchronization.

I'd be curious to see what others think about this. Ping @AerialMantis @kbenzie .

hdelan · 2024-10-08T15:11:53Z

If we consider a subDAG for each async allocation starting at Alloc and ending at Free, then we could elide the events needed to make this subDAG work internally. But we would still need to be able to pass depEvents into Alloc (the start of the subDAG) and we also need to be able to pass out a recorded event from Free. Potentially an API would look something like this instead:

ur_result_t urEnqueueUSMBlahAllocExp(
    ur_queue_handle_t hQueue, ur_usm_pool_handle_t pPool, const size_t size,
    const ur_exp_async_usm_alloc_properties_t *pProperties,
    uint32_t numEventsInWaitList, const ur_event_handle_t *phEventWaitList,
    void **ppMem); // No out event
ur_result_t urEnqueueUSMFreeExp(ur_queue_handle_t hQueue,
                                ur_usm_pool_handle_t pPool, void *pMem,
                                ur_event_handle_t *phEvent); // No in events

I'm not sure if this is a lot better as it makes the APIs a bit inconsistent.

pbalcer · 2024-10-08T15:35:08Z

Yea, the example you gave makes sense... Hm, my main concern that we'd have to spend time here potentially creating new events, doing the relevant append wait for the waitlist events, and then signaling events, when all of this may not be necessary.

But I guess the implementation can avoid all that overhead if no signal event and no waitlist are provided, so ideally the enqueue alloc/free is very quick without heavyweight event-based synchronization.

In general, I agree that, all things being equal, it's better to be consistent.

hdelan · 2024-10-08T15:50:56Z

Yea, the example you gave makes sense... Hm, my main concern that we'd have to spend time here potentially creating new events, doing the relevant append wait for the waitlist events, and then signaling events, when all of this may not be necessary.

I mean we can forego this overhead by using in order queues and without passing out events. I think when dealing with higher level out of order queues, performance is not as critical as in order queues, so an event here and there will not hurt. Regardless of whether we pass events explicitly to the entry points, if we need to do dep analysis for out of order queues, we will need to record events, either explicitly when we pass events, or implicitly by using ptrs.

But I guess the implementation can avoid all that overhead if no signal event and no waitlist are provided, so ideally the enqueue alloc/free is very quick without heavyweight event-based synchronization.

Yes exactly.

pbalcer · 2024-10-08T15:54:35Z

If we consider a subDAG for each async allocation starting at Alloc and ending at Free, then we could elide the events needed to make this subDAG work internally.

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

hdelan · 2024-10-08T16:02:51Z

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

Yes the pointer is immediately valid so no blocking should occur. However there are some nuances with this, ie if allocating a mem pool for the first time there will be a blocking wait before the kernel starts. This is all CUDA backend dependent. The main idea is that once a mem pool has been made, async allocs can be reused without needing expensive CUDA blocking allocations to be made. See here https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/#stream-ordered_memory_allocator

hdelan · 2024-10-08T16:09:00Z

Also just to clarify on an earlier point.

void *ptr;
urEnqueueUSMDeviceAllocExp(...,&ptr, 1024, ...); // allocate 1kb
urKernelSetArgPointer(..., ptr);

ur_event_handle_t kernel_one_out_event;
urEnqueueKernelLaunch(..., &kernel_one_out_event);

ur_event_handle_t free_out_event;
urEnqueueUSMFreeExp(ptr, waitlist = &[kernel_one_out_event], free_out_event);

void *ptr2;
ur_event_handle_t allocate_out_event;
urEnqueueUSMDeviceAllocExp(...,&ptr2, 1024, ..., waitlist = &[free_out_event], &allocate_out_event); // allocate 1kb. This takes the same events as enqueue kernel launch would.

assert(ptr == ptr2); // the implementation can figure out the ptr object will finish by the time ptr2 waitlist signals.

There would be no way to provide such a guarantee (ptr == ptr2) in UR as we are not in control of the native allocator, which may have its own unusual heuristics.

pbalcer · 2024-10-08T16:17:37Z

But implementations should do this regardless (for every applicable object in the pool), because we need to return a pointer immediately. Ideally that pointer does not require the subsequent kernel execution to wait, or wait the least possible amount of time. From what I can tell, cuda even allows you to forbid the runtime from allocating objects that would cause an implicit synchronization (cudaMemPoolReuseAllowInternalDependencies).

Yes the pointer is immediately valid so no blocking should occur. However there are some nuances with this, ie if allocating a mem pool for the first time there will be a blocking wait before the kernel starts. This is all CUDA backend dependent. The main idea is that once a mem pool has been made, async allocs can be reused without needing expensive CUDA blocking allocations to be made. See here https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/#stream-ordered_memory_allocator

Yea, I've been looking at that as well. My point wasn't about allocating physically-backed pages (i.e., blocking on a page fault or something), but rather on blocking/waiting on events that originate from urEnqueueUSMBlahAllocExp. If possible, this function should return a pointer that has no/minimal dependencies for being reused given the waitlist. In other words, an optimal implementation may need to internally create a DAG of dependencies for ptrs, and then, at allocation time, figure out which objects will be reusable (or have the least amount of outstanding dependencies) once the events on the waitlist are signaled.

Anyway, I'm convinced. I think an in-order implementation should be straightforward, the out-of-order one will be difficult regardless of how we do the API. I think it will fine if we do a good-enough approximate solution in the latter case.

hdelan · 2024-10-08T16:19:42Z

Thanks for discussion @pbalcer !

igchor · 2024-10-08T16:41:48Z

This looks good. I like the approach with explicit event wait list and signal events and I agree this should be consistent with other enqueue APIs.

For cases where there is only a single event in the wait list (which should be a fairly common case) the event elision should be fairly simple. For some pool implementations (e.g. what we discussed for L0) I guess we could just bump the reference of the input event and return it as a signal event.

First basic work in progress spec.

Add an entry so the user can specify if the native USM pool should be used.

kswiecicki · 2024-10-23T13:05:48Z

Hi @hdelan, I’m currently working on the L0 adapter implementation for the recent spec changes. Would it be possible to separate the spec changes from the implementation of the API that was added to the spec? This would help reduce clutter in PRs that only depend on the spec changes.

github-actions bot added loader Loader related feature/bug common Changes or additions to common utilities specification Changes or additions to the specification experimental Experimental feature additions/changes/specification level-zero L0 adapter specific issues labels Oct 8, 2024

hdelan force-pushed the async-alloc branch 3 times, most recently from de24f3d to 13392a0 Compare October 8, 2024 10:57

pbalcer reviewed Oct 8, 2024

View reviewed changes

hdelan added 3 commits October 22, 2024 10:56

Add initial spec for async alloc entry points

ae677c7

First basic work in progress spec.

Add ur_usm_pool_flags_t entry

ce7bf01

Add an entry so the user can specify if the native USM pool should be used.

WIP

92ffce5

hdelan force-pushed the async-alloc branch from b0808a3 to 92ffce5 Compare October 22, 2024 09:56

github-actions bot added cuda CUDA adapter specific issues hip HIP adapter specific issues native-cpu Native CPU adapter specific issues labels Oct 22, 2024

kswiecicki mentioned this pull request Oct 23, 2024

[L0] Add initial USM alloc enqueue API #2203

Draft

kswiecicki mentioned this pull request Nov 7, 2024

[UR] Initial spec for the enqueue allocation API #2295

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UR] Add initial spec for async alloc entry points #2180

[UR] Add initial spec for async alloc entry points #2180

hdelan commented Oct 8, 2024

pbalcer left a comment

hdelan commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024 •

edited

Loading

pbalcer commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024

pbalcer commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024

hdelan commented Oct 8, 2024 •

edited

Loading

pbalcer commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024

igchor commented Oct 8, 2024

kswiecicki commented Oct 23, 2024

[UR] Add initial spec for async alloc entry points #2180

Are you sure you want to change the base?

[UR] Add initial spec for async alloc entry points #2180

Conversation

hdelan commented Oct 8, 2024

pbalcer left a comment

Choose a reason for hiding this comment

hdelan commented Oct 8, 2024 • edited Loading

hdelan commented Oct 8, 2024 • edited Loading

pbalcer commented Oct 8, 2024 • edited Loading

hdelan commented Oct 8, 2024

pbalcer commented Oct 8, 2024 • edited Loading

hdelan commented Oct 8, 2024

hdelan commented Oct 8, 2024 • edited Loading

pbalcer commented Oct 8, 2024 • edited Loading

hdelan commented Oct 8, 2024

igchor commented Oct 8, 2024

kswiecicki commented Oct 23, 2024

hdelan commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024 •

edited

Loading

pbalcer commented Oct 8, 2024 •

edited

Loading

pbalcer commented Oct 8, 2024 •

edited

Loading

hdelan commented Oct 8, 2024 •

edited

Loading

pbalcer commented Oct 8, 2024 •

edited

Loading