[KHR] add sycl_khr_queue_empty_query #700

TApplencourt · 2025-01-28T20:28:31Z

Ported: https://gitlab.khronos.org/sycl/Specification/-/merge_requests/727/

Add queue.size() and queue.empty(). I think we agreed that the other get_wait_list() was hard to implement, so I removed it from here.

I need help naming those APIs. I'm not sure how it works to add a new member function to an existing class... Should I prefix the name of the function with khr?

Thanks in advance!

@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)

adoc/extensions/sycl_khr_queue_size_queries.adoc

TApplencourt · 2025-01-28T22:37:53Z

Updated with just khr_empty.

Pennycook

@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)

I think this is free of UB if (as Greg pointed out) you use an in-order queue.

My reasoning, in case anybody is interested:

The std::atomic_bool is only used by host code.
If the host_task executes eagerly it will run until the atomic value changes.
If the host_task executes lazily it will not run until e2.wait() is called.

adoc/extensions/sycl_khr_queue_empty_query.adoc

Co-authored-by: Greg Lueck <[email protected]>

Co-authored-by: John Pennycook <[email protected]>

adoc/extensions/sycl_khr_queue_empty_query.adoc

Co-authored-by: Greg Lueck <[email protected]>

gmlueck

I like your style better, but this is the style in the rest of the spec, so we should be consistent.

adoc/extensions/sycl_khr_queue_empty_query.adoc

Co-authored-by: Greg Lueck <[email protected]>

CLAassistant · 2025-01-29T20:38:59Z

All committers have signed the CLA.

adoc/extensions/sycl_khr_queue_empty_query.adoc

Co-authored-by: Ronan Keryell <[email protected]>

TApplencourt · 2025-01-30T17:37:11Z

Ready to ~~review~~ accept :)

DuncanMcBain · 2025-01-31T19:47:27Z

adoc/extensions/sycl_khr_queue_empty_query.adoc

+completed, [code]#false# otherwise.
+
+{note} Since the implementation executes commands asynchronously, the returned
+value is a snapshot in time.


Because the queue class is thread-safe, I was wondering if we should remark that the value returned from this function should be considered immediately stale. The user might be able to guarantee that the value is still accurate - e.g. no threading, or the user has a separate lock over the queue or similar - but we can't prove that a queue returning true from empty() will still be true a moment later, in the general case. A note to this effect might help users understand that they can write code which would fall foul of this.

I think the concern is not that another thread might add something to the queue, it's that the SYCL runtime might asynchronously remove things from the queue. No amount of careful coding on the application's part can solve this. If queue::khr_empty returns false, the queue might become empty a moment later.

I can add immediately stale. Maybe better than snapshot in time

You know that if it's true and that nobody from your app enqueue stuff in the queue, it will be still true.
For false, it's outside your app's control. Maybe the tick after it will become true, as @gmlueck said (regardless of what you did in your code).

I don't know how to phrase it better. Sadly, std::futur doesn't have an is_ready that can inspire us.

@gmlueck agreed - I just don't think that's as much of an issue as, say, believing that the queue is empty, then having another thread dump 2500 kernels onto it :D

@TApplencourt I think I stole the "immediately stale" wording from some other API. But I think it communicates the idea.

Pennycook · 2025-02-03T10:27:38Z

adoc/extensions/sycl_khr_queue_empty_query.adoc

+_Returns:_ [code]#true# if all <<command,commands>> enqueued on this queue have
+completed, [code]#false# otherwise.


The comment about thread safety here made me think about what the desired synchronization/observability semantics are here. I'm starting a new comment thread because I think it's a separate issue, and I don't think that comments about thread-safety/staleness can resolve it.

Before we get into the standardese: should the examples below work, or not?

Example 1: Host-Device Synchronization

bool* usm = sycl::malloc_shared<bool>(1, q); *usm = false; q.single_task([=]() { *usm = true; }; while (not q.empty()) {} // NB: This thread never called wait // If the queue is empty, we "know" the single_task completed. // Are its results guaranteed to be visible? assert(*usm == true);

Example 2: Inter-Thread Synchronization via Device

// Assume these allocations are visible to both threads. bool* a = malloc(sizeof(bool)); bool* b = sycl::malloc_shared<bool>(1, q); *a = false; *b = false; // Thread 1 { *a = true; q.single_task([=]() { *b = true; }; } // Thread 2 { if (q.empty()) { if (*b == true) { // If the queue is empty, the single_task might have executed. // If b is true, we "know" the single task executed (assuming Example 1 is valid). // Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2? assert(*a == true); } } }

Answering these questions might be something that we want to defer until a larger rework of the execution model, but I wanted to bring it up so we don't lose track of it.

I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call wait. Therefore the loop on q.empty could be an infinite loop.

I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on queue::empty. Is the following example well defined?

// Assume these allocations are visible to both threads. bool* a = malloc(sizeof(bool)); bool* b = sycl::malloc_shared<bool>(1, q); *a = false; *b = false; // Thread 1 { *a = true; q.single_task([=]() { *b = true; }; } // Thread 2 { q.wait(); // Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2? assert(*a == true); }

If you think the answer is "yes", what part of the spec guarantees this?

I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call wait. Therefore the loop on q.empty could be an infinite loop.

You're right, but I was trying not to overcomplicate the example by bringing eager/lazy into it!

If you ignore that some implementations may go into an infinite loop, the question is still interesting: if the host thread does make it past that loop, should there be any guarantee of the *usm value?

I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on queue::empty. Is the following example well defined?
...
If you think the answer is "yes", what part of the spec guarantees this?

I agree that we need to clarify the behavior of other APIs, which is why it might make sense just to note that empty has this problem and revisit it as part of later execution/memory model clarifications. Querying whether an event is in complete status has exactly the same problem.

I think q.wait() is different. Although it's not clearly/formally stated anywhere that your example would work -- and q.wait() isn't described formally in terms of synchronization, etc -- I think it's aligned with the intent of wait.

I know examples are non-normative, but as a demonstrator of intent: the example in the USM section shows that calling wait makes the memory available to the thread that called wait:

myQueue.parallel_for(1024, [=](id<1> idx) { // Initialize each buffer element with its own rank number starting at 0 data[idx] = idx; }); // End of the kernel function // Explicitly wait for kernel execution since there is no accessor involved myQueue.wait(); // Print result for (int i = 0; i < 1024; i++) std::cout << "data[" << i << "] = " << data[i] << std::endl;

...the implication being that wait() is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked on wait is unblocked. Everybody is relying on this behavior today whenever they use USM.

I think this makes your example well-defined: *a = true is sequenced-before q.single_task, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.

I don't know whether we intended for a query that returns info::event_command_status::complete to have the same behavior or not, and I don't know what the intent is with empty(). I suspect we want them both to have some sort of synchronization behavior, but that's not clear from the specification.

...the implication being that wait() is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked on wait is unblocked. Everybody is relying on this behavior today whenever they use USM.

I think this makes your example well-defined: *a = true is sequenced-before q.single_task, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.

I agree that it's reasonable to assume that queue::wait is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write to a is "sequenced before" the kernel is submitted?

Your point about info::event_command_status::complete seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):

while (not q.empty()) {} while (e.get_info<info::event::command_execution_status>() != info::event_command_status::complete) {}

I agree that it's reasonable to assume that queue::wait is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write to a is "sequenced before" the kernel is submitted?

Yes, exactly. C++ doesn't have multiple devices, of course, so I'm trying to apply the rules as-if the device were just another thread. My understanding is that "sequenced before" is more or less a fancy way of saying that "within a thread, things happen in program order", and then there are a bunch of transitivity rules (starting here) that I interpret to mean something like "if A synchronizes with B, anything that happened before A must also have happened before B".

Your point about info::event_command_status::complete seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):

I agree they should behave the same.

True. 2 Independent questions indeed.

For 1/ At least, from experience, I know "real" apps that do the spin lock of cuEventQuerry, and they never deadlock. Not sure if it's because cuda is always greedy, or because cuEventQuerry force submission. (I guess (2) it's an implementation details of (1)).

So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.

for ... Q.submit(); Q.empty(); Q.wait()

Should be faster than:

for ... Q.submit(); Q.wait();

For 2/, definitely yes.

But my idea is that if checking for an event doesn't submit, it force you always call wait so question 2/ is "useless", or a totology.

You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe completed event . And you cannot see an event-completed without calling wait.

Hope it's kind of clean

So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.

I'm not sure we want to do this. I might be wrong, but my gut says that requiring implementations to start executing kernels when an event is queried would prevent single-threaded implementations (or at least make them a lot more complicated). The only way to avoid deadlocks would be to implement a mechanism to switch between host and device code during kernel execution, which I think would be difficult in the general case.

But my idea is that if checking for an event doesn't submit, it force you always call wait so question 2/ is "useless", or a totology.

You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe completed event . And you cannot see an event-completed without calling wait.

wait is the only way to guarantee that a kernel is started, but the opposite is not true; there is no guarantee that a kernel will not start until wait is called.

So, a spin-loop querying an event might work, it's just implementation-defined. Such a loop would definitely fail for a single-threaded implementation like SimSYCL (because the kernel will never start), but would probably work when offloading to an accelerator via CUDA/OpenCL/Level Zero (because the kernel will probably start).

@TApplencourt - I don't think we should say anything about the execution model in this KHR, so I propose that we defer that discussion until later. I've proposed some wording in https://github.com/KhronosGroup/SYCL-Docs/pull/700/files#r1942646351 to address the memory visibility aspect, though, because I think we are in agreement that the answer to the second (Memory Model) question needs to be "Yes".

Thanks a lot! Merged.

Memory Model: Does observing a completed event status imply the command's side-effects are visible?

One more argument in favor is that in a code like this:

sycl::queue q(in_order); auto a = sycl::malloc_shared<int>(1,Q); auto e1 = q.single_task([=] { a[0] = 1}); auto e2= q.single_task([=] {}); q.wait();

I want to be able to read a[0] when e1 has completed, not when e2 / q.wait() did.

adoc/extensions/sycl_khr_queue_empty_query.adoc

Co-authored-by: John Pennycook <[email protected]>

TApplencourt · 2025-02-05T18:04:47Z

Always reflow, always...

keryell

Thanks!

tomdeakin · 2025-02-06T16:18:34Z

WG approved to merge.

tomdeakin · 2025-02-06T17:15:13Z

Waiting confirmation on implementations

gmlueck · 2025-02-06T19:20:28Z

This is the Intel internal tracker to implement in DPC++: CMPLRLLVM-65342

add sycl_khr_queue_size_queries

c106cfa

sycl-issue-bot bot mentioned this pull request Jan 28, 2025

[Spec change] [KHR] add sycl_khr_queue_size_queries KhronosGroup/SYCL-CTS#1028

Closed

gmlueck reviewed Jan 28, 2025

View reviewed changes

just empty

2ac88a8

TApplencourt added 3 commits January 28, 2025 16:39

Update doc in example

43317cf

returns

e716ad4

reflow..

7296ba9

Pennycook reviewed Jan 29, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

adoc/extensions/sycl_khr_queue_empty_query.adoc Show resolved Hide resolved

gmlueck reviewed Jan 29, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

TApplencourt and others added 5 commits January 29, 2025 09:04

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

120072c

Co-authored-by: Greg Lueck <[email protected]>

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

0dfbc90

Co-authored-by: Greg Lueck <[email protected]>

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

89f009e

Co-authored-by: John Pennycook <[email protected]>

Other example

3d2cd92

run reflow

282203a

TApplencourt changed the title ~~[KHR] add sycl_khr_queue_size_queries~~ [KHR] add sycl_khr_queue_empty_querie Jan 29, 2025

TApplencourt changed the title ~~[KHR] add sycl_khr_queue_empty_querie~~ [KHR] add sycl_khr_queue_empty_query Jan 29, 2025

gmlueck reviewed Jan 29, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

gmlueck reviewed Jan 29, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

TApplencourt and others added 3 commits January 29, 2025 10:30

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

3ca24dc

Co-authored-by: Greg Lueck <[email protected]>

more comment

e74ff10

Better example

c3330dc

gmlueck reviewed Jan 29, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

TApplencourt and others added 2 commits January 29, 2025 14:18

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

e37fbae

Co-authored-by: Greg Lueck <[email protected]>

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

bad8a04

Co-authored-by: Greg Lueck <[email protected]>

keryell reviewed Jan 30, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved

TApplencourt force-pushed the khr-queue-size-queries branch from 02d4d1d to 4884d91 Compare January 30, 2025 15:36

TApplencourt and others added 2 commits January 30, 2025 09:38

run clang format

fa38af0

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

9a36237

Co-authored-by: Ronan Keryell <[email protected]>

TApplencourt force-pushed the khr-queue-size-queries branch from 4884d91 to 9a36237 Compare January 30, 2025 15:38

TApplencourt mentioned this pull request Jan 30, 2025

Add khr_queue_empty tests KhronosGroup/SYCL-CTS#1031

Merged

gmlueck approved these changes Jan 30, 2025

View reviewed changes

DuncanMcBain reviewed Jan 31, 2025

View reviewed changes

Pennycook reviewed Feb 3, 2025

View reviewed changes

Pennycook reviewed Feb 5, 2025

View reviewed changes

adoc/extensions/sycl_khr_queue_empty_query.adoc Show resolved Hide resolved

Update adoc/extensions/sycl_khr_queue_empty_query.adoc

6d1c673

Co-authored-by: John Pennycook <[email protected]>

run reflow

9d93515

keryell approved these changes Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KHR] add sycl_khr_queue_empty_query #700

[KHR] add sycl_khr_queue_empty_query #700

TApplencourt commented Jan 28, 2025 •

edited

Loading

TApplencourt commented Jan 28, 2025 •

edited

Loading

Pennycook left a comment •

edited

Loading

gmlueck left a comment

CLAassistant commented Jan 29, 2025 •

edited

Loading

TApplencourt commented Jan 30, 2025

DuncanMcBain Jan 31, 2025

gmlueck Jan 31, 2025

TApplencourt Jan 31, 2025

DuncanMcBain Feb 1, 2025

Pennycook Feb 3, 2025 •

edited

Loading

gmlueck Feb 3, 2025

Pennycook Feb 3, 2025

gmlueck Feb 3, 2025

Pennycook Feb 3, 2025

TApplencourt Feb 3, 2025 •

edited

Loading

Pennycook Feb 4, 2025

Pennycook Feb 5, 2025

TApplencourt Feb 5, 2025

TApplencourt Feb 5, 2025

TApplencourt commented Feb 5, 2025

keryell left a comment

tomdeakin commented Feb 6, 2025

tomdeakin commented Feb 6, 2025

gmlueck commented Feb 6, 2025

		_Returns:_ [code]#true# if all <<command,commands>> enqueued on this queue have
		completed, [code]#false# otherwise.

[KHR] add sycl_khr_queue_empty_query #700

Are you sure you want to change the base?

[KHR] add sycl_khr_queue_empty_query #700

Conversation

TApplencourt commented Jan 28, 2025 • edited Loading

TApplencourt commented Jan 28, 2025 • edited Loading

Pennycook left a comment • edited Loading

Choose a reason for hiding this comment

gmlueck left a comment

Choose a reason for hiding this comment

CLAassistant commented Jan 29, 2025 • edited Loading

TApplencourt commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pennycook Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Example 1: Host-Device Synchronization

Example 2: Inter-Thread Synchronization via Device

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TApplencourt Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TApplencourt commented Feb 5, 2025

keryell left a comment

Choose a reason for hiding this comment

tomdeakin commented Feb 6, 2025

tomdeakin commented Feb 6, 2025

gmlueck commented Feb 6, 2025

TApplencourt commented Jan 28, 2025 •

edited

Loading

TApplencourt commented Jan 28, 2025 •

edited

Loading

Pennycook left a comment •

edited

Loading

CLAassistant commented Jan 29, 2025 •

edited

Loading

Pennycook Feb 3, 2025 •

edited

Loading

TApplencourt Feb 3, 2025 •

edited

Loading