Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KHR] add sycl_khr_queue_empty_query #700

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

TApplencourt
Copy link
Contributor

@TApplencourt TApplencourt commented Jan 28, 2025

Ported: https://gitlab.khronos.org/sycl/Specification/-/merge_requests/727/

Add queue.size() and queue.empty(). I think we agreed that the other get_wait_list() was hard to implement, so I removed it from here.

I need help naming those APIs. I'm not sure how it works to add a new member function to an existing class... Should I prefix the name of the function with khr?

Thanks in advance!

@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)

adoc/extensions/sycl_khr_queue_size_queries.adoc Outdated Show resolved Hide resolved
adoc/extensions/sycl_khr_queue_size_queries.adoc Outdated Show resolved Hide resolved
adoc/extensions/sycl_khr_queue_size_queries.adoc Outdated Show resolved Hide resolved
adoc/extensions/sycl_khr_queue_size_queries.adoc Outdated Show resolved Hide resolved
@TApplencourt
Copy link
Contributor Author

TApplencourt commented Jan 28, 2025

Updated with just khr_empty.

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)

I think this is free of UB if (as Greg pointed out) you use an in-order queue.

My reasoning, in case anybody is interested:

  • The std::atomic_bool is only used by host code.
  • If the host_task executes eagerly it will run until the atomic value changes.
  • If the host_task executes lazily it will not run until e2.wait() is called.

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved
@TApplencourt TApplencourt changed the title [KHR] add sycl_khr_queue_size_queries [KHR] add sycl_khr_queue_empty_querie Jan 29, 2025
@TApplencourt TApplencourt changed the title [KHR] add sycl_khr_queue_empty_querie [KHR] add sycl_khr_queue_empty_query Jan 29, 2025
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your style better, but this is the style in the rest of the spec, so we should be consistent.

adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved
adoc/extensions/sycl_khr_queue_empty_query.adoc Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Jan 29, 2025

CLA assistant check
All committers have signed the CLA.

@TApplencourt TApplencourt force-pushed the khr-queue-size-queries branch from 02d4d1d to 4884d91 Compare January 30, 2025 15:36
@TApplencourt
Copy link
Contributor Author

Ready to review accept :)

completed, [code]#false# otherwise.

{note} Since the implementation executes commands asynchronously, the returned
value is a snapshot in time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the queue class is thread-safe, I was wondering if we should remark that the value returned from this function should be considered immediately stale. The user might be able to guarantee that the value is still accurate - e.g. no threading, or the user has a separate lock over the queue or similar - but we can't prove that a queue returning true from empty() will still be true a moment later, in the general case. A note to this effect might help users understand that they can write code which would fall foul of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern is not that another thread might add something to the queue, it's that the SYCL runtime might asynchronously remove things from the queue. No amount of careful coding on the application's part can solve this. If queue::khr_empty returns false, the queue might become empty a moment later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add immediately stale. Maybe better than snapshot in time

You know that if it's true and that nobody from your app enqueue stuff in the queue, it will be still true.
For false, it's outside your app's control. Maybe the tick after it will become true, as @gmlueck said (regardless of what you did in your code).

I don't know how to phrase it better. Sadly, std::futur doesn't have an is_ready that can inspire us.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck agreed - I just don't think that's as much of an issue as, say, believing that the queue is empty, then having another thread dump 2500 kernels onto it :D

@TApplencourt I think I stole the "immediately stale" wording from some other API. But I think it communicates the idea.

Comment on lines +43 to +44
_Returns:_ [code]#true# if all <<command,commands>> enqueued on this queue have
completed, [code]#false# otherwise.
Copy link
Contributor

@Pennycook Pennycook Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about thread safety here made me think about what the desired synchronization/observability semantics are here. I'm starting a new comment thread because I think it's a separate issue, and I don't think that comments about thread-safety/staleness can resolve it.

Before we get into the standardese: should the examples below work, or not?

Example 1: Host-Device Synchronization

bool* usm = sycl::malloc_shared<bool>(1, q);
*usm = false;

q.single_task([=]() {
  *usm = true;
};

while (not q.empty()) {} // NB: This thread never called wait

// If the queue is empty, we "know" the single_task completed.
// Are its results guaranteed to be visible?
assert(*usm == true);

Example 2: Inter-Thread Synchronization via Device

// Assume these allocations are visible to both threads.
bool* a = malloc(sizeof(bool));
bool* b = sycl::malloc_shared<bool>(1, q);
*a = false;
*b = false;

// Thread 1
{
  *a = true;
  q.single_task([=]() {
    *b = true;
  };
}

// Thread 2
{
  if (q.empty()) {
    if (*b == true) {
      // If the queue is empty, the single_task might have executed.
      // If b is true, we "know" the single task executed (assuming Example 1 is valid).
      // Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2?
      assert(*a == true); 
    }
  }
}

Answering these questions might be something that we want to defer until a larger rework of the execution model, but I wanted to bring it up so we don't lose track of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call wait. Therefore the loop on q.empty could be an infinite loop.

I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on queue::empty. Is the following example well defined?

// Assume these allocations are visible to both threads.
bool* a = malloc(sizeof(bool));
bool* b = sycl::malloc_shared<bool>(1, q);
*a = false;
*b = false;

// Thread 1
{
  *a = true;
  q.single_task([=]() {
    *b = true;
  };
}

// Thread 2
{
  q.wait();
  // Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2?
  assert(*a == true); 
}

If you think the answer is "yes", what part of the spec guarantees this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call wait. Therefore the loop on q.empty could be an infinite loop.

You're right, but I was trying not to overcomplicate the example by bringing eager/lazy into it!

If you ignore that some implementations may go into an infinite loop, the question is still interesting: if the host thread does make it past that loop, should there be any guarantee of the *usm value?

I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on queue::empty. Is the following example well defined?
...
If you think the answer is "yes", what part of the spec guarantees this?

I agree that we need to clarify the behavior of other APIs, which is why it might make sense just to note that empty has this problem and revisit it as part of later execution/memory model clarifications. Querying whether an event is in complete status has exactly the same problem.

I think q.wait() is different. Although it's not clearly/formally stated anywhere that your example would work -- and q.wait() isn't described formally in terms of synchronization, etc -- I think it's aligned with the intent of wait.

I know examples are non-normative, but as a demonstrator of intent: the example in the USM section shows that calling wait makes the memory available to the thread that called wait:

  myQueue.parallel_for(1024, [=](id<1> idx) {
    // Initialize each buffer element with its own rank number starting at 0
    data[idx] = idx;
  }); // End of the kernel function

  // Explicitly wait for kernel execution since there is no accessor involved
  myQueue.wait();

  // Print result
  for (int i = 0; i < 1024; i++)
    std::cout << "data[" << i << "] = " << data[i] << std::endl;

...the implication being that wait() is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked on wait is unblocked. Everybody is relying on this behavior today whenever they use USM.

I think this makes your example well-defined: *a = true is sequenced-before q.single_task, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.

I don't know whether we intended for a query that returns info::event_command_status::complete to have the same behavior or not, and I don't know what the intent is with empty(). I suspect we want them both to have some sort of synchronization behavior, but that's not clear from the specification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...the implication being that wait() is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked on wait is unblocked. Everybody is relying on this behavior today whenever they use USM.

I think this makes your example well-defined: *a = true is sequenced-before q.single_task, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.

I agree that it's reasonable to assume that queue::wait is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write to a is "sequenced before" the kernel is submitted?

Your point about info::event_command_status::complete seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):

while (not q.empty()) {}

while (e.get_info<info::event::command_execution_status>() != info::event_command_status::complete) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's reasonable to assume that queue::wait is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write to a is "sequenced before" the kernel is submitted?

Yes, exactly. C++ doesn't have multiple devices, of course, so I'm trying to apply the rules as-if the device were just another thread. My understanding is that "sequenced before" is more or less a fancy way of saying that "within a thread, things happen in program order", and then there are a bunch of transitivity rules (starting here) that I interpret to mean something like "if A synchronizes with B, anything that happened before A must also have happened before B".

Your point about info::event_command_status::complete seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):

I agree they should behave the same.

Copy link
Contributor Author

@TApplencourt TApplencourt Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. 2 Independent questions indeed.

For 1/ At least, from experience, I know "real" apps that do the spin lock of cuEventQuerry, and they never deadlock. Not sure if it's because cuda is always greedy, or because cuEventQuerry force submission. (I guess (2) it's an implementation details of (1)).

So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.

for ...
  Q.submit();  
   Q.empty();
Q.wait()

Should be faster than:

for ...
  Q.submit();  
   Q.wait();

For 2/, definitely yes.

But my idea is that if checking for an event doesn't submit, it force you always call wait so question 2/ is "useless", or a totology.

You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe completed event . And you cannot see an event-completed without calling wait.

Hope it's kind of clean

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.

I'm not sure we want to do this. I might be wrong, but my gut says that requiring implementations to start executing kernels when an event is queried would prevent single-threaded implementations (or at least make them a lot more complicated). The only way to avoid deadlocks would be to implement a mechanism to switch between host and device code during kernel execution, which I think would be difficult in the general case.

But my idea is that if checking for an event doesn't submit, it force you always call wait so question 2/ is "useless", or a totology.

You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe completed event . And you cannot see an event-completed without calling wait.

wait is the only way to guarantee that a kernel is started, but the opposite is not true; there is no guarantee that a kernel will not start until wait is called.

So, a spin-loop querying an event might work, it's just implementation-defined. Such a loop would definitely fail for a single-threaded implementation like SimSYCL (because the kernel will never start), but would probably work when offloading to an accelerator via CUDA/OpenCL/Level Zero (because the kernel will probably start).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TApplencourt - I don't think we should say anything about the execution model in this KHR, so I propose that we defer that discussion until later. I've proposed some wording in https://github.com/KhronosGroup/SYCL-Docs/pull/700/files#r1942646351 to address the memory visibility aspect, though, because I think we are in agreement that the answer to the second (Memory Model) question needs to be "Yes".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory Model: Does observing a completed event status imply the command's side-effects are visible?

One more argument in favor is that in a code like this:

    sycl::queue q(in_order);
    auto a = sycl::malloc_shared<int>(1,Q);
    auto e1 = q.single_task([=] { a[0] = 1});
    auto e2= q.single_task([=] {});
    q.wait();

I want to be able to read a[0] when e1 has completed, not when e2 / q.wait() did.

@TApplencourt
Copy link
Contributor Author

Always reflow, always...

Copy link
Member

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tomdeakin
Copy link
Contributor

WG approved to merge.

@tomdeakin
Copy link
Contributor

Waiting confirmation on implementations

@gmlueck
Copy link
Contributor

gmlueck commented Feb 6, 2025

This is the Intel internal tracker to implement in DPC++: CMPLRLLVM-65342

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants