-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KHR] add sycl_khr_queue_empty_query #700
base: main
Are you sure you want to change the base?
Conversation
Updated with just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)
I think this is free of UB if (as Greg pointed out) you use an in-order queue.
My reasoning, in case anybody is interested:
- The
std::atomic_bool
is only used by host code. - If the
host_task
executes eagerly it will run until the atomic value changes. - If the
host_task
executes lazily it will not run untile2.wait()
is called.
Co-authored-by: Greg Lueck <[email protected]>
Co-authored-by: Greg Lueck <[email protected]>
Co-authored-by: John Pennycook <[email protected]>
Co-authored-by: Greg Lueck <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your style better, but this is the style in the rest of the spec, so we should be consistent.
Co-authored-by: Greg Lueck <[email protected]>
Co-authored-by: Greg Lueck <[email protected]>
02d4d1d
to
4884d91
Compare
Co-authored-by: Ronan Keryell <[email protected]>
4884d91
to
9a36237
Compare
Ready to |
completed, [code]#false# otherwise. | ||
|
||
{note} Since the implementation executes commands asynchronously, the returned | ||
value is a snapshot in time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the queue class is thread-safe, I was wondering if we should remark that the value returned from this function should be considered immediately stale. The user might be able to guarantee that the value is still accurate - e.g. no threading, or the user has a separate lock over the queue or similar - but we can't prove that a queue returning true
from empty()
will still be true a moment later, in the general case. A note to this effect might help users understand that they can write code which would fall foul of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concern is not that another thread might add something to the queue, it's that the SYCL runtime might asynchronously remove things from the queue. No amount of careful coding on the application's part can solve this. If queue::khr_empty
returns false, the queue might become empty a moment later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add immediately stale
. Maybe better than snapshot in time
You know that if it's true
and that nobody from your app enqueue stuff in the queue, it will be still true
.
For false
, it's outside your app's control. Maybe the tick after it will become true,
as @gmlueck said (regardless of what you did in your code).
I don't know how to phrase it better. Sadly, std::futur
doesn't have an is_ready
that can inspire us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck agreed - I just don't think that's as much of an issue as, say, believing that the queue is empty, then having another thread dump 2500 kernels onto it :D
@TApplencourt I think I stole the "immediately stale" wording from some other API. But I think it communicates the idea.
_Returns:_ [code]#true# if all <<command,commands>> enqueued on this queue have | ||
completed, [code]#false# otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment about thread safety here made me think about what the desired synchronization/observability semantics are here. I'm starting a new comment thread because I think it's a separate issue, and I don't think that comments about thread-safety/staleness can resolve it.
Before we get into the standardese: should the examples below work, or not?
Example 1: Host-Device Synchronization
bool* usm = sycl::malloc_shared<bool>(1, q);
*usm = false;
q.single_task([=]() {
*usm = true;
};
while (not q.empty()) {} // NB: This thread never called wait
// If the queue is empty, we "know" the single_task completed.
// Are its results guaranteed to be visible?
assert(*usm == true);
Example 2: Inter-Thread Synchronization via Device
// Assume these allocations are visible to both threads.
bool* a = malloc(sizeof(bool));
bool* b = sycl::malloc_shared<bool>(1, q);
*a = false;
*b = false;
// Thread 1
{
*a = true;
q.single_task([=]() {
*b = true;
};
}
// Thread 2
{
if (q.empty()) {
if (*b == true) {
// If the queue is empty, the single_task might have executed.
// If b is true, we "know" the single task executed (assuming Example 1 is valid).
// Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2?
assert(*a == true);
}
}
}
Answering these questions might be something that we want to defer until a larger rework of the execution model, but I wanted to bring it up so we don't lose track of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call wait
. Therefore the loop on q.empty
could be an infinite loop.
I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on queue::empty
. Is the following example well defined?
// Assume these allocations are visible to both threads.
bool* a = malloc(sizeof(bool));
bool* b = sycl::malloc_shared<bool>(1, q);
*a = false;
*b = false;
// Thread 1
{
*a = true;
q.single_task([=]() {
*b = true;
};
}
// Thread 2
{
q.wait();
// Are things Thread 1 did before enqueueing the task guaranteed to be visible to Thread 2?
assert(*a == true);
}
If you think the answer is "yes", what part of the spec guarantees this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think example 1 is not guaranteed to work because a SYCL implementation is allowed to execute work in queues lazily, waiting for the application to call
wait
. Therefore the loop onq.empty
could be an infinite loop.
You're right, but I was trying not to overcomplicate the example by bringing eager/lazy into it!
If you ignore that some implementations may go into an infinite loop, the question is still interesting: if the host thread does make it past that loop, should there be any guarantee of the *usm
value?
I think your real question is about inter-thread synchronization, though. It seems like this question exists even without relying on
queue::empty
. Is the following example well defined?
...
If you think the answer is "yes", what part of the spec guarantees this?
I agree that we need to clarify the behavior of other APIs, which is why it might make sense just to note that empty
has this problem and revisit it as part of later execution/memory model clarifications. Querying whether an event is in complete
status has exactly the same problem.
I think q.wait()
is different. Although it's not clearly/formally stated anywhere that your example would work -- and q.wait()
isn't described formally in terms of synchronization, etc -- I think it's aligned with the intent of wait
.
I know examples are non-normative, but as a demonstrator of intent: the example in the USM section shows that calling wait
makes the memory available to the thread that called wait
:
myQueue.parallel_for(1024, [=](id<1> idx) {
// Initialize each buffer element with its own rank number starting at 0
data[idx] = idx;
}); // End of the kernel function
// Explicitly wait for kernel execution since there is no accessor involved
myQueue.wait();
// Print result
for (int i = 0; i < 1024; i++)
std::cout << "data[" << i << "] = " << data[i] << std::endl;
...the implication being that wait()
is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked on wait
is unblocked. Everybody is relying on this behavior today whenever they use USM.
I think this makes your example well-defined: *a = true
is sequenced-before q.single_task
, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.
I don't know whether we intended for a query that returns info::event_command_status::complete
to have the same behavior or not, and I don't know what the intent is with empty()
. I suspect we want them both to have some sort of synchronization behavior, but that's not clear from the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...the implication being that
wait()
is some sort of synchronizing operation, or at least that we can guarantee the end of the kernel function happens-before the thread blocked onwait
is unblocked. Everybody is relying on this behavior today whenever they use USM.I think this makes your example well-defined:
*a = true
is sequenced-beforeq.single_task
, which happens-before the start of the kernel (on the device), which is sequenced-before the end of the kernel function (on the device), which happens-before thread 2 is unblocked.
I agree that it's reasonable to assume that queue::wait
is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write to a
is "sequenced before" the kernel is submitted?
Your point about info::event_command_status::complete
seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):
while (not q.empty()) {}
while (e.get_info<info::event::command_execution_status>() != info::event_command_status::complete) {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's reasonable to assume that
queue::wait
is a synchronization point that ensures the memory written by kernels in the queue are visible to the calling thread. I wasn't sure if it also guaranteed that memory written by another thread was visible to this thread. I guess you are saying that normal C++ rules for "synchronizes with" would provide this guarantee because the write toa
is "sequenced before" the kernel is submitted?
Yes, exactly. C++ doesn't have multiple devices, of course, so I'm trying to apply the rules as-if the device were just another thread. My understanding is that "sequenced before" is more or less a fancy way of saying that "within a thread, things happen in program order", and then there are a bunch of transitivity rules (starting here) that I interpret to mean something like "if A synchronizes with B, anything that happened before A must also have happened before B".
Your point about
info::event_command_status::complete
seems very relevant. It seems like these two loops should provide similar guarantees (or lack of guarantees):
I agree they should behave the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. 2 Independent questions indeed.
For 1/ At least, from experience, I know "real" apps that do the spin lock of cuEventQuerry
, and they never deadlock. Not sure if it's because cuda is always greedy, or because cuEventQuerry
force submission. (I guess (2) it's an implementation details of (1)).
So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.
for ...
Q.submit();
Q.empty();
Q.wait()
Should be faster than:
for ...
Q.submit();
Q.wait();
For 2/, definitely yes.
But my idea is that if checking for an event doesn't submit, it force you always call wait
so question 2/ is "useless", or a totology.
You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe completed event
. And you cannot see an event-completed without calling wait.
Hope it's kind of clean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we should say "yes." It will "please" people porting too Cuda or people who are "latency bound," I suppose.
I'm not sure we want to do this. I might be wrong, but my gut says that requiring implementations to start executing kernels when an event is queried would prevent single-threaded implementations (or at least make them a lot more complicated). The only way to avoid deadlocks would be to implement a mechanism to switch between host and device code during kernel execution, which I think would be difficult in the general case.
But my idea is that if checking for an event doesn't submit, it force you always call wait so question 2/ is "useless", or a totology.
You always call wait (because this is the only wait to submit), so the command's side effects are always visible, regardless of if you observe
completed event
. And you cannot see an event-completed without calling wait.
wait
is the only way to guarantee that a kernel is started, but the opposite is not true; there is no guarantee that a kernel will not start until wait
is called.
So, a spin-loop querying an event might work, it's just implementation-defined. Such a loop would definitely fail for a single-threaded implementation like SimSYCL (because the kernel will never start), but would probably work when offloading to an accelerator via CUDA/OpenCL/Level Zero (because the kernel will probably start).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TApplencourt - I don't think we should say anything about the execution model in this KHR, so I propose that we defer that discussion until later. I've proposed some wording in https://github.com/KhronosGroup/SYCL-Docs/pull/700/files#r1942646351 to address the memory visibility aspect, though, because I think we are in agreement that the answer to the second (Memory Model) question needs to be "Yes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory Model: Does observing a completed event status imply the command's side-effects are visible?
One more argument in favor is that in a code like this:
sycl::queue q(in_order);
auto a = sycl::malloc_shared<int>(1,Q);
auto e1 = q.single_task([=] { a[0] = 1});
auto e2= q.single_task([=] {});
q.wait();
I want to be able to read a[0]
when e1
has completed, not when e2
/ q.wait()
did.
Co-authored-by: John Pennycook <[email protected]>
Always reflow, always... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
WG approved to merge. |
Waiting confirmation on implementations |
This is the Intel internal tracker to implement in DPC++: CMPLRLLVM-65342 |
Ported: https://gitlab.khronos.org/sycl/Specification/-/merge_requests/727/
Add
queue.size()
andqueue.empty()
. I think we agreed that the otherget_wait_list()
was hard to implement, so I removed it from here.I need help naming those APIs. I'm not sure how it works to add a new member function to an existing class... Should I prefix the name of the function with
khr
?Thanks in advance!
@Pennycook, I would appreciate it if you could look at my example. I think/hope / pray that it's free of UB :)