Adding local work requesting scheduler that is based on message passing internally #5845

hkaiser · 2022-04-11T22:28:50Z

This adds a new experimental work-requesting scheduler to the list of existing schedulers

Using uniform_int_distribution with proper bounds
Removing queue index from thread_queues as it was unused
Renaming workstealing --> workrequesting
Adding adaptive work stealing (steal half/steal one)
- this makes this scheduler consistently (albeit only slightly) faster than the (default) local-priority scheduler
Adding LIFO and FIFO variations of local work-stealing scheduler
- flyby: fixing HPX_WITH_SWAP_CONTEXT_EMULATION
- flyby: minor changes to fibonacci_local example
Adding high- and low- priority queues
- flyby: cache_line_data now does not generate warnings errors if padding is not needed
Adding bound queues
flyby: using cache_line_data for scheduler states

hkaiser · 2022-04-13T19:38:43Z

First performance measurements show an overall improvement of up to 5-10%. Very promising!

StellarBot · 2022-04-13T22:22:52Z

Performance test report

HPX Performance

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR	PARALLEL_EXECUTOR	SCHEDULER_EXECUTOR
For Each	(=)	(=)	+

Info

Property	Before	After
HPX Commit	`a681968`	`cbd27da`
HPX Datetime	2022-02-09T15:45:06+00:00	2022-04-13T22:08:13+00:00
Envfile
Hostname	nid00729	nid01193
Compiler	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clustername	daint	daint
Datetime	2022-02-09T17:03:21.440240+01:00	2022-04-14T00:21:45.647786+02:00

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	(=)

Info

Property	Before	After
HPX Commit	`96a2e4b`	`cbd27da`
HPX Datetime	2021-11-11T08:14:57+00:00	2022-04-13T22:08:13+00:00
Envfile
Hostname	nid00006	nid01193
Compiler	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clustername	daint	daint
Datetime	2021-11-11T09:28:13.071121+01:00	2022-04-14T00:22:02.373105+02:00

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATOR	PARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATOR	SCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add	-	(=)	(=)
Stream Benchmark - Scale	(=)	(=)	(=)
Stream Benchmark - Triad	-	(=)	(=)
Stream Benchmark - Copy	(=)	(=)	(=)

Info

Property	Before	After
HPX Commit	`71d8dbe`	`cbd27da`
HPX Datetime	2021-11-10T19:14:21+00:00	2022-04-13T22:08:13+00:00
Envfile
Hostname	nid00120	nid01193
Compiler	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0	/apps/daint/SSL/HPX/packages/llvm-11.0.0/bin/clang++ 11.0.0
Clustername	daint	daint
Datetime	2021-11-10T20:28:18.266961+01:00	2022-04-14T00:22:17.738365+02:00

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (≤5%)
++/--	Large performance improvement/degradation (≤10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

examples/1d_stencil/1d_stencil_8.cpp

Pansysk75 · 2023-06-28T16:05:06Z

I started looking at the performance of this scheduler a bit (on the single-NUMA-domain case)

Directly replacing our current scheduler, it performs worse in the general case (at least on the algorithm benchmarks, which I had handy),
I also tried using the do_not_combine_tasks hint to bypass the index queues we use on bulk_execute, because that is technically another form of stealing, which would skew the comparison. It is still a bit worse.

In the case of a bulk-execution of relatively uniform work, stealing is very limited, because our scheduling is already quite balanced anyways. So this would explain the performance deficit, as we are possibly taking on a larger overhead (more complex scheduler) for small benefit (a few non-cache-disrupting steals).

And then, there is the fact that we are only able to poll the steal requests in between thread execution, which introduces some latency in responding to steal requests.

I will try to produce a best-case scenario for this scheduler though, even if it is just as a proof of concept.

Edit: it seems to perform much better on few cores, which could suggest a large number of failed stealing attempts when we have many cores. We can experiment with sending the steal request where there is actual work.

hkaiser · 2023-06-28T18:53:40Z

I started looking at the performance of this scheduler a bit (on the single-NUMA-domain case)

Directly replacing our current scheduler, it performs worse in the general case (at least on the algorithm benchmarks, which I had handy), I also tried using the do_not_combine_tasks hint to bypass the index queues we use on bulk_execute, because that is technically another form of stealing, which would skew the comparison. It is still a bit worse.

In the case of a bulk-execution of relatively uniform work, stealing is very limited, because our scheduling is already quite balanced anyways. So this would explain the performance deficit, as we are possibly taking on a larger overhead (more complex scheduler) for small benefit (a few non-cache-disrupting steals).

And then, there is the fact that we are only able to poll the steal requests in between thread execution, which introduces some latency in responding to steal requests.

I will try to produce a best-case scenario for this scheduler though, even if it is just as a proof of concept.

Edit: it seems to perform much better on few cores, which could suggest a large number of failed stealing attempts when we have many cores. We can experiment with sending the steal request where there is actual work.

I did see improvements on tests that ran a large amount of separate tasks (like fibonacci). For uniform iterative parallelism the benefit would probably be small.

Pansysk75 · 2023-08-05T22:17:48Z

@hkaiser You'll still have to add the fix in 1d_stencil_8
451bbd0#diff-126ef56cea5c727b92b9ec59b26d8c429db7f6655405270d63adda430a4a5091
The sem->wait() is not guaranteed to block, so the lambda ends up with a reference to a deleted object.

hkaiser · 2023-08-05T22:46:12Z

@hkaiser You'll still have to add the fix in 1d_stencil_8 451bbd0#diff-126ef56cea5c727b92b9ec59b26d8c429db7f6655405270d63adda430a4a5091 The sem->wait() is not guaranteed to block, so the lambda ends up with a reference to a deleted object.

I thought that was fixed by: #6294

Pansysk75 · 2023-08-05T23:09:59Z

I thought that was fixed by: #6294

@hkaiser No, I think it cannot be fixed that way (that's why I had gotten a bit confused there). It's a plain old out-of-scope situation.
Just think of what happens if the semaphore doesn't need to wait for the sem.signal() (which will often happen, if the semaphore is not "saturated").
The created lambda (probably not executed yet) has a reference to the semaphore, which gets deleted if we go out of scope.
There is no way for the semaphore to know it is being referenced this way.

I think that makes sense, unless I have some misconception about the role of sliding_semaphore

hkaiser · 2023-08-23T14:04:47Z

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

Pansysk75 · 2023-08-27T18:24:41Z

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

Seems like tasks get stuck in the stealing queue, re-applying this fix solves the issue:
451bbd0

Did you do something else to try solve that issue, or was this fix accidentally left behind?

hkaiser · 2023-08-27T19:23:56Z

@Pansysk75 I have applied the change to the use of the sliding_semaphore everywhere. This however doesn't seem to be sufficient to make things work: https://app.circleci.com/pipelines/github/STEllAR-GROUP/hpx/15639/workflows/eb35fe70-6509-4d54-80a1-3a4454e93d06/jobs/360130

Seems like tasks get stuck in the stealing queue, re-applying this fix solves the issue: 451bbd0

Did you do something else to try solve that issue, or was this fix accidentally left behind?

Thanks a lot - not sure how that got lost. Much appreciated!

… internally - Using uniform_int_distribution with proper bounds - Removing queue index from thread_queues as it was unused - flyby: remove commented out options from .clang-format - Renaming workstealing --> workrequesting - Adding adaptive work stealing (steal half/steal one) - this makes this scheduler consistently (albeit only slightly) faster than the (default) local-priority scheduler - Adding LIFO and FIFO variations of local work-stealing scheduler - flyby: fixing HPX_WITH_SWAP_CONTEXT_EMULATION - flyby: minor changes to fibonacci_local example - Adding high- and low- priority queues - flyby: cache_line_data now does not generate warnings errors if padding is not needed - Adding bound queues - flyby: using cache_line_data for scheduler states

StellarBot · 2023-08-28T17:35:09Z

Performance test report

HPX Performance

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR	PARALLEL_EXECUTOR	SCHEDULER_EXECUTOR
For Each	(=)	??	(=)

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-08-28T15:47:36+00:00
HPX Commit	`dcb5415`	`de95b0f`
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime	2023-05-10T14:50:18.616050-05:00	2023-08-28T12:31:56.660155-05:00
Clustername	rostam	rostam
Envfile
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu

Comparison

BENCHMARK	NO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch	(=)

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-08-28T15:47:36+00:00
HPX Commit	`dcb5415`	`de95b0f`
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime	2023-05-10T14:52:35.047119-05:00	2023-08-28T12:34:09.253604-05:00
Clustername	rostam	rostam
Envfile
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu

Comparison

BENCHMARK	FORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATOR	PARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATOR	SCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add	(=)	(=)	(=)
Stream Benchmark - Scale	(=)	(=)	=
Stream Benchmark - Triad	(=)	=	(=)
Stream Benchmark - Copy	(=)	-	(=)

Info

Property	Before	After
HPX Datetime	2023-05-10T12:07:53+00:00	2023-08-28T15:47:36+00:00
HPX Commit	`dcb5415`	`de95b0f`
Compiler	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1	/opt/apps/llvm/13.0.1/bin/clang++ 13.0.1
Datetime	2023-05-10T14:52:52.237641-05:00	2023-08-28T12:34:26.236878-05:00
Clustername	rostam	rostam
Envfile
Hostname	medusa08.rostam.cct.lsu.edu	medusa08.rostam.cct.lsu.edu

Explanation of Symbols

Symbol	MEANING
=	No performance change (confidence interval within ±1%)
(=)	Probably no performance change (confidence interval within ±2%)
(+)/(-)	Very small performance improvement/degradation (≤1%)
+/-	Small performance improvement/degradation (≤5%)
++/--	Large performance improvement/degradation (≤10%)
+++/---	Very large performance improvement/degradation (>10%)
?	Probably no change, but quite large uncertainty (confidence interval with ±5%)
??	Unclear result, very large uncertainty (±10%)
???	Something unexpected…

hkaiser · 2023-08-30T15:47:43Z

I think this is good to go now. Thanks again @Pansysk75!

hkaiser · 2023-08-30T15:47:50Z

bors merge

bors · 2023-08-30T16:06:28Z

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

Bors

hkaiser added type: optimization type: enhancement category: threadmanager labels Apr 11, 2022

hkaiser requested review from aurianer, msimberg and biddisco as code owners April 11, 2022 22:28

hkaiser force-pushed the local_workstealing_scheduler branch 16 times, most recently from b9472b7 to a98609d Compare April 13, 2022 17:54

hkaiser force-pushed the local_workstealing_scheduler branch from a98609d to 26bbc03 Compare April 13, 2022 22:08

hkaiser force-pushed the local_workstealing_scheduler branch 5 times, most recently from 2a7d2fc to 753df97 Compare April 16, 2022 15:54

Pansysk75 force-pushed the local_workstealing_scheduler branch from a7f7496 to 511b157 Compare June 28, 2023 15:31

hkaiser commented Jun 28, 2023

View reviewed changes

examples/1d_stencil/1d_stencil_8.cpp Outdated Show resolved Hide resolved

hkaiser force-pushed the local_workstealing_scheduler branch 2 times, most recently from f910674 to a282657 Compare August 5, 2023 20:53

hkaiser force-pushed the local_workstealing_scheduler branch 3 times, most recently from 6b926b4 to 48b364b Compare August 23, 2023 12:46

hkaiser added 6 commits August 28, 2023 10:35

Fixing race during destruction of hpx::thread

bfc7099

Fixing merge conflicts, revert changes to thread exit callbacks

5dab68c

Fixing use of sliding semaphore

68f0a3a

Cleaning up implementation

5bb0c1d

Reapply necessary fixes to scheduler

02e2b4b

hkaiser force-pushed the local_workstealing_scheduler branch from cb0e449 to 02e2b4b Compare August 28, 2023 15:35

hkaiser marked this pull request as ready for review August 30, 2023 15:47

bors bot merged commit 7849eef into master Aug 30, 2023
18 checks passed

bors bot deleted the local_workstealing_scheduler branch August 30, 2023 16:06

hkaiser added this to the 1.10.0 milestone May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding local work requesting scheduler that is based on message passing internally #5845

Adding local work requesting scheduler that is based on message passing internally #5845

hkaiser commented Apr 11, 2022 •

edited

Loading

hkaiser commented Apr 13, 2022 •

edited

Loading

StellarBot commented Apr 13, 2022

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

Pansysk75 commented Jun 28, 2023 •

edited

Loading

hkaiser commented Jun 28, 2023

Pansysk75 commented Aug 5, 2023

hkaiser commented Aug 5, 2023

Pansysk75 commented Aug 5, 2023 •

edited

Loading

hkaiser commented Aug 23, 2023

Pansysk75 commented Aug 27, 2023 •

edited

Loading

hkaiser commented Aug 27, 2023

StellarBot commented Aug 28, 2023

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

hkaiser commented Aug 30, 2023

hkaiser commented Aug 30, 2023

bors bot commented Aug 30, 2023

Adding local work requesting scheduler that is based on message passing internally #5845

Adding local work requesting scheduler that is based on message passing internally #5845

Conversation

hkaiser commented Apr 11, 2022 • edited Loading

hkaiser commented Apr 13, 2022 • edited Loading

StellarBot commented Apr 13, 2022

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

Pansysk75 commented Jun 28, 2023 • edited Loading

hkaiser commented Jun 28, 2023

Pansysk75 commented Aug 5, 2023

hkaiser commented Aug 5, 2023

Pansysk75 commented Aug 5, 2023 • edited Loading

hkaiser commented Aug 23, 2023

Pansysk75 commented Aug 27, 2023 • edited Loading

hkaiser commented Aug 27, 2023

StellarBot commented Aug 28, 2023

HPX Performance

Comparison

Info

Comparison

Info

Comparison

Info

Explanation of Symbols

hkaiser commented Aug 30, 2023

hkaiser commented Aug 30, 2023

bors bot commented Aug 30, 2023

hkaiser commented Apr 11, 2022 •

edited

Loading

hkaiser commented Apr 13, 2022 •

edited

Loading

Pansysk75 commented Jun 28, 2023 •

edited

Loading

Pansysk75 commented Aug 5, 2023 •

edited

Loading

Pansysk75 commented Aug 27, 2023 •

edited

Loading