-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelized collapse function with openmp in StateVectorLQubit.hpp #986
base: master
Are you sure you want to change the base?
parallelized collapse function with openmp in StateVectorLQubit.hpp #986
Conversation
…ohanzai/pennylane-lightning into feature/parallelize_collapse
Hi @xiaohanzai, thank you for that! Could you please resolve the conflicts so we can run our CIs? |
Thank you, @xiaohanzai! |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #986 +/- ##
==========================================
- Coverage 97.66% 93.75% -3.91%
==========================================
Files 221 169 -52
Lines 33178 22309 -10869
==========================================
- Hits 32403 20916 -11487
- Misses 775 1393 +618 ☔ View full report in Codecov by Sentry. |
Thanks @xiaohanzai. Could you please share your benchmark scripts (both cpp and shell script code) for discussion? |
Here's the script: https://codeshare.io/nAJZOY And the shell commands I used: for |
Thanks @xiaohanzai for sharing the scripts. A few questions come to my mind for discussion. Please feel free to leave your thoughts, thanks! |
Since |
Could you please explain why more threads lead performance regression when the number of qubit is small, that's say |
Could you please explain why increasing the number of threads can improve the performance when |
Could you please explain why the |
Is there any room to optimize this method? If yes, how would you like to further optimize this code? If not, why? |
I pushed another commit to parallelize
There's too few array elements to deal with, so the parallelization overhead is a lot larger than the actual workload. It's not very beneficial to parallelize the code unless the array size is large enough.
My speculation is that 8 32KB L1 caches can hold about 2^15 complex float array elements, so when array size isn't too large they can be stored in the nearest caches. But when array size gets too large e.g. 2^30 then they will have to be loaded from RAM. Then memory latency gets in the way.
That I'm not sure... I'd expect the performance of stride=1 to be worse in general because of cache misses, but I'm not sure why mac and cluster appear to be very different.
I considered dynamic scheduling, static scheduling with a small chunk size, and loop tiling, but these all didn't seem to work better than the default scheduling. I think the scheduling overhead is too large when array size is large and chunk size is small. So I think on my side I don't have any further ways of optimizing it. |
@@ -695,9 +695,12 @@ class StateVectorLQubit : public StateVectorBase<PrecisionT, Derived> { | |||
// **__**__ for stride 2 | |||
// ****____ for stride 4 | |||
const std::size_t k = branch ? 0 : 1; | |||
#if defined(_OPENMP) | |||
#pragma omp parallel for collapse(2) default(none) shared(arr, half_section_size, stride, k) | |||
#endif | |||
for (std::size_t idx = 0; idx < half_section_size; idx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xiaohanzai .
Could you try to fuse these two loops into one to see if the performance can be improved, especially for the stride=1
case?
@multiphaseCFD Thanks a lot for the suggestion! I just updated the code in codeshare. It seems like on my mac the results didn't get improved a lot, especially for stride = 1. This plot is got by running |
Thanks @xiaohanzai |
Before submitting
Please complete the following checklist when submitting a PR:
All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests
directory!All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running
make docs
.Ensure that the test suite passes, by running
make test
.Add a new entry to the
.github/CHANGELOG.md
file, summarizing thechange, and including a link back to the PR.
Ensure that code is properly formatted by running
make format
.When all the above are checked, delete everything above the dashed
line and fill in the pull request template.
Context:
Would like to improve performance of the
collapse
method since Mid-Circuit Measurement (MCM) support was added to thelightning.qubit
backend.Description of the Change:
Parallelized
collapse
function withOpenMP
.Benefits:
Faster execution with several threads.
Here's the strong scaling results on my mac and on a cluster at UofT.
To benchmark the implementation, I assumed
num_qubits = 10-30
. The upper bound of 30 is chosen so that on my mac a complex float array of size1<<30
can be held in my RAM.The execution times are obtained by averaging the run time of the
collapse
function 10 or more times. Attached please also see the code that I used for benchmarking.Possible Drawbacks:
Related GitHub Issues:
#962