parallelized collapse function with openmp in StateVectorLQubit.hpp #986

xiaohanzai · 2024-11-07T00:54:49Z

Before submitting

Please complete the following checklist when submitting a PR:

All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests directory!
All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running make docs.
Ensure that the test suite passes, by running make test.
Add a new entry to the .github/CHANGELOG.md file, summarizing the
change, and including a link back to the PR.
Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.

Context:
Would like to improve performance of the collapse method since Mid-Circuit Measurement (MCM) support was added to the lightning.qubit backend.

Description of the Change:
Parallelized collapse function with OpenMP.

Benefits:
Faster execution with several threads.

Here's the strong scaling results on my mac and on a cluster at UofT.
To benchmark the implementation, I assumed num_qubits = 10-30. The upper bound of 30 is chosen so that on my mac a complex float array of size 1<<30 can be held in my RAM.
The execution times are obtained by averaging the run time of the collapse function 10 or more times. Attached please also see the code that I used for benchmarking.

Possible Drawbacks:

Related GitHub Issues:
#962

…ohanzai/pennylane-lightning into feature/parallelize_collapse

AmintorDusko · 2024-11-07T12:56:29Z

Hi @xiaohanzai, thank you for that! Could you please resolve the conflicts so we can run our CIs?

AmintorDusko · 2024-11-07T13:42:57Z

Thank you, @xiaohanzai!

codecov · 2024-11-07T13:45:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.75%. Comparing base (f594f29) to head (fdc8d55).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   97.66%   93.75%   -3.91%     
==========================================
  Files         221      169      -52     
  Lines       33178    22309   -10869     
==========================================
- Hits        32403    20916   -11487     
- Misses        775     1393     +618

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ohanzai/pennylane-lightning into feature/parallelize_collapse

multiphaseCFD · 2024-11-07T16:13:33Z

Thanks @xiaohanzai. Could you please share your benchmark scripts (both cpp and shell script code) for discussion?

xiaohanzai · 2024-11-07T16:28:09Z

Here's the script: https://codeshare.io/nAJZOY

And the shell commands I used:
g++-14 -O3 -Wall -Wextra -fopenmp -o benchmark benchmark.cpp
./benchmark > rst.txt

for num_qubits < 20 I used 1000 iterations in benchmark.cpp.

multiphaseCFD · 2024-11-07T19:58:25Z

Thanks @xiaohanzai for sharing the scripts. A few questions come to my mind for discussion. Please feel free to leave your thoughts, thanks!

multiphaseCFD · 2024-11-07T19:59:51Z

Since normalize() is a part of the collapse(), would you like to parallelize it as well?

multiphaseCFD · 2024-11-07T20:00:19Z

Could you please explain why more threads lead performance regression when the number of qubit is small, that's say num_qubits = 10?

multiphaseCFD · 2024-11-07T20:01:14Z

Could you please explain why increasing the number of threads can improve the performance when num_threads <=16, while further increase of thread numbers would lead the performance regression when num_qubits > 20?

multiphaseCFD · 2024-11-07T20:01:47Z

Could you please explain why the stride=vec_size/2 scales better than stride=1 on the mac, while it's opposite on the node of cluster?

multiphaseCFD · 2024-11-07T20:02:11Z

Is there any room to optimize this method? If yes, how would you like to further optimize this code? If not, why?

…ze_collapse

xiaohanzai · 2024-11-08T22:03:27Z

Hi @multiphaseCFD

I pushed another commit to parallelize normalize. Here are my thoughts on your questions:

Could you please explain why more threads lead performance regression when the number of qubit is small, that's say num_qubits = 10?

There's too few array elements to deal with, so the parallelization overhead is a lot larger than the actual workload. It's not very beneficial to parallelize the code unless the array size is large enough.

Could you please explain why increasing the number of threads can improve the performance when num_threads <=16, while further increase of thread numbers would lead the performance regression when num_qubits > 20?

My speculation is that 8 32KB L1 caches can hold about 2^15 complex float array elements, so when array size isn't too large they can be stored in the nearest caches. But when array size gets too large e.g. 2^30 then they will have to be loaded from RAM. Then memory latency gets in the way.

Could you please explain why the stride=vec_size/2 scales better than stride=1 on the mac, while it's opposite on the node of cluster?

That I'm not sure... I'd expect the performance of stride=1 to be worse in general because of cache misses, but I'm not sure why mac and cluster appear to be very different.

Is there any room to optimize this method? If yes, how would you like to further optimize this code? If not, why?

I considered dynamic scheduling, static scheduling with a small chunk size, and loop tiling, but these all didn't seem to work better than the default scheduling. I think the scheduling overhead is too large when array size is large and chunk size is small. So I think on my side I don't have any further ways of optimizing it.

multiphaseCFD · 2024-11-08T23:20:40Z

pennylane_lightning/core/src/simulators/lightning_qubit/StateVectorLQubit.hpp

@@ -695,9 +695,12 @@ class StateVectorLQubit : public StateVectorBase<PrecisionT, Derived> {
        // **__**__ for stride 2
        // ****____ for stride 4
        const std::size_t k = branch ? 0 : 1;
+#if defined(_OPENMP)
+#pragma omp parallel for collapse(2) default(none) shared(arr, half_section_size, stride, k)
+#endif
        for (std::size_t idx = 0; idx < half_section_size; idx++) {


Thanks @xiaohanzai .
Could you try to fuse these two loops into one to see if the performance can be improved, especially for the stride=1 case?

xiaohanzai · 2024-11-09T01:44:04Z

@multiphaseCFD Thanks a lot for the suggestion! I just updated the code in codeshare. It seems like on my mac the results didn't get improved a lot, especially for stride = 1. This plot is got by running collapse function without normalize(), so same as before:

This is after adding in a parallelized normalize:

And results on the cluster with normalize:

tomlqc · 2024-11-14T20:50:30Z

Thanks @xiaohanzai

xiaohanzai and others added 4 commits November 5, 2024 20:46

parallelized collapse function with openmp in StateVectorLQubit.hpp

c46940e

Auto update version from '0.39.0-dev51' to '0.40.0-dev2'

a011e13

updated .github/CHANGELOD.md

3574e1d

Merge branch 'feature/parallelize_collapse' of https://github.com/xia…

3ad1de7

…ohanzai/pennylane-lightning into feature/parallelize_collapse

xiaohanzai and others added 2 commits November 7, 2024 08:40

resolved conflicts

b72d6b2

Auto update version from '0.40.0-dev3' to '0.40.0-dev4'

d6df04b

xiaohanzai added 3 commits November 7, 2024 10:00

hotfix of code not compiling on linux

5b1c5ce

Merge branch 'feature/parallelize_collapse' of https://github.com/xia…

4b02d2e

…ohanzai/pennylane-lightning into feature/parallelize_collapse

fix clang tidy check

fdc8d55

tomlqc requested a review from multiphaseCFD November 7, 2024 16:38

xiaohanzai added 2 commits November 8, 2024 17:02

added parallelization for normalize function

e5fe0a7

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

5056958

…ze_collapse

Auto update version from '0.40.0-dev4' to '0.40.0-dev5'

1808cf1

multiphaseCFD reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelized collapse function with openmp in StateVectorLQubit.hpp #986

parallelized collapse function with openmp in StateVectorLQubit.hpp #986

xiaohanzai commented Nov 7, 2024 •

edited

Loading

AmintorDusko commented Nov 7, 2024

AmintorDusko commented Nov 7, 2024

codecov bot commented Nov 7, 2024 •

edited

Loading

multiphaseCFD commented Nov 7, 2024

xiaohanzai commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024 •

edited

Loading

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

xiaohanzai commented Nov 8, 2024

multiphaseCFD Nov 8, 2024

xiaohanzai commented Nov 9, 2024

tomlqc commented Nov 14, 2024

parallelized collapse function with openmp in StateVectorLQubit.hpp #986

Are you sure you want to change the base?

parallelized collapse function with openmp in StateVectorLQubit.hpp #986

Conversation

xiaohanzai commented Nov 7, 2024 • edited Loading

Before submitting

AmintorDusko commented Nov 7, 2024

AmintorDusko commented Nov 7, 2024

codecov bot commented Nov 7, 2024 • edited Loading

Codecov Report

multiphaseCFD commented Nov 7, 2024

xiaohanzai commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024 • edited Loading

multiphaseCFD commented Nov 7, 2024

multiphaseCFD commented Nov 7, 2024

xiaohanzai commented Nov 8, 2024

multiphaseCFD Nov 8, 2024

Choose a reason for hiding this comment

xiaohanzai commented Nov 9, 2024

tomlqc commented Nov 14, 2024

xiaohanzai commented Nov 7, 2024 •

edited

Loading

codecov bot commented Nov 7, 2024 •

edited

Loading

multiphaseCFD commented Nov 7, 2024 •

edited

Loading