Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelized collapse function with openmp in StateVectorLQubit.hpp #986

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

xiaohanzai
Copy link

@xiaohanzai xiaohanzai commented Nov 7, 2024

Before submitting

Please complete the following checklist when submitting a PR:

  • All new features must include a unit test.
    If you've fixed a bug or added code that should be tested, add a test to the
    tests directory!

  • All new functions and code must be clearly commented and documented.
    If you do make documentation changes, make sure that the docs build and
    render correctly by running make docs.

  • Ensure that the test suite passes, by running make test.

  • Add a new entry to the .github/CHANGELOG.md file, summarizing the
    change, and including a link back to the PR.

  • Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


Context:
Would like to improve performance of the collapse method since Mid-Circuit Measurement (MCM) support was added to the lightning.qubit backend.

Description of the Change:
Parallelized collapse function with OpenMP.

Benefits:
Faster execution with several threads.

Here's the strong scaling results on my mac and on a cluster at UofT.
To benchmark the implementation, I assumed num_qubits = 10-30. The upper bound of 30 is chosen so that on my mac a complex float array of size 1<<30 can be held in my RAM.
The execution times are obtained by averaging the run time of the collapse function 10 or more times. Attached please also see the code that I used for benchmarking.

Screenshot 2024-11-06 at 7 06 44 PM Screenshot 2024-11-06 at 7 10 48 PM Screenshot 2024-11-07 at 10 04 14 AM

Possible Drawbacks:

Related GitHub Issues:
#962

@AmintorDusko
Copy link
Contributor

Hi @xiaohanzai, thank you for that! Could you please resolve the conflicts so we can run our CIs?

@AmintorDusko
Copy link
Contributor

Thank you, @xiaohanzai!

Copy link

codecov bot commented Nov 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.75%. Comparing base (f594f29) to head (fdc8d55).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #986      +/-   ##
==========================================
- Coverage   97.66%   93.75%   -3.91%     
==========================================
  Files         221      169      -52     
  Lines       33178    22309   -10869     
==========================================
- Hits        32403    20916   -11487     
- Misses        775     1393     +618     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@multiphaseCFD
Copy link
Member

Thanks @xiaohanzai. Could you please share your benchmark scripts (both cpp and shell script code) for discussion?

@xiaohanzai
Copy link
Author

Here's the script: https://codeshare.io/nAJZOY

And the shell commands I used:
g++-14 -O3 -Wall -Wextra -fopenmp -o benchmark benchmark.cpp
./benchmark > rst.txt

for num_qubits < 20 I used 1000 iterations in benchmark.cpp.

@tomlqc tomlqc requested a review from multiphaseCFD November 7, 2024 16:38
@multiphaseCFD
Copy link
Member

Thanks @xiaohanzai for sharing the scripts. A few questions come to my mind for discussion. Please feel free to leave your thoughts, thanks!

@multiphaseCFD
Copy link
Member

Since normalize() is a part of the collapse(), would you like to parallelize it as well?

@multiphaseCFD
Copy link
Member

Could you please explain why more threads lead performance regression when the number of qubit is small, that's say num_qubits = 10?

@multiphaseCFD
Copy link
Member

multiphaseCFD commented Nov 7, 2024

Could you please explain why increasing the number of threads can improve the performance when num_threads <=16, while further increase of thread numbers would lead the performance regression when num_qubits > 20?

@multiphaseCFD
Copy link
Member

Could you please explain why the stride=vec_size/2 scales better than stride=1 on the mac, while it's opposite on the node of cluster?

@multiphaseCFD
Copy link
Member

Is there any room to optimize this method? If yes, how would you like to further optimize this code? If not, why?

@xiaohanzai
Copy link
Author

Hi @multiphaseCFD

I pushed another commit to parallelize normalize. Here are my thoughts on your questions:

Could you please explain why more threads lead performance regression when the number of qubit is small, that's say num_qubits = 10?

There's too few array elements to deal with, so the parallelization overhead is a lot larger than the actual workload. It's not very beneficial to parallelize the code unless the array size is large enough.

Could you please explain why increasing the number of threads can improve the performance when num_threads <=16, while further increase of thread numbers would lead the performance regression when num_qubits > 20?

My speculation is that 8 32KB L1 caches can hold about 2^15 complex float array elements, so when array size isn't too large they can be stored in the nearest caches. But when array size gets too large e.g. 2^30 then they will have to be loaded from RAM. Then memory latency gets in the way.

Could you please explain why the stride=vec_size/2 scales better than stride=1 on the mac, while it's opposite on the node of cluster?

That I'm not sure... I'd expect the performance of stride=1 to be worse in general because of cache misses, but I'm not sure why mac and cluster appear to be very different.

Is there any room to optimize this method? If yes, how would you like to further optimize this code? If not, why?

I considered dynamic scheduling, static scheduling with a small chunk size, and loop tiling, but these all didn't seem to work better than the default scheduling. I think the scheduling overhead is too large when array size is large and chunk size is small. So I think on my side I don't have any further ways of optimizing it.

@@ -695,9 +695,12 @@ class StateVectorLQubit : public StateVectorBase<PrecisionT, Derived> {
// **__**__ for stride 2
// ****____ for stride 4
const std::size_t k = branch ? 0 : 1;
#if defined(_OPENMP)
#pragma omp parallel for collapse(2) default(none) shared(arr, half_section_size, stride, k)
#endif
for (std::size_t idx = 0; idx < half_section_size; idx++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiaohanzai .
Could you try to fuse these two loops into one to see if the performance can be improved, especially for the stride=1 case?

@xiaohanzai
Copy link
Author

@multiphaseCFD Thanks a lot for the suggestion! I just updated the code in codeshare. It seems like on my mac the results didn't get improved a lot, especially for stride = 1. This plot is got by running collapse function without normalize(), so same as before:
Screenshot 2024-11-08 at 8 31 21 PM

This is after adding in a parallelized normalize:
Screenshot 2024-11-08 at 8 32 29 PM

And results on the cluster with normalize:
Screenshot 2024-11-08 at 8 38 17 PM

@tomlqc
Copy link
Contributor

tomlqc commented Nov 14, 2024

Thanks @xiaohanzai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants