-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: add skip transpile to t1 #1454
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcmckayibm. I left some comments here.
@@ -98,7 +98,7 @@ def __init__( | |||
# Set experiment options | |||
self.set_experiment_options(delays=delays) | |||
|
|||
def circuits(self) -> List[QuantumCircuit]: | |||
def circuits(self, qnum=0) -> List[QuantumCircuit]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is much faster but circuits()
is the base class method and signature shouldn't change. Indeed this breaks polymorphism of experiment class, e.g. user may think T2 also has this argument. Also experiment.run
doesn't consider the qnum
argument.
You can consider another approach of creating a pass manager that only performs virtual -> physical index mapping. Which must be lightweight compared with the default level0 pass manager (to reduce overhead you can directly run layout
, i.e. layout.run(circuits)
)
Unfortunately this code is not as performant as the proposed code change because of the initialization overhead of passes and pass manager instance. However this doesn't cause API break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's anything forbidden by this idea in inheritance because the circuit call still behaves the same. You can absolutely add parameters to an inherited function. But this is also backwards compatibl. I also don't know why run
needs to know anything about this. It's merely to allow the circuit to be created on any qubit if desired. I don't see why it's relevant whether a user will think T2 has this argument. Of course since this works really well I would argue it should be added to the T2 class as well.
I did some profiling of this PR, #1455, and import importlib.metadata
import os
import time
from subprocess import run
from qiskit_ibm_runtime import QiskitRuntimeService
from qiskit_experiments.library import T1
from qiskit_experiments.framework import ParallelExperiment
service = QiskitRuntimeService(channel="ibm_quantum")
backend_real = service.backend("ibm_sherbrooke")
start = time.perf_counter()
t1_exp_tmp_list = []
coh_wait_list = [0,10,20]
for i in list(range(10)):
t1_exp_tmp = T1([i], coh_wait_list, backend=backend_real)
t1_exp_tmp.set_transpile_options(optimization_level=0)
t1_exp_tmp_list.append(t1_exp_tmp)
t1_exp_tst = ParallelExperiment(t1_exp_tmp_list, backend=backend_real, flatten_results=True)
t1_exp_tst._transpiled_circuits()
total_time = time.perf_counter() - start
qe_proc = run(["git", "rev-parse", "--abbrev-ref", "HEAD"], check=True, text=True, capture_output=True)
print(f"Qiskit: {importlib.metadata.version('qiskit')}; QE: {qe_proc.stdout.strip()}; Parallel: {os.environ['QISKIT_PARALLEL']}")
print(f"Total time: {total_time:0.1f} seconds") I ran this on the current stable Qiskit (1.0.2) and the current
Note these are statistics of one, so there are probably decent fractional error bars on the numbers (10%?). I tried re-running the From these numbers we see that QE main with Qiskit 1.0.2 is significantly slower than the other QE branches, whether or not tranpiler parallelism is used. Additionally, QE main is still slow with Qiskit 1.2.0 when parallelism is used. These results are all consistent with the motivation for Qiskit/qiskit#12410 -- for a large device with pulse calibrations, serializing the target (done even when parallelism is disabled on Qiskit 1.0.2) costs significantly more than the benefit of parallelizing the transpilation. Skipping the serialization as is done in the QE main with Qiskit 1.2.0 case cuts down the processing time significantly, making it indistinguishable from the two branches that bypass transpilation. Though for these small circuits we don't make much use of the transpiler's cutting edge features, it is still nice to let it handle layout and basis translation. While it is nice to format the experiments to generate circuits using the typical IBM basis gates, it is also nice to have experiments that could in principle run on other providers that used rx or h instead of sx if we don't have to give that up. On Windows and mac OS, the default is not to parallelize transpile, so they should already be fast with Qiskit 1.2.0. On Linux, parellelism is enabled by default. So I think the minimal change to avoid this serialization overhead would be to set |
Thanks Will. I'm glad to see the transpile overhead itself is almost negligible (at least in the timescale of whole experiment run). It would be nice to keep full transpile stages if it's performance is comparable, but we can also prepare custom staged pass manager including only translation and layout (only expand passes), because it's reasonable to assume the most of QE instances define ISA circuits. But according to your test the speed gain from doing this would be less than 1 sec, which doesn't really make sensible performance difference in our case. In that sense just updating the default transpile option with |
Thanks @wshanks , another option would be to do a check in the class constructor (sort of like what I did in the |
@nkanazawa1989 I will make a PR for @dcmckayibm I am confused about the class constructor point. What is the flow of information that you are expecting there? One technical point is that the backend is not a required argument of the constructor, but there is a I tried editing the test code so that the subexperiments had 100 circuits instead of 3. In that case this PR and #1455 both still measured a time around 4.5 seconds (which I think is mostly the experiment construction time since it did not change compared to the 3 circuits case) while Ideally, transpiling small circuits would have low enough overhead that we could just use the transpiler but maybe we are not there yet (Qiskit is still pushing for performance improvements, how much overhead per circuit can we accept?). For
Besides layout, those should all be just checks that find nothing to do. Would a more limited pass manager be any faster? |
I guess negligible overhead of rewriting the circuit is thanks to the singleton gates (especially in T1 circuits without parameterized gate). @dcmckayibm In the original design This also makes me think virtual index -> physical index mapping is a common pattern for all experiments, and we can also implement this logic in the |
I did some further testing of variations on transpiling like those in this PR and in #1455. The code I used for the testing is in this gist. I will discuss the script below but first here are the outputs I got from running it with Qiskit 1.1 (and my qiskit-experiments repo was checked out to the branch from #1455 though the test script doesn't use anything relevant to a specific branch, just
The warm up lines are just to make sure one-time processing like deserializing the target does not penalize the first function more than the following ones. For all the performance tests, 100 T1 circuits were used.
There are a couple other tests commented out that let So to try to summarize:
Here are the options I see:
A couple caveats:
What do you think? |
Thanks for careful profiling. It seems to me 8sec vs 1sec in a practical setting is enough gain to consider the wider change. I'd emphasize that transpile overhead also comes from the data conversion into DAGCircuit IR, and the speed of the fast transpile method depends on the circuit depth because it needs to loop over all operation data on the wire (e.g. with something like RB this would become much slower). Regarding Rustification, currently Qiskit doesn't implement Rust version of QuantumCircuit (it has CircuirData struct though) and I wouldn't expect much performance gain from it. This is because eventually Rust just calls Python function with PyObject. On the other hand, the change in this PR breaks polymorphism of experiment. This allows experiment to directly create physical circuit, and what the base class run method receives will become union of virtual and physical circuit. In some edge case, it must accept a list of mixture of them. Unfortunately we don't have any attribute or type information to distinguish them. Current implementation is simple because we can naively assume experiment circuits are all virtual. Probably we need to insert another layer (base class) for experiments creating physical circuit and the other class for virtual circuits, and the base run method will be aware of only physical (or ISA) circuits. However this adds another hierarchy to the already complicated QE class tree. If we come up with the elegant implementation I'm fine with taking the approach of the last bullet. |
Near-term we want to implement some quick version of Not every experiment can benefit from parameterization. RB can be parameterized to realize different seeds with the same circuit by varying rz angles, but I don't think Clifford length can be parameterized. |
As I was making my post with the profiling numbers, I was re-running things and making changes. I Just realized that the numbers I had pasted in were from a run with a bug that skipped the pass manager run for the |
@dcmckayibm @nkanazawa1989 Based on feedback outside of this issue, I think we have consensus for the following approach:
Does that sound good?
@nkanazawa1989 I think your concern here could be addressed by modifying experiments to have a new |
Thanks for all the profiling @wshanks and I think the solution is a good one. |
Thanks @wshanks for proposal. I also think the |
Here is a first pass at what I suggested above: #1459 I am still thinking about edge cases and studying how these changes work. In particular, since I made it fall back to full |
Summary
Please describe what this PR changes as concisely as possible. Link to the issue(s) that this addresses, if any.
Details and comments
Some details that should be in this section include:
Note that this entire PR description field will be used as the commit message upon merge, so please keep it updated along with the PR. Secondary discussions, such as intermediate testing and bug statuses that do not affect the final PR, should be in the PR comments.
PR checklist (delete when all criteria are met)
CONTRIBUTING.md
.reno
if this change needs to be documented in the release notes.