-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThunderFX: Save the reproducer script into files #1380
base: main
Are you sure you want to change the base?
Conversation
This is really cool! A couple questions, @kiya00:
For this part, can the script instead know how to produce the same compilation as in the original reproduction? Like if the original was compiled like
Do this CUDA-specific calls only appear if one or more of the input tensors is generated on a CUDA device?
Where do the low and high values for this call to randint come from? Should we think about using make_tensor instead? Can we add a comment with the original FX graph, too?
Not a question, I just thought this was really cool and helpful.
Why this final call the the function? |
e3e0451
to
df2c199
Compare
Hi @mruberry , thanks for the advice I'll change it accordingly
I mostly kept the repro script as the one written by Tom, I think it's more like a debug+benchmark script on different backend depending on the environment variable. If we just want to produce the same compilation as in the original reproduction, I can change it to the thunder options actually used.
When we can get the real tensor instead of FakeTensor in the Dynamo graph, the min/max value can be obtained from it. And sometimes it's needed to ensure correctness (e.g. nanogpt input must be in range 0-255). Currently the original torch interface is used to create the inputs, like the nvFuser repro does, maybe it's more user friendly to use the native torch API, but it's easy to change to
I'll modify it to only appear when cuda is used. |
That makes a lot of sense! I think we can let @tfogal comment when he's back, but I think the principal desire behind these reproduction scripts is to create a standalone file that someone interested in reviewing thunder's performance or correctness can quickly run to replicate an issue. That's why I'd suggest making it so that when someone clicks "run" the script executes what thunder did the same way thunder did it. Of course it's great to add notes for how to override/compare that behavior, too!
That's really cool. Querying for the min and max values from the real tensors sounds like a good solution. I would take a look at
Awesome! |
…_repro in benchmark_litgpt.py
Thanks for the ping, sorry for the delay. The original code conflates two things, and I think there's a good argument to be made that during this cleanup process we rethink things. That is, we've got two use cases to service:
These two are very close but not quite the same. I think/hope that the ThunderCompilerBenchmark class that Yan added recently suffices for the latter goal. Without thinking about how hard it is / if it's appropriate for a single PR, what I'd love to see is:
Does that make sense to both of you? |
Yep. I think providing a default that mimics the actual behavior, and the ability to override/tweak/investigate alternatives would be great! |
Before submitting
What does this PR do?
Fixes #1082.
Based on the code provided by @tfogal
(https://github.com/tfogal/NeMo/blob/a0c711deae6c6f7342662795425684a342f95b8d/examples/multimodal/multimodal_llm/neva/neva_pretrain.py#L164), I added the
ThunderCompiler.save_reproducer_to_folder
interface to save the reproducer script in an "offline" way.SubgraphInfo.thunder_compiled_fns_example_inputs
is added to record the input tensor metadata and after execution we retrieve the information inSubgraphInfo
and write the reproducer to file.The
save_dynamo_repro
option is added to thebenchmark_litgpt.py
, an example of its use:python thunder/benchmarks/benchmark_litgpt.py --model_name Llama-2-7b-hf --compile dynamo+thunder --n_layers=2 --save_dynamo_repro='tmp/bench'
TODO: support for saving the repro of module with checkpointing needs #1437
An example of the saved reproducer script using CPU inputs
An example of the saved reproducer script using CUDA inputs