Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

Open
michaelroyzen opened this issue Jul 30, 2022 · 4 comments
Open

Comments

@michaelroyzen
Copy link

michaelroyzen commented Jul 30, 2022

I'm running the t5_bf16 notebook with the T0_3B model. Everything works great until

enc_fp16_onnx = create_model_for_provider(encoder_model_path, "CUDAExecutionProvider", log_severity=3)
enc_fp16_onnx_binding: IOBinding = enc_fp16_onnx.io_binding()
dec_onnx = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3)
dec_onnx_binding: IOBinding = dec_onnx.io_binding()

causes

InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./test-enc/model.onnx failed:This is an invalid model. Type Error: Type 'tensor(bfloat16)' of input parameter (onnx::Pow_398) of operator (Pow) in node (Pow_138) is invalid.

EDIT 8/1:
This is odd, as onnx claims to support Pow in bf16 as of https://github.com/onnx/onnx/pull/3412. The linked PR suggests that only opset 15+ supports the exponent in Pow in bf16. I upgraded the opset version to 15 in convert_to_onnx(), and now I get a RuntimeError when calling create_model_for_provider

RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&, const onnxruntime::IExecutionProvider&, const std::function<bool(const std::basic_string&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : UnpackTensor: the pre-allocate size does not match the size in proto

I'm running PyTorch 1.11.0 and onnx 1.12.0 with onnxruntime 1.12.0. Your help would be greatly appreciated @pommedeterresautee

Hardware: NVIDIA A10 w/ 24GB and hardware bf16 support

@michaelroyzen
Copy link
Author

@pommedeterresautee The t5_bf16 notebook doesn't work with t5-3b either, for that matter. It errors on the same line as T0_3B, but for a different reason:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Deserialize tensor onnx::MatMul_2878 failed.UnpackTensor: the pre-allocate size does not match the size in proto

This is quite important for the project I'm working on, and it would be great if you could help ASAP. Thank you in advance

@pommedeterresautee
Copy link
Member

pommedeterresautee commented Aug 5, 2022

I will check in the coming days but TBH not sure you will like BF16 accuracy, it's quite low compared to FP16, which implies adding casting everywhere (it was our hope to not have to do that anymore). The trick is trained in BF16 models are accumulated in FP32, so at the end you need good precision to reproduce the results. Range kills FP16 and precision kills FP16 on deep nets, at the end casting is the only way.
One thing which broke many stuff is Python 1.12.0 (it changed the way it stores some values in onnx), we are pushing patches here and there but did not retried those notebooks.

One thing you may want to try is exporting onnx from Pytorch with amp enabled (fp16 and bf16 are both supported). In this video at 6'30 they say that it should work in last pytorch, not had the time to try it myself.
https://www.youtube.com/watch?v=R2mUT_s0PbE

If you do, would be very interested to know if it worked for you.

Also found this issue about this possibility, related bug and fixes:
pytorch/pytorch#72494

Seems to work... hope it helps in your project.

@michaelroyzen
Copy link
Author

Thanks @pommedeterresautee. I couldn't find any more information on how I could use amp in the export process.

I'm actually using PyTorch 1.11 for the export (and onnx 1.12.0 and onnxruntime-gpur 1.12.0). The odd thing is that t5-small works in the t5_bf16 notebook but not t5-3b. I'd appreciate your help here.

@michaelroyzen
Copy link
Author

@pommedeterresautee Part of the issue seems to be that the notebooks are generally broken with the latest version of the library and its dependencies. I've created a separate issue about that, #130.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants