t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

michaelroyzen · 2022-07-30T00:10:50Z

I'm running the t5_bf16 notebook with the T0_3B model. Everything works great until

enc_fp16_onnx = create_model_for_provider(encoder_model_path, "CUDAExecutionProvider", log_severity=3)
enc_fp16_onnx_binding: IOBinding = enc_fp16_onnx.io_binding()
dec_onnx = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3)
dec_onnx_binding: IOBinding = dec_onnx.io_binding()

causes

InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./test-enc/model.onnx failed:This is an invalid model. Type Error: Type 'tensor(bfloat16)' of input parameter (onnx::Pow_398) of operator (Pow) in node (Pow_138) is invalid.

EDIT 8/1:
This is odd, as onnx claims to support Pow in bf16 as of https://github.com/onnx/onnx/pull/3412. The linked PR suggests that only opset 15+ supports the exponent in Pow in bf16. I upgraded the opset version to 15 in convert_to_onnx(), and now I get a RuntimeError when calling create_model_for_provider

RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/optimizer/optimizer_execution_frame.cc:75 onnxruntime::OptimizerExecutionFrame::Info::Info(const std::vector<const onnxruntime::Node*>&, const InitializedTensorSet&, const onnxruntime::Path&, const onnxruntime::IExecutionProvider&, const std::function<bool(const std::basic_string&)>&) [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : UnpackTensor: the pre-allocate size does not match the size in proto

I'm running PyTorch 1.11.0 and onnx 1.12.0 with onnxruntime 1.12.0. Your help would be greatly appreciated @pommedeterresautee

Hardware: NVIDIA A10 w/ 24GB and hardware bf16 support

The text was updated successfully, but these errors were encountered:

michaelroyzen · 2022-08-01T23:21:53Z

@pommedeterresautee The t5_bf16 notebook doesn't work with t5-3b either, for that matter. It errors on the same line as T0_3B, but for a different reason:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Deserialize tensor onnx::MatMul_2878 failed.UnpackTensor: the pre-allocate size does not match the size in proto

This is quite important for the project I'm working on, and it would be great if you could help ASAP. Thank you in advance

pommedeterresautee · 2022-08-05T15:25:59Z

I will check in the coming days but TBH not sure you will like BF16 accuracy, it's quite low compared to FP16, which implies adding casting everywhere (it was our hope to not have to do that anymore). The trick is trained in BF16 models are accumulated in FP32, so at the end you need good precision to reproduce the results. Range kills FP16 and precision kills FP16 on deep nets, at the end casting is the only way.
One thing which broke many stuff is Python 1.12.0 (it changed the way it stores some values in onnx), we are pushing patches here and there but did not retried those notebooks.

One thing you may want to try is exporting onnx from Pytorch with amp enabled (fp16 and bf16 are both supported). In this video at 6'30 they say that it should work in last pytorch, not had the time to try it myself.
https://www.youtube.com/watch?v=R2mUT_s0PbE

If you do, would be very interested to know if it worked for you.

Also found this issue about this possibility, related bug and fixes:
pytorch/pytorch#72494

Seems to work... hope it helps in your project.

michaelroyzen · 2022-08-20T19:22:07Z

Thanks @pommedeterresautee. I couldn't find any more information on how I could use amp in the export process.

I'm actually using PyTorch 1.11 for the export (and onnx 1.12.0 and onnxruntime-gpur 1.12.0). The odd thing is that t5-small works in the t5_bf16 notebook but not t5-3b. I'd appreciate your help here.

michaelroyzen · 2022-08-20T21:27:13Z

@pommedeterresautee Part of the issue seems to be that the notebooks are generally broken with the latest version of the library and its dependencies. I've created a separate issue about that, #130.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

michaelroyzen commented Jul 30, 2022 •

edited

Loading

michaelroyzen commented Aug 1, 2022

pommedeterresautee commented Aug 5, 2022 •

edited

Loading

michaelroyzen commented Aug 20, 2022

michaelroyzen commented Aug 20, 2022

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118

Comments

michaelroyzen commented Jul 30, 2022 • edited Loading

michaelroyzen commented Aug 1, 2022

pommedeterresautee commented Aug 5, 2022 • edited Loading

michaelroyzen commented Aug 20, 2022

michaelroyzen commented Aug 20, 2022

michaelroyzen commented Jul 30, 2022 •

edited

Loading

pommedeterresautee commented Aug 5, 2022 •

edited

Loading