-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
t5_bf16 notebooks fails with [ONNXRuntimeError] : 10 : INVALID_GRAPH #118
Comments
@pommedeterresautee The t5_bf16 notebook doesn't work with t5-3b either, for that matter. It errors on the same line as T0_3B, but for a different reason:
This is quite important for the project I'm working on, and it would be great if you could help ASAP. Thank you in advance |
I will check in the coming days but TBH not sure you will like BF16 accuracy, it's quite low compared to FP16, which implies adding casting everywhere (it was our hope to not have to do that anymore). The trick is trained in BF16 models are accumulated in FP32, so at the end you need good precision to reproduce the results. Range kills FP16 and precision kills FP16 on deep nets, at the end casting is the only way. One thing you may want to try is exporting onnx from Pytorch with amp enabled (fp16 and bf16 are both supported). In this video at 6'30 they say that it should work in last pytorch, not had the time to try it myself. If you do, would be very interested to know if it worked for you. Also found this issue about this possibility, related bug and fixes: Seems to work... hope it helps in your project. |
Thanks @pommedeterresautee. I couldn't find any more information on how I could use amp in the export process. I'm actually using PyTorch 1.11 for the export (and onnx 1.12.0 and onnxruntime-gpur 1.12.0). The odd thing is that t5-small works in the t5_bf16 notebook but not t5-3b. I'd appreciate your help here. |
@pommedeterresautee Part of the issue seems to be that the notebooks are generally broken with the latest version of the library and its dependencies. I've created a separate issue about that, #130. |
I'm running the t5_bf16 notebook with the T0_3B model. Everything works great until
causes
EDIT 8/1:
This is odd, as onnx claims to support Pow in bf16 as of https://github.com/onnx/onnx/pull/3412. The linked PR suggests that only opset 15+ supports the exponent in Pow in bf16. I upgraded the opset version to 15 in convert_to_onnx(), and now I get a RuntimeError when calling create_model_for_provider
I'm running PyTorch 1.11.0 and onnx 1.12.0 with onnxruntime 1.12.0. Your help would be greatly appreciated @pommedeterresautee
Hardware: NVIDIA A10 w/ 24GB and hardware bf16 support
The text was updated successfully, but these errors were encountered: