Help finding why an ONNX model that runs fine with onnxruntime doesn't run with tract? #675
-
So, I exported a model from PyTorch to ONNX, and x = ort.OrtValue.ortvalue_from_numpy(x.numpy())
x_len = ort.OrtValue.ortvalue_from_numpy(x_len.numpy())
print(x.shape(), x_len.shape()) # prints [1, 119760] [1]
ort_sess.run(None, {"x": x, "x_len": x_len}) But trying to do the same with let audio_data_len = audio_data.len();
tract_onnx::onnx()
.model_for_path("../EfficientConformer/model.onnx")?
.with_input_names(vec!["x", "x_len"])?
.with_input_fact(
0,
InferenceFact::dt_shape(f32::datum_type(), shapefactoid![one, audio_data_len]),
)?
.with_input_fact(
1,
InferenceFact::dt_shape(f32::datum_type(), shapefactoid![one]),
)?
.into_runnable()?
.run(tvec!(
Tensor::from_shape(&[1, audio_data_len], audio_data)?,
Tensor::from_shape(&[one], &[(audio_data_len as f32)])?
))?; fails with this error:
My totally uneducated guess is that there is some difference in how broadcasting works between My plan now is to narrow down the node in the graph at which point the shape as calculated by Is there some way to "step through" graph execution? Or to perhaps print the shapes of the tensors at specific nodes? Obv, I'm open to other ways of debugging this if I'm not on the right track. Thank you so much for open sourcing this library btw!! |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 2 replies
-
Hello! Thanks for your interest in tract! Broascasting rules are the same, at least in theory. And they are well test-covered, so I doubt the error comes from broadcasting, even if it the disrepency appears when we try to wire an operator that does broadcasting. The operation that breaks here tries to add a 1x4x81x81 and a 1x1x243x243 tensor. This does not work in numpy (or any other framework, really) broadcast rules, so we must assume the error happens before this step. I would say the first step into debugging this is to use I guess this is a network featuring recurring operators. No worries, they are supposed to work, this is probably just a bug. The test coverage is less complete that for cnns, so there may remain corner cases issues. Please ask as many questions as you need to figure this out. The command line documentation is nascent... We appreciate you trying to narrow down the issue, but if it comes to that we can also have a look at the model if you can share it with us. |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm a bit surprised by the issue with the command line, my understanding is the I don't have a "direct" answer to your question (obtaining shapes from an onnx network from some kind of onnx tools), but I have used onnxruntime in the past to compare all intermediate result to tract's (using the sub-command import numpy
import onnx
from onnx import shape_inference
from onnx.tools import update_model_dims
import onnxruntime as onnxrt
path = "albert_chinese_tiny/model.onnx"
model = onnx.load(path)
all_wires = {}
for node in model.graph.node:
for wire in node.output:
all_wires[wire] = True
for input in model.graph.input:
all_wires.pop(input.name, None)
for output in model.graph.output:
all_wires.pop(output.name, None)
for wire in all_wires:
output = onnx.ValueInfoProto()
output.name = wire
model.graph.output.append(output)
onnx.save(model, "full.onnx")
sess = onnxrt.InferenceSession("full.onnx")
io = numpy.load("./io.npz")
print(io)
inputs = {
"input_ids": io["input_ids"],
"attention_mask": io["attention_mask"],
"token_type_ids": io["token_type_ids"],
}
res = sess.run([], inputs)
npz = inputs
for pair in zip(model.graph.output, res):
npz[pair[0].name] = pair[1]
numpy.savez("full.npz", **npz) As for 2/ i have actually very little experience with onnx tools. I had made tract able to do shape analysis (because of tensorflow) before onnx became tract's favourite format... So I don't know why the inference fails on a Pad. I agree it's weird. Might be a bug. |
Beta Was this translation helpful? Give feedback.
-
Ho, just realised what was wrong in your command: put
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the tip about getting shape info from ONNX. I modified your script slightly, and got an output of the shapes as inferred by ONNX (here's a markdown version). Here's also the shapes as inferred by tract. There's a lot of nodes whose output shape differs between tract and ONNX, but I could only find 3 nodes (so far, not done yet looking) where the inputs of the node were of identical shape and the output was of a different shape. In all of the 3 cases, what
I really have no idea as to how to go about making sense of the ONNX output shapes inferred in the above nodes. I know they are likely to be correct since they (probably) match the ones inferred by Pytorch code (because the model (which is an ASR model) works fine while run as Pytorch and while run as ONNX). If you have any ideas as to how to go about making sense of this, that would be amazing! Otherwise, what I will do is gather all the nodes where this happens1, and hopefully a bulb goes off in my head, or otherwise, ask on the ONNX Github. 1 If you're curious, my plan to do that is to write some code to output a JSON representation of the graph/shapes as inferred by tract (looking here in the tract CLI code to copy a little), and to also output a JSON representation of the graph/shapes as inferred by ONNX, and then write a "graph diffing" script to find all nodes where their inputs are of identical shapes between ONNX and tract, but the output shape is different. PS: The originally offending node is now |
Beta Was this translation helpful? Give feedback.
-
For future reference, a couple of spotted problems playing with the model:
|
Beta Was this translation helpful? Give feedback.
-
found it :) #680 |
Beta Was this translation helpful? Give feedback.
-
I think it is worth mentionning that one of the key to finding the root problem was to use a debug build (did not think to mention it). There are more checks for possible bugs happening in debug builds, so the debug build pointed my to the strided slice immediately... |
Beta Was this translation helpful? Give feedback.
-
Amazing! Without your incredibly quick responses and generosity with your time, I definitely would have switched to onnxruntime! I'm working on a web app with the model deployed in the browser through wasm, so I'll keep you updated on the end result. My sincere thanks! |
Beta Was this translation helpful? Give feedback.
found it :) #680