Help finding why an ONNX model that runs fine with onnxruntime doesn't run with tract? #675

davidatsurge · 2022-04-16T18:31:53Z

davidatsurge
Apr 16, 2022

So, I exported a model from PyTorch to ONNX, and onnxruntime in Python can run it fine with (something like):

            x = ort.OrtValue.ortvalue_from_numpy(x.numpy())
            x_len = ort.OrtValue.ortvalue_from_numpy(x_len.numpy())
            print(x.shape(), x_len.shape()) # prints [1, 119760] [1]
            ort_sess.run(None, {"x": x, "x_len": x_len})

But trying to do the same with tract with

    let audio_data_len = audio_data.len();
    tract_onnx::onnx()
        .model_for_path("../EfficientConformer/model.onnx")?
        .with_input_names(vec!["x", "x_len"])?
        .with_input_fact(
            0,
            InferenceFact::dt_shape(f32::datum_type(), shapefactoid![one, audio_data_len]),
        )?
        .with_input_fact(
            1,
            InferenceFact::dt_shape(f32::datum_type(), shapefactoid![one]),
        )?
        .into_runnable()?
        .run(tvec!(
            Tensor::from_shape(&[1, audio_data_len], audio_data)?,
            Tensor::from_shape(&[one], &[(audio_data_len as f32)])?
        ))?;

fails with this error:

Error: Evaluating #1768 "Add_1270" Add

Caused by:
    0: wiring adhoc (TypedBinOp(Add)), determining output_facts
    1: in output_facts invocation
    2: Can not broadcast shapes a:1,4,81,81,F32 2.444773, 1.5999142, 2.1563404, 1.429431, 0.92673504, 1.370506, 1.4811311, 3.4748058, 2.3725028, 1.8086797, -0.03742999, -0.9938958... b:1,1,243,243,F32 -0, -0, -0, -0, -0, -0, -0, -0, -0, -0, -0, -0...

My totally uneducated guess is that there is some difference in how broadcasting works between onnxruntime and tract.

My plan now is to narrow down the node in the graph at which point the shape as calculated by tract differs from the one calculated when running the PyTorch model.

Is there some way to "step through" graph execution? Or to perhaps print the shapes of the tensors at specific nodes?

Obv, I'm open to other ways of debugging this if I'm not on the right track.

Thank you so much for open sourcing this library btw!!

Answered by kali

Apr 19, 2022

found it :) #680

View full answer

kali · 2022-04-17T08:47:38Z

kali
Apr 17, 2022
Maintainer

Hello! Thanks for your interest in tract!

Broascasting rules are the same, at least in theory. And they are well test-covered, so I doubt the error comes from broadcasting, even if it the disrepency appears when we try to wire an operator that does broadcasting. The operation that breaks here tries to add a 1x4x81x81 and a 1x1x243x243 tensor. This does not work in numpy (or any other framework, really) broadcast rules, so we must assume the error happens before this step.

I would say the first step into debugging this is to use tract command line tools: see https://github.com/sonos/tract/blob/main/doc/cli-recipe.md#model-import-pipeline for some (pretty rough) instructions. --pass load, but mostly--pass analyse should be helpful to see what the shape analysis infers. Hopefully you can narrow down on the root cause of the shape discrepancy.

I guess this is a network featuring recurring operators. No worries, they are supposed to work, this is probably just a bug. The test coverage is less complete that for cnns, so there may remain corner cases issues.

Please ask as many questions as you need to figure this out. The command line documentation is nascent... We appreciate you trying to narrow down the issue, but if it comes to that we can also have a look at the model if you can share it with us.

2 replies

davidatsurge Apr 18, 2022
Author

Your generosity with your time and attention is very appreciated!

TL;DR: I've used tract's CLI to look at the input and output shapes of each node¹, but I haven't been able to figure out where the difference is with onnx, because I haven't figured out how to view the shapes inferred by onnx². My plan now is to figure out how to view all the shapes inferred by onnx at runtime, and find the difference between that and the versions inferred by tract.

If you have tips on how to do that (view all the shapes inferred by onnx²), that would be amazing! Otherwise, if you feel like you wanna take a look at the model (which at this point, I don't think is necessary since I haven't exhausted other options), I've created a small repo with the onnx model, input file, and the python/rust code to reproduce the issue.

¹ The command I used was: tract model.onnx --input x:1,77040,f32 x_len:1,f32 --pass analyse dump --io-long.

However, in order to use tract CLI's dump tool and infer shapes all the way, I had to apply this patch because without this patch, I couldn't specify the shape of more than one input, and the model has one more than one input.

diff --git a/cli/src/main.rs b/cli/src/main.rs
index 2b8807e4..5ec5944c 100644
--- a/cli/src/main.rs
+++ b/cli/src/main.rs
@@ -83,10 +83,7 @@ fn main() -> tract_core::anyhow::Result<()> {
         .arg(arg!([model] "Sets the model to use"))
         .arg(arg!(-f --format [format]
                   "Hint the model format ('kaldi', 'onnx', 'nnef' or 'tf') instead of guess from extension."))
-
-        .arg(arg!(-i --input [input] ...
-                  "Set input shape and type (@file.pb or @file.npz:thing.npy or 3x4xi32)."))
-
+         .arg(Arg::new("input").long("input").short('i').multiple_values(true).takes_value(true))
         .arg(arg!(--"const-input" [const_input] ... "Treat input as a Const (by name), retaining its value."))
         .arg(arg!(--"input-bundle" [input_bundle] "Path to an input container (.npz)"))

² The shapes outputted by onnx start to become "unknown" once the pad op is encountered (not sure why, as it seems to me that the output of the specific pad op should be statically deducible). This is the code snippet that I used to make sure that model.onnx had shape info (which I then inspected using netron).

import onnx # You need to `pip install onnx` for this.
path = "model.onnx"
model = onnx.load(path)
from onnx import shape_inference

shaped_model = shape_inference.infer_shapes(model)
onnx.save(shaped_model, path)

kali Apr 19, 2022
Maintainer

Confirming there is an issue with the command line parser. Pushing your fix (more or less) as #679 .

kali · 2022-04-18T15:43:17Z

kali
Apr 18, 2022
Maintainer

Hello, I'm a bit surprised by the issue with the command line, my understanding is the ... in the argument spec was enough (I'm not 100% sure about that. I know it used to work, because I have dealt with multiple input networks ine the past, but I'm not sure it was before or after I had to migrate to the new argument spec format from clap).

I don't have a "direct" answer to your question (obtaining shapes from an onnx network from some kind of onnx tools), but I have used onnxruntime in the past to compare all intermediate result to tract's (using the sub-command compare --npz ... on the command line if you're curious). The trick is to alter the model so that all intermediate results become model outputs, and run it with onnxrt. You don't really need the actual outputs, and you don't really need to put them in .npz format, but you'll get their shapes.

  import numpy
  import onnx
  from onnx import shape_inference
  from onnx.tools import update_model_dims

  import onnxruntime as onnxrt

  path = "albert_chinese_tiny/model.onnx"
  model = onnx.load(path)

  all_wires = {}
  for node in model.graph.node:
      for wire in node.output:
          all_wires[wire] = True

  for input in model.graph.input:
      all_wires.pop(input.name, None)

  for output in model.graph.output:
      all_wires.pop(output.name, None)

  for wire in all_wires:
      output = onnx.ValueInfoProto()
      output.name = wire
      model.graph.output.append(output)

  onnx.save(model, "full.onnx")

  sess = onnxrt.InferenceSession("full.onnx")
  io = numpy.load("./io.npz")

  print(io)

  inputs = {
          "input_ids": io["input_ids"],
          "attention_mask": io["attention_mask"],
          "token_type_ids": io["token_type_ids"],
          }
  res = sess.run([], inputs)

  npz = inputs
  for pair in zip(model.graph.output, res):
      npz[pair[0].name] = pair[1]

  numpy.savez("full.npz", **npz)

As for 2/ i have actually very little experience with onnx tools. I had made tract able to do shape analysis (because of tensorflow) before onnx became tract's favourite format... So I don't know why the inference fails on a Pad. I agree it's weird. Might be a bug.

0 replies

kali · 2022-04-18T16:53:56Z

kali
Apr 18, 2022
Maintainer

Ho, just realised what was wrong in your command: put -i multiple times, before each one of the input instead.

tract model.onnx -i x:1,77040,f32 -i x_len:1,f32 --pass analyse dump --io-long

0 replies

davidatsurge · 2022-04-19T15:44:43Z

davidatsurge
Apr 19, 2022
Author

Thanks for the tip about getting shape info from ONNX. I modified your script slightly, and got an output of the shapes as inferred by ONNX (here's a markdown version). Here's also the shapes as inferred by tract.

There's a lot of nodes whose output shape differs between tract and ONNX, but I could only find 3 nodes (so far, not done yet looking) where the inputs of the node were of identical shape and the output was of a different shape. In all of the 3 cases, what tract does makes complete sense to me, and I'm puzzled as to why ONNX is doing whatever it happens to be doing. Here's are the three nodes (identified by their respective output names).

Output Add_857 is of shape () in onnx, and of shape (1,241,120) in tract.
Output ReduceMean_851 is of shape (1, 241, 480) in onnx, and of shape (1, 241, 120) in tract.
Output Concat_719 of shape () onnx, and of shape (1,) in tract.

I really have no idea as to how to go about making sense of the ONNX output shapes inferred in the above nodes. I know they are likely to be correct since they (probably) match the ones inferred by Pytorch code (because the model (which is an ASR model) works fine while run as Pytorch and while run as ONNX). If you have any ideas as to how to go about making sense of this, that would be amazing!

Otherwise, what I will do is gather all the nodes where this happens¹, and hopefully a bulb goes off in my head, or otherwise, ask on the ONNX Github.

¹ If you're curious, my plan to do that is to write some code to output a JSON representation of the graph/shapes as inferred by tract (looking here in the tract CLI code to copy a little), and to also output a JSON representation of the graph/shapes as inferred by ONNX, and then write a "graph diffing" script to find all nodes where their inputs are of identical shapes between ONNX and tract, but the output shape is different.

PS: The originally offending node is now Add_1195 because I slightly modified the model (the repo has the updated model too).

0 replies

kali · 2022-04-19T16:44:54Z

kali
Apr 19, 2022
Maintainer

For future reference, a couple of spotted problems playing with the model:

crasher in type with tract 0.15.7

tract ../issue-675/tract_debugging/model.onnx -i x:1,77040,f32 -i x_len:1,f32 --pass type
thread 'main' panicked at 'no entry found for key', /home/kali/.cargo/registry/src/github.com-1ecc6299db9ec823/tract-onnx-0.15.7/src/ops/logic.rs:160:59

better error in type with top of tree (with CLI fix), plus a possible culprit for the root issue when rewritting a StrideSlice (same 243 and 81 mismatch. stride lost in translation ?).

TSAR 19/04 18:42 ~/dev/sonos/tract/cli% cargo run ../issue-675/tract_debugging/model.onnx -i x:1,77040,f32 -i x_len:1,f32 --pass type
    Finished dev [unoptimized + debuginfo] target(s) in 0.08s
     Running `/home/kali/dev/sonos/tract/target/debug/tract ../issue-675/tract_debugging/model.onnx -i 'x:1,77040,f32' -i 'x_len:1,f32' --pass type`
[2022-04-19T16:42:59.870356922Z ERROR tract] Translating node #491 "Slice_1187" StridedSlice ToTypedTranslator

    Caused by:
        0: translating op StridedSlice { optional_axes_input: Some(3), optional_steps_input: Some(4), begin_mask: 0, end_mask: 0, shrink_axis_mask: 0 }
        1: Output mismatch after rewiring expansion for output #0: expected 1,1,81,243,F32 got 1,1,243,243,F32

0 replies

kali · 2022-04-19T17:07:17Z

kali
Apr 19, 2022
Maintainer

found it :) #680

0 replies

kali · 2022-04-19T17:34:22Z

kali
Apr 19, 2022
Maintainer

I think it is worth mentionning that one of the key to finding the root problem was to use a debug build (did not think to mention it). There are more checks for possible bugs happening in debug builds, so the debug build pointed my to the strided slice immediately...

0 replies

davidatsurge · 2022-04-20T06:59:38Z

davidatsurge
Apr 20, 2022
Author

Amazing! Without your incredibly quick responses and generosity with your time, I definitely would have switched to onnxruntime!

I'm working on a web app with the model deployed in the browser through wasm, so I'll keep you updated on the end result.

My sincere thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help finding why an ONNX model that runs fine with onnxruntime doesn't run with tract? #675

{{title}}

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Help finding why an ONNX model that runs fine with onnxruntime doesn't run with tract? #675

davidatsurge Apr 16, 2022

Replies: 8 comments · 2 replies

kali Apr 17, 2022 Maintainer

davidatsurge Apr 18, 2022 Author

kali Apr 19, 2022 Maintainer

kali Apr 18, 2022 Maintainer

kali Apr 18, 2022 Maintainer

davidatsurge Apr 19, 2022 Author

kali Apr 19, 2022 Maintainer

kali Apr 19, 2022 Maintainer

kali Apr 19, 2022 Maintainer

davidatsurge Apr 20, 2022 Author

davidatsurge
Apr 16, 2022

Replies: 8 comments 2 replies

kali
Apr 17, 2022
Maintainer

davidatsurge Apr 18, 2022
Author

kali Apr 19, 2022
Maintainer

kali
Apr 18, 2022
Maintainer

kali
Apr 18, 2022
Maintainer

davidatsurge
Apr 19, 2022
Author

kali
Apr 19, 2022
Maintainer

kali
Apr 19, 2022
Maintainer

kali
Apr 19, 2022
Maintainer

davidatsurge
Apr 20, 2022
Author