[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

monorimet · 2025-02-26T01:20:20Z

What happened?

Related bandaid PR: #20101

Using iree.build to export torch modules (with turbine_generate build actions), we encounter an issue where weakref cleanup (?) seemingly can’t figure out what to do with the contents of the torch no.module exported, resulting in a hang during python exit.

This seemingly only happens when two conditions are satisfied:

We use multiprocessing ‘spawn’ context
The exported torch module creates tensors from class attributes or bare lists (observed with integer list) without registering them as buffers.

Steps to reproduce your issue

coming soon…

What component(s) does this issue relate to?

Compiler

Version information

028b90f
(Top of main at post time)

Additional context

No response

The text was updated successfully, but these errors were encountered:

chrsmcgrr · 2025-02-27T10:13:27Z

Not 100% sure this is the same error but,

We also encountered the same hang with python3.11 and bisecting upstream LLVM we found this PR to be the culprit. We reverted the change for now on our fork.

And if you move to python3.12 you will not get a hang anymore but a crash in the Python Interpreter. I believe in the garbage collector.

We investigated for a bit to find the root cause using memray on a iree-turbine using aot.export. To see if there was memory leaking with the MLIR bindings before this upstream patch.

We could not find too much and we're not 100% sure if memray is reporting correctly but it did find leaks in:

ir_utils.py: reports memory leaking on the numpy array.
and in linear.py of pytorch for some reason the weights empty tensor is reported as leaking.

And we are seeing a memory leak when running multiple exports of large LLMs within a single unix process. (Even with explicit gc.collect() calls) So the problem is there.

We also tried reverting a RefTracker change in torch-mlir but it did not fix the hang/crash we were seeing with the upstream MLIR change.

We haven't solved the issue or exactly found it per-se but are prototyping a nb::ndarray implementation that will hopefully take full advantage of the DLPack protocol that torch.Tensor should also have and ensure that proper ownership of the data is handled.

Hope this helps. Let me know if you find a better solution :)

monorimet added the bug 🐞 Something isn't working label Feb 26, 2025

monorimet assigned monorimet and stellaraccident Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

monorimet commented Feb 26, 2025

chrsmcgrr commented Feb 27, 2025 •

edited

Loading

[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

Comments

monorimet commented Feb 26, 2025

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

chrsmcgrr commented Feb 27, 2025 • edited Loading

chrsmcgrr commented Feb 27, 2025 •

edited

Loading