Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iree.build] torch export cleanup hang with multiprocessing ‘spawn’ context #20102

Open
monorimet opened this issue Feb 26, 2025 · 1 comment
Assignees
Labels
bug 🐞 Something isn't working

Comments

@monorimet
Copy link
Collaborator

What happened?

Related bandaid PR: #20101

Using iree.build to export torch modules (with turbine_generate build actions), we encounter an issue where weakref cleanup (?) seemingly can’t figure out what to do with the contents of the torch no.module exported, resulting in a hang during python exit.

This seemingly only happens when two conditions are satisfied:

  1. We use multiprocessing ‘spawn’ context
  2. The exported torch module creates tensors from class attributes or bare lists (observed with integer list) without registering them as buffers.

Steps to reproduce your issue

coming soon…

What component(s) does this issue relate to?

Compiler

Version information

028b90f
(Top of main at post time)

Additional context

No response

@monorimet monorimet added the bug 🐞 Something isn't working label Feb 26, 2025
@chrsmcgrr
Copy link
Contributor

chrsmcgrr commented Feb 27, 2025

Not 100% sure this is the same error but,

We also encountered the same hang with python3.11 and bisecting upstream LLVM we found this PR to be the culprit. We reverted the change for now on our fork.

And if you move to python3.12 you will not get a hang anymore but a crash in the Python Interpreter. I believe in the garbage collector.

We investigated for a bit to find the root cause using memray on a iree-turbine using aot.export. To see if there was memory leaking with the MLIR bindings before this upstream patch.

We could not find too much and we're not 100% sure if memray is reporting correctly but it did find leaks in:

And we are seeing a memory leak when running multiple exports of large LLMs within a single unix process. (Even with explicit gc.collect() calls) So the problem is there.

We also tried reverting a RefTracker change in torch-mlir but it did not fix the hang/crash we were seeing with the upstream MLIR change.

We haven't solved the issue or exactly found it per-se but are prototyping a nb::ndarray implementation that will hopefully take full advantage of the DLPack protocol that torch.Tensor should also have and ensure that proper ownership of the data is handled.

Hope this helps. Let me know if you find a better solution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants