You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using iree.build to export torch modules (with turbine_generate build actions), we encounter an issue where weakref cleanup (?) seemingly can’t figure out what to do with the contents of the torch no.module exported, resulting in a hang during python exit.
This seemingly only happens when two conditions are satisfied:
We use multiprocessing ‘spawn’ context
The exported torch module creates tensors from class attributes or bare lists (observed with integer list) without registering them as buffers.
We also encountered the same hang with python3.11 and bisecting upstream LLVM we found this PR to be the culprit. We reverted the change for now on our fork.
And if you move to python3.12 you will not get a hang anymore but a crash in the Python Interpreter. I believe in the garbage collector.
We investigated for a bit to find the root cause using memray on a iree-turbine using aot.export. To see if there was memory leaking with the MLIR bindings before this upstream patch.
We could not find too much and we're not 100% sure if memray is reporting correctly but it did find leaks in:
ir_utils.py: reports memory leaking on the numpy array.
and in linear.py of pytorch for some reason the weights empty tensor is reported as leaking.
And we are seeing a memory leak when running multiple exports of large LLMs within a single unix process. (Even with explicit gc.collect() calls) So the problem is there.
We also tried reverting a RefTracker change in torch-mlir but it did not fix the hang/crash we were seeing with the upstream MLIR change.
We haven't solved the issue or exactly found it per-se but are prototyping a nb::ndarray implementation that will hopefully take full advantage of the DLPack protocol that torch.Tensor should also have and ensure that proper ownership of the data is handled.
Hope this helps. Let me know if you find a better solution :)
What happened?
Related bandaid PR: #20101
Using iree.build to export torch modules (with turbine_generate build actions), we encounter an issue where weakref cleanup (?) seemingly can’t figure out what to do with the contents of the torch no.module exported, resulting in a hang during python exit.
This seemingly only happens when two conditions are satisfied:
Steps to reproduce your issue
coming soon…
What component(s) does this issue relate to?
Compiler
Version information
028b90f
(Top of main at post time)
Additional context
No response
The text was updated successfully, but these errors were encountered: