Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource exhaustion when compiling many functions #190

Open
mofeing opened this issue Oct 21, 2024 · 5 comments
Open

Resource exhaustion when compiling many functions #190

mofeing opened this issue Oct 21, 2024 · 5 comments

Comments

@mofeing
Copy link
Collaborator

mofeing commented Oct 21, 2024

Some of my algorithms change its structure on every iteration, so I need to recompile the function on each iteration. This is sth that works well on Jax. Aside of the huge performance overhead we're now having in Reactant (which may need some refactoring on Thunk and compile to zero the amount of compilation from the Julia side to improve it), I'm getting the following error after running many compilations.

Some notes:

  • It always fails on the same number of iterations, suggesting that each compile takes some resource that's it's not releasing and the resource size is constant.
  • This resource size might be machine-dependent. I've just tried on macOS aarch64.

Image

MWE

julia> x = rand(2,2)
julia> y = Reactant.to_rarray(x)

julia> for _ in 1:10_000
           f = @compile sum(y)
       end
LLVM ERROR: pthread_create failed: Resource temporarily unavailable

[27943] signal (6): Abort trap: 6
in expression starting at REPL[18]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 175047067 (Pool: 174829757; Big: 217310); GC: 115
fish: Job 1, 'julia +1.10' terminated by signal SIGABRT (Abort)
@wsmoses
Copy link
Member

wsmoses commented Oct 23, 2024

can you trace in rr or soemthing with a debug jll to ascertain what resource is failing to be allocated?

@wsmoses
Copy link
Member

wsmoses commented Oct 23, 2024

I never fully got things setup iirc, but also in principle SetLogLevel might force a lot of debug to spew which may be helpful

@mofeing
Copy link
Collaborator Author

mofeing commented Oct 23, 2024

So SetLogLevel didn't work because that sets the log level of Abseil, which XLA uses but LLVM not. And because the crash originates in LLVM, we don't have that kind of info.

LLDB shows some interesting info. It seems that every time we run the mlir::PassManager over the MLIR, it calls verify which enqueues the work and grows the number of threads in llvm::ThreadPoolTaskGroup. At some point, it reaches the max number of threads per process and fails.

The question is: why the TheadPoolTaskGroup is growing the number of threads? And why aren't this threads being cleaned? Maybe it's because we create too many functions in a very short amount of time?

LLVM ERROR: pthread_create failed: Resource temporarily unavailable
Process 3988 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x0000000193b5ea60 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`__pthread_kill:
->  0x193b5ea60 <+8>:  b.lo   0x193b5ea80    ; <+40>
    0x193b5ea64 <+12>: pacibsp 
    0x193b5ea68 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x193b5ea6c <+20>: mov    x29, sp
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x0000000193b5ea60 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000193b96c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000193aa3a20 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000311b84bb8 libReactantExtra.dylib`llvm::report_fatal_error(llvm::Twine const&, bool) + 296
    frame #4: 0x0000000311bd6830 libReactantExtra.dylib`ReportErrnumFatal(char const*, int) + 512
    frame #5: 0x0000000311bd65e8 libReactantExtra.dylib`llvm::llvm_execute_on_thread_impl(void* (*)(void*), void*, std::__1::optional<unsigned int>) + 248
    frame #6: 0x0000000311bd4120 libReactantExtra.dylib`llvm::StdThreadPool::grow(int) + 384
    frame #7: 0x0000000311bd51e4 libReactantExtra.dylib`llvm::StdThreadPool::asyncEnqueue(std::__1::function<void ()>, llvm::ThreadPoolTaskGroup*) + 228
    frame #8: 0x0000000311492410 libReactantExtra.dylib`std::__1::shared_future<void> llvm::ThreadPoolInterface::asyncImpl<void>(std::__1::function<void ()>, llvm::ThreadPoolTaskGroup*) + 176
    frame #9: 0x0000000311b2e918 libReactantExtra.dylib`(anonymous namespace)::OperationVerifier::verifyOpAndDominance(mlir::Operation&) + 1752
    frame #10: 0x0000000311b2e230 libReactantExtra.dylib`mlir::verify(mlir::Operation*, bool) + 32
    frame #11: 0x00000003119d1d3c libReactantExtra.dylib`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 1628
    frame #12: 0x00000003119d4578 libReactantExtra.dylib`mlir::PassManager::runPasses(mlir::Operation*, mlir::AnalysisManager) + 104
    frame #13: 0x00000003119d43b0 libReactantExtra.dylib`mlir::PassManager::run(mlir::Operation*) + 912
    frame #14: 0x00000003117276ec libReactantExtra.dylib`mlirPassManagerRunOnOp + 12
    frame #15: 0x000000011c43c128
    frame #16: 0x000000010c93c1fc
    frame #17: 0x000000010ffe8020
    frame #18: 0x000000010c86c118
    frame #19: 0x0000000100a496f0 libjulia-internal.1.10.5.dylib`jl_toplevel_eval_flex(m=<unavailable>, e=<unavailable>, fast=<unavailable>, expanded=<unavailable>) at toplevel.c:877:19 [opt]
    frame #20: 0x0000000100a4a5d8 libjulia-internal.1.10.5.dylib`ijl_toplevel_eval_in [inlined] ijl_toplevel_eval(m=0x000000012d6dcad0, v=0x000000010b235b50) at toplevel.c:943:12 [opt]
    frame #21: 0x0000000100a4a5cc libjulia-internal.1.10.5.dylib`ijl_toplevel_eval_in(m=0x000000012d6dcad0, ex=0x000000010b235b50) at toplevel.c:985:13 [opt]
    frame #22: 0x00000001292e3bf4 sys.dylib`japi1_include_string_81176.1 at boot.jl:385
    frame #23: 0x0000000100a17eac libjulia-internal.1.10.5.dylib`ijl_apply_generic [inlined] _jl_invoke(F=0x000000012c99c330, args=0x000000016fdfc150, nargs=4, mfunc=0x000000012c99d0b0, world=<unavailable>) at gf.c:0 [opt]
    frame #24: 0x0000000100a17e40 libjulia-internal.1.10.5.dylib`ijl_apply_generic(F=0x000000012c99c330, args=0x000000016fdfc150, nargs=<unavailable>) at gf.c:3077:12 [opt]
    frame #25: 0x00000001292d2e2c sys.dylib`japi1__include_81184.1 at loading.jl:2136
    frame #26: 0x0000000128d2f928 sys.dylib`julia_include_46590.1 at Base.jl:495
    frame #27: 0x000000012922a138 sys.dylib`jfptr_include_46591.1 + 60
    frame #28: 0x0000000100a17eac libjulia-internal.1.10.5.dylib`ijl_apply_generic [inlined] _jl_invoke(F=0x0000000129acd570, args=0x000000016fdfdd70, nargs=2, mfunc=0x0000000129acddd0, world=<unavailable>) at gf.c:0 [opt]
    frame #29: 0x0000000100a17e40 libjulia-internal.1.10.5.dylib`ijl_apply_generic(F=0x0000000129acd570, args=0x000000016fdfdd70, nargs=<unavailable>) at gf.c:3077:12 [opt]
    frame #30: 0x00000001282a90ec sys.dylib`julia_exec_options_82788.1 at client.jl:318
    frame #31: 0x0000000129126aac sys.dylib`julia__start_82926.1 at client.jl:552
    frame #32: 0x0000000129126bd4 sys.dylib`jfptr__start_82927.1 + 36
    frame #33: 0x0000000100a17eac libjulia-internal.1.10.5.dylib`ijl_apply_generic [inlined] _jl_invoke(F=0x000000012c53f4b0, args=0x000000016fdfe298, nargs=0, mfunc=0x000000012c53f340, world=<unavailable>) at gf.c:0 [opt]
    frame #34: 0x0000000100a17e40 libjulia-internal.1.10.5.dylib`ijl_apply_generic(F=0x000000012c53f4b0, args=0x000000016fdfe298, nargs=<unavailable>) at gf.c:3077:12 [opt]
    frame #35: 0x0000000100a7590c libjulia-internal.1.10.5.dylib`true_main [inlined] jl_apply(args=0x000000016fdfe290, nargs=1) at julia.h:1982:12 [opt]
    frame #36: 0x0000000100a758f8 libjulia-internal.1.10.5.dylib`true_main(argc=<unavailable>, argv=<unavailable>) at jlapi.c:582:13 [opt]
    frame #37: 0x0000000100a75800 libjulia-internal.1.10.5.dylib`jl_repl_entrypoint(argc=<unavailable>, argv=<unavailable>) at jlapi.c:731:15 [opt]
    frame #38: 0x0000000100003f6c julia`main + 12
    frame #39: 0x000000019380e0e0 dyld`start + 2360

@mofeing
Copy link
Collaborator Author

mofeing commented Oct 23, 2024

Related jax-ml/jax#16272

@wsmoses
Copy link
Member

wsmoses commented Oct 23, 2024

Yeah we should also disable threading in our mlir contexts too -- I thought we fixed this earlier on the constructor for mlicontext, but maybe only just discussed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants