-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Radical-Pilot hangs on Amarel #3074
Comments
@andre-merzky same behavior with
|
Thanks @AymenFJA . Alas I can't access the sandbox at |
@andre-merzky 2 sessions are attached with 2 different modes: |
as a side note - should we try with different python version? following env setup we have in the corresponding config radical.pilot/src/radical/pilot/configs/resource_rutgers.json Lines 29 to 33 in 7d6864e
|
|
Update: @mtitov and I interactively tested this issue on Amarel and it was confirmed that it only happens on 3 nodes and with different number of tasks. This issue doesn't happen with 4 nodes. Our conclusion is that it is an issue of SLURM with 3 nodes. @andre-merzky what do you think should we investigate it more if so how? If not should we conclude it as a non-RP issue (which clearly is) and close this ticket? |
Update: This behavior is back with Amarel on one node. |
Proposal: PR to use threaded manager |
Closing this in correspondence to radical-cybertools/radical.entk#656 |
I saw this behavior via
EnTK
, tested cases:passes
(25 pipelines).passes
(50 pipelines).fails
(100 pipelines).RCT Stack:
Access mode
batch mode
:from
radical.entk.wfprocessor.0000.log
:Then everything hangs until it times out. No task folders were created by RP in the agent sandbox.
The text was updated successfully, but these errors were encountered: