Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radical-Pilot hangs on Amarel #3074

Closed
AymenFJA opened this issue Oct 27, 2023 · 10 comments
Closed

Radical-Pilot hangs on Amarel #3074

AymenFJA opened this issue Oct 27, 2023 · 10 comments

Comments

@AymenFJA
Copy link
Contributor

AymenFJA commented Oct 27, 2023

I saw this behavior via EnTK, tested cases:

  • 1 node and 100 tasks: passes (25 pipelines).
  • 2 nodes and 200 tasks: passes (50 pipelines).
  • 3 nodes and 400 tasks: fails (100 pipelines).

RCT Stack:


  python               : /cache/home/afa64/ve/facts/bin/python3
  pythonpath           :
  version              : 3.6.8
  virtualenv           : /cache/home/afa64/ve/facts

  radical.entk         : 1.41.0
  radical.gtod         : 1.41.0
  radical.pilot        : 1.41.0
  radical.saga         : 1.41.0
  radical.utils        : 1.41.0

Access mode batch mode:

res_dict = {
    'resource': 'rutgers.amarel',
    'walltime': 90,
    'cpus': 72,
    'access_schema': 'interactive',
}

from radical.entk.wfprocessor.0000.log:

1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000388 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000388 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0098 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0098 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0099 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0099 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0396 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0396 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000396 to state SCHEDULING
1698359774.077 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000396 to state SCHEDULING
1698359774.107 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : DEBUG    : Workload submitted to Task Manager
1698359774.274 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000000 to state SCHEDULED
1698359774.380 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000000 to state SCHEDULED
1698359774.393 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000004 to state SCHEDULED
1698359774.826 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000004 to state SCHEDULED
1698359775.161 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000008 to state SCHEDULED
1698359775.658 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000008 to state SCHEDULED
1698359775.996 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000012 to state SCHEDULED
1698359776.334 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000012 to state SCHEDULED
1698359776.562 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000016 to state SCHEDULED
1698359776.775 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000016 to state SCHEDULED
1698359776.826 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000020 to state SCHEDULED
1698359777.251 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000020 to state SCHEDULED
1698359777.377 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000024 to state SCHEDULED

Then everything hangs until it times out. No task folders were created by RP in the agent sandbox.

@andre-merzky
Copy link
Member

@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!

@AymenFJA
Copy link
Contributor Author

@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!

@andre-merzky same behavior with utils devel:

env at /cache/home/afa64/ve/facts exists

---------------------------------------------------------------------

PWD       : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000
ENV       : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000//env/rp_named_env.rp.env
SCRIPT    : /cache/home/afa64/ve/facts/bin/radical-pilot-create-static-ve
PREFIX    : /cache/home/afa64/ve/facts
VERSION   : 3.6
MODULES   :  apache-libcloud chardet colorama idna msgpack msgpack-python netifaces ntplib parse dill pyzmq regex requests setproctitle urllib3
DEFAULTS  : True
PYTHON    : /cache/home/afa64/ve/facts/bin/python3 (Python 3.6.8)
PYTHONPATH: /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
RCT_STACK :
  python               : /cache/home/afa64/ve/facts/bin/python3
  pythonpath           : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
  version              : 3.6.8
  virtualenv           : /cache/home/afa64/ve/facts

  radical.entk         : 1.41.0
  radical.gtod         : 1.41.0
  radical.pilot        : 1.41.0
  radical.saga         : 1.41.0
  radical.utils        : 1.42.0-v1.41.0-16-g357e032@devel

---------------------------------------------------------------------
1698419422.891 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000360 to state SCHEDULED
1698419422.894 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000360 to state SCHEDULED
1698419422.895 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000364 to state SCHEDULED
1698419422.900 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000364 to state SCHEDULED
1698419422.901 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000368 to state SCHEDULED
1698419422.904 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000368 to state SCHEDULED
1698419422.905 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000376 to state SCHEDULED
1698419422.920 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000376 to state SCHEDULED
1698419422.922 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000380 to state SCHEDULED
1698419422.924 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000380 to state SCHEDULED
1698419422.930 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000384 to state SCHEDULED
1698419422.932 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000384 to state SCHEDULED
1698419422.935 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000388 to state SCHEDULED
1698419422.939 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000388 to state SCHEDULED
1698419422.941 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000392 to state SCHEDULED
1698419422.944 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000392 to state SCHEDULED
1698419422.945 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000396 to state SCHEDULED
1698419422.947 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000396 to state SCHEDULED
1698419422.949 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0000 to state SCHEDULED
1698419422.951 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0000 to state SCHEDULED
1698419422.953 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0004 to state SCHEDULED
1698419422.955 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0004 to state SCHEDULED
1698419422.956 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0008 to state SCHEDULED
1698419422.960 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0008 to state SCHEDULED

@andre-merzky
Copy link
Member

Thanks @AymenFJA . Alas I can't access the sandbox at /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000 - could you please tar it up and attach it, or make it otherwise available? Thanks!

@AymenFJA
Copy link
Contributor Author

@andre-merzky 2 sessions are attached with 2 different modes:
interactive_batch.zip
login_node.zip

@mtitov
Copy link
Contributor

mtitov commented Oct 27, 2023

as a side note - should we try with different python version? following env setup we have in the corresponding config

"pre_bootstrap_0" :["module use /projects/community/modulefiles",
"module load gcc/5.4",
"module load python/3.9.6-gc563",
"module load intel/17.0.4"
],

@AymenFJA
Copy link
Contributor Author

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Nov 9, 2023

Update: @mtitov and I interactively tested this issue on Amarel and it was confirmed that it only happens on 3 nodes and with different number of tasks. This issue doesn't happen with 4 nodes. Our conclusion is that it is an issue of SLURM with 3 nodes. @andre-merzky what do you think should we investigate it more if so how? If not should we conclude it as a non-RP issue (which clearly is) and close this ticket?

@AymenFJA
Copy link
Contributor Author

Update: This behavior is back with Amarel on one node.
I can also confirm that the same behavior is happening on UVA Rivanna.

@andre-merzky
Copy link
Member

Proposal: PR to use threaded manager

@AymenFJA
Copy link
Contributor Author

Closing this in correspondence to radical-cybertools/radical.entk#656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants