Skip to content

summit jsrun

Andre Merzky edited this page Mar 25, 2019 · 28 revisions

JSRUN Issues

The following issues affect the execution of RCT on Summit.


Open

[CCS #401235] jsrun failes on concurrent execution

  • DESCRIPTION
  • CONTACT: Jack Morrison
  • PRIORITY: critical. This disables any reasonable RP execution on summit.
  • DONE: reproducer provided

[CCS #398323] jsrun stability issues

  • DESCRIPTION: We see about 1% failure rate for jsrun when using it in quick succession over many tasks. That failure rate seems to increase with shorter tasks, so might be a concurrency issue. The action on this one is on me to follow up with some experiments.
  • CONTACT: Jack Morrison
  • PRIORITY: high. This affects the reliability of workload executions. Failures are recoverable.
  • NOTES: possibly related to [CCS #399015]
  • TODO: plot failure rate against workload size.

[CCS #398579] clarification on jsrun resource files

  • DESCRIPTION: this issue with jsrun resource files has been acknowledges as bug: -a is not evaluated in their context. Apparently that issue has been forwarded to IBM.
  • CONTACT: Thomas Papatheodore
  • PRIORITY: low. For the time being, we put a workaround in place.
  • RESOLUTION: RP switched to ERF and thus avoids -a. We left the ticket open.

[no ticket] jsrun core index is off

jsrun explicit resource file (ERF) allocates incorrect resources
When using an ERF that requests cores on a compute node’s second 
socket (hardware threads 88-171), the set of cores allocated on 
the second socket are shifted upwards by (1) physical core.

For example:
The following ERF requests the first physical core on each socket:

2 : {host: * ; cpu: {0-3},{88-91}}
jsrun currently shifts the second socket’s allocation by (1) physical core, allocating 92-95 instead of the specified 88-91.

$ jsrun --erf_input ERF_filename js_task_info | sort
 
Task 0 ( 0/2, 0/2 ) is bound to cpu[s] 0-3 on host h36n03 with OMP_NUM_THREADS=1 and with OMP_PLACES={0:4}
 
Task 1 ( 1/2, 1/2 ) is bound to cpu[s] 92-95 on host h36n03 with OMP_NUM_THREADS=1 and with OMP_PLACES={92:4}
  • DESCRIPTION: we hit an issue similar to this: the specifying the first physical core in an ERF spec causes an error (sometimes fatal to the session). We currently work around this by marking the affected core as down, which limits the set of workloads we can run, but otherwise works as expected.
  • PRIORITY: low. For the time being, we put a workaround in place. Should we open a ticket? The workaround wastes two physical cores per node (~2.5%).

Closed

[CCS #399015] jsrun segfault and failure

  • DESCRIPTION: jsrun dumped core on a unit, all subsequent jsruns fail, even in new pilot instances, as they fail to contact the PMIx layer.
  • reproducer provided
  • CONTACT: George Markomanolis
  • PRIORITY: very high. This affects the reliability of workload executions. Workloads cannot recover from the failure.
  • DONE: test reproducibility with multiple users - issue was confirmed, reproducer is viable.
  • RESOLUTION: this disappeared with the switch to ERF. The issue was closed.

[CCS #398324] jsrun limits (PID limits on batch nodes)

  • DESCRIPTION: We can only run a certain number of jsruns (~1k) until hitting a process limit (~4k - each jsrun instance needs multiple processes). Jack will look into the limit, and also investigate if jsrun can be used from compute nodes which should not have that limit. We would prefer that second option, as that also makes it easier to load-balance our software framework.
  • CONTACT: Jack Morrison
  • PRIORITY: high. This affects the scale at which RP can run workflows.
  • DONE: testing deployment of jsrun on nodes.
  • RESOLUTION: jsrun works on the compute nodes, we can scale out! This is solved.
  • REOPEN: this still works in general, but breaks for consecutive jsrun invocations.
  • RESOLUTION: this turned out to be unrelated to execution of jsrun on compute nodes, also happens on batch nodes now. Issue closed in favor of [CCS #401235].

[CCS #398187] non-purged shared filesystem

  • DESCRIPTION: We miss a world-accessible shared filesystem on the compute nodes to use for software deployment (e.g., ZMQ) - all shared file systems are regularly purged.
  • CONTACT: Brian Smith
  • PRIORITY: medium. This impacts ease of use significantly, but system specific workarounds can be use to mitigate some parts.
  • RESOLUTION: Brian installed a software dependency as system module (i.e., ZMQ), which helps for now - but this is likely to pop up again. For the time being, we consider this issue closed.

[CCS #399012] clarification on jsrun resource files (2)

  • DESCRIPTION: resource files require uniform resources (CPU and GPU) over all resource sets, which limits the set of workloads we can execute.
  • REPLY: yes, needs to be uniform
  • CONTACT: George Markomanolis
  • PRIORITY: low. It limits the generality of the jsrun LM, but does not impact current workloads.
  • RESOLUTION: look into ERF as alternative resource specification format. This issue is closed.

[CCS #398578] request for jsrun clarification

  • DESCRIPTION: another issue with jsrun resource files where 0 gpus lead to the allocation of all GPUs on the target node. This is also accepted as a bug and apparently worked upon by IBM.
  • CONTACT: Thomas Papatheodore
  • PRIORITY: medium. It might affect running CPU/GPU-only CUs.
  • RESOLUTION: this appears to be resolved

JSRUN Stress testing

  • purpose: determine error rate dependencies
  • parameters: pilot size, unit size, unit runtime, unit concurrency, spawn rate
  • test matrix
  • test script
  • analysis

Clone this wiki locally