Launching a job without consuming scheduler resources #3737

SteVwonder · 2021-06-22T20:46:43Z

SteVwonder
Jun 22, 2021
Maintainer

We have a few workflows (e.g., Merlin and Swift) that would like to use Flux to launch their daemons/workers across the allocation, and then from those daemons/workers, submit parallel MPI jobs to run (intentionally oversubscribing the cores/nodes to run both the jobs and the daemons). If flux mini is used for both sets of launches, problems arise with overallocation in the scheduler (the same resources are being requested twice)

In the case of Merlin, the daemons communicate via RabbitMQ, so launching them with flux exec is workable, leaving just the MPI jobs to be run with flux mini and the scheduler.

In the case of Swift, both the workers and the parallel jobs use MPI. So Swift need PMI at both levels, making flux exec not really an option (baring some kludges to manually launch the job-shell).

What would be great in the Swift case is if the outer launch could be done using flux mini, but if the workflow could request that the job take up no resources in the scheduler. Alternatively, maybe flux exec could be extended to handle the MPI case via some user-side extension (e.g., flux exec flux job-shell ...).

SteVwonder · 2021-06-22T21:31:06Z

SteVwonder
Jun 22, 2021
Maintainer Author

@grondo had the brilliant idea to support adding arbitrary R values into jobspec. A jobtap plugin could see this R and skip the scheduler altogether and move the job directly into the exec system/run state. Obviously this would need to be restricted to the instance owner. The submission of jobspec is preserved in this case so that a unique Fluid and KVS directory are created automatically, and so that the typical job lifecycle is preserved.

This capability could also be leveraged by tools/debuggers to co-schedule/co-launch daemons alongside an existing job (just copy the job's R, stuff it in a jobspec, and change out the command to the debugger).

Some complications that we discussed on the call:

the job-shell needs to traverse the resource section to count the number of slots, so the resource section in the jobspec needs to be compatible with the R that is embedded. To avoid this complication, we can use the per-resource trick so that the job-shell knows how many tasks to launch without traversing the resource section
the jobtap plugin would need to be able to move the job through the sched state without firing off an alloc request to the scheduler

1 reply

grondo Jun 22, 2021
Maintainer

One approach would be to update the Job State Diagram to allow jobs to go directly from the PRIORITY state to RUN if an alloc event is received. This would naturally skip the alloc request sent upon entry to the SCHED state.
(We still keep the DEPEND state because it might be nice if even "jobs" submitted in this manner still respected any added dependencies)

If the solution is plugin-based, then a plugin could be loaded that would look for the jobspec option to skip allocation, write the provided R into the KVS and generate an alloc event (I guess via a new jobtap function or similar). The nice thing about using a plugin is that the plugin would only be loaded when this functionality is desired.

Another approach would be to simply add a new jobtap callback which would allow a plugin to satisfy an alloc request before it is sent to the scheduler. This approach has some interesting applications in that it could be used to create a scheduler within the job manager, or allow alloc requests to be diverted to alternate scheduler services within the instance. However, it would mean that these jobs would have to wait in the priority queue until their priority was large enough to generate an alloc request in the first place, which isn't probably the intent of the use case here (I assume immediate execution is the goal).

Edit: meant to tag @garlick to get his thoughts on the larger issues of job state diagram changes, and allowing plugins to emit events in the job manager.

garlick · 2021-06-22T23:52:56Z

garlick
Jun 22, 2021
Maintainer

Oooh, this seems like a nice idea!

It seems like this could be done without altering the state diagram - as you suggest above (I think?)

Job enters SCHED state, calls stack of jobtap plugins, one of which may satisfy the resource request based on R fragment provided in jobspec. Upon completion, if no resources, enter priority queue as usual, send alloc request... Otherwise, proceed to RUN state.

Heh, it's like a parasitiic allocation.

3 replies

grondo Jun 23, 2021
Maintainer

Oh yeah, that will work. For some reason I was stuck thinking that we'd have to skip the SCHED state altogether to avoid sending an alloc request, but your suggestion might be simpler.

We'd have to do something similar in CLEANUP to avoid sending a free request. I wonder if we could reuse job->has_resources (keep it false in the case where resources are not techincally allocated from a scheduler)

grondo Jun 23, 2021
Maintainer

Here's a proof of concept that seems to work based on @garlick's idea above.

What I've done is added the ability for a plugin to provide an R string in the FLUX_PLUGIN_ARG_OUT of the job.state.sched callback. If an R is provided, then job->has_resources is set and the R is committed to the kvs. Once the commit succeeds, an alloc event is posted. Additionally, a new job->skip_free flag is set on the job so that a free request is not sent to the scheduler after the final release event.

This actually seems to work with the following demo plugin:

#include <jansson.h>
#include <flux/core.h>
#include <flux/jobtap.h>

static int sched_cb (flux_plugin_t *p,
                     const char *topic,
                     flux_plugin_arg_t *args,
                     void *arg)
{
    int rc = 0;
    json_t *R = NULL;
    flux_t *h = flux_jobtap_get_flux (p);

    if (flux_plugin_arg_unpack (args,
                                FLUX_PLUGIN_ARG_IN,
                                "{s:{s:{s:{s?o}}}}",
                                "jobspec",
                                "attributes",
                                "system",
                                "R", &R) < 0) {
        flux_log (h, LOG_ERR, "unpack: %s", flux_plugin_arg_strerror (args));
        return -1;
    }
    if (R) {
        char *s = json_dumps (R, JSON_COMPACT);
        rc = flux_plugin_arg_pack (args,
                                   FLUX_PLUGIN_ARG_OUT,
                                   "{s:s}", "R", s);
        if (rc < 0)
            flux_log (h, LOG_ERR,
                      "pack: %s", flux_plugin_arg_strerror (args));
        free (s);
    }
    return rc;
}

int flux_plugin_init (flux_plugin_t *p)
{
    return flux_plugin_add_handler (p, "job.state.sched", sched_cb, NULL);
}

If system.R is set in the jobspec, then this is used directly as the R for the job. The per-resource shell option does need to be set to avoid confusing the shell, but this could be wrapped up in perhaps another flux mini <something> command.

For example to run on the same resource set as an existing job:

$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
    ƒFd9L8ej grondo   sleep       R      2      2   10.86m asp,asp
$ flux mini run -o per-resource.type=node --setattr=system.R="$(flux job info ƒFd9L8ej R)" flux getattr rank
1
0

garlick Jun 23, 2021
Maintainer

Woah, that apparently went well :-)

grondo · 2021-06-23T20:16:57Z

grondo
Jun 23, 2021
Maintainer

Proof of concept posted in PR #3740

0 replies

andre-merzky · 2021-08-25T18:29:41Z

andre-merzky
Aug 25, 2021

Hi Fluxers,

would you happen to have some Python example code to show how a job with R values should be formed? Where would I find the data structure for R documented?

Many thanks, Andre.

8 replies

dongahn Aug 25, 2021
Maintainer

If you need to manipular the optional scheduling key within RV1, you will have to layer flux-R with flux ion-R commands from flux-sched: See an example at https://flux-framework.readthedocs.io/en/latest/adminguide.html#resource-configuration

In fact the python class above is used in this front end command.

grondo Aug 25, 2021
Maintainer

Though the scheduling key is ignored when you are bypassing the scheduler 🙂. However, I suppose it would be required if you plan to launch a Flux subinstance within the alloc-bypass job, though that might not be advisable.

grondo Aug 25, 2021
Maintainer

One other thing I forgot to mention -- you can fetch the R for an existing job with flux job info JOBID R (in case you want to run a job on the same resources as an existing job).

Also, if using the alloc-bypass feature, make sure the plugin is loaded (it is not loaded by default):

ƒ(s=4,d=0,builddir) $ flux jobtap list 
ƒ(s=4,d=0,builddir) $ flux jobtap load alloc-bypass.so
ƒ(s=4,d=0,builddir) $ flux jobtap list 
alloc-bypass.so

andre-merzky Aug 25, 2021

you can fetch the R for an existing job

That's very useful indeed - thanks for adding that info!

grondo Aug 25, 2021
Maintainer

And in case it would be even more helpful, here's how to perform the equivalent of flux job info JOBID R in Python

import flux
from flux.job import JobID
import sys

h = flux.Flux()
jobid = int(JobID(sys.argv[1]))
resp = h.rpc("job-info.lookup", {"id": jobid, "keys": ["R"], "flags": 0}).get()

print(resp["R"])

e.g.:

$ flux python R.py ƒ2Um3YJx55 | jq
{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-1",
        "children": {
          "core": "0-1"
        }
      }
    ],
    "starttime": 0,
    "expiration": 0,
    "nodelist": [
      "asp,asp"
    ]
  }
}

andre-merzky · 2021-08-25T21:29:23Z

andre-merzky
Aug 25, 2021

Thanks @dongahn , @grondo, I'll dig through those docs! :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launching a job without consuming scheduler resources #3737

{{title}}

Replies: 5 comments 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Launching a job without consuming scheduler resources #3737

SteVwonder Jun 22, 2021 Maintainer

Replies: 5 comments · 12 replies

SteVwonder Jun 22, 2021 Maintainer Author

grondo Jun 22, 2021 Maintainer

garlick Jun 22, 2021 Maintainer

grondo Jun 23, 2021 Maintainer

grondo Jun 23, 2021 Maintainer

garlick Jun 23, 2021 Maintainer

grondo Jun 23, 2021 Maintainer

andre-merzky Aug 25, 2021

dongahn Aug 25, 2021 Maintainer

grondo Aug 25, 2021 Maintainer

grondo Aug 25, 2021 Maintainer

andre-merzky Aug 25, 2021

grondo Aug 25, 2021 Maintainer

andre-merzky Aug 25, 2021

SteVwonder
Jun 22, 2021
Maintainer

Replies: 5 comments 12 replies

SteVwonder
Jun 22, 2021
Maintainer Author

grondo Jun 22, 2021
Maintainer

garlick
Jun 22, 2021
Maintainer

grondo Jun 23, 2021
Maintainer

grondo Jun 23, 2021
Maintainer

garlick Jun 23, 2021
Maintainer

grondo
Jun 23, 2021
Maintainer

andre-merzky
Aug 25, 2021

dongahn Aug 25, 2021
Maintainer

grondo Aug 25, 2021
Maintainer

grondo Aug 25, 2021
Maintainer

grondo Aug 25, 2021
Maintainer

andre-merzky
Aug 25, 2021