flux jobspec v2 discussion, subsystem queries, implicit scoping #5785

trws · 2024-03-11T18:35:15Z

trws
Mar 11, 2024
Maintainer

This is use-case 1.2 from our valid jobspec examples, reworked to use the
nesting setup @vsoch was proposing (roughly at least), this cuts a whole lot of
complexity, and yet it's easy to machine this into canonical:

version: 999
tasks:
  - command: [ "flux", "start" ]
    count:
      min: 3
      max: 30
    resources:
    - type: node
      count: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

Here are some options for subsystem syntax for a rabbit:

3 nodes, all of which connect to the same rabbit by a storage link, no
limitation on number but that's implicitly one right now:

version: 999
resources:
  - type: slot
    count: 3
    label: default
    with:
      - type: node
        count: 1
        with:
        - edge: storage
          label: rbt
  - type: rabbit
    label: rbt
    count: 1
tasks:
  - command: [ "flux", "start" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

3 nodes, all of which connect to some rabbit not necessarily the same one, do
we want to have different ones for near-node vs ephemeral lustre? Honestly I'm
not sure:

version: 999
resources:
  - type: slot
    count: 3
    label: default
    with:
      - type: node
        count: 1
        with:
        - edge: storage
          type: rabbit
tasks:
  - command: [ "flux", "start" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

Ping: @tpatki, @milroy, @jameshcorbett, @rountree

vsoch · 2024-03-11T18:43:48Z

vsoch
Mar 11, 2024
Maintainer

Here is the modified variant I'm using, which is basically that - resources needed for a task.

version: 1
resources:
- count: 2
  type: node
  with:
  - count: 1
    label: default
    type: slot
    with:
    - count: 2
      type: core
tasks:
- command:
  - ior
  slot: default
  count:
    per_slot: 1
  resources:
    io:
      match:
      - type: shm

The only difference from the above is that I'm scoping it to a named subsystem, and that is primarily for my implementation to be able to match it to the subsystem "io" and then use the custom algorithm defined there for it (subsystems might have different matching algorithms, or more explicitly - "the thing that you do" when you stumble on its edge in the graph (and have a connection to that subsystem graph). The above says "When you find this edge for "io" while traversing a slot, just match that storage type "shared memory." It's dumb and simple but will be enough to reproduce our first descriptive experiments (that were done from an image selection standpoint) now using the scheduler.

10 replies

trws Mar 12, 2024
Maintainer Author

I think I didn't express my question well, that's all good information but wasn't quite what I meant to ask for. In this syntax, how does the system know which vertex in the containment graph the edge is meant to come from? Like will this look for a node->shm link, or a core->shm link or what? I don't see where the structure is specified I guess, or is it not and it's an entirely separate match that must also be found but need not be connected?

vsoch Mar 12, 2024
Maintainer

It’s at the same level of the slot, so wherever the slot is set to be (up to the user).

trws Mar 12, 2024
Maintainer Author

Oh, ok, so this would be looking for an edge from the node to the shm because the slot is below the node?

vsoch Mar 12, 2024
Maintainer

I think I got that right? A node with I read as "I want this node with this resource..." so a slot means you look for the resources at that level, and then if there are nested stuffs below it (cores in the above - "I want a node with cores" which are a child resource) that would technically be a child. So:

Task resources defined in slot are at the level where the slot is defined (node for the above) - read as "a node with x"
Slot resources that are (by definition) children of the slot are child vertices in the same subsystem.

Does that make sense (or did I get that wrong)?

trws Mar 12, 2024
Maintainer Author

Yeah, that's right. I've never thought of splitting it up quite this way so it didn't click without the extra explanation. The slot is sort-of a non-entity, it's a passthrough container that denotes where the tasks are, and things under it (dominated by is probably the right graph-centric term) are marked exclusive, while things outside it are left non-exclusive. So I guess this would be equivalent, it just never occurred to me to allow splitting up the resource query quite this way, it's interesting.

vsoch · 2024-03-17T07:09:56Z

vsoch
Mar 17, 2024
Maintainer

For the jobspec command, most of the examples I've seen are like "app" or just a single executable with args. It looks like command is a list of strings:

So I'm thinking we need an ability (however it's done) to handle:

a custom script
a specification to run flux batch (also with a custom script).

For example, for the flux operator I can provide a single command, e.g.,:

apiVersion: flux-framework.org/v1alpha2
kind: MiniCluster
metadata:
  name: flux-sample
spec:
  size: 4
  containers:
  - image: ghcr.io/converged-computing/distributed-fractal
    command: fractal leader

And that would map easily to what we have now, but there are several cases this doesn't well address:

writing a set of commands for a script that is given to flux submit (e.g, to allocate something for different ranks to do). The example here is usernetes.
using flux batch and at the top level spinning off different flux allocations (maybe even with different schedulers)!

Also, for most complex start logic, especially if containers are involved, it's likely going to be the case that the user wants a multi-lined, custom entrypoint. Right now they would need to build that into the container, or (for the flux operator) a workaround would be to have it written in a pre or init command block, and then set the command to be to use it.

I think we need to make it (for the user) stupid easy to get either a multi-line script or batch - and although I'd like to write some kind of tool to better orchestrate that, we can assume there are simple cases of multi-line that already warrant needing this. For example, for the flux operator I can do:

apiVersion: flux-framework.org/v1alpha2
kind: MiniCluster
metadata:
  name: flux-sample
spec:
  size: 4
  containers:
  - image: ghcr.io/converged-computing/distributed-fractal
    commands:
      script: |
        #!/bin/bash
        echo "This is task"
        if [[ "\${FLUX_TASK_RANK}" == "0" ]]; then
           fractal leader --force-exit --host 0.0.0.0:50051
        else
           fractal worker --force-exit --host flux-sample-0.flux-service.default.svc.cluster.local:50051
        fi

Right now I require the user to escape envars, otherwise they get evaluated, but you could imagine that being more elegant. I'm not sure it belongs here, but (for the associated tool to use in batch or a script that accounts for rank tasks) I'd like to be able to easily ask for things like:

start another broker with this setup
use this topology mapping for that setup
wait here until the other tasks that I just "did the thing for" are ready
submit these jobs A and wait for them before I do B

I know the last one works with waitable, but I'm thinking of wanting something that looks more similar to a very simple /dumb specification of a sequence of tasks and dependencies that doesn't require the user to know the nuances of flux flags. Anyway, for this particular point, I'm suggesting we have something like:

tasks:
    - slot: default
      script: |
        #!/bin/bash
        echo "The answer my friend"
        echo "is blowing in the wind"
        echo "The answer is blowin' in the wind"
      count:
         per_slot: 1

And that way I can more easily write a job for, rainbow, that doesn't require my special script to exist there (but only the software needed for it).

10 replies

trws Mar 18, 2024
Maintainer Author

I'm not saying you're wrong, if anything I'm saying what you have here is something that we currently have a pretty good way to support, but not necessarily good frontend syntax. It doesn't need filemap commands, or flux exec, or either submit or batch. If you build a complete jobspec, possibly with embedded files (which flux automatically places into a tmpdir on launch so they can be invoked or referenced naturally) or whatever else like is created by flux submit or flux batch, you can submit that to flux with flux job submit and it will run it without any extra preprocessing.

We might want convenience syntax for this, but I think you accidentally missed features we have that make working this way more straightforward.

vsoch Mar 18, 2024
Maintainer

But the important detail here is that this solution doesn’t work without flux.

trws Mar 18, 2024
Maintainer Author

It means nothing else reads the format we've established and provides that functionality. There's no reason you couldn't have a translator for jobspec that could run it on something else. I kinda thought you meant for rainbow to take jobspec, and run it, was that not right?

More to the point, the specific issue of having to use filemap and exec and so-forth are clearly flux specific, which is what I was responding to, so I'm not sure why it matters that the flux implementation of the functionality is flux specific?

vsoch Mar 18, 2024
Maintainer

This is not attached to rainbow directly - it is right after rainbow releases a jobspec, meaning assignment, some service receives and accepts it, and then rainbow is out of the picture. More explicitly, some entity is going to receive a jobspec from rainbow - that could be a service, cronjob, or just someone running a script, and then the entity needs to "do something with it" to turn it from yaml into something running on compute, whatever that might mean. For flux that's easy, flux just eats these natively, and that's the simplest use case. But that's not enough, and a too hard coded and stringent design, because what if flux isn't the cluster orchestrator? What if there is no cluster orchestrator at all, but it's a thin layer to provide an interface to a service like Google batch? There is a translation layer that is needed to take the contents of the jobspec, namely logic to setup the job, stage things, and other setup steps that I know are coming but not present in this simple use case, that we need. This receiving tool would be able to ingest the jobspec, accepting it from rainbow, and then easily convert it to the logic it needs. For example, I can see a future experiment with rainbow serving clusters with Flux, Google Batch, AWS batch, and Slurm, and all of those clusters are going to be able to receive the jobspec and easily map it to their specific APIs or commands. It won't be the case that everyone that uses rainbow (that assumes jobspec) needs to write some custom hackadoo script.

trws Mar 18, 2024
Maintainer Author

I am now deeply confused. That response seems wholly divorced from what I said or asked. Jobspec as it exists today supports specifying scripts and other files to be distributed, with their contents embedded in the jobspec, and having the command execute one of those scripts. That seems like it matches what you want to do. Yes, for a non-flux backend something would have to convert those to files. Where does a "hackadoo script" come into the picture?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux jobspec v2 discussion, subsystem queries, implicit scoping #5785

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 20 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

flux jobspec v2 discussion, subsystem queries, implicit scoping #5785

trws Mar 11, 2024 Maintainer

Replies: 2 comments · 20 replies

vsoch Mar 11, 2024 Maintainer

trws Mar 12, 2024 Maintainer Author

vsoch Mar 12, 2024 Maintainer

trws Mar 12, 2024 Maintainer Author

vsoch Mar 12, 2024 Maintainer

trws Mar 12, 2024 Maintainer Author

vsoch Mar 17, 2024 Maintainer

trws Mar 18, 2024 Maintainer Author

vsoch Mar 18, 2024 Maintainer

trws Mar 18, 2024 Maintainer Author

vsoch Mar 18, 2024 Maintainer

trws Mar 18, 2024 Maintainer Author

trws
Mar 11, 2024
Maintainer

Replies: 2 comments 20 replies

vsoch
Mar 11, 2024
Maintainer

trws Mar 12, 2024
Maintainer Author

vsoch Mar 12, 2024
Maintainer

trws Mar 12, 2024
Maintainer Author

vsoch Mar 12, 2024
Maintainer

trws Mar 12, 2024
Maintainer Author

vsoch
Mar 17, 2024
Maintainer

trws Mar 18, 2024
Maintainer Author

vsoch Mar 18, 2024
Maintainer

trws Mar 18, 2024
Maintainer Author

vsoch Mar 18, 2024
Maintainer

trws Mar 18, 2024
Maintainer Author