Support for elasticity and resource dynamism in jobspec #6493

milroy · 2024-12-09T08:56:42Z

milroy
Dec 9, 2024
Maintainer

I think the wall of text that follows is both ideas and design. Since it would be nice to get a design out of it, I tagged the discussion as such.

Motivation

To lay a foundation for work on resource and task dynamism in a multi-cluster environment, I’d like to discuss concepts and specifications needed to specify and support moldable, evolving, malleable, and elastic jobs in Flux. With respect to some of our past selves and our Flux forebears, here’s a link to a really great and informative discussion on related topics back in 2015(!): #354. I can’t say that I’ve read every comment, but I encourage everyone reading this to review that discussion first.

Definitions

So we’re on the same page, here are modified and updated definitions of those terms based on those found in the now classic Feitelson and Rudolph, 2005:

A rigid job requires a fixed number of resources and shape in order to execute, as specified by the user at job submission.

An evolving job is one that may change its resource requirements during execution. Note that it is the application itself that initiates the changes.

A moldable job is one that can be initiated with variable resources. The resources are determined by the resource manager before job execution.

A malleable job is one that can adapt to changes (initiated by the resource manager) in its resources during execution.

(My definition) An elastic job is one that features dynamism described by the three previous job types in combination and/or dynamism in time.

I updated those definitions to generalize what changes from the number of processors to resources and their shapes. In my opinion there’s a lack of precision in discussions about characteristics and behaviors related to elasticity. Hopefully the ensuing discussion can also serve to clarify those characteristics and behaviors.

New terminology and needs

In the following bullets, I’ve taken some of the graph terminology from Diestel’s Graph Theory (Fifth Edition). I don’t think we need to get into detailed discussion on graphs (if so; cool!), but I’d like to anchor some of the items on well-defined concepts from graph theory. I say that with an eye toward research.

Time terms and needs
- Start at specific time “deferred allocation”
- Complete any time before deadline
  - Note: initially just consider the total walltime of all job components
  - Could later apply AI/ML to fine tune (known to be difficult)
- Use cases for start after a specific time?
- Combinations like start after a specific time and complete before deadline?
- More general or flexible temporal expression, e.g., could start anytime in the closed interval [T0, T1] but must start by T1.
  - Metadata to indicate time specific to a vertex or subgraph
Space terms and needs
- Shape: a synonym of subgraph (typically used to refer to a Flux jobspec slot)
  - Do we allow two shapes to be equal if they are isomorphic by type (e.g., in jobspec A, 3 leaf vertices underneath one socket are cores, in jobspec B they are 3 GPUs)? Other isomorphisms?
- Graph minor: each moldable, malleable, evolving application must specify a jobspec shape that is rigid, or more formally a jobspec graph minor or “stencil” (“job core?”) and a time interval for which it is active. Exactly one can exist at any time. Initially require the jobspec graph minor’s active time interval to be the length of time of the entire job. We may want to support disjoint graph minors (again, exactly one can exist at any time, so we could allow substitutions). Note that they would technically not be disjoint in the resource graph itself since any job represents at least a non-exclusive allocation path up to the graph root.
  - Need metadata to indicate the root of the job core
- Grow: a transformation of the job core, defining an inflation space or IX (see Diestel 1.7)
  - The operation could be defined as a subgraph or restriction of the rooted product: https://en.wikipedia.org/wiki/Rooted_product_of_graphs
    - A rooted product could help define the “vertex of attachment” or “model vertex” (see Diestel Fig. 1.7.2) in a jobspec
- Shrink: a contraction operation on the IX that maps vertices back to the graph core (i.e., quotient graph)
- Need metadata to indicate importance (weights, priorities) of subgraphs eligible for growth or to be shrunken
Time and space
- Specify multiple job shapes along with their “place” relationships (e.g., vertices of attachment) and temporal relationships in the same file
  - Will allow Fluxion to perform multiple matches across time and will require enforcing dependencies across matches in a composite jobspec
- Annotations on jobspec vertices to indicate their temporal validity (e.g., this IX subgraph can be reclaimed after 12:00, or I will need to grow job core at 9:00 by this storage subgraph to accommodate a checkpoint)

Those are some initial ideas to get us started. I don’t mean to imply any of these ideas are required or that I’m set on the terminology I just used. Feel free to propose other terms or redirect the discussion in a more concrete direction. I’ll get to work on preparing some YAML/JSON examples of the items above for motivation.

While not necessary for a conceptual discussion, use cases are very welcome and will help to ground our thoughts.

vsoch · 2024-12-10T21:35:10Z

vsoch
Dec 10, 2024
Maintainer

I'll post this in sections of response (and threads) to organize points. This first post is about the terms at the top - how to describe changing vs not. The number of terms - rigid vs evolving vs. modable vs malleable vs elastic I know we can simplify into a few. Maybe just rigid and elastic. What do the others add? I think what we might do is define rigid and elastic, and then define cases for elasticity:

Application driven changes
User requested changes
Workload manager driven changes (and the job defines a scope of what is allowed)

In all cases, the workload manager has to orchestrate. The distinction with case 3 is the workload manager is "shuffling around" resources to make things fair, and the application has to be able to support that (hence needing its permission to checkpoint and restore).

4 replies

vsoch Dec 10, 2024
Maintainer

Cases I see for time and (some) events:

Give my job this amount of time to run (time or duration) - this is different than cases below.
Start my job when you can (no specification of when)
Start my job exactly at this time (deferred allocation) but what happens when you don't get it? Start as soon as possible? Cancel?
Complete before deadline (this feels dangerous to me - if a job is started and then cut right before finish, would the user really be happy?) The only case this might make sense is when you are running iterations of a simulation, and just want it to cut at some point. But arguably, you could also set a number of successes / failures (this is what JobSet in Kubernetes does)
Start before deadline (and then knowing an estimated duration seems like a more reasonable approach)

For all of the above, the user likely has insight into individual components, but it's a lot to ask to mentally add them up to a holistic picture. To summarize, I'd see the following attributes:

duration: max time my job is allowed to run before being killed. If unset, default of 0 means workload manager default.
deferred:
- start this many seconds into the future (use case - give some other component some time to run)
deadline: an event must be done by this time
- kill: kill job by this timestamp (what are use cases)?
- start: cancel job if it is not started by this deadline
policy:
- start: start when "this metric" reaches this state (number of jobs successful / failed, time in queue, application metric)
- failure: failure when "this metric" reaches this state
- grow: increase the size of a resource when "this metric" reaches this state
- shrink: reduce the size of a resource when "this metric" reaches this state
order: start my job after this one completes (basic sequential, like depends on, so already supported arguably). Beyond that we give to workflow manager.

A job can have several of these components. For example, you can defer a start, and then have a global deadline, or a max duration.

vsoch Dec 10, 2024
Maintainer

Graph ideas (and apologies in advance for not knowing these formally). I'm going to make stuff up to start, because it's fun. Isomorphism I'm guessing is like saying "Different form, but same thing." I think there are a few cases.

Isomorphism with multiple graphs

Where a graph is a cluster or unified set of logical resources

Split isomorphism: We re-arrange the same resources. For example, four nodes running in parallel, each with 12cpu on one cluster. I could split that into 2 nodes, also each with 12cpu, running on two clusters. So a split isomorphism is taking a graph, changing its form to be multiple graphs, and it's considered the same. This use case might be relevant if one cluster is really full, and some work needs to be sent to run elsewhere.
Join isomorpism: This is the same idea, but combining graphs. This might make sense to "undo" the above operation. I'd then go as far to say it's some kind of rule that 1. a graph set that was subject to some split isomorphism operation is guaranteed to be join isomorphic. 2. A graph that is join isomorphic is guaranteed to be split.

Isomorphism with one graph

Resource isomorphism comes down to shuffling (or changing) labels, and deciding that A is the same as B. I think this would come down to being able to say "Resource X is the same as Resource Y" - or a GPU from one vendor (of some set of same features) is the same as a GPU from another vendor. So this is less about comparing the actual graph shapes, but more the labels against some database or thresholds of sameness. This also could be extended to say "this many GPU of this type is equal to a different number of GPUs of this other type."

Shape

I'm trying to think about what we need here. I think for the most part, resource graphs of different types are trees? And I don't see why there would be cycles? And there is always a root, the place where some main scheduler or workload manager sits and work is received? Maybe the simplest idea for shape is to start with descriptions for entire graphs (binary, k-ary with some value of k) and then set metadata at each node for, for example, stopping points. E.g., "The traversal should stop here because the graph isn't actually a perfectly shaped tree." If we need a shape as a reference, I think using something like Cypher (or a standard similar / derived from it) makes a lot of sense, otherwise we are maintaining our own thing. In terms of the shape of an entire graph, we probably want to use formats that are already well understood to start - turtle or triples, the standard json with nodes/edges or whatever formats Cypher eats up.

Also - just to throw a towel into everything (because it was mentioned at a meeting this week, and I had the idea last year) the complete opposite strategy to fractale, and one that works well in cloud, is to make exactly the cluster you need for the work instead of trying to match a cluster to the work. Arguably there are dimensions to that. For example, if we throw away hard coded software and environment modules and have them added on the fly? All of a sudden we don't have to schedule to that. If we can start with some base set of nodes and add software, filesystems, and (maybe even network) on the fly, we just assemble what we want when the application needs it. We would essentially be running an on-premises cloud, with a lot of abstractions and logic similar to what is done there vs. traditional "hard code and build everything and you are stuck with it" HPC. It calls for a different focus of work on designing that space of tooling to work with the base VMs and filesystems instead.

vsoch Dec 10, 2024
Maintainer

Extending on the above and asking "what would flux need, in the jobspec, to allow these cases?" Here is a quick example. This of course could be written a million ways, and this is just one. Since adding new stuff to existing sections is hard, I'm defining a new section "policies" that describes conditions for elasticity.

Single Task

The durations / policies are easy because they are rule based. You just need to put them somewhere. The current duration already has a spot, but let's pretend it does not, and it becomes part of a policy.

{
  "resources": [
    {
      "type": "slot",
      "count": 1,
      "with": [
        {
          "type": "core",
          "count": 1
        }
      ],
      "label": "task"
    },
  "tasks": [
    {
      "command": ["hostname"],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    }
  ],
  "policies": {
    # The job should not go over 200 seconds
     "duration": "200s",

    # Start in a day because I know the cluster is going offline this afternoon 
    "deferred": {
        "type": "start",
        "period": "1d"
    },

    # my work needs to be done by a meeting in 3 days so cancel everything at that point, I don't care.
     "deadline": {
         "start": "3d"
     },

   # This I'm taking from a similar idea with the ensemble operator - you define actions that trigger from events.
   # Fail the "task" task when the accuracy metric fails below 50%
    "actions": [
           "target": "task",
           "trigger": "metric",
           "name": "accuracy",
           "value": 0.50,
           "action": "fail"
       ],

      # When a task mean duration goes over 100s, grow the jobspec by 2 nodes (subject to re-scheduling, etc)
      # The use case is that we must not have given it enough resources. It needs more.
       [
           "target": "task",
           "trigger": "metric",
           "name": "mean-duration",
           "value": "100s",
           "action": "grow:2"
       ],
    }
  },
  "version": 2
}

Multiple Tasks

Policies is separate from tasks because policies might define two tasks. I'm tempted to go into the jobspec next gen definition for named resource blocks here to reduce the redundancy of resources but I'll just be redundant for now.

{
  "resources": [
    {
      "type": "slot",
      "count": 1,
      "with": [
        {
          "type": "core",
          "count": 1
        }
      ],
      "label": "task"
    },
  "tasks": [
    {
     # Adding an identifier because it cannot be defined by command or slot name
     "name": "hostname",
      "command": ["hostname"],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    },
    {
     "name": "existential-crisis",
      "command": ["whoami"],
      "slot": "task",
      "count": {
        "per_slot": 1
      }
    }
  ],
  "policies": {
    # When the existential crisis is completed (status change to completed), submit a job for hostname
    "actions": [
           "trigger": "status",
           "name": "completed",
           "value" "existential-crisis",
           "action": "submit",
           "target": "hostname"
       ],

    # This would make it circular - have an existential crisis when hostname fails!
     "actions": [
           "trigger": "status",
           "name": "fail",
           "value" "hostname",
           "action": "submit",
           "target": "existential-crisis"
       ]
    }
  },
  "version": 2
}

Another idea I'm just thinking of - if we have identifiers for the job (application) and the resource set, arguably we could have either of them be the entity that is elastic and changing. E.g., in one case the application changes something (and the resources do not) and in another the resources change (and the application does not) or maybe they both change! And one small detail - the order of the variables above might be important for the human reader / user. By moving the target to the end, it's much easier to read than if it's at the beginning. I'm also not convinced circular is a terrible thing - there could be workflows (state machines) that want that design, but they would need to have other mechanisms for hard stops.

vsoch Dec 11, 2024
Maintainer

Stencil: I'm reading this as a template shape for a graph, and is this a term from graph theory? I really like stencil, it's gives me elegance vibes, but (to play devil's advocate) we might consider who is going to use the term. If it's a user (and they need to ask "What is a stencil" and then either not know or look it up) we might consider template. If it's for developers mostly (and we like the term) we can do what we like. would just call it a template, I think. Job core is too similar to the resource core type and would be confusing.

I'm reading (from quick search) a graph minor is when you perform some set of operations on graph A and get graph B. I think the implication Could there be two types (again I'm making this up):

Simplified graph minor: when you do operations on a graph, and it becomes simpler (pruning nodes, etc)
Extended graph minor (would that be a graph major)?: when you do operations to a graph and it becomes more complex.

I'm assuming the set of operations on any graph that lead to a graph minor have some kind of inverse, or more generally speaking, when you remove resources you could add them back. But I think in HPC applications running, that might not always fly for specific application types. E.g., once you remove resources, adding them back is moot because they can't be added to the running unit. So a lot of these graph operations (and if something is reversible or not) likely depends on the capabilities of the application.

For graph major - this brings up questions of fair share. If I start a job with 10 nodes, and I request to grow, does that mean I need to put that new request in the queue? Or does my application need logic to make the request N- the time before it is needed in order for it to work? Would the user need to request 10 nodes, and then define also a max and min size, which would give them some priority for the resources, more than if they didn't specify grow?

In terms of "a time interval for which it is active" I think that falls under the jurisdiction of policy / actions, specifically when targeting resources for an application. The way I'm thinking about it, policies are basically events -> trigger -> actions that lead to some change in the graph. A change in the graph is a different graph minor.

For a disjoint graph minor - that means a piece of a graph breaks off, would that be a parent instance launching a child job?

Initially require the jobspec graph minor’s active time interval to be the length of time of the entire job.

This makes sense. And other graph minor (disjoint or not, I'm not sure yet what the difference is if we are using a subset of jobs that are doing their own thing) members can be submit up to the remaining time of the parent job.

Need metadata to indicate the root of the job core

We just needed this for usernetes! There is a way to derive it by walking up the flux-uri, and I think there is an open issue to add environment variables/attributes to more easily get there. #6474

For grow / shrink - I think we likely would want to start simple, just with resources at the top level of a job. And then the application or workflow tool running under it decides what to do with those new resources. That also means there needs to be a discovery mechanism. Another thing to keep in mind (that the various Kubernetes mix up all the time) is ensuring we have a clear definition between a workflow DAG (which might respond to more resources) and a workload manager graph.

High level thinking - for everything that we do, we have several means of communication for the workload manager / job to send information to a workload manager:

event subscription: (ideal)
annotations (akin to Kubernetes) on the job (metadata in KVS, for example). And then within that, what is allowed to be changed by the user (do we have a concept of reactive annotations that can indicate state or change something)? This gets at your reference to temporal validity.

That's my brain dump for now! 🧠 I need to read a lot of these mentioned resources so I have better understanding of the actual graph theory concepts (very excited for this).

vsoch · 2024-12-12T05:32:45Z

vsoch
Dec 12, 2024
Maintainer

A new thread for some notes from Toward convergence in job schedulers for parallel supercomputers. Most of these are probably redundant definitions, but I might add to it. I do want to find newer papers because this one is 2 decades old, and (at least I'd hope) there are newer approaches (not just theory, but paired with implementation).

Throughput

"number of jobs completed through unit time"

and

"The higher the throughput, the more users are satisfied."

There are the variables of human perception (it just needs to feel fast) and then the reality that the number of actual users of a system isn't really approaching infinity. I wonder if the user not being able to know the number of other users (and their place or priority in the queue) is important here, since a large part of that "is it fast enough so I am happy" depends on that?

For multiple users, in terms of Flux, we get high throughput with instances. The implication with that approach is that a single user is launching a huge number of jobs (and wants it done quickly). But why are they launching so many jobs? Doesn't that hint at an issue with the design of the jobspec, what scale it is ingesting per unit submit? The problem with this demonstration of throughput is that (I don't think) it actually maps to multi-tenancy because (I think) it would assume one top level instance owner that can then use children. It also assumes people are actually submitting jobs like that, and at those numbers. Is this the case? If someone wants to submit all those tiny jobs, what are they actually doing (and why couldn't it be encompassed in processes under one job)?

adaptive partitioning

"Given a range of choices, the scheduler can set the number of processors based on knowledge about the system load and competing jobs, knowledge that is typically not available to the user."

vs.

dynamic partitioning

"changing the number of processors at runtime"

And "One common heuristic for dynamic partitioning is to strive for equal sized partitions (usually called "equipartitioning")

space slicing

when a system allows the processors to be partitioned on a job by job basis (and they note most do).

"An alternative is to use folding. With folding, the number of processors allocated to a job can only grow or shrink by factors of 2.

gang scheduling

This jumped out at me because (along with coscheduling) it's a term that cloud uses too

"An extension of preemption is the ability to preempt all the members of a parallel job at the same time, as well as restarting all the members of another job. This is called gang scheduling."

Migration

"Migration refers to the ability of a scheduler to move an executing job or some of its components to other processors."

Probably we need this same concept but for across clusters. Checkpoint and restart?

Change job execution order

"A scheduler may be able to process jobs in an order different from the job submittal order. Many batch systems have some such exibility Of course, this exibility is only useful if there is some information as to the resource requirements of the waiting jobs as well as any deadlines or response time requirements."

Maybe users should be rewarded not just for providing resource requirements, but getting them right. You aren't punished if you don't (getting cookies taken away) but you don't get any new cookies.

We mention this option since it easily leads to violation of the primary goal of a scheduler -- the execution of every job. Some aging mechanism is required to ensure that jobs are not passed over for arbitrarily long time periods.

This is an interesting idea - would people be willing to accept better resources for a longer wait time? Or the inverse? We might want to consider the idea of user policies that define these preferences '- 'You can move me to a less optimal CPU for my job to run faster" Or what if the time on a scheduler is like an auction - each person is given some amount of time, maybe depending on funding, etc., and they are allowed to trade, sell, etc. Could it be like a market? And reach a more optimal state? Or maybe we need new models - a race where the fastest percentile gets bonus resources, and likely they will have better cognitive health and be more effective too. Or whomever gives the sysadmins the most pizza. 🍕

Clusters of assumptions

These are 20 years old, should there be new ones? When we design this descriptive stuff, what is our cost function anyway?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for elasticity and resource dynamism in jobspec #6493

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support for elasticity and resource dynamism in jobspec #6493

milroy Dec 9, 2024 Maintainer

Motivation

Definitions

New terminology and needs

Replies: 2 comments · 4 replies

vsoch Dec 10, 2024 Maintainer

vsoch Dec 10, 2024 Maintainer

vsoch Dec 10, 2024 Maintainer

Isomorphism with multiple graphs

Isomorphism with one graph

Shape

vsoch Dec 10, 2024 Maintainer

Single Task

Multiple Tasks

vsoch Dec 11, 2024 Maintainer

vsoch Dec 12, 2024 Maintainer

Throughput

adaptive partitioning

dynamic partitioning

space slicing

gang scheduling

Migration

Change job execution order

Clusters of assumptions

milroy
Dec 9, 2024
Maintainer

Replies: 2 comments 4 replies

vsoch
Dec 10, 2024
Maintainer

vsoch Dec 10, 2024
Maintainer

vsoch Dec 10, 2024
Maintainer

vsoch Dec 10, 2024
Maintainer

vsoch Dec 11, 2024
Maintainer

vsoch
Dec 12, 2024
Maintainer