-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature tracking: Advanced Reservations (DATs) #5201
Comments
Linked flux-framework/flux-sched#1013 above. According to @trws and @milroy once that PR is merged, we will have much of the support needed in Fluxion to schedule a DAT. |
Idea: add a new job state RESERVED between SCHED and RUN such that a job request with a special attribute could get its alloc response R from the scheduler early, in advance of the the This would work for any job, including a sub-instance. An advance static R would be more susceptible to having resources go bad before the job starts. With a flux instance, we could initially just set the quorum value to some fraction of the total and let the instance start with some non-critical nodes down. |
For clarity, is the benefit of having the RESERVED state and advance R available so that an subinstance could be configured with the eventual resources assigned to the job? Would this also have some benefit for normal jobs? Adding a new state just for that purpose feels like it could be short sighted (though I'm probably missing the other benefits!), especially if we plan one day to support instances that can grow onto unknown resources instead of just known resources. Another thing I'll just throw out there is that there is already a way to hold a job between SCHED and RUN by issuing a |
I guess the main appeal to me is that we wouldn't need to have a separate set of tools and rules for reservations like slurm. A job request would be sufficient to request a reservation, the existing scheduler interfaces would be sufficient to communicate the results, existing tools could be used to view/update reservations (since they are just jobs). In a way regular jobs then are just a degenerate case of a reservation anyway, where time in RESERVED is very short, so the plan doesn't introduce niche features that would have less testing than mainstream ones. But yeah it builds upon the existing resource model which is fundamentally static. However, as we add dynamic resource capability to flux, this could grow too. For example, maybe a job could request to start as soon as an initial resource request can be fulfilled, and also hold a reservation that would be added to the job later? Maybe we could also add a way for the scheduler to modify an already allocated R, such as replacing nodes that are no longer available, and we could make that work the same for running and reserved jobs. Anyway I'm not hard over on this idea - just throwing it out there to see if it sticks. Sounds like it's sliding down the wall a bit :-) |
No, this is sounding appealing to me, but I'm afraid I still don't follow some points:
I like this idea, but unfortunately don't have the mental capacity today to follow the reasoning. How would a reservation be requested? Would we just add a field to jobspec with an enforced start and end time, and only satisfy these requests from the instance owner? If a reservation is just a job that hasn't yet started, how would multiple jobs be submitted to a job in RESERVED state? It seems like these actions would require separate or missing tools we don't already have anyway.
Ah, this is a good point. I had missed that all jobs would go through RESERVED (I had envisioned it as a one-off state). I do like this idea.
I think this the general case of grow/shrink we've discussed before, and it doesn't seem like a RESERVED state is necessary to make this happen (at least we've never discussed it in that way) It seems like we were headed towards using resource-update events to manage that. (already we can update R using this approach) |
I didn't really say this clearly but yes, I was thinking some new jobspec attributes would be the way a job would request "reserved" resources. We already have a duration, so maybe attributes for start time and flags indicating whether start time is absolute or best effort, what to do if resources become unavailable before start time, etc.
I was thinking in that case the RESERVED job would be a subinstance, but would only accept jobs after it starts for now. Hmm, maybe that's a stronger requirement than I thought.
I just meant that jobs with reserved resource allocations could benefit in a general way from grow, not necessarily help us get there. |
Ah, I see. Forgive me, but do we need a separate state to handle this case then? For the purposes of all other tools the job would effectively be pending. I guess Flux could start a single rank instance (with the sole initial online rank excluded) to handle early job submission, but in principle that doesn't seem to require a new state. I worry that if the R is constantly evolving for a reserved allocation, then this would create a lot of traffic in the eventlog, whereas if we just keep the job in SCHED state until the allocation is granted we can just emit the actual R. I really apologize because I feel like I'm missing the piece of the design that requires a new state. I am sure it is my fault and not yours. |
If we had a "reserved" state, possibly with either soft or hard semantics, we might also be able to use that to show it has been given a prospective start time by the scheduler. This is a bit of an idle thought while I'm in an OpenMP meeting, so it might not match super well, but if we could get both a nicer interface for DATs and have a way to surface predicted starts for jobs other than the next that would make users happy. |
Is this necessary if flux-framework/flux-sched#1015 is fixed? We do already have ephemeral "annotations" that do not potentially fill the eventlog with events to communicate this kind of data which could change with each schedule update. OTOH, with the estimated starttime and resources for every job which is in the scheduler plan available, we could expose the scheduler's plan via some kind of visualization (kind of like OAR's Gantt drawing tool). Does even this, though, require a new job state? Could the planned resources for jobs be exposed in some other manner that doesn't require writing data to the KVS and an eventlog each time it changes? (Just throwing that question out there, I don't really know the answer) |
Also, would a RESERVED state also require a transition back to SCHED, e.g. if a new higher priority job is submitted, changing the schedule such that a RESERVED job no longer has any reserved resources in the current plan? |
A note from the meeting: Being to submit to a DAT/reservation before its starttime is a optional requirement for a minimum viable solution. I take that to mean we fulfill this requirement by being able to submit a job request that is guaranteed to be fulfilled at some time point in the future, with a way to launch a multi-user instance on those resources once allocated, including a way to restrict the set of users allowed to submit to that instance. Assuming this is correct, I'll update the bullet list above with some missing items. I don't think this solution requires a new job state and all the changes that would come with it? |
I'd say let's hit the reset button on this discussion and start from the requirements. IOW let's drop the idea of RESERVED state and also of "regular jobs" having reservations and see what else we can come up with. If we need those ideas we can come back to them. |
On user restrictions: only the system instance currently loads the A related question is whether we worry about proper accounting for users within that subinstance. In RFC 33 we did define an access policy, so if we didn't want to load |
Loading I'm also not sure how accounting for a subinstance would work. The subinstance jobs would not be going to the job archive or accounting archive, so we'd need some way to attribute usage, perhaps in an epilog or rc3 script when the DAT job is exiting? @ryanday36 - I assume we do currently account for jobs in DATs and reservations since Slurm only has one level of scheduling? |
That's correct. We do want to charge DAT usage to the users bank(s). |
Is a DAT currently represented as a Queue, such that normal user jobs in that queue are accounted, or as a job where only that job is actually accounted that runs many job steps?
|
For reference, here is a snippet of how Slurm accounts for reservations:
|
That begs a question for me. How often do we run into a DAT where it's composed of multiple banks rather than a single bank for the DAT? I admit I'd conceived of a dat as being a charged entity in and of itself which would be charged for at that level rather than the usage cost falling directly on the users that submitted work to it. |
Good question @trws. And if we need to use a bank/account to control access to a DAT job, then we would need some way to create the access control list from the bank when the job is started, or extend support for the |
This is a tracking issue for an implementation of DATs. The requirements as I understand them include:
an interface for specifying a set of resources (as a resource spec or perhaps specific resources) that are reserved for a specific user set in a specified time range. (This might be a Fluxion-only interface, since the simple scheduler has no actual schedule)
Allow the prescribed user set to submit jobs before the reservation
Advanced reservation support flux-sched#963
Support deferred job start time flux-sched#1013
allow early interaction with instances that will be started as part of a future job (DAT) (optional)
launch a multiuser instance as a job #5531
restrict user set with access to DAT job
new job submission utility which allows submission of DAT/reserved job with deferred start time, restricted user list, etc.
accounting support for DAT jobs
The text was updated successfully, but these errors were encountered: