-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin: enforce max resource limits per-association #562
base: master
Are you sure you want to change the base?
Conversation
@ryanday36 when you get the chance, could you give this PR a high-level look and let me know if this looks good from your side? I think I'd mostly be curious if the way that the plugin is enforcing the resource limits looks good to you. |
ece2f0f
to
2864dd6
Compare
This looks like a good approach to me @cmoussa1. I'm not sure I understand where |
Thanks @ryanday36. The |
d3dccb4
to
265229c
Compare
265229c
to
594c2e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just made a quick first pass and have some comments and suggestions.
I'll have to study the code further and perhaps experiment before I can do a full review.
src/plugins/mf_priority.cpp
Outdated
if (jj_get_counts_json (held_job_jobspec, &counts) < 0) { | ||
flux_jobtap_raise_exception (p, | ||
FLUX_JOBTAP_CURRENT_JOB, | ||
"mf_priority", | ||
0, | ||
"job.state.inactive: unable to " \ | ||
"count resources from jobspec " \ | ||
"of held job"); | ||
goto error; | ||
} | ||
int held_job_nnodes = counts.nnodes; | ||
int held_job_ncores = counts.nslots * counts.slot_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it might be good to consider saving the estimated nnodes/ncores for held jobs in the map somehow (sorry I haven't digested the code enough to actually suggest how to do that) instead of reparsing the jobspec.
Also, side note: Does the code currently only check to see if the held job at the front of the list can be released based on current limits? If there are more held jobs, should they also be checked to see if they can be released. E.g. if a user is over a max nodes limit of 10, and a 2 node job becomes inactive, if the held job at the front of the list is 5 nodes, and the next job is only 2 nodes, should the second one be released?
In anticipation of this, perhaps abstract the check if a held job is eligible for release into a separate function?
(hope that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good thought. I might have to think about how best to do that - perhaps I could extend the existing map structure to store the resource counts of the held job as well. Maybe by creating a class for held jobs that has a jobid
, nnodes
, and ncores
members. I'll think about that - thanks!
Does the code currently only check to see if the held job at the front of the list can be released based on current limits?
Yes, this was a deliberate choice to only check the held job at the front of the list (i.e the first job that was held due to a flux-accounting dependency) is eligible to be released. I chose this to avoid turning the priority plugin into its own "mini scheduler" by considering multiple jobs. I could be open to perhaps changing this, though, to consider multiple jobs for release. What do you think? Maybe @ryanday36 has thoughts on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe by creating a class for held jobs that has a jobid, nnodes, and ncores members. I'll think about that - thanks!
One benefit of extracting some of the code into a class is that you could then move this to its own source file, then create a unit test based on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!! Yeah, I'll play around with this and see if I can clean this up (particularly the resource counting part) by moving this into its own class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was able to get somewhere here. I can instead create a Job
class that has properties like its id
and its nnodes
and ncores
counts that it retrieves in job.state.depend
, as well as any flux-accounting-specific dependencies added to it. Then the plugin does not even need to unpack the held job's jobspec a second time in job.state.inactive
; it can just retrieve the job's resources from its entry in a map and use it for comparison. Much cleaner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've just force-pushed up a rework to the way dependencies are added and removed in job.state.depend
and job.state.inactive
to make use of a newly-added Job
class - notably, not having to count the resources for a held job a second time. The internal map that the plugin uses to keep track of held jobs uses this Job
class as the value of the map to avoid having to do a second lookup. The callback for job.state.inactive
is hopefully a little cleaner now as a result.
In the near future, I think I could definitely rework this to convert the held_jobs
vector in the Association
class to just be this same map (and this way, the plugin only needs to keep track of held jobs in one place), but that will probably be enough work to warrant its own PR.
# Scenario 2: | ||
# The association submits enough jobs to take up the max-nodes limit, but they | ||
# still have one core available to use, so they can submit a job that takes up | ||
# the last core and it will run right away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not be understanding this comment, but from the PR description and I think the code, a job should be held if it would cause the association to go over the max nodes OR max cores limit? This is because a job requesting nodes only may only account for 1 core per node, but will actually use nnodes*cores_per_node.
Similarly, the jj_get_counts()
code will return nnodes=0
for any jobspec that requests only cores, so if a job requests say 100000 cores, the job could be let through based on nodes if it isn't held if it exceeds either limit.
Is this comment correct, and if so, can you explain this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a job should be held if it causes the association to go over either limit. In the plugin, I am just relying on what the jj_get_counts ()
function returns, and while it is not a perfect approximation, it is trying to serve at least as an estimate.
I think before I had started working on this, I had talked to @ryanday36 about this exact scenario before that you mentioned (a job requesting a bunch of cores without specifying nodes and being able to pass through), but from what I think I remember from that conversation is that this case is pretty rare in practice. He should definitely comment though and clarify in case I am misremembering.
The second scenario in the sharness test is mostly just showing that the plugin will keep track of exactly what jj_get_counts ()
returns and only enforces limits based on what resources it counts from that function; essentially, a job could be submitted that specifies an exact resource and can perhaps "get around" a set limit by only specifying nodes or only specifying cores. I'm sorry if I misunderstood your question or if I'm not making sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a job should be held if it causes the association to go over either limit.
a job could be submitted that specifies an exact resource and can perhaps "get around" a set limit by only specifying nodes or only specifying cores. I'm sorry if I misunderstood your question or if I'm not making sense
Ok, I think I understand now. If a job submits a jobspec that doesn't ask for nodes, then jj_get_counts ()
reports nnodes=0
(which actually is impossible, since a job cannot run on 0 nodes), so the nnodes limit is not exceeded and neither is the ncores limit since the association is still below that limit.
I wonder if would be more practical to set nnodes = 1
when jj_get_counts()
doesn't detect an explicit node request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might make sense. Please correct me if I am misunderstanding, but would setting nnodes = 1
still be accurate on a system that is not node-exclusive scheduled? Maybe it applies in both cases (node-exclusive scheduling and non-node-exclusive scheduling)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I'd ask @ryanday36 for clarification. However, even on a core-scheduled system, a new job is guaranteed to use at least 1 node. However, it may not use a new node when it is actually scheduled, so this is kind of a tough one!
This is where to really properly support these limits, it might have to be the scheduler that is applying them. Only the scheduler knows if it allocates resources for a non-exclusive job whether it will use a new node and add to the associations node count. Or, maybe the node limit just doesn't make sense on core scheduled clusters?
test_debug "jq -S . <query.json" && | ||
jq -e ".mf_priority_map[] | select(.userid == 5001) | .banks[0].held_jobs | length == 0" <query.json | ||
' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice exhaustive testing BTW!
Problem: The priority plugin needs a way to keep track of held jobs due to either max-running-jobs or max-resource limits across an association's set of jobs. Create a new Job class to be utilized by the priority plugin in order to keep track of a held job's list of dependencies and its requested resource counts. Add this class to its own source file and compile it along with the rest of the priority plugin code.
9f05be8
to
78c577b
Compare
Problem: The priority plugin tracks resource usage across an association's set of running jobs, but it does not enforce any limits on this resource usage in the event where an association goes over their limit. Add a check in job.state.depend to see if the job's resources would put the association over *either* their max_nodes or max_cores limit. If so, add a dependency on the job and hold it. In job.state.inactive, rework the check to see if there are any held jobs for the association to also ensure the held job would satisfy all three conditions: 1) the association is under their max-running-jobs limit 2) the job would not put the association over their max nodes limit 3) the job would not put the association over their max cores limit before fully releasing the job to be scheduled to run. Keep track of which held jobs have which dependencies in a map within the plugin, where the key of the map is the ID of a held job and the value is a Job object that stores the node/core counts for the job as well as any dependencies associated with the job. The dependencies added to the held job are removed as the above conditions are satisfied. Once all dependencies are removed from the job, it can be released by the plugin to be scheduled to run, and the key-value pair is removed from the internal map that keeps track of held jobs.
Problem: t1005 and t1012 create JSON payloads to be sent to the plugin, but these payloads don't contain max_cores or max_nodes information per-association. Add this information to the payloads sent to the plugin in t1005 and t1012.
Problem: The flux-accounting testsuite doesn't have any tests for checking dependencies on jobs when an association has hit their max resource limits. Add some basic tests.
78c577b
to
a3133e3
Compare
Problem
The priority plugin tracks resource usage across an association's set of running jobs, but it does not enforce any limits
on this resource usage in the event where an association goes over their limit.
This PR looks to add limit enforcement on an association's set of running jobs according to their resource usage, specifically with
ncores
andnnodes
. Injob.state.depend
, the jobspec for a submitted job is unpacked and its job size is calculated. If the job would put the association over either theirmax_cores
ormax_nodes
limit, a dependency is added to the job and it is held until a currently running job completes and frees the appropriate amount of resources.Since the priority plugin is now enforcing two different kinds of limits (
max-running-jobs
andmax-resources
), it needs a way to keep track of which dependencies a job has since now it can have one or both. A map is added to the plugin where the key is a the ID of a held job and the value is a list of dependencies associated with the job. A number of helper functions are added to search for held jobs' dependencies and remove them from the internal map when the job no longer has any dependencies on it.In
job.state.inactive
, the process of releasing a held job is reworked to check if the job satisfies all requirements in order to be released. It is first checked to see that the association is under theirmax-running-jobs
limit and then checked to see that the held job would not put the association over either theirmax_cores
ormax_nodes
limit. The dependencies for the job are removed as they are satisfied, i.e if a job would satisfy themax-running-jobs
dependency but not themax-resources-dependency
, only themax-running-jobs
dependency is removed and the job continues to be held. Once all requirements are met, the rest of the remaining dependencies are removed from the job and it can be released to run.To avoid becoming its own mini scheduler, the priority plugin will only look at the first held job for the association. So, if an association has two held jobs, only the first held job will be considered for release.
I've expanded the test file that originally just tracked resources across an association's set of jobs to also test checking adding and removing dependencies. Different scenarios are walked through to make sure an association's set of jobs have the correct dependencies added to it and are removed correctly, such as having a job only have a
max-running-jobs
ormax-resources
dependency added to it, have both dependencies added to it at the same time, only have one of the two dependencies removed from a held job, etc.