Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: enforce max resource limits per-association #562

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cmoussa1
Copy link
Member

@cmoussa1 cmoussa1 commented Jan 15, 2025

Problem

The priority plugin tracks resource usage across an association's set of running jobs, but it does not enforce any limits
on this resource usage in the event where an association goes over their limit.


This PR looks to add limit enforcement on an association's set of running jobs according to their resource usage, specifically with ncores and nnodes. In job.state.depend, the jobspec for a submitted job is unpacked and its job size is calculated. If the job would put the association over either their max_cores or max_nodes limit, a dependency is added to the job and it is held until a currently running job completes and frees the appropriate amount of resources.

Since the priority plugin is now enforcing two different kinds of limits (max-running-jobs and max-resources), it needs a way to keep track of which dependencies a job has since now it can have one or both. A map is added to the plugin where the key is a the ID of a held job and the value is a list of dependencies associated with the job. A number of helper functions are added to search for held jobs' dependencies and remove them from the internal map when the job no longer has any dependencies on it.

In job.state.inactive, the process of releasing a held job is reworked to check if the job satisfies all requirements in order to be released. It is first checked to see that the association is under their max-running-jobs limit and then checked to see that the held job would not put the association over either their max_cores or max_nodes limit. The dependencies for the job are removed as they are satisfied, i.e if a job would satisfy the max-running-jobs dependency but not the max-resources-dependency, only the max-running-jobs dependency is removed and the job continues to be held. Once all requirements are met, the rest of the remaining dependencies are removed from the job and it can be released to run.

To avoid becoming its own mini scheduler, the priority plugin will only look at the first held job for the association. So, if an association has two held jobs, only the first held job will be considered for release.

I've expanded the test file that originally just tracked resources across an association's set of jobs to also test checking adding and removing dependencies. Different scenarios are walked through to make sure an association's set of jobs have the correct dependencies added to it and are removed correctly, such as having a job only have a max-running-jobs or max-resources dependency added to it, have both dependencies added to it at the same time, only have one of the two dependencies removed from a held job, etc.

@cmoussa1
Copy link
Member Author

@ryanday36 when you get the chance, could you give this PR a high-level look and let me know if this looks good from your side? I think I'd mostly be curious if the way that the plugin is enforcing the resource limits looks good to you.

@cmoussa1 cmoussa1 force-pushed the enforce.resource.limits branch from ece2f0f to 2864dd6 Compare January 17, 2025 00:23
@ryanday36
Copy link
Contributor

This looks like a good approach to me @cmoussa1. I'm not sure I understand where b->curr_nodes and b->curr_cores get updated. Is that already happening when jobs start / end? or is that still to come?

@cmoussa1
Copy link
Member Author

Thanks @ryanday36. The cur_nodes and cur_cores for a job will get updated when the job enters RUN state and INACTIVE state. Sorry that wasn't made immediately clear - I added that support in #561.

@cmoussa1 cmoussa1 force-pushed the enforce.resource.limits branch 2 times, most recently from d3dccb4 to 265229c Compare January 17, 2025 19:28
@cmoussa1 cmoussa1 changed the title [WIP] plugin: enforce max resource limits per-association plugin: enforce max resource limits per-association Jan 18, 2025
@cmoussa1 cmoussa1 marked this pull request as ready for review January 18, 2025 00:51
@cmoussa1 cmoussa1 force-pushed the enforce.resource.limits branch from 265229c to 594c2e2 Compare January 21, 2025 17:25
@cmoussa1 cmoussa1 requested a review from grondo January 21, 2025 17:37
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just made a quick first pass and have some comments and suggestions.

I'll have to study the code further and perhaps experiment before I can do a full review.

Comment on lines 1359 to 1370
if (jj_get_counts_json (held_job_jobspec, &counts) < 0) {
flux_jobtap_raise_exception (p,
FLUX_JOBTAP_CURRENT_JOB,
"mf_priority",
0,
"job.state.inactive: unable to " \
"count resources from jobspec " \
"of held job");
goto error;
}
int held_job_nnodes = counts.nnodes;
int held_job_ncores = counts.nslots * counts.slot_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might be good to consider saving the estimated nnodes/ncores for held jobs in the map somehow (sorry I haven't digested the code enough to actually suggest how to do that) instead of reparsing the jobspec.

Also, side note: Does the code currently only check to see if the held job at the front of the list can be released based on current limits? If there are more held jobs, should they also be checked to see if they can be released. E.g. if a user is over a max nodes limit of 10, and a 2 node job becomes inactive, if the held job at the front of the list is 5 nodes, and the next job is only 2 nodes, should the second one be released?

In anticipation of this, perhaps abstract the check if a held job is eligible for release into a separate function?

(hope that makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good thought. I might have to think about how best to do that - perhaps I could extend the existing map structure to store the resource counts of the held job as well. Maybe by creating a class for held jobs that has a jobid, nnodes, and ncores members. I'll think about that - thanks!

Does the code currently only check to see if the held job at the front of the list can be released based on current limits?

Yes, this was a deliberate choice to only check the held job at the front of the list (i.e the first job that was held due to a flux-accounting dependency) is eligible to be released. I chose this to avoid turning the priority plugin into its own "mini scheduler" by considering multiple jobs. I could be open to perhaps changing this, though, to consider multiple jobs for release. What do you think? Maybe @ryanday36 has thoughts on this too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe by creating a class for held jobs that has a jobid, nnodes, and ncores members. I'll think about that - thanks!

One benefit of extracting some of the code into a class is that you could then move this to its own source file, then create a unit test based on that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!! Yeah, I'll play around with this and see if I can clean this up (particularly the resource counting part) by moving this into its own class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was able to get somewhere here. I can instead create a Job class that has properties like its id and its nnodes and ncores counts that it retrieves in job.state.depend, as well as any flux-accounting-specific dependencies added to it. Then the plugin does not even need to unpack the held job's jobspec a second time in job.state.inactive; it can just retrieve the job's resources from its entry in a map and use it for comparison. Much cleaner!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've just force-pushed up a rework to the way dependencies are added and removed in job.state.depend and job.state.inactive to make use of a newly-added Job class - notably, not having to count the resources for a held job a second time. The internal map that the plugin uses to keep track of held jobs uses this Job class as the value of the map to avoid having to do a second lookup. The callback for job.state.inactive is hopefully a little cleaner now as a result.

In the near future, I think I could definitely rework this to convert the held_jobs vector in the Association class to just be this same map (and this way, the plugin only needs to keep track of held jobs in one place), but that will probably be enough work to warrant its own PR.

Comment on lines +148 to +151
# Scenario 2:
# The association submits enough jobs to take up the max-nodes limit, but they
# still have one core available to use, so they can submit a job that takes up
# the last core and it will run right away.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not be understanding this comment, but from the PR description and I think the code, a job should be held if it would cause the association to go over the max nodes OR max cores limit? This is because a job requesting nodes only may only account for 1 core per node, but will actually use nnodes*cores_per_node.

Similarly, the jj_get_counts() code will return nnodes=0 for any jobspec that requests only cores, so if a job requests say 100000 cores, the job could be let through based on nodes if it isn't held if it exceeds either limit.

Is this comment correct, and if so, can you explain this scenario?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a job should be held if it causes the association to go over either limit. In the plugin, I am just relying on what the jj_get_counts () function returns, and while it is not a perfect approximation, it is trying to serve at least as an estimate.

I think before I had started working on this, I had talked to @ryanday36 about this exact scenario before that you mentioned (a job requesting a bunch of cores without specifying nodes and being able to pass through), but from what I think I remember from that conversation is that this case is pretty rare in practice. He should definitely comment though and clarify in case I am misremembering.

The second scenario in the sharness test is mostly just showing that the plugin will keep track of exactly what jj_get_counts () returns and only enforces limits based on what resources it counts from that function; essentially, a job could be submitted that specifies an exact resource and can perhaps "get around" a set limit by only specifying nodes or only specifying cores. I'm sorry if I misunderstood your question or if I'm not making sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a job should be held if it causes the association to go over either limit.

a job could be submitted that specifies an exact resource and can perhaps "get around" a set limit by only specifying nodes or only specifying cores. I'm sorry if I misunderstood your question or if I'm not making sense

Ok, I think I understand now. If a job submits a jobspec that doesn't ask for nodes, then jj_get_counts () reports nnodes=0 (which actually is impossible, since a job cannot run on 0 nodes), so the nnodes limit is not exceeded and neither is the ncores limit since the association is still below that limit.

I wonder if would be more practical to set nnodes = 1 when jj_get_counts() doesn't detect an explicit node request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might make sense. Please correct me if I am misunderstanding, but would setting nnodes = 1 still be accurate on a system that is not node-exclusive scheduled? Maybe it applies in both cases (node-exclusive scheduling and non-node-exclusive scheduling)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'd ask @ryanday36 for clarification. However, even on a core-scheduled system, a new job is guaranteed to use at least 1 node. However, it may not use a new node when it is actually scheduled, so this is kind of a tough one!

This is where to really properly support these limits, it might have to be the scheduler that is applying them. Only the scheduler knows if it allocates resources for a non-exclusive job whether it will use a new node and add to the associations node count. Or, maybe the node limit just doesn't make sense on core scheduled clusters?

test_debug "jq -S . <query.json" &&
jq -e ".mf_priority_map[] | select(.userid == 5001) | .banks[0].held_jobs | length == 0" <query.json
'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice exhaustive testing BTW!

Problem: The priority plugin needs a way to keep track of held jobs due
to either max-running-jobs or max-resource limits across an
association's set of jobs.

Create a new Job class to be utilized by the priority plugin in order to
keep track of a held job's list of dependencies and its requested
resource counts. Add this class to its own source file and compile it
along with the rest of the priority plugin code.
@cmoussa1 cmoussa1 force-pushed the enforce.resource.limits branch 4 times, most recently from 9f05be8 to 78c577b Compare January 24, 2025 22:10
Problem: The priority plugin tracks resource usage across an
association's set of running jobs, but it does not enforce any limits
on this resource usage in the event where an association goes over
their limit.

Add a check in job.state.depend to see if the job's resources would put
the association over *either* their max_nodes or max_cores limit. If
so, add a dependency on the job and hold it.

In job.state.inactive, rework the check to see if there are any held
jobs for the association to also ensure the held job would satisfy
all three conditions:

1) the association is under their max-running-jobs limit
2) the job would not put the association over their max nodes limit
3) the job would not put the association over their max cores limit

before fully releasing the job to be scheduled to run. Keep track of
which held jobs have which dependencies in a map within the plugin,
where the key of the map is the ID of a held job and the value is a
Job object that stores the node/core counts for the job as well as any
dependencies associated with the job.

The dependencies added to the held job are removed as the above
conditions are satisfied. Once all dependencies are removed from the
job, it can be released by the plugin to be scheduled to run, and the
key-value pair is removed from the internal map that keeps track of
held jobs.
Problem: t1005 and t1012 create JSON payloads to be sent to the plugin,
but these payloads don't contain max_cores or max_nodes information
per-association.

Add this information to the payloads sent to the plugin in t1005 and
t1012.
Problem: The flux-accounting testsuite doesn't have any tests for
checking dependencies on jobs when an association has hit their max
resource limits.

Add some basic tests.
@cmoussa1 cmoussa1 force-pushed the enforce.resource.limits branch from 78c577b to a3133e3 Compare January 24, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature new feature plugin related to the multi-factor priority plugin
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants