job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656

chu11 · 2024-01-05T06:05:32Z

TSIA.

Maybe the only interesting side note is that until we solve #5367 (which should probably follow this one up) we don't yet have a friendly user interface for this. The tests send constraints objects straight to the job-list module.

grondo · 2024-01-08T22:34:56Z

I wonder if we should limit the size of hostlist allowed in a constraint query for job-list. This probably goes for job constraints too. I was basically able to hang job-list with a query like, where no jobs ran on >= pi4:

{
  "or": [
    {
      "and": [
        {
          "states": [
            62
          ]
        },
        {
          "hostlist": [
            "pi[4-1000000]"
          ]
        }
      ]
    },
    {
      "hostlist": [
        "pi[4-10000000]"
      ]
    }
  ]
}

Just add more or to get a longer hang...

chu11 · 2024-01-09T19:01:03Z

I wonder if we should limit the size of hostlist allowed in a constraint query

Good catch! Hmmm not sure what the limit shouild be ... 1024 should be more than enough?

This probably goes for job constraints too

Yeah, b/c one could make a semi-infinite sized one. Although I'm unsure how to limit the size. Let me create an issue for that.

grondo · 2024-01-09T19:12:17Z

1024 might be a bit small on a cluster with 10K nodes. Something dynamic might work if job-list can get the maximum expected instance size. However, this doesn't prevent DoS from submitting a complex query (e.g. or host:foo[1-100] x1000). There should be some way to break out of a match operation that is taking too long... 🤔

chu11 · 2024-01-09T19:36:23Z

Something dynamic might work if job-list can get the maximum expected instance size.

Hmmm, I guess it wouldn't be too hard to keep a running track of largest node count seen so far.

However, this doesn't prevent DoS from submitting a complex query

Yeah, I brought up in #5669, perhaps need some hard caps just to avoid DoS attempts.

chu11 · 2024-01-15T01:19:22Z

Something dynamic might work if job-list can get the maximum expected instance size.

Was looking into this and going back on pros on cons of various approaches:

calling sched.resource-status to get nodelist count
getting the number of brokers via flux_get_size()
keep running max number of nodes in nodelist of all jobs

the problems with some of the above

sched module may not be loaded, then what do you do
number of brokers is good upper bound but not accurate (i.e. multiple brokers per node), but this is probably ok
assumes cluster size stays the same, which I'm not sure is accurate to assume (e.g. common to shrink cluster as hardware dies off towards EOL .. and there's all the stuff the kubernetes people doing with cluster sizes)
keep tabs on running max number of nodes for all jobs is ok for everything in memory, but not when there's a job db
- this maybe could be solved with a side table with the job-db for "max counts" and stuff like that ... although I'd prefer to not do that.

Sooo, I'm not sure.. I'm beginning to wonder if a heuristic might be best. Like number of brokers OR max job count, whatever is bigger? then max is a multiple of it? That's a lot of work for this. This also makes me wonder if we should hard code some max instead.

Edit: hmmm here's a compromise idea, number of brokers or some-defined N, whichever is larger? That way there's always some decent minimum that is allowed, like 1024 or something.

chu11 · 2024-01-16T07:15:03Z

re-pushed using the following idea to limit hostlist constraint sizes, the limit is instance size (ie number of brokers) or 1024, whatever is bigger. the 1024 minimum is to give the constraint some decent minimum, in the event the size of the cluster has shrunk over time.

good idea? bad idea?

grondo · 2024-01-16T15:42:20Z

the limit is instance size (ie number of brokers) or 1024, whatever is bigger. the 1024 minimum is to give the constraint some decent minimum

This seems reasonable to me.

Another idea I had would be for job-list to keep a single hostlist (really a set) of all hosts encountered for all jobs. When a hostlist constraint is encountered, the job-list module could take the intersection of this set and the "all hosts" set to limit the constraint hostlist to only those hosts which could possibly be matched. A similar process could be used to optimize matching of ranks as well.

The drawback is extra work when processing every job to maintain the set of possible hosts, which may not end up being worth it. 🤔

chu11 · 2024-01-16T16:52:15Z

Another idea I had would be for job-list to keep a single hostlist (really a set) of all hosts encountered for all jobs. When a hostlist constraint is encountered, the job-list module could take the intersection of this set and the "all hosts" set to limit the constraint hostlist to only those hosts which could possibly be matched. A similar process could be used to optimize matching of ranks as well.

In the current implementation this would work, but it'd immediately become a problem if we have older job data stored in a database, which we wouldn't scan upon a instance restart. So I was trying to avoid doing any optimization based on what we've seen so far.

As mentioned above, I guess this could be solved with a side part of the database of extra stuffs (like "max hosts" or "hosts we've seen so far"). Hmmm, let me think about this more. Maybe doing that in the DB would be inevitable for optimization purposes.

Or as you say ... maybe it's not worth the energy to do this optimization.

grondo · 2024-01-16T16:55:49Z

Just curious, how are you going to match on hosts with older jobs in a database. Maybe the optimization won't be needed in that case if the hosts are indexed or something. (I haven't seen an implementation, so difficult to reason about). I assume a good database will have already handled the query optimization and "DoS" problem...

chu11 · 2024-02-01T22:46:55Z

re-pushed, updating for conflicts, now based on #5681

grondo

On a quick first pass, found some places where error.text is not set on error and thus garbage would be returned in the error response.

grondo · 2024-03-28T17:14:09Z

src/modules/job-list/match.c

+    if (inc_check_comparison (c->mctx, comparisons, errp) < 0)
+        return -1;


The count of comparisons is incremented here and also for each host. Doesn't that mean one host comparison is counted twice?

I think my logic was there was a comparison simply to see if we should bother to check hosts (i.e. if the job didn't run, we don't check anything, and that should count for at least 1 comparison). But now that you mention it, perhaps that is not the most sensible way to think about it. I'll remove the first check.

grondo · 2024-03-28T17:15:01Z

src/modules/job-list/match.c

+                                   match_hostlist,
+                                   wrap_hostlist_destroy,
+                                   errp)))
+        return NULL;


errp->text not set here.

i think this is ok, errp is set in the call to list_constraint_new()

Oh yeah, but looks like it is missing in the EINVAL path.

Oh, nevermind, I think I was looking at the wrong function in the diff.

Nevermind, I think I was looking at the wrong function in the diff.

grondo · 2024-03-28T17:15:19Z

src/modules/job-list/match.c

+    /* Create a single hostlist if user specifies multiple nodes or
+     * RFC29 hostlist range */
+    if (!(hl = hostlist_create ()))
+        goto error;


errp->text not set here.

grondo · 2024-03-28T17:15:35Z

src/modules/job-list/match.c

+    }
+    if (!zlistx_add_end (c->values, hl)) {
+        hostlist_destroy (hl);
+        goto error;


errp->text not set here.

grondo · 2024-03-28T17:15:49Z

src/modules/job-list/match.c

@@ -743,6 +831,28 @@ struct match_ctx *match_ctx_create (flux_t *h)
        goto error;
    }

+    if (flux_get_size (mctx->h, &mctx->max_hostlist) < 0)
+        goto error;


errp->text not set here.

i think you mean I forgot a flux_log in this case, but correct!!

Oh, heh. Yeah I must have been blindly checking goto error without errprintf() 🤦

chu11 · 2024-04-17T05:25:03Z

re-pushed with tweaks per comments above

chu11 · 2024-04-19T17:38:28Z

rebased now that #5681 is merged

grondo

I've tested this one several times as part of development of #5711 and LGTM!

chu11 · 2024-06-11T15:23:33Z

@Mergifyio rebase

Problem: In the near future it'd be convenient to do calculations on the job nodelist, but it often needs to be in a hostlist data structure for processing. We'd like to avoid converting the nodelist to hostlist struct over and over again. Add a field into the job_data struct to hold a cached version of the nodelist in a hostlist struct.

Problem: It would be convenient to filter jobs based on the nodes they ran on. Add a constraint operator "hostlist" to filter on nodes within the job nodelist. Multiple nodes can be specified. Hostlists represented in RFC29 format are acceptable for input to the constraint. Fixes flux-framework#4186

Problem: There are no unit tests for the new 'hostlist' constraint operator. Add tests in match and state_match tests.

Problem: In t2260-job-list.t, some test jobs are submitted using "flux job submit". This makes it inconvenient to set job requirements on those jobs, such as specific nodes the jobs should run on. Solution: Convert all uses of "flux job submit" to use "flux submit".

Problem: In the near future we would like to filter jobs based on nodes in a job's nodelist. This would be an issue in the current test setup because test brokers all run under the same host and it is unknown which nodes jobs actually ran on. Solution: Setup fake hostnames for the test brokers in t2260-job-list.t and when necessary, run test jobs on specific hosts.

Problem: There is no testing / coverage for the new hostlist constraint in the job-list module. Add some tests to t2260-job-list.t.

mergify · 2024-06-11T15:23:52Z

rebase

✅ Branch has been successfully rebased

chu11 · 2024-06-11T15:56:19Z

oops, missed that this had been approved last week. rebased and setting MWP. thanks

…_filter_nodes job-list: support "hostlist" constraint to allow jobs to be filtered by nodes

codecov · 2024-09-09T16:01:33Z

Codecov Report

Attention: Patch coverage is 78.18182% with 12 lines in your changes missing coverage. Please review.

Project coverage is 83.29%. Comparing base (67fc412) to head (d381bf3).
Report is 451 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/job-list/match.c	76.92%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5656      +/-   ##
==========================================
- Coverage   83.30%   83.29%   -0.01%     
==========================================
  Files         519      519              
  Lines       83680    83734      +54     
==========================================
+ Hits        69707    69747      +40     
- Misses      13973    13987      +14

Files with missing lines	Coverage Δ
src/modules/job-list/job_data.c	`93.64% <100.00%> (+0.02%)`	⬆️
src/modules/job-list/state_match.c	`92.81% <100.00%> (+0.03%)`	⬆️
src/modules/job-list/match.c	`89.08% <76.92%> (-1.81%)`	⬇️

... and 9 files with indirect coverage changes

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch from 2ea42eb to cb9e7b5 Compare January 5, 2024 06:14

chu11 mentioned this pull request Jan 9, 2024

job-list: limit size of job constraint #5669

Closed

chu11 mentioned this pull request Jan 16, 2024

rfc43: support hostlist constraint flux-framework/rfc#410

Merged

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch 2 times, most recently from cc950de to 8596055 Compare January 16, 2024 07:13

grondo mentioned this pull request Jan 29, 2024

WIP: Experimental: Add JobList constraint parser and support constraint query strings in flux jobs -f #5711

Open

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch 3 times, most recently from eeada31 to 7b1164e Compare February 1, 2024 18:26

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch 2 times, most recently from 1010788 to c80c4c6 Compare February 2, 2024 00:33

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch from c80c4c6 to 27f8b5a Compare March 20, 2024 17:11

grondo reviewed Mar 28, 2024

View reviewed changes

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch 3 times, most recently from 19ff734 to eb1fcf5 Compare April 17, 2024 05:21

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch from eb1fcf5 to e1fef43 Compare April 19, 2024 17:37

grondo approved these changes Jun 3, 2024

View reviewed changes

grondo added this to the flux-core-0.64.0 milestone Jun 5, 2024

chu11 added 6 commits June 11, 2024 15:23

job-list/test: add hostlist constraint unit tests

f1de8d0

Problem: There are no unit tests for the new 'hostlist' constraint operator. Add tests in match and state_match tests.

testsuite: add hostlist constraint coverage

d381bf3

Problem: There is no testing / coverage for the new hostlist constraint in the job-list module. Add some tests to t2260-job-list.t.

chu11 force-pushed the issue4186_flux_jobs_filter_nodes branch from e1fef43 to d381bf3 Compare June 11, 2024 15:23

chu11 added the merge-when-passing label Jun 11, 2024

mergify bot merged commit 6462184 into flux-framework:master Jun 11, 2024
34 of 35 checks passed

chu11 deleted the issue4186_flux_jobs_filter_nodes branch June 11, 2024 18:22

trws pushed a commit to trws/flux-core that referenced this pull request Jun 14, 2024

Merge pull request flux-framework#5656 from chu11/issue4186_flux_jobs…

a22fc0d

…_filter_nodes job-list: support "hostlist" constraint to allow jobs to be filtered by nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656

job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656

chu11 commented Jan 5, 2024 •

edited

Loading

grondo commented Jan 8, 2024

chu11 commented Jan 9, 2024

grondo commented Jan 9, 2024 •

edited

Loading

chu11 commented Jan 9, 2024

chu11 commented Jan 15, 2024 •

edited

Loading

chu11 commented Jan 16, 2024 •

edited

Loading

grondo commented Jan 16, 2024

chu11 commented Jan 16, 2024

grondo commented Jan 16, 2024

chu11 commented Feb 1, 2024

grondo left a comment

grondo Mar 28, 2024

chu11 Apr 17, 2024

grondo Mar 28, 2024

chu11 Apr 16, 2024 •

edited

Loading

grondo Apr 16, 2024

grondo Apr 16, 2024

grondo Apr 16, 2024

grondo Mar 28, 2024

grondo Mar 28, 2024

grondo Mar 28, 2024

chu11 Apr 17, 2024

grondo Apr 17, 2024

chu11 commented Apr 17, 2024

chu11 commented Apr 19, 2024

grondo left a comment

chu11 commented Jun 11, 2024

mergify bot commented Jun 11, 2024

chu11 commented Jun 11, 2024

codecov bot commented Sep 9, 2024

		if (inc_check_comparison (c->mctx, comparisons, errp) < 0)
		return -1;

job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656

job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656

Conversation

chu11 commented Jan 5, 2024 • edited Loading

grondo commented Jan 8, 2024

chu11 commented Jan 9, 2024

grondo commented Jan 9, 2024 • edited Loading

chu11 commented Jan 9, 2024

chu11 commented Jan 15, 2024 • edited Loading

chu11 commented Jan 16, 2024 • edited Loading

grondo commented Jan 16, 2024

chu11 commented Jan 16, 2024

grondo commented Jan 16, 2024

chu11 commented Feb 1, 2024

grondo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Apr 17, 2024

chu11 commented Apr 19, 2024

grondo left a comment

Choose a reason for hiding this comment

chu11 commented Jun 11, 2024

mergify bot commented Jun 11, 2024

✅ Branch has been successfully rebased

chu11 commented Jun 11, 2024

codecov bot commented Sep 9, 2024

Codecov Report

chu11 commented Jan 5, 2024 •

edited

Loading

grondo commented Jan 9, 2024 •

edited

Loading

chu11 commented Jan 15, 2024 •

edited

Loading

chu11 commented Jan 16, 2024 •

edited

Loading

chu11 Apr 16, 2024 •

edited

Loading