Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't prune traversal by leaf vertex subplans #1314

Merged
merged 3 commits into from
Nov 7, 2024

Conversation

milroy
Copy link
Member

@milroy milroy commented Nov 5, 2024

Issue #1260 and logs from production clusters have reported match errors such as by_subplan: planner_multi_avail_during returned -1. for JGFs with node vertices with an edge to the cluster vertex. The error occurs because PR #1248 prevents creation of subplan filters on leaf vertices, causing subsequent checks against the planner to fail.

This PR adds a check to by_subplan to bypass pruning based on null subplans.

Currently, the test case succeeds without the fix because the error isn't logged in flux dmesg for some reason.

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good although it would be good to make sure the test fails without the PR...

@@ -176,6 +176,11 @@ int dfu_impl_t::by_subplan (const jobmeta_t &meta,
int saved_errno = errno;
planner_multi_t *p = (*m_graph)[u].idata.subplans[s];

if (!p) {
// Subplan is null if u is a leaf.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the subplan be NULL for any other reasons? Are there any more checks needed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in person this afternoon. If it's null for any other reason, it's unintentional, but checking if the vertex is a leaf or not is an unfortunately tricky/expensive operation to use in something that gets called this much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can think of a fast way to determine if the vertex is not a leaf that can be a follow-up PR. The only place I can think of where such an error could creep in is here, and the function should exit and cause the graph initialization/re-initialization to fail:

if (!(p = subtree_plan (u, avail, types))) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting for posterity that I tested pruning filter configurations that skip intermediate vertices (e.g., cluster:core, node:core, to skip rack vertices) and observed that the rack vertex subplan is not null. Apparently the code creates the planners in the intermediate vertices anyway.

@milroy milroy force-pushed the issue-1260 branch 2 times, most recently from 174bf92 to 2f9b37d Compare November 6, 2024 03:45
Problem: issue flux-framework#1260
reported errors related to a failure in obtaining the available
resources for planner_multi. This condition occurs because leaf
vertices no longer have subplan filters due to the optimization in PR
flux-framework#1248.

Add a check for the null subplan and exit the function without error in
that case.
Problem: t3015-resource-power-jgf.t was copied to
t3016-resource-power-jgf.t, but the original file is no longer used in
the testsuite.

Remove the duplicate and unused script.
Problem: there is no test in the testsuite for the error reported by
issue flux-framework#1260.

Add a resource-query test to the advanced resource tests.
@milroy
Copy link
Member Author

milroy commented Nov 6, 2024

After many attempts I was unsuccessful at getting the error to propagate to flux dmesg. I updated the testsuite test to use resource-query instead. Let me know if I should keep poking at the resource module-based test.

I think the PR is ready for review.

@milroy milroy requested review from trws and jameshcorbett November 6, 2024 08:34
@milroy
Copy link
Member Author

milroy commented Nov 6, 2024

I updated the testsuite test to use resource-query instead.

(The updated test based on resource-query fails without the change in this PR and succeeds with it.)

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for this!

@milroy
Copy link
Member Author

milroy commented Nov 7, 2024

Thanks for the reviews. Setting MWP.

@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Nov 7, 2024
@mergify mergify bot merged commit 8183356 into flux-framework:master Nov 7, 2024
21 checks passed
@milroy milroy deleted the issue-1260 branch November 7, 2024 00:03
Copy link

codecov bot commented Nov 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.2%. Comparing base (1c1f8d6) to head (ee57490).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1314     +/-   ##
========================================
- Coverage    75.2%   75.2%   -0.1%     
========================================
  Files         111     111             
  Lines       15983   15986      +3     
========================================
- Hits        12032   12028      -4     
- Misses       3951    3958      +7     
Files with missing lines Coverage Δ
resource/traversers/dfu_impl.cpp 83.6% <100.0%> (-0.4%) ⬇️

... and 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when matching rabbits: by_subplan: planner_multi_avail_during returned -1
3 participants