-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't prune traversal by leaf vertex subplans #1314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good although it would be good to make sure the test fails without the PR...
@@ -176,6 +176,11 @@ int dfu_impl_t::by_subplan (const jobmeta_t &meta, | |||
int saved_errno = errno; | |||
planner_multi_t *p = (*m_graph)[u].idata.subplans[s]; | |||
|
|||
if (!p) { | |||
// Subplan is null if u is a leaf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the subplan be NULL for any other reasons? Are there any more checks needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in person this afternoon. If it's null for any other reason, it's unintentional, but checking if the vertex is a leaf or not is an unfortunately tricky/expensive operation to use in something that gets called this much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can think of a fast way to determine if the vertex is not a leaf that can be a follow-up PR. The only place I can think of where such an error could creep in is here, and the function should exit and cause the graph initialization/re-initialization to fail:
flux-sched/resource/traversers/dfu_impl.cpp
Line 1115 in 1c1f8d6
if (!(p = subtree_plan (u, avail, types))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting for posterity that I tested pruning filter configurations that skip intermediate vertices (e.g., cluster:core, node:core
, to skip rack
vertices) and observed that the rack
vertex subplan is not null. Apparently the code creates the planners in the intermediate vertices anyway.
174bf92
to
2f9b37d
Compare
Problem: issue flux-framework#1260 reported errors related to a failure in obtaining the available resources for planner_multi. This condition occurs because leaf vertices no longer have subplan filters due to the optimization in PR flux-framework#1248. Add a check for the null subplan and exit the function without error in that case.
Problem: t3015-resource-power-jgf.t was copied to t3016-resource-power-jgf.t, but the original file is no longer used in the testsuite. Remove the duplicate and unused script.
Problem: there is no test in the testsuite for the error reported by issue flux-framework#1260. Add a resource-query test to the advanced resource tests.
After many attempts I was unsuccessful at getting the error to propagate to I think the PR is ready for review. |
(The updated test based on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for this!
Thanks for the reviews. Setting MWP. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1314 +/- ##
========================================
- Coverage 75.2% 75.2% -0.1%
========================================
Files 111 111
Lines 15983 15986 +3
========================================
- Hits 12032 12028 -4
- Misses 3951 3958 +7
|
Issue #1260 and logs from production clusters have reported match errors such as
by_subplan: planner_multi_avail_during returned -1.
for JGFs withnode
vertices with an edge to thecluster
vertex. The error occurs because PR #1248 prevents creation of subplan filters on leaf vertices, causing subsequent checks against the planner to fail.This PR adds a check to
by_subplan
to bypass pruning based on null subplans.Currently, the test case succeeds without the fix because the error isn't logged in
flux dmesg
for some reason.