Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RQ2 minimal functionality and bug fix to reapi match allocate #1101

Merged
merged 6 commits into from
Dec 14, 2023

Conversation

zekemorton
Copy link
Collaborator

This MR includes changes to the reapi to support baseline rq2 functionality and some basic tests to test out rq2.

rq2 in this MR can handle all of the cli arguments, though the write_to_graph function is not yet implemented so the -g and -o flags do not yet do anything. rq2 only implements 3 commands, help, quit, and match. For match only the sub commands allocate and allocate_orelse_reserve are currently supported. This MR introduces a lot of infrastructure around rq2 and should set the stage for future development in this effort.

The test expected output was generated with resource-query and should be a good measure of whether rq2 is behaving properly.

@zekemorton zekemorton force-pushed the reapi-match-allocate branch 5 times, most recently from 7de27dd to 09c7f15 Compare October 13, 2023 19:27
@zekemorton
Copy link
Collaborator Author

@vsoch do you mind taking a look at #1098 and this commit . It changes the behavior of the return value for reapi_cli_t::match_allocate and is now causing one of the go bindings test to fail with error. The second match allocate (line 53 in main.go for testing) is now resulting in error because the match allocate fails.

What are your thoughts on this? Should reapi_cli_t::match_allocate return non 0 if traverser_run returns non 0? I want to be careful about making changes to the reapi that would affect existing code thats using it

@vsoch
Copy link
Member

vsoch commented Oct 13, 2023

Sure! So that function is used in the go bindings here:

fluxerr := (int)(C.reapi_cli_match_allocate((*C.struct_reapi_cli_ctx)(cli.ctx),
(C.bool)(orelse_reserve),
spec,
(*C.ulong)(&jobid),
(*C.bool)(&reserved),
&r,
(*C.long)(&at),
(*C.double)(&overhead)))
. The general logic is that we expect match_allocate (and more generally, calls to flux) to return an integer return code. The function here
func retvalToError(code int, message string) error {
if code == 0 {
return nil
}
return fmt.Errorf(message+" %d", code)
then turns that into a proper go error - nil if the return code is 0 and an error otherwise.

The second match allocate (line 53 in main.go for testing) is now resulting in error because the match allocate fails.

Can you link me to the failing test? Why would it be correct that match allocate fails (returns non zero?)

What are your thoughts on this? Should reapi_cli_t::match_allocate return non 0 if traverser_run returns non 0? I want to be careful about making changes to the reapi that would affect existing code thats using it

I don't know the details of traverser_run - if it fails does that mean the match allocate has failed too? In that case why would we want to hide the error from the Go bindings? Looking closer - it looks like you are adding an rc2 to the function, and returning the sum of rc1 and rc2. Can we make a small table of the possible return values given this addition, and the action (if it's considered a failure or not) for each? I'm guessing if rc is -1 and rc2 is 0 we'd get a value of -1, and maybe that doesn't indicate an actual error has occurred (and we just need to account for that case in the Go bindings)?

@zekemorton
Copy link
Collaborator Author

zekemorton commented Oct 13, 2023

Can you link me to the failing test?

https://github.com/flux-framework/flux-sched/actions/runs/6512317565/job/17689731284?pr=1101#step:7:4178

Why would it be correct that match allocate fails (returns non zero?)

It should fail if you try a match allocate and no resources are available for that allocate. In this case, nothing is actually scheduled or reserved. In any case, whether we determine this from return codes or from some other means, it would be important to know if a match was successful or not. Note that a MATCH_ALLOCATE_ORELSE_RESERVE in this case would succeed since the job can be scheduled for the future.

don't know the details of traverser_run - if it fails does that mean the match allocate has failed too?

traverser_run is a pass through function for dfu_traverser_t::run and this function calls a schedule function. If that schedule function fails, it will return -1 I think. This function failing means that the match allocate has failed. https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu.cpp#L309C7-L309C7

Can we make a small table of the possible return values?

rc = 0 on success or -1 on failure and has to do more infrastructure types of things, memory error, not able to get time of day, etc. Failure indicates that the process is partially completed and sets m_err_msg with an appropriate message.
rc2 = 0 on success or -1 on failure and strictly has to do with weather or not the traverser run function was successful.
having rc =-1 is a weird failure that shouldn't ever really happen, whereas rc2 will fail if you give it an unsatisfiable request if that make sense?
Without these changes, what is now rc2 never actually gets checked and get over written for the other checks with rc.

why would we want to hide the error from the Go bindings?

I guess it depends on what you consider an error, is it an error for the function to try the match allocate and not do anything if the allocate is not possible? Or is that expected behavior and the results should be checked by some other means?

@vsoch
Copy link
Member

vsoch commented Oct 14, 2023

I don't think I know enough about flux-sched to address your questions, going to ping @milroy here. I can help with any changes to the Go that are warranted when it's decided how to handle it.

@zekemorton zekemorton force-pushed the reapi-match-allocate branch 2 times, most recently from 2b794f5 to 9b80bc3 Compare October 31, 2023 23:19
@zekemorton
Copy link
Collaborator Author

@milroy this should be ready for a review!

As discussed, I changed the go tests to set orelse_reserve to true and updated the expected output accordingly.

jameshcorbett
jameshcorbett previously approved these changes Nov 14, 2023
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally, some small comments. And I'm sure you'll still want Dan to review.

resource/utilities/rq2.hpp Outdated Show resolved Hide resolved
std::cerr << optarg << std::endl;
usage (1);
}
graph_format_to_ext (format, token);
Copy link
Member

@jameshcorbett jameshcorbett Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be simplified a bit? It looks like you take the string, turn it into a format object, and then turn the format object back into the same string. I don't see graph_format_to_ext() and string_to_graph_format() being used elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was part of the functionality for writing out graphs of different formats from resource query. I had started to fill in the functionality entirely, but decided to add that in later. Must not have removed it all yet. I can take this out and save it for later. Thanks for catching it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I remember now, this does convert it from string to format object and back to string. This was beneficial in original rq because it stored the format object along the way and this only required a single string comparison. Since we don't have the functionality to use the format object, it doesn't really make a lot of sense to do it this way right now, but this does in its current form accomplish something. boost::iequals is case insensitive, so this does convert the string to lowercase.

I am happy to change it to something that's a bit more clear or keep it as is, though we will want to store that format object at some point

@zekemorton zekemorton force-pushed the reapi-match-allocate branch from 9b80bc3 to 47f7979 Compare December 4, 2023 23:12
@mergify mergify bot dismissed jameshcorbett’s stale review December 4, 2023 23:12

Approving reviews have been dismissed because this pull request
was updated.

@zekemorton
Copy link
Collaborator Author

After discussing the changes with @milroy, we found that changing the interface for reapi_clit_t::match_allocate has bigger implications and the return behavior should be left as is. A non 0 return code represents an error when running the code rather than a failure to find a match.
rq2 now determines if a match was found by checking errno for ENODEV, EBUSY, EINVAL, all of which indicate a failure to match.

@zekemorton zekemorton changed the title Reapi match allocate RQ2 minimal functionality and bug fix to reapi match allocate Dec 4, 2023
@zekemorton zekemorton force-pushed the reapi-match-allocate branch 3 times, most recently from 2a6a3ca to 5147da0 Compare December 6, 2023 17:20
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is looking really good and just requires a few minor changes.


out:
rq->incr_job_counter ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior in resource-query increments the counter if print_schedule_info is called:

print_schedule_info (ctx, out, jobid,
It's executed unless:

  1. gettimeofday fails here, or here
  2. The match subcommand isn't recognized
  3. The writer fails to output
  4. The satisfiability check is performed

I think placing rq->incr_job_counter (); before out: will generate the desired behavior or enable it to be generated when new rq2 features are added:

Suggested change
rq->incr_job_counter ();
rq->incr_job_counter ();
out:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I think this change will mean that we will not increment when a match is not found. I can add the increment into the block for that check though

resource/reapi/bindings/c++/reapi_cli.hpp Show resolved Hide resolved
resource/utilities/rq2.cpp Outdated Show resolved Hide resolved
@zekemorton zekemorton force-pushed the reapi-match-allocate branch from 5147da0 to c374426 Compare December 7, 2023 17:48
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the added check for the return code from rq->traverser_run will break the future implementation of match_multi.

resource/utilities/rq2.cpp Outdated Show resolved Hide resolved
resource/utilities/rq2.cpp Outdated Show resolved Hide resolved
resource/reapi/bindings/c++/reapi_cli_impl.hpp Outdated Show resolved Hide resolved
@zekemorton zekemorton force-pushed the reapi-match-allocate branch 2 times, most recently from 2556828 to b87592b Compare December 14, 2023 00:11
Problem: reapi_cli_t::match_allocate does not return or track
whether a match was successful or not. It also sets the job info
with an unset value for overhead.

Check return code from traverser run to determine if a match is
successful. If a match is not successful, increase the job
counter and return 0. errno is set with the correct error value
to be handled by the calling function.

Move code blocks that set overhead value ov above blocks where
job info is created.
Problem: current info function only provides a subset of fields
from job_info_t.

Add an overloaded info function that mutates job parameter to
job_info_t value corresponding to jobid.
Problem: new resource query and potentially other tools using this
api may need access to the traverser pre and post order counts.
Accessors for these fields do not exist.

Create accessors for these fields in reapi_cli_t and
resource_query_t.
Problem: resource query does not utilize reapi. This means that
new functionality must be implemented in both places and that
reapi is not tested with automated ci tests and code coverage.

Create a new resource query that uses reapi.
Problem: rq2 needs tests to verify that it can be a drop in
replacement for rq1

Add some simple tests for current functionality. Generate the
expected files using rq1
Problem: a recent fix to reapi changed the output of the golang
bindings tests. This changed a value that was once static to a
value that changes with every run.

Use a sed command to remove this value from the output of the test.
Edited tests to use orelse_reserve flag.
@zekemorton
Copy link
Collaborator Author

@milroy I added in the most recent changes requested, should be ready now!

@milroy
Copy link
Member

milroy commented Dec 14, 2023

Thanks for the great work! The PR looks good and is ready to merge. Go ahead and set MWP.

@zekemorton zekemorton added the merge-when-passing mergify.io - merge PR automatically once CI passes label Dec 14, 2023
@mergify mergify bot merged commit fe872c8 into flux-framework:master Dec 14, 2023
20 of 22 checks passed
Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Merging #1101 (e4f265d) into master (250eac7) will decrease coverage by 1.7%.
Report is 7 commits behind head on master.
The diff coverage is 51.2%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1101     +/-   ##
========================================
- Coverage    71.8%   70.1%   -1.7%     
========================================
  Files          88      90      +2     
  Lines       11562   12261    +699     
========================================
+ Hits         8309    8603    +294     
- Misses       3253    3658    +405     
Files Coverage Δ
resource/reapi/bindings/c++/reapi_cli_impl.hpp 37.2% <64.2%> (ø)
resource/utilities/rq2.cpp 49.2% <49.2%> (ø)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants