RQ2 minimal functionality and bug fix to reapi match allocate #1101

zekemorton · 2023-10-11T23:43:12Z

This MR includes changes to the reapi to support baseline rq2 functionality and some basic tests to test out rq2.

rq2 in this MR can handle all of the cli arguments, though the write_to_graph function is not yet implemented so the -g and -o flags do not yet do anything. rq2 only implements 3 commands, help, quit, and match. For match only the sub commands allocate and allocate_orelse_reserve are currently supported. This MR introduces a lot of infrastructure around rq2 and should set the stage for future development in this effort.

The test expected output was generated with resource-query and should be a good measure of whether rq2 is behaving properly.

zekemorton · 2023-10-13T21:21:38Z

@vsoch do you mind taking a look at #1098 and this commit . It changes the behavior of the return value for reapi_cli_t::match_allocate and is now causing one of the go bindings test to fail with error. The second match allocate (line 53 in main.go for testing) is now resulting in error because the match allocate fails.

What are your thoughts on this? Should reapi_cli_t::match_allocate return non 0 if traverser_run returns non 0? I want to be careful about making changes to the reapi that would affect existing code thats using it

vsoch · 2023-10-13T21:30:35Z

Sure! So that function is used in the go bindings here:

flux-sched/resource/reapi/bindings/go/src/fluxcli/reapi_cli.go

Lines 102 to 109 in aae6506

    
           fluxerr := (int)(C.reapi_cli_match_allocate((*C.struct_reapi_cli_ctx)(cli.ctx), 
        
           	(C.bool)(orelse_reserve), 
        
           	spec, 
        
           	(*C.ulong)(&jobid), 
        
           	(*C.bool)(&reserved), 
        
           	&r, 
        
           	(*C.long)(&at), 
        
           	(*C.double)(&overhead)))

. The general logic is that we expect match_allocate (and more generally, calls to flux) to return an integer return code. The function here

flux-sched/resource/reapi/bindings/go/src/fluxcli/reapi_cli.go

Lines 40 to 44 in aae6506

    
           func retvalToError(code int, message string) error { 
        
           	if code == 0 { 
        
           		return nil 
        
           	} 
        
           	return fmt.Errorf(message+" %d", code)

then turns that into a proper go error - nil if the return code is 0 and an error otherwise.

The second match allocate (line 53 in main.go for testing) is now resulting in error because the match allocate fails.

Can you link me to the failing test? Why would it be correct that match allocate fails (returns non zero?)

What are your thoughts on this? Should reapi_cli_t::match_allocate return non 0 if traverser_run returns non 0? I want to be careful about making changes to the reapi that would affect existing code thats using it

I don't know the details of traverser_run - if it fails does that mean the match allocate has failed too? In that case why would we want to hide the error from the Go bindings? Looking closer - it looks like you are adding an rc2 to the function, and returning the sum of rc1 and rc2. Can we make a small table of the possible return values given this addition, and the action (if it's considered a failure or not) for each? I'm guessing if rc is -1 and rc2 is 0 we'd get a value of -1, and maybe that doesn't indicate an actual error has occurred (and we just need to account for that case in the Go bindings)?

zekemorton · 2023-10-13T22:37:04Z

Can you link me to the failing test?

https://github.com/flux-framework/flux-sched/actions/runs/6512317565/job/17689731284?pr=1101#step:7:4178

Why would it be correct that match allocate fails (returns non zero?)

It should fail if you try a match allocate and no resources are available for that allocate. In this case, nothing is actually scheduled or reserved. In any case, whether we determine this from return codes or from some other means, it would be important to know if a match was successful or not. Note that a MATCH_ALLOCATE_ORELSE_RESERVE in this case would succeed since the job can be scheduled for the future.

don't know the details of traverser_run - if it fails does that mean the match allocate has failed too?

traverser_run is a pass through function for dfu_traverser_t::run and this function calls a schedule function. If that schedule function fails, it will return -1 I think. This function failing means that the match allocate has failed. https://github.com/flux-framework/flux-sched/blob/master/resource/traversers/dfu.cpp#L309C7-L309C7

Can we make a small table of the possible return values?

rc = 0 on success or -1 on failure and has to do more infrastructure types of things, memory error, not able to get time of day, etc. Failure indicates that the process is partially completed and sets m_err_msg with an appropriate message.
rc2 = 0 on success or -1 on failure and strictly has to do with weather or not the traverser run function was successful.
having rc =-1 is a weird failure that shouldn't ever really happen, whereas rc2 will fail if you give it an unsatisfiable request if that make sense?
Without these changes, what is now rc2 never actually gets checked and get over written for the other checks with rc.

why would we want to hide the error from the Go bindings?

I guess it depends on what you consider an error, is it an error for the function to try the match allocate and not do anything if the allocate is not possible? Or is that expected behavior and the results should be checked by some other means?

vsoch · 2023-10-14T00:07:46Z

I don't think I know enough about flux-sched to address your questions, going to ping @milroy here. I can help with any changes to the Go that are warranted when it's decided how to handle it.

zekemorton · 2023-10-31T23:53:29Z

@milroy this should be ready for a review!

As discussed, I changed the go tests to set orelse_reserve to true and updated the expected output accordingly.

jameshcorbett

LGTM generally, some small comments. And I'm sure you'll still want Dan to review.

resource/reapi/bindings/c++/reapi_cli_impl.hpp

resource/utilities/rq2.hpp

jameshcorbett · 2023-11-14T21:15:04Z

resource/utilities/rq2.cpp

+                    std::cerr << optarg << std::endl;
+                    usage (1);
+                }
+                graph_format_to_ext (format, token);


Could this be simplified a bit? It looks like you take the string, turn it into a format object, and then turn the format object back into the same string. I don't see graph_format_to_ext() and string_to_graph_format() being used elsewhere.

I believe this was part of the functionality for writing out graphs of different formats from resource query. I had started to fill in the functionality entirely, but decided to add that in later. Must not have removed it all yet. I can take this out and save it for later. Thanks for catching it!

Ahh I remember now, this does convert it from string to format object and back to string. This was beneficial in original rq because it stored the format object along the way and this only required a single string comparison. Since we don't have the functionality to use the format object, it doesn't really make a lot of sense to do it this way right now, but this does in its current form accomplish something. boost::iequals is case insensitive, so this does convert the string to lowercase.

I am happy to change it to something that's a bit more clear or keep it as is, though we will want to store that format object at some point

Approving reviews have been dismissed because this pull request
was updated.

zekemorton · 2023-12-04T23:16:13Z

After discussing the changes with @milroy, we found that changing the interface for reapi_clit_t::match_allocate has bigger implications and the return behavior should be left as is. A non 0 return code represents an error when running the code rather than a failure to find a match.
rq2 now determines if a match was found by checking errno for ENODEV, EBUSY, EINVAL, all of which indicate a failure to match.

milroy

The PR is looking really good and just requires a few minor changes.

resource/reapi/bindings/c++/reapi_cli_impl.hpp

milroy · 2023-12-07T01:56:44Z

resource/reapi/bindings/c++/reapi_cli_impl.hpp


 out:
+    rq->incr_job_counter ();


The current behavior in resource-query increments the counter if print_schedule_info is called:

flux-sched/resource/utilities/command.cpp

Line 236 in 0b70113

print_schedule_info (ctx, out, jobid,

It's executed unless:

gettimeofday fails here, or here

The match subcommand isn't recognized

The writer fails to output

The satisfiability check is performed

I think placing rq->incr_job_counter (); before out: will generate the desired behavior or enable it to be generated when new rq2 features are added:

Suggested change

rq->incr_job_counter ();

rq->incr_job_counter ();

out:

Thanks for pointing this out! I think this change will mean that we will not increment when a match is not found. I can add the increment into the block for that check though

resource/reapi/bindings/c++/reapi_cli.hpp

resource/utilities/rq2.cpp

milroy

I think the added check for the return code from rq->traverser_run will break the future implementation of match_multi.

resource/utilities/rq2.cpp

resource/reapi/bindings/c++/reapi_cli_impl.hpp

Problem: reapi_cli_t::match_allocate does not return or track whether a match was successful or not. It also sets the job info with an unset value for overhead. Check return code from traverser run to determine if a match is successful. If a match is not successful, increase the job counter and return 0. errno is set with the correct error value to be handled by the calling function. Move code blocks that set overhead value ov above blocks where job info is created.

Problem: current info function only provides a subset of fields from job_info_t. Add an overloaded info function that mutates job parameter to job_info_t value corresponding to jobid.

Problem: new resource query and potentially other tools using this api may need access to the traverser pre and post order counts. Accessors for these fields do not exist. Create accessors for these fields in reapi_cli_t and resource_query_t.

Problem: resource query does not utilize reapi. This means that new functionality must be implemented in both places and that reapi is not tested with automated ci tests and code coverage. Create a new resource query that uses reapi.

Problem: rq2 needs tests to verify that it can be a drop in replacement for rq1 Add some simple tests for current functionality. Generate the expected files using rq1

Problem: a recent fix to reapi changed the output of the golang bindings tests. This changed a value that was once static to a value that changes with every run. Use a sed command to remove this value from the output of the test. Edited tests to use orelse_reserve flag.

zekemorton · 2023-12-14T00:30:13Z

@milroy I added in the most recent changes requested, should be ready now!

milroy · 2023-12-14T00:34:26Z

Thanks for the great work! The PR looks good and is ready to merge. Go ahead and set MWP.

codecov · 2023-12-14T00:42:12Z

Codecov Report

Merging #1101 (e4f265d) into master (250eac7) will decrease coverage by 1.7%.
Report is 7 commits behind head on master.
The diff coverage is 51.2%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1101     +/-   ##
========================================
- Coverage    71.8%   70.1%   -1.7%     
========================================
  Files          88      90      +2     
  Lines       11562   12261    +699     
========================================
+ Hits         8309    8603    +294     
- Misses       3253    3658    +405

Files	Coverage Δ
resource/reapi/bindings/c++/reapi_cli_impl.hpp	`37.2% <64.2%> (ø)`
resource/utilities/rq2.cpp	`49.2% <49.2%> (ø)`

zekemorton requested review from milroy and jameshcorbett October 11, 2023 23:43

zekemorton force-pushed the reapi-match-allocate branch 5 times, most recently from 7de27dd to 09c7f15 Compare October 13, 2023 19:27

zekemorton force-pushed the reapi-match-allocate branch 2 times, most recently from 2b794f5 to 9b80bc3 Compare October 31, 2023 23:19

jameshcorbett previously approved these changes Nov 14, 2023

View reviewed changes

zekemorton force-pushed the reapi-match-allocate branch from 9b80bc3 to 47f7979 Compare December 4, 2023 23:12

zekemorton changed the title ~~Reapi match allocate~~ RQ2 minimal functionality and bug fix to reapi match allocate Dec 4, 2023

zekemorton force-pushed the reapi-match-allocate branch 3 times, most recently from 2a6a3ca to 5147da0 Compare December 6, 2023 17:20

milroy requested changes Dec 7, 2023

View reviewed changes

zekemorton force-pushed the reapi-match-allocate branch from 5147da0 to c374426 Compare December 7, 2023 17:48

milroy requested changes Dec 12, 2023

View reviewed changes

resource/utilities/rq2.cpp Outdated Show resolved Hide resolved

resource/utilities/rq2.cpp Outdated Show resolved Hide resolved

resource/reapi/bindings/c++/reapi_cli_impl.hpp Outdated Show resolved Hide resolved

zekemorton force-pushed the reapi-match-allocate branch 2 times, most recently from 2556828 to b87592b Compare December 14, 2023 00:11

zekemorton added 3 commits December 13, 2023 16:21

reapi: add overloaded info function to reapi_cli_t

c19ddaf

Problem: current info function only provides a subset of fields from job_info_t. Add an overloaded info function that mutates job parameter to job_info_t value corresponding to jobid.

zekemorton added 3 commits December 13, 2023 16:21

rq: rq2 structure with match allocate

6c7ca2e

Problem: resource query does not utilize reapi. This means that new functionality must be implemented in both places and that reapi is not tested with automated ci tests and code coverage. Create a new resource query that uses reapi.

tests: add tests for rq2

922b6dc

Problem: rq2 needs tests to verify that it can be a drop in replacement for rq1 Add some simple tests for current functionality. Generate the expected files using rq1

zekemorton force-pushed the reapi-match-allocate branch from b87592b to e4f265d Compare December 14, 2023 00:21

milroy approved these changes Dec 14, 2023

View reviewed changes

zekemorton added the merge-when-passing mergify.io - merge PR automatically once CI passes label Dec 14, 2023

mergify bot merged commit fe872c8 into flux-framework:master Dec 14, 2023
20 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RQ2 minimal functionality and bug fix to reapi match allocate #1101

RQ2 minimal functionality and bug fix to reapi match allocate #1101

zekemorton commented Oct 11, 2023

zekemorton commented Oct 13, 2023

vsoch commented Oct 13, 2023

zekemorton commented Oct 13, 2023 •

edited

Loading

vsoch commented Oct 14, 2023

zekemorton commented Oct 31, 2023

jameshcorbett left a comment

jameshcorbett Nov 14, 2023 •

edited

Loading

zekemorton Nov 28, 2023

zekemorton Nov 29, 2023

zekemorton commented Dec 4, 2023

milroy left a comment

milroy Dec 7, 2023

zekemorton Dec 7, 2023

milroy left a comment

zekemorton commented Dec 14, 2023

milroy commented Dec 14, 2023

codecov bot commented Dec 14, 2023

RQ2 minimal functionality and bug fix to reapi match allocate #1101

RQ2 minimal functionality and bug fix to reapi match allocate #1101

Conversation

zekemorton commented Oct 11, 2023

zekemorton commented Oct 13, 2023

vsoch commented Oct 13, 2023

zekemorton commented Oct 13, 2023 • edited Loading

vsoch commented Oct 14, 2023

zekemorton commented Oct 31, 2023

jameshcorbett left a comment

Choose a reason for hiding this comment

jameshcorbett Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

zekemorton Nov 28, 2023

Choose a reason for hiding this comment

zekemorton Nov 29, 2023

Choose a reason for hiding this comment

zekemorton commented Dec 4, 2023

milroy left a comment

Choose a reason for hiding this comment

milroy Dec 7, 2023

Choose a reason for hiding this comment

zekemorton Dec 7, 2023

Choose a reason for hiding this comment

milroy left a comment

Choose a reason for hiding this comment

zekemorton commented Dec 14, 2023

milroy commented Dec 14, 2023

codecov bot commented Dec 14, 2023

Codecov Report

zekemorton commented Oct 13, 2023 •

edited

Loading

jameshcorbett Nov 14, 2023 •

edited

Loading