Design a tool, mode, or best practice for using Flux in an "unconstrained" mode for regression tests #5450

grondo · 2023-09-12T21:30:52Z

grondo
Sep 12, 2023
Maintainer

A user has a CTest based testsuite for which they run a series of parallel jobs and MPI tests under ctest -j <big_number>. On Flux systems, the tests use flux run to invoke tests, but when -j is much larger than the number of cores on a node, or when tests are running many tasks, a number of the flux run invocations hang (presumably waiting for resources, but there may be other contention here because the jobs reportedly do not proceed?).

In any event, the expected behavior in these situations is apparently to control the parallelism with ctest -j and not block any flux jobs due to resource allocation. For example, this is how it worked previously when MPI is launched by mpirun and/or Slurm, which do not allocate resources for job steps.

Additionally, the default in this situation should be to disable cpu and gpu affinity, since the assumption is that each job invoked by ctest will have access to all resources on a node.

So the real request here is for a tool, option, mode, or even just a guide for how to do something like this with Flux.

For now I've suggested to the user that they can run their testsuite under flux start -s 16 for example to get 16x the number of apparent cores, or similarly if using flux alloc or batch use the -o per-resource.node=16 option.

However, I figured others might have better ideas (could we make the alloc-bypass.so plugin easier to use with a single option, or devise a scheduler or scheduler option that allocates resources repeatedly to jobs?

I thought @trws might have some ideas here because this is similar to things we've discussed before.

trws · 2023-09-13T15:58:58Z

trws
Sep 13, 2023
Maintainer

Wow, ok. This is moderately terrifying to be honest about it, and I'm pretty sure srun defaults on some systems will do the same as us at least with recent slurm, makes me wonder what they do.

Faking out the resources is a good idea for a stopgap. Maybe we could offer a way to set the default exclusivity to off, or have a "flux launch" or "flux run --oversubscribe-me-horribly" option that lets you just jam out jobs and all it does is MPI bootstrap? I think it partly depends on the full use-case, is this a single node, multiple, do we have to support the full run options in this case? If we have to support everything, a null scheduler or alloc-bypass would probably be least costly I guess. Best, in terms of giving results I think people would actually want, might be to have a scheduler that will oversubscribe but try to spread load.

Oddly something semi-related to this came up not too long ago, which was the thought of making it so flux could actually provide a gnu-make compatible jobserver port so you could let a build be elastic for example. Or more related to this case, ctest actually lets you specify how much hardware is consumed, so you can specify width of tasks, and if we supported it and the current PR on cmake to support a jobserver outside it lands, we could directly provide backpressure to it that could apply to the entire instance. I think that would be amazing, but probably a lot more work than is worth it except maybe as a "hey this might be fun" hackathon kind of thing.

0 replies

jjellio · 2023-09-13T17:11:50Z

jjellio
Sep 13, 2023

I'm the user here - this use case comes from running a 3k+ regression suite on Trilinos, and also wanting to run test suites on our bigger apps. Quite a bit of our internal testing uses the resource manager to grab a node, then we run unbound via mpirun directly.

So the idea of just boot strapping MPI seems like it is what we do. On systems where that doesn't work, it's usually a mess. Often, teams just run the tests one at a time - which can takes hours. ATS-2 was and still is a major headache here.

The general premise is: you are running regression tests that are "light", and so you really can get away with the maximum number of processes the GPU (or CPU) can handle. It's a spread of 1, and 4 MPI usually - but most are MPI based (even if run with NP=1).

I'm guessing other labs have similar issues. I see it from Trilinos, EMPIRE, SPARC and SIERRA. They can spend an enormous amount of time testing (which is good), but I'm not sure any resource manager does this well (probably because we are doing the oposite of what the manager was intended to do - isolate / share resources). CPU-system you can get away with alot - because the you can run unbound, and the OS scheduler will bounce your processes around any core/thread it chooses.

GPUs are not so lenient. Typically, if you try to play the CPU-game with GPUs, you end up with your processes using GPU number zero - or you have to code some wonky stuff into Kokkos to randomize gpu selection. I shared with Mark on FLUX's mattermost an example of things I've done on LLNL's EAS machines - and I'm able to get decent concurrency, but to do this I need to wrap flux and wrap the process launched . e.g., flux-wrap -n NP ./run-wrap app.exe -args. Flux-wrap lets me inject run-wrap and also ensure flux isn't binding things. Then run-wrap plays games with setting ROCR_VISIBLE_DEVICES so each process doesn't choose dev=0.

I'd be happy to organize an email chain or short Teams stand-up to have various users chime in on this.

Off the top of my head

@vbrunini - Victor is a SPARC developer
@sebrowne - Sam works with Trilinos devops
@rppawlo - Roger is an EMPIRE developer

1 reply

grondo Sep 13, 2023
Maintainer Author

Copying here my suggestion from that Mattermost thread:

If you are using the trick of spawning multiple flux brokers per node to multiply the apparent resource counts (e.g. with flux alloc/batch -o per-resource.node=N or flux start -s N), then it may be helpful to actually request GPUs with your flux run invocations, e.g. add -g 1 to get 1 GPU per task or --gpus-per-node=1 to get one GPU per "node". Flux will then assign specific GPUs to your job (though each GPU will be allocated up to N times) and you will not be required to play tricks with modulus to assign a different GPU per task, etc.

It should be said that running in this way (multiple brokers per node) is just a stop gap until we have a more palatable solution.

sebrowne · 2023-09-13T17:26:57Z

sebrowne
Sep 13, 2023

Trilinos is exploring/is capable of using the CTest resource allocation code (combined with Kokkos and CUDA_VISIBLE_DEVICES stuff) to do GPU placement. CTest allows for scheduling/usage of external resources (such as GPUs), and will load them evenly if we set it up this way.

SIERRA has built-in code to its test harness that does exactly the same thing as CTest's resource manager with respect to GPU placement.

We haven't gotten to AMD GPUs yet, but it sounds like this is the same approach as @jjellio uses with ROCR_VISIBLE_DEVICES.

Hopefully that helps at least a little bit? Just sharing the context with which I'm familiar.

2 replies

jjellio Sep 13, 2023

I think the big ask here, from me - is to have throughput (oversubscribed) testing. I used my gimmick setup yesterday and ran a test suite that was taking 7 hours in 1 hour. It's a dirty hack - but alot of those tests are not hammering the GPUs (this was done through EMPIRE). Ultimately though, ctest is going to require Flux (or whatever) to launch whatever it says, without really using flux as the resource manager anymore. Cmake effectively becomes that.

rppawlo Sep 13, 2023

As @sebrowne mentioned - we use the ctest resource manager for gpus. ctest knows how to set environment variables that kokkos can use to map multiple gpus on the same node to separate unit tests. From the EMPIRE side, for unit testing, we tend to overload the GPUs. Unit tests are small and never fill up the GPU. We can get a 2x performance improvement in the unit test suite runtime by allocating 3 MPI processes per gpu. For example, if a node has 4 gpus, we will tell ctest to use 12 mpi processes, 3 per gpu.

trws · 2023-09-18T08:32:15Z

trws
Sep 18, 2023
Maintainer

I had an interesting conversation with some tool folks at the MPI Forum last week that relates to this. They were looking for a way to reliably run a tool alongside a job, in ways that sometimes mean they need to run on all the same nodes, even if those nodes are logically full. This is for things like debuggers, or tracing tools that one might want to attach after the job they're attaching to has been allocated and launched.

I bring that up because if we make a mode or method where we say for a job or for an instance we ignore oversubscription with existing jobs (probably only instance owner can do this) it might actually solve both use-cases.

1 reply

grondo Sep 18, 2023
Maintainer Author

We actually do have a way to do this with the alloc-bypass plugin. We could probably make the usage a bit easier but essentially:

$ flux submit -xN4 sleep inf
ƒM4EfVVR
$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
    ƒM4EfVVR grondo   sleep       R      4      4   2.069s pi[3,0-2]
$ flux jobtap load alloc-bypass.so
$ flux run --setattr=alloc-bypass.R="$(flux job info ƒM4EfVVR R)" -o per-resource.type=node  hostname
pi3
pi1
pi0
pi2

Explanation: The instance owner can provide an R for a job and if the alloc-bypass plugin is loaded, it will skip sending an alloc request to the scheduler and will run the job using the provided R.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design a tool, mode, or best practice for using Flux in an "unconstrained" mode for regression tests #5450

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Design a tool, mode, or best practice for using Flux in an "unconstrained" mode for regression tests #5450

grondo Sep 12, 2023 Maintainer

Replies: 4 comments · 4 replies

trws Sep 13, 2023 Maintainer

jjellio Sep 13, 2023

grondo Sep 13, 2023 Maintainer Author

sebrowne Sep 13, 2023

jjellio Sep 13, 2023

rppawlo Sep 13, 2023

trws Sep 18, 2023 Maintainer

grondo Sep 18, 2023 Maintainer Author

grondo
Sep 12, 2023
Maintainer

Replies: 4 comments 4 replies

trws
Sep 13, 2023
Maintainer

jjellio
Sep 13, 2023

grondo Sep 13, 2023
Maintainer Author

sebrowne
Sep 13, 2023

trws
Sep 18, 2023
Maintainer

grondo Sep 18, 2023
Maintainer Author