Replies: 4 comments 4 replies
-
Wow, ok. This is moderately terrifying to be honest about it, and I'm pretty sure srun defaults on some systems will do the same as us at least with recent slurm, makes me wonder what they do. Faking out the resources is a good idea for a stopgap. Maybe we could offer a way to set the default exclusivity to off, or have a "flux launch" or "flux run --oversubscribe-me-horribly" option that lets you just jam out jobs and all it does is MPI bootstrap? I think it partly depends on the full use-case, is this a single node, multiple, do we have to support the full run options in this case? If we have to support everything, a null scheduler or alloc-bypass would probably be least costly I guess. Best, in terms of giving results I think people would actually want, might be to have a scheduler that will oversubscribe but try to spread load. Oddly something semi-related to this came up not too long ago, which was the thought of making it so flux could actually provide a gnu-make compatible jobserver port so you could let a build be elastic for example. Or more related to this case, ctest actually lets you specify how much hardware is consumed, so you can specify width of tasks, and if we supported it and the current PR on cmake to support a jobserver outside it lands, we could directly provide backpressure to it that could apply to the entire instance. I think that would be amazing, but probably a lot more work than is worth it except maybe as a "hey this might be fun" hackathon kind of thing. |
Beta Was this translation helpful? Give feedback.
-
I'm the user here - this use case comes from running a 3k+ regression suite on Trilinos, and also wanting to run test suites on our bigger apps. Quite a bit of our internal testing uses the resource manager to grab a node, then we run unbound via So the idea of just boot strapping MPI seems like it is what we do. On systems where that doesn't work, it's usually a mess. Often, teams just run the tests one at a time - which can takes hours. ATS-2 was and still is a major headache here. The general premise is: you are running regression tests that are "light", and so you really can get away with the maximum number of processes the GPU (or CPU) can handle. It's a spread of 1, and 4 MPI usually - but most are MPI based (even if run with NP=1). I'm guessing other labs have similar issues. I see it from Trilinos, EMPIRE, SPARC and SIERRA. They can spend an enormous amount of time testing (which is good), but I'm not sure any resource manager does this well (probably because we are doing the oposite of what the manager was intended to do - isolate / share resources). CPU-system you can get away with alot - because the you can run unbound, and the OS scheduler will bounce your processes around any core/thread it chooses. GPUs are not so lenient. Typically, if you try to play the CPU-game with GPUs, you end up with your processes using GPU number zero - or you have to code some wonky stuff into Kokkos to randomize gpu selection. I shared with Mark on FLUX's mattermost an example of things I've done on LLNL's EAS machines - and I'm able to get decent concurrency, but to do this I need to wrap flux and wrap the process launched . e.g., I'd be happy to organize an email chain or short Teams stand-up to have various users chime in on this. Off the top of my head @vbrunini - Victor is a SPARC developer |
Beta Was this translation helpful? Give feedback.
-
Trilinos is exploring/is capable of using the CTest resource allocation code (combined with Kokkos and CUDA_VISIBLE_DEVICES stuff) to do GPU placement. CTest allows for scheduling/usage of external resources (such as GPUs), and will load them evenly if we set it up this way. SIERRA has built-in code to its test harness that does exactly the same thing as CTest's resource manager with respect to GPU placement. We haven't gotten to AMD GPUs yet, but it sounds like this is the same approach as @jjellio uses with ROCR_VISIBLE_DEVICES. Hopefully that helps at least a little bit? Just sharing the context with which I'm familiar. |
Beta Was this translation helpful? Give feedback.
-
I had an interesting conversation with some tool folks at the MPI Forum last week that relates to this. They were looking for a way to reliably run a tool alongside a job, in ways that sometimes mean they need to run on all the same nodes, even if those nodes are logically full. This is for things like debuggers, or tracing tools that one might want to attach after the job they're attaching to has been allocated and launched. I bring that up because if we make a mode or method where we say for a job or for an instance we ignore oversubscription with existing jobs (probably only instance owner can do this) it might actually solve both use-cases. |
Beta Was this translation helpful? Give feedback.
-
A user has a CTest based testsuite for which they run a series of parallel jobs and MPI tests under
ctest -j <big_number>
. On Flux systems, the tests useflux run
to invoke tests, but when-j
is much larger than the number of cores on a node, or when tests are running many tasks, a number of theflux run
invocations hang (presumably waiting for resources, but there may be other contention here because the jobs reportedly do not proceed?).In any event, the expected behavior in these situations is apparently to control the parallelism with
ctest -j
and not block any flux jobs due to resource allocation. For example, this is how it worked previously when MPI is launched bympirun
and/or Slurm, which do not allocate resources for job steps.Additionally, the default in this situation should be to disable cpu and gpu affinity, since the assumption is that each job invoked by ctest will have access to all resources on a node.
So the real request here is for a tool, option, mode, or even just a guide for how to do something like this with Flux.
For now I've suggested to the user that they can run their testsuite under
flux start -s 16
for example to get 16x the number of apparent cores, or similarly if usingflux alloc
orbatch
use the-o per-resource.node=16
option.However, I figured others might have better ideas (could we make the
alloc-bypass.so
plugin easier to use with a single option, or devise a scheduler or scheduler option that allocates resources repeatedly to jobs?I thought @trws might have some ideas here because this is similar to things we've discussed before.
Beta Was this translation helpful? Give feedback.
All reactions