Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments/suggestions for registering an nspace #58

Open
rhc54 opened this issue Jan 24, 2022 · 21 comments
Open

Comments/suggestions for registering an nspace #58

rhc54 opened this issue Jan 24, 2022 · 21 comments

Comments

@rhc54
Copy link

rhc54 commented Jan 24, 2022

Looking at the issues list, it appears you may have been basing your work on the Slurm PMIx plugin. Understandable, but actually that is a poor choice as it (a) is missing things that OMPI and CH4 needs/wants, and (b) actually has some errors in it. We've been trying for some time to get them to fix it, without success. Both OMPI and CH4 will run with that plugin - they both just have to make assumptions that may reduce performance or require a greater exchange of information at startup.

FWIW: OMPI will not fallback to a data exchange. If it cannot get info via PMIx, it simply makes a lowest denominator assumption. I don't know CH4's policy, but suspect it does the same based on my last view of it.

The better "model" would be to use what is in PRRTE - see the code here. The Standard itself is also a good source for what procs are expecting to see (relevant section is here).

I know you don't currently support MPI Sessions, Spawn, and fault tolerance, so we can safely ignore those related values. The PMIx library will "backfill" some things - e.g., we internally will use the regex maps you provide to compute the local peers, local size, etc if you don't give them to us. However, there are some things we cannot backfill for you that both programming libraries can use, if you have the ability to provide them.

For example, if you know the topology of each node, then you could pass down the cpuset of each process in the job and its device distances for NICs on its node (there is a PMIx API for computing the distances for you). This allows each process to determine which interface its peer will be using, which in turn allows for things like collective optimization (CH4 uses this - OMPI probably will soon as well).

It is also good to let PMIx help you setup the application via the PMIx_server_setup_application API. This is where the support for Slingshot and friends is done. It needs to be executed by the equivalent of mpirun and the returned data included in the "blob" sent to the daemons that will be hosting application processes. If you pass in the programming library name (e.g., "ompi" or "mpich"), PMIx will add whatever supporting info those libraries require/desire. In addition, we can assign network security keys and even assign endpoints (CH4 particularly uses that feature). There are a couple of APIs you would need to call on your backend daemons when fork/exec'ing the application processes - I can walk you thru them if you want to pursue it.

Again, I can point you to the relevant PRRTE areas if you want to use them as a guide - or I'm happy to provide advice. I suspect the "setup application" and distributing the returned info is the main thing that will impact you as otherwise you'd have to generate the info yourself that each library wants - and that can be a bit of a moving target, which is why we built it into PMIx so everyone got updated at the same time.

@garlick
Copy link
Member

garlick commented Jan 24, 2022

Thanks @rhc54, I actually did refer to openpmix, prrte, ompi v5.0.x, and slurm pmix plugins plus multiple pmix spec versions while writing this, but it was an uphill battle to spelunk and reconcile all that, so your guidance could be pretty useful at this point.

My goal was first to supply the bare minimum needed to bootstrap ompi v5, performance be damned, and then make incremental changes that have tangible benefits now, while tracking the latest openpmix releases. It sounds like you've identified

(Edit: that last issue ref was the wrong one, I guess i don't have one but I meant to refer to supporting spectrum MPI, which does impact us.)

Anything else?

One thing I struggled with was the nspace setup. I ended up with a trial and error process to find things that actually seemed to make a difference, and where I landed was a vastly smaller set than prrte for example. Are there specifics in that area that I should pay attention to? Since my testing was mainly on a single system, I'm quite sure there is more to do.

Other random comments

  • Where is the current CH4 work (with pmix support) occuring? I didn't see much pmix support in mpich main branch on github
  • Where is ompi+slingshot development occurring? I'm wondering if our people working on the CORAL2 procurement might like to be aware of that (if they aren't already).
  • Is there anything in the way of pmix compliance testing that we could add to the CI for this project?

@rhc54
Copy link
Author

rhc54 commented Jan 24, 2022

Where is the current CH4 work (with pmix support) occuring? I didn't see much pmix support in mpich main branch on github

Intel is working on it for Aurora - I'm not sure how much of the code is public yet. I suspect Argonne controls that, so perhaps Ken could better answer any questions about it. There is substantial PMIx integration in it, including full support for "instant on".

Where is ompi+slingshot development occurring? I'm wondering if our people working on the CORAL2 procurement might like to be aware of that (if they aren't already).

The integration is actually at the PMIx level, not OMPI, and is being done by Intel (again, for Aurora). Idea was to only have to implement fabric support once and then be able to reuse it across host environments (we have multiple fabric plugins). The plugin will assign endpoints, provide device distances for all procs and NICS in the job, and assign security keys - basically, everything you need to do an "instant on" startup. However, it requires that you use the PMIx_server_setup_application (and backend friends) to make it work.

Is there anything in the way of pmix compliance testing that we could add to the CI for this project?

IBM has an entire CI suite they execute on PMIx and PRRTE - you might talk to Josh about it. I'm pretty sure all the tests are in the pmix-tests repository. Josh can help explain how it gets used.

@rhc54
Copy link
Author

rhc54 commented Jan 24, 2022

One thing I struggled with was the nspace setup. I ended up with a trial and error process to find things that actually seemed to make a difference, and where I landed was a vastly smaller set than prrte for example. Are there specifics in that area that I should pay attention to? Since my testing was mainly on a single system, I'm quite sure there is more to do.

Hmmm...I see the problem. Yeah, what is there isn't quite correct, but I can see how you might get to that point. Let me dig around - I believe I have documentation about the local proc spawn procedure.

@rhc54
Copy link
Author

rhc54 commented Jan 24, 2022

PMIx_server_setup_application() (maybe that will help with configure does not grok IBM spectrum MPI #46?)

Hmmm...I'm not sure why you would need a configure test about Spectrum relative to this plugin? Is it because Spectrum needs some envars in order to operate, and you are trying to set them in flux? If so, then yes - we could add a "spectrum" plugin to PMIx and let PMIx handle all that so you don't need the complications in flux. It would be easy to do. You'd just need the user to tell you it is a "spectrum" app (e.g., --mpi=spectrum) and then pass that down to PMIx_server_setup_application - PMIx would take care of the rest.

@garlick
Copy link
Member

garlick commented Jan 25, 2022

Hmmm...I'm not sure why you would need a configure test about Spectrum relative to this plugin?

Sorry, my memory is confused on this - I don't think we've tried to launch spectrum applications with this plugin. We do it with the old flux MCA plugin in ompi but by not using pmix we lose support for some interconnect enhancements. Regardless of whether it helps with spectrum, it seems like we should do it and if I recall what I've seen in various examples it's not too hard.

I'll check out pmix-tests, thanks!

@garlick
Copy link
Member

garlick commented Jan 25, 2022

Oh and the purpose of the configure check for MPI mentioned in #46 is just for setting up the built-in testing, where we compile some built in MPI programs and make sure we can run them. It would be nice if we could check out his repo on the IBM system and run "make check" and show that it basically works with the native MPI there, albeit on one node only.

Bringing it up here was a non-sequitur, sorry!

@rhc54
Copy link
Author

rhc54 commented Jan 25, 2022

Hmmm...I see the problem. Yeah, what is there isn't quite correct, but I can see how you might get to that point. Let me dig around - I believe I have documentation about the local proc spawn procedure.

This will take me a couple of days - need to put out a few fires. I'm writing it up as a "guide" on our wiki and will post the link here when done.

@rhc54
Copy link
Author

rhc54 commented Feb 3, 2022

I'm sorry this took so long, but I finally put together a brief outline on the launch sequence:

https://github.com/openpmix/openpmix/wiki/PMIx-Launch-Sequence

Hopefully, that will help clarify things a bit. Please feel free to comment and/or make suggestions for improving it.

@garlick
Copy link
Member

garlick commented Feb 4, 2022

Thanks @rhc54.

I'll circle back and audit this code against your description as soon as I have time - thanks!

Meanwhile one thing that remains a mystery is what subset of the nspace I need for direct launching openmpi v5. The doc references the spec (and v4.0 when the current one is v4.1?) but I had a lot of difficulty trying to determine what the minimal subset is, as well as reconciling differences between spec versions, and between spec and actual code.

One quick question: the doc references the PMIX_PROGRAMMING_MODEL attribute for telling pmix to load a plugin specific to a version of MPI. That would seem to be an opportunity to get a hard failure if something isn't set up right for that MPI. Is that a true statement and does a model plugin exist for openmpi?

@rhc54
Copy link
Author

rhc54 commented Feb 4, 2022

Meanwhile one thing that remains a mystery is what subset of the nspace I need for direct launching openmpi v5. The doc references the spec (and v4.0 when the current one is v4.1?) but I had a lot of difficulty trying to determine what the minimal subset is, as well as reconciling differences between spec versions, and between spec and actual code.

I'll create a wiki page that lists the minimal required, desired, and "nice to have" values. We used to have that but it got put somewhere non-obvious during the transfer of the pmix.org web site from me to the standard's group. Undoubtedly needs updating anyway.

One quick question: the doc references the PMIX_PROGRAMMING_MODEL attribute for telling pmix to load a plugin specific to a version of MPI. That would seem to be an opportunity to get a hard failure if something isn't set up right for that MPI. Is that a true statement and does a model plugin exist for openmpi?

Yeah, it can be a hard failure. The idea here is to relieve the RM of having to know how to handle every implementation of every programming model out there. If we put their needs into PMIx, then all the launchers and RMs that speak PMIx will automatically be able to support that model/implementation - and we only have to update one place when something changes.

OMPI does indeed have a plugin, as does mpich (not in a release branch yet, but coming soon), the openshmem reference implementation (from stony brook), and SOS (which NVIDIA is now transitioning into their own oshmem library). I've poked the IBM team about getting a spectrum plugin as they have some differences from base OMPI.

@garlick
Copy link
Member

garlick commented Feb 4, 2022

Thanks - any pointers to non release branches containing those plugins? (esp openmpi?)

@rhc54
Copy link
Author

rhc54 commented Feb 4, 2022

OMPI is in the release branches - see https://github.com/openpmix/openpmix/blob/master/src/mca/pmdl/ompi/pmdl_ompi.c.

You can take a peek at the MPICH one as well if you like: https://github.com/openpmix/openpmix/blob/master/src/mca/pmdl/mpich/pmdl_mpich.c

@garlick
Copy link
Member

garlick commented Feb 4, 2022

Interesting. So maybe if I try building this plugin against openpmix master and set the model to ompi, I will achieve enlightenment? Or at least a hard error that I can track down in a methodical way?

@rhc54
Copy link
Author

rhc54 commented Feb 4, 2022

I don't think that will work - IIRC, you call "setup_fork" before you call "register_nspace" - correct? It has to be the other way around, and you have to call "setup_fork" for each process before you spawn it as the envars pmdl/ompi make are proc-specific. I'm not sure if OMPI will bark about missing the information, or simply quietly ignore its lack (and lose some capability). Believe it is the former, but I honestly can't be sure any more.

@garlick
Copy link
Member

garlick commented Feb 4, 2022

I don't think that will work - IIRC, you call "setup_fork" before you call "register_nspace" - correct? It has to be the other way around, and you have to call "setup_fork" for each process before you spawn it as the envars pmdl/ompi make are proc-specific.

No, it's the other way around, though it may not be obvious because they are called from flux-specific callbacks: first, the shell.init callback for the pmix shell plugin calls register_nspace, then the task.init callback calls setup_fork for each task before it is forked.

I'm not sure if OMPI will bark about missing the information, or simply quietly ignore its lack (and lose some capability). Believe it is the former, but I honestly can't be sure any more.

The code in ompi that I audited before treated practically every pmix call failure as non-fatal, so bummer if this turns out to be the same. Maybe this is not going to be a great approach and this:

I'll create a wiki page that lists the minimal required, desired, and "nice to have" values. We used to have that but it got put somewhere non-obvious during the transfer of the pmix.org web site from me to the standard's group. Undoubtedly needs updating anyway.

is the way go go.

Thanks for taking the time to explain things.

@rhc54
Copy link
Author

rhc54 commented Feb 4, 2022

I'm not sure what you mean about the approach. OMPI accesses PMIx info in a bunch of places. There is the "main" RTE init area in ompi_rte.c, but that just handles the global things OMPI wants. Various components inside OMPI obtain data via PMIx for stuff like determining if shared memory can be used, and if so, where the rendezvous file will be and who is creating it. Hence my statement that it might not bark, but you could lose performance.

The pmdl/ompi plugin doesn't provide all the data OMPI needs - it just provides some support that is more OMPI-specific (like the MPI-3 data in OMPI-specific names). The RM still has to provide the nspace registration data about the job, app, and procs - OMPI will definitely fail without those. But then again, so would MPICH and OSHMEM.

So you need both things in parallel if you want to fully support these libraries. PMIx provides the former as those are truly implementation specific, and there is no point in having every RM implement them. The latter is implementation-agnostic, and so it makes sense for the RM to do those.

@garlick
Copy link
Member

garlick commented Feb 4, 2022

I'm not sure what you mean about the approach.

I meant the approach for learning what nspace values that flux-pmix is missing are the most important to add for ompi v5 support. Some way of deducing those would be nice, but someone telling me would be fine.

If everything listed as "required" in the nspace section of the 4.1 spec is what I need to add then so be it, but it wasn't clear to me that this would be either necessary or sufficient to make ompi v5 bullet proof on a production HPC cluster. Given limited development time availability, I need to choose the next chunk of effort to make here with the most payoff.

So basically any guidance you can provide towards that end would be much appreciated.

@rhc54
Copy link
Author

rhc54 commented Feb 4, 2022

Gotcha - will do!

@rhc54
Copy link
Author

rhc54 commented Feb 7, 2022

I have updated the launch sequence page to include a list of required, optional, and bare minimum data. The bare minimum listing is here

@garlick
Copy link
Member

garlick commented Feb 7, 2022

Exactly what I was hoping for! Thanks so much.

@rhc54
Copy link
Author

rhc54 commented Feb 7, 2022

I'll continue to fine-tune it a bit over the next few days - I think some of these are now provided for you by the PMIx server library (e.g., I think we provide the PMIX_TOPOLOGY_STRING so you don't need to explicitly do so). Just need to do some checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants