Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set pmix.locstr and pmix.cpuset #31

Open
garlick opened this issue Sep 29, 2021 · 8 comments · Fixed by #95
Open

set pmix.locstr and pmix.cpuset #31

garlick opened this issue Sep 29, 2021 · 8 comments · Fixed by #95

Comments

@garlick
Copy link
Member

garlick commented Sep 29, 2021

Problem: ompi asks for pmix.locstr and pmix.cpuset (optionally).

Furthermore, pmix.locstr is listed in the business card exchange use case in spec v5 sec B.1.2.

I think if we provide these attributes, then every rank of ompi won't need to gather hwloc info from scratch, as an strace would seem to indicate they are doing now.

@wihobbs
Copy link
Member

wihobbs commented Feb 2, 2024

Bumping this, as testing on Dane suggests that pmix's calling into hwloc to load topology for each core causes MPI_Init to be painfully slow, especially on systems with a lot of cores (Dane has 112 per node). We can get around this for now by setting HWLOC_XMLFILE. Results with and without that variable set:

 dane171 ~/scale-testing $ srun -N1 -n1 --pty ../flux-core/install-dane/bin/flux start
 
(s=1,d=0) dane171 ~/scale-testing $ unset HWLOC_XMLFILE
(s=1,d=0) dane171 ~/scale-testing $ flux run -n112 -o pmi=pmix ./hello-openmpi
fALWig6B: completed MPI_Init in 48.326s.  There are 112 tasks
fALWig6B: completed first barrier in 0.000s
fALWig6B: completed MPI_Finalize in 0.073s

(s=1,d=0) dane171 ~/scale-testing $ export HWLOC_XMLFILE=$(pwd)/dane171.xml
(s=1,d=0) dane171 ~/scale-testing $ flux run -n112 -o pmi=pmix ./hello-openmpi
f2cYgUdH9: completed MPI_Init in 0.905s.  There are 112 tasks
f2cYgUdH9: completed first barrier in 0.000s
f2cYgUdH9: completed MPI_Finalize in 0.083s

@garlick
Copy link
Member Author

garlick commented Feb 2, 2024

Notes to whoever ends up doing this:

The two keys mentioned above are referred to in ORTE code using these names

$ git grep pmix.locstr
include/pmix_common.h.in:#define PMIX_LOCALITY_STRING                "pmix.locstr"           // (char*) string describing a proc's location
$ git grep pmix.cpuset
include/pmix_common.h.in:#define PMIX_CPUSET                         "pmix.cpuset"           // (char*) hwloc bitmap applied to proc upon launch

Don't go looking at slurm for a reference since it apparently does this wrong.

References on this from Ralph in #58 are:

For better or worse, in flux-pmix I didn't use the preprocessor macros in the "standard" pmix.h to manipulate the info vectors. Instead the "infovec" class in flux-pmix does that stuff. That may be confusing when looking at orte code since it uses the macros.

Be prepared for some studying of the above pmix-standard, the orte code, and flux-pmix and some head scratching as there seem to be many ways to organize the "info schema". Do you set the keys once per node, once for the whole system, once per task...?

It may be helpful to look at the code in openmpi to see how it is looking up the keys and do it that way.

@grondo
Copy link
Contributor

grondo commented Feb 3, 2024

A kludgy workaround that I hesitate to even mention would be to dump the shell's XML to a file and set HWLOC_XMLFILE for whatever process is going to needlessly call hwloc_topology_load(). It would be better to do this the right way, but the right way in this case sounds.. painful?

@garlick
Copy link
Member Author

garlick commented Feb 3, 2024

A kludgy workaround that I hesitate to even mention would be to dump the shell's XML to a file and set HWLOC_XMLFILE for whatever process is going to needlessly call hwloc_topology_load().

Yeah we should do it right, but since openmpi may not be the only application-level code that does this, a shell option like -o hwloc=xmlfile or similar might be a useful thing to have around...

@wihobbs
Copy link
Member

wihobbs commented Feb 5, 2024

I agree it'd be useful, but we'd have to be sure that users retain the power to subsequently unset that variable, because it seems to cause mvapich jobs to fail:

(s=1,d=1) corona188 ~/scale-testing $ flux run -n48 ./hello-mvapich-corona
f2cFjYqSX: completed MPI_Init in 0.476s.  There are 48 tasks
f2cFjYqSX: completed first barrier in 0.000s
f2cFjYqSX: completed MPI_Finalize in 0.004s
(s=1,d=1) corona188 ~/scale-testing $ export HWLOC_XMLFILE=$(hostname).xml
(s=1,d=1) corona188 ~/scale-testing $ flux run -n48 ./hello-mvapich-corona
[corona188:mpi_rank_0][smpi_load_hwloc_topology] Please retry after setting MV2_BCAST_HWLOC_TOPOLOGY=0
Warning: mv2_get_arch_type: Failed to determine number of processors.
Warning: mv2_get_arch_type: Failed to determine number of processors.

@grondo
Copy link
Contributor

grondo commented Feb 5, 2024

Yes, I think the idea is that the hwloc XML would only be saved to HWLOC_XMLFILE if -o pmi=pmix is set (and thus this plugin is activated), or a separate -o hwloc-xmlfile or similar shell option was used (so it is opt-in only).

One nice thing about doing it in the flux-pmix plugin is that once this issue is closed, the automated use of HWLOC_XMLFILE could be removed, so users don't have to worry about what plugin version they're running and do they also need to specify some other shell option to get good performance.

@grondo
Copy link
Contributor

grondo commented Feb 5, 2024

Though @garlick makes a good point about other use cases for -o hwloc=xmlfile, so maybe we should add that option to the job shell, and some way to activate it from other shell plugins.

garlick added a commit to garlick/flux-pmix that referenced this issue Feb 5, 2024
Problem: flux-pmix does not yet share hwloc with MPI, which
forces MPI to go looking for it, sometimes at great cost
to performance.

For the time being, tell flux-core to share a hwloc xml
file by setting the -o hwloc.xmlfile shell option.

Later we will want to fix flux-framework#31 properly and drop this.
@mergify mergify bot closed this as completed in #95 Feb 6, 2024
@garlick
Copy link
Member Author

garlick commented Feb 6, 2024

Oops this should remain open for "correct" solution.

@garlick garlick reopened this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants