Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi 4.1.2-2ubuntu1 fails with missing munge component #74

Open
garlick opened this issue Dec 25, 2022 · 0 comments
Open

openmpi 4.1.2-2ubuntu1 fails with missing munge component #74

garlick opened this issue Dec 25, 2022 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Dec 25, 2022

Problem: on Ubuntu 22.04.1 LTS, flux-pmix fails make check when built with an external openpmix-4.2.2 (default configure options) and openmpi-4.1.2-2ubuntu1 is installed:

expecting success: 
	run_timeout 30 flux mini run -overbose=2 -N1 -n2 \
		${MPI_HELLO} >hello_1n2p.out &&
	grep "There are 2 tasks" hello_1n2p.out

0.027s: flux-shell[0]: DEBUG: Loading /opt/flux-core-v0.46.1-54/etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: Sucessfully loaded flux.shell module
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/intel_mpi.lua
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/mvapich.lua
0.028s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/openmpi.lua
0.028s: flux-shell[0]: TRACE: trying to load /home/garlick/proj/flux-pmix/t/etc/rc.lua
0.029s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.030s: flux-shell[0]: DEBUG: pmix: jobid = 13690208256
0.030s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.030s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.030s: flux-shell[0]: DEBUG: pmix: total_nprocs = 2
0.030s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.2.2rc2
0.052s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.052s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc
0.052s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1
0.052s: flux-shell[0]: DEBUG: 0: task_count=2 slot_count=2 cores_per_slot=1 slots_per_node=2
0.052s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

[system76-pc:159601] PMIX ERROR: PACK-MISMATCH in file ../../../src/client/pmix_client.c at line 832
[system76-pc:159601] OPAL ERROR: Pack data mismatch in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[system76-pc:159601] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

[system76-pc:159602] PMIX ERROR: PACK-MISMATCH in file ../../../src/client/pmix_client.c at line 832
[system76-pc:159602] OPAL ERROR: Pack data mismatch in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[system76-pc:159602] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
0.061s: flux-shell[0]: TRACE: pmi: 0: C: pmi EOF
0.061s: flux-shell[0]: DEBUG: task 0 complete status=1
0.061s: flux-shell[0]: TRACE: pmi: 1: C: pmi EOF
0.061s: flux-shell[0]: DEBUG: task 1 complete status=1
0.071s: flux-shell[0]: DEBUG: exit 1

Neither openmpi's built-in libpmix nor the side-installed 4.2.2 used to build flux-pmix have a psec_munge plugin installed as a separate DSO. However, rebuilding openpmix-4.2.2 with --without-munge does resolve the problem.

Based on the pack error, it would appear that the requirement for munge is not negotiated between client and server - it changes the wire protocol and mismatched configurations cannot interoperate. See also https://bugs.schedmd.com/show_bug.cgi?id=12396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant