Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support JSM ecosystem on coral system #90

Closed
wants to merge 1 commit into from

Conversation

garlick
Copy link
Member

@garlick garlick commented Oct 2, 2023

This is a WIP to collect fixes needed to get flux-pmix working on the LLNL lassen system as proposed in #85

@garlick garlick force-pushed the coral_update branch 2 times, most recently from a035804 to dee6f50 Compare October 3, 2023 22:12
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Merging #90 (d0b630d) into main (d25a5b4) will not change coverage.
The diff coverage is n/a.

❗ Current head d0b630d differs from pull request most recent head 1b50093. Consider uploading reports for the commit 1b50093 to get more accurate results

@@           Coverage Diff           @@
##             main      #90   +/-   ##
=======================================
  Coverage   78.13%   78.13%           
=======================================
  Files          12       12           
  Lines        1413     1413           
=======================================
  Hits         1104     1104           
  Misses        309      309           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@garlick garlick changed the title WIP: get flux-pmix working properly on coral system misc bug fixes Oct 3, 2023
@garlick
Copy link
Member Author

garlick commented Oct 3, 2023

I dropped the WIP on the title.

This addresses some problems that have nothing to do with coral.

As far as coral status, jsrun booting flux has been demonstrated, but we've yet to find a pmix server package that we can properly link with, nor a hwloc library it turns out. Flux may end up being packaged as a module in /usr/tce on this system. At that point we'll see if any issues remain.

@garlick garlick force-pushed the coral_update branch 2 times, most recently from ccae174 to 9ef5c28 Compare October 4, 2023 02:25
grondo
grondo previously approved these changes Oct 4, 2023
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@garlick garlick added the merge-when-passing Let mergify auto-rebase and merge when CI passes label Oct 4, 2023
@garlick garlick removed the merge-when-passing Let mergify auto-rebase and merge when CI passes label Oct 4, 2023
@garlick
Copy link
Member Author

garlick commented Oct 4, 2023

I've got an alternate solution to the shmem debacle working so I'm going to split this PR into parts and resubmit. Sorry for the flailing around!

@garlick garlick changed the title misc bug fixes support openpmix 3.1.2 Oct 4, 2023
@garlick
Copy link
Member Author

garlick commented Oct 4, 2023

OK, the flux-pmix tests in the CI build against pmix 3.1.2 are failing the same way as noted in #85, e.g

2023-10-04T18:44:39.8275512Z 0.059s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v3: /usr/lib/pmix/mca_bfrops_v3.so: undefined symbol: pmix_bfrops_base_print_ptr (ignored)
2023-10-04T18:44:39.8276259Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v12: /usr/lib/pmix/mca_bfrops_v12.so: undefined symbol: pmix_buffer_t_class (ignored)
2023-10-04T18:44:39.8276825Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v21: /usr/lib/pmix/mca_bfrops_v21.so: undefined symbol: pmix_bfrops_base_print_ptr (ignored)
2023-10-04T18:44:39.8277385Z 0.060s: flux-shell[0]: stderr: [fv-az551-696:07704] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v20: /usr/lib/pmix/mca_bfrops_v20.so: undefined symbol: pmix_bfrops_base_print_datatype (ignored)
2023-10-04T18:44:39.8277764Z 0.060s: flux-shell[0]: stderr: --------------------------------------------------------------------------
2023-10-04T18:44:39.8278110Z 0.060s: flux-shell[0]: stderr: We were unable to find any usable plugins for the BFROPS framework. This PMIx
2023-10-04T18:44:39.8278442Z 0.060s: flux-shell[0]: stderr: framework requires at least one plugin in order to operate. This can be caused
2023-10-04T18:44:39.8278658Z 0.060s: flux-shell[0]: stderr: by any of the following:
2023-10-04T18:44:39.8278805Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8279234Z 0.060s: flux-shell[0]: stderr: * we were unable to build any of the plugins due to some combination
2023-10-04T18:44:39.8279519Z 0.060s: flux-shell[0]: stderr:   of configure directives and available system support
2023-10-04T18:44:39.8279786Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8280100Z 0.060s: flux-shell[0]: stderr: * no plugin was selected due to some combination of MCA parameter
2023-10-04T18:44:39.8280528Z 0.060s: flux-shell[0]: stderr:   directives versus built plugins (i.e., you excluded all the plugins
2023-10-04T18:44:39.8280764Z 0.060s: flux-shell[0]: stderr:   that were built and/or could execute)
2023-10-04T18:44:39.8280914Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8281234Z 0.060s: flux-shell[0]: stderr: * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
2023-10-04T18:44:39.8281549Z 0.060s: flux-shell[0]: stderr:   "mca_base_component_path", is set and doesn't point to any location
2023-10-04T18:44:39.8281957Z 0.060s: flux-shell[0]: stderr:   that includes at least one usable plugin for this framework.
2023-10-04T18:44:39.8282109Z 0.060s: flux-shell[0]: stderr: 
2023-10-04T18:44:39.8282380Z 0.060s: flux-shell[0]: stderr: Please check your installation and environment.
2023-10-04T18:44:39.8282645Z 0.060s: flux-shell[0]: stderr: --------------------------------------------------------------------------
2023-10-04T18:44:39.8282869Z 0.059s: flux-shell[0]:  WARN: pmix: PMIx_server_init: SILENT_ERROR
2023-10-04T18:44:39.8283200Z 0.059s: flux-shell[0]: ERROR: plugin 'pmix': shell.init failed
2023-10-04T18:44:39.8283375Z 0.059s: flux-shell[0]: FATAL: shell_init

It's hard to piece together what is going on in pmix/ompi land but it seems like maybe not linking the mca dsos against libpmix.so was an oversight in that old version?

This describes the problem as one with static builds (not applicable here): openpmix/openpmix#1188
However, the proposed fix was confirmed to fix a non-static build of 3.1.2: openpmix/openpmix#1186

So maybe the installed 3.1.2 on lassen is just unusable?

@garlick
Copy link
Member Author

garlick commented Oct 4, 2023

Well. 3.1.2 may not even be in use on coral - it just happens to be the newest packaged version that includes the server headers. The version that jsm is built with is 3.1.4. Pushing out a new package for 3.1.4 may be a logical thing to do there. I'll try changing pmix's minimum version and the CI build to 3.1.4 and see how that goes.

Problem: some versions of pmix don't have a .pc file, and
some packaged versions have a broken one.

Add a configure --with-pmix[=PREFIX].  If this is specified
without a PREFIX, default system paths are assumed.  If PREFIX
is specified, then PMIX_CFLAGS and PMIX_LIBS are set based on that.

No checks are performed to ensure PREFIX refers to a working
pmix install of the minimum version.  This is intended to be an
override mechanism for exceptional situations.  The default
pkg-config method is the preferred one, when it works.
@garlick garlick force-pushed the coral_update branch 3 times, most recently from d0b630d to 1b50093 Compare October 4, 2023 19:57
@garlick
Copy link
Member Author

garlick commented Oct 4, 2023

lots of failures in flux-pmix unit tests with 3.1.4. Sigh.
I dropped the CI commit for now. Will revisit this branch later. Adding back the WIP.

@garlick garlick changed the title support openpmix 3.1.2 WIP: support JSM ecosystem on coral system Oct 4, 2023
@garlick garlick dismissed grondo’s stale review October 4, 2023 20:37

PR is back to a wip

@garlick
Copy link
Member Author

garlick commented Dec 9, 2024

Rumor has it that this and modern flux is running on the sierra LSF/JSM system so closing this old work.

@garlick garlick closed this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants