Skip to content

WeeklyTelcon_20210309

Geoff Paulsen edited this page Mar 10, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

From last week

  • UCX priority PR https://github.com/open-mpi/ompi/pull/8496 merged to master and v4.0.x and v4.1.x
  • Do we have a wiki page on how to rerun bots. How to make this more prominent?
  • PR 8511 Addresses issues/8321
    • Merged to master and release branches
  • IBM CI needs to upgrade UCX from 1.8 to 1.9
    • Believe we did. Somethings not building, but one compiler is failing.
    • Austen is investigating

Abort

  • Some discussion of WHEN should we call abort in various components?
  • Sometimes better to call an MPI error handler
  • Sometimes it's unclear what to do
  • Ran out of time today for discussion

Github recommends changing the name of master branch to main

  • we'd like to do that, but not in the middle of about to branch like this.
  • Needs to be coordinated.
  • Will discuss more in the future (out of time)

PR 8551 - New coding style enforced via clang --format

  • Love to commit this, and then run this across all of code base.
  • Overall changes seem reason
  • Would have a Github CI to fail if user failed.
  • git has hooks for clang --format.
    • Could output the diff so a user without clang could fix their PR.
  • run by hand for commit, or based on changed files
  • Could also run on github commit hook.
  • When run this on submodules, need to figure out how to deal with 3rd Party.
  • Nice to get this in before v5.0
  • Won't do on v4.0.x or v4.1.x
  • Probably sort the ordering of includes, but this might point out some breaks.
    • Order is based off reg-ex
  • The intent would have a 2nd PR that fixes
    • Excluding ROMIO
    • treematch
    • 3rdParty
  • Nathan can tell us how to do a pre-commit hook in our local git config.
  • Everyone seems in favor of. Will discuss more on ticket, and merge in a few days, or next Tuesday at the latest
  • No reason to hold v5.0 branch for this, but want to do this for v5.0.

Autoconf 2.7

  • Building with autoconf 2.7 had a bunch of warnings / broken
  • MacPorts just updated default to autoconf 2.7.1 is now out
  • some work to clean up < 2.6 macros has gone into master
  • Chrisoph has updated master to just have a few remaining warnings.
    • Ralph is following
    • works with autoconf 2.7, one mca component might have issues.
    • 2.69 -2012
    • 2.70 - 2020 (new dev, depricated many things)
    • 2.71 - 2021 (8 months - prob quick turn bugfix)

New Topics

  • 32bit? Do we want to continue to support this?

  • Originally decorated PR with #if 32bit, but passed CI, so decided

  • Rasberry Pi 32bit - George and Jeff is running on this.

    • Just hobbiest toy systems.
  • Nathan thinks he might be able to write a compare-and-swap

  • 32bit not really production environment, just hobbiest.

  • Production seems to be 64bit.

  • v5.0 - good time to drop 32bit.

    • Jeff will send note to packaging, and see if they will care.
    • Debian might care.
    • Jeff volunteered to
    • OSC/RDMA assumed everything was 64bit, but once we changed
    • Jeff will ask Absoft
  • On 32bit, if we could use C11 atomics with locks, it might be allowed.

    • So perhaps this would be a path.
    • Is C11 available on older 32bit systems.
    • gcc 6.0+ it should work fine.
  • PMIX v4.1 might be delayed.

    • So backup plan is get PRRTE working with PMIx v4.0
    • Not sure what we'll lose with PMIx v4.0 instead of v4.1
    • Folks should try runng OMPI with PMIx v4.0 Probably release Open-MPI v5.0 with PMIx v4.0
  • Too many Open Issues (50)

    • Geoff and Howard will go over v4.0.x issues, and try to close or address some of them.
    • Need to label some as wont_fix, let sit for a while, and then close
  • PR 8435 - https://github.com/open-mpi/ompi/pull/8435

    • mistake this was targeting v4.1 instead of master.
    • Draft PR.
  • UCX Issue 8321,

    • We do need to understand what's going on , as there were comments saying we should not support anything older than 1.9.0, but then there was a comment that it's reproducable in 1.9 also
    • Is this a UCX problem, or a PML problem?
      • We don't know if it's PML or UCX
  • UCX 1.9.0 + OMPI 4.0.4 - Issue 8442

    • datatype engine issue
    • George has a fix, but it no longer applies cleanly.
    • He will try to push, so someone else can
    • PR8473 - Sergy pushed a possible fix, but it still failed a CI test, and then closed the PR.
    • May not be related to Issue 8321
    • We're ready to cut an RC for both 4.1.1 and 4.0.6, these two are blocking.
  • UCX meeting is on Wednesdays

    • Howard may go tomorrow.
    • UCX community didn't like us configuring out, they're looking into
    • It'd be nice to link this to an issue tomorrow.

4.0.x

  • Merged all fixes for v4.0.6 except for something to address blocker: https://github.com/open-mpi/ompi/issues/8442
  • George still on his plate.
    • Two issues (main issue, but also might be an incorrect assert in datatype engine also)
    • George: no assert is correct, shouldn't be called in that case.

v4.1

  • blocking on UCX issues (see New topics above) *

Open-MPI v5.0

  • Before we branch

  • PR 8536 needs another set of eyes from Howard and George. *

  • Ralph almost has singleton comm spawn working

  • Geoff went through most open PRs and many of the newer issues to see if anything would block the branching of v5.0. Discussed these briefly: WeeklyTelcon_20210302-ompiv5-branching

    • Look on target to branch next week after AWS GPU Direct PR, and remove CR gets in
  • PRRTE making good progress:

    • Ralph resolved about 11 tickets in PRRTE last week. Maybe 20 more
    • Then prrte will branch v2.0
    • Open-MPI can branch anytime, we'll revisit end of Feb.
  • Raghu, How is GPU Direct RDMA for AWS? Still on track. PR this week.

  • One-sided tests are still busted. Do we keep running these if they're failing?

    • Nathan is actively working on, so hopeful we'll get this.
  • Issue 7486

  • Josh summarized discussion from last week in issue.

  • Anything else Josh needs to implement?

    • No, Josh will get to before end of month, before v5.0 branches.
  • master configure issue - for v5.0 both of these will need to be fixed.

  1. Luster configure option, Edger sees it, but no idea how to fix it.
    • Not sure if he should open an issue. Ralph thinks Giles fixed. Edger will give it a try
  2. SharedFP component, Edger opened an issue this morning.
    • Blocker for v5.0

Video Presentation

  • ECP Community days ( March 30-April 1st )

    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
    • Tuesday March 30th from 1-2:30pm (US Eastern)
      • Invited some people to speak. They will be our main community speakers.
      • Anyone on OMPI community can send slides to Jeff and George
      • Due Friday March 26th
    • PMIx Wed 31st 11 - 12:30 (US Eastern)
  • Discuss for v5.0

    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work

Longer Term discussions

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

MTT

  • what is being reported looks pretty good.
    • ppc atomics - Austen has been looking at this
  • Intercomm Merge is getting inconsistant ordering of procs.
    • What is the priority of this?
    • Many of the ibm tests start off by doing some intercomm manipulation.
      • Won't get
  • Mellanox MTT had been failing. Boris set some debug, and they unplugged it.
    • They plan to re-enable it tomorrow.

Clone this wiki locally