-
Notifications
You must be signed in to change notification settings - Fork 859
WeeklyTelcon_20200114
Geoffrey Paulsen edited this page Jan 14, 2020
·
2 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Brian Barrett (AWS)
- Austen Lauria (IBM)
- Charles Shereda (LLNL)
- Edgar Gabriel (UH)
- David Bernhold (ORNL)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Michael Heinz (Intel)
- Ralph Castain (Intel)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Thomas Naughton (ORNL)
- Noah Evans (Sandia)
- George Bosilca (UTK)
- Artem Polyakov (Mellanox)
- Matthew Dosanjh (Sandia)
- Brandon Yates (Intel)
- Erik Zeiske
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Xin Zhao (Mellanox)
- mohan (AWS)
- Akshay Venkatesh (NVIDIA)
- Josh Hursey (IBM)
- Joshua Ladd (Mellanox)
- Brendan Cunningham (Intel)
-
Brian has todo: Coverity coverage for PRRTE
-
PR 6821 - Merged last week.
- First PR with a submodule. Uses hwloc via submodules.
- Name of the hwloc component changed to hwloc2 (not hwloc20x)
-
PR 7284 - https://github.com/open-mpi/ompi/pull/7284
- common / PML abstraction break issue might point to other issues.
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- 7267 need a review on datatype thing.
- probably won't get --hostfile fix so, may be last releases.
- probably make 3.0 RC in immediate future.
- 7276 Possibly a configure test for pmix warning/error.
- Jeff needs to review for v3.1.x
Review v4.0.x Milestones v4.0.3
- PR7283 - Nathan may have found the issue last night,
- Trivial fix.
- We've had xpmem rcache issue. This would be good.
- v4.0.3 in the works.
- Schedule: End of january.
- Try to get rc1 built this Friday
- If PMIx 3.1.5 is released we'd like to take that.
- push off comm-spawn
- issue 6960 (close) had something cherry-picked to release branch, but it's still not fixed.
- Configuring
--enable-ipv6
shouldn't preclude ipv4. - Do we need to cherry-pick 6964 back into v4.0.x ?
- Configuring
- Schedule: April 2020?
- It's official! Portland Oregon, Feb 17, 2020.
- Please register on Wiki page, since Jeff has to register you.
- Date looks good. Feb 17th right before MPI Forum
- 2pm monday, and maybe most of Tuesday
- Cisco has a portland facility and is happy to host.
- about 20-30 min drive from MPI Forum, will probably need a car.
Review Master Master Pull Requests
- There may be a new PMIx v3.1.5 in January, we could pickup for v4.0.3.
- We'll know next week
-
ORTE-removal/PRRTE PR is ready to be committed.
- Mellanox CI is still failing on OSHMEM.
- Hand testing is looking fine.
- using an ORTE parameter, and then OSHMEM then fails because dir doesn't exist or wrong permissions.
- Mellanox said they'ed look at, but were not on the WebEx today.
-
Some of these things require some thought.
- Assume we still want to convert as many ORTE parameters as possible.
- PRRTE doesn't use the mca parameter like ORTE did.
- PRRTE is a persistant model, so can't just set an mca parameter for all jobs...
- Ralph is doing mpirun command line parsing conversion.
-
prun
is the one-shot mpirun equivalent. - we could have mpirun binary instead of symlink.
- Ralph is using schizo framework to do this work instead
- There are many that there
-
Ralph sent email about submodules
- Issue was would update the PRRTE repo
- But PMIx was still embedded as a tarball.
- Once Ralph converted PMIx to submodule, then could pull on both.
-
Still a bunch of things to do after this PR goes in.
- Still 1+ month of effort before Open MPI v5.0 could be ready with this.
- see: https://github.com/openpmix/prrte/issues/298
-
Singleton comm-spawn... how do we make this work? - PMIx understands it.
- Do we need to support singleton comm-spawn starting the PRRTEs?
- Now that we will support a persistant infrastructure, maybe we just require users to start it first.
-
Address comm-spawn issues that have been raised.