Skip to content

WeeklyTelcon_20181211

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (in Person)

  • Geoff Paulsen
  • Jeff Squyres
  • Brian Barrett
  • Ralph Castain
  • Dan Topa (LANL)
  • Edgar Gabriel
  • Howard Pritchard
  • Thomas Naughton
  • Todd Kordenbrock
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Nathan Hjelm
  • Aravind Gopalakrishnan (Intel)
  • Josh Hursey
  • Matias Cabral
  • Akshay Venkatesh (nVidia)
  • David Bernholdt
  • Geoffroy Vallee
  • Joshua Ladd
  • Matthew Dosanjh
  • Arm (UTK)
  • George
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • Summary of PMIx re-architecturing for v5.0

  • Lots of TCP wire-up discussion

  • Session work is mostly done. Ready mid-January.

    • works with MPI_Init.
    • Involved a lot of cleanup for setup and shutdown.
    • Can keep it as prototype, or put it in, without headers.
    • For MPI_Init/MPI_Finalize only apps, fully backward compatible.
      • Initialize a "default" Session.
    • Asking about adding this to master in mid-January
    • Part of cleanup is to have reverse setup and shutdown.
    • Cleanup sounds good. Well contained. Set of pathes.
      • Calling it "instances" inside of MPI, but we'll be renaming it if/when MPI standardizes sessions.
    • Summary - patches for cleanup lets do them and look at them.
      • Under work for sessions, need to look at a bit closer
      • We can discuss sessions bindings in the future.
    • Session init is all local, so timing should still be good.
  • github suggestion on email filtering

Minutes

Review v2.1.6 (not going to do this in immediate future.

  • Schedule: posted a v2.1.6 rc1
  • Driver: Assembly and locking fix, vader and pmix, etc.
    • Nathan will check to ensure all atomic things in that branch.
      • Issue 5932 Not all atomic fixes correctly in v2.1.x branch yet
        • still happening in v2.1.6 rc1
        • Something missing on v2.1.x branch
    • May get ready to finish it, but not release it until January (since we're all going away).

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • Scheduled 3.0.4 may of 2019
    • PMIx 2.2 will be available next week

Review v3.1.x Milestones v3.1.0

  • Schedule:
  • Scheduled 3.1.4 april of 2019
    • New PMIx available next week

Review v4.0.x Milestones v4.0.1

  • Schedule: Need a quick turn around for a v4.0.1
  • v4.0.0 - a few major issues:
    1. mpi.h is correct, but the library is not building the removed and deprecated functions because they're missing in Makefile.
    2. Two issue hit via SPACK packaging:
      • root cause may be: make -j creates TOO many threads of parallel execution on some OSes.
      • max filename restrition on fortran header files.
        • PR6121 master - should resolve on v4.0.x
  • Discuss pulling PR 6110 into v4.0.1
    • Bug, some OSHMEM APIs missed in v4.0.0
    • Jeff pulled up slides showing that we can ADD APIs in minor versions.
      • Old built executables must be able to run with newer.
      • We need to verify if the patch breaks anything with older built executables.
    • Because this PR is just adding functions, it should be okay.
    • Mellanox volunteered to test built with old executable and run with newer OMPI
    • If that test passes, everyone is okay with pulling this in.
  • UCX priority PR - expecting a PR from master
  • Matias Cabral local procs with OFI MTL - master this PR is okay, will be coming back to v4.0.x 6106
  • Two rankfile mapper issues reported on mailing list. Howard will file issue.
  • Need to create v4.0.x issues for https://www.mail-archive.com/[email protected]/msg32847.html
    • @siegmargross

Master

  • IBM mtt nightly fortran

  • IBM PGI compiler license expired.

  • Libtool issue came up before or during supercomputing.

    • this goes back to v3.0 or v3.1 (can't remember what user was actually using).
    • We made a backwards incompatible change to opal (not part of our ABI)
    • when we bumped the version numbers in libtool, we bumped the version so you couldn't use an old libopal with a new libmpi. On basis that Apps should only link in libmpi, so it doesn't matter.
    • We had a user complaining that it was failing due to link errors. After a bit for, his app was linking against lib HDF5 library which is linked against libtool which does a secondary inspection and links against libopal.
      • Not really an HDF5 bug, it's a libtool issue.
      • Litterly nothing we can do for v3.0.3 (or nothing we can do for v3.0.x)
      • Probably want to figure out what to do here.
    • Option 1: Stop installing libtool .la files.
      • Actually be "gross", have to talk to package managers, they have strong feelings.
    • Option 2: Start treating those libraries as part of our ABI gaurantee.
    • Option 3: Someone's flavor of libtool has a patch that they don't include the dependent library in the .la files.
      • Jeff and Brian will look at patch, and inquire upstream with libtool
      • 2015 was last time libtool had an active release.
        • Don't know if there's much active libtool development anyway.
      • Need to feel out the libtool community about this.
    • Update: There was a patch, but it caused other side-effects.
    • Conclusion is we'll probably have to version all libraries.
  • Maybe it's time to discuss bringing opal and ompi back into one library, so that we only version that instead of all libs.

    • May be a bit tricky, since for Python these libs now link again to top level library.
    • How will this be affected by the replacing orte with prte?
      • Prte is seperate, so don't really have orte.
      • Compile in all components seperately and link in the final step
      • We no longer have a seperate runtime to break out, so no reason to do this additional work (to try to break runtime out)
    • Please think about this.
  • Still Lots of golden balls on PR's due to Amazon EWS / Jenkins

    • Looks like the problem is in Jenkins (deadlocking on itself), web-interface is still up. None of the instances spin down, etc. Need to go find jenkin's bug report and see if they've made progress.
    • Jenkin's still has not fixed this issue, so we can't use EWS.
    • UPdate: Jenkins server is just dead.
  • What do we do about all of these Master PRs?

    • We don't have a release off of master soon.
    • New PRs won't go yellow-ball because don't spawn EC2 tasks (theory)
    • Will still run libfabric and some other tests.

PMIx

  • Releasing a new version at end of week or next week.

MTT

  • IBM test configure should have caused that.
  • Cisco has a one-sided info check that failed a hundred times.
    • Cisco install fail looks like a legit compile fail (ipv6 master)

New topics

  • We have a new ibm-ompi SLACK channel for Open MPI developers.
    • Not for users, just developers...
    • email Jeff If you're interested in being added.

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally