WeeklyTelcon_20220614

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Akshay Venkatesh (NVIDIA)
Austen Lauria (IBM)
Brendan Cunningham (Cornelis Networks)
Brian Barrett (AWS)
Christoph Niethammer (HLRS)
Edgar Gabriel (UoH)
Geoffrey Paulsen (IBM)
Hessam Mirsadeghi (UCX/nVidia)
Joseph Schuchart
Josh Fisher (Cornelis Networks)
Josh Hursey (IBM)
Matthew Dosanjh (Sandia)
Todd Kordenbrock (Sandia)
Tommy Janjusic (nVidia)
William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

Artem Polyakov (nVidia)
Aurelien Bouteiller (UTK)
Brandon Yates (Intel)
Charles Shereda (LLNL)
David Bernhold (ORNL)
Erik Zeiske
Geoffroy Vallee (ARM)
George Bosilca (UTK)
Harumi Kuno (HPE)
Howard Pritchard (LANL)
Jeff Squyres (Cisco)
Joshua Ladd (nVidia)
Marisa Roman (Cornelius)
Mark Allen (IBM)
Matias Cabral (Intel)
Michael Heinz (Cornelis Networks)
Nathan Hjelm (Google)
Noah Evans (Sandia)
Raghu Raja (AWS)
Ralph Castain (Intel)
Sam Gutierrez (LLNL)
Scott Breyer (Sandia?)
Shintaro iwasaki
Thomas Naughton (ORNL)
Xin Zhao (nVidia)

v4.1.x

v4.1.5
- Schedule: targeting ~6 mon (Nov 1)
- No driver on schedule yet.

v5.0.x

Updated PMIx and PRRTE submodule pointers.
- Issue 10437 - We hope this is resolved by updated pointers.
- Austen couldn't reproduce, can anyone give confirmation that this is resolved?
Issue 10468 - Doc to-do list.
Issue 10459 - a bunch of issues with ompi-master.
- Compiler issues with Q-Threads
  - Not sure who the owner of qthreads is.
Discussions about new
Mellanox still have some use-cases for sm_cuda btl.
Any idea on how mature accellerator framework is?
- nVidia commits to testing the framework on main.
- Still some discussion on the Pull Request.
A couple of critical new issues.
- Issue 10435 - a Regression from v4.1
  - No update.
Progress being made on missing Sessions symbols.
- Howard has a PR open that needs a bit more work.
Call to Prte / PMIx
- Longest Pole in the tent right now.
- If you want OMPI v5.0 released in near-ish future, please scare up some resources
- Use PRRTE critical and Target v2.1 labels for issues.
Schedule:
- Blockers are still the same.
- PRRTE blocker -
- Right now looking like late summer (Us not having a PRRTE release for Packager to package)
  - Call for help - If anyone has resources to help, we can move this release date much sooner.
  - Requires investment from us.
- Blockers are listed Some are in the PRRTE project
- Any Alternatives?
  - The problem for Open MPI is not that PRRTE isn't ready to release. The parts we use, works great, but other parts still have issues (namely DVM)
  - Because we install PMIx and PRRTE as if they came from their own tarballs.
    - This leaves Packagers no good way to distribute Open MPI.
  - How do we install PMIx and PRRTE in open-mpi/lib instead and get all of the rpaths correct?
  - This might be the best bet (aside from fixing PRRTE ources of course)
Several Backported PRs

Main branch

coll_han tuning runs discussion [PR 10347]
- Tommy(nVidia) + UCX on v5.0.x Seems that Adapt and Han are underperforming realtive to
  - Graph of data posted to [PR 10347]
    - Percentage difference latency graphs.
    - Anything ABOVE 0 is where Han out performed (better) than tuned.
  - He's been seeing some "sorry" messages.
    - Perhaps a combination of SLURM and MPIRUN?
  - Just tested Alltoall, Allreduce, and Allgather.
  - x86 cluster, 32nodes x 40ppn
    - By node HAN seems to perform better
    - By core Tuned seems to perform better.
  - Some dips might be due to UCX dynamic transport at this scale (rather than RC)
  - Tommy can do some more testing if others have suggestions.
  - Used mpirun with either (--map-by-node|--map-by-core) force ucx and select collective.
  - Tommy will also run 1ppn and full ppn
- Would be good to run Open MPI v4.1 branch to see, especially since George's paper was against v4.1
- Brian(AWS) was using EFA, and seeing similar things.
- Would also be interesting to see how UCC stands up against these numbers.
- Corneilius (Brendan) ran both v4.1 and main - not highly tuned clusters, but similar components.
  - Trying to isolate the differences between v4.1 and main.
  - Just increasing priority SHOULD work to select the correct collective components.
  - OFI with PSM2 provider
  - Substantial difference between main and v4.1
  - Have seen substantial differences with different mapping flags.
  - Maybe we should rerun this with explict mapping controls.
  - Small messages seem better with Han and large messages due to Tuned?
- Austen (IBM) also did graphs with v5.0.x
  - lower percentages
  - OB1 with out of box with Tuned/Han
  - Orange is --map-by-core, blue is --map-by-node
  - Bcast getting close to 90%
  - Will run with IMB to verify OSU data.
  - Using UCX didn't see much difference on Han and Tuned.
  - HAN is heirarchical so scaling ppn shouldn't be as noticable difference as scaling nodes.
  - Don't really see too much difference between --map-by-core and --map-by-node (expected in HAN), but dissimilar with Brian and Tommy's data.
- Would be good for George to look and comment on this.
- Joseph is also planning to do runs.
  - Will talk to George on posted numbers and post any suggestions.
- Thomas Naughton.
- main and v5.0.x should be the same, use either
Please HELP!
- Performance test default selection of Tuned vs HAN
- Brian hasn't (and might not for a while) have time to send out instructions on how to test.
  - Can anyone send out these instructions?
- Call for folks to performance test at 16 nodes, and at whatever "makes sense" for them.
Accelerator stuff that William is working on, should be able to get out of draft.
- Edgar has been working on ROCME component of Framework
- Post v5.0.0? Originally was shouldn't since release was close, but if it slips to end of summer, we'll see ...
Edgar finished ROCM component... appears to be working.
- William or Brian can comment on how close to merge to main.
- William working on btl sm_cuda and rcache code. Could maybe merge at the end of this week.
- Tommy, was going to get some nVidia people to review / test.
- Discussion on btl sm_cuda - used to be a cloned copy of sm, but it's the older sm component, not vader which was renamed to sm.
  - Might be time to drop btl sm_cuda?
  - vader component does not have hooks to the new framework.
  - Uses where btl sm_cuda might get used today would be:
    - TCP path would use this for on-node
    - Node without UCX
  - even one-sided would not end up using btl sm_cuda.
- v5.0.0 would be a good time to remove this.
  - Based on old sm is a big detractor.
  - Can we ALSO remove rcache? Unclear.
What's the status of accellerator branch on v5.0.x branch?
- PR is just to main.
- We said we could do a backport, but that would be after it gets merged to main
  - If v5.0.0 is still a month out, is that enough time?
  - v5.0.0 is lurking closer.
- This is a BIG chunk of code...
  - But if v5.0.0 delays longer... this would be good to get in.
- Answer is largely dependent on pmix and prte.
- Also has implications on OMPI-next?
Can anyone who understands packaging review: https://github.com/open-mpi/ompi/pull/10386 ?
Automate 3rd Party minimum version checks into a txt file that both
- configure and docs could read from a common file.
- config.py runs at beginning of Sphynx and could read in files, etc.
- Still iterating on.
https://github.com/open-mpi/ompi/pull/8941 -
- Like to get this in, or close it
- Geoff will sent him an email to George to ask him to reiview.

MTT

Face-to-face

What are companies thinking about travel?
Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Should think about schedule, location, and topics.
- Some new topics added this week. Please consider adding more topics.
MPI Forum was virtual
Next one Euro MPI will be hybrid.
- Plan to continue being hybrid with 1-2 meetings / year.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeeklyTelcon_20220614

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

not there today (I keep this for easy cut-n-paste for future notes)

v4.1.x

v5.0.x

Main branch

MTT

Face-to-face

Clone this wiki locally