Skip to content

WeeklyTelcon_20160126

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph Castain
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • 1.10.2 went out the Door.
  • Already have a bug (Giles) Ralph fixed.
  • Another bug Fortran - broken F08 bindings (Jeff) saw late last night.
  • Need to verify that library versions are still correct? -Jeff took care of.
  • MPI_Abort investigation (Ralph)? - Periodically have this issue where MPI_Abort + MTT has some issue. Perl is suspect, Ralph will look into ruby or another language.
  • 1.10 C Strided mutex lock issue. (Nathan)?
  • High CPU utilization on Async progress thread (Ralph)? Ralph Fixed... One off 1.10, not in master. In 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    1. Issue 1252 - Nathan's progression decay function progress? Looking at files today.
      • udcm, openib_error_handler - opal_outputs would be sufficent.
    2. Issue 1215 - Group Comm Errors thing (Ralph) - Deal with race condition in ORTE collectives.
      • Launch goes down the tree. Mutex goes across the tree.
      • So possible to receive a modex message before you receive launch message.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
  • Group Comms weren't working for Comms of powers of 2. (Nathan)? Fixed.
  • ROMIO default for OMPI on Luster (only) PR 896?
  • 894, 890, 900, 901 - Jeff and Howard are good with. Jeff?
    • Taking all of those merged.
  • Issue 1292 - Asked Ralph if this is right way to fix this. (Ralph)
  • Issue 1177 - large message writev, fixed but not merged to master - Test working everywhere but OS X / BSD (George).
    • OS X / BSD limits large message total size to 32K?
    • Not going to fix for 2.0.0
    • Someone can write code to handle OS X / BSD.
  • Issue 1299 - hang (Nathan)? Need to go ahead an fix this today. Giles has patch, Nathan just needs to verify.
  • 2.0.0 does not compile on Solaris due to statfs(). Now that we moved to OMPIO, we're now hitting the problem.
    • Edgar is working on it. Solaris has different number of args and return code.
  • Issue 1301 - check max CQ size before creating CQ. (Josh)
    • If it passes Jenkins, happy. UD OOB (Mellanox runs). Approved, Pending Jenkins.
  • HWThreads - Ralph? Talk to Mike about use case? A commit has been done, and moved to 1.10.
    • Pinged Giles that it should go to 2.0 also.
  • Travis Status on 2.0?
    • Going well.
  • Nathan is good with 2.0 for 1sided
  • PR918 - Ralph reviewed on master. Giles PRed it to 2.0.
  • PR919 - hwloc - Ralph will review
  • PR911 - use correct endpoint. Just got word from nVidia that this is good.
  • PR917 - Ryan will look at today. LANL hardware that hits this is going away. Doesn't affect Aries. Aries doesn't have get_alignment(). Want this in.

Review Master?

  • BTL flags = 305 perf got horrible? Edgar? Worked around by removing this on his cluster. Don't understand why. He always used to set it, but now doesn't.
  • OMPIO not finding PDFS2 - configure work Edgar is

MTT status:

  • Cisco was showing timeouts. Jeff found 2 things on cluster. Specific problem couldn't replicate.
    • not handling OOB on Master or 1.10. Cisco cluster 4 or 5 IP addresses on each node. eth0 was down on one node. Timeout on eth0 was taking quite a while. Jeff removed those two nodes. Unusual for real world. OOB verbosity exposes.
    • Long running problem, need a good solution.

Status Updates:

  • Cisco - Been working on Cluster, Release issues with Howard. have a couple of small scalability improvements for usNIC.
  • ORNL - Not much to report. Any progress with UBUNTU package ownership? Geoffroy will look on Saturday.
  • UTK - Not much to report.
  • NVIDIA - Sylvain not much, A user issue not finding CUDA. User got an error message in log, but job ran fine.
    • Looking at some error

New items (dev list discussions, RFCs, etc)

  • Discussion about configure summary at end of configure?
    • 67 Frameworks and over 200 components.
    • 6 Major Frameworks: RAZ, PLM, PML, MTL, BTL, OOB.
      • Could someone moch up what they'd like to see.
    • Leary of runtime environment. Moab on top of SLURM, then there are env vars that are not job related.
    • ompi_info to lookup how it was built.
    • --with behavior to help also.
    • Decided to add to Feb Face2Face.

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally