Skip to content

WeeklyTelcon_20160412

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Edgar Gabriel
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph Castain
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3

    • Ralph will look at the ByNode thing today.
    • Allena reported a SLURM issue.
    • Issue 1530 Reason MTT was hanging was due to a test signal handler segving, cores taking a long time to dump.
    • Next 1.10 release. need to fix these issues, but looking like early May.
  • Github Now DOES allow per-branch permissions, so will look at

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

    • 1 remaining blocker: to memory symbol patcher - Nathan / IBM / Mellanox.
      • Got the original code to stack with UCX. munmap on Linux has an optional argument. This doesn't work well with any style of hooking. Loader ends up patching a random function address. Didn't understand. Assembly looked okay, but munmap was randomly
      • When UCX is involved, seeing both Open MPI and UCX memhooks, which is great.
      • SPARC still having issues, so will need a solution for 2.0.1.
      • Nathan will work to remove ptmalloc on master, and have build time
      • on 2.0.0 Nathan will add a --enable-ptmalloc explicit configure option, but doesn't build by default.
        • If users configure --enable-ptmalloc, then it would disable the internal memhook frameworks entirely.
        • when this happens, will have to add some early code to tickle ptmalloc
        • need to document that if --enable-ptmalloc then munmap() calls may give wrong answers.
      • Nathan might due sparc assembly himself, since seeing weird dlsym issues on sparc, that might be related.
      • Now created first time creating an rcache.
      • Might need to tweak openib BTL, because it created rcache too early.
      • When you create a thread, it has to expand the heap sometimes.
      • When they expand heap, they protect entire heap with PROT_NOTE.
      • munmap, mremap (if new length is smaller than old length), shmdt (only on linux, not OSX), brk.
        • need sbrk also (for negative increment).
      • Nathan will look at README for memory hook stuff.
      • OPENFABRICS should get it's act together and put in something in kernel to alleviate ll of this.
    • Question, do we want new prettier ompi_info output. Didn't change parsable output.
      • Low risk, got contributor agreement (works for SuSE). Can pull 1515, 1516, 1518 into 2.0.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0 *

Review Master?

  • Absoft failure. Need to fix configure stuff with atomics. On Master.
  • Nathan - this failure should go into 2.0.1. Absoft MTT shows that

MTT status:

  • IBM has a client facing cluster
    • Working on getting Jenkins setup on IBM side to ensure Pull Requests get tested on Power also.
    • Hoping to have online this week?
  • Better upload interface using by both Ralph and Josh.
  • Plugins to support SLURM, Copy tree, Shell commands. Compiler version detection.
  • Looking at establishing MTT release .tarballs.
  • Timeframe: Sooner rather than later

Status Updates:

  1. Cisco - a bunch of release engineering work for both libfabric and OMPI.
    • assisting on a number of bugs.
  2. NVIDIA - Sylvian - An issue on MTT - looking into.

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally