Flux scale testing Jan 2024 #5649

grondo · 2024-01-03T23:59:54Z

grondo
Jan 3, 2024
Maintainer

We have an opportunity to do some scaling tests on the dane cluster (~1300 available nodes).

This discussion is for recording what we'd like to test and the results. The current plan is to test:

instance startup time scaling up to N brokers per node for topologies in kary:2, kary:256, ..., kary:0 and binomial
MPI launch times for different job sizes in a sample of the above topologies at large instance size
similar times for srun launched MPI

grondo · 2024-01-04T00:05:42Z

grondo
Jan 4, 2024
Maintainer Author

Here's some early results from last week for instance bootstrap scaling across 64 nodes with up to 48 brokers per node (sorted by the time to "ready", i.e. state-machine.wait response)

NODES  SIZE     TOPO  T_START    T_URI   T_INIT  T_READY  (TOTAL) T_SHUTDN
   16    16 kary:256    0.010    0.098    0.040    2.881    2.921    2.759
    1     2 binomial    0.010    0.094    0.041    2.888    2.929    2.706
    2     2   kary:2    0.005    0.101    0.042    2.902    2.944    2.757
    2     2 binomial    0.009    0.094    0.039    2.909    2.949    2.755
   32    32 kary:256    0.013    0.104    0.044    2.949    2.993    2.802
    1     2   kary:2    0.006    0.092    0.035    2.952    2.986    2.776
   64    64 kary:256    0.020    0.100    0.053    2.969    3.022    2.807
    1     1 binomial    0.021    0.109    0.071    3.093    3.164    2.296
    4     8 kary:256    0.009    0.098    0.039    3.093    3.132    2.779
    1     1   kary:2    0.018    0.111    0.068    3.097    3.165    2.266
    1     1 kary:256    0.021    0.091    0.054    3.112    3.166    2.343
    8    16 kary:256    0.007    0.098    0.037    3.181    3.218    2.814
   32    64 kary:256    0.012    0.102    0.043    3.194    3.237    2.818
    2     4 kary:256    0.011    0.097    0.041    3.215    3.256    2.796
    1     4 kary:256    0.007    0.096    0.037    3.470    3.507    2.832
    2     2 kary:256    0.005    0.098    0.040    3.475    3.514    2.834
    8    32 kary:256    0.006    0.105    0.036    3.711    3.747    2.867
    1     2 kary:256    0.005    0.097    0.038    3.757    3.795    2.865
    2     8 kary:256    0.006    0.106    0.042    3.784    3.826    2.853
   64   256 kary:256    0.022    0.130    0.065    3.887    3.952    2.935
    4     4 binomial    0.006    0.098    0.038    3.989    4.027    3.342
    1     4 binomial    0.009    0.095    0.037    4.134    4.171    3.367
    8     8 kary:256    0.021    0.132    0.086    4.182    4.269    2.873
    4    16 kary:256    0.010    0.096    0.037    4.210    4.248    2.911
    1     4   kary:2    0.007    0.095    0.034    4.229    4.264    3.348
    1     8 kary:256    0.007    0.105    0.036    4.458    4.495    2.936
    4     4 kary:256    0.007    0.095    0.038    4.464    4.501    2.806
    2     4 binomial    0.007    0.097    0.038    4.500    4.538    3.385
    4     4   kary:2    0.006    0.093    0.036    4.501    4.537    3.409
    2     4   kary:2    0.009    0.096    0.039    4.510    4.549    3.406
    2    16 kary:256    0.009    0.112    0.043    4.824    4.867    2.969
    8     8   kary:2    0.007    0.099    0.038    4.962    5.000    3.867
    4     8   kary:2    0.011    0.097    0.040    4.977    5.017    3.929
    4     8 binomial    0.014    0.094    0.040    4.990    5.030    3.844
    2     8   kary:2    0.009    0.100    0.037    5.134    5.171    3.990
    2     8 binomial    0.006    0.102    0.039    5.206    5.245    3.829
    4    32 kary:256    0.009    0.114    0.043    5.405    5.447    3.038
    8     8 binomial    0.022    0.139    0.092    5.885    5.976    3.986
    1     8 binomial    0.006    0.104    0.034    6.021    6.055    4.007
   16    16   kary:2    0.013    0.097    0.041    6.067    6.108    4.459
    4    16 binomial    0.009    0.102    0.041    6.202    6.243    4.409
    1     8   kary:2    0.007    0.107    0.038    6.473    6.511    4.022
   16    16 binomial    0.010    0.096    0.039    6.529    6.568    4.496
    4    16   kary:2    0.010    0.101    0.039    6.739    6.778    4.554
    4    64 kary:256    0.010    0.133    0.040    6.835    6.875    3.139
    8   128 kary:256    0.011    0.145    0.040    6.890    6.930    3.174
   16    32   kary:2    0.012    0.100    0.039    6.958    6.997    5.061
   16    32 binomial    0.011    0.096    0.038    7.086    7.124    5.014
    1    16 kary:256    0.005    0.125    0.038    7.154    7.191    3.205
    2    32 kary:256    0.021    0.167    0.091    7.261    7.351    3.246
   64   512 kary:256    0.021    0.160    0.055    7.460    7.515    3.674
    8    16 binomial    0.006    0.098    0.036    7.489    7.525    4.485
   32    32 binomial    0.016    0.101    0.045    7.637    7.682    5.038
    2    16 binomial    0.016    0.153    0.089    7.848    7.937    4.602
    8    32 binomial    0.008    0.108    0.042    7.874    7.916    5.032
    4    32 binomial    0.008    0.108    0.036    8.031    8.067    5.102
   32    32   kary:2    0.015    0.106    0.048    8.064    8.112    5.100
   64    64   kary:2    0.019    0.100    0.050    8.147    8.197    5.547
   64    64 binomial    0.019    0.102    0.051    8.186    8.238    5.583
   16    64 binomial    0.009    0.111    0.039    8.445    8.484    5.554
   32    64 binomial    0.019    0.107    0.049    8.653    8.702    5.617
    1    16 binomial    0.011    0.120    0.040    9.105    9.145    4.735
    8    64 binomial    0.009    0.113    0.038    9.166    9.204    5.672
   64   128 binomial    0.018    0.107    0.051    9.189    9.240    6.137
   64   128   kary:2    0.020    0.104    0.050    9.196    9.246    6.251
   32   128 binomial    0.014    0.116    0.041    9.350    9.391    6.082
    1    16   kary:2    0.006    0.120    0.036    9.552    9.588    4.774
    1    32 kary:256    0.014    0.152    0.038    9.577    9.615    3.474
    2    32 binomial    0.004    0.129    0.033    9.831    9.864    5.235
    4   128 kary:256    0.007    0.200    0.043    9.998   10.041    3.496
    2    64 kary:256    0.009    0.170    0.038   10.233   10.271    3.505
   16   128 binomial    0.010    0.129    0.041   10.262   10.303    6.182
    8   256 kary:256    0.008    0.231    0.035   10.275   10.310    3.561
   64   256 binomial    0.019    0.119    0.050   10.379   10.429    6.735
    4    32   kary:2    0.009    0.115    0.039   10.850   10.889    5.506
   32   256 binomial    0.013    0.147    0.045   11.257   11.302    6.747
    4    64 binomial    0.008    0.140    0.040   11.438   11.478    5.857
   64   512 binomial    0.020    0.167    0.053   12.351   12.403    7.415
    8   128 binomial    0.011    0.150    0.039   12.641   12.680    6.487
   16   256 binomial    0.011    0.216    0.079   13.055   13.134    6.926
    4   192 kary:256    0.007    0.295    0.044   13.063   13.108    3.922
    1    32 binomial    0.009    0.150    0.037   13.173   13.210    5.524
    2    32   kary:2    0.021    0.196    0.092   13.188   13.280    5.742
    1    48 kary:256    0.008    0.207    0.039   13.198   13.237    3.786
   64   256   kary:2    0.020    0.123    0.052   13.343   13.396    7.073
    8    64   kary:2    0.010    0.119    0.039   13.532   13.571    6.117
    2    96 kary:256    0.011    0.239    0.036   13.533   13.569    3.911
    1    32   kary:2    0.008    0.151    0.037   14.042   14.079    5.504
   32   512 binomial    0.013    0.209    0.040   14.137   14.178    7.507
    2    64 binomial    0.005    0.188    0.038   14.895   14.933    5.987
   64  1024 binomial    0.022    0.297    0.053   15.189   15.242    8.519
    4   128 binomial    0.006    0.220    0.040   16.534   16.575    6.835
    8   256 binomial    0.014    0.255    0.040   16.576   16.616    7.146
    1    48   kary:2    0.010    0.200    0.038   16.922   16.960    5.821
    1    48 binomial    0.008    0.199    0.038   16.956   16.994    5.822
    2    96 binomial    0.007    0.295    0.034   17.772   17.806    6.365
   16   512 binomial    0.012    0.343    0.043   18.401   18.444    7.860
    2    64   kary:2    0.005    0.320    0.038   18.658   18.696    6.580
    4   192 binomial    0.005    0.346    0.039   18.687   18.726    6.985
   64  2048 kary:256    0.019    0.858    0.053   18.782   18.836    4.594
   32  1024 binomial    0.012    0.514    0.042   19.674   19.716    8.473
    8   384 binomial    0.011    0.457    0.040   19.850   19.890    7.496
   64  2048 binomial    0.021    0.896    0.056   20.784   20.840    9.328
    8   128   kary:2    0.007    0.180    0.040   20.947   20.987    6.729
   64   512   kary:2    0.020    0.170    0.054   21.797   21.851    7.985
   16   768 binomial    0.010    0.671    0.040   22.010   22.050    8.304
   32  1536 binomial    0.011    1.094    0.043   22.573   22.616    9.188
   64  3072 binomial    0.018    1.915    0.050   24.446   24.497   10.514
   16   768 kary:256    0.013    0.566    0.039   24.545   24.584    4.622
   32  1536 kary:256    0.013    0.973    0.043   25.279   25.322    5.207
    2    96   kary:2    0.006    0.601    0.040   25.631   25.672    7.346
   32   512   kary:2    0.015    0.253    0.041   29.052   29.093    8.171
    8   256   kary:2    0.010    0.388    0.039   30.960   31.000    8.573
   16    64 kary:256    0.008    0.106    0.037   31.345   31.382    2.849
   16    32 kary:256    0.012    0.102    0.041   31.358   31.399    2.890
   32   128 kary:256    0.014    0.120    0.047   31.381   31.428    2.901
   64   128 kary:256    0.020    0.106    0.052   31.405   31.457    2.851
   32   256 kary:256    0.013    0.135    0.041   31.441   31.482    3.060
   32   512 kary:256    0.013    0.217    0.044   31.472   31.516    3.796
   16   256 kary:256    0.028    0.205    0.095   31.513   31.608    3.257
    8    64 kary:256    0.008    0.109    0.035   31.644   31.679    3.049
    8   384 kary:256    0.010    0.369    0.042   31.679   31.720    4.537
   16   128 kary:256    0.009    0.128    0.039   31.688   31.727    2.989
   32    64   kary:2    0.013    0.103    0.039   31.823   31.862    5.689
   64  1024 kary:256    0.019    0.289    0.050   31.858   31.908    4.260
   16   512 kary:256    0.011    0.316    0.042   31.890   31.932    4.224
    2    16   kary:2    0.009    0.114    0.039   31.894   31.933    4.698
   32   128   kary:2    0.015    0.119    0.044   31.954   31.997    6.408
   32  1024 kary:256    0.012    0.491    0.040   32.029   32.069    4.625
    4   128   kary:2    0.010    0.348    0.038   32.172   32.210    7.477
    8    16   kary:2    0.007    0.098    0.035   32.395   32.430    4.544
    4   192   kary:2    0.010    0.660    0.039   32.570   32.609    8.517
    4    64   kary:2    0.011    0.166    0.039   32.894   32.933    6.245
   16   128   kary:2    0.009    0.132    0.038   32.945   32.983    6.726
   16   256   kary:2    0.022    0.239    0.091   33.263   33.354    7.830
   64  3072 kary:256    0.020    1.805    0.052   33.889   33.940    5.358
   16    64   kary:2    0.008    0.111    0.039   34.131   34.170    5.694
   64  1024   kary:2    0.020    0.333    0.052   34.168   34.220    9.114
    8    32   kary:2    0.009    0.104    0.039   34.319   34.358    5.222
   32   256   kary:2    0.016    0.146    0.044   34.467   34.511    7.417
   16   512   kary:2    0.010    0.472    0.038   41.110   41.148    9.046
    8   384   kary:2    0.010    0.759    0.039   43.545   43.584   10.181
   32  1024   kary:2    0.014    0.676    0.043   47.394   47.437   10.104
   64  2048   kary:2    0.020    1.029    0.051   53.629   53.681   11.265
   16   768   kary:2    0.011    0.977    0.040   56.404   56.444   11.086
   32  1536   kary:2    0.016    1.360    0.045   64.234   64.279   12.340
   64  3072   kary:2    0.021    2.161    0.052   68.065   68.118   14.068

For lack of a better place to put it, here's the test script I used to generate the above results

#!/usr/bin/env python3

import argparse
import itertools
import math
import os
import sys
import time

import flux
import flux.job
import flux.uri
from flux.idset import IDset
from flux.resource import resource_list


class InstanceBench:
    def __init__(
        self,
        flux_handle,
        nnodes,
        brokers_per_node=1,
        topo="kary:2",
        conf=None,
        progress=None,
    ):

        self.flux_handle = flux_handle
        self.nnodes = nnodes
        self.brokers_per_node = brokers_per_node
        self.topo = topo
        self.id = None
        self.t0 = None
        self.t_submit = None
        self.t_start = None
        self.t_uri = None
        self.t_shell_init = None
        self.t_ready = None
        self.t_finish = None
        self.child_handle = None
        self.then_cb = None
        self.then_args = []
        self.then_kw_args = {}
        self.progress = progress
        self.size = nnodes * brokers_per_node
        self.topo = topo
        self.name = f"[N:{nnodes:<4d} SZ:{self.size:<4d} {self.topo:<8}]"

        broker_opts = ["-Sbroker.rc2_none=1"]
        if topo is not None:
            broker_opts.append(f"-Stbon.topo={topo}")
        if conf is not None:
            broker_opts.append("-c{{tmpdir}}/conf.json")

        jobspec = flux.job.JobspecV1.from_command(
            command=["flux", "broker", *broker_opts],
            exclusive=True,
            num_nodes=nnodes,
            num_tasks=nnodes * brokers_per_node,
        )
        jobspec.setattr_shell_option("mpi", "none")
        if conf is not None:
            jobspec.add_file("conf.json", conf)
        self.jobspec = jobspec

    def log(self, msg):
        ts = self.ts or (time.time() - self.t0)
        print(f"{self.name}: {ts:6.3f}s: {msg}", file=sys.stderr, flush=True)
        self.ts = None

    def then(self, cb, *args, **kw_args):
        self.then_cb = cb
        self.then_args = args
        self.then_kw_args = kw_args

    def submit(self):
        self.t0 = time.time()
        flux.job.submit_async(self.flux_handle, self.jobspec).then(self.submit_cb)
        return self

    def submit_cb(self, future):
        try:
            self.id = future.get_id()
        except OSError as exc:
            print(exc, file=sys.stderr)
            return
        if self.progress:
            job = flux.job.JobInfo(
                {
                    "id": self.id,
                    "state": flux.constants.FLUX_JOB_STATE_SCHED,
                    "t_submit": time.time(),
                }
            )
            self.progress.add_job(job)
        flux.job.event_watch_async(self.flux_handle, self.id).then(self.bg_wait_cb)

    def child_ready_cb(self, future):
        future.get()
        self.t_ready = time.time()
        self.log("ready")

        self.size = self.child_handle.attr_get("size")
        self.topo = self.child_handle.attr_get("tbon.topo")

        #  Shutdown and report timing:
        self.log("requesting shutdown")
        shutdown = self.child_handle.rpc("shutdown.start", {"loglevel": 1})

    def bg_wait_cb(self, future):
        event = future.get_event()
        if self.progress:
            self.progress.process_event(self.id, event)
        if not event:
            #  The job has unexpectedly exited since we're at the end
            #   of the eventlog. Run `flux job attach` since this will dump
            #   any errors or output, then raise an exception.
            os.system(f"flux job attach {self.id} >&2")
            raise OSError(f"{self.id}: unexpectedly exited")

        self.ts = event.timestamp - self.t0
        # self.log(f"{event.name}")
        if event.name == "submit":
            self.t_submit = event.timestamp
        elif event.name == "alloc":
            self.t_alloc = event.timestamp
        elif event.name == "start":
            self.t_start = event.timestamp
            flux.job.event_watch_async(
                self.flux_handle, self.id, eventlog="guest.exec.eventlog"
            ).then(self.shell_init_wait_cb)
        elif event.name == "memo" and "uri" in event.context:
            self.t_uri = event.timestamp
            uri = str(flux.uri.JobURI(event.context["uri"]))
            self.log(f"opening handle to {self.id}")
            self.child_handle = flux.Flux(uri)

            #  Set main handle reactor as reactor for his child handle so
            #  events can be processed:
            self.child_handle.flux_set_reactor(self.flux_handle.get_reactor())

            self.log("connected to child job")

            #  Wait for child instance to be ready:
            self.child_handle.rpc("state-machine.wait").then(self.child_ready_cb)

        elif event.name == "finish":
            self.t_finish = event.timestamp
            future.cancel(stop=True)
            if self.then_cb is not None:
                self.then_cb(self, *self.then_args, **self.then_kw_wargs)
            if self.progress:
                # Notify ProgressBar that this job is done via a None event
                self.progress.process_event(self.id, None)

    def shell_init_wait_cb(self, future):
        event = future.get_event()
        if not event:
            return
        self.ts = event.timestamp - self.t0
        self.log(f"exec.{event.name}")
        if event.name == "shell.init":
            self.t_shell_init = event.timestamp
            future.cancel(stop=True)

    def timing_header(self, file=sys.stdout):
        print(
            "%5s %5s %8s %8s %8s %8s %8s %8s %8s"
            % (
                "NODES",
                "SIZE",
                "TOPO",
                "T_START",
                "T_URI",
                "T_INIT",
                "T_READY",
                "(TOTAL)",
                "T_SHUTDN",
            ),
            file=file,
        )

    def report_timing(self, file=sys.stdout):
        print(
            "%5s %5s %8s %8.3f %8.3f %8.3f %8.3f %8.3f %8.3f"
            % (
                self.nnodes,
                self.size,
                self.topo,
                self.t_start - self.t_alloc,
                self.t_uri - self.t_start,
                self.t_shell_init - self.t_alloc,
                self.t_ready - self.t_shell_init,
                self.t_ready - self.t_alloc,
                self.t_finish - self.t_ready,
            ),
            file=file,
        )


def generate_values(end):
    """
    Generate a list of powers of 2 (including `1` by default), up to and
    including `end`. If `end` is not a power of 2 insert it as the last
    element in list to ensure it is present.
    The list is returned in reverse order (largest values first)
    """
    stop = int(math.log2(end)) + 1
    values = [1 << i for i in range(stop)]
    if end not in values:
        values.append(end)
    return values.reverse()


def parse_args():
    parser = argparse.ArgumentParser(
        prog="instance-timing", formatter_class=flux.util.help_formatter()
    )
    parser.add_argument(
        "-N",
        "--max-nodes",
        metavar="N",
        type=int,
        default=None,
        help="Scale up to N nodes by powers of two",
    )
    parser.add_argument(
        "-B",
        "--max-brokers-per-node",
        type=int,
        metavar="N",
        default=1,
        help="Run powers of 2 brokers-per-node up to N",
    )
    parser.add_argument(
        "--topo",
        metavar="TOPO,...",
        type=str,
        default="kary:2",
        help="add one or more tbon.topo values to test",
    )
    parser.add_argument(
        "-L",
        "--log-file",
        metavar="FILE",
        help="log results to FILE in addition to stdout",
    )
    return parser.parse_args()


def get_max_nnodes(flux_handle):
    """
    Get the maximum nodes available in the default queue or anonymous
    queue if there are no queues configured.
    """
    resources = resource_list(flux.Flux()).get()
    try:
        config = flux_handle.rpc("config.get").get()
        defaultq = config["policy"]["jobspec"]["defaults"]["system"]["queue"]
        constraint = config["queues"][defaultq]["requires"]
        avail = resources["up"].copy_constraint({"properties": constraint})
    except KeyError:
        avail = resources["up"]
    return avail.nnodes


def print_results(instances, ofile=sys.stdout):
    instances[0].timing_header(ofile)
    for ib in instances:
        ib.report_timing(ofile)


def main():
    args = parse_args()
    args.topo = args.topo.split(",")

    h = flux.Flux()
    if not args.max_nodes:
        args.max_nodes = get_max_nnodes(h)

    nnodes = generate_values(args.max_nodes)
    bpn = generate_values(args.max_brokers_per_node)

    inputs = itertools.product(nnodes, bpn, args.topo)

    progress = flux.job.watcher.JobProgressBar(h)
    progress.start()
    instances = []
    for i in inputs:
        instances.append(
            InstanceBench(
                h, i[0], brokers_per_node=i[1], topo=i[2], progress=progress
            ).submit()
        )

    h.reactor_run()

    print_results(instances)
    if args.log_file:
        with open(args.log_file, "w") as ofile:
            print_results(instances, ofile=ofile)


if __name__ == "__main__":
    main()

# vi: ts=4 sw=4 expandtab

0 replies

grondo · 2024-01-29T16:05:22Z

grondo
Jan 29, 2024
Maintainer Author

On Friday, I was able to get 1040 nodes on dane. This system has 112 cores per node for a total of 116480 cores.
Due to some networking issues, I was unable to run any tests with multiple brokers per node, nor could I run multiple MPI tasks per core. (Will have to investigate that one further at a later date)

All these tests were with the default mvapich2 on our systems

MVAPICH2 Version      :	2.3.7
MVAPICH2 Release date :	Wed March 02 22:00:00 EST 2022
MVAPICH2 Device       :	ch3:psm
MVAPICH2 configure    :	--prefix=/usr/tce/backend/installations/linux-rhel8-x86_64/intel-2021.6.0/mvapich2-2.3.7-ndi3qk2xcbpvgkdx7odtajxrvmaawmod --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --enable-wrapper-rpath=yes --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pm=hydra --with-device=ch3:psm --with-psm2=/usr --with-file-system=lustre+nfs+ufs --enable-llnl-site-specific-options --enable-debuginfo
MVAPICH2 CC           :	/usr/tce/spack/lib/spack/env/intel/icc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX          :	/usr/tce/spack/lib/spack/env/intel/icpc   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77          :	/usr/tce/spack/lib/spack/env/intel/ifort   -O2
MVAPICH2 FC           :	/usr/tce/spack/lib/spack/env/intel/ifort   -O2

Here's the raw results with 3 different instance topologies from the MPI hello test from t/mpi/hello.c.

MPI scale testing on dane
TIME: 2024-01-26T12:18:34-08:00
INFO: 1040 Nodes, 116480 Cores, 0 GPUs
TOPO: kary:2

 NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
     1        1    0.283919285    0.000010378    0.001716082    0.285646052
     1        2    0.287419631    0.000012255    0.001060093    0.288492146
     1        4    0.289351036    0.000014826    0.001446813    0.290812872
     1        8    0.305294476    0.000031035    0.002094513    0.307420217
     1       16    0.311661343    0.000075972    0.003579978    0.315317574
     1       32    0.315789148    0.000084086    0.007255126    0.323128604
     2        2    0.316062970    0.000007182    0.165544022    0.481614362
     2        4    0.325221536    0.000016046    0.164778588    0.490016357
     1       64    0.471850589    0.000201717    0.006344430    0.478396901
     4        4    0.318787464    0.000009051    0.165705791    0.484502475
     2        8    0.333233782    0.000027804    0.166741294    0.500003065
     4        8    0.329427344    0.000022572    0.165736323    0.495186421
     2       16    0.340862665    0.000041025    0.176507160    0.517411041
     4       16    0.331693331    0.000039873    0.175703004    0.507436558
     8        8    0.324263012    0.000010733    0.164683961    0.488957888
     8       16    0.331090771    0.000025701    0.165577208    0.496693876
     4       32    0.350046781    0.000046400    0.178285137    0.528378502
     8       32    0.342666226    0.000039129    0.166601892    0.509307553
     2       32    0.362450283    0.000069736    0.183365306    0.545885514
    16       16    0.331316304    0.000014254    0.164985552    0.496316325
     4       64    0.370739655    0.000079918    0.170939342    0.541759245
     8       64    0.355969503    0.000054790    0.168911664    0.524936315
    16       32    0.341132924    0.000031435    0.165509439    0.506674142
     2       64    0.407425795    0.000109135    0.172684524    0.580219640
     4      128    0.412331714    0.000107870    0.179754612    0.592194502
     8      128    0.373249819    0.000053291    0.180946270    0.554249574
    16       64    0.359372696    0.000037602    0.166641728    0.526052213
    16      128    0.362560869    0.000037231    0.179207747    0.541806122
     8      256    0.445870996    0.000127045    0.180757112    0.626755400
    16      256    0.430208304    0.000059065    0.182409144    0.612676718
     1      112    0.716281033    0.000497673    0.049285235    0.766064230
     2      128    0.587008984    0.000260863    0.180455109    0.767725228
    32       32    0.424401840    0.000014090    0.165108290    0.589524543
    16      512    0.465892199    0.000112438    0.176886261    0.642891075
     4      256    0.661423607    0.000277822    0.182786649    0.844488319
     8      512    0.650864554    0.000319834    0.179452975    0.830637782
    32       64    0.497990822    0.000028789    0.166358354    0.664378176
    32      256    0.417874335    0.000042478    0.174720649    0.592637655
    32      128    0.386554123    0.000030410    0.181622887    0.568207618
    32      512    0.426524217    0.000064298    0.173329686    0.599918370
    16     1024    0.819489351    0.000452001    0.185522412    1.005463964
     2      224    0.985956386    0.000551542    0.187361866    1.173870019
     4      448    0.983582129    0.000707058    0.181542194    1.165831769
    32     1024    0.530454833    0.000128264    0.175250354    0.705833637
     8      896    1.015942891    0.000699521    0.176779429    1.193422095
    16     1792    1.558793446    0.000779410    0.178165919    1.737738975
    32     2048    1.440696323    0.072658497    0.207966803    1.721321970
    32     3584    1.702757580    0.099994932    0.239751124    2.042503961
    64       64    2.092187109    0.000019319    0.165082938    2.257289536
    64      128    2.108307979    0.000029688    0.169024690    2.277362553
    64      256    2.092870409    0.000034158    0.182230941    2.275135680
    64      512    2.132350351    0.000040981    0.174404543    2.306796072
    64     1024    2.177057639    0.000064880    0.173885789    2.351008497
    64     2048    2.145016225    0.090820211    0.198820395    2.434657081
    64     4096    1.994405752    0.153414154    0.226606021    2.374426157
    64     7168    1.914928886    0.241389049    0.221592162    2.377910511
   128      128    0.685179157    0.000017185    0.163508857    0.848705366
   128      512    0.440595012    0.000038138    0.173740263    0.614373594
   128      256    0.413159812    0.000031647    0.167881419    0.581073048
   128     1024    1.072461316    0.000048958    0.187752411    1.260262872
   128     2048    1.531803618    0.406498704    0.192617318    2.130919959
   128     4096    1.574596118    0.236573928    0.190819028    2.001989289
   128     8192    1.360402584    0.428471866    0.250916226    2.039790942
   256      256    1.259881998    0.000023468    0.169259193    1.429164836
   128    14336    2.021740364    0.425280842    0.226190812    2.673212398
   256      512    0.692907797    0.000030957    0.163663057    0.856601983
   256     1024    0.654217098    0.000037056    0.177513188    0.831767508
   256     2048    0.684163197    0.166809509    0.185352869    1.036325809
   256     4096    0.778201947    0.237151892    0.188464325    1.203818405
   256     8192    0.847800954    0.572943909    0.203826158    1.624571424
   256    16384    1.764285782    1.424441561    0.228755806    3.417483456
   512      512    1.807185605    0.000024555    0.167648649    1.974859001
   256    28672    2.659826498    2.593482156    0.226241126    5.479550185
   512     1024    0.481472737    0.000029323    0.166252513    0.647754868
   512     2048    0.659927017    0.333712481    0.175224698    1.168864383
   512     4096    1.856246244    0.492147594    0.189909584    2.538303791
   512     8192    1.812988719    1.301228030    0.192449275    3.306666230
   512    16384    1.910615376    1.795146768    0.188956803    3.894719338
   512    32768    1.337499139    5.408123738    0.223811827    6.969435069
   512    57344    5.121688703    6.122964444    0.217639960   11.462293404
  1024     1024    1.519092790    0.000030210    0.168948176    1.688071418
  1024     2048    1.440313691    0.044654778    0.168660795    1.653629499
  1024     4096    1.477512181    0.067942168    0.175108817    1.720563392
  1024     8192    1.572274563    0.105738939    0.179003844    1.857017550
  1024    16384    1.895944490    0.211461534    0.187079683    2.294486046
  1024    32768    1.185196543    0.426102248    0.209933702    1.821232731
  1024    65536    3.352177616    0.878585215    0.257855270    4.488618375
  1024   114688    5.032802324    1.571234158    0.305872505    6.909909242
  1040     1040    1.481238288    0.001095844    0.166023088    1.648357429
  1040    33280    0.974260636    0.511682652    0.197497022    1.683440676
  1040     2080    1.201790908    0.051514456    0.166048010    1.419353738
  1040     4160    1.227534639    0.073768402    0.175391328    1.476694719
  1040     8320    1.299302358    0.117077278    0.183683357    1.600063360
  1040    66560    3.537339418    1.031260077    0.206674650    4.775274433
  1040    16640    0.609103372    0.232197118    0.187869329    1.029170038
  1040   116480    5.492306948    1.742482701    0.283976784    7.518766718

MPI scale testing on dane
TIME: 2024-01-26T12:27:43-08:00
INFO: 1040 Nodes, 116480 Cores, 0 GPUs
TOPO: kary:256

 NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
     1        1    0.295315171    0.000010742    0.001331062    0.296657374
     1       16    0.310742054    0.000043883    0.003556570    0.314342701
     1       32    0.331022011    0.000102331    0.005626816    0.336751345
     1       64    0.379101659    0.000286303    0.013019973    0.392408150
     8       16    0.316628962    0.000018971    0.170031876    0.486680010
     2        4    0.311635229    0.000017328    0.165194985    0.476847827
     4        8    0.328356389    0.000021312    0.165486449    0.493864350
     4       16    0.332420267    0.000029446    0.165441373    0.497891263
     2        8    0.332218238    0.000029137    0.166729395    0.498976958
     2        2    0.302584678    0.000007772    0.164482699    0.467075379
     4       64    0.367714095    0.000077859    0.181607386    0.549399550
    16      128    0.349688562    0.000036966    0.178433566    0.528159256
     4       32    0.340478986    0.000043574    0.172099610    0.512622365
     4      128    0.404172869    0.000109616    0.175731650    0.580014325
     2       64    0.416572098    0.000101906    0.169803406    0.586477647
     1      112    0.713755266    0.000477539    0.011527963    0.725761036
     2       16    0.480365730    0.000043562    0.175331065    0.655740721
     4        4    0.314920675    0.000008868    0.164950952    0.479880685
     8        8    0.452635781    0.000010111    0.164220582    0.616866661
     1        8    0.306800075    0.000028746    0.001919859    0.308748873
    16       64    0.341097983    0.000040760    0.166739865    0.507878819
    16      256    0.368771109    0.000061582    0.185729392    0.554562291
     1        4    0.303146982    0.000020862    0.001567010    0.304735263
     2      224    1.078169995    0.000739970    0.187130438    1.266040611
     2       32    0.360792084    0.000074847    0.182964019    0.543831311
     8       64    0.394036736    0.000052368    0.166816230    0.560905522
    16       32    0.359040256    0.000032030    0.165895619    0.524968095
     1        2    0.287774520    0.000011981    0.001297899    0.289084586
    16       16    0.340226269    0.000014845    0.164865582    0.505106887
     8       32    0.437016530    0.000035535    0.166687256    0.603739513
     8      128    0.366164809    0.000055522    0.168004951    0.534225483
    16      512    0.430635578    0.000130004    0.184816512    0.615582292
     8      256    0.584001363    0.000121453    0.176435687    0.760558704
     2      128    0.477101028    0.000286610    0.169601578    0.646989564
     4      256    0.501987491    0.000293696    0.197521125    0.699802520
    16     1024    0.825946482    0.000404820    0.180545407    1.006896939
     8      512    0.684286094    0.000284670    0.171706437    0.856277399
    32       32    0.613384596    0.000015292    0.164548055    0.777948200
     4      448    0.618878870    0.000589319    0.184191309    0.803659705
    32       64    0.720162961    0.000028613    0.164615186    0.884806933
     8      896    0.749341378    0.000575212    0.180194870    0.930111701
    16     1792    0.691709499    0.000589855    0.169513897    0.861813451
    32      128    0.468204092    0.000029882    0.172296201    0.640530376
    32      256    0.429969445    0.000043484    0.183679822    0.613692980
    64      128    0.457573282    0.000027599    0.163351348    0.620952413
    64       64    0.394775634    0.000021009    0.164772960    0.559569794
    64      256    0.356393503    0.000032643    0.185695223    0.542121717
    64      512    0.364379133    0.000049197    0.183739877    0.548168401
    32      512    0.583719104    0.000065668    0.173528263    0.757313234
    64     1024    0.576367037    0.000068313    0.174796918    0.751232438
    32     1024    0.712543423    0.000130097    0.180518454    0.893192180
    64     2048    0.732906054    0.091820657    0.187810341    1.012537325
    32     2048    0.803669192    0.076420323    0.192814078    1.072903877
    64     4096    0.754687096    0.116702030    0.214406775    1.085796169
    32     3584    0.874509971    0.109070878    0.218871120    1.202452352
   128      128    0.649353510    0.000019457    0.166602773    0.815975924
   128      256    0.582590645    0.000030594    0.164371302    0.746992705
   128      512    0.663060874    0.000032589    0.180218602    0.843312270
    64     7168    0.606206514    0.266732463    0.226786217    1.099725518
   128     1024    0.728622420    0.000044561    0.173312335    0.901979491
   128     2048    0.841436131    0.095009410    0.193729632    1.130175530
   128     4096    1.138701979    0.202501311    0.186988201    1.528191853
   128     8192    0.747471323    0.358327273    0.215655425    1.321454293
   256      256    0.570025370    0.000024508    0.163863464    0.733913500
   256      512    0.776052439    0.000030987    0.167313969    0.943397579
   128    14336    1.191622544    0.721990408    0.221813781    2.135426977
   256     1024    1.621362726    0.000032875    0.177924937    1.799320742
   256     2048    0.751542735    0.201464061    0.187444686    1.140451787
   256     4096    0.685275388    0.342636297    0.193420656    1.221332602
   256     8192    0.971714594    0.741814584    0.193568518    1.907097983
   256    16384    1.480179716    1.349337124    0.222304236    3.051821277
   512      512    2.182829726    0.000023933    0.163649740    2.346503584
   256    28672    1.880942189    3.014662968    0.231530327    5.127135967
   512     1024    0.874798855    0.000029053    0.163263025    1.038091106
   512     2048    1.583790567    0.420908796    0.178792829    2.183492585
   512     4096    1.542013651    0.688759870    0.176912757    2.407686507
   512     8192    1.433880311    1.419690007    0.191022324    3.044592986
   512    16384    1.098374515    2.957235059    0.198002363    4.253612177
   512    32768    2.057006879    5.532468861    0.210222924    7.799698987
   512    57344    4.644445419   11.826031482    0.232355016   16.702832231
  1024     2048    0.716983800    0.686350694    0.168456113    1.571790937
  1024     1024    1.804741883    0.000026440    0.163999555    1.968768046
  1024     4096    1.132080104    1.386928105    0.171506501    2.690514920
  1024     8192    0.892007002    2.767528178    0.185145349    3.844680822
  1024    16384    1.311436357    5.574671750    0.179805834    7.065914165
  1024    32768    1.934603217   13.392741192    0.199424446   15.526769128
  1024    65536    3.391625891   26.233990207    0.247357669   29.872974202
  1024   114688    2.688208740   47.768926261    0.236112618   50.693247989
  1040     1040    1.410929690    0.001791426    0.170829343    1.583550672
  1040     2080    1.170676523    0.788522748    0.169484034    2.128683537
  1040    16640    1.432744982    6.252806847    0.194829868    7.880382130
  1040     4160    1.134113025    1.557562570    0.173640201    2.865316025
  1040     8320    1.200873465    3.113611162    0.183999752    4.498484608
  1040    33280    1.950063564   14.410571693    0.195931782   16.556567243
  1040    66560    7.856818496   26.270660522    0.230093771   34.357573079
  1040   116480    6.363699688   51.621046028    0.307123484   58.291869608

MPI scale testing on dane
TIME: 2024-01-26T12:49:38-08:00
INFO: 1040 Nodes, 116480 Cores, 0 GPUs
TOPO: binomial

 NODES   NTASKS           INIT        BARRIER       FINALIZE          TOTAL
     1        2    0.285453075    0.000011518    0.002150657    0.287615576
     1        8    0.302390429    0.000028158    0.001808053    0.304226833
     1        1    0.294102617    0.000012587    0.001795072    0.295910561
     1        4    0.285936356    0.000018136    0.001510180    0.287464878
     2        4    0.304151089    0.000021920    0.167337654    0.471510854
     2        2    0.307922500    0.000008826    0.165832634    0.473764159
     4       16    0.414779686    0.000025890    0.168243296    0.583049063
     4        4    0.377938245    0.000009941    0.164979252    0.542927758
     2        8    0.324407084    0.000020366    0.175253678    0.499681288
     8        8    0.390014053    0.000011534    0.164824716    0.554850600
     8       32    0.433540376    0.000027523    0.165050939    0.598619014
     4       32    0.517560775    0.000050327    0.169952995    0.687564322
     2       16    0.436430863    0.000040839    0.176564929    0.613036943
     8       16    0.513658708    0.000024583    0.165365081    0.679048678
    16       32    0.543246824    0.000029412    0.165547212    0.708823633
     4        8    0.432920624    0.000017558    0.164019369    0.596957748
     1       16    0.312622704    0.000069596    0.003930607    0.316623216
     8       64    0.615124680    0.000042105    0.176862493    0.792029557
    16       64    0.673185769    0.000032620    0.169314145    0.842532764
    16       16    0.699263808    0.000015205    0.164705648    0.863984842
    32       64    0.437803468    0.000036048    0.167953258    0.605793009
    32      256    0.608384878    0.000039925    0.174046225    0.782471207
    32       32    0.325680543    0.000013825    0.165038066    0.490732752
    32      128    0.535860092    0.000030348    0.178511785    0.714402450
     2       32    0.360222607    0.000063610    0.181376435    0.541662846
    16      128    0.875521801    0.000037918    0.184284732    1.059844625
    64       64    0.464831683    0.000061641    0.164583384    0.629476925
     4       64    1.372430922    0.000122069    0.179805164    1.552358340
     8      128    1.471565703    0.000058197    0.183039424    1.654663904
    64      128    0.679882564    0.000026613    0.166396450    0.846305843
    64      256    0.576294876    0.000034946    0.177060405    0.753390412
    64      512    0.561259809    0.000042031    0.173306037    0.734608047
     2       64    0.417178253    0.000119596    0.165647312    0.582945373
    32      512    1.079315515    0.000062660    0.179505035    1.258883416
     1       32    0.313715490    0.000101158    0.008398341    0.322215186
    16      256    0.996463519    0.000065329    0.182849117    1.179378137
   128      256    0.393618508    0.000027189    0.163753764    0.557399635
    64     1024    0.531137543    0.000067352    0.177529685    0.708734767
   128      512    0.404461238    0.000033026    0.177180365    0.581674832
   128      128    0.945012946    0.000021949    0.167640763    1.112675844
     4      128    1.205956131    0.000117615    0.172687117    1.378761208
     8      256    1.812365952    0.000124745    0.187995764    2.000486683
    16      512    1.390622930    0.000120918    0.171438566    1.562182609
   128     1024    0.674422509    0.000046692    0.172294811    0.846764233
    32     1024    0.463769672    0.000122573    0.186286741    0.650179149
   128     2048    0.479439469    0.106662186    0.189659650    0.775761521
    64     2048    1.549250306    0.065000356    0.192007560    1.806258500
   128     4096    0.492810881    0.262521360    0.196744995    0.952077557
     1       64    0.370222356    0.000250173    0.007029358    0.377502140
     2      128    1.897729975    0.000285948    0.180761752    2.078777878
   256      256    0.412206034    0.000025710    0.166940851    0.579172762
     8      512    0.614455626    0.000268746    0.182727796    0.797452385
   128     8192    1.169488935    0.403690680    0.205933256    1.779113108
     4      256    0.530474185    0.000262716    0.174177005    0.704914146
    16     1024    2.898423868    0.000507549    0.171682112    3.070613718
   128    14336    1.779425310    0.742680481    0.265535743    2.787642062
    32     2048    1.299917720    0.053763328    0.218404105    1.572085358
   256      512    1.931078128    0.000028414    0.168891026    2.099997774
     1      112    0.471016606    0.000481822    0.011830244    0.483328893
    64     4096    1.528425192    0.124621008    0.209338506    1.862384928
     2      224    0.987186114    0.000519629    0.182549640    1.170255636
     4      448    3.355617160    0.000684249    0.183190936    3.539492557
     8      896    1.793262784    0.000714198    0.167347441    1.961324639
   256     1024    0.710876243    0.000037520    0.173628232    0.884542184
   256     2048    0.819363598    0.203477370    0.177052883    1.199894018
   256     4096    1.089219143    0.393855296    0.193094707    1.676169584
    16     1792    0.680624697    0.000685098    0.177709789    0.859019934
    32     3584    3.331723105    0.094781439    0.262170805    3.688675610
   256     8192    0.919921495    0.713484302    0.190017464    1.823423531
    64     7168    3.225288487    0.218884210    0.210081215    3.654254195
   256    16384    1.415373868    1.340448962    0.195968423    2.951791493
   512      512    0.925460848    0.000027980    0.167710374    1.093199395
   256    28672    3.110242989    2.871909406    0.199947724    6.182100664
   512     1024    2.643038331    0.000029813    0.163524312    2.806592668
   512     2048    0.789882857    0.344126906    0.172605598    1.306615708
   512     4096    1.162099445    0.730785821    0.183304729    2.076190220
   512     8192    0.633416799    1.431494460    0.192799039    2.257710632
   512    16384    1.395925638    2.688252685    0.191184607    4.275363217
   512    32768    1.543496444    6.073022335    0.194346249    7.810865379
   512    57344    5.410182795    9.812592104    0.258621217   15.481396529
  1024     2048    1.140619140    0.595058263    0.171805631    1.907483244
  1024     1024    1.324251819    0.000023994    0.167209808    1.491485801
  1024     4096    1.164811328    1.214099501    0.169411653    2.548322687
  1024     8192    1.400841152    2.541650797    0.174934962    4.117427156
  1024    16384    1.158405279    4.952828431    0.198413165    6.309647151
  1024    32768    1.593433767    9.825188112    0.216063047   11.634685393
  1024    65536    2.588857489   20.250512749    0.226805790   23.066176278
  1024   114688    2.725687887   35.224165864    0.233595847   38.183449874
  1040     1040    1.504292717    0.000229944    0.171391024    1.675913854
  1040     2080    1.162378716    0.623544794    0.171203651    1.957127485
  1040     8320    1.129571042    2.425224451    0.186975143    3.741770952
  1040    16640    1.213845576    4.937628674    0.192967836    6.344442418
  1040     4160    1.031251583    1.276947360    0.177495872    2.485695147
  1040    33280    1.385484396    9.872936570    0.188673979   11.447095241
  1040    66560    2.626899852   19.978351155    0.217134207   22.822385603
  1040   116480    2.200552627   35.752454400    0.244063826   38.197071192

Here's a quick plot of the results. If I have time I might try splitting out data by processes per node as well.

Edit: I realized that the srun results were likely from when the system was having network issues, so I've removed them as invalid for now. There are results from a single full scale srun run below, but I didn't get a full set of results.

4 replies

grondo Jan 29, 2024
Maintainer Author

An interesting observation here is that for some of the larger Flux runs, MPI_Init is fairly fast, but the first barrier is much slower. I don't have an explanation for that and didn't have time to do any kind of experiments. Someone that understands MPI+PMI a bit better could probably suggest an explanation.

grondo Jan 29, 2024
Maintainer Author

Here's another bit of data for a large run:

$ flux run -vvv -n 116480 ./t/mpi/hello
jobid: fxENXBfV
0.000s: job.submit {"userid":6885,"urgency":16,"flags":0,"version":1}
0.012s: job.validate
0.023s: job.depend
0.023s: job.priority {"priority":16}
0.555s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-1039]/core[0-111]"}}}
1.635s: job.start
0.590s: exec.init
0.591s: exec.starting
1.762s: exec.shell.init {"service":"6885-shell-fxENXBfV","leader-rank":0,"size":1040}
3.556s: exec.shell.start {"taskmap":{"version":1,"map":[[0,1040,112,1]]}}
fxENXBfV: completed MPI_Init in 5.073s. There are 116480 tasks
22.088s: exec.shell.task-exit {"localid":0,"rank":98112,"state":"Exited","pid":2590516,"wait_status":0,"signaled":0,"exitcode":0}
fxENXBfV: completed first barrier in 1.751s
fxENXBfV: completed MPI_Finalize in 0.251s
28.690s: job.finish {"status":0}
28.689s: exec.complete {"status":0}
28.690s: exec.done

Note that while the timing for MPI operations (the only thing our hello test is doing) is quite fast (~7s), it took 22s before the first task exited. So the overall time to run the test was ~30s. It would be interesting to understand where the extra time is being spent. (It took only ~4s to start all the shells and have them synchronize)

However, at least for this test, Flux was still doing slightly better than srun:

$ time srun -n 116480 t/mpi/hello
0: completed MPI_Init in 24.508s. There are 116480 tasks
0: completed first barrier in 4.957s
0: completed MPI_Finalize in 0.202s

real	0m41.035s
user	0m0.430s
sys	0m0.604s

garlick Jan 29, 2024
Maintainer

Nice - thanks for doing this!

Regarding the first barrier being slow in large runs, I don't have any great insight either but a couple of ideas

are the co-located tasks taking advantage of shared memory (vs using the network)?
is slurm somehow providing extra interconnect info that is used to accelerate collectives on this platform?

It might be good to record which mpi was used here (the default mvapich?) and then we could do some small scale experiments to at least check it uses shared memory where appropriate when launched with flux.

grondo Jan 29, 2024
Maintainer Author

It might be good to record which mpi was used here (the default mvapich?)

I had (belatedly) added that info to the original post.

wihobbs · 2024-02-05T23:18:48Z

wihobbs
Feb 5, 2024
Maintainer

@nhanford was asking about numbers for running OpenMPI at scale. Unfortunately, it's not quite as large an experiment, but I wasn't able to get 1040 nodes to work.

(s=512,d=0) dane75 ~ $ flux run -N512 -n57334 -o pmi=pmix ./scale-testing/hello-openmpi
f4gbTKdm9: completed MPI_Init in 32.816s.  There are 57334 tasks
f4gbTKdm9: completed first barrier in 0.006s
f4gbTKdm9: completed MPI_Finalize in 0.507s

Note that currently we have to use export HWLOC_XMLFILE across all the nodes to get OpenMPI to start up.

1 reply

wihobbs Feb 5, 2024
Maintainer

I was able to get a 1040-node run by forcing traffic over the management ethernet network:

(s=1040,d=0) dane11 ~/flux-core/install-dane (master)$ FLUX_IPADDR_INTERFACE=eno8303 flux run -n116480 -opmi=pmix -vvv ~/scale-testing/hello-openmpi
jobid: fwAA1KkB
0.000s: job.submit {"userid":60943,"urgency":16,"flags":0,"version":1}
0.012s: job.validate
0.022s: job.depend
0.022s: job.priority {"priority":16}
0.247s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-1039]/core[0-111]"}}}
0.850s: job.start
0.247s: exec.init
0.248s: exec.starting
3.049s: exec.shell.init {"service":"60943-shell-fwAA1KkB","leader-rank":0,"size":1040}
4.666s: exec.shell.start {"taskmap":{"version":1,"map":[[0,1040,112,1]]}}
fwAA1KkB: completed MPI_Init in 155.575s.  There are 116480 tasks
fwAA1KkB: completed first barrier in 0.008s
167.210s: exec.shell.task-exit {"localid":63,"rank":12831,"state":"Exited","pid":563798,"wait_status":0,"signaled":0,"exitcode":0}
fwAA1KkB: completed MPI_Finalize in 0.783s
173.814s: job.finish {"status":0}
173.814s: exec.complete {"status":0}
173.814s: exec.done

Ok, correction: the traffic is not being forced over the ethernet network. The option FLUX_IPADDR_INTERFACE is only meant for broker traffic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux scale testing Jan 2024 #5649

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Flux scale testing Jan 2024 #5649

grondo Jan 3, 2024 Maintainer

Replies: 3 comments · 5 replies

grondo Jan 4, 2024 Maintainer Author

grondo Jan 29, 2024 Maintainer Author

grondo Jan 29, 2024 Maintainer Author

grondo Jan 29, 2024 Maintainer Author

garlick Jan 29, 2024 Maintainer

grondo Jan 29, 2024 Maintainer Author

wihobbs Feb 5, 2024 Maintainer

wihobbs Feb 5, 2024 Maintainer

grondo
Jan 3, 2024
Maintainer

Replies: 3 comments 5 replies

grondo
Jan 4, 2024
Maintainer Author

grondo
Jan 29, 2024
Maintainer Author

grondo Jan 29, 2024
Maintainer Author

grondo Jan 29, 2024
Maintainer Author

garlick Jan 29, 2024
Maintainer

grondo Jan 29, 2024
Maintainer Author

wihobbs
Feb 5, 2024
Maintainer

wihobbs Feb 5, 2024
Maintainer